Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

See

discussions, stats, and author profiles for this publication at: http://www.researchgate.net/publication/220469172

Mandelbrot's Model for Zipf's Law: Can


Mandelbrot's Model Explain Zipf's Law for
Language?

ARTICLE in JOURNAL OF QUANTITATIVE LINGUISTICS · AUGUST 2009


Impact Factor: 0.33 · DOI: 10.1080/09296170902850358 · Source: DBLP

CITATIONS READS

5 303

1 AUTHOR:

Dmitrii Manin
27 PUBLICATIONS 256 CITATIONS

SEE PROFILE

Available from: Dmitrii Manin


Retrieved on: 24 September 2015
This article was downloaded by: [University of Chicago]
On: 9 March 2011
Access details: Access Details: [subscription number 933591331]
Publisher Routledge
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-
41 Mortimer Street, London W1T 3JH, UK

Journal of Quantitative Linguistics


Publication details, including instructions for authors and subscription information:
http://www.informaworld.com/smpp/title~content=t716100702

Mandelbrot's Model for Zipf's Law: Can Mandelbrot's Model Explain Zipf's
Law for Language?
D. Yu. Manin

To cite this Article Manin, D. Yu.(2009) 'Mandelbrot's Model for Zipf's Law: Can Mandelbrot's Model Explain Zipf's Law
for Language?', Journal of Quantitative Linguistics, 16: 3, 274 — 285
To link to this Article: DOI: 10.1080/09296170902850358
URL: http://dx.doi.org/10.1080/09296170902850358

PLEASE SCROLL DOWN FOR ARTICLE

Full terms and conditions of use: http://www.informaworld.com/terms-and-conditions-of-access.pdf

This article may be used for research, teaching and private study purposes. Any substantial or
systematic reproduction, re-distribution, re-selling, loan or sub-licensing, systematic supply or
distribution in any form to anyone is expressly forbidden.

The publisher does not give any warranty express or implied or make any representation that the contents
will be complete or accurate or up to date. The accuracy of any instructions, formulae and drug doses
should be independently verified with primary sources. The publisher shall not be liable for any loss,
actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directly
or indirectly in connection with or arising out of the use of this material.
Journal of Quantitative Linguistics
2009, Volume 16, Number 3, pp. 274–285
DOI: 10.1080/09296170902850358

Mandelbrot’s Model for Zipf’s Law


Can Mandelbrot’s Model Explain Zipf’s Law for
Language?*
D. Yu. Manin
Palo Alto, USA
Downloaded By: [University of Chicago] At: 23:57 9 March 2011

ABSTRACT

Zipf’s law states that if words of a language are sorted in the order of decreasing
frequency of usage, a word’s frequency is inversely proportional to its rank, or sequence
number in the list. The Zipf-Mandelbrot law is a more general formula that provides a
better fit in the low-rank region. Among several models aimed at explaining this effect,
Mandelbrot’s model is one of the best known. It derives Zipf’s law as a result of the
optimization of information/cost ratio, but leads to an unrealistic view of texts as random
character sequences. In this article, a new modification of the model is proposed that is
free from this drawback and allows the optimal information/cost ratio to be achieved via
language evolution. It is demonstrated that the Zipf-Mandelbrot formula follows from
this model, but its two parameters are not independent. As a result, the formula cannot
convincingly be fitted to the actual word frequency distributions.

INTRODUCTION

Zipf’s law (Zipf, 1949) may be one of the most enigmatic and
controversial regularities known in linguistics. In its most straightfor-
ward form, it states that if words of a language are ranked in the order of
decreasing frequency in texts, the frequency is inversely proportional to
the rank (sequence number in the list),

fk / kB ð1Þ

*Address correspondence to: D. Yu. Manin, 3127 Bryant Street, Palo Alto, CA 94306.
Tel: 650-575-1506. E-mail: manin@pobox.com

0929-6174/09/16030274 Ó 2009 Taylor & Francis


MANDELBROT’S MODEL FOR ZIPF’S LAW 275

where fk is the frequency of the word with rank k. As a typical example,


consider the log-log plot of frequency vs. rank in Figure 1, calculated
from a frequency dictionary of Russian language compiled by Sharoff
(2002, n.d.).
The exponent in Formula (1) is close to 1 for large balanced corpora
and single-author collections, but may be different for various special
cases such as the speech of young children or schizophrenics, military
communications, language subsets consisting of nouns only, and so on
(Ferrer i Cancho, 2005). There are several models aimed at explaining
Zipf’s law, the two best-known being those of Simon (1955) and
Mandelbrot (1953) (a review of these and other models can be found in
Manin, 2008, where a new model is also proposed).
Downloaded By: [University of Chicago] At: 23:57 9 March 2011

Low-rank and high-rank regions are characterized by deviations from


the power law, with a flattening in the former and steepening in the latter.
The steeper decline at high ranks may simply be due to under-sampling of
rare words causing underestimation of the ‘‘true’’ ranks of the words
encountered in the corpus once or twice. To account for the low-rank
flattening, Mandelbrot (1966) proposed a modified formula, known as
the Zipf-Mandelbrot law:
fk / ðk0 þ kÞB ð2Þ

Fig. 1. Zipf’s law for the Russian language.


276 D. YU. MANIN

where k0 is a parameter (not necessarily an integer). At k  k0, the Zipf-


Mandelbrot Formula (2) is asymptotically equivalent to the power law
(1), while at small k, it exhibits the required flattening.
The two parameters, B and k0, in the Zipf-Mandelbrot law are usually
considered to be independent, which allows the curve to be fitted to the
actual word frequency distributions. The purpose of this article is to
demonstrate that if the formula is to be derived from Mandelbrot’s
model, the parameters turn out to be functionally related. As a result, the
formula does not provide a good fit to the data any more. In what
follows, we first discuss the original Mandelbrot’s model for Zipf’s law;
then formulate a new variant of the model; and demonstrate that the law
(2) follows from it with only one free parameter. The derivation is
Downloaded By: [University of Chicago] At: 23:57 9 March 2011

supported by numerical simulation.

THE TWO FACES OF MANDELBROT’S MODEL

The simplest possible model exhibiting Zipfian distribution is due to


Mandelbrot (1966) and is widely known as the ‘‘random typing’’ or
‘‘intermittent silence’’ model. It is simply a generator of random
character sequences where each symbol of an arbitrary alphabet has
the same constant probability and one of the symbols is arbitrarily
designated as a word-delimiting ‘‘space’’. The reason why ‘‘words’’ in
such a sequence have a power-law frequency distribution is very simple,
as noted by Li (1992). Indeed, the number of possible words of a given
length is exponential in length (since all characters are equiprobable), and
the probability of any given word is also exponential in its length. Hence,
the dependency of each word’s frequency on its frequency rank is
asymptotically given by a power law. In fact, the characters need not
even be equiprobable for this result to hold (Li, 1992). Moreover, a
theorem of Shannon (1948) (Section 7, Theorem 3) suggests that even the
condition of independence between characters can be relaxed and
replaced with ergodicity of the source.
Based on this observation, it is commonly held that Zipf’s law is
‘‘linguistically shallow’’ (Mandelbrot, 1982) and does not reveal anything
interesting about natural language. However, it is easy to show that this
conclusion is, at least, premature. The random typing model itself is
undoubtedly ‘‘shallow’’, but it cannot be related to natural language for
the very simple reason that the number of distinct words of the same
MANDELBROT’S MODEL FOR ZIPF’S LAW 277

length in a real language is far from being exponential in length. In fact, it is


not even monotonic, as can be seen in Figure 2, where this distribution is
calculated from a frequency dictionary of the Russian language (Sharoff,
n.d.) and from Leo Tolstoy’s novel War and Peace. (It also does not matter
that the frequency dictionary counts multiple word forms as one word,
while with War and Peace we counted them as distinct words).
This shows that the random typing model is not directly applicable to
natural text.
However, Mandelbrot’s model admits an entirely different formula-
tion, as proposed by Mandelbrot (1953). It is based on the idea that the
language is optimal in the sense that it minimizes the average ratio of
production cost to information content. Mandelbrot proposed that the
Downloaded By: [University of Chicago] At: 23:57 9 March 2011

cost of ‘‘producing’’ a word is proportional to its length in characters;


defined the information content of a word to be the Shannon entropy (i.e.
negative logarithm of frequency); and demonstrated that Zipf’s law
follows from these assumptions.
It is well known that the maximum entropy per letter is achieved by
random sequences of letters, simply because entropy is a measure of
unpredictability, and random sequences are the most unpredictable. Thus,
under the assumptions of Mandelbrot’s model, the optimal language is one
where each sequence of n letters is as frequent as any other. Hence, the
optimality model leads to the same result as the random typing model. As

Fig. 2. Distribution of words by length.


278 D. YU. MANIN

Mandelbrot wrote in 1966, ‘‘these variants are fully equivalent mathema-


tically, but they appeal to such different intuitions that the strongest critics
of one may be the strongest partisans of another’’.
Indeed, there is a significant conceptual difference between the two
approaches in that the optimization principle allows one, in principle, to
demonstrate how the optimal state can be achieved as a result of
language evolution. This advantage is also a liability, because it is
necessary to demonstrate that the global optimum can actually be
achieved via some local dynamics which is causal and not teleological.
Thus, the famous principle of least action in mechanics is equivalent to
the local force-driven Newtonian dynamics. In the same way, a soap film
on a wire frame achieves the global minimum of its surface area via local
Downloaded By: [University of Chicago] At: 23:57 9 March 2011

dynamics of infinitesimal surface elements shifting and stretching under


each other’s tug. Just as surface elements do not ‘‘know’’ anything about
the total area of the film, individual words do not ‘‘know’’ anything
about the average information/cost ratio.
Interestingly, in the case of Mandelbrot’s optimizing model, such a
local dynamics can be proposed. Namely, suppose that, if speakers notice
that a word’s individual information/cost ratio is below average (the
word has faded), they start using it less, and conversely, if the ratio is
favorable, the word’s frequency increases.1 We will demonstrate below
that this local dynamics indeed results in a stable power-law distribution
of word frequencies.
In the next section we review the mathematics of Mandelbrot’s
optimality model and show that it can be reformulated so that it is no
longer equivalent to the unrealistic random typing model.

MANDELBROT’S MODEL REVISITED

First of all, let us briefly reproduce the mathematical derivation of Zipf’s


law from the optimality principle. Let k be the frequency rank of the
word wk, let its frequency (normalized so that the sum of all frequencies is
1
As an example, consider the process known in linguistics by which so-called expressive
synonyms change to regular words. A well-known example is Russian glaz ‘‘eye’’, which
initially meant ‘‘pebble’’, then became expressive for ‘‘eye’’, and gradually displaced the
original word for ‘‘eye’’, oko of Indo-European descent. Another example is provided by
French teˆte ‘‘head’’ 5 testa ‘‘crock, pot’’, that started as an expressive synonym for
‘‘head’’ and eventually supplanted the original word chef in this sense.
MANDELBROT’S MODEL FOR ZIPF’S LAW 279

unity) be pk, and let the cost of producing word wk be Ck. It makes sense
to leave the function Ck unspecified for as long as possible. The word’s
information content, or entropy, is related to its frequency pk as
Hk ¼ 7 log2 pk. The average cost per word is given by
X
C¼ p k Ck ð3Þ
k

and the average entropy per word by


X
H¼ pk log2 pk : ð4Þ
k
Downloaded By: [University of Chicago] At: 23:57 9 March 2011

P
One can now ask what frequency distribution {pk} satisfying kpk ¼ 1
will minimize the cost ratio C* ¼ C/H.
We can use the standard method of Lagrange multipliers to find the
minimum of C*, given the normalization constraint on pk:
!
@ X

C þl pj ¼ 0: ð5Þ
@pk j

Here the value of Lagrange multiplier l is to be determined later so as to


normalize the frequencies. Performing the differentiation in (5) we obtain

Ck C
þ ðlog2 pk þ 1= ln 2Þ þ l ¼ 0; 8k: ð6Þ
H H2

This expresses the frequencies pk given costs Ck:

pk ¼ l0 2HCk =C ; ð7Þ

where we denoted
2
l0 ¼ 2lH =C1= ln 2 : ð8Þ

Thus, l0 is an arbitrary constant that we can use directly to normalize


frequencies. Now, once the cost Ck of each word is known or assumed,
Equation (7) yields the frequency distribution for the words. Note,
however, that to obtain a closed-form solution, it is also necessary to
280 D. YU. MANIN

consistently determine the constants C and H in the RHS of (7) from


their respective definitions (3) and (4).
Now, it is easy to see from Equation (7) that a power law for
frequencies could only result from the Ansatz

Ck ¼ C0 log2 k ð9Þ

which leads to

C0
pk ¼ l0 kB ; B ¼ H ð10Þ
C
Downloaded By: [University of Chicago] At: 23:57 9 March 2011

(note that C / C0, so C0/C does not depend on C0). How could one justify
Equation (9)? In Mandelbrot’s original formulation, as we already
mentioned, the cost of a word was assumed to be proportional to its length;
thus the only way to get the logarithmic dependency on the rank is to assume
that the number of distinct words grows exponentially with length. It is not
necessary in this formulation to postulate that any combination of letters of a
given length is equally probable, but even this weaker requirement is not
realistic for natural languages, as demonstrated by Figure 2.
There is, however, a much more plausible argument in favour of the
desired Ansatz (9), which does not depend on any assumptions about
word length at all. Suppose words are stored in some kind of an
addressable memory. For simplicity, one can imagine a linear array of
memory cells, each containing one word. Then, the cost of retrieving the
word in the kth cell can be assumed to be proportional to the length of its
address, that is to the minimum number of bits (or neuron firings, say)
needed to specify the address. And this is precisely log2 k. Of course, this
does not depend on memory being in any real sense ‘‘linear’’.
It is important to note that this is not just a different justification,
because with it the optimality model is no longer equivalent to the
random typing model. Let us now proceed to solving (10). From the
normalization condition for frequencies, we get

1 B
pk ¼ k ð11Þ
zðBÞ
P s
where z is the Riemann zeta-function zðsÞ ¼ 1 1 n . But this is not the
end of the story, since B is related to H and C via Equation (10), and they
MANDELBROT’S MODEL FOR ZIPF’S LAW 281

in turn depend on B via pk. This amounts to an equation for the power
law exponent B, which thus is not arbitrary. By substituting (11) back
into (3) and (4), we get

C0 X1
C¼ kB log2 k ð12Þ
zðBÞ 1

B X 1
H¼ kB log2 ðkzðBÞ1=B Þ: ð13Þ
zðBÞ 1

It is now easy to see that B ¼ HC0/C can only be satisfied when


z(B) ¼ 1, which implies B!?. This is not a very encouraging result,
Downloaded By: [University of Chicago] At: 23:57 9 March 2011

since it means that the minimum cost per unit information is achieved
when there is only one word in use, and both cost and information
vanish.
This conclusion is borne out by a simple numerical simulation. Recall
that in Section 2, we noted that cost ratio optimization can be achieved
via local dynamics. Namely, if speakers notice that a word’s individual
information/cost ratio is below average, they start using it less, and
conversely, if the ratio is favourable, the word’s frequency increases. It is
hard to tell a priori whether this process would converge to a stationary
distribution, so a numerical simulation was performed. The following
algorithm implements this dynamics:

Cost Ratio Optimization Algorithm

(1) Initialize an array of N frequencies pk with random numbers and


normalize them.
(2) Calculate average cost and information per word according to (3),
(4).
(3) For each k ¼ 1,. . .,N, calculate cost ratio for the kth word as
Ck ¼ Ck =Hk ¼ log2 k= log2 pk . If it is within the interval [(1 7 g)
C*,(1 þ g)C*], where g is a parameter, leave pk unchanged. Other-
wise increase pk by a constant factor if cost ratio is above the
average or decrease it by the same factor if it is below the average.
(4) If no frequencies were changed, stop.
(5) Reorder words (i.e. reassign ranks in the decreasing order of
frequency), renormalize frequencies and repeat from step (2).
282 D. YU. MANIN

This procedure quickly leads to the state where all frequencies, but one,
are zero.

DERIVATION OF ZIPF-MANDELBROT FORMULA

We have seen that the Ansatz (9) does not eventually lead to the desired
result. It is probably this problem that prompted Mandelbrot to propose
a modification to Zipf’s law. In his own words,

. . . it seems worth pointing out that it has not been obtained by ‘‘mere
curve fitting’’: in attempting to explain the first approximation law,
i(r,k) ¼ (1/10)kr71, I invariably obtained the more general second
Downloaded By: [University of Chicago] At: 23:57 9 March 2011

approximation, and only later did I realize that this more general
formula was necessary and basically sufficient to fit the empirical data.
(Mandelbrot, 1966, p. 356)

It turns out that the degeneracy problem can be avoided by the


following modification of the cost function Ansatz:

Ck ¼ C0 log2 ðk þ k0 Þ: ð14Þ

It looks rather natural if we again imagine the linear memory, but this
time with first k0 cells not occupied by useful words. Substitution of (14)
into (7) yields the Zipf-Mandelbrot law

1
pk ¼ ðk þ k0 ÞB ð15Þ
zðB; 1 þ k0 Þ
P
where z is now the Hurwitz zeta function, zðs; qÞ ¼ 1 s
0 ðn þ qÞ .
The Zipf-Mandelbrot formula has the potential of correctly approx-
imating not only the power law, but also the initial, low-rank range of the
real frequency distributions, which flatten out at k 5 10 or so. But
remember again that the second part of (10), B ¼ HC0/C, needs to be
satisfied, which means that parameters k0 and B are not independent.
This is rarely, if ever, mentioned in the literature, while it is a rather
important constraint. Substituting (15) into (3) and (4) and noting that
@ X
1
zðs; qÞ ¼  ðn þ qÞs lnðn þ qÞ ð16Þ
@s 0
MANDELBROT’S MODEL FOR ZIPF’S LAW 283

we obtain

C0 z0 ðB; 1 þ k0 Þ
C¼ ð17Þ
ln 2 zðB; 1 þ k0 Þ

B z0 ðB; 1 þ k0 Þ
H ¼ ln zðB; 1 þ k0 Þ  ð18Þ
ln 2 zðB; 1 þ k0 Þ

B ¼ HC0 =C ð19Þ

where z0 is the derivative over the first argument. After simple


transformations this reduces to
Downloaded By: [University of Chicago] At: 23:57 9 March 2011

ln zðB; 1 þ k0 Þ
B¼B ln 2 ð20Þ
ðln zðB; 1 þ k0 ÞÞ0

that is

zðB; 1 þ k0 Þ ¼ 1: ð21Þ

When k0!0, B!?, as previously. In the opposite limit, k0!?, the


Zipfian exponent B tends to 1, but extremely slowly. To see this, let k0 be
a large integer. Then,
X
k0
zðB; 1 þ k0 Þ ¼ zðBÞ  nB : ð22Þ
1

In order to compensate for the infinite growth of the second term as


k0!?, B must tend to 1, where Riemann’s zeta function has a pole. Let
B ¼ 1 þ e, e1, then

zðBÞ ¼ Oð1=eÞ ð23Þ

X
k0  
B 1
n ¼ O ke ð24Þ
1
e 0

whence

ke0 ¼ Oð1Þ; or B ¼ 1 þ Oð1= ln k0 Þ:


284 D. YU. MANIN

The relationship between B and k0 can be calculated numerically,


but this would not tell us whether the resulting solution is stable with
respect to the local dynamics described above. Running the local
dynamics model shows that, in contrast to the case k0 ¼ 0, the model
does converge to a stable solution described by (15), as shown in
Figure 3.
However, as is readily seen from the figure, no values of k0 yield a
satisfactory approximation to the actual distribution. For small k0, the
slope is still significantly steeper than 71, but for larger k0, the flattened
portion spreads too far. Thus, with k0 ¼ 10, the slope is still about 71.4,
but the power law starts at about k ¼ 100, while in the actual distribution
it begins after k ¼ 10.
Downloaded By: [University of Chicago] At: 23:57 9 March 2011

To sum up, the Zipf-Mandelbrot law can be obtained from a model


optimizing the information/cost ratio with no assumptions about word
lengths. This model is not equivalent to the random typing model, and
allows the optimum to be achieved via local dynamics, i.e. in a causal,
rather than teleological manner. However, the two parameters of the
resulting distribution, B and k0, are not independent, and as a result, it
does not provide a reasonable fit to the empirical data.

Fig. 3. Zipf-Mandelbrot law with different values of k0. Real frequency distribution (not
to scale) and Zipf’s law are shown for comparison.
MANDELBROT’S MODEL FOR ZIPF’S LAW 285

REFERENCES

Ferrer i Cancho, R. (2005). The variation of Zipf’s law in human language. European
Physical Journal B, 44, 249–257.
Li, W. (1992). Random texts exhibit Zipf’s-law-like word frequency distribution. IEEE
Transactions on Information Theory, 38(6), 1842–1845.
Mandelbrot, B. (1953). An informational theory of the statistical structure of languages.
In W. Jackson (Ed.), Communication Theory (pp. 486–502). Woburn, MA:
Butterworth.
Mandelbrot, B. (1966). Information theory and psycholinguistics: A theory of word
frequencies. In P. F. Lazarsfield & N. W. Henry (Eds), Readings in Mathematical
Social Sciences (pp. 350–368). Cambridge: MIT Press.
Mandelbrot, B. (1982). The Fractal Geometry of Nature. New York: Freeman.
Manin, D. Yu. (2008). Zipf’s law and avoidance of excessive synonymy. Cognitive Science
Downloaded By: [University of Chicago] At: 23:57 9 March 2011

Journal, 32(7), 1075–1098.


Shannon, C. E. (1948). A mathematical theory of communication. The Bell System
Technical Journal, 27(3), 379–423.
Sharoff, S. (2002). Meaning as use: Exploitation of aligned corpora for the contrastive
study of lexical semantics. In Proceedings of Language Resources and Evaluation
Conference (LREC02). Las Palmas, Spain, May. Retrieved April 1, 2008, from
http://www.artint.ru/projects/frqlist/lrec-02.pdf
Sharoff, S. (n.d.). The Frequency Dictionary of Russian. Retrieved April 1, 2008, from
http://artint.ru/projects/frqlist/frqlist-en.asp
Simon, H. A. (1955). On a class of skew distribution functions. Biometrica, 42, 425–440.
Zipf, G. K. (1949). Human Behavior and the Principle of Least Effort. Cambridge, MA:
Addison-Wesley.

You might also like