Mandelbrot's Model For Zipf's Law: Can Mandelbrot's Model Explain Zipf's Law For Language?

See
discussions, stats, and author profiles for this publication at: http://www.researchgate.net/publication/220469172
Mandelbrot's Model for Zipf's Law: Can

Mandelbrot's Model Explain Zipf's Law for
Language?
ARTICLE in JOURNAL OF QUANTITATIVE LINGUISTICS · AUGUST 2009

Impact Factor: 0.33 · DOI: 10.1080/09296170902850358 · Source: DBLP
CITATIONS READS
5 303
1 AUTHOR:
Dmitrii Manin
27 PUBLICATIONS 256 CITATIONS
SEE PROFILE
Available from: Dmitrii Manin

Retrieved on: 24 September 2015
This article was downloaded by: [University of Chicago]
On: 9 March 2011
Access details: Access Details: [subscription number 933591331]
Publisher Routledge
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-
41 Mortimer Street, London W1T 3JH, UK
Journal of Quantitative Linguistics

Publication details, including instructions for authors and subscription information:
http://www.informaworld.com/smpp/title~content=t716100702
Mandelbrot's Model for Zipf's Law: Can Mandelbrot's Model Explain Zipf's
Law for Language?
D. Yu. Manin
To cite this Article Manin, D. Yu.(2009) 'Mandelbrot's Model for Zipf's Law: Can Mandelbrot's Model Explain Zipf's Law
for Language?', Journal of Quantitative Linguistics, 16: 3, 274 — 285
To link to this Article: DOI: 10.1080/09296170902850358
URL: http://dx.doi.org/10.1080/09296170902850358
PLEASE SCROLL DOWN FOR ARTICLE
Full terms and conditions of use: http://www.informaworld.com/terms-and-conditions-of-access.pdf
This article may be used for research, teaching and private study purposes. Any substantial or
systematic reproduction, re-distribution, re-selling, loan or sub-licensing, systematic supply or
distribution in any form to anyone is expressly forbidden.
The publisher does not give any warranty express or implied or make any representation that the contents
will be complete or accurate or up to date. The accuracy of any instructions, formulae and drug doses
should be independently verified with primary sources. The publisher shall not be liable for any loss,
actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directly
or indirectly in connection with or arising out of the use of this material.
Journal of Quantitative Linguistics
2009, Volume 16, Number 3, pp. 274–285
DOI: 10.1080/09296170902850358
Mandelbrot’s Model for Zipf’s Law

Can Mandelbrot’s Model Explain Zipf’s Law for
Language?*
D. Yu. Manin
Palo Alto, USA
Downloaded By: [University of Chicago] At: 23:57 9 March 2011
ABSTRACT
Zipf’s law states that if words of a language are sorted in the order of decreasing
frequency of usage, a word’s frequency is inversely proportional to its rank, or sequence
number in the list. The Zipf-Mandelbrot law is a more general formula that provides a
better fit in the low-rank region. Among several models aimed at explaining this effect,
Mandelbrot’s model is one of the best known. It derives Zipf’s law as a result of the
optimization of information/cost ratio, but leads to an unrealistic view of texts as random
character sequences. In this article, a new modification of the model is proposed that is
free from this drawback and allows the optimal information/cost ratio to be achieved via
language evolution. It is demonstrated that the Zipf-Mandelbrot formula follows from
this model, but its two parameters are not independent. As a result, the formula cannot
convincingly be fitted to the actual word frequency distributions.
INTRODUCTION
Zipf’s law (Zipf, 1949) may be one of the most enigmatic and
controversial regularities known in linguistics. In its most straightfor-
ward form, it states that if words of a language are ranked in the order of
decreasing frequency in texts, the frequency is inversely proportional to
the rank (sequence number in the list),
fk / kB ð1Þ
*Address correspondence to: D. Yu. Manin, 3127 Bryant Street, Palo Alto, CA 94306.
Tel: 650-575-1506. E-mail: manin@pobox.com
0929-6174/09/16030274 Ó 2009 Taylor & Francis

MANDELBROT’S MODEL FOR ZIPF’S LAW 275
where fk is the frequency of the word with rank k. As a typical example,

consider the log-log plot of frequency vs. rank in Figure 1, calculated
from a frequency dictionary of Russian language compiled by Sharoff
(2002, n.d.).
The exponent in Formula (1) is close to 1 for large balanced corpora
and single-author collections, but may be different for various special
cases such as the speech of young children or schizophrenics, military
communications, language subsets consisting of nouns only, and so on
(Ferrer i Cancho, 2005). There are several models aimed at explaining
Zipf’s law, the two best-known being those of Simon (1955) and
Mandelbrot (1953) (a review of these and other models can be found in
Manin, 2008, where a new model is also proposed).
Low-rank and high-rank regions are characterized by deviations from

the power law, with a flattening in the former and steepening in the latter.
The steeper decline at high ranks may simply be due to under-sampling of
rare words causing underestimation of the ‘‘true’’ ranks of the words
encountered in the corpus once or twice. To account for the low-rank
flattening, Mandelbrot (1966) proposed a modified formula, known as
the Zipf-Mandelbrot law:
fk / ðk0 þ kÞB ð2Þ
Fig. 1. Zipf’s law for the Russian language.

276 D. YU. MANIN
where k0 is a parameter (not necessarily an integer). At k k0, the Zipf-

Mandelbrot Formula (2) is asymptotically equivalent to the power law
(1), while at small k, it exhibits the required flattening.
The two parameters, B and k0, in the Zipf-Mandelbrot law are usually
considered to be independent, which allows the curve to be fitted to the
actual word frequency distributions. The purpose of this article is to
demonstrate that if the formula is to be derived from Mandelbrot’s
model, the parameters turn out to be functionally related. As a result, the
formula does not provide a good fit to the data any more. In what
follows, we first discuss the original Mandelbrot’s model for Zipf’s law;
then formulate a new variant of the model; and demonstrate that the law
(2) follows from it with only one free parameter. The derivation is
supported by numerical simulation.
THE TWO FACES OF MANDELBROT’S MODEL
The simplest possible model exhibiting Zipfian distribution is due to

Mandelbrot (1966) and is widely known as the ‘‘random typing’’ or
‘‘intermittent silence’’ model. It is simply a generator of random
character sequences where each symbol of an arbitrary alphabet has
the same constant probability and one of the symbols is arbitrarily
designated as a word-delimiting ‘‘space’’. The reason why ‘‘words’’ in
such a sequence have a power-law frequency distribution is very simple,
as noted by Li (1992). Indeed, the number of possible words of a given
length is exponential in length (since all characters are equiprobable), and
the probability of any given word is also exponential in its length. Hence,
the dependency of each word’s frequency on its frequency rank is
asymptotically given by a power law. In fact, the characters need not
even be equiprobable for this result to hold (Li, 1992). Moreover, a
theorem of Shannon (1948) (Section 7, Theorem 3) suggests that even the
condition of independence between characters can be relaxed and
replaced with ergodicity of the source.
Based on this observation, it is commonly held that Zipf’s law is
‘‘linguistically shallow’’ (Mandelbrot, 1982) and does not reveal anything
interesting about natural language. However, it is easy to show that this
conclusion is, at least, premature. The random typing model itself is
undoubtedly ‘‘shallow’’, but it cannot be related to natural language for
the very simple reason that the number of distinct words of the same
length in a real language is far from being exponential in length. In fact, it is

not even monotonic, as can be seen in Figure 2, where this distribution is
calculated from a frequency dictionary of the Russian language (Sharoff,
n.d.) and from Leo Tolstoy’s novel War and Peace. (It also does not matter
that the frequency dictionary counts multiple word forms as one word,
while with War and Peace we counted them as distinct words).
This shows that the random typing model is not directly applicable to
natural text.
However, Mandelbrot’s model admits an entirely different formula-
tion, as proposed by Mandelbrot (1953). It is based on the idea that the
language is optimal in the sense that it minimizes the average ratio of
production cost to information content. Mandelbrot proposed that the
cost of ‘‘producing’’ a word is proportional to its length in characters;

defined the information content of a word to be the Shannon entropy (i.e.
negative logarithm of frequency); and demonstrated that Zipf’s law
follows from these assumptions.
It is well known that the maximum entropy per letter is achieved by
random sequences of letters, simply because entropy is a measure of
unpredictability, and random sequences are the most unpredictable. Thus,
under the assumptions of Mandelbrot’s model, the optimal language is one
where each sequence of n letters is as frequent as any other. Hence, the
optimality model leads to the same result as the random typing model. As
Fig. 2. Distribution of words by length.

278 D. YU. MANIN
Mandelbrot wrote in 1966, ‘‘these variants are fully equivalent mathema-

tically, but they appeal to such different intuitions that the strongest critics
of one may be the strongest partisans of another’’.
Indeed, there is a significant conceptual difference between the two
approaches in that the optimization principle allows one, in principle, to
demonstrate how the optimal state can be achieved as a result of
language evolution. This advantage is also a liability, because it is
necessary to demonstrate that the global optimum can actually be
achieved via some local dynamics which is causal and not teleological.
Thus, the famous principle of least action in mechanics is equivalent to
the local force-driven Newtonian dynamics. In the same way, a soap film
on a wire frame achieves the global minimum of its surface area via local
dynamics of infinitesimal surface elements shifting and stretching under

each other’s tug. Just as surface elements do not ‘‘know’’ anything about
the total area of the film, individual words do not ‘‘know’’ anything
about the average information/cost ratio.
Interestingly, in the case of Mandelbrot’s optimizing model, such a
local dynamics can be proposed. Namely, suppose that, if speakers notice
that a word’s individual information/cost ratio is below average (the
word has faded), they start using it less, and conversely, if the ratio is
favorable, the word’s frequency increases.1 We will demonstrate below
that this local dynamics indeed results in a stable power-law distribution
of word frequencies.
In the next section we review the mathematics of Mandelbrot’s
optimality model and show that it can be reformulated so that it is no
longer equivalent to the unrealistic random typing model.
MANDELBROT’S MODEL REVISITED
First of all, let us briefly reproduce the mathematical derivation of Zipf’s

law from the optimality principle. Let k be the frequency rank of the
word wk, let its frequency (normalized so that the sum of all frequencies is
1
As an example, consider the process known in linguistics by which so-called expressive
synonyms change to regular words. A well-known example is Russian glaz ‘‘eye’’, which
initially meant ‘‘pebble’’, then became expressive for ‘‘eye’’, and gradually displaced the
original word for ‘‘eye’’, oko of Indo-European descent. Another example is provided by
French teˆte ‘‘head’’ 5 testa ‘‘crock, pot’’, that started as an expressive synonym for
‘‘head’’ and eventually supplanted the original word chef in this sense.
unity) be pk, and let the cost of producing word wk be Ck. It makes sense
to leave the function Ck unspecified for as long as possible. The word’s
information content, or entropy, is related to its frequency pk as
Hk ¼ 7 log2 pk. The average cost per word is given by
X
C¼ p k Ck ð3Þ
k
and the average entropy per word by

X
H¼ pk log2 pk : ð4Þ
k
P
One can now ask what frequency distribution {pk} satisfying kpk ¼ 1
will minimize the cost ratio C* ¼ C/H.
We can use the standard method of Lagrange multipliers to find the
minimum of C*, given the normalization constraint on pk:
!
@ X

C þl pj ¼ 0: ð5Þ
@pk j
Here the value of Lagrange multiplier l is to be determined later so as to

normalize the frequencies. Performing the differentiation in (5) we obtain
Ck C
þ ðlog2 pk þ 1= ln 2Þ þ l ¼ 0; 8k: ð6Þ
H H2
This expresses the frequencies pk given costs Ck:
pk ¼ l0 2HCk =C ; ð7Þ
where we denoted
2
l0 ¼ 2lH =C1= ln 2 : ð8Þ
Thus, l0 is an arbitrary constant that we can use directly to normalize

frequencies. Now, once the cost Ck of each word is known or assumed,
Equation (7) yields the frequency distribution for the words. Note,
however, that to obtain a closed-form solution, it is also necessary to
280 D. YU. MANIN
consistently determine the constants C and H in the RHS of (7) from

their respective definitions (3) and (4).
Now, it is easy to see from Equation (7) that a power law for
frequencies could only result from the Ansatz
Ck ¼ C0 log2 k ð9Þ
which leads to
C0
pk ¼ l0 kB ; B ¼ H ð10Þ
C
(note that C / C0, so C0/C does not depend on C0). How could one justify
Equation (9)? In Mandelbrot’s original formulation, as we already
mentioned, the cost of a word was assumed to be proportional to its length;
thus the only way to get the logarithmic dependency on the rank is to assume
that the number of distinct words grows exponentially with length. It is not
necessary in this formulation to postulate that any combination of letters of a
given length is equally probable, but even this weaker requirement is not
realistic for natural languages, as demonstrated by Figure 2.
There is, however, a much more plausible argument in favour of the
desired Ansatz (9), which does not depend on any assumptions about
word length at all. Suppose words are stored in some kind of an
addressable memory. For simplicity, one can imagine a linear array of
memory cells, each containing one word. Then, the cost of retrieving the
word in the kth cell can be assumed to be proportional to the length of its
address, that is to the minimum number of bits (or neuron firings, say)
needed to specify the address. And this is precisely log2 k. Of course, this
does not depend on memory being in any real sense ‘‘linear’’.
It is important to note that this is not just a different justification,
because with it the optimality model is no longer equivalent to the
random typing model. Let us now proceed to solving (10). From the
normalization condition for frequencies, we get
1 B
pk ¼ k ð11Þ
zðBÞ
P s
where z is the Riemann zeta-function zðsÞ ¼ 1 1 n . But this is not the
end of the story, since B is related to H and C via Equation (10), and they
in turn depend on B via pk. This amounts to an equation for the power
law exponent B, which thus is not arbitrary. By substituting (11) back
into (3) and (4), we get
C0 X1
C¼ kB log2 k ð12Þ
zðBÞ 1
B X 1
H¼ kB log2 ðkzðBÞ1=B Þ: ð13Þ
zðBÞ 1
It is now easy to see that B ¼ HC0/C can only be satisfied when

z(B) ¼ 1, which implies B!?. This is not a very encouraging result,
since it means that the minimum cost per unit information is achieved
when there is only one word in use, and both cost and information
vanish.
This conclusion is borne out by a simple numerical simulation. Recall
that in Section 2, we noted that cost ratio optimization can be achieved
via local dynamics. Namely, if speakers notice that a word’s individual
information/cost ratio is below average, they start using it less, and
conversely, if the ratio is favourable, the word’s frequency increases. It is
hard to tell a priori whether this process would converge to a stationary
distribution, so a numerical simulation was performed. The following
algorithm implements this dynamics:
Cost Ratio Optimization Algorithm
(1) Initialize an array of N frequencies pk with random numbers and

normalize them.
(2) Calculate average cost and information per word according to (3),
(4).
(3) For each k ¼ 1,. . .,N, calculate cost ratio for the kth word as
Ck ¼ Ck =Hk ¼ log2 k= log2 pk . If it is within the interval [(1 7 g)
C*,(1 þ g)C*], where g is a parameter, leave pk unchanged. Other-
wise increase pk by a constant factor if cost ratio is above the
average or decrease it by the same factor if it is below the average.
(4) If no frequencies were changed, stop.
(5) Reorder words (i.e. reassign ranks in the decreasing order of
frequency), renormalize frequencies and repeat from step (2).
282 D. YU. MANIN
This procedure quickly leads to the state where all frequencies, but one,
are zero.
DERIVATION OF ZIPF-MANDELBROT FORMULA
We have seen that the Ansatz (9) does not eventually lead to the desired
result. It is probably this problem that prompted Mandelbrot to propose
a modification to Zipf’s law. In his own words,
. . . it seems worth pointing out that it has not been obtained by ‘‘mere
curve fitting’’: in attempting to explain the first approximation law,
i(r,k) ¼ (1/10)kr71, I invariably obtained the more general second
approximation, and only later did I realize that this more general
formula was necessary and basically sufficient to fit the empirical data.
(Mandelbrot, 1966, p. 356)
It turns out that the degeneracy problem can be avoided by the

following modification of the cost function Ansatz:
Ck ¼ C0 log2 ðk þ k0 Þ: ð14Þ
It looks rather natural if we again imagine the linear memory, but this
time with first k0 cells not occupied by useful words. Substitution of (14)
into (7) yields the Zipf-Mandelbrot law
1
pk ¼ ðk þ k0 ÞB ð15Þ
zðB; 1 þ k0 Þ
P
where z is now the Hurwitz zeta function, zðs; qÞ ¼ 1 s
0 ðn þ qÞ .
The Zipf-Mandelbrot formula has the potential of correctly approx-
imating not only the power law, but also the initial, low-rank range of the
real frequency distributions, which flatten out at k 5 10 or so. But
remember again that the second part of (10), B ¼ HC0/C, needs to be
satisfied, which means that parameters k0 and B are not independent.
This is rarely, if ever, mentioned in the literature, while it is a rather
important constraint. Substituting (15) into (3) and (4) and noting that
@ X
1
zðs; qÞ ¼ ðn þ qÞs lnðn þ qÞ ð16Þ
@s 0
we obtain
C0 z0 ðB; 1 þ k0 Þ
C¼ ð17Þ
ln 2 zðB; 1 þ k0 Þ
B z0 ðB; 1 þ k0 Þ
H ¼ ln zðB; 1 þ k0 Þ ð18Þ
ln 2 zðB; 1 þ k0 Þ
B ¼ HC0 =C ð19Þ
where z0 is the derivative over the first argument. After simple

transformations this reduces to
ln zðB; 1 þ k0 Þ
B¼B ln 2 ð20Þ
ðln zðB; 1 þ k0 ÞÞ0
that is
zðB; 1 þ k0 Þ ¼ 1: ð21Þ
When k0!0, B!?, as previously. In the opposite limit, k0!?, the

Zipfian exponent B tends to 1, but extremely slowly. To see this, let k0 be
a large integer. Then,
X
k0
zðB; 1 þ k0 Þ ¼ zðBÞ nB : ð22Þ
1
In order to compensate for the infinite growth of the second term as

k0!?, B must tend to 1, where Riemann’s zeta function has a pole. Let
B ¼ 1 þ e, e1, then
zðBÞ ¼ Oð1=eÞ ð23Þ
X
k0
B 1
n ¼ O ke ð24Þ
1
e 0
whence
ke0 ¼ Oð1Þ; or B ¼ 1 þ Oð1= ln k0 Þ:

284 D. YU. MANIN
The relationship between B and k0 can be calculated numerically,

but this would not tell us whether the resulting solution is stable with
respect to the local dynamics described above. Running the local
dynamics model shows that, in contrast to the case k0 ¼ 0, the model
does converge to a stable solution described by (15), as shown in
Figure 3.
However, as is readily seen from the figure, no values of k0 yield a
satisfactory approximation to the actual distribution. For small k0, the
slope is still significantly steeper than 71, but for larger k0, the flattened
portion spreads too far. Thus, with k0 ¼ 10, the slope is still about 71.4,
but the power law starts at about k ¼ 100, while in the actual distribution
it begins after k ¼ 10.
To sum up, the Zipf-Mandelbrot law can be obtained from a model

optimizing the information/cost ratio with no assumptions about word
lengths. This model is not equivalent to the random typing model, and
allows the optimum to be achieved via local dynamics, i.e. in a causal,
rather than teleological manner. However, the two parameters of the
resulting distribution, B and k0, are not independent, and as a result, it
does not provide a reasonable fit to the empirical data.
Fig. 3. Zipf-Mandelbrot law with different values of k0. Real frequency distribution (not
to scale) and Zipf’s law are shown for comparison.
REFERENCES
Ferrer i Cancho, R. (2005). The variation of Zipf’s law in human language. European
Physical Journal B, 44, 249–257.
Li, W. (1992). Random texts exhibit Zipf’s-law-like word frequency distribution. IEEE
Transactions on Information Theory, 38(6), 1842–1845.
Mandelbrot, B. (1953). An informational theory of the statistical structure of languages.
In W. Jackson (Ed.), Communication Theory (pp. 486–502). Woburn, MA:
Butterworth.
Mandelbrot, B. (1966). Information theory and psycholinguistics: A theory of word
frequencies. In P. F. Lazarsfield & N. W. Henry (Eds), Readings in Mathematical
Social Sciences (pp. 350–368). Cambridge: MIT Press.
Mandelbrot, B. (1982). The Fractal Geometry of Nature. New York: Freeman.
Manin, D. Yu. (2008). Zipf’s law and avoidance of excessive synonymy. Cognitive Science
Journal, 32(7), 1075–1098.

Shannon, C. E. (1948). A mathematical theory of communication. The Bell System
Technical Journal, 27(3), 379–423.
Sharoff, S. (2002). Meaning as use: Exploitation of aligned corpora for the contrastive
study of lexical semantics. In Proceedings of Language Resources and Evaluation
Conference (LREC02). Las Palmas, Spain, May. Retrieved April 1, 2008, from
http://www.artint.ru/projects/frqlist/lrec-02.pdf
Sharoff, S. (n.d.). The Frequency Dictionary of Russian. Retrieved April 1, 2008, from
http://artint.ru/projects/frqlist/frqlist-en.asp
Simon, H. A. (1955). On a class of skew distribution functions. Biometrica, 42, 425–440.
Zipf, G. K. (1949). Human Behavior and the Principle of Least Effort. Cambridge, MA:
Addison-Wesley.

Mandelbrot's Model For Zipf's Law: Can Mandelbrot's Model Explain Zipf's Law For Language?

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mandelbrot's Model For Zipf's Law: Can Mandelbrot's Model Explain Zipf's Law For Language?

Uploaded by

Copyright:

Available Formats

See

Mandelbrot's Model for Zipf's Law: Can

ARTICLE in JOURNAL OF QUANTITATIVE LINGUISTICS · AUGUST 2009

Available from: Dmitrii Manin

Journal of Quantitative Linguistics

PLEASE SCROLL DOWN FOR ARTICLE

Full terms and conditions of use: http://www.informaworld.com/terms-and-conditions-of-access.pdf

Mandelbrot’s Model for Zipf’s Law

0929-6174/09/16030274 Ó 2009 Taylor & Francis

where fk is the frequency of the word with rank k. As a typical example,

Low-rank and high-rank regions are characterized by deviations from

Fig. 1. Zipf’s law for the Russian language.

where k0 is a parameter (not necessarily an integer). At k k0, the Zipf-

supported by numerical simulation.

THE TWO FACES OF MANDELBROT’S MODEL

The simplest possible model exhibiting Zipﬁan distribution is due to

length in a real language is far from being exponential in length. In fact, it is

cost of ‘‘producing’’ a word is proportional to its length in characters;

Fig. 2. Distribution of words by length.

Mandelbrot wrote in 1966, ‘‘these variants are fully equivalent mathema-

dynamics of inﬁnitesimal surface elements shifting and stretching under

MANDELBROT’S MODEL REVISITED

First of all, let us brieﬂy reproduce the mathematical derivation of Zipf’s

and the average entropy per word by

Here the value of Lagrange multiplier l is to be determined later so as to

This expresses the frequencies pk given costs Ck:

Thus, l0 is an arbitrary constant that we can use directly to normalize

consistently determine the constants C and H in the RHS of (7) from

It is now easy to see that B ¼ HC0/C can only be satisﬁed when

Cost Ratio Optimization Algorithm

(1) Initialize an array of N frequencies pk with random numbers and

DERIVATION OF ZIPF-MANDELBROT FORMULA

It turns out that the degeneracy problem can be avoided by the

where z0 is the derivative over the ﬁrst argument. After simple

When k0!0, B!?, as previously. In the opposite limit, k0!?, the

In order to compensate for the inﬁnite growth of the second term as

zðBÞ ¼ Oð1=eÞ ð23Þ

ke0 ¼ Oð1Þ; or B ¼ 1 þ Oð1= ln k0 Þ:

The relationship between B and k0 can be calculated numerically,

To sum up, the Zipf-Mandelbrot law can be obtained from a model

Journal, 32(7), 1075–1098.

You might also like