Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Physica A 390 (2011) 3189–3203

Contents lists available at ScienceDirect

Physica A
journal homepage: www.elsevier.com/locate/physa

The growth statistics of Zipfian ensembles: Beyond Heaps’ law✩


Iddo Eliazar ∗
Department of Technology Management, Holon Institute of Technology, P.O. Box 305, Holon 58102, Israel

article info abstract


Article history: We consider an evolving ensemble assembled from a set of n different elements via a
Received 5 March 2011 stochastic growth process in which independent and identically distributed copies of
Received in revised form 20 April 2011 the elements arrive randomly in time, and their statistics are governed by Zipf’s law.
Available online 12 May 2011
The associated ‘‘Heaps process’’ is the stochastic process tracking the fraction of different
element copies present in the evolving ensemble at any given time point. For example, the
Keywords:
evolving ensemble is a text assembled from a stream of words, and the Heaps process keeps
Zipf’s law
Heaps’ law
count of the number of different words in the evolving text. A detailed asymptotic statistical
Power laws analysis of the Heaps process, in the limit n → ∞, is conducted. This paper establishes
Rank distributions a comprehensive ‘‘Heapsian analysis’’ of the growth statistics of Zipfian ensembles. The
Growth processes analysis presented far extends and generalizes Heaps’ law, which asserts that the number
Poisson processes of different words in a text of length l follows a power law in the variable l.
Heaps process © 2011 Elsevier B.V. All rights reserved.
Heaps curve
Functional Central Limit Theorems (FCLTs)

1. Introduction

Zipf’s law and Heaps’ law are empirical power laws observed in linguistics. Zipf’s law [1] asserts that the occurrence
frequency F (r ) of the rth most common word in a given text follows a power-law decay in the rank variable r:
a
F (r ) ≃ α , (1)
r
where a is a positive coefficient, and α is a positive exponent. Heaps’ law [2] asserts that the number N (l) of different words
appearing in a given text of length1 l follows a power-law growth in the length variable l:

N (l) ≃ blβ , (2)


where b is a positive coefficient, and β is a positive exponent taking values in the range 0 < β < 1. Heaps’ law represents
‘‘the rate of innovation in a stream of words’’ [3]. The fact that Zipf’s law and Heaps’ law are both empirical linguistic laws
implies that there should be an underlying connection between them [4–7].
Zipf’s law is one single example of power-law rank distributions. Power-law rank distributions were first discovered in
demography by Auerbach [8], in word frequencies by Estoup [9], and in scientific productivity by Lotka [10]. The interest
in power-law rank distributions rose in the Physics community [11], and in the Computer Science community [12], in
the context Network Theory [13,14] — which, in turn, was motivated by the emergence of the World Wide Web and
the Internet. Following the rise of Web 2.0 Heaps’ law was observed in the collective dynamics of social annotation

✩ This paper is dedicated to Professor H. Eugene Stanley, on the occasion of his 70th birthday.
∗ Tel.: +972 507 290 650.
E-mail address: eliazar@post.tau.ac.il.
1 The length of the text is defined as the total number of words appearing in it.

0378-4371/$ – see front matter © 2011 Elsevier B.V. All rights reserved.
doi:10.1016/j.physa.2011.05.003
3190 I. Eliazar / Physica A 390 (2011) 3189–3203

and tagging processes [3,15]. Both Zipf’s law and Heaps’ law appear to emerge in the context of network-based growth
processes [3,12]. Yet so, the Physics community is well aware and knowledgeable of Zipf’s law, whereas it is rather unaware
and unknowledgeable of Heaps’ law.
The goal of this paper is to expose the Physics community to Heaps’ law. We believe that such an exposition is almost
an imperative due to (i) the prevalence of power-law rank distributions in a host of complex systems [16], and (ii) the
unequivocal empirical connection between Zipf’s law and Heaps’ law observed in linguistics.
In this paper we explore the connection between Zipf’s law and Heaps’ law from a Statistical Physics perspective —
considering a general ‘‘Zipfian ensemble’’ consisting of n different elements. In the Natural Sciences ensembles are usually
created by dynamic, and often stochastic, growth processes. We thus consider the Zipfian ensemble to be assembled via a
stochastic growth process in which: the elements arrive randomly in time, and the statistical frequencies of the arriving
elements are governed by Zipf’s law. In the context of linguistics the Zipfian ensemble is a text evolving from a stream of
words. As a conceptual physical example one can consider Per Bak’s ‘‘sandpile model’’ [17] — in which case the Zipfian
ensemble is the time-series of sand avalanches (the ensemble elements being the different avalanche sizes). As a conceptual
computer science example one can consider information retrieval from a data warehouse — in which case the Zipfian
ensemble is the time-series of retrieval requests (the ensemble elements being the different items stored in the data
warehouse).
Focus is set on the ‘‘Heaps process’’ of the Zipfian ensemble: the stochastic process tracking the fraction of the different
elements present in the ensemble at any given time point. We explore the asymptotic statistics of the Heaps process in
the limit n → ∞, i.e., as the number of ensemble elements tends to infinity. Our asymptotic statistical analysis asserts
that: (i) the Heaps process converges, in the limit n → ∞, to the ‘‘Heaps curve’’ — a deterministic curve which  √is a
generalization of Heaps’ law; (ii) the random fluctuations of the Heaps process around the Heaps curve are of order O 1/ n ,
and their correlations are governed by the Heaps curve; (iii) √ the Heaps process is well approximated by the Heaps curve
plus an additional Gaussian noise process with amplitude 1/ n. In the nomenclature of Probability Theory the Gaussian
approximation result is termed a ‘‘Functional Central Limit Theorem’’ (FCLT).
This paper establishes a comprehensive ‘‘Heapsian analysis’’ – which far extends Heaps’ empirical law – of the growth
statistics of Zipfian ensembles. The reminder of the Paper is organized as follows. Section 2 describes the details of the Zipfian
ensemble and its associated Heaps process. Section 3 presents the Heaps curve and explores its connection to Heaps’ law.
Section 4 studies the asymptotic mean behavior and correlation structure of the Heaps process. Section 5 establishes the
FCLT result. The proofs of all key results are given in the Appendix. A short exposition (with no proofs) of the FCLT result
appears in the Fast Track Communication [18].
Throughout the paper we shall make use of the following notation. Indicator Functions: I [E ] denotes the indicator
function of an event E . Expectation and Variance: E [X ] and Var [X ] denote, respectively, the mathematical expectation and
the variance of a real-valued random variable X . Covariance: Cov [X1 , X2 ] denotes the covariance between two real-valued
random variables X1 and X2 . Minima and Maxima: (x1 ∧ x2 ) and (x1 ∨ x2 ) denote, respectively, the minimum min (x1 , x2 )
and the maximum max (x1 , x2 ) of two real numbers x1 and x2 .

2. The Zipfian ensemble

Consider a set of n different elements from which an evolving ensemble is assembled via a stochastic growth process.
Elements arrive to the evolving ensemble randomly in time, following a Poisson process Πn with intensity λn . The arriving
elements are independent and identically distributed random variables, governed by a Zipfian law with exponent α . Namely,
the probability that an arriving element has rank r is given by
 1 α
Fn (r ) =  1 α r
 1 α (r = 1, . . . , n). (3)
1
+ ··· + n

The ‘‘coloring theorem’’ of the theory of Poisson processes ([19, Section 5.1]) implies that: (i) the arrivals of elements of
rank r form a Poisson process Πn (r ) with intensity λn (r ) = λn Fn (r ); (ii) the Poisson processes {Πn (r )}nr=1 are mutually
independent.
Let Tn (r ) denote the first arrival of the Poisson process Πn (r ). Namely, Tn (r ) is the time epoch at which the first element
of rank r arrives to the evolving ensemble. Since Πn (r ) is a Poisson process with intensity λn (r ), the random time Tn (r ) is
exponentially distributed with mean E [Tn (r )] = 1/λn (r ). Moreover, since the Poisson processes {Πn (r )}nr=1 are mutually
independent – so are the random times {Tn (r )}nr=1 . In terms of the random times {Tn (r )}nr=1 the fraction of different elements
present in the evolving ensemble at time t is given by
n
1−
Hn (t ) = I [Tn (r ) ≤ t] (t ≥ 0). (4)
n r =1

Clearly, the stochastic process Hn = (Hn (t ))t ≥0 – henceforth termed ‘‘the Heaps process’’ – grows monotonically from
the level Hn (0) = 0 to the saturation level limt →∞ Hn (t ) = 1. In what follows we shall explore the asymptotic statistics
of the stochastic Heaps process Hn in the limit n → ∞, i.e., as the number of elements tends to infinity. To that end the
I. Eliazar / Physica A 390 (2011) 3189–3203 3191

1
a 1 b 0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
H(t) 0.5 H(t)
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0 0 0.5 1 1.5 2 2.5 3
0 0.5 1 1.5 2 2.5 3
t
t

c 1
0.9
0.8
0.7
0.6
0.5
H(t)
0.4
0.3
0.2
0.1
0
0 0.5 1 1.5 2 2.5 3
t

Fig. 1. Convergence of the Heaps process Hn to the Heaps curve H. The Heaps process Hn , for n = 100, is simulated numerically and depicted by the rough
(blue) line. The Heaps curve H is computed numerically and depicted by the smooth (green) line. The Zipfian exponent is α = 1/2 in Fig. 1A, α = 1 in
Fig. 1B, and α = 2 in Fig. 1C. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Poissonian intensity λn needs to be adequately scaled. As the analysis to be carried out will establish, the proper scaling is
given by λn = 1/Fn (n). This scaling, in turn, implies that the intensity of the Poisson process Πn (r ) is given by
 r −α
λn (r ) = (r = 1, . . . , n). (5)
n
As noted in the Introduction, in the context of linguistics the n elements represent the words in a given language, and
the ensemble is an evolving text. Alternatively, the ensemble can be considered as the stream of words broadcasted on a
radio or television channel. In the taxonomy of linguistics the ‘‘arriving elements’’ are termed ‘‘tokens’’, and the ‘‘different
elements’’ are termed ‘‘types’’ [20]. In this terminology Eq. (3) represents the probability law of the ‘‘tokens’’, and Eq. (4)
represents the fraction of ‘‘types’’ accumulated up to time t.
As a conceptual physical example of the evolving ensemble consider Per Bak’s ‘‘sandpile model’’ — the prototypical
example of systems in ‘‘self-organized criticality’’ [17]. In this example the elements represent the different sand avalanches
that can take place in the sandpile model – the avalanches ranked in an increasing order of their magnitudes. In the sandpile
model setting Eq. (3) represents the probability law of the avalanche magnitudes, and Eq. (4) represents the fraction of
avalanches of different magnitudes taking place up to time t.
As a conceptual computer science example of the evolving ensemble consider information retrieval from a data
warehouse — a prevalent and ubiquitous process in our modern-day information age. In this example the elements represent
the different items stored in the data warehouse — the items ranked in a decreasing order of their ‘‘retrieval-popularity’’.
Eq. (3) represents the probability law of the items’ popularity, and Eq. (4) represents the fraction of different items retrieved
up to time t.

3. The Heaps curve

In this Paper we will show that the stochastic Heaps process Hn = (Hn (t ))t ≥0 converges, in the limit n → ∞, to a
deterministic curve H = (H (t ))t ≥0 – henceforth termed ‘‘the Heaps curve’’ – given by
1
∫   
t
H (t ) = 1 − exp − α du (t ≥ 0). (6)
0 u
The precise type of convergence of the Heaps process Hn to the Heaps curve H shall be presented hereinafter in Sections 4
and 5. The convergence of the Heaps process Hn to the Heaps curve H is demonstrated in Fig. 1 (for n = 100) and in Fig. 2
(for n = 1000).
Eq. (6) gives an implicit integral representation of the Heaps curve H in terms of its variable t. The only case where
the implicit representation turns explicit is at the exponent value α = 0. At this exponent value: (i) Zipf’s law coincides
3192 I. Eliazar / Physica A 390 (2011) 3189–3203

1
a 1 b 0.9
0.9
0.8 0.8
0.7 0.7
0.6 0.6
H(t) 0.5
H(t) 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3
t t

1
c 0.9
0.8
0.7
0.6
H(t) 0.5
0.4
0.3
0.2
0.1
0
0 0.5 1 1.5 2 2.5 3
t

Fig. 2. Convergence of the Heaps process Hn to the Heaps curve H. The Heaps process Hn , for n = 1000, is simulated numerically and depicted by the
rough (blue) line. The Heaps curve H is computed numerically and depicted by the smooth (green) line. The Zipfian exponent is α = 1/2 in Fig. 2A, α = 1
in Fig. 2B, and α = 2 in Fig. 2C. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

with the Uniform law: Fn (r ) = 1/n (r = 1, . . . , n); (ii) The elements arrive at the evolving ensemble with equal unit
intensities: λn (r ) = 1 (r = 1, . . . , n); (iii) The Heaps curve is explicitly given by H (t ) = 1 − exp (−t ) (t ≥ 0). We note that
the exponent value α = 0 corresponds to the classic Probability Theory ‘‘coupon collector problem’’ [21,22]. In linguistics
however empirical studies never yield exponents α close to zero.
As in the case of the Heaps process Hn , the Heaps curve H grows monotonically from the level H (0) = 0 to the saturation
level limt →∞ H (t ) = 1. The asymptotic behavior of the Heaps curve H near the origin is as follows. The Heaps curve H is
asymptotically equivalent to the curve G = (G (t ))t ≥0 (i.e., limt →0 H (t ) /G (t ) = 1) which is given by

 t
 α < 1,
 1 −α 


G (t ) = t ln
1
α = 1, (7)
t


Γ (1 − 1/α) t 1/α

α > 1.

In the exponent range α > 1 the full power-series expansion of the Heaps curve H is given by


(−1)m tm
H (t ) = Γ (1 − 1/α) t 1/α +

. (8)
m=1
m! mα − 1

The proof of Eqs. (7) and (8) is given in the Appendix.


Eqs. (7) and (8) explain the connection between the empirically observed Heaps’ law (recall Eq. (2)) and the Heaps curve
H. In the exponent range α > 1 the Heaps curve admits – for small t values – the power-law form H (t ) ≈ Γ (1 − β) t β ,
with exponent β = 1/α (taking values in the range 0 < β < 1). This power-law form perfectly coincides with Heaps’
law — albeit with the time variable t replacing the length variable l. The reason why the coincidence takes place for small t
values is due to the underlying scaling (λn = 1/Fn (n)) which, in effect, ‘‘speeds up’’ time. Note that in the exponent range
α > 1 the Zipfian exponent α is reciprocally transformed to the Heapsian exponent β = 1/α . This reciprocal connection was
observed and discussed in [3,4] and in [6,7].
The Heaps curve H is, in effect, a correction to Heaps’ law. Observing the Heaps curve H near the origin and near infinity,
one can informally conclude that its asymptotic behavior is given by H (t ) ≈ ct ϵ where: (i) for small time scales (t ≪ 1) c
is a positive constant and the exponent ϵ takes values in the range 0 < ϵ ≤ 1; (ii) for large time scales (t ≫ 1) c = 1 and
ϵ = 0. In particular, one can informally assert that the aforementioned exponent ϵ changes as time t increases from zero to
infinity. This ‘‘exponent change’’ was observed and investigated in [23].
I. Eliazar / Physica A 390 (2011) 3189–3203 3193

4. Mean behavior and correlation structure

In this section we explore the asymptotic mean behavior and the asymptotic correlation structure of the Heaps process Hn
in the limit n → ∞. Our first assertion is that, for large n, the mean of the Heaps process is well approximated by the Heaps
curve:
E [Hn (t )] ≃ H (t ) . (9)
The precise meaning of Eq. (9) is given by the following proposition:

Proposition 1. Set l to be an arbitrary positive level. The mean function E [Hn (t )] (t ≥ 0) converges uniformly on the ray t ≥ l,
in the limit n → ∞, to the Heaps curve H:
 
lim max |E [Hn (t )] − H (t )| = 0 (t ≥ 0). (10)
n→∞ t ≥l

The proof of Proposition 1 is given in the Appendix. We now turn to investigate the fluctuations of Heaps process Hn
around its asymptotic mean — the Heaps curveH. √ To that
 end we calculate the variance of the Heaps process Hn and assert
that, for large n, the fluctuations are of order O 1/ n and are given by:

1
Var [Hn (t )] ≃ (H (2t ) − H (t )) . (11)
n
The precise meaning of Eq. (11) is given by the following proposition:

Proposition 2. Set l to be an arbitrary positive level. The scaled variance function nVar [Hn (t )] (t ≥ 0) converges uniformly on
the ray t ≥ l, in the limit n → ∞, to the function (H (2t ) − H (t )) (t ≥ 0):
 
lim max |nVar [Hn (t )] − (H (2t ) − H (t ))| = 0 (t ≥ 0). (12)
n→∞ t ≥l

In effect, Eq. (11) and Proposition 2, are corollaries of results regarding the asymptotic correlation structure of the Heaps
process Hn. Indeed,
√  calculating the auto-covariance of the Heaps process Hn we assert that, for large n, the correlations are
of order O 1/ n and are given by:

1
Cov [Hn (s) , Hn (t )] ≃ (H (s + t ) − H (s ∨ t )) . (13)
n
The precise meaning of Eq. (13) is given by the following proposition:

Proposition 3. Set l to be an arbitrary positive level. The scaled covariance function nCov [Hn (s) , Hn (t )] (s, t ≥ 0) converges
uniformly on the range s, t ≥ l, in the limit n → ∞, to the function (H (s + t ) − H (s ∨ t )) (s, t ≥ 0):
 
lim max |nCov [Hn (s) , Hn (t )] − (H (s + t ) − H (s ∨ t ))| = 0 (s, t ≥ 0). (14)
n→∞ s,t ≥l

The proof of Proposition 3 is given in the Appendix.

5. Functional central limit theorem

The asymptotic statistical analysis presented in the previous section gives rise to the following stochastic approximation,
for large n, of the Heaps process:
1
Hn (t ) ≃ H (t ) + √ Z (t ) , (15)
n
where Z = (Z (t ))t ≥0 is a zero-mean noise process with correlation structure governed by the covariance

Cov [Z (s) , Z (t )] = H (s + t ) − H (s ∨ t ) (s, t ≥ 0). (16)


Indeed, since the fluctuation process Z has zero mean, calculating the mean of both sides of Eq. (15) yields Eq. (9). And, since
the Heaps curve H is deterministic, calculating the covariance of both sides of Eq. (15) yields Eq. (13).
We now turn to explain the precise meaning of the stochastic approximation of Eq. (15). To that end consider the
fluctuation process Zn = (Zn (t ))t ≥0 given by

Zn (t ) = n (Hn (t ) − H (t )) . (17)
3194 I. Eliazar / Physica A 390 (2011) 3189–3203

Namely, the stochastic process Zn measures the


√ random fluctuations of the Heaps process Hn around the Heaps curve H, and
scales the random fluctuations by the factor n. The asymptotic statistical behavior of the fluctuation process Zn is given by
the following stochastic limit-law result:

Proposition 4. The fluctuation process Zn = (Zn (t )) converges, in law, in the limit n → ∞, to a zero-mean Gaussian process
Z = (Z (t ))t ≥0 with covariance function Cov [Z (s) , Z (t )] = H (s + t ) − H (s ∨ t ) (s, t ≥ 0).
The proof of Proposition 4 is given in the Appendix. Proposition 4 asserts that the zero-mean noise process Z is a Gaussian
process — and hence its statistics are fully characterized by its covariance function. In the nomenclature of probability theory
the stochastic limit-law result of Proposition 4 is termed a ‘‘Functional Central Limit Theorem’’ (FCLT). The best known FCLT
result is Donsker’s theorem [24] which asserts that properly scaled random walks converge, in law, to Brownian motion —
a zero-mean Gaussian process B = (B (t ))t ≥0 with covariance function Cov [B (s) , B (t )] = (s ∧ t ) (s, t ≥ 0).

6. Conclusions

This Paper presented a comprehensive ‘‘Heapsian analysis’’ of the growth statistics of Zipfian ensembles. We considered
an evolving ensemble assembled via a stochastic growth process in which: elements arrive randomly in time following a
Poisson process, and the arriving elements are independent random variables whose distribution is governed by Zipf’s law
with dimension n. Focus was set on the ensemble’s Heaps process – the stochastic process tracking the ensemble’s fraction
of different elements at any given time point – and on its asymptotic statistics in the limit n → ∞. An asymptotic statistical
analysis explored the mean behavior and the correlation structure of the Heaps process, and further established a Functional
Central Limit Theorem governing the stochastic convergence of the Heaps process to the limiting Heaps curve. The results
obtained far extend Heaps’ law, and provide scientists the precise statistics of growth in ensembles governed by Zipfian
statistics.

Acknowledgment

The author gratefully acknowledges Shlomi Reuveni for the numerical simulations and the figures.

Appendix

A.1. Analysis of the Heaps curve

A.1.1. Proof of Eq. (7)


We begin with an integral calculation of the function H (t ), and then split into three asymptotically different cases: α < 1,
α = 1, and α > 1.
∫ 1
H (t ) = 1 − exp −u−α t
  
du (18)
0

(using the change of variables x = u−α )


∫ ∞ [ ]
1 −1−1/α
= [1 − exp (−xt )] x I (x > 1) dx
0 α
∫ ∞ [∫ ∞ ][ ] (19)
1 −1−1/α
= t exp (−ty) I (y < x) dy x I (x > 1) dx
0 0 α
(changing the order of integration)
∫ ∞ ∫ ∞ 
1 −1−1/α
=t exp (−ty) x dx dy
0 y∨1 α
∫ ∞
(20)
=t exp (−ty) (y ∨ 1)−1/α dy
0

(using the change of variables x = ty)


∫ ∞ −1/α x
= exp (−x)
∨1 dy
0 t
∫ t ∫ ∞
= exp (−x) dx + t 1/α exp (−x) x−1/α dx (21)
0 t
∫ ∞
= [1 − exp (−t )] + t 1/α exp (−x) x−1/α dx.
t
I. Eliazar / Physica A 390 (2011) 3189–3203 3195

The case α < 1.


H (t ) H (t )
lim t
= (1 − α) lim = (22)
t →0
1−α
t →0 t

(using Eq. (21))

exp (−x) x−1/α dx


 ∞ 
1 − exp (−t )
= (1 − α) lim + lim t
(23)
t →0 t t →0 t 1−1/α

(using L’Hospital’s rule in the second limit)

− exp (−t ) t −1/α


 
= (1 − α) 1 + lim
t →0 (1 − 1/α) t −1/α
  (24)
1
= (1 − α) 1 − = 1.
(1 − 1/α)
The case α = 1.
H (t ) H (t )
lim  1  = lim (25)
t →0 t ln t
t →0 −t ln (t )
(using Eq. (21))
∞
1 − exp (−t ) 1 t
exp (−x) x−1 dx
= lim + lim (26)
t →0 t − ln (t ) t →0 − ln (t )
(using L’Hospital’s rule in the second limit)

− exp (−t ) t −1
= 0 + lim = 1. (27)
t →0 −t − 1
The case α > 1.
H (t ) 1 H (t )
lim = lim (28)
t →0 Γ (1 − 1/α) t 1/α Γ (1 − 1/α) t →0 t 1/α
(using Eq. (21))

1 − exp (−t ) 1−1/α ∞


 ∫ 
1
= lim t + lim exp (−x) x−1/α dx
Γ (1 − 1/α) t →0 t t →0 t
 ∫ ∞  (29)
1
= 0+ exp (−x) x−1/α dx =1
Γ (1 − 1/α) 0

(using the definition of the Gamma function).

A.1.2. Proof of Eq. (8)


Consider the exponent range α > 1. Eq. (21) implies that
∫ ∞
1/α
H (t ) = [1 − exp (−t )] + t exp (−x) x−1/α dx
t
[∫ ∞ ∫ t
]
= [1 − exp (−t )] + t 1/α exp (−x) x−1/α dx − exp (−x) x−1/α dx
0 0
t
[ ∫ ]
= [1 − exp (−t )] + t 1/α Γ (1 − 1/α) − exp (−x) x−1/α dx
0
t
 ∫ 
= Γ (1 − 1/α) t 1/α + 1 − exp (−t ) − t 1/α exp (−x) x−1/α dx . (30)
0

Now, using Taylor expansions we have



− (−1)m+1
1 − exp (−t ) = t m, (31)
m=1
m!
3196 I. Eliazar / Physica A 390 (2011) 3189–3203

and
∫ t − 
t ∞
(−1)k k −1/α

1/α −1/α 1/α
t exp (−x) x dx = t x x dx
0 0 k=0
k!

(−1)k t k−1/α ∞
(−1)k t k+1−1/α

1/α
dx = t 1/α
− −
=t x
k=0
k! 0 k =0
k! k + 1 − 1/α

− (−1)k+2 k+1

− (−1)m+1 m
= t k+1 = t m. (32)
k=0
(k + 1)! k + 1 − 1/α m=1
m! m − 1/α

Substituting Eqs. (31) and (32) into Eq. (30) yields


 
1/α

− (−1)m+1 ∞
− (−1)m+1 m
H (t ) = Γ (1 − 1/α) t + t m
− t m

m=1
m! m=1
m! m − 1/α

(−1)m+1
 
1/α
− m
= Γ (1 − 1/α) t + 1− tm
m=1
m! m − 1/α

(−1)m tm
= Γ (1 − 1/α) t 1/α +

. (33)
m=1
m! mα − 1

A.2. Mean and covariance analysis

A.2.1. Preliminaries
Set
h (u; t ) = 1 − exp −u−α t (0 < u < 1; t ≥ 0).
 
(34)

Note that since λn (r ) = (r /n) −α


we have
r 
h ; t = 1 − exp (−λn (r ) t ) (35)
n
and
∂h  r 
; t = λn (r ) exp (−λn (r ) t ) (r = 1, . . . , n; t ≥ 0). (36)
∂t n
Also note (using straightforward algebra) that
h (u; s ∧ t ) − h (u; s) h (u; t ) = h (u; s + t ) − h (u; s ∨ t ) (s, t ≥ 0). (37)
Let g (u) be a continuous function on the unit interval (0 ≤ u ≤ 1). Basic calculus asserts that:
n ∫ 1
1 − r 
lim g = g (u) du. (38)
n→∞ n n 0
r =1

Moreover, if g (u) has a bounded derivative on the unit interval then:

max g ′ (u)
   
1 − n r  ∫ 1 
 0≤u≤1
g − g (u) du ≤ . (39)


 n r =1 n 0  n

A.2.2. Proof of Proposition 1


Fix t > 0. Then:
 
n
1−
E [Hn (t )] = E I (Tn (r ) ≤ t ) (40)
n r =1

(using the properties of the expectation functional E [·])


n n
1− 1−
= E [I (Tn (r ) ≤ t )] = Pr (Tn (r ) ≤ t ) (41)
n r =1 n r =1
I. Eliazar / Physica A 390 (2011) 3189–3203 3197

(using the fact that the random variables Tn (r ) are exponentially distributed with mean 1/λn (r ) (r = 1, . . . , n), and using
Eq. (35))
n n
1− 1 − r 
= (1 − exp (−λn (r ) t )) = h ;t . (42)
n r =1 n r =1 n

Hence
 
1 − n r  ∫ 1 
|E [Hn (t )] − H (t )| =  h ;t − h (u; t ) du (43)
 
 n r =1 n 0 
(using Eq. (39))
 ∂h
 
1
(u; t ) .

≤ max  (44)
n 0≤u≤1 ∂ u
Eq. (44), in turn, implies that
 ∂h
  
1
max |E [Hn (t )] − H (t )| ≤ (u; t ) .

max max  (45)
t ≥l n t ≥l 0≤u≤1 ∂u
Now,
 ∂h
 
 (u; t ) = α t exp −u−α t u−α−1 = α t −1/α exp (−y) y1+1/α ,
  
 ∂u  (46)

where we used the change of variables y = −u−α t. Consequently


 ∂h α
 
(u; t ) = 1/α max exp (−y) y1+1/α
  
max 
0≤u≤1 ∂u t y≥t


≤ α t −1/α max exp (−y) y1+1/α = 1/α
 
(47)
y≥0 t
where cα is a constant depending on the exponent α . (Eq. (47)), in turn, implies that
 ∂h
  

(u; t ) (l > 0).

max max  ≤ (48)
t ≥l 0≤u≤1 ∂u l1/α
Substituting Eq. (48) into Eq. (45) we conclude that
1 cα
max |E [Hn (t )] − H (t )| ≤ (l > 0). (49)
t ≥l n l1/α
Taking the limit n → ∞ in Eq. (49) proves Proposition 1.

A.2.3. Proof of Proposition 3


Fix s, t > 0. Then:
 
n n
1− 1−
n · Cov [Hn (s) , Hn (t )] = n · Cov I (Tn (r ) ≤ s) , I (Tn (k) ≤ t ) (50)
n r =1 n k=1

(using the properties of the covariance quadratic form Cov [·, ·], and using the independence of the random variables Tn (r )
(r = 1, . . . , n))
n n
1 −−
= Cov [I (Tn (r ) ≤ s) , I (Tn (k) ≤ t )]
n r =1 k=1
n
1−
= Cov [I (Tn (r ) ≤ s) , I (Tn (r ) ≤ t )] (51)
n r =1
n
1−
= (Pr (Tn (r ) ≤ s ∧ t ) − Pr (Tn (r ) ≤ s) Pr (Tn (r ) ≤ t ))
n r =1

(using the fact that the random variables Tn (r ) are exponentially distributed with mean 1/λn (r ) (r = 1, . . . , n))
n
1−
= ((1 − exp (−λn (r ) (s ∧ t ))) − (1 − exp (−λn (r ) s)) (1 − exp (−λn (r ) t ))) (52)
n r =1
3198 I. Eliazar / Physica A 390 (2011) 3189–3203

(using Eq. (35), and using Eq. (37))


n
1 − r   r   r 
= h ;s ∧ t − h ;s h ;t
n r =1 n n n
n
1 − r  r 
= h ;s + t − h ;s ∨ t (53)
n r =1 n n
   
n n
1 − r  1 − r 
= h ;s + t − h ;s ∨ t
n r =1 n n r =1 n

(using Eq. (42))

= E [Hn (s + t )] − E [Hn (s ∨ t )] . (54)

Hence

|n · Cov [Hn (s) , Hn (t )] − (H (s + t ) − H (s ∨ t ))|


= |(E [Hn (s + t )] − E [Hn (s ∨ t )]) − (H (s + t ) − H (s ∨ t ))| (55)
≤ |E [Hn (s + t )] − H (s + t )| + |E [Hn (s ∨ t )] − H (s ∨ t )| .
Eq. (55), in turn, implies that

max |n · Cov [Hn (s) , Hn (t )] − (H (s + t ) − H (s ∨ t ))|


s,t ≥l

≤ max (|E [Hn (s + t )] − H (s + t )| + |E [Hn (s ∨ t )] − H (s ∨ t )|)


s,t ≥l

≤ max |E [Hn (s + t )] − H (s + t )| + max |E [Hn (s ∨ t )] − H (s ∨ t )|


s,t ≥l s,t ≥l

≤ 2 max |E [Hn (t )] − H (t )| (l > 0). (56)


t ≥l

Substituting Eq. (49) into Eq. (56) we conclude that

1 2cα
max |n · Cov [Hn (s) , Hn (t )] − (H (s + t ) − H (s ∨ t ))| ≤ (l > 0). (57)
s,t ≥l n l1/α
Taking the limit n → ∞ in Eq. (57) proves Proposition 3.

A.3. The FCLT

In this section of the Appendix we prove Proposition 4. The proof is split into six steps.
Step 1. Let ϕ (t ) (t ≥ 0) be an arbitrary bounded test function. Then:
∫ ∞ ∫ ∞ ′
ϕ (t ) Zn′ (t ) dt = ϕ (t ) n (Hn (t ) − H (t )) dt
√
0 0


∫ ∞ √
∫ ∞
= n ϕ (t ) Hn′ (t ) dt − n ϕ (t ) H ′ (t ) dt (58)
0 0

(using Eq. (4), and denoting by δ (·) Dirac’s ‘‘delta function’’)


 
∞ n ∞
√ √
∫ ∫
1−
= n ϕ (t ) δ (t − Tn (r )) dt − n ϕ (t ) H ′ (t ) dt
0 n r =1 0
(59)
n ∞


1 −
= √ ϕ (Tn (r )) − n ϕ (t ) H ′ (t ) dt .
n r =1 0

Eq. (59) implies that


  
∞ n ∞

[  ∫ ]  ∫ 
i −
E exp i ϕ (t ) Zn (t ) dt

= E exp √ ϕ (Tn (r )) exp −i n ϕ (t ) H ′ (t ) dt . (60)
0 n r =1 0
I. Eliazar / Physica A 390 (2011) 3189–3203 3199

The independence of the random variables Tn (r ) (r = 1, . . . , n) implies that


  
n n [  ]
i − ∏ i
E exp √ ϕ (Tn (r )) = E exp √ ϕ (Tn (r ))
n r =1 r =1
n
 ]
n [ 
− i
= exp ln E exp √ ϕ (Tn (r )) , (61)
r =1
n

and Eqs. (6) and (34) imply that

∞ ∞ 1
∂h
∫ ∫ [∫ ]
ϕ (t ) H ′ (t ) dt = ϕ (t ) (u; t ) du dt . (62)
0 0 0 ∂t
Substituting Eqs. (61) and (62) into Eq. (60) we obtain that
 ] 
∞ n
√ ∞ 1
∂h
[  ∫ ] [  ] ∫ [∫
− i
E exp i ϕ (t ) Zn (t ) dt

= exp ln E exp √ ϕ (Tn (r )) −i n ϕ (t ) (u; t ) du dt .
0 r =1
n 0 0 ∂t
(63)

Step 2.

n [  ]
− i
ln E exp √ ϕ (Tn (r )) (64)
r =1
n

(using a second-order Taylor expansion for the exponential function, and using the boundness of the test function ϕ (t ))

n   
− i 1 1
ln 1 + √ E [ϕ (Tn (r ))] − E ϕ (Tn (r ))2 + O
 
= √ (65)
r =1
n 2n n n

(using a second-order Taylor expansion for the logarithmic function)

n   
− i 1   1
√ E [ϕ (Tn (r ))] − E ϕ (Tn (r ))2 − E [ϕ (Tn (r ))]2 + O
 
= √ (66)
r =1
n 2n n n

(using some basic algebra)


 
n n

 
1− 1 −  1
E [ϕ (Tn (r ))] E ϕ (Tn (r ))2 − E [ϕ (Tn (r ))]2 + O
 
=i n − √ (67)
n r =1 2n r =1 n

Step 3.

n
1−
E [ϕ (Tn (r ))] (68)
n r =1

(using the fact that the random variables Tn (r ) are exponentially distributed with mean 1/λn (r ) (r = 1, . . . , n), and using
Eq. (36))

n ∫
1− ∞
= ϕ (t ) [λn (r ) exp (−λn (r ) t )] dt
n r =1 0
 
∫ ∞ n
1−
= ϕ (t ) λn (r ) exp (−λn (r ) t ) dt (69)
0 n r =1
 
1 − ∂h  r
∫ ∞ n 
= ϕ (t ) ;t dt .
0 n r =1 ∂ t n
3200 I. Eliazar / Physica A 390 (2011) 3189–3203

Hence
 ] 
n
∂h
1 − ∫ ∞ [∫ 1
E [ϕ (Tn (r ))] − ϕ (t ) (u; t ) du dt 
 
0 ∂t

 n r =1 0 
∫   ] 
1 − ∂h  r  ∂h
 ∞ n ∫ ∞ [∫ 1
= ϕ (t ) ; t dt − ϕ (t ) (u; t ) du dt 
 
 0 n r =1 ∂ t n 0 0 ∂ t 
 
n
∂h  r  ∂h
∫ ∞ 1 − ∫ 1 
≤ ϕ (t )  ;t − (u; t ) du dt (70)
 
0  n r =1 ∂ t n 0 ∂t 

(using Eq. (39))


∞  ∂ ∂h
∫    
1
ϕ (t ) (u; t )  dt .

≤ max  (71)
n 0 0≤u≤1 ∂ u ∂t
Now:
 ∂ ∂h
  
 −α  −2α−1
 ∂ u ∂ t (u; t )  = α t exp −u t u − α exp −u−α t u−α−1 
    
  

≤ α t exp −u−α t u−2α−1 + α exp −u−α t u−α−1


   

α
= 1+1/α exp (−y) y1+1/α (1 + y) (72)
t
where we used the change of variables y = −u−α t. Consequently
 ∂ ∂h α
  
(u; t )  = 1+1/α max exp (−y) y1+1/α (1 + y)
  
max 
0≤u≤1 ∂ u ∂t t y≥t

α cα
≤ 1+1/α max exp (−y) y1+1/α (1 + y) =
 
(73)
t y≥0 t 1+1/α
where cα is a constant depending on the exponent α . Substituting Eq. (73) into Eq. (71) we obtain that
 ] 
n
∂h ϕ (t )
1 − ∫ ∞ [∫ 1 ∫ ∞
 cα
E [ϕ (Tn (r ))] − ϕ (t ) (u; t ) du dt  ≤ dt . (74)

∂ 1+1/α

 n r =1 0 0 t  n 0 t

Thus, for bounded test functions satisfying the integrability condition 0 |ϕ (t )| t −1−1/α dt < ∞ we have
∞

n ∞ 1
∂h
∫ [∫ ]  
1− 1
E [ϕ (Tn (r ))] = ϕ (t ) (u; t ) du dt + O . (75)
n r =1 0 0 ∂t n

Eq. (75), in turn, implies that


 
√ n
√ ∞ 1
∂h
∫ [∫ ]  
1− 1
i n E [ϕ (Tn (r ))] =i n ϕ (t ) (u; t ) du dt + O √ . (76)
n r =1 0 0 ∂t n

Step 4.
n
1 − 
E ϕ (Tn (r ))2 − E [ϕ (Tn (r ))]2
 
(77)
n r =1

(using the fact that the random variables Tn (r ) are exponentially distributed with mean 1/λn (r ) (r = 1, . . . , n), and using
Eq. (36))
∫ 2 
n ∞ ∫ ∞
1−
= ϕ (t ) [λn (r ) exp (−λn (r ) t )] dt −
2
ϕ (t ) [λn (r ) exp (−λn (r ) t )] dt
n r =1 0 0
∫ ] 2 
n ∞
∂h  r  ∂h r 
[ ] ∫ ∞ [ 
1−
= ϕ (t ) 2
; t dt − ϕ (t ) ; t dt (78)
n r =1 0 ∂t n 0 ∂t n
I. Eliazar / Physica A 390 (2011) 3189–3203 3201

ϕ (t )2 ∂∂ht (u; t ) dt − 0 ϕ (t ) ∂∂ht (u; t ) dt )


∞  ∞  2
(using Eq. (38) with g (u) =
  
0
∫ ] 2 

∂h ∂h
∫ 1
[ ] ∫ ∞ [
= ϕ (t ) 2
(u; t ) dt − ϕ (t ) (u; t ) dt du + δn , (79)
0 0 ∂t 0 ∂t

where δn is an error term satisfying limn→∞ δn = 0. Setting


∫ ] 2 

∂h ∂h
∫ 1
[ ] ∫ ∞ [
K (ϕ) = ϕ (t ) 2
(u; t ) dt − ϕ (t ) (u; t ) dt du, (80)
0 0 ∂t 0 ∂t

and substituting Eq. (80) into Eq. (79) we obtain that

n
1 − 
E ϕ (Tn (r ))2 − E [ϕ (Tn (r ))]2 = K (ϕ) + δn .
 
(81)
n r =1

Step 5. Substituting Eqs. (76) and (81) into Eq. (67) yields

n
√ ∞ 1
∂h
[  ]  ∫ [∫ ]   
− i 1 1
ln E exp √ ϕ (Tn (r )) = i n ϕ (t ) (u; t ) du dt − (K (ϕ) + δn ) + O √ . (82)
r =1
n 0 0 ∂t 2 n

Substituting Eq. (82) into Eq. (63) further yields


[  ∫ ∞ ]   
1 1
E exp i ϕ (t ) Zn (t ) dt

= exp − (K (ϕ) + δn ) + O √ , (83)
0 2 n

and hence
[  ∫ ∞ ]  
1
lim E exp i ϕ (t ) Zn′ (t ) dt = exp − K (ϕ) . (84)
n→∞ 0 2

On the other hand, changing the order of integration in Eq. (80) gives

∂h

∂h ∂h
∫ ] ∫ ∞∫ ∞
[∫ 1
[∫ 1 ]
K (ϕ) = ϕ (t )2
(u; t ) du dt − ϕ ( s) ϕ ( t ) (u; s) (u; t ) du dtds. (85)
0 0 ∂t 0 0 0 ∂t ∂t

In the analysis we used bounded test functions ϕ (t ) satisfying the integrability condition 0 |ϕ (t )| t −1−1/α dt < ∞. Yet
∞
the right-hand side of Eq. (84) holds valid for all test functions ϕ (t ) for which K (ϕ) is well defined. Thus, in effect, Eq. (84)
can be extended to all test functions ϕ (t ) for which K (ϕ) is well defined.
Step 6. Let 0 < t1 < · · · < tm < ∞ be an arbitrary sequence of increasing time epochs, and let θ1 , . . . , θm be arbitrary real
numbers. Note that
m
− m
− ∫ ∞
θj Zn tj = θj I t ≤ tj Zn′ (t ) dt
   
j=1 j =1 0
 
∫ ∞ m

θj I t ≤ tj Zn′ (t ) dt ,
 
= (86)
0 j =1

and set
m

ϕ (t ) = θj I t ≤ tj (t ≥ 0).
 
(87)
j =1

Eq. (84) implies that for the test function ϕ (t ) given by Eq. (87) we have
  
m  
− 1
θj Zn tj = exp − K (ϕ) ,
 
lim E exp i (88)
n→∞
j =1
2

provided that K (ϕ) is well defined.


3202 I. Eliazar / Physica A 390 (2011) 3189–3203

Now, for the test function ϕ (t ) given by Eq. (87) note that
∫ ∞ − 2 [∫
∞ 1
∂h m 1
∂h
∫ [∫ ] ]
ϕ (t ) 2
(u; t ) du dt = θj I t ≤ tj (u; t ) du dt
 
0 0 ∂t 0 j =1 0 ∂t

m − m
∂h
− ∫ ∞ [∫ 1 ]
θj θk (u; t ) du dt
 
= I t ≤ tj ∧ tk
j=1 k=1 0 0 ∂t

m − m
 ∂h
− ∫ 1 ∫ ∞ 
θj θk (u; t ) dt du

= I t ≤ tj ∧ tk
j=1 k=1 0 0 ∂t
−m − m ∫ 1
θj θk h u; tj ∧ tk du,
 
= (89)
j=1 k=1 0

and further note that


∞ ∞
∂h ∂h 1
∫ ∫ [∫ ]
ϕ (s) ϕ (t ) (u; s) (u; t ) du dtds
0 0 0 ∂t ∂t
    [∫
m m
∂h ∂h
∫ ∞∫ ∞ − − ] 1
θj I s ≤ tj θk I (t ≤ tk ) ( u; s ) (u; t ) du dtds
 
=
0 0 j =1 k=1 0 ∂t ∂t
m m 1 ∞   ∂h ∞
∂h
−− ∫ ∫  ∫ 
= θj θk I s ≤ tj (u; s) ds I (t ≤ tk ) (u; t ) dt du
j=1 k=1 0 0 ∂t 0 ∂t
−m − m ∫ 1
θj θk h u; tj h (u; tk ) dsdu.
 
= (90)
j=1 k=1 0

Substituting Eqs. (89) and (90) into Eq. (85) yields


m −
− m ∫ 1
K (ϕ) = θj θk h u; tj ∧ tk − h u; tj h (u; tk ) du
     
(91)
j=1 k=1 0

(using Eq. (37))


m −
− m ∫ 1
θj θk
    
= h u; tj + tk − h u; tj ∨ tk du (92)
j=1 k=1 0

(using Eqs. (6) and (34))


m −
− m
θj θk H tj + tk − H tj ∨ tk .
    
= (93)
j=1 k=1

Thus, K (ϕ) is indeed well defined for the test function ϕ (t ) given by Eq. (87), and substituting Eq. (93) into Eq. (88) we
conclude that
    
m m m
− 1 −−
θj Zn tj θj θk H tj + tk − H tj ∨ tk .
      
lim E exp i = exp − (94)
n→∞
j =1
2 j=1 k=1

The Fourier transform appearing on the right-hand side of Eq. (94) characterizes a zero-mean Gaussian process Z =
(Z (t ))t ≥0 with covariance Cov [Z (s) , Z (t )] = H (s + t ) − H (s ∨ t ) (s, t ≥ 0). Hence, Eq. (94) proves Proposition 4.

References

[1] G.K. Zipf, Human Behavior and the Principle of Least Effort, Addison-Wesley, Cambridge, 1949.
[2] H.S. Heaps, Information Retrieval: Computational and Theoretical Aspects, Academic Press, Boston, 1978.
[3] C. Cattutoa, et al., Proc. Natl. Acad. Sci. 106 (2009) 10511.
[4] R. Baeza-Yates, G. Navarro, J. Amer. Soc. Inform. Sci. 51 (2000) 69.
[5] D.C. van Leijenhorst, Th.P. van der Weide, Inform. Sci. 170 (2005) 263.
[6] M.A. Serrano, A. Flammini, F. Menczer, PLoS ONE 4 (2009) e5372.
[7] L. Lu, Z.K. Zhang, T. Zhou, PLoS ONE 5 (2010) e14139.
[8] F. Auerbach, Petermanns Geographische Mitteilungen 59 (1913) 74.
[9] J.B. Estoup, Gammes Stenographiques, Institut Stenographique de France, Paris, 1916.
[10] A.J. Lotka, J. Washington Acad. Sci. 16 (1926) 317.
I. Eliazar / Physica A 390 (2011) 3189–3203 3203

[11] M.E.J. Newman, Contemp. Phys. 46 (2005) 323.


[12] M. Mitzenmacher, Internet Math. 1 (2004) 226.
[13] R. Albert, A.L. Barabási, Rev. Mod. Phys. 74 (2002) 47.
[14] R. Cohen, S. Havlin, Complex Networks: Structure, Robustness and Function, Cambridge University Press, Cambridge, 2010.
[15] H. Hu, D. Han, X. Wang, Physica A 389 (2010) 1065.
[16] http://www.nslij-genetics.org/wli/zipf/.
[17] P. Bak, How Nature Works: The Science of Self Organized Criticality, Copernicus, New York, 1996.
[18] I. Eliazar, Limit laws for Zipf’s law, J. Phys. A 44 (2011) 022001.
[19] J.F.C. Kingman, Poisson Processes, Oxford University Press, Oxford, 1993.
[20] http://plato.stanford.edu/entries/types-tokens/.
[21] Feller, An Introduction to Probability Theory and its Applications, Vol. I, Wiley, New York, 1957.
[22] L.E. Baum, P. Billingsley, Ann. Math. Stat. 36 (1965) 1835.
[23] S. Bernhardsson, L. Rocha, P. Minnhagen, New J. Phys. 11 (2009) 123015.
[24] M.D. Donsker, Mem. Amer. Math. Soc. 6 (1951) 1.

You might also like