The Q-Exponentials Do Not Maximize The Rényi Entropy

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Physica A 578 (2021) 126126

Contents lists available at ScienceDirect

Physica A
journal homepage: www.elsevier.com/locate/physa

The q-exponentials do not maximize the Rényi entropy



Thomas Oikonomou a ,1 , , Konstantinos Kaloudis b ,1 , G. Baris Bagci c ,1
a
College of Engineering and Computer Science, VinUniversity, Hanoi 10000, Vietnam
b
Department of Mathematics, Nazarbayev University, Nur-Sultan 010000, Kazakhstan
c
Department of Physics, Mersin University, Mersin 33343, Turkey

article info a b s t r a c t

Article history: It is generally assumed that the Rényi entropy is maximized by the q-exponentials and is
Received 20 August 2020 hence useful to construct a generalized statistical mechanics. However, to the best of our
Received in revised form 17 April 2021 knowledge, this assumption has never been explicitly checked. In this work, we consider
Available online 24 May 2021
the Rényi entropy with the linear and escort mean value constraints and check whether
Keywords: it is indeed maximized by q-exponentials. We show, both theoretically and numerically,
Rényi entropy that the Rényi entropy yields erroneous inferences concerning the optimum distributions
q-exponentials of the q-exponential form and moreover exhibits high estimation errors in the regime of
Shannon entropy long range correlations. Finally, we note that the Shannon entropy successfully detects
MaxEnt the power law distributions when the logarithmic mean value constraint is used.
Optimum distribution © 2021 Elsevier B.V. All rights reserved.
Estimators
Estimation error

1. Introduction

There have recently been various proposals of generalized entropies in the literature in order to extend the description
within statistical mechanics and thermodynamics of complex, out-of-equilibrium systems [1]. Despite this multitude, their
common claim is twofold i.e. the ability of detecting the power law distributions and constructing a generalized statistical
mechanics in which the power law distributions maximize the particular generalized entropy under scrutiny. Both of these
claims generally rely on the Maximum Entropy (MaxEnt) Principle.
MaxEnt is an inference procedure aiming to provide an estimate of the distribution ρ † (x ∈ D), assuming that the
i.i.d.
observed data set D = {x1 , . . . , xn } consists of the realizations of the continuous (or discrete) random variables Xi ∼ ρ † (·).
The index i runs over the observed data set. According to this principle, the least biased (also called posterior) distribution
ρ̂ (x ∈ D) towards ρ † (x ∈ D) is the one that maximizes the relative-entropy measure H [ρ, m] given the information
extracted from the sample D in terms of J + 1 the number linearly independent constraints, i.e.,

ρ̂ := arg max H [ρ, m] , (1)


ρ∈I

where I = {ρ ∈ S : Ij = 0, j = 0, 1, . . . , J } (S is a nonempty subset of a real linear space). m(x ∈ D) is a function that


guarantees the invariance of H [ρ, m] under coordinate transformation and provides an origin of measurement of H [ρ, m].
It is proportional to the ‘‘limiting density of discrete points’’ and has been named as the ‘‘invariant measure’’ by Jaynes [2].

∗ Corresponding author.
E-mail address: thomas.o@vinuni.edu.vn (T. Oikonomou).
1 All authors have equally contributed.

https://doi.org/10.1016/j.physa.2021.126126
0378-4371/© 2021 Elsevier B.V. All rights reserved.
T. Oikonomou, K. Kaloudis and G.B. Bagci Physica A 578 (2021) 126126

Without loss of generality, it will be set to unity throughout the present manuscript. The constraints are commonly of
the form

I0 [ρ] = ρ (x)dx − 1 , (2a)
D
∫ n
1∑
fj (x)ρ (x)dx −
⟨ ⟩
Ij [ρ] = E{fj (X )} − fj (x) = fj (xi ) (2b)
D n
i=1
⟨ ⟩
where fj (x) is defined as the empirical average of the real-valued function fj (also called ‘‘features’’ of data under
scrutiny [3]) determined from the observed data D and it acts as the estimator of E{fj (X )}. In fact, such constraints
guarantee the accordance of the empirical averages with the expectations taken with the respect to the posterior
distribution.
Technically, Eqs. (1)–(2) are considered as an optimization problem and expressed by virtue of the Lagrange Multipliers
(LM) method with the Lagrangian
J

L(ρ, α, β ) = H [ρ] − α I0 [ρ] − βj Ij [ρ] . (3)
j=1

F ρ (x) dx ,
(∫ ( ) )
In this context H [ρ] is called the objective function of the problem. Assuming it is of the form H [ρ] = G D
then δ L = 0 leads to the following optimization conditions
J
∂ F ρ (x)
( ) ⏐

G′ −α− βj fj (x) = 0 , and Ij [ρ]⏐ = 0, (4)

∂ρ (x) j=0,1,...,J
j=1

where α and β = {β1 , . . . , βJ } are the LM’s. The first condition determines the structure of the posterior distribution,
denoted by ρ ∗ (x, α, β ), while the remaining J + 1 conditions serve to calculate the optimum LM values i.e. the LM
estimators denoted by {α̂, β̂}. Substituting the LM estimators into ρ ∗ (x, α, β ), one finally determines the posterior
distribution as ρ̂ (x) = ρ ∗ (x, α̂, β̂ ). This solution is optimum, that is, it gives the largest possible value of entropy given
the constraints are fulfilled, if and only if the saddle point property of the Lagrangian is satisfied [4] (p. 238),

min Ld (α, β ) = max min L(ρ, α, β ) , (5)


α∈R ρ∈S α∈R
β∈RJ β∈RJ

where Ld (α, β ) := maxρ∈S L(ρ, α, β ) is the so-called dual Lagrangian function, which is always convex with respect to
LM’s (see p. 238 in Ref. [4]). For a concave optimization with equality constraints, that are satisfied by ρ ∈ S , Eq. (5)
always holds [4] (p. 226). However, a crucial issue needs to be clarified at this point. Namely, the former solution may
not maximize the objective functional itself [5] (p. 238). When this is the case, ρ̂ is the optimum but not the maximum
solution, which is originally required in Eq. (1) . It is then possible that the inferences drawn from Eq. (4) are biased and
thus fail to provide the best estimator of ρ † . This distinction is also of pivotal importance for physical theories where
H [ρ] is identified with the thermodynamic entropy and according to the 2nd Law of Thermodynamics it has to exhibit
the unique maximum at the steady state.
For a self-consistent optimization procedure H [ρ] has to satisfy the Shore–Johnson (SJ) axioms [6], which in turn,
uniquely lead to the Shannon entropy
∫ ( )
1
HS [ρ] = ρ (x) ln dx . (6)
D ρ (x)

⟨ ⟩ρ (x, α̂, β ) into



Shannon entropy is also in agreement with the maximality required in Eq. (1). Indeed, after substituting
HS , HS [ρ] → HS [ρ ] =: HS (β, E{fj }), and the constraints are taken into account, HS (β, E{fj }) → HS (β, fj ) =: ĤS (β ), the
∗ ∗ ∗ ∗

former entropy exhibits exactly the same expression as the respective dual Lagrangian Ld (β ) [7], so that (∂ Ld (β )/∂βj ) =
0 = (∂ ĤS (β )/∂βj ) [8,9]. This means that the solution ρ̂ obtained from the LM method is not only entropic optimum but
also entropic maximum.
Within physics, however, a general drawback of the LM method is the arbitrariness of physical units of the features
fj (X ), e.g. f (X ) = ln(X ), leading to the disconnection between HS [ρ] and the thermodynamic entropy. For example, a
choice of random variable X as microstate energy does not fulfill the expected unit of the internal energy when the
aforementioned logarithmic feature is used. In order to confront this issue, different types of entropic functionals have
been proposed within MaxEnt that lead to the non-exponential distributions with linear random functions fj (X ) = X . One
of these functionals – which preserves the additivity of the Shannon functional when considering independent events –
is the Rényi entropy [10]
(∫ )
1
Hq [ρ] = ln [ρ (x)]q dx . (7)
1−q D
2
T. Oikonomou, K. Kaloudis and G.B. Bagci Physica A 578 (2021) 126126

The Rényi entropy is known to be concave for q ∈ (0, 1] and quasi-concave for q > 1 [11]. However, it has only recently
been realized that the MaxEnt inference procedure based on the Rényi entropy might violate the Shore–Johnson subset
and the system independence axioms [12–14]. Similar results were obtained for the Tsallis entropy revealing the violation
of the system independence axiom [15]. Recent studies on the axiomatic structure have shown that even more restricted
set of axioms lead to the functional in Eq. (6) [16–18] . As a result, the MaxEnt inference recipe used together with the
Rényi entropy may generally lead to erroneous results. Simply put, it might fail to correctly estimate the parameter values
in the real distribution ρ † .
In this work, we consider in advance a true distribution of the Q -exponential form
[ ] Q −1 1
ρ † (x) = Q Λ 1 − (Q − 1)Λx , (8)

where x ∈ D = [0, ∞) with Λ > 0. The normalization of ρ † requires Q ∈ (0, 1]. Within statistical mechanics, the
former distribution is known as Q -exponential while it is referred to as the 2-parameter generalized Pareto distribution
in the statistics literature [19,20]. Then, considering a Lagrangian with the normalization and expectation value constraints
(i.e. two LM’s α and βj = β ), we show the following two statements. First, we theoretically show that the Rényi entropy
with linear constraints is optimized but not maximized from the distribution derived in Eq. (4). Second, we numerically
explore the efficiency of the Rényi MaxEnt inference, and show that it generally fails to yield correct results in the entire
Q -domain. Similar situation we have in the Rényi MaxEnt with escort constraints. Our theoretical claims are verified by
the numerical simulations. In all cases, we also consider the Shannon entropy as well for comparison.
The paper is organized as follows: In the next section, we calculate the posterior distribution of the q-exponential form
with MaxEnt based on the Shannon entropy subject to the logarithmic mean value constraint. In Sections 3 and 4, we
repeat this procedure with the Rényi entropy adopting linear and escort mean value constraints, respectively. We show
in both versions of the Rényi MaxEnt formalism, in contrast to the Shannon case, that the optimum solution differs from
the maximum one. In Section 5, we present our numerical results and demonstrate that the optimum estimators obtained
within the Rényi MaxEnt in Eq. (5) indeed fail to correctly infer ρ † . A discussion explaining the observed results follows.
Finally, concluding remarks are presented in Section 6.

2. Shannon MaxEnt optimizing q-exponentials

In this section, we aim to determine the optimum distribution estimate ρ̂S of the true distribution ρ † in Eq. (8) using the
Shannon MaxEnt, and prove that this is also the maximum distribution. To this aim, we consider the following auxiliary
functional
(∫ )
(∫ ∞ ) ∞
L = HS [ρ] − α ρ (x)dx − 1 − β ρ (x)f (x, q, λ)dx − ⟨f (x, q, λ)⟩ (9)
0 0

with the additional parameters {q, λ} and the logarithmic feature as [20]
( )
f (X , q, λ) = ln 1 − (q − 1)λX . (10)

The LM method yields the following optimization conditions


δL
0 = − ln ρ (x) − 1 − α − β f (x, q, λ) ,
( )
=0 ⇒ (11a)
δρ
∂L ∞

=0 ⇒ 0 = ρ (x)dx − 1 , (11b)
∂α
∫0 ∞
∂L
=0 ⇒ 0 = ρ (x)f (x, q, λ)dx − ⟨f (x, q, λ)⟩ . (11c)
∂β 0

We determine the Shannon optimum distribution structure from Eq. (11a) as


]−β
ρS∗ (x, α, β, q, λ) = e−1−α 1 − (q − 1)λx
[
(12)
with β > 1, q ∈ (0, 1) and λ > 0. Using Eq. (11b), the normalization multiplier can be expressed in terms of the remaining
3 parameters so that
[ ]
α̂ = α̂ (β, q, λ) = −1 − ln (β − 1)(1 − q)λ . (13)

The mean value constraint in Eq. (11c) then yields the estimator
1
= ⟨f (x, q, λ)⟩ . (14)
β̂ − 1
To be able to determine the estimators of the parameters λ and q, we now apply the Parameter-Space Expansion (PSE)
method within MaxEnt [8,9] according to which both the LM’s and the additional parameters are calculated as follows:
3
T. Oikonomou, K. Kaloudis and G.B. Bagci Physica A 578 (2021) 126126

One directly substitutes the normalized distribution ρ ∗ (x, α̂, β, q, λ) into the Shannon entropy, taking the constraints into
account, to obtain [8,9]

ĤS (β, λ, q) = α̂ (β, λ, q) + 1 + β ⟨f (x, q, λ)⟩ . (15)

Then, the estimators of {β, λ, q} are determined from the conditions

∂ ĤS ∂ ĤS ∂ ĤS


= 0, = 0, = 0. (16)
∂β ∂λ ∂q
We note that ⟨f (x, q, λ)⟩ is completely determined by the data and thus is independent of the Lagrange multiplier β .
Interestingly enough, the entropy expression in Eq. (15) is identical with the expression of the respective dual Lagrangian
function Ld [7] of Eq. (9), so that Eq. (16) can be written in terms of Ld
∂ Ld ∂ α̂
=0 ⇒ = − ⟨f (x, q, λ)⟩ (17a)
∂β ∂β
∂ Ld ∂ α̂ ∂ ∂
⟨ ⟩
=0 ⇒ = −β ⟨f (x, q, λ)⟩ = −β f (x, q, λ) (17b)
∂λ ∂λ ∂λ ∂λ
∂ Ld ∂ α̂ ∂ ∂
⟨ ⟩
=0 ⇒ = −β ⟨f (x, q, λ)⟩ = −β f (x, q, λ) . (17c)
∂q ∂q ∂q ∂q
The derivatives of α̂ on the l.h.s. can be explicitly calculated from Eq. (13). After substituting them into Eq. (17), we observe
that Eq. (17a) is the same as Eq. (14), as expected. Accordingly, the PSE method in the Shannon case is nothing but the
dual Lagrange formalism applied on the additional parameters {q, λ} as well. The important point here is that, due to
Ld = ĤS , the estimators are calculated directly from the entropy extremum. In other words, in the Shannon case, the
optimum solution ρ̂S (x) is also the maximum solution in complete agreement with Eq. (1).
Proceeding further, we observe that Eqs. (17b) and (17c) are identical. Thus, we have two equations with three
unknowns to solve. Then, at the maximum state we can fix β at the value 1− 1
q
so that the distribution ρS∗ takes the
form of interest in Eq. (8) i.e.
[ ] q−1 1
ρS∗ (x, q, λ) = qλ 1 − (q − 1)λx . (18)

Next, we see that the first two conditions in Eq. (17) yield the implicitly-defined estimators for q and λ as
n
1 − q̂ 1∑ ( )
= ln 1 − (q̂ − 1)λxi , (19a)
q̂ n
i=1
n
1 1∑ xi
= . (19b)
λ̂ n
i=1
1 − (q − 1)λ̂xi

Assuming that neither q nor λ is known (which is generally the case), we set q = q̂ and λ = λ̂ and solve the 2 × 2 system in
Eq. (19). Interestingly enough, the MaxEnt estimators coincide with the associated Maximum Likelihood Estimators (MLE)
for the case under study, namely the q-exponential distribution [19,20], which is commonly referred to as Type-II (or 2-
parameter) generalized Pareto distribution in the statistics literature [21,22]. This equivalence implies that the MaxEnt
estimators are consistent and asymptotically normal, as the MLE. Moreover, the performance of the MLE estimator has
been extensively studied with respect to other estimators e.g. for the cases of both the 2-parameter (q-exponential) [23]
and the 3-parameter [24] generalized Pareto distributions. Of course this is not the only case that the MaxEnt and the MLE
yield the same results. In fact, this is typical in the cases of Gibbs distributions, namely distributions that belong to the
Exponential family, with MaxEnt constraints (features) based on the associated sufficient statistics or e.g. for the Kappa
distribution of which the generalized Pareto is a special case [25–28]. However, one then has to carefully consider the
proper reparametrization in order to ensure the equivalence of the estimators. For the sake of completeness, we note that
there have been various proposals in the literature on how to obtain a power-law maximum distribution from Shannon
MaxEnt by using different types of constraints [29–31]. On the other hand, a thorough derivation and analysis of the MLE
is lacking in these works.
Finally, comparing the result of the Shannon MaxEnt in Eqs. (18)–(19) with the Q -exponential distribution ρ † in Eq. (8),
we read that ρ̂S = ρS∗ (x, q̂, λ̂) is the estimated distribution of ρ † (x, Q , Λ) with the parameter estimates being:

Q̂S = q̂ , Λ̂S = λ̂ (20)

so that the full rendering of the q-exponential through Shannon MaxEnt is consistently possible.
4
T. Oikonomou, K. Kaloudis and G.B. Bagci Physica A 578 (2021) 126126

3. Rényi MaxEnt with linear constraints

We now consider the Rényi entropy in Eq. (7) with support D = [0, ∞) so that the functional to be optimized reads
[∫ ∞ ] [∫ ∞ ]
L = Hq [ρ] − α ρ (x)dx − 1 − β xρ (x)dx − ⟨x⟩ (21)
0 0

with the feature f (X ) = X so that ⟨f (x)⟩ = ⟨x⟩. Before proceeding further, we remark here a fundamental difference
between the Shannon and Rényi MaxEnt updating procedures. According to Shore–Johnson formalism, the update is
included in the consideration of the constraints. The more information we have about the system the more constraints
we consider in Eq. (21). In the Shannon picture the parameter q is included in the constraints, thus the update gives
information about q too, leading to its MaxEnt estimator q̂. In the Rényi picture on the other hand, the constraints do not
include information about q, since the feature is independent of q i.e. f (X , q) = X . Therefore, the estimator of q is not
determined within MaxEnt but in advance by fitting the slope of the data cumulative distribution so that one overcomes
the problem of choosing the optimal number of bins [32]. To distinguish it from the q-estimator q̂ obtained within MaxEnt,
we simply denote it by q. The optimization conditions of Eq. (21) now read
δL β ∞
( )( )∫
1−q
=0 ⇒ [ρ (x)] q−1
= α 1+ x [ρ (x′ )]q dx′ , (22a)
δρ q α 0
∂L ∞

=0 ⇒ 0 = ρ (x)dx − 1 , (22b)
∂α
∫0 ∞
∂L
=0 ⇒ 0 = xρ (x)dx − ⟨x⟩ . (22c)
∂β 0

From Eq. (22a), we determine the Rényi maximum distribution structure as


)2 ( ) q−1 1
qβ β
(
2q − 1
ρR,lin

(x, α, β ) = 1+ x , (23)
2q − 1 (1 − q)α α
from which we calculate the integrals of the constraints as
∫ ∞
2q − 1
ρR,lin

(x, α, β )dx = , (24a)
0 (1 − q)α
∫ ∞
1
xρR,lin

(x, α, β )dx = (24b)
0 β
with q ∈ (1/2, 1). Comparing then Eqs. (22b)–(22c) with Eqs. (24a)–(24b) we obtain the optimal values recovered by the
LM as
2q − 1 1
α̂ = , β̂ = . (25)
1−q ⟨x ⟩
Note that α̂ is not an estimator in the statistical sense, as it does not explicitly depend on the observed sample. The
equation above leads to the concomitant Rényi posterior distribution

ρ̂R,lin (x) = ρR,lin



(x, α̂, β̂ ) . (26)

Comparing it with ρ in Eq. (8), we see that the Rényi estimators of the parameters Q and Λ are given by

β̂ 1
Q̂R,lin = q , Λ̂R,lin = = . (27)
2q − 1 (2Q̂R,lin − 1) ⟨x⟩
Similarly to the Shannon case, we will now apply the dual formalism to observe whether the optimal LM values in
Eq. (25) are actually determined from the extremum of the Rényi entropy. Thus, substituting ρ ∗ (x, α̂, β ) into L in Eq. (21)
and into Hq [ρ] while taking into account the mean value constraint, we obtain
( )
q q
Ld (β ) = Hq∗ (β ) − 1 + β ⟨x⟩ , Hq∗ (β ) = ln − ln(β ) , (28a)
1−q 2q − 1
( ) ( )
q 1 1−q
Ĥq (β ) = − ln + ln 1 + β ⟨x⟩ − ln(β ) (28b)
2q − 1 1−q 2q − 1
It becomes obvious that these two functions are not equal to each other leading also to different minima. The minimization
of Ld yields the result in Eq. (25), as expected, while ∂ Ĥq (β )/∂β = 0 yields
2q − 1
β̂ ′ = . (29)
q ⟨x ⟩
5
T. Oikonomou, K. Kaloudis and G.B. Bagci Physica A 578 (2021) 126126

Fig. 1. (a) Plot of Ld (β ) (green solid line) and Ĥq (β ) (black solid line) in Eq. (28) with respect to the Lagrange multiplier β , for q = 0.6 and ⟨x⟩ = 5.
(b) Plot of Hq∗ (β ) in Eq. (28a) with respect to β .

Accordingly, we see that the optimal value β̂ in Eq. (25), in contrast to Shannon case, differs from the estimator obtained
from the extremum of the Rényi entropy itself, i.e., β̂ ′ in Eq. (29).
In order to explore which estimator, Eq. (29) or Eq. (25), corresponds to a higher entropy value Hq∗ (β ), we first plot in
Fig. 1(a) the functions Ld (β ) (green solid line) and Ĥq (β ) (black solid line) for the randomly chosen values q = 0.6 and
⟨x⟩ = 5 which are to be extracted from the data set itself. We clearly see that β̂ ′ (red dotted–dashed) is distinct from β̂
(blue dashed), corresponding also to a lower minimum point. Regarding the entropy Hq∗ (β ) itself, we can see in Fig. 1(b)
that the former estimator corresponds to a higher entropy value, as expected. In other words, the estimators in Eq. (25)
optimize but not maximize the Rényi entropy.

4. Rényi MaxEnt with escort constraints

An alternative ∫ ∞optimization of the Rényi entropy is based on the expectation value involving the escort distribution
Pq (x) = [ρ (x)]q / 0 [ρ (x)]q dx which yields the functional
[∫ ∞ ] [∫ ∞ ]
L = Hq [ρ] − α ρ (x)dx − 1 − β xPq (x)dx − ⟨x⟩ . (30)
0 0

The optimization conditions given by Eq. (30) now read


δL )] (1 − q)α ∞
[ ∫
[ρ (x)]q−1 1 − (1 − q)β x − ⟨x⟩ = [ρ (x′ )]q dx′ ,
(
=0 ⇒ (31a)
δρ q 0
∂L
∫ ∞
=0 ⇒ 0= ρ (x)dx − 1 , (31b)
∂α
∫0 ∞
∂L
=0 ⇒ 0= xPq (x)dx − ⟨x⟩ . (31c)
∂β 0

A quick inspection of Eq. (30) seems to suggest that the parameter q is now estimated within the MaxEnt formalism,
since the mean value constraint in Eq. (30), in contrast to Eq. (21), involves the parameter q. This is however not so, due
not part of the data feature f (X , q) = X , and accordingly, there is no q-dependence on the estimator
to the fact that q is ∑
⟨f (x, q)⟩ = ⟨x⟩ = 1n ni=1 xi . Put differently, estimating the probability pi with the relative frequency p̂i = ni /n, one can
q
∑n ni
write the estimator ⟨x⟩ of the escort expectation value Eq {X } as ⟨x⟩ = i=1
∑n q xi . Since there are no repetitions in the
j=1 nj
1
∑n
data xi ∈ D, we have ni = 1 so that ⟨x⟩ = i=1 xi . In other words, the estimation of q lies outside the scope of MaxEnt
n
and it is obtained from the slope of the data cumulative distribution. Hence, we adopt the same notation as in the linear
case.
6
T. Oikonomou, K. Kaloudis and G.B. Bagci Physica A 578 (2021) 126126

Assuming ρR,esc

(x, α, β ) to be the solution of Eq. (31) and substituting β̃ = β[1 + (1 − q)β ⟨x⟩]−1 , after some simple
algebra we obtain
] q−1 1
(1 − q)α ] 1−1 q
[ [
ρR,esc

(x, α, β ) = (2 − q)β̃ 1 − (1 − q)β̃ x , (32a)
q
q 1
α̂ = , β̂ = (32b)
1−q ⟨x⟩
with q ∈ (1, 2) and
β
β̃ > 0 ⇒ > 0. (33)
1 + (1 − q)β ⟨x⟩
which is a necessary condition for the convergence of the integrals.
Already at this stage, one can see the emergence of a delicate inconsistency: Eq. (31c), which is one of the usual steps
in the optimization procedure, requires β and ⟨x⟩ to be uncoupled from one another (see also Sections 2 and 3). On the
other hand, this kind of coupling, namely Eq. (33), is enforced as a result of the very same set of equations comprising
the optimization procedure. This fundamental inconsistency signals the failure of the maximization procedure for the
Rényi entropy used with the escort constraints. Intuitively, this inconsistent behavior can be understood in the following
manner: moving from the linear averaging to the one with escort, the theory evades the well-known divergence of the
second moment haunting the linear averaging scheme (see p. 3591 in Ref. [33]) [34]. However, the price paid for this
is the incompatibility between ⟨x⟩ obtained through the escort averaging as a result of the optimization procedure and
⟨x⟩ stemming from the data itself which is still calculated through the usual linear averaging procedure. Torn in between
these two different approaches in order to match the value of ⟨x⟩, optimization procedure signals that the sole possible
resolution in this framework is to inconsistently couple β and ⟨x⟩ so that whole optimization procedure, in a manner of
speaking, breaks down unable to detect any maximum. Note beforehand that this is exactly what we observe in Figs. 3
and 5 for which we provide detailed explanations in the following section.
From Eqs. (8) and (32), we read that the Rényi escort estimators for the parameters Q and Λ are determined as
β̂ 1
Q̂R,esc = 2 − q , Λ̂R,esc = = . (34)
2−q Q̂R,esc ⟨x⟩

5. Shannon and Rényi MaxEnt: Numerical results

In this section, we resort to simulation in order to compare the performance of the MaxEnt inference procedure, based
on the Shannon (with logarithmic constraints) and Rényi (for both linear and escort constraints) entropies. With three
different inference procedures identifying the same underlying distribution, we can easily compare their efficiency using a
point estimation criterion [35]. So, although the MaxEnt principle infers a density, we are solely interested on the accuracy
of the parameter estimation of this density. If one is interested in similar comparisons with MaxEnt procedures identifying
different densities, then other criteria have to be used (e.g. any measure of statistical similarity between densities such
as Kullback–Leibler divergence) [36].
Therefore, we consider a sample of i.i.d. values D = {x1 , . . . , xn } generated by the (true and unknown in reality)
distribution ρ † in Eq. (8) for given parameters Λ > 0 and Q ∈ ( 12 , 32 ] ∪ ( 23 , 1], and study the performance of the estimators
ΛS , ΛR,lin and ΛR,esc recorded in Eqs. (20), (27) and (34), respectively. The Q -range above warrants the existence of the
first moment E{X }. The values of Q in the left subinterval ( 21 , 23 ] correspond to an infinite second moment E{X 2 } or
equivalently infinite variance of the distribution, therefore describing Long Range Correlations (LRC), while in the right
subinterval ( 23 , 1] the second moment is finite corresponding to Short Range Correlations (SRC) [37,38]. In Fig. 2 we plot a
representative random sample generated by ρ † for the randomly chosen parameters {n, Λ, Q } = {105 , 10, 0.7} to visually
verify that the obtained data represent correctly the true distributions ρ † . Note that the aforementioned Q value falls into
the relevant interval of many applications i.e. 1 < q < 1.5 through Q = 2 − q considering the q-exponentials such as
optical lattices [39], anomalous diffusion [40] and cumulative distribution of calm-times of earthquakes [41].
At this point we recall, as discussed in Sections 3 and 4 , that the MaxEnt inference procedure based on the Rényi
entropy provides an estimator only for Λ but not for Q . The determination of the latter parameter is outside the scope of
MaxEnt. To create the ideal scenario for the Rényi MaxEnt, we will substitute the exact value of Q in Eqs. (27) and (34) for
which we generated∑ the data set D in each case. Accordingly, Λ̂R,lin and Λ̂R,esc will be solely determined by the empirical
n
mean value ⟨x⟩ = 1n i=1 xi as dictated in Eqs. (27) and (34). For the Shannon case on the other hand, the estimator Λ̂S
will be determined by solving numerically the 2 × 2 system in Eq. (19). In other words, Q̂S will be determined within
MaxEnt as well.
We will begin the comparison of the three different inference procedures with the estimation of the sampling
{ and Q ∈} {0.55, 0.65, 0.75},
distributions of the Λ-estimators, via Monte Carlo simulation. More specifically, for Λ = 10
we generate M = 104 independent random samples Xi from the pdf ρ † (·) of size n ∈ 103 , 104 and calculate the
i.i.d.
associated estimators. Formally, for i = 1, . . . , M, we have Xi = (xi1 , . . . , xin ) ∼ ρ † (·), leading to the samples of estimators
7
T. Oikonomou, K. Kaloudis and G.B. Bagci Physica A 578 (2021) 126126

Fig. 2. Histogram of a sample obtained from the pdf ρ † (x) for Λ = 10, Q = 0.7 and n = 105 .

Fig. 3. Kernel density estimators of the sampling distributions of the Λ-estimators, for Λ = 10 (true value indicated with dashed line).

( ) ( ) ( )
(1) (M) (1) (M) (1) (M)
Λ̂S = Λ̂S , . . . , Λ̂S , Λ̂R,lin = Λ̂R,lin , . . . , Λ̂R,lin , Λ̂R,esc = Λ̂R,esc , . . . , Λ̂R,esc , for the Shannon and the Rényi with
linear and escort constraints inference procedures, respectively. In Fig. 3 we present the kernel density estimators of the
Λ-estimators sampling distributions, calculated through the obtained samples for the different values of Q and n. It is
evident that regarding the Rényi MaxEnt estimators we have different forms of deviations from the true parameters. As we
move away from the point Q = 1, the Λ̂R,lin estimators are characterized by increasingly high variance and relatively low
bias, while the Λ̂R,esc estimators are characterized by increasingly high bias and relatively low variance. On the contrary,
the Shannon MaxEnt estimators remain unaffected by the (true) values of Q , exhibiting low mean squared error (with
relatively low bias and variance). Moreover, although the increase of the sample size results in lower variances, the
Rényi-based estimators apparently exhibit systematic failure, especially for Q = 0.55 and generally in the LRC regime.
The efficiency of the Shannon MaxEnt estimators is anticipated, due their equivalence with the associated MLE
estimators as explained in Section 2. Regarding the Rényi-based estimators, the main reason for the exhibited poor
performance is the MaxEnt features. More specifically, although the choice of the specific constraints was made due to
their ability to give a ‘‘proper’’ result (in terms of density estimation), they apparently render the MaxEnt unable to recover
the true value of the rate parameter. Notably, the specific features do not provide enough information from the data
8
T. Oikonomou, K. Kaloudis and G.B. Bagci Physica A 578 (2021) 126126

Fig. 4. The K–L divergence between the Q -exponential and ordinary exponential distributions with varying Q and same Λ. We indicate the value
Q = 2/3 with red line.

regarding the specific parameter Λ. Nevertheless, the specific constraints are also the reason for the increased performance
of these estimators when Q ≈ 1: the Q -exponential is nearly ∑indistinguishable from an exponential distribution (with
n
the same rate parameter Λ), so the sum of the observations i=1 x i is a sufficient statistic for Λ, and therefore extracts
all the available information from the sample.
In order to demonstrate the aforementioned resemblance of the Q -exponential with the associated exponential when
Q ≈ 1, we present their Kullback–Leibler (K–L) divergence acting as quantifier of the statistical distance in Fig. 4 . The
K–L divergence reads
∫ ( )
p1 (x)
KL [p1 ∥p2 ] := p1 (x) log dx (35)
R p2 (x)
1
for densities p1 , p2 . Specifically for p1 (x) = Q Λ[1 − (Q − 1)Λx] Q −1 and p2 (x) = Λe−Λx , we have
Q −1
KL[p1 ∥p2 ] = ln(Q ) − (36)
Q (2Q − 1)
for Q ∈ 21 , 1 . In Fig. 4, one can clearly see that the distance between the two densities decreases when Q enters the
( ]
SRC regime.
In order to further perform a more extensive numerical comparison, we construct a uniform grid of the parameter
plane (Q , Λ) of size 200 × 200 and generate for each pair (Q̂ , Λ̂) of the grid M = 5 × 103 Monte Carlo samples for each
one of the 3 estimators (Rényi with linear/escort constraints and Shannon), with sample size n = 103 . As we are interested
in the detection of both bias and variance, the criterion that we will use regarding the comparison in the (Q , Λ)-parameter
plane is the Monte Carlo estimate of the relative root-mean-square error defined as

1
{( )2 }1/2 1
[ ( { } )2 ]1/2
rRMSE(Λ̂, Λ) := Λ̂ − Λ
E = Var(Λ̂) + E Λ̂ − Λ (37)
Λ Λ
for any estimator Λ̂ of Λ ∈ R. The reason for this particular choice is that the rRMSE takes into account both the variance
and the bias of the estimator while it remains unaffected by the magnitude of the particular choice of the rate parameter
Λ. While the calculation of the rRMSE using the empirical Monte Carlo averages does not provide an accurate estimate
of its theoretical value, it is useful in terms of comparing the performance of the estimators under scrutiny.
In Fig. 5, we present the obtained results using a colormap for the base-10 logarithm of the estimated rRMSE, which
are computed using the aforementioned Monte Carlo estimator samples. The blue color implies lower estimation error
required for a consistent inference procedure, while the red indicates higher estimation error. We observe that the
Shannon optimization of {Q , Λ} in Eq. (8) consistently exhibits a low estimation error of the same order of magnitude
for all considered combinations of the estimators {Q̂S , Λ̂S }. On the other hand, the performance of both Rényi inference
procedures is enhanced when the underlying distribution is in fact close to the exponential i.e. as Q ≈ 1. In Fig. 5, one
can also observe that the linear constraints are better suited for the MaxEnt procedure compared to the escort ones, since
9
T. Oikonomou, K. Kaloudis and G.B. Bagci Physica A 578 (2021) 126126

Fig. 5. Colormaps of the base–10 logarithm of the rRMSE for the three different inference procedures. For each point on the (Q , Λ)-parameter plane,
we calculated the rRMSE using M = 5000 samples of size n = 1000.

the former exhibit lower estimation errors. We note that this observation too conforms to our concomitant theoretical
observations in Sections 3 and 4 .
To understand the numerical results in Fig. 5, we need to closely study the properties related to
∫ the true distribution m (x)[ρ † (x)]q dx
D∫f
ρ † (x) in Eq. (8). Considering the theoretical moments E{f m (X )} = f m (x)ρ † (x)dx, Eq {f m (X )} =

D † q and the
D [ρ (x)] dx
functions f1 (X ) = ln 1 + (1 − Q )ΛX and f2 (X ) = X , we have
( )

( )2
1−Q 1−Q
, Q ∈ [0, 1] , f12 (x) , Q ∈ [0, 1]
{ }
E{f1 (x)} = E =2 (38a)
Q Q
( ] ( ]
1 1 1 2
, ,1 , E f22 (x) = , ,1
{ }
E {f2 (x)} = Q ∈ Q ∈ (38b)
1
Λ 1 2
Λ2
( ) ( )( )
2 Q − 2
2 3 Q − 2
Q − 3
3
[ )
1 1 3
, q ∈ [1, 2] , Eq f22 (x) = , q ∈ 1, ,
{ }
Eq {f2 (x)} = (38c)
(2 − q)Λ − q Λ2
(3 )
(2 − q) 2
2

with q = 2 − Q . In Eq. (38a) we see that for the logarithmic function f1 , the first two moments and therefore the respective
variance, are finite for the entire range of Q -values. Accordingly, for MaxEnt based on f1 , the derived estimators should
be able to yield correct inferences about ρ † for Q ∈ [0, 1]. This is precisely the case of the Shannon entropy as presented
in Section 2. Indeed, we observe in Fig. 5 that the estimation error is very low and consistent for the entire Q -range.
If one uses within MaxEnt the function f2 and the linear expectation value in Eq. (38b), then, first, one cannot study
the entire Q -range, and second in the regime (1/2, 2/3], the approximation of E{X } through the empirical formula fails
to reproduce correct results for a finite data set because of the infinite variance. However, this is the range of utmost
importance, since conventionally the infinity of the second moment is related with LRC [37,38]. This is the case of the
Rényi entropy as presented in Section 3 and verified exactly in Fig. 5. Finally, invoking f2 with escort expectation value
within MaxEnt yields to potentially correct inferences in the Q -range (1/2, 1] including the interval of LRC, as can be
seen in Eq. (38c). Note that this is the case of Rényi entropy with the escort expectation constraint. However, looking at
Fig. 5 we observe at a first glance an inconsistency between the numerical results and the preceding discussion. Here the
reason of high estimation error across the entire Q -interval is different, namely the invalid estimator β̂ in Eq. (32b) and
accordingly in Eq. (34). This can be seen, simply by fact that in both Rényi cases, the two distinct expectation constraints
are approximated by the same empirical mean value (see Section 3 for explanation), leading to two different estimators
for Λ in Eqs. (27) and (34). Then, since Eq. (27) yields low estimation error (for finite variance), this inductively means
that Eq. (34) is not a proper estimator. This is explained by our theoretical considerations in Section 3.
10
T. Oikonomou, K. Kaloudis and G.B. Bagci Physica A 578 (2021) 126126

We note that the numerical simulations were performed using Julia [42] programming language and the code is
available online [43].

6. Conclusions

Generalized entropies such as the Rényi entropy are frequently used to obtain optimum distributions of the q-
exponential form by invoking the MaxEnt principle and applying it exactly as one usually does with the Shannon entropy.
Whether this is indeed so i.e., whether q-exponentials really maximizes the Rényi entropy, has not been explicitly checked
so far to the best of our knowledge. The general practice is to use the MaxEnt to obtain the concomitant optimum
distribution and use it without any doubt. On the other hand, it is well known that the justification of the MaxEnt principle
for the Shannon entropy requires the conformation to the SJ axioms [6]. The Shannon entropy naturally satisfies these
criteria. However, it is still an open problem whether the Rényi entropy meets SJ axioms or not [12–14].
Therefore, in the present work, we considered the Rényi entropy as one of the generalized entropies and explicitly
checked whether it delivers the aforementioned claims. Adopting both linear and escort mean value constraints, we
theoretically showed that the Rényi entropy yields erroneous inferences concerning the maximum distributions of the
q-exponential form. The use of the Rényi entropy with linear constraints yields optimum but albeit not maximum entropy
solutions as shown in Fig. 1b. On the other hand, the use of the escort averaging scheme suffers from the incompatibility
between the averages calculated through the escort distribution in theory and linear averaging used in the data in
practice. For numerical analysis, we generated simulated data whose true distribution is a q-exponential (see Fig. 2) and
demonstrated that the MaxEnt estimators of the Rényi entropy with both linear and escort constraints fail to infer the
true parameters of the associated data, as can be seen in Fig. 3. Moreover, we found that this feature cannot be remedied
by increasing the sample size. The linear constraints yield high variance and low bias whereas the escort ones result in
low variance and high bias as demonstrated in Fig. 3. Furthermore, the relative root-mean-square error analysis presented
in Fig. 5 shows that the Rényi entropy, both with linear and escort constraints, exhibits high estimation errors although
linear constraints relatively result in better inferences compared to the escort ones. In all cases, however, the estimations
improve in the vicinity of the value q = 1. The reason for this behavior is that the Rényi entropy becomes almost identical
to the Shannon entropy and q-exponential distribution converges to the exponential distribution in this limit. In fact,
Fig. 4 explicitly verifies that the two distributions are indiscernible in the aforementioned limit by making use of the K–L
divergence.
We have also theoretically shown and numerically verified that the Shannon entropy with the logarithmic mean value
constraint detects the power law distributions with proper estimators, that coincide with the MLE. Moreover, adopting
the Shannon entropy has also the merit of obtaining the distribution directly from the updating procedure i.e. from
the constraints. In the case of the Rényi entropy, finding the exponent of the power law distribution lies outside the
scope of the MaxEnt and requires an additional fitting. Since the Rényi entropy is unreliable in detecting the power law
distributions due to all of the aforementioned reasons, the use of the Shannon entropy emerges as the most appropriate
method to properly infer this type of distributions. On the other hand, we note that even the Shannon entropy cannot
be used to construct a generalized statistical mechanics, since the logarithmic constraints are not compatible with the
appropriate units of constraints required by the generalized statistical mechanics.
Finally, note that the exponent of the q-exponential distribution used in this work for numerical illustration lies in
the interval 1 < q < 1.5. This interval includes many applications ranging from optical lattices [39] to the anomalous
diffusion [40], among others. Therefore, one might ask why we have not pursued further any application of the sort in
this work. The reason is that all the applications relying on the MaxEnt first requires the associated optimized distribution
to be found. Only then, one can fit the data or construct a generalized statistical mechanics. However, the present work
shows that this cannot be done in a scientifically sound manner in the case of the Rényi entropy. This corollary is also
applicable to the Tsallis entropy due to its monotone relation to the Rényi entropy [44].

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have
appeared to influence the work reported in this paper.

Acknowledgment

G.B.B. acknowledges the support of Mersin University, Turkey under the project 2018-3-AP5-3093.

References

[1] J.M. Amigó, S.G. Balogh, S. Hernández, A brief review of generalized entropies, Entropy 20 (11) (2018) 813.
[2] R.D. Rosenkrantz, E.T. Jaynes, Papers on Probability, Statistics and Statistical Physics, Springer Science & Business Media, 2012.
[3] M. Dudík, S.J. Phillips, R.E. Schapire, Maximum entropy density estimation with generalized regularization and an application to species
distribution modeling, J. Mach. Learn. Res. 8 (2007) 1217.
[4] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press, 2009.

11
T. Oikonomou, K. Kaloudis and G.B. Bagci Physica A 578 (2021) 126126

[5] A. Beck, Introduction To Nonlinear Optimization, SIAM, Philadelphia, PA, USA, 2014.
[6] J.E. Shore, R.W. Johnson, IEEE Trans. Inform. Theory IT-26 (1980) 26, IT-27 (1981) 472; IT- 29 (1983) 942.
[7] A.B. Templeman, L. Xingsi, Entropy duals, Eng. Optim. 9 (1985) 107.
[8] V.P. Singh, A.K. Rajagopal, A new method of parameter estimation for hydrologic frequency analysis, Hydrol. Sci. Technol. 2 (3) (1986) 33.
[9] V.P. Singh, Entropy Theory and Its Application in Environmental and Water Engineering, Wiley-Blackwell, 2013, p. 325.
[10] A. Rényi, Probability Theory, North-Holland, 1970.
[11] S.-W. Ho, S. Verdú, Convexity/concavity of Renyi entropy and α -mutual information, IEEE Int. Symp. Inf. Theory (2015) 745.
[12] Th. Oikonomou, G.B. Bagci, Rényi entropy yields artificial biases not in the data and incorrect updating due to the finite-size data, Phys. Rev.
E 99 (2019) 032134.
[13] P. Jizba, J. Korbel, Comment on Rényi entropy yields artificial biases not in the data and incorrect updating due to the finite-size data, Phys.
Rev. E 100 (2019) 026101.
[14] Th. Oikonomou, G.B. Bagci, Reply ‘comment on Rényi entropy yields artificial biases not in the data and incorrect updating due to the finite-size
data ’, Phys. Rev. E 100 (2019) 026102.
[15] S. Pressé, K. Ghosh, J. Lee, K.A. Dill, Nonadditive entropies yield probability distributions with biases not warranted by the data, Phys. Rev. Lett.
111 (2013) 180604.
[16] J. Skilling, Maximum Entropy and Bayesian Methods, Springer Science & Business Media, 1988.
[17] A. Caticha, Information and entropy, AIP Conf. Proc. 954 (2007) 11.
[18] K. Vanslette, Entropic updating of probabilities and density matrices, Entropy 19 (2017) 664.
[19] C.R. Shalizi, Maximum likelihood estimation for q-exponential (Tsallis) distributions, 2007, arXiv:math/0701854v2.
[20] V.P. Singh, H. Guo, Parameter estimation for 2-parameter generalized pareto distribution by POME, Stoch. Hydrol. Hydraul. 11 (1997) 211.
[21] J. Pickands, Statistical inference using extreme order statistics, Ann. Statist. 3 (1) (1975) 119.
[22] B.C. Arnold, Pareto and generalized Pareto distributions, in: Modeling Income Distributions and Lorenz Curves, Springer, New York, NY, 2008,
pp. 119–145.
[23] A. Chaouche, J.N. Bacro, Statistical inference for the generalized Pareto distribution: Maximum likelihood revisited, Comm. Statist. Theory
Methods 35 (5) (2006) 785.
[24] P. de Zea Bermudez, S. Kotz, Parameter estimation of the generalized Pareto distribution-Part I, J. Statist. Plann. Inference 140 (6) (2010) 1353.
[25] J.R. P.W. Mielke, E.S. Johnson, Three-parameter kappa distribution maximum likelihood estimates and likelihood ratio tests, Mon. Weather Rev.
101 (9) (1973) 701.
[26] V.P. Singh, Z.Q. Deng, Entropy-based parameter estimation for kappa distribution, J. Hydrol. Eng. 8 (2) (2003) 81.
[27] D.C. Brody, A note on exponential families of distributions, J. Phys. A 40 (2007) F691.
[28] F. Nielsen, V. Garcia, Statistical exponential families: A digest with flash cards, 2009, arXiv:0911.4863v2.
[29] J.-F. Bercher, Tsallis distribution as a standard maximum entropy solution with ‘tail’ constraint, Phys. Lett. A 372 (2008) 5657.
[30] A. Hernando, A. Plastino, A.R. Plastino, Maxent and dynamical information, Eur. Phys. J. B 85 (2012) 147.
[31] M. Visser, Zipf’s law power laws and maximum entropy, New J. Phys. 15 (2013) 043021.
[32] K. He, G. Meeden, Selecting the number of bins in a histogram: A decision theoretic approach, J. Statist. Plann. Inference 61 (1) (1997) 49.
[33] C. Tsallis, S.V.F. Levy, A.M.C. Souza, R. Maynard, Statistical-mechanical foundation of the ubiquity of Lévy distributions in nature, Phys. Rev.
Lett. 75 (1996) 3589.
[34] Note that Ref [33] is concerned with the Tsallis entropy. However, the divergence of the second moment is a common illicit behavior when
one uses linearly averaged constraints, hence arguments therein are also applicable to the Rényi entropy.
[35] E.L. Lehmann, G. Casella, Theory of Point Estimation, Springer Science & Business Media, 2006.
[36] S.I. Amari, Information Geometry and Its Applications, Springer, 2016.
[37] H.E. Stanley, S.V. Buldyrev, A.L. Goldberger, Z.D. Goldberger, S. Havlin, R.N. Mantegna, S.M. Ossadnik, C.-K. Peng, M. Simons, Statistical mechanics
in biology: how ubiquitous are long-range correlations?, Physica A 205 (1994) 214.
[38] Th. Oikonomou, A. Provata, Non-extensive trends in the size distribution of coding and non-coding DNA sequences in the human genome, Eur.
Phys. J. B 50 (2006) 259–264.
[39] P. Douglas, S. Bergamini, F. Renzoni, Tunable tsallis distributions in dissipative optical lattices, Phys. Rev. Lett. 96 (2006) 110601.
[40] B. Liu, J. Goree, Superdiffusion and non-Gaussian statistics in a driven-dissipative 2D dusty plasma, Phys. Rev. Lett. 100 (2008) 055003.
[41] S. Abe, N. Suzuki, Scale-free statistics of time interval between successive earthquakes, Physica A 350 (2005) 588.
[42] J. Bezanson, A. Edelman, S. Karpinski, V.B. Shah, Julia: A fresh approach to numerical computing, SIAM Rev. 59 (1) (2017) 65.
[43] Th. Oikonomou, K. Kaloudis, G.B. Bagci, kkaloudis/qExponential-MaxEnt (GitHub Repository).
[44] C. Tsallis, Introduction To Nonextensive Statistical Mechanics: Approaching a Complex World, Springer, 2009.

12

You might also like