Estimating The Errors On Measured Entropy and Mutual Information

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Physica D 125 (1999) 285–294


Estimating the errors on measured entropy and mutual information
Mark S. Roulston ∗
a Division of Geological and Planetary Sciences, California Institute of Technology 150-21, Pasadena, CA 91125, USA
Received 3 November 1997; received in revised form 17 June 1998; accepted 30 September 1998
Communicated by J.D. Meiss

Abstract
Information entropy and the related quantity mutual information are used extensively as measures of complexity and to
identify nonlinearity in dynamical systems. Expressions for the probability distribution of entropies and mutual informations
calculated from finite amounts of data exist in the literature but the expressions have seldom been used in the field of nonlinear
dynamics. In this paper formulae for estimating the errors on observed information entropies and mutual informations are
derived using the standard error analysis familiar to physicists. Their validity is demonstrated by numerical experiment. For
illustration the formulae are then used to evaluate the errors on the time-lagged mutual information of the logistic map. c 1999
Elsevier Science B.V. All rights reserved.
Keywords: Entropy; Mutual information

1. Introduction

Information theoretic functionals such as entropy and the related quantity of mutual information can be used
to identify general relationships between variables. Information entropy has been used to analyze the behavior of
nonlinear dynamical systems and time series [1–10]. Information entropy is also used to quantify the complexity
of symbol sequences such as DNA sequences [11]. Invariably the analysis of real data involves a finite amount
of data. Furthermore, in the case of continuous variables, a quantization must be chosen. The calculated entropy
of the data will have a functional dependence on the amount of data and the quantization chosen. To know the
significance of the calculated entropy the effect of finite data and quantization on the probability distribution of the
calculated entropy should be known. Expressions for the systematic and random errors in observed entropies have
been calculated before by Basharin [12], Harris [13] and Herzel et al. [14] but such expressions have rarely been
used in the nonlinear dynamics literature.
In this paper error estimates will be derived using the standard error formulae familiar to physicists. The formulae
are presented in concise form in Section 6 of this paper.

夽 Contribution number 5759, California Institute of Technology Division of Geological and Planetary Sciences.
∗ Corresponding author. Tel: +626-395-3992; fax: +626-585-1917; e-mail: mark@gps.caltech.edu

0167-2789/99/$ – see front matter 1999 c Elsevier Science B.V. All rights reserved.
PII: S 0 1 6 7 - 2 7 8 9 ( 9 8 ) 0 0 2 6 9 - 3
286 M.S. Roulston / Physica D 125 (1999) 285–294

These formulae will be verified by numerical experiments before being applied to the well-known logistic equa-
tion. This will demonstrate that when small datasets are being analyzed the bias and random error in mutual
information can be significant and should be estimated.

2. Entropy and mutual information

The most common information theoretic functional is entropy. For a discrete variable, X, the entropy is defined
as
BX
X
H(X) = − pi ln pi , (1)
i=1

where the sum is over the BX “states” that X can assume and pi is the probability that X will be in state i. The joint
entropy of two discrete variables, X and Y , is defined as
BX X
X BY
H(X, Y) = − pij ln pij , (2)
i=1 j=1

where the sum is over the BX states that X can assume and the BY states that Y can assume and pij is the probability
that X is in state i and Y is in state j. Having defined entropy and joint entropy the mutual information of two
discrete variables, X and Y , can be defined as

I(X; Y) = H(X) + H(Y) − H(X, Y). (3)

Mutual information can be thought of as a generalized correlation analogous to the linear correlation coefficient, r,
but sensitive to any relationship, not just linear dependence. If we wish to test hypotheses concerning the entropy
or joint entropy of variables we must have a means to determine the uncertainties on the values of these quantities
which have been calculated from data. In the following section the bias and variance of the observed entropy of a
series will be estimated. The analysis will then be extended to observed mutual information.

3. Estimating the error on an observed entropy

In this section the systematic and random error of an observed entropy of a series of values will be estimated.
Consider an ensemble of series. Let there be N values in each series. Let each value be assigned to one of B
states, which will be labeled i (i = 1, 2, . . . , B). Let the probability that a value will be in the ith state be pi . Let
the number of values in the ith state be ni . The number of values in the ith state, ni , is a binomial random variable.
This can be seen by considering each member of the series as a “trial”. The probability of success (i.e. that the value
will be in the ith state) is pi while the probability of failure (i.e. the value is not in the ith state) is 1 − pi . Thus
ni : B(N, p) and from the properties of the binomial distribution the expectation value and variance of ni are given
by

E[ni ] = Npi , V [ni ] = Npi (1 − pi ). (4)

The entropy of the series that will be measured, Hobs , is

XB B
X
ni ni
Hobs = − ln = − qi ln qi , (5)
N N
i=1 i=1
M.S. Roulston / Physica D 125 (1999) 285–294 287

where the notation qi = ni /N has been used. The expectation value of Hobs will now be calculated by introducing
the variable εi which is defined as
qi − pi
εi = . (6)
pi
The observed entropy in Eq. (5) can thus be written as
B
X
Hobs = − pi (1 + εi ) ln (pi (1 + εi )), (7)
i=1
B
X
Hobs = − pi (1 + εi )[ ln pi + ln (1 + εi )]. (8)
i=1

If Npi is large εi will be small and so the logarithm in Eq. (8) can be expanded in a Taylor series and Eq. (8) can
simplified to give
B
X ε2 pi
Hobs = − pi ln pi + εi pi (1 + pi ) + i + O(ε3i ), (9)
2
i=1
B
X ε2 pi
Hobs = H∞ − εi pi (1 + ln pi ) + i + O(ε3i ), (10)
2
i=1
P
where the “true” entropy of the system has been written H∞ = − B i=1 pi ln pi . Since the expectation value of εi is
zero the expectation value of the observed entropy, to second order in εi , is
B
X hε2 ipi
i
hHobs i ≈ H∞ − . (11)
2
i=1

Using Eqs. (4) and (6) it can be shown that


(1 − pi )
hε2i i = , pi 6= 0. (12)
Npi
Substitution of Eq. (12) into Eq. (11) and evaluation of the sum gives
B∗ − 1
hHobs i ≈ H∞ − , (13)
2N
where B∗ is the number of states for which pi 6= 0. From Eq. (13) it can be seen that the expected value of the
observed entropy is systematically biased downwards from the true entropy. This result was obtained by Basharin
[12] and Herzel [16,17] who pointed out, that to second order, the bias is independent of the actual distribution. The
correction term has been calculated to higher order by Grassberger [18] but the second order result will be used here.
The source of this bias can be intuitively understood by considering the case of a uniform distribution. In this case
the true entropy is also the maximum possible entropy thus any difference between the observed distribution and
the true distribution will act to make the observed entropy lower than the true entropy. The current author previously
derived the same correction in the context of a uniform distribution and then went on to derive the variance of Hobs
for the uniform case [15]. In this paper the variance of Hobs for an arbitrary distribution will be derived.
The variance of Hobs will be denoted V [Hobs ]. It can be calculated using the standard error formula
XB  
∂Hobs 2
V [Hobs ] = V [nk ], (14)
∂nk
k=1
288 M.S. Roulston / Physica D 125 (1999) 285–294

where V [·] denotes the variance. The partial derivatives of Hobs can be evaluated as follows:
B
!
∂Hobs ∂ X
= − qi ln qi , (15)
∂nk ∂nk
i=1

XB
∂Hobs ∂
=− (qi ln qi ), (16)
∂nk ∂nk
i=1

XB
∂Hobs ∂qi
=− (1 + ln qi ). (17)
∂nk ∂nk
i=1

To evaluate the derivative of qi with respect to nk qi must be rewritten to include explicitly its dependence on the
ns in both the numerator and the denominator. That is
!
∂qi ∂ ni Nδik − ni
= PB = , (18)
∂nk ∂nk j=1 nj
N2
P
where the fact that Bj=1 nj = N has been used in the last step. δik is the Kronecker delta and is defined as δik = 0
when i 6= k and δik = 1 when i = k. Substitution of Eq. (18) into Eq. (17) gives
B
X  
∂Hobs δik ni
= − (1 + ln qi ) − 2 , (19)
∂nk N N
i=1

XB
∂Hobs δik qi δik qi ln qi
=− − + ln qi − , (20)
∂nk N N N N
i=1

B
!
∂Hobs 1 X
=− ln qk − qi ln qi , (21)
∂nk N
i=1

∂Hobs 1
= − ( ln qk + Hobs ), (22)
∂nk N
P PB
where the fact that B i=1 qi = i=1 δik = 1 has been used. Substitution of Eq. (22) into Eq. (14) gives
B
X 1
V [Hobs ] = ( ln qk + Hobs )2 V [nk ]. (23)
N2
k=1

The value of V [nk ] can be estimated from the observed distribution using Eq. (4).
V [nk ] = Npk (1 − pk ) = Nqk (1 − qk ) + O(εk ). (24)
To demonstrate the validity of the error estimates on observed entropies given by Eqs. (13) and (23) numerical
experiments were conducted. In each experiment an ensemble of 1000 series of N random variables was generated
using a given probability distribution, [p1 , p2 , . . . , pB ]. The distribution of each series and the corresponding
entropy, Hobs , were calculated. A histogram of the 1000 values of Hobs was constructed and compared to a normal
distribution with an expectation value given by Eq. (13) and a variance given by Eq. (23). The results of these
experiments are shown in Fig. 1. The left panels show the prescribed probability distributions that were used to
generate the ensembles. The center and right panels show a comparsion of the histograms of the 1000 values of
M.S. Roulston / Physica D 125 (1999) 285–294 289

Fig. 1. Results of the numerical experiment described in Section 3. The left panels show the probability distributions used to construct the
ensembles of 1000 series of N values. The center and right panels compare histograms of Hobs for the ensemble with the theoretical distribution
derived in the text (solid and dotted curves) for the cases of N = 100 and N = 1000. The vertical dashed lines are H∞ .

Hobs compared to the theoretical normal distribution N (hHobs i, V [Hobs ]) for values of N of 100 and 1000. Note
that the solid curve has a variance equal to the mean value of V [Hobs ] for the ensemble. Remember that V [Hobs ] is
estimated using the observed qi s and therefore each value of Hobs corresponds to a slightly different estimate for
V [Hobs ]. The dotted lines show the 1 sigma spread of estimates of V [Hobs ] for the ensemble, the standard error
on the standard error as it were. The small spread of the error estimates justifies using the observed qi s to estimate
V [ni ] in Eq. (24). The vertical dashed lines show the value of H∞ for the given probability distribution. In all the
290 M.S. Roulston / Physica D 125 (1999) 285–294

numerical experiments the error formulae give good estimates for the bias and spread of the observations. Even
for the case of 10 uniformly distributed states (top panels) and N = 100. However, in the uniform case it can be
seen most clearly that the normal distribution is an approximation. The actual distribution is asymmetric since the
observed entropy cannot be higher than H∞ for a uniform distribution.

4. Estimating the error on an observed mutual information

The error analysis of an observed mutual information can be performed in a similar manner to that of an observed
entropy.
The observed mutual information, Iobs , is given by
Iobs = Hobs (X) + Hobs (Y) − Hobs (X, Y). (25)
If X can assume one of BX states and Y can assume one of BY states and if there are N pairs (X, Y) then the
expectation value of Iobs is given by
∗ −1
BX B∗ − 1 B∗ − 1
hIobs i = H∞ (X) − + H∞ (Y) − Y − H∞ (X, Y) + XY , (26)
2N 2N 2N
∗ − B∗ − B∗ + 1
BXY X Y
hIobs i = I∞ + , (27)
2N
where I∞ is the “true” mutual information which would be measured when N → ∞. BX ∗ is the number of states
∗ ∗ is the
of X which have a finite probability, BY is the number of states of Y which have a finite probability and BXY
number of states for which pij 6= 0.
Let qij = nij /N where nij is the number of data points that have an X value in the ith state and a Y value in the
jth state. As before qij can be written to show its dependence on the nij s explicitly.
nij
qij = PB PB , (28)
n=1 nmn
X Y
m=1

where BX is the number of states that X can assume and BY is the number of states that Y can assume. Differentiation
of Eq. (28) with respect to nkl gives the following relationships:
∂qij Nδi k δj l − nij
= , (29)
∂nkl N2

BX
! PBX
X
∂ δj l i=1 nij
qij = − , (30)
∂nkl N N2
i=1
  PBY
BY
∂ X  δik j=1 nij
qij = − . (31)
∂nkl N N2
j=1

The observed mutual information, Iobs , can be written in terms of the qij s as
Iobs = Hobs (X) + Hobs (Y) − Hobs (X, Y), (32)
    ! !
XBX XBY XBY XBY BX
X BX
X BX X
X BY
Iobs = −  qij  ln  qij  − qij ln qij + qij ln qij . (33)
i=1 j=1 j=1 j=1 i=1 i=1 i=1 j=1
M.S. Roulston / Physica D 125 (1999) 285–294 291

Differentiation of Eq. (33) with respect to nkl gives


     !! !
BX
X BY
X BY
X BY
X BX
X BX
X
∂Iobs 1 + ln  qij  ∂  qij  − ∂
=− 1 + ln qij qij
∂nkl ∂nkl ∂nkl
i=1 j=1 j=1 j=1 i=1 i=1

BX X
X BY
∂qij
+ (1 + ln qij ) . (34)
∂nkl
i=1 j=1

Substitution of Eqs. (29)–(31) into Eq. (34) and evaluating the sums leads to
   ! 
BY BX
∂Iobs 1  X X
= ln qkj  + ln qil − ln qkl + Iobs  . (35)
∂nkl N
j=1 i=1

Eq. (35) can be substituted into the standard error equation (Eq. (14)) to give
   ! 2
X BY
BX X BY BX
1  X X
V [Iobs ] = ln qkj  + ln qil − ln qkl + Iobs  V [nkl ]. (36)
N2
k=1 l=1 j=1 i=1

Again V [nkl ] can be estimated from the observed distribution.


V [nkl ] = Nqkl (1 − qkl ) + O(εkl ). (37)
As with the case of entropy the error estimates for the observed mutual information were validated by numerical
experiment. Again 1000 series of N pairs of random variables, (X, Y), were generated using a given probability
distribution (this time a two-dimensional one). The mutual information was calculated for each series and the results
plotted as histogram for comparison with the normal distribution, N (hIobs i, V [Iobs ]). The results are shown in Fig.
2. As with Fig. 1 the solid curves show the mean error estimates while the dotted curves show the 1 sigma spread of
the error estimates. The vertical dashed lines show the value of I∞ . Again the formulae provide good approximations
for the bias and spread in the measurements, with one noticeable exception. For the case of 100 equally likely states
and N = 100 the bias calculated using Eq. (27) is smaller than the observed bias. In this case the expected number
of data points in each bin of the joint distribution is unity. With such small values of nij the approximation that the
εs are small is completely invalid. Note that for N = 1000 the bias given by Eq. (27) is close to the observed bias.
Furthermore for the second distribution in which pij is nonzero for 10 of the 100 bins even when N = 100 the
formulae provide good estimates of the errors on Iobs . As a rule of thumb max nij ≥ 10 for the formulae to give
reasonable error estimates.

5. Application to the logistic equation

To illustrate how the error estimates can be used they were applied to datasets generated using the famous logistic
equation
xt+1 = 4xt (1 − xt ). (38)
Time series of N = 10 000, N = 5000, N = 500 and N = 200 data points were generated. The points lay on the
real interval [0, 1] but were binned into 10 bins, each of width 0.1. The mutual information of each time series and
a lagged version of itself was then calculated. The results are shown in Fig. 3. In each panel the solid line shows
the observed mutual information, Iobs , while the crosses denote the corrected mutual information, i.e. the estimate
292 M.S. Roulston / Physica D 125 (1999) 285–294

Fig. 2. The result of the numerical experiment described in Section 4. The left panels show the joint probability distributions used to construct
the 1000 series of N(X, Y) pairs. The centre and right panels compare the histograms of Iobs with the theoretical distribution derived in the text
(solid and dotted curves) for the cases of N = 100 and N = 1000. The vertical dashed lines are I∞ .

of I∞ given by Eq. (27), and the size of the error bars was determined using Eq. (36). It can be seen that when a
large amount of data are used the bias and random error are negligible but for smaller datasets (N < 500), this is
no longer the case. In this example the biases are larger than the random errors but the results in Fig. 2 demonstrate
that this is not always the case.
M.S. Roulston / Physica D 125 (1999) 285–294 293

Fig. 3. Lagged mutual information of the time series generated by the logistic equation. The solid lines are the observed mutual informations,
Iobs , while the points are estimates of I∞ with error bars calculated using Eq. (36).

6. Summary

Estimates of the systematic and standard error on observed entropies and mutual informations have been derived.
The result for entropy is
B∗ − 1
H∞ ≈ Hobs + ± σH , (39)
2N
v
u B
u1X
σH = t ( ln qk + Hobs )2 qk (1 − qk ), (40)
N
k=1

where qi is the observed distribution of states and B∗ is the number of bins for which qi 6= 0. The result for mutual
information is
∗ + B∗ − B∗ − 1
BX Y XY
I∞ ≈ Iobs + ± σI , (41)
2N
v
u BX BY
u 1 XX
σI = t ( ln qkX + ln qlY − ln qkl + Iobs )2 qkl (1 − qkl ), (42)
N
k=1 l=1

where qX and qY are the observed distributions of X and Y respectively, that is


BY
X BX
X
qkX = qkj , qlY = qil . (43)
j=1 i=1

BX∗ is the number of bins for which qX 6= 0, B∗ is the number of bins for which qY 6= 0 and B∗ is the number
i Y i XY
of bins for which qij 6= 0. For these formulae to give reasonable error estimates the condition that max nij ≥ 10
294 M.S. Roulston / Physica D 125 (1999) 285–294

should be met. This will ensure that the deviation between the observed distribution and the true distribution is small
enough for the second order expansion in ε to be appropriate.
Since estimating the errors on information theoretic functionals is not much more computationally intensive than
computing the functionals themselves, quoting the errors on entropy and mutual information should be considered
by anyone using these quantities, especially when the amount of available data is small.

Acknowledgements

The author would like to thank Hans-Peter Herzel for drawing his attention to some of the previous work in this
field and to the two anonymous reviewers whose suggestions greatly improved this paper.

References

[1] A.M. Fraser, H.L. Swinney, Independent coordinates for strange attractors from mutual information, Phys. Rev. A 33 (1986) 1134–1140.
[2] A.M. Fraser, Information and entropy in strange attractors, IEEE Trans. Inf. Theory 35 (1989) 245–262.
[3] M. Palus, Singular-value decomposition in attractor reconstruction – pitfalls and precautions, Physica D 55 (1992) 221–234.
[4] M. Palus, Information theoretic test for nonlinearity in time-series, Phys. Lett. A 175 (1993) 203–209.
[5] B. Pompe, Measuring statistical dependences in a time series, J. Stat. Phys. 73 (1993) 587–611.
[6] M. Palus, Testing for nonlinearity in weather records, Phys. Lett. A 193 (1994) 67–74.
[7] M. Palus, Testing for nonlinearity using redundancies – quantitative and qualitative aspects, Physica D 80 (1995) 186–205.
[8] M. Palus, Detecting nonlinearity in multivariate time-series, Phys. Lett. A 213 (1996) 138–147.
[9] M. Palus, Coarse-grained entropy rates for characterization of complex time-series, Physica D 93 (1996) 64–77.
[10] Y.-C. Tian, F. Gao, Extraction of delay information from chaotic time series based on information entropy, Physica D 108 (1997) 113–118.
[11] H. Herzel, W. Ebeling, A. Schmitt, Entropies of biosequences: The role of repeats, Phys. Rev. E 50 (1994) 5061–5071.
[12] G.P. Basharin, On a statistical estimate for the entropy of a sequence of independent random variables, Theory Prob. App. 4 (1959) 333–338.
[13] B. Harris, The statistical estimation of entropy in the non-parametric case, Topics Inf. Theory 16 (1975) 323–355.
[14] H. Herzel, I. Grosse, Correlations in DNA sequences: The role of protein coding segments, Phys. Rev. E 55 (1997) 800–810.
[15] M.S. Roulston, Significance testing of information theoretic functionals, Physica D 110 (1997) 62–66.
[16] H. Herzel, Complexity of symbol sequences, Syst. Anal. Model. Simul. 5 (1988) 435–444.
[17] H. Herzel, A.O. Schmitt, W. Ebeling, Finite sample effects in sequence analysis, Chaos, Solitons and Fractals 4 (1994) 97–113.
[18] P. Grassberger, Finite sample corrections to entropy and dimension estimates, Phys. Lett. A 128 (1988) 369–373.

You might also like