Professional Documents
Culture Documents
On Measures of Entropy and Information: Gavin E. Crooks
On Measures of Entropy and Information: Gavin E. Crooks
On Measures of Entropy and Information: Gavin E. Crooks
Gavin E. Crooks
2017-05-07
2
O M E I, G. E. C ()
Joint entropy Given a joint probability distribution Mutual information is oft notated with a semicolon,
P(A, B) then the joint entropy is rather than a colon [5, 11]. We’ll reserve the semicolon
∑ for divergences.
S(A, B) ≜ − P(Aa , Bb ) ln P(Aa , Bb ) (2) Mutual information is the reduction in uncertainty of
a,b
A due to the knowledge of B, or vice versa.
This joint entropy can be readily generalized to any num-
I(A : B) = S(A) − S(A | B)
ber of variables.
= S(B) − S(B | A)
S(A1 , A2 , . . . , An ) = S(A) + S(B) − S(A, B)
∑
=− P(A1a1 , A2a2 , . . . , An 1 2 n
an ) ln P(Aa1 , Aa2 , . . . , Aan )
= S(A, B) − S(A | B) − S(B | A)
a1 ,a2 ,...,an
The mutual information is zero if (and only if) A and B
Marginal entropy The entropy of a marginal distribution. are independent (written A ⊥ ⊥ B), such that P(Aa , Bb ) =
Thus S(A), S(B), S(C), S(A, B), S(B, C) and S(A, C) are all P(Aa )P(Ba ).
marginal entropies of the joint entropy S(A, B, C). The mutual information of an ensemble with itself
(sometimes called the self-information) is the entropy.
Conditional entropy (or equivocation) [5, 8] Measures how I(A : A) = S(A)
uncertain we are of A on the average when we know B.
∑ ∑
S(A | B) ≜ − P(Bb ) P(Aa | Bb ) ln P(Aa | Bb ) (3) Multivariate mutual information (co-information) [11, 12,
b a 13, 14, 15, 3]: A multivariate generalization of the mutual
information. Given a collection of probability ensembles,
The chain rule for entropies [5, 8] expands conditional A = {A1 , A2 , · · · , A|A| }, the multivariate mutual infor-
entropy as a Shannon information measure. mation is equal to an alternating signed sum of all the
3
O M E I, G. E. C ()
marginal entropies.
A B Here P(A) is the power set of A (the set of all subsets), and
|A| is the set cardinality. Note that there are con icting
sign conventions in the literature: the multivariate mu-
tual information is sometimes de ned with opposite sign
S(A) . for odd cardinalities (See interaction information (8)).
The single variable case is equal to the entropy, I(A) =
A B S(A), the binary case is equal to the standard mutual in-
formation (6), and the ternary case is
I(A : B : C) (7)
. ∑ P(Aa , Bb )P(Aa , Cc )P(Bb , Cc )
S(B) ≜ P(Aa , Bb , Cc ) ln
P(Aa , Bb , Cc )P(Aa )P(Bb )P(Cc )
a,b,c
A B
For three or more variables the mutual information can
be positive, negative, or zero, whereas for one or two vari-
ables the mutual information is non-negative. For zero
S(A, B) . variables the mutual information is zero, I(∅) = 0.
A B
.
.
.
I(A : B) .
S(A) = I(A)
S(A | B) .
S(A, B) = I(A) + I(B) − I(A : B)
S(A, B, C) = I(A) + I(B) + I(A)
A B
− I(A : B) − I(B : C) − I(A : C)
+ I(A : B : C)
. Or generally,
S(B | A)
∑
S(A1 , A2 , . . . , A|A| ) = − (−1)|B| I(B1 : · · · : B|B| )
A B B∈P(A)
Figure 1: Two-variable Information diagrams [2, 3]. The multivariate self-information is equal the entropy
for any cardinality.
I(A : A : A : · · · : A) = S(A)
4
O M E I, G. E. C ()
A B
C
. . .
I(A : B, C)
A B A B A B
I(A : B | C) I(A : C | B) I(B : C | A)
.
A B
I(A : B | C) = S(A | C) − S(A | B, C)
C
= S(B | C) − S(B | A, C)
C C C
A B
5
O M E I, G. E. C ()
C C C C
.
.
. − . = . − .
A B A B A B A B
Measures the total amount of randomness localized to in- Roughly speaking, the relative entropy measures the dif-
dividual variables. ference between two distributions, although it is not a
metric since it is not symmetric [D(A ∥ B) ≠ D(B ∥ A) in
general], nor does it obey the triangle inequality. Note
that the two distributions must have the same sample
.
space, and that we take as convention that 0 ln 0 = 0.
.
. One interpretation of relative entropy is that it repre-
sents an encoding cost [8]: if we encode messages using an
optimal code for a probability distribution P(Bx ) of mes-
sages x, but the messages actually arrive with probabili-
Total correlation (Multi-information, multivariate con- ties P(Ax ), then each message requires, on average, an ad-
straint, redundancy) [11, 12, 22, 14] ditional D(A ∥ B) nats to encode compared to the optimal
encoding.
TotalCorr(A1 , A2 , . . . , An ) ≜ (12) The mutual information (5) is the relative entropy be-
S(A ) + S(A ) + · · · + S(A ) − S(A , A , . . . , An )
1 2 n 1 2 tween the joint and marginal product distributions. Let
the random variables (Â, B̂) be independent, but with the
The total amount of information carried by correlations
3
between the variables. Quanti es the total correlation or Despite misinformation to the contrary, this uncertainty coef cient
is not related to Theil’s U-statistic
redundancy. Equal to the mutual information when n = 4
Note that our notation for relative entropy is uncommon. Follow-
2. The independence bound on entropy states that the ing [8], many authors instead directly supply the distributions as argu-
total correlation is non-negative. ments, e.g. D(p(x)∥q(x)).
6
O M E I, G. E. C ()
B C
A D
S(∅)
. . . .
. . . . . .
. . . .
I(A : B : C : D)
7
O M E I, G. E. C ()
B C
A D
S(∅)
. . . .
. . . . . .
. . . .
I(A : B : C : D)
8
O M E I, G. E. C ()
same marginals as (A, B), i.e. P(Â, B̂) = P(A)P(B). Then Since the relative conditional entropy is non-negative, it
follows that
I(A : B) = D(A, B ∥ Â, B̂) ( ) ( )
D A, B ∥ A ′ , B ′ ⩾ D B ∥ B ′
Similarly, for three or more variables, the relative entropy
between the joint and marginal product distributions is
the total correlation (12). Relative mutual information [1]
D(A : B ∥ A ′ : B ′ ) = I(A : B) if A ⊥⊥ B
Burg entropy [28]:
∑ Note that the uncertainly coef cient (14) is also some-
Burg(B) ≜ ln P(Bx ) (17)
times called the relative mutual information.
x
D(A : B | C ∥ A ′ : B ′ | C ′ ) (21)
Relative joint entropy ∑ ∑
≜ P(Cz ) P(Ax , By | Cz )
( ) ∑ P(Ax , By ) z x,y
D A, B ∥ A ′ , B ′ ≜ P(Ax , By ) ln (18)
P(Ax′ , By′ ) P(Ax , By | Cz ) P(Ax′ | Cz′ )P(By′ | Cz′ )
x,y × ln
P(Ax′ , By′ | Cz′ ) P(Ax | Cz )P(By | Cz )
We can generalize any Shannon information measure to a
relative Shannon information measures by combining ap- We could continue this insanity by generalizing the con-
propriate linear combinations of relative joint entropies5 . ditional mutual relative entropy to many variables.
Relative conditional entropy (conditional relative en- Relative relative entropy [1]
tropy, conditional Kullback-Leibler divergence) [8]
(
) ∑ P(Ax ) P(Dx )
( ) ∑ P(Ax | By ) D (A ∥ B)
(C ∥ D) ≜ P(Ax ) ln (22)
D A | B ∥ A′ | B′ ≜ P(Ax , By ) ln (19) P(Bx ) P(Cx )
x,y
P(Ax′ | By′ ) x
9
O M E I, G. E. C ()
Jeffreys entropy The Jeffreys entropy (Jeffreys diver- Jensen-Shannon divergence (skewed Jensen-Shannon di-
gence, J-divergence or symmetrized Kullback-Leibler di- vergence)
vergence) [29, 25] is a symmetrized relative entropy (15). ( ) ( )
JSα (A; B) ≜(1 − α)D A ∥ M + αD B ∥ M , (25)
Jeffreys(A; B) ≜ 12 D(A ∥ B) + 12 D(A ∥ B) (23) P(M) = (1 − α)P(A) + αP(B) .
∑ P(A x ) ∑ P(Bx)
= 21 P(Ax ) ln + 12 P(Bx ) ln
P(Bx ) P(Ax )
x x Jensen-Shannon information (generalized Jensen-
∑( ) P(Ax ) Shannon divergence) [0]
= 12 P(Ax ) − P(Bx ) ln
x
P(Bx ) ∑ ( )
JSΘ (A1 ; A2 ; · · · ; An ) ≜ P(Θα ) D Aα ∥ M , (26)
This measure is symmetric and non-negative, but not a α=1,n
∑
metric since it does not obey the triangle inequality. The P(Θα ) P(Aα
P(Mx ) = x) .
Jeffreys entropy is a symmetric f-divergence (34), Cf (A; B) α
with f(t) = 21 (t − 1) ln t.
Note that the Jeffreys divergence is sometimes de ned The entropy of a mixed distribution is the average en-
as D(A ∥ B) + D(B ∥ A), which is twice the value of the tropy of the components plus the Jensen-Shannon en-
de nition used here. tropy [30] :
∑
S(M) = JSΘ (A1 ; A2 ; · · · ; An ) + P(Θα ) S(Aα )
α
Jensen-Shannon entropy (Jensen difference, information
radius, capacitory discrimination6 ) is the mean relative The multivariate Jensen-Shannon entropy is the mu-
entropy between two distributions and the distribution tual information between the mixing ensemble Θ, and the
mean [31, 30] mixed ensemble M.
∑ P(Ax ) ∑ P(Θα , Mx )
JS(A; B) ≜ 12 P(Ax ) ln ( ) (24) I(Θ : M) ≜ P(Θα , Mx ) ln
x
1
P(Ax ) + P(Bx ) α,x
P(Θα )P(Mx )
2
∑
∑ P(Bx ) =− P(Mx ) ln P(Mx )
+ 1
2 P(Bx ) ln ( )
x
x
1
P(Ax ) + P(Bx )
2 ∑
( ) ( ) + P(Θα )P(Mx | Θα ) ln P(MX | Θα )
= 12 D A ∥ M + 12 D B ∥ M , α,x
∑
=S(M) − 21 S(A) − 12 S(B) . = S(M) − P(Θα ) S(Aα )
α
where
= JSΘ (A1 ; A2 ; · · · ; An )
P(Mx ) = 21 P(Ax ) + 12 P(Bx )
Resistor-average entropy [33]:
One interpretation of the Jensen-Shannon entropy is in
terms of a Bayesian inference problem [32]: given a sam- 1
ResAvg(A; B) ≜ 1 1 (27)
ple taken from one of two probability distributions, the +
D(A ∥ B) D(B ∥ A)
Jensen-Shannon entropy is the average information the
sample provides about the identity of the distribution. The harmonic mean of forward and reversed relative en-
The divergence is equal to zero only if the two distribu- tropies.
tions are identical, and therefore indistinguishable, and
reaches its maximum value of ln 2 nats (i.e. 1 bit) if the
two distributions do not overlap and therefore are per- 4 Rényi information
fectly distinguishable from a single sample.
The Jeffreys and Jensen-Shannon entropies are related Rényi information (Rényi entropy, alpha-order en-
by the inequalities [31] tropy) [34]: A one parameter generalization of the Shan-
non entropy.
0 ⩽ JS(A; B) ⩽ 1
2 Jeffreys(A; B) .
1 ∑
Rényiα (A) ≜ ln P(Ax )α (28)
1−α x
6
Capacitory discrimination is de ned as twice the Jensen-Shannon Interesting special cases of the Rényi information in-
divergence by Topsøe [30] clude the Hartley entropy (α = 0), collision entropy (α =
10
O M E I, G. E. C ()
2), Shannon entropy (α = 1), and min entropy (α = ∞). Csiszár-Morimoto, Ali-Silvey, or ϕ-divergence) [42, 43,
See also: Rényi divergence (46). 44, 45, 46, 33].
∑ ( P(B ) )
x
Collision entropy (Rényi information of order 2, second Cf (A; B) = P(Ax ) f , (34)
x
P(Ax )
order entropy) [35]
∑ where the function f is convex ⌣ and f(1) = 0. This im-
CollisionEntropy(A) ≜ − ln P(Ax )2 (29)
plies Cf (A; B) ⩾ 0 from an application of Jensen’s inequal-
x
ity. Examples already encountered include the relative,
= Rényi2 (A) Jeffreys, and Jensen-Shannon entropies (see table 2). Note
that the rst argument to the f-divergence appears in the
A special case of the Rényi information. The negative log
numerator of the ratio and is the distribution to be aver-
(58) probability that two independent samples from the
aged over. The opposite convention also occurs8 .
distribution are the same.
Convex functions are closed under conical combina-
tions: the function f(x) = c1 f1 (x)+c2 f2 (x)+. . .+ cN fN (x)
Min-entropy [34] is convex if each function of the mixture fn is convex
and each constant cn ⩾ 0 is non-negative. It follows
MinEntropy(A) ≜ − ln max P(Ax ) (30) that a positive linear sum of an f-divergences is also an
x
f-divergence [47].
Hartley entropy (Hartley function, max-entropy, Boltz-
mann entropy) [36]: The logarithm of the number distinct Dual f-divergence [47] The dual of an f divergence is de-
possibilities. ned by swapping the arguments.
11
O M E I, G. E. C ()
2 2t
(36) K-divergence ln t ln
(1 + t) (1 + t)
1
√
(39) Pearson divergence (t − 1)2 ( t − t)2
√
t−α − 1
(48) Cressie-Read divergence
α(α + 1)
t1−α − 1
(49) Tsallis divergence
α−1
Jeffreys Jensen
(t − 1)2
(41) LeCam discrimination (t − 1)2
(t + 1)
1
(23) Jeffreys entropy 2 (t − 1) ln t − ln t
1 2 2t
(24) Jensen-Shannon entropy 2 ln + 12 t ln 2
ln 1+t − ln t
1+t 1+t
2 |t ···
1
(51) variational distance − 1|
√
(38) Hellinger discrimination 1− t ···
12
O M E I, G. E. C ()
We include the half’s in these de nitions so that sym- Symmetric, with range [0, 1]. A common alternative nor-
metrized f-divergences are invariant under symmetriza- malization omits the one-half prefactor. The name origi-
tion. We call these the Jeffreys and Jensen symmetriza- nates form that of the corresponding integral in the con-
tions respectively [1], becuase a Jeffreys symmetriza- tinuous case [51]
tion of the relative entropy (15) leads to the Jeffreys en-
tropy (23), and a Jensen symmetrization gives the Jensen-
Pearson divergence (χ2 -divergence, chi square diver-
Shannon entropy (24).
gence, Pearson chi square divergence, Kagen divergence,
quadratic divergence, least squares) [52]
K-divergence [31] ( )
∑ P(Bx ) − P(Ax ) 2
∑ Pearson(A; B) ≜ 1
2 (39)
P(Ax ) P(Ax )
KayDiv(A; B) ≜ P(Ax ) ln ( ) (36) x
x
1
P(Ax ) + P(Bx ) ∑ ( P(Ax ) )2
( )
2
= 21 P(Ax ) −1
=D A∥M , P(Mx ) = 12 P(Ax ) + 12 P(Bx ) x
P(Bx )
2 = Cf (A; B) , f(t) = (t − 1)2
= Cf (A; B), f(t) = ln .
(1 + t)
Of interest since the K-divergence is a lower bound to the Neyman divergence (inverse Pearson chi square diver-
relative entropy [31], gence) [53, 54] The dual of the Pearson divergence (argu-
ments switched),
KayDiv(A; B) ⩽ 12 D(A ∥ B)
Neyman(A; B) ≜ Pearson(B; A) (40)
and the Jeffreys symmetrization of the K-divergence is the
Jensen-Shannon entropy. LeCam discrimination (Vincze-LeCam divergence, trian-
1 1 gular discrimination) [55, 56, 30]
JS(A; B) = 2 KayDiv(A; B) + 2 KayDiv(B; A)
( ) ( ) ( )
= 2D A ∥ M + 2D B ∥ M
1 1
∑ P(Ax ) − P(Bx ) 2
LeCam(A; B) ≜ 1
2 (41)
x
P(Ax ) + P(Bx )
Fidelity (Bhattacharyya coef cient, Hellinger af nity) [48] (t − 1)2
= Cf (A; B), f(t) = .
The Bhattacharyya distance (57) and the Hellinger diver- 2(t + 1)
gence and distance (38) are functions of delity. The name
derives from usage in quantum information theory [49]. The LeCam divergence is a Jensen symeterized Pearson
divergence.
∑√
Fidelity(A; B) ≜ P(Ax )P(Bx ) , (37) ( ) ( )
x
LeCam(A; B) ≜ 1
2 Pearson A; M + 1
2 Pearson B; M
P(Mx ) = 12 P(Ax ) + 21 P(Bx )
The range is [0, 1], with unity only if the two distributions
are identical. Fidelity is √not itself an f-divergence (The
required function f(t) = t isn’t convex), but is directly Skewed K-divergence [31]
related to the Hellinger divergence (38) and Bhattacharyya
∑ P(Ax )
distance (57). KayDivα (A; B) ≜ P(Ax ) ln
x
(1 − α)P(Ax ) + αP(Bx )
(42)
Hellinger discrimination (Squared Hellinger distance ) [50, ( )
=D A∥M , P(M) = (1 − α)P(A) + αP(B)
0]
∑ (√ √ )2
HellingerDiv(A; B) ≜ 1
2 P(Ax ) − P(Bx ) (38) Alpha-Jensen-Shannon-entropy [31, 57]
x
√
∑ ( AlphaJSα (A; B) ≜ 1
KayDivα (A; B) + 1
KayDivα (B; A)
P(Bx ) ) 2 2
= P(Ax ) 1 − (43)
x
P(Ax )
√ AlphaJS0 (A; B) = 0
= Cf (A; B), f(t) = (1 − t)
∑√ AlphaJS 12 (A; B) = JS(A; B)
=1− P(Ax )P(Bx )
x
AlphaJS1 (A; B) = Jeffreys(A; B)
= 1 − Fidelity(A; B) The Jeffreys symmetrization of the skewed K-divergence.
13
O M E I, G. E. C ()
6 Chernoff divergence as α approaches zero the Rényi entropy weighs all pos-
sibilities more equally, regardless of their dissimilarities.
Chernoff divergence The Chernoff divergence [58, 59] of We recover the relative entropy in the limit of α → 1.
order α is de ned as Interesting special cases of the Rényi divergence occur
∑ ( )α−1 for α = 0, 12 , 1 and ∞. As previously mentioned, α = 1
P(Ax )
Chernoff α (A; B) ≜ − ln P(Ax ) (44) gives the relative entropy (15), and α = 12 gives the Bhat-
P(Bx )
[
x
] tacharyya distance (57). In the limit α → 0, the Rényi
= − ln Cf (A; B) + 1 , f(t) = t1−α − 1 . divergence slides to the negative log probability under q
that p is non-zero,
The Chernoff divergence is zero for α = 1 and α = 0, ∑
and reaches a maximum, the Chernoff information [58, lim Rényiα (A; B) = − ln lim P(Ax )α P(Bx )1−α
α→0 α→0
8], for some intermediate value of alpha. The Chernoff x
divergence is well de ned for α > 1 if P(Bx ) > 0 whenever
∑
= − ln P(Bx )[P(Ax ) > 0] .
P(Ax ) > 0, and for α < 0 if P(Ax ) > 0 whenever P(Bx ) > x
0, and thus de ned for all α if the distributions have the
same support. Here we have used the Iverson bracket, [a], which evalu-
The Chernoff divergence of order α is related to the ates to 1 if the condition inside the bracket is true, and 0
Chernoff divergence of order 1 − α with the distributions otherwise. If the two distributions have the same support
interchanged [59], then in the α → 0 Rényi divergence is zero.
This relation always holds for α ∈ [0, 1], and for all α when
Alpha-divergence [0, 0] In the most widespread parame-
the distributions have the same support.
terization,
( )
Chernoff coef cient (alpha divergence) [58, 59, 60] 1 ∑
Dα (A; B) ≜ 1− α
P(Ax ) P(Bx ) 1−α
(47)
∑ ( )α−1 α(1 − α) x
P(Ax )
ChernoffCoefficientα (A; B) ≜ P(Ax )
x
P(Bx )
The alpha-divergence is self-dual,
(45)
( )
= exp − Chernoff α (A; B) Dα (A; B) = D1−α (B; A)
1−α
= Cf (A; B) + 1, f(t) = t −1
Special cases include the Neyman and Pearson diver-
The exponential twist density [61] is way of mixing two gences, the Hellinger discrimination, and the relative en-
distributions [62] to form a third tropy.
1
Pearson(A; B) α = −1
P(Cx ) = P(Ax )α P(Bx )1−α
Zα D(B ∥ A) limα→0
Here α is a mixing parameter between 0 and 1. The nor- Dα (A; B) = 4 HellingerDiv(A; B) α = + 12
malization constant Zα is the Chernoff coef cient, Zα =
D(A ∥ B) limα→+1
ChernoffCoefficientα (A; B) [63]. Neyman(A; B) α = +2
Rényi divergence The Rényi divergence (or relative Rényi Another common paramaterization is [0, 0]
entropy) of order α is a one-parameter generalization of ( )
the relative entropy [34], 4 ∑ 1−α ′ 1+α ′
Dα ′ (A; B) ≜ 1− P(Ax ) 2 P(Bx ) 2
( )α−1 1 − α ′2
1 ∑ P(Ax ) x
Rényiα (A; B) ≜ ln P(Ax ) (46)
α−1 x
P(Bx ) where α ′ = 1 − 2α. This paramaterization has the advan-
1 tage that the duality corresponds to negating the parame-
= Chernoff α (A; B) ter.
1−α
1 [ ]
= ln Cf (A; B) + 1 , f(t) = t1−α − 1 . D+α ′ (A; B) = D−α ′ (B; A)
α−1
Higher values of α give a Rényi divergence dominated by
the greatest ratio between the two distributions, whereas
14
O M E I, G. E. C ()
Cressie-Read divergence [54] can assign to the same event. Equal to twice the Varia-
∑ [( )α ] tional distance
1 P(Ax )
CressieReadα (A; B) ≜ P(Ax ) −1
α(α + 1) x P(Bx ) Euclidian distance (L2 distance)
(48)
1 [ α Rényi (A;B) ] √
= e α+1
−1 ∑
α(α + 1) L2 (A; B) ≜ P(Bx ) − P(Ax )2 (52)
1 x
= Tsallisα+1 (A; B)
α+1
It is sometimes useful to treat a probability distribution
t−α − 1
= Cf (A; B), f(t) = . as a vector in a Euclidean vector space, and therefore con-
α(α + 1) sider Euclidean distances between probability distribu-
tions.
Tsallis divergence (relative Tsallis entropy) [64] Other
closely related divergences include the relative Tsallis en- Minkowski distance
tropy,
( ) p1
[( )α−1 ] ∑
1 ∑ P(Ax ) Lp (A; B) ≜ P(Bx ) − P(Ax )p (53)
Tsallisα (A; B) ≜ P(Ax ) −1 x
α−1 x P(Bx )
(49) A metric distance provided that p ⩾ 1.
1 [ (α−1) Rényiα (A;B) ]
= e −1
α−1 Chebyshev distance
1−α
= Cf (A; B), f(t) = t α−1−1 ,
L∞ (A; B) ≜ max P(Bx ) − P(Ax ) (54)
The Sharma-Mittal divergence encompasses Cressie-Read Jensen-Shannon distance is the square root of the Jensen-
divergence (β = 1 − α(α + 1)), Rényi divergence (β → Shannon divergence, and is a metric between probability
1), Tsallis divergence (β → 0), and the relative entropy distributions [32, 67].
(β, α → 1).
Bhattacharyya distance [68] The Chernoff divergence of
order α = 12 .
7 Distances
∑√
Variational distance (L1 distance, variational divergence) Bhattacharyya(A; B) ≜ − ln P(Ax )P(Bx ) (57)
x
∑
V(A; B) = L1 (A; B) ≜ 1 P(Bx ) − P(Ax ) (51) = Chernoff 12 (A; B)
2
x = − ln Fidelity(A; B) .
The only f-divergence, Cf (A; B), f(t) = t − 1 (34) which
is also a metric [65]. 8 Speci c information
Pinsker’s inequality:
A speci c (or point-wise, or local) entropy is the entropy
D(A ∥ B) ⩾ 1
2 V(A; B)2 associated with a single event, as opposed to the average
entropy over the entire ensemble [13].
Total variational distance [0] The largest possible differ- A common convention is to use lower cased function
ence between the probabilities that the two distributions names for speci c information measures; s for S, i for I.
15
O M E I, G. E. C ()
However, since the expectation of a single proposition is divergence; and total variational distance. Reordered
equal to the value associated with the event, we can also operator precedence. Expanded and re ned collection
write S(Ax ) ≜ s(Ax ), and I(Ax : By ) ≜ i(Ax : By ). of relative Shannon entropies. Extensive miscella-
With this notation we can express point-wise measures neous improvements.
corresponding to all the ensemble measures de ned pre- 0.3 (2016-04-21) Added residual entropy (11) (and com-
viously, without having to create a host of new notation. bined with independent information, which is essen-
We can also write multivariate speci c informations aver- tially another name for the same thing), and uncer-
aged over only one ensemble, not both, e.g. I(A : Bx ) [26]. tainty coef cient (14). Added information diagrams.
Miscellaneous minor improvements. Fixed sign error
Speci c entropy (self-information, score, surprise, or sur- in lautum information (13) (Kudos: Ryan James)
prisal) is the negative logarithm of a probability. 0.2 (2015-06-26) Improved presentation and xed miscel-
laneous minor errors.
s(Ax ) ≜ − ln P(Ax ) (58) 0.1 (2015-01-22) Initial release: over 50 information mea-
sures described.
The expectation of the speci c entropy is the Shannon en-
tropy, the average point-wise entropy of the ensemble (1).
References
[ ] ∑
S(A) = E s(A) = P(Ax ) s(Ax ) (Recursive citations mark neologisms and other innovations [1].)
x
[0] [citation needed]. (pages 9, 9, 10, 11, 13, 14, 14, 14, 14,
and 15).
where the expectation operator E is
[ ] ∑ [1] Gavin E. Crooks. On measures of entropy and information.
E f(A) = P(Aa ) f(Aa ) Technical report, (2017). Tech. Note 009 v0.6 http://
a threeplusone.com/info. (pages 2, 9, 9, 9, 13, and 16).
[2] Raymond W. Yeung. A new outlook on Shannon’s infor-
Speci c mutual information (point-wise mutual informa- mation measures. IEEE Trans. Inf. Theory, 37(3):466–474
tion) [26, 13, 69] (1991). doi:10.1109/18.79902. (pages 2, 4, and 5).
16
O M E I, G. E. C ()
[11] William J. McGill. Multivariate information trans- [28] John Parker Burg. The relationship between maximum
mission. Psychometrika, 19(2):97–116 (1954). entropy spectra and maximum likelihood spectra. Geo-
doi:10.1007/BF02289159. (pages 3, 3, 4, 5, and 6). physics, 37:375–376 (1972). doi:10.1190/1.1440265. (page
9).
[12] Satosi Watanabe. Information theoretical analysis of multi-
variate correlation. IBM J. Res. Develop., 4(1):66–82 (1960). [29] Harold Jeffreys. Theory of probability. Clarendon Press,
doi:10.1147/rd.41.0066. (pages 3, 4, and 6). Oxford, 2nd edition (1948). (page 10).
[13] Robert Fano. Transmission of Information: A Statistical [30] Flemming Topsøe. Some inequalities for informa-
Theory of Communications. MIT Press, Cambridge, MA tion divergence and related measures of discrimina-
(1961). (pages 3, 5, 15, and 16). tion. IEEE Trans. Inf. Theory, 46(4):1602–1609 (2000).
doi:10.1109/18.850703. (pages 10, 10, 10, and 13).
[14] Anthony J. Bell. The co-information lattice. In Proc. 4th Int.
Symp. Independent Component Analysis and Blind Source [31] Jianhua Lin. Divergence measures based on the Shan-
Separation (2003). (pages 3 and 6). non entropy. IEEE Trans. Inf. Theory, 37:145–151 (1991).
doi:10.1109/18.61115. (pages 10, 10, 13, 13, 13, and 13).
[15] Yaneer Bar-Yam. Multiscale complexity / en-
tropy. Adv. in Complex Sys., 7:47–63 (2004). [32] Dominik M. Endres and Johannes E. Schindelin. A new
doi:10.1142/S0219525904000068. (page 3). metric for probability distributions. IEEE Trans. Inf. The-
ory, 49(7):1858–1860 (2003). doi:10.1109/TIT.2003.813506.
[16] Elad Schneidman, Susanne Still, Michael J. Berry, II, (pages 10 and 15).
and William Bialek. Network information and con-
nected correlations. Phys. Rev. Lett., 91:238701 (2003). [33] Don H. Johnson and Sinan Sinanović. Symmetrizing the
doi:10.1103/PhysRevLett.91.238701. (page 4). Kullback-Leibler distance. Technical report, Rice Univ.,
Houston, TX (2001). http://www.ece.rice.edu/~dhj/
[17] Te Sun Han. Nonnegative entropy measures of multivariate resistor.pdf. (pages 10 and 11).
symmetric correlations. Inf. Control, 36:133–156 (1978).
doi:10.1016/S0019-9958(78)90275-9. (page 6). [34] Alfréd Rényi. On measures of entropy and informa-
tion. In Proceedings of the fourth Berkeley symposium on
[18] Samer A. Abdallah and Mark D. Plumbley. A measure mathematics, statistics and probability, volume 1, pages
of statistical complexity based on predictive information 547–561 (1960). http://projecteuclid.org/euclid.
with application to nite spin systems. Phys. Lett. A, bsmsp/1200512181. (pages 10, 11, 14, and 16).
376(4):275–281 (2012). doi:10.1016/j.physleta.2011.10.066.
(page 6). [35] Charles H. Bennett, François Bessette, Gilles Bras-
sard, Louis Salvail, and John Smolin. Experimental
[19] Phipps Arabie and Scott A. Boorman. Multidimensional quantum cryptography. J. Cryptology, 5:3–28 (1992).
scaling of measures of distance between partitions. J. Math. doi:10.1007/BF00191318. (page 11).
Psy., 10(148-203) (1973). doi:10.1016/0022-2496(73)90012-
6. (page 6). [36] Ralph V. L. Hartley. Transmission of information. Bell
Syst. Tech. J., 7:535–563 (1928). doi:10.1002/j.1538-
[20] James P. Crutch eld. Information and its metric. In L. Lam 7305.1928.tb01236.x. (page 11).
and H. C. Morris, editors, Nonlinear Structures in Physi-
cal Systems – Pattern Formation, Chaos and Waves, pages [37] Jan Havrda and František Charvát. Quanti cation method
119–130. Springer-Verlag, New York (1990). (page 6). of classi cation processes. Concept of structural a-entropy.
Kybernetika, 3:30–35 (1967). http://dml.cz/dmlcz/
[21] Sergio Verdú and Tsachy Weissman. The information lost 125526. (page 11).
in erasures. IEEE Trans. Inf. Theory, 54:5030–5058 (2008).
doi:10.1109/TIT.2008.929968. (page 6). [38] Zoltán Daróczy. Generalized information functions.
Inf. Control, 16(1):36–51 (1970). doi:10.1016/S0019-
[22] Te Sun Han. Multiple mutual information and multiple in- 9958(70)80040-7. (page 11).
teractions in frequency data. Inf. Control, 46:26–45 (1980).
doi:10.1016/S0019-9958(80)90478-7. (page 6). [39] Constantino Tsallis. Possible generalization of Boltzmann-
Gibbs statistics. J. Stat. Phys., 52:479–487 (1988).
[23] Daniel P. Palomar and Sergio Verdú. Lautum information. doi:10.1007/BF01016429. (page 11).
IEEE Trans. Inf. Theory, 54(3):964–975 (2008). (page 6).
[40] Bhu Dev Sharma and D. P. Mittal. New non-additive mea-
[24] William H. Press, Saul A. Teukolsky, William T. Vetter- sures of entropy for discrete probability distributions. J.
ling, and Brian P. Flannery. Numerical Recipes: The Art of Math. Sci. (Soc. Math. Sci., Delhi, India), 10(28-40) (1975).
Scienti c Computing. Cambridge University Press, New (pages 11 and 15).
York, 3rd edition (2007). (page 6).
[41] Frank Nielsen and Richard Nock. A closed-form expres-
[25] Solomon Kullback and Richard A. Leibler. On informa- sion for the Sharma-Mittal entropy of exponential fami-
tion and suf ciency. Ann. Math. Statist., 22:79–86 (1951). lies. J. Phys. A, 45(3):032003 (2012). doi:10.1088/1751-
doi:10.1214/aoms/1177729694. (pages 6 and 10). 8113/45/3/032003. (pages 11 and 15).
[26] Irving J. Good. Some terminology and notation in informa- [42] S. M. Ali and Samuel D. Silvey. A general class of coef-
tion theory. Proc. Inst. Elec. Eng. C, 103(3):200–204 (1956). cients of divergence of one distribution from another. J.
doi:10.1049/pi-c.1956.0024. (pages 9, 16, 16, and 16). Roy. Statist. Soc. B, 28(1):131–142 (1966). (page 11).
[27] David F. Kerridge. Inaccuracy and inference. J. Roy. Statist. [43] Imre Csiszár. Information-type measures of difference of
Soc. Sec. B, 23:184–194 (1961). http://www.jstor.org/ probability distributions and indirect observation. Studia
stable/2983856. (page 9). Sci. Math. Hungar., 2:299–318 (1967). (page 11).
17
O M E I, G. E. C ()
[44] Tetsuzo Morimoto. Markov processes and the H- [60] Shun ichi Amari and Hiroshi Nagaoka. Methods of infor-
theorem. J. Phys. Soc. Jpn, 18:328–331 (1963). mation geometry, volume 191. American Mathematical
doi:10.1143/JPSJ.18.328. (page 11). Society (2000). (page 14).
[45] Inder J. Taneja. Re nement inequalities among symmet- [61] James A. Bucklew. Large deviation techniques in decision,
ric divergence measure. Aust. J. Math. Anal. Appl., 2(1):8 simulation, and estimation,. Wiley (1990). (page 14).
(2005). (page 11).
[62] Anand G. Dabak and Don H. Johnson. Relations be-
[46] Alexander N. Gorban, Pavel A. Gorban, and George tween Kullback-Leibler distance and Fisher information
Judge. Entropy: The Markov ordering approach. Entropy, (2001). http://www.ece.rice.edu/~dhj/distance.pdf.
12:1145–1193 (2010). doi:10.3390/e12051145. (page 11). (page 14).
[47] Mark D. Reid and Robert C. Williamson. Information, di- [63] John C. Baez. Rényi entropy and free energy.
vergence and risk for binary experiments. J. Mach. Learn. arXiv:1102.2098. (page 14).
Res., 12:731–817 (2011). http://jmlr.csail.mit.edu/
papers/v12/reid11a.html. (page 11, 11, and 11). [64] Lisa Borland, Angel R. Plastino, and Constantino Tsal-
lis. Information gain within nonextensive thermo-
[48] Anil K. Bhattacharyya. On a measure of divergence be- statistics. J. Math. Phys., 39(12):6490–6501 (1998).
tween two multinomial populations. Sankhyā, 7:401–406 doi:10.1063/1.532660. Erratum: 40 2196 (1999). (page 15).
(1946). (page 13).
[65] Mohammadali Khosravifard, Dariush Fooladivanda, and
[49] Richard Jozsa. Fidelity for mixed quantum
T. Aaron Gulliver. Con iction of the convexity
states. J. Mod. Opt., 41:2315–2323 (1994).
and metric properties in f-divergences. IEICE-Tran.
doi:10.1080/09500349414552171. (page 13).
Fund. Elec. Comm. Comp. Sci., E90-A:1848–1853 (2007).
[50] Shizuo Kakutani. On equivalence of in nite product mea- doi:10.1093/ietfec/e90-a.9.1848. (page 15).
sures. Ann. Math., 49(1):214–224 (1948). https://www.
[66] P. Kafka, Ferdinand Österreicher, and I. Vincze. On pow-
jstor.org/stable/1969123. (page 13).
ers of f-divergences de ning a distance. Studia Sci. Math.
[51] Ernst Hellinger. Neue Begründung der Theo- Hungar., 26:415–422 (1991). (page 15).
rie quadratischer Formen von unendlichvielen
Veränderlichen. J. Reine Angew. Math., 136:210–271 [67] Ferdinand Österreicher and Igor Vajda. A new class of met-
(1909). doi:10.1515/crll.1909.136.210. (page 13). ric divergences on probability spaces and and its statisti-
cal applications. Ann. Inst. Statist. Math., 55(3):639–653
[52] Karl Pearson. X. On the criterion that a given system of (2003). doi:10.1007/BF02517812. (page 15).
deviations from the probable in the case of a correlated sys-
tem of variables is such that it can be reasonably supposed [68] Anil K. Bhattacharyya. On a measure of divergence be-
to have arisen from random sampling. Philos. Mag. Ser. 5, tween two statistical populations de ned by their proba-
50(302):157–175 (1900). doi:10.1080/14786440009463897. bility distributions. Bull. Calcutta Math. Soc., 35(1):99–109
(page 13). (1943). (page 15).
[53] Jerzy Neyman. Contribution to the theory of the χ2 test. [69] Kenneth W. Church and Patrick Hanks. Word association
In Proceedings of the [ rst] Berkeley symposium on math- norms, mutual information and lexicography. Comp. Ling.,
ematics, statistics and probability, pages 239–273. Univ. of 16(1):22–29 (1990). (page 16).
Calif. Press (1949). http://projecteuclid.org/euclid.
bsmsp/1166219208. (page 13). [70] Sung-Hyuk Cha. Comprehensive survey on dis-
tance/similarity measures between probability density
[54] Noel Cressie and Timothy R. C. Read. Multinomial good- functions. Int. J. Math. Mod. Meth. Appl. Sci., 1(4):300–307
ness of t. J. Roy. Statist. Soc. B, 46:440–464 (1984). http: (2007). http://www.naun.org/main/NAUN/ijmmas/
//www.jstor.org/stable/2345686. (pages 13 and 15). mmmas-49.pdf. (page 16).
18