On Measures of Entropy and Information: Gavin E. Crooks

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

On Measures of Entropy and Information

Tech. Note 009 v0.6


http://threeplusone.com/info

Gavin E. Crooks

2017-05-07

Contents 5 Csiszár f-divergences 11


Csiszár f-divergence . . . . . . . . . . . . . . . . 11
0 Notes on notation and nomenclature 2 Dual f-divergence . . . . . . . . . . . . . . . . . 11
Symmetric f-divergences . . . . . . . . . . . . . 11
1 Entropy 3
Entropy . . . . . . . . . . . . . . . . . . . . . . . 3 K-divergence . . . . . . . . . . . . . . . . . . . . 13
Joint entropy . . . . . . . . . . . . . . . . . . . . 3 Fidelity . . . . . . . . . . . . . . . . . . . . . . . 13
Marginal entropy . . . . . . . . . . . . . . . . . . 3 Hellinger discrimination . . . . . . . . . . . . . 13
Conditional entropy . . . . . . . . . . . . . . . . 3 Pearson divergence . . . . . . . . . . . . . . . . . 13
Perplexity . . . . . . . . . . . . . . . . . . . . . . 3 Neyman divergence . . . . . . . . . . . . . . . . 13
LeCam discrimination . . . . . . . . . . . . . . . 13
2 Mutual information 3 Skewed K-divergence . . . . . . . . . . . . . . . 13
Mutual information . . . . . . . . . . . . . . . . 3 Alpha-Jensen-Shannon-entropy . . . . . . . . . . 13
Multivariate mutual information . . . . . . . . . 3
Interaction information . . . . . . . . . . . . . . 5 6 Chernoff divergence 14
Conditional mutual information . . . . . . . . . 5 Chernoff divergence . . . . . . . . . . . . . . . . 14
Binding information . . . . . . . . . . . . . . . . 6
Chernoff coef cient . . . . . . . . . . . . . . . . 14
Residual entropy . . . . . . . . . . . . . . . . . . 6
Rényi divergence . . . . . . . . . . . . . . . . . . 14
Total correlation . . . . . . . . . . . . . . . . . . 6
Alpha-divergence . . . . . . . . . . . . . . . . . . 14
Lautum information . . . . . . . . . . . . . . . . 6
Uncertainty coef cient . . . . . . . . . . . . . . 6 Cressie-Read divergence . . . . . . . . . . . . . . 14
Tsallis divergence . . . . . . . . . . . . . . . . . 15
3 Relative entropy 6 Sharma-Mittal divergence . . . . . . . . . . . . . 15
Relative entropy . . . . . . . . . . . . . . . . . . 6
Cross entropy . . . . . . . . . . . . . . . . . . . . 9 7 Distances 15
Burg entropy . . . . . . . . . . . . . . . . . . . . 9 Variational distance . . . . . . . . . . . . . . . . 15
Relative joint entropy . . . . . . . . . . . . . . . 9 Total variational distance . . . . . . . . . . . . . 15
Relative conditional entropy . . . . . . . . . . . 9 Euclidian distance . . . . . . . . . . . . . . . . . 15
Relative mutual information . . . . . . . . . . . 9 Minkowski distance . . . . . . . . . . . . . . . . 15
Relative conditional mutual information . . . . 9 Chebyshev distance . . . . . . . . . . . . . . . . 15
Relative relative entropy . . . . . . . . . . . . . 9 LeCam distance . . . . . . . . . . . . . . . . . . 15
Jeffreys entropy . . . . . . . . . . . . . . . . . . . 10
Hellinger distance . . . . . . . . . . . . . . . . . 15
Jensen-Shannon entropy . . . . . . . . . . . . . . 10
Jensen-Shannon distance . . . . . . . . . . . . . 15
Jensen-Shannon divergence . . . . . . . . . . . . 10
Bhattacharyya distance . . . . . . . . . . . . . . 15
Jensen-Shannon information . . . . . . . . . . . 10
Resistor-average entropy . . . . . . . . . . . . . 10
8 Speci c information 15
4 Rényi information 10 Speci c entropy . . . . . . . . . . . . . . . . . . 16
Rényi information . . . . . . . . . . . . . . . . . 10 Speci c mutual information . . . . . . . . . . . 16
Collision entropy . . . . . . . . . . . . . . . . . . 11
Min-entropy . . . . . . . . . . . . . . . . . . . . 11 Back Matter 16
Hartley entropy . . . . . . . . . . . . . . . . . . . 11 Acknowledgments . . . . . . . . . . . . . . . . . 16
Tsallis information . . . . . . . . . . . . . . . . . 11 Version history . . . . . . . . . . . . . . . . . . . 16
Sharma-Mittal information . . . . . . . . . . . . 11 Bibliography . . . . . . . . . . . . . . . . . . . . 16
O M  E  I, G. E. C ()

0 Notes on notation and nomenclature Table 1: Information measure symbology


Information measures An information measure is,
broadly speaking, any function of one or more probability symbol usage commutative precedence
distributions. We’ll speci cally reserve the term entropy
for measures that have units of entropy — negative log- , conjugation yes high
arithms of probabilities, or a linear combination of the : mutual yes :
same. The units of entropy depend on the base of the | conditional no :
logarithm: for instance bits (”binary digits”), nats (”nat- || relative entropy no :
ural units”), or bans, for bases 2, e, or 10 respectively. ; divergence no low
Herein we’ll use natural logarithms. Note that, by con-
vention, 0 ln 0 = 0. In many cases, tradition doth dictate
that an entropic measure be referred to as an information,
e.g. “mutual information” rather than “mutual entropy”. precedence (high to low) ‘,’, ‘:’, ‘|’, ‘∥’, ‘;’ to obviate exces-
But we’ll eschew the not uncommon practice of using the sive bracketing [1]. Samples spaces are the same on ei-
term “entropy” to refer to generic (non-entropic) informa- ther side of double bars ‘∥’ or semicolons ‘;’, but different
tion measures. across bars ‘|’ and colons ‘:’. Measures are symmetric to
interchange of the ensembles across commas and colons,
i.e. S(A, B, C) is the same as S(C, B, A) and I(A : B : C) is
Ensembles We’ll use upper case letters for ensembles,
the same as I(B : C : A).
A, B, C, etc. An ensemble (or probability space) consists
of a complete and mutually exclusive set of propositions,
{Aa }, indexed by the sample space a ∈ ΩA . The proba- Information diagrams Information diagrams (See Figs.
bility of a proposition is written as P(Aa ), and the condi- 1, 2, 3 and 4) are a graphical display of multivariant Shan-
tional probability as P(Aa | Bb ). For instance, Aa could non information measures [2, 3].
represent the propositions {A = a}, where A is a random B C
variable. Any sum is implicitly over the entire relevant A D
sample space. When different ensembles have different
samples spaces, we’ll use corresponding lower case letters
.
for samples, a for A, b for B, and so on. But when ensem-
bles share the same sample space, we’ll use some other
I(A, B : C | D)
lower case letter, typically x, y, z. For instance, compare
the de nitions of mutual information (5) (different sample
These are not Venn-diagrams per se, since the individual
spaces) with relative entropy (15) (same sample spaces).
regions can have positive or negative weight. The regions
We will not employ the common shorthand of labeling
of an information diagram corresponding to a particular
distributions by the sample alone, e.g. P(x) for P({A = x}),
information measure can be deduced by mapping joint
precisely because we need to deal with multiple distribu-
distributions ‘A, B’ to the union of sets A ∪ B, mutual
tions with the same sample space in the same expression.
measures ‘A : B’, to the intersection of sets A ∩ B, and
conditional ‘A|B’ to set complement A \ B. For instance,
Collections of ensembles Since we have used subscripts the conditional mutual information I(A, B : C | D) corre-
to index propositions, we’ll use superscripts to index col- sponds to the region ((A ∪ B) ∩ C) \ D.
lections of ensembles, A = {A1 , A2 , . . . , A|A| }. We never
need to take powers of ensembles or propositions, so there
Dissimilarity An information-theoretic divergence is a
is little risk of ambiguity. Other useful notation includes
measure of dissimilarity between a pair of ensembles that
P(A), the power set of A (the set of all subsets); |A|, the
is non-negative and zero if (and only if) the distributions
set cardinality; ∅ for the empty set; and A \ A, the set
are identical. Since divergences are not symmetric to
complement (difference) of A.
their arguments in general, we can also de ne the dual
divergence d∗ (A; B) = d(B; A). We’ll refer to a symmet-
Naming and notating Information measures are given ric divergence as a discrimination.1 By distance we mean
CamelCased function names, unless an unambiguous a metric distance: a measure that is non-negative; sym-
short function name is in common usage. Commas metric; zero if (and only if) the distributions are identical
between ensembles (or propositions) denote conjugation (re ective); and obeys the triangle inequality, d(A; B) +
(logical and); a colon ‘:’ the mutual information (5) “be- d(B; C) ⩾ d(A; C).
tween” ensembles; a double bar ‘∥’ the relative entropy 1
The terms disparity, discrimination, and divergence are used essen-
(15) of one ensemble “relative to” another; a semicolon
tially interchangeably as synonymies of dissimilarity. The useful dis-
‘;’ for any other comparison of ensembles; and a bar ‘|’ tinction that a discrimination is a symmetric divergence appears to be
denotes conditioning (“given”). We’ll use the operator novel, but consistent with practical usage in much of the literature.

2
O M  E  I, G. E. C ()

1 Entropy Bayes’ rule for probabilities is the relation P(Aa | Bb ) =


P(Bb | Aa )P(Aa )/P(Bb ). In entropic terms the equivalent
Entropy (Shannon entropy, Gibbs entropy) A measure of statement is
the inherent uncertainty or randomness of a single ran-
dom variable. S(A | B) = S(B | A) + S(A) − S(B)

S(A) ≜ − P(Aa ) ln P(Aa ) (1)
a
Perplexity [9]

In information theory the entropy is typically denoted by Perplexity(A) ≜ eS(A) (4)


the symbol H, a notation that dates back to Boltzmann
and his H-theorem [4], and adopted by Shannon [5]. The The exponentiation of the entropy (with the same base).
notation S is due to Clausius and the original discovery of
entropy in thermodynamics [6], and adopted by Gibbs [7]
for use in statistical mechanics. I tend to use S since I care 2 Mutual information
about the physics of information, and the symbol H is oft
2
needed to denote the the Hamiltonian. Mutual information (mutual entropy, transinforma-
Entropy is occasionally referred to as the self-entropy, tion) [5, 8]
since entropy is equal to the mutual information between ∑ P(Aa , Bb )
a distribution and itself, S(A) = I(A : A). This is distinct I(A : B) ≜ P(Aa , Bb ) ln (5)
P(Aa )P(Bb )
from the speci c entropy (58) which is also sometimes re- a,b
ferred to as the self-information.
Entropies are non-negative and bounded.
.
0 ⩽ S(A) ⩽ ln |ΩA |
A B

Joint entropy Given a joint probability distribution Mutual information is oft notated with a semicolon,
P(A, B) then the joint entropy is rather than a colon [5, 11]. We’ll reserve the semicolon
∑ for divergences.
S(A, B) ≜ − P(Aa , Bb ) ln P(Aa , Bb ) (2) Mutual information is the reduction in uncertainty of
a,b
A due to the knowledge of B, or vice versa.
This joint entropy can be readily generalized to any num-
I(A : B) = S(A) − S(A | B)
ber of variables.
= S(B) − S(B | A)
S(A1 , A2 , . . . , An ) = S(A) + S(B) − S(A, B)

=− P(A1a1 , A2a2 , . . . , An 1 2 n
an ) ln P(Aa1 , Aa2 , . . . , Aan )
= S(A, B) − S(A | B) − S(B | A)
a1 ,a2 ,...,an
The mutual information is zero if (and only if) A and B
Marginal entropy The entropy of a marginal distribution. are independent (written A ⊥ ⊥ B), such that P(Aa , Bb ) =
Thus S(A), S(B), S(C), S(A, B), S(B, C) and S(A, C) are all P(Aa )P(Ba ).
marginal entropies of the joint entropy S(A, B, C). The mutual information of an ensemble with itself
(sometimes called the self-information) is the entropy.
Conditional entropy (or equivocation) [5, 8] Measures how I(A : A) = S(A)
uncertain we are of A on the average when we know B.
∑ ∑
S(A | B) ≜ − P(Bb ) P(Aa | Bb ) ln P(Aa | Bb ) (3) Multivariate mutual information (co-information) [11, 12,
b a 13, 14, 15, 3]: A multivariate generalization of the mutual
information. Given a collection of probability ensembles,
The chain rule for entropies [5, 8] expands conditional A = {A1 , A2 , · · · , A|A| }, the multivariate mutual infor-
entropy as a Shannon information measure. mation is equal to an alternating signed sum of all the

S(A, B) = S(A | B) + S(B) 2


“I didn’t like the term Information Theory. Claude didn’t like it
either. You see, the term ‘information theory’ suggests that it is a theory
This follows from the probability chain rule (Aa , Bb ) = about information – but it’s not. It’s the transmission of information, not
information. Lots of people just didn’t understand this…I coined the
P(Bb | Aa )P(Aa ). Since entropies are always non- term ‘mutual information’ to avoid such nonsense: making the point
negative, one immediate consequence is that condition- that information is always about something. It is information provided
ing always reduces entropy, S(A | B) ⩽ S(A). by something, about something.” – Robert Fano [10]

3
O M  E  I, G. E. C ()

marginal entropies.

I(A1 : A2 : · · · : A|A| ) (6)



|B| |B|
S(∅) . ≜− 1 2
(−1) S(B , B , . . . , B )
B∈P(A)

A B Here P(A) is the power set of A (the set of all subsets), and
|A| is the set cardinality. Note that there are con icting
sign conventions in the literature: the multivariate mu-
tual information is sometimes de ned with opposite sign
S(A) . for odd cardinalities (See interaction information (8)).
The single variable case is equal to the entropy, I(A) =
A B S(A), the binary case is equal to the standard mutual in-
formation (6), and the ternary case is

I(A : B : C) (7)
. ∑ P(Aa , Bb )P(Aa , Cc )P(Bb , Cc )
S(B) ≜ P(Aa , Bb , Cc ) ln
P(Aa , Bb , Cc )P(Aa )P(Bb )P(Cc )
a,b,c
A B
For three or more variables the mutual information can
be positive, negative, or zero, whereas for one or two vari-
ables the mutual information is non-negative. For zero
S(A, B) . variables the mutual information is zero, I(∅) = 0.

A B
.
.
.

I(A : B) .

The mutual information de nes a partitioning of the to-


A B tal, multivariate joint entropy into single variable, binary,
ternary, and higher order shared entropies.

S(A) = I(A)
S(A | B) .
S(A, B) = I(A) + I(B) − I(A : B)
S(A, B, C) = I(A) + I(B) + I(A)
A B
− I(A : B) − I(B : C) − I(A : C)
+ I(A : B : C)

. Or generally,
S(B | A)

S(A1 , A2 , . . . , A|A| ) = − (−1)|B| I(B1 : · · · : B|B| )
A B B∈P(A)

The triplet interaction information is the information


that a pair of variable provides about the third, compared
residual . to the information that each provides separately [11, 12,
16].
A B I(A : B : C) = I(A : B) + I(A : C) − I(A : B, C)

Figure 1: Two-variable Information diagrams [2, 3]. The multivariate self-information is equal the entropy
for any cardinality.

I(A : A : A : · · · : A) = S(A)

4
O M  E  I, G. E. C ()

Interaction information (synergy, mutual informa-


tion) [11]: An alternative sign convention for multivari-
ate mutual information. The interaction information is
equal in magnitude to the multivariate information, but
has the opposite sign for odd number of ensembles. C

Int(A1 : A2 : · · · : An ) ≜ (−1)n I(A1 : A2 : · · · : An ) (8)


I(A : B | C)
The sign connection used above for multivariate informa-
tion generally makes more sense. .

Conditional mutual information [13] The average mutual A B


information between A and B given C. C
I(A : B | C) (9)
∑ ∑ P(Aa , Bb | Cc ) I(A : B : C)
≜ P(Cc ) P(Aa , Bb | Cc ) ln
c
P(Aa | Cc )P(Bb | Cc )
a,b
.
C C C

A B
C
. . .

I(A : B, C)
A B A B A B
I(A : B | C) I(A : C | B) I(B : C | A)
.

A B
I(A : B | C) = S(A | C) − S(A | B, C)
C
= S(B | C) − S(B | A, C)

The conditional mutual information is non-negative. If binding


the conditional mutual information I(A : B | C) is zero,
then A and B are conditionally independent given C (writ-
.
ten A ⊥⊥ B | C). Conversely conditional independence
implies that the conditional mutual information is zero.
A B
A ⊥⊥ B | C ⇐⇒ I(A : B | C) = 0
C
Conditional independence implies that P(Aa , Bb | Cc ) =
P(Aa | Cc ) P(Ba | Cc )for all a, b, c. residual
The chain rule for mutual information is
.
I(A : B, C) = I(A : C) + I(A : B | C)

C C C
A B

Figure 2: Three-variable information diagrams [2, 3]. Note


. = . + .
that the central region I(A : B : C) can be negative (these
A B A B A B are not Venn diagrams) and that we shade regions in pro-
portion to their multiplicity, e.g. total correlation.
The data processing inequality states that if A and C are
conditionally independent, given B (as happens when you
have a Markov chain A → B → C) then

I(A : B) ⩾ I(A : C) given I(A : C | B) = 0

5
O M  E  I, G. E. C ()

Proof: I(A : B) − I(A : C) = I(A : B | C) − I(A : C | B), but


I(A : C | B) is zero, and I(A : B | C) is positive.
.

C C C C
.
.

. − . = . − .

A B A B A B A B

Lautum information [23] :


Binding information (Dual total correlation) [17, 3, 18] ∑ P(Aa )P(Bb )
∑ Lautum(A; B) ≜ P(Aa )P(Bb ) ln (13)
P(Aa , Bb )
Binding(A) ≜ S(A) − S(A | A \ A) (10) a,b
A∈A
Much like the mutual information, but with the roles of
Here A \ A is the set compliment of A. joint and marginal product distributions swapped. (Lau-
tum is mutual spelled backwards).

. Uncertainty coef cient (relative mutual information) [24]


. 3
.
I(A : B) S(A | B)
UncertaintyCoeff(A; B) ≜ =1− (14)
S(A) S(A)

Binding(A : B) = I(A : B) Given B, the fraction of the information we can predict


about A.
Binding(A : B : C) = I(A, B : A, C : B, C)
Binding(A : B : C : D) = I(A, B, C : A, B, D : A, C, D : B, C, D)
3 Relative entropy
Residual entropy (erasure entropy, independent infor-
mation,variation of information, shared information dis- Relative entropy (Kullback-Leibler divergence, KL-
tance) [19, 20, 21, 3] divergence, KL-distance, “dee-kay-ell”, information gain,
∑ logarithmic divergence, information divergence) [25, 8]4
Residual(A) ≜ S(A | A \ A) (11)
∑ P(Ax )
A∈A
D(A ∥ B) ≜ P(Ax ) ln (15)
= S(A) − Binding(A) x
P(Bx )

Measures the total amount of randomness localized to in- Roughly speaking, the relative entropy measures the dif-
dividual variables. ference between two distributions, although it is not a
metric since it is not symmetric [D(A ∥ B) ≠ D(B ∥ A) in
general], nor does it obey the triangle inequality. Note
that the two distributions must have the same sample
.
space, and that we take as convention that 0 ln 0 = 0.
.
. One interpretation of relative entropy is that it repre-
sents an encoding cost [8]: if we encode messages using an
optimal code for a probability distribution P(Bx ) of mes-
sages x, but the messages actually arrive with probabili-
Total correlation (Multi-information, multivariate con- ties P(Ax ), then each message requires, on average, an ad-
straint, redundancy) [11, 12, 22, 14] ditional D(A ∥ B) nats to encode compared to the optimal
encoding.
TotalCorr(A1 , A2 , . . . , An ) ≜ (12) The mutual information (5) is the relative entropy be-
S(A ) + S(A ) + · · · + S(A ) − S(A , A , . . . , An )
1 2 n 1 2 tween the joint and marginal product distributions. Let
the random variables (Â, B̂) be independent, but with the
The total amount of information carried by correlations
3
between the variables. Quanti es the total correlation or Despite misinformation to the contrary, this uncertainty coef cient
is not related to Theil’s U-statistic
redundancy. Equal to the mutual information when n = 4
Note that our notation for relative entropy is uncommon. Follow-
2. The independence bound on entropy states that the ing [8], many authors instead directly supply the distributions as argu-
total correlation is non-negative. ments, e.g. D(p(x)∥q(x)).

6
O M  E  I, G. E. C ()

B C
A D

S(∅)

. . . .

S(A | B, C, D) S(B | A, C, D) S(C | A, B, D) S(D | A, B, C)

. . . . . .

I(A : B | C, D) I(A : C | B, D) I(A : D | B, C) I(B : C | A, D) I(B : D | A, C) I(C : D | A, B)

. . . .

I(A : B : C | D) I(A : B : D | C) I(B : C : D | A) I(A : C : D | B)

I(A : B : C : D)

Figure 3: Components of four-variable information diagrams.

7
O M  E  I, G. E. C ()

B C
A D

S(∅)

. . . .

S(A) S(B) S(C) S(D)

. . . . . .

I(A : B) I(A : C) I(A : D) I(B : C) I(B : D) I(C : D)

. . . .

I(A : B : C) I(A : B : D) I(A : C : D) I(B : C : D)

I(A : B : C : D)

Figure 4: Information diagrams for four-variable mutual-information.

8
O M  E  I, G. E. C ()

same marginals as (A, B), i.e. P(Â, B̂) = P(A)P(B). Then Since the relative conditional entropy is non-negative, it
follows that
I(A : B) = D(A, B ∥ Â, B̂) ( ) ( )
D A, B ∥ A ′ , B ′ ⩾ D B ∥ B ′
Similarly, for three or more variables, the relative entropy
between the joint and marginal product distributions is
the total correlation (12). Relative mutual information [1]

TotalCorr(A1 , A2 , . . . , An ) D(A : B ∥ A ′ : B ′ ) (20)


= D(A1 , A2 , . . . , An ∥ Â1 , Â2 , . . . , Ân ) ∑ P(Ax , By ) P(A ′
x )P(By

)
≜ P(Ax , By ) ln ′ , B ′ ) P(A )P(B )
x,y
P(A x y x y
The Lautum information (13) is
The relative joint entropy can be split up into various
Lautum(A : B) = D(Â, B̂ ∥ A, B)
components analogously to the joint entropy.

Cross entropy (inaccuracy) [26, 27]: D(A, B ∥ A ′ , B ′ )


∑ ≜ D(A ∥ A ′ ) + D(B ∥ B ′ ) + D(A : B ∥ A ′ : B ′ )
CrossEnt(A; B) ≜ − P(Ax ) ln P(Bx ) (16) ( )
= D A | B A ′ | B ′ + D(B ∥ B ′ )
x ( ) ( )
= S(A) + D(A ∥ B) = D A | B A′ | B′ + D B | A B′ | A′
− D(A : B ∥ A ′ : B ′ )
The cross entropy measures the average number of bits
needed to identify events that occur with probability If the reference distributions are independent, the rela-
P(Ax ), if a coding scheme is used that is optimal for the tive mutual information is equal to the mutual informa-
probability distribution P(Bx ). tion between the principle distributions.

D(A : B ∥ A ′ : B ′ ) = I(A : B) if A ⊥⊥ B
Burg entropy [28]:
∑ Note that the uncertainly coef cient (14) is also some-
Burg(B) ≜ ln P(Bx ) (17)
times called the relative mutual information.
x

Proportional to the cross entropy with a uniform source


distribution. Relative conditional mutual information [1]

D(A : B | C ∥ A ′ : B ′ | C ′ ) (21)
Relative joint entropy ∑ ∑
≜ P(Cz ) P(Ax , By | Cz )
( ) ∑ P(Ax , By ) z x,y
D A, B ∥ A ′ , B ′ ≜ P(Ax , By ) ln (18)
P(Ax′ , By′ ) P(Ax , By | Cz ) P(Ax′ | Cz′ )P(By′ | Cz′ )
x,y × ln
P(Ax′ , By′ | Cz′ ) P(Ax | Cz )P(By | Cz )
We can generalize any Shannon information measure to a
relative Shannon information measures by combining ap- We could continue this insanity by generalizing the con-
propriate linear combinations of relative joint entropies5 . ditional mutual relative entropy to many variables.

Relative conditional entropy (conditional relative en- Relative relative entropy [1]
tropy, conditional Kullback-Leibler divergence) [8]
( ) ∑ P(Ax ) P(Dx )
( ) ∑ P(Ax | By ) D (A ∥ B) (C ∥ D) ≜ P(Ax ) ln (22)
D A | B ∥ A′ | B′ ≜ P(Ax , By ) ln (19) P(Bx ) P(Cx )
x,y
P(Ax′ | By′ ) x

The relative operation can be applied to relative entropy


The chain rule for relative entropy is itself, leading to the recursively de ned relative relative
( ) entropy. As an example, the relative mutual information
D(A, B ∥ A ′ , B ′ ) = D A | B ∥ A ′ | B ′ + D(B ∥ B ′ ) can also be expressed as a relative relative entropy (Just
5 as the mutual information can be expressed as a relative
The reason for extending the standard notations to this larger class
of relative Shannon information measures is that many such measures entropy).
turn up naturally, for instance when consider the non-equilibrium ther- ( )
modynamics of strongly coupled systems [0, 0]. D(A : B ∥ A ′ : B ′ ) = D (A, B ∥ Â, B̂) (A ′ , B ′ ∥ Â ′ , B̂ ′ )

9
O M  E  I, G. E. C ()

Jeffreys entropy The Jeffreys entropy (Jeffreys diver- Jensen-Shannon divergence (skewed Jensen-Shannon di-
gence, J-divergence or symmetrized Kullback-Leibler di- vergence)
vergence) [29, 25] is a symmetrized relative entropy (15). ( ) ( )
JSα (A; B) ≜(1 − α)D A ∥ M + αD B ∥ M , (25)
Jeffreys(A; B) ≜ 12 D(A ∥ B) + 12 D(A ∥ B) (23) P(M) = (1 − α)P(A) + αP(B) .
∑ P(A x ) ∑ P(Bx)
= 21 P(Ax ) ln + 12 P(Bx ) ln
P(Bx ) P(Ax )
x x Jensen-Shannon information (generalized Jensen-
∑( ) P(Ax ) Shannon divergence) [0]
= 12 P(Ax ) − P(Bx ) ln
x
P(Bx ) ∑ ( )
JSΘ (A1 ; A2 ; · · · ; An ) ≜ P(Θα ) D Aα ∥ M , (26)
This measure is symmetric and non-negative, but not a α=1,n

metric since it does not obey the triangle inequality. The P(Θα ) P(Aα
P(Mx ) = x) .
Jeffreys entropy is a symmetric f-divergence (34), Cf (A; B) α
with f(t) = 21 (t − 1) ln t.
Note that the Jeffreys divergence is sometimes de ned The entropy of a mixed distribution is the average en-
as D(A ∥ B) + D(B ∥ A), which is twice the value of the tropy of the components plus the Jensen-Shannon en-
de nition used here. tropy [30] :

S(M) = JSΘ (A1 ; A2 ; · · · ; An ) + P(Θα ) S(Aα )
α
Jensen-Shannon entropy (Jensen difference, information
radius, capacitory discrimination6 ) is the mean relative The multivariate Jensen-Shannon entropy is the mu-
entropy between two distributions and the distribution tual information between the mixing ensemble Θ, and the
mean [31, 30] mixed ensemble M.
∑ P(Ax ) ∑ P(Θα , Mx )
JS(A; B) ≜ 12 P(Ax ) ln ( ) (24) I(Θ : M) ≜ P(Θα , Mx ) ln
x
1
P(Ax ) + P(Bx ) α,x
P(Θα )P(Mx )
2

∑ P(Bx ) =− P(Mx ) ln P(Mx )
+ 1
2 P(Bx ) ln ( )
x
x
1
P(Ax ) + P(Bx )
2 ∑
( ) ( ) + P(Θα )P(Mx | Θα ) ln P(MX | Θα )
= 12 D A ∥ M + 12 D B ∥ M , α,x

=S(M) − 21 S(A) − 12 S(B) . = S(M) − P(Θα ) S(Aα )
α
where
= JSΘ (A1 ; A2 ; · · · ; An )
P(Mx ) = 21 P(Ax ) + 12 P(Bx )
Resistor-average entropy [33]:
One interpretation of the Jensen-Shannon entropy is in
terms of a Bayesian inference problem [32]: given a sam- 1
ResAvg(A; B) ≜ 1 1 (27)
ple taken from one of two probability distributions, the +
D(A ∥ B) D(B ∥ A)
Jensen-Shannon entropy is the average information the
sample provides about the identity of the distribution. The harmonic mean of forward and reversed relative en-
The divergence is equal to zero only if the two distribu- tropies.
tions are identical, and therefore indistinguishable, and
reaches its maximum value of ln 2 nats (i.e. 1 bit) if the
two distributions do not overlap and therefore are per- 4 Rényi information
fectly distinguishable from a single sample.
The Jeffreys and Jensen-Shannon entropies are related Rényi information (Rényi entropy, alpha-order en-
by the inequalities [31] tropy) [34]: A one parameter generalization of the Shan-
non entropy.
0 ⩽ JS(A; B) ⩽ 1
2 Jeffreys(A; B) .
1 ∑
Rényiα (A) ≜ ln P(Ax )α (28)
1−α x

6
Capacitory discrimination is de ned as twice the Jensen-Shannon Interesting special cases of the Rényi information in-
divergence by Topsøe [30] clude the Hartley entropy (α = 0), collision entropy (α =

10
O M  E  I, G. E. C ()

2), Shannon entropy (α = 1), and min entropy (α = ∞). Csiszár-Morimoto, Ali-Silvey, or ϕ-divergence) [42, 43,
See also: Rényi divergence (46). 44, 45, 46, 33].
∑ ( P(B ) )
x
Collision entropy (Rényi information of order 2, second Cf (A; B) = P(Ax ) f , (34)
x
P(Ax )
order entropy) [35]
∑ where the function f is convex ⌣ and f(1) = 0. This im-
CollisionEntropy(A) ≜ − ln P(Ax )2 (29)
plies Cf (A; B) ⩾ 0 from an application of Jensen’s inequal-
x
ity. Examples already encountered include the relative,
= Rényi2 (A) Jeffreys, and Jensen-Shannon entropies (see table 2). Note
that the rst argument to the f-divergence appears in the
A special case of the Rényi information. The negative log
numerator of the ratio and is the distribution to be aver-
(58) probability that two independent samples from the
aged over. The opposite convention also occurs8 .
distribution are the same.
Convex functions are closed under conical combina-
tions: the function f(x) = c1 f1 (x)+c2 f2 (x)+. . .+ cN fN (x)
Min-entropy [34] is convex if each function of the mixture fn is convex
and each constant cn ⩾ 0 is non-negative. It follows
MinEntropy(A) ≜ − ln max P(Ax ) (30) that a positive linear sum of an f-divergences is also an
x
f-divergence [47].
Hartley entropy (Hartley function, max-entropy, Boltz-
mann entropy) [36]: The logarithm of the number distinct Dual f-divergence [47] The dual of an f divergence is de-
possibilities. ned by swapping the arguments.

Hartley(A) ≜ ln |ΩA | (31) Cf (A; B) = C∗f (B; A) = Cf∗ (B; A) (35)

Here f∗ is the Csiszár dual [47] of a function f∗ (x) =


Tsallis information [37, 38, 39] (Havrda-Charvát informa-
x f(1/x). For instance, if f(x) = − ln(x) then the f diver-
tion, α order information)
gence is the relative entropy Cf (A; B) = D(A ∥ B), with
1 ( ∑ ) dual function f∗ (x) = x ln(x), and dual divergence the dual
Tsallisα (A) ≜ 1− P(Ax )α (32) (or reverse) relative entropy, Cf∗ (A; B) = D(B ∥ A).
α−1 x
1 [ (α−1) Rényiα (A) ]
= e −1
α−1 Symmetric f-divergences [0] Many instances of the f-
divergence are symmetric under interchange of the two
Sharma-Mittal information [40, 41] distributions,
 ( ) 1−β  Cf (A; B) = Cf (B; A) .
1  ∑ 1−α

SharmaMittalα,β (A) ≜ 1− P(Ax )α 


β−1 This implies that f(x) = x f( x1 ).
x
There are two common methods for symmetrizing an
(33)
asymmetric f-divergence: the Jeffreys symmetrization
Assuming suitable limits are taken, the Sharma-Mittal where we average over interchanged distributions
information contains Shannon, Rényi and Tsallis infor-
Cf (A; B) = 21 Cg (A; B) + 21 Cg (B; A)
mations as special cases.
f(t) = 21 g(t) + 21 tg(t−1 )
SharmaMittal1,1 (A) = S(A)
SharmaMittalα,1 (A) = Rényiα (A) and the Jensen symmetrization where we take the aver-
age divergence of both distributions to the average distri-
SharmaMittalα,α (A) = Tsallisα (A) bution.
( ) ( )
Cf (A; B) = 21 Ch A; M + 12 Ch B; M
5 Csiszár f-divergences P(Mx ) = 12 P(Ax ) + 12 P(Bx )
( ) ( )
Csiszár f-divergence Many interesting divergence mea- 1 1 1 1
f(t) = 2 h + 2h
sures between probability distributions can be written 1 + 12 t 1 + 12 t−1
as (or related to) an f-divergence (also know as Csiszár7 ,
8
I, myself, have used the opposite convention on other occasions, but,
7
Pronounced che-sar. on re ection, this way around makes more sense.

11
O M  E  I, G. E. C ()

Table 2: Csiszár f-divergences (§5)

Asymmetric f-divergences f(t) f∗ (t)

(15) relative entropy − ln t t ln t

2 2t
(36) K-divergence ln t ln
(1 + t) (1 + t)
1

(39) Pearson divergence (t − 1)2 ( t − t)2

t−α − 1
(48) Cressie-Read divergence
α(α + 1)
t1−α − 1
(49) Tsallis divergence
α−1

Jeffreys Jensen

Symmetric f-divergences f(t) h(t) g(t)

(t − 1)2
(41) LeCam discrimination (t − 1)2
(t + 1)
1
(23) Jeffreys entropy 2 (t − 1) ln t − ln t

1 2 2t
(24) Jensen-Shannon entropy 2 ln + 12 t ln 2
ln 1+t − ln t
1+t 1+t
2 |t ···
1
(51) variational distance − 1|

(38) Hellinger discrimination 1− t ···

12
O M  E  I, G. E. C ()

We include the half’s in these de nitions so that sym- Symmetric, with range [0, 1]. A common alternative nor-
metrized f-divergences are invariant under symmetriza- malization omits the one-half prefactor. The name origi-
tion. We call these the Jeffreys and Jensen symmetriza- nates form that of the corresponding integral in the con-
tions respectively [1], becuase a Jeffreys symmetriza- tinuous case [51]
tion of the relative entropy (15) leads to the Jeffreys en-
tropy (23), and a Jensen symmetrization gives the Jensen-
Pearson divergence (χ2 -divergence, chi square diver-
Shannon entropy (24).
gence, Pearson chi square divergence, Kagen divergence,
quadratic divergence, least squares) [52]
K-divergence [31] ( )
∑ P(Bx ) − P(Ax ) 2
∑ Pearson(A; B) ≜ 1
2 (39)
P(Ax ) P(Ax )
KayDiv(A; B) ≜ P(Ax ) ln ( ) (36) x
x
1
P(Ax ) + P(Bx ) ∑ ( P(Ax ) )2
( )
2
= 21 P(Ax ) −1
=D A∥M , P(Mx ) = 12 P(Ax ) + 12 P(Bx ) x
P(Bx )
2 = Cf (A; B) , f(t) = (t − 1)2
= Cf (A; B), f(t) = ln .
(1 + t)

Of interest since the K-divergence is a lower bound to the Neyman divergence (inverse Pearson chi square diver-
relative entropy [31], gence) [53, 54] The dual of the Pearson divergence (argu-
ments switched),
KayDiv(A; B) ⩽ 12 D(A ∥ B)
Neyman(A; B) ≜ Pearson(B; A) (40)
and the Jeffreys symmetrization of the K-divergence is the
Jensen-Shannon entropy. LeCam discrimination (Vincze-LeCam divergence, trian-
1 1 gular discrimination) [55, 56, 30]
JS(A; B) = 2 KayDiv(A; B) + 2 KayDiv(B; A)
( ) ( ) ( )
= 2D A ∥ M + 2D B ∥ M
1 1
∑ P(Ax ) − P(Bx ) 2
LeCam(A; B) ≜ 1
2 (41)
x
P(Ax ) + P(Bx )
Fidelity (Bhattacharyya coef cient, Hellinger af nity) [48] (t − 1)2
= Cf (A; B), f(t) = .
The Bhattacharyya distance (57) and the Hellinger diver- 2(t + 1)
gence and distance (38) are functions of delity. The name
derives from usage in quantum information theory [49]. The LeCam divergence is a Jensen symeterized Pearson
divergence.
∑√
Fidelity(A; B) ≜ P(Ax )P(Bx ) , (37) ( ) ( )
x
LeCam(A; B) ≜ 1
2 Pearson A; M + 1
2 Pearson B; M
P(Mx ) = 12 P(Ax ) + 21 P(Bx )
The range is [0, 1], with unity only if the two distributions
are identical. Fidelity is √not itself an f-divergence (The
required function f(t) = t isn’t convex), but is directly Skewed K-divergence [31]
related to the Hellinger divergence (38) and Bhattacharyya
∑ P(Ax )
distance (57). KayDivα (A; B) ≜ P(Ax ) ln
x
(1 − α)P(Ax ) + αP(Bx )
(42)
Hellinger discrimination (Squared Hellinger distance ) [50, ( )
=D A∥M , P(M) = (1 − α)P(A) + αP(B)
0]
∑ (√ √ )2
HellingerDiv(A; B) ≜ 1
2 P(Ax ) − P(Bx ) (38) Alpha-Jensen-Shannon-entropy [31, 57]
x

∑ ( AlphaJSα (A; B) ≜ 1
KayDivα (A; B) + 1
KayDivα (B; A)
P(Bx ) ) 2 2
= P(Ax ) 1 − (43)
x
P(Ax )
√ AlphaJS0 (A; B) = 0
= Cf (A; B), f(t) = (1 − t)
∑√ AlphaJS 12 (A; B) = JS(A; B)
=1− P(Ax )P(Bx )
x
AlphaJS1 (A; B) = Jeffreys(A; B)
= 1 − Fidelity(A; B) The Jeffreys symmetrization of the skewed K-divergence.

13
O M  E  I, G. E. C ()

6 Chernoff divergence as α approaches zero the Rényi entropy weighs all pos-
sibilities more equally, regardless of their dissimilarities.
Chernoff divergence The Chernoff divergence [58, 59] of We recover the relative entropy in the limit of α → 1.
order α is de ned as Interesting special cases of the Rényi divergence occur
∑ ( )α−1 for α = 0, 12 , 1 and ∞. As previously mentioned, α = 1
P(Ax )
Chernoff α (A; B) ≜ − ln P(Ax ) (44) gives the relative entropy (15), and α = 12 gives the Bhat-
P(Bx )
[
x
] tacharyya distance (57). In the limit α → 0, the Rényi
= − ln Cf (A; B) + 1 , f(t) = t1−α − 1 . divergence slides to the negative log probability under q
that p is non-zero,
The Chernoff divergence is zero for α = 1 and α = 0, ∑
and reaches a maximum, the Chernoff information [58, lim Rényiα (A; B) = − ln lim P(Ax )α P(Bx )1−α
α→0 α→0
8], for some intermediate value of alpha. The Chernoff x
divergence is well de ned for α > 1 if P(Bx ) > 0 whenever

= − ln P(Bx )[P(Ax ) > 0] .
P(Ax ) > 0, and for α < 0 if P(Ax ) > 0 whenever P(Bx ) > x
0, and thus de ned for all α if the distributions have the
same support. Here we have used the Iverson bracket, [a], which evalu-
The Chernoff divergence of order α is related to the ates to 1 if the condition inside the bracket is true, and 0
Chernoff divergence of order 1 − α with the distributions otherwise. If the two distributions have the same support
interchanged [59], then in the α → 0 Rényi divergence is zero.

Chernoff α (A; B) = Chernoff 1−α (B; A) .

This relation always holds for α ∈ [0, 1], and for all α when
Alpha-divergence [0, 0] In the most widespread parame-
the distributions have the same support.
terization,
( )
Chernoff coef cient (alpha divergence) [58, 59, 60] 1 ∑
Dα (A; B) ≜ 1− α
P(Ax ) P(Bx ) 1−α
(47)
∑ ( )α−1 α(1 − α) x
P(Ax )
ChernoffCoefficientα (A; B) ≜ P(Ax )
x
P(Bx )
The alpha-divergence is self-dual,
(45)
( )
= exp − Chernoff α (A; B) Dα (A; B) = D1−α (B; A)
1−α
= Cf (A; B) + 1, f(t) = t −1
Special cases include the Neyman and Pearson diver-
The exponential twist density [61] is way of mixing two gences, the Hellinger discrimination, and the relative en-
distributions [62] to form a third tropy.

1 
Pearson(A; B) α = −1
P(Cx ) = P(Ax )α P(Bx )1−α 



Zα D(B ∥ A) limα→0
Here α is a mixing parameter between 0 and 1. The nor- Dα (A; B) = 4 HellingerDiv(A; B) α = + 12


malization constant Zα is the Chernoff coef cient, Zα = 
D(A ∥ B) limα→+1


ChernoffCoefficientα (A; B) [63]. Neyman(A; B) α = +2

Rényi divergence The Rényi divergence (or relative Rényi Another common paramaterization is [0, 0]
entropy) of order α is a one-parameter generalization of ( )
the relative entropy [34], 4 ∑ 1−α ′ 1+α ′
Dα ′ (A; B) ≜ 1− P(Ax ) 2 P(Bx ) 2
( )α−1 1 − α ′2
1 ∑ P(Ax ) x
Rényiα (A; B) ≜ ln P(Ax ) (46)
α−1 x
P(Bx ) where α ′ = 1 − 2α. This paramaterization has the advan-
1 tage that the duality corresponds to negating the parame-
= Chernoff α (A; B) ter.
1−α
1 [ ]
= ln Cf (A; B) + 1 , f(t) = t1−α − 1 . D+α ′ (A; B) = D−α ′ (B; A)
α−1
Higher values of α give a Rényi divergence dominated by
the greatest ratio between the two distributions, whereas

14
O M  E  I, G. E. C ()

Cressie-Read divergence [54] can assign to the same event. Equal to twice the Varia-
∑ [( )α ] tional distance
1 P(Ax )
CressieReadα (A; B) ≜ P(Ax ) −1
α(α + 1) x P(Bx ) Euclidian distance (L2 distance)
(48)
1 [ α Rényi (A;B) ] √
= e α+1
−1 ∑
α(α + 1) L2 (A; B) ≜ P(Bx ) − P(Ax ) 2 (52)
1 x
= Tsallisα+1 (A; B)
α+1
It is sometimes useful to treat a probability distribution
t−α − 1
= Cf (A; B), f(t) = . as a vector in a Euclidean vector space, and therefore con-
α(α + 1) sider Euclidean distances between probability distribu-
tions.
Tsallis divergence (relative Tsallis entropy) [64] Other
closely related divergences include the relative Tsallis en- Minkowski distance
tropy,
( ) p1
[( )α−1 ] ∑
1 ∑ P(Ax ) Lp (A; B) ≜ P(Bx ) − P(Ax ) p (53)
Tsallisα (A; B) ≜ P(Ax ) −1 x
α−1 x P(Bx )
(49) A metric distance provided that p ⩾ 1.
1 [ (α−1) Rényiα (A;B) ]
= e −1
α−1 Chebyshev distance
1−α
= Cf (A; B), f(t) = t α−1−1 ,
L∞ (A; B) ≜ max P(Bx ) − P(Ax ) (54)

Sharma-Mittal divergence [40, 41]


LeCam distance The square root of the LeCam diver-
SharmaMittalα,β (A; B) (50) gence (41) [66].
 ( ) 1−β  √
1  ∑ 1−α
LeCamDist(A; B) ≜ LeCam(A; B) (55)
≜ 1− P(Ax )α P(Bx )1−α 
β−1 x
Hellinger distance The square root of the Hellinger diver-
1 ( 1−β
)
gence.
= 1 − ChernoffCoefficientα (A; B) 1−α
1−β √
α > 0, α ̸= 0, β ̸= 0 Hellinger(A; B) ≜ HellingerDiv(A; B) (56)

The Sharma-Mittal divergence encompasses Cressie-Read Jensen-Shannon distance is the square root of the Jensen-
divergence (β = 1 − α(α + 1)), Rényi divergence (β → Shannon divergence, and is a metric between probability
1), Tsallis divergence (β → 0), and the relative entropy distributions [32, 67].
(β, α → 1).
Bhattacharyya distance [68] The Chernoff divergence of
order α = 12 .
7 Distances
∑√
Variational distance (L1 distance, variational divergence) Bhattacharyya(A; B) ≜ − ln P(Ax )P(Bx ) (57)
x

V(A; B) = L1 (A; B) ≜ 1 P(Bx ) − P(Ax ) (51) = Chernoff 12 (A; B)
2
x = − ln Fidelity(A; B) .

The only f-divergence, Cf (A; B), f(t) = t − 1 (34) which
is also a metric [65]. 8 Speci c information
Pinsker’s inequality:
A speci c (or point-wise, or local) entropy is the entropy
D(A ∥ B) ⩾ 1
2 V(A; B)2 associated with a single event, as opposed to the average
entropy over the entire ensemble [13].
Total variational distance [0] The largest possible differ- A common convention is to use lower cased function
ence between the probabilities that the two distributions names for speci c information measures; s for S, i for I.

15
O M  E  I, G. E. C ()

However, since the expectation of a single proposition is divergence; and total variational distance. Reordered
equal to the value associated with the event, we can also operator precedence. Expanded and re ned collection
write S(Ax ) ≜ s(Ax ), and I(Ax : By ) ≜ i(Ax : By ). of relative Shannon entropies. Extensive miscella-
With this notation we can express point-wise measures neous improvements.
corresponding to all the ensemble measures de ned pre- 0.3 (2016-04-21) Added residual entropy (11) (and com-
viously, without having to create a host of new notation. bined with independent information, which is essen-
We can also write multivariate speci c informations aver- tially another name for the same thing), and uncer-
aged over only one ensemble, not both, e.g. I(A : Bx ) [26]. tainty coef cient (14). Added information diagrams.
Miscellaneous minor improvements. Fixed sign error
Speci c entropy (self-information, score, surprise, or sur- in lautum information (13) (Kudos: Ryan James)
prisal) is the negative logarithm of a probability. 0.2 (2015-06-26) Improved presentation and xed miscel-
laneous minor errors.
s(Ax ) ≜ − ln P(Ax ) (58) 0.1 (2015-01-22) Initial release: over 50 information mea-
sures described.
The expectation of the speci c entropy is the Shannon en-
tropy, the average point-wise entropy of the ensemble (1).
References
[ ] ∑
S(A) = E s(A) = P(Ax ) s(Ax ) (Recursive citations mark neologisms and other innovations [1].)
x
[0] [citation needed]. (pages 9, 9, 10, 11, 13, 14, 14, 14, 14,
and 15).
where the expectation operator E is
[ ] ∑ [1] Gavin E. Crooks. On measures of entropy and information.
E f(A) = P(Aa ) f(Aa ) Technical report, (2017). Tech. Note 009 v0.6 http://
a threeplusone.com/info. (pages 2, 9, 9, 9, 13, and 16).
[2] Raymond W. Yeung. A new outlook on Shannon’s infor-
Speci c mutual information (point-wise mutual informa- mation measures. IEEE Trans. Inf. Theory, 37(3):466–474
tion) [26, 13, 69] (1991). doi:10.1109/18.79902. (pages 2, 4, and 5).

[3] Ryan G. James, Christopher J. Ellison, and James P. Crutch-


P(Ax , By )
i(Ax : By ) ≜ ln (59) eld. Anatomy of a bit: Information in a time series obser-
P(Ax )P(By ) vation. Chaos, 21(3):037109 (2011). doi:10.1063/1.3637494.
(pages 2, 3, 4, 5, 6, 6, and 16).
Similarly, the expectation of the speci c mutual informa-
[4] Ludwig Boltzmann. Weitere studien über das
tion is the mutual information of the ensemble (5). Wärmegleichgewicht unter Gasmolekülen (Further
[ ] studies on the thermal equilibrium of gas molecules).
I(A : B) = E i(A : B) Sitzungsberichte Akad. Wiss., Vienna, part II, 66:275–370
(1872). (page 3).

[5] Claude E. Shannon. A mathematical theory of commu-


nication. Bell Syst. Tech. J., 27:379–423, 623–656 (1948).
Acknowledgments doi:10.1002/j.1538-7305.1948.tb01338.x. (page 3, 3, 3, 3,
and 3).
Useful sources and pedagogical expositions include Cover
[6] Rudolf Clausius. Ueber verschiedene für die anwen-
and Thomas’s classic Elements of Information Theory [8], dung bequeme formen der hauptgleichungen der mecha-
the inestimable (for some value of truthiness) Wikipedia, nischen wärmetheorie. Annalen der Physik und Chemie,
and various other reviews [70, 3]. Notation and nomen- 201(7):353–400 (1865). doi:10.1002/andp.18652010702.
clature was adapted from Good (1956) [26]. The title of (page 3).
this essay was shamelessly lifted from Rényi’s paper on a [7] J. Willard Gibbs. Elementary principles in statistical me-
similar subject [34]. chanics. Charles Scribner’s Sons, New York (1902). (page
3).
This review is inevitably incomplete, inaccurate and oth- [8] Thomas M. Cover and Joy A. Thomas. Elements of infor-
erwise imperfect — caveat emptor. mation theory. Wiley, New York, 2nd edition (2006). (pages
3, 3, 3, 6, 6, 6, 9, 14, and 16).

Version history [9] Frederick Jelinek. Continuous speech recognition by


statistical methods. Proc. IEEE, 64(4):532–556 (1976).
0.6 (2017-05-07) Added alpha-divergence (47) and Burg en- doi:10.1109/PROC.1976.10159. (page 3).
tropy (17). Corrected typos. (Kudos: Glenn Davis)
[10] O. Aftab, P. Cheung, A. Kim, S. Thakknar, and N. Yed-
0.5 (2016-08-16) Extensive miscellaneous improvements. danapudi. Information theory and the digital revolu-
0.4 (2016-06-22) Added Jensen-Shannon divergence; alpha tion (2001). http://web.mit.edu/6.933/www/Fall2001/
Jensen-Shannon entropy; skewed K-Divergence; dual Shannon2.pdf. (page 3).

16
O M  E  I, G. E. C ()

[11] William J. McGill. Multivariate information trans- [28] John Parker Burg. The relationship between maximum
mission. Psychometrika, 19(2):97–116 (1954). entropy spectra and maximum likelihood spectra. Geo-
doi:10.1007/BF02289159. (pages 3, 3, 4, 5, and 6). physics, 37:375–376 (1972). doi:10.1190/1.1440265. (page
9).
[12] Satosi Watanabe. Information theoretical analysis of multi-
variate correlation. IBM J. Res. Develop., 4(1):66–82 (1960). [29] Harold Jeffreys. Theory of probability. Clarendon Press,
doi:10.1147/rd.41.0066. (pages 3, 4, and 6). Oxford, 2nd edition (1948). (page 10).

[13] Robert Fano. Transmission of Information: A Statistical [30] Flemming Topsøe. Some inequalities for informa-
Theory of Communications. MIT Press, Cambridge, MA tion divergence and related measures of discrimina-
(1961). (pages 3, 5, 15, and 16). tion. IEEE Trans. Inf. Theory, 46(4):1602–1609 (2000).
doi:10.1109/18.850703. (pages 10, 10, 10, and 13).
[14] Anthony J. Bell. The co-information lattice. In Proc. 4th Int.
Symp. Independent Component Analysis and Blind Source [31] Jianhua Lin. Divergence measures based on the Shan-
Separation (2003). (pages 3 and 6). non entropy. IEEE Trans. Inf. Theory, 37:145–151 (1991).
doi:10.1109/18.61115. (pages 10, 10, 13, 13, 13, and 13).
[15] Yaneer Bar-Yam. Multiscale complexity / en-
tropy. Adv. in Complex Sys., 7:47–63 (2004). [32] Dominik M. Endres and Johannes E. Schindelin. A new
doi:10.1142/S0219525904000068. (page 3). metric for probability distributions. IEEE Trans. Inf. The-
ory, 49(7):1858–1860 (2003). doi:10.1109/TIT.2003.813506.
[16] Elad Schneidman, Susanne Still, Michael J. Berry, II, (pages 10 and 15).
and William Bialek. Network information and con-
nected correlations. Phys. Rev. Lett., 91:238701 (2003). [33] Don H. Johnson and Sinan Sinanović. Symmetrizing the
doi:10.1103/PhysRevLett.91.238701. (page 4). Kullback-Leibler distance. Technical report, Rice Univ.,
Houston, TX (2001). http://www.ece.rice.edu/~dhj/
[17] Te Sun Han. Nonnegative entropy measures of multivariate resistor.pdf. (pages 10 and 11).
symmetric correlations. Inf. Control, 36:133–156 (1978).
doi:10.1016/S0019-9958(78)90275-9. (page 6). [34] Alfréd Rényi. On measures of entropy and informa-
tion. In Proceedings of the fourth Berkeley symposium on
[18] Samer A. Abdallah and Mark D. Plumbley. A measure mathematics, statistics and probability, volume 1, pages
of statistical complexity based on predictive information 547–561 (1960). http://projecteuclid.org/euclid.
with application to nite spin systems. Phys. Lett. A, bsmsp/1200512181. (pages 10, 11, 14, and 16).
376(4):275–281 (2012). doi:10.1016/j.physleta.2011.10.066.
(page 6). [35] Charles H. Bennett, François Bessette, Gilles Bras-
sard, Louis Salvail, and John Smolin. Experimental
[19] Phipps Arabie and Scott A. Boorman. Multidimensional quantum cryptography. J. Cryptology, 5:3–28 (1992).
scaling of measures of distance between partitions. J. Math. doi:10.1007/BF00191318. (page 11).
Psy., 10(148-203) (1973). doi:10.1016/0022-2496(73)90012-
6. (page 6). [36] Ralph V. L. Hartley. Transmission of information. Bell
Syst. Tech. J., 7:535–563 (1928). doi:10.1002/j.1538-
[20] James P. Crutch eld. Information and its metric. In L. Lam 7305.1928.tb01236.x. (page 11).
and H. C. Morris, editors, Nonlinear Structures in Physi-
cal Systems – Pattern Formation, Chaos and Waves, pages [37] Jan Havrda and František Charvát. Quanti cation method
119–130. Springer-Verlag, New York (1990). (page 6). of classi cation processes. Concept of structural a-entropy.
Kybernetika, 3:30–35 (1967). http://dml.cz/dmlcz/
[21] Sergio Verdú and Tsachy Weissman. The information lost 125526. (page 11).
in erasures. IEEE Trans. Inf. Theory, 54:5030–5058 (2008).
doi:10.1109/TIT.2008.929968. (page 6). [38] Zoltán Daróczy. Generalized information functions.
Inf. Control, 16(1):36–51 (1970). doi:10.1016/S0019-
[22] Te Sun Han. Multiple mutual information and multiple in- 9958(70)80040-7. (page 11).
teractions in frequency data. Inf. Control, 46:26–45 (1980).
doi:10.1016/S0019-9958(80)90478-7. (page 6). [39] Constantino Tsallis. Possible generalization of Boltzmann-
Gibbs statistics. J. Stat. Phys., 52:479–487 (1988).
[23] Daniel P. Palomar and Sergio Verdú. Lautum information. doi:10.1007/BF01016429. (page 11).
IEEE Trans. Inf. Theory, 54(3):964–975 (2008). (page 6).
[40] Bhu Dev Sharma and D. P. Mittal. New non-additive mea-
[24] William H. Press, Saul A. Teukolsky, William T. Vetter- sures of entropy for discrete probability distributions. J.
ling, and Brian P. Flannery. Numerical Recipes: The Art of Math. Sci. (Soc. Math. Sci., Delhi, India), 10(28-40) (1975).
Scienti c Computing. Cambridge University Press, New (pages 11 and 15).
York, 3rd edition (2007). (page 6).
[41] Frank Nielsen and Richard Nock. A closed-form expres-
[25] Solomon Kullback and Richard A. Leibler. On informa- sion for the Sharma-Mittal entropy of exponential fami-
tion and suf ciency. Ann. Math. Statist., 22:79–86 (1951). lies. J. Phys. A, 45(3):032003 (2012). doi:10.1088/1751-
doi:10.1214/aoms/1177729694. (pages 6 and 10). 8113/45/3/032003. (pages 11 and 15).

[26] Irving J. Good. Some terminology and notation in informa- [42] S. M. Ali and Samuel D. Silvey. A general class of coef-
tion theory. Proc. Inst. Elec. Eng. C, 103(3):200–204 (1956). cients of divergence of one distribution from another. J.
doi:10.1049/pi-c.1956.0024. (pages 9, 16, 16, and 16). Roy. Statist. Soc. B, 28(1):131–142 (1966). (page 11).

[27] David F. Kerridge. Inaccuracy and inference. J. Roy. Statist. [43] Imre Csiszár. Information-type measures of difference of
Soc. Sec. B, 23:184–194 (1961). http://www.jstor.org/ probability distributions and indirect observation. Studia
stable/2983856. (page 9). Sci. Math. Hungar., 2:299–318 (1967). (page 11).

17
O M  E  I, G. E. C ()

[44] Tetsuzo Morimoto. Markov processes and the H- [60] Shun ichi Amari and Hiroshi Nagaoka. Methods of infor-
theorem. J. Phys. Soc. Jpn, 18:328–331 (1963). mation geometry, volume 191. American Mathematical
doi:10.1143/JPSJ.18.328. (page 11). Society (2000). (page 14).
[45] Inder J. Taneja. Re nement inequalities among symmet- [61] James A. Bucklew. Large deviation techniques in decision,
ric divergence measure. Aust. J. Math. Anal. Appl., 2(1):8 simulation, and estimation,. Wiley (1990). (page 14).
(2005). (page 11).
[62] Anand G. Dabak and Don H. Johnson. Relations be-
[46] Alexander N. Gorban, Pavel A. Gorban, and George tween Kullback-Leibler distance and Fisher information
Judge. Entropy: The Markov ordering approach. Entropy, (2001). http://www.ece.rice.edu/~dhj/distance.pdf.
12:1145–1193 (2010). doi:10.3390/e12051145. (page 11). (page 14).
[47] Mark D. Reid and Robert C. Williamson. Information, di- [63] John C. Baez. Rényi entropy and free energy.
vergence and risk for binary experiments. J. Mach. Learn. arXiv:1102.2098. (page 14).
Res., 12:731–817 (2011). http://jmlr.csail.mit.edu/
papers/v12/reid11a.html. (page 11, 11, and 11). [64] Lisa Borland, Angel R. Plastino, and Constantino Tsal-
lis. Information gain within nonextensive thermo-
[48] Anil K. Bhattacharyya. On a measure of divergence be- statistics. J. Math. Phys., 39(12):6490–6501 (1998).
tween two multinomial populations. Sankhyā, 7:401–406 doi:10.1063/1.532660. Erratum: 40 2196 (1999). (page 15).
(1946). (page 13).
[65] Mohammadali Khosravifard, Dariush Fooladivanda, and
[49] Richard Jozsa. Fidelity for mixed quantum
T. Aaron Gulliver. Con iction of the convexity
states. J. Mod. Opt., 41:2315–2323 (1994).
and metric properties in f-divergences. IEICE-Tran.
doi:10.1080/09500349414552171. (page 13).
Fund. Elec. Comm. Comp. Sci., E90-A:1848–1853 (2007).
[50] Shizuo Kakutani. On equivalence of in nite product mea- doi:10.1093/ietfec/e90-a.9.1848. (page 15).
sures. Ann. Math., 49(1):214–224 (1948). https://www.
[66] P. Kafka, Ferdinand Österreicher, and I. Vincze. On pow-
jstor.org/stable/1969123. (page 13).
ers of f-divergences de ning a distance. Studia Sci. Math.
[51] Ernst Hellinger. Neue Begründung der Theo- Hungar., 26:415–422 (1991). (page 15).
rie quadratischer Formen von unendlichvielen
Veränderlichen. J. Reine Angew. Math., 136:210–271 [67] Ferdinand Österreicher and Igor Vajda. A new class of met-
(1909). doi:10.1515/crll.1909.136.210. (page 13). ric divergences on probability spaces and and its statisti-
cal applications. Ann. Inst. Statist. Math., 55(3):639–653
[52] Karl Pearson. X. On the criterion that a given system of (2003). doi:10.1007/BF02517812. (page 15).
deviations from the probable in the case of a correlated sys-
tem of variables is such that it can be reasonably supposed [68] Anil K. Bhattacharyya. On a measure of divergence be-
to have arisen from random sampling. Philos. Mag. Ser. 5, tween two statistical populations de ned by their proba-
50(302):157–175 (1900). doi:10.1080/14786440009463897. bility distributions. Bull. Calcutta Math. Soc., 35(1):99–109
(page 13). (1943). (page 15).

[53] Jerzy Neyman. Contribution to the theory of the χ2 test. [69] Kenneth W. Church and Patrick Hanks. Word association
In Proceedings of the [ rst] Berkeley symposium on math- norms, mutual information and lexicography. Comp. Ling.,
ematics, statistics and probability, pages 239–273. Univ. of 16(1):22–29 (1990). (page 16).
Calif. Press (1949). http://projecteuclid.org/euclid.
bsmsp/1166219208. (page 13). [70] Sung-Hyuk Cha. Comprehensive survey on dis-
tance/similarity measures between probability density
[54] Noel Cressie and Timothy R. C. Read. Multinomial good- functions. Int. J. Math. Mod. Meth. Appl. Sci., 1(4):300–307
ness of t. J. Roy. Statist. Soc. B, 46:440–464 (1984). http: (2007). http://www.naun.org/main/NAUN/ijmmas/
//www.jstor.org/stable/2345686. (pages 13 and 15). mmmas-49.pdf. (page 16).

[55] Istán Vincze. On the concept and measure of information


contained in an observation. In J. Gani and V. F. Rohatgi,
editors, Contributions to Probability: A Collection of Pa-
pers Dedicated to Eugene Lukacs, pages 207–214. Academic
Press, New York (1981). (page 13).
[56] Lucien Le Cam. Asymptotic Methods in Statistical Deci-
sion Theory. Springer-Verlag, Berlin (1986). (page 13).
[57] Frank Nielsen. A family of statistical symmetric diver-
gences based on Jensen’s inequality. arXiv:1009.4004. (page
13).
Copyright © 2015-2017 Gavin E. Crooks
[58] Herman Chernoff. A measure of asymptotic ef ciency
for tests of a hypothesis based on the sum of ob-
servations. Ann. Math. Statist., 23(4):493–507 (1952). http://threeplusone.com/info
doi:10.1214/aoms/1177729330. (page 14, 14, and 14).
typeset on 2017-05-07 with XeTeX version 0.9997
[59] Yoshihide Kakizawa, Robert H. Shumway, and Masanobu
Taniguchi. Discrimination and clustering for multivariate fonts: Trump Mediaeval (text), Euler (math)
time series. J. Amer. Statist. Assoc., 93(441):328–340 (1998). 271828183
(page 14, 14, and 14).

18

You might also like