Professional Documents
Culture Documents
(Handbook of Statistics, 46) Frank Nielsen, Arni S. R. Srinivasa Rao, C.R. Rao - Geometry and Statistics-Academic Press (2022)
(Handbook of Statistics, 46) Frank Nielsen, Arni S. R. Srinivasa Rao, C.R. Rao - Geometry and Statistics-Academic Press (2022)
(Handbook of Statistics, 46) Frank Nielsen, Arni S. R. Srinivasa Rao, C.R. Rao - Geometry and Statistics-Academic Press (2022)
Volume 46
Geometry and Statistics
Handbook of Statistics
Series Editors
C.R. Rao
C.R. Rao AIMSCS, University of Hyderabad Campus,
Hyderabad, India
Geometry and
Statistics
Edited by
Frank Nielsen
Sony Computer Science Laboratories Inc.,
Tokyo, Japan
C.R. Rao
AIMSCS, University of Hyderabad Campus,
Hyderabad, India
Academic Press is an imprint of Elsevier
50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States
525 B Street, Suite 1650, San Diego, CA 92101, United States
The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom
125 London Wall, London, EC2Y 5AS, United Kingdom
No part of this publication may be reproduced or transmitted in any form or by any means,
electronic or mechanical, including photocopying, recording, or any information storage and
retrieval system, without permission in writing from the publisher. Details on how to seek
permission, further information about the Publisher’s permissions policies and our arrangements
with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency,
can be found at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the
Publisher (other than as may be noted herein).
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience
broaden our understanding, changes in research methods, professional
practices, or medical treatment may become necessary.
Practitioners and researchers must always rely on their own experience and knowledge
in evaluating and using any information, methods, compounds, or experiments
described herein. In using such information or methods they should be mindful of their
own safety and the safety of others, including parties for whom they have a professional
responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or
editors, assume any liability for any injury and/or damage to persons or property as a
matter of products liability, negligence or otherwise, or from any use or operation of any
methods, products, instructions, or ideas contained in the material herein.
ISBN: 978-0-323-91345-4
ISSN: 0169-7161
Contributors xi
Preface xiii
Section I
Foundations in classical geometry and analysis 1
1. Geometry, information, and complex bundles 3
Steven G. Krantz and Arni S.R. Srinivasa Rao
1. Introduction 3
2. Complex planes 5
2.1 Important implications of Liouville’s theorem 9
3. Geometric analysis and Jordan curves 13
4. Summary 17
References 17
v
vi Contents
Section II
Information geometry 105
4. Symplectic theory of heat and information
geometry 107
Fr
ederic Barbaresco
1. Preamble 108
2. Life and seminal work of Souriau on lie groups
thermodynamics 111
3. From information geometry to lie groups
thermodynamics 118
4. Symplectic structure of fisher metric and entropy as Casimir
function in coadjoint representation 123
4.1 Symplectic Fisher Metric structures given by Souriau
model 123
4.2 Entropy characterization as generalized Casimir invariant
function in coadjoint representation and Poisson
Cohomology 128
4.3 Koszul Poisson Cohomology and entropy
characterization 131
Contents vii
Section III
Advanced geometrical intuition 283
8. Parallel transport, a central tool in geometric statistics
for computational anatomy: Application to cardiac
motion modeling 285
Nicolas Guigui and Xavier Pennec
1. Introduction 286
1.1 Diffeomorphometry 287
1.2 Longitudinal models 288
1.3 Parallel transport for intersubject normalization 291
1.4 Chapter organization 292
Contents ix
Index 465
Contributors
ed
Fr eric Barbaresco (107), THALES Land & Air Systems, Meudon, France
Alessandro Barp (21), Department of Engineering, University of Cambridge,
Cambridge; The Alan Turing Institute, The British Library, London,
United Kingdom
Iris Bennett (79), Department of Statistics, North Carolina State University; Corteva
Agriscience, Raleigh, NC, United States
Michel Broniatowski (145), LPSM, Sorbonne Universite, Paris, France
Lancelot Da Costa (21), Department of Mathematics, Imperial College London;
Wellcome Centre for Human Neuroimaging, University College London, London,
United Kingdom
Guilherme França (21), Computer Science Division, University of California,
Berkeley, CA, United States
Karl Friston (21), Wellcome Centre for Human Neuroimaging, University College
London, London, United Kingdom
Mark Girolami (21), Department of Engineering, University of Cambridge,
Cambridge; The Alan Turing Institute, The British Library, London,
United Kingdom
Nicolas Guigui (285), Universite C^ote d’Azur and Inria, Epione team, Sophia-
Antipolis, Biot, France
Simon Heuveline (357), Centre for Mathematical Sciences, University of Cambridge,
Cambridge, United Kingdom
Michael I. Jordan (21), Computer Science Division; Department of Statistics,
University of California, Berkeley, CA, United States
Steven G. Krantz (1), Department of Mathematics, Washington University in St.
Louis, St. Louis, MO, United States
Soumendra Nath Lahiri (79), Department of Mathematics and Statistics, Washington
University in St. Louis, St. Louis, MO, United States
Tuhin Majumder (79), Department of Statistics, North Carolina State University,
Raleigh, NC, United States
Paul Marriott (327), Department of Statistics and Actuarial Science, University of
Waterloo, Waterloo, ON, Canada
Donald E.K. Martin (79), Department of Statistics, North Carolina State University,
Raleigh, NC, United States
xi
xii Contributors
xiii
xiv Preface
these problems using these geometric structures. The third chapter by Donald
E.K. Martin, Iris Bennett, Tuhin Majumder, and Soumendra Nath Lahiri is
about equivalence relations in statistical inference and geometrical analysis.
The authors demonstrate how sparse Markov modeling helps improve the
understanding of statistical inferences. The chapter touches on higher-order
Markov models and derivations of certain conditional probability distributions
and their applications.
Section II contains four chapters. The first chapter by Frederic Barbaresco
based on the foundations of “Lie Groups Thermodynamics” describes a novel
formulation of heat theory and information geometry. The chapter describes
the utility of the Gaussian distribution on the space of Symmetric Positive
Definite matrices, constructions of the Koszul–Fisher metric, and the Casimir
function, which is characterized by Koszul–Poisson cohomology. The second
chapter by Michel Broniatowski and Wolfgang Stummer introduces a general
framework for density-based and distribution function-based divergence
approaches to the relative entropy of Kullback–Leibler information distances
and distribution function-based divergences. The authors also describe in
detail and provide foundations for Cramer–von Mises test statistics and
Anderson–Darling test statistics. The third chapter by Frank Nielsen recalls
the construction of dually flat spaces from a pair of convex conjugate func-
tions related to the Legendre–Fenchel transform. This dually flat structure is
then illustrated for exponential families, mixture families, and regular homo-
geneous convex cones. For mixture families, the Shannon negentropy defines
the convex function inducing the dually flat space. It is, however, usually not
available in closed form for continuous mixtures. The chapter reports a
closed-form formula for the mixture family of two Cauchy distributions and
uses this formula to explicitly build a dually flat mixture family manifold.
The fourth chapter by Ke Sun considers the problem of embedding a
low-dimensional latent space into a high-dimensional observation space.
The author tackles the definitions of the simplicity of such embeddings based
on the framework of information geometry and discusses the relationships
between parametric and nonparametric embeddings.
Section III contains four chapters. The first chapter by Nicolas Guigui and
Xavier Pennec is richly illustrated and presents the tool of parallel transport
via affine or Riemannian connection for problems in computational anatomy.
Parallel transport defines how tangent vectors are related between tangent
planes that are infinitesimally close to each other. The authors discuss the
choice of parallel transport and its numerical accuracy in ladder method
implementations for statistical analysis of subject-specific longitudinal
changes or motions with respect to common template anatomy. The authors
then apply their novel parallel transport method to motion modeling of the
cardiac right ventricle under pressure or volume overload. To resolve this
problem, parallel transport is shown to be insufficient for normalizing
large-volume deformations, and the authors propose a novel normalization
Preface xv
Foundations in classical
geometry and analysis
This page intentionally left blank
Chapter 1
Geometry, information,
and complex bundles
Steven G. Krantza and Arni S.R. Srinivasa Raob,c,∗
a
Department of Mathematics, Washington University in St. Louis, St. Louis, MO, United States
b
Laboratory for Theory and Mathematical Modeling, Medical College of Georgia, Augusta
University, Augusta, GA, United States
c
Department of Mathematics, Augusta University, Augusta, GA, United States
∗
Corresponding author: e-mail: arni.rao2020@gmail.com
Abstract
In this chapter, we will describe information geometric principles on complex planes.
Using the geometric constructions, we prove two theorems on special types of
constructions of Jordan curves around a ball within a complex bundle. Essentials on
complex planes and bundles required for understanding the contour constructions done
in the chapter are provided.
Keywords: Information geometry, Complex analysis, Jordan curves, Complex bundles
1 Introduction
The idea of combining Riemann surfaces with probability densities for under-
standing distances between two populations was introduced by C.R. Rao in
1945 (Rao, 1949). This led to the development of the subject of information
geometry (Amari, 2016; Amari and Nagaoka, 2000; Ay et al., 2017; van
Rijsbergen, 2004). The principles of information geometry were helpful in
statistical decisions and inferences (Amari et al., 1987; Plastino et al., 2021)
and in statistical physics (Bhattacharyya and Keerthi, 2000; Dehesa et al.,
2012; Frieden, 1992, 2021; Jaiswal et al., 2021). Most of these articles and
associated articles focused on the Cramer–Rao inequality and on Fisher
Information (Efron, 1975; Rao, 1973), and in obtaining deformation proper-
ties for the exponential family of distributions. The idea of transportation
of information from one region to another region through topological struc-
tures and through complex plane bundles was introduced in Rao (2021).
Our chapter is structured as follows: In the next section we present the fun-
damentals of complex analysis. Section 3 describes the ideas of geometry on
complex planes and Section 4 summarizes the newer advantages of informa-
tion geometry on complex planes.
2 Complex planes
Let be the complex plane, and let S be a region. Let z be a complex
number in S, and z ¼ x + iy for x, y (the set of real numbers). We define
a function f as follows:
f :S! (1)
such that f(z) ¼ w for w ¼ u + iv and u, v . Such a function f is said to be
analytic(or holomorphic) in an open set S if there exists a complex derivative
at z for every z S. If f is analytic at each point in the entire complex plane,
then f is called an entire function. Suppose f ¼ u + iv is analytic in a domain U
(an open and connected set), then the first-order partial differential equations
of u and v satisfy
∂u ∂v ∂u ∂v
¼ and ¼ : (2)
∂x ∂y ∂y ∂x
The equations in (2) are known as the Cauchy–Riemann (CR) equations.
Geometric structures can be constructed on a domain in the plane and, using
such a domain, information on geometric structures can be transported (Rao,
2021). A smooth function uðx1 , x2 , …, xn Þ that satisfies the equation
∂2 u ∂2 u ∂2 u
+ 2 +⋯+ 2¼0 (3)
∂x12 ∂x2 ∂xn
is called a harmonic function, and Eq. (3) is called Laplace’s equation. Eq. (3)
can be also written as Δu ¼ 0 for the operator
∂2 ∂2 ∂2
Δu ¼ + + ⋯ + :
∂x21 ∂x22 ∂x2n
A two-variable function u(x, y) is called harmonic if
∂2 u ∂2 u
+ 2 ¼ 0: (4)
∂x 2 ∂y
A standard result that can be proved using Laplace’s equation, and f ¼ u + iv,
where u ¼ u(x, y) and v ¼ v(x, y) is stated below. See Krantz (2004), Churchill and
Brown (1984), Ahlfors (1978), Krantz (2017), and Rudin (1987).
Theorem 1. If f ¼ u + iv is analytic in U, then u and v are harmonic in U.
∂ ∂u ∂ ∂v
¼
∂x ∂x ∂x ∂y
∂ ∂v
¼
∂y ∂x
(5)
∂ ∂u
¼
∂y ∂y
∂2 u
¼ :
∂y2
This implies that Δu ¼ 0. Similarly,
∂ ∂v ∂ ∂u
¼
∂x ∂x ∂x ∂y
∂ ∂u
¼
∂y ∂x
∂ ∂v (6)
¼
∂y ∂y
∂2 u
¼ 2:
∂y
This implies that Δv ¼ 0. □
The Eqs. (5) and (6) imply that u and v are harmonic. Harmonic functions
combined with CR can assist in understanding the conjugate of components
and information transportation. Analytic functions are also important in the
formation of contours which were shown to transport information from one
complex plane to another complex plane within a finite and infinite complex
plane bundle (Rao, 2021). The set γðtÞ ¼ ðxðtÞ, yðtÞÞ for a set of real
values t [a, b], and for a continuous x(t) and y(t) is said to be an arc. The
arc γ(t) is called a Jordan arc if it is simple and closed, i.e., if γ(t1) 6¼ γ(t2)
for all t1 6¼ t2 except for γ(a) ¼ γ(b). A closed arc is an arc for which
γ(a) ¼ γ(b), and such arcs are also referred as Jordan curves. See Fig. 1.
Suppose we map the value of t onto another real-valued function ϕ(ζ) for
(a1 ζ b1), then γ(t) values within [a, b] are transformed into, say, Γ(ζ) ¼
γ[ϕ(t)]. The length of the arc γ(t), say L(γ(t)), is defined to be
Z b
LðγðtÞÞ ¼ jγ 0 ðtÞjdt: (8)
a
In Eq. (10) the two parametric representation functions are ϕ1(t1) and ϕ2(t2)
for t1 [a, b] and t2 [b, c]. See Fig. 2. Suppose there are multiple para-
metric representation functions ϕj(tj) for tj in [aj, aj+1] corresponding to the
piecewise smooth arcs γ j for j ¼ 1, 2, …, k. The contour γ can be represented
using piecewise smooth arcs as
Z
γðtj Þdtj : (11)
8 SECTION I Foundations in classical geometry and analysis
FIG. 2 Contour formation from piecewise smooth arcs. Here γ i(ti) for i ¼ 1, 2, …, 6 are piece-
wise smooth arcs defined on the real number intervals ½a1 , a2 , ½a2 , a3 , …, ½a5 , a6 . After the para-
metric representation described in the text, one can compute the total length of the contour, say C,
using piecewise lengths of γ i(ti).
Here the superscript (n) denotes a derivative. One of the important conse-
quences of the Cauchy integral formula (18) is
ðnÞ n! max j f ðzÞjA
f ðz0 Þ , ðn ¼ 1, 2, …Þ (19)
ðr A Þn
where f is analytic inside a circle A (with radius rA). The inequality (19) is
also called the Cauchy estimate and is used in proving Liouville’s theorem
on entire functions.
Theorem 4 (Liouville’s theorem). A bounded entire function is constant
throughout the complex plane.
and assume ρ(z) is not zero for all z . After some algebraic constructions
and applying triangular inequality, we arrive at
1
j f ðzÞj ¼ is bounded: (20)
jρðzÞj
Eq. (20) implies f is bounded in the entire plane. But, by Liouville’s theo-
rem, f(z) is constant, which is a contradiction because ρ(z) is not constant.
Suppose a function f is analytic throughout an annular domain with radii
γ A and γ B and center z0 such that r A < jz z0 j < r B . Let γ be a Jordan curve
around z0 within the z values for γ A < jz z0 j < γ B. Then, at each such z, the
function f(z) can be represented as the following series:
X
∞
f ðzÞ ¼ An ðz z0 Þn , for rA < jz z0 j < rB (21)
n¼∞
where
Z
1 f ðzÞdz
An ¼ for n ¼ 0, 1, 2, …: (22)
2πi γ ðz z0 Þn+1
The series in Eq. (21) is called the Laurent series. Suppose we write g(z) ¼
f(z + z0). Then, g(z) is analytic on the annulus rA < jzj < rB and we can write
g(z) with the following Laurent series expression:
X
∞ X
∞
gðzÞ ¼ Bn zn + Cn zn , (23)
n¼0 n¼0
where
Z
1 gðzÞdz
Bn ¼ for n ¼ 0, 1, 2, …, (24)
2πi γ zn+1
Z
1 gðzÞdz
Cn ¼ for n ¼ 0, 1, 2, …, (25)
2πi γ zn+1
See Fig. 3. Let us consider the disks D(z0, rA), and D(0, rA) as in Fig. 3. If
we excise the centers z0 and 0 from these disks, respectively, then the sets of
remaining points of the disks are called deleted neighborhoodsof z0 and 0,
respectively. We call z0 an isolated singularity of f if f is not analytic at z0
and f is analytic on a deleted neighborhood of z0. Similarly, an isolated singu-
larity of f at 0 can be defined.
When 0 in Fig. 3 is the isolated singular point of f, then f(z) can be
expressed as
X
∞
f ðzÞ ¼ Bn zn + C1 z1 + C2 z2 + ⋯ + Cn zn + ⋯ , (26)
n¼0
Geometry, information, and complex bundles Chapter 1 11
FIG. 3 (A) Jordan curve within the domain r A < jz z0 j < r B and expression of f(z) for a
z value within this region. (B) Jordan curve within the domain r A < jzj < r B and g(z) in
Eq. (23) is analytic within this domain.
where Bn and Cn are defined as in Eqs. (24) and (25). The number C1 in
Eq. (26) is called the residue of f at 0, and is denoted by
Res f ðzÞ:
z¼0
Similarly, when z0 is an isolated singular point, one can write the Laurent
series expression. The Cauchy residue theorem is a helpful tool to compute
a contour integral when there are a finite number k of isolated singular points
within a simple, closed contour γ. Suppose f is analytic within and on γ except
for a finite number of isolated singular points within γ, then
Z X
k
f ðzÞdz ¼ 2πi Res f ðzÞ: (27)
γ z¼zn
n¼1
The structure of disks and domains described earlier in the Laurent series
expression can be applied in the transportation of information within complex
planes. See Rao and Krantz (2021). In Rao and Krantz (2021) we have
described the basics and the importance of conformal mapping, and preserva-
tion of angles in 3D objects. Conformality is an important feature of analytic
functions. See Rao and Krantz (2021) and Ahlfors (1978), Churchill and
Brown (1984), Krantz (2004), and Rudin (1987) for general ideas and founda-
tions on conformal mapping of two piecewise smooth arcs, and especially
regarding angle preservations. In this chapter, we demonstrate the conformal
mapping principle on two intersecting Jordan curves.
Let us consider two Jordan curves J1 and J2 within an annulus with radii
rA and rB, respectively. Assume that the two curves J1 and J2 intersect at
z1 , z2 with
12 SECTION I Foundations in classical geometry and analysis
γ A < z1 < γ B , (28)
and
γ A < z2 < γ B : (29)
The curve J1 on the plane is created from the real values of the interval, say,
[a1, b1], and the curve J2 is created on the plane from the real-valued interval,
say, [a2, b2] (Fig. 4).
Let
c1 ¼ arg½J 02 ðt2 Þ arg½J 01 ðt1 Þ
c2 ¼ arg½J 02 ðt4 Þ arg½J 01 ðt3 Þ,
where t1, t3 [a1, b1], and t2, t4 [a2, b2]. Here J1(t1) ¼ J2(t2) ¼ c1, and J1(t3)
¼ J2(t4) ¼ c2. Let f1 and f2 be two analytic functions mapped from J1 and J2,
respectively. Assume f1(c1) 6¼ 0 and f2(c2) 6¼ 0. Due to the conformality the
angle from two mapped curves, say, S2 and S1, at c1 will be equal to the angle
at f1(c1) from J2 and J1, and the angle from S2 and S1 at c2 will be equal to
the angle at f2(c2) from J2 and J1,
We have discussed conformal mapping of curves within an annulus. One
can bring a similar description in any region U C in which the two curves
J1 and J2 are created.
FIG. 4 Conformal mappings of two intersecting Jordan curves J1 and J2 onto two intersecting
Jordan curves S1 and S2. Note that although we have demonstrated here a Jordan curve mapping
to another Jordan curve, a Jordan curve need not map always to a Jordan curve. The two curves
J1 and J2 are generated independently from each other. We also mention here that a Jordan curve
within an annulus need not map to a curve situated in another annulus.
Geometry, information, and complex bundles Chapter 1 13
FIG. 5 Formation of contours on the boundary of a ball B(z0, rA) due to the intersection of a bun-
dle of complex planes. We showed an overview of the formation of Jordan curves in this figure.
More details are in Fig. 6.
Proof. We will have infinitely many points of G on the boundary of B(z0, rA).
Let a , b be two arbitrary planes within G. Let z1(a) be one such point of
G that lies on a and ∂B(z0, rA). A point in G here we treat as a complex
FIG. 6 Microscopic view of the formation of contours and Jordan curves. Planes that do not
belong to G help to construct Jordan curves as these planes allow contour formations between
different planes of G and return to the points on ∂B(z0, rA).
Geometry, information, and complex bundles Chapter 1 15
The points z1(d1) and z1(d) both lie on d . So an arc can be constructed
directly from z1(d1) to z1(d) or an arc can be constructed from z1(d1) to
z1(d) indirectly passing through O2 . Let z1(t) for t ½d, d1 ð0, δÞ be the
piece of arc of the contour γ(t) which is drawn from z1(d) to z1(d1). Here
z1(t) for t [d, d1] could pass through O2 or not. An arc from z1(d1) to
another point on ∂B(z0, rA) on the other side of B(z0, rA) can be drawn in a
similar fashion that we described above for z1(t) for t [a, d] on one side
of B(z0, rA) in Fig. 6B. In a similar way, piecewise arcs from d1 to δ can be
constructed for d1 t δ.
Let us divide the interval [d1, δ] into a finite number of intervals, say,
Remark 1. In the above proof, we have constructed a smooth arc from z1(a) to
z1(d) passing through O1 , but one can also bring arguments to construct a
smooth arc from z1(a) to z1(d) passing through O2 .
∂Bðz0 , r A Þ \ G1 \ G2 \ G3 6¼ ϕ: (36)
Then,
(a) we can construct infinitely many Jordan curves Jk for k ¼ 1, 2, … on the
boundary of the ball B(z0, rA), and,
(b) [
∞
Jk ¼ ∂Bðz0 , rA Þ: (37)
k¼1
4 Summary
Our novelty is that we constructed Jordan curves on the boundary of a ball
(that is simply connected) but through points generated by complex plane
bundles on the boundary of a ball considered. The development of a contour
is facilitated by the way we consider the bundles and not by the simple
connectedness of the ball. These kinds of contour formations are treated here
as passing of information between planes and geometry of contours.
The ideas of information geometry and emerging newer applications are
interesting, see, for example, Hayashi (2022), Mishra and Kumar (2021),
Nielsen (2022), Gauchy et al. (2022), Hua et al. (2021), and Barbaresco
(2021) for recent developments. We hope our newer ideas of construction
of contours will add to further development of information geometry on
planes, surfaces, and geometry of topological manifolds.
References
Ahlfors, L.V., 1978. Complex Analysis. An Introduction to the Theory of Analytic Functions of
One Complex Variable. In: third ed. International Series in Pure and Applied Mathematics,
McGraw-Hill Book Co., New York. xi+331.
Amari, S.-i., 2016. Information Geometry and Its Applications. Applied Mathematical Sciences,
vol. 194, Springer, Tokyo. xiii+374 pp., ISBN 978-4-431-55977-1; 978-4-431-55978-8.
Amari, S.-i., Nagaoka, H., 2000. Methods of Information Geometry (Translated from the 1993
Japanese original by Daishi Harada. Translations of Mathematical Monographs, 191).
American Mathematical Society/Oxford University Press, Providence, RI/Oxford, x+206 pp.
ISBN: 0-8218-0531-2.
Amari, S.I., Barndorff-Nielsen, O.E., Kass, R.E., Lauritzen, S.L., Rao, C.R., 1987. Differential
Geometry in Statistical Inference. Institute of Mathematical Statistics Lecture Notes–
Monograph Series, vol. 10, Institute of Mathematical Statistics, Hayward, CA, iv+240 pp.
ISBN: 0-940600-12-9.
Ay, N., Jost, J., L^e, H.V., Schwachh€ofer, L., 2017. Information Geometry. A Series of Modern
Surveys in Mathematics, vol. 64, Springer, Cham, xi+407 pp. ISBN: 978-3-319-56477-7.
Barbaresco, F., 2021. Koszul lecture related to geometric and analytic mechanics, Souriau’s Lie
group thermodynamics and information geometry. Inf. Geom. 4 (1), 245–262.
Bhattacharyya, C., Keerthi, S., 2000. Sathiya Information geometry and Plefka’s mean-field the-
ory. J. Phys. A 33 (7), 1307–1312.
Chern, S.S., 1977. Circle bundles, geometry and topology. In: Lecture Notes in Math. Proc. III
Latin Amer. School of Math., Inst. Mat. Pura Aplicada CNPq, Rio de Janeiro, 1976,
vol. 597. Springer, Berlin, pp. 114–131.
18 SECTION I Foundations in classical geometry and analysis
Chern, S.S., 1989. Vector Bundles With a Connection. Global Differential Geometry. MAA Stud.
Math., vol. 27, Mathematical Association of America, Washington, DC, pp. 1–26.
Choi, Y.-J., 2012. On the differential geometric characterization of the Lee models. J. Geom.
Anal. 22 (1), 168–205.
Churchill, R.V., Brown, J., 1984. Ward Complex Variables and Applications, fourth ed.
McGraw-Hill Book Co., New York. x+339 pp.
Dehesa, J.S., Plastino, A.R., Sánchez-Moreno, P., Vignat, C., 2012. Generalized Cramer-Rao rela-
tions for non-relativistic quantum systems. Appl. Math. Lett. 25 (11), 1689–1694.
Efron, B., 1975. Defining the curvature of a statistical problem (with applications to second order
efficiency). Ann. Statist. 3 (6), 1189–1242.
Frieden, B., 1992. Roy fisher information and uncertainty complementarity. Phys. Lett. A 169 (3),
123–130.
Frieden, B.R., 2021. Principle of minimum loss of Fisher information, arising from the
Cramer-Rao inequality: its role in evolution of bio-physical laws, complex systems and uni-
verses. In: Plastino, A., Rao, A.S.R.S., Rao, C.R. (Eds.), Information Geometry. Handbook of
Statistics, vol. 45. Elsevier, pp. 117–148.
Gauchy, C., Stenger, J., Sueur, R., Iooss, B., 2022. An information geometry approach to robust-
ness analysis for the uncertainty quantification of computer codes. Technometrics 64 (1),
80–91.
Green, M., Lazarsfeld, R., 1991. Higher obstructions to deforming cohomology groups of line
bundles. J. Am. Math. Soc. 4 (1), 87–103.
Greene, R.E., Kim, K.-T., Krantz, S.G., 2011. The Geometry of Complex Domains. Progress in
Mathematics, vol. 291, Birkh€auser Boston, Ltd., Boston, MA, xiv+303 pp. ISBN: 978-0-
8176-4139-9..
Hayashi, M., 2022. Information geometry approach to parameter estimation in hidden Markov
model. Bernoulli 28 (1), 307–342.
Hua, X., Ono, Y., Peng, L., Cheng, Y., Wang, H., 2021. Target detection within nonhomogeneous
clutter via total bregman divergence-based matrix information geometry detectors. IEEE
Trans. Signal Process. 69, 4326–4340.
Jaiswal, N., Gautam, M., Sarkar, T., 2021. Complexity and information geometry in the transverse
XY model. Phys. Rev. E 104 (2) (Paper No. 024127, 10 pp.).
Kobayashi, S., 2014. Differential Geometry of Complex Vector Bundles. Princeton University
Press, Princeton, NJ. Reprint of the 1987 edition.
Krantz, S.G., 2004. Complex Analysis: The Geometric Viewpoint. In: second ed. Carus Mathe-
matical Monographs, vol. 23. Mathematical Association of America, Washington, DC, ISBN:
0-88385-035-4. xviii+219 pp.
Krantz, S.G., 2017. Harmonic and Complex Analysis in Several Variables. Springer Monographs
in Mathematics, Springer, Cham. ISBN 978-3-319-63229-2; 978-3-319-63231-5, xii+424 pp.
Krantz, S., 2020. How fundamental is the fundamental theorem of algebra? Math. Mag. 93 (2),
139–142.
Krantz, S.G., Parks, H.R., 1999. The Geometry of Domains in Space. Birkh€auser Advanced Texts:
Basler Lehrb€ ucher (Birkh€auser Advanced Texts: Basel Textbooks), Birkh€auser Boston, Inc.,
Boston, MA, x+308 pp. ISBN: 0-8176-4097-5.
Kruglikov, B.S., 2007. Tangent and normal bundles in almost complex geometry. Differential
Geom. Appl. 25 (4), 399–418.
Mishra, K.V., Kumar, M.A., 2021. Information geometry and classical Cramer-Rao-type inequal-
ities. In: Plastino, A., Rao, A.S.R.S., Rao, C.R. (Eds.), Handbook of Statistics: Information
Geometry. vol. 45. Elsevier, pp. 79–114.
Geometry, information, and complex bundles Chapter 1 19
Nielsen, F., 2022. The many faces of information geometry. Not. Am. Math. Soc. 69 (1), 36–45.
Plastino, A.R., Plastino, A., Pennini, F., 2021. Chapter 1–Revisiting the connection between
Fisher information and entropy’s rate of change. In: Plastino, A., Rao, A.S.R.S., Rao, C.R.
(Eds.), Handbook of Statistics. vol. 45. Elsevier, pp. 3–14.
Rao, C.R., 1949. On the distance between two populations. Sankhya 9, 246–248.
Rao, C.R., 1973. Linear Statistical Inference and Its Applications. In: second ed. Wiley Series in
Probability and Mathematical Statistics, John Wiley & Sons, New York, London, Sydney.
xx+625 pp.
Rao, A.S.R.S., 2021. Multilevel contours on bundles of complex planes. In: Nielsen, F., Rao,
A.S.R.S., Rao, C.R. (Eds.), Geometry and Statistics. Handbook of Statistics, vol. 46. Elsevier.
Rao, A.S.R.S., Krantz, S.G., 2020. Data science for virtual tourism using cutting-edge visualiza-
tions: information geometry and conformal mapping. Patterns 1 (5), 100067.
Rao, A.S.R.S., Krantz, S.G., 2021. Rao distances and conformal mapping. In: Plastino, A.,
Rao, A.S.R.S., Rao, C.R. (Eds.), Information Geometry. Handbook of Statistics, vol. 45.
Elsevier, pp. 43–56.
Rudin, W., 1987. Real and Complex Analysis, third ed. McGraw-Hill Book Co., New York,
xiv+416 pp. ISBN: 0-07-054234-1.
Tsallis, C., 1988. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 52 (1–2),
479–487.
van Rijsbergen, C.J., 2004. The Geometry of Information Retrieval. Cambridge University Press,
Cambridge, xii+150 pp. ISBN: 0-521-83805-3.
Wu, D., Yau, S.T., 2016. Negative holomorphic curvature and positive canonical bundle. Invent.
Math. 204 (2), 595–604.
Yoo, S., 2017. A differential-geometric analysis of the Bergman representative map. Ann. Polon.
Math. 120 (2), 163–181.
Zhang, J., Wong, T.-K.L., 2021. Chapter 10–λ-Deformed probability families with subtractive and
divisive normalizations. In: Plastino, A., Rao, A.S.R.S., Rao, C.R. (Eds.), Information Geom-
etry. Handbook of Statistics, vol. 45. Elsevier, pp. 187–215.
This page intentionally left blank
Chapter 2
Geometric methods
for sampling, optimization,
inference, and adaptive agents
Alessandro Barpa,b,*,†, Lancelot Da Costac,d,†, Guilherme Françae,†,
Karl Fristond, Mark Girolamia,b, Michael I. Jordane,f,
and Grigorios A. Pavliotisc
a
Department of Engineering, University of Cambridge, Cambridge, United Kingdom
b
The Alan Turing Institute, The British Library, London, United Kingdom
c
Department of Mathematics, Imperial College London, London, United Kingdom
d
Wellcome Centre for Human Neuroimaging, University College London, London,
United Kingdom
e
Computer Science Division, University of California, Berkeley, CA, United States
f
Department of Statistics, University of California, Berkeley, CA, United States
∗
Corresponding author: e-mail: ab2286@cam.ac.uk
Abstract
In this chapter, we identify fundamental geometric structures that underlie the problems
of sampling, optimization, inference, and adaptive decision-making. Based on this
identification, we derive algorithms that exploit these geometric structures to solve
these problems efficiently. We show that a wide range of geometric theories emerge
naturally in these fields, ranging from measure-preserving processes, information diver-
gences, Poisson geometry, and geometric integration. Specifically, we explain how
(i) leveraging the symplectic geometry of Hamiltonian systems enables us to construct
(accelerated) sampling and optimization methods, (ii) the theory of Hilbertian subspaces
and Stein operators provides a general methodology to obtain robust estimators, and
(iii) preserving the information geometry of decision-making yields adaptive agents that
perform active inference. Throughout, we emphasize the rich connections between
these fields; e.g., inference draws on sampling and optimization, and adaptive
decision-making assesses decisions by inferring their counterfactual consequences.
Our exposition provides a conceptual overview of underlying ideas, rather than a tech-
nical discussion, which can be found in the references herein.
†
Equal contribution.
1 Introduction
Differential geometry plays a fundamental role in applied mathematics, statis-
tics, and computer science, including numerical integration (Celledoni et al.,
2014; Hairer et al., 2010; Leimkuhler and Reich, 2004; Marsden and West,
2001; McLachlan and Quispel, 2002), optimization (Alimisis et al., 2021;
Betancourt et al., 2018; Bravetti et al., 2019; França et al., 2020, 2021a,b),
sampling (Barp et al., 2019b; Betancourt et al., 2017; Duane et al., 1987;
Livingstone et al., 2019; Rousset et al., 2010), statistics on spaces with deep
learning (Bronstein et al., 2021; Celledoni et al., 2021), medical imaging
and shape methods (Durrleman et al., 2009; Vaillant and Glaunes, 2005),
interpolation (Barp et al., 2022), and the study of random maps (Harms
et al., 2020), to name a few. Of particular relevance to this chapter is informa-
tion geometry, i.e., the differential geometric treatment of smooth statistical
manifolds, whose origin stems from a seminal article by Rao (1992) who
introduced the Fisher metric tensor on parameterized statistical models, and
thus a natural Riemannian geometry that was later observed to correspond
to an infinitesimal distance with respect to the Kullback–Leibler (KL) diver-
gence (Jeffreys, 1946). The geometric study of statistical models has had
many successes (Amari, 2016; Ay et al., 2017; Nielsen, 2020), ranging from
statistical inference, where it was used to prove the optimality of the maxi-
mum likelihood estimator (Amari, 2012), to the construction of the category
of mathematical statistics, generated by Markov morphisms (Chentsov,
1965; Jost et al., 2021). Our goal in this chapter is to discuss the emergence
of natural geometries within a few important areas of statistics and applied
mathematics, namely optimization, sampling, inference, and adaptive agents.
We provide a conceptual introduction to the underlying ideas rather than a
technical discussion, highlighting connections with various fields of mathe-
matics and physics.
The vast majority of statistics and machine learning applications involve
solving optimization problems. Accelerated gradient-based methods
(Nesterov, 1983; Polyak, 1964), and several variations thereof, have became
workhorses in these fields. Recently, there has been great interest in studying
such methods from a continuous-time limiting perspective; see, e.g., Su et al.
(2016), Wibisono et al. (2016), Wilson et al. (2021), França et al. (2018),
França et al. (2018), França et al. (2021c), Muehlebach and Jordan (2021b),
and Muehlebach and Jordan (2021a) and references therein. Such methods
can be seen as 1st order integrators to a classical Hamiltonian system with dis-
sipation. This raises the question on how to discretize the system such that
important properties are preserved, assuming the system has fast convergence
Sampling, optimization, inference, and adaptive agents Chapter 2 23
to critical points and desirable stability properties. It has been known for a
long time that the class of symplectic integrators is the preferred choice for
simulating physical systems (Benettin and Giorgilli, 1994; Forest, 2006;
Hairer et al., 2010; Kennedy et al., 2013; McLachlan and Quispel, 2002;
McLachlan and Quispel, 2006; Sanz-Serna, 1992; Suzuki, 1990; Takahashi
and Imada, 1984; Yoshida, 1990). These discretization techniques, designed
to preserve the underlying (symplectic) geometry of Hamiltonian systems,
also form the basis of Hamiltonian Monte Carlo (HMC) (or hybrid Monte
Carlo) methods (Duane et al., 1987; Neal, 2011). Originally, such a theory
of geometric integration was developed with conservative systems in mind,
while, in optimization, the associated system is naturally a dissipative one.
Nevertheless, symplectic integrators were exploited in this context
(Betancourt et al., 2018; Bravetti et al., 2019; França et al., 2020). More
recently, it has been proved that a generalization of symplectic integrators to
dissipative Hamiltonian systems is indeed able to preserve rates of convergence
and stability (França et al., 2021b), which are the main properties of interest for
optimization. Follow-up work (França et al., 2021a) extended this approach,
enabling optimization on manifolds and problems with constraints. There is
also a tight connection between optimization on the space of measures and sam-
pling which dates back to Otto (2001) and Jordan et al. (1998); we will revisit
these ideas in relation to dissipative Hamiltonian systems.
Sampling methods are critical to the efficient implementation of many
methodologies. Most modern samplers are based on Markov Chain Monte
Carlo methods, which include slice samplers (Murray et al., 2010; Neal,
2003), piecewise-deterministic Markov chains, such as bouncy particle and
zig-zag samplers (Bierkens and Roberts, 2017; Bierkens et al., 2019;
Bouchard-C^ ote et al., 2018; Davis, 1984; Peters and de With, 2012; Vanetti
et al., 2017), Langevin algorithms (Durmus and Moulines, 2017; Durmus
et al., 2018; Roberts and Tweedie, 1996), interacting particle systems
(Garbuno-Inigo et al., 2019), and the class of HMC methods (Barp et al.,
2018; Betancourt, 2017; Betancourt et al., 2017; Duane et al., 1987; Neal,
2011; Rousset et al., 2010). The original HMC algorithm was introduced in
physics to sample distributions on gauge groups for lattice quantum chromo-
dynamics (Duane et al., 1987). It combined two approaches that emerged in
previous decades, namely the Metropolis-Hastings algorithm and the Hamilto-
nian formulation of molecular dynamics (Alder and Wainwright, 1959;
Hastings, 1970; Metropolis et al., 1953). Modern HMC relies heavily on
symplectic integrators to simulate a deterministic dynamic, responsible for
generating distant moves between samples and thus reduce their correlation,
while at the same time preserving important geometric properties. This deter-
ministic step is then usually combined with a corrective step (originally a
Metropolis-Hastings acceptance step) to ensure preservation of the correct
target, and with a stochastic process, employed to speed up convergence to
the target distribution. We will first focus on the geometry of measure-
preserving diffusions, which emerges from ideas formulated by Poincare
24 SECTION I Foundations in classical geometry and analysis
and Volterra, and form the building block of many samplers. In particular, we
will discuss ways to “accelerate” sampling using irreversibility and hypoellip-
ticity. We will then introduce HMC focusing on its underlying Poisson geom-
etry, the important role played by symmetries, and its connection to geometric
integration.
We then discuss the problem of statistical inference, whose practical
implementation usually relies upon sampling and optimization. Given obser-
vations from a target distribution, many estimators belong to the family of
the so-called M and Z estimators (Van der Vaart, 2000), which are obtained
by finding the parameters that maximizes (or are zeros of ) a parameterized
set of functions. These include the maximum likelihood and minimum
Hyv€arinen score matching estimators (Hyv€arinen and Dayan, 2005; Vapnik,
1999), which are also particular instances of the minimum score estimators
induced by scoring rules that quantify the discrepancy between a sample
and a distribution (Parry et al., 2012). The Monge–Kantorovich transportation
problem (Villani, 2009b) motivates another important class of estimators,
namely the minimum Kantorovich and p-Wasserstein estimators, whose
implementation use the Sinkhorn discrepancy (Bassetti et al., 2006; Cuturi,
2013; Peyre et al., 2019). Our discussion of inference builds upon the theory
of Hilbertian subspaces and, in particular, reproducing kernels. These infer-
ence schemes rely on the continuity of linear functionals, such as probability
and Schwartz distributions, over a class of functions to geometrize the analy-
sis of integral probability metrics which measure the worse-case integration
error. We shall explain how maximum mean, kernelized, and score matching
discrepancies arise naturally from topological considerations.
Models of adaptive agents are the basis of algorithmic-decision-making
under uncertainty. This is a difficult problem that spans multiple disciplines
such as statistical decision theory (Berger, 1985), game theory (Von
Neumann and Morgenstern, 1944), control theory (Bellman and Dreyfus,
2015), reinforcement learning (Barto and Sutton, 1992), and active inference
(Da Costa et al., 2020a). To illustrate a generic use case for the previous
methodologies, we consider active inference, a unifying formulation of
behavior—subsuming perception, planning, and learning—as a process of
inference (Da Costa et al., 2020a; Friston, 2010; Friston et al., 2010, 2015).
We describe decision-making under active inference using information geom-
etry, revealing several special cases that are established notions in statistics,
cognitive science, and engineering. We then show how preserving this infor-
mation geometry in algorithms enables adaptive algorithmic decision-making,
endowing robots and artificial agents with useful capabilities, including
robustness, generalization, and context-sensitivity (Da Costa et al., 2022;
Lanillos et al., 2021). Active inference is an interesting use case because it
has yet to be scaled—to tackle high dimensional problems—to the same
extent as established approaches, such as reinforcement learning (Silver
et al., 2016); however, numerical analyses generally show that active
Sampling, optimization, inference, and adaptive agents Chapter 2 25
2 Accelerated optimization
We shall be concerned with the problem of optimization of a function
V : M ! , i.e., finding a point that maximizes V (q), or minimizes V (q),
over a smooth manifold M. We will assume this function is differentiable to
construct algorithms that rely on the flows of smooth vector fields guided by
the derivatives of V (q).
Many algorithms in optimization are given as a sequence of finite
differences, represented by iterations of a mapping Ψδt : M ! M , where
δt > 0 is a step size. The analysis of such finite difference iterations is usually
challenging, relying on painstaking algebra to obtain theoretical guarantees;
such as convergence to a critical point, stability, and rates of convergence
to a critical point. Even when these algorithms are seen as discretizations of
a continuum system, whose behavior is presumably understood, it is well-
known that most discretizations break important properties of the system.
~
eδtY ∘eδtZ ¼ eδtX ,
1 1
X~ ¼ ðY + ZÞ + ½Y, Zδt + ð½Y, ½Y, Z ½Z, ½Y, ZÞδt2 + ⋯ ,
2 12
(1)
where [Y, Z] ¼ Y Z ZY is the commutator between Y and Z. Thus, the
numerical method itself can be seen as a smooth dynamical system with flow
~
map ΨXδt ¼ eδtX . The goal of geometric integration is to construct numerical
methods for which X~ shares with X the critical properties of interest; this is
usually done by requiring preservation of some geometric structure.
Recall that a numerical map ΨXδt is said to be of order r 1 if
jΨδt ΦXδt j ¼ Oðδtr +1 Þ; we abuse notation slightly and let jj denote a well-
X
defined distance over manifolds (see Hansen (2011) for details). Thus, the
expansion (1) also shows that the error in the approximation is
jΨXδt ΦXδt j ¼ Oðδt2 Þ, i.e., we have an integrator of order r ¼ 1. One can also
consider more elaborate compositions, such as
ΨXδt ≡ ΦYδt=2 ∘ ΦZδt ∘ ΦYδt=2 , (2)
which is more accurate since the first term in (1) cancels out, yielding an inte-
grator of order r ¼ 2.a
a
Higher-order methods are constructed by looking for appropriate compositions that cancel first
terms in the BCH formula (Yoshida, 1990). However, methods for r > 2 tend to be expensive
numerically, with not so many benefits (if any) over methods of order r ¼ 2.
b
We denote by xi the ith component of x and ∂i ≡ ∂=∂xi . We also use Einstein’s summation
convention, i.e., repeated upper and lower indices are summed over.
Sampling, optimization, inference, and adaptive agents Chapter 2 27
for some function f3, and so on for all commutators in (1), then the right-hand
side of the BCH formula would itself be an expansion in terms of a vector
field X~Bf for some shadow function ~f ¼ f + f 3 δt + f 4 δt2 + ⋯ . In particular, ~f
would inherit all the properties of f, i.e., properties common to B-vector fields.
This is precisely the case for Poisson brackets, written B ≡ Π, which are
antisymmetric brackets for which the Jacobi identity holds:
h i
XΠ
f , X Π Π
g ¼ X f f ,gg , f f, gg ≡ ∂i f Πij ∂j g, (4)
by Ωij ¼ ðΠ1 Þij and called a symplectic form. In this case, the function f is
called a Hamiltonian, denoted f ¼ H. The invertibility of the Poisson
tensor Πij implies that such a bracket exists only on even-dimensional
spaces. Darboux theorem then ensures the existence of local coordinates
x ≡ (q1, …, qd, p1, …, pd) in which the symplectic form can be represented
0 I
as Ω ¼ . Dynamically, this corresponds to the fact that these are
I 0
second-order differential equations, requiring not only a position q M but
also a momentum p T *q M.c Note that if H ¼ pi, then XH ¼ ∂=∂qi , and con-
versely if H ¼ qi, then XH ¼ ∂=∂pi . Thus, a change in coordinate qi is gen-
erated by its conjugate momentum pi, and vice versa. Thus, the only way to
generate dynamics on M in this case is by introducing a Hamiltonian depend-
ing on both position and momentum. From a numerical viewpoint, the
extended phase space introduces extra degrees of freedom that allow us to
incorporate “symmetries” in the Hamiltonian, which facilitate integration.
Indeed, in practice, the Hamiltonian usually decomposes into a potential
energy, associated to position and independent of momentum, and a kinetic
energy, associated to momentum and invariant under position changes, both
generating tractable flows. Thanks to this decomposition, we are able to con-
struct numerical methods through splitting the vector field. Note also that, for
symplectic brackets, the existence of a shadow Hamiltonian can be guaranteed
beyond the case of splitting methods, e.g., for variational integrators—which
use a discrete version of Hamilton’s principle of least action—and more
generally for most symplectic integrators in which the symplectic bracket is
preserved up to topological considerations described by the first de Rham
cohomology of phase space.
c
More precisely, the dynamics evolve on the cotangent bundle X ¼ T * M, with coordinates
x ¼ (q, p); momentum p Tq* M and velocity v ¼ dq=dt T q M are equivalent on the Riemannian
manifolds that are used in practice. M is called the configuration manifold with coordinates q.
Sampling, optimization, inference, and adaptive agents Chapter 2 29
1
Hðq, pÞ ¼ pi gij pj + VðqÞ, (6)
2
where gij is a constant symmetric positive definite matrix with inverse gij. The
associated vector field is XBH ¼ gij pj ∂qi ½∂qi V + γðtÞpi ∂pi , with γ(t) > 0 being
a “damping coefficient.” This is associated to the negative definite matrix
0 I 0 0
B≡ γðtÞ : (7)
I 0 0 I
|fflfflfflfflfflffl{zfflfflfflfflfflffl} |fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl}
conservative dissipative
Two common choices for the damping are the constant case, γ(t) ¼ γ, and
the asymptotic vanishing case, γ(t) ¼ r/t for some constant r 3 (other
choices are also possible). When V(q) is a convex function (resp. strongly
convex function with parameter μ > 0), it is possible to show the following
convergence rates (França et al., 2018):
ð11Þ
where λ1(g) is the largest eigenvalue of the metric g. The convergence rates of
this system are therefore known under such convexity assumptions. Ideally,
we want to design optimization methods that preserve these rates, i.e., are
30 SECTION I Foundations in classical geometry and analysis
“rate-matching,” and are also numerically stable. As we will see, such geo-
metric integrators can be constructed by leveraging the shadow Hamiltonian
property of symplectic methods on higher-dimensional conservative Hamilto-
nian systems (França et al., 2021b) (see also Asorey et al., 1983; Marthinsen
and Owren, 2016). This holds not only on 2d but on general settings, namely
on arbitrary smooth manifolds (França et al., 2021a,b).
In the conformal Hamiltonian case, the dissipation appears explicitly in the
equations of motion. It is however theoretically convenient to consider an
equivalent explicit time-dependent Hamiltonian formulation. Consider the fol-
lowing coordinate transformation into system (8):
Z
p 7! eηðtÞ p, Hðq, pÞ 7! eηðtÞ H q, eηðtÞ p , ηðtÞ ≡ γðtÞdt: (12)
embed the original dissipative system with phase space 2d into a higher-
dimensional conservative system with phase space 2d +2 . The dissipative
dynamics thus lies on a hypersurface of constant energy, K ¼ 0, in high-
dimensions; see França et al. (2021b) for details. The reason for doing this
procedure, called symplectification, is purely theoretical: since the theory of
symplectic integrators only accounts for conservative systems, we can now
extend this theory to dissipative settings by applying a symplectic integrator
to (13) and then fixing the relevant coordinates (17) in the resulting method.
Geometrically, this corresponds to integrating the time flow exactly (França
et al., 2021b; Marthinsen and Owren, 2016). In França et al. (2021b) such a
procedure was defined under the name of presymplectic integrators, and these
connections hold not only for the specific example above but also for general
nonconservative Hamiltonian systems.
We are now ready to explain why this approach is suitable to construct
practical optimization methods. Let Ψδs : 2d +2 ! 2d +2 be a symplectic
integrator of order r 1 applied to system (15). Denote by (tk, qk, uk, pk)
the numerical state, obtained by k ¼ 0, 1, … iterations of Ψδs. Time is
simulated over the grid sk ¼ (δs)k, with step size δt > 0. Because a symplectic
integrator has a shadow Hamiltonian, we have
~ k , qk , uk , pk Þ ¼ Kðtðsk Þ, qðsk Þ, uðsk Þ, pðsk ÞÞ + Oðδsr Þ:
Kðt
Enforcing (17), the coordinate tk becomes simply the time discretization sk,
which is exact, and so is uk ¼ u(tk) since it is a function of time alone;
importantly, u does not couple to any of the other degrees of freedom so it
is irrelevant whether we have access to u(s) or not. Replacing (15) into the
above equation, we conclude:
~ k , qk , pk Þ ¼ Hðtk , qðtk Þ, pðtk ÞÞ + Oðδtr Þ,
Hðt (18)
where we now denote tk ¼ (δt)k, for k ¼ 0, 1, …. Hence, the time-dependent
Hamiltonian also has a shadow, thanks to the cancellation of the variable u. In
particular, if we replace the explicit form of the Hamiltonian (13), we obtaind
Vðqk Þ V ? ¼ Vðqðtk ÞÞ V ? + O eηðtk Þ δtr : (19)
|fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflffl}
numerical rate continuum rate
small error
Therefore, the known rates (11) for the continuum system are nearly
preserved—and so would be any rates of more general time-dependent (dissi-
pative) Hamiltonian systems. Moreover, as a consequence of (18), the original
time-independent Hamiltonian (6) of the conformal formulation is also closely
d
The kinetic part only contributes to the small error since g is positive definite and
jpk pðtk Þj ¼ Oðδtr Þ. There are several technical details we are omitting, such as Lipschitz con-
ditions on the Hamiltonian and on the numerical method, which we refer to França et al. (2021b)
for details.
32 SECTION I Foundations in classical geometry and analysis
e
Naturally, all these results hold for suitable choices of step size, which can be determined by a
linear stability analysis of the particular numerical method under consideration.
f
In a practical implementation, it is convenient to make the change of variables pk 7! eηðtk Þ pk into
(20); recall the transformations (12). In this case the method reads
pk + 1=2 ¼ eΔηk pk ðδt=2Þ∂q Vðqk Þ ,
qk + 1 ¼ qk δtcoshðΔηk Þg1 pk + 1=2 ,
pk + 1 ¼ eΔηk pk + 1=2 ðδt=2Þ∂q Vðqk + 1 Þ,
R t +1=2
where Δηk ≡ ηðtk +1=2 Þ ηðtk Þ ¼ tkk γðtÞdt . Note that only a half-step difference of η(t)
appears in these updates. The algorithm is thus written in the same variables as the conformal rep-
resentation (8). The advantage is that we do not have large or small exponentials, which can be
problematic numerically. Furthermore, when solving optimization problems, it is convenient to
set the matrix g ¼ (δt)I; this was noted in França et al. (2020) but can also be understood from
the rates (11) since then the step size δt disappears from some of these formulas.
Sampling, optimization, inference, and adaptive agents Chapter 2 33
where we recall that δt > 0 is the step size and tk ¼ (δt)k, for iterations k ¼ 0,
1, …. This method, which is a dissipative generalization of the leapfrog, was
proposed in França et al. (2021b) and has very good performance when solv-
ing unconstrained problems (10). In a similar fashion, one can extend any
(known) symplectic integrator to a dissipative setting; the above method is
just one such example.
g
Theoretically, there is no loss of generality since Nash or Whitney embedding theorems tell us
that any smooth manifold M can be embedded into n for sufficiently large n.
34 SECTION I Foundations in classical geometry and analysis
h
Besides accelerated gradient-based methods, accelerated extensions of important proximal-based
methods such as proximal point, proximal-gradient, alternating direction method of multipliers
(ADMM), Douglas–Rachford, Tseng splitting, etc., are implicit discretizations of (8); see
França et al. (2021c) for details.
Sampling, optimization, inference, and adaptive agents Chapter 2 35
The same happens in more general settings; when the damping is too strong,
the second derivative becomes negligible and the dynamics is approximately
first-order.
As an illustration, consider Fig. 1 (left) where a particle immersed in a
fluid falls under the influence of a potential force ∂qV (q), that plays the
role of “gravity,” and is constrained to move on a surface. In the underdamped
case, the particle is under water, which is not so viscous, so it has acceleration
and moves fast (even oscillate). In the overdamped case, the particle is in a
highly viscous fluid, such as honey, and the drag force γp is comparable
or stronger to ∂qV (q), thus the particle moves slowly since it cannot accel-
erate; during the same elapsed time δt, an accelerated particle would travel a
longer distance. We can indeed verify this behavior numerically. In Fig. 1
(right) we run algorithm (23) in the underdamped and overdamped regimes
when solving an optimization problem on the n-sphere, i.e., on the Lie group
SO(n).i We can see that, in the overdamped regime, this method has essen-
tially the same dynamics as the Riemannian gradient descent (Zhang and
Sra, 2016), which is nonaccelerated and corresponds to a first-order dynamics;
all methods use the same step size, only the damping coefficient is changed.
FIG. 1 Why simulating second-order systems yields accelerated methods. Left: Constrained par-
ticle falling in fluids of different viscosity. When the drag force is strong, the particle cannot
accelerate and has a first-order dynamics (see text). Right: Simulation of algorithm (23) where
V (Q) is the energy of a spherical spin glass (Lie group SO(n), with n ¼ 500) (França et al.,
2021a). In the overdamped regime the method is close to Riemannian gradient descent (Zhang
and Sra, 2016), which is a first-order dynamics; (23) is much faster in the underdamped regime.
i
The details are not important here, but this problem minimizes the Hamiltonian of a spherical
spin glass (see França et al. (2021a) for details). The same behavior is seen with the constrained
method (24) as well.
36 SECTION I Foundations in classical geometry and analysis
where now ∂ ≡ (∂q, ∂p) and μ is a measure over P 2 ð2d Þ. Let F be the free
energy defined as
F½μ ≡ U½μ β1 S½μ, U½μ ≡ μ ½H, S½μ ≡ μ ½ log μ, (28)
where U is the (internal) energy, H is the Hamiltonian (6), S is the Shannon
entropy, and β is the inverse temperature. The functional derivative of the free
energy equals
δF 1
¼ H + β1 log μ ¼ k pk2 + VðqÞ + β1 log μ: (29)
δμ 2
In particular, the minimizer of F is the stationary density
h i
ρðq, pÞ ¼ Z 1
β e βHðq,pÞ
, Z β ≡ μ e βHðq,pÞ
: (30)
Note also that the free energy (28) is nothing but the KL divergence (up to a
constant which is the partition function):
KL½μ|ρ ≡ μ ½ log ðμ=ρÞ ¼ βF½μ log Z β :
Therefore, the evolution of μ as given by the conformal Hamiltonian system
(27) minimizes the divergence from the stationary density (30). Replacing
(29) into (27) we obtain
Sampling, optimization, inference, and adaptive agents Chapter 2 37
∂t μ ¼ ∂q ½μp + ∂p μ∂q VðqÞ + γμp + γβ1 ∂2p μ,
which is nothing but the Fokker-Planck equation associated to the under-
damped Langevin diffusion
qffiffiffiffiffiffiffiffiffiffiffiffi
dqt ¼ pt dt, dpt ¼ ∂q Vðqt Þdt γpt dt + 2γβ1 dwt , (31)
which corresponds precisely to the gradient flow (25) on the free energy
functional F[μ] (Ambrosio et al., 2005; Jordan et al., 1998), where now
μ ¼ μðq, tÞ P 2 ðd Þ. Thus, in the same manner that a second-order damped
Hamiltonian system may achieve accelerated optimization compared to a
first-order gradient flow, the underdamped Langevin diffusion (31) may
achieve accelerated sampling compared to the overdamped Langevin diffu-
sion (32). Such an acceleration has indeed been demonstrated (Ma et al.,
2019a) in continuous-time and for a particular discretization.
the right-hand side of (33) converges to our target integral on the left-hand side
almost surely (Hairer, 2018). An efficient sampling scheme is one that mini-
mizes the variance of the MCMC estimator. In other words, fewer samples will
be needed to obtain a good estimate. Intuitively, good samplers are Markov
chains that converge as fast as possible to the target distribution.
acceleration direction of
z}|{ greatest decrease
r _
q zfflfflfflfflffl}|fflfflfflfflffl{
mq̈ ¼ ∂VðqÞ ! ¼ rVðqÞ , (38)
|fflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflffl} dt
flat Newton |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
Riemannian Newton
with given initial conditions for the position q and velocity v ¼ dq/dt. This is a
second-order system which evolves in the tangent bundle, ðq, vÞ TM ,
which is TM ¼ d d when M ¼ d . The resulting flow is conservative
since it corresponds to a Hamiltonian system as discussed in Section 2.2, with
Hamiltonian Hðq, vÞ ≡ 12 kvk2q + VðqÞ, where kvk2q is the Riemannian squared-
norm, which is vTg(q)v when M ¼ d and g(q) is the Riemannian metric;
this is the manifold version of the Hamiltonian (6). This system preserves
the symplectic measure μΩ ðq, vÞ ¼ detgðqÞdqdv, and thus also the canonical
distribution μ ∝ eH(q,v)μΩ, which is the product of the target distribution
over position with the Gaussian measures on velocity (with covariance g).
For instance, on M ¼ d ,
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 T
μðq, vÞ∝ ρðqÞ N 0, g1 ðqÞ ðvÞ∝ eVðqÞ det gðqÞdq det gðqÞe 2v gðqÞv dv:
metric, one greatly simplifies the equations of motion of the geodesic flow,
reducing the usual second-order Euler–Lagrange equations to the first-order
Euler-Arnold equations (Barp, 2019; Holm et al., 1998; Modin et al., 2010),
with tractable solutions in many cases of interest, e.g., for naturally reductive
homogeneous spaces; including d , the space of positive definite matrices,
Stiefel manifolds, Grassmannian manifolds, and many Lie groups. In such
cases, it is possible to find a Riemannian metric whose geodesic flow is known
and given by the Lie group exponential (Barp et al., 2019b; Holbrook et al.,
2016, 2018). For the other main class of spaces, namely those given by
constraints, if one chooses the restriction of the Euclidean metric, then the
RATTLE scheme discussed in optimization (see Section 2.4) is a suitable sym-
plectic integrator (Au et al., 2020; Graham et al., 2019; Leimkuhler and
Matthews, 2016; Lelièvre et al., 2019, 2020) (perhaps up to a reversibility
check). Occasionally, it may be suitable to use a Riemannian metric associated
to the target distribution rather than the sample space; e.g., when it belongs to a
statistical manifold. In that case, any choice of (information) divergence gives
rise to an information tensor that may be used in the HMC algorithm. Notably,
this is the case in Bayesian statistics, wherein attempting to find a Riemannian
metric that locally matches the Hessian of the posterior motivates the use of the
Fisher information tensor summed with the Hessian of the prior, giving rise to
the Riemannian HMC (Girolami and Calderhead, 2011; Livingstone and
Girolami, 2014). When a Riemannian metric whose geodesic flow is unknown
is chosen, one can use the trick of increasing the dimension of the phase space
to add symmetries to derive explicit symplectic integrators (Cobb et al., 2019;
Tao, 2016).
Once we have an integrator for the geodesic flow, another important
consideration is the construction and tuning of the overall integrator, i.e.,
the specific composition of ΦH δt and Φδt . Traditional numerical integrators
1 H2
ϕ : T * ↪H* ! H:
where δux : h 7! u hðxÞ; but this need not be the case in general, and we will
employ Hilbertian subspaces with no reproducing kernel to construct the
score-matching discrepancy.
This geometric description of RKHS and MMD allows us to swiftly apply
topological methods in their analysis. For example, in order for MMD2 to be a
valid notion of statistical divergence, it should accurately discriminate distinct
distributions, in the sense that MMD[ρ|μ] ¼ 0 iff ρ ¼ μ. By construction,
MMD will be characteristic to a subset of T * , that is be able to distinguish
its elements, iff ϕ is injective. The Hahn–Banach theorem further shows that
this is equivalent to the denseness of H in T , reducing the matter to a topolog-
ical question (Simon-Gabriel and Sch€ olkopf, 2018; Sriperumbudur et al.,
2010, 2011). In many applications, we typically would like T * to be the set
of probability measures, but the latter is not even a vector space. Instead, just
as is commonly done to define (statistical) manifolds, it is desirable to embed
P within a more structured space, such as the space of finite Radon measures
C*0 . Characteristicness to C*0 is also known as universality in learning theory,
since such RKHS are dense in L2(μ) for any μ P, which enables the method
to learn the target function independent of the data-generating distribution
(Carmeli et al., 2010). However, in many important cases, we are interested
48 SECTION I Foundations in classical geometry and analysis
in analyzing the denseness of H in a space other than C0. For instance, in the
case of unbounded reproducing kernels, we cannot aim to separate all
finite distributions, since the RKHS will contain unbounded functions and
the MMD will only be defined on a subset of P . In the particular case
of the KSDs discussed below, which are given by transforming a base RKHS
into a Stein RKHS via a differential operator, the characteristicness of the
Stein RKHS to a set of probability measures is equivalent to the characteris-
ticness of the base RKHS to more general spaces T * of Schwartz distributions
(Barp et al., 2022).
Moreover, the ability of MMD to discriminate distributions is also useful
to ensure it further metrizes, or at least controls, weak convergence, and thus
provide a suitable quantification of the discrepancy between unequal distribu-
tions. Indeed, on noncompact locally compact Hausdorff spaces such as d ,
when H↪C0, then MMD will metrize weak convergence (of probability mea-
sures) iff the kernel k is continuous and H is characteristic to the space of
finite Radon measures (Simon-Gabriel et al., 2020). The fact that the RKHS
must separate all finite measures in order to metrize weak convergence results
from the fact that otherwise MMD cannot in general prevent positive mea-
sures from degenerating into the null measure on noncompact spaces, beyond
the family of translation-invariant kernels, for which characteristicness to the
sets of probability measures or that of finite measures is in fact equivalent
(Simon-Gabriel and Sch€ olkopf, 2018). It is also possible to prevent probabil-
ity mass from escaping to infinity—when the topology of the sequence of dis-
tributions is relatively compact with respect to the weak topology on the space
of distributions—since, in that case, standard topological arguments relate
MMD and weak convergence via characteristicness to P (Ethier and Kurtz,
2009). For example, by Prokhorov’s theorem we may use the tightness of a
sequence of distributions to ensure characteristic MMDs detect any loss of
mass, and thus control weak convergence (Gorham and Mackey, 2017).
μ : C∞
c ðMÞ=Imðdivμ jc Þ ! ,
Consequently,
Thus, the test functions that integrate to zero are precisely those that can be
written as the divergence of compactly supported vector fields. In particular,
on compact manifolds, there is a canonical Stein operator, divμ, which turns
vector fields into functions with vanishing expectations. For other types of
manifolds, one can obtain similar dualities by using other classes of differen-
tial forms, such as the square-integrable ones, or by allowing boundaries.
For our purposes, the above is sufficient to motivate calling
the canonical Stein operator, whose domain Xμ, called the Stein class, is any
R of vector fields satisfying the desired property that μ ½Sμ ðXÞ ≡
set
Sμ ðXÞdμ ¼ 0, for all X Xμ .
50 SECTION I Foundations in classical geometry and analysis
by
SBμ : C∞ ∞
μ ! C ðMÞ, SBμ f ≡ divμ ðXBf Þ:
Z
dSμ ðV Þ ðμ, ρÞ ¼ sup Sμ ðXÞdρ:
XV
R
The expression Sμ ðXÞdρ is precisely the rate of change of the KL divergence
along measures satisfying the continuity equation, an observation that leads
to Stein variational gradient descent (SVGD) methods to approximate distri-
butions (Liu and Wang, 2016; Liu and Zhu, 2018). Specifically,
P in SVGD the
target measure is approximated using a finite distribution ‘ δx‘ , where the
location of the particles {x‘}‘ is updated by moving along the direction that
maximizes the rate of change of KL within a space of vector fields isomorphic
to a RKHS (e.g., the space of gradients of functions in a RKHS).
When Sμ is the canonical Stein operator, there is a canonical Stein class,
provided by Stokes’ theorem, which essentially only depends on the manifold:
for a connected manifold M,R viewing integration
R as an operator on smooth
μ-integrable functions, then fdμ ¼ 0 , dα ¼ 0, where f ¼divμ(μ♯(α)).
Unfortunately, Stokes’ theorem usually doesR not provide a practical descrip-
tion of the differential forms that satisfy dα ¼ 0, aside from the compactly
supported case. There are, however, several choices of Stein class constructed
from Hilbertian subspaces that lead to computationally tractable Stein discre-
pancies. One route consists in constructing a RKHS of mean-zero functions as
the image of another RKHS under a Stein operator. In this case, we can use Sμ
to map a given RKHS of d -valued functions H, with (matrix-valued) repro-
ducing kernel K, into a Stein RKHS of -valued functions Sμ ðHÞ associated to
a Stein reproducing kernel kμ, given by (here q is the Lebesgue density of μ)
1
kμ ðx, yÞ ¼ ∂ ∂ ðqðxÞKðx, yÞqðyÞÞ:
qðxÞqðyÞ y x
The resulting Stein discrepancy can be thought of as an MMD that depends
only on ρ and is known as kernel Stein discrepancy (Oates et al., 2017):
ZZ
1
KSD½ ρ2 ≡ MMD½ ρ|μ2 ¼ ∂ ∂ ðqðxÞKðx, yÞqðyÞÞdρðyÞdρðxÞ:
qðxÞqðyÞ y x
It is worth noting that, while L2 ðT * M, μÞ is not a RKHS, and does not have a
reproducing kernel, it remains a Hilbertian subspace of the space of de Rham
currents. When B is Riemannian, we recover the Riemannian score matching
(Barp et al., 2022)
Z
SMG ½ ρ|μ ¼ krH rKk2 dρ,
while in Euclidean space (41) yields the diffusion score matching (Barp
et al., 2019a).
The parameters B and k, and the choice of statistical model, can often be
adjusted to achieve characteristicness, consistency, bias-robustness, and
obtain central limit theorems; see Barp et al. (2019a) for details, and for
numerical experiments showing an acceleration induced by the information
Riemannian metric.
Under appropriate choices of kernels and models one can derive theoretical
guarantees, such as concentration/generalization bounds, consistency, asymp-
totic normality, and robustness; see, e.g., Briol et al. (2019), Gretton et al.
(2009), and Dziugaite et al. (2015). Moreover, many approaches to kernel
selection in a wide range of contexts have been studied, which include the
median heuristic or maximizing the power of hypothesis tests, and in practice
mixtures of Gaussian kernels are often employed (Briol et al., 2019; Dziugaite
et al., 2015; Garreau et al., 2017; Li et al., 2017; Ramdas et al., 2015;
Sutherland et al., 2016).
FIG. 2 Partitions and agents. This figure illustrates a human (agent π) interacting with its envi-
ronment (external process s), and the resulting partition into external s, observable o, and autono-
mous a processes. The external states are the environment, which the agent does not have direct
access to, but which is sampled through the observable states. These could include states of the
sensory epithelia (e.g., eyes and skin). The autonomous states constitute the muscles and nervous
system that factor available information into decisions. In the example of human behavior, the
environment causes observations (i.e., sensations), which inform a nervous and muscular response,
which in turn influences the environment. In general, autonomous responses may be informed by all
past agent states π t ¼ (ot, at) (the information available to the agent at time t), which means that
the systems we are describing are typically non-Markovian.
56 SECTION I Foundations in classical geometry and analysis
At any moment in time t, the agent has access, at most, to its past trajec-
tory π t, and has agency over its future autonomous trajectory a>t. We define
a decision to be a choice of autonomous states in the future given available
data π t. We interpret P(s, o|π t) as expressing the agent’s preferences over
external and observable trajectories given available data, and P(s, o|a, π t)
as expressing the agent’s predictions over external and observable paths given
a decision a. In the following, we describe how (precise) agents’ decisions
P(a|π t) relate to predictions and preferences.
Since both observable and autonomous processes evolve deterministically
in precise agents, the Shannon entropy of observable and autonomous paths
is equalj :
S½Pðs, o|π t Þ ¼ S½Pðs, a|π t Þ for any π t : (42)
Crucially, this allows us to express agents’ decisions as a functional of their
predictions and preferences (Friston et al., 2022)k :
0 ¼ Pðs,o, a|πt Þ ½ log Pðs,o|π t Þ log Pðs, o|π t Þ
¼ Pðs,o, a|πt Þ ½ log Pðs,a|π t Þ log Pðs, o|π t Þ
¼ Pða|πt Þ ½log Pða|π t Þ + Pðs, o|a,πt Þ ½log Pðs|a,π t Þ log Pðs, o|π t Þ
) log Pða|π t Þ ¼ Pðs,o|a,π t Þ ½log Pðs|a, π t Þ log Pðs,o|π t Þ:
(EFE)
j
To obtain (42) note that when the path space is finite, we have the equality S½Pðs, o|π t Þ
S½Pðs, a|π t Þ ¼ Pðx|πt Þ ½ log Pða|s, o, π t Þ log Pðo|s, a, π t Þ ¼ 0 due to Definition 1. This
equality can be extended to more general path spaces via a limiting argument, by expressing
entropies as a limiting density of discrete points ( Jaynes, 1957).
k
The second equality follows from (42), and the implication follows since the KL divergence
vanishes only when its arguments are equal.
Sampling, optimization, inference, and adaptive agents Chapter 2 57
(43)
Risk refers to the KL divergence between the predicted and preferred external
course of events. Minimizing risk entails making predicted (external) trajec-
tories fulfill preferred external trajectories. Ambiguity refers to the expected
entropy of future observations, given future external trajectories. An external
trajectory that can lead to various distinct observation trajectories is highly
ambiguous—and vice versa. Thus, minimizing ambiguity leads to sampling
observations that enable to recognize the external course of events. This leads
to a type of observational bias commonly known as the streetlight effect
(Kaplan, 1973): when a person loses their keys at night, they initially search
for them under the streetlight because the resulting observations (“I see my
keys under the streetlight” or “I do not see my keys under the streetlight”)
accurately disambiguate external states of affairs.
l
As a negative log density over paths, the expected free energy is an action in the physical sense of
the word.
FIG. 3 See figure legend on opposite page.
Sampling, optimization, inference, and adaptive agents Chapter 2 59
Extrinsic value refers to the (log) likelihood of observations under the model
of preferences. This corresponds to an expected utility or expected reward in
behavioral economics, control theory, and reinforcement learning (Barto and
Sutton, 1992; Von Neumann and Morgenstern, 1944). In short, maximizing
extrinsic value leads to sampling observations that are likely under the model
of preferences. Intrinsic value refers to the amount of information gained
about external courses of events. This measures the expected degree of belief
updating about external trajectories under a decision, with versus without
future observations. Making decisions to maximize information gain leads
to a goal-directed form of exploration (Schwartenbeck et al., 2019), driven
to answer “What would happen if I did that?” (Schmidhuber, 2010). Interest-
ingly, this decision-making procedure underwrites Bayesian experimental
FIG. 3—Cont’d Decision-making under active inference. This figure illustrates various impera-
tives that underwrite decision-making under active inference in terms of several special cases that
predominate in statistics, cognitive science, and engineering. These special cases are disclosed
when one removes certain sources of uncertainty. For example, if we remove ambiguity,
decision-making minimizes risk, which corresponds to aligning predictions with preferences about
the external course of events. This underwrites prospect theory of human choice behavior in eco-
nomics (Kahneman and Tversky, 1979) and modern approaches to control as inference (Levine,
2018; Rawlik et al., 2013; Toussaint, 2009), variously known as Kalman duality (Kalman, 1960;
Todorov, 2008), KL control (Kappen et al., 2012), and maximum entropy reinforcement learning
(Ziebart, 2010). If we further remove preferences, decision-making maximizes the entropy of exter-
nal trajectories. This maximum entropy principle ( Jaynes, 1957; Lasota and MacKey, 1994) allows
one to least commit to a prespecified external trajectory and therefore keep options open. If we rein-
troduce ambiguity, but ignore preferences, decision-making maximizes intrinsic value or expected
information gain (MacKay, 2003b). This underwrites Bayesian experimental design (Lindley,
1956) and active learning in statistics (MacKay, 1992), intrinsic motivation and artificial curiosity
in machine learning and robotics (Barto et al., 2013; Deci and Ryan, 1985; Oudeyer and Kaplan,
2007; Schmidhuber, 2010; Sun et al., 2011). This is mathematically equivalent to optimizing
expected Bayesian surprise and mutual information, which underwrites visual search (Itti and
Baldi, 2009; Parr et al., 2021c) and the organization of our visual apparatus (Barlow, 1961;
Linsker, 1990; Optican and Richmond, 1987). Lastly, if we remove intrinsic value, we are left with
maximizing extrinsic value or expected utility. This underwrites expected utility theory (Von
Neumann and Morgenstern, 1944), game theory, optimal control (Åstr€om, 1965; Bellman, 1957),
and reinforcement learning (Barto and Sutton, 1992). Bayesian formulations of maximizing
expected utility under uncertainty are also known as Bayesian decision theory (Berger, 1985).
60 SECTION I Foundations in classical geometry and analysis
of preferences over external trajectories P(s). Together, (1) and (2) form the
agent’s (POMDP) prediction model, and (2) and (3) form the agent’s (hidden
Markov) preference model, which defines an active inference agent. A simple
simulation of active inference on a POMDP is provided in Fig. 4; implemen-
tation details on generic POMDPs are available in Da Costa et al. (2020a),
Heins et al. (2022), Sajid et al. (2021a), and Smith et al. (2022). For more
complex simulations of sequential decision-making (e.g., involving hierarchi-
cal POMDPs), please see Sajid et al. (2021a), Friston et al. (2017b), Parr
(2019), Fountas et al. (2020), Millidge (2020), Çatal et al. (2021), and
Friston et al. (2018).
FIG. 4—Cont’d Sequential decision-making in a T-Maze environment. Left: The agent’s pre-
diction model is a partially observed Markov decision process (see text) represented here as a
Bayesian network (Bishop, 2006). The color scheme illustrates the problem at t ¼ 2: the agent
must make a decision (in red) based on previous actions and observations (in gray), which are
informative about external states and future observations (in white). Right: st: The T-Maze has
four possible spatial locations: middle, top-left, top-right, and bottom. One of the top locations
contains a reward (in red), while the other contains a punishment (in black). The reward’s location
determines the context. The bottom arm contains a cue whose color (blue or green) discloses the
context. Together, location, and context determine the external state. ot: The agent observes its
spatial location. In addition, when it is at the top of the Maze, it observes the reward or the pun-
ishment; when it is at the bottom, it observes the color of the cue. at: Each action corresponds to
visiting one of the four spatial locations. P(st): The agent prefers being at the reward’s location
( log Pðst Þ ¼ + 3) and avoid the punishment’s location ( log Pðst Þ ¼ 3). All other states have
a neutral preference ( log Pðst Þ ¼ 0). o0: The agent is in the middle of the Maze and is unaware
of the context. a1: Visiting the bottom or top arms have a lower ambiguity than staying, as they
yield observations that disclose the context. However, staying or visiting the bottom arm are safer
options, as visiting a top arm risks receiving the punishment. By acting to minimize both risk and
ambiguity (43) the agent goes to the bottom. o1: The agent observes the color of the cue and hence
determines the context. a2: All actions have equal ambiguity as the context is known. Collecting
the reward has a lower risk than staying or visiting the middle, which themselves have a lower risk
than collecting the punishment. Thus, the agent visits the arm with the reward. See Friston et al.
(2017a) for more details.
64 SECTION I Foundations in classical geometry and analysis
(Champion et al., 2021a, b; Fountas et al., 2020; Maisto et al., 2021; Silver
et al., 2016). Similarly, Monte–Carlo sampling finesses the expectations
inherent in assessing action sequences (46) (Fountas et al., 2020). A comple-
mentary approach is to assess actions, instead of action sequences, by condi-
tioning all future actions to be optimal in the sense that they minimize the
expected free energy (Da Costa et al., 2020b; Friston et al., 2021a). This idea
leads to a backward form of planning, where the agent plans for the best
action at the last time-step, followed by the best action at the penultimate
time-step, and so on, until the present. Crucially, it leads to smarter agents
(Da Costa et al., 2020b; Friston et al., 2021a) whose computational complex-
ity scales linearly (as opposed to exponentially) in the length of action
sequences (Paul et al., n.d.).
Scalable inference methods (Zhang et al., 2017a) can be used to make active
inference more efficient (van de Laar and de Vries, 2019). For example, we can
train neural networks to predict the various posterior distributions, including the
posterior over actions (Fountas et al., 2020; Millidge, 2020; Sajid et al., 2022).
While training, the output of the neural network can be used as an initial con-
dition for variational inference (Tschantz et al., 2020a), resulting in accurate
inferences whose computational cost decreases as the network learns. Addition-
ally, optimizing free energy reduces to efficient message-passing schemes,
when one imposes certain simplifying restrictions to the family of candidate
distributions (Champion et al., 2021c; Parr et al., 2019; Schw€obel et al.,
2018; Wainwright and Jordan, 2007; Winn and Bishop, 2005).
A much cheaper implementation of active inference exists for continuous
states evolving in continuous time. The method frames perception and
decision-making as variational inference, by simulating a gradient flow on
free energy in an extended state space (Friston et al., 2010, 2022). Further-
more, it can be combined with discrete active inference to operate efficiently
in generative models combining discrete and continuous states (Friston et al.,
2017c). As an example, high-dimensional observations in the continuous
domain (e.g., speech) processed through continuous active inference are con-
verted into discrete, abstract representations (e.g., semantics) (Sajid et al.,
2022). Based on these representations, the agent makes high-level, categorical
decisions (e.g., “I want to move over there”), which contextualize low-level,
continuous actions (e.g., the continuous motion of a limb toward the goal
location) (Parr et al., 2021b).
Acknowledgments
The authors thank Noor Sajid for helpful discussions on adaptive agents. LD is supported by
the Fonds National de la Recherche, Luxembourg (Project code: 13568875). KF is sup-
ported by funding for the Wellcome Centre for Human Neuroimaging (Ref: 205103/Z/
16/Z) and a Canada-UK Artificial Intelligence Initiative (Ref: ES/T01279X/1). GAP was
Sampling, optimization, inference, and adaptive agents Chapter 2 65
partially supported by JPMorgan Chase & Co under J.P. Morgan A.I. Research Awards in
2019 and 2021 and by the EPSRC, grant number EP/P031587/1. This publication is based
on work partially supported by the EPSRC Centre for Doctoral Training in Mathematics
of Random Systems: Analysis, Modelling and Simulation (EP/S023925/1). AB, GF, MG,
and MIJ thank the support of the Army Research Office (ARO) under contract W911NF-
17-1-0304 as part of the collaboration between US DOD, UK MOD, and UK Engineering
and Physical Research Council (EPSRC) under the Multidisciplinary University Research
Initiative (MURI).
References
Abdulle, A., Pavliotis, G.A., Vilmart, G., 2019. Accelerated convergence to equilibrium and
reduced asymptotic variance for Langevin dynamics using Stratonovich perturbations. C. R.
Math. 357 (4), 349–354. https://doi.org/10.1016/j.crma.2019.04.008.
Alder, B.J., Wainwright, T.E., 1959. Studies in molecular dynamics. I. general method. J. Chem.
Phys. 31 (2), 459–466.
Alimisis, F., Orvieto, A., Becigneul, G., Lucchi, A., 2021. Momentum improves optimization on
Riemannian manifolds. Int. Conf. Artif. Intell. Stat. 130, 1351–1359.
Amari, S., 2012. Differential-Geometrical Methods in Statistics. vol. 28. Springer Science & Busi-
ness Media.
Amari, S., 2016. Information Geometry and Its Applications. vol. 194. Springer.
Ambrosio, L., Gigli, N., Savare, G., 2005. Gradient Flows: In Metric Spaces and in the Space of
Probability Measures. Springer Science & Business Media, ISBN: 978-3-7643-2428-5
(January).
Anastasiou, A., Barp, A., Briol, F., Ebner, B., Gaunt, R.E., Ghaderinezhad, F., Gorham, J.,
Gretton, A., Ley, C., Liu, Q., et al., 2021. Stein’s method meets statistics: a review of some
recent developments. arXiv:2105.03481 (arXiv preprint).
Andersen, H.C., 1983. Rattle: a “velocity” version of the shake algorithm for molecular dynamics
calculations. J. Comput. Phys. 52 (1), 24–34. https://doi.org/10.1016/0021-9991(83)90014-1.
Aronszajn, N., 1950. Theory of reproducing kernels. Trans. Am. Math. Soc. 68 (3), 337–404.
Asorey, M., Carinena, J.F., Ibort, L.A., 1983. Generalized canonical transformations for
time-dependent systems. J. Math. Phys. 24 (12), 2745–2750.
Åstr€
om, K.J., 1965. Optimal control of Markov processes with incomplete state information.
J. Math. Anal. Appl. 10 (1), 174–205. https://doi.org/10.1016/0022-247X(65)90154-X.
Au, K.X., Graham, M.M., Thiery, A.H., 2020. Manifold lifting: scaling MCMC to the vanishing
noise regime. arXiv:2003.03950 (arXiv preprint).
Ay, N., Jost, J., V^an L^e, H., Schwachh€ofer, L., 2017. Information Geometry. vol. 64. Springer.
Barbour, A.D., 1988. Stein’s method and poisson process convergence. J. Appl. Probab. 25 (A),
175–184.
Barlow, H.B., 1961. Possible Principles Underlying the Transformations of Sensory Messages.
The MIT Press, ISBN: 978-0-262-31421-3.
Barp, A., 2019. Hamiltonian Monte Carlo on lie groups and constrained mechanics on homoge-
neous manifolds. In: International Conference on Geometric Science of Information.
Springer, pp. 665–675.
Barp, A., 2020. The Bracket Geometry of Statistics (Ph.D. thesis). Imperial College London.
Barp, A., Briol, F.X., Kennedy, A.D., Girolami, M., 2018. Geometry and dynamics for Markov
chain Monte Carlo. Annu. Rev. Stat. App. 5, 451–471.
66 SECTION I Foundations in classical geometry and analysis
Barp, A., Briol, F., Duncan, A.B., Girolami, M., Mackey, L., 2019a. Minimum Stein discrepancy
estimators. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alche-Buc, F., Fox, E.,
Garnett, R. (Eds.), Advances in Neural Information Processing Systems. vol. 32. Curran
Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/ba7609ee5789cc4dff171045
a693a65f-Paper.pdf.
Barp, A., Kennedy, A., Girolami, M., 2019b. Hamiltonian Monte Carlo on symmetric and homo-
geneous spaces via symplectic reduction. arXiv:1903.02699.
Barp, A., Takao, S., Betancourt, M., Arnaudon, A., Girolami, M., 2021. A unifying and canonical
description of measure-preserving diffusions. arXiv:2105.02845 [math, stat].
Barp, A., Oates, C.J., Porcu, E., Girolami, M., 2022. A Riemann-Stein Kernel method. Bernoulli.
Barp, A., Simon-Gabriel, C.J., Mackey, L., 2022. Targeted convergence characteristics of maxi-
mum mean discrepancies and Kernel Stein discrepancies. (In preparation).
Barto, A., Sutton, R., 1992. Reinforcement Learning: An Introduction. A Bradford Book.
Barto, A., Mirolli, M., Baldassarre, G., 2013. Novelty or surprise? Front. Psychol. 4. https://doi.
org/10.3389/fpsyg.2013.00907.
Bassetti, F., Bodini, A., Regazzini, E., 2006. On minimum Kantorovich distance estimators. Stat.
Probab. Lett. 76 (12), 1298–1302.
Bellman, R.E., 1957. Dynamic Programming. Princeton University Press, Princeton, NJ, US,
ISBN: 978-0-691-14668-3.
Bellman, R.E., Dreyfus, S.E., 2015. Applied Dynamic Programming. Princeton University Press,
ISBN: 978-1-4008-7465-1.
Benettin, G., Giorgilli, A., 1994. On the Hamiltonian interpolation of near-to-the-identity sym-
plectic mappings with application to symplectic integration algorithms. J. Stat. Phys. 74,
1117–1143. https://doi.org/10.1007/BF02188219.
Berger, J.O., 1985. Statistical Decision Theory and Bayesian Analysis. Springer Series in Statis-
tics, second ed. Springer-Verlag, New York, ISBN: 978-0-387-96098-2, https://doi.org/
10.1007/978-1-4757-4286-2.
Berger-Tal, O., Nathan, J., Meron, E., Saltz, D., 2014. The exploration-exploitation Dilemma:
a multidisciplinary framework. PLoS One 9 (4), e95693. https://doi.org/10.1371/journal.
pone.0095693.
Berlinet, A., Thomas-Agnan, C., 2011. Reproducing kernel Hilbert spaces in probability and
statistics. Springer Science & Business Media.
Betancourt, M., 2015. The fundamental incompatibility of scalable Hamiltonian Monte Carlo and
naive data subsampling. In: International Conference on Machine Learning, PMLR,
pp. 533–540.
Betancourt, M., 2016. Identifying the optimal integration time in Hamiltonian Monte Carlo.
arXiv:1601.00225 (arXiv preprint).
Betancourt, M., 2017. A conceptual introduction to Hamiltonian Monte Carlo. arXiv:1701.02434.
Betancourt, M., Byrne, S., Livingstone, S., Girolami, M., 2017. The geometric foundations of
Hamiltonian Monte Carlo. Bernoulli 23 (4A), 2257–2298.
Betancourt, M., Jordan, M.I., Wilson, A., 2018. On symplectic optimization. arXiv:1802.03653
[stat.CO].
Bierkens, J., Roberts, G., 2017. A piecewise deterministic scaling limit of lifted Metropolis-
Hastings in the curie-weiss model. Ann. App. Prob. 27 (2), 846–882.
Bierkens, J., Fearnhead, P., Roberts, G., 2019. The zig-zag process and super-efficient sampling
for Bayesian analysis of big data. Ann. Stats. 47 (3), 1288–1320.
Sampling, optimization, inference, and adaptive agents Chapter 2 67
Bishop, C.M., 2006. Pattern Recognition and Machine Learning. Information Science and Statis-
tics, Springer, New York, ISBN: 978-0-387-31073-2.
Bismut, J.M., 1981. Martingales, the Malliavin calculus and hypoellipticity under general
H€ormander’s conditions. Z. Wahrsch. Verw. Gebiete 56 (4), 469–505.
Blanes, S., Casas, F., Sanz-Serna, J.M., 2014. Numerical integrators for the hybrid Monte Carlo
method. SIAM J. Sci. Comput. 36 (4), A1556–A1580.
Blei, D.M., Kucukelbir, A., McAuliffe, J.D., 2017. Variational inference: a review for
statisticians. J. Am. Stat. Assoc. 112 (518), 859–877. https://doi.org/10.1080/01621459.
2017.1285773.
Bonnabel, S., 2013. Stochastic gradient descent on riemannian manifolds. IEEE Trans. Autom.
Control 58 (9), 2217–2229.
Bou-Rabee, N., Sanz-Serna, J.M., 2018. Geometric integrators and the Hamiltonian Monte Carlo
method. Acta Numer. 27, 113–206.
Bouchard-C^ ote, A., Vollmer, S.J., Doucet, A., 2018. The bouncy particle sampler: a nonreversible
rejection-free Markov chain Monte Carlo method. J. Am. Stat. Assoc. 113 (522), 855–867.
Bravetti, A., Daza-Torres, M.L., Flores-Arguedas, H., Betancourt, M., 2019. Optimization algo-
rithms inspired by the geometry of dissipative systems. arXiv:1912.02928 [math.OC].
Briol, F., Barp, A., Duncan, A.B., Girolami, M., 2019. Statistical inference for generative models
with maximum mean discrepancy. arXiv:1906.05944.
Bronstein, M.M., Bruna, J., Cohen, T., Velickovic, P., 2021. Geometric deep learning: grids,
groups, graphs, geodesics, and gauges. arXiv:2104.13478 [cs.LG].
Campos, C.M., Sanz-Serna, J.M., 2015. Extra chance generalized hybrid Monte Carlo. J. Comput.
Phys. 281, 365–374.
Campos, C.M., Sanz-Serna, J.M., 2017. Palindromic 3-stage splitting integrators, a roadmap.
J. Comput. Phys. 346, 340–355.
Cances, E., Legoll, F., Stoltz, G., 2007. Theoretical and numerical comparison of some sampling
methods for molecular dynamics. ESAIM: Math. Model. Numer. Anal. 41 (2), 351–389.
Carmeli, C., De Vito, E., Toigo, A., Umanitá, V., 2010. Vector valued reproducing kernel Hilbert
spaces and universality. Anal. Appl. 8 (01), 19–61.
Çatal, O., Verbelen, T., Van de Maele, T., Dhoedt, B., Safron, A., 2021. Robot navigation as hier-
archical active inference. Neural Netw. 142, 192–204. https://doi.org/10.1016/j.
neunet.2021.05.010.
Celledoni, E., Marthinsen, H., Owren, B., 2014. An introduction to Lie group integrators: basics,
new developments and applications. J. Comput. Phys. 257, 1040–1061.
Celledoni, E., Ehrhardt, M.J., Etmann, C., McLachlan, R.I., Owren, B., Schonlieb, C.B.,
Sherry, F., 2021. Structure-preserving deep learning. Eur. J. Appl. Math. 32 (5), 888–936.
Chafaı̈, D., 2004. Entropies, convexity, and functional inequalities, On ϕ-entropies and ϕ-Sobolev
inequalities. J. Math. Kyoto Univ. 44 (2), 325–363. https://doi.org/10.1215/kjm/1250283556.
Chak, M., Kantas, N., Lelièvre, T., Pavliotis, G. A., 2021, Nov. Optimal friction matrix for under-
damped Langevin sampling.
Champion, T., Bowman, H., Grzes, M., 2021a. Branching time active inference: empirical study
and complexity class analysis. arXiv:2111.11276 [cs].
Champion, T., Da Costa, L., Bowman, H., Grzes, M., 2021b. Branching time active inference: the
theory and its generality. arXiv:2111.11107 [cs].
Champion, T., Grzes, M., Bowman, H., 2021c. Realizing active inference in variational message
passing: the outcome-blind certainty seeker. Neural Comput. 33 (10), 2762–2826. https://doi.
org/10.1162/neco_a_01422.
68 SECTION I Foundations in classical geometry and analysis
Chen, Y., Li, W., 2018. Natural gradient in Wasserstein statistical manifold. arXiv:1805.08380
(arXiv preprint).
Chen, T., Fox, E., Guestrin, C., 2014. Stochastic gradient Hamiltonian Monte Carlo. In: Interna-
tional Conference on Machine Learning. PMLR, pp. 1683–1691.
Chen, W.Y., Barp, A., Briol, F., Gorham, J., Girolami, M., Mackey, L., Oates, C., 2019. Stein
point Markov chain Monte Carlo. In: International Conference on Machine Learning, PMLR,
pp. 1011–1021.
Chentsov, N.N., 1965. Categories of mathematical statistics. Uspekhi Mat. Nauk 20 (4), 194–195.
Chwialkowski, K., Strathmann, H., Gretton, A., 2016. A kernel test of goodness of fit. In: Inter-
national Conference on Machine Learning, PMLR, pp. 2606–2615.
Clark, M.A., Joó, B., Kennedy, A.D., Silva, P.J., 2011. Improving dynamical lattice QCD simula-
tions through integrator tuning using Poisson brackets and a force-gradient integrator. Phys.
Rev. D 84 (7), 071502.
Cobb, A.D., Baydin, A.G., Markham, A., Roberts, S.J., 2019. Introducing an explicit symplectic
integration scheme for Riemannian manifold Hamiltonian Monte Carlo. arXiv:1910.06243
(arXiv preprint).
Cullen, M., Davey, B., Friston, K.J., Moran, R.J., 2018. Active inference in OpenAI Gym: a para-
digm for computational investigations into psychiatric illness. Biol. Psychiatry Cogn. Neu-
rosci. Neuroimaging 3 (9), 809–818. https://doi.org/10.1016/j.bpsc.2018.06.010.
Cuturi, M., 2013. Sinkhorn distances: lightspeed computation of optimal transport. NeurIPS 26,
2292–2300.
Da Costa, L., Parr, T., Sajid, N., Veselic, S., Neacsu, V., Friston, K., 2020a. Active inference on
discrete state-spaces: a synthesis. J. Math. Psychol. 99, 102447. https://doi.org/10.1016/
j.jmp.2020.102447.
Da Costa, L., Sajid, N., Parr, T., Friston, K., Smith, R., 2020b. The relationship between dynamic
programming and active inference: the discrete, finite-horizon case. arXiv:2009.08111
[cs, math, q-bio].
Da Costa, L., Friston, K., Heins, C., Pavliotis, G.A., 2021. Bayesian mechanics for stationary pro-
cesses. Proc. R. Soc. A Math. Phys. Eng. Sci. 477 (2256), 20210518. https://doi.org/10.1098/
rspa.2021.0518.
Da Costa, L., Lanillos, P., Sajid, N., Friston, K., Khan, S., 2022. How active inference could help
revolutionise robotics. Entropy 24 (3), 361. https://doi.org/10.3390/e24030361.
Davis, M.H.A., 1984. Piecewise-deterministic markov processes: a general class of non-diffusion
stochastic models. J. R. Stat. Soc. B (Methodol.) 46 (3), 353–376.
Deci, E., Ryan, R.M., 1985. Intrinsic Motivation and Self-Determination in Human Behavior.
Perspectives in Social Psychology, Springer US, New York, ISBN: 978-0-306-42022-1,
https://doi.org/10.1007/978-1-4899-2271-7.
Duane, S., Kennedy, A.D., Pendleton, B.J., Roweth, D., 1987. Hybrid Monte Carlo. Phys. Lett.
B 195 (2), 216–222.
Duncan, A.B., Lelièvre, T., Pavliotis, G.A., 2016. Variance reduction using nonreversible Lange-
vin samplers. J. Stat. Phys. 163 (3), 457–491. https://doi.org/10.1007/s10955-016-1491-2.
Duncan, A.B., N€ usken, N., Pavliotis, G.A., 2017. Using perturbed underdamped Langevin dynam-
ics to efficiently sample from probability distributions. J. Stat. Phys. 169 (6), 1098–1131.
https://doi.org/10.1007/s10955-017-1906-8.
Durmus, A., Moulines, E., 2017. Nonasymptotic convergence analysis for the unadjusted Lange-
vin algorithm. Ann. App. Prob. 27 (3), 1551–1587.
Durmus, A., Moulines, E., Saksman, E., 2017. On the convergence of Hamiltonian Monte Carlo.
arXiv:1705.00166 (arXiv preprint).
Sampling, optimization, inference, and adaptive agents Chapter 2 69
Durmus, A., Moulines, E., Pereyra, M., 2018. Efficient Bayesian computation by proximal
Markov chain Monte Carlo: when Langevin meets Moreau. SIAM J. Imaging Sci. 11 (1),
473–506.
Durrleman, S., Pennec, X., Trouve, A., Ayache, N., 2009. Statistical models of sets of curves and
surfaces based on currents. Med. Image Anal. 13 (5), 793–808.
Dziugaite, G.K., Roy, D.M., Ghahramani, Z., 2015. Training generative neural networks via
maximum mean discrepancy optimization. arXiv:1505.03906 (arXiv preprint).
Ethier, S.N., Kurtz, T.G., 2009. Markov Processes: Characterization and Convergence. vol. 282
John Wiley & Sons.
Fang, Y., Sanz-Serna, J.-M., Skeel, R.D., 2014. Compressible generalized hybrid Monte Carlo.
J. Chem. Phys. 140 (17), 174108.
Fernández-Pendás, M., Akhmatskaya, E., Sanz-Serna, J.M., 2016. Adaptive multi-stage integrators
for optimal energy conservation in molecular simulations. J. Comput. Phys. 327, 434–449.
Forest, E., 2006. Geometric integration for particle accelerators. J. Phys. A Math. Gen. 39,
5321–5377.
Fountas, Z., Sajid, N., Mediano, P.A.M., Friston, K., 2020. Deep active inference agents using
Monte-Carlo methods. arXiv:2006.04176 [cs, q-bio, stat].
França, G., Robinson, D., Vidal, R., 2018. ADMM and accelerated ADMM as continuous dyna-
mical systems. Int. Conf. Mach. Learn. 80, 1559–1567.
França, G., Robinson, D.P., Vidal, R., 2018. A nonsmooth dynamical systems perspective on
accelerated extensions of ADMM. arXiv:1808.04048 [math.OC].
França, G., Sulam, J., Robinson, D.P., Vidal, R., 2020. Conformal symplectic and relativistic
optimization. J. Stat. Mech. 2020 (12), 124008. https://doi.org/10.1088/1742-5468/abcaee.
França, G., Barp, A., Girolami, M., Jordan, M.I., 2021a. Optimization on manifolds: a symplectic
approach. arXiv:2107.11231 [cond-mat.stat-mech].
França, G., Jordan, M.I., Vidal, R., 2021b. On dissipative symplectic integration with applications
to gradient-based optimization. J. Stat. Mech. 2021 (4), 043402. https://doi.org/10.1088/1742-
5468/abf5d4.
França, G., Robinson, D.P., Vidal, R., 2021c. Gradient flows and proximal splitting methods:
a unified view on accelerated and stochastic optimization. Phys. Rev. E 103, 053304.
https://doi.org/10.1103/PhysRevE.103.053304.
Friston, K., 2010. The free-energy principle: a unified brain theory? Nat. Rev. Neurosci. 11 (2),
127–138. https://doi.org/10.1038/nrn2787.
Friston, K., Kilner, J., Harrison, L., 2006. A free energy principle for the brain. J. Physiol.-Paris
100 (1-3), 70–87. https://doi.org/10.1016/j.jphysparis.2006.10.001.
Friston, K.J., Daunizeau, J., Kilner, J., Kiebel, S.J., 2010. Action and behavior: a free-energy for-
mulation. Biol. Cybern. 102 (3), 227–260. https://doi.org/10.1007/s00422-010-0364-z.
Friston, K., Rigoli, F., Ognibene, D., Mathys, C., Fitzgerald, T., Pezzulo, G., 2015. Active infer-
ence and epistemic value. Cogn. Neurosci. 6 (4), 187–214. https://doi.org/
10.1080/17588928.2015.1020053.
Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, P., O’Doherty, J., Pezzulo, G., 2016. Active
inference and learning. Neurosci. Biobehav. Rev. 68, 862–879. https://doi.org/10.1016/
j.neubiorev.2016.06.022.
Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, P., Pezzulo, G., 2017a. Active inference:
a process theory. Neural Comput. 29 (1), 1–49. https://doi.org/10.1162/NECO_a_00912.
Friston, K.J., Lin, M., Frith, C.D., Pezzulo, G., Hobson, J.A., Ondobaka, S., 2017b. Active infer-
ence, curiosity and insight. Neural Comput. 29 (10), 2633–2683. https://doi.org/10.1162/
neco_a_00999.
70 SECTION I Foundations in classical geometry and analysis
Friston, K.J., Parr, T., de Vries, B., 2017c. The graphical brain: belief propagation and active
inference. Netw. Neurosci. 1 (4), 381–414. https://doi.org/10.1162/NETN_a_00018.
Friston, K.J., Rosch, R., Parr, T., Price, C., Bowman, H., 2018. Deep temporal models and
active inference. Neurosci. Biobehav. Rev. 90, 486–501. https://doi.org/10.1016/j.neubiorev.
2018.04.004.
Friston, K., Parr, T., Zeidman, P., 2019. Bayesian model reduction. arXiv:1805.07092 [stat].
Friston, K., Da Costa, L., Hafner, D., Hesp, C., Parr, T., 2021a. Sophisticated inference. Neural
Comput. 33 (3), 713–763. https://doi.org/10.1162/neco_a_01351.
Friston, K., Heins, C., Ueltzh€offer, K., Da Costa, L., Parr, T., 2021b. Stochastic Chaos and
Markov Blankets. Entropy 23 (9), 1220. https://doi.org/10.3390/e23091220.
Friston, K., Moran, R.J., Nagai, Y., Taniguchi, T., Gomi, H., Tenenbaum, J., 2021c. World model
learning and inference. Neural Netw. 144, 573–590. https://doi.org/10.1016/j.neunet.
2021.09.011.
Friston, K., Da Costa, L., Sajid, N., Heins, C., Ueltzh€offer, K., Pavliotis, G.A., Parr, T., 2022. The
free energy principle made simpler but not too simple. arXiv:2201.06387 [cond-mat, physics:
nlin, physics:physics, q-bio].
Garbuno-Inigo, A., Hoffmann, F., Li, W., Stuart, A.M., 2019. Interacting Langevin diffusions:
gradient structure and ensemble Kalman sampler. arXiv:1903.08866 [math].
Garreau, D., Jitkrittum, W., Kanagawa, M., 2017. Large sample analysis of the median heuristic.
arXiv:1707.07269 (arXiv preprint).
Girolami, M., Calderhead, B., 2011. Riemann manifold Langevin and Hamiltonian Monte Carlo
methods. J. R. Stat. Soc. B (Stat. Methodol.) 73 (2), 123–214.
Gorham, J., Mackey, L., 2017. Measuring sample quality with kernels. In: International Confer-
ence on Machine Learning. PMLR, pp. 1292–1301.
Gorham, J., Duncan, A.B., Vollmer, S.J., Mackey, L., 2019. Measuring sample quality with diffu-
sions. Ann. Appl. Probab. 29 (5), 2884–2928.
Graham, M.M., Thiery, A.H., Beskos, A., 2019. Manifold Markov chain Monte Carlo methods for
Bayesian inference in a wide class of diffusion models. arXiv:1912.02982 (arXiv preprint).
Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B.K., 2009. A fast, consistent kernel
two-sample test. In: NIPS, vol. 23, pp. 673–681.
Gretton, A., Borgwardt, K.M., Rasch, M.J., Sch€olkopf, B., Smola, A., 2012. A kernel two-sample
test. J. Mach. Learn. Res. 13 (1), 723–773.
Guillin, A., Monmarche, P., 2021. Optimal linear drift for the speed of convergence of an Hypoel-
liptic diffusion. arXiv:1604.07295 [math].
Hairer, M., 2018. Ergodic Properties of Markov Processes.
Hairer, E., Lubich, C., Wanner, G., 2010. Geometric Numerical Integration: Structure-Preserving
Algorithms for Ordinary Differential Equations. Springer.
Hansen, A.C., 2011. A theoretical framework for backward error analysis on manifolds. J. Geom.
Mech. 3 (1), 81–111. https://doi.org/10.3934/jgm.2011.3.81.
Harms, P., Michor, P.W., Pennec, X., Sommer, S., 2020. Geometry of sample spaces.
arXiv:2010.08039.
Hastings, W.K., 1970. Monte Carlo Sampling Methods Using Markov Chains and their Applica-
tions. Oxford University Press.
Haussmann, U.G., Pardoux, E., 1986. Time reversal of diffusions. Ann. Probab. 14 (4),
1188–1205. https://doi.org/10.1214/aop/1176992362.
Heber, F., Trst’anová, Z., Leimkuhler, B., 2020. Posterior sampling strategies based on discretized
stochastic differential equations for machine learning applications. J. Mach. Learn. Res.
21 (228), 1–33.
Sampling, optimization, inference, and adaptive agents Chapter 2 71
Heins, C., Millidge, B., Demekas, D., Klein, B., Friston, K., Couzin, I., Tschantz, A., 2022.
Pymdp: a Python library for active inference in discrete state spaces. arXiv:2201.03904
[cs, q-bio].
Helffer, B., 1998. Remarks on decay of correlations and Witten Laplacians Brascamp–Lieb
inequalities and semiclassical limit. J. Funct. Anal. 155 (2), 571–586. https://doi.org/
10.1006/jfan.1997.3239.
Hodgkinson, L., Salomone, R., Roosta, F., 2020. The reproducing stein kernel approach for
post-hoc corrected sampling. arXiv:2001.09266 (arXiv preprint).
Hoffman, M.D., Gelman, A., et al., 2014. The No-U-Turn sampler: adaptively setting path lengths
in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 15 (1), 1593–1623.
Holbrook, A., Vandenberg-Rodes, A., Shahbaba, B., 2016. Bayesian inference on matrix mani-
folds for linear dimensionality reduction. arXiv:1606.04478 (arXiv preprint).
Holbrook, A., Lan, S., Vandenberg-Rodes, A., Shahbaba, B., 2018. Geodesic Lagrangian Monte
Carlo over the space of positive definite matrices: with application to Bayesian spectral
density estimation. J. Stat. Comput. Simul. 88 (5), 982–1002.
Holm, D.D., Marsden, J.E., Ratiu, T.S., 1998. The Euler-Poincare equations and semidirect
products with applications to continuum theories. Adv. Math. 137 (1), 1–81.
H€ormander, L., 1967. Hypoelliptic second order differential equations. Acta Math. 119, 147–171.
https://doi.org/10.1007/BF02392081.
Horowitz, A.M., 1991. A generalized guided Monte Carlo algorithm. Phys. Lett. B 268 (2),
247–252.
Hwang, C.-R., Hwang-Ma, S.-Y., Sheu, S.-J., 2005. Accelerating diffusions. Ann. Appl. Probab.
15 (2), 1433–1444. https://doi.org/10.1214/105051605000000025.
Hyv€arinen, A., Dayan, P., 2005. Estimation of non-normalized statistical models by score match-
ing. J. Mach. Learn. Res. 6 (4), 695–709.
Itti, L., Baldi, P., 2009. Bayesian surprise attracts human attention. Vis. Res. 49 (10), 1295–1306.
https://doi.org/10.1016/j.visres.2008.09.007.
Izaguirre, J.A., Hampton, S.S., 2004. Shadow hybrid Monte Carlo: an efficient propagator in
phase space of macromolecules. J. Comput. Phys. 200 (2), 581–604.
Jaynes, E.T., 1957. Information theory and statistical mechanics. Phys. Rev. 106 (4), 620–630.
https://doi.org/10.1103/PhysRev.106.620.
Jeffreys, H., 1946. An invariant form for the prior probability in estimation problems. Proc. R.
Soc. Lond. A. Math. Phys. Sci. 186 (1007), 453–461.
Jordan, R., Kinderlehrer, D., Otto, F., 1998. The variational formulation of the Fokker-Planck
equation. SIAM J. Math. Anal. 29 (1), 1–17.
Jost, J., L^e, H.V., Tran, T.D., 2021. Probabilistic morphisms and Bayesian nonparametrics. Eur.
Phys. J. Plus 136 (4), 1–29.
Joulin, A., Ollivier, Y., 2010. Curvature, concentration and error estimates for Markov chain
Monte Carlo. Ann. Probab. 38 (6), 2418–2442. https://doi.org/10.1214/10-AOP541.
Kahneman, D., Tversky, A., 1979. Prospect theory: an analysis of decision under risk. Econome-
trica 47 (2), 263–291. https://doi.org/10.2307/1914185.
Kakade, S.M., 2001. A natural policy gradient. Adv. Neural Inf. Process. Syst. 14, 1531–1538.
Kalman, R.E., 1960. A new approach to linear filtering and prediction problems. J. Basic Eng.
82 (1), 35–45. https://doi.org/10.1115/1.3662552.
Kaplan, A., 1973. The Conduct of Inquiry. Transaction Publishers, ISBN: 978-1-4128-3629-6.
Kappen, H.J., Gómez, V., Opper, M., 2012. Optimal control as a graphical model inference prob-
lem. Mach. Learn. 87 (2), 159–182. https://doi.org/10.1007/s10994-012-5278-7.
72 SECTION I Foundations in classical geometry and analysis
Karakida, R., Okada, M., Amari, S., 2016. Adaptive natural gradient learning algorithms for
unnormalized statistical models. In: International Conference on Artificial Neural Networks,
Springer, pp. 427–434.
Katsoulakis, M., Pantazis, Y., Rey-Bellet, L., 2014. Measuring the irreversibility of numerical
schemes for reversible stochastic differential equations. ESAIM: Math. Model. Numer.
Anal./Modelisation Mathematique et Analyse Numerique 48 (5), 1351–1379. https://doi.org/
10.1051/m2an/2013142.
Kennedy, A.D., Silva, P.J., Clark, M.A., 2013. Shadow Hamiltonians, Poisson brackets, and gauge
theories. Phys. Rev. D 87, 034511.
Lanillos, P., Meo, C., Pezzato, C., Meera, A.A., Baioumy, M., Ohata, W., Tschantz, A.,
Millidge, B., Wisse, M., Buckley, C.L., Tani, J., 2021. Active inference in robotics and arti-
ficial agents: survey and challenges. arXiv:2112.01871 [cs].
Lasota, A., MacKey, M.C., 1994. Chaos, Fractals, and Noise: Stochastic Aspects of Dynamics.
Springer-Verlag, ISBN: 978-3-540-94049-4.
Lee, J.M., 2013. Smooth manifolds. In: Introduction to Smooth Manifolds, Springer, pp. 1–31.
Leimkuhler, B., Matthews, C., 2016. Efficient molecular dynamics using geodesic integration and
solvent-solute splitting. Proc. R. Soc. A Math. Phys. Eng. Sci. 472 (2189), 20160138. https://
doi.org/10.1098/rspa.2016.0138.
Leimkuhler, B., Reich, S., 2004. Simulating Hamiltonian Dynamics. Cambridge University Press.
Leimkuhler, B.J., Skeel, R.D., 1994. Symplectic numerical integrators in constrained Hamiltonian
systems. J. Comput. Phys. 112, 117–125. https://doi.org/10.1006/jcph.1994.1085.
Lelièvre, T., Nier, F., Pavliotis, G.A., 2013. Optimal non-reversible linear drift for the conver-
gence to equilibrium of a diffusion. J. Stat. Phys. 152 (2), 237–274. https://doi.org/10.1007/
s10955-013-0769-x.
Lelièvre, T., Rousset, M., Stoltz, G., 2019. Hybrid Monte Carlo methods for sampling probability
measures on submanifolds. Numer. Math. 143 (2), 379–421. https://doi.org/10.1007/s00211-
019-01056-4.
Lelièvre, T., Stoltz, G., Zhang, W., 2020. Multiple projection MCMC algorithms on submani-
folds. arXiv:2003.09402 (arXiv preprint).
Leok, M., Zhang, J., 2017. Connecting information geometry and geometric mechanics. Entropy
19 (10), 518.
Levine, S., 2018. Reinforcement learning and control as probabilistic inference: tutorial and
review. arXiv:1805.00909 [cs, stat].
Li, C.-L., Chang, W.-C., Cheng, Y., Yang, Y., Póczos, B., 2017. Mmd gan: towards deeper under-
standing of moment matching network. arXiv:1705.08584 (arXiv preprint).
Lindley, D.V., 1956. On a measure of the information provided by an experiment. Ann. Math.
Stat. 27 (4), 986–1005.
Linsker, R., 1990. Perceptual neural organization: some approaches based on network models and
information theory. Annu. Rev. Neurosci. 13 (1), 257–281. https://doi.org/10.1146/annurev.
ne.13.030190.001353.
Liu, Q., Wang, D., 2016. Stein variational gradient descent: a general purpose Bayesian inference
algorithm. Adv. Neural Inf. Process. Syst. 29.
Liu, C., Zhu, J., 2018. Riemannian Stein variational gradient descent for Bayesian inference.
In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32. 1.
Liu, Q., Lee, J., Jordan, M., 2016. A kernelized stein discrepancy for goodness-of-fit tests.
In: International Conference on Machine Learning, PMLR, pp. 276–284.
Livingstone, S., Girolami, M., 2014. Information-geometric Markov chain Monte Carlo methods
using diffusions. Entropy 16 (6), 3074–3102.
Sampling, optimization, inference, and adaptive agents Chapter 2 73
Livingstone, S., Betancourt, M., Byrne, S., Girolami, M., 2019. On the geometric ergodicity of
Hamiltonian Monte Carlo. Bernoulli 25 (4A), 3109–3138.
Ma, Y.-A., Chen, T., Fox, E.B., 2015. A complete recipe for Stochastic gradient MCMC.
arXiv:1506.04696 [math, stat].
Ma, Y.-A., Chatterji, N., Cheng, X., Flammarion, N., Bartlett, P., Jordan, M.I., 2019a. Is There an
Analog of Nesterov Acceleration for MCMC? arXiv:1902.00996.
Ma, Y.-A., Chatterji, N., Cheng, X., Flammarion, N., Bartlett, P., Jordan, M.I., 2019b. Is there an
analog of Nesterov acceleration for MCMC? arXiv:1902.00996 [cs, math, stat].
MacKay, D.J.C., 1992. Information-based objective functions for active data selection. Neural
Comput. 4 (4), 590–604. https://doi.org/10.1162/neco.1992.4.4.590.
MacKay, D.J.C., 2003a. Information Theory, Inference and Learning Algorithms. Cambridge
University Press.
MacKay, D.J.C., 2003b. Information Theory, Inference and Learning Algorithms, sixth printing
2007 ed. Cambridge University Press, Cambridge, UK; New York, ISBN: 978-0-521-64298-9.
Mackenze, P.B., 1989. An improved hybrid Monte Carlo method. Phys. Lett. B 226 (3-4),
369–371.
Maisto, D., Gregoretti, F., Friston, K., Pezzulo, G., 2021. Active tree search in large POMDPs.
arXiv:2103.13860 [cs, math, q-bio].
Markovic, D., Stojic, H., Schw€obel, S., Kiebel, S.J., 2021. An empirical evaluation of active infer-
ence in multi-armed bandits. Neural Netw. 144, 229–246. https://doi.org/10.1016/
j.neunet.2021.08.018.
Marsden, J.E., West, M., 2001. Discrete mechanics and variational integrators. Acta Numer. 10,
357–514. https://doi.org/10.1017/S096249290100006X.
Marthinsen, H., Owren, B., 2016. Geometric integration of non-autonomous Hamiltonian
problems. Adv. Comput. Math. 42, 313–332. https://doi.org/10.1007/s10444-015-9425-0.
Mattingly, J.C., Stuart, A.M., Higham, D.J., 2002. Ergodicity for SDEs and approximations:
locally Lipschitz vector fields and degenerate noise. Stoch. Process. Their Appl. 101 (2),
185–232. https://doi.org/10.1016/S0304-4149(02)00150-3.
Mattingly, J.C., Stuart, A.M., Tretyakov, M.V., 2010. Convergence of numerical time-averaging
and stationary measures via Poisson equations. SIAM J. Numer. Anal. 48 (2), 552–577.
https://doi.org/10.1137/090770527.
Mazzaglia, P., Verbelen, T., Dhoedt, B., 2021. Contrastive active inference. In: Advances in
Neural Information Processing Systems.
McLachlan, R., Perlmutter, M., 2001. Conformal Hamiltonian systems. J. Geom. Phys. 39,
276–300. https://doi.org/10.1016/S0393-0440(01)00020-1.
McLachlan, R.I., Quispel, G.R.W., 2002. Splitting methods. Acta Numer. 11, 341. https://doi.org/
10.1017/S0962492902000053.
McLachlan, R.I., Quispel, G.R.W., 2006. Geometric integrators for ODEs. J. Phys. A Math. Gen.
39, 5251–5285. https://doi.org/10.1088/0305-4470/39/19/s01.
McLachlan, R.I., Quispel, G.R.W., Robidoux, N., 1999. Geometric integration using discrete gra-
dients. Philos. Trans. R. Soc. Lond. A 357 (1754), 1021–1045.
McLachlan, R.I., Modin, K., Verdier, O., Wilkins, M., 2014. Geometric generalizations of
SHAKE and RATTLE. Found. Comput. Math. (14), 339–370. https://doi.org/10.1007/
s10208-013-9163-y.
Metropolis, N.R., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E., 1953. Equation of
state calculations by fast computing machines. J. Chem. Phys. 21 (6), 1087–1092.
Millidge, B., 2020. Deep active inference as variational policy gradients. J. Math. Psychol. 96,
102348. https://doi.org/10.1016/j.jmp.2020.102348.
74 SECTION I Foundations in classical geometry and analysis
Mira, A., 2001. Ordering and improving the performance of Monte Carlo Markov chains. Stat.
Sci. 16 (4), 340–350. https://doi.org/10.1214/ss/1015346319.
Modin, K., Perlmutter, M., Marsland, S., McLachlan, R., 2010. Geodesics on Lie groups: Euler
equations and totally geodesic subgroup. Res. Lett. Inform. Math. Sci. 14, 79–106.
Muandet, K., Fukumizu, K., Sriperumbudur, B., Sch€olkopf, B., 2016. Kernel mean embedding of
distributions: a review and beyond. arXiv:1605.09522 (arXiv preprint).
Muehlebach, M., Jordan, M.I., 2021a. On constraints in first-order optimization: a view from
non-smooth dynamical systems. arXiv:2107.08225, [math.OC].
Muehlebach, M., Jordan, M.I., 2021b. Optimization with momentum: dynamical, control-
theoretic, and symplectic perspectives. J. Mach. Learn. Res. 22 (73), 1–50.
M€uller, A., 1997. Integral probability metrics and their generating classes of functions. Adv. Appl.
Probab. 29 (2), 429–443.
Murray, I., Adams, R., MacKay, D., 2010. Elliptical slice sampling. In: Int. Conf. Artificial Intel-
ligence and Stats, JMLR Workshop and Conference Proceedings, pp. 541–548.
Neal, R.M., 1992. Bayesian training of Backpropagation Networks by the Hybrid Monte Carlo
Method. Citeseer.
Neal, R.M., 2003. Slice sampling. Ann. Stat. 31 (3), 705–767.
Neal, R.M., 2004. Improving asymptotic variance of MCMC estimators: non-reversible chains are
better. arXiv:math/0407281.
Neal, R.M., 2011. MCMC using Hamiltonian dynamics. In: Handbook of Markov Chain Monte
Carlo, Chapman and Hall/CRC.
Nesterov, Y., 1983. A method of solving a convex programming problem with convergence rate
O(1/k2). Soviet Math. Doklady 27 (2), 372–376.
Nielsen, F., 2020. An elementary introduction to information geometry. Entropy 22 (10), 1100.
Oates, C.J., Girolami, M., Chopin, N., 2017. Control functionals for Monte Carlo integration. J. R.
Stat. Soc. B (Stat. Methodol.) 79 (3), 695–718.
Optican, L.M., Richmond, B.J., 1987. Temporal encoding of two-dimensional patterns by single
units in primate inferior temporal cortex. III. Information theoretic analysis. J. Neurophysiol.
57 (1), 162–178. https://doi.org/10.1152/jn.1987.57.1.162.
Otto, F., 2001. The geometry of dissipative evolution equations: the porous medium equation.
Commun. Partial Differ. Equ. 26, 101–174. https://doi.org/10.1081/PDE-100002243.
Ottobre, M., 2016. Markov chain Monte Carlo and irreversibility. Rep. Math. Phys. 77, 267–292.
https://doi.org/10.1016/S0034-4877(16)30031-3.
Ottobre, M., Pillai, N.S., Pinski, F.J., Stuart, A.M., 2016. A function space HMC algorithm with
second order Langevin diffusion limit. Bernoulli 22 (1), 60–106.
Oudeyer, P.-Y., Kaplan, F., 2007. What is intrinsic motivation? A typology of computational
approaches. Front. Neurorobot. 1, 6. https://doi.org/10.3389/neuro.12.006.2007.
Park, H., Amari, S., Fukumizu, K., 2000. Adaptive natural gradient learning algorithms for vari-
ous stochastic models. Neural Netw. 13 (7), 755–764.
Parr, T., 2019. The Computational Neurology of Active Vision (Ph.D. thesis). University College
London, London.
Parr, T., Markovic, D., Kiebel, S.J., Friston, K.J., 2019. Neuronal message passing using mean-
field, Bethe, and marginal approximations. Sci. Rep. 9 (1), 1889. https://doi.org/10.1038/
s41598-018-38246-3.
Parr, T., Da Costa, L., Heins, C., Ramstead, M.J.D., Friston, K.J., 2021a. Memory and Markov
Blankets. Entropy 23 (9), 1105. https://doi.org/10.3390/e23091105.
Parr, T., Limanowski, J., Rawji, V., Friston, K., 2021b. The computational neurology of move-
ment under active inference. Brain 144 (6), 1799–1818. https://doi.org/10.1093/brain/
awab085.
Sampling, optimization, inference, and adaptive agents Chapter 2 75
Parr, T., Sajid, N., Da Costa, L., Mirza, M.B., Friston, K.J., 2021c. Generative models for active
vision. Front. Neurorobot. 15, 651432.
Parry, M., Dawid, A.P., Lauritzen, S., 2012. Proper local scoring rules. Ann. Stat. 40 (1),
561–592.
Paul, A., Sajid, N., Gopalkrishnan, M., Razi, A., 2021. Active inference for Stochastic control.
arXiv:2108.12245 [cs].
Paul, A., Da Costa, L., Gopalkrishnan, M., Razi, A., n.d. Active Inference for Stochastic and
Adaptive Control in a Partially Observable Environment.
Pavliotis, G.A., 2014. Stochastic Processes and Applications: Diffusion Processes, the Fokker-
Planck and Langevin Equations. Texts in Applied Mathematics, vol. 60. Springer, New York,
ISBN: 978-1-4939-1322-0.
Peters, E.A.J.F., de With, G., 2012. Rejection-free Monte Carlo sampling for general potentials.
Phys. Rev. E 85 (2), 026703.
Peyre, G., Cuturi, M., et al., 2019. Computational optimal transport: with applications to data
science. Found. Trends Mach. Learn. 11 (5-6), 355–607.
Polyak, B.T., 1964. Some methods of speeding up the convergence of iteration methods. USSR
Comput. Math. Math. Phys. 4 (5), 1–17.
Predescu, C., Lippert, R.A., Eastwood, M.P., Ierardi, D., Xu, H., Jensen, M., Bowers, K.J.,
Gullingsrud, J., Rendleman, C.A., Dror, R.O., et al., 2012. Computationally efficient molec-
ular dynamics integrators with improved sampling accuracy. Mol. Phys. 110 (9-10), 967–983.
Radivojevic, T., Akhmatskaya, E., 2020. Modified Hamiltonian Monte Carlo for Bayesian infer-
ence. Stat. Comput. 30 (2), 377–404.
Radivojevic, T., Fernández-Pendás, M., Sanz-Serna, J.M., Akhmatskaya, E., 2018. Multi-stage
splitting integrators for sampling with modified Hamiltonian Monte Carlo methods. J. Com-
put. Phys. 373, 900–916.
Ramdas, A., Reddi, S.J., Poczos, B., Singh, A., Wasserman, L., 2015. Adaptivity and
computation-statistics tradeoffs for kernel and distance based high dimensional two sample
testing. arXiv:1508.00655 (arXiv preprint).
Rao, C.R., 1992. Information and the accuracy attainable in the estimation of statistical para-
meters. In: Breakthroughs in Statistics, Springer, pp. 235–247.
Rawlik, K., Toussaint, M., Vijayakumar, S., 2013. On Stochastic optimal control and reinforce-
ment learning by approximate inference. In: Twenty-Third International Joint Conference
on Artificial Intelligence.
Rey-Bellet, L., Spiliopoulos, K., 2015. Irreversible Langevin samplers and variance reduction: a
large deviation approach. Nonlinearity 28 (7), 2081–2103. https://doi.org/10.1088/0951-
7715/28/7/2081.
Roberts, G.O., Tweedie, R.L., 1996. Exponential convergence of Langevin distributions and their
discrete approximations. Bernoulli, 341–363.
Rousset, M., Stoltz, G., Lelievre, T., 2010. Free Energy Computations: A Mathematical Perspec-
tive. World Scientific.
Sajid, N., Ball, P.J., Parr, T., Friston, K.J., 2021a. Active inference: demystified and compared.
Neural Comput. 33 (3), 674–712. https://doi.org/10.1162/neco_a_01357.
Sajid, N., Da Costa, L., Parr, T., Friston, K., 2021b. Active inference, Bayesian optimal design,
and expected utility. arXiv:2110.04074 [cs, math, stat].
Sajid, N., Holmes, E., Costa, L.D., Price, C., Friston, K., 2022. A mixed generative model of audi-
tory word repetition. bioRxiv. https://doi.org/10.1101/2022.01.20.477138. 2022.01.20.477138.
Sanz-Serna, J.M., 1992. Symplectic integrators for Hamiltonian problems: an overview. Acta
Numer. 1, 243–286. https://doi.org/10.1017/S0962492900002282.
76 SECTION I Foundations in classical geometry and analysis
Saumard, A., Wellner, J.A., 2014. Log-concavity and strong log-concavity: a review. Stat. Surv. 8,
45–114. https://doi.org/10.1214/14-SS107.
Schmidhuber, J., 2010. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE
Trans. Auton. Ment. Dev. 2 (3), 230–247. https://doi.org/10.1109/TAMD.2010.2056368.
Schwartenbeck, P., Passecker, J., Hauser, T.U., FitzGerald, T.H., Kronbichler, M., Friston, K.J.,
2019. Computational mechanisms of curiosity and goal-directed exploration. eLife 8, 45.
Schwartz, L., 1964. Sous-espaces hilbertiens d’espaces vectoriels topologiques et noyaux associes
(noyaux reproduisants). J. d’anal. Math. 13 (1), 115–256.
Schw€ obel, S., Kiebel, S., Markovic, D., 2018. Active inference, belief propagation, and the Bethe
approximation. Neural Comput. 30 (9), 2530–2567. https://doi.org/10.1162/neco_a_01108.
Sexton, J.C., Weingarten, D.H., 1992. Hamiltonian evolution for the hybrid Monte Carlo algo-
rithm. Nucl. Phys. B 380 (3), 665–677.
Shahbaba, B., Lan, S., Johnson, W.O., Neal, R.M., 2014. Split Hamiltonian Monte Carlo. Stat.
Comput. 24 (3), 339–349.
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J.,
Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J.,
Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T.,
Hassabis, D., 2016. Mastering the game of Go with deep neural networks and tree search.
Nature 529 (7587), 484–489. https://doi.org/10.1038/nature16961.
Simon-Gabriel, C.-J., Sch€olkopf, B., 2018. Kernel distribution embeddings: universal kernels,
characteristic kernels and kernel metrics on distributions. J. Mach. Learn. Res. 19 (1),
1708–1736.
Simon-Gabriel, C.J., Barp, A., Sch€olkopf, B., Mackey, L., 2020. Metrizing weak convergence
with maximum mean discrepancies. arXiv:2006.09268 (arXiv preprint).
Smith, R., Schwartenbeck, P., Parr, T., Friston, K.J., 2020. An active inference approach to mod-
eling structure learning: concept learning as an example case. Front. Comput. Neurosci. 14,
41. https://doi.org/10.3389/fncom.2020.00041.
Smith, R., Friston, K.J., Whyte, C.J., 2022. A step-by-step tutorial on active inference and its
application to empirical data. J. Math. Psychol. 107, 102632. https://doi.org/10.1016/j.
jmp.2021.102632.
Sohl-Dickstein, J., Mudigonda, M., DeWeese, M., 2014. Hamiltonian Monte Carlo without
detailed balance. In: International Conference on Machine Learning, pp. 719–726.
Sriperumbudur, B.K., Gretton, A., Fukumizu, K., Sch€olkopf, B., Lanckriet, G.R.G., 2010.
Hilbert space embeddings and metrics on probability measures. J. Mach. Learn. Res. 11,
1517–1561.
Sriperumbudur, B.K., Fukumizu, K., Lanckriet, G.R.G., 2011. Universality, characteristic Kernels
and RKHS embedding of measures. J. Mach. Learn. Res. 12 (7), 2389–2410.
Stein, C., 1972. A bound for the error in the normal approximation to the distribution of a sum of
dependent random variables. In: Proceedings of the Sixth Berkeley Symposium on Mathemat-
ical Statistics and Probability, volume 2: Probability Theory, vol. 6. University of California
Press, pp. 583–603.
Steinwart, I., Christmann, A., 2008. Support Vector Machines. Springer Science & Business
Media.
Strathmann, H., Sejdinovic, D., Livingstone, S., Szabo, Z., Gretton, A., 2015. Gradient-free
Hamiltonian Monte Carlo with efficient kernel exponential families. arXiv:1506.02564
(arXiv preprint).
Su, W., Boyd, S., Candès, E.J., 2016. A differential equation for modeling Nesterov’s accelerated
gradient method: theory and insights. J. Mach. Learn. Res. 17 (153), 1–43.
Sampling, optimization, inference, and adaptive agents Chapter 2 77
Sun, Y., Gomez, F., Schmidhuber, J., 2011. Planning to be surprised: optimal Bayesian explora-
tion in dynamic environments. arXiv:1103.5708 [cs, stat].
Sutherland, D.J., Tung, H.-Y., Strathmann, H., De, S., Ramdas, A., Smola, A., Gretton, A., 2016.
Generative models and model criticism via optimized maximum mean discrepancy.
arXiv:1611.04488 (arXiv preprint).
Suzuki, M., 1990. Fractal decomposition of exponential operators with applications to many-body
theories and Monte Carlo simulations. Phys. Lett. A 146, 319–323. https://doi.org/
10.1016/0375-9601(90)90962-N.
Takahashi, M., Imada, M., 1984. Monte Carlo calculation of quantum systems. II. Higher order
correction. J. Phys. Soc. Jpn. 53, 3765–3769.
Tao, M., 2016. Explicit symplectic approximation of nonseparable Hamiltonians: algorithm and
long time performance. Phys. Rev. E 94 (4), 043303.
Todorov, E., 2008. General duality between optimal control and estimation. In: 2008 47th IEEE
Conference on Decision and Control, December, pp. 4286–4292.
Toussaint, M., 2009. Robot trajectory optimization using approximate inference. In: ICML ’09.
Proceedings of the 26th Annual International Conference on Machine Learning, June, Asso-
ciation for Computing Machinery, Montreal, Quebec, Canada, pp. 1049–1056.
Tschantz, A., Millidge, B., Seth, A.K., Buckley, C.L., 2020a. Control as hybrid inference.
arXiv:2007.05838 [cs, stat].
Tschantz, A., Seth, A.K., Buckley, C.L., 2020b. Learning action-oriented models through active
inference. PLoS Comput. Biol. 16 (4), e1007805. https://doi.org/10.1371/journal.pcbi.
1007805.
Tuckerman, M.B.B.J.M., Berne, B.J., Martyna, G.J., 1992. Reversible multiple time scale molec-
ular dynamics. J. Chem. Phys. 97 (3), 1990–2001.
Vaillant, M., Glaunes, J., 2005. Surface matching via currents. In: Biennial Int. Conf. Information
Processing in Medical Imaging, Springer, pp. 381–392.
van de Laar, T.W., de Vries, B., 2019. Simulating active inference processes by message passing.
Front. Robot. AI 6.
van der Himst, O., Lanillos, P., 2020. Deep active inference for partially observable MDPs.
arXiv:2009.03622 [cs, stat].
Van der Vaart, A.W., 2000. Asymptotic Statistics. vol. 3. Cambridge University Press.
Vanetti, P., Bouchard-C^ote, A., Deligiannidis, G., Doucet, A., 2017. Piecewise-deterministic
Markov chain Monte Carlo. arXiv:1707.05296.
Vapnik, V., 1999. The Nature of Statistical Learning Theory. Springer Science & Business Media.
Villani, C., 2009a. Hypocoercivity. Memoirs of the American Mathematical Society, vol. 202.
American Mathematical Society, ISBN: 978-1-4704-0564-9 978-0-8218-6691-7 978-0-
8218-4498-4. https://doi.org/10.1090/S0065-9266-09-00567-5.
Villani, C., 2009b. Optimal Transport: Old and New. Springer.
Von Neumann, J., Morgenstern, O., 1944. Theory of Games and Economic Behavior. Princeton
University Press.
Wainwright, M.J., Jordan, M.I., 2007. Graphical models, exponential families, and variational
inference. Found. Trends Mach. Learn. 1 (1-2), 1–305. https://doi.org/10.1561/2200000001.
Wang, Z., Mohamed, S., Freitas, N., 2013. Adaptive Hamiltonian and Riemann manifold Monte
Carlo. In: International Conference on Machine Learning, PMLR, pp. 1462–1470.
Wauthier, S.T., Çatal, O., Verbelen, T., Dhoedt, B., 2020. Sleep: Model Reduction in Deep Active
Inference. p. 13.
Weinstein, A., 1997. The modular automorphism group of a Poisson manifold. J. Geom. Phys.
23 (3-4), 379–394.
78 SECTION I Foundations in classical geometry and analysis
Wibisono, A., Wilson, A.C., Jordan, M.I., 2016. A variational perspective on accelerated methods
in optimization. Proc. Natl. Acad. Sci. 113 (47), E7351–E7358. https://doi.org/10.1073/
pnas.1614734113.
Wilson, A., Recht, B., Jordan, M.I., 2021. A Lyapunov analysis of accelerated methods in optimi-
zation. J. Mach. Learn. Res. 22, 1–34.
Winn, J., Bishop, C.M., 2005. Variational message passing. J. Mach. Learn. Res. 34.
Wu, S.-J., Hwang, C.-R., Chu, M.T., 2014. Attaining the optimal Gaussian diffusion acceleration.
J. Stat. Phys. 155, 571–590. https://doi.org/10.1007/s10955-014-0963-5.
Yoshida, H., 1990. Construction of higher order symplectic integrators. Phys. Lett. A 150 (5),
262–268. https://doi.org/10.1016/0375-9601(90)90092-3.
Zhang, H., Sra, S., 2016. First-order methods for geodesically convex optimization. In: Conf.
Learning Theory, pp. 1617–1638.
Zhang, C., Butepage, J., Kjellstrom, H., Mandt, S., 2017a. Advances in variational inference.
arXiv:1711.05597 [cs, stat].
Zhang, C., Shahbaba, B., Zhao, H., 2017b. Hamiltonian Monte Carlo acceleration using surrogate
functions with random bases. Stat. Comput. 27 (6), 1473–1490.
Zhang, B.J., Marzouk, Y.M., Spiliopoulos, K., 2021. Geometry-informed irreversible perturba-
tions for accelerated convergence of Langevin dynamics. arXiv:2108.08247 [math, stat].
Ziebart, B., 2010. Modeling Purposeful Adaptive Behavior With the Principle of Maximum
Causal Entropy (Ph.D. thesis). Carnegie Mellon University, Pittsburgh.
Chapter 3
Equivalence relations
and inference for sparse
Markov models
Donald E.K. Martina,*, Iris Bennettb,d, Tuhin Majumdera,
and Soumendra Nath Lahiric
a
Department of Statistics, North Carolina State University, Raleigh, NC, United States
b
Department of Statistics, North Carolina State University, Raleigh, NC, United States
c
Department of Mathematics and Statistics, Washington University in St. Louis, St. Louis,
MO, United States
d
Corteva Agriscience, Raleigh, NC, United States
*
Corresponding author: e-mail: demarti4@ncsu.edu
Abstract
Equivalence relations can be useful for statistical inference. This is demonstrated for
modeling and statistical inference using sparse Markov models (SMMs). SMMs arise
as a mapping of higher-order Markov models, based on the equivalence relation of con-
ditioning histories having equal conditional probability distributions. After discussing
advantages of SMMs for statistical modeling, we give two algorithms for their fitting,
and highlight their use in modeling and classification applications. We also show
through an application to alignment of DNA sequences that equivalence relations can
greatly reduce the number of states of an auxiliary Markov chain used for computing
the distribution of a pattern statistic, an important inference task for categorical time
series.
Keywords: Auxiliary Markov chain, Collapsed Gibbs sampler, Dirichlet process, Quo-
tient mapping, Recursive computation, Regularization
1 Introduction
Let X ≡ X1 , X2 , …, Xn be a categorical time series taking values
x ≡ x1 , x2 , …, xn , with each xj lying in a finite set Σ. Markovian structure,
which assumes that conditioning on very recent observations is sufficient, is
frequently used in modeling categorical time series. In a Markov model with
mth-order dependence, the conditional probability of the current observation
given the last m data points does not change by conditioning further into the
past, i.e., for t ¼ m + 1, m + 2, …, n,
Pr½Xt ¼ xt jXt1 ¼ xt1 , …, X1 ¼ x1 ¼ Pr½Xt ¼ xt jXt1 ¼ xt1 , …, Xtm ¼ xtm :
An mth-order Markovian sequence X may be associated with a first-order
Markov chain, with states represented by the m-tuples xet ≡ ðxtm+1 , …, xt Þ of
Σm (t fm, m + 1, …, ng), an initial distribution τ over m-tuples at time m,
and a jΣjmjΣjm transition probability matrix T that has no more than jΣj non-
zero elements per row, where jj here denotes set cardinality.
listing contexts in the order xt4, xt3, xt2, xt1. A full context tree represen-
tation of the 16 contexts is given in the left-hand plot of Fig. 1, where the con-
texts are the leaf nodes of the graph. The histories in equivalence classes of
FIG. 1 Context tree of the full fourth-order histories (left) and VLMC of Example 1 (right).
Equivalence relations and inference for SMM Chapter 3 83
The classes γ l, l ¼ 1, …, 4 are the partition of the set of 2-tuples into those
with a fixed value of Xt2 and an arbitrary value of Xt1. This is not a VLMC
as there is no common ending of the 2-tuples in each class, but is an SMM.
SMMs have been considered in very recent years under the names minimal
Markov models (Garcı́a and González-López, 2010), sparse Markov chains
(J€a€askinen et al., 2014), SMMs (Xiong et al., 2016), and partition Markov
models (Fernández et al., 2018; Garcı́a and González-López, 2017). SMMs
allow a wider variety of models than VLMC and Markov chains with partial
connections, as conditions on the grouped m-tuples are relaxed.
Given the usefulness of SMM for modeling categorical time series, the
nontrivial task of model fitting is important. The next section gives summaries
of two algorithms for fitting the model, along with three applications. In
Section 3, using the task of computing the distribution of a pattern statistic
through an auxiliary Markov chain (AMC), we show that equivalence rela-
tions can be very helpful for statistical inference related to data following
an SMM. The final section is a summary.
There has been relatively little work in the literature on fitting SMMs.
Garcı́a and González-López (2010, 2017) used an agglomerative hierarchical
clustering approach in which two clusters of histories are combined when
their combination reduces the Bayesian information criterion (BIC) of the
fitted model. They showed that their approach renders a consistent estimator.
J€a€askinen et al. (2014) took a Bayesian approach to the problem of fitting
SMMs. Setting a uniform prior over the possible partitions of the histories,
their method used an agglomerative hierarchical clustering approach that
maximizes the posterior probability of the partition at each step. To reduce
the number of posterior probabilities to be calculated as part of the latter clus-
tering algorithm and thus cut down the computation time, Xiong et al. (2016)
applied a Delaunay triangulation to the sample transition probabilities of the
histories before performing agglomerative clustering along the triangulation.
Two very recent methods are outlined below.
iid
ci jζ Multinomialð1, ζÞ,
ζ GEMð1, αÞ,
pj DirichletðjΣj1 1jΣj Þ,
iid
1. Initialization
for h in Σm do
for i in Σ do
Compute ni|h
end
end
Select initial partition of histories
2. Gibbs Sampling;
for i in 1:I do
for h in Σm do
Remove h from its current cluster
for Cluster γj do
Calculate P (X|h ∈ Cluster γj )
end
Calculate P (X|h ∈ New Cluster)
Reassign h to new cluster
end
if i > I then
Record partition
end
end
3. Select Partition;
Find mode over recorded partitions;
Return mode;
Bennett et al. (2022) showed that the computation using the GSDPMM
algorithm runs in OðηjΣjm Þ time (much less, for large m, than the most com-
petitive method in terms of accuracy, Xiong et al. 2016). Also shown was that
each of the iterations of the Gibbs sampler provides a consistent estimator of
the correct partition as n !∞.
Bennett et al. (2022) used a third-order Markov model, as did Kharin (2020),
to describe wind patterns over the n ¼ 6574 days. Here Σ ¼ {0, 1, 2}, and
there are thus 2η transition probability parameters to be estimated. For the
full third-order Markov model, η ¼ 33 ¼ 27 and thus there are 54 transition
probability parameters. Using the MC(3, 2) model yielded η ¼ 9 clusters of
histories, reducing the number of transition probability parameters to 18.
Fitting an SMM to the data further reduced the model, grouping the histories
into η ¼ 5 clusters for a total of 10 transition probability parameters.
The SMM representation had a BIC of 8043, while the MC(3, 2) model
had a BIC of 8125. The greater flexibility of the SMM when compared to
the MC model then allowed for a reduction in the number of parameters to
be estimated and a better model fit according to BIC, without sacrificing the
explanatory power of the model.
X
Σm X 2 X
π^ j,a bj, a +λ wj1 j2 kbj1 bj2k2 , (1)
j¼1 aΣ 1j1 <j2 Σm
where λ > 0 is a penalty parameter. Eq. (1) penalizes the distance between
distinct pairs of probability vectors in order to identify identical ones. Once
the probability vectors bj have been identified, estimates of the numbers of
groups η in S and the groups γ j, j ¼ 1, …, η are evident.
The advantage of clustering by solving Eq. (1) for a range of λ is that we
get a solution path corresponding to the range from a single cluster consisting
of all the elements to jΣjm singleton clusters. Hence, the algorithm does not
need to fix the number of clusters a priori. It also avoids getting stuck in local
minima.
The tuning parameter λ, and thus the optimum cluster assignment, is cho-
sen to minimize BIC over a grid of λ values. BIC is defined as
BICðλÞ ¼ 2‘ðλÞ + ηλ ðjΣj 1Þ log n,
where ηλ is the number of groups and ‘(λ) is the likelihood associated with a
specific value of λ. The solution of Eq. (1) corresponding to the optimal BIC λ
value is considered as the estimated cluster assignment. Majumder et al.
(2022) gave theoretical results that indicate that for a range of λ values, one
can perfectly recover the true clusters for large n.
Several efficient algorithms have been developed in recent years to solve
Eq. (1) when the penalty function is convex. Chi and Lange (2015) used aug-
mented Lagrangian methods. First view solving Eq. (1) as the constrained
optimization problem
1X Σm X
minimize ^ j bjk22 + λ
kπ wl kvl k2
2 j¼1 lE
subject to bl1 bl2 vl ¼ 0,
where E is the set of all distinct edges {l : l ¼ (l1, l2), l1 < l2, wl > 0}. Here, a
new splitting variable vl has been introduced to capture the difference
between the group centroids. Chi and Lange (2015) used two algorithms for
solving this constrained optimization problem, namely ADMM and AMA.
For both algorithms, one first incorporates an augmented Lagrangian as
follows:
1X X
p
Lν ðB, V, ΨÞ ¼ ^ j bj k22 + λ
kπ wl kvl k2
2 j¼1 lE
X νX
+ hψ l , vl bl1 + bl2 i + kvl bl1 + bl2 k22 ,
lE
2 lE
where B, V, and Ψ are the matrices with bj, vl, and ψ l for j ¼ 1, …, Σm and
l E as their columns, respectively. Splitting the variables in this manner
Equivalence relations and inference for SMM Chapter 3 89
allows one to update B, V, and Ψ sequentially, given the other variables. The
convergence of ADMM does not depend on the choice of ν, as it will converge
for any ν > 0. On the other hand, AMA converges for any 0 < ν < 2/Σm.
Chi and Lange (2015) showed that AMA is much faster than ADMM,
especially when the weights are sparse. They also proved that only B and Ψ
need to be updated at each step, but not V.
AMA was used in the model fitting of Majumder et al. (2022). Let B(t) and
Ψ be the parameter values at the tth step. The updates to the next step are
(t)
Data: Sequence X
Result: Partition for SMM
Initialize Ψ(0)
for t = 1, 2, 3, ... do
for j = 1, 2, 3, ..., Σm do
(t) (t−1) (t−1)
Δi = l1 =j ψl − l2 =j ψl
end
end
forall l do
(t) (t) (t)
gl = π̂l1 − π̂l2 + Δl1 − Δl2
(t) (t−1) (t)
ψl = PCl (ψl − νgl )
end
3.1 Notation
A pattern is a finite string of symbols from Σ. For pattern u ¼ ðu1 …, ujuj Þ of
length juj, ðu1 , …, uh Þ 1 h juj is a prefix of u (a proper prefix if h < juj),
and ðuj , …, ujuj Þ, 1 j juj is a suffix of u (a proper suffix if j > 1). Let W be
a collection of patterns that occur in X and Z a pattern statistic related to W,
taking values z in a finite set Φ. For convenience it will be assumed that all
patterns of W have length greater than m. Also, let a b denote the concatena-
tion of strings a and b. For example, if a ¼ 1101 and b ¼ 101, then a b ¼
1101101.
Ceq contains completion strings for prefix strings of fsq and suffixes of q of
length less than m that are pattern prefixes. The occurrence of strings of C eq
may or may not require an update to the pattern statistic Z. For example, let
Z be the indicator of whether or not W occurs in X. If W has already occurred,
Z remains 1 with any additional occurrences, and thus the statistic would not
be updated. The set of completion strings whose occurrence does require an
update to Z is denoted by Cq. Also, define the direct occurrence of a
q Q>m to be the occurrence of q v, where v is the completion string of q.
The transition function for string q Q on symbol x Σ is defined by
δ(q, x) ≡ the longest suffix of q x that is in Q. Thus either δ(q, x) ¼ q x,
or δ(q, x) ¼ fl(q x).
Martin (2019) proved two main theorems and a corollary to the second
one. In Theorem 1, the author showed that the symbiosis of the failure state
of q x and failure transitions (when δ(q, x) 6¼ q x) makes them both easy
to compute, as both are obtained by first going to fl(q), and then computing
the transition on symbol x from there. Thus failure states may be obtained
sequentially (over string lengths) using the transition function δ.
Theorem 2, along with its corollary, gives necessary and sufficient condi-
tions for strings to be equivalent so that equivalent strings may be determined
and combined during the setup of the state space Υ. In the proof of the theo-
rem, it was shown that updates to the statistic’s value are the same on conca-
tenating an arbitrary string to q and q0 if and only if the updates to the
statistic’s value are the same when any string of Cq [ Cq0 occurs. (The
required condition (q)m ¼ (q0 )m is clear so that probabilities starting in both
strings are the same.) The proof of the corollary simply steps through showing
necessity and sufficiency. The reader is pointed to the original paper for the
proofs.
Theorem 1. If q Σ m, fl(q x) ¼ (q x)m. For q QnΣm, fl(q x) ¼ δ(fl(q), x).
94 SECTION I Foundations in classical geometry and analysis
Corollary 1. For statistics and pattern counting techniques such that the
direct occurrence of q does not preclude updating the statistic on the occur-
rence of the strings of Cf lq , if jfsqj > 1 and jf sq0 j > 1, q q0 if and only if
fl(q) fl(q0 ) and the updates to the statistic are exactly the same on the direct
occurrences of q and q0 .
Corollary 2. For statistics and pattern counting techniques such that the
direct occurrence of q (q0 ) does not preclude updating the statistic on the
occurrence of the strings of Cfl(q) (Cf lðq0 Þ ), if jqj > m and jq0 j > m, q q0 if
and only if fl(q) fl(q0 ) and the updates to the statistic are exactly the same
on the direct occurrences of q and q0 .
Theorem 3 and its corollary lead to the following algorithm for setting up
Υ and computing the distribution of pattern statistic Z in the case of an SMM.
Algorithm 3 (Computing the sampling distribution of statistic Z). Given
SMM fS, pg with maximal context depth m, the distribution of Z may be
obtained as follows:
Equivalence relations and inference for SMM Chapter 3 95
0 1
0:4 0 0:6 0 0 0
B 0:2 C
B 0 0 0 0 0:8 C
B C
B 0 0:65 0 0 0 0:35 C
T¼B
B
C:
B 0 0:7 0 0 0 0:3 CC
B C
@ 0 0 0 0 0:8 0:2 A
0:4 0 0 0:6 0 0
In the case of stationary trials, solving τ̌ Ť ¼ τ̌ for probability vector τ̌ gives
initial probability vector
1
τ ¼ ½50, 51, 30, 45, 204, 75:
455
placed in the state space. Strings 1 10 and 11 0 are equivalent as they have
exactly the same transitions and updates on completion string 1.
To illustrate setting up a state space in the SMM case, we consider the rel-
atively short spaced seed 1*11*1. Then, feasibility of the computation for a
seed of a length used in practice is shown for the “Patternhunter seed”
111*1**1*1**11*111 (Ma et al., 2002). The model is fixed in both cases to
have probability equivalence classes S ¼ fγ 1 , …, γ 4 g ¼ ff000, 100, 010,
110g, f011, 111g, f001g, f101gg (m ¼ 3), with Pr[Xt ¼ 1jγ j], j ¼ 1, 2, 3, 4,
respectively, equal to (0.6,0.8,0.65,0.7), as in Example 3. This VLMC model
is input to the algorithm. The length of X for possible alignment is fixed at
n ¼ 64, the length considered in Ma et al. (2002).
Example 4. (Seed 1*11*1). Spaced seed S ¼ 1 * 11 * 1 has W ¼ {101101,
101111, 111101, 111111}. For this collection of patterns, m-tuples 011 and
111 have different completion strings, and thus γ 2 ¼ {011, 111} needs to be
split. Also, whereas the elements of {000, 100} have no pattern progress,
those of {010, 110} have pattern prefix 10 as their suffix, and thus γ 1 needs
to be split into these two classes. Thus Ť and τ are as in Example 3.
FIG. 3 States of the AMC for seed 1*11*1. The counting states whose entrance signals that the
statistic is updated are colored in green, and other states with nonzero coverage in blue. Coverage
updates on transitions into counting states are indicated and covered positions are marked with a
1. For clarity, not all transitions are shown. States in white have no coverage, and are needed to
compute sensitivity of the seed (along with an absorbing state). White nodes with a blue “glow”
are prefix strings, while the others are not.
100 SECTION I Foundations in classical geometry and analysis
sensitivity of the seed, the probability of at least one seed occurrence in X (an
absorbing state to indicate the seed’s occurrence would also be needed if seed
sensitivity was used to trigger an alignment instead of seed coverage). The
strings 10110 and 11110 are equivalent (from Corollary 2) for computing sen-
sitivity or coverage, as they both have the same failure state, and the update to
coverage is + 4 when both strings are completed. Thus, on symbol 0, 1111
transitions to 10110, and 11110 is deleted. State 1 1 11 transitions to 1 011 0
on symbol 0 (the second position of the string is no longer marked as being
covered since the position cannot possibly be involved in a seed hit that
occurs after the direct one). Similarly, state 1 1 11 1 transitions to 1 011 0 on
symbol 0 (instead of 1 01 10 since 1 011 0 and 1 01 10 are equivalent. The equiv-
alence of these states may not be obvious, since for pattern prefix 10, the 1 is
covered in one case and not in the other. However, the completion string of
10 is of the form 11*1. The occurrence of the first 1 of this string implies
the direct hit of the longer string 10110 with a coverage update of +2 for both
states, rendering all 1’s as being covered so that the subsequent updates to the
statistic must be the same.)
A MATLAB program was written to implement the algorithm to compute
coverage of spaced seeds and run on a Dell PC with an Intel Core i7 CPU 873
with 2.93 GHz and 8 GB RAM. The total computation time for spaced seed
S ¼ 1 * 11 * 1 was about 0.5 s, 0.1 s to set up the state space and 0.4 s to carry
out the computation.
Example 5. (Seed 111*1**1*1**11*111). For the spaced seed 111*1**1*1**
11*111 used in the original version of the Patternhunter software (Ma et al.,
2002), jΥj ¼ 3815, compared with the more than 320,000 strings that are
obtained using prefixes of Wext. For the coverage distribution, the total com-
putation time was about 40 s, 33.8 s to set up the state space and 6.5 s to carry
out the computation. It took more than a day to set the state space using pre-
fixes of Wext. This reveals the advantages of the representation based on prefix
strings with marked locations, while especially pointing to the great reduction
afforded by determining equivalence relations of states of the AMC, and
sequential combination of those that are equivalent.
4 Summary
In this chapter, equivalence relations of conditional transition probability
distributions are shown to lead to improved modeling capabilities for a
higher-order Markov model. The associated mapping of probability distribu-
tions and their corresponding m-tuple histories based on the equivalence rela-
tions of equal conditional probability distributions leads to the formation of a
SMM that has a reduced number of parameters and more flexibility in the
model fitting process. Admittedly, there is a price to pay, as allowing the
Equivalence relations and inference for SMM Chapter 3 101
Acknowledgments
This material is based upon work supported by the National Science Foundation under
Grant No. 1811933.
References
Aho, A.V., Corasick, M.J., 1975. Efficient string matching: an aid to bibliographic search.
Commun. ACM 18, 333–340.
Aston, J.A.D., Martin, D.E.K., 2007. Waiting time distributions of general runs and patterns in
hidden Markov models. Ann. Appl. Stat. 1 (2), 585–611.
Avery, P., 1987. The analysis of intron data and their use in the detection of short signals. J. Mol.
Evol. 26. 335–334.
Begleiter, R., El-Yaniv, R., Yona, G., 2004. On prediction using variable length Markov models.
J. Artif. Intell. 22, 385–421.
Belloni, A., Oliveira, R., 2017. Approximate group context tree. Ann. Stat. 45 (1), 355–385.
Ben-gal, I., Morag, G., Shmilovici, A., 2003. Context-based statistical process control. Techno-
metrics 45 (4), 293–311.
Bennett, I., Martin, D.E.K., Lahiri, S.N., 2022. Fitting sparse Markov models through a collapsed
Gibbs sampler. (Submitted for publication).
Benson, G., Mak, D.Y.F., 2008. Exact distribution of a spaced seed statistic for DNA homology
detection. In: International Symposium on String Processing and Information Retrieval,
Springer, Berlin, Heidelberg.
Bercovici, S., Rodriguez, J.M., Elmore, M., Batzoglou, S., 2012. Ancestry inference in complex
admixtures via variable-length Markov chain linkage models. In: Lecture Notes in Computer
Science. Research in Computational Molecular Biology. RECOMB, vol. 7262. Springer, Ber-
lin, Heidelberg, pp. 12–28.
Borges, J., Levene, M., 2007. Evaluating variable length Markov chain models for analysis of user
web navigation. IEEE Trans. Knowl. 19 (4), 441–452.
Bratko, A., Cormack, G., Filipic˘, B., Lynam, T., Zupan, B., 2006. Spam filtering using statistical
data compression models. J. Mach. Learn. Res. 7, 2673–2698.
Brookner, E., 1966. Recurrent events in a Markov chain. Inf. Control. 9, 215–229.
Browning, S.R., 2006. Multilocus association mapping using variable-length Markov chains. Am.
J. Hum. Genet. 78, 903–913.
B€
uhlmann, P., Wyner, A.J., 1999. Variable length Markov chains. Ann. Stat. 27 (2), 480–513.
102 SECTION I Foundations in classical geometry and analysis
Chi, E.C., Lange, K., 2015. Splitting methods for convex clustering. J. Comput. Graph. Stat.
24 (4), 994–1013.
Ferguson, T.S., 1973. A Bayesian analysis of some nonparametric problems. Ann. Stat. 1 (2),
209–230. https://doi.org/10.1214/aos/1176342360.
Fernández, M., Garcı́a, J.E., González-López, V.A., 2018. A copula-based partition Markov pro-
cedure. Commun. Stat. Theory Methods 47 (14), 3408–3417.
Fu, J.C., Koutras, M.V., 1994. Distribution theory of runs: a Markov chain approach. J. Am. Stat.
Assoc. 89, 1050–1058.
Gabadinho, A., Ritschard, G., 2016. Analyzing state sequences with probabilistic suffix trees.
J. Stat. Softw. 72 (3), 1–39.
Gallo, S., Leonardi, F., 2015. Nonparametric statistical inference for the context tree of a station-
ary ergodic process. Electron. J. Stat. 9, 2076–2098.
Galves, A., Galves, C., Garcı́a, J.E., Garcia, N.L., Leonardi, F., 2012. Context tree selection and
linguistic rhythm retrieval from written texts. Ann. Appl. Stat. 6, 186–209.
Garcı́a, J.E., González-López, V.A., 2010. Minimal Markov models. arXiv:1002.0729.
Garcı́a, J.E., González-López, V.A., 2017. Consistent estimation of partition Markov models.
Entropy 19, 1050–1058.
Haslett, J., Raftery, A.E., 1989. Space-time modelling with long-memory dependence: assessing
Ireland’s wind power resource. J. R. Stat. Soc. C (Appl. Stat.) 38 (1), 1–50.
J€a€askinen, V., Xiong, J., Koski, T., Corander, J., 2014. Sparse Markov chains for sequence data.
Scand. J. Stat. 41, 641–655.
Kharin, Y., 2020. Statistical analysis of big data based on parsimonious models of high-order
Markov chains. Austrian J. Stat. 49, 76–88.
Kharin, Y.S., Petlitskii, A.I., 2007. A Markov chain of order s with r partial connections and sta-
tistical inference on its parameters. Discret. Math. Appl. 19 (2), 109–130.
Koutras, M.V., Alexandrou, V.A., 1995. Runs, scans and urn models: a unified Markov chain
approach. Ann. Inst. Stat. Math. 47, 743–766.
Lladser, M.E., 2007. Minimal Markov chain embeddings of pattern problems. In: Proceedings of
the 2007 Information Theory and Applications Workshop, University of California, San
Diego.
Lladser, M., Betterton, M.D., Knight, R., 2008. Multiple pattern matching: a Markov chain
approach. J. Math. Biol. 56 (1-2), 51–92.
Ma, B., Tromp, J., Li, M., 2002. Patternhunter: faster and more sensitive homology search. Bio-
informatics 18 (3), 440–445.
Majumder, T., Lahiri, S.N., Martin, D.E.K., 2022. Fitting sparse Markov models to categorical
time series using regularization. http://arxiv.org/abs/2202.05485. (Submitted for publication).
Marshall, T., Rahmann, S., 2008. Probabilistic arithmetic automata and their application to pattern
matching statistics. In: Ferragina, P., Landau, G.M. (Eds.), Proceedings of the 19th Annual
Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science,
vol. 5029. Springer, Heidelberg, pp. 95–106.
Martin, D.E.K., 2019. Minimal auxiliary Markov chains through sequential elimination of states.
Commun. Stat. Simul. Comput. 48 (4), 1040–1054.
Martin, D.E.K., 2020. Distributions of pattern statistics in sparse Markov models. Ann. Inst. Stat.
Math. 72, 895–913. https://doi.org/10.1007/s10463-019-00714-6.
Martin, D.E.K., Coleman, D.A., 2011. Distributions of clump statistics for a collection of words.
J. Appl. Probab. 48, 1049–1059.
Martin, D.E.K., Noe, L., 2017. Faster exact probabilities for statistics of overlapping pattern
occurrences. Ann. Inst. Stat. Math. 69 (1), 231–248.
Equivalence relations and inference for SMM Chapter 3 103
Noe, L., 2017. Best hits of 11110110111: model-free selection and parameter-free sensitivity cal-
culation of spaced seeds. Algorithms Mol. Biol. 12 (1). https://doi.org/10.1186/s13015-017-
0092-1.
Noe, L., Martin, D.E.K., 2014. A coverage criterion for spaced seeds and its applications to SVM
string-kernels and k-mer distances. J. Comput. Biol. 21 (12), 947–963.
Nuel, G., 2008. Pattern Markov chains: optimal Markov chain embedding through deterministic
finite automata. J. Appl. Probab. 45, 226–243.
Raftery, A., Tavare, S., 1994. Estimation and modelling repeated patterns in high order Markov
chains with the mixture transition distribution model. J. R. Stat. Soc. C (Appl. Stat.) 43 (1),
179–199.
Ribeca, P., Raineri, E., 2008. Faster exact Markovian probability functions for motif occurrences:
a DFA-only approach. Bioinformatics 24 (24), 2839–2848.
Rissanen, J., 1983. A universal data compression system. IEEE Trans. Inf. Theory 29, 656–664.
Rissanen, J., 1986. Complexity of strings in the class of Markov sources. IEEE Trans. Inf. Theory
32 (4), 526–532.
Ron, D., Singer, Y., Tishby, N., 1996. The power of amnesia: learning probabilistic automata with
variable memory length. Mach. Learn. 25 (2-3), 117–149.
Roos, T., Yu, B., 2009. Sparse Markov source estimation via transformed Lasso. In: Proceedings
of the IEEE Information Theory Workshop (ITW-2009), Taormina, Sicily, Italy, pp. 241–245.
Shmilovici, A., Ben-gal, I., 2007. Using a VOM model for reconstructing potential coding regions
in EST sequences. Comput. Stat. 22, 49–69.
Weinberger, M., Lempel, A., Ziv, J., 1992. A sequential algorithm for the universal coding of
finite memory sources. IEEE Trans. Inf. Theory IT-38, 1002–1024.
Weinberger, M., Rissanen, J., Feder, M., 1995. A universal finite memory source. IEEE Trans.
Inf. Theory 41 (3), 643–652.
Willems, F.M.J., Shtarkov, Y.M., Tjalkens, T.J., 1995. The context-tree weighting method: basic
properties. IEEE Trans. Inf. Theory 41 (3), 653–664.
Xiong, J., J€a€askinen, V., Corander, J., 2016. Recursive learning for sparse Markov models. Bayes-
ian Anal. 11 (1), 247–263.
Yin, J., Wang, J., 2016. A model-based approach for text clustering with outlier detection. In:
2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 625–636.
Zhang, J., Ghahramani, Z., Yang, Y., 2005. A probabilistic model for online document clustering
with application to novelty detection. In: Saul, L.K., Weiss, Y., Bottou, L. (Eds.), Advances
in Neural Information Processing Systems 17. MIT Press, pp. 1617–1624.
This page intentionally left blank
Section II
Information geometry
This page intentionally left blank
Chapter 4
Abstract
We present in this chapter a new formulation of heat theory and Information Geometry
through symplectic and Poisson structures based on Jean-Marie Souriau’s symplectic
model of statistical mechanics, called “Lie Groups Thermodynamics.” Souriau model
was initially described in chapter IV “Statistical Mechanics” of his book “Structure of
dynamical systems” published in 1969. This model gives an archetypal, and purely geo-
metric, characterization of Entropy, which appears as an invariant Casimir function in
coadjoint representation, from which we will deduce a geometric heat equation as
Euler-Poincare equation. The approach also allows generalizing the Fisher metric of
information geometry thanks to the KKS (Kirillov, Kostant, Souriau) 2-form in the
affine case via the Souriau’s cocycle. In this model, the Souriau’s moment map and
the coadjoint orbits play a central role. Ontologically, this model provides joint geomet-
ric structures for Statistical Mechanics, Information Geometry and Probability. Entropy
acquires a geometric foundation as a function parameterized by mean of moment map
in dual Lie algebra, and in term of foliations. Souriau established the generalized Gibbs
laws when the manifold has a symplectic form and a connected Lie group G operates on
this manifold by symplectomorphisms. Souriau Entropy is invariant under the action of
the group acting on the homogeneous symplectic manifold. As quoted by Souriau, these
equations are universal and could be also of great interest in Mathematics. The dual
space of the Lie algebra foliates into coadjoint orbits that are also the level sets on
the entropy that could be interpreted in the framework of Thermodynamics by the fact
that motion remaining on these surfaces is non-dissipative, whereas motion transversal
to these surfaces is dissipative. We will also explain the second Principle in thermody-
namics by definite positiveness of Souriau tensor extending the Koszul-Fisher metric
from Information Geometry. Entropy as Casimir function is characterized by Koszul
Poisson Cohomology. We will finally introduce Gaussian distribution on the space of
Symmetric Positive Definite (SPD) matrices, through Souriau’s covariant Gibbs density
by considering this space as the pure imaginary axis of the homogeneous Siegel upper
half space where Sp(2n,R)/U(n) acts transitively. Gauss density of SPD matrices is
computed through Souriau’s moment map and coadjoint orbits. We will illustrate the
model first for Poincare unit disk, then Siegel unit disk and finally upper half space.
For this example, we deduce Gauss density for SPD matrices.
Keywords: Symplectic geometry, Casimir function, Entropy, Koszul-Fisher Metric,
Heat equation, Poisson Cohomology Symmetric Positive Definite Matrices, Lie groups
thermodynamics, Maximum entropy, Exponential density family
1 Preamble
“There is nothing more in physical theories than symmetry groups, except the
mathematical construction which allows precisely to show that there is nothing
more”
Jean-Marie Souriau
In 2019, at the University of Paris and at the Henri Poincare Institute were
celebrated the 50th anniversary of the founding work of Geometric Mechanics
(introduction of Symplectic Geometry in Mechanics) (http://souriau2019.fr/),
the book by Jean-Marie Souriau “Structure of dynamical systems” with in par-
ticular the guest talk by Jean-Pierre Bourguignon (former director of IHES
and director of the European ERC) “Jean-Marie Souriau and Symplectic
Geometry” (https://ideasinscience.org/fr/video/154).
Jean-Marie Souriau developed the symplectic aspects of classical mechan-
ics and quantum mechanics. Among his most important works, let us quote
the moment(um) map (geometrization of Noether’s theorem), the coadjoint
action of a group on its moment space, geometric quantization, the classifica-
tion of homogeneous symplectic leaves, or diffeological spaces.
In Jean-Marie Souriau’s book “Structure of dynamical systems,” chapter
IV concerning Statistical Mechanics has been little read and cited. There he
developed a symplectic model of statistical physics, which he called “Lie
Groups Thermodynamics.” This geometric theory of statistical mechanics is
based on the study of the Hamiltonian action of a Lie group on the symplectic
manifold induced by the coadjoint orbit of the group. Souriau deals there with
the affine case of the coadjoint operator involving the “Souriau symplectic
cocycle” reflecting the lack of cohomology. In a natural way, generalized
Gibbs states are defined, indexed by parameters of Lie algebra (geometric
generalization of Planck temperature), which are covariant under the action
of the group. These Gibbs states are a natural generalization of those con-
structed with the Hamiltonian of a vector field, since this Hamiltonian can
be considered as a moment of the local Hamiltonian action of the group of
time translations. If a dynamic system is invariant by a Lie subgroup of the
group (for example of the Galileo group in physics), the natural equilibria
of this system are the Gibbs states constructed with a moment of the Hamilto-
nian action of this sub-group. Souriau introduces, in the non-zero cohomology
case, a Riemannian metric, which generalizes the Fisher-Koszul metric of
Symplectic theory of heat and information geometry Chapter 4 109
by Jean-Louis Koszul (Barbaresco, 2021d; Koszul and Zou, 2019), there are
bridges between the mathematical structures of the different disciplines:
l Symplectic Geometry and Information Geometry
l Poisson Geometry and Information Geometry
l Integrated System and Information Geometry
The novelty lies in the cross exploration of symplectic and Poisson structures
in Information Geometry, but also the enrichment of symplectic and Poisson
geometries by the particular structures of information geometry: affinely flat
manifold, Hessian metric, Fisher metric, dual potential functions and dual
coordinate systems, affine representation theory, Lie algebra cohomology,
affine Poisson structure, etc.
The tools developed will jointly enrich:
l Statistical physics: symplectic model of statistical physics
l Artificial Intelligence: statistical tools on Lie groups with new symplectic
integrators for deep machine learning on homogeneous manifolds
l Quantum Physics: generalization of the geometry of information to the
theory of quantum information
l Mathematics: enrichment of symplectic geometry, Poisson geometry and
the theory of integrable systems through the contribution of structures
associated with the geometry of information and affinely flat manifolds.
temperature. The invariance with respect to the group, and the fact that the
entropy S is a convex function of β, imposes very strict, universal condi-
tions—i.e., independent of the system considered. For a large class of systems,
for example, there exist necessarily a critical temperature beyond which no
equilibrium can exist. In the cases where an equilibrium exists, it generally
consists of a rigid rotation about the barycenter, etc. These purely theoretical
results are evidently confirmed by numerous astronomical examples: the
Earth and the starts rotating about themselves; dissipative evolution imposes
a solid rotation on the central regions of the galaxies, which itself can lead to
a gravitational instability of the “quasar” type; the Clapeyron relations
extend to the geometrical-dynamical quantities, etc. One can, if one wishes,
interpret β as a space-time vector (the temperature vector of Planck), giving
to the metric tensor g a null Lie derivative. This suggests describing the dis-
sipative processes by a temperature vector β which is no longer compelled
by this condition; the corresponding Lie derivative of g, the “friction
tensor,” becomes the source of the dissipation. One obtains in this way a phe-
nomenological model of continuous media which presents some interesting
properties: the temperature vector and entropy flux are in duality; the positive
entropy production is a consequence of Einstein’s equations; the Onsager
reciprocity relations are generalized; in the case of a fluid in in the
non-relativistic approximation, the model unifies heat conduction and viscos-
ity (equations of Fourier and Navier).”
The name Souriau means “little mouse” in “the Perche.” In the “Vendo-
mois,” the Souriau from 1490 to 1819 were all “Master Plowmen” or
“Master Millers.” Jean-Marie Souriau was born June 3, 1922 in Paris in the
6th arrondissement. Jean-Marie Souriau comes from a family of Philosopher
all graduated from ENS Paris. His father, Michel Souriau joined the ENS
Paris in 1910 and obtained the aggregation in philosophy in 1914 and wrote
in 1938 an article on “Introduction to mathematical symbolism” before being
mobilized as a battalion commander. His uncle, Etienne Souriau, is a French
philosopher, specialized in aesthetics, entered the ENS Paris in 1912, received
first in the aggregation in philosophy in 1920. In 1958, Etienne Souriau was
elected member of the Academy of moral sciences and policies by a commit-
tee in which Charles de Gaulle appears, and will be the director of the thesis
Rohmer. Etienne Souriau published a book on “Struc-
of the filmmaker Eric
ture of the Work of Art (Structure de l’Oeuvre d’Art).” His grandfather, Paul
Souriau, is a French philosopher known for his work on the theory of inven-
tion and aesthetics, who entered the ENS Paris in 1873 and aggregated in phi-
losophy in 1876. It can be noted that his grand-father, Paul Souriau, composed
a thesis titled “Theory of Invention” (theorie de l’invention), published in
1881, and also a Latin thesis titled “De motus perceptione,” which aimed to
determine the importance of vision for the perception of movements (the initial
thesis title was “De visione motus” and was a precursor to his future work on
the perception of movement). We can assume that Jean-Marie Souriau read his
114 SECTION II Information geometry
grandfather’s thesis and was influenced by it for his own work. In 1889, Paul
Souriau published a book on “The Aesthetics of Movement (L’Esthetiquee du
movement)” which describes two levels of aesthetics for movement: mechani-
cal beauty (the adaptation of movement to fulfill its purpose) and meaning of
movement (the meaning that the movement communicates to an outside
observer). In doing so, Paul Souriau distinguished movement from perception
of movement, concepts that would later become the subject of motor cognition
and psychophysics. It is interesting to note that Etienne Souriau studied the
structures of aestheticism, Paul Souriau developed the aestheticism of the
movement and Jean-Marie Souriau founded the structures of the movement.
This triptych will remain an important element of French philosophy at the
hinge of this 1900 Spirit (Esprit 1900).
Jean-Marie Souriau from 1932 to 1942 did his secondary studies in Nancy,
Nı̂mes, Grenoble and Versailles. Jean-Marie Souriau married Christianne
Hoebrechts, who died prematurely in 1985 and with whom he had five chil-
dren Isabelle, Catherine, Yann, Jer^
ome and Magali. He entered the ENS Paris
in 1942, passing twice in the unoccupied zone in Lyon and a second time in
Paris. Also received at the Ecole Polytechnique, he resigned to join the ENS
Paris (Fig. 1). During his studies at the ENS, he took courses at the Sorbonne
from the physicist Yves Rocard and the mathematician Elie Cartan. He volun-
teered for “La France Libre” in 1944. On his return in 1946, he passed the
mathematics aggregation, and the same year joined a laboratory working on
FIG. 1 Jean-Marie Souriau, a student at the Ecole Normale Superieure in Paris in 1942, with
Jacques Dixmier and Rene Deheuvels among others (© ENS Paris).
Symplectic theory of heat and information geometry Chapter 4 115
FIG. 2 Cover page of the thesis of Jean-Marie Souriau “On the stability of planes” defended on
June 20, 1952, as well as the bibliographic page which refers to the work of Yves Rocard (per-
sonal picture of Souriau PhD manuscript archived at ONERA).
116 SECTION II Information geometry
(ESTA, Paris) under the general title “New Methods of Mathematical Physics.”
From 1951 to 1952, he created and ran the Mechanics course in the third year
of the Ecole Normale Superieure de l’Enseignement Technique (ENSET,
Paris). From 1952, he also had a university education in the following disci-
plines: Mathematics, Mechanics, Relativity, Mathematical Methods of Physics
and Computer Science.
After his thesis in 1954, he joined the “Institut des Hautes Etudes,” rue de
Rome in Tunis, and moved with his wife to Carthage. It is during this period
that he rereads and deepens the work of Lagrange in Analytical Mechanics
and discovers the symplectic structures that he will formalize in his book
“Structure of dynamic systems.” It is by thinking of his discussions with
ONERA engineers that he invents his masterpiece, the “moment map”
(application moment). We can read in the interview by Patrick Iglesias
(Iglesias, 1995) “It was with the memory of discussions with engineers who
asked themselves the following question: what is essential in mechanics.
I remember very well an engineer who asked me: is mechanics simply the
principle of conservation of energy? This is fine for a one-parameter system,
but once there are two, it is not enough. I had learned of course the
Lagrange equations and all the analytical principles of mechanics, but it
was all a cookbook; we did not see any real principles.” He remained in
Tunis from 1952 to 1958, as Lecturer, then as Full Professor at the Institut
des Hautes Etudes. In 1953, he participated in Strasbourg in the conference
on Differential Geometry (Fig. 3).
In 1958, he became Professor at the University of Aix-Marseille. He
remained in Marseille throughout his career, and from 1978 to 1985 became
Director of the Center for Theoretical Physics of Marseille (CNRS laboratory)
in charge of the teams in Theoretical Mechanics, Geometry and Quantifica-
tion, Astronomy and Cosmology. He was also professor of Mathematics at
the University of Provence (Aix-Marseille I) and ended up as an exceptional
Professor with second echelon. He was also a member of the “Societe
Mathematique de France” and of the French Society of Specialists in Astron-
omy. For 5 years, he also taught the course of the third interuniversity cycle of
Pure Mathematics in Marseille and the third interuniversity cycle of Theoreti-
cal Physics in Marseille-Nice. He was a member of the Editorial Board of the
Journal of Geometry and Physics in Florence. He organized two International
Colloquiums of the CNRS in 1968 and 1981 and Days of the “Societe
Mathematique de France.” Honored by the Academic Palms and of the
National Order of Merit, he obtained the Prize on the subject “Vibrations”
put up for competition by the Association for Aeronautical Research in
1952, the Prize on the subject “Cosmology” put out for competition by the
Foundation Louis Jacot in 1978, the Grand Prix Jaffe of the French Academy
of Sciences in 1981 and the Great Scientist prize of the City of Paris in 1986.
Jean-Marie Souriau died in 2012 in its 90th year.
FIG. 3 Jean-Marie Souriau at the Conference on “Differential Geometry” in Strasbourg in 1953. In the same picture, Jean-Louis Koszul, Andre Weil,
Shiing-Shen Chern, Georges de Rham, Charles Ehresmann, Lucien Godeaux, Heinz Hopf, Andre Lichnerowicz (the director of the thesis of Jean-Marie Souriau),
Bernard Malgrange, John Milnor, Georges Reeb, Laurent Schwartz, Rene Thom, Paulette Libermann. (© dernières nouvelles d’Alsace and “Differential Geome-
try, Strasbourg, 1953” by Michèle Audin in Notices of AMS Volume 55, Number 3, March 2008).
118 SECTION II Information geometry
l Moment map:
l Souriau 1-cocycle:
θ ð gÞ
l Souriau 2-cocycle:
~ ðX, Y Þ ¼ J ½X,Y fJ X , J Y g
Θ
where
gg!ℝ
~ ðX, Y Þ ¼ hΘðXÞ, Y i with ΘðXÞ ¼ T e θðXðeÞÞ
X, Y7!Θ
and the dual potential, the Shannon entropy, is given by the Legendre
transform:
∂ΦðθÞ ∂SðηÞ
SðηÞ ¼ hθ, ηi ΦðθÞ with ηi ¼ and θi ¼ (2)
∂θi ∂ηi
Ð
We can observe that Φ(θ) ¼ log Rehθ,yidy ¼ log ψ(θ) is related to the
classical cumulant generating function. Koszul and Vinberg have introduced a
generalization through an affinely invariant Hessian metric on a sharp convex
cone by its characteristic function (Satake, 1980):
Z
ΦΩ ðθÞ ¼ log ehθ,yi dy ¼ log ψ Ω ðθÞ with θ Ω sharp convex cone
Ω∗
Z (3)
hθ,yi
ψ Ω ðθ Þ ¼ e dy with Koszul-Vinberg Characteristic function
∗
Ω
property that ψ Ω(gx) ¼ jdet( g)j1ψ Ω(x) for all x Ω, g G(Ω) and then that
Pm
ψ Ω(x)dx is an invariant measure on Ω. Writing ∂a ¼ ai ∂x∂ i , one can write:
i¼1
Z
∂a ΦΩ ðxÞ ¼ ∂a ð log ψ Ω ðxÞÞ ¼ ψ Ω ðxÞ1 ha, yiehx,yi dy
Ω∗
¼ a, x∗ , a U, x Ω (6)
120 SECTION II Information geometry
~ β ðZ 1 , Z 2 Þ ¼ Θ
with Θ ~ ðZ1 , Z2 Þ + hQ, adZ1 ðZ2 Þ i where ad Z1 ðZ 2 Þ ¼ ½Z 1 , Z 2
(10)
Souriau
n has proved
o that all co-adjoint orbit of a Lie Group given by
∗
ΟF ¼ Adg F, g G subset of g∗ , F g∗ carries a natural homogeneous
symplectic structure by a closed G-invariant 2-form. If we define
K ¼ Ad∗g ¼ (Adg1)∗ and
D E
K ∗ ðXÞ ¼ ðad X Þ∗ with : Ad∗g F, Y ¼ F, Ad g1 Y , 8g G, Y g, F g∗
(11)
where if X g, Ad g ðXÞ ¼ gXg1 g, the G-invariant 2-form is given by the
following expression:
σ Ω ðad X F, adY FÞ ¼ BF ðX, Y Þ ¼ hF, ½X, Y i, X, Y g (12)
Symplectic theory of heat and information geometry Chapter 4 121
~ ðX, Y Þ : g g ! ℝ
Θ with ΘðXÞ ¼ T e θðXðeÞÞ
(16)
X, Y7!hΘðXÞ, Y i
Souriau has then defined a Gibbs density that is covariant under the action
of the group:
ehUðξÞ,βi
pGibbs ðξÞ ¼ eΦðβÞhUðξÞ,βi ¼ Z with ΦðβÞ
ehUðξÞ,βi dλω
Z M
hUðξÞ,βi
¼ log e dλω (17)
M
Z
U ðξÞehUðξÞ,βi dλω
Z
∂ΦðβÞ M
Q¼ ¼ Z ¼ UðξÞpðξÞdλω (18)
∂β
ehUðξÞ,βi dλω M
M
122 SECTION II Information geometry
novich differential equation for the stochastic process given by the following
relation by mean of Souriau’s symplectic cocycle:
XN
∗ ∂H ∗ ∂Hi
dQ + ad ∂H Q + Θ dt + ad ∂Hi Q + Θ ∘ dW i ðtÞ ¼ 0 (21)
∂Q ∂Q i¼1
∂Q ∂Q
heat, element of the dual space of the Lie algebra g∗ of the group. Souriau has
proposed a Riemannian metric that we have identified as a generalization of
the Fisher metric:
h i
~ β ðZ1 , ½β, Z 2 Þ
I ðβÞ ¼ gβ with gβ ð½β, Z 1 , ½β, Z2 Þ ¼ Θ (22)
~ β ðZ 1 , Z 2 Þ ¼ Θ
with Θ ~ β ðZ 1 , Z 2 Þ + hQ, adZ1 ðZ 2 Þi where ad Z1 ðZ2 Þ ¼ ½Z1 , Z2
(23)
Souriau Fundamental Theorem is that “Every symplectic manifold on
which a Lie group acts transitively by a Hamiltonian action is a covering space
of a coadjoint orbit.” We can observe that for Souriau model, Fisher metric is
an extension of this 2-form in non-equivariant case gβ ð½β, Z1 , ½β, Z2 Þ ¼
e ðZ1 , ½β, Z2 Þ + hQ, ½Z1 , ½β, Z2 i.
Θ
The Souriau additional term Θ ~ ðZ1 , ½β, Z 2 Þ is generated by non-
equivariance through Symplectic cocycle. The tensor Θ ~ used to define this
extended Fisher metric is defined by the moment map J(x), application from
M (homogeneous symplectic manifold) to the dual space of the Lie algebra
g∗ , given by:
~ ðX, Y Þ ¼ J ½X,Y fJ X , J Y g
Θ (24)
∂β2
as in classical information geometry. We will
establish the equality of two terms, between Souriau definition based on Lie
group cocycle Θ and parameterized by “geometric heat” Q (element of the
dual space of the Lie algebra) and “geometric temperature” β (element of
Lie algebra) and hessian of characteristic function Φ(β) ¼ log ψ Ω(β) with
respect to the variable β:
∂2 log ψ Ω
gβ ð½β, Z1 , ½β, Z 2 Þ ¼ hΘðZ 1 Þ, ½β, Z2 i + hQ, ½Z 1 , ½β, Z 2 i ¼ (32)
∂β2
If one assumes that U(gξ) ¼ Ad∗g U(ξ) + θ( g), g G, ξ M which means
that the energy U : M ! g∗ satisfies the same equivariance condition as the
moment map μ : M ! g∗ , then one has for g G and β Ω:
D E
Z Z
hUðξÞ,Ad g βi Ad ∗1 UðξÞ,β
ψ Ω Ad g β ¼ e dλðξÞ ¼ e g
dλðξÞ
M M
Z Z
hUðg ξÞθðg Þ,βi dλðξÞ ¼ ehθðg1 Þ,βi ehUðg ξÞ,βi dλðξÞ
1 1 1
ψ Ω Ad g β ¼ e
M M
ψ Ω Ad g β ¼ ehθðg Þ,βi ψ Ω ðβÞ
1
Φ Ad g β ¼ log ψ Ω Adg β ¼ ΦðβÞ θ g1 , β
(33)
126 SECTION II Information geometry
d 2 ∂Q
Q β + t½Z 1 , β + o t , Y ¼ ð½Z 1 , βÞ, Y (40)
dt t¼0 ∂β
and the right-hand side of (22) is calculated as:
d
QðβÞ, Y t½Z1 , Y + o t2 ¼ θ I + tZ 1 + o t2 , Y
dt t¼0 (41)
¼ hQðβÞ, ½Z1 , Y i + hdθðZ1 Þ, Y i
Therefore,
∂Q
ð½Z 1 , βÞ, Y ¼ hdθðZ 1 Þ, Y i hQðβÞ, ½Z1 , Y i (42)
∂β
∂ 2 Φðβ Þ ∂Q
I ðβ Þ ¼ ¼ (46)
∂β2 ∂β
We can remark that this is a new analogy between “Fisher Metric” and
“thermodynamics capacity.” We can observe that Pierre Duhem in the previ-
ous century founded Thermodynamics on capacity and its extension in a new
theory called “Energetique.”
128 SECTION II Information geometry
non-null cohomology verified by Entropy, ad ∗∂S Q + Θ ∂S
∂Q ¼ 0, and then the
∂Q
∂Q ∗ ∂H ∂FðQÞ
¼ ad ∂H Q + Θ and ¼ fFðQÞ, H gΘ~ (51)
∂t ∂Q ∂Q ∂t
That we can rewrite:
∂Q ∂Q ∂β ∂H
¼ ¼ ad∗∂H Q + Θ (52)
∂t ∂β ∂t ∂Q ∂Q
D E
∂Q
where ∂β geometric heat capacity is given by gβ ðX, Y Þ ¼ ∂Q
∂β ð X Þ, Y
for X, Y g with gβ ðX, Y Þ ¼ Θe β ðX, Y Þ ¼ hQðβÞ, ½X, Y i + Θ
e ðX, Y Þ related to
Souriau-Fisher tensor.
Heat Eq. (33) is the PDE for (calorific) Energy density where the nature of
material is characterized by the geometric heat capacity. Numerical Integra-
tion of Lie–Poisson systems while preserving coadjoint orbits is described
in Engo and Faltinsen (2002).
In the Euclidean case with homogeneous material, we recover classical
equation (Balian, 2003):
∂ρE λ ∂ρE ∂T
¼ div rρE with ¼C (53)
∂t C ∂t ∂t
The link with second principle of Thermodynamics will be deduced
from positivity of Souriau-Fisher metric:
dQ ∗ ∂H
SðQÞ ¼ hQ, βi ΦðβÞ with ¼ ad ∂H Q + Θ
dt ∂Q ∂Q
dS dβ ∂H dΦ
¼ Q, + ad∗∂H Q + Θ ,β
dt dt ∂Q ∂Q dt
D E
dS dβ ∂H dΦ
¼ Q, + ad ∗∂H Q, β + Θ ,β
dt dt ∂Q ∂Q dt
dS dβ ∂H ~ ∂H , β dΦ
¼ Q, + Q, ,β +Θ
dt dt ∂Q ∂Q dt
(54)
dS
¼ Q,
dβ
+Θ~ β ∂H , β ∂Φ , dβ
dt dt ∂Q ∂β dt
dS
¼ Q,
dβ
+Θ~ β ∂H , β ∂Φ , dβ with ∂Φ ¼ Q
dt dt ∂Q ∂β dt ∂β
dS ~ ∂H
¼ Θβ , β 0, 8H ðlink to positivity of Fisher metricÞ
dt ∂Q
dS ~ ~β
if H ¼ S ) ¼ Θβ ðβ, βÞ ¼ 0 because β Ker Θ
∂S
∂Q¼β
dt
This equation was observed by Souriau in his paper of 1974, where he has
~ β , that is written:
written that geometric temperature β is a kernel of Θ
~ β ) hQ, ½β, Z i + Θ
β Ker Θ ~ ðβ, Z Þ ¼ 0 (56)
80 1 0 10 10 1
< a b 1 ba∗1 a∗1 0 1 0
SU ð1, 1Þ ¼ @ A¼@ A@ A@ A=a, b ℂ,
:
b∗ a∗ 0 1 0 a∗ a∗1 b∗ 1
9
=
jaj2 jbj2 ¼ 1
;
(59)
and its Lie algebra given by elements:
ir η
suð1, 1Þ ¼ =r ℝ, η ℂ (60)
η∗ ir
A basis for this Lie algebra su(1, 1) is ðu1 , u2 , u3 Þ g with:
i 1 0 1 0 1 1 0 i
u1 ¼ , u2 ¼ and u3 ¼ (61)
2 0 1 2 1 0 2 i 0
with [u1, u3] ¼ u2, [u1, u2] ¼ u3, [u2, u3] ¼ u1. The Harish-Chandra embed-
ding is given by φ(gx0) ¼ ζ ¼ ba∗1. From ja j2 jb j2 ¼ 1, one has jζ j < 1.
Conversely, for any jζ j < 1, taking any a ℂ such that jaj ¼ (1 j aj2)1/2
and putting b ¼ ζa∗, one obtains g G for which φ(gx0) ¼ ζ. The domain
D ¼ φ(M) is the unit disc D ¼ {ζ ℂ/jζ j < 1}.
The compact subgroup is generated by u1, while u2 and u3 generate a
hyperbolic subgroup. The dual space of the Lie algebra is given by:
z x + iy
suð1, 1Þ∗ ¼ =x, y, z ℝ (62)
x + iy z
with the basis
1 0 0 i 0 1
u∗1 , u∗2 , u∗3 g :∗
u∗1 ¼ ∗
, u2 ¼ and u∗3 ¼
0 1 i 0 1 0
(63)
Let us consider D ¼ {z ℂ/j zj < 1} be the open unit disk of Poincare. For
each ρ > 0, the pair (D, ωρ) is a symplectic homogeneous manifold with
ωρ ¼ 2iρ dz^dz2∗ 2 , where ωρ is invariant under the action:
ð1jzj Þ
SU ð1, 1Þ D ! D
az + b (64)
ðg, zÞ7!g:z ¼ ∗ ∗
b z+a
This action is transitive and is globally and strongly Hamiltonian. Its gen-
erators are the hamiltonian vector fields associated to the functions:
1 + jzj2 ρ z z∗ z + z∗
J 1 z, z∗ ¼ ρ , J 2 z, z∗ ¼ , J 3 z, z∗ ¼ ρ (65)
1 j zj 2 i 1 jzj 2
1 jzj2
134 SECTION II Information geometry
ir η
β¼ ∗ , r ℝ , η ℂ. We can observe that r and η ¼ ηR + iηI contain
η ir
3 degrees of freedom, as required. Also because detg ¼ 1, we get Tr(β) ¼ 0.
We exponentiate β with exponential map to get:
X∞
ðεβÞk aε ð β Þ bε ð β Þ
g ¼ exp ðεβÞ ¼ ¼ (68)
k¼0
k! b∗ε ðβÞ a∗ε ðβÞ
If we make
the remark
that we have the following relation
ir η ir η
β2 ¼ ∗ ¼ jηj2 r 2 I, we develop the exponential map:
η ir η∗ ir
0 1
sinh ðεRÞ sinh ðεRÞ
B cosh ð εR Þ + ir η C
R R
g ¼ exp ðεβÞ ¼ B@
C
∗ sinh ðεRÞ sinh ðεRÞ A
η cosh ðεRÞ ir
R R
with R2 ¼ jηj2 r 2
(69)
We can observe that one condition isthat j η j 2 r2 >
0 then the subset to
ir η
consider is given by the subset Λβ ¼ β ¼ ∗ , r ℝ, η ℂ=jηj2
η ir
Ð
r 2 > 0g such that DehJ(z),βidλ(z) < + ∞. The generalized Gibbs states of
the full SU(1, 1) group do not exist. However, generalized Gibbs states for
the one-parameter subgroups exp(αβ), β Λβ, of the SU(1, 1) group do exist.
The generalized Gibbs state associated to β remains invariant under the
restriction of the action to the one-parameter subgroup of SU(1, 1) generated
by exp(εβ).
To go further, we will develop the Souriau Gibbs density from the Souriau
moment map J(z) and the Souriau temperature β Λβ . If we note
1
b ¼ 11jzj2 , we can write the moment map:
z
1 0
J ðzÞ ¼¼ ρð2Mbb+ Tr ðMbb+ ÞI Þ with M ¼ (70)
0 1
We can the write the covariant Gibbs density in the unit disk given by
moment map of the Lie group SU(1, 1) and geometric temperature in its Lie
algebra β Λβ:
ehJðzÞ,βi dz ^ dz∗
pGibbs ðzÞ ¼ Z with dλðzÞ ¼ 2iρ
2 (71)
ehJðzÞ,βi dλðzÞ 1 jzj2
D
136 SECTION II Information geometry
0 1
* 1 + jzj2 2z∗
+
B ð1jzj2 Þ ð1jzj2 Þ C ir η
ρ@ A,
2z
1 + jzj2 η∗ ir
hρð2ℑbb Trðℑbb ÞI Þ, βi
e Z
+ +
e ð1jzj2 Þ ð1jzj2 Þ
pGibbs ðzÞ ¼ ¼ Z
ehJðzÞ, βi dλðzÞ ehJðzÞ, βi dλðzÞ
D D
(72)
To write the Gibbs density with respect to its statistical moments, we have
to express the density with respect to Q ¼ E[J(z)]. Then, we have to
invert the
ir η
relation between Q and β, to replace this last variable β ¼ ∗ Λβ by
η ir
Ð
β ¼ Θ1 ðQÞ g where Q ¼ ∂Φ∂βðβÞ ¼ ΘðβÞ g∗ with Φ(β) ¼ log DehJ(z),βi
dλ(z), deduce from Legendre transform. The mean moment map is given by:
2 0 13
1 + jwj2 2w∗
6 B
C7
6 B 1 jwj2 1 jwj2 C7
6 B C7
Q ¼ E½J ðzÞ ¼ E6ρB C7 where w D (73)
6 B 2w 1 + j wj 2 C7
4 @
A5
1 jwj2 1 jwj2
We can remark that this Souriau Gibbs density, covariant under the action
of SU(1,1) Lie group, is reparametrized via Legendre transform and moment
map to define Gaus density of Maximum d’Entropy in the Poincare unit disk.
Souriau’s Lie Groups Thermodynamics is very general and could be used to
other groups or other homogeneous manifold where a Lie group acts
transitively.
A B ℂ I BD1 A BD1 C 0 In 0
g¼ G , g¼ n (74)
C D 0 In 0 D D1 C In
Benjamin Cahen has studied this case and introduced the moment map by
identifying G-equivariantly g∗ with g by means of the Killing form β on gℂ :
g∗ G-equivariant with g by Killing form βðX, Y Þ ¼ 2ðp + qÞTr ðXY Þ
The set of all elements of g fixed by K is h:
!
nI p 0
h ¼ felement of G fixed by K g , ξ0 h, ξ0 ¼ iλ
0 nI q (75)
) hξ0 , ½Z, Z + i ¼ 2iλð2nÞ2 Tr ðZZ + Þ, 8Z D
Then, we the equivariant moment map is given by:
8X gC , Z D, ψ ðZÞ ¼ Ad ∗ ð exp ðZ + Þζ ð exp Z + exp Z ÞÞξ0
8g G, Z D then ψ ðg:Z Þ ¼ Ad ∗g ψ ðZÞ (76)
ψ is a diffeomorphism from SD onto orbit Oðξ0 Þ
!
ðI n ZZ + Þ1 ðnZZ+ nI n Þ ð2nÞZðI n Z+ Z Þ1
ψ ðZÞ ¼ iλ 1
ð2nÞðI n Z+ Z Þ Z+ nI q + nZ + Z ðI n Z + Z Þ1
(77)
and
1 !
Ip Z Iq Z+ Z
ζ ð exp Z + exp Z Þ ¼ (78)
0 Iq
The moment map is then given by:
!
ðI n ZZ + Þ1 ðI n + ZZ + Þ 2Z + ðI n ZZ + Þ1
J ðZ Þ ¼ ρn g∗ (79)
2ðI n ZZ + Þ1 Z ðI n + ZZ + ÞðI n ZZ + Þ1
Souriau Gibbs density is then given with β, M g and Z SDn by:
* ! +
ðIn ZZ + Þ1 ðIn + ZZ + Þ 2Z+ ðIn ZZ + Þ1
ρn ,β
e 2ðIn ZZ + Þ1 Z ðIn + ZZ+ ÞðIn ZZ + Þ1
pGibbs ðZÞ ¼ Z
ehJðZÞ, βi dλðZÞ
SDn
8
> β ¼ Θ1 ðQÞ g
>
>
<
Q ¼ E½J ðZ Þ
with
>
>
: Q ¼ ∂ΦðβÞ ¼ ΘðβÞ g∗
>
∂β
(80)
138 SECTION II Information geometry
with ½ε, δsym ¼ εJ n T δ δJ n T ε and ~ε, ~δ sp ¼ ~ε~δ ~δ~ε, and the associated inner
products hε, δisym ¼ tr(εδ) and ~ε, ~δ sp ¼ tr ~εT ~δ . With these inner products,
we identify their dual spaces with themselves sym(2n, ℝ) ¼ sym (2n, ℝ)∗,
sp(2n, ℝ) ¼ sp(2n, ℝ)∗.
T
Define first adjoint operator Ad G ε ¼ ðG1 Þ εG1 , AdGeε ¼ GeεG1 with
G Spð2n, ℝÞ and adε δ ¼ εJ n T δ δJ n T ε ¼ ½ε,δsym , and co-adjoint operators
AdG∗1
η ¼ GηGT and ad∗ε η ¼ Jnεη ηεJn. To coadjoint orbit O ¼ {Ad∗G η
sym (2n, ℝ)∗/G Sp(2n, ℝ)} for each η sym (2n, ℝ)∗, we can associate a sym-
plectic manifold with the KKS (Kirillov-Kostant-Souriau) 2-form:
D E
ΩO ðηÞ ad ∗ε η, ad ∗δ η ¼ η, ½ε, δsym ¼ tr η½ε, δsym (86)
∗
We then compute the moment "map J : Hn ! sp(2n, ℝ) # such that
1 1
V V U
iεΩHn ¼ dhJ(.), εi, given by J ðW Þ ¼ 1
for W ¼ U
UV UV 1 U + V
+ iV H n , we deduce:
ehJðW Þ,εi
pGibbs ðW Þ ¼ Z with dλðW Þ ¼ 2iρ V 1 dW ^ V 1 dW + (87)
ehJ ðW Þ,εi dλðW Þ
Hn
with
" # !
V 1 V 1 U ε11 ε12
hJ ðW Þ, εi ¼ Tr ðJ ðW ÞεÞ ¼ Tr (88)
UV 1 UV 1 U + V εT12 ε22
And then
hJ ðW Þ, εi ¼ Tr V 1 ε11 + 2UV 1 ε12 + UV 1 U + V ε22 (89)
To consider Gibbs density and Gauss density for Symmetric Positive Def-
inite Matrix, we have to consider the case W ¼ iV with U ¼ 0 and
hJ(W), εi ¼ Tr[V1ε11 + Vε22].
6 Conclusion
The classical notion of Gibbs’ canonical ensemble has been extended by
Jean-Marie Souriau to the case of a symplectic manifold on which a Lie group
has a symplectic action (“dynamic group”). Souriau definition extends a cer-
tain number of classical thermodynamic properties as temperature that is an
element of the Lie algebra of the group, and as heat that is an element of its
dual, but also to inequalities of convexity. In the case of non-commutative
groups, particular properties appear: the symmetry is spontaneously broken,
certain relations of cohomological type are verified in the Lie algebra of the
140 SECTION II Information geometry
References
Balian, R., 1991. From Microphysics to Macrophysics. vols. Volume 1–2 Springer Science and
Business Media LLC: Berlin/Heidelberg, Germany.
Balian, R., 2003. Introduction à la thermodynamique hors-equilibre. (CEA report).
Balian, R., 2014. The entropy-based quantum metric. Entropy 16, 3878–3888.
Balian, R., 2015. François Massieu et les Potentiels Thermodynamiques. Evolution des Disci-
plines et Histoire des Decouvertes; Academie des Sciences, Paris, France.
Balian, R., Valentin, P., 2001. Hamiltonian structure of thermodynamics with gauge. Eur. Phys.
J. B. 21, 269–282.
Balian, R., Alhassid, Y., Reinhardt, H., 1986. Dissipation in many-body systems: a geometric
approach based on information theory. Phys. Rep. 131, 1–146.
Barbaresco, F., 2019a. Lie Groups Thermodynamics & Souriau-Fisher Metric, SOURIAU 2019
conference. Institut Henri Poincare.
Barbaresco, F., 2019b. Lie Groups Thermodynamics & Souriau-Fisher Metric, SOURIAU 2019
Conference. Institut Henri Poincare.
Barbaresco, F., 2020a. Lie group statistics and lie group machine learning based on Souriau lie
groups Thermodynamics & Koszul-Souriau-Fisher Metric: new entropy definition as
generalized Casimir invariant function in coadjoint representation. Entropy 22, 642.
Barbaresco, F., 2020b. Souriau-Casimir Lie Groups Thermodynamics & Machine Learning, Joint
Structures and Common Foundations of Statistical Physics, Information Geometry and Infer-
ence for Learning, les Houches Summer Week SPIGL’20, 27.
Barbaresco, F., 2021a. Souriau-Casimir Lie Groups Thermodynamics & Machine Learning,
SPIGL’20 Proceedings, les Houches Summer Week on Joint Structures and Common Foun-
dations of Statistical Physics. Information Geometry and Inference for Learning, Springer
Proceedings in Mathematics & Statistics,.
Barbaresco, F., 2021c. Jean-Marie Souriau’s Symplectic model of statistical physics: seminal
papers on lie groups thermodynamics - Quod Erat demonstrandum, les Houches SPIGL’20
proceedings. Springer Proceedings in Mathematics & Statistics.
Barbaresco, F., 2021d. Koszul lecture related to geometric and analytic mechanics, Souriau’s Lie
group thermodynamics and information geometry. Information Geometry Journal.
Barbaresco, F., Gay-Balmaz, F., 2020. Lie group Cohomology and (multi)Symplectic integrators:
new geometric tools for lie group machine learning based on Souriau geometric statistical
mechanics. Entropy 22, 498.
Berezin, F.A., 1967. Some remarks about the associated envelope of a Lie algebra. Funct Anal Its
Appl 1, 91–102.
Berezin, F.A., 1975. Quantization in complex symmetric space. Math, USSR Izv 9, 341–379.
Blanc-Lapierre, A., Casal, P., Tortrat, A., 1959. Methodes mathematiques de la mecanique statis-
tique. Masson, Paris.
Cahen, B., 2004. Contraction de SU(1,1) vers le groupe de Heisenberg. Travaux de Mathema-
tiques, Fascicule XV, pp. 19–43.
Cahen, B., 2013. Global parametrization of scalar holomorphic coadjoint orbits of a
Quasi-Hermitian Lie Group, acta. Univ. Palacki. Olomuc. Fac. rer. Nat., Mat. 52, 35–48.
Symplectic theory of heat and information geometry Chapter 4 141
Cartan, E., 1929. Sur les invariants integraux de certains espaces homogènes clos et les proprietes
topologiques de ces espaces. Ann. Soc. Pol. Math, 181–225. t.8.
Cartier, P., 1994. Some Fundamental Techniques in the Theory of Integrable Systems, IHES/M/
94/23, SW9421. Available online: https://cds.cern.ch/record/263222/files/P00023319.pdf.
(accessed on 31 May 2020).
Casimir, H.G.B., 1931. Uber die konstruktion einer zu den irreduziblen darstellungen halbeinfa-
cher kontinuierlicher gruppen geh€origen differentialgleichung. Proc. R. Soc. Amsterdam 4.
Chevallier, E., Forget, T., Barbaresco, F., Angulo, J., 2016. Kernel density estimation on the Sie-
gel space with an application to radar processing. Entropy 18, 396.
Cishahayo, C., de Bièvre, S., 1993. On the contraction of the discrete series of SU(1;1). Annales
de l’institut Fourier, tome 43 (2), 551–567.
FGSI’19 Conference, 2019. “Foundations of Geometric Structures of Information” in February
2019 at IMAG (Institut Montpellierain Alexander Grothendieck). https://fgsi2019.
sciencesconf.org/.
CIRM, 2017. TGSI’17 Conference on “Topological and Geometrical Structures of Information”.
https://www.mdpi.com/journal/entropy/special_issues/topological_geometrical_info.
Dacunha-Castelle, D., Gamboa, F., 1990. Maximum d’entropie et problème des moments Annales
de l’I.H.P. Section B, tome 26, no 4, pp. 567–596.
De Saxce, G., 2016. Link between lie group statistical mechanics and thermodynamics of conti-
nua. Entropy 18, 254.
De Saxce, G., 2019. Euler-Poincare equation for lie groups with non null symplectic cohomology.
Application to the mechanics. In: Nielsen, F., Barbaresco, F. (Eds.), GSI 2019. LNCS.
vol. Volume 11712. Springer, Berlin, Germany.
De Saxce, G., Marle, C.-M., 2020. Presentation du livre de Jean-Marie Souriau “Structure des sys-
tèmes dynamiques”, preprint.
De Saxce, G., Vallee, C., 2016. Galilean Mechanics and Thermodynamics of Continua. Wiley.
Ecole de Physique des Houches, 2020. SPIGL’20 in July 2020 on “Joint Structures and Common
Foundations of Statistical Physics, Information Geometry and Inference for Learning”.
https://franknielsen.github.io/SPIG-LesHouches2020/.
Engo, K., Faltinsen, S., 2002. Numerical integration of lie–Poisson systems while preserving
coadjoint orbits and energy. SIAM J. Numer. Anal. 39 (1), 128–145.
Gallisssot, F., 1952. Les formes exterieures en mecanique. vol 4 Annales de l’Institut Fourier,
pp. 145–297.
GSI Conference Cycle, 2013–2021. “Geometric Science of Information” in 2013, 2015, 2017,
2019 et 2021 at Ecole des Mines de Paris, Ecole Polytechnique, ENAC and Sorbonne Uni-
versite. https://franknielsen.github.io/GSI/.
Hua, L.K., 1963. Harmonic Analysis of Functions of Several Complex Variables in the Classical
Domains, Translations of Mathematical Monographs 6. Amer. Math. Soc, RI.
Iglesias, P., 1979. Thermodynamique geometrique appliquee aux configuration tournantes en
astrophysique. Thèse de 3ème cycle, Universite de Provence.
Iglesias, P., 1995. Itineraire d’un mathematicien: Un entretien avec Jean-Marie Souriau. Le jour-
nal de Maths des elèves. ENS Lyon.
Kosmann-Schwarzbach, Y., 2013. Simeon-Denis Poisson. Palaiseau, France, Les Mathematiques
au Service de la Science; Ecole Polytechnique.
Koszul, J.L., 1985. Crochet de Schouten-Nijenhuis et cohomologie, Asterisque, numero hors-serie
Cartan et les mathematiques d’aujourd’hui (Lyon, 25–29 juin 1984). pp. 257–271.
Elie
Koszul, J.-L., Zou, Y.M., 2019. Introduction to Symplectic Geometry. Berlin/Heidelberg,
Germany, Springer Science and Business Media LLC.
142 SECTION II Information geometry
Souriau, J.-M., 1978. Thermodynamique et geometrie. In: Bleuler, K., Reetz, A. (Eds.), Differen-
tial Geometry Methods in Mathematical Physics II, Proceedings University of Bonn, July
13–16 1977. Springer.
Souriau, J.-M., 1984. Mecanique Classique et Geometrie Symplectique. CNRS-CPT-84/PE.1695.
Souriau, J.-M., 1986. La structure symplectique de la mecanique decrite par Lagrange en 1811.
Math. Sci. Hum. 94, 45–54.
Souriau, J.-M., 1996. Grammaire de la nature.
Souriau, J.-M., 1997. Structure of Dynamical Systems: A Symplectic View of Physics. The Prog-
ress in Mathematics Book Series (PM, Volume 149), Springer.
Souriau, J.-M., 2007. On geometric dynamics. Discrete and Continuous Dynamical Systems Vol-
ume 19 (3), 595–607.
Souriau, B.F., 2019. Exponential map algorithm for machine learning on matrix lie groups. In:
Nielsen, F., Barbaresco, F. (Eds.), Geometric Science of Information. GSI 2019. Lecture
Notes in Computer Science. vol 11712. Springer.
Stratonovich R.L., On distributions in representation space, Soviet Physics JETP, vol.4,n°6, 1957.
Vialatte, A., 2011. Les gloires silencieuses: Elie Cartan, Journalismes, Le Petit Dauphinois
1932–1944. Cahiers Alexandre Vialatte n°36, pp. 150–160.
Vorob’ev, Y.M., Karasev, M.V., 1988. Poisson manifolds and the Schouten bracket. Funktsional.
Anal. i Prilozhen 22 (1), 1–11. 96.
Further reading
Barbaresco, F., 2021b. Entropy Geometric Structure as Casimir Invariant Function in Coadjoint
Representation: Geometric Theory of Heat & Information Based on Souriau Lie Groups Ther-
modynamics and Lie Algebra Cohomology, Prepared for Encyclopedia of Entropy across the
Disciplines. World Scientific.
Marle, C.-M., 2020a. On Gibbs states of mechanical systems with symmetries. JGSP 57, 45–85.
This page intentionally left blank
Chapter 5
Abstract
Density-based directed distances—particularly known as divergences—between proba-
bility distributions are widely used in statistics as well as in the adjacent research fields
of information theory, artificial intelligence and machine learning. Prominent examples
are the Kullback–Leibler information distance (relative entropy) which, e.g., is closely
connected to the omnipresent maximum likelihood estimation method, and Pearson’s
χ 2-distance which, e.g., is used for the celebrated chisquare goodness-of-fit test. An-
other line of statistical inference is built upon distribution-function-based divergences
such as, e.g., the prominent (weighted versions of ) Cramer–von Mises test statistics
respectively Anderson–Darling test statistics which are frequently applied for
goodness-of-fit investigations; some more recent methods deal with (other kinds of )
cumulative paired divergences and closely related concepts. In this paper, we provide
a general framework which covers in particular both the above-mentioned density-based
and distribution-function-based divergence approaches; the dissimilarity of quantiles
respectively of other statistical functionals will be included as well. From this frame-
work, we structurally extract numerous classical and also state-of-the-art (including
new) procedures. Furthermore, we deduce new concepts of dependence between ran-
dom variables, as alternatives to the celebrated mutual information. Some variational
representations are discussed, too.
Keywords: Phi-divergences, Scaled Bregman distances, Estimation, Testing,
Statistical functionals, Bayesian decision making
General
Handling of Zeros (Sec.1.3)
Motivations
(Sec.1)
Motivations from
Prob. Theory (Sec.1.4)
a
See, e.g., Weller-Fahy et al. (2015).
148 SECTION II Information geometry
Z
fP ðxÞ
0 Dϕ ðP, QÞ :¼ fQ ðxÞ ϕ dλðxÞ , (1)
X fQ ðxÞ
Z
fP ðxÞ
¼ ϕ dQðxÞ , (2)
X fQ ðxÞ
where ϕ : ]0, ∞[ 7! [0, ∞[ is a convex function which is strictly convex at 1
and which satisfies ϕ(1) ¼ 0. It can be easily seen that this Dϕ(, ) satisfies
the above-mentioned requirements/properties/axioms (D1) and (D2). In the
above-mentioned discrete setup with X ¼ X # , (1) turns into
X
f P ðxÞ
0 Dϕ ðP, QÞ ¼ fQ ðxÞ ϕ ,
xX
f Q ðxÞ
#
Z pffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffi 2
DϕPC ðP, QÞ ¼ 2 fP ðxÞ fQ ðxÞ dλðxÞ,
X
(A3) Dϕ(P1, P2) should take its minimum value when P1 ¼ P2 and its maxi-
mum value when P1 ? P2 (i.e., P1 and P2 are singular, in the sense that
the supports of the distributions P1 and P2 do not overlap (are disjoint)).
(A4) A further axiom of statistical nature should be satisfied in relation with
a statistical notion of separability of two distributions in a given model.
Assume that for a given family of parametric distributions ðPθ ÞθΘ
and for any small risk α the following property holds: if Pθ0 is rejected
vs Pθ1 with risk α optimally (Neyman–Pearson approach), then Pθ0 is
rejected vs Pθ2 with risk α (meaning that Pθ2 is further away from Pθ0
than Pθ1 is).
Then one should have
Dϕ ðPθ0 , Pθ2 Þ Dϕ ðPθ0 , Pθ1 Þ:
where ϕ is not necessarily equal to ϕ0; for instance, through some comfort-
ably verifiable criteria on ϕ one can end up with an outcoming minimum
b
ϕ-divergence/distance estimator θb which is more robust against outliers than
the MLE θ^ (see, e.g., the residual-adjustment-function approach of Lindsay
(1994), its comprehensive treatment in Basu et al. (2011), and the corresponding
flexibilizations in Kißlinger and Stummer (2016) and Roensch and Stummer
b
(2017)). Usually, θb of (6) is called minimum ϕ-divergence estimator (MDE),
and Qθ^^ is the ϕ-divergence projection of Pemp
N on Ω.
A further useful generalization is the “distribution-outcome type” mini-
mum divergence/distance estimation problem
b :¼ arg inf Dϕ ðQ, Pemp Þ
Q (7)
N
QΩ
where PempN stems from a general (not necessarily parametric, unknown) data
generating distribution P and Ω may be a “fairly general” model being a class
of finite discrete distributions Q having strictly positive probability mass func-
tion fQ on X # (and, as usual, (7) can be rewritten as a minimization problem
on the (#Ω 1)-dimensional probability simplex). The outcoming Q ^ of (7) is
still called (distribution-type) minimum ϕ-divergence estimator (MDE), and
can be interpreted as ϕ-divergence-projection of Pemp N on Ω. Problem (7) is
in particular beneficial in non- and semiparametric contexts, where Ω reflects
(partially) nonparametrizable model constraints. For instance, Ω may consist
(only) of constraints on moments or on L-moments (see, e.g., Broniatowski
and Decurninge (2016)); alternatively, Ω may be, e.g., a tubular neighbor-
hood of a parametric model (see, e.g., Ghosh and Basu, 2018; Liu and
Lindsay, 2009).
The closeness—especially in terms of the sample size N—of the
data-derived empirical distribution from the model Ω is quantified by the
corresponding minimum
Dϕ ðΩ, Pemp emp
N Þ :¼ inf Dϕ ðQ, PN Þ (8)
QΩ
xÞ ¼ 0 for all θ Θ one should certainly reduce the space X by removing xe.
If f Qθ ðe
c
A unifying framework for some directed distances in statistics Chapter 5 153
where (i) the left equality holds only for P ¼ Q, and (ii) the right equality
holds always for P ? Q (singularity, i.e., the zeros-set of fP is disjoint from
the zeros-set of fQ) and only for P ? Q in case of ϕ(0) + ϕ*(0) < ∞.
8
< ∞, if α 62 0, 1½,
Dϕα ðP, QÞ ¼ ϕα ð0Þ + ϕ*α ð0Þ ¼ 1 (13)
: , if α 0, 1½:
α ð1 αÞ
Especially, for P ? Q one gets for the Kullback–Leibler divergence
DϕKL ðP, QÞ ¼ Dϕ1 ðP, QÞ ¼ ∞ whereas Dϕ0:99 ðP, QÞ ¼ 10000 99 achieves a finite
value; thus, in order to avoid infinities it is more convenient to work with the
well-approximating divergence generator ϕ0.99 of ϕ1. Similarly, for P ? Q in
the reverse Kullback–Leibler divergence we obtain DϕRKL ðP, QÞ ¼
Dϕ0 ðP, QÞ ¼ ∞ whereas Dϕ0:01 ðP, QÞ ¼ 10000 99 . Furthermore, for P ? Q one gets
for Pearson’s χ 2-divergence Dϕ2 ðP, QÞ ¼ ∞ , for Neyman’s χ 2-divergence
Dϕ1 ðP, QÞ ¼ ∞ and for the (squared) Hellinger distance Dϕ1=2 ðP, QÞ ¼ 4.
Returning to the general context, notice that the upper bound ϕ(0) + ϕ*(0)
in Theorem 1 is independent of P and Q, and thus Dϕ(P, Q) is of no discrimi-
native use in statistical situations where P and Q are singular (i.e., P ? Q).
This is the case, for instance, in the following commonly encountered
“crossover” context:
(CO1) Y is an univariate (absolutely continuous) random variable with
unknown hypothetical probability distribution P having strictly posi-
tive density function fP with respect to the Lebesgue measure λL on
X ¼ R (recall that this means that fP is a “classical” (e.g., Gaussian)
probability density function),
(CO2) the corresponding model Ω :¼ {Qθ : θ Θ} (Θ R) is a class of
parametric distributions having strictly positive probability density
functions fQθ with respect to λL, and
PN
(CO3) Pemp N :¼ N
1
i¼1 δY i ½ is the data-derived empirical distribution of
a N-size independent and identically distributed (i.i.d.) sample/
observations Y 1 , …, Y N of Y; recall that the according probability mass
function is fPemp
N
ðxÞ ¼ N1 #fi f1, …, Ng : Y i ¼ xg which is the den-
sity function with respect to the counting measure λ# on the distinct
values of the sample.
This contrary density-function behavior can be put in an encompassing frame-
work by employing the joint density-building (i.e., dominating) measure λ :¼
λL + λ#. Clearly, one always has the singularity Pemp
N ?Qθ and thus, due to
Theorem 1 one gets
*
Dϕ ðQθ , Pemp
N Þ ¼ ϕð0Þ + ϕ ð0Þ for all θ Θ,
*
inf Dϕ ðQθ , Pemp
N Þ ¼ ϕð0Þ + ϕ ð0Þ: (14)
θΘ
Also notice that for power divergences Dϕα ðP, QÞ with α 62 ]0, 1[ it can
happen that Dϕα ðP, QÞ ¼ ∞ even though P and Q are not singular (which
due to (13) is consistent with Theorem 1). For instance, consider a situation
with two different i.i.d. samples Y1 , …, Y N of Y having distribution P and
Ye1 , …, YeM of Ye having distribution Q with (say) Q P (equivalence); in terms
PN
of the corresponding empirical distributions Pemp N :¼ N
1
i¼1 δY i ½ and
PM
Pe :¼ e Þ ¼ ∞ if the set of zeros of
emp emp emp
M M
1
δ ~ ½ one obtains Dϕ ðPN , P
i¼1 Y i α M
the corresponding probability mass function fPempN
is strictly larger (for α 0)
respectively smaller (for α 1) than the set of zeros of fP~emp (i.e., P~M ½ fPemp
emp
N
¼
M
0 > 0 respectively Pemp N ½ fP~M ¼ 0 > 0), to be seen by applying (10), (11),
emp
(12). As above, in such a nonsingular situation it is better to use the (in fact,
even sample-dependent !) power divergence Dϕ0:99 ðPemp eemp
N , PM Þ instead of the
Kullback–Leibler divergence Dϕ1 ðPemp eemp
N , PM Þ ¼ ∞ . Similar infinity effects
can be constructed for the above-mentioned other important special cases
α ¼ 0 (reverse Kullback–Leibler divergence), α ¼ 2 (Pearson’s χ 2-diver-
gence), α ¼ 1 (Neyman’s χ 2-divergence) whereas for the case α ¼ 1/2
(square Hellinger distance) everything works out well. Such an approach
serves as an alternative to the approach of “lifting/unzeroing/adjusting” (from
sampling randomly appearing) zero probability massesd by pseudo-counts or
“smoothing (in a discrete sense),” see, e.g., Fienberg and Holland (1970), as
well as, e.g., Section 4.5 in Jurafsky and Martin (2009) and the references
therein.
Next, we briefly indicate two ways to circumvent the problem described in
the above-mentioned crossover context (CO1),(CO2),(CO3):
(GR) grouping (partitioning, quantization) of data: converte the model Ω into
a purely discrete context, by subdividing the data point set X ¼
S s
j ¼ 1 Aj into countably many—(say) s N [ f∞gnf1g—(measurable)
disjoint classes A1 , …, As with the property λL[Aj] > 0 (“essential par-
tition”); proceed as in the above general discrete subsetup with
X new :¼ fA1 , …, As g and thus the i-th data observation Yi(ω) and the
corresponding running variable x manifest (only) the corresponding
class membership (see, e.g., Vajda and van der Meulen (2011) for a
survey on different choices). Some corresponding thorough statistical
investigations (such as efficiency, robustness, types of grouping, grouping
error sensitivity) of the corresponding minimum-ϕ-divergence estima-
tion can be found, e.g., in Victoria-Feser and Ronchetti (1997),
d
E.g., which correspond to empty cells in sampled histograms, e.g., for rare events and
small/medium sample sizes.
e
In several situations, such a conversion can appear in a natural way; e.g., an institution may gen-
erate/collect data of “continuous value” but mask them for external data analysts to group fre-
quencies, for reasons of confidentiality (information asymmetry).
156 SECTION II Information geometry
Menendez et al. (1998, 2001a, b), Morales et al. (2004, 2006), and Lin
and He (2006).
(SM) smoothing of the empirical density function: convert everything to a
purely continuous context, by keeping the original data-point-set X
and by “continuously modifying” (e.g., with the help of kernels)
the empirical density
R function fPemp N
ðÞ to a function fPNemp, smo ðÞ > 0
(a.s.) such that X fPemp,smo
N
ðxÞ dλ L ðxÞ ¼ 1. Some corresponding thorough
statistical investigations (such as efficiency, robustness, information
loss) of the corresponding minimum-ϕ-divergence estimation can be
found, e.g., in Beran (1977), Basu and Lindsay (1994), Park and
Basu (2004), Chapter 3 of Basu et al. (2011), Kuchibhotla and Basu
(2015), and Al Mohamad (2018), and the references therein.
In contrast to the above, let us now encounter a crossover situation where
(CO1) and (CO3) still hold, but the parametric model assumption (CO2) is
replaced by
(CO2’) the corresponding model Ω :¼ {Q : Q satisfies some nonparametric
constraints} is a class of distributions Q which contains both
(i) distributions Q having strictly positive probability density func-
tions fQ with respect to λL, and (ii) all “context-specific
appropriate” finite discrete distributions Q (having ideally the same
(or at least, smaller or equal) support as Pemp
N ).
The subclasses of Ω which satisfy (i) respectively (ii) are denoted by Ωac
respectively Ωdis. Widely applied special cases of (CO2’) are nonparametric
contexts where Ω is the class of all distributions on X ¼ R satisfying pregi-
ven moment conditions. Suppose that we are interested in the corresponding
model-adequacy problem (cf. (9))
Dϕ ðΩac , PÞ :¼ infac Dϕ ðQ, PÞ (15)
QΩ
where P is the (unknown) data generating distribution (cf. (CO1)). Recall that
in case of P Ωac one obtains Dϕ(Ωac, P) ¼ 0, whereas for P 62 Ωac the
ϕ-divergence minimum Dϕ(Ωac, P) quantifies the adequacy of the model
Ωac for modeling P; a lower Dϕ(Ωac, P)-value means a better adequacy. Since
in the current setup the empirical distribution Pemp emp
N of (CO3) satisfies PN ?Q
for all Q Ω we obtain (analogously to (14))
ac
*
Dϕ ðQ, Pemp
N Þ ¼ ϕð0Þ + ϕ ð0Þ for all Q Ωac ,
Dϕ ðΩ ac
, Pemp
N Þ : ¼ inf Dϕ ðQ, Pemp
N Þ (16)
QΩac
¼ ϕð0Þ + ϕ* ð0Þ:
Hence, statistically it makes no sense to approximate (15) by (16). Let us discuss
an appropriate alternative, e.g., for the case of the reverse Kullback–Leibler
A unifying framework for some directed distances in statistics Chapter 5 157
divergence Dϕ0 ðQ, PÞ with generator ϕ0 ðtÞ ¼ ϕRKL ðtÞ ¼ log ðtÞ + t 1 (cf.(3)).
By (12), we have ϕ0(0) ¼ ∞ as well as ϕ*0 ð0Þ ¼ 1 and thus ϕ0 ð0Þ + ϕ 0 ð0Þ ¼ ∞ as
well as (by (10), (11))
Z !
f Q ðx Þ h i
emp emp emp
D ϕ0 Q, PN ¼ ϕ0 dPN ðxÞ + ∞ PN fQ ¼ 0
f fQ fPemp >0g f P ð xÞ
emp
N N
!
1 X fQ ðYi Þ h i
emp
¼ ϕ0 + ∞ PN f Q ¼ 0 < ∞
N fPemp ðYi Þ
fif1, …N g:fQ ðYi Þ fPemp ðYi Þ>0g N
N
where Pemp
N is the above-mentioned empirical distribution of a sample of
N independent copies under P, and Ω is a class of probability distributions
on ðX ;B Þ , and Dϕ1 ðΩ, PÞ :¼ inf QΩ Dϕ1 ðQ, PÞ. Therefore, the Kullback–
Leibler divergence measures the rate of decay of the chances for PempN to
belong to Ω as N increases, in case that P does not belong to Ω. Other
divergences inherit the same character: assume that the function ϕ is the
Fenchel–Legendre transform of a cumulant generating function Λ(t), namely
where ΛðtÞ :¼ log E½etW for some random variable W defined on some arbi-
trary space. With ðX1 , …, XN Þ being an i.i.d. sample under P and ðW 1 , …, W N Þ
being an i.i.d. sample of copies of W, we define the associated weighted
empirical distribution as
1 X
N
N :¼
PW W δ :
N i¼1 i Xi
intractable; this is, e.g., the case when the model Ω is defined by constraints on
the expectation of a L-statistics (e.g., describing a tubular neighborhood of a
distribution with prescribed number of given quantiles; such constraints are
not linear with respect to the underlying distribution of the data, but merely
with respect to their quantile measure). In such a situation, one can trans-
pose everything to a minimization problem for the ϕ-divergence between
the corresponding empirical quantile measures where the constraint can
also be stated in terms of quantile measures (see Broniatowski and
Decurninge, 2016).
Further examples of ϕ-divergences between other statistical objects can be
found in Section 2.5.1.2.
(see, e.g., Csiszar, 1991, Pardo and Vajda, 1997, 2003, Stummer and Vajda,
2012) where ϕ0 is the derivative of the supposedly differentiable ϕ. The class
(18) includes as important special cases, e.g., the density power divergences
(also known as Basu-Harris-Hjort-Jones distances, cf. Basu et al. (1998)) with
the squared L2-norm as a subcase. The principal types of statistical applica-
tions of OBD are basically the same as for the ϕ-divergences (minimum
divergence estimation, robustness etc.); however, the corresponding technical
details may differ substantially.
Concerning some recent progress of divergences, Stummer (2007) as well
as Stummer and Vajda (2012) introduced the concept of scaled Bregman
divergences/distances SBD
0 DSBD
ϕ ðP, QÞ :¼ Dϕ,λ,m ðP,QÞ
SBD
Z
fP ðxÞ fQ ðxÞ 0 fQ ðxÞ fP ðxÞ fQ ðxÞ
:¼ ϕ ϕ ϕ mðxÞ dλðxÞ
X mðxÞ mðxÞ mðxÞ mðxÞ mðxÞ
which (by using a scaling function m()) generalizes all the above-
mentioned (nearly disjoint) density-based ϕ-divergences (17) and OBD
divergences (18) at once. Hence, the SBD divergence class constitutes a
quite general framework for dealing with a wide range of data analyses, in
a well-structured way.
A unifying framework for some directed distances in statistics Chapter 5 163
2 The framework
2.1 Statistical functionals S and their dissimilarity
Let us assume that the modeled—respectively, observed—random data take values
in a state space Y (with at least two distinct values), which is equipped with a
system A of admissible events (σ-algebra). On this, we consider two probability
distributions (probability measures) P and Q of interest. By appropriate choices
of ðY , A Þ, such a general context also covers modeling of series of observations,
functional data as well as stochastic process data (the latter by choosing Y as an
appropriate space of paths, i.e., of whole scenarios along a set of times).
In the following, we deal with situations where—e.g. in face of the dichot-
omous uncertainty P versus Q—the statistical decision (inference) goal can
be, e.g., expressed by means of “dissimilarity-expressing relations”
RðSðPÞ, SðQÞÞ between univariate real-valued “statistical functionals” S() of
the form SðPÞ :¼ fSx ðPÞgxX and SðQÞ :¼ fSx ðQÞgxX for the two distribu-
tions P and Q,f where X is a set of (at least two different) “functional
indices.” As corresponding preliminaries, in this section we broadly discuss
examples of statistical functionals which we shall employ later on to recover
known—respectively create new—divergences between them.
In principle, one can distinguish between unit-free (e.g., “percentage-type”)
functionals S() and unit-dependent (e.g., monetary) functionals S(). For the
real line Y ¼ X ¼ R , the most prominent examples for the former are
the cumulative distribution functions (cdf ) fSx ðPÞgxR :¼ fFP ðxÞgxR :¼
fP½ ∞, xgxR ¼: Scd ðPÞ , the survival functions (suf ) fSx ðPÞgxR :¼
f1 FP ðxÞgxR :¼ fP½x, ∞½gxR ¼: Ssu ðPÞ (which are also called reliability
f
The statistical functional S() can also be thought of as a function-valued “plug-in statistics”
respectively as a real-valued function on X which carries a probability-distribution-valued param-
eter ; accordingly S(P) and S(Q) are two different functions corresponding to the two different
parameter constellations P,Q; accordingly Sx(P) and Sx(Q) are the corresponding function values
at x X .
164 SECTION II Information geometry
g
There are also versions allowing for negative values, not discussed here.
166 SECTION II Information geometry
h
Notice that this kind of outlyingness concept is intrinsic (with respect to P), as opposed to the
“relative outlyingness” defined as a degree of mismatch between the frequency of certain data-
observation points compared to the corresponding (very much lower) modeling frequency; see, e.-
g., Lindsay (1994), Basu et al. (2011), and the corresponding flexibilization in Kißlinger and
Stummer (2016).
A unifying framework for some directed distances in statistics Chapter 5 167
ð1Þ ð1Þ
univariate RP ðxÞ ¼ RP ðxÞ ¼ 2 FP ðxÞ 1 , QP ðxÞ ¼ FP 1 +2 x , and thus
n o
ð1Þ
there are the consistencies Scr,1 ðPÞ :¼ RP ðxÞ ¼ Scr ðPÞ, Scqu,1 ðPÞ :¼
n o xR
ð1Þ
QP ðxÞ ¼ Scqu ðPÞ.
x½1,1
There are also several other different approaches to define multidimen-
sional analogues of quantile functions, see, e.g., Serfling (2002, 2010),
Galichon and Henry (2012), and Faugeras and R€uschendorf (2017). All those
multivariate quantile functions are also covered by our divergence toolkit,
component-wise (with subsequent aggregation).
Let us finally mention that for general state space Y , as unit-free statistical
functionals one can also take for instance families fSx ðPÞgxX :¼ fP½Ex gxX
of probabilities of some particularly selected concrete events Ex A of
purpose-driven interest, where X is some set of indices.
As needed later on, notice that these statistical functionals SðPÞ ¼
fSx ðPÞgxX have the following different ranges R ðSðPÞÞ : R Scd ðPÞ ¼
R ðSsu ðPÞÞ ¼ R ðSpm ðPÞÞ ½0, 1, R Spd ðPÞ ½0, ∞½, R ðSmg ðPÞÞ ½0, ∞,
R ðSqu ðPÞÞ ∞, ∞½ (respectively
R ðSquðPÞÞ ½0, ∞½ for nonnegative
random variables Y 0), R Sde ðPÞ ½0, ∞, R ðSou ðPÞÞ ½0, ∞,
R ðScr,i ðPÞÞ ½1, 1, R ðScqu,i ðPÞÞ ∞, ∞½(i f1, …, dg), and R S λ̆, S̆ ðPÞ
depends on the choice of λ̆ and S̆.
The above-mentioned “dissimilarity-expressing functional relations”
RðSðPÞ, SðQÞÞ can be typically of (i) numerical nature or (ii) graphical/plot-
ting nature, or hybrids thereof. As far as (i) is concerned, for fixed xX
the dissimilarity between the real-valued Sx(P) and Sx(Q) can be expressed
by (weighted) ratios close to 1, (weighted) differences close to 0, and combi-
nations thereof; these information on “pointwise” dissimilarities can then be
compressed to a single real number, e.g., by means of aggregation (weighted
summation, weighted integration, etc.) over x or by taking the maximum
respectively minimum values with respect to x. In contrast, for X ¼ R one
widespread tool for (ii) is to draw a two-dimensional scatterplot
ðSx ðPÞ, Sx ðQÞÞxX and evaluate—visually by eyeballing or quantitatively—
the dissimilarity in terms of sizes of deviations from the equality-expressing
diagonal (t, t). In the above-mentioned special case of Sx(P) ¼ FP(x) ¼
P[]∞, x]], Sx(Q) ¼ FQ(x) ¼ Q[]∞, x]] this leads to the well-known
“Probability-Probability-Plot” (PP Plot), whereas the choice Sx ðPÞ ¼
FP ðxÞ ¼ inf fzR : FP ðzÞ xg, Sx ðQÞ ¼ FQ ðxÞ ¼ inf fzR : FQ ðzÞ xg
amounts to the very frequently used “Quantile–Quantile-Plot” (QQ-Plot). More-
over, the choice Sx ðPÞ ¼ DP ðxÞ and Sx ðQÞ ¼ DQ ðxÞ for some P-based and
Q-based depth function generates the DD-Plot in the sense of Liu et al. (1999).
As already mentioned above, we follow the line (i), in terms of
divergences.
168 SECTION II Information geometry
division by m1(x) ¼ 0 for some x). Moreover, for any fixed c [0, 1] the
(finite) function ϕ0+,c :a,b½! ∞,∞½ is well-defined by ϕ0+,c ðtÞ :¼
c ϕ0+ ðtÞ + ð1 cÞ ϕ0 ðtÞ, where ϕ0+ ðtÞ denotes the (always finite) right-
hand derivative of ϕ at the point t ]a, b[ and ϕ0 ðtÞ the (always finite)
left-hand derivative of ϕ at t ]a, b[. If ϕ Φ(]a, b[) is also continuously
differentiable—which we denote by ϕ ΦC1 ða, b½Þ—then for all c [0, 1]
one gets ϕ0+,c ðtÞ ¼ ϕ0 ðtÞ (t ]a, b[) and in such a situation we also suppress
+ as well as c in all the following expressions. We also employ the contin-
uous continuation ϕ0+,c : ½a,b ! ½∞, ∞ given by ϕ0+,c ðtÞ :¼ ϕ0+,c ðtÞ (t ]
a, b[), ϕ0+,c ðaÞ :¼ lim t#a ϕ0+,c ðtÞ, ϕ0+,c ðbÞ :¼ lim t"b ϕ0+,c ðtÞ. To explain the
precise meaning of (19), we also make use of the (finite, nonnegative) func-
tion ψ ϕ,c :]a, b[
]a, b[! [0, ∞[ given by ψ ϕ,c ðs, tÞ :¼ ϕðsÞ ϕðtÞ
ϕ0+,c ðtÞ ðs tÞ 0 (s, t ]a, b[). To extend this to a lower semicontinuous
function ψ ϕ,c : ½a,b
½a, b ! ½0,∞ we proceed as follows: first, we set
ψ ϕ,c ðs, tÞ :¼ ψ ϕ,c ðs,tÞ for all s, t ]a, b[. Moreover, since for fixed t ]a,
b[, the function s ! ψ ϕ,c(s, t) is convex and continuous, the limit
ψ ϕ,c ða,tÞ :¼ lim s!a ψ ϕ,c ðs,tÞ always exists and (in order to avoid overlines
in (19)) will be interpreted/abbreviated as ϕðaÞ ϕðtÞ ϕ0+,c ðtÞ ða tÞ .
Analogously, for fixed t ]a,b[ we set ψ ϕ,c ðb,tÞ :¼ lim s!b ψ ϕ,c ðs,tÞ with
corresponding short-hand notation ϕðbÞ ϕðtÞ ϕ0+,c ðtÞ ðb tÞ. Further-
more, for fixed s ]a,b[ we interpret ϕðsÞ ϕðaÞ ϕ0+,c ðaÞ ðs aÞ as
ψ ϕ,c ðs, aÞ :¼ ϕðsÞ ϕ0+,c ðaÞ s + lim t ϕ0+,c ðaÞ ϕðtÞ
t!a
1∞,∞½ ϕ0+,c ðaÞ + ∞ 1f∞g ϕ0+,c ðaÞ ,
where the involved limit always exists but may be infinite. Analogously,
for fixed s ]a, b[ we interpret ϕðsÞ ϕðbÞ ϕ0+,c ðbÞ ðs bÞ as
ψ ϕ,c ðs, bÞ :¼ ϕðsÞ ϕ0+,c ðbÞ s + lim t ϕ0+,c ðbÞ ϕðtÞ
t!b
1∞,∞½ ϕ0+,c ðbÞ + ∞ 1f+∞g ϕ0+,c ðbÞ ,
where again the involved limit always exists but may be infinite.
Finally, we always set ψ ϕ,c ða,aÞ :¼ 0, ψ ϕ,c ðb,bÞ :¼ 0, and ψ ϕ, c ða, bÞ :¼
lim s!a ψ ϕ,c ðs, bÞ, ψ ϕ,c ðb,aÞ :¼ lim s!b ψ ϕ,c ðs,aÞ. Notice that ψ ϕ,c is
lower-semicontinuous but not necessarily continuous.
0 0Since ratios are
ultimately involved, we also consistently take ψ ϕ,c 0 , 0 :¼ 0.
With (I1) and (I2), we define the BS divergence (BS distance) of (19)
precisely as
170 SECTION II Information geometry
Z
Sx ðPÞ Sx ðQÞ
0 Dcϕ,m1 ,m2 ,m3 ,λ ðSðPÞ, SðQÞÞ
¼ ψ ϕ,c , m3 ðxÞ dλðxÞ (20)
X m1 ðxÞ m2 ðxÞ
Z
Sx ðPÞ Sx ðQÞ
:¼ ψ ϕ,c , m3 ðxÞ dλðxÞ, (21)
m1 ðxÞ m2 ðxÞ
X
R
but mostly use the less clumsy notation with given in (19), (20) hence-
forth, as a shortcut for the implicitly involved boundary behavior.
As a side remark let us mention that we could further generalize (19)
by adapting a wider divergence (e.g., nonconvex generators ϕ covering) con-
cept of Stummer and Kißlinger (2017) who also deal even with nonconvex
nonconcave divergence generators ϕ; for the sake of brevity, this is
omitted here.
Notice that by construction one has the following important assertion (cf.
Broniatowski and Stummer, 2019):
Theorem 2. Let ϕ Φ(]a, b[) and c [0, 1]. Then there holds
Dcϕ,m1 ,m2 ,m,λ ðSðPÞ, SðQÞÞ 0 (i.e., the above-mentioned desired property
(D1) is satisfied). Moreover, Dcϕ,m1 ,m2 ,m,λ ðSðPÞ, SðQÞÞ ¼ 0 if mSx1ðPÞ Sx ðQÞ
ðxÞ ¼ m2 ðxÞ for
λ-almost all x X . Depending on the concrete situation, Dcϕ,m1 ,m2 ,m3 ,λ ðSðPÞ,
SðQÞÞ may take infinite value.
-. ϕ is strictly convex at t;
-. if ϕ is differentiable at t and s 6¼ t, then ϕ is not affine-linear on the interval
½ min ðs, tÞ, max ðs, tÞ (i.e., between t and s);
A unifying framework for some directed distances in statistics Chapter 5 171
R
Formally, by defining the integral functional gϕ,m3 ,λ ðξÞ :¼ X ϕðξðxÞÞ
R
m3 ðxÞ dλðxÞ and plugging in, e.g., gϕ,m3 ,λ SðPÞ
m1 ¼ X ϕ mSx1ðPÞ
ðxÞ m3 ðxÞ dλðxÞ,
the divergence in (24) can be interpreted as
0 Dϕ,m1 ,m2 ,m3 ,λ ðSðPÞ, SðQÞÞ
SðPÞ SðQÞ SðQÞ SðPÞ SðQÞ (25)
¼ gϕ,m3 ,λ gϕ,m3 ,λ g0ϕ,m3 ,λ ,
m1 m2 m2 m1 m2
e1 ðtÞ ϕ
ϕ1 ðtÞ :¼ lim ϕα ðtÞ ¼ ϕ e0 ð1Þ ðt 1Þ
1
α!1
¼ t log t + 1 t ½0, ∞½, t 0, ∞½, (30)
e0 ðtÞ :¼ lim ϕα ðtÞ ¼ log t ∞, ∞½,
ϕ t 0, ∞½,
α!0
2.5 The scaling and the aggregation functions m1, m2, and m3
In the above two Sections 2.3 and 2.4, we have presented special cases of the
first and the last component of the “divergence parameter” β ¼ (ϕ, m1, m2,
m3, λ), whereas now we focus on m1, m2, and m3. To start with, in accordance
with (19), the aggregation function m3 tunes the fine aggregation details
(recall that λ governs the principal aggregation structure). Moreover, the
function m1() scales the statistical functional S(P) evaluated at P and m2()
the same statistical functional S(Q) evaluated at Q. From a modeling perspec-
tive, these two scaling functions can, e.g.,
l “purely direct” in the sense that m1(x), m2(x) are chosen to directly reflect
some dependence on the index-state x X (independent of the choice
of S), or
l “purely adaptive” in the sense that m1(x) ¼ w1(Sx(P), Sx(Q)), m2(x) ¼
w2(Sx(P), Sx(Q)) for some appropriate (measurable) “connector functions”
w1, w2 on the product RðSðPÞÞ
RðSðQÞÞ of the ranges of fSx ðPÞgxX
and fSx ðQÞgxX , or
l “hybrids” m1(x) ¼ w1(x, Sx(P), Sx(Q)) m2(x) ¼ w2(x, Sx(P), Sx(Q)).
174 SECTION II Information geometry
X
Sx ðPÞ Sx ðQÞ 0 Sx ðQÞ Sx ðPÞ Sx ðQÞ
¼ ϕ ϕ ϕ mðxÞ rðxÞ:
xX mðxÞ mðxÞ +,c mðxÞ mðxÞ mðxÞ
(34)
Remark 1. (a) In a contextÐ of “λ-probability-density
Ð functions” with gen-
eral X and P[] :¼ fP(x) dλ(x), Q[] :¼ fQ(x) dλ(x) satisfying P½X ¼
Q½X ¼ 1 , one can take the statistical functionals Sλpd x ðPÞ :¼ fP ðxÞ 0 ,
Sλpd
x ðQÞ :¼ fQ ðxÞ 0; accordingly, Ðfor r(x) ≡ 1 (abbreviated as function 1
with constant value 1) and M[] :¼ m(x) dλ(x) the divergence (33) can be
interpreted asi
i
In a context where P and Q are risk distributions (e.g., Q is a pregiven reference one) the SBD
Bϕ ðP, Q j MÞ can be interpreted as risk excess of P over Q (or vice versa), in contrast to
Faugeras and R€uschendorf (2018) who use hemimetrics rather than divergences.
A unifying framework for some directed distances in statistics Chapter 5 175
0 Dcϕ,m,m,1 m,λ Sλpd ðPÞ, Sλpd ðQÞ
Z
f P ðxÞ f Q ðx Þ f Q ðxÞ f P ðxÞ f Q ðxÞ
¼ ϕ ϕ ϕ0+,c mðxÞ dλðxÞ
X mðxÞ mðxÞ mðxÞ mðxÞ mðxÞ
¼: Bϕ ðP, Q j MÞ,
(35)
where the scaled Bregman divergence Bϕ ðP, Q j MÞ has been first defined
in Stummer (2007), Stummer and Vajda (2012), see also Kißlinger and
Stummer (2013, 2015, 2016) for the “purely adaptive” case mðxÞ ¼
wð fP ðxÞ, fQ ðxÞÞ and indications on nonprobability measures. Notice that
this directly subsumes for X ¼ Y ¼ R the “classical density” functional
Sλpd() ¼ Spd() with the choice λ ¼ λL (and the Riemann integration
dλL(x) ¼ dx), as well as for the discrete setup Y ¼ X ¼ X# the
“classical probability mass” functional Sλpd() ¼ Spm() with the choice
λ ¼ λ# (recall λ#[{x}] ¼ 1 for all x X # ); for the latter, the divergence
(35) reads as
0 Dcϕ,m,m,1 m,λ# Sλ# pd ðPÞ, Sλ# pd ðQÞ ¼ Dcϕ,m,m,1 m,λ# ðSpm ðPÞ, Spm ðQÞÞ
X
pP ðxÞ pQ ðxÞ 0 pQ ðxÞ pP ðxÞ pQ ðxÞ
¼ ϕ ϕ ϕ mðxÞ
xX # mðxÞ mðxÞ +,c mðxÞ mðxÞ mðxÞ
¼ : B#ϕ ðP, Q j MÞ:
(36)
For the important special case of the above-mentioned power-function-
α
type generator ϕðtÞ :¼ ϕα ðtÞ ¼ t α t + α1
α ðα1Þ (α ]0, ∞[n{1}), Roensch
and Stummer (2019a) (see also Ghosh and Basu (2016b) for the
unscaled special case m(x) ¼ 1) employed the corresponding scaled
Bregman divergences (35) in order to obtain robust minimum
divergence-type parameter estimates for the setup of sequences of
independent random variables whose distributions are nonidentical but
linked by a common (scalar or multidimensional) parameter; this is,
e.g., important in the context of generalized linear models (GLMs)
which are omnipresent in statistics, artificial intelligence and machine
learning.
Returning to the general framework, for the important special case α
2
¼ 2 leading to the above-mentioned generator ϕ2 ðtÞ :¼ ðt1Þ
2 , the scaled
Bregman divergences (35) and (36) turn into
Z
ð fP ðxÞ fQ ðxÞÞ2
0 Bϕ2 ðP, Q j MÞ ¼ dλðxÞ
X 2 mðxÞ
176 SECTION II Information geometry
and
X ðpP ðxÞ pQ ðxÞÞ2
0 B#ϕ2 ðP, Q j MÞ ¼ : (37)
xX # 2 mðxÞ
(c) In a context of mortality data analytics (which is essential for the calcu-
lation of insurance premiums, financial reserves, annuities, pension bene-
fits, various benefits of social insurance programs, etc.), the divergence
(34) (with r(x) ¼ 1) has been employed by Kr€omer and Stummer
(2019) in order to achieve a realistic representation of mortality rates
by smoothing and error-correcting of crude rates; there, X is a set of ages
(in years), Sx(P) is the so-called data-based crude annual mortality rate
by age x, Sx(Q) is an—optimally determinable—candidate model member
(out of a parametric or nonparametric model) for the unknown true
annual mortality rate by age x, and m(x) is an appropriately chosen scal-
ing at x.
This concludes the current Remark 1.
cd cd
Z rðxÞ h cd cd
i2
0 Dϕ2 ,1,1,r 1,λ SQ,S ðPÞ, SQ,S ðQÞ ¼ SQ,S
x ðPÞ SQ,S
x ðQÞ dλðxÞ,
R 2
(51)
λpd λpd
0 Dϕ1 ,1,1,1,λ ðS ðPÞ, S ðQÞÞ
Z
(52)
fP ðxÞ
¼ fP ðxÞ log + fQ ðxÞ fP ðxÞ dλðxÞ,
X fQ ðxÞ
Analogously to the paragraph after (46), one can recommend here to exclude
α 0 whenever fP(x) ¼ 0 for all x in some A with λ[A] > 0, respectively
α 1 whenever fQ(x) ¼ 0 for all x in some A~ with λ½A ~ > 0 . As far as
splitting of
the integral, e.g., in (43)R resp. (45) is concerned,
notice that the
1 λ,λpd 1 λ,λpd
integral μ ν ½X ¼ X fQ ðxÞ fP ðxÞ dλðxÞ ¼ 1 1¼ 0 but
R fP ðxÞ
X ½fQ ðxÞ 1 dλðxÞ may be infinite (take, e.g., X ¼ ½0,∞½, λ ¼ λL, and the expo-
nential distribution density functions fP ðxÞ :¼ c1 expðc1 xÞ , fQ ðxÞ :¼
c2 exp ðc2 xÞ with 0 c1 c2). The choice α > 0 in (51) coincides with
the “order-α” density power divergences DPD of Basu et al. (1998); for their
statistical applications see, e.g., Basu et al. (2015), Ghosh and Basu (2016a,b),
and the references therein, and for general α R see, e.g., Stummer and
Vajda (2012).
The divergence (52) is the celebrated “Kullback–Leibler information
divergence KL” between fP and fQ (respectively between P and Q); alterna-
tively, instead of KL one often uses the terminology “relative entropy.” The
divergence (54) (cf. α ¼ 2) is nothing but half of the squared L2-distance
between the two λ-density functions fP() and fQ().
Notice that for the classical case X ¼ Y ¼ R, r(x) R x≡ 1, λ ¼ λL—where
λpd
one has fP(x) ¼ fP(x), S (P) ¼ S (P), and FP ðxÞ ¼ ∞ fP ðzÞdλL ðzÞ—(51)
pd
is essentially different from (40) with S(P) ¼ Scd(P), S(Q) ¼ Scd(Q) which
is explicitly of the “doubly aggregated form”
0 Dϕα ,1,1,1,λ Scd ðPÞ, Scd ðQÞ
Z Z x α Z x α
1
¼ f P ðzÞdλL ðzÞ + ðα 1Þ f Q ðzÞdλL ðzÞ
R α ðα 1 Þ ∞ ∞
Z x Z x α1
α f P ðzÞdλL ðzÞ f Q ðzÞdλL ðzÞ dλL ðxÞ, for α Rnf0, 1g,
∞ ∞
where again one should exclude α 0 whenever pP(x) ¼ 0 for all x in some
A with λ#[A] > 0, respectively α 1 whenever pQ(x) ¼ 0 for all x in some A~
~ > 0. For example, take the context from the paragraph right before
with λ# ½A
(36), with discrete random variable Y, pQ(x) ¼ Q[Y ¼ x], pP ðxÞ ¼ pPemp N
ðxÞ.
Then, the divergences 2N Dϕα ,1,1,1,λ# ðSpm ðPemp
N Þ, S pm
ðQÞÞ (for α R) can be
used as goodness-of-fit test statistics; see, e.g., Kißlinger and Stummer
(2016) for their limit behavior as the sample size N tends to infinity.
Classical quantile functions. The divergence (38) with S(P) ¼ Squ(P),
S(Q) ¼ Squ(Q) can be interpreted as a quantitative measure of tail risk of P,
relative to some pregiven reference distribution Q.j
Especially, for Y ¼ R and X ¼ 0, 1½ , Sx ðPÞ ¼ Squ x ðPÞ ¼ FP ðxÞ,
Sx ðQÞ ¼ Squx ðQÞ ¼ FQ ðxÞ, and the Lebesgue measure λ ¼ λL (with the usual
dλL(x) ¼ dx), we get from (46) the special case
Z 2
0 2 Dϕ2 ,1,1,r 1,λ ðS ðPÞ, S ðQÞÞ ¼
qu qu
FP ðxÞ FQ ðxÞ dλL ðxÞ (55)
0,1½
which is nothing but the 2-Wasserstein distance between the two probability mea-
sures P and Q. Corresponding connections with optimal transport will be discussed
in Section 2.7. Notice that (55) does generally not coincide with its analogue
Z
2
2 Dϕ2 ,1,1,r 1,λ ðScd ðPÞ, Sd ðQÞÞ ¼ FP ðxÞ FQ ðxÞ dλL ðxÞ ; (56)
R
to see this, take, e.g., 0 < c2 < c1 (e.g., c1 ¼ 2, c2 ¼ 1) and the exponential
quantile functions FP ðxÞ ¼ c11 log ð1 xÞ , FQ ðxÞ ¼ c12 log ð1 xÞ
2
(x [0,1]) for which (55) becomes 2 ðc12 c11 Þ , whereas for the corresponding
exponential distribution functions FP ðxÞ ¼ ð1 expðc1 xÞÞ 10,∞½ ðxÞ and
FQ ðxÞ ¼ ð1 exp ðc2 xÞÞ 10,∞½ ðxÞ the divergence (56) becomes 2c12
2 1
c1 + c2 + 2c1 .
Depth, outlyingness, centered rank and centered quantile functions.
As a special case one gets
Dcϕ,1,1,r 1,λL ðSde ðPÞ, Sde ðQÞÞ,
Dcϕ,1,1,r 1,λL ðSou ðPÞ, Sou ðQÞÞ,
Xd
Dcϕ,1,1,r 1,λL ðScr,i ðPÞ, Scr,i ðQÞÞ, (57)
i¼1
X
d
Dcϕ,1,1,r 1,λL ðScqu,i ðPÞ, Scqu,i ðQÞÞ, (58)
i¼1
j
Hence, such a divergence represents an alternative to Faugeras and R€uschendorf (2018) where
they use hemimetrics (which, e.g., have only a weak-identity property, but satisfy a triangle
inequality) rather than divergences.
A unifying framework for some directed distances in statistics Chapter 5 183
all of which have not appeared elsewhere before (up to our knowledge); recall
that the respective domains of ϕ have to take care of the ranges R Sde ðPÞ
½0, ∞ , R ðSou ðPÞÞ ½0, ∞ , R Scr,i ðPÞ ½1, 1 , R Scqu,i ðPÞ ∞,
∞½ (i f1, …, dg ). Notice that these divergences differ structurally from the
Bregman distances of Hallin (2018) who uses the centered rank function RP()
(also called center-outward distribution function) as a multidimensional (in general
not additionally separable) generator ϕ, and not as points between which the
distance is to be measured between.
R
with ϕ* ð0Þ :¼ lim u!0 u ϕð1uÞ ¼ lim v!∞ ϕðvÞ
v . In case of X Sx ðQÞ rðxÞ
dλðxÞ < ∞, the divergence (61) becomes
0 Dcϕ,SðQÞ,SðQÞ,r SðQÞ,λ ðSðPÞ, SðQÞÞ
Z
S x ð PÞ 0
¼ r ðxÞ Sx ðQÞ ϕ ϕ+,c ð1Þ ðSx ðPÞ Sx ðQÞÞ
X Sx ðQÞ
h i
10,∞½ ðSx ðPÞ Sx ðQÞÞ dλðxÞ + ϕ* ð0Þ ϕ0+,c ð1Þ
Z
r ðxÞ Sx ðPÞ 1f0g ðSx ðQÞÞ dλðxÞ + ϕð0Þ + ϕ0+,c ð1Þ
ZX Z
r ðxÞ Sx ðQÞ 1f0g ðSx ðPÞÞ dλðxÞ ϕð1Þ r ðxÞ Sx ðQÞ dλðxÞ:
X X
(62)
R
Moreover, in case of ϕð1Þ ¼ 0 and X ðSx ðPÞ Sx ðQÞÞ rðxÞ dλðxÞ ∞,
R R
∞½ (but not necessarily X Sx ðPÞ rðxÞ dλðxÞ < ∞ , X Sx ðQÞ rðxÞ dλðxÞ <
∞Þ, the divergence (61) turns into
0 Dcϕ,SðQÞ,SðQÞ,r SðQÞ,λ ðSðPÞ, SðQÞÞ
Z
S ð PÞ
¼ r ðxÞ Sx ðQÞ ϕ x 10,∞½ ðSx ðPÞ Sx ðQÞÞ dλðxÞ
X Sx ðQÞ
Z
+ ϕ * ð 0Þ r ðxÞ Sx ðPÞ 1f0g ðSx ðQÞÞ dλðxÞ (63)
ZX
+ ϕð 0Þ r ðxÞ Sx ðQÞ 1f0g ðSx ðPÞÞ dλðxÞ
ZX
ϕ0+,c ð1Þ r ðxÞ ðSx ðPÞ Sx ðQÞÞ dλðxÞ:
X
k
And thus, c becomes obsolete.
A unifying framework for some directed distances in statistics Chapter 5 185
context “naturally,” then one should be aware that Ξ may become negative
depending on the involved set-up; for a counter-example, see, e.g., Stummer
and Vajda (2010).
An important generator-concerning example is the power-function (limit)
case
R ϕ ¼ ϕα with α R (cf. (26), (30), (31), (28)) under the constraint
X ðS x ðPÞ Sx ðQÞÞ rðxÞ dλðxÞ ∞, ∞½ . Accordingly, the “implicit-
boundary-describing” divergence (60) and the corresponding “explicit-
boundary” version (63) turn into the generalized power divergences of order
α (cf. Stummer and Vajda (2010) for ðxÞ ≡ 1)
0 Dϕα ,SðQÞ,SðQÞ,r SðQÞ,λ ðSðPÞ, SðQÞÞ
Z α
1 Sx ðPÞ S ðPÞ
¼ α x + α 1 Sx ðQÞ rðxÞ dλðxÞ
X α ðα 1Þ Sx ðQÞ Sx ðQÞ
(65)
Z α
1 Sx ðPÞ S ðPÞ
¼ rðxÞ Sx ðQÞ α x + α1
α ðα 1Þ X Sx ðQÞ Sx ðQÞ
10, ∞½ ðSx ðPÞ Sx ðQÞÞ dλðxÞ
Z
+ ϕ α ð0Þ rðxÞ Sx ðPÞ 1f0g ðSx ðQÞÞ dλðxÞ
ZX
+ ϕα ð0Þ rðxÞ Sx ðQÞ 1f0g ðSx ðPÞÞ dλðxÞ
ZX h i
1
¼ rðxÞ Sx ðPÞα Sx ðQÞ1α Sx ðQÞ 10, ∞½ ðSx ðPÞ Sx ðQÞÞ dλðxÞ
α ðα 1Þ X
Z
1
+ rðxÞ ðSx ðPÞ Sx ðQÞÞ dλðxÞ
1α X
Z
+ ∞ 11, ∞½ ðαÞ rðxÞ Sx ðPÞ 1f0g ðSx ðQÞÞ dλðxÞ
X
1
+ 1 ðαÞ + ∞ 1∞,0½ ðαÞ
α ð1 αÞ 0,1[1, ∞½
Z
rðxÞ Sx ðQÞ 1f0g ðSx ðPÞÞ dλðxÞ,
X
(66)
0 Dϕ1 ,SðQÞ,SðQÞ,r SðQÞ,λ ðSðPÞ,SðQÞÞ
Z
(67)
Sx ðPÞ S ðPÞ S ðPÞ
¼ log x + 1 x Sx ðQÞ rðxÞ dλðxÞ
X Sx ðQÞ Sx ðQÞ Sx ðQÞ
Z
S ðPÞ
¼ rðxÞ Sx ðPÞ log x 10, ∞½ ðSx ðPÞ Sx ðQÞÞ dλðxÞ
Sx ðQÞ
ZX Z
+ rðxÞ ðSx ðQÞ Sx ðPÞÞ dλðxÞ + ∞ rðxÞ Sx ðPÞ 1f0g ðSx ðQÞÞ dλðxÞ,
X X
(68)
186 SECTION II Information geometry
ϕ(1) ¼ 0, the corresponding special case Dϕ,Sλpd ðQÞ,Sλpd ðQÞ,r Sλpd ðQÞ,λ ðSλpd ðPÞ, Sλpd ðQÞÞ
of (61) turns out to be the (r ) “local ϕ-divergence” of Avlogiaris et al. (2016a)
and Avlogiaris et al. (2016b); in case of r(x) ≡ 1 (where (64) is satisfied), this
reduces to the classical ϕ-divergence of Csiszar (1963), Ali and Silvey (1966),
and Morimoto (1963)l,m
0 Dϕ,Sλpd ðQÞ,Sλpd ðQÞ,1 Sλpd ðQÞ,λ ðSλpd ðPÞ, Sλpd ðQÞÞ
Z !
fP ðxÞ
¼ fQ ðxÞ ϕ 10,∞½ fP ðxÞ fQ ðxÞ dλðxÞ
X fQ ðxÞ
Z Z
+ ϕ* ð0Þ fP ðxÞ 1f0g fQ ðxÞ dλðxÞ + ϕð0Þ fQ ðxÞ 1f0g ð fP ðxÞÞ dλðxÞ
ZX
X
l
See, e.g., Liese and Vajda (1987) and Vajda (1989) on comprehensive studies thereupon.
m
Notice that c becomes obsolete.
188 SECTION II Information geometry
Lebesgue measure λ ¼ λL (with the usual dλL(x) ¼ dx), and r(x) ≡ 1. Therein,
the special case
Z
0 Dϕ , Scd ðQÞ, Scd ðQÞ,1 Scd ðQÞ, λ ðScd ðPÞ, Scd ðQÞÞ ¼ FP ðxÞ FQ ðxÞ dλL ðxÞ
1=2
TV L
R
(75)
of (73) is the well-known Kantorovich metric (between the distribution func-
tions
R FP(),FQ()). It is known that
R the integral in (75) is finite provided that
X x dFP ðxÞ ∞, ∞½ and X x dFQ ðxÞ ∞, ∞½ (if the distribution
P resp. Q is generated by some real-valued random variable, say X resp. Y,
this means that E[X] and E[Y ] exist and are finite). To proceed, let us discuss
the special case
0 Dϕ1 , Scd ðQÞ, Scd ðQÞ,1 Scd ðQÞ, λL ðScd ðPÞ, Scd ðQÞÞ
Z
(76)
FP ðxÞ F ðxÞ F ðxÞ
¼ log P + 1 P FQ ðxÞ dλL ðxÞ
R FQ ðxÞ FQ ðxÞ FQ ðxÞ
of (67), (68). For the special sub-setup of nonnegative random variables
(and thus Y ¼ X ¼0, ∞½) with finite expectations and strictly positive cdf,
(76) simplifies to the so-called “cumulative Kullback–Leibler information”
of Park et al. (2012) (see also Park et al. (2018) for an extension to the whole
real line, Di Crescenzo and Longobardi (2015) for an adaption to possibly
smaller support as well as for an adaption to a dynamic form analogously
to the explanations in the following lines). In contrast, we illuminate the
special case
0 Dϕ1 , Ssu ðQÞ, Ssu ðQÞ,1 Ssu ðQÞ, λL ðSsu ðPÞ, Ssu ðQÞÞ
Z
1 FP ðxÞ 1 FP ðxÞ 1 FP ðxÞ
¼ log + 1 ð1 FQ ðxÞÞ dλL ðxÞ
R 1 FQ ðxÞ 1 FQ ðxÞ 1 FQ ðxÞ
(77)
of (67), (68). This has been employed by Liu (2007) for the special case of
P ¼ Pemp
N and Q ¼ Qθ in order to obtain a corresponding minimum-divergence
parameter estimator of θ (see, e.g., also Yari and Saghafi (2012), Yari et al.
(2013), and Mehrali and Asadi (2021) for follow-up papers). For the general
context of nonnegative, absolutely continuous random variables (and thus
Y ¼ X ¼0, ∞½) with finite expectations and strictly positive cdf, (77) sim-
plifies to the so-called “cumulative (residual) Kullback–Leibler information”
of Baratpour and Habibi Rad (2012) (see also Park et al. (2012) for further
A unifying framework for some directed distances in statistics Chapter 5 189
propertiesn and Park et al. (2018) for an extension to the whole real line); the
latter has been adapted to a dynamic form by Chamany and Baratpour (2014)
as follows (adapted to our terminology): take arbitrarily fixed “instance”
t 0, Y ¼ X ¼t, ∞½ and replace in (77) the survival function Ssu x ðPÞ :¼
n o
1FP ðxÞ
f1 FP ðxÞgxR by Sx ðPÞ :¼ 1FP ðtÞ
su,t
being essentially the survival
xt,∞½
function of a random variable (e.g., residual lifetime) [X tjX > t]
under P, and analogously for Q; accordingly, the integral range is ]t, ∞[.
We can generalize this by simply plugging Ssu,t(P), Ssu,t(Q) into our general
divergences (59) and (38)—and even (19)—(with λ ¼ λL). An analogous
dynamization can be done for density-functionals, by plugging Sλpd,t :¼
n o
fP ðxÞ
1FP ðtÞ instead of Sλpd ¼ f fP ðxÞgxR into (59) and (38)—and even
xt,∞½
(19)—and thus covering the corresponding dynamic Kullback–Leibler
divergence of Ebrahimi and Kirmani (1996) as well as the more general
ϕ-divergences between residual lifetimes of Vonta and Karagrigoriou (2010)
as special cases; notice that Sλpd,t is essentially the density function of the
random variable Xt :¼ [X tjX > t] under P, where, e.g., X is typically a
(nonnegative) absolutely continuous random variable which describes the
residual lifetime of a person or an item or a “process” and hence, Xt is called
residual lifetime (at t) which is fundamentally used in survival analysis and
systems reliability engineering. In risk management and extreme value theory,
Xt describes the important notion of random excess (e.g., of a loss X) over
the threshold t, which is, e.g., employed in the well-known peaks-over-
threshold method.
λpd,t
n o
Analogously, we can plug in Se :¼ fP ðxÞ instead of Sλpd ¼
FP ðtÞ xt,∞½
f fP ðxÞgxR into (59) and (38)—and even (19)—and thus cover the
corresponding dynamic Kullback–Leibler divergence of Di Crescenzo and
Longobardi (2004) as well as the more general ϕ-divergences between past
λpd,t
lifetimes of Vonta and Karagrigoriou (2010) as special cases; notice that Se
is essentially the density function of the random variable [XjX t] under P.
Classical quantile functions. The divergence (59) with S(P) ¼ Squ(P),
S(Q) ¼ Squ(Q) can be interpreted as a quantitative measure of tail risk of P,
relative to some pregiven reference distribution Q.o
n
In this sub-setup, they also introduce an alternative with ϕ e1 ðtÞ of (29) together with
R 1FP ðxÞ
su,var
Sx ðPÞ :¼ ∞ —rather than with ϕ1(t) of (30) together with Ssu
x ðPÞ :¼ 1 FP ðxÞ—
ð1FP ðξÞÞdξ
0
Dϕ ,SλL pd ðQÞ,SλL pd ðQÞ,1 SλL pd ðQÞ,λL ðSλL pd ðPÞ, SλL pd ðQÞÞ (cf.(65)) equivalently in terms
α
of quantile functions, where they also emphasize the advantage for distribu-
tions P and Q having closed-form quantile functions but nonclosed-form distri-
bution functions.
The above-mentioned contexts differ considerably from that of
Broniatowski and Decurninge (2016), who basically employ ϕ-divergences
Dϕ ðQQ , QP Þ between special quantile measures (rather than quantile func-
tions) QQ and QP ; recall that for any probability measure P on R, one can
associate a (signed) quantile measure QP on ]0, 1[ having as its generalized
distribution function nothing else but the quantile function FP of P. In more
detail, similarly to the above-mentioned empirical likelihood principle,
Broniatowski and Decurninge (2016) consider—in an i.i.d. context—the
minimization
Dϕ ðΩdis
N , QPN Þ :¼
emp inf Dϕ ðQQ , QPemp Þ
QQ Ωdis
N
N
quantile measures QQe having support on [0, 1]; for example, the Q’s e may be
taken from a tubular neighborhood Λ—constructed through a finite collection
of conditions on L-moments (cf., e.g., Hosking, 1990)—of some class of dis-
tributions on R+, such as the Pareto- or Weibull-distribution class. Such tasks
have numerous applications in climate sciences or hydrology. As a side
remark, let us mention that for the general context of quantile measures QQ
and QP being absolutely continuous (with respect to the Lebesgue measure
λL on [0, 1]), the ϕ-divergence Dϕ ðQQ , QP Þ turns into the divergence
Dcϕ,Sqd ðPÞ,Sqd ðPÞ,Sqd ðPÞ,λ ðSqd ðQÞ, Sqd ðPÞÞ (cf. (59)) between the quantile density
L
0
functions Sqd ðPÞ :¼ Sqd x ðPÞ x 0,1½ :¼ ðFP Þ ðxÞ x 0,1½ and S (Q). Thus,
qd
by applying our general divergences (19) to Sqd(Q) and Sqd(P) we end up with
a completely new framework Dcϕ,m1 ,m2 ,m3 ,λ ðSqd ðQÞ, Sqd ðPÞÞ (and many interest-
ing special cases) for quantifying dissimilarities between quantile density
functions.
Depth, outlyingness, centered rank, and centered quantile functions.
As a special case one gets Dcϕ,Sde ðQÞ,Sde ðQÞ,r Sde ðQÞ,λ ðSde ðPÞ, Sde ðQÞÞ ,
P L
Dcϕ,Sou ðQÞ,Sou ðQÞ,r Sou ðQÞ,λL ðSou ðPÞ, Sou ðQÞÞ, di¼1 Dcϕ,Scr,i ðQÞ,Scr,i ðQÞ,r Scr,i ðQÞ,λ ðScr,i ðPÞ,
Pd L
(85)
The special case w(u, v) ¼ 1 reduces to the Cramer–von Mises (test statistics)
family (47), and the choice r(x) ¼ 1, w(u, v) ¼ v (1 v) gives the Anderson–
Darling (1952) test statistics. With (85), we can also imbed as special cases
(together with r(x) ¼ 1) some other known divergences which emphasize
the upper tails: w(u, v) ¼ 1 v (cf. Ahmad et al., 1988), w(u, v) ¼ 1 v2
(cf. Rodriguez and Viollaz, 1995, see also Shin et al. (2012) for applications
in environmental extreme-value theory), w(u, v) ¼ (1v)β with β > 0 (cf.
Deheuvels and Martynov (2003), see also Chernobai et al. (2015) for the case
β ¼ 2 together with a left-truncated version of the empirical distribution func-
tion). Moreover, (85) covers as special cases (together with r(x) ¼ 1) some
other known divergences which emphasize the lower tails: w(u, v) ¼ v (cf.
Ahmad et al., 1988; Scott, 1999), w(u, v) ¼ vβ with β > 0 (cf. Deheuvels
and Martynov, 2003), w(u, v) ¼ v (2 v) (cf. Rodriguez and Viollaz
(1995), see also Shin et al. (2012)). In contrast, in a two-sample-test situation
P
where Q is replaced by the empirical distribution P eemp :¼ 1 L δe ½ of a
L L i¼1 Y i
e e
L-size i.i.d. sample Y 1 , …, Y L of Y (under Q), some authors (e.g., Hajek et al.,
1999; Rosenblatt, 1952) choose divergences which can be imbedded (with the
choice w(u, v) ¼ 1, r(x) ¼ 1) in our framework as multiple of
cd eemp eemp is an
Dϕ2 ,1,1,1,λ ðScd ðPemp
N Þ, S ðPL ÞÞ where λ ¼ c1 PN
emp
+ ð1 c1 Þ P L
appropriate mixture with c1 ]0, 1[. In further contrast, if one chooses the
Lebesgue measure λ ¼ λL (with the usual Riemann integration dλL(x) ¼ dx)
and r(x) ≡ 1 in (85), then one ends up with an adaptively weighted extension
of (48).
Classical quantile functions. The divergence (79) with S(P) ¼ Squ(P), S(Q) ¼
S (Q), λ ¼ λL, i.e., Dcϕ,wðSqu ðPÞ,Squ ðQÞÞ,wðSqu ðPÞ,Squ ðQÞÞ,r wðSqu ðPÞ,Squ ðQÞÞ,λL ðSqu ðPÞ, Squ ðQÞÞ
qu
—which has first been given in Stummer (2021) in an even more flexible form—
can be interpreted as a quantitative measure of tail risk of P, relative to some preg-
iven reference distribution Q; corresponding connections with optimal transport
will be discussed in Section 2.7.
194 SECTION II Information geometry
all of which have not appeared elsewhere before (up to our knowledge);
recall that the respective domains of ϕ have to take care of the ranges
R Sde ðPÞ ½0, ∞ , R ðSou ðPÞÞ ½0,∞ , R Scr,i ðPÞ ½1, 1 , R Scqu,i ðPÞ
∞,∞½ (i f1, …,dg).
~ ~
2.5.2 m1 ðxÞ ¼ S x ðPÞ and m2 ðxÞ ¼ S x ðQÞ with statistical functional
e 6¼ S, m3(x) 0
S
n o
e
Recall SðPÞ :¼ fSx ðPÞgxX , SðQÞ :¼ fSx ðQÞgxX , and let SðPÞ :¼ Sex ðPÞ ,
n o xX
e
SðQÞ :¼ Sex ðQÞ for (typically) Se being “essentially different” to S
xX
(e.g., take Se and S as different choices from Scd, Spd, Spm, Ssu, Smg, Squ, Sde, Sou).
For this special case, from (19) one can deduce
0 Dϕ,S~ðPÞ,S~ðQÞ,m3 ,λ ðSðPÞ, SðQÞÞ
Z " ! ! ! !#
Sx ðPÞ Sx ðQÞ 0 Sx ðQÞ Sx ðPÞ Sx ðQÞ
¼ ϕ ϕ ϕ+,c m3 ðxÞ dλðxÞ,
Sex ðPÞ Sex ðQÞ Sex ðQÞ Sex ðPÞ Sex ðQÞ
X
(86)
which for the discrete setup ðX , λÞ ¼ ðX# , λ# Þ simplifies to
0 Dϕ,S~ðPÞ,S~ðQÞ,m3 ,λ# ðSðPÞ, SðQÞÞ
" ! ! ! !#
X Sx ðPÞ Sx ðQÞ Sx ðQÞ Sx ðPÞ Sx ðQÞ
0
¼ ϕ ϕ ϕ+,c m3 ðxÞ:
xX Sex ðPÞ Sex ðQÞ Sex ðQÞ Sex ðPÞ Sex ðQÞ
0 Dϕ,Ssu ðPÞ,Ssu ðQÞ,m3 ,λL Spd ðPÞ, Spd ðQÞ
Z
fP ðxÞ fQ ðxÞ
¼ ϕ ϕ
1 FP ðxÞ 1 FQ ðxÞ
X
fQ ðxÞ fP ðxÞ fQ ðxÞ
ϕ0+,c m3 ðxÞ dλL ðxÞ,
1 FQ ðxÞ 1 FP ðxÞ 1 FQ ðxÞ
which can be interpreted as divergence between the two modeling hazard rate
functions at stake.
2.6 Auto-divergences
The main stream of this paper deals with divergences/distances between
(families of ) real-valued “statistical functionals” S() of the form SðPÞ :¼
fSx ðPÞgxX and SðQÞ :¼ fSx ðQÞgxX stemming from two different distribu-
tions P and Q. In quite some meaningful situations, P and Q can stem from
the same fundamental underlying random mechanism P̆. Take for instance
the situation where Y ¼ X ¼ R, λ ¼ λL and Y 1 , …Y N are i.i.d. observations
from a random variable Y with distribution P̆ having (with a slight abuse
of notation P̆ ¼ P̆ ∘ Y1) distribution function FP̆ (x) ¼ P̆[Y x] which is dif-
dF ðxÞ
ferentiable with a density fP̆ ðxÞ ¼ dxP̆
being positive in an interval and zero
elsewhere. The corresponding order statistics are denoted by Y 1:N < Y 2:N <
… < Y N:N where Yk:N is the k-th largest observation and in particular
Y 1:N :¼ min fY 1 , …Y N g, Y N:N :¼ max fY 1 , …Y N g; the distribution P̆k of Yk:N
(k f1, …, Ng ) has distribution function FP̆k(x) :¼ P̆[Yk:N x] with well-
known density function
N!
k1
nk
fP̆k ðxÞ :¼ FP̆ ðxÞ 1 FP̆ ðxÞ fP̆ ðxÞ (87)
ðN kÞ! ðk 1Þ!
(see, e.g., Reiss (1989), Arnold et al. (1992), and David and Nagaraja (2003)
for comprehensive treatments of order statistics). In such a context, it makes
sense to take P :¼P̆j, Q :¼ P̆k (j, k f1, …, Ng) respectively P :¼ P̆, Q :¼P̆k
(or vice versa) and study the divergences
0 Dϕ, m1 , m2 , m3 , λL Spd ðP̆j Þ,Spd ðP̆k Þ
Z " fP̆j ðxÞ
! fP̆j ðxÞ fP̆ ðxÞ
!#
fP̆k ðxÞ 0
fP̆k ðxÞ
:¼ ϕ ϕ ϕ+ , c k
m3 ðxÞ dλL ðxÞ
m1 ðxÞ m2 ðxÞ m2 ðxÞ m1 ðxÞ m2 ðxÞ
X
respectively
196 SECTION II Information geometry
0 Dϕ, m1 , m2 , m3 , λL Spd ðP̆Þ, Spd ðP̆k Þ
Z " fP̆j ðxÞ
! #
fP̆k ðxÞ 0
fP̆k ðxÞ fP̆ ðxÞ fP̆k ðxÞ
:¼ ϕ ϕ ϕ+, c m3 ðxÞ dλL ðxÞ ,
m1 ðxÞ m2 ðxÞ m2 ðxÞ m1 ðxÞ m2 ðxÞ
X
(88)
or deterministic transformations thereof.
For instance, (some of ) the divergences in Ebrahimi et al. (2004) and
Asadi et al.
(2006) can beimbedded here
as the special
cases
˘ ˘ ˘ ˘
Dϕ1 , 1, 1, 1, λL S ðPj Þ, S ðP k Þ , Dϕ1 , 1, 1, 1, λL S ðPÞ, S ðP k Þ ,
pd pd pd pd
α1 log ½1 + α
1 ðα 1Þ Dϕ Spd ðP˘j Þ,Spd ðP˘k Þ ,
α , S ð P̆ k Þ, S ð P̆ k Þ, S ð P̆ k Þ, λ
pd pd pd
˘ Spd ðP˘ Þ , for α Rnf0, 1g.
α1 log ½1 + α ðα 1Þ
1 Dϕ Spd ðPÞ,
k
α , S ð P̆ k Þ, S ð P̆ k Þ, S ð P̆ k Þ, λ
pd pd pd
with ψe : RðFP Þ
RðFQ Þ 7! ½0, ∞ defined by (cf. (I2) and (21))
u v
ψe ðu,vÞ :¼ wðu, vÞ ψ ϕ,c , 0 with
wðu, vÞ wðu,vÞ
u v u v
ψ ϕ,c , :¼ ϕ ϕ
wðu, vÞ wðu, vÞ wðu, vÞ wðu, vÞ
v u v
ϕ0+,c :
wðu,vÞ wðu, vÞ wðu, vÞ
Under Assumption 1 (and hence under the more restrictive Assumption 2) of
Stummer (2021)—who deals even with a more general context where the
scaling and the aggregation function need not coincide—one can adapt
Theorem 4 and Corollary 1 of Broniatowski and Stummer (2019) to obtain
the desired basic divergence properties (D1) and (D2) in the form of
198 SECTION II Information geometry
ðNNÞDcϕ,wðSqu ðPÞ,Squ ðQÞÞ,wðSqu ðPÞ,Squ ðQÞÞ,wðSqu ðPÞ,Squ ðQÞÞ,λL ðSqu ðPÞ, Squ ðQÞÞ 0
ðREÞ Dcϕ,wðSqu ðPÞ,Squ ðQÞÞ,wðSqu ðPÞ,Squ ðQÞÞ,wðSqu ðPÞ,Squ ðQÞÞ,λL ðSqu ðPÞ, Squ ðQÞÞ ¼ 0
if and only if FP ðxÞ ¼ FQ ðxÞ for λL a:a: x X:
where the minimum in (90) is taken over all R-valued random variables X and
Y (on an arbitrary probability space ðΩ, A, SÞ) such that P½X ¼ P½ ,
P½Y ¼ Q½ . As usual, E denotes the expectation with respect to P.
p
Other names are submodular, Lattice-subadditive, 2-antitone, 2-negative, Δ-antitone, supernega-
tive, “satisfying the (continuous) Monge property/condition.”
q
Other names are supermodular, Lattice-superadditive, 2-increasing, 2-positive, Δ-monotone,
2-monotone, “fulfilling the moderate growth property,” “satisfying the measure property,”
“satisfying the twist condition.”
r
A comprehensive discussion on general quasi-monotone functions can be found, e.g., in Chapter
6.C of Marshall et al. (2011).
A unifying framework for some directed distances in statistics Chapter 5 199
Remark 2. (i) Notice that Pcom is ψ-independent, e and may not be the unique
minimizer in (91). As a (not necessarily unique) minimizer in
(90), one can take X :¼ FP ðUÞ, Y :¼ FQ ðUÞ for some uniform
random variable U on [0, 1].
(ii) In Theorem 3 we have shown that Pcom (cf. (92)) is an optimal
transport plan of the KTP (91) with the pointwise-BS-distance-type
(pBS-type) cost function ψðu, e vÞ. The outcoming minimal value is equal
to Dcϕ,wðSqu ðPÞ,Squ ðQÞÞ,wðSqu ðPÞ,Squ ðQÞÞ,wðSqu ðPÞ,Squ ðQÞÞ,λL ðSqu ðPÞ, Squ ðQÞÞ which is
typically straightforward to compute (resp. approximate).
(iii) Depending on the chosen divergence, one may have to restrict the sup-
port of P and Q, for instance to (subsets of ) [0, ∞[.
We give some further special cases of pBS-type cost functions, which are
continuous and quasi-antitone, but which are generally not symmetric and
thus not of POM type:
Example 2. “Smooth” pointwise ϕ-divergences (i.e., pointwise Csiszar-Ali-
Silvey-Morimoto divergences): take ϕ : ½0, ∞½7!R to be a strictly convex,
twice continuously differentiable function on ]0, ∞[ with continuous exten-
sion on t ¼ 0, together with w(u, v) :¼ v ]0, ∞[, and c is obsolete. Accord-
e ðu, vÞ ¼ v ϕ uv v ϕ ð1Þ ϕ ð1Þ ðu vÞ, and hence the second
0
ingly, ψ
mixed derivative satisfies ∂ e ψ
¼ u ϕ00 u < 0 (u, v ]0, ∞[); thus, ψe is
2
∂u∂v v2 v
quasi-antitone on ]0, ∞[
]0, ∞[. Accordingly, (90) to (93) applies to such
kind of (cf. Section 2.5.1.2) ϕ-divergences concerning P,Q having support
γ
in [0, ∞[. As an example, take, e.g., the power function ϕðtÞ :¼ t γ t + γ1
γ ðγ1Þ
200 SECTION II Information geometry
∂u∂v
formula of the latter is given in the appendix of Kißlinger and Stummer
(2018), who also give applications to robust change detection in data streams).
Illustrative examples of suitable ϕ and w can be found, e.g., in Kißlinger and
Stummer (2016).
^ QÞ
where (94) is called Monge transportation problem (MTP). Here, ΓðP,
denotes the family of all measurable maps T : R 7! R such that P[T ]
¼ Q[].
3 Aggregated/integrated divergences
Suppose that ϕ ¼ ϕz, P ¼ Pz, Q ¼ Qz, m1 ¼ m1,z, m2 ¼ m2,z, m3 ¼ m3,z, λ ¼ λz
depend on the same (!!) “parameter/quantity” Rz Z . Then it makes sense to
study the aggregated/integrated divergence Z Dϕz ,m1,z ,m2,z ,m3,z ,λz ðSðPz Þ, SðQz ÞÞ
d̆λðzÞ where λ̆ is a σ-finite measure on Z (e.g., the Lebesgue measure λL,
the counting measure λ# or a probability measure, where in case of the latter
one also uses the terminology “expected divergence”).
An interesting special case is the following family: recall first that for the
two-element space Y ¼ X ¼ f0, 1g we denote the corresponding probability
mass function as Spm ðPÞ ¼ fP½fxggxX ¼ f1 P½f1g, P½f1gg; in other
words, P is a Bernoulli distribution Ber(θP) which is completely determined
by its parameter θP [0, 1] with interpretation θP ¼ P[{1}]. Now suppose
that θP ¼ θP(z) depends on a real-valued parameter z R. In such a situation
it makes sense to study the aggregated (integrated) divergence for
ϕ ΦC1 ða, b½Þ
Z
0 Dϕ,m1,z ,m2,z ,m3,z ,λ# Spm ðBer ðθP ðzÞÞÞ, Spm Ber θQ ðzÞ dλ˘ðzÞ
R
Z (" !
1 θQ ðzÞ
!
1 θQ ðzÞ
! !#
1 θP ðzÞ 1 θP ðzÞ 1 θQ ðzÞ
¼ ϕ ϕ ϕ0 m3,z ð0Þ
R m1,z ð0Þ m2,z ð0Þ m2,z ð0Þ m1,z ð0Þ m2,z ð0Þ
" ! ! ! !# )
θP ðzÞ θQ ðzÞ θQ ðzÞ θP ðzÞ θQ ðzÞ
+ ϕ ϕ ϕ0 m3,z ð1Þ d λ̆ðzÞ
m1,z ð1Þ m2,z ð1Þ m2,z ð1Þ m1,z ð1Þ m2,z ð1Þ
(95)
where λ̆ is a σ-finite measure on R (e.g., the Lebesgue measure λL, the count-
ing measure λ# or a probability measure) and the scaling functions m1, m2 as
well as the aggregating function m3 are allowed to depend (in a measurable
way) on z (which is denoted by extending their indices with z). For the non-
differentiable case ϕ Φ(]a, b[), the derivative ϕ0 has to be replaced by ϕ0+,c .
In adaption of the discussion after formula (25), by defining the integral
R R
e :¼ ½
functional geϕ,m3 ,λ̆ ðξÞ e
R f0,1g ϕðξðx, zÞÞ m3 ðxÞ dλ# ðxÞ d λ̆ðzÞ and plugging
in, e.g.,
pm
S ðBer ðθP ð ÞÞÞ
geϕ,m3 , λ̆
8 m1, 9
Z < =
1 θ P ðzÞ θ P ðzÞ
¼ ϕ m3,z ð0Þ + ϕ m3,z ð1Þ d λ̆ðzÞ, (96)
R: m1,z ð0Þ m1,z ð1Þ ;
202 SECTION II Information geometry
(99)
" #
Y
d
+ ϕ ð0Þ P fPi ðxi Þ ¼ 0 + ϕð0Þ Q½ fP ðxÞ ¼ 0 ϕð1Þ : (100)
i¼1
The divergence in the formula lines (99) and (100) has first appeared in
Micheas and Zografos (2006). In applications, one often takes X ¼ Y ¼
Rd, Y i ¼ R, λi :¼ λL to be the Lebesgue measure on R and thus λ :¼ λLd is
the Lebesgue measure on Rd (with a slight abuse of notation), fP to be the
204 SECTION II Information geometry
1 ∂u2
for almost
all u1, u2 [0, 1]
[0, 1]) (see, e.g., p. 83 in Durante and Sempi (2016) and
the there-mentioned references). Accordingly, fP ðx1 ,x2 Þ ¼ fP1 ðx1 Þ fP2 ðx2 Þ
cðFP1 ðx1 Þ,FP2 ðx2 ÞÞ and thus, in case of strictly positive fP1 ð Þ > 0, fP2 ð Þ > 0
the divergence (99), (100) rewrites as
0 Dϕ,Sλpd ðQÞ,Sλpd ðQÞ,1 Sλpd ðQÞ,λ ðSλpd ðPÞ, Sλpd ðQÞÞ
Z Z
fP ðx1 , x2 Þ
¼ f P1 ðx1 Þ fP2 ðx2 Þ ϕ dλL ðx1 Þ dλL ðx2 Þ ϕð1Þ
R R fP1 ðx1 Þ fP2 ðx2 Þ
Z 1Z 1
¼ ϕ ðcðu1 , u2 ÞÞ dλL ðu1 Þ dλL ðu2 Þ ϕð1Þ,
0 0
(101)
which solely depends on the copula (density) and not on the marginals. For
the subcase ϕ(1) ¼ 0, formula (101) was established basically in Durrani
and Zeng (2009) without assumptions and without a proof; they also give
some examples including ϕ ¼ ϕα (α Rnf0, 1g) of (26), as well as the KL
generator ϕ ¼ ϕ e1 ðtÞ of (29) leading to the “copula representation of mutual
information.” The latter also appears in the earlier work of Davy and
Doucet (2003), as well as, e.g., in Zeng and Durrani (2011), Zeng et al.
(2014), and Tran (2018); in contrast, Tran also gives a copula representation
of the Kullback–Leibler information divergence between two general
d-dimensional Lebesgue density functions SλL pd ðPÞ :¼ fP ð Þ and SλL pd ðQÞ :¼
fQ ð Þ where P and Q are allowed to have different marginals, and Q need
not be of independence-expressing product type.
(2) For the special case X :¼ Y ¼ R2 , continuous marginal distribution
functions FP1 and FP2 , product measure λ :¼ P1 P2, Scd x ðQÞ :¼ FQ ðxÞ ¼
FP1 ðx1 Þ FP2 ðx2 Þ ½0, 1 , as well as joint distribution function Scd
x ðPÞ :¼
FP ðxÞ ½0, 1, one gets the following special cases of (46) and (73),
respectively:
A unifying framework for some directed distances in statistics Chapter 5 205
(for (102) see Blum et al. (1961), Schweizer and Wolff (1981), up to constants
and squares) and
1=2
0 Dϕ ,Scd ðQÞ,Scd ðQÞ,1 Scd ðQÞ,λ ðScd ðPÞ, Scd ðQÞÞ ¼
Z Z TV
¼ jFP ðx1 , x2 Þ FP1 ðx1 Þ FP2 ðx2 Þj dP1 ðx1 Þ dP2 ðx2 Þ
R R
Z 1Z 1
¼ jCðu1 , u2 Þ u1 u2 j dλL ðu1 Þ dλL ðu2 Þ (103)
0 0
5 Bayesian contexts
There are various different ways how divergences can be used in Bayesian
frameworks:
(1) as “direct” quantifiers of dissimilarities between statistical functionals of
various parameter distributions:
for instance, consider a n-dimensional vector of observable random
quantities Z ¼ ðZ 1 , …, Zn Þ whose distribution depends on an unobservable
206 SECTION II Information geometry
π H fP ðZÞ
with posterior probability (for H ) π post
H ðZÞ :¼ π H
H
fP ðZÞ + ð1π H Þ fP ðZÞ in
H A
terms of the λ-density functions fPH ðÞ and f PA ðÞ where λ is, e.g., PH 2+ PA
(or any measure such that PH and PA are absolutely continuous w.r.t. λ).
A unifying framework for some directed distances in statistics Chapter 5 207
¼ Dϕ,Sλpd ðPA Þ,Sλpd ðPA Þ,1 Sλpd ðPA Þ,λ ðSλpd ðPH Þ, Sλpd ðPA ÞÞ (105)
1π
where gϕ ðπÞ :¼ ϕ0+ is nondecreasing in π ]0, 1[. If ϕ is twice
π
00
differentiable, then one can simplify π1H dgϕ ðπ H Þ ¼ ðπ 1 Þ3 ϕ 1π πH
H
dπ H
H
I ðπH , PH , PA Þ ¼ Dϕ,Sλpd ðPA Þ,Sλpd ðPA Þ,1 Sλpd ðPA Þ,λ ðSλpd ðPH Þ, Sλpd ðPA ÞÞ
s
They also have shown some kind of “reciprocal.”
208 SECTION II Information geometry
Z
Bðπ H , PH , PA Þ ¼ min fΛH fPH ðZÞ, ΛA g dPA
(in an even slightly more general form), which can be very useful in case
that the posterior minimal mean decision loss cannot be computed explic-
itly. For instance, Stummer and Vajda (2007) give applications to
decision-making of time-continuous, nonstationary financial stochastic
processes.
(4) as auxiliary tools: for instance, in an i.i.d.-type Bayesian parametric
model-misspecification context, Kleijn and van der Vaart (2012) employ
the reverse-Kullback–Leibler-distance minimizer
bb
θ :¼ arg inf Dϕ0 ðQθ , Ptr Þ
θΘ
¼ arg inf Dϕ0 ,Sλpd ðQÞ,Sλpd ðQÞ,1 Sλpd ðQÞ,λ ðSλpd ðQθ Þ, Sλpd ðPtr ÞÞ
θΘ
(cf. (10) respectively (74) with ϕ ¼ ϕ0) in order to formulate and prove
an asymptotic normality—under the unknown true out-of-model-lying
data-generating distribution Ptr—of the involved posterior parameter
distribution.
6 Variational representations
Variational representations of (say) ϕ-divergences, often referred to as dual
representations, transform ϕ-divergence estimation into an optimization prob-
lem on an infinite dimensional function space, generally, but may also lead to
a simpler optimization problem when some knowledge on the class of
measures Q where Dϕ ðQ, PÞ has to be optimized is available; moreover, as
already mentioned at the end of Section 1.3 above, such variational represen-
tations can also be employed to circumvent the crossover problem (CO1),
(C2),(CO3).
To begin with, in the following we loosely sketch the corresponding gen-
eral setting. We equip M , the linear space of all finite signed measures
(including all probability measures) on ðX , B Þ with
R the so-called τ-topology,
the coarsest one which makes the mapping f ! f dQ continuous for all
A unifying framework for some directed distances in statistics Chapter 5 209
two sample problems, when it is assumed that dQ/dP belongs to some class of
functions; for example we may assume that Ω consists in all distributions such
that x ! ðdQ=dPÞðxÞ belongs to some parametric class. This requires some
analysis around (106), which is handled now.
The supremum in Eq. (106) may not be reached, even in elementary cases.
Consider the case when ϕ ¼ ϕ1, hence the case when D eϕ ðQ, PÞ is the
Kullback–Leibler divergence between Q and P, and assume that both Q and
P are two Gaussian probability measures on R with same variance and differ-
ent mean values. Then it is readily checked that the supremum in (106) is
reached on a polynomial with degree 2, hence outside of Mb . For statistical
purposes it is relevant that formula (106) holds with attainment; indeed the
supremum, in case when D eϕ ðQ, PÞ is finite, is reached at g :¼ ϕ0 ðdQ=dPÞ,
therefore, in case when ϕ is differentiable, on a function which may not be
bounded.
It is also of interest to consider (106) in the case when P is atomic and Q is
a continuous distribution; for example let ðX1 , …, XN Þ be an i.i.d. sample
under some probability measure R on R, and take Q to be a probability mea-
sure which is absolutely continuous with respect to the Lebesgue measure;
consider the case when D eϕ ¼ Deϕ is the (slightly modified) Kullback–Leibler
1
emp
divergence. Denote by PN the empirical measure of the sample. Taking
gðxÞ :¼ M 1fX1 ,…,XN gc ðxÞ for some arbitrary M, it holds by (106) that
e ϕ ðQ, Pemp Þ M proving that no inference can be performed about R making
D N
use of the variational form as it stands. Some more structure and information
has to be incorporated in the variational form of the divergence in order to cir-
cumvent this obstacle. Assuming that ϕ is a differentiable function in its
domain, the supremum in (106) is reached at g* :¼ ϕ0 ðdQ=dPÞ as checked
by substitution.t Let F be a class of functions containing all functions
ϕ0 ðdQ=dPÞðxÞ as Q runs in a given model R Ω. Consider the subspace MF of
all finite signed measures Q such that Ω j f jdjQj is finite for all functions f
in F, then similarly as in (106) we may obtain the following variational form
of De ϕ ðQ, PÞ, which is valid when Q belongs to MF and P belongs to P
Z Z
e
Dϕ ðQ, PÞ ¼ sup gðxÞ dQðxÞ ϕ* ðgðxÞÞ dPðxÞ, (107)
ghMb [F i X X
t
In case when ϕ is not differentiable at some point, then the supremum in (107) should satisfy
g* ðxÞ ∂ϕðdQ=dPÞðxÞ for all x in X , where ∂ϕ(t) is the subdifferential set of the convex function
ϕ at point t, ∂ϕðtÞ :¼ fz R : ϕðsÞ ϕðtÞ+zðs tÞ, 8s Rg.
A unifying framework for some directed distances in statistics Chapter 5 211
Acknowledgments
W.S. is grateful to the Sorbonne Universite Paris for its multiple partial financial support
and especially the LPSM for its multiple great hospitality. M.B. thanks very much the
University of Erlangen-N€ urnberg for its partial financial support and hospitality.
Moreover, W.S. would like to thank Ingo Klein and Konstantinos Zografos for some helpful
remarks on a much earlier draft of this paper.
References
Ahmad, M.I., Sinclair, C.D., Spurr, B.D., 1988. Assessment of flood frequency models using
empirical distribution function statistics. Water Resour. Res. 24 (8), 1323–1328.
Al Mohamad, D., 2018. Towards a better understanding of the dual representation of phi diver-
gences. Stat. Papers 59 (3), 1205–1253.
Ali, M.S., Silvey, D., 1966. A general class of coefficients of divergence of one distribution from
another. J. Roy. Stat. Soc. B-28, 131–140.
Alonso-Revenga, J.M., Martin, N., Pardo, L., 2017. New improved estimators for overdispersion
in models with clustered multinomial data and unequal cluster sizes. Stat. Comput. 27,
193–217.
Amari, S.-I., 2016. Information Geometry and Its Applications. Springer, Japan.
Amari, S.-I., Nagaoka, H., 2000. Methods of Information Geometry. Oxford University Press.
Amari, S.-I., Karakida, R., Oizumi, M., 2018. Information geometry connecting Wasserstein
distance and Kullback-Leibler divergence via the entropy-relaxed transportation problem.
Info. Geo. 1, 13–37.
Anderson, T.W., Darling, D.A., 1952. Asymptotic theory of certain goodness of fit criteria based
on stochastic processes. Ann. Math. Stat. 23, 193–212.
Arikan, E., Merhav, N., 1998. Guessing subject to distortion. IEEE Trans. Inf. Theory 44 (3),
1041–1056.
Arnold, B.C., Balakrishnan, N., Nagaraja, H.N., 1992. A First Course in Order Statistics. Wiley,
New York.
Asadi, M., Ebrahimi, N., Hamedani, G.G., Soofi, E.S., 2006. Information measures for Pareto
distributions and order statistics. In: Balakrishnan, N., Castillo, E., Sarabia, J.M. (Eds.),
Advances in Distribution Theory, Order Statistics, and Inference. Birkh€auser, Boston,
pp. 207–223.
A unifying framework for some directed distances in statistics Chapter 5 215
Avlogiaris, G., Micheas, A., Zografos, K., 2016a. On local divergences between two probability
measures. Metrika 79, 303–333.
Avlogiaris, G., Micheas, A., Zografos, K., 2016b. On testing local hypotheses via local diver-
gence. Stat. Methodol. 31, 20–42.
Ay, N., Jost, J., Le, H.V., Schwachh€ofer, L., 2017. Information Geometry. Springer Intern.
Baggerly, K.A., 1998. Empirical likelihood as a goodness-of-fit measure. Biometrika 85 (3),
535–547.
Bahadur, R.R., 1967. Rates of convergence of estimates and test statistics. Ann. Math. Stat. 38,
303–324.
Bahadur, R.R., 1971. Some Limit Theorems in Statistics. SIAM, Philadelphia.
Bapat, R.B., Beg, M.I., 1989. Order statistics for nonidentically distributed variables and perma-
nents. Sankhya A 51 (1), 79–93.
Baratpour, S., Habibi Rad, A., 2012. Testing goodness-of-fit for exponential distribution based on
cumulative residual entropy. Commun. Stat. Theory Methods 41 (8), 1387–1396.
Barbaresco, F., Nielsen, F., 2021. Geometric Structures of Statistical Physics, Information Geom-
etry, and Learning. Springer Nature, Switzerland.
Baringhaus, L., Henze, N., 2017. Cramer-von Mises distance: probabilistic interpretation, confi-
dence intervals, and neighborhood-of-model validation. J. Nonparam. Stat. 29 (2), 167–188.
Basu, A., Lindsay, B.G., 1994. Minimum disparity estimation for continuous models: efficiency,
distributions and robustness. Ann. Inst. Stat. Math. 46 (4), 683–705.
Basu, A., Harris, I.R., Hjort, N.L., Jones, M.C., 1998. Robust and efficient estimation by minimiz-
ing a density power divergence. Biometrika 85 (3), 549–559.
Basu, A., Shioya, H., Park, C., 2011. Statistical Inference: The Minimum Distance Approach.
CRC Press, Boca Raton.
Basu, A., Mandal, A., Martin, N., Pardo, L., 2015. Robust tests for the equality of two normal
means based on the density power divergence. Metrika 78, 611–634.
Beran, R., 1977. Minimum Hellinger distance estimates for parametric models. Ann. Stat. 5 (3),
445–463.
Bertail, P., Gautherat, E., Harari-Kermadec, H., 2014. Empirical φ*-divergence minimizers for
Hadamard differentiable functionals. In: Akritas, M.G., et al. (Eds.), Topics in Nonparametric
Statistics. Springer, New York, pp. 21–32.
Bertrand, P., Broniatowski, M., Marcotorchino, J.-F., 2021. Divergences minimisation and appli-
cations. In: Nielsen, F., Barbaresco, F. (Eds.), Geometric Science of Information GSI 2021.
Lecture Notes in Computer Science, vol. 12829. Springer Nature, Switzerland, pp. 818–828.
Birkhoff, G.D., 1932. A set of postulates for plane geometry, based on scale and protractor. Ann.
Math. 33 (2), 329–345.
Birrell, J., Katsoulakis, M.A., Pantazis, Y., 2022a. Optimizing variational representations of diver-
gences and accelerating their statistical estimation. IEEE Trans. Inf. Theory (early access),
https://doi.org/10.1109/TIT.2022.3160659.
Birrell, J., Dupuis, P., Katsoulakis, M.A., Pantazis, Y., Rey-Bellet, L., 2022b. (f, Γ)-Divergences:
interpolating between f-divergences and integral probability metrics. J. Mach. Learn. Res. 23,
1–70.
Blum, J.R., Kiefer, J., Rosenblatt, M., 1961. Distribution-free tests of independence based on the
sample distribution function. Ann. Math. Stat. 32, 485–498.
Boekee, D.E., 1977. An extension of the Fisher information measure. In: Csiszar, I., Elias, P.
(Eds.), Topics in Information theory (Second Colloq., Keszthely, 1975). Colloq. Math. Soc.
János Bolyai, vol. 16. North-Holland, Amsterdam, pp. 113–123.
Boissonnat, J.-D., Nielsen, F., Nock, R., 2010. Bregman Voronoi diagrams. Discret. Comput.
Geom. 44 (2), 281–307.
216 SECTION II Information geometry
Bouzebda, S., Keziou, A., 2010. New estimates and tests of independence in semiparametric cop-
ula models. Kybernetika 46 (1), 178–201.
Broniatowski, M., 2003. Estimation of the Kullback-Leibler divergence. Math. Methods Stat.
12 (4), 391–409.
Broniatowski, M., 2021. Minimum divergence estimators, maximum likelihood and the
generalized bootstrap. Entropy 23 (185), 15 pages. https://doi.org/10.3390/e23020185.
Broniatowski, M., Decurninge, A., 2016. Estimation for models defined by conditions on their
L-moments. IEEE Trans. Inf. Theory 62 (9), 5181–5198.
Broniatowski, M., Keziou, A., 2006. Minimization of ϕ-divergences on sets of signed measures.
Stud. Sci. Math. Hung. 43, 403–442.
Broniatowski, M., Keziou, A., 2009. Parametric estimation and tests through divergences and the
duality technique. J. Multivar. Anal. 100 (1), 16–36.
Broniatowski, M., Keziou, A., 2012. Divergences and duality for estimation and test under
moment condition models. J. Stat. Plan. Inference 142, 2554–2573.
Broniatowski, M., Stummer, W., 2019. Some universal insights on divergences for statistics,
machine learning and artificial intelligence. In: Nielsen, F. (Ed.), Geometric Structures of
Information. Springer Nature, Switzerland, pp. 149–211.
Broniatowski, M., Stummer, W., 2021. A precise bare simulation approach to the minimization of
some distances–foundations. arXiv:2107.01693v1 (July).
Broniatowski, M., Miranda, E., Stummer, W., 2019. Testing the number and the nature of the
components in a mixture distribution. In: Nielsen, F., Barbaresco, F. (Eds.), Geometric
Science of Information GSI 2019. Lecture Notes in Computer Science, vol. 11712. Springer
Nature, Switzerland, pp. 309–318.
Chamany, A., Baratpour, S., 2014. A dynamic discrimination information based on cumulative
residual entropy and its properties. Commun. Stat. Theory Methods 43, 1041–1049.
Chernobai, A., Rachev, S.T., Fabozzi, F.J., 2015. Composite goodness-of-fit tests for left-
truncated loss samples. In: Lee, C.-F., Lee, J. (Eds.), Handbook of Financial Econometrics
and Statistics. Springer Science/Business Media, New York, pp. 575–596.
Chernozhukov, V., Galichon, A., Hallin, M., Henry, M., 2017. Monge-Kantorovich depth, quan-
tiles, ranks, and signs. Ann. Stat. 45 (1), 223–256.
Cramer, H., 1928. On the composition of elementary errors. Scand. Actuar. J. 1928 (1), 13–74 and
141–180.
Csiszar, I., 1963. Eine informations theoretische Ungleichung und ihre Anwendung auf den
Beweis der Ergodizit€at von Markoffschen Ketten. Publ. Math. Inst. Hung. Acad. Sci.
A-8, 85–108.
Csiszar, I., 1967. Information-type measures of difference of probability distributions and indirect
observations. Stud. Sci. Math. Hung. 2, 299–318.
Csiszar, I., 1991. Why least squares and maximum entropy? An axiomatic approach to inference
for linear inverse problems. Ann. Stat. 19 (4), 2032–2066.
Csiszar, I., Breuer, T., 2016. Measuring distribution model risk. Math. Financ. 26 (2), 395–411.
Darling, D.A., 1957. The Kolmogorow-Smirnov, Cramer-von Mises tests. Ann. Math. Stat. 28,
823–838.
David, H.A., Nagaraja, H.N., 2003. Order Statistics, third ed. Wiley, Hoboken.
Davy, M., Doucet, A., 2003. Copulas: a new insight into positive time-frequency distributions.
IEEE Signal Proc. Letters 10 (7), 215–218.
De Groot, M.H., 1962. Uncertainty, information and sequential experiments. Ann. Math. Stat. 33,
404–419.
A unifying framework for some directed distances in statistics Chapter 5 217
Deheuvels, P., Martynov, G., 2003. Karhunen-Loeve expansions for weighted Wiener processes
and Brownian bridges via Bessel functions. In: Hoffmann-Jorgensen, J., Marcus, M.B.,
Wellner, J.A. (Eds.), High Dimensional Probability III. Springer, Basel, pp. 57–93.
Dembo, A., Zeitouni, O., 2009. Large Deviations Techniques and Applications, second ed. (corr.
print). Springer, New York.
Di Crescenzo, A., Longobardi, M., 2004. A measure of discrimination between past lifetime dis-
tributions. Stat. Prob. Lett. 67, 173–182.
Di Crescenzo, A., Longobardi, M., 2015. Some properties and applications of cumulative
Kullback-Leibler information. Appl. Stoch. Models Bus. Ind. 31, 875–891.
Duchi, J., Khosravi, K., Ruan, F., 2018. Multiclass classification, information, divergence and sur-
rogate risk. Ann. Stat. 46 (6B), 3246–3275.
Durante, F., Sempi, C., 2016. Principles of Copula Theory. CRC Press, Boca Raton.
Durrani, T.S., Zeng, X., 2009. Copula based divergence measures and their use in image registra-
tion. In: Proc. 17th Eur. Sig. Proc. Conf. (EUSIPCO 2009), pp. 1309–1313.
Ebrahimi, N., Kirmani, S.N.U.A., 1996. A measure of discrimination between two residual life-
time distributions and its applications. Ann. Inst. Stat. Math. 48 (2), 257–265.
Ebrahimi, N., Soofi, E.S., Zahedi, H., 2004. Information properties of order statistics and
spacings. IEEE Trans. Inf. Theory 50 (1), 177–183.
Efron, B., Tibshirani, R.J., 1993. An Introduction to the Bootstrap. Chapman and Hall, New York.
Embrechts, P., Hofert, M., 2013. A note on generalized inverses. Math. Meth. Oper. Res. 77,
423–432.
Faugeras, O.P., R€ uschendorf, L., 2017. Markov morphisms: a combined copula and mass trans-
portation approach to multivariate quantiles. Math. Appl. 45 (1), 3–45.
Faugeras, O.P., R€ uschendorf, L., 2018. Risk excess measures induced by hemi-metrics. Toulouse
School of Economics, Working Paper 18-922.
Fienberg, S.E., Holland, P.W., 1970. Methods for eliminating zero counts in contingency tables.
In: Patil, G.P. (Ed.), Random Counts in Scientific Work, vol. 1 (Random Counts in Models
and Structures). Pennsylvania State University Press, University Park, pp. 233–260.
Figalli, A., 2018. On the continuity of center-outward distribution and quantile functions. Nonlin-
ear Anal. 177 (B), 413–421.
Galichon, H., Henry, M., 2012. Dual theory of choice with multivariate risks. J. Econ. Theory
147, 1501–1516.
Garcia-Garcia, D., Williamson, R.C., 2012. Divergences and risks for multiclass experiments. In:
25th Annual Conference on Learning Theory; JMLR Workshop and Conference Proceedings,
vol. 23. 28.1–28.20.
Gayen, A., Kumar, M.A., 2021. Projection theorems and estimating equations for power-law mod-
els. J. Multivar. Anal. 184, 104734. https://doi.org/10.1016/j.jmva.2021.104734.
Ghosh, A., Basu, A., 2016a. Robust Bayes estimation using the density power divergence. Ann.
Inst. Stat. Math. 68, 413–437.
Ghosh, A., Basu, A., 2016b. Robust estimation in generalized linear models: the density power
divergence approach. TEST 25, 269–290.
Ghosh, A., Basu, A., 2018. A new family of divergences originating from model adequacy tests
and applications to robust statistical inference. IEEE Trans. Inf. Theory 64 (8), 5581–5591.
Gilchrist, W.G., 2000. Statistical Modelling With Quantile Functions. Chapman & Hall/CRC,
Boca Raton.
Groeneboom, P., Oosterhoff, J., 1977. Bahadur efficiency and probability of large deviations. Stat.
Neerlandica 31 (1), 1–24.
Groeneboom, P., Oosterhoff, J., Ruymgaart, F.H., 1979. Large deviation theorems for empirical
probability measures. Ann. Prob. 7 (4), 553–586.
218 SECTION II Information geometry
Guo, X., Hong, J., Lin, T., Yang, N., 2021. Relaxed Wasserstein with application to GANs. In:
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP
2021), pp. 3325–3329.
Gy€orfi, L., Nemetz, T., 1977. f-Dissimilarity: a general class of separation measures of several
probability measures. In: Csiszar, I., Elias, P. (Eds.), Topics in Information Theory (Second
Colloq., Keszthely, 1975). Colloq. Math. Soc. János Bolyai, vol. 16. North-Holland, Amster-
dam, pp. 309–321.
Gy€orfi, L., Nemetz, T., 1978. f-Dissimilarity: a generalization of the affinity of several distribu-
tions. Ann. Inst. Stat. Math 30 (Part A), 105–113.
Hajek, J., Sidak, Z., Sen, P.K., 1999. Theory of Rank Tests. Academic Press, San Diego.
Hallin, M., 2017. On distribution and quantile functions, ranks and signs. ECARES Working
Paper 2017-34.
Hallin, M., 2018. From Mahalanobis to Bregman via Monge and Kantorovich. Sankhya 80-B
(Suppl. 1), S135–S146.
Hallin, M., Del Barrio, E., Cuesta-Albertos, J., Matran, C., 2021. Distribution and quantile func-
tions, ranks and signs in dimension d; a measure transportation approach. Ann. Stat. 49 (2),
1139–1165.
Hande, S., 1994. A note on order statistics for nonidentically distributed variables. Sankhya
A 56 (2), 365–368.
Henze, N., Nikitin, Y.Y., 2000. A new approach to goodness-of-fit testing based on the integrated
empirical process. J. Nonparam. Stat. 12 (3), 391–416.
Hoadley, A.B., 1967. On the probability of large deviations of functions of several empirical
cdf’s. Ann. Math. Stat. 38, 360–381.
Hosking, J.R.M., 1990. L-moments: analysis and estimation of distributions using linear combina-
tions of order statistics. J. R. Stat. Soc. B 52 (1), 105–124.
Jager, L., Wellner, J.A., 2007. Goodness-of-fit tests via phi-divergences. Ann. Stat. 35 (5),
2018–2053.
Judge, G.G., Mittelhammer, R.C., 2012. An Information Theoretic Approach to Econometrics.
Cambridge University Press, Cambridge.
Jurafsky, D., Martin, J.H., 2009. Speech and Language Processing, second ed. Pearson/Prentice
Hall, Upper Saddle River.
Kammerer, N.B., Stummer, W., 2020. Some dissimilarity measures of branching processes and
optimal decision making in the presence of potential pandemics. Entropy 22 (8), 874.
https://doi.org/10.3390/e22080874 (123 pages).
Kanamori, T., Suzuki, T., Sugiyama, M., 2012. f-divergence estimation and two-sample homoge-
neity test under semiparametric density-ratio models. IEEE Trans. Inf. Theory 58 (2),
708–720.
Karakida, R., Amari, S.-I., 2017. Information geometry of Wasserstein divergence. In: Nielsen, F.,
Barbaresco, F. (Eds.), Geometric Science of Information GSI 2017. Lecture Notes in Com-
puter Science, vol. 10589. Springer, International, pp. 119–126.
Kayal, S., Tripathy, M.R., 2018. A quantile-based Tsallis-α divergence. Physica A 492, 496–505.
Keziou, A., 2015. Multivariate divergences with application in multisample density ratio models.
In: Nielsen, F., Barbaresco, F. (Eds.), Geometric Science of Information GSI 2015. Lecture
Notes in Computer Science, vol. 9389. Springer, Berlin, pp. 444–453.
Keziou, A., Leoni-Aubin, S., 2008. On empirical likelihood for semiparametric two-sample den-
sity ratio models. J. Stat. Plann. Infer. 138 (4), 915–928.
Kißlinger, A.-L., Stummer, W., 2013. Some decision procedures based on scaled Bregman dis-
tance surfaces. In: Nielsen, F., Barbaresco, F. (Eds.), Geometric Science of Information
GSI 2013. Lecture Notes in Computer Science, vol. 8085. Springer, Berlin, pp. 479–486.
A unifying framework for some directed distances in statistics Chapter 5 219
Kißlinger, A.-L., Stummer, W., 2015. New model search for nonlinear recursive models, regres-
sions and autoregressions. In: Nielsen, F., Barbaresco, F. (Eds.), Lecture Notes in Computer
Science, vol. 9389. Springer, Berlin, pp. 693–701.
Kißlinger, A.-L., Stummer, W., 2016. Robust statistical engineering by means of scaled Bregman
distances. In: Agostinelli, C., et al. (Eds.), Recent Advances in Robust Statistics–Theory and
Applications. Springer, New Delhi, pp. 81–113.
Kißlinger, A.-L., Stummer, W., 2018. A new toolkit for robust distributional change detection.
Appl. Stoch. Models Bus. Ind. 34, 682–699.
Kleijn, B.J.K., van der Vaart, A.W., 2012. The Bernstein-von-Mises theorem under misspecifica-
tion. Electron. J. Stat. 6, 354–381.
Klein, I., Mangold, B., Doll, M., 2016. Cumulative paired ϕ-entropy. Entropy 18 (7), 248.
Kr€
omer, S., Stummer, W., 2019. A new toolkit for mortality data analytics. In: Steland, A.,
Rafajlowicz, E., Okhrin, O. (Eds.), Stochastic Models, Statistics and Their Applications.
Springer Nature, Switzerland, pp. 393–407.
Kuchibhotla, A.K., Basu, A., 2015. A general setup for minimum disparity estimation. Stat. Prob.
Lett. 96, 68–74.
Liese, L., Miescke, K.J., 2008. Statistical Decision Theory; Estimation, Testing, and Selection.
Springer, New York.
Liese, F., Vajda, I., 1987. Convex Statistical Distances. Teubner, Leipzig.
Liese, F., Vajda, I., 2006. On divergences and informations in statistics and information theory.
IEEE Trans. Inf. Theory 52 (10), 4394–4412.
Lin, N., He, X., 2006. Robust and efficient estimation under data grouping. Biometrika 93 (1),
99–112.
Lin, T., Hu, Z., Guo, X., 2019. Sparsemax and relaxed Wasserstein for topic sparsity. In: The
Twelfth ACM International Conference on Web Search and Data Mining (WSDM 19),
ACM, New York, pp. 141–149.
Lindsay, B.G., 1994. Efficiency versus robustness: the case for minimum Hellinger distance and
related methods. Ann. Stat. 22 (2), 1081–1114.
Lindsay, B.G., 2004. Statistical distances as loss functions in assessing model adequacy. In:
Taper, M.P., Lele, S.R. (Eds.), The Nature of Scientific Evidence. The University of Chicago
Press, Chicago, pp. 439–487.
Lindsay, B.G., Markatou, M., Ray, S., Yang, K., Chen, S.-C., 2008. Quadratic distances on prob-
abilities; a unified foundation. Ann. Stat. 36 (2), 983–1006.
Liu, J., 2007. Information Theoretic Content and Probability (Ph.D. thesis). University of Florida.
Liu, L., Lindsay, B.G., 2009. Building and using semiparametric tolerance regions for parametric
multinomial models. Ann. Stat. 37 (6A), 3644–3659.
Liu, R.Y., Parelius, J.M., Singh, K., 1999. Multivariate analysis by data depth: descriptive statis-
tics, graphics and inference. Ann. Stat. 27 (3), 783–858.
Markatou, M., Chen, Y., 2019. Statistical distances and the construction of evidence functions for
model adequacy. Front. Ecol. Evol. 7, 447. https://doi.org/10.3389/fevo.2019.00447.
Markatou, M., Sofikitou, E., 2018. Non-quadratic distances in model assessment. Entropy 20, 464.
https://doi.org/10.3390/e20060464.
Marshall, A.W., Olkin, I., Arnold, B.C., 2011. Inequalities: Theory of Majorization and Its Appli-
cations, second ed. Springer, New York.
Matusita, K., 1967. On the notion of affinity of several distributions and some of its applications.
Ann. Inst. Stat. Math. 19, 181–192.
Mehrali, Y., Asadi, M., 2021. Parameter-estimation based on cumulative Kullback-Leibler infor-
mation. REVSTAT 19 (1), 111–130.
220 SECTION II Information geometry
Menendez, M., Pardo, L., Taneja, I.J., 1992. On M-dimensional unified (r, s)-Jensen difference
divergence measures and their applications. Kybernetika 28 (4), 309–324.
Menendez, M., Salicru, M., Morales, D., Pardo, L., 1997. Divergence measures between popula-
tions: applications in the exponential family. Commun. Stat. Theory Methods 26 (5),
1099–1117.
Menendez, M., Morales, D., Pardo, L., Vajda, I., 1998. Two approaches to grouping of data and
related disparity statistics. Commun. Stat. Theory Methods 27 (3), 609–633.
Menendez, M., Morales, D., Pardo, L., Vajda, I., 2001a. Minimum disparity estimators for
discrete and continuous models. Appl. Math. 46 (6), 439–466.
Menendez, M., Morales, D., Pardo, L., Vajda, I., 2001b. Minimum divergence estimators based on
grouped data. Ann. Inst. Stat. Math. 53 (2), 277–288.
Micheas, A.C., Zografos, K., 2006. Measuring stochastic dependence using ϕ-divergence. J. Mul-
tivar. Anal. 97, 765–784.
Millmann, R.S., Parker, G.D., 1991. Geometry–A Metric Approach With Models, second ed.
Springer, New York.
Morales, D., Pardo, L., Vajda, I., 2004. Digitalization of observations permits efficient estimation
in continuous models. In: Lopez-Diaz, M., et al. (Eds.), Soft Methodology and Random Infor-
mation Systems. Springer, Berlin, pp. 315–322.
Morales, D., Pardo, L., Vajda, I., 2006. On efficient estimation in continuous models based on
finitely quantized observations. Commun. Stat. Theory Methods 35 (9), 1629–1653.
Morimoto, T., 1963. Markov processes and the H-theorem. J. Phys. Soc. Jpn. 18 (3), 328–331.
Najim, J., 2002. A Cramer type theorem for weighted random variables. Electron. J. Prob. 7 (4),
1–32. https://doi.org/10.1214/EJP.v7-103.
Nguyen, X., Wainwright, M.J., Jordan, M.I., 2010. Estimating divergence functionals and the like-
lihood ratio by convex risk minimization. IEEE Trans. Inf. Theory 56 (11), 5847–5861.
Nielsen, F. (Ed.), 2021. Progress in Information Geometry. Springer Nature, Switzerland.
Nielsen, F., Barbaresco, F. (Eds.), 2013. Geometric Science of Information GSI 2013. Lecture
Notes in Computer Science, vol. 8085 Springer, Berlin.
Nielsen, F., Barbaresco, F. (Eds.), 2015. Geometric Science of Information GSI 2015. Lecture
Notes in Computer Science, vol. 9389 Springer, International.
Nielsen, F., Barbaresco, F. (Eds.), 2017. Geometric Science of Information GSI 2017. Lecture
Notes in Computer Science, vol. 10589 Springer, International.
Nielsen, F., Barbaresco, F. (Eds.), 2019. Geometric Science of Information GSI 2019. Lecture
Notes in Computer Science, vol. 11712 Springer Nature, Switzerland.
Nielsen, F., Barbaresco, F. (Eds.), 2021. Geometric Science of Information GSI 2021. Lecture
Notes in Computer Science, vol. 12829 Springer Nature, Switzerland.
Nielsen, F., Bhatia, R. (Eds.), 2013. Matrix Information Geometry. Springer, Berlin.
Nikitin, Y., 1995. Asymptotic Efficiency of Nonparametric Tests. Cambridge University Press,
Cambridge.
Nock, R., Nielsen, F., Amari, S.-I., 2016. On conformal divergences and their population minimi-
zers. IEEE Trans. Inform. Theory 62 (1), 527–538.
€
Osterreicher, F., Vajda, I., 1993. Statistical information and discrimination. IEEE Trans. Inf.
Theory 39 (3), 1036–1039.
Owen, A.B., 1988. Empirical likelihood ratio confidence intervals for a single functional.
Biometrika 75 (2), 237–249.
Owen, A.B., 1990. Empirical likelihood ratio confidence regions. Ann. Stat. 18 (1), 90–120.
Owen, A.B., 2001. Empirical Likelihood. Chapman and Hall, Boca Raton.
A unifying framework for some directed distances in statistics Chapter 5 221
Pal, S., Wong, T.-K.L., 2016. The geometry of relative arbitrage. Math. Finan. Econon. 10,
263–293.
Pal, S., Wong, T.-K.L., 2018. Exponentially concave functions and a new information geometry.
Ann. Prob. 46 (2), 1070–1113.
Pardo, L., 2006. Statistical Inference Based on Divergence Measures. Chapman & Hall/CRC,
Boca Raton.
Pardo, M.C., Vajda, I., 1997. About distances of discrete distributions satisfying the data proces-
sing theorem of information theory. IEEE Trans. Inf. Theory 43 (4), 1288–1293.
Pardo, M.C., Vajda, I., 2003. On asymptotic properties of information-theoretic divergences.
IEEE Trans. Inf. Theory 49 (7), 1860–1868.
Park, C., Basu, A., 2004. Minimum disparity estimation: asymptotic normality and breakdown
point results. Bull. Inform. Cybernet. 36, 19–33.
Park, S., Rao, M., Shin, D.W., 2012. On cumulative residual Kullback-Leibler information. Stat.
Prob. Lett. 82, 2025–2032.
Park, S., Noughabi, H.A., Kim, I., 2018. General cumulative Kullback-Leibler information.
Commun. Stat. Theory Methods 47 (7), 1551–1560.
Pelletier, B., 2011. Inference in φ-families of distributions. Statistics 45 (3), 223–236.
Peyre, G., Cuturi, M., 2019. Computational optimal transport: with applications to data science.
Found. Trends Mach. Learn. 11 (5-6), 355–607 (Also appeared in book form by now Publish-
ers, Hanover MA, USA (2019)).
Rachev, S.T., R€ uschendorf, L., 1998. Mass Transportation Problems, vol. I. Springer, New York.
Read, T.R.C., Cressie, N.A.C., 1988. Goodness-of-Fit Statistics for Discrete Multivariate Data.
Springer, New York.
Reiss, R.-D., 1989. Approximate Distributions of Order Statistics. Springer, New York.
Rodriguez, J.C., Viollaz, A.J., 1995. A Cramer-von Mises type goodness of fit test with asymmet-
ric weight function. Commun. Stat. Theory Methods 24 (4), 1095–1120.
Roensch, B., Stummer, W., 2017. 3D insights to some divergences for robust statistics and
machine learning. In: Nielsen, F., Barbaresco, F. (Eds.), Geometric Science of Information
GSI 2017. Lecture Notes in Computer Science, vol. 10589. Springer International, pp.
460–469.
Roensch, B., Stummer, W., 2019a. Robust estimation by means of scaled Bregman power
distances; part I; non-homogeneous data. In: Nielsen, F., Barbaresco, F. (Eds.), Geometric
Science of Information GSI 2019. Lecture Notes in Computer Science, vol. 11712. Springer
Nature, Switzerland, pp. 319–330.
Roensch, B., Stummer, W., 2019b. Robust estimation by means of scaled Bregman power dis-
tances; part II; extreme values. In: Nielsen, F., Barbaresco, F. (Eds.), Geometric Science of
Information GSI 2019. Lecture Notes in Computer Science, vol. 11712. Springer Nature,
Switzerland, pp. 331–340.
Rosenblatt, M., 1952. Limit theorems associated with variants of the von Mises statistic. Ann.
Math. Stat. 23, 617–623.
Sankaran, P.G., Sunoj, S.M., Unnikrishnan Nair, N., 2016. Kullback-Leibler divergence: a quan-
tile approach. Stat. Prob. Lett. 111, 72–79.
Schweizer, B., Wolff, E.F., 1981. On nonparametric measures of independence for random vari-
ables. Ann. Stat. 9 (4), 879–885.
Scott, W.F., 1999. A weighted Cramer-von Mises statistic, with some applications to clinical
trials. Commun. Stat. Theory Methods 28 (12), 3001–3008.
Serfling, R., 2002. Quantile functions for multivariate analysis: approaches and applications. Stat.
Neerlandica 56 (2), 214–232.
222 SECTION II Information geometry
Serfling, R., 2006. Depth functions in nonparametric multivariate inference. In: Liu, R.Y.,
Serfling, R., Souvaine, D.L. (Eds.), Robust Multivariate Analysis, Computational Geometry
and Applications. DIMACS Series in Discrete Mathematics and Theoretical Computer Sci-
ence, vol. 72. American Mathematical Society, pp. 1–16.
Serfling, R., 2010. Equivariance and invariance properties of multivariate quantile and related
functions, and the role of standardization. J. Nonparam. Stat. 22 (7), 915–936.
Serfling, R., Zuo, Y., 2010. Discussion. Ann. Stat. 38 (2), 676–684.
Shin, H., Jung, Y., Jeong, C., Heo, J.-H., 2012. Assessment of modified Anderson-Darling test
statistics for the generalized extreme value and generalized logistic distributions. Stoch.
Env. Res. Risk A. 26, 105–114.
Sklar, A., 1959. Fonctions de repartition á n dimensions et leurs marges. Publ. Inst. Stat. Univ.
Paris 8, 229–231.
Smirnov/Smirnoff, N., 1936. Sur la distribution de ω2. C. R. Acad. Sci. Paris 202, 449–452.
Sriperumbudur, B.K., Gretton, A., Fukumizu, K., Lanckriet, G., Scholkop, B.A., 2012. On the
empirical estimation of integral probability metrics. Electron. J. Stat. 6, 1550–1599.
Stephens, M.A., 1986. Test based on EDF statistics. In: D’Agostino, R.B., Stephens, M.A. (Eds.),
Goodness-of-Fit Techniques. Marcel Dekker Inc., New York, pp. 97–193.
Stummer, W., 1999. On a statistical information measure of diffusion processes. Stat. Decisions
17, 359–376.
Stummer, W., 2001. On a statistical information measure for a generalized Samuelson-Black-
Scholes model. Stat. Decisions 19, 289–314.
Stummer, W., 2004. Exponentials, Diffusions, Finance, Entropy and Information. Shaker, Aachen.
Stummer, W., 2007. Some Bregman distances between financial diffusion processes. Proc. Appl.
Math. Mech. 7 (1), 1050503–1050504.
Stummer, W., 2021. Optimal transport with some directed distances. In: Nielsen, F.,
Barbaresco, F. (Eds.), Geometric Science of Information GSI 2021. Lecture Notes in Com-
puter Science, vol. 12829. Springer Nature, Switzerland, pp. 829–840.
Stummer, W., Kißlinger, A.L., 2017. Some new flexibilizations of Bregman divergences and their
asymptotics. In: Nielsen, F., Barbaresco, F. (Eds.), Geometric Science of Information GSI
2017. Lecture Notes in Computer Science, vol. 10589. Springer International, pp. 514–522.
Stummer, W., Lao, W., 2012. Limits of Bayesian decision related quantities of binomial asset
price models. Kybernetika 48 (4), 750–767.
Stummer, W., Vajda, I., 2007. Optimal statistical decisions about some alternative financial mod-
els. J. Econometrics 137, 441–447.
Stummer, W., Vajda, I., 2010. On divergences of finite measures and their applicability in statis-
tics and information theory. Statistics 44, 169–187.
Stummer, W., Vajda, I., 2012. On Bregman distances and divergences of probability measures.
IEEE Trans. Inf. Theory 58 (3), 1277–1288.
Sunoj, S.M., Sankaran, P.G., Unnikrishnan Nair, N., 2018. Quantile-based cumulative Kullback-
Leibler divergence. Statistics 52 (1), 1–17.
Tan, Z., Zhang, X., 2022. On loss functions and regret bounds for multi-category classfication.
IEEE Trans. Inf. Theory (early access), https://doi.org/10.1109/TIT.2022.3167635.
Toussaint, G.T., 1974. Some properties of Matusita’s measure of affinity of several distributions.
Ann. Inst. Stat. Math. 26 (3), 389–394.
Toussaint, G.T., 1978. Probability of error, expected divergence, and the affinity of several distri-
butions. IEEE Trans. Syst. Man Cybern. SMC-8 (6), 482–485.
Tran, V.H., 2018. Copula variational Bayes inference via information geometry. Preprint,
arXiv:1803.10998v1 (March).
A unifying framework for some directed distances in statistics Chapter 5 223
Trashorras, J., Wintenberger, O., 2014. Large deviations for bootstrapped empirical measures.
Bernoulli 20 (4), 1845–1878.
Vajda, I., 1972. On the f-divergence and singularity of probability measures. Periodica Math.
Hungar. 2 (1-4), 223–234.
Vajda, I., 1989. Theory of Statistical Inference and Information. Kluwer, Dordrecht.
Vajda, I., van der Meulen, E.C., 2011. Goodness-of-fit criteria based on observations quantized by
hypothetical and empirical percentiles. In: Karian, Z.A., Dudewicz, E.J. (Eds.), Handbook of
Fitting Statistical Distributions With R. Chapman & Hall/CRC, Boca Raton, pp. 917–994.
Vaughan, R.J., Venables, W.N., 1972. Permanent expressions for order statistic densities. J. R.
Stat. Soc. B 34 (2), 308–310.
Victoria-Feser, M.-P., Ronchetti, E., 1997. Robust estimation for grouped data. J. Am. Stat.
Assoc. 92 (437), 333–340.
Von Mises, R., 1931. Wahrscheinlichkeitsrechnung und ihre Anwendung in der Statistik und the-
oretischen Physik. Deuticke, Leipzig.
Vonta, F., Karagrigoriou, A., 2010. Generalized measures of divergence in survival analysis and
reliability. J. Appl. Prob. 47, 216–234.
Weller-Fahy, D.J., Borghetti, B.J., Sodemann, A.A., 2015. A survey of distance and similarity
measures used within network intrusion anomaly detection. IEEE Commun. Surv. Tutorials
17 (1), 70–91.
Werner, E., Ye, D., 2017. Mixed f-divergence for multiple pairs of measures. Canad. Math. Bull.
60 (3), 641–654.
Yari, G., Saghafi, A., 2012. Unbiased Weibull modulus estimation using differential cumulative
entropy. Commun. Stat. Simul. Comput. 41 (8), 1372–1378.
Yari, G., Mirhabibi, A., Saghafi, A., 2013. Estimation of the Weibull parameters by Kullback-
Leibler divergence of survival functions. Appl. Math. Inf. Sci. 7 (1), 187–192.
Zeng, X., Durrani, T.S., 2011. Estimation of mutual information using copula density function.
Electr. Lett. 47 (8), 493–494.
Zeng, X., Ren, J., Sun, M., Marshall, S., Durrani, T., 2014. Copulas for statistical signal proces-
sing (Part II): simulation, optimal selection and practical applications. Signal Process. 94,
681–690.
Zografos, K., 1994. Asymptotic distributions of estimated f-dissimilarity between populations in
stratified random sampling. Stat. Prob. Lett. 21, 147–151.
Zografos, K., 1998. f-Dissimilarity of several distributions in testing statistical hypotheses. Ann.
Inst. Stat. Math. 50 (2), 295–310.
Zuo, Y., Serfling, R., 2000a. General notions of statistical depth function. Ann. Stat. 28 (2),
461–482.
Zuo, Y., Serfling, R., 2000b. Structural properties and convergence results for contours of sample
statistical depth functions. Ann. Stat. 28 (2), 483–499.
This page intentionally left blank
Chapter 6
Abstract
A smooth and strictly convex function on an open convex domain induces both (1) a
Hessian structure with respect to the standard flat Euclidean connection, and (2) a dually
flat structure in information geometry. We first review these fundamental constructions
and illustrate how to instantiate them for (a) full regular exponential families from their
cumulant functions, (b) regular homogeneous cones from their characteristic functions,
and (c) mixture families from their Shannon negentropy functions. Although these struc-
tures can be explicitly built for many common examples of the first two classes of expo-
nential families and homogeneous cones, the differential entropy of continuous statistical
mixtures with distinct prescribed density components sharing the same support is hith-
erto not known in closed form, hence forcing implementations of mixture family mani-
folds in practice using Monte Carlo sampling. In this chapter, we report a notable
exception: The uniorder family of mixtures defined as the convex combination of two
prescribed and distinct Cauchy distributions.
Keywords: Riemannian manifold, Hessian manifold, Affine connection, Exponential
family, Homogeneous regular cone, Mixture family, Fisher metric, Rao distance,
Cauchy mixture
†
https://franknielsen.github.io/.
For any α , the connections rα and rα are proven dual with respect to
α α
the Fisher information metric gF since their mid-connection r 2+ r corre-
sponds to the Levi-Civita metric connection gr (Godinho and Natário, 2014).
The fundamental theorem of Riemannian geometry (Godinho and Natário,
2014) states that the Levi-Civita metric is the unique torsion-free metric-
compatible affine connection.
Two common types of families of probability distributions are considered
in information geometry: The exponential families (Barndorff-Nielsen, 2014)
and the mixture families (Amari, 2016; Nielsen and Hadjeres, 2019). The 1-
structures (i.e., α-structure for α ¼ 1) of the exponential families and mix-
ture families are said dually flat (to be detailed in Section 2.2). Dually flat
spaces have also been called Bregman manifolds (Nielsen, 2021) as they
can be realized from either a smooth and strictly convex functions
(Bregman generators) or equivalently from their corresponding Bregman
divergences via the information geometry structure derived from divergences
(Amari and Cichocki, 2010).
The analytic dually flat space of the mixture family Chapter 6 227
corresponding Riemannian distance ρhF ðp1 , p2 Þ between any two point points
p1, p2 on the Riemannian manifold ðM, ghF Þ. When hF is the Fisher informa-
tion matrix, this distance is called the Rao’s distance (Atkinson and
Mitchell, 1981) or the Fisher–Rao distance (Pinele et al., 2020).
More generally, higher-order covariant derivatives are defined recursively by rkT ¼ r(rk1T)
a
When Fi ðθÞ ¼ 12 θ2 for i f1, …, Dg, we have hi(θ) ¼ θ, and we recover the
usual Euclidean distance formula expressed in the Cartesian coordinate sys-
tem. In general, the formula of Eq. (1) is the Euclidean distance expressed
using the hðθÞ ¼ ðh1 ðθ1 Þ, …, hD ðθD ÞÞ -coordinate system. Indeed, recall that
a Riemannian metric tensor g is the Euclidean metric (Godinho and Natário,
2014) if there exists a coordinate system h such that [g]h ¼ I, the identity
matrix of dimension D D. The Euclidean metric expressed in the Cartesian
coordinate system λ is [g]λ ¼ I, the identity matrix.
Notice that the Euclidean distance between two point p1 and p2 of the
pffiffiffiffiffiffiffiffiffiffiffiffiffi
Euclidean plane 2 is expressed in the polar coordinate ðr ¼ x2 + y2 ,
θ ¼ arctan yxÞ (with inverse transformation ðx ¼ r cos θ, y ¼ r sin θÞ) as
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ρEuc ðp1 , p2 Þ ¼ r 21 + r 22 2r 1 r 2 cos ðθ2 θ1 Þ:
for the Bregman generator F(θ) called the Bregman divergence (Bregman,
1967), and we have BF ðθ1 : θ2 Þ ¼ BF* ðη2 : η1 Þ. We can also express equiva-
lently the dual Bregman divergences using mixed parameterizations with the
Fenchel–Young divergences (Blondel et al., 2020; Nielsen, 2021):
Y F ðθ1 : η2 Þ :¼ Fðθ1 Þ + F* ðη2 Þ θ>
1 η2 :
Thus we can express the divergence using either primal, dual, or mixed coor-
dinate systems as follows:
BF ðθ1 : θ2 Þ ¼ Y F ðθ1 : η2 Þ ¼ Y F* ðη2 : θ1 Þ ¼ BF* ðη2 : η1 Þ:
A Riemannian Hessian metric tensor (Shima, 2007) Fg can be defined in
the θ-coordinate system by
½F gθ :¼ r2θ FðθÞ,
F*
with dual metric tensor g expressed in the η-coordinate system by
*
½F gη :¼ r2η F* ðηÞ:
F* i j
F
gðei , ej Þ ¼ ∂i ∂j FðθÞ, gðe* , e* Þ ¼ ∂i ∂ j F* ðηÞ:
The Crouzeix identity (Crouzeix, 1977) holds (i.e., r2θ FðθÞr2η F* ðηÞ ¼ I,
the identity matrix) meaning that the basis E and E* are reciprocal (Amari,
j
2016; Nielsen, 2020): gðei , e* Þ ¼ δji , where δji is the Kr€onecker symbol:
δji ¼ 0 if j 6¼ i, and δji ¼ 1 iff. i ¼ j.
The Riemannian metric tensor can thus be expressed equivalently as
∂ ∂
F
g¼ FðθÞ dθi dθ j ,
∂θi ∂θ j
∂ ∂
¼ F ðηÞ dηi dη j ,
∂ηi ∂ηj
¼ dθi dη j ,
where denotes the tensor product (Godinho and Natário, 2014).
A Bregman manifold has been called a dually flat space in information
geometry (Amari, 2016; Nielsen, 2020) because the dual potential functions
*
F(θ) and F*(η) induce two affine connections, denoted by Fr and F r, which
are flat because their corresponding Riemann–Christoffel symbols Γij charac-
F k
terizing Fr vanish in the θ-coordinate system (i.e., F Γkij ðθÞ ¼ 0 and θ() is
F* k
called a Fr-coordinate system) and the Riemann–Christoffel symbols Γij
F* F* k
characterizing r vanish in the η-coordinate system (i.e., Γij ðηÞ ¼ 0 and
F*
η() is called a r-coordinate system). Furthermore, the two (torsion free)
*
affine connections Fr and F r are dual with respect to the metric tensor
(Amari, 2016; Nielsen, 2020) Fg so that we have the mid-connection which
coincides with the Levi-Civita metric connection:
F
r + F r LC
¼ r,
2
where LCr ¼ gr denote the Levi-Civita connection induced by the Hessian
metric Fg.
Two common examples of dually flat spaces of statistical models are the
exponential family manifolds (Amari, 2016; Nielsen, 2020) built from regular
exponential families (Barndorff-Nielsen, 2014) by setting the Bregman gen-
erators to the cumulant functions of the family, and the mixture family mani-
folds induced by the negentropy of a statistical mixture with prescribed
linearly independent component distributions (Amari, 2016; Nielsen, 2020).
The family of categorical distributions (also called multinoulli distributions)
are both an exponential family and a mixture family. It is interesting to notice
that the cumulant functions of regular exponential families are always analytic
(Cω, see Barndorff-Nielsen, 2014, i.e., F(θ) admitting locally a converging
The analytic dually flat space of the mixture family Chapter 6 233
Taylor series at any θ Θ), but the negentropy of a mixture may not be
analytic (e.g., negentropy of a mixture of two normal distributions
(Watanabe, 2004)).
To use the toolbox of geometric algorithms on Bregman manifolds (e.g.,
Banerjee et al., 2005; Boissonnat et al., 2010), one needs the generators
F and F* and their gradient rF and rF* in closed form. This may not always
be possible (Nielsen and Hadjeres, 2019) either:
l because it is not computable using elementary functions (e.g., the cumulant
function of a polynomial exponential family) or the negentropy of a Gauss-
ian mixture (Watanabe, 2004) (definite integral of a log-sum-exp term), or
l because it is computationally intractable (e.g., the cumulant function of a
discrete exponential family in Boltzmann machines (Amari, 2016))
Eguchi (Eguchi, 1992) described the following method to build a dual
information-geometric structure from a smooth parameter divergence D( : )
which meets the following requirements.
1. D(θ : θ0 )
0 for all θ, θ0 with equality iff θ ¼ θ0 .
2. ∂i Dðθ : θ0 Þjθ0 ¼θ ¼ ∂0j Dðθ : θ0 Þjθ0 ¼θ ¼ 0 for all i, j, where ∂l :¼ ∂θ∂ l and ∂0l :
¼ ∂θ∂ 0 .
h l i
3. ∂i ∂0j Dðθ : θ0 Þjθ0 ¼θ is a positive-definite matrix.
ij
The construction, called divergence information geometry (Amari and
Cichocki, 2010), proceeds as follows:
gij ðθÞ ¼ ∂i ∂0j Dðθ : θ0 Þjθ0 ¼θ ,
Γij, k ðθÞ ¼ ∂i ∂j ∂0k Dðθ : θ0 Þjθ0 ¼θ ,
Γ ij, k ðθÞ ¼ ∂k ∂0i ∂0j Dðθ : θ0 Þjθ0 ¼θ :
It can be shown that the connections r and r* induced, respectively, by Γij,k
and Γ∗ij,k are torsion-free and dual.
In practice, many explicit dually flat space constructions have been
reported for exponential families (e.g., Zhang et al., 2007; Zhong et al.,
2008; Malagò and Pistone, 2015) but to the best of the author’s knowledge
none so far for continuous mixture families with component distributions
sharing the real-line support.
We report a first exception: The explicit construction of a dually flat space
of the family of statistical mixtures with two prescribed and distinct Cauchy
distributions. That is, we report in closed form the Bregman generator F(θ),
the dual parameter η ¼ F0 (θ), the dual Bregman generator F*(η) and its deriva-
tive (F*)0 (η) ¼ θ, and the dual Bregman divergences BF ðθ1 : θ2 Þ ¼ BF* ðη2 : η1 Þ
which amount to the Kullback–Leibler divergence between the corresponding
Cauchy mixtures. We check these (large) formulas using symbolic calculations,
and to fix ideas instantiate these formulas for the special case of a mixture
234 SECTION II Information geometry
i¼1
R
where FðθÞ ¼ log exp ðθxÞdμðxÞ is called the cumulant function. It can be
shown that ZðθÞ ¼ exp ðFðθÞÞ is logarithmically strictly convex (Barndorff-
Nielsen, 2014), and thus F(θ) is strictly convex. Moreover, F(θ) is real ana-
lytic on the natural parameter space Θ ¼ {θ F(θ) < ∞}. Thus on the domain
Θ D, F(θ) induces a Bregman manifold and a dually flat space structures.
The family of categorical distributions form a discrete exponential family
(Amari, 2016).
More generally, the concept of an exponential family can be generalized in
nonstatistical contexts (Naudts and Anthonis, 2012).
h i
Iθ ðθÞ ¼ Epθ r log pθ ðxÞðrlog pθ ðxÞÞ> ,
∂ ∂
¼ Epθ log pθ ðxÞ log pθ ðxÞ ,
∂θi ∂θj ij
xi xj
¼ Epθ :
θi θj ij
l When i ¼ j, we have
1 1
½I θ ðθÞii ¼ Ep ½x2 ¼ ,
θ2i θ i θi
Thus it follows that the Fisher information matrix of the categorical distribu-
tions is the diagonal matrix:
1 1
I θ ðθÞ ¼ diag , …, :
θ1 θd
h iFor any smooth invertible mapping η(θ) with invertible Jacobian matrix
∂θj
∂η , we have the following covariant rule of the FIM:
j ij
>
∂θi ∂θi
I η ðηÞ ¼ I θ ðθðηÞÞ :
∂ηj ij
∂ηj ij
Therefore, the FIM transforms into the identity matrix under the
η-parameterization:
pffiffiffiffiffi
pffiffiffiffiffi
pffiffiffiffiffi 1 1 pffiffiffiffiffi
I η ðηÞ ¼ diagð θ1 , …, θd Þ diag , …, diagð θ1 , …, θd Þ ¼ I,
θ1 θd
FIG. 2 Two-square-root embedding of the Bernoulli family onto the positive orthant of the
sphere of radius r ¼ 2.
The analytic dually flat space of the mixture family Chapter 6 237
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u d qffiffiffiffiffi qffiffiffiffiffi u d qffiffiffiffiffi qffiffiffiffiffi
uX uX
ρðpθ1 , pθ2 Þ ¼
+ + t ð2 θ1 2 θ2 Þ ¼ 2 t ð θi1 θi2 Þ2 ¼ 2 ρHellinger ðθ1 , θ2 Þ,
i i 2
i¼1 i¼1
where
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u d qffiffiffiffiffi qffiffiffiffiffi
uX pffiffiffiffiffi pffiffiffiffiffi
ρHellinger ðθi1 , θi2 Þ ¼ t ð θi1 θi2 Þ2 ¼k θ1 θ2 k2 ,
i¼1
denotes the Hellinger distance, a metric distance. The squared Hellinger distance
is called the Hellinger divergence and belongs to the class of f-divergences for
pffiffiffi 2
the generator f Hellinger ðuÞ ¼ ð u 1Þ .
Furthermore, the following inequality follows from the embedding of the
normalized probabilities on the positive-orthant of the sphere that
ρðpθ1 , pθ2 Þ
2 ρHellinger ðθ1 , θ2 Þ,
with equality if and only if θ1 ¼ θ2.
O O
FIG. 3 Two examples of cones of 2: a nonregular cone (left, nonconvex and not pointed) and a
regular cone (right).
238 SECTION II Information geometry
Observe the similarity with the partition function of a natural exponential fam-
ily when the cone is self-dual. It can be shown that the characteristic function
χ K is strictly logarithmically convex. Let Aut(K) denote the automorphism
group of K, i.e., the subgroup of the general linear group GLð, dÞ such that
A Aut(K) , A(K) ¼ K, with A(K) ¼ {Ax : x K}. The automorphism group
can be shown to a be a Lie group (G€ uler, 1996). A regular cone is said homo-
geneous if its automorphism group is transitive: That is, for all x, y K, there
exits A Aut(K) such that Ax ¼ y.
It can be shown that
χ K ðxÞ
χ K ðAxÞ ¼ , (2)
jdetðAÞj
for any A Aut(K).
Since χ K is strictly logarithmically convex, let us consider the function
FK ðxÞ ¼ log χ K ðxÞ,
which is strictly convex. Moreover, the function is analytic for homogeneous
cones. We can therefore associate a dually flat space structure to cones
(Shima, 2007) (Chapter 4) using (K, FK). The induced Riemannian metric
r2F(x) is invariant under the group automorphism.
Consider a prescribed point e K°, the interior of K. For any x K°, let
Ax Aut(K) such that Axe ¼ x. Then using Eq. (2), we have
FK ðxÞ ¼ log χ K ðeÞ log ðjdetðAx ÞjÞ:
Since the Bregman generators are defined up to an affine term, we have
FK ðxÞ ≡ log ðjdetðAx ÞjÞ:
Furthermore, for a homogeneous regular cone, we have (G€uler, 1996)
(Theorem 4.4):
1
FK ðxÞ ≡ log detðr2 FK ðxÞÞ:
2
For example, consider the nonnegative orthant cone K ¼ d+ + . Then we
P
have FK ðxÞ ¼ di¼1 log xi and r2 FK ðxÞ ¼ diag x12 , …, x12 , and
Q P 1 d
where δxi ðxÞ ¼ δðx xi Þ ¼ 1 iff x ¼ xi and 0 when x 6¼ xi. The functions δxi
are called Dirac distributions and are linearly independent provided that
xi 6¼ xj for any i 6¼ j. The Shannon negentropy is
X X
D
FðθÞ ¼ mθ ðxÞ log mθ ðxÞ ¼ θi log θi :
x fx0 , x1 , …, xD g i¼0
R
where Ij ¼ X j pj ðxÞ log pj ðxÞdμðxÞ is Shannon negentropy of component pj.
Thus we have
X
D X
D
FðθÞ ¼ θi log θi + θi ðIi I0 Þ:
i¼1 i¼1
When pj ðxÞ ¼ δxj ðxÞ with X j ¼ fxj g, we recover the discrete mixture family
of categorical distributions.
In general, when the mixture components share the same support X , it is
difficult to obtain a closed-form formula for the negentropy FðθÞ ¼
R
X mθ ðxÞ log mθ ðxÞdμðxÞ because of the integral of the nonseparable log-sum
term. In symbolic computing, the celebrated Risch algorithm (Risch, 1969)
allows one to either calculate definite integrals in closed form or report that
no such closed-form formula exists using elementary functions. However,
the Risch method is only a semi-algorithm since it requires to implement a
tautology oracle to check whether some mathematical expressions are equiv-
alent to zero or not.
The analytic dually flat space of the mixture family Chapter 6 241
(3)
0 1
(9)
B ðs + 1Þ2 + l2 C
+ ð1 θÞ log B
@ r ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
C
A
2
2 ð1 sÞ + l2 s ð1 θÞ θ + s2 + 2s θ + ðs2 + l2 + 1Þ ð1 θÞ
0 1
(10)
B 2 2
ðs1 + s0 Þ + ðl0 l1 Þ C
+ ð1 θÞ log B
@ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi
C
A
2 2 2
2 s0 ðs0 s1 Þ + ðl0 l1 Þ s1 ð1 θÞ θ + s20 s21 + 2s0 s1 θ + s21 + s20 + ðl0 l1 Þ ð1 θÞ
s1
+ θ log + log ð4πs0 Þ:
s0
The analytic dually flat space of the mixture family Chapter 6 245
Clearly, we have JF,α(θ1 : θ2) ¼ JF(θ1, θ2;1 α, α). See Fig. 6. The fact the
Jensen diversities are proper divergences (i.e., J F ðθ1 , …, θn ; w1 , …, wn Þ
0
with equality iff θ1 ¼ … ¼ θn ) stems from Jensen’s inequality. A geometric
proof of the discrete Jensen’s inequality is given in Fig. 7.
2.95 3.8
2.9
3.6
2.85
3.4
2.8
2.75 3.2
2.7 3
2.65
2.8
2.6
2.55 2.6
2.5 2.4
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
theta theta
FIG. 5 Plots of the Cauchy mixture entropies h[(1 θ)p0,1 + θp5,1] (left) and h[(1 θ)p0,1 +
θp5,3] (right). The mixture entropies are concave with respect to the mixing parameter θ.
246 SECTION II Information geometry
y = F (θ)
P̄ = (θ̄, F (θ))
F (θ)
E[F (θ)] − F (E[θ])
θ1 θ
θ̄ θD
P
FIG. 7 Visual proof of Jensen’s inequality: The center of mass ðθ, D1 i Fðθi ÞÞ of the points
θ1 , …, θD is necessarily contained in the convex hull CH of the points ðθ1 , Fðθ1 ÞÞ, …,
ðθP2 , Fðθ2 ÞÞ . Since
1 Pthe convex hull CH is included in the epigraph of function F(θ), we have
i Fðθi Þ
F D i θi .
1
D
We get
Thus the dual convex conjugate F* can be expressed in closed form using
the θ-coordinate system: F* ðηðθÞÞ ¼ h ½pl0 ,s0 : mθ .
It was shown in Nielsen (2020) how to reconstruct the statistical
divergence corresponding to the Bregman divergence BF induced by the
Bregman generator F. When the Bregman generator is the negentropy of
a mixture family, the Kullback–Leibler divergence is reconstructed
(Nielsen, 2020):
Thus we have
This last formula is thus available in closed form since F(θ) is available in
closed form. We may also derive this formula as follows:
h i
1
¼ h mθ1 + θ2 ðh½mθ1 + h½mθ2 Þ (22)
2 2
are also available in closed form Nielsen and Nock (2018) (using Eq. (10).
Proposition 3. The Kullback–Leibler divergence, Jeffreys divergence, and
Jensen–Shannon divergence between any two mixtures of Cauchy distribu-
tions with prescribed two distinct components can be calculated in closed
forms.
0 4 0 5
H ¼ F0,1,1,1 ð0Þ ¼ log 0:069…, F0,1,1,1 ð1Þ ¼ log 0:069…
5 4
That is the dual domain centered at 0 (with η 12 ¼ 0).
Therefore, the Bregman divergence BF0,1,1,1 ðθ1 : θ2 Þ equivalent to the
Kullback–Leibler divergence between the corresponding mixtures is
–2.53 3.3
3.2
–2.54
3.1
–2.55
3
–2.56 2.9
–2.57 2.8
2.7
–2.58
2.6
–2.59 2.5
0 0.2 0.4 0.6 0.8 1 -1 -0.5 0 0.5 1
theta eta
FIG. 8 Plots of the strictly convex and smooth Bregman generator F0,1,1,1(θ) (left) and its convex conjugate F*0,1,1,1(θ) (right).
The analytic dually flat space of the mixture family Chapter 6 251
It follows that the dual potential function F*(η) is available in closed form
(see Appendix for the closed-form formula):
20 π
F*0,1,1,1 ðηðθÞÞ ¼ log pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : (27)
2 1 + θ θ2 θ + 3
b
We used WolframAlpha® available online at https://www.wolframalpha.com/ for this symbolic
computation with the following query: (2*sqrt(1+t-t*t)+t+2)/(2*sqrt(1+t-t*t)
-t+3)=exp(eta) solve for t.
The metric tensor F0,1,1,1 g can be calculated in closed form in the θ-coordinate system using the second derivative: ½F0,1,1,1 gθ ¼ F000,1,1,1 ðθÞ. We get
(see Appendix):
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
θ8 + 4 θ7 + θ2 + θ + 1 7 θ6 21 θ5 35 θ4 + 105 θ3 + 56 θ2 112 θ 64 + 19 θ6 71 θ5 30 θ4 + 183 θ3 + 40 θ2 144 θ 64
½F0, 1, 1, 1 gθ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi :
θ10 + 5 θ9 + θ2 + θ + 1 8 θ8 32 θ7 40 θ6 + 232 θ5 + 16 θ4 456 θ3 48 θ2 + 320 θ + 128 + 23 θ8 122 θ7 + θ6 + 445 θ5 127 θ4 640 θ3 + 32 θ2 + 384 θ + 128
Thus, we reported in closed form all necessary equations to implement the mixture family of two distinct Cauchy distributions (Eqs. 23–26).
The Cauchy distributions are Student’s t-distributions for ν ¼ 1 degree of freedom. When the degrees of freedom tend to infinity, the Cauchy
distributions tend to normal distributions.
The analytic dually flat space of the mixture family Chapter 6 253
5 Conclusion
Amari’s α-structure (Amari, 2016; Nielsen, 2022) on a manifold geometri-
cally modeling a family of statistical parametric models consists of using the
Fisher metric as the Riemannian metric tensor, and a pair of dual rα and rα
affine connections which are compatible with the Fisher metric. In particular,
the 1-structure for both exponential families and mixture families yields
dually flat spaces, i.e., dual Hessian structures on a manifold equipped with
a global chart (Shima, 2007). In a dually flat space, the Legendre–Fenchel
transformation gives rise to the dual affine coordinate systems θ() and η()
for the dual connections r1 (e-connection or exponential connection) and
r1 (m-connection or mixture connection), and the dual convex potential
functions F(θ) and F*(η). Although those dual potential functions are
(1) always analytic for exponential families and (2) available in closed form
for many distribution families of exponential type (Nielsen and Hadjeres,
2019) (e.g., multivariate normal distributions), the negentropy potential func-
tion of a mixture family may not be analytic (e.g., mixture of two prescribed
Gaussians (Michalowicz et al., 2008)) and is rarely available in closed
form except when the mixture components have pairwise disjoint support
(e.g., the family of categorical distributions on a finite sample space). In this
chapter, we reported a first notable example of a continuous mixture family of
two Cauchy components (model order D ¼ 1 with full support ) with dual
potential convex functions F(θ) and F*(η) available in closed form. It is inter-
esting to build continuous mixture families with arbitrary many component
distributions k > 1 sharing the same support for which the dual potential func-
tions are available in closed form. Another question left for future research is
to study the intersection of the potential functions obtained by exponential
families and mixture families.
Acknowledgments
The author is deeply indebted to Professor Kazuki Okamura (Shizuoka
University, Japan) who collaborated with the author on the study of
f-divergences between Cauchy distributions (Nielsen and Okamura, 2021).
In this work, all the α-geometries (Amari, 2016) of the family of Cauchy
distributions are proven to coincide with the Fisher–Rao hyperbolic geometry
of the Cauchy family due to the symmetric property of the f-divergences for
the Cauchy family.
c
https://maxima.sourceforge.io/.
254 SECTION II Information geometry
To export the formula in FORTRAN code which can be translated easily into
Java, we run the following command:
f o r t r a n ( r a t s i m p ( hmixCauchy ( l 0 , s0 , l 1 , s1 , t h e t a ) ) ) ;
expand(−hmixCauchy ( l 0 , s0 , l 1 , s1 , t h e t a ) ) ;
s t r i n g (%) ;
Then for example, we can calculate and export in TE X the metric tensor in
the primal coordinate system θ as follows:
F:−hmixCauchy ( 0 , 1 , 1 , 1 , t h e t a ) ;
g : d e r i v a t i v e (F , t h e t a , 2 ) ;
r a t s i m p (%) ;
t e x (%) ;
p l o t 2 d (F , [ t h e t a , 0 , 1 ] ) ;
To implement the case of (l0, s0) ¼ (0, 1) and (l1, s1) ¼ (1, 1) discussed in the
main body, we execute the following MAXIMA code:
F( t h e t a ) := t h e t a ∗ l o g ( ( 2 ∗ s q r t (1+ t h e t a −( t h e t a ∗ t h e t a ) )+t h e t a +2) / ( 2 ∗ s q r t (1+ t h e t a
−( t h e t a ∗ t h e t a ) )−t h e t a +3) ) + l o g ( ( 2 ∗ s q r t (1+ t h e t a −( t h e t a ∗ t h e t a ) )−t h e t a +3)
/(20∗% p i ) ) ;
gradF ( t h e t a ) := l o g ( ( 2 ∗ s q r t (1+ t h e t a −( t h e t a ∗ t h e t a ) )+t h e t a +2) / ( 2 ∗ s q r t (1+ t h e t a
−( t h e t a ∗ t h e t a ) )−t h e t a +3) ) ;
BD( t h e t a 1 , t h e t a 2 ) :=F( t h e t a 1 )−F( t h e t a 2 ) −( t h e t a 1 −t h e t a 2 ) ∗ gradF ( t h e t a 2 ) ;
expand (F( t h e t a 1 )−F( t h e t a 2 ) −( t h e t a 1 −t h e t a 2 ) ∗ gradF ( t h e t a 2 ) ) ;
t h e t a ( e t a ) :=(5∗ exp ( 2 ∗ e t a ) + 2∗ s q r t ( 5 ) ∗ s q r t ( exp ( 3 ∗ e t a ) − 2∗ exp ( 2 ∗ e t a )+exp ( e t a ) )
− 3∗ exp ( e t a ) ) / ( 5 ∗ exp ( 2 ∗ e t a ) −6∗exp ( e t a ) +5) ;
Fdual ( e t a ) := t h e t a ( e t a ) ∗ eta−F( t h e t a ( e t a ) ) ;
expand ( t h e t a ( e t a ) ∗ eta−F( t h e t a ( e t a ) ) ) ;
References
Amari, S.-i., 2016. Information Geometry and Its Applications. Applied Mathematical Sciences,
Springer Japan, ISBN: 9784431559771.
Amari, S.-i., Armstrong, J., 2014. Curvature of Hessian manifolds. Differ. Geom. Appl. 33, 1–12.
Amari, S.-i., Cichocki, A., 2010. Information geometry of divergence functions. Bull. Polish
Acad. Sci. Tech. sci. 58 (1), 183–195.
Atkinson, C., Mitchell, A.F.S., 1981. Rao’s distance measure. Sankhya Indian J. Stat. A 45,
345–365.
Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J., Lafferty, J., 2005. Clustering with Bregman
divergences. J. Mach. Learn. Res. 6 (10).
Barndorff-Nielsen, O., 2014. Information and Exponential Families: In Statistical Theory. John
Wiley & Sons.
Blondel, M., Martins, A.F., Niculae, V., 2020. Learning with Fenchel-Young losses. J. Mach.
Learn. Res. 21 (35), 1–69.
Boissonnat, J.-D., Nielsen, F., Nock, R., 2010. Bregman Voronoi diagrams. Discrete Comput.
Geom. 44 (2), 281–307.
Bregman, L.M., 1967. The relaxation method of finding the common point of convex sets and its
application to the solution of problems in convex programming. USSR Comput. Math. Math.
Phys. 7 (3), 200–217.
Calvo, J.A., 2018. Scientific Programming: Numeric, Symbolic, and Graphical Computing With
Maxima. Cambridge Scholars Publishing.
Chyzak, F., Nielsen, F., 2019. A closed-form formula for the Kullback-Leibler divergence
between Cauchy distributions. arXiv preprint arXiv:1905.10965.
Crouzeix, J.-P., 1977. A relationship between the second derivatives of a convex function and of
its conjugate. Math. Program. 13 (1), 364–365.
Došlá, Š., 2009. Conditions for bimodality and multimodality of a mixture of two unimodal den-
sities. Kybernetika 45 (2), 279–292.
Eguchi, S., 1992. Geometry of minimum contrast. Hiroshima Math. J. 22 (3), 631–647.
Faraut, J., Korányi, A., 1994. Analysis on Symmetric Cones. Oxford Mathematical Monographs.
Godinho, L., Natário, J., 2014. An Introduction to Riemannian Geometry. Springer.
Gomes-Gonçalves, E., Gzyl, H., Nielsen, F., 2019. Geometry and fixed-rate quantization in
Riemannian metric spaces induced by separable Bregman divergences. In: International
Conference on Geometric Science of Information, pp. 351–358.
256 SECTION II Information geometry
G€uler, O., 1996. Barrier functions in interior point methods. Math. Oper. Res. 21 (4), 860–885.
Keener, R.W., 2010. Theoretical Statistics: Topics for a Core Course. Springer.
Lee, Y.T., Yue, M.-C., 2021. Universal barrier is n-self-concordant. Math. Oper. Res. 46 (3),
1129–1148.
Lin, J., 1991. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory
37 (1), 145–151.
Malagò, L., Pistone, G., 2015. Information geometry of the Gaussian distribution in view of sto-
chastic optimization. In: Proceedings of the ACM Conference on Foundations of Genetic
Algorithms XIII, pp. 150–162.
Michalowicz, J.V., Nichols, J.M., Bucholtz, F., 2008. Calculation of differential entropy for a
mixed Gaussian distribution. Entropy 10 (3), 200–206.
Naudts, J., Anthonis, B., 2012. Data set models and exponential families in statistical physics and
beyond. Mod. Phys. Lett. B 26 (10), 1250062.
Nielsen, F., 2020. An elementary introduction to information geometry. Entropy 22 (10), 1100.
Nielsen, F., 2021. On geodesic triangles with right angles in a dually flat space. In: Progress in
Information Geometry, Springer, Cham, pp. 153–190.
Nielsen, F., 2022. The many faces of information geometry. Not. Am. Math. Soc. 69 (1), 36–45.
Nielsen, F., Boltz, S., 2011. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. The-
ory 57 (8), 5455–5466.
Nielsen, F., Hadjeres, G., 2019. Monte Carlo information-geometric structures. In: Geometric
Structures of Information, Springer, pp. 69–103.
Nielsen, F., Nock, R., 2018. On the geometry of mixtures of prescribed distributions. In: 2018
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pp. 2861–2865.
Nielsen, F., Okamura, K., 2021, July. On f-divergences between Cauchy distributions. In: Interna-
tional Conference on Geometric Science of Information. Springer, Cham, pp. 799–807.
Pinele, J., Strapasson, J.E., Costa, S.I.R., 2020. The Fisher-Rao distance between multivariate nor-
mal distributions: special cases, bounds and applications. Entropy 22 (4), 404.
Risch, R.H., 1969. The problem of integration in finite terms. Trans. Am. Math. Soc. 139,
167–189.
Rockafellar, R.T., 1967. Conjugates and Legendre transforms of convex functions. Can. J. Math.
19, 200–205.
Shima, H., 2007. The Geometry of Hessian Structures. World Scientific.
Watanabe, S., 2004. Kullback information of normal mixture is not an analytic function. IEICE
Tech. Rep. 2004, 41–46.
Zhang, Z., Sun, H., Zhong, F., 2007. Information geometry of the power inverse Gaussian
distribution. Appl. Sci. 9, 194–203.
Zhong, F., Sun, H., Zhang, Z., 2008. The geometry of the Dirichlet manifold. J. Kor. Math. Soc.
45 (3), 859–870.
Chapter 7
Local measurements
of nonlinear embeddings
with information geometry
Ke Sun*
CSIRO Data61, Sydney, NSW, Australia
The Australian National University, Canberra, ACT, Australia
*
Corresponding author: e-mail: sunk@ieee.org
Abstract
A basic problem in machine learning is to find a mapping f from a low-dimensional
latent space Y to a high-dimensional observation space X . Modern tools such as deep
neural networks are capable to represent general nonlinear mappings. A learner can eas-
ily find a mapping which perfectly fits all the observations. However, such a mapping is
often not considered as good, because it is not simple enough and can overfit. How to
define simplicity? We try to make a formal definition on the amount of information
imposed by a nonlinear mapping f. Intuitively, we measure the local discrepancy
between the pullback geometry and the intrinsic geometry of the latent space. Our defi-
nition is based on information geometry and is independent of the empirical observa-
tions, nor specific parameterizations. We prove its basic properties and discuss
relationships with related machine learning methods.
Keywords: Dimensionality reduction, Manifold learning, Latent space, Autoencoders,
Embedding, Information geometry, α-Divergence
1 Introduction
In statistical machine learning, one is often interested to derive a nonlinear
mapping f, from a latent space Y to an observable space X (as shown in
Fig. 1) so that meaningful latent representations can be learned for the observed
data. Similar problems widely appear in dimensionality reduction (a.k.a.
manifold learning), nonlinear regression, and deep representation learning.
By convention in machine learning, we refer the inverse of f which we
denote as g, rather than f itself, as an “embedding” that is a mapping from
X to Y . The latent space Y is referred to as the embedding space.
FIG. 1 The basic subjects in this paper: a latent Euclidean space Y; an observation space X (a
Riemannian manifold); and a differentiable mapping f : Y ! X from Y to X .
↗
θ
x!gθ ðxÞ
and the symbol “D” for our proposed discrepancy measure. The mappings
are denoted by fθ and gθ, where the subscript θ (parameters of the associated
mapping) can be omitted.
Another type of embedding is based on information theoretical measure-
ments, which fits in the following paradigm:
↘
embedding geometry
y ! p
Dð p : qÞ ðinformation geometryÞ
↗
data geometry
x ! q
Z
1 α
Dα ðp : qÞ ¼ 1 p ð yÞq ðyÞdy :
1α
(2)
αð1 αÞ
It is easy to show from L’H^ ospital’s rule that
Z
p~ð yÞ
lim Dα ð p~ : q~Þ ¼ KLð p~ : q~Þ ¼ p~ð yÞ log p~ð yÞ + q~ðyÞ dy
α!1 q~ð yÞ
is the (generalized) KL divergence, and similarly, lim α!0 Dα ð~ p : q~Þ ¼
KLð~ q : p~Þ is the reverse KL divergence. Therefore, the definition of the
α-divergence is naturally extended to α , encompassing KL, reverse KL,
along with several commonly used divergences (Cichocki et al., 2015;
Amari, 2016). We can easily verify that Dα ð~ p : q~Þ 0, with Dα ð~
p : q~Þ ¼ 0 if
and only if p~ ¼ q~. As a generalization of the KL divergence, the α-divergence
has been applied into machine learning. Usually, α is a hyperparameter of the
learning machine that can be tuned. See Narayan et al. (2015) and Li and
Turner (2016) for recent examples.
We need the following property of α-divergence in our developments.
Lemma 1 autonormalizing. R Given a probability distribution
R p(y) and a posi-
tive measure s(y) so that pð yÞdy ¼ 1 and 0 < sð yÞdy < ∞, the optimal
γ + :¼ ð02C ∞Þ minimizing Dα(p : γs) has the form
0Z 11=α
α
B p ð yÞs1α
ðyÞdy C
γ ? :¼ argmin Dα ðp : γsÞ ¼ B
@ Z C ,
A
γ + sð yÞdy
The proofs of our formal statements are in Appendices. Note the expres-
sion of Dα(p : γ ?s) is not the same as Dα(p : q) in Eq. (2). For a given α, it
is
R αclear 1αthat they both reduce to compute the Hellinger integral
p ð yÞq ð yÞdy . Because t is a monotonic transformation of t/α for
1/α
3 α-Discrepancy of an embedding
How to measure the amount of imposed information, or complexity, of a
mapping f : Y ! X between two manifolds Y and X , in such a way that is
independent to the observations? Taking dimensionality reduction as an
example, the central problem is to find such a “good” mapping f, which not
only fits well the observations but also is somehow simple in the sense that
it is less curved. In this case, f is from a latent space Y d that is usually a
low-dimensional Euclidean space,a to an observable space X , which often
has a high dimensionality. For example, the index mapping i !xi is usually
considered as “not good” for dimensionality reduction, because only the
a
In this paper, a d-dimensional real manifold is denoted by Md , or simply M where the super-
script (dimensionality) is omitted.
Local measurements of nonlinear embeddings Chapter 7 263
information regarding the order of the samples (if such information exists) is
preserved in the embedding, and all other observed information is lost through
the highly curved f.
We define such an intrinsic loss of information and discuss its properties.
For the convenience of analysis, we make the following assumptions:
Assumption 1. X is equipped with a Riemannian metric M(x), i.e., a covariant
symmetric tensor that is positive definite and varies soothingly w.r.t. x X .
b
Such an embedding is known as an “immersion” ( Jost, 2011).
264 SECTION II Information geometry
l If α ! 1, small singular values of J(y) that are close to 0 lead to high dis-
crepancy. Minimizing Dα,f ð yÞ helps avoid singularity.
l If α ! 0, large singular values of J(y) that are larger than 1 lead to high
discrepancy. Minimizing Dα,f ð yÞ favors contractive mappings (Rifai
et al., 2011) where the scale of the Jacobian is constrained.
4 Empirical α-discrepancy
In this section, we show that the proposed α-discrepancy is computationally
friendly. By definition, the global α-discrepancy Dα,f is an expectation of
the local α-discrepancy Dα,f ð yÞ w.r.t. a given prior distribution Uð yÞ. Using m
i.i.d. samples fyi gm
i¼1 Uð yÞ, we can estimate
1X m
Dα, f Dα, f ðyi Þ:
m i¼1
where Al(hl) is a diagonal matrix with diagonal entries ν0 (Wlhl + bl). There-
fore, J(y) is expressed as a series of matrix multiplications. For general neural
network mapping f not limited to the above simple case, one can use the
268 SECTION II Information geometry
gives an unbiased estimate of Dα(py : γsy), where the notation D ^α ðpy : γsy Þ is
n
abused as it depends on the samples fy j g j¼1 . Indeed,
Z " #
^ 1 γsy ðyÞ pαy ðyÞγ 1α s1α
y ðyÞ
ERy ðyÞ D α ðpy : γsy Þ ¼ + Ry ðyÞ dy
1α αRy ðyÞ αð1 αÞRy ðyÞ
Z Z
1 1 1
¼ + γsy ðyÞdy pαy ðyÞγ 1α s1α
y ðyÞdy,
1α α αð1 αÞ
which gives exactly Dα(py : γsy). By the large number law, D ^α ðpy : γsy Þ con-
verges to the α-divergence Dα(py : γsy) as n !∞. This holds regardless of the
choice of the reference distribution Ry ðyÞ. However, different Ry ðyÞ may result
in different estimation variance. The RHS (right-hand-side) of Eq. (5) is a con-
vex function of γ, and the optimal γ ? which minimizes D ^α ðpy : γsy Þ is in
closed form (similar to Lemma 1). Then, one can estimate
Dα, f ð yÞ min+ D^ α ðpy : γsy Þ:
γ
Venna and Kaski, 2007; Yang et al., 2014) are based on our third paradigm
introduced in Section 1. Based on similar principles, this meta family of
dimensionality reduction methods aim to find the embedding (inverse of
f : Y ! X) based on a given set fxi g X. In this section, we provide a unified
view of these embedding techniques based on the proposed α-discrepancy.
Often, neighborhood embedding are nonparametric methods, meaning
that the parametric form of the mapping f (or its inverse g) is not explicitly
learned. Instead, the embedding points {yi ¼ g(xi)}, which implicitly gives
g(), are obtained by minimizing a loss function. They directly generalize
to parametric approaches (Carreira-Perpi nán and Vladymyrov, 2015). Our
analysis can be applied to both cases. However, the Jacobian may be diffi-
cult to obtain for the nonparametric mapping and requires approximation
techniques.
In the empirical α-discrepancy, we set the reference distribution Ry to be
uniform across the latent {yi} associated with the observed samples {xi}.
This is only a technical treatment which allows us to arrive at a simple loss
function and to connect with related embedding methods. Other choices of
Ry lead to more general embedding approaches. Thus, we simply set 8j,
Ry ðy j Þ ¼ ρ > 0.
From Eq. (5), we get
" #
1 1 X
n
γsy ðy j Þ pαy ðy j Þγ 1α s1α
y ðy j Þ
D^ α ðpy : γsy Þ ¼ + : (6)
1 α n j¼1 αρ αð1 αÞρ
In other words, we use each yi as the reference point, and use all the other
samples as the neighbors.
A further algorithmic detail in SNE (Hinton and Roweis, 2003) is for com-
puting the probabilities py ðy j Þ. Essentially, SNE sets M(x) ¼ λ(x)I, where
λ :¼ λ(x) > 0 is a scalar depending on x and can be computed based on
entropy constraints. It helps model observed data with different densities at
different regions on X . Then py ðy j Þ is computed based on
1 > > λ
py ðy j Þ∝ exp ðy j yÞ J MðxÞJðy j yÞ exp k f ðy j Þ f ð yÞÞk :
2
2 2
5.2 Autoencoders
Autoencoder networks try to learn both f θ : Y ! X (the decoder) and its
corresponding projection gθ : X ! Y (the encoder) to represent a given set
of observations fxi g X . The following (known) statement presents an
intrinsic geometry of autoencoders.
Proposition 3. Assumptions 1, 2, and 3 are true. In an autoencoder network
g f
X ! Y ! X, the decoder f induces a metric tensor in the latent space Y, given
by Mð yÞ ¼ J >f ð yÞMðf ð yÞÞJ f ð yÞ; the encoder g : X ! Y induces a metric
Local measurements of nonlinear embeddings Chapter 7 271
tensor in the manifold Z ¼ X =g (the quotient space w.r.t. the equivalent
relation g: x1 g x2 if and only if g(x1) ¼ g(x2)) given by MðzÞ ¼
J>
g ðzÞMðgðzÞÞJ g ðzÞ.
λ
pðy j xÞ ∝ Uð yÞpðx j yÞ∝ exp log UðyÞ k x f ðyÞk2
2
Denote g : X ! Y to be an approximation of f1. A Taylor expansion of f(y)
around g(x) yields
λ
pðy j xÞ∝ exp log Uð yÞ k x f ðgðxÞÞ JðgðxÞÞð y gðxÞÞk2 + oðk y gðxÞ kÞ ,
2
where o() is the little-o notation. If λ is large enough, meaning the conditional
distribution p(x j y) tends to be deterministic, then the second term will domi-
nate. We arrive at the rough approximation
pðy j xÞ Gðy j gðxÞ, λJ > ðgðxÞÞJðgðxÞÞÞ:
Hence, the covariance of the posterior is related to the pullback metric at g(x).
The notion of pullback metric has been investigated in metric learning
(Lebanon, 2005) and manifold learning (Sun and Marchand-Maillet, 2014).
Latent space geometry is studied in Gaussian process latent variable models
(Tosi et al., 2014). In the realm of deep learning, latent manifolds learned
by deep generative models (including VAEs) are explored (Arvanitidis
et al., 2018; Hauberg, 2018; Shao et al., 2018). In this context, stochastic met-
ric tensor and the expected metric are investigated (Arvanitidis et al., 2018;
Hauberg, 2018). Notice that we only consider deterministic metrics, as our
mapping f is deterministic. A separate neural network can be used to learn
the metric tensor (Kuhnel et al., 2018). Information geometry and the pullback
metric have been applied to graph neural networks (Sun et al., 2020). These
previous efforts on deep generative models focus on the study of the geomet-
ric measurements (e.g., geodesics, curvature, and means) in the latent space
obtained in deep learning. In contrast, our objective is to derive information
theoretical complexity measures of a mapping f.
for a classifier deep neural network, the output space X is the simplex
Pd +1
Δd ¼ fx : i¼1 xi ¼ 1; xi > 0, 8ig . The unique metric of statistical mani-
fold is given by the Fisher information metric. Thus, the definition of
α-discrepancy can be extended by using such a geometry.
Acknowledgment
The author thank the anonymous reviewers for the insightful comments and timely reviews.
Appendices
In the following we provide outline proofs of the theoretical statements.
Therefore,
Z
pα ðyÞs1α ðyÞdy
? α Z
ðγ Þ ¼ ,
sðyÞdy
and
0Z 11
α α
B p ð yÞs ð yÞdyC
1α
γ? ¼ B
@ Z C:
A
sð yÞdy
To compare Dα(p : γ ?s) with Dα(p : q), we recall from Eq. (2) that
Z
1 α
Dα ðp : qÞ ¼ 1 p ðyÞq ð yÞdy :
1α
(A.2)
αð1 αÞ
By Lemma 1,
Z
arg min Dα, f ¼ arg min Uð yÞDα, f ð yÞdy
f f
Z
¼ arg min Uð yÞ inf Dα ðpy : γð yÞsy Þdy (B.1)
f γ CðYÞ
Z
¼ arg min Uð yÞDα ðpy : Gy Þdy:
f
In the above minimization problem, notice that f only appears in the term py
on the RHS.
If f : Y ! X is an isometry, then the pullback metric is equivalent to the
Riemannian metric of Y which is given by the identity matrix I. Therefore,
8y Y,
J > MðxÞJ ¼ I:
Therefore, 8y Y,
py ðyÞ ¼ Gðy j y, J > MðxÞJÞ ¼ Gðy j y, IÞ ¼ Gy ðyÞ:
Based on the basic properties of α divergence, we know that 8y Y, py ¼ Gy
implies 8y Y, Dα(py : Gy) ¼ 0. At the same time,
Z
Uð yÞDα ðpy : Gy Þdy 0:
Dσ p ðx Þ : q ðx Þ ¼ Dσ ðpðxÞ : qðxÞÞ:
1α
If we set σðtÞ ¼ αð1αÞ
tt
, and plugging in Eq. (C.1), we get exactly Dα(p : q) in
Eq. (2). Therefore, the α-divergence between two probability densities
belongs to the family of f-divergences.
Local measurements of nonlinear embeddings Chapter 7 277
Let β ¼ 1 α, then
Z
pαy ðyÞqβy ðyÞdy
Z
jJ > MJjα=2 1 > >
¼ exp ðyyÞ αJ MJ + βI ðy yÞ dy
ð2πÞd=2 2
Z
>
1=2
jJ > MJjα=2 αJ MJ + βI
1 > >
¼ exp ðyyÞ αJ MJ + βI ðy yÞ dy
jαJ > MJ + βIj1=2 ð2πÞd=2 2
jJ > MJjα=2
¼ :
jαJ > MJ + βIj1=2
(D.1)
By Lemma 1,
Dα, f ðyÞ :¼ inf Dα ðpy : γðyÞsy Þ
γ CðYÞ
" Z 1=α # " #
> 1=2
1 1 jJ MJj
¼ 1 pαy ðyÞqβy ðyÞdy ¼ 1 :
β β jαJ > MJ + βI j1=2α
□
Let α ! 0. The factor 1
1α ! 1, and α only appears in the term
1
,
jαJ > MJ + ð1 αÞI j1=2α
which is a “1∞” type of limit. We have
1 log
αJ > MJ + ð1 αÞI
lim ¼ lim exp
α!0 jαJ > MJ + ð1 αÞI j1=2α α!0 2α
log αJ MJ + ð1 αÞI
>
¼exp lim
α!0 2α
0 1
tr ðαJ > MJ + ð1 αÞIÞ1 ðJ > MJ IÞ
¼exp @ lim A
α!0 2
>
tr J MJ I
¼exp
2
1 >
d
¼exp tr J MJ + :
2 2
In summary, we get
>
1=2 1 >
d
D0, f ð yÞ :¼ lim Dα, f ðyÞ ¼ 1 J MJ
exp tr J MJ +
α!0 2 2
1
>
1 >
d
¼ 1 exp log
J MJ
tr J MJ + :
2 2 2
Local measurements of nonlinear embeddings Chapter 7 279
Similarly,
D1, f ðyÞ :¼ lim Dα, f ð yÞ
α!1
"
1 jJ > MJj1=2 1
¼ 1 lim log
αJ > MJ + ð1 αÞI
1 α!1 jαJ > MJ + ð1 αÞI j1=2α 2α 2
1
> 1 >
tr ðαJ MJ + ð1 αÞIÞ ðJ MJ IÞ
2α
jJ > MJj1=2 1
>
1 >
¼
1 >
log J MJ tr ðJ MJÞ ðJ MJ IÞ
jJ > MJ j1=2 2 2
1
d 1
¼ log
J > MJ
+ tr ðJ > MJÞ1 :
2 2 2
References
Amari, S., 1995. Information geometry of the EM and em algorithms for neural networks. Neural
Netw. 8 (9), 1379–1408.
Amari, S.-i, 2016. Information Geometry and Its Applications. Applied Mathematical Sciences,
vol. 194 Springer, Japan.
Arvanitidis, G., Hansen, L.K., Hauberg, S., 2018. Latent space oddity: on the curvature of deep
generative models. In: ICLR’18. Proceedings of the 6th International Conference on Learning
Representations.
Carreira-Perpiñán, M.Á., 2010. The elastic embedding algorithm for dimensionality reduction. In:
F€urnkranz, J., Joachims, T. (Eds.), ICML’10. Proceedings of the 27th International Confer-
ence on Machine Learning, Omnipress, pp. 167–174.
Carreira-Perpi nán, M.Á., Vladymyrov, M., 2015. A fast, universal algorithm to learn parametric
nonlinear embeddings. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R.
(Eds.), Advances in Neural Information Processing Systems. vol. 28. Curran Associates, Inc.
Cichocki, A., Cruces, S., Amari, S.-i., 2015. Log-determinant divergences revisited: alpha-beta
and gamma log-det divergences. Entropy 17 (5), 2988–3034.
Cook, J., Sutskever, I., Mnih, A., Hinton, G., 2007. Visualizing similarity data with a mixture of
maps. In: Meila, M., Shen, X. (Eds.), Proceedings of Machine Learning Research. Proceed-
ings of the 11th International Conference on Artificial Intelligence and Statistics, vol. 2.
PMLR, pp. 67–74.
Csiszár, I., 1967. On topological properties of f-divergences. Stud. Sci. Math. Hung. 2, 329–339.
Hauberg, S., 2018. Only Bayes should learn a manifold (on the estimation of differential geomet-
ric structure from data). CoRR abs/1806.04994. https://arxiv.org/abs/1806.04994.
280 SECTION II Information geometry
Hernandez-Lobato, J., Li, Y., Rowland, M., Bui, T., Hernandez-Lobato, D., Turner, R., 2016.
Black-box alpha divergence minimization. In: Balcan, M.F., Weinberger, K.Q. (Eds.), Pro-
ceedings of Machine Learning Research. Proceedings of the 33rd International Conference
on Machine Learning, vol. 48. PMLR, pp. 1511–1520.
Hinton, G.E., Roweis, S., 2003. Stochastic neighbor embedding. In: Becker, S., Thrun, S.,
Obermayer, K. (Eds.), Advances in Neural Information Processing Systems. vol. 15. MIT Press.
Jost, J., 2011. Riemannian Geometry and Geometric Analysis. Universitext, sixth ed. Springer.
Kingma, D.P., Welling, M., 2014. Auto-encoding variational Bayes. In: ICLR’14. Proceedings of
the 2nd International Conference on Learning Representations.
Kuhnel, L., Fletcher, T., Joshi, S., Sommer, S., 2018. Latent space non-linear statistics. CoRR abs/
1805.07632. https://arxiv.org/abs/1805.07632.
Lebanon, G., 2005. Riemannian Geometry and Statistical Machine Learning (Ph.D. thesis). Car-
negie Mellon University.
Lee, J.A., Renard, E., Bernard, G., Dupont, P., Verleysen, M., 2013. Type 1 and 2 mixtures of
Kullback-Leibler divergences as cost functions in dimensionality reduction based on similar-
ity preservation. Neurocomputing 112, 92–108.
Li, Y., Turner, R.E., 2016. Renyi divergence variational inference. In: Lee, D., Sugiyama, M.,
Luxburg, U., Guyon, I., Garnett, R. (Eds.), Advances in Neural Information Processing Sys-
tems. vol. 29. Curran Associates, Inc.
Miolane, N., Holmes, S., 2020. Learning weighted submanifolds with variational autoencoders
and Riemannian variational autoencoders. In: CVPR 2020. 2020 IEEE/CVF Conference on
Computer Vision and Pattern Recognition, IEEE Computer Society, pp. 14491–14499.
Nair, V., Hinton, G.E., 2010. Rectified linear units improve restricted Boltzmann machines. In:
F€urnkranz, J., Joachims, T. (Eds.), ICML’10. Proceedings of the 27th International Confer-
ence on Machine Learning, Omnipress, pp. 807–814.
Narayan, K., Punjani, A., Abbeel, P., 2015. Alpha-beta divergences discover micro and macro
structures in data. In: Bach, F., Blei, D. (Eds.), Proceedings of Machine Learning Research.
Proceedings of the 32nd International Conference on Machine Learning, vol. 37. PMLR,
pp. 796–804.
Nielsen, F., 2020. An elementary introduction to information geometry. Entropy 22 (10), 1100.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z.,
Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M.,
Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S., 2019. Pytorch: an
imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H.,
Beygelzimer, A., d’Alche-Buc, F., Fox, E., Garnett, R. (Eds.), Advances in Neural Informa-
tion Processing Systems. vol. 32. Curran Associates, Inc, pp. 8024–8035.
Pennec, X., 2006. Intrinsic statistics on Riemannian manifolds: basic tools for geometric measure-
ments. J. Math. Imaging Vis. 25 (1), 127–154.
Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y., 2011. Contractive auto-encoders: explicit
invariance during feature extraction. In: Getoor, L., Scheffer, T. (Eds.), ICML’11. Proceed-
ings of the 28th International Conference on International Conference on Machine Learning,
Omnipress, pp. 833–840.
Said, S., Bombrun, L., Berthoumieu, Y., Manton, J.H., 2017. Riemannian Gaussian distributions
on the space of symmetric positive definite matrices. IEEE Trans. Inf. Theory 63 (4),
2153–2170.
Shao, H., Kumar, A., Fletcher, P.T., 2018. The Riemannian geometry of deep generative models.
In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops
(CVPRW), pp. 428–4288.
Local measurements of nonlinear embeddings Chapter 7 281
Sun, K., 2018. Intrinsic universal measurements of non-linear embeddings. CoRR abs/
1811.01464. https://arxiv.org/abs/1811.01464.
Sun, K., 2020. Information geometry for data geometry through pullbacks. In: Deep Learning
Through Information Geometry (Workshop at Thirty-fourth Conference on Neural Informa-
tion Processing Systems).
Sun, K., Marchand-Maillet, S., 2014. An information geometry of statistical manifold learning. In:
Xing, E.P., Jebara, T. (Eds.), Proceedings of Machine Learning Research. Proceedings of the
31st International Conference on Machine Learning, vol. 32. PMLR, pp. 1–9.
Sun, K., Koniusz, P., Wang, Z., 2020. Fisher-Bures adversary graph convolutional networks. In:
Adams, R.P., Gogate, V. (Eds.), Proceedings of Machine Learning Research. Proceedings
of The 35th Uncertainty in Artificial Intelligence Conference, vol. 115. PMLR, pp. 465–475.
Tosi, A., Hauberg, S., Vellido, A., Lawrence, N.D., 2014. Metrics for probabilistic geometries. In:
Zhang, N.L., Tian, J. (Eds.), UAI’14. Proceedings of the 30th Conference on Uncertainty in
Artificial Intelligence, AUAI Press, pp. 800–808.
van der Maaten, L., Hinton, G., 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9 (Nov),
2579–2605.
Venna, J., Kaski, S., 2007. Nonlinear dimensionality reduction as information retrieval. In:
Meila, M., Shen, X. (Eds.), Proceedings of Machine Learning Research. Proceedings of the
11th International Conference on Artificial Intelligence and Statistics, vol. 2. PMLR, pp.
572–579.
Yang, Z., Peltonen, J., Kaski, S., 2014. Optimization equivalence of divergences improves neigh-
bor embedding. In: Xing, E.P., Jebara, T. (Eds.), Proceedings of Machine Learning Research.
Proceedings of the 31st International Conference on Machine Learning, vol. 32. PMLR, pp.
460–468.
This page intentionally left blank
Section III
Advanced geometrical
intuition
This page intentionally left blank
Chapter 8
Abstract
Transporting the statistical knowledge regressed in the neighborhood of a point to a dif-
ferent but related place (transfer learning) is important for many applications. In medi-
cal imaging, cardiac motion modeling and structural brain changes are two such
examples: for a group-wise statistical analysis, subject-specific longitudinal deforma-
tions need to be transported in a common template anatomy.
In geometric statistics, the natural (parallel) transport method is defined by the inte-
gration of a Riemannian connection which specifies how tangent vectors are compared
at neighboring points. In this process, the numerical accuracy of the transport method is
critical. Discrete methods based on iterated geodesic parallelograms inspired by
Schild’s ladder were shown to be very efficient and apparently stable in practice. In this
chapter, we show that ladder methods are actually second-order schemes, even with
numerically approximated geodesics. We also propose a new original algorithm to
implement these methods in the context of the large deformation diffeomorphic metric
mapping (LDDMM) framework that endows the space of diffeomorphisms with a right-
invariant RKHS metric.
When applied to the motion modeling of the cardiac right ventricle under pressure
or volume overload, the method however exhibits unexpected effects in the presence of
very large volume differences between subjects. We first investigate an intuitive rescal-
ing of the modulus after parallel transport to preserve the ejection fraction. The surpris-
ingly simple scaling/volume relationship that we obtain suggests to decouples the
volume change from the deformation directly within the LDDMM metric. The parallel
transport of cardiac trajectories with this new metric now reveals statistical insights into
the dynamics of each disease. This example shows that parallel transport could become
a tool of choice for data-driven metric optimization.
Keywords: Parallel transport, Longitudinal studies, Mean trajectory, Cardiac motion
analysis, Schild’s ladder, Pole ladder, Riemannian manifolds
1 Introduction
At the interface of geometry, statistics, image analysis, and medicine, Compu-
tational Anatomy aims at analyzing and modeling the variability of the bio-
logical shape of tissues and organs and their dynamics at the population level.
The goal is to estimate representative anatomies across diseases, populations,
species, or ages, to model the organ development across time (growth or aging),
to discover morphological differences between normal and pathological groups,
and to estimate and correlate the variability with other functional, genetic, or
structural information. In the context of cardiology, computational anatomy
models the cardiac cycle as a sequence of deformations of an anatomy and
allows to characterize quantitatively these deformations to compare the impact
of diseases on the cardiac function.
The analysis of the organs’ shape and deformations often relies on the
identification of features to describe locally the anatomy such as landmarks,
curves, surfaces, intensity patches, full images, etc. Modeling their statistical
distribution in the population requires to first identify point-to-point anato-
mical correspondences between these geometric features across subjects. This
may be feasible for landmark points, but not for curves or surfaces. Thus, one
generally considers relabeled point-sets or reparameterized curves/surfaces/
images as equivalent objects. With this geometric formulation, shape spaces
are the quotient of the original space of features by their reparameterization
group. The difficulty is that taking the quotient generally endows the shape
space with a nonlinear manifold structure, even when we start from features
living in a Euclidean space.
For instance, equivalence classes of k-tuples of points under rigid or simi-
larity transformations result in nonlinear Kendall’s shape spaces (see, e.g.,
Dryden and Mardia (2016) for a recent account on that subject). The quotient
of curves, surfaces, and higher dimensional objects by their reparameteriza-
tions (diffeomorphisms of their domains) produces in general even more com-
plex infinite dimensional shape spaces (Bauer et al., 2014). Thus, shapes
belong in general to nonlinear manifolds, while statistics were essentially
developed for linear and Euclidean spaces. This has motivated the use and
development of a consistent statistical framework on Riemannian manifolds
and Lie groups during the past 25 years, a field called Geometric Statistics
in computational anatomy (Pennec et al., 2020).
Deformation-based morphometry (DBM) is also a popular method to study
statistical differences of brain anatomies. It consists in analyzing local
Parallel transport, a central tool in geometric statistics Chapter 8 287
1.1 Diffeomorphometry
Following Thompson (1917), it is often assumed that there exists a template
shape or image (called an atlas in medical image analysis) that represents
the standard (or mean) anatomy, and that the intersubject variability is
encoded by deformations of that template toward the shapes of each subject
or their evolution in time. A very desirable aspect of these transformations
is to smoothly preserve the spatial organization of the anatomical tissues by
avoiding intersections, folds, or tears. Simply encoding deformations with
the vector space of displacement fields is not sufficient to preserve the topol-
ogy: one needs to require diffeomorphic transformations (differentiable one-
to-one transformations with differentiable inverse).
Lie groups of diffeomorphisms are examples of space that are both infinite
dimensional manifolds and Lie groups, and the statistic analysis of shapes
through their diffeomorphic deformations has been coined as diffeomorpho-
metry. This approach was pushed forward by Grenander and Miller (1998)
and turned into a mathematically grounded framework by endowing the space
of diffeomorphisms with a sufficiently regular right invariant metric (Miller
et al., 2015; Younes, 2019), leading to the so-called large deformation diffeo-
morphic metric mapping (LDDMM) framework, detailed in Section 2.2.
Since the optimization of time-varying velocity fields was computationally
intensive, an alternative parameterization of diffeomorphisms with the flow of
stationary velocity fields (SVF) was introduced by Arsigny et al. (2006). The
flow of SVFs generates one-parameter subgroups, which are simply matrix
exponentials in matrix Lie groups. Although a number of theoretical difficul-
ties remain when moving to infinite dimensions, very efficient algorithms can
be adapted from the matrix case, like the scaling and squaring procedure
(Higham, 2005) to compute the group exponential and its Jacobian, or the
Baker–Campbell–Hausdorff (BCH) formula to approximate the composition
of two deformations directly in the log-domain. This allows the straightfor-
ward extension of the classical “demons” image registration algorithm to
288 SECTION III Advanced geometrical intuition
FIG. 1 Example of a systolic deformation estimated on a patient and applied to the atlas without
normalization (middle) and with normalization by the method of Section 3.3 (right). The ED
frame is a red point cloud, while the ES frame is the blue mesh. The ES frames obtained without
normalization show irregularities and cannot be used in downstream analyses.
From the properties of ODEs, one can prove that given γ and v a tangent
vector at x ¼ γ(0), there exists a unique parallel vector field X along γ such
that X(0) ¼ v. Intuitively, this ODE constrains the parallel vector field to keep
its orientation w.r.t. the velocity γ_ of the curve while moving along it. For any
t in the domain of definition of γ, X(t) is called the parallel transport of v at
y ¼ γ(t) along γ, and written Πyx v.
For normalizing the momentum of a longitudinal deformation along an
intersubject geodesic diffeomorphism, parallel transport was first proposed
in the LDDMM framework using Jacobi fields (Qiu et al., 2008, 2009;
Younes, 2007). The method was used in Cury et al. (2016) for the analysis
of the thalamus in frontotemporal dementia and it was improved by the
Fanning Scheme (Louis et al., 2018), which is applied to shape analysis of
brain structures (Louis et al., 2017). In latter works, the authors claim that
the Jacobi field approach is more precise and less expensive computationally
than ladder methods. We have proved in Guigui and Pennec (2021) that this is
not the case. Moreover, the proposed method implicitly assumes that the same
metric is used for both longitudinal and intersubject deformation. This is an
important concern as the cross-sectional intersubject and longitudinal intra-
subject deformations have a fundamentally different nature.
In order to implement a parallel transport algorithm that remains consistent
with the numerical scheme used to compute the geodesics, Lorenzi et al.
(2011b) and Lorenzi and Pennec (2013) proposed to adapt Schild’s ladder to
image registration with deformations parameterized by SVF. This method relies
on iterating the construction of geodesic parallelograms to approximate the par-
allel transport of a vector v along a geodesic (see Fig. 2A). Interestingly, the
Schild’s ladder implementation appeared to be more stable in practice than
the closed-form expression of the symmetric Cartan–Schouten parallel transport
on geodesics. This was attributed to the inconsistency of the numerical schemes
used for the computation of the transformation Jacobian in the implementation
of this exact parallel transport formula.
Shortly after, it was realized that parallel transport along geodesics could
exploit an additional symmetry by using the geodesic along which we want
to transport as one of the diagonal of the geodesic parallelogram: this gave
292 SECTION III Advanced geometrical intuition
A B
FIG. 2 Representation of the two ladder schemes, Schild’s (A) and pole ladder (B). The methods
consist in iterating the construction of geodesic parallelograms to approximate the parallel trans-
port of a vector v along a curve which is geodesic by arc. (A) Schild’s ladder; (B) Pole ladder
rise to the pole ladder scheme (Lorenzi and Pennec, 2014). In this case the
geodesic along which we are transporting is used as diagonal of the parallelo-
grams (see Fig. 2B). This greatly reduces the number of geodesics to compute
(v1 does not even need to be computed in Fig. 2B). Pole ladder combined
with an efficient numerical scheme based on the Baker–Campbell–Hausdorff
formula was found to be more stable on simulated and real experiments than
the other parallel transport schemes tested on Lie groups. This result and the
higher symmetry led to conjecture that pole ladder could be a higher order
scheme than Schild’s ladder.
The SVF framework was successfully applied in Lorenzi et al. (2015) to
distinguish pathological evolution from normal aging. An efficient computa-
tional pipeline is provided by Hadj-Hamou et al. (2016) and used in Sivera
et al. (2020) to model Alzheimer’s disease and to analyze the effect of a
potential treatment in a clinical trial (see illustration in Fig. 3).
FIG. 3 Statistics on diffeomorphisms with the Cartan–Schouten connection to model the normal
component of the aging trajectory and the additional component specific to Alzheimer’s disease.
Longitudinal geodesic deformations regressed in the sequence of images of each subject are par-
allel transported along the subject-to-reference deformation at baseline. In this common space, a
linear model in the tangent space of diffeomorphisms estimates the mean trajectory for the two
clinical conditions. Derived from original images and illustrations of Marco Lorenzi and Raphaël
Sivera.
that we summarize in Section 2.1. To the best of our knowledge, ladder meth-
ods have not been used in the LDDMM framework, for which a BCH-type
approximation was missing. We first give an overview of the LDDMM frame-
work in Section 2.2 before proposing a new second-order implementation of
pole ladder on LDDMM deformations in Section 2.3.
We turn in Section 3 to the application of this methodology to the group-
wise analysis of cardiac motion across diseases. For the cardiac motions that
we analyze in Section 3.2, relative volume differences between subjects
appear to have an unexpected effect on the parallel transported trajectories.
Since this undesirable effect cannot be attributed to the numerical accuracy
of the transport scheme, it indicates that we have to revise the choice of the
Riemannian metric used for the intersubject comparison. We first investigate
in Section 3.3 an intuitive rescaling of the modulus of the momentum after the
parallel transport to preserve on average the ejection fraction along the motion
trajectory. Regression results over a population of subjects and patients show a
surprisingly simple relationship between the scaling and the intersubject volume
ratio. However, modifying the transport equations in an ad-hoc fashion is not
satisfactory from a theoretical point. Thus, we investigate in Section 3.4 a more
satisfactory strategy that decouple the volume change from the deformation
directly within the LDDMM metric.
294 SECTION III Advanced geometrical intuition
FIG. 4 Schild’s ladder procedure to parallel transport the vector v along the sampled curve with
initial velocity w. Top: First rung of the ladder using an approximate geodesic parallelogram.
Bottom: The method is iterated with rungs at each point sampled along the curve. Reproduced
from Guigui, N., Pennec, X., 2021. Numerical accuracy of ladder schemes for parallel transport
onmanifolds. Found. Comput. Math. https://doi.org/10.1007/s10208-021-09515-x.
Parallel transport, a central tool in geometric statistics Chapter 8 295
curvature of the underlying space. This allows to prove that these methods can
be iterated to converge with quadratic speed, even when geodesics are approxi-
mated by numerical schemes.
Assuming that there exists a convex neighborhood that contains the entire
parallelogram, all the above operations are well defined. In the literature, this
construction is then iterated along γ without further precision.
The same result can be obtained for pole ladder, and in the case where
geodesics are obtained by numerical integration of the geodesic equation.
Parallel transport, a central tool in geometric statistics Chapter 8 297
FIG. 5 Top: Elementary construction of the pole ladder with the previous notations (left), and
new notations (right), in a normal coordinate system at m. Bottom: Iterated scheme. Reproduced
from Guigui, N., Pennec, X., 2021. Numerical accuracy of ladder schemes for parallel transport
onmanifolds. Found. Comput. Math. https://doi.org/10.1007/s10208-021-09515-x.
298 SECTION III Advanced geometrical intuition
We write DiffK the subgroup of diffeomorphisms obtained this way and use
matrix notation with a dNc dNc block matrix Kðc, c0 Þ ¼ ðKðci , c0j_ÞI d Þ and
ij
flat vectors of size dNc for landmarks sets, velocities, and momentum so that
v(x) ¼ K(c, x)μ and k vk2K ¼ μ> Kðc, cÞμ. The total cost, or energy of the defor-
R1
mation can be defined as 0 k vt k2K dt.
It can be shown that the momentum vectors that minimize this energy,
ð0Þ ð1Þ
considering ck , ck , k ¼ 1…N c fixed, together with the equation driving
the motion of the control points, follow a Hamiltonian system of ODEs:
8
< c_ ðtÞ ¼ KðcðtÞ ,cðtÞ ÞμðtÞ
(8)
: μ_ ðtÞ ¼ r KðcðtÞ ,cðtÞ ÞμðtÞT μðtÞ
1
FIG. 6 Example of registration and optimal path between a circle (source) and an ellipse (target)
with 60 landmarks, 15 control points (green), and corresponding momentum vectors (cyan) and
kernel width 1. The progression from blue to red shows the time trajectories of the landmarks
under the minimizing geodesic deformation.
Z 1
1
dK ðq, qÞ2 ¼ inf q_ t Kðc, cÞ1 q_ t dt j q0 ¼ q, q1 ¼ q
2 0
This in fact turns the space of shapes into a homogeneous space, with invari-
ant metric given by the group.
FIG. 7 Example of the registration between ED (blue wireframe) and ES (red points) for a
control right ventricle. There are 938 landmarks, and 34 control points with kernel width
10 mm (the overall ventricle at ED is about 55 mm high). Note how momentum vectors are noisy
due to strong interactions between control points (A). The velocity field at the control points
offers a better summary of the deformation (B). (A) Systolic deformation with momenta;
(B) Systolic deformation with velocity field.
We describe the pole ladder in the case where the momentum to transport
μ and the momentum along which we transport ω are carried by the same ini-
tial control points c ¼ cA. If it is not the case, μ is projected to cA by solving
a linear system with a Cholesky decomposition of K(cA, cA): μ0 ¼ K(cA, cA)1
K(c, cA)μ.
Define rk : TS * + ! TS * the map that performs one step of the fourth-
order Runge–Kutta numerical scheme on the Hamiltonian system (8), and rk1
the projection on the control points only, and rk1 x its approximate inverse by
gradient descent, such that k rk1 ðrk1x ðyÞ, hÞ y k h5 . Then, pole ladder is
performed by the following algorithm.
Choose n and 1 α 2, and divide [0, 1] into n intervals of length
h ¼ 1n. Compute the first midpoint c1 , ω1 ¼ rkðcA , ω, h2Þ along the main geode-
sic, and the first rung q0 ¼ rk1 ðμ, n1α Þ. Then iterate for i ¼ 1…n (see Fig. 8)
1. Compute the momentum αi ¼ n rk1 ci ðqi1 Þ;
2. Compute the flow of its inverse qi ¼rk1(ci, αi, h);
3. Compute the next midpoint ci+1, ωi+1 ¼ rk(ci, ωi, h) except at the last step
where h h
2 to compute c;
e
Return μe ¼ nα ð1Þn rk1
c~ ðqn Þ.
Two examples of parallel transport solutions are shown, the first on simple
parametric shapes, the evolution from a circle to an ellipse is transported to a
smaller circle (Fig. 9). We retrieve a smaller ellipse as expected. The second
example on real right ventricle (RV) mesh of the heart (Fig. 10) shows the
ED-to-ES deformation transported to a third reference shape, the atlas. The
obtained reconstruction is a personalized estimate of an ES frame for the atlas,
specific to the considered patient. The transported frame is much more accept-
able than those of Fig. 1.
2.3.1 Validation
Recall that parallel transport is an isometry, so the norm of the transported
deformation must equal that of the initial deformation. We use this property
as a first step to validate our implementation on a population of simulated
2d-shapes. These are generated by shooting from a circle with random Gauss-
ian momentum vectors. This defines the source shape. Then an affinity is
FIG. 8 Representation of the pole ladder for diffeomorphisms with an odd number of steps n.
The exponential maps are computed with a RK4 scheme, and the log by gradient descent.
Geodesic between source circle and evolution into ellipse Evolution transported to the smaller circle
Geodesic between source and target circles
start
2.0 2 2
evolution
control points
1.5
1.0 1 1
0.5
start start
0.0 target 0 evolution 0
control points control points
−0.5
−1.0 −1 −1
−1.5
−2.0 −2 −2
FIG. 9 Parallel transport of the evolution from a circle to an ellipse along the deformation to a smaller circle.
304 SECTION III Advanced geometrical intuition
FIG. 10 Example of an ED-to-ES deformation transported along the ED-to-atlas path. The
source frames are represented by a blue wireframe, the final frames by red landmarks, the control
points by green landmarks, and the initial velocities at the control points by orange arrows. The
transported frame is much more acceptable than those of Fig. 1.
applied to the source shape to define its evolution (i.e., scaling with different
coefficients along each axis). This deformation is then estimated by registra-
tion and transported to the template. The reconstruction obtained by deform-
ing the circle with the transported deformation (yellow shape on the figure)
resembles an ellipse, as expected by the application of the affinity.
For n ¼ 10 rungs, we obtain a root mean squared error (RMSE) of
8.3 103 and less than 1.3 0.9% relative error (in absolute value) for
100 samples. We also apply the affinity to the template and register the tem-
plate to the result to compute the expected momentum after transport. The
deviation of the transported momentum to this expected value is measured
with the kernel norm of the difference and averaged over the samples. We
obtain an error of 0.34 while the fanning scheme achieves 0.33. The two
implementations are therefore very similar on this set of shapes. We however
find cases where the fanning scheme does not behave as well as the pole
ladder for more complex deformations on the heart data (Fig. 13).
of blood pumped by the heart to the body circulation, and is used in clinical
routines to evaluate heart failures. Area strain (AS) is the relative area change
of each cell of the mesh between ED time and ES time. As it depends on the
quality of the mesh, it is usually filtered by computing the mean at each vertex
over the neighboring cells, and visualized as a colormap on the mesh, or aver-
aged by regions of the ventricle. It represents the amount of stretching local
tissues undergo during the deformation, and is a common descriptor of cardiac
motion (Di Folco et al., 2019; Kleijn et al., 2011; Moceri et al., 2020).
FIG. 11 Framework using registration and parallel transport to normalize cardiac deformations.
Parallel transport, a central tool in geometric statistics Chapter 8 307
FIG. 12 Examples of reconstructions of the ES frame (blue meshes) after parallel transport of
the ED-to-ES deformation (point cloud to mesh) along the deformation from ED to atlas. Parallel
transport is computed by the fanning scheme (middle) and our implementation of pole ladder
(right). In most cases, little difference can be observed visually.
FIG. 13 Examples of reconstruction where the fanning scheme did not achieve to transport the
deformation, while pole ladder did, although it is not realistic.
FIG. 14 Examples of reconstructions of the ES frame (blue meshes) after parallel transport of
the ED-to-ES deformation (point cloud to mesh) along the deformation from ED to atlas. The sub-
jects belong to the ASD group, hence show large volume changes that result in unrealistic ES
frames.
EF
200 40
150 30
100 20
50
10
0
PHT ASD Control ToF PHT ASD Control ToF
Disease group Disease group
40
30
20
EF error
10
−10
−20
−30
PHT ASD Control ToF
Disease group
FIG. 15 End-diastolic volume by disease group (A), original ejection fraction (B) and its modi-
fication by parallel transport (C). (A) ED volume per disease; (B) Initial EF. (C) Alteration of EF.
state that the transport method and the Riemannian metric are not anymore
consistent together. To remove this inconsistency with the general geometric
statistics framework, we then investigate in Section 3.4 a modification of
the metric on diffeomorphisms that decouples the volume change from the
deformation.
3.3.1 Hypothesis
We now propose to introduce a rescaling step after the parallel transport in
our framework of Fig. 11. More precisely, instead of using Rt ¼ ExpA
ðΠAED LogED ðSt ÞÞ as the subject-specific reconstruction of time frame St, we
introduce a parameter λ > 0 and use
310 SECTION III Advanced geometrical intuition
Rt ðλÞ ¼ ExpA λΠAED LogED ðSt Þ :
where ΠAED is the parallel transport map along the geodesic that joins the ED
frame to the atlas. Based on our observation of the results of parallel transport
(Figs. 12 and 14) and of the relationship between the ejection fraction error
and the disease group (Fig. 15C), we hypothesize that this rescaling should
be patient-specific and depend on the volume of the ED frame. Furthermore,
rescaling should not be necessary if the volume of the ED frame matches that
of the atlas. The relevant quantity is thus the ratio VolED/VolA.
For each patient, we then minimize this loss function with respect to λ to find
the patient-specific scaling parameter. We solve this problem by gradient
descent, where the gradient is computed by automatic differentiation. The
parameter λ is regularized to be close to 1 to avoid poor solutions.
3.3.3 Results
To validate the method, we first check that the ejection fraction is effectively
conserved. The errors on ejection fraction and area strain are reported in
Table 2. We indeed observe a threefold decrease of the RMSE on the ejection
fraction. Additionally, the distribution of this error is no longer related to the
disease group (Fig. 16). Visually, the reconstructions obtained by the scaled
parallel transport are more realistic (Fig. 17). Interestingly, it is possible to
40
30
20
EF error
10
−10
method
−20 ScaledPT
PT
−30
PHT ASD Control ToF
Disease group
FIG. 16 Alteration of the ejection fraction per disease group. The scaled parallel transport
achieves an error that is not related to the disease group.
FIG. 17 Examples of reconstructions of the ES frame (blue meshes) after parallel transport of
the ED-to-ES deformation (point cloud to mesh) along ED-to-atlas deformation. The large volume
changes between subjects and the atlas result in unrealistic ES frames (middle column). This prob-
lem is well addressed by the scaling strategy (right column).
obtain a low error on the EF after the scaled transport, but this does not pre-
serve the area strain (AS). This shows that although these quantities are
related, they carry different information and the AS depends on the initial
shape, itself related to the pathology.
3.4.1 Model
We apply their result as follows. Restricting to the space of shapes that are
embeddings of the circle in 2 or of the sphere in 3 , let M ¼ EmbðSd , d +1 Þ
be the set of shapes, endowed with a metric g. As we are interested in relative
volume changes, we consider the function f ¼ log Vol
VolA : M ! (so that
1
Vol ), and let M0 ¼ f (0), the space of shapes whose volume equals
df ¼ dVol
the volume of the atlas VolA. Then, (by Niethammer and Vialard, 2013,
Theorem 41), (M, g) can be decomposed into a direct product of Riemannian
manifolds ðM, gÞ ¼ ð, dt2 Þ ðM0 , g0 Þ where g0 is a Riemannian metric on
the submanifold M0. This metric can be constructed by choosing a projection
π : M ! M0 and the restriction of g to M0. However, there is no canonical
projection and two schemes are proposed in Niethammer and Vialard (2013):
l by gradient flow: we follow the flow of (dVol)# where # depends on the
choice of metric on M, in this case the LDDMM metric.
314 SECTION III Advanced geometrical intuition
l by scaling: we center the shape around 0 and divide all the landmarks by
Vol1=3
q . This choice depends on the center (barycenter) of the shape which
may be unnatural.
In both cases, the framework is slightly modified by applying this projection
first to all the time frames so that their volume matches that of the atlas. Then
the previous framework of Fig. 11 is applied with volume-preserving geode-
sics and parallel transport. This corresponds to the parallel transport on the
factor M0 of M ¼ M0. the (Euclidean) transport on corresponds
Finally,
VolSt
to applying the vector log VolED to 0 so that the new volume is Volnew ðtÞ ¼
VolA
VolSt as in the previous section, and the ejection fraction is preserved by
VolED
construction. This volume is obtained either by scaling or by gradient flow,
consistently with the projection.
3.4.2 Implementation
We now give more details on the computation of geodesics and our imple-
mentation in the LDDMM framework of Section 2.2. Recall that tangent
spaces are described by the Hilbert space HK of vector fields obtained by
convolution. As we restrict to M0 ¼ f1(0), a vertical tangent subspace at
any shape q is defined as the set of vector fields of HK that are volume
preserving, i.e.,
kerðdf q Þ ¼ fv H K jdVolq ððvðqÞÞ ¼ 0g,
To avoid any confusion, we distinguish the linear form dVolq from its repre-
sentation in the canonical basis ∂Volq such that dVolq(v(q)) ¼ v(q)>∂Volq.
The orthogonal projection on Vq (which depends on the LDDMM metric)
can be computed by
π q : T q M ! Vq
! (11)
v 7!v dVolq ðvÞ n
! ðdVol Þ#
where n ¼ kðdVol qÞ# k2 : Note that the map # associates a tangent vector to the
q
linear form dVol such that dVol(v) ¼ h(dVolq)#, viK so that (dVolq)# ¼ K(, q)
∂Volq. Define also the dual π *q : T *q M ! V *q to π q by the relation μðπ q ðvÞÞ ¼
π *q ðμÞðvÞ, i.e., in matrix notations μ> π q ðvÞ ¼ v> π *q ðμÞ. From (11), we obtain
π q : Tq M ! Vq
!
μ> Kðc, qÞ∂Volq (12)
ðc,μÞ 7! c,μ dVolq :
∂Vol>
q Kðq, qÞ∂Volq
Parallel transport, a central tool in geometric statistics Chapter 8 315
3.4.3 Geodesics
A geodesic between q0 , q1 Emb1 ðS2 , 3 Þ is solution of
Z 1
inf kvtk2K dt (13)
0
3.4.4 Results
We first validate our implementation by measuring the area strain and ejection
fraction errors, and complete Table 2 into Table 3. Similarly, we reproduce
the boxplot of Fig. 15C with the two projection and the previous volume con-
strained scaling strategy. As in the previous section, there is no relation
between the error and the disease group. We show some reconstructions in
Fig. 19, along the reconstructions obtained by the scaled parallel transport
of the previous section. The two metrics and the two projection methods result
in different reconstructions. The projection by gradient flow seems less stable
than the two other methods, as the thinning near the valve is exaggerated on
the first row, and the apex of the second row is unrealistically spherical.
FIG. 19 Examples of reconstructions of the ES frame (blue meshes) after parallel transport of
the ED-to-ES deformation (point cloud to mesh) along the deformation from ED to atlas. Compar-
ison of the three methods. The gradient flow seems less stable (first two rows) than the projection
by scaling. The scaling method results in more vertical movement of the base than the two other
methods. This may be due to the centering step required before scaling.
> X (14)
> ðtÞ ðtÞ ðtÞ ðtÞT ðtÞ ðtÞ
: μ_k
> ¼ r1 Kðck , cj Þμk μj + uk
j
3.5.1.1 Results
If we consider a discrete sequence of observation times t1 ¼ 0, t2 , …td ¼ 1
and configurations xt1 , …, xtd , one seeks to find the path ϕt that minimizes
the new cost
Z
1 X d 1
CS ðc, μ, ut Þ ¼ k xti ϕti ðxt0 Þk22 + kuðtÞ k2 dt + kvc,μ
0 kK :
2
(15)
α d i¼1
2
0
In practice, the ODEs (8) and (14) are discretized in n time steps and an inte-
gration method such as Euler or Runge–Kutta is used. We define all the
patients’ trajectories between t ¼ 0 and t ¼ 1, and use the same discretization
for all the patients to ensure that u0 , …, uNc are estimated at corresponding
times. Along with μ(0), these are estimated by gradient descent as in the case
of registration. Setting u0 ¼ … ¼ uNc ¼ 0 at all times recovers a geodesic tra-
jectory. We use a kernel bandwidth σ ¼ 15 in all the experiments, and 60 con-
trol points for all the deformations of the atlas. The initial control points are
fixed for the entire dataset so that the initial momenta can be compared con-
sistently. They have been optimized to register the atlas on all the transported
ES frames.
Visually we only notice very slight differences between the geodesic and
spline regressions (Fig. 20). We do observe that the fit is slightly less precise
in the intermediate frames for the geodesic regression, and this is confirmed
by measuring the overall data attachment term (left term of (15)) that is
reported in Table 4, regardless of the normalization method. The spline
regression therefore yields a more faithful representation of the normalized
cardiac deformations than the geodesic regression, at the cost of a larger set
of parameters, encompassing the external time-dependent forces ut whose
interpretation and analysis are difficult. Indeed, the size of the acceleration
term ut is (d 1) Nc 3 ¼ 1620.
318 SECTION III Advanced geometrical intuition
FIG. 20 Fit of the normalized sequence of an ASD patient by geodesic (top) and spline (bottom)
regression. Fits are the purple transparent meshes, and normalized data are the blue meshes with
edges. Upon visual inspection, there is barely any difference between the two fits.
TABLE 4 Comparison of the mean data attachment term for each method.
SPT Grad VPPT Scaling VPPT
Geodesic 1384 952 1659 906 1483 864
Spline 549 413 681 390 554 324
FIG. 21 Results of the Hotelling tests between velocities. Where the differences are significant,
control points and difference with the control group are in red, while colored arrows represent the
mean velocity field for the disease group. All the arrows have been scaled by a factor 2.3 for visu-
alization purposes. The color map of the meshes reflects the norm of the velocity field at that
point, if it is significantly different from the control group. As for the area strain, there is little
difference for the ASD group, showing that the volume differences have well been filtered out
by the normalization. Moreover, the differences for the PHT and ToF group reflect deformations
with less amplitude than the control group, as observed on the ejection fraction.
significant differences between each disease and the control group. The differ-
ences are superimposed with the group mean velocity fields, and show that
these differences are mainly of magnitude, with different orientations at a
few points along the free wall. The differences observed near the tricuspid
valve mainly reflect the difference of magnitude of the deformations, and
one should be cautious before drawing further conclusions as the quality of
the mesh may vary near this region. However, it is interesting to notice that
very little differences are observed for the ASD group, which corroborates
previous results (Moceri et al., 2020). Similarly, only small differences are
observed on the septum for the PHT group, showing that the shape differences
usually observed on the PHT group have been filtered out. This makes the dif-
ferences observed on the free wall interesting and other markers such as the
circumferential strain will be studied to confirm these effects. These experi-
ments also highlight the difference between the PHT and ToF group as signif-
icant differences are localized on the inferior part of the free wall and on the
inlet (area of the tricuspid valve) for the ToF group, while they are distributed
across the whole shape for the PHT group.
320 SECTION III Advanced geometrical intuition
The same visualization is provided for the mean acceleration term by dis-
ease group, across the group-wise mean trajectory (Fig. 22). Recall that these
can be interpreted as external forces. They seem to increase the movement
toward contraction at the beginning of motion while slowing it down toward
the end. These terms thus provide more insights into the dynamics of each
disease.
Similar to the analysis of the velocities, the mean external force of the
ASD group is not significantly different from that of the control group. This
shows that the volume differences reported in Fig. 15A were successfully
FIG. 22 Results of the Hotelling tests between accelerations. Where the differences are signifi-
cant, control points and differences with the control group are in red, while colored arrows repre-
sent the mean acceleration for the disease group. As for the velocity fields, there are no
differences for the ASD group. For the ToF group, the differences are localized, but these sites
vary with time, while for the PHT group, differences are distributed on the entire shape.
Parallel transport, a central tool in geometric statistics Chapter 8 321
normalized by the scaled parallel transport. For the ToF and PHT group, many
points have significantly different values of acceleration, and those differ-
ences are shown by red arrows. They are distributed across the whole RV
for the PHT group, while their locations vary in time for the ToF group,
and concentrate around the apex toward the end of the deformation. For the
ToF group, we do find differences on the infandibulum (area of the pulmonary
valve) at times ED + 1 and ED + 2 as expected from the consequences of the
surgery. This is in favor of our method on the analysis of shape deformations,
as these differences were not observed with the traditional analysis of strain
maps. As in the case of velocities, the differences mainly reflect amplitude
differences, especially near ED and ES times. They thus reflect the late con-
traction of the RV and the distribution of the differences across the ventricle
and in time reflect a longer and asynchronous contraction introduced by the
disease. Indeed, the deformation of the control group is more uniformly
distributed on the shape, and the introduction of heterogeneity across the
shape has been previously reported and associated with the disease, and is
known to be a factor of arrhythmia with direct consequences on survival.
4 Conclusion
In this chapter, we first summarized recent results using parallel transport for
geometric statistics in computational anatomy. We proposed in particular a
new implementation of the pole ladder scheme. These results guarantee that
parallel transport algorithms are well behaved, enabling their use as a normal-
ization step for longitudinal shape data. This further allows to evaluate the
underlying metric and the normalization model beyond the numerical
stability.
On the application side, the results on data of the motion of the cardiac
right ventricle exhibited a strong bias due to the volume differences at a refer-
ence time point. Two strategies were investigated to correct for this bias: scal-
ing the parallel transport, and using a metric that decomposes volume changes
from volume-preserving deformations. Both methods were successful at pre-
serving the ejection fraction, and reduced the bias due to volume differences.
When scaling the parallel transport, we discovered a significant relationship
between the scaling parameter and the initial volume ratio; however, its inter-
pretation must be understood further. This will be investigated in future work,
together with the possible mathematical relation to the volume-preserving
metric. The statistical analysis of the normalized deformations was meaningful
and coherent with previous knowledge. This framework is readily usable to
evaluate the impact of a treatment or a surgery on the deformation. It can also
be used to simulate new cardiac sequences for a given shape from a population
of cardiac sequences, by performing a principal component analysis of the
momentum vectors of the spline or geodesic regression, and shooting along
the principal modes, or by sampling coefficients from a multivariate normal
law to mix the principal components.
322 SECTION III Advanced geometrical intuition
Acknowledgments
This project has received funding from the European Research Council (ERC) under the
European Union’s Horizon 2020 research and innovation program (grant G-Statistics agree-
ment No 786854). This work has been supported by the French government, through the
3IA C^ote d’Azur Investments in the Future project managed by the National Research
Agency (ANR) with the reference number ANR-19-P3IA-0002.
Abbreviations
LDDMM large diffeomorphic deformation metric mapping
SVF stationary velocity field
RKHS reproducible kernel Hilbert space
References
Arsigny, V., Commowick, O., Pennec, X., Ayache, N., 2006. A log-Euclidean framework for
statistics on diffeomorphisms. In: Larsen, R., Nielsen, M., Sporring, J. (Eds.), Medical Image
Computing and Computer-Assisted Intervention—MICCAI 2006. LNCS 4190. Springer
Berlin Heidelberg, pp. 924–931. https://doi.org/10.1007/11866565_113.
Ashburner, J., Hutton, C., Frackowiak, R., Johnsrude, I., Price, C., Friston, K., 1998. Identifying
global anatomical differences: deformation-based morphometry. Human Brain Mapping 6
(5–6), 348–357.
Bauer, M., Bruveris, M., Michor, P.W., 2014. Overview of the geometries of shape spaces and
diffeomorphism groups. J. Math. Imaging Vis. 50 (1–2), 60–97. https://doi.org/10.1007/
s10851-013-0490-z.
B^
one, A., Louis, M., Martin, B., Durrleman, S., 2018. Deformetrica 4: An Open-Source Software
for Statistical Shape Analysis. In: Reuter, M., Wachinger, C., Lombaert, H., Paniagua, B.,
L€uthi, M., Egger, B. (Eds.), Shape in Medical Imaging. ShapeMI 2018. LNCS 11167.
Springer, Cham, pp. 3–13. https://doi.org/10.1007/978-3-030-04747-4_1.
Parallel transport, a central tool in geometric statistics Chapter 8 323
Charon, N., Charlier, B., Glaunès, J.A., Gori, P., Roussillon, P., 2020. Fidelity metrics between
curves and surfaces: currents, varifolds, and normal cycles. In: Pennec, X., Sommer, S.,
Fletcher, T. (Eds.), Riemannian Geometric Statistics in Medical Image Analysis. Academic
Press, pp. 441–477. https://doi.org/10.1016/B978-0-12-814725-2.00021-2.
Cury, C., Lorenzi, M., Cash, D., Nicholas, J., Routier, A., Rohrer, J., Ourselin, S., Durrleman, S.,
Modat, M., 2016. Spatio-temporal shape analysis of cross-sectional data for detection of early
changes in neurodegenerative disease. In: SeSAMI 2016—First International Workshop Spec-
tral and Shape Analysis in Medical Imaging. LNCS 10126. Springer, pp. 63–75. https://doi.
org/10.1007/978-3-319-51237-2_6.
Debavelaere, V., B^ one, A., Durrleman, S., Allassonnière, S., 2019. Initiative for the Alzheimer’s
Disease Neuroimaging Clustering of longitudinal shape data sets using mixture of separate or
branching trajectories. In: Medical Image Computing and Computer Assisted Intervention—
MICCAI 2019. LNCS 11767. Springer, Cham, pp. 66–74. https://doi.org/10.1007/978-3-030-
32251-9_8.
Di Folco, M., Clarysse, P., Moceri, P., Duchateau, N., 2019. Learning interactions between car-
diac shape and deformation: application to pulmonary hypertension. In: Tenth International
Statistical Atlases and Computational Modeling of the Heart (STACOM) Workshop, Held
in Conjunction With MICCAI 2019, Shenzen, China. LNCS 12009. Shenzen, China,
pp. 119–127. https://doi.org/10.1007/978-3-030-39074-7_13.
Di Folco, M., Guigui, N., Clarysse, P., Moceri, P., Duchateau, N., 2021. Investigation of the
impact of normalization on the study of interactions between Myocardial shape and deforma-
tion. In: Functional Imaging and Modeling of the Heart. LNCS 12738. Springer, Cham,
pp. 223–231. https://doi.org/10.1007/978-3-030-78710-3_22.
Dryden, I.L., Mardia, K.V., 2016. Statistical Shape Analysis With Applications in R, second ed.
Wiley Series in Probability and Statistics, Wiley, Chichester, UK; Hoboken, NJ, ISBN:
978-0-470-69962-1.
Duchateau, N., De Craene, M., Piella, G., Silva, E., Doltra, A., Sitges, M., Bijnens, B.H.,
Frangi, A.F., 2011. A spatiotemporal statistical atlas of motion for the quantification of abnor-
mal myocardial tissue velocities. Med. Image Anal. 15 (3), 316–328. https://doi.org/10.1016/
j.media.2010.12.006.
Durrleman, S., Allassonnière, S., Joshi, S., 2013a. Sparse adaptive parameterization of variability
in image ensembles. Int. J. Comput. Vis. 101 (1), 161–183. https://doi.org/10.1007/s11263-
012-0556-1.
Durrleman, S., Pennec, X., Trouve, A., Braga, J., Gerig, G., Ayache, N., 2013b. Toward a com-
prehensive framework for the spatiotemporal statistical analysis of longitudinal shape data.
Int. J. Comput. Vis. 103 (1), 22–59. https://doi.org/10.1007/s11263-012-0592-x.
Durrleman, S., Prastawa, M., Charon, N., Korenberg, J.R., Joshi, S., Gerig, G., Trouve, A., 2014.
Morphometry of anatomical shape complexes with dense deformations and sparse parameters.
NeuroImage 101, 35–49. https://doi.org/10.1016/j.neuroimage.2014.06.043.
Ehlers, J., Pirani, F.A.E., Schild, A., 1972. The geometry of free fall and light propagation. In:
O’Raifeartaigh, L. (Ed.), General Relativity: Papers in Honour of J. L. Synge. Clarendon
Press, Oxford, pp. 63–84.
Fishbaugh, J., Durrleman, S., Prastawa, M., Gerig, G., 2017. Geodesic shape regression with mul-
tiple geometries and sparse parameters. Med. Image Anal. 39, 1–17. https://doi.org/10.1016/
j.media.2017.03.008.
Gerber, S., Tasdizen, T., Fletcher, T., Joshi, S., Whitaker, R., 2010. Manifold modeling for brain
population analysis. Med. Image Anal. 14 (5), 643–653. https://doi.org/10.1016/
j.media.2010.05.008.
324 SECTION III Advanced geometrical intuition
Grenander, U., Miller, M., 1998. Computational anatomy: an emerging discipline. Q. Appl. Math.
LVI (4), 617–694.
Guigui, N., Pennec, X., 2021. Numerical accuracy of ladder schemes for parallel transport on
manifolds. Found. Comput. Math. https://doi.org/10.1007/s10208-021-09515-x.
Guigui, N., Moceri, P., Sermesant, M., Pennec, X., 2021. Cardiac motion modeling with parallel
transport and shape splines. In: ISBI 2021—IEEE 18th International Symposium on Biomed-
ical Imaging. IEEE, pp. 1394–1397. https://doi.org/10.1109/ISBI48211.2021.9433887.
Hadj-Hamou, M., Lorenzi, M., Ayache, N., Pennec, X., 2016. Longitudinal analysis of image time
series with diffeomorphic deformations: a computational framework based on stationary
velocity fields. Front. Neurosci. 10 (236). https://doi.org/10.3389/fnins.2016.00236.
Hauberg, S., Lauze, F., Pedersen, K.S., 2013. Unscented Kalman filtering on Riemannian mani-
folds. J. Math. Imaging Vis. 46 (1), 103–120. https://doi.org/10.1007/s10851-012-0372-9.
Higham, N.J., 2005. The scaling and squaring method for the matrix exponential revisited. SIAM
J. Matrix Anal. Appl. 26 (4), 1179–1193. https://doi.org/10.1137/04061101X.
Hinkle, J., Fletcher, T., Joshi, S., 2014. Intrinsic polynomials for regression on Riemannian mani-
folds. J. Math. Imaging Vis. 50 (1–2), 32–52. https://doi.org/10.1007/s10851-013-0489-5.
Kheyfets, A., Miller, W.A., Newton, G.A., 2000. Schild’s ladder parallel transport procedure for
an arbitrary connection. Int. J. Theor. Phys. 39 (12), 2891–2898. https://doi.org/10.1023/
A:1026473418439.
Kleijn, S.A., Aly, M.F.A., Terwee, C.B., van Rossum, A.C., Kamp, O., 2011. Three-dimensional
speckle tracking echocardiography for automatic assessment of global and regional left ven-
tricular function based on area strain. J. Am. Soc. Echocardiogr. 24 (3), 314–321. https://doi.
org/10.1016/j.echo.2011.01.014.
Lorenzi, M., Pennec, X., 2013. Geodesics, parallel transport & one-parameter subgroups for
diffeomorphic image registration. Int. J. Comput. Vis. 105 (2), 111–127. https://doi.org/
10.1007/s11263-012-0598-4.
Lorenzi, M., Pennec, X., 2014. Efficient parallel transport of deformations in time series of
images: from Schild to pole ladder. J. Math. Imaging Vis. 50 (1), 5–17. https://doi.org/
10.1007/s10851-013-0470-3.
Lorenzi, M., Ayache, N., Frisoni, G.B., Pennec, X., ADNI, 2011a. Mapping the effects of a β
levels on the longitudinal changes in healthy aging: hierarchical modeling based on stationary
velocity fields. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI
2011, September, Springer, Berlin, Heidelberg, pp. 663–670.
Lorenzi, M., Ayache, N., Pennec, X., 2011b. Schilds ladder for the parallel transport of deforma-
tions in time series of images. In: IPMI—22nd International Conference on Information
Processing in Medical Images—2011, July, vol. 6801. Springer, p. 463.
Lorenzi, M., Pennec, X., Frisoni, G.B., Ayache, N., 2015. Disentangling normal aging from
Alzheimer’s disease in structural magnetic resonance images. Neurobiol. Aging 36, S42.
https://doi.org/10.1016/j.neurobiolaging.2014.07.046.
Louis, M., B^one, A., Charlier, B., Durrleman, S., 2017. Parallel transport in shape analysis: a scal-
able numerical scheme. In: Nielsen, F., Barbaresco, F. (Eds.), Geometric Science of Informa-
tion. LNCS 10589. Springer International Publishing, pp. 29–37. https://link.springer.com/
chapter/10.1007/978-3-319-68445-1_4.
Louis, M., Charlier, B., Jusselin, P., Pal, S., Durrleman, S., 2018. A fanning scheme for the par-
allel transport along geodesics on Riemannian manifolds. SIAM J. Numer. Anal. 56 (4),
2563–2584. https://doi.org/10.1137/17M1130617.
Louis, M., Couronne, R., Koval, I., Charlier, B., Durrleman, S., 2019. Riemannian geometry
learning for disease progression modelling. In: Information Processing in Medical Imaging.
Springer, Cham, pp. 542–553.
Parallel transport, a central tool in geometric statistics Chapter 8 325
Mansi, T., Durrleman, S., Bernhardt, B., Sermesant, M., Delingette, H., Voigt, I., Lurz, P.,
Taylor, A.M., Blanc, J., Boudjemline, Y., Pennec, X., Ayache, N., 2009. A statistical model
of right ventricle in tetralogy of fallot for prediction of remodelling and therapy planning. In:
Yang, G.-Z., Hawkes, D., Rueckert, D., Noble, A., Taylor, C. (Eds.), Medical Image Comput-
ing and Computer-Assisted Intervention—MICCAI 2009. LNCS 5761. Springer Berlin,
Heidelberg, Berlin, Heidelberg, pp. 214–221. vol. https://link.springer.com/chapter/
10.1007/978-3-642-04268-3_27.
McLeod, K., Mansi, T., Sermesant, M., Pongiglione, G., Pennec, X., 2013. Statistical shape anal-
ysis of surfaces in medical images applied to the tetralogy of fallot heart. In: Cazals, F.,
Kornprobst, P. (Eds.), Modeling in Computational Biology and Biomedicine:
A Multidisciplinary Endeavor. Springer, Berlin, Heidelberg, pp. 165–191. https://doi.org/
10.1007/978-3-642-31208-3_5.
Micheli, M., Glaunès, J.A., 2014. Matrix-valued Kernels for shape deformation analysis. Geome-
try Imaging Comput. 1 (1), 57–139. https://doi.org/10.4310/GIC.2014.v1.n1.a2.
Miller, M., Trouve, A., Younes, L., 2015. Hamiltonian systems and optimal control in computa-
tional anatomy: 100 years since D’Arcy Thompson. Ann. Rev. Biomed. Eng. 17 (1),
447–509. https://doi.org/10.1146/annurev-bioeng-071114-040601.
Misner, C.W., Thorne, K.S., Wheeler, J.A., 1973. Gravitation. Princeton University Press, ISBN:
978-0-691-17779-3.
Moceri, P., Duchateau, N., Baudouy, D., Schouver, E.-D., Leroy, S., Squara, F., Ferrari, E.,
Sermesant, M., 2018. Three-dimensional right-ventricular regional deformation and survival
in pulmonary hypertension. Eur. Heart J. Cardiovasc. Imaging 19 (4), 450–458. https://doi.
org/10.1093/ehjci/jex163.
Moceri, P., Duchateau, N., Gillon, S., Jaunay, L., Baudouy, D., Squara, F., Ferrari, E.,
Sermesant, M., 2020. 3D right ventricular shape and strain in congenital heart disease patients
with right ventricular chronic volume loading. Eur. Heart J. Cardiovasc. Imaging 22 (10),
1174–1181. https://doi.org/10.1093/ehjci/jeaa189.
Niethammer, M., Vialard, F.-X., 2013. Riemannian metrics for statistics on shapes : parallel trans-
port and scale invariance. In: Proceedings of Miccai Workshop, MFCA. HAL. http://www-
sop.inria.fr/asclepios/events//MFCA13/Proceedings/MFCA2013_1_1.pdf.
Niethammer, M., Kwitt, R., Vialard, F.-X., 2019. Metric learning for image registration. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 8463–8472.
Pennec, X., 2019. Curvature effects on the empirical mean in Riemannian and affine manifolds: a
non-asymptotic high concentration expansion in the small-sample regime. arXiv:1906.07418
[math, stat].
Pennec, X., Arsigny, V., 2012. Exponential barycenters of the canonical cartan connection
and invariant means on lie groups. In: Matrix Information Geometry, May. Springer,
pp. 123–168. https://doi.org/10.1007/978-3-642-30232-9_7.
Pennec, X., Lorenzi, M., 2020. Beyond Riemannian geometry: the affine connection setting for
transformation groups. In: Pennec, X., Sommer, S., Fletcher, T. (Eds.), Riemannian Geomet-
ric Statistics in Medical Image Analysis. Academic Press, pp. 169–229. https://doi.org/
10.1016/B978-0-12-814725-2.00012-1.
Pennec, X., Sommer, S., Fletcher, T., 2020. Riemannian Geometric Statistics in Medical Image
Analysis. Elsevier. https://doi.org/10.1016/C2017-0-01561-6.
Peyrat, J.-M., Delingette, H., Sermesant, M., Xu, C., Ayache, N., 2010. Registration of 4D cardiac
CT sequences under trajectory constraints with multichannel diffeomorphic demons. IEEE
Trans. Med. Imaging 29 (7), 1351–1368. https://doi.org/10.1109/TMI.2009.2038908.
326 SECTION III Advanced geometrical intuition
Qiu, A., Younes, L., Miller, M., Csernansky, J.G., 2008. Parallel transport in diffeomorphisms
distinguishes the time-dependent pattern of hippocampal surface deformation due to healthy
aging and the dementia of the Alzheimer’s type. NeuroImage 40 (1), 68–76. https://doi.org/
10.1016/j.neuroimage.2007.11.041.
Qiu, A., Albert, M., Younes, L., Miller, M., 2009. Time sequence diffeomorphic metric mapping
and parallel transport track time-dependent shape changes. NeuroImage 45 (Suppl. 1),
S51–S60. https://doi.org/10.1016/j.neuroimage.2008.10.039.
Rao, A., Chandrashekara, R., Sanchez-Ortiz, G.I., Mohiaddin, R., Aljabar, P., Hajnal, J.V.,
Puri, B.K., Rueckert, D., 2004. Spatial transformation of motion and deformation fields using
nonrigid registration. IEEE Trans. Med. Imaging 23 (9), 1065–1076. https://doi.org/10.1109/
TMI.2004.828681.
Sanz, J., Sánchez-Quintana, D., Bossone, E., Bogaard, H.J., Naeije, R., 2019. Anatomy, function,
and dysfunction of the right ventricle: JACC state-of-the-art review. J. Am. College Cardiol.
73 (12), 1463–1482. https://doi.org/10.1016/j.jacc.2018.12.076.
Schiratti, J.-B., Allassonnière, S., Colliot, O., Durrleman, S., 2015. Learning spatiotemporal
trajectories from manifold-valued longitudinal data. In: Proc Adv. Neural Inf. Process.
Syst., 28. https://proceedings.neurips.cc/paper/2015/hash/186a157b2992e7daed3677ce8e9fe40f-
Abstract.html.
Singh, N., Vialard, F.-X., Niethammer, M., 2015. Splines for diffeomorphisms. Med. Image Anal.
25 (1), 56–71. https://doi.org/10.1016/j.media.2015.04.012.
Sivera, R., Capet, N., Manera, V., Fabre, R., Lorenzi, M., Delingette, H., Pennec, X., Ayache, N.,
Robert, P., 2020. Voxel-based assessments of treatment effects on longitudinal brain changes
in the multidomain Alzheimer preventive trial cohort. Neurobiol. Aging 94, 50. https://doi.
org/10.1016/j.neurobiolaging.2019.11.020.
Thompson, D.W., 1917. On Growth and Form. Cambridge University Press, Cambridge, https://
doi.org/10.1017/CBO9781107325852.
Trouve, A., Vialard, F.-X., 2012. Shape splines and stochastic shape evolutions: a second order
point of view. Q. Appl. Math. 70 (2), 219–251, Publisher: Brown University. https://www.
jstor.org/stable/43639026.
Trouve, A., Vialard, F.-X., 2012. Shape splines and stochastic shape evolutions: a second order
point of view. Q. Appl. Math. 70 (2), 219–251.
Vialard, F.-X., Risser, L., 2014. Spatially-varying metric learning for diffeomorphic image regis-
tration: a variational framework. In: Medical Image Computing and Computer-Assisted
Intervention—MICCAI 2014. LNCS 8673. Springer, Cham, pp. 227–234. https://doi.org/
10.1007/978-3-319-10404-1_29.
Younes, L., 2007. Jacobi fields in groups of diffeomorphisms and applications. Q. Appl. Math.
65 (1), 113–134. https://doi.org/10.1090/S0033-569X-07-01027-5.
Younes, L., 2019. Shapes and Diffeomorphisms. Applied Mathematical Sciences, vol. 171
Springer Berlin Heidelberg, Berlin, Heidelberg, ISBN: 978-3-662-58495-8, https://doi.org/
10.1007/978-3-662-58496-5.
Young, A.A., Frangi, A.F., 2009. Computational cardiac atlases: from patient to population and
back. Exp. Physiol. 94 (5), 578–596. https://doi.org/10.1113/expphysiol.2008.044081.
Chapter 9
Abstract
In this article we look at the geometry of mixture models. This geometry can be affine,
convex, differential, information, or algebraic. While mixture models are simple to
define, highly interpretable, and widely used in statistical practice, their underlying sta-
tistical properties are complex and still not completely theoretically resolved. This com-
plexity often has a geometric cause and resolution. The family presents interesting
challenges to the development of a complete geometric theory of inference. Throughout
we illustrate the discussion with simple, and often visual, models for illustration and to
aid intuition.
Keywords: Affine geometry, Convex geometry, Geometry of distributions, Informa-
tion geometry, Mixture models, Nonstandard asymptotics, Singular models
1 Introduction
The objective of this article is to explore the relationship between information
geometry and the theory, and statistical properties, of mixture models. Geom-
etry illuminates the nonstandard inferential properties of these models. Con-
versely, these models provide challenging test cases which have shaped the
development of a general geometric theory of statistical inference. We start
by reviewing foundational issues in both mixture modeling and the geometry
of the space of distributions.
error modeling (Fuller, 1987); repeated measures (Crowder and Hand, 2017);
machine learning (ML); and Bayesian inference (Watanabe, 2021a) and many
other areas whenever modeling considers latent, or hidden, structure.
Very often analysts select a particular parametric family for the compo-
nent distributions of a mixture model, i.e., f(x;θ) in Definition 1. For example,
multivariate normal or multivariate t-distributions are often used in cluster
analysis (McNicholas, 2016), exponential or Weibull distributions in lifetime
data analysis (Jewell, 1982), or mixtures of discrete distributions such as
Poisson or Binomial in discrete data analysis (Everitt and Hand, 1981,
Chapter 4). These component distributions are often selected to lie in the class
of exponential (dispersion) families which are the workhorses of generalized
linear models (GLM) and hence very frequently used in applied statistics.
Exponential (dispersion) families have excellent theoretical and computa-
tional properties across a range of inference methods, including the frequen-
tist, likelihood, and Bayesian schools.
However real-world data is complicated and can be highly heterogeneous
with the analyst frequently finding features that cannot be modeled while
staying purely within the exponential family framework. It is then very com-
mon to use some form of mixture model. These models are very flexible—
meaning they can deal with data complexity—and also can be highly inter-
pretable so very attractive to the analyst. Of course there are no free lunches
and their very flexibility comes at a cost. Inference properties of mixture
models are more complex than those of exponential families, as are many
of the associated computational issues. Balancing these costs and benefits is
part of the art of working with mixture models.
We start with the following definition.
Definition 1. Consider a convex mixture of probability mass, or density,
functions f (x; θ), for θ U d written as
X
K
f ðx; K, ρ, θÞ :¼ ρi f ðx; θi Þ, (1)
i¼1
f 1 ðxÞð1θÞ f 2 ðxÞθ
f ðeÞ ðx; θÞ :¼ , (3)
CðθÞ
R
where CðθÞ ¼ f 1 ðxÞð1θÞ f 2 ðxÞθ dθ . As simple as these models are it is
informative to consider their parameter spaces and issues associated with their
support. For f (m)(x; ρ), the parameter ρ can take any value as long as f (m)(x; ρ)
0 for all x. This space includes ρ [0, 1] but is not restricted to this inter-
val. Furthermore, the support of f (m)(x; ρ) can depend on ρ and, as is well
known, this can give rise to nonstandard inferential behavior. For f (e)(x; θ)
the parameter space is defined by the condition C(θ) < ∞ and if f1(x) and
f2(x) have common support, we have an exponential family. For more detail
see the discussion around Definitions 5 and 6 in Section 4.
330 SECTION III Advanced geometrical intuition
The standard identifiability result for this family is Yakowitz and Spragins
(1968) which requires that (a) K is fixed and known, (b) each ρi is strictly pos-
itive, and (c) the component parameters ðμi , σ 2i Þ are distinct. Furthermore, any
identification is only up to permutations of the component labels f1, …, Kg.
where σ 1 ¼ 1 and σ 2 ¼ 1.5 which, in this case, are considered fixed and
known. It is important to note that K ¼ 3 is also treated as fixed and known
here as required in Definition 2. In this example the data seems to form two
clusters each with a different location and spread, with the components empir-
ical variances seeming larger than 1. It is easy to show numerically that the
likelihood for Model 1 has multiple local modes. Panel (a) shows the fit of
two of these. They correspond to two different attempts by the model to
explain both the heterogeneity in two means and two variances with only
the three components. Neither attempt is ideal, each underestimating the
spread in one of the empirical clusters. The multimodality is coming from
the trade-off between different sorts of error.
Model 2 can explain the cluster variance, since its fixed component
variances are larger; however in this case we find that in the fit ρb1 0. That
is because the likelihood is maximized on the boundary and at a point of
singularity. Only two components are needed and the model tries to remove
one of the three provided.
It is perhaps natural to ask why not try a third model defined by
P3
i¼1 ρi ϕðx; μi , σ i Þ where now each component variance is unknown? In this
2
case the model is so flexible that there are even more local modes and some
of these correspond to the case where σb 21 ¼ 0 and the corresponding likeli-
hood is unbounded.
0.30
0.20
0.20
Density
Density
0.10
0.10
0.00
0.00
−5 0 5 10 −5 0 5 10
FIG. 1 Fitting a three-component mixture models. (A) Model 1: There are at least two local
modes in the likelihood. (B) Model 2: The likelihood is maximized on the boundary resulting
in trying to fit a two-component model.
Geometry and mixture models Chapter 9 333
ðπ 00 , π 10 , π 01 , π 11 Þ ¼ ðð1 ρ1+ Þð1 ρ+1 Þ, ρ1+ ð1 ρ+1 Þ, ð1 ρ1+ Þρ+1 , ρ1+ ρ+1 Þ,
where ρ1+, ρ+1 [0, 1] are the marginal probabilities. It is easily shown that
this space is an extended exponential family; the term extended here means
taking the closure of a regular exponential family, Barndorff-Nielsen
(1978). We mention in passing that the independence space is also a
so-called ruled surface meaning it can be decomposed into the union of line
segments in the simplex. The independence space is plotted in Fig. 2A where
it can be seen that the surface connects to all four vertices and four of the six
edges of the simplex.
It is also easy to show that the convex hull of the independence space is the
whole three simplex and so can be decomposed into a union of sets—in fact
manifolds—of different dimensions. There are four zero-dimension components,
six one-dimensional, four two-dimensional, and one full three-dimensional man-
ifold which is the whole relative interior of the simplex. We can also see that
Geometry and mixture models Chapter 9 335
(0,1) (0,1)
(1,1) (1,1)
(1,0) (1,0)
(0,1)
(1,1)
(1,0)
FIG. 2 The case D ¼ 3 with P being the three simplex: panel (A) shows the two-dimensional
independence space, (B) shows a one-dimensional exponential family, and (C) shows this
one-dimensional family’s convex hull.
These examples illustrate general points. First Example 5 shows that there
exists a trade-off between geometric structures. Manifold structures allow
gradient methods to be used in (local) optimization, while convexity allows
global optimization of a concave log-likelihood. This is related to the discus-
sion in Example 1. In that example by fixing K ¼ 3 we kept a manifold
structure, but it was at the cost of losing convexity which resulted in multiple
local modes.
The second point is to note that since the convex hull in Example 5 does
not contain the whole simplex, there are two distinct cases to consider. First,
the empirical distribution of the data lies inside, the convex hull and second
the case where it lies outside. When it lies inside the model is clearly flexible
enough to fit the data exactly and may actually overfit. The second case is that
Geometry and mixture models Chapter 9 337
C 01 fπ 00 ¼ 1g 0
C 02 fπ 10 ¼ 1g 0
C 11 fπðθÞjθ g 1
C 3
fρ1 πðθ1 Þ+ð1 ρ1 Þπðθ2 Þjρ ð0, 1Þ, θ1 6¼ θ2 g 3
the mixture model is not flexible enough to exactly fit the data and that
estimates will lie on the boundary on the convex hull. We note the strong
parallels between this and the behavior of the fitting of Models 1 and 2 in
Example 2.
The third point is that Examples 4 and 5 give a clear visual representation
of the singularity and boundary properties of mixture models which are a
theme of this article. It is shown explicitly that the convex hulls are not
manifolds but are unions of manifolds of differing dimensions. We also note
the fact that the support of the distributions is not fixed which is one reason
that the Fisher information cannot be assumed to always be nonsingular.
3 Likelihood geometry
In the examples of Section 2.1 by working with finite, discrete distributions
applications of convex geometry were straightforward. We now turn to a dif-
ferent, but related, approach to using the tools of finite-dimensional convex
geometry. This though is applicable to a much wider set of mixture models.
The method is due to Lindsay (1995). The embedding affine space is always
finite dimensional, but the dimension is determined by the number of distinct
observations in the data. This is going to be sufficient since the focus of this
approach is the likelihood function which only depends on the observed data.
We therefore call this a likelihood geometry approach.
Throughout this section we work with the general
Ð definition of a mixture
model on exponential families, i.e., f(x; Q) :¼ θΘ f(x; θ)dQ(θ). One of the
most important properties of Lindsay’s likelihood geometry is it allows a
characterization of, and a way to compute, the nonparametric maximum like-
lihood estimate (NPMLE) of a mixture model, Q. b Suppose we have an i.i.d.
338 SECTION III Advanced geometrical intuition
has a unique maximum over the space of all distribution functions Q. Further-
more, the maximizer Qb is a discrete distribution with no more than D distinct
points of support, where D is the number of distinct points in ðx1 , …, xn Þ.
(b) If Q0 is a candidate for mixture distribution of the NPMLE, then this can
be checked by a gradient characterization defined by
XD
Ld ðθÞ Ld ðQ0 Þ
DQ0 ðθÞ :¼ nðdÞ
d¼1
Ld ðQ0 Þ
where Ld(θ) is the likelihood at θ for the unmixed model and we have the
characterization
b , DQ ðθÞ 0,
Q0 ¼ Q (8)
0
for all θ Θ.
Geometry and mixture models Chapter 9 339
There are some important points to note about Lindsay’s result. First the
fact that the NPMLE is achieved at a discrete mixture is analogous to the fact
that the discrete CDF based on a set of i.i.d. data can be seen as the “best”
estimate of an unknown distribution. Second the gradient characterization of
a maximum likelihood estimate in standard analysis is replaced by a func-
tional version through DQ0 ðθÞ which characterizes the NPMLE estimate
through Inequality (8). In particular it is a derivative along the line segment
defined by (2) in Section 1.2. The theorem requires that the likelihood is
decreasing—strictly nonincreasing—along all relevant line segments starting
b
at f ðx; QÞ.
To illustrate and motivate Theorem 1 we look at a simple example similar
to one in Lesperance and Kalbfleisch (1992).
Example 6. In this example, for direct visualization, the number of observed
data points is two: x1 ¼ 1, x2 ¼ 4. We assume a mixture model over compo-
nents N(μ, 1) with unknown mixing distribution Q(μ).
The likelihood components are defined by ðL1 ðμÞ, L2 ðμÞÞ :¼ ðϕð1; μ, 1Þ,
ϕð4; μ, 1ÞÞ where ϕ(x; μ, 1) is the density function for the N(μ, 1) distribution.
The image of the curve μ ! ðL1 ðμÞ, L2 ðμÞÞ is plotted in Fig. 3A and C as the black
curve, with the likelihood contours plotted with (red) dashed lines. To find the
NPMLE we want to maximize the likelihood over the convex hull of this curve.
From Theorem 1 the NPMLE can be characterized by the gradient function
X2
Li ðμÞ Li ðQ0 Þ
DQ0 ðμÞ ¼
i¼1
Li ðμÞ
for all μ in the parameter space. This is illustrated in Fig. 3B and D. In panel
(a) a two-component mixture is denoted by the cross on a line segment joining
two points defined by
Q0 ¼ 0:3Nð2:2, 1Þ + 0:7Nð3:2, 1Þ:
The corresponding gradient function is shown in panel (B). Since the gradient
function is positive at some points, this does not satisfy the conditions of the
theorem; hence this candidate mixture is not the NPMLE. On the other hand
in panel (C) we plot the mixture defined by Q0 ¼ 0.5N(1, 1) + 0.5N(4, 1). Its
gradient function is shown in panel (D) and it does satisfy the conditions since
it is nowhere strictly positive and zero only at the support points which are a
subset of the observed data. Hence this candidate point is the NPMLE.
340 SECTION III Advanced geometrical intuition
2
Direction Derivative
0.3
1
x
0.2
L2
0
0.1
−1
0.0
−2
0.0 0.1 0.2 0.3 0.4 −10 −5 0 5 10
L1 theta.list
(c) (d)
0.4
0.0
Direction Derivative
−0.5
0.3
−1.0
x
0.2
L2
−1.5
0.1
−2.0
0.0
FIG. 3 Applying Lindsay’s theorem in a simple model: panels (A) and (C) show the likelihood
space with exponential family embedded (black curves), candidate mixed model (cross), and the
level sets of the log-likelihoods (red contours). Panels (B) and (D) show the gradient function for
each candidate.
The fact that the NPMLE results in clusters of support points is a feature
of inference on mixture models. For example, consider a model such as
Geometry and mixture models Chapter 9 341
Data
2
50
Directional Derivative
1
40
Frequency
30
0
20
−1
10
−2
0
0 5 10 15 0 5 10 15
data λ
FIG. 4 The data for Example 7 plotted as a histogram and the final gradient function. The vertical
dashed lines are the support points.
P
f ðx; K, ρ, θÞ :¼ Ki¼1 ρi f ðx; θi Þ where θ1 < θ2 < … < θK but where j θ1 θKj
¼ E and E is small relative to the Fisher information. This model might be
formally identified but is very hard to distinguish from the two-component
model ρf(x; θ1) + (1 ρ)f(x; θK) or indeed sometimes from an unmixed model.
One approach to this is to consider the local mixture model of Marriott
(2002). In this approach, when the mixing distribution QE(θ) has small
variance, a Laplace expansion gives a good approximation using
Z
∂ ∂2
f ðx; θÞQE ðθÞdθ f ðx; θÞ + λ1 f ðx; θÞ + λ2 2 f ðx; θÞ,
∂θ ∂θ
and, through a geometric analysis of this structure, the small number of para-
meters (θ, λ1, θ2) are identified and computationally easy to estimate. These
models have found application in measurement error models (Marriott, 2003),
survival analysis, (Maroufy and Marriott, 2019), robustness (Maroufy and
Marriott, 2020), and the idea is generalized in Maroufy and Marriott (2017).
There are two approaches to this in the literature: (a) work intrinsically on
finite-dimensional parametric models which lie in an infinite-dimensional
space and (b) work extrinsically by defining the geometric structures by
embedding in general infinite-dimensional embedding affine spaces. The first
of these approaches was taken in Amari (1985), see Amari (2016) and shown
in Definition 5 and the second approach is described in Definition 6.
We start with the dual affine geometry defined intrinsically on a para-
metric family f(x; θ) satisfying the regularity conditions in Amari (1985).
Definition 5. Consider a parametric family of distributions, f(x; θ), which
satisfies the regularity conditions discussed in Amari and Nagaoka (2000,
P. 26) which gives the family the structure of a smooth manifold. On each tan-
gent space we have the Fisher metric which is, in terms of coordinates,
∂ log f ðx; θÞ ∂ log f ðx; θÞ
gij ðθÞ :¼ Ef ðx; θÞ ,
∂θi ∂θj
giving rise to an inner product on the tangent space. While other metrics are
possible—which would generate alternative forms of the Cramer–Rao theo-
rem (Kumar and Mishra, 2020)—we focus here on the standard case. Notions
of straight lines, as discussed in Section 1.2, are also constructed intrinsically
using a family of affine connections. For this article we are interested in the
special case of the α ¼ 1, or mixture, connection. The key result is Amari
and Nagaoka (2000, Theorem 2.4) which states that a mixture family such
as f(x; ρ) :¼ ρf(x) + (1 ρ)g(x) is flat in the mixture connection.
These affine structures generate the same convex structures that we have
used in Sections 2 and 3 and agree with Definition 5 on manifolds. Given that
we have very general definitions of the mixture structure we look at an
example to illustrate some of the issues in general situations.
Geometry and mixture models Chapter 9 343
Profile Likelihood
0.0
−0.2
−0.4
μ2
−0.6
−0.8
−1.0
0 2 4 6 8 10
μ1
FIG. 5 Profile log-likelihood contours (black contours) showing multimodal behavior in a mix-
ture of two normals. The shaded (red) region on the left denotes points where ρbðμ1 , μ2 Þ ¼ 0 or 1,
so the MLE lies strictly outside the manifold.
calculation of the posterior. Note that when, in such a Markov Chain Monte
Carlo analysis, whenever the value of K changes, the algorithm needs to
map between manifolds of different dimensions.
More fundamentally in this area the continuous mixture, Definition 3, is a
fundamental expression in Bayesian analysis. For example, it must be used if
there are hierarchical or latent structures or for constructing a predictive
distribution from a posterior.
The fundamental singular nature of mixture models is reflected in Bayes-
ian analysis. We illustrate this with an example of using Markov Chain Monte
Carlo (MCMC) in this context.
Example 9. Let us return to Model 2 in Example 1 to illustrate how singula-
rities have an effect on a Bayesian analysis. Here we are treating K as fixed
and known. By eye the data appears to have two normal clusters and we are
fitting a three-component model. Further each of the model components does
a good job of fitting an individual data cluster. There is therefore redundancy
in the model and there are many ways that the model can give very good fits.
For example, one of the components can be essentially removed by setting
Geometry and mixture models Chapter 9 345
1.0
0.8
0.6
ρ2
0.4
0.2
0.0
ρ1
FIG. 6 Posterior distribution of ρ1, ρ2 for Model 2.
one the ρi parameters to be very small. In this case there is great uncertainty
about the value of the corresponding μi as it basically plays no role in the
model. Alternatively two components can converge, i.e., μi μj, resulting
in no information about the difference ρi ρj. In Fig. 6 we show the result
of estimating the marginal posterior distribution for (ρ1, ρ2) using an MCMC
analysis when uniform and independent priors have been used for simplicity.
We see that the uncertainty about the mixing parameters, which comes from
the singular geometry, is reflected in the posterior not being concentrated
around a point in parameter space. Further note that the posterior here would
be very poorly estimated with a normal-based approximation and this point
will be important in the discussion below.
other mixtures, but also models from ML including neural networks, Boltz-
mann machines, and radial basis methods. These models all share the proper-
ties that there are latent, hierarchical structures which will generate singular
behavior.
Watanabe’s approach—a recent general review can be found in Watanabe
(2021a)—focuses on understanding the predictive power of a Bayesian model
and can be seen as a development of the approach of Akaike (1974) to the sin-
gular model case. Assume that we observe data ðx1 , …, xn Þ from a distribution
defined by q(x). We do not know q(x) but have a putative parametric model
f(x; θ) and a prior distribution on θ, say ϑ(θ). Following the approach of
Akaike (1974) we do not assume that q(x) lies in the model family. We com-
pute the Bayesian predictive distribution pðxjx1 , …, xn Þ ¼ Eθ ðpðx; θÞÞ where
the expectation is over the posterior distribution for θ given the data. Follow-
ing Akaike we measure the quality of the predictive distribution by its
“distance” to q(x) and we do this with the Kullback–Leibler (KL) divergence
Z
qðxÞ
KðqðxÞjjpðxÞÞ ¼ qðxÞ log dx:
pðxÞ
This is the foundation on which the Akaike information criteria (AIC) is based
and then used for model selection when models are nonsingular. This is a
minimization problem and in the nonsingular case we expect that the objec-
tive function is locally quadratic around a single point. Let us investigate
the KL divergence in the singular case where the minimum value is attained
on a set of points.
Example 10. Consider again the simple line segment model, Eq. (2), in a
normal mixture context f(x; ρ, μ) ¼ (1 ρ)ϕ(x;0, 1) + ρϕ(x; μ, 1) and consider
the behavior of
KLð f ðx; ρtrue , μtrue Þjj f ðx; ρ, μÞÞ
for ρ [0, 1] and μ ¼ [3, 3]. In Fig. 7 we plot the level sets of this function
in two cases. The first is the nonsingular case with ρtrue ¼ 0.5, μtrue ¼ 1 and
we see a local minimum of zero at the true value and that locally the function
is approximately quadratic. Its hessian would be the (nonsingular) Fisher
information. In contrast if ρtrue ¼ 0 or μtrue ¼ 0 we are in the singular case.
An example is shown in panel (b). Here there is a set of values at the
minimum of zero and there will be no locally quadratic approximation to this
function around the set. This is exactly the singular case considered by
Watanabe (2009). The function is not going to have a quadratic approximation
rather one of the form
KLð f ðx; ρtrue , μtrue Þjj f ðx; ρ, μÞÞ Cμ2 ρ2 : (9)
3
2
2
1
1
μ
μ
−1 0
−1 0
−3
−3
0.0 0.4 0.8 0.0 0.4 0.8
ρ ρ
FIG. 7 Level sets of the Kullback–Leibler divergence: (A) nonsingular case with ρtrue ¼ 0.5,
μtrue ¼ 1, (B) singular case with ρtrue ¼ 0.5, μtrue ¼ 0.
1.5
0.5
0.5
y
v
−0.5
−0.5
−1.5
−1.5
point. In the singular case these fail. The tools of algebraic geometry however
allow us to find local approximations to functions which have singularities.
A key result is Hironaka’s theorem (Watanabe, 2009, Theorem 2.3), which
allows us to resolve the singularities. In the general theory we can think of
singularities in sets of solutions of the form {xjf(x) ¼ 0} where the set is
not a manifold. An example is the point of self-intersection of the curve in
Fig. 8A, or the set of minimal values in Fig. 7B which is ρμ ¼ 0 which also
has self-intersection at μ ¼ ρ ¼ 0.
In very general terms if we have a function f ðx1 , …, xd Þ which has a singu-
lar point at 0, then there exists a transformation to smooth parameters,
ðu1 , …, ud Þ, and a “nice” mapping, g(u) ¼ x, such that locally to the singular-
ity we can write
348 SECTION III Advanced geometrical intuition
We now see the way that the singularity of the KL-divergence shown in
Fig. 7B has the approximation (9). This is the resolution of the singularity
now with k1 ¼ k2 ¼ 2.
1 Y
n
pðθjDÞ ¼ pðXi ; θÞϑðθÞ
Z n i¼1
1X 1X
n n
log Eθ ðpðxi ; θÞÞ Var θ ð log ðpðxi ; θÞÞ (12)
n i¼1 n i¼1
where moments are taken over the posterior hence can be computed
using MCMC.
The theory of the AIC (Akaike, 1974) or DIC (Spiegelhalter et al., 2002)
assumes a nonsingular model and under this assumption the AIC, DIC, and
WAIC are all equal to order o(n1), see Watanabe (2021a, Section 3.1).
However in the singular model case this equivalence will fail. For example,
in Watanabe (2021b) it is explained that the WAIC and generalization error
are equal in expectation to order O(n2). In that paper a simulation study
based on a three-component mixture of normals—a singular model—the
WAIC is able to give very good approximations to the generalization error,
whereas the AIC is not. Hence, from this study, the WAIC can be used in
model selection for prediction problems in mixture models.
The key differences that the singular case has with the nonsingular case
have been summarized by Watanabe as the following generalizations: positive
350 SECTION III Advanced geometrical intuition
consider score-based tests. However we immediately see that this can result in
complexity even in very simple cases. The following example is discussed in
Li et al. (2009).
Example 12. Consider the two-component mixture of exponential densities
with mean μ. For example, take the case
ð1 ρÞf ðx; 1Þ + ρf ðx; μÞ,
where f ðx; μÞ ¼ exp ð μxÞ . The Fisher information for the score for ρ
1
μ
at ρ ¼ 0 is
8 2
< nð1 μÞ
0 < μ < 2,
μð2 μÞ
:
∞ μ 2:
Hence we immediately see that traditional score tests will fail in certain
directions since they will not have finite variances and the CLT will fail.
Other natural routes to follow are likelihood ratio test (LRT)-based meth-
ods. The following results from Lindsay (1995, Chapter 4) are helpful in
seeing that the standard χ 2-asymptotic behavior of the log-likelihood, familiar
in exponential families, does not hold for mixture models.
Example 13. From Lindsay (1995, Page 95) consider the K ¼ 2 case of
mixtures with Bin(2, π) components. The asymptotic distribution of the maxi-
mum likelihood estimate is a mixture of χ 2 distributions. In particular the test
of one versus two components has a 0:5χ 20 + 0:5χ 21 distribution. However for
only slightly more complex models the limiting distributions are more diffi-
cult. For example, even for two-component mixtures of Bin(3, π) distributions
the same likelihood-based test of one against two components now has a
Chen et al. (2001) give a clear statement of the situation which we summa-
rize here in a particular case. Consider a two-component mixture model for
θ Θ , (1 ρ) f(x; θ1) + ρf(x; θ2), where we are testing the hypothesis that
the true distribution has only one component. That is we are in the case where
either θ1 ¼ θ2 or ρ {0, 1}. Under the null hypothesis, when θ0 labels the true
distribution, the limit distribution LRT is that of the random variable
where W(θ) is a Gaussian process with normalized mean and variance and a
given autocovariance function. If we compare using this result with the stan-
dard cases (Chen et al., 2001) point out the following problems: (i) the limit-
ing distribution is not pivotal in that it depends on θ0 as already discussed in
Example 13; (ii) the limit distribution does not just depend on the dimension
of θ, for example, it is different for the normal and Poisson cases; and
(iii) computing the maximum of a Gaussian process a complex problem and
so the LRT loses a lot of its practical utility and appeal.
Li et al. (2009) explain that the general regularity conditions that are
needed to get satisfactory asymptotic behavior, even in the one-dimensional
case θ Θ , typically involve requiring finite Fisher information and the
parameter space Θ being compact. Ideally we want a test statistic that keeps
the standard pivotal and limiting properties of the LRT similar to those in
the exponential family and to try and relax these stringent regularity condi-
tions. First consider penalty approaches to regularize the problem.
Definition 7. The penalized likelihood statistic is defined for a two-
component model as
( )
Xn
2 log ðρf ðx; θ1 Þ + ð1 ρÞf ðx; θ2 ÞÞ penðρÞ
i¼1
where a penalty term pen( ) is designed to bound ρ away from the boundaries
at ρ {0, 1}. For example, penðρÞ ¼ C log ð4ρð1 ρÞÞ is used in Chen and
Kalbfleisch (1996) or penð Þ ¼ C log ð1 j1 2ρjÞ in Li et al. (2009). In
both these cases the limiting distributions are of the form 0:5χ 20 + 0:5χ 21 under
regularity conditions.
to code and also easy to compute the expectation step. The most important
geometric advantages are that iterations cannot “jump” outside the fundamen-
tal boundaries of the parameter space, and that it will still work in cases such
as Example 12 where the Fisher information is singular.
There is an EM version of the penalized likelihood approach to testing the
number of mixing components (Li et al., 2009, Page 415). In this paper the
idea is to start from a finite number of initial points and make a fixed number
of EM-based iteration steps starting from each of them. After this number of
iterations is completed, the maximum of the penalized LRT statistics is
computed. The EM’s property of staying in the parameter space, and the
penalization, keeps the solution sufficiently regular to give the result that,
asymptotically, the test statistic has a 0:5χ 20 + 0:5χ 21 distribution without the
need for assumptions on compactness of the parameter space or having a
finite Fisher information, see Li et al. (2009, Theorem 2). Currently these
approaches are still of active research interest and the testing problem is of
interest, see the recent paper Chen et al. (2020) for a modern review.
7 Discussion
In this article we have described, at a high level, the relationship between the
statistical properties of mixture models and geometries of various kinds. We
have looked at basic notations of convexity within an affine space structure,
tools from differential geometry such as metric tensors and connections,
and tools from algebraic geometry. We have shown that mixture models,
despite their apparent simplicity, have complex inferential properties which
are related to their singular structure.
For statisticians the sheer complexity of the literature of differential or
algebraic geometry can be a major hurdle. These topics have been studied
intensely for hundreds of years and knowing where to start reading is
undoubtedly a challenge. In this article we have aimed to give a flavor of
the tools that have been used, while deliberately skipping the more technical
details, and pointing to the literature that we have found helpful.
References
Akaike, H., 1974. A new look at the statistical model identification. IEEE Trans. Automat. Contr.
19 (6), 716–723.
Amari, S.-I., 1985. Differential-Geometrical Methods in Statistics. Springer, p. 293.
Amari, S.-I., 2016. Information Geometry and Its Applications. vol. 194 Springer.
Amari, S.-I., 2020. Any target function exists in a neighborhood of any sufficiently wide random
network: a geometrical perspective. Neural Comput. 32 (8), 1431–1447.
Amari, S.-I., Nagaoka, H., 2000. Methods of Information Geometry. vol. 191 American Mathe-
matical Society.
Barndorff-Nielsen, O.E., 1978. Information and Exponential Families in Statistical Theory. John
Wiley & Sons, p. 238.
354 SECTION III Advanced geometrical intuition
Chen, J., Kalbfleisch, J.D., 1996. Penalized minimum-distance estimates in finite mixture models.
Can. J. Stat. 24 (2), 167–175.
Chen, H., Chen, J., Kalbfleisch, J.D., 2001. A modified likelihood ratio test for homogeneity in
finite mixture models. J. R. Stat. Soc. B 63 (1), 19–29.
Chen, J., Li, P., Liu, G., 2020. Homogeneity testing under finite location-scale mixtures. Can. J.
Stat. 48 (4), 670–684.
Critchley, F., Marriott, P., 2014. Computational information geometry in statistics: theory and
practice. Entropy 16 (5), 2454–2471.
Crowder, M.J., Hand, D.J., 2017. Analysis of Repeated Measures. Routledge.
Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood from incomplete data via
the EM algorithm. J. R. Stat. Soc. B 39 (1), 1–22.
Everitt, B.S., Hand, D.J., 1981. Finite Mixture Distributions. Chapman and Hall.
Fuller, W.A., 1987. Measurement Error Models. John Wiley, New York.
Green, P.J., 1995. Reversible jump Markov chain Monte Carlo computation and Bayesian model
determination. Biometrika 82 (4), 711–732.
Hampel, F.R., 1974. The influence curve and its role in robust estimation. J. Am. Stat. Assoc.
69 (346), 383–393.
Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., Stahel, W.A., 2011. Robust Statistics: The
Approach Based on Influence Functions. vol. 196 John Wiley & Sons.
Jewell, N.P., 1982. Mixtures of exponential distributions. Ann. Stat. 10 (2), 479–484.
Karlin, S., 1968. Total Positivity. vol. 1 Stanford University Press.
Karlin, S., Shapley, L.S., 1953. Geometry of Moment Spaces. vol. 12 American Mathematical
Society.
Kass, R.E., Vos, P.W., 2011. Geometrical Foundations of Asymptotic Inference. vol. 908 John
Wiley & Sons.
Kumar, M.A., Mishra, K.V., 2020. Cramer-Rao lower bounds arising from generalized Csiszár
divergences. Inf. Geom. 3 (1), 33–59.
Lesperance, M.L., Kalbfleisch, J.D., 1992. An algorithm for computing the nonparametric MLE
of a mixing distribution. J. Am. Stat. Assoc. 87 (417), 120–126.
Li, P., Chen, J., Marriott, P., 2009. Non-finite Fisher information and homogeneity: an EM
approach. Biometrika 96 (2), 411–426.
Lindsay, B.G., 1994. Efficiency versus robustness: the case for minimum Hellinger distance and
related methods. Ann. Stat. 22 (2), 1081–1114.
Lindsay, B.G., 1995. Mixture models: theory, geometry and applications. In: NSF-CBMS
Regional Conference Series in Probability and Statistics.
Maroufy, V., Marriott, P., 2017. Mixture models: building a parameter space. Stat. Comput.
27 (3), 591–597.
Maroufy, V., Marriott, P., 2019. Generalising frailty assumptions in survival analysis: a geometric
approach. In: Geometric Structures of Information, Springer, pp. 137–148.
Maroufy, V., Marriott, P., 2020. Local and global robustness with conjugate and sparsity priors.
Stat. Sinica 30, 579–599.
Marriott, P., 2002. On the local geometry of mixture models. Biometrika 89 (1), 77–93.
Marriott, P., 2003. On the geometry of measurement error models. Biometrika 90 (3), 567–576.
McLachlan, G.J., Basford, K.E., 1988. Mixture Models: Inference and Applications to Clustering.
vol. 38 M. Dekker New York.
McLachlan, G., Peel, D., 2000. Finite Mixture Models. vol. 198 John Wiley & Sons.
McNicholas, P.D., 2016. Mixture Model-Based Classification. Chapman and Hall/CRC.
Murray, M.K., Rice, J.W., 1993. Differential Geometry and Statistics. Chapman & Hall, p. 272.
Geometry and mixture models Chapter 9 355
Neyman, J., Scott, E., 1965. On the use of c (alpha) optimal tests of composite hypotheses. Bull.
Int. Stat. Inst. 41 (1), 477–497.
Spiegelhalter, D.J., Best, N.G., Carlin, B.P., Van Der Linde, A., 2002. Bayesian measures of
model complexity and fit. J. R. Stat. Soc. B 64 (4), 583–639.
Titterington, D.M., Smith, A.F.M., Makov, U.E., 1985. Statistical Analysis of Finite Mixture
Distributions. John Wiley & Sons Incorporated.
Wang, Y., 2007. On fast computation of the non-parametric maximum likelihood estimate of a
mixing distribution. J. R. Stat. Soc. B 69 (2), 185–198.
Watanabe, S., 2009. Algebraic Geometry and Statistical Learning Theory. Cambridge University
Press, p. 25.
Watanabe, S., 2021a. Information criteria and cross validation for Bayesian inference in regular
and singular cases. Jpn. J. Stat. Data Sci. 4, 1–19.
Watanabe, S., 2021b. WAIC and WBIC for mixture models. Behaviormetrika 48 (1), 5–21.
Yakowitz, S.J., Spragins, J.D., 1968. On the identifiability of finite mixtures. Ann. Math. Stat.
39 (1), 209–214.
This page intentionally left blank
Chapter 10
Gaussian distributions on
Riemannian symmetric spaces
of nonpositive curvature
Salem Saida,*, Cyrus Mostajeranb, and Simon Heuvelinec
a
CNRS, Laboratoire LJK, Universite Grenoble-Alpes, Grenoble, France
b
Department of Engineering, University of Cambridge, Cambridge, United Kingdom
c
Centre for Mathematical Sciences, University of Cambridge, Cambridge, United Kingdom
*
Corresponding author: e-mail: salem.said@univ-grenoble-alpes.fr
Abstract
This article aims to give a coherent presentation of the theory of Gaussian distributions
on Riemannian symmetric spaces and also to report on recent original developments of
this theory. The initial goal is to define a family of probability distributions, on any suit-
able Riemannian manifold, for which maximum-likelihood estimation, based on a finite
sequence of observations, is equivalent to computation of the Riemannian barycenter of
these observations. As it turns out, this goal is achievable whenever the underlying Rie-
mannian manifold is a Riemannian symmetric space of nonpositive curvature. In this
case, the required Gaussian distributions are exactly the maximum-entropy distribu-
tions, for fixed barycenter and dispersion. The second step is the search for efficient
means of computing the normalizing factors associated with these distributions. This
leads to a fascinating connection with random matrix theory, and even with theoretical
physics (Chern–Simons theory), which yields a series of original results that provide
exact expressions, as well as high-dimensional asymptotic expansions of the normaliz-
ing factors. Another outcome of this connection with random matrix theory is the new
idea of duality, between Gaussian distributions on Riemannian symmetric spaces of
opposite curvatures. The present article also investigates Bayesian inference for Gauss-
ian distributions on symmetric spaces. This investigation motivates original results
regarding Markov-Chain Monte Carlo and convex optimization on Riemannian mani-
folds. It also reveals a new open problem (roughly, this concerns the equality of a pos-
teriori mode with a posteriori barycenter), which should be the focus of future
developments.
Keywords: Gaussian distribution, Symmetric space, Random matrix theory,
Markov-Chain Monte Carlo, Convex optimization, Bayesian inference
1 Introduction
The realization that an essentially new approach, beyond that of classical sta-
tistics, is needed in order to learn from data that live in non-Euclidean spaces
can be credited to Frechet, who invented what we today call the Frechet mean,
back in 1948 (Frechet, 1948).
The Frechet mean generalizes the concept of the mean (average or expec-
tation) of a sequence of observations ðx1 , …, xN Þ , from the classical case
where these observations lie in a Euclidean space, to the general setting where
they belong to a non-Euclidean space.
Let us call our sample space M (the observations belong to M). If M is a
Euclidean space, it has a vector space structure, and the mean of ðx1 , …, xN Þ
is just the arithmetic mean ðx1 + ⋯ + xN Þ=N. If M is a non-Euclidean space,
it will have no vector space structure, and this definition will lose all meaning.
To salvage the concept of mean, Frechet suggested looking at the set of
global minima of the sum of squared distances (the factor 1/2 is included
for later convenience)
1X N
EðxÞ ¼ d2 ðxn ,xÞ for x M
2 n¼1
In Said et al. (2017, 2018), the approach of Cheng and Vemuri (2013) was
generalized to Gaussian distributions on Riemannian symmetric spaces of
nonpositive curvature, which include hyperbolic spaces, as well as spaces
of real, complex, and quaternion positive-definite matrices, and spaces of
structured (Toeplitz or block-Toeplitz) positive-definite matrices. This opened
the way to rigorous learning algorithms for data that live in these spaces (this
is partially discussed in Said et al., 2018).
The introduction of Riemannian symmetric spaces reduced normalizing
factors of Gaussian distributions to multiple integrals, which could be com-
puted using Monte Carlo techniques (Zanini et al., 2016). Only very recently,
it was realized that the techniques of random matrix theory made it possible to
write down both analytic expressions and high-dimensional asymptotic expan-
sions of these multiple integrals. This was studied by the theoretical physics
community (Santilli and Tierz, 2021) (see also our paper, currently under
review Heuveline et al., 2021).
The aim of the present article is to give a coherent presentation of the the-
ory of Gaussian distributions on Riemannian symmetric spaces of nonpositive
curvature, and report on recent original developments of this theory, including
(but not limited to) the ones just mentioned. Its main body (Sections 2 and 3)
relies on a variety of new results, mostly contained in the habilitation thesis
(Said, 2021)—one advantage of this situation is that the flow of results is
not interrupted by their sometimes lengthy proofs, given in Said (2021).
In the following, Section 2 introduces Gaussian distributions and their con-
nection with random matrix theory. Section 3 investigates Bayesian inference
of these distributions. Each one of these sections opens with a description of
the original results which it contains.
Another, more modest, contribution of this article is Appendix B (not
based on Said, 2021). This appendix provides new results on the convergence
rates for Riemannian gradient descent, applied to strictly convex and strongly
convex functions, defined on a convex subset of a Riemannian manifold. The
main results are Propositions B.8 and B.9.
1X N
minimize over x ME N ðxÞ ¼ d 2 ðxn ,xÞ (1)
2 n¼1
This means that x^N is an empirical barycenter of the samples (xn). In order to
construct probability distributions Pðx, σÞ, which satisfy this definition, con-
sider the density profile
2
d ðx, xÞ
f ðxjx, σÞ ¼ exp (2)
2σ 2
and the normalizing factor,
Z
Zðx, σÞ ¼ f ðxjx, σÞ volðdxÞ (3)
M
where ωd1 denotes the area of the unit sphere in d , and Φ denotes the stan-
dard normal distribution function.
1 d 2 ðx, xÞ
Pðdxjx, σÞ ¼ ðZðσÞÞ exp volðdxÞ (7)
2σ 2
and yields a well-defined probability distribution Pðx, σÞ on M. This will be
the main focus, throughout the following.
Remark. Here, readers may wish to recall the concept of a Hadamard mani-
fold, or of a homogeneous space, from Petersen (2006) (or any other good
Riemannian geometry textbook). The point of appealing to these concepts is
the following. The assumption that M is a Hadamard manifold implies that
geodesic spherical coordinates, which cover all of M, can be introduced at
any point x M. Proposition 1 is obtained by writing the integral (3) in terms
of these spherical coordinates, and then applying Riemannian volume compar-
ison theorems that state, very roughly speaking, that manifolds with more pos-
itive curvature have less volume. On the other hand, to say that M is a
homogeneous space means that all points x M are equivalent, so changing
the point of origin x does not change the integral (3). This is the key to
Proposition 2.
k 2
ρðx, kÞx ¼ exp α ρ e2 α x, k
α k
4
Gaussian distributions on Riemannian symmetric spaces Chapter 10 365
h i Z Y
N
ωβ ðNÞ
ZðσÞ ¼ exp N Nβ
2
ðσ 2
=2Þ jVðuÞjβ ρðui ,2σ 2 Þdui (10)
2NNβ N! N+ i¼1
Remark. Curious readers will want to compute ωβ(N). For example, ω2(N)
can be found using the Weyl integral formula on U(N) (Knapp, 2002). This
yields ω2(N) ¼ vol(U (N))/(2π)N. The volume of the unitary group can be
found by looking at the normalizing factor of a Gaussian unitary ensemble
(Mehta, 2004). Specifically, volðU ðNÞÞ ¼ ð2πÞðN +NÞ=2 =GðNÞ , in terms of
2
Example 3. For this last example, let M ¼ DN be the Siegel domain (Siegel,
1943). This is the set of N N symmetric complex matrices z, such that IN
z†z is positive-definite. Here, M ¼ G/K, where G ’ SpðN, Þ (real symplectic
group) and K ’ U(N) (unitary group). Precisely, G is the group of 2N
2N complex matrices g, with gt Ω g ¼ Ω and g† Γ g ¼ Γ, where t denotes
the transpose, and where Ω and Γ are the matrices
IN IN
Ω¼ ; Γ¼
IN IN
In addition, K is the group of block-diagonal matrices k ¼ diag(U, U*) where
U U(N), and * denotes the conjugate. The action of G on M is given by
M€obius transformations,
A B
g z ¼ ðAz + BÞðCz + DÞ1 g ¼ (11)
C D
This action preserves the Siegel metric, which is defined by
1 1
hv, viz ¼kðIN zz{ Þ vk2B k vk2B ¼ trðvv{ Þ (12)
2
where each tangent vector v is identified with a symmetric complex matrix.
Now (Terras, 1988),
a
a¼ ; a ¼ diagða11 , …, aNN Þ (13)
a
The positive roots are λ(a) ¼ aii ajj for i < j, and λ(a) ¼ aii + ajj for i j, all
with mλ ¼ 1. The order of the Weyl group is jWj ¼ 2NN!, and ω(S) ¼
vol(U(N))/2N. Replacing into (8), it follows that
366 SECTION III Advanced geometrical intuition
Z Y
volðUðNÞÞ N
a2 Y Y
ZðσÞ ¼ exp ii2 sinh jaii ajj j sinh jaii + ajj j da
22N N! a i¼1 2σ i<j ij
(14)
or, after introducing ui ¼ cosh ð2aii Þ,
Z Y
N
2N
ZðσÞ ¼ 2 volðUðNÞÞ VðuÞ wðui ,8σ 2 Þdui (15)
CN n¼1
This proposition is almost immediate. From (7), one has the log-likelihood
function
1 X N
‘ðx ,σÞ ¼ N log ZðσÞ d 2 ðxn , xÞ (16)
2σ 2 n¼1
Since the first term does not depend on x, one may maximize ‘ðx, σÞ, first over
x and then over σ. Clearly, maximizing over x is equivalent to minimizing the
sum of squared distances d 2 ðxn , xÞ. This is just the least-squares problem (1),
whose solution is the empirical barycenter x^N . Moreover, x^N is unique, since
M is a Hadamard manifold (Afsari, 2010; Sturm, 2003).
Consider now maximum-likelihood estimation of σ. This is better carried
out in terms of the natural parameter η ¼ (2σ 2)1, or in terms of the moment
parameter δ ¼ ψ 0 (η), where ψðηÞ ¼ log ZðσÞ and the prime denotes the
derivative.
Proposition 4. The function ψ(η), just defined, is a strictly convex function,
which maps the half-line (∞, 0) onto . The maximum-likelihood estimates
of the parameters η and δ are
Gaussian distributions on Riemannian symmetric spaces Chapter 10 367
XN
^η N ¼ ðψ 0 Þ1 ð^δ N Þ and ^δ N ¼ 1 d 2 ðxn , x^N Þ (17)
N n¼1
Remark. η^N in (17) is well-defined, since the range of ψ 0 is equal to (0, ∞).
Indeed, one has the following inequalities, analogous to (5),
ψ 00 ðηÞ ψ 0 ðηÞ ψ 0c ðηÞ (18)
where ψ 0 ðηÞ ¼ log Z 0 ðσÞ, and ψ c ðηÞ ¼ log Zc ðσÞ. Now, ψ 00 ðηÞ ¼ nσ 2 , which
increases to +∞ when σ increases to +∞. On the other hand, since
η ¼ (2σ 2)1,
d
ψ 0c ðηÞ ¼ σ 3 ð log Zc ðσÞÞ (19)
dσ
which, from (6), is ¼ 0 when σ ¼ 0. Thus, it follows from (18) that ψ 0 maps
the half-line (∞, 0) onto the half-line (0, +∞). □
Proposition 6. Let Pðx ,σÞ be given by (7), for x M and σ > 0. The Rieman-
nian barycenter of Pðx , σÞ is equal to x.
The proof of this proposition relies on the fact that the so-called variance
function
Z
1
EðxÞ ¼ d 2 ðx, yÞPðdyjx , σÞ (20)
2 M
is strongly convex (see Said, 2021, Paragraph 2.2.3). Thus, if gradEðxÞ ¼ 0,
then x is the global minimizer of E, and therefore the barycenter of Pðx, σÞ.
However, gradEðxÞ ¼ 0 follows a direct application of the following
“Fisher’s identity,”
Z
ðgradx log pðxjx ,σÞÞPðdxjx ,σÞ ¼ 0
M
where gradx denotes the gradient with respect to x, defined according to the
Riemannian metric of M, and pðxjx, σÞ is the probability density function,
appearing in (7).
The covariance form of Pðx, σÞ is the symmetric bilinear form Cx on T x M,
Z
Cx ðu, vÞ ¼ hu, Exp1 1
x ðxÞihExpx ðxÞ, vi pðxjx , σÞvolðdxÞ u ,v Tx M
M
(21)
1
where Exp denotes the Riemannian exponential map (Exp is well-defined,
since M is a Hadamard manifold).
With σ > 0 fixed, the map which assigns to x M the covariance form Cx
is a (0,2)-tensor field on M, here called the covariance tensor of Pðx, σÞ. In
order to compute this tensor field, consider the following situation.
Assume M ¼ G/K is a Riemannian symmetric space. Here, K ¼ Ko, the
stabilizer in G of o M. For k K and u ToM, it is clear k u ToM. This
defines a representation of K in the tangent space ToM, called the isotropy rep-
resentation. One says that M is an irreducible symmetric space, if this isotropy
representation is irreducible.
If M is not irreducible, then it is a product of irreducible Riemannian
symmetric spaces M ¼ M1 ⋯ Ms (Helgason, 1962) (proposition 5.5,
chapter VIII. This is the de Rham decomposition of M). Accordingly, for
x M and u TxM, one may write x ¼ ðx1 , …, xs Þ and u ¼ ðu1 , …, us Þ,
where xr Mr and ur T xr Mr . Now, looking back at (7), it may be seen
that
Ys 2
1 d ðxr , x r Þ
pðxjx ,σÞ ¼ pðxr jx r , σÞ pðxr jx r , σÞ ¼ ðZr ðσÞÞ exp (22)
r¼1
2σ 2
For the following proposition, let η ¼ (2σ 2)1 and ψ r ðηÞ ¼ log Zr ðσÞ.
Gaussian distributions on Riemannian symmetric spaces Chapter 10 369
where pnn is the leading coefficient in pn. The required orthonormal polyno-
1
mials pn are given by p n ¼ ð2πσ 2 Þ 4 s n , where sn are the Stieltjes-Wigert poly-
nomials (Szeg€ o, 1939) (p. 33). Looking up the expression of these
polynomials, it is easy to find
" #
2 Yn
1 ð2n + 1Þ
p2 1 emσ
2
nn ¼ ð2π σ Þ exp σ2
2 2
2 m¼1
Then, working out the product (27) and replacing into (10), one leads to (26).
Moving on, it is possible to derive an asymptotic expression of Z(σ), valid in
the limit where N goes to infinity while the product t ¼ Nσ 2 remains constant.
Proposition 9. Let Z(σ) be given by (26). If N ! ∞, while t ¼ Nσ 2 remains
constant, then the following equivalence holds,
Zeta function.
The main idea behind (28) is that, taking the logarithm in (26), the product
on the right-hand side turns into a Riemann sum for the improper integral
Z 1
ð1 xÞ log ð1 etx Þdx ¼ ðLi3 ðet Þ ζð3ÞÞ=t2
0
where the equality follows by integrating term-by-term the power series of the
logarithm.
Gaussian distributions on Riemannian symmetric spaces Chapter 10 371
ð1Þ
X
N 1
RN ðuÞ ¼ ρðu,2σ 2 Þ p2n ðuÞ (32)
n¼0
in the notation of 2.6 (pn are orthonormal polynomials, with respect to the
weight ρ(u, 2σ 2)). According to Deift (1998) (p. 133), νeN given by (31) con-
verges weakly to the so-called equilibrium distribution νet , which minimizes
the electrostatic energy functional
Z Z ∞Z ∞
1 ∞1
EðνÞ ¼ log 2 ðuÞνðduÞ log ju vjνðduÞνðdvÞ (33)
t 0 2 0 0
372 SECTION III Advanced geometrical intuition
a
To follow the original notation of Jacobi (Whittaker and Watson, 1950), this should be written
ϑ(eiϕjq) where q ¼ eσ . In other popular notations, this function is called ϑ00 or ϑ3.
2
Gaussian distributions on Riemannian symmetric spaces Chapter 10 373
h 1 i
f * ðxjx, σÞ ¼ det 2π σ 2 2 Θ xx{ jσ 2 (35)
since the matrices xx{ and x{ x are similar. Then, let ZM* ðσÞ denote the normal-
izing constant
Z
ZM* ðσÞ ¼ f * ðxjx, σÞ volðdxÞ (36)
M*
which does not depend on x, as can be seen, by introducing the new variable
of integration z ¼ xx{ , and using the invariance of vol(dx).
Now, define a Θ distribution Θðx, σÞ as the probability distribution on M*,
whose probability density function, with respect to vol(dx), is given by
1
p* ðxjx, σÞ ¼ Z M* ðσÞ f * ðxjx, σÞ (37)
Proposition 11. Let ZM(σ) ¼ Z(σ) be given by (26), and ZM ðσÞ be given by
(36). Then, the following equality holds
3
Z M ðσÞ N N 2
¼ exp σ (38)
ZM* ðσÞ 6
Remark. The Gaussian density (7) on M and the Θ distribution density (37) on
M* are apparently unrelated. Therefore, it is interesting to note their normal-
izing constants ZM(σ) and Z M* ðσÞ scale together according to the simple rela-
tion (38). The connection between the two distributions is due to the duality
between M and M*. □
b
Proofs of the results stated in this section can be found in chapter 4 of Said (2021).
374 SECTION III Advanced geometrical intuition
Here, π(x) will remain partially unknown, as the missing normalizing factor
cannot be determined. Then, two Bayesian estimators are studied. The maxi-
mum a posteriori x^MAP is the mode of π(x),
x^MAP ¼ argmaxx M πðxÞ
while the minimum mean square error estimator x^MMS, classically understood
as the mean (expectation) of the posterior density, is here the Riemannian
barycenter of π(x).
It is seen that x^MAP can be computed directly, being a geodesic convex
combination of the prior barycenter z and a new observation y, with respective
weights 1 ρ and ρ, where ρ ¼ τ2/(σ 2 + τ2). On the other hand, x^MMS seems
much harder to compute.
However, Proposition 12 states that x^MMS ¼ x^MAP if ρ ¼ 1/2, and
Proposition 13 states that, in the special case where M is a hyperbolic space,
x^MMS is a geodesic convex combination of z and y, just like x^MAP , but with
different weights, say (1 t*) and t*.
Section 3.2 reports on numerical experiments which show, again in the special
case where M is a hyperbolic space, that x^MMS and x^MAP lie very close to each
other, and that they even appear to be equal (this would mean t* ¼ ρ). At present,
the authors are unaware of any mathematical explanation of this phenomenon.
Section 3.3 describes the computational tools employed in calculating
x^MMS . First, Proposition 14 provides easy-to-verify sufficient conditions, for
the geometric ergodicity of an isotropic Metropolis-Hastings Markov chain,
in a Riemannian symmetric space M. These conditions are shown to apply
in the case of the posterior density π(x), making it possible to generate geo-
metrically ergodic samples (xn ;n
1) from this density.
Proposition 15 states that the empirical barycenter xN of the samples
ðx1 , …, xN Þ converges almost-surely to x^MMS , so xN may be used to approxi-
mate x^MMS to any required accuracy.
Concretely, computing the empirical barycenter xN requires solving a
strongly convex optimization problem on the Riemannian manifold M (here,
convexity is with respect to the Riemannian connection of M (Udriste, 1994).
This has come to be called “geodesic convexity”). Appendix B is devoted to
a brief but systematic study of convex optimization on Riemannian manifolds.
Specifically, it establishes the rate of convergence of Riemannian gradient
descent schemes, applied to strictly convex or strongly convex cost functions.
The gradient descent schemes under consideration are retraction schemes
(not limited to the Riemannian exponential) with a constant step-size. The
problem is then to find the largest possible step-size which guarantees a cer-
tain rate of convergence. Proposition B.8 addresses this problem for strictly
convex functions, and Proposition B.9 for strongly convex functions. For
any strictly convex cost function, and suitable retraction, Proposition B.8
gives the largest possible step-size which guarantees a rate of convergence
at least as fast as O(1/t) (t is the number of iterations). Proposition B.9 does
the same for strongly convex functions, but with an exponential rate of
Gaussian distributions on Riemannian symmetric spaces Chapter 10 375
c
If p, q M and c : [0, 1] ! M is a geodesic curve with c(0) ¼ p and c(1) ¼ q, then p #t q ¼ c(t),
for t [0, 1]. In other words, p #t q is a geodesic convex combination of p and q, with respective
weights (1 t) and t.
376 SECTION III Advanced geometrical intuition
Z
1
E π ðwÞ ¼ d 2 ðw, xÞπðxÞvolðdxÞ (43)
2 M
While it is easy to compute x^MAP from (42), it is much harder to find x^MMS , as
this requires minimizing the integral (43), where the density π(x) is known
only up to normalization.
Still, there is one special case where these two estimators are equal.
Proposition 12. In the above notation, if σ 2 ¼ τ2 (that is ρ ¼ 1/2), then
x^MMS ¼ x^MAP .
Of course, the upper bound in (44) is not tight, since it is strictly positive,
even when ρ ¼ 1/2, as one may see from (45).
It will be shown below that a Metropolis-Hastings algorithm, with
Gaussian proposals, can be used to generate geometrically ergodic samples
(xn ;n
1) from the posterior density π. It is therefore possible to approximate
(45) by an empirical average
Gaussian distributions on Riemannian symmetric spaces Chapter 10 377
1X N
m 1 ð^
x MAP Þ ¼ x MAP , xn Þ
dð^ (46)
N n¼1
Dimension d 2 3 4 5 6 7 8 9 10
m 1 ð^
x MAP Þ 0:28 0:35 0:41 0:47 0:50 0:57 0:60 0:66 0:70
dðx MMS , x^MAP Þ 0:00 0:00 0:00 0:01 0:01 0:02 0:02 0:02 0:03
and the following table for σ 2 ¼ 1 and τ2 ¼ 0.5, again using N ¼ 2 105.
Dimension d 2 3 4 5 6 7 8 9 10
m 1 ð^
x MAP Þ 0:75 1:00 1:12 1:44 1:73 1:97 2:15 2:54 2:91
dðx MMS , x^MAP Þ 0:00 0:00 0:03 0:02 0:02 0:03 0:04 0:03 0:12
The first table confirms Proposition 12. The second table, more surprisingly,
shows that x^MMS and x^MAP can be quite close to each other, even when ρ ¼ 6 1/2.
In both of these tables, dðxMMS , x^MAP Þ is an approximation of
dð^x MMS , x^MAP Þ, based on using the empirical barycenter xMMS instead of
x^MMS . The main source of error affecting this approximation is the fact that
the samples ðx1 , …,xN Þ follow from a Metropolis-Hastings algorithm, and
not directly from the posterior density π.
Other values of σ 2 and τ2 lead to similar orders of magnitude for m1 ð^xMAP Þ
and dðx MMS , x^MAP Þ. While m1 ð^ xMAP Þ increases with the dimension d,
dðxMMS , x^MAP Þ does not appear sensitive to increasing dimension.
Based on these experimental results, one is tempted to conjecture that
x^MMS ¼ x^MAP , even when ρ 6¼ 1/2. Of course, numerical experiments do not
equate to a mathematical proof.
Z
Pf ðxÞ ¼ αðx, yÞqðx, yÞf ðyÞvolðdyÞ + ρðxÞf ðxÞ (47)
M
(a3) there exist δq > 0 and εq > 0 such that d(x, y) < δq implies q(x, y) > εq
Remark. The posterior density π in (41) verifies Assumptions (a1) and (a2).
To see this, let x* ¼ z, and write
1 1
grad ‘ðxÞ ¼ rðxÞgrad rðxÞ 2 grad f y ðxÞ
τ2 σ
where fy(x) ¼ d2(y, x)/2. Then, taking the scalar product with grad r,
1 1
hgrad r, grad ‘ix ¼ rðxÞ 2 hgrad r, grad fy ix (52)
τ2 σ
since grad r(x) is a unit vector, for all x M. Now, grad f y ðxÞ ¼ Exp1
x ðyÞ
(see Chavel, 2006). But, since r(x) is a convex function of x, it follows, by
(B.2) in Appendix B.1, that
hgrad r, Exp1
x ðyÞi rðyÞ rðxÞ
for any y M. Thus, the right-hand side of (52) is strictly negative, as soon as
r(x) > r(y), and Assumption (a1) is indeed verified. That Assumption (a2) is
also verified can be proved by a similar reasoning. □
Remark. On the other hand, Assumption (a3) holds, if the proposed transition
density q(x, y) is a Gaussian density, q(x, y) ¼ p(yjx, τq). With this choice of
q(x, y), all the assumptions of Proposition 14 are verified, for the posterior
density π in (41). Therefore, Proposition 14 implies that the Metropolis-
Hastings algorithm generates geometrically ergodic samples (xn ; n
1), from
this posterior density. □
1 XN
E N ðwÞ ¼ d 2 ðw, xn Þ (53)
2N n¼1
Let x^ denote the Riemannian barycenter of the invariant density π. It turns out
that xN converges almost-surely to x^.
Proposition 15. Let (xn) be any Markov chain in a Hadamard manifold M,
with invariant distribution π. Denote x N the empirical barycenter of
ðx1 ,…, xN Þ, and x^ the Riemannian barycenter of π. If (xn) satisfies the strong
law of large numbers (51), then x N converges to x^, almost-surely.
xMMS is seen to lie on the geodesic connecting z and y, here the dashed
circle arc. Note that Fig. 1A corresponds to Proposition 12 and Fig. 1B to
Proposition 13.
Proof. Recall from Alekseevskij et al. (1993) (section 3) that there exists an
isometry σ : M ! M, such that σ ∘ σ is the identity (σ is an involution),
and the set of fixed points of σ is exactly the geodesic curve γ. The key point
in the following, which can be seen from (41), is that
πðσðxÞÞ ¼ πðxÞ for x M (54)
In other words, σ leaves invariant the posterior density π. Let E π be the func-
tion in (43). Then, note that
Z
1
ðE π ∘σÞðwÞ ¼ d 2 ðw, σðxÞÞπðxÞvolðdxÞ
2 ZM
1
¼ d 2 ðw, xÞπðσðxÞÞσ * ðvolÞðdxÞ
2 M
where the first equality follows from (41), because σ is an isometry and an
involution, and the second equality by a change of variables (σ*(vol) denotes
the pullback of the volume form vol by σ). Using (54) and the fact that σ pre-
serves the volume, it now follows that
ðE π ∘σÞðwÞ ¼ E π ðwÞ for w M (55)
Finally, taking w ¼ x^MMS and recalling that x^MMS is the unique global mini-
xMMS Þ ¼ x^MMS , so that x^MMS indeed lies on γ.□
mizer of E π , it follows that σð^
Proof. Let w ¼ γ(t) and take the gradient of (55). This yields
dσ grad E π ðγðtÞÞ ¼ grad E π ðγðtÞÞ (58)
However, the derivative dσ : Tγ(t)M ! Tγ(t)M is equal to 1 on vectors parallel
to γðtÞ
_ and to 1 on vectors orthogonal to γðtÞ _ . Thus, (58) implies that
grad E π ðγðtÞÞ should be parallel to γðtÞ.
_ This is equivalent to (56). □
The next step in the proof will be to compute c(0) and c(1). This will show
that c(0) is negative and c(1) is positive. Computing c(0) and c(1) requires
taking a closer look at the posterior density π.
Lemma 3. Let Z(z, y, τ, σ) denote the missing normalizing factor in (41).
Then, Z(z, y, τ, σ) ¼ Z(δ, τ, σ), where δ ¼ d2(z, y).
Z
d 2 ðgðyÞ, gðxÞÞ d 2 ðgðxÞ, gðzÞÞ
¼ exp volðdxÞ
M 2σ 2 2τ2
Thus, introducing the change of variables w ¼ g(x),
Z
d 2 ðy0 , wÞ d 2 ðw, z0 Þ
Zðz, y, τ, σÞ ¼ exp 2
2
volðdwÞ ¼ Zðz0 , y0 , τ, σÞ
M 2σ 2τ
In other words, Z(z, y, τ, σ) only depends on the distance between z and y.□
Y sinh λðaÞmλ
φ*ðvolÞ ¼
λðaÞ β ðdvÞ
λ Δ+
where mλ is the multiplicity of λ (the rank of Πλ). On the other hand, one may
show
Y
β* ðdvÞ ¼ jλðaÞjmλ da ωðdsÞ (A.9)
λ Δ+
where da is the volume form on a, and ω is the invariant volume induced onto
S from K.
Finally, the Riemannian volume, in terms of the parameterization φ, can
be expressed in the following way
Y
φ* ðvolÞ ¼ j sinh λðaÞjmλ da ωðdsÞ (A.10)
λ Δ+
Using (A.10), it will be possible to write down integral formulae for Rieman-
nian symmetric spaces, either noncompact or compact.
To obtain an integral formula from (A.10), one should first note that β :
S a ! p is not regular, nor one-to-one. Recall the following:
l the hyperplanes λ(a) ¼ 0, where λ Δ+, divide a into finitely many
connected components, which are open and convex sets, known as Weyl
chambers. From (A.9), β is regular on each Weyl chamber.
l let K 0a denote the normalizer of a in K. Then, W ¼ K 0a =K a is a finite group
of automorphisms of a, called the Weyl group, which acts freely transi-
tively on the set of Weyl chambers (Helgason, 1962) (theorem 2.12,
chapter VII).
386 SECTION III Advanced geometrical intuition
Then, for each Weyl chamber C, β is regular and one-to-one, from S C onto
its image in p. Moreover, if ar is the union of the Weyl chambers (a ar if
and only if λ(a) 6¼ 0 for any λ Δ+), then β is regular and jWj-to-one from
S ar onto its image in p. To obtain the desired integral formula, it only
remains to note that φ is a diffeomorphism from S C onto its image in
M. However, this image is the set Mr of regular values of φ. By Sard’s lemma,
its complement is negligible (Bogachev, 2007).
Proposition A.1. Let M ¼ G/K be a Riemannian symmetric space, which
belongs to the “noncompact case,” just described. Then, for any bounded
continuous function f : M ! ,
Z Z Z Y
f ðxÞ volðdxÞ ¼ f ðφðs, aÞÞ ð sinh λðaÞÞmλ da ωðdsÞ (A.12)
M C+ S λΔ +
Z Z Y
1
¼ f ðφðs, aÞÞ j sinh λðaÞjmλ da ωðdsÞ (A.13)
jWj a S λΔ +
(A.17)
where da ¼ da11 …daNN . Now, assume f is a class function: f(k x) ¼ f(x) for
k K and x H(N). That is, f(x) depends only on the eigenvalues xi ¼ eri of x.
By (A.17),
388 SECTION III Advanced geometrical intuition
Z Z Y
ωðSN Þ
f ðxÞ volðdxÞ ¼ f ð exp ðrÞÞ sinh 2 ððri r j Þ=2Þdr (A.18)
HðNÞ 2N N! N i<j
Q
where Vðeiθ Þ ¼ i<j ðeiθj eiθi Þ is the Vandermonde determinant.
Integrals such as (A.19) and (A.22) are familiar in random matrix theory
(Meckes, 2019; Mehta, 2004). The resemblance between these integrals (for
example, in the role played by the Vandermonde determinant) is at the origin
of the sort of “duality” described in Section 2.8.
d
Please do not confuse the imaginary number i with the subscript i.
Gaussian distributions on Riemannian symmetric spaces Chapter 10 389
Here, h, i and d(, ) denote the Riemannian scalar product and distance,
associated with the Riemannian metric tensor g of M. Moreover,
stands for
the Loewner order. A straightforward consequence of (ii) in Proposition B.2 is
the so-called PL inequality (PL stands for Polyak–Lojasiewicz (Karimi et al.,
2016)). This inequality will be used in Section B.4.2.
Proposition B.3. Let f : A ! be a twice differentiable, α-strongly convex
function. If f has its minimum at x* A, then
kgradf ðxÞk2x
2αðf ðxÞ f ðx* ÞÞ for all x A (B.4)
The retraction Ret is called geodesic when Φ00x ð0x Þðv, vÞ ¼ 0 for x M and
v TxM. In this case, Φx agrees with Idx up to third-order terms. For geodesic
regular retractions, Proposition B.6 provides a new, general version of
Proposition B.4.
Proposition B.6. Let f : M ! be a C2 function, with Bc ¼ {x : f(x) c}
compact (and not empty), and let Ret : TM ! M be a geodesic regular retrac-
tion. There exist constants βc, δc, Hc
0, which depend on f and Ret, such
that, for all x Bc,
f ðyÞ f ðxÞ μ 1 ðβc H c =2Þμ ðδc k gradf ðxÞk2x Þμ2 k gradf ðxÞk2x (B.12)
The widely used projection retractions for spheres, unitary groups and
Grassmann manifolds, are examples of contractive, uniformly geodesic regu-
lar retractions (Said, 2021) (see sections 1.5 and 1.6). Here is another exam-
ple, for positive-definite matrices.
Example. Let M ¼ P(N), the space of symmetric positive-definite N N
matrices, equipped with its usual affine-invariant metric (Pennec, 2006). For
x P(N), the tangent space TxP(N) is identified with the space S(N) of sym-
metric N N matrices. Then, recall the Riemannian exponential map
Expx ðvÞ ¼ x exp ðx1 vÞ (where exp denotes the matrix exponential), and con-
sider the retraction Retx(v) ¼ x + v + (1/2)vx1v. The point of using this
retraction is that the eigenvalues of Retx(v) will always be greater than 1/2.
In addition, it is a contractive, uniformly geodesic regular retraction. □
Then, since f is bounded below on the compact set Bc, the series
P∞
t¼0 kgradf ðx Þ kxt must converge. Finally, compactness of Bc ensures every
t 2
Let μ*d denote the infimum of μ such that 1 κðRc ÞLc μ ð2δRc
kgradf kBc ÞLc μ2 < 0.
Proposition B.8. Under the same assumptions of Lemma B.2, if μ
min fμ*c , μ*d g, then
Gaussian distributions on Riemannian symmetric spaces Chapter 10 395
2d 2 ðx0 , x* Þ
f ðxt +1 Þ f ðx* Þ (B.19)
μðt + 1Þ
for all t
0. In particular, the sequence (xt) converges to x*.
Remark. The quality of the convergence in (B.19) depends above all on the
step-size μ. The smaller this is, the slower the convergence. From the defini-
tions of μ*c and μ*d , it is clear that there are two reasons why μ would be smal-
ler: a larger constant δ, and a larger curvature (in absolute value) κ min . In
theory, one can alway make δ ¼ 0 by using the retraction Ret ¼ Exp, but this
requires the ability to compute the Riemannian exponential Exp with suffi-
cient accuracy. □
Note that Vπ is strictly convex (but not strongly convex), and ð1 + qκ min Þ
-smooth on M, because the same is true of each function Vy. In fact, Vπ has
compact sublevel sets whenever the distribution π has finite first-order
moments (Said, 2021). In this case, Vπ(x) is guaranteed to achieve its mini-
mum at some x* M. This x* is called the robust Riemannian barycenter of
π (the adjective “robust” comes from the field of robust statistics (Huber
and Ronchetti, 2009)).
When applying Lemma B.2 and Proposition B.8 to the present example
(with f ¼ Vπ ), note that inj(x*) ¼ ∞, since M is a Hadamard manifold, and
Lc ¼ ð1 + qκ min Þ does not depend on c.
396 SECTION III Advanced geometrical intuition
1/2.
Proof of Proposition B.5: This is given in Said (2021), section 1.5.
Proof of Proposition B.6: Assume Ret is a regular geodesic retraction,
and let Φ be the corresponding map in (B.10). Since Bc is compact, there exist
βc, δc
0 such that
n o
sup k Φ0x ðuÞkop ; x Bc and u T x M, k ukx k gradf ðxÞkx β1=2
c (C.1)
n o
sup k Φ000
x ðuÞkop ; x Bc and u T x M, k ukx k gradf ðxÞkx δc (C.2)
where kkop denotes the operator norm of the linear map Φ0x ðuÞ : T x M ! T x M,
or of the trilinear map Φ000
x ðuÞ : T x M T x M T x M ! . In terms of these
constants βc and δc,
k Φx ðμgradf ðxÞÞk2x ðβc μ2 Þ k gradf ðxÞk2x for x Bc (C.3)
However, applying (B.8) to the third term on the right-hand side, this implies
Lðxt +1 Þ Lðxt Þ + hgradLðxt Þ, Φxt ðμgradf ðxt ÞÞixt + κðRc ÞLc μ2 ðf ðxt Þ f ðx* ÞÞ
(C.5)
Now, consider the second term on the right-hand side, since gradLðxt Þ ¼
Exp1 *
xt ðx Þ, this second term is equal to
μhExp1 * 1 *
xt ðx Þ, gradf ðx Þixt hExpxt ðx Þ, Φxt ðμgradf ðx ÞÞ + μgradf ðx Þixt
t t t
(C.6)
Applying (B.2) and (C.4) (with δc ¼ δ, since Ret is uniformly geodesic),
ðC:6Þ μðf ðxt Þ f ðx* ÞÞ + ðδμ3 Þ kExp1 * t 3
xt ðx Þkxt kgradf ðx Þkxt
Since f(xt)
f(x*), whenever the expression in square brackets is positive, one
has L(xt+1) L(xt). However, this directly yields (B.18).
Proof of Proposition B.8: Note from Lemma B.1 that μ μ*c implies
dðx , x Þ
t dðx0 , x* Þ
where the first inequality follows by applying Cauchy–Schwarz to (B.2), and
the second one from Lemma B.2, since μ μ*d . Letting ε(t) ¼ f(xt) f(x*),
it is now clear that
2
εðt + 1Þ εðtÞ ðμ=2Þ εðtÞ=dðx0 , x* Þ
References
Absil, P.A., Mahony, R., Sepulchre, R., 2008. Optimization Algorithms on Matrix Manifolds.
Princeton University Press.
Afsari, B., 2010. Riemannian Lp center of mass: existence, uniqueness and convexity. Proc. Am.
Math. Soc. 139 (2), 655–673.
Alekseevskij, D.V., Vinberg, E.B., Solodovnikov, A.S., 1993. Geometry of Spaces of Constant
Curvature (EMS vol. 29). Springer-Verlag.
Bhattacharya, R., Patrangenaru, V., 2003. Large sample theory of instrinsic and extrinsic sample
means on manifolds I. Ann. Stat. 31 (1), 1–29.
Bogachev, V.I., 2007. Measure Theory. vol. I. Springer-Verlag.
Cabanes, Y., 2021. Multidimensional Complex Stationary Centered Gaussian regressive Time
Series Classification: Application for Audio and dar Clutter Machine Learning in Hyperbolic
and Siegel Spaces (Ph.D. thesis), University of Bordeaux.
Chavel, I., 2006. Riemannian Geometry, A Modern Introduction. Cambridge University Press.
Cheng, G., Vemuri, B.C., 2013. A novel dynamic system in the space of SPD matrices with appli-
cations to appearance tracking. SIAM J. Imaging Sci. 6 (1), 592–615.
Congedo, M., Barachant, A., Bhatia, R., 2017. Riemannian geometry for EEG-based brain-
computer interfaces; a primer and a review. Brain-Comput. Interfaces 4 (3), 155–174.
Deift, P., 1998. Orthogonal Polynomials and Random Matrices: A Riemann-Hilber Approach.
American Mathematical Society.
Frechet, M., 1948. Les elements aleatoires de nature quelconque dans un espace distancie. Ann.
l’I.H.P. 10 (4). 215–210.
Helgason, S., 1962. Differential Geometry and Symmetric Spaces. Academic Press, New York
and London.
Heuveline, S., Said, S., Mostajeran, C., 2021. Gaussian distributions on Riemannian symmetric
spaces, random matrices, and planar Feynman diagrams. arXiv:2106.08953.
400 SECTION III Advanced geometrical intuition
Huber, P.J., Ronchetti, E.M., 2009. Robust Statistics, second ed. Wiley-Blackwell.
Jarner, S.F., Hansen, E., 1998. Geometric ergodicity of Metropolis algorithms. Stoch. Process.
Appl. 58, 341–361.
Karimi, H., Nutini, J., Schmidt, M., 2016. Linear convergence of gradient and proximal-gradient
methods under the Polyak-Lojasiewicz condition. In: Machine Learning and Knowledge Dis-
covery in Databases.
Kendall, W.S., 1990. Probability, convexity, and harmonic maps with small image I: uniqueness
and fine existence. Proc. Lond. Math. Soc. 61 (2), 371–406.
Knapp, A.W., 2002. Lie Groups, Beyond an Introduction, second. Birkhauser.
Kuijlaars, A.B.J., Van Assche, W., 1999. The asymptotic zero distribution of orthogonal polyno-
mials with varying recurrence coefficients. J. Approx. Theory 99, 167–197.
Lee, J.M., 2012. Introduction to Smooth Manifolds, second. Springer Science.
Mariño, M., 2005. Chern-Simons Theory, Matrix Models, and Topological Strings. Oxford
University Press.
Meckes, E.S., 2019. The Random Matrix Theory of the Classical Compact Groups. Cambridge
University Press.
Mehta, M.L., 2004. Random Matrices, third ed. Elsevier Ltd.
Meyn, S., Tweedie, R.L., 2008. Markov Chains and Stochastic Stability. Cambridge University Press.
Nesterov, Y., 2018. Lectures on Convex Optimization. Springer Switzerland.
Pennec, X., 2006. Intrinsic statistics on Riemannian manifolds: basic tools for geometric measure-
ments. J. Math. Imaging Vis. 25 (1), 127–154.
Petersen, P., 2006. Riemannian Geometry, second ed. Springer Science.
Roberts, R.O., Rosenthal, J.S., 2004. General state-space Markov chains and MCMC algorithms.
Probab. Surv. 1, 20–71.
Said, S., 2021. Statistical models and probabilistic methods on Riemannian manifolds.
arXiv:2101.10855.
Said, S., Manton, J.H., 2021. Riemannian barycentres of Gibbs distributions: new results on con-
centration and convexity. Inf. Geom. 4 (2).
Said, S., Bombrun, L., Berthoumieu, Y., Manton, J.H., 2017. Riemannian Gaussian distributions on
the space of symmetric positive definite matrices. IEEE Trans. Inf. Theory 63 (4), 2153–2170.
Said, S., Hajri, H., Bombrun, L., Vemuri, B.C., 2018. Gaussian distributions on Riemannian sym-
metric spaces: statistical learning with structured covariance matrices. IEEE Trans. Inf. The-
ory 64 (2), 752–772.
Santilli, L., Tierz, M., 2021. Riemannian Gaussian distributions, random matrix ensembles and
diffusion kernels. Nucl. Phys. B 973, 115582.
Siegel, C.L., 1943. Symplectic geometry. Am. J. Math. 65 (1), 1–86.
Sturm, K.T., 2003. Probability measures on metric spaces of nonpositive curvature. Contemp.
Math. 338, 1–34.
Szeg€o, G., 1939. Orthogonal Polynomials, first ed. American Mathematical Society.
Terras, A., 1988. Harmonic Analysis on Symmetric Spaces and Applications. vol. II. Springer-Verlag.
Udriste, C., 1994. Convex Functions and Optimization Methods on Riemannian Manifolds.
Springer Science.
Whittaker, E.T., Watson, G.N., 1950. A Course of Modern Analysis, fourth ed. Cambridge Uni-
versity Press.
Zanini, P., Congedo, M., Jutten, C., Said, S., Berthoumieu, Y., 2016. Parameters estimate of Rie-
mannian Gaussian distribution in the manifold of covariance matrices. In: Sensor Array and
Multichannel Signal Processing.
Chapter 11
Abstract
A new concept called multilevel contours is introduced through this article by the
author. Theorems on contours constructed on a bundle of complex planes are stated
and proved. Multilevel contours can transport information from one complex plane to
another. Within a random environment, the behavior of contours and multilevel con-
tours passing through the bundles of complex planes are studied. Further properties of
contours by a removal process of the data are studied. The concept of “islands” and
“holes” within a bundle is introduced through this article. These all constructions help
to understand the dynamics of the set of points of the bundle. Further research on the
topics introduced here will be followed up by the author. These include closed approx-
imations of the multilevel contour formations and their removal processes. The ideas
and results presented in this article are novel.
Keywords: Multilevel complex planes, Spinning, Randomness, Holomorphism, PDEs
MSC: 32L05, 60K3, 32H02
1 Introduction
Let us consider a bundle of eight complex planes 1 , 2, 3, 4, 5 , 6 , 7,
and 8 as shown in Fig. 1. These planes are considered such that one plane is
parallel to any other plane in the bundle or they could intersect with each
other at some angle. Let γ 1 be an arc constructed from the points generated
by zðt1 Þ 1 for a11 t1 b11 such that z(a11) ¼ z1 and z(b11) ¼ z2.
☆
Dedication: This article is dedicated to my friend and collaborator Professor Steven G. Krantz,
Washington University, St. Louis, United States on completion of his 70th Birthday.
FIG. 1 A bundle of complex planes and multilevel contours M1, M2, M3.
Here a11, b11 : Point z2 is located at the intersection of the planes 1 and
2. We allow constructing an arc in the plane 2 from z2 to z3 for z3 2, 3,
and 6 . Let γ 2 be an arc constructed from zðt2 Þ 2 for a12 t2 b12 (a12,
b12 ) such that z(a12) ¼ z2 and z(b12) ¼ z3. The arc γ i is constructed by join-
ing z(a1i) ¼ zi and z(b1i) ¼ zi+1 generated by the set of points z(ti) for a1i ti
b1i for i ¼ 3, 4, …, 7 and a1i, b1i . We allow the possibility to construct an
arc from an ending point of an arc in a plane to a point located in a different
plane if that ending point of an arc is located at the intersection of two or more
complex planes. We saw above a few points lying at the intersection of two or
more planes. The other points, for example, z4 3 , 4 , and 8, z5 4 and
5 , z6 4 , 6 , and 7 , z7 4 and 7 , and z8 7 and 8 .
Let us form a contour by piecewise joining of arcs γ i for i ¼ 1, 2, …, 7 and
call this M1. Let us rename the arcs corresponding to the contour M1 as γ M i
1
4. For the sake of visualization, we have separated a single point at the inter-
secting planes as two or more points in different colors, a smaller oval-shaped
object in Fig. 1. Suppose a contour M2 is constructed using a set of values z(si)
for a2i si a2i with corresponding arcs γ M i
2
for i ¼ 1, 2, 3, 4, and another
contour M3 is constructed z(ui) for a3i ui a3i with corresponding arcs γ M i
3
7 Z
X β1i
LðM1 Þ ¼ jz0 ½ξi ðτÞjξ0i ðτÞdτ: (1)
i¼1 α1i
4 Z
X β3i
LðM3 Þ ¼ jz0 ½ψ i ðτÞjψ 0i ðτÞdτ: (3)
i¼1 α3i
The contours M1, M2, M3 are located on bundle of complex planes, which we
term here as multilevel contours. One can draw several such multilevel con-
tours on a bundle of complex planes as shown in Fig. 1. We have considered
eight complex planes and three multilevel contours as an example, but one
can extend these examples to demonstrate the intersection of many more com-
plex planes and contours passing through them. Although multilevel contours
are newly introduced here in this article, the principles associated with contours
on a single complex plane can be found in any standard textbooks, see, for
example, Ahlfors (1978), Churchill and Brown (1984), Krantz (2004), and
Rudin (1987).
FIG. 2 Bundle of infinitely many (uncountable) complex planes and shortest contour passing
through them.
Proof. Let l and q be two arbitrary lines among these uncountable lines
formed out of the above slicing method. Suppose we chose a point z1 on l.
There exists a point z2 on q such that the x-coordinate and y-coordinate of
both z1 and z2 are the same. Note that z1 , z2 0 . Depending upon how
we visualize the xy-axes of 0 , the following possibilities for the values of
z1 and z2 will arise:
(i) z1 ¼ (0, 0) and z2 ¼ (0, 0) or z1 6¼ (0, 0) and z2 6¼ (0, 0).
(ii) If z1, z2 6¼ (0, 0), then both z1 and z2 will have either a nonzero
x-coordinate (and y-coordinate as zero) or a nonzero y-coordinate (and
x-coordinate as zero). We first assume that both z1 and z2 are on the
y-coordinate. Let C(z1, z2) be a contour described by the equation z(t1)
for a1 t1 a01 , where z(a1) ¼ z1 for z1 0 , l and zða01 Þ ¼ z2 for
z2 0 , q. Here C(z1, z2) is a multilevel contour because z1 and z2 are
points on parallel lines l and q on different planes, but both these points
also belong to 0 . Suppose
t1 ¼ ε1 ðτÞ δ1 τ δ01
be the parametric representation for C(z1, z2), where ε1 is a real-valued
function mapping ½δ1 , δ01 onto the interval ½a0 , a00 : The length of the con-
tour C(z1, z2) is obtained by
Z 0
δ1
L½Cðz1 , z2 Þ ¼ jz0 ½ε1 ðτÞjε01 ðτÞdτ (4)
δ1
Multilevel contours on bundles of complex planes Chapter 11 405
Suppose we consider a point z3 on the same plane in which the point z2 lies
but not on the line q such that z3 62 0 : That means, z3 6¼ z2. Let C(z2, z3)
be a contour described by the equation z(t2) for a2 t2 a02 , where z(a2) ¼ z2
for z2 0 , q and zða02 Þ ¼ z3 for z3 62 0 and z3 62 q: Here C(z2, z3) is not a
multilevel contour. Suppose
t2 ¼ ε2 ðτÞ δ2 τ δ02
be the parametric representation for C(z2, z3), where ε2 is a real-valued
function mapping ½δ2 , δ02 onto the interval ½a2 , a02 : The length of the contour
C(z2, z3) can be obtained by
Z δ2
0
Since z3 is not in 0, we cannot draw a contour directly from z1 to z3. To draw
a contour to z3 from z1, we can have piecewise arcs passing through z2 or
through any other points of the line q to z3. If the contour from z1 to z3 passes
through z2, then
Z δ1
0
0
jz ½ε1 ðτÞjε01 ðτÞdτ < jz0 ½ε3 ðτÞjε03 ðτÞdτ, (7)
δ1 δ3
where the R.H.S. of the inequality (7) is the length of the contour C(z1, z4)
described by the equation z(t3) for a3 t3 a03 , where z(a3) ¼ z1 for z1
0 , l and zða03 Þ ¼ z4 for z4 0 : Here t3 ¼ ε3 ðτÞ δ3 τ δ03 is the para-
metric representation for C(z1, z4) and ε3 is a real-valued function mapping
½δ3 , δ03 onto the interval ½a3 , a03 . From (6) and (7), we have
Z δ1
0 Z δ2
0
0
jz ½ε1 ðτÞjε01 ðτÞdτ + jz0 ½ε2 ðτÞjε02 ðτÞdτ
δ1 δ2
Z δ3
0 Z δ4
0
0
< jz ½ε3 ðτÞjε03 ðτÞdτ + jz0 ½ε4 ðτÞjε04 ðτÞdτ, (8)
δ3 δ4
where the second term of R.H.S. of the inequality (8) is the length of the con-
tour C(z4, z3) described by the equation z(t4) for a4 t4 a04 , where z(a4) ¼ z4
6 0 : Here t4 ¼ ε4 ðτÞ δ4 τ δ04 is the
for z4 0 , q and zða04 Þ ¼ z3 for z3 ¼
406 SECTION III Advanced geometrical intuition
and
[
Cðz, z0 Þ :
z, z0
In (9), the integral on the L.H.S. is the length of the contour C(z, z0 ) described
by the equation z(tp) for ap tp a0p , where z(ap) ¼ z for z 0 , p and
zða0p Þ ¼ z0 for z0 p0 , 0 : Here tp ¼ εp ðτÞ δp τ δ0p is the parametric rep-
resentation for contour C(z, z0 ) and εp is a real-valued function mapping
½δp , δ0p onto the interval ½ap , a0p . □
lim zn ¼ z
n!∞
Remark 3. The distances between each pair of adjacent points on every con-
tour created by slicing parallel to 0 are equal.
Suppose we remove the space created by 0 from the bundle B ðÞ, then
the bundle formed on the left of 0 be denoted by BL ð0 Þ and the bundle
formed on the right of 0 be denoted by BR ð0 Þ: Then
BL ð0 Þ [ 0 [ BR ð0 Þ ¼ B ðÞ (10)
The set B ðÞ 0 defined as
B ðÞ 0 ¼ fz : z B ðÞ and z 62 0 g
Multilevel contours on bundles of complex planes Chapter 11 407
the similar argument of (11), we write below BL ð0 Þ r as a union of two
disconnected sets
BL ð0 Þ r ¼ BL ðr Þ [ BR ðr Þ ðr > 0Þ (13)
Although we could not draw a multilevel contour passing through all the
planes of the bundle BL ðr Þ, one can draw such a contour through the plane
r while it intersects the bundle BL ðr Þ: One can write another disjoint set
BL ðr Þ r0 ¼ BL ðr0 Þ [ BR ðr0 Þ ðr, r 0 > 0Þ, (14)
where r0 is complex plane used to slice BL ðr0 Þ and r is complex plane used
to slice BL ð0 Þ:
Let us now consider right side of the plane 0 within the bundle B ðÞ,
i.e., BR ð0 Þ: Suppose we slice the bundle BR ð0 Þ using a plane parallel to
0 and call this s (s > 0). Let us remove the space created by s from
BR ð0 Þ such that
BR ð0 Þ s ¼ BL ðs Þ [ BR ðr Þ ðs > 0Þ: (15)
From the above constructions (10) through (15), the set of points of the bundle
B ðÞ with intersecting planes r , 0 , s are written as
B ðÞ ¼ BL ð0 Þ [ 0 [ BR ð0 Þ
¼ BL ðr Þ [ r [ BR ðr Þ [ 0 [ BL ðs Þ [ s [ BR ðs Þ
(16)
¼)
(17)
B ðÞ ð0 [ r [ r Þ ¼ BL ðr Þ [ BR ðr Þ [ BL ðs Þ [ BR ðs Þ
408 SECTION III Advanced geometrical intuition
The four disconnected sets in (17) can be used for forming infinitely many
disconnected sets. Due to the removal of the spaces as shown in Fig. 3, it is
impossible to draw multilevel contours within and between these four discon-
nected sets in (17). One of the advantages of multilevel contours is to develop
trees of arcs, paths that can be used for transportation of information between
two or more interacting (intersecting) complex planes. The bundle B(C) has
parallel planes; however, the contours drawn within each of these planes need
not be similar. One can construct a functional mapping such that a contour or
an arc drawn in one plane could be mapped to a contour or an arc in another
plane. But these two sets of contours could be used for transporting informa-
tion continuously only if these contours in two or more planes are path-
connected (Fig. 4).
FIG. 3 Creation of disconnected sets out of the bundle B ðÞ due to the slicing and removal of
the spaces created by the complex planes 0 , r , s .
Suppose S1, S2, S3, S4 are four contours drawn for specific purposes in four
different complex planes 1 , 2 , 3 , 4 , respectively. Let Si be described by
z(t) for asi ts1 a0si where asi ,a0si for i ¼ 1, 2, 3, 4. See Fig. 4. Had these
four planes have no intersecting set of points, then one could not construct
multilevel contours passing through these planes. Let zðasi Þ ¼ zS1i be the starting
point and zða0si Þ ¼ zS2i be the ending point of the contour Si for i ¼ 1, 2, 3, 4.
Suppose each independent contour Si has a specific information stored in it.
Information stored in S1 is transferred to the contour S2 using a contour T1. Sup-
pose 1 \ 2 ¼ fz : z 1 and z 2 g. Then for the structure of the planes
and contours S1 through S4 in Fig. 4, we have 1 \ 2 ¼ 6 ϕ (empty set) and
S1 \ ð1 \ 2 Þ ¼ ϕ. Let T1 be a contour drawn from a point in S1 to a point
in S2 through a set of points for which 1 \ 2 6¼ ϕ: T1 is described by
zðut1 Þ for at1 ut1 a0t1 where at1 , a0t1 . The starting point of T1 lies on
S1 and the ending point of T1 lies on S2. We call T1 a transporting contour.
Using T1 the information stored in S1 and S2 can be communicated. We will dis-
cuss later more features on information transfer. The length of the multilevel
contour due to S1, T1, S2, say L[S1, S2] is computed as
Z 0 Z 0
δs ωt
L½S1 , S2 ¼
1
jz½εs1 ðτÞjε0s1 ðτÞdτ +
1
z½ηt ðτÞη0 ðτÞdτ
1 t1
δ s1 ω t1
Z 0 (18)
δs
jz½εs2 ðτÞjε0s2 ðτÞdτ
2
+
δ s2
Here tsi ¼ εsi ðτÞ δsi τ δ0si for i ¼ 1, 2, 3, 4 is the parametric represen-
tation for contour Si with a real-valued function εsi mapping ½δsi , δ0si onto the
interval ½asi , a0si , and ut1 ¼ ηt1 ðτÞ ωt1 τ ω0t1 is the parametric represen-
tation for contour T1 with a real-valued function ηt1 mapping ½ωt1 , ω0t1 onto the
interval ½at1 , a0t1 . Since the total information stored in S1 and S2 are exchanged,
we have considered total lengths of S1 and S2 even though T1 could be
connected with any point of S1 and S2. Since zðat1 Þ S1 and zðat2 Þ S2
and the T1 describes a contour from zðat1 Þ to zðat2 Þ, the length of the middle
integral in R.H.S. of (18) is not constant. The transportation contour can be
used to transport information from S2 to S1. Let us denote this by the contour
T 01. In that case, the starting point of T 01 lies on S2, and the ending point of T 01
lies on S1. T 01 is described by z0 ðu0t1 Þ for at1 ut1 a0t1 where at1 , a0t1 , and
u0t1 ¼ η*t1 ðτÞ ωt1 τ ω0t1 is the parametric representation for contour T 01
with a real-valued function η*t1 mapping ½ωt1 , ω0t1 onto the interval ½at1 , a0t1 .
When we measure the length S2 to S1, say L(S2, S1], the orientation of the
transportation contour changes, and it is computed as
410 SECTION III Advanced geometrical intuition
Z δs
0 Z ωt1
jz½εs1 ðτÞjε0s1 ðτÞdτ jz½ηt1 ðτÞjη0t1 ðτÞdτ
1
L½S2 , S1 ¼ +
0
δs1 ωt
Z 0
1
(19)
δs
jz½εs2 ðτÞjε0s2 ðτÞdτ
2
+
δs2
The values of the middle integrals in the R.H.S. of (18) and (19) need not
be the same unless the below condition is satisfied:
zðat1Þ, z0 ða0t1Þ S1 and z0 ðat1Þ zða0t1Þ S2 : (20)
The transportation contour T2 joins S2 to S3 and the transportation contour
T 02 joins S3 to S2. The contour T2 is described by zðut2 Þ for at2 ut2 a0t2
where at2 , a0t2 . The starting point of T2 is in the set S2 and the ending point
of T2 is in the set S3. The function ut2 ¼ ηt2 ðτÞ ωt2 τ ω0t2 is the para-
metric representation for contour T2 with a real-valued function ηt2 mapping
½ωt2 , ω0t2 onto the interval ½at2 , a0t2 . The contour T 02 is described by zðu0t2Þ for
at2 u0t2 a0t2 where at2 , a0t2 , and the function u0t2 ¼ ηt2 ðτÞ ðωt2 τ ω0t2Þ
is the parametric representation for contour T 02 with a real-valued function
η*t2 mapping ½ωt2 ,ω0t2 onto the interval ½at2 ,a0t2 . The lengths of these transpor-
tation contours can be computed and
Z ω0t Z ωt
2
z½ηt ðτÞη0 ðτÞdτ ¼
2
if and only if
zðat2 Þ, z0 ða0t2 Þ S2 and z0 ðat2 Þ zða0t2 Þ S3 :
The multilevel contour lengths of S2 to S3 and S3 to S2 are computed as
Z 0 Z 0
δs ωt
L½S2 , S3 ¼
2
jz½εs2 ðτÞjε0s2 ðτÞdτ +
2
z½ηt ðτÞη0 ðτÞdτ
2 t2
δs2 ωt2
Z 0 (21)
δs
jz½εs3 ðτÞjε0s3 ðτÞdτ
3
+
δs3
Z δs
0
if and only if
zðat3 Þ, z0 ða0t3 Þ S3 and z0 ðat3 Þ zða0t3 Þ S4 :
The multilevel contour lengths of S2 to S3 and S3 to S2 are computed as
Z 0 Z 0
δs ωt
L½S3 , S4 ¼
3
jz½εs3 ðτÞjε0s3 ðτÞdτ +
3
z½ηt ðτÞη0 ðτÞdτ
3 t3
δ s3 ω t3
Z 0 (23)
δs
jz½εs4 ðτÞjε0s4 ðτÞdτ
4
+
δ s4
Z δs
0 Z ωt3
3
L½S4 , S3 ¼ jz½εs3 ðτÞjε0s3 ðτÞdτ + z½η*t ðτÞη0t3 ðτÞdτ
0
δ s3 ωt 3
3
Z 0
(24)
δs
jz½εs4 ðτÞjε0s4 ðτÞdτ:
4
+
δs4
4 Z 3 Z
0
X δs
i X ωti
L½S4 ,S1 ¼ jz½εsi ðτÞjε0si ðτÞdτ + z½η*t ðτÞη0ti ðτÞdτ: (26)
0
i¼1 δ si i¼1 ωti i
Such constructions can be extended for several other practical situations aris-
ing from the data. Note that the contours Ti and T 0i pass through the line cre-
ated by i \ i +1 for i ¼ 1, 2, 3. So the corresponding integrals of the second
term of the R.H.S. of (25) represent combined lengths created due to traveling
of the contour Ti from a point in Si to a point in i \ i +1 and then traveling
from a point in i \ i +1 to Si+1. Similarly, the integrals of the second term of
the R.H.S. of (26) represent combined lengths created due to traveling of the
contour T 0i from a point in Si+1 to a point in i \ i +1 and then traveling from
a point in i \ i +1 to Si. Next, we will see how this combined integral can be
subdivided into smaller integrals while computing the shortest distance.
Since Ti was described by zðuti Þ for ati uti a0ti , we further partition the
Ti into three contours mentioned in the previous paragraph. Suppose zðam ti Þ be
1
0
the point on Si for ati ½ati , ati that is the closest to the point on the line say,
m1
due to i \ i +1 that is closest to Si+1. The point at which the contour from
0
zðamti Þ joins Si+1 say, zðati Þ for ati ½ati , ati that is the closest from a point
3 m4 m4
on the line created due to i \ i +1. Here zðam ti Þ and zðati Þ are complex num-
1 m4
Z ωIIt II 0
LðII : Ti Þ ¼
i
z½η ðτÞ ηII ðτÞdτ (28)
ti ti
ωIt
i
Z
0
0
ωt III
LðI : Ti Þ ¼
i
z½η ðτÞ ηIII ðτÞdτ (29)
ti ti
ωIIti
0
point on Si for ati ½ati , ati that is the farthest to the point on the line say,
M1
due to i \ i +1 that is farthest to Si+1. The point at which the contour from
0
ti Þ joins Si+1 say, zðati Þ for ati ½ati , ati that is the farthest from a point
zðaM 3 M4 M4
on the line created due to i \ i +1. The complex numbers zðuti Þ partitioned as
8 IV 1
> zðuti Þ ðati uti < aIV ti Þ
>
>
< C
V C
zðuti Þ ¼ zðuVti Þ ðaIV u t < a Þ
ti C,
>
>
ti i
A
>
: VI 0
zðuti Þ ðati uti ati Þ
V
such that
0 0
ðati uti < aIV
ti Þ [ ðati uti < ati Þ [ ðati uti ati Þ ¼ ½ati , ati :
IV V VI
Let us redefine the function uti below to represent the three partitions men-
tioned above
8 IV 1
> ηti ðωti τ < ωIV
ti Þ
>
>
< C
V C
uti ¼ ηVti ðωIV τ < ω Þ
ti C,
>
>
ti
A
>
: VI 0
ηti ðωti τ ωti Þ
V
such that
0 0
ðωti τ < ωIV
ti Þ [ ðωti τ < ωti Þ [ ðωti τ ωti Þ ¼ ½ωti , ωti :
IV V VI
The three longest distances arise out of above partitions are, say, L(IV : Ti),
L(V : Ti), and L(V I : Ti). These longest distances are given by
414 SECTION III Advanced geometrical intuition
Z ωIV
t IV 0
LðIV : Ti Þ ¼
i
z½η ðτÞ ηIV ðτÞdτ (31)
ti ti
ω ti
Z ωVt V 0
LðV : Ti Þ ¼
i
z½η ðτÞ ηV ðτÞdτ (32)
ti ti
ωIti
Z
0
0
ωti VI
LðVI : Ti Þ ¼ z½η ðτÞ ηVI ðτÞdτ (33)
ti ti
ωIIt
i
Proof. Suppose we make a copy of the bundle B ðÞ combined with the posi-
tioning of 0 and place it on the bundle such that these two bundles occupy
exactly the same space. Let us call the original bundle with the positioning
of 0 as Bo ðÞ and its copy as Bc ðÞ. Suppose we tilt Bc ðÞ to the left such
that Bc ðÞ inclined at an angle θ for θ > 0 with y-axis. See Fig. 5. The points
(complex numbers) on the plane 0 do not change with this tilting so as the
points in the space created by Bc ðÞ: Let us consider a plane p before tilting
for p Bo ðÞ: The same p in Bc ðÞ is now inclined away at an angle θ.
Let us call the copied bundle Bc ðÞ that is inclined at an angle θ be Bc ð, θÞ.
Each value of p that was there when p Bo ð, θÞ is still there after
Multilevel contours on bundles of complex planes Chapter 11 415
FIG. 5 Angle between two bundles Bo ðÞ and Bc ðÞ and rotation of the bundle Bc ðÞ over
y-axis.
that is associated with Xðzl ðt1 Þ, l Þ such that pðzl ðt1 Þ, l Þ describes the proba-
bility that Xðzl ðt1 Þ, l Þ picks zl1 for zl1 ¼ zl ðt1 Þðt1 > t0 Þ . The probability
function is defined as
pðzl ðt1 Þ, t1 Þ ¼ Prob½Xðzl ðt1 Þ, l Þ ¼ zl1 for t1 > t0 : (37)
Once Xðzl ðt1 Þ, l Þ is generated, then one can draw contour from zl0 to zl1 and
compute Lðzl0 , zl1 Þ the length from zl0 to zl1 . The contour γ l is described by
γ l ¼ zl ðtÞ for t ½t0 , ∞Þ
and
t ¼ vl ðτÞ al00 t < al01
is the parametric representation for γ l with a real-valued function vl mapping
½al00 , al01 Þ onto the interval [t0, t1). This gives us,
Z a01
l
Lðzl0 , zl1 Þ ¼ l
jz½vl ðτÞjv0l ðτÞdτ: (38)
a00
When a random variable Xðzl ðtÞ, l Þ chooses a number within the disc cre-
ated and that value (number) has been chosen already and was part of the con-
tour, then Xðzl ðtÞ, l Þ will choose another number in the disc. This procedure
continues until a distinct number is chosen by Xðzl ðtÞ, l Þ. Such an assumption
in Definition 1 or in (39) will allow quicker formation of multilevel contour.
We will later see the consequences of relaxing the assumption in Definition 1.
We have
zl1 Dðzl0 :r 0 Þ and zl0 6¼ zl1 : (40)
418 SECTION III Advanced geometrical intuition
Let zl(t2) be the value of Xðzl ðt1 Þ, l Þ at t2 for t2 > t1 and zl ðt2 Þ Dðzl1 :r 1 Þ for
r1 > 0 such that
pðzl ðt2 Þ, t2 Þ ¼ Prob½Xðzl ðt1 Þ, l Þ ¼ zl2 for t2 > t1 :
The contour γ l with a new parametric representation
t ¼ vl ðτÞ al01 t < al02
with a real-valued function vl mapping ½al01 , al02 Þ onto the interval [t1, t2) helps
us to compute the length Lðzl1 , zl2 Þ from zl1 to zl2 ¼ zl ðt2 Þ from zl1 for t2 > t1.
That is,
pðzl ðt2 Þ,t2 Þ ¼ Prob½Xðzl ðt2 Þ,l Þ ¼ zl ðt2 Þ=Xðzl ðt0 Þ,l Þ ¼ zl0 ,
Xðzl ðt1 Þ, l Þ ¼ zl ðt1 Þ
Z a02
l
Lðzl1 ,zl2 Þ ¼ l
jz½vl ðτÞjv0l ðτÞdτ: (41)
a01
Note that, we are not drawing contours from zl0 to zl2 because Xðzl ðt1 Þ, l Þ will
change to zl2 for t2 > t1. In fact, under the construction explained, the contour
will start zl0 and reach zl2 only through the point zl1 : As these are new ideas,
we have explained above the probabilities of picking various values by
Xðzl ðt1 Þ, l Þ. We will slightly redefine below the probabilities and their transi-
tions to accommodate an easier understanding of these concepts.
There are infinitely many options around zl0 that Xðzl ðt1 Þ, l Þ can pick dur-
ing [t0, t1] each with a probability pðzl ðtÞ, tÞ for t [t0, t1]. Let pðzl0 , zl ðtÞ, tÞ for
t [t0, t1] be the probability that Xðzl ðtÞ, l Þ ¼ zl0 at t0 has transitioned to
Xðzl ðtÞ, l Þ ¼ zl1 during [t0, t1]. For all such probabilities of transitions during
[t0, t1], we will have
Z t1
pðzl0 , zl ðtÞ, tÞdt ¼ 1, (42)
t0
Note that,
pðzl0 ,zl ðtÞ,tÞ ðt ½t0 , t2 Þ ¼ pðzl0 , zl ðtÞ, tÞ ðt ½t0 , t1 Þ
(43)
pðzl1 , zl ðtÞ, tÞ ðt ðt0 , t2 Þ:
A direct transition from the complex number zl0 to another complex num-
ber zl2 is not possible during [t0, t2] under the above framework. By direct
transition we mean here a one-step transition. As Xðzl ðtÞ, l Þ has reached zl1
during [t0, t1] and then Xðzl ðtÞ, l Þ starting at zl1 , it has taken the value zl2 dur-
ing (t1, t2]. This implies pðzl0 , zl ðtÞ, tÞ ðt ½t0 , t2 Þ is not possible without
having a hopping over the value Xðzl ðtÞ, l Þ ¼ zl1 during [t0, t1]. If zl2 is not
picked by Xðzl ðtÞ, l Þ during (t1, t2], then it can be picked at some future time
by Xðzl ðtÞ, l Þ for t > t2. So,
0 ðone-transitionÞ
pðzl0 ,zl ðtÞ,tÞ ðt ½t0 ,t2 Þ ¼
> 0 ðtwo or more steps transitionsÞ
Theorem 3. A contour formed by the set of points generated by Xðzl ðtÞ, l Þ on
l for l in the bundle B ðÞ combined with the 0 intersecting with the bun-
dle and satisfying Definition 1 will obey continuous time Markov property.
Proof. Let γ l be the contour generated by Xðzl ðtÞ, l Þ over the time interval
[t0, ∞). Suppose Xðzl ðt0 Þ, l Þ ¼ zl0 : Suppose Xðzl ðtÞ, l Þ has taken the value
zl1 during t [t0, t1], and Xðzl ðtÞ, l Þ has taken the value zl2 during t (t1, t2].
Here zl1 Dðzl0 :r 0 Þ l for r0 > 0, and zl2 Dðzl1 :r 1 Þ l for r1 > 0.
By this construction, we have
pðzl0, zl2, tÞ ðt ½t0 , t2 Þ ¼ pðzl0, zl1, tÞ ðt ½t0 , t1 Þ and pðzl1, zl2 , tÞ ðt ðt1, t2 Þ
¼ Prob½Xðzl ðtÞ,l Þ ¼ zl1 ðt ½t0 ,t1 Þ=Xðzl ðt0 Þ, l Þ ¼ zl0
Prob½Xðzl ðtÞ,l Þ ¼ zl2 ðt ðt1, t2 Þ=Xðzl ðtÞ, l Þ ¼ zl1 ðt ½t0 , t1 Þ:
(44)
Through (44), we can conclude that the random variable Xðzl ðtÞ, l Þ
ðt ½t0 , t2 Þ obeys Markov property during the interval [t0, t2]. In (44), the
number zl2 is generated within a disc around the number zl1 but not around
the disc with center zl0 . A contour is drawn from zl0 to zl2 only through zl1 .
In a similar way, the value of Xðzl ðtÞ, l Þ ðt ðtn1 , tn Þ is located in the
disc Dðzln1 :r n1 Þ for rn1 > 0 and not on the disc Dðzlk :r k Þ for rk > 0 and
k ¼ 0, 1, …, n 2. That is,
420 SECTION III Advanced geometrical intuition
Prob½Xðzl ðtÞ, l Þ ¼ zln ðt ðtn1 , tn Þ=Xðzl ðt0 Þ, l Þ ¼ zl0 , Xðzl ðtÞ, l Þ ¼ zl1
Xðzl ðtÞ,l Þ ¼ zl1 ðt ðt0 , t1 Þ,…, …,Xðzl ðtÞ, l Þ ¼ zln1 ðt ðtn2 ,tn1 Þ
¼ Prob½Xðzl ðtÞ, l Þ ¼ zln ðt ðtn1 , tn Þ= (45)
Xðzl ðtÞ, l Þ ¼ zln1 ðt ðtn2 , tn1 Þ
A contour is drawn from zl0 to zln only connecting the numbers (points)
through zl1 , zl2 , …, zln1 : The result in (45) is also true when tn !∞. Hence the
contour γ l formed using the numbers generated by Xðzl ðtÞ, l Þðt ½t0 , ∞Þ obeys
continuous time Markov property or continuous time Markov chain. □
of intersecting planes ðl \ 0 Þ or p \ 0 or some other similar inter-
secting planes, it will have the power to generate next number in two distinct
discs as described above in (47). This feature of Xðzl ðtÞ, l Þ helps to form
multilevel contours. This feature is summarized below:
Xðzl ðtÞ, l Þ ¼ zlb and zlb Dðzla ,ra ,0 Þ ¼) γ l is a multilevel contour
(48)
Xðzl ðtÞ, l Þ ¼ zlb and zlb Dðzla ,ra ,l Þ ¼) γ l is a contour on l
This rule (48) is applicable each time the value of Xðzl ðtÞ, l Þ falls in an
intersection of planes. Once a contour attains the multilevel contour property,
it will remain as a multilevel contour of that particular Xðzl ðtÞ, l Þ even if the
value of Xðzl ðtÞ, l Þ reruns and remains in l forever. The value of the radius
at each two-step randomness and the location of the next number to be picked
by Xðzl ðtÞ, l Þ decide the time taken for a contour to become a multilevel con-
tour (if there is a possibility to become). See Fig. 7. The time interval to reach
zla from zl0 could be at least one, and it requires at least two time intervals to
reach zlb from zl0 under the framework described above.
Suppose zlb Dðzla , r a , 0 Þ0. If zl0 to zla reaches in one time interval and zla
to zlb reaches in one time interval, then
kzlb zl0 k > kzla zl0 k
because
kzla zl0 k + kzla zlb k ¼ kzlb zl0 k
If zl0 to zla reaches in more than one time interval, then the length of the con-
tour Lðzlo , zla Þ from zl0 to zla is still less than Lðzlo , zlb Þ from zl0 to zlb because zlb
FIG. 7 Contours spreading across one or more planes from l due to random environment.
422 SECTION III Advanced geometrical intuition
lies in the disc Dðzla , r a , 0 Þ0 . Suppose Xðzl ðt0 Þ, l Þ ¼ zl0 and Xðzl ðtÞ, l Þ
reaches at zla during the nth time interval. Let Xðzl ðtÞ, l Þ ¼ zl1 for
zl1 Dðzl0 r 0 Þ during the first time interval [t0, t1] and Xðzl ðtÞ, l Þ ¼ zl2 for
zl2 Dðzl1 r 1 Þ during the second time interval and so on Xðzl ðtÞ, l Þ ¼ zla
for zla Dðzln1 r n1 Þ during nth time interval. Suppose Xðzl ðtÞ, l Þ ¼ zlb
for zlb Dðzla r a Þ. Suppose γ l is described by
and the real-valued functions vli +1 mapping ðal0i , al0i +1 for i ¼ 1, …, n 2 onto
the intervals [t0, t1], (t1, t2], (tn2, tn1]. The real-valued function vl0 maps
½al00 , al01 onto the interval ([t0, t1], vla maps ðal0n1 , al0a onto the interval (tn1, ta],
and the real-valued function vlb maps ðal0a , al0b onto the interval (ta, tb]. Then
n1 Z
X Z
l
a0i + 1 al0a
Lðzlo ,zlb Þ ¼ l
jz½vli ðτÞjv0li ðτÞdτ + l
jz½vla ðτÞjv0la ðτÞdτ
i¼1 a0i a0n1
Z l
(49)
a0b
+ jz½vlb ðτÞjv0lb ðτÞdτ > Lðzlo , zla Þ
al0a
because the first two terms of the R.H.S. of (49) is Lðzlo , zla Þ: Here
Z l
a0i + 1
l
jz½vli ðτÞjv0li ðτÞdτ Dðzli ,ri Þ for i ¼ 1, 2,…, n 1
a0i
Z al0a
l
jz½vla ðτÞjv0la ðτÞdτ Dðzln1 ,ra ,l Þ
a0n1
Z l
a0b
jz½vlb ðτÞjv0lb ðτÞdτ Dðzla , ra ,0 Þ
al0a
and
Dðzl0 , r 0 Þ \ Dðzl1 , r 1 Þ \ … \ Dðzln2 , r n2 Þ 6¼ ϕ ðemptyÞ
Multilevel contours on bundles of complex planes Chapter 11 423
each of these discs are nonempty and they have distinct set of numbers on l :
The disc Dðzla , r a , 0 Þ has some elements outside the plane l , and
Dðzln1 , r n1 , l Þ \ Dðzla , r a , 0 Þ 6¼ ϕ ðemptyÞ:
Suppose it takes infinitely many time intervals to reach zla from zl0 (due to the
random environment created).
Extending the parametric representation described above, the length of the
contour from zl0 to zla is
∞ Z al0∞
X
Lðzlo ,zla Þ ¼ l
jz½vli ðτÞjv0li ðτÞdτ < Lðzlo ,zlb Þ
i¼1 a0i
because
∞ Z
X Z l
al0∞ a0b
Lðzlo ,zlb Þ ¼ l
jz½vli ðτÞjv0li ðτÞdτ + jz½vlb ðτÞjv0lb ðτÞdτ
i¼1 a0i al0∞
and
Z l
a0b
jz½vlb ðτÞjv0lb ðτÞdτ Dðzl∞ , r∞ , 0 Þ
al0∞
\
∞
Dðzli ,ri Þ 6¼ ϕ ðemptyÞ: (50)
i¼0
Theorem 4. Suppose it takes infinitely many time intervals for Xðzl ðtÞ, l Þ to
reach zla from zl0 . Then the infinitely many discs (uncountable) created while
reaching zla from zl0 could beT nested under the two-step random environment
created by Xðzl ðtÞ, l Þ and ∞ i¼0 Dðzli 2C r i Þ 6¼ ϕ:
Proof. Suppose Xðzl ðt0 Þ, l Þ ¼ zl0 : Let Dðzl0 , r 0 Þ be formed out of two-step
randomness and zl1 Dðzl0 , r 0 Þ. Suppose r1 is generated randomly such that
Dðzl1 , r 1 Þ Dðzl0 , r 0 Þ. Further, let ri is generated randomly such that
l
jz½vli ðτÞjv0li ðτÞdτ Dðzl0 , r0 Þ:
i¼1 a0i
Proof. Let γ l be a contour described by zl(t) for t [t0, ∞) that has a starting
point on the plane l with zl ðt0 Þ ¼ zl0 (say). The values of zl(t) can reach other
planes because the bundle B ðÞ in which l lies is intersecting with the 0.
Due to the property (Definition 1), Xðzl ðtÞ, l Þ keeps on generating new num-
bers during [t0, ∞) and zla 6¼ zlb for a 6¼ b and a, b . At some time γ l could
become a multilevel contour if the value generated by Xðzl ðtÞ, l Þ reaches
another plane, say p for p 6¼ l . Reaching another plane is possible due
to the positioning of 0 : Or γ l could remain as a contour on l for t [t0, ∞).
In Theorem 3, we saw that a contour drawn from an initial number zl0
reaches zln only through distinct numbers zl1 , zl2 , …, zln1 . The numbers
zl1 , zl2 , …, zln1 were generated by Xðzl ðtÞ, l Þ. Alternatively, suppose the initial
value chosen by Xðzl ðtÞ, l Þ is say, z0l0 for z0l0 6¼ zl0 , then even if the numbers
zl1 , zl2 , …, zln1 are the same through which a contour up to zln is drawn (due
Multilevel contours on bundles of complex planes Chapter 11 425
where pðzla , zla , tÞ the probability of transition to the same number is zero. Sup-
pose zl0 , zl1 , zl2 , …, zln , … are the set of numbers generated by Xðzl ðtÞ, l Þ over
the time. Let pðzl0 , zl1 , tÞð1Þ represent the probability of reaching zl1 from zl0 in
1-time interval [t0, t1], pðzl0 , zl1 , tÞð2Þ represent the probability of reaching zl2
from zl0 in 2-time intervals [t0, t1], (t1, t2], and so on. We note that zl2 cannot
be reached directly from zl0 in 1-time interval [t0, t1] or [t0, t2] as mentioned
previously in the article. So
pðzl1 , zl3 , tÞð2Þ > 0 and pðzl1 , zl3 , tÞðmÞ ¼ 0, for m ¼ 1, 3, 4, 5, … (52)
The n-time intervals transition probabilities between any other two distinct
complex numbers on l can be expressed as in (51) and (52).
Remark 6. The notation Xðzl ðtÞ, l Þt [t0, ∞) is used even if Xðzl ðtÞ, l Þ starts
generating numbers from different planes after crossing through 0 after the
initial value zl0 was chosen in l : Such notation will help identify the origin
plane of Xðzl ðtÞ, l Þ.
Theorem 6. Given Xðzl ðt0 Þ, l Þ ¼ zl0 : Suppose pðzla , zlb , tÞðnÞ represent n-time
intervals transition probabilities from zla to zlb where
Xðzl ðtÞ, l Þ ¼ zla for t ðta1 , ta and
Xðzl ðtÞ, l Þ ¼ zlb for t ðtb1 , tb ,
for tb > ta, then
ðnÞ > 0 if n ¼ tb ta sequential time intervals
pðzla , zlb , tÞ ¼ :
¼ 0 if n ¼
6 tb ta sequential time intervals
426 SECTION III Advanced geometrical intuition
Proof. The n-time interval transition probability pðzla , zlb , tÞðnÞ is written as
pðzl ðta Þ, zl ðtb Þ, tÞðnÞ ðt ðta , tb Þ ¼ pðzla , zla1, tÞð1Þ ðt ðta , ta1 Þ:
for ta < ta1 < …tan1 < tb : In (53), zla1 is generated by Xðzl ðtÞ, l Þ during
ðta , ta1 from the set of numbers of the disc Dðzla , r a Þ, zla2 is generated by
Xðzl ðtÞ, l Þ during ðta1 , ta2 from the set of numbers of the disc Dðzla1, r a1 Þ,
and so on, zlb is generated by Xðzl ðtÞ, l Þ during ðtan1 , tb from the set of num-
bers of the disc Dðzlan1, r an1 Þ: The numbers zla1, zla2, …, zlan1, zlb were sequen-
tially generated from the sets of distinct discs
n o
Dðzla , r a Þ, Dðzla1 , r a1 Þ, …, Dðzlan1 , r an1 Þ (54)
From (57) and (59) we conclude that pðzla , zlb , tÞðnÞ > 0 when tb ta ¼ n
sequential time intervals.
Suppose tb ta 6¼ m—sequential time intervals. This implies zlb is gener-
ated either through less than m—sequential time intervals after choosing zla by
Xðzl ðtÞ, l Þ or zlb is generated through more than m—sequential time intervals
after choosing zla by Xðzl ðtÞ, l Þ. If m ¼ 1, then
Theorem 7. Two contours formed during [t0, ∞) need not be identical but
their lengths could be identical.
428 SECTION III Advanced geometrical intuition
Proof. Let γ l(X) and γ l(Y ) be two contours formed out of the points created
by the two processes fXðzl ðtÞ, l Þgt
t0 and fY ðzl ðtÞ, l Þgt
t0 with fXðzl ðt0 Þ,
l Þg ¼ zl0 ðXÞ and fY ðzl ðt0 Þ, l Þg ¼ zl0 ðYÞ: Let γ l(X) be described by zX(t)
[t0, ∞) and γ l(Y) be described by zY(t) [t0, ∞). The two state spaces
corresponding to the two processes are
AX ðl Þ ¼ fz : z ¼ zl0 ðXÞ at Xðzl ðt0 Þ,l Þ and Xðzl ðtÞ,l Þðt0 , ∞Þ ¼ z
for z B ðÞ
with some order of choosing z and zla ðXÞ 6¼ zlb ðXÞ
if a 6¼ b, a,b > t0 g
AY ðl Þ ¼ fz : z ¼ zl0 ðYÞ at Y ðzl ðt0 Þ,l Þ and Y ðzl ðtÞ, l Þðt0 , ∞Þ ¼ z
for z B ðÞ:
with some order of choosing z and zla ðYÞ 6¼ zlb ðYÞ
if a 6¼ b, a,b > t0 g
Note, γ l(X) is identical to γ l(Y) if and only if zl0 ðXÞ ¼ zl0 ðYÞ and all other
z values generated out of infinite iterations of two processes are identical, i.e.,
AX ðl Þ ¼ AY ðl Þ. Since AX ðl Þ and AY ðl Þ are not disjoint, there is a possi-
bility that fXðzl ðtÞ, l Þgt
t0 and fY ðzl ðtÞ, l Þgt
t0 may choose same numbers
during [t0, ∞). If AX ðl Þ 6¼ AY ðl Þ , then anyway γ l(X) is not identical to
γ l(Y). Given that AX ðl Þ and AY ðl Þ are available, let L½ðzl0 ðXÞ, zla ðXÞ be
the length of the contour from zl0 ðXÞ to zla ðXÞ and L½ðzla ðXÞ, zlb ðXÞ be the
length of the contour from zla ðXÞ to zlb ðXÞ such that the length of the contour
from zl0 ðXÞ to zla ðXÞ is computed as
L½ðzl0 ðXÞ, zlb ðXÞ ¼ L½ðzl0 ðXÞ, zla ðXÞ + L½ðzla ðXÞ, zlb ðXÞ: (60)
Let L½ðzl0 ðYÞ, zla ðYÞ be the length of the contour from zl0 ðYÞ to zla ðYÞ and
L½ðzla ðYÞ, zlb ðYÞ be the length of the contour from zla ðYÞ to zlb ðYÞ such that
the length of the contour from zl0 ðYÞ to zla ðYÞ is computed as
L½ðzl0 ðYÞ, zlb ðYÞ ¼ L½ðzl0 ðYÞ, zla ðYÞ + L½ðzla ðYÞ, zlb ðYÞ: (61)
In (60) and (61), it is assumed that zl0 ðXÞ 6¼ zl0 ðYÞ, zla ðXÞ 6¼ zla ðXÞ , and
zlb ðXÞ 6¼ zlb ðYÞ. Suppose
L½ðzl0 ðXÞ, zla ðXÞ 6¼ L½ðzl0 ðYÞ, zla ðYÞ and
such that
L½ðzl0 ðYÞ, zla ðYÞ L½ðzl0 ðXÞ, zla ðXÞ ¼ L½ðzla ðXÞ, zlb ðXÞ L½ðzla ðYÞ, zlb ðYÞ
(62)
By (62), we conclude that L½ðzl0 ðXÞ, zlb ðXÞ ¼ L½ðzl0 ðYÞ, zlb ðYÞ: Since ta
and tb are arbitrary, one can extend the result to other contour distances. □
Theorem 8. Two state spaces AX ðl Þ and AY ðl Þ that are identical need not
imply the corresponding contours are identical.
Proof. Given that the two state spaces AX ðl Þ and AY ðl Þ are identical. This
implies the states in AX ðl Þ and AY ðl Þ are the same. Suppose the order of
the states generated by fXðzl ðtÞ, l Þgt
t0 and fY ðzl ðtÞ, l Þgt
t0 are the same,
then the two contours γ l(X) and γ l(Y) are identical.
Suppose zl0 ðXÞ ¼ zl0 ðYÞ, but the randomness has resulted in distinct order
of the states in AX ðl Þ and AY ðl Þ such that
zla ðXÞ ¼ zlb ðYÞ and zlb ðXÞ ¼ zla ðYÞ
for some arbitrary ta and tb. This implies γ l(X) be no more identical to γ l(Y ).□
X
∞
pðzla ðYÞ; zla ðYÞ; tÞðnÞ ðt ½t0 , ∞ÞÞ ¼ 0:
n¼1
430 SECTION III Advanced geometrical intuition
Suppose the real-valued function vl(X, τ) maps ½al0 , ∞Þ onto the interval [t0, ∞),
then
Z ∞
Lðγ l , XÞ ¼ jz½vl ðX,τÞjv0l ðX,τÞdτ
al0
Z ∞
¼ jz½vl ðY, τÞjv0l ðY,τÞdτ
al0
¼ Lðγ l ,YÞ
where the real-valued function vl(Y, τ) maps ½al0 , ∞Þ onto the interval [t0, ∞).
Only looking at the integral expressions used for L(γ l, X) or L(γ l, Y), we are
unable to tell whether a contour γ l(X) has traveled to any other planes beyond
l. The symbol l in γ l(X) stands for the plane from which this contour has ori-
ginated and X stands for Xðzl ðtÞ, l Þ indicating the random variable responsi-
ble for generating data required to form γ l(X). Suppose we consider infinitely
many random variables of the type Xðzl ðtÞ, l Þ to satisfy two conditions: (i)
each of these works nondisjointly such that they may choose an initial value
that was chosen by a different random variable, and (ii) each of these random
variables chooses an initial value that is distinct from others such that the
number of initial values is again the number of random variables. Let α be
the number of distinct initial values satisfying the condition (i) such that α
is less than the cardinality of l and let X0 be the index random variable. Then
the total lengths of all the contours originated by all the random variables of
condition (i) is
XZ ∞
0
Lðγ l ,X , αÞ ¼ jz½vl ðX0 , α, τÞjv0l ðX0 , α, τÞdτdα (63)
α al0
Let β represents the distinct initial values in the condition (ii) due to distinct
random variables within the condition (ii) with a then the total length of all
the contours generated due to condition (ii) is
XZ ∞
Lðγ l ,X0 , βÞ ¼ jz½vl ðX0 , β, τÞjv0l ðX0 , α, τÞdτdβ: (64)
β al0
Proof. Let us consider infinitely many random variables within the condition
(i). Let the arbitrary variable be X0 ðzl ðtÞ, l , αÞðt ½t0 , ∞Þ for α as in condition
(i) described above. Let X0 ðzl ðtÞ, l , αÞ ¼ zl0 ðt0 , X0 , αÞ: The set of discs formed
due to each X0 ðzl ðtÞ, l , αÞ are infinite. Each point on the plane could be the
origin of a contour on l . This implies there is a possibility that
[ [
Dðzl0 , r 0 , X0 , αÞ ¼ Dðzl0 , r 0 , X0 , βÞ (65)
zla l zla l
[ [ [ [
Dðzla , r a , X0 , αÞ ¼ Dðzla , r a , X0 , βÞ,
α z la l β zla l
else, if at least one such ra that was chosen randomly is different in conditions
(i) and (ii) then
[ [ [ [
Dðzla , r a , X0 , αÞ 6¼ Dðzla , r a , X0 , βÞ:
α zla l β zla l
□
432 SECTION III Advanced geometrical intuition
We have
[ [
Dðzla , r a , X0 , βÞ ¼ l
β zla l
S S
because zla β zla l Dðzla , r a , X0 , βÞ implies zla l for each zla and
S S
zla l implies zla β zla l Dðzla , r a , X0 , βÞ: For condition (ii) within
every disc, there are infinitely many points of other contours, whereas such
an assertion is not possible for the discs generated under the condition (ii).
There is no chance to form an isolated disc under the two-step randomness
procedure and Markov property derived earlier still holds for the discs formed
under these two conditions. For a general description of continuous-time Mar-
kov property, refer, for example, to Good (1961), Bhat and Deshpande (1986),
Chen (1991), Gani and Stals (2005), and Goswami and Rao (2006).
Remark 7. Under random environment the possibility for having identical ra
values in each iteration of X0 ðzl ðtÞ, l , αÞ for infinitely many time interval is
very small. So the chances for below equality can be treated as a rare event:
Dðzla , ra , X0 , αÞ ¼ Dðzla , ra , X0 , βÞ for t ðta , ta0
Dðzla0 , ra0 , X0 , αÞ ¼ Dðzla0 , ra0 , X0 , βÞ for t ðta0 , ta00
⋮⋮
Dðzlb , rb , X , αÞ ¼ Dðzlb , rb , X0 , βÞ for t ðtb , tb
0
⋮⋮
Remark 8. When we relax the assumption in (39) and Definition 1 for the pos-
sibility to choose the same state by Xðzl ðtÞ, l Þ after that state has been chosen
earlier by Xðzl ðtÞ, l Þ, then each state in AX ðl Þ becomes recurrent. For a
recurrent state, the probability to return to a state is certain even if it takes a
very large number of time intervals. We can draw many contours like γ l(X),
γ l(Y), etc. Each contour will have its starting point or the origin depending
upon the initial value chosen by the random variable responsible to generate
the data required. A thick forest of contours can be formed from infinitely
many random variables. A family of infinitely many random variables of type
Xðzl ðtÞ, l Þ could form a forest of contours. Let this family be Fl
ld
for a real-valued function vld1 mapping ½al00 , a0 1 onto the interval ½t0 , td1 : The
set of elements on this contour are the set zl(t) ðt ½t0 , td1 Þ and denoted by
γ l(X, t) ðt ½t0 , td1 Þ. The remaining elements in the bundle are
434 SECTION III Advanced geometrical intuition
where B ðÞ|t¼td indicates the space of B ðÞ that was available at t ¼ td1
1
and
Suppose td1 is randomly chosen, and the rest of all the time intervals are fixed
to maintain the interval lengths equal to td1 t0 : The time intervals have con-
stant length, but the piecewise contour lengths in these intervals need not be
identical because the contour formation is dependent on two-step randomness
and corresponding discs formations. The process of removal continues after
td1 and ϕ be the rate of removal of elements from B ðÞ, then this can be
expressed with a differential equation
dB ðÞ
¼ B ðÞ ϕðXÞB ðÞ: (66)
dt
A constant rate of removal of elements is difficult to imagine because within
the each time intervals
f½t0 , td1 , ðtd1 , td2 , …g
the number of elements to be removed depends on the lengths of contour
formed during these intervals. These contour lengths are
Z a0 2
ld
ld
z½vld2 ðτÞv0ld ðτÞdτ,
2
a0 1
Z ld
a0 3 (67)
ld
z½vld3 ðτÞv0ld ðτÞdτ,
3
a0 2
⋮
Multilevel contours on bundles of complex planes Chapter 11 435
We know that the two-step randomness creates discs at each iteration and
the space occupied by these discs on B ðÞ need not be identical. That means
the lengths of contours formed during f½t0 , td1 , ðtd1 , td2 , …g need not be iden-
tical. The quantity ϕ can only be retrospectively estimated from the data on
the sets of elements created by the piecewise contours within the intervals
f½t0 , td1 , ðtd1 , td2 , …g: So a better way to express the dynamics due to removal
of elements from B ðÞ due to the removal of piecewise contours is
dB ðÞ
¼ B ðÞ ϕðX, tÞB ðÞ, (68)
dt
Over time (68) will produce the dynamics within bundle B ðÞ: The total
elements inside B ðÞ keep on decreasing due to the removal of piecewise
contours (can be treated as a death rate of data on piecewise contours). The
questions that remain to understand here are if the rate of removal of contours
is faster than the formation of the contours (a possibility exists), then does the
removal rate becomes an instantaneous rate? What if the contour γ l(X, t) is
forming continuously such that it is spreading into infinitely many planes of
B ðÞ, and we start removing the space created by γ l(X, t), then how the
dynamics of B ðÞ look like?
The rate of removal of γ l(X, t) in an interval will be zero if no contour data
is available for that interval. The removal of contour data resumes as soon as
the contour data becomes available. This also implies the removal process
could be temporarily discontinued. By the set-up of the time intervals that
are used for removing contours, the removal rate of contours might be higher
than the formation rates or vice versa, or they both might be identical. First,
an interval of time is decided and within this interval whatever the contour
lies that set of points (numbers) will be removed. If within that chosen time
interval no contour data is available, then the removal process halts temporar-
ily. The removal process resumes once data on contours becomes available. It
is difficult to model a form for ϕ(t) because it is dependent on the time inter-
val that was used to remove for and the length of contour that was formed by
the process fXðzl ðtÞ, l Þgt
t0 through the two-step randomness. The lengths
that will be removed during these intervals are shown in (67). At a given tM
> t0 the length of γ l(X, tM)(tM (t0, ∞)) formed until tM could be larger than
the sum of these above intervals or could be equal, that is
436 SECTION III Advanced geometrical intuition
8 Z ald1
>
>
> 0
0
>
> ¼ z½v ðτÞ vld ðτÞdτ
>
> l0
l d1
>
1
a
>
>
0
>
>
> ∞ Z a0di + 2
l
>
> X 0
>
> z½v ðτÞ vld ðτÞdτ
>
>
+ l
>
ldi d i + 1
>
i +1
i¼1 a0
>
>
>
>
>
> ðwhenever ϕðtM Þ ¼ 0Þ
>
>
<
γ l ðX, tM ÞðtM ðt0 , ∞ÞÞ (69)
>
> Z ald1
>
> 0
0
>
> > z½v ðτÞ vld ðτÞdτ
>
>
l d 1
>
l0 1
>
>
a0
>
>
> ∞ Z a0 i + 2
ld
>
> X 0
>
> ðτÞ vld ðτÞdτ
>
>
+ z½v l
> i¼1 a0ld d i +1
> i i +1
>
>
>
>
>
> ðotherwiseÞ
>
:
∞ Z
ld
X a0 i + 2
+ ld
z½vldi + 1 ðτÞv0ld ðτÞdτ
i +1
i¼1 a0 i
is impossible. Whenever
Z a0 1
ld
γ l ðX,tÞðtM ðt0 , ∞ÞÞ ¼ l
z½vld1 ðτÞv0ld ðτÞdτ
1
a00
(70)
∞ Z
ld
X a0 i + 2
+ ld
z½vldi + 1 ðτÞv0ld ðτÞdτ
i +1
i¼1 a0 i
at t ¼ tb (say), then during (tb, tb0 ] for ðtb , tb0 ¼ ðt0 , td1 , the amount of data
removed could be equal to the amount of γ l(X, t) that is available during
ðtdb , tdb0 : Also when (70) is true at t ¼ tb, then
dB ðÞ
¼ 0: (71)
dt
Satisfying (71) does not indicate a removal process has attained stationary
solution or a steady-state solution. As noted earlier after attaining (71) at some
t > t0, the rate of removal continues soon after the formation of a new piece of
contour in γ l(X, t).
Multilevel contours on bundles of complex planes Chapter 11 437
Proof. The removal process of the data generated by γ l(X, t) continues even
after dBdtðÞ ¼ 0 at t for t (t0, ∞). The amount of ϕ(X, t) after dBdtðÞ ¼ 0
depends on the availability of the length of γ l(X, t) just after attaining
dB ðÞ
dt ¼ 0 and it could be smaller than the set of data points generated by
the piece of contour formed after
Z a0 1
ld
γ l ðX, tÞðtM ðt0 , ∞ÞÞ ¼ l
z½vld1 ðτÞv0ld ðτÞdτ
1
a00
∞ Z
ld
X a0 i + 2
+ ld
z½vldi + 1 ðτÞv0ld ðτÞdτ
i +1
i¼1 a0 i
for E > 0 chosen. Hence, the rate of removal of the space of data in B ðÞ can
never attain stability as long as the contour formation process fXðzl ðtÞ, l Þgt
t0
continues. □
Theorem 11. Suppose RM be an upper bound such that the length of the con-
tour removed
Z
ld
0
a0 b
l
z½vld 0 ðτÞv0ld 0 ðτÞdτ RM (72)
a0db b b
for an arbitrary interval ðtdb , tdb0 : Then such an RM does not exist for all the
intervals of the type ðtdb , tdb0 :
Proof. Suppose the quantity RM exists for all the intervals of the type ðtdb , tdb0
such that (72) is true. This implies for any given arbitrary interval ðtdc , tdc0
where ðtdc , tdc0 occurred prior to ðtdb , tdb0 or has occurred after ðtdb , tdb0 , but
the length of the contour whose data to be removed does not exceed RM. Such
an assertion is true only if RM !∞, and not for a finite RM because the piece
of the contour γ l(X, t) whose data to be removed depends on the length of the
438 SECTION III Advanced geometrical intuition
contour that is available. This implies there is no upper limit for the length of
the contour to be formed. This contradicts that RM can be attained such that
(72) holds. The set of the data created by the length
Z a0 1
ld
∞ Z
X
ld
a0 i + 2
0
l
z½vld1 ðτÞvld ðτÞdτ + ld
z½vldi + 1 ðτÞv0ld ðτÞdτ
1 i +1
a00 i¼1 a0 i
could reach γ l(X, t) from left (or from below) once or more than once. Contour
formation and corresponding removal process once initiated will continue
forever. □
One can also use a different strategy to remove a space of data points
formed by γ l(X, t) for t [t0, tb]. Suppose we assume ϕ(X, t) follows a certain
parametric form to decide the number of elements of the set zl(t) on γ l(X, t)
over various time intervals. say f½t0 , td1 , ðtd1 , td2 , …g . We will know the
length of γ l(X, t) at tb, which is
Z a0 b
ld
l
z½vldb ðτÞv0ld ðτÞdτ: (73)
b
a00
So we choose the removal rate of the set of data created on this contour up to
tb such that ϕ(t) at each interval f½t0 , td1 , ðtd1 , td2 , …g is less than the
corresponding pieces of the contours formed during f½tb , tb0 , ðtb0 , tb00 , …g .
Note that these two sets of intervals need not have same interval lengths.
The intervals to form γ l(X, t) are emerged out of two-step randomness. At
tb, we first form
Dðzlb , rb Þ (74)
using rb and fXðzl ðtÞ, l Þgt
t0 chooses zlb0 from (74). If we choose ϕ(X, t) such
that
Z
ld
a0 b
0
ϕðX,tÞ ¼ ψðX, tÞ l
z½vld 0 ðτÞv0ld 0 ðτÞdτ t ðtb ,tb0 (75)
a0db b b
dB ðÞ
¼ B ðÞ 4ψðX, tÞ ld z½vld 0 ðτÞv0ld 0 ðτÞdτ t ðtb , tb0 5
0
dt a0 b b b
B ðÞ t ðtb ,tb0 ,
and for each of the interval, we can choose an ψ(t) or it could be a constant
value in (0, 1). We assure through (75) that the set of numbers on γ l(X, t)
removed during (tb, tb0 ] are less than the set of numbers formed on contour
Multilevel contours on bundles of complex planes Chapter 11 439
during (tb, tb0 ]. This way the data of the contour γ l(X, t) remaining unremoved
are at least the set of data points that are required to draw the distance (73).
Similarly, the dynamics in bundle due to the removal of the set of numbers
(data points) removed during (tb0 , tb00 ] is
2 3
Z a db00
l
dB ðÞ
¼ B ðÞ 4ψðX,tÞ ld 0 z½vld 00 ðτÞv00ld 00 ðτÞdτ t ðtb0 ,tb00 5
0
dt a0 b b b
B ðÞ t ðtb0 ,tb00 :
Through the strategy explained here, the removal of the space over the long
period of time can be approximated by
" Z al∞ #
dB ðÞ 0
0
¼ B ðÞ ψðX, tÞ l jz½vl∞ ðτÞjvl∞ ðτÞdτ ðt ðtb , t∞ Þ
dt a00 (76)
B ðÞ ðt ð½t0 , ∞Þ
The differential equation (76) gives an approximation of overall dynamics
generated in various intervals f½tb , tb0 , ðtb0 , tb00 , …g . The amount of space
removed would never be able to reach a situation where ϕ(t) ¼ 0 in these dif-
ferential equations because the data points due to the length of the contour in
(73) will be still in excess. Through the differential equation (76) we made
sure that γ l(X, tM)(tM (t0, ∞)) of (69) satisfies
Z a0 1
ld
γ l ðX, tM ÞðtM ðt0 , ∞ÞÞ > l
z½vld1 ðτÞv0ld ðτÞdτ
1
a00
(77)
∞ Z
ld
X a0 i + 2
+ ld
z½vldi + 1 ðτÞv0ld ðτÞdτ
i +1
i¼1 a0 i
if the removal process follows ϕ(X, t) of (75). There are no specific advan-
tages if (70) holds unless we are having any difficulties with discontinuity
of the removal process.
Suppose we wanted to introduce infinitely many random processes to gen-
erate contours as in (63) and (64) and then initiate corresponding removal pro-
cesses. The space of the data lost in B ðÞ over a period of time intervals and
constructions of such sets would involve careful considerations of contour for-
mation and removal processes. For the sake of understanding the dynamics in
B ðÞ due to these multiple contour formation and removal processes, let us
introduce a second process fY ðzl ðtÞ, l Þgt
tc for tc > tb. Recollect that when
fXðzl ðtÞ, l Þgt
t0 has reached t ¼ tb, we have introduced the removal process
of γ l(X, t). This implies, fY ðzl ðtÞ, l Þgt
tc is introduced tc tb time units after
a removal process of γ l(X, t) was initiated, and tc time units after the process
fXðzl ðtÞ, l Þgt
t0 was introduced in bundle B ðÞ. At the time of introduction
440 SECTION III Advanced geometrical intuition
FIG. 9 Nonavailability of the space for fYðzl ðtÞ, l Þg in a disc after the time tc due to removal of
a piece of contour γ l(x, t). The shaded region cannot be reached from zl0 within a disc
Dðzl0 ðYÞ, r0 ðYÞÞ.
Multilevel contours on bundles of complex planes Chapter 11 441
choose zl1 ðYÞ, but a contour passing from zl0 ðYÞ to zl1 ðYÞ for zl1 ðYÞ
Dðzl0 ðYÞ, r 0 ðYÞÞ cannot be drawn. The removal process has caused disc
Dðzl0 ðYÞ, r 0 ðYÞÞ to write below as a union of these three sets
Dðzl0 ðYÞ, r0 ðYÞÞ ¼ S1 ½Dðzl0 ðYÞ,r0 ðYÞÞ [ S2 ½Dðzl0 ðYÞ,r0 ðYÞÞ
(79)
[ S3 ½Dðzl0 ðYÞ,r0 ðYÞÞ
Here S1, S2, and S3 are disjoint. A point (number) to which a contour can
be drawn from zl0 ðYÞ is located within the set S2 ½Dðzl0 ðYÞ, r 0 ðYÞÞ: Similarly
let be zla ðYÞ an arbitrary point available to draw a contour from a previous
iteration and was chosen by fY ðzl ðtÞ, l Þg at some t. Using zla ðYÞ we can draw
a disc Dðzla ðYÞ, r a ðYÞÞ such that
Dðzla ðYÞ, ra ðYÞÞ ¼ S1 ½Dðzla ðYÞ,ra ðYÞÞ [ S2 ½Dðzla ðYÞ, ra ðYÞÞ[
S1 ½Dðzla ðYÞ,ra ðYÞÞ:
If r l ðX, tÞ \ Dðzla ðYÞ, r a ðYÞÞ ¼ ϕ ðempty setÞ, then
S1 ½Dðzla ðYÞ, ra ðYÞÞ ¼ ϕ and S3 ½Dðzla ðYÞ,ra ðYÞÞ ¼ ϕ
and
Dðzla ðYÞ, ra ðYÞÞ ¼ S2 ½Dðzla ðYÞ,ra ðYÞÞ:
Any point in Dðzla ðYÞ, r a ðYÞÞ can be randomly chosen by fY ðzl ðtÞ, l Þg such
that a contour can be drawn within Dðzla ðYÞ, r a ðYÞÞ: Suppose in a given
Dðzla ðYÞ, r a ðYÞÞ, the next iteration point, say, zla0 ðYÞ for zla ðYÞ for zla0 ðYÞ
Dðzla ðYÞ, r a ðYÞÞ lies such that a direct contour from zla ðYÞ to zla0 ðYÞ cannot
be drawn due to a deleted space of the contour γ l(X, t). Suppose there exists
some space outside the deleted space of γ l(X, t) within Dðzla ðYÞ, r a ðYÞÞ so that
piecewise arcs can be drawn from zla ðYÞ to zla0 ðYÞ. In such situations
Y ðzl ðtÞ, l Þ will generate a set of points around the deleted space to draw
piecewise arcs to join zla ðYÞ to zla0 ðYÞ whose distance is
Z
ld
0
a0 a
z½vl ðY, τÞv0 ðτÞdτ: (80)
l da ld 0
a0da a
k Z
laði + 1Þ
X a0
+ ld
aðiÞ
z½vlaði + 1Þ ðY, τÞv0laði + 1Þ ðτÞdτ (81)
i¼1 a0
Z
ld
a0 a
0
+ laðk + 1Þ
z½vld 0 ðY, τÞv0ld 0 ðτÞdτ
a0 a a
442 SECTION III Advanced geometrical intuition
is always maintained between the new contour formation and removal loca-
tion of this contour such that removal rate never becomes zero. Suppose
zlg(1) be the point chosen in D zlg ðYÞ, r g ðYÞ l where
zlgð1Þ ðYÞ S2 D zlg ðYÞ, rg ðYÞ
D zlg ðYÞ, rg ðYÞ ¼ S1 D zlg ðYÞ, rg ðYÞ [ S2 D zlg ðYÞ, rg ðYÞ [
(82)
S3 D zlg ðYÞ, rg ðYÞ :
The length from zlg(Y) to zlg(1)(Y) is
Z a0
lgð1Þ
lg
z½vlgð1Þ ðY, τÞv0lgð1Þ ðτÞdτ: (83)
a0
The removal rate of Y ðzl ðtÞ, l Þ we denote here by ϕ(Y, t). The value of ϕ(Y, t)
during (tg, tg(1)] is expressed using (83) as
Z a0
lgð1Þ
ϕðY, tÞ ¼ ψðY, tÞ lg
z½vlgð1Þ ðY,τÞv0lgð1Þ ðτÞdτ,
a0
for 0 < ψ(Y, t) < 1, and the value of ϕ(Y, t) during (tg, t∞] is expressed using
(83) as
Z al∞
0
ϕðY,tÞ ¼ ψðY, tÞ l jz½vl∞ ðY,τÞjv0l∞ ðτÞdτ
g
a0
The dynamics in bundle B ðÞ due to removal of set of points in γ l(X, t)
ðt ½tb , ∞Þ and in γ l(Y, t) t ½tg , ∞ described above can be divided into
below four parts:
(i) Removal of data points due to the removal process introduced on con-
tour γ l(X, t),
Multilevel contours on bundles of complex planes Chapter 11 443
(ii) Removal of data points due to the removal process introduced on con-
tour γ l(Y, t),
(iii) Removal of data points in the set γ l ðX, tÞ \ γ l ðY, tÞ for γ l ðX, tÞ \
γ l ðY, tÞ 6¼ ϕ ðempty setÞ, due to the removal process introduced on con-
tour γ l(X, t),
(iv) Removal of data points in the set γ l ðX, tÞ \ γ l ðY, tÞ for γ l ðX, tÞ \
γ l ðY, tÞ 6¼ ϕ ðempty setÞ, due to the removal process introduced on con-
tour γ l(Y, t).
Let ϕ(X, t) and ϕ(Y, t) represent removal rates for the points purely on γ l(x, t)
and γ l(Y, t) and not on γ l ðX, tÞ \ γ l ðY, tÞ 6¼ ϕ ðempty setÞ: Let ϕ3(X, t) repre-
sent removal rates for the points purely on γ l ðX, tÞ \ γ l ðY, tÞ 6¼ ϕ ðempty setÞ
for the contour initiated by Xðzl ðtÞ, l Þ and ϕ4(Y, t) represent removal rates
for the points purely on γ l ðX, tÞ \ γ l ðY, tÞ 6¼ ϕ ðempty setÞ for the contour
initiated by Y ðzl ðtÞ, l Þ: The dynamics in bundle B ðÞ due to four parts
above is expressed through the differential equation:
dB ðÞ
¼ B ðÞ ϕðX, tÞB ðÞ ϕðY, tÞB ðÞ ϕ3 ðX,tÞB ðÞ
dt
ϕ4 ðY, tÞB ðÞ
" Z al∞ #
0
¼ B ðÞ ψðX,tÞ l jz½vl∞ ðX, τÞjv0l∞ ðX, τÞdτ B ðÞ
a00
" Z #
al0∞
ψðY,tÞ lg
jz½vl∞ ðY, τÞjv0l∞ ðY, τÞdτ B ðÞ ψ 3 ðX, tÞB ðÞ
a0
FIG. 10 Islands of deleted data due to removal process on infinitely many contours.
initiated at the same time t0. By construction, they have different origins in l :
Then the set of complex numbers {zl(Xa, t)} ðt ½t0 , ∞Þ used in the formation
of γ l(Xa, t) and the set of numbers used in the formation of γ l(Xb, t) could have
a nonempty intersection. That is
T
fzl : zl B ðÞ,fzl ðXa ,tÞgðt ðt0 , ∞Þ and zl0 ðXa Þ l g
(87)
zl : zl B ðÞ, fzl ðXb , tÞgðt ðt0 , ∞Þ and zl0 ðXb Þ l 6¼ ϕ ðemptyÞ
or
T
fzl : zl B ðÞ,fzl ðXa ,tÞgðt ðt0 , ∞ÞÞ and zl0 ðXa Þ l g
(88)
zl : zl B ðÞ, fzl ðXb , tÞgðt ðt0 , ∞ÞÞ and zl0 ðXb Þ l 6¼ ϕ ðemptyÞ
Any two contours that have different initial values need not be disjoint. If every
zl(Xa, t) for every t (t0, ∞) has no overlap with any element of fzl ðXb , tÞg
ðt ðt0 , ∞ÞÞ, then that could be purely due to the random environment created
in Section 3. Either of these contours or both could be multilevel contours and
have origins in l : The formation of multilevel contours and randomness at
0 \ l described earlier remains the same. Two contours might have points
of intersection within B ðÞ, but such points of intersection need not behave
like common points of 0 \ l B ðÞ: This means the set of points on
γ l ðXa , tÞ \ γ l ðXb , tÞ
for which
γ l ðXa , tÞ \ γ l ðXb , tÞ 6¼ ϕ ðemptyÞ
cannot be used for changing the plane of the contours. However, the set of
points on
γ l ðXa , tÞ \ γ l ðXb , tÞ 0 (89)
for any two arbitrary random variables Xa ðzl ðtÞ, l Þ and Xb ðzl ðtÞ, l Þ could
behave similarly to the points on 0 \ l : The points on 0 \ p for any arbi-
trary p B ðÞ will have similar properties of forming a multilevel contour
as described in Section 3.
446 SECTION III Advanced geometrical intuition
Suppose
γ l ðXa , tÞ \ γ l ðXb ,tÞ p (90)
for some arbitrary plane p B ðÞ and γ l ðXa , tÞ \ γ l ðXb , tÞ have origins
in l :
Suppose (90) satisfied at t ¼ td, then a disc Dðzld ðXa Þ, r d ðXa ÞÞ with center
zld ðXa Þ and radius rd(Xa) is formed such that
Dðzld ðXa Þ, rd ðXa ÞÞ p (91)
and next iteration point of γ l(Xa, t) after zld ðXa Þ lies in p and not in 0 \ p.
Suppose zld0 ðXa Þ be the point generated after zld0 ðXa Þ for zld0 ðXa Þ
Dðzld ðXa Þ, r d ðXa ÞÞ, then
zld0 ðXa Þ p and zld0 ðXa Þ 62 0 \ p :
A contour drawn during [td, td0 ] to reach zld0 ðXa Þ from zld ðXa Þ with the distance
Z a0d
l 0
l
z½uld0 ðXa , τÞu0l∞ ðXa ,τÞdτ , (92)
a0d
l
lies on p : Here the real-valued function uld0 ðXa , τÞ maps ½al0d , a0d0 Þ onto the
interval [td, td0 ]. Because zld ðXa Þ lies on γ l(Xa, t) satisfying (90), it could con-
tribute in the next step to form γ l(Xa, t) or γ l(Xb, t). In either situation, the dis-
tance in (91) lies in p . Hence a point in B ðÞ if it is in 0 \ p for some
arbitrary p has two options to produce a new point on the contour to continue
contour formation. The description of the formation of contours at the inter-
section of γ l(Xa, t) and γ l(Xb, t) is also true if there are more than two intersect-
ing contours. We also note that the lengths of the infinitely many contours up
to time td which were all introduced at t0 could have different lengths based
on the area of the discs formed, and the point chosen by the corresponding
random variable. So the set of lengths
8 l0 9
<Z a0d Z ald0 =
0 0
0
z½u l ðX a
, τÞ u ðX a
,τÞdτ, z½u l ðX b
,τÞ u ðX b
, τÞdτ, … (93)
: a ld d
0 ld 0 l
ad
d
0 ld 0
;
0 0
could have different spaces occupied in B ðÞ. The location of each contour
after some long time t∞ for t∞≫ td0 could be anywhere in the bundle and they
could be situated in any plane. The set of lengths
(Z l ∞ Z al∞ )
a0
0
l
jz½ul∞ ðXa , τÞju0l∞ ðXa , τÞdτ, l z½ul0∞ ðXb , τÞu0l∞ ðXb , τÞdτ,… (94)
a0d a0d
are ever evolving within B ðÞ . For a point zldð2Þ ðXa Þ p and zldð2Þ ðXa Þ
62 l \ p , the equality
Z a0
ldð1Þ
Z a0
ldð2Þ
l
z½uldð1Þ ðXa , τÞu0ldð1Þ ðXa , τÞdτ + ldð1Þ
z½uldð2Þ ðXa , τÞu0ldð2Þ ðXa ,τÞdτ
a0d a0
Z a0
ldð2Þ
¼ l
z½uld*ð2Þ ðXa , τÞu0ldð2Þ ðXa ,τÞdτ
a0d
(95)
holds for zld ðX Þ 0 \ l and zldð1Þ ðX Þ 0 \ p. For all such zld ðX Þ p
a a b
(96)
holds for zld ðX Þ 0 \ l and zldð1Þ ðX Þ 0 \ p . From (95) and (96), we
b b
Z a0
ldð2Þ
> ldð1Þ
z½uldð2Þ ðXa ,τÞu0ldð2Þ ðXa , τÞdτ
a0
and
Z a0
ldð2Þ
l
z½uld*ð2Þ ðXb , τÞu0ldð2Þ ðXb , τÞdτ (98)
a0d
Z a0
ldð2Þ
> ldð1Þ
z½uldð2Þ ðXb ,τÞu0ldð2Þ ðXb , τÞdτ
a0
because the multilevel contours γ l(Xa, t) and γ l(Xb, t) whose distances are in
(95) and (96) have to pass through the plane 0 \ p : For all sets of three
numbers of the type for zld ðXb Þ , zldð1Þ ðXb Þ , and zldð2Þ ðXb Þ lying in 0 \ l ,
0 \ p , and p , for arbitrary l and p in B ðÞ, the equality
448 SECTION III Advanced geometrical intuition
∞ Z
ldðiÞ
X a0
l
z½uldð1Þ ðXb , τÞu0ldðiÞ ðXb ,τÞdτ
i¼1 a0d
∞ Z
ldði + 1Þ
X a0
+ ldðiÞ
z½uldði + 1Þ ðXb ,τÞu0ldði + 1Þ ðXb , τÞdτ (99)
i¼1 a0
∞ Z
ldðiÞ
X a0
¼ l
z½uld*ðiÞ ðXb , τÞu0ldðiÞ ðXb ,τÞdτ
i¼1 a0d
∞ Z
ldði + 1Þ
X a0
> ldðiÞ
z½uldði + 1Þ ðXb , τÞu0ldði + 1Þ ðXb , τÞdτ:
i¼1 a0
where the real-valued function uld* maps appropriate time intervals after the
parametric representation. For all above such sets of five points in the bundle
and for three sets of planes, we will have
∞ Z
ldðiÞ
X a0
l
z½uldðiÞ ðXa , τÞu0ldðiÞ ðXa ,τÞdτ
i¼1 a0d
∞ Z
ldð2 + iÞ
X a0
+ ldð1 + iÞ
z½uldð2Þ ðXa ,τÞu0ldð2Þ ðXa ,τÞdτ
i¼1 a0
∞ Z
ldð3Þ
X a0
+ ldð2Þ
z½uldð3Þ ðXa ,τÞu0ldð3Þ ðXa , τÞdτ
i¼1 a0
(102)
∞ Z
ldð4 + iÞ
X a0
+ ldð3 + iÞ
z½uldð4Þ ðXa ,τÞu0ldð4Þ ðXa ,τÞdτ
i¼1 a0
∞ Z
ldð5 + iÞ
X a0
+ ldð4 + iÞ
z½uldð5Þ ðXa ,τÞu0ldð5Þ ðXa ,τÞdτ
i¼1 a0
∞ Z
ldð5 + iÞ
X a0
¼ l
z½uld*ð5 + iÞ ðXa ,τÞu0ldð5 + iÞ ðXa ,τÞdτ
i¼1 a0d
Example 1. Let fzl ðXa , tÞg be the set of numbers of γ l(Xa, t) for t [t0, tb] that
got deleted due to a removal process. Then Xa ðzl ðtÞ, l Þ would not be able to
choose a number from fzl ðXa , tÞg for t [t0, tb]. There could be many such
holes in B ðÞ: Two or more contours using the same set of points for a
period of time, then the removal process of one contour could delete the com-
mon set of data so that a hole is formed. We will soon see that the space cre-
ated by the set H is dynamic.
Example 2. Consider a disc Dðzld ðXa Þ, r d ðXa ÞÞ with a center zld ðXa Þ chosen by
Xa ðzl ðtÞ, l Þ from a previous iteration. Let two contours γ l(Xb, t) and γ l(Xc, t)
pass through Dðzld ðXa Þ, r d ðXa ÞÞ and intersects at two locations say, zli ðXb Þ
and zlj_ ðXb Þ as shown in Fig. 11. Suppose the spaces of points of γ l(Xb, t)
and γ l(Xc, t) passing through Dðzld ðXa Þ, r d ðXa ÞÞ were lost due to respective
variables’ removal processes. Then the space formed between these two
points of intersections including the data on the contours between zli ðXb Þ
and zlj_ ðXb Þ is an island.
The two sets H and S are dynamic as spaces created by these sets could
change because of the dynamic nature of contour formation and removal pro-
cess described in Section 3. A time-dependent versions of the definitions for
holes and islandscan be given here. A set S(t) for t [t0, ∞) and satisfying
Definition 3 can be called an island at t. The set of elements of S(tc) for
tc [t0, ∞) satisfying Definition 3 might lose all its elements in a removal
process and might turn into a hole at a time td for td > tc. If S(t) is an island,
then no one cannot draw a contour from the elements of S(t) to an element in
Sc(t) where
Sc ðtÞ ¼ fz : z 62 SðtÞ B ðÞ and z B ðÞg: (103)
c
Similarly, a contour cannot be drawn from an element (point) of S (t) to an
element in S(t). The spaces of S(t) and Sc(t) are separated by H. The area of
a hole could change over a time as more data points removed are added to
a specific hole.
Theorem 13. Suppose a disc Dðzld ðXa Þ, r d ðXa ÞÞ formed out of Xa ðzl ðtÞ, l Þ is
given. Let zl1 and zl2 be two points in the boundary set of D. A contour C1
formed by an arbitrary process Xb ðzl ðtÞ, l Þ enters the disc through zl1 and
leaves the disc from zl2 and another contour C2 formed by an arbitrary
process Xc ðzl ðtÞ, l Þ enters the disc through zl1 and leaves the disc from zl2 :
The paths of C1 and C2 never meet except at the points zl1 and zl2 and the cen-
ter zld ðXa Þ lies in between the contours. Suppose the removal process of two
contours C1 and C2 introduced such that C1 [ C2 form a hole at t. Then,
the set of points lying between C1 and C2 forms an island.
Proof. Given that zl1 and zl2 are located on the boundary of the disc Dðzld ðXa Þ,
r d ðXa ÞÞ and zld ðXa Þ is located in between C1 and C2. See Fig. 12 for a descrip-
tion of given information and locations of zl1 and zl2 .
Let S(t) be a set in Dðzld ðXa Þ, r d ðXa ÞÞ that consists of all points in between
C1 and C2 as shown in Fig. 12. The set Dðzld ðXa Þ, r d ðXa ÞÞnSðtÞ will consists of
points as in (104),
Dðzld ðXa Þ,rd ðXa ÞÞnSðtÞ ¼ fzl : zl Dðzld ðXa Þ, rd ðXa ÞÞ and zl 62 SðtÞg: (104)
The disc Dðzld ðX Þ, r d ðX ÞÞ can be partitioned into disjoint union of three sets
a a
as below
Dðzld ðXa Þ, rd ðXa ÞÞ ¼ ½Dðzld ðXa Þ, rd ðXa ÞÞnSðtÞ [ SðtÞ [ ½C1 [ C2 : (105)
By the construction, we cannot draw a contour from a point in S(t) to a
point in
Dðzld ðXa Þ, rd ðXa ÞÞnSðtÞ:
Hence, S(t) is an island. □
Theorem 14. The union of collection of all holes within B ðÞ is compact.
[
Hα (106)
α
Hα < l z½uldðt + iÞ ðXa , τÞu0ldð5 + iÞ ðXa , τÞdτ, (107)
α a d
0
Proof. Given there is a S(t) at time t. Let there be finitely many contours pass-
ing through the region S(t) such that all the points of S(t) are in one or more of
the contours. Once a removal process is introduced at tc > t, then all the
points of S(t) will not be available for a new random variable. Hence, S(t) will
asymptotically become a hole. □
Let us introduce a removal process for the infinitely many contours intro-
duced earlier in this section. These contours were all started at the same time
t0. The number of such contours is equivalent to the set of elements in l :
Some of the elements in l are also in 0 \ l : Note that,
l ¼ ð0 \ l Þ [ l ¼ fzl : zl l or zl 0 \ l g: (108)
So removing a contour data that was formed from t0 until a time tc at a rate
ϕ(Xa, t) for an arbitrary random variable Xa ðzl ðtÞ, l Þ would also remove con-
tour data that is located in the set 0 \ l . We described earlier how the
islands of sets of data could be formed, and formation of islands could happen
in distinct time intervals once removal process is introduced.
Remember that all the contours have origin in l only. Now at tc due to a
removal process, the points of l are not available to be chosen by any of the
infinitely many random variables. The paths of these random variables are not
disjoint. Some of these random variables may be in another plane outside l
at t ¼ tc. Contour formations of these infinitely many random variables of
type Xa ðzl ðtÞ, l Þ may continue even after tc because of their presence outside
l at t ¼ tc. So the entire plane l is not available in the bundle B ðÞ. So the
remaining space formed will be the set of points B ðÞnl , where
B ðÞnl ¼ fzl : zl B ðÞ and zl l g (109)
Multilevel contours on bundles of complex planes Chapter 11 453
where B ð, α2 Þ B ðÞ is the set of planes which are above l , and
B ð, 1 α2 Þ B ðÞ is the set of planes which are above l : The set of
contours that are active after tc are located in the set, say B ð, 1 α1 Þ and
from (113), we have
B ð,1 α2 Þ ¼ B ðÞjt>tc : (115)
As described above, contour formations and removal processes of the con-
tours in B ð, 1 α2 Þ will continue. Due to presence of the hole l , the
active contours of B ð, α2 Þ and B ð, 1 α2 Þ will not have any further
intersecting points. The tails of the contours remaining in B ð, α2 Þ and
B ð, 1 α2 Þ will be eventually lost for some time after tc. Let γ l(Xa, t) be
an arbitrary contour that is active in B ð, α2 Þ, and was created by Xa. Sup-
pose the set of points touched by the contour γ l(Xa, t) prior to tc were located
in B ð, α2 Þ, l , and B ð, 1 α2 Þ, this contour is described by zl(Xa, t)
ðt ½t0 , ∞ÞÞ and t ¼ ulc ðXa , τÞ ða0 τ ac Þ is the parametric representation
for γ l(Xa, t) with a real-valued function ul(Xa, τ) mapping [a0, ac] onto the
interval [t0, tc]. Let Lðγ l ðXa , tÞðt ½t0 , tc ÞÞ represent the length of γ l ðXa , tÞ
ðt ½t0 , tc Þ up to tc, then
Z ac
Lðγ l ðXa , tÞðt ½t0 ,tc ÞÞ ¼ jz½ulc ðXa ,τÞju0lc ðXa , τÞdτ (116)
a0
had covered points from each disjoint set of B ðÞ in (114). Let us assume
that γ l ðXa , tÞðt ½t0 , tc Þ has visited a multiple number of times through each
of the sets of (114) before it remained active in B ð, α2 Þ at t ¼ tc. Then
Lðγ l ðXa , tÞðt ½t0 , tc ÞÞ in (116) can be expressed as three components where
each component is made up of several contour integrals. Since γ l(Xa, t) had
visited each portion in (116) several times, the length Lðγ l ðXa , tÞðt ½t0 , tc ÞÞ
in (116) is distributed into corresponding parts. The first part consists of the
sum of all the lengths of piecewise contours of (116) lying in l , say,
Lðγ l ðXa , t, α1 Þðt ½t0 , tc ÞÞ, and can be computed using
Z að1Þ
Lðγ l ðXa , t, α1 Þðt ½t0 , tc ÞÞ ¼ z½ulað1Þ ðXa , τÞu0lað1Þ ðXa , τÞdτ
að0Þ
X Z aði + 1Þ
+ z½ulaði + 1Þ ðXa , τÞu0laði + 1Þ ðXa ,τÞdτ,
i Aðα1 Þ aðiÞ
(117)
where ulað1Þ ðXa , τÞ and ulaði +1Þ ðXa , τÞ are the real-valued functions used in
parametric representations with corresponding onto mappings. The notation
i A(α1) indicates summing the length over all the piecewise contours in
the set A(α1). The set A(α1) consists of all the piecewise contours
of Xa ðzl ðtÞ, l Þ until tc that are lying in l : The first integral on the R.H.S.
of (117) is the length of the piecewise contour from its origin to the entry
Multilevel contours on bundles of complex planes Chapter 11 455
until tc that are lying in B ð, α2 Þ: The second integral in the R.H.S. of
(118) consists of length of the last piece of the contour γ l(Xa, t) until t ¼ tc
in B ð, α2 Þ. Here ulc ðXa , τÞ is the real-valued function used in parametric
representations with corresponding onto mappings. The third part in (116)
consists of piecewise contours in B ð, 1 α2 Þ whose total length, say,
Lðγ l ðXa , t, 1 α2 Þðt ½t0 , tc ÞÞ, is computed as
Lðγ l ðXa , t,1 α2 Þðt ½t0 , tc ÞÞ
X Z aði + 1Þ
0 (119)
i Að1α Þ
z½u l aði + 1Þ
ðX a
,τÞ ulaði + 1Þ ðXa , τÞdτ:
2
aðiÞ
Hence the length in (116) can be expressed using (117), (118), and (119) as
Z að1Þ
Lðγ l ðX , tÞðt ½t0 ,tc ÞÞ ¼
a
z½ulað1Þ ðXa ,τÞu0lað1Þ ðXa ,τÞdτ
að0Þ
X Z aði + 1Þ
+ z½ulaði + 1Þ ðXa , τÞu0laði + 1Þ ðXa ,τÞdτ +
i Aðα1 Þ aðiÞ
X Z aði + 1Þ
¼ z½ulaði + 1Þ ðXa , τÞu0laði + 1Þ ðXa ,τÞdτ
i Aðα2 Þ aðiÞ
Z ac
+ jz½ulc ðXa ,τÞju0lc ðXa , τÞdτ
ac1
X Z aði + 1Þ
+ z½ulaði + 1Þ ðXa ,τÞu0laði + 1Þ ðXa , τÞdτ:
i Að1α2 Þ aðiÞ
(120)
Due to the hole l created in the bundle B ðÞ, the remaining length of the
contour that will be subjected to removal process is obtained by removing
the sum of piecewise contour lengths in l , and is given by
456 SECTION III Advanced geometrical intuition
X Z aði + 1Þ
Lðγ l ðXa , tÞðt ½t0 , tc ÞÞ ¼ z½ulaði + 1Þ ðXa , τÞu0laði + 1Þ ðXa ,τÞdτ
i Aðα2 Þ aðiÞ
Z ac
+ jz½ulc ðXa ,τÞju0lc ðXa , τÞdτ
ac1
X Z aði + 1Þ
+ z½ulaði + 1Þ ðXa , τÞu0laði + 1Þ ðXa , τÞdτ:
i Að1α2 Þ aðiÞ
(121)
Since γ l(X , t) is active in B ð, α2 Þ, the formation of the contour will con-
a
tinue forever and the sum of the pieces of the lengths of γ l(Xa, t) that is there
in B ð, 1 α2 Þ will be deleted from B ð, 1 α2 Þ. This deletion could be
according to a removal function ϕ(Xa, t, 1 α2) similar to the procedure
explained in (75). The differential equation to model the space of points lost
due to a removal process ϕ(Xa, t, 1 α2) is
dB ð, 1 α2 Þ
a ¼ B ð, 1 α2 Þjt¼tc ϕðXa , tÞB ð, 1 α2 Þjt¼tc
dt ϕðX ,t,1α2 Þ
(122)
for
ϕðXa ,t,1 α2 Þ ¼ ψðXa , t, 1 α2 Þ
X Z aði + 1Þ
z½ulaði + 1Þ ðXa ,τÞu0laði + 1Þ ðXa , τÞdτ t ðtc ,tc0 :
i Að1α2 Þ aðiÞ
(123)
for 0 < ψ(Xa, t, 1 α2) < 1. Suppose γ l(Xa, t) is active in B ð, 1 α2 Þ
instead of in B ð, α2 Þ. As in above, the set of points touched by the contour
γ l(Xa, t) prior to tc would have located in B ð, α2 Þ, l , and B ð, 1 α2 Þ.
We0 described this contour by zl(Xa, t) ðt ½t0 , ∞ÞÞ and t ¼ wlc ðXa , τÞ
a0 τ a0c is the parametric representation for γ l(Xa, t) with a real-
valued function wlc ðXa , τÞ mapping ½a00 , a0c onto the interval [t0, tc]. Let
Lðγ l ðXa , t, 1 α2 Þðt ½t0 , tc ÞÞ represent the length of γ l ðXa , tÞðt ½t0 , tc Þ up
to tc, then
Z a0c
Lðγ l ðX , t, 1 α2 Þðt ½t0 ,tc ÞÞ ¼
a
jz½wlc ðXa , τÞjw0lc ðXa ,τÞdτ (124)
a00
had covered points from each disjoint set of B ðÞ in (114). As in above, let
us assume that γ l ðXa , tÞðt ½t0 , tc Þ has visited a multiple number of times
through each of the sets of (114) before it remained active in B ð, 1 α2 Þ
at t ¼ tc. Then Lðγ l ðXa , t, 1 α2 Þðt ½t0 , tc ÞÞ in (124) can be expressed as
three components where each component is made up of several contour
Multilevel contours on bundles of complex planes Chapter 11 457
integrals. Since γ l(Xa, t) had visited each portion in (124) several times,
the length Lðγ l ðXa , t, 1 α2 Þðt ½t0 , tc ÞÞ in (124) is distributed into
corresponding parts. The first part consists of the sum of all the lengths of
piecewise contours of (124) lying in l , say, Lðγ l ðXa , t, α1 Þðt ½t0 , tc ÞÞ, and
can be computed using
Z a0 ð1Þ
Lðγ l ðX ,t,α1 Þðt ½t0 ,tc ÞÞ ¼
a
z½wla0 ð1Þ ðXa , τÞw0l 0 ðXa ,τÞdτ
a ð1Þ
a0 ð0Þ
X Z a ði + 1Þ
0
+ z½wlaði + 1Þ ðXa , τÞw0laði + 1Þ ðXa , τÞdτ,
0
i A0 ðα1 Þ a ðiÞ
(125)
where wla0 ð1Þ ðX , τÞ and wla0 ði +1Þ ðX , τÞ are the real-valued functions used in para-
a a
Note that the contour is active in B ð, 1 α2 Þ. The set A0 (α2) in (126) con-
sists of all the piecewise contours of Xa ðzl ðtÞ, l Þ until tc that are lying in
B ð, 1 α2 Þ: The second integral in the R.H.S. of (126) consists of length
of the last piece of the contour γ l(Xa, t) until t ¼ tc in B ð, 1 α2 Þ. Here
wlc ðXa , τÞ is the real-valued function used in parametric representations with
corresponding onto mappings. The third part in (124) consists of piecewise
contours in B ð, α2 Þ whose total length, say, Lðγ l ðXa , t, α2 Þðt ½t0 , tc ÞÞ, is
computed as
X Z aði + 1Þ
Lðγ l ðX ,t,α2 Þðt ½t0 ,tc ÞÞ ¼
a
z½wla0 ði + 1Þ ðXa , τÞw0l 0 ðXa , τÞdτ:
a ði + 1Þ
i A0 ð1α2 Þ aðiÞ
(127)
458 SECTION III Advanced geometrical intuition
The length in (124) can be expressed using (125), (126), and (127) as
Z a0 ð1Þ
Lðγ l ðXa ,t,1 α2 Þðt ½t0 ,tc ÞÞ ¼ z½wla0 ð1Þ ðXa , τÞw0l 0 ðXa , τÞdτ
a0 ð0Þ a ð1Þ
X Z a ði + 1Þ
0
+ z½wlaði + 1Þ ðXa ,τÞw0laði + 1Þ ðXa ,τÞdτ
a0 ðiÞ
i A0 ðα1 Þ
Z
X a0 ði + 1Þ 0
z½wl ðX a w a
+ 0
a ði + 1Þ
,τÞ la0 ði + 1Þ ðX , τÞdτ
a0 ðiÞ
i A0 ð1α2 Þ
Z a0c
+ z½wl ðXa , τÞw0 ðXa ,τÞdτ
c lc
a0c1
Z
X aði + 1Þ 0
z½wl ðX a
τÞw a
0
a ði + 1Þ
, la0 ði + 1Þ ðX ,τÞdτ
aðiÞ
i A0 ð1α2 Þ
(128)
Due to the hole l created in the bundle B ðÞ, the remaining length of the
contour that will be subjected to removal process is obtained by removing
the sum of piecewise contour lengths in l , and is given by
Lðγ l ðXa ,t,1 α2 Þðt ½t0 , tc ÞÞ
X Z a0 ði + 1Þ
¼ z½wla0 ði + 1Þ ðXa , τÞw0l 0 ðXa , τÞdτ
0 a ði + 1Þ
i A ð1α2 Þ a ðiÞ
0
Z ac
0
(129)
+ jz½wlc ðXa , τÞjw0lc ðXa ,τÞdτ
0
ac1
X Z aði + 1Þ
z½wla0 ði + 1Þ ðXa ,τÞw0l 0 ðXa , τÞdτ
a ði + 1Þ
i A0 ð1α2 Þ aðiÞ
continue forever. The tail part that is the sum of the pieces of the lengths of
γ l(Xa, t) that is there in B ð, α2 Þ will be deleted from B ð, α2 Þ: This dele-
tion could be according to a removal function ϕ(Xa, t, α2) similar to the pro-
cedure explained in (122). The differential equation to model the space of
points lost due to a removal process ϕ(Xa, t, α2) is
dB ð, α2 Þ
a ¼ B ð, α2 Þjt¼tc ϕðXa , t, α2 ÞB ð, α2 Þjt¼tc (130)
dt ϕðX ,t,α2 Þ
where
ϕðXa ,t,α2 Þ ¼ ψðXa , t, α2 Þ
X Z a0 ði + 1Þ
z½wla0 ði + 1Þ ðXa , τÞw0l 0 ðXa , τÞdτ t ðtc , tc0
0 a ði + 1Þ
i A0 ð1α2 Þ a ðiÞ
(131)
for 0 < ψ(X , t, α2) < 1.
a
Multilevel contours on bundles of complex planes Chapter 11 459
FIG. 13 (A) Formation of contours in a bundle, and (B) formation of a hole in the bundle due to
a removal process of an arbitrary plane in which contours originated.
460 SECTION III Advanced geometrical intuition
Z bc
L γ l ðXb ,t, α2 Þðt ½t0 ,tc Þ ¼ z½Ol ðXb ,τ, α2 ÞO0 ðXb , τ,α2 Þdτ: (132)
c lc
b0
where
ϕðXb ,t, 1 α2 Þ ¼ ψðXb ,t, 1 α2 Þ
X Z bði + 1Þ
z½Olbði + 1Þ ðXb ,τÞO0lbði + 1Þ ðXb ,τÞdτ t ðtc , tc0 :
i Bð1α2 Þ bðiÞ
(134)
for 0 < ψ(Xb, t, 1 α2) < 1. Meaning of the set B(1 α2) and the procedure
to obtain the integral in (134) are similar to corresponding model in (123).
Alternatively, when γ l(Xb, t) at t ¼ tc is active in B ð, 1 α2 Þ, the set of
points touched by the contour γ l(Xb, t) prior to tc would have located in
B ð, α2 Þ, l , and B ð, 1 α2 Þ. The contour γ l(Xb, t) can be described by
zl(Xb, t) ðt ½t0 , ∞ÞÞ and t ¼ Qlc ðXb , τ, 1 α2 Þ b00 τ b0c is the parametric
representation for γ l(Xb, t, 1 α2) with a real-valued function Qlc ðXb , τ,
1 α2 Þ mapping ½b00 , b0c onto the interval [t0, tc]. Let L γ l ðXb , t, 1 α2 Þ
ðt ½t0 , tc ÞÞ represent the length of γ l ðXb , t, 1 α2 Þðt ½t0 , tc Þ up to tc, then
L γ l ðXb ,t,1 α2 Þðt ½t0 , tc Þ
Z b0c
(135)
¼ z½Ql ðXb ,τ, 1 α2 ÞQ0 ðXb ,τ, 1 α2 Þdτ:
c lc
b00
where
ϕðXb ,t,α2 Þ ¼ ψðXb , t, α2 Þ
X Z b ði + 1Þ
0
z½Qlbði + 1Þ ðXb ,τÞQ0lbði + 1Þ ðXb ,τÞdτ t ðtc , tc0 :
0
i B0 ð1α2 Þ b ðiÞ
(137)
0
for 0 < ψ(X , t, 1 α2) < 1. The set B (1 α2) consists of piecewise
b
∂2 B ð,α2 Þ
¼ B ð, α2 Þjt¼tc
∂Xb ∂t ϕðXb ,t,1α2 Þ
(141)
∂ϕðXb , t, α2 Þ LðXa , 1 α2 Þ + LðXb , α2 Þ
,
∂Xa
where ϕ(Xa, t, 1 α2) and ϕ(Xb, t, α2) are the removal rates of γ l(Xa, t) and
γ l(Xb, t) in B ð, 1 α2 Þ and B ð, α2 Þ, respectively. The sum of the piece-
wise lengths in B ð, 1 α2 Þ and B ð, α2 Þ is represented by L(Xa, 1 α2)
and L(Xb, α2), respectively. Suppose γ l(Xa, t) and γ l(Xb, t) are active in
B ð, 1 α2 Þ at t ¼ tc, then the partial differential equation describing the
dynamics of removal of the spaces created by γ l(Xa, t) and γ l(Xb, t) in
B ð, α2 Þ until tc are
∂2 B ð, α2 Þ
¼ B ð, α2 Þjt¼tc
∂Xa ∂t ϕðXa ,t,α2 Þ
(142)
∂ϕðXa , t, α2 Þ LðXa + Xb , α2 Þ
,
∂Xa
∂2 B ð, α2 Þ
¼ B ð, α2 Þjt¼tc
∂Xb ∂t ϕðXb ,t,α2 Þ
(143)
∂ϕðXb , t, α2 Þ LðXa + Xb , α2 Þ
,
∂Xb
where ϕ(X , t, α2) and ϕ(X , t, α2) are the removal rates and the sum of the
a b
5 Concluding remarks
Multilevel contours passing through bundle B ðÞ of complex planes could
demonstrate interesting properties. The random environment created brings
the dynamic nature of the bundle through the removal process introduced.
The continuous-time Markov properties, differential equations, and topologi-
cal analysis on the bundle give scope to further investigate them using func-
tional approximations. The transportation of information through contours
for different complex planes could be extended further for practical situations
arising out of transportation problems. There are several applications of com-
plex analysis that are out of scope to discuss in this article. A wide range of
literature is available for interested readers, see, for example, Rao and
Krantz (2021), Chanillo et al. (2005), Ponnusamy and Silverman (2006),
Campos (2011), Pathak (2019), Cohen (2007), and Krantz (2008). One can
also introduce several forms of parametric contour formations by assuming
functional growth rates of contours. That could be an independent approach
toward modeling the behavior of the contours concerning a given functional
form of contour formation. Similarly, the removal rates can be assumed to fol-
low certain closed-form approximations (special forms of Harmonic func-
tions, and Poisson integrals). Information carried between various complex
planes that were discussed in Section 2 would be obstructed if the set of points
that a given connected contour gets deleted (lost) due to a removal process.
Acknowledgments
I wish to thank and appreciate our children (daughter: Sheetal Rao, son: GopalKrishna Rao,
son: Raghav Rao) whose several weekends play time with me was sacrificed by them while
I was occupied with this project during the Summer/Fall of 2021.
References
Ahlfors, L.V., 1978. Complex Analysis. An Introduction to the Theory of Analytic Functions of
One Complex Variable. International Series in Pure and Applied Mathematics, third ed.
McGraw-Hill Book Co., New York. xi+331 pp.
Bhat, B.R., Deshpande, S.K., 1986. Likelihood ratio test for testing order of continuous time finite
Markov chains. Commun. Stat. A Theory Methods 15 (6), 1751–1771.
Campos, L.M.B.C., 2011. Complex Analysis With Applications to Flows and Fields. Mathematics
and Physics for Science and Technology, CRC Press, Boca Raton, FL.
Chanillo, S., Cordaro, P.D., Hanges, N., Hounie, J., Meziani, A., 2005. Geometric Analysis of
PDE and Several Complex Variables. Dedicated to François Treves. Including Papers From
the Workshop Held in S ao Paulo, August 2003. Contemporary Mathematics, vol. 368,
American Mathematical Society, Providence, RI.
464 SECTION III Advanced geometrical intuition
Chen, M.F., 1991. On three classical problems for Markov chains with continuous time
parameters. J. Appl. Prob. 28 (2), 305–320.
Churchill, R.V., Brown, J., 1984. Ward Complex Variables and Applications, fourth ed. McGraw-
Hill Book Co., New York. x+339 pp.
Cohen, H., 2007. Complex Analysis With Applications in Science and Engineering, second ed.
Springer, New York.
Gani, J., Stals, L., 2005. A continuous time Markov chain model for a plantation-nursery system.
Environmetrics 16 (8), 849–861.
Good, I.J., 1961. The frequency count of a Markov chain and the transition to continuous time.
Ann. Math. Stat. 32, 41–48.
Goswami, A., Rao, B.V., 2006. A Course in Applied Stochastic Processes. Texts and Readings in
Mathematics, vol. 40 Hindustan Book Agency, New Delhi.
Krantz, S.G., 2004. Complex Analysis: The Geometric Viewpoint. Carus Mathematical Mono-
graphs, second ed. vol. 23 Mathematical Association of America, Washington, DC. xviii
+219 pp.
Krantz, S.G., 2008. Complex Variables. A Physical Approach With Applications and MATLAB®.
Textbooks in Mathematics, Chapman & Hall/CRC, Boca Raton, FL.
Pathak, H.K., 2019. Complex Analysis and Applications. Springer, Singapore.
Ponnusamy, S., Silverman, H., 2006. Complex Variables With Applications. Birkh€auser Boston,
Inc., Boston, MA.
Rao, A.S.R.S., Krantz, S.G., 2021. Rao Distances and Conformal Mappings, Information Geome-
try. Handbook of Statistics, vol. 45 Elsevier/North-Holland, Amsterdam.
Rudin, W., 1987. Real and Complex Analysis, third ed. McGraw-Hill Book Co., New York,
ISBN: 0-07-054234-1. xviii+219 pp.
Index
Note: Page numbers followed by “f ” indicate figures, “t” indicate tables, and “b” indicate boxes.
465
466 Index
K
H Kantorovich transportation problem (KTP),
Hamiltonian-based accelerated 197–198
sampling Kernel Stein discrepancies, 50–52
description, 37–38 k-nearest neighbor classification methods,
diffusion processes, 38–40 159–160
Hamiltonian formulation of molecular Kostant-Kirillov Souriau (KKS), 122
dynamics, 23–24 Koszul-Fisher Metric, 122
Hamiltonian Monte Carlo (HMC), 22–23, Koszul Poisson cohomology, 131–132
40–45 Koszul-Vinberg characteristic function, 109
Hamiltonian system with dissipation, 22–23, Kullback–Leibler (KL) divergence, 22,
299 147–149, 248, 259
Harmonic functions, 5–7
Heat equation, 108–109
Hellinger distance, 265 L
Hellinger transform, 213–214 Ladder methods
Hessian manifolds, 227–230 Cholesky decomposition, 302
Higher-order Markovian sequences, 91–94 ED and ES, 301f
Hilbertian subspaces, 47 Hamiltonian formulation, 300
Homogeneous regular cone, 227 validation, 302–304
468 Index
W Z
Wasserstein space, 35–37 Z estimator, 24
Watanabe information criteria (WAIC), 349 Zig-zag samplers, 23–24
This page intentionally left blank