(Handbook of Statistics, 46) Frank Nielsen, Arni S. R. Srinivasa Rao, C.R. Rao - Geometry and Statistics-Academic Press (2022)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 490

Handbook of Statistics

Volume 46
Geometry and Statistics
Handbook of Statistics

Series Editors

C.R. Rao
C.R. Rao AIMSCS, University of Hyderabad Campus,
Hyderabad, India

Arni S.R. Srinivasa Rao


Medical College of Georgia, Augusta University, United States
Handbook of Statistics
Volume 46

Geometry and
Statistics

Edited by
Frank Nielsen
Sony Computer Science Laboratories Inc.,
Tokyo, Japan

Arni S.R. Srinivasa Rao


Medical College of Georgia,
Augusta, Georgia, United States

C.R. Rao
AIMSCS, University of Hyderabad Campus,
Hyderabad, India
Academic Press is an imprint of Elsevier
50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States
525 B Street, Suite 1650, San Diego, CA 92101, United States
The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom
125 London Wall, London, EC2Y 5AS, United Kingdom

Copyright © 2022 Elsevier B.V. All rights reserved.

No part of this publication may be reproduced or transmitted in any form or by any means,
electronic or mechanical, including photocopying, recording, or any information storage and
retrieval system, without permission in writing from the publisher. Details on how to seek
permission, further information about the Publisher’s permissions policies and our arrangements
with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency,
can be found at our website: www.elsevier.com/permissions.

This book and the individual contributions contained in it are protected under copyright by the
Publisher (other than as may be noted herein).

Notices
Knowledge and best practice in this field are constantly changing. As new research and experience
broaden our understanding, changes in research methods, professional
practices, or medical treatment may become necessary.

Practitioners and researchers must always rely on their own experience and knowledge
in evaluating and using any information, methods, compounds, or experiments
described herein. In using such information or methods they should be mindful of their
own safety and the safety of others, including parties for whom they have a professional
responsibility.

To the fullest extent of the law, neither the Publisher nor the authors, contributors, or
editors, assume any liability for any injury and/or damage to persons or property as a
matter of products liability, negligence or otherwise, or from any use or operation of any
methods, products, instructions, or ideas contained in the material herein.

ISBN: 978-0-323-91345-4
ISSN: 0169-7161

For information on all Academic Press publications


visit our website at https://www.elsevier.com/books-and-journals

Publisher: Zoe Kruze


Acquisitions Editor: Sam Mahfoudh
Developmental Editor: Naiza Ermin Mendoza
Production Project Manager: Abdulla Sait
Cover Designer: Victoria Pearson
Typeset by STRAIVE, India
Contents

Contributors xi
Preface xiii

Section I
Foundations in classical geometry and analysis 1
1. Geometry, information, and complex bundles 3
Steven G. Krantz and Arni S.R. Srinivasa Rao
1. Introduction 3
2. Complex planes 5
2.1 Important implications of Liouville’s theorem 9
3. Geometric analysis and Jordan curves 13
4. Summary 17
References 17

2. Geometric methods for sampling, optimization,


inference, and adaptive agents 21
Alessandro Barp, Lancelot Da Costa, Guilherme França,
Karl Friston, Mark Girolami, Michael I. Jordan, and
Grigorios A. Pavliotis
1. Introduction 22
2. Accelerated optimization 25
2.1 Principle of geometric integration 25
2.2 Conservative flows and symplectic integrators 26
2.3 Rate-matching integrators for smooth optimization 28
2.4 Manifold and constrained optimization 33
2.5 Gradient flow as a high friction limit 34
2.6 Optimization on the space of probability measures 35
3. Hamiltonian-based accelerated sampling 37
3.1 Optimizing diffusion processes for sampling 38
3.2 Hamiltonian Monte Carlo 40
4. Statistical inference with kernel-based discrepancies 46
4.1 Topological methods for MMDs 47
4.2 Smooth measures and KSDs 48
4.3 Information geometry of MMDs and natural gradient
descent 52

v
vi Contents

5. Adaptive agents through active inference 54


5.1 Modeling adaptive decision-making 54
5.2 Realizing adaptive agents 60
Acknowledgments 64
References 65

3. Equivalence relations and inference for sparse Markov


models 79
Donald E.K. Martin, Iris Bennett, Tuhin Majumder, and
Soumendra Nath Lahiri
1. Introduction 79
1.1 Improved modeling capabilities of sparse Markov models
(SMMs) 80
2. Fitting SMMs and example applications 83
2.1 Model fitting based on a collapsed Gibbs sampler 84
2.2 Fitting SMM through regularization 87
3. Equivalence relations and the computation of distributions of
pattern statistics for SMMs 90
3.1 Notation 91
3.2 Computing distributions in higher-order Markovian
sequences 91
3.3 Specializing the computation to SMM 94
3.4 Application to spaced seed coverage 96
4. Summary 100
Acknowledgments 101
References 101

Section II
Information geometry 105
4. Symplectic theory of heat and information
geometry 107
Fr
ederic Barbaresco
1. Preamble 108
2. Life and seminal work of Souriau on lie groups
thermodynamics 111
3. From information geometry to lie groups
thermodynamics 118
4. Symplectic structure of fisher metric and entropy as Casimir
function in coadjoint representation 123
4.1 Symplectic Fisher Metric structures given by Souriau
model 123
4.2 Entropy characterization as generalized Casimir invariant
function in coadjoint representation and Poisson
Cohomology 128
4.3 Koszul Poisson Cohomology and entropy
characterization 131
Contents vii

5. Covariant maximum entropy density by Souriau model 132


5.1 Gauss density on Poincare unit disk covariant with respect
to SU(1,1) Lie group 132
5.2 Gauss density on Siegel unit disk covariant with respect to
SU(N,N) Lie group 136
5.3 Gauss density on Siegel upper half plane 138
6. Conclusion 139
References 140
Further reading 143

5. A unifying framework for some directed distances in


statistics 145
Michel Broniatowski and Wolfgang Stummer
1. Divergences, statistical motivations, and connections to
geometry 147
1.1 Basic requirements on divergences (directed
distances) 147
1.2 Some statistical motivations 147
1.3 Incorporating density function zeros 152
1.4 Some motivations from probability theory 157
1.5 Divergences and geometry 159
1.6 Some incentives for extensions 161
2. The framework 163
2.1 Statistical functionals S and their dissimilarity 163
2.2 The divergences (directed distances) D 168
2.3 The reference measure λ 171
2.4 The divergence generator ϕ 171
2.5 The scaling and the aggregation functions m1, m2,
and m3 173
2.6 Auto-divergences 195
2.7 Connections with optimal transport and coupling 197
3. Aggregated/integrated divergences 201
4. Dependence expressing divergences 203
5. Bayesian contexts 205
6. Variational representations 208
7. Some further variants 211
Acknowledgments 214
References 214

6. The analytic dually flat space of the mixture family of


two prescribed distinct Cauchy distributions 225
Frank Nielsen
1. Introduction and motivation 226
2. Differential-geometric structures induced by smooth convex
functions 227
2.1 Hessian manifolds and Bregman manifolds 227
2.2 Bregman manifolds: Dually flat spaces 230
viii Contents

3. Some illustrating examples 234


3.1 Exponential family manifolds 234
3.2 Regular cone manifolds 237
3.3 Mixture family manifolds 239
4. Information geometry of the mixture family of two distinct
Cauchy distributions 241
4.1 Cauchy mixture family of order 1 241
4.2 An analytic example with closed-form dual
potentials 249
5. Conclusion 253
Acknowledgments 253
Appendix. Symbolic computing notebook in MAXIMA 253
References 255

7. Local measurements of nonlinear embeddings with


information geometry 257
Ke Sun
1. Introduction 257
2. α-Divergence and autonormalizing 260
3. α-Discrepancy of an embedding 262
4. Empirical α-discrepancy 267
5. Connections to existing methods 268
5.1 Neighborhood embeddings 268
5.2 Autoencoders 270
6. Conclusion and extensions 272
Acknowledgment 274
Appendices 274
Appendix A. Proof of Lemma 1 274
Appendix B. Proof of Proposition 1 275
Appendix C. Proof of Proposition 2 276
Appendix D. Proof of Theorem 1 277
References 279

Section III
Advanced geometrical intuition 283
8. Parallel transport, a central tool in geometric statistics
for computational anatomy: Application to cardiac
motion modeling 285
Nicolas Guigui and Xavier Pennec
1. Introduction 286
1.1 Diffeomorphometry 287
1.2 Longitudinal models 288
1.3 Parallel transport for intersubject normalization 291
1.4 Chapter organization 292
Contents ix

2. Parallel transport with ladder methods 294


2.1 Numerical accuracy of Schild’s and pole ladders 294
2.2 A short overview of the LDDMM framework 298
2.3 Ladder methods with LDDMM 300
3. Application to cardiac motion modeling 304
3.1 The right ventricle and its diseases 305
3.2 Motion normalization with parallel transport 306
3.3 An intuitive rescaling of LDDMM parallel transport 309
3.4 Changing the metric to preserve relative volume
changes 313
3.5 Analysis of the normalized deformations 316
4. Conclusion 321
Acknowledgments 322
References 322

9. Geometry and mixture models 327


Paul Marriott
1. Introduction 327
1.1 Fundamentals of modeling with mixtures 327
1.2 Mixtures and the fundamentals of geometry 329
1.3 Structure of article 330
2. Identification, singularities, and boundaries 331
2.1 Mixtures of finite distributions 334
3. Likelihood geometry 337
4. General geometric structures 341
5. Singular learning theory 343
5.1 Bayesian methods 343
5.2 Singularities and algebraic geometry 345
5.3 Singular learning and model selection 348
6. Nonstandard testing problems 350
7. Discussion 353
References 353

10. Gaussian distributions on Riemannian symmetric


spaces of nonpositive curvature 357
Salem Said, Cyrus Mostajeran, and Simon Heuveline
1. Introduction 358
2. Gaussian distributions and RMT 359
2.1 From Gauss to Shannon 360
2.2 The “right” Gaussian 361
2.3 The normalizing factor Z(σ) 363
2.4 MLE and maximum entropy 366
2.5 Barycenter and covariance 367
2.6 Z(σ) from RMT 369
2.7 The asymptotic distribution 371
2.8 Duality: The Θ distributions 372
x Contents

3. Gaussian distributions and Bayesian inference 373


3.1 MAP versus MMS 375
3.2 Bounding the distance 376
3.3 Computing the MMS 377
3.4 Proof of Proposition 13 381
Appendix A. Riemannian symmetric spaces 383
A.1 The noncompact case 385
A.2 The compact case 386
A.3 Example of Propositions A.1 and A.2 387
Appendix B. Convex optimization 388
B.1 Convex sets and functions 388
B.2 Second-order Taylor formula 390
B.3 Taylor with retractions 391
B.4 Riemannian gradient descent 393
Appendix C. Proofs for Section B 397
References 399

11. Multilevel contours on bundles of complex planes 401


Arni S.R. Srinivasa Rao
1. Introduction 401
2. Infinitely many bundles of complex planes 403
3. Multilevel contours in a random environment 415
3.1 Behavior of X (zl(t), ℂl) at (ℂl \ ℂ0) 420
3.2 Loss of spaces in bundle Bℝ(ℂ) 433
4. Islands and holes in Bℝ(ℂ) 444
4.1 Consequences of Bℝ(ℂ)\ℂl on multilevel contours 453
4.2 PDEs for the dynamics of lost space 459
5. Concluding remarks 463
Acknowledgments 463
References 463

Index 465
Contributors

ed
Fr eric Barbaresco (107), THALES Land & Air Systems, Meudon, France
Alessandro Barp (21), Department of Engineering, University of Cambridge,
Cambridge; The Alan Turing Institute, The British Library, London,
United Kingdom
Iris Bennett (79), Department of Statistics, North Carolina State University; Corteva
Agriscience, Raleigh, NC, United States
Michel Broniatowski (145), LPSM, Sorbonne Universite, Paris, France
Lancelot Da Costa (21), Department of Mathematics, Imperial College London;
Wellcome Centre for Human Neuroimaging, University College London, London,
United Kingdom
Guilherme França (21), Computer Science Division, University of California,
Berkeley, CA, United States
Karl Friston (21), Wellcome Centre for Human Neuroimaging, University College
London, London, United Kingdom
Mark Girolami (21), Department of Engineering, University of Cambridge,
Cambridge; The Alan Turing Institute, The British Library, London,
United Kingdom
Nicolas Guigui (285), Universite C^ote d’Azur and Inria, Epione team, Sophia-
Antipolis, Biot, France
Simon Heuveline (357), Centre for Mathematical Sciences, University of Cambridge,
Cambridge, United Kingdom
Michael I. Jordan (21), Computer Science Division; Department of Statistics,
University of California, Berkeley, CA, United States
Steven G. Krantz (1), Department of Mathematics, Washington University in St.
Louis, St. Louis, MO, United States
Soumendra Nath Lahiri (79), Department of Mathematics and Statistics, Washington
University in St. Louis, St. Louis, MO, United States
Tuhin Majumder (79), Department of Statistics, North Carolina State University,
Raleigh, NC, United States
Paul Marriott (327), Department of Statistics and Actuarial Science, University of
Waterloo, Waterloo, ON, Canada
Donald E.K. Martin (79), Department of Statistics, North Carolina State University,
Raleigh, NC, United States

xi
xii Contributors

Cyrus Mostajeran (357), Department of Engineering, University of Cambridge,


Cambridge, United Kingdom
Frank Nielsen (225), Sony Computer Science Laboratories Inc., Tokyo, Japan
Grigorios A. Pavliotis (21), Department of Mathematics, Imperial College London,
London, United Kingdom
Xavier Pennec (285), Universite C^ote d’Azur and Inria, Epione team, Sophia-
Antipolis, Biot, France
Salem Said (357), CNRS, Laboratoire LJK, Universite Grenoble-Alpes, Grenoble,
France
Arni S.R. Srinivasa Rao (1, 401), Laboratory for Theory and Mathematical Modeling,
Medical College of Georgia; Department of Mathematics, Augusta University,
Augusta, GA, United States
Wolfgang Stummer (145), Department of Mathematics, University of Erlangen–
N€urnberg, Erlangen; School of Business, Economics and Society, University of
Erlangen-N€urnberg, N€urnberg, Germany
Ke Sun (257), CSIRO Data61, Sydney, NSW; The Australian National University,
Canberra, ACT, Australia
Preface

Volume 46 of the Handbook of Statistics with the theme “Geometry and


Statistics” provides state-of-the-art research topics focusing on the interface
of statistics with geometry. Geometrical intuition and clarity have always
helped statistical and mathematical analyses, and this volume presents the
fundamental concepts of the recent advancements on this interface in an enter-
taining and engaging manner. We have also included some chapters purely
dealing with the statistical aspects and others presenting new venues in
complex analysis in the hope that these chapters will foster interactions of
statistics and geometry in the near future.
The contents of the volume range from the basics of complex plane
geometry to classical geometry, dually flat surfaces, deeper geometrical
foundations that can improve our understanding of statistical inferences,
Riemannian surfaces and applications, complex bundles, information
geometry, random matrix theory, and numerical simulations.
This volume will engage both new researchers and experienced authors,
and the contents are introduced by keeping readers from statistics, mathe-
matics, data science, computer scientists, and related fields in mind. All the
authors prepared their chapters carefully and skillfully by keeping the
traditions of the Handbook of Statistics series alive.
The 11 chapters in this volume are divided into 3 sections, namely,
Section I: Foundations in Classical Geometry and Analysis
Section II: Information Geometry
Section III: Advanced Geometrical Intuition
Section I contains three chapters. The first chapter by Arni S.R. Srinivasa Rao
and Steven G. Krantz describes the foundations of complex planes and
analytic functions and introduces the development of newer applications.
A special Jordan curve theorem that passes through a complex plane bundle
and points on the boundary of a ball is proved in this chapter. The advantages
of such special Jordan curves are discussed. The kinds of complex bundles
considered are similar to the bundles introduced in Chapter 11 of this volume.
The second chapter by Alessandro Barp, Lancelot Da Costa, Guilherme
França, Karl Friston, Mark Girolami, Michael I. Jordan, and Grigorios
A. Pavliotis first provides an overview of the fundamental geometric struc-
tures that underlie sampling, optimization, inference, and adaptive behavior
problems. Then, the authors show how to design efficient algorithms to solve

xiii
xiv Preface

these problems using these geometric structures. The third chapter by Donald
E.K. Martin, Iris Bennett, Tuhin Majumder, and Soumendra Nath Lahiri is
about equivalence relations in statistical inference and geometrical analysis.
The authors demonstrate how sparse Markov modeling helps improve the
understanding of statistical inferences. The chapter touches on higher-order
Markov models and derivations of certain conditional probability distributions
and their applications.
Section II contains four chapters. The first chapter by Frederic Barbaresco
based on the foundations of “Lie Groups Thermodynamics” describes a novel
formulation of heat theory and information geometry. The chapter describes
the utility of the Gaussian distribution on the space of Symmetric Positive
Definite matrices, constructions of the Koszul–Fisher metric, and the Casimir
function, which is characterized by Koszul–Poisson cohomology. The second
chapter by Michel Broniatowski and Wolfgang Stummer introduces a general
framework for density-based and distribution function-based divergence
approaches to the relative entropy of Kullback–Leibler information distances
and distribution function-based divergences. The authors also describe in
detail and provide foundations for Cramer–von Mises test statistics and
Anderson–Darling test statistics. The third chapter by Frank Nielsen recalls
the construction of dually flat spaces from a pair of convex conjugate func-
tions related to the Legendre–Fenchel transform. This dually flat structure is
then illustrated for exponential families, mixture families, and regular homo-
geneous convex cones. For mixture families, the Shannon negentropy defines
the convex function inducing the dually flat space. It is, however, usually not
available in closed form for continuous mixtures. The chapter reports a
closed-form formula for the mixture family of two Cauchy distributions and
uses this formula to explicitly build a dually flat mixture family manifold.
The fourth chapter by Ke Sun considers the problem of embedding a
low-dimensional latent space into a high-dimensional observation space.
The author tackles the definitions of the simplicity of such embeddings based
on the framework of information geometry and discusses the relationships
between parametric and nonparametric embeddings.
Section III contains four chapters. The first chapter by Nicolas Guigui and
Xavier Pennec is richly illustrated and presents the tool of parallel transport
via affine or Riemannian connection for problems in computational anatomy.
Parallel transport defines how tangent vectors are related between tangent
planes that are infinitesimally close to each other. The authors discuss the
choice of parallel transport and its numerical accuracy in ladder method
implementations for statistical analysis of subject-specific longitudinal
changes or motions with respect to common template anatomy. The authors
then apply their novel parallel transport method to motion modeling of the
cardiac right ventricle under pressure or volume overload. To resolve this
problem, parallel transport is shown to be insufficient for normalizing
large-volume deformations, and the authors propose a novel normalization
Preface xv

procedure in which parallel transport is demonstrated to be a useful tool for


choosing the appropriate metric adapted to the data. The second chapter by
Paul Marriott presents a tutorial survey on the use of several geometries for
characterizing the inference properties of statistical mixture models. The
viewpoints of affine geometry, likelihood convex geometry, information
geometry, and algebraic geometry are explained and illustrated with numer-
ous examples that highlight the complex inferential nature of mixture models
in general. The third chapter by Salem Said, Cyrus Mostajeran, and Simon
Heuveline presents the state-of-the-art theory of Gaussian distributions on
Riemannian symmetric spaces. It is shown that when the Riemannian sym-
metric spaces have nonpositive curvature, the maximum entropy distribution
with the prescribed barycenter and dispersion defines a Riemannian Gaussian
distribution such that the maximum likelihood estimator for that family
amounts to computing a Riemannian barycenter of the observations. In the
second part of this chapter, the authors tackle the mathematical tractability
of the normalizing constants of these Riemannian Gaussian distributions and
show an interesting connection with the random matrix theory. In the fourth
chapter by Arni S.R. Srinivasa Rao, the author introduces newer principles
of constructing contours passing through complex plane bundles. The geome-
try created through this process led to new concepts such as “multilevel
contours” within the bundle, “islands,” and “holes” within complex planes.
Information is transported from one plane to another through the geometrical
shapes of the contours.
We express our sincere thanks to Mr. Sam Mahfoudh, acquisitions editor
(Elsevier and North-Holland), for his overall administrative support through-
out the preparation of this volume. His valuable involvement in the project
is highly appreciated. We thank Ms. Naiza Mendoza, developmental editor
(Elsevier), for providing excellent assistance to the editors and for engaging
authors in all kinds of technical queries throughout the preparation and until
the proofing stage and production. Our thanks also go to Md. Sait Abdulla,
the project manager of book production, RELX India Private Limited, Chen-
nai, India, for leading the production, responding to several rounds of queries
by the authors, being available at all times for printing activities, and
providing assistance to the editors. Our sincere thanks and gratitude go to
all the authors for writing brilliant chapters by keeping to our requirements
of the volume. We very much thank our referees for their timely assessment
of the chapters.
We firmly believe that this volume has come up at the right time and it
gives us great satisfaction to have been involved in its production. We are
convinced that this collection will be a useful resource for beginners and
advanced scientists working in statistics, mathematics, and geometry.
Frank Nielsen
Arni S.R. Srinivasa Rao
C.R. Rao
This page intentionally left blank
Section I

Foundations in classical
geometry and analysis
This page intentionally left blank
Chapter 1

Geometry, information,
and complex bundles
Steven G. Krantza and Arni S.R. Srinivasa Raob,c,∗
a
Department of Mathematics, Washington University in St. Louis, St. Louis, MO, United States
b
Laboratory for Theory and Mathematical Modeling, Medical College of Georgia, Augusta
University, Augusta, GA, United States
c
Department of Mathematics, Augusta University, Augusta, GA, United States

Corresponding author: e-mail: arni.rao2020@gmail.com

Abstract
In this chapter, we will describe information geometric principles on complex planes.
Using the geometric constructions, we prove two theorems on special types of
constructions of Jordan curves around a ball within a complex bundle. Essentials on
complex planes and bundles required for understanding the contour constructions done
in the chapter are provided.
Keywords: Information geometry, Complex analysis, Jordan curves, Complex bundles

1 Introduction
The idea of combining Riemann surfaces with probability densities for under-
standing distances between two populations was introduced by C.R. Rao in
1945 (Rao, 1949). This led to the development of the subject of information
geometry (Amari, 2016; Amari and Nagaoka, 2000; Ay et al., 2017; van
Rijsbergen, 2004). The principles of information geometry were helpful in
statistical decisions and inferences (Amari et al., 1987; Plastino et al., 2021)
and in statistical physics (Bhattacharyya and Keerthi, 2000; Dehesa et al.,
2012; Frieden, 1992, 2021; Jaiswal et al., 2021). Most of these articles and
associated articles focused on the Cramer–Rao inequality and on Fisher
Information (Efron, 1975; Rao, 1973), and in obtaining deformation proper-
ties for the exponential family of distributions. The idea of transportation
of information from one region to another region through topological struc-
tures and through complex plane bundles was introduced in Rao (2021).

Handbook of Statistics, Vol. 46. https://doi.org/10.1016/bs.host.2022.03.002


Copyright © 2022 Elsevier B.V. All rights reserved. 3
4 SECTION I Foundations in classical geometry and analysis

Through such an analysis a new concept called multilevel contours was


developed. Such translation of information from one plane to another plane
is new in the subject of geometric structures within complex bundles and
arguably has implications in climate analysis. The connectedness properties
of complex vector bundles through homomorphism principles help in under-
standing four-dimensional manifolds; see, for example, Chern (1989) and
Chern (1977). Understanding the curvature of objects coming from a bundle
and Einstein–Hermitian conditions on such curvatures helps to understand
the stability of the system of bundles. These are different from understanding
geodesic distances on complex bundles. The idea of constructing contours
passing through the bundle helps in creating multiple paths and these are
conceptually different from analysis of vector bundles constructed earlier
(Kobayashi, 2014). Deformation of complex vector bundles can be performed
in various ways; for example, see Green and Lazarsfeld (1991), Kruglikov
(2007), and Wu and Yau (2016).
The information geometry principles were also used to obtain deformed
λ-exponential families (Tsallis, 1988; Zhang and Wong, 2021). The princi-
ples associated with obtaining distances between two probability densities
through information geometry theory have been attracting several theoreti-
cians worldwide. The idea of Riemann surfaces introduced through informa-
tion geometry caused scientists to think of the geometry of shapes formed
while building measures to obtain distances between two independent
populations.
Information geometry principles on the complex plane and their applica-
tions were introduced by Rao and Krantz (2020, 2021). The advantages of
such newer ideas are shown to be helpful in virtual tourism through the imple-
mentation of technologies of virtual reality. The distances between three-
dimensional objects of tourist locations and angles between them can be
preserved to improve existing VR technology and the ideas of contour inte-
grals of complex planes and path-connectedness properties of planes were
shown to be helpful in these newer applications. The distance measures of
regular information geometric principles and geodesic distances do not pre-
serve angular information. Hence the newer idea of geometry on complex
planes (especially conformality) and information preserved in the geometric
objectives and their potential applications enhances the utility of infor-
mation geometry. The geometric interpretations of structures formed on
complex planes and, more generally, automorphisms of Riemann surfaces
are well established—see, for example, Krantz (2004), Krantz and Parks
(1999), and Greene et al. (2011). The Bergman metric and Bergman geom-
etry are central tools in the analysis of such complex manifolds (Greene
et al., 2011; Yoo, 2017). See also the Kobayashi–Royden metric
(Choi, 2012).
Geometry, information, and complex bundles Chapter 1 5

Our chapter is structured as follows: In the next section we present the fun-
damentals of complex analysis. Section 3 describes the ideas of geometry on
complex planes and Section 4 summarizes the newer advantages of informa-
tion geometry on complex planes.

2 Complex planes
Let  be the complex plane, and let S   be a region. Let z be a complex
number in S, and z ¼ x + iy for x, y   (the set of real numbers). We define
a function f as follows:
f :S! (1)
such that f(z) ¼ w for w ¼ u + iv and u, v  . Such a function f is said to be
analytic(or holomorphic) in an open set S if there exists a complex derivative
at z for every z  S. If f is analytic at each point in the entire complex plane,
then f is called an entire function. Suppose f ¼ u + iv is analytic in a domain U
(an open and connected set), then the first-order partial differential equations
of u and v satisfy
∂u ∂v ∂u ∂v
¼ and ¼ : (2)
∂x ∂y ∂y ∂x
The equations in (2) are known as the Cauchy–Riemann (CR) equations.
Geometric structures can be constructed on a domain in the plane and, using
such a domain, information on geometric structures can be transported (Rao,
2021). A smooth function uðx1 , x2 , …, xn Þ that satisfies the equation
∂2 u ∂2 u ∂2 u
+ 2 +⋯+ 2¼0 (3)
∂x12 ∂x2 ∂xn
is called a harmonic function, and Eq. (3) is called Laplace’s equation. Eq. (3)
can be also written as Δu ¼ 0 for the operator
∂2 ∂2 ∂2
Δu ¼ + + ⋯ + :
∂x21 ∂x22 ∂x2n
A two-variable function u(x, y) is called harmonic if
∂2 u ∂2 u
+ 2 ¼ 0: (4)
∂x 2 ∂y
A standard result that can be proved using Laplace’s equation, and f ¼ u + iv,
where u ¼ u(x, y) and v ¼ v(x, y) is stated below. See Krantz (2004), Churchill and
Brown (1984), Ahlfors (1978), Krantz (2017), and Rudin (1987).
Theorem 1. If f ¼ u + iv is analytic in U, then u and v are harmonic in U.

Proof. Using (2) we can write


6 SECTION I Foundations in classical geometry and analysis

∂ ∂u ∂ ∂v
¼
∂x ∂x ∂x ∂y
∂ ∂v
¼
∂y ∂x
  (5)
∂ ∂u
¼ 
∂y ∂y
∂2 u
¼ :
∂y2
This implies that Δu ¼ 0. Similarly,

∂ ∂v ∂ ∂u
¼
∂x ∂x ∂x ∂y
∂ ∂u
¼
∂y ∂x
∂ ∂v (6)
¼
∂y ∂y
∂2 u
¼ 2:
∂y
This implies that Δv ¼ 0. □

The Eqs. (5) and (6) imply that u and v are harmonic. Harmonic functions
combined with CR can assist in understanding the conjugate of components
and information transportation. Analytic functions are also important in the
formation of contours which were shown to transport information from one
complex plane to another complex plane within a finite and infinite complex
plane bundle (Rao, 2021). The set γðtÞ ¼ ðxðtÞ, yðtÞÞ   for a set of real
values t  [a, b], and for a continuous x(t) and y(t) is said to be an arc. The
arc γ(t) is called a Jordan arc if it is simple and closed, i.e., if γ(t1) 6¼ γ(t2)
for all t1 6¼ t2 except for γ(a) ¼ γ(b). A closed arc is an arc for which
γ(a) ¼ γ(b), and such arcs are also referred as Jordan curves. See Fig. 1.

FIG. 1 Arcs and Jordan arcs.


Geometry, information, and complex bundles Chapter 1 7

Let f be a complex-valued function defined on the complex values of γ(t) for


t  [a, b]. The contour integral of f on γ(t) is defined using the Riemann–
Stieltjes integral as
I Z b
f dzðtÞ ¼ f ½γðtÞdγðtÞ: (7)
γ a

Suppose we map the value of t onto another real-valued function ϕ(ζ) for
(a1  ζ  b1), then γ(t) values within [a, b] are transformed into, say, Γ(ζ) ¼
γ[ϕ(t)]. The length of the arc γ(t), say L(γ(t)), is defined to be
Z b
LðγðtÞÞ ¼ jγ 0 ðtÞjdt: (8)
a

After the change variables described, Eq. (8) becomes


Z b1
LðγðtÞÞ ¼ jγ 0 ½ϕðtÞjϕ0 ðtÞdt: (9)
a1

Here γ 0 and ϕ0 indicate derivatives of γ and ϕ, respectively. The parametric


description described above also suggests that γ could be a homeomorphism
when the domain [a, b] is mapped into [γ(a), γ(b)]. Similarly, Γ could form
a homeomorphism when mapped from [γ(a), γ(b)] to [γ(a1), γ(b1)]. Hence,
isometryproperties can be observed depending upon whether
γ : ½a, b ! ½γðaÞ, γðbÞ
is one to one or not. A piecewise smooth arc is called a contour. For example:
suppose we form two arcs γ 1(t1) and γ 2(t2) for t1  [a, b] and t2  [b, c] and
γ 1(b) ¼ γ2(b). Let γ be the concatenation of these two curves: γ 1 followed by
γ 2. Then γ(t) for t  [a, c] represents a contour provided γ(t) is continuous and
γ 0 (t) is piecewise continuous. Let L(γ 1(t)) and Lðγ 2 ðt2 ÞÞ be the lengths of the
two arcs γ 1(t1) and γ 2(t2), respectively. The length of the contour γ(t), say
LðγðtÞÞ, can be computed as
LðγðtÞÞ ¼ Lðγ 1 ðtÞÞ + Lðγ 2 ðt2 ÞÞ
Z b1 Z
 0  c1   (10)
¼ γ ½ϕ1 ðt1 Þϕ0 ðt1 Þdt1 + γ 0 ½ϕ2 ðt2 Þϕ0 ðt2 Þdt2 :
1 1 2 2
a1 b1

In Eq. (10) the two parametric representation functions are ϕ1(t1) and ϕ2(t2)
for t1  [a, b] and t2  [b, c]. See Fig. 2. Suppose there are multiple para-
metric representation functions ϕj(tj) for tj in [aj, aj+1] corresponding to the
piecewise smooth arcs γ j for j ¼ 1, 2, …, k. The contour γ can be represented
using piecewise smooth arcs as
Z
γðtj Þdtj : (11)
8 SECTION I Foundations in classical geometry and analysis

FIG. 2 Contour formation from piecewise smooth arcs. Here γ i(ti) for i ¼ 1, 2, …, 6 are piece-
wise smooth arcs defined on the real number intervals ½a1 , a2 , ½a2 , a3 , …, ½a5 , a6 . After the para-
metric representation described in the text, one can compute the total length of the contour, say C,
using piecewise lengths of γ i(ti).

Then the length L(γ(t)) of the contour γ which is formed by concatenating


k piecewise smooth arcs can be obtained from the lengths of these piecewise
arcs, L(γ(tj)) as
Xk Z aj + 1  
 0 
LðγðtÞÞ ¼ γ j ½ϕj ðtj Þϕ0j ðtj Þdtj : (12)
j¼i aj

Analytic functions in a domain U on Jordan curves have interesting properties


due to the Cauchy integral theorem and the Cauchy integral formula stated
below.
Theorem 2. Suppose f is analytic on an open set U except for a finite number
of points within U at which f is only continuous. Then, for every piecewise
smooth closed arc that is homotopic to a point within U, we have
I
f ðzÞ dz ¼ 0: (13)
z

Definition 1 (Cauchy integral formula). Suppose that f is an analytic function


on a simply connected open set U. Let γ be a smooth piecewise closed coun-
terclockwise (positively oriented) arc in U. Then, for any point z0 interior to γ,
we have
Z
1 f ðzÞ
f ðz0 Þ ¼ dz: (14)
2πi γ z  z0
Geometry, information, and complex bundles Chapter 1 9

Suppose we define another function g(z) on U as


8
< f ðzÞ  f ðz0 Þ if z 6¼ z
gðzÞ ¼ z  z0 0
: (15)
: 0
f ðzÞ if z ¼ z0
Then, g is also analytic. Closed smooth arcs can be utilized to transport infor-
mation from one region to another region within .

Theorem 3. Suppose that f is analytic everywhere inside and on a closed


smooth arc or closed contour γ. If z0 is interior to γ, then
Z
1 f ðzÞdz
f 0 ðzÞ ¼ , (16)
2πi γ ðz  z0 Þ2
Z
1 f ðzÞdz
f 00 ðz0 Þ ¼ , (17)
πi γ ðz  z0 Þ3
and by mathematical induction, we will have
Z
ðnÞ n! f ðzÞdz
f ðz0 Þ ¼ : (18)
2πi γ ðz  z0 Þn+1

Here the superscript (n) denotes a derivative. One of the important conse-
quences of the Cauchy integral formula (18) is
 
 ðnÞ  n! max j f ðzÞjA
 f ðz0 Þ  , ðn ¼ 1, 2, …Þ (19)
ðr A Þn
where f is analytic inside a circle A (with radius rA). The inequality (19) is
also called the Cauchy estimate and is used in proving Liouville’s theorem
on entire functions.
Theorem 4 (Liouville’s theorem). A bounded entire function is constant
throughout the complex plane.

2.1 Important implications of Liouville’s theorem


Liouville’s theorem helps prove the fundamental theorem of algebra, which
states that, for any nth degree polynomial with n at least 1, f(z), there exists
at least one point z0 such that f(z0) ¼ 0. There are several proofs of the funda-
mental theorem of algebra available; see the recent discussion in Krantz
(2020). Suppose that ρ(z) is a complex-valued polynomial given by
ρðzÞ ¼ c0 + c1 z + c2 z2 + ⋯ + cn zn :
Let
1
f ðzÞ ¼
ρðzÞ
10 SECTION I Foundations in classical geometry and analysis

and assume ρ(z) is not zero for all z  . After some algebraic constructions
and applying triangular inequality, we arrive at
1
j f ðzÞj ¼ is bounded: (20)
jρðzÞj
Eq. (20) implies f is bounded in the entire plane. But, by Liouville’s theo-
rem, f(z) is constant, which is a contradiction because ρ(z) is not constant.
Suppose a function f is analytic throughout an annular domain with radii
γ A and γ B and center z0 such that r A < jz  z0 j < r B . Let γ be a Jordan curve
around z0 within the z values for γ A < jz  z0 j < γ B. Then, at each such z, the
function f(z) can be represented as the following series:
X

f ðzÞ ¼ An ðz  z0 Þn , for rA < jz  z0 j < rB (21)
n¼∞

where
Z
1 f ðzÞdz
An ¼ for n ¼ 0,  1,  2, …: (22)
2πi γ ðz  z0 Þn+1
The series in Eq. (21) is called the Laurent series. Suppose we write g(z) ¼
f(z + z0). Then, g(z) is analytic on the annulus rA < jzj < rB and we can write
g(z) with the following Laurent series expression:
X
∞ X

gðzÞ ¼ Bn zn + Cn zn , (23)
n¼0 n¼0

where
Z
1 gðzÞdz
Bn ¼ for n ¼ 0, 1, 2, …, (24)
2πi γ zn+1
Z
1 gðzÞdz
Cn ¼ for n ¼ 0, 1, 2, …, (25)
2πi γ zn+1
See Fig. 3. Let us consider the disks D(z0, rA), and D(0, rA) as in Fig. 3. If
we excise the centers z0 and 0 from these disks, respectively, then the sets of
remaining points of the disks are called deleted neighborhoodsof z0 and 0,
respectively. We call z0 an isolated singularity of f if f is not analytic at z0
and f is analytic on a deleted neighborhood of z0. Similarly, an isolated singu-
larity of f at 0 can be defined.
When 0 in Fig. 3 is the isolated singular point of f, then f(z) can be
expressed as
X

f ðzÞ ¼ Bn zn + C1 z1 + C2 z2 + ⋯ + Cn zn + ⋯ , (26)
n¼0
Geometry, information, and complex bundles Chapter 1 11

FIG. 3 (A) Jordan curve within the domain r A < jz  z0 j < r B and expression of f(z) for a
z value within this region. (B) Jordan curve within the domain r A < jzj < r B and g(z) in
Eq. (23) is analytic within this domain.

where Bn and Cn are defined as in Eqs. (24) and (25). The number C1 in
Eq. (26) is called the residue of f at 0, and is denoted by
Res f ðzÞ:
z¼0

Similarly, when z0 is an isolated singular point, one can write the Laurent
series expression. The Cauchy residue theorem is a helpful tool to compute
a contour integral when there are a finite number k of isolated singular points
within a simple, closed contour γ. Suppose f is analytic within and on γ except
for a finite number of isolated singular points within γ, then
Z X
k
f ðzÞdz ¼ 2πi Res f ðzÞ: (27)
γ z¼zn
n¼1

The structure of disks and domains described earlier in the Laurent series
expression can be applied in the transportation of information within complex
planes. See Rao and Krantz (2021). In Rao and Krantz (2021) we have
described the basics and the importance of conformal mapping, and preserva-
tion of angles in 3D objects. Conformality is an important feature of analytic
functions. See Rao and Krantz (2021) and Ahlfors (1978), Churchill and
Brown (1984), Krantz (2004), and Rudin (1987) for general ideas and founda-
tions on conformal mapping of two piecewise smooth arcs, and especially
regarding angle preservations. In this chapter, we demonstrate the conformal
mapping principle on two intersecting Jordan curves.
Let us consider two Jordan curves J1 and J2 within an annulus with radii
rA and rB, respectively. Assume that the two curves J1 and J2 intersect at
z1 , z2 with
12 SECTION I Foundations in classical geometry and analysis

 
γ A < z1  < γ B , (28)
and
 
γ A < z2  < γ B : (29)
The curve J1 on the plane is created from the real values of the interval, say,
[a1, b1], and the curve J2 is created on the plane from the real-valued interval,
say, [a2, b2] (Fig. 4).
Let
c1 ¼ arg½J 02 ðt2 Þ  arg½J 01 ðt1 Þ
c2 ¼ arg½J 02 ðt4 Þ  arg½J 01 ðt3 Þ,
where t1, t3  [a1, b1], and t2, t4  [a2, b2]. Here J1(t1) ¼ J2(t2) ¼ c1, and J1(t3)
¼ J2(t4) ¼ c2. Let f1 and f2 be two analytic functions mapped from J1 and J2,
respectively. Assume f1(c1) 6¼ 0 and f2(c2) 6¼ 0. Due to the conformality the
angle from two mapped curves, say, S2 and S1, at c1 will be equal to the angle
at f1(c1) from J2 and J1, and the angle from S2 and S1 at c2 will be equal to
the angle at f2(c2) from J2 and J1,
We have discussed conformal mapping of curves within an annulus. One
can bring a similar description in any region U C in which the two curves
J1 and J2 are created.

FIG. 4 Conformal mappings of two intersecting Jordan curves J1 and J2 onto two intersecting
Jordan curves S1 and S2. Note that although we have demonstrated here a Jordan curve mapping
to another Jordan curve, a Jordan curve need not map always to a Jordan curve. The two curves
J1 and J2 are generated independently from each other. We also mention here that a Jordan curve
within an annulus need not map to a curve situated in another annulus.
Geometry, information, and complex bundles Chapter 1 13

FIG. 5 Formation of contours on the boundary of a ball B(z0, rA) due to the intersection of a bun-
dle of complex planes. We showed an overview of the formation of Jordan curves in this figure.
More details are in Fig. 6.

3 Geometric analysis and Jordan curves


In this section, we describe contour formation on the boundary of a ball using
a bundle of complex planes. See Fig. 5. We prove a couple of important results
on Jordan curves that are constructed within a bundle G and between two com-
plex planes which are orthogonal to G. We call these two parallel planes O1
and O2 . The Jordan curve we are constructing in the boundary of the ball is
not just an arbitrary, generic Jordan curve. It has some special properties that
require work to construct. The special kind of Jordan curves that we will be
constructing in this section obey the basic properties of Jordan curves in com-
plex planes. A brief overview of the section is provided in the next paragraph,
and more details are provided in the later part of the section. A ball B(z0, rA) is
placed within G and in between O1 and O2 . For constructing arcs or con-
tours, we do not have access to the points of G which are inside B(z0, rA).
Jordan curves are formed by joining points of G which are on the boundary
of B(z0, rA), say, ∂B(z0, rA), and these are at the intersection of G and O1
and O2 . From any arbitrary point of G which is on ∂B(z0, rA), a piecewise
arc is constructed that passes through O1 and O2 ; we return to a different
point on ∂B(z0, rA) by constructing another piecewise arc. This procedure is
continued to obtain a closed contour γ(t) for t  [0, δ], (δ > 0). Please see
the subsequent material and Theorems 5 and 6.
Let B(z0, rA) be a ball. Let there be infinitely many complex planes parallel
to each other passing through the ball B(z0, rA) as shown in Fig. 5. Let us
denote these parallel complex planes as a bundle G. Each plane in G intersects
B(z0, rA) only at the boundary of B(z0, rA). The contours on the boundary of
B(z0, rA) are formed by concatenating several piecewise smooth arcs that will
14 SECTION I Foundations in classical geometry and analysis

be described in this section. However, regular concatenated arcs on a single


complex plane were described in Section 2. The concatenation of smooth arcs
in Fig. 5 occurred on the boundary of B(z0, rA) due to G. There are several
interesting properties that can be derived for the contours formed on B(z0, rA).
Since the planes within G are parallel to each other and they intersect with
B(z0, rA) only at the boundary of B(z0, rA), we cannot construct contours pass-
ing between two or more planes of G. Let us consider two complex planes,
say O1 and O2, that intersect with G orthogonally and such that B(z0, rA) lies
with O1 and O2 as in Fig. 6B. The two parallel planes O1 and O2 are not
part of G. A Jordan curve can be created on the circle that is formed by the
intersection of G and B(z0, rA); we are trying to create Jordan curves which
touch points on ∂B(z0, rA) but which are not located in a plane in which γ(t)
for t > 0 lies. The planes O1 and O2 have locations which satisfy these
conditions:
O1 \ Bðz0 , r A Þ ¼ ϕ ðemptyÞ and O2 \ Bðz0 , r A Þ ¼ ϕ ðemptyÞ, (30)
O1 \ O2 ¼ ϕ (31)
O1 \ G 6¼ ϕ and O2 \ G 6¼ ϕ: (32)
Theorem 5 (Jordan curve on a ball). Suppose B(z0, rA), O1 , O2 , and G are
given and they satisfy (30)–(32). Let ∂B(z0, rA) be the set of boundary points
of B(z0, rA). Then, one can construct a Jordan curve passing through ∂B(z0, rA).

Proof. We will have infinitely many points of G on the boundary of B(z0, rA).
Let a , b be two arbitrary planes within G. Let z1(a) be one such point of
G that lies on a and ∂B(z0, rA). A point in G here we treat as a complex

FIG. 6 Microscopic view of the formation of contours and Jordan curves. Planes that do not
belong to G help to construct Jordan curves as these planes allow contour formations between
different planes of G and return to the points on ∂B(z0, rA).
Geometry, information, and complex bundles Chapter 1 15

number and an element in G we treat here as a plane. An arbitrary point in G


could be lying on any arbitrary plane within G. Since Eqs. (30) and (32) hold,
we will have z1 ðaÞ 62 O1. Let z1(b) be another point of G that lies on b and
∂B(z0, rA). The plane b could be above or below a. Suppose that b is par-
allel to and disjoint from a . One can draw a smooth arc from z1(a) to z1(b)
through O1 if z1(a) and z1(b) lie on the same side of the ball where O1 inter-
sects with G or one can draw a smooth arc from z1(a) to z1(b) through O2 on
the other side of the ball where O2 intersects with G. Suppose z1(a) to z1(b)
are joined through O1 as shown in Fig. 6A. The intersection of a plane in
G with the ball is a ring-shaped object or the boundary of a disk. The points
inside B(z0, rA) are not available to us for constructing arcs between planes
as explained at the beginning of the section.
Let us consider a new plane d ð6¼ a and b ) within G. The plane d
could be above b and adjacent to b or d could be below a and adjacent
to a . Suppose d is above b and z1(d) be a point on d . One can draw a
shorter smooth arc from z1(b) to z1(d) through O1 if z1(b) and z1(c) lie on
the left side of the ball as shown in Fig. 6A. A contour can be drawn from
z1(a) to z1(d) passing through O1 twice. We keep on constructing newer
smooth arcs and contours passing from z1(d) to other points on ∂B(z0, rA)
via the plane O1 .
Let γ(t) be the contour drawn for t  ½0, δ  . Let the starting point of
this contour be z1(a). That is γ(t) ¼ z1(a) for t ¼ 0. The piece of the contour
from z1(a) to z1(d) is explained above. Once the contour reaches z1(d) at some
value of t for t > 0, it can travel to other regions of ∂B(z0, rA) through L, O2,
and G. From the point z1(d) on γ(t) for t(0, δ) a smooth arc can be constructed
that can pass through a corresponding plane d in which z1(d) lies, and this
smooth arc could pass through a point on ∂B(z0, rA) on the other side of
B(z0, rA) as shown in example paths in Fig. 6B. Let z1(d1) be the point on
the other side of ∂B(z0, rA), and lying on ∂B(z0, rA) such that

z1 ðd1 Þ \ ∂Bðz0 , r A Þ 6¼ ϕ: (33)

The points z1(d1) and z1(d) both lie on d . So an arc can be constructed
directly from z1(d1) to z1(d) or an arc can be constructed from z1(d1) to
z1(d) indirectly passing through O2 . Let z1(t) for t  ½d, d1   ð0, δÞ be the
piece of arc of the contour γ(t) which is drawn from z1(d) to z1(d1). Here
z1(t) for t  [d, d1] could pass through O2 or not. An arc from z1(d1) to
another point on ∂B(z0, rA) on the other side of B(z0, rA) can be drawn in a
similar fashion that we described above for z1(t) for t  [a, d] on one side
of B(z0, rA) in Fig. 6B. In a similar way, piecewise arcs from d1 to δ can be
constructed for d1  t  δ.
Let us divide the interval [d1, δ] into a finite number of intervals, say,

f½d1 , d2 , ½d2 , d3 , …, ½d k , δg: (34)


16 SECTION I Foundations in classical geometry and analysis

For each of the intervals, we can construct piecewise nonoverlapping smooth


arcs, say, z1(t) for t  [d1, d2], z1(t) for t  [d2, d3], and so on until z1(t) for t 
[dk, δ]. The final piece of the arc z1(t) for t  [dk, δ] will be joined at z1(a) for
t ¼ δ such that z1(δ) ¼ z1(a) that completes the formation of a Jordan curve
J1 as indicated in Fig. 6B. □

Remark 1. In the above proof, we have constructed a smooth arc from z1(a) to
z1(d) passing through O1 , but one can also bring arguments to construct a
smooth arc from z1(a) to z1(d) passing through O2 .

Remark 2. In the proof of Theorem 7, we have provided construction of the


Jordan curve that travels from one side of B(z0, rA) to the other side of
B(z0, rA) and returns to the point from where γ(t) has started. However, one
can have other paths of Jordan curves, say, for example, J2 and J3 as shown
in Fig. 6B.

Theorem 6 (Infinitely many Jordan Curves on a ball and Covering Theorem).


Suppose B(z0, rA) is considered within a bundle of three intersecting infinitely
many complex bundles G1, G2, and G3 (Fig. 7) such that
G1 \ G2 \ G3 6¼ ϕ, (35)

∂Bðz0 , r A Þ \ G1 \ G2 \ G3 6¼ ϕ: (36)
Then,
(a) we can construct infinitely many Jordan curves Jk for k ¼ 1, 2, … on the
boundary of the ball B(z0, rA), and,

FIG. 7 Jordan curves on B(z0, rA) on bundle of bundles.


Geometry, information, and complex bundles Chapter 1 17

(b) [

Jk ¼ ∂Bðz0 , rA Þ: (37)
k¼1

Proof. An arbitrary Jordan curve Jk in (a) can be constructed in a similar way


that was described in the proof of Theorem 5. For (b) the idea is as follows:
All the untouched points on the boundary can be used to construct a new con-
tour, and we can repeat the process such that (37) holds. □

4 Summary
Our novelty is that we constructed Jordan curves on the boundary of a ball
(that is simply connected) but through points generated by complex plane
bundles on the boundary of a ball considered. The development of a contour
is facilitated by the way we consider the bundles and not by the simple
connectedness of the ball. These kinds of contour formations are treated here
as passing of information between planes and geometry of contours.
The ideas of information geometry and emerging newer applications are
interesting, see, for example, Hayashi (2022), Mishra and Kumar (2021),
Nielsen (2022), Gauchy et al. (2022), Hua et al. (2021), and Barbaresco
(2021) for recent developments. We hope our newer ideas of construction
of contours will add to further development of information geometry on
planes, surfaces, and geometry of topological manifolds.

References
Ahlfors, L.V., 1978. Complex Analysis. An Introduction to the Theory of Analytic Functions of
One Complex Variable. In: third ed. International Series in Pure and Applied Mathematics,
McGraw-Hill Book Co., New York. xi+331.
Amari, S.-i., 2016. Information Geometry and Its Applications. Applied Mathematical Sciences,
vol. 194, Springer, Tokyo. xiii+374 pp., ISBN 978-4-431-55977-1; 978-4-431-55978-8.
Amari, S.-i., Nagaoka, H., 2000. Methods of Information Geometry (Translated from the 1993
Japanese original by Daishi Harada. Translations of Mathematical Monographs, 191).
American Mathematical Society/Oxford University Press, Providence, RI/Oxford, x+206 pp.
ISBN: 0-8218-0531-2.
Amari, S.I., Barndorff-Nielsen, O.E., Kass, R.E., Lauritzen, S.L., Rao, C.R., 1987. Differential
Geometry in Statistical Inference. Institute of Mathematical Statistics Lecture Notes–
Monograph Series, vol. 10, Institute of Mathematical Statistics, Hayward, CA, iv+240 pp.
ISBN: 0-940600-12-9.
Ay, N., Jost, J., L^e, H.V., Schwachh€ofer, L., 2017. Information Geometry. A Series of Modern
Surveys in Mathematics, vol. 64, Springer, Cham, xi+407 pp. ISBN: 978-3-319-56477-7.
Barbaresco, F., 2021. Koszul lecture related to geometric and analytic mechanics, Souriau’s Lie
group thermodynamics and information geometry. Inf. Geom. 4 (1), 245–262.
Bhattacharyya, C., Keerthi, S., 2000. Sathiya Information geometry and Plefka’s mean-field the-
ory. J. Phys. A 33 (7), 1307–1312.
Chern, S.S., 1977. Circle bundles, geometry and topology. In: Lecture Notes in Math. Proc. III
Latin Amer. School of Math., Inst. Mat. Pura Aplicada CNPq, Rio de Janeiro, 1976,
vol. 597. Springer, Berlin, pp. 114–131.
18 SECTION I Foundations in classical geometry and analysis

Chern, S.S., 1989. Vector Bundles With a Connection. Global Differential Geometry. MAA Stud.
Math., vol. 27, Mathematical Association of America, Washington, DC, pp. 1–26.
Choi, Y.-J., 2012. On the differential geometric characterization of the Lee models. J. Geom.
Anal. 22 (1), 168–205.
Churchill, R.V., Brown, J., 1984. Ward Complex Variables and Applications, fourth ed.
McGraw-Hill Book Co., New York. x+339 pp.
Dehesa, J.S., Plastino, A.R., Sánchez-Moreno, P., Vignat, C., 2012. Generalized Cramer-Rao rela-
tions for non-relativistic quantum systems. Appl. Math. Lett. 25 (11), 1689–1694.
Efron, B., 1975. Defining the curvature of a statistical problem (with applications to second order
efficiency). Ann. Statist. 3 (6), 1189–1242.
Frieden, B., 1992. Roy fisher information and uncertainty complementarity. Phys. Lett. A 169 (3),
123–130.
Frieden, B.R., 2021. Principle of minimum loss of Fisher information, arising from the
Cramer-Rao inequality: its role in evolution of bio-physical laws, complex systems and uni-
verses. In: Plastino, A., Rao, A.S.R.S., Rao, C.R. (Eds.), Information Geometry. Handbook of
Statistics, vol. 45. Elsevier, pp. 117–148.
Gauchy, C., Stenger, J., Sueur, R., Iooss, B., 2022. An information geometry approach to robust-
ness analysis for the uncertainty quantification of computer codes. Technometrics 64 (1),
80–91.
Green, M., Lazarsfeld, R., 1991. Higher obstructions to deforming cohomology groups of line
bundles. J. Am. Math. Soc. 4 (1), 87–103.
Greene, R.E., Kim, K.-T., Krantz, S.G., 2011. The Geometry of Complex Domains. Progress in
Mathematics, vol. 291, Birkh€auser Boston, Ltd., Boston, MA, xiv+303 pp. ISBN: 978-0-
8176-4139-9..
Hayashi, M., 2022. Information geometry approach to parameter estimation in hidden Markov
model. Bernoulli 28 (1), 307–342.
Hua, X., Ono, Y., Peng, L., Cheng, Y., Wang, H., 2021. Target detection within nonhomogeneous
clutter via total bregman divergence-based matrix information geometry detectors. IEEE
Trans. Signal Process. 69, 4326–4340.
Jaiswal, N., Gautam, M., Sarkar, T., 2021. Complexity and information geometry in the transverse
XY model. Phys. Rev. E 104 (2) (Paper No. 024127, 10 pp.).
Kobayashi, S., 2014. Differential Geometry of Complex Vector Bundles. Princeton University
Press, Princeton, NJ. Reprint of the 1987 edition.
Krantz, S.G., 2004. Complex Analysis: The Geometric Viewpoint. In: second ed. Carus Mathe-
matical Monographs, vol. 23. Mathematical Association of America, Washington, DC, ISBN:
0-88385-035-4. xviii+219 pp.
Krantz, S.G., 2017. Harmonic and Complex Analysis in Several Variables. Springer Monographs
in Mathematics, Springer, Cham. ISBN 978-3-319-63229-2; 978-3-319-63231-5, xii+424 pp.
Krantz, S., 2020. How fundamental is the fundamental theorem of algebra? Math. Mag. 93 (2),
139–142.
Krantz, S.G., Parks, H.R., 1999. The Geometry of Domains in Space. Birkh€auser Advanced Texts:
Basler Lehrb€ ucher (Birkh€auser Advanced Texts: Basel Textbooks), Birkh€auser Boston, Inc.,
Boston, MA, x+308 pp. ISBN: 0-8176-4097-5.
Kruglikov, B.S., 2007. Tangent and normal bundles in almost complex geometry. Differential
Geom. Appl. 25 (4), 399–418.
Mishra, K.V., Kumar, M.A., 2021. Information geometry and classical Cramer-Rao-type inequal-
ities. In: Plastino, A., Rao, A.S.R.S., Rao, C.R. (Eds.), Handbook of Statistics: Information
Geometry. vol. 45. Elsevier, pp. 79–114.
Geometry, information, and complex bundles Chapter 1 19

Nielsen, F., 2022. The many faces of information geometry. Not. Am. Math. Soc. 69 (1), 36–45.
Plastino, A.R., Plastino, A., Pennini, F., 2021. Chapter 1–Revisiting the connection between
Fisher information and entropy’s rate of change. In: Plastino, A., Rao, A.S.R.S., Rao, C.R.
(Eds.), Handbook of Statistics. vol. 45. Elsevier, pp. 3–14.
Rao, C.R., 1949. On the distance between two populations. Sankhya 9, 246–248.
Rao, C.R., 1973. Linear Statistical Inference and Its Applications. In: second ed. Wiley Series in
Probability and Mathematical Statistics, John Wiley & Sons, New York, London, Sydney.
xx+625 pp.
Rao, A.S.R.S., 2021. Multilevel contours on bundles of complex planes. In: Nielsen, F., Rao,
A.S.R.S., Rao, C.R. (Eds.), Geometry and Statistics. Handbook of Statistics, vol. 46. Elsevier.
Rao, A.S.R.S., Krantz, S.G., 2020. Data science for virtual tourism using cutting-edge visualiza-
tions: information geometry and conformal mapping. Patterns 1 (5), 100067.
Rao, A.S.R.S., Krantz, S.G., 2021. Rao distances and conformal mapping. In: Plastino, A.,
Rao, A.S.R.S., Rao, C.R. (Eds.), Information Geometry. Handbook of Statistics, vol. 45.
Elsevier, pp. 43–56.
Rudin, W., 1987. Real and Complex Analysis, third ed. McGraw-Hill Book Co., New York,
xiv+416 pp. ISBN: 0-07-054234-1.
Tsallis, C., 1988. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 52 (1–2),
479–487.
van Rijsbergen, C.J., 2004. The Geometry of Information Retrieval. Cambridge University Press,
Cambridge, xii+150 pp. ISBN: 0-521-83805-3.
Wu, D., Yau, S.T., 2016. Negative holomorphic curvature and positive canonical bundle. Invent.
Math. 204 (2), 595–604.
Yoo, S., 2017. A differential-geometric analysis of the Bergman representative map. Ann. Polon.
Math. 120 (2), 163–181.
Zhang, J., Wong, T.-K.L., 2021. Chapter 10–λ-Deformed probability families with subtractive and
divisive normalizations. In: Plastino, A., Rao, A.S.R.S., Rao, C.R. (Eds.), Information Geom-
etry. Handbook of Statistics, vol. 45. Elsevier, pp. 187–215.
This page intentionally left blank
Chapter 2

Geometric methods
for sampling, optimization,
inference, and adaptive agents
Alessandro Barpa,b,*,†, Lancelot Da Costac,d,†, Guilherme Françae,†,
Karl Fristond, Mark Girolamia,b, Michael I. Jordane,f,
and Grigorios A. Pavliotisc
a
Department of Engineering, University of Cambridge, Cambridge, United Kingdom
b
The Alan Turing Institute, The British Library, London, United Kingdom
c
Department of Mathematics, Imperial College London, London, United Kingdom
d
Wellcome Centre for Human Neuroimaging, University College London, London,
United Kingdom
e
Computer Science Division, University of California, Berkeley, CA, United States
f
Department of Statistics, University of California, Berkeley, CA, United States

Corresponding author: e-mail: ab2286@cam.ac.uk

Abstract
In this chapter, we identify fundamental geometric structures that underlie the problems
of sampling, optimization, inference, and adaptive decision-making. Based on this
identification, we derive algorithms that exploit these geometric structures to solve
these problems efficiently. We show that a wide range of geometric theories emerge
naturally in these fields, ranging from measure-preserving processes, information diver-
gences, Poisson geometry, and geometric integration. Specifically, we explain how
(i) leveraging the symplectic geometry of Hamiltonian systems enables us to construct
(accelerated) sampling and optimization methods, (ii) the theory of Hilbertian subspaces
and Stein operators provides a general methodology to obtain robust estimators, and
(iii) preserving the information geometry of decision-making yields adaptive agents that
perform active inference. Throughout, we emphasize the rich connections between
these fields; e.g., inference draws on sampling and optimization, and adaptive
decision-making assesses decisions by inferring their counterfactual consequences.
Our exposition provides a conceptual overview of underlying ideas, rather than a tech-
nical discussion, which can be found in the references herein.


Equal contribution.

Handbook of Statistics, Vol. 46. https://doi.org/10.1016/bs.host.2022.03.005


Copyright © 2022 Elsevier B.V. All rights reserved. 21
22 SECTION I Foundations in classical geometry and analysis

Keywords: Information geometry, Hamiltonian Monte Carlo, Stein’s method, Repro-


ducing kernel, Variational inference, Accelerated optimization, Dissipative systems,
Decision theory, Active inference

1 Introduction
Differential geometry plays a fundamental role in applied mathematics, statis-
tics, and computer science, including numerical integration (Celledoni et al.,
2014; Hairer et al., 2010; Leimkuhler and Reich, 2004; Marsden and West,
2001; McLachlan and Quispel, 2002), optimization (Alimisis et al., 2021;
Betancourt et al., 2018; Bravetti et al., 2019; França et al., 2020, 2021a,b),
sampling (Barp et al., 2019b; Betancourt et al., 2017; Duane et al., 1987;
Livingstone et al., 2019; Rousset et al., 2010), statistics on spaces with deep
learning (Bronstein et al., 2021; Celledoni et al., 2021), medical imaging
and shape methods (Durrleman et al., 2009; Vaillant and Glaunes, 2005),
interpolation (Barp et al., 2022), and the study of random maps (Harms
et al., 2020), to name a few. Of particular relevance to this chapter is informa-
tion geometry, i.e., the differential geometric treatment of smooth statistical
manifolds, whose origin stems from a seminal article by Rao (1992) who
introduced the Fisher metric tensor on parameterized statistical models, and
thus a natural Riemannian geometry that was later observed to correspond
to an infinitesimal distance with respect to the Kullback–Leibler (KL) diver-
gence (Jeffreys, 1946). The geometric study of statistical models has had
many successes (Amari, 2016; Ay et al., 2017; Nielsen, 2020), ranging from
statistical inference, where it was used to prove the optimality of the maxi-
mum likelihood estimator (Amari, 2012), to the construction of the category
of mathematical statistics, generated by Markov morphisms (Chentsov,
1965; Jost et al., 2021). Our goal in this chapter is to discuss the emergence
of natural geometries within a few important areas of statistics and applied
mathematics, namely optimization, sampling, inference, and adaptive agents.
We provide a conceptual introduction to the underlying ideas rather than a
technical discussion, highlighting connections with various fields of mathe-
matics and physics.
The vast majority of statistics and machine learning applications involve
solving optimization problems. Accelerated gradient-based methods
(Nesterov, 1983; Polyak, 1964), and several variations thereof, have became
workhorses in these fields. Recently, there has been great interest in studying
such methods from a continuous-time limiting perspective; see, e.g., Su et al.
(2016), Wibisono et al. (2016), Wilson et al. (2021), França et al. (2018),
França et al. (2018), França et al. (2021c), Muehlebach and Jordan (2021b),
and Muehlebach and Jordan (2021a) and references therein. Such methods
can be seen as 1st order integrators to a classical Hamiltonian system with dis-
sipation. This raises the question on how to discretize the system such that
important properties are preserved, assuming the system has fast convergence
Sampling, optimization, inference, and adaptive agents Chapter 2 23

to critical points and desirable stability properties. It has been known for a
long time that the class of symplectic integrators is the preferred choice for
simulating physical systems (Benettin and Giorgilli, 1994; Forest, 2006;
Hairer et al., 2010; Kennedy et al., 2013; McLachlan and Quispel, 2002;
McLachlan and Quispel, 2006; Sanz-Serna, 1992; Suzuki, 1990; Takahashi
and Imada, 1984; Yoshida, 1990). These discretization techniques, designed
to preserve the underlying (symplectic) geometry of Hamiltonian systems,
also form the basis of Hamiltonian Monte Carlo (HMC) (or hybrid Monte
Carlo) methods (Duane et al., 1987; Neal, 2011). Originally, such a theory
of geometric integration was developed with conservative systems in mind,
while, in optimization, the associated system is naturally a dissipative one.
Nevertheless, symplectic integrators were exploited in this context
(Betancourt et al., 2018; Bravetti et al., 2019; França et al., 2020). More
recently, it has been proved that a generalization of symplectic integrators to
dissipative Hamiltonian systems is indeed able to preserve rates of convergence
and stability (França et al., 2021b), which are the main properties of interest for
optimization. Follow-up work (França et al., 2021a) extended this approach,
enabling optimization on manifolds and problems with constraints. There is
also a tight connection between optimization on the space of measures and sam-
pling which dates back to Otto (2001) and Jordan et al. (1998); we will revisit
these ideas in relation to dissipative Hamiltonian systems.
Sampling methods are critical to the efficient implementation of many
methodologies. Most modern samplers are based on Markov Chain Monte
Carlo methods, which include slice samplers (Murray et al., 2010; Neal,
2003), piecewise-deterministic Markov chains, such as bouncy particle and
zig-zag samplers (Bierkens and Roberts, 2017; Bierkens et al., 2019;
Bouchard-C^ ote et al., 2018; Davis, 1984; Peters and de With, 2012; Vanetti
et al., 2017), Langevin algorithms (Durmus and Moulines, 2017; Durmus
et al., 2018; Roberts and Tweedie, 1996), interacting particle systems
(Garbuno-Inigo et al., 2019), and the class of HMC methods (Barp et al.,
2018; Betancourt, 2017; Betancourt et al., 2017; Duane et al., 1987; Neal,
2011; Rousset et al., 2010). The original HMC algorithm was introduced in
physics to sample distributions on gauge groups for lattice quantum chromo-
dynamics (Duane et al., 1987). It combined two approaches that emerged in
previous decades, namely the Metropolis-Hastings algorithm and the Hamilto-
nian formulation of molecular dynamics (Alder and Wainwright, 1959;
Hastings, 1970; Metropolis et al., 1953). Modern HMC relies heavily on
symplectic integrators to simulate a deterministic dynamic, responsible for
generating distant moves between samples and thus reduce their correlation,
while at the same time preserving important geometric properties. This deter-
ministic step is then usually combined with a corrective step (originally a
Metropolis-Hastings acceptance step) to ensure preservation of the correct
target, and with a stochastic process, employed to speed up convergence to
the target distribution. We will first focus on the geometry of measure-
preserving diffusions, which emerges from ideas formulated by Poincare
24 SECTION I Foundations in classical geometry and analysis

and Volterra, and form the building block of many samplers. In particular, we
will discuss ways to “accelerate” sampling using irreversibility and hypoellip-
ticity. We will then introduce HMC focusing on its underlying Poisson geom-
etry, the important role played by symmetries, and its connection to geometric
integration.
We then discuss the problem of statistical inference, whose practical
implementation usually relies upon sampling and optimization. Given obser-
vations from a target distribution, many estimators belong to the family of
the so-called M and Z estimators (Van der Vaart, 2000), which are obtained
by finding the parameters that maximizes (or are zeros of ) a parameterized
set of functions. These include the maximum likelihood and minimum
Hyv€arinen score matching estimators (Hyv€arinen and Dayan, 2005; Vapnik,
1999), which are also particular instances of the minimum score estimators
induced by scoring rules that quantify the discrepancy between a sample
and a distribution (Parry et al., 2012). The Monge–Kantorovich transportation
problem (Villani, 2009b) motivates another important class of estimators,
namely the minimum Kantorovich and p-Wasserstein estimators, whose
implementation use the Sinkhorn discrepancy (Bassetti et al., 2006; Cuturi,
2013; Peyre et al., 2019). Our discussion of inference builds upon the theory
of Hilbertian subspaces and, in particular, reproducing kernels. These infer-
ence schemes rely on the continuity of linear functionals, such as probability
and Schwartz distributions, over a class of functions to geometrize the analy-
sis of integral probability metrics which measure the worse-case integration
error. We shall explain how maximum mean, kernelized, and score matching
discrepancies arise naturally from topological considerations.
Models of adaptive agents are the basis of algorithmic-decision-making
under uncertainty. This is a difficult problem that spans multiple disciplines
such as statistical decision theory (Berger, 1985), game theory (Von
Neumann and Morgenstern, 1944), control theory (Bellman and Dreyfus,
2015), reinforcement learning (Barto and Sutton, 1992), and active inference
(Da Costa et al., 2020a). To illustrate a generic use case for the previous
methodologies, we consider active inference, a unifying formulation of
behavior—subsuming perception, planning, and learning—as a process of
inference (Da Costa et al., 2020a; Friston, 2010; Friston et al., 2010, 2015).
We describe decision-making under active inference using information geom-
etry, revealing several special cases that are established notions in statistics,
cognitive science, and engineering. We then show how preserving this infor-
mation geometry in algorithms enables adaptive algorithmic decision-making,
endowing robots and artificial agents with useful capabilities, including
robustness, generalization, and context-sensitivity (Da Costa et al., 2022;
Lanillos et al., 2021). Active inference is an interesting use case because it
has yet to be scaled—to tackle high dimensional problems—to the same
extent as established approaches, such as reinforcement learning (Silver
et al., 2016); however, numerical analyses generally show that active
Sampling, optimization, inference, and adaptive agents Chapter 2 25

inference performs at least as well in simple environments (Cullen et al.,


2018; Markovic et al., 2021; Mazzaglia et al., 2021; Millidge, 2020; Paul
et al., 2021; Sajid et al., 2021a; van der Himst and Lanillos, 2020), and better
in environments featuring volatility, ambiguity, and context switches
(Markovic et al., 2021; Sajid et al., 2021a).

2 Accelerated optimization
We shall be concerned with the problem of optimization of a function
V : M ! , i.e., finding a point that maximizes V (q), or minimizes  V (q),
over a smooth manifold M. We will assume this function is differentiable to
construct algorithms that rely on the flows of smooth vector fields guided by
the derivatives of V (q).
Many algorithms in optimization are given as a sequence of finite
differences, represented by iterations of a mapping Ψδt : M ! M , where
δt > 0 is a step size. The analysis of such finite difference iterations is usually
challenging, relying on painstaking algebra to obtain theoretical guarantees;
such as convergence to a critical point, stability, and rates of convergence
to a critical point. Even when these algorithms are seen as discretizations of
a continuum system, whose behavior is presumably understood, it is well-
known that most discretizations break important properties of the system.

2.1 Principle of geometric integration


Fortunately, here comes into play one of the most fundamental ideas of geo-
metric integration: many numerical integrators are very close—exponentially
in the step size—to a smooth dynamics generated by a shadow vector field
(a perturbation of the original vector field). This allows us to analyze the dis-
crete trajectory implemented by the algorithm using powerful tools from
dynamical systems and differential geometry, which are a priori reserved to
smooth systems. Crucially, while numerical integrators typically diverge sig-
nificantly from the dynamics they aim to simulate, geometric integrators
respect the main properties of the system. In the context of optimization this
means respecting stability and rates of convergence. This was first demon-
strated in França et al. (2021b) and further extended in França et al.
(2021a); our following discussion will be based on these works.
The simplest way to construct numerical methods to simulate the flow of a
vector field X arises when it is given by a sum, X ¼ Y + Z, and the flows of the
individual vector fields Y and Z are—analytically or numerically—tractable.
In such a case, we can approximate the exact flow ΦXδt ¼ eδtX , for step size
δt > 0, by composing the individual flows ΦYδt ¼ eδtY and ΦZδt ¼ eδtZ. The sim-
plest composition is given by ΨXδt ≡ ΦYδt ∘ΦZδt. The Baker–Campbell–Hausdorff
(BCH) formula then yields
26 SECTION I Foundations in classical geometry and analysis

~
eδtY ∘eδtZ ¼ eδtX ,
1 1
X~ ¼ ðY + ZÞ + ½Y, Zδt + ð½Y, ½Y, Z  ½Z, ½Y, ZÞδt2 + ⋯ ,
2 12
(1)
where [Y, Z] ¼ Y Z  ZY is the commutator between Y and Z. Thus, the
numerical method itself can be seen as a smooth dynamical system with flow
~
map ΨXδt ¼ eδtX . The goal of geometric integration is to construct numerical
methods for which X~ shares with X the critical properties of interest; this is
usually done by requiring preservation of some geometric structure.
Recall that a numerical map ΨXδt is said to be of order r  1 if
jΨδt  ΦXδt j ¼ Oðδtr +1 Þ; we abuse notation slightly and let jj denote a well-
X

defined distance over manifolds (see Hansen (2011) for details). Thus, the
expansion (1) also shows that the error in the approximation is
jΨXδt  ΦXδt j ¼ Oðδt2 Þ, i.e., we have an integrator of order r ¼ 1. One can also
consider more elaborate compositions, such as
ΨXδt ≡ ΦYδt=2 ∘ ΦZδt ∘ ΦYδt=2 , (2)
which is more accurate since the first term in (1) cancels out, yielding an inte-
grator of order r ¼ 2.a

2.2 Conservative flows and symplectic integrators


As a stepping stone, we first discuss the construction of suitable conservative
flows, namely flows along which some function f : X !  is constant, where
X is the phase space manifold of the system, i.e., the space in which the
dynamics evolves. Such flows, which are among the most well-studied due
to their importance in physics, will enable us to obtain our desired “rate-
matching” optimization methods and will also be central in our construction
of geometric samplers.
To construct vector fields along the derivative of f we shall need brackets.
Geometrically, these are morphisms X * ! X , also known as contravariant
tensors of rank 2 in physics, where X * is the dual space of X . Note that on
Riemannian manifolds (e.g., X ¼ n ) both spaces are isomorphic. In Euclid-
ean space, x  X ¼ n , we define such B-vector fields in terms of a state-
dependent matrix B ¼ B(x) asb

a
Higher-order methods are constructed by looking for appropriate compositions that cancel first
terms in the BCH formula (Yoshida, 1990). However, methods for r > 2 tend to be expensive
numerically, with not so many benefits (if any) over methods of order r ¼ 2.
b
We denote by xi the ith component of x and ∂i ≡ ∂=∂xi . We also use Einstein’s summation
convention, i.e., repeated upper and lower indices are summed over.
Sampling, optimization, inference, and adaptive agents Chapter 2 27

XBf ðxÞ ≡ Bij ðxÞ∂i f ðxÞ∂j : (3)


Any vector field that depends linearly and locally P on f may be written in this
manner. Notice that a decomposition f ¼ a fa induces a decomposition
P
XBf ¼ a XBfa that is amenable to the splitting integrators previously men-
tioned. Importantly, vector fields that preserve f correspond to bracket vector
fields in which B is antisymmetric (McLachlan et al., 1999). Constructing
conservative flows is thus straightforward. Unfortunately, it is a rather more
challenging task to construct efficient discretizations that retain this property;
most well-known procedures, namely discrete-gradient and projection meth-
ods, only give rise to integrators that require solving implicit equations at
every step, and they may break other important properties of the system.
For a particular class of conservative flows, it is possible to construct
splitting integrators that—exactly—preserve another function ~f that remains
close to f. Indeed, going back to the BCH formula (1), we see that if we were
to approximate a conservative flow of XBf ¼ XBf1 + XBf2 by composing the flows
of XBf1 and XBf2 , and, crucially, if we had a bracket for which the commutators
can be written as
h i
XBf1 , XBf2 ¼ XBf3 ,

for some function f3, and so on for all commutators in (1), then the right-hand
side of the BCH formula would itself be an expansion in terms of a vector
field X~Bf for some shadow function ~f ¼ f + f 3 δt + f 4 δt2 + ⋯ . In particular, ~f
would inherit all the properties of f, i.e., properties common to B-vector fields.
This is precisely the case for Poisson brackets, written B ≡ Π, which are
antisymmetric brackets for which the Jacobi identity holds:
h i

f , X Π Π
g ¼ X f f ,gg , f f, gg ≡ ∂i f Πij ∂j g, (4)

where { f, g} is the Poisson bracket between functions f and g. The BCH


formula then implies
1 1
f~ ¼ ð f1 + f2 Þ + f f1 , f2 gδt + ðf f1 ,f f1 , f2 gg + f f2 , f f2 , f1 ggÞδt2 + ⋯ :
2 12
(5)
Such an integrator can thus be seen as a Poisson system itself, generated by
the above asymptotic shadow ~f , which is exactly preserved.
Poisson brackets and their dynamics are the most important class of con-
servative dynamical systems, describing many physical systems, including
all of fluid and classical mechanics. The two main classes of Poisson brackets
are constant antisymmetric matrices on Euclidean space, and symplectic
brackets for which Πij(x) is invertible at every point x. Its inverse is denoted
28 SECTION I Foundations in classical geometry and analysis

by Ωij ¼ ðΠ1 Þij and called a symplectic form. In this case, the function f is
called a Hamiltonian, denoted f ¼ H. The invertibility of the Poisson
tensor Πij implies that such a bracket exists only on even-dimensional
spaces. Darboux theorem then ensures the existence of local coordinates
x ≡ (q1, …, qd, p1, …, pd) in which the symplectic form can be represented
 
0 I
as Ω ¼ . Dynamically, this corresponds to the fact that these are
I 0
second-order differential equations, requiring not only a position q  M but
also a momentum p  T *q M.c Note that if H ¼ pi, then XH ¼ ∂=∂qi , and con-
versely if H ¼ qi, then XH ¼ ∂=∂pi . Thus, a change in coordinate qi is gen-
erated by its conjugate momentum pi, and vice versa. Thus, the only way to
generate dynamics on M in this case is by introducing a Hamiltonian depend-
ing on both position and momentum. From a numerical viewpoint, the
extended phase space introduces extra degrees of freedom that allow us to
incorporate “symmetries” in the Hamiltonian, which facilitate integration.
Indeed, in practice, the Hamiltonian usually decomposes into a potential
energy, associated to position and independent of momentum, and a kinetic
energy, associated to momentum and invariant under position changes, both
generating tractable flows. Thanks to this decomposition, we are able to con-
struct numerical methods through splitting the vector field. Note also that, for
symplectic brackets, the existence of a shadow Hamiltonian can be guaranteed
beyond the case of splitting methods, e.g., for variational integrators—which
use a discrete version of Hamilton’s principle of least action—and more
generally for most symplectic integrators in which the symplectic bracket is
preserved up to topological considerations described by the first de Rham
cohomology of phase space.

2.3 Rate-matching integrators for smooth optimization


Having obtained a vast family of smooth dynamics and integrators that
closely preserve f, we can now apply these ideas to optimization. Vector fields
for which a Hamiltonian function f ¼ H dissipates can be written as a bracket
vector field XBH for some negative semi-definite matrix B (McLachlan et al.,
1999). Let us consider a concrete example in X ¼ 2d in the form of a
(generalized) conformal Hamiltonian system (França et al., 2020, 2021b;
McLachlan and Perlmutter, 2001). Consider thus the Hamiltonian

c
More precisely, the dynamics evolve on the cotangent bundle X ¼ T * M, with coordinates
x ¼ (q, p); momentum p  Tq* M and velocity v ¼ dq=dt  T q M are equivalent on the Riemannian
manifolds that are used in practice. M is called the configuration manifold with coordinates q.
Sampling, optimization, inference, and adaptive agents Chapter 2 29

1
Hðq, pÞ ¼ pi gij pj + VðqÞ, (6)
2
where gij is a constant symmetric positive definite matrix with inverse gij. The
associated vector field is XBH ¼ gij pj ∂qi  ½∂qi V + γðtÞpi ∂pi , with γ(t) > 0 being
a “damping coefficient.” This is associated to the negative definite matrix
   
0 I 0 0
B≡  γðtÞ : (7)
I 0 0 I
|fflfflfflfflfflffl{zfflfflfflfflfflffl} |fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl}
conservative dissipative

The equations of motion are


dqi dpi ∂V
¼ gij pj , ¼  i  γðtÞpi , (8)
dt dt ∂q
and obey
dH
¼ γðtÞpi gij pj  0, (9)
dt
so the system is dissipative. Suppose V(q) has a minimizer q? ≡ arg minqV(q)
in some region of interest and, without loss of generality, has value V? ≡ V(q?)
≡ 0. Then H > 0 and dH/dt < 0 outside such a critical point, implying that
H is also a (strict) Lyapunov function; the existence of such a Lyapunov
function implies that trajectories starting in the neighborhood of q? will
converge to q?. In other words, the above system provably solves the optimi-
zation problem
min VðqÞ: (10)
q  d

Two common choices for the damping are the constant case, γ(t) ¼ γ, and
the asymptotic vanishing case, γ(t) ¼ r/t for some constant r  3 (other
choices are also possible). When V(q) is a convex function (resp. strongly
convex function with parameter μ > 0), it is possible to show the following
convergence rates (França et al., 2018):

ð11Þ

where λ1(g) is the largest eigenvalue of the metric g. The convergence rates of
this system are therefore known under such convexity assumptions. Ideally,
we want to design optimization methods that preserve these rates, i.e., are
30 SECTION I Foundations in classical geometry and analysis

“rate-matching,” and are also numerically stable. As we will see, such geo-
metric integrators can be constructed by leveraging the shadow Hamiltonian
property of symplectic methods on higher-dimensional conservative Hamilto-
nian systems (França et al., 2021b) (see also Asorey et al., 1983; Marthinsen
and Owren, 2016). This holds not only on 2d but on general settings, namely
on arbitrary smooth manifolds (França et al., 2021a,b).
In the conformal Hamiltonian case, the dissipation appears explicitly in the
equations of motion. It is however theoretically convenient to consider an
equivalent explicit time-dependent Hamiltonian formulation. Consider the fol-
lowing coordinate transformation into system (8):
  Z
p 7! eηðtÞ p, Hðq, pÞ 7! eηðtÞ H q, eηðtÞ p , ηðtÞ ≡ γðtÞdt: (12)

It is easy to see that (8) is equivalent to standard Hamilton’s equations,


dqi ∂H dpi ∂H
¼ , ¼ i,
dt ∂pi dt ∂q
with the explicit time-dependent Hamiltonian
1
Hðt, q, pÞ ¼ eηðtÞ pi gij pj + eηðtÞ VðqÞ: (13)
2
The rate of change of H along the flow now satisfies
dH ∂H
¼ 6¼ 0, (14)
dt ∂t
so the system is nonconservative; this equation is equivalent to (9).
Going one step further, let us now promote t to a new coordinate and intro-
duce its (conjugate) momentum u. Consider thus the higher-dimensional
Hamiltonian
1 ηðtÞ ij
Kðt, q, u, pÞ ≡ e pi g pj + eηðtÞ VðqÞ + u: (15)
2
Note that t and u are two arbitrary canonical coordinates. Denoting the time
parameter of this system by s, Hamilton’s equations read
dt du ∂K dqi dpi ∂V
¼ 1, ¼ , ¼ eηðtÞ gij pj , ¼ eηðtÞ i : (16)
ds ds ∂t ds ds ∂q
This system is conservative since dK/ds ¼ 0. Now, if we fix coordinates as
t ¼ s, uðsÞ ¼ Hðs, qðsÞ, pðsÞÞ, (17)
the conservative system (16) reduces precisely the original dissipative system
(13); the second equation in (16) reproduces (14), and the remaining equations
are equivalent to the equations of motion associated to (13), which in turn are
equivalent to (8) as previously noted. Formally, what we have done is to
Sampling, optimization, inference, and adaptive agents Chapter 2 31

embed the original dissipative system with phase space 2d into a higher-
dimensional conservative system with phase space 2d +2 . The dissipative
dynamics thus lies on a hypersurface of constant energy, K ¼ 0, in high-
dimensions; see França et al. (2021b) for details. The reason for doing this
procedure, called symplectification, is purely theoretical: since the theory of
symplectic integrators only accounts for conservative systems, we can now
extend this theory to dissipative settings by applying a symplectic integrator
to (13) and then fixing the relevant coordinates (17) in the resulting method.
Geometrically, this corresponds to integrating the time flow exactly (França
et al., 2021b; Marthinsen and Owren, 2016). In França et al. (2021b) such a
procedure was defined under the name of presymplectic integrators, and these
connections hold not only for the specific example above but also for general
nonconservative Hamiltonian systems.
We are now ready to explain why this approach is suitable to construct
practical optimization methods. Let Ψδs : 2d +2 ! 2d +2 be a symplectic
integrator of order r  1 applied to system (15). Denote by (tk, qk, uk, pk)
the numerical state, obtained by k ¼ 0, 1, … iterations of Ψδs. Time is
simulated over the grid sk ¼ (δs)k, with step size δt > 0. Because a symplectic
integrator has a shadow Hamiltonian, we have
~ k , qk , uk , pk Þ ¼ Kðtðsk Þ, qðsk Þ, uðsk Þ, pðsk ÞÞ + Oðδsr Þ:
Kðt
Enforcing (17), the coordinate tk becomes simply the time discretization sk,
which is exact, and so is uk ¼ u(tk) since it is a function of time alone;
importantly, u does not couple to any of the other degrees of freedom so it
is irrelevant whether we have access to u(s) or not. Replacing (15) into the
above equation, we conclude:
~ k , qk , pk Þ ¼ Hðtk , qðtk Þ, pðtk ÞÞ + Oðδtr Þ,
Hðt (18)
where we now denote tk ¼ (δt)k, for k ¼ 0, 1, …. Hence, the time-dependent
Hamiltonian also has a shadow, thanks to the cancellation of the variable u. In
particular, if we replace the explicit form of the Hamiltonian (13), we obtaind
 
Vðqk Þ  V ? ¼ Vðqðtk ÞÞ  V ? + O eηðtk Þ δtr : (19)
|fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflffl}
numerical rate continuum rate
small error

Therefore, the known rates (11) for the continuum system are nearly
preserved—and so would be any rates of more general time-dependent (dissi-
pative) Hamiltonian systems. Moreover, as a consequence of (18), the original
time-independent Hamiltonian (6) of the conformal formulation is also closely

d
The kinetic part only contributes to the small error since g is positive definite and
jpk  pðtk Þj ¼ Oðδtr Þ. There are several technical details we are omitting, such as Lipschitz con-
ditions on the Hamiltonian and on the numerical method, which we refer to França et al. (2021b)
for details.
32 SECTION I Foundations in classical geometry and analysis

preserved, i.e., within the same bounded error in δt—recall transformation


(12). However, this is also a Lyapunov function; hence, the numerical method
respects the stability properties of the original system as well.e
In short, as a consequence of having a shadow Hamiltonian, such geo-
metric integrators are able to reproduce all the relevant properties of the
continuum system. These arguments are completely general; namely, they
ultimately rely on the BCH formula, the existence of bracket vector fields,
and the symplectification procedure. Under these basic principles, no
discrete-time analyses were necessary to obtain guarantees for the numerical
method, which may not be particularly enlightening from a dynamical sys-
tems viewpoint and are only applicable on a (painful) case-by-case basis.
Let us now present an explicit algorithm to solve the optimization problem
(10). Consider a generic (conservative) Hamiltonian H(q, p), evolving in
time s. The well-known leapfrog or St€ ormer–Verlet method, the most used
symplectic integrator in the literature, is based on the composition (2) and
reads (Hairer et al., 2010)
pk +1=2 ¼ pk  ðδs=2Þ∂q Hðqk , pk +1=2 Þ,
 
qk +1 ¼ qk  ðδs=2Þ ∂p Hðqk ,pk +1=2 Þ + ∂p Hðqk +1 , pk +1=2 Þ ,
pk +1 ¼ pk +1=2  ðδs=2Þ∂q Hðqk +1 , pk +1=2 Þ:
According to our prescription, replacing the higher-dimensional Hamiltonian
(15), imposing the gauge fixing conditions (17), and recalling that u cancels
out, we obtain the following methodf:
pk +1=2 ¼ pk  ðδt=2Þeηðtk Þ ∂q Vðqk Þ,
h i
qk +1 ¼ qk  ðδt=2Þ eηðtk Þ + eηðtk +1 Þ g1 pk +1=2 , (20)
pk +1 ¼ pk +1=2  ðδt=2Þeηðtk +1 Þ ∂q Vðqk +1 Þ,

e
Naturally, all these results hold for suitable choices of step size, which can be determined by a
linear stability analysis of the particular numerical method under consideration.
f
In a practical implementation, it is convenient to make the change of variables pk 7! eηðtk Þ pk into
(20); recall the transformations (12). In this case the method reads
 
pk + 1=2 ¼ eΔηk pk  ðδt=2Þ∂q Vðqk Þ ,
qk + 1 ¼ qk  δtcoshðΔηk Þg1 pk + 1=2 ,
pk + 1 ¼ eΔηk pk + 1=2  ðδt=2Þ∂q Vðqk + 1 Þ,
R t +1=2
where Δηk ≡ ηðtk +1=2 Þ  ηðtk Þ ¼ tkk γðtÞdt . Note that only a half-step difference of η(t)
appears in these updates. The algorithm is thus written in the same variables as the conformal rep-
resentation (8). The advantage is that we do not have large or small exponentials, which can be
problematic numerically. Furthermore, when solving optimization problems, it is convenient to
set the matrix g ¼ (δt)I; this was noted in França et al. (2020) but can also be understood from
the rates (11) since then the step size δt disappears from some of these formulas.
Sampling, optimization, inference, and adaptive agents Chapter 2 33

where we recall that δt > 0 is the step size and tk ¼ (δt)k, for iterations k ¼ 0,
1, …. This method, which is a dissipative generalization of the leapfrog, was
proposed in França et al. (2021b) and has very good performance when solv-
ing unconstrained problems (10). In a similar fashion, one can extend any
(known) symplectic integrator to a dissipative setting; the above method is
just one such example.

2.4 Manifold and constrained optimization


Following França et al. (2021a), we briefly mention how the previous
approach can be extended in great generality, i.e., to an optimization problem
min VðqÞ, (21)
qM

where M is an arbitrary Riemannian manifold. There are essentially two ways


to solve this problem through a (dissipative) Hamiltonian approach. One is to
simulate a Hamiltonian dynamics on T * M by incorporating the metric of M
in the kinetic part of the Hamiltonian. Another is to consider a Hamiltonian
dynamics on n and embed M into n by imposing several constraints,g
ψ a ðqÞ ¼ 0, a ¼ 1, …, m: (22)
This constrained case turns out to be particularly useful since we typically are
unable to compute the geodesic flow on M, but are able to construct robust
constrained symplectic integrators for it.
As an example of the first approach, consider M ¼ G being a Lie group,
with Lie algebra g and generators {Ti}. The analogous of Hamiltonian (13)
1 ηðtÞ
is given by H ¼  4g e TrðP2 Þ + eηðtÞ VðQÞ , where g > 0 is a constant,
Q  G, and P  g (they can be seen as matrices). The method (20) can be
adapted to this setting, resulting in the following algorithm (França et al.,
2021a) (recall footnote 2.3):
 

Pk + 1=2 ¼ eΔηk Pk  ðδt=2ÞTr ∂Q VðQk Þ  Qk  Pk Pk ,


 
Qk + 1 ¼ Qk exp δt cosh ðΔηk Þg1 Pk + 1=2 , (23)
Δηk
 
Pk + 1 ¼ e Pk + 1=2  ðδt=2Þ Tr ∂Q VðQk + 1 Þ  Qk + 1  Pk + 1=2 Pk ,
where ð∂Q VðQÞÞij ¼ ∂V=∂Q ji is a matrix.
As an example of the second approach, one can constrain the integrator on
n to define a symplectic integrator on M via the discrete constrained varia-
tional approach (Marsden and West, 2001) by introducing P Lagrange
multipliers, i.e., by considering the Hamiltonian H + eη(t) aλaψ a(q), where

g
Theoretically, there is no loss of generality since Nash or Whitney embedding theorems tell us
that any smooth manifold M can be embedded into n for sufficiently large n.
34 SECTION I Foundations in classical geometry and analysis

H is the Hamiltonian (13). In particular, the method (20) can be constrained to


yield (França et al., 2021a)
 
pk + 1=2 ¼ eΔηk Λg ðqk Þ pk  ðδt=2Þ∂q Vðqk Þ ,
 >
p k + 1=2 ¼ pk + 1=2  ðδt=2ÞeΔηk ∂q ψðqk Þ λ,
qk + 1 ¼ qk  δt cosh ðΔηk Þg1 p k + 1=2 , (24)
0 ¼ ψ a ðqk + 1 Þ ða ¼ 1,…,mÞ,
h i
Δηk
pk + 1 ¼ Λg ðqk + 1 Þ e p k + 1=2  ðδt=2Þ∂q Vðqk + 1 Þ ,

where we have the projector Λg ðqÞ ≡ I  R1 g ðqÞ∂q ψðqÞg


1
with Rg(q) ≡ ∂qψ
1 >
(q)g ∂qψ(q) , and (∂qψ)ij ≡ ∂ψ i/∂q is the Jacobian matrix of the constraints;
j
Rt
λ ¼ (λ1, …, λm)> is the vector of Lagrange multipliers; and Δηk ≡ tkk +1=2 γðtÞdt
accounts for the damping. In practice, the Lagrange multipliers are deter-
mined by solving the (nonlinear) algebraic equations for the constraints
i.e., the second to fourth updates above are solved simultaneously. The above
method consists in a dissipative generalization of the well-known RATTLE
integrator from molecular dynamics (Andersen, 1983; Leimkuhler and
Matthews, 2016; Leimkuhler and Skeel, 1994; McLachlan et al., 2014).
It is possible to generalize any other (conservative) symplectic method to
this (dissipative) optimization setting on manifolds. In this general setting,
there still exists a shadow Hamiltonian so that convergence rates and stability
are closely preserved numerically (França et al., 2021a) (similarly to (18) and
(19)). In particular, one can also consider different types of kinetic energy,
beyond the quadratic case discussed above, which may perform better in spe-
cific problems (França et al., 2020). This approach therefore allows one to
adapt existing symplectic integrators to solve optimization problems on Lie
groups and other manifolds commonly appearing in machine learning, such
as Stiefel, Grassmanians, or to solve constrained optimization problems on n .

2.5 Gradient flow as a high friction limit


Let us provide some intuition why simulating second-order systems is expected
to yield faster algorithms. It has been shown that several other accelerated opti-
mization methodsh are also discretizations of system (8) (França et al., 2021c).
Moreover, in the large friction limit, γ ! ∞, this system reduces to the
first-order gradient flow, dq/dt ¼ ∂qV (q) (assuming g ∝ I), which is the con-
tinuum limit of standard, i.e., nonaccelerated methods (França et al., 2021c).

h
Besides accelerated gradient-based methods, accelerated extensions of important proximal-based
methods such as proximal point, proximal-gradient, alternating direction method of multipliers
(ADMM), Douglas–Rachford, Tseng splitting, etc., are implicit discretizations of (8); see
França et al. (2021c) for details.
Sampling, optimization, inference, and adaptive agents Chapter 2 35

The same happens in more general settings; when the damping is too strong,
the second derivative becomes negligible and the dynamics is approximately
first-order.
As an illustration, consider Fig. 1 (left) where a particle immersed in a
fluid falls under the influence of a potential force  ∂qV (q), that plays the
role of “gravity,” and is constrained to move on a surface. In the underdamped
case, the particle is under water, which is not so viscous, so it has acceleration
and moves fast (even oscillate). In the overdamped case, the particle is in a
highly viscous fluid, such as honey, and the drag force  γp is comparable
or stronger to  ∂qV (q), thus the particle moves slowly since it cannot accel-
erate; during the same elapsed time δt, an accelerated particle would travel a
longer distance. We can indeed verify this behavior numerically. In Fig. 1
(right) we run algorithm (23) in the underdamped and overdamped regimes
when solving an optimization problem on the n-sphere, i.e., on the Lie group
SO(n).i We can see that, in the overdamped regime, this method has essen-
tially the same dynamics as the Riemannian gradient descent (Zhang and
Sra, 2016), which is nonaccelerated and corresponds to a first-order dynamics;
all methods use the same step size, only the damping coefficient is changed.

2.6 Optimization on the space of probability measures


There is a tight connection between sampling and optimization on the space of
probability measures which goes back to Otto (2001) and Jordan et al. (1998).

FIG. 1 Why simulating second-order systems yields accelerated methods. Left: Constrained par-
ticle falling in fluids of different viscosity. When the drag force is strong, the particle cannot
accelerate and has a first-order dynamics (see text). Right: Simulation of algorithm (23) where
V (Q) is the energy of a spherical spin glass (Lie group SO(n), with n ¼ 500) (França et al.,
2021a). In the overdamped regime the method is close to Riemannian gradient descent (Zhang
and Sra, 2016), which is a first-order dynamics; (23) is much faster in the underdamped regime.

i
The details are not important here, but this problem minimizes the Hamiltonian of a spherical
spin glass (see França et al. (2021a) for details). The same behavior is seen with the constrained
method (24) as well.
36 SECTION I Foundations in classical geometry and analysis

Let P 2 ðn Þ be the space of probability measures on n with finite second


moments, endowed with a Wasserstein-2 metric W2. The gradient flow of a
functional F[μ] on the space of probability measures is the solution to the par-
tial differential equation ∂t μðq, tÞ ¼ rW 2 F½μðq, tÞ , which, under sufficient
regularity conditions, is equivalent to (Ambrosio et al., 2005; Jordan et al.,
1998; Otto, 2001)
 
δF½μ
∂t μ ¼ ∂  μ∂ , (25)
δμ
where ∂ ≡ ∂q and ∂  are the derivative and the divergence operators on n ,
respectively. The evolution of this system solves the optimization problem
ρ ≡ arg min F½μ, (26)
μ  P 2 ðn Þ

i.e., μ(q, t) ! ρ(q) as t !∞ in the sense of distributions. We can consider


the analogous situation with a dissipative flow induced by the conformal
Hamiltonian tensor (7) on the space of probability measures; we set g ¼ I and
γ(t) ¼ γ ¼ const. for simplicity. Thus, instead of (25), we have a conformal
Hamiltonian system in Wasserstein space given by the continuity equation
 
δF½μ
∂t μ ¼ ∂  μB∂ , (27)
δμ

where now ∂ ≡ (∂q, ∂p) and μ is a measure over P 2 ð2d Þ. Let F be the free
energy defined as
F½μ ≡ U½μ  β1 S½μ, U½μ ≡ μ ½H, S½μ ≡ μ ½ log μ, (28)
where U is the (internal) energy, H is the Hamiltonian (6), S is the Shannon
entropy, and β is the inverse temperature. The functional derivative of the free
energy equals
δF 1
¼ H + β1 log μ ¼ k pk2 + VðqÞ + β1 log μ: (29)
δμ 2
In particular, the minimizer of F is the stationary density
h i
ρðq, pÞ ¼ Z 1
β e βHðq,pÞ
, Z β ≡  μ e βHðq,pÞ
: (30)

Note also that the free energy (28) is nothing but the KL divergence (up to a
constant which is the partition function):
KL½μ|ρ ≡ μ ½ log ðμ=ρÞ ¼ βF½μ  log Z β :
Therefore, the evolution of μ as given by the conformal Hamiltonian system
(27) minimizes the divergence from the stationary density (30). Replacing
(29) into (27) we obtain
Sampling, optimization, inference, and adaptive agents Chapter 2 37

 
∂t μ ¼ ∂q  ½μp + ∂p  μ∂q VðqÞ + γμp + γβ1 ∂2p μ,
which is nothing but the Fokker-Planck equation associated to the under-
damped Langevin diffusion
qffiffiffiffiffiffiffiffiffiffiffiffi
dqt ¼ pt dt, dpt ¼ ∂q Vðqt Þdt  γpt dt + 2γβ1 dwt , (31)

where wt is a standard Wiener process. Thus, the underdamped Langevin can


be seen as performing accelerated optimization on the space of probability
measures. A quantitative study of its speed of convergence is given by the
theory of hypocoercivity (Ottobre, 2016; Pavliotis, 2014; Villani, 2009a).
The above results provide a tight connection between sampling and
optimization. Interestingly, by the same argument as used in Section 2.5
(see Fig. 1), the high friction limit, γ ! ∞, of the underdamped Langevin
diffusion (31) yields the overdamped Langevin diffusion (França et al.,
2021c; Pavliotis, 2014)
qffiffiffiffiffiffiffiffiffiffi
dqt ¼ rVðqt Þdt + 2β1 dwt , (32)

which corresponds precisely to the gradient flow (25) on the free energy
functional F[μ] (Ambrosio et al., 2005; Jordan et al., 1998), where now
μ ¼ μðq, tÞ  P 2 ðd Þ. Thus, in the same manner that a second-order damped
Hamiltonian system may achieve accelerated optimization compared to a
first-order gradient flow, the underdamped Langevin diffusion (31) may
achieve accelerated sampling compared to the overdamped Langevin diffu-
sion (32). Such an acceleration has indeed been demonstrated (Ma et al.,
2019a) in continuous-time and for a particular discretization.

3 Hamiltonian-based accelerated sampling


The purpose of sampling methods is to efficiently draw samples from a given
target distribution ρ or, more commonly, to calculate expectations with
respect to ρ:
Z
1Xn1
f dρ  f ðxk Þ: (33)
X n k¼0

However, generating i.i.d. samples {xk} is usually practically infeasible, even


for finite sample spaces, as in high dimensions probability mass tends to con-
centrate in small regions of the sample space, while regions of high probability
mass tend to be separated by large regions of negligible probability. Moreover,
ρ is usually only known up to a normalization constant (MacKay, 2003a).
To circumvent this issue, MCMC methods rely on constructing ergodic Markov
chains fxn gn   that preserve the target distribution ρ. If we run the chain long
enough (n !∞), Birkhoff’s ergodic theorem guarantees that the estimator on
38 SECTION I Foundations in classical geometry and analysis

the right-hand side of (33) converges to our target integral on the left-hand side
almost surely (Hairer, 2018). An efficient sampling scheme is one that mini-
mizes the variance of the MCMC estimator. In other words, fewer samples will
be needed to obtain a good estimate. Intuitively, good samplers are Markov
chains that converge as fast as possible to the target distribution.

3.1 Optimizing diffusion processes for sampling


As many MCMC methods are based on discretizing continuous-time stochas-
tic processes, the analysis of continuous-time processes is informative of the
properties of efficient samplers.
Diffusion processes possess a rich geometric theory, extending that of vec-
tor fields, and have been widely studied in the context of sampling. They are
Markov processes featuring almost surely continuous sample paths (i.e., no
jumps) and correspond to the solutions of stochastic differential equations
(SDEs). While a deterministic flow is given by a first-order differential
operator—namely, a vector field X as used in Section 2—diffusions require
specifying a set of vector fields X, Y1 , …, YN , where X represents the (deter-
ministic) drift and Yi the directions of the (random) Wiener processes wit ,
and are characterized by a second-order differential operator of the form
L ≡ X + Yi ∘ Yi , known as the generator of the process. Equivalently, diffu-
sions can be written as Stratonovich SDEs: dxt ¼ Xðxt Þdt + Y i ðxt Þ ∘ dwit .
For a smooth positive target measure ρ, the complete family of
ρ-preserving diffusions is given by (up to a topological obstruction contribu-
tion) (Barp et al., 2021)
1
dxt ¼ curlρ ðAÞdt + div ðY ÞY dt + Yi ∘ dwit , (34)
2 ρ i i
for a choice of antisymmetric bracket A. Here curlρ is a differential operator
on multivector fields, generalizing the divergence on vector fields divρ of ρ,
and is induced via an isomorphism ρ♯ defined by ρ which allows to transfer
the calculus of twisted differential forms to a measure-informed calculus on
multivector fields (Barp et al., 2021). The ergodicity of (34) is essentially char-
acterized by H€ormander’s hypoellipticity condition; i.e., whether the Lie alge-
bra of vector fields generated by fYi , ½X, Yi gNi¼1 spans the tangent spaces at
every point (Bismut, 1981; H€ ormander, 1967; Pavliotis, 2014). On Euclidean
space the above complete class of measure-preserving diffusions can be given
succinctly by It^
o SDEs (Ma et al., 2015):
pffiffiffiffiffiffiffiffiffiffiffiffi
dxt ¼ ðA + SÞðxt Þ∂V ðxt Þdt + ∂  ðA + SÞðxt Þdt + 2Sðxt Þdwt , (35)
where S, A reduce to symmetric and antisymmetric matrix fields and V is the
negative Lebesgue log-density of ρ.
Sampling, optimization, inference, and adaptive agents Chapter 2 39

There are two well-studied criteria describing sampling efficiency in


Markov processes: (1) the worst-case asymptotic variance of the MCMC esti-
mator (33) over functions in L2(ρ), and (2) the spectral gap. The spectral gap
is the lowest nonzero eigenvalue of the (negative) generator L on L2(ρ).
When it exists, it is an exponential convergence rate of the density of the
process to the target density (Chafaı̈, 2004; Hwang et al., 2005; Pavliotis,
2014). Together, these criteria yield confidence intervals on the nonasympto-
tic variance of the MCMC estimator, which determines sampling performance
( Joulin and Ollivier, 2010).
A fundamental criterion for efficient sampling is nonreversibility (Duncan
et al., 2016; Hwang et al., 2005; Rey-Bellet and Spiliopoulos, 2015). A pro-
cess is nonreversible if it is statistically distinguishable from its time reversal
when initialized at the target distribution (Pavliotis, 2014). Measure-
preserving diffusions are nonreversible precisely when A ≢ 0 (Haussmann
and Pardoux, 1986). Intuitively, nonreversible processes backtrack less often
and thus furnish more diverse samples (Neal, 2004). Furthermore, nonreversi-
bility leads to mixing, which accelerates convergence to the target measure.
It is well known that removing nonreversibility worsens the spectral gap
and the asymptotic variance of the MCMC estimator (Duncan et al., 2016;
Hwang et al., 2005; Rey-Bellet and Spiliopoulos, 2015). In diffusions with
linear coefficients, one can construct the optimal nonreversible matrix A to
optimize the spectral gap (Lelièvre et al., 2013; Wu et al., 2014) or the asymp-
totic variance (Duncan et al., 2016). However, there are no generic guidelines
on how to optimize nonreversibility in arbitrary diffusions. This suggests a
two-step strategy to construct efficient samplers: (1) optimize reversible diffu-
sions, and (2) add a nonreversible perturbation A ≢ 0 (Zhang et al., 2021).
Diffusions on manifolds are reversible when A ≡ 0, and thus have the form
dxt ¼ 12 divρ ðY i ÞY i dt + Y i ∘ dwit , which on Euclidean space reads
pffiffiffiffiffiffiffiffiffiffiffiffi
dxt ¼ Sðxt Þ∂V ðxt Þdt + ∂  Sðxt Þdt + 2Sðxt Þdwt : (36)
The spectral gap and the asymptotic variance of the MCMC estimator are the
same optimality criteria in reversible Markov processes (Mira, 2001). When S
is positive definite everywhere, it defines a Riemannian metric g on the state
space. The generator is then the elliptic differential operator
L ¼ rg + Δg , (37)
where rg is the Riemannian gradient and Δg is the Laplace–Beltrami
operator, i.e., the Riemannian counterpart of the Laplace operator. Thus,
reversible (elliptic) diffusions (37) are the natural generalization of the over-
damped Langevin dynamics (32) to Riemannian manifolds (Girolami and
Calderhead, 2011). Optimizing S to improve sampling amounts to endowing
the state space with a suitable Riemannian geometry that exploits the structure
of the target density. For example, sampling is improved by directing noise
along vector fields that preserve the target density (Abdulle et al., 2019).
40 SECTION I Foundations in classical geometry and analysis

When the potential V is strongly convex, the optimal Riemannian geometry is


given by g ≡ ∂2V (Helffer, 1998; Saumard and Wellner, 2014). Sampling can
also be improved in hypoelliptic diffusions with degenerate noise (i.e., when
S is not positive definite). Intuitively, the absence of noise in some directions
of space leads the process to backtrack less often and thus yield more diverse
samples. For instance, in the linear case, the optimal spectral gap is attained
for an irreversible diffusion with degenerate noise (Guillin and Monmarche,
2021). However, degenerate diffusions can be very slow to start with, as the
absence of noise in some directions of space makes it more difficult for the
process to explore the state space (Guillin and Monmarche, 2021).
Underdamped Langevin dynamics (31) combines all the desirable proper-
ties of an efficient sampler: it is irreversible, has degenerate noise, and
achieves accelerated convergence to the target density (Ma et al., 2019b).
We can optimize the reversible part of the dynamics (i.e., the friction γ) to
improve the asymptotic variance of the MCMC estimator (Chak et al.,
2021). Lastly, we can significantly improve underdamped Langevin dynamics
by adding additional nonreversible perturbations to the drift (Duncan
et al., 2017).
One way to obtain MCMC algorithms is to numerically integrate diffusion
processes. As virtually all nonlinear diffusion processes cannot be simulated
exactly, we ultimately need to study the performance of discrete algorithms
instead of their continuous counterparts. Alarmingly, many properties of dif-
fusions can be lost in numerical integration. For example, numerical integra-
tion can affect ergodicity (Mattingly et al., 2002). An irreversible diffusion
may sample more poorly than its reversible counterpart after integration
(Katsoulakis et al., 2014). This may be because numerical discretization can
introduce, or otherwise change, the amount of nonreversibility (Katsoulakis
et al., 2014). The invariant measure of the diffusion and its numerical integra-
tion may differ, a feature known as bias. We may observe very large bias even
in the simplest schemes, such as the Euler–Maruyama integration of over-
damped Langevin (Roberts and Tweedie, 1996). Luckily, there are schemes
whose bias can be controlled by the integration step size (Mattingly et al.,
2010); yet, this precludes using large step sizes. Alternatively, one can remove
bias by supplementing the integration step with a Metropolis-Hastings correc-
tive step; however, this makes the resulting algorithm reversible. In conclusion,
designing efficient sampling algorithms with strong theoretical guarantees is a
nontrivial problem that needs to be addressed in its own right.

3.2 Hamiltonian Monte Carlo


Constructing measure-preserving processes, in particular diffusions, is rela-
tively straightforward. A much more challenging task consists of constructing
efficient sampling algorithms with strong theoretical guarantees. We now dis-
cuss an important family of well-studied methods, known as Hamiltonian
Monte Carlo (HMC), which can be implemented on any manifold, for any
Sampling, optimization, inference, and adaptive agents Chapter 2 41

smooth fully supported target measure that is known up to a normalizing


constant. Some of these methods can be seen as an appropriate geometric inte-
gration of the underdamped Langevin diffusion, but it is in general simpler to
view them as combining a geometrically integrated deterministic dynamics
with a simple stochastic process that ensures ergodicity.
The conservative Hamiltonian systems previously discussed provide a nat-
ural candidate for the deterministic dynamics. Indeed, given a target measure
ρ∝ eV μM , with μM a Riemannian measure (such as the Lebesgue measure
dq on M ¼ d ), if we interpret the negative log-density V(q) as a potential
energy, i.e., a function depending on position q, one can then plug in the
potential within Newton’s equation to obtain a deterministic proposal that is
well defined on any manifold, as soon as the acceleration and derivative
operators have been replaced by their curved analog

acceleration direction of
z}|{ greatest decrease
r _
q zfflfflfflfflffl}|fflfflfflfflffl{
mq̈ ¼ ∂VðqÞ ! ¼ rVðqÞ , (38)
|fflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflffl} dt
flat Newton |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
Riemannian Newton

with given initial conditions for the position q and velocity v ¼ dq/dt. This is a
second-order system which evolves in the tangent bundle, ðq, vÞ  TM ,
which is TM ¼ d  d when M ¼ d . The resulting flow is conservative
since it corresponds to a Hamiltonian system as discussed in Section 2.2, with
Hamiltonian Hðq, vÞ ≡ 12 kvk2q + VðqÞ, where kvk2q is the Riemannian squared-
norm, which is vTg(q)v when M ¼ d and g(q) is the Riemannian metric;
this is the manifold version of the Hamiltonian (6). This system preserves
the symplectic measure μΩ ðq, vÞ ¼ detgðqÞdqdv, and thus also the canonical
distribution μ ∝ eH(q,v)μΩ, which is the product of the target distribution
over position with the Gaussian measures on velocity (with covariance g).
For instance, on M ¼ d ,
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 T
μðq, vÞ∝ ρðqÞ  N 0, g1 ðqÞ ðvÞ∝ eVðqÞ det gðqÞdq  det gðqÞe 2v gðqÞv dv:

Moreover, the pushforward under the projection Proj : (q, v) 7! q is precisely


the target measure: Proj*μ ¼ ρ. Concretely, the samples generated by moving
according to Newton’s law, after ignoring their velocity component, have ρ as
their law. The main critical features and advantages in using Hamiltonian
systems arise from their numerical implementation. Indeed, the flow of (38)
is only tractable for the simplest target measures, namely those possessing a
high degree of symmetry. In order to proceed, we must devise suitable numeri-
cal approximations which, unfortunately, not only break such symmetries
but may lose key properties of the dynamics such as stationarity (typically
not retained by discretizations). However, as we saw in Section 2.2, most
42 SECTION I Foundations in classical geometry and analysis

symplectic integrators have a shadow Hamiltonian and thus generate discrete


trajectories that are close to the associated bona fide (shadow) Hamiltonian
dynamics, that in particular preserve the shadow canonical distribution.
Most integrators used in sampling, such as the leapfrog, are geodesic inte-
grators. These are splitting methods (see Section 2.1) obtained by splitting
the Hamiltonian H(q, v) ¼ H1(q, v) + H2(q), where H 1 ðq, vÞ ¼ 12 kvk2q and
H2(q) ¼ V (q) are to be treated as independent Hamiltonians in their own
right. Both of these Hamiltonians generate dynamics that might be tractable:
the Riemannian component H1, associated to the Riemannian reference
measure, induces the geodesic flow, while the target density component
H2 gives rise to a vertical gradient flow, wherein the velocity is shifted by
the direction of maximal density change, i.e., (q, v) 7! (q, v  δtrqV (q)).
The Jacobi identity and the BCH formula imply these integrators do possess
a shadow Hamiltonian H ~ , and reproduce its dynamics. Such a shadow can
in fact be explicitly obtained from (5) by computing iterated Poisson brackets;
e.g., on M ¼ d and for H(q, v) ¼ (1/2)vTgv + V (q), the three-stage geodesic
integrator ΦH bδt ∘ Φaδt ∘ Φð11bÞδt ∘ Φð12aÞδt ∘ Φð11bÞδt ∘ Φaδt ∘ Φbδt , with para-
1 H2 H1 H2 H1 H2 H1
2 2
meters a, b  , yields (Radivojevic et al., 2018)
h i
~ vÞ ¼ Hðq, vÞ + δt2 c1 ∂VðqÞT g1 ∂VðqÞ + c2 vT ∂2 VðqÞv + Oðδt4 Þ,
Hðq,

for some constants c1 and c2. As an immediate consequence, these symplectic


integrators preserve the reference symplectic measure μΩ and can be used as a
(deterministic) Markov proposal, which, when combined with the Metropolis-
Hastings acceptance step that depends only on the target density, gives rise
to a measure-preserving process. Moreover, the existence of the shadow
Hamiltonian ensures that the acceptance rate will remain high for distant
proposals, allowing small correlations. However, since Hamiltonian flows
are conservative, they remain stuck within energy level sets, which prevents
ergodicity. It is thus necessary to introduce another measure-preserving pro-
cess, known as the heat bath or thermostat, that explores different energy
levels; the simplest such process corresponds to sampling a velocity from a
Gaussian distribution. Bringing these ingredients together, we thus have the
following HMC algorithm: given zn ¼ (q, v), compute zn+1 according to
1. Heat bath: sample a velocity according to a Gaussian, v{ N ð0, g1 ðqÞÞ.
2. Shadow Hamiltonian dynamics: move along the Hamiltonian flow gener-
ated by the geodesic integrator, z* ¼Ψδt(z†), where z† ¼ (q, v†).
3. Metropolis correction: accept z* with probability min f1, eΔH g, where
ΔH ¼ H(z?)  H(z†). If accepted, then set zn+1 ¼ z*; otherwise, set
zn+1 ¼ (q, v†).
The above rudimentary HMC method (originally known as Hybrid Monte
Carlo) was proposed for simulations in lattice quantum chromodynamics with
M being the special unitary group, SU(n), and used a Hamiltonian dynamics
Sampling, optimization, inference, and adaptive agents Chapter 2 43

ingeniously constructed from the Maurer–Cartan frame to compute the parti-


tion function of discretized gauge theories (Duane et al., 1987). This method
has later been applied in molecular dynamics and statistics (Betancourt, 2017;
Cances et al., 2007; Neal, 1992, 2011; Rousset et al., 2010).
While the above discussion provides a justification for the use of Hamilto-
nian mechanics, a more constructive argument from first principles can also
be given. From the family of measure-preserving dynamics, which as we have
seen can be written as curlμ(A) (recall (34)), we want to identify those suited
to practical implementations (here μ could be any distribution on some space
F having the target ρ has a marginal). Only for the simplest distributions μ we
can hope to find brackets A for which the flow of curlμ(A) is tractable. Instead,
the standard approach to geometrically integrate this flowPrelies as before on
splitting methods, which effectively decompose μ∝ e H‘ μF into simpler
components by decomposing the reference measure from the density and
taking advantage of any product structure of the density, so that curlμ ðAÞ ¼
P
curlμF ðAÞ + ‘ XAH‘ .
There are three critical properties underpinning the success of HMC in
practice. The first two are the preservation of the reference measure and the
existence of a conserved shadow Hamiltonian for the numerical method.
These imply that we remain close to preserving μ along the flow, and in par-
ticular leads to Jacobian-free Metropolis corrections with good acceptance
rates for distant proposals (see Fang et al. (2014) for examples of schemes
with Jacobian corrections). Achieving these properties yields strong con-
straints on the choice of A (Barp, 2020); the shadow property is essentially
exclusive to Poisson systems, for which the conservation of a common refer-
ence measure is equivalent to the triviality of the modular class in the first
Poisson cohomology group (Weinstein, 1997). In particular, Poisson brackets
that admit such an invariant measure have been carefully analyzed and are
known as unimodular; the only unimodular Poisson bracket that can be con-
structed on general manifolds seems to be precisely the symplectic one.
The third critical property is the existence of splittings methods for
which all the composing flows are either tractable or have adequate approx-
imations, namely the geodesic integrators. Indeed, as we have seen, the flow
ΦH2 —induced by the potential H2(q) ¼ V (q)—is always tractable, indepen-
dent of the complexity of the target density; this is possible mainly due to
the extra “symmetries” resulting from implementing the flow on a higher-
dimensional space TM rather than M. On the other hand, one key consid-
eration for the tractability of the geodesic flow ΦH 1
δt —induced by the kinetic
energy H1(q, v)—is the choice of Riemannian metric; for most cases, it is
numerically hard to implement ΦH 1
δt since several implicit equations need
to be solved. In general, it is desirable to use a Riemannian metric that
reflects the intrinsic symmetries of the sample space, mathematically
described by a Lie group action. Indeed, by using an invariant Riemannian
44 SECTION I Foundations in classical geometry and analysis

metric, one greatly simplifies the equations of motion of the geodesic flow,
reducing the usual second-order Euler–Lagrange equations to the first-order
Euler-Arnold equations (Barp, 2019; Holm et al., 1998; Modin et al., 2010),
with tractable solutions in many cases of interest, e.g., for naturally reductive
homogeneous spaces; including d , the space of positive definite matrices,
Stiefel manifolds, Grassmannian manifolds, and many Lie groups. In such
cases, it is possible to find a Riemannian metric whose geodesic flow is known
and given by the Lie group exponential (Barp et al., 2019b; Holbrook et al.,
2016, 2018). For the other main class of spaces, namely those given by
constraints, if one chooses the restriction of the Euclidean metric, then the
RATTLE scheme discussed in optimization (see Section 2.4) is a suitable sym-
plectic integrator (Au et al., 2020; Graham et al., 2019; Leimkuhler and
Matthews, 2016; Lelièvre et al., 2019, 2020) (perhaps up to a reversibility
check). Occasionally, it may be suitable to use a Riemannian metric associated
to the target distribution rather than the sample space; e.g., when it belongs to a
statistical manifold. In that case, any choice of (information) divergence gives
rise to an information tensor that may be used in the HMC algorithm. Notably,
this is the case in Bayesian statistics, wherein attempting to find a Riemannian
metric that locally matches the Hessian of the posterior motivates the use of the
Fisher information tensor summed with the Hessian of the prior, giving rise to
the Riemannian HMC (Girolami and Calderhead, 2011; Livingstone and
Girolami, 2014). When a Riemannian metric whose geodesic flow is unknown
is chosen, one can use the trick of increasing the dimension of the phase space
to add symmetries to derive explicit symplectic integrators (Cobb et al., 2019;
Tao, 2016).
Once we have an integrator for the geodesic flow, another important
consideration is the construction and tuning of the overall integrator, i.e.,
the specific composition of ΦH δt and Φδt . Traditional numerical integrators
1 H2

are tuned to provide highly accurate approximations for the trajectories in


the limit δt ! 0; for instance, a fourth-order Runge–Kutta method. However,
samplers aim to have the largest possible step size δt in order to reduce
correlations. One approach consists in tuning the integrator to obtain good
density preservation in the Gaussian case. Another approach consists in tuning
the integrator to ensure the shadow Hamiltonian H ~ agrees with H up to the
desired order; see Predescu et al. (2012), Blanes et al. (2014), Fernández-
Pendás et al. (2016), Campos and Sanz-Serna (2017), Bou-Rabee and Sanz-
Serna (2018), and Clark et al. (2011). We note that when the target density
contains two components, one computationally expensive and the other com-
putationally cheap, it may be desirable to further split the potential H2(q) ¼ V
(q) to obtain higher acceptance rates (Sexton and Weingarten, 1992; Shahbaba
et al., 2014; Tuckerman et al., 1992). In order to achieve ergodicity in HMC
methods, it is usually sufficient to randomize the trajectory length of the
Sampling, optimization, inference, and adaptive agents Chapter 2 45

integrator (Betancourt, 2016; Mackenze, 1989; Wang et al., 2013). However,


deriving guarantees on the rate of convergence of HMC is difficult, though
recent works have established sufficient conditions for geometric ergodicity
(Durmus et al., 2017; Livingstone et al., 2019).
Let us also briefly mention some useful upgrades that have been proposed
in recent years. First, whenever the Metropolis step rejects the proposed sample,
the (expensive) computation of the numerical trajectory is wasted, and several
modifications have been proposed to address this issue, for example by granting
the method extra integration steps when the proposal is rejected (Campos and
Sanz-Serna, 2015; Sohl-Dickstein et al., 2014), or using a dynamic integration
with a termination criterion that aims to ensure the motion is long enough to
avoid random walks, but short enough that we do not waste computational
effort, such as the No-U-Turn sampler (Hoffman et al., 2014).
Second, the Metropolis algorithm gives rise to a reversible method which,
as discussed above, usually has slower convergence properties. Modern HMC
methods bypass this issue by replacing the heat bath by an Ornstein–
Uhlenbeck process, which ensures the overall algorithm is irreversible. In this
case, the overall HMC method can be viewed as a geometric integration of the
underdamped Langevin diffusion (Heber et al., 2020; Ottobre, 2016; Ottobre
et al., 2016). The connection between HMC and Langevin diffusion originates
from the desire to replace the Gaussian heat bath with a partial momentum
refreshment, yielding a more accurate simulation of dynamical properties
and higher acceptance rates (Horowitz, 1991).
Third, many modifications of the rudimentary HMC algorithm only pro-
vide improvements when the acceptance rate is sufficiently high. A third class
of upgrades improves the acceptance rate by using the fact that the shadow
Hamiltonian is exactly preserved by the integrator. These shadow HMC meth-
ods sample from a biased target distribution, defined by the (truncated)
shadow Hamiltonian, and correct the bias in the samples via an importance
sampler (Izaguirre and Hampton, 2004; Radivojevic and Akhmatskaya,
2020; Radivojevic et al., 2018).
Finally, the Metropolis step can be replaced with a multinomial correction
that uses the entire numerical trajectory, accepting a given point along it
according to the degree by which it distorts the target measure (Betancourt,
2017). Some methods entirely skip the accept/reject step, in particular those
relying on approximate gradients and surrogates (Strathmann et al., 2015;
Zhang et al., 2017b), such as the stochastic HMC methods; such methods
approximate the potential V (q)P and its derivative when they are given by a
sum over data points, V (q) ¼ iVi(q), by a cheaper sum over a uniformly sam-
pled minibatches (Chen et al., 2014) (these are commonly called stochastic gra-
dients in machine learning). However, this may break the shadow property and
reduce the scalable and robust properties of HMC methods (Betancourt, 2015).
46 SECTION I Foundations in classical geometry and analysis

4 Statistical inference with kernel-based discrepancies


The problem of parameter inference consists of estimating an element θ*Θ
using a sequence of random functions (or estimators) θ^n : Ω ! Θ, with θ^n
determined by a set of measurements fq1 , …, qn g representing the available
experimental data. In the statistical context, we search for the optimal approx-
imation μθ* of the target measure ρ within a statistical model {μθ : θ  Θ},
with respect to a discrepancy D : P  P ! ½0, ∞ over the set of probability
measures P. A common choice of discrepancy is the KL divergence, and
the resulting inference problem can be implemented via the asymptotically
optimal maximum likelihood estimators (Van der Vaart, 2000). As in many
applications we are interested in computing expectations, a particularly suit-
able notion of discrepancies are the integral probability pseudometrics
(IPM) (M€ uller, 1997), which quantify the worse-case integration error with
respect to a family of functions F
Z Z 
 
d F ðρ, μÞ ≡ sup  f dρ  f dμ:
f F

An apparent difficulty arises with IPMs in that we need to compute a supre-


mum, which will be intractable for most choices of F. Observe, however, that
if F were the unit ball of a normed vector space H, and integration with
respect to ρ and μ was a continuous linear functional on H, then dF ðρ, μÞ
would correspond to the distance between ρ and μ in the dual norm over the
dual H* , i.e., dF ðρ, μÞ ¼ k ρ  μk* . Conveniently, reproducing kernel Hilbert
spaces (RKHS) are precisely Hilbert spaces over which the Dirac distributions
δx : f 7! f(x) act continuously (Aronszajn, 1950; Berlinet and Thomas-Agnan,
2011; Steinwart and Christmann, 2008), and, more generally, the probability
distributions that act continuously by integration on a RKHS H are exactly
those for which all elements of H are integrable (Barp et al., 2022). Denoting
by P H the set of such probability measures, so that by definition δx  P H, we
can define the maximum mean discrepancy (MMD) as
MMD : P H  P H ! ½0, ∞Þ, MMD½ρ|μ ¼k ρ  μ k ,
where we further used the Riesz representation isomorphism to view
ρ, μ  P H
H* ffi H as elements of H. The map P H ! H is usually referred
to as the mean embedding (Muandet et al., 2016; Sriperumbudur et al., 2010).
The angles between the mean embedding of Dirac distributions play a central
role in the study of RKHS, and indeed characterize them. They define the
reproducing kernel
k : M  M ! , kðx, yÞ ≡ hδx , δy i,
with h, i denoting the inner product on H, from which we can obtain a prac-
tical expression for the squared MMD:
Z Z
MMD2 ½ρ|μ ¼ kðx, yÞðρ  μÞðdyÞðρ  μÞðdxÞ:
Sampling, optimization, inference, and adaptive agents Chapter 2 47

4.1 Topological methods for MMDs


A key feature of RKHS, as identified by Laurent Schwartz, is the fact they are
Hilbertian subspaces, i.e., Hilbert spaces continuously embedded within a
topological vector space T , denoted H ↪ T (Schwartz, 1964). In this context,
by composing the transpose of the inclusion H ↪ T with the Riesz isomor-
phism, we can define a (generalized) mean embedding as the weakly contin-
uous positive map

ϕ : T * ↪H* ! H:

This mapping allows us to transfer structures between H and T *, an example


of which is the MMD, which is nothing else than the pullback of the Hilbert
space metric from H to T *. Some important examples of T are C0, C∞ c , and
M
 (with their canonical topologies), whose duals are the spaces of finite
Radon measures, Schwartz distributions, and measures with finite support,
respectively (Simon-Gabriel and Sch€ olkopf, 2018). In particular a RKHS, as
defined above, is any Hilbert space satisfying H ↪ M . More generally,
when T is continuously embedded in the space of n -valued functions on
M —as in the examples above, which have n ¼ 1—then H inherits (and
can be characterized in terms of ) a reproducing kernel K : M  M !
nn , defined 8v, u  n by
h i
v> Kðx, yÞu ¼ δvx ϕðδuy Þ ,

where δux : h 7! u  hðxÞ; but this need not be the case in general, and we will
employ Hilbertian subspaces with no reproducing kernel to construct the
score-matching discrepancy.
This geometric description of RKHS and MMD allows us to swiftly apply
topological methods in their analysis. For example, in order for MMD2 to be a
valid notion of statistical divergence, it should accurately discriminate distinct
distributions, in the sense that MMD[ρ|μ] ¼ 0 iff ρ ¼ μ. By construction,
MMD will be characteristic to a subset of T * , that is be able to distinguish
its elements, iff ϕ is injective. The Hahn–Banach theorem further shows that
this is equivalent to the denseness of H in T , reducing the matter to a topolog-
ical question (Simon-Gabriel and Sch€ olkopf, 2018; Sriperumbudur et al.,
2010, 2011). In many applications, we typically would like T * to be the set
of probability measures, but the latter is not even a vector space. Instead, just
as is commonly done to define (statistical) manifolds, it is desirable to embed
P within a more structured space, such as the space of finite Radon measures
C*0 . Characteristicness to C*0 is also known as universality in learning theory,
since such RKHS are dense in L2(μ) for any μ  P, which enables the method
to learn the target function independent of the data-generating distribution
(Carmeli et al., 2010). However, in many important cases, we are interested
48 SECTION I Foundations in classical geometry and analysis

in analyzing the denseness of H in a space other than C0. For instance, in the
case of unbounded reproducing kernels, we cannot aim to separate all
finite distributions, since the RKHS will contain unbounded functions and
the MMD will only be defined on a subset of P . In the particular case
of the KSDs discussed below, which are given by transforming a base RKHS
into a Stein RKHS via a differential operator, the characteristicness of the
Stein RKHS to a set of probability measures is equivalent to the characteris-
ticness of the base RKHS to more general spaces T * of Schwartz distributions
(Barp et al., 2022).
Moreover, the ability of MMD to discriminate distributions is also useful
to ensure it further metrizes, or at least controls, weak convergence, and thus
provide a suitable quantification of the discrepancy between unequal distribu-
tions. Indeed, on noncompact locally compact Hausdorff spaces such as d ,
when H↪C0, then MMD will metrize weak convergence (of probability mea-
sures) iff the kernel k is continuous and H is characteristic to the space of
finite Radon measures (Simon-Gabriel et al., 2020). The fact that the RKHS
must separate all finite measures in order to metrize weak convergence results
from the fact that otherwise MMD cannot in general prevent positive mea-
sures from degenerating into the null measure on noncompact spaces, beyond
the family of translation-invariant kernels, for which characteristicness to the
sets of probability measures or that of finite measures is in fact equivalent
(Simon-Gabriel and Sch€ olkopf, 2018). It is also possible to prevent probabil-
ity mass from escaping to infinity—when the topology of the sequence of dis-
tributions is relatively compact with respect to the weak topology on the space
of distributions—since, in that case, standard topological arguments relate
MMD and weak convergence via characteristicness to P (Ethier and Kurtz,
2009). For example, by Prokhorov’s theorem we may use the tightness of a
sequence of distributions to ensure characteristic MMDs detect any loss of
mass, and thus control weak convergence (Gorham and Mackey, 2017).

4.2 Smooth measures and KSDs


MMD have a computationally tractable expression whenever ρ, μ are discrete
measures, or at least tractable U-statistics when their samples are readily
available. Many applications involve distributions that are smooth and fully
supported, but hard to sample from. Recalling the definition of dF ðρ, μÞ, it
would be useful to construct an MMD for which the set F consists of func-
tions whose integral under μ is tractable, for example equal to zero; the
MMD would then reduce to a double integration with respect to ρ. To achieve
this, we will leverage ideas from Stein’s method (Anastasiou et al., 2021;
Stein, 1972), and apply Stein operators to a given RKHS so as to construct
a Stein RKHS whose elements have vanishing expectation under a distribution
of interest.
Sampling, optimization, inference, and adaptive agents Chapter 2 49

4.2.1 The canonical Stein operator and Poincare  duality


To gain intuition on Stein operators, we begin by considering the integral with
respectR to μ as a linear operator on test functions, μ : C∞ c ðMÞ !  , with
μf ≡ f dμ, and we are interested in generating test functions in the kernel
of this operator (i.e., with vanishing expectations). There are two fundamental
theorems that help us understand the integral-differential geometry of the
manifold: de Rham’s theorem and Poincare duality. The former relates the
topology of the manifold to information on the solutions of differential equa-
tions defined over the manifold (Lee, 2013). The latter (which contains the
fundamental theorem
R of calculus) describes the properties of the integral pair-
ing ðα, βÞ 7! α ^ β of differential forms, which include the pairing of test
R
functions with smooth measures ðf , μÞ 7! f dμ . While these results are
canonical statements about the manifold, we can turn them into measure-
theoretic statements by means of the isomorphism μ♯. In particular, when
M is connected, there is an isomorphism between the top compactly sup-
ported twisted de Rham cohomology group Hnc ðMÞ R (which depends on the
topology of M) and  given by integration, ω 7! M ω. Applying the transfor-
mation μ♯ to this isomorphism yields the isomorphism of vector spaces

μ : C∞
c ðMÞ=Imðdivμ jc Þ ! ,

where divμ jc : Xc ðMÞ ! C∞


c ðMÞ is the divergence operator restricted to the
set of compactly supported vector fields Xc ðMÞ. Hence, if h, f  C∞c ðMÞ,
then
Z Z
fdμ ¼ hdμ , f ¼ h + divμ ðXÞ for some X  Xc ðMÞ:

Consequently,

μ1 ðf0gÞ ¼ fdivμ ðXÞ : X  Xc ðMÞg:

Thus, the test functions that integrate to zero are precisely those that can be
written as the divergence of compactly supported vector fields. In particular,
on compact manifolds, there is a canonical Stein operator, divμ, which turns
vector fields into functions with vanishing expectations. For other types of
manifolds, one can obtain similar dualities by using other classes of differen-
tial forms, such as the square-integrable ones, or by allowing boundaries.
For our purposes, the above is sufficient to motivate calling

Sμ ≡ divμ jXμ : Xμ ! C∞ ðMÞ

the canonical Stein operator, whose domain Xμ, called the Stein class, is any
R of vector fields satisfying the desired property that μ ½Sμ ðXÞ ≡
set
Sμ ðXÞdμ ¼ 0, for all X  Xμ .
50 SECTION I Foundations in classical geometry and analysis

If we have a bracket B on M, we can turn the canonical Stein operator on


vector fields into a second-order differential operator acting on functions, the
B-Stein operator on the Stein class
C∞ ∞
μ ≡ f f  C ðMÞ : X f  Xμ g,
B

by
SBμ : C∞ ∞
μ ! C ðMÞ, SBμ f ≡ divμ ðXBf Þ:

If μ ¼ eH μM , then we have the following useful decomposition:


diveH μM ðXBf Þ ¼ divμM ðXBf Þ  XBf ðHÞ:
Let us give some important examples of bracket Stein operators. When B ≡ A
is antisymmetric, the A-Stein operator is simply a first-order differential oper-
ator, namely the μ-preserving curl vector field SAμ ¼ curlμ ðAÞ. When B is Rie-
mannian, and μM is the Riemannian measure, then
Sgμ ð f Þ ¼ r  Xgf  hrf , rHi ¼ Δf  hrf , rHi ¼ Δf + hrf , r log dμ=dμM i,

where r, Δ, r, h, i are the Riemannian gradient, Laplacian, divergence,


and metric, respectively; the g1-Stein operator becomes the Riemannian
Stein operator (Barp et al., 2022; Hodgkinson et al., 2020; Liu and Zhu,
2018). Hence, the Riemannian Stein operator is the restriction of the canonical
Stein operator to gradient vector fields. In general, decomposing the bracket
into its symmetric and antisymmetric parts, B ≡ S + A, we obtain the follow-
ing useful decomposition of the B-Stein operator:
SSμ + A ð f Þ ¼ divμ ðXSf Þ + curlμ ðAÞðf Þ: (39)
In particular, if we restrict ourselves to a symmetric positive semidefinite B,
associated to a set of vector fields {Yi}, XBf ≡ Y i ðf ÞY i for any function f, then
(39) corresponds to the generator of a μ-preserving diffusion. A suitable Stein
class
h is then
i the domain of the generator, since for any function in that domain
μ SSμ + A f vanishes by the Fokker-Planck equation. The construction of
Stein operators via measure-preserving diffusions is known as the Barbour
approach (Barbour, 1988). In fact, the brackets allow us to define a more
general notion of Stein operator acting on 1-forms fα : XBα  Xμ g, and, on
flat Euclidean space, SBμ ðαÞ ≡ divμ ðXBα Þ recovers the “diffusion” Stein operator
(Gorham et al., 2019).

4.2.2 Kernel Stein discrepancies and score matching


Once we have a Stein operator, we need to construct a Stein class for it, i.e., a
set V of vector fields (or more general tensor fields) whose image F under
the operator has mean zero under μ. The resulting IPM is then known as a
Stein Discrepancy:
Sampling, optimization, inference, and adaptive agents Chapter 2 51

Z 
 
dSμ ðV Þ ðμ, ρÞ ¼ sup  Sμ ðXÞdρ:

XV
R
The expression Sμ ðXÞdρ is precisely the rate of change of the KL divergence
along measures satisfying the continuity equation, an observation that leads
to Stein variational gradient descent (SVGD) methods to approximate distri-
butions (Liu and Wang, 2016; Liu and Zhu, 2018). Specifically,
P in SVGD the
target measure is approximated using a finite distribution ‘ δx‘ , where the
location of the particles {x‘}‘ is updated by moving along the direction that
maximizes the rate of change of KL within a space of vector fields isomorphic
to a RKHS (e.g., the space of gradients of functions in a RKHS).
When Sμ is the canonical Stein operator, there is a canonical Stein class,
provided by Stokes’ theorem, which essentially only depends on the manifold:
for a connected manifold M,R viewing integration
R as an operator on smooth
μ-integrable functions, then fdμ ¼ 0 , dα ¼ 0, where f ¼divμ(μ♯(α)).
Unfortunately, Stokes’ theorem usually doesR not provide a practical descrip-
tion of the differential forms that satisfy dα ¼ 0, aside from the compactly
supported case. There are, however, several choices of Stein class constructed
from Hilbertian subspaces that lead to computationally tractable Stein discre-
pancies. One route consists in constructing a RKHS of mean-zero functions as
the image of another RKHS under a Stein operator. In this case, we can use Sμ
to map a given RKHS of d -valued functions H, with (matrix-valued) repro-
ducing kernel K, into a Stein RKHS of -valued functions Sμ ðHÞ associated to
a Stein reproducing kernel kμ, given by (here q is the Lebesgue density of μ)
1
kμ ðx, yÞ ¼ ∂  ∂  ðqðxÞKðx, yÞqðyÞÞ:
qðxÞqðyÞ y x
The resulting Stein discrepancy can be thought of as an MMD that depends
only on ρ and is known as kernel Stein discrepancy (Oates et al., 2017):
ZZ
1
KSD½ ρ2 ≡ MMD½ ρ|μ2 ¼ ∂  ∂  ðqðxÞKðx, yÞqðyÞÞdρðyÞdρðxÞ:
qðxÞqðyÞ y x

Another class of discrepancies relies on a choice of bracket B together with a


R R *
corollary from Stokes’ theorem: SBμ ðαÞdρ ¼ αðXBHK Þdρ, where eH and
eK are the densities of ρ and μ with respect to a common smooth measure
(below the Riemannian one), while B* is the dual bracket (the transpose of B).
We can thus rewrite the Stein discrepancy as
Z 
 
sup  αðXHK Þdρ
 B*
(40)
αA

over some family of 1-forms A . As we did previously, we can “remove” the


supremum by rewriting the above as a supremum over some unit ball of a
52 SECTION I Foundations in classical geometry and analysis

continuous linear functional. This can be achieved once we have a Riemannian


metric h, i, which induces a natural inner
R product that is central to the theory
of Harmonic forms, namely ðα, βÞμ ≡ hα, βidμ. In particular, taking as A the
smooth compactly supported 1-forms in the unit ball of L2 ðT * M, μÞ —the
Hilbert space of square μ-integrable 1-forms—the Stein discrepancy recovers
a generalization of the score matching (Hyv€arinen and Dayan, 2005):
Z h i
* *
SMB ½ ρ|μ ¼ kXBHK k2 dρ ¼ ρ kXBHK k2 : (41)

It is worth noting that, while L2 ðT * M, μÞ is not a RKHS, and does not have a
reproducing kernel, it remains a Hilbertian subspace of the space of de Rham
currents. When B is Riemannian, we recover the Riemannian score matching
(Barp et al., 2022)
Z
SMG ½ ρ|μ ¼ krH  rKk2 dρ,

while in Euclidean space (41) yields the diffusion score matching (Barp
et al., 2019a).

4.3 Information geometry of MMDs and natural gradient descent


MMDs and Stein discrepancies have proved to be important tools in a wide
range of contexts, from hypothesis testing and training generative neural net-
works to measuring sample quality (Abdulle et al., 2019; Chen et al., 2019;
Chwialkowski et al., 2016; Dziugaite et al., 2015; Gretton et al., 2012; Liu
et al., 2016). In the context of statistical inference, once we have chosen a
suitable discrepancy, D, and a statistical model, {μΘ}, our aim is to find the
best approximation of the target distribution within the model; this corre-
sponds to solving the optimization problem θ*  arg minθΘ D[ρ|μθ]. As
mentioned previously, computing the value of the discrepancy D[ρ|μθ] is
computationally challenging. Fortunately, we can often obtain robust Stein
discrepancy estimators for smooth statistical models, whose distributions have
a smooth positive Lebesgue density, as well as MMD estimators for genera-
tive model that are easy to sample from but have intractable model densities.
In either case, once we have an estimator D ^m based on m samples from the
target, we must solve the approximate optimization problem θ*m arg minθΘ
D^m ½ρ|μθ . When the function D^m ½ρ|μθ  is smooth, this may be done via the
accelerated Hamiltonian-based optimization methods previously discussed
(Section 2). If D is a divergence function, one can also usually improve the
speed of convergence by following the natural gradient descent, associated
with the information Riemannian metric gθ induced by D (Chen and Li,
2018; Kakade, 2001; Karakida et al., 2016; Park et al., 2000). In practice, this
leads to implementing the update
Sampling, optimization, inference, and adaptive agents Chapter 2 53

θ^t + 1 ¼ θ^t  γ t g^1 ^


θt ∂θt Dm ½ρ|μθ ,

where {γ t} is an appropriate sequence of step sizes, and g^1


θ is the inverse of a
regularized estimate of the information tensor (Bonnabel, 2013). Finally, note
that there is a deep connection between divergences and the geometric
mechanics discussed in sampling and optimization, as any divergence may
be interpreted as a discrete Lagrangian, and hence generates a symplectic
structure and integrator (Leok and Zhang, 2017).

4.3.1 Minimum Stein discrepancy estimators


When the model {μθ} consists of smooth measures with positive densities
{qθ}, and we have access to samples {x‘} from the target, the Stein discrepan-
cies offer a flexible family of inference methods. For SM we can use the
estimator
Xm
^ m ½ρ|μθ  ¼ 1
SM k BT ∂x log qθ k22 + 2∂x  ðBBT ∂x log qθ Þ ðx‘ Þ
m ‘¼1

combined with the following expression for the information tensor:


Z
ðgθ Þij ¼ BT ∂x ∂θi log qθ  BT ∂x ∂θ j log qθ dμθ :

For KSD it is convenient to choose a family of matrix kernels Kθ(x, y) ¼ Bθ(x)


k(x, y)Bθ(y)T, for some scalar kernel k and parameter-dependent matrix func-
tion Bθ. Denoting the associated Stein reproducing kernel by kμθ ,θ , we have
the unbiased estimator
1 X m
^ m ½ρ ¼
KSD kμ , θ ðxi , xj Þ,
mðm  1Þ i6¼j θ

and information tensor


Z Z
ðgθ Þij ¼ ð∂x ∂θ j log qθ ÞT Bθ ðxÞkðx, yÞBTθ ðyÞ∂x ∂θi log qθ dμθ ðxÞdμθ ðyÞ:

The parameters B and k, and the choice of statistical model, can often be
adjusted to achieve characteristicness, consistency, bias-robustness, and
obtain central limit theorems; see Barp et al. (2019a) for details, and for
numerical experiments showing an acceleration induced by the information
Riemannian metric.

4.3.2 Likelihood-free inference with generative models


For many applications of interests, the densities of the model, {μθ}, cannot be
evaluated or differentiated. We thus need density-free inference methods. This
is the case, for instance, in the context of generative models wherein μθ is the
54 SECTION I Foundations in classical geometry and analysis

pushforward of a distribution μ, from which we can sample efficiently, with


respect to a generator function Tθ. Then, the minimum Stein discrepancy esti-
mators based on KSD and SM, or other discrepancies that rely on the scores,
are intractable. The MMDs are suited to this case since they depend on the
target and model only through integration, which can be straightforwardly
estimated using the samples. The associated information tensor is
Z Z
ðgθ Þij ¼ ∂θ T θ ðuÞT ∂x ∂y kðx, yÞjðT θ ðuÞ,T θ ðvÞÞ ∂θ T θ ðvÞdμðuÞdμðvÞ:

Under appropriate choices of kernels and models one can derive theoretical
guarantees, such as concentration/generalization bounds, consistency, asymp-
totic normality, and robustness; see, e.g., Briol et al. (2019), Gretton et al.
(2009), and Dziugaite et al. (2015). Moreover, many approaches to kernel
selection in a wide range of contexts have been studied, which include the
median heuristic or maximizing the power of hypothesis tests, and in practice
mixtures of Gaussian kernels are often employed (Briol et al., 2019; Dziugaite
et al., 2015; Garreau et al., 2017; Li et al., 2017; Ramdas et al., 2015;
Sutherland et al., 2016).

5 Adaptive agents through active inference


The previous sections have established some of the mathematical fundamentals
of optimization, sampling, and inference. In this final section, we close with a
generic use case called active inference, a general framework for describing
and designing adaptive agents that unifies all aspects of behavior—including
perception, planning, and learning—as processes of inference. Active inference
emerged in the late 2000s as a unifying theory of brain function (Friston, 2010;
Friston et al., 2006, 2010, 2015), and has since been applied to simulate a wide
range of behaviors in neuroscience (Da Costa et al., 2020a; Parr, 2019),
machine learning (Fountas et al., 2020; Mazzaglia et al., 2021; Millidge, 2020),
and robotics (Da Costa et al., 2022; Lanillos et al., 2021). In what follows, we
derive the objective functional overarching decision-making and describe its
information geometric structure, revealing several special cases that are estab-
lished notions in statistics, cognitive science, and engineering. Then, we exploit
this geometric structure in a generic framework for designing adaptive agents.

5.1 Modeling adaptive decision-making


5.1.1 Behavior, agents, and environments
We define behavior as the interaction between an agent and its environment.
Together the agent and its environment form a system that evolves over time
according to a stochastic process x. This definition entails a notion of time T ,
which may be discrete or continuous, and a state space X, which should be a
measure space (e.g., discrete space, manifold, etc.). A stochastic process x is a
Sampling, optimization, inference, and adaptive agents Chapter 2 55

time-indexed collection of random variables xt on the state space. More con-


cisely, it is a random variable over trajectories on the state space T ! X :
x : Ω ! ðT ! X Þ, ω 7! xðωÞ () xt : Ω ! X , ω 7! xðωÞðtÞ 8t  T :

We denote by P the probability density of x on the space of paths T ! X


with respect to a prespecified base measure.
Typically, systems comprising an agent and its environment have three
sets of states: external states are unknown to the agent and constitute the envi-
ronment; the observable states are those agent’s states that the agent sees but
cannot directly control; and finally, the autonomous states are those agent’s
states that the agent sees and can directly control. This produces a partition
of the state space X into states external to the agent S and states belonging
to the agent Π, which themselves comprise observable O and autonomous
states A . As a consequence, the system x can be decomposed into external
s, observable o, and autonomous a processes:
X ≡ SΠ ≡ SOA ¼) x ≡ ðs, πÞ ≡ ðs, o, aÞ,
here written as random interacting trajectories on their respective spaces (see
Fig. 2 for an illustration).

FIG. 2 Partitions and agents. This figure illustrates a human (agent π) interacting with its envi-
ronment (external process s), and the resulting partition into external s, observable o, and autono-
mous a processes. The external states are the environment, which the agent does not have direct
access to, but which is sampled through the observable states. These could include states of the
sensory epithelia (e.g., eyes and skin). The autonomous states constitute the muscles and nervous
system that factor available information into decisions. In the example of human behavior, the
environment causes observations (i.e., sensations), which inform a nervous and muscular response,
which in turn influences the environment. In general, autonomous responses may be informed by all
past agent states π t ¼ (ot, at) (the information available to the agent at time t), which means that
the systems we are describing are typically non-Markovian.
56 SECTION I Foundations in classical geometry and analysis

5.1.2 Decision-making in precise agents


The description of behavior adopted so far could, in principle, describe parti-
cles interacting with a heat bath (Pavliotis, 2014) as well as humans interact-
ing with their environment (see Fig. 2). We would like a description that
accounts for purposeful behavior (Da Costa et al., 2020a, 2021; Friston
et al., 2021a, b, 2022; Parr et al., 2021a). So what distinguishes people from
small particles? An obvious distinction is that people are subject to classical
as opposed to statistical mechanics. In other words, they are precise agents,
with conservative dynamics.
Definition 1 (Precise agent). An agent is precise when it evolves determinis-
tically in a (possibly) stochastic environment, i.e., when P(π|s, π t) is a Dirac
measure for any s, π t. For example,
dst ¼ f ðst , π t Þdt + dwt , dπ t ¼ gðst , π t Þdt:

At any moment in time t, the agent has access, at most, to its past trajec-
tory π t, and has agency over its future autonomous trajectory a>t. We define
a decision to be a choice of autonomous states in the future given available
data π t. We interpret P(s, o|π t) as expressing the agent’s preferences over
external and observable trajectories given available data, and P(s, o|a, π t)
as expressing the agent’s predictions over external and observable paths given
a decision a. In the following, we describe how (precise) agents’ decisions
P(a|π t) relate to predictions and preferences.
Since both observable and autonomous processes evolve deterministically
in precise agents, the Shannon entropy of observable and autonomous paths
is equalj :
S½Pðs, o|π t Þ ¼ S½Pðs, a|π t Þ for any π t : (42)
Crucially, this allows us to express agents’ decisions as a functional of their
predictions and preferences (Friston et al., 2022)k :
0 ¼ Pðs,o, a|πt Þ ½ log Pðs,o|π t Þ  log Pðs, o|π t Þ
¼ Pðs,o, a|πt Þ ½ log Pðs,a|π t Þ  log Pðs, o|π t Þ
¼ Pða|πt Þ ½log Pða|π t Þ + Pðs, o|a,πt Þ ½log Pðs|a,π t Þ  log Pðs, o|π t Þ
)  log Pða|π t Þ ¼ Pðs,o|a,π t Þ ½log Pðs|a, π t Þ  log Pðs,o|π t Þ:
(EFE)

j
To obtain (42) note that when the path space is finite, we have the equality S½Pðs, o|π t Þ
S½Pðs, a|π t Þ ¼ Pðx|πt Þ ½ log Pða|s, o, π t Þ  log Pðo|s, a, π t Þ ¼ 0 due to Definition 1. This
equality can be extended to more general path spaces via a limiting argument, by expressing
entropies as a limiting density of discrete points ( Jaynes, 1957).
k
The second equality follows from (42), and the implication follows since the KL divergence
vanishes only when its arguments are equal.
Sampling, optimization, inference, and adaptive agents Chapter 2 57

The negative log density of decisions (EFE) is known as an expected free


energy (Da Costa et al., 2020a; Friston et al., 2015; Schwartenbeck et al.,
2019). This is because its functional form is similar to the free energy func-
tional (a.k.a. evidence lower bound (Blei et al., 2017)) used in approximate
Bayesian inference (Friston et al., 2015, 2022). We define active inference
as Hamilton’s principle of least action on expected free energyl that expresses
the most likely decision a, where
a ≡ arg min  log Pða|π t Þ: (AIF)
a

5.1.3 The information geometry of decision-making


Interestingly, active inference (AIF) looks like it describes agents that engage
in purposeful behavior. Indeed, we can rearrange the expected free energy
(EFE) in several ways, each of which reveals a fundamental trade-off that
underwrites decision-making. This allows us to relate active inference to
information theoretic formulations of decision-making that predominate in
statistics, cognitive science, and engineering (see Fig. 3).
For example, decision-making minimizes both risk and ambiguity:

predicted paths preferred paths


 zfflfflfflfflfflfflffl}|fflfflfflfflfflfflffl{ zfflfflfflffl}|fflfflfflffl{ 
 log Pða|π t Þ ¼ KL Pðs|a, π t Þ | Pðs|π t Þ + Pðs|a,π t Þ ½S½Pðo|s, π t Þ:
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
risk ambiguity

(43)

Risk refers to the KL divergence between the predicted and preferred external
course of events. Minimizing risk entails making predicted (external) trajec-
tories fulfill preferred external trajectories. Ambiguity refers to the expected
entropy of future observations, given future external trajectories. An external
trajectory that can lead to various distinct observation trajectories is highly
ambiguous—and vice versa. Thus, minimizing ambiguity leads to sampling
observations that enable to recognize the external course of events. This leads
to a type of observational bias commonly known as the streetlight effect
(Kaplan, 1973): when a person loses their keys at night, they initially search
for them under the streetlight because the resulting observations (“I see my
keys under the streetlight” or “I do not see my keys under the streetlight”)
accurately disambiguate external states of affairs.

l
As a negative log density over paths, the expected free energy is an action in the physical sense of
the word.
FIG. 3 See figure legend on opposite page.
Sampling, optimization, inference, and adaptive agents Chapter 2 59

Similarly, decision-making maximizes extrinsic and intrinsic value (Sajid


et al., 2021b):
 log Pða|π t Þ ¼ Pðo|a,πt Þ ½KL½Pðs|o, a, π t Þ|Pðs|o, π t Þ
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
0
preferred paths
zfflfflfflfflffl}|fflfflfflfflffl{
Pðo|a,π t Þ ½ log Pðo|π t Þ  Pðo|a,πt Þ ½KL½Pðs|o, a, π t Þ|Pðs|a, π t Þ:
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
extrinsic value intrinsic value

Extrinsic value refers to the (log) likelihood of observations under the model
of preferences. This corresponds to an expected utility or expected reward in
behavioral economics, control theory, and reinforcement learning (Barto and
Sutton, 1992; Von Neumann and Morgenstern, 1944). In short, maximizing
extrinsic value leads to sampling observations that are likely under the model
of preferences. Intrinsic value refers to the amount of information gained
about external courses of events. This measures the expected degree of belief
updating about external trajectories under a decision, with versus without
future observations. Making decisions to maximize information gain leads
to a goal-directed form of exploration (Schwartenbeck et al., 2019), driven
to answer “What would happen if I did that?” (Schmidhuber, 2010). Interest-
ingly, this decision-making procedure underwrites Bayesian experimental

FIG. 3—Cont’d Decision-making under active inference. This figure illustrates various impera-
tives that underwrite decision-making under active inference in terms of several special cases that
predominate in statistics, cognitive science, and engineering. These special cases are disclosed
when one removes certain sources of uncertainty. For example, if we remove ambiguity,
decision-making minimizes risk, which corresponds to aligning predictions with preferences about
the external course of events. This underwrites prospect theory of human choice behavior in eco-
nomics (Kahneman and Tversky, 1979) and modern approaches to control as inference (Levine,
2018; Rawlik et al., 2013; Toussaint, 2009), variously known as Kalman duality (Kalman, 1960;
Todorov, 2008), KL control (Kappen et al., 2012), and maximum entropy reinforcement learning
(Ziebart, 2010). If we further remove preferences, decision-making maximizes the entropy of exter-
nal trajectories. This maximum entropy principle ( Jaynes, 1957; Lasota and MacKey, 1994) allows
one to least commit to a prespecified external trajectory and therefore keep options open. If we rein-
troduce ambiguity, but ignore preferences, decision-making maximizes intrinsic value or expected
information gain (MacKay, 2003b). This underwrites Bayesian experimental design (Lindley,
1956) and active learning in statistics (MacKay, 1992), intrinsic motivation and artificial curiosity
in machine learning and robotics (Barto et al., 2013; Deci and Ryan, 1985; Oudeyer and Kaplan,
2007; Schmidhuber, 2010; Sun et al., 2011). This is mathematically equivalent to optimizing
expected Bayesian surprise and mutual information, which underwrites visual search (Itti and
Baldi, 2009; Parr et al., 2021c) and the organization of our visual apparatus (Barlow, 1961;
Linsker, 1990; Optican and Richmond, 1987). Lastly, if we remove intrinsic value, we are left with
maximizing extrinsic value or expected utility. This underwrites expected utility theory (Von
Neumann and Morgenstern, 1944), game theory, optimal control (Åstr€om, 1965; Bellman, 1957),
and reinforcement learning (Barto and Sutton, 1992). Bayesian formulations of maximizing
expected utility under uncertainty are also known as Bayesian decision theory (Berger, 1985).
60 SECTION I Foundations in classical geometry and analysis

design in statistics (Lindley, 1956), which describes optimal experiments as


those that maximize expected information gain.
In conclusion, decision-making under active inference weighs the impera-
tives of maximizing utility and information gain, which suggests a principled
solution to the exploration-exploitation dilemma (Berger-Tal et al., 2014).

5.2 Realizing adaptive agents


We now show how active inference affords a generic recipe to generate adap-
tive agents.

5.2.1 The basic active inference algorithm


The idea is to define an agent by a prediction model P(s, o|a), expressing the
distribution of external and observable paths given autonomous paths, and a
preference model P(s, o), expressing the preferred external and observable tra-
jectories. To aid intuition, we will refer to autonomous states as actions. At any
time t, the agent knows past observations and actions π t ¼ (ot, at), and must
make a decision. In discrete time, the decision-making process is as follows:
1. Preferential inference: infer preferences about external and observable
trajectories, i.e.,
Infer Pðs, o|π t Þ with Qðs, o|π t Þ: (44)
2. For each possible sequence of past, present, and future actions a:
(a) Perceptual inference: infer external and observable paths under the
action sequence, i.e.,
Infer Pðs, o|a, π t Þ with Qðs, o|a, π t Þ: (45)
(b) Planning as inference: assess the action sequence by evaluating its
expected free energy (EFE), i.e.,
 log Qða|π t Þ ≡ Qðs,o|a,π t Þ ½ log Qðs|a, π t Þ  log Qðs, o|π t Þ: (46)
3. Decision-making: execute the most likely decision at+1 according to
X
at + 1 ¼ arg max Qðat + 1 |π t Þ, Qðat + 1 |π t Þ ¼ Qðat + 1 |aÞQða|π t Þ:
a
(47)

5.2.2 Sequential decision-making under uncertainty


A common model of sequential decision-making under uncertainty is a par-
tially observable Markov decision process (POMDP). A POMDP is a discrete
time model of how actions influence external and observable states. In a
POMDP, (1) each external state depends only on the current action and previ-
ous external state P(st|st1, at), and (2) each observation depends only on the
current external state P(ot|st). One can additionally specify (3) a distribution
Sampling, optimization, inference, and adaptive agents Chapter 2 61

of preferences over external trajectories P(s). Together, (1) and (2) form the
agent’s (POMDP) prediction model, and (2) and (3) form the agent’s (hidden
Markov) preference model, which defines an active inference agent. A simple
simulation of active inference on a POMDP is provided in Fig. 4; implemen-
tation details on generic POMDPs are available in Da Costa et al. (2020a),
Heins et al. (2022), Sajid et al. (2021a), and Smith et al. (2022). For more
complex simulations of sequential decision-making (e.g., involving hierarchi-
cal POMDPs), please see Sajid et al. (2021a), Friston et al. (2017b), Parr
(2019), Fountas et al. (2020), Millidge (2020), Çatal et al. (2021), and
Friston et al. (2018).

5.2.3 World model learning as inference


Due to a lack of domain knowledge, it may be challenging to specify an
agent’s prediction and preference model. For example, how do external states
map to observations? Should external states be represented in a discrete or
continuous state space?
In active inference, generative models are learned by inferring their para-
meters (Da Costa et al., 2020a; Friston et al., 2016; Sajid et al., 2021a) and
structure (Da Costa et al., 2020a; Friston et al., 2017b, 2021c; Smith et al.,
2020; Wauthier et al., 2020). Suppose there is an unknown parameter (or
structure variable) m in the prediction model, the preference model, or both.
By definition, each alternative parameterization m entails different predictions
P(o, s|a, m) and preferences P(o, s|m). Since unknowns are simply external
states, we treat the parameter as an additional external state. We equip the
space of parameters with a prior distribution P(m), and define the agent with
an augmented prediction (resp. preference) model that combines the different
alternatives P(o, s, m|a) ≡ P(o, s|a, m)P(m) (resp. P(o, s, m) ≡ P(o, s|m)P(m)).
The parameter can then be inferred along with other external states during
preferential (44) or perceptual (45) inference (Da Costa et al., 2020a;
Friston et al., 2016; Sajid et al., 2021a). Better yet, having specified priors
over parameters that are independent of actions, we can infer them separately,
for example, after fixed-length sequences of decisions to reduce computa-
tional cost (Da Costa et al., 2020a; Friston et al., 2016).
All this says that a prior P(m) and some data π t lead to approximate pos-
terior beliefs Q(m|π t)  P(m|π t) about model parameters. But what are the
right priors? One way to answer this question lies in optimizing a free energy
functional F (a.k.a. an evidence lower bound (Blei et al., 2017)):
F ≡ Qðm|π t Þ ½ log Pðm, π t Þ  S½Qðm|π t Þ
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl}
energy entropy

¼ KL½Qðm|π t Þ|PðmÞ  Qðm|πt Þ ½ log Pðπ t |mÞ:


|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
complexity accuracy
FIG. 4 See figure legend on opposite page.
Sampling, optimization, inference, and adaptive agents Chapter 2 63

Choosing priors that minimize free energy leads to parsimonious models


that explain the data at hand (Tschantz et al., 2020b). This follows since max-
imizing accuracy increases the likelihood of the data under the posterior
model, while minimizing complexity decreases the movement from prior to
posterior, which can be seen as a proxy for computational cost. Maximizing
accuracy usually results in generative models involving universal function
approximators (Çatal et al., 2021; Fountas et al., 2020; Millidge, 2020), while
minimizing complexity results in organizing representations in sparse, com-
partmentalized, and hierarchical generative models (Friston et al., 2017b,
2021c; Wauthier et al., 2020), where higher levels of the hierarchy encode
more abstract representations and vice versa (Friston et al., 2018). A compu-
tationally efficient method to compare priors by their free energy is Bayesian
model reduction (Da Costa et al., 2020a; Friston et al., 2019). In conclusion,
free energy unifies inference and model selection under a single objective
function.

5.2.4 Scaling active inference


We conclude by identifying promising scaling methods for active inference that
enable computationally tractable implementations in a variety of applications.
Planning for all possible courses of action is computationally expensive as
the number of action sequences is exponential in the number of time-steps.
One way to finesse this is by planning only for intelligently chosen subsets
of action sequences, using sampling algorithms like Monte–Carlo tree search

FIG. 4—Cont’d Sequential decision-making in a T-Maze environment. Left: The agent’s pre-
diction model is a partially observed Markov decision process (see text) represented here as a
Bayesian network (Bishop, 2006). The color scheme illustrates the problem at t ¼ 2: the agent
must make a decision (in red) based on previous actions and observations (in gray), which are
informative about external states and future observations (in white). Right: st: The T-Maze has
four possible spatial locations: middle, top-left, top-right, and bottom. One of the top locations
contains a reward (in red), while the other contains a punishment (in black). The reward’s location
determines the context. The bottom arm contains a cue whose color (blue or green) discloses the
context. Together, location, and context determine the external state. ot: The agent observes its
spatial location. In addition, when it is at the top of the Maze, it observes the reward or the pun-
ishment; when it is at the bottom, it observes the color of the cue. at: Each action corresponds to
visiting one of the four spatial locations. P(st): The agent prefers being at the reward’s location
( log Pðst Þ ¼ + 3) and avoid the punishment’s location ( log Pðst Þ ¼ 3). All other states have
a neutral preference ( log Pðst Þ ¼ 0). o0: The agent is in the middle of the Maze and is unaware
of the context. a1: Visiting the bottom or top arms have a lower ambiguity than staying, as they
yield observations that disclose the context. However, staying or visiting the bottom arm are safer
options, as visiting a top arm risks receiving the punishment. By acting to minimize both risk and
ambiguity (43) the agent goes to the bottom. o1: The agent observes the color of the cue and hence
determines the context. a2: All actions have equal ambiguity as the context is known. Collecting
the reward has a lower risk than staying or visiting the middle, which themselves have a lower risk
than collecting the punishment. Thus, the agent visits the arm with the reward. See Friston et al.
(2017a) for more details.
64 SECTION I Foundations in classical geometry and analysis

(Champion et al., 2021a, b; Fountas et al., 2020; Maisto et al., 2021; Silver
et al., 2016). Similarly, Monte–Carlo sampling finesses the expectations
inherent in assessing action sequences (46) (Fountas et al., 2020). A comple-
mentary approach is to assess actions, instead of action sequences, by condi-
tioning all future actions to be optimal in the sense that they minimize the
expected free energy (Da Costa et al., 2020b; Friston et al., 2021a). This idea
leads to a backward form of planning, where the agent plans for the best
action at the last time-step, followed by the best action at the penultimate
time-step, and so on, until the present. Crucially, it leads to smarter agents
(Da Costa et al., 2020b; Friston et al., 2021a) whose computational complex-
ity scales linearly (as opposed to exponentially) in the length of action
sequences (Paul et al., n.d.).
Scalable inference methods (Zhang et al., 2017a) can be used to make active
inference more efficient (van de Laar and de Vries, 2019). For example, we can
train neural networks to predict the various posterior distributions, including the
posterior over actions (Fountas et al., 2020; Millidge, 2020; Sajid et al., 2022).
While training, the output of the neural network can be used as an initial con-
dition for variational inference (Tschantz et al., 2020a), resulting in accurate
inferences whose computational cost decreases as the network learns. Addition-
ally, optimizing free energy reduces to efficient message-passing schemes,
when one imposes certain simplifying restrictions to the family of candidate
distributions (Champion et al., 2021c; Parr et al., 2019; Schw€obel et al.,
2018; Wainwright and Jordan, 2007; Winn and Bishop, 2005).
A much cheaper implementation of active inference exists for continuous
states evolving in continuous time. The method frames perception and
decision-making as variational inference, by simulating a gradient flow on
free energy in an extended state space (Friston et al., 2010, 2022). Further-
more, it can be combined with discrete active inference to operate efficiently
in generative models combining discrete and continuous states (Friston et al.,
2017c). As an example, high-dimensional observations in the continuous
domain (e.g., speech) processed through continuous active inference are con-
verted into discrete, abstract representations (e.g., semantics) (Sajid et al.,
2022). Based on these representations, the agent makes high-level, categorical
decisions (e.g., “I want to move over there”), which contextualize low-level,
continuous actions (e.g., the continuous motion of a limb toward the goal
location) (Parr et al., 2021b).

Acknowledgments
The authors thank Noor Sajid for helpful discussions on adaptive agents. LD is supported by
the Fonds National de la Recherche, Luxembourg (Project code: 13568875). KF is sup-
ported by funding for the Wellcome Centre for Human Neuroimaging (Ref: 205103/Z/
16/Z) and a Canada-UK Artificial Intelligence Initiative (Ref: ES/T01279X/1). GAP was
Sampling, optimization, inference, and adaptive agents Chapter 2 65

partially supported by JPMorgan Chase & Co under J.P. Morgan A.I. Research Awards in
2019 and 2021 and by the EPSRC, grant number EP/P031587/1. This publication is based
on work partially supported by the EPSRC Centre for Doctoral Training in Mathematics
of Random Systems: Analysis, Modelling and Simulation (EP/S023925/1). AB, GF, MG,
and MIJ thank the support of the Army Research Office (ARO) under contract W911NF-
17-1-0304 as part of the collaboration between US DOD, UK MOD, and UK Engineering
and Physical Research Council (EPSRC) under the Multidisciplinary University Research
Initiative (MURI).

References
Abdulle, A., Pavliotis, G.A., Vilmart, G., 2019. Accelerated convergence to equilibrium and
reduced asymptotic variance for Langevin dynamics using Stratonovich perturbations. C. R.
Math. 357 (4), 349–354. https://doi.org/10.1016/j.crma.2019.04.008.
Alder, B.J., Wainwright, T.E., 1959. Studies in molecular dynamics. I. general method. J. Chem.
Phys. 31 (2), 459–466.
Alimisis, F., Orvieto, A., Becigneul, G., Lucchi, A., 2021. Momentum improves optimization on
Riemannian manifolds. Int. Conf. Artif. Intell. Stat. 130, 1351–1359.
Amari, S., 2012. Differential-Geometrical Methods in Statistics. vol. 28. Springer Science & Busi-
ness Media.
Amari, S., 2016. Information Geometry and Its Applications. vol. 194. Springer.
Ambrosio, L., Gigli, N., Savare, G., 2005. Gradient Flows: In Metric Spaces and in the Space of
Probability Measures. Springer Science & Business Media, ISBN: 978-3-7643-2428-5
(January).
Anastasiou, A., Barp, A., Briol, F., Ebner, B., Gaunt, R.E., Ghaderinezhad, F., Gorham, J.,
Gretton, A., Ley, C., Liu, Q., et al., 2021. Stein’s method meets statistics: a review of some
recent developments. arXiv:2105.03481 (arXiv preprint).
Andersen, H.C., 1983. Rattle: a “velocity” version of the shake algorithm for molecular dynamics
calculations. J. Comput. Phys. 52 (1), 24–34. https://doi.org/10.1016/0021-9991(83)90014-1.
Aronszajn, N., 1950. Theory of reproducing kernels. Trans. Am. Math. Soc. 68 (3), 337–404.
Asorey, M., Carinena, J.F., Ibort, L.A., 1983. Generalized canonical transformations for
time-dependent systems. J. Math. Phys. 24 (12), 2745–2750.
Åstr€
om, K.J., 1965. Optimal control of Markov processes with incomplete state information.
J. Math. Anal. Appl. 10 (1), 174–205. https://doi.org/10.1016/0022-247X(65)90154-X.
Au, K.X., Graham, M.M., Thiery, A.H., 2020. Manifold lifting: scaling MCMC to the vanishing
noise regime. arXiv:2003.03950 (arXiv preprint).
Ay, N., Jost, J., V^an L^e, H., Schwachh€ofer, L., 2017. Information Geometry. vol. 64. Springer.
Barbour, A.D., 1988. Stein’s method and poisson process convergence. J. Appl. Probab. 25 (A),
175–184.
Barlow, H.B., 1961. Possible Principles Underlying the Transformations of Sensory Messages.
The MIT Press, ISBN: 978-0-262-31421-3.
Barp, A., 2019. Hamiltonian Monte Carlo on lie groups and constrained mechanics on homoge-
neous manifolds. In: International Conference on Geometric Science of Information.
Springer, pp. 665–675.
Barp, A., 2020. The Bracket Geometry of Statistics (Ph.D. thesis). Imperial College London.
Barp, A., Briol, F.X., Kennedy, A.D., Girolami, M., 2018. Geometry and dynamics for Markov
chain Monte Carlo. Annu. Rev. Stat. App. 5, 451–471.
66 SECTION I Foundations in classical geometry and analysis

Barp, A., Briol, F., Duncan, A.B., Girolami, M., Mackey, L., 2019a. Minimum Stein discrepancy
estimators. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alche-Buc, F., Fox, E.,
Garnett, R. (Eds.), Advances in Neural Information Processing Systems. vol. 32. Curran
Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/ba7609ee5789cc4dff171045
a693a65f-Paper.pdf.
Barp, A., Kennedy, A., Girolami, M., 2019b. Hamiltonian Monte Carlo on symmetric and homo-
geneous spaces via symplectic reduction. arXiv:1903.02699.
Barp, A., Takao, S., Betancourt, M., Arnaudon, A., Girolami, M., 2021. A unifying and canonical
description of measure-preserving diffusions. arXiv:2105.02845 [math, stat].
Barp, A., Oates, C.J., Porcu, E., Girolami, M., 2022. A Riemann-Stein Kernel method. Bernoulli.
Barp, A., Simon-Gabriel, C.J., Mackey, L., 2022. Targeted convergence characteristics of maxi-
mum mean discrepancies and Kernel Stein discrepancies. (In preparation).
Barto, A., Sutton, R., 1992. Reinforcement Learning: An Introduction. A Bradford Book.
Barto, A., Mirolli, M., Baldassarre, G., 2013. Novelty or surprise? Front. Psychol. 4. https://doi.
org/10.3389/fpsyg.2013.00907.
Bassetti, F., Bodini, A., Regazzini, E., 2006. On minimum Kantorovich distance estimators. Stat.
Probab. Lett. 76 (12), 1298–1302.
Bellman, R.E., 1957. Dynamic Programming. Princeton University Press, Princeton, NJ, US,
ISBN: 978-0-691-14668-3.
Bellman, R.E., Dreyfus, S.E., 2015. Applied Dynamic Programming. Princeton University Press,
ISBN: 978-1-4008-7465-1.
Benettin, G., Giorgilli, A., 1994. On the Hamiltonian interpolation of near-to-the-identity sym-
plectic mappings with application to symplectic integration algorithms. J. Stat. Phys. 74,
1117–1143. https://doi.org/10.1007/BF02188219.
Berger, J.O., 1985. Statistical Decision Theory and Bayesian Analysis. Springer Series in Statis-
tics, second ed. Springer-Verlag, New York, ISBN: 978-0-387-96098-2, https://doi.org/
10.1007/978-1-4757-4286-2.
Berger-Tal, O., Nathan, J., Meron, E., Saltz, D., 2014. The exploration-exploitation Dilemma:
a multidisciplinary framework. PLoS One 9 (4), e95693. https://doi.org/10.1371/journal.
pone.0095693.
Berlinet, A., Thomas-Agnan, C., 2011. Reproducing kernel Hilbert spaces in probability and
statistics. Springer Science & Business Media.
Betancourt, M., 2015. The fundamental incompatibility of scalable Hamiltonian Monte Carlo and
naive data subsampling. In: International Conference on Machine Learning, PMLR,
pp. 533–540.
Betancourt, M., 2016. Identifying the optimal integration time in Hamiltonian Monte Carlo.
arXiv:1601.00225 (arXiv preprint).
Betancourt, M., 2017. A conceptual introduction to Hamiltonian Monte Carlo. arXiv:1701.02434.
Betancourt, M., Byrne, S., Livingstone, S., Girolami, M., 2017. The geometric foundations of
Hamiltonian Monte Carlo. Bernoulli 23 (4A), 2257–2298.
Betancourt, M., Jordan, M.I., Wilson, A., 2018. On symplectic optimization. arXiv:1802.03653
[stat.CO].
Bierkens, J., Roberts, G., 2017. A piecewise deterministic scaling limit of lifted Metropolis-
Hastings in the curie-weiss model. Ann. App. Prob. 27 (2), 846–882.
Bierkens, J., Fearnhead, P., Roberts, G., 2019. The zig-zag process and super-efficient sampling
for Bayesian analysis of big data. Ann. Stats. 47 (3), 1288–1320.
Sampling, optimization, inference, and adaptive agents Chapter 2 67

Bishop, C.M., 2006. Pattern Recognition and Machine Learning. Information Science and Statis-
tics, Springer, New York, ISBN: 978-0-387-31073-2.
Bismut, J.M., 1981. Martingales, the Malliavin calculus and hypoellipticity under general
H€ormander’s conditions. Z. Wahrsch. Verw. Gebiete 56 (4), 469–505.
Blanes, S., Casas, F., Sanz-Serna, J.M., 2014. Numerical integrators for the hybrid Monte Carlo
method. SIAM J. Sci. Comput. 36 (4), A1556–A1580.
Blei, D.M., Kucukelbir, A., McAuliffe, J.D., 2017. Variational inference: a review for
statisticians. J. Am. Stat. Assoc. 112 (518), 859–877. https://doi.org/10.1080/01621459.
2017.1285773.
Bonnabel, S., 2013. Stochastic gradient descent on riemannian manifolds. IEEE Trans. Autom.
Control 58 (9), 2217–2229.
Bou-Rabee, N., Sanz-Serna, J.M., 2018. Geometric integrators and the Hamiltonian Monte Carlo
method. Acta Numer. 27, 113–206.
Bouchard-C^ ote, A., Vollmer, S.J., Doucet, A., 2018. The bouncy particle sampler: a nonreversible
rejection-free Markov chain Monte Carlo method. J. Am. Stat. Assoc. 113 (522), 855–867.
Bravetti, A., Daza-Torres, M.L., Flores-Arguedas, H., Betancourt, M., 2019. Optimization algo-
rithms inspired by the geometry of dissipative systems. arXiv:1912.02928 [math.OC].
Briol, F., Barp, A., Duncan, A.B., Girolami, M., 2019. Statistical inference for generative models
with maximum mean discrepancy. arXiv:1906.05944.
Bronstein, M.M., Bruna, J., Cohen, T., Velickovic, P., 2021. Geometric deep learning: grids,
groups, graphs, geodesics, and gauges. arXiv:2104.13478 [cs.LG].
Campos, C.M., Sanz-Serna, J.M., 2015. Extra chance generalized hybrid Monte Carlo. J. Comput.
Phys. 281, 365–374.
Campos, C.M., Sanz-Serna, J.M., 2017. Palindromic 3-stage splitting integrators, a roadmap.
J. Comput. Phys. 346, 340–355.
Cances, E., Legoll, F., Stoltz, G., 2007. Theoretical and numerical comparison of some sampling
methods for molecular dynamics. ESAIM: Math. Model. Numer. Anal. 41 (2), 351–389.
Carmeli, C., De Vito, E., Toigo, A., Umanitá, V., 2010. Vector valued reproducing kernel Hilbert
spaces and universality. Anal. Appl. 8 (01), 19–61.
Çatal, O., Verbelen, T., Van de Maele, T., Dhoedt, B., Safron, A., 2021. Robot navigation as hier-
archical active inference. Neural Netw. 142, 192–204. https://doi.org/10.1016/j.
neunet.2021.05.010.
Celledoni, E., Marthinsen, H., Owren, B., 2014. An introduction to Lie group integrators: basics,
new developments and applications. J. Comput. Phys. 257, 1040–1061.
Celledoni, E., Ehrhardt, M.J., Etmann, C., McLachlan, R.I., Owren, B., Schonlieb, C.B.,
Sherry, F., 2021. Structure-preserving deep learning. Eur. J. Appl. Math. 32 (5), 888–936.
Chafaı̈, D., 2004. Entropies, convexity, and functional inequalities, On ϕ-entropies and ϕ-Sobolev
inequalities. J. Math. Kyoto Univ. 44 (2), 325–363. https://doi.org/10.1215/kjm/1250283556.
Chak, M., Kantas, N., Lelièvre, T., Pavliotis, G. A., 2021, Nov. Optimal friction matrix for under-
damped Langevin sampling.
Champion, T., Bowman, H., Grzes, M., 2021a. Branching time active inference: empirical study
and complexity class analysis. arXiv:2111.11276 [cs].
Champion, T., Da Costa, L., Bowman, H., Grzes, M., 2021b. Branching time active inference: the
theory and its generality. arXiv:2111.11107 [cs].
Champion, T., Grzes, M., Bowman, H., 2021c. Realizing active inference in variational message
passing: the outcome-blind certainty seeker. Neural Comput. 33 (10), 2762–2826. https://doi.
org/10.1162/neco_a_01422.
68 SECTION I Foundations in classical geometry and analysis

Chen, Y., Li, W., 2018. Natural gradient in Wasserstein statistical manifold. arXiv:1805.08380
(arXiv preprint).
Chen, T., Fox, E., Guestrin, C., 2014. Stochastic gradient Hamiltonian Monte Carlo. In: Interna-
tional Conference on Machine Learning. PMLR, pp. 1683–1691.
Chen, W.Y., Barp, A., Briol, F., Gorham, J., Girolami, M., Mackey, L., Oates, C., 2019. Stein
point Markov chain Monte Carlo. In: International Conference on Machine Learning, PMLR,
pp. 1011–1021.
Chentsov, N.N., 1965. Categories of mathematical statistics. Uspekhi Mat. Nauk 20 (4), 194–195.
Chwialkowski, K., Strathmann, H., Gretton, A., 2016. A kernel test of goodness of fit. In: Inter-
national Conference on Machine Learning, PMLR, pp. 2606–2615.
Clark, M.A., Joó, B., Kennedy, A.D., Silva, P.J., 2011. Improving dynamical lattice QCD simula-
tions through integrator tuning using Poisson brackets and a force-gradient integrator. Phys.
Rev. D 84 (7), 071502.
Cobb, A.D., Baydin, A.G., Markham, A., Roberts, S.J., 2019. Introducing an explicit symplectic
integration scheme for Riemannian manifold Hamiltonian Monte Carlo. arXiv:1910.06243
(arXiv preprint).
Cullen, M., Davey, B., Friston, K.J., Moran, R.J., 2018. Active inference in OpenAI Gym: a para-
digm for computational investigations into psychiatric illness. Biol. Psychiatry Cogn. Neu-
rosci. Neuroimaging 3 (9), 809–818. https://doi.org/10.1016/j.bpsc.2018.06.010.
Cuturi, M., 2013. Sinkhorn distances: lightspeed computation of optimal transport. NeurIPS 26,
2292–2300.
Da Costa, L., Parr, T., Sajid, N., Veselic, S., Neacsu, V., Friston, K., 2020a. Active inference on
discrete state-spaces: a synthesis. J. Math. Psychol. 99, 102447. https://doi.org/10.1016/
j.jmp.2020.102447.
Da Costa, L., Sajid, N., Parr, T., Friston, K., Smith, R., 2020b. The relationship between dynamic
programming and active inference: the discrete, finite-horizon case. arXiv:2009.08111
[cs, math, q-bio].
Da Costa, L., Friston, K., Heins, C., Pavliotis, G.A., 2021. Bayesian mechanics for stationary pro-
cesses. Proc. R. Soc. A Math. Phys. Eng. Sci. 477 (2256), 20210518. https://doi.org/10.1098/
rspa.2021.0518.
Da Costa, L., Lanillos, P., Sajid, N., Friston, K., Khan, S., 2022. How active inference could help
revolutionise robotics. Entropy 24 (3), 361. https://doi.org/10.3390/e24030361.
Davis, M.H.A., 1984. Piecewise-deterministic markov processes: a general class of non-diffusion
stochastic models. J. R. Stat. Soc. B (Methodol.) 46 (3), 353–376.
Deci, E., Ryan, R.M., 1985. Intrinsic Motivation and Self-Determination in Human Behavior.
Perspectives in Social Psychology, Springer US, New York, ISBN: 978-0-306-42022-1,
https://doi.org/10.1007/978-1-4899-2271-7.
Duane, S., Kennedy, A.D., Pendleton, B.J., Roweth, D., 1987. Hybrid Monte Carlo. Phys. Lett.
B 195 (2), 216–222.
Duncan, A.B., Lelièvre, T., Pavliotis, G.A., 2016. Variance reduction using nonreversible Lange-
vin samplers. J. Stat. Phys. 163 (3), 457–491. https://doi.org/10.1007/s10955-016-1491-2.
Duncan, A.B., N€ usken, N., Pavliotis, G.A., 2017. Using perturbed underdamped Langevin dynam-
ics to efficiently sample from probability distributions. J. Stat. Phys. 169 (6), 1098–1131.
https://doi.org/10.1007/s10955-017-1906-8.
Durmus, A., Moulines, E., 2017. Nonasymptotic convergence analysis for the unadjusted Lange-
vin algorithm. Ann. App. Prob. 27 (3), 1551–1587.
Durmus, A., Moulines, E., Saksman, E., 2017. On the convergence of Hamiltonian Monte Carlo.
arXiv:1705.00166 (arXiv preprint).
Sampling, optimization, inference, and adaptive agents Chapter 2 69

Durmus, A., Moulines, E., Pereyra, M., 2018. Efficient Bayesian computation by proximal
Markov chain Monte Carlo: when Langevin meets Moreau. SIAM J. Imaging Sci. 11 (1),
473–506.
Durrleman, S., Pennec, X., Trouve, A., Ayache, N., 2009. Statistical models of sets of curves and
surfaces based on currents. Med. Image Anal. 13 (5), 793–808.
Dziugaite, G.K., Roy, D.M., Ghahramani, Z., 2015. Training generative neural networks via
maximum mean discrepancy optimization. arXiv:1505.03906 (arXiv preprint).
Ethier, S.N., Kurtz, T.G., 2009. Markov Processes: Characterization and Convergence. vol. 282
John Wiley & Sons.
Fang, Y., Sanz-Serna, J.-M., Skeel, R.D., 2014. Compressible generalized hybrid Monte Carlo.
J. Chem. Phys. 140 (17), 174108.
Fernández-Pendás, M., Akhmatskaya, E., Sanz-Serna, J.M., 2016. Adaptive multi-stage integrators
for optimal energy conservation in molecular simulations. J. Comput. Phys. 327, 434–449.
Forest, E., 2006. Geometric integration for particle accelerators. J. Phys. A Math. Gen. 39,
5321–5377.
Fountas, Z., Sajid, N., Mediano, P.A.M., Friston, K., 2020. Deep active inference agents using
Monte-Carlo methods. arXiv:2006.04176 [cs, q-bio, stat].
França, G., Robinson, D., Vidal, R., 2018. ADMM and accelerated ADMM as continuous dyna-
mical systems. Int. Conf. Mach. Learn. 80, 1559–1567.
França, G., Robinson, D.P., Vidal, R., 2018. A nonsmooth dynamical systems perspective on
accelerated extensions of ADMM. arXiv:1808.04048 [math.OC].
França, G., Sulam, J., Robinson, D.P., Vidal, R., 2020. Conformal symplectic and relativistic
optimization. J. Stat. Mech. 2020 (12), 124008. https://doi.org/10.1088/1742-5468/abcaee.
França, G., Barp, A., Girolami, M., Jordan, M.I., 2021a. Optimization on manifolds: a symplectic
approach. arXiv:2107.11231 [cond-mat.stat-mech].
França, G., Jordan, M.I., Vidal, R., 2021b. On dissipative symplectic integration with applications
to gradient-based optimization. J. Stat. Mech. 2021 (4), 043402. https://doi.org/10.1088/1742-
5468/abf5d4.
França, G., Robinson, D.P., Vidal, R., 2021c. Gradient flows and proximal splitting methods:
a unified view on accelerated and stochastic optimization. Phys. Rev. E 103, 053304.
https://doi.org/10.1103/PhysRevE.103.053304.
Friston, K., 2010. The free-energy principle: a unified brain theory? Nat. Rev. Neurosci. 11 (2),
127–138. https://doi.org/10.1038/nrn2787.
Friston, K., Kilner, J., Harrison, L., 2006. A free energy principle for the brain. J. Physiol.-Paris
100 (1-3), 70–87. https://doi.org/10.1016/j.jphysparis.2006.10.001.
Friston, K.J., Daunizeau, J., Kilner, J., Kiebel, S.J., 2010. Action and behavior: a free-energy for-
mulation. Biol. Cybern. 102 (3), 227–260. https://doi.org/10.1007/s00422-010-0364-z.
Friston, K., Rigoli, F., Ognibene, D., Mathys, C., Fitzgerald, T., Pezzulo, G., 2015. Active infer-
ence and epistemic value. Cogn. Neurosci. 6 (4), 187–214. https://doi.org/
10.1080/17588928.2015.1020053.
Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, P., O’Doherty, J., Pezzulo, G., 2016. Active
inference and learning. Neurosci. Biobehav. Rev. 68, 862–879. https://doi.org/10.1016/
j.neubiorev.2016.06.022.
Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, P., Pezzulo, G., 2017a. Active inference:
a process theory. Neural Comput. 29 (1), 1–49. https://doi.org/10.1162/NECO_a_00912.
Friston, K.J., Lin, M., Frith, C.D., Pezzulo, G., Hobson, J.A., Ondobaka, S., 2017b. Active infer-
ence, curiosity and insight. Neural Comput. 29 (10), 2633–2683. https://doi.org/10.1162/
neco_a_00999.
70 SECTION I Foundations in classical geometry and analysis

Friston, K.J., Parr, T., de Vries, B., 2017c. The graphical brain: belief propagation and active
inference. Netw. Neurosci. 1 (4), 381–414. https://doi.org/10.1162/NETN_a_00018.
Friston, K.J., Rosch, R., Parr, T., Price, C., Bowman, H., 2018. Deep temporal models and
active inference. Neurosci. Biobehav. Rev. 90, 486–501. https://doi.org/10.1016/j.neubiorev.
2018.04.004.
Friston, K., Parr, T., Zeidman, P., 2019. Bayesian model reduction. arXiv:1805.07092 [stat].
Friston, K., Da Costa, L., Hafner, D., Hesp, C., Parr, T., 2021a. Sophisticated inference. Neural
Comput. 33 (3), 713–763. https://doi.org/10.1162/neco_a_01351.
Friston, K., Heins, C., Ueltzh€offer, K., Da Costa, L., Parr, T., 2021b. Stochastic Chaos and
Markov Blankets. Entropy 23 (9), 1220. https://doi.org/10.3390/e23091220.
Friston, K., Moran, R.J., Nagai, Y., Taniguchi, T., Gomi, H., Tenenbaum, J., 2021c. World model
learning and inference. Neural Netw. 144, 573–590. https://doi.org/10.1016/j.neunet.
2021.09.011.
Friston, K., Da Costa, L., Sajid, N., Heins, C., Ueltzh€offer, K., Pavliotis, G.A., Parr, T., 2022. The
free energy principle made simpler but not too simple. arXiv:2201.06387 [cond-mat, physics:
nlin, physics:physics, q-bio].
Garbuno-Inigo, A., Hoffmann, F., Li, W., Stuart, A.M., 2019. Interacting Langevin diffusions:
gradient structure and ensemble Kalman sampler. arXiv:1903.08866 [math].
Garreau, D., Jitkrittum, W., Kanagawa, M., 2017. Large sample analysis of the median heuristic.
arXiv:1707.07269 (arXiv preprint).
Girolami, M., Calderhead, B., 2011. Riemann manifold Langevin and Hamiltonian Monte Carlo
methods. J. R. Stat. Soc. B (Stat. Methodol.) 73 (2), 123–214.
Gorham, J., Mackey, L., 2017. Measuring sample quality with kernels. In: International Confer-
ence on Machine Learning. PMLR, pp. 1292–1301.
Gorham, J., Duncan, A.B., Vollmer, S.J., Mackey, L., 2019. Measuring sample quality with diffu-
sions. Ann. Appl. Probab. 29 (5), 2884–2928.
Graham, M.M., Thiery, A.H., Beskos, A., 2019. Manifold Markov chain Monte Carlo methods for
Bayesian inference in a wide class of diffusion models. arXiv:1912.02982 (arXiv preprint).
Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B.K., 2009. A fast, consistent kernel
two-sample test. In: NIPS, vol. 23, pp. 673–681.
Gretton, A., Borgwardt, K.M., Rasch, M.J., Sch€olkopf, B., Smola, A., 2012. A kernel two-sample
test. J. Mach. Learn. Res. 13 (1), 723–773.
Guillin, A., Monmarche, P., 2021. Optimal linear drift for the speed of convergence of an Hypoel-
liptic diffusion. arXiv:1604.07295 [math].
Hairer, M., 2018. Ergodic Properties of Markov Processes.
Hairer, E., Lubich, C., Wanner, G., 2010. Geometric Numerical Integration: Structure-Preserving
Algorithms for Ordinary Differential Equations. Springer.
Hansen, A.C., 2011. A theoretical framework for backward error analysis on manifolds. J. Geom.
Mech. 3 (1), 81–111. https://doi.org/10.3934/jgm.2011.3.81.
Harms, P., Michor, P.W., Pennec, X., Sommer, S., 2020. Geometry of sample spaces.
arXiv:2010.08039.
Hastings, W.K., 1970. Monte Carlo Sampling Methods Using Markov Chains and their Applica-
tions. Oxford University Press.
Haussmann, U.G., Pardoux, E., 1986. Time reversal of diffusions. Ann. Probab. 14 (4),
1188–1205. https://doi.org/10.1214/aop/1176992362.
Heber, F., Trst’anová, Z., Leimkuhler, B., 2020. Posterior sampling strategies based on discretized
stochastic differential equations for machine learning applications. J. Mach. Learn. Res.
21 (228), 1–33.
Sampling, optimization, inference, and adaptive agents Chapter 2 71

Heins, C., Millidge, B., Demekas, D., Klein, B., Friston, K., Couzin, I., Tschantz, A., 2022.
Pymdp: a Python library for active inference in discrete state spaces. arXiv:2201.03904
[cs, q-bio].
Helffer, B., 1998. Remarks on decay of correlations and Witten Laplacians Brascamp–Lieb
inequalities and semiclassical limit. J. Funct. Anal. 155 (2), 571–586. https://doi.org/
10.1006/jfan.1997.3239.
Hodgkinson, L., Salomone, R., Roosta, F., 2020. The reproducing stein kernel approach for
post-hoc corrected sampling. arXiv:2001.09266 (arXiv preprint).
Hoffman, M.D., Gelman, A., et al., 2014. The No-U-Turn sampler: adaptively setting path lengths
in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 15 (1), 1593–1623.
Holbrook, A., Vandenberg-Rodes, A., Shahbaba, B., 2016. Bayesian inference on matrix mani-
folds for linear dimensionality reduction. arXiv:1606.04478 (arXiv preprint).
Holbrook, A., Lan, S., Vandenberg-Rodes, A., Shahbaba, B., 2018. Geodesic Lagrangian Monte
Carlo over the space of positive definite matrices: with application to Bayesian spectral
density estimation. J. Stat. Comput. Simul. 88 (5), 982–1002.
Holm, D.D., Marsden, J.E., Ratiu, T.S., 1998. The Euler-Poincare equations and semidirect
products with applications to continuum theories. Adv. Math. 137 (1), 1–81.
H€ormander, L., 1967. Hypoelliptic second order differential equations. Acta Math. 119, 147–171.
https://doi.org/10.1007/BF02392081.
Horowitz, A.M., 1991. A generalized guided Monte Carlo algorithm. Phys. Lett. B 268 (2),
247–252.
Hwang, C.-R., Hwang-Ma, S.-Y., Sheu, S.-J., 2005. Accelerating diffusions. Ann. Appl. Probab.
15 (2), 1433–1444. https://doi.org/10.1214/105051605000000025.
Hyv€arinen, A., Dayan, P., 2005. Estimation of non-normalized statistical models by score match-
ing. J. Mach. Learn. Res. 6 (4), 695–709.
Itti, L., Baldi, P., 2009. Bayesian surprise attracts human attention. Vis. Res. 49 (10), 1295–1306.
https://doi.org/10.1016/j.visres.2008.09.007.
Izaguirre, J.A., Hampton, S.S., 2004. Shadow hybrid Monte Carlo: an efficient propagator in
phase space of macromolecules. J. Comput. Phys. 200 (2), 581–604.
Jaynes, E.T., 1957. Information theory and statistical mechanics. Phys. Rev. 106 (4), 620–630.
https://doi.org/10.1103/PhysRev.106.620.
Jeffreys, H., 1946. An invariant form for the prior probability in estimation problems. Proc. R.
Soc. Lond. A. Math. Phys. Sci. 186 (1007), 453–461.
Jordan, R., Kinderlehrer, D., Otto, F., 1998. The variational formulation of the Fokker-Planck
equation. SIAM J. Math. Anal. 29 (1), 1–17.
Jost, J., L^e, H.V., Tran, T.D., 2021. Probabilistic morphisms and Bayesian nonparametrics. Eur.
Phys. J. Plus 136 (4), 1–29.
Joulin, A., Ollivier, Y., 2010. Curvature, concentration and error estimates for Markov chain
Monte Carlo. Ann. Probab. 38 (6), 2418–2442. https://doi.org/10.1214/10-AOP541.
Kahneman, D., Tversky, A., 1979. Prospect theory: an analysis of decision under risk. Econome-
trica 47 (2), 263–291. https://doi.org/10.2307/1914185.
Kakade, S.M., 2001. A natural policy gradient. Adv. Neural Inf. Process. Syst. 14, 1531–1538.
Kalman, R.E., 1960. A new approach to linear filtering and prediction problems. J. Basic Eng.
82 (1), 35–45. https://doi.org/10.1115/1.3662552.
Kaplan, A., 1973. The Conduct of Inquiry. Transaction Publishers, ISBN: 978-1-4128-3629-6.
Kappen, H.J., Gómez, V., Opper, M., 2012. Optimal control as a graphical model inference prob-
lem. Mach. Learn. 87 (2), 159–182. https://doi.org/10.1007/s10994-012-5278-7.
72 SECTION I Foundations in classical geometry and analysis

Karakida, R., Okada, M., Amari, S., 2016. Adaptive natural gradient learning algorithms for
unnormalized statistical models. In: International Conference on Artificial Neural Networks,
Springer, pp. 427–434.
Katsoulakis, M., Pantazis, Y., Rey-Bellet, L., 2014. Measuring the irreversibility of numerical
schemes for reversible stochastic differential equations. ESAIM: Math. Model. Numer.
Anal./Modelisation Mathematique et Analyse Numerique 48 (5), 1351–1379. https://doi.org/
10.1051/m2an/2013142.
Kennedy, A.D., Silva, P.J., Clark, M.A., 2013. Shadow Hamiltonians, Poisson brackets, and gauge
theories. Phys. Rev. D 87, 034511.
Lanillos, P., Meo, C., Pezzato, C., Meera, A.A., Baioumy, M., Ohata, W., Tschantz, A.,
Millidge, B., Wisse, M., Buckley, C.L., Tani, J., 2021. Active inference in robotics and arti-
ficial agents: survey and challenges. arXiv:2112.01871 [cs].
Lasota, A., MacKey, M.C., 1994. Chaos, Fractals, and Noise: Stochastic Aspects of Dynamics.
Springer-Verlag, ISBN: 978-3-540-94049-4.
Lee, J.M., 2013. Smooth manifolds. In: Introduction to Smooth Manifolds, Springer, pp. 1–31.
Leimkuhler, B., Matthews, C., 2016. Efficient molecular dynamics using geodesic integration and
solvent-solute splitting. Proc. R. Soc. A Math. Phys. Eng. Sci. 472 (2189), 20160138. https://
doi.org/10.1098/rspa.2016.0138.
Leimkuhler, B., Reich, S., 2004. Simulating Hamiltonian Dynamics. Cambridge University Press.
Leimkuhler, B.J., Skeel, R.D., 1994. Symplectic numerical integrators in constrained Hamiltonian
systems. J. Comput. Phys. 112, 117–125. https://doi.org/10.1006/jcph.1994.1085.
Lelièvre, T., Nier, F., Pavliotis, G.A., 2013. Optimal non-reversible linear drift for the conver-
gence to equilibrium of a diffusion. J. Stat. Phys. 152 (2), 237–274. https://doi.org/10.1007/
s10955-013-0769-x.
Lelièvre, T., Rousset, M., Stoltz, G., 2019. Hybrid Monte Carlo methods for sampling probability
measures on submanifolds. Numer. Math. 143 (2), 379–421. https://doi.org/10.1007/s00211-
019-01056-4.
Lelièvre, T., Stoltz, G., Zhang, W., 2020. Multiple projection MCMC algorithms on submani-
folds. arXiv:2003.09402 (arXiv preprint).
Leok, M., Zhang, J., 2017. Connecting information geometry and geometric mechanics. Entropy
19 (10), 518.
Levine, S., 2018. Reinforcement learning and control as probabilistic inference: tutorial and
review. arXiv:1805.00909 [cs, stat].
Li, C.-L., Chang, W.-C., Cheng, Y., Yang, Y., Póczos, B., 2017. Mmd gan: towards deeper under-
standing of moment matching network. arXiv:1705.08584 (arXiv preprint).
Lindley, D.V., 1956. On a measure of the information provided by an experiment. Ann. Math.
Stat. 27 (4), 986–1005.
Linsker, R., 1990. Perceptual neural organization: some approaches based on network models and
information theory. Annu. Rev. Neurosci. 13 (1), 257–281. https://doi.org/10.1146/annurev.
ne.13.030190.001353.
Liu, Q., Wang, D., 2016. Stein variational gradient descent: a general purpose Bayesian inference
algorithm. Adv. Neural Inf. Process. Syst. 29.
Liu, C., Zhu, J., 2018. Riemannian Stein variational gradient descent for Bayesian inference.
In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32. 1.
Liu, Q., Lee, J., Jordan, M., 2016. A kernelized stein discrepancy for goodness-of-fit tests.
In: International Conference on Machine Learning, PMLR, pp. 276–284.
Livingstone, S., Girolami, M., 2014. Information-geometric Markov chain Monte Carlo methods
using diffusions. Entropy 16 (6), 3074–3102.
Sampling, optimization, inference, and adaptive agents Chapter 2 73

Livingstone, S., Betancourt, M., Byrne, S., Girolami, M., 2019. On the geometric ergodicity of
Hamiltonian Monte Carlo. Bernoulli 25 (4A), 3109–3138.
Ma, Y.-A., Chen, T., Fox, E.B., 2015. A complete recipe for Stochastic gradient MCMC.
arXiv:1506.04696 [math, stat].
Ma, Y.-A., Chatterji, N., Cheng, X., Flammarion, N., Bartlett, P., Jordan, M.I., 2019a. Is There an
Analog of Nesterov Acceleration for MCMC? arXiv:1902.00996.
Ma, Y.-A., Chatterji, N., Cheng, X., Flammarion, N., Bartlett, P., Jordan, M.I., 2019b. Is there an
analog of Nesterov acceleration for MCMC? arXiv:1902.00996 [cs, math, stat].
MacKay, D.J.C., 1992. Information-based objective functions for active data selection. Neural
Comput. 4 (4), 590–604. https://doi.org/10.1162/neco.1992.4.4.590.
MacKay, D.J.C., 2003a. Information Theory, Inference and Learning Algorithms. Cambridge
University Press.
MacKay, D.J.C., 2003b. Information Theory, Inference and Learning Algorithms, sixth printing
2007 ed. Cambridge University Press, Cambridge, UK; New York, ISBN: 978-0-521-64298-9.
Mackenze, P.B., 1989. An improved hybrid Monte Carlo method. Phys. Lett. B 226 (3-4),
369–371.
Maisto, D., Gregoretti, F., Friston, K., Pezzulo, G., 2021. Active tree search in large POMDPs.
arXiv:2103.13860 [cs, math, q-bio].
Markovic, D., Stojic, H., Schw€obel, S., Kiebel, S.J., 2021. An empirical evaluation of active infer-
ence in multi-armed bandits. Neural Netw. 144, 229–246. https://doi.org/10.1016/
j.neunet.2021.08.018.
Marsden, J.E., West, M., 2001. Discrete mechanics and variational integrators. Acta Numer. 10,
357–514. https://doi.org/10.1017/S096249290100006X.
Marthinsen, H., Owren, B., 2016. Geometric integration of non-autonomous Hamiltonian
problems. Adv. Comput. Math. 42, 313–332. https://doi.org/10.1007/s10444-015-9425-0.
Mattingly, J.C., Stuart, A.M., Higham, D.J., 2002. Ergodicity for SDEs and approximations:
locally Lipschitz vector fields and degenerate noise. Stoch. Process. Their Appl. 101 (2),
185–232. https://doi.org/10.1016/S0304-4149(02)00150-3.
Mattingly, J.C., Stuart, A.M., Tretyakov, M.V., 2010. Convergence of numerical time-averaging
and stationary measures via Poisson equations. SIAM J. Numer. Anal. 48 (2), 552–577.
https://doi.org/10.1137/090770527.
Mazzaglia, P., Verbelen, T., Dhoedt, B., 2021. Contrastive active inference. In: Advances in
Neural Information Processing Systems.
McLachlan, R., Perlmutter, M., 2001. Conformal Hamiltonian systems. J. Geom. Phys. 39,
276–300. https://doi.org/10.1016/S0393-0440(01)00020-1.
McLachlan, R.I., Quispel, G.R.W., 2002. Splitting methods. Acta Numer. 11, 341. https://doi.org/
10.1017/S0962492902000053.
McLachlan, R.I., Quispel, G.R.W., 2006. Geometric integrators for ODEs. J. Phys. A Math. Gen.
39, 5251–5285. https://doi.org/10.1088/0305-4470/39/19/s01.
McLachlan, R.I., Quispel, G.R.W., Robidoux, N., 1999. Geometric integration using discrete gra-
dients. Philos. Trans. R. Soc. Lond. A 357 (1754), 1021–1045.
McLachlan, R.I., Modin, K., Verdier, O., Wilkins, M., 2014. Geometric generalizations of
SHAKE and RATTLE. Found. Comput. Math. (14), 339–370. https://doi.org/10.1007/
s10208-013-9163-y.
Metropolis, N.R., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E., 1953. Equation of
state calculations by fast computing machines. J. Chem. Phys. 21 (6), 1087–1092.
Millidge, B., 2020. Deep active inference as variational policy gradients. J. Math. Psychol. 96,
102348. https://doi.org/10.1016/j.jmp.2020.102348.
74 SECTION I Foundations in classical geometry and analysis

Mira, A., 2001. Ordering and improving the performance of Monte Carlo Markov chains. Stat.
Sci. 16 (4), 340–350. https://doi.org/10.1214/ss/1015346319.
Modin, K., Perlmutter, M., Marsland, S., McLachlan, R., 2010. Geodesics on Lie groups: Euler
equations and totally geodesic subgroup. Res. Lett. Inform. Math. Sci. 14, 79–106.
Muandet, K., Fukumizu, K., Sriperumbudur, B., Sch€olkopf, B., 2016. Kernel mean embedding of
distributions: a review and beyond. arXiv:1605.09522 (arXiv preprint).
Muehlebach, M., Jordan, M.I., 2021a. On constraints in first-order optimization: a view from
non-smooth dynamical systems. arXiv:2107.08225, [math.OC].
Muehlebach, M., Jordan, M.I., 2021b. Optimization with momentum: dynamical, control-
theoretic, and symplectic perspectives. J. Mach. Learn. Res. 22 (73), 1–50.
M€uller, A., 1997. Integral probability metrics and their generating classes of functions. Adv. Appl.
Probab. 29 (2), 429–443.
Murray, I., Adams, R., MacKay, D., 2010. Elliptical slice sampling. In: Int. Conf. Artificial Intel-
ligence and Stats, JMLR Workshop and Conference Proceedings, pp. 541–548.
Neal, R.M., 1992. Bayesian training of Backpropagation Networks by the Hybrid Monte Carlo
Method. Citeseer.
Neal, R.M., 2003. Slice sampling. Ann. Stat. 31 (3), 705–767.
Neal, R.M., 2004. Improving asymptotic variance of MCMC estimators: non-reversible chains are
better. arXiv:math/0407281.
Neal, R.M., 2011. MCMC using Hamiltonian dynamics. In: Handbook of Markov Chain Monte
Carlo, Chapman and Hall/CRC.
Nesterov, Y., 1983. A method of solving a convex programming problem with convergence rate
O(1/k2). Soviet Math. Doklady 27 (2), 372–376.
Nielsen, F., 2020. An elementary introduction to information geometry. Entropy 22 (10), 1100.
Oates, C.J., Girolami, M., Chopin, N., 2017. Control functionals for Monte Carlo integration. J. R.
Stat. Soc. B (Stat. Methodol.) 79 (3), 695–718.
Optican, L.M., Richmond, B.J., 1987. Temporal encoding of two-dimensional patterns by single
units in primate inferior temporal cortex. III. Information theoretic analysis. J. Neurophysiol.
57 (1), 162–178. https://doi.org/10.1152/jn.1987.57.1.162.
Otto, F., 2001. The geometry of dissipative evolution equations: the porous medium equation.
Commun. Partial Differ. Equ. 26, 101–174. https://doi.org/10.1081/PDE-100002243.
Ottobre, M., 2016. Markov chain Monte Carlo and irreversibility. Rep. Math. Phys. 77, 267–292.
https://doi.org/10.1016/S0034-4877(16)30031-3.
Ottobre, M., Pillai, N.S., Pinski, F.J., Stuart, A.M., 2016. A function space HMC algorithm with
second order Langevin diffusion limit. Bernoulli 22 (1), 60–106.
Oudeyer, P.-Y., Kaplan, F., 2007. What is intrinsic motivation? A typology of computational
approaches. Front. Neurorobot. 1, 6. https://doi.org/10.3389/neuro.12.006.2007.
Park, H., Amari, S., Fukumizu, K., 2000. Adaptive natural gradient learning algorithms for vari-
ous stochastic models. Neural Netw. 13 (7), 755–764.
Parr, T., 2019. The Computational Neurology of Active Vision (Ph.D. thesis). University College
London, London.
Parr, T., Markovic, D., Kiebel, S.J., Friston, K.J., 2019. Neuronal message passing using mean-
field, Bethe, and marginal approximations. Sci. Rep. 9 (1), 1889. https://doi.org/10.1038/
s41598-018-38246-3.
Parr, T., Da Costa, L., Heins, C., Ramstead, M.J.D., Friston, K.J., 2021a. Memory and Markov
Blankets. Entropy 23 (9), 1105. https://doi.org/10.3390/e23091105.
Parr, T., Limanowski, J., Rawji, V., Friston, K., 2021b. The computational neurology of move-
ment under active inference. Brain 144 (6), 1799–1818. https://doi.org/10.1093/brain/
awab085.
Sampling, optimization, inference, and adaptive agents Chapter 2 75

Parr, T., Sajid, N., Da Costa, L., Mirza, M.B., Friston, K.J., 2021c. Generative models for active
vision. Front. Neurorobot. 15, 651432.
Parry, M., Dawid, A.P., Lauritzen, S., 2012. Proper local scoring rules. Ann. Stat. 40 (1),
561–592.
Paul, A., Sajid, N., Gopalkrishnan, M., Razi, A., 2021. Active inference for Stochastic control.
arXiv:2108.12245 [cs].
Paul, A., Da Costa, L., Gopalkrishnan, M., Razi, A., n.d. Active Inference for Stochastic and
Adaptive Control in a Partially Observable Environment.
Pavliotis, G.A., 2014. Stochastic Processes and Applications: Diffusion Processes, the Fokker-
Planck and Langevin Equations. Texts in Applied Mathematics, vol. 60. Springer, New York,
ISBN: 978-1-4939-1322-0.
Peters, E.A.J.F., de With, G., 2012. Rejection-free Monte Carlo sampling for general potentials.
Phys. Rev. E 85 (2), 026703.
Peyre, G., Cuturi, M., et al., 2019. Computational optimal transport: with applications to data
science. Found. Trends Mach. Learn. 11 (5-6), 355–607.
Polyak, B.T., 1964. Some methods of speeding up the convergence of iteration methods. USSR
Comput. Math. Math. Phys. 4 (5), 1–17.
Predescu, C., Lippert, R.A., Eastwood, M.P., Ierardi, D., Xu, H., Jensen, M., Bowers, K.J.,
Gullingsrud, J., Rendleman, C.A., Dror, R.O., et al., 2012. Computationally efficient molec-
ular dynamics integrators with improved sampling accuracy. Mol. Phys. 110 (9-10), 967–983.
Radivojevic, T., Akhmatskaya, E., 2020. Modified Hamiltonian Monte Carlo for Bayesian infer-
ence. Stat. Comput. 30 (2), 377–404.
Radivojevic, T., Fernández-Pendás, M., Sanz-Serna, J.M., Akhmatskaya, E., 2018. Multi-stage
splitting integrators for sampling with modified Hamiltonian Monte Carlo methods. J. Com-
put. Phys. 373, 900–916.
Ramdas, A., Reddi, S.J., Poczos, B., Singh, A., Wasserman, L., 2015. Adaptivity and
computation-statistics tradeoffs for kernel and distance based high dimensional two sample
testing. arXiv:1508.00655 (arXiv preprint).
Rao, C.R., 1992. Information and the accuracy attainable in the estimation of statistical para-
meters. In: Breakthroughs in Statistics, Springer, pp. 235–247.
Rawlik, K., Toussaint, M., Vijayakumar, S., 2013. On Stochastic optimal control and reinforce-
ment learning by approximate inference. In: Twenty-Third International Joint Conference
on Artificial Intelligence.
Rey-Bellet, L., Spiliopoulos, K., 2015. Irreversible Langevin samplers and variance reduction: a
large deviation approach. Nonlinearity 28 (7), 2081–2103. https://doi.org/10.1088/0951-
7715/28/7/2081.
Roberts, G.O., Tweedie, R.L., 1996. Exponential convergence of Langevin distributions and their
discrete approximations. Bernoulli, 341–363.
Rousset, M., Stoltz, G., Lelievre, T., 2010. Free Energy Computations: A Mathematical Perspec-
tive. World Scientific.
Sajid, N., Ball, P.J., Parr, T., Friston, K.J., 2021a. Active inference: demystified and compared.
Neural Comput. 33 (3), 674–712. https://doi.org/10.1162/neco_a_01357.
Sajid, N., Da Costa, L., Parr, T., Friston, K., 2021b. Active inference, Bayesian optimal design,
and expected utility. arXiv:2110.04074 [cs, math, stat].
Sajid, N., Holmes, E., Costa, L.D., Price, C., Friston, K., 2022. A mixed generative model of audi-
tory word repetition. bioRxiv. https://doi.org/10.1101/2022.01.20.477138. 2022.01.20.477138.
Sanz-Serna, J.M., 1992. Symplectic integrators for Hamiltonian problems: an overview. Acta
Numer. 1, 243–286. https://doi.org/10.1017/S0962492900002282.
76 SECTION I Foundations in classical geometry and analysis

Saumard, A., Wellner, J.A., 2014. Log-concavity and strong log-concavity: a review. Stat. Surv. 8,
45–114. https://doi.org/10.1214/14-SS107.
Schmidhuber, J., 2010. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE
Trans. Auton. Ment. Dev. 2 (3), 230–247. https://doi.org/10.1109/TAMD.2010.2056368.
Schwartenbeck, P., Passecker, J., Hauser, T.U., FitzGerald, T.H., Kronbichler, M., Friston, K.J.,
2019. Computational mechanisms of curiosity and goal-directed exploration. eLife 8, 45.
Schwartz, L., 1964. Sous-espaces hilbertiens d’espaces vectoriels topologiques et noyaux associes
(noyaux reproduisants). J. d’anal. Math. 13 (1), 115–256.
Schw€ obel, S., Kiebel, S., Markovic, D., 2018. Active inference, belief propagation, and the Bethe
approximation. Neural Comput. 30 (9), 2530–2567. https://doi.org/10.1162/neco_a_01108.
Sexton, J.C., Weingarten, D.H., 1992. Hamiltonian evolution for the hybrid Monte Carlo algo-
rithm. Nucl. Phys. B 380 (3), 665–677.
Shahbaba, B., Lan, S., Johnson, W.O., Neal, R.M., 2014. Split Hamiltonian Monte Carlo. Stat.
Comput. 24 (3), 339–349.
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J.,
Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J.,
Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T.,
Hassabis, D., 2016. Mastering the game of Go with deep neural networks and tree search.
Nature 529 (7587), 484–489. https://doi.org/10.1038/nature16961.
Simon-Gabriel, C.-J., Sch€olkopf, B., 2018. Kernel distribution embeddings: universal kernels,
characteristic kernels and kernel metrics on distributions. J. Mach. Learn. Res. 19 (1),
1708–1736.
Simon-Gabriel, C.J., Barp, A., Sch€olkopf, B., Mackey, L., 2020. Metrizing weak convergence
with maximum mean discrepancies. arXiv:2006.09268 (arXiv preprint).
Smith, R., Schwartenbeck, P., Parr, T., Friston, K.J., 2020. An active inference approach to mod-
eling structure learning: concept learning as an example case. Front. Comput. Neurosci. 14,
41. https://doi.org/10.3389/fncom.2020.00041.
Smith, R., Friston, K.J., Whyte, C.J., 2022. A step-by-step tutorial on active inference and its
application to empirical data. J. Math. Psychol. 107, 102632. https://doi.org/10.1016/j.
jmp.2021.102632.
Sohl-Dickstein, J., Mudigonda, M., DeWeese, M., 2014. Hamiltonian Monte Carlo without
detailed balance. In: International Conference on Machine Learning, pp. 719–726.
Sriperumbudur, B.K., Gretton, A., Fukumizu, K., Sch€olkopf, B., Lanckriet, G.R.G., 2010.
Hilbert space embeddings and metrics on probability measures. J. Mach. Learn. Res. 11,
1517–1561.
Sriperumbudur, B.K., Fukumizu, K., Lanckriet, G.R.G., 2011. Universality, characteristic Kernels
and RKHS embedding of measures. J. Mach. Learn. Res. 12 (7), 2389–2410.
Stein, C., 1972. A bound for the error in the normal approximation to the distribution of a sum of
dependent random variables. In: Proceedings of the Sixth Berkeley Symposium on Mathemat-
ical Statistics and Probability, volume 2: Probability Theory, vol. 6. University of California
Press, pp. 583–603.
Steinwart, I., Christmann, A., 2008. Support Vector Machines. Springer Science & Business
Media.
Strathmann, H., Sejdinovic, D., Livingstone, S., Szabo, Z., Gretton, A., 2015. Gradient-free
Hamiltonian Monte Carlo with efficient kernel exponential families. arXiv:1506.02564
(arXiv preprint).
Su, W., Boyd, S., Candès, E.J., 2016. A differential equation for modeling Nesterov’s accelerated
gradient method: theory and insights. J. Mach. Learn. Res. 17 (153), 1–43.
Sampling, optimization, inference, and adaptive agents Chapter 2 77

Sun, Y., Gomez, F., Schmidhuber, J., 2011. Planning to be surprised: optimal Bayesian explora-
tion in dynamic environments. arXiv:1103.5708 [cs, stat].
Sutherland, D.J., Tung, H.-Y., Strathmann, H., De, S., Ramdas, A., Smola, A., Gretton, A., 2016.
Generative models and model criticism via optimized maximum mean discrepancy.
arXiv:1611.04488 (arXiv preprint).
Suzuki, M., 1990. Fractal decomposition of exponential operators with applications to many-body
theories and Monte Carlo simulations. Phys. Lett. A 146, 319–323. https://doi.org/
10.1016/0375-9601(90)90962-N.
Takahashi, M., Imada, M., 1984. Monte Carlo calculation of quantum systems. II. Higher order
correction. J. Phys. Soc. Jpn. 53, 3765–3769.
Tao, M., 2016. Explicit symplectic approximation of nonseparable Hamiltonians: algorithm and
long time performance. Phys. Rev. E 94 (4), 043303.
Todorov, E., 2008. General duality between optimal control and estimation. In: 2008 47th IEEE
Conference on Decision and Control, December, pp. 4286–4292.
Toussaint, M., 2009. Robot trajectory optimization using approximate inference. In: ICML ’09.
Proceedings of the 26th Annual International Conference on Machine Learning, June, Asso-
ciation for Computing Machinery, Montreal, Quebec, Canada, pp. 1049–1056.
Tschantz, A., Millidge, B., Seth, A.K., Buckley, C.L., 2020a. Control as hybrid inference.
arXiv:2007.05838 [cs, stat].
Tschantz, A., Seth, A.K., Buckley, C.L., 2020b. Learning action-oriented models through active
inference. PLoS Comput. Biol. 16 (4), e1007805. https://doi.org/10.1371/journal.pcbi.
1007805.
Tuckerman, M.B.B.J.M., Berne, B.J., Martyna, G.J., 1992. Reversible multiple time scale molec-
ular dynamics. J. Chem. Phys. 97 (3), 1990–2001.
Vaillant, M., Glaunes, J., 2005. Surface matching via currents. In: Biennial Int. Conf. Information
Processing in Medical Imaging, Springer, pp. 381–392.
van de Laar, T.W., de Vries, B., 2019. Simulating active inference processes by message passing.
Front. Robot. AI 6.
van der Himst, O., Lanillos, P., 2020. Deep active inference for partially observable MDPs.
arXiv:2009.03622 [cs, stat].
Van der Vaart, A.W., 2000. Asymptotic Statistics. vol. 3. Cambridge University Press.
Vanetti, P., Bouchard-C^ote, A., Deligiannidis, G., Doucet, A., 2017. Piecewise-deterministic
Markov chain Monte Carlo. arXiv:1707.05296.
Vapnik, V., 1999. The Nature of Statistical Learning Theory. Springer Science & Business Media.
Villani, C., 2009a. Hypocoercivity. Memoirs of the American Mathematical Society, vol. 202.
American Mathematical Society, ISBN: 978-1-4704-0564-9 978-0-8218-6691-7 978-0-
8218-4498-4. https://doi.org/10.1090/S0065-9266-09-00567-5.
Villani, C., 2009b. Optimal Transport: Old and New. Springer.
Von Neumann, J., Morgenstern, O., 1944. Theory of Games and Economic Behavior. Princeton
University Press.
Wainwright, M.J., Jordan, M.I., 2007. Graphical models, exponential families, and variational
inference. Found. Trends Mach. Learn. 1 (1-2), 1–305. https://doi.org/10.1561/2200000001.
Wang, Z., Mohamed, S., Freitas, N., 2013. Adaptive Hamiltonian and Riemann manifold Monte
Carlo. In: International Conference on Machine Learning, PMLR, pp. 1462–1470.
Wauthier, S.T., Çatal, O., Verbelen, T., Dhoedt, B., 2020. Sleep: Model Reduction in Deep Active
Inference. p. 13.
Weinstein, A., 1997. The modular automorphism group of a Poisson manifold. J. Geom. Phys.
23 (3-4), 379–394.
78 SECTION I Foundations in classical geometry and analysis

Wibisono, A., Wilson, A.C., Jordan, M.I., 2016. A variational perspective on accelerated methods
in optimization. Proc. Natl. Acad. Sci. 113 (47), E7351–E7358. https://doi.org/10.1073/
pnas.1614734113.
Wilson, A., Recht, B., Jordan, M.I., 2021. A Lyapunov analysis of accelerated methods in optimi-
zation. J. Mach. Learn. Res. 22, 1–34.
Winn, J., Bishop, C.M., 2005. Variational message passing. J. Mach. Learn. Res. 34.
Wu, S.-J., Hwang, C.-R., Chu, M.T., 2014. Attaining the optimal Gaussian diffusion acceleration.
J. Stat. Phys. 155, 571–590. https://doi.org/10.1007/s10955-014-0963-5.
Yoshida, H., 1990. Construction of higher order symplectic integrators. Phys. Lett. A 150 (5),
262–268. https://doi.org/10.1016/0375-9601(90)90092-3.
Zhang, H., Sra, S., 2016. First-order methods for geodesically convex optimization. In: Conf.
Learning Theory, pp. 1617–1638.
Zhang, C., Butepage, J., Kjellstrom, H., Mandt, S., 2017a. Advances in variational inference.
arXiv:1711.05597 [cs, stat].
Zhang, C., Shahbaba, B., Zhao, H., 2017b. Hamiltonian Monte Carlo acceleration using surrogate
functions with random bases. Stat. Comput. 27 (6), 1473–1490.
Zhang, B.J., Marzouk, Y.M., Spiliopoulos, K., 2021. Geometry-informed irreversible perturba-
tions for accelerated convergence of Langevin dynamics. arXiv:2108.08247 [math, stat].
Ziebart, B., 2010. Modeling Purposeful Adaptive Behavior With the Principle of Maximum
Causal Entropy (Ph.D. thesis). Carnegie Mellon University, Pittsburgh.
Chapter 3

Equivalence relations
and inference for sparse
Markov models
Donald E.K. Martina,*, Iris Bennettb,d, Tuhin Majumdera,
and Soumendra Nath Lahiric
a
Department of Statistics, North Carolina State University, Raleigh, NC, United States
b
Department of Statistics, North Carolina State University, Raleigh, NC, United States
c
Department of Mathematics and Statistics, Washington University in St. Louis, St. Louis,
MO, United States
d
Corteva Agriscience, Raleigh, NC, United States
*
Corresponding author: e-mail: demarti4@ncsu.edu

Abstract
Equivalence relations can be useful for statistical inference. This is demonstrated for
modeling and statistical inference using sparse Markov models (SMMs). SMMs arise
as a mapping of higher-order Markov models, based on the equivalence relation of con-
ditioning histories having equal conditional probability distributions. After discussing
advantages of SMMs for statistical modeling, we give two algorithms for their fitting,
and highlight their use in modeling and classification applications. We also show
through an application to alignment of DNA sequences that equivalence relations can
greatly reduce the number of states of an auxiliary Markov chain used for computing
the distribution of a pattern statistic, an important inference task for categorical time
series.
Keywords: Auxiliary Markov chain, Collapsed Gibbs sampler, Dirichlet process, Quo-
tient mapping, Recursive computation, Regularization

1 Introduction
Let X ≡ X1 , X2 , …, Xn be a categorical time series taking values
x ≡ x1 , x2 , …, xn , with each xj lying in a finite set Σ. Markovian structure,
which assumes that conditioning on very recent observations is sufficient, is
frequently used in modeling categorical time series. In a Markov model with
mth-order dependence, the conditional probability of the current observation

Handbook of Statistics, Vol. 46. https://doi.org/10.1016/bs.host.2022.03.001


Copyright © 2022 Elsevier B.V. All rights reserved. 79
80 SECTION I Foundations in classical geometry and analysis

given the last m data points does not change by conditioning further into the
past, i.e., for t ¼ m + 1, m + 2, …, n,
Pr½Xt ¼ xt jXt1 ¼ xt1 , …, X1 ¼ x1  ¼ Pr½Xt ¼ xt jXt1 ¼ xt1 , …, Xtm ¼ xtm :
An mth-order Markovian sequence X may be associated with a first-order
Markov chain, with states represented by the m-tuples xet ≡ ðxtm+1 , …, xt Þ of
Σm (t  fm, m + 1, …, ng), an initial distribution τ over m-tuples at time m,
and a jΣjmjΣjm transition probability matrix T that has no more than jΣj non-
zero elements per row, where jj here denotes set cardinality.

1.1 Improved modeling capabilities of sparse Markov models


(SMMs)
Whereas higher-order Markov models provide a good approximation to prob-
abilities associated with many categorical time series, a major drawback asso-
ciated with them is that the number of transition probability parameters is
jΣjm(jΣj 1), exponential in the model order m. Therefore, low-order models
are typically used in applications, even when higher-order dependence is
called for. Another drawback, as pointed out in B€uhlmann and Wyner
(1999), is a lack of flexibility in that Markov models give relatively few
choices for the number of model parameters, increasing by a factor of jΣj as
the model order is increased.
Various efforts have been made to ameliorate these problems, with the
dimension reduction strategy being to exploit sparsity in the model. Rissanen
(1983) studied context tree models in the context of data compression.
A context tree represents histories of variable length needed for conditioning,
with the length of contexts varying. B€ uhlmann and Wyner (1999) studied the
model from a more statistical point of view under the name variable length
Markov chains (VLMCs). In the literature, VLMCs have also been called
Markov sources (Rissanen, 1986; Roos and Yu, 2009), finite-memory sources
(Weinberger et al., 1992, 1995), variable-order Markov (VOM) models
(Begleiter et al., 2004), probabilistic suffix trees (Gabadinho and Ritschard,
2016), and probabilistic suffix automata (Ron et al., 1996). The model has been
studied by statisticians, computer scientists, information theorists, and computa-
tional biologists, among others, and applied to statistical applications such as
exploratory data analysis for DNA nucleotides and other biological sequences
(Bercovici et al., 2012; Browning, 2006), binary time series from computer sci-
ence or information theory (Willems et al., 1995), text (Ron et al., 1996), clas-
sification (Shmilovici and Ben-gal, 2007), prediction (Begleiter et al., 2004),
statistical process control (Ben-gal et al., 2003), linguistics (Belloni and
Oliveira, 2017; Gallo and Leonardi, 2015; Galves et al., 2012), spam filtering
(Bratko et al., 2006), and web navigation (Borges and Levene, 2007).
In a similar effort to exploit sparsity, Kharin and Petlitskii (2007) intro-
duced the Markov chain with partial connections. A Markov chain of order
m with m0 partial connections is denoted MC(m, m0 ). In an MC(m, m0 ),
Equivalence relations and inference for SMM Chapter 3 81

the conditional distribution of Xt given all previous observations equals the


conditional distribution of Xt given some subset of size m0  m of the most
recent observations.
SMMs generalize the two models above. Whereas, for example, with
VLMC the most recent elements of the variable length histories must be the
same, in SMM that condition is relaxed and any m-tuples may be grouped.
Thus SMMs are both flexible and parsimonious in terms of the number of
parameters, allowing a better trade-off of the bias that can arise from using
a model of an order lower than the true value, and variance associated with
estimating many transition probabilities.
To form a SMM, one begins with a higher-order Markov model of order
m. Now cluster m-tuple histories xet into groups such that conditional probabil-
ity distributions given m-tuples in the same group are exactly the same. If no
such clustering is possible, then no sparseness is obtained and we still have
the original higher-order model. However, when there is such clustering, a
reduction in the number of model parameters is attained. In the latter case,
the clustered groups of m-tuples
S ¼ fγ 1 , γ 2 , …, γ η g
along with the conditional probability distributions associated with each group
give a SMM of order, or maximal context depth, m.
Geometrically, a higher-order Markov model may be considered as jΣjm
points on a probability simplex. Each point is associated with an m-tuple xet ,
and represents a conditional probability distribution, given its m-tuple, over
the jΣj categories. Thus each point is a jΣj1 dimensional object due to the
restriction that its components, which are all positive, sum to one. As each
probability needs to be estimated from data, there are then jΣj1 parameters
for each of the jΣjm points.
Now, define an equivalence relation p1  p2 if and only if the points
p1 and p2 of the probability simplex coincide, meaning that their associated
conditional probability distributions are exactly the same. Using the language
of topology, the quotient mapping of the original jΣjm points to η distinct ones
(the equivalence classes) gives the SMM. We typically represent the equiva-
lence classes in terms of the corresponding clustered m-tuples for conve-
nience. The SMM then has η(jΣj1) parameters to be estimated.
Example 1. Let X be a VLMC of order m ¼ 4, with Σ ¼ {0, 1}, and the
following partition S of the 24 ¼ 16 four-tuples into the equivalence classes
with equal (but not listed) conditional probability distributions:

S ¼ff0000,0010, 0100, 0110, 1000, 1010, 1100, 1110g,


f0001,0101, 1001, 1101g, f0011, 1011g, f0111g,f1111gg,

listing contexts in the order xt4, xt3, xt2, xt1. A full context tree represen-
tation of the 16 contexts is given in the left-hand plot of Fig. 1, where the con-
texts are the leaf nodes of the graph. The histories in equivalence classes of
FIG. 1 Context tree of the full fourth-order histories (left) and VLMC of Example 1 (right).
Equivalence relations and inference for SMM Chapter 3 83

the VLMC correspond to the number of consecutive 1’s at the end of


strings, and are represented by the leaf nodes 0, 01, 011, 0111, and 1111
in the context tree in the right-hand plot of Fig. 1. In more generality, for
an mth-order model of this form, the number of clusters is η ¼ m + 1 as
opposed to 2m, a significant reduction of the number of parameters and asso-
ciated contexts.
Example 2. Let X be a second-order Markovian sequence with Σ ¼ {A, C,
G, T}, and for which the conditional distribution is constant over equivalence
classes
S ¼ ffAA, AC, AG, ATg, fCA, CC, CG, CTg, fGA, GC, GG, GTg, fTA, TC, TG, TTgg:

The classes γ l, l ¼ 1, …, 4 are the partition of the set of 2-tuples into those
with a fixed value of Xt2 and an arbitrary value of Xt1. This is not a VLMC
as there is no common ending of the 2-tuples in each class, but is an SMM.
SMMs have been considered in very recent years under the names minimal
Markov models (Garcı́a and González-López, 2010), sparse Markov chains
(J€a€askinen et al., 2014), SMMs (Xiong et al., 2016), and partition Markov
models (Fernández et al., 2018; Garcı́a and González-López, 2017). SMMs
allow a wider variety of models than VLMC and Markov chains with partial
connections, as conditions on the grouped m-tuples are relaxed.

Given the usefulness of SMM for modeling categorical time series, the
nontrivial task of model fitting is important. The next section gives summaries
of two algorithms for fitting the model, along with three applications. In
Section 3, using the task of computing the distribution of a pattern statistic
through an auxiliary Markov chain (AMC), we show that equivalence rela-
tions can be very helpful for statistical inference related to data following
an SMM. The final section is a summary.

2 Fitting SMMs and example applications


The problem of fitting a SMM is then a clustering problem where the number
of groups η and the sets of S need to be determined. After obtaining empirical
conditional probability distributions given the various m-tuples, clustering is
carried out based on statistical criteria. The more m-tuples that are grouped
together, the smaller the number of parameters that are needed, and the
greater the benefit of the clustering exercise. Of course, one should only group
distributions that are “close.” The problem is made more difficult by the fact
that there are no constraints on which histories may share sets of transition
probabilities. Note that after the histories are partitioned, probabilities
p may be estimated via either the maximum likelihood estimator (MLE) using
ratios of the appropriate counts, or the posterior mode conditional on S.
84 SECTION I Foundations in classical geometry and analysis

There has been relatively little work in the literature on fitting SMMs.
Garcı́a and González-López (2010, 2017) used an agglomerative hierarchical
clustering approach in which two clusters of histories are combined when
their combination reduces the Bayesian information criterion (BIC) of the
fitted model. They showed that their approach renders a consistent estimator.
J€a€askinen et al. (2014) took a Bayesian approach to the problem of fitting
SMMs. Setting a uniform prior over the possible partitions of the histories,
their method used an agglomerative hierarchical clustering approach that
maximizes the posterior probability of the partition at each step. To reduce
the number of posterior probabilities to be calculated as part of the latter clus-
tering algorithm and thus cut down the computation time, Xiong et al. (2016)
applied a Delaunay triangulation to the sample transition probabilities of the
histories before performing agglomerative clustering along the triangulation.
Two very recent methods are outlined below.

2.1 Model fitting based on a collapsed Gibbs sampler


The first approach is a Bayesian one, as in J€a€askinen et al. (2014) and Xiong
et al. (2016). The method is laid out in Bennett et al. (2022), where it is shown
that the given estimation procedure is consistent for identifying the correct
clustering. Simulations also showed that the method outperforms existing
methods for fitting SMMs in terms of both accuracy and computation time
when Σm is large.
The method draws from the literature on topic models for short texts,
where discovering topics in a corpus of short texts also involves clustering
counts. Each text is assumed to be generated by one of finitely many topics
that correspond to a multinomial distribution over words. Yin and Wang
(2016) developed a Gibbs sampler, which they termed the GSDMM, to fit
the Dirichlet mixture model of Zhang et al. (2005), providing an efficient
way to fit that model.
A limitation of Dirichlet mixture models is the need to set a maximum
number of mixture components a priori, as setting a high maximum number
of components slows the model fitting. Yin and Wang (2016) extended the
GSDMM algorithm using the Dirichlet process that was introduced in
Ferguson (1973). The Dirichlet process allows one to avoid setting a maxi-
mum number of mixture components, and retains the conjugate relationship
between the Dirichlet and multinomial distributions. The resulting collapsed
Gibbs sampler is referred to as the GSDPMM in Yin and Wang (2016), and
is the method that was adapted in Bennett et al. (2022) to identify the partition
Equivalence relations and inference for SMM Chapter 3 85

of an SMM. Rather than grouping documents that share multinomial probabil-


ities over words, in the SMM application, one seeks to cluster m-tuple his-
tories with common multinomial transition probabilities.
As in J€a€askinen et al. (2014) and Xiong et al. (2016), the collapsed Gibbs
sampler of Bennett et al. (2022) sets a Dirichlet prior on the transition probabil-
ities within each cluster. Unlike these previous methods, which set a uniform
prior on the assignment of histories to clusters, a Dirichlet process prior is used.
The algorithm seeks c, a vector of length jΣjm that contains the cluster
assignments of each of the m-tuple histories. The elements ci of c are each
assumed to come from a multinomial distribution over the possible clusters,
and these multinomial probabilities are given by ζ. Because the number
of clusters is unknown, a stick-breaking prior is used for ζ. Identical Dirichlet
priors are used for p, the multinomial transition probabilities associated
with each cluster. This model may be written, with Xt:t+m1 ≡ Xt ,
Xt+1 , …, Xt+m1 , as:
Xt+m jXt:t+m1 , cXt:t+m1 , pcX  Multinomialð1, pcX Þ,
t:t+m1 t:t+m1

iid
ci jζ  Multinomialð1, ζÞ,
ζ  GEMð1, αÞ,

pj  DirichletðjΣj1 1jΣj Þ,
iid

with 0 < α < 1, t ¼ 1, 2, …, n  m; i ¼ 1, 2, …, jΣjm; and j ¼ 1, 2, …, η. Here,


1r ¼ ð1, …, 1Þ0 for r  1, and GEM(1, α) refers to the Griffiths–Engen–
McCloskey distribution with parameters 1 and α, which is defined as follows.
If U  GEM(1, α), then for l ¼ 1, 2, …,
iid
Vl  Betað1, αÞ
!
Y
l1
U l ¼ Vl ð 1  Vz Þ :
z¼1

As notation, let the number of occurrences of history h in realization x be


denoted nh. The number of occurrences of symbol i immediately following
history h is written nijh, and similarly, the number of occurrences of symbol
i immediately following cluster γ j is written as nijγj .
The algorithm uses a collapsed Gibbs sampler to find the posterior mode,
or maximum a posteriori (MAP), over c to cluster the histories. We simply
give a summary of the algorithm here, pointing the reader to Bennett et al.
(2022) for specific details.
86 SECTION I Foundations in classical geometry and analysis

ALGORITHM 1 GSDPMM for fitting SMMs.


Data: Sequence X
Result: Partition for SMM

1. Initialization
for h in Σm do
for i in Σ do
Compute ni|h
end
end
Select initial partition of histories

2. Gibbs Sampling;
for i in 1:I do
for h in Σm do
Remove h from its current cluster
for Cluster γj do
Calculate P (X|h ∈ Cluster γj )
end
Calculate P (X|h ∈ New Cluster)
Reassign h to new cluster
end
if i > I  then
Record partition
end
end

3. Select Partition;
Find mode over recorded partitions;
Return mode;

Bennett et al. (2022) showed that the computation using the GSDPMM
algorithm runs in OðηjΣjm Þ time (much less, for large m, than the most com-
petitive method in terms of accuracy, Xiong et al. 2016). Also shown was that
each of the iterations of the Gibbs sampler provides a consistent estimator of
the correct partition as n !∞.

2.1.1 Modeling wind speeds


Bennett et al. (2022) used the GSDPMM to model daily average wind speeds
at Malin Head in coastal Ireland from 1961–1978 (Haslett and Raftery, 1989),
using the discretization of Kharin (2020). Let the average wind speed of day
t (in knots) be denoted by wt, and define xt using
8
>
<0 if wt < 5
xt ¼ 1 if 5  wt  20
>
:
2 if 20 < wt
Equivalence relations and inference for SMM Chapter 3 87

Bennett et al. (2022) used a third-order Markov model, as did Kharin (2020),
to describe wind patterns over the n ¼ 6574 days. Here Σ ¼ {0, 1, 2}, and
there are thus 2η transition probability parameters to be estimated. For the
full third-order Markov model, η ¼ 33 ¼ 27 and thus there are 54 transition
probability parameters. Using the MC(3, 2) model yielded η ¼ 9 clusters of
histories, reducing the number of transition probability parameters to 18.
Fitting an SMM to the data further reduced the model, grouping the histories
into η ¼ 5 clusters for a total of 10 transition probability parameters.
The SMM representation had a BIC of 8043, while the MC(3, 2) model
had a BIC of 8125. The greater flexibility of the SMM when compared to
the MC model then allowed for a reduction in the number of parameters to
be estimated and a better model fit according to BIC, without sacrificing the
explanatory power of the model.

2.1.2 Modeling a DNA sequence


Bennett et al. (2022) used the GSDPMM to model the transitions of nucleo-
tide bases from two introns belonging to the mouse αA-crystalline gene.
The sequence is of length n ¼ 1307 bases belonging to alphabet Σ ¼ {A, C,
G, T} (Avery, 1987). This data was previously modeled using Markov chains
in Raftery and Tavare (1994). An interesting feature of this data is that while
the full second-order Markov chain model is outperformed by the first-order
Markov chain using a Bayes factor, the chain does exhibit some second-order
behavior (Raftery and Tavare, 1994). To account for this behavior while
maintaining a parsimonious model, the latter reference collapsed the A and G
bases into a single group. This modification shrunk the size of the alphabet
to 3, resulting in 18 transition probability parameters as compared to 48 in
the full second-order model, and a better model fit than either the full first-
or second-order Markov chain. A major disadvantage to this approach is that
the model is then not suitable for prediction of bases.
The solution of Bennett et al. (2022) to facilitate prediction was to fit a
second-order SMM. The fitted model partitioned the 16 length-two histories
into two sets. While there were 16 transition probabilities parameters with a
full first-order model and a BIC of 1785.7, and 48 transition probability para-
meters with BIC 1883.8 for the full second-order model, the SMM has only
six parameters and a BIC of 1764.7.

2.2 Fitting SMM through regularization


Majumder et al. (2022) fit SMMs using penalized functions. Let π j,
j ¼ 1, …, Σm be empirical conditional transition probability vectors over
Σ, and let bj, j ¼ 1, …, Σm be probability vectors taking values in Σ, and that
solve the penalized criterion function
88 SECTION I Foundations in classical geometry and analysis

X
Σm X  2 X
π^ j,a  bj, a +λ wj1 j2 kbj1  bj2k2 , (1)
j¼1 aΣ 1j1 <j2 Σm

where λ > 0 is a penalty parameter. Eq. (1) penalizes the distance between
distinct pairs of probability vectors in order to identify identical ones. Once
the probability vectors bj have been identified, estimates of the numbers of
groups η in S and the groups γ j, j ¼ 1, …, η are evident.
The advantage of clustering by solving Eq. (1) for a range of λ is that we
get a solution path corresponding to the range from a single cluster consisting
of all the elements to jΣjm singleton clusters. Hence, the algorithm does not
need to fix the number of clusters a priori. It also avoids getting stuck in local
minima.
The tuning parameter λ, and thus the optimum cluster assignment, is cho-
sen to minimize BIC over a grid of λ values. BIC is defined as
BICðλÞ ¼ 2‘ðλÞ + ηλ ðjΣj  1Þ log n,
where ηλ is the number of groups and ‘(λ) is the likelihood associated with a
specific value of λ. The solution of Eq. (1) corresponding to the optimal BIC λ
value is considered as the estimated cluster assignment. Majumder et al.
(2022) gave theoretical results that indicate that for a range of λ values, one
can perfectly recover the true clusters for large n.
Several efficient algorithms have been developed in recent years to solve
Eq. (1) when the penalty function is convex. Chi and Lange (2015) used aug-
mented Lagrangian methods. First view solving Eq. (1) as the constrained
optimization problem

1X Σm X
minimize ^ j  bjk22 + λ
kπ wl kvl k2
2 j¼1 lE
subject to bl1  bl2  vl ¼ 0,
where E is the set of all distinct edges {l : l ¼ (l1, l2), l1 < l2, wl > 0}. Here, a
new splitting variable vl has been introduced to capture the difference
between the group centroids. Chi and Lange (2015) used two algorithms for
solving this constrained optimization problem, namely ADMM and AMA.
For both algorithms, one first incorporates an augmented Lagrangian as
follows:
1X X
p
Lν ðB, V, ΨÞ ¼ ^ j  bj k22 + λ
kπ wl kvl k2
2 j¼1 lE
X νX
+ hψ l , vl  bl1 + bl2 i + kvl  bl1 + bl2 k22 ,
lE
2 lE

where B, V, and Ψ are the matrices with bj, vl, and ψ l for j ¼ 1, …, Σm and
l  E as their columns, respectively. Splitting the variables in this manner
Equivalence relations and inference for SMM Chapter 3 89

allows one to update B, V, and Ψ sequentially, given the other variables. The
convergence of ADMM does not depend on the choice of ν, as it will converge
for any ν > 0. On the other hand, AMA converges for any 0 < ν < 2/Σm.
Chi and Lange (2015) showed that AMA is much faster than ADMM,
especially when the weights are sparse. They also proved that only B and Ψ
need to be updated at each step, but not V.
AMA was used in the model fitting of Majumder et al. (2022). Let B(t) and
Ψ be the parameter values at the tth step. The updates to the next step are
(t)

computed using the following relations:


ðt + 1Þ
X ðtÞ X ðtÞ
bj ^j +
¼π ψl  ψl
l1 ¼j l2 ¼j
ðt + 1Þ ðtÞ ðt + 1Þ
ψl ¼ P Cl ðψ l  νgl Þ,
ðt+1Þ ðt+1Þ ðt+1Þ
where gl ¼ bl1  bl2 , Cl ¼ fψ l :k ψ l k2  λwl g, and P A ð  Þ is the pro-
jection of  onto the set A. Updates are repeated until convergence. The con-
vergence criterion uses the dual problem and duality gap, and is discussed in
Chi and Lange (2015) in detail. A summary of the algorithm used in
Majumder et al. (2022) is as follows:

ALGORITHM 2 AMA for fitting SMMs.

Data: Sequence X
Result: Partition for SMM

Initialize Ψ(0)
for t = 1, 2, 3, ... do
for j = 1, 2, 3, ..., Σm do
(t)  (t−1)  (t−1)
Δi = l1 =j ψl − l2 =j ψl
end
end
forall l do
(t) (t) (t)
gl = π̂l1 − π̂l2 + Δl1 − Δl2
(t) (t−1) (t)
ψl = PCl (ψl − νgl )
end

2.2.1 Application to classifying viruses


Majumder et al. (2022) applied SMM fitting to classifying virus samples col-
lected from humans. Data from individuals with four different viruses were
obtained: SARS-CoV-2 (COVID-19), MERS (Middle East Respiratory Syn-
drome), Dengue, and Hepatitis B. The latter paper collected samples for 500
individuals from the NCBI database, where 200 were individuals infected by
SARS-CoV-2, 50 with MERS, 100 from Dengue, and 150 from Hepatitis B
over different time periods and locations. The authors were particularly careful
90 SECTION I Foundations in classical geometry and analysis

in collecting the samples from COVID-19 data to incorporate different strains


of the disease, taking 50 samples each from four different time frames: April
2020, September 2020, January 2021, and April 2021. The time frames were
selected on the basis of the spread of certain strains or looking at the peak of
covid cases.
The NCBI database contains reference genome sequences for the viruses.
Those reference sequences represent the ideal genome structure of any partic-
ular virus species. Note that very minimal changes in the nucleotide sequence
can lead to very different strains of the same disease.
When full sequences were available, little classification was needed as the
sequences were nearly equal to the reference sequence. However, in practice,
there are many occasions where the full sequence is not available. The chal-
lenge lies in that scenario.
First, a reference SMM was built from the reference sequence for each
virus. Next, a randomly selected contiguous segment of the genome sequence
was selected from each sample, and used to compute the likelihood of that
segment under each of the four reference models. The model with the highest
likelihood was the chosen model.
The lengths of the reference genome sequences for SARS, MERS,
Dengue, and Hepatitis B were 29,903, 30,119, 10,735, and 3542, respectively.
For SARS and MERS, an SMM of order m ¼ 4 was fit, while for the other
two viruses, a model was fit with m ¼ 3. Longer reference sequences were
associated with higher-order SMM.
There is a biological significance in using order m  3. Three consecutive
DNA bases form a codon, which translates genetic code into a sequence of
amino acids. So, it is reasonable to assume that SMMs of orders 3 or more
will be able to explain the structure of a virus.
From the samples, Majumder et al. (2022) randomly chose segments of
length 100E%, and computed likelihoods under the four models to classify
them to the most likely virus class. Three different values of E were used:
0.05, 0.1, and 0.25. The 4  4 confusion matrices for the three scenarios
are presented in the latter paper. For the first two values of E, it is fair to
say that accurate classification was not obtained due to the very low percent-
age of the data sequences that were used. The miss-classification rate when
only 25% of the original samples were used was only 0.032, a very encourag-
ing result. Thus it appears that the procedure has merit, and could be very
useful for biological analysis.

3 Equivalence relations and the computation of distributions


of pattern statistics for SMMs
We now show how equivalence relations can improve the efficiency of an
inference task for data following a SMM. The task is obtaining the distribu-
tion of a pattern statistic using an auxiliary Markoc chain (AMC).
Equivalence relations and inference for SMM Chapter 3 91

Pattern counts can point to important discoveries, such as sites on DNA


sequences that are recognized by various agents, similar segments in DNA
sequences, and changes in a production process. However, one must differenti-
ate between patterns that occur by chance, and ones that are statistically
significant. To determine statistical significance requires computing the p-value
associated with the statistic, under a null model that describes what is expected.
Monte Carlo simulation can provide approximations to needed probabil-
ities; however, large numbers of replications are typically required for accu-
rate results, especially so when the pattern of interest is rare. Methods to
obtain exact results in an efficient manner are then important.
A simple approach that is useful in this regard is to set up a correspon-
dence between relevant events in X and a related AMC Y ≡ Y 1 , …, Y n . This
allows using well-known properties of Markov chains to compute the desired
probabilities (see Aston and Martin, 2007; Brookner, 1966; Fu and Koutras,
1994; Koutras and Alexandrou, 1995).
The number of states of Y greatly affects computational efficiency, which
is a function of the size of the transition probability matrix for Y. Equivalence
relations can be very helpful by reducing the number of states.
After giving notation, we give a brief review of the problem of computing
distributions of pattern statistics in higher-order Markovian sequences using
an AMC, and then consider the extension to the SMM case. We then use an
application to space seeds to highlight the usefulness of equivalence relations.

3.1 Notation
A pattern is a finite string of symbols from Σ. For pattern u ¼ ðu1 …, ujuj Þ of
length juj, ðu1 , …, uh Þ 1  h juj is a prefix of u (a proper prefix if h < juj),
and ðuj , …, ujuj Þ, 1  j juj is a suffix of u (a proper suffix if j > 1). Let W be
a collection of patterns that occur in X and Z a pattern statistic related to W,
taking values z in a finite set Φ. For convenience it will be assumed that all
patterns of W have length greater than m. Also, let a  b denote the concatena-
tion of strings a and b. For example, if a ¼ 1101 and b ¼ 101, then a  b ¼
1101101.

3.2 Computing distributions in higher-order Markovian sequences


We have, for z  Φ,
X
Pr½Z ¼ z ¼ Pr½X5x,
x:ZðxÞ¼z

i.e., Pr½Z ¼ z is the sum of probabilities of realizations x that are mapped by


statistic Z into z. However, examining each of the jΣjn possible sequences x to
carry out the computation is typically not feasible unless it is carried out in an
efficient manner. This is where an AMC comes in.
92 SECTION I Foundations in classical geometry and analysis

If Pr½Z ¼ z ¼ Pr½Yn  Υz , where {Υz, z  Φ} is a partition of the state


space Υ of Y, then one can compute the desired distribution through multiply-
ing initial probability vector αm for probabilities of Y at time m sequentially
by the transition probability matrix Ω of Y.
The states Υ must include sufficient information so that Y is a Markov
chain, and so that the transition probabilities of Ω may be computed. All states
must then include the last m observations of X, and information on progress
toward observing the patterns of interest. The current value of the statistic
Z must be kept in some form, though not necessarily in states of Υ. These
requirements then determine the relationship between X and Y.
For efficient computation, Aston and Martin (2007) forwarded a scheme
where values of statistic Z are not part of the states of AMC Y. In that manner,
Υ contains only pattern progress and m-tuple strings, since those strings transition
in the same manner regardless of the value of Z. They then used probability matri-
ces Ψt, t ¼ m, …, n instead of probability vectors αt, with the rows of the Ψt
corresponding to values of Z, and the columns to pattern progress/m-tuple strings.
Still, for complex patterns, the size of the state space of Y can be large. Sev-
eral researchers (see, e.g., Lladser, 2007; Lladser et al., 2008; Marshall and
Rahmann, 2008; Ribeca and Raineri, 2008) applied computer science theory
related to minimizing the states of a deterministic finite automaton (DFA) to
reduce the number of states needed for AMC Y. The states needed in Υ are then
the equivalence classes determined in the DFA minimization process. In the
context of computing the sampling distribution of a pattern statistic, equivalent
states q and q0 are such that concatenating an arbitrary string r to form q  r and
q0  r gives the same update both in terms of the probability and the value of the
pattern statistic. The equivalence of states q and q0 will be denoted by q  q0 . As
the order of the square transition probability matrix Ω of Υ is the number of
states, jΥj, using the equivalence classes for the computation instead of the orig-
inal state space can represent a huge savings in terms of computation time.
A drawback is that for the very large Υ that can arise with complex pat-
terns, for example, structured motifs in DNA (see Martin, 2019) or the spaced
seed problem that will be considered below, setting up an initial state space
before applying a minimization algorithm may be infeasible. Nuel (2008)
got around this problem by setting up a nondeterministic finite automaton that
greatly reduces the size of the initial state space before setting up a minimal
DFA. Martin (2019) developed a characterization of equivalent states so that
extraneous ones may be determined and deleted during the process of forming
the AMC state space. In this manner, no extraneous states enter the state space
at any stage. The methodology of that paper for developing minimal state
spaces is outlined next.
Let Q ¼ Q>m [ Σm , where Q>m consists of proper prefixes of patterns of
W that are of length at least m and Σm is the set of all m-tuples. The longest
Equivalence relations and inference for SMM Chapter 3 93

proper suffix of a string q  Q that is itself in Q is called the failure state of


q (Aho and Corasick, 1975) and is denoted by fl(q). By definition, strings
q  Σm do not have failure states. As in Martin (2019), define the failure
sequence associated with q to consist of q and its sequence of failure states
of decreasing lengths:
f sq ¼ ðf sq,1 , …, f sq,jf sq j Þ ≡ ðq, f lðqÞ, f lðf lðqÞÞ, f lðf lðf lðqÞÞÞ, …, ðqÞm Þ:
Here (q)m denotes the m-tuple at the end of q. Note that (q)m may or may not
be a pattern prefix. Also define
eq ¼ fv : 9u  f sq such that u  v ¼ w  W; jvj < jwjg
C
[fv : 9l such that 1  l < m and ðqÞl  v ¼ w  Wg:

Ceq contains completion strings for prefix strings of fsq and suffixes of q of
length less than m that are pattern prefixes. The occurrence of strings of C eq
may or may not require an update to the pattern statistic Z. For example, let
Z be the indicator of whether or not W occurs in X. If W has already occurred,
Z remains 1 with any additional occurrences, and thus the statistic would not
be updated. The set of completion strings whose occurrence does require an
update to Z is denoted by Cq. Also, define the direct occurrence of a
q  Q>m to be the occurrence of q  v, where v is the completion string of q.
The transition function for string q  Q on symbol x  Σ is defined by
δ(q, x) ≡ the longest suffix of q  x that is in Q. Thus either δ(q, x) ¼ q  x,
or δ(q, x) ¼ fl(q  x).
Martin (2019) proved two main theorems and a corollary to the second
one. In Theorem 1, the author showed that the symbiosis of the failure state
of q  x and failure transitions (when δ(q, x) 6¼ q  x) makes them both easy
to compute, as both are obtained by first going to fl(q), and then computing
the transition on symbol x from there. Thus failure states may be obtained
sequentially (over string lengths) using the transition function δ.
Theorem 2, along with its corollary, gives necessary and sufficient condi-
tions for strings to be equivalent so that equivalent strings may be determined
and combined during the setup of the state space Υ. In the proof of the theo-
rem, it was shown that updates to the statistic’s value are the same on conca-
tenating an arbitrary string to q and q0 if and only if the updates to the
statistic’s value are the same when any string of Cq [ Cq0 occurs. (The
required condition (q)m ¼ (q0 )m is clear so that probabilities starting in both
strings are the same.) The proof of the corollary simply steps through showing
necessity and sufficiency. The reader is pointed to the original paper for the
proofs.
Theorem 1. If q  Σ m, fl(q  x) ¼ (q  x)m. For q  QnΣm, fl(q  x) ¼ δ(fl(q), x).
94 SECTION I Foundations in classical geometry and analysis

Theorem 2. For mth-order Markovian sequence X, q  q0 if and only if


(q)m ¼ (q0 )m and the updates to the statistic’s value are exactly the same when
any string of Cq [ Cq0 occurs.

Corollary 1. For statistics and pattern counting techniques such that the
direct occurrence of q does not preclude updating the statistic on the occur-
rence of the strings of Cf lq , if jfsqj > 1 and jf sq0 j > 1, q  q0 if and only if
fl(q)  fl(q0 ) and the updates to the statistic are exactly the same on the direct
occurrences of q and q0 .

3.3 Specializing the computation to SMM


In Martin (2020) the setting is specialized to the case of SMM, allowing one
to use not only the advantages of SMM but also the minimization techniques
of Martin (2019) to keep the AMC state space as small as possible.
While one could define Υ to be the same AMC state space as for an
mth-order Markovian sequence, that approach would ignore the possibility
that equivalent m-tuples could be combined, resulting in a smaller state space.
Thus as an initial step to set up Υ, conditions for equivalent m-tuples are
determined, followed by conditions for longer strings. Those conditions are
as follows:
Theorem 3 (Martin, 2020). Strings q and q0 are equivalent if and only if they
have the same updates to Z on the occurrence of all strings of Cq [ Cq0 , and
in addition
(i) If jqj ¼ jq0 j ¼ m, they lie in the same class γ j of S and when concatenat-
ing an arbitrary string r satisfying jrj  f1, …, m  1  lγj g, (qr)m and
(q0 r)m lie in the same probability equivalence class γ l for some l.
(ii) For jqj and jq0 j > m, (q)m  (q0 )m.
Here, lγj  0 is the length of the longest common suffix of m-tuples in proba-
bility equivalence class γ j.

Corollary 2. For statistics and pattern counting techniques such that the
direct occurrence of q (q0 ) does not preclude updating the statistic on the
occurrence of the strings of Cfl(q) (Cf lðq0 Þ ), if jqj > m and jq0 j > m, q  q0 if
and only if fl(q)  fl(q0 ) and the updates to the statistic are exactly the same
on the direct occurrences of q and q0 .

Theorem 3 and its corollary lead to the following algorithm for setting up
Υ and computing the distribution of pattern statistic Z in the case of an SMM.
Algorithm 3 (Computing the sampling distribution of statistic Z). Given
SMM fS, pg with maximal context depth m, the distribution of Z may be
obtained as follows:
Equivalence relations and inference for SMM Chapter 3 95

(i) Refine probability equivalence classes γ j, j ¼ 1, …, η based on updates


to Z on the occurrence of completion strings. To do this, determine com-
pletion strings of suffixes of m-tuples that are pattern prefixes. Check to
see that m-tuples are completed on the same strings with the same update
to Z, and separate ones that are not. A separated m-tuple will either be
placed into a new class that was previously formed from γ j (if it has
the same completion strings and update to Z as strings in that class),
or into an additional new class for which it is the only member.
(ii) Further refine probability equivalence classes by concatenating arbi-
trary symbols of Σ to determine if the class γ l of the resulting m-tuple
suffix is the same. To do this, for each class γ j, determine the destina-
tion of its elements on the concatenation of a single symbol from Σ.
Separate elements where the destination classes are different, and
repeat until no new classes are obtained.
(iii) Revise matrix T for X, or equivalently for m-tuples, based on the final
grouping of m-tuples (call the resulting matrix Ť). Also, combine initial
probabilities for grouped m-tuples into a single element of new initial
vector τ̌ In the stationary case, solve for τ̌ using τ̌ Ť¼τ̌ , with entries
that sum to one.
(iv) Set up the jΦjjΥj initial probability matrix Ψm for time m. Its nonzero
initial probabilities are given by Ψm,ðij ,jÞ ¼ τ̌ ðjÞ, where ij is the row
corresponding to the value of z for state j of Υ. Due to the assumption
here that patterns of W have lengths greater than m, ij ¼ 1
(corresponding to Z ¼ 0) for all j, but in general, that does not have
to be the case.
(v) Determine the destination state for strings of Υ on the concatenation of
symbols x  Σ, beginning with the m-tuples that are pattern prefixes. If
q  x  Q>m, determine if there is an equivalent state already in the state
space. If there is, map the transition to that state, but if not, form a new
state. If q  x 62 Q>m, the transition is obtained using δ(q, x) ¼ fl(q  x).
After the pattern prefixes are formed, states are added (as needed) using
transition function δ and other transition-related rules for the given pat-
tern reckoning problem. Repeat the process of determining new states
and their transitions for all new states formed at the last stage, terminat-
ing if there are no new states.
(vi) At each stage, obtain failure states using fl(q  x) ¼ δ(fl(q), x).
0
(vii) At each stage, the transition probabilities are given by Pr½q ! q  ¼
0
Pr½ðqÞm ! ðq Þm , where transition probabilities for m-tuples are taken
from matrix T.  These probabilities are recorded in Ω.
(viii) After obtaining the states of Υ and their transition probabilities, for
t ¼ m, …, n  1, right multiply Ψt by Ω and move transition probabil-
ities for entering a “counting state” (a state for which the value of Z
should be updated) to the appropriate row. This gives Ψt+1.
(ix) Sum probabilities from Ψn corresponding to Υz to obtain Pr½Z ¼ z:
96 SECTION I Foundations in classical geometry and analysis

We now give an example of setting up for an SMM that will be consid-


ered later. The reader is pointed to Martin (2020) for more details and
examples.
Example 3. Let S ¼ fγ 1 , …, γ 4 g ¼ ff000, 100, 010, 110g, f011, 111g, f001g,
f101gg for binary X, with Pr[Xt ¼ 1jγ j], j ¼ 1, 2, 3, 4, respectively, equal to
(0.6,0.8,0.65,0.7), m ¼ 3, and with Z being the number of overlapping
occurrences of W ¼ {11111}. The class {011, 111} would be split since
string 111 is completed by 11, but string 11 is not. This gives new class
γ 5 ¼ {111}.
With this split, only probability equivalence class γ 1 can possibly be split
further, and any such splitting would not be due to completion strings, since
all of the members of γ 1 have no progress to a pattern. To determine the
result after concatenating a symbol to the m-tuples of γ 1, note that for m-tuples
in this class, two of the suffixes of length two are 00, and the other two are
10. Thus two transition to 000 and 001 (which are in classes 1 and 3),
while the other two transition to 100 and 101 (which are in classes
1 and 4). Thus we split these strings, leaving γ 1 ¼ {000, 100} and forming
new class γ 6 ¼ {010, 110}. The final classes are fγ 1 , …, γ 6 g ¼
ff000, 100g, f011g, f001g, f101g, f111g, f010, 110gg, and the transition
probability matrix over γ 1 , …, γ 6 , using this order of classes for the rows
and columns, is then

0 1
0:4 0 0:6 0 0 0
B 0:2 C
B 0 0 0 0 0:8 C
B C
B 0 0:65 0 0 0 0:35 C
T¼B

B
C:
B 0 0:7 0 0 0 0:3 CC
B C
@ 0 0 0 0 0:8 0:2 A
0:4 0 0 0:6 0 0
In the case of stationary trials, solving τ̌ Ť ¼ τ̌ for probability vector τ̌ gives
initial probability vector
1
τ ¼ ½50, 51, 30, 45, 204, 75:
455

3.4 Application to spaced seed coverage


Spaced seeds (Ma et al., 2002) are used as an initial filtering step to help with
the trade-off between attempting too many costly DNA sequence alignments
and missing similar DNA segments by not aligning them. A spaced seed is
a pattern S ¼ s1 , …, sk from alphabet {1, *} with s1 ¼ sk ¼ 1. A “1” indicates
a match position and “*” a wildcard position that does not have to match. Let
r denote the number of wildcard positions in the seed.
Equivalence relations and inference for SMM Chapter 3 97

Let X be the binary sequence formed by aligning two DNA segments of


length n and assigning a value Xj ¼ 1 if the jth position of the segments
match, and 0 otherwise. Spaced seed S hits or occurs in X at position ν if
for j ¼ 1, …, k, Xνk+j ¼ 1 whenever sj ¼ 1. In an occurrence, a 1 at position
Xνk+j corresponding to sj ¼ 1 is said to be covered.
If seed coverage is used as the test statistic to trigger a full alignment, its
distribution is needed to differentiate between random and meaningful similar
segments. Overlapping and nonoverlapping seed hits must be distinguished in
this case when setting up Υ because the update to coverage on a seed hit
depends on which positions were previously covered (a covered position is
only counted once). The many combinations of prefixes of patterns represent-
ing overlapping spaced seed hits and positions that may or may not be
covered can render the number of states of an AMC to be extremely large
and the computation of the distribution of coverage difficult. Thus, there
have been relatively few studies where seed coverage is used as a criterion
(see, however, Benson and Mak, 2008; Martin, 2019; Martin and Noe,
2017; Noe, 2017; Noe and Martin, 2014).
Consider now a spaced seed S ¼ s1 , …, sk with k > m, and let W be the set
of 2r patterns obtained by replacing the r stars of the seed by either 0 or 1. The
binary sequence X that gives indicators of matching positions in the two DNA
segments is assumed to be stationary (as sequence segments for an alignment
are relatively short) and to follow a given SMM of maximal depth m.
One option for forming an AMC state space Υ is to use prefix strings
of the set Wext that contains the extension of patterns of W to all possi-
ble overlapping pattern occurrences (Martin and Coleman, 2011). Wext is
defined by
W ext ¼ fu : 9w1 , w2  W such that u ¼ νw1  αw1 w2  βw2 , where νw1  αw1 w2
¼ w1 , αw1 w2  βw2 ¼ w2 , and νw1 and βw2 are nonemptyg:
Here αw1 w2 is a suffix of w1 and prefix of w2, the overlapping portion of the
two patterns. As all spaced seeds begin and end with 1, αw1 w2 is guaranteed
to be nonempty. While a reasonable option, a representation based on the
extended strings of Wext is less than optimal, as it uses excessive storage
locations, and increases the difficulty in locating equivalent strings.
A more storage friendly representation is established in the following man-
ner. Let strings Q represent progress toward patterns, while also having marks
for all covered positions. An example of the representation based on Wext and
based on strings Q with marked coverage is given in the left- and right-hand
plots of Fig. 2, respectively, for the small spaced seed 11*1, assuming that
m ¼ 1 for simplicity. In those plots, the seven states in red are not needed
in the computation. The representation based on Q makes it immediately clear
that six of these seven states are not needed in Υ, as they have exactly the
same Q string and covered positions as strings that would already have been
FIG. 2 AMC state space for spaced seed 11*1 using extended strings (left) and with covered positions marked with a 1 (right). States in white have no coverage,
states with coverage are in green, while those with coverage that are not needed for the computation are in red. Updates to coverage are marked.
Equivalence relations and inference for SMM Chapter 3 99

placed in the state space. Strings 1 10 and 11 0 are equivalent as they have
 
exactly the same transitions and updates on completion string 1.
To illustrate setting up a state space in the SMM case, we consider the rel-
atively short spaced seed 1*11*1. Then, feasibility of the computation for a
seed of a length used in practice is shown for the “Patternhunter seed”
111*1**1*1**11*111 (Ma et al., 2002). The model is fixed in both cases to
have probability equivalence classes S ¼ fγ 1 , …, γ 4 g ¼ ff000, 100, 010,
110g, f011, 111g, f001g, f101gg (m ¼ 3), with Pr[Xt ¼ 1jγ j], j ¼ 1, 2, 3, 4,
respectively, equal to (0.6,0.8,0.65,0.7), as in Example 3. This VLMC model
is input to the algorithm. The length of X for possible alignment is fixed at
n ¼ 64, the length considered in Ma et al. (2002).
Example 4. (Seed 1*11*1). Spaced seed S ¼ 1 * 11 * 1 has W ¼ {101101,
101111, 111101, 111111}. For this collection of patterns, m-tuples 011 and
111 have different completion strings, and thus γ 2 ¼ {011, 111} needs to be
split. Also, whereas the elements of {000, 100} have no pattern progress,
those of {010, 110} have pattern prefix 10 as their suffix, and thus γ 1 needs
to be split into these two classes. Thus Ť and τ are as in Example 3.

The state space Υ is depicted in Fig. 3. That representation has 27 states,


as opposed to the 84 states (not shown) that are obtained using prefixes of
Wext. In Fig. 3, only the 11 states in white are needed to compute the

FIG. 3 States of the AMC for seed 1*11*1. The counting states whose entrance signals that the
statistic is updated are colored in green, and other states with nonzero coverage in blue. Coverage
updates on transitions into counting states are indicated and covered positions are marked with a
1. For clarity, not all transitions are shown. States in white have no coverage, and are needed to
compute sensitivity of the seed (along with an absorbing state). White nodes with a blue “glow”
are prefix strings, while the others are not.
100 SECTION I Foundations in classical geometry and analysis

sensitivity of the seed, the probability of at least one seed occurrence in X (an
absorbing state to indicate the seed’s occurrence would also be needed if seed
sensitivity was used to trigger an alignment instead of seed coverage). The
strings 10110 and 11110 are equivalent (from Corollary 2) for computing sen-
sitivity or coverage, as they both have the same failure state, and the update to
coverage is + 4 when both strings are completed. Thus, on symbol 0, 1111
transitions to 10110, and 11110 is deleted. State 1 1 11 transitions to 1 011 0
    
on symbol 0 (the second position of the string is no longer marked as being
covered since the position cannot possibly be involved in a seed hit that
occurs after the direct one). Similarly, state 1 1 11 1 transitions to 1 011 0 on
    
symbol 0 (instead of 1 01 10 since 1 011 0 and 1 01 10 are equivalent. The equiv-
     
alence of these states may not be obvious, since for pattern prefix 10, the 1 is
covered in one case and not in the other. However, the completion string of
10 is of the form 11*1. The occurrence of the first 1 of this string implies
the direct hit of the longer string 10110 with a coverage update of +2 for both
states, rendering all 1’s as being covered so that the subsequent updates to the
statistic must be the same.)
A MATLAB program was written to implement the algorithm to compute
coverage of spaced seeds and run on a Dell PC with an Intel Core i7 CPU 873
with 2.93 GHz and 8 GB RAM. The total computation time for spaced seed
S ¼ 1 * 11 * 1 was about 0.5 s, 0.1 s to set up the state space and 0.4 s to carry
out the computation.
Example 5. (Seed 111*1**1*1**11*111). For the spaced seed 111*1**1*1**
11*111 used in the original version of the Patternhunter software (Ma et al.,
2002), jΥj ¼ 3815, compared with the more than 320,000 strings that are
obtained using prefixes of Wext. For the coverage distribution, the total com-
putation time was about 40 s, 33.8 s to set up the state space and 6.5 s to carry
out the computation. It took more than a day to set the state space using pre-
fixes of Wext. This reveals the advantages of the representation based on prefix
strings with marked locations, while especially pointing to the great reduction
afforded by determining equivalence relations of states of the AMC, and
sequential combination of those that are equivalent.

4 Summary
In this chapter, equivalence relations of conditional transition probability
distributions are shown to lead to improved modeling capabilities for a
higher-order Markov model. The associated mapping of probability distribu-
tions and their corresponding m-tuple histories based on the equivalence rela-
tions of equal conditional probability distributions leads to the formation of a
SMM that has a reduced number of parameters and more flexibility in the
model fitting process. Admittedly, there is a price to pay, as allowing the
Equivalence relations and inference for SMM Chapter 3 101

flexibility of SMMs with no restrictions on the m-tuples that may be grouped


renders the model fitting exercise to be nontrivial.
Two methods to fit the model are given, and the usefulness of the model is
illustrated through three modeling applications.
We then show how equivalence relations can aid in computing distribu-
tions of pattern statistics in SMM using an auxiliary Markov chain (AMC).
Methods for determining equivalent states allow one to find much smaller
AMC state spaces in some cases, leading to much more efficient computation
procedures. This is made clear in an application to computing the distribution
of spaced seed coverage for the Patternhunter seed, a seed that has been used
in sequence alignment software.

Acknowledgments
This material is based upon work supported by the National Science Foundation under
Grant No. 1811933.

References
Aho, A.V., Corasick, M.J., 1975. Efficient string matching: an aid to bibliographic search.
Commun. ACM 18, 333–340.
Aston, J.A.D., Martin, D.E.K., 2007. Waiting time distributions of general runs and patterns in
hidden Markov models. Ann. Appl. Stat. 1 (2), 585–611.
Avery, P., 1987. The analysis of intron data and their use in the detection of short signals. J. Mol.
Evol. 26. 335–334.
Begleiter, R., El-Yaniv, R., Yona, G., 2004. On prediction using variable length Markov models.
J. Artif. Intell. 22, 385–421.
Belloni, A., Oliveira, R., 2017. Approximate group context tree. Ann. Stat. 45 (1), 355–385.
Ben-gal, I., Morag, G., Shmilovici, A., 2003. Context-based statistical process control. Techno-
metrics 45 (4), 293–311.
Bennett, I., Martin, D.E.K., Lahiri, S.N., 2022. Fitting sparse Markov models through a collapsed
Gibbs sampler. (Submitted for publication).
Benson, G., Mak, D.Y.F., 2008. Exact distribution of a spaced seed statistic for DNA homology
detection. In: International Symposium on String Processing and Information Retrieval,
Springer, Berlin, Heidelberg.
Bercovici, S., Rodriguez, J.M., Elmore, M., Batzoglou, S., 2012. Ancestry inference in complex
admixtures via variable-length Markov chain linkage models. In: Lecture Notes in Computer
Science. Research in Computational Molecular Biology. RECOMB, vol. 7262. Springer, Ber-
lin, Heidelberg, pp. 12–28.
Borges, J., Levene, M., 2007. Evaluating variable length Markov chain models for analysis of user
web navigation. IEEE Trans. Knowl. 19 (4), 441–452.
Bratko, A., Cormack, G., Filipic˘, B., Lynam, T., Zupan, B., 2006. Spam filtering using statistical
data compression models. J. Mach. Learn. Res. 7, 2673–2698.
Brookner, E., 1966. Recurrent events in a Markov chain. Inf. Control. 9, 215–229.
Browning, S.R., 2006. Multilocus association mapping using variable-length Markov chains. Am.
J. Hum. Genet. 78, 903–913.
B€
uhlmann, P., Wyner, A.J., 1999. Variable length Markov chains. Ann. Stat. 27 (2), 480–513.
102 SECTION I Foundations in classical geometry and analysis

Chi, E.C., Lange, K., 2015. Splitting methods for convex clustering. J. Comput. Graph. Stat.
24 (4), 994–1013.
Ferguson, T.S., 1973. A Bayesian analysis of some nonparametric problems. Ann. Stat. 1 (2),
209–230. https://doi.org/10.1214/aos/1176342360.
Fernández, M., Garcı́a, J.E., González-López, V.A., 2018. A copula-based partition Markov pro-
cedure. Commun. Stat. Theory Methods 47 (14), 3408–3417.
Fu, J.C., Koutras, M.V., 1994. Distribution theory of runs: a Markov chain approach. J. Am. Stat.
Assoc. 89, 1050–1058.
Gabadinho, A., Ritschard, G., 2016. Analyzing state sequences with probabilistic suffix trees.
J. Stat. Softw. 72 (3), 1–39.
Gallo, S., Leonardi, F., 2015. Nonparametric statistical inference for the context tree of a station-
ary ergodic process. Electron. J. Stat. 9, 2076–2098.
Galves, A., Galves, C., Garcı́a, J.E., Garcia, N.L., Leonardi, F., 2012. Context tree selection and
linguistic rhythm retrieval from written texts. Ann. Appl. Stat. 6, 186–209.
Garcı́a, J.E., González-López, V.A., 2010. Minimal Markov models. arXiv:1002.0729.
Garcı́a, J.E., González-López, V.A., 2017. Consistent estimation of partition Markov models.
Entropy 19, 1050–1058.
Haslett, J., Raftery, A.E., 1989. Space-time modelling with long-memory dependence: assessing
Ireland’s wind power resource. J. R. Stat. Soc. C (Appl. Stat.) 38 (1), 1–50.
J€a€askinen, V., Xiong, J., Koski, T., Corander, J., 2014. Sparse Markov chains for sequence data.
Scand. J. Stat. 41, 641–655.
Kharin, Y., 2020. Statistical analysis of big data based on parsimonious models of high-order
Markov chains. Austrian J. Stat. 49, 76–88.
Kharin, Y.S., Petlitskii, A.I., 2007. A Markov chain of order s with r partial connections and sta-
tistical inference on its parameters. Discret. Math. Appl. 19 (2), 109–130.
Koutras, M.V., Alexandrou, V.A., 1995. Runs, scans and urn models: a unified Markov chain
approach. Ann. Inst. Stat. Math. 47, 743–766.
Lladser, M.E., 2007. Minimal Markov chain embeddings of pattern problems. In: Proceedings of
the 2007 Information Theory and Applications Workshop, University of California, San
Diego.
Lladser, M., Betterton, M.D., Knight, R., 2008. Multiple pattern matching: a Markov chain
approach. J. Math. Biol. 56 (1-2), 51–92.
Ma, B., Tromp, J., Li, M., 2002. Patternhunter: faster and more sensitive homology search. Bio-
informatics 18 (3), 440–445.
Majumder, T., Lahiri, S.N., Martin, D.E.K., 2022. Fitting sparse Markov models to categorical
time series using regularization. http://arxiv.org/abs/2202.05485. (Submitted for publication).
Marshall, T., Rahmann, S., 2008. Probabilistic arithmetic automata and their application to pattern
matching statistics. In: Ferragina, P., Landau, G.M. (Eds.), Proceedings of the 19th Annual
Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science,
vol. 5029. Springer, Heidelberg, pp. 95–106.
Martin, D.E.K., 2019. Minimal auxiliary Markov chains through sequential elimination of states.
Commun. Stat. Simul. Comput. 48 (4), 1040–1054.
Martin, D.E.K., 2020. Distributions of pattern statistics in sparse Markov models. Ann. Inst. Stat.
Math. 72, 895–913. https://doi.org/10.1007/s10463-019-00714-6.
Martin, D.E.K., Coleman, D.A., 2011. Distributions of clump statistics for a collection of words.
J. Appl. Probab. 48, 1049–1059.
Martin, D.E.K., Noe, L., 2017. Faster exact probabilities for statistics of overlapping pattern
occurrences. Ann. Inst. Stat. Math. 69 (1), 231–248.
Equivalence relations and inference for SMM Chapter 3 103

Noe, L., 2017. Best hits of 11110110111: model-free selection and parameter-free sensitivity cal-
culation of spaced seeds. Algorithms Mol. Biol. 12 (1). https://doi.org/10.1186/s13015-017-
0092-1.
Noe, L., Martin, D.E.K., 2014. A coverage criterion for spaced seeds and its applications to SVM
string-kernels and k-mer distances. J. Comput. Biol. 21 (12), 947–963.
Nuel, G., 2008. Pattern Markov chains: optimal Markov chain embedding through deterministic
finite automata. J. Appl. Probab. 45, 226–243.
Raftery, A., Tavare, S., 1994. Estimation and modelling repeated patterns in high order Markov
chains with the mixture transition distribution model. J. R. Stat. Soc. C (Appl. Stat.) 43 (1),
179–199.
Ribeca, P., Raineri, E., 2008. Faster exact Markovian probability functions for motif occurrences:
a DFA-only approach. Bioinformatics 24 (24), 2839–2848.
Rissanen, J., 1983. A universal data compression system. IEEE Trans. Inf. Theory 29, 656–664.
Rissanen, J., 1986. Complexity of strings in the class of Markov sources. IEEE Trans. Inf. Theory
32 (4), 526–532.
Ron, D., Singer, Y., Tishby, N., 1996. The power of amnesia: learning probabilistic automata with
variable memory length. Mach. Learn. 25 (2-3), 117–149.
Roos, T., Yu, B., 2009. Sparse Markov source estimation via transformed Lasso. In: Proceedings
of the IEEE Information Theory Workshop (ITW-2009), Taormina, Sicily, Italy, pp. 241–245.
Shmilovici, A., Ben-gal, I., 2007. Using a VOM model for reconstructing potential coding regions
in EST sequences. Comput. Stat. 22, 49–69.
Weinberger, M., Lempel, A., Ziv, J., 1992. A sequential algorithm for the universal coding of
finite memory sources. IEEE Trans. Inf. Theory IT-38, 1002–1024.
Weinberger, M., Rissanen, J., Feder, M., 1995. A universal finite memory source. IEEE Trans.
Inf. Theory 41 (3), 643–652.
Willems, F.M.J., Shtarkov, Y.M., Tjalkens, T.J., 1995. The context-tree weighting method: basic
properties. IEEE Trans. Inf. Theory 41 (3), 653–664.
Xiong, J., J€a€askinen, V., Corander, J., 2016. Recursive learning for sparse Markov models. Bayes-
ian Anal. 11 (1), 247–263.
Yin, J., Wang, J., 2016. A model-based approach for text clustering with outlier detection. In:
2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 625–636.
Zhang, J., Ghahramani, Z., Yang, Y., 2005. A probabilistic model for online document clustering
with application to novelty detection. In: Saul, L.K., Weiss, Y., Bottou, L. (Eds.), Advances
in Neural Information Processing Systems 17. MIT Press, pp. 1617–1624.
This page intentionally left blank
Section II

Information geometry
This page intentionally left blank
Chapter 4

Symplectic theory of heat


and information geometry
 de
Fre ric Barbaresco*
THALES Land & Air Systems, Meudon, France
*
Corresponding author: e-mail: frederic.barbaresco@thalesgroup.com

Abstract
We present in this chapter a new formulation of heat theory and Information Geometry
through symplectic and Poisson structures based on Jean-Marie Souriau’s symplectic
model of statistical mechanics, called “Lie Groups Thermodynamics.” Souriau model
was initially described in chapter IV “Statistical Mechanics” of his book “Structure of
dynamical systems” published in 1969. This model gives an archetypal, and purely geo-
metric, characterization of Entropy, which appears as an invariant Casimir function in
coadjoint representation, from which we will deduce a geometric heat equation as
Euler-Poincare equation. The approach also allows generalizing the Fisher metric of
information geometry thanks to the KKS (Kirillov, Kostant, Souriau) 2-form in the
affine case via the Souriau’s cocycle. In this model, the Souriau’s moment map and
the coadjoint orbits play a central role. Ontologically, this model provides joint geomet-
ric structures for Statistical Mechanics, Information Geometry and Probability. Entropy
acquires a geometric foundation as a function parameterized by mean of moment map
in dual Lie algebra, and in term of foliations. Souriau established the generalized Gibbs
laws when the manifold has a symplectic form and a connected Lie group G operates on
this manifold by symplectomorphisms. Souriau Entropy is invariant under the action of
the group acting on the homogeneous symplectic manifold. As quoted by Souriau, these
equations are universal and could be also of great interest in Mathematics. The dual
space of the Lie algebra foliates into coadjoint orbits that are also the level sets on
the entropy that could be interpreted in the framework of Thermodynamics by the fact
that motion remaining on these surfaces is non-dissipative, whereas motion transversal
to these surfaces is dissipative. We will also explain the second Principle in thermody-
namics by definite positiveness of Souriau tensor extending the Koszul-Fisher metric
from Information Geometry. Entropy as Casimir function is characterized by Koszul
Poisson Cohomology. We will finally introduce Gaussian distribution on the space of
Symmetric Positive Definite (SPD) matrices, through Souriau’s covariant Gibbs density
by considering this space as the pure imaginary axis of the homogeneous Siegel upper
half space where Sp(2n,R)/U(n) acts transitively. Gauss density of SPD matrices is

Handbook of Statistics, Vol. 46. https://doi.org/10.1016/bs.host.2022.02.003


Copyright © 2022 Elsevier B.V. All rights reserved. 107
108 SECTION II Information geometry

computed through Souriau’s moment map and coadjoint orbits. We will illustrate the
model first for Poincare unit disk, then Siegel unit disk and finally upper half space.
For this example, we deduce Gauss density for SPD matrices.
Keywords: Symplectic geometry, Casimir function, Entropy, Koszul-Fisher Metric,
Heat equation, Poisson Cohomology Symmetric Positive Definite Matrices, Lie groups
thermodynamics, Maximum entropy, Exponential density family

1 Preamble
“There is nothing more in physical theories than symmetry groups, except the
mathematical construction which allows precisely to show that there is nothing
more”
Jean-Marie Souriau
In 2019, at the University of Paris and at the Henri Poincare Institute were
celebrated the 50th anniversary of the founding work of Geometric Mechanics
(introduction of Symplectic Geometry in Mechanics) (http://souriau2019.fr/),
the book by Jean-Marie Souriau “Structure of dynamical systems” with in par-
ticular the guest talk by Jean-Pierre Bourguignon (former director of IHES
and director of the European ERC) “Jean-Marie Souriau and Symplectic
Geometry” (https://ideasinscience.org/fr/video/154).
Jean-Marie Souriau developed the symplectic aspects of classical mechan-
ics and quantum mechanics. Among his most important works, let us quote
the moment(um) map (geometrization of Noether’s theorem), the coadjoint
action of a group on its moment space, geometric quantization, the classifica-
tion of homogeneous symplectic leaves, or diffeological spaces.
In Jean-Marie Souriau’s book “Structure of dynamical systems,” chapter
IV concerning Statistical Mechanics has been little read and cited. There he
developed a symplectic model of statistical physics, which he called “Lie
Groups Thermodynamics.” This geometric theory of statistical mechanics is
based on the study of the Hamiltonian action of a Lie group on the symplectic
manifold induced by the coadjoint orbit of the group. Souriau deals there with
the affine case of the coadjoint operator involving the “Souriau symplectic
cocycle” reflecting the lack of cohomology. In a natural way, generalized
Gibbs states are defined, indexed by parameters of Lie algebra (geometric
generalization of Planck temperature), which are covariant under the action
of the group. These Gibbs states are a natural generalization of those con-
structed with the Hamiltonian of a vector field, since this Hamiltonian can
be considered as a moment of the local Hamiltonian action of the group of
time translations. If a dynamic system is invariant by a Lie subgroup of the
group (for example of the Galileo group in physics), the natural equilibria
of this system are the Gibbs states constructed with a moment of the Hamilto-
nian action of this sub-group. Souriau introduces, in the non-zero cohomology
case, a Riemannian metric, which generalizes the Fisher-Koszul metric of
Symplectic theory of heat and information geometry Chapter 4 109

Information geometry, which is thus characterized algebraically. We therefore


have access to new tools based on the theory of affine representations of Lie
algebras and coadjoint orbits to make probabilities and statistics on Lie
groups. These tools thus make it possible to define new families of deep
learning algorithms in Artificial Intelligence for “Lie group” type data or
for data on homogeneous spaces on which a group acts transitively. Souriau’s
model also makes it possible to give a purely geometric characterization of
Entropy of Information theory: Entropy is defined there as an invariant Casi-
mir function in coadjoint representation (the set of Casimir functions being
linked to de Rham’s Cohomology). Entropy is thus seen as a constant function
along affine coadjoint orbits. By generalizing Entropy, it becomes possible to
explicitly compute the probability densities of Maximum Entropy (generaliza-
tion of the Gaussian density notion). The model also makes it possible to give
a new geometric model of Heat with an Euler-Poincare and Lie-Poisson
equation of heat (geometrization of the heat equation via Poisson brackets).
This Souriau model was recently rediscovered by Charles-Michel Marle,
Frederic Barbaresco and Gery de Saxce.
Information Geometry is a theory that dates back to the mid-1950s with
the work of Rao, Fisher and Frechet which makes it possible to define a met-
ric and a so-called Fisher invariant distance in the space of parameters of
probability densities, especially for exponential families. The Hessian struc-
tures of this geometry were developed in the 1960s by the mathematician
Jean-Louis Koszul (member of Bourbaki, doctoral student of Henri Cartan)
who introduced fundamental tools, extending the work of Elie Cartan on
homogeneous symmetrical spaces, to characterize the geometry of sharp con-
vex cones using the tool called “Koszul-Vinberg characteristic function.” The
Fisher metric is a special case given by Koszul’s 2-form. Jean-Louis Koszul
was a member of the second generation of Bourbaki, to whom we owe the
Koszul connection, the Koszul complex, the Koszul cohomology, the Koszul
duality, etc. He made a great contribution to the fundamental theory of differ-
ential geometry, in particular the theory of linear connections and affinely flat
manifolds, providing the foundations of information geometry. The links
between Koszul’s work and Information geometry were developed by Hiro-
hiko Shima, Michel Nguiffo-Boyom and Frederic Barbaresco. These tools
were used in quantum physics by Roger Balian (creator of the theoretical
physics laboratory at CEA), by introducing a quantum Fisher metric.
As early as 1993, A. Fujiwara and Y. Nakamura developed the close links
between information geometry and the theory of integrable systems by study-
ing dynamical systems on varieties of Gaussian and multinomial distributions.
In 1993, within the framework of the application of Liouville’s theory of
completely integrable systems applied to statistical varieties, Y. Nakamura
developed systems of gradients on probability distributions. He proved the
following results: gradient systems on statistical manifolds are completely
integrable Hamiltonian systems, gradient systems can always be linearized
110 SECTION II Information geometry

using the dual coordinates given by information geometry, gradient flows on


statistical manifolds converge exponentially toward equilibrium points, and
the gradient system associated with the Gaussian distribution can be related
to an Orstein-Ulhenbeck process. Recently, Jean-Pierre Françoise has revis-
ited this model and extended it, in conjunction with the Lax equations. Con-
solidating the bridges between integrable systems and Information
geometry, he considered the integrable Hamiltonian system of Peakons-Anti
Peakons associated with the Camassa-Holm equation, and on the basis of pre-
vious contributions by Nakamura on Toda lattice, re-explored the close links
with Information Geometry.
Souriau’s “Lie Groups Thermodynamics” thus makes it possible to stimu-
late new developments at the interface of these disciplines, information geom-
etry and the theory of integrable systems, for the establishment of new tools
for probabilities and statistics on Lie groups, paving the way for machine
learning methods in Artificial Intelligence operating on these geometric struc-
tures with potential new inference schemes based on symplectic integrators.
In mirror image, the study of the extension of the geometry of information
to Lie groups, via the “Lie Groups Thermodynamics” allows to deepen the
structures of symplectic geometry, Poisson geometry and associated flows:
Euler-Poincare equation, Lie-Poisson flow, Lax equation, etc. with regard to
the peculiarities of this Hessian geometry of Information that Jean-Louis
Koszul characterized on sharp convex cones. Souriau and Koszul have devel-
oped an “affine” theory of the representation of algebras and Lie groups, with
in particular the study of the non-zero cohomology case (affine action of the
coadjoint operator on the dual of the Lie algebra via a symplectic cocycle).
A recent state of the art was elaborated through the organization of trans-
disciplinary conferences on the subject mixing the communities of Informa-
tion geometry, algebraic structures linked to the work of Jean-Louis Koszul
on homogeneous bounded symmetrical spaces, symplectic geometry and Pois-
son geometry in connection with geometrical mechanics… We can refer to
the following conferences:
l GSI conference cycle “Geometric Science of Information” (GSI
Conference Cycle, 2013–2021)
l Ecole de Physique des Houches SPIGL’20 (Ecole de Physique des
Houches, 2020)
l FGSI’19 Conference “Foundations of Geometric Structures of
Information” (FGSI’19 Conference, 2019)
l CIRM TGSI’17 Conference on “Topological and Geometrical Structures
of Information” (CIRM, 2017)
l MaxEnt’14 Conference and MaxEnt’22 Conference (MaxEnt’14, 2014)
Based on the founding work of Jean-Marie Souriau in “Lie Groups Thermo-
dynamics” (symplectic model of statistical physics) (Souriau, 1969, 1974,
1997) and the foundations of “Hessian structures of information geometry”
Symplectic theory of heat and information geometry Chapter 4 111

by Jean-Louis Koszul (Barbaresco, 2021d; Koszul and Zou, 2019), there are
bridges between the mathematical structures of the different disciplines:
l Symplectic Geometry and Information Geometry
l Poisson Geometry and Information Geometry
l Integrated System and Information Geometry
The novelty lies in the cross exploration of symplectic and Poisson structures
in Information Geometry, but also the enrichment of symplectic and Poisson
geometries by the particular structures of information geometry: affinely flat
manifold, Hessian metric, Fisher metric, dual potential functions and dual
coordinate systems, affine representation theory, Lie algebra cohomology,
affine Poisson structure, etc.
The tools developed will jointly enrich:
l Statistical physics: symplectic model of statistical physics
l Artificial Intelligence: statistical tools on Lie groups with new symplectic
integrators for deep machine learning on homogeneous manifolds
l Quantum Physics: generalization of the geometry of information to the
theory of quantum information
l Mathematics: enrichment of symplectic geometry, Poisson geometry and
the theory of integrable systems through the contribution of structures
associated with the geometry of information and affinely flat manifolds.

2 Life and seminal work of Souriau on lie groups


thermodynamics
Jean-Marie Souriau has introduced his model of “Lie Groups Thermodynam-
ics” in chapter IV of Souriau (1969, 1997). This chapter IV remained little
known for a long time, because the first readers of the work were more inter-
ested in the symplectic model introduced for mechanics, than its extension for
Statistical Physics.
Symplectic model of statistical mechanics was introduced in the three fol-
lowing Jean-Marie Souriau papers in Souriau (1974, 1975, 1984). These
papers by Jean-Marie Souriau are little known because they were only pub-
lished in French in CNRS publications. In these papers, Souriau consolidated
the bedrock of what he called “Lie Groups Thermodynamics.” This model is
of great importance in the context of the geometrization of information
sciences and geometric theory of heat. Souriau built a thermodynamics in
which the density of Gibbs is completely covariant. He introduced there a
new Riemannian metric invariant under the action of the group, from the
“moment map” (geometrization of Emmy Noether Theorem (Kosmann-
Schwarzbach, 2013; Noether, 1918)) and a cocycle which translates the lack
of equivariance of the coadjoint operator on the moment map and linked to
the cohomology of Lie algebra. We have discovered that this metric was in
112 SECTION II Information geometry

fact a generalization of the Koszul-Fisher metric in Information Geometry.


More recently, we have observed (Barbaresco, 2019a, 2020a,b, 2021a;
Barbaresco and Gay-Balmaz, 2020; Libermann and Marle, 1987) that the
invariance of the Entropy defined on the coadjoint orbits made it possible to
identify the Entropy with a generalized Casimir function in coadjoint repre-
sentation. This definition of Entropy only depends on symmetries, and is an
alternative to Shannon and von Neumann approaches who had only given
an axiomatic definition. To allow access to a larger number, we have made
an English translation of these three Souriau’s seminal papers, which retains
all of the author’s original notations. We have only translated parts concerning
the symplectic model of statistical Mechanics and his model of Lie Groups
Thermodynamics. This model should be able to be extended for other
entropies.
From Lagrange’s in Analytic Mechanics (Lagrange, 1808), Jean-Marie
Souriau, based on previous works of Gallisssot (1952) and Blanc-Lapierre
et al. (1959), conceived his Symplectic model of Statistical Physics
(Souriau, 1969, 1974, 1975, 1984, 1997) and other geometric models in Phys-
ics (Souriau, 1954, 1965, 1966, 1967, 1978, 1986, 1996). This model was
studied by his PhD student Iglesias (1979, 1995), and rediscovered by De
Saxce (2016, 2019), De Saxce and Marle (2020), De Saxce and Vallee
(2016), Libermann and Marle (1987), Marle (2016, 2018, 2019, 2021),
Barbaresco (2019a, 2020a,b, 2021a), and Barbaresco and Gay-Balmaz
(2020). Symplectic model used by Souriau, especially the non-null cohomol-
ogy case, was deepened by Koszul and Zou (2019). New fruitful results have
been established (Barbaresco, 2019a, 2020a,b, 2021a; Barbaresco and
Gay-Balmaz, 2020) making links with Maximum Entropy Theory studied
by Dacunha-Castelle and Gamboa (1990), Information Geometry applied in
Physics by Balian (1991, 2014, 2015), Balian et al. (1986) and Balian and
Valentin (2001) and the Theory of Integrable Systems synthetized by
Cartier (1994). Main objective of Souriau’s project in Physics was summar-
ized in his sentence “There is nothing more in physical theories than symme-
try groups except the mathematical construction which allows precisely to
show that there is nothing more” [Il n’y a rien de plus dans les th eories
physiques que les groupes de sym etrie si ce n’est la construction math ema-
tique qui permet pr ecis
ement de montrer qu’il n’y a rien de plus]. In
Souriau (2007), Jean-Marie Souriau summarized his contribution to thermo-
dynamics: “The second principle of thermodynamics is independent: it indi-
cates that the entropy S increases during a dissipation; here we mean
entropy in the sense of Clausius-Boltzmann, which is a function of the statis-
tical state ρ. If therefore a state possesses, for a given mean value of the
moment, greatest entropy, it will not be subject to dissipation. These states,
if they exist, thus represent the terminal state of dissipation. They are indexed
by a parameter β with values in the Lie algebra of the Lorentz-Poincar e
group; they generalize the Gibbs equilibrium states, β playing the role of
Symplectic theory of heat and information geometry Chapter 4 113

temperature. The invariance with respect to the group, and the fact that the
entropy S is a convex function of β, imposes very strict, universal condi-
tions—i.e., independent of the system considered. For a large class of systems,
for example, there exist necessarily a critical temperature beyond which no
equilibrium can exist. In the cases where an equilibrium exists, it generally
consists of a rigid rotation about the barycenter, etc. These purely theoretical
results are evidently confirmed by numerous astronomical examples: the
Earth and the starts rotating about themselves; dissipative evolution imposes
a solid rotation on the central regions of the galaxies, which itself can lead to
a gravitational instability of the “quasar” type; the Clapeyron relations
extend to the geometrical-dynamical quantities, etc. One can, if one wishes,
interpret β as a space-time vector (the temperature vector of Planck), giving
to the metric tensor g a null Lie derivative. This suggests describing the dis-
sipative processes by a temperature vector β which is no longer compelled
by this condition; the corresponding Lie derivative of g, the “friction
tensor,” becomes the source of the dissipation. One obtains in this way a phe-
nomenological model of continuous media which presents some interesting
properties: the temperature vector and entropy flux are in duality; the positive
entropy production is a consequence of Einstein’s equations; the Onsager
reciprocity relations are generalized; in the case of a fluid in in the
non-relativistic approximation, the model unifies heat conduction and viscos-
ity (equations of Fourier and Navier).”
The name Souriau means “little mouse” in “the Perche.” In the “Vendo-
mois,” the Souriau from 1490 to 1819 were all “Master Plowmen” or
“Master Millers.” Jean-Marie Souriau was born June 3, 1922 in Paris in the
6th arrondissement. Jean-Marie Souriau comes from a family of Philosopher
all graduated from ENS Paris. His father, Michel Souriau joined the ENS
Paris in 1910 and obtained the aggregation in philosophy in 1914 and wrote
in 1938 an article on “Introduction to mathematical symbolism” before being
mobilized as a battalion commander. His uncle, Etienne Souriau, is a French
philosopher, specialized in aesthetics, entered the ENS Paris in 1912, received
first in the aggregation in philosophy in 1920. In 1958, Etienne Souriau was
elected member of the Academy of moral sciences and policies by a commit-
tee in which Charles de Gaulle appears, and will be the director of the thesis
 Rohmer. Etienne Souriau published a book on “Struc-
of the filmmaker Eric
ture of the Work of Art (Structure de l’Oeuvre d’Art).” His grandfather, Paul
Souriau, is a French philosopher known for his work on the theory of inven-
tion and aesthetics, who entered the ENS Paris in 1873 and aggregated in phi-
losophy in 1876. It can be noted that his grand-father, Paul Souriau, composed
a thesis titled “Theory of Invention” (theorie de l’invention), published in
1881, and also a Latin thesis titled “De motus perceptione,” which aimed to
determine the importance of vision for the perception of movements (the initial
thesis title was “De visione motus” and was a precursor to his future work on
the perception of movement). We can assume that Jean-Marie Souriau read his
114 SECTION II Information geometry

grandfather’s thesis and was influenced by it for his own work. In 1889, Paul
Souriau published a book on “The Aesthetics of Movement (L’Esthetiquee du
movement)” which describes two levels of aesthetics for movement: mechani-
cal beauty (the adaptation of movement to fulfill its purpose) and meaning of
movement (the meaning that the movement communicates to an outside
observer). In doing so, Paul Souriau distinguished movement from perception
of movement, concepts that would later become the subject of motor cognition
and psychophysics. It is interesting to note that Etienne Souriau studied the
structures of aestheticism, Paul Souriau developed the aestheticism of the
movement and Jean-Marie Souriau founded the structures of the movement.
This triptych will remain an important element of French philosophy at the
hinge of this 1900 Spirit (Esprit 1900).
Jean-Marie Souriau from 1932 to 1942 did his secondary studies in Nancy,
Nı̂mes, Grenoble and Versailles. Jean-Marie Souriau married Christianne
Hoebrechts, who died prematurely in 1985 and with whom he had five chil-
dren Isabelle, Catherine, Yann, Jer^
ome and Magali. He entered the ENS Paris
in 1942, passing twice in the unoccupied zone in Lyon and a second time in
Paris. Also received at the Ecole Polytechnique, he resigned to join the ENS
Paris (Fig. 1). During his studies at the ENS, he took courses at the Sorbonne
from the physicist Yves Rocard and the mathematician Elie Cartan. He volun-
teered for “La France Libre” in 1944. On his return in 1946, he passed the
mathematics aggregation, and the same year joined a laboratory working on

FIG. 1 Jean-Marie Souriau, a student at the Ecole Normale Superieure in Paris in 1942, with
Jacques Dixmier and Rene Deheuvels among others (© ENS Paris).
Symplectic theory of heat and information geometry Chapter 4 115

the scanning electron microscope and then entered as a researcher in a


“theoretical physics” session at the CNRS.
He finally opted for a career as an aeronautical engineer at ONERA by
becoming head of research groups and defending his thesis in June 1952 on
the theme of “aircraft stability” which was supervised by Andre Lichnerowicz
(professor at the College of France) and Joseph Perès (collaborator of Vito
Volterra), which was useful for the design of the “Caravelle” and
“Concorde” aircrafts (ONERA obtains royalties from Souriau patents). In this
thesis, he refers to the book by Yves Rocard on “General dynamics of
vibrations.” On his thesis, he wrote (Souriau, 2007) “I studied the problems
of vibrations and stability which arise in aeronautics and in some other tech-
niques; this work allowed me to develop stability criteria which are presented
in the form of algorithms which can be easily calculated from theoretical data
or from tests; they have since been regularly used in various fields (subsonic
and supersonic airplanes, navigation instruments, etc.).” I obtained a copy of
this thesis through colleagues at ONERA, whose cover I am reproducing
below (Fig. 2).
During this period, he also invented in 1948 an algorithm named
Leverrier-Souriau algorithm (Souriau, 2019) which allows to compute the
characteristic polynomial of a matrix and which was used on the first IBM
computers in the United States. From 1948 to 1952, he also provided
continuing education at the Special School of Aeronautical Works

FIG. 2 Cover page of the thesis of Jean-Marie Souriau “On the stability of planes” defended on
June 20, 1952, as well as the bibliographic page which refers to the work of Yves Rocard (per-
sonal picture of Souriau PhD manuscript archived at ONERA).
116 SECTION II Information geometry

(ESTA, Paris) under the general title “New Methods of Mathematical Physics.”
From 1951 to 1952, he created and ran the Mechanics course in the third year

of the Ecole Normale Superieure de l’Enseignement Technique (ENSET, 
Paris). From 1952, he also had a university education in the following disci-
plines: Mathematics, Mechanics, Relativity, Mathematical Methods of Physics
and Computer Science.
After his thesis in 1954, he joined the “Institut des Hautes Etudes,” rue de
Rome in Tunis, and moved with his wife to Carthage. It is during this period
that he rereads and deepens the work of Lagrange in Analytical Mechanics
and discovers the symplectic structures that he will formalize in his book
“Structure of dynamic systems.” It is by thinking of his discussions with
ONERA engineers that he invents his masterpiece, the “moment map”
(application moment). We can read in the interview by Patrick Iglesias
(Iglesias, 1995) “It was with the memory of discussions with engineers who
asked themselves the following question: what is essential in mechanics.
I remember very well an engineer who asked me: is mechanics simply the
principle of conservation of energy? This is fine for a one-parameter system,
but once there are two, it is not enough. I had learned of course the
Lagrange equations and all the analytical principles of mechanics, but it
was all a cookbook; we did not see any real principles.” He remained in
Tunis from 1952 to 1958, as Lecturer, then as Full Professor at the Institut

des Hautes Etudes. In 1953, he participated in Strasbourg in the conference
on Differential Geometry (Fig. 3).
In 1958, he became Professor at the University of Aix-Marseille. He
remained in Marseille throughout his career, and from 1978 to 1985 became
Director of the Center for Theoretical Physics of Marseille (CNRS laboratory)
in charge of the teams in Theoretical Mechanics, Geometry and Quantifica-
tion, Astronomy and Cosmology. He was also professor of Mathematics at
the University of Provence (Aix-Marseille I) and ended up as an exceptional
Professor with second echelon. He was also a member of the “Societe
Mathematique de France” and of the French Society of Specialists in Astron-
omy. For 5 years, he also taught the course of the third interuniversity cycle of
Pure Mathematics in Marseille and the third interuniversity cycle of Theoreti-
cal Physics in Marseille-Nice. He was a member of the Editorial Board of the
Journal of Geometry and Physics in Florence. He organized two International
Colloquiums of the CNRS in 1968 and 1981 and Days of the “Societe
Mathematique de France.” Honored by the Academic Palms and of the
National Order of Merit, he obtained the Prize on the subject “Vibrations”
put up for competition by the Association for Aeronautical Research in
1952, the Prize on the subject “Cosmology” put out for competition by the
Foundation Louis Jacot in 1978, the Grand Prix Jaffe of the French Academy
of Sciences in 1981 and the Great Scientist prize of the City of Paris in 1986.
Jean-Marie Souriau died in 2012 in its 90th year.
FIG. 3 Jean-Marie Souriau at the Conference on “Differential Geometry” in Strasbourg in 1953. In the same picture, Jean-Louis Koszul, Andre Weil,
Shiing-Shen Chern, Georges de Rham, Charles Ehresmann, Lucien Godeaux, Heinz Hopf, Andre Lichnerowicz (the director of the thesis of Jean-Marie Souriau),
Bernard Malgrange, John Milnor, Georges Reeb, Laurent Schwartz, Rene Thom, Paulette Libermann. (© dernières nouvelles d’Alsace and “Differential Geome-
try, Strasbourg, 1953” by Michèle Audin in Notices of AMS Volume 55, Number 3, March 2008).
118 SECTION II Information geometry

3 From information geometry to lie groups thermodynamics


In the following, we will use the notation:
l Lie and dual Lie algebras:
Lie algebra: g ¼ T e G
Dual space of Lie algebra g∗
l Coadjoint operator:
 ∗
Ad∗g ¼ Ad g1
D E  
with Ad∗g F, Y ¼ F, Ad g1 Y , 8g  G, Y  g, F  g∗

l Moment map:

J ðxÞ : M ! g∗ such that J X ðxÞ ¼ hJ ðxÞ, Xi, Xg

l Souriau 1-cocycle:

θ ð gÞ

l Souriau 2-cocycle:

~ ðX, Y Þ ¼ J ½X,Y   fJ X , J Y g
Θ
where
gg!ℝ
~ ðX, Y Þ ¼ hΘðXÞ, Y i with ΘðXÞ ¼ T e θðXðeÞÞ
X, Y7!Θ

l Affine coadjoint operator:

Ad#g ð:Þ ¼ Ad∗g ð:Þ + θðgÞ

l Poisson Bracket given by KKS 2-form


  
∂F ∂G
fF, GgðXÞ ¼ X, ,
∂X ∂X

l Affine Poisson bracket:


   

∂F ∂G ∂F ∂G
fF, GgΘ~ ðXÞ ¼ X, , + Θ ,
∂X ∂X ∂X ∂X
In this chapter, we will explain how to define statistics on Lie Groups and
its Symplectic manifold associated to the coadjoint orbits, and more especially
Symplectic theory of heat and information geometry Chapter 4 119

how to define extension of Gauss density as Gibbs density in the Framework


of Geometric Statistical Mechanics. Amari has proved that the Riemannian
metric in an exponential family is the Fisher information matrix defined by:
 2  Z
∂ Φ
gij ¼  with ΦðθÞ ¼  log ehθ,yi dy (1)
∂θi ∂θ j ij ℝ

and the dual potential, the Shannon entropy, is given by the Legendre
transform:
∂ΦðθÞ ∂SðηÞ
SðηÞ ¼ hθ, ηi  ΦðθÞ with ηi ¼ and θi ¼ (2)
∂θi ∂ηi
Ð
We can observe that Φ(θ) ¼  log Rehθ,yidy ¼  log ψ(θ) is related to the
classical cumulant generating function. Koszul and Vinberg have introduced a
generalization through an affinely invariant Hessian metric on a sharp convex
cone by its characteristic function (Satake, 1980):
Z
ΦΩ ðθÞ ¼  log ehθ,yi dy ¼  log ψ Ω ðθÞ with θ  Ω sharp convex cone
Ω∗
Z (3)
hθ,yi
ψ Ω ðθ Þ ¼ e dy with Koszul-Vinberg Characteristic function

Ω

The name “characteristic function” come from the link underlined by


Ernest Vinberg:
Let Ω be a cone in U and Ω∗ its dual, for any λ > 0, H λ ðxÞ ¼ fy  U=hx, yi ¼ λg
and let d ðλÞ y denote the Lebesgue measure on H λ ðxÞ :
Z Z
ðm  1Þ!
ψ Ω ðxÞ ¼ ehx,yi dy ¼ m1 d ðλÞ y
Ω∗ λ Ω∗ \Hλ ðxÞ
(4)
∗ ∗ ∗ t 1 ∗
There exist a bijection x  Ω !
7 x  Ω , satisfying the relation (gx) ¼ g x
for all g  G(Ω) ¼ {g  GL(U)/gΩ ¼ Ω} the linear automorphism group of Ω
and x∗ is:
Z Z
x∗ ¼ yd ðλÞ y= d ð λÞ y (5)
Ω∗ \Hλ ðxÞ Ω∗ \Hλ ðxÞ

We can observe that x is the center of gravity of Ω∗ \ Hλ(x). We have the


property that ψ Ω(gx) ¼ jdet( g)j1ψ Ω(x) for all x  Ω, g  G(Ω) and then that
Pm
ψ Ω(x)dx is an invariant measure on Ω. Writing ∂a ¼ ai ∂x∂ i , one can write:
i¼1
Z
∂a ΦΩ ðxÞ ¼ ∂a ð log ψ Ω ðxÞÞ ¼ ψ Ω ðxÞ1 ha, yiehx,yi dy
  Ω∗
¼ a, x∗ , a  U, x  Ω (6)
120 SECTION II Information geometry

Then, the tangent space to the hypersurface {y  U/ψ Ω(y) ¼ ψ Ω(x)} at


x  Ω is given by {y  U/hx∗, yi ¼ m}. For x  Ω, a, b  U, the bilinear form
∂a ∂b log ψ Ω(x) is symmetric and positive definite, so that it defines an invari-
ant Riemannian metric on Ω.
These relations have been extended by Jean-Marie Souriau in geometric
statistical mechanics, where he developed a “Lie groups thermodynamics”
of dynamical systems where the (maximum entropy) Gibbs density is covari-
ant with respect to the action of the Lie group. In the Souriau model, previous
structures are preserved:
Z
∂2 Φ
I ðβÞ ¼  2 with ΦðβÞ ¼  log ehUðξÞ,βi dλω and U : M ! g∗ (7)
∂β
M

We preserve the Legendre transform:


∂ΦðβÞ ∂SðQÞ
SðQÞ ¼ hQ, βi  ΦðβÞ with Q ¼  g∗ and β ¼ g (8)
∂β ∂Q
We can remark that we can define Entropy as “Legrendre transform of
minus loga-Laplace transform” (also called Cramer transform) and that
log(Laplace transform is linked to Cumulants generating function.
In the Souriau Lie groups thermodynamics model (Barbaresco, 2019b,
2020a; Casimir, 1931; Souriau, 1969), β is a “geometric” (Planck) tempera-
ture, element of Lie algebra g of the group, and Q is a “geometric” heat, ele-
ment of the dual space of the Lie algebra g∗ of the group. Souriau has
proposed a Riemannian metric that we have identified as a generalization of
the Fisher metric:
h i
~ β ðZ1 , ½β, Z2 Þ
I ðβÞ ¼ gβ with gβ ð½β, Z 1 , ½β, Z2 Þ ¼ Θ (9)

~ β ðZ 1 , Z 2 Þ ¼ Θ
with Θ ~ ðZ1 , Z2 Þ + hQ, adZ1 ðZ2 Þ i where ad Z1 ðZ 2 Þ ¼ ½Z 1 , Z 2 
(10)
Souriau
n has proved
o that all co-adjoint orbit of a Lie Group given by

ΟF ¼ Adg F, g  G subset of g∗ , F  g∗ carries a natural homogeneous
symplectic structure by a closed G-invariant 2-form. If we define
K ¼ Ad∗g ¼ (Adg1)∗ and
D E  
K ∗ ðXÞ ¼ ðad X Þ∗ with : Ad∗g F, Y ¼ F, Ad g1 Y , 8g  G, Y  g, F  g∗
(11)
where if X  g, Ad g ðXÞ ¼ gXg1  g, the G-invariant 2-form is given by the
following expression:
σ Ω ðad X F, adY FÞ ¼ BF ðX, Y Þ ¼ hF, ½X, Y i, X, Y  g (12)
Symplectic theory of heat and information geometry Chapter 4 121

Souriau Fundamental Theorem is that “Every symplectic manifold on


which a Lie group acts transitively by a Hamiltonian action is a covering
space of a coadjoint orbit.” We can observe that for Souriau model, Fisher
metric is an extension of this 2-form in non-equivariant case:
~ ðZ1 , ½β, Z2 Þ + hQ, ½Z1 , ½β, Z2  i
gβ ð½β, Z 1 , ½β, Z2 Þ ¼ Θ (13)
e ðZ1 , ½β, Z2 Þ is generated by non-equivariance
The Souriau additional term Θ
through Symplectic cocycle. The tensor Θ ~ used to define this extended Fisher
metric is defined by the moment map J(x), application from M (homogeneous
symplectic manifold) to the dual space of the Lie algebra g∗ , given by:
~ ðX, Y Þ ¼ J ½X,Y   fJ X , J Y g
Θ (14)

with J ðxÞ : M ! g∗ such that J X ðxÞ ¼ hJ ðxÞ, Xi, X  g


This tensor Θ~ is also defined in tangent space of the cocycle θðgÞ  g∗
(this cocycle appears due to the non-equivariance of the coadjoint operator
Ad∗g, action of the group on the dual space of the lie algebra; the action of
the group on the dual space of the Lie algebra is modified with a cocycle so
that the momentum map becomes equivariant relative to this new affine
action):
 
Q Adg ðβÞ ¼ Ad ∗g ðQÞ + θðgÞ (15)

θðgÞ  g∗ is called nonequivariance one-cocycle, and it is a measure of the


lack of equivariance of the moment map.

~ ðX, Y Þ : g  g ! ℝ
Θ with ΘðXÞ ¼ T e θðXðeÞÞ
(16)
X, Y7!hΘðXÞ, Y i

Souriau has then defined a Gibbs density that is covariant under the action
of the group:
ehUðξÞ,βi
pGibbs ðξÞ ¼ eΦðβÞhUðξÞ,βi ¼ Z with ΦðβÞ
ehUðξÞ,βi dλω
Z M
hUðξÞ,βi
¼  log e dλω (17)
M

Z
U ðξÞehUðξÞ,βi dλω
Z
∂ΦðβÞ M
Q¼ ¼ Z ¼ UðξÞpðξÞdλω (18)
∂β
ehUðξÞ,βi dλω M
M
122 SECTION II Information geometry

We can express the Gibbs density with respect to Q by inverting the


relation:
∂ΦðβÞ
¼ ΘðβÞ: Then pGibbs,Q ðξÞ ¼ eΦðβÞhUðξÞ,Θ ðQÞi with β ¼ Θ1 ðQÞ
1

∂β
(19)
We will observe that Souriau Entropy S(Q) defined on affine coadjoint
orbit of the group (where Q is a “geometric” heat, element of the dual space
of the Lie algebra g∗ of the group) has a property of invariance S(Ad#g(Q)) ¼
S(Q) with respect to Souriau affine definition of coadjoint action Ad#g(Q) ¼
Ad∗g(Q) + θ( g) where θ( g) is called the Souriau cocycle, and is associated to
the default of equivariance of the moment map (Cartier, 1994; De Saxce
and Vallee, 2016; Mikami, 1987; Souriau, 1969). In the framework of Souriau
Lie groups Thermodynamics (Souriau, 1954, 1965, 1966, 1967, 1974, 1975,
1978, 1984, 1997), we will then characterize the Entropy as a generalized
Casimir invariant function (Casimir, 1931) in coadjoint representation. When
M is a Poisson manifold, a function on M is a Casimir function if and only if
this function is constant on each symplectic leaf (the non-empty open subsets
of the symplectic leaves are the smallest embedded manifolds of M which are
Poisson submanifolds). Classically, the Entropy is defined axiomatically as
Shannon or von Neumann Entropies without any geometric structures con-
straints. In this paper, the Entropy will be characterized as solution of the
Casimir equation given for affine equivariance by:


∗ ∂S
ad ∂S Q + Θ ¼ Ckij ad ∗ ∂S i Qk + Θ j ¼ 0 (20)
∂Q j ∂Q j ð∂QÞ

where ΘðXÞ ¼ T e θðXðeÞÞ with Θ ~ ðX, Y Þ ¼ hΘðXÞ, Y i ¼ J ½X,Y   fJ X , J Y g in


non-null cohomology case (non-equivariance of coadjoint operator on the
moment map), with θ( g) the Souriau Symplectic cocycle. The KKS
(Kostant-Kirillov Souriau) 2-form that associates a structure of homogeneous
symplectic manifold to coadjoint orbits, will be linked with extension of
Koszul-Fisher metric. The information manifold foliates into level sets of
the entropy that could be interpreted in the framework of Thermodynamics
by the fact that motion remaining on these surfaces is non-dissipative,
whereas motion transversal to these

surfaces is dissipative, where the dynam-
∗ ∂H
dt ¼ ad ∂H Q + Θ
ics is given by dQ with the existence of a stable equilibrium
∂Q
∂Q


∗ ∂S
when H ¼ S ) dQ
dt ¼ ad ∂S Q + Θ ∂Q ¼ 0 preserves coadjoint orbits and
∂Q

Casimirs of the Lie–Poisson equation


by construction).



We will also observe that dS ¼ Θ ~ β ∂H , β ¼ Θ
~ β ∂H , β dt where Θ ~ ∂H , β
∂Q ∂Q ∂Q
D h iE
∂H
+ Q, ∂Q , β , showing that 2nd principle is linked to positive definiteness of
Symplectic theory of heat and information geometry Chapter 4 123

Souriau tensor related to Fisher Information or Koszul 2-form



extension. We

can extend Affine Lie-Poisson equation dt ¼ ad ∂H Q + Θ ∂H
dQ
∂Q to a new Strato-
∂Q

novich differential equation for the stochastic process given by the following
relation by mean of Souriau’s symplectic cocycle:
  XN  
∗ ∂H ∗ ∂Hi
dQ + ad ∂H Q + Θ dt + ad ∂Hi Q + Θ ∘ dW i ðtÞ ¼ 0 (21)
∂Q ∂Q i¼1
∂Q ∂Q

More details on Souriau Lie Groups Thermodynamics are available in


author paper (Barbaresco, 2019a, 2020a, 2021a,c,d) or Charles-Michel Marle
papers (Marle, 2016, 2019, 2020b).
We can remark that previous Souriau affine 2-form structure applied for
statistics on symplectic manifolds has extended classical notion of Fisher met-
ric introduced in information geometry.

4 Symplectic structure of fisher metric and entropy as Casimir


function in coadjoint representation
“Ayant eu l’occasion de m’occuper du mouvement de rotation d’un corps solide
creux, dont la cavite est remplie de liquide, j’ai ete conduit à mettre les
equations generales de la Mecanique sous une forme que je crois nouvelle et
qu’il peut ^etre interessant de faire connaı̂tre [Having had the opportunity to con-
cern myself with the rotational motion of a hollow solid body, the cavity of
which is filled with liquid, I was led to put the general equations of Mechanics
in a form which I believe to be new and may be interesting to make
known.]”–Henri Poincare (Poincar e, 1901).

Based on this model, we will introduce a geometric characterization of


Entropy as a generalized Casimir invariant function in coadjoint representa-
tion, where Souriau cocycle is a measure of the lack of equivariance of the
moment mapping. The dual space of the Lie algebra foliates into coadjoint
orbits that are also the level sets on the entropy that could be interpreted in
the framework of Thermodynamics by the fact that motion remaining on these
surfaces is non-dissipative, whereas motion transversal to these surfaces is
dissipative. We will also explain the second Principle in thermodynamics by
definite positiveness of Souriau tensor extending the Koszul-Fisher metric
from Information Geometry, and introduce a new geometric Fourier heat
equation with Souriau-Koszul-Fisher tensor. In conclusion, Entropy as Casi-
mir function is characterized by Koszul Poisson Cohomology.

4.1 Symplectic Fisher Metric structures given by Souriau model


In the Souriau Lie groups thermodynamics model, β is a “geometric” (Planck)
temperature, element of Lie algebra g of the group, and Q is a “geometric”
124 SECTION II Information geometry

heat, element of the dual space of the Lie algebra g∗ of the group. Souriau has
proposed a Riemannian metric that we have identified as a generalization of
the Fisher metric:
h i
~ β ðZ1 , ½β, Z 2 Þ
I ðβÞ ¼ gβ with gβ ð½β, Z 1 , ½β, Z2 Þ ¼ Θ (22)

~ β ðZ 1 , Z 2 Þ ¼ Θ
with Θ ~ β ðZ 1 , Z 2 Þ + hQ, adZ1 ðZ 2 Þi where ad Z1 ðZ2 Þ ¼ ½Z1 , Z2 
(23)
Souriau Fundamental Theorem is that “Every symplectic manifold on
which a Lie group acts transitively by a Hamiltonian action is a covering space
of a coadjoint orbit.” We can observe that for Souriau model, Fisher metric is
an extension of this 2-form in non-equivariant case gβ ð½β, Z1 , ½β, Z2 Þ ¼
e ðZ1 , ½β, Z2 Þ + hQ, ½Z1 , ½β, Z2 i.
Θ
The Souriau additional term Θ ~ ðZ1 , ½β, Z 2 Þ is generated by non-
equivariance through Symplectic cocycle. The tensor Θ ~ used to define this
extended Fisher metric is defined by the moment map J(x), application from
M (homogeneous symplectic manifold) to the dual space of the Lie algebra
g∗ , given by:
~ ðX, Y Þ ¼ J ½X,Y   fJ X , J Y g
Θ (24)

with J ðxÞ : M ! g∗ such that J X ðxÞ ¼ hJ ðxÞ, Xi, X  g: (25)


~ is also defined in tangent space of the cocycle θðgÞ  g∗ (this
This tensor Θ
cocycle appears due to the non-equivariance of the coadjoint operator Ad∗g,
action of the group on the dual space of the lie algebra; the action of the group
on the dual space of the Lie algebra is modified with a cocycle so that the
momentum map becomes equivariant relative to this new affine action):
 
Q Adg ðβÞ ¼ Ad ∗g ðQÞ + θðgÞ (26)
D E  
I use notation Ad∗g ¼ (Adg1)∗ with Ad∗g F, Y ¼ F, Ad g1 Y , 8g  G,
Y  g, F  g∗ as used by Koszul and Souriau. θðgÞ  g∗ is called nonequivar-
iance one-cocycle, and it is a measure of the lack of equivariance of the
moment map.
~ ðX, Y Þ : g  g ! ℜ
Θ with ΘðXÞ ¼ T e θðXðeÞÞ
(27)
X, Y 7! hΘðXÞ, Y i
It can be then deduced that the tensor could be also written by (with
cocycle relation):
~ ðX, Y Þ ¼ J ½X,Y   fJ X , J Y g ¼ hdθðXÞ, Y i,
Θ X, Y  g (28)
~ ð½X, Y , ZÞ + Θ
Θ ~ ð½Y, Z , XÞ + Θ
~ ð½Z, X, Y Þ ¼ 0, X, Y, Z  g (29)
Symplectic theory of heat and information geometry Chapter 4 125

This study of the moment map J equivariance, and the existence of an


affine action of G on g∗ , whose linear part is the coadjoint action, for which
the moment J is equivariant, is at the cornerstone of Souriau theory of geo-
metric mechanics and Lie groups thermodynamics. When an element of the
group g acts on the element β  g of the Lie algebra, given by adjoint opera-
tor Adg. With respect to the action of the group Adg(β), the entropy S(Q) and
the Fisher metric I(β) are invariant:
(   
S Q Adg ðβÞ ¼ SðQÞ
β  g ! Ad g ðβÞ )   (30)
I Ad g ðβÞ ¼ I ðβÞ

Souriau completed his “geometric heat theory” by introducing a 2-form in


the Lie algebra, that is a Riemannian metric tensor in the values of adjoint orbit
of β, [β, Z] with Z an element of the Lie algebra. This metric is given for (β, Q):
gβ ð½β, Z 1 , ½β, Z2 Þ ¼ hΘðZ1 Þ, ½β, Z 2 i + hQ, ½Z1 , ½β, Z2 i (31)
where Θ is a cocycle of the Lie algebra, defined by Θ ¼ Teθ with θ a cocycle
of the Lie group defined by θ(M) ¼ Q(AdM(β))  Ad∗M Q. We observe that
Souriau Riemannian metric, introduced with symplectic cocycle, is a general-
ization of the Fisher metric, that we call the Souriau-Fisher metric, that pre-
serves the property to be defined as a Hessian of the partition function
logarithm gβ ¼  ∂∂βΦ2 ¼ ∂ log ψ Ω
2 2

∂β2
as in classical information geometry. We will
establish the equality of two terms, between Souriau definition based on Lie
group cocycle Θ and parameterized by “geometric heat” Q (element of the
dual space of the Lie algebra) and “geometric temperature” β (element of
Lie algebra) and hessian of characteristic function Φ(β) ¼  log ψ Ω(β) with
respect to the variable β:
∂2 log ψ Ω
gβ ð½β, Z1 , ½β, Z 2 Þ ¼ hΘðZ 1 Þ, ½β, Z2 i + hQ, ½Z 1 , ½β, Z 2 i ¼ (32)
∂β2
If one assumes that U(gξ) ¼ Ad∗g U(ξ) + θ( g), g  G, ξ  M which means
that the energy U : M ! g∗ satisfies the same equivariance condition as the
moment map μ : M ! g∗ , then one has for g  G and β  Ω:
D E
Z Z
  hUðξÞ,Ad g βi  Ad ∗1 UðξÞ,β
ψ Ω Ad g β ¼ e dλðξÞ ¼ e g
dλðξÞ
M M
Z Z
  hUðg ξÞθðg Þ,βi dλðξÞ ¼ ehθðg1 Þ,βi ehUðg ξÞ,βi dλðξÞ
1 1 1
ψ Ω Ad g β ¼ e
M M
 
ψ Ω Ad g β ¼ ehθðg Þ,βi ψ Ω ðβÞ
1

       
Φ Ad g β ¼  log ψ Ω Adg β ¼ ΦðβÞ  θ g1 , β
(33)
126 SECTION II Information geometry

To consider the invariance of Entropy, we have to use the property that


 
Q Ad g β ¼ Ad ∗g QðβÞ + θðgÞ ¼ g:QðβÞ, β  Ω, g  G (34)
We can prove the invariance under the action of the group:
        
s Q Adg β ¼ Q Ad g β , Adg β  Φ Ad g β
   D E    
s Q Adg β ¼ Ad∗g QðβÞ + θðgÞ, Adg β  ΦðβÞ + θ g1 , β
   D E D   E (35)
s Q Adg β ¼ Ad∗g QðβÞ, Adg β  ΦðβÞ + Ag∗g1 θðgÞ + θ g1 , β
      
s Q Adg β ¼ sðQðβÞÞ using Ad∗g1 θðgÞ + θ g1 ¼ θ g1 g ¼ 0
For β  Ω, let gβ be the Hessian form on T β Ω≡g with the potential
Φ(β) ¼  log ψ Ω(β). For X, Y  g, we define:
2
∂2 Φ ∂
gβ ðX, Y Þ ¼  2 ðX, Y Þ ¼ log ψ Ω ðβ + sX + tY Þ (36)
∂β ∂s∂t s¼t¼0
The positive definitiveness is given by Cauchy-Schwarz inequality:
8Z Z 9
>
> e hUðξÞ, βi
dλðξÞ: hUðξÞ, Xi2 ehUðξÞ, βi dλðξÞ >
>
>
> >
>
>
< >
=
1 M
0 M
1
gβ ðX, Y Þ ¼ Z 2
ψ Ω ðβ Þ2 >
>
>
>
>
>
>
>  @ hU ðξÞ, XiehUðξÞ, βi dλðξÞA >
>
: ;
Z
8
2M Z
2 9
>
> ehUðξÞ, βi=2 dλðξÞ: hU ðξÞ, XiehUðξÞ, βi=2 dλðξÞ >
>
>
> >
>
>
< >
=
1 M
0 M
1
¼ 2> Z 2 0
ψ Ω ðβ Þ > >
>
>
>  @ hU ðξÞ, βi=2
:hU ðξÞ, Xie hUðξÞ, βi=2
dλðξÞ A >
>
>
:
e >
;
M
(37)
We observe that gβ(X, Y) ¼ 0 if and only if hU(ξ), Xi is independent of
ξ  M, which means that the set {U(ξ); ξ  M} is contained in an affine hyper-
plane in g∗ perpendicular to the vector X  g. We have seen that gβ ¼  ∂∂βΦ2 ,
2

that is a generalization of classical Fisher metric from Information geom-


etry, and will give the relation the Riemannian metric introduced by Souriau.

∂Q
gβ ðX, Y Þ ¼  ðXÞ, Y for X, Y  g (38)
∂β
we have for any β  Ω, g  G and Y  g:
     
Q Ad g β , Y ¼ QðβÞ, Ad g1 Y + hθðgÞ, Y i (39)
Let us differentiate the above expression with respect to g. Namely, we
substitute g ¼ exp(tZ1), t  R and differentiate at t ¼ 0. Then the left-hand side
of (22) becomes
Symplectic theory of heat and information geometry Chapter 4 127


d    2   ∂Q
Q β + t½Z 1 , β + o t , Y ¼ ð½Z 1 , βÞ, Y (40)
dt t¼0 ∂β
and the right-hand side of (22) is calculated as:

d        
QðβÞ, Y  t½Z1 , Y  + o t2 ¼ θ I + tZ 1 + o t2 , Y
dt t¼0 (41)
¼ hQðβÞ, ½Z1 , Y i + hdθðZ1 Þ, Y i
Therefore,

∂Q
ð½Z 1 , βÞ, Y ¼ hdθðZ 1 Þ, Y i  hQðβÞ, ½Z1 , Y i (42)
∂β

Substituting Y ¼  [β, Z2] to the above expression:



∂Q
gβ ð½β, Z 1 , ½β, Z2 Þ ¼  ð½Z1 , βÞ, ½β, Z2 
∂β (43)
gβ ð½β, Z 1 , ½β, Z2 Þ ¼ hdθðZ1 Þ, ½β, Z 2 i + hQðβÞ, ½Z1 , ½β, Z 2 i

We define then symplectic 2-cocycle and the tensor:


ΘðZ 1 Þ ¼ dθðZ 1 Þ
(44)
~ ðZ 1 , Z 2 Þ ¼ hΘðZ1 Þ, Z2 i ¼ J ½Z ,Z   fJ Z1 , J Z2 g
Θ 1 2

Considering Θ~ β ðZ1 , Z 2 Þ ¼ hQðβÞ, ½Z1 , Z2 i + Θ


~ ðZ 1 , Z2 Þ, that is an extension
of KKS (Kirillov-Kostant-Souriau) 2 form in the case of non-null cohomol-
ogy. Introduced by Souriau, we can define this extension Fisher metric with
2-form of Souriau:
~ β ðZ1 , ½β, Z 2 Þ
gβ ð½β, Z 1 , ½β, Z2 Þ ¼ Θ (45)

As the entropy is defined by the Legendre transform of the characteristic


function, a dual metric of the Fisher metric is also given by the hessian of
∂2 SðQÞ
“geometric entropy” S(Q) with respect to the dual variable given by Q: ∂Q2
.
SKIPIF 1 < 0 The Fisher metric has been considered by Souriau as SKIPIF
1 < 0 a generalization of “heat capacity.” Souriau called it K SKIPIF 1 < 0
the “geometric capacity”:

∂ 2 Φðβ Þ ∂Q
I ðβ Þ ¼  ¼ (46)
∂β2 ∂β

We can remark that this is a new analogy between “Fisher Metric” and
“thermodynamics capacity.” We can observe that Pierre Duhem in the previ-
ous century founded Thermodynamics on capacity and its extension in a new
theory called “Energetique.”
128 SECTION II Information geometry

4.2 Entropy characterization as generalized Casimir invariant


function in coadjoint representation and Poisson Cohomology
In his 1974 paper, Jean-Marie Souriau has written hQ, ½β, Zi + Θ ~ ðβ, ZÞ ¼ 0.
To prove this equation, we have to consider the parametrized curve
t7!Ad exp ðtZÞ β with Z  g and t  R . The parameterized curve Adexp(tZ)β
passes, for t ¼ 0, through the point β, since Adexp(0) is the identical map of
the Lie Algebra g. This curve is in the adjoint orbit of β. So by taking its
derivative with respect to t, then for t ¼ 0, we obtain a tangent vector in β
at the adjoint orbit of this point. When Z takes all possible values in g, the vec-
tors thus obtained generate all the vector space tangent in β to the orbit of this
point:
    
dΦ Ad exp ðtZÞ β  dΦ d Ad exp ðtZÞ β 
dt  ¼ dβ , dt  ¼ hQ, ad Z βi ¼ hQ, ½Z, βi
t¼0 t¼0
(47)
As we have seen before Φ(Adgβ) ¼ Φ(β)  hθ(g1), βi. If we set g ¼ exp
(tZ), we obtain Φ(Adexp(tZ)β) ¼ Φ(β)  hθ(exp( tZ)), βi and by derivation
with respect to t at t ¼ 0, we finally recover the equation given by Souriau:
 
dΦ Ad exp ðtZÞ β 
~
dt  ¼ hQ, ½Z, βi ¼ hdθðZ Þ, βi with ΘðX, Y Þ
t¼0
¼ hdθðXÞ, Y i (48)
Souriau has stopped by this last equation, the characterization of Group

action on Q ¼ dΦ dβ . Souriau has observed that S[Q(Adg(β))] ¼ S[Adg(Q)
+ θ( g)] ¼ S(Q). We propose to characterize more explicitly this invariance,
by characterizing Entropy as an invariant Casimir function in coadjoint
∂S
representation. From last Souriau equation, if we use the identities β ¼ ∂Q ,
~
adβZ ¼ [β, Z] and Θðβ, Z Þ ¼ hΘðβÞ, Z i , then we can deduce that


∂S
ad∗∂S Q + Θ ∂Q , Z ¼ 0, 8Z. So, Entropy S(Q) should verify ad ∗∂S Q +
∂Q ∂Q


∂S
Θ ∂Q ¼ 0 , characterizes an invariant Casimir function in case of
non-null cohomology, that we propose to write with Poisson brackets, where:
  
∂S ∂H ~ ∂S , ∂H ¼ 0, 8H : g∗ ! R, Q  g∗
fS, HgΘ~ ðQÞ ¼ Q, , +Θ
∂Q ∂Q ∂Q ∂Q
(49)
We have observed that Souriau Entropy is a Casimir function in case with
non-null cohomology when an additional cocycle should be taken into
account. Indeed, infinitesimal variation is characterized

by the following
 dif-
   
 ∗
ferentiation: dt S Q Ad exp ðtxÞ β t¼0 ¼ dt S Ad exp ðtxÞ Q + θð exp ðtxÞÞ  ¼
d d
D
E t¼0
∗ ∂S
 ad ∂S Q + Θ ∂Q , x . We recover extended Casimir equation in case of
∂Q
Symplectic theory of heat and information geometry Chapter 4 129



non-null cohomology verified by Entropy, ad ∗∂S Q + Θ ∂S
∂Q ¼ 0, and then the
∂Q

generalized Casimir condition fS, HgΘ~ ðQÞ ¼ 0. Hamiltonian motion on these


affine coadjoint orbits is given by the solutions of the Lie-Poisson equations
with cocycle.
The identification of Entropy as an Invariant Casimir Function in Coad-
joint representation is also important in Information Theory, because classi-
cally Entropy is introduced axiomatically. With this new approach, we can
build Entropy by constructing the Casimir Function associated to the Lie
group and also in case of non-null cohomology.
This new equation is important because it introduces new structure of
differential equation in case of non-null cohomology. This previous
Lie-Poisson equation is equivalent to the modified Lie-Poisson variational
principle:
Z τ 
∂H
δ QðtÞ, ðtÞ  H ðQðtÞÞ dt ¼ 0
∂Q
0
8
> ∂H
>
> ¼ g1 g_  g, g  G
>
> ∂Q
>
>
>
< ∂2 H (50)
∂H 1
where δQ ¼ δ , η ¼ g δg
> ∂Q2 ∂Q
>
>
>
>    
>
> ∂H ∂H ∂H
>
: Q, δ ¼ Q, η_ + ,η ~
+Θ ,η
∂Q ∂Q ∂Q
Z τ 
∂H
δ QðtÞ, ðtÞ  H ðQðtÞÞ dt
0 ∂Q
Z τ   
∂H ∂H ∂H
¼ δQ, + Q, δ  δQ, dt ¼ 0
0 ∂Q ∂Q ∂Q
Z τ   
dη ∂H ~ ∂H , η dt
¼ Q, + ,η +Θ
0 dt ∂Q ∂Q
Z τ  D E 
dη ∂H
¼ Q, + Q, ad dH η + Θ , η dt
0 dt dQ ∂Q
Z τ
dQ ∂H
¼  ∗
+ ad dH Q + Θ , η dt + hQ, ηijτ0 ¼ 0
Int: 0 dt dQ ∂Q
by
parts

Geometric Heat Fourier Equation


From this Lie-Poisson equation, we can introduce a Geometric Heat
Fourier Equation:
130 SECTION II Information geometry


∂Q ∗ ∂H ∂FðQÞ
¼ ad ∂H Q + Θ and ¼ fFðQÞ, H gΘ~ (51)
∂t ∂Q ∂Q ∂t
That we can rewrite:

∂Q ∂Q ∂β ∂H
¼ ¼ ad∗∂H Q + Θ (52)
∂t ∂β ∂t ∂Q ∂Q
D E
∂Q
where ∂β geometric heat capacity is given by gβ ðX, Y Þ ¼  ∂Q
∂β ð X Þ, Y
for X, Y  g with gβ ðX, Y Þ ¼ Θe β ðX, Y Þ ¼ hQðβÞ, ½X, Y i + Θ
e ðX, Y Þ related to
Souriau-Fisher tensor.
Heat Eq. (33) is the PDE for (calorific) Energy density where the nature of
material is characterized by the geometric heat capacity. Numerical Integra-
tion of Lie–Poisson systems while preserving coadjoint orbits is described
in Engo and Faltinsen (2002).
In the Euclidean case with homogeneous material, we recover classical
equation (Balian, 2003):

∂ρE λ ∂ρE ∂T
¼ div rρE with ¼C (53)
∂t C ∂t ∂t
The link with second principle of Thermodynamics will be deduced
from positivity of Souriau-Fisher metric:

dQ ∗ ∂H
SðQÞ ¼ hQ, βi  ΦðβÞ with ¼ ad ∂H Q + Θ
dt ∂Q ∂Q
 
dS dβ ∂H dΦ
¼ Q, + ad∗∂H Q + Θ ,β 
dt dt ∂Q ∂Q dt
 D E 
dS dβ ∂H dΦ
¼ Q, + ad ∗∂H Q, β + Θ ,β 
dt dt ∂Q ∂Q dt
   
dS dβ ∂H ~ ∂H , β  dΦ
¼ Q, + Q, ,β +Θ
dt dt ∂Q ∂Q dt
  (54)
dS
¼ Q,

+Θ~ β ∂H , β  ∂Φ , dβ
dt dt ∂Q ∂β dt
 
dS
¼ Q,

+Θ~ β ∂H , β  ∂Φ , dβ with ∂Φ ¼ Q
dt dt ∂Q ∂β dt ∂β

dS ~ ∂H
¼ Θβ , β  0, 8H ðlink to positivity of Fisher metricÞ
dt ∂Q
dS ~ ~β
if H ¼ S ) ¼ Θβ ðβ, βÞ ¼ 0 because β  Ker Θ
∂S
∂Q¼β
dt

Entropy production is then linked with Souriau-Fisher structure, dS ¼





D h iE
e β ∂H , β dt with Θ
Θ ~ β ∂H , β ¼ Θ
~ ∂H ,β + Q, ∂H ,β Souriau-Fisher tensor.
∂Q ∂Q ∂Q ∂Q
Symplectic theory of heat and information geometry Chapter 4 131

The 2 equations characterizing Entropy as invariant Casimir function are


related by:
   
∂S ∂H ∂S ∂H
fS, H gΘ~ ðQÞ ¼ Q, , + Θ , ¼0
∂Q ∂Q ∂Q ∂Q
 
∂H ∂S ∂H
fS, H gΘ~ ðQÞ ¼ Q, ad ∂Q∂S + Θ , ¼0
∂Q ∂Q ∂Q
 
∂H ∂S ∂H
fS, H gΘ~ ðQÞ ¼ ad ∗∂S Q, + Θ , ¼0
∂Q ∂Q ∂Q ∂Q

∗ ∂S ∂H ∗ ∂S
8H, fS, H gΘ~ ðQÞ ¼ ad ∂S Q + Θ , ¼ 0 ) ad ∂S Q + Θ ¼0
∂Q ∂Q ∂Q ∂Q ∂Q
(55)

This equation was observed by Souriau in his paper of 1974, where he has
~ β , that is written:
written that geometric temperature β is a kernel of Θ

~ β ) hQ, ½β, Z i + Θ
β  Ker Θ ~ ðβ, Z Þ ¼ 0 (56)

That we can develop to recover the Casimir equation:


  D E
) Q, adβ Z + Θ ~ ðβ, ZÞ ¼ 0 ) ad∗ Q, Z + Θ ~ ðβ, Z Þ ¼ 0
β
D E 
∂S ∗ ~ ∂S ∗ ∂S
β¼ ) ad ∂S Q, Z + Θ , Z ¼ ad ∂S Q + Θ , Z ¼ 0, 8Z
∂Q ∂Q ∂Q ∂Q ∂Q

∂S
) ad ∗∂S Q + Θ ¼0
∂Q ∂Q
(57)

4.3 Koszul Poisson Cohomology and entropy characterization


Poisson Cohomology was introduced by Lichnerowicz (1977) and Koszul
(1985). Koszul made reference to seminal E. Cartan paper (Cartan, 1929;
Vialatte, 2011) “Elie Cartan does not explicitly mention Λ(g’) [the complex
of alternate forms on a Lie algebra], because he treats groups as symmetrical
spaces and is therefore interested in differential forms which are invariant to
both by the translations to the left and the translations to the right, which cor-
responds to the elements of Λ(g’) invariant by the prolongation of the coad-
joint representation. Nevertheless, it can be said that by 1929 an essential
piece of the cohomological theory of Lie algebras was in place.”
A. Lichnerowics defined Poisson structure by considering generalization of
Symplectic structures which involve contravariant tensor fields rather than dif-
ferential forms and by observing that the Schouten-Nijenhuis bracket allows to
write an intrinsic, coordinate free form, Poisson structure. Let’s consider
132 SECTION II Information geometry

adPQ ¼ [P, Q], where adP is a graded linear endomorphism of degree p  1 of


A(M). From graded Jacobi identity we can write (Koszul, 1985):
 
adP ½Q, R ¼ ½ad P Q, R + ð1Þðp1Þðq1Þ Q, ad p R
(58)
ad½P,Q ¼ ad P ∘ad Q  ð1Þðp1Þðq1Þ adQ ∘adP
First equation of (38) means that the graded endormorphims adPQ ¼
[P, Q], of degree p-1, is a derivation of the graded Lie algebra A(M) with
the Schouten-Nijenhuis bracket as composition law. The second equation
of (38) means that the endomorphism ad[P,Q] is the graded commutator of
endomorphisms adP and adQ. Vorob’ev and Karasev (1988) have suggested
cohomology classification in terms of closed forms and de Rham Cohomol-
ogy of coadjoint orbits Ω (called Euler orbits by authors), symplectic leaves
of a Poisson manifold Ν. Let Zk(Ω) and Hk(Ω) be the space of closed
k-forms on Ω and their de Rham cohomology classes. Considering the base
of the fibration of Ν by these orbits as Ν/Ω, they have introduced the smooth
mapping Zk[Ω] ¼ C∞(N/Ω ! Zk(Ω)) and Hk[Ω] ¼ C∞(N/Ω ! Hk(Ω)). The ele-
ments of Zk[Ω] are closed forms on Ω, depending on coordinates on Ν/Ω. Then
H0[Ω] ¼ Casim(N) is the set of Casimir functions on Ν, of functions which
are constant on all Euler orbits. In the framework of Lie Groups Thermo-
dynamics, Entropy is then characterized by zero-dimensional de Rham Coho-
mology. For arbitrary v  V(N), with the set V(N) of all vector field on N, the
tensor Dv defines a closed 2-form α(Dv) ¼ Z2[Ω], and if v  V(N) annihilates
H0[Ω] ¼ Casim(N), then this form is exact. The center of Poisson algebra
(Berezin, 1967) induced from the symplectic structure is the zero-dimensional
de Rham cohomology group, the Casimir functions.

5 Covariant maximum entropy density by Souriau model


Seminal work on distributions in representation space was seminally studied
by Stratonovich (1957) before J.M. Souriau. We will introduce Gaussian dis-
tribution on the space of Symmetric Positive Definite (SPD) matrices, through
Souriau’s covariant Gibbs density by considering this space as the pure imag-
inary axis of the homogeneous Siegel upper half space where Sp(2n,R)/U
(n) acts transitively. Gauss density of SPD matrices is computed through
Souriau’s moment map and coadjoint orbits. We will illustrate the model first
for Poincare unit disk, then Siegel unit disk and finally upper half space. For
this example, we deduce Gauss density for SPD matrices.

5.1 Gauss density on Poincare unit disk covariant with respect


to SU(1,1) Lie group
We will introduce Souriau moment map for SU(1,1)/U(1) group that acts tran-
sitively on Poincare Unit Disk, based on moment map. Considering the Lie
group:
Symplectic theory of heat and information geometry Chapter 4 133

80 1 0 10 10 1
< a b 1 ba∗1 a∗1 0 1 0
SU ð1, 1Þ ¼ @ A¼@ A@ A@ A=a, b  ℂ,
:
b∗ a∗ 0 1 0 a∗ a∗1 b∗ 1
9
=
jaj2  jbj2 ¼ 1
;

(59)
and its Lie algebra given by elements:
 
ir η
suð1, 1Þ ¼ =r  ℝ, η  ℂ (60)
η∗ ir
A basis for this Lie algebra su(1, 1) is ðu1 , u2 , u3 Þ  g with:

i 1 0 1 0 1 1 0 i
u1 ¼ , u2 ¼  and u3 ¼ (61)
2 0 1 2 1 0 2 i 0
with [u1, u3] ¼  u2, [u1, u2] ¼ u3, [u2, u3] ¼  u1. The Harish-Chandra embed-
ding is given by φ(gx0) ¼ ζ ¼ ba∗1. From ja j2  jb j2 ¼ 1, one has jζ j < 1.
Conversely, for any jζ j < 1, taking any a  ℂ such that jaj ¼ (1  j aj2)1/2
and putting b ¼ ζa∗, one obtains g  G for which φ(gx0) ¼ ζ. The domain
D ¼ φ(M) is the unit disc D ¼ {ζ  ℂ/jζ j < 1}.
The compact subgroup is generated by u1, while u2 and u3 generate a
hyperbolic subgroup. The dual space of the Lie algebra is given by:
 
z x + iy
suð1, 1Þ∗ ¼ =x, y, z  ℝ (62)
x + iy z
with the basis

  1 0 0 i 0 1
u∗1 , u∗2 , u∗3 g :∗
u∗1 ¼ ∗
, u2 ¼ and u∗3 ¼
0 1 i 0 1 0
(63)
Let us consider D ¼ {z  ℂ/j zj < 1} be the open unit disk of Poincare. For
each ρ > 0, the pair (D, ωρ) is a symplectic homogeneous manifold with
ωρ ¼ 2iρ dz^dz2∗ 2 , where ωρ is invariant under the action:
ð1jzj Þ
SU ð1, 1Þ  D ! D
az + b (64)
ðg, zÞ7!g:z ¼ ∗ ∗
b z+a
This action is transitive and is globally and strongly Hamiltonian. Its gen-
erators are the hamiltonian vector fields associated to the functions:
  1 + jzj2   ρ z  z∗   z + z∗
J 1 z, z∗ ¼ ρ , J 2 z, z∗ ¼ , J 3 z, z∗ ¼ ρ (65)
1  j zj 2 i 1  jzj 2
1  jzj2
134 SECTION II Information geometry

The associated moment map J : D ! su∗(1, 1) defined by J(z). ui ¼ Ji(z, z∗),


maps D into a coadjoint orbit in su∗(1, 1). Then, we can write the moment map
as a matrix element of su∗(1, 1) (Cahen, 2004, 2013; Cishahayo and de Bièvre,
1993; Renaud, 1996):
0 1
1 + j zj 2 z∗
B 2 C
B 1  jzj2 1  jzj2 C
J ðzÞ ¼ J1 ðz, z∗ Þu∗1 + J2 ðz, z∗ Þu∗2 + J3 ðz, z∗ Þu∗3 ¼ ρB C  g∗
@ z 1 + jzj2 A
2 
1  jzj2 1  jzj2
(66)
The moment map J is a diffeomorphism of D onto one sheet of the
two-sheeted hyperboloid in su∗(1, 1), determined by the following equation
J21  J22  J23 ¼ ρ2, J1  ρ with J1u∗1 + J2u∗2 + J3u∗3  su∗(1, 1). We note Ο+ρ the
coadjoint orbit Ad∗SU(1,1) of SU(1, 1), given by the upper sheet of the
two-sheeted hyperboloid given by previous equation. The orbit method of
Kostant-Kirillov-Souriau associates to each of these coadjoint orbits a repre-
sentation of the discrete series of SU(1, 1), provided that ρ is a half integer
greater or equal than 1 (ρ ¼ 2k , k  N and ρ  1). When explicitly executing
the Kostant-Kirillov construction, the representation Hilbert spaces Ηρ are
realized as closed reproducing kernel subspaces of L2(D, ωρ).The Kostant-
Kirillov-Souriau orbit method shows that to each coadjoint orbit of a
connected Lie group is associated a unitary irreducible representation of G
acting in a Hilbert space H.
Souriau has observed that action of the full Galilean group on the space
of motions of an isolated mechanical system is not related to any equilib-
rium Gibbs state (the open subset of the Lie algebra, associated to this Gibbs
state is empty). The main Souriau idea was to define the Gibbs states for
one-parameter subgroups of the Galilean group. We will use the same
approach, in this case We will consider action of the Lie group SU(1, 1)
on the symplectic manifold (M,ω) (Poincare unit disk) and its momentum
 R 
map J are such that the open subset Λβ ¼ β  g= D ehJ ðzÞ, βi dλðzÞ < + ∞
is not empty. This condition is not always satisfied when (M, ω) is a cotan-
gent bundle, but of course it is satisfied when it is a compact manifold. The
idea of Souriau is to consider a one parameter subgroup of SU(1, 1). To
parametrize elements of SU(1, 1) is through its Lie algebra. In the neighbor-
hood of the identity element, the elements of g  SU(1, 1) can be written as
the exponential of an element β of its Lie algebra:
g ¼ exp ðεβÞ with β  g (67)

1 0
The condition g+ Mg ¼ M for M ¼ can be expanded for
0 1
ε << 1 and is equivalent to β+ M + Mβ ¼ 0 which then implies
Symplectic theory of heat and information geometry Chapter 4 135


ir η
β¼ ∗ , r  ℝ , η  ℂ. We can observe that r and η ¼ ηR + iηI contain
η ir
3 degrees of freedom, as required. Also because detg ¼ 1, we get Tr(β) ¼ 0.
We exponentiate β with exponential map to get:
X∞
ðεβÞk aε ð β Þ bε ð β Þ
g ¼ exp ðεβÞ ¼ ¼ (68)
k¼0
k! b∗ε ðβÞ a∗ε ðβÞ

If we make
the remark

that we have the following relation
ir η ir η
β2 ¼ ∗ ¼ jηj2  r 2 I, we develop the exponential map:
η ir η∗ ir
0 1
sinh ðεRÞ sinh ðεRÞ
B cosh ð εR Þ + ir η C
R R
g ¼ exp ðεβÞ ¼ B@
C
∗ sinh ðεRÞ sinh ðεRÞ A
η cosh ðεRÞ  ir
R R
with R2 ¼ jηj2  r 2
(69)
We can observe that one condition isthat j η j 2  r2 >
0 then the subset to
ir η
consider is given by the subset Λβ ¼ β ¼ ∗ , r  ℝ, η  ℂ=jηj2 
η ir
Ð
r 2 > 0g such that DehJ(z),βidλ(z) < + ∞. The generalized Gibbs states of
the full SU(1, 1) group do not exist. However, generalized Gibbs states for
the one-parameter subgroups exp(αβ), β  Λβ, of the SU(1, 1) group do exist.
The generalized Gibbs state associated to β remains invariant under the
restriction of the action to the one-parameter subgroup of SU(1, 1) generated
by exp(εβ).
To go further, we will develop the Souriau Gibbs density from the Souriau
moment map J(z) and the Souriau temperature β  Λβ . If we note
1
b ¼ 11jzj2 , we can write the moment map:
z
 
1 0
J ðzÞ ¼¼ ρð2Mbb+  Tr ðMbb+ ÞI Þ with M ¼ (70)
0 1
We can the write the covariant Gibbs density in the unit disk given by
moment map of the Lie group SU(1, 1) and geometric temperature in its Lie
algebra β  Λβ:
ehJðzÞ,βi dz ^ dz∗
pGibbs ðzÞ ¼ Z with dλðzÞ ¼ 2iρ
2 (71)
ehJðzÞ,βi dλðzÞ 1  jzj2
D
136 SECTION II Information geometry

0 1
* 1 + jzj2 2z∗ +
B ð1jzj2 Þ ð1jzj2 Þ C ir η
 ρ@ A,
2z
 1 + jzj2 η∗ ir
hρð2ℑbb Trðℑbb ÞI Þ, βi
e Z
+ +
e ð1jzj2 Þ ð1jzj2 Þ
pGibbs ðzÞ ¼ ¼ Z
ehJðzÞ, βi dλðzÞ ehJðzÞ, βi dλðzÞ
D D
(72)
To write the Gibbs density with respect to its statistical moments, we have
to express the density with respect to Q ¼ E[J(z)]. Then, we have to invert the
ir η
relation between Q and β, to replace this last variable β ¼ ∗  Λβ by
η ir
Ð
β ¼ Θ1 ðQÞ  g where Q ¼ ∂Φ∂βðβÞ ¼ ΘðβÞ  g∗ with Φ(β) ¼  log DehJ(z),βi
dλ(z), deduce from Legendre transform. The mean moment map is given by:
2 0 13
1 + jwj2 2w∗
6 B

C7
6 B 1  jwj2 1  jwj2 C7
6 B C7
Q ¼ E½J ðzÞ ¼ E6ρB C7 where w  D (73)
6 B 2w 1 + j wj 2 C7
4 @

A5
1  jwj2 1  jwj2

We can remark that this Souriau Gibbs density, covariant under the action
of SU(1,1) Lie group, is reparametrized via Legendre transform and moment
map to define Gaus density of Maximum d’Entropy in the Poincare unit disk.
Souriau’s Lie Groups Thermodynamics is very general and could be used to
other groups or other homogeneous manifold where a Lie group acts
transitively.

5.2 Gauss density on Siegel unit disk covariant with respect to SU


(N,N) Lie group
In previous section, we have addressed Gauss density in Poincare unit DiskTo
address computation of covariant Gibbs density for Siegel Unit Disk
SDn ¼ {Z  Mat(n, ℂ)/In  ZZ+ > 0}, we will consider in this section SU(p, q)
Unitary Group (Berezin, 1975; Chevallier et al., 2016; Hua, 1963; Leverrier,
2018; Nielsen, 2020; Siegel, 1943):
G ¼ SU ðn, nÞ and K ¼ SðU ðnÞ  UðnÞÞ
 
A 0
¼ =A  U ðnÞ, D  UðnÞ, detðAÞdetðDÞ ¼ 1
0 D
We can use the following decomposition for g  Gℂ (complexification of g),
and consider its action on Siegel Unit Disk given by:
Symplectic theory of heat and information geometry Chapter 4 137


A B ℂ I BD1 A  BD1 C 0 In 0
g¼ G , g¼ n (74)
C D 0 In 0 D D1 C In
Benjamin Cahen has studied this case and introduced the moment map by
identifying G-equivariantly g∗ with g by means of the Killing form β on gℂ :
g∗ G-equivariant with g by Killing form βðX, Y Þ ¼ 2ðp + qÞTr ðXY Þ
The set of all elements of g fixed by K is h:
!
nI p 0
h ¼ felement of G fixed by K g , ξ0  h, ξ0 ¼ iλ
0 nI q (75)
) hξ0 , ½Z, Z + i ¼ 2iλð2nÞ2 Tr ðZZ + Þ, 8Z  D
Then, we the equivariant moment map is given by:
8X  gC , Z  D, ψ ðZÞ ¼ Ad ∗ ð exp ðZ + Þζ ð exp Z + exp Z ÞÞξ0
8g  G, Z  D then ψ ðg:Z Þ ¼ Ad ∗g ψ ðZÞ (76)
ψ is a diffeomorphism from SD onto orbit Oðξ0 Þ
!
ðI n  ZZ + Þ1 ðnZZ+  nI n Þ ð2nÞZðI n  Z+ Z Þ1
ψ ðZÞ ¼ iλ 1  
ð2nÞðI n  Z+ Z Þ Z+ nI q + nZ + Z ðI n  Z + Z Þ1
(77)
and
 1 !
Ip Z Iq  Z+ Z
ζ ð exp Z + exp Z Þ ¼ (78)
0 Iq
The moment map is then given by:
!
ðI n  ZZ + Þ1 ðI n + ZZ + Þ 2Z + ðI n  ZZ + Þ1
J ðZ Þ ¼ ρn  g∗ (79)
2ðI n  ZZ + Þ1 Z ðI n + ZZ + ÞðI n  ZZ + Þ1
Souriau Gibbs density is then given with β, M  g and Z  SDn by:
* ! +
ðIn  ZZ + Þ1 ðIn + ZZ + Þ 2Z+ ðIn  ZZ + Þ1
 ρn ,β
e 2ðIn  ZZ + Þ1 Z ðIn + ZZ+ ÞðIn  ZZ + Þ1
pGibbs ðZÞ ¼ Z
ehJðZÞ, βi dλðZÞ
SDn
8
> β ¼ Θ1 ðQÞ  g
>
>
<
Q ¼ E½J ðZ Þ
with
>
>
: Q ¼ ∂ΦðβÞ ¼ ΘðβÞ  g∗
>
∂β
(80)
138 SECTION II Information geometry

Gauss density of SPD matrix is then given by Z ¼ (Y  I)(Y + I)1,


Y  Sym(n)+.
We can remark that Souriau Lie Groups Thermodynamics is very general
and can be used to compute Gauss density of Maximum Density for Lie
groups or homogeneous manifolds.

5.3 Gauss density on Siegel upper half plane


We conclude by considering Hn ¼ {W ¼ U + iV/U, V  Sym(n), V > 0} Siegel
Upper Half Space, related to Siegel Unit Disk by Cayley transform Z ¼
(W + iIn)1(W  iIn), that we will analyze as an homogeneous space with respect
to Sp(2n, ℝ)/U(n) group
 action where the Sympletic group is given by the
 defi-
A B A C ¼ C A, B D ¼ D B
T T T T
nition Spð2n, ℝÞ ¼  Matð2n, ℝÞ= and
C D AT D  CT B ¼ In
the left action:

A B
Φ : Spð2n, ℝÞ  Hn ! Hn with Φ , W ¼ ðC + DW ÞðA + BW Þ1
C D
(81)
that is transitive:
! !
V 1=2 0
Φ : Spð2n, ℝÞ  H n ! Hn with Φ , iI n ¼ U + iV
UV 1=2 V 1=2
(82)
The isotropy subgroup of the element iIn  Hn is
  
X Y
Spð2n, ℝÞ \ Oð2nÞ ¼  Matð2n, ℝÞ=XT X + Y T Y ¼ In , XT Y ¼ Y T X
Y X
(83)
 
X Y
that is identified with U(n) by Spð2n, RÞ \ Oð2nÞ ! U ðnÞ; 7!X + iY
Y X
and then Hn ffi Sp(2n, ℝ)/U(n) that is also a symplectic manifold with sym-
plectic form:
 
ΩHn ¼ dΘHn with one-form ΘHn ¼ tr UdV 1 (84)
By identifying the symplectic algebra sp(2n, ℝ) with sym(2n, ℝ) (Ohsawa
and Tronci, 2017) via:
   T 
ε11 ε12 ε12 ε22
ε¼ T  symð2n, ℝÞ7!~ε ¼ J n T ε ¼  spð2n, ℝÞ
ε12 ε22 ε11 ε12
(85)
Symplectic theory of heat and information geometry Chapter 4 139

 
with ½ε, δsym ¼ εJ n T δ  δJ n T ε and ~ε, ~δ sp ¼ ~ε~δ  ~δ~ε, and the associated inner
   
products hε, δisym ¼ tr(εδ) and ~ε, ~δ sp ¼ tr ~εT ~δ . With these inner products,
we identify their dual spaces with themselves sym(2n, ℝ) ¼ sym (2n, ℝ)∗,
sp(2n, ℝ) ¼ sp(2n, ℝ)∗.
T
Define first adjoint operator Ad G ε ¼ ðG1 Þ εG1 , AdGeε ¼ GeεG1 with
G  Spð2n, ℝÞ and adε δ ¼ εJ n T δ  δJ n T ε ¼ ½ε,δsym , and co-adjoint operators
AdG∗1
η ¼ GηGT and ad∗ε η ¼ Jnεη  ηεJn. To coadjoint orbit O ¼ {Ad∗G η 
sym (2n, ℝ)∗/G  Sp(2n, ℝ)} for each η  sym (2n, ℝ)∗, we can associate a sym-
plectic manifold with the KKS (Kirillov-Kostant-Souriau) 2-form:
  D E

ΩO ðηÞ ad ∗ε η, ad ∗δ η ¼ η, ½ε, δsym ¼ tr η½ε, δsym (86)

We then compute the moment "map J : Hn ! sp(2n, ℝ) # such that
1 1
V V U
iεΩHn ¼ dhJ(.), εi, given by J ðW Þ ¼ 1
for W ¼ U
UV UV 1 U + V
+ iV  H n , we deduce:
ehJðW Þ,εi  
pGibbs ðW Þ ¼ Z with dλðW Þ ¼ 2iρ V 1 dW ^ V 1 dW + (87)
ehJ ðW Þ,εi dλðW Þ
Hn

with
" # !
V 1 V 1 U ε11 ε12
hJ ðW Þ, εi ¼ Tr ðJ ðW ÞεÞ ¼ Tr (88)
UV 1 UV 1 U + V εT12 ε22

And then
   
hJ ðW Þ, εi ¼ Tr V 1 ε11 + 2UV 1 ε12 + UV 1 U + V ε22 (89)
To consider Gibbs density and Gauss density for Symmetric Positive Def-
inite Matrix, we have to consider the case W ¼ iV with U ¼ 0 and
hJ(W), εi ¼ Tr[V1ε11 + Vε22].

6 Conclusion
The classical notion of Gibbs’ canonical ensemble has been extended by
Jean-Marie Souriau to the case of a symplectic manifold on which a Lie group
has a symplectic action (“dynamic group”). Souriau definition extends a cer-
tain number of classical thermodynamic properties as temperature that is an
element of the Lie algebra of the group, and as heat that is an element of its
dual, but also to inequalities of convexity. In the case of non-commutative
groups, particular properties appear: the symmetry is spontaneously broken,
certain relations of cohomological type are verified in the Lie algebra of the
140 SECTION II Information geometry

group. Various applications could be considered from covariant or relativistic


statistical Mechanics to statistics on Lie groups. Souriau model is a new sym-
plectic model of heat and of Information Geometry.

References
Balian, R., 1991. From Microphysics to Macrophysics. vols. Volume 1–2 Springer Science and
Business Media LLC: Berlin/Heidelberg, Germany.
Balian, R., 2003. Introduction à la thermodynamique hors-equilibre. (CEA report).
Balian, R., 2014. The entropy-based quantum metric. Entropy 16, 3878–3888.

Balian, R., 2015. François Massieu et les Potentiels Thermodynamiques. Evolution des Disci-
plines et Histoire des Decouvertes; Academie des Sciences, Paris, France.
Balian, R., Valentin, P., 2001. Hamiltonian structure of thermodynamics with gauge. Eur. Phys.
J. B. 21, 269–282.
Balian, R., Alhassid, Y., Reinhardt, H., 1986. Dissipation in many-body systems: a geometric
approach based on information theory. Phys. Rep. 131, 1–146.
Barbaresco, F., 2019a. Lie Groups Thermodynamics & Souriau-Fisher Metric, SOURIAU 2019
conference. Institut Henri Poincare.
Barbaresco, F., 2019b. Lie Groups Thermodynamics & Souriau-Fisher Metric, SOURIAU 2019
Conference. Institut Henri Poincare.
Barbaresco, F., 2020a. Lie group statistics and lie group machine learning based on Souriau lie
groups Thermodynamics & Koszul-Souriau-Fisher Metric: new entropy definition as
generalized Casimir invariant function in coadjoint representation. Entropy 22, 642.
Barbaresco, F., 2020b. Souriau-Casimir Lie Groups Thermodynamics & Machine Learning, Joint
Structures and Common Foundations of Statistical Physics, Information Geometry and Infer-
ence for Learning, les Houches Summer Week SPIGL’20, 27.
Barbaresco, F., 2021a. Souriau-Casimir Lie Groups Thermodynamics & Machine Learning,
SPIGL’20 Proceedings, les Houches Summer Week on Joint Structures and Common Foun-
dations of Statistical Physics. Information Geometry and Inference for Learning, Springer
Proceedings in Mathematics & Statistics,.
Barbaresco, F., 2021c. Jean-Marie Souriau’s Symplectic model of statistical physics: seminal
papers on lie groups thermodynamics - Quod Erat demonstrandum, les Houches SPIGL’20
proceedings. Springer Proceedings in Mathematics & Statistics.
Barbaresco, F., 2021d. Koszul lecture related to geometric and analytic mechanics, Souriau’s Lie
group thermodynamics and information geometry. Information Geometry Journal.
Barbaresco, F., Gay-Balmaz, F., 2020. Lie group Cohomology and (multi)Symplectic integrators:
new geometric tools for lie group machine learning based on Souriau geometric statistical
mechanics. Entropy 22, 498.
Berezin, F.A., 1967. Some remarks about the associated envelope of a Lie algebra. Funct Anal Its
Appl 1, 91–102.
Berezin, F.A., 1975. Quantization in complex symmetric space. Math, USSR Izv 9, 341–379.
Blanc-Lapierre, A., Casal, P., Tortrat, A., 1959. Methodes mathematiques de la mecanique statis-
tique. Masson, Paris.
Cahen, B., 2004. Contraction de SU(1,1) vers le groupe de Heisenberg. Travaux de Mathema-
tiques, Fascicule XV, pp. 19–43.
Cahen, B., 2013. Global parametrization of scalar holomorphic coadjoint orbits of a
Quasi-Hermitian Lie Group, acta. Univ. Palacki. Olomuc. Fac. rer. Nat., Mat. 52, 35–48.
Symplectic theory of heat and information geometry Chapter 4 141

Cartan, E., 1929. Sur les invariants integraux de certains espaces homogènes clos et les proprietes
topologiques de ces espaces. Ann. Soc. Pol. Math, 181–225. t.8.
Cartier, P., 1994. Some Fundamental Techniques in the Theory of Integrable Systems, IHES/M/
94/23, SW9421. Available online: https://cds.cern.ch/record/263222/files/P00023319.pdf.
(accessed on 31 May 2020).
Casimir, H.G.B., 1931. Uber die konstruktion einer zu den irreduziblen darstellungen halbeinfa-
cher kontinuierlicher gruppen geh€origen differentialgleichung. Proc. R. Soc. Amsterdam 4.
Chevallier, E., Forget, T., Barbaresco, F., Angulo, J., 2016. Kernel density estimation on the Sie-
gel space with an application to radar processing. Entropy 18, 396.
Cishahayo, C., de Bièvre, S., 1993. On the contraction of the discrete series of SU(1;1). Annales
de l’institut Fourier, tome 43 (2), 551–567.
FGSI’19 Conference, 2019. “Foundations of Geometric Structures of Information” in February
2019 at IMAG (Institut Montpellierain Alexander Grothendieck). https://fgsi2019.
sciencesconf.org/.
CIRM, 2017. TGSI’17 Conference on “Topological and Geometrical Structures of Information”.
https://www.mdpi.com/journal/entropy/special_issues/topological_geometrical_info.
Dacunha-Castelle, D., Gamboa, F., 1990. Maximum d’entropie et problème des moments Annales
de l’I.H.P. Section B, tome 26, no 4, pp. 567–596.
De Saxce, G., 2016. Link between lie group statistical mechanics and thermodynamics of conti-
nua. Entropy 18, 254.
De Saxce, G., 2019. Euler-Poincare equation for lie groups with non null symplectic cohomology.
Application to the mechanics. In: Nielsen, F., Barbaresco, F. (Eds.), GSI 2019. LNCS.
vol. Volume 11712. Springer, Berlin, Germany.
De Saxce, G., Marle, C.-M., 2020. Presentation du livre de Jean-Marie Souriau “Structure des sys-
tèmes dynamiques”, preprint.
De Saxce, G., Vallee, C., 2016. Galilean Mechanics and Thermodynamics of Continua. Wiley.
Ecole de Physique des Houches, 2020. SPIGL’20 in July 2020 on “Joint Structures and Common
Foundations of Statistical Physics, Information Geometry and Inference for Learning”.
https://franknielsen.github.io/SPIG-LesHouches2020/.
Engo, K., Faltinsen, S., 2002. Numerical integration of lie–Poisson systems while preserving
coadjoint orbits and energy. SIAM J. Numer. Anal. 39 (1), 128–145.
Gallisssot, F., 1952. Les formes exterieures en mecanique. vol 4 Annales de l’Institut Fourier,
pp. 145–297.
GSI Conference Cycle, 2013–2021. “Geometric Science of Information” in 2013, 2015, 2017,
2019 et 2021 at Ecole des Mines de Paris, Ecole Polytechnique, ENAC and Sorbonne Uni-
versite. https://franknielsen.github.io/GSI/.
Hua, L.K., 1963. Harmonic Analysis of Functions of Several Complex Variables in the Classical
Domains, Translations of Mathematical Monographs 6. Amer. Math. Soc, RI.
Iglesias, P., 1979. Thermodynamique geometrique appliquee aux configuration tournantes en
astrophysique. Thèse de 3ème cycle, Universite de Provence.
Iglesias, P., 1995. Itineraire d’un mathematicien: Un entretien avec Jean-Marie Souriau. Le jour-
nal de Maths des elèves. ENS Lyon.
Kosmann-Schwarzbach, Y., 2013. Simeon-Denis Poisson. Palaiseau, France, Les Mathematiques
au Service de la Science; Ecole Polytechnique.
Koszul, J.L., 1985. Crochet de Schouten-Nijenhuis et cohomologie, Asterisque, numero hors-serie
 Cartan et les mathematiques d’aujourd’hui (Lyon, 25–29 juin 1984). pp. 257–271.
Elie
Koszul, J.-L., Zou, Y.M., 2019. Introduction to Symplectic Geometry. Berlin/Heidelberg,
Germany, Springer Science and Business Media LLC.
142 SECTION II Information geometry

Lagrange, J.-L., 1808. Mecanique analytique. La veuve Desaint, Paris.


Leverrier, A., 2018. SU(p,q) coherent states and a gaussian de Finetti theorem. Journal of Mathe-
matical Physics 59, 042202.
Libermann, P., Marle, C.-M., 1987. Symplectic Geometry and Analytical Mechanics. Reidel, Kuf-
stein, Austria.
Lichnerowicz, A., 1977. Les varietes de Poisson et leurs algèbres de Lie associees. J. Differential
Geom. 12, 253–300.
Marle, C.-M., 2016. From tools in Symplectic and Poisson geometry to J.-M. Souriau’s theories of
statistical mechanics and thermodynamics. Entropy 18, 370. https://doi.org/10.3390/
e18100370.
Marle, C.M., 2018. Geometrie Symplectique et Geometrie de Poisson. Paris, France, Calvage &
Mounet.
Marle, C.-M., 2019. Projection Stereographique et Moments, Hal-02157930, Version 1. Available
online: https://hal.archives-ouvertes.fr/hal-02157930/. (Access on: 31 May 2020).
Marle, C.-M., 2020b. On Gibbs states of mechanical systems with symmetries. JGSP 57, 45–85.
Marle, C.-M., 2021. On Generalized Gibbs States of Mechanical Systems with Symmetries.
arXiv:2012.00582v2 [math.DG].
MaxEnt’14, 2014. Conference at Amboise in Clos Luce and MaxEnt’22 Conference at Institut
Henri Poincare in Paris. https://web2.see.asso.fr/en/maxent14; https://see.asso.fr/events/
maxent22/.
Mikami, K., 1987. Local lie algebra structure and momentum mapping. J. Math. Soc. Japan
39 (2).
Nielsen, F., 2020. The Siegel–Klein Disk. Entropy 22, 1019.
Noether, E., 1918. Invariante Variationsprobleme. Nachrichten von der K€oniglichen Gesellschaft
der Wissenschaften zu G€ottingen. Mathematisch-physikalische Klasse, pp. 235–257.
Ohsawa, T., Tronci, C., 2017. Geometry and dynamics of gaussian wave packets and their Wigner
transforms. Journal of Mathematical Physics 58, 092105.
Poincare, H., 1901. Sur une forme nouvelle des equations de la Mecanique, Compte-rendus des
seances de l’Academie des Sciences. pp. 48–51.
Renaud, J., 1996. The contraction of the SU(1,1) discrete series of representations by means of
coherent states. Journ. Math. Phys. 37 (7), 3168–3179.
Satake, I., 1980. Algebraic Structures of Symmetric Domains. Princeton University Press.
Siegel, C.L., 1943. Symplectic geometry. American Journal of Mathematics 65 (1), 1–86.
Souriau, J.M., 1954. Equations Canoniques et Geometrie Symplectique. Publications Scientifiques
de l’Universite d’Alger Publisher, Serie A, vol. Volume 1, pp. 239–265. fasc.2.
Souriau, J.M., 1965. Geometrie de l’Espace des Phases, Calcul des Variations et Mecanique
Quantique. Tirage Roneotype; Faculte des Sciences, Marseille, France.
Souriau, J.-M., 1966. Definition covariante des equilibres thermodynamiques. Supplemento al
Nuovo cimento IV (1), 203–216.
Souriau, J.-M., 1967. Realisations d’algèbres de Lie au moyen de variables dynamiques. Il Nuovo
Cim. A 49, 197–198. https://doi.org/10.1007/bf02739084.
Souriau, J.-M., 1969. Structure des systèmes dynamiques. Dunod.
Souriau, J.-M., 1974. Mecanique statistique, groupes de Lie et cosmologie. In: Colloque Interna-
tional du CNRS "Geometrie symplectique et physique Mathematique". Aix-en-Provence (Ed.
CNRS, 1976).
Souriau, J.-M., 1975. Geometrie Symplectique et Physique Mathematique, Deux Conferences de
Jean-Marie Souriau, Colloquium do la Societe Mathematique de France. Ile-de-France, Paris
(19 Fevrier 1975–12 Novembre 1975).
Symplectic theory of heat and information geometry Chapter 4 143

Souriau, J.-M., 1978. Thermodynamique et geometrie. In: Bleuler, K., Reetz, A. (Eds.), Differen-
tial Geometry Methods in Mathematical Physics II, Proceedings University of Bonn, July
13–16 1977. Springer.
Souriau, J.-M., 1984. Mecanique Classique et Geometrie Symplectique. CNRS-CPT-84/PE.1695.
Souriau, J.-M., 1986. La structure symplectique de la mecanique decrite par Lagrange en 1811.
Math. Sci. Hum. 94, 45–54.
Souriau, J.-M., 1996. Grammaire de la nature.
Souriau, J.-M., 1997. Structure of Dynamical Systems: A Symplectic View of Physics. The Prog-
ress in Mathematics Book Series (PM, Volume 149), Springer.
Souriau, J.-M., 2007. On geometric dynamics. Discrete and Continuous Dynamical Systems Vol-
ume 19 (3), 595–607.
Souriau, B.F., 2019. Exponential map algorithm for machine learning on matrix lie groups. In:
Nielsen, F., Barbaresco, F. (Eds.), Geometric Science of Information. GSI 2019. Lecture
Notes in Computer Science. vol 11712. Springer.
Stratonovich R.L., On distributions in representation space, Soviet Physics JETP, vol.4,n°6, 1957.
Vialatte, A., 2011. Les gloires silencieuses: Elie Cartan, Journalismes, Le Petit Dauphinois
1932–1944. Cahiers Alexandre Vialatte n°36, pp. 150–160.
Vorob’ev, Y.M., Karasev, M.V., 1988. Poisson manifolds and the Schouten bracket. Funktsional.
Anal. i Prilozhen 22 (1), 1–11. 96.

Further reading
Barbaresco, F., 2021b. Entropy Geometric Structure as Casimir Invariant Function in Coadjoint
Representation: Geometric Theory of Heat & Information Based on Souriau Lie Groups Ther-
modynamics and Lie Algebra Cohomology, Prepared for Encyclopedia of Entropy across the
Disciplines. World Scientific.
Marle, C.-M., 2020a. On Gibbs states of mechanical systems with symmetries. JGSP 57, 45–85.
This page intentionally left blank
Chapter 5

A unifying framework for some


directed distances in statistics
Michel Broniatowskia and Wolfgang Stummerb,c,*
a
LPSM, Sorbonne Universite, Paris, France
b

Department of Mathematics, University of Erlangen–Nurnberg, Erlangen, Germany
c

School of Business, Economics and Society, University of Erlangen-Nurnberg, €
Nurnberg,
Germany
*
Corresponding author: e-mail: stummer@math.fau.de

Abstract
Density-based directed distances—particularly known as divergences—between proba-
bility distributions are widely used in statistics as well as in the adjacent research fields
of information theory, artificial intelligence and machine learning. Prominent examples
are the Kullback–Leibler information distance (relative entropy) which, e.g., is closely
connected to the omnipresent maximum likelihood estimation method, and Pearson’s
χ 2-distance which, e.g., is used for the celebrated chisquare goodness-of-fit test. An-
other line of statistical inference is built upon distribution-function-based divergences
such as, e.g., the prominent (weighted versions of ) Cramer–von Mises test statistics
respectively Anderson–Darling test statistics which are frequently applied for
goodness-of-fit investigations; some more recent methods deal with (other kinds of )
cumulative paired divergences and closely related concepts. In this paper, we provide
a general framework which covers in particular both the above-mentioned density-based
and distribution-function-based divergence approaches; the dissimilarity of quantiles
respectively of other statistical functionals will be included as well. From this frame-
work, we structurally extract numerous classical and also state-of-the-art (including
new) procedures. Furthermore, we deduce new concepts of dependence between ran-
dom variables, as alternatives to the celebrated mutual information. Some variational
representations are discussed, too.
Keywords: Phi-divergences, Scaled Bregman distances, Estimation, Testing,
Statistical functionals, Bayesian decision making

Handbook of Statistics, Vol. 46. https://doi.org/10.1016/bs.host.2022.03.007


Copyright © 2022 Elsevier B.V. All rights reserved. 145
146 SECTION II Information geometry

Directed Distances = Divergences Divergences and Geometry (Sec.1.5)


(Sec.1.1)

Statistical Motivations (Sec.1.2)

General
Handling of Zeros (Sec.1.3)
Motivations
(Sec.1)

Motivations from
Prob. Theory (Sec.1.4)

Incentives for Extensions Phi-Divergences (Csiszar-Ali-Silvey-Morimoto)


(Sec.1.6)
Between Density Functions resp. Probability Masses

Statistical Functionals (Sec.2.1)


Universal Divergence Framework
(Sec.2)
Divergence Components (Sec.2.2-2.5)

Scaling with 2 different statistical


functionals (Sec.2.5.2)
Scaled Bregman Divergences
(Sec.2.5.1)
Auto-Divergences (Sec.2.6)
e.g. for Order Statistics

Optimal Transport/Coupling (Sec.2.7)

(Ordinary/Classical) Bregman (Weighted) Phi-Divergences


Divergences/Distances (Sec.2.5.1.2) Adaptive Scaling
(Sec.2.5.1.1) (Sec.2.5.1.3)

Density Functions and Probability Masses


Aggregated/Integrated Divergences (Sec.3)
Dependence Expressing Divergences (Sec.4)
Cumulative Distribution and Tail/Survival Functions
Bayesian Contexts (Sec.5)
Variational Representations (Sec.6)
Quantile Functions
Some Further Variants (Sec.7)

Multidimensional Depth and Centered Rank


Functions, etc.
A unifying framework for some directed distances in statistics Chapter 5 147

1 Divergences, statistical motivations, and connections


to geometry
1.1 Basic requirements on divergences (directed distances)
For a first view, let P and Q be two probability distributions (probability mea-
sures). For those, we would like to employ real-valued indices D(P, Q) which
quantify the “distance” (dissimilarity, proximity, closeness, discrepancy, dis-
crimination) between P and Q. Accordingly, we require D(, ) to have the
following reasonable “minimal/coarse/wide” properties
(D1) DðP, QÞ  0 for all P, Q under investigation (nonnegativity),
(D2) DðP, QÞ ¼ 0 if and only if P ¼ Q (reflexivity; identity of
indiscerniblesa),
and such D(, ) is then called a divergence (in the narrow sense) or disparity
or contrast function. Basically, the divergence DðP, QÞ of P and Q can be
interpreted as a kind of “directed distance from P to Q”; the corresponding
directness stems from the fact that in general one has the asymmetry
DðP, QÞ 6¼ DðQ, PÞ. This can turn out to be especially useful in contexts where
the first distribution P is always/principally of “more importance” or of
“higher attention” than the second distribution Q; moreover, it can technically
happen that DðP,QÞ < ∞ but DðQ,PÞ ¼ ∞, for instance in practically impor-
tant applications within a (say) discrete context where P and Q have different
zero-valued probability masses (e.g., zero observations), see, e.g., the discus-
sion in Section 1.3.
Notice that we don’t assume that the triangle inequality holds for D(, ).

1.2 Some statistical motivations


To start with, let us consider probability distributions P and Q having strictly
positive density functions (densities) fP and fQ with respect to some measure
λ on some (measurable) space X . For instance, if λ :¼ λL is the Lebesgue mea-
sure on (some subset of ) X ¼ R then fP and fQ are “classical” (e.g., Gaussian)
density functions; in contrast, in the discrete setup where X :¼ X # has count-
ably many elements and is equipped with the counting measure λ :¼ λ# :¼
P
zX# δz (where δz is Dirac’s one-point distribution δz[A] :¼1A(z) (where here
and in the sequel 1A() stands for the indicator function of a set A), and thus
λ#[{z}] ¼ 1 for all z  X # ), then fP and fQ are probability mass functions
(counting-density functions, relative-frequency functions, frequencies).
For such kind of probability measures P and Q, let us start with the widely
used class Dϕ(, ) of Csiszar–Ali–Silvey–Morimoto (CASM) divergences (see
Ali and Silvey, 1966; Csiszar, 1963; Morimoto, 1963) which are usually
abbreviatorily called ϕ-divergences and which are defined by

a
See, e.g., Weller-Fahy et al. (2015).
148 SECTION II Information geometry

Z  
fP ðxÞ
0  Dϕ ðP, QÞ :¼ fQ ðxÞ  ϕ dλðxÞ , (1)
X fQ ðxÞ
Z  
fP ðxÞ
¼ ϕ dQðxÞ , (2)
X fQ ðxÞ
where ϕ : ]0, ∞[ 7! [0, ∞[ is a convex function which is strictly convex at 1
and which satisfies ϕ(1) ¼ 0. It can be easily seen that this Dϕ(, ) satisfies
the above-mentioned requirements/properties/axioms (D1) and (D2). In the
above-mentioned discrete setup with X ¼ X # , (1) turns into
X  
f P ðxÞ
0  Dϕ ðP, QÞ ¼ fQ ðxÞ  ϕ ,
xX
f Q ðxÞ
#

whereas in the above-mentioned real-valued absolutely continuous case, the


integral in (1) reduces (except for rare cases) to a classical Riemann integral
with integrator dλL(x) ¼ dx. Notice that—depending on X , ϕ etc.—the diver-
gence Dϕ(P, Q) in (1) may become ∞. For comprehensive treatments of ϕ-
divergences (CASM divergences), the reader is referred to, e.g., Liese and
Vajda (1987), Read and Cressie (1988), Vajda (1989), Liese and Vajda
(2006), Pardo (2006), Liese and Miescke (2008), and Basu et al. (2011).
Important prominent special cases of (1) are the omnipresent Kullback–
Leibler divergence/distance (relative entropy) with ϕKL ðtÞ :¼ t log ðtÞ+1  t
and thus
Z  
fP ðxÞ
DϕKL ðP, QÞ ¼ fP ðxÞ  log dλðxÞ,
X fQ ðxÞ
the reverse Kullback–Leibler divergence/distance with ϕRKL ðtÞ :¼  log ðtÞ + t  1
and hence
Z  
fQ ðxÞ
DϕRKL ðP, QÞ ¼ fQ ðxÞ  log dλðxÞ ¼ DϕKL ðQ, PÞ, (3)
X fP ðxÞ
2
(half of ) Pearson’s χ 2-distance with ϕPC ðtÞ :¼ ðt1Þ
2 and consequently
Z
1 ð fP ðxÞ  fQ ðxÞÞ2
DϕPC ðP, QÞ ¼ dλðxÞ, (4)
2 X fQ ðxÞ
2
(half of ) Neyman’s χ 2-distance with ϕNC ðtÞ :¼ ðt1Þ
2  t and thus
Z
1 ð fP ðxÞ  fQ ðxÞÞ2
DϕPC ðP, QÞ ¼ dλðxÞ,
2 X fP ðxÞ
the (double of squared) Hellinger distance—also called (half of ) Freeman–
pffi 2
Tukey divergence—with ϕHD ðtÞ :¼ 2ð t  1Þ and hence
A unifying framework for some directed distances in statistics Chapter 5 149

Z pffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffi 2
DϕPC ðP, QÞ ¼ 2 fP ðxÞ  fQ ðxÞ dλðxÞ,
X

the total variation distance with ϕTV(t) :¼ jt  1j and consequently


Z
 
DϕTV ðP, QÞ ¼  f ðxÞ  f ðxÞ  dλðxÞ:
P Q
X

and the power divergences Dϕα ðP, QÞ (also known as alpha-divergences,


Cressie–Read measures/distances, and Tsallis cross-entropies) with ϕα ðtÞ :¼
tα α  t + α1
α  ðα1Þ(α  Rnf0, 1g). Notice that (in the current setup of probability dis-
tributions with zero-free density functions) DϕPC ðP, QÞ resp. DϕNC ðP, QÞ resp.
DϕHD ðP, QÞ are equal to Dϕα ðP, QÞ with α ¼ 2 resp. α ¼ 1 resp. α ¼ 1/2, and
that one can prove DϕKL ðP, QÞ ¼ lim α"1 Dϕα ðP, QÞ ¼: Dϕ1 ðP, QÞ as well as
DϕRKL ðP, QÞ ¼ lim α#0 Dϕα ðP,QÞ ¼: Dϕ0 ðP, QÞ; henceforth, we will use this
comfortable continuous embedding to a divergence family ðDϕα ðP,QÞÞαR
which covers important special cases.
From a statistical standpoint, the definition (1) finds motivation in the
far-reaching approach by Ali and Silvey (1966): by noting that in a simple
model—where a random variable X takes values on a finite discrete
set X ¼ X # and its distribution is either P or Q having probability mass
function fP or fQ—the statistics ffP ðXÞ
ðXÞ is a sufficient statistics (meaning that
Q
fP ðXÞ fP ðXÞ
P½X ¼ x j fQ ðXÞ ¼ t ¼ Q½X ¼ x j fQ ðXÞ ¼ t
for all x and t) they argue that any
measurement aiming at inference on the distribution of X should be a func-
tion of the likelihood ratio LR :¼ ffP ðXÞ
ðXÞ . Thus, a real-valued coefficient
Q

D(P, Q) of closeness/dissimilarity between P and Q should be considered


as an aggregation/expectation—over some measure (typically P or Q)—of
a function ϕ of LR, hence formally leading to (1) with not necessarily con-
vex function ϕ. This construction is compatible with the following set of
four axioms/requirements which bear some fundamentals for the construc-
tion of a discrimination index between distributions, and which (among
other things) imply the convexity of ϕ:
(A1) Dϕ(P1, P2) should be defined for all pairs of probability distributions
P1, P2 on the same sample space X .
(A2) Let x 7! t(x) be an arbitrary measurable transformation from ðX ;F Þ
onto a measure space ðY ;G Þ then there should hold
Dϕ ðP1 , P2 Þ  Dϕ ðP1 t1 , P2 t1 Þ, (5)
1
where Pit denotes the induced measure on Y corresponding to Pi.
Notice that (5) is called data processing inequality or information pro-
cessing inequality, and—as shown in Ali and Silvey (1966)—it implies
that ϕ should be a convex function.
150 SECTION II Information geometry

(A3) Dϕ(P1, P2) should take its minimum value when P1 ¼ P2 and its maxi-
mum value when P1 ? P2 (i.e., P1 and P2 are singular, in the sense that
the supports of the distributions P1 and P2 do not overlap (are disjoint)).
(A4) A further axiom of statistical nature should be satisfied in relation with
a statistical notion of separability of two distributions in a given model.
Assume that for a given family of parametric distributions ðPθ ÞθΘ
and for any small risk α the following property holds: if Pθ0 is rejected
vs Pθ1 with risk  α optimally (Neyman–Pearson approach), then Pθ0 is
rejected vs Pθ2 with risk  α (meaning that Pθ2 is further away from Pθ0
than Pθ1 is).
Then one should have
Dϕ ðPθ0 , Pθ2 Þ  Dϕ ðPθ0 , Pθ1 Þ:

Notice that in (A4) we use a slight extension of the original requirements of


Ali and Silvey (1966) (who employ a monotone likelihood ratio concept).
As a second use-of-divergence incentive stemming from considerations in
statistics (as well as in the adjacent research fields of information theory, arti-
ficial intelligence and machine learning), we mention parameter estimation in
terms of ϕ-divergence minimization. For this, let Y be a random variable tak-
ing values in a finite discrete space X :¼ X#, and let fP(x) ¼ P[Y ¼ x] be its
strictly positive probability mass function under an unknown hypothetical law
P. Moreover, we assume that P lies in—respectively can be approximated
by—a model Ω :¼ {Qθ : θ  Θ} (Θ  R) being a class of finite discrete
parametric distributions having strictly positive probability mass function fQθ
PN
on X # . Moreover, let Pemp N :¼ N 
1
i¼1 δY i ½   be the well-known data-derived
empirical distribution/measure of a N-size independent and identically
distributed (i.i.d.) sample/observations Y 1 , …, Y N of Y; the according probabil-
ity mass function is fPempN
ðxÞ ¼ N1  #fi  f1, …,Ng : Yi ¼ xg which reflects the
underlying (normalized) histogram; here, as usual, #A denotes the number
of elements in a set A. In the following, we assume that the sample size N
is large enough such that f Pemp N
is strictly positive (see the next section for a
relaxation).
If the data-generating distribution P lies in Ω, i.e., P ¼ Qθtr for some
“true” unknown parameter θtr  Θ, then (under some mild technical assump-
tions) it is easy to show that the corresponding maximum likelihood estimator
(MLE) θ^ is EQUAL to
b
θb :¼ arg min Dϕ0 ðQθ , Pemp
N Þ
θΘ

where ϕ0 :¼  log ðtÞ + t  1 and Dϕ0 ð  ,  Þ is the reverse Kullback–Leibler


^
divergence already mentioned above. Due to its construction, θ^ is called min-
imum reverse-Kullback–Leibler divergence (RKLD) estimator, and Q^θ^ is the
A unifying framework for some directed distances in statistics Chapter 5 151

RKLD-projection of Pemp N on Ω. In the other—also practically important—


case where P does not lie in the model Ω (but is reasonably “close” to it),
i.e., the model is misspecified, then Q^θ^ is still a reasonable proxy of P if the
sample size N is large enough.
In the light of the preceding paragraph, it makes sense to consider the
more general minimum ϕ-divergence/distance estimation problem
b
θb :¼ arg inf Dϕ ðQθ , Pemp
N Þ (6)
θΘ

where ϕ is not necessarily equal to ϕ0; for instance, through some comfort-
ably verifiable criteria on ϕ one can end up with an outcoming minimum
b
ϕ-divergence/distance estimator θb which is more robust against outliers than
the MLE θ^ (see, e.g., the residual-adjustment-function approach of Lindsay
(1994), its comprehensive treatment in Basu et al. (2011), and the corresponding
flexibilizations in Kißlinger and Stummer (2016) and Roensch and Stummer
b
(2017)). Usually, θb of (6) is called minimum ϕ-divergence estimator (MDE),
and Qθ^^ is the ϕ-divergence projection of Pemp
N on Ω.
A further useful generalization is the “distribution-outcome type” mini-
mum divergence/distance estimation problem
b :¼ arg inf Dϕ ðQ, Pemp Þ
Q (7)
N
QΩ

where PempN stems from a general (not necessarily parametric, unknown) data
generating distribution P and Ω may be a “fairly general” model being a class
of finite discrete distributions Q having strictly positive probability mass func-
tion fQ on X # (and, as usual, (7) can be rewritten as a minimization problem
on the (#Ω  1)-dimensional probability simplex). The outcoming Q ^ of (7) is
still called (distribution-type) minimum ϕ-divergence estimator (MDE), and
can be interpreted as ϕ-divergence-projection of Pemp N on Ω. Problem (7) is
in particular beneficial in non- and semiparametric contexts, where Ω reflects
(partially) nonparametrizable model constraints. For instance, Ω may consist
(only) of constraints on moments or on L-moments (see, e.g., Broniatowski
and Decurninge (2016)); alternatively, Ω may be, e.g., a tubular neighbor-
hood of a parametric model (see, e.g., Ghosh and Basu, 2018; Liu and
Lindsay, 2009).
The closeness—especially in terms of the sample size N—of the
data-derived empirical distribution from the model Ω is quantified by the
corresponding minimum
Dϕ ðΩ, Pemp emp
N Þ :¼ inf Dϕ ðQ, PN Þ (8)
QΩ

of (7); thus, it carries useful statistical information. Moreover, under some


mild assumptions, Dϕ ðΩ, Pemp
N Þ converges to
152 SECTION II Information geometry

Dϕ ðΩ, PÞ :¼ inf Dϕ ðQ, PÞ (9)


QΩ

where P is the (unknown) data generating distribution. In case of P  Ω one


obtains Dϕ(Ω, P) ¼ 0, whereas for P 62 Ω the ϕ-divergence minimum
Dϕ(Ω, P) — and thus its approximation Dϕ ðΩ, Pemp N Þ—quantifies the adequacy
of the model Ω for modeling P; a lower Dϕ(Ω, P)-value means a better ade-
quacy (in the sense of a lower departure between the model and the truth,
cf. Lindsay (2004), Lindsay et al. (2008), Markatou and Sofikitou (2018),
Markatou and Chen (2019)).
Hence, especially in the context of model selection/choice (and the related
issue of goodness-of-fit testing) within complex big data contexts, for the
search of appropriate models Ω and model elements/members therein, the (fast
and efficient) computation of Dϕ(Ω, P) and Dϕ ðΩ, Pemp N Þ constitutes a decisive
first step, since if the latter two are “too large” (respectively “much larger than”
Dϕ ðΩ, PÞ and Dϕ ðΩ,  Pemp Þ for some competing model Ω),  then the model Ω is
N
“not adequate enough” (respectively “much less adequate than” Ω).  For tack-
emp
ling the computation of Dϕ(Ω, P) and Dϕ ðΩ, PN Þ on fairly general (e.g.,
high-dimensional, nonconvex and even highly disconnected) constraint sets
Ω, a “precise bare simulation” approach has been recently developed by
Broniatowski and Stummer (2021).
For the sake of a compact first glance, in this section we have mainly dealt
with finite discrete distributions P and Q having zeros-free probability mass
functions. However, with appropriate technical care, one can extend the above
concepts also to general discrete distributions with zeros-carrying probability
mass functions and even to nondiscrete (e.g., absolutely continuous) distribu-
tions with zeros-carrying density functions. (Only) The correspondingly nec-
essary generalization of the basic ϕ-divergence definition (1) is addressed in
the next section.

1.3 Incorporating density function zeros


Recall that in our first basic ϕ-divergence definition (1),(2) we have employed
probability distributions P and Q having strictly positive density functions fP
and fQ with respect to some measure λ on some (measurable) space X, and
consequently P and Q are equivalent. However, in many applications one
has to allow fP and/or fQ to have zero values. For instance, in the above-
mentioned empirical distribution Pemp N for small/medium sample size N (or
even large sample size for rare events) one may have fPemp N
ðe
xÞ ¼ 0 for some
xe, even though the candidate-model probability mass satisfies fQθ ðe
b
xÞ 6¼ 0
for some θ  Θ.c

Which corresponds to an empty histogram cell at xe.


b

xÞ ¼ 0 for all θ  Θ one should certainly reduce the space X by removing xe.
If f Qθ ðe
c
A unifying framework for some directed distances in statistics Chapter 5 153

Accordingly, we employ the following extension: for probability distribu-


tions P and Q having density functions fP and fQ with respect to some measure
λ on some (measurable) space X one defines the Csiszar–Ali–Silvey–Morimoto
(CASM) divergences—in short ϕ-divergences—by
Z  
f ðxÞ
0  Dϕ ðP, QÞ :¼ ϕ P dQðxÞ
f fP  fQ >0g fQ ðxÞ (10)
+ ϕð0Þ  Q½ fP ¼ 0 + ϕ* ð0Þ  P½ fQ ¼ 0

with ϕð0Þ  0 ¼ 0 and ϕ* ð0Þ  0 ¼ 0 (11)


(see, e.g., Liese and Vajda, 2006). Here, we have employed (as above) ϕ : ]0,
∞[ 7![0, ∞[ to be a convex function which is strictly convex at 1 and which
satisfies ϕ(1) ¼ 0; moreover, we have used the (always existing) limits
ϕ(0) :¼ limt#0ϕ(t)  ]0, ∞] and ϕ* ð0Þ :¼ lim t#0 ϕ* ðtÞ ¼ lim t!∞ ϕðtÞ t of the
*
so-called *-adjoint function ϕ ðtÞ :¼ t  ϕð t Þ (t  ]0, ∞[). It can be proved that
1

Dϕ(, ) satisfies the above-mentioned requirements/properties/axioms (D1)


and (D2); even more, one gets the following range-of-value assertion
(cf. Csiszar (1963), Csiszar (1967), and Vajda (1972), see, e.g., also Liese
and Vajda (2006)):
Theorem 1. There holds

0  Dϕ ðP, QÞ  ϕð0Þ + ϕ* ð0Þ for all P, Q

where (i) the left equality holds only for P ¼ Q, and (ii) the right equality
holds always for P ? Q (singularity, i.e., the zeros-set of fP is disjoint from
the zeros-set of fQ) and only for P ? Q in case of ϕ(0) + ϕ*(0) < ∞.

A generalization of Theorem 1 to the context of finite (not necessarily


probability) measures P and Q is given in Stummer and Vajda (2010); for
instance, in a two-sample test situation P and Q may be two generalized
empirical distributions which reflect nonnormalized (rather than normalized)
histograms.
As an example, let us illuminate the upper bounds ϕ(0) + ϕ*(0) of—the
zeros-incorporating versions of—the above-mentioned important power
α
divergence family ðDϕα ðP, QÞÞαR with ϕα ðtÞ :¼ t α  t + α1
α  ðα1Þ (α  Rnf0, 1g),
ϕ1 ðtÞ :¼ ϕKL ðtÞ ¼ t log ðtÞ + 1  t and ϕ0 ðtÞ :¼ ϕRKL ðtÞ :¼  log ðtÞ + t  1 .
It is easy to see that for P ? Q one gets
(
∞, if α  0,
*
ϕα ð0Þ ¼ ϕ1α ð0Þ ¼ 1 (12)
, if α > 0,
α
and hence
154 SECTION II Information geometry

8
< ∞, if α 62 0, 1½,
Dϕα ðP, QÞ ¼ ϕα ð0Þ + ϕ*α ð0Þ ¼ 1 (13)
: , if α  0, 1½:
α  ð1  αÞ
Especially, for P ? Q one gets for the Kullback–Leibler divergence
DϕKL ðP, QÞ ¼ Dϕ1 ðP, QÞ ¼ ∞ whereas Dϕ0:99 ðP, QÞ ¼ 10000 99 achieves a finite
value; thus, in order to avoid infinities it is more convenient to work with the
well-approximating divergence generator ϕ0.99 of ϕ1. Similarly, for P ? Q in
the reverse Kullback–Leibler divergence we obtain DϕRKL ðP, QÞ ¼
Dϕ0 ðP, QÞ ¼ ∞ whereas Dϕ0:01 ðP, QÞ ¼ 10000 99 . Furthermore, for P ? Q one gets
for Pearson’s χ 2-divergence Dϕ2 ðP, QÞ ¼ ∞ , for Neyman’s χ 2-divergence
Dϕ1 ðP, QÞ ¼ ∞ and for the (squared) Hellinger distance Dϕ1=2 ðP, QÞ ¼ 4.
Returning to the general context, notice that the upper bound ϕ(0) + ϕ*(0)
in Theorem 1 is independent of P and Q, and thus Dϕ(P, Q) is of no discrimi-
native use in statistical situations where P and Q are singular (i.e., P ? Q).
This is the case, for instance, in the following commonly encountered
“crossover” context:
(CO1) Y is an univariate (absolutely continuous) random variable with
unknown hypothetical probability distribution P having strictly posi-
tive density function fP with respect to the Lebesgue measure λL on
X ¼ R (recall that this means that fP is a “classical” (e.g., Gaussian)
probability density function),
(CO2) the corresponding model Ω :¼ {Qθ : θ Θ} (Θ  R) is a class of
parametric distributions having strictly positive probability density
functions fQθ with respect to λL, and
PN
(CO3) Pemp N :¼ N 
1
i¼1 δY i ½   is the data-derived empirical distribution of
a N-size independent and identically distributed (i.i.d.) sample/
observations Y 1 , …, Y N of Y; recall that the according probability mass
function is fPemp
N
ðxÞ ¼ N1  #fi  f1, …, Ng : Y i ¼ xg which is the den-
sity function with respect to the counting measure λ# on the distinct
values of the sample.
This contrary density-function behavior can be put in an encompassing frame-
work by employing the joint density-building (i.e., dominating) measure λ :¼
λL + λ#. Clearly, one always has the singularity Pemp
N ?Qθ and thus, due to
Theorem 1 one gets
*
Dϕ ðQθ , Pemp
N Þ ¼ ϕð0Þ + ϕ ð0Þ for all θ  Θ,
*
inf Dϕ ðQθ , Pemp
N Þ ¼ ϕð0Þ + ϕ ð0Þ: (14)
θΘ

Accordingly, in such a situation one cannot obtain a corresponding minimum


ϕ-divergence estimator.
A unifying framework for some directed distances in statistics Chapter 5 155

Also notice that for power divergences Dϕα ðP, QÞ with α 62 ]0, 1[ it can
happen that Dϕα ðP, QÞ ¼ ∞ even though P and Q are not singular (which
due to (13) is consistent with Theorem 1). For instance, consider a situation
with two different i.i.d. samples Y1 , …, Y N of Y having distribution P and
Ye1 , …, YeM of Ye having distribution Q with (say) Q  P (equivalence); in terms
PN
of the corresponding empirical distributions Pemp N :¼ N 
1
i¼1 δY i ½  and
PM
Pe :¼  e Þ ¼ ∞ if the set of zeros of
emp emp emp
M M
1
δ ~ ½  one obtains Dϕ ðPN , P
i¼1 Y i α M
the corresponding probability mass function fPempN
is strictly larger (for α  0)
respectively smaller (for α  1) than the set of zeros of fP~emp (i.e., P~M ½ fPemp
emp
N
¼
M
0  > 0 respectively Pemp N ½ fP~M ¼ 0  > 0), to be seen by applying (10), (11),
emp

(12). As above, in such a nonsingular situation it is better to use the (in fact,
even sample-dependent !) power divergence Dϕ0:99 ðPemp eemp
N , PM Þ instead of the
Kullback–Leibler divergence Dϕ1 ðPemp eemp
N , PM Þ ¼ ∞ . Similar infinity effects
can be constructed for the above-mentioned other important special cases
α ¼ 0 (reverse Kullback–Leibler divergence), α ¼ 2 (Pearson’s χ 2-diver-
gence), α ¼ 1 (Neyman’s χ 2-divergence) whereas for the case α ¼ 1/2
(square Hellinger distance) everything works out well. Such an approach
serves as an alternative to the approach of “lifting/unzeroing/adjusting” (from
sampling randomly appearing) zero probability massesd by pseudo-counts or
“smoothing (in a discrete sense),” see, e.g., Fienberg and Holland (1970), as
well as, e.g., Section 4.5 in Jurafsky and Martin (2009) and the references
therein.
Next, we briefly indicate two ways to circumvent the problem described in
the above-mentioned crossover context (CO1),(CO2),(CO3):
(GR) grouping (partitioning, quantization) of data: converte the model Ω into
a purely discrete context, by subdividing the data point set X ¼
S s
j ¼ 1 Aj into countably many—(say) s  N [ f∞gnf1g—(measurable)
disjoint classes A1 , …, As with the property λL[Aj] > 0 (“essential par-
tition”); proceed as in the above general discrete subsetup with
X new :¼ fA1 , …, As g and thus the i-th data observation Yi(ω) and the
corresponding running variable x manifest (only) the corresponding
class membership (see, e.g., Vajda and van der Meulen (2011) for a
survey on different choices). Some corresponding thorough statistical
investigations (such as efficiency, robustness, types of grouping, grouping
error sensitivity) of the corresponding minimum-ϕ-divergence estima-
tion can be found, e.g., in Victoria-Feser and Ronchetti (1997),

d
E.g., which correspond to empty cells in sampled histograms, e.g., for rare events and
small/medium sample sizes.
e
In several situations, such a conversion can appear in a natural way; e.g., an institution may gen-
erate/collect data of “continuous value” but mask them for external data analysts to group fre-
quencies, for reasons of confidentiality (information asymmetry).
156 SECTION II Information geometry

Menendez et al. (1998, 2001a, b), Morales et al. (2004, 2006), and Lin
and He (2006).
(SM) smoothing of the empirical density function: convert everything to a
purely continuous context, by keeping the original data-point-set X
and by “continuously modifying” (e.g., with the help of kernels)
the empirical density
R function fPemp N
ðÞ to a function fPNemp, smo ðÞ > 0
(a.s.) such that X fPemp,smo
N
ðxÞ dλ L ðxÞ ¼ 1. Some corresponding thorough
statistical investigations (such as efficiency, robustness, information
loss) of the corresponding minimum-ϕ-divergence estimation can be
found, e.g., in Beran (1977), Basu and Lindsay (1994), Park and
Basu (2004), Chapter 3 of Basu et al. (2011), Kuchibhotla and Basu
(2015), and Al Mohamad (2018), and the references therein.
In contrast to the above, let us now encounter a crossover situation where
(CO1) and (CO3) still hold, but the parametric model assumption (CO2) is
replaced by
(CO2’) the corresponding model Ω :¼ {Q : Q satisfies some nonparametric
constraints} is a class of distributions Q which contains both
(i) distributions Q having strictly positive probability density func-
tions fQ with respect to λL, and (ii) all “context-specific
appropriate” finite discrete distributions Q (having ideally the same
(or at least, smaller or equal) support as Pemp
N ).

The subclasses of Ω which satisfy (i) respectively (ii) are denoted by Ωac
respectively Ωdis. Widely applied special cases of (CO2’) are nonparametric
contexts where Ω is the class of all distributions on X ¼ R satisfying pregi-
ven moment conditions. Suppose that we are interested in the corresponding
model-adequacy problem (cf. (9))
Dϕ ðΩac , PÞ :¼ infac Dϕ ðQ, PÞ (15)
QΩ

where P is the (unknown) data generating distribution (cf. (CO1)). Recall that
in case of P  Ωac one obtains Dϕ(Ωac, P) ¼ 0, whereas for P 62 Ωac the
ϕ-divergence minimum Dϕ(Ωac, P) quantifies the adequacy of the model
Ωac for modeling P; a lower Dϕ(Ωac, P)-value means a better adequacy. Since
in the current setup the empirical distribution Pemp emp
N of (CO3) satisfies PN ?Q
for all Q  Ω we obtain (analogously to (14))
ac

*
Dϕ ðQ, Pemp
N Þ ¼ ϕð0Þ + ϕ ð0Þ for all Q  Ωac ,
Dϕ ðΩ ac
, Pemp
N Þ : ¼ inf Dϕ ðQ, Pemp
N Þ (16)
QΩac

¼ ϕð0Þ + ϕ* ð0Þ:
Hence, statistically it makes no sense to approximate (15) by (16). Let us discuss
an appropriate alternative, e.g., for the case of the reverse Kullback–Leibler
A unifying framework for some directed distances in statistics Chapter 5 157

divergence Dϕ0 ðQ, PÞ with generator ϕ0 ðtÞ ¼ ϕRKL ðtÞ ¼  log ðtÞ + t  1 (cf.(3)).
By (12), we have ϕ0(0) ¼ ∞ as well as ϕ*0 ð0Þ ¼ 1 and thus ϕ0 ð0Þ + ϕ 0 ð0Þ ¼ ∞ as
well as (by (10), (11))
Z !
  f Q ðx Þ h i
emp emp emp
D ϕ0 Q, PN ¼ ϕ0 dPN ðxÞ + ∞  PN fQ ¼ 0
f fQ  fPemp >0g f P ð xÞ
emp
N N
!
1 X fQ ðYi Þ h i
emp
¼  ϕ0 + ∞  PN f Q ¼ 0 < ∞
N fPemp ðYi Þ
fif1, …N g:fQ ðYi Þ  fPemp ðYi Þ>0g N
N

for all Q in Ωdis


N which is defined as the class of distributions in Ω
dis
such that
emp
Q ≪ PN (and thus Q½ fPempN
¼ 0 ¼ 0); also recall that the last term becomes
emp
∞ 0 ¼ 0 in case that Q and PN have the same support. Hence, under the
N is nonvoid, one can approximate the ϕ ¼ ϕ0-version of
assumption that Ωdis
(15) by
emp emp
Dϕ0 ðΩdis
N , PN Þ :¼ inf Dϕ0 ðQ, PN Þ:
QΩdis
N

This is the basic idea of the divergence-minimization formulation of the


so-called empirical likelihood principle of Owen (1988, 1990, 2001), which
leads to many variations according to the choice of the divergence generator
ϕ; see, e.g., Baggerly (1998), Judge and Mittelhammer (2012), Bertail et al.
(2014), Broniatowski and Keziou (2012), and the references therein.
Other ways to circumvent the crossover problem (CO1),(CO2),(CO3)
respectively (CO1),(CO2’),(CO3) can be found, e.g., in Section VIII of
Liese and Vajda (2006) and Section 4 of Broniatowski and Stummer
(2019); moreover, some variational-representation-method approaches will
be discussed in Section 6.
As a third statistical incentive, let us mention that with the help of
ϕ-divergence minimization one can build generalizations of exponential
families with pregiven sufficient statistics (see, e.g., Gayen and Kumar,
2021; Pelletier, 2011). In the special case of Kullback–Leiber divergence
(i.e., the divergence generator ϕ is taken to be ϕ1 ðtÞ ¼ ϕKL ðtÞ ¼ t log ðtÞ
+ 1  t) one ends up with classical exponential families.

1.4 Some motivations from probability theory


Another environment where Dϕ(Q, P) appears in a natural way is probability
theory, in the area of the large deviation paradigm; the celebrated Sanov the-
orem states that, up to technicalities,
1 
lim log P Pemp
N  Ω ¼ Dϕ1 ðΩ, PÞ
N!∞ N
158 SECTION II Information geometry

where Pemp
N is the above-mentioned empirical distribution of a sample of
N independent copies under P, and Ω is a class of probability distributions
on ðX ;B Þ , and Dϕ1 ðΩ, PÞ :¼ inf QΩ Dϕ1 ðQ, PÞ. Therefore, the Kullback–
Leibler divergence measures the rate of decay of the chances for PempN to
belong to Ω as N increases, in case that P does not belong to Ω. Other
divergences inherit the same character: assume that the function ϕ is the
Fenchel–Legendre transform of a cumulant generating function Λ(t), namely

ϕðxÞ ¼ sup t  x  ΛðtÞ


t

where ΛðtÞ :¼ log E½etW  for some random variable W defined on some arbi-
trary space. With ðX1 , …, XN Þ being an i.i.d. sample under P and ðW 1 , …, W N Þ
being an i.i.d. sample of copies of W, we define the associated weighted
empirical distribution as

1 X
N

N :¼
PW W δ :
N i¼1 i Xi

The following type of conditional Sanov theorem holds:


1  

lim log P PW
N!∞ N N  Ω X1 , …, X N ¼ Dϕ ðΩ, PÞ,

where Ω is a class of signed measures on ðX ;BÞ satisfying some regularity


assumptions. This result characterizes Dϕ(Ω, P) as a rate of escape of PW N from
Ω when P does not belong to Ω. We refer to Najim (2002), Trashorras and
Wintenberger (2014), and Broniatowski and Stummer (2021) where the latter
consider several applications of this result for (deterministic as well as statis-
tical) optimization procedures by bootstrap.
Of course, there are connections between statistical inferences and
ϕ-divergence-based large deviations results. For instance, the large deviation
properties of (types of ) the empirical distribution of a sample from its parent
distribution is the cornerstone for the asymptotic study of tests. In this realm,
the ϕ-divergences play a significant role while testing between some para-
metric null hypothesis θ  Θ0 and an alternative η  Θ1; the corresponding
Bahadur slope of a given test statistics indicates the decay of its p-value under
the alternative. In “standard” setups, this is connected to the Kullback–Leibler
divergence inf θΘ0 Dϕ1 ðPη , Pθ Þ (between the alternative η and the set of all
null hypotheses) which qualifies the asymptotic efficiency of the statistics at
use; see Bahadur (1967, 1971), Hoadley (1967), and also, e.g., Groeneboom
and Oosterhoff (1977) and Nikitin (1995). As far as other setups are
concerned, Efron and Tibshirani (1993) generally suggest the weighted
bootstrap as a valuable approach for testing. In some concrete frameworks,
it can be proved that testing in parametric models based on appropriate
weighted-bootstrapped ϕ-divergence test statistics enjoys maximal Bahadur
A unifying framework for some directed distances in statistics Chapter 5 159

efficiency with respect to any other weighted-bootstrapped test statistics (see


Broniatowski, 2021); the corresponding Bahadur slope is related to the spe-
cific weighting procedure, and substitutes the Kullback–Leibler divergence
by some other ϕ-divergence, specific of the large deviation properties of
the weighted empirical distribution.

1.5 Divergences and geometry


For this section, we return to the general framework of Section 1.1 where we
have defined divergences to satisfy the two properties
(D1) DðP, QÞ  0 for all P, Q under investigation (nonnegativity),
(D2) DðP, QÞ ¼ 0 if and only if P ¼ Q (reflexivity; identity of indiscernibles).
Being interpreted as “directed” distances, the divergences D(, ) can be con-
nected to geometric issues in various different ways. For the sake of brevity,
we mention here only a few of those.
To start with an “all-encompassing view,” following the lines of, e.g.,
Birkhoff (1932) and Millmann and Parker (1991), one can build from
l any set S , whose elements can be interpreted as “points,” together with
l a collection L of nonempty subsets of S, interpreted as “lines” (as a man-
ifestation of a principle sort of structural connectivity between points),
l and an arbitrary symmetric distance dð  ,  Þ on S
S ,
an axiomatic constructive framework of geometry which can be of far-
reaching nature; therein, dð  ,  Þ plays basically the role of a marked ruler.
Accordingly, each triplet ðS , L , dð  ,  ÞÞ forms a distinct “quantitative geo-
metric system”; the most prominent classical case is certainly S ¼ R2 with
L as the collection of all vertical and nonvertical lines, equipped with the
Euclidean distance dð  ,  Þ, hence generating the usual Euclidean geometry
in the two-dimensional space. In the case that dð  ,  Þ is only an asymmetric
distance (divergence) but not a distance anymore, we propose that some of
the outcoming geometric building blocks have to be interpreted in a
direction-based way (e.g., the use of dð  ,  Þ as a marked directed ruler, the con-
struction of points of equal divergence from a center viewed as distorted directed
spheres). For our special statistical divergence framework dð  ,  Þ :¼ Dð  , Þ
one has to work with S being a family of real-valued functions on X .
Second, from any symmetric distance dð  ,  Þ on a “sufficiently rich” set
S and a finite number of (fixed or adaptively flexible) distinct “reference points”
si (i ¼ 1, …, n) one can construct the corresponding Voronoi cells V (si) by
Vðsi Þ :¼ fz  S : dðz, si Þ  dðz, s j Þ for all j ¼ 1, …, ng:
This produces a tessellation (tiling) of S which is very useful for classifica-
tion purposes. Of course, the geometric shape of these tessellations is of fun-
damental importance. In the case that dð  ,  Þ is only an asymmetric distance
160 SECTION II Information geometry

(divergence), then V (si) has to be interpreted as a directed Voronoi cell and


then there is also the “reversely directed” alternative
e i Þ :¼ fz  S : dðsi , zÞ  dðs j , zÞ for all j ¼ 1, …, n g:
Vðs
Recent applications where S  Rd and dð  ,  Þ is a Bregman divergence or a
more general conformal divergence, can be found, e.g., in Boissonnat et al.
(2010) and Nock et al. (2016) (and the references therein), where they also
deal with the corresponding adaption of k-nearest neighbor classification
methods.
Moreover, with each (say) asymmetric distance (divergence) dð  ,  Þ one
can associate a divergence-ball Bd ðs, ρÞ with “center” s  S and “radius”
ρ ]0, ∞[, defined by Bd ðs, ρÞ :¼ fs  S : dðs, zÞ  ρg, whereas the
corresponding divergence sphere is given by Sd ðs, ρÞ :¼ fs  S : dðs, zÞ ¼ ρg;
see, e.g., Csiszar and Breuer (2016) for a use of some divergence balls as a con-
straint in financial-risk-related decisions. Of course, the “geometry/topology”
induced by divergence balls and spheres is generally quite nonobvious; see for
instance Roensch and Stummer (2017), who describe and visualize different
effects in a 3D setup of scaled Bregman divergences (which will be covered
below). Moreover, the generalization
DðΩ, Pemp emp ^ :¼ arg inf DðQ, Pemp Þ
N Þ :¼ inf DðQ, PN Þ, Q N
QΩ QΩ

of the above-mentioned statistical minimum divergence/distance estimation


problems (7), (8) can, e.g., be (loosely) achieved by blowing up the diver-
gence sphere SD ðs, Pemp N Þ through increasing the radius ρ until it first touches
the model Ω. Accordingly, there may be an interesting interplay between the
geometric/topological properties of both SD ðs, PempN Þ and the (e.g., nonconvex,
nonsmooth, nonintersection-of-hyperplanes type, complicated-manifold-type)
boundary ∂Ω of Ω (see, e.g., Roensch and Stummer, 2017).
Thirdly, consider a framework where P :¼ P eθ1 and Q :¼ P eθ2 depend on
some parameters θ1  Θ, θ2  Θ. The way of dependence of the function
(say) S ðP eθ Þ on the underlying parameter θ from an appropriate space Θ of,
e.g., manifold type, may show up directly, e.g., via its operation/functioning
as a relevant system indicator, or it may be manifested implicitly, e.g., such that
eθ Þ is the solution of an optimization problem with θ-involving constraints.
S ð P
In such a framework, one can induce divergences DðSðP eθ2 ÞÞ ¼: f ðθ1 , θ2 Þ
eθ1 Þ, SðP
and—under sufficiently smooth dependence—study their corresponding dif-
ferential geometric behavior of f(, ) on Θ. An example is provided by the
Kullback–Leibler divergence between two distributions of the same expo-
nential family of distributions, which defines a Bregman divergence on
the parameter space. This and related issues are subsumed in the research
field of “information geometry”; for comprehensive overviews see, e.g.,
Amari and Nagaoka (2000), Amari (2016), and Ay et al. (2017). Moreover,
A unifying framework for some directed distances in statistics Chapter 5 161

for recent connections between divergence-based information geometry and


optimal transport the reader is, e.g., referred to Pal and Wong (2016, 2018),
Karakida and Amari (2017), Amari et al. (2018), Peyre and Cuturi (2019),
and the literature therein.
Further relations of divergences with other approaches to geometry can be
overviewed, e.g., from the wide-range-covering research-article collections in
Nielsen and Bhatia (2013), Nielsen and Barbaresco (2013, 2015, 2017, 2019,
2021), Barbaresco and Nielsen (2021), and Nielsen (2021).
Moreover, geometry also enters as a tool for visualizing quantitative
effects on divergences. A more detailed discussion (including also other
approaches) on the interplay between statistics and geometry is beyond the
scope of this chapter; they will appear in other parts of this book.

1.6 Some incentives for extensions


1.6.1 ϕ-Divergences between other statistical objects
Recall that for probability distributions P and Q having strictly positive den-
sity functions fP and fQ with respect to some measure λ on a data space X
(which covers as special cases both the classical density functions respectively
the probability mass functions), we have defined the ϕ-divergences (CASM
divergences) by
Z  
fP ðxÞ
0  Dϕ ðP, QÞ :¼ fQ ðxÞ  ϕ dλðxÞ ¼: Dϕ,λ ð fP , fQ Þ, (17)
X fQ ðxÞ
where the last notation-type term in (17) indicates the interpretation as
ϕ-divergence between density functions, measuring their similarity. However,
e.g., for X  R and the Lebesgue measure λ ¼ λL (and hence almost always
dλL(x) ¼ dx), it makes also sense to quantify the dissimilarity—in
terms of ϕ-divergences—between other related “statistical objects,” most
notably between the information-aggregating cumulative distribution func-
tions FP and FQ of P and Q. For instance, formally, DϕPC ,Q ðFP ,FQ Þ ¼
R ðFP ðxÞFQ ðxÞÞ2
1
2 X FQ ðxÞ dQðxÞ, (cf. (4) with fP, fQ replaced by FP,FQ and λ ¼ Q)
is—in case of employing the empirical measure P ¼ Pemp N —a special member
of the family of weighted Cramer–von Mises test statistics (in fact it is a mod-
ified Anderson–Darling test statistics of, e.g., Ahmad et al. (1988) and Scott
(1999), see also Shin et al. (2012) for applications in environmental extreme
value theory).
As another incentive, let us mention the use of ϕ-divergences between
quantile functions respectively between “transformations” thereof. For instance,
they can be employed in situations where the above-mentioned classical mini-
mum ϕ-divergence/distance estimation problem (7) and (8)—which involves
ϕ-divergences between density functions—is theoretically and practically
162 SECTION II Information geometry

intractable; this is, e.g., the case when the model Ω is defined by constraints on
the expectation of a L-statistics (e.g., describing a tubular neighborhood of a
distribution with prescribed number of given quantiles; such constraints are
not linear with respect to the underlying distribution of the data, but merely
with respect to their quantile measure). In such a situation, one can trans-
pose everything to a minimization problem for the ϕ-divergence between
the corresponding empirical quantile measures where the constraint can
also be stated in terms of quantile measures (see Broniatowski and
Decurninge, 2016).
Further examples of ϕ-divergences between other statistical objects can be
found in Section 2.5.1.2.

1.6.2 Some non-ϕ-divergences between probability distributions


In contrary to the preceding section, instead of replacing the probability distri-
butions P and Q, let us keep the latter two but consider now some other diver-
gences D(P, Q) (of non-ϕ-divergence type) of statistical interest. For instance,
there is a substantially growing amount of applications of the so-called
(ordinary/classical) Bregman distances/divergences OBD
0  DOBD ðP, QÞ
Z ϕ

0

(18)
:¼ ϕ ðfP ðxÞÞ  ϕ fQ ðxÞ  ϕ fQ ðxÞ  fP ðxÞ  fQ ðxÞ dλðxÞ ,
X

(see, e.g., Csiszar, 1991, Pardo and Vajda, 1997, 2003, Stummer and Vajda,
2012) where ϕ0 is the derivative of the supposedly differentiable ϕ. The class
(18) includes as important special cases, e.g., the density power divergences
(also known as Basu-Harris-Hjort-Jones distances, cf. Basu et al. (1998)) with
the squared L2-norm as a subcase. The principal types of statistical applica-
tions of OBD are basically the same as for the ϕ-divergences (minimum
divergence estimation, robustness etc.); however, the corresponding technical
details may differ substantially.
Concerning some recent progress of divergences, Stummer (2007) as well
as Stummer and Vajda (2012) introduced the concept of scaled Bregman
divergences/distances SBD
0  DSBD
ϕ ðP, QÞ :¼ Dϕ,λ,m ðP,QÞ
SBD
Z        
fP ðxÞ fQ ðxÞ 0 fQ ðxÞ fP ðxÞ fQ ðxÞ
:¼ ϕ ϕ ϕ   mðxÞ dλðxÞ
X mðxÞ mðxÞ mðxÞ mðxÞ mðxÞ

which (by using a scaling function m()) generalizes all the above-
mentioned (nearly disjoint) density-based ϕ-divergences (17) and OBD
divergences (18) at once. Hence, the SBD divergence class constitutes a
quite general framework for dealing with a wide range of data analyses, in
a well-structured way.
A unifying framework for some directed distances in statistics Chapter 5 163

1.6.3 Some non-ϕ-divergences between other statistical objects


Of course, for statistical applications it also makes sense to the combine the
extension ideas of the two preceding sections. For instance,
Z

1 ðFP ðxÞ  FQ ðxÞÞ2
0  DϕPC ,Q,m FP , FQ ¼ 
SBD
dQðxÞ
2 R mðxÞ
constitutes—in case of employing the empirical measure P ¼ Pemp N —the fam-
ily of weighted Cramer–von Mises test statistics (see Cramer, 1928; Von
Mises, 1931, as well as Smirnov/Smirnoff, 1936).
In the following, we work out an extensive toolkit of divergences between
statistical objects, which goes far beyond the above-mentioned concepts.

2 The framework
2.1 Statistical functionals S and their dissimilarity
Let us assume that the modeled—respectively, observed—random data take values
in a state space Y (with at least two distinct values), which is equipped with a
system A of admissible events (σ-algebra). On this, we consider two probability
distributions (probability measures) P and Q of interest. By appropriate choices
of ðY , A Þ, such a general context also covers modeling of series of observations,
functional data as well as stochastic process data (the latter by choosing Y as an
appropriate space of paths, i.e., of whole scenarios along a set of times).
In the following, we deal with situations where—e.g. in face of the dichot-
omous uncertainty P versus Q—the statistical decision (inference) goal can
be, e.g., expressed by means of “dissimilarity-expressing relations”
RðSðPÞ, SðQÞÞ between univariate real-valued “statistical functionals” S() of
the form SðPÞ :¼ fSx ðPÞgxX and SðQÞ :¼ fSx ðQÞgxX for the two distribu-
tions P and Q,f where X is a set of (at least two different) “functional
indices.” As corresponding preliminaries, in this section we broadly discuss
examples of statistical functionals which we shall employ later on to recover
known—respectively create new—divergences between them.
In principle, one can distinguish between unit-free (e.g., “percentage-type”)
functionals S() and unit-dependent (e.g., monetary) functionals S(). For the
real line Y ¼ X ¼ R , the most prominent examples for the former are
the cumulative distribution functions (cdf ) fSx ðPÞgxR :¼ fFP ðxÞgxR :¼
fP½  ∞, xgxR ¼: Scd ðPÞ , the survival functions (suf ) fSx ðPÞgxR :¼
f1  FP ðxÞgxR :¼ fP½x, ∞½gxR ¼: Ssu ðPÞ (which are also called reliability

f
The statistical functional S() can also be thought of as a function-valued “plug-in statistics”
respectively as a real-valued function on X which carries a probability-distribution-valued param-
eter  ; accordingly S(P) and S(Q) are two different functions corresponding to the two different
parameter constellations P,Q; accordingly Sx(P) and Sx(Q) are the corresponding function values
at x  X .
164 SECTION II Information geometry

functions or tail functions), then“classical”


o probability density functions (pdf )
dFP ðxÞ
fSx ðPÞgxR :¼ f fP ðxÞgxR :¼ dx ¼: Spd ðPÞ, the moment generating
xR R 
functions (mgf ) fSx ðPÞgxR :¼ fMP ðxÞgxR :¼ Y exy dPðyÞ xR ¼: Smg ðPÞ,
and for finite/countable Y ¼ X  R the probability mass functions (pmf )
fSx ðPÞgxX :¼ fpP ðxÞgxX :¼ fP½fxggxX ¼: Spm ðPÞ; furthermore, we also
cover the centered rank function (cf., e.g., Serfling (2010) and Serfling
and Zuo (2010), also called “center-outward distribution function” in, e.g.,
Hallin (2017) and Hallin et al. (2021)) fSx ðPÞgxR :¼ f2  FP ðxÞ  1gxR ¼:
Scr,1 ðPÞ.
Continuing on the real line, in contrast to the above discussion on unit-free
statistical functionals, let us now turn our intention to unit-dependent statisti-
cal functionals. For the latter, in case of Y ¼ R and X ¼0, 1½, the most
prominent examples are the univariate quantile functions
 
fSx ðPÞgx0,1½ :¼ FP ðxÞ x0,1½ :¼ f inf fz  R : FP ðzÞ  xggx0,1½ ¼: Squ ðPÞ;

for Y ¼ ½0, ∞½ we take


 
fSx ðPÞgx0, 1½ :¼ FP ðxÞ x0, 1½ :¼ f inffz  ½0, ∞½ : FP ðzÞ  xggx0, 1½ ¼: Squ ðPÞ:

Of course, if the underlying cdf z ! FP(z) is strictly increasing, then x !


FP ðxÞ is nothing but its “classical” inverse function. Let us also mention that
in quantitative finance and insurance, the quantile FP ðxÞ (e.g., quoted in US
dollars units) is called the value-at-risk for confidence level x  100%.
A detailed discussion on properties and pitfalls of univariate quantile
functions can be found, e.g., in Embrechts and Hofert (2013); see also, e.g.,
Gilchrist (2000) for a comprehensive survey on quantile functions for
practitioners of statistical modeling.
Similarly, the generalized inverse of the centered rank function amounts
to so-called median-oriented quantile function 
(cf.  Serfling, 2006)
fSx ðPÞgx1,1½ :¼ fð2  FP ð  Þ1Þ ðxÞgx1,1½ ¼ FP 1 +2 x x1,1½ ¼: Smqu ðPÞ.
The sign of x indicates the direction from the median MP :¼ FP ð12Þ.
If the distribution P is generated by some univariate real-valued random
variable, say Y, then (with a slight abuse of notation) one has the obvious
interpretations FP(x) ¼ P[Y  x], pP(x) ¼ P[Y ¼ x] and FP ðxÞ ¼
inf fz  R : P½Y  z  xg.
Let us mention that for X ¼ Y ¼ R we also cover nR “integrated statistical
o
functionals” of the form SðPÞ :¼ fSx ðPÞgxR :¼
x
S̆ ðPÞ d ˘
λðzÞ ¼:
∞ z
˘
n o xR
S λ̆, S ðPÞ where λ̆ is a σ-finite measure on R and S̆ðPÞ :¼ S̆z ðPÞ is a non-
zR
negative respectively λ̆-integrable statistical functional. For special cases
cd cd
SQ,S ðPÞ (i.e., λ̆ ¼ Q and S̆ = Scd) as well as SQ,S ðQÞ in a goodness-of-fit test-
ing context, see, e.g., Henze and Nikitin (2000).
A unifying framework for some directed distances in statistics Chapter 5 165

For the multidimensional Euclidean space Y ¼ X ¼ Rd (d  N ), unit-


free-type examples are the “classical” cumulative distribution functions
(cdf ) fSx ðPÞgxRd :¼ fFP ðxÞgxRd :¼ fP½   ∞, x gxRd ¼: Scd ðPÞ (which
are based on marginal orderings), the “classical” probability density functions
Ð
(pdf ) fSx ðPÞgxRd :¼ f fP ðxÞgxRd ¼: Spd ðPÞ (such that P[] :¼  fP(x) dλL(x)
with d-dimensional Lebesgue measure λRL), the moment generating functions
(mgf ) fSx ðPÞgxRd :¼ fMP ðxÞgxRd :¼ Y e<x,y> dPðyÞ xRd ¼: Smg ðPÞ, and
for finite/countable Y ¼ X  Rd the probability mass functions (pmf )
fSx ðPÞgxX :¼ fpP ðxÞgxX :¼ fP½fxggxX ¼: Spm ðPÞ. Furthermore, we cover
statistical depth functions fSx ðPÞgxRd :¼ fDP ðxÞgxRd ¼: Sde ðPÞ and statistical
outlyingness functions fSx ðPÞgxRd :¼ fOP ðxÞgxRd ¼: Sou ðPÞ, e.g., in the sense
of Zuo and Serfling (2000a) (see also Chernozhukov et al., 2017): basically,
x 7! DP ðxÞ  0 g provides a P-based center-outward ordering of points xRd
(in other words, it measures how deep (central) a point xRd is with respect
to P), where the point MP of maximal depth (deepest point, if unique) is inter-
preted as multidimensional median and the depth decreases monotonically as
x moves away from M along any straight line running through the deepest
point; moreover, DP ð  Þ should be affine invariant (in particular, independent
on the underlying coordinate system) and vanishing at infinity; in practice,
DP ð  Þ is typically bounded. In essence, higher depth values represent greater
“centrality.” A corresponding outlying function OP() is basically OP ð  Þ :¼
fOD ðDP ð  ÞÞ for some strictly decreasing (but not necessarily bounded) non-
negative function fOD of DP ð  Þ, such as OP ð  Þ :¼ DP1ð  Þ  1 or OP ð  Þ :¼
c  ð1  sup DPdðD ÞP ðzÞÞ for some constant c > 0 (in case that DP ð  Þ is bounded).
zR
Accordingly, OP() provides a P-based center-inward ordering of points
x  Rd : higher values represent greater “outlyingness.” Since fOD is invertible,
one can always “switch equivalently” between DP ð  Þ and OP(). Several exam-
ples for DP ð  Þ and OP() can be found, e.g., in Liu et al. (1999), Zuo and
Serfling (2000a,b), and Serfling (2002).
According to the “D-O-Q-R paradigm” of Serfling (2010), one can link to
the univariate/one-dimensional P-characteristics DP ð  Þ, OP() two multivari-
ate/d-dimensional P-characteristics, namely a centered rank function RP()
(also called center-outward distribution function in Hallin (2017); Hallin
et al. (2021)) and a quantile function (also called center-outward quantile sur-
face in Liu et al. (1999), and center-outward quantile function in Hallin
(2017); Hallin et al. (2021)), which are inverses of each other. Such a linkage
works, e.g., basically as follows: first, one chooses some bounded set B  Rd
of “indices,” often the d-dimensional unit ball B :¼ Bd ð0Þ which we

g
There are also versions allowing for negative values, not discussed here.
166 SECTION II Information geometry

henceforth use for the following explanations. Second, a P-based quantile


function QP : Bd ð0Þ 7! Rd with “full” range R ðQP Þ ¼ Rd is such that it gen-
erates contour sets (level sets) C c :¼ fQP ðuÞ : jjujj ¼ cg, 0  c < 1 (where
jjjj denotes the Euclidean norm on Rd ) which are nested (as c varies increas-
ingly). The most central point MP :¼ QP ð0Þ is interpreted as d-dimensional
median. The magnitude c represents a degree of outlyingness for all data
points in C c , and higher c-values corresponding to “more extreme data
points.”h Third, RP is taken to be the (possibly multivalued) inverse of QP .
For technical purposes, one attempts to use quantile functions QP such that
the contour sets C c are “strictly nested” (in the sense that they do not intersect
for different c’s) such that the inverse function RP : Rd 7! Bd ð0Þ is deter-
mined by uniquely solving the equation y ¼ QP ðuÞ for uBd ð0Þ , for all
yRd . Finally, as a naturally corresponding outlyingness function one can,
e.g., take the magnitude OP(y) :¼ jjRP(y)jj (i.e., the c for which y  C c )
and derive the associated depth function DP ðyÞ ¼ f OD ðOP ðyÞÞ . Since our
divergence framework deals with univariate statistical n functionals,
o we shall
ðiÞ
work with the i-th components fSx ðPÞgxRd :¼ QP ðxÞ ¼: Scqu,i ðPÞ
n o xB
ðiÞ
and fSx ðPÞgxRd :¼ RP ðxÞ d
¼: Scr,i ðPÞ (if1, …, dg) and finally aggre-
xR
gate the results by adding up the correspondingly outcoming d divergences
over i (see, e.g., (57) and (58)).
There are several ways to build up concrete “D-O-Q-R” setups.
A recent one which generates centered d-dimensional analogues of the univar-
iate quantile-transform mapping and the reciprocal probability-integral
transformation—and which uses Brenier–McCann techniques connected to
the Monge–Kantorovich theory of optimal mass transportation—is con-
structed by Chernozhukov et al. (2017), Hallin (2017), and Hallin et al.
(2021) (see also Faugeras and R€ uschendorf, 2017; Figalli, 2018): indeed, for
absolutely continuous distributions P on Rd with nonvanishing (Lebesgue)
density functions they define RP as the unique gradient rψ of a convex func-
tion ψ—mapping Rd to Bd ð0Þ and—“pushing P forward to” the uniform mea-
sure U ðBd ð0ÞÞ on Bd ð0Þ (i.e., the distribution of rψ under P is U ðBd ð0ÞÞ); as
corresponding quantile function they take the inverse QP :¼ RP of RP. As
indicated above, this implies the transformations Z  P if and only if
RP ðZÞ  U ðBd ð0ÞÞ as well as U  U ðBd ð0ÞÞ if and only if QP ðUÞ  P .
Depth functions for P can be generated from depth functions DU ðBd ð0ÞÞ ð  Þ
by DP ðxÞ :¼ DU ðBd ð0ÞÞ ðRP ðxÞÞ (x  Rd ). For d ¼ 1, one arrives at the

h
Notice that this kind of outlyingness concept is intrinsic (with respect to P), as opposed to the
“relative outlyingness” defined as a degree of mismatch between the frequency of certain data-
observation points compared to the corresponding (very much lower) modeling frequency; see, e.-
g., Lindsay (1994), Basu et al. (2011), and the corresponding flexibilization in Kißlinger and
Stummer (2016).
A unifying framework for some directed distances in statistics Chapter 5 167

ð1Þ ð1Þ

univariate RP ðxÞ ¼ RP ðxÞ ¼ 2  FP ðxÞ  1 , QP ðxÞ ¼ FP 1 +2 x , and thus
n o
ð1Þ
there are the consistencies Scr,1 ðPÞ :¼ RP ðxÞ ¼ Scr ðPÞ, Scqu,1 ðPÞ :¼
n o xR
ð1Þ
QP ðxÞ ¼ Scqu ðPÞ.
x½1,1
There are also several other different approaches to define multidimen-
sional analogues of quantile functions, see, e.g., Serfling (2002, 2010),
Galichon and Henry (2012), and Faugeras and R€uschendorf (2017). All those
multivariate quantile functions are also covered by our divergence toolkit,
component-wise (with subsequent aggregation).
Let us finally mention that for general state space Y , as unit-free statistical
functionals one can also take for instance families fSx ðPÞgxX :¼ fP½Ex gxX
of probabilities of some particularly selected concrete events Ex A of
purpose-driven interest, where X is some set of indices.
As needed later on, notice that these statistical functionals SðPÞ ¼


fSx ðPÞgxX have the following different ranges R ðSðPÞÞ : R Scd ðPÞ ¼


R ðSsu ðPÞÞ ¼ R ðSpm ðPÞÞ  ½0, 1, R Spd ðPÞ  ½0, ∞½, R ðSmg ðPÞÞ  ½0, ∞,
R ðSqu ðPÞÞ   ∞, ∞½ (respectively
R ðSqu ðPÞÞ  ½0, ∞½ for nonnegative
random variables Y  0), R Sde ðPÞ  ½0, ∞, R ðSou ðPÞÞ  ½0, ∞,
 
R ðScr,i ðPÞÞ  ½1, 1, R ðScqu,i ðPÞÞ   ∞, ∞½(i  f1, …, dg), and R S λ̆, S̆ ðPÞ
depends on the choice of λ̆ and S̆.
The above-mentioned “dissimilarity-expressing functional relations”
RðSðPÞ, SðQÞÞ can be typically of (i) numerical nature or (ii) graphical/plot-
ting nature, or hybrids thereof. As far as (i) is concerned, for fixed xX
the dissimilarity between the real-valued Sx(P) and Sx(Q) can be expressed
by (weighted) ratios close to 1, (weighted) differences close to 0, and combi-
nations thereof; these information on “pointwise” dissimilarities can then be
compressed to a single real number, e.g., by means of aggregation (weighted
summation, weighted integration, etc.) over x or by taking the maximum
respectively minimum values with respect to x. In contrast, for X ¼ R one
widespread tool for (ii) is to draw a two-dimensional scatterplot
ðSx ðPÞ, Sx ðQÞÞxX and evaluate—visually by eyeballing or quantitatively—
the dissimilarity in terms of sizes of deviations from the equality-expressing
diagonal (t, t). In the above-mentioned special case of Sx(P) ¼ FP(x) ¼
P[]∞, x]], Sx(Q) ¼ FQ(x) ¼ Q[]∞, x]] this leads to the well-known
“Probability-Probability-Plot” (PP  Plot), whereas the choice Sx ðPÞ ¼
FP ðxÞ ¼ inf fzR : FP ðzÞ  xg, Sx ðQÞ ¼ FQ ðxÞ ¼ inf fzR : FQ ðzÞ  xg
amounts to the very frequently used “Quantile–Quantile-Plot” (QQ-Plot). More-
over, the choice Sx ðPÞ ¼ DP ðxÞ and Sx ðQÞ ¼ DQ ðxÞ for some P-based and
Q-based depth function generates the DD-Plot in the sense of Liu et al. (1999).
As already mentioned above, we follow the line (i), in terms of
divergences.
168 SECTION II Information geometry

2.2 The divergences (directed distances) D


Let us now specify the details of the divergences (directed distances)
DðSðPÞ, SðQÞÞ which we are going to employ henceforth as dissimilarity mea-
sures between the statistical functionals SðPÞ :¼ fSx ðPÞgxX and SðQÞ :¼
fSx ðQÞgxX . To begin with, we equip the index space X with a σ-algebra
F and a σ-finite measure λ (e.g., a probability measure, the Lebesgue mea-
sure, a counting measure); furthermore, we assume that x ! Sx(P)  [∞,
∞] and x ! Sx(Q)  [∞, ∞] are correspondingly measurable functions
which satisfy Sx(P) ] ∞, ∞[, Sx(Q) ] ∞, ∞[ for λ-almost all (abbreviated
as λ-a.a.) xX . For such a context, we quantify the (aggregated) divergence
DðSðPÞ, SðQÞÞ :¼ Dcβ ðSðPÞ, SðQÞÞ between the two statistical functionals S(P)
and S(Q) in terms of the “parameters” β ¼ (ϕ, m1, m2, m3,λ) and c by
0  Dcϕ, m1 , m2 , m3 , λ ðSðPÞ, SðQÞÞ
Z        
Sx ðPÞ Sx ðQÞ 0 Sx ðQÞ Sx ðPÞ Sx ðQÞ
:¼ ϕ ϕ  ϕ+, c   m3 ðxÞ dλðxÞ,
X m1 ðxÞ m2 ðxÞ m2 ðxÞ m1 ðxÞ m2 ðxÞ
(19)
R
where the meaning of the integral symbol —as a shortcut of the integral over
an appropriate extension of the integrand—will become clear in (21). Here, in
accordance with the BS distances of Broniatowski and Stummer (2019)—who
flexibilized/widened the concept of scaled Bregman distances of Stummer
(2007) and Stummer and Vajda (2012)—we use the following ingredients:
(I1) (measurable) scaling functions m1 : X ! ½∞, ∞ and m2 : X !
½∞, ∞ as well as a nonnegative (measurable) aggregating function
m3 : X ! ½0, ∞ such that m1(x) ] ∞, ∞[, m2(x) ] ∞, ∞[,
m3(x)  [0, ∞[ for λ-a.a. x  X . In analogy with the above notation,
we use the symbols mi :¼ fmi ðxÞgxX to refer to the whole functions.
Let us emphasize that we also allow for adaptive situations in the sense
that all three functions m1(x), m2(x), and m3(x) (evaluated at x) may also
depend on Sx(P) and Sx(Q), see below. In the following, RðGÞ denotes
the range (image) of a function G :¼ fGðxÞgxX .
(I2) the so-called divergence-generator ϕ which is a continuous, convex
(finite) function ϕ : E !] ∞, ∞[ on some appropriately chosen open
  E ¼]a,
interval  b[ such that [a, b]  covers
n (ato least) the union
SðPÞ SðQÞ SðPÞ Sx ðPÞ SðQÞ
R m1 [R m2 of both ranges R m1 of m1 ðxÞ xX and R m2
n o
Sx ðQÞ
of m2 ðxÞ xX ; for instance, E ¼]0, 1[, E ¼]0, ∞[ or E ¼] ∞, ∞[; the
class of all such functions will be denoted by Φ(]a, b[). Furthermore, we
assume that ϕ is continuously extended to ϕ  : ½a, b ! ½∞,∞ by setting

ϕðtÞ :¼ ϕðtÞ for t ]a, b[ as well as ϕðaÞ  :¼
 :¼ lim t#a ϕðtÞ , ϕðbÞ
lim t"b ϕðtÞ on the two boundary points t ¼ a and t ¼ b. The latter two
are the only points at which infinite values may appear (e.g., because of
A unifying framework for some directed distances in statistics Chapter 5 169

division by m1(x) ¼ 0 for some x). Moreover, for any fixed c  [0, 1] the
(finite) function ϕ0+,c :a,b½!  ∞,∞½ is well-defined by ϕ0+,c ðtÞ :¼
c  ϕ0+ ðtÞ + ð1  cÞ  ϕ0 ðtÞ, where ϕ0+ ðtÞ denotes the (always finite) right-
hand derivative of ϕ at the point t ]a, b[ and ϕ0 ðtÞ the (always finite)
left-hand derivative of ϕ at t ]a, b[. If ϕ  Φ(]a, b[) is also continuously
differentiable—which we denote by ϕ  ΦC1 ða, b½Þ—then for all c  [0, 1]
one gets ϕ0+,c ðtÞ ¼ ϕ0 ðtÞ (t ]a, b[) and in such a situation we also suppress
+ as well as c in all the following expressions. We also employ the contin-
uous continuation ϕ0+,c : ½a,b ! ½∞, ∞ given by ϕ0+,c ðtÞ :¼ ϕ0+,c ðtÞ (t ]
a, b[), ϕ0+,c ðaÞ :¼ lim t#a ϕ0+,c ðtÞ, ϕ0+,c ðbÞ :¼ lim t"b ϕ0+,c ðtÞ. To explain the
precise meaning of (19), we also make use of the (finite, nonnegative) func-
tion ψ ϕ,c :]a, b[
]a, b[! [0, ∞[ given by ψ ϕ,c ðs, tÞ :¼ ϕðsÞ  ϕðtÞ 
ϕ0+,c ðtÞ  ðs  tÞ  0 (s, t ]a, b[). To extend this to a lower semicontinuous
function ψ ϕ,c : ½a,b
½a, b ! ½0,∞ we proceed as follows: first, we set
ψ ϕ,c ðs, tÞ :¼ ψ ϕ,c ðs,tÞ for all s, t ]a, b[. Moreover, since for fixed t  ]a,
b[, the function s ! ψ ϕ,c(s, t) is convex and continuous, the limit
ψ ϕ,c ða,tÞ :¼ lim s!a ψ ϕ,c ðs,tÞ always exists and (in order to avoid overlines
in (19)) will be interpreted/abbreviated as ϕðaÞ  ϕðtÞ  ϕ0+,c ðtÞ  ða  tÞ .
Analogously, for fixed t ]a,b[ we set ψ ϕ,c ðb,tÞ :¼ lim s!b ψ ϕ,c ðs,tÞ with
corresponding short-hand notation ϕðbÞ  ϕðtÞ  ϕ0+,c ðtÞ  ðb  tÞ. Further-
more, for fixed s ]a,b[ we interpret ϕðsÞ  ϕðaÞ  ϕ0+,c ðaÞ  ðs  aÞ as
  
ψ ϕ,c ðs, aÞ :¼ ϕðsÞ  ϕ0+,c ðaÞ  s + lim t  ϕ0+,c ðaÞ  ϕðtÞ
t!a
   
 1∞,∞½ ϕ0+,c ðaÞ + ∞  1f∞g ϕ0+,c ðaÞ ,

where the involved limit always exists but may be infinite. Analogously,
for fixed s ]a, b[ we interpret ϕðsÞ  ϕðbÞ  ϕ0+,c ðbÞ  ðs  bÞ as
  
ψ ϕ,c ðs, bÞ :¼ ϕðsÞ  ϕ0+,c ðbÞ  s + lim t  ϕ0+,c ðbÞ  ϕðtÞ
t!b
   
 1∞,∞½ ϕ0+,c ðbÞ + ∞  1f+∞g ϕ0+,c ðbÞ ,

where again the involved limit always exists but may be infinite.
Finally, we always set ψ ϕ,c ða,aÞ :¼ 0, ψ ϕ,c ðb,bÞ :¼ 0, and ψ ϕ, c ða, bÞ :¼
lim s!a ψ ϕ,c ðs, bÞ, ψ ϕ,c ðb,aÞ :¼ lim s!b ψ ϕ,c ðs,aÞ. Notice that ψ ϕ,c is
lower-semicontinuous but not necessarily continuous.
0 0 Since ratios are
ultimately involved, we also consistently take ψ ϕ,c 0 , 0 :¼ 0.
With (I1) and (I2), we define the BS divergence (BS distance) of (19)
precisely as
170 SECTION II Information geometry

Z  
Sx ðPÞ Sx ðQÞ
0 Dcϕ,m1 ,m2 ,m3 ,λ ðSðPÞ, SðQÞÞ
¼ ψ ϕ,c ,  m3 ðxÞ dλðxÞ (20)
X m1 ðxÞ m2 ðxÞ
Z  
Sx ðPÞ Sx ðQÞ
:¼ ψ ϕ,c ,  m3 ðxÞ dλðxÞ, (21)
m1 ðxÞ m2 ðxÞ
X
R
but mostly use the less clumsy notation with  given in (19), (20) hence-
forth, as a shortcut for the implicitly involved boundary behavior.
As a side remark let us mention that we could further generalize (19)
by adapting a wider divergence (e.g., nonconvex generators ϕ covering) con-
cept of Stummer and Kißlinger (2017) who also deal even with nonconvex
nonconcave divergence generators ϕ; for the sake of brevity, this is
omitted here.
Notice that by construction one has the following important assertion (cf.
Broniatowski and Stummer, 2019):
Theorem 2. Let ϕ  Φ(]a, b[) and c  [0, 1]. Then there holds
Dcϕ,m1 ,m2 ,m,λ ðSðPÞ, SðQÞÞ  0 (i.e., the above-mentioned desired property
(D1) is satisfied). Moreover, Dcϕ,m1 ,m2 ,m,λ ðSðPÞ, SðQÞÞ ¼ 0 if mSx1ðPÞ Sx ðQÞ
ðxÞ ¼ m2 ðxÞ for
λ-almost all x  X . Depending on the concrete situation, Dcϕ,m1 ,m2 ,m3 ,λ ðSðPÞ,
SðQÞÞ may take infinite value.

To get a “sharp identifiability,” i.e., the correspondingly adapted version


of the above-mentioned desired reflexivity property (D2) in the form of
Dcϕ,m1 ,m2 ,m3 ,λ ðSðPÞ, SðQÞÞ ¼ 0 if and only if
Sx ðPÞ Sx ðQÞ
¼ for λa:a: x  X , (22)
m1 ðxÞ m2 ðxÞ
one needs further requirements on ϕ  Φ(]a, b[) and c  [0, 1]; for the rest of
the paper, we assume that the validity of (22) holds.
For instance, the latter is satisfied in a setup where m3 ðxÞ ¼
 
w x, mSx1ðPÞ Sx ðQÞ
ðxÞ , m2 ðxÞ for some (measurable) function w : X
½a, b
½a, b !
½0, ∞ , and the (correspondingly adapted) Assumptions 2 respectively 3 of
Broniatowski and Stummer (2019) hold (cf. Theorem 4 respectively Corollary
1 therein); in particular, this means that RðSðPÞ SðQÞ
m1 Þ [ Rð m2 Þ  ½a, b and that for
all s  RðSðPÞ SðQÞ
m1 Þ and all t  Rð m2 Þ the following conditions hold:

-. ϕ is strictly convex at t;
-. if ϕ is differentiable at t and s 6¼ t, then ϕ is not affine-linear on the interval
½ min ðs, tÞ, max ðs, tÞ (i.e., between t and s);
A unifying framework for some directed distances in statistics Chapter 5 171

-. if ϕ is not differentiable at t, s > t and ϕ is affine linear on [t, s], then we


exclude c ¼ 1 for the (“globally/universally chosen”) subderivative
ϕ0+,c ð  Þ ¼ c  ϕ0+ ð  Þ + ð1  cÞ  ϕ0 ð  Þ;
-. if ϕ is not differentiable at t, s < t and ϕ is affine linear on [s, t], then we
exclude c ¼ 0 for ϕ0+,c ð  Þ.
In the following, we discuss several important (classes of ) special cases of
β ¼ (ϕ, m1, m2, m3, λ) in a well-structured way. Let us start with the latter.

2.3 The reference measure λ


In (19), λ governs the principal aggregation structure. For instance, if one
chooses λ ¼ λL as the Lebesgue measure on X  R , then the integral in
(19) turns out to be of Lebesgue-type and (with some rare exceptions) conse-
quently of Riemann-type with dλ(x) ¼ dx. In contrast, in the discrete setup
where the index set X ¼ X # has countably
P many elements and is equipped
with the counting measure λ :¼ λ# :¼ zX # δz (where δz is Dirac’s one-point
distribution δz[A] :¼1A(z), and thus λ#[{z}] ¼ 1 for all zX# ), then (19)
simplifies to
0  Dϕ, m1 , m2 , m3 , λ# ðSðPÞ,SðQÞÞ
X        
Sz ðPÞ Sz ðQÞ 0 Sz ðQÞ Sz ðPÞ Sz ðQÞ
:¼ ϕ ϕ  ϕ+, c   m3 ðzÞ,
zX m1 ðzÞ m2 ðzÞ m2 ðzÞ m1 ðzÞ m2 ðzÞ
(23)
P
which we interpret as zX ψ ϕ,c ðmSz1ðPÞ Sz ðQÞ
ðzÞ , m2 ðzÞÞ  m3 ðzÞ with the same conven-
tions and limits as in the paragraph right after (19).

2.4 The divergence generator ϕ


We continue with the inspection of interesting special cases of β ¼ (ϕ, m1, m2,
m3, λ) by dealing with the first component. For divergence generator
ϕΦC1 ða, b½Þ (recall that then we suppress the obsolete c and subderivative
index +), the formula (19) turns into
0  Dϕ, m1 , m2 , m3 , λ ðSðPÞ, SðQÞÞ
Z        
Sx ðPÞ Sx ðQÞ 0 Sx ðQÞ Sx ðPÞ Sx ðQÞ
:¼ ϕ ϕ ϕ   m3 ðxÞ dλðxÞ,
X m1 ðxÞ m2 ðxÞ m2 ðxÞ m1 ðxÞ m2 ðxÞ
(24)
whereas (23) becomes
0  Dϕ, m1 , m2 , m3 , λ# ðSðPÞ, SðQÞÞ
X        
Sx ðPÞ Sx ðQÞ 0 Sx ðQÞ Sx ðPÞ Sx ðQÞ
:¼ ϕ ϕ ϕ   m3 ðxÞ:
xX m1 ðxÞ m2 ðxÞ m2 ðxÞ m1 ðxÞ m2 ðxÞ
172 SECTION II Information geometry

R
Formally, by defining the integral functional gϕ,m3 ,λ ðξÞ :¼ X ϕðξðxÞÞ 
  R  
m3 ðxÞ dλðxÞ and plugging in, e.g., gϕ,m3 ,λ SðPÞ
m1 ¼ X ϕ mSx1ðPÞ
ðxÞ  m3 ðxÞ dλðxÞ,
the divergence in (24) can be interpreted as
0  Dϕ,m1 ,m2 ,m3 ,λ ðSðPÞ, SðQÞÞ
     
SðPÞ SðQÞ SðQÞ SðPÞ SðQÞ (25)
¼ gϕ,m3 ,λ  gϕ,m3 ,λ  g0ϕ,m3 ,λ , 
m1 m2 m2 m1 m2

where g0ϕ,m3 ,λ ðη,  Þ denotes the corresponding directional derivate at η ¼ SðQÞ


m2 .
An important special case is the following: consider the “nonnegativity-
setup”
Sx ðPÞ S ðQÞ
ðNN0Þ  0 and x  0 for all x  X ;
m1 ðxÞ m2 ðxÞ
for instance, this always holds for nonnegative scaling functions m1 and m2, in
combination with Scd, Spd, Spm, Ssu, Smg, Sde, Sou, and for nonnegative real-
valued random variables also with Squ. Under (NN0), one can take a ¼ 0,
b ¼ ∞, i.e., E ¼]0, ∞[, and employ the strictly convex power functions
α
ϕðtÞ eα ðtÞ :¼ t  1   ∞, ∞½,
e : ¼ϕ t 0, ∞½, α  Rnf0, 1g,
αðα  1Þ
eα ðtÞ  ϕ0
e ð1Þ  ðt  1Þ ¼ tα  1 t1
ϕðtÞ : ¼ ϕα ðtÞ :¼ ϕ α  ½0, ∞½, t 0, ∞½,
αðα  1Þ α  1
α  Rnf0, 1g:
(26)
The perhaps most important special case is α ¼ 2, for which (26) turns into
ðt  1Þ2
ϕ2 ðtÞ :¼ , t 0, ∞½¼ E: (27)
2
Also notice that the divergence generator ϕ2 of (27) can be trivially
extended to
ðt  1Þ2 
ϕ 2 ðtÞ :¼ , t   ∞, ∞½¼ E, (28)
2
which is useful in the general setup
Sx ðPÞ S ðQÞ
ðGSÞ ½∞, ∞ and x ½∞, ∞ for all x  X ;
m1 ðxÞ m2 ðxÞ
which appears for nonnegative scaling functions m1, m2 in combination with
Squ for real-valued random variables.
Further examples of everywhere strictly convex divergence generators ϕ
for the nonnegativity setup (NN0) (i.e., a ¼ 0, b ¼ ∞, E ¼]0, ∞[) can be
obtained by taking the α-limits
A unifying framework for some directed distances in statistics Chapter 5 173

e1 ðtÞ :¼ lim ϕα ðtÞ ¼ t  log t  ½e1 , ∞½,


ϕ t 0, ∞½, (29)
α!1

e1 ðtÞ  ϕ
ϕ1 ðtÞ :¼ lim ϕα ðtÞ ¼ ϕ e0 ð1Þ  ðt  1Þ
1
α!1
¼ t  log t + 1  t  ½0, ∞½, t 0, ∞½, (30)
e0 ðtÞ :¼ lim ϕα ðtÞ ¼  log t    ∞, ∞½,
ϕ t 0, ∞½,
α!0

ϕ0 ðtÞ :¼ lim ϕα ðtÞ ¼ ϕ e0 ð1Þ  ðt  1Þ ¼  log t + t  1  ½0, ∞½,


e0 ðtÞ  ϕ t 0, ∞½:
0
α!0
(31)
A list of extension-relevant (cf. (I2)) properties of the functions ϕα with
α  R can be found in Broniatowski and Stummer (2019). The latter also dis-
cuss in detail the important but (in our context) technically delicate diver-
gence generator
ϕTV ðtÞ :¼ jt  1j (32)
which is nondifferentiable at t ¼ 1; the latter is also the only point of strict
convexity.
As demonstrated in Broniatowski and Stummer (2019), ϕTV can—in our
context—only be potentially applied if Smx2ðQÞ
ðxÞ ¼ 1 for λ-a.a. x  X , and one
generally has to exclude c ¼ 1 and c ¼ 0 for ϕ0+,c ð  Þ (i.e., we choose c ]
0, 1[); the latter two can be avoided under some nonobvious constraints on
the statistical functionals S(P) and S(Q), see for instance Section 2.5.1.2.

2.5 The scaling and the aggregation functions m1, m2, and m3
In the above two Sections 2.3 and 2.4, we have presented special cases of the
first and the last component of the “divergence parameter” β ¼ (ϕ, m1, m2,
m3, λ), whereas now we focus on m1, m2, and m3. To start with, in accordance
with (19), the aggregation function m3 tunes the fine aggregation details
(recall that λ governs the principal aggregation structure). Moreover, the
function m1() scales the statistical functional S(P) evaluated at P and m2()
the same statistical functional S(Q) evaluated at Q. From a modeling perspec-
tive, these two scaling functions can, e.g.,
l “purely direct” in the sense that m1(x), m2(x) are chosen to directly reflect
some dependence on the index-state x  X (independent of the choice
of S), or
l “purely adaptive” in the sense that m1(x) ¼ w1(Sx(P), Sx(Q)), m2(x) ¼
w2(Sx(P), Sx(Q)) for some appropriate (measurable) “connector functions”
w1, w2 on the product RðSðPÞÞ
RðSðQÞÞ of the ranges of fSx ðPÞgxX
and fSx ðQÞgxX , or
l “hybrids” m1(x) ¼ w1(x, Sx(P), Sx(Q)) m2(x) ¼ w2(x, Sx(P), Sx(Q)).
174 SECTION II Information geometry

In the remainder of Section 2, we illuminate several important sub-setups of


m1, m2, and m3, and special cases therein. As a side effect, this also shows that
our framework (19) generalizes considerably all the concrete divergences in
the below-mentioned references (even for the same statistical functional such
as, e.g., S ¼ Scd); for the sake of brevity, we mention that only at this point,
collectively.

2.5.1 m1(x) ¼ m2(x) ¼: m(x), m3(x) ¼ r(x)m(x)  [0, ∞] for some


(measurable) function r : X ! R satisfying rðxÞ   ∞,0½[0, ∞½
for λa.a. x  X
In such a sub-setup, the scaling functions are strongly coupled with the aggre-
gation function. In order to avoid “case-overlapping” and “uncontrolled
boundary effects,” unless otherwise stated we assume here that the function
r() does not (explicitly) depend on the functions m(), S(P), and S(Q), i.e.,
it is not of the adaptive form r() ¼ h(, m(), S(P), S(Q)). From (19) one
can derive
0  Dcϕ, m, m, r  m, λ ðSðPÞ, SðQÞÞ
Z        
Sx ðPÞ Sx ðQÞ Sx ðQÞ Sx ðPÞ Sx ðQÞ
:¼ ϕ ϕ  ϕ0+, c   mðxÞ  rðxÞ dλðxÞ,
X mðxÞ mðxÞ mðxÞ mðxÞ mðxÞ
(33)
which for the discrete setup ðX , λÞ ¼ ðX# , λ# Þ (recall λ#[{x}] ¼ 1 for all
x  X# ) simplifies to
0  Dϕ,m,m,r  m,λ# ðSðPÞ, SðQÞÞ

X        
Sx ðPÞ Sx ðQÞ 0 Sx ðQÞ Sx ðPÞ Sx ðQÞ
¼ ϕ  ϕ  ϕ   mðxÞ  rðxÞ:
xX mðxÞ mðxÞ +,c mðxÞ mðxÞ mðxÞ
(34)
Remark 1. (a) In a contextÐ of “λ-probability-density
Ð functions” with gen-
eral X and P[] :¼  fP(x) dλ(x), Q[] :¼  fQ(x) dλ(x) satisfying P½X  ¼
Q½X  ¼ 1 , one can take the statistical functionals Sλpd x ðPÞ :¼ fP ðxÞ  0 ,
Sλpd
x ðQÞ :¼ fQ ðxÞ  0; accordingly, Ðfor r(x) ≡ 1 (abbreviated as function 1
with constant value 1) and M[] :¼  m(x) dλ(x) the divergence (33) can be
interpreted asi

i
In a context where P and Q are risk distributions (e.g., Q is a pregiven reference one) the SBD
Bϕ ðP, Q j MÞ can be interpreted as risk excess of P over Q (or vice versa), in contrast to
Faugeras and R€uschendorf (2018) who use hemimetrics rather than divergences.
A unifying framework for some directed distances in statistics Chapter 5 175



0  Dcϕ,m,m,1  m,λ Sλpd ðPÞ, Sλpd ðQÞ
Z        
f P ðxÞ f Q ðx Þ f Q ðxÞ f P ðxÞ f Q ðxÞ
¼ ϕ ϕ  ϕ0+,c   mðxÞ dλðxÞ
X mðxÞ mðxÞ mðxÞ mðxÞ mðxÞ

¼: Bϕ ðP, Q j MÞ,
(35)
where the scaled Bregman divergence Bϕ ðP, Q j MÞ has been first defined
in Stummer (2007), Stummer and Vajda (2012), see also Kißlinger and
Stummer (2013, 2015, 2016) for the “purely adaptive” case mðxÞ ¼
wð fP ðxÞ, fQ ðxÞÞ and indications on nonprobability measures. Notice that
this directly subsumes for X ¼ Y ¼ R the “classical density” functional
Sλpd() ¼ Spd() with the choice λ ¼ λL (and the Riemann integration
dλL(x) ¼ dx), as well as for the discrete setup Y ¼ X ¼ X# the
“classical probability mass” functional Sλpd() ¼ Spm() with the choice
λ ¼ λ# (recall λ#[{x}] ¼ 1 for all x  X # ); for the latter, the divergence
(35) reads as


0  Dcϕ,m,m,1  m,λ# Sλ# pd ðPÞ, Sλ# pd ðQÞ ¼ Dcϕ,m,m,1  m,λ# ðSpm ðPÞ, Spm ðQÞÞ
X        
pP ðxÞ pQ ðxÞ 0 pQ ðxÞ pP ðxÞ pQ ðxÞ
¼ ϕ  ϕ  ϕ   mðxÞ
xX # mðxÞ mðxÞ +,c mðxÞ mðxÞ mðxÞ
¼ : B#ϕ ðP, Q j MÞ:
(36)
For the important special case of the above-mentioned power-function-
α
type generator ϕðtÞ :¼ ϕα ðtÞ ¼ t α  t + α1
α  ðα1Þ (α  ]0, ∞[n{1}), Roensch
and Stummer (2019a) (see also Ghosh and Basu (2016b) for the
unscaled special case m(x) ¼ 1) employed the corresponding scaled
Bregman divergences (35) in order to obtain robust minimum
divergence-type parameter estimates for the setup of sequences of
independent random variables whose distributions are nonidentical but
linked by a common (scalar or multidimensional) parameter; this is,
e.g., important in the context of generalized linear models (GLMs)
which are omnipresent in statistics, artificial intelligence and machine
learning.
Returning to the general framework, for the important special case α
2
¼ 2 leading to the above-mentioned generator ϕ2 ðtÞ :¼ ðt1Þ
2 , the scaled
Bregman divergences (35) and (36) turn into
Z
ð fP ðxÞ  fQ ðxÞÞ2
0  Bϕ2 ðP, Q j MÞ ¼ dλðxÞ
X 2  mðxÞ
176 SECTION II Information geometry

and
X ðpP ðxÞ  pQ ðxÞÞ2
0  B#ϕ2 ðP, Q j MÞ ¼ : (37)
xX # 2  mðxÞ

For instance, in (36) and (37), if Y is a random variable taking values


in the discrete space X# , then pQ(x) ¼ Q[Y ¼ x] may be its probability
mass function under a hypothetical/candidate law Q, and pP ðxÞ ¼
1
N  #fif1,…,Ng : Yi ¼ xg ¼: pPemp
N
ðxÞ is the probability mass function of
the corresponding data-derived “empirical distribution” P :¼ Pemp N :¼
PN
1
N  δ
i¼1 Yi ½   of a N-size-independent and identically distributed (i.i.d.)
sample Y 1 , …,Y N of Y which is nothing but the probability distribution
reflecting the underlying (normalized) histogram; moreover, m() is a
scaling/weighting.
In contrast, within a context of clustered multinomial data, we can basi-
cally rewrite the parametric extension of Pthe Brier’s  consistent estimator  of
Alonso-Revenga et al. (2017) as ce  L‘¼1 B#ϕ2 Pemp,‘ N , P emp
N j P θ^ where
emp,‘
PPN is the empirical distribution of the ‘-th cluster, Pemp N ¼
1 L emp,‘
L ‘¼1 NP , P θ^ is a (minimum-divergence-)estimated distribution from
a (log-linear) model class, and ce is an appropriately chosen multiplier
(under the assumption of equal cluster sizes, which can be relaxed in a
straightforward manner).
(b) In contrast to (a), for the context Y ¼ X ¼ R, r(x) ≡ 1, one obtains in
terms of the cumulative distribution functions Scd x ðPÞ
R ¼ FP ðxÞ, Sx ðQÞ ¼
cd
1  λ,cd
FQ ðxÞ the two nonprobability
R measures μ ½  :¼  FP ðxÞ dλðxÞ  λ½  
and ν1  λ,cd ½   :¼  FQ ðxÞ dλðxÞ  λ½   with—possibly infinite—total
masses μ1  λ,cd ½R, ν1  λ,cd ½R. The latter two are finite if λ is a probability
measure or a finite measure; for the nonfinite Lebesgue measure λ ¼ λL
and for intervals [x1, x2] one can interpret μ1  λL ,cd ½ ½x1 , x2   as the
corresponding area (between x1 and x2) under the distribution function
FP(). Analogously to (35), one can interpret


0  Dcϕ,m,m,1  m,λ Scd ðPÞ, Scd ðQÞ
Z        
FP ðxÞ FQ ðxÞ 0 FQ ðxÞ FP ðxÞ FQ ðxÞ
¼ ϕ ϕ  ϕ+,c   mðxÞ dλðxÞ
X mðxÞ mðxÞ mðxÞ mðxÞ mðxÞ


¼: Bϕ μ1  λ,cd , ν1  λ,cd j M

as scaled Bregman divergence between the nonprobability measures


μ1  λ,cd and ν1  λ,cd .
A unifying framework for some directed distances in statistics Chapter 5 177

(c) In a context of mortality data analytics (which is essential for the calcu-
lation of insurance premiums, financial reserves, annuities, pension bene-
fits, various benefits of social insurance programs, etc.), the divergence
(34) (with r(x) ¼ 1) has been employed by Kr€omer and Stummer
(2019) in order to achieve a realistic representation of mortality rates
by smoothing and error-correcting of crude rates; there, X is a set of ages
(in years), Sx(P) is the so-called data-based crude annual mortality rate
by age x, Sx(Q) is an—optimally determinable—candidate model member
(out of a parametric or nonparametric model) for the unknown true
annual mortality rate by age x, and m(x) is an appropriately chosen scal-
ing at x.
This concludes the current Remark 1.

In the following, we illuminate two important special cases of the scaling


(and aggregation-part) function m(), namely m(x) :¼ 1 and m(x) :¼ Sx(Q):

2.5.1.1 m1(x) 5 m2(x):5 1, m3(x) 5 r(x) for some (measurable)


function r : X ! ½0, ∞ satisfying r(x)  ]0, ∞[ for λa.a. x  X
In this sub-setup, (33) becomes
0  Dcϕ,1,1,r  1,λ ðSðPÞ, SðQÞÞ
Z h i
:¼ ϕ ðSx ðPÞÞ  ϕ ðSx ðQÞÞ  ϕ0+,c ðSx ðQÞÞ  ðSx ðPÞ  Sx ðQÞÞ rðxÞ dλðxÞ,
X
(38)
which for the discrete setup ðX , λÞ ¼ ðX # , λ# Þ turns into
0  Dcϕ,1,1,r  1,λ# ðSðPÞ, SðQÞÞ
X 
0
:¼ xX
ϕ ð S x ðPÞ Þ  ϕ ð S x ðQÞÞ  ϕ +,c ð S x ðQÞÞ  ðS x ðPÞ  S x ðQÞ Þ rðxÞ:
(39)
For reasons to be clarified below, in case of differentiable generator ϕ (and
thus ϕ0+,c ¼ ϕ0 is the classical derivative) one can interpret (38) and (39) as
weighted Bregman distances between the two statistical functionals S(P)
and S(Q).
Let us first discuss the important special case ϕ ¼ ϕα (α  R, cf. (26),
(30), (31), (28)) together with Sx(P)  0, Sx(Q)  0 – as it is always the case
for Scd, Spd, Spm, Ssu, Smg, Sde, Sou, and for nonnegative real-valued random
variables also with Squ. By incorporating the above-mentioned extension-
relevant (cf. (I2)) properties of ϕα (see Broniatowski and Stummer, 2019) into
(38), we end up with
178 SECTION II Information geometry

0  Dϕα ,1,1,r  1,λ ðSðPÞ, SðQÞÞ


Z
r ðx Þ 
¼  ðSx ðPÞÞα + ðα  1Þ  ðSx ðQÞÞα  α  Sx ðPÞ  ðSx ðQÞÞα1 dλðxÞ
X α  ð α  1 Þ
(40)
Z h i
rðxÞ
¼  ðSx ðPÞÞα + ðα  1Þ  ðSx ðQÞÞα  α  Sx ðPÞ  ðSx ðQÞÞα1
X α  ðα  1Þ
 10,∞½ ðSx ðPÞ  Sx ðQÞÞ dλðxÞ
Z
ðSx ðPÞÞα
+ rðxÞ   11,∞½ ðαÞ + ∞  1∞,0½[0,1½ ðαÞ
X α  ðα  1Þ
 10,∞½ ðSx ðPÞÞ  1f0g ðSx ðQÞÞ dλðxÞ
Z
ðSx ðQÞÞα
+ rðxÞ   10,1½[1,∞½ ðαÞ + ∞  1∞,0½ ðαÞ
X α
 10,∞½ ðSx ðQÞÞ  1f0g ðSx ðPÞÞ dλðxÞ, for α  Rnf0, 1g,
(41)
0  Dϕ1 ,1,1,r  1,λ ðSðPÞ, SðQÞÞ
Z   (42)
Sx ðPÞ
¼ rðxÞ  Sx ðPÞ  log + Sx ðQÞ  Sx ðPÞ dλðxÞ
X Sx ðQÞ
Z  
Sx ðPÞ
¼ rðxÞ  Sx ðPÞ  log + Sx ðQÞ  Sx ðPÞ  10,∞½ ðSx ðPÞ  Sx ðQÞÞ dλðxÞ
X Sx ðQÞ
Z
+ rðxÞ  ∞  10,∞½ ðSx ðPÞÞ  1f0g ðSx ðQÞÞ dλðxÞ
X
Z
+ rðxÞ  Sx ðQÞ  10,∞½ ðSx ðQÞÞ  1f0g ðSx ðPÞÞ dλðxÞ,
X
(43)
0  Dϕ0 ,1,1,r  1,λ ðSðPÞ, SðQÞÞ
Z   (44)
Sx ðPÞ Sx ðPÞ
¼ rðxÞ   log +  1 dλðxÞ
X Sx ðQÞ Sx ðQÞ
Z  
Sx ðPÞ Sx ðPÞ
¼ rðxÞ   log +  1  10,∞½ ðSx ðPÞ  Sx ðQÞÞ dλðxÞ
X Sx ðQÞ Sx ðQÞ
Z
+ rðxÞ  ∞  10,∞½ ðSx ðPÞÞ  1f0g ðSx ðQÞÞ dλðxÞ
X
Z
+ rðxÞ  ∞  10,∞½ ðSx ðQÞÞ  1f0g ðSx ðPÞÞ dλðxÞ,
X
(45)
Z
rðxÞ
0  Dϕ2 ,1,1,r  1,λ ðSðPÞ, SðQÞÞ ¼  ½Sx ðPÞ  Sx ðQÞ2 dλðxÞ ; (46)
X 2
A unifying framework for some directed distances in statistics Chapter 5 179

as a recommendation, one should avoid α  0 whenever Sx(P) ¼ 0 for all


x in some A with λ[A] > 0, respectively α  1 whenever Sx(Q) ¼ 0 for
all x in some A~ with λ½A ~ > 0 . As far as splitting of the integral, e.g., in
R
(43) resp. (45) is concerned, notice that R ½Sx ðQÞ  Sx ðPÞ  rðxÞ dλðxÞ
R Sx ðPÞ
resp. X ½Sx ðQÞ  1  rðxÞ dλðxÞ may be finite even in cases where
R R
X Sx ðPÞ  rðxÞ dλðxÞ ¼ ∞ and X Sx ðQÞ  rðxÞ dλðxÞ ¼ ∞ (take, e.g., X ¼
½0, ∞½ , λ ¼ λL, r(x) ≡ 1, and the exponential distribution functions
Sx ðPÞ ¼ FP ðxÞ ¼ 1  exp ðc1  xÞ, Sx ðQÞ ¼ FQ ðxÞ ¼ 1  exp ðc2  xÞ with
0 < c1 < c2). Notice that (46) can be used also in cases where Sx ðPÞ  R,
Sx ðQÞ  R, and thus, e.g., for Squ for arbitrary real-valued random variables.
R As before, for the discrete setup ðX , λÞ ¼ ðX# , λ# Þ all the terms
P
X … dλðxÞ in (41) to (46) turn into xX ….
Distribution functions. For Y ¼ X ¼ R, Sx ðPÞ ¼ Scd x ðPÞ ¼ FP ðxÞ,
Sx ðQÞ ¼ Scd
x ðQÞ ¼ F Q ðxÞ, let us illuminate the case α ¼ 2 of (46). For
instance, if Y is a real-valued random variable and FQ(x) ¼ Q[Y  x] is
its probability mass function under a hypothetical/candidate law Q, one
can take FP ðxÞ ¼ N1  #fi  f1, …, Ng : Y i  xg ¼: FPemp N
ðxÞ as the distri-
bution function of the corresponding data-derived “empirical distribution”
PN
P :¼ Pemp
N :¼ N 
1
i¼1 δYi ½   of a N-size i.i.d. sample Y 1 ,…, Y N of Y. In such
a set-up, the choice (say) λ ¼ Q in (46) and multiplication with 2N lead to the
weighted Cramer–von Mises test statistics (see Cramer (1928), Von Mises
(1931), Smirnov/Smirnoff (1936), and also Darling (1957) for a historic account)
  Z h i2
emp
0  2N  Dϕ2 ,1,1,r  1,Q ðScd PN , Scd ðQÞÞ ¼ N  FPemp ðxÞ  FQ ðxÞ  rðxÞ dQðxÞ
N
R
(47)
which are special “quadratic EDF statistics” in the sense of Stephens (1986)
(who also uses the term “Cramer–von Mises family”). The special case r(x)
≡ 1 is nothing but the prominent (unweighted) Cramer–von Mises test statistics;
for some recent statistical insights on the latter, see, e.g., Baringhaus and Henze
(2017). In contrast, if one chooses the Lebesgue measure λ ¼ λL and r(x) ≡ 1 in
(46), then one ends up with the N-fold of the “classical” squared L2-distance
between the two distribution functions FPemp N
ðÞ and FQ(), i.e., with
Z h i2

emp cd
0  2N  Dϕ2 ,1,1,1,λL ðS PN , S ðQÞÞ ¼ N 
cd
FPemp
N
ðxÞ  FQ ðxÞ dλL ðxÞ
R
(48)
where one can typically identify dλL(x) ¼ dx (Riemann integral).
In a similar fashion, for the special case Y ¼ X ¼ R, and the above-
mentioned integrated statistical functionals (cf. Section 2.1) Sx ðPÞ ¼
cd Rx cd Rx
SQ,S
x ðPÞ ¼ ∞ FP ðzÞ dQðzÞ, Sx ðQÞ ¼ SQ,S
x ðQÞ ¼ ∞ FQ ðzÞ dQðzÞ , we get
from (38) (analogously to (46))
180 SECTION II Information geometry

 cd cd
 Z rðxÞ h cd cd
i2
0  Dϕ2 ,1,1,r  1,λ SQ,S ðPÞ, SQ,S ðQÞ ¼  SQ,S
x ðPÞ  SQ,S
x ðQÞ dλðxÞ,
R 2

for which the choice r(x) ≡ 2N, P :¼ Pemp


N , λ ¼ Q leads to the divergence used
in a goodness-of-fit testing context by Henze and Nikitin (2000).
λ-probability density functions. If for general X one takes the special
case r(x) ≡ 1 together with the “λ-probability-density functions” context
λpd
(cf. Remark 1 (c)) Sx ðPÞ ¼ Sλpd
x ðPÞ :¼ fP ðxÞ  0, Sx ðQÞ ¼ Sx ðQÞ ¼ fQ ðxÞ  0,
then the divergences (38) and (39) become
0  Dcϕ,1,1,1,λ ðSλpd ðPÞ, Sλpd ðQÞÞ
Z




:¼ ϕð fP ðxÞÞ  ϕ fQ ðxÞ  ϕ0+,c fQ ðxÞ  fP ðxÞ  fQ ðxÞ dλðxÞ,
X
(49)
and
0  Dcϕ,1,1,1,λ# ðSλpd ðPÞ, Sλpd ðQÞÞ
X 


(50)
:¼ xX
ϕð fP ðxÞÞ  ϕ fQ ðxÞ  ϕ0+,c fQ ðxÞ  fP ðxÞ  fQ ðxÞ :

In case of differentiable generator ϕ (and thus ϕ0+,c ¼ ϕ0 is the classical


derivative), the divergences in (49) and (50) are nothing but the classical
Bregman distances between the two probability distributions P and Q (see,
e.g., Csiszar, 1991; Pardo and Vajda, 1997, 2003; Stummer and Vajda,
2012). If one further specializes ϕ ¼ ϕα, the divergences (40), (42), (44)
and (46) become
Z

1
0  Dϕα ,1,1,1,λ Sλpd ðPÞ, Sλpd ðQÞ ¼  ð fP ðxÞÞα + ðα  1Þ
X α  ð α  1Þ


α
α1
 fQ ðxÞ  α  f P ðxÞ  fQ ðxÞ dλðxÞ, for α  R∖f0, 1g,

(51)
λpd λpd
0  Dϕ1 ,1,1,1,λ ðS ðPÞ, S ðQÞÞ
Z   (52)
fP ðxÞ
¼ fP ðxÞ  log + fQ ðxÞ  fP ðxÞ dλðxÞ,
X fQ ðxÞ

0  Dϕ0 ,1,1,1,λ ðSλpd ðPÞ, Sλpd ðQÞÞ


Z   (53)
fP ðxÞ f ðxÞ
¼  log + P  1 dλðxÞ,
X fQ ðxÞ fQ ðxÞ
Z
1  2
0  Dϕ2 ,1,1,1,λ ðSλpd ðPÞ, Sλpd ðQÞÞ ¼  fP ðxÞ  fQ ðxÞ dλðxÞ: (54)
X 2
A unifying framework for some directed distances in statistics Chapter 5 181

Analogously to the paragraph after (46), one can recommend here to exclude
α  0 whenever fP(x) ¼ 0 for all x in some A with λ[A] > 0, respectively
α  1 whenever fQ(x) ¼ 0 for all x in some A~ with λ½A ~ > 0 . As far as
splitting of
the integral, e.g., in (43)R resp.  (45) is concerned,
notice that the
1  λ,λpd 1  λ,λpd
integral μ ν ½X  ¼ X fQ ðxÞ  fP ðxÞ dλðxÞ ¼ 1 1¼ 0 but
R fP ðxÞ
X ½fQ ðxÞ  1 dλðxÞ may be infinite (take, e.g., X ¼ ½0,∞½, λ ¼ λL, and the expo-
nential distribution density functions fP ðxÞ :¼ c1  expðc1  xÞ , fQ ðxÞ :¼
c2  exp ðc2  xÞ with 0  c1  c2). The choice α > 0 in (51) coincides with
the “order-α” density power divergences DPD of Basu et al. (1998); for their
statistical applications see, e.g., Basu et al. (2015), Ghosh and Basu (2016a,b),
and the references therein, and for general α  R see, e.g., Stummer and
Vajda (2012).
The divergence (52) is the celebrated “Kullback–Leibler information
divergence KL” between fP and fQ (respectively between P and Q); alterna-
tively, instead of KL one often uses the terminology “relative entropy.” The
divergence (54) (cf. α ¼ 2) is nothing but half of the squared L2-distance
between the two λ-density functions fP() and fQ().
Notice that for the classical case X ¼ Y ¼ R, r(x) R x≡ 1, λ ¼ λL—where
λpd
one has fP(x) ¼ fP(x), S (P) ¼ S (P), and FP ðxÞ ¼ ∞ fP ðzÞdλL ðzÞ—(51)
pd

is essentially different from (40) with S(P) ¼ Scd(P), S(Q) ¼ Scd(Q) which
is explicitly of the “doubly aggregated form”


0  Dϕα ,1,1,1,λ Scd ðPÞ, Scd ðQÞ
Z Z x α Z x α
1
¼  f P ðzÞdλL ðzÞ + ðα  1Þ  f Q ðzÞdλL ðzÞ
R α  ðα  1 Þ ∞ ∞
Z x Z x α1
α  f P ðzÞdλL ðzÞ  f Q ðzÞdλL ðzÞ dλL ðxÞ, for α  Rnf0, 1g,
∞ ∞

with the usual dλL(x) ¼ dx.


In contrast, for the discrete setup ðX , λÞ ¼ ðX# , λ# Þ with X#  R (recall
λ#[{x}] ¼ 1) one has fP(x) ¼ pP(x) for all x  X# ) and the divergences (51)
to (54) simplify to
0  Dϕα ,1,1,1,λ# ðSpm ðPÞ, Spm ðQÞÞ
 α  α1
X 1 α
¼ xX α  ðα  1Þ
 ðp P ðxÞ Þ + ðα  1Þ  p Q ðxÞ  α  p P ðxÞ  p Q ðxÞ

for α  Rnf0, 1g,


" ! #
X pP ðxÞ
0  Dϕ1 ,1,1,1,λ# ðS ðPÞ, S ðQÞÞ ¼
pm pm pP ðxÞ  log + pQ ðxÞ  pP ðxÞ ,
xX pQ ðxÞ
" ! #
X pP ðxÞ pP ðxÞ
0  Dϕ0 ,1,1,1,λ# ðSpm ðPÞ, Spm ðQÞÞ ¼  log +  1 ,
xX pQ ðxÞ pQ ðxÞ
X h i2
1
0  Dϕ2 ,1,1,1,λ# ðSpm ðPÞ, Spm ðQÞÞ ¼ xX 2
 p P ðxÞ  p Q ðxÞ ,
182 SECTION II Information geometry

where again one should exclude α  0 whenever pP(x) ¼ 0 for all x in some
A with λ#[A] > 0, respectively α  1 whenever pQ(x) ¼ 0 for all x in some A~
~ > 0. For example, take the context from the paragraph right before
with λ# ½A
(36), with discrete random variable Y, pQ(x) ¼ Q[Y ¼ x], pP ðxÞ ¼ pPemp N
ðxÞ.
Then, the divergences 2N  Dϕα ,1,1,1,λ# ðSpm ðPemp
N Þ, S pm
ðQÞÞ (for α  R) can be
used as goodness-of-fit test statistics; see, e.g., Kißlinger and Stummer
(2016) for their limit behavior as the sample size N tends to infinity.
Classical quantile functions. The divergence (38) with S(P) ¼ Squ(P),
S(Q) ¼ Squ(Q) can be interpreted as a quantitative measure of tail risk of P,
relative to some pregiven reference distribution Q.j
Especially, for Y ¼ R and X ¼ 0, 1½ , Sx ðPÞ ¼ Squ x ðPÞ ¼ FP ðxÞ,
Sx ðQÞ ¼ Squx ðQÞ ¼ FQ ðxÞ, and the Lebesgue measure λ ¼ λL (with the usual
dλL(x) ¼ dx), we get from (46) the special case
Z  2
0  2  Dϕ2 ,1,1,r  1,λ ðS ðPÞ, S ðQÞÞ ¼
qu qu
FP ðxÞ  FQ ðxÞ dλL ðxÞ (55)
0,1½

which is nothing but the 2-Wasserstein distance between the two probability mea-
sures P and Q. Corresponding connections with optimal transport will be discussed
in Section 2.7. Notice that (55) does generally not coincide with its analogue
Z

2
2  Dϕ2 ,1,1,r  1,λ ðScd ðPÞ, Sd ðQÞÞ ¼ FP ðxÞ  FQ ðxÞ dλL ðxÞ ; (56)
R

to see this, take, e.g., 0 < c2 < c1 (e.g., c1 ¼ 2, c2 ¼ 1) and the exponential
quantile functions FP ðxÞ ¼  c11  log ð1  xÞ , FQ ðxÞ ¼  c12  log ð1  xÞ
2
(x [0,1]) for which (55) becomes 2  ðc12  c11 Þ , whereas for the corresponding
exponential distribution functions FP ðxÞ ¼ ð1  expðc1  xÞÞ  10,∞½ ðxÞ and
FQ ðxÞ ¼ ð1  exp ðc2  xÞÞ  10,∞½ ðxÞ the divergence (56) becomes 2c12 
2 1
c1 + c2 + 2c1 .
Depth, outlyingness, centered rank and centered quantile functions.
As a special case one gets
Dcϕ,1,1,r  1,λL ðSde ðPÞ, Sde ðQÞÞ,
Dcϕ,1,1,r  1,λL ðSou ðPÞ, Sou ðQÞÞ,
Xd
Dcϕ,1,1,r  1,λL ðScr,i ðPÞ, Scr,i ðQÞÞ, (57)
i¼1

X
d
Dcϕ,1,1,r  1,λL ðScqu,i ðPÞ, Scqu,i ðQÞÞ, (58)
i¼1

j
Hence, such a divergence represents an alternative to Faugeras and R€uschendorf (2018) where
they use hemimetrics (which, e.g., have only a weak-identity property, but satisfy a triangle
inequality) rather than divergences.
A unifying framework for some directed distances in statistics Chapter 5 183

all of which have not appeared elsewhere before (up to our knowledge); recall


that the respective domains of ϕ have to take care of the ranges R Sde ðPÞ 



½0, ∞ , R ðSou ðPÞÞ  ½0, ∞ , R Scr,i ðPÞ  ½1, 1 , R Scqu,i ðPÞ   ∞,
∞½ (i  f1, …, dg ). Notice that these divergences differ structurally from the
Bregman distances of Hallin (2018) who uses the centered rank function RP()
(also called center-outward distribution function) as a multidimensional (in general
not additionally separable) generator ϕ, and not as points between which the
distance is to be measured between.

2.5.1.2 m1(x) 5 m2(x):5 Sx(Q), m3(x) 5 r(x)Sx(Q)  [0, ∞]


for some (measurable) function r : X ! R satisfying
rðxÞ    ∞, 0½[0, ∞½ for λa.a. x  X
In such a context, we require that the function r() does not (explicitly) depend
on the functions S(P) and S(Q), i.e., it is not of the adaptive form r() ¼
h(, S(P), S(Q)). The incorporation of the zeros of S(P), S(Q) can be adapted
from Broniatowski and Stummer (2019): for instance, in a nonnegativity
set-up where for λ-almost all x  X one has r(x)  ]0, ∞[ as well as Sx(P)
 [0, ∞[, Sx(Q)  [0, ∞[ (as it is always the case for Scd, Spd, Spm, Ssu, Smg,
Sde, Sou, and for nonnegative real-valued random variables also with Squ),
one can take E ¼]a, b[¼]0, ∞[ to end up with the following special case of (33)
0  Dcϕ,SðQÞ,SðQÞ,r  SðQÞ,λ ðSðPÞ, SðQÞÞ
Z    
Sx ðPÞ Sx ðPÞ
¼ ϕ  ϕ ð1Þ  ϕ0+,c ð1Þ   1 Sx ðQÞ  rðxÞ dλðxÞ
X Sx ðQÞ Sx ðQÞ
(59)
Z  
Sx ðPÞ
¼ Sx ðQÞ  ϕ  Sx ðQÞ  ϕ ð1Þ  ϕ0+, c ð1Þ  ðSx ðPÞ  Sx ðQÞÞ rðxÞ dλðxÞ
X Sx ðQÞ
(60)
Z  
Sx ðPÞ
¼ rðxÞ  Sx ðQÞ  ϕ  Sx ðQÞ  ϕð1Þ  ϕ0+, c ð1Þ  ðSx ðPÞ  Sx ðQÞÞ
X Sx ðQÞ
 10, ∞½ ðSx ðPÞ  Sx ðQÞÞ dλðxÞ
h i Z
+ ϕ ð0Þ  ϕ0+, c ð1Þ  rðxÞ  Sx ðPÞ  10, ∞½ ðSx ðPÞÞ  1f0g ðSx ðQÞÞ dλðxÞ
X
h i Z
+ ϕð0Þ + ϕ0+, c ð1Þ  ϕð1Þ  rðxÞ  Sx ðQÞ  10, ∞½ ðSx ðQÞÞ  1f0g ðSx ðPÞÞ dλðxÞ
Z  X
Sx ðPÞ
¼ rðxÞ  Sx ðQÞ  ϕ  Sx ðQÞ  ϕð1Þ  ϕ0+, c ð1Þ  ðSx ðPÞ  Sx ðQÞÞ
X Sx ðQÞ
h i Z
 10, ∞½ ðSx ðPÞ  Sx ðQÞÞ dλðxÞ + ϕ ð0Þ  ϕ0+, c ð1Þ  rðxÞ  Sx ðPÞ  1f0g ðSx ðQÞÞ dλðxÞ
X
h i Z
+ ϕð0Þ + ϕ0+, c ð1Þ  ϕð1Þ  rðxÞ  Sx ðQÞ  1f0g ðSx ðPÞÞ dλðxÞ,
X
(61)
184 SECTION II Information geometry

R
with ϕ* ð0Þ :¼ lim u!0 u  ϕð1uÞ ¼ lim v!∞ ϕðvÞ
v . In case of X Sx ðQÞ  rðxÞ
dλðxÞ < ∞, the divergence (61) becomes
0  Dcϕ,SðQÞ,SðQÞ,r  SðQÞ,λ ðSðPÞ, SðQÞÞ
Z  
S x ð PÞ 0
¼ r ðxÞ  Sx ðQÞ  ϕ  ϕ+,c ð1Þ  ðSx ðPÞ  Sx ðQÞÞ
X Sx ðQÞ
h i
 10,∞½ ðSx ðPÞ  Sx ðQÞÞ dλðxÞ + ϕ* ð0Þ  ϕ0+,c ð1Þ
Z

 r ðxÞ  Sx ðPÞ  1f0g ðSx ðQÞÞ dλðxÞ + ϕð0Þ + ϕ0+,c ð1Þ
ZX Z
 r ðxÞ  Sx ðQÞ  1f0g ðSx ðPÞÞ dλðxÞ  ϕð1Þ  r ðxÞ  Sx ðQÞ dλðxÞ:
X X
(62)
R
Moreover, in case of ϕð1Þ ¼ 0 and X ðSx ðPÞ  Sx ðQÞÞ  rðxÞ dλðxÞ  ∞,
R R
∞½ (but not necessarily X Sx ðPÞ  rðxÞ dλðxÞ < ∞ , X Sx ðQÞ  rðxÞ dλðxÞ <
∞Þ, the divergence (61) turns into
0  Dcϕ,SðQÞ,SðQÞ,r  SðQÞ,λ ðSðPÞ, SðQÞÞ
Z  
S ð PÞ
¼ r ðxÞ  Sx ðQÞ  ϕ x  10,∞½ ðSx ðPÞ  Sx ðQÞÞ dλðxÞ
X Sx ðQÞ
Z
+ ϕ * ð 0Þ  r ðxÞ  Sx ðPÞ  1f0g ðSx ðQÞÞ dλðxÞ (63)
ZX
+ ϕð 0Þ  r ðxÞ  Sx ðQÞ  1f0g ðSx ðPÞÞ dλðxÞ
ZX
ϕ0+,c ð1Þ  r ðxÞ  ðSx ðPÞ  Sx ðQÞÞ dλðxÞ:
X

To obtain the sharp identifiability (reflexivity) of the divergence


Dcϕ,SðQÞ,SðQÞ,r  SðQÞ,λ ðSðPÞ, SðQÞÞ of (61), one can either use the conditions for-
SðPÞ
mulated after (22) in terms of s  RðSðQÞ Þ and t  RðSðQÞ
SðQÞÞ ¼ f1g or the strict
convexity of ϕ at t ¼ 1 together withk
Z
ðSx ðPÞ  Sx ðQÞÞ  rðxÞ dλðxÞ ¼ 0 (64)
X

see Broniatowski and Stummer (2019) for corresponding details. Addition-


ally, in the light of (62) let us indicate that if one wants to use Ξ :¼
R Sx ðPÞ
X Sx ðQÞ  ϕðSx ðQÞÞ  rðxÞ dλðxÞ (with appropriate zero conventions) as a diver-
gence, then one should employ generators ϕ satisfying ϕð1Þ ¼ ϕ0+,c ð1Þ ¼ 0, or
employ models fulfilling the assumption (64) together with generators ϕ with
ϕ(1) ¼ 0. On the other hand, if this integral Ξ appears in your application

k
And thus, c becomes obsolete.
A unifying framework for some directed distances in statistics Chapter 5 185

context “naturally,” then one should be aware that Ξ may become negative
depending on the involved set-up; for a counter-example, see, e.g., Stummer
and Vajda (2010).
An important generator-concerning example is the power-function (limit)
case
R ϕ ¼ ϕα with α  R (cf. (26), (30), (31), (28)) under the constraint
X ðS x ðPÞ  Sx ðQÞÞ  rðxÞ dλðxÞ   ∞, ∞½ . Accordingly, the “implicit-
boundary-describing” divergence (60) and the corresponding “explicit-
boundary” version (63) turn into the generalized power divergences of order
α (cf. Stummer and Vajda (2010) for ðxÞ ≡ 1)
0  Dϕα ,SðQÞ,SðQÞ,r  SðQÞ,λ ðSðPÞ, SðQÞÞ
Z  α
1 Sx ðPÞ S ðPÞ
¼  α x + α  1  Sx ðQÞ  rðxÞ dλðxÞ
X α  ðα  1Þ Sx ðQÞ Sx ðQÞ
(65)
Z  α
1 Sx ðPÞ S ðPÞ
¼  rðxÞ  Sx ðQÞ  α x + α1
α  ðα  1Þ X Sx ðQÞ Sx ðQÞ
 10, ∞½ ðSx ðPÞ  Sx ðQÞÞ dλðxÞ
Z
+ ϕ α ð0Þ  rðxÞ  Sx ðPÞ  1f0g ðSx ðQÞÞ dλðxÞ
ZX
+ ϕα ð0Þ  rðxÞ  Sx ðQÞ  1f0g ðSx ðPÞÞ dλðxÞ
ZX h i
1
¼ rðxÞ  Sx ðPÞα  Sx ðQÞ1α  Sx ðQÞ  10, ∞½ ðSx ðPÞ  Sx ðQÞÞ dλðxÞ
α  ðα  1Þ X
Z
1
+  rðxÞ  ðSx ðPÞ  Sx ðQÞÞ dλðxÞ
1α X
Z
+ ∞  11, ∞½ ðαÞ  rðxÞ  Sx ðPÞ  1f0g ðSx ðQÞÞ dλðxÞ
X
 
1
+ 1 ðαÞ + ∞  1∞,0½ ðαÞ
α  ð1  αÞ 0,1[1, ∞½
Z
 rðxÞ  Sx ðQÞ  1f0g ðSx ðPÞÞ dλðxÞ,
X
(66)
0  Dϕ1 ,SðQÞ,SðQÞ,r  SðQÞ,λ ðSðPÞ,SðQÞÞ
Z   (67)
Sx ðPÞ S ðPÞ S ðPÞ
¼  log x + 1 x  Sx ðQÞ  rðxÞ dλðxÞ
X Sx ðQÞ Sx ðQÞ Sx ðQÞ
Z  
S ðPÞ
¼ rðxÞ  Sx ðPÞ  log x  10, ∞½ ðSx ðPÞ  Sx ðQÞÞ dλðxÞ
Sx ðQÞ
ZX Z
+ rðxÞ  ðSx ðQÞ  Sx ðPÞÞ dλðxÞ + ∞  rðxÞ  Sx ðPÞ  1f0g ðSx ðQÞÞ dλðxÞ,
X X
(68)
186 SECTION II Information geometry

0  Dϕ0 , SðQÞ, SðQÞ, r  SðQÞ, λ ðSðPÞ,SðQÞÞ


Z   (69)
S ðPÞ S ðPÞ
¼  log x + x  1  Sx ðQÞ  rðxÞ dλðxÞ
X Sx ðQÞ Sx ðQÞ
Z  
Sx ðQÞ
¼ rðxÞ  Sx ðQÞ  log  10, ∞½ ðSx ðPÞ  Sx ðQÞÞ dλðxÞ
Sx ðPÞ
ZX Z
+ rðxÞ  ðSx ðPÞ  Sx ðQÞÞ dλðxÞ + ∞  rðxÞ  Sx ðQÞ  1f0g ðSx ðPÞÞ dλðxÞ,
X X
(70)
Z 2
1 ðSx ðPÞ  Sx ðQÞÞ
0  Dϕ2 , SðQÞ, SðQÞ, r  SðQÞ, λ ðSðPÞ,SðQÞÞ ¼   rðxÞ dλðxÞ
X 2 Sx ðQÞ
(71)
Z
1 ðSx ðPÞ  Sx ðQÞÞ2
¼ rðxÞ   1½0, ∞½ ðSx ðPÞÞ  10, ∞½ ðSx ðQÞÞ dλðxÞ
2 X Sx ðQÞ
Z (72)
+∞ rðxÞ  Sx ðPÞ  1f0g ðSx ðQÞÞ dλðxÞ,
X

which is an adaption of a result of Broniatowski and Stummer (2019).


Another important generator-concerning example is the total variation
case ϕTV(t) :¼ jt  1j (cf. (32)) together with c ¼ 12 . Accordingly, the
“implicit-boundary-describing” divergence (60) resp. the corresponding
“explicit-boundary” version (63) turn into
Z  
 Sx ðPÞ 
Sx ðQÞ    1  rðxÞ dλðxÞ
1=2
0  Dϕ ,SðQÞ, SðQÞ,r  SðQÞ,λ ðSðPÞ,SðQÞÞ ¼
Z
TV
X Sx ðQÞ
¼ jSx ðPÞ  Sx ðQÞj  rðxÞ dλðxÞ,
X
(73)
which is also an adaption of a result of Broniatowski and Stummer (2019).
Notice that (73)—which is nothing but the r-weighted L1-distance between
the two statistical functionals S(P) and S(Q)—can be used also in cases where
Sx ðPÞ  R, Sx ðQÞ  R, and thus, e.g., for Squ for arbitrary real-valued random
variables.
As usual, for arbitrary discrete setup ðX , λÞ ¼ ðX# , λ# Þ all the terms
R R
X … dλðxÞ (respectively X … dλðxÞ) in the divergences (61) to (73) turn into
P P
x  X … (respectively x  X …).
As far as concrete statistical functionals is concerned, let us briefly discuss
several important sub-cases.
λ-probability-density functions. First, in the “λ-probability-density
functions” context of Remark 1 one has for general X the statistical functionals
Sλpd λpd
x ðPÞ :¼ fP ðxÞ  0 , Sx ðQÞ :¼ fQ ðxÞ  0 , and under the constraints
A unifying framework for some directed distances in statistics Chapter 5 187

ϕ(1) ¼ 0, the corresponding special case Dϕ,Sλpd ðQÞ,Sλpd ðQÞ,r  Sλpd ðQÞ,λ ðSλpd ðPÞ, Sλpd ðQÞÞ
of (61) turns out to be the (r ) “local ϕ-divergence” of Avlogiaris et al. (2016a)
and Avlogiaris et al. (2016b); in case of r(x) ≡ 1 (where (64) is satisfied), this
reduces to the classical ϕ-divergence of Csiszar (1963), Ali and Silvey (1966),
and Morimoto (1963)l,m
0  Dϕ,Sλpd ðQÞ,Sλpd ðQÞ,1  Sλpd ðQÞ,λ ðSλpd ðPÞ, Sλpd ðQÞÞ
Z !
fP ðxÞ  
¼ fQ ðxÞ  ϕ  10,∞½ fP ðxÞ  fQ ðxÞ dλðxÞ
X fQ ðxÞ
Z   Z
+ ϕ* ð0Þ  fP ðxÞ  1f0g fQ ðxÞ dλðxÞ + ϕð0Þ  fQ ðxÞ  1f0g ð fP ðxÞÞ dλðxÞ
ZX  
X

ϕ0+,c ð1Þ  fP ðxÞ  fQ ðxÞ dλðxÞ


X
Z !
fP ðxÞ  
¼ fQ ðxÞ  ϕ  10,∞½ fP ðxÞ  fQ ðxÞ dλðxÞ
X fQ ðxÞ
+ ϕ* ð0Þ  P½ fQ ðxÞ ¼ 0 + ϕð0Þ  Q½ fP ðxÞ ¼ 0
(74)
which coincides with (10); if ϕ(1) 6¼ 0 then one has to additionally subtract
ϕ(1) (cf. the corresponding special case of (61)). The corresponding special
cases Dϕα ,Sλpd ðQÞ,Sλpd ðQÞ,1  Sλpd ðQÞ,λ ðSλpd ðPÞ, Sλpd ðQÞÞ (α  R ) of (65) to (72) are
called “power divergences” (between the λ-density functions Sλpd  ðPÞ :¼
fP ð  Þ , Sλpd
 ðQÞ :¼ fQ ð  Þ ); if the latter two are strictly positive, the subcase
α ¼ 1, α ¼ 0, and α ¼ 2 is nothing but the (classical) Kullback–Leibler divergence
(relative entropy), the reverse Kullback–Leibler divergence (reverse relative
entropy), and the Pearson chisquare divergence, respectively. The special case
Z
 
0  Dϕ ,Sλpd ðQÞ,Sλpd ðQÞ,1  Sλpd ðQÞ,λ ðSλpd ðPÞ, Sλpd ðQÞÞ ¼
1=2  f ðxÞ  f ðxÞ dλðxÞ
P Q
TV
X

of (73) is the total variation distance or L1-distance (between the λ-density


functions Sλpd λpd
 ðPÞ :¼ fP ð  Þ, S  ðQÞ :¼ fQ ð  Þ).
Analogously to Section 2.5.1.1, for X ¼ Y ¼ R the current context sub-
sumes the “classical density” functionals Sλpd() ¼ Spd() with the choice λ ¼
λL (and the Riemann integration dλL(x) ¼ dx). In contrast, for the discrete
setup Y ¼ X ¼ X# it covers the “classical probability mass” functional
Sλpd() ¼ Spm() with the Rchoice λ ¼ λ# (recall λ#[{x}] ¼ 1 for all x  X# );
accordingly, all the terms X … dλðxÞ in the divergences (61) to (74) turn into
P
xX … .

l
See, e.g., Liese and Vajda (1987) and Vajda (1989) on comprehensive studies thereupon.
m
Notice that c becomes obsolete.
188 SECTION II Information geometry

Distribution and survival functions. Let us first consider the context


Y ¼ X ¼ R , Sx ðPÞ ¼ Scd x ðPÞ ¼ FP ðxÞ, Sx ðQÞ ¼ Sx ðQÞ ¼ FQ ðxÞ, and the
cd

Lebesgue measure λ ¼ λL (with the usual dλL(x) ¼ dx), and r(x) ≡ 1. Therein,
the special case
Z
 
0  Dϕ , Scd ðQÞ, Scd ðQÞ,1  Scd ðQÞ, λ ðScd ðPÞ, Scd ðQÞÞ ¼ FP ðxÞ  FQ ðxÞ dλL ðxÞ
1=2
TV L
R
(75)
of (73) is the well-known Kantorovich metric (between the distribution func-
tions
R FP(),FQ()). It is known that
R the integral in (75) is finite provided that
X x dFP ðxÞ    ∞, ∞½ and X x dFQ ðxÞ  ∞, ∞½ (if the distribution
P resp. Q is generated by some real-valued random variable, say X resp. Y,
this means that E[X] and E[Y ] exist and are finite). To proceed, let us discuss
the special case
0  Dϕ1 , Scd ðQÞ, Scd ðQÞ,1  Scd ðQÞ, λL ðScd ðPÞ, Scd ðQÞÞ
Z   (76)
FP ðxÞ F ðxÞ F ðxÞ
¼  log P + 1 P  FQ ðxÞ dλL ðxÞ
R FQ ðxÞ FQ ðxÞ FQ ðxÞ
of (67), (68). For the special sub-setup of nonnegative random variables
(and thus Y ¼ X ¼0, ∞½) with finite expectations and strictly positive cdf,
(76) simplifies to the so-called “cumulative Kullback–Leibler information”
of Park et al. (2012) (see also Park et al. (2018) for an extension to the whole
real line, Di Crescenzo and Longobardi (2015) for an adaption to possibly
smaller support as well as for an adaption to a dynamic form analogously
to the explanations in the following lines). In contrast, we illuminate the
special case
0  Dϕ1 , Ssu ðQÞ, Ssu ðQÞ,1  Ssu ðQÞ, λL ðSsu ðPÞ, Ssu ðQÞÞ
Z  
1  FP ðxÞ 1  FP ðxÞ 1  FP ðxÞ
¼  log + 1  ð1  FQ ðxÞÞ dλL ðxÞ
R 1  FQ ðxÞ 1  FQ ðxÞ 1  FQ ðxÞ
(77)
of (67), (68). This has been employed by Liu (2007) for the special case of
P ¼ Pemp
N and Q ¼ Qθ in order to obtain a corresponding minimum-divergence
parameter estimator of θ (see, e.g., also Yari and Saghafi (2012), Yari et al.
(2013), and Mehrali and Asadi (2021) for follow-up papers). For the general
context of nonnegative, absolutely continuous random variables (and thus
Y ¼ X ¼0, ∞½) with finite expectations and strictly positive cdf, (77) sim-
plifies to the so-called “cumulative (residual) Kullback–Leibler information”
of Baratpour and Habibi Rad (2012) (see also Park et al. (2012) for further
A unifying framework for some directed distances in statistics Chapter 5 189

propertiesn and Park et al. (2018) for an extension to the whole real line); the
latter has been adapted to a dynamic form by Chamany and Baratpour (2014)
as follows (adapted to our terminology): take arbitrarily fixed “instance”
t  0, Y ¼ X ¼t, ∞½ and replace in (77) the survival function Ssu x ðPÞ :¼
n o
1FP ðxÞ
f1  FP ðxÞgxR by Sx ðPÞ :¼ 1FP ðtÞ
su,t
being essentially the survival
xt,∞½
function of a random variable (e.g., residual lifetime) [X  tjX > t]
under P, and analogously for Q; accordingly, the integral range is ]t, ∞[.
We can generalize this by simply plugging Ssu,t(P), Ssu,t(Q) into our general
divergences (59) and (38)—and even (19)—(with λ ¼ λL). An analogous
dynamization can be done for density-functionals, by plugging Sλpd,t :¼
n o
fP ðxÞ
1FP ðtÞ instead of Sλpd ¼ f fP ðxÞgxR into (59) and (38)—and even
xt,∞½
(19)—and thus covering the corresponding dynamic Kullback–Leibler
divergence of Ebrahimi and Kirmani (1996) as well as the more general
ϕ-divergences between residual lifetimes of Vonta and Karagrigoriou (2010)
as special cases; notice that Sλpd,t is essentially the density function of the
random variable Xt :¼ [X  tjX > t] under P, where, e.g., X is typically a
(nonnegative) absolutely continuous random variable which describes the
residual lifetime of a person or an item or a “process” and hence, Xt is called
residual lifetime (at t) which is fundamentally used in survival analysis and
systems reliability engineering. In risk management and extreme value theory,
Xt describes the important notion of random excess (e.g., of a loss X) over
the threshold t, which is, e.g., employed in the well-known peaks-over-
threshold method.
λpd,t
n o
Analogously, we can plug in Se :¼ fP ðxÞ instead of Sλpd ¼
FP ðtÞ xt,∞½
f fP ðxÞgxR into (59) and (38)—and even (19)—and thus cover the
corresponding dynamic Kullback–Leibler divergence of Di Crescenzo and
Longobardi (2004) as well as the more general ϕ-divergences between past
λpd,t
lifetimes of Vonta and Karagrigoriou (2010) as special cases; notice that Se
is essentially the density function of the random variable [XjX  t] under P.
Classical quantile functions. The divergence (59) with S(P) ¼ Squ(P),
S(Q) ¼ Squ(Q) can be interpreted as a quantitative measure of tail risk of P,
relative to some pregiven reference distribution Q.o

n
In this sub-setup, they also introduce an alternative with ϕ e1 ðtÞ of (29) together with
R 1FP ðxÞ
su,var
Sx ðPÞ :¼ ∞ —rather than with ϕ1(t) of (30) together with Ssu
x ðPÞ :¼ 1  FP ðxÞ—
ð1FP ðξÞÞdξ
0

(and analogously for Q).


o
Hence, such a divergence represents an alternative to Faugeras and R€uschendorf (2018) where
they use hemimetrics rather than divergences.
190 SECTION II Information geometry

For Y ¼ R and X ¼ 0, 1½, we get for the quantiles context


Z  
1=2  
0  Dϕ , Squ ðQÞ,Squ ðQÞ,1  Squ ðQÞ,λL ðS ðPÞ, S ðQÞÞ ¼
qu qu
FP ðxÞ  FQ ðxÞ dλL ðxÞ
TV
X
(78)
which is nothing but the 1-Wasserstein distance between the two probability
measures P and Q. It is well-known that the right-hand sides of (75) and
(78) coincide, in contrast to the discussion on the “L2-case” right after (56).
Corresponding connections with optimal transport will be discussed in
Section 2.7.
Let us briefly discuss some other connections between ϕ-divergences and
quantile functions. In the above-mentioned setup of Baratpour and Habibi Rad
(2012) (under the existence of strictly positive probability density functions),
Sunoj et al. (2018) rewrite the cumulative Kullback–Leibler information (cf. the
special case of (77)) equivalently in terms of quantile functions. In contrast, in a
context of absolutely continuous probability distributions P and Q on X ¼ R with
strictly positive density functions fP and fQ, Sankaran

et 
al. (2016) rewrite the clas-
R fP ðxÞ
sical Kullback–Leibler divergence X ½ fP ðxÞ  log fQ ðxÞ
+ fQ ðxÞ  fP ðxÞdλL ðxÞ ¼

Dϕ ,SλL pd ðQÞ,SλL pd ðQÞ,1  SλL pd ðQÞ,λL ðSλL pd ðPÞ, SλL pd ðQÞÞ


(cf. (67)) equivalently in
1
terms of quantile functions; in the same setup, for α  0, 1½[1, ∞½
Kayal and Tripathy (2018) rewrite the classical α-order power diver-
gences (in fact, the classical α-order Tsallis cross-entropies which
R f ðxÞ α f ðxÞ
1
are multiples thereof ) X α  ðα1Þ  ½ð fP ðxÞÞ  α  fP ðxÞ + α  1  fQ ðxÞdλðxÞ ¼
Q Q

Dϕ ,SλL pd ðQÞ,SλL pd ðQÞ,1  SλL pd ðQÞ,λL ðSλL pd ðPÞ, SλL pd ðQÞÞ (cf.(65)) equivalently in terms
α
of quantile functions, where they also emphasize the advantage for distribu-
tions P and Q having closed-form quantile functions but nonclosed-form distri-
bution functions.
The above-mentioned contexts differ considerably from that of
Broniatowski and Decurninge (2016), who basically employ ϕ-divergences
Dϕ ðQQ , QP Þ between special quantile measures (rather than quantile func-
tions) QQ and QP ; recall that for any probability measure P on R, one can
associate a (signed) quantile measure QP on ]0, 1[ having as its generalized
distribution function nothing else but the quantile function FP of P. In more
detail, similarly to the above-mentioned empirical likelihood principle,
Broniatowski and Decurninge (2016) consider—in an i.i.d. context—the
minimization
Dϕ ðΩdis
N , QPN Þ :¼
emp inf Dϕ ðQQ , QPemp Þ
QQ  Ωdis
N
N

of the ϕ-divergences Dϕ ðQQ , QPemp Þ, where Ωdis


N is the subclass of quantile
N i 
measures QQ having support on N , i ¼ 0, 1, …, N of a desired model Ω of
A unifying framework for some directed distances in statistics Chapter 5 191

quantile measures QQe having support on [0, 1]; for example, the Q’s e may be
taken from a tubular neighborhood Λ—constructed through a finite collection
of conditions on L-moments (cf., e.g., Hosking, 1990)—of some class of dis-
tributions on R+, such as the Pareto- or Weibull-distribution class. Such tasks
have numerous applications in climate sciences or hydrology. As a side
remark, let us mention that for the general context of quantile measures QQ
and QP being absolutely continuous (with respect to the Lebesgue measure
λL on [0, 1]), the ϕ-divergence Dϕ ðQQ , QP Þ turns into the divergence
Dcϕ,Sqd ðPÞ,Sqd ðPÞ,Sqd ðPÞ,λ ðSqd ðQÞ, Sqd ðPÞÞ (cf. (59)) between the quantile density
L
   0 
functions Sqd ðPÞ :¼ Sqd x ðPÞ x  0,1½ :¼ ðFP Þ ðxÞ x  0,1½ and S (Q). Thus,
qd

by applying our general divergences (19) to Sqd(Q) and Sqd(P) we end up with
a completely new framework Dcϕ,m1 ,m2 ,m3 ,λ ðSqd ðQÞ, Sqd ðPÞÞ (and many interest-
ing special cases) for quantifying dissimilarities between quantile density
functions.
Depth, outlyingness, centered rank, and centered quantile functions.
As a special case one gets Dcϕ,Sde ðQÞ,Sde ðQÞ,r  Sde ðQÞ,λ ðSde ðPÞ, Sde ðQÞÞ ,
P L

Dcϕ,Sou ðQÞ,Sou ðQÞ,r  Sou ðQÞ,λL ðSou ðPÞ, Sou ðQÞÞ, di¼1 Dcϕ,Scr,i ðQÞ,Scr,i ðQÞ,r  Scr,i ðQÞ,λ ðScr,i ðPÞ,
Pd L

Scr,i ðQÞÞ , i¼1 Dϕ,Scqu,i ,Scqu,i ,r  Scqu,i ,λL ðS


c cqu,i
ðPÞ, Scqu,i ðQÞÞ , all of which have not
appeared elsewhere before (up to our knowledge); recall that the respective


domains of ϕ have to take care of the ranges R Sde ðPÞ  ½0, ∞, R ðSou ðPÞÞ 

cr,i
cqu,i
½0, ∞, R S ðPÞ  ½1, 1, R S ðPÞ   ∞, ∞½ (i  f1, …, dg).

2.5.1.3 m1(x) 5 m2(x):5 w(Sx(P), Sx(Q)), m3(x) 5 r(x)w(Sx(P), Sx(Q)) 


[0, ∞[ for some (measurable) functions w : RðSðPÞÞ
RðSðQÞÞ ! R
and r : X ! R
Such a choice extends the contexts of the previous Sections 2.5.1.1 and
2.5.1.2 (where the “connector function” w took the simple form w(u, v) ¼ 1
resp. w(u, v) ¼ v). This introduces a wide adaptive modeling flexibility, where
(33) specializes to
0  Dcϕ,wðSðPÞ,SðQÞÞ,wðSðPÞ,SðQÞÞ,r  wðSðPÞ, SðQÞÞ,λ ðSðPÞ, SðQÞÞ
Z    
Sx ðPÞ Sx ðQÞ
:¼ ϕ ϕ
X wðSx ðPÞ, Sx ðQÞÞ wðSx ðPÞ, Sx ðQÞÞ
    (79)
0 Sx ðQÞ Sx ðPÞ Sx ðQÞ
ϕ+,c  
wðSx ðPÞ,Sx ðQÞÞ wðSx ðPÞ,Sx ðQÞÞ wðSx ðPÞ, Sx ðQÞÞ
 wðSx ðPÞ,Sx ðQÞÞ  rðxÞ dλðxÞ ,
which for the discrete setup ðX , λÞ ¼ ðX# , λ# Þ (recall λ#[{x}] ¼ 1 for all
x  X# ) simplifies to
192 SECTION II Information geometry

0  Dcϕ,wðSðPÞ,SðQÞÞ,wðSðPÞ,SðQÞÞ,r  wðSðPÞ,SðQÞÞ,λ ðSðPÞ, SðQÞÞ


#
     
X Sx ðPÞ S x ðQ Þ S x ðQ Þ
0
¼ ϕ  ϕ  ϕ
xX wðSx ðPÞ, Sx ðQÞÞ wðSx ðPÞ, Sx ðQÞÞ +,c wðS ðPÞ, S ðQÞÞ
x x
 
S x ðPÞ Sx ðQÞ
   wðSx ðPÞ, Sx ðQÞÞ  rðxÞ:
wðSx ðPÞ, Sx ðQÞÞ wðSx ðPÞ, Sx ðQÞÞ
(80)
As a side remark, let us mention that by appropriate choices of w(, ) and ϕ in
(79) we can even derive divergences of the form (63) but with nonconvex
nonconcave ϕ: see, e.g., the “perturbed” power divergences of Roensch and
Stummer (2017).
In the following, let us illuminate the important special case of (80) with
ϕ ¼ ϕα (α  R, cf. (26), (30), (31), (28)) together with Sx(P)  0, Sx(Q)  0
(as it is always the case for Scd, Spd, Spm, Ssu, Smg, Sde, Sou, and for nonnegative
real-valued random variables also with Squ):
0  Dϕα ,wðSðPÞ, SðQÞÞ,wðSðPÞ, SðQÞÞ,r  wðSðPÞ, SðQÞÞ, λ ðSðPÞ, SðQÞÞ
Z
rðxÞ  ðwðSx ðPÞ, Sx ðQÞÞÞ1α
¼  ½ðSx ðPÞÞα + ðα  1Þ  ðSx ðQÞÞα (81)
X α  ðα  1Þ
i
α  Sx ðPÞ  ðSx ðQÞÞα1 dλðxÞ, for αRnf0, 1g,

0  Dϕ1 ,wðSðPÞ,SðQÞÞ,wðSðPÞ,SðQÞÞ,r  wðSðPÞ,SðQÞÞ,λ ðSðPÞ,SðQÞÞ


Z   (82)
Sx ðPÞ
¼ rðxÞ  Sx ðPÞ  log + Sx ðQÞ  Sx ðPÞ dλðxÞ,
X Sx ðQÞ
0  Dϕ0 ,wðSðPÞ,SðQÞÞ,wðSðPÞ,SðQÞÞ,r  wðSðPÞ,SðQÞÞ,λ ðSðPÞ,SðQÞÞ
Z   (83)
S ðPÞ S ðPÞ
¼ rðxÞ  wðSx ðPÞ,Sx ðQÞÞ   log x + x  1 dλðxÞ,
X Sx ðQÞ Sx ðQÞ
0  Dϕ2 ,wðSðPÞ,SðQÞÞ,wðSðPÞ,SðQÞÞ,r  wðSðPÞ,SðQÞÞ,λ ðSðPÞ, SðQÞÞ
Z (84)
rðxÞ ðSx ðPÞ  Sx ðQÞÞ2
¼  dλðxÞ :
X 2 wðSx ðPÞ, Sx ðQÞÞ
λ-probability-density functions. For general X , r(x) ¼ 1, and (cf.
Remark 1 (c)) Sx ðPÞ ¼ Sλpd λpd
x ðPÞ :¼ fP ðxÞ  0, Sx ðQÞ ¼ Sx ðQÞ ¼ fQ ðxÞ  0,
the divergences (79), (80), (81) to (84) are due to Kißlinger and Stummer
(2013, 2015, 2016) (where they also gave indications on nonprobability
measures). Recall that this directly subsumes for X ¼ Y ¼ R the
“classical density” functional Sλpd() ¼ Spd() with the choice λ ¼ λL (and
the Riemann integration dλL(x) ¼ dx), as well as for the discrete setup Y ¼
X ¼ X# the “classical probability mass” functional Sλpd() ¼ Spm() with
the choice λ ¼ λ#.
A unifying framework for some directed distances in statistics Chapter 5 193

Distribution functions. Recall that Y ¼ X ¼ R, Sx ðPÞ ¼ Scd x ðPÞ ¼FP ðxÞ,


Sx ðQÞ ¼ Sx ðQÞ ¼ FQ ðxÞ. Let us illuminate (84) for the setup of a real-
cd

valued random variable Y, with FQ(x) ¼ Q[Y  x] under a hypothetical/


candidate law Q, and with FP ðxÞ ¼ N1  #fi  f1, …, Ng : Y i  xg ¼:
FPemp ðxÞ as the distribution function of the corresponding data-derived
N
PN
“empirical distribution” P :¼ Pemp N :¼ N 
1
i¼1 δY i ½   of a N-size i.i.d.
sample Y 1 , …, Y N of Y. In such a set-up, the choice λ ¼ Q in (84) and
multiplication with 2N lead to
   
emp
0  2N  Dϕ ,wðScd ðPempÞ, Scd ðQÞÞ,wðScd ðPemp Þ, Scd ðQÞÞ, r  wðScd ðPempÞ, Scd ðQÞÞ,Q Scd PN , Scd ðQÞ
2 N N N
 2
Z FPemp ðxÞ  FQ ðxÞ
¼ N  r ðxÞ  N  dQðxÞ:
R w FPemp ðxÞ, FQ ðxÞ
N

(85)
The special case w(u, v) ¼ 1 reduces to the Cramer–von Mises (test statistics)
family (47), and the choice r(x) ¼ 1, w(u, v) ¼ v  (1  v) gives the Anderson–
Darling (1952) test statistics. With (85), we can also imbed as special cases
(together with r(x) ¼ 1) some other known divergences which emphasize
the upper tails: w(u, v) ¼ 1  v (cf. Ahmad et al., 1988), w(u, v) ¼ 1  v2
(cf. Rodriguez and Viollaz, 1995, see also Shin et al. (2012) for applications
in environmental extreme-value theory), w(u, v) ¼ (1v)β with β > 0 (cf.
Deheuvels and Martynov (2003), see also Chernobai et al. (2015) for the case
β ¼ 2 together with a left-truncated version of the empirical distribution func-
tion). Moreover, (85) covers as special cases (together with r(x) ¼ 1) some
other known divergences which emphasize the lower tails: w(u, v) ¼ v (cf.
Ahmad et al., 1988; Scott, 1999), w(u, v) ¼ vβ with β > 0 (cf. Deheuvels
and Martynov, 2003), w(u, v) ¼ v  (2  v) (cf. Rodriguez and Viollaz
(1995), see also Shin et al. (2012)). In contrast, in a two-sample-test situation
P
where Q is replaced by the empirical distribution P eemp :¼ 1  L δe ½   of a
L L i¼1 Y i
e e
L-size i.i.d. sample Y 1 , …, Y L of Y (under Q), some authors (e.g., Hajek et al.,
1999; Rosenblatt, 1952) choose divergences which can be imbedded (with the
choice w(u, v) ¼ 1, r(x) ¼ 1) in our framework as multiple of
cd eemp eemp is an
Dϕ2 ,1,1,1,λ ðScd ðPemp
N Þ, S ðPL ÞÞ where λ ¼ c1  PN
emp
+ ð1  c1 Þ  P L
appropriate mixture with c1 ]0, 1[. In further contrast, if one chooses the
Lebesgue measure λ ¼ λL (with the usual Riemann integration dλL(x) ¼ dx)
and r(x) ≡ 1 in (85), then one ends up with an adaptively weighted extension
of (48).
Classical quantile functions. The divergence (79) with S(P) ¼ Squ(P), S(Q) ¼
S (Q), λ ¼ λL, i.e., Dcϕ,wðSqu ðPÞ,Squ ðQÞÞ,wðSqu ðPÞ,Squ ðQÞÞ,r  wðSqu ðPÞ,Squ ðQÞÞ,λL ðSqu ðPÞ, Squ ðQÞÞ
qu

—which has first been given in Stummer (2021) in an even more flexible form—
can be interpreted as a quantitative measure of tail risk of P, relative to some preg-
iven reference distribution Q; corresponding connections with optimal transport
will be discussed in Section 2.7.
194 SECTION II Information geometry

Depth, outlyingness, centered rank and centered quantile functions. As


a special case of (79) one gets
Dcϕ, wðSde ðQÞ, Sde ðQÞÞ, wðSde ðQÞ, Sde ðQÞÞ, r  wðSde ðQÞ, Sde ðQÞÞ, λL ðSdeðPÞ,Sde ðQÞÞ,
Dcϕ, wðSou ðQÞ, Sou ðQÞÞ, wðSou ðQÞ, Sou ðQÞÞ, r  wðSou ðQÞ, Sou ðQÞÞ, λL ðSou ðPÞ,Sou ðQÞÞ,
Pd
i¼1 Dϕ, wðScr, i ðQÞ, Scr, i ðQÞÞ, wðScr, i ðQÞ, Scr, i ðQÞÞ, r  wð Scr, i ðQÞ, Scr, i ðQÞÞ, λL ðS ðPÞ,Scr, i ðQÞÞ,
c cr , i
Pd
i¼1 Dϕ, wðScqu, i , Scqu, i Þ, wðScqu, i , Scqu, i Þ, r  wðScqu, i , Scqu, i Þ, λL ðS ðPÞ,Scqu, i ðQÞÞ,
c cqu, i

all of which have not appeared elsewhere before (up to our knowledge);
recall that the respective domains of ϕ have to take care of the ranges




R Sde ðPÞ  ½0, ∞ , R ðSou ðPÞÞ  ½0,∞ , R Scr,i ðPÞ  ½1, 1 , R Scqu,i ðPÞ
  ∞,∞½ (i f1, …,dg).
~ ~
2.5.2 m1 ðxÞ ¼ S x ðPÞ and m2 ðxÞ ¼ S x ðQÞ with statistical functional
e 6¼ S, m3(x)  0
S
n o
e
Recall SðPÞ :¼ fSx ðPÞgxX , SðQÞ :¼ fSx ðQÞgxX , and let SðPÞ :¼ Sex ðPÞ ,
n o xX
e
SðQÞ :¼ Sex ðQÞ for (typically) Se being “essentially different” to S
xX
(e.g., take Se and S as different choices from Scd, Spd, Spm, Ssu, Smg, Squ, Sde, Sou).
For this special case, from (19) one can deduce
0  Dϕ,S~ðPÞ,S~ðQÞ,m3 ,λ ðSðPÞ, SðQÞÞ
Z " ! ! ! !#
Sx ðPÞ Sx ðQÞ 0 Sx ðQÞ Sx ðPÞ Sx ðQÞ
¼ ϕ ϕ  ϕ+,c   m3 ðxÞ dλðxÞ,
Sex ðPÞ Sex ðQÞ Sex ðQÞ Sex ðPÞ Sex ðQÞ
X
(86)
which for the discrete setup ðX , λÞ ¼ ðX# , λ# Þ simplifies to
0  Dϕ,S~ðPÞ,S~ðQÞ,m3 ,λ# ðSðPÞ, SðQÞÞ
" ! ! ! !#
X Sx ðPÞ Sx ðQÞ Sx ðQÞ Sx ðPÞ Sx ðQÞ
0
¼ ϕ ϕ  ϕ+,c   m3 ðxÞ:
xX Sex ðPÞ Sex ðQÞ Sex ðQÞ Sex ðPÞ Sex ðQÞ

As an example, take Y ¼ X ¼ ½0, ∞½, λ ¼ λL, the probability (Lebesgue-)


density functions S ¼ Spd, i.e., SðPÞ ¼ fSx ðPÞgx  ½0,∞½ ¼ f fP ðxÞgx  ½0,∞½ ¼
n o
dFP ðxÞ
dx , as well as the survival (reliability, tail) functions Se ¼ Ssu , i.e.,
n
x½0,∞½
o
e ¼ Sex ðPÞ
SðPÞ ¼ f1  FP ðxÞgx½0, ∞½ ¼ fP½ x, ∞½ gx½0, ∞½ . Accord-
x ½0, ∞½
ingly, the function x ! Sx ðPÞ ¼ 1FfP ðxÞ —with the convention 0c ¼ ∞ for all
eSx ðPÞ P ðxÞ

c  R —can be interpreted as the hazard rate function (failure rate function,


force of mortality) under the model distribution P (and analogously under the
alternative model distribution Q) of a nonnegative random variable Y. Hence,
(86) turns into
A unifying framework for some directed distances in statistics Chapter 5 195



0  Dϕ,Ssu ðPÞ,Ssu ðQÞ,m3 ,λL Spd ðPÞ, Spd ðQÞ
Z    
fP ðxÞ fQ ðxÞ
¼ ϕ ϕ
1  FP ðxÞ 1  FQ ðxÞ
X
   
fQ ðxÞ fP ðxÞ fQ ðxÞ
ϕ0+,c   m3 ðxÞ dλL ðxÞ,
1  FQ ðxÞ 1  FP ðxÞ 1  FQ ðxÞ
which can be interpreted as divergence between the two modeling hazard rate
functions at stake.

2.6 Auto-divergences
The main stream of this paper deals with divergences/distances between
(families of ) real-valued “statistical functionals” S() of the form SðPÞ :¼
fSx ðPÞgxX and SðQÞ :¼ fSx ðQÞgxX stemming from two different distribu-
tions P and Q. In quite some meaningful situations, P and Q can stem from
the same fundamental underlying random mechanism P̆. Take for instance
the situation where Y ¼ X ¼ R, λ ¼ λL and Y 1 , …Y N are i.i.d. observations
from a random variable Y with distribution P̆ having (with a slight abuse
of notation P̆ ¼ P̆ ∘ Y1) distribution function FP̆ (x) ¼ P̆[Y  x] which is dif-
dF ðxÞ
ferentiable with a density fP̆ ðxÞ ¼ dxP̆
being positive in an interval and zero
elsewhere. The corresponding order statistics are denoted by Y 1:N < Y 2:N <
… < Y N:N where Yk:N is the k-th largest observation and in particular
Y 1:N :¼ min fY 1 , …Y N g, Y N:N :¼ max fY 1 , …Y N g; the distribution P̆k of Yk:N
(k  f1, …, Ng ) has distribution function FP̆k(x) :¼ P̆[Yk:N  x] with well-
known density function

N!
k1
nk
fP̆k ðxÞ :¼  FP̆ ðxÞ  1  FP̆ ðxÞ  fP̆ ðxÞ (87)
ðN  kÞ!  ðk  1Þ!

(see, e.g., Reiss (1989), Arnold et al. (1992), and David and Nagaraja (2003)
for comprehensive treatments of order statistics). In such a context, it makes
sense to take P :¼P̆j, Q :¼ P̆k (j, k  f1, …, Ng) respectively P :¼ P̆, Q :¼P̆k
(or vice versa) and study the divergences


0  Dϕ, m1 , m2 , m3 , λL Spd ðP̆j Þ,Spd ðP̆k Þ
Z " fP̆j ðxÞ
!     fP̆j ðxÞ fP̆ ðxÞ
!#
fP̆k ðxÞ 0
fP̆k ðxÞ
:¼ ϕ ϕ  ϕ+ , c   k
m3 ðxÞ dλL ðxÞ
m1 ðxÞ m2 ðxÞ m2 ðxÞ m1 ðxÞ m2 ðxÞ
X

respectively
196 SECTION II Information geometry



0  Dϕ, m1 , m2 , m3 , λL Spd ðP̆Þ, Spd ðP̆k Þ
Z " fP̆j ðxÞ
!      #
fP̆k ðxÞ 0
fP̆k ðxÞ fP̆ ðxÞ fP̆k ðxÞ
:¼ ϕ ϕ  ϕ+, c   m3 ðxÞ dλL ðxÞ ,
m1 ðxÞ m2 ðxÞ m2 ðxÞ m1 ðxÞ m2 ðxÞ
X
(88)
or deterministic transformations thereof.
For instance, (some of ) the divergences in Ebrahimi et al. (2004) and
Asadi et al. 
(2006) can beimbedded here 
as the special

cases
˘ ˘ ˘ ˘
Dϕ1 , 1, 1, 1, λL S ðPj Þ, S ðP k Þ , Dϕ1 , 1, 1, 1, λL S ðPÞ, S ðP k Þ ,
pd pd pd pd
 
α1 log ½1 + α 
1 ðα  1Þ  Dϕ Spd ðP˘j Þ,Spd ðP˘k Þ ,
α , S ð P̆ k Þ, S ð P̆ k Þ, S ð P̆ k Þ, λ
pd pd pd
 
˘ Spd ðP˘ Þ , for α  Rnf0, 1g.
α1 log ½1 + α  ðα  1Þ 
1 Dϕ Spd ðPÞ,
k
α , S ð P̆ k Þ, S ð P̆ k Þ, S ð P̆ k Þ, λ
pd pd pd

For other (nonauto type) scaled Bregman divergences involving distribu-


tions of certain transforms of spacings between observations (i.e., differences
of order statistics), the reader is, e.g., referred to Roensch and Stummer
(2019b).
Vaughan and Venables (1972), Bapat and Beg (1989), and Hande (1994)
give some extensions of (87) for random observations Y 1 , …Y N which are
independent but nonidentically distributed, e.g., their distributions may be
linked by a common (scalar or multidimensional) parameter; this is a common
situation in contemporary statistical applications, e.g., in data analytics,
artificial intelligence, and machine learning (which employ GLM models,
etc.). By employing (88) for these extensions of (87), we end up with an even
wider new toolkit for auto-divergences between (distributions of ) order
statistics.
With a completely different auto-divergence flavor, we apply (65)
to define the generalized power divergence (of order α  Rnf0, 1g)
Dϕα ,SðQθ Þ,SðQθ Þ,r  SðQθ Þ,λL ðjS0 ðQθ Þj, SðQθ ÞÞ between a classical θ-parametric den-
n o
sity function SðQθ Þ :¼ Spd ðQθ Þ ¼ SλL pd ðQθ Þ ¼ fQθ ðxÞ and the modulus
nx∂f R ðxÞo
 
of its (supposedly existing) θ-derivative jS0 ðQθ Þj :¼  Q∂θθ  . By (66)
xR
(with r(x) ¼ 1), we can rewrite
Z   1=ðα1Þ
∂ f Q θ ðxÞα  1α
Fα : ¼    f ðxÞ dλ ðx Þ
 ∂θ  Qθ L
 R
¼ α  ðα  1Þ  Dϕα ,SðQθ Þ,SðQθ Þ,SðQθ Þ,λL ðjS0 ðQθ Þj, SðQθ ÞÞ
Z   1=ðα1Þ
∂ f Qθ ðxÞ
+ α    dλL ðxÞ + 1  α
R ∂θ 
∂f ðxÞ
provided that (say) α > 1, Q∂θθ ðxÞ 6¼ 0, f Qθ ðxÞ > 0 (for a.a. x) and
R ∂ f Qθ ðxÞ
R j ∂θ j dλL ðxÞ < ∞ (otherwise there are extra terms according to (66)).
A unifying framework for some directed distances in statistics Chapter 5 197

Hence, within our universal divergence framework we have imbedded the so


called α-order Fisher information measure Fα of Boekee (1977) with the spe-
cial case α ¼ 2 being the omnipresent Fisher information.

2.7 Connections with optimal transport and coupling


In this section we consider the context of Section 2.5.1.3 with X ¼ 0, 1½,
Lebesgue measure λ ¼ λL as well as r(x) ¼ 1 for all x  X, and apply
this to the quantile functions Squ ðPÞ ¼ fSx ðPÞgx 0,1½ :¼ FP ðxÞ x 0,1½ :¼
f inf fz R : FP ðzÞ  xggx 0,1½ and Squ(Q) of two random variables X and Y
on Y ¼ R having distribution P and Q, respectively; recall from Section
 2.1
that for Y ¼ ½0, ∞½ we take Squ ðPÞ ¼ fSx ðPÞgx 0,1½ :¼ FP ðxÞ x 0,1½ :¼
f inf fz ½0, ∞½: FP ðzÞ  xggx 0,1½ instead. Accordingly, we quantify the
corresponding dissimilarity as the divergence (directed distance)
Dcϕ,wðSqu ðPÞ,Squ ðQÞÞ,wðSqu ðPÞ,Squ ðQÞÞ,wðSqu ðPÞ,Squ ðQÞÞ,λL ðSqu ðPÞ, Squ ðQÞÞ
Z " ! !
FP ðxÞ FQ ðxÞ
:¼ ϕ ϕ
0,1½ wðFP ðxÞ, FQ ðxÞÞ wðFP ðxÞ, FQ ðxÞÞ
! !#
FQ ðxÞ FP ðxÞ FQ ðxÞ
ϕ0+,c  
wðFP ðxÞ, FQ ðxÞÞ wðFP ðxÞ, FQ ðxÞÞ wðFP ðxÞ, FQ ðxÞÞ

 wðFP ðxÞ, FQ ðxÞÞ  rðxÞ dλL ðxÞ (89)


Z
¼ e
ψðF P ðxÞ, FQ ðxÞÞ dλL ðxÞ
0,1½

with ψe : RðFP Þ
RðFQ Þ 7! ½0, ∞ defined by (cf. (I2) and (21))
 
u v
ψe ðu,vÞ :¼ wðu, vÞ  ψ ϕ,c ,  0 with
wðu, vÞ wðu,vÞ
     
u v u v
ψ ϕ,c , :¼ ϕ ϕ
wðu, vÞ wðu, vÞ wðu, vÞ wðu, vÞ
   
v u v
 ϕ0+,c   :
wðu,vÞ wðu, vÞ wðu, vÞ
Under Assumption 1 (and hence under the more restrictive Assumption 2) of
Stummer (2021)—who deals even with a more general context where the
scaling and the aggregation function need not coincide—one can adapt
Theorem 4 and Corollary 1 of Broniatowski and Stummer (2019) to obtain
the desired basic divergence properties (D1) and (D2) in the form of
198 SECTION II Information geometry

ðNNÞDcϕ,wðSqu ðPÞ,Squ ðQÞÞ,wðSqu ðPÞ,Squ ðQÞÞ,wðSqu ðPÞ,Squ ðQÞÞ,λL ðSqu ðPÞ, Squ ðQÞÞ  0
ðREÞ Dcϕ,wðSqu ðPÞ,Squ ðQÞÞ,wðSqu ðPÞ,Squ ðQÞÞ,wðSqu ðPÞ,Squ ðQÞÞ,λL ðSqu ðPÞ, Squ ðQÞÞ ¼ 0
if and only if FP ðxÞ ¼ FQ ðxÞ for λL a:a: x  X:

In order to establish a connection between the divergence (89) and optimal


transport problems, we impose for the rest of this section the additional
requirement that the function ψe is continuous (except for the point (u, v) ¼
(0, 0)) and quasi-antitonep in the sense
e 1 , v1 Þ + ψðu
ψðu e 2 , v2 Þ  ψðu
e 2 , v1 Þ + ψðu
e 1 , v2 Þ for all u1  u2 , v1  v2 ;
in other words, ψð e  ,  Þ is assumed to be continuous (except for the point
(u, v) ¼ (0, 0)) and quasi-monotone.q,r For such a setup, one can consider the
Kantorovich transportation problem (KTP) with the pointwise-BS-distance-type
(pBS-type) cost function ðu, vÞ ! e vÞ; indeed, Stummer (2021) recently
7 ψðu,
obtained (an even more general version of ) the following
Theorem 3. Let ΓðP, e QÞ be the family of all probability distributions P
on R
R which have marginal distributions P½ 
R ¼ P½   and
P½R
  ¼ Q½  . Moreover, we denote the corresponding upper Hoeffding–
Frechet bound (cf., e.g., Theorem 3.1.1 of Rachev and R€uschendorf (1998))
by Pcom having “comonotonic” distribution function FPcom ðu, vÞ :¼
minfFP ðuÞ,FQ ðvÞg (u, v  R). Then

min e YÞ
E ψðX, (90)
fXP, YQg
Z
¼ min ψe dPðu, vÞ (91)
fP  e
ΓðP, QÞg
R
R
Z
¼ e vÞ dPcom ðu, vÞ
ψðu, (92)
R
R
Z
¼ ψ
e ðFP ðxÞ, FQ ðxÞÞ dλL ðxÞ
0,1½
¼ Dϕ, wðSqu ðPÞ, Squ ðQÞÞ,wðSqu ðPÞ,Squ ðQÞÞ,wðSqu ðPÞ,Squ ðQÞÞ,λL ðSqu ðPÞ,Squ ðQÞÞ  0,
c
(93)

where the minimum in (90) is taken over all R-valued random variables X and
Y (on an arbitrary probability space ðΩ, A, SÞ) such that P½X    ¼ P½  ,
P½Y    ¼ Q½  . As usual, E denotes the expectation with respect to P.

p
Other names are submodular, Lattice-subadditive, 2-antitone, 2-negative, Δ-antitone, supernega-
tive, “satisfying the (continuous) Monge property/condition.”
q
Other names are supermodular, Lattice-superadditive, 2-increasing, 2-positive, Δ-monotone,
2-monotone, “fulfilling the moderate growth property,” “satisfying the measure property,”
“satisfying the twist condition.”
r
A comprehensive discussion on general quasi-monotone functions can be found, e.g., in Chapter
6.C of Marshall et al. (2011).
A unifying framework for some directed distances in statistics Chapter 5 199

Remark 2. (i) Notice that Pcom is ψ-independent, e and may not be the unique
minimizer in (91). As a (not necessarily unique) minimizer in
(90), one can take X :¼ FP ðUÞ, Y :¼ FQ ðUÞ for some uniform
random variable U on [0, 1].
(ii) In Theorem 3 we have shown that Pcom (cf. (92)) is an optimal
transport plan of the KTP (91) with the pointwise-BS-distance-type
(pBS-type) cost function ψðu, e vÞ. The outcoming minimal value is equal
to Dcϕ,wðSqu ðPÞ,Squ ðQÞÞ,wðSqu ðPÞ,Squ ðQÞÞ,wðSqu ðPÞ,Squ ðQÞÞ,λL ðSqu ðPÞ, Squ ðQÞÞ which is
typically straightforward to compute (resp. approximate).
(iii) Depending on the chosen divergence, one may have to restrict the sup-
port of P and Q, for instance to (subsets of ) [0, ∞[.

Remark 2 (ii) generally contrasts to those prominently used KTP whose


cost function is a power d(u, v)p of a metric d(u, v) (denoted as POM-type cost
function) which leads to the well-known Wasserstein distances. (Apart from
technicalities) There are some overlaps, though:
Example 1. (i) Take Y  ½0, ∞Þ (and thus the support of P and Q is
contained in [0, ∞[) together with the nonsmooth ϕ(t) :¼
ϕTV(t) :¼ jt  1j (t  [0, ∞[), c ¼ 12 , w(u, v) :¼ v  [0,
e vÞ ¼ ju  vj ¼: dðu, vÞ (u, v  [0, ∞[).
∞[ to obtain ψðu,
For an extension to Y 2 ¼ R see Stummer (2021).
(ii) Take Y ¼ R, ϕðtÞ :¼ ϕ2 ðtÞ :¼ ðt1Þ2 (t  R, with obsolete c), w(u, v) :¼ 1
2 2
e vÞ ¼ ðuvÞ
to end up with ψðu, 2 ¼ dðu,2vÞ .
2
(iii) The symmetric distances d(u, v) and dðu,2vÞ are convex functions of
u  v and thus continuous quasi-antitone functions. The corres-
pondingly outcoming Wasserstein distances are thus conside-
rably flexibilized by our new much more general distance
Dcϕ,wðSqu ðPÞ,Squ ðQÞÞ,wðSqu ðPÞ,Squ ðQÞÞ,wðSqu ðPÞ,Squ ðQÞÞ, λL ðSqu ðPÞ,Squ ðQÞÞ of (93).

We give some further special cases of pBS-type cost functions, which are
continuous and quasi-antitone, but which are generally not symmetric and
thus not of POM type:
Example 2. “Smooth” pointwise ϕ-divergences (i.e., pointwise Csiszar-Ali-
Silvey-Morimoto divergences): take ϕ : ½0, ∞½7!R to be a strictly convex,
twice continuously differentiable function on ]0, ∞[ with continuous exten-
sion on t ¼ 0, together with w(u, v) :¼ v  ]0, ∞[, and c is obsolete. Accord-


e ðu, vÞ ¼ v  ϕ uv  v  ϕ ð1Þ  ϕ ð1Þ  ðu  vÞ, and hence the second
0
ingly, ψ


mixed derivative satisfies ∂ e ψ
¼  u ϕ00 u < 0 (u, v  ]0, ∞[); thus, ψe is
2

∂u∂v v2 v
quasi-antitone on ]0, ∞[
]0, ∞[. Accordingly, (90) to (93) applies to such
kind of (cf. Section 2.5.1.2) ϕ-divergences concerning P,Q having support
γ
in [0, ∞[. As an example, take, e.g., the power function ϕðtÞ :¼ t γ  t + γ1
γ  ðγ1Þ
200 SECTION II Information geometry

(γ Rnf0,1g ). A different connection between optimal transport and other


kind of ϕ-divergences can be found in Bertrand et al. (2021).

Example 3. “Smooth” pointwise classical (i.e., unscaled) Bregman diver-


gences (CBD): take ϕ : R 7! R to be a strictly convex, twice continuously
differentiable function w(u, v) :¼ 1 and c is obsolete. Accordingly,
2
e vÞ :¼ ϕ ðuÞ  ϕ ðvÞ  ϕ0 ðvÞ  ðu  vÞ and hence ∂ eψ ðu, vÞ 00
ψðu, ¼ ϕ ðvÞ < 0
∂u∂v
(u,v  R); thus, ψe is quasi-antitone on R
R. Accordingly, the representation
(90) to (93) applies to such kind of (cf. Section 2.5.1.1) CBD. The
corresponding special case of (91) is called “a relaxed Wasserstein distance
(parameterized by ϕ) between P and Q” in the recent papers of Lin et al.
(2019) and Guo et al. (2021) for a restrictive setup where P and Q are supposed
to have compact support; the latter two references do not give connections to
divergences of quantile functions but substantially concentrate on applications
to topic sparsity for analyzing user-generated web content and social media,
respectively, to generative adversarial networks (GANs).

Example 4. “Smooth” pointwise Scaled Bregman Distances: for instance,


consider P and Q with support in [0, ∞[. One gets that ψe is quasi-antitone
on ]0, ∞[
]0, ∞[ if the generator function ϕ is strictly convex and thrice con-
tinuously differentiable on ]0, ∞[ (and hence, c is obsolete) and the so-called
scale connector w is twice continuously differentiable such that—on ]0,
∞[
]0, ∞[—ψe is twice continuously differentiable and ∂ ψe  0 (an explicit
2

∂u∂v
formula of the latter is given in the appendix of Kißlinger and Stummer
(2018), who also give applications to robust change detection in data streams).
Illustrative examples of suitable ϕ and w can be found, e.g., in Kißlinger and
Stummer (2016).

Returning to the general context, it is straightforward to see that if P does


not give mass to points (i.e., it has continuous distribution function FP) then
there exists even a deterministic optimal transportation plan: indeed, for the
map T com :¼ FQ ∘FP one has Pcom ½   ¼ P½ðid, T com Þ    and thus (92) is
equal to
Z
e T com ðuÞÞ dPðuÞ
ψðu,
R
Z
¼ min fT  ΓðP,QÞg e TðuÞÞ dPðuÞ
ψðu, (94)
^
R

e TðXÞÞ
¼ min fXP, TðXÞQg E ψðX,
A unifying framework for some directed distances in statistics Chapter 5 201

^ QÞ
where (94) is called Monge transportation problem (MTP). Here, ΓðP,
denotes the family of all measurable maps T : R 7! R such that P[T  ]
¼ Q[].

3 Aggregated/integrated divergences
Suppose that ϕ ¼ ϕz, P ¼ Pz, Q ¼ Qz, m1 ¼ m1,z, m2 ¼ m2,z, m3 ¼ m3,z, λ ¼ λz
depend on the same (!!) “parameter/quantity” Rz  Z . Then it makes sense to
study the aggregated/integrated divergence Z Dϕz ,m1,z ,m2,z ,m3,z ,λz ðSðPz Þ, SðQz ÞÞ
d̆λðzÞ where λ̆ is a σ-finite measure on Z (e.g., the Lebesgue measure λL,
the counting measure λ# or a probability measure, where in case of the latter
one also uses the terminology “expected divergence”).
An interesting special case is the following family: recall first that for the
two-element space Y ¼ X ¼ f0, 1g we denote the corresponding probability
mass function as Spm ðPÞ ¼ fP½fxggxX ¼ f1  P½f1g, P½f1gg; in other
words, P is a Bernoulli distribution Ber(θP) which is completely determined
by its parameter θP  [0, 1] with interpretation θP ¼ P[{1}]. Now suppose
that θP ¼ θP(z) depends on a real-valued parameter z  R. In such a situation
it makes sense to study the aggregated (integrated) divergence for
ϕ  ΦC1 ða, b½Þ
Z    
0 Dϕ,m1,z ,m2,z ,m3,z ,λ# Spm ðBer ðθP ðzÞÞÞ, Spm Ber θQ ðzÞ dλ˘ðzÞ
R
Z (" !
1 θQ ðzÞ
!
1 θQ ðzÞ
! !#
1 θP ðzÞ 1 θP ðzÞ 1 θQ ðzÞ
¼ ϕ ϕ  ϕ0    m3,z ð0Þ
R m1,z ð0Þ m2,z ð0Þ m2,z ð0Þ m1,z ð0Þ m2,z ð0Þ
" ! ! ! !# )
θP ðzÞ θQ ðzÞ θQ ðzÞ θP ðzÞ θQ ðzÞ
+ ϕ ϕ  ϕ0    m3,z ð1Þ d λ̆ðzÞ
m1,z ð1Þ m2,z ð1Þ m2,z ð1Þ m1,z ð1Þ m2,z ð1Þ
(95)

where λ̆ is a σ-finite measure on R (e.g., the Lebesgue measure λL, the count-
ing measure λ# or a probability measure) and the scaling functions m1, m2 as
well as the aggregating function m3 are allowed to depend (in a measurable
way) on z (which is denoted by extending their indices with z). For the non-
differentiable case ϕ  Φ(]a, b[), the derivative ϕ0 has to be replaced by ϕ0+,c .
In adaption of the discussion after formula (25), by defining the integral
R R
e :¼ ½
functional geϕ,m3 ,λ̆ ðξÞ e
R f0,1g ϕðξðx, zÞÞ  m3 ðxÞ dλ# ðxÞ d λ̆ðzÞ and plugging
in, e.g.,
 pm 
S ðBer ðθP ð  ÞÞÞ
geϕ,m3 , λ̆
8 m1,  9
Z <     =
1  θ P ðzÞ θ P ðzÞ
¼ ϕ  m3,z ð0Þ + ϕ  m3,z ð1Þ d λ̆ðzÞ, (96)
R: m1,z ð0Þ m1,z ð1Þ ;
202 SECTION II Information geometry

the divergence in (95) can be (formally) interpreted as


Z
0 Dϕ,m1,z ,m2,z ,m3,z ,λ# ðSpm ðBerðθP ðzÞÞÞ, Spm ðBerðθQ ðzÞÞÞÞ d λ̆ðzÞ
R
 pm   pm 
S ðBerðθP ð  ÞÞÞ S ðBerðθQ ð  ÞÞÞ
¼ geϕ,m , λ̆  geϕ,m3 , λ̆
3 m1,  m2, 
 pm 
0 S ðBerðθQ ð  ÞÞÞ S ðBerðθP ð  ÞÞÞ Spm ðBerðθQ ð  ÞÞÞ
pm
e
g ϕ,m3 , λ̆ ,  :
m2,  m1,  m2, 

As an important special case, take λ̆ :¼ λL (and we formally identify the


Lebesgue-integral with the Riemann-integral over dz), θP ðzÞ :¼ FP ðzÞ ¼
z ðPÞ, θ Q ðzÞ :¼ FQ ðzÞ ¼ Q½  ∞, z ¼ Sz ðQÞ, m1,z(0) ¼
P½  ∞, z ¼ Scd cd

m2,z(0) ¼ m3,z(0) ¼ 1  θQ(z), m1,z(1) ¼ m2,z(1) ¼ m3,z(1) ¼ θQ(z), and


accordingly (95) simplifies to
Z
0 ˘
Dϕ,m1,z ,m2,z ,m3,z ,λ ðSpm ðBerðθP ðzÞÞÞ, Spm ðBerðθQ ðzÞÞÞÞ dλðzÞ
R
Z     
1  FP ðzÞ 1  FP ðzÞ
¼ ϕ  ϕ ð1Þ  ϕ0 ð1Þ  1  ð1  FQ ðzÞÞ
R 1  FQ ðzÞ 1  FQ ðzÞ
    
FP ðzÞ 0 FP ðzÞ
+ ϕ  ϕ ð1Þ  ϕ ð1Þ  1  FQ ðzÞ dz
FQ ðzÞ FQ ðzÞ
¼: CPDϕ ðP, QÞ,

which in case of ϕ(1) ¼ 0 becomes


0  CPDϕ ðP, QÞ
Z      
1  FP ðzÞ FP ðzÞ (97)
¼ ϕ  ð1  FQ ðzÞÞ + ϕ  FQ ðzÞ dz:
R 1  FQ ðzÞ FQ ðzÞ
If basically ϕ(0) ¼ ϕ(1) ¼ 0 and P, Q are generated by random variables, say
P ¼ Pr[X  ], Q ¼ Pr[Y   ]—and thus FP(z) ¼ Pr[X  z], FQ(z) ¼ Pr[Y 
z]—then according to (97) the CPDϕ(P, Q) coincides with the cumulative
paired ϕ-divergence CPDϕ(X, Y ) of Klein et al. (2016); the special case
CPDϕα ðX, YÞ with ϕ ¼ ϕα from (26) was employed by Jager and Wellner
(2007). Notice that without the assumption ϕð1Þ ¼ 0 ¼ ϕ0 ð1Þ, the right-hand
side of (97) may become negative and thus is not a divergence anymore.
As a side remark, notice that in the “unscaled setup” λ̆ :¼ λL, θP(z) :¼
FP(z), m1,z(0) ¼ m3,z(0) ¼ m1,z(1) ¼ m3,z(1) ¼ 1, the formula (96) becomes
 pm  Z
S ðBerðθP ð  ÞÞÞ
geϕ,m , λ̆ ¼ fϕ ð1  FP ðzÞÞ + ϕ ðFP ðzÞÞg dz
3 m1,  R

which corresponds to the cumulative ϕ-entropy of P introduced by Klein et al.


(2016).
A unifying framework for some directed distances in statistics Chapter 5 203

4 Dependence expressing divergences


Let the data take values in some product space Y ¼
di¼1 Yi with product
σ-algebra A ¼ di¼1 A i . On this, we consider probability distributions
P having marginals Pi determined by Pi ½Ai  :¼ P½Y 1 ⋯
Ai ⋯
Yd 
(i  f1, …, dg, Ai  A i ). Furthermore, let Q :¼ di¼1 Pi be the product mea-
sure having the same marginals as P. Typically, P½   :¼ Pr½ðY1 ,…, Yd Þ  
is the joint distribution of some random variables Y 1 ,…, Y d; on the other hand,
the latter are independent under the (generally different) probability measure
Q. As usual, we also involve statistical functionals SðPÞ :¼ fSx ðPÞgxX and
SðQÞ :¼ fSx ðQÞgxX , where X is an index space equipped with a σ-algebra
F and a σ-finite measure λ (e.g., a probability measure, the Lebesgue measure,
a counting measure). Accordingly, any of the above divergences (cf. (19))
0  Dcϕ, m1 , m2 , m3 , λ ðSðPÞ, SðQÞÞ
Z        
Sx ðPÞ Sx ðQÞ 0 Sx ðQÞ Sx ðPÞ Sx ðQÞ
:¼ ϕ ϕ  ϕ+, c   m3 ðxÞ dλðxÞ
X m1 ðxÞ m2 ðxÞ m2 ðxÞ m1 ðxÞ m2 ðxÞ
(98)
can be interpreted as a directed degree of dependence of P (e.g., of the
above-mentioned random variables Y 1 , …, Y d ), since it measures the amount
of dissimilarity between the same statistical functional of P and of the
independence-expressing Q. We can imbed the following known divergences
as special cases of (98):
(1) Consider ϕ-divergences between λ-density functions, i.e., take X :¼
Y , a real continuous convex function ϕ on [0, ∞[, a product measure λ :¼
Qd
di¼1 λi, Sλpd
x ðQÞ :¼ fQ ðxÞ :¼ i¼1 fPi ðxi Þ  0 where x ¼ ðx1 , …, xd Þ and fPi
is the λi-density function of the marginal distribution Pi, as well as Sλpd x ðPÞ
:¼ fP ðxÞ  0 to be the λ-density function of P, to end up with the following
special case of (98) resp. (74):
0  Dϕ,Sλpd ðQÞ,Sλpd ðQÞ,1  Sλpd ðQÞ,λ ðSλpd ðPÞ, Sλpd ðQÞÞ
Z d !  
Q fP ðxÞ Qd
¼ fPi ðxi Þ  ϕ Qd  10,∞½ fP ðxÞ  fPi ðxi Þ dλðxÞ
X i¼1 i¼1 fPi ðxi Þ i¼1

(99)
" #
Y
d
+ ϕ ð0Þ  P fPi ðxi Þ ¼ 0 + ϕð0Þ  Q½ fP ðxÞ ¼ 0  ϕð1Þ : (100)
i¼1

The divergence in the formula lines (99) and (100) has first appeared in
Micheas and Zografos (2006). In applications, one often takes X ¼ Y ¼
Rd, Y i ¼ R, λi :¼ λL to be the Lebesgue measure on R and thus λ :¼ λLd is
the Lebesgue measure on Rd (with a slight abuse of notation), fP to be the
204 SECTION II Information geometry

classical joint (Lebesgue) density function of Y 1 , …, Y d, and f Pi to be the clas-


sical (Lebesgue) density function of Yi.
By plugging ϕðtÞ ¼ ϕ1 ðtÞ ¼ t  log t + 1  t  ½0, ∞½ with t  ]0, ∞[ (cf.
(30)) into (100), one obtains the prominent mutual information. References to fur-
ther subcases of (99), (100) can be found, e.g., in Micheas and Zografos (2006).
For d ¼ 2, X ¼ Y ¼ R2, λ :¼ λL2, continuous marginal density functions
fP1 and fP2 , by Sklar’s theorem (Sklar, 1959) one can uniquely rewrite the
joint distribution function FP ðx1 ,x2 Þ ¼ CðFP1 ðx1 Þ, FP2 ðx1 ÞÞ in terms of a copula
C(, ). Suppose further that C(, ) is absolutely continuous (with respect to the
Lebesgue measure on [0, 1]
[0, 1]), and hence for its (Lebesgue) density
function c(, )—called copula density—one gets cðu1 , u2 Þ ¼ ∂ ∂u Cðu1 ,u2 Þ
2

1 ∂u2
for almost
all u1, u2  [0, 1]
[0, 1]) (see, e.g., p. 83 in Durante and Sempi (2016) and
the there-mentioned references). Accordingly, fP ðx1 ,x2 Þ ¼ fP1 ðx1 Þ  fP2 ðx2 Þ 
cðFP1 ðx1 Þ,FP2 ðx2 ÞÞ and thus, in case of strictly positive fP1 ð  Þ > 0, fP2 ð  Þ > 0
the divergence (99), (100) rewrites as
0  Dϕ,Sλpd ðQÞ,Sλpd ðQÞ,1  Sλpd ðQÞ,λ ðSλpd ðPÞ, Sλpd ðQÞÞ
Z Z  
fP ðx1 , x2 Þ
¼ f P1 ðx1 Þ  fP2 ðx2 Þ  ϕ dλL ðx1 Þ dλL ðx2 Þ  ϕð1Þ
R R fP1 ðx1 Þ  fP2 ðx2 Þ
Z 1Z 1
¼ ϕ ðcðu1 , u2 ÞÞ dλL ðu1 Þ dλL ðu2 Þ  ϕð1Þ,
0 0
(101)
which solely depends on the copula (density) and not on the marginals. For
the subcase ϕ(1) ¼ 0, formula (101) was established basically in Durrani
and Zeng (2009) without assumptions and without a proof; they also give
some examples including ϕ ¼ ϕα (α  Rnf0, 1g) of (26), as well as the KL
generator ϕ ¼ ϕ e1 ðtÞ of (29) leading to the “copula representation of mutual
information.” The latter also appears in the earlier work of Davy and
Doucet (2003), as well as, e.g., in Zeng and Durrani (2011), Zeng et al.
(2014), and Tran (2018); in contrast, Tran also gives a copula representation
of the Kullback–Leibler information divergence between two general
d-dimensional Lebesgue density functions SλL pd ðPÞ :¼ fP ð  Þ and SλL pd ðQÞ :¼
fQ ð  Þ where P and Q are allowed to have different marginals, and Q need
not be of independence-expressing product type.
(2) For the special case X :¼ Y ¼ R2 , continuous marginal distribution
functions FP1 and FP2 , product measure λ :¼ P1 P2, Scd x ðQÞ :¼ FQ ðxÞ ¼
FP1 ðx1 Þ  FP2 ðx2 Þ  ½0, 1 , as well as joint distribution function Scd
x ðPÞ :¼
FP ðxÞ  ½0, 1, one gets the following special cases of (46) and (73),
respectively:
A unifying framework for some directed distances in statistics Chapter 5 205

0  Dϕ2 ,1,1,1  1,λ ðScd ðPÞ, Scd ðQÞÞ


Z Z
1
¼  ½FP ðx1 , x2 Þ  FP1 ðx1 Þ  FP2 ðx2 Þ2 dP1 ðx1 Þ dP2 ðx2 Þ
R R 2
Z 1Z 1
¼ ½Cðu1 , u2 Þ  u1  u2 2 dλL ðu1 Þ dλL ðu2 Þ (102)
0 0

(for (102) see Blum et al. (1961), Schweizer and Wolff (1981), up to constants
and squares) and
1=2
0  Dϕ ,Scd ðQÞ,Scd ðQÞ,1  Scd ðQÞ,λ ðScd ðPÞ, Scd ðQÞÞ ¼
Z Z TV
¼ jFP ðx1 , x2 Þ  FP1 ðx1 Þ  FP2 ðx2 Þj dP1 ðx1 Þ dP2 ðx2 Þ
R R
Z 1Z 1
¼ jCðu1 , u2 Þ  u1  u2 j dλL ðu1 Þ dλL ðu2 Þ (103)
0 0

(for (103) see Schweizer and Wolff (1981), up to constants).


As a side remark, let us mention that other interplays between divergences
and copula functions can be constructed. For instance, suppose that P and Q
are two probability distributions on the d-dimensional product (measurable)
space ðY , A Þ having copula density functions cP resp. cQ; the latter can be
interpreted as special statistical functionals Scop(P) of P resp. Scop(Q) of Q,
and thus, by employing the divergences (19) we obtain
0  Dcϕ,m1 ,m2 ,m3 ,λ d ðScop ðPÞ, Scop ðQÞÞ
L
Z        
cP ðxÞ cQ ðxÞ 0
cQ ðxÞ cP ðxÞ cQ ðxÞ
:¼ ϕ ϕ  ϕ+,c   m3 ðxÞ dλLd ðxÞ
Y m1 ðxÞ m2 ðxÞ m2 ðxÞ m1 ðxÞ m2 ðxÞ
(104)
where we recall that λLd denotes the d-dimensional Lebesgue measure and
thus the integral in (104) turns out to be (with some rare exceptions) of
d-dimensional Riemann type with dλLd ðxÞ ¼ dx. The (CASM ϕ-divergences
type) special case Dcϕ,Scop ðQÞ,Scop ðQÞ,Scop ðQÞ,λ d ðScop ðPÞ, Scop ðQÞÞ leads to a diver-
L
gence which has been used by Bouzebda and Keziou (2010) in order to obtain
new estimates and tests of independence in semiparametric copula models
with the help of variational methods.

5 Bayesian contexts
There are various different ways how divergences can be used in Bayesian
frameworks:
(1) as “direct” quantifiers of dissimilarities between statistical functionals of
various parameter distributions:
for instance, consider a n-dimensional vector of observable random
quantities Z ¼ ðZ 1 , …, Zn Þ whose distribution depends on an unobservable
206 SECTION II Information geometry

(and hence, also random) multivariate parameter Θ :¼ ðΘ1 , …, Θd Þ , as


well as a real-valued quantity Zn+1 (whose distribution also depends on
Θ) to be predicted. Corresponding candidates for distributions P, Q—to
be used in D(S(P), S(Q))—are for example the following: the prior distri-
bution PrΘ[] :¼ Pr[Θ  ] of Θ (under some underlying probability mea-
sure Pr), the posterior distribution PrΘjZ¼z[] :¼ Pr[Θ   j Z ¼ z] of Θ
given the data observation R Z ¼ z, the predictive prior distribution
Pr Zn+1 ½   ¼ Pr½Zn+1    ¼ Rd Pr Zn+1 jΘ¼θ ½   dPr Θ ðθÞ of Zn+1, and the
predictive
R posterior distribution Pr Zn+1 jZ¼z ½   ¼ Pr½Zn+1   j Z ¼ z ¼
d Pr Z n+1 jΘ¼θ ½   dPr ΘjZ¼z ðθÞ of Zn+1. For instance, the divergence
R
D(Sλpd(PrΘ), Sλpd(PrΘjZ¼z)) serves as “degree of informativity of the
new data-point observation on the learning of the true unknown
parameter.” Analogously, one can also consider more complex setups
like, e.g., a continuum Z ¼ {Zt : t  [0, T]} of observations, parameters
Θ of function type, and Zu (u > T) rather than Zn+1.
(2) as “decision risk reduction” (“model risk reduction,” “information gain”):
in a dichotomous Bayesian decision problem between the two alternative
probability distributions P :¼ PH and Q :¼ PA , one takes Θ ¼ fH , A g,
Pr Θ ½   :¼ π H  δH ½   + ð1  π H Þ  δA ½   for some π H  0, 1½ .
Within this context, suppose we want to make decisions/actions d taking
values in a space D. Furthermore, for the case that H were true we attri-
bute a real-valued loss LH ðdÞ  0 to each d; LH ðdÞ ¼ 0 corresponds to
a “right” decision d, LH ðÞ > 0 to the amount of loss taking the “wrong”
decision d . In the same way, for the case that A were true we use
LA ðdÞ  0. Prior to random observations Z, the corresponding prior mini-
mal mean decision loss (prior Bayes loss, prior Bayes risk) is given by
Bðπ H Þ :¼ inf fπ H  LH ðdÞ+ð1  πH Þ  LA ðdÞg:
dD

Based upon a concrete observation Z ¼ z, we decide for some “action”


d  D, operationalized by a decision rule d from the space of all possible
observations to D (i.e., dðzÞ  DÞ . The corresponding posterior minimal
mean decision loss (posterior Bayes loss, posterior Bayes risk) is defined by
 Z Z 
BðπH , PH , PA Þ :¼ inf π H  LH ðdðZÞÞ dPH + ð1  π H Þ  LA ðdðZÞÞ d PA
d

where the infimum is taken among all “admissible” decision functions d.


Up to technicalities, one can show that
Z
Bðπ H , PH , PA Þ ¼ Bðπ post
H ðZÞÞ ðπ H  dPH + ð1  π H Þ  dPA Þ,

π H  fP ðZÞ
with posterior probability (for H ) π post
H ðZÞ :¼ π H
H
 fP ðZÞ + ð1π H Þ  fP ðZÞ in
H A

terms of the λ-density functions fPH ðÞ and f PA ðÞ where λ is, e.g., PH 2+ PA
(or any measure such that PH and PA are absolutely continuous w.r.t. λ).
A unifying framework for some directed distances in statistics Chapter 5 207

The difference I ðπ H , PH , PA Þ :¼ BðπH Þ  Bðπ H , PH , PA Þ  0 can be


interpreted as a statistical information measure in the sense of De Groot
(1962), and as degree of reduction of the decision risk due to observation.
Let us first discuss the special case D ¼ ½0, 1 with d interpreted as evi-
dence degree, and LH ðdÞ ¼ 1  d, LA ðdÞ ¼ d (Bayes testing). Hence,

Bðπ H Þ ¼ π H ^ ð1  π H Þ . From this, Osterreicher and Vajda (1993),
Liese and Vajda (2006) have shown that ϕ-divergences (i.e., Csiszar-
Ali-Silvey-Morimoto divergences) can be represented as “average” statis-
tical information measures, i.e., (in our notation)
Z
1
IπH ðPH , PA Þ dgϕ ðπH Þ
0,1½ πH

¼ Dϕ,Sλpd ðPA Þ,Sλpd ðPA Þ,1  Sλpd ðPA Þ,λ ðSλpd ðPH Þ, Sλpd ðPA ÞÞ (105)

1π
where gϕ ðπÞ :¼ ϕ0+ is nondecreasing in π  ]0, 1[. If ϕ is twice
π
00
 
differentiable, then one can simplify π1H dgϕ ðπ H Þ ¼ ðπ 1 Þ3 ϕ 1π πH
H
dπ H
H

in (105). For the divergence generators ϕα with α  R (cf. (26), (30),


α2
(31), (28)) one gets 1
πH dgϕ ðπ H Þ ¼ ð1π HÞ
ðπ Þα+1
dπ H ; see also Stummer
H

(1999, 2001, 2004) for an adaption to a context of path observations of



financial diffusion processes. In contrast, Osterreicher and Vajda (1993)
have also given a “direct” representation (in our notation)

I ðπH , PH , PA Þ ¼ Dϕ,Sλpd ðPA Þ,Sλpd ðPA Þ,1  Sλpd ðPA Þ,λ ðSλpd ðPH Þ, Sλpd ðPA ÞÞ

for some appropriately chosen loss functions LH ð  Þ , LA ð  Þ which


depend on ϕ and π H s; see also Stummer (2001, 2004) for an adaption
of the case ϕ :¼ ϕα with α  R within a context of financial diffusion
processes.
(3) as bounds of minimal mean decision losses: in the context of (2), let us
now discuss the binary decision space D ¼ fdH , dA g where dH stands
for an action preferred in the case that PH were true. Furthermore, sup-
pose that PH is absolutely continuous with respect to λ :¼ PA having
density function fPH ðÞ; notice that fPA ðÞ ≡ 1. For the loss functions
LH ðdÞ ¼ cH  1fdA g ðdÞ and LA ðdÞ ¼ cA  1fdH g ðdÞ with some constants
cH > 0 , cA > 0 , the posterior minimal mean decision loss (posterior
Bayes loss) is

s
They also have shown some kind of “reciprocal.”
208 SECTION II Information geometry

Z
Bðπ H , PH , PA Þ ¼ min fΛH  fPH ðZÞ, ΛA g dPA

with constants ΛH :¼ π H  cH > 0 , ΛA :¼ ð1  π H Þ  cA > 0 .


For this, Stummer and Vajda (2007) have achieved the following
bounds in terms of power divergences
D :¼ Dϕχ ,Sλpd ðPA Þ,Sλpd ðPA Þ,1  Sλpd ðPA Þ,λ ðSλpd ðPH Þ, Sλpd ðPA ÞÞ for arbitrary
χ  ]0, 1[
8 χ
>
> max f1,1χ g max f1,1χ
χ g
>
> ΛH  ΛA max f1χ ,1χ g
>
1
< χ 1χ
 ð1  χ  ð1  χÞ  DÞ
max f1χ , χ g
Bðπ H , PH , PA Þ ðΛH + ΛA Þ
>
>
>
>
>
: χ 1χ
ΛH  ΛA  ð1  χ  ð1  χÞ  DÞ

(in an even slightly more general form), which can be very useful in case
that the posterior minimal mean decision loss cannot be computed explic-
itly. For instance, Stummer and Vajda (2007) give applications to
decision-making of time-continuous, nonstationary financial stochastic
processes.
(4) as auxiliary tools: for instance, in an i.i.d.-type Bayesian parametric
model-misspecification context, Kleijn and van der Vaart (2012) employ
the reverse-Kullback–Leibler-distance minimizer
bb
θ :¼ arg inf Dϕ0 ðQθ , Ptr Þ
θΘ
¼ arg inf Dϕ0 ,Sλpd ðQÞ,Sλpd ðQÞ,1  Sλpd ðQÞ,λ ðSλpd ðQθ Þ, Sλpd ðPtr ÞÞ
θΘ

(cf. (10) respectively (74) with ϕ ¼ ϕ0) in order to formulate and prove
an asymptotic normality—under the unknown true out-of-model-lying
data-generating distribution Ptr—of the involved posterior parameter
distribution.

6 Variational representations
Variational representations of (say) ϕ-divergences, often referred to as dual
representations, transform ϕ-divergence estimation into an optimization prob-
lem on an infinite dimensional function space, generally, but may also lead to
a simpler optimization problem when some knowledge on the class of
measures Q where Dϕ ðQ, PÞ has to be optimized is available; moreover, as
already mentioned at the end of Section 1.3 above, such variational represen-
tations can also be employed to circumvent the crossover problem (CO1),
(C2),(CO3).
To begin with, in the following we loosely sketch the corresponding gen-
eral setting. We equip M , the linear space of all finite signed measures
(including all probability measures) on ðX , B Þ with
R the so-called τ-topology,
the coarsest one which makes the mapping f ! f dQ continuous for all
A unifying framework for some directed distances in statistics Chapter 5 209

measures Q in M when f runs in the class Mb of all bounded measurable


functions on ðX ;BÞ . As an exemplary statistical incentive for the use of
signed measures, let us mention the context where one wants to estimate,
and test for, a mixture probability distribution c  Q1 + (1  c)  Q2 with prob-
ability measures Q1,Q2 and c  [0, 1]. In such a situation, it is sometimes
technically useful to extend the range of c beyond [0, 1] which leads to a
signed finite measure. As a next step, since the mapping Q ! Dϕ ðQ, PÞ is
convex and lower semicontinuous in the τ-topology we deduce that the fol-
lowing result holds for all Q in M and P in P :
Z Z
Deϕ ðQ, PÞ ¼ sup gðxÞ dQðxÞ  ϕ ðgðxÞÞ dPðxÞ (106)
g  Mb X *
X

where (cf. Broniatowski (2003) in the Kullback–Leibler divergence case as


well as Broniatowski and Keziou (2006) for a general formulation)
8Z  
< ϕ
dQ
ðxÞ dPðxÞ, for Q≪P,
Deϕ ðQ, PÞ :¼ dP
: X
∞, else,
is a slightly adopted version of the ϕ-divergence defined in (10) (see also
(74)) and ϕ*(x) :¼ supt(t  x  ϕ(t)) designates the Fenchel–Legendre trans-
form of the generator ϕ, see Broniatowski and Keziou (2006) and Nguyen
et al. (2010). The choice of the τ-topology is motivated by statistical consid-
erations, since most statistical functionals are continuous in this topology; see
Groeneboom et al. (1979). This choice is in contrast with similar representa-
tions for the Kullback–Leibler divergences (see, e.g., Dembo and Zeitouni,
2009) under the weak topology on P, for which the supremum in (106) is
taken over all continuous bounded functions on ðX ;B Þ.
Representation (106) offers a useful mathematical tool to measure statisti-
cal similarity between data collections or to measure the directed distance
between a distribution P (either explicit or known through sampling), and a
class of distributions Ω, as well as to compare complex probabilistic models.
The main practical advantage of variational formulas is that an explicit form
of the probability distributions or their likelihood ratio, dQ/dP, is not neces-
sary. Only samples from both distributions are required since the difference
of expected values in (106) can be approximated by statistical averages, in
case both Q and P are known through sampling. In practice, the infinite-
dimensional function space has to be approximated or even restricted. One
attempt is the restriction of the function space to a reproducing kernel Hilbert
space (RKHS) and the corresponding kernel-based approximation in Nguyen
et al. (2010). In many cases of relevance, however, some information can
be ninserted in the description
o of the minimization problem of the form
inf D eϕ ðQ, PÞ; Q  Ω when some relation between P and all members in
Ω can be assumed. Such is the case in logistic models, or more globally in
210 SECTION II Information geometry

two sample problems, when it is assumed that dQ/dP belongs to some class of
functions; for example we may assume that Ω consists in all distributions such
that x ! ðdQ=dPÞðxÞ belongs to some parametric class. This requires some
analysis around (106), which is handled now.
The supremum in Eq. (106) may not be reached, even in elementary cases.
Consider the case when ϕ ¼ ϕ1, hence the case when D eϕ ðQ, PÞ is the
Kullback–Leibler divergence between Q and P, and assume that both Q and
P are two Gaussian probability measures on R with same variance and differ-
ent mean values. Then it is readily checked that the supremum in (106) is
reached on a polynomial with degree 2, hence outside of Mb . For statistical
purposes it is relevant that formula (106) holds with attainment; indeed the
supremum, in case when D eϕ ðQ, PÞ is finite, is reached at g :¼ ϕ0 ðdQ=dPÞ,
therefore, in case when ϕ is differentiable, on a function which may not be
bounded.
It is also of interest to consider (106) in the case when P is atomic and Q is
a continuous distribution; for example let ðX1 , …, XN Þ be an i.i.d. sample
under some probability measure R on R, and take Q to be a probability mea-
sure which is absolutely continuous with respect to the Lebesgue measure;
consider the case when D eϕ ¼ Deϕ is the (slightly modified) Kullback–Leibler
1
emp
divergence. Denote by PN the empirical measure of the sample. Taking
gðxÞ :¼ M  1fX1 ,…,XN gc ðxÞ for some arbitrary M, it holds by (106) that
e ϕ ðQ, Pemp Þ  M proving that no inference can be performed about R making
D N
use of the variational form as it stands. Some more structure and information
has to be incorporated in the variational form of the divergence in order to cir-
cumvent this obstacle. Assuming that ϕ is a differentiable function in its
domain, the supremum in (106) is reached at g* :¼ ϕ0 ðdQ=dPÞ as checked
by substitution.t Let F be a class of functions containing all functions
ϕ0 ðdQ=dPÞðxÞ as Q runs in a given model R Ω. Consider the subspace MF of
all finite signed measures Q such that Ω j f jdjQj is finite for all functions f
in F, then similarly as in (106) we may obtain the following variational form
of De ϕ ðQ, PÞ, which is valid when Q belongs to MF and P belongs to P
Z Z
e
Dϕ ðQ, PÞ ¼ sup gðxÞ dQðxÞ  ϕ* ðgðxÞÞ dPðxÞ, (107)
ghMb [F i X X

in which we substituted Mb by the broader class Mb [ F which may contain


unbounded functions; note that (107) is valid for a smaller class of measures
Q than (106).

t
In case when ϕ is not differentiable at some point, then the supremum in (107) should satisfy
g* ðxÞ  ∂ϕðdQ=dPÞðxÞ for all x in X , where ∂ϕ(t) is the subdifferential set of the convex function
ϕ at point t, ∂ϕðtÞ :¼ fz  R : ϕðsÞ  ϕðtÞ+zðs  tÞ, 8s  Rg.
A unifying framework for some directed distances in statistics Chapter 5 211

For instance, in the above example pertaining to the Kullback–Leibler


divergence and both P is Gaussian on R and Q belongs to the class Ω of all
Gaussian distributions on R with same variance as P, then F consists of all
polynomial functions with degree 2, and the supremum in (107) is attained.
Looking at the case when P is substituted by Pemp
N and Q is absolutely contin-

emp
e
uous, and since Dϕ Q, PN does not convey any information from the data,
we are led to define a restriction to the supremum operation on the space
hMb [ F i; since we assumed that ϕ0 ðdQ=dPÞ  F for any Q in Ω  M F
we have
Z Z
eϕ ðQ, PÞ ¼ sup gðxÞ dQðxÞ 
D ϕ* ðgðxÞÞ dPðxÞ (108)
gF X X

which is valid only when Q ≪ P. We thus can define a new “pseudo


divergence,” say D eϕ ðQ, PÞ which coincides with D
eϕ ðQ, PÞ in those cases,
and which takes finite values depending on the data when P is substituted
by Pemp
N . In that case we define
Z

1 X
N
eϕ Q, Pemp :¼ sup
D gðxÞ dQðxÞ  ϕ ðX Þ, (109)
N
gF X N i¼1 * i

which is the starting point of variational divergence-based inference; see


Broniatowski and Keziou (2009). Note that the above formula does not
require any grouping or smoothing. Also the resulting estimator of the likeli-
hood ratio dQ*/dP where Q* :¼ arg inf Q  Ω D eϕ ðQ, PÞ results from a double
optimization, the inner one pertaining to the estimation of g*(Q) solving
(109) for any Q in Ω. Assuming that fϕ* ðgÞ, g  F g is a Glivenko–Cantelli
class of functions in some appropriate metrics provides the ingredients to han-
dle convergence properties of the estimators. The choice of the divergence ϕ
may obey robustness vs efficiency equilibrium, as exemplified in parametric
models; see also Al Mohamad (2018).
Formula (108) can be obtained through simple convexity considerations
(see p. 172 of Liese and Vajda (1987) or Theorem 17 of Liese and Vajda
(2006)) and is used when F consists of all the functions ϕ0 ðdQ=dPÞ as
Q and P run in some parametric model. In a more general (semiparametric
or nonparametric setting), formula (107) is adequate for inference in models
consisting in probability distributions Q which integrate functions in F , and
leads to numerical optimization making use of regularity assumptions on the
likelihood ratio dQ/dP.

7 Some further variants


Extending the (say) ϕ-divergence definition outside the natural context of
probability measures appears as necessary in various situations; for instance,
212 SECTION II Information geometry

models defined by conditions pertaining to expectations of order statistics


(or more generally of L-statistics) are ubiquitous in meteorology, hydrology,
or in finance through constraints on the value at risk, e.g., on the distortion
risk measure (DRM) of index α, which is defined in terms of the quantile
function F associated to the distribution function F on R+ through
R1
0 F ðuÞ  1fu>αg du: Note that this class of constraints are not linear with
respect to F but with respect to F only; hence, characterization of the projec-
tion of some measure P on such sets of measures is not characterized by expo-
nential family types of distributions. Inference on whether a distribution P
satisfies this kind of constraints leads to the extension of the definition of
divergences between quantile measures, which may be signed measures.
Variational representations for inference can be defined and projections on
linear constraints pertaining to quantile measures can be characterized; see
Broniatowski and Decurninge (2016). Also in the statistical frame, testing for
the number of components in a finite mixture requires the extension of the defi-
nition of divergences to not-necessarily positive argument, such as occurs for
the Pearson χ 2-divergence; this allows to replace the nonregular statistical task
of estimating (testing) a value of a parameter at the border of its domain into
a regular problem, at the cost of introducing mixtures with negative weights;
an attempt in this direction is made in Broniatowski et al. (2019).
For large-dimensional spaces X , variational representations of
ϕ-divergences (i.e., CASM divergences) offer significant theoretical insights
and practical advantages in numerous research areas. Recently, they have
gained popularity in machine learning as a tractable and scalable approach
for training probabilistic models and for statistically differentiating between
data distributions; see, e.g., Birrell et al. (2022a).
Explicit methods to estimate the ϕ-divergence and likelihood ratio
between two probability measures known through sampling (hence substitut-
ing Q and P in (108) by their empirical counterparts) have been considered
making some hypothesis on its regularity, or adding some penalty term in
terms of the assumed complexity of the class F ; examples include Sobolev
classes of functions or reproducing Kernel Hilbert Space approximations;
see Nguyen et al. (2010) for explicit methods and properties of the estimators.
Extensions of the basic divergence formula as given in (108) to include
some extra inner optimization term have been proposed by Birrell et al.
(2022b) under the name of (f Γ)-divergences; this new class encompasses
both the ϕ-divergence class and many integral probability metrics (see also
Sriperumbudur et al. (2012) on the overlap of the latter two); they provide
uncertainty quantification bounds for misspecified models in terms of the
ϕ-divergence between the truth and the model, somehow in a similar way
as considered in cryptology (see Arikan and Merhav (1998) and subsequent
extensive literature). Also, Birrell et al. (2022b) apply optimization of those
divergences to training Generative Adversial Networks (GANs).
A unifying framework for some directed distances in statistics Chapter 5 213

Another area where extension of the ϕ-divergences (i.e., CASM diver-


gences) to signed measures turns out to be useful, is related to (e.g., determin-
istic) general optimization problems where one aims at projecting a vector
(or a function) on a class of vectors (or a class of functions); we refer to
Broniatowski and Stummer (2021) for an extensive treatment of such pro-
blems in the finite-dimensional case.
As already indicated above, there are also divergences between stochastic
processes where X is the set of all possible paths (i.e., all time-evolution sce-
narios). By nature, the analysis of the outcoming (say) ϕ-divergences between
two distributions on the path space X may become very involved. For
instance, power divergences between diffusion processes—and applications
to finance, Bayesian decision-making, etc.—were treated in Stummer (1999,
2001, 2004) as well as in Stummer and Vajda (2007) (see also the
corresponding binomial-process-approximations in Stummer and Lao
(2012)); in contrast, Kammerer and Stummer (2020) study power divergences
between Galton–Watson branching processes with immigration and apply the
outcomes to optimal decision-making in the presence of a pandemics (such as
COVID-19).
For continuous, convex, homogeneous functions ϕ : RK+ 7!R , general
multivariate ϕ-dissimilarities of the form
Z  
dQ1 dQK
D ϕ ðQ, PÞ ¼ ϕ ðxÞ, …, ðxÞ dPðxÞ
X dP dP
(which need not necessarily be divergences in the sense of a multivariate ana-
logue of the above axioms (D1), (D2)) have been first introduced by Gy€orfi
and Nemetz (1977, 1978) and later on investigated by, e.g., Zografos (1994)
for stratified random sampling, by Zografos (1998) for hypothesis testing,
and by Garcia-Garcia and Williamson (2012) for multiclass classification
problems. As noticed by Gy€ orfi and Nemetz (1977), the multivariate
ϕ-dissimilarities cover as special cases Matusita’s affinity (Matusita, 1967),
the more general Toussaint’s affinity (Toussaint, 1974, 1978) (which by
nature is a multivariate (form of a) Hellinger integral being also called Hellin-
ger transform in Liese and Miescke (2008)), and—in the bivariate case K ¼
2—also the ϕ-divergences (i.e., the CASM divergences). Special multivariate
ϕ-divergences D ϕ ðQ, PÞ were, e.g., employed by Toussaint (1974, 1978)
(see also Menendez et al., 1997) in form of an average over all pairwise
Jeffreys divergences (where the latter are sum-symmetrized Kullback–Leibler
divergences) by Menendez et al. (1992) in form of a convex-combination of
“Kullback–Leibler divergences between each individual probability distribu-
tion and the convex combination of all probability distributions” (i.e., multi-
variate extensions of the Jensen–Shannon divergence), and by Werner and
Ye (2017) in form of integrals over the geometric mean of all the integrands
in pairwise ϕ-divergences (and they even flexibilize to components of a RK+ -
214 SECTION II Information geometry

valued function ϕ, and call the outcome a mixed ϕ-divergence). A general


“natural multivariate” extension of a ϕ-divergence in the sense of CASM—
called multidistribution ϕ-divergence—has been given by Duchi et al.
(2018) who employed this to multiclass classification problems (see also
Tan and Zhang, 2022 for further application to loss functions and regret
bounds). The general multivariate ϕ-dissimilarity between signed measures
(rather than the more restrictive probability distributions)—under assumptions
which imply the multivariate analogue of the above axioms (D1), (D2)—has
been introduced by Keziou (2015) and used for the analysis of semiparametric
multisample density ratio models (for the latter, see, e.g., Keziou and
Leoni-Aubin (2008) and Kanamori et al. (2012)).

Acknowledgments
W.S. is grateful to the Sorbonne Universite Paris for its multiple partial financial support
and especially the LPSM for its multiple great hospitality. M.B. thanks very much the
University of Erlangen-N€ urnberg for its partial financial support and hospitality.
Moreover, W.S. would like to thank Ingo Klein and Konstantinos Zografos for some helpful
remarks on a much earlier draft of this paper.

References
Ahmad, M.I., Sinclair, C.D., Spurr, B.D., 1988. Assessment of flood frequency models using
empirical distribution function statistics. Water Resour. Res. 24 (8), 1323–1328.
Al Mohamad, D., 2018. Towards a better understanding of the dual representation of phi diver-
gences. Stat. Papers 59 (3), 1205–1253.
Ali, M.S., Silvey, D., 1966. A general class of coefficients of divergence of one distribution from
another. J. Roy. Stat. Soc. B-28, 131–140.
Alonso-Revenga, J.M., Martin, N., Pardo, L., 2017. New improved estimators for overdispersion
in models with clustered multinomial data and unequal cluster sizes. Stat. Comput. 27,
193–217.
Amari, S.-I., 2016. Information Geometry and Its Applications. Springer, Japan.
Amari, S.-I., Nagaoka, H., 2000. Methods of Information Geometry. Oxford University Press.
Amari, S.-I., Karakida, R., Oizumi, M., 2018. Information geometry connecting Wasserstein
distance and Kullback-Leibler divergence via the entropy-relaxed transportation problem.
Info. Geo. 1, 13–37.
Anderson, T.W., Darling, D.A., 1952. Asymptotic theory of certain goodness of fit criteria based
on stochastic processes. Ann. Math. Stat. 23, 193–212.
Arikan, E., Merhav, N., 1998. Guessing subject to distortion. IEEE Trans. Inf. Theory 44 (3),
1041–1056.
Arnold, B.C., Balakrishnan, N., Nagaraja, H.N., 1992. A First Course in Order Statistics. Wiley,
New York.
Asadi, M., Ebrahimi, N., Hamedani, G.G., Soofi, E.S., 2006. Information measures for Pareto
distributions and order statistics. In: Balakrishnan, N., Castillo, E., Sarabia, J.M. (Eds.),
Advances in Distribution Theory, Order Statistics, and Inference. Birkh€auser, Boston,
pp. 207–223.
A unifying framework for some directed distances in statistics Chapter 5 215

Avlogiaris, G., Micheas, A., Zografos, K., 2016a. On local divergences between two probability
measures. Metrika 79, 303–333.
Avlogiaris, G., Micheas, A., Zografos, K., 2016b. On testing local hypotheses via local diver-
gence. Stat. Methodol. 31, 20–42.
Ay, N., Jost, J., Le, H.V., Schwachh€ofer, L., 2017. Information Geometry. Springer Intern.
Baggerly, K.A., 1998. Empirical likelihood as a goodness-of-fit measure. Biometrika 85 (3),
535–547.
Bahadur, R.R., 1967. Rates of convergence of estimates and test statistics. Ann. Math. Stat. 38,
303–324.
Bahadur, R.R., 1971. Some Limit Theorems in Statistics. SIAM, Philadelphia.
Bapat, R.B., Beg, M.I., 1989. Order statistics for nonidentically distributed variables and perma-
nents. Sankhya A 51 (1), 79–93.
Baratpour, S., Habibi Rad, A., 2012. Testing goodness-of-fit for exponential distribution based on
cumulative residual entropy. Commun. Stat. Theory Methods 41 (8), 1387–1396.
Barbaresco, F., Nielsen, F., 2021. Geometric Structures of Statistical Physics, Information Geom-
etry, and Learning. Springer Nature, Switzerland.
Baringhaus, L., Henze, N., 2017. Cramer-von Mises distance: probabilistic interpretation, confi-
dence intervals, and neighborhood-of-model validation. J. Nonparam. Stat. 29 (2), 167–188.
Basu, A., Lindsay, B.G., 1994. Minimum disparity estimation for continuous models: efficiency,
distributions and robustness. Ann. Inst. Stat. Math. 46 (4), 683–705.
Basu, A., Harris, I.R., Hjort, N.L., Jones, M.C., 1998. Robust and efficient estimation by minimiz-
ing a density power divergence. Biometrika 85 (3), 549–559.
Basu, A., Shioya, H., Park, C., 2011. Statistical Inference: The Minimum Distance Approach.
CRC Press, Boca Raton.
Basu, A., Mandal, A., Martin, N., Pardo, L., 2015. Robust tests for the equality of two normal
means based on the density power divergence. Metrika 78, 611–634.
Beran, R., 1977. Minimum Hellinger distance estimates for parametric models. Ann. Stat. 5 (3),
445–463.
Bertail, P., Gautherat, E., Harari-Kermadec, H., 2014. Empirical φ*-divergence minimizers for
Hadamard differentiable functionals. In: Akritas, M.G., et al. (Eds.), Topics in Nonparametric
Statistics. Springer, New York, pp. 21–32.
Bertrand, P., Broniatowski, M., Marcotorchino, J.-F., 2021. Divergences minimisation and appli-
cations. In: Nielsen, F., Barbaresco, F. (Eds.), Geometric Science of Information GSI 2021.
Lecture Notes in Computer Science, vol. 12829. Springer Nature, Switzerland, pp. 818–828.
Birkhoff, G.D., 1932. A set of postulates for plane geometry, based on scale and protractor. Ann.
Math. 33 (2), 329–345.
Birrell, J., Katsoulakis, M.A., Pantazis, Y., 2022a. Optimizing variational representations of diver-
gences and accelerating their statistical estimation. IEEE Trans. Inf. Theory (early access),
https://doi.org/10.1109/TIT.2022.3160659.
Birrell, J., Dupuis, P., Katsoulakis, M.A., Pantazis, Y., Rey-Bellet, L., 2022b. (f, Γ)-Divergences:
interpolating between f-divergences and integral probability metrics. J. Mach. Learn. Res. 23,
1–70.
Blum, J.R., Kiefer, J., Rosenblatt, M., 1961. Distribution-free tests of independence based on the
sample distribution function. Ann. Math. Stat. 32, 485–498.
Boekee, D.E., 1977. An extension of the Fisher information measure. In: Csiszar, I., Elias, P.
(Eds.), Topics in Information theory (Second Colloq., Keszthely, 1975). Colloq. Math. Soc.
János Bolyai, vol. 16. North-Holland, Amsterdam, pp. 113–123.
Boissonnat, J.-D., Nielsen, F., Nock, R., 2010. Bregman Voronoi diagrams. Discret. Comput.
Geom. 44 (2), 281–307.
216 SECTION II Information geometry

Bouzebda, S., Keziou, A., 2010. New estimates and tests of independence in semiparametric cop-
ula models. Kybernetika 46 (1), 178–201.
Broniatowski, M., 2003. Estimation of the Kullback-Leibler divergence. Math. Methods Stat.
12 (4), 391–409.
Broniatowski, M., 2021. Minimum divergence estimators, maximum likelihood and the
generalized bootstrap. Entropy 23 (185), 15 pages. https://doi.org/10.3390/e23020185.
Broniatowski, M., Decurninge, A., 2016. Estimation for models defined by conditions on their
L-moments. IEEE Trans. Inf. Theory 62 (9), 5181–5198.
Broniatowski, M., Keziou, A., 2006. Minimization of ϕ-divergences on sets of signed measures.
Stud. Sci. Math. Hung. 43, 403–442.
Broniatowski, M., Keziou, A., 2009. Parametric estimation and tests through divergences and the
duality technique. J. Multivar. Anal. 100 (1), 16–36.
Broniatowski, M., Keziou, A., 2012. Divergences and duality for estimation and test under
moment condition models. J. Stat. Plan. Inference 142, 2554–2573.
Broniatowski, M., Stummer, W., 2019. Some universal insights on divergences for statistics,
machine learning and artificial intelligence. In: Nielsen, F. (Ed.), Geometric Structures of
Information. Springer Nature, Switzerland, pp. 149–211.
Broniatowski, M., Stummer, W., 2021. A precise bare simulation approach to the minimization of
some distances–foundations. arXiv:2107.01693v1 (July).
Broniatowski, M., Miranda, E., Stummer, W., 2019. Testing the number and the nature of the
components in a mixture distribution. In: Nielsen, F., Barbaresco, F. (Eds.), Geometric
Science of Information GSI 2019. Lecture Notes in Computer Science, vol. 11712. Springer
Nature, Switzerland, pp. 309–318.
Chamany, A., Baratpour, S., 2014. A dynamic discrimination information based on cumulative
residual entropy and its properties. Commun. Stat. Theory Methods 43, 1041–1049.
Chernobai, A., Rachev, S.T., Fabozzi, F.J., 2015. Composite goodness-of-fit tests for left-
truncated loss samples. In: Lee, C.-F., Lee, J. (Eds.), Handbook of Financial Econometrics
and Statistics. Springer Science/Business Media, New York, pp. 575–596.
Chernozhukov, V., Galichon, A., Hallin, M., Henry, M., 2017. Monge-Kantorovich depth, quan-
tiles, ranks, and signs. Ann. Stat. 45 (1), 223–256.
Cramer, H., 1928. On the composition of elementary errors. Scand. Actuar. J. 1928 (1), 13–74 and
141–180.
Csiszar, I., 1963. Eine informations theoretische Ungleichung und ihre Anwendung auf den
Beweis der Ergodizit€at von Markoffschen Ketten. Publ. Math. Inst. Hung. Acad. Sci.
A-8, 85–108.
Csiszar, I., 1967. Information-type measures of difference of probability distributions and indirect
observations. Stud. Sci. Math. Hung. 2, 299–318.
Csiszar, I., 1991. Why least squares and maximum entropy? An axiomatic approach to inference
for linear inverse problems. Ann. Stat. 19 (4), 2032–2066.
Csiszar, I., Breuer, T., 2016. Measuring distribution model risk. Math. Financ. 26 (2), 395–411.
Darling, D.A., 1957. The Kolmogorow-Smirnov, Cramer-von Mises tests. Ann. Math. Stat. 28,
823–838.
David, H.A., Nagaraja, H.N., 2003. Order Statistics, third ed. Wiley, Hoboken.
Davy, M., Doucet, A., 2003. Copulas: a new insight into positive time-frequency distributions.
IEEE Signal Proc. Letters 10 (7), 215–218.
De Groot, M.H., 1962. Uncertainty, information and sequential experiments. Ann. Math. Stat. 33,
404–419.
A unifying framework for some directed distances in statistics Chapter 5 217

Deheuvels, P., Martynov, G., 2003. Karhunen-Loeve expansions for weighted Wiener processes
and Brownian bridges via Bessel functions. In: Hoffmann-Jorgensen, J., Marcus, M.B.,
Wellner, J.A. (Eds.), High Dimensional Probability III. Springer, Basel, pp. 57–93.
Dembo, A., Zeitouni, O., 2009. Large Deviations Techniques and Applications, second ed. (corr.
print). Springer, New York.
Di Crescenzo, A., Longobardi, M., 2004. A measure of discrimination between past lifetime dis-
tributions. Stat. Prob. Lett. 67, 173–182.
Di Crescenzo, A., Longobardi, M., 2015. Some properties and applications of cumulative
Kullback-Leibler information. Appl. Stoch. Models Bus. Ind. 31, 875–891.
Duchi, J., Khosravi, K., Ruan, F., 2018. Multiclass classification, information, divergence and sur-
rogate risk. Ann. Stat. 46 (6B), 3246–3275.
Durante, F., Sempi, C., 2016. Principles of Copula Theory. CRC Press, Boca Raton.
Durrani, T.S., Zeng, X., 2009. Copula based divergence measures and their use in image registra-
tion. In: Proc. 17th Eur. Sig. Proc. Conf. (EUSIPCO 2009), pp. 1309–1313.
Ebrahimi, N., Kirmani, S.N.U.A., 1996. A measure of discrimination between two residual life-
time distributions and its applications. Ann. Inst. Stat. Math. 48 (2), 257–265.
Ebrahimi, N., Soofi, E.S., Zahedi, H., 2004. Information properties of order statistics and
spacings. IEEE Trans. Inf. Theory 50 (1), 177–183.
Efron, B., Tibshirani, R.J., 1993. An Introduction to the Bootstrap. Chapman and Hall, New York.
Embrechts, P., Hofert, M., 2013. A note on generalized inverses. Math. Meth. Oper. Res. 77,
423–432.
Faugeras, O.P., R€ uschendorf, L., 2017. Markov morphisms: a combined copula and mass trans-
portation approach to multivariate quantiles. Math. Appl. 45 (1), 3–45.
Faugeras, O.P., R€ uschendorf, L., 2018. Risk excess measures induced by hemi-metrics. Toulouse
School of Economics, Working Paper 18-922.
Fienberg, S.E., Holland, P.W., 1970. Methods for eliminating zero counts in contingency tables.
In: Patil, G.P. (Ed.), Random Counts in Scientific Work, vol. 1 (Random Counts in Models
and Structures). Pennsylvania State University Press, University Park, pp. 233–260.
Figalli, A., 2018. On the continuity of center-outward distribution and quantile functions. Nonlin-
ear Anal. 177 (B), 413–421.
Galichon, H., Henry, M., 2012. Dual theory of choice with multivariate risks. J. Econ. Theory
147, 1501–1516.
Garcia-Garcia, D., Williamson, R.C., 2012. Divergences and risks for multiclass experiments. In:
25th Annual Conference on Learning Theory; JMLR Workshop and Conference Proceedings,
vol. 23. 28.1–28.20.
Gayen, A., Kumar, M.A., 2021. Projection theorems and estimating equations for power-law mod-
els. J. Multivar. Anal. 184, 104734. https://doi.org/10.1016/j.jmva.2021.104734.
Ghosh, A., Basu, A., 2016a. Robust Bayes estimation using the density power divergence. Ann.
Inst. Stat. Math. 68, 413–437.
Ghosh, A., Basu, A., 2016b. Robust estimation in generalized linear models: the density power
divergence approach. TEST 25, 269–290.
Ghosh, A., Basu, A., 2018. A new family of divergences originating from model adequacy tests
and applications to robust statistical inference. IEEE Trans. Inf. Theory 64 (8), 5581–5591.
Gilchrist, W.G., 2000. Statistical Modelling With Quantile Functions. Chapman & Hall/CRC,
Boca Raton.
Groeneboom, P., Oosterhoff, J., 1977. Bahadur efficiency and probability of large deviations. Stat.
Neerlandica 31 (1), 1–24.
Groeneboom, P., Oosterhoff, J., Ruymgaart, F.H., 1979. Large deviation theorems for empirical
probability measures. Ann. Prob. 7 (4), 553–586.
218 SECTION II Information geometry

Guo, X., Hong, J., Lin, T., Yang, N., 2021. Relaxed Wasserstein with application to GANs. In:
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP
2021), pp. 3325–3329.
Gy€orfi, L., Nemetz, T., 1977. f-Dissimilarity: a general class of separation measures of several
probability measures. In: Csiszar, I., Elias, P. (Eds.), Topics in Information Theory (Second
Colloq., Keszthely, 1975). Colloq. Math. Soc. János Bolyai, vol. 16. North-Holland, Amster-
dam, pp. 309–321.
Gy€orfi, L., Nemetz, T., 1978. f-Dissimilarity: a generalization of the affinity of several distribu-
tions. Ann. Inst. Stat. Math 30 (Part A), 105–113.
Hajek, J., Sidak, Z., Sen, P.K., 1999. Theory of Rank Tests. Academic Press, San Diego.
Hallin, M., 2017. On distribution and quantile functions, ranks and signs. ECARES Working
Paper 2017-34.
Hallin, M., 2018. From Mahalanobis to Bregman via Monge and Kantorovich. Sankhya 80-B
(Suppl. 1), S135–S146.
Hallin, M., Del Barrio, E., Cuesta-Albertos, J., Matran, C., 2021. Distribution and quantile func-
tions, ranks and signs in dimension d; a measure transportation approach. Ann. Stat. 49 (2),
1139–1165.
Hande, S., 1994. A note on order statistics for nonidentically distributed variables. Sankhya
A 56 (2), 365–368.
Henze, N., Nikitin, Y.Y., 2000. A new approach to goodness-of-fit testing based on the integrated
empirical process. J. Nonparam. Stat. 12 (3), 391–416.
Hoadley, A.B., 1967. On the probability of large deviations of functions of several empirical
cdf’s. Ann. Math. Stat. 38, 360–381.
Hosking, J.R.M., 1990. L-moments: analysis and estimation of distributions using linear combina-
tions of order statistics. J. R. Stat. Soc. B 52 (1), 105–124.
Jager, L., Wellner, J.A., 2007. Goodness-of-fit tests via phi-divergences. Ann. Stat. 35 (5),
2018–2053.
Judge, G.G., Mittelhammer, R.C., 2012. An Information Theoretic Approach to Econometrics.
Cambridge University Press, Cambridge.
Jurafsky, D., Martin, J.H., 2009. Speech and Language Processing, second ed. Pearson/Prentice
Hall, Upper Saddle River.
Kammerer, N.B., Stummer, W., 2020. Some dissimilarity measures of branching processes and
optimal decision making in the presence of potential pandemics. Entropy 22 (8), 874.
https://doi.org/10.3390/e22080874 (123 pages).
Kanamori, T., Suzuki, T., Sugiyama, M., 2012. f-divergence estimation and two-sample homoge-
neity test under semiparametric density-ratio models. IEEE Trans. Inf. Theory 58 (2),
708–720.
Karakida, R., Amari, S.-I., 2017. Information geometry of Wasserstein divergence. In: Nielsen, F.,
Barbaresco, F. (Eds.), Geometric Science of Information GSI 2017. Lecture Notes in Com-
puter Science, vol. 10589. Springer, International, pp. 119–126.
Kayal, S., Tripathy, M.R., 2018. A quantile-based Tsallis-α divergence. Physica A 492, 496–505.
Keziou, A., 2015. Multivariate divergences with application in multisample density ratio models.
In: Nielsen, F., Barbaresco, F. (Eds.), Geometric Science of Information GSI 2015. Lecture
Notes in Computer Science, vol. 9389. Springer, Berlin, pp. 444–453.
Keziou, A., Leoni-Aubin, S., 2008. On empirical likelihood for semiparametric two-sample den-
sity ratio models. J. Stat. Plann. Infer. 138 (4), 915–928.
Kißlinger, A.-L., Stummer, W., 2013. Some decision procedures based on scaled Bregman dis-
tance surfaces. In: Nielsen, F., Barbaresco, F. (Eds.), Geometric Science of Information
GSI 2013. Lecture Notes in Computer Science, vol. 8085. Springer, Berlin, pp. 479–486.
A unifying framework for some directed distances in statistics Chapter 5 219

Kißlinger, A.-L., Stummer, W., 2015. New model search for nonlinear recursive models, regres-
sions and autoregressions. In: Nielsen, F., Barbaresco, F. (Eds.), Lecture Notes in Computer
Science, vol. 9389. Springer, Berlin, pp. 693–701.
Kißlinger, A.-L., Stummer, W., 2016. Robust statistical engineering by means of scaled Bregman
distances. In: Agostinelli, C., et al. (Eds.), Recent Advances in Robust Statistics–Theory and
Applications. Springer, New Delhi, pp. 81–113.
Kißlinger, A.-L., Stummer, W., 2018. A new toolkit for robust distributional change detection.
Appl. Stoch. Models Bus. Ind. 34, 682–699.
Kleijn, B.J.K., van der Vaart, A.W., 2012. The Bernstein-von-Mises theorem under misspecifica-
tion. Electron. J. Stat. 6, 354–381.
Klein, I., Mangold, B., Doll, M., 2016. Cumulative paired ϕ-entropy. Entropy 18 (7), 248.
Kr€
omer, S., Stummer, W., 2019. A new toolkit for mortality data analytics. In: Steland, A.,
Rafajlowicz, E., Okhrin, O. (Eds.), Stochastic Models, Statistics and Their Applications.
Springer Nature, Switzerland, pp. 393–407.
Kuchibhotla, A.K., Basu, A., 2015. A general setup for minimum disparity estimation. Stat. Prob.
Lett. 96, 68–74.
Liese, L., Miescke, K.J., 2008. Statistical Decision Theory; Estimation, Testing, and Selection.
Springer, New York.
Liese, F., Vajda, I., 1987. Convex Statistical Distances. Teubner, Leipzig.
Liese, F., Vajda, I., 2006. On divergences and informations in statistics and information theory.
IEEE Trans. Inf. Theory 52 (10), 4394–4412.
Lin, N., He, X., 2006. Robust and efficient estimation under data grouping. Biometrika 93 (1),
99–112.
Lin, T., Hu, Z., Guo, X., 2019. Sparsemax and relaxed Wasserstein for topic sparsity. In: The
Twelfth ACM International Conference on Web Search and Data Mining (WSDM 19),
ACM, New York, pp. 141–149.
Lindsay, B.G., 1994. Efficiency versus robustness: the case for minimum Hellinger distance and
related methods. Ann. Stat. 22 (2), 1081–1114.
Lindsay, B.G., 2004. Statistical distances as loss functions in assessing model adequacy. In:
Taper, M.P., Lele, S.R. (Eds.), The Nature of Scientific Evidence. The University of Chicago
Press, Chicago, pp. 439–487.
Lindsay, B.G., Markatou, M., Ray, S., Yang, K., Chen, S.-C., 2008. Quadratic distances on prob-
abilities; a unified foundation. Ann. Stat. 36 (2), 983–1006.
Liu, J., 2007. Information Theoretic Content and Probability (Ph.D. thesis). University of Florida.
Liu, L., Lindsay, B.G., 2009. Building and using semiparametric tolerance regions for parametric
multinomial models. Ann. Stat. 37 (6A), 3644–3659.
Liu, R.Y., Parelius, J.M., Singh, K., 1999. Multivariate analysis by data depth: descriptive statis-
tics, graphics and inference. Ann. Stat. 27 (3), 783–858.
Markatou, M., Chen, Y., 2019. Statistical distances and the construction of evidence functions for
model adequacy. Front. Ecol. Evol. 7, 447. https://doi.org/10.3389/fevo.2019.00447.
Markatou, M., Sofikitou, E., 2018. Non-quadratic distances in model assessment. Entropy 20, 464.
https://doi.org/10.3390/e20060464.
Marshall, A.W., Olkin, I., Arnold, B.C., 2011. Inequalities: Theory of Majorization and Its Appli-
cations, second ed. Springer, New York.
Matusita, K., 1967. On the notion of affinity of several distributions and some of its applications.
Ann. Inst. Stat. Math. 19, 181–192.
Mehrali, Y., Asadi, M., 2021. Parameter-estimation based on cumulative Kullback-Leibler infor-
mation. REVSTAT 19 (1), 111–130.
220 SECTION II Information geometry

Menendez, M., Pardo, L., Taneja, I.J., 1992. On M-dimensional unified (r, s)-Jensen difference
divergence measures and their applications. Kybernetika 28 (4), 309–324.
Menendez, M., Salicru, M., Morales, D., Pardo, L., 1997. Divergence measures between popula-
tions: applications in the exponential family. Commun. Stat. Theory Methods 26 (5),
1099–1117.
Menendez, M., Morales, D., Pardo, L., Vajda, I., 1998. Two approaches to grouping of data and
related disparity statistics. Commun. Stat. Theory Methods 27 (3), 609–633.
Menendez, M., Morales, D., Pardo, L., Vajda, I., 2001a. Minimum disparity estimators for
discrete and continuous models. Appl. Math. 46 (6), 439–466.
Menendez, M., Morales, D., Pardo, L., Vajda, I., 2001b. Minimum divergence estimators based on
grouped data. Ann. Inst. Stat. Math. 53 (2), 277–288.
Micheas, A.C., Zografos, K., 2006. Measuring stochastic dependence using ϕ-divergence. J. Mul-
tivar. Anal. 97, 765–784.
Millmann, R.S., Parker, G.D., 1991. Geometry–A Metric Approach With Models, second ed.
Springer, New York.
Morales, D., Pardo, L., Vajda, I., 2004. Digitalization of observations permits efficient estimation
in continuous models. In: Lopez-Diaz, M., et al. (Eds.), Soft Methodology and Random Infor-
mation Systems. Springer, Berlin, pp. 315–322.
Morales, D., Pardo, L., Vajda, I., 2006. On efficient estimation in continuous models based on
finitely quantized observations. Commun. Stat. Theory Methods 35 (9), 1629–1653.
Morimoto, T., 1963. Markov processes and the H-theorem. J. Phys. Soc. Jpn. 18 (3), 328–331.
Najim, J., 2002. A Cramer type theorem for weighted random variables. Electron. J. Prob. 7 (4),
1–32. https://doi.org/10.1214/EJP.v7-103.
Nguyen, X., Wainwright, M.J., Jordan, M.I., 2010. Estimating divergence functionals and the like-
lihood ratio by convex risk minimization. IEEE Trans. Inf. Theory 56 (11), 5847–5861.
Nielsen, F. (Ed.), 2021. Progress in Information Geometry. Springer Nature, Switzerland.
Nielsen, F., Barbaresco, F. (Eds.), 2013. Geometric Science of Information GSI 2013. Lecture
Notes in Computer Science, vol. 8085 Springer, Berlin.
Nielsen, F., Barbaresco, F. (Eds.), 2015. Geometric Science of Information GSI 2015. Lecture
Notes in Computer Science, vol. 9389 Springer, International.
Nielsen, F., Barbaresco, F. (Eds.), 2017. Geometric Science of Information GSI 2017. Lecture
Notes in Computer Science, vol. 10589 Springer, International.
Nielsen, F., Barbaresco, F. (Eds.), 2019. Geometric Science of Information GSI 2019. Lecture
Notes in Computer Science, vol. 11712 Springer Nature, Switzerland.
Nielsen, F., Barbaresco, F. (Eds.), 2021. Geometric Science of Information GSI 2021. Lecture
Notes in Computer Science, vol. 12829 Springer Nature, Switzerland.
Nielsen, F., Bhatia, R. (Eds.), 2013. Matrix Information Geometry. Springer, Berlin.
Nikitin, Y., 1995. Asymptotic Efficiency of Nonparametric Tests. Cambridge University Press,
Cambridge.
Nock, R., Nielsen, F., Amari, S.-I., 2016. On conformal divergences and their population minimi-
zers. IEEE Trans. Inform. Theory 62 (1), 527–538.

Osterreicher, F., Vajda, I., 1993. Statistical information and discrimination. IEEE Trans. Inf.
Theory 39 (3), 1036–1039.
Owen, A.B., 1988. Empirical likelihood ratio confidence intervals for a single functional.
Biometrika 75 (2), 237–249.
Owen, A.B., 1990. Empirical likelihood ratio confidence regions. Ann. Stat. 18 (1), 90–120.
Owen, A.B., 2001. Empirical Likelihood. Chapman and Hall, Boca Raton.
A unifying framework for some directed distances in statistics Chapter 5 221

Pal, S., Wong, T.-K.L., 2016. The geometry of relative arbitrage. Math. Finan. Econon. 10,
263–293.
Pal, S., Wong, T.-K.L., 2018. Exponentially concave functions and a new information geometry.
Ann. Prob. 46 (2), 1070–1113.
Pardo, L., 2006. Statistical Inference Based on Divergence Measures. Chapman & Hall/CRC,
Boca Raton.
Pardo, M.C., Vajda, I., 1997. About distances of discrete distributions satisfying the data proces-
sing theorem of information theory. IEEE Trans. Inf. Theory 43 (4), 1288–1293.
Pardo, M.C., Vajda, I., 2003. On asymptotic properties of information-theoretic divergences.
IEEE Trans. Inf. Theory 49 (7), 1860–1868.
Park, C., Basu, A., 2004. Minimum disparity estimation: asymptotic normality and breakdown
point results. Bull. Inform. Cybernet. 36, 19–33.
Park, S., Rao, M., Shin, D.W., 2012. On cumulative residual Kullback-Leibler information. Stat.
Prob. Lett. 82, 2025–2032.
Park, S., Noughabi, H.A., Kim, I., 2018. General cumulative Kullback-Leibler information.
Commun. Stat. Theory Methods 47 (7), 1551–1560.
Pelletier, B., 2011. Inference in φ-families of distributions. Statistics 45 (3), 223–236.
Peyre, G., Cuturi, M., 2019. Computational optimal transport: with applications to data science.
Found. Trends Mach. Learn. 11 (5-6), 355–607 (Also appeared in book form by now Publish-
ers, Hanover MA, USA (2019)).
Rachev, S.T., R€ uschendorf, L., 1998. Mass Transportation Problems, vol. I. Springer, New York.
Read, T.R.C., Cressie, N.A.C., 1988. Goodness-of-Fit Statistics for Discrete Multivariate Data.
Springer, New York.
Reiss, R.-D., 1989. Approximate Distributions of Order Statistics. Springer, New York.
Rodriguez, J.C., Viollaz, A.J., 1995. A Cramer-von Mises type goodness of fit test with asymmet-
ric weight function. Commun. Stat. Theory Methods 24 (4), 1095–1120.
Roensch, B., Stummer, W., 2017. 3D insights to some divergences for robust statistics and
machine learning. In: Nielsen, F., Barbaresco, F. (Eds.), Geometric Science of Information
GSI 2017. Lecture Notes in Computer Science, vol. 10589. Springer International, pp.
460–469.
Roensch, B., Stummer, W., 2019a. Robust estimation by means of scaled Bregman power
distances; part I; non-homogeneous data. In: Nielsen, F., Barbaresco, F. (Eds.), Geometric
Science of Information GSI 2019. Lecture Notes in Computer Science, vol. 11712. Springer
Nature, Switzerland, pp. 319–330.
Roensch, B., Stummer, W., 2019b. Robust estimation by means of scaled Bregman power dis-
tances; part II; extreme values. In: Nielsen, F., Barbaresco, F. (Eds.), Geometric Science of
Information GSI 2019. Lecture Notes in Computer Science, vol. 11712. Springer Nature,
Switzerland, pp. 331–340.
Rosenblatt, M., 1952. Limit theorems associated with variants of the von Mises statistic. Ann.
Math. Stat. 23, 617–623.
Sankaran, P.G., Sunoj, S.M., Unnikrishnan Nair, N., 2016. Kullback-Leibler divergence: a quan-
tile approach. Stat. Prob. Lett. 111, 72–79.
Schweizer, B., Wolff, E.F., 1981. On nonparametric measures of independence for random vari-
ables. Ann. Stat. 9 (4), 879–885.
Scott, W.F., 1999. A weighted Cramer-von Mises statistic, with some applications to clinical
trials. Commun. Stat. Theory Methods 28 (12), 3001–3008.
Serfling, R., 2002. Quantile functions for multivariate analysis: approaches and applications. Stat.
Neerlandica 56 (2), 214–232.
222 SECTION II Information geometry

Serfling, R., 2006. Depth functions in nonparametric multivariate inference. In: Liu, R.Y.,
Serfling, R., Souvaine, D.L. (Eds.), Robust Multivariate Analysis, Computational Geometry
and Applications. DIMACS Series in Discrete Mathematics and Theoretical Computer Sci-
ence, vol. 72. American Mathematical Society, pp. 1–16.
Serfling, R., 2010. Equivariance and invariance properties of multivariate quantile and related
functions, and the role of standardization. J. Nonparam. Stat. 22 (7), 915–936.
Serfling, R., Zuo, Y., 2010. Discussion. Ann. Stat. 38 (2), 676–684.
Shin, H., Jung, Y., Jeong, C., Heo, J.-H., 2012. Assessment of modified Anderson-Darling test
statistics for the generalized extreme value and generalized logistic distributions. Stoch.
Env. Res. Risk A. 26, 105–114.
Sklar, A., 1959. Fonctions de repartition á n dimensions et leurs marges. Publ. Inst. Stat. Univ.
Paris 8, 229–231.
Smirnov/Smirnoff, N., 1936. Sur la distribution de ω2. C. R. Acad. Sci. Paris 202, 449–452.
Sriperumbudur, B.K., Gretton, A., Fukumizu, K., Lanckriet, G., Scholkop, B.A., 2012. On the
empirical estimation of integral probability metrics. Electron. J. Stat. 6, 1550–1599.
Stephens, M.A., 1986. Test based on EDF statistics. In: D’Agostino, R.B., Stephens, M.A. (Eds.),
Goodness-of-Fit Techniques. Marcel Dekker Inc., New York, pp. 97–193.
Stummer, W., 1999. On a statistical information measure of diffusion processes. Stat. Decisions
17, 359–376.
Stummer, W., 2001. On a statistical information measure for a generalized Samuelson-Black-
Scholes model. Stat. Decisions 19, 289–314.
Stummer, W., 2004. Exponentials, Diffusions, Finance, Entropy and Information. Shaker, Aachen.
Stummer, W., 2007. Some Bregman distances between financial diffusion processes. Proc. Appl.
Math. Mech. 7 (1), 1050503–1050504.
Stummer, W., 2021. Optimal transport with some directed distances. In: Nielsen, F.,
Barbaresco, F. (Eds.), Geometric Science of Information GSI 2021. Lecture Notes in Com-
puter Science, vol. 12829. Springer Nature, Switzerland, pp. 829–840.
Stummer, W., Kißlinger, A.L., 2017. Some new flexibilizations of Bregman divergences and their
asymptotics. In: Nielsen, F., Barbaresco, F. (Eds.), Geometric Science of Information GSI
2017. Lecture Notes in Computer Science, vol. 10589. Springer International, pp. 514–522.
Stummer, W., Lao, W., 2012. Limits of Bayesian decision related quantities of binomial asset
price models. Kybernetika 48 (4), 750–767.
Stummer, W., Vajda, I., 2007. Optimal statistical decisions about some alternative financial mod-
els. J. Econometrics 137, 441–447.
Stummer, W., Vajda, I., 2010. On divergences of finite measures and their applicability in statis-
tics and information theory. Statistics 44, 169–187.
Stummer, W., Vajda, I., 2012. On Bregman distances and divergences of probability measures.
IEEE Trans. Inf. Theory 58 (3), 1277–1288.
Sunoj, S.M., Sankaran, P.G., Unnikrishnan Nair, N., 2018. Quantile-based cumulative Kullback-
Leibler divergence. Statistics 52 (1), 1–17.
Tan, Z., Zhang, X., 2022. On loss functions and regret bounds for multi-category classfication.
IEEE Trans. Inf. Theory (early access), https://doi.org/10.1109/TIT.2022.3167635.
Toussaint, G.T., 1974. Some properties of Matusita’s measure of affinity of several distributions.
Ann. Inst. Stat. Math. 26 (3), 389–394.
Toussaint, G.T., 1978. Probability of error, expected divergence, and the affinity of several distri-
butions. IEEE Trans. Syst. Man Cybern. SMC-8 (6), 482–485.
Tran, V.H., 2018. Copula variational Bayes inference via information geometry. Preprint,
arXiv:1803.10998v1 (March).
A unifying framework for some directed distances in statistics Chapter 5 223

Trashorras, J., Wintenberger, O., 2014. Large deviations for bootstrapped empirical measures.
Bernoulli 20 (4), 1845–1878.
Vajda, I., 1972. On the f-divergence and singularity of probability measures. Periodica Math.
Hungar. 2 (1-4), 223–234.
Vajda, I., 1989. Theory of Statistical Inference and Information. Kluwer, Dordrecht.
Vajda, I., van der Meulen, E.C., 2011. Goodness-of-fit criteria based on observations quantized by
hypothetical and empirical percentiles. In: Karian, Z.A., Dudewicz, E.J. (Eds.), Handbook of
Fitting Statistical Distributions With R. Chapman & Hall/CRC, Boca Raton, pp. 917–994.
Vaughan, R.J., Venables, W.N., 1972. Permanent expressions for order statistic densities. J. R.
Stat. Soc. B 34 (2), 308–310.
Victoria-Feser, M.-P., Ronchetti, E., 1997. Robust estimation for grouped data. J. Am. Stat.
Assoc. 92 (437), 333–340.
Von Mises, R., 1931. Wahrscheinlichkeitsrechnung und ihre Anwendung in der Statistik und the-
oretischen Physik. Deuticke, Leipzig.
Vonta, F., Karagrigoriou, A., 2010. Generalized measures of divergence in survival analysis and
reliability. J. Appl. Prob. 47, 216–234.
Weller-Fahy, D.J., Borghetti, B.J., Sodemann, A.A., 2015. A survey of distance and similarity
measures used within network intrusion anomaly detection. IEEE Commun. Surv. Tutorials
17 (1), 70–91.
Werner, E., Ye, D., 2017. Mixed f-divergence for multiple pairs of measures. Canad. Math. Bull.
60 (3), 641–654.
Yari, G., Saghafi, A., 2012. Unbiased Weibull modulus estimation using differential cumulative
entropy. Commun. Stat. Simul. Comput. 41 (8), 1372–1378.
Yari, G., Mirhabibi, A., Saghafi, A., 2013. Estimation of the Weibull parameters by Kullback-
Leibler divergence of survival functions. Appl. Math. Inf. Sci. 7 (1), 187–192.
Zeng, X., Durrani, T.S., 2011. Estimation of mutual information using copula density function.
Electr. Lett. 47 (8), 493–494.
Zeng, X., Ren, J., Sun, M., Marshall, S., Durrani, T., 2014. Copulas for statistical signal proces-
sing (Part II): simulation, optimal selection and practical applications. Signal Process. 94,
681–690.
Zografos, K., 1994. Asymptotic distributions of estimated f-dissimilarity between populations in
stratified random sampling. Stat. Prob. Lett. 21, 147–151.
Zografos, K., 1998. f-Dissimilarity of several distributions in testing statistical hypotheses. Ann.
Inst. Stat. Math. 50 (2), 295–310.
Zuo, Y., Serfling, R., 2000a. General notions of statistical depth function. Ann. Stat. 28 (2),
461–482.
Zuo, Y., Serfling, R., 2000b. Structural properties and convergence results for contours of sample
statistical depth functions. Ann. Stat. 28 (2), 483–499.
This page intentionally left blank
Chapter 6

The analytic dually flat space


of the mixture family of two
prescribed distinct Cauchy
distributions
Frank Nielsen∗,†
Sony Computer Science Laboratories Inc., Tokyo, Japan

Corresponding author: e-mail: frank.nielsen.x@gmail.com

Abstract
A smooth and strictly convex function on an open convex domain induces both (1) a
Hessian structure with respect to the standard flat Euclidean connection, and (2) a dually
flat structure in information geometry. We first review these fundamental constructions
and illustrate how to instantiate them for (a) full regular exponential families from their
cumulant functions, (b) regular homogeneous cones from their characteristic functions,
and (c) mixture families from their Shannon negentropy functions. Although these struc-
tures can be explicitly built for many common examples of the first two classes of expo-
nential families and homogeneous cones, the differential entropy of continuous statistical
mixtures with distinct prescribed density components sharing the same support is hith-
erto not known in closed form, hence forcing implementations of mixture family mani-
folds in practice using Monte Carlo sampling. In this chapter, we report a notable
exception: The uniorder family of mixtures defined as the convex combination of two
prescribed and distinct Cauchy distributions.
Keywords: Riemannian manifold, Hessian manifold, Affine connection, Exponential
family, Homogeneous regular cone, Mixture family, Fisher metric, Rao distance,
Cauchy mixture


https://franknielsen.github.io/.

Handbook of Statistics, Vol. 46. https://doi.org/10.1016/bs.host.2022.02.002


Copyright © 2022 Elsevier B.V. All rights reserved. 225
226 SECTION II Information geometry

1 Introduction and motivation


Let ðX , Σ, μÞ denote a measure space (Keener, 2010) where X denotes the
sample space of outcomes, Σ a σ-algebra event space, and μ a positive mea-
sure. Information geometry (Amari, 2016) studies the geometric structures
of a family {Pθ(x)}θΘ of parametric probability
n measures
o all dominated by
μ called the statistical model. Let P ¼ pθ ðxÞ ¼ dP

θ
denote the set of
θΘ
Radon–Nikodym densities (Keener, 2010) of Pθ with respect to μ. The dimen-
sion D of the parameter space Θ  D denotes the order of the model (e.g.,
D ¼ 1 for the family of exponential distributions, D ¼ 2 for the family of
univariate normal distributions, etc.). Information geometry relies on the core
concept of affine connections in differential geometry (Godinho and Natário,
2014). Amari pioneered this field and elicited in particular the so-called dual-
istic  α-structure (Amari, 2016) ðP, gF , rα , rα Þ of P by using a family of
affine connections, called the α-connections rα for α  . An α-connection
(Amari, 2016) defines rα-geodesics on P , and is specified according to its
corresponding D3 Christoffel symbols Γαki,j (functions) as follows:
h  i
1α
Γαki,j ðθÞ ¼ Epθ ∂k ∂i lθ ðxÞ + ∂k lθ ðxÞ∂i lθ ðxÞ ∂j lθ ðxÞ ,
2
where lθ ðxÞ ¼ log pθ ðxÞ denotes the log-likelihood function, and ∂i is the nota-
tional shortcut for ∂θ∂ i for i  f1, …, Dg. The Riemannian metric tensor gF is
the Fisher information metric which can be expressed in the θ-coordinate sys-
tem using the Fisher information matrix (Nielsen, 2022) (FIM) as follows:
h i
½gF θ ¼ Epθ rθ lθ ðxÞðrθ lθ ðxÞÞ> :

For any α  , the connections rα and rα are proven dual with respect to
α α
the Fisher information metric gF since their mid-connection r 2+ r corre-
sponds to the Levi-Civita metric connection gr (Godinho and Natário, 2014).
The fundamental theorem of Riemannian geometry (Godinho and Natário,
2014) states that the Levi-Civita metric is the unique torsion-free metric-
compatible affine connection.
Two common types of families of probability distributions are considered
in information geometry: The exponential families (Barndorff-Nielsen, 2014)
and the mixture families (Amari, 2016; Nielsen and Hadjeres, 2019). The 1-
structures (i.e., α-structure for α ¼ 1) of the exponential families and mix-
ture families are said dually flat (to be detailed in Section 2.2). Dually flat
spaces have also been called Bregman manifolds (Nielsen, 2021) as they
can be realized from either a smooth and strictly convex functions
(Bregman generators) or equivalently from their corresponding Bregman
divergences via the information geometry structure derived from divergences
(Amari and Cichocki, 2010).
The analytic dually flat space of the mixture family Chapter 6 227

However, there is a significant difference when considering the Bregman


generators induced by exponential families from the Bregman generators
induced by mixture families: While the Bregman generators of exponential
families (i.e., cumulant functions) are always real analytic and available in
closed form for many common exponential families (i.e., multivariate normal
family or Beta family), the Bregman generators of mixture families (Shannon
negentropy of mixtures) can be nonanalytic (Watanabe, 2004) (e.g., Shannon
negentropy of a mixture of two prescribed and distinct Gaussian components).
In particular, the differential entropy of two univariate isotropic Gaussians is
discussed in Michalowicz et al. (2008) where a formula is reported depending
on a definite integral which needs to be tabulated in practice.
Let us notice that the family of categorical distributions can be both inter-
preted as a discrete mixture family and a discrete exponential family (Amari,
2016) with closed form Bregman generators being convex conjugate of each
other (Amari, 2016).
In this work, we present a mixture family C ¼ fð1  θÞpl1 ,s1 + θpl2 ,s2 :
θ  ð0, 1Þg of two distinct Cauchy distributions pl1 ,s1 and pl2 ,s2 6¼ pl1 ,s1 (order
D ¼ 1) which is analytic, and report the convex conjugate Bregman functions
and dual parameterizations in closed-form.
The paper is organized as follows: In Section 2, we recall the two usual
differential-geometric constructions obtained from a convex function in an open
convex domain of d : Namely, (1) the Hessian manifold in Section 2.1 and (2)
the dually flat space in Section 2.2. Furthermore, we link those structures to
Amari’s  1-structures of exponential families and mixture families. We then
illustrate these constructions in Section 3 for (a) the full regular exponential
families (Barndorff-Nielsen, 2014) (Section 3.1), (b) the homogeneous regular
cones (G€ uler, 1996) (Section 3.2) and (c) the mixture families (Nielsen and
Hadjeres, 2019) (Section 3.3).
Our main contribution is presented in Section 4 where we report in closed
form all the necessary formulas required to explicitly implement the dually
flat space of statistical mixtures of two distinct Cauchy components.
Appendix provides a computational notebook using the open source symbolic
computing software MAXIMA (Calvo, 2018).

2 Differential-geometric structures induced by smooth convex


functions
2.1 Hessian manifolds and Bregman manifolds
Consider the D-dimensional Euclidean space d as an affine space equipped
with the Cartesian coordinate system x(). We can view d as a flat manifold
ðd , Euc rÞ where Eucr denotes the standard flat connection of d (Shima,
2007) (Chapter 1) such that Euc r ∂i ∂x∂ j ¼ 0.
∂x
228 SECTION II Information geometry

In general, the Hessian operatora r2 (Godinho and Natário, 2014) applied


to a function F on a manifold M is defined according to a connection r as
follows:
r2 FðV, WÞ ¼ rW ðrV FÞ  rrW V F ¼ r2 FðW, VÞ,
for any two smooth vector fields V and W. Using its Christoffel symbols Γkij of
the connection r, we get in local coordinates:
∂2 F ∂F
r2 Fð∂xi , ∂x j Þ ¼  Γkij k :
∂xi ∂x j ∂x
A connection is said flat when there exists a coordinate system such that
all Christoffel symbols vanish: Γkij ðxÞ ¼ 0. On a flat manifold (M, flatr), we
thus have the Hessian operator rewritten as:
∂2 F
r2 Fð∂xi , ∂x j Þ ¼ :
∂xi ∂x j
In particular, this holds on the Euclidean manifold ðd , Euc rÞ.
A Riemannian metric g on a flat manifold (M, r) is called a Hessian met-
ric (Shima, 2007) (Chapter 2) if there exists a local coordinate system x and a
potential function F(x) such that g ¼ [gij]ij with gij which can be expressed as
∂2
gij ðxÞ ¼ FðxÞ:
∂xi ∂x j
When r ¼ Eucr, we further say that g is a Bregman metric (thus a special
type of Hessian metric). For example, consider a smooth and strictly convex
function F(θ) defined on an open convex domain Θ  D . Then
XD X D
∂2
g¼ FðθÞdθi  dθ j
i¼1 j¼1
∂θ i ∂θ j

is a Bregman metric (Gomes-Gonçalves et al., 2019). In dimension two, any


analytic Riemannian manifold is a Hessian manifold (Amari and Armstrong,
2014). However, in dimension greater than three, there exist topological and
curvature obstructions that prevent a general Riemannian metric to be a Hessian
metric (Amari and Armstrong, 2014).
For a Hessian metric ½ghF ðθÞ with hF ðθÞ :¼ ∂θ∂i ∂θ
2
F
j , we can associate the

corresponding Riemannian distance ρhF ðp1 , p2 Þ between any two point points
p1, p2 on the Riemannian manifold ðM, ghF Þ. When hF is the Fisher informa-
tion matrix, this distance is called the Rao’s distance (Atkinson and
Mitchell, 1981) or the Fisher–Rao distance (Pinele et al., 2020).

More generally, higher-order covariant derivatives are defined recursively by rkT ¼ r(rk1T)
a

with terminal case r1T ¼ rT.


The analytic dually flat space of the mixture family Chapter 6 229

In the particular interesting case of separable Bregman generators


(Gomes-Gonçalves et al., 2019) with smooth strictly convex generator
P
FðθÞ ¼ D i¼1 Fi ðθ Þ (where the univariate functions Fi’s are scalar Bregman
i

generators), we get the following diagonal Hessian matrix:


hF ðθÞ ¼ diagðF001 ðθ1 Þ, …, F00D ðθD ÞÞ,
and the corresponding Riemannian distance ghF , called the Riemannian
Bregman distance in Gomes-Gonçalves et al. (2019), can be computed using
the following formula:
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
X D  2
ρF ðθ1 , θ2 Þ ¼ hi ðθi1 Þ  hi ðθi2 Þ , (1)
i¼1
pffiffiffiffiffiffiffiffiffiffiffi

where the functions hi’s are the antiderivatives of Fi00 ðuÞ:
Z θqffiffiffiffiffiffiffiffiffiffiffiffi
hi ðθÞ ¼ Fi00 ðuÞdu:

When Fi ðθÞ ¼ 12 θ2 for i  f1, …, Dg, we have hi(θ) ¼ θ, and we recover the
usual Euclidean distance formula expressed in the Cartesian coordinate sys-
tem. In general, the formula of Eq. (1) is the Euclidean distance expressed
using the hðθÞ ¼ ðh1 ðθ1 Þ, …, hD ðθD ÞÞ -coordinate system. Indeed, recall that
a Riemannian metric tensor g is the Euclidean metric (Godinho and Natário,
2014) if there exists a coordinate system h such that [g]h ¼ I, the identity
matrix of dimension D  D. The Euclidean metric expressed in the Cartesian
coordinate system λ is [g]λ ¼ I, the identity matrix.
Notice that the Euclidean distance between two point p1 and p2 of the
pffiffiffiffiffiffiffiffiffiffiffiffiffi
Euclidean plane 2 is expressed in the polar coordinate ðr ¼ x2 + y2 ,
θ ¼ arctan yxÞ (with inverse transformation ðx ¼ r cos θ, y ¼ r sin θÞ) as
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ρEuc ðp1 , p2 Þ ¼ r 21 + r 22  2r 1 r 2 cos ðθ2  θ1 Þ:

The Riemannian metric of the Euclidean metric expressed in the Cartesian


coordinate system λ ¼ (x, y) is ds2 ¼ dx2 + dy2, or ds2 ¼ [dx dy]> Gλ(λ)
[dx dy] with Gλ(λ) ¼ diag(1, 1). The Euclidean metric tensor can be expressed
in any new η-coordinate system using the following covariant rule:
 >  
∂λi ∂λi
Gη ðηÞ ¼  Gλ ðλðηÞÞ  ,
∂ηj ij ∂ηj ij
h i
∂λ
where ∂ηj is the invertible Jacobian matrix of the transformation. We have
j ij
     
∂λj ∂ðx ¼ r cos θ, y ¼ r sin θÞ cos θ r sin θ
¼ ¼ :
∂ηj ij
∂ðr, θÞ ij sin θ r cos θ
230 SECTION II Information geometry

Thus it follows that


 >    
cos θ r sin θ cos θ r sin θ 1 0
Gη ðηÞ ¼ I ¼ ¼ diagð1, r 2 Þ,
sin θ r cos θ sin θ r cos θ 0 r2

using the identity sin 2 θ + cos 2 θ ¼ 1.


Hence, the Euclidean metric expressed in the polar coordinate system is
ds2η ¼ ½dr dθ>  diagð1, r 2 Þ  ½dr dθ ¼ dr 2 + r 2 dθ2 :
The Poincare metric on the plane is defined in the Cartesian coordinate
system λ by the metric tensor ½gP λ ¼ y12 I . This Poincare metric cannot be
expressed as Jac> hðx,yÞ  I  Jachðx,yÞ for an invertible coordinate transformation
h(x, y). That is, it is not the Euclidean metric in disguise, but the hyperbolic
metric.

2.2 Bregman manifolds: Dually flat spaces


A dually flat space (Shima, 2007; Amari, 2016) (also called a Bregman mani-
fold in Nielsen (2021)) can be built from any strictly convex and smooth func-
tion F(θ) (with open convex domain dom(F) ¼ Θ 6¼ ∅) of Legendre-type
(Rockafellar, 1967). The Legendre–Fenchel transformation of (Θ, F(θ)) yields
a dual Legendre-type potential function (H, F*(η)) (with open convex domain
dom(F*) ¼ H) where
F* ðηÞ :¼ sup fθ> η  FðθÞg:
θΘ

Fig. 1 geometrically interprets the Legendre–Fenchel transform as the


negative of the y-intercept of the unique tangent line to the graph of
F ¼ ðθ, FðθÞÞ : θ  Θg which has slope η.
The Legendre–Fenchel transformation on Legendre-type functions is invo-
lutive (i.e., (F*)* ¼ F by the Fenchel–Moreau theorem) and induces two dual
coordinate systems:
ηðθÞ ¼ rθ FðθÞ,
and
θðηÞ ¼ rη F* ðηÞ:
Thus the gradients of convex conjugates are inverse functions of each other:
rF* ¼ (rF)1 and rF ¼ (rF*)1.
The Bregman manifold is equipped with a divergence:
BF ðθ1 : θ2 Þ :¼ Fðθ1 Þ  Fðθ2 Þ  ðθ1  θ2 Þ> rFðθ2 Þ,
The analytic dually flat space of the mixture family Chapter 6 231

FIG. 1 Reading geometrically the Legendre–Fenchel transformation as the negative of the


y-intercept of the unique tangent line to the graph of F(θ) which has slope η.

for the Bregman generator F(θ) called the Bregman divergence (Bregman,
1967), and we have BF ðθ1 : θ2 Þ ¼ BF* ðη2 : η1 Þ. We can also express equiva-
lently the dual Bregman divergences using mixed parameterizations with the
Fenchel–Young divergences (Blondel et al., 2020; Nielsen, 2021):
Y F ðθ1 : η2 Þ :¼ Fðθ1 Þ + F* ðη2 Þ  θ>
1 η2 :

Thus we can express the divergence using either primal, dual, or mixed coor-
dinate systems as follows:
BF ðθ1 : θ2 Þ ¼ Y F ðθ1 : η2 Þ ¼ Y F* ðη2 : θ1 Þ ¼ BF* ðη2 : η1 Þ:
A Riemannian Hessian metric tensor (Shima, 2007) Fg can be defined in
the θ-coordinate system by
½F gθ :¼ r2θ FðθÞ,
F*
with dual metric tensor g expressed in the η-coordinate system by
*
½F gη :¼ r2η F* ðηÞ:

Let θ ¼ ðθ1 , …, θD Þ denote the contravariant coordinates and η ¼ ðη1 , …, ηD Þ


its equivalent covariant coordinates. Let ∂i :¼ ∂θ∂ i define the primal natural basis
E :¼ {ei ¼ ∂i}, and let ∂i :¼ ∂η∂ define the dual natural basis E* :¼ {e*i ¼ ∂i}.
i
We have
232 SECTION II Information geometry

F* i j
F
gðei , ej Þ ¼ ∂i ∂j FðθÞ, gðe* , e* Þ ¼ ∂i ∂ j F* ðηÞ:

The Crouzeix identity (Crouzeix, 1977) holds (i.e., r2θ FðθÞr2η F* ðηÞ ¼ I,
the identity matrix) meaning that the basis E and E* are reciprocal (Amari,
j
2016; Nielsen, 2020): gðei , e* Þ ¼ δji , where δji is the Kr€onecker symbol:
δji ¼ 0 if j 6¼ i, and δji ¼ 1 iff. i ¼ j.
The Riemannian metric tensor can thus be expressed equivalently as
∂ ∂
F
g¼ FðθÞ dθi  dθ j ,
∂θi ∂θ j
∂ ∂
¼ F ðηÞ dηi  dη j ,
∂ηi ∂ηj
¼ dθi  dη j ,
where  denotes the tensor product (Godinho and Natário, 2014).
A Bregman manifold has been called a dually flat space in information
geometry (Amari, 2016; Nielsen, 2020) because the dual potential functions
*
F(θ) and F*(η) induce two affine connections, denoted by Fr and F r, which
are flat because their corresponding Riemann–Christoffel symbols Γij charac-
F k

terizing Fr vanish in the θ-coordinate system (i.e., F Γkij ðθÞ ¼ 0 and θ() is
F* k
called a Fr-coordinate system) and the Riemann–Christoffel symbols Γij
F* F* k
characterizing r vanish in the η-coordinate system (i.e., Γij ðηÞ ¼ 0 and
F*
η() is called a r-coordinate system). Furthermore, the two (torsion free)
*
affine connections Fr and F r are dual with respect to the metric tensor
(Amari, 2016; Nielsen, 2020) Fg so that we have the mid-connection which
coincides with the Levi-Civita metric connection:

F
r + F r LC
¼ r,
2
where LCr ¼ gr denote the Levi-Civita connection induced by the Hessian
metric Fg.
Two common examples of dually flat spaces of statistical models are the
exponential family manifolds (Amari, 2016; Nielsen, 2020) built from regular
exponential families (Barndorff-Nielsen, 2014) by setting the Bregman gen-
erators to the cumulant functions of the family, and the mixture family mani-
folds induced by the negentropy of a statistical mixture with prescribed
linearly independent component distributions (Amari, 2016; Nielsen, 2020).
The family of categorical distributions (also called multinoulli distributions)
are both an exponential family and a mixture family. It is interesting to notice
that the cumulant functions of regular exponential families are always analytic
(Cω, see Barndorff-Nielsen, 2014, i.e., F(θ) admitting locally a converging
The analytic dually flat space of the mixture family Chapter 6 233

Taylor series at any θ  Θ), but the negentropy of a mixture may not be
analytic (e.g., negentropy of a mixture of two normal distributions
(Watanabe, 2004)).
To use the toolbox of geometric algorithms on Bregman manifolds (e.g.,
Banerjee et al., 2005; Boissonnat et al., 2010), one needs the generators
F and F* and their gradient rF and rF* in closed form. This may not always
be possible (Nielsen and Hadjeres, 2019) either:
l because it is not computable using elementary functions (e.g., the cumulant
function of a polynomial exponential family) or the negentropy of a Gauss-
ian mixture (Watanabe, 2004) (definite integral of a log-sum-exp term), or
l because it is computationally intractable (e.g., the cumulant function of a
discrete exponential family in Boltzmann machines (Amari, 2016))
Eguchi (Eguchi, 1992) described the following method to build a dual
information-geometric structure from a smooth parameter divergence D( : )
which meets the following requirements.
1. D(θ : θ0 )
0 for all θ, θ0 with equality iff θ ¼ θ0 .
2. ∂i Dðθ : θ0 Þjθ0 ¼θ ¼ ∂0j Dðθ : θ0 Þjθ0 ¼θ ¼ 0 for all i, j, where ∂l :¼ ∂θ∂ l and ∂0l :
¼ ∂θ∂ 0 .
h l i
3.  ∂i ∂0j Dðθ : θ0 Þjθ0 ¼θ is a positive-definite matrix.
ij
The construction, called divergence information geometry (Amari and
Cichocki, 2010), proceeds as follows:
gij ðθÞ ¼ ∂i ∂0j Dðθ : θ0 Þjθ0 ¼θ ,
Γij, k ðθÞ ¼ ∂i ∂j ∂0k Dðθ : θ0 Þjθ0 ¼θ ,
Γ ij, k ðθÞ ¼ ∂k ∂0i ∂0j Dðθ : θ0 Þjθ0 ¼θ :
It can be shown that the connections r and r* induced, respectively, by Γij,k
and Γ∗ij,k are torsion-free and dual.
In practice, many explicit dually flat space constructions have been
reported for exponential families (e.g., Zhang et al., 2007; Zhong et al.,
2008; Malagò and Pistone, 2015) but to the best of the author’s knowledge
none so far for continuous mixture families with component distributions
sharing the real-line support.
We report a first exception: The explicit construction of a dually flat space
of the family of statistical mixtures with two prescribed and distinct Cauchy
distributions. That is, we report in closed form the Bregman generator F(θ),
the dual parameter η ¼ F0 (θ), the dual Bregman generator F*(η) and its deriva-
tive (F*)0 (η) ¼ θ, and the dual Bregman divergences BF ðθ1 : θ2 Þ ¼ BF* ðη2 : η1 Þ
which amount to the Kullback–Leibler divergence between the corresponding
Cauchy mixtures. We check these (large) formulas using symbolic calculations,
and to fix ideas instantiate these formulas for the special case of a mixture
234 SECTION II Information geometry

family which is obtained as the convex combination of a standard Cauchy den-


sity (location parameter 0 and scale parameter 1) with the Cauchy density of
location parameter 1 and scale parameter 1.

3 Some illustrating examples


3.1 Exponential family manifolds
3.1.1 Natural exponential family
A natural exponential family (Barndorff-Nielsen, 2014) E ¼ fPθ g in a proba-
bility space ðX , Σ, μÞ is a set of parametric probability measures Pθ all dom-
inating by μ (on support X ) with Radon–Nikodym densities pθ ¼ dP θ
dμ which
can be expressed canonically as
!
XD
pθ ðxÞ ¼ exp θ x  FðθÞ ,
i i

i¼1
R
where FðθÞ ¼ log exp ðθxÞdμðxÞ is called the cumulant function. It can be
shown that ZðθÞ ¼ exp ðFðθÞÞ is logarithmically strictly convex (Barndorff-
Nielsen, 2014), and thus F(θ) is strictly convex. Moreover, F(θ) is real ana-
lytic on the natural parameter space Θ ¼ {θ F(θ) < ∞}. Thus on the domain
Θ  D, F(θ) induces a Bregman manifold and a dually flat space structures.
The family of categorical distributions form a discrete exponential family
(Amari, 2016).
More generally, the concept of an exponential family can be generalized in
nonstatistical contexts (Naudts and Anthonis, 2012).

3.1.2 Fisher–Rao manifold of the categorical distributions


Consider the family of categorical distributions P ¼ fpθ : θ  Θg on the prob-
ability space ðΩd , 2Ωd , μÞ, where μ# denotes the counting measure on the sample
space Ωd ¼ fω1 ,…,ωd g. When d ¼ 2, the categorical distributions are called
the Bernoulli distributions, and when d > 2 they are sometimes termed the mul-
tinoulli distributions.
The density of a random variable X Catðq1 , …, qd Þ following a categor-
ical distribution pθ(x) is
Y
d X
d
pθ ðxÞ ¼ qxi i , 8i  f1,…, dg, xi  f0, 1g, xi ¼ 1,
i¼1 i¼1

where qi ¼ PrðX ¼ ωi Þ. The parameter space Θ is the (d  1) dimensional


open standard simplex Δ°d .
The Fisher information matrix (FIM) of the categorical distributions is
The analytic dually flat space of the mixture family Chapter 6 235

h i
Iθ ðθÞ ¼ Epθ r log pθ ðxÞðrlog pθ ðxÞÞ> ,
  
∂ ∂
¼ Epθ log pθ ðxÞ log pθ ðxÞ ,
∂θi ∂θj ij
  
xi xj
¼ Epθ :
θi θj ij

l When i ¼ j, we have
1 1
½I θ ðθÞii ¼ Ep ½x2  ¼ ,
θ2i θ i θi

since Epθ ½x2i  ¼ x2i qi ¼ qi ¼ θ1i .


l When i 6¼ j, we have
1
½I θ ðθÞij ¼ E ½x x  ¼ 0,
θ i θ j pθ i j
since x x ¼ 0 for i 6¼ j (because we have 8i  f1, …, dg, xi  f0, 1g, and
Pd i j
i¼1 xi ¼ 1).

Thus it follows that the Fisher information matrix of the categorical distribu-
tions is the diagonal matrix:

1 1
I θ ðθÞ ¼ diag , …, :
θ1 θd

h iFor any smooth invertible mapping η(θ) with invertible Jacobian matrix
∂θj
∂η , we have the following covariant rule of the FIM:
j ij
 >  
∂θi ∂θi
I η ðηÞ ¼  I θ ðθðηÞÞ  :
∂ηj ij
∂ηj ij

Thus by making a smooth change of variable θ 7! ðη1 ðθÞ, …, ηd ðθÞÞ with


pffiffiffiffi pffiffiffiffi
ηi ðθÞ ¼ 2 θi (with ηi ðθÞ ¼ 2 θi ), we get the following Jacobian matrix
 >   pffiffiffiffiffi
∂θi 1 1 pffiffiffiffiffi
¼ diag η1 ðθÞ, …, ηd ðθÞ ¼ diagð θ1 , …, θd Þ:
∂ηj ij 2 2

Therefore, the FIM transforms into the identity matrix under the
η-parameterization:
pffiffiffiffiffi
pffiffiffiffiffi
pffiffiffiffiffi 1 1 pffiffiffiffiffi
I η ðηÞ ¼ diagð θ1 , …, θd Þ  diag , …,  diagð θ1 , …, θd Þ ¼ I,
θ1 θd

where I denotes the identity matrix.


236 SECTION II Information geometry

The transformed parameter space H ¼ {η(θ) : θ  Θ} is the positive


P
orthant of the sphere of radius r ¼ 2 since k ηk2 ¼ 2 di¼1 θi ¼ 2 (see Fig. 2).
That is, we have performed an isometric embedding of the Fisher–Rao mani-
fold of the categorical distributions with d atoms into the Euclidean space of
dimension d + 1. Therefore, the Rao distance on the Fisher–Rao manifold of
categorical distributions can be computed as the geodesic distance on H. The
geodesic distance between η ¼ η(pθ) and η0 ¼ η(pθ0 ) on H is
ρH ðη, η0 Þ ¼ 2∠0ηη0 ,
with
0 Xd 1 !
0 X
η  η0 4 η η d
¼ arccos @ A ¼ arccos
i i 0
∠0ηη0 ¼ arccos i¼1
ηi ηi :
k ηk2 k η0 k2 k ηk2 k η0 k2 i¼1

It follows that the Rao distance between two categorical distributions is


!
d pffiffiffipffiffiffiffi
X
0
ρP ðpθ , pθ0 Þ ¼ 2 arccos θ θ :
i¼1
P pffiffiffipffiffiffiffi
The term di¼1 θ θ0 is called the Bhattacharyya coefficient.
It follows from the curvature κ ¼ r12 of a sphere of radius r in d that the cat-
egorical Fisher–Rao manifold (nonembedded manifold) has curvature κ ¼ r12 ¼ 14.
Now, relax the constraint of normalized probabilities pθ and consider
positive measures pθ+ (with Θ + ¼ d+ + ) while keeping the two-square-root
embedding. The extended Rao distance to d+ + becomes:

FIG. 2 Two-square-root embedding of the Bernoulli family onto the positive orthant of the
sphere of radius r ¼ 2.
The analytic dually flat space of the mixture family Chapter 6 237

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u d qffiffiffiffiffi qffiffiffiffiffi u d qffiffiffiffiffi qffiffiffiffiffi
uX uX
ρðpθ1 , pθ2 Þ ¼
+ + t ð2 θ1  2 θ2 Þ ¼ 2 t ð θi1  θi2 Þ2 ¼ 2 ρHellinger ðθ1 , θ2 Þ,
i i 2

i¼1 i¼1

where
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u d qffiffiffiffiffi qffiffiffiffiffi
uX pffiffiffiffiffi pffiffiffiffiffi
ρHellinger ðθi1 , θi2 Þ ¼ t ð θi1  θi2 Þ2 ¼k θ1  θ2 k2 ,
i¼1

denotes the Hellinger distance, a metric distance. The squared Hellinger distance
is called the Hellinger divergence and belongs to the class of f-divergences for
pffiffiffi 2
the generator f Hellinger ðuÞ ¼ ð u  1Þ .
Furthermore, the following inequality follows from the embedding of the
normalized probabilities on the positive-orthant of the sphere that
ρðpθ1 , pθ2 Þ
2 ρHellinger ðθ1 , θ2 Þ,
with equality if and only if θ1 ¼ θ2.

3.2 Regular cone manifolds


A cone K  D is a subset such that

0, x  K, λx  K:
A cone is said pointed if K \ ðKÞ ¼ 0. We consider regular cones which are
(i) convex and (ii) pointed (i.e., contains no line). Fig. 3 displays an example
of a nonregular cone (left: nonconvex and not pointed) and an example of a
regular cone (right)

O O

FIG. 3 Two examples of cones of 2: a nonregular cone (left, nonconvex and not pointed) and a
regular cone (right).
238 SECTION II Information geometry

For a cone K, we can associate a dual regular cone K* defined by


K * ¼ \x  K fy  D : x> y
0g:
When K is regular, we have (K*)* ¼ K. A cone is self-dual when K* ¼ K.
We can associate to a cone K a characteristic function χ K(x) such that for
any x  K, we have
Z
χ K ðxÞ ¼ exp ðx> yÞdy:
K*

Observe the similarity with the partition function of a natural exponential fam-
ily when the cone is self-dual. It can be shown that the characteristic function
χ K is strictly logarithmically convex. Let Aut(K) denote the automorphism
group of K, i.e., the subgroup of the general linear group GLð, dÞ such that
A  Aut(K) , A(K) ¼ K, with A(K) ¼ {Ax : x  K}. The automorphism group
can be shown to a be a Lie group (G€ uler, 1996). A regular cone is said homo-
geneous if its automorphism group is transitive: That is, for all x, y  K, there
exits A  Aut(K) such that Ax ¼ y.
It can be shown that
χ K ðxÞ
χ K ðAxÞ ¼ , (2)
jdetðAÞj
for any A  Aut(K).
Since χ K is strictly logarithmically convex, let us consider the function
FK ðxÞ ¼ log χ K ðxÞ,
which is strictly convex. Moreover, the function is analytic for homogeneous
cones. We can therefore associate a dually flat space structure to cones
(Shima, 2007) (Chapter 4) using (K, FK). The induced Riemannian metric
r2F(x) is invariant under the group automorphism.
Consider a prescribed point e  K°, the interior of K. For any x  K°, let
Ax  Aut(K) such that Axe ¼ x. Then using Eq. (2), we have
FK ðxÞ ¼ log χ K ðeÞ  log ðjdetðAx ÞjÞ:
Since the Bregman generators are defined up to an affine term, we have
FK ðxÞ ≡  log ðjdetðAx ÞjÞ:
Furthermore, for a homogeneous regular cone, we have (G€uler, 1996)
(Theorem 4.4):
1
FK ðxÞ ≡ log detðr2 FK ðxÞÞ:
2
For example, consider the nonnegative orthant cone K ¼ d+ + . Then we
P  
have FK ðxÞ ¼  di¼1 log xi and r2 FK ðxÞ ¼ diag x12 , …, x12 , and
Q P 1 d

detðr2 FK Þ ¼ ni¼1 x12 so that 12 log detðr2 FK ðxÞÞ ¼  di¼1 log xi .


i
The analytic dually flat space of the mixture family Chapter 6 239

Consider the cone of symmetric positive-definite matrices of dimension


d  d (SPD cone). It is a self-dual cone, and the logarithm of the characteris-
tic function is FSPD ðPÞ ¼  d +2 1 log detðPÞ. Notice that the cumulant function
of zero-centered multivariate normal distributions N ð0, ΣÞ is
1
FN ðΣÞ ¼  log detðΣÞ:
2
Thus functions FSPD and FN differ by a multiplicative factor d + 1.
G€uler (G€
uler, 1996) investigated a generic way to build universal barrier
functions (Lee and Yue, 2021) on cones for interior point methods: He proved
that FK(x) for homogeneous cones are self-concordant barrier function
(Theorem 4.3 of G€ uler, 1996). We recommend the monograph of Faraut and
Korányi (1994) for analysis on symmetric cones which are open convex
self-dual homogeneous cones in Euclidean space.

3.3 Mixture family manifolds


3.3.1 Definition
A mixture family (Amari, 2016) M of order D is defined by D + 1 linearly
independent functions p0(x), p1(x), pD(x) as
( )
XD X
D
M ¼ mθ ðxÞ ¼ ð1  θi Þp0 ðxÞ + pi ðxÞ : θ  Δ°D ,
i¼1 i¼1

where Δ°D denotes the D-dimensional open standard simplex.


Consider a probability space ðX , ΣX , μÞ where X denotes the sample space,
ΣX , a σ-algebra and μ a positive measure. The set of statistical mixtures with
prescribed D + 1 linear independent components form a mixture family.
Furthermore, it can be shown that the Shannon negentropy
Z
FðθÞ ¼ mθ ðxÞ log mθ ðxÞdμðxÞ
X

is a strictly convex and smooth function (Nielsen and Hadjeres, 2019):


A Bregman generator. Next, we describe in Section 3.3.2 the discrete mixture
family of categorical distributions and point out the difficulty to get the
negentropy in closed form in general. Section 4 will report an interesting
example of analytic continuous mixture of order 1.

3.3.2 The categorical distributions: A discrete mixture family


Consider the family of categorical distributions as a mixture family
( )
X D X
D
M ¼ mθ ðxÞ ¼ ð1  θi Þδx0 ðxÞ + δxD ðxÞ : θ  Δ°D ,
i¼1 i¼1
240 SECTION II Information geometry

where δxi ðxÞ ¼ δðx  xi Þ ¼ 1 iff x ¼ xi and 0 when x 6¼ xi. The functions δxi
are called Dirac distributions and are linearly independent provided that
xi 6¼ xj for any i 6¼ j. The Shannon negentropy is
X X
D
FðθÞ ¼ mθ ðxÞ log mθ ðxÞ ¼ θi log θi :
x fx0 , x1 , …, xD g i¼0

The discrete mixture family of categorical distributions can be extended


to continuous mixtures with mixture components having pairwise disjoint
support X i \ X j ¼ ∅ for all i 6¼ j. We have
Z
FðθÞ ¼ mθ ðxÞ logmθ ðxÞdμðxÞ,
X
!Z
XD Z XD
¼ mθ ðxÞ logmθ ðxÞdμðxÞ + 1  θi mθ ðxÞ log mθ ðxÞdμðxÞ,
i¼1 X i i¼1 X0
!Z
D Z
X X
D
¼ ðθi pi ðxÞÞ log ðθi pi ðxÞÞdμðxÞ + 1  θi p0 ðxÞ log p0 ðxÞdμðxÞ,
i¼1 X i i¼1 X0
!
X
D X
D
¼ θi ð log θi + Ii Þ + 1  θi I0 ,
i¼1 i¼1

R
where Ij ¼ X j pj ðxÞ log pj ðxÞdμðxÞ is Shannon negentropy of component pj.
Thus we have
X
D X
D
FðθÞ ¼ θi log θi + θi ðIi  I0 Þ:
i¼1 i¼1

Since Bregman generators are equivalent affine terms, it follows that


X
D
FðθÞ ≡ θi log θi :
i¼1

When pj ðxÞ ¼ δxj ðxÞ with X j ¼ fxj g, we recover the discrete mixture family
of categorical distributions.
In general, when the mixture components share the same support X , it is
difficult to obtain a closed-form formula for the negentropy FðθÞ ¼
R
X mθ ðxÞ log mθ ðxÞdμðxÞ because of the integral of the nonseparable log-sum
term. In symbolic computing, the celebrated Risch algorithm (Risch, 1969)
allows one to either calculate definite integrals in closed form or report that
no such closed-form formula exists using elementary functions. However,
the Risch method is only a semi-algorithm since it requires to implement a
tautology oracle to check whether some mathematical expressions are equiv-
alent to zero or not.
The analytic dually flat space of the mixture family Chapter 6 241

4 Information geometry of the mixture family of two distinct


Cauchy distributions
4.1 Cauchy mixture family of order 1
The probability density function of a Cauchy distribution with location param-
eter l and scale parameter s > 0 is
1 s
pl,s ðxÞ :¼   2  ¼ :
πs 1 + xl πðs + ðx  lÞ2 Þ
2
s

The family of Cauchy distributions form a location-scale family




1 xl
C ¼ pl,s ðxÞ :¼ p : ðl, sÞ     + + ,
s s
with standard Cauchy distribution
1
p0,1 ðxÞ :¼ :
πð1 + x2 Þ
Although the standard Cauchy probability density function is symmetric (even
R ¼ p0, 1(x)), the
pdf, i.e., p0,1(x) R mean and variance do not exist because both
the integrals xp0,1 ðxÞdx and x p0,1 ðxÞdx diverge. The Cauchy probability
2

density functions are unimodal with modes at the location parameters l.


Furthermore, let C1 , …, Cn be n independent Cauchy distributions with same
P
distribution parameters (i.e., Ci Cauchy(l, s)). Then Sn ¼ 1n i Ci
Cauchyðl, sÞ. Indeed, if C1 Cauchy(l1, s1), and C2 Cauchy(l2, s2) then
C1 + C2 Cauchy(l1 + l2, s1 + s2). Moreover, if C1 Cauchy(l, s), then
λC Cauchy(λl, sjλj) for any λ  . Notice that the central limit theorem
(CLT) does not apply to Cauchy distributions.
Consider the mixture family (Amari, 2016; Nielsen, 2020) induced by two
distinct Cauchy distributions pl0 ,s0 and pl1 ,s1 :

M :¼ mθ ðxÞ :¼ ð1  θÞpl0 ,s0 ðxÞ + θpl1 ,s1 ðxÞ : θ  ð0, 1Þ :
Fig. 4 displays an example of a Cauchy mixture of two components with
(l0, s0) ¼ (1, 1) and (l1, s1) ¼ (1, 2). As plotted in the figure, we can visua-
lize that the mixture of two Cauchy densities is not a Cauchy density because
the mixture can have two modes. See Došlá (2009) (Proposition 1) for the
conditions of unimodality/bimodality of a two-component Cauchy mixture.
Because the mixture mθ is a convex combination of two prescribed Cauchy
components, these statistical mixtures have also been called w-mixtures in
Nielsen and Nock (2018) (stands for weight mixtures).
The Kullback–Leibler divergence between two continuous probability
densities p(x) and q(x) is
Z
pðxÞ
DKL ½p : q ¼ pðxÞlog dx,
qðxÞ
¼ h ½p : q  h½p,
R
where h ½p : q ¼  pðxÞ log qðxÞdx is the cross-entropy, and
R
h½p ¼ h ½p : p ¼  pðxÞ log pðxÞdx is the differential entropy.
FIG. 4 An example of a Cauchy mixture {mθ(x) : θ  (0, 1)} of two components with (l0, s0) ¼ (1, 1) and (l1, s1) ¼ (1, 2): From top left to bottom right: mθ(x)
for θ  (0, 0.2, 0.4, 0.6, 0.8, 1). Notice that m0(x) and m1(x) do not belong to the mixture family since θ  (0, 1).
The analytic dually flat space of the mixture family Chapter 6 243

In Nielsen and Okamura (2021), the following closed-form formula (using


complex analysis in Þ was proven for the Kullback–Leibler divergence
between a Cauchy density and a mixture of two Cauchy densities:
DKL ½pl0 ,s0 : mθ 
0 1
B ðl0  l1 Þ2 + ðs0 + s1 Þ2 C
¼ log @ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiA:
ð1  θÞðs0 + s1 + ðl0  l1 Þ Þ + 2θs0 s1 + 2 s20 s21 + s0 s1 ððs0  s1 Þ2 + ðl0  l1 Þ2 Þθð1  θÞ
2 2 2

(3)

Furthermore, the following closed-form formula was reported for the


Jensen–Shannon divergence:
 
1
DJS ðpl0 ,s0 : pl1 ,s1 Þ ¼ h m1  ðh½pl0 ,s0 + h½pl1 ,s1 Þ,
2
0
2
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1
2 2
B 2 ðl 0  l1 Þ + ðs 0 + s 1 Þ C
¼ log @qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi A:
2 2 pffiffiffiffiffiffiffiffi
ðl0  l1 Þ + ðs0 + s1 Þ + 2 s0 s1
Define the skewed θ-Jensen–Shannon divergence (Lin, 1991) for θ  (0, 1):
DJS,θ ðpl0 ,s0 : pl1 ,s1 Þ :¼ ð1  θÞ DKL ðpl0 ,s0 : mθ Þ + θ DKL ðpl1 ,s1 : mθ Þ, (4)
 
¼ h½ð1  θÞpl0 ,s0 + θpl1 ,s1   ð1  θÞh½pl0 ,s0  + θh½pl1 ,s1  : (5)
Since DKL ðpl1 ,s1 : mθ Þ ¼ DKL ðpl1 ,s1 : m01θ Þ with m0θ ðxÞ :¼ ð1  θÞpl1 ,s1 ðxÞ +
θpl0 ,s0 ðxÞ, we can use Eq. (3) to calculate DKL ðpl1 ,s1 : mθ Þ. Using the fact that
the Shannon entropy for the Cauchy density pl,s is h½pl,s  ¼ log ð4πsÞ (Chyzak
and Nielsen, 2019), we thus get the Shannon differential entropy of a two-
component Cauchy mixture in closed form:
Proposition 1 (Entropy of a two-Cauchy mixture). The differential entropy of
a mixture of two Cauchy distributions is available in closed form:
h½mθ  ¼ h½ð1  θÞpl0 ,s0 + θpl1 ,s1 
¼ DJS,θ ðpl0 ,s0 : pl1 ,s1 Þ + ðð1  θÞh½ pl0 ,s0  + θh½ pl1 ,s1 Þ: (6)

Since the differential entropy of a location-scale density can be expressed


as h½pl,s  ¼ log s + h½p0,1  (by a change of variable in the definite integral of
Shannon entropy), we have
s
h½mθ  ¼ DJS,θ ðpl0 ,s0 : pl1 ,s1 Þ + θ log 1 + log ð4πs0 Þ, (7)
s0
since h[p0,1] ¼ log 4π (Chyzak and Nielsen, 2019).
Without loss of generality, by considering the action of the location-scale
group, we may assume (l0, s0) ¼ (0, 1) and (l1, s1) ¼ (l, s) since we have:
 
h½ð1  θÞpl0 ,s0 + θpl1 ,s1  ¼ h ð1  θÞp0,1 + θpl1 l0 ,s1 + log s0 : (8)
s0 s0
We get using symbolic computing detailed in Appendix, a closed-form formula for Eq. (7):
0 1
B ð s + 1Þ 2 + l 2 C
h½mθ  ¼ θ log B
@ r ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 
C
A
2
2 ðs  1Þ + l s ð1  θÞ θ + s + ðs + l + 1Þ θ + 2s ð1  θÞ
2 2 2 2

0 1
(9)
B ðs + 1Þ2 + l2 C
+ ð1  θÞ log B
@ r ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
 
C
A
2
2 ð1  sÞ + l2 s ð1  θÞ θ + s2 + 2s θ + ðs2 + l2 + 1Þ ð1  θÞ

+ θ log s + log ð4πÞ:


The general formula calculated using symbolic computing reported in the Appendix for the differential entropy of the mixture
of two Cauchy distributions is
0 1
B ðs1 + s0 Þ2 + ðl1  l0 Þ2 C
h½mθ  ¼ θ log B
@ r ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
  ffi  
C
A
2 2 2
2 s0 s1 ðs1  s0 Þ + ðl1  l0 Þ ð1  θÞ θ + s0 s1 + s1 + s0 + ðl1  l0 Þ θ + 2s0 s1 ð1  θÞ
2 2 2 2

0 1
(10)
B 2 2
ðs1 + s0 Þ + ðl0  l1 Þ C
+ ð1  θÞ log B
@ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
  ffi  
C
A
2 2 2
2 s0 ðs0  s1 Þ + ðl0  l1 Þ s1 ð1  θÞ θ + s20 s21 + 2s0 s1 θ + s21 + s20 + ðl0  l1 Þ ð1  θÞ
s1
+ θ log + log ð4πs0 Þ:
s0
The analytic dually flat space of the mixture family Chapter 6 245

Consider the negentropy as a univariate Bregman generator (Fig. 5) :


FðθÞ :¼ h½mðθÞ:
It can be proven that F(θ) is a strictly convex function (Nielsen and
Hadjeres, 2019).
The skewed Jensen–Shannon divergence is interpreted as a skewed Jensen
divergence for the smooth and strictly convex generator F(θ) :¼ h[m(θ)]:
DJS,θ ðpl0 ,s0 : pl1 ,s1 Þ ¼ h½ð1  θÞpl0 , s0 + θpl1 ,s1   ðð1  θÞh½pl0 ,s0  + θh½pl1 ,s1 Þ,
¼ ð1  θÞFð0Þ + θFð1Þ  FðθÞ,
¼ JF,θ ð0 : 1Þ,
where the α-skewed Jensen divergence for a strictly convex generator F is
defined by
J F,α ðθ1 : θ2 Þ :¼ ð1  αÞFðθ1 Þ + αFðθ2 Þ  Fðð1  αÞθ1 + αθ2 Þ:
The skewed Jensen divergence is also called the Burbea–Rao divergence
(Nielsen and Boltz, 2011; Amari, 2016). The Jensen divergence between
two parameters can be extended to the Jensen diversity between a set of
n
2 weighted parameters P θ1 , …, θn with corresponding weights w1 , …, wn
(such that wi > 0 and iwi ¼ 1) as follows:
! !
X
n Xn
JF ðθ1 , …, θn ; w1 ,…,wn Þ ¼ wi Fðθi Þ  F wi θi :
i¼1 i¼1

Clearly, we have JF,α(θ1 : θ2) ¼ JF(θ1, θ2;1  α, α). See Fig. 6. The fact the
Jensen diversities are proper divergences (i.e., J F ðθ1 , …, θn ; w1 , …, wn Þ
0
with equality iff θ1 ¼ … ¼ θn ) stems from Jensen’s inequality. A geometric
proof of the discrete Jensen’s inequality is given in Fig. 7.

2.95 3.8
2.9
3.6
2.85
3.4
2.8
2.75 3.2
2.7 3
2.65
2.8
2.6
2.55 2.6
2.5 2.4
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
theta theta
FIG. 5 Plots of the Cauchy mixture entropies h[(1  θ)p0,1 + θp5,1] (left) and h[(1  θ)p0,1 +
θp5,3] (right). The mixture entropies are concave with respect to the mixing parameter θ.
246 SECTION II Information geometry

Jensen divergence (D = 2) Jensen diversity (D > 2)


FIG. 6 Illustration of the Jensen divergence between two parameters (left, D ¼ 2) and its gen-
eralization to a set of D parameters called the Jensen diversity (right, D ¼ 4) as the vertical
gap between E[F(θ)]  F(E[θ])
0 (shown in green).

y = F (θ)

Pθ1 = (θ1 , F (θ1 ))


PθD = (θD , F (θD ))

P̄ = (θ̄, F (θ))
F (θ)
E[F (θ)] − F (E[θ])

θ1 θ
θ̄ θD
P
FIG. 7 Visual proof of Jensen’s inequality: The center of mass ðθ, D1 i Fðθi ÞÞ of the points
θ1 , …, θD is necessarily contained in the convex hull CH of the points ðθ1 , Fðθ1 ÞÞ, …,
ðθP2 , Fðθ2 ÞÞ . Since
 1 Pthe convex hull CH is included in the epigraph of function F(θ), we have
i Fðθi Þ
F D i θi .
1
D

Since the Bregman generator is available in closed form, we can also


calculate the dual coordinate (Nielsen, 2020) by differentiation:
ηðθÞ :¼ F0 ðθÞ:
The analytic dually flat space of the mixture family Chapter 6 247

We get

A + θðΔ2s + Δ2l Þ + 2s0 s1 s


ηðθÞ ¼ log + log 0 , (11)
A  θðΔ2s + Δ2l Þ + s20 + s21 + Δ2l s1
where
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
pffiffiffiffiffiffiffiffi
A ¼ 2 s0 s1 s0 s1 + θð1  θÞðΔ2s + Δ2l Þ,

with Δ2s ¼ ðs1  s0 Þ2 and Δ2l ¼ ðl1  l0 Þ2 .


It was shown in Nielsen and Hadjeres (2019) that the dual convex
conjugate F*(η) for a mixture family generator F(θ) amounts to calculate
the cross-entropy between p0 and the mixture mθ (which can also be parame-
terized equivalently as mη):
Z
*
F ðηÞ ¼  p0 ðxÞ log mη ðxÞdμðxÞ:

Using the formula of Eq. (3) and the fact that

DKL ðpl0 ,s0 : mθ Þ ¼ h ½pl0 ,s0 : mθ   h½pl0 ,s0 ,

we deduce a closed-form formula for the cross-entropy between p0 and the


mixture mθ since h½pl0 ,s0  ¼ log ð4πs0 Þ.
Proposition 2 (Cross-entropy closed-form formula). The cross-entropy
between p0 and the mixture mθ is:

h ½pl0 ,s0 : mθ  ¼ DKL ðpl0 ,s0 : mθ Þ + log ð4πs0 Þ: (12)

Thus the dual convex conjugate F* can be expressed in closed form using
the θ-coordinate system: F* ðηðθÞÞ ¼ h ½pl0 ,s0 : mθ .
It was shown in Nielsen (2020) how to reconstruct the statistical
divergence corresponding to the Bregman divergence BF induced by the
Bregman generator F. When the Bregman generator is the negentropy of
a mixture family, the Kullback–Leibler divergence is reconstructed
(Nielsen, 2020):
Thus we have

DKL ½mθ1 : mθ2  ¼ BF ðθ1 : θ2 Þ, (13)

¼ Fðθ1 Þ  Fðθ2 Þ  ðθ1  θ2 ÞF0 ðθ2 Þ, (14)

¼ Fðθ1 Þ + F* ðF0 ðθ2 ÞÞ  θ1 F0 ðθ2 Þ: (15)


248 SECTION II Information geometry

This last formula is thus available in closed form since F(θ) is available in
closed form. We may also derive this formula as follows:

DKL ½mθ1 : mθ2  ¼ h ½mθ1 : mθ2   h½m1  (16)

¼ ð1  θ1 Þh ½p0 : mθ2  + θ1 h ½p1 : mθ2   h½m1 , (17)

and use the above closed-form formula.


We show in the Appendix how to report an explicit closed-form formula
with respect to l0, s0, l1, s1, θ1, and θ2.
It follows that the Jeffreys divergence (which symmetrizes the Kullback–
Leibler divergence):

DJ ½mθ1 : mθ2  :¼ DKL ½mθ1 : mθ2  + DKL ½mθ2 : mθ1 , (18)

¼ ðθ2  θ1 Þðη2  η1 Þ, (19)

and the Jensen–Shannon divergence (a symmetrization of the KLD upper


bounded by log 2):
 h m + mθ 2 i h m + mθ 2 i
1
DJS ½mθ1 : mθ2  :¼ DKL mθ1 : θ1 + DKL mθ2 : θ1 , (20)
2 2 2
 h i h i
1
¼ DKL mθ1 : mθ1 + θ2 + DKL mθ2 : mθ1 + θ2 , (21)
2 2 2

h i
1
¼ h mθ1 + θ2  ðh½mθ1  + h½mθ2 Þ (22)
2 2

are also available in closed form Nielsen and Nock (2018) (using Eq. (10).
Proposition 3. The Kullback–Leibler divergence, Jeffreys divergence, and
Jensen–Shannon divergence between any two mixtures of Cauchy distribu-
tions with prescribed two distinct components can be calculated in closed
forms.

Notice that closed-form formula for the Kullback–Leibler divergence and


the Jensen–Shannon divergence between Cauchy distributions were reported
in Nielsen and Okamura (2021).
To express F*(η) using the η-coordinate, we first need to calculate θ(η) ¼
F*0 (η) ¼ (F0 )1(η), i.e., inverse F0 (θ), and then apply the Legendre formula:

F* ðηÞ ¼ θðηÞη  FðθðηÞÞ:


The analytic dually flat space of the mixture family Chapter 6 249

We may also consider circular, wrapped, or log-Cauchy distributions


which can be obtained by suitable transformations from the Cauchy distribu-
tions. See Nielsen and Okamura (2021).
By fixing (l0, s0) and (l1, s1), we can obtain θ(η) in closed form (and there-
fore F*(η)). We report an example in the following section.

4.2 An analytic example with closed-form dual potentials


Let us illustrate the construction of dually flat space for the family of mixtures
of two Cauchy distributions with prescribed parameters (l0, s0) ¼ (0, 1) and
(l1, s1) ¼ (1, 1) (i.e., Δ2l ¼ 1 and Δ2s ¼ 0). We have the following induced
generator:

F0,1,1,1 ðθÞ ¼ h½ð1  θÞp0,1 + θp1,1 ,


pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ! pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi !
2 1 + θ  θ2 + θ + 2 2 1 + θ  θ2  θ + 3
¼ θ log pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi + log ,
2 1 + θ  θ2  θ + 3 20 π
(23)

for θ  Θ :¼ (0, 1).


Fig. 8 plots the strictly convex and differentiable potential function F(θ)
for θ  (0, 1).
We can therefore calculate in closed form the derivative of F0,1,1,1(θ)
(refer to generic Eq. (11):
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi !
2 1 + θ  θ2 + θ + 2
ηðθÞ ¼ F00,1,1,1 ðθÞ ¼ log pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : (24)
2 1 + θ  θ2  θ + 3

Function η(θ) is a strictly increasing function since it is the derivative


of the strictly convex function F00,1,1,1(θ). It follows that the parameter η
ranges in

0 4 0 5
H ¼ F0,1,1,1 ð0Þ ¼ log 0:069…, F0,1,1,1 ð1Þ ¼ log 0:069…
5 4

That is the dual domain centered at 0 (with η 12 ¼ 0).
Therefore, the Bregman divergence BF0,1,1,1 ðθ1 : θ2 Þ equivalent to the
Kullback–Leibler divergence between the corresponding mixtures is
–2.53 3.3
3.2
–2.54
3.1
–2.55
3
–2.56 2.9

–2.57 2.8
2.7
–2.58
2.6
–2.59 2.5
0 0.2 0.4 0.6 0.8 1 -1 -0.5 0 0.5 1
theta eta
FIG. 8 Plots of the strictly convex and smooth Bregman generator F0,1,1,1(θ) (left) and its convex conjugate F*0,1,1,1(θ) (right).
The analytic dually flat space of the mixture family Chapter 6 251

BF0, 1, 1, 1 ðθ1 : θ2 Þ ¼ DKL ½mθ1 : mθ2 ,


qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 + θ1  θ21 + θ1 + 2Þð2 1 + θ2  θ22  θ2 + 3Þ
ð2
¼ θ1 log qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ð2 1 + θ1  θ21  θ1 + 3Þð2 1 + θ2  θ22 + θ2 + 2Þ
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 + θ1  θ21  θ1 + 3
2
+ log qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi :
2 1 + θ2  θ22  θ2 + 2
(25)
Now, let us calculate the inverse function θ(η) in closed form using a
computer algebra softwareb :

θðηÞ ¼ F*0,1,1,1 0ðηÞ


pffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (26)
2 5 exp ð3ηÞ  2exp ð2ηÞ + exp ðηÞ + 3 exp ðηÞ  5 exp ð2ηÞ
¼ :
6exp ðηÞ  5exp ð2ηÞ  5

It follows that the dual potential function F*(η) is available in closed form
(see Appendix for the closed-form formula):

F*0,1,1,1 ðηÞ ¼ θðηÞη  F0,1,1,1 ðθðηÞÞ:

We have the dual potential expressed using the θ-coordinate as:

20 π
F*0,1,1,1 ðηðθÞÞ ¼ log pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : (27)
2 1 + θ  θ2  θ + 3

By plugging the expression of Eq. (26) in Eq. (27), we get a closed-form


expression for F*0,1,1,1 ðηÞ.
In Appendix, we show how to report in closed-form F*(η) by expanding
the formula: This yields a long formula which we do not paste here, from
0
which we can also obtain F*0,1,1,1 ðηÞ.

b
We used WolframAlpha® available online at https://www.wolframalpha.com/ for this symbolic
computation with the following query: (2*sqrt(1+t-t*t)+t+2)/(2*sqrt(1+t-t*t)
-t+3)=exp(eta) solve for t.
The metric tensor F0,1,1,1 g can be calculated in closed form in the θ-coordinate system using the second derivative: ½F0,1,1,1 gθ ¼ F000,1,1,1 ðθÞ. We get
(see Appendix):
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  
θ8 + 4 θ7 + θ2 + θ + 1 7 θ6  21 θ5  35 θ4 + 105 θ3 + 56 θ2  112 θ  64 + 19 θ6  71 θ5  30 θ4 + 183 θ3 + 40 θ2  144 θ  64
½F0, 1, 1, 1 gθ ¼  pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi   :
θ10 + 5 θ9 + θ2 + θ + 1 8 θ8  32 θ7  40 θ6 + 232 θ5 + 16 θ4  456 θ3  48 θ2 + 320 θ + 128 + 23 θ8  122 θ7 + θ6 + 445 θ5  127 θ4  640 θ3 + 32 θ2 + 384 θ + 128

Thus, we reported in closed form all necessary equations to implement the mixture family of two distinct Cauchy distributions (Eqs. 23–26).
The Cauchy distributions are Student’s t-distributions for ν ¼ 1 degree of freedom. When the degrees of freedom tend to infinity, the Cauchy
distributions tend to normal distributions.
The analytic dually flat space of the mixture family Chapter 6 253

5 Conclusion
Amari’s  α-structure (Amari, 2016; Nielsen, 2022) on a manifold geometri-
cally modeling a family of statistical parametric models consists of using the
Fisher metric as the Riemannian metric tensor, and a pair of dual rα and rα
affine connections which are compatible with the Fisher metric. In particular,
the  1-structure for both exponential families and mixture families yields
dually flat spaces, i.e., dual Hessian structures on a manifold equipped with
a global chart (Shima, 2007). In a dually flat space, the Legendre–Fenchel
transformation gives rise to the dual affine coordinate systems θ() and η()
for the dual connections r1 (e-connection or exponential connection) and
r1 (m-connection or mixture connection), and the dual convex potential
functions F(θ) and F*(η). Although those dual potential functions are
(1) always analytic for exponential families and (2) available in closed form
for many distribution families of exponential type (Nielsen and Hadjeres,
2019) (e.g., multivariate normal distributions), the negentropy potential func-
tion of a mixture family may not be analytic (e.g., mixture of two prescribed
Gaussians (Michalowicz et al., 2008)) and is rarely available in closed
form except when the mixture components have pairwise disjoint support
(e.g., the family of categorical distributions on a finite sample space). In this
chapter, we reported a first notable example of a continuous mixture family of
two Cauchy components (model order D ¼ 1 with full support ) with dual
potential convex functions F(θ) and F*(η) available in closed form. It is inter-
esting to build continuous mixture families with arbitrary many component
distributions k > 1 sharing the same support for which the dual potential func-
tions are available in closed form. Another question left for future research is
to study the intersection of the potential functions obtained by exponential
families and mixture families.

Acknowledgments
The author is deeply indebted to Professor Kazuki Okamura (Shizuoka
University, Japan) who collaborated with the author on the study of
f-divergences between Cauchy distributions (Nielsen and Okamura, 2021).
In this work, all the α-geometries (Amari, 2016) of the family of Cauchy
distributions are proven to coincide with the Fisher–Rao hyperbolic geometry
of the Cauchy family due to the symmetric property of the f-divergences for
the Cauchy family.

Appendix. Symbolic computing notebook in MAXIMA


The following notebook can be executed using the computer algebra system
c
MAXIMA for symbolic computations:

c
https://maxima.sourceforge.io/.
254 SECTION II Information geometry

First, we initialize the closed-form formula:


KLCauchy ( l 0 , s0 , l 1 , s1 , t h e t a ) := l o g ( ( ( l 0 −l 1 ) ∗∗2+( s 0+s 1 ) ∗ ∗ 2 ) / ( (1− t h e t a ) ∗ ( s 0 ∗
s 0+s 1 ∗ s 1 +( l 0 −l 1 ) ∗ ∗ 2 ) + 2∗ t h e t a ∗ s 0 ∗ s 1 +2∗ s q r t ( s 0 ∗ s 0 ∗ s 1 ∗ s 1+s 0 ∗ s 1 ∗ ( ( s0−s 1 )
∗∗2+( l 0 −l 1 ) ∗ ∗ 2 ) ∗ t h e t a ∗(1− t h e t a ) ) ) ) ;
JSCauchy ( l 0 , s0 , l 1 , s1 , t h e t a ) :=(1− t h e t a ) ∗KLCauchy ( l 0 , s0 , l 1 , s1 , t h e t a ) + t h e t a ∗
KLCauchy ( l 1 , s1 , l 0 , s0 ,1 − t h e t a ) ;
hCauchy ( l , s ) := l o g (4∗% p i ∗ s ) ;
hmixCauchy ( l 0 , s0 , l 1 , s1 , t h e t a ) := JSCauchy ( l 0 , s0 , l 1 , s1 , t h e t a ) +((1− t h e t a ) ∗
hCauchy ( l 0 , s 0 )+t h e t a ∗hCauchy ( l 1 , s 1 ) ) ;
c r o s s e n t r o p y C a u c h y ( l 0 , s0 , l 1 , s1 , t h e t a ) := KLCauchy ( l 0 , s0 , l 1 , s1 , t h e t a )+hCauchy
( l0 , s0 ) ;
KLmix( l 0 , s0 , l 1 , s1 , t h e t a 1 , t h e t a 2 ) :=(1− t h e t a 1 ) ∗ c r o s s e n t r o p y C a u c h y ( l 0 , s0 , l 1 , s1 ,
t h e t a 2 )+t h e t a 1 ∗ c r o s s e n t r o p y C a u c h y ( l 1 , s1 , l 0 , s0 ,1 − t h e t a 2 )−hmixCauchy ( l 0 , s0 ,
l 1 , s1 , t h e t a 1 ) ;

To export the formula in FORTRAN code which can be translated easily into
Java, we run the following command:
f o r t r a n ( r a t s i m p ( hmixCauchy ( l 0 , s0 , l 1 , s1 , t h e t a ) ) ) ;

To get the Bregman generator, we do:

expand(−hmixCauchy ( l 0 , s0 , l 1 , s1 , t h e t a ) ) ;
s t r i n g (%) ;

To obtain the dual parameterization η ¼ F0 (θ), we use the following code


snippet:
e t a : d e r i v a t i v e ( hmixCauchy ( l 0 , s0 , l 1 , s1 , t h e t a ) , t h e t a , 1 ) ;

Then for example, we can calculate and export in TE X the metric tensor in
the primal coordinate system θ as follows:
F:−hmixCauchy ( 0 , 1 , 1 , 1 , t h e t a ) ;
g : d e r i v a t i v e (F , t h e t a , 2 ) ;
r a t s i m p (%) ;
t e x (%) ;

and plot the potential function:

p l o t 2 d (F , [ t h e t a , 0 , 1 ] ) ;

To obtain a closed-form formula for the Kullback–Leibler divergence


between mθ1 and mθ2 , we do:
expand (KLmix( l 0 , s0 , l 1 , s1 , t h e t a 1 , t h e t a 2 ) ) ;
s t r i n g (%) ;

We can check by symbolic/numeric calculations that the Kullback–Leibler


divergence between mθ1 and mθ2 are symmetric:
DiffKL ( l 0 , s0 , l 1 , s1 , t h e t a 1 , t h e t a 2 ) :=KLmix( l 0 , s0 , l 1 , s1 , t h e t a 1 , t h e t a 2 )−KLmix( l 0 ,
s0 , l 1 , s1 , t h e t a 2 , t h e t a 1 ) ;
f l o a t ( DiffKL ( 0 , 1 , 1 , 1 , 0 . 2 , 0 . 8 ) ) ;
The analytic dually flat space of the mixture family Chapter 6 255

To implement the case of (l0, s0) ¼ (0, 1) and (l1, s1) ¼ (1, 1) discussed in the
main body, we execute the following MAXIMA code:
F( t h e t a ) := t h e t a ∗ l o g ( ( 2 ∗ s q r t (1+ t h e t a −( t h e t a ∗ t h e t a ) )+t h e t a +2) / ( 2 ∗ s q r t (1+ t h e t a
−( t h e t a ∗ t h e t a ) )−t h e t a +3) ) + l o g ( ( 2 ∗ s q r t (1+ t h e t a −( t h e t a ∗ t h e t a ) )−t h e t a +3)
/(20∗% p i ) ) ;
gradF ( t h e t a ) := l o g ( ( 2 ∗ s q r t (1+ t h e t a −( t h e t a ∗ t h e t a ) )+t h e t a +2) / ( 2 ∗ s q r t (1+ t h e t a
−( t h e t a ∗ t h e t a ) )−t h e t a +3) ) ;
BD( t h e t a 1 , t h e t a 2 ) :=F( t h e t a 1 )−F( t h e t a 2 ) −( t h e t a 1 −t h e t a 2 ) ∗ gradF ( t h e t a 2 ) ;
expand (F( t h e t a 1 )−F( t h e t a 2 ) −( t h e t a 1 −t h e t a 2 ) ∗ gradF ( t h e t a 2 ) ) ;
t h e t a ( e t a ) :=(5∗ exp ( 2 ∗ e t a ) + 2∗ s q r t ( 5 ) ∗ s q r t ( exp ( 3 ∗ e t a ) − 2∗ exp ( 2 ∗ e t a )+exp ( e t a ) )
− 3∗ exp ( e t a ) ) / ( 5 ∗ exp ( 2 ∗ e t a ) −6∗exp ( e t a ) +5) ;
Fdual ( e t a ) := t h e t a ( e t a ) ∗ eta−F( t h e t a ( e t a ) ) ;
expand ( t h e t a ( e t a ) ∗ eta−F( t h e t a ( e t a ) ) ) ;

This yields a closed-formula for F*0,1,1,1 ðηÞ (albeit a lengthy formula).

References
Amari, S.-i., 2016. Information Geometry and Its Applications. Applied Mathematical Sciences,
Springer Japan, ISBN: 9784431559771.
Amari, S.-i., Armstrong, J., 2014. Curvature of Hessian manifolds. Differ. Geom. Appl. 33, 1–12.
Amari, S.-i., Cichocki, A., 2010. Information geometry of divergence functions. Bull. Polish
Acad. Sci. Tech. sci. 58 (1), 183–195.
Atkinson, C., Mitchell, A.F.S., 1981. Rao’s distance measure. Sankhya Indian J. Stat. A 45,
345–365.
Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J., Lafferty, J., 2005. Clustering with Bregman
divergences. J. Mach. Learn. Res. 6 (10).
Barndorff-Nielsen, O., 2014. Information and Exponential Families: In Statistical Theory. John
Wiley & Sons.
Blondel, M., Martins, A.F., Niculae, V., 2020. Learning with Fenchel-Young losses. J. Mach.
Learn. Res. 21 (35), 1–69.
Boissonnat, J.-D., Nielsen, F., Nock, R., 2010. Bregman Voronoi diagrams. Discrete Comput.
Geom. 44 (2), 281–307.
Bregman, L.M., 1967. The relaxation method of finding the common point of convex sets and its
application to the solution of problems in convex programming. USSR Comput. Math. Math.
Phys. 7 (3), 200–217.
Calvo, J.A., 2018. Scientific Programming: Numeric, Symbolic, and Graphical Computing With
Maxima. Cambridge Scholars Publishing.
Chyzak, F., Nielsen, F., 2019. A closed-form formula for the Kullback-Leibler divergence
between Cauchy distributions. arXiv preprint arXiv:1905.10965.
Crouzeix, J.-P., 1977. A relationship between the second derivatives of a convex function and of
its conjugate. Math. Program. 13 (1), 364–365.
Došlá, Š., 2009. Conditions for bimodality and multimodality of a mixture of two unimodal den-
sities. Kybernetika 45 (2), 279–292.
Eguchi, S., 1992. Geometry of minimum contrast. Hiroshima Math. J. 22 (3), 631–647.
Faraut, J., Korányi, A., 1994. Analysis on Symmetric Cones. Oxford Mathematical Monographs.
Godinho, L., Natário, J., 2014. An Introduction to Riemannian Geometry. Springer.
Gomes-Gonçalves, E., Gzyl, H., Nielsen, F., 2019. Geometry and fixed-rate quantization in
Riemannian metric spaces induced by separable Bregman divergences. In: International
Conference on Geometric Science of Information, pp. 351–358.
256 SECTION II Information geometry

G€uler, O., 1996. Barrier functions in interior point methods. Math. Oper. Res. 21 (4), 860–885.
Keener, R.W., 2010. Theoretical Statistics: Topics for a Core Course. Springer.
Lee, Y.T., Yue, M.-C., 2021. Universal barrier is n-self-concordant. Math. Oper. Res. 46 (3),
1129–1148.
Lin, J., 1991. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory
37 (1), 145–151.
Malagò, L., Pistone, G., 2015. Information geometry of the Gaussian distribution in view of sto-
chastic optimization. In: Proceedings of the ACM Conference on Foundations of Genetic
Algorithms XIII, pp. 150–162.
Michalowicz, J.V., Nichols, J.M., Bucholtz, F., 2008. Calculation of differential entropy for a
mixed Gaussian distribution. Entropy 10 (3), 200–206.
Naudts, J., Anthonis, B., 2012. Data set models and exponential families in statistical physics and
beyond. Mod. Phys. Lett. B 26 (10), 1250062.
Nielsen, F., 2020. An elementary introduction to information geometry. Entropy 22 (10), 1100.
Nielsen, F., 2021. On geodesic triangles with right angles in a dually flat space. In: Progress in
Information Geometry, Springer, Cham, pp. 153–190.
Nielsen, F., 2022. The many faces of information geometry. Not. Am. Math. Soc. 69 (1), 36–45.
Nielsen, F., Boltz, S., 2011. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. The-
ory 57 (8), 5455–5466.
Nielsen, F., Hadjeres, G., 2019. Monte Carlo information-geometric structures. In: Geometric
Structures of Information, Springer, pp. 69–103.
Nielsen, F., Nock, R., 2018. On the geometry of mixtures of prescribed distributions. In: 2018
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
pp. 2861–2865.
Nielsen, F., Okamura, K., 2021, July. On f-divergences between Cauchy distributions. In: Interna-
tional Conference on Geometric Science of Information. Springer, Cham, pp. 799–807.
Pinele, J., Strapasson, J.E., Costa, S.I.R., 2020. The Fisher-Rao distance between multivariate nor-
mal distributions: special cases, bounds and applications. Entropy 22 (4), 404.
Risch, R.H., 1969. The problem of integration in finite terms. Trans. Am. Math. Soc. 139,
167–189.
Rockafellar, R.T., 1967. Conjugates and Legendre transforms of convex functions. Can. J. Math.
19, 200–205.
Shima, H., 2007. The Geometry of Hessian Structures. World Scientific.
Watanabe, S., 2004. Kullback information of normal mixture is not an analytic function. IEICE
Tech. Rep. 2004, 41–46.
Zhang, Z., Sun, H., Zhong, F., 2007. Information geometry of the power inverse Gaussian
distribution. Appl. Sci. 9, 194–203.
Zhong, F., Sun, H., Zhang, Z., 2008. The geometry of the Dirichlet manifold. J. Kor. Math. Soc.
45 (3), 859–870.
Chapter 7

Local measurements
of nonlinear embeddings
with information geometry
Ke Sun*
CSIRO Data61, Sydney, NSW, Australia
The Australian National University, Canberra, ACT, Australia
*
Corresponding author: e-mail: sunk@ieee.org

Abstract
A basic problem in machine learning is to find a mapping f from a low-dimensional
latent space Y to a high-dimensional observation space X . Modern tools such as deep
neural networks are capable to represent general nonlinear mappings. A learner can eas-
ily find a mapping which perfectly fits all the observations. However, such a mapping is
often not considered as good, because it is not simple enough and can overfit. How to
define simplicity? We try to make a formal definition on the amount of information
imposed by a nonlinear mapping f. Intuitively, we measure the local discrepancy
between the pullback geometry and the intrinsic geometry of the latent space. Our defi-
nition is based on information geometry and is independent of the empirical observa-
tions, nor specific parameterizations. We prove its basic properties and discuss
relationships with related machine learning methods.
Keywords: Dimensionality reduction, Manifold learning, Latent space, Autoencoders,
Embedding, Information geometry, α-Divergence

1 Introduction
In statistical machine learning, one is often interested to derive a nonlinear
mapping f, from a latent space Y to an observable space X (as shown in
Fig. 1) so that meaningful latent representations can be learned for the observed
data. Similar problems widely appear in dimensionality reduction (a.k.a.
manifold learning), nonlinear regression, and deep representation learning.
By convention in machine learning, we refer the inverse of f which we
denote as g, rather than f itself, as an “embedding” that is a mapping from
X to Y . The latent space Y is referred to as the embedding space.

Handbook of Statistics, Vol. 46. https://doi.org/10.1016/bs.host.2022.03.008


Copyright © 2022 Elsevier B.V. All rights reserved. 257
258 SECTION II Information geometry

FIG. 1 The basic subjects in this paper: a latent Euclidean space Y; an observation space X (a
Riemannian manifold); and a differentiable mapping f : Y ! X from Y to X .

In Riemannian geometry, an embedding is from the low-dimensional Y to the


high-dimensional X , but not the other way round. This is only a matter of
terminology.
How to measure and learn an embedding? Consider a nonlinear mapping
f θ : Y ! X parameterized by a neural network with parameters θ (the
weights and bias terms). The discrepancy of fθ with respect to (w.r.t.) a given
observable x  X and a latent y  Y can be measured by some distance
between fθ(y) and x denoted by D(fθ(y) : x). This is illustrated in the following
diagram:
θ
y ! fθ ð yÞ

Dð fθ ðyÞ : xÞ ðdata geometryÞ
x ↗
As both fθ(y) and x are points on X , such measurements of f is based on the
data geometry, which is the geometry of X equipped with fundamental quan-
tities such as the distance. Then, one can minimize the empirical average of
D(fθ(y) : x) w.r.t. a set of samples to learn the neural network fθ. Similarly,
an embedding gθ : X ! Y can be measured by first mapping x to the latent
space, and then measuring the distance against y. The following diagram
shows this paradigm based on the geometry of Y, the embedding geometry.
y

Dðy : gθ ðxÞÞ ðembedding geometryÞ


θ
x!gθ ðxÞ

A typical example of these paradigms is autoencoder networks (Kingma and


Welling, 2014; Rifai et al., 2011), where gθ and fθ are called the “encoder”
and the “decoder,” respectively.
As a convention of notations, we use Greek letters such as “α” and “β” to
denote scalars, bold lowercase letters such as “x” and “y” for vectors, regu-
lar capital letters such as “A” and “W” for matrices, and calligraphic letter
such as “X ” and “Y ” for manifolds and their related measurements, with
exceptions. We reserve the symbol “D” for a general distance-like measure,
Local measurements of nonlinear embeddings Chapter 7 259

and the symbol “D” for our proposed discrepancy measure. The mappings
are denoted by fθ and gθ, where the subscript θ (parameters of the associated
mapping) can be omitted.
Another type of embedding is based on information theoretical measure-
ments, which fits in the following paradigm:


embedding geometry
y ! p
Dð p : qÞ ðinformation geometryÞ

data geometry
x ! q

Two probability distributions p and q are computed for X and Y, respectively.


Then the quality of embedding is measured by some “distance” D( p : q)
between these two probability distributions, based on the geometry of the
space of probability distributions, a.k.a. information geometry (Amari,
2016). A typical example is stochastic neighborhood embedding (Hinton
and Roweis, 2003), where p ¼ ( pij)nn and q ¼ (qij)nn denote pairwise
proximity w.r.t. n samples on the two manifolds X and Y, respectively, and
D( p : q) is chosen as the Kullback–Leibler (KL) divergence.
In summary, the following three geometries are involved to measure
f : X ! Y:
Data Geometry. Geometric measurements in a usually high-dimensional
observation space denoted as X , or the geometry of the input feature space;
Embedding Geometry. Intrinsic geometry of the embedding space Y, which
is usually assumed to have a simple well-defined geometric structure, such as
the Euclidean space;
Information Geometry. Geometry of the space of probability distributions.
It also gives an intrinsic geometry of parametric probabilistic models.
It is a fundamental and mostly unsolved problem on how to properly define
these geometries, and how these geometries interact and affect learning.
Following the above information geometric approach, this work tackles
the basic problem of defining an intrinsic complexity of the mapping
f : Y ! X . Our construction is based on the following intuitions. For any
given y  Y , we can define a local probability distribution p(y) (a “soft
neighborhood” of y) based on how the image of Y under the mapping f is
embedded inside X . In geometric terms, p(y) is defined based on a submani-
fold f ðYÞ of the ambient space X , which is often referred to as the “data
manifold.” On the other hand, we can build another not necessarily normal-
ized measure s(y) around y based on the intrinsic structure of Y . For a low
complexity f, p and s should be similar, meaning that the mapping f does
not impose additional information; p and s are dissimilar for a high complex-
ity f. Hence, we measure the complexity, or imposed information, of f locally
around y  Y through measuring the discrepancy of p and s. This work is par-
tially based on our unpublished report (Sun, 2018).
260 SECTION II Information geometry

We made the following contributions:


l A formal definition of α-discrepancy that is a one-parameter family of
discrepancy measurements of a differentiable mapping f : Y ! X ;
l Proof of its invariance and other basic properties;
l Discussion on its computational feasibility and practical estimation
methods;
l Connections with existing machine learning methods including neighbor-
hood embeddings and autoencoders.
In the rest of this chapter, we first review the α-divergence, based on which our
core concepts are defined (Section 2). We formally define the local and global
α-discrepancy of an embedding and show its basic properties and closed-form
expressions (Section 3). Then, we demonstrate that the α-discrepancy is compu-
tationally feasible and give practical estimation methods (Section 4). We pres-
ent how it connects with existing techniques such as the neighbour embeddings
and related deep learning methods (Section 5). We discuss possible extensions
and conclusions (Section 6). Proofs of the statements are provided in the end of
the chapter.

2 α-Divergence and autonormalizing


Recall that in our construction, we need a “distance” between two local posi-
tive measures on the manifold Y: one is associated with a neighborhood on
the data manifold f ðYÞ around a reference point y  Y ; the other means a
neighborhood based on the underlying geometry of Y by prior assumption
(that is independent of f ). Therefore, we fall back to defining information
theoretical distances in the space of positive measures.
Information geometry (Amari, 2016) is a discipline where information
theoretic quantities are endowed with a geometric background so that one
can study the “essence of information” in an intuitive way. It has broad con-
nections with statistical machine learning (Amari, 1995, 2016; Lebanon,
2005; Sun and Marchand-Maillet, 2014; Yang et al., 2014). We only need
one of the many useful concepts from this broad area and refer the reader to
Amari (2016) and Nielsen (2020) for proper introductions.
The α-divergence (Amari, 2016) is a uniparametric family of information
divergence (nonnegative dissimilarity that is asymmetric and does not satisfy
the triangle inequality). Given two finite positive measures p~ and q~ (that are
not necessarily normalized) defined on the same support Y, the α-divergence
gauges their “distance” by
Z
1  
Dα ð p~ : q~Þ :¼ α~ qð yÞ  p~α ð yÞ~
pð yÞ + ð1  αÞ~ q1α ð yÞ dy (1)
αð1  αÞ
for α  nf0, 1g . If p and q are probability distributions, then Eq. (1) is
simplified to
Local measurements of nonlinear embeddings Chapter 7 261

 Z 
1 α
Dα ðp : qÞ ¼ 1  p ð yÞq ðyÞdy :
1α
(2)
αð1  αÞ
It is easy to show from L’H^ ospital’s rule that
Z  
p~ð yÞ
lim Dα ð p~ : q~Þ ¼ KLð p~ : q~Þ ¼ p~ð yÞ log  p~ð yÞ + q~ðyÞ dy
α!1 q~ð yÞ
is the (generalized) KL divergence, and similarly, lim α!0 Dα ð~ p : q~Þ ¼
KLð~ q : p~Þ is the reverse KL divergence. Therefore, the definition of the
α-divergence is naturally extended to α  , encompassing KL, reverse KL,
along with several commonly used divergences (Cichocki et al., 2015;
Amari, 2016). We can easily verify that Dα ð~ p : q~Þ  0, with Dα ð~
p : q~Þ ¼ 0 if
and only if p~ ¼ q~. As a generalization of the KL divergence, the α-divergence
has been applied into machine learning. Usually, α is a hyperparameter of the
learning machine that can be tuned. See Narayan et al. (2015) and Li and
Turner (2016) for recent examples.
We need the following property of α-divergence in our developments.
Lemma 1 autonormalizing. R Given a probability distribution
R p(y) and a posi-
tive measure s(y) so that pð yÞdy ¼ 1 and 0 < sð yÞdy < ∞, the optimal
γ   + :¼ ð02C ∞Þ minimizing Dα(p : γs) has the form
0Z 11=α
α
B p ð yÞs1α
ðyÞdy C
γ ? :¼ argmin Dα ðp : γsÞ ¼ B
@ Z C ,
A
γ  + sð yÞdy

and the corresponding divergence reduces to


" Z 1=α #
? 1 α
Dα ðp : γ sÞ ¼ 1 p ðyÞq ð yÞdy
1α
,
1α
R
where qð yÞ :¼ sð yÞ= sð yÞdy is the normalized density w.r.t. the positive
measure s(y).

The proofs of our formal statements are in Appendices. Note the expres-
sion of Dα(p : γ ?s) is not the same as Dα(p : q) in Eq. (2). For a given α, it
is
R αclear 1αthat they both reduce to compute the Hellinger integral
p ð yÞq ð yÞdy . Because t is a monotonic transformation of t/α for
1/α

t   + , Dα(p : γ ?s) is a deterministic monotonic function of Dα(p : q).


Intuitively, γ plays the role of an “autonormalizer.” If y ¼ i is a discrete
random variable, then minimizing Dα(p : γs) boils down to minimizing
the α-divergence
P between two discrete probability distributions: pi and
q i ¼ si / isi.
262 SECTION II Information geometry

In the statement of Lemma 1, we ignore the special cases when α ¼ 0 or


α ¼ 1 for simpler expressions. One can retrieve these special cases by apply-
ing L’H^ ospital’s rule, which is straightforward. Unless the cases for α ¼ 0 or
α ¼ 1 are explicitly discussed, similar treatment of the parameter α is applied
throughout the paper. R
Remark 1. If α ¼ 1, then λ? ¼ 1= sð yÞdy and λ?s(y) ¼ q(y). In this special
case,
D1 ðp : γ ? sÞ ¼ D1 ðp : qÞ ¼ KLðp : qÞ:
In general, 8α  , as λ? is the minimizer of Dα(p : γs), we have
0  Dα ðp : γ ? sÞ  Dα ðp : qÞ:

Remark 2. For a given α, minimizing Dα(p : qθ), where qθ is a parametric fam-


ily of distributions with parameters θ, is equivalent to minimizing Dα(p : γsθ)
w.r.t. both γ and θ:
θ? :¼ arg min Dα ðp : qθ Þ ¼ arg min min+ Dα ðp : γsθ Þ,
θ θ γ

where sθ is the unnormalized positive measure w.r.t. qθ.

In machine learning, Dα(p : qθ) is often used as the loss to be minimized


w.r.t. the free parameters θ. Dα(p : γsθ) is a surrogate function of Dα(p : qθ)
after some deterministic transformations depending on α. In parametric
learning, it may be favorable to minimize Dα(pR: γsθ) for its simpler expres-
sion without the terms related to the normalizer sθ ð yÞdy. One may discover
that the Hessian of Dα(p : qθ) w.r.t. θ and the Hessian of Dα(p : γsθ) w.r.t. θ are
different. Therefore, the loss landscapes of Dα(p : qθ) and Dα(p : γsθ) are dif-
ferent. These application scenarios are beyond the scope of the current paper.

3 α-Discrepancy of an embedding
How to measure the amount of imposed information, or complexity, of a
mapping f : Y ! X between two manifolds Y and X , in such a way that is
independent to the observations? Taking dimensionality reduction as an
example, the central problem is to find such a “good” mapping f, which not
only fits well the observations but also is somehow simple in the sense that
it is less curved. In this case, f is from a latent space Y d that is usually a
low-dimensional Euclidean space,a to an observable space X , which often
has a high dimensionality. For example, the index mapping i !xi is usually
considered as “not good” for dimensionality reduction, because only the

a
In this paper, a d-dimensional real manifold is denoted by Md , or simply M where the super-
script (dimensionality) is omitted.
Local measurements of nonlinear embeddings Chapter 7 263

information regarding the order of the samples (if such information exists) is
preserved in the embedding, and all other observed information is lost through
the highly curved f.
We define such an intrinsic loss of information and discuss its properties.
For the convenience of analysis, we make the following assumptions:
Assumption 1. X is equipped with a Riemannian metric M(x), i.e., a covariant
symmetric tensor that is positive definite and varies soothingly w.r.t. x  X .

Assumption 2. The latent space Y :¼ d is a d-dimensional Euclidean space


endowed with a positive similarity sy ðyÞ: Y 2 !  + , where y  Y is called a
reference point, and y  Y is called a neighbor. This measure has simple
closed form, and satisfies
Z
8y  Y, 0 < sy ðyÞdy < ∞:

Assumption 3. There exists a differentiable mapping f : Y ! X , whose


Jacobian J(y), or simply denoted as J, can be computed. Depending on our
analysis, we may further assume the Jacobian J has full column rank every-
where on Y .b Such an assumption excludes neural network architectures
which reduce Y’s dimensionality in some hidden layer in between Y and X .

Assumption 4. The following generation process: A latent point y is drawn


from a prior distribution Uð yÞ defined on Y. Its corresponding observed point
is x ¼ f ð yÞ  X . This Uð yÞ can be Gaussian, uniform, etc.

The mapping f : Y ! X induces a metric tensor of Y, given by the pull-


back metric ( Jost, 2011) M(y) :¼ J>(y)M(f(y))J(y), where Jð yÞ :¼ ∂x
∂y j is the
Jacobian matrix at y. We abuse M to denote both the Riemannian metric on
X and the pullback metric on Y . Informally, it means that the geometry of
Y is based on how the image f ðYÞ is curved inside the ambient space X .
By definition, M(y) is only positive semi-definite and may have zero eigenva-
lues. Moreover, M(y) may not be smooth when y varies in Y . Therefore,
further assumptions are needed to make the pullback metric to be a valid
Riemannian metric. The notion of pullback metric is applied in machine
learning (Lebanon, 2005; Sun and Marchand-Maillet, 2014; Tosi et al.,
2014) including deep learning (Arvanitidis et al., 2018; Hauberg, 2018; Sun,
2020; Sun et al., 2020). Related work will be discussed in Section 5.
In order to apply information geometric measurements, we consider a
probability density defined w.r.t. this induced geometry, given by

b
Such an embedding is known as an “immersion” ( Jost, 2011).
264 SECTION II Information geometry

py ðyÞ :¼ G y j y, J > ðyÞMðf ðyÞÞJðyÞ



1
∝ exp  ðy  yÞ> J > ðyÞMðf ðyÞÞJðyÞðy  yÞ ,
2
where G( j μ, Σ) denotes the multivariate Gaussian distribution centered at μ
with precision matrix Σ. Formally, py ðyÞ is a probability distribution locally
defined on the tangent space T y Y. If the pullback geometry is Riemannian,
a neighborhood of y is given by f exp y ðyÞ : y  py ðyÞg , where exp y :
T y Y ! Y is the Riemannian exponential map. In this paper, we relax the for-
mal requirements and perceive py ðyÞ as a local neighborhood distribution
around y  Y . Here, we define similarities based on a given Riemannian
metric, which is the reverse treatment of inducing metrics based on kernels
(see 11.2.4 Amari, 2016). Gaussian distributions on Riemannian manifolds
are formally defined (Pennec, 2006; Said et al., 2017).
On the other hand, by Assumption 2 the latent space Y is endowed with
a positive similarity measure sy ðyÞ, where y, y  Y. Typical choices of sy ðyÞ
can be
    κ + 1
1 k y  yk2 2
sy ð yÞ ¼ exp  k y  yk or sy ðyÞ ¼ 1 +
2
ðκ  1Þ,
2 κ
where kk is the Euclidean norm. Both of those choices are isotropic and
decrease as the distance k y  y k increases. The meaning of sy ðyÞ is a local
neighborhood of y  Y based on the intrinsic geometry of Y.
We try to align these two geometries by comparing the local probability
distribution py ðyÞ and the positive measure sy ðyÞ, which can be gauged by
the α-divergence (a “distance” between positive measures) introduced in
Section 2. The underlying assumption is that, under a simple mapping f, these
two different neighborhoods should be similar. On the other hand, after a
complex mapping f, the neighborhood structure is likely to be destroyed.
The complexity of f : Y ! X can be defined as follows.
Definition 1 Embedding α-discrepancy. With respect to the Assumptions 1, 2,
and 3, the local α-discrepancy of f : Y ! X is a function on Y:
Dα, f ðyÞ :¼ inf Dα ðpy : γð yÞsy Þ,
γ  CðYÞ

where CðYÞ :¼ fγ : Y !  : 8y  Y, γð yÞ > 0g is the set of positive


+

functions on Y. If further Assumption 4 is true, the (global) α-discrepancy of


f is
Dα, f :¼ EUðyÞ Dα, f ðyÞ,
where EUð yÞ ð  Þ denotes the expectation w.r.t. Uð yÞ.

Remark 3. An alternative definition of the α-discrepancy can be based on the


α-divergence between two normalized densities py and qy. In this paper, we
Local measurements of nonlinear embeddings Chapter 7 265

choose the more general Definition 1, as γ(y) can be automatically normalized


based on Lemma 1.

As Dα,f is measured against the mapping f : Y ! X instead of two prob-


ability measures, we use the term “discrepancy” (denoted as D) instead of
“divergence” (denoted as D). The α-discrepancy Dα,f ð yÞ is a scalar function
on the manifold Y, as its value varies w.r.t. the reference point y  Y. The
global discrepancy Dα,f is simply a weighted average of the local discrepancy
w.r.t. some prior distribution Uð yÞ . Obviously, from the nonnegativity of
α-divergence, 8y  Y , 8α   , and for any differentiable f, we have
Dα,f ð yÞ  0.
As the α-divergence is in general unbounded, Dα,f ð yÞ is also unbounded.
Note that when α ¼ 1/2, the α-divergence becomes the Hellinger distance,
which is a bounded metric. Depending on the application, it may be favored
as it allows py ðyÞ and qy ðyÞ to be zero, and therefore has better numerical
stability.
To interpret the meaning of the proposed discrepancy value, we have the
following basic property.
Proposition 1. Assume that sy ðyÞ :¼ exp  12 k y  yk2 . 8α  , any isom-
etry f : Y ! X (if exists) is an optimal solution of min f Dα,f .

Notice that an “isometric embedding” is defined by such mappings where


the induced metric M(y) is everywhere equivalent to the intrinsic metric of the
Euclidean space Y given by the identity matrix I. To gain some intuitions of
Dα,f , consider the special case, where dimðX Þ ¼ dimðYÞ and f is a change of
coordinates. Minimizing Dα,f w.r.t. f helps to find a “good” coordinate system
where the metric tensor is best aligned to I. Consider M(x) is nonisometric and
is small along the directions of the data point cloud (Lebanon, 2005) so that
geodesics walk along the observed data points. Minimizing the α-discrepancy
means to transform the coordinate system so that unit balls centered around
the data points are like “pancakes” along the data manifold.
The α-divergence belongs to the broader f-divergence (Csiszár, 1967) fam-
ily and therefore inherits the primitives. By the invariance of f-divergence
(Amari, 2016), its value does not change w.r.t. coordinate transformation of
the support. We have the following property of Definition 1.
Proposition 2. Dα,f ð yÞ (and therefore Dα,f ) is invariant w.r.t. any reparameter-
ization of the observation space X or the latent space Y. In other words, for
any given diffeomorphisms ΦX : X ! X and ΦY : Y ! Y, we have
8y  Y, Dα, f ð yÞ ¼ Dα,ΦX ∘f ∘ΦY ðΦ1
Y ð yÞÞ:

Indeed, consider the observation space is reparameterized to a new coordi-


nate system x0 , where the Jacobian of x !x0 is given by Jx :¼ Jx(x). Then the
266 SECTION II Information geometry

pullback metric becomes Mð yÞ ¼ J > J > 0 >


x Mðx ÞJ x J ¼ J MðxÞJ. Therefore, the
α-discrepancy is an intrinsic measure solely determined by α, f and the geom-
etry of X and Y, and is regardless of the choice of the coordinate system.
In order to examine the analytic expression of the α-discrepancy, we have
the following theorem.
Theorem 1. If sy ðyÞ :¼ exp  12 k y  yk2 , then
" #
1 jJ > ðyÞMðf ðyÞÞJðyÞj1=2
Dα, f ð yÞ ¼ 1 ,
1α jαJ > ðyÞMðf ð yÞÞJð yÞ + ð1  αÞI j1=2α
where I is the identity matrix.

Observe that, for general values of α, the α-discrepancy can be well


defined even for singular J(y).
Remark 4. We give explicitly the expressions of Dα,f ð yÞ for α ¼ 0 or 1:
 
1 1
d
D0, f ðyÞ ¼ 1  exp log J > ðyÞMðf ðyÞÞJðyÞ  tr J > ð yÞMðf ð yÞÞJð yÞ + ,
2 2 2
1 > 1 > 1
d
D1, f ðyÞ ¼ log jJ ðyÞMðf ðyÞÞJðyÞj + tr ðJ ð yÞMðf ðyÞÞJð yÞÞ  ,
2 2 2

where tr() denotes the trace, and d is the dimensionality of Y.



Remark 5. If sy ðyÞ :¼ exp  12 k y  yk2 , it is easy to see that Dα,f ð yÞ is a
dissimilarity measure between M(y) and I, while D0, f ð yÞ and D1, f ð yÞ are in
the form of a LogDet divergence (Cichocki et al., 2015) between two positive
definite matrices A and B (up to a monotonic transformation):
LogDetðA,BÞ ¼ trðAB1 Þ  log jAB1 j:
The following corollary is a “spectrum version” of Theorem 1.
Corollary 1. If 8y, M(f(y)) ¼ I, and sy ðyÞ :¼ exp  12 k y  yk2 , then
" #
1 Yd
τi ðyÞ
Dα, f ð yÞ ¼ 1 , (3)
1α i¼1 ð1 + ατi ð yÞ  αÞ
2 1=2α

where τi(y) (i ¼ 1, …, d) denotes the singular values of J(y).

It is well known (see, e.g., Cichocki et al., 2015; Hernandez-Lobato et al.,


2016) that the α-divergence presents different properties w.r.t. different set-
tings of α. For example, when α  1, minimizing Dα(p : q) will enforce q
> 0 (zero-avoiding) whenever p > 0. Similarly, we have the following remark
on the setting of α.
Remark 6. Based on Remark 5 and Corollary 1, the value of Dα,f ð yÞ could be
sensitive to the settings of α.
Local measurements of nonlinear embeddings Chapter 7 267

l If α ! 1, small singular values of J(y) that are close to 0 lead to high dis-
crepancy. Minimizing Dα,f ð yÞ helps avoid singularity.
l If α ! 0, large singular values of J(y) that are larger than 1 lead to high
discrepancy. Minimizing Dα,f ð yÞ favors contractive mappings (Rifai
et al., 2011) where the scale of the Jacobian is constrained.

4 Empirical α-discrepancy
In this section, we show that the proposed α-discrepancy is computationally
friendly. By definition, the global α-discrepancy Dα,f is an expectation of
the local α-discrepancy Dα,f ð yÞ w.r.t. a given prior distribution Uð yÞ. Using m
i.i.d. samples fyi gm
i¼1  Uð yÞ, we can estimate

1X m
Dα, f  Dα, f ðyi Þ:
m i¼1

In machine learning problems, usually we are already given a set of observed


points fxi gm m
i¼1 X, whose latent representations fyi gi¼1 can be “learned.” We
may choose Uð yÞ to be the empirical distribution so that Uð yÞ :¼
Pm
i¼1 δð y  yi Þ, where δ() is the Dirac delta function. Therefore, the prob-
1
m
lem reduces to compute Dα,f ð yÞ based on some given α, f, and y.
By definition, Dα,f ð yÞ is the α-divergence between a Gaussian distribution
py and a positive measure λsy. As sy is in simple closed form by our
Assumption 2, and λ can be fixed based on Lemma 1, the remaining problem
is to compute the covariance matrix of py given by J>(y)M(f(y))J(y), which
depends on the Jacobian J(y) of the mapping f. Afterward, Dα,f ð yÞ can be
computed based on either Theorem 1 or a numerical integrator.
For deep neural networks, the mapping f is usually specified by a compo-
sition of linear mappings and nonlinear activation layers so that
f :¼ fL ∘⋯∘f2 ∘f1 ,
where L  1 is the number of layers, fl(hl) :¼ ν(Wlhl + bl), l ¼ 1, …, L, hl is
the vector input of the lth layer, h1 :¼ y, Wl is the weight matrix, bl is the bias
terms, and ν is an elementwise nonlinear activation function, e.g., ReLU (Nair
and Hinton, 2010). By the chain rule,
Y
L
Jð yÞ ¼ Al ðhl ÞWl , (4)
l¼1

where Al(hl) is a diagonal matrix with diagonal entries ν0 (Wlhl + bl). There-
fore, J(y) is expressed as a series of matrix multiplications. For general neural
network mapping f not limited to the above simple case, one can use the
268 SECTION II Information geometry

“Jacobian” programming interface that is highly optimized in modern autodif-


ferentiation frameworks (Paszke et al., 2019).
In the following, we show that Dα,f ð yÞ can be approximated by Monte
Carlo sampling. By the definition of Dα,f ð yÞ , we only need to estimate
Dα(py : γsy). For a given y, we draw a set of n neighbors fy j gnj¼1  Ry ðyÞ,
where Ry is a reference probability distribution defined on the same support
as py and sy. Therefore,
" #
1 1 X
n
γsy ðy j Þ pαy ðy j Þγ 1α s1α
y ðy j Þ
D^ α ðpy : γsy Þ ¼ +  (5)
1  α n j¼1 αRy ðy j Þ αð1  αÞRy ðy j Þ

gives an unbiased estimate of Dα(py : γsy), where the notation D ^α ðpy : γsy Þ is
n
abused as it depends on the samples fy j g j¼1 . Indeed,
Z " #
^ 1 γsy ðyÞ pαy ðyÞγ 1α s1α
y ðyÞ
ERy ðyÞ D α ðpy : γsy Þ ¼ + Ry ðyÞ  dy
1α αRy ðyÞ αð1  αÞRy ðyÞ
Z Z
1 1 1
¼ + γsy ðyÞdy  pαy ðyÞγ 1α s1α
y ðyÞdy,
1α α αð1  αÞ

which gives exactly Dα(py : γsy). By the large number law, D ^α ðpy : γsy Þ con-
verges to the α-divergence Dα(py : γsy) as n !∞. This holds regardless of the
choice of the reference distribution Ry ðyÞ. However, different Ry ðyÞ may result
in different estimation variance. The RHS (right-hand-side) of Eq. (5) is a con-
vex function of γ, and the optimal γ ? which minimizes D ^α ðpy : γsy Þ is in
closed form (similar to Lemma 1). Then, one can estimate
Dα, f ð yÞ  min+ D^ α ðpy : γsy Þ:
γ

This empirical α-discrepancy is useful when Dα(py : γsy) is not available in


closed form.

5 Connections to existing methods


In this section, we show that machine learning based on the proposed
α-discrepancy encompasses some well-studied techniques, and thus we
uncover a connection in between those existing methods.

5.1 Neighborhood embeddings


The neighborhood embeddings (Carreira-Perpiñán, 2010; Cook et al., 2007;
Hinton and Roweis, 2003; Lee et al., 2013; van der Maaten and Hinton, 2008;
Local measurements of nonlinear embeddings Chapter 7 269

Venna and Kaski, 2007; Yang et al., 2014) are based on our third paradigm
introduced in Section 1. Based on similar principles, this meta family of
dimensionality reduction methods aim to find the embedding (inverse of
f : Y ! X) based on a given set fxi g X. In this section, we provide a unified
view of these embedding techniques based on the proposed α-discrepancy.
Often, neighborhood embedding are nonparametric methods, meaning
that the parametric form of the mapping f (or its inverse g) is not explicitly
learned. Instead, the embedding points {yi ¼ g(xi)}, which implicitly gives
g(), are obtained by minimizing a loss function. They directly generalize
to parametric approaches (Carreira-Perpi nán and Vladymyrov, 2015). Our
analysis can be applied to both cases. However, the Jacobian may be diffi-
cult to obtain for the nonparametric mapping and requires approximation
techniques.
In the empirical α-discrepancy, we set the reference distribution Ry to be
uniform across the latent {yi} associated with the observed samples {xi}.
This is only a technical treatment which allows us to arrive at a simple loss
function and to connect with related embedding methods. Other choices of
Ry lead to more general embedding approaches. Thus, we simply set 8j,
Ry ðy j Þ ¼ ρ > 0.
From Eq. (5), we get
" #
1 1 X
n
γsy ðy j Þ pαy ðy j Þγ 1α s1α
y ðy j Þ
D^ α ðpy : γsy Þ ¼ +  : (6)
1  α n j¼1 αρ αð1  αÞρ

By differentiating the RHS w.r.t. γ, we obtain the optimal γ:


Xn
pα ðy Þs1α ðy j Þ
j¼1 y j y
α
ðγ ? Þ ¼ Xn :
j¼1
s y ðy j Þ

Plugging γ ? into Eq. (6), we get


" #1=α
1 1 X
n
D^ α ðpy : γ sy Þ ¼
?
 α
p ðy Þq ðy j Þ
1α
, (7)
1  α ð1  αÞρ j¼1 y j y
P
where qy ðy j Þ ¼ sy ðy j Þ= nj¼1 sy ðy j Þ. From Lemma 1 and related discussions,
the RHS of Eq. (7) is the α-divergence between the two probability mass
functions: py ðy j Þ and qy ðy j Þ up to a monotonic transformation. Let α ! 1,
then it reduces to the KL divergence, which, depending on the choice of sy, is
exactly the loss function of stochastic neighbor embedding (SNE) (Hinton and
Roweis, 2003) or t-SNE (van der Maaten and Hinton, 2008). They are imple-
mented by minimizing the empirical average of D ^α ðpy : γ ? sy Þ over {yi}. Note
i i

y j are sampled according to Ryi, which is uniform on the population {yi}.


270 SECTION II Information geometry

In other words, we use each yi as the reference point, and use all the other
samples as the neighbors.
A further algorithmic detail in SNE (Hinton and Roweis, 2003) is for com-
puting the probabilities py ðy j Þ. Essentially, SNE sets M(x) ¼ λ(x)I, where
λ :¼ λ(x) > 0 is a scalar depending on x and can be computed based on
entropy constraints. It helps model observed data with different densities at
different regions on X . Then py ðy j Þ is computed based on
   
1 > > λ
py ðy j Þ∝ exp  ðy j  yÞ J MðxÞJðy j  yÞ  exp  k f ðy j Þ  f ð yÞÞk :
2
2 2

As J is the local linear approximation of the nonlinear mapping


f, k Jðy j  yÞ kk f ðy j Þ  f ð yÞÞ k if y j and y are close enough.
Let us examine the general loss on the RHS of Eq. (6). For simplicity, let
0 < α < 1. The last term in the bracket imposes an attraction strength between
y and y j as it tries to maximize the similarity sy ðy j Þ. The first term in the
bracket forms input-independent background repulsion (Carreira-Perpiñán,
2010), in the sense that it does not depend on py ðy j Þ , and the similarity
sy ðy j Þ is to be minimized.
When learning neighborhood embeddings, one may either apply Eq. (6),
where γ   + is a free parameter, or Eq. (7), where γ is fixed to γ ?. Most
methods in this family take the latter approach. By Lemma 1, they are actually
equivalent and have the same global optimal solution. However, they may
present different properties during numerical optimization.
If we consider γ as a hyperparameter that can be set to different values
(instead of fixed to γ ? or set free), Eq. (6) becomes a general formulation of
elastic embedding (EE) (Carreira-Perpiñán, 2010), where the attraction terms
and the repulsion terms have customized weights, and the loss does not corre-
spond to the KL divergence between normalized probability distributions. In
our formulations, the reference distribution Ry corresponds to these weights.
For general values of α as a hyperparameter, Eq. (6) gives neighbor
embeddings based on the α-divergence (Lee et al., 2013; Venna and Kaski,
2007; Yang et al., 2014). Other choice of the information divergence extends
to broader families of embeddings (Narayan et al., 2015).

5.2 Autoencoders
Autoencoder networks try to learn both f θ : Y ! X (the decoder) and its
corresponding projection gθ : X ! Y (the encoder) to represent a given set
of observations fxi g X . The following (known) statement presents an
intrinsic geometry of autoencoders.
Proposition 3. Assumptions 1, 2, and 3 are true. In an autoencoder network
g f
X ! Y ! X, the decoder f induces a metric tensor in the latent space Y, given
by Mð yÞ ¼ J >f ð yÞMðf ð yÞÞJ f ð yÞ; the encoder g : X ! Y induces a metric
Local measurements of nonlinear embeddings Chapter 7 271

tensor in the manifold Z ¼ X =g (the quotient space w.r.t. the equivalent
relation g: x1 g x2 if and only if g(x1) ¼ g(x2)) given by MðzÞ ¼
J>
g ðzÞMðgðzÞÞJ g ðzÞ.

The proof (omitted) is straightforward from the definition of the pullback


metric. Note the induced metric tensor is not necessarily a Riemannian metric.
For example, in ReLU networks (Nair and Hinton, 2010), f is not smooth, and
the metric tensor M(y) may not vary smoothly along Y. If Jf has zero singular
values, the metric tensor may be singular and therefore does not satisfy the
conditions of a Riemannian metric. The second part of Proposition 3 requires
the quotient space Z to have a smooth manifold structure, which is not
guaranteed.
Let us assume the input space X is Euclidean and M(x) ¼ I, 8x  X. The
decoder-induced metric is simply Jf> ðyÞJf ðyÞ. Depending on the choice of sy,
the α-discrepancy is a function of the matrix Jf> ð yÞJf ðyÞ and can have
simple expressions such as the formula of Dα,f in Theorem 1. Therefore, the
α-discrepancy can be potentially used as an information theoretical regularizer
for deep autoencoders to penalize “complex” fθ. Previous works on regulariz-
ing autoencoders (Rifai et al., 2011) based on the Jacobian matrix can

therefore be connected with our definition. If sy ðyÞ :¼ exp  12 k y  yk2 ,


minimizing the α-discrepancy pushes J >f ð yÞJ f ðyÞ toward I. A small value
of α-discrepancy means that the neural network mapping fθ is locally an
orthogonal transformation, which is considered to have a low complexity.
By our discussions in the end of Section 3, different settings of α lead to dif-
ferent complexity measures based on the spectrum of J >f ð yÞJ f ðyÞ. It provides
flexible tools to regularize the learning of deep autoencoders.
As a popular deep generative model, variational autoencoders (VAEs)
(Kingma and Welling, 2014) learn a parametric model pθ(x, y) by maxi-
mizing an evidence lower bound of the log-likelihood of some observed
data. This is equivalent to minimizing the KL divergence (or in general
α-divergence as in Li and Turner (2016)) between the true posterior distri-
bution pθ(y j x) and a parametric variational distribution qφ(y j x) implemen-
ted by the encoder. The loss to be minimized consists of the

reconstruction
error plus a complexity term given by KL qφ ðy j xÞ : pðyÞ , the KL diver-
gence between the approximated posterior qφ(y j x) and the prior p(y).
Comparatively, our proposed α-discrepancy is based on two local neighbor-
hood densities w.r.t. the pullback geometry and the intrinsic geometry of
Y, respectively. VAEs are generalized based on Riemannian geometry
(Miolane and Holmes, 2020).
We develop a unified perspective on the pullback metric and Bayes’ rule.
Consider the simple case where both X and Y are Euclidean spaces, and
p(x j y) ¼ G(x j f(y), λI), where λ > 0. That means our deterministic mapping
f is randomized w.r.t. a covariance matrix λ1I. By Bayes’ rule,
272 SECTION II Information geometry

 
λ
pðy j xÞ ∝ Uð yÞpðx j yÞ∝ exp log UðyÞ  k x  f ðyÞk2
2
Denote g : X ! Y to be an approximation of f1. A Taylor expansion of f(y)
around g(x) yields
 
λ
pðy j xÞ∝ exp log Uð yÞ  k x  f ðgðxÞÞ  JðgðxÞÞð y  gðxÞÞk2 + oðk y  gðxÞ kÞ ,
2

where o() is the little-o notation. If λ is large enough, meaning the conditional
distribution p(x j y) tends to be deterministic, then the second term will domi-
nate. We arrive at the rough approximation
pðy j xÞ  Gðy j gðxÞ, λJ > ðgðxÞÞJðgðxÞÞÞ:
Hence, the covariance of the posterior is related to the pullback metric at g(x).
The notion of pullback metric has been investigated in metric learning
(Lebanon, 2005) and manifold learning (Sun and Marchand-Maillet, 2014).
Latent space geometry is studied in Gaussian process latent variable models
(Tosi et al., 2014). In the realm of deep learning, latent manifolds learned
by deep generative models (including VAEs) are explored (Arvanitidis
et al., 2018; Hauberg, 2018; Shao et al., 2018). In this context, stochastic met-
ric tensor and the expected metric are investigated (Arvanitidis et al., 2018;
Hauberg, 2018). Notice that we only consider deterministic metrics, as our
mapping f is deterministic. A separate neural network can be used to learn
the metric tensor (Kuhnel et al., 2018). Information geometry and the pullback
metric have been applied to graph neural networks (Sun et al., 2020). These
previous efforts on deep generative models focus on the study of the geomet-
ric measurements (e.g., geodesics, curvature, and means) in the latent space
obtained in deep learning. In contrast, our objective is to derive information
theoretical complexity measures of a mapping f.

6 Conclusion and extensions


We study the fundamental problem on how to measure the information carried
by a nonlinear mapping f from a latent space Y to the observation space X .
We define the concept of α-discrepancy, a one-parameter family of informa-
tion theoretical measurements given by a function Dα,f ð yÞ on the latent
space Y, where α  . Intuitively, Dα,f ð yÞ measures the discrepancy between
f and an isometry locally at 8y  Y. Our definition is invariant to reparame-
terization (therefore intrinsic) and is independent to the empirical observa-
tions. Both neighborhood embeddings and deep representation learning are
connected with this concept. It gives theoretical insights and generalization
of these methods.
Essentially, the proposed embedding α-discrepancy measures how “far”
f is from an isometry when the α-discrepancy achieves 0. We can further
Local measurements of nonlinear embeddings Chapter 7 273

extend its definition to measure how far f is from a conformal mapping.


We have the following definition.
Definition 2 Conformal α-discrepancy. With respect to the Assumption 1, 2,
and 3, the local conformal α-discrepancy of f : Y ! X is a function on Y:

Cα, f ð yÞ :¼ inf Dα py : γð yÞsy,ζ : (8)


γ , ζ  CðYÞ

where sy,ζ ðyÞ :¼ exp  ζð2yÞ k y  yk2 . The conformal α-discrepancy w.r.t.
Assumption 4 is
Cα, f :¼ EUð yÞ Cα, f ð yÞ:

Note s is abused to denote both sy in Definition 1 and sy, ζ in Definition 2.


For a given y  Y, the conformal
α-discrepancy
is the minimal α-divergence
between py and γð yÞ exp  ζð2yÞ k y  yk2 , where γ(y) and ζ(y) are free para-
meters depending on y. In practice, one may constrain ζ(y) to vary inside a
subrange of  + to avoid trivial or undesired solutions.
If sy ðyÞ :¼ exp ð 12 k y  yk2 Þ is used to compute Dα,f ð yÞ, we have the
following relationship by definition:
0  Cα, f ðyÞ  Dα, f ð yÞ:
Intuitively, the positive measure γ(y)sy, ζ is more flexible than γ(y)sy and has
one additional free parameter ζ(y). Similar to Proposition 1, a small value of
Cα,f indicates that f is close to a conformal mapping; and vice versa. Cα,f is use-
ful to explore learning objectives for conformal manifold learning. It is also
useful as a regularizer for deep learning, enforcing a neural network mapping
fθ to be conformal-like. Related theoretical statements are similar to Dα,f and
are omitted for brevity.
Consider the empirical α-discrepancy in SectionR 4. We can choose the
reference distribution RðyÞ ¼ qy ðyÞ :¼ sy ðyÞ= sy ðyÞdy , leading to another
approximation of Definition 1 (as an alternative empirical α-discrepancy):
" !α # Z
1 1 Xn
γ γ 1α py ðy j Þ
D^ α ðpy : γsy Þ ¼ +  sy ðyÞdy: (9)
1  α n j¼1 α αð1  αÞ sy ðy j Þ

This approximation could be favored, as the RHS has a simple expression,


and y j can be sampled according to a simple latent distribution, e.g., an iso-
^α ðpy : γsy Þ can be evaluated
tropic Gaussian distribution centered around y. D
by the reparameterization trick (Kingma and Welling, 2014) as y j ¼ y + e,
where e is drawn from a zero-centered Gaussian distribution.
In certain applications, it may be reasonable to assume that X and/or Y
is a statistical manifold (space of probability distributions). For example,
274 SECTION II Information geometry

for a classifier deep neural network, the output space X is the simplex
Pd +1
Δd ¼ fx : i¼1 xi ¼ 1; xi > 0, 8ig . The unique metric of statistical mani-
fold is given by the Fisher information metric. Thus, the definition of
α-discrepancy can be extended by using such a geometry.
Acknowledgment
The author thank the anonymous reviewers for the insightful comments and timely reviews.

Appendices
In the following we provide outline proofs of the theoretical statements.

Appendix A. Proof of Lemma 1


Proof We have
Z
1  
Dα ðp : γsÞ ¼ αpð yÞ + ð1  αÞγsðyÞ  pα ðyÞγ 1α s1α ðyÞ dy:
αð1  αÞ
(A.1)
It is easy to see that the RHS is a convex function w.r.t. γ. Therefore, the opti-
mal γ ? can be obtained by solving
Z
∂Dα ðp : γsÞ 1  
¼ ð1  αÞsðyÞ  ð1  αÞpα ðyÞγ α s1α ðyÞ dy ¼ 0,
∂γ αð1  αÞ
which reduces to
Z Z
sðyÞdy ¼ ðγ ? Þα pα ð yÞs1α ð yÞdy:

Therefore,
Z
pα ðyÞs1α ðyÞdy
? α Z
ðγ Þ ¼ ,
sðyÞdy

and
0Z 11
α α
B p ð yÞs ð yÞdyC
1α

γ? ¼ B
@ Z C:
A
sð yÞdy

From the above derivations, we have


Z Z Z
γ ? sðyÞdy ¼ ðγ ? Þ1α pα ðyÞs1α ðyÞdy ¼ pα ð yÞðγ ? Þ1α s1α ð yÞdy:
Local measurements of nonlinear embeddings Chapter 7 275

Plugging the above equation into Eq. (A.1), we get


Z
? 1
Dα ðp : γ sÞ ¼ ½αpð yÞ + ð1  αÞγ ? sð yÞ  γ ? sðyÞ
dy
αð1  αÞ
Z
1
¼ ½αpð yÞ  αγ ? sð yÞ
dy
αð1  αÞ
Z
1
¼ ½ pð yÞ  γ ? sð yÞ
dy
1α
2 0Z 11=α 3
Z α
6 B p ðyÞs 1α
ðyÞdy C 7
1 6pðyÞ  B C sðyÞ7dy
¼ 4 @ Z A 5
1α
sðyÞdy
2 0Z 11=α 3
α Z
B p ðyÞs ð yÞdyC
1α
1 661  B C
7
¼ Z sðyÞdy7
1 α4 @ A 5
sð yÞdy
2 0Z 11=α 3
α
p ðyÞs 1α
ð yÞdy
1 6 B
61  B Z
C 7
C 7
¼ 4 @ A 5
1α 1α
ð sðyÞdyÞ
2 0 0 1 11=α 3
1α
Z
1 66 B α B sðyÞ C C 7
7
¼ 61  B B
@ p ðyÞ@Z
C
A dyC
A 7
1 α4 5
sðyÞ
" Z 1=α #
1 α
¼ 1 p ðyÞq ðyÞdy
1α
:
1α

To compare Dα(p : γ ?s) with Dα(p : q), we recall from Eq. (2) that
 Z 
1 α
Dα ðp : qÞ ¼ 1  p ðyÞq ð yÞdy :
1α
(A.2)
αð1  αÞ

Appendix B. Proof of Proposition 1



If sy ðyÞ ¼ exp  12 k y  yk2 , then its corresponding normalized density is
Gy ðyÞ ¼ Gðy j y, IÞ , which is the multivariate Gaussian distribution with
mean y and covariance matrix I.
276 SECTION II Information geometry

By Lemma 1,
Z
arg min Dα, f ¼ arg min Uð yÞDα, f ð yÞdy
f f
Z
¼ arg min Uð yÞ inf Dα ðpy : γð yÞsy Þdy (B.1)
f γ  CðYÞ
Z
¼ arg min Uð yÞDα ðpy : Gy Þdy:
f

In the above minimization problem, notice that f only appears in the term py
on the RHS.
If f : Y ! X is an isometry, then the pullback metric is equivalent to the
Riemannian metric of Y which is given by the identity matrix I. Therefore,
8y  Y,
J > MðxÞJ ¼ I:
Therefore, 8y  Y,
py ðyÞ ¼ Gðy j y, J > MðxÞJÞ ¼ Gðy j y, IÞ ¼ Gy ðyÞ:
Based on the basic properties of α divergence, we know that 8y  Y, py ¼ Gy
implies 8y  Y, Dα(py : Gy) ¼ 0. At the same time,
Z
Uð yÞDα ðpy : Gy Þdy  0:

Therefore, an isometric f must be the optimal solution of Eq. (B.1).

Appendix C. Proof of Proposition 2


The proof needs the invariance property of the f-divergence. Given a convex
function σ :  + !  and a pair of normalized probability densities p(x) and
q(x), their f-divergence is in the form
Z  
qðxÞ
Dσ ðp : qÞ :¼ pðxÞσ dx: (C.1)
pðxÞ
Note we use σ() as f() is already used to denote the generative mapping from
Y to X. The f-divergence satisfies the invariance: under a coordinate transfor-
mation given by a diffeomorphism Φ : x !x0 , the probability density becomes
p0 (x0 ). The f-divergence is invariant:
0 0 0 0

Dσ p ðx Þ : q ðx Þ ¼ Dσ ðpðxÞ : qðxÞÞ:
1α
If we set σðtÞ ¼ αð1αÞ
tt
, and plugging in Eq. (C.1), we get exactly Dα(p : q) in
Eq. (2). Therefore, the α-divergence between two probability densities
belongs to the family of f-divergences.
Local measurements of nonlinear embeddings Chapter 7 277

Proof We first show that Dα,ΦX ∘α ð yÞ ¼ Dα,f ð yÞ everywhere. Consider the


observation space X is reparameterized to a new coordinate system x0 based
on the diffeomorphism ΦX : x ! x0 . The Riemannian metric is a covariant
tensor and follows the transformation rule
JX> ðxÞMðΦX ðxÞÞJX ðxÞ ¼ MðxÞ, (C.2)
where J X denotes the Jacobian of ΦX .
By the chain rule, the Jacobian of the mapping ΦX ∘ f : y ! x0 is given by
J X ðf ð yÞÞ  Jð yÞ. Then, the pullback metric associated with ΦX ∘ f becomes
MðyÞ ¼ J > ð yÞJX> ð f ð yÞÞ  MðΦX ð f ðyÞÞÞ  JX ð f ðyÞÞJðyÞ
 
¼ J > ð yÞ  JX> ð f ðyÞÞMðΦX ð f ðyÞÞÞJX ð f ðyÞÞ  JðyÞ
¼ Jð yÞ> Mð f ð yÞÞJð yÞ:
In Definition 1, py is invariant w.r.t. ΦX . Therefore, 8y  Y , Dα, f ð yÞ ¼
Dα,ΦX ∘f ð yÞ.
In the following, we show 8y  Y , Dα, f ð yÞ ¼ Dα, f ∘ΦY ð yÞ , where ΦY :
Y ! Y is a diffeomorphism. By Lemma 1,
Dα, f ð yÞ ¼ inf Dα ðpy : γðyÞsy Þ (C.3)
γ  CðYÞ

is a monotonic function of Dα(py : qy), where qy is the normalized density


associated with sy. By the invariance of α-divergence, under the diffeomorph-
ism ΦY : y0 ! y,

Dα ðpy : qy Þ ¼ Dα pΦ1
Y ð yÞ : q Φ 1
Y ð yÞ :

This shows Dα,f ð yÞ ¼ Dα,f ∘ΦY ð yÞ everywhere on Y . In summary, the


α-discrepancy is invariant to diffeomorphisms on X and Y. □

Appendix D. Proof of Theorem 1


Proof We simply denote J instead of J(y), and M instead of M(f(y)). One has
to remember that they both depends on y. By definition,
 
jJ > MJj1=2 1 > >
py ðyÞ ¼ exp  ðyyÞ J MJðy  yÞ :
ð2πÞd=2 2

If sy ðyÞ :¼ exp  12 k y  yk2 , its normalized density is another Gaussian
distribution
 
1 1
qy ðyÞ ¼ exp  k y  yk 2
:
ð2πÞd=2 2
278 SECTION II Information geometry

Let β ¼ 1  α, then
Z
pαy ðyÞqβy ðyÞdy
Z  
jJ > MJjα=2 1 > >

¼ exp  ðyyÞ αJ MJ + βI ðy  yÞ dy
ð2πÞd=2 2
Z > 1=2  
jJ > MJjα=2 αJ MJ + βI 1 > >

¼ exp  ðyyÞ αJ MJ + βI ðy  yÞ dy
jαJ > MJ + βIj1=2 ð2πÞd=2 2
jJ > MJjα=2
¼ :
jαJ > MJ + βIj1=2
(D.1)
By Lemma 1,
Dα, f ðyÞ :¼ inf Dα ðpy : γðyÞsy Þ
γ  CðYÞ
" Z 1=α # " #
> 1=2
1 1 jJ MJj
¼ 1 pαy ðyÞqβy ðyÞdy ¼ 1 :
β β jαJ > MJ + βI j1=2α

Let α ! 0. The factor 1
1α ! 1, and α only appears in the term
1
,
jαJ > MJ + ð1  αÞI j1=2α
which is a “1∞” type of limit. We have
 
1 log αJ > MJ + ð1  αÞI
lim ¼ lim exp 
α!0 jαJ > MJ + ð1  αÞI j1=2α α!0 2α
 
log αJ MJ + ð1  αÞI
>
¼exp  lim
α!0 2α
0 1
tr ðαJ > MJ + ð1  αÞIÞ1 ðJ > MJ  IÞ
¼exp @ lim A
α!0 2
 >

tr J MJ  I
¼exp 
2
 
1 >
d
¼exp  tr J MJ + :
2 2

In summary, we get
 
> 1=2 1 >
d

D0, f ð yÞ :¼ lim Dα, f ðyÞ ¼ 1  J MJ exp  tr J MJ +
α!0 2  2

1 > 1 >
d
¼ 1  exp log J MJ  tr J MJ + :
2 2 2
Local measurements of nonlinear embeddings Chapter 7 279

Similarly,
D1, f ðyÞ :¼ lim Dα, f ð yÞ
α!1
" 
1 jJ > MJj1=2 1
¼ 1  lim log αJ > MJ + ð1  αÞI
1 α!1 jαJ > MJ + ð1  αÞI j1=2α 2α 2

1 
> 1 >
 tr ðαJ MJ + ð1  αÞIÞ ðJ MJ  IÞ

 
jJ > MJj1=2 1 > 1 >
¼ 1 >
log J MJ  tr ðJ MJÞ ðJ MJ  IÞ
jJ > MJ j1=2 2 2
1 d 1
¼ log J > MJ  + tr ðJ > MJÞ1 :
2 2 2

By assumption log pð yÞ is bounded. Therefore,


λ
log pðx,yÞ ¼ log pðyÞ + log pðx j yÞ ¼ log pðyÞ  ðx  f ðyÞÞ> MðxÞðx  f ðyÞÞ> :
2
(D.2)
If λ ! ∞, the second term will dominate. A Taylor expansion of f(y) as a lin-
ear function of y yields the pull-back metric.

References
Amari, S., 1995. Information geometry of the EM and em algorithms for neural networks. Neural
Netw. 8 (9), 1379–1408.
Amari, S.-i, 2016. Information Geometry and Its Applications. Applied Mathematical Sciences,
vol. 194 Springer, Japan.
Arvanitidis, G., Hansen, L.K., Hauberg, S., 2018. Latent space oddity: on the curvature of deep
generative models. In: ICLR’18. Proceedings of the 6th International Conference on Learning
Representations.
Carreira-Perpiñán, M.Á., 2010. The elastic embedding algorithm for dimensionality reduction. In:
F€urnkranz, J., Joachims, T. (Eds.), ICML’10. Proceedings of the 27th International Confer-
ence on Machine Learning, Omnipress, pp. 167–174.
Carreira-Perpi nán, M.Á., Vladymyrov, M., 2015. A fast, universal algorithm to learn parametric
nonlinear embeddings. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R.
(Eds.), Advances in Neural Information Processing Systems. vol. 28. Curran Associates, Inc.
Cichocki, A., Cruces, S., Amari, S.-i., 2015. Log-determinant divergences revisited: alpha-beta
and gamma log-det divergences. Entropy 17 (5), 2988–3034.
Cook, J., Sutskever, I., Mnih, A., Hinton, G., 2007. Visualizing similarity data with a mixture of
maps. In: Meila, M., Shen, X. (Eds.), Proceedings of Machine Learning Research. Proceed-
ings of the 11th International Conference on Artificial Intelligence and Statistics, vol. 2.
PMLR, pp. 67–74.
Csiszár, I., 1967. On topological properties of f-divergences. Stud. Sci. Math. Hung. 2, 329–339.
Hauberg, S., 2018. Only Bayes should learn a manifold (on the estimation of differential geomet-
ric structure from data). CoRR abs/1806.04994. https://arxiv.org/abs/1806.04994.
280 SECTION II Information geometry

Hernandez-Lobato, J., Li, Y., Rowland, M., Bui, T., Hernandez-Lobato, D., Turner, R., 2016.
Black-box alpha divergence minimization. In: Balcan, M.F., Weinberger, K.Q. (Eds.), Pro-
ceedings of Machine Learning Research. Proceedings of the 33rd International Conference
on Machine Learning, vol. 48. PMLR, pp. 1511–1520.
Hinton, G.E., Roweis, S., 2003. Stochastic neighbor embedding. In: Becker, S., Thrun, S.,
Obermayer, K. (Eds.), Advances in Neural Information Processing Systems. vol. 15. MIT Press.
Jost, J., 2011. Riemannian Geometry and Geometric Analysis. Universitext, sixth ed. Springer.
Kingma, D.P., Welling, M., 2014. Auto-encoding variational Bayes. In: ICLR’14. Proceedings of
the 2nd International Conference on Learning Representations.
Kuhnel, L., Fletcher, T., Joshi, S., Sommer, S., 2018. Latent space non-linear statistics. CoRR abs/
1805.07632. https://arxiv.org/abs/1805.07632.
Lebanon, G., 2005. Riemannian Geometry and Statistical Machine Learning (Ph.D. thesis). Car-
negie Mellon University.
Lee, J.A., Renard, E., Bernard, G., Dupont, P., Verleysen, M., 2013. Type 1 and 2 mixtures of
Kullback-Leibler divergences as cost functions in dimensionality reduction based on similar-
ity preservation. Neurocomputing 112, 92–108.
Li, Y., Turner, R.E., 2016. Renyi divergence variational inference. In: Lee, D., Sugiyama, M.,
Luxburg, U., Guyon, I., Garnett, R. (Eds.), Advances in Neural Information Processing Sys-
tems. vol. 29. Curran Associates, Inc.
Miolane, N., Holmes, S., 2020. Learning weighted submanifolds with variational autoencoders
and Riemannian variational autoencoders. In: CVPR 2020. 2020 IEEE/CVF Conference on
Computer Vision and Pattern Recognition, IEEE Computer Society, pp. 14491–14499.
Nair, V., Hinton, G.E., 2010. Rectified linear units improve restricted Boltzmann machines. In:
F€urnkranz, J., Joachims, T. (Eds.), ICML’10. Proceedings of the 27th International Confer-
ence on Machine Learning, Omnipress, pp. 807–814.
Narayan, K., Punjani, A., Abbeel, P., 2015. Alpha-beta divergences discover micro and macro
structures in data. In: Bach, F., Blei, D. (Eds.), Proceedings of Machine Learning Research.
Proceedings of the 32nd International Conference on Machine Learning, vol. 37. PMLR,
pp. 796–804.
Nielsen, F., 2020. An elementary introduction to information geometry. Entropy 22 (10), 1100.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z.,
Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M.,
Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S., 2019. Pytorch: an
imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H.,
Beygelzimer, A., d’Alche-Buc, F., Fox, E., Garnett, R. (Eds.), Advances in Neural Informa-
tion Processing Systems. vol. 32. Curran Associates, Inc, pp. 8024–8035.
Pennec, X., 2006. Intrinsic statistics on Riemannian manifolds: basic tools for geometric measure-
ments. J. Math. Imaging Vis. 25 (1), 127–154.
Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y., 2011. Contractive auto-encoders: explicit
invariance during feature extraction. In: Getoor, L., Scheffer, T. (Eds.), ICML’11. Proceed-
ings of the 28th International Conference on International Conference on Machine Learning,
Omnipress, pp. 833–840.
Said, S., Bombrun, L., Berthoumieu, Y., Manton, J.H., 2017. Riemannian Gaussian distributions
on the space of symmetric positive definite matrices. IEEE Trans. Inf. Theory 63 (4),
2153–2170.
Shao, H., Kumar, A., Fletcher, P.T., 2018. The Riemannian geometry of deep generative models.
In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops
(CVPRW), pp. 428–4288.
Local measurements of nonlinear embeddings Chapter 7 281

Sun, K., 2018. Intrinsic universal measurements of non-linear embeddings. CoRR abs/
1811.01464. https://arxiv.org/abs/1811.01464.
Sun, K., 2020. Information geometry for data geometry through pullbacks. In: Deep Learning
Through Information Geometry (Workshop at Thirty-fourth Conference on Neural Informa-
tion Processing Systems).
Sun, K., Marchand-Maillet, S., 2014. An information geometry of statistical manifold learning. In:
Xing, E.P., Jebara, T. (Eds.), Proceedings of Machine Learning Research. Proceedings of the
31st International Conference on Machine Learning, vol. 32. PMLR, pp. 1–9.
Sun, K., Koniusz, P., Wang, Z., 2020. Fisher-Bures adversary graph convolutional networks. In:
Adams, R.P., Gogate, V. (Eds.), Proceedings of Machine Learning Research. Proceedings
of The 35th Uncertainty in Artificial Intelligence Conference, vol. 115. PMLR, pp. 465–475.
Tosi, A., Hauberg, S., Vellido, A., Lawrence, N.D., 2014. Metrics for probabilistic geometries. In:
Zhang, N.L., Tian, J. (Eds.), UAI’14. Proceedings of the 30th Conference on Uncertainty in
Artificial Intelligence, AUAI Press, pp. 800–808.
van der Maaten, L., Hinton, G., 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9 (Nov),
2579–2605.
Venna, J., Kaski, S., 2007. Nonlinear dimensionality reduction as information retrieval. In:
Meila, M., Shen, X. (Eds.), Proceedings of Machine Learning Research. Proceedings of the
11th International Conference on Artificial Intelligence and Statistics, vol. 2. PMLR, pp.
572–579.
Yang, Z., Peltonen, J., Kaski, S., 2014. Optimization equivalence of divergences improves neigh-
bor embedding. In: Xing, E.P., Jebara, T. (Eds.), Proceedings of Machine Learning Research.
Proceedings of the 31st International Conference on Machine Learning, vol. 32. PMLR, pp.
460–468.
This page intentionally left blank
Section III

Advanced geometrical
intuition
This page intentionally left blank
Chapter 8

Parallel transport, a central tool


in geometric statistics
for computational anatomy:
Application to cardiac
motion modeling
Nicolas Guigui and Xavier Pennec*
Universit
e C^
ote d’Azur and Inria, Epione team, Sophia-Antipolis, Biot, France
*
Corresponding author: e-mail: xavier.pennec@inria.fr

Abstract
Transporting the statistical knowledge regressed in the neighborhood of a point to a dif-
ferent but related place (transfer learning) is important for many applications. In medi-
cal imaging, cardiac motion modeling and structural brain changes are two such
examples: for a group-wise statistical analysis, subject-specific longitudinal deforma-
tions need to be transported in a common template anatomy.
In geometric statistics, the natural (parallel) transport method is defined by the inte-
gration of a Riemannian connection which specifies how tangent vectors are compared
at neighboring points. In this process, the numerical accuracy of the transport method is
critical. Discrete methods based on iterated geodesic parallelograms inspired by
Schild’s ladder were shown to be very efficient and apparently stable in practice. In this
chapter, we show that ladder methods are actually second-order schemes, even with
numerically approximated geodesics. We also propose a new original algorithm to
implement these methods in the context of the large deformation diffeomorphic metric
mapping (LDDMM) framework that endows the space of diffeomorphisms with a right-
invariant RKHS metric.
When applied to the motion modeling of the cardiac right ventricle under pressure
or volume overload, the method however exhibits unexpected effects in the presence of
very large volume differences between subjects. We first investigate an intuitive rescal-
ing of the modulus after parallel transport to preserve the ejection fraction. The surpris-
ingly simple scaling/volume relationship that we obtain suggests to decouples the
volume change from the deformation directly within the LDDMM metric. The parallel

Handbook of Statistics, Vol. 46. https://doi.org/10.1016/bs.host.2022.03.006


Copyright © 2022 Elsevier B.V. All rights reserved. 285
286 SECTION III Advanced geometrical intuition

transport of cardiac trajectories with this new metric now reveals statistical insights into
the dynamics of each disease. This example shows that parallel transport could become
a tool of choice for data-driven metric optimization.
Keywords: Parallel transport, Longitudinal studies, Mean trajectory, Cardiac motion
analysis, Schild’s ladder, Pole ladder, Riemannian manifolds

1 Introduction
At the interface of geometry, statistics, image analysis, and medicine, Compu-
tational Anatomy aims at analyzing and modeling the variability of the bio-
logical shape of tissues and organs and their dynamics at the population level.
The goal is to estimate representative anatomies across diseases, populations,
species, or ages, to model the organ development across time (growth or aging),
to discover morphological differences between normal and pathological groups,
and to estimate and correlate the variability with other functional, genetic, or
structural information. In the context of cardiology, computational anatomy
models the cardiac cycle as a sequence of deformations of an anatomy and
allows to characterize quantitatively these deformations to compare the impact
of diseases on the cardiac function.
The analysis of the organs’ shape and deformations often relies on the
identification of features to describe locally the anatomy such as landmarks,
curves, surfaces, intensity patches, full images, etc. Modeling their statistical
distribution in the population requires to first identify point-to-point anato-
mical correspondences between these geometric features across subjects. This
may be feasible for landmark points, but not for curves or surfaces. Thus, one
generally considers relabeled point-sets or reparameterized curves/surfaces/
images as equivalent objects. With this geometric formulation, shape spaces
are the quotient of the original space of features by their reparameterization
group. The difficulty is that taking the quotient generally endows the shape
space with a nonlinear manifold structure, even when we start from features
living in a Euclidean space.
For instance, equivalence classes of k-tuples of points under rigid or simi-
larity transformations result in nonlinear Kendall’s shape spaces (see, e.g.,
Dryden and Mardia (2016) for a recent account on that subject). The quotient
of curves, surfaces, and higher dimensional objects by their reparameteriza-
tions (diffeomorphisms of their domains) produces in general even more com-
plex infinite dimensional shape spaces (Bauer et al., 2014). Thus, shapes
belong in general to nonlinear manifolds, while statistics were essentially
developed for linear and Euclidean spaces. This has motivated the use and
development of a consistent statistical framework on Riemannian manifolds
and Lie groups during the past 25 years, a field called Geometric Statistics
in computational anatomy (Pennec et al., 2020).
Deformation-based morphometry (DBM) is also a popular method to study
statistical differences of brain anatomies. It consists in analyzing local
Parallel transport, a central tool in geometric statistics Chapter 8 287

deformation features of nonlinear image registration to a reference (Ashburner


et al., 1998). For instance, the Jacobian of the transformation encoding local
volume changes can be used to detect the areas that are statistically growing
of shrinking in a population. However, DBM analyzes the deformation indepen-
dently at each point of the image. In order to capture spatial correlations, it is
more interesting to model the transformations of all the points together, i.e.,
to lift statistics from the objects (image voxels, curves, surfaces) to the transfor-
mations of their embedding space. This powerful idea also allows us to capture
jointly the variability of several structures to model their interactions. In the
brain, for instance, one can consider together the shape of the cortex, the ven-
tricles, deep gray nuclei, and other internal brain structures. We can also include
structures that are not surface, such as the sulcal lines encoding the complex
folding patterns of the gray matter at the surface of the cortex.

1.1 Diffeomorphometry
Following Thompson (1917), it is often assumed that there exists a template
shape or image (called an atlas in medical image analysis) that represents
the standard (or mean) anatomy, and that the intersubject variability is
encoded by deformations of that template toward the shapes of each subject
or their evolution in time. A very desirable aspect of these transformations
is to smoothly preserve the spatial organization of the anatomical tissues by
avoiding intersections, folds, or tears. Simply encoding deformations with
the vector space of displacement fields is not sufficient to preserve the topol-
ogy: one needs to require diffeomorphic transformations (differentiable one-
to-one transformations with differentiable inverse).
Lie groups of diffeomorphisms are examples of space that are both infinite
dimensional manifolds and Lie groups, and the statistic analysis of shapes
through their diffeomorphic deformations has been coined as diffeomorpho-
metry. This approach was pushed forward by Grenander and Miller (1998)
and turned into a mathematically grounded framework by endowing the space
of diffeomorphisms with a sufficiently regular right invariant metric (Miller
et al., 2015; Younes, 2019), leading to the so-called large deformation diffeo-
morphic metric mapping (LDDMM) framework, detailed in Section 2.2.
Since the optimization of time-varying velocity fields was computationally
intensive, an alternative parameterization of diffeomorphisms with the flow of
stationary velocity fields (SVF) was introduced by Arsigny et al. (2006). The
flow of SVFs generates one-parameter subgroups, which are simply matrix
exponentials in matrix Lie groups. Although a number of theoretical difficul-
ties remain when moving to infinite dimensions, very efficient algorithms can
be adapted from the matrix case, like the scaling and squaring procedure
(Higham, 2005) to compute the group exponential and its Jacobian, or the
Baker–Campbell–Hausdorff (BCH) formula to approximate the composition
of two deformations directly in the log-domain. This allows the straightfor-
ward extension of the classical “demons” image registration algorithm to
288 SECTION III Advanced geometrical intuition

encode diffeomorphic transformations parameterized by SVFs. Because the


inverse of the flow of an SVF is simply parameterized by the opposite of this
SVF, a special feature of the log-demons registration framework is to enforce
almost seamlessly the very desirable inverse consistency of the registration.
The SVF framework was successfully used in a number of application to brain
studies (Hadj-Hamou et al., 2016; Lorenzi et al., 2011a) as we will see below.
The differential geometric foundations of this very efficient SVF frame-
work were uncovered in Lorenzi and Pennec (2013): one-parameter subgroups
are actually geodesics (in the sense of autoparallel curves) of the Cartan–
Schouten connection, a canonical bi-invariant symmetric affine connection
that exists on every Lie group. Whenever a bi-invariant metric exists on the
group, this affine connection coincides with the Levi-Civita connection of
that bi-invariant metric, in which case the Frechet mean is left, right, and
inverse equivariant. However, only direct products of compact and Abelian
groups admit a bi-invariant metric, which is very limiting in practice. Thus,
Lie groups generally do not admit any bi-invariant metric, in which case the
Frechet mean based on a left-invariant distance is not consistent with right
translations nor with inversions. This is already the case for Euclidean
motions SE(3) since it is a semi-direct product. In contrast, exponential bary-
centers of the canonical Cartan-Schouten connection define a nonmetric
notion of mean on any Lie groups which is automatically equivariant by left,
right translation and inversion (Pennec and Arsigny, 2012). One can also
define covariance matrices and Mahalanobis distances that are consistent with
the group operations. Thus, from the points of view of abstract transforma-
tions, independently of their action on objects, the Cartan–Schouten connec-
tion offers a unique way to define a consistent statistical framework on Lie
groups. This drove the interest for statistics on affine connection spaces, a
superclass of Riemannian manifolds where nonmetric geodesics are defined
locally as autoparallel curves (Pennec and Lorenzi, 2020).

1.2 Longitudinal models


Once we have decided for the LDDMM or the SVF parameterization of dif-
feomorphisms, a first and simple approach to tackle longitudinal data is to
regress the template and its deformations over time by minimizing the inter-
subject registration square-distance to each observation. The template trajec-
tory may be a geodesic in a group of diffeomorphism for very few data
points, such as for studying the remodeling of the heart over time in pediatric
images (Mansi et al., 2009; McLeod et al., 2013). With more samples over
time, one may fit a deformation trajectory with kernel regression (Gerber
et al., 2010) or assume a spline structure (Hinkle et al., 2014; Singh et al.,
2015; Trouve and Vialard, 2012). This type of method is good for a cross-
sectional design, where we have one data point per subject, but is likely to
hide the small longitudinal changes of each subject into the large intersubject
variability when we have more than one time point per subject.
Parallel transport, a central tool in geometric statistics Chapter 8 289

Longitudinal models are preferable to study time-series of shapes in both


dynamical systems such as the heart and disease progression modeling such
as the aging brain. Due to the noncommutativity of the longitudinal and inter-
subject deformations, several paradigms were investigated to compare trans-
formations trajectories at different points (Durrleman et al., 2013b). One
choice is to model the reference trajectory in the template space and to make
it patient-specific by deforming it using a template to subject transformation
valid for all times (Durrleman et al., 2013b; Rao et al., 2004). This approach
can be geometrized by parallel transporting the initial velocity of the template
to subject deformation along the reference longitudinal deformation, as pro-
posed by Schiratti et al. (2015). This was called exp-parallelization for a
geodesic reference trajectory, but the method works for any smooth curve.
A recent refinement of this model has been proposed to cluster trajectories into
distinct classes with different trajectories (Debavelaere et al., 2019). However,
this does not allow to easily model the variability on the longitudinal trajec-
tories which is observed for instance in the heart. Moreover, exp-parallel curve
may be very different from geodesics when the atlas-to-subject transformation
is large: on a sphere, for instance, exp-parallel curves are small circles parallel
to the reference geodesic great circle, and reduce to a point at the poles.
In this chapter, we prefer to use the other main approach, where we first
regress a patient-specific trajectory, and then normalize it with respect to
the patient-to-atlas deformation at a reference time to perform the statistical
analysis of all trajectories at the population level (Lorenzi and Pennec,
2013). It is sometimes argued that there is no canonical reference time for
comparing the evolution of brains. This is not the case for the heart, for which
the end-diastole (ED) is a natural time reference of the cardiac cycle. The
subject-specific trajectory is usually a geodesic for the analysis of structural
brain changes with aging in Alzheimer’s disease (Hadj-Hamou et al., 2016).
For the heart, one can use any motion trajectory parameterized by its loga-
rithm (the initial tangent vector or momentum that geodesically shoots to
the deformation at a given point of the trajectory). In this setting, one has a
curve of tangent vectors encoding the subject-specific deformation over time
that needs to be transported to a reference template.
Transporting scalar values (e.g., the intensity of an image) along a diffeo-
morphic transformation is simply done by resampling that image. However,
when one wants to transport differential (vector) quantities, it is not sufficient
to resample each of the vector component individually: there needs to be a nor-
malization that also reorients the vector. For instance, a constant initial vector
field encodes a translation over time. However, when the intersubject transfor-
mation is a rotation by 90 degrees, that vector needs to be rotated along. In this
chapter, we focus on geometric methods based on parallel transport.
For cardiac motion, an average cardiac deformation across patients is a
called a statistical motion atlas (Duchateau et al., 2011; Peyrat et al., 2010;
Young and Frangi, 2009). In this application, the need to normalize deforma-
tions is salient in the example of Fig. 1, where the deformation between the
290 SECTION III Advanced geometrical intuition

FIG. 1 Example of a systolic deformation estimated on a patient and applied to the atlas without
normalization (middle) and with normalization by the method of Section 3.3 (right). The ED
frame is a red point cloud, while the ES frame is the blue mesh. The ES frames obtained without
normalization show irregularities and cannot be used in downstream analyses.

patients’ end-diastolic (ED) and end-systolic (ES) frames is applied without


any normalization to another reference shape, the atlas. These result in meshes
of poor quality, with irregularities, especially near the valves and on the sep-
tum. These abnormalities are salient when considering ventricles with pres-
sure and volume overload as the amplitude of both the intrapatient and
patient-to-atlas deformations may be large.
Parallel transport, a central tool in geometric statistics Chapter 8 291

1.3 Parallel transport for intersubject normalization


Given a manifold equipped with an affine connection, for example, a Rieman-
nian manifold with its Levi-Civita connection, one can define the parallel
transport of a tangent vector v along a curve γ.
Definition 1 (Parallel vector field). Let M be a smooth manifold and r a con-
nection on M. For any curve γ : [a, b] ! M in M, a vector field X along γ is
parallel if
rγðtÞ
_ XðtÞ ¼ 0: (1)

From the properties of ODEs, one can prove that given γ and v a tangent
vector at x ¼ γ(0), there exists a unique parallel vector field X along γ such
that X(0) ¼ v. Intuitively, this ODE constrains the parallel vector field to keep
its orientation w.r.t. the velocity γ_ of the curve while moving along it. For any
t in the domain of definition of γ, X(t) is called the parallel transport of v at
y ¼ γ(t) along γ, and written Πyx v.
For normalizing the momentum of a longitudinal deformation along an
intersubject geodesic diffeomorphism, parallel transport was first proposed
in the LDDMM framework using Jacobi fields (Qiu et al., 2008, 2009;
Younes, 2007). The method was used in Cury et al. (2016) for the analysis
of the thalamus in frontotemporal dementia and it was improved by the
Fanning Scheme (Louis et al., 2018), which is applied to shape analysis of
brain structures (Louis et al., 2017). In latter works, the authors claim that
the Jacobi field approach is more precise and less expensive computationally
than ladder methods. We have proved in Guigui and Pennec (2021) that this is
not the case. Moreover, the proposed method implicitly assumes that the same
metric is used for both longitudinal and intersubject deformation. This is an
important concern as the cross-sectional intersubject and longitudinal intra-
subject deformations have a fundamentally different nature.
In order to implement a parallel transport algorithm that remains consistent
with the numerical scheme used to compute the geodesics, Lorenzi et al.
(2011b) and Lorenzi and Pennec (2013) proposed to adapt Schild’s ladder to
image registration with deformations parameterized by SVF. This method relies
on iterating the construction of geodesic parallelograms to approximate the par-
allel transport of a vector v along a geodesic (see Fig. 2A). Interestingly, the
Schild’s ladder implementation appeared to be more stable in practice than
the closed-form expression of the symmetric Cartan–Schouten parallel transport
on geodesics. This was attributed to the inconsistency of the numerical schemes
used for the computation of the transformation Jacobian in the implementation
of this exact parallel transport formula.
Shortly after, it was realized that parallel transport along geodesics could
exploit an additional symmetry by using the geodesic along which we want
to transport as one of the diagonal of the geodesic parallelogram: this gave
292 SECTION III Advanced geometrical intuition

A B

FIG. 2 Representation of the two ladder schemes, Schild’s (A) and pole ladder (B). The methods
consist in iterating the construction of geodesic parallelograms to approximate the parallel trans-
port of a vector v along a curve which is geodesic by arc. (A) Schild’s ladder; (B) Pole ladder

rise to the pole ladder scheme (Lorenzi and Pennec, 2014). In this case the
geodesic along which we are transporting is used as diagonal of the parallelo-
grams (see Fig. 2B). This greatly reduces the number of geodesics to compute
(v1 does not even need to be computed in Fig. 2B). Pole ladder combined
with an efficient numerical scheme based on the Baker–Campbell–Hausdorff
formula was found to be more stable on simulated and real experiments than
the other parallel transport schemes tested on Lie groups. This result and the
higher symmetry led to conjecture that pole ladder could be a higher order
scheme than Schild’s ladder.
The SVF framework was successfully applied in Lorenzi et al. (2015) to
distinguish pathological evolution from normal aging. An efficient computa-
tional pipeline is provided by Hadj-Hamou et al. (2016) and used in Sivera
et al. (2020) to model Alzheimer’s disease and to analyze the effect of a
potential treatment in a clinical trial (see illustration in Fig. 3).

1.4 Chapter organization


In this chapter, we address the longitudinal analysis of cardiac motion across
patients and diseases. A key problem of this application is the large cyclic
nature of cardiac deformations, along with large to very large intersubject
transformations. In order to caption the large cardiac motion, we chose to
use the LDDMM framework because a right invariant metric is compatible
with a Lagrangian and a Hamiltonian formulation of the mechanical equa-
tions, while the SVF framework is rather related to a Eulerian invariance.
The numerical complexity of the implementation of Jacobi fields and the
algorithmic simplicity of ladder methods led us to chose the latter for inter-
subject normalization. However, despite the practical success of ladder meth-
ods for longitudinal brain analyses, their numerical accuracy remained
essentially unknown beyond the first order. This led Guigui and Pennec
(2021) to conduct a careful numerical analysis of Schild’s and pole ladders
Parallel transport, a central tool in geometric statistics Chapter 8 293

FIG. 3 Statistics on diffeomorphisms with the Cartan–Schouten connection to model the normal
component of the aging trajectory and the additional component specific to Alzheimer’s disease.
Longitudinal geodesic deformations regressed in the sequence of images of each subject are par-
allel transported along the subject-to-reference deformation at baseline. In this common space, a
linear model in the tangent space of diffeomorphisms estimates the mean trajectory for the two
clinical conditions. Derived from original images and illustrations of Marco Lorenzi and Raphaël
Sivera.

that we summarize in Section 2.1. To the best of our knowledge, ladder meth-
ods have not been used in the LDDMM framework, for which a BCH-type
approximation was missing. We first give an overview of the LDDMM frame-
work in Section 2.2 before proposing a new second-order implementation of
pole ladder on LDDMM deformations in Section 2.3.
We turn in Section 3 to the application of this methodology to the group-
wise analysis of cardiac motion across diseases. For the cardiac motions that
we analyze in Section 3.2, relative volume differences between subjects
appear to have an unexpected effect on the parallel transported trajectories.
Since this undesirable effect cannot be attributed to the numerical accuracy
of the transport scheme, it indicates that we have to revise the choice of the
Riemannian metric used for the intersubject comparison. We first investigate
in Section 3.3 an intuitive rescaling of the modulus of the momentum after the
parallel transport to preserve on average the ejection fraction along the motion
trajectory. Regression results over a population of subjects and patients show a
surprisingly simple relationship between the scaling and the intersubject volume
ratio. However, modifying the transport equations in an ad-hoc fashion is not
satisfactory from a theoretical point. Thus, we investigate in Section 3.4 a more
satisfactory strategy that decouple the volume change from the deformation
directly within the LDDMM metric.
294 SECTION III Advanced geometrical intuition

2 Parallel transport with ladder methods


Numerical methods have been proposed in geometric data processing for the
parallel transport of vectors along curves in manifolds. The oldest algorithm
is probably Schild’s ladder, a general method for the parallel transport along
arbitrary curves, introduced in the theory of gravitation in Misner et al.
(1973) in the spirit of Schild’s geometric constructions. Ehlers et al. (1972) is
often cited as reference for Schild’s ladder, but no mention of the scheme is
made in this work. The method extends the infinitesimal transport through the
construction of geodesic parallelograms (Fig. 4). It is algorithmically interesting
since it only requires the computation of geodesics (initial and boundary value
problems) without requiring the knowledge of the second-order structure of the
space (connection or curvature tensors). Kheyfets et al. (2000) proved that the
scheme realizes a first-order approximation of the parallel transport for a sym-
metric connection. This makes sense since the skew-symmetric part of the con-
nection, the torsion, does not impact the geodesic equation. Schild’s ladder is
nowadays increasingly used in nonlinear data processing and analysis to imple-
ment parallel transport in Riemannian manifolds. One can cite for instance
(Lorenzi et al., 2011b) and the follow-up publications cited in introduction
for the parallel transport of deformations in computational anatomy, or
Hauberg et al. (2013) for parallel transporting the covariance matrix in Kalman
filtering.

2.1 Numerical accuracy of Schild’s and pole ladders


Despite the use of ladder methods in applications, no results on their conver-
gence were published before the work of Guigui and Pennec (2021) that we
summarize here. We give Taylor approximations of the elementary construc-
tions of Schild’s ladder and of the pole ladder with respect to the Riemann

FIG. 4 Schild’s ladder procedure to parallel transport the vector v along the sampled curve with
initial velocity w. Top: First rung of the ladder using an approximate geodesic parallelogram.
Bottom: The method is iterated with rungs at each point sampled along the curve. Reproduced
from Guigui, N., Pennec, X., 2021. Numerical accuracy of ladder schemes for parallel transport
onmanifolds. Found. Comput. Math. https://doi.org/10.1007/s10208-021-09515-x.
Parallel transport, a central tool in geometric statistics Chapter 8 295

curvature of the underlying space. This allows to prove that these methods can
be iterated to converge with quadratic speed, even when geodesics are approxi-
mated by numerical schemes.

2.1.1 Elementary construction of Schild’s ladder


Recall that Exp denote the Riemannian exponential map that maps a point and
a tangent vector to the point reached at time t ¼ 1 by the unique geodesic with
those initial conditions. Log is its inverse defined locally, such that
Expx(Logx(y)) ¼ y. See (Pennec et al., 2020, chapter 1) for more details.
The construction to parallel transport v  TxM along the geodesic γ with
γ(0) ¼ x and γð0Þ
_ ¼ w  T x M (such that (v, w)  Ux) is given by the follow-
ing steps (Fig. 4):
1. Compute the geodesics from x with initial velocities v and w until time
s ¼ t ¼ 1 to obtain xv and xw. These are the sides of the parallelogram.
2. Compute the geodesic between xv and xw and the midpoint m of this
geodesic:
 
1
m ¼ Expxv Logxv ðxw Þ :
2
This is the first diagonal of the parallelogram.
3. Compute the geodesic between x and m, let a  TxM be its initial velocity.
Extend it beyond m for the same length as between x and m to obtain z:
a ¼ Logx ðmÞ; z ¼ Expx ð2aÞ ¼ x2a :
This is the second diagonal of the parallelogram.
4. Compute the geodesic between xw and z. Its initial velocity uw is an
approximation of the parallel transport of v along the geodesic from x to
xw, i.e.,
uw ¼ Logxw ðx2a Þ:

Assuming that there exists a convex neighborhood that contains the entire
parallelogram, all the above operations are well defined. In the literature, this
construction is then iterated along γ without further precision.

2.1.2 Taylor expansion


We can now reformulate this elementary construction in terms of successive
applications of the double exponential and the neighboring logarithm maps,
given in Pennec (2019). The computations are detailed in the appendix of
Guigui and Pennec (2021), and we only report here the result at third order,
meaning that all the terms of the form r R(, )  are summarized in the term
O(4). We first obtain a generalized midpoint rule:
1
2a ¼ w + v + Rðv, wÞðw  vÞ + Oð4Þ (2)
6
296 SECTION III Advanced geometrical intuition

We notice that this expression is symmetric in v and w, as expected. Further-


more, the deviation from the Euclidean mean of v, w (the parallelogram law)
is explicitly due to the curvature. Accounting for this correction term is a key
ingredient to reach a quadratic speed of convergence.
Now, by propagating this midpoint rule to compute the error ex made by
this construction to parallel transport v, ex ¼ Πxxw v  uw , we obtain a third-
order approximation of the Schild’s ladder construction:
Theorem 1. Let (M, g) be a finite dimensional Riemannian manifold. Let x  M
and v, w  TxM sufficiently small. Then the output u of one step of Schild’s
ladder parallel transported back to x is given by
1
u¼v + Rðw, vÞv + Oð4Þ (3)
2

This theorem shows that Schild’s ladder is in fact a second-order approxi-


mation of parallel transport. Furthermore, this shows that splitting the main
geodesic into segments of length 1n and simply iterating this construction
n times will in fact sum n error terms of the form Rðwni , vi Þvi, hence by linearity
of R, the error will not necessarily vanish as n !∞. To ensure convergence, it
is necessary to also scale v in each parallelogram, as detailed in the next
paragraph.

2.1.3 Numerical scheme and convergence


With the previous notations, let us define schildðx, w, vÞ ¼ uw  T xw M . We
now divide the geodesic γ into segments of length 1n for some n  * large
enough so that the previous Taylor expansions can be applied to wn and nv. As
mentioned before, v needs to be scaled as w. In fact let α  1 and consider
the sequence defined by (see Fig. 4)
v0 ¼ v
 w v (4)
vi + 1 ¼ nα  schild xi , , α ,
i i
n n
where xi ¼ γðni Þ ¼ Expx ðni wÞ, wi ¼ n Logxi ðxi +1 Þ ¼ Πxxi w. We now establish
the following result, which ensures convergence of Schild’s ladder to the par-
allel transport of v along γ at order at most two.
Theorem 2. Let (xi, vi, wi)(in) be the sequence defined as above. Then
9τ > 0, 9β > 0, 9N  , 8n > N,
τ β
k vn  Πxxn v k + 2:
nα n
Moreover, τ is bounded by a bound on the sectional curvature, and β by a
bound on the covariant derivative of the curvature tensor.

The same result can be obtained for pole ladder, and in the case where
geodesics are obtained by numerical integration of the geodesic equation.
Parallel transport, a central tool in geometric statistics Chapter 8 297

2.1.4 Pole ladder


In this case, the main geodesic is used as diagonal of the constructed paralle-
lograms (see Fig. 5). This allows to reduce the number of geodesics that need
to be computed. Similarly, a Taylor approximation of the construction gives
1
u¼v + ððrw RÞðw, vÞð5v  wÞ + ðrv RÞðw, vÞð2v  wÞÞ + Oð5Þ: (5)
12
And this can be propagated to show the following bound
β
kvn  Πm
m v k
n
, (6)
n2
that ensures convergence to the parallel transport with a quadratic speed.

2.1.5 Infinitesimal schemes


When geodesics are not available in closed form, we replace the Exp map by
a fourth-order numerical scheme (e.g., Runge–Kutta (RK)), and the Log is
obtained by gradient descent over the initial condition of Exp. It turns out that
only one step of the numerical scheme is sufficient to ensure convergence,
and keeps the computational complexity reasonable. As only one step of the
integration schemes is performed, we are no longer computing geodesic par-
allelograms, but infinitesimal ones, and thus refer to this variant as infinitesi-
mal scheme. As in the previous case, we can show
1
k Πxxn v  vn k¼ Oð 2 Þ:
n
This allows to use this scheme in the SVF and LDDMM frameworks. For the
latter, we first give an overview of the framework in the next section before
presenting our parallel transport algorithm.

FIG. 5 Top: Elementary construction of the pole ladder with the previous notations (left), and
new notations (right), in a normal coordinate system at m. Bottom: Iterated scheme. Reproduced
from Guigui, N., Pennec, X., 2021. Numerical accuracy of ladder schemes for parallel transport
onmanifolds. Found. Comput. Math. https://doi.org/10.1007/s10208-021-09515-x.
298 SECTION III Advanced geometrical intuition

2.2 A short overview of the LDDMM framework


The LDDMM framework encompasses both algorithms for shape matching
(also referred to as registration) and a Riemannian geometric structure on
the space of deformations that projects to the space of shapes. The former
allows computing shape descriptors, and to parameterize diffeomorphisms.
The latter provides a distance to compare deformations, and an associated
notion of parallelism to transport them. We summarize here the formulation
of Durrleman et al. (2013a) and Durrleman et al. (2014) and their implemen-
tation in B^one et al. (2018), focusing on the case of landmarks. For a thorough
treatment of the topic, we refer to Younes (2019).
In this chapter, we restrict to shapes defined as d  1-dimensional subma-
nifolds of d , d ¼ 1, 2, with the same given topology, and approximated by
triangulated meshes, defined by a set of points in d , called landmarks, and
connectivity between those. The set of shapes is denoted S. The model of
computational anatomy stipulates that such a shape is fully described by a dif-
feomorphism φ  Diffðd Þ, where the interaction between shapes and diffeo-
morphisms is encoded by the action of Diffðd Þ on S. This turns the shape
space into a homogeneous space with group Diffðd Þ. The isotropy subgroup
of a given shape is the set of reparameterization of this shape, i.e., diffeomorph-
ism that leave the shape unchanged, but move particles along the shape. These
diffeomorphisms are infinite dimensional nuisance parameters in a statistical
analysis, and one feature of LDDMM is to provide methods that are invariant
to such nuisance (Miller et al., 2015).
A natural and efficient computational construction of diffeomorphisms is
obtained by flows associated to ordinary differential equations (ODEs) ∂ϕt()
¼ vt[ϕt()], with the initial condition ϕ0 ¼ Id. The time-dependent vector field
vt can be interpreted as the instantaneous speed of the points during deforma-
tion, and must verify certain regularity conditions to ensure that solutions to
the ODE are indeed diffeomorphisms. An efficient way to enforce these con-
ditions is to restrict to vector fields obtained by the convolution of a number
Nc of momentum vectors carried by distinct control points:
X
Nc
ðtÞ ðtÞ
vt ðxÞ ¼ Kðx, ck ÞμK , (7)
k¼1
2
where K is a kernel function, e.g., the Gaussian kernel: Kðx, yÞ ¼ exp ð kxyk
σ 2 Þ.
The effect of the choice of kernel is studied in Micheli and Glaunès (2014), but
we restrict to the Gaussian kernel in this study. The (closure of the) set of such
vector fields forms a reproducing kernel Hilbert space HK, with the associated
norm
X
k vk2K ¼ Kðci , c j ÞμTi μj_:
i, j
Parallel transport, a central tool in geometric statistics Chapter 8 299

We write DiffK the subgroup of diffeomorphisms obtained this way and use
matrix notation with a dNc  dNc block matrix Kðc, c0 Þ ¼ ðKðci , c0j_ÞI d Þ and
ij
flat vectors of size dNc for landmarks sets, velocities, and momentum so that
v(x) ¼ K(c, x)μ and k vk2K ¼ μ> Kðc, cÞμ. The total cost, or energy of the defor-
R1
mation can be defined as 0 k vt k2K dt.
It can be shown that the momentum vectors that minimize this energy,
ð0Þ ð1Þ
considering ck , ck , k ¼ 1…N c fixed, together with the equation driving
the motion of the control points, follow a Hamiltonian system of ODEs:
8
< c_ ðtÞ ¼ KðcðtÞ ,cðtÞ ÞμðtÞ
(8)
: μ_ ðtÞ ¼ r KðcðtÞ ,cðtÞ ÞμðtÞT μðtÞ
1

A diffeomorphism ϕ1 is thus uniquely parameterized by the initial conditions


ð0Þ ð0Þ
ck , μk , k ¼ 1…N c , and a shape registration criterion between a template q
and target q can be defined as
Cðc, μÞ ¼k q  ϕc,μ
1 ðqÞk22 + α2 k vc,μ
0 kK :
2
(9)
where α is a regularization parameter that penalizes large deformations.
Minimizing C therefore amounts to finding the transformation that best
deforms q to match q. The gradient of C can be computed through automatic
differentiation, to perform gradient descent. Note that in (9), the L2 norm
between landmark positions is used to evaluate the data-attachment term.
When corresponding landmarks are not available, fidelity metrics relying on
currents, varifolds, or normal cycles have been derived and allow to compute
metrics between curves or surfaces with different parameterizations. These are
reviewed in Charon et al. (2020).
Two examples of registration solutions are shown, the first is on simple
parametric shapes, a circle and an ellipse (Fig. 6). In this case the full trajec-
tory of each landmark is represented (with a color scheme from blue to red).
These differ from a linear interpolation especially where the curvature of the
shape increases. The second example is on real RV data (Fig. 7). The distribu-
tion of momentum vectors on the shape is not intuitive, and it is easier to
inspect the velocity field evaluated at the control points. The matching is very
efficient on these meshes, thanks to the quality of the correspondences
between landmarks and to the smoothness of the shapes.
The optimal value of C defines a distance between ϕ1 and the identity. In
fact this distance derives from a right-invariant Riemannian metric on the group
DiffK and the path t !7 ϕt is a minimizing geodesic for this metric. Integrating
the system (8) computes the exponential map of this metric, and is often called
shooting in this section, while minimizing (9) approximates the logarithm, and
is called registering. By considering the action of diffeomorphisms on shapes,
this distance projects to a distance between the shapes q and q, and we have
300 SECTION III Advanced geometrical intuition

FIG. 6 Example of registration and optimal path between a circle (source) and an ellipse (target)
with 60 landmarks, 15 control points (green), and corresponding momentum vectors (cyan) and
kernel width 1. The progression from blue to red shows the time trajectories of the landmarks
under the minimizing geodesic deformation.

 Z 1 
1
dK ðq, qÞ2 ¼ inf q_ t Kðc, cÞ1 q_ t dt j q0 ¼ q, q1 ¼ q
2 0
This in fact turns the space of shapes into a homogeneous space, with invari-
ant metric given by the group.

2.3 Ladder methods with LDDMM


Along with a distance, the Riemannian metric provides a notion of parallel
transport thanks to its Levi-Civita connection. A Hamiltonian formulation of
parallel transport may be given by an ODE that allows to transport a set of
momentum vectors along a path of diffeomorphisms ϕt (Younes, 2019, section
13.3.3). Previous implementations of parallel transport of deformations in the
context of LDDMM are however based on Jacobi fields (Cury et al., 2016;
Younes, 2007), and in particular on the fanning scheme (Louis et al., 2018).
Ladder methods, in contrast, were proposed in the SVF framework. Adapt-
ing them to LDDMM is difficult because of the absence of an equivalent of
the BCH formula to implement each rung of the ladder. We propose here a
Parallel transport, a central tool in geometric statistics Chapter 8 301

FIG. 7 Example of the registration between ED (blue wireframe) and ES (red points) for a
control right ventricle. There are 938 landmarks, and 34 control points with kernel width
10 mm (the overall ventricle at ED is about 55 mm high). Note how momentum vectors are noisy
due to strong interactions between control points (A). The velocity field at the control points
offers a better summary of the deformation (B). (A) Systolic deformation with momenta;
(B) Systolic deformation with velocity field.

new implementation of pole ladder in the context of LDDMM that leverages


the lower dimensional parameterization and the Hamiltonian formulation, and
is grounded by the results of the previous section. Indeed, the geodesics are
not known in closed form and we are in the case of infinitesimal ladder
schemes, requiring the use of fourth-order integration schemes. Our imple-
mentation is largely inspired by that of the fanning scheme and leverages
the open-source software Deformetrica (B^ one et al., 2018).
Throughout this section, we consider a reference shape qA, and wish to
transport the deformation (c, μ) obtained by registration of a shape q1 on
another shape q0 to qA, that is, along the geodesic between q0 and qA, parame-
terized by (cA, ω) and also obtained by registration. We suppose that registra-
tion outputs a set of control points such that K(c, c) is a symmetric positive
definite matrix.
Recall from the homogeneous space structure that the tangent space at
q0 is identified with the horizontal subspace of the Lie algebra of DiffK, which
is here the reproducing Kernel Hilbert space HK that contains all vector fields
defined by the convolution (7). It defines a one-to-one correspondence
between the tangent space at q0, and its co-tangent space, meaning that we
can transport the momentum μ and flow the points c instead of transporting
a full vector field v. Similarly, we do not need to flow the full shape q0 along
the ladder, but only the control points.
302 SECTION III Advanced geometrical intuition

We describe the pole ladder in the case where the momentum to transport
μ and the momentum along which we transport ω are carried by the same ini-
tial control points c ¼ cA. If it is not the case, μ is projected to cA by solving
a linear system with a Cholesky decomposition of K(cA, cA): μ0 ¼ K(cA, cA)1
K(c, cA)μ.
Define rk : TS *   + ! TS * the map that performs one step of the fourth-
order Runge–Kutta numerical scheme on the Hamiltonian system (8), and rk1
the projection on the control points only, and rk1 x its approximate inverse by
gradient descent, such that k rk1 ðrk1x ðyÞ, hÞ  y k h5 . Then, pole ladder is
performed by the following algorithm.
Choose n   and 1  α  2, and divide [0, 1] into n intervals of length
h ¼ 1n. Compute the first midpoint c1 , ω1 ¼ rkðcA , ω, h2Þ along the main geode-
sic, and the first rung q0 ¼ rk1 ðμ, n1α Þ. Then iterate for i ¼ 1…n (see Fig. 8)
1. Compute the momentum αi ¼ n rk1 ci ðqi1 Þ;
2. Compute the flow of its inverse qi ¼rk1(ci, αi, h);
3. Compute the next midpoint ci+1, ωi+1 ¼ rk(ci, ωi, h) except at the last step
where h h
2 to compute c;
e
Return μe ¼ nα ð1Þn rk1
c~ ðqn Þ.
Two examples of parallel transport solutions are shown, the first on simple
parametric shapes, the evolution from a circle to an ellipse is transported to a
smaller circle (Fig. 9). We retrieve a smaller ellipse as expected. The second
example on real right ventricle (RV) mesh of the heart (Fig. 10) shows the
ED-to-ES deformation transported to a third reference shape, the atlas. The
obtained reconstruction is a personalized estimate of an ES frame for the atlas,
specific to the considered patient. The transported frame is much more accept-
able than those of Fig. 1.

2.3.1 Validation
Recall that parallel transport is an isometry, so the norm of the transported
deformation must equal that of the initial deformation. We use this property
as a first step to validate our implementation on a population of simulated
2d-shapes. These are generated by shooting from a circle with random Gauss-
ian momentum vectors. This defines the source shape. Then an affinity is

FIG. 8 Representation of the pole ladder for diffeomorphisms with an odd number of steps n.
The exponential maps are computed with a RK4 scheme, and the log by gradient descent.
Geodesic between source circle and evolution into ellipse Evolution transported to the smaller circle
Geodesic between source and target circles
start
2.0 2 2
evolution
control points
1.5

1.0 1 1

0.5
start start
0.0 target 0 evolution 0
control points control points
−0.5

−1.0 −1 −1

−1.5

−2.0 −2 −2

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0


−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0

FIG. 9 Parallel transport of the evolution from a circle to an ellipse along the deformation to a smaller circle.
304 SECTION III Advanced geometrical intuition

FIG. 10 Example of an ED-to-ES deformation transported along the ED-to-atlas path. The
source frames are represented by a blue wireframe, the final frames by red landmarks, the control
points by green landmarks, and the initial velocities at the control points by orange arrows. The
transported frame is much more acceptable than those of Fig. 1.

applied to the source shape to define its evolution (i.e., scaling with different
coefficients along each axis). This deformation is then estimated by registra-
tion and transported to the template. The reconstruction obtained by deform-
ing the circle with the transported deformation (yellow shape on the figure)
resembles an ellipse, as expected by the application of the affinity.
For n ¼ 10 rungs, we obtain a root mean squared error (RMSE) of
8.3  103 and less than 1.3  0.9% relative error (in absolute value) for
100 samples. We also apply the affinity to the template and register the tem-
plate to the result to compute the expected momentum after transport. The
deviation of the transported momentum to this expected value is measured
with the kernel norm of the difference and averaged over the samples. We
obtain an error of 0.34 while the fanning scheme achieves 0.33. The two
implementations are therefore very similar on this set of shapes. We however
find cases where the fanning scheme does not behave as well as the pole
ladder for more complex deformations on the heart data (Fig. 13).

3 Application to cardiac motion modeling


Spatiotemporal shape analysis is of growing importance in the study of car-
diac diseases. In particular, the assessment of cardiac function requires the
measurement and analysis of cardiac motion beyond scalar indicators such
as volume, ejection fraction, pressure-volume loops, or area strain. We focus
on the case of the right ventricle (RV) that is of particular interest as it has
been shown to have a large capacity to adapt to overload by remodeling
(Sanz et al., 2019), raising the issue of disentangling the deformation from
the initial anatomy.
Parallel transport, a central tool in geometric statistics Chapter 8 305

3.1 The right ventricle and its diseases


The right ventricle receives the blood from the right atrium through the tricus-
pid valve and ejects it to the lungs through the pulmonary artery. Although
neglected in the assessment of normal blood circulation of adult patients for
decades, its relevance to determine symptoms and outcomes is now well
established in several pathologies. Three such pathologies are studied in our
database.
Pulmonary hypertension (PH) is a severe condition characterized by
increased blood pressure in the pulmonary arteries, appearing between the
ages of 30 and 40. It is associated with symptoms and premature deaths
(Moceri et al., 2018) and the most common source of pressure overload in
the right ventricle, resulting in hypertrophy, flattening of the interventricular
septum, and progressive dilation and dysfunction (Sanz et al., 2019).
Atrial septal defect (ASD) is a congenital heart defect in which blood flows
between the left and right atrium. This may lead to lower oxygen levels in the
blood that supplies the other organs, resulting in cyanosis. It affects the RV
with volume overload, marked by dilation and hypertrophy and leftward
septal shift, but it may be tolerated for years, as many studies report no affec-
tation of the function of the RV under volume overload (Moceri et al., 2020;
Sanz et al., 2019).
Tetralogy of Fallot (ToF) is another congenital heart defect characterized
by pulmonary stenosis (i.e., narrowing of the pulmonary valve) and an over-
riding aorta, allowing blood flows between the left and right ventricles, and
resulting in hypertrophy of the RV, reduced oxygen level, and cyanosis. This
is the unique case where the RV is under both volume and pressure overload.
Untreated, ToF patients rarely survive to adulthood. It is treated by surgical
repair during the first year of life, greatly improving survival, but postsurgery
defects such as pulmonary regurgitation may occur, requiring other operations.
We use a database of 3D meshes extracted from 314 echocardiographic
sequences from patients examined at the CHU of Nice (see Moceri et al.
(2018) and Moceri et al. (2020) for details). The meshes were extracted with
a commercial software (4D RV Function 2.0, TomTec Imaging Systems,
GmbH, DE) with point-to-point correspondences across time and patients.
These are formed of 938 points and 1872 triangles. All the shapes were rea-
ligned with a subject-specific rigid-body deformation. An atlas was computed
from the end-diastolic meshes of the control group, after rigid-body align-
ment. We warmly thank Pamela Moceri and Nicolas Duchateau for collecting
and curating the data, and for the helpful interactions that we had during this
project.
Two main measures used by the clinicians to evaluate the cardiac function
will be used in the sequel to evaluate our modeling pipeline. Ejection fraction
(EF) is the relative volume change between end-diastolic (ED) time and
end-systolic (ES) time, i.e., during one contraction. It represents the volume
306 SECTION III Advanced geometrical intuition

of blood pumped by the heart to the body circulation, and is used in clinical
routines to evaluate heart failures. Area strain (AS) is the relative area change
of each cell of the mesh between ED time and ES time. As it depends on the
quality of the mesh, it is usually filtered by computing the mean at each vertex
over the neighboring cells, and visualized as a colormap on the mesh, or aver-
aged by regions of the ventricle. It represents the amount of stretching local
tissues undergo during the deformation, and is a common descriptor of cardiac
motion (Di Folco et al., 2019; Kleijn et al., 2011; Moceri et al., 2020).

3.2 Motion normalization with parallel transport


For each subject, the motion of the right ventricle during the systolic phase of
the cardiac cycle is encoded by the LDDMM deformation at a discrete set of
sampling times between ED and ES. Each time point is treated as an indepen-
dent observation for its reorientation by parallel transport into the atlas refer-
ence frame. The framework is summarized in Fig. 11. For each observation
time ti, the subject’s frame at time ti is registered to this subject’s ED frame,
transported along the path between ED and the atlas, and reconstructed by
shooting with the transported momenta.
We reproduce in Fig. 12 the reconstructions of the ES frame for a patient of
each disease group and of the control group. These are obtained by the pole lad-
der algorithm (right) as well as the fanning scheme (middle). On these examples
the two methods produce very similar results, which validate our implementa-
tion. Moreover, the obtained reconstructions are smooth (compare with Fig. 1)
and seem to adapt the deformation pattern of each patient to the atlas.
In most cases, little difference can be observed qualitatively. We find how-
ever a few examples where the fanning scheme did not achieve to transport

FIG. 11 Framework using registration and parallel transport to normalize cardiac deformations.
Parallel transport, a central tool in geometric statistics Chapter 8 307

FIG. 12 Examples of reconstructions of the ES frame (blue meshes) after parallel transport of
the ED-to-ES deformation (point cloud to mesh) along the deformation from ED to atlas. Parallel
transport is computed by the fanning scheme (middle) and our implementation of pole ladder
(right). In most cases, little difference can be observed visually.

FIG. 13 Examples of reconstruction where the fanning scheme did not achieve to transport the
deformation, while pole ladder did, although it is not realistic.

the deformation, and some points diverge or collapse in the reconstructions.


This occurs when the original shapes are significantly larger than the atlas,
and with strong spherification of the apex, causing large deformations as for
the ASD patient of Fig. 13.

3.2.1 Interaction between shape and deformations: A scale


problem
Niethammer and Vialard (2013) showed that parallel transport in LDDMM
does not conserve global properties such as scale or volume changes. This
is visible in our previous experiment (Fig. 9), where the anisotropy of the
original evolution (from circle to ellipse) is 1.4, whereas the one obtained
after parallel transport is 2.3. In the case of cardiac deformations, the magni-
tude of the temporal deformation is comparable to that of the subject’s refer-
ence to atlas deformation, and substantial volume changes are observed. Thus,
the lack of scale invariance is crucial.
Furthermore, there is a significant correlation between the magnitude of
the systolic deformation and the ED volume (ρ ¼ 0.42 in our dataset).
308 SECTION III Advanced geometrical intuition

Additionally, there are significant differences of ED volumes by disease


groups (Fig. 15A). As the parallel transport is isometric, this deformation
may be too large for the atlas. Examples of the obtained ES frames are shown
in Fig. 14 for patients whose RV volume is greater than that of the atlas,
resulting in nonrealistic ES frames.
To quantify this phenomenon, we measure the EF and the AS before and
after transport and report the RMSE in Table 1. We observe a large relative
RMSE EF, representing 31% of the mean value, and equal to the standard
deviation of the data. Furthermore, plotting the absolute value of the EF error
by disease group, we retrieve a strong correlation between the error and the
EF (Fig. 15C). This effect is quite undesirable for a normalization procedure!
In conclusion, parallel transport was able to reorient the deformation to a
different frame, but was not sufficient to properly normalize it in the presence
of large volume differences. Since we control the numerical accuracy of the
parallel transport method thanks to Section 2.1, the error has to be attributed
to the choice of the transport method or to the choice of the Riemannian met-
ric used on deformations. In order to achieve proper normalization for a
broader range of volume changes and to preserve the relative volume changes,
we investigate in Section 3.3 a first intuitive ad-hoc strategy where we modu-
late the amplitude of the parallel transported vector. However, this amounts to

FIG. 14 Examples of reconstructions of the ES frame (blue meshes) after parallel transport of
the ED-to-ES deformation (point cloud to mesh) along the deformation from ED to atlas. The sub-
jects belong to the ASD group, hence show large volume changes that result in unrealistic ES
frames.

TABLE 1 Validation of parallel transport with EF and AS.

Original values RMSE PT


AS 0.24  0.08 0.18
EF 0.42  0.13 0.13
Parallel transport, a central tool in geometric statistics Chapter 8 309

A End-diastolic volume by disease group B Ejection Fraction by disease group


400 70
350
60
300
50
250
EDVol

EF
200 40

150 30
100 20
50
10
0
PHT ASD Control ToF PHT ASD Control ToF
Disease group Disease group

C Ejection Fraction error by disease group

40

30

20
EF error

10

−10

−20

−30
PHT ASD Control ToF
Disease group
FIG. 15 End-diastolic volume by disease group (A), original ejection fraction (B) and its modi-
fication by parallel transport (C). (A) ED volume per disease; (B) Initial EF. (C) Alteration of EF.

state that the transport method and the Riemannian metric are not anymore
consistent together. To remove this inconsistency with the general geometric
statistics framework, we then investigate in Section 3.4 a modification of
the metric on diffeomorphisms that decouples the volume change from the
deformation.

3.3 An intuitive rescaling of LDDMM parallel transport


In Guigui et al. (2021), we hypothesizes that the amplitude of the ED to atlas
deformation, i.e., the one along which we transport, acts on the transported
deformation by scaling to construct a normalization strategy invariant to ini-
tial volume changes.

3.3.1 Hypothesis
We now propose to introduce a rescaling step after the parallel transport in
our framework of Fig. 11. More precisely, instead of using Rt ¼ ExpA
ðΠAED LogED ðSt ÞÞ as the subject-specific reconstruction of time frame St, we
introduce a parameter λ > 0 and use
310 SECTION III Advanced geometrical intuition

 
Rt ðλÞ ¼ ExpA λΠAED LogED ðSt Þ :
where ΠAED is the parallel transport map along the geodesic that joins the ED
frame to the atlas. Based on our observation of the results of parallel transport
(Figs. 12 and 14) and of the relationship between the ejection fraction error
and the disease group (Fig. 15C), we hypothesize that this rescaling should
be patient-specific and depend on the volume of the ED frame. Furthermore,
rescaling should not be necessary if the volume of the ED frame matches that
of the atlas. The relevant quantity is thus the ratio VolED/VolA.

3.3.2 Criterion and estimation of λ


Given a patient’s sequence, we now need a criterion to find λ. As discussed in
Section 1, a clinically relevant quantity that should be conserved is the ejection
fraction, as it is a straightforward scalar indicator of cardiac function. A perfect
preservation of the ejection fraction between each time point and the ED frame
constraints the volume of each reconstruction:
VolA
Volnew ðtÞ ¼ VolSt :
VolED
A straightforward loss function for λ is therefore
X
k
LðλÞ ¼ ðVolðRti ðλÞÞ  Volnew ðti ÞÞ2 :
i¼1

For each patient, we then minimize this loss function with respect to λ to find
the patient-specific scaling parameter. We solve this problem by gradient
descent, where the gradient is computed by automatic differentiation. The
parameter λ is regularized to be close to 1 to avoid poor solutions.

3.3.3 Results
To validate the method, we first check that the ejection fraction is effectively
conserved. The errors on ejection fraction and area strain are reported in
Table 2. We indeed observe a threefold decrease of the RMSE on the ejection
fraction. Additionally, the distribution of this error is no longer related to the
disease group (Fig. 16). Visually, the reconstructions obtained by the scaled
parallel transport are more realistic (Fig. 17). Interestingly, it is possible to

TABLE 2 Validation of scaled parallel transport (SPT) with EF and AS.

Original values RMSE PT RMSE SPT


AS 0.24  0.08 0.18 0.13
EF 0.42  0.13 0.13 0.04
Parallel transport, a central tool in geometric statistics Chapter 8 311

Ejection Fraction error by disease group

40

30

20
EF error

10

−10
method
−20 ScaledPT
PT
−30
PHT ASD Control ToF
Disease group
FIG. 16 Alteration of the ejection fraction per disease group. The scaled parallel transport
achieves an error that is not related to the disease group.

FIG. 17 Examples of reconstructions of the ES frame (blue meshes) after parallel transport of
the ED-to-ES deformation (point cloud to mesh) along ED-to-atlas deformation. The large volume
changes between subjects and the atlas result in unrealistic ES frames (middle column). This prob-
lem is well addressed by the scaling strategy (right column).

obtain a low error on the EF after the scaled transport, but this does not pre-
serve the area strain (AS). This shows that although these quantities are
related, they carry different information and the AS depends on the initial
shape, itself related to the pathology.

3.3.3.1 Relationship between λ and VolED


As expected, the scaling coefficient is closely related to the ED volume. We
use a linear regression to identify a linear relation between log ðλÞ and
log ðVolA =VolED Þ. This is displayed in Fig. 18. The relationship seems highly
312 SECTION III Advanced geometrical intuition

FIG. 18 Scaling parameter λ with respect to the ED volume.

significant as we measure a coefficient of determination R2 ¼ 0.92. The


measured slope is α ¼ 0.64 and corresponds to the relationship
 α
VolA
λ¼β : (10)
VolED
where β is the exponential of the intercept, but is found close to 1, with β ¼ 1.08
In order to better understand this relationship, we reproduce the experiment
on 2d simulated shapes, with the same experimental set-up as in the validation
experiment. It is striking that we also find a highly significant relationship with
R2 ¼ 0.99 and almost the same slope α ¼ 0.67 and β ¼ 1.02. The values
measured for α are close to 2/3 and surprisingly do not seem to depend on
the dimension of the ambient space.
Our experiments thus give strong evidence in favor of relation (10), and
raise many questions. First is α a constant, or does it depend on the data,
the dimension, the kernel, or the deformation model? From our results it
seems that it does not depend on the data or the dimension. Performing the
same experiments with different kernels could bring further information, this
is ongoing work. Similarly, experiments in the SVF framework could be per-
formed to test whether this effect is specific to LDDMM or more general.
These experiments could also clarify whether β significantly differs from one.
Furthermore, this effect suggests that all the experiments that have been
performed with parallel transport should be revised with a scaling step. Most
of these in the medical imaging field concern the evolution of brain images
Parallel transport, a central tool in geometric statistics Chapter 8 313

(Cury et al., 2016; Lorenzi et al., 2015). However, longitudinal deformations


of the brain are very different from that of the right ventricle, and we do not
expect the scaling step to be necessary. Indeed, brain deformations are char-
acterized by very localized strong Jacobian determinant, but very little overall
volume changes. A more relevant type of data to reproduce this effect could
be to evaluate tumor evolution during treatment, or the evolution of more
plastic organs such as the liver, whose shape is affected by all the liver dis-
eases, whether viral, metabolic, or due to alcohol.
Finally, a last hypothesis is that by scaling the parallel transport of the
LDDMM metric, we are approximating the parallel transport of a direct prod-
uct metric with a volume-preserving factor and a metric on the volume
changes as will be introduced in the next section. This relation would translate
in setting the metric on the volume part.

3.4 Changing the metric to preserve relative volume changes


Although modifying the parallel transport equations works astonishingly well
from the practical point of view, it is not satisfactory from a theoretical point
of view since it decorrelates the transport from the metric used on the defor-
mations. We investigate in this section a more satisfactory strategy that
decouples the volume change from the deformation directly within the
LDDMM metric. This strategy is based on the scale invariant metric proposed
by Niethammer and Vialard (2013) and it was implemented thanks to the pre-
cious advice of François-Xavier Vialard. More generally, they show that a
Riemannian metric on the space of shapes preserves a global nondegenerate
function f (such as volume) if and only if it can be decomposed into a product
metric on Im(f ) and f1(0) (see Niethammer and Vialard, 2013, Theorem 41).

3.4.1 Model
We apply their result as follows. Restricting to the space of shapes that are
embeddings of the circle in 2 or of the sphere in 3 , let M ¼ EmbðSd , d +1 Þ
be the set of shapes, endowed with a metric g. As we are interested in relative
volume changes, we consider the function f ¼ log Vol
VolA : M !  (so that
1
Vol ), and let M0 ¼ f (0), the space of shapes whose volume equals
df ¼ dVol
the volume of the atlas VolA. Then, (by Niethammer and Vialard, 2013,
Theorem 41), (M, g) can be decomposed into a direct product of Riemannian
manifolds ðM, gÞ ¼ ð, dt2 Þ  ðM0 , g0 Þ where g0 is a Riemannian metric on
the submanifold M0. This metric can be constructed by choosing a projection
π : M ! M0 and the restriction of g to M0. However, there is no canonical
projection and two schemes are proposed in Niethammer and Vialard (2013):
l by gradient flow: we follow the flow of (dVol)# where # depends on the
choice of metric on M, in this case the LDDMM metric.
314 SECTION III Advanced geometrical intuition

l by scaling: we center the shape around 0 and divide all the landmarks by
Vol1=3
q . This choice depends on the center (barycenter) of the shape which
may be unnatural.
In both cases, the framework is slightly modified by applying this projection
first to all the time frames so that their volume matches that of the atlas. Then
the previous framework of Fig. 11 is applied with volume-preserving geode-
sics and parallel transport. This corresponds to the parallel transport on the
factor M0 of M ¼   M0.   the (Euclidean) transport on  corresponds
Finally,
VolSt
to applying the vector log VolED to 0 so that the new volume is Volnew ðtÞ ¼
VolA
VolSt as in the previous section, and the ejection fraction is preserved by
VolED
construction. This volume is obtained either by scaling or by gradient flow,
consistently with the projection.

3.4.2 Implementation
We now give more details on the computation of geodesics and our imple-
mentation in the LDDMM framework of Section 2.2. Recall that tangent
spaces are described by the Hilbert space HK of vector fields obtained by
convolution. As we restrict to M0 ¼ f1(0), a vertical tangent subspace at
any shape q is defined as the set of vector fields of HK that are volume
preserving, i.e.,
kerðdf q Þ ¼ fv  H K jdVolq ððvðqÞÞ ¼ 0g,
To avoid any confusion, we distinguish the linear form dVolq from its repre-
sentation in the canonical basis ∂Volq such that dVolq(v(q)) ¼ v(q)>∂Volq.
The orthogonal projection on Vq (which depends on the LDDMM metric)
can be computed by
π q : T q M ! Vq
! (11)
v 7!v  dVolq ðvÞ n
! ðdVol Þ#
where n ¼ kðdVol qÞ# k2 : Note that the map # associates a tangent vector to the
q

linear form dVol such that dVol(v) ¼ h(dVolq)#, viK so that (dVolq)# ¼ K(, q)
∂Volq. Define also the dual π *q : T *q M ! V *q to π q by the relation μðπ q ðvÞÞ ¼
π *q ðμÞðvÞ, i.e., in matrix notations μ> π q ðvÞ ¼ v> π *q ðμÞ. From (11), we obtain

π q : Tq M ! Vq
!
μ> Kðc, qÞ∂Volq (12)
ðc,μÞ 7! c,μ  dVolq :
∂Vol>
q Kðq, qÞ∂Volq
Parallel transport, a central tool in geometric statistics Chapter 8 315

3.4.3 Geodesics
A geodesic between q0 , q1  Emb1 ðS2 , 3 Þ is solution of
Z 1
inf kvtk2K dt (13)
0

under the constraints q_ ¼ vt ðqðtÞÞ where vt  Vq(t). The constraints can be


relaxed to q_ ¼ π qðtÞ ðvt ðqðtÞÞ where vt  V. This allows solving Problem (13)
by solutions of Hamiltonian equations given by Hðμ, qÞ ¼ 12 < π *q ðμÞ,
KðqÞπ *q ðμÞ >.
In order to implement these geodesics in the LDDMM framework pre-
sented in Section 2.2, we consider that both the control points and the land-
mark points are part of the system, and write the Hamiltonian
2
1 ðμT Kðc, qÞ∂Volq Þ
Hðμ, q, cÞ ¼ μT Kðc, cÞμ  :
2 2 k ðdVolq Þ# k2K
The corresponding Hamiltonian system can readily replace (8) for registra-
tion and in the geodesics of the pole ladder constructions. We can thus con-
duct the same experiment as in the previous section with this metric.

3.4.4 Results
We first validate our implementation by measuring the area strain and ejection
fraction errors, and complete Table 2 into Table 3. Similarly, we reproduce
the boxplot of Fig. 15C with the two projection and the previous volume con-
strained scaling strategy. As in the previous section, there is no relation
between the error and the disease group. We show some reconstructions in
Fig. 19, along the reconstructions obtained by the scaled parallel transport
of the previous section. The two metrics and the two projection methods result
in different reconstructions. The projection by gradient flow seems less stable
than the two other methods, as the thinning near the valve is exaggerated on
the first row, and the apex of the second row is unrealistically spherical.

TABLE 3 Validation of the volume-preserving parallel transport (VPPT),


with the projection performed either by gradient flow or by scaling.

Original values PT SPT Grad VPPT Scaling VPPT


AS 0.24  0.08 0.18 0.13 0.13 0.09
EF 0.42  0.13 0.13 0.04 0.02 0.02
316 SECTION III Advanced geometrical intuition

FIG. 19 Examples of reconstructions of the ES frame (blue meshes) after parallel transport of
the ED-to-ES deformation (point cloud to mesh) along the deformation from ED to atlas. Compar-
ison of the three methods. The gradient flow seems less stable (first two rows) than the projection
by scaling. The scaling method results in more vertical movement of the base than the two other
methods. This may be due to the centering step required before scaling.

The values of area strain obtained by this normalization procedure were


included in the analysis of Di Folco et al. that leveraged dimensionality reduc-
tion methods to compare the different normalization procedures and explore
the interactions between shape and deformations. Shape descriptors were
computed by point-wise differences between the ED mesh and the atlas after
a rigid-body alignment, or alignment by a similarity. By comparing the latent
spaces obtained with normalized and unnormalized data, they show that both
methods reduce the bias introduced by the alignment of the ED shapes
(Di Folco et al., 2021).
Now we come to the original goal of this chapter: performing group-wise
analysis of the systolic deformation. We have proposed several ways to nor-
malize each individual time deformation, and we now need to summarize
the complete systolic trajectory. We propose to use a geodesic—or higher
order—regression. This is detailed in the next section dedicated to the down-
stream analyses.

3.5 Analysis of the normalized deformations


In order to compactly represent the full systolic trajectory in the same space
and to proceed with linear statistics, we propose to fit a regression model in
the shape space as considered in the registration framework of Section 2.2.
It estimates a geodesic path between two shapes (the equivalent of a uniform
motion). For a trajectory such as the contraction of the cardiac RV, one may
expect to find second-order dynamics, making a (first-order) geodesic regres-
sion ill-suited. We thus propose to use the second-order model defined in
Trouve and Vialard (2012) to account for the motion of the RV during systole.
It is similar to a geodesic regression (Fishbaugh et al., 2017) in the shape
space, with an additional acceleration term in the model of the trajectory.
We therefore compare it to the simpler geodesic regression.
Parallel transport, a central tool in geometric statistics Chapter 8 317

3.5.1 Geodesic and spline regression


The model of Trouve and Vialard (2012) introduces a second-order term ut
that can be interpreted as a random external force smoothly perturbing the tra-
jectory around a mean geodesic which modifies the continuous-time system of
Eq. (8) as follows: 8t  [0, 1],
8 X ðtÞ ðtÞ ðtÞ
>
> c_k ðtÞ ¼ Kðck , cj Þμj
>
< j

> X (14)
> ðtÞ ðtÞ ðtÞ ðtÞT ðtÞ ðtÞ
: μ_k
> ¼ r1 Kðck , cj Þμk μj + uk
j

3.5.1.1 Results
If we consider a discrete sequence of observation times t1 ¼ 0, t2 , …td ¼ 1
and configurations xt1 , …, xtd , one seeks to find the path ϕt that minimizes
the new cost

Z
1 X d 1
CS ðc, μ, ut Þ ¼ k xti  ϕti ðxt0 Þk22 + kuðtÞ k2 dt + kvc,μ
0 kK :
2
(15)
α d i¼1
2
0

In practice, the ODEs (8) and (14) are discretized in n time steps and an inte-
gration method such as Euler or Runge–Kutta is used. We define all the
patients’ trajectories between t ¼ 0 and t ¼ 1, and use the same discretization
for all the patients to ensure that u0 , …, uNc are estimated at corresponding
times. Along with μ(0), these are estimated by gradient descent as in the case
of registration. Setting u0 ¼ … ¼ uNc ¼ 0 at all times recovers a geodesic tra-
jectory. We use a kernel bandwidth σ ¼ 15 in all the experiments, and 60 con-
trol points for all the deformations of the atlas. The initial control points are
fixed for the entire dataset so that the initial momenta can be compared con-
sistently. They have been optimized to register the atlas on all the transported
ES frames.
Visually we only notice very slight differences between the geodesic and
spline regressions (Fig. 20). We do observe that the fit is slightly less precise
in the intermediate frames for the geodesic regression, and this is confirmed
by measuring the overall data attachment term (left term of (15)) that is
reported in Table 4, regardless of the normalization method. The spline
regression therefore yields a more faithful representation of the normalized
cardiac deformations than the geodesic regression, at the cost of a larger set
of parameters, encompassing the external time-dependent forces ut whose
interpretation and analysis are difficult. Indeed, the size of the acceleration
term ut is (d  1)  Nc  3 ¼ 1620.
318 SECTION III Advanced geometrical intuition

FIG. 20 Fit of the normalized sequence of an ASD patient by geodesic (top) and spline (bottom)
regression. Fits are the purple transparent meshes, and normalized data are the blue meshes with
edges. Upon visual inspection, there is barely any difference between the two fits.

TABLE 4 Comparison of the mean data attachment term for each method.
SPT Grad VPPT Scaling VPPT
Geodesic 1384  952 1659  906 1483  864
Spline 549  413 681  390 554  324

3.5.2 Hotelling tests on velocities


The sequences are normalized by one of the three proposed methods: scaled
parallel transport (SPT), volume-preserving parallel transport (VPPT) with
gradient projection, and VPPT with scaling projection, and summarized by
a regression with one of the two deformation models: geodesic or spline. In
this section, we focus on the result obtained with the scaled parallel transport
and spline regression. This procedure results in a set of descriptive parameters
for each individual that lie in a linear space: the co-tangent space at the atlas.
For more stable descriptors of the deformation, we compute the initial velocity
vectors at the initial control points with v ¼ K(c, c)μ. We now have all the
ingredients to perform statistics on cardiac deformations described by the initial
velocities.
We compare the impact of the different diseases by performing pairwise
Hotelling tests between each disease group and the control group. One statis-
tic is computed at each control point that tests if the mean velocity at that
point is significantly different. A Bonferroni correction is then applied for
multiple testing across the full set of control points, to maintain type I error
risk at α ¼ 0.05.
The results for the velocities of the spline regression with the scaled PT
are displayed in Fig. 21 for the ASD, ToF, and PHT groups and show
Parallel transport, a central tool in geometric statistics Chapter 8 319

FIG. 21 Results of the Hotelling tests between velocities. Where the differences are significant,
control points and difference with the control group are in red, while colored arrows represent the
mean velocity field for the disease group. All the arrows have been scaled by a factor 2.3 for visu-
alization purposes. The color map of the meshes reflects the norm of the velocity field at that
point, if it is significantly different from the control group. As for the area strain, there is little
difference for the ASD group, showing that the volume differences have well been filtered out
by the normalization. Moreover, the differences for the PHT and ToF group reflect deformations
with less amplitude than the control group, as observed on the ejection fraction.

significant differences between each disease and the control group. The differ-
ences are superimposed with the group mean velocity fields, and show that
these differences are mainly of magnitude, with different orientations at a
few points along the free wall. The differences observed near the tricuspid
valve mainly reflect the difference of magnitude of the deformations, and
one should be cautious before drawing further conclusions as the quality of
the mesh may vary near this region. However, it is interesting to notice that
very little differences are observed for the ASD group, which corroborates
previous results (Moceri et al., 2020). Similarly, only small differences are
observed on the septum for the PHT group, showing that the shape differences
usually observed on the PHT group have been filtered out. This makes the dif-
ferences observed on the free wall interesting and other markers such as the
circumferential strain will be studied to confirm these effects. These experi-
ments also highlight the difference between the PHT and ToF group as signif-
icant differences are localized on the inferior part of the free wall and on the
inlet (area of the tricuspid valve) for the ToF group, while they are distributed
across the whole shape for the PHT group.
320 SECTION III Advanced geometrical intuition

The same visualization is provided for the mean acceleration term by dis-
ease group, across the group-wise mean trajectory (Fig. 22). Recall that these
can be interpreted as external forces. They seem to increase the movement
toward contraction at the beginning of motion while slowing it down toward
the end. These terms thus provide more insights into the dynamics of each
disease.
Similar to the analysis of the velocities, the mean external force of the
ASD group is not significantly different from that of the control group. This
shows that the volume differences reported in Fig. 15A were successfully

FIG. 22 Results of the Hotelling tests between accelerations. Where the differences are signifi-
cant, control points and differences with the control group are in red, while colored arrows repre-
sent the mean acceleration for the disease group. As for the velocity fields, there are no
differences for the ASD group. For the ToF group, the differences are localized, but these sites
vary with time, while for the PHT group, differences are distributed on the entire shape.
Parallel transport, a central tool in geometric statistics Chapter 8 321

normalized by the scaled parallel transport. For the ToF and PHT group, many
points have significantly different values of acceleration, and those differ-
ences are shown by red arrows. They are distributed across the whole RV
for the PHT group, while their locations vary in time for the ToF group,
and concentrate around the apex toward the end of the deformation. For the
ToF group, we do find differences on the infandibulum (area of the pulmonary
valve) at times ED + 1 and ED + 2 as expected from the consequences of the
surgery. This is in favor of our method on the analysis of shape deformations,
as these differences were not observed with the traditional analysis of strain
maps. As in the case of velocities, the differences mainly reflect amplitude
differences, especially near ED and ES times. They thus reflect the late con-
traction of the RV and the distribution of the differences across the ventricle
and in time reflect a longer and asynchronous contraction introduced by the
disease. Indeed, the deformation of the control group is more uniformly
distributed on the shape, and the introduction of heterogeneity across the
shape has been previously reported and associated with the disease, and is
known to be a factor of arrhythmia with direct consequences on survival.

4 Conclusion
In this chapter, we first summarized recent results using parallel transport for
geometric statistics in computational anatomy. We proposed in particular a
new implementation of the pole ladder scheme. These results guarantee that
parallel transport algorithms are well behaved, enabling their use as a normal-
ization step for longitudinal shape data. This further allows to evaluate the
underlying metric and the normalization model beyond the numerical
stability.
On the application side, the results on data of the motion of the cardiac
right ventricle exhibited a strong bias due to the volume differences at a refer-
ence time point. Two strategies were investigated to correct for this bias: scal-
ing the parallel transport, and using a metric that decomposes volume changes
from volume-preserving deformations. Both methods were successful at pre-
serving the ejection fraction, and reduced the bias due to volume differences.
When scaling the parallel transport, we discovered a significant relationship
between the scaling parameter and the initial volume ratio; however, its inter-
pretation must be understood further. This will be investigated in future work,
together with the possible mathematical relation to the volume-preserving
metric. The statistical analysis of the normalized deformations was meaningful
and coherent with previous knowledge. This framework is readily usable to
evaluate the impact of a treatment or a surgery on the deformation. It can also
be used to simulate new cardiac sequences for a given shape from a population
of cardiac sequences, by performing a principal component analysis of the
momentum vectors of the spline or geodesic regression, and shooting along
the principal modes, or by sampling coefficients from a multivariate normal
law to mix the principal components.
322 SECTION III Advanced geometrical intuition

This study is however not sufficient to decide which method should be


preferred. This comes back to the more general question of choosing the met-
ric. As the numerical properties of the parallel transport algorithm are now
well grounded, parallel transport can now be used as a proxy to evaluate the
pertinence of the metric used. In this implementation of the LDDMM frame-
work, the choice of metric boils down to the choice of kernel, but the frame-
work itself is questioned. As data-driven methods have gained popularity in
this field, more papers have proposed strategies to learn the metric from data
(Louis et al., 2019; Niethammer et al., 2019; Vialard and Risser, 2014). These
solutions are attractive to remove the arbitrary aspect of the choice of metric.
Yet, beyond the computational complexity, these approaches face the problem
of defining “natural” criteria to assess a model. It will be interesting to test the
parallel transport normalization method with a metric learned from data.

Acknowledgments
This project has received funding from the European Research Council (ERC) under the
European Union’s Horizon 2020 research and innovation program (grant G-Statistics agree-
ment No 786854). This work has been supported by the French government, through the
3IA C^ote d’Azur Investments in the Future project managed by the National Research
Agency (ANR) with the reference number ANR-19-P3IA-0002.

Abbreviations
LDDMM large diffeomorphic deformation metric mapping
SVF stationary velocity field
RKHS reproducible kernel Hilbert space

References
Arsigny, V., Commowick, O., Pennec, X., Ayache, N., 2006. A log-Euclidean framework for
statistics on diffeomorphisms. In: Larsen, R., Nielsen, M., Sporring, J. (Eds.), Medical Image
Computing and Computer-Assisted Intervention—MICCAI 2006. LNCS 4190. Springer
Berlin Heidelberg, pp. 924–931. https://doi.org/10.1007/11866565_113.
Ashburner, J., Hutton, C., Frackowiak, R., Johnsrude, I., Price, C., Friston, K., 1998. Identifying
global anatomical differences: deformation-based morphometry. Human Brain Mapping 6
(5–6), 348–357.
Bauer, M., Bruveris, M., Michor, P.W., 2014. Overview of the geometries of shape spaces and
diffeomorphism groups. J. Math. Imaging Vis. 50 (1–2), 60–97. https://doi.org/10.1007/
s10851-013-0490-z.
B^
one, A., Louis, M., Martin, B., Durrleman, S., 2018. Deformetrica 4: An Open-Source Software
for Statistical Shape Analysis. In: Reuter, M., Wachinger, C., Lombaert, H., Paniagua, B.,
L€uthi, M., Egger, B. (Eds.), Shape in Medical Imaging. ShapeMI 2018. LNCS 11167.
Springer, Cham, pp. 3–13. https://doi.org/10.1007/978-3-030-04747-4_1.
Parallel transport, a central tool in geometric statistics Chapter 8 323

Charon, N., Charlier, B., Glaunès, J.A., Gori, P., Roussillon, P., 2020. Fidelity metrics between
curves and surfaces: currents, varifolds, and normal cycles. In: Pennec, X., Sommer, S.,
Fletcher, T. (Eds.), Riemannian Geometric Statistics in Medical Image Analysis. Academic
Press, pp. 441–477. https://doi.org/10.1016/B978-0-12-814725-2.00021-2.
Cury, C., Lorenzi, M., Cash, D., Nicholas, J., Routier, A., Rohrer, J., Ourselin, S., Durrleman, S.,
Modat, M., 2016. Spatio-temporal shape analysis of cross-sectional data for detection of early
changes in neurodegenerative disease. In: SeSAMI 2016—First International Workshop Spec-
tral and Shape Analysis in Medical Imaging. LNCS 10126. Springer, pp. 63–75. https://doi.
org/10.1007/978-3-319-51237-2_6.
Debavelaere, V., B^ one, A., Durrleman, S., Allassonnière, S., 2019. Initiative for the Alzheimer’s
Disease Neuroimaging Clustering of longitudinal shape data sets using mixture of separate or
branching trajectories. In: Medical Image Computing and Computer Assisted Intervention—
MICCAI 2019. LNCS 11767. Springer, Cham, pp. 66–74. https://doi.org/10.1007/978-3-030-
32251-9_8.
Di Folco, M., Clarysse, P., Moceri, P., Duchateau, N., 2019. Learning interactions between car-
diac shape and deformation: application to pulmonary hypertension. In: Tenth International
Statistical Atlases and Computational Modeling of the Heart (STACOM) Workshop, Held
in Conjunction With MICCAI 2019, Shenzen, China. LNCS 12009. Shenzen, China,
pp. 119–127. https://doi.org/10.1007/978-3-030-39074-7_13.
Di Folco, M., Guigui, N., Clarysse, P., Moceri, P., Duchateau, N., 2021. Investigation of the
impact of normalization on the study of interactions between Myocardial shape and deforma-
tion. In: Functional Imaging and Modeling of the Heart. LNCS 12738. Springer, Cham,
pp. 223–231. https://doi.org/10.1007/978-3-030-78710-3_22.
Dryden, I.L., Mardia, K.V., 2016. Statistical Shape Analysis With Applications in R, second ed.
Wiley Series in Probability and Statistics, Wiley, Chichester, UK; Hoboken, NJ, ISBN:
978-0-470-69962-1.
Duchateau, N., De Craene, M., Piella, G., Silva, E., Doltra, A., Sitges, M., Bijnens, B.H.,
Frangi, A.F., 2011. A spatiotemporal statistical atlas of motion for the quantification of abnor-
mal myocardial tissue velocities. Med. Image Anal. 15 (3), 316–328. https://doi.org/10.1016/
j.media.2010.12.006.
Durrleman, S., Allassonnière, S., Joshi, S., 2013a. Sparse adaptive parameterization of variability
in image ensembles. Int. J. Comput. Vis. 101 (1), 161–183. https://doi.org/10.1007/s11263-
012-0556-1.
Durrleman, S., Pennec, X., Trouve, A., Braga, J., Gerig, G., Ayache, N., 2013b. Toward a com-
prehensive framework for the spatiotemporal statistical analysis of longitudinal shape data.
Int. J. Comput. Vis. 103 (1), 22–59. https://doi.org/10.1007/s11263-012-0592-x.
Durrleman, S., Prastawa, M., Charon, N., Korenberg, J.R., Joshi, S., Gerig, G., Trouve, A., 2014.
Morphometry of anatomical shape complexes with dense deformations and sparse parameters.
NeuroImage 101, 35–49. https://doi.org/10.1016/j.neuroimage.2014.06.043.
Ehlers, J., Pirani, F.A.E., Schild, A., 1972. The geometry of free fall and light propagation. In:
O’Raifeartaigh, L. (Ed.), General Relativity: Papers in Honour of J. L. Synge. Clarendon
Press, Oxford, pp. 63–84.
Fishbaugh, J., Durrleman, S., Prastawa, M., Gerig, G., 2017. Geodesic shape regression with mul-
tiple geometries and sparse parameters. Med. Image Anal. 39, 1–17. https://doi.org/10.1016/
j.media.2017.03.008.
Gerber, S., Tasdizen, T., Fletcher, T., Joshi, S., Whitaker, R., 2010. Manifold modeling for brain
population analysis. Med. Image Anal. 14 (5), 643–653. https://doi.org/10.1016/
j.media.2010.05.008.
324 SECTION III Advanced geometrical intuition

Grenander, U., Miller, M., 1998. Computational anatomy: an emerging discipline. Q. Appl. Math.
LVI (4), 617–694.
Guigui, N., Pennec, X., 2021. Numerical accuracy of ladder schemes for parallel transport on
manifolds. Found. Comput. Math. https://doi.org/10.1007/s10208-021-09515-x.
Guigui, N., Moceri, P., Sermesant, M., Pennec, X., 2021. Cardiac motion modeling with parallel
transport and shape splines. In: ISBI 2021—IEEE 18th International Symposium on Biomed-
ical Imaging. IEEE, pp. 1394–1397. https://doi.org/10.1109/ISBI48211.2021.9433887.
Hadj-Hamou, M., Lorenzi, M., Ayache, N., Pennec, X., 2016. Longitudinal analysis of image time
series with diffeomorphic deformations: a computational framework based on stationary
velocity fields. Front. Neurosci. 10 (236). https://doi.org/10.3389/fnins.2016.00236.
Hauberg, S., Lauze, F., Pedersen, K.S., 2013. Unscented Kalman filtering on Riemannian mani-
folds. J. Math. Imaging Vis. 46 (1), 103–120. https://doi.org/10.1007/s10851-012-0372-9.
Higham, N.J., 2005. The scaling and squaring method for the matrix exponential revisited. SIAM
J. Matrix Anal. Appl. 26 (4), 1179–1193. https://doi.org/10.1137/04061101X.
Hinkle, J., Fletcher, T., Joshi, S., 2014. Intrinsic polynomials for regression on Riemannian mani-
folds. J. Math. Imaging Vis. 50 (1–2), 32–52. https://doi.org/10.1007/s10851-013-0489-5.
Kheyfets, A., Miller, W.A., Newton, G.A., 2000. Schild’s ladder parallel transport procedure for
an arbitrary connection. Int. J. Theor. Phys. 39 (12), 2891–2898. https://doi.org/10.1023/
A:1026473418439.
Kleijn, S.A., Aly, M.F.A., Terwee, C.B., van Rossum, A.C., Kamp, O., 2011. Three-dimensional
speckle tracking echocardiography for automatic assessment of global and regional left ven-
tricular function based on area strain. J. Am. Soc. Echocardiogr. 24 (3), 314–321. https://doi.
org/10.1016/j.echo.2011.01.014.
Lorenzi, M., Pennec, X., 2013. Geodesics, parallel transport & one-parameter subgroups for
diffeomorphic image registration. Int. J. Comput. Vis. 105 (2), 111–127. https://doi.org/
10.1007/s11263-012-0598-4.
Lorenzi, M., Pennec, X., 2014. Efficient parallel transport of deformations in time series of
images: from Schild to pole ladder. J. Math. Imaging Vis. 50 (1), 5–17. https://doi.org/
10.1007/s10851-013-0470-3.
Lorenzi, M., Ayache, N., Frisoni, G.B., Pennec, X., ADNI, 2011a. Mapping the effects of a β
levels on the longitudinal changes in healthy aging: hierarchical modeling based on stationary
velocity fields. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI
2011, September, Springer, Berlin, Heidelberg, pp. 663–670.
Lorenzi, M., Ayache, N., Pennec, X., 2011b. Schilds ladder for the parallel transport of deforma-
tions in time series of images. In: IPMI—22nd International Conference on Information
Processing in Medical Images—2011, July, vol. 6801. Springer, p. 463.
Lorenzi, M., Pennec, X., Frisoni, G.B., Ayache, N., 2015. Disentangling normal aging from
Alzheimer’s disease in structural magnetic resonance images. Neurobiol. Aging 36, S42.
https://doi.org/10.1016/j.neurobiolaging.2014.07.046.
Louis, M., B^one, A., Charlier, B., Durrleman, S., 2017. Parallel transport in shape analysis: a scal-
able numerical scheme. In: Nielsen, F., Barbaresco, F. (Eds.), Geometric Science of Informa-
tion. LNCS 10589. Springer International Publishing, pp. 29–37. https://link.springer.com/
chapter/10.1007/978-3-319-68445-1_4.
Louis, M., Charlier, B., Jusselin, P., Pal, S., Durrleman, S., 2018. A fanning scheme for the par-
allel transport along geodesics on Riemannian manifolds. SIAM J. Numer. Anal. 56 (4),
2563–2584. https://doi.org/10.1137/17M1130617.
Louis, M., Couronne, R., Koval, I., Charlier, B., Durrleman, S., 2019. Riemannian geometry
learning for disease progression modelling. In: Information Processing in Medical Imaging.
Springer, Cham, pp. 542–553.
Parallel transport, a central tool in geometric statistics Chapter 8 325

Mansi, T., Durrleman, S., Bernhardt, B., Sermesant, M., Delingette, H., Voigt, I., Lurz, P.,
Taylor, A.M., Blanc, J., Boudjemline, Y., Pennec, X., Ayache, N., 2009. A statistical model
of right ventricle in tetralogy of fallot for prediction of remodelling and therapy planning. In:
Yang, G.-Z., Hawkes, D., Rueckert, D., Noble, A., Taylor, C. (Eds.), Medical Image Comput-
ing and Computer-Assisted Intervention—MICCAI 2009. LNCS 5761. Springer Berlin,
Heidelberg, Berlin, Heidelberg, pp. 214–221. vol. https://link.springer.com/chapter/
10.1007/978-3-642-04268-3_27.
McLeod, K., Mansi, T., Sermesant, M., Pongiglione, G., Pennec, X., 2013. Statistical shape anal-
ysis of surfaces in medical images applied to the tetralogy of fallot heart. In: Cazals, F.,
Kornprobst, P. (Eds.), Modeling in Computational Biology and Biomedicine:
A Multidisciplinary Endeavor. Springer, Berlin, Heidelberg, pp. 165–191. https://doi.org/
10.1007/978-3-642-31208-3_5.
Micheli, M., Glaunès, J.A., 2014. Matrix-valued Kernels for shape deformation analysis. Geome-
try Imaging Comput. 1 (1), 57–139. https://doi.org/10.4310/GIC.2014.v1.n1.a2.
Miller, M., Trouve, A., Younes, L., 2015. Hamiltonian systems and optimal control in computa-
tional anatomy: 100 years since D’Arcy Thompson. Ann. Rev. Biomed. Eng. 17 (1),
447–509. https://doi.org/10.1146/annurev-bioeng-071114-040601.
Misner, C.W., Thorne, K.S., Wheeler, J.A., 1973. Gravitation. Princeton University Press, ISBN:
978-0-691-17779-3.
Moceri, P., Duchateau, N., Baudouy, D., Schouver, E.-D., Leroy, S., Squara, F., Ferrari, E.,
Sermesant, M., 2018. Three-dimensional right-ventricular regional deformation and survival
in pulmonary hypertension. Eur. Heart J. Cardiovasc. Imaging 19 (4), 450–458. https://doi.
org/10.1093/ehjci/jex163.
Moceri, P., Duchateau, N., Gillon, S., Jaunay, L., Baudouy, D., Squara, F., Ferrari, E.,
Sermesant, M., 2020. 3D right ventricular shape and strain in congenital heart disease patients
with right ventricular chronic volume loading. Eur. Heart J. Cardiovasc. Imaging 22 (10),
1174–1181. https://doi.org/10.1093/ehjci/jeaa189.
Niethammer, M., Vialard, F.-X., 2013. Riemannian metrics for statistics on shapes : parallel trans-
port and scale invariance. In: Proceedings of Miccai Workshop, MFCA. HAL. http://www-
sop.inria.fr/asclepios/events//MFCA13/Proceedings/MFCA2013_1_1.pdf.
Niethammer, M., Kwitt, R., Vialard, F.-X., 2019. Metric learning for image registration. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 8463–8472.
Pennec, X., 2019. Curvature effects on the empirical mean in Riemannian and affine manifolds: a
non-asymptotic high concentration expansion in the small-sample regime. arXiv:1906.07418
[math, stat].
Pennec, X., Arsigny, V., 2012. Exponential barycenters of the canonical cartan connection
and invariant means on lie groups. In: Matrix Information Geometry, May. Springer,
pp. 123–168. https://doi.org/10.1007/978-3-642-30232-9_7.
Pennec, X., Lorenzi, M., 2020. Beyond Riemannian geometry: the affine connection setting for
transformation groups. In: Pennec, X., Sommer, S., Fletcher, T. (Eds.), Riemannian Geomet-
ric Statistics in Medical Image Analysis. Academic Press, pp. 169–229. https://doi.org/
10.1016/B978-0-12-814725-2.00012-1.
Pennec, X., Sommer, S., Fletcher, T., 2020. Riemannian Geometric Statistics in Medical Image
Analysis. Elsevier. https://doi.org/10.1016/C2017-0-01561-6.
Peyrat, J.-M., Delingette, H., Sermesant, M., Xu, C., Ayache, N., 2010. Registration of 4D cardiac
CT sequences under trajectory constraints with multichannel diffeomorphic demons. IEEE
Trans. Med. Imaging 29 (7), 1351–1368. https://doi.org/10.1109/TMI.2009.2038908.
326 SECTION III Advanced geometrical intuition

Qiu, A., Younes, L., Miller, M., Csernansky, J.G., 2008. Parallel transport in diffeomorphisms
distinguishes the time-dependent pattern of hippocampal surface deformation due to healthy
aging and the dementia of the Alzheimer’s type. NeuroImage 40 (1), 68–76. https://doi.org/
10.1016/j.neuroimage.2007.11.041.
Qiu, A., Albert, M., Younes, L., Miller, M., 2009. Time sequence diffeomorphic metric mapping
and parallel transport track time-dependent shape changes. NeuroImage 45 (Suppl. 1),
S51–S60. https://doi.org/10.1016/j.neuroimage.2008.10.039.
Rao, A., Chandrashekara, R., Sanchez-Ortiz, G.I., Mohiaddin, R., Aljabar, P., Hajnal, J.V.,
Puri, B.K., Rueckert, D., 2004. Spatial transformation of motion and deformation fields using
nonrigid registration. IEEE Trans. Med. Imaging 23 (9), 1065–1076. https://doi.org/10.1109/
TMI.2004.828681.
Sanz, J., Sánchez-Quintana, D., Bossone, E., Bogaard, H.J., Naeije, R., 2019. Anatomy, function,
and dysfunction of the right ventricle: JACC state-of-the-art review. J. Am. College Cardiol.
73 (12), 1463–1482. https://doi.org/10.1016/j.jacc.2018.12.076.
Schiratti, J.-B., Allassonnière, S., Colliot, O., Durrleman, S., 2015. Learning spatiotemporal
trajectories from manifold-valued longitudinal data. In: Proc Adv. Neural Inf. Process.
Syst., 28. https://proceedings.neurips.cc/paper/2015/hash/186a157b2992e7daed3677ce8e9fe40f-
Abstract.html.
Singh, N., Vialard, F.-X., Niethammer, M., 2015. Splines for diffeomorphisms. Med. Image Anal.
25 (1), 56–71. https://doi.org/10.1016/j.media.2015.04.012.
Sivera, R., Capet, N., Manera, V., Fabre, R., Lorenzi, M., Delingette, H., Pennec, X., Ayache, N.,
Robert, P., 2020. Voxel-based assessments of treatment effects on longitudinal brain changes
in the multidomain Alzheimer preventive trial cohort. Neurobiol. Aging 94, 50. https://doi.
org/10.1016/j.neurobiolaging.2019.11.020.
Thompson, D.W., 1917. On Growth and Form. Cambridge University Press, Cambridge, https://
doi.org/10.1017/CBO9781107325852.
Trouve, A., Vialard, F.-X., 2012. Shape splines and stochastic shape evolutions: a second order
point of view. Q. Appl. Math. 70 (2), 219–251, Publisher: Brown University. https://www.
jstor.org/stable/43639026.
Trouve, A., Vialard, F.-X., 2012. Shape splines and stochastic shape evolutions: a second order
point of view. Q. Appl. Math. 70 (2), 219–251.
Vialard, F.-X., Risser, L., 2014. Spatially-varying metric learning for diffeomorphic image regis-
tration: a variational framework. In: Medical Image Computing and Computer-Assisted
Intervention—MICCAI 2014. LNCS 8673. Springer, Cham, pp. 227–234. https://doi.org/
10.1007/978-3-319-10404-1_29.
Younes, L., 2007. Jacobi fields in groups of diffeomorphisms and applications. Q. Appl. Math.
65 (1), 113–134. https://doi.org/10.1090/S0033-569X-07-01027-5.
Younes, L., 2019. Shapes and Diffeomorphisms. Applied Mathematical Sciences, vol. 171
Springer Berlin Heidelberg, Berlin, Heidelberg, ISBN: 978-3-662-58495-8, https://doi.org/
10.1007/978-3-662-58496-5.
Young, A.A., Frangi, A.F., 2009. Computational cardiac atlases: from patient to population and
back. Exp. Physiol. 94 (5), 578–596. https://doi.org/10.1113/expphysiol.2008.044081.
Chapter 9

Geometry and mixture models


Paul Marriott*
Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, ON, Canada
*
Corresponding author: e-mail: pmarriott@uwaterloo.ca

Abstract
In this article we look at the geometry of mixture models. This geometry can be affine,
convex, differential, information, or algebraic. While mixture models are simple to
define, highly interpretable, and widely used in statistical practice, their underlying sta-
tistical properties are complex and still not completely theoretically resolved. This com-
plexity often has a geometric cause and resolution. The family presents interesting
challenges to the development of a complete geometric theory of inference. Throughout
we illustrate the discussion with simple, and often visual, models for illustration and to
aid intuition.
Keywords: Affine geometry, Convex geometry, Geometry of distributions, Informa-
tion geometry, Mixture models, Nonstandard asymptotics, Singular models

1 Introduction
The objective of this article is to explore the relationship between information
geometry and the theory, and statistical properties, of mixture models. Geom-
etry illuminates the nonstandard inferential properties of these models. Con-
versely, these models provide challenging test cases which have shaped the
development of a general geometric theory of statistical inference. We start
by reviewing foundational issues in both mixture modeling and the geometry
of the space of distributions.

1.1 Fundamentals of modeling with mixtures


Excellent overviews of the theory and application of mixture models can be
found in Everitt and Hand (1981), Titterington et al. (1985), and McLachlan
and Peel (2000) and references therein. Applications of mixture models can
be found across almost all of applied statistics, for example: clustering
(McLachlan and Basford, 1988); robustness (Hampel et al., 2011); measurement

Handbook of Statistics, Vol. 46. https://doi.org/10.1016/bs.host.2022.02.001


Copyright © 2022 Elsevier B.V. All rights reserved. 327
328 SECTION III Advanced geometrical intuition

error modeling (Fuller, 1987); repeated measures (Crowder and Hand, 2017);
machine learning (ML); and Bayesian inference (Watanabe, 2021a) and many
other areas whenever modeling considers latent, or hidden, structure.
Very often analysts select a particular parametric family for the compo-
nent distributions of a mixture model, i.e., f(x;θ) in Definition 1. For example,
multivariate normal or multivariate t-distributions are often used in cluster
analysis (McNicholas, 2016), exponential or Weibull distributions in lifetime
data analysis (Jewell, 1982), or mixtures of discrete distributions such as
Poisson or Binomial in discrete data analysis (Everitt and Hand, 1981,
Chapter 4). These component distributions are often selected to lie in the class
of exponential (dispersion) families which are the workhorses of generalized
linear models (GLM) and hence very frequently used in applied statistics.
Exponential (dispersion) families have excellent theoretical and computa-
tional properties across a range of inference methods, including the frequen-
tist, likelihood, and Bayesian schools.
However real-world data is complicated and can be highly heterogeneous
with the analyst frequently finding features that cannot be modeled while
staying purely within the exponential family framework. It is then very com-
mon to use some form of mixture model. These models are very flexible—
meaning they can deal with data complexity—and also can be highly inter-
pretable so very attractive to the analyst. Of course there are no free lunches
and their very flexibility comes at a cost. Inference properties of mixture
models are more complex than those of exponential families, as are many
of the associated computational issues. Balancing these costs and benefits is
part of the art of working with mixture models.
We start with the following definition.
Definition 1. Consider a convex mixture of probability mass, or density,
functions f (x; θ), for θ  U  d written as
X
K
f ðx; K, ρ, θÞ :¼ ρi f ðx; θi Þ, (1)
i¼1

with K  + , ρ :¼ ðρ1 , …, ρK Þ  K , θ :¼ ðθ1 , …, θK Þ  dK , and with the


P
constraints Ki¼1 ρi ¼ 1, ρi  0. Notice here that we choose to allow ρi to
equal zero to work with the closure.

The parameter space of an exponential family is an open subset of a fixed


dimensional Euclidean space and hence, under regularity, the exponential
family is geometrically a manifold. In Definition 1 the parameter space is
more complex. Even if K is treated as fixed and known, there are singularities
and identification issues whenever θi ¼ θj or ρi ¼ 0 and through permutation
of the component labels f1, …, Kg. Furthermore in many analyses the value
of K cannot assumed to be known and the model selection problem of select-
ing K is a very challenging problem.
Geometry and mixture models Chapter 9 329

A consequence of this (geometric) difference is that inference in mixture


modeling has a large number of complications: identification problems; com-
plex likelihood functions with multiple modes and unbounded values; singular
Fisher information matrices; and nonstandard asymptotic properties in estima-
tion, testing, and model selection. Since the objects are not manifolds, the
geometric tools that we can use to deal with this list of problems come from
affine, convex, and algebraic geometry which can reinforce the well-known
tools from differential geometry used in information geometry.

1.2 Mixtures and the fundamentals of geometry


When considering the foundations of a geometric theory of statistical model-
ing, we need to think about questions such as: what is a local neighborhood of
a distribution? what is a straight line? how to measure length? how to measure
angles? and how to think about convexity? For many of these questions sim-
ple mixture models can be helpful both for illustration and also to explore the
ways that naive geometric approaches can breakdown, or at least require care,
in infinite-dimensional function spaces.
We start by considering the local neighborhood around a probability mass
or density function, f (x). When studying robustness (Hampel, 1974) uses a
mixture-based contamination model, (1 E)f(x) + E δ(x), where δ(x) is the point
mass at x and E  [0, 1], in the construction of the influence function. This
approach explores the complexity of the infinite-dimensional neighborhood
with potentially very large changes in inference being associated with small
values of E. For more on this see Hampel et al. (2011) and for related geometric
considerations, Lindsay (1994) and also Marriott (2002), and Maroufy and
Marriott (2020).
This approach can be generalized to construct “straight lines” in distribu-
tion space. For example, we can connect two probability mass (density) func-
tions by additive or geometric mixing:
f ðmÞ ðx; ρÞ :¼ ð1  ρÞf1 ðxÞ + ρf2 ðxÞ, (2)

f 1 ðxÞð1θÞ f 2 ðxÞθ
f ðeÞ ðx; θÞ :¼ , (3)
CðθÞ
R
where CðθÞ ¼ f 1 ðxÞð1θÞ f 2 ðxÞθ dθ . As simple as these models are it is
informative to consider their parameter spaces and issues associated with their
support. For f (m)(x; ρ), the parameter ρ can take any value as long as f (m)(x; ρ)
 0 for all x. This space includes ρ  [0, 1] but is not restricted to this inter-
val. Furthermore, the support of f (m)(x; ρ) can depend on ρ and, as is well
known, this can give rise to nonstandard inferential behavior. For f (e)(x; θ)
the parameter space is defined by the condition C(θ) < ∞ and if f1(x) and
f2(x) have common support, we have an exponential family. For more detail
see the discussion around Definitions 5 and 6 in Section 4.
330 SECTION III Advanced geometrical intuition

Given an i.i.d. sample, fx1 , …, xn g , the log-likelihood function on


f (m)(x; ρ),
X
n
‘ðρÞ :¼ log ðð1  ρÞf1 ðxi Þ + ρf2 ðxi ÞÞ,
i¼1

is easily shown to be concave—by computing its second derivative—although


not necessarily strictly concave. Further due to potential changes in support it
need not be finite. Using the log-likelihood we can consider measuring local
distances and angles using the statistical properties of the score, i.e., through
the expected Fisher information. Example 12 shows that even in very simple
examples the Fisher information of the linear mixture f(m)(x;ρ) is not necessar-
ily finite or nonsingular. Similarly Critchley and Marriott (2014, Example 7)
show a simple example where the Fisher information is singular for f(e)(x;θ).
In both these cases the difficulty is caused not by a change of support but
large differences in tail behavior. Geometric approaches for dealing with sin-
gular models are discussed in Section 5.
Next we might think about convex geometry questions such as how to
characterize the convex hull and the extreme points of K distinct distributions
f f i ðxÞji ¼ 1, …, Kg? Of course studying convexity on general function spaces
is more complex than in finite-dimensional Euclidean spaces since extending
hyperplane separation theorems to the infinite case can be difficult. However
when the components fi(x) have been selected from an exponential family, the
total positivity properties of these families can be very useful, see Critchley
and Marriott (2014, Theorem 1) or Section 2.1, where we apply the methods
of Karlin (1968). The situation is much more complex when the mixture is
over a continuous set of components f(x;θ). Example 2 shows a case where
a lack of identification is inherent in the model. However, and somewhat sur-
prisingly, when we are working with the likelihood, following Lindsay
(1995), we can work without loss in finite-dimensional spaces defined through
the data. In this situation the maximum likelihood estimate is frequently going
to be attained on the boundary of a convex set and characterization of the
maximum using a derivative needs to be reconsidered.
In summary, simple examples of mixtures give examples of nonstandard
inferential properties: changes of support; singular Fisher information; lack
of identification; and boundary effects in optimization. The paper is structured
around discussing these issues.

1.3 Structure of article


Focusing on the relationship between mixture models and geometry this arti-
cle has the following structure: in Section 2 the complex geometric structure
of mixtures, including boundary and singularity issues, is explored in the
finite case; in Section 3 properties of the likelihood are examined using the
convex geometry of Lindsay (1995); Section 4 looks at the geometry of
Geometry and mixture models Chapter 9 331

general mixtures models in particular looking at the intrinsic information


geometry of Amari (1985) and Amari (2016) and the embedding affine geo-
metries of Murray and Rice (1993) and Marriott (2002); in Section 5 the
issues surrounding singular mixture models are considered by following the
algebraic geometry of Watanabe (2009); in Section 6 the effect of the singular
nature of mixture models is shown to generate associated nonstandard asymp-
totic results in testing and model selection.

2 Identification, singularities, and boundaries


Information geometry recognizes that exponential families are manifolds and
have a well-defined notion of dimension, see Barndorff-Nielsen (1978),
Amari (1985), Murray and Rice (1993), Kass and Vos (2011), and Amari
(2016). Furthermore, the Fisher information gives these manifolds a Rieman-
nian structure to which a set of dual affine connections is added. In this sec-
tion we show that this framework is not the most natural for mixture models.
In particular while a manifold is locally diffeomorphic to an open subset of a
fixed dimensional Euclidean space, mixture models are unions of structures of
different dimensions. More specifically we note: mixture models can have
boundaries where these varying dimensional structures meet; can have distri-
butions with different support in the same model; have identification and
overfitting problems; and have intrinsic geometric singularities.
To illustrate the discussion consider the commonly used mixture of K
normal components.
Definition 2. The finite mixture of K normal distributions, N(μ, σ 2), with
densities ϕ(x;μ, σ 2), is denoted by
X
K
f normal ðx; K, ρ, μ, σ 2 Þ :¼ ρi ϕðx; μi , σ 2i Þ: (4)
i¼1

The standard identifiability result for this family is Yakowitz and Spragins
(1968) which requires that (a) K is fixed and known, (b) each ρi is strictly pos-
itive, and (c) the component parameters ðμi , σ 2i Þ are distinct. Furthermore, any
identification is only up to permutations of the component labels f1, …, Kg.

We note the geometric differences between the closed convex hull of


Definition 1 and the strictly identified model in Definition 2. This second def-
inition adds enough conditions to ensure that the model has a manifold struc-
ture. However, as we shall see, this comes at a cost: multimodal and infinite
likelihoods, maximum likelihood estimates on the boundary of the parameter
space, and essentially singular models. This last point is further discussed for
a mixture of K normals in a Bayesian context in Example 9.
We illustrate these issues with an analysis of a simulated data set with a
three-component mixture of normals.
332 SECTION III Advanced geometrical intuition

Example 1. In Fig. 1 we show a simulated example to which we fit two


different three-component mixtures of normals,
X
3 X
3
Model 1 : ρi ϕðx; μi , σ 21 Þ and Model 2 : ρi ϕðx; μi , σ 22 Þ,
i¼1 i¼1

where σ 1 ¼ 1 and σ 2 ¼ 1.5 which, in this case, are considered fixed and
known. It is important to note that K ¼ 3 is also treated as fixed and known
here as required in Definition 2. In this example the data seems to form two
clusters each with a different location and spread, with the components empir-
ical variances seeming larger than 1. It is easy to show numerically that the
likelihood for Model 1 has multiple local modes. Panel (a) shows the fit of
two of these. They correspond to two different attempts by the model to
explain both the heterogeneity in two means and two variances with only
the three components. Neither attempt is ideal, each underestimating the
spread in one of the empirical clusters. The multimodality is coming from
the trade-off between different sorts of error.
Model 2 can explain the cluster variance, since its fixed component
variances are larger; however in this case we find that in the fit ρb1  0. That
is because the likelihood is maximized on the boundary and at a point of
singularity. Only two components are needed and the model tries to remove
one of the three provided.
It is perhaps natural to ask why not try a third model defined by
P3
i¼1 ρi ϕðx; μi , σ i Þ where now each component variance is unknown? In this
2

case the model is so flexible that there are even more local modes and some
of these correspond to the case where σb 21 ¼ 0 and the corresponding likeli-
hood is unbounded.

(a) Model 1 (b) Model 2


0.30

0.30
0.20

0.20
Density

Density
0.10

0.10
0.00

0.00

−5 0 5 10 −5 0 5 10
FIG. 1 Fitting a three-component mixture models. (A) Model 1: There are at least two local
modes in the likelihood. (B) Model 2: The likelihood is maximized on the boundary resulting
in trying to fit a two-component model.
Geometry and mixture models Chapter 9 333

This simple example illustrates some general features of working with


mixture models. Model 1 is not flexible enough to allow it to capture the main
empirical features of the data. Model 2, despite having the same “dimension,”
is too flexible and the estimate is at a singular part of the model, in fact on a
boundary. The third attempt is so flexible that it contains highly singular mod-
els with degenerate support. In summary we see that despite the mixture being
seemingly such a natural approach to model the data shown in Fig. 1, having
K fixed and known and ρi > 0, to ensure identification, does not result in sat-
isfactory inference. The conditions for identification came at a cost.
Suppose that we allow K to be unknown, and potentially unbounded. This
leads to a generalization of the definition of a mixture model which allows
very general mixing structures where the mixing comes from a general distri-
bution. This distribution might be considered completely unspecified and to
be estimated by the data, see Section 3, or it might come from a parametric
family such as a Bayesian prior, see Section 5.
Definition 3. Suppose that f(x; θ) is a regular full exponential family with
parameter space θ  Θ. We can define the general mixture model via
Z
f ðx; QÞ :¼ f ðx; θÞdQðθÞ (5)

where Q is a distribution with support on a subset of the parameter space of θ.

Of course finite mixture models, as defined by (1), are a special case of


this definition. The identification problem of finite mixtures are, as might
be expected, still a feature of this more general case. Indeed we would expect
it to be more of a problem in some cases. Consider the following examples.
Example 2. It is well known that when Q(μ) is the N(0, τ2) distribution
Z
ϕðx; μ, σ 2 ÞdQðμÞ ¼ ϕðx; μ, σ 2 + τ2 Þ:
μ

Hence there is a global identification problem in this mixture space.

Example 3. Suppose we consider a kernel density estimate of a distribution


from an i.i.d. sample, fx1 , …, xn g, given by
X
n
1
b
fh ðxÞ :¼ ϕðx; xi , h2 Þ,
i¼1
n

where h is a bandwidth tuning parameter. This estimate is a mixture of nor-


mals and the ability of these estimates to give excellent approximations to
general distributions, at least given enough data, shows that in some sense
the convex hull of normal distributions forms a near universal class. This
extreme flexibility comes with the associated possibility of overfitting and
identification problems.
334 SECTION III Advanced geometrical intuition

We have shown concrete examples of complex likelihood behavior,


boundary and singularity issues and examples, of over- and underfitting with
corresponding identification problems in commonly used mixture models.
This article explores the ways that geometry can help understand these issues.
In order to illustrate more general ideas later, we start with cases where visual
intuition can be a useful illustration of the interrelationship between mixture
models and geometry.

2.1 Mixtures of finite distributions


Insight into the geometric structure of mixture models in general, and mix-
tures of exponential families in particular, can be found by considering the
case of mixtures of finite, discrete models. Following the approach of Amari
and Nagaoka (2000, p. 40) we assume, in this section, that a random variable
has support in a finite set, χ, and the space of finite distributions over χ is
denoted by P, which will trivially have the geometry of a closed simplex of
dimension D :¼ jχj1.
The following two examples look at distributions over a 2  2 contingency
table so that the sample space, χ, are the counts {N00, N10, N01, N11} with a
fixed total, and D ¼ 3 thus allowing us to visualize the geometry in 3 .
The general discussion though is relevant for all finite D.
Example 4. Consider a 2  2 contingency table with cell probabilities π ij, i,
j  {0, 1} so here P is a three-dimensional closed simplex. This means
( )
X
P ¼ Δ ¼ ðπ 00 , π 10 , π 01 , π 11 Þj
3
π ij ¼ 1, π ij  0  4 :
ij

The set of independent models for the 2  2 table is defined by

ðπ 00 , π 10 , π 01 , π 11 Þ ¼ ðð1  ρ1+ Þð1  ρ+1 Þ, ρ1+ ð1  ρ+1 Þ, ð1  ρ1+ Þρ+1 , ρ1+ ρ+1 Þ,

where ρ1+, ρ+1  [0, 1] are the marginal probabilities. It is easily shown that
this space is an extended exponential family; the term extended here means
taking the closure of a regular exponential family, Barndorff-Nielsen
(1978). We mention in passing that the independence space is also a
so-called ruled surface meaning it can be decomposed into the union of line
segments in the simplex. The independence space is plotted in Fig. 2A where
it can be seen that the surface connects to all four vertices and four of the six
edges of the simplex.
It is also easy to show that the convex hull of the independence space is the
whole three simplex and so can be decomposed into a union of sets—in fact
manifolds—of different dimensions. There are four zero-dimension components,
six one-dimensional, four two-dimensional, and one full three-dimensional man-
ifold which is the whole relative interior of the simplex. We can also see that
Geometry and mixture models Chapter 9 335

(a) Independence Space (b) Exponential family


(0,0) (0,0)

(0,1) (0,1)
(1,1) (1,1)

(1,0) (1,0)

(c) Convex hull


(0,0)

(0,1)
(1,1)

(1,0)

FIG. 2 The case D ¼ 3 with P being the three simplex: panel (A) shows the two-dimensional
independence space, (B) shows a one-dimensional exponential family, and (C) shows this
one-dimensional family’s convex hull.

there will be identification issues in this example. There is a unique representa-


tion in terms of mixtures of the extreme points of the convex hull. These
extreme points are simply the four vertices. However there are many different
representations of the same distribution with mixtures of independent distribu-
tion which lie strictly within the relative interior.

In Example 4 we see that all distributions on the 2  2 contingency


table are mixtures of independence models and this fact generalizes to
examples of much more complex contingency tables. This is a form of a
universality representation property which is similar to that discussed in
336 SECTION III Advanced geometrical intuition

Example 3. As discussed there this means mixture models will be extremely


flexible but, this can also give rise to potential overfitting and identification
problems. We also note in passing recent work on neural networks—which
have latent structure so in many ways are similar to the mixture models
discussed here—which have universal representation properties and yet still
work very well in generalizing predictions after fitting to training data, see
the recent Amari (2020) for more on this subject.
Example 5 looks at a one-dimensional exponential family embedded in the
three simplex where the convex hull is more complex than that of Example 4
but still decomposes to a union of manifolds of different dimensions. We
note a general property of exponential families embedded in a simplex of
any dimension D which was discussed in Critchley and Marriott (2014,
Theorem 1). By the totally positivity properties of exponential families,
Karlin (1968), under regularity but generically, any distinct D + 1 distribu-
tions in the exponential family will have a D-dimensional convex hull. This
means that in general mixtures of exponential families will have maximal
dimension although, as shown below, the convex hull need not be universal.
Example 5. We can define a one-dimensional discrete exponential family
lying in the space P ¼ Δ3 for some distinct values v00, v10, v01, v11, by
 
π ij ðθÞ :¼ π 0ij exp θvij  ψðθÞ , (6)
P 0  
where ψðθÞ ¼ log ij π ij exp θvij and θ   . An example is plotted in
Fig. 2B as a curve in the three simplex

πðθÞ :¼ fðπ 00 ðθÞ, π 01 ðθÞ, π 10 ðθÞ, π 11 ðθÞÞjθ  g  Δ3 :


Following Karlin and Shapley (1953) the convex hull of the closure of the
curve is plotted in Fig. 2C. It can be shown to be the union of seven manifolds
of differing dimensions as listed in Table 1. Note that the individual parts of
the decomposition, such as C11 , C21 , or C22 , are manifolds, but they are not
themselves convex in the simplex.

These examples illustrate general points. First Example 5 shows that there
exists a trade-off between geometric structures. Manifold structures allow
gradient methods to be used in (local) optimization, while convexity allows
global optimization of a concave log-likelihood. This is related to the discus-
sion in Example 1. In that example by fixing K ¼ 3 we kept a manifold
structure, but it was at the cost of losing convexity which resulted in multiple
local modes.
The second point is to note that since the convex hull in Example 5 does
not contain the whole simplex, there are two distinct cases to consider. First,
the empirical distribution of the data lies inside, the convex hull and second
the case where it lies outside. When it lies inside the model is clearly flexible
enough to fit the data exactly and may actually overfit. The second case is that
Geometry and mixture models Chapter 9 337

TABLE 1 Decomposition of convex hull to a union of manifolds.


Name Definition Dimension

C 01 fπ 00 ¼ 1g 0

C 02 fπ 10 ¼ 1g 0

C 11 fπðθÞjθ  g 1

C 12 fðρ, 1  ρ, 0, 0Þjρ  ð0, 1Þg 1

C 21 fρð1, 0, 0, 0Þ+ð1  ρÞπðθÞjρ  ð0, 1Þ, θ  g 2

C 22 fρð0, 1, 0, 0Þ+ð1  ρÞπðθÞjρ  ð0, 1Þ, θ  g 2

C 3
fρ1 πðθ1 Þ+ð1  ρ1 Þπðθ2 Þjρ  ð0, 1Þ, θ1 6¼ θ2  g 3

the mixture model is not flexible enough to exactly fit the data and that
estimates will lie on the boundary on the convex hull. We note the strong
parallels between this and the behavior of the fitting of Models 1 and 2 in
Example 2.
The third point is that Examples 4 and 5 give a clear visual representation
of the singularity and boundary properties of mixture models which are a
theme of this article. It is shown explicitly that the convex hulls are not
manifolds but are unions of manifolds of differing dimensions. We also note
the fact that the support of the distributions is not fixed which is one reason
that the Fisher information cannot be assumed to always be nonsingular.

3 Likelihood geometry
In the examples of Section 2.1 by working with finite, discrete distributions
applications of convex geometry were straightforward. We now turn to a dif-
ferent, but related, approach to using the tools of finite-dimensional convex
geometry. This though is applicable to a much wider set of mixture models.
The method is due to Lindsay (1995). The embedding affine space is always
finite dimensional, but the dimension is determined by the number of distinct
observations in the data. This is going to be sufficient since the focus of this
approach is the likelihood function which only depends on the observed data.
We therefore call this a likelihood geometry approach.
Throughout this section we work with the general
Ð definition of a mixture
model on exponential families, i.e., f(x; Q) :¼ θΘ f(x; θ)dQ(θ). One of the
most important properties of Lindsay’s likelihood geometry is it allows a
characterization of, and a way to compute, the nonparametric maximum like-
lihood estimate (NPMLE) of a mixture model, Q. b Suppose we have an i.i.d.
338 SECTION III Advanced geometrical intuition

sample, fx1 , …, xn g , which has D distinct values, the following definition


shows how we can represent the space of mixtures in a finite-dimensional
Euclidean space.
Definition 4. For a distinct observation xd, define Ld(θ) :¼ f(xd; θ). We repre-
sent the model f(x; θ) in D-dimensional Euclidean space via
θ ! LðθÞ :¼ ðL1 ðθÞ, …, LD ðθÞÞ  D :
Furthermore we can represent any mixture via
Q ! LðQÞ :¼ ðL1 ðQÞ, …, LD ðQÞÞ  D , (7)
Ð
where Ld(Q) ¼ θΘ f(xd; θ)dQ(θ). By the linearity of integration the represen-
tation in D of the set of mixtures is exactly the convex hull of fLðθÞjθ  g.

Lindsay (1995) makes extensive use of the convex geometry of D to


solve the estimation problem of finding a distribution Q which maximizes
the likelihood. For most of what we do here, for illustration, we will assume
that θ is one dimensional but otherwise make no assumptions on the distri-
bution Q(θ). Let n(d) be the number of times xd was observed. The full
log-likelihood then can be written as
X
D
‘ðQÞ ¼ nðdÞ log Ld ðQÞ:
d¼1

This then a concave function D !  though the representation (7) in


Definition 4 and the optimization over Q can be seen as a finite-dimensional
problem. The following result can be found in Lindsay (1995, Page 23).
Theorem 1. (a) The log-likelihood,
Xn Xn Z 
‘ðQÞ ¼ log Li ðQÞ ¼ log f ðxi ; θÞdQðθÞ ,
i¼1 i¼1

has a unique maximum over the space of all distribution functions Q. Further-
more, the maximizer Qb is a discrete distribution with no more than D distinct
points of support, where D is the number of distinct points in ðx1 , …, xn Þ.
(b) If Q0 is a candidate for mixture distribution of the NPMLE, then this can
be checked by a gradient characterization defined by
XD  
Ld ðθÞ  Ld ðQ0 Þ
DQ0 ðθÞ :¼ nðdÞ
d¼1
Ld ðQ0 Þ

where Ld(θ) is the likelihood at θ for the unmixed model and we have the
characterization
b , DQ ðθÞ  0,
Q0 ¼ Q (8)
0

for all θ  Θ.
Geometry and mixture models Chapter 9 339

b are precisely those points where


(c) The support points of Q
DQbðθÞ  0:
 
b …, LD ðQÞ
(d) Finally, the fitted values of the maximum likelihood L1 ðQÞ, b
are unique even if Q b itself may not be.

There are some important points to note about Lindsay’s result. First the
fact that the NPMLE is achieved at a discrete mixture is analogous to the fact
that the discrete CDF based on a set of i.i.d. data can be seen as the “best”
estimate of an unknown distribution. Second the gradient characterization of
a maximum likelihood estimate in standard analysis is replaced by a func-
tional version through DQ0 ðθÞ which characterizes the NPMLE estimate
through Inequality (8). In particular it is a derivative along the line segment
defined by (2) in Section 1.2. The theorem requires that the likelihood is
decreasing—strictly nonincreasing—along all relevant line segments starting
b
at f ðx; QÞ.
To illustrate and motivate Theorem 1 we look at a simple example similar
to one in Lesperance and Kalbfleisch (1992).
Example 6. In this example, for direct visualization, the number of observed
data points is two: x1 ¼ 1, x2 ¼ 4. We assume a mixture model over compo-
nents N(μ, 1) with unknown mixing distribution Q(μ).
The likelihood components are defined by ðL1 ðμÞ, L2 ðμÞÞ :¼ ðϕð1; μ, 1Þ,
ϕð4; μ, 1ÞÞ where ϕ(x; μ, 1) is the density function for the N(μ, 1) distribution.
The image of the curve μ ! ðL1 ðμÞ, L2 ðμÞÞ is plotted in Fig. 3A and C as the black
curve, with the likelihood contours plotted with (red) dashed lines. To find the
NPMLE we want to maximize the likelihood over the convex hull of this curve.
From Theorem 1 the NPMLE can be characterized by the gradient function
X2  
Li ðμÞ  Li ðQ0 Þ
DQ0 ðμÞ ¼
i¼1
Li ðμÞ

for all μ in the parameter space. This is illustrated in Fig. 3B and D. In panel
(a) a two-component mixture is denoted by the cross on a line segment joining
two points defined by
Q0 ¼ 0:3Nð2:2, 1Þ + 0:7Nð3:2, 1Þ:
The corresponding gradient function is shown in panel (B). Since the gradient
function is positive at some points, this does not satisfy the conditions of the
theorem; hence this candidate mixture is not the NPMLE. On the other hand
in panel (C) we plot the mixture defined by Q0 ¼ 0.5N(1, 1) + 0.5N(4, 1). Its
gradient function is shown in panel (D) and it does satisfy the conditions since
it is nowhere strictly positive and zero only at the support points which are a
subset of the observed data. Hence this candidate point is the NPMLE.
340 SECTION III Advanced geometrical intuition

(a) 0.4 (b)

2
Direction Derivative
0.3

1
x
0.2
L2

0
0.1

−1
0.0

−2
0.0 0.1 0.2 0.3 0.4 −10 −5 0 5 10
L1 theta.list

(c) (d)
0.4

0.0
Direction Derivative

−0.5
0.3

−1.0

x
0.2
L2

−1.5
0.1

−2.0
0.0

0.0 0.1 0.2 0.3 0.4 −10 −5 0 5 10


L1 μ

FIG. 3 Applying Lindsay’s theorem in a simple model: panels (A) and (C) show the likelihood
space with exponential family embedded (black curves), candidate mixed model (cross), and the
level sets of the log-likelihoods (red contours). Panels (B) and (D) show the gradient function for
each candidate.

A fast algorithm to implement the results of Theorem 1 was implemented


in Wang (2007) and the paper compared it to a number of differing ways of
computing the NPMLE including the EM, vertex direction method, vertex
exchange method, and the intra-simplex direction method, see Wang (2007)
for references. We implement Wang’s method in a more realistic example
than Example 6 below.
Example 7. Fig. 4 shows the data generated from a three-component mixture
of Poisson distributions. The sample size is 170 and the right hand panel
shows the directional derivative of the solution. The vertical dashed lines in
this panel show the values of the support points. At first it looks like the algo-
rithm has identified three components. In fact it identified five distinct com-
ments with two pairs very close to each other. It is very common that the
estimates occur in clusters since it is very hard to separate, purely from the
data, support points which are very close.

The fact that the NPMLE results in clusters of support points is a feature
of inference on mixture models. For example, consider a model such as
Geometry and mixture models Chapter 9 341

Data

2
50

Directional Derivative

1
40
Frequency

30

0
20

−1
10

−2
0

0 5 10 15 0 5 10 15
data λ
FIG. 4 The data for Example 7 plotted as a histogram and the final gradient function. The vertical
dashed lines are the support points.
P
f ðx; K, ρ, θÞ :¼ Ki¼1 ρi f ðx; θi Þ where θ1 < θ2 < … < θK but where j θ1  θKj
¼ E and E is small relative to the Fisher information. This model might be
formally identified but is very hard to distinguish from the two-component
model ρf(x; θ1) + (1  ρ)f(x; θK) or indeed sometimes from an unmixed model.
One approach to this is to consider the local mixture model of Marriott
(2002). In this approach, when the mixing distribution QE(θ) has small
variance, a Laplace expansion gives a good approximation using
Z
∂ ∂2
f ðx; θÞQE ðθÞdθ  f ðx; θÞ + λ1 f ðx; θÞ + λ2 2 f ðx; θÞ,
∂θ ∂θ
and, through a geometric analysis of this structure, the small number of para-
meters (θ, λ1, θ2) are identified and computationally easy to estimate. These
models have found application in measurement error models (Marriott, 2003),
survival analysis, (Maroufy and Marriott, 2019), robustness (Maroufy and
Marriott, 2020), and the idea is generalized in Maroufy and Marriott (2017).

4 General geometric structures


In Section 2 we used finite, discrete mixtures to show that mixture models can
be thought of as unions of manifolds of different dimensions having singula-
rities and boundaries. We exploit embedding in finite-dimensional convex
spaces. In Section 3 we showed that it was sufficient to embed in a data-
defined finite-dimensional space when computing the NPMLE for much more
general models. In this section we look at what are the appropriate geometries
to describe the structure of general mixture models without restricting to finite
dimensions. Moving out of the finite-dimensional world is going to make the
discussion more complex as pointed out in Amari and Nagaoka (2000, p. 40).
342 SECTION III Advanced geometrical intuition

There are two approaches to this in the literature: (a) work intrinsically on
finite-dimensional parametric models which lie in an infinite-dimensional
space and (b) work extrinsically by defining the geometric structures by
embedding in general infinite-dimensional embedding affine spaces. The first
of these approaches was taken in Amari (1985), see Amari (2016) and shown
in Definition 5 and the second approach is described in Definition 6.
We start with the dual affine geometry defined intrinsically on a para-
metric family f(x; θ) satisfying the regularity conditions in Amari (1985).
Definition 5. Consider a parametric family of distributions, f(x; θ), which
satisfies the regularity conditions discussed in Amari and Nagaoka (2000,
P. 26) which gives the family the structure of a smooth manifold. On each tan-
gent space we have the Fisher metric which is, in terms of coordinates,
 
∂ log f ðx; θÞ ∂ log f ðx; θÞ
gij ðθÞ :¼ Ef ðx; θÞ ,
∂θi ∂θj
giving rise to an inner product on the tangent space. While other metrics are
possible—which would generate alternative forms of the Cramer–Rao theo-
rem (Kumar and Mishra, 2020)—we focus here on the standard case. Notions
of straight lines, as discussed in Section 1.2, are also constructed intrinsically
using a family of affine connections. For this article we are interested in the
special case of the α ¼ 1, or mixture, connection. The key result is Amari
and Nagaoka (2000, Theorem 2.4) which states that a mixture family such
as f(x; ρ) :¼ ρf(x) + (1  ρ)g(x) is flat in the mixture connection.

This definition is based on recognizing that regular parametric models are


Riemannian manifolds but, as we have seen, mixture models are unions of
different dimensional manifolds and, as discussed in Section 1.2, the Fisher
information can be singular. We therefore could start with a more primitive
notation of geometry that of an affine space.
Definition 6. Murray and Rice (1993) construct, on equivalence classes of
sets of positive measures with common support, an affine geometry which
agrees with the α ¼ +1 (exponential) connection on sets of distributions.
Indeed it characterizes finite-dimensional exponential families, with a fixed
support, as being finite-dimensional affine spaces of the general affine struc-
ture. In a related way (Marriott, 2002) defines a general affine geometry on
spaces of measures which integrate to one. This agrees with the α ¼ 1 (mix-
ture) connection when we are in a manifold. It has the property that finite-
dimensional affine subspaces are mixtures of unitary measures.

These affine structures generate the same convex structures that we have
used in Sections 2 and 3 and agree with Definition 5 on manifolds. Given that
we have very general definitions of the mixture structure we look at an
example to illustrate some of the issues in general situations.
Geometry and mixture models Chapter 9 343

Example 8. We return to a finite mixture of normal distributions as in


Definition 2. The set of all mixtures of normals has a natural convex structure
but is infinite dimensional. Suppose we fix K, say to 2, then work on the set
(1  ρ)ϕ(x; μ1, 1) + ρϕ(x; μ2, 1) for μ1 > μ2. For ρ  (0, 1) this is a three-
dimensional manifold but at ρ ¼ 0, 1 there are singularities. For example,
consider the case μ1 ¼ 0, ρ ¼ 0, then the intrinsic geometry approach of
Definition 5 would start with the score vectors:
∂‘ 1 ∂‘
j ¼ exp ð μ2 ðμ2  2xÞÞ  1, j ¼ 0,
∂ρ μ1 ¼0,ρ¼0 2 ∂μ2 μ1 ¼0,ρ¼0
which are mean zero but have singular Fisher information.
 
exp ðμ22 Þ  1 0
:
0 0
Working under the conditions that ρ  (0, 1) and μ1 > μ2 does formally
exclude exact singularities, but not all the inference problems. The three-
dimensional manifold will, due to the total positivity properties (Karlin,
1968), be highly curved in the affine mixture geometry. This curvature can
result in multimodal likelihood functions. An example is plotted in Fig. 5
where we plot the profile log-likelihood ‘ðb ρðμ1 , μ2 Þ, μ1 , μ2 Þ where ρbðμ1 , μ2 Þ
maximizes the likelihood on the one-dimensional line segment (1  ρ)
ϕ1(x; μ1, 1) + ρϕ1(x; μ2, 1). As discussed in Section 1.2 on this line segment
we have a concave likelihood. In the example a simulate data set was created
which has a variance greater than one, skewness, and outliers. The three-
dimensional model is trying to traded-off explaining these different features
of the data in a way that results in complex multimodality. The K ¼ 2 model
is a manifold but not flexible enough to capture the complexity of the data at a
unique mode.

We have seen that for general spaces of distributions we can define


mixture geometry in ways that are consistent with what we have seen in the
finite-dimensional cases. We again see boundaries being important, singular-
ity and curvature resulting in complex likelihood structures. In the next
section we look at tools which tackle the singularities directly.

5 Singular learning theory


5.1 Bayesian methods
Bayesian methods are important tools which have often been used in the
analysis of mixture models. For finite mixture models a highlight, which
allows treating the number of components K in Equation (1) to be an unknown
integer-valued parameter, is reverse jump Markov Chain Monte Carlo, see
Green (1995). This explores different possible values of K as part of the
344 SECTION III Advanced geometrical intuition

Profile Likelihood

0.0
−0.2
−0.4
μ2

−0.6
−0.8
−1.0

0 2 4 6 8 10
μ1
FIG. 5 Profile log-likelihood contours (black contours) showing multimodal behavior in a mix-
ture of two normals. The shaded (red) region on the left denotes points where ρbðμ1 , μ2 Þ ¼ 0 or 1,
so the MLE lies strictly outside the manifold.

calculation of the posterior. Note that when, in such a Markov Chain Monte
Carlo analysis, whenever the value of K changes, the algorithm needs to
map between manifolds of different dimensions.
More fundamentally in this area the continuous mixture, Definition 3, is a
fundamental expression in Bayesian analysis. For example, it must be used if
there are hierarchical or latent structures or for constructing a predictive
distribution from a posterior.
The fundamental singular nature of mixture models is reflected in Bayes-
ian analysis. We illustrate this with an example of using Markov Chain Monte
Carlo (MCMC) in this context.
Example 9. Let us return to Model 2 in Example 1 to illustrate how singula-
rities have an effect on a Bayesian analysis. Here we are treating K as fixed
and known. By eye the data appears to have two normal clusters and we are
fitting a three-component model. Further each of the model components does
a good job of fitting an individual data cluster. There is therefore redundancy
in the model and there are many ways that the model can give very good fits.
For example, one of the components can be essentially removed by setting
Geometry and mixture models Chapter 9 345

1.0
0.8
0.6
ρ2

0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

ρ1
FIG. 6 Posterior distribution of ρ1, ρ2 for Model 2.

one the ρi parameters to be very small. In this case there is great uncertainty
about the value of the corresponding μi as it basically plays no role in the
model. Alternatively two components can converge, i.e., μi  μj, resulting
in no information about the difference ρi  ρj. In Fig. 6 we show the result
of estimating the marginal posterior distribution for (ρ1, ρ2) using an MCMC
analysis when uniform and independent priors have been used for simplicity.
We see that the uncertainty about the mixing parameters, which comes from
the singular geometry, is reflected in the posterior not being concentrated
around a point in parameter space. Further note that the posterior here would
be very poorly estimated with a normal-based approximation and this point
will be important in the discussion below.

Example 9 shows how the nonregular geometry of mixture models can


affect the global behavior of the posterior. Watanabe (2009) investigated the
effect of the singularities of mixture models, and in general in many machine
learning (ML) models, often in the context of predictive modeling. In this
work he carefully distinguishes between singular models such as mixture
models, and nonsingular ones, such as regular exponential families. Indeed
the geometry needed for the singular case is no longer the differential geome-
try of manifold, but rather algebraic geometry.

5.2 Singularities and algebraic geometry


In this section we review some basic concepts about the treatment of singula-
rities in algebraic geometry and why this theory is important in a ML context.
Examples where this approach can be used include normal and binomial and
346 SECTION III Advanced geometrical intuition

other mixtures, but also models from ML including neural networks, Boltz-
mann machines, and radial basis methods. These models all share the proper-
ties that there are latent, hierarchical structures which will generate singular
behavior.
Watanabe’s approach—a recent general review can be found in Watanabe
(2021a)—focuses on understanding the predictive power of a Bayesian model
and can be seen as a development of the approach of Akaike (1974) to the sin-
gular model case. Assume that we observe data ðx1 , …, xn Þ from a distribution
defined by q(x). We do not know q(x) but have a putative parametric model
f(x; θ) and a prior distribution on θ, say ϑ(θ). Following the approach of
Akaike (1974) we do not assume that q(x) lies in the model family. We com-
pute the Bayesian predictive distribution pðxjx1 , …, xn Þ ¼ Eθ ðpðx; θÞÞ where
the expectation is over the posterior distribution for θ given the data. Follow-
ing Akaike we measure the quality of the predictive distribution by its
“distance” to q(x) and we do this with the Kullback–Leibler (KL) divergence
Z
qðxÞ
KðqðxÞjjpðxÞÞ ¼ qðxÞ log dx:
pðxÞ
This is the foundation on which the Akaike information criteria (AIC) is based
and then used for model selection when models are nonsingular. This is a
minimization problem and in the nonsingular case we expect that the objec-
tive function is locally quadratic around a single point. Let us investigate
the KL divergence in the singular case where the minimum value is attained
on a set of points.
Example 10. Consider again the simple line segment model, Eq. (2), in a
normal mixture context f(x; ρ, μ) ¼ (1  ρ)ϕ(x;0, 1) + ρϕ(x; μ, 1) and consider
the behavior of
KLð f ðx; ρtrue , μtrue Þjj f ðx; ρ, μÞÞ
for ρ  [0, 1] and μ ¼ [3, 3]. In Fig. 7 we plot the level sets of this function
in two cases. The first is the nonsingular case with ρtrue ¼ 0.5, μtrue ¼ 1 and
we see a local minimum of zero at the true value and that locally the function
is approximately quadratic. Its hessian would be the (nonsingular) Fisher
information. In contrast if ρtrue ¼ 0 or μtrue ¼ 0 we are in the singular case.
An example is shown in panel (b). Here there is a set of values at the
minimum of zero and there will be no locally quadratic approximation to this
function around the set. This is exactly the singular case considered by
Watanabe (2009). The function is not going to have a quadratic approximation
rather one of the form
KLð f ðx; ρtrue , μtrue Þjj f ðx; ρ, μÞÞ  Cμ2 ρ2 : (9)

In the nonsingular case many very useful approximations and asymptotic


expansions are based on a local quadratic approximation around a critical
Geometry and mixture models Chapter 9 347

(a) Non−singular case (b) Singular case

3
2

2
1

1
μ

μ
−1 0

−1 0
−3

−3
0.0 0.4 0.8 0.0 0.4 0.8
ρ ρ
FIG. 7 Level sets of the Kullback–Leibler divergence: (A) nonsingular case with ρtrue ¼ 0.5,
μtrue ¼ 1, (B) singular case with ρtrue ¼ 0.5, μtrue ¼ 0.

(a) Singular function (b) Resolution


1.5

1.5
0.5

0.5
y

v
−0.5

−0.5
−1.5

−1.5

−1.5 −0.5 0.5 1.5 −1.5 −0.5 0.5 1.5


x u
FIG. 8 Resolving a geometric singularity: (A) level sets of the function (x, y) :¼ y2  x3  x2
with f(x, y) ¼ 0 being plotted in black. (B) The normal crossing singularity which is the resolution.

point. In the singular case these fail. The tools of algebraic geometry however
allow us to find local approximations to functions which have singularities.
A key result is Hironaka’s theorem (Watanabe, 2009, Theorem 2.3), which
allows us to resolve the singularities. In the general theory we can think of
singularities in sets of solutions of the form {xjf(x) ¼ 0} where the set is
not a manifold. An example is the point of self-intersection of the curve in
Fig. 8A, or the set of minimal values in Fig. 7B which is ρμ ¼ 0 which also
has self-intersection at μ ¼ ρ ¼ 0.
In very general terms if we have a function f ðx1 , …, xd Þ which has a singu-
lar point at 0, then there exists a transformation to smooth parameters,
ðu1 , …, ud Þ, and a “nice” mapping, g(u) ¼ x, such that locally to the singular-
ity we can write
348 SECTION III Advanced geometrical intuition

f ðgðuÞÞ ¼ Suk11 …ukdd (10)


where S is a nonzero constant and ki are nonnegative integers. The theorem
constructs a local representation of the function, in a neighborhood of a singu-
lar point, in terms of a so-called normal crossing where the singular set
fxjf ðx1 , …, xd Þ ¼ 0g looks like a union of hyperplanes. The following, purely
geometric example, illustrates the theorem.
Example 11. An example of resolving a singularity is shown in Fig. 8. The set
V is defined to be the set of solutions of the equation f(x, y) :¼ y2  x3  x2 ¼ 0.
This is plotted in Fig. 8A as a black self-crossing curve. Points where the func-
tion is greater (less) than zero are plotted in red (blue). The resolution of this is
the change of variable to u, v where we have (locally to x ¼ y ¼ 0)
f ðgðu, vÞÞ ¼ uk1 vk2 :
In this case k1 ¼ k2 ¼ 1. This is plotted in panel (b) and locally V is a union of
hyperplanes.

We now see the way that the singularity of the KL-divergence shown in
Fig. 7B has the approximation (9). This is the resolution of the singularity
now with k1 ¼ k2 ¼ 2.

5.3 Singular learning and model selection


In this section we give a short overview of the highlights of singular learning
theory and, in particular, we see how it can be used for model selection in
mixture models. In particular when the interest is in using mixture models
for predictive inference, it gives a way of learning about the number of com-
ponents K which, as we have seen, is one of the most important problems
when using mixture models.
We start by thinking about the properties of the log-likelihood ratio statistic.
In nonsingular models of course we have asymptotic χ 2 distribution approxima-
tions around the MLE. By standard arguments these approximations are based
on linking local quadratic approximations with the central limit theorem (CLT)
after suitable normalization. As explained in Section 5.2 local expansions
around singular points are not quadratic rather are given by the resolution of
the singularity given by Hironaka’s theorem, Watanabe (2009, Theorem 2.3).
Further, the simple CLT results are replaced with convergence to a Gaussian
process. This is summarized in Watanabe (2009, Page 30: Main Formula I)
which says there exists a smooth transformation g(u) from parameters
ðu1 , …ud Þ such that the log-likelihood ratio function can be written as
1 k1
1 …ud  pffiffiffi u1 …ud ξn ðu1 , …, ud Þ
2kd kd
u2k 1
(11)
n
The second term converges to a Gaussian process with a well-defined mean
and covariance structure.
Geometry and mixture models Chapter 9 349

This representation of the log-likelihood allows, Watanabe (2009, Page


34: Main Formula II), an asymptotic expansion of the stochastic complexity
which can be used for model selection. Suppose we have data D ¼
fX1 , …, Xn g a parametric model p(x; θ) which has prior ϑ(θ) which generates
the posterior

1 Y
n
pðθjDÞ ¼ pðXi ; θÞϑðθÞ
Z n i¼1

with Zn being the normalizing constant or marginal likelihood. The stochastic


complexity is defined by Fn ¼  log Z n. This is a measure of the evidence the
data has about the model and hence has an important role in model selection.
The second key result of Watanabe (2009) is that stochastic complexity has
the following asymptotic expansion
λ log n  ðm  1Þ log log n + FðξÞ + op ð1Þ
where the critical terms are λ and m which are computed from the zeta-
function of the Kullback–Leibler distance from the true model q(x) to p(x; θ).
These powerful results give a theoretical foundation for working with
singular models. Of course in practice we do not know the true distribution
q(x) so to make tools operational we need to resort to asymptotic expansions.
The standard justification of the form of the AIC (Akaike, 1974) comes from
using the standard normal approximation to the posterior which holds for non-
singular models. As we have seen in Fig. 6 this approximation can fail badly
in singular models. Instead of using a normal approximation the representa-
tion (11) can be used. Working to a given asymptotic order gives rise to the
Watanabe information criteria (WAIC)

1X 1X
n n
 log Eθ ðpðxi ; θÞÞ  Var θ ð log ðpðxi ; θÞÞ (12)
n i¼1 n i¼1

where moments are taken over the posterior hence can be computed
using MCMC.
The theory of the AIC (Akaike, 1974) or DIC (Spiegelhalter et al., 2002)
assumes a nonsingular model and under this assumption the AIC, DIC, and
WAIC are all equal to order o(n1), see Watanabe (2021a, Section 3.1).
However in the singular model case this equivalence will fail. For example,
in Watanabe (2021b) it is explained that the WAIC and generalization error
are equal in expectation to order O(n2). In that paper a simulation study
based on a three-component mixture of normals—a singular model—the
WAIC is able to give very good approximations to the generalization error,
whereas the AIC is not. Hence, from this study, the WAIC can be used in
model selection for prediction problems in mixture models.
The key differences that the singular case has with the nonsingular case
have been summarized by Watanabe as the following generalizations: positive
350 SECTION III Advanced geometrical intuition

definite Fisher information matrices are generalized to nonsingular ones; the


linear algebra of tangent spaces are generalized to rings of polynomials; and
central limit results are generalized to results from empirical process theory.
We will see this last point again in the following section.

6 Nonstandard testing problems


As we have seen above the singular geometry of a mixture model can result in
much more complex behavior of the likelihood function when compared to
the standard exponential family case. In this section we see how this effects
the sampling properties of test statistics for what might seem like simple tests.
Here rather than taking a model selection approach, as in the previous section,
we look at the problem from a hypothesis testing point of view.
Following Lindsay (1995, Chapter 4) consider the two following hypo-
thesis testing problems.
H0 : K ¼ 1 versus H A1 : K > 1
H0 : K ¼ 1 versus H A2 : K ¼ 2
These are clearly different, but related, testing problems and care is needed in
a given scientific context as to which alternative is of interest. We show that
the asymptotic sampling behavior of natural test statistics, perhaps measuring
overdispersion or using the log-likelihood, can be surprising complex.
A standard approach is the Neyman C(α) test for overdispersion Neyman
and Scott (1965), which is a test designed for the H A1 alternative above. We
follow the gradient approach of Lindsay (1995, Chapter 4). Recall that the
criterion for Q0 to be the NPLME comes from the directional derivative
inequality
X D  
Li ðμÞ  Li ðQ0 Þ
DQ0 ðμÞ ¼ 0
i¼1
Li ðμÞ

for all μ. Under H0 we have that Q0 is a degenerate distribution and it is nat-


ural to evaluate it at μb, the maximum likelihood estimate for the unmixed
model. We use the notation D1 ðμÞ :¼ DΔðμb Þ ðμÞ where ΔðbμÞ is the degenerate
mixing distribution which has support only at the point μb. Lindsay (1995)
shows that D1(μ) can be closely approximated by D001 ðμÞ. So either term being
greater than zero would be evidence that the unmixed model is not the
NPMLE. This would represents evidence against the null of having an
unmixed model. A direct calculation of D001 ðb
μÞ in the exponential family case
shows that it is the difference between the empirical variance of the data and
the variance as modeled by the unmixed model. This is precisely the C(α)
overdispersion criterion.
While overdispersion tests are natural for H A1 when there is interest in
power in the directions associated with H A2 : K ¼ 2, it would be natural to
Geometry and mixture models Chapter 9 351

consider score-based tests. However we immediately see that this can result in
complexity even in very simple cases. The following example is discussed in
Li et al. (2009).
Example 12. Consider the two-component mixture of exponential densities
with mean μ. For example, take the case
ð1  ρÞf ðx; 1Þ + ρf ðx; μÞ,
where f ðx; μÞ ¼ exp ð μxÞ . The Fisher information for the score for ρ
1
μ
at ρ ¼ 0 is

8 2
< nð1  μÞ
0 < μ < 2,
μð2  μÞ
:
∞ μ  2:
Hence we immediately see that traditional score tests will fail in certain
directions since they will not have finite variances and the CLT will fail.

Other natural routes to follow are likelihood ratio test (LRT)-based meth-
ods. The following results from Lindsay (1995, Chapter 4) are helpful in
seeing that the standard χ 2-asymptotic behavior of the log-likelihood, familiar
in exponential families, does not hold for mixture models.
Example 13. From Lindsay (1995, Page 95) consider the K ¼ 2 case of
mixtures with Bin(2, π) components. The asymptotic distribution of the maxi-
mum likelihood estimate is a mixture of χ 2 distributions. In particular the test
of one versus two components has a 0:5χ 20 + 0:5χ 21 distribution. However for
only slightly more complex models the limiting distributions are more diffi-
cult. For example, even for two-component mixtures of Bin(3, π) distributions
the same likelihood-based test of one against two components now has a

ð0:5  αÞχ 20 + 0:5χ 21 + αχ 22


limit distribution, where α is calculated from the Fisher information of the
model and depends on the value of the parameter value, π, of the unmixed
model. Hence the effect of the value of π as a nuisance parameter does not
vanish in the limit. The dependence of the limit on nuisance parameters makes
the test statistic considerably more difficult to use of course.

In general the asymptotic behavior of the LRT for mixtures is extremely


complex when compared its behavior in the exponential family situation.
Watanabe (2009, Chapter 6) states in the singular case we expect convergence
to a Gaussian process rather than the traditional convergence in distribution to
a member of the χ 2 family where the only dependence on the model is through
its dimension. It is not surprising perhaps that since singular models do not
have a well-defined dimension, this extremely useful traditional result fails.
352 SECTION III Advanced geometrical intuition

Chen et al. (2001) give a clear statement of the situation which we summa-
rize here in a particular case. Consider a two-component mixture model for
θ  Θ  , (1  ρ) f(x; θ1) + ρf(x; θ2), where we are testing the hypothesis that
the true distribution has only one component. That is we are in the case where
either θ1 ¼ θ2 or ρ {0, 1}. Under the null hypothesis, when θ0 labels the true
distribution, the limit distribution LRT is that of the random variable

supfW + ðθÞg2 (13)


θ

where W(θ) is a Gaussian process with normalized mean and variance and a
given autocovariance function. If we compare using this result with the stan-
dard cases (Chen et al., 2001) point out the following problems: (i) the limit-
ing distribution is not pivotal in that it depends on θ0 as already discussed in
Example 13; (ii) the limit distribution does not just depend on the dimension
of θ, for example, it is different for the normal and Poisson cases; and
(iii) computing the maximum of a Gaussian process a complex problem and
so the LRT loses a lot of its practical utility and appeal.
Li et al. (2009) explain that the general regularity conditions that are
needed to get satisfactory asymptotic behavior, even in the one-dimensional
case θ  Θ  , typically involve requiring finite Fisher information and the
parameter space Θ being compact. Ideally we want a test statistic that keeps
the standard pivotal and limiting properties of the LRT similar to those in
the exponential family and to try and relax these stringent regularity condi-
tions. First consider penalty approaches to regularize the problem.
Definition 7. The penalized likelihood statistic is defined for a two-
component model as
( )
Xn
2 log ðρf ðx; θ1 Þ + ð1  ρÞf ðx; θ2 ÞÞ  penðρÞ
i¼1

where a penalty term pen( ) is designed to bound ρ away from the boundaries
at ρ  {0, 1}. For example, penðρÞ ¼ C log ð4ρð1  ρÞÞ is used in Chen and
Kalbfleisch (1996) or penð Þ ¼ C log ð1  j1  2ρjÞ in Li et al. (2009). In
both these cases the limiting distributions are of the form 0:5χ 20 + 0:5χ 21 under
regularity conditions.

One approach which allows a relaxation of the regularity conditions and


generalizes to the case where the components have more than one parameter
is to exploit ideas from the EM algorithm.
Of course for mixture models it is very common in practice to use the EM
algorithm (Dempster et al., 1977) for estimation. This is a hill-climbing
approach which can find local modes of the likelihood. In principle it does
not converge as fast as Newton-based methods. However it has a number of
practical advantages in popular examples particularly in cases where it is easy
Geometry and mixture models Chapter 9 353

to code and also easy to compute the expectation step. The most important
geometric advantages are that iterations cannot “jump” outside the fundamen-
tal boundaries of the parameter space, and that it will still work in cases such
as Example 12 where the Fisher information is singular.
There is an EM version of the penalized likelihood approach to testing the
number of mixing components (Li et al., 2009, Page 415). In this paper the
idea is to start from a finite number of initial points and make a fixed number
of EM-based iteration steps starting from each of them. After this number of
iterations is completed, the maximum of the penalized LRT statistics is
computed. The EM’s property of staying in the parameter space, and the
penalization, keeps the solution sufficiently regular to give the result that,
asymptotically, the test statistic has a 0:5χ 20 + 0:5χ 21 distribution without the
need for assumptions on compactness of the parameter space or having a
finite Fisher information, see Li et al. (2009, Theorem 2). Currently these
approaches are still of active research interest and the testing problem is of
interest, see the recent paper Chen et al. (2020) for a modern review.

7 Discussion
In this article we have described, at a high level, the relationship between the
statistical properties of mixture models and geometries of various kinds. We
have looked at basic notations of convexity within an affine space structure,
tools from differential geometry such as metric tensors and connections,
and tools from algebraic geometry. We have shown that mixture models,
despite their apparent simplicity, have complex inferential properties which
are related to their singular structure.
For statisticians the sheer complexity of the literature of differential or
algebraic geometry can be a major hurdle. These topics have been studied
intensely for hundreds of years and knowing where to start reading is
undoubtedly a challenge. In this article we have aimed to give a flavor of
the tools that have been used, while deliberately skipping the more technical
details, and pointing to the literature that we have found helpful.

References
Akaike, H., 1974. A new look at the statistical model identification. IEEE Trans. Automat. Contr.
19 (6), 716–723.
Amari, S.-I., 1985. Differential-Geometrical Methods in Statistics. Springer, p. 293.
Amari, S.-I., 2016. Information Geometry and Its Applications. vol. 194 Springer.
Amari, S.-I., 2020. Any target function exists in a neighborhood of any sufficiently wide random
network: a geometrical perspective. Neural Comput. 32 (8), 1431–1447.
Amari, S.-I., Nagaoka, H., 2000. Methods of Information Geometry. vol. 191 American Mathe-
matical Society.
Barndorff-Nielsen, O.E., 1978. Information and Exponential Families in Statistical Theory. John
Wiley & Sons, p. 238.
354 SECTION III Advanced geometrical intuition

Chen, J., Kalbfleisch, J.D., 1996. Penalized minimum-distance estimates in finite mixture models.
Can. J. Stat. 24 (2), 167–175.
Chen, H., Chen, J., Kalbfleisch, J.D., 2001. A modified likelihood ratio test for homogeneity in
finite mixture models. J. R. Stat. Soc. B 63 (1), 19–29.
Chen, J., Li, P., Liu, G., 2020. Homogeneity testing under finite location-scale mixtures. Can. J.
Stat. 48 (4), 670–684.
Critchley, F., Marriott, P., 2014. Computational information geometry in statistics: theory and
practice. Entropy 16 (5), 2454–2471.
Crowder, M.J., Hand, D.J., 2017. Analysis of Repeated Measures. Routledge.
Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood from incomplete data via
the EM algorithm. J. R. Stat. Soc. B 39 (1), 1–22.
Everitt, B.S., Hand, D.J., 1981. Finite Mixture Distributions. Chapman and Hall.
Fuller, W.A., 1987. Measurement Error Models. John Wiley, New York.
Green, P.J., 1995. Reversible jump Markov chain Monte Carlo computation and Bayesian model
determination. Biometrika 82 (4), 711–732.
Hampel, F.R., 1974. The influence curve and its role in robust estimation. J. Am. Stat. Assoc.
69 (346), 383–393.
Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., Stahel, W.A., 2011. Robust Statistics: The
Approach Based on Influence Functions. vol. 196 John Wiley & Sons.
Jewell, N.P., 1982. Mixtures of exponential distributions. Ann. Stat. 10 (2), 479–484.
Karlin, S., 1968. Total Positivity. vol. 1 Stanford University Press.
Karlin, S., Shapley, L.S., 1953. Geometry of Moment Spaces. vol. 12 American Mathematical
Society.
Kass, R.E., Vos, P.W., 2011. Geometrical Foundations of Asymptotic Inference. vol. 908 John
Wiley & Sons.
Kumar, M.A., Mishra, K.V., 2020. Cramer-Rao lower bounds arising from generalized Csiszár
divergences. Inf. Geom. 3 (1), 33–59.
Lesperance, M.L., Kalbfleisch, J.D., 1992. An algorithm for computing the nonparametric MLE
of a mixing distribution. J. Am. Stat. Assoc. 87 (417), 120–126.
Li, P., Chen, J., Marriott, P., 2009. Non-finite Fisher information and homogeneity: an EM
approach. Biometrika 96 (2), 411–426.
Lindsay, B.G., 1994. Efficiency versus robustness: the case for minimum Hellinger distance and
related methods. Ann. Stat. 22 (2), 1081–1114.
Lindsay, B.G., 1995. Mixture models: theory, geometry and applications. In: NSF-CBMS
Regional Conference Series in Probability and Statistics.
Maroufy, V., Marriott, P., 2017. Mixture models: building a parameter space. Stat. Comput.
27 (3), 591–597.
Maroufy, V., Marriott, P., 2019. Generalising frailty assumptions in survival analysis: a geometric
approach. In: Geometric Structures of Information, Springer, pp. 137–148.
Maroufy, V., Marriott, P., 2020. Local and global robustness with conjugate and sparsity priors.
Stat. Sinica 30, 579–599.
Marriott, P., 2002. On the local geometry of mixture models. Biometrika 89 (1), 77–93.
Marriott, P., 2003. On the geometry of measurement error models. Biometrika 90 (3), 567–576.
McLachlan, G.J., Basford, K.E., 1988. Mixture Models: Inference and Applications to Clustering.
vol. 38 M. Dekker New York.
McLachlan, G., Peel, D., 2000. Finite Mixture Models. vol. 198 John Wiley & Sons.
McNicholas, P.D., 2016. Mixture Model-Based Classification. Chapman and Hall/CRC.
Murray, M.K., Rice, J.W., 1993. Differential Geometry and Statistics. Chapman & Hall, p. 272.
Geometry and mixture models Chapter 9 355

Neyman, J., Scott, E., 1965. On the use of c (alpha) optimal tests of composite hypotheses. Bull.
Int. Stat. Inst. 41 (1), 477–497.
Spiegelhalter, D.J., Best, N.G., Carlin, B.P., Van Der Linde, A., 2002. Bayesian measures of
model complexity and fit. J. R. Stat. Soc. B 64 (4), 583–639.
Titterington, D.M., Smith, A.F.M., Makov, U.E., 1985. Statistical Analysis of Finite Mixture
Distributions. John Wiley & Sons Incorporated.
Wang, Y., 2007. On fast computation of the non-parametric maximum likelihood estimate of a
mixing distribution. J. R. Stat. Soc. B 69 (2), 185–198.
Watanabe, S., 2009. Algebraic Geometry and Statistical Learning Theory. Cambridge University
Press, p. 25.
Watanabe, S., 2021a. Information criteria and cross validation for Bayesian inference in regular
and singular cases. Jpn. J. Stat. Data Sci. 4, 1–19.
Watanabe, S., 2021b. WAIC and WBIC for mixture models. Behaviormetrika 48 (1), 5–21.
Yakowitz, S.J., Spragins, J.D., 1968. On the identifiability of finite mixtures. Ann. Math. Stat.
39 (1), 209–214.
This page intentionally left blank
Chapter 10

Gaussian distributions on
Riemannian symmetric spaces
of nonpositive curvature
Salem Saida,*, Cyrus Mostajeranb, and Simon Heuvelinec
a
CNRS, Laboratoire LJK, Universite Grenoble-Alpes, Grenoble, France
b
Department of Engineering, University of Cambridge, Cambridge, United Kingdom
c
Centre for Mathematical Sciences, University of Cambridge, Cambridge, United Kingdom
*
Corresponding author: e-mail: salem.said@univ-grenoble-alpes.fr

Abstract
This article aims to give a coherent presentation of the theory of Gaussian distributions
on Riemannian symmetric spaces and also to report on recent original developments of
this theory. The initial goal is to define a family of probability distributions, on any suit-
able Riemannian manifold, for which maximum-likelihood estimation, based on a finite
sequence of observations, is equivalent to computation of the Riemannian barycenter of
these observations. As it turns out, this goal is achievable whenever the underlying Rie-
mannian manifold is a Riemannian symmetric space of nonpositive curvature. In this
case, the required Gaussian distributions are exactly the maximum-entropy distribu-
tions, for fixed barycenter and dispersion. The second step is the search for efficient
means of computing the normalizing factors associated with these distributions. This
leads to a fascinating connection with random matrix theory, and even with theoretical
physics (Chern–Simons theory), which yields a series of original results that provide
exact expressions, as well as high-dimensional asymptotic expansions of the normaliz-
ing factors. Another outcome of this connection with random matrix theory is the new
idea of duality, between Gaussian distributions on Riemannian symmetric spaces of
opposite curvatures. The present article also investigates Bayesian inference for Gauss-
ian distributions on symmetric spaces. This investigation motivates original results
regarding Markov-Chain Monte Carlo and convex optimization on Riemannian mani-
folds. It also reveals a new open problem (roughly, this concerns the equality of a pos-
teriori mode with a posteriori barycenter), which should be the focus of future
developments.
Keywords: Gaussian distribution, Symmetric space, Random matrix theory,
Markov-Chain Monte Carlo, Convex optimization, Bayesian inference

Handbook of Statistics, Vol. 46. https://doi.org/10.1016/bs.host.2022.03.004


Copyright © 2022 Elsevier B.V. All rights reserved. 357
358 SECTION III Advanced geometrical intuition

1 Introduction
The realization that an essentially new approach, beyond that of classical sta-
tistics, is needed in order to learn from data that live in non-Euclidean spaces
can be credited to Frechet, who invented what we today call the Frechet mean,
back in 1948 (Frechet, 1948).
The Frechet mean generalizes the concept of the mean (average or expec-
tation) of a sequence of observations ðx1 , …, xN Þ , from the classical case
where these observations lie in a Euclidean space, to the general setting where
they belong to a non-Euclidean space.
Let us call our sample space M (the observations belong to M). If M is a
Euclidean space, it has a vector space structure, and the mean of ðx1 , …, xN Þ
is just the arithmetic mean ðx1 + ⋯ + xN Þ=N. If M is a non-Euclidean space,
it will have no vector space structure, and this definition will lose all meaning.
To salvage the concept of mean, Frechet suggested looking at the set of
global minima of the sum of squared distances (the factor 1/2 is included
for later convenience)

1X N
EðxÞ ¼ d2 ðxn ,xÞ for x  M
2 n¼1

He noted that any global minimum of E deserves to be called a mean of


ðx1 , …, xN Þ . In this way, the mean of a sequence of observations in a
non-Euclidean space is well-defined, at the cost of eventually failing to be
unique.
Fast-forward to the present, learning from data that live in Riemannian
manifolds (a particular class of non-Euclidean spaces) has become central to
many applications, ranging from radar signal processing to neuroscience
(Cabanes, 2021; Congedo et al., 2017), and the Frechet mean (more descrip-
tively called the Riemannian barycenter) a very popular tool in this respect
(Afsari, 2010; Said and Manton, 2021).
Naturally, it also became important to provide a statistical (specifically,
inferential) foundation for the use of this tool. One way or another, this leads
to the quest for a suitable definition of a Gaussian distribution on a Rieman-
nian manifold. This appeared inevitable, already because of the intimate con-
nection between arithmetic means and Gaussian distributions, in the classical
Euclidean case (see Section 2.1, for discussion).
Gaussian distributions, defined as maximum entropy distributions on a
Riemannian manifold, for a given Frechet mean and dispersion, were first
introduced by Pennec (2006). For a while, it remained difficult to study these
distributions, as there was no practical means of computing the associated
normalizing factors. However, a first breakthrough came when these factors
were expressed as multiple integrals, in the case of Gaussian distributions
on the space of real positive-definite matrices (Cheng and Vemuri, 2013).
Gaussian distributions on Riemannian symmetric spaces Chapter 10 359

In Said et al. (2017, 2018), the approach of Cheng and Vemuri (2013) was
generalized to Gaussian distributions on Riemannian symmetric spaces of
nonpositive curvature, which include hyperbolic spaces, as well as spaces
of real, complex, and quaternion positive-definite matrices, and spaces of
structured (Toeplitz or block-Toeplitz) positive-definite matrices. This opened
the way to rigorous learning algorithms for data that live in these spaces (this
is partially discussed in Said et al., 2018).
The introduction of Riemannian symmetric spaces reduced normalizing
factors of Gaussian distributions to multiple integrals, which could be com-
puted using Monte Carlo techniques (Zanini et al., 2016). Only very recently,
it was realized that the techniques of random matrix theory made it possible to
write down both analytic expressions and high-dimensional asymptotic expan-
sions of these multiple integrals. This was studied by the theoretical physics
community (Santilli and Tierz, 2021) (see also our paper, currently under
review Heuveline et al., 2021).
The aim of the present article is to give a coherent presentation of the the-
ory of Gaussian distributions on Riemannian symmetric spaces of nonpositive
curvature, and report on recent original developments of this theory, including
(but not limited to) the ones just mentioned. Its main body (Sections 2 and 3)
relies on a variety of new results, mostly contained in the habilitation thesis
(Said, 2021)—one advantage of this situation is that the flow of results is
not interrupted by their sometimes lengthy proofs, given in Said (2021).
In the following, Section 2 introduces Gaussian distributions and their con-
nection with random matrix theory. Section 3 investigates Bayesian inference
of these distributions. Each one of these sections opens with a description of
the original results which it contains.
Another, more modest, contribution of this article is Appendix B (not
based on Said, 2021). This appendix provides new results on the convergence
rates for Riemannian gradient descent, applied to strictly convex and strongly
convex functions, defined on a convex subset of a Riemannian manifold. The
main results are Propositions B.8 and B.9.

2 Gaussian distributions and RMT


The starting point of this section is a historical discussion of the concept of a
Gaussian distribution. This leads up to the definition of Gaussian distributions,
adopted in Section 2.2, as a family of distributions Pð x, σÞ on a Riemannian
manifold M, with parameters x  M and σ > 0, for which maximum-likelihood
estimation of x is equivalent to computation of the Riemannian barycenter.
It turns out that this definition can be pursued whenever M is a Riemannian
symmetric space of nonpositive curvature.
Section 2.3 then gives a general expression of the normalizing factor Z(σ)
of the Gaussian distribution Pðx, σÞ, in the form of a multiple integral (8).
When M is a space of positive-definite matrices, or when M is the so-called
360 SECTION III Advanced geometrical intuition

Siegel domain, (8) is further reduced to a kind of integral familiar in random


matrix theory ((10) and (15), respectively).
Section 2.4 states the existence and uniqueness of maximum-likelihood
estimates of the parameters x and σ. It also states the maximum-entropy prop-
erty of the Gaussian distribution Pðx, σÞ, in Proposition 5. Section 2.5 provides
expressions of the barycenter (shown to be equal to x) and the covariance ten-
sor of Pðx, σÞ.
Section 2.6 begins the series of results based on random matrix theory
(RMT). These concern Gaussian distributions on the space H(N) of complex
positive-definite matrices. First, the analytic expression of Z(σ) is given in
Proposition 8. Then, an asymptotic expansion of this expression, in the limit
where N goes to infinity while t ¼ Nσ 2 remains constant, is given in
Proposition 9.
Section 2.7 describes the asymptotic distribution of eigenvalues of a ran-
dom positive-definite matrix in H(N), drawn from the Gaussian distribution
P(IN, σ) (IN denotes the N  N identity matrix). This asymptotic distribution
has a probability density function, whose explicit expression is provided in
Proposition 10.
Section 2.8 introduces Θ distributions. These are classical normal distribu-
tions, wrapped around the unitary group U(N), which is the dual symmetric
space of H(N). Proposition 11 uncovers an unexpected relationship between
Θ distributions on U(N) and Gaussian distributions on H(N): the normalizing
factors of these distributions are connected by a simple identity (38).
Proofs of the above-mentioned results may be found in chapter 3 of Said
(2021).

2.1 From Gauss to Shannon


The story of Gaussian distributions is a story of discovery and rediscovery.
Different scientists, at different times, were repeatedly led to these distribu-
tions, through different routes. It seems the story began in 1801, on New Year’s
day, when Giuseppe Piazzi sighted a heavenly body (in fact, the asteroid Ceres),
which he thought to be a new planet. Less than 6 weeks later, this “new planet”
disappeared behind the sun. Using a method of least squares, Gauss predicted
the area in the sky, where it reappeared 1 year later. His justification of this
method of least squares (cast in modern language) is that measurement errors
follow a family of distributions, which satisfies
Property 1: maximum-likelihood estimation is equivalent to the
least-squares problem.
In his Theoria motus corporum coelestium (1809), he used this property to
show that the distribution of measurement errors is (again, in modern lan-
guage) a Gaussian distribution.
In 1810, Laplace studied the distribution of a quantity, which is the aggre-
gate of a great number of elementary observations. He was led in this
Gaussian distributions on Riemannian symmetric spaces Chapter 10 361

(completely different) way, to the same distribution discovered by Gauss.


Laplace was among the first scientists to show
Property 2: the distribution of the sum of a large number of elementary
observations is (asymptotically) a Gaussian distribution.
Around 1860, Maxwell rediscovered Gaussian distributions, through his
investigation of the velocity distribution of particles in an ideal gas (which he
viewed as freely colliding perfect elastic spheres). Essentially, he showed that
Property 3: the distribution of a rotationally invariant random vector,
which has independent components, is a Gaussian distribution.
Kinetic theory led to another fascinating development, related to Gaussian
distributions. Around 1905, Einstein (and, independently, Smoluchowsky)
showed that
Property 4: the distribution of the position of a particle, which is under-
going a Brownian motion, is a Gaussian distribution.
In addition to kinetic theory, alternative routes to Gaussian distributions
have been found in quantum mechanics, information theory, and other fields.
In quantum mechanics, a Gaussian distribution is a position distribution with
minimum uncertainty. That is, it achieves equality in Heisenberg’s inequality.
In information theory, one may attribute to Shannon the following maximum-
entropy characterization
Property 5: a probability distribution with maximum entropy, among all
distributions with a given mean and variance, is a Gaussian distribution.
The above list of rediscoveries of Gaussian distributions may be extended
much longer. However, the main point is the following. In a Euclidean space,
identified with d , any one of the above five properties leads to the same
famous expression of a Gaussian distribution,
 
  d
2 2 ðx  xÞ2
Pðdxjx, σÞ ¼ 2πσ exp  dx
2σ 2
as a probability distribution on d , with mean vector x  d and variance
parameter σ > 0 (here dx denotes the Lebesgue measure on d ).
In non-Euclidean space, each one of these properties may lead to a different
distribution, which may then be called a Gaussian distribution, but only from a
restricted point of view. People interested in Brownian motion may call the heat
kernel of a Riemannian manifold a Gaussian distribution on that manifold.
However, statisticians will not like this definition, since it will (in general) fail
to have a straightforward connection to maximum-likelihood estimation.

2.2 The “right” Gaussian


As of now, the following definition of Gaussian distributions is chosen.
Gaussian distributions, on a Riemannian manifold M, are a family of distribu-
tions Pðx, σÞ, parameterized by x  M and σ > 0, such that: a maximum-
likelihood estimate x^N of x , based on samples ðxn ; n ¼ 1, …, NÞ from
Pðx, σÞ, is a solution of the least-squares problem
362 SECTION III Advanced geometrical intuition

1X N
minimize over x  ME N ðxÞ ¼ d 2 ðxn ,xÞ (1)
2 n¼1

This means that x^N is an empirical barycenter of the samples (xn). In order to
construct probability distributions Pðx, σÞ, which satisfy this definition, con-
sider the density profile
 2 
d ðx, xÞ
f ðxjx, σÞ ¼ exp  (2)
2σ 2
and the normalizing factor,
Z
Zðx, σÞ ¼ f ðxjx, σÞ volðdxÞ (3)
M

where vol denotes Riemannian volume. If this is finite, then


Pðdxjx,σÞ ¼ ðZðx , σÞÞ1 f ðxjx,σÞ volðdxÞ (4)
is a well-defined probability distribution on M. In Section 2.4, it will be seen
that Pðx, σÞ, as defined by (4), is indeed a Gaussian distribution, if M is a
Hadamard manifold and also a homogeneous space. The following proposi-
tions will then be helpful.
Proposition 1. Let M be a Hadamard manifold, whose sectional curvatures
lie in [κ, 0], where κ ¼ c2. Then, for any x  M and σ > 0, if Zðx , σÞ is
given by (3),
Z0 ðσÞ  Zðx, σÞ  Zc ðσÞ (5)
d
where Z0 ðσÞ ¼ ð2πσ 2 Þ2 and Zc(σ) is positive, given by (d denotes the dimen-
sion of M)
X
d 1  
σ k d  1 Φððd  1  2kÞσcÞ
Zc ðσÞ ¼ ωd1 ð1Þ (6)
ð2cÞd1 k¼0 k Φ0 ððd  1  2kÞσcÞ

where ωd1 denotes the area of the unit sphere in  d , and Φ denotes the stan-
dard normal distribution function.

Proposition 2. If M is a Riemannian homogeneous space, and Zðx ,σÞ is given


by (3), then Zðx , σÞ does not depend on x. In other words, Zðx, σÞ ¼ ZðσÞ.

If M is a Hadamard manifold and also a homogeneous space, then both


Propositions 1 and 2 apply to M. Indeed, if M is a Riemannian homogeneous
space, then its sectional curvatures lie within a bounded subset of the real line.
Therefore, Proposition 1 implies Zðx, σÞ is finite for all x  M and σ > 0. On
the other hand, Proposition 2 implies that Zðx, σÞ ¼ ZðσÞ. Then, (4) reduces to
Gaussian distributions on Riemannian symmetric spaces Chapter 10 363

 
1 d 2 ðx, xÞ
Pðdxjx, σÞ ¼ ðZðσÞÞ exp  volðdxÞ (7)
2σ 2
and yields a well-defined probability distribution Pðx, σÞ on M. This will be
the main focus, throughout the following.
Remark. Here, readers may wish to recall the concept of a Hadamard mani-
fold, or of a homogeneous space, from Petersen (2006) (or any other good
Riemannian geometry textbook). The point of appealing to these concepts is
the following. The assumption that M is a Hadamard manifold implies that
geodesic spherical coordinates, which cover all of M, can be introduced at
any point x  M. Proposition 1 is obtained by writing the integral (3) in terms
of these spherical coordinates, and then applying Riemannian volume compar-
ison theorems that state, very roughly speaking, that manifolds with more pos-
itive curvature have less volume. On the other hand, to say that M is a
homogeneous space means that all points x  M are equivalent, so changing
the point of origin x does not change the integral (3). This is the key to
Proposition 2.

2.3 The normalizing factor Z(σ)


Assume now M ¼ G/K is a Riemannian symmetric space which belongs to the
noncompact case, described in Appendix A.1. In particular, M is a Hadamard
manifold and also a homogeneous space. Thus, for each x  M and σ > 0,
there is a well-defined probability distribution Pðx, σÞ on M, given by (7).
Here, the normalizing factor Z(σ) can be expressed as a multiple integral,
using the integral formula (A.13) of Proposition A.1, from Appendix A.1.
Applying this proposition (with o ¼ x), it is enough to note
" #
k ak2B
f ðφðs, aÞjx, σÞ ¼ exp 
2σ 2

where k ak2B ¼ Bða, aÞ , in terms of the Ad(G)-invariant symmetric bilinear


form B (see Appendix A.1). Since this expression only depends on a, (A.13)
yields the following formula
Z " #
ωðSÞ kak2B Y
ZðσÞ ¼ exp  2 j sinh λðaÞjmλ da (8)
jWj a 2σ λ  Δ
+

This formula expresses Z(σ) as a multiple integral on the vector space a .


Recall that the dimension of a is known as the rank of M (Helgason, 1962).
Example 1. The easiest instance of (8) arises when M is a hyperbolic space of
dimension d, and constant sectional curvature equal to 1. Then, M has rank
equal to 1 so that a ¼  a^ for some unit vector a^  a. Since the sectional cur-
vature is equal to 1, there is only one positive root λ, say λð^ aÞ ¼ 1, with
364 SECTION III Advanced geometrical intuition

multiplicity mλ ¼ d  1. In addition, jWj ¼ 2 because there are two Weyl


chambers, C + ¼ fr a^ ; r > 0g and C ¼ fr a^ ; r < 0g. Accordingly, (8) reads
Z +∞  
ωd1 r2
ZðσÞ ¼ exp  2 j sinh ðrÞjd1
2 ∞ 2σ
Z +∞  
r2
dr ¼ ωd1 exp  2 sinh d1 ðrÞdr
0 2σ
In general, if all distances are divided by c > 0, the sectional curvature 1 is
replaced by c2. Thus, when M is a hyperbolic space of dimension d, and sec-
tional curvature c2,
Z +∞  
r2 d1
ZðσÞ ¼ ωd1 exp  2 ðc1 sinh ðcrÞÞ dr
0 2σ
This is exactly Zc(σ), expressed analytically in (6).

Example 2. Another example, also susceptible of analytic expression, is when


M is a space of positive-definite matrices with real, complex, or quaternion
coefficients. Then, M ¼ G/K with G ¼ GLðN, Þ, where  ¼ , , or  (real
numbers, complex numbers, or quaternions), and K  G a maximal compact
subgroup, K ¼ O(N), U(N), or Sp(N). In each of these three cases, a is the
space of N  N real diagonal matrices, and the positive roots are the linear
maps λ(a) ¼ aii  ajj where i < j, each one having its multiplicity mλ ¼ β
(β ¼ 1, 2, or 4, for  ¼ ,  , or  ). In addition, kak2B ¼ 4trða2 Þ . The
Weyl group W is the group of permutation matrices in K, so jWj ¼ N!, while
S ¼ K/TN where TN is the subgroup of all diagonal matrices in K. Replacing
all of this into (8), it follows that
Z Y  
ωβ ðNÞ N
2a2 Y  β
ZðσÞ ¼ exp  2ii sinh ðaii  ajj Þ da (9)
N! a i¼1 σ i<j

where ωβ(N) stands for ω(S), and da ¼ da11 …daNN . Introducing xi ¼


exp ð2aii Þ,
Z Y
ωβ ðNÞ N
N
ZðσÞ ¼ jVðxÞjβ ρðxi , 2σ 2 Þxi β dxi
2NNβ N! N+ i¼1
Q
where Nβ ¼ (β/2)(N1)+1, ρðx, kÞ ¼ exp ð log 2 ðxÞ=kÞ, and V (x) ¼ i<j
(xjxi) is the Vandermonde determinant. Finally, using the elementary
identity
 

k 2
ρðx, kÞx ¼ exp α ρ e2 α x, k
α k

4
Gaussian distributions on Riemannian symmetric spaces Chapter 10 365

it is immediately found that

h i Z Y
N
ωβ ðNÞ
ZðσÞ ¼  exp N Nβ
2
ðσ 2
=2Þ  jVðuÞjβ ρðui ,2σ 2 Þdui (10)
2NNβ N! N+ i¼1

For the case β ¼ 2, the integral in (10) will be expressed analytically in


Section 2.6.

Remark. Curious readers will want to compute ωβ(N). For example, ω2(N)
can be found using the Weyl integral formula on U(N) (Knapp, 2002). This
yields ω2(N) ¼ vol(U (N))/(2π)N. The volume of the unitary group can be
found by looking at the normalizing factor of a Gaussian unitary ensemble
(Mehta, 2004). Specifically, volðU ðNÞÞ ¼ ð2πÞðN +NÞ=2 =GðNÞ , in terms of
2

GðNÞ ¼ Γð1Þ  Γð2Þ  ⋯  ΓðNÞ (Γ denotes the Euler Gamma function). □

Example 3. For this last example, let M ¼ DN be the Siegel domain (Siegel,
1943). This is the set of N  N symmetric complex matrices z, such that IN 
z†z is positive-definite. Here, M ¼ G/K, where G ’ SpðN, Þ (real symplectic
group) and K ’ U(N) (unitary group). Precisely, G is the group of 2N 
2N complex matrices g, with gt Ω g ¼ Ω and g† Γ g ¼ Γ, where t denotes
the transpose, and where Ω and Γ are the matrices
   
IN IN
Ω¼ ; Γ¼
IN IN
In addition, K is the group of block-diagonal matrices k ¼ diag(U, U*) where
U  U(N), and * denotes the conjugate. The action of G on M is given by
M€obius transformations,
 
A B
g  z ¼ ðAz + BÞðCz + DÞ1 g ¼ (11)
C D
This action preserves the Siegel metric, which is defined by
1 1
hv, viz ¼kðIN  zz{ Þ vk2B k vk2B ¼ trðvv{ Þ (12)
2
where each tangent vector v is identified with a symmetric complex matrix.
Now (Terras, 1988),
 
a
a¼ ; a ¼ diagða11 , …, aNN Þ (13)
a
The positive roots are λ(a) ¼ aii  ajj for i < j, and λ(a) ¼ aii + ajj for i  j, all
with mλ ¼ 1. The order of the Weyl group is jWj ¼ 2NN!, and ω(S) ¼
vol(U(N))/2N. Replacing into (8), it follows that
366 SECTION III Advanced geometrical intuition

Z Y  
volðUðNÞÞ N
a2 Y Y
ZðσÞ ¼ exp  ii2 sinh jaii  ajj j sinh jaii + ajj j da
22N N! a i¼1 2σ i<j ij

(14)
or, after introducing ui ¼ cosh ð2aii Þ,
Z Y
N
2N
ZðσÞ ¼ 2 volðUðNÞÞ  VðuÞ wðui ,8σ 2 Þdui (15)
CN n¼1

where CN ¼ fu  N+ : u1  u2  u1  …  uN g and wðu,kÞ ¼ exp ðacosh2


ðuÞ=kÞ, while V (u) is the Vandermonde determinant, as in (10).

2.4 MLE and maximum entropy


Let M be a Hadamard manifold, which is also a homogeneous space. Consider
the family of distributions Pðx, σÞ on M, given by (7) for x  M and σ > 0.
This family of distributions fits the definition of Gaussian distributions, stated
at the beginning of 2.2.
Proposition 3. Let Pðx , σÞ be given by (7), for x  M and σ > 0. The
maximum-likelihood estimate of the parameter x, based on samples ðxn ; n ¼
1,…, NÞ from Pðx , σÞ, is unique and equal to the empirical barycenter x^N of
the samples (xn).

This proposition is almost immediate. From (7), one has the log-likelihood
function

1 X N
‘ðx ,σÞ ¼ N log ZðσÞ  d 2 ðxn , xÞ (16)
2σ 2 n¼1

Since the first term does not depend on x, one may maximize ‘ðx, σÞ, first over
x and then over σ. Clearly, maximizing over x is equivalent to minimizing the
sum of squared distances d 2 ðxn , xÞ. This is just the least-squares problem (1),
whose solution is the empirical barycenter x^N . Moreover, x^N is unique, since
M is a Hadamard manifold (Afsari, 2010; Sturm, 2003).
Consider now maximum-likelihood estimation of σ. This is better carried
out in terms of the natural parameter η ¼ (2σ 2)1, or in terms of the moment
parameter δ ¼ ψ 0 (η), where ψðηÞ ¼ log ZðσÞ and the prime denotes the
derivative.
Proposition 4. The function ψ(η), just defined, is a strictly convex function,
which maps the half-line (∞, 0) onto . The maximum-likelihood estimates
of the parameters η and δ are
Gaussian distributions on Riemannian symmetric spaces Chapter 10 367

XN
^η N ¼ ðψ 0 Þ1 ð^δ N Þ and ^δ N ¼ 1 d 2 ðxn , x^N Þ (17)
N n¼1

where (ψ 0 )1 denotes the reciprocal function.

Remark. η^N in (17) is well-defined, since the range of ψ 0 is equal to (0, ∞).
Indeed, one has the following inequalities, analogous to (5),
ψ 00 ðηÞ  ψ 0 ðηÞ  ψ 0c ðηÞ (18)
where ψ 0 ðηÞ ¼ log Z 0 ðσÞ, and ψ c ðηÞ ¼ log Zc ðσÞ. Now, ψ 00 ðηÞ ¼ nσ 2 , which
increases to +∞ when σ increases to +∞. On the other hand, since
η ¼ (2σ 2)1,
d
ψ 0c ðηÞ ¼ σ 3 ð log Zc ðσÞÞ (19)

which, from (6), is ¼ 0 when σ ¼ 0. Thus, it follows from (18) that ψ 0 maps
the half-line (∞, 0) onto the half-line (0, +∞). □

An alternative definition of Gaussian distributions is provided by their


maximum-entropy property, stated in the following proposition. Here, entropy
specifically means Shannon’s differential entropy. If P is a probability distri-
bution on M, with probability density function p, this entropy is equal to
Z
SðPÞ ¼ ð log pðxÞÞpðxÞvolðdxÞ
M

Proposition 5. The Gaussian distribution Pðx ,σÞ is the unique distribution on


M, having maximum Shannon entropy, among all distributions P with given
barycenter x and dispersion δ ¼  xP ½d 2 ðx, xÞ. Its entropy is equal to ψ*(δ)
where ψ* is the Legendre transform of ψ.

2.5 Barycenter and covariance


Let M be a Hadamard manifold, which is also a homogeneous space. Consider
the barycenter and covariance of the Gaussian distribution Pðx, σÞ on M, given
by (7).
First, it should be noted Pðx, σÞ has a well-defined Riemannian barycenter,
since it has finite second-order moments. To see that this is true, it is enough
to note that
Z
d 2 ðx, xÞPðdxjx, σÞ < ∞
M

Indeed, this integral is just ψ 0 (η) in (18).


368 SECTION III Advanced geometrical intuition

Proposition 6. Let Pðx ,σÞ be given by (7), for x  M and σ > 0. The Rieman-
nian barycenter of Pðx , σÞ is equal to x.

The proof of this proposition relies on the fact that the so-called variance
function
Z
1
EðxÞ ¼ d 2 ðx, yÞPðdyjx , σÞ (20)
2 M
is strongly convex (see Said, 2021, Paragraph 2.2.3). Thus, if gradEðxÞ ¼ 0,
then x is the global minimizer of E, and therefore the barycenter of Pðx, σÞ.
However, gradEðxÞ ¼ 0 follows a direct application of the following
“Fisher’s identity,”
Z
ðgradx log pðxjx ,σÞÞPðdxjx ,σÞ ¼ 0
M

where gradx denotes the gradient with respect to x, defined according to the
Riemannian metric of M, and pðxjx, σÞ is the probability density function,
appearing in (7).
The covariance form of Pðx, σÞ is the symmetric bilinear form Cx on T x M,
Z
Cx ðu, vÞ ¼ hu, Exp1 1
x ðxÞihExpx ðxÞ, vi pðxjx , σÞvolðdxÞ u ,v  Tx M
M
(21)
1
where Exp denotes the Riemannian exponential map (Exp is well-defined,
since M is a Hadamard manifold).
With σ > 0 fixed, the map which assigns to x  M the covariance form Cx
is a (0,2)-tensor field on M, here called the covariance tensor of Pðx, σÞ. In
order to compute this tensor field, consider the following situation.
Assume M ¼ G/K is a Riemannian symmetric space. Here, K ¼ Ko, the
stabilizer in G of o  M. For k  K and u  ToM, it is clear k  u  ToM. This
defines a representation of K in the tangent space ToM, called the isotropy rep-
resentation. One says that M is an irreducible symmetric space, if this isotropy
representation is irreducible.
If M is not irreducible, then it is a product of irreducible Riemannian
symmetric spaces M ¼ M1  ⋯  Ms (Helgason, 1962) (proposition 5.5,
chapter VIII. This is the de Rham decomposition of M). Accordingly, for
x  M and u  TxM, one may write x ¼ ðx1 , …, xs Þ and u ¼ ðu1 , …, us Þ,
where xr  Mr and ur  T xr Mr . Now, looking back at (7), it may be seen
that
Ys  2 
1 d ðxr , x r Þ
pðxjx ,σÞ ¼ pðxr jx r , σÞ pðxr jx r , σÞ ¼ ðZr ðσÞÞ exp  (22)
r¼1
2σ 2

For the following proposition, let η ¼ (2σ 2)1 and ψ r ðηÞ ¼ log Zr ðσÞ.
Gaussian distributions on Riemannian symmetric spaces Chapter 10 369

Proposition 7. Assume that M is a product of irreducible Riemannian sym-


metric spaces, M ¼ M1  ⋯  Ms . The covariance tensor C in (21) is given by
Xs
ψ 0r ðηÞ
Cx ðu,uÞ ¼ kur k2x r (23)
r¼1
dim Mr

for u  Tx M where x ¼ ðx 1 , …,x s Þ and u ¼ ðu1 ,…, us Þ, with x r  Mr and


u r  T xr M r .

Example. Let M ¼ H(N), the space of N  N Hermitian positive-definite


matrices, so M ¼ GLðN, Þ=UðNÞ, with U(N) the stabilizer of o ¼ IN
(N  N identity matrix). The de Rham decomposition of M is M ¼ M1  M2,
where M1 ¼  and M2 is the submanifold whose elements are those x  M such
that detðxÞ ¼ 1. Accordingly, each x  M is identified with the couple ðx1 , x2 Þ,
1
x1 ¼ log detðxÞ x2 ¼ ðdetðxÞÞ1=N x
N
and each u  T x M is written u ¼ u1 x + u2
1 1
u1 ¼ trðx1 uÞ u2 ¼ u  trðx1 uÞ x
N N
These may be replaced into expression (23),
ψ 02 ðηÞ
Cx ðu, uÞ ¼ ψ 01 ðηÞu21 + k u2 k2x2 (24)
N2  1
1
where ψ 1 ðηÞ ¼ log ð2π σ 2 Þ 2, and ψ 2 ðηÞ ¼ log ZðσÞ  ψ 1 ðηÞ (Z(σ) is given by
(26) in 2.6). After a direct calculation, this can be brought under the form
2
Cx ðu, uÞ ¼ g1 ðσÞtr2 ðx1 uÞ + g2 ðσÞtrðx1 uÞ (25)
where g1(σ) and g2(σ) are certain functions of σ.

Remark. As a corollary of Proposition 7, the covariance tensor C is a


G-invariant Riemannian metric on M. This is clear, for example, in the
special case of (25), which coincides with the general expression of a
GLðN, Þ-invariant metric. □

2.6 Z(σ) from RMT


Random matrix theory is very helpful in the calculation of integrals such as
(10) and (15), leading both to exact expressions and to asymptotic expansions
of these integrals. Here, this is illustrated for the integral (10), with β ¼ 2.
This corresponds to M ¼ H(N), the space of N  N Hermitian
positive-definite matrices. In this case, it is possible to provide an analytic for-
mula for the normalizing factor Z(σ).
370 SECTION III Advanced geometrical intuition

Proposition 8. When M ¼ H(N), the normalizing factor Z(σ), given by (10)


with β ¼ 2, admits the following analytic expression
 3   1
Nn
ω2 ðNÞ  N N  N 2 NY
1  enσ
2
ZðσÞ ¼ N 2 2π σ 2 2
exp σ (26)
2 6 n¼1

The proof of this proposition is a direct application of a well-known for-


mula from random matrix theory (Mehta, 2004). The integral in (10) reads
Z YN
IN ðσÞ ¼ jVðuÞjβ ρðui ,2σ 2 Þdui
N+ i¼1

According to Mehta (2004) (chapter 5, p. 79), if ðp n ; n ¼ 0, 1, …Þ are ortho-


normal polynomials with respect to the weight function ρ(u, 2σ 2) on  + , then
IN(σ) is given by
Y
N 1
IN ðσÞ ¼ N! p2
nn (27)
n¼0

where pnn is the leading coefficient in pn. The required orthonormal polyno-
1
mials pn are given by p n ¼ ð2πσ 2 Þ 4 s n , where sn are the Stieltjes-Wigert poly-
nomials (Szeg€ o, 1939) (p. 33). Looking up the expression of these
polynomials, it is easy to find
" #
2 Yn

1 ð2n + 1Þ
p2 1  emσ
2
nn ¼ ð2π σ Þ exp σ2
2 2
2 m¼1

Then, working out the product (27) and replacing into (10), one leads to (26).
Moving on, it is possible to derive an asymptotic expression of Z(σ), valid in
the limit where N goes to infinity while the product t ¼ Nσ 2 remains constant.
Proposition 9. Let Z(σ) be given by (26). If N ! ∞, while t ¼ Nσ 2 remains
constant, then the following equivalence holds,

1 1 2N 3 t Li3 ðet Þ  ζð3Þ


log ZðσÞ   log + +  (28)
N2 2 π 4 6 t2
P
where Li3 ðxÞ ¼ ∞ k¼1 x =k for jxj < 1 (the trilogarithm), and ζ is the Riemann
k 3

Zeta function.

The main idea behind (28) is that, taking the logarithm in (26), the product
on the right-hand side turns into a Riemann sum for the improper integral
Z 1
ð1  xÞ log ð1  etx Þdx ¼ ðLi3 ðet Þ  ζð3ÞÞ=t2
0

where the equality follows by integrating term-by-term the power series of the
logarithm.
Gaussian distributions on Riemannian symmetric spaces Chapter 10 371

2.7 The asymptotic distribution


From the point of view of random matrix theory, a Gaussian distribution
P(IN, σ) on M ¼ H(N) defines a unitary matrix ensemble. If x is a random
matrix, drawn from this ensemble, and ðxi ; i ¼ 1, …, NÞ are its eigenvalues,
which all belong to (0, ∞), then the empirical distribution νN, which is given
by (as usual, δxi is the Dirac distribution at xi)
" #
1X N
νN ðBÞ ¼  δx ðBÞ (29)
N i¼1 i

for measurable B  ð0, ∞Þ, converges to an absolutely continuous distribution


νt, when N goes to infinity, while the product t ¼ Nσ 2 remains constant.
pffiffiffiffiffiffiffiffiffiffi
Proposition 10. Let c ¼ et and aðtÞ ¼ cð1 + 1  cÞ2 while
pffiffiffiffiffiffiffiffiffiffi
bðtÞ ¼ cð1  1  cÞ2 . When N goes to infinity, while the product t ¼ Nσ 2
remains constant, the empirical distribution νN converges weakly to the distri-
bution νt with probability density function
 t 
dνt 1 4e x  ðx + 1Þ2
ðxÞ ¼ arctan 1½aðtÞ,bðtÞ ðxÞ (30)
dx πtx x+ 1
where 1[a(t),b(t)] denotes the indicator function of the interval [a(t), b(t)].

Remark. As one should expect, when t ¼ 0 (so σ 2 ¼ 0), a(t) ¼ b(t) ¼ 1. □

The proof of Proposition 10 is a relatively direct application of a result in


Kuijlaars and Van Assche (1999) (p. 191). In fact, the integration variables in
(10) are ui ¼ etxi. Let νeN be the empirical distribution of the ui (this is the
same as (29), but with ui instead of xi). By applying (Mehta, 2004) (chapter
5, p. 81),
Z
1 ð1Þ
νeN ðBÞ ¼ R ðuÞðduÞ (31)
N B N
for measurable B  ð0, ∞Þ, where the one-point correlation function R(1)
N (u) is
given by

ð1Þ
X
N 1
RN ðuÞ ¼ ρðu,2σ 2 Þ p2n ðuÞ (32)
n¼0

in the notation of 2.6 (pn are orthonormal polynomials, with respect to the
weight ρ(u, 2σ 2)). According to Deift (1998) (p. 133), νeN given by (31) con-
verges weakly to the so-called equilibrium distribution νet , which minimizes
the electrostatic energy functional
Z Z ∞Z ∞
1 ∞1
EðνÞ ¼ log 2 ðuÞνðduÞ  log ju  vjνðduÞνðdvÞ (33)
t 0 2 0 0
372 SECTION III Advanced geometrical intuition

over probability distributions ν on (0, ∞). Also according to Deift (1998)


(p. 133), this equilibrium distribution is the asymptotic distribution of the
zeros of the polynomial pN (in the limit N !∞ while Nσ 2 ¼ t). Fortunately,
pN is just a constant multiple of the Stieltjes-Wigert polynomial sN (Szeg€o,
1939) (p. 33). Therefore, the required asymptotic distribution of zeros can
be read from Kuijlaars and Van Assche (1999) (p. 191). Finally, (30) follows
by introducing the change of variables x ¼ etu.
Remark. In Mariño (2005), the equilibrium distribution νet is derived directly,
by searching for stationary distributions of the energy functional (33). This
leads to a singular integral equation, whose solution reduces to a Riemann–
Hilbert problem. Astoundingly, the Gaussian distributions on H(N), as intro-
duced in the present chapter, provide a matrix model for Chern–Simons
quantum field theory (a detailed account is given in Mariño, 2005). □

2.8 Duality: The Θ distributions


Recall the Riemannian symmetric space M ¼ H(N) of 2.6. Its dual space is
the unitary group M* ¼ U(N) (the definition of duality may be found in
Appendix A.2).
Consider now a family of distributions on M*, which will be called Θ dis-
tributions, and which display an interesting connection with Gaussian distri-
butions on M, studied in 2.6. Recall Jacobi’s ϑ function,a
X
+∞
ϑðeiϕ jσ 2 Þ ¼ exp ðm2 σ 2 + 2miϕÞ
m¼∞

As a function of ϕ, up to some minor modifications, this is a wrapped normal


distribution (in other words, the heat kernel of the unit circle),
  " #
1 σ 2 X∞
ð2ϕ  2mπÞ 2
ϑ eiϕ j ¼ exp 
2π 2 m¼∞
2σ 2

Each x  M* can be written x ¼ k  eiθ where k  U(N) and


eiθ ¼ diagðeiθj ; j ¼ 1, …,NÞ. Here, k  y ¼ k y k† for y  M*. With this notation,
define the following matrix ϑ function,
 
 2 iθ σ
2
Θ xjσ ¼ k  ϑ e j (34)
2
which is obtained from x by applying Jacobi’s ϑ function to each eigenvalue
of x. Further, consider the positive function,

a
To follow the original notation of Jacobi (Whittaker and Watson, 1950), this should be written
ϑ(eiϕjq) where q ¼ eσ . In other popular notations, this function is called ϑ00 or ϑ3.
2
Gaussian distributions on Riemannian symmetric spaces Chapter 10 373

h 1  i
f * ðxjx, σÞ ¼ det 2π σ 2 2 Θ xx{ jσ 2 (35)

where x  M* . This is also equal to


h 1  i
det 2π σ 2 2 Θ x{ xjσ 2

since the matrices xx{ and x{ x are similar. Then, let ZM* ðσÞ denote the normal-
izing constant
Z
ZM* ðσÞ ¼ f * ðxjx, σÞ volðdxÞ (36)
M*

which does not depend on x, as can be seen, by introducing the new variable
of integration z ¼ xx{ , and using the invariance of vol(dx).
Now, define a Θ distribution Θðx, σÞ as the probability distribution on M*,
whose probability density function, with respect to vol(dx), is given by
 1
p* ðxjx, σÞ ¼ Z M* ðσÞ f * ðxjx, σÞ (37)
Proposition 11. Let ZM(σ) ¼ Z(σ) be given by (26), and ZM ðσÞ be given by
(36). Then, the following equality holds
 3  
Z M ðσÞ N N 2
¼ exp σ (38)
ZM* ðσÞ 6

Remark. The Gaussian density (7) on M and the Θ distribution density (37) on
M* are apparently unrelated. Therefore, it is interesting to note their normal-
izing constants ZM(σ) and Z M* ðσÞ scale together according to the simple rela-
tion (38). The connection between the two distributions is due to the duality
between M and M*. □

3 Gaussian distributions and Bayesian inference


This section aims to investigate Bayesian inference for Gaussian distributions.
Precisely, it aims to study Bayesian estimation of the parameter x of a Gauss-
ian distribution P(x, σ), when this parameter is assigned a prior density which
is also Gaussian, say P(z, τ). Both the prior density P(z, τ) and the likelihood
density P(x, σ) are defined on a Riemannian symmetric space of nonpositive
curvature M.b
Section 3.1 begins by expressing the posterior density π(x), based on the
general definition
πðxÞ∝ prior density  likelihood density

b
Proofs of the results stated in this section can be found in chapter 4 of Said (2021).
374 SECTION III Advanced geometrical intuition

Here, π(x) will remain partially unknown, as the missing normalizing factor
cannot be determined. Then, two Bayesian estimators are studied. The maxi-
mum a posteriori x^MAP is the mode of π(x),
x^MAP ¼ argmaxx  M πðxÞ
while the minimum mean square error estimator x^MMS, classically understood
as the mean (expectation) of the posterior density, is here the Riemannian
barycenter of π(x).
It is seen that x^MAP can be computed directly, being a geodesic convex
combination of the prior barycenter z and a new observation y, with respective
weights 1  ρ and ρ, where ρ ¼ τ2/(σ 2 + τ2). On the other hand, x^MMS seems
much harder to compute.
However, Proposition 12 states that x^MMS ¼ x^MAP if ρ ¼ 1/2, and
Proposition 13 states that, in the special case where M is a hyperbolic space,
x^MMS is a geodesic convex combination of z and y, just like x^MAP , but with
different weights, say (1  t*) and t*.
Section 3.2 reports on numerical experiments which show, again in the special
case where M is a hyperbolic space, that x^MMS and x^MAP lie very close to each
other, and that they even appear to be equal (this would mean t* ¼ ρ). At present,
the authors are unaware of any mathematical explanation of this phenomenon.
Section 3.3 describes the computational tools employed in calculating
x^MMS . First, Proposition 14 provides easy-to-verify sufficient conditions, for
the geometric ergodicity of an isotropic Metropolis-Hastings Markov chain,
in a Riemannian symmetric space M. These conditions are shown to apply
in the case of the posterior density π(x), making it possible to generate geo-
metrically ergodic samples (xn ;n
1) from this density.
Proposition 15 states that the empirical barycenter xN of the samples
ðx1 , …, xN Þ converges almost-surely to x^MMS , so xN may be used to approxi-
mate x^MMS to any required accuracy.
Concretely, computing the empirical barycenter xN requires solving a
strongly convex optimization problem on the Riemannian manifold M (here,
convexity is with respect to the Riemannian connection of M (Udriste, 1994).
This has come to be called “geodesic convexity”). Appendix B is devoted to
a brief but systematic study of convex optimization on Riemannian manifolds.
Specifically, it establishes the rate of convergence of Riemannian gradient
descent schemes, applied to strictly convex or strongly convex cost functions.
The gradient descent schemes under consideration are retraction schemes
(not limited to the Riemannian exponential) with a constant step-size. The
problem is then to find the largest possible step-size which guarantees a cer-
tain rate of convergence. Proposition B.8 addresses this problem for strictly
convex functions, and Proposition B.9 for strongly convex functions. For
any strictly convex cost function, and suitable retraction, Proposition B.8
gives the largest possible step-size which guarantees a rate of convergence
at least as fast as O(1/t) (t is the number of iterations). Proposition B.9 does
the same for strongly convex functions, but with an exponential rate of
Gaussian distributions on Riemannian symmetric spaces Chapter 10 375

convergence. In fact, with regard to the original motivation of computing xN ,


this ensures that a gradient descent scheme, using the Riemannian exponen-
tial, converges after only a few iterations.

3.1 MAP versus MMS


Assume M ¼ G/K is a Riemannian symmetric space which belongs to the
noncompact case, described in Appendix A.1. Recall the Gaussian distribution
P(x, σ) on M given by its probability density function (7)
 
1 d 2 ðy, xÞ
pðyjx, σÞ ¼ ðZðσÞÞ exp  (39)
2σ 2
In 2.4, it was seen that maximum-likelihood estimation of the parameter x,
based on samples ðyn ; n ¼ 1, …, NÞ, amounts to computing the empirical bar-
ycenter of these samples. The one-sample maximum-likelihood estimate,
given a single observation y, is therefore x^ML ¼ y.
Instead of maximum-likelihood estimation, consider Bayesian estimation
of x, based on the observation y. To do so, assign to x a prior density, which
is also Gaussian,
 
1 d 2 ðx, zÞ
pðxjz, τÞ ¼ ðZðτÞÞ exp  (40)
2τ2
Upon observation of y, Bayesian inference concerning x is carried out using
the posterior density
 
d 2 ðy, xÞ d 2 ðx, zÞ
πðxÞ∝ exp   (41)
2σ 2 2τ2
where ∝ indicates a missing (unknown) normalizing factor.
In particular, the maximum a posteriori estimator x^MAP of x is equal to the
mode of the posterior density π(x). In other words, x^MAP minimizes the
weighted sum of squared distances d2(y, x)/σ 2 + d2(x, z)/τ2. This is expressed
in the following notation,c
τ2
x^MAP ¼ z#ρ y where ρ ¼ 2 2 (42)
σ +τ
Thus, x^MAP is a geodesic convex combination of the prior barycenter z and the
observation y, with respective weights σ 2/(σ 2 + τ2) and τ2/(σ 2 + τ2).
On the other hand, the minimum mean square error estimator x^MMS is the
barycenter of the posterior density π(x). That is, x^MMS is the unique global
minimizer of

c
If p, q  M and c : [0, 1] ! M is a geodesic curve with c(0) ¼ p and c(1) ¼ q, then p #t q ¼ c(t),
for t  [0, 1]. In other words, p #t q is a geodesic convex combination of p and q, with respective
weights (1  t) and t.
376 SECTION III Advanced geometrical intuition

Z
1
E π ðwÞ ¼ d 2 ðw, xÞπðxÞvolðdxÞ (43)
2 M

While it is easy to compute x^MAP from (42), it is much harder to find x^MMS , as
this requires minimizing the integral (43), where the density π(x) is known
only up to normalization.
Still, there is one special case where these two estimators are equal.
Proposition 12. In the above notation, if σ 2 ¼ τ2 (that is ρ ¼ 1/2), then
x^MMS ¼ x^MAP .

When M is a Euclidean space, it is well-known that x^MMS ¼ x^MAP for any


value of ρ. When M is a space of constant negative curvature, the following
proposition indicates x^MMS and x^MAP cannot be too far away from one another.
Proposition 13. In the above notation, if M is a space of constant negative
curvature (hyperbolic space), then x^MMS ¼ z#t* y for some t*  (0, 1).

3.2 Bounding the distance


In general, one expects x^MMS and x^MAP to be different from one another, when
ρ 6¼ 1/2. However, when M is a space of constant negative curvature,
Proposition 13 shows the distance between these two estimators is always less
than the distance between z and y.
Surprisingly (again when M is a space of constant negative curvature),
numerical experiments show that x^MMS and x^MAP lie very close to each other,
and that they even appear to be equal. The authors are unaware of any math-
ematical explanation of this phenomenon.
It is possible to bound the distance between x^MMS and x^MAP , using the fun-
damental contraction property (Sturm, 2003) (this is an immediate application
of Jensen’s inequality, as explained in the proof of theorem 6.3 in Sturm, 2003).
xMMS , x^MAP Þ  Wðπ, δx^MAP Þ
dð^ (44)
where W denotes the Kantorovich (L1-Wasserstein) distance, and δx^MAP denotes
the Dirac probability distribution concentrated at x^MAP . Now, the right-hand
side of (44) is equal to the first-order moment
Z
m1 ð^
xMAP Þ ¼ dð^
xMAP , xÞπðxÞvolðdxÞ (45)
M

Of course, the upper bound in (44) is not tight, since it is strictly positive,
even when ρ ¼ 1/2, as one may see from (45).
It will be shown below that a Metropolis-Hastings algorithm, with
Gaussian proposals, can be used to generate geometrically ergodic samples
(xn ;n
1) from the posterior density π. It is therefore possible to approximate
(45) by an empirical average
Gaussian distributions on Riemannian symmetric spaces Chapter 10 377

1X N
m 1 ð^
x MAP Þ ¼ x MAP , xn Þ
dð^ (46)
N n¼1

In addition, the samples (xn) can be used to compute a convergent approxima-


tion of x^MMS : Precisely, the empirical barycenter xMMS of the samples
ðx1 , …, xN Þ converges almost-surely to x^MMS (this is a result of
Proposition 15).
Numerical experiments were conducted in the case where M is a hyper-
bolic space of curvature equal to 1 and of dimension d. The following
table was obtained for the values σ 2 ¼ τ2 ¼ 0.1, using samples ðx1 , …, xN Þ
where N ¼ 2  105.

Dimension d 2 3 4 5 6 7 8 9 10
m 1 ð^
x MAP Þ 0:28 0:35 0:41 0:47 0:50 0:57 0:60 0:66 0:70
dðx MMS , x^MAP Þ 0:00 0:00 0:00 0:01 0:01 0:02 0:02 0:02 0:03

and the following table for σ 2 ¼ 1 and τ2 ¼ 0.5, again using N ¼ 2  105.

Dimension d 2 3 4 5 6 7 8 9 10
m 1 ð^
x MAP Þ 0:75 1:00 1:12 1:44 1:73 1:97 2:15 2:54 2:91
dðx MMS , x^MAP Þ 0:00 0:00 0:03 0:02 0:02 0:03 0:04 0:03 0:12

The first table confirms Proposition 12. The second table, more surprisingly,
shows that x^MMS and x^MAP can be quite close to each other, even when ρ ¼ 6 1/2.
In both of these tables, dðxMMS , x^MAP Þ is an approximation of
dð^x MMS , x^MAP Þ, based on using the empirical barycenter xMMS instead of
x^MMS . The main source of error affecting this approximation is the fact that
the samples ðx1 , …,xN Þ follow from a Metropolis-Hastings algorithm, and
not directly from the posterior density π.
Other values of σ 2 and τ2 lead to similar orders of magnitude for m1 ð^xMAP Þ
and dðx MMS , x^MAP Þ. While m1 ð^ xMAP Þ increases with the dimension d,
dðxMMS , x^MAP Þ does not appear sensitive to increasing dimension.
Based on these experimental results, one is tempted to conjecture that
x^MMS ¼ x^MAP , even when ρ 6¼ 1/2. Of course, numerical experiments do not
equate to a mathematical proof.

3.3 Computing the MMS


3.3.1 Metropolis-Hastings algorithm
A crucial step in Bayesian inference is sampling from the posterior density.
Here, this is π(x) given by (41). Since π(x) is known only up to normalization,
a suitable sampling method is afforded by the Metropolis-Hastings algorithm.
This algorithm generates a Markov chain (xn ; n
1), with transition kernel
(Roberts and Rosenthal, 2004)
378 SECTION III Advanced geometrical intuition

Z
Pf ðxÞ ¼ αðx, yÞqðx, yÞf ðyÞvolðdyÞ + ρðxÞf ðxÞ (47)
M

for any bounded measurable function f : M ! , where α(x, y) is the proba-


bility of accepting a transition from x to dy, and ρ(x) is the probability of stay-
ing at x, and where q(x, y) is the proposed transition density
Z
qðx, yÞ
0 and qðx, yÞvolðdyÞ ¼ 1 for x  M (48)
M

In the following, (xn) will always be an isotropic Metropolis-Hastings chain,


in the sense that q(x, y) ¼ q(d(x, y)), so q(x, y) only depends on the distance
d(x, y). In this case, the acceptance probability α(x, y) is given by αðx, yÞ ¼
minf1, πðyÞ=πðxÞg.
The aim of the Metropolis-Hastings algorithm is to produce a Markov
chain (xn) which is geometrically ergodic. Geometric ergodicity means the
distribution π n of xn converges to π, with a geometric rate, in the sense that
there exist β  (0, 1) and R(x1)  (0, ∞), as well as a function V : M ! ,
such that (in the following, π(dx) ¼ π(x) vol(dx))
n o
VðxÞ
max 1, d 2 ðx, x* Þ for some x*  M (49)
Z 
 
 f ðxÞðπ ðdxÞ  πðdxÞÞ   Rðx1 Þβn (50)
 n 
M

for any function f : M !  with jfj V. If the chain (xn) is geometrically


ergodic, then it satisfies the strong law of large numbers (Meyn and
Tweedie, 2008)
Z
1X N
f ðxn Þ ! f ðxÞπðdxÞ ðalmost-surelyÞ (51)
N n¼1 M

as well as a corresponding central limit theorem (see theorem 17.0.1, in Meyn


and Tweedie, 2008). Then, in practice, the Metropolis-Hastings algorithm can
be used to generate samples (xn) from the posterior density π(x).
The following general statement can be proved, concerning the geometric
ergodicity of isotropic Metropolis-Hastings chains. The proof (see Said, 2021,
section 4.6) is a generalization of the one carried out in the special case where
M is a Euclidean space (Jarner and Hansen, 1998).
Proposition 14. Let M be a Riemannian symmetric space, which belongs
to the noncompact case. Assume (xn ; n
1) is a Markov chain in M, with
transition kernel given by (47), with proposed transition density q(x, y) ¼
q(d(x, y)), and with strictly positive invariant density π.
The chain (xn) satisfies (49) and (50), if the following assumptions hold,
(a1) there exists x* M, such that r(x) ¼ d(x*, x) and ‘ðxÞ ¼ log πðxÞ satisfy
Gaussian distributions on Riemannian symmetric spaces Chapter 10 379

hgrad r, grad ‘ix


limsup < 0
rðxÞ!∞ rðxÞ

(a2) if nðxÞ ¼ grad ‘ðxÞ= kgrad ‘ðxÞk, then n(x) satisfies


limsup hgrad r, nix < 0
rðxÞ!∞

(a3) there exist δq > 0 and εq > 0 such that d(x, y) < δq implies q(x, y) > εq

Remark. The posterior density π in (41) verifies Assumptions (a1) and (a2).
To see this, let x* ¼ z, and write
1 1
grad ‘ðxÞ ¼  rðxÞgrad rðxÞ  2 grad f y ðxÞ
τ2 σ
where fy(x) ¼ d2(y, x)/2. Then, taking the scalar product with grad r,
1 1
hgrad r, grad ‘ix ¼  rðxÞ  2 hgrad r, grad fy ix (52)
τ2 σ
since grad r(x) is a unit vector, for all x  M. Now, grad f y ðxÞ ¼ Exp1
x ðyÞ
(see Chavel, 2006). But, since r(x) is a convex function of x, it follows, by
(B.2) in Appendix B.1, that

hgrad r, Exp1
x ðyÞi  rðyÞ  rðxÞ

for any y  M. Thus, the right-hand side of (52) is strictly negative, as soon as
r(x) > r(y), and Assumption (a1) is indeed verified. That Assumption (a2) is
also verified can be proved by a similar reasoning. □

Remark. On the other hand, Assumption (a3) holds, if the proposed transition
density q(x, y) is a Gaussian density, q(x, y) ¼ p(yjx, τq). With this choice of
q(x, y), all the assumptions of Proposition 14 are verified, for the posterior
density π in (41). Therefore, Proposition 14 implies that the Metropolis-
Hastings algorithm generates geometrically ergodic samples (xn ; n
1), from
this posterior density. □

3.3.2 The empirical barycenter


Let (xn ; n
1) be a Metropolis-Hastings Markov chain in M, with its transi-
tion kernel (47), and invariant density π. Assume the chain (xn) is geometri-
cally ergodic, so it satisfies the strong law of large numbers (51).
Then, let xN denote the empirical barycenter of the first N samples
ðx1 , …, xN Þ. This is the unique global minimum of the variance function
380 SECTION III Advanced geometrical intuition

1 XN
E N ðwÞ ¼ d 2 ðw, xn Þ (53)
2N n¼1

Let x^ denote the Riemannian barycenter of the invariant density π. It turns out
that xN converges almost-surely to x^.
Proposition 15. Let (xn) be any Markov chain in a Hadamard manifold M,
with invariant distribution π. Denote x N the empirical barycenter of
ðx1 ,…, xN Þ, and x^ the Riemannian barycenter of π. If (xn) satisfies the strong
law of large numbers (51), then x N converges to x^, almost-surely.

The proof of Proposition 15 is nearly a word-for-word repetition of the


proof in Bhattacharya and Patrangenaru (2003) (that of theorem 2.3).
According to the remarks after Proposition 14, the Metropolis-Hastings
Markov chain (xn), whose invariant density is the posterior density π(x), given
by (41), is geometrically ergodic. Therefore, by Proposition 15, the empirical
barycenter xMMS , of the samples ðx1 , …, xN Þ, converges almost-surely to the
minimum mean square error estimator x^MMS (since this is just the barycenter
of the posterior density π). This provides a practical strategy for approximat-
ing x^MMS : Indeed, xMMS can be computed using the Riemannian gradient
descent method (this method is discussed in Appendix B.4).
Remark. This strategy for approximating x^MMS provided the numerical results
discussed in Section 3.2. For an additional, visual illustration, consider (as in
Section 3.2) the case where M is a space of constant negative curvature 1,
and of dimension d ¼ 2. Fig. 1 represents M in the shape of the Poincare disc.
The prior barycenter z is designated by a square □ and the observation y by
a circle ∘. Grey crosses  mark the last 1000 out of N ¼ 100, 000 samples
xn generated using the Metropolis-Hastings kernel (47), and the empirical bar-
ycenter xMMS is designated by a black circle l. In both of Fig. 1A and B,

FIG. 1 Poincare disc with z ¼ □ ; y ¼ ∘ ; xMMS ¼ (A) σ 2 ¼ τ2 ¼ 0.1 (Proposition 12)


(B) σ 2 ¼ 0.1 ; τ2 ¼ 1 (Proposition 13).
Gaussian distributions on Riemannian symmetric spaces Chapter 10 381

xMMS is seen to lie on the geodesic connecting z and y, here the dashed
circle arc. Note that Fig. 1A corresponds to Proposition 12 and Fig. 1B to
Proposition 13.

3.4 Proof of Proposition 13


Neither this proposition nor its proof appeared in Said (2021). The proof is
here given in a series of lemmas. Recall that M is now a hyperbolic space
(simply connected space of constant negative curvature 1).
Lemma 1. Let γ :  ! M denote the geodesic curve with γ(0) ¼ z and
γ(1) ¼ y. Then, x^MMS lies on this geodesic curve γ.

Proof. Recall from Alekseevskij et al. (1993) (section 3) that there exists an
isometry σ : M ! M, such that σ ∘ σ is the identity (σ is an involution),
and the set of fixed points of σ is exactly the geodesic curve γ. The key point
in the following, which can be seen from (41), is that
πðσðxÞÞ ¼ πðxÞ for x  M (54)
In other words, σ leaves invariant the posterior density π. Let E π be the func-
tion in (43). Then, note that
Z
1
ðE π ∘σÞðwÞ ¼ d 2 ðw, σðxÞÞπðxÞvolðdxÞ
2 ZM
1
¼ d 2 ðw, xÞπðσðxÞÞσ * ðvolÞðdxÞ
2 M
where the first equality follows from (41), because σ is an isometry and an
involution, and the second equality by a change of variables (σ*(vol) denotes
the pullback of the volume form vol by σ). Using (54) and the fact that σ pre-
serves the volume, it now follows that
ðE π ∘σÞðwÞ ¼ E π ðwÞ for w  M (55)
Finally, taking w ¼ x^MMS and recalling that x^MMS is the unique global mini-
xMMS Þ ¼ x^MMS , so that x^MMS indeed lies on γ.□
mizer of E π , it follows that σð^

Lemma 2. There exists a continuous function c :  !  such that


grad E π ðγðtÞÞ ¼ cðtÞ γðtÞ
_ for t   (56)

Remark. Taking covariant derivatives in (56),


_ ¼ c0 ðtÞ γðtÞ
Hess E π ðγðtÞÞ  γðtÞ _ (57)
0
where c (t) ¼ dc(t)/dt. Because E π is 1-strongly convex (see Item (iii) of
Proposition B.2, Appendix B.1), it follows that c0 (t)
1. In particular, c(t)
is strictly increasing.
382 SECTION III Advanced geometrical intuition

Proof. Let w ¼ γ(t) and take the gradient of (55). This yields
dσ  grad E π ðγðtÞÞ ¼ grad E π ðγðtÞÞ (58)
However, the derivative dσ : Tγ(t)M ! Tγ(t)M is equal to 1 on vectors parallel
to γðtÞ
_ and to 1 on vectors orthogonal to γðtÞ _ . Thus, (58) implies that
grad E π ðγðtÞÞ should be parallel to γðtÞ.
_ This is equivalent to (56). □

The next step in the proof will be to compute c(0) and c(1). This will show
that c(0) is negative and c(1) is positive. Computing c(0) and c(1) requires
taking a closer look at the posterior density π.
Lemma 3. Let Z(z, y, τ, σ) denote the missing normalizing factor in (41).
Then, Z(z, y, τ, σ) ¼ Z(δ, τ, σ), where δ ¼ d2(z, y).

Proof. A hyperbolic space is a two-point homogeneous space (Helgason,


1962) (p. 355). This means if z0 , y0  M have d(z0 , y0 ) ¼ d(z, y), then there
exists an isometry g : M ! M such that g(z) ¼ z0 and g(y) ¼ y0 . Now, since
g is an isometry,
Z  
d 2 ðy, xÞ d 2 ðx, zÞ
Zðz, y, τ, σÞ ¼ exp   volðdxÞ
M 2σ 2 2τ2

Z  
d 2 ðgðyÞ, gðxÞÞ d 2 ðgðxÞ, gðzÞÞ
¼ exp   volðdxÞ
M 2σ 2 2τ2
Thus, introducing the change of variables w ¼ g(x),
Z  
d 2 ðy0 , wÞ d 2 ðw, z0 Þ
Zðz, y, τ, σÞ ¼ exp  2
 2
volðdwÞ ¼ Zðz0 , y0 , τ, σÞ
M 2σ 2τ
In other words, Z(z, y, τ, σ) only depends on the distance between z and y.□

It is now possible to compute c(0) and c(1).


Lemma 4. There exists a positive constant ψ(δ, τ, σ), such that
cð0Þ ¼ ψðδ, τ, σÞτ2 and cð1Þ ¼ ψðδ, τ, σÞσ 2 (59)
in the notation of (56).

Proof. For any value of the parameters (z, y, τ, σ),


Z  
d 2 ðy, xÞ d 2 ðx, zÞ
exp    log Zðδ, τ, σÞ volðdxÞ ¼ 1
M 2σ 2 2τ2
where δ ¼ d2(z, y). Taking the gradient of this identity with respect to z,
and using
Gaussian distributions on Riemannian symmetric spaces Chapter 10 383

gradz d 2 ðx, zÞ ¼ 2Exp1


z ðxÞ ; gradz d 2 ðz, yÞ ¼ 2Exp1
z ðyÞ

where gradz denotes the gradient with respect to z, it follows that


Z
1
Exp1 1
z ðxÞπðxÞvolðdxÞ  ψðδ, τ, σÞ Expz ðyÞ ¼ 0 (60)
τ2 M
where ψðδ, τ, σÞ ¼ 2  ∂ log Zðδ, τ, σÞ=∂δ. However, here one has
Z
Exp1
z ðxÞπðxÞvolðdxÞ ¼ grad E π ðzÞ ; Exp1z ðyÞ ¼ γð0Þ
_ (61)
M

Thus, replacing (61) into (60), it follows that


cð0Þ ¼ ψðδ, τ, σÞτ2
which is the first part of (59). The second part can be proved in the same way,
taking the gradient with respect to y rather than z. The fact that ψ(δ, τ, σ) > 0 -
follows because c(t) is strictly increasing (see the remark after Lemma 2), and
has at most one zero (because E π has exactly one stationary point).
It is now possible to complete the proof of Proposition 13. Lemma 4
shows that c(0) is negative and c(1) is positive. Therefore, c(t*) ¼ 0 for some
t* (0, 1). By (56) of Lemma 2, grad E π ðγðt* ÞÞ ¼ 0. Since E π is strongly con-
vex, it follows immediately that γ(t*) is the unique global minimizer of E π . In
other words, x^MMS ¼ γðt* Þ, as required. □

Appendix A Riemannian symmetric spaces


A Riemannian symmetric space is a Riemannian manifold M, such that, for
each x  M, there exists an isometry sx : M ! M, with sx(x) ¼ x and
d sx(x) ¼ Idx. This isometry sx is called the geodesic symmetry at x
(Helgason, 1962).
Let G denote the identity component of the isometry goup of M, and K ¼
Ko be the stabilizer in G of some point o  M. Then, M ¼ G/K is a Rieman-
nian homogeneous space. The mapping θ : G ! G, θ( g) ¼ so ∘ g ∘ so is an
involutive isomorphism of G.
Let g denote the Lie algebra of G, and consider the Cartan decomposition,
g ¼ k + p, where k is the +1 eigenspace of d θ and p is the 1 eigenspace of d
θ. One clearly has the commutation relations,
½k, k  k ; k, p  p ; ½p, p  k (A.1)
In addition, it turns out that k is the Lie algebra of K, and that p may be iden-
tified with ToM, in a natural way.
The Riemannian metric of M may always be expressed in terms of an
Ad(K)-invariant scalar product Q on g. If x  M is given by x ¼ g  o for some
g  G (where g  o ¼ g(o)), then
384 SECTION III Advanced geometrical intuition

hu, vix ¼ Qðg1  u, g1  vÞ (A.2)


where the vectors g1  u and g1  v, which belong to ToM, are identified with
elements of p. Here, by an abuse of notation, d g1  u is denoted g1  u.
Let exp : g ! G denote the Lie group exponential. If v  ToM, then the
Riemannian exponential Expo(v) is given by
Expo ðvÞ ¼ exp ðvÞ  o (A.3)
Moreover, if Πt0 denotes parallel transport along the geodesic c(t) ¼ Expo(tv),
then
Πt0 ðuÞ ¼ exp ðtvÞ  u (A.4)
for any u  ToM (note that the identification T o M ’ p is always made, implic-
itly). Using (A.4), one can derive the following expression for the Riemann
curvature tensor at o,
Ro ðv, uÞw ¼ ½½v, u, w v, u, w  T o M (A.5)
A fundamental property of symmetric spaces is that the curvature tensor is
parallel: r R ¼ 0. This is often used to solve the Jacobi equation
(Helgason, 1962; Petersen, 2006), and then express the derivative of the Rie-
mannian exponential,
dExpx ðvÞðuÞ ¼ exp ðvÞ  shðRv ÞðuÞ (A.6)
P∞
where shðRv Þ ¼ n¼0 ðRv Þn =ð2n + 1Þ! for the self-adjoint curvature operator
Rv(u) ¼ [v, [v, u]]. As a result of (A.6), since exp ðvÞ is an isometry, the fol-
lowing expression of the Riemannian volume is immediate
Exp*o ðvolÞ ¼ jdetðshðRv ÞÞjdv (A.7)
where dv denotes the volume form on ToM, associated with the restriction of
the scalar product Q to p.
Expression (A.7) yields applicable integral formulae, when g is a reductive
Lie algebra (g ¼ z + gss: z the center of g and gss semisimple). If a is a maximal
Abelian subspace of p, any v  p is of the form v ¼ Ad(k) a for some k  K
and a  a (see Helgason, 1962, lemma 6.3, chapter V). Moreover, using the
fact that Ad(k) is an isomorphism of g,
X
Adðk1 Þ ∘ Rv ∘ AdðkÞ ¼ Ra ¼ ðλðaÞÞ2 Πλ (A.8)
λ  Δ+

where each λ  Δ+ is a linear form λ : a ! , and Πλ is the orthogonal pro-


jector onto the corresponding eigenspace of Ra. Here, Δ+ is the set of positive
roots of g with respect to a (Helgason, 1962) (see lemma 2.9, chapter VII).
It is possible to use the diagonalization (A.8), in order to evaluate the
determinant (A.7). To obtain a regular parameterization, let S ¼ K=K a , where
K a is the centralizer of a in K. Then, let φ : S  a ! M be given by φ(s, a) ¼
Expo(β(s, a)) where β(s, a) ¼ Ad(s) a. Now, by (A.7) and (A.8),
Gaussian distributions on Riemannian symmetric spaces Chapter 10 385

 
Y  sinh λðaÞmλ
φ*ðvolÞ ¼  
 λðaÞ  β ðdvÞ
λ  Δ+

where mλ is the multiplicity of λ (the rank of Πλ). On the other hand, one may
show
Y
β* ðdvÞ ¼ jλðaÞjmλ da ωðdsÞ (A.9)
λ  Δ+

where da is the volume form on a, and ω is the invariant volume induced onto
S from K.
Finally, the Riemannian volume, in terms of the parameterization φ, can
be expressed in the following way
Y
φ* ðvolÞ ¼ j sinh λðaÞjmλ da ωðdsÞ (A.10)
λ  Δ+

Using (A.10), it will be possible to write down integral formulae for Rieman-
nian symmetric spaces, either noncompact or compact.

A.1 The noncompact case


This is the case where g admits an Ad(G)-invariant, nondegenerate, symmetric
bilinear form B, such that Q(u, z) ¼ B(u, d θ(z)) is an Ad(K)-invariant scalar
product on g.
In this case, B is negative-definite on k and positive-definite on p. More-
over, the linear map ad(z) ¼ [z, ] is skew-symmetric or symmetric (with
respect to Q), according to whether z  k or z  p.
If u1 , u2  p are orthonormal, the sectional curvature of Span(u1, u2) is
found from (A.5), κðu1 , u2 Þ ¼  k½u1 , u2 k2o  0. Therefore, M has nonposi-
tive sectional curvatures.
In fact, M is a Hadamard manifold. It is geodesically complete by (A.3). It
is moreover simply connected, because Expo : p ! M is a diffeomorphism
(Helgason, 1962) (theorem 1.1, chapter VI). Thus, (A.7) yields a first integral
formula,
Z Z
f ðxÞ volðdxÞ ¼ f ðExpo ðvÞÞ jdetðshðRv ÞÞjdv (A.11)
M p

To obtain an integral formula from (A.10), one should first note that β :
S  a ! p is not regular, nor one-to-one. Recall the following:
l the hyperplanes λ(a) ¼ 0, where λ Δ+, divide a into finitely many
connected components, which are open and convex sets, known as Weyl
chambers. From (A.9), β is regular on each Weyl chamber.
l let K 0a denote the normalizer of a in K. Then, W ¼ K 0a =K a is a finite group
of automorphisms of a, called the Weyl group, which acts freely transi-
tively on the set of Weyl chambers (Helgason, 1962) (theorem 2.12,
chapter VII).
386 SECTION III Advanced geometrical intuition

Then, for each Weyl chamber C, β is regular and one-to-one, from S  C onto
its image in p. Moreover, if ar is the union of the Weyl chambers (a  ar if
and only if λ(a) 6¼ 0 for any λ Δ+), then β is regular and jWj-to-one from
S  ar onto its image in p. To obtain the desired integral formula, it only
remains to note that φ is a diffeomorphism from S  C onto its image in
M. However, this image is the set Mr of regular values of φ. By Sard’s lemma,
its complement is negligible (Bogachev, 2007).
Proposition A.1. Let M ¼ G/K be a Riemannian symmetric space, which
belongs to the “noncompact case,” just described. Then, for any bounded
continuous function f : M ! ,
Z Z Z Y
f ðxÞ volðdxÞ ¼ f ðφðs, aÞÞ ð sinh λðaÞÞmλ da ωðdsÞ (A.12)
M C+ S λΔ +

Z Z Y
1
¼ f ðφðs, aÞÞ j sinh λðaÞjmλ da ωðdsÞ (A.13)
jWj a S λΔ +

Here, C+ is the Weyl chamber C + ¼ fa  a : λ  Δ + ) λðaÞ > 0g.

A.2 The compact case


In this case, g admits an Ad(G)-invariant scalar product Q. Therefore, ad(z) is
skew-symmetric, with respect to Q, for each z  g. Using (A.5), it follows that
M is compact, with nonnegative sectional curvature.
In fact, the compact case may be obtained from the previous noncompact
case by duality. Denote g the complexification of g, and let g* ¼ k + p* where
p* ¼ ip. Then, g* is a compact real form of g (that is, g* is a compact Lie
algebra, and its complexification is equal to g ). Denote G* the connected
Lie group with Lie algebra g* .
If M ¼ G/K is a Riemannian symmetric space which belongs to the non-
compact case, then M* ¼ G*/K is a Riemannian symmetric space which
belongs to the compact case. Formally, to pass from the noncompact case to
the compact case, all one has to do is replace a by i a. Applying this recipe
to (A.10), one obtains
Y
φ*ðvolÞ ¼ j sin λðaÞjmλ da ωðdsÞ (A.14)
λ  Δ+

where da is the volume form on a* ¼ ia , and ω is the invariant volume


induced onto S from K. Note that the image under Expo of a* is the torus
T * ¼ a* =aK , where aK is the lattice given by aK ¼ fa  a* : Expo ðaÞ ¼ og.
Recall the following:
Gaussian distributions on Riemannian symmetric spaces Chapter 10 387

l φ(s, a) only depends on t ¼ Expo(a). Thus, φ may be considered as a map


from S  T* to M.
l if a  aK , then exp ð2aÞ ¼ e (the identity element in G*). Thus,
λðaÞ  iπ  for all λ  Δ+ (Helgason, 1962) (p. 383). Therefore, there
exists a function D : T ! , such that
Y
DðtÞ ¼ j sin λðaÞjmλ whenever t ¼ Expo ðaÞ
λ  Δ+

Now, T* is a totally flat submanifold of M. Therefore, Exp*(dt) ¼ da, where


dt denotes the invariant volume induced onto T* from M. With a slight abuse
of notation, (A.14) now reads,
φ*ðvolÞ ¼ DðtÞdt ωðdsÞ (A.15)
Denote (T*)r the set of t  T* such that D(t) 6¼ 0. By the same arguments as in
the noncompact case, φ is a regular jWj-to-one map from S  (T*)r onto Mr,
the set of regular values of φ.
Proposition A.2. Let M ¼ G*/K be a Riemannian symmetric space, which
belongs to the “compact case,” just described. For any bounded continuous
function f : M ! ,
Z Z Z
1
f ðxÞ volðdxÞ ¼ f ðφðt, aÞÞDðtÞdt ωðdsÞ (A.16)
M jWj T * S

A.3 Example of Propositions A.1 and A.2


Consider M ¼ H(N) the space of N  N Hermitian positive-definite matrices.
Here, G ¼ GLðN, Þ and K ¼ U(N). Moreover, B(u, z) ¼ Re(tr(uz)) and d θ(z)
¼ z†. Thus, p is the space of N  N Hermitian matrices, and one may choose
a the space of N  N real diagonal matrices. The positive roots are the linear
maps λ(a) ¼ aii  ajj where i < j, and each one has its multiplicity equal to 2.
The Weyl group W is the group of permutation matrices in U(N) (so jWj ¼ N!).
Finally, S ¼ U(N)/TN ≡ SN, where TN is the torus of diagonal unitary matrices.
By (A.13),
Z Z Z
1  Y
f ðxÞ volðdxÞ ¼ f s exp ð2aÞs{ sinh 2 ðaii  ajj Þda ωðdsÞ
HðNÞ N! a SN i<j

(A.17)
where da ¼ da11 …daNN . Now, assume f is a class function: f(k  x) ¼ f(x) for
k  K and x H(N). That is, f(x) depends only on the eigenvalues xi ¼ eri of x.
By (A.17),
388 SECTION III Advanced geometrical intuition

Z Z Y
ωðSN Þ
f ðxÞ volðdxÞ ¼ f ð exp ðrÞÞ sinh 2 ððri  r j Þ=2Þdr (A.18)
HðNÞ 2N N! N i<j

or, by introducing the eigenvalues xi as integration variables,


Z Z Y
ωðSN Þ N
f ðxÞ volðdxÞ ¼ N2 f ðx1 , …, xN ÞjVðxÞj2 xNi dxi (A.19)
HðNÞ 2 N! N+ i¼1
Q
where V (x) ¼ i<j(xj  xi) is the Vandermonde determinant.
The dual of H(N) is the unitary group U(N). Here, G* ¼ U(N)  U(N) and
K ’ U(N) is the diagonal group K ¼ {(x, x) ;x  U(N)}. The Riemannian met-
ric is given by the trace scalar product Q(u, z) ¼ tr(uz). Moreover, T* ¼ TN
and S ¼ SN (this is U(N)/TN). The positive roots are λ(ia) ¼ aii  ajj where
i < j and a is N  N, real and diagonal.d By writing the integral over TN as
a multiple integral, (A.16) reads,
Z Z Z
1  Y
f ðxÞ volðdxÞ ¼ f s exp ð2iaÞs{ sin 2 ðaii  a jj ÞωðdsÞ da
UðNÞ N! ½0,2π N
SN i<j
(A.20)
where da ¼ da11 …daNN .
Now, assume f is a class function. That is, f(x) depends only on eigenva-
lues eiθi of x. Integrating out s, from (A.20), it follows,
Z Z Y
ωðSN Þ
f ðxÞ volðdxÞ ¼ N f ð exp ðiθÞÞ sin 2 ððθi  θj Þ=2Þdθ (A.21)
UðNÞ 2 N! ½0,2πN
i<j

or, after an elementary manipulation,


Z Z
ωðSN Þ
f ðxÞ volðdxÞ ¼ N2 f ðθ1 , …, θN ÞjVðeiθ Þj2 dθ1 …θN (A.22)
UðNÞ 2 N! ½0,2π N

Q
where Vðeiθ Þ ¼ i<j ðeiθj  eiθi Þ is the Vandermonde determinant.
Integrals such as (A.19) and (A.22) are familiar in random matrix theory
(Meckes, 2019; Mehta, 2004). The resemblance between these integrals (for
example, in the role played by the Vandermonde determinant) is at the origin
of the sort of “duality” described in Section 2.8.

Appendix B Convex optimization


B.1 Convex sets and functions
In Euclidean geometry, a convex set A is any set which satisfies the definition :
if points x and y belong to A, then the straight line segment between x and

d
Please do not confuse the imaginary number i with the subscript i.
Gaussian distributions on Riemannian symmetric spaces Chapter 10 389

y lies entirely in A. One hopes to extend this definition to Riemannian geom-


etry, by letting geodesics play the role of straight lines. However, this does not
lead to one, but to multiple definitions of a convex set. The present article will
focus on the following (Chavel, 2006).
Definition B.1. A subset A of a complete Riemannian manifold M is called
strongly convex if, whenever points x and y belong to A, there exists a unique
length-minimizing geodesic γ x, y connecting x and y, and γ x, y lies entirely in A.

Remark. As an example of a different way of defining a convex set, consider


the following. A subset A of M is called weakly convex if, whenever points
x and y belong to A, there exists a unique geodesic γ in M, such that γ con-
nects x and y, and γ lies entirely in A (this coincides with the definition in
Chavel (2006), because γ is then the unique length-minimizing curve, among
all curves that connect x and y and lie entirely in A). □

In Euclidean geometry, a ball of any radius is convex. In a Riemannian


manifold, a ball may fail to be strongly (or even weakly) convex, if its radius
is too large. On the other hand, a ball with sufficiently small radius is always
strongly convex (Chavel, 2006; Petersen, 2006).
Proposition B.1. Assume the sectional curvatures of M are bounded above by
κ 2max
0. Then, denoting inj(M) the injectivity radius of M, let

1 π
R c ðMÞ ¼ min injðMÞ, (B.1)
2 2κ max
For any x  M and R < Rc(M), the open ball B(x, R) is strongly convex
(if κ max ¼ 0, it should be understood that division by zero yields infinity).

Remark. If 12 injðMÞ is replaced by inj(M) in (B.1), then R  Rc(M) implies


B(x, R) is weakly convex (Chavel, 2006; Kendall, 1990). □

There is a certain class of Riemannian manifolds, where balls of any


radius are strongly convex. Namely, these are Hadamard manifolds. Recall
that a Hadamard manifold is a simply connected, complete Riemannian man-
ifold with nonpositive sectional curvatures. In particular (Lee, 2012), this
implies inj(M) ¼ ∞ and κ max ¼ 0, so that Rc(M) ¼ ∞.
The following definition of a convex function on a Riemannian manifold
directly extends the usual, well-known definition of a convex function on a
Euclidean space.
Definition B.2. Let A be a strongly convex subset of a complete Riemannian
manifold M, and f : A ! . Then, f is called convex (respectively, strictly
convex) if f(γ x,y(t)) is a convex (respectively, strictly convex) function of
the time parameter t, for all x, y  A. Further, if there exists α > 0 such that
f(γ x,y(t)) is an α-strongly convex function of t, for all x, y  A, then f is called
α-strongly convex.
390 SECTION III Advanced geometrical intuition

For differentiable functions, it is possible to write down first-order and


second-order characterizations of convexity (Udriste, 1994). Recall that
γ_x,y ð0Þ ¼ Exp1
x ðyÞ, for x, y  A, where the dot denotes the time derivative
and Exp the Riemannian exponential map (Lee, 2012). In addition, let
gradf and Hessf denote the gradient and Hessian of a function f on M (defined
with respect to the Riemannian metric and Levi-Civita connection of M).
Proposition B.2. Let A be a strongly convex subset of a complete Riemannian
manifold M, and f : A ! .
(i) Assume f is differentiable. Then, f is convex, if and only if
f ðyÞ  f ðxÞ
hgradf ðxÞ, Exp1
x ðyÞix for all x, y  A (B.2)
Moreover, f is strictly convex if and only if the above inequality is strict
whenever y 6¼ x.
(ii) Assume f is differentiable. Then f is α-strongly convex, if and only if,
f ðyÞ  f ðxÞ
hgradf ðxÞ, Exp1
x ðyÞix + ðα=2Þd ðy, xÞ
2
for all x, y  A
(B.3)
(iii) Assume f is twice differentiable. Then f is convex if and only if
Hessf(x) 0, and strictly convex if and only if Hessf(x) 0, for all
x  A. Moreover, f is α-strongly convex if and only if Hessf(x) α
g(x), for all x  A.

Here, h, i and d(, ) denote the Riemannian scalar product and distance,
associated with the Riemannian metric tensor g of M. Moreover, stands for
the Loewner order. A straightforward consequence of (ii) in Proposition B.2 is
the so-called PL inequality (PL stands for Polyak–Lojasiewicz (Karimi et al.,
2016)). This inequality will be used in Section B.4.2.
Proposition B.3. Let f : A !  be a twice differentiable, α-strongly convex
function. If f has its minimum at x* A, then
kgradf ðxÞk2x
2αðf ðxÞ  f ðx* ÞÞ for all x  A (B.4)

B.2 Second-order Taylor formula


Consider the second-order Taylor formula, for a twice-differentiable function
f : M !  (as usual, M is a complete Riemannian manifold). For x  M and
v  TxM,
1
f ðExpx ðvÞÞ ¼ f ðxÞ + hgradf ðxÞ, vix + Hess fγðt* Þ ðγ_ , γÞ
_ (B.5)
2
where γ is the geodesic curve γ(t) ¼ Expx(tv) and t* (0, 1). Formula (B.5) is
the first-order Taylor expansion, with Lagrange remainder, of the function
f(γ(t)), at t ¼ 0 (Said, 2021). This formula will be the starting point for the
study of Riemannian gradient descent in Section B.4. There, it will be applied
with v ¼ μ gradf(x) and μ  (0, 1].
Gaussian distributions on Riemannian symmetric spaces Chapter 10 391

To apply (B.5), it is quite helpful to control the second-order term in


its right-hand side. One says that f is L-smooth on B  M , if there exists
L
0 with jHess f y ðu, uÞj  L k uk2y for all y  B and u  TyM. Then,
if γ(t) ¼ Expx(tv) belongs to B for all t  (0, 1),
f ðExpx ðvÞÞ  f ðxÞ + hgradf ðxÞ, vix + ðL=2Þ kvk2x (B.6)
Inequality (B.6) yields Proposition B.4. To state this proposition, consider a
C2 function f : M ! , and assume the sublevel set Bc ¼ {x : f(x)  c} is
compact (and not empty), for some real c. Of course, Bc is contained in some
closed ball B ¼ Bðz, RÞ. Let G be the maximum of kgradf(x)kx, taken over
x  B, and B0 ¼ Bðz, R + GÞ. Now, by compactness of B0 , f is Lc-smooth on
B0 , for some Lc
0.
Proposition B.4. Let f : M !  be a C2 function, with Bc and Lc defined as
above, and let y ¼ Expx(μ gradf(x)) for some μ  (0, 1]. If μ  1/Lc, then
f ðyÞ  f ðxÞ  ðμ=2Þ kgradf ðxÞk2x for all x  Bc (B.7)
In particular, x  Bc implies y  Bc.

Remark 1. As a consequence of (B.7), if x* Bc is such that f(x*) is the mini-


mum of f(x), taken over x  Bc, then
2Lc ðf ðxÞ  f ðx* ÞÞ
kgradf ðxÞk2x for all x  Bc (B.8)
which is complementary to (B.4). □

B.3 Taylor with retractions


It is customary, in practical applications, to approximate the Riemannian
exponential map by another so-called retraction map, which is easier to com-
pute (Absil et al., 2008). In this context, it is helpful to derive new versions of
Formula (B.5) and of Proposition B.4, which apply when the exponential map
Exp is replaced with a retraction Ret.
Recall that a retraction is a smooth map Ret : TM ! M (denoted Ret(x, v)
¼ Retx(v) for x  M and v  TxM), such that
Retx ð0x Þ ¼ x and dRetx ð0x Þ ¼ Idx (B.9)
for all x  M. Here, 0x is the zero element in TxM and Idx is the identity map
of TxM. Most retractions, encountered in practical applications, are regular
retractions, in the following sense (Said, 2021).
Definition B.3. A retraction Ret : TM ! M is regular, if there exists a smooth
bundle map Φ : TM ! TM, such that
Retx ðvÞ ¼ Expx ðΦx ðvÞÞ for all x  M and v  T x M (B.10)
Here, Φ is denoted Φ(x, v) ¼ Φx(v) (“bundle map” means Φx(v)  TxM for all
v  TxM).
392 SECTION III Advanced geometrical intuition

If Ret : TM ! M is a regular retraction and f : M !  is a twice-


differentiable function, then (B.5) and (B.10) directly imply
1
f ðRetx ðvÞÞ ¼ f ðxÞ + hgradf , Φx ðvÞix + Hess f γðt* Þ ðγ,
_ γÞ
_ (B.11)
2
where γ is the geodesic curve γ(t) ¼ Expx(tΦx(v)) and t* (0, 1). Formula
(B.11) is the required new version of (B.5).
The Riemannian exponential Exp is a regular retraction, with Φx ¼Idx for
x  M. For a general regular retraction Ret, each map Φx : TxM ! TxM still
agrees with Idx up to second-order terms (Said, 2021).
Proposition B.5. Let Ret : TM ! M be a regular retraction, with Φ : TM !
TM given by (B.10). Then, for each x  M, the map Φx : TxM ! TxM verifies
(a) Φx(0x) ¼ 0x and Φ0x ð0x Þ ¼ Id x (the prime denotes the Fr echet derivative).
(b) Φ00x ð0x Þðv, vÞ ¼ c̈ð0Þ, where the curve c(t) is given by c(t) ¼Retx(tv).

The retraction Ret is called geodesic when Φ00x ð0x Þðv, vÞ ¼ 0 for x  M and
v  TxM. In this case, Φx agrees with Idx up to third-order terms. For geodesic
regular retractions, Proposition B.6 provides a new, general version of
Proposition B.4.
Proposition B.6. Let f : M !  be a C2 function, with Bc ¼ {x : f(x)  c}
compact (and not empty), and let Ret : TM ! M be a geodesic regular retrac-
tion. There exist constants βc, δc, Hc
0, which depend on f and Ret, such
that, for all x  Bc,

f ðyÞ  f ðxÞ  μ 1  ðβc H c =2Þμ  ðδc k gradf ðxÞk2x Þμ2 k gradf ðxÞk2x (B.12)

whenever y ¼Retx(μ gradf (x)) for some μ  (0, 1]. In particular,


1
 ðβc H c =2Þμ  ðδc k gradf ðxÞk2x Þμ2
0¼)f ðyÞ  f ðxÞ  ðμ=2Þ
2
k gradf ðxÞk2x (B.13)
Therefore, x  Bc implies y  Bc.

Remark. The application of Proposition B.6 is somewhat simplified when the


retraction Ret, in addition to being regular and geodesic, is contractive and
uniformly geodesic. Here, contractive means that
k Φx ðvÞkx k vkx for x  M and v  T x M (B.14)
In this case, it is always possible to put βc ¼ 1 and Hc ¼ Lc, where Lc is the
same constant as in Proposition B.4. Uniformly geodesic means there exists
δ
0, such that
k Φx ðvÞ  vkx  δ kvk3x for x  M and v  T x M (B.15)
In this case, it is always possible to put δc ¼ δ (independent of c and
even of f ). □
Gaussian distributions on Riemannian symmetric spaces Chapter 10 393

The widely used projection retractions for spheres, unitary groups and
Grassmann manifolds, are examples of contractive, uniformly geodesic regu-
lar retractions (Said, 2021) (see sections 1.5 and 1.6). Here is another exam-
ple, for positive-definite matrices.
Example. Let M ¼ P(N), the space of symmetric positive-definite N  N
matrices, equipped with its usual affine-invariant metric (Pennec, 2006). For
x  P(N), the tangent space TxP(N) is identified with the space S(N) of sym-
metric N  N matrices. Then, recall the Riemannian exponential map
Expx ðvÞ ¼ x exp ðx1 vÞ (where exp denotes the matrix exponential), and con-
sider the retraction Retx(v) ¼ x + v + (1/2)vx1v. The point of using this
retraction is that the eigenvalues of Retx(v) will always be greater than 1/2.
In addition, it is a contractive, uniformly geodesic regular retraction. □

B.4 Riemannian gradient descent


Let f : M !  be a C2 function, and Ret : TM ! M a retraction. Together,
these yield the Riemannian gradient descent scheme, where μ  (0, 1] is
called the step-size,
xt +1 ¼ Retxt ðμgradf ðxt ÞÞ t ¼ 0, 1, … (B.16)
Assume f has a compact (nonempty) sublevel set Bc, and choose a constant
Lc
0 as in Proposition B.4. Also, assume Ret is a contractive, uniformly
geodesic regular retraction, as in the remark after Proposition B.6. Specifi-
cally, Ret verifies (B.14) and (B.15), for some constant δ
0. This allows
for Ret ¼ Exp, the Riemannian exponential, in which case δ ¼ 0. Now, the
following lemma is a direct consequence of (B.13) in Proposition B.6.
Lemma B.1. Under the two assumptions just described, on f and Ret, if
x0  Bc, then
1
 ðLc =2Þμ  ðδ kgrad f k2Bc Þμ2
0 ¼) f ðxt + 1 Þ  f ðxt Þ  ðμ=2Þ kgrad f ðxt Þ k2xt
2
(B.17)
for all t
0, so that xt  Bc for all t
0. Here, kgradf kBc is the maximum of
kgradf(x)kx, taken over x  Bc.

This lemma immediately yields the convergence of the Riemannian gradi-


ent descent scheme (B.16), to the stationary points of f in Bc. Here, μ*c is the
infimum of μ  (0, 1] such that 12  ðLc =2Þμ  ðδ kgradf k2Bc Þμ2 < 0.
Proposition B.7. Under the assumptions on f and Ret, made in Lemma B.1, if
μ  μ c , then x0  Bc implies the sequence (xt) generated by (B.16) converges
to the set of stationary points of f, in the sublevel set Bc.
394 SECTION III Advanced geometrical intuition

The proof of this proposition is straightforward. From Lemma B.1,


if μ  μ c , then x0  Bc implies xt  Bc and ðμ=2Þ kgradf ðxt Þk2xt  f ðxt Þ 
f ðxt +1 Þ for all t
0. Adding these inequalities, for t ¼ 0, …,T,
XT
ðμ=2Þ kgradf ðxt Þ k2xt  f ðxt Þ  f ðxT + 1 Þ
t¼0

Then, since f is bounded below on the compact set Bc, the series
P∞
t¼0 kgradf ðx Þ kxt must converge. Finally, compactness of Bc ensures every
t 2

subsequence of (xt) has a further subsequence that converges to a stationary


point of f in Bc.
Proposition B.7 holds without any convexity assumptions, made on the
function f. Consider now the case of a strictly convex, and then of a strongly
convex f.

B.4.1 Strictly convex case


Now, assume the function f is strictly convex on some strongly convex subset
A of M. Moreover, assume f has a compact (nonempty) sublevel set Bc  A.
Then, choose a constant Lc
0 as in Proposition B.4, and let the retraction
Ret be as in Lemma B.1.
Assume that f has a unique minimum at x* Bc. Let R be the radius of the
smallest ball B(x*, R) such that Bc  Bðx* , RÞ, D the maximum of kgradf(x)kx
for x  B(x*, R), and Rc ¼ R + D. The following lemma ensures that (B.16)
“contracts the distance to x*.”
hLemma B.2. Assume
i that the sectional curvatures of M lie in the interval
κ2min ,κ 2max , and that Rc < inj(x*). If x0  Bc and μ  μ c (with μ c as in
Proposition B.7), then

1  κðRc ÞLc μ  ð2δRc kgradf kBc ÞLc μ2


0¼)dðxt +1 , x* Þ  dðxt , x* Þ (B.18)
where κðRc Þ ¼ max fκ min Rc cothðκ min Rc Þ, jκ max Rc cotðκ max Rc Þjg and kgrad f kBc
is the maximum of kgradf(x)kx for x  Bc.

Remark. inj(x*) denotes the injectivity radius of M at x* (Petersen, 2006). The


condition that Rc < inj(x*) guarantees the xt stay away from the cut locus
Cut(x*). This condition is introduced because the distance function x7!d(x, x*)
is not differentiable on Cut(x*), where its Hessian may even diverge to ∞.

Let μ*d denote the infimum of μ such that 1  κðRc ÞLc μ  ð2δRc
kgradf kBc ÞLc μ2 < 0.
Proposition B.8. Under the same assumptions of Lemma B.2, if μ 
min fμ*c , μ*d g, then
Gaussian distributions on Riemannian symmetric spaces Chapter 10 395

2d 2 ðx0 , x* Þ
f ðxt +1 Þ  f ðx* Þ  (B.19)
μðt + 1Þ
for all t
0. In particular, the sequence (xt) converges to x*.

Remark. The quality of the convergence in (B.19) depends above all on the
step-size μ. The smaller this is, the slower the convergence. From the defini-
tions of μ*c and μ*d , it is clear that there are two reasons why μ would be smal-
ler: a larger constant δ, and a larger curvature (in absolute value) κ min . In
theory, one can alway make δ ¼ 0 by using the retraction Ret ¼ Exp, but this
requires the ability to compute the Riemannian exponential Exp with suffi-
cient accuracy. □

Remark. The rate of convergence stated in (B.19) is a partial generalization of


the rate found in Nesterov (2018), for gradient descent in a Euclidean space.
In the Euclidean setting, δ ¼ 0 and κ min ¼ κ max ¼ 0. It then follows from
Proposition B.8, that (B.19) obtains whenever μ  1/Lc. Essentially, this
is corollary 2.1.2 (p. 81) in Nesterov (2018). However, note the restriction
μ  (0, 1], which is necessary in a curved Riemannian manifold. □

Here is an optimization problem, which falls under the scope of


Proposition B.8.
Example. Let M be a Hadamard manifold, with sectional curvatures bounded
below by κ 2min  0. Fix a cutoff parameter q > 0, and define
h i12
V y ðxÞ ¼ q2 1 + ðdðx, yÞ=qÞ2  q2 for x, y  M (B.20)

In Said (2021), it was proved that V y : M !  is strictly convex, but not


strongly convex, and that it is ð1 + qκ min Þ-smooth on M.
Now, let π be a probability distribution on M and consider the problem of
minimizing
Z
V π ðxÞ ¼ V y ðxÞ πðdyÞ (B.21)
M

Note that Vπ is strictly convex (but not strongly convex), and ð1 + qκ min Þ
-smooth on M, because the same is true of each function Vy. In fact, Vπ has
compact sublevel sets whenever the distribution π has finite first-order
moments (Said, 2021). In this case, Vπ(x) is guaranteed to achieve its mini-
mum at some x* M. This x* is called the robust Riemannian barycenter of
π (the adjective “robust” comes from the field of robust statistics (Huber
and Ronchetti, 2009)).
When applying Lemma B.2 and Proposition B.8 to the present example
(with f ¼ Vπ ), note that inj(x*) ¼ ∞, since M is a Hadamard manifold, and
Lc ¼ ð1 + qκ min Þ does not depend on c.
396 SECTION III Advanced geometrical intuition

B.4.2 Strongly convex case


Here, assume the function f is α-strongly convex on some strongly convex
subset A  M. Let Bc  A be a sublevel set of f (where c > infx f(x)). Because
f is strongly convex, Bc is compact, and it is possible to choose a constant
Lc
0, as in Proposition B.4. Then, let μ*c be given as in Proposition B.7.
As usual, f has a unique minimum at x* Bc.
Proposition B.9. Under the assumptions just described, if μ  μ*c and x0  Bc,
then
f ðxt Þ  f ðx* Þ  ð1  μαÞt ðf ðx0 Þ  f ðx* ÞÞ (B.22)
for all t
0. In particular, the sequence (xt) converges to x*.

The proof of this proposition follows by replacing inequality (B.4) into


Lemma B.1. Indeed, if μ  μ*c and x0  Bc, then (B.17) in Lemma B.1 imme-
diately implies
f ðxt +1 Þ  f ðx* Þ  f ðxt Þ  f ðx* Þ  ðμ=2Þ kgradf ðxt Þk2xt for t
0
Thus, replacing (B.4) into the right-hand side,
f ðxt +1 Þ  f ðx* Þ  ð1  μαÞðf ðxt Þ  f ðx* ÞÞ
and (B.22) can be obtained by induction.
Remark. (B.22) shows that f(xt) converges to the minimum f(x*), at an expo-
nential rate. In practice, this can still be quite slow, if μα is very small. Indeed,
one should attempt to use μ as large as possible, in order to benefit from the
exponential rate (B.22). From the definition of μ*c , one cannot have μ any
larger than 1/Lc, and μ ¼ 1/Lc is only possible if δ ¼ 0, which corresponds
to using Ret ¼ Exp. □

Example. Let π be a probability distribution on a complete Riemannian man-


ifold M, and define
Z
1
Eπ ðxÞ ¼ d 2 ðx, yÞ πðdyÞ for x  M (B.23)
2 M
If the support of π is contained in a ball B(z, R), where R < Rc(M) given by
(B.1), then Eπ is C2 on B(z, R), and has a unique global minimum x* M, such
that x* B(z, R) (Afsari, 2010) (x* is the Riemannian barycenter of π). In
addition, if R < Rc(M)/2, then Eπ is α-strongly convex on B(z, R), with α
equal to 2κ max R cot ð2κ max RÞ (¼ 1 if κ max ¼ 0).
In this case, it is possible to apply Proposition B.9 to the present example
(with f ¼ Eπ). If M has positive sectional curvatures, it is always possible to
choose Lc ¼ 1. On the other hand, if M has negative sectional curvatures
Lc ¼ 1 + 4κ min R always works.
Gaussian distributions on Riemannian symmetric spaces Chapter 10 397

Appendix C Proofs for Section B


Proof of Proposition B.3: Write inequality (B.3) under the equivalent form
f ðyÞ  f ðxÞ
hgradf ðxÞ, Exp1 1 2
x ðyÞix + ðα=2Þ k Expx ðyÞkx

With y ¼ x*, this becomes


f ðxÞ  f ðx* Þ  hgradf ðxÞ, Exp1 * 1 * 2
x ðx Þix  ðα=2Þ k Expx ðx Þkx

or, by completing the square on the right-hand side,


1 1
f ðxÞ  f ðx* Þ   kgradf ðxÞ + Exp1 * 2 2
x ðx Þkx + 2α kgradf ðxÞkx

Then, (B.4) follows immediately, by noting the first term on the right-hand
side is negative.
Proof of Proposition B.4: The proof employs the notation introduced
before the proposition. Let x  Bc and v ¼ μ gradf (x). Then, note that
kvkx kgradf (x)kx  G. This implies γ(t) ¼ Expx(tv) belongs to B0 for all
t  (0, 1). From the definition of Lc, it now follows by (B.6) that
f ðyÞ  f ðxÞ + hgradf ðxÞ, vix + ðLc =2Þ k vk2x
and, by recalling v ¼ μ gradf(x),
af ðyÞ  f ðxÞ  μð1  ðLc =2ÞμÞ kgradf ðxÞk2x
Then, (B.7) follows because μ  1/Lc implies the expression in parentheses is

1/2.
Proof of Proposition B.5: This is given in Said (2021), section 1.5.
Proof of Proposition B.6: Assume Ret is a regular geodesic retraction,
and let Φ be the corresponding map in (B.10). Since Bc is compact, there exist
βc, δc
0 such that
n o
sup k Φ0x ðuÞkop ; x  Bc and u  T x M, k ukx k gradf ðxÞkx  β1=2
c (C.1)
n o
sup k Φ000
x ðuÞkop ; x  Bc and u  T x M, k ukx k gradf ðxÞkx  δc (C.2)

where kkop denotes the operator norm of the linear map Φ0x ðuÞ : T x M ! T x M,
or of the trilinear map Φ000
x ðuÞ : T x M  T x M  T x M ! . In terms of these
constants βc and δc,
k Φx ðμgradf ðxÞÞk2x  ðβc μ2 Þ k gradf ðxÞk2x for x  Bc (C.3)

k Φx ðμgradf ðxÞÞ + μgradf ðxÞkx  ðδc μ3 Þ k gradf ðxÞk3x for x  Bc


(C.4)
398 SECTION III Advanced geometrical intuition

Furthermore, let Bc be contained in a closed geodesic ball B ¼ Bðz, RÞ. Denote


G the maximum of kgradf(x)kx taken over x  B, and B0 ¼ Bðz, R + β1=2 c GÞ.
By compactness of B0 , there exists Hc
0 such that f is Hc-smooth on B0 .
Now, in order to prove (B.12), note that γðtÞ ¼ Expx ðtΦx ðμgradf ðxÞÞÞ
belongs to B0 for all t  (0, 1). It follows from (B.11) that (similarly to (B.6)),
f ðyÞ  f ðxÞ + hgradf , Φx ðμgradf ðxÞÞix + ðH c =2Þ k Φx ðμgradf ðxÞÞk2x
Then, using (C.3) and (C.4),
f ðyÞ  f ðxÞ  μ kgradf ðxÞk2x + ðβc H c =2Þμ2 kgradf ðxÞk2x + δc μ3 kgradf ðxÞk4x
which is the same as (B.12). Finally, (B.13) is an immediate consequence of
(B.12).
Proof of Lemma B.1: Eq. (B.17) can be obtained immediately, upon
replacing βc ¼ 1, δc ¼ δ, and Hc ¼ Lc into (B.13).
Proof of Proposition B.7: The proof has already been summarized, right
after the proposition.
Proof of Lemma B.2: Let L(x) ¼ d2(x, x*)/2. If Rc < inj(x*), then L(x) is
κ(Rc)-smooth on B(x*, Rc) (Petersen, 2006). Note that xt  Bc for all t
0,
because μ  μ*c as in Proposition B.7. Then, from the Taylor expansion
(B.11) of L, and since Ret is contractive ((C.3) holds with βc ¼ 1),
Lðxt + 1 Þ  Lðxt Þ + hgradLðxt Þ,Φxt ðμgradf ðxt ÞÞixt + ðκðRc Þ=2Þμ2 kgradf ðxt Þ k2xt

However, applying (B.8) to the third term on the right-hand side, this implies
Lðxt +1 Þ  Lðxt Þ + hgradLðxt Þ, Φxt ðμgradf ðxt ÞÞixt + κðRc ÞLc μ2 ðf ðxt Þ  f ðx* ÞÞ
(C.5)
Now, consider the second term on the right-hand side, since gradLðxt Þ ¼
Exp1 *
xt ðx Þ, this second term is equal to

μhExp1 * 1 *
xt ðx Þ, gradf ðx Þixt  hExpxt ðx Þ, Φxt ðμgradf ðx ÞÞ + μgradf ðx Þixt
t t t

(C.6)
Applying (B.2) and (C.4) (with δc ¼ δ, since Ret is uniformly geodesic),
ðC:6Þ  μðf ðxt Þ  f ðx* ÞÞ + ðδμ3 Þ kExp1 * t 3
xt ðx Þkxt kgradf ðx Þkxt

Using (B.8) once again, along with kExp1 *


xt ðx Þkxt  Rc and
kgradf ðxt Þkxt kgradf kBc ,
ðC:6Þ  μðf ðxt Þ  f ðx* ÞÞ + ð2δRc kgradf kBc ÞLc μ3 ðf ðxt Þ  f ðx* ÞÞ (C.7)
Finally, from (C.5) and (C.7),
h i
Lðxt +1 Þ  Lðxt Þ  μ 1  κðRc ÞLc μ  ð2δRc kgradf kBc ÞLc μ2 ðf ðxt Þ  f ðx* ÞÞ
Gaussian distributions on Riemannian symmetric spaces Chapter 10 399

Since f(xt)
f(x*), whenever the expression in square brackets is positive, one
has L(xt+1)  L(xt). However, this directly yields (B.18).
Proof of Proposition B.8: Note from Lemma B.1 that μ  μ*c implies

f ðxt +1 Þ  f ðx* Þ  f ðxt Þ  f ðx* Þ  ðμ=2Þ kgradf ðxt Þk2xt


On the other hand, note that
f ðxt Þ  f ðx* Þ f ðxt Þ  f ðx* Þ
kgradf ðxt Þkxt
*

dðx , x Þ
t dðx0 , x* Þ
where the first inequality follows by applying Cauchy–Schwarz to (B.2), and
the second one from Lemma B.2, since μ  μ*d . Letting ε(t) ¼ f(xt)  f(x*),
it is now clear that

2
εðt + 1Þ  εðtÞ  ðμ=2Þ εðtÞ=dðx0 , x* Þ

so that (B.19) can be proved by a straightforward induction.


Proof of Proposition B.9: The proof was summarized after the
proposition.

References
Absil, P.A., Mahony, R., Sepulchre, R., 2008. Optimization Algorithms on Matrix Manifolds.
Princeton University Press.
Afsari, B., 2010. Riemannian Lp center of mass: existence, uniqueness and convexity. Proc. Am.
Math. Soc. 139 (2), 655–673.
Alekseevskij, D.V., Vinberg, E.B., Solodovnikov, A.S., 1993. Geometry of Spaces of Constant
Curvature (EMS vol. 29). Springer-Verlag.
Bhattacharya, R., Patrangenaru, V., 2003. Large sample theory of instrinsic and extrinsic sample
means on manifolds I. Ann. Stat. 31 (1), 1–29.
Bogachev, V.I., 2007. Measure Theory. vol. I. Springer-Verlag.
Cabanes, Y., 2021. Multidimensional Complex Stationary Centered Gaussian regressive Time
Series Classification: Application for Audio and dar Clutter Machine Learning in Hyperbolic
and Siegel Spaces (Ph.D. thesis), University of Bordeaux.
Chavel, I., 2006. Riemannian Geometry, A Modern Introduction. Cambridge University Press.
Cheng, G., Vemuri, B.C., 2013. A novel dynamic system in the space of SPD matrices with appli-
cations to appearance tracking. SIAM J. Imaging Sci. 6 (1), 592–615.
Congedo, M., Barachant, A., Bhatia, R., 2017. Riemannian geometry for EEG-based brain-
computer interfaces; a primer and a review. Brain-Comput. Interfaces 4 (3), 155–174.
Deift, P., 1998. Orthogonal Polynomials and Random Matrices: A Riemann-Hilber Approach.
American Mathematical Society.
Frechet, M., 1948. Les elements aleatoires de nature quelconque dans un espace distancie. Ann.
l’I.H.P. 10 (4). 215–210.
Helgason, S., 1962. Differential Geometry and Symmetric Spaces. Academic Press, New York
and London.
Heuveline, S., Said, S., Mostajeran, C., 2021. Gaussian distributions on Riemannian symmetric
spaces, random matrices, and planar Feynman diagrams. arXiv:2106.08953.
400 SECTION III Advanced geometrical intuition

Huber, P.J., Ronchetti, E.M., 2009. Robust Statistics, second ed. Wiley-Blackwell.
Jarner, S.F., Hansen, E., 1998. Geometric ergodicity of Metropolis algorithms. Stoch. Process.
Appl. 58, 341–361.
Karimi, H., Nutini, J., Schmidt, M., 2016. Linear convergence of gradient and proximal-gradient
methods under the Polyak-Lojasiewicz condition. In: Machine Learning and Knowledge Dis-
covery in Databases.
Kendall, W.S., 1990. Probability, convexity, and harmonic maps with small image I: uniqueness
and fine existence. Proc. Lond. Math. Soc. 61 (2), 371–406.
Knapp, A.W., 2002. Lie Groups, Beyond an Introduction, second. Birkhauser.
Kuijlaars, A.B.J., Van Assche, W., 1999. The asymptotic zero distribution of orthogonal polyno-
mials with varying recurrence coefficients. J. Approx. Theory 99, 167–197.
Lee, J.M., 2012. Introduction to Smooth Manifolds, second. Springer Science.
Mariño, M., 2005. Chern-Simons Theory, Matrix Models, and Topological Strings. Oxford
University Press.
Meckes, E.S., 2019. The Random Matrix Theory of the Classical Compact Groups. Cambridge
University Press.
Mehta, M.L., 2004. Random Matrices, third ed. Elsevier Ltd.
Meyn, S., Tweedie, R.L., 2008. Markov Chains and Stochastic Stability. Cambridge University Press.
Nesterov, Y., 2018. Lectures on Convex Optimization. Springer Switzerland.
Pennec, X., 2006. Intrinsic statistics on Riemannian manifolds: basic tools for geometric measure-
ments. J. Math. Imaging Vis. 25 (1), 127–154.
Petersen, P., 2006. Riemannian Geometry, second ed. Springer Science.
Roberts, R.O., Rosenthal, J.S., 2004. General state-space Markov chains and MCMC algorithms.
Probab. Surv. 1, 20–71.
Said, S., 2021. Statistical models and probabilistic methods on Riemannian manifolds.
arXiv:2101.10855.
Said, S., Manton, J.H., 2021. Riemannian barycentres of Gibbs distributions: new results on con-
centration and convexity. Inf. Geom. 4 (2).
Said, S., Bombrun, L., Berthoumieu, Y., Manton, J.H., 2017. Riemannian Gaussian distributions on
the space of symmetric positive definite matrices. IEEE Trans. Inf. Theory 63 (4), 2153–2170.
Said, S., Hajri, H., Bombrun, L., Vemuri, B.C., 2018. Gaussian distributions on Riemannian sym-
metric spaces: statistical learning with structured covariance matrices. IEEE Trans. Inf. The-
ory 64 (2), 752–772.
Santilli, L., Tierz, M., 2021. Riemannian Gaussian distributions, random matrix ensembles and
diffusion kernels. Nucl. Phys. B 973, 115582.
Siegel, C.L., 1943. Symplectic geometry. Am. J. Math. 65 (1), 1–86.
Sturm, K.T., 2003. Probability measures on metric spaces of nonpositive curvature. Contemp.
Math. 338, 1–34.
Szeg€o, G., 1939. Orthogonal Polynomials, first ed. American Mathematical Society.
Terras, A., 1988. Harmonic Analysis on Symmetric Spaces and Applications. vol. II. Springer-Verlag.
Udriste, C., 1994. Convex Functions and Optimization Methods on Riemannian Manifolds.
Springer Science.
Whittaker, E.T., Watson, G.N., 1950. A Course of Modern Analysis, fourth ed. Cambridge Uni-
versity Press.
Zanini, P., Congedo, M., Jutten, C., Said, S., Berthoumieu, Y., 2016. Parameters estimate of Rie-
mannian Gaussian distribution in the manifold of covariance matrices. In: Sensor Array and
Multichannel Signal Processing.
Chapter 11

Multilevel contours on bundles


of complex planes☆
Arni S.R. Srinivasa Rao*
Laboratory for Theory and Mathematical Modeling, Medical College of Georgia, Augusta
University, Augusta, GA, United States
Department of Mathematics, Augusta University, Augusta, GA, United States

Corresponding author: e-mail: arni.rao2020@gmail.com; arrao@augusta.edu

Abstract
A new concept called multilevel contours is introduced through this article by the
author. Theorems on contours constructed on a bundle of complex planes are stated
and proved. Multilevel contours can transport information from one complex plane to
another. Within a random environment, the behavior of contours and multilevel con-
tours passing through the bundles of complex planes are studied. Further properties of
contours by a removal process of the data are studied. The concept of “islands” and
“holes” within a bundle is introduced through this article. These all constructions help
to understand the dynamics of the set of points of the bundle. Further research on the
topics introduced here will be followed up by the author. These include closed approx-
imations of the multilevel contour formations and their removal processes. The ideas
and results presented in this article are novel.
Keywords: Multilevel complex planes, Spinning, Randomness, Holomorphism, PDEs
MSC: 32L05, 60K3, 32H02

1 Introduction
Let us consider a bundle of eight complex planes 1 , 2, 3, 4, 5 , 6 , 7,
and 8 as shown in Fig. 1. These planes are considered such that one plane is
parallel to any other plane in the bundle or they could intersect with each
other at some angle. Let γ 1 be an arc constructed from the points generated
by zðt1 Þ  1 for a11  t1  b11 such that z(a11) ¼ z1 and z(b11) ¼ z2.


Dedication: This article is dedicated to my friend and collaborator Professor Steven G. Krantz,
Washington University, St. Louis, United States on completion of his 70th Birthday.

Handbook of Statistics, Vol. 46. https://doi.org/10.1016/bs.host.2022.03.003


Copyright © 2022 Elsevier B.V. All rights reserved. 401
402 SECTION III Advanced geometrical intuition

FIG. 1 A bundle of complex planes and multilevel contours M1, M2, M3.

Here a11, b11  : Point z2 is located at the intersection of the planes 1 and
2. We allow constructing an arc in the plane 2 from z2 to z3 for z3  2, 3,
and 6 . Let γ 2 be an arc constructed from zðt2 Þ  2 for a12  t2  b12 (a12,
b12  ) such that z(a12) ¼ z2 and z(b12) ¼ z3. The arc γ i is constructed by join-
ing z(a1i) ¼ zi and z(b1i) ¼ zi+1 generated by the set of points z(ti) for a1i  ti 
b1i for i ¼ 3, 4, …, 7 and a1i, b1i  . We allow the possibility to construct an
arc from an ending point of an arc in a plane to a point located in a different
plane if that ending point of an arc is located at the intersection of two or more
complex planes. We saw above a few points lying at the intersection of two or
more planes. The other points, for example, z4  3 , 4 , and 8, z5  4 and
5 , z6  4 , 6 , and 7 , z7  4 and 7 , and z8  7 and 8 .
Let us form a contour by piecewise joining of arcs γ i for i ¼ 1, 2, …, 7 and
call this M1. Let us rename the arcs corresponding to the contour M1 as γ M i
1

for i ¼ 1, 2, …, 7: Two more sample contours M2 and M3 are constructed using


the points {z1, z2, z3, z6, z5} and {z1, z2, z3, z6, z8}, respectively. See Fig. 1. Let
the piecewise arcs corresponding to the contour M2 be γ M i
2
for i ¼ 1, 2, 3, 4,
and the piecewise arcs corresponding to the contour M3 be γ M i for i ¼ 1, 2, 3,
3

4. For the sake of visualization, we have separated a single point at the inter-
secting planes as two or more points in different colors, a smaller oval-shaped
object in Fig. 1. Suppose a contour M2 is constructed using a set of values z(si)
for a2i  si  a2i with corresponding arcs γ M i
2
for i ¼ 1, 2, 3, 4, and another
contour M3 is constructed z(ui) for a3i  ui  a3i with corresponding arcs γ M i
3

for i ¼ 1, 2, 3, 4. Here a2i, a3i, b2i, b3i  : Let


ti ¼ ξi ðτÞ ðα1i  τ  β1i Þ
for i ¼ 1, 2, …, 7 be the parametric representation for a real-valued function
ξi mapping [α1i, β1i] onto [a1i, b1i]. Then the length of the contour M1, say,
L(M1) is computed through the integral
Multilevel contours on bundles of complex planes Chapter 11 403

7 Z
X β1i
LðM1 Þ ¼ jz0 ½ξi ðτÞjξ0i ðτÞdτ: (1)
i¼1 α1i

Let si ¼ ϕi(τ) (α2i  τ  β2i) for i ¼ 1, 2, 3, 4 be the parametric represen-


tation for a real-valued function ϕi mapping [α2i, β2i] onto [a2i, b2i], and ui ¼
ψ i(τ) (α3i  τ  β3i) for i ¼ 1, 2, 3, 4 be the parametric representation for a
real-valued function ψ i mapping [α3i, β3i] onto [a3i, b3i].
Then the lengths of the contours M2 and M3 can be computed as
X4 Z β2i
LðM2 Þ ¼ jz0 ½ϕi ðτÞjϕ0i ðτÞdτ, (2)
i¼1 α2i

4 Z
X β3i
LðM3 Þ ¼ jz0 ½ψ i ðτÞjψ 0i ðτÞdτ: (3)
i¼1 α3i

The contours M1, M2, M3 are located on bundle of complex planes, which we
term here as multilevel contours. One can draw several such multilevel con-
tours on a bundle of complex planes as shown in Fig. 1. We have considered
eight complex planes and three multilevel contours as an example, but one
can extend these examples to demonstrate the intersection of many more com-
plex planes and contours passing through them. Although multilevel contours
are newly introduced here in this article, the principles associated with contours
on a single complex plane can be found in any standard textbooks, see, for
example, Ahlfors (1978), Churchill and Brown (1984), Krantz (2004), and
Rudin (1987).

2 Infinitely many bundles of complex planes


Let us consider infinitely many (uncountable) complex planes parallel to
each other as shown in Fig. 2 and call this B ðÞ: We try to form contours
passing through these bundles and understand the behavior of the contours
at the intersection of other planes. A contour passing through the points
(complex numbers) lying in the intersection planes are given a feature to
switch a plane. The points at intersections are assumed to possess special
features and behavior of those points under a random environment that we
will see in this article. Before we understand other properties, let us prove
a Theorem.
Theorem 1. The shortest possible multilevel contour passing through B ðÞ
is the real line.

Proof. Consider a complex plane that is located perpendicular to bundle


B ðÞ such that it is at 90 degrees with the x-axis. Call this 0 . Using 0
we slice B ðÞ vertically at an arbitrary location as shown in Fig. 2. Each
slice will have uncountable lines distinct from each other, and these lines
are parallel to each other. □
404 SECTION III Advanced geometrical intuition

FIG. 2 Bundle of infinitely many (uncountable) complex planes and shortest contour passing
through them.

Proof. Let l and q be two arbitrary lines among these uncountable lines
formed out of the above slicing method. Suppose we chose a point z1 on l.
There exists a point z2 on q such that the x-coordinate and y-coordinate of
both z1 and z2 are the same. Note that z1 , z2  0 . Depending upon how
we visualize the xy-axes of 0 , the following possibilities for the values of
z1 and z2 will arise:
(i) z1 ¼ (0, 0) and z2 ¼ (0, 0) or z1 6¼ (0, 0) and z2 6¼ (0, 0).
(ii) If z1, z2 6¼ (0, 0), then both z1 and z2 will have either a nonzero
x-coordinate (and y-coordinate as zero) or a nonzero y-coordinate (and
x-coordinate as zero). We first assume that both z1 and z2 are on the
y-coordinate. Let C(z1, z2) be a contour described by the equation z(t1)
for a1  t1  a01 , where z(a1) ¼ z1 for z1  0 , l and zða01 Þ ¼ z2 for
z2  0 , q. Here C(z1, z2) is a multilevel contour because z1 and z2 are
points on parallel lines l and q on different planes, but both these points
also belong to 0 . Suppose
 
t1 ¼ ε1 ðτÞ δ1  τ  δ01
be the parametric representation for C(z1, z2), where ε1 is a real-valued
function mapping ½δ1 , δ01  onto the interval ½a0 , a00 : The length of the con-
tour C(z1, z2) is obtained by
Z 0
δ1
L½Cðz1 , z2 Þ ¼ jz0 ½ε1 ðτÞjε01 ðτÞdτ (4)
δ1
Multilevel contours on bundles of complex planes Chapter 11 405

Suppose we consider a point z3 on the same plane in which the point z2 lies
but not on the line q such that z3 62 0 : That means, z3 6¼ z2. Let C(z2, z3)
be a contour described by the equation z(t2) for a2  t2  a02 , where z(a2) ¼ z2
for z2  0 , q and zða02 Þ ¼ z3 for z3 62 0 and z3 62 q: Here C(z2, z3) is not a
multilevel contour. Suppose
 
t2 ¼ ε2 ðτÞ δ2  τ  δ02
be the parametric representation for C(z2, z3), where ε2 is a real-valued
function mapping ½δ2 , δ02  onto the interval ½a2 , a02 : The length of the contour
C(z2, z3) can be obtained by
Z δ2
0

L½Cðz2 , z3 Þ ¼ jz0 ½ε2 ðτÞjε02 ðτÞdτ: (5)


δ2

Since z3 is not in 0, we cannot draw a contour directly from z1 to z3. To draw
a contour to z3 from z1, we can have piecewise arcs passing through z2 or
through any other points of the line q to z3. If the contour from z1 to z3 passes
through z2, then
Z δ1
0

jz0 ½ε1 ðτÞjε01 ðτÞdτ <


δ1
Z δ1
0 Z δ2
0 (6)
0
jz ½ε1 ðτÞjε01 ðτÞdτ + jz 0
½ε2 ðτÞjε02 ðτÞdτ,
δ1 δ2

If the contour from z1 to z3 passes through an arbitrary point, say, z4 (z4 ¼


6 z2)
for z4  q, then
Z δ1
0 Z δ3
0

0
jz ½ε1 ðτÞjε01 ðτÞdτ < jz0 ½ε3 ðτÞjε03 ðτÞdτ, (7)
δ1 δ3

where the R.H.S. of the inequality (7) is the length of the contour C(z1, z4)
described by the equation z(t3) for a3  t3  a03 , where z(a3) ¼ z1 for z1 
 
0 , l and zða03 Þ ¼ z4 for z4  0 : Here t3 ¼ ε3 ðτÞ δ3  τ  δ03 is the para-
metric representation for C(z1, z4) and ε3 is a real-valued function mapping
½δ3 , δ03  onto the interval ½a3 , a03 . From (6) and (7), we have
Z δ1
0 Z δ2
0

0
jz ½ε1 ðτÞjε01 ðτÞdτ + jz0 ½ε2 ðτÞjε02 ðτÞdτ
δ1 δ2

Z δ3
0 Z δ4
0

0
< jz ½ε3 ðτÞjε03 ðτÞdτ + jz0 ½ε4 ðτÞjε04 ðτÞdτ, (8)
δ3 δ4

where the second term of R.H.S. of the inequality (8) is the length of the con-
tour C(z4, z3) described by the equation z(t4) for a4  t4  a04 , where z(a4) ¼ z4
 
6 0 : Here t4 ¼ ε4 ðτÞ δ4  τ  δ04 is the
for z4  0 , q and zða04 Þ ¼ z3 for z3 ¼
406 SECTION III Advanced geometrical intuition

parametric representation for contour C(z4, z3) and ε4 is a real-valued function


mapping ½δ4 , δ04  onto the interval ½a4 , a04 . From (7) to (8) we conclude that
L½Cðz1 , z2 Þ is the shortest contour and it is a line. We can choose a point z
on a line p for p  0 and find a corresponding point z0 on a line, say, p0 for
p0  0 and construct an argument as above to see that L½Cðz, z0 Þ is the short-
est. The x and y coordinates of z and z0 are the same. We can construct infi-
nitely many contours between various points z, z0 lying on the slice such
that L½Cðz, z0 Þ is the shortest. If p and p0 are the lines from adjacent
planes, then
[Z
0
δp  
z0 ½εp ðτÞε0 ðτÞdτ ¼ ∞ (9)
p
z, z0 δp

and
[
Cðz, z0 Þ  :
z, z0

In (9), the integral on the L.H.S. is the length of the contour C(z, z0 ) described
by the equation z(tp) for ap  tp  a0p , where z(ap) ¼ z for z  0 , p and
 
zða0p Þ ¼ z0 for z0  p0 , 0 : Here tp ¼ εp ðτÞ δp  τ  δ0p is the parametric rep-
resentation for contour C(z, z0 ) and εp is a real-valued function mapping
½δp , δ0p  onto the interval ½ap , a0p . □

Remark 1. Let zn be a sequence on the slice 0 , then

lim zn ¼ z
n!∞

since z is equal to zn8n.


Remark 2. Remark 1 is true for each slice on the bundle that is parallel to :
But the limits of convergence on each slice are different.

Remark 3. The distances between each pair of adjacent points on every con-
tour created by slicing parallel to 0 are equal.

Suppose we remove the space created by 0 from the bundle B ðÞ, then
the bundle formed on the left of 0 be denoted by BL ð0 Þ and the bundle
formed on the right of 0 be denoted by BR ð0 Þ: Then
BL ð0 Þ [ 0 [ BR ð0 Þ ¼ B ðÞ (10)
The set B ðÞ  0 defined as
B ðÞ  0 ¼ fz : z  B ðÞ and z 62 0 g
Multilevel contours on bundles of complex planes Chapter 11 407

forms a disconnected set because


B ðÞ  0 ¼ BL ð0 Þ [ BR ð0 Þ (11)
and BL ð0 Þ and BR ð0 Þ are disjoint and nonempty sets within B ðÞ: No
multilevel contours can be drawn passing through various planes of the bundle
BL ð0 Þ unless BL ð0 Þ is sliced similarly as we did for the bundle B ðÞ.
Under similar circumstances, no multilevel contours can be drawn passing
through the planes of BR ð0 Þ: Suppose we slice the bundle BL ð0 Þ with a
complex plane that was kept parallel to 0 and call this new plane r (r > 0)
that was used for slicing. Now 1 intersects with each and every plane of
BL ð0 Þ: Let us remove the space created by r from BL ð0 Þ to form a new dis-
joint bundles BL ðr Þ and BR ðr Þ such that
BL ðr Þ [ r [ BR ðr Þ ¼ BL ð0 Þ, (12)
where BL ðr Þ
is the bundle formed on the left of r and BR ðr Þ
is the bundle
formed on the right of r due to removal of the space r from B ð0 Þ: Using
L

the similar argument of (11), we write below BL ð0 Þ  r as a union of two
disconnected sets
BL ð0 Þ  r ¼ BL ðr Þ [ BR ðr Þ ðr > 0Þ (13)
Although we could not draw a multilevel contour passing through all the
planes of the bundle BL ðr Þ, one can draw such a contour through the plane
r while it intersects the bundle BL ðr Þ: One can write another disjoint set
BL ðr Þ  r0 ¼ BL ðr0 Þ [ BR ðr0 Þ ðr, r 0 > 0Þ, (14)
where r0 is complex plane used to slice BL ðr0 Þ and r is complex plane used
to slice BL ð0 Þ:
Let us now consider right side of the plane 0 within the bundle B ðÞ,
i.e., BR ð0 Þ: Suppose we slice the bundle BR ð0 Þ using a plane parallel to
0 and call this s (s > 0). Let us remove the space created by s from
BR ð0 Þ such that
BR ð0 Þ  s ¼ BL ðs Þ [ BR ðr Þ ðs > 0Þ: (15)
From the above constructions (10) through (15), the set of points of the bundle
B ðÞ with intersecting planes r , 0 , s are written as
B ðÞ ¼ BL ð0 Þ [ 0 [ BR ð0 Þ
   
¼ BL ðr Þ [ r [ BR ðr Þ [ 0 [ BL ðs Þ [ s [ BR ðs Þ
(16)
¼)
    (17)
B ðÞ  ð0 [ r [ r Þ ¼ BL ðr Þ [ BR ðr Þ [ BL ðs Þ [ BR ðs Þ
408 SECTION III Advanced geometrical intuition

The four disconnected sets in (17) can be used for forming infinitely many
disconnected sets. Due to the removal of the spaces as shown in Fig. 3, it is
impossible to draw multilevel contours within and between these four discon-
nected sets in (17). One of the advantages of multilevel contours is to develop
trees of arcs, paths that can be used for transportation of information between
two or more interacting (intersecting) complex planes. The bundle B(C) has
parallel planes; however, the contours drawn within each of these planes need
not be similar. One can construct a functional mapping such that a contour or
an arc drawn in one plane could be mapped to a contour or an arc in another
plane. But these two sets of contours could be used for transporting informa-
tion continuously only if these contours in two or more planes are path-
connected (Fig. 4).

FIG. 3 Creation of disconnected sets out of the bundle B ðÞ due to the slicing and removal of
the spaces created by the complex planes 0 , r , s .

FIG. 4 Transportation of information through multilevel contours S1 , …, S7 passing through four


intersecting complex planes 1 , 2 , 3 , and 4 .
Multilevel contours on bundles of complex planes Chapter 11 409

Suppose S1, S2, S3, S4 are four contours drawn for specific purposes in four
different complex planes 1 , 2 , 3 , 4 , respectively. Let Si be described by
z(t) for asi  ts1  a0si where asi ,a0si   for i ¼ 1, 2, 3, 4. See Fig. 4. Had these
four planes have no intersecting set of points, then one could not construct
multilevel contours passing through these planes. Let zðasi Þ ¼ zS1i be the starting
point and zða0si Þ ¼ zS2i be the ending point of the contour Si for i ¼ 1, 2, 3, 4.
Suppose each independent contour Si has a specific information stored in it.
Information stored in S1 is transferred to the contour S2 using a contour T1. Sup-
pose 1 \ 2 ¼ fz : z  1 and z  2 g. Then for the structure of the planes
and contours S1 through S4 in Fig. 4, we have 1 \ 2 ¼ 6 ϕ (empty set) and
S1 \ ð1 \ 2 Þ ¼ ϕ. Let T1 be a contour drawn from a point in S1 to a point
in S2 through a set of points for which 1 \ 2 6¼ ϕ: T1 is described by
zðut1 Þ for at1  ut1  a0t1 where at1 , a0t1  . The starting point of T1 lies on
S1 and the ending point of T1 lies on S2. We call T1 a transporting contour.
Using T1 the information stored in S1 and S2 can be communicated. We will dis-
cuss later more features on information transfer. The length of the multilevel
contour due to S1, T1, S2, say L[S1, S2] is computed as
Z 0 Z 0
δs ωt  
L½S1 , S2  ¼
1
jz½εs1 ðτÞjε0s1 ðτÞdτ +
1
z½ηt ðτÞη0 ðτÞdτ
1 t1
δ s1 ω t1
Z 0 (18)
δs
jz½εs2 ðτÞjε0s2 ðτÞdτ
2
+
δ s2
 
Here tsi ¼ εsi ðτÞ δsi  τ  δ0si for i ¼ 1, 2, 3, 4 is the parametric represen-
tation for contour Si with a real-valued function εsi mapping ½δsi , δ0si  onto the
 
interval ½asi , a0si , and ut1 ¼ ηt1 ðτÞ ωt1  τ  ω0t1 is the parametric represen-
tation for contour T1 with a real-valued function ηt1 mapping ½ωt1 , ω0t1  onto the
interval ½at1 , a0t1 . Since the total information stored in S1 and S2 are exchanged,
we have considered total lengths of S1 and S2 even though T1 could be
connected with any point of S1 and S2. Since zðat1 Þ  S1 and zðat2 Þ  S2
and the T1 describes a contour from zðat1 Þ to zðat2 Þ, the length of the middle
integral in R.H.S. of (18) is not constant. The transportation contour can be
used to transport information from S2 to S1. Let us denote this by the contour
T 01. In that case, the starting point of T 01 lies on S2, and the ending point of T 01
lies on S1. T 01 is described by z0 ðu0t1 Þ for at1  ut1  a0t1 where at1 , a0t1  , and
 
u0t1 ¼ η*t1 ðτÞ ωt1  τ  ω0t1 is the parametric representation for contour T 01
with a real-valued function η*t1 mapping ½ωt1 , ω0t1  onto the interval ½at1 , a0t1 .
When we measure the length S2 to S1, say L(S2, S1], the orientation of the
transportation contour changes, and it is computed as
410 SECTION III Advanced geometrical intuition

Z δs
0 Z ωt1
jz½εs1 ðτÞjε0s1 ðτÞdτ jz½ηt1 ðτÞjη0t1 ðτÞdτ
1
L½S2 , S1  ¼ +
0
δs1 ωt
Z 0
1
(19)
δs
jz½εs2 ðτÞjε0s2 ðτÞdτ
2
+
δs2

The values of the middle integrals in the R.H.S. of (18) and (19) need not
be the same unless the below condition is satisfied:
zðat1Þ, z0 ða0t1Þ  S1 and z0 ðat1Þ zða0t1Þ  S2 : (20)
The transportation contour T2 joins S2 to S3 and the transportation contour
T 02 joins S3 to S2. The contour T2 is described by zðut2 Þ for at2  ut2  a0t2
where at2 , a0t2  . The starting point of T2 is in the set S2 and the ending point
 
of T2 is in the set S3. The function ut2 ¼ ηt2 ðτÞ ωt2  τ  ω0t2 is the para-
metric representation for contour T2 with a real-valued function ηt2 mapping
½ωt2 , ω0t2  onto the interval ½at2 , a0t2 . The contour T 02 is described by zðu0t2Þ for
at2  u0t2  a0t2 where at2 , a0t2  , and the function u0t2 ¼ ηt2 ðτÞ ðωt2  τ  ω0t2Þ
is the parametric representation for contour T 02 with a real-valued function
η*t2 mapping ½ωt2 ,ω0t2  onto the interval ½at2 ,a0t2 . The lengths of these transpor-
tation contours can be computed and
Z ω0t Z ωt
2  
z½ηt ðτÞη0 ðτÞdτ ¼
2

2 t2 jz½η*t2 ðτÞjη0t2 ðτÞdτ


0
ωt2 ωt
2

if and only if
zðat2 Þ, z0 ða0t2 Þ  S2 and z0 ðat2 Þ zða0t2 Þ  S3 :
The multilevel contour lengths of S2 to S3 and S3 to S2 are computed as
Z 0 Z 0
δs ωt  
L½S2 , S3  ¼
2
jz½εs2 ðτÞjε0s2 ðτÞdτ +
2
z½ηt ðτÞη0 ðτÞdτ
2 t2
δs2 ωt2
Z 0 (21)
δs
jz½εs3 ðτÞjε0s3 ðτÞdτ
3
+
δs3

Z δs
0

jz½εs2 ðτÞjε0s2 ðτÞdτ


2
L½S3 , S2  ¼
δ s2
Z ωt2
+ jz½η*t2 ðτÞjη0t2 ðτÞdτ (22)
0
ωt
2
Z δs
0

jz½εs3 ðτÞjε0s3 ðτÞdτ


3
+
δ s3
Multilevel contours on bundles of complex planes Chapter 11 411

The transportation contour T3 joins S3 to S4 and the return transportation


contour T 03 joins S3 to S4. The contour T3 is described by zðut3 Þ for at3 
ut3  a0t3 where at3 , a0t3  . The starting point of T3 is in the set S3 and the end-
 
ing point of T3 is in the set S4. The function ut3 ¼ ηt3 ðτÞ ωt3  τ  ω0t3 is the
parametric representation for T3 with a real-valued function ηt3 mapping
½ωt3 , ω0t3  onto the interval ½at3 , a0t3 . The contour T 03 is described by zðu0t3 Þ for
at3  u0t3  a0t3 where at3 , a0t3  , and the function u0t3 ¼ ηt3 ðτÞ ðωt3  τ  ω0t3 Þ
is the parametric representation for T 03 with a real-valued function ηt3 mapping
½ωt3 ,ω0t3  onto the interval ½at3 ,a0t3 . The lengths of these transportation contours
can be computed and
Z ω0t Z ωt
3  
z½ηt ðτÞη0 ðτÞdτ ¼
3

3 t3 jz½ηt3 ðτÞjη0t3 ðτÞdτ


0
ω t3 ωt
3

if and only if
zðat3 Þ, z0 ða0t3 Þ  S3 and z0 ðat3 Þ zða0t3 Þ  S4 :
The multilevel contour lengths of S2 to S3 and S3 to S2 are computed as
Z 0 Z 0
δs ωt  
L½S3 , S4  ¼
3
jz½εs3 ðτÞjε0s3 ðτÞdτ +
3
z½ηt ðτÞη0 ðτÞdτ
3 t3
δ s3 ω t3
Z 0 (23)
δs
jz½εs4 ðτÞjε0s4 ðτÞdτ
4
+
δ s4

Z δs
0 Z ωt3  
3  
L½S4 , S3  ¼ jz½εs3 ðτÞjε0s3 ðτÞdτ + z½η*t ðτÞη0t3 ðτÞdτ
0
δ s3 ωt 3
3
Z 0
(24)
δs
jz½εs4 ðτÞjε0s4 ðτÞdτ:
4
+
δs4

The total lengths of multilevel contours from S1 to S4 and from S4 to S1 are


obtained by
4 Z 3 Z
0 0
X δ si X ωti  
L½S1 , S4  ¼ jz½εsi ðτÞjε0si ðτÞdτ + z½ηt ðτÞη0 ðτÞdτ (25)
i ti
i¼1 δ si i¼1 ωti

4 Z 3 Z  
0
X δs
i X ωti
 
L½S4 ,S1  ¼ jz½εsi ðτÞjε0si ðτÞdτ + z½η*t ðτÞη0ti ðτÞdτ: (26)
0
i¼1 δ si i¼1 ωti i

Given the fixed shapes of contours on different planes, as shown arbi-


trarily in Fig. 4, the above constructions of lengths and transportation contours
are to be treated as an example of the usefulness of multilevel contours.
412 SECTION III Advanced geometrical intuition

Such constructions can be extended for several other practical situations aris-
ing from the data. Note that the contours Ti and T 0i pass through the line cre-
ated by i \ i +1 for i ¼ 1, 2, 3. So the corresponding integrals of the second
term of the R.H.S. of (25) represent combined lengths created due to traveling
of the contour Ti from a point in Si to a point in i \ i +1 and then traveling
from a point in i \ i +1 to Si+1. Similarly, the integrals of the second term of
the R.H.S. of (26) represent combined lengths created due to traveling of the
contour T 0i from a point in Si+1 to a point in i \ i +1 and then traveling from
a point in i \ i +1 to Si. Next, we will see how this combined integral can be
subdivided into smaller integrals while computing the shortest distance.
Since Ti was described by zðuti Þ for ati  uti  a0ti , we further partition the
Ti into three contours mentioned in the previous paragraph. Suppose zðam ti Þ be
1

0
the point on Si for ati  ½ati , ati  that is the closest to the point on the line say,
m1

ti Þ created due to i \ i +1 , and zðati Þ be the point on the line created


m3
zðam 2

due to i \ i +1 that is closest to Si+1. The point at which the contour from
0
zðamti Þ joins Si+1 say, zðati Þ for ati  ½ati , ati  that is the closest from a point
3 m4 m4

on the line created due to i \ i +1. Here zðam ti Þ and zðati Þ are complex num-
1 m4

bers on different complex planes. The complex numbers zðutiÞ partitioned as


8 I 1
> zðuti Þ ðati  uti < aIti Þ
>
< C
zðuti Þ ¼ zðuIIti Þ ðaIti  uti < aIIti Þ C
A,
>
>
: III
zðuti Þ ðaIIti  uti  a0ti Þ
such that
ðati  uti < aIti Þ [ ðaIti  uti < aIIti Þ [ ðaIIti  uti  a0ti Þ ¼ ½ati , a0ti :
Let us redefine the function uti below to represent the three partitions men-
tioned above
8 I 1
> ηt ðωti  τ < ωIti Þ
>
< i C
ut i ¼ ηIIti ðωIti  τ < ωIIti Þ C
A,
>
>
: I
ηti ðωIIti  τ  ω0ti Þ
such that
ðωti  τ < ωIti Þ [ ðωIti  τ < ωIIti Þ [ ðωIIti  τ  ω0ti Þ ¼ ½ωti , ω0ti :
The three shortest distances arise out of above partitions are, say, L(I : Ti),
L(II : Ti), and L(III : Ti). These shortest distances are given by
Z ωIt
i   0
LðI : Ti Þ ¼ z½ηI ðτÞ ηI ðτÞdτ (27)
ti ti
ωti
Multilevel contours on bundles of complex planes Chapter 11 413

Z ωIIt  II   0
LðII : Ti Þ ¼
i
z½η ðτÞ ηII ðτÞdτ (28)
ti ti
ωIt
i

Z
 0
0
ωt  III
LðI : Ti Þ ¼
i
z½η ðτÞ ηIII ðτÞdτ (29)
ti ti
ωIIti

The shortest transportation contour from Si to Si+1 is computed by using (27)


through (29)

i Þ ¼ LðI : T i Þ + LðII : T i Þ + LðIII : T i Þ:


LðT m (30)
The shortest transportation contour from Si+1 to Si would be the same as in
(30). Next we compute the farthest transportation contour that joins Si+1 from
Si. We partition the Ti described by zðuti Þ for ati  uti  a0ti into three contours
that represent longest contour drawn from Si to Si+1. Suppose zðaM ti Þ be the
1

0
point on Si for ati  ½ati , ati  that is the farthest to the point on the line say,
M1

ti Þ created due to i \ i +1 , and zðati Þ be the point on the line created


M3
zðaM 2

due to i \ i +1 that is farthest to Si+1. The point at which the contour from
0
ti Þ joins Si+1 say, zðati Þ for ati  ½ati , ati  that is the farthest from a point
zðaM 3 M4 M4

on the line created due to i \ i +1. The complex numbers zðuti Þ partitioned as
8 IV 1
> zðuti Þ ðati  uti < aIV ti Þ
>
>
< C
V C
zðuti Þ ¼ zðuVti Þ ðaIV  u t < a Þ
ti C,
>
>
ti i
A
>
: VI 0
zðuti Þ ðati  uti  ati Þ
V

such that
0 0
ðati  uti < aIV
ti Þ [ ðati  uti < ati Þ [ ðati  uti  ati Þ ¼ ½ati , ati :
IV V VI

Let us redefine the function uti below to represent the three partitions men-
tioned above
8 IV 1
> ηti ðωti  τ < ωIV
ti Þ
>
>
< C
V C
uti ¼ ηVti ðωIV  τ < ω Þ
ti C,
>
>
ti
A
>
: VI 0
ηti ðωti  τ  ωti Þ
V

such that
0 0
ðωti  τ < ωIV
ti Þ [ ðωti  τ < ωti Þ [ ðωti  τ  ωti Þ ¼ ½ωti , ωti :
IV V VI

The three longest distances arise out of above partitions are, say, L(IV : Ti),
L(V : Ti), and L(V I : Ti). These longest distances are given by
414 SECTION III Advanced geometrical intuition

Z ωIV
t  IV  0
LðIV : Ti Þ ¼
i
z½η ðτÞ ηIV ðτÞdτ (31)
ti ti
ω ti
Z ωVt  V  0
LðV : Ti Þ ¼
i
z½η ðτÞ ηV ðτÞdτ (32)
ti ti
ωIti

Z
 0
0
ωti  VI
LðVI : Ti Þ ¼ z½η ðτÞ ηVI ðτÞdτ (33)
ti ti
ωIIt
i

The shortest transportation contour from Si to Si+1 is computed by using (31)


through (33)
LðTiM Þ ¼ LðIV : Ti Þ + LðV : Ti Þ + LðVI : Ti Þ: (34)
The shortest distance and the longest distance of a multilevel contour give
us an idea about the transportation contour T i , T 0i , and the range of times that
they carry information from one contour to another contour. The shorter the
distance, the quicker the information transformed between two contours in
different planes, and the longer the transportation contour, the longer the time
for transporting the information. The time taken to reach from z1 to z2 in a
plane is assumed to be proportional to the distance between z1 and z2.
A transportation contour is assumed here to carry the information on the shape
of the contour; this carries the location of data points on the contour in that
plane and joins with another contour in another plane. In this way, all the con-
tours lying in different places are joined so that combined information on con-
tours (multilevel contours) is constructed. The contours lying in different
planes are otherwise disjoint. By combining contours in different planes an
information tree is attained that has locations of all the points lying in various
contours of a bundle.
Theorem 2 (Spinning of bundle theorem). Suppose the bundle B ðÞ is
rotated anticlockwise such that the line passing through (0, 0) of all the
planes within B ðÞ forms an angle θ (θ > 0) with y-axis. Suppose the rota-
tion is continued for each θ  ð0, 360   + : Then the space created due to
such a rotation forms a 1  1 correspondence with B ðÞ:

Proof. Suppose we make a copy of the bundle B ðÞ combined with the posi-
tioning of 0 and place it on the bundle such that these two bundles occupy
exactly the same space. Let us call the original bundle with the positioning
of 0 as Bo ðÞ and its copy as Bc ðÞ. Suppose we tilt Bc ðÞ to the left such
that Bc ðÞ inclined at an angle θ for θ > 0 with y-axis. See Fig. 5. The points
(complex numbers) on the plane 0 do not change with this tilting so as the
points in the space created by Bc ðÞ: Let us consider a plane p before tilting
for p  Bo ðÞ: The same p in Bc ðÞ is now inclined away at an angle θ.
Let us call the copied bundle Bc ðÞ that is inclined at an angle θ be Bc ð, θÞ.
Each value of p that was there when p  Bo ð, θÞ is still there after
Multilevel contours on bundles of complex planes Chapter 11 415

FIG. 5 Angle between two bundles Bo ðÞ and Bc ðÞ and rotation of the bundle Bc ðÞ over
y-axis.

inclination. However, p in Bc ð, θÞ intersects with infinite set of planes of


Bo ðÞ: Next, we show that p in Bc ð, θÞ has the same points which are a
subset of Bo ðÞ:
However, p in Bc ð, θÞ intersects with infinitely many (uncountable)
planes. So by Theorem 1, one can draw a contour that passes through all
the infinite planes of Bo ðÞ: There is no point of p that does not intersect
with points of the bundle Bo ðÞ: This brings the conclusion that

p  Bc ð, θÞ ¼) p  Bo ðÞ: (35)


This means,
for every z  p implies z  Bo ðÞ: (36)
Statement (36) is true for every θ > 0 and every Bc ð, θÞ: Since θ is arbi-
trary, the statement (36) is true for each θ  (0, 360]. Hence, the space created
by Bc ð, θÞ for all θ values for θ  (0, 3600] is the same as the space of the
bundle B ðÞ: □

Remark 4. The n-dimensional complex plane n is also subset of the space


created by the rotation in Theorem 2.

3 Multilevel contours in a random environment


Let us consider the bundle B ðÞ combined with the 0 intersecting with the
bundle as in Section 2. Let the intersection of 0 on the bundle be arbitrary.
Choose a plane l for l  B ðÞ: Let Xðγ l , l Þ be a random variable
416 SECTION III Advanced geometrical intuition

describing the position of the complex number on a contour γ l at time t in the


plane l . Here γ l is described by zl(t)(t0  t  b0) for t0 , b0   and
zl ðt0 Þ ¼ zl0 (say). Let Xðzl0 , l Þ be the position of the random variable at
t0 in l . Here Xðγ l , l Þ can be treated like a standard measurable function.
Suppose Xðzl0 , l Þ is picked arbitrarily in the plane l . Here Xðzl ðtÞ, l Þ for
t  [t0, ∞) and zl ðtÞ  l is a stochastic process on the complex plane l .
Once Xðzl0 , l Þ is chosen, the location of Xðzl1 , l Þ for t1 > t0 could be any-
where within an open disc Dðzl0 :r 0 Þ, where Xðzl1 , l Þ is the complex number
picked by Xðzl ðtÞ, l Þ at t1 and
Dðzl0 :r 0 Þ ¼ fz : jz  zl0 j < r 0 for r 0 > 0 and z  l g:
Suppose Xðzl0 , l Þ ¼ zl1 . The selection of radius is done randomly at each
step of selecting a complex number. Once Xðzl , l Þ chooses a number at t0,
then during [t0, t1] a value for r0 for r0 > 0 is chosen randomly to build
Dðzl0 :r 0 Þ. Once the set of points of Dðzl0 :r 0 Þ are available, then Xðzl0 , l Þ will
choose the second number during [t0, t1]. See Fig. 6. This procedure of
two-step randomness continues forever in the intervals
f½t0 , ta , ðta , ta0 , ðta0 , tb , ðtb , tb0 , …g
where a, a0 , b, b0 , …   and t0 < ta < ta0 < tb < tb0 <⋯ : We can draw a con-
tour from zl0 to zl1 using the set of points generated by zl(t) within the time
interval [t0, t1). The set of all possible values of the set fXðzl , l Þg can take over
the interval t0, ∞) is called the state-space of Xðzl ðtÞ, l Þ: This continuous-time
stochastic process does not stop once the initial value on the plane l is chosen
and its state-space is l . The described process is a continuous time and con-
tinuous space stochastic process. Let pðzl ðt1 Þ, l Þ be the probability function

FIG. 6 Formation of contours through two-step randomness.


Multilevel contours on bundles of complex planes Chapter 11 417

that is associated with Xðzl ðt1 Þ, l Þ such that pðzl ðt1 Þ, l Þ describes the proba-
bility that Xðzl ðt1 Þ, l Þ picks zl1 for zl1 ¼ zl ðt1 Þðt1 > t0 Þ . The probability
function is defined as
pðzl ðt1 Þ, t1 Þ ¼ Prob½Xðzl ðt1 Þ, l Þ ¼ zl1  for t1 > t0 : (37)
Once Xðzl ðt1 Þ, l Þ is generated, then one can draw contour from zl0 to zl1 and
compute Lðzl0 , zl1 Þ the length from zl0 to zl1 . The contour γ l is described by
γ l ¼ zl ðtÞ for t  ½t0 , ∞Þ
and
 
t ¼ vl ðτÞ al00  t < al01
is the parametric representation for γ l with a real-valued function vl mapping
½al00 , al01 Þ onto the interval [t0, t1). This gives us,
Z a01
l

Lðzl0 , zl1 Þ ¼ l
jz½vl ðτÞjv0l ðτÞdτ: (38)
a00

Our main focus here in this article is on multilevel contours. So initially we


assume that once the variable Xðzl ðt1 Þ, l Þ picked a value at t0, it cannot reach
the same value for t > t0, i.e.,
Xðzl ðt0 Þ, l Þ 6¼ Xðzl ðtÞ, l Þ for t > t0 : (39)
For simplicity in understanding, we might denote (at certain specific places in
the article) a complex number generated at each iteration with an integer
index. However, in reality the numbers chosen by Xðzl ðtÞ, l Þ are uncountably
infinite.
Definition 1. Distinct complex numbers by Xðzl ðtÞ, l Þ: Suppose Xðzl ðt0 Þ,
l Þ ¼ zl0 for zl0  l . Suppose Xðzl ðtÞ, l Þ ¼ zlα for t  [t0, tα] and Xðzl ðtÞ,
l Þ ¼ zlβ for t  (tα, tβ] for all t0 < tα < tβ and α, β, t0 , tα , tβ  , then zl0 6¼
zlα 6¼ zlβ . This property assures that Xðzl ðtÞ, l Þ cannot choose the same num-
ber that it already chose in any of the previous time intervals after the initial
number is chosen (including the initial number).

When a random variable Xðzl ðtÞ, l Þ chooses a number within the disc cre-
ated and that value (number) has been chosen already and was part of the con-
tour, then Xðzl ðtÞ, l Þ will choose another number in the disc. This procedure
continues until a distinct number is chosen by Xðzl ðtÞ, l Þ. Such an assumption
in Definition 1 or in (39) will allow quicker formation of multilevel contour.
We will later see the consequences of relaxing the assumption in Definition 1.
We have
zl1  Dðzl0 :r 0 Þ and zl0 6¼ zl1 : (40)
418 SECTION III Advanced geometrical intuition

Let zl(t2) be the value of Xðzl ðt1 Þ, l Þ at t2 for t2 > t1 and zl ðt2 Þ  Dðzl1 :r 1 Þ for
r1 > 0 such that
pðzl ðt2 Þ, t2 Þ ¼ Prob½Xðzl ðt1 Þ, l Þ ¼ zl2  for t2 > t1 :
The contour γ l with a new parametric representation
 
t ¼ vl ðτÞ al01  t < al02

with a real-valued function vl mapping ½al01 , al02 Þ onto the interval [t1, t2) helps
us to compute the length Lðzl1 , zl2 Þ from zl1 to zl2 ¼ zl ðt2 Þ from zl1 for t2 > t1.
That is,
pðzl ðt2 Þ,t2 Þ ¼ Prob½Xðzl ðt2 Þ,l Þ ¼ zl ðt2 Þ=Xðzl ðt0 Þ,l Þ ¼ zl0 ,
Xðzl ðt1 Þ, l Þ ¼ zl ðt1 Þ
Z a02
l

Lðzl1 ,zl2 Þ ¼ l
jz½vl ðτÞjv0l ðτÞdτ: (41)
a01

Note that, we are not drawing contours from zl0 to zl2 because Xðzl ðt1 Þ, l Þ will
change to zl2 for t2 > t1. In fact, under the construction explained, the contour
will start zl0 and reach zl2 only through the point zl1 : As these are new ideas,
we have explained above the probabilities of picking various values by
Xðzl ðt1 Þ, l Þ. We will slightly redefine below the probabilities and their transi-
tions to accommodate an easier understanding of these concepts.
There are infinitely many options around zl0 that Xðzl ðt1 Þ, l Þ can pick dur-
ing [t0, t1] each with a probability pðzl ðtÞ, tÞ for t  [t0, t1]. Let pðzl0 , zl ðtÞ, tÞ for
t  [t0, t1] be the probability that Xðzl ðtÞ, l Þ ¼ zl0 at t0 has transitioned to
Xðzl ðtÞ, l Þ ¼ zl1 during [t0, t1]. For all such probabilities of transitions during
[t0, t1], we will have
Z t1
pðzl0 , zl ðtÞ, tÞdt ¼ 1, (42)
t0

where pðzl0 , zl ðtÞ, tÞ can be expressed as


pðzl0 , zl ðtÞ, tÞ ¼ Prob½Xðzl ðtÞ, l Þ ¼ zl1 =Xðzl ðt0 Þ, l Þ ¼ zl0 :
Let pðzl0 , zl ðtÞ, tÞ for t  (t1, t2] be the probability that Xðzl ðtÞ, l Þ during
(t1, t2] will pick a number within Dðzl1 , r 1 Þ  l among infinitely many
options. The transition probability from zl1 to zl2 during (t1, t2] is
pðzl1 , zl ðtÞ, tÞ ðt  ðt1 , t2 Þ ¼ Prob½Xðzl ðtÞ, l Þ ¼ zl2 =Xðzl ðt1 Þ, l Þ ¼ zl1 
such that
Z t2
pðzl1 , zl ðtÞ, tÞdt ¼ 1:
t1
Multilevel contours on bundles of complex planes Chapter 11 419

Note that,
pðzl0 ,zl ðtÞ,tÞ ðt  ½t0 , t2 Þ ¼ pðzl0 , zl ðtÞ, tÞ ðt  ½t0 , t1 Þ
(43)
 pðzl1 , zl ðtÞ, tÞ ðt  ðt0 , t2 Þ:
A direct transition from the complex number zl0 to another complex num-
ber zl2 is not possible during [t0, t2] under the above framework. By direct
transition we mean here a one-step transition. As Xðzl ðtÞ, l Þ has reached zl1
during [t0, t1] and then Xðzl ðtÞ, l Þ starting at zl1 , it has taken the value zl2 dur-
ing (t1, t2]. This implies pðzl0 , zl ðtÞ, tÞ ðt  ½t0 , t2 Þ is not possible without
having a hopping over the value Xðzl ðtÞ, l Þ ¼ zl1 during [t0, t1]. If zl2 is not
picked by Xðzl ðtÞ, l Þ during (t1, t2], then it can be picked at some future time
by Xðzl ðtÞ, l Þ for t > t2. So,

0 ðone-transitionÞ
pðzl0 ,zl ðtÞ,tÞ ðt  ½t0 ,t2 Þ ¼
> 0 ðtwo or more steps transitionsÞ
Theorem 3. A contour formed by the set of points generated by Xðzl ðtÞ, l Þ on
l for l in the bundle B ðÞ combined with the 0 intersecting with the bun-
dle and satisfying Definition 1 will obey continuous time Markov property.

Proof. Let γ l be the contour generated by Xðzl ðtÞ, l Þ over the time interval
[t0, ∞). Suppose Xðzl ðt0 Þ, l Þ ¼ zl0 : Suppose Xðzl ðtÞ, l Þ has taken the value
zl1 during t  [t0, t1], and Xðzl ðtÞ, l Þ has taken the value zl2 during t  (t1, t2].
Here zl1  Dðzl0 :r 0 Þ  l for r0 > 0, and zl2  Dðzl1 :r 1 Þ  l for r1 > 0.
By this construction, we have

Dðzl0 :r 0 Þ \ Dðzl1 :r 1 Þ 6¼ ϕ ðempty setÞ


Since Definition 1 holds, we have zl1 6¼ zl0 : The transition probability for
Xðzl ðtÞ, l Þ from zl0 to zl2 is

pðzl0, zl2, tÞ ðt ½t0 , t2 Þ ¼ pðzl0, zl1, tÞ ðt  ½t0 , t1 Þ and pðzl1, zl2 , tÞ ðt  ðt1, t2 Þ
¼ Prob½Xðzl ðtÞ,l Þ ¼ zl1 ðt  ½t0 ,t1 Þ=Xðzl ðt0 Þ, l Þ ¼ zl0 
 Prob½Xðzl ðtÞ,l Þ ¼ zl2 ðt  ðt1, t2 Þ=Xðzl ðtÞ, l Þ ¼ zl1 ðt  ½t0 , t1 Þ:
(44)
Through (44), we can conclude that the random variable Xðzl ðtÞ, l Þ
ðt  ½t0 , t2 Þ obeys Markov property during the interval [t0, t2]. In (44), the
number zl2 is generated within a disc around the number zl1 but not around
the disc with center zl0 . A contour is drawn from zl0 to zl2 only through zl1 .
In a similar way, the value of Xðzl ðtÞ, l Þ ðt  ðtn1 , tn Þ is located in the
disc Dðzln1 :r n1 Þ for rn1 > 0 and not on the disc Dðzlk :r k Þ for rk > 0 and
k ¼ 0, 1, …, n  2. That is,
420 SECTION III Advanced geometrical intuition

Prob½Xðzl ðtÞ, l Þ ¼ zln ðt  ðtn1 , tn Þ=Xðzl ðt0 Þ, l Þ ¼ zl0 , Xðzl ðtÞ, l Þ ¼ zl1

Xðzl ðtÞ,l Þ ¼ zl1 ðt  ðt0 , t1 Þ,…, …,Xðzl ðtÞ, l Þ ¼ zln1 ðt  ðtn2 ,tn1 Þ
¼ Prob½Xðzl ðtÞ, l Þ ¼ zln ðt  ðtn1 , tn Þ= (45)
Xðzl ðtÞ, l Þ ¼ zln1 ðt  ðtn2 , tn1 Þ
A contour is drawn from zl0 to zln only connecting the numbers (points)
through zl1 , zl2 , …, zln1 : The result in (45) is also true when tn !∞. Hence the
contour γ l formed using the numbers generated by Xðzl ðtÞ, l Þðt  ½t0 , ∞Þ obeys
continuous time Markov property or continuous time Markov chain. □

3.1 Behavior of X (zl (t), ℂl ) at (ℂl \ ℂ0)


Let us understand the behavior of Xðzl ðtÞ, l Þ at the intersection of l \ 0 :
Set l \ 0 consists of elements l \ 0 ¼ fz : z  l and z  0 g. Here
l \ 0 6¼ ϕ ðemptyÞ: The numbers on l \ 0 form a dense set of numbers
and they form a line. Whenever Xðzl ðtÞ, l Þ reaches a number, say, zla in the
set l \ 0 , then Xðzl ðtÞ, l Þ can choose the next number within the set
l \ 0 or within the plane l or within the plane 0 : That is, as soon as
Xðzl ðtÞ, l Þ chooses a number zla after a countable number of transitions from
zl0 , then the two-step randomness will help to form two discs, namely,
Dðzla , r a , l Þ and Dðzla , r a , 0 Þ: Here
Dðzla , r a , l Þ  l and Dðzla , r a , 0 Þ  0 : (46)
One can have different radii for two discs. For both the discs we leave the
symbol zla in both the discs to indicate that the associated random variable
has origins in l : The next number of Xðzl ðtÞ, l Þ, say, zlb could fall within
Dðzla , r a , l Þ such that it is on line l \ 0 or it is in the set l \ ðl \ 0 Þ0 .
Alternatively, zlb could fall within Dðzla , r a , 0 Þ such that it is on line l \ 0
or it is in the set 0 \ ðl \ 0 Þ0 where
l \ ðl \ 0 Þ0 ¼ fz : z  l and z 62 ðl \ 0 Þg
(47)
0 \ ðl \ 0 Þ0 ¼ fz : z  0 and z 62 ðl \ 0 Þg:

If zlb  0 \ ðl \ 0 Þ0 , then by joining all the numbers generated by


Xðzl ðtÞ, l Þ from zl0 through zlb , we form a multilevel contour. If zlb  l \
ðl \ 0 Þ0 , then by joining all the numbers generated by Xðzl ðtÞ, l Þ from
zl0 through zlb , we form a contour on l . Even if zlb  l \ ðl \ 0 Þ0 , still
Xðzl ðtÞ, l Þ can choose at a future iteration a value (number) in the set
ðl \ 0 Þ and escape the plane through 0 . Again at a future iteration the
value of Xðzl ðtÞ, l Þ could fall within l : As soon as Xðzl ðtÞ, l Þleaves l
(if in case), it reaches another plane, say, p , because p \ 0 6¼ ϕ and
 
p \ 0  p for some elements of 0. Anytime Xðzl ðtÞ, l Þ reaches a set
Multilevel contours on bundles of complex planes Chapter 11 421

 
of intersecting planes ðl \ 0 Þ or p \ 0 or some other similar inter-
secting planes, it will have the power to generate next number in two distinct
discs as described above in (47). This feature of Xðzl ðtÞ, l Þ helps to form
multilevel contours. This feature is summarized below:

Xðzl ðtÞ, l Þ ¼ zlb and zlb  Dðzla ,ra ,0 Þ ¼) γ l is a multilevel contour
(48)
Xðzl ðtÞ, l Þ ¼ zlb and zlb  Dðzla ,ra ,l Þ ¼) γ l is a contour on l

This rule (48) is applicable each time the value of Xðzl ðtÞ, l Þ falls in an
intersection of planes. Once a contour attains the multilevel contour property,
it will remain as a multilevel contour of that particular Xðzl ðtÞ, l Þ even if the
value of Xðzl ðtÞ, l Þ reruns and remains in l forever. The value of the radius
at each two-step randomness and the location of the next number to be picked
by Xðzl ðtÞ, l Þ decide the time taken for a contour to become a multilevel con-
tour (if there is a possibility to become). See Fig. 7. The time interval to reach
zla from zl0 could be at least one, and it requires at least two time intervals to
reach zlb from zl0 under the framework described above.
Suppose zlb  Dðzla , r a , 0 Þ0. If zl0 to zla reaches in one time interval and zla
to zlb reaches in one time interval, then
kzlb  zl0 k > kzla  zl0 k
because
kzla  zl0 k + kzla  zlb k ¼ kzlb  zl0 k
If zl0 to zla reaches in more than one time interval, then the length of the con-
tour Lðzlo , zla Þ from zl0 to zla is still less than Lðzlo , zlb Þ from zl0 to zlb because zlb

FIG. 7 Contours spreading across one or more planes from l due to random environment.
422 SECTION III Advanced geometrical intuition

lies in the disc Dðzla , r a , 0 Þ0 . Suppose Xðzl ðt0 Þ, l Þ ¼ zl0 and Xðzl ðtÞ, l Þ
reaches at zla during the nth time interval. Let Xðzl ðtÞ, l Þ ¼ zl1 for
zl1  Dðzl0 r 0 Þ during the first time interval [t0, t1] and Xðzl ðtÞ, l Þ ¼ zl2 for
zl2  Dðzl1 r 1 Þ during the second time interval and so on Xðzl ðtÞ, l Þ ¼ zla
for zla  Dðzln1 r n1 Þ during nth time interval. Suppose Xðzl ðtÞ, l Þ ¼ zlb
for zlb  Dðzla r a Þ. Suppose γ l is described by

γ l ¼ zl ðtÞ for t  ½t0 , ∞Þ

and the parametric representations are given by


8  
>
> vl1 ðτÞ al00  t  al01
>
>  l1 
>
>
>
> vl2 ðτÞ a0 < t  a0
l2
>
>
>
<⋮
t¼  ln2 
ln1 ,
>
> v ðτÞ a < t  a
>
>
l n1 0 0
>
>  
>
> vla ðτÞ al0n1 < t  al0a
>
>
>
:  
vlb ðτÞ al0a < t  al0b

and the real-valued functions vli +1 mapping ðal0i , al0i +1  for i ¼ 1, …, n  2 onto
the intervals [t0, t1], (t1, t2], (tn2, tn1]. The real-valued function vl0 maps
½al00 , al01  onto the interval ([t0, t1], vla maps ðal0n1 , al0a  onto the interval (tn1, ta],
and the real-valued function vlb maps ðal0a , al0b  onto the interval (ta, tb]. Then
n1 Z
X Z
l
a0i + 1 al0a
Lðzlo ,zlb Þ ¼ l
jz½vli ðτÞjv0li ðτÞdτ + l
jz½vla ðτÞjv0la ðτÞdτ
i¼1 a0i a0n1
Z l
(49)
a0b
+ jz½vlb ðτÞjv0lb ðτÞdτ > Lðzlo , zla Þ
al0a

because the first two terms of the R.H.S. of (49) is Lðzlo , zla Þ: Here
Z l
a0i + 1

l
jz½vli ðτÞjv0li ðτÞdτ  Dðzli ,ri Þ for i ¼ 1, 2,…, n  1
a0i
Z al0a

l
jz½vla ðτÞjv0la ðτÞdτ  Dðzln1 ,ra ,l Þ
a0n1
Z l
a0b
jz½vlb ðτÞjv0lb ðτÞdτ  Dðzla , ra ,0 Þ
al0a

and
Dðzl0 , r 0 Þ \ Dðzl1 , r 1 Þ \ … \ Dðzln2 , r n2 Þ 6¼ ϕ ðemptyÞ
Multilevel contours on bundles of complex planes Chapter 11 423

each of these discs are nonempty and they have distinct set of numbers on l :
The disc Dðzla , r a , 0 Þ has some elements outside the plane l , and
Dðzln1 , r n1 , l Þ \ Dðzla , r a , 0 Þ 6¼ ϕ ðemptyÞ:
Suppose it takes infinitely many time intervals to reach zla from zl0 (due to the
random environment created).
Extending the parametric representation described above, the length of the
contour from zl0 to zla is
∞ Z al0∞
X
Lðzlo ,zla Þ ¼ l
jz½vli ðτÞjv0li ðτÞdτ < Lðzlo ,zlb Þ
i¼1 a0i

because
∞ Z
X Z l
al0∞ a0b
Lðzlo ,zlb Þ ¼ l
jz½vli ðτÞjv0li ðτÞdτ + jz½vlb ðτÞjv0lb ðτÞdτ
i¼1 a0i al0∞

and
Z l
a0b
jz½vlb ðτÞjv0lb ðτÞdτ  Dðzl∞ , r∞ , 0 Þ
al0∞

\

Dðzli ,ri Þ 6¼ ϕ ðemptyÞ: (50)
i¼0

Theorem 4. Suppose it takes infinitely many time intervals for Xðzl ðtÞ, l Þ to
reach zla from zl0 . Then the infinitely many discs (uncountable) created while
reaching zla from zl0 could beT nested under the two-step random environment
created by Xðzl ðtÞ, l Þ and ∞ i¼0 Dðzli 2C r i Þ 6¼ ϕ:

Proof. Suppose Xðzl ðt0 Þ, l Þ ¼ zl0 : Let Dðzl0 , r 0 Þ be formed out of two-step
randomness and zl1  Dðzl0 , r 0 Þ. Suppose r1 is generated randomly such that
Dðzl1 , r 1 Þ  Dðzl0 , r 0 Þ. Further, let ri is generated randomly such that

D ðzli , ri Þ  D ðzli1 , ri1 Þ for i ¼ 2, 3, …, ∞


and

zli  D ðzli1 ,ri1 Þ for i ¼ 1,2,…, ∞


See Fig. 8. Given that zla has been generated by Xðzl ðtÞ, l Þ, it must be in one
of the infinitely many discs T formed as described earlier before Theorem 4.
Moreover, zla  l \ 0 : So ∞ i¼0 Dðzli , r i Þ 6¼ ϕ. □
424 SECTION III Advanced geometrical intuition

FIG. 8 Nested discs and points on the intersection of planes.

Remark 5. As a consequence of Theorem 4, we will see that


X∞ Z al0∞

l
jz½vli ðτÞjv0li ðτÞdτ  Dðzl0 , r0 Þ:
i¼1 a0i

The random fluctuations and transitions explained in this section could


arise infinitely many times during [t0, ∞). The random environment created
on l and its behavior at the intersection of l and 0 is crucial for the crea-
tion of multilevel contours. A contour γ l could remain forever in l or could
go beyond l and reach other planes. Contour γ l also has the potential to reach
infinitely many planes and also has the potential to return to each and every
plane infinitely many times.
Theorem 5. There exists a unique contour γ l for each initial number chosen
on l for l in the bundle B ðÞ combined with the 0 intersecting with
the bundle and satisfying the property in Definition 1.

Proof. Let γ l be a contour described by zl(t) for t  [t0, ∞) that has a starting
point on the plane l with zl ðt0 Þ ¼ zl0 (say). The values of zl(t) can reach other
planes because the bundle B ðÞ in which l lies is intersecting with the 0.
Due to the property (Definition 1), Xðzl ðtÞ, l Þ keeps on generating new num-
bers during [t0, ∞) and zla 6¼ zlb for a 6¼ b and a, b  . At some time γ l could
become a multilevel contour if the value generated by Xðzl ðtÞ, l Þ reaches
another plane, say p for p 6¼ l . Reaching another plane is possible due
to the positioning of 0 : Or γ l could remain as a contour on l for t  [t0, ∞).
In Theorem 3, we saw that a contour drawn from an initial number zl0
reaches zln only through distinct numbers zl1 , zl2 , …, zln1 . The numbers
zl1 , zl2 , …, zln1 were generated by Xðzl ðtÞ, l Þ. Alternatively, suppose the initial
value chosen by Xðzl ðtÞ, l Þ is say, z0l0 for z0l0 6¼ zl0 , then even if the numbers
zl1 , zl2 , …, zln1 are the same through which a contour up to zln is drawn (due
Multilevel contours on bundles of complex planes Chapter 11 425

to the randomness in selection of complex numbers by Xðzl ðtÞ, l Þ), a contour


drawn from zl0 to zln would be different than z0l0 . This argument is valid for γ l,
a contour on l or a multilevel contour. Hence, as tn !∞ the numbers gener-
ated by Xðzl ðtÞ, l Þ would be distinct due to the property (Definition 1).
Therefore γ l is unique. □

Each point (number) on l is potentially capable to produce a contour


which could remain forever in l or could become a multilevel contour by
crossing l through 0 . Xðzl ðtÞ, l Þ has the power to choose the first number
of l and generate rest of the numbers randomly. Due to the property
(Definition 1), Xðzl ðtÞ, l Þ loses the power to choose a number that was
already been chosen earlier. That is

pðzla , zla , tÞ ðt  ½t0 , ∞ÞÞ ¼ 0

where pðzla , zla , tÞ the probability of transition to the same number is zero. Sup-
pose zl0 , zl1 , zl2 , …, zln , … are the set of numbers generated by Xðzl ðtÞ, l Þ over
the time. Let pðzl0 , zl1 , tÞð1Þ represent the probability of reaching zl1 from zl0 in
1-time interval [t0, t1], pðzl0 , zl1 , tÞð2Þ represent the probability of reaching zl2
from zl0 in 2-time intervals [t0, t1], (t1, t2], and so on. We note that zl2 cannot
be reached directly from zl0 in 1-time interval [t0, t1] or [t0, t2] as mentioned
previously in the article. So

pðzl0 , zl2 , tÞð2Þ > 0 and pðzl0 , zl2 , tÞð1Þ ¼ 0, (51)


and

pðzl1 , zl3 , tÞð2Þ > 0 and pðzl1 , zl3 , tÞðmÞ ¼ 0, for m ¼ 1, 3, 4, 5, … (52)
The n-time intervals transition probabilities between any other two distinct
complex numbers on l can be expressed as in (51) and (52).
Remark 6. The notation Xðzl ðtÞ, l Þt  [t0, ∞) is used even if Xðzl ðtÞ, l Þ starts
generating numbers from different planes after crossing through 0 after the
initial value zl0 was chosen in l : Such notation will help identify the origin
plane of Xðzl ðtÞ, l Þ.

Theorem 6. Given Xðzl ðt0 Þ, l Þ ¼ zl0 : Suppose pðzla , zlb , tÞðnÞ represent n-time
intervals transition probabilities from zla to zlb where
Xðzl ðtÞ, l Þ ¼ zla for t  ðta1 , ta  and
Xðzl ðtÞ, l Þ ¼ zlb for t  ðtb1 , tb ,
for tb > ta, then

ðnÞ > 0 if n ¼ tb  ta sequential time intervals
pðzla , zlb , tÞ ¼ :
¼ 0 if n ¼
6 tb  ta sequential time intervals
426 SECTION III Advanced geometrical intuition

Proof. The n-time interval transition probability pðzla , zlb , tÞðnÞ is written as

pðzl ðta Þ, zl ðtb Þ, tÞðnÞ ðt  ðta , tb Þ ¼ pðzla , zla1, tÞð1Þ ðt  ðta , ta1 Þ:

pðzla1, zla2, tÞð1Þ ðt  ðta1, ta2 Þ: (53)

…:pðzlan1, zlb , tÞð1Þ ðt  ðtan1, tb Þ,

for ta < ta1 < …tan1 < tb : In (53), zla1 is generated by Xðzl ðtÞ, l Þ during
ðta , ta1  from the set of numbers of the disc Dðzla , r a Þ, zla2 is generated by
Xðzl ðtÞ, l Þ during ðta1 , ta2  from the set of numbers of the disc Dðzla1, r a1 Þ,
and so on, zlb is generated by Xðzl ðtÞ, l Þ during ðtan1 , tb  from the set of num-
bers of the disc Dðzlan1, r an1 Þ: The numbers zla1, zla2, …, zlan1, zlb were sequen-
tially generated from the sets of distinct discs
n o
Dðzla , r a Þ, Dðzla1 , r a1 Þ, …, Dðzlan1 , r an1 Þ (54)

within the sequential time intervals


fðta , ta1 , ðta1 , ta2 , …, ðtan1 , tb g (55)
Due to the sequential nature of (54) and (55), we will have

pðzla , zla1 , tÞð1Þ ðt  ðta , ta1 Þ > 0

pðzla1, zla2 , tÞð1Þ ðt  ðta1 , ta2 Þ > 0


(56)

pðzlan1, zlb , tÞð1Þ ðt  ðtan1 , tb Þ > 0:
From (56) we conclude that

pðzl ðta Þ, zl ðtb Þ, tÞðnÞ ðt  ðta , tb Þ > 0: (57)


We also note that,
ta1 ta ¼ 1step time interval
ta2 ta1 ¼ 1step time interval
(58)

tb  tan1 ¼ 1step time interval
Summing up the terms of the L.H.S. of (58) and equating it to sum of the
quantities of the R.H.S. of (58), we see that
tb  ta ¼ nsequential time intervals: (59)
Multilevel contours on bundles of complex planes Chapter 11 427

From (57) and (59) we conclude that pðzla , zlb , tÞðnÞ > 0 when tb  ta ¼ n 
sequential time intervals.
Suppose tb  ta 6¼ m—sequential time intervals. This implies zlb is gener-
ated either through less than m—sequential time intervals after choosing zla by
Xðzl ðtÞ, l Þ or zlb is generated through more than m—sequential time intervals
after choosing zla by Xðzl ðtÞ, l Þ. If m ¼ 1, then

tb  ta 6¼ 1, which implies, pðzla , zlb , tÞð1Þ ¼ 0


if m ¼ 2, then

tb  ta 6¼ 2, which implies, pðzla , zlb , tÞð2Þ ¼ 0


and so on if m ¼ n, then

tb  ta 6¼ n, which implies, pðzla , zlb , tÞðnÞ ¼ 0:

This concludes that pðzla , zlb , tÞðnÞ ¼ 0 if n 6¼ tb  ta sequential time intervals.


Each number in l that was chosen initially by Xðzl ðtÞ, l Þ and further
points chosen by Xðzl ðtÞ, l Þ over the time [t0, ∞) forms the state space of
Xðzl ðtÞ, l Þ and is given by
AX ðl Þ ¼ fz : z ¼ zl0 at Xðzl ðt0 Þ, l Þ and Xðzl ðtÞ,l Þðt0 , ∞Þ ¼ z
for z  B ðÞ:
Here the order of choosing z is preserved and zla 6¼ zlb
if a 6¼ b, a, b > t0 g
As the time progresses it will keep on reaching new numbers in B ðÞ: All
the elements of the set AX ðl Þ need not be in l due to behavior of
Xðzl ðt0 Þ, l Þ at l \ 0 explained previously. The points of AX ðl Þ could be
forever in l or they could spread across one or more planes of B ðÞ: Each
element in AX ðl Þ is called a state of the process fXðzl ðtÞ, l Þgt
t0 : If there is
no certainty to return to a state after leaving that state by the random variable,
we call it a transient state. In this case, once Xðzl ðtÞ, l Þ chooses a state, then
it will not be able to return to that state forever. Since there is an equal proba-
bility for Xðzl ðtÞ, l Þ to choose any complex number on l, there is a possibil-
ity to form contours of infinite lengths starting from each point on plane l :
However, these infinite lengths of contours need not be identical. Suppose
we define another process fY ðzl ðtÞ, l Þgt
t0 with a two-step randomness that
we used for generating the discs and radii, then we consider these two pro-
cesses fXðzl ðtÞ, l Þgt
t0 and fXðzl ðtÞ, l Þgt
t0 are not disjoint to each other,
that is, both these processes have some probability to choose identical values
in the same order or in different order during [t0, ∞). □

Theorem 7. Two contours formed during [t0, ∞) need not be identical but
their lengths could be identical.
428 SECTION III Advanced geometrical intuition

Proof. Let γ l(X) and γ l(Y ) be two contours formed out of the points created
by the two processes fXðzl ðtÞ, l Þgt
t0 and fY ðzl ðtÞ, l Þgt
t0 with fXðzl ðt0 Þ,
l Þg ¼ zl0 ðXÞ and fY ðzl ðt0 Þ, l Þg ¼ zl0 ðYÞ: Let γ l(X) be described by zX(t)
[t0, ∞) and γ l(Y) be described by zY(t) [t0, ∞). The two state spaces
corresponding to the two processes are
AX ðl Þ ¼ fz : z ¼ zl0 ðXÞ at Xðzl ðt0 Þ,l Þ and Xðzl ðtÞ,l Þðt0 , ∞Þ ¼ z
for z  B ðÞ
with some order of choosing z and zla ðXÞ 6¼ zlb ðXÞ
if a 6¼ b, a,b > t0 g
AY ðl Þ ¼ fz : z ¼ zl0 ðYÞ at Y ðzl ðt0 Þ,l Þ and Y ðzl ðtÞ, l Þðt0 , ∞Þ ¼ z
for z  B ðÞ:
with some order of choosing z and zla ðYÞ 6¼ zlb ðYÞ
if a 6¼ b, a,b > t0 g
Note, γ l(X) is identical to γ l(Y) if and only if zl0 ðXÞ ¼ zl0 ðYÞ and all other
z values generated out of infinite iterations of two processes are identical, i.e.,
AX ðl Þ ¼ AY ðl Þ. Since AX ðl Þ and AY ðl Þ are not disjoint, there is a possi-
bility that fXðzl ðtÞ, l Þgt
t0 and fY ðzl ðtÞ, l Þgt
t0 may choose same numbers
during [t0, ∞). If AX ðl Þ 6¼ AY ðl Þ , then anyway γ l(X) is not identical to
γ l(Y). Given that AX ðl Þ and AY ðl Þ are available, let L½ðzl0 ðXÞ, zla ðXÞ be
the length of the contour from zl0 ðXÞ to zla ðXÞ and L½ðzla ðXÞ, zlb ðXÞ be the
length of the contour from zla ðXÞ to zlb ðXÞ such that the length of the contour
from zl0 ðXÞ to zla ðXÞ is computed as
L½ðzl0 ðXÞ, zlb ðXÞ ¼ L½ðzl0 ðXÞ, zla ðXÞ + L½ðzla ðXÞ, zlb ðXÞ: (60)
Let L½ðzl0 ðYÞ, zla ðYÞ be the length of the contour from zl0 ðYÞ to zla ðYÞ and
L½ðzla ðYÞ, zlb ðYÞ be the length of the contour from zla ðYÞ to zlb ðYÞ such that
the length of the contour from zl0 ðYÞ to zla ðYÞ is computed as
L½ðzl0 ðYÞ, zlb ðYÞ ¼ L½ðzl0 ðYÞ, zla ðYÞ + L½ðzla ðYÞ, zlb ðYÞ: (61)
In (60) and (61), it is assumed that zl0 ðXÞ 6¼ zl0 ðYÞ, zla ðXÞ 6¼ zla ðXÞ , and
zlb ðXÞ 6¼ zlb ðYÞ. Suppose
L½ðzl0 ðXÞ, zla ðXÞ 6¼ L½ðzl0 ðYÞ, zla ðYÞ and

L½ðzl0 ðXÞ, zla ðXÞ < L½ðzl0 ðYÞ, zla ðYÞ


and
L½ðzla ðXÞ, zlb ðXÞ 6¼ L½ðzla ðYÞ, zlb ðYÞ and

L½ðzla ðXÞ, zlb ðXÞ > L½ðzla ðYÞ, zlb ðYÞ


Multilevel contours on bundles of complex planes Chapter 11 429

such that
L½ðzl0 ðYÞ, zla ðYÞ  L½ðzl0 ðXÞ, zla ðXÞ ¼ L½ðzla ðXÞ, zlb ðXÞ  L½ðzla ðYÞ, zlb ðYÞ
(62)
By (62), we conclude that L½ðzl0 ðXÞ, zlb ðXÞ ¼ L½ðzl0 ðYÞ, zlb ðYÞ: Since ta
and tb are arbitrary, one can extend the result to other contour distances. □

Theorem 8. Two state spaces AX ðl Þ and AY ðl Þ that are identical need not
imply the corresponding contours are identical.

Proof. Given that the two state spaces AX ðl Þ and AY ðl Þ are identical. This
implies the states in AX ðl Þ and AY ðl Þ are the same. Suppose the order of
the states generated by fXðzl ðtÞ, l Þgt
t0 and fY ðzl ðtÞ, l Þgt
t0 are the same,
then the two contours γ l(X) and γ l(Y) are identical.
Suppose zl0 ðXÞ ¼ zl0 ðYÞ, but the randomness has resulted in distinct order
of the states in AX ðl Þ and AY ðl Þ such that
zla ðXÞ ¼ zlb ðYÞ and zlb ðXÞ ¼ zla ðYÞ
for some arbitrary ta and tb. This implies γ l(X) be no more identical to γ l(Y ).□

The length of a contour is less sensitive for fluctuations due to randomness


than the contour itself. When zl0 ðXÞ 6¼ zl0 ðYÞ, then γ l(X) 6¼ γ l(Y) but in the
long run, say for some t (T, ∞), the lengths of γ l(X) and γ l(Y) could be the
same. Once AX ðl Þ and AY ðl Þ are formed by the two nondisjoint processes
fXðzl ðtÞ, l Þgt
t0 and fY ðzl ðtÞ, l Þgt
t0 , then

pðzl0 ðXÞ, zl0 ðXÞ, tÞðnÞ ðt  ½t0 , ∞ÞÞ ¼ 0 for n ¼ 1, 2, …

pðzl0 ðYÞ, zl0 ðYÞ, tÞðnÞ ðt  ½t0 , ∞ÞÞ ¼ 0 for n ¼ 1, 2, …


and

pðzla ðXÞ, zla ðXÞ, tÞðnÞ ðt  ½t0 , ∞ÞÞ ¼ 0 for n ¼ 1, 2, …


for an arbitrary zla ðXÞ chosen by Xðzl ðtÞ, l Þ,

pðzla ðYÞ, zla ðYÞ, tÞðnÞ ðt  ½t0 , ∞ÞÞ ¼ 0 for n ¼ 1, 2, …


for an arbitrary zla ðYÞ chosen by Y ðzl ðtÞ, l Þ. These imply,
X

pðzla ðXÞ; zla ðXÞ; tÞðnÞ ðt  ½t0 , ∞ÞÞ ¼ 0
n¼1

X

pðzla ðYÞ; zla ðYÞ; tÞðnÞ ðt  ½t0 , ∞ÞÞ ¼ 0:
n¼1
430 SECTION III Advanced geometrical intuition

Suppose the real-valued function vl(X, τ) maps ½al0 , ∞Þ onto the interval [t0, ∞),
then
Z ∞
Lðγ l , XÞ ¼ jz½vl ðX,τÞjv0l ðX,τÞdτ
al0
Z ∞
¼ jz½vl ðY, τÞjv0l ðY,τÞdτ
al0

¼ Lðγ l ,YÞ
where the real-valued function vl(Y, τ) maps ½al0 , ∞Þ onto the interval [t0, ∞).
Only looking at the integral expressions used for L(γ l, X) or L(γ l, Y), we are
unable to tell whether a contour γ l(X) has traveled to any other planes beyond
l. The symbol l in γ l(X) stands for the plane from which this contour has ori-
ginated and X stands for Xðzl ðtÞ, l Þ indicating the random variable responsi-
ble for generating data required to form γ l(X). Suppose we consider infinitely
many random variables of the type Xðzl ðtÞ, l Þ to satisfy two conditions: (i)
each of these works nondisjointly such that they may choose an initial value
that was chosen by a different random variable, and (ii) each of these random
variables chooses an initial value that is distinct from others such that the
number of initial values is again the number of random variables. Let α be
the number of distinct initial values satisfying the condition (i) such that α
is less than the cardinality of l and let X0 be the index random variable. Then
the total lengths of all the contours originated by all the random variables of
condition (i) is
XZ ∞
0
Lðγ l ,X , αÞ ¼ jz½vl ðX0 , α, τÞjv0l ðX0 , α, τÞdτdα (63)
α al0

Let β represents the distinct initial values in the condition (ii) due to distinct
random variables within the condition (ii) with a then the total length of all
the contours generated due to condition (ii) is
XZ ∞
Lðγ l ,X0 , βÞ ¼ jz½vl ðX0 , β, τÞjv0l ðX0 , α, τÞdτdβ: (64)
β al0

There is no comparative measure between (63) and (64), but


Lðγ l , X0 , αÞ  B ðÞ,
Lðγ l , X0 , βÞ  B ðÞ:
and
[ [
Lðγ l , X0 ,αÞ  Dðzla , ra ,X0 , αÞ
α zla  l
[ [
Lðγ l , X0 ,βÞ  Dðzla , ra ,X0 , βÞ
β zla  l
Multilevel contours on bundles of complex planes Chapter 11 431

where Dðzla , r a , X0 , αÞ and Dðzla , r a , X0 , αÞ represent discs generated through


two-step randomness for conditions (i) and (ii). The procedure for generating
these discs remains the same as described previously. Currently, we have not
considered the spaces created due to overlapping contours by these infinitely
many contours generated due to conditions (i) and (ii). However,
[ [ [ [
Dðzla , r a , X0 , αÞ 6¼ Dðzla , r a , X0 , βÞ
α zla  l β zla  l

if the origins of contours created in (i) and (ii) are different.


Theorem 9. Suppose infinitely many random variables of the type Xðzl ðtÞ, l Þ
are available whose cardinality is same as that of l and two different condi-
tions (i) and (ii) above are given. The union of the sets of discs formed under
these two conditions could be different or the same.

Proof. Let us consider infinitely many random variables within the condition
(i). Let the arbitrary variable be X0 ðzl ðtÞ, l , αÞðt  ½t0 , ∞Þ for α as in condition
(i) described above. Let X0 ðzl ðtÞ, l , αÞ ¼ zl0 ðt0 , X0 , αÞ: The set of discs formed
due to each X0 ðzl ðtÞ, l , αÞ are infinite. Each point on the plane could be the
origin of a contour on l . This implies there is a possibility that

[ [
Dðzl0 , r 0 , X0 , αÞ ¼ Dðzl0 , r 0 , X0 , βÞ (65)
zla  l zla  l

Note that r0 is associated with two-step randomness of each X0 ðzl ðtÞ, l , αÞ


and r0 values are generated separately for conditions (i) and (ii). So, r0 in
the L.H.S. of (65) need not be equal to the r0 of R.H.S. of (65). When ran-
domly each X0 ðzl ðtÞ, l , αÞ chooses different origins and r0s of each
zl0 ðt0 , X0 , αÞ corresponding to each X0 ðzl ðtÞ, l , αÞ are identical, then (65) holds.
If these r0s are different, then (65) does not hold. Once (65) holds, then sup-
pose the ras for zla ðta , X0 , αÞ corresponding to each X0 ðzl ðtÞ, l , αÞ are identical
to the ras for zla ðta , X0 , βÞ corresponding to each X0 ðzl ðtÞ, l , βÞ for each of the
infinitely many time intervals, then

[ [ [ [
Dðzla , r a , X0 , αÞ ¼ Dðzla , r a , X0 , βÞ,
α z la   l β zla  l

else, if at least one such ra that was chosen randomly is different in conditions
(i) and (ii) then
[ [ [ [
Dðzla , r a , X0 , αÞ 6¼ Dðzla , r a , X0 , βÞ:
α zla  l β zla  l


432 SECTION III Advanced geometrical intuition

We have
[ [
Dðzla , r a , X0 , βÞ ¼ l
β zla  l
S S
because zla  β zla  l Dðzla , r a , X0 , βÞ implies zla  l for each zla and
S S
zla  l implies zla  β zla  l Dðzla , r a , X0 , βÞ: For condition (ii) within
every disc, there are infinitely many points of other contours, whereas such
an assertion is not possible for the discs generated under the condition (ii).
There is no chance to form an isolated disc under the two-step randomness
procedure and Markov property derived earlier still holds for the discs formed
under these two conditions. For a general description of continuous-time Mar-
kov property, refer, for example, to Good (1961), Bhat and Deshpande (1986),
Chen (1991), Gani and Stals (2005), and Goswami and Rao (2006).
Remark 7. Under random environment the possibility for having identical ra
values in each iteration of X0 ðzl ðtÞ, l , αÞ for infinitely many time interval is
very small. So the chances for below equality can be treated as a rare event:
Dðzla , ra , X0 , αÞ ¼ Dðzla , ra , X0 , βÞ for t  ðta , ta0 
Dðzla0 , ra0 , X0 , αÞ ¼ Dðzla0 , ra0 , X0 , βÞ for t  ðta0 , ta00 
⋮⋮
Dðzlb , rb , X , αÞ ¼ Dðzlb , rb , X0 , βÞ for t  ðtb , tb 
0

⋮⋮

Remark 8. When we relax the assumption in (39) and Definition 1 for the pos-
sibility to choose the same state by Xðzl ðtÞ, l Þ after that state has been chosen
earlier by Xðzl ðtÞ, l Þ, then each state in AX ðl Þ becomes recurrent. For a
recurrent state, the probability to return to a state is certain even if it takes a
very large number of time intervals. We can draw many contours like γ l(X),
γ l(Y), etc. Each contour will have its starting point or the origin depending
upon the initial value chosen by the random variable responsible to generate
the data required. A thick forest of contours can be formed from infinitely
many random variables. A family of infinitely many random variables of type
Xðzl ðtÞ, l Þ could form a forest of contours. Let this family be Fl

Fl ¼ fXðzl ðtÞ, l Þ, Y ðzl ðtÞ, l Þ, …g


for the set of random variables defined on l . Let Fl satisfies (39) and
Definition 1. Each element within Fl will have infinitely many points called
the state spaces. Each state space will have infinitely many states. A family
Fl is recurrent if all the elements of Fl are recurrent and if not Fl is called
transient. A transient family Fl would have a higher possibility to form a
Multilevel contours on bundles of complex planes Chapter 11 433

relatively quicker dense forest of contours than a recurrent family. These


dense forests of contours could spread over one or more planes in bundle
B ðÞ.

3.2 Loss of spaces in bundle Bℝ(ℂ)


Suppose we continue our investigations of the behavior of Xðzl ðtÞ, l Þ with
the properties of distinct complex numbers as described in Section 3.1. Let
γ l(X, t) be the contour formed out of the set of points zl(t) (t  [t0, ∞)) sequen-
tially chosen as per two-step randomness of fXðzl ðtÞ, l Þgt
t0 and has origin in
the plane l . Suppose the space created by γ l(X, t) (t  [t0, ∞) is removed
from B ðÞ: Let [γ l(X, t)]c be the space of all points of the B ðÞ minus
the points of the contour. Let us assume that γ l(X, t) is formed out of the dis-
tinct complex numbers described in the previous section. That is,
½γ l ðX, tÞc ¼ B ðÞnγ l ðX, tÞ ¼ fzl : zl  B ðÞ and zl 62 γ l ðX, tÞg:
When we introduce another random variable Y ðzl ðtÞ, l Þ, we assume that the
space lost due to γ l(X, t) is not available for Y ðzl ðtÞ, l Þ: That is, Y ðzl ðtÞ, l Þ
can choose numbers out of [γ l(X, t)]c. The two-step randomness is flexible
to choose a radius and the next number within a disc until a number is found.
This implies γ l(Y, t) can be formed continuously without any obstructions
from the available numbers of the bundle. The timing between introducing
process Xðzl ðtÞ, l Þ and removing a set of data created by γ l(X, t) up to a cer-
tain time t would form a removal process. The elements or numbers that the
process Xðzl ðtÞ, l Þ occupies over a time interval will be nothing new and they
are part of B ðÞ: Due to the removal of the data occupied by the contour
γ l(X, t) during the time interval, say ½t0 , td1 , the bundle has lost elements from
it. Since fXðzl ðtÞ, l Þgt
t0 is a continuous process, the contour γ l(Y, t) still be
forming after removal process has started at td1: Suppose fXðzl ðtÞ, l Þgt
t0 has
generated discs for a long time intervals up to tb by the time removal process
has started. Here tb > td1: All the elements of the contour γ l(X, t) that was
formed during t0 , td1  are nothing but the elements on the contout until td1
whose length is
Z a0 1
ld
 
 
l
z½vld1 ðτÞv0ld ðτÞdτ
1
a00

ld
for a real-valued function vld1 mapping ½al00 , a0 1  onto the interval ½t0 , td1 : The
set of elements on this contour are the set zl(t) ðt  ½t0 , td1 Þ and denoted by
γ l(X, t) ðt  ½t0 , td1 Þ. The remaining elements in the bundle are
434 SECTION III Advanced geometrical intuition

B ðÞ  ϕ1 ðXÞB ðÞ


γ l ðX, tÞðt  ½t0 ,td1 Þ
where ϕ1 ðXÞ ¼
B ðÞ
and differential equation describing the dynamics is
dB ðÞ
¼ B ðÞ  ϕ1 ðXÞB ðÞ for ðt  ½t0 , td1 Þ:
dt
Suppose we remove the elements of the contour γ l(X, t) that was formed dur-
ing ðtd1 , td2  from B ðÞ such that td2  td1 ¼ td1  t0 : The rate of change in
the bundle B ðÞnγ l ðX, tÞðt  ½t0 , td1  during ðtd1 , td2  is
dB ðÞ
|t¼td ¼ B ðÞ|t¼td  ϕ2 ðXÞB ðÞ|t¼td for ðt  ½td1 , td2 Þ
dt 1 1 1

where B ðÞ|t¼td indicates the space of B ðÞ that was available at t ¼ td1
1

and

γ l ðX, tÞðt  ½td1 , td2 Þ


ϕ2 ðXÞ ¼ :
B ðÞ|t¼td
1

Suppose td1 is randomly chosen, and the rest of all the time intervals are fixed
to maintain the interval lengths equal to td1  t0 : The time intervals have con-
stant length, but the piecewise contour lengths in these intervals need not be
identical because the contour formation is dependent on two-step randomness
and corresponding discs formations. The process of removal continues after
td1 and ϕ be the rate of removal of elements from B ðÞ, then this can be
expressed with a differential equation
dB ðÞ
¼ B ðÞ  ϕðXÞB ðÞ: (66)
dt
A constant rate of removal of elements is difficult to imagine because within
the each time intervals
f½t0 , td1 , ðtd1 , td2 , …g
the number of elements to be removed depends on the lengths of contour
formed during these intervals. These contour lengths are
Z a0 2
ld
 
 
ld
z½vld2 ðτÞv0ld ðτÞdτ,
2
a0 1
Z ld
a0 3   (67)
 
ld
z½vld3 ðτÞv0ld ðτÞdτ,
3
a0 2

Multilevel contours on bundles of complex planes Chapter 11 435

We know that the two-step randomness creates discs at each iteration and
the space occupied by these discs on B ðÞ need not be identical. That means
the lengths of contours formed during f½t0 , td1 , ðtd1 , td2 , …g need not be iden-
tical. The quantity ϕ can only be retrospectively estimated from the data on
the sets of elements created by the piecewise contours within the intervals
f½t0 , td1 , ðtd1 , td2 , …g: So a better way to express the dynamics due to removal
of elements from B ðÞ due to the removal of piecewise contours is

dB ðÞ
¼ B ðÞ  ϕðX, tÞB ðÞ, (68)
dt

where ϕ(t) can be approximated by

γ l ðX, tÞðt  ½td , td00 Þ


ϕðX, tÞ ¼ :
B ðÞ

Over time (68) will produce the dynamics within bundle B ðÞ: The total
elements inside B ðÞ keep on decreasing due to the removal of piecewise
contours (can be treated as a death rate of data on piecewise contours). The
questions that remain to understand here are if the rate of removal of contours
is faster than the formation of the contours (a possibility exists), then does the
removal rate becomes an instantaneous rate? What if the contour γ l(X, t) is
forming continuously such that it is spreading into infinitely many planes of
B ðÞ, and we start removing the space created by γ l(X, t), then how the
dynamics of B ðÞ look like?
The rate of removal of γ l(X, t) in an interval will be zero if no contour data
is available for that interval. The removal of contour data resumes as soon as
the contour data becomes available. This also implies the removal process
could be temporarily discontinued. By the set-up of the time intervals that
are used for removing contours, the removal rate of contours might be higher
than the formation rates or vice versa, or they both might be identical. First,
an interval of time is decided and within this interval whatever the contour
lies that set of points (numbers) will be removed. If within that chosen time
interval no contour data is available, then the removal process halts temporar-
ily. The removal process resumes once data on contours becomes available. It
is difficult to model a form for ϕ(t) because it is dependent on the time inter-
val that was used to remove for and the length of contour that was formed by
the process fXðzl ðtÞ, l Þgt
t0 through the two-step randomness. The lengths
that will be removed during these intervals are shown in (67). At a given tM
> t0 the length of γ l(X, tM)(tM  (t0, ∞)) formed until tM could be larger than
the sum of these above intervals or could be equal, that is
436 SECTION III Advanced geometrical intuition

8 Z ald1 
> 
>
> 0
  0
>
> ¼  z½v ðτÞ vld ðτÞdτ
>
> l0
l d1
>
1
a
>
>
0
>
>
> ∞ Z a0di + 2  
l
>
> X   0
>
> z½v ðτÞ vld ðτÞdτ
>
>
+ l
>
ldi d i + 1
>
i +1
i¼1 a0
>
>
>
>
>
> ðwhenever ϕðtM Þ ¼ 0Þ
>
>
<
γ l ðX, tM ÞðtM  ðt0 , ∞ÞÞ (69)
>
> Z ald1  
>
> 0
  0
>
> >  z½v ðτÞ vld ðτÞdτ
>
>
l d 1
>
l0 1
>
>
a0
>
>
> ∞ Z a0 i + 2  
ld
>
> X   0
>
>  ðτÞ vld ðτÞdτ
>
>
+ z½v l
> i¼1 a0ld d i +1
> i i +1
>
>
>
>
>
> ðotherwiseÞ
>
:

Here ϕ(tM) ¼ 0 indicates there is no contour data available that is to be


removed from B ðÞ: The event
Z ld
a0 1  
 
γ l ðX, tM ÞðtM  ðt0 , ∞ÞÞ < l
z½vld1 ðτÞv0ld ðτÞdτ
1
a00

∞ Z  
ld
X a0 i + 2
 
+ ld
z½vldi + 1 ðτÞv0ld ðτÞdτ
i +1
i¼1 a0 i

is impossible. Whenever
Z a0 1
ld
 
 
γ l ðX,tÞðtM  ðt0 , ∞ÞÞ ¼ l
z½vld1 ðτÞv0ld ðτÞdτ
1
a00
(70)
∞ Z  
ld
X a0 i + 2
 
+ ld
z½vldi + 1 ðτÞv0ld ðτÞdτ
i +1
i¼1 a0 i

at t ¼ tb (say), then during (tb, tb0 ] for ðtb , tb0  ¼ ðt0 , td1 , the amount of data
removed could be equal to the amount of γ l(X, t) that is available during
ðtdb , tdb0 : Also when (70) is true at t ¼ tb, then
dB ðÞ
¼ 0: (71)
dt
Satisfying (71) does not indicate a removal process has attained stationary
solution or a steady-state solution. As noted earlier after attaining (71) at some
t > t0, the rate of removal continues soon after the formation of a new piece of
contour in γ l(X, t).
Multilevel contours on bundles of complex planes Chapter 11 437

Theorem 10. The differential equation describing the removal process


dB ðÞ
¼ B ðÞ  ϕðX, tÞB ðÞ
dt
never attains global stability.

Proof. The removal process of the data generated by γ l(X, t) continues even
after dBdtðÞ ¼ 0 at t for t  (t0, ∞). The amount of ϕ(X, t) after dBdtðÞ ¼ 0
depends on the availability of the length of γ l(X, t) just after attaining
dB ðÞ
dt ¼ 0 and it could be smaller than the set of data points generated by
the piece of contour formed after
Z a0 1
ld
 
 
γ l ðX, tÞðtM  ðt0 , ∞ÞÞ ¼ l
z½vld1 ðτÞv0ld ðτÞdτ
1
a00

∞ Z  
ld
X a0 i + 2
 
+ ld
z½vldi + 1 ðτÞv0ld ðτÞdτ
i +1
i¼1 a0 i

or equal to the piece of the contour formed. We could never attain a


situation of
Z
ld
a0 b
0
 
 
l
z½vld 0 ðτÞv0ld 0 ðτÞdτ < E
a0db b b

for E > 0 chosen. Hence, the rate of removal of the space of data in B ðÞ can
never attain stability as long as the contour formation process fXðzl ðtÞ, l Þgt
t0
continues. □

Theorem 11. Suppose RM be an upper bound such that the length of the con-
tour removed
Z
ld
0
a0 b  
 
l
z½vld 0 ðτÞv0ld 0 ðτÞdτ  RM (72)
a0db b b

for an arbitrary interval ðtdb , tdb0 : Then such an RM does not exist for all the
intervals of the type ðtdb , tdb0 :

Proof. Suppose the quantity RM exists for all the intervals of the type ðtdb , tdb0 
such that (72) is true. This implies for any given arbitrary interval ðtdc , tdc0 
where ðtdc , tdc0  occurred prior to ðtdb , tdb0  or has occurred after ðtdb , tdb0 , but
the length of the contour whose data to be removed does not exceed RM. Such
an assertion is true only if RM !∞, and not for a finite RM because the piece
of the contour γ l(X, t) whose data to be removed depends on the length of the
438 SECTION III Advanced geometrical intuition

contour that is available. This implies there is no upper limit for the length of
the contour to be formed. This contradicts that RM can be attained such that
(72) holds. The set of the data created by the length
Z a0 1
ld
  ∞ Z
X
ld
a0 i + 2  
  0  
l
z½vld1 ðτÞvld ðτÞdτ + ld
z½vldi + 1 ðτÞv0ld ðτÞdτ
1 i +1
a00 i¼1 a0 i

could reach γ l(X, t) from left (or from below) once or more than once. Contour
formation and corresponding removal process once initiated will continue
forever. □

One can also use a different strategy to remove a space of data points
formed by γ l(X, t) for t  [t0, tb]. Suppose we assume ϕ(X, t) follows a certain
parametric form to decide the number of elements of the set zl(t) on γ l(X, t)
over various time intervals. say f½t0 , td1 , ðtd1 , td2 , …g . We will know the
length of γ l(X, t) at tb, which is
Z a0 b
ld
 
 
l
z½vldb ðτÞv0ld ðτÞdτ: (73)
b
a00

So we choose the removal rate of the set of data created on this contour up to
tb such that ϕ(t) at each interval f½t0 , td1 , ðtd1 , td2 , …g is less than the
corresponding pieces of the contours formed during f½tb , tb0 , ðtb0 , tb00 , …g .
Note that these two sets of intervals need not have same interval lengths.
The intervals to form γ l(X, t) are emerged out of two-step randomness. At
tb, we first form
Dðzlb , rb Þ (74)
using rb and fXðzl ðtÞ, l Þgt
t0 chooses zlb0 from (74). If we choose ϕ(X, t) such
that
Z
ld
a0 b
0
   
 
ϕðX,tÞ ¼ ψðX, tÞ l
z½vld 0 ðτÞv0ld 0 ðτÞdτ t  ðtb ,tb0  (75)
a0db b b

for 0 < ψ(X, t) < 1. The dynamics in the bundle would be


2 3
Z a db0
l

dB ðÞ    
 
¼ B ðÞ  4ψðX, tÞ ld z½vld 0 ðτÞv0ld 0 ðτÞdτ t  ðtb , tb0  5
0

dt a0 b b b

 
 B ðÞ t  ðtb ,tb0  ,
and for each of the interval, we can choose an ψ(t) or it could be a constant
value in (0, 1). We assure through (75) that the set of numbers on γ l(X, t)
removed during (tb, tb0 ] are less than the set of numbers formed on contour
Multilevel contours on bundles of complex planes Chapter 11 439

during (tb, tb0 ]. This way the data of the contour γ l(X, t) remaining unremoved
are at least the set of data points that are required to draw the distance (73).
Similarly, the dynamics in bundle due to the removal of the set of numbers
(data points) removed during (tb0 , tb00 ] is
2 3
Z a db00 
l

dB ðÞ   
 
¼ B ðÞ  4ψðX,tÞ ld 0 z½vld 00 ðτÞv00ld 00 ðτÞdτ t  ðtb0 ,tb00  5
0

dt a0 b b b

 
 B ðÞ t  ðtb0 ,tb00  :
Through the strategy explained here, the removal of the space over the long
period of time can be approximated by
" Z al∞ #
dB ðÞ 0
0
¼ B ðÞ  ψðX, tÞ l jz½vl∞ ðτÞjvl∞ ðτÞdτ ðt  ðtb , t∞ Þ
dt a00 (76)
 B ðÞ ðt  ð½t0 , ∞Þ
The differential equation (76) gives an approximation of overall dynamics
generated in various intervals f½tb , tb0 , ðtb0 , tb00 , …g . The amount of space
removed would never be able to reach a situation where ϕ(t) ¼ 0 in these dif-
ferential equations because the data points due to the length of the contour in
(73) will be still in excess. Through the differential equation (76) we made
sure that γ l(X, tM)(tM  (t0, ∞)) of (69) satisfies
Z a0 1
ld
 
 
γ l ðX, tM ÞðtM  ðt0 , ∞ÞÞ > l
z½vld1 ðτÞv0ld ðτÞdτ
1
a00
(77)
∞ Z  
ld
X a0 i + 2
 
+ ld
z½vldi + 1 ðτÞv0ld ðτÞdτ
i +1
i¼1 a0 i

if the removal process follows ϕ(X, t) of (75). There are no specific advan-
tages if (70) holds unless we are having any difficulties with discontinuity
of the removal process.
Suppose we wanted to introduce infinitely many random processes to gen-
erate contours as in (63) and (64) and then initiate corresponding removal pro-
cesses. The space of the data lost in B ðÞ over a period of time intervals and
constructions of such sets would involve careful considerations of contour for-
mation and removal processes. For the sake of understanding the dynamics in
B ðÞ due to these multiple contour formation and removal processes, let us
introduce a second process fY ðzl ðtÞ, l Þgt
tc for tc > tb. Recollect that when
fXðzl ðtÞ, l Þgt
t0 has reached t ¼ tb, we have introduced the removal process
of γ l(X, t). This implies, fY ðzl ðtÞ, l Þgt
tc is introduced tc  tb time units after
a removal process of γ l(X, t) was initiated, and tc time units after the process
fXðzl ðtÞ, l Þgt
t0 was introduced in bundle B ðÞ. At the time of introduction
440 SECTION III Advanced geometrical intuition

of fY ðzl ðtÞ, l Þgt


tc the bundle B ðÞ has lost some set of data points due to
the removal process of γ l(X, t) at t ¼ tb. The two-step randomness of
fY ðzl ðtÞ, l Þg will choose a number zl(Y, t) to t ¼ tc on the plane l . Let us
call this initial value of the new contour zl0 ðYÞ and the contour be γ l(Y, t).
Since the space of the contour γ l(Y, t) for the period t0 to tc was removed,
the number zl0 ðYÞ will be point of l such that it is within the set
fl nzl ðX,tÞ ðt  ½t0 , tc Þg,
where
l nzl ðX,tÞ ðt  ½t0 ,tc Þ ¼ fzl : zl  l g and zl 62 zl ðX,tÞ for t  ½t0 , tc :
Once zl0 ðYÞ is chosen by Y ðzl ðtÞ, l Þ, then a disc
Dðzl0 ðYÞ, r0 ðYÞÞ  l (78)
will be formed. As the second step Y ðzl ðtÞ, l Þ will pick a point within the disc
(78) such that the points of this disc are all part of the available space in l
that was available after points (numbers) due to the removal process of the
contour γ l(X, t) for t ¼ t0 to t ¼ tc are implemented. Once zl0 ðYÞ is chosen
and the disc (78) is formed, there may be situation that the entire space of this
disc is not available for choosing zl1 such that a contour or a piecewise contour
cannot be formed by joining zl0 to zl1 : The process fY ðzl ðtÞ, l Þg will have to
pick a number randomly only in certain locations of disc. See Fig. 9. The
space within Dðzl0 ðYÞ, r 0 ðYÞÞ is divided into three components, namely, the
set of points due to the removal of the space of the contour γ l(X, t), say,
S1 ½Dðzl0 ðYÞ, r 0 ðYÞÞ, the set of points available in the disc to which a contour
or piecewise arcs can be drawn from zl0 ðYÞ to zl1 ðYÞ, say, S2 ½Dðzl0 ðYÞ, r 0 ðYÞÞ,
and the set of points available, say, S3 ½Dðzl0 ðYÞ, r 0 ðYÞÞ in the disc to

FIG. 9 Nonavailability of the space for fYðzl ðtÞ, l Þg in a disc after the time tc due to removal of
a piece of contour γ l(x, t). The shaded region cannot be reached from zl0 within a disc
Dðzl0 ðYÞ, r0 ðYÞÞ.
Multilevel contours on bundles of complex planes Chapter 11 441

choose zl1 ðYÞ, but a contour passing from zl0 ðYÞ to zl1 ðYÞ for zl1 ðYÞ 
Dðzl0 ðYÞ, r 0 ðYÞÞ cannot be drawn. The removal process has caused disc
Dðzl0 ðYÞ, r 0 ðYÞÞ to write below as a union of these three sets
Dðzl0 ðYÞ, r0 ðYÞÞ ¼ S1 ½Dðzl0 ðYÞ,r0 ðYÞÞ [ S2 ½Dðzl0 ðYÞ,r0 ðYÞÞ
(79)
[ S3 ½Dðzl0 ðYÞ,r0 ðYÞÞ
Here S1, S2, and S3 are disjoint. A point (number) to which a contour can
be drawn from zl0 ðYÞ is located within the set S2 ½Dðzl0 ðYÞ, r 0 ðYÞÞ: Similarly
let be zla ðYÞ an arbitrary point available to draw a contour from a previous
iteration and was chosen by fY ðzl ðtÞ, l Þg at some t. Using zla ðYÞ we can draw
a disc Dðzla ðYÞ, r a ðYÞÞ such that
Dðzla ðYÞ, ra ðYÞÞ ¼ S1 ½Dðzla ðYÞ,ra ðYÞÞ [ S2 ½Dðzla ðYÞ, ra ðYÞÞ[
S1 ½Dðzla ðYÞ,ra ðYÞÞ:
If r l ðX, tÞ \ Dðzla ðYÞ, r a ðYÞÞ ¼ ϕ ðempty setÞ, then
S1 ½Dðzla ðYÞ, ra ðYÞÞ ¼ ϕ and S3 ½Dðzla ðYÞ,ra ðYÞÞ ¼ ϕ
and
Dðzla ðYÞ, ra ðYÞÞ ¼ S2 ½Dðzla ðYÞ,ra ðYÞÞ:
Any point in Dðzla ðYÞ, r a ðYÞÞ can be randomly chosen by fY ðzl ðtÞ, l Þg such
that a contour can be drawn within Dðzla ðYÞ, r a ðYÞÞ: Suppose in a given
Dðzla ðYÞ, r a ðYÞÞ, the next iteration point, say, zla0 ðYÞ for zla ðYÞ for zla0 ðYÞ 
Dðzla ðYÞ, r a ðYÞÞ lies such that a direct contour from zla ðYÞ to zla0 ðYÞ cannot
be drawn due to a deleted space of the contour γ l(X, t). Suppose there exists
some space outside the deleted space of γ l(X, t) within Dðzla ðYÞ, r a ðYÞÞ so that
piecewise arcs can be drawn from zla ðYÞ to zla0 ðYÞ. In such situations
Y ðzl ðtÞ, l Þ will generate a set of points around the deleted space to draw
piecewise arcs to join zla ðYÞ to zla0 ðYÞ whose distance is
Z
ld
0
a0 a  
z½vl ðY, τÞv0 ðτÞdτ: (80)
l da ld 0
a0da a

Here (80) will be the sum of piecewise arcs such that


Z Z
ld
0
 
laðiÞ
a0 a   a0
 
z½vl ðY,τÞv0 ðτÞdτ ¼ z½vlaðiÞ ðY,τÞv0laðiÞ ðτÞdτ
l da ld 0 l
a0da a a0da

k Z  
laði + 1Þ
X a0
 
+ ld
aðiÞ
z½vlaði + 1Þ ðY, τÞv0laði + 1Þ ðτÞdτ (81)
i¼1 a0

Z
ld
a0 a
0
 
 
+ laðk + 1Þ
z½vld 0 ðY, τÞv0ld 0 ðτÞdτ
a0 a a
442 SECTION III Advanced geometrical intuition

The process of creation of γ l(Y, t) for t > tc continues as described above.


Contour γ l(Y, t) can have a space of γ l(X, t) that does not get deleted due to
the removal process of Xðzl ðtÞ, l Þ: All the points of the set γ l ðX, tÞ \ γ l ðY, tÞ
such that
γ l ðX, tÞ \ γ l ðY,tÞ 6¼ ϕ ðempty setÞ
can get deleted due to the removal process of Xðzl ðtÞ, l Þ: The process
Y ðzl ðtÞ, l Þ until a removal process for Y ðzl ðtÞ, l Þ is introduced will not influ-
ence the differential equation describing the loss of space in bundle B ðÞ in
(76). Let us introduce removal process for γ l(Y, t) at t ¼ tg, i.e., starting at the
interval (tg1, tg]. Suppose a length of this contour equivalent to
Z l
a0g  
z½vl ðY, τÞv0 ðτÞdτ
g lg
al0c

is always maintained between the new contour formation and removal loca-
tion of this contour such that removal rate  never becomes zero. Suppose
zlg(1) be the point chosen in D zlg ðYÞ, r g ðYÞ  l where
  
zlgð1Þ ðYÞ  S2 D zlg ðYÞ, rg ðYÞ
       
D zlg ðYÞ, rg ðYÞ ¼ S1 D zlg ðYÞ, rg ðYÞ [ S2 D zlg ðYÞ, rg ðYÞ [
   (82)
S3 D zlg ðYÞ, rg ðYÞ :
The length from zlg(Y) to zlg(1)(Y) is
Z a0
lgð1Þ
 
 
lg
z½vlgð1Þ ðY, τÞv0lgð1Þ ðτÞdτ: (83)
a0

The removal rate of Y ðzl ðtÞ, l Þ we denote here by ϕ(Y, t). The value of ϕ(Y, t)
during (tg, tg(1)] is expressed using (83) as
Z a0
lgð1Þ
 
 
ϕðY, tÞ ¼ ψðY, tÞ lg
z½vlgð1Þ ðY,τÞv0lgð1Þ ðτÞdτ,
a0

for 0 < ψ(Y, t) < 1, and the value of ϕ(Y, t) during (tg, t∞] is expressed using
(83) as
Z al∞
0
ϕðY,tÞ ¼ ψðY, tÞ l jz½vl∞ ðY,τÞjv0l∞ ðτÞdτ
g
a0

The dynamics in bundle B ðÞ due to removal of set of points in γ l(X, t)
ðt  ½tb , ∞Þ and in γ l(Y, t) t  ½tg , ∞ described above can be divided into
below four parts:
(i) Removal of data points due to the removal process introduced on con-
tour γ l(X, t),
Multilevel contours on bundles of complex planes Chapter 11 443

(ii) Removal of data points due to the removal process introduced on con-
tour γ l(Y, t),
(iii) Removal of data points in the set γ l ðX, tÞ \ γ l ðY, tÞ for γ l ðX, tÞ \
γ l ðY, tÞ 6¼ ϕ ðempty setÞ, due to the removal process introduced on con-
tour γ l(X, t),
(iv) Removal of data points in the set γ l ðX, tÞ \ γ l ðY, tÞ for γ l ðX, tÞ \
γ l ðY, tÞ 6¼ ϕ ðempty setÞ, due to the removal process introduced on con-
tour γ l(Y, t).
Let ϕ(X, t) and ϕ(Y, t) represent removal rates for the points purely on γ l(x, t)
and γ l(Y, t) and not on γ l ðX, tÞ \ γ l ðY, tÞ 6¼ ϕ ðempty setÞ: Let ϕ3(X, t) repre-
sent removal rates for the points purely on γ l ðX, tÞ \ γ l ðY, tÞ 6¼ ϕ ðempty setÞ
for the contour initiated by Xðzl ðtÞ, l Þ and ϕ4(Y, t) represent removal rates
for the points purely on γ l ðX, tÞ \ γ l ðY, tÞ 6¼ ϕ ðempty setÞ for the contour
initiated by Y ðzl ðtÞ, l Þ: The dynamics in bundle B ðÞ due to four parts
above is expressed through the differential equation:
dB ðÞ
¼ B ðÞ  ϕðX, tÞB ðÞ  ϕðY, tÞB ðÞ  ϕ3 ðX,tÞB ðÞ
dt
 ϕ4 ðY, tÞB ðÞ
" Z al∞ #
0
¼ B ðÞ  ψðX,tÞ l jz½vl∞ ðX, τÞjv0l∞ ðX, τÞdτ B ðÞ
a00
" Z #
al0∞
 ψðY,tÞ lg
jz½vl∞ ðY, τÞjv0l∞ ðY, τÞdτ B ðÞ  ψ 3 ðX, tÞB ðÞ
a0

 ψ 4 ðY, tÞB ðÞ


(84)
for 0 < ψ 3(X, t) < 1 and 0 < ψ 4(Y, t) < 1 in the same time interval in which
ψ(X, t) and ψ(Y, t) are implemented.
Theorem 12. The differential equation
dB ðÞ
¼ B ðÞ  ½ϕðX, tÞ + ϕðY,tÞ + ϕ3 ðX,tÞ + ϕ4 ðY, tÞB ðÞ (85)
dt
where
Z al0∞
ϕðX, tÞ ¼ ψðX,tÞ l
jz½vl∞ ðX, τÞjv0l∞ ðX, τÞdτ ð0 < ψðX, tÞ < 1Þ,
a00
Z al0∞
ϕðY, tÞ ¼ ψðY, tÞ lg
jz½vl∞ ðY, τÞjv0l∞ ðY,τÞdτ ð0 < ψðY, tÞ < 1Þ,
a0

ϕ3 ðX, tÞ ¼ 0 < ψ 3 ðX, tÞ < 1,


ϕ4 ðY, tÞ ¼ 0 < ψ 4 ðY, tÞ < 1,
will never attain global stability.
444 SECTION III Advanced geometrical intuition

Proof. The removal process continuously removes sets of points of contours


γ l(x, t) and γ l(y, t) described in (85). As in the proof of Theorem 10, the con-
tour formations happens continuously, and
jB ðÞj < E
would never arise for every E > 0, because of the construction of ϕ(X, t),
ϕ(Y, t), ϕ3(X, t), ϕ4(Y, t) given in the statement there will be always space
of points in ϕ4(Y, t). □

Remark 9. One can also develop an argument similar to the argument in


provided in the proof of Theorem 10 to prove Theorem 12.

Suppose Xa ðzl ðtÞ, l Þ represents an arbitrary random variable out of infi-


nitely many random variables introduced to form contours with origin in l :
Let the removal rate of γ l(Xa, t) be ϕ(Xa, t) and ϕ3(Xa, t) be the removal rate
of the data points at the intersections of one or more contours. A general dif-
ferential equation describing the dynamics due to the removal process of
points in the bundle is
Z
dB ðÞ
¼ B ðÞ  ϕðXa , tÞdt½B ðÞ
dt
Z Xa
(86)
 ϕ3 ðX ,tÞdt½B ðÞ:
a
Xa
R
In (86), the term Xa ϕðXa , tÞdt represent the overall removal rates for the infi-
Rnitely many contours. At the each iteration of Eq. (86), the quantity
a ϕðX , tÞdt will be updated based on new removal of a certain contour. Sim-
a
X R
ilarly, Xa ϕ3 ðXa , tÞdt represent removal rates of the sets of points available at
the intersections of the contours.

4 Islands and holes in Bℝ(ℂ)


The removal process of bundle will create islands of holes due to overlapping
(intersections) of infinitely many contours within B ðÞ: These islands will
be never be able to reach again by a newly introduced two-step randomness.
See Fig. 10.
Suppose we introduce infinitely many random variables of type
Xa ðzl ðtÞ, l Þ that we saw in Section 3, but all were introduced at the same time
in l : Let the number of these random variables be such that they are one to
one and onto with each member of the complex plane l : Suppose these start
creating data for the formation of infinitely many contours. One contour is
assumed not to block another contour to use its data points. This is explained
further in the following sentences. Suppose Xa ðzl ðtÞ, l Þ and Xb ðzl ðtÞ, l Þ be
two contours chosen arbitrarily out of these infinitely many contours that were
Multilevel contours on bundles of complex planes Chapter 11 445

FIG. 10 Islands of deleted data due to removal process on infinitely many contours.

initiated at the same time t0. By construction, they have different origins in l :
Then the set of complex numbers {zl(Xa, t)} ðt  ½t0 , ∞Þ used in the formation
of γ l(Xa, t) and the set of numbers used in the formation of γ l(Xb, t) could have
a nonempty intersection. That is
T

fzl : zl  B ðÞ,fzl ðXa ,tÞgðt  ðt0 , ∞Þ and zl0 ðXa Þ  l g
(87)
zl : zl  B ðÞ, fzl ðXb , tÞgðt  ðt0 , ∞Þ and zl0 ðXb Þ  l 6¼ ϕ ðemptyÞ
or
T

fzl : zl  B ðÞ,fzl ðXa ,tÞgðt  ðt0 , ∞ÞÞ and zl0 ðXa Þ  l g
(88)
zl : zl  B ðÞ, fzl ðXb , tÞgðt  ðt0 , ∞ÞÞ and zl0 ðXb Þ  l 6¼ ϕ ðemptyÞ
Any two contours that have different initial values need not be disjoint. If every
zl(Xa, t) for every t  (t0, ∞) has no overlap with any element of fzl ðXb , tÞg
ðt  ðt0 , ∞ÞÞ, then that could be purely due to the random environment created
in Section 3. Either of these contours or both could be multilevel contours and
have origins in l : The formation of multilevel contours and randomness at
0 \ l described earlier remains the same. Two contours might have points
of intersection within B ðÞ, but such points of intersection need not behave
like common points of 0 \ l  B ðÞ: This means the set of points on
γ l ðXa , tÞ \ γ l ðXb , tÞ
for which
γ l ðXa , tÞ \ γ l ðXb , tÞ 6¼ ϕ ðemptyÞ
cannot be used for changing the plane of the contours. However, the set of
points on
γ l ðXa , tÞ \ γ l ðXb , tÞ  0 (89)
for any two arbitrary random variables Xa ðzl ðtÞ, l Þ and Xb ðzl ðtÞ, l Þ could
behave similarly to the points on 0 \ l : The points on 0 \ p for any arbi-
trary p  B ðÞ will have similar properties of forming a multilevel contour
as described in Section 3.
446 SECTION III Advanced geometrical intuition

Suppose
γ l ðXa , tÞ \ γ l ðXb ,tÞ  p (90)
for some arbitrary plane p  B ðÞ and γ l ðXa , tÞ \ γ l ðXb , tÞ have origins
in l :
Suppose (90) satisfied at t ¼ td, then a disc Dðzld ðXa Þ, r d ðXa ÞÞ with center
zld ðXa Þ and radius rd(Xa) is formed such that
Dðzld ðXa Þ, rd ðXa ÞÞ  p (91)
and next iteration point of γ l(Xa, t) after zld ðXa Þ lies in p and not in 0 \ p.
Suppose zld0 ðXa Þ be the point generated after zld0 ðXa Þ for zld0 ðXa Þ 
Dðzld ðXa Þ, r d ðXa ÞÞ, then
zld0 ðXa Þ  p and zld0 ðXa Þ 62 0 \ p :

A contour drawn during [td, td0 ] to reach zld0 ðXa Þ from zld ðXa Þ with the distance
Z a0d
l 0
 
 
l
z½uld0 ðXa , τÞu0l∞ ðXa ,τÞdτ , (92)
a0d

l
lies on p : Here the real-valued function uld0 ðXa , τÞ maps ½al0d , a0d0 Þ onto the
interval [td, td0 ]. Because zld ðXa Þ lies on γ l(Xa, t) satisfying (90), it could con-
tribute in the next step to form γ l(Xa, t) or γ l(Xb, t). In either situation, the dis-
tance in (91) lies in p . Hence a point in B ðÞ if it is in 0 \ p for some
arbitrary p has two options to produce a new point on the contour to continue
contour formation. The description of the formation of contours at the inter-
section of γ l(Xa, t) and γ l(Xb, t) is also true if there are more than two intersect-
ing contours. We also note that the lengths of the infinitely many contours up
to time td which were all introduced at t0 could have different lengths based
on the area of the discs formed, and the point chosen by the corresponding
random variable. So the set of lengths
8 l0 9
<Z a0d   Z ald0   =
  0 0
  0
 z½u l ðX a
, τÞ u ðX a
,τÞdτ,  z½u l ðX b
,τÞ u ðX b
, τÞdτ, … (93)
: a ld d
0 ld 0 l
ad
d
0 ld 0
;
0 0

could have different spaces occupied in B ðÞ. The location of each contour
after some long time t∞ for t∞≫ td0 could be anywhere in the bundle and they
could be situated in any plane. The set of lengths
(Z l ∞ Z al∞  )
a0 
0
 
l
jz½ul∞ ðXa , τÞju0l∞ ðXa , τÞdτ, l z½ul0∞ ðXb , τÞu0l∞ ðXb , τÞdτ,… (94)
a0d a0d

and the spaces occupied by


γ l ðXa ,tÞ,γ l ðXb , tÞ, …,
Multilevel contours on bundles of complex planes Chapter 11 447

are ever evolving within B ðÞ . For a point zldð2Þ ðXa Þ  p and zldð2Þ ðXa Þ
62 l \ p , the equality
Z a0
ldð1Þ
  Z a0
ldð2Þ
 
   
l
z½uldð1Þ ðXa , τÞu0ldð1Þ ðXa , τÞdτ + ldð1Þ
z½uldð2Þ ðXa , τÞu0ldð2Þ ðXa ,τÞdτ
a0d a0
Z a0
ldð2Þ
 
 
¼ l
z½uld*ð2Þ ðXa , τÞu0ldð2Þ ðXa ,τÞdτ
a0d

(95)
holds for zld ðX Þ  0 \ l and zldð1Þ ðX Þ  0 \ p. For all such zld ðX Þ  p
a a b

and zldð2Þ ðXb Þ 2 6 l \ p , and zld ðXb Þ ¼


6 zld ðXa Þ, zldð1Þ ðXb Þ ¼
6 zldð1Þ ðXa Þ, and
zldð2Þ ðXb Þ6¼ zldð2Þ ðXa Þ the equality
Z a0
ldð1Þ
  Z a0
ldð2Þ
 
  0  
l
z½uldð1Þ ðX , τÞuldð1Þ ðX ,τÞdτ +
b b
ldð1Þ
z½uldð2Þ ðXb ,τÞu0ldð2Þ ðXb ,τÞdτ
a0d a0
Z a0
ldð2Þ
 
 
¼ l
z½uld*ð2Þ ðXb , τÞu0ldð2Þ ðXb , τÞdτ
a0d

(96)
holds for zld ðX Þ  0 \ l and zldð1Þ ðX Þ  0 \ p . From (95) and (96), we
b b

also see that


Z a0
ldð2Þ
 
 
l
z½uld*ð2Þ ðXa , τÞu0ldð2Þ ðXa , τÞdτ (97)
a0d

Z a0
ldð2Þ
 
 
> ldð1Þ
z½uldð2Þ ðXa ,τÞu0ldð2Þ ðXa , τÞdτ
a0

and
Z a0
ldð2Þ
 
 
l
z½uld*ð2Þ ðXb , τÞu0ldð2Þ ðXb , τÞdτ (98)
a0d

Z a0
ldð2Þ
 
 
> ldð1Þ
z½uldð2Þ ðXb ,τÞu0ldð2Þ ðXb , τÞdτ
a0

because the multilevel contours γ l(Xa, t) and γ l(Xb, t) whose distances are in
(95) and (96) have to pass through the plane 0 \ p : For all sets of three
numbers of the type for zld ðXb Þ , zldð1Þ ðXb Þ , and zldð2Þ ðXb Þ lying in 0 \ l ,
0 \ p , and p , for arbitrary l and p in B ðÞ, the equality
448 SECTION III Advanced geometrical intuition

∞ Z  
ldðiÞ
X a0
 
l
z½uldð1Þ ðXb , τÞu0ldðiÞ ðXb ,τÞdτ
i¼1 a0d
∞ Z  
ldði + 1Þ
X a0
 
+ ldðiÞ
z½uldði + 1Þ ðXb ,τÞu0ldði + 1Þ ðXb , τÞdτ (99)
i¼1 a0
∞ Z  
ldðiÞ
X a0
 
¼ l
z½uld*ðiÞ ðXb , τÞu0ldðiÞ ðXb ,τÞdτ
i¼1 a0d

holds, and (99) leads to


∞ Z  
ldðiÞ
X a0
 
l
z½uld*ðiÞ ðXb ,τÞu0ldðiÞ ðXb , τÞdτ (100)
i¼1 a0d

∞ Z  
ldði + 1Þ
X a0
 
> ldðiÞ
z½uldði + 1Þ ðXb , τÞu0ldði + 1Þ ðXb , τÞdτ:
i¼1 a0

Consider an arbitrary plane q lying somewhere above p and p lying some-


where above l for l, p, and q in B ðÞ, Let the five points (numbers) in
the bundle are arranged as follows:
zld ðXa Þ  0 \ l , zldð1Þ ðXa Þ  0 \ p ,

zldð2Þ ðXa Þ 62 0 \ p and zldð2Þ ðXa Þ  p , zldð3Þ ðXa Þ  0 \ p


ðreachable from zldð2Þ ðXa ÞÞ,
zldð4Þ ðXa Þ0 \ q , and zldð5Þ ðXa Þ 62 0 \ q and zldð5Þ ðXa Þ  p :
Then the equality arising out of these points is
Z a0
ldð1Þ
 
 
l
z½uldð1Þ ðXa , τÞu0ldð1Þ ðXa ,τÞdτ
a0d
Z ldð2Þ
a0  
 
+ ldð1Þ
z½uldð2Þ ðXa , τÞu0ldð2Þ ðXa ,τÞdτ
a0
Z ldð3Þ
a0  
 
+ ldð2Þ
z½uldð3Þ ðXa , τÞu0ldð3Þ ðXa ,τÞdτ
a0
Z (101)
 
ldð4Þ
a0
 
+ ldð3Þ
z½uldð4Þ ðXa , τÞu0ldð4Þ ðXa ,τÞdτ
a0
Z ldð5Þ
a0  
 
+ ldð4Þ
z½uldð5Þ ðXa , τÞu0ldð5Þ ðXa ,τÞdτ
a0
Z ldð5Þ
a0  
 
¼ l
z½uld*ð5Þ ðXa , τÞu0ldð5Þ ðXa ,τÞdτ
a0d
Multilevel contours on bundles of complex planes Chapter 11 449

where the real-valued function uld* maps appropriate time intervals after the
parametric representation. For all above such sets of five points in the bundle
and for three sets of planes, we will have
∞ Z  
ldðiÞ
X a0
 
l
z½uldðiÞ ðXa , τÞu0ldðiÞ ðXa ,τÞdτ
i¼1 a0d
∞ Z  
ldð2 + iÞ
X a0
 
+ ldð1 + iÞ
z½uldð2Þ ðXa ,τÞu0ldð2Þ ðXa ,τÞdτ
i¼1 a0
∞ Z  
ldð3Þ
X a0
 
+ ldð2Þ
z½uldð3Þ ðXa ,τÞu0ldð3Þ ðXa , τÞdτ
i¼1 a0
(102)
∞ Z  
ldð4 + iÞ
X a0
 
+ ldð3 + iÞ
z½uldð4Þ ðXa ,τÞu0ldð4Þ ðXa ,τÞdτ
i¼1 a0
∞ Z  
ldð5 + iÞ
X a0
 
+ ldð4 + iÞ
z½uldð5Þ ðXa ,τÞu0ldð5Þ ðXa ,τÞdτ
i¼1 a0
∞ Z  
ldð5 + iÞ
X a0
 
¼ l
z½uld*ð5 + iÞ ðXa ,τÞu0ldð5 + iÞ ðXa ,τÞdτ
i¼1 a0d

A hole is formed in B ðÞ due to a loss of a set of points (numbers) of


data in which a piece of a contour was located (before it got deleted due to
a removal process) or a loss of a set of data points of a group of contours.
The line consisting of points was lost due to a removal process, but all other
points around the line or area around a hole could be chosen by a random var-
iable. We introduce and define two new sets, namely, a hole and an island.
Definition 2. Hole: A hole H is a closed set of points in B ðÞ such that no
point of H is available to be chosen by an arbitrary Xa ðzl ðtÞ, l Þ for an arbi-
trary plane l :

Example 1. Let fzl ðXa , tÞg be the set of numbers of γ l(Xa, t) for t  [t0, tb] that
got deleted due to a removal process. Then Xa ðzl ðtÞ, l Þ would not be able to
choose a number from fzl ðXa , tÞg for t  [t0, tb]. There could be many such
holes in B ðÞ: Two or more contours using the same set of points for a
period of time, then the removal process of one contour could delete the com-
mon set of data so that a hole is formed. We will soon see that the space cre-
ated by the set H is dynamic.

Definition 3. Island: Let S  B ðÞ and S 6¼ ϕ (empty). The set S is called


an island if any element in S is available to be chosen by a random variable,
but no contour can be drawn from an element within S to an element outside
S, say, Sc. Here S, Sc  B ðÞ:
450 SECTION III Advanced geometrical intuition

Example 2. Consider a disc Dðzld ðXa Þ, r d ðXa ÞÞ with a center zld ðXa Þ chosen by
Xa ðzl ðtÞ, l Þ from a previous iteration. Let two contours γ l(Xb, t) and γ l(Xc, t)
pass through Dðzld ðXa Þ, r d ðXa ÞÞ and intersects at two locations say, zli ðXb Þ
and zlj_ ðXb Þ as shown in Fig. 11. Suppose the spaces of points of γ l(Xb, t)
and γ l(Xc, t) passing through Dðzld ðXa Þ, r d ðXa ÞÞ were lost due to respective
variables’ removal processes. Then the space formed between these two
points of intersections including the data on the contours between zli ðXb Þ
and zlj_ ðXb Þ is an island.
The two sets H and S are dynamic as spaces created by these sets could
change because of the dynamic nature of contour formation and removal pro-
cess described in Section 3. A time-dependent versions of the definitions for
holes and islandscan be given here. A set S(t) for t  [t0, ∞) and satisfying
Definition 3 can be called an island at t. The set of elements of S(tc) for
tc  [t0, ∞) satisfying Definition 3 might lose all its elements in a removal
process and might turn into a hole at a time td for td > tc. If S(t) is an island,
then no one cannot draw a contour from the elements of S(t) to an element in
Sc(t) where
Sc ðtÞ ¼ fz : z 62 SðtÞ  B ðÞ and z  B ðÞg: (103)
c
Similarly, a contour cannot be drawn from an element (point) of S (t) to an
element in S(t). The spaces of S(t) and Sc(t) are separated by H. The area of
a hole could change over a time as more data points removed are added to
a specific hole.

Theorem 13. Suppose a disc Dðzld ðXa Þ, r d ðXa ÞÞ formed out of Xa ðzl ðtÞ, l Þ is
given. Let zl1 and zl2 be two points in the boundary set of D. A contour C1
formed by an arbitrary process Xb ðzl ðtÞ, l Þ enters the disc through zl1 and
leaves the disc from zl2 and another contour C2 formed by an arbitrary

FIG. 11 An island is created within Dðzld ðXa Þ, rd ðXa ÞÞ:


Multilevel contours on bundles of complex planes Chapter 11 451

process Xc ðzl ðtÞ, l Þ enters the disc through zl1 and leaves the disc from zl2 :
The paths of C1 and C2 never meet except at the points zl1 and zl2 and the cen-
ter zld ðXa Þ lies in between the contours. Suppose the removal process of two
contours C1 and C2 introduced such that C1 [ C2 form a hole at t. Then,
the set of points lying between C1 and C2 forms an island.

Proof. Given that zl1 and zl2 are located on the boundary of the disc Dðzld ðXa Þ,
r d ðXa ÞÞ and zld ðXa Þ is located in between C1 and C2. See Fig. 12 for a descrip-
tion of given information and locations of zl1 and zl2 .
Let S(t) be a set in Dðzld ðXa Þ, r d ðXa ÞÞ that consists of all points in between
C1 and C2 as shown in Fig. 12. The set Dðzld ðXa Þ, r d ðXa ÞÞnSðtÞ will consists of
points as in (104),
Dðzld ðXa Þ,rd ðXa ÞÞnSðtÞ ¼ fzl : zl  Dðzld ðXa Þ, rd ðXa ÞÞ and zl 62 SðtÞg: (104)
The disc Dðzld ðX Þ, r d ðX ÞÞ can be partitioned into disjoint union of three sets
a a

as below
Dðzld ðXa Þ, rd ðXa ÞÞ ¼ ½Dðzld ðXa Þ, rd ðXa ÞÞnSðtÞ [ SðtÞ [ ½C1 [ C2 : (105)
By the construction, we cannot draw a contour from a point in S(t) to a
point in
Dðzld ðXa Þ, rd ðXa ÞÞnSðtÞ:
Hence, S(t) is an island. □

Theorem 14. The union of collection of all holes within B ðÞ is compact.

Proof. Each hole is a closed set of points from a contour or a collection of


contours. Let Hα be an arbitrary hole. An infinite union of such holes,

FIG. 12 Formation of an island due to a removal process of Theorem 13.


452 SECTION III Advanced geometrical intuition

[
Hα (106)
α

is closed. Each Hα at time t is bounded by the length of the contour formed


until the time t + 1, because the removal process follows contour formation
with some lag in the time. So
 
[  Z a0dðt + iÞ  
l

   
 Hα  < l z½uldðt + iÞ ðXa , τÞu0ldð5 + iÞ ðXa , τÞdτ, (107)
α  a d
0

for an arbitrary process X ðzl ðtÞ, l Þ. So the union of a collection of holes is


a

bounded. Hence, such a collection is compact. □

Theorem 15. Suppose infinitely many random variables of the type


Xa ðzl ðtÞ, l Þ are introduced one by one as in Section 3. If every point within
an island S(t) is part of a contour at time t, then due to a removal process
S(t) becomes H(tc) for tc > t.

Proof. Given there is a S(t) at time t. Let there be finitely many contours pass-
ing through the region S(t) such that all the points of S(t) are in one or more of
the contours. Once a removal process is introduced at tc > t, then all the
points of S(t) will not be available for a new random variable. Hence, S(t) will
asymptotically become a hole. □

Let us introduce a removal process for the infinitely many contours intro-
duced earlier in this section. These contours were all started at the same time
t0. The number of such contours is equivalent to the set of elements in l :
Some of the elements in l are also in 0 \ l : Note that,
l ¼ ð0 \ l Þ [ l ¼ fzl : zl  l or zl  0 \ l g: (108)
So removing a contour data that was formed from t0 until a time tc at a rate
ϕ(Xa, t) for an arbitrary random variable Xa ðzl ðtÞ, l Þ would also remove con-
tour data that is located in the set 0 \ l . We described earlier how the
islands of sets of data could be formed, and formation of islands could happen
in distinct time intervals once removal process is introduced.
Remember that all the contours have origin in l only. Now at tc due to a
removal process, the points of l are not available to be chosen by any of the
infinitely many random variables. The paths of these random variables are not
disjoint. Some of these random variables may be in another plane outside l
at t ¼ tc. Contour formations of these infinitely many random variables of
type Xa ðzl ðtÞ, l Þ may continue even after tc because of their presence outside
l at t ¼ tc. So the entire plane l is not available in the bundle B ðÞ. So the
remaining space formed will be the set of points B ðÞnl , where
B ðÞnl ¼ fzl : zl  B ðÞ and zl  l g (109)
Multilevel contours on bundles of complex planes Chapter 11 453

4.1 Consequences of Bℝ(ℂ)\ℂl on multilevel contours


At the time of t ¼ tc, there could be one of the following two situations for the
status of infinitely many random variables that were introduced at t0.
(i) All the contours formed until tc are active only in l at t ¼ tc and for
any arbitrary Xa ðzl ðtÞ, l Þ the set of points fzl ðXa , tÞg for t  [t0, tc] are
in l , i.e.,
fzl ðXa , tÞgðt  ½t0 , tc Þ  l : (110)
This means no contour has crossed the plane until t ¼ tc.
(ii) Only a fraction of infinitely many random variables are active in l at
t ¼ tc, and the rest of all random variables are active outside l .
If every contour satisfies the above situation (i) at t ¼ tc, then due to the
removal process all the points in l cannot be reached to form a contour. In
such a situation, the plane l forms a hole, and the contours formation process
halts. Suppose only a fraction of infinitely many random variables, say, α1 are
inside l at tc, then there will be two options for the location of those random
variables, i.e., for the fraction (1  α1) which are outside l at tc. A fraction of
them, say, α2 will be above l and the remaining fraction (1  α2) of the
random variables will be somewhere in a plane below l . Let T be the set
of all contours active tc and α1 be the fraction of them active and located in
l , then ð1  α1 ÞjT j be the set of contours that are active and outside l. Then,
ð1  α1 ÞjT j are further divided as
ð1  α1 ÞjT j ¼ α2 ½ð1  α1 ÞjT j + ð1  α2 Þ½ð1  α1 ÞjT j, (111)
where α2 is the fraction of contours that are active at tc and are located some-
where in a plane above l , and 1  α2 is the fraction of contours that are
active at tc and are located somewhere in a plane below l : Because the plane
l became a hole, the set of contours, say, Tα1 denoted to represent α1 jT j num-
ber of contours will stop further formation for t > tc. The set of contours, say,
T1α1 will be active outside l , where
T1α1 ¼ Tα2 [ T1α2 : (112)
Here Tα2 represent the set of contours that are located somewhere in a plane
above l , T 1α2 represent the set of contours that are located somewhere in
a plane below l : The carnality of T1α1 is constant for t > tc, so as the car-
dinals of Tα2 and T1α2 at t > tc. We also partition the set of elements in
B ðÞ at t > tc as
B ðÞjt>tc ¼ B ð, α2 Þ [ B ð,1  α2 Þ, (113)
whereas
B ðÞ ¼ B ð, α2 Þ [ l [ B ð, 1  α2 Þ, (114)
454 SECTION III Advanced geometrical intuition

where B ð, α2 Þ  B ðÞ is the set of planes which are above l , and
B ð, 1  α2 Þ  B ðÞ is the set of planes which are above l : The set of
contours that are active after tc are located in the set, say B ð, 1  α1 Þ and
from (113), we have
B ð,1  α2 Þ ¼ B ðÞjt>tc : (115)
As described above, contour formations and removal processes of the con-
tours in B ð, 1  α2 Þ will continue. Due to presence of the hole l , the
active contours of B ð, α2 Þ and B ð, 1  α2 Þ will not have any further
intersecting points. The tails of the contours remaining in B ð, α2 Þ and
B ð, 1  α2 Þ will be eventually lost for some time after tc. Let γ l(Xa, t) be
an arbitrary contour that is active in B ð, α2 Þ, and was created by Xa. Sup-
pose the set of points touched by the contour γ l(Xa, t) prior to tc were located
in B ð, α2 Þ, l , and B ð, 1  α2 Þ, this contour is described by zl(Xa, t)
ðt  ½t0 , ∞ÞÞ and t ¼ ulc ðXa , τÞ ða0  τ  ac Þ is the parametric representation
for γ l(Xa, t) with a real-valued function ul(Xa, τ) mapping [a0, ac] onto the
interval [t0, tc]. Let Lðγ l ðXa , tÞðt  ½t0 , tc ÞÞ represent the length of γ l ðXa , tÞ
ðt  ½t0 , tc Þ up to tc, then
Z ac
Lðγ l ðXa , tÞðt  ½t0 ,tc ÞÞ ¼ jz½ulc ðXa ,τÞju0lc ðXa , τÞdτ (116)
a0

had covered points from each disjoint set of B ðÞ in (114). Let us assume
that γ l ðXa , tÞðt  ½t0 , tc Þ has visited a multiple number of times through each
of the sets of (114) before it remained active in B ð, α2 Þ at t ¼ tc. Then
Lðγ l ðXa , tÞðt  ½t0 , tc ÞÞ in (116) can be expressed as three components where
each component is made up of several contour integrals. Since γ l(Xa, t) had
visited each portion in (116) several times, the length Lðγ l ðXa , tÞðt  ½t0 , tc ÞÞ
in (116) is distributed into corresponding parts. The first part consists of the
sum of all the lengths of piecewise contours of (116) lying in l , say,
Lðγ l ðXa , t, α1 Þðt  ½t0 , tc ÞÞ, and can be computed using
Z að1Þ  
 
Lðγ l ðXa , t, α1 Þðt  ½t0 , tc ÞÞ ¼ z½ulað1Þ ðXa , τÞu0lað1Þ ðXa , τÞdτ
að0Þ
X Z aði + 1Þ 



+ z½ulaði + 1Þ ðXa , τÞu0laði + 1Þ ðXa ,τÞdτ,
i  Aðα1 Þ aðiÞ

(117)
where ulað1Þ ðXa , τÞ and ulaði +1Þ ðXa , τÞ are the real-valued functions used in
parametric representations with corresponding onto mappings. The notation
i  A(α1) indicates summing the length over all the piecewise contours in
the set A(α1). The set A(α1) consists of all the piecewise contours
of Xa ðzl ðtÞ, l Þ until tc that are lying in l : The first integral on the R.H.S.
of (117) is the length of the piecewise contour from its origin to the entry
Multilevel contours on bundles of complex planes Chapter 11 455

point either in B ð, α2 Þ or in B ð, 1  α2 Þ. The sum of integrals on the


R.H.S. of (117) is the total length of the piecewise contours due to A(α1).
The second part in (116) consists of piecewise contours in B ð, α2 Þ
whose total length, say, Lðγ l ðXa , t, α2 Þðt  ½t0 , tc ÞÞ, is computed as
X Z aði + 1Þ  

Lðγ l ðX ,t,α2 Þðt  ½t0 ,tc ÞÞ ¼
a
z½ulaði + 1Þ ðXa , τÞu0laði + 1Þ ðXa ,τÞdτ
i  Aðα2 Þ aðiÞ
Z ac
+ jz½ulc ðXa , τÞju0lc ðXa ,τÞdτ
ac1
(118)
The set A(α2) in (118) consists of all the piecewise contours of X ðzl ðtÞ, l Þ a

until tc that are lying in B ð, α2 Þ: The second integral in the R.H.S. of
(118) consists of length of the last piece of the contour γ l(Xa, t) until t ¼ tc
in B ð, α2 Þ. Here ulc ðXa , τÞ is the real-valued function used in parametric
representations with corresponding onto mappings. The third part in (116)
consists of piecewise contours in B ð, 1  α2 Þ whose total length, say,
Lðγ l ðXa , t, 1  α2 Þðt  ½t0 , tc ÞÞ, is computed as
Lðγ l ðXa , t,1  α2 Þðt  ½t0 , tc ÞÞ
X Z aði + 1Þ  
  0 (119)
i  Að1α Þ
 z½u l aði + 1Þ
ðX a
,τÞ ulaði + 1Þ ðXa , τÞdτ:
2
aðiÞ

Hence the length in (116) can be expressed using (117), (118), and (119) as
Z að1Þ  
 
Lðγ l ðX , tÞðt  ½t0 ,tc ÞÞ ¼
a
z½ulað1Þ ðXa ,τÞu0lað1Þ ðXa ,τÞdτ
að0Þ
X Z aði + 1Þ 



+ z½ulaði + 1Þ ðXa , τÞu0laði + 1Þ ðXa ,τÞdτ +
i  Aðα1 Þ aðiÞ
X Z aði + 1Þ 



¼ z½ulaði + 1Þ ðXa , τÞu0laði + 1Þ ðXa ,τÞdτ
i  Aðα2 Þ aðiÞ
Z ac
+ jz½ulc ðXa ,τÞju0lc ðXa , τÞdτ
ac1
X Z aði + 1Þ  
 
+ z½ulaði + 1Þ ðXa ,τÞu0laði + 1Þ ðXa , τÞdτ:
i  Að1α2 Þ aðiÞ

(120)
Due to the hole l created in the bundle B ðÞ, the remaining length of the
contour that will be subjected to removal process is obtained by removing
the sum of piecewise contour lengths in l , and is given by
456 SECTION III Advanced geometrical intuition

X Z aði + 1Þ 



Lðγ l ðXa , tÞðt  ½t0 , tc ÞÞ ¼ z½ulaði + 1Þ ðXa , τÞu0laði + 1Þ ðXa ,τÞdτ
i  Aðα2 Þ aðiÞ
Z ac
+ jz½ulc ðXa ,τÞju0lc ðXa , τÞdτ
ac1
X Z aði + 1Þ  
 
+ z½ulaði + 1Þ ðXa , τÞu0laði + 1Þ ðXa , τÞdτ:
i  Að1α2 Þ aðiÞ

(121)
Since γ l(X , t) is active in B ð, α2 Þ, the formation of the contour will con-
a

tinue forever and the sum of the pieces of the lengths of γ l(Xa, t) that is there
in B ð, 1  α2 Þ will be deleted from B ð, 1  α2 Þ. This deletion could be
according to a removal function ϕ(Xa, t, 1  α2) similar to the procedure
explained in (75). The differential equation to model the space of points lost
due to a removal process ϕ(Xa, t, 1  α2) is

dB ð, 1  α2 Þ
 a ¼ B ð, 1  α2 Þjt¼tc  ϕðXa , tÞB ð, 1  α2 Þjt¼tc
dt ϕðX ,t,1α2 Þ
(122)
for
ϕðXa ,t,1  α2 Þ ¼ ψðXa , t, 1  α2 Þ
X Z aði + 1Þ  
  
 z½ulaði + 1Þ ðXa ,τÞu0laði + 1Þ ðXa , τÞdτ t  ðtc ,tc0  :
i  Að1α2 Þ aðiÞ

(123)
for 0 < ψ(Xa, t, 1  α2) < 1. Suppose γ l(Xa, t) is active in B ð, 1  α2 Þ
instead of in B ð, α2 Þ. As in above, the set of points touched by the contour
γ l(Xa, t) prior to tc would have located in B ð, α2 Þ, l , and B ð, 1  α2 Þ.
We0 described  this contour by zl(Xa, t) ðt  ½t0 , ∞ÞÞ and t ¼ wlc ðXa , τÞ
a0  τ  a0c is the parametric representation for γ l(Xa, t) with a real-
valued function wlc ðXa , τÞ mapping ½a00 , a0c  onto the interval [t0, tc]. Let
Lðγ l ðXa , t, 1  α2 Þðt  ½t0 , tc ÞÞ represent the length of γ l ðXa , tÞðt  ½t0 , tc Þ up
to tc, then
Z a0c
Lðγ l ðX , t, 1  α2 Þðt  ½t0 ,tc ÞÞ ¼
a
jz½wlc ðXa , τÞjw0lc ðXa ,τÞdτ (124)
a00

had covered points from each disjoint set of B ðÞ in (114). As in above, let
us assume that γ l ðXa , tÞðt  ½t0 , tc Þ has visited a multiple number of times
through each of the sets of (114) before it remained active in B ð, 1  α2 Þ
at t ¼ tc. Then Lðγ l ðXa , t, 1  α2 Þðt  ½t0 , tc ÞÞ in (124) can be expressed as
three components where each component is made up of several contour
Multilevel contours on bundles of complex planes Chapter 11 457

integrals. Since γ l(Xa, t) had visited each portion in (124) several times,
the length Lðγ l ðXa , t, 1  α2 Þðt  ½t0 , tc ÞÞ in (124) is distributed into
corresponding parts. The first part consists of the sum of all the lengths of
piecewise contours of (124) lying in l , say, Lðγ l ðXa , t, α1 Þðt  ½t0 , tc ÞÞ, and
can be computed using
Z a0 ð1Þ  
 
Lðγ l ðX ,t,α1 Þðt  ½t0 ,tc ÞÞ ¼
a
z½wla0 ð1Þ ðXa , τÞw0l 0 ðXa ,τÞdτ
a ð1Þ
a0 ð0Þ

X Z a ði + 1Þ  
0
 
+ z½wlaði + 1Þ ðXa , τÞw0laði + 1Þ ðXa , τÞdτ,
0
i  A0 ðα1 Þ a ðiÞ

(125)
where wla0 ð1Þ ðX , τÞ and wla0 ði +1Þ ðX , τÞ are the real-valued functions used in para-
a a

metric representations with corresponding onto mappings. The notation i  A0


(α1) indicates summing the length over all the piecewise contours in the set A0
(α1). The set A0 (α1) consists of all the piecewise contours of Xa ðzl ðtÞ, l Þ until
tc that are lying in l : The first integral on the R.H.S. of (125) is the length of
the piecewise contour from its origin to the entry point either in B ð, α2 Þ or
in B ð, 1  α2 Þ. The sum of integrals on the R.H.S. of (125) is the total
length of the piecewise contours due to A0 (α1). The second part in (124)
consists of piecewise contours in B ð, 1  α2 Þ whose total length, say,
Lðγ l ðXa , t, 1  α2 Þðt  ½t0 , tc ÞÞ, is computed as
Lðγ l ðXa , t,1  α2 Þðt  ½t0 , tc ÞÞ
X Z a0 ði + 1Þ  
 
¼ z½wla0 ði + 1Þ ðXa ,τÞw0l 0 ðXa , τÞdτ
a ði + 1Þ
0
i  A ð1α2 Þ a0 ðiÞ (126)
Z 0
ac
+ jz½wlc ðXa ,τÞjw0lc ðXa ,τÞdτ
0
ac1

Note that the contour is active in B ð, 1  α2 Þ. The set A0 (α2) in (126) con-
sists of all the piecewise contours of Xa ðzl ðtÞ, l Þ until tc that are lying in
B ð, 1  α2 Þ: The second integral in the R.H.S. of (126) consists of length
of the last piece of the contour γ l(Xa, t) until t ¼ tc in B ð, 1  α2 Þ. Here
wlc ðXa , τÞ is the real-valued function used in parametric representations with
corresponding onto mappings. The third part in (124) consists of piecewise
contours in B ð, α2 Þ whose total length, say, Lðγ l ðXa , t, α2 Þðt  ½t0 , tc ÞÞ, is
computed as
X Z aði + 1Þ  

Lðγ l ðX ,t,α2 Þðt  ½t0 ,tc ÞÞ ¼
a
z½wla0 ði + 1Þ ðXa , τÞw0l 0 ðXa , τÞdτ:
a ði + 1Þ
i  A0 ð1α2 Þ aðiÞ

(127)
458 SECTION III Advanced geometrical intuition

The length in (124) can be expressed using (125), (126), and (127) as
Z a0 ð1Þ  
 
Lðγ l ðXa ,t,1  α2 Þðt  ½t0 ,tc ÞÞ ¼ z½wla0 ð1Þ ðXa , τÞw0l 0 ðXa , τÞdτ
a0 ð0Þ a ð1Þ

X Z a ði + 1Þ 
0



+ z½wlaði + 1Þ ðXa ,τÞw0laði + 1Þ ðXa ,τÞdτ
a0 ðiÞ
i  A0 ðα1 Þ
Z  
X a0 ði + 1Þ   0
z½wl ðX a w a
+  0
a ði + 1Þ
,τÞ  la0 ði + 1Þ ðX , τÞdτ
a0 ðiÞ
i  A0 ð1α2 Þ
Z a0c  
+ z½wl ðXa , τÞw0 ðXa ,τÞdτ
c lc
a0c1
Z  
X aði + 1Þ   0
z½wl ðX a
τÞw a
 0
a ði + 1Þ
,  la0 ði + 1Þ ðX ,τÞdτ
aðiÞ
i  A0 ð1α2 Þ
(128)
Due to the hole l created in the bundle B ðÞ, the remaining length of the
contour that will be subjected to removal process is obtained by removing
the sum of piecewise contour lengths in l , and is given by
Lðγ l ðXa ,t,1  α2 Þðt  ½t0 , tc ÞÞ
X Z a0 ði + 1Þ  
 
¼ z½wla0 ði + 1Þ ðXa , τÞw0l 0 ðXa , τÞdτ
0 a ði + 1Þ
i  A ð1α2 Þ a ðiÞ
0

Z ac
0
(129)
+ jz½wlc ðXa , τÞjw0lc ðXa ,τÞdτ
0
ac1
X Z aði + 1Þ  
 
z½wla0 ði + 1Þ ðXa ,τÞw0l 0 ðXa , τÞdτ
a ði + 1Þ
i  A0 ð1α2 Þ aðiÞ

Since γ l(X , t) is active in B ð, 1  α2 Þ, the formation of the contour will


a

continue forever. The tail part that is the sum of the pieces of the lengths of
γ l(Xa, t) that is there in B ð, α2 Þ will be deleted from B ð, α2 Þ: This dele-
tion could be according to a removal function ϕ(Xa, t, α2) similar to the pro-
cedure explained in (122). The differential equation to model the space of
points lost due to a removal process ϕ(Xa, t, α2) is

dB ð, α2 Þ
 a ¼ B ð, α2 Þjt¼tc ϕðXa , t, α2 ÞB ð, α2 Þjt¼tc (130)
dt ϕðX ,t,α2 Þ

where
ϕðXa ,t,α2 Þ ¼ ψðXa , t, α2 Þ
X Z a0 ði + 1Þ 


  
 z½wla0 ði + 1Þ ðXa , τÞw0l 0 ðXa , τÞdτ t  ðtc , tc0 
0 a ði + 1Þ
i  A0 ð1α2 Þ a ðiÞ

(131)
for 0 < ψ(X , t, α2) < 1.
a
Multilevel contours on bundles of complex planes Chapter 11 459

4.2 PDEs for the dynamics of lost space


Suppose we consider simultaneously another arbitrary random variable
Xb ðzl ðtÞ, l Þ along with Xa ðzl ðtÞ, l Þ in l . As in Xb ðzl ðtÞ, l Þ, we will develop
two models, one for understanding the dynamics of the space lost by first
assuming that γ l(Xb, t) at t ¼ tc is active in B ð, α2 Þ and then second time
by assuming γ l(Xb, t) at t ¼ tc is active in B ð, 1  α2 Þ. When γ l(Xb, t)
at t ¼ tc is active in B ð, α2 Þ , the set of points touched by the contour
γ l(Xb, t) prior to tc would have located in B ð, α2 Þ, l , and B ð, 1  α2 Þ:
The contour γ l(Xb, t) is described by zl(Xb, t) ðt  ½t0 , ∞ÞÞ and t ¼ O
ðXb , τ, α2 Þ ðb0  τ  bc Þ is the parametric representation for γ l(Xb, t, α2) with
a real-valued function Olc ðXb , τ, α2 Þ mapping [b0, bc] onto the interval [t0, tc].
 
Let L γ l ðXb , t, α2 Þðt  ½t0 , tc Þ represent the length of γ l ðXb , t, α2 Þ ðt  ½t0 , tc Þ
up to tc, then (Fig. 13)

FIG. 13 (A) Formation of contours in a bundle, and (B) formation of a hole in the bundle due to
a removal process of an arbitrary plane in which contours originated.
460 SECTION III Advanced geometrical intuition

Z bc 
  
L γ l ðXb ,t, α2 Þðt  ½t0 ,tc Þ ¼ z½Ol ðXb ,τ, α2 ÞO0 ðXb , τ,α2 Þdτ: (132)
c lc
b0

By following the procedure explained in Section 4.1, we partition


 
L γ l ðXb ,t, α2 Þðt  ½t0 ,tc Þ
into a sum of three parts of piecewise contour lengths. The final differential
equation to model the space of points lost due to only the removal process
ϕ(Xb, t, 1  α2) is

dB ð,1  α2 Þ
 b ¼ B ð,1  α2 Þjt¼tc 
dt ϕðX ,t,1α2 Þ (133)
¼ ϕðX , t, 1  α2 ÞB ð,1  α2 Þjt¼tc
b

where
ϕðXb ,t, 1  α2 Þ ¼ ψðXb ,t, 1  α2 Þ
X Z bði + 1Þ  
  
 z½Olbði + 1Þ ðXb ,τÞO0lbði + 1Þ ðXb ,τÞdτ t  ðtc , tc0  :
i  Bð1α2 Þ bðiÞ

(134)
for 0 < ψ(Xb, t, 1  α2) < 1. Meaning of the set B(1  α2) and the procedure
to obtain the integral in (134) are similar to corresponding model in (123).
Alternatively, when γ l(Xb, t) at t ¼ tc is active in B ð, 1  α2 Þ, the set of
points touched by the contour γ l(Xb, t) prior to tc would have located in
B ð, α2 Þ, l , and B ð, 1  α2 Þ. The contour γ l(Xb, t) can be described by
 
zl(Xb, t) ðt  ½t0 , ∞ÞÞ and t ¼ Qlc ðXb , τ, 1  α2 Þ b00  τ  b0c is the parametric
representation for γ l(Xb, t, 1  α2) with a real-valued function Qlc ðXb , τ,

1  α2 Þ mapping ½b00 , b0c  onto the interval [t0, tc]. Let L γ l ðXb , t, 1  α2 Þ
ðt  ½t0 , tc ÞÞ represent the length of γ l ðXb , t, 1  α2 Þðt  ½t0 , tc Þ up to tc, then
 
L γ l ðXb ,t,1  α2 Þðt  ½t0 , tc Þ
Z b0c
  (135)
¼ z½Ql ðXb ,τ, 1  α2 ÞQ0 ðXb ,τ, 1  α2 Þdτ:
c lc
b00

By following the procedure explained in Section 4.1, we partition


 
L γ l ðXb ,t,1  α2 Þðt  ½t0 , tc Þ
into a sum of three parts of piecewise contour lengths. The corresponding dif-
ferential equation to model the space of points lost due to only the removal
process ϕ(Xb, t, α2) is

dB ð, α2 Þ
 b ¼ B ð, 1  α2 Þjt¼tc
dt ϕðX ,t,α2 Þ (136)
 ϕðXb , t,α2 ÞB ð, 1  α2 Þjt¼tc
Multilevel contours on bundles of complex planes Chapter 11 461

where
ϕðXb ,t,α2 Þ ¼ ψðXb , t, α2 Þ
X Z b ði + 1Þ  
0
   
 z½Qlbði + 1Þ ðXb ,τÞQ0lbði + 1Þ ðXb ,τÞdτ t  ðtc , tc0  :
0
i  B0 ð1α2 Þ b ðiÞ

(137)
0
for 0 < ψ(X , t, 1  α2) < 1. The set B (1  α2) consists of piecewise
b

contours in B ð, 1  α2 Þ as described in Section 4.1 and the procedure


to obtain the integral in (137) is similar to corresponding model in (131).
The differential equations (133) and (136) are good when the removal process
of Xa(zl(t), t) was not introduced simultaneously. The differential equations
(122) and (130) do not provide dynamics of lost spaces when the removal
process of Xb(zl(t), t) was not introduced simultaneously. Under the simul-
taneous existence of the two contours γ l(Xa, t) and γ l(Xb, t), and the
corresponding removal processes, we will have four situations. Suppose
γ l(Xa, t) and γ l(Xb, t) are active in B ð, α2 Þ at t ¼ tc, then the partial differ-
ential equation describing the dynamics of removal of the spaces created by
γ l(Xa, t) and γ l(Xb, t) in B ð, 1  α2 Þ until tc are

∂2 B ð, 1  α2 Þ
 a ¼ B ð, 1  α2 Þjt¼tc
∂Xa ∂t ϕðX ,t,1α2 Þ
 
∂ϕðXa , t, 1  α2 Þ LðXa + Xb , 1  α2 Þ
 ,
∂Xa
(138)

∂ B ð, 1  α2 Þ
2
 b ¼ B ð, 1  α2 Þjt¼tc
∂Xb ∂t ϕðX ,t,1α2 Þ
 
∂ϕðXb , t, 1  α2 Þ LðXa + Xb , 1  α2 Þ
 ,
∂Xb
(139)
for the removal rates ϕ(Xa, t, 1  α2), ϕ(Xb, t, 1  α2) and length of contour
formed due to γ l(Xa, t) and γ l(Xb, t) in B ð, 1  α2 Þ is L(Xa + Xb, 1  α2).
Suppose γ l(Xa, t) is active in B ð, α2 Þ and γ l(Xb, t) is active in B ð, 1  α2 Þ
at t ¼ tc, then the partial differential equations describing the dynamics of
removal of the space created by γ l(Xa, t) in B ð, 1  α2 Þ and γ l(Xb, t) in
B ð, α2 Þ until tc are

∂2 B ð, 1  α2 Þ
 a ¼ B ð, 1  α2 Þjt¼tc
∂Xa ∂t ϕðX ,t,1α2 Þ
 
∂ϕðXa , t, 1  α2 Þ LðXa , 1  α2 Þ +LðXb , α2 Þ
 ,
∂Xa
(140)
462 SECTION III Advanced geometrical intuition


∂2 B ð,α2 Þ
¼ B ð, α2 Þjt¼tc
∂Xb ∂t ϕðXb ,t,1α2 Þ
  (141)
∂ϕðXb , t, α2 Þ LðXa , 1  α2 Þ + LðXb , α2 Þ
 ,
∂Xa
where ϕ(Xa, t, 1  α2) and ϕ(Xb, t, α2) are the removal rates of γ l(Xa, t) and
γ l(Xb, t) in B ð, 1  α2 Þ and B ð, α2 Þ, respectively. The sum of the piece-
wise lengths in B ð, 1  α2 Þ and B ð, α2 Þ is represented by L(Xa, 1  α2)
and L(Xb, α2), respectively. Suppose γ l(Xa, t) and γ l(Xb, t) are active in
B ð, 1  α2 Þ at t ¼ tc, then the partial differential equation describing the
dynamics of removal of the spaces created by γ l(Xa, t) and γ l(Xb, t) in
B ð, α2 Þ until tc are

∂2 B ð, α2 Þ
¼ B ð, α2 Þjt¼tc
∂Xa ∂t ϕðXa ,t,α2 Þ
  (142)
∂ϕðXa , t, α2 Þ LðXa + Xb , α2 Þ
 ,
∂Xa

∂2 B ð, α2 Þ
¼ B ð, α2 Þjt¼tc
∂Xb ∂t ϕðXb ,t,α2 Þ
  (143)
∂ϕðXb , t, α2 Þ LðXa + Xb , α2 Þ
 ,
∂Xb
where ϕ(X , t, α2) and ϕ(X , t, α2) are the removal rates and the sum of the
a b

piecewise lengths of contours formed due to γ l(Xa, t) and γ l(Xb, t) in


B ð, α2 Þ is L(Xa + Xb, α2). Suppose γ l(Xa, t) is active in B ð, 1  α2 Þ and
γ l(Xb, t) is active in B ð, α2 Þ at t ¼ tc, then the partial differential equations
describing the dynamics of removal of the space created by γ l(Xa, t) in
B ð, α2 Þ and γ l(Xb, t) in B ð, 1  α2 Þ until tc are

∂2 B ð, α2 Þ
¼ B ð, α2 Þjt¼tc
∂Xa ∂t ϕðXa ,t, α2 Þ
  (144)
∂ϕðXa ,t, α2 Þ LðXa ,α2 Þ + LðXb , 1  α2 Þ
 ,
∂Xa

∂2 B ð, 1  α2 Þ
 b ¼ B ð, 1  α2 Þjt¼tc
∂Xb ∂t ϕðX ,t,1α2 Þ
 
∂ϕðXb , t, α2 Þ LðXa , α2 Þ + LðXb ,1  α2 Þ
 ,
∂Xa
(145)
where ϕ(X , t, α2) and ϕ(X , t, 1  α2) are the removal rates of γ l(X , t) and
a b a

γ l(Xb, t) in B ð, α2 Þ and B ð, 1  α2 Þ, respectively. The sum of the piece-


wise lengths in B ð, α2 Þ and B ð, 1  α2 Þ is represented by L(Xa, α2)
and L(Xb, 1  α2), respectively. The two partitions in B ð, α2 Þ and
B ð, 1  α2 Þ behave independently in terms of further removal and growth
of contours.
Multilevel contours on bundles of complex planes Chapter 11 463

The PDEs represented in (138) through (145) are simultaneous removal of


two contours at a time. These can be treated as an example of the removal
process of the dynamics of multiple removal processes. The plan is to remove
all the infinitely many contours at tc.

5 Concluding remarks
Multilevel contours passing through bundle B ðÞ of complex planes could
demonstrate interesting properties. The random environment created brings
the dynamic nature of the bundle through the removal process introduced.
The continuous-time Markov properties, differential equations, and topologi-
cal analysis on the bundle give scope to further investigate them using func-
tional approximations. The transportation of information through contours
for different complex planes could be extended further for practical situations
arising out of transportation problems. There are several applications of com-
plex analysis that are out of scope to discuss in this article. A wide range of
literature is available for interested readers, see, for example, Rao and
Krantz (2021), Chanillo et al. (2005), Ponnusamy and Silverman (2006),
Campos (2011), Pathak (2019), Cohen (2007), and Krantz (2008). One can
also introduce several forms of parametric contour formations by assuming
functional growth rates of contours. That could be an independent approach
toward modeling the behavior of the contours concerning a given functional
form of contour formation. Similarly, the removal rates can be assumed to fol-
low certain closed-form approximations (special forms of Harmonic func-
tions, and Poisson integrals). Information carried between various complex
planes that were discussed in Section 2 would be obstructed if the set of points
that a given connected contour gets deleted (lost) due to a removal process.

Acknowledgments
I wish to thank and appreciate our children (daughter: Sheetal Rao, son: GopalKrishna Rao,
son: Raghav Rao) whose several weekends play time with me was sacrificed by them while
I was occupied with this project during the Summer/Fall of 2021.

References
Ahlfors, L.V., 1978. Complex Analysis. An Introduction to the Theory of Analytic Functions of
One Complex Variable. International Series in Pure and Applied Mathematics, third ed.
McGraw-Hill Book Co., New York. xi+331 pp.
Bhat, B.R., Deshpande, S.K., 1986. Likelihood ratio test for testing order of continuous time finite
Markov chains. Commun. Stat. A Theory Methods 15 (6), 1751–1771.
Campos, L.M.B.C., 2011. Complex Analysis With Applications to Flows and Fields. Mathematics
and Physics for Science and Technology, CRC Press, Boca Raton, FL.
Chanillo, S., Cordaro, P.D., Hanges, N., Hounie, J., Meziani, A., 2005. Geometric Analysis of
PDE and Several Complex Variables. Dedicated to François Treves. Including Papers From
the Workshop Held in S ao Paulo, August 2003. Contemporary Mathematics, vol. 368,
American Mathematical Society, Providence, RI.
464 SECTION III Advanced geometrical intuition

Chen, M.F., 1991. On three classical problems for Markov chains with continuous time
parameters. J. Appl. Prob. 28 (2), 305–320.
Churchill, R.V., Brown, J., 1984. Ward Complex Variables and Applications, fourth ed. McGraw-
Hill Book Co., New York. x+339 pp.
Cohen, H., 2007. Complex Analysis With Applications in Science and Engineering, second ed.
Springer, New York.
Gani, J., Stals, L., 2005. A continuous time Markov chain model for a plantation-nursery system.
Environmetrics 16 (8), 849–861.
Good, I.J., 1961. The frequency count of a Markov chain and the transition to continuous time.
Ann. Math. Stat. 32, 41–48.
Goswami, A., Rao, B.V., 2006. A Course in Applied Stochastic Processes. Texts and Readings in
Mathematics, vol. 40 Hindustan Book Agency, New Delhi.
Krantz, S.G., 2004. Complex Analysis: The Geometric Viewpoint. Carus Mathematical Mono-
graphs, second ed. vol. 23 Mathematical Association of America, Washington, DC. xviii
+219 pp.
Krantz, S.G., 2008. Complex Variables. A Physical Approach With Applications and MATLAB®.
Textbooks in Mathematics, Chapman & Hall/CRC, Boca Raton, FL.
Pathak, H.K., 2019. Complex Analysis and Applications. Springer, Singapore.
Ponnusamy, S., Silverman, H., 2006. Complex Variables With Applications. Birkh€auser Boston,
Inc., Boston, MA.
Rao, A.S.R.S., Krantz, S.G., 2021. Rao Distances and Conformal Mappings, Information Geome-
try. Handbook of Statistics, vol. 45 Elsevier/North-Holland, Amsterdam.
Rudin, W., 1987. Real and Complex Analysis, third ed. McGraw-Hill Book Co., New York,
ISBN: 0-07-054234-1. xviii+219 pp.
Index

Note: Page numbers followed by “f ” indicate figures, “t” indicate tables, and “b” indicate boxes.

A Barycenter and covariance, 367–369


Accelerated optimization Basu-Harris-Hjort-Jones distances, 162
conservative flows and symplectic Bayesian contexts, 205–208
integrators, 26–28 Bayesian inference
discretizations of continuum system, 25 bounding the distance, 376–377
geometric integration, 25–26 geodesic convexity, 374
rate-matching integrators, smooth MAP vs. MMS, 375–376
optimization, 28–33 parameter, 373
Active inference, 24–25 Bayesian information criterion (BIC), 83
adaptive decision-making, 54–60 Bayesian methods, 343–345
defined, 54 Bergman geometry, 4
Adaptive agents, 24–25 The Bergman metric tool, 4
prediction model, 60 Bhattacharyya coefficient, 236
sequential decision-making under Birkhoff’s ergodic theorem, 37–38
uncertainty, 60–61 Bonferroni correction, 318
world model learning, 61–63 Bouncy particle, 23–24
Affine coadjoint operator, 118 Brackets, 26–27
Affine connection, 226, 232 Bracket vector fields, 26–27
Affine geometry, 342 Bregman divergence, 160–161, 230–231
Affine Lie-Poisson equation, 122–123 Bregman generators, 227, 229, 246–247
Affine Poisson bracket, 118 Bregman manifolds, 226–230
Aggregated/integrated divergences, 201–202 Brenier–McCann techniques, 166–167
Akaike information criteria (AIC), 346
Algebraic geometry, 345–348
α-connections, 226
C
Camassa-Holm equation, 109–110
α-Divergence, 147–149
Canonical Stein operator, 49–50
and autonormalizing, 260–262
“Caravelle” aircrafts, 115
defined, 260
Cartan–Schouten connection model, 288, 293f
Analytic functions, 6–7
Casimir function, 108–109, 111–112
Arbitrary diffusions, 39
Casimir invariant function, 128–131
Arbitrary random variables, 444
Cauchy estimate, 9
Artificial Intelligence, 111
Cauchy integral theorem, 7–9
Asymptotic distribution, 371–372
Cauchy mixture family of order 1, 241–249,
Atlas in medical image analysis, 287
242f, 245f
Atrial septal defect (ASD), 305
Cauchy residue theorem, 10–11
Auto-divergences, 195–197
Cauchy–Riemann (CR) equations, 5
Autoencoder networks, 270–272
Centered quantile functions, 182–183, 191,
Auxiliary Markov chain (AMC), 83, 98–99f
194
Center-outward distribution function, 163–166
B Central limit theorem (CLT), 348
Bahadur slope, 158–159 Classical quantile functions, 182, 189, 193
Baker–Campbell–Hausdorff (BCH) formula, Closed-form dual potentials, 249–252, 250f
25–26, 287–288, 291–292 Coadjoint operator, 118

465
466 Index

Collapsed Gibbs sampler, 84–87 Dimensionality reduction, 262–263, 268–269


“Comonotonic” distribution function, 198 Director of the Center for Theoretical Physics
Compact case, 386–387 of Marseille, 116
Complex bundles, 3–4 Dirichlet process, 84–85
Complex planes, 5–12, 402f, 403–415, 404f Dissimilarity-expressing functional relations,
Complex-valued function, 6–7 167
Computational anatomy, 286 Distortion risk measure (DRM), 151
“Concorde” aircrafts, 115 Distribution and survival functions, 188–189
Conformal Hamiltonian system, 28–29 Distribution functions, 179, 193
Conformal mapping, 11, 12f The Θ distributions, 372–373
Conservative flows, 26 Divergence generator ϕ, 168, 171–173
Context tree, 82f Divergence information geometry, 233
Contour formation, 7–8, 8f, 13f Divergences (directed distances), 147,
Contour lengths, 433–435 159–161, 168–171
Control theory, 24–25 ϕ-Divergences, 147–149, 161–162
Convergence rates, 29–30 DNA sequence modeling, 87
Convex geometry, 330, 337 D-O-Q-R paradigm, 165–167
Convex optimization Dually flat spaces, 230–234
Riemannian gradient descent, 393–396
second-order Taylor formula, 390–391
sets and functions, 388–390
E
Einstein–Hermitian conditions, 3–4
Taylor with retractions, 391–393
Ejection fraction (EF), 305–306
Cramer–Rao inequality and Fisher
Ejection fraction per disease group, 311f
Information, 3–4
Elastic embedding (EE), 270
Cramer transform, 120
Embedding
Cramer–von Mises test statistics, 161, 163, 179
geometry, 258
Cross-entropy closed-form formula, 247
imposed information/complexity, 262–263
Crouzeix identity, 232
Embedding space, 257–258
Crude annual mortality rate, 177
Empirical α-discrepancy, 267–268
Csiszar–Ali–Silvey–Morimoto (CASM)
Empirical barycenter, 379–381
divergences, 147–149, 151, 153
Encoder, 258
Cumulative distribution functions, 176
End-diastole (ED) frame, 289–290
Cumulative Kullback–Leibler information,
End-systolic (ES) frames, 289–290
188–189
Entire function, 5
Entropy, 108–109, 111–112, 120, 122
D Euclidean metric tensor, 229–230
Darboux theorem, 27–28 Euler–Maruyama integration, 40
Expected free energy (EFE), 57
Data geometry, 259
Data processing inequality, 149 Explicit time-dependent Hamiltonian
Decision-making, precise agents, 56–57 formulation, 30
Decoder, 258 Exponential family manifolds
Fisher–Rao manifold, 234–237
Deformation-based morphometry (DBM),
286–287 natural exponential family, 234
Deleted neighborhoods, 10 Exp-parallelization, 289
“Demons” image registration algorithm,
287–288 F
Density functions, 161 Fanning Scheme, 291
Density function zeros, 152–157 f-divergence, invariance property, 276–277
Density power divergences, 162 Fenchel–Moreau theorem, 230
Dependence expressing divergences, 203–205 Finite distributions, 334–337
Diffeomorphometry, 287–288 Fisher information matrix (FIM), 118–119,
Differential geometry, 22 226, 234–235
Index 467

Fisher invariant distance, 109 Hotelling tests on velocities, 318–321,


Fisher-Koszul metric of Information geometry, 319–320f
108–109 Hybrid Monte Carlo methods, 22–23
Fisher metric, 253 Hypoellipticity condition, 38
Foundations of Geometric Structures of Hypothesis, 309–310
Information (FGSI), 110
Freeman–Tukey divergence, 147–149
I
Identity matrix, 266
G “Implicit boundary-describing”
Galton–Watson branching processes, 213 divergence, 185–186
Gauss density, 138–139 Infinitesimal schemes, 297
Gaussian distributions Information geometry, 109, 259, 331
and Bayesian inference (see Bayesian advantages, 4
inference) deformed λ-exponential families, 4
defined, 358 statistical decisions and inferences, 3–4
and RMT (see Random matrix theory Information processing inequality, 149
(RMT)) Integral probability pseudometrics (IPM),
General geometric structures, 341–343 46–47
Generalized linear models (GLM), 174, 328 Integrated statistical functionals, 164
Geodesic and spline regression, 317 Intrinsic complexity, 259
Geodesic parallelogram, 291–292 Islands and holes, 444–463
Geodesics, 315 Isolated singularity, 10
Geometric analysis, 13–17 Isometric embedding, 265
Geometric heat Fourier equation, 129
Geometric integration, 25–26
Geometric Science of Information (GSI), 110
J
Jacobi fields, 292–293
Geometric statistics, 286
Jensen divergence, 246f
Geometric structures, 5
Jensen–Shannon divergence, 213–214, 248
Gradient flow, 34–35, 313
Jordan arcs, 6–7, 6f
Griffiths–Engen–McCloskey distribution, 85
Jordan curves, 7–8, 10–17, 11f, 14f, 16f
GSDMM algorithm, 84–85, 86b

K
H Kantorovich transportation problem (KTP),
Hamiltonian-based accelerated 197–198
sampling Kernel Stein discrepancies, 50–52
description, 37–38 k-nearest neighbor classification methods,
diffusion processes, 38–40 159–160
Hamiltonian formulation of molecular Kostant-Kirillov Souriau (KKS), 122
dynamics, 23–24 Koszul-Fisher Metric, 122
Hamiltonian Monte Carlo (HMC), 22–23, Koszul Poisson cohomology, 131–132
40–45 Koszul-Vinberg characteristic function, 109
Hamiltonian system with dissipation, 22–23, Kullback–Leibler (KL) divergence, 22,
299 147–149, 248, 259
Harmonic functions, 5–7
Heat equation, 108–109
Hellinger distance, 265 L
Hellinger transform, 213–214 Ladder methods
Hessian manifolds, 227–230 Cholesky decomposition, 302
Higher-order Markovian sequences, 91–94 ED and ES, 301f
Hilbertian subspaces, 47 Hamiltonian formulation, 300
Homogeneous regular cone, 227 validation, 302–304
468 Index

Lagrange multipliers, 33–34 Minimum ϕ-divergence/distance estimation


Langevin algorithms, 23–24 problem, 151
Laplace–Beltrami operator, 39–40 Minimum ϕ-divergence estimator
Laplace’s equation, 5 (MDE), 151
Large deformation diffeomorphic metric Minimum Hyv€arinen score matching
mapping (LDDMM) framework, 287, estimators, 24
298–300 Minimum Kantorovich estimators, 24
Latent space, 257–258, 262–263, 272 Minimum reverse-Kullback-Leibler
Laurent series expression, 10–11 divergence (RKLD) estimator, 150–151
Leapfrog, 32–33 Mixture family manifolds
Legendre–Fenchel transformation, 230, 231f definition, 239
Leverrier-Souriau algorithm, 115–116 discrete mixture family, 239–240
Lie and dual Lie algebras, 118 Mixture models
Lie groups thermodynamics, 108–110 applications, 327–328
defined, 111–112 bandwidth tuning parameter, 333
“moment map”, 111–112 decomposition of convex hull, 337t
Likelihood geometry, 337–341, 340f definition, 328
Liouville’s theorem, 9–12 finite distributions, 334–337
LogDet divergence, 266 fundamentals of geometry, 329–330
Longitudinal models, 288–290 identification problem of, 333
Lyapunov function, 28–29 information geometry, 331
K normal components, 331
manifold structures, 336
M multivariate normal/multivariate
Manifold learning, 272 t-distributions, 328
Markov Chain Monte Carlo (MCMC) three-components, 332, 332f
methods, 23–24, 344 Model fitting, 84–87
Markov morphisms, 22 Modified Anderson–Darling test
Markov processes, 39 statistics, 161
MATLAB program, 100 Modified Lie-Poisson variational principle,
MaxEnt’14 Conference, 110 129
MaxEnt’22 Conference, 110 Moment map, 118
Maximum a posteriori (MAP), 85 Momentum vectors, 299
Maximum entropy, 108–109, 120, 366–367 Monge–Kantorovich theory, 166–167
Maximum likelihood estimator (MLE), 83, Monge–Kantorovich transportation
150–151 problem, 24
Maximum mean discrepancy (MMD). Monte Carlo simulation, 91
See also Natural gradient descent Motion normalization, with parallel transport,
defined, 46–47 306–309, 306f
Lebesgue density, 52 mth-order Markovian sequence, 79–80
natural gradient descent, 52–54 m-tuples, 83
smooth measures and KSDs, 48–52 Multilevel contour lengths, 411
topological methods, 47–48 Multilevel contours, 3–4, 402f, 415–444,
Mean embedding, 46–47 453–458
Mean trajectory, 293f, 320 Multinoulli distributions, 232–234
Median-oriented quantile function, 164
M estimator, 24
Metropolis-Hastings algorithm, 23–24, N
377–379 Natural gradient descent
Middle East Respiratory Syndrome (MERS), likelihood-free inference, generative models,
89–90 53–54
Minimal Markov models, 83 minimum stein discrepancy estimators, 53
Index 469

Negative definite matrix, 28–29 Profile log-likelihood contours, 344f


Neighborhood embeddings, 268–270 Pullback metric, 271–272
Nested discs and points, planes intersection, Pulmonary hypertension (PH), 305
424f p-Wasserstein estimator, 24
Neyman–Pearson approach, 150
Noether’s theorem, 108
Noncompact case, 385–386 Q
Non-ϕ-divergences, 162 Quantile density functions, 190–191
Nonstandard asymptotics, 329 Quantile–Quantile-Plot (QQ-Plot), 167
Nonstandard testing problems, 350–353 Quantitative geometric system, 159
Normalized deformations, 316–321 Quantum physics, 111
Normalizing factor Z(σ), 363–366 Quotient mapping, 81
n-time intervals transition probabilities, 425
Numerical accuracy, 294–297 R
Numerical scheme and convergence, 296 Radon–Nikodym densities, 226, 234
Random matrix theory (RMT)
O asymptotic distribution, 371–372
One-parameter subgroups, 288 barycenter and covariance, 367–369
Optimal transport and coupling, 197–201 The Θ distributions, 372–373
Ordinary differential equations (ODEs), from Gauss to Shannon, 360–361
298–299 MLE and maximum entropy, 366–367
Overdamped Langevin diffusion, 37 normalizing factor, 363–366, 369–370
Riemannian barycenter, 359
The “right” Gaussian, 361–363
P Siegel domain, 359–360
Parallel transport Randomness, 416–417, 416f, 421
intersubject normalization, 291–292 Random variables, 429–431
Schild’s and pole ladders (see Schild’s and Rao distance, 236
pole ladders) Rate of removal, 435–436
Parallel vector field, 291 Real-valued function, 429–431
Parameter divergence, 233 Reference measure λ, 171
Parameter inference, 46–54 Reference point, 263
Partially observable Markov decision process Registering, 299–300
(POMDP), 60–61 Regular cone manifolds, 237–239
Partition Markov models, 83 Regularization, SMM, 87–90
Partitions and agents, 55f Reinforcement learning, 24–25
Pattern, 91 Relative entropy, 186–187
Patternhunter software, 100 Removal process, 437, 452
PDEs, 459–463 Reproducing kernel, 46–47
Pearson chisquare divergence, 186–187 Reproducing kernel Hilbert space (RKHS),
Piecewise smooth arcs, 7–8, 8f 209–210
Poincare duality, 49–50 Residual lifetime, 188–189
Poincare metric on plane, 230 Residue, 10–11
Pointwise-BS-distance-type (pBS-type) cost Reverse Kullback–Leibler divergence,
function, 197–199 153–155
Poisson brackets, 27, 118 Reverse relative entropy, 186–187
Pole ladder, 292f, 297 Riemann–Christoffel symbols, 232
Positive definite matrix, 28–29 Riemannian gradient descent, 393–396
Presymplectic integrators, 30–31 Riemannian Hessian metric tensor, 231–232
Probability and Schwartz distributions, 24 Riemannian manifold, 33–34, 228, 294,
Probability theory, 157–159 313–314
470 Index

Riemannian metric, 108–109 Stationary solution, steady-state solution,


Riemannian symmetric spaces, 383–388 435–436
Riemann–Stieltjes integral, 6–7 Stationary velocity fields (SVF), 287–288, 292
Riemann surfaces, 3–4 Statistical decision theory, 24–25
Ruled surface meaning, 334 Statistical functionals, 163–167
Statistical inference, 24
Statistical model, 226
S Statistical motion atlas, 289–290
Sampling methods, 23–24 Statistical physics, 111
Scaled Bregman divergences/distances, 162, Stein’s method, 48
168 Stein variational gradient descent (SVGD)
Scaled parallel transport (SPT), 310t, 318 methods, 50–51
Scaling active inference, 63–64 Stochastic differential equations (SDEs), 38
Scaling and aggregation functions, 173–195 Stochastic neighbor embedding (SNE),
Schild’s and pole ladders 269–270
elementary construction, 295 St€ormer-Verlet method, 32–33
infinitesimal schemes, 297 Streetlight effect, 57
numerical scheme and convergence, 296 Strictly convex case, 394–395
parallel transport, 294–295, 294f Strongly convex case, 396
pole ladder, 297 SU (N,N) Lie group, 136–138
Taylor expansion, 295–296 SU(1,1)/U(1) Lie group, 132–136
Schild’s geometric constructions, 294 Symmetric space, 359
Score matching, 50–52 Symplectic brackets, 27–28
Second-order Taylor formula, 390–391 Symplectic Fisher metric structures, 123–127
Second principle of Thermodynamics, 130 Symplectic form, 27–28
Shadow function, 27 Symplectic geometry, 110
Shadow vector field, 25 Symplectic integrators, 22–23
Shannon entropy, 35–37 Symplectification, 30–32
Shape and deformations, 307–309, 308f
Sharp identifiability, 170
Shooting, 299–300 T
Siegel domain, 359–360 Taylor expansion, 295–296
Singular learning and model selection, Taylor with retractions, 391–393
348–350 Tetralogy of Fallot (ToF), 305
Singular learning theory Time intervals, 433–435
and algebraic geometry, 345–348 Toeplitz/block-Toeplitz positive-definite
Bayesian methods, 343–345 matrices, 359
singular learning and model selection, Toussaint’s affinity, 213–214
348–350 Transportation contour, 414
Singular models, 333, 345, 349 Transporting scalar values, 289
“Smooth” pointwise ϕ-divergences, 199–200
Souriau cocycle, 122
1-cocycle, 118
U
Underdamped Langevin diffusion, 35–37
2-cocycle, 118
Unit-dependent functionals, 163–164
Souriau Fundamental Theorem, 121
Unit-free functionals, 163–164
Souriau, Jean-Marie, 108, 114–115f, 117f
Souriau’s project, 112–114
Spaced seed coverage, 96–100 V
Sparse Markov models (SMMs), 80–83 Variable length Markov chains (VLMCs), 80
computation, 94–96 Variable-order Markov (VOM) models, 80
fitting SMM through regularization, 87–90 Variational autoencoders (VAEs), 271
pattern statistics, 90–100 Variational integrators, 27–28
Spatiotemporal shape analysis, 304 Variational representations, ϕ-divergences,
Spinning of bundle theorem, 414 208–211
Index 471

Vector bundles, 3–4 Watanabe’s approach, 346


Volume-preserving parallel transport (VPPT), Wind speeds modeling, 86–87
315t, 318 World model learning, 61–63

W Z
Wasserstein space, 35–37 Z estimator, 24
Watanabe information criteria (WAIC), 349 Zig-zag samplers, 23–24
This page intentionally left blank

You might also like