JSGDK

Chapter 13
Information Divergence Geometry and the

Application to Statistical Machine Learning
Shinto Eguchi
Abstract This chapter presents intuitive understandings for statistical learning from
an information geometric point of view. We discuss a wide class of information
divergence indices that express quantitatively a departure between any two proba-
bility density functions. In general, the information divergence leads to a statistical
method by minimization which is based on the empirical data available. We dis-
cuss the association between the information divergence and a Riemannian metric
and a pair of conjugate linear connections for a family of probability density func-
tions. The most familiar example is the Kullback–Leibler divergence, which leads
to the maximum likelihood method associated with the information metric and the
pair of the exponential and mixture connections. For the class of statistical methods
obtained by minimizing the divergence we discuss statistical properties focusing
on its robustness. As applications to statistical learning we discuss the minimum
divergence method for the principal component analysis, independent component
analysis and for statistical pattern recognition.
13.1 Introduction
Statistical machine learning approach are very successful in providing powerful and
efficient methods for inductive reasoning in information spaces with uncertainty
[22, 43]. In the following we assume a geometric view for learning algorithms and
the statistical discussion about statistical learning methods. In this context, a chal-
lenging problem was solved to answer the question to what extent geometry and
the maximum likelihood method are associated with each other, see [1, 2]. It is
known that the linear Gaussian regression associates with a Euclidean geometry in
which the least squares method is characterized by projection of the observed data
S. Eguchi
Institute of Statistical Mathematics, 4-6-7 Minami-Azabu, Minato-ku, Tokyo 106-8569, Japan
e-mail: eguchi@ism.ac.jp
F. Emmert-Streib, M. Dehmer (eds.), Information Theory and Statistical Learning, 309

DOI: 10.1007/978-0-387-84816-7 13,
c Springer Science+Business Media LLC 2009
310 S. Eguchi
point onto the linear hull of explanatory data vectors. This is only a special example
of the geometry associated with maximum likelihood. In general, the geometry is
elucidated by a dualistic Riemannian geometry such that the information metric is
introduced as Riemannian metric and two linear connections, called e-connection
and m-connection. In this framework two connections are conjugate with respect
to the information metric. The optimality of the maximum likelihood is character-
ized by m-projection onto the e-geodesic model. We will discuss this structure in
addition to the extension to a class of minimum divergence methods.
We review a close relation between the maximum likelihood method and the
Kullback–Leibler divergence. Let p and q be probability density functions on a data
space X . The Kullback–Leibler divergence is defined by

DKL (p, q) = p(x){log p(x) − log q(x)}Λ (dx),
X
where Λ is a carrier measure. Consider a statistical situation in which p is an under-

lying density function for data and q is a model density function. In this context we
define a statistical model by a parametric family of probability density functions
M = {qθ (x), θ ∈ Θ }.
Then the log-likelihood function for the model based on a given dataset is approxi-
mated by DKL (p, qθ ) neglecting a constant in θ , where p is the underlying density
function of the dataset. Hence we observe that the minimization of DKL (p, q) in q
of M is almost surely equivalent to the maximum likelihood method. Since the prin-
ciple of maximum likelihood was proposed by Fisher [14], it has been applied to a
vast of datasets with various forms from almost all scientific fields. In general the
maximum likelihood estimator is supported by several points such as the invariance
under one-to-one data transformations, the covariance by a parameter transforma-
tion, the asymptotically consistency and the asymptotical efficiency. Furthermore,
several advantageous properties in theoretical aspects are proven on the assumption
of an exponential family, see [3] for a detailed discussion.
13.2 Class of Information Divergence
The principle of maximum likelihood is established on the basis of a specific prop-

erty for a pair of elementary functions. One is an exponential function that defines an
exponential model; the other is a logarithmic function that defines a log-likelihood
function. It is well known that they are connected by conjugate convexity,
log(s) = argmax{ts − exp(t)}. (13.1)

t∈R
We will see that this convexity leads to the Kullback–Leibler divergence and the
Boltzmann–Shannon entropy, which lead us to deeper understanding about the

JSGDK

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

JSGDK

Uploaded by

Copyright:

Available Formats

Chapter 13

Information Divergence Geometry and the

F. Emmert-Streib, M. Dehmer (eds.), Information Theory and Statistical Learning, 309

where Λ is a carrier measure. Consider a statistical situation in which p is an under-

13.2 Class of Information Divergence

The principle of maximum likelihood is established on the basis of a specific prop-

log(s) = argmax{ts − exp(t)}. (13.1)

You might also like