Professional Documents
Culture Documents
Learning With The Minimum Description Length Principle 1St Edition Kenji Yamanishi Online Ebook Texxtbook Full Chapter PDF
Learning With The Minimum Description Length Principle 1St Edition Kenji Yamanishi Online Ebook Texxtbook Full Chapter PDF
Learning With The Minimum Description Length Principle 1St Edition Kenji Yamanishi Online Ebook Texxtbook Full Chapter PDF
https://ebookmeta.com/product/general-relativity-the-theoretical-
minimum-1st-edition-leonard-susskind/
https://ebookmeta.com/product/ao-oni-kenji-kuroda/
https://ebookmeta.com/product/spectral-invariants-with-bulk-
quasi-morphisms-and-lagrangian-floer-theory-1st-edition-kenji-
fukaya/
https://ebookmeta.com/product/fun-and-games-day-at-the-parade-
length-susan-daddis/
Interpretive Description 2nd Edition Sally Thorne
https://ebookmeta.com/product/interpretive-description-2nd-
edition-sally-thorne/
https://ebookmeta.com/product/the-fifth-principle-peter-m-senge/
https://ebookmeta.com/product/insect-chronobiology-1st-edition-
hideharu-numata-kenji-tomioka/
https://ebookmeta.com/product/r-calculus-v-description-
logics-1st-edition-li/
https://ebookmeta.com/product/masses-and-the-infinity-options-
principle-1st-edition-norbert-schwarzer/
Kenji Yamanishi
Learning
with the Minimum
Description Length
Principle
Learning with the Minimum Description Length
Principle
Kenji Yamanishi
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
To my wife and my son.
Foreword
vii
Preface
ix
x Preface
Tokyo), and Mr. Kento Urano (University of Tokyo), and Ryo Yuki (University of
Tokyo).
xiii
xiv Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
Acronyms
xix
xx Acronyms
The most important issue that we address throughout this book is what information
is. Once information is properly quantified, then “learning” can be thought of as the
task of extracting the maximum information from data. Shannon [30] quantified the
amount of information within the framework of probability theory. Subsequently,
several attempts were made to quantify the information in terms of Kolmogorov
complexity [15], algorithmic complexity [2], information spectrum [9], etc. The
important point is that no matter how the information is defined, it should be defined
relative to the model. Here, the model means a constraint for the representation of the
source that generates the data. With the recent progress in machine learning, many
kinds of models have been developed to obtain rich knowledge representations. In
this book, we start with addressing the issue of how to define the information included
in some data with the help of models.
This section shows that the notions of coding, information, and probability distri-
butions are closely related to one another and that they are unified under the MDL
principle. We refer to the seminal book, Elements of Information Theory written by
Cover and Thomas [3], for reviewing fundamental notions of information theory.
Let X be a finite set of symbols and Xn be a set of sequences of n symbols over
X. A source coding is defined as a map φ : X → {0, 1}∗ , where {0, 1}∗ is the set of
sequences over {0, 1}. The purpose of a source coding is to represent the original
data using as short a binary sequence as possible (see Fig. 1.1). Hereafter, we simply
refer to a source coding as a coding.
We require a specific feature for a coding.
Definition 1.1 We define a prefix coding as a coding φ such that for any x = y, φ(x)
is not a prefix of φ(y).
Consider the situation where we encode a sequence of symbols over Xn into a
binary sequence. With prefix encoding, we can decode the binary sequence to obtain
an original symbol even if we do not use a comma for each end of the encoded
sequence.
For a coding φ, let L : X → R+ be a map so that L(x) is the code-length for φ(x).
We call such L a code-length function associated with φ. The necessary and sufficient
condition for any coding φ to be a prefix coding is given in terms of the code-length
function L.
In the following, we suppose that X is finite. The following theorem follows the
book by Cover and Thomas [3] (pp: 107–110).
Theorem 1.1 (The Kraft inequality) [19] Let L be a code-length function associated
with a coding φ. Then, if φ is a prefix coding, then L satisfies the following inequality:
2−L(x) ≤ 1. (1.1)
x∈X
1.1 Information, Probability, and Coding 3
Conversely, if L satisfies (1.1), then there exists a prefix coding for which L is the
code-length function.
The inequality (1.1) is called the Kraft inequality.
Proof We can represent a codeword for a prefix coding as a leaf node of a binary tree.
That is, any codeword associated with a prefix coding can be represented as a path
from the root to a leaf node. The prefix condition is equivalent to the condition that
a codeword cannot be an ancestor of any other codeword on the tree (see Fig. 1.2).
Let L max be the maximum length of any codeword over X. For x ∈ X, a codeword
at depth L(x) has 2 L max −L(x) descendants at depth L max . The total number of descen-
dants of x over X at depth L max should be less than 2 L max because the codeword for
x is located at a leaf of the tree under the prefix condition. This leads to
2 L max −L(x) ≤ 2 L max .
x∈X
Equivalently, we have
2−L(x) ≤ 1.
x∈X
Further, the Kraft inequality extends to the case where X is countably infinite.
Example 1.1 Let X = {A, B, C, D}. We consider the coding φ such that
All the codewords are located at the leaves of a binary tree, as shown in Fig. 1.2.
Hence, we see that φ is a prefix coding over X. Since the code-length function is
given by
L(A) = 3, L(B) = 3, L(C) = 2, L(D) = 1,
4 1 Information and Coding
φ(D) = 1
φ(C) = 01
φ( B) = 001
φ(A) = 000
For the code-length function L(x) satisfying the Kraft inequality, if we define p(x)
by
then L(x) satisfies (1.1). Here, z is the least integer not smaller than z. Hereafter, the
logarithm refers to the base 2 logarithm. We may replace it with the natural logarithm
depending on the context. Then, the Kraft inequality (1.1) may be replaced with the
following:
e−L(x) ≤ 1,
x∈X
then we call p(x) a probability density function over X, where μ(x) is a given measure
on X. We simply write μ(dx) as dx for the Lebesgue measure throughout the paper.
Hereafter, although we discuss supposing that X is discrete, the same argument holds
for the cases where X is continuous, although the probability mass function should
be replaced with the probability density function.
Let us consider the case where X is continuous. Suppose that X is a 1-dimensional
Euclidean space. In this case, we must truncate x ∈ X with finite precision to obtain
[x]. Then, we encode [x] in a finite code-length. Let X̄ be a countable subset of X
obtained by quantizing X with truncation scale at most δ. Let p(x) be a probability
density function over X. Then, we may instead consider the probability mass function
over X̄:
p̄([x]) ≈ p(x)δ.
By taking sufficiently small δ > 0, the code-length for [x] can be calculated
approximately by
The code-length for any prefix coding can be calculated as in (1.4) using the (inferior)
probability distribution p. When we employ a prefix coding, we call the code-length
function L the prefix code-length function. Although the code-length should be inte-
ger in real, we allow it to take non-integer values in the discussion to follow, for the
sake of mathematical simplicity.
The goal of source coding is data compression. Hence, a coding should be
designed so that the code-length is as short as possible. If the true probability distri-
bution p for generating data is known in advance, the following theorem holds for
the average code-length for the data generated according to p.
Theorem 1.2 (A lower bound on the expected code-length) [3, 30] Suppose that the
data are distributed according to the probability distribution p(x). Then, any prefix
code-length function L(x) satisfies the following inequality:
def
E p [·] denotes the expectation of x taken with respect to p, and H ( p) =
where
− x∈Xn p(x) log p(x), which we call the entropy of p(x).
Proof For any prefix code-length function L(x), there exists an inferior probability
distribution q(x) such that L(x) = − log q(x) by Theorem 1.1. Then, we have
E p [L(x)] = H ( p) + D( p||q),
def
where H ( p) = E p [− log p(x)] is the entropy function of p, and
def
D( p||q) = E p [log( p(x)/q(x))] is the Kullback–Leibler divergence between p and
q. In general, the following inequality holds:
Theorem 1.2 shows that when we know the true probability distribution p in advance,
the minimum expected code-length can be achieved with the coding such that for
any x,
In this section, we consider how to conduct a prefix coding for the case where the
true probability distribution is not known in advance. We call such a coding the
universal coding. It is impossible to design any completely universal coding without
any information about the data generation. Thus, we introduce the class of probability
distributions P, which is a set of probability distributions whose representations are
restricted under some constraint. We consider how to encode the data with the help
of P even though the true distribution is unknown.
8 1 Information and Coding
First, we consider what we call the two-part coding. The idea of two-part coding is to
encode the data sequence in two steps: (1) encode a single p which is selected from
P and (2) encode the data sequence x relative to p. In the case where P is discrete,
the two-part code-length with a selected p is written as
def
Ltwo−part (x; P) = min{− log p(x) + L( p)}. (1.9)
p∈P
where the infimum is taken over all the methods for quantization of P.
The two-part coding is the most typical universal coding method. The crucial
problem in it is how we can find the optimal quantization method for a given class
of probability distributions. This is discussed in Sect. 3.2.4.
We next introduce the Bayes coding. We denote the Bayes code-length for a data
sequence x relative to a class P as LBayes (x; P). It is defined as the code-length for x
relative to the Bayesian marginal distribution over P. In the case where P is discrete,
letting π( p) be the prior probability mass function over P, it is written as
def
LBayes (x; P) = − log π( p) p(x). (1.11)
p∈P
In the case where P is continuous, the sum in (1.11) is replaced with the integral
over P:
def
LBayes (x; P) = − log p(x)μ(d p), (1.12)
1.3 Universal Coding 9
n def n!
= .
n0 · · · nk n0! · · · nk !
n
log
n0 · · · nk
bits.
We then define the counting code-length Lcount (x) for x as
⎛ ⎞
n
k
i−1
log ⎝n − n j + 1⎠ .
def
Lcount (x) = log + (1.13)
n0 · · · nk
i=0 j=0
1 √
log n! = n + log n − n + log 2π + o(1),
2
n0 nk 1 n
Lcount (x) = n H ,..., + log
n n 2 n · · · nk
⎛ ⎞0
k
i−1
√
+ log ⎝n − n j + 1⎠ − k log 2π + o(1), (1.14)
i=0 j=0
def k
where H (z 0 , . . . , z k ) = − i=0 z i log z i .
At a glance, there seems to be no probabilistic assumption for data generation;
however, we can interpret this as each sequence being uniformly distributed over
the set of sequences of length n with the statistics (n 0 , . . . , n k ). We call the coding
whose code-length is given by (1.14) the counting coding.
The counting coding is effective when any probability distribution of data is not
specified. However, if a probability distribution of data is known to be biased, this
coding will not necessarily attain the shortest code-length, in average.
For a given data sequence x, supposing that there exists a maximum of p(x) over
P, we consider the maximum likelihood max p∈P p(x) supposing that the maximum
exists. Note that the maximum likelihood does not form a probability distribution
over Xn because
def
Cn (P) = max p(x) > 1.
p∈P
x∈Xn
def
LNML (x) = − log pNML (x)
= − log max p(x) + log Cn (P), (1.15)
p∈P
which we call the normalized maximum likelihood (NML) code-length for x relative
to P. We call the coding whose code-length function is given by (1.15) the NML
coding.
The NML coding is one of the central notions in this book. It is the most rec-
ommendable coding because it is defined without any prior information nor any
1.3 Universal Coding 11
quantization of the class. It is more exactly defined in Sect. 1.4 for the case where
the model class is parametric. The optimality of the NML coding will be shown in a
number of scenarios (see Sects. 1.4.2 and 3.2.3).
Kolmogorov complexity [15, 16] is an approach to quantify the information for a data
sequence. It does not make any explicit assumption for a probabilistic mechanism
of the data generation, but rather defines the complexity in terms of the size of the
computer program.
Definition 1.3 Let ( p) be the length of a program p and U( p) be the output of
the universal Turing machine when a program p is input. We define the Kolmogorov
complexity of a data sequence x as
def
K(x) = min ( p). (1.16)
p:U( p)=x
The Kolmogorov complexity of x means the shortest size of the program that
produces x.
We further define the universal probability as follows:
Definition 1.4 We define the universal probability of x as
2−( p) .
def
PU (x) = (1.17)
p:U( p)=x
According to [3], we have the following relation between the Kolmogorov com-
plexity and universal probability.
Theorem 1.3 (Equivalence between Kolmogorov complexity and universal proba-
bility) [3] For some c < ∞, for all x, we have
Theorem 1.3 shows that the universal probability and negative exponentiated
Kolmogorov complexity are essentially equivalent through (1.18). This relation is
analogous to that between the probability distribution and Shannon information.
If x can be generated according to a probability distribution in a certain class
P, then the Kolmogorov complexity of x will match the code-length for x with the
help of P (see also [3] pp: 143–145). We do not describe the details of the theory of
Kolmogorov complexity; instead, we recommend the books [3, 20] for readers who
are interested in this topic. We merely mention here that the Kolmogorov complexity
can be thought of as a kind of universal coding.
12 1 Information and Coding
def
θ̂ (x) = argmax p(x; θ ). (1.20)
θ
Here we suppose that the maximum of p(x; θ ) with respect to θ exists. Equation
(1.19) implies that plugging θ = θ̂ (x) into the negative log likelihood yields the least
Shannon information for x relative to P, which we call the baseline of x with respect
to P. Note that (1.19) does not define a prefix code-length function over Xn because
p(x; θ̂ (x)) > 1, (1.21)
x
which implies that (1.19) does not satisfy the Kraft inequality. We see that normalizing
p(x; θ̂ (x)) by (1.21) forms a probability distribution over Xn .
Definition 1.5 We call (1.22) the normalized maximum likelihood distribution
(NML distribution) over Xn :
In the case where X is continuous, the sum in the denominator is replaced with an
integral. We define the normalized maximum likelihood (NML) code-length for x
with respect to P as
1.4 Stochastic Complexity 13
The NML code-length (1.23) for x relative to P is also called the stochastic
complexity of x with respect to P. We may rewrite this as SC(x; P).
The stochastic complexity can be thought of as an extension of Shannon infor-
mation to the case where the true probability distribution is not given in advance
but rather a class of probability distributions is given instead. This is justified in
Sects. 1.4.2 and 1.4.3.
We pay special attention to the second term in (1.23), which is the logarithm of
the normalization term.
Definition 1.6 We define the parametric complexity for P for data length n as
def
log C(P) = log max p(x; θ ) = log p(x; θ̂ (x)). (1.24)
θ
x∈X n x∈X n
where the maximum is taken over all x ∈ Xn . We define the min-max regret with
respect to P as
def
R(P) = min R(q; P), (1.26)
q
In Definition 1.7, we assume that there exist the minimum and maximum values
for the regret. If no such values exist, the minimum and maximum would be replaced
with the infimum and supremum, respectively.
We justify the NML code-length from the point of view that it achieves the min-
max regret. The following theorem was first proved by Shtarkov [31].
Theorem 1.4 (Min-max regret optimality of the NML code-length) [31] The NML
distribution is the only distribution that achieves the min-max regret. That is,
where the minimum is taken over all probability distributions. Then, the min-max
regret coincides with the parametric complexity.
The min-max regret on the right-hand side of (1.27) is the minimum of the largest
value of any prefix code-length relative to the baseline with respect to P. Theo-
rem 1.4 shows that the probability distribution achieving the min-max regret is the
NML distribution. The NML code-length associated with the NML distribution has
optimality in this sense.
Proof Let the worst-case regret for q relative to P be R(q; P). Let the probability
distribution that achieves the minimum of R(q; P) be q = p ∗ . If p ∗ = pNML , for
some x, pNML (x) > p ∗ (x). For such an x, we have
This contradicts the fact that p ∗ achieves the minimum of R(q; P). This implies
that p ∗ = pNML . Then R(P) = R( pNML ; P) = log C(P). The proof is completed.
Theorem 1.5 (A lower bound on the expected code-length for parametric classes)
[23] Let P = { p(x; θ ) : θ ∈ } be a k-dimensional parametric class of probability
distributions, where is a compact set. Suppose that for each √ θ ∈ , letting θ̂ (x) be
the MLE of θ from x, the asymptotic normality holds, i.e., n(θ̂ (x) − θ ) is asymp-
totically distributed according to N(0, I −1 (θ )), where N (0, I −1 (θ )) is the normal
distribution (Gaussian distribution) with mean 0 and variance covariance matrix
I −1 (θ ), and I (θ ) is the Fisher information matrix defined as
2
def 1 ∂ log p(x; θ )
I (θ ) = lim E θ − , (1.29)
n→∞ n ∂θ ∂θ
and E θ denotes the expectation of x taken with respect to p(x; θ ). Then, for any prefix
code-length function L(x), for any > 0, except θ ∈ 0 such that the Lebesgue
volume of 0 goes to zero as n → ∞, the following inequality holds:
k−
E θ [L(x)] ≥ E θ [− log p(x; θ )] + log n. (1.30)
2
If for a given L, we happen to choose θ such that L(x) = − log p(x; θ ), then
the lower bound (1.30) is violated; however, Theorem 1.5 means that such a θ only
exists in a range whose volume asymptotically becomes zero.
Below we give the proof of Theorem 1.5 according to [23].
¯
√ space to obtain a finite set
Proof Let us discretize the real-valued parameter
so that each axis is discretized with width c/ n where c is a constant. We denote
the discretization as τ : → .¯ For θ ∈ ,
¯ let Xn (θ ) def
= {x : τ (θ̂ (x)) = θ } where
θ̂ (x) is the MLE of θ from x.
For a given probability mass function p(x; θ ), we define P(θ ) by
def
P(θ ) = p(x; θ ).
x∈Xn (θ)
Then, by the asymptotic normality, for some n 0 > 0, for some 0 < δ < 1, we
have
def
Q(θ ) = q(x).
x∈Xn (θ)
Note that p(x; θ )/P(θ ) and q(x)/Q(θ ) form probability mass functions over
Xn (θ ). We consider the Kullback–Leibler divergence between them to obtain the
following inequality:
16 1 Information and Coding
where we have used the fact that the Kullback–Leibler divergence satisfies D( p||q) ≥
0 for any probability distributions p and q (see the proof of Theorem 1.2). Equiva-
lently, we have
p(x; θ ) P(θ )
p(x; θ ) log ≥ P(θ ) log . (1.32)
x∈Xn (θ)
q(x) Q(θ )
Letting the left-hand side of (1.32) be R(θ ), for any given > 0, set
def k(1 − )
An = θ ∈ : R(θ ) < log n .
2
p(x; θ ) k(1 − )
p(x; θ ) log ≥ log n. (1.33)
x∈Xn (θ)
q(x) 2
p(x; θ ) q(x)
p(x; θ ) log ≥ p(x; θ ) 1 −
x∈Xn −Xn (θ)
q(x) x∈Xn −X n (θ)
p(x; θ )
= ( p(x; θ ) − q(x))
x∈Xn −Xn (θ)
≥ −1, (1.34)
Since L(x) = − log q(x), setting = k − 2/ log n yields
k−
E θ [L(x)] ≥ E θ [− log p(x; θ )] + log n.
2
In the following, we prove that the Lebesgue measure Vn of An goes to zero as
¯
√ to infinity. Let us consider a set of cells, each centered at θ̄ ∈ with width
n goes
c/ n in each axis. Let the smallest set of such cells covering An be An . Then, we
have
1.4 Stochastic Complexity 17
k
c
Vn ≤ |An | √ . (1.35)
n
For θ ∈ An , by (1.32),
k(1 − ) P(θ )
log n > P(θ ) log .
2 Q(θ )
This implies that by the asymptotic normality assumption, for some 0 < δ < 1,
for sufficiently large n,
k(1 − ) log n
Q(θ ) > P(θ ) exp −
2P(θ )
k(1 − ) log n
> (1 − δ) exp − .
2(1 − δ)
Hence, for sufficiently small > 0, for some 0 < α < 1, the following inequality
holds:
αk
Q(θ ) > n − 2 .
This leads to
αk
1= Q(θ ) ≥ |An |n − 2 .
¯
θ∈
Thus, we have
αk
|An | ≤ n 2 . (1.36)
The right-hand side of (1.37) goes to zero as n goes to infinity since 0 < α < 1.
This completes the proof.
As will be shown in Sects. 1.5.1 and 1.5.3, under some regularity conditions for
a k-dimensional parametric class P, the following equation asymptotically holds:
k n
SC(x; P) = − log p(x; θ̂ (x)) + log + log |I (θ )|dθ + o(1). (1.38)
2 2π
k
E θ [− log p(x; θ̂ (x))] = E θ [− log p(x; θ )] − + o(1), (1.39)
2
where E θ is the expectation of x taken with respect to p(x; θ ).
Therefore, taking the expectation of both sides of (1.40) with respect to p(x; θ ),
we have
By the consistency of the MLE and the law of large numbers (see Sect. 9.2.7), as
n becomes sufficiently large, with probability 1, we have
Iˆ(θ̂ ) → I (θ ).
k
E θ [− log p(x; θ )] = E θ [− log p(x; θ̂ )] + + o(1).
2
This completes the proof.
Combining (1.38) with (1.39) yields
k n
E θ [SC(x; P)] = E θ [− log p(x; θ )] + log + log |I (θ )|dθ + o(1).
2 2π e
We see that E θ [SC(x; P)] asymptotically matches the lower bound (1.30) within
error o(log n). Recall that this lower bound is a generalization of Shannon’s source
coding theorem to the case where the true probability distribution is not given, but
rather the class of probability distributions is given. In this sense, the stochastic
complexity can be considered as a generalization of Shannon information.
1. Let I (θ ) be the Fisher information matrix as in (1.29). For any θ ∈ , the limit in
(1.29) converges and for some 0 < c1 , c2 < ∞, c1 ≤ |I (θ )| ≤ c2 , where |I (θ )|
is the determinant of I (θ ).
2. I (θ
√) is continuous with respect to θ .
3. |I (θ )|dθ < ∞.
4. For each θ , the asymptotic normality of√ the MLE of θ holds. That is, if x
is generated according to p(x; θ ), then n(θ̂ (x) − θ ) is normally distributed
according to N(0, I −1 (θ )) as n → ∞, where θ̂ (x) is the MLE of θ from x (i.e.,
θ̂ = argmaxθ∈ p(x; θ )).
5. For some positive definite matrix C0 , for all x, I (θ̂(x)) < C0 .
Then, the following equation asymptotically holds for the parametric complexity:
k n
C(P) = log + log |I (θ )|dθ + o(1), (1.43)
2 2π
Note that Theorem 1.6 is valid only when the data length is sufficiently large
compared to the number of parameters. It is not necessarily applicable to high-
dimensional parametric models wherein the number of parameters is larger than the
data length. Such high-dimensional models are dealt with in Sects. 1.5.5 and 5.2.
In the following, we give a number of applications of Theorem 1.6.
PMulti = { p(x; θ ) : θ ∈ k }
n0 n k
θ̂ = ,..., .
n n
In Example 2.1, we show how to derive this MLE in detail.
Meanwhile, the determinant of Fisher information matrix is calculated as follows:
k
|I (θ )| = θi−1 .
i=0
Thus, applying (1.43) to the calculation of parametric complexity yields the fol-
lowing formula of the NML code-length for x with respect to PMulti :
k n
LNML (x; PMulti ) = − log p(x; θ̂ (x)) + log + log |I (θ )|dθ
2 2π
nk k
k+1
n0 n π 2
= nH ,..., + log + log k+1 , (1.45)
n n 2 2π 2
k ∞
where H (z 0 , . . . , z k ) = − i=0 z i log z i . (x) = 0 e−t t x−1 dt is the Gamma func-
tion defined so that for any positive integer m,
1 · 3 · 5 . . . (2m − 1) √
(m) = (m − 1)!, (m + 1/2) = π.
2m
Comparing (1.45) with the counting code-length (1.14), we see that (1.45) would
be shorter asymptotically.
θ x e−θ
p(x; θ ) = .
x!
We denote the class of Poisson distributions as
Thus, applying (1.43) to the calculation of parametric complexity yields the fol-
lowing formula of the NML code-length for x with respect to PPoisson :
n
LNML (x; PPoisson ) = θ̂n(1 − log θ̂) + log xt !
t=1
1 n a
+ log + 1+ log 2.
2 2π 2
Let τ = σ 2 and the parameter space be = {(μ, τ ) : μ ∈ (−∞, +∞), τ > 0}.
In this case, the Fisher information matrix I (μ, τ ) is given as follows:
1/τ 0
I (μ, τ ) = .
0 1/2τ 2
Thus, we have
1
|I (μ, τ )| = .
2τ 3
1 1
n n
μ̂ = xt , τ̂ = (xt − μ̂)2 .
n t=1 n t=1
The problem is that in applying the asymptotic approximation formula, the third
term of the right-hand side of (1.44) diverges if the range of parameters is infinite.
One of methods for avoiding this problem is to restrict the range of parameters as
follows: Letting s and r be the least positive integers such that |μ̂| ≤ 2s , τ̂ ≥ 2−2r ,
we take the parameter space ˜ = {(μ, τ ) : |μ| ≤ 2s , τ ≥ 2−2r } rather than . In
˜ We write a class
other words, the data range is restricted so that μ̂ and τ̂ are in .
of 1−dimensional Gaussian distributions with such a restricted parameter space as
PGauss . A simple calculation leads to
|I (μ, τ )|dμdτ = 2s+r +3/2 .
˜
(μ,τ )∈
1.5 Parametric Complexity 23
Here, (r, s) is the prefix code-length for r and s, calculated as log∗ s + log∗ r ,
and
where c = 2.865, and the sum is taken over all positives. log∗ (x) is Rissanen’s integer
code-length function [23] (p. 34), where the base of the logarithm is 2.
p(x; θ, M) = f (x|θ̂(x))g(θ̂(x); θ ),
def
g(θ̄; θ ) = p(x; θ ).
x:θ̂ (x)=θ̄
def
f (x|θ̂ (x)) = p(x; θ )/g(θ̂(x); θ ).
Let the normalization term in the NML distribution be C(P), where log C(P)
is the parametric complexity for P. Then, it is calculated using the g-function as
follows:
C(P) = p(x; θ̂ (x))
x
= dθ̂ p( y; θ̂ )
y:θ̂( y)=θ̂
= g(θ̂; θ̂ )dθ̂ . (1.48)
g is the probability density function for the Gamma distribution with shape parameter
n and scale parameter 1/θ :
θ n nn n
g(θ̂(x); θ ) = exp −θ · .
(n)θ̂ (x)n+1 θ̂(x)
θmax
C(PExp ) = g(θ̂; θ̂ )dθ̂
θmin
θmax
nn 1
= n dθ̂
e (n − 1)! θ̂
θmin
nn θmax
= n log . (1.50)
e (n − 1)! θmin
1 1
n n
μ̂(x) = ˆ
xt , (x) = (xt − μ̂(x))(xt − μ̂(x)) .
n t=1 n t=1
For a parameter vector ξ = (R, 1,1 , . . . , 1,d , 2 ) ∈ (Rd+2 )+ , we restrict the data
domain to Xn (ξ ), defined as
d(d−1)
def Vol(O(d))
Xn (ξ ) = x : ||μ̂(x)|| ≤ R, 1, j ≤ λ̂ j ≤ 2 < 1 ( j = 1, . . . , d), 2
2
<1 ,
2d
√
ˆ ||x|| = x x, and Vol(O(d)) is the
where λ j is the jth largest eigenvalue of ,
volume of the orthogonal group in dimension d embedded in Rd×d :
d2
2d π 2
Vol(O(d)) = .
d( 2 )
d
26 1 Information and Coding
d(d−1) d
Here, is the multivariate Gamma function defined as d (x) =π 4
d
j=1
1− j
x + 2 , and is the Gamma function.
ˆ
p(x; θ ) = f (x|μ̂(x), (x)), ˆ
g1 (μ̂(x); μ, )g2 ((x); ),
1 1
(μ̂ − μ) −1 (μ̂ − μ) ,
def
g1 (μ̂; μ, ) = d 1 exp −
(2π/n) ||
2 2 2/n
ˆ
||
n−d−2
2 1
ˆ ) =
g2 (;
def
d(d−1) n−1
−1 ˆ
n−1 exp − Tr(n ) .
2 2 | n1 | 2 d 2
2
ˆ
We fix μ̂(x) = μ̂ and (x) ˆ We define g()
= . ˆ as
ˆ is
Then, the parametric complexity is upper-bounded as follows. Suppose that
ˆ
written as = U diag(λ1 · · · λd )U for some orthogonal matrix U .
d
C(PGauss )
= p(x; θ̂ (x))d x
X n (ξ )
= d μ̂ g()dˆ ˆ
||μ̂||≤R
= d μ̂ dU ˆ
|λ̂i − λ̂ j |g()d( λ̂1 · · · λ̂d )
||μ̂||≤R 1≤i< j≤d
d − d2 d − d2
d(d−1) 2d+1 R d j=1 1, j − j=1 2 n dn2 1
< dU 2
2
· n−1
d d+1 ( d2 ) 2e d 2
V ol(O(d)) d(d−1) n dn2 1
< 2
2
B(d, R, 1 ) × n−1
2d 2e d 2
n dn2 1
< B(d, R, 1 ) × n−1 , (1.51)
2e d 2
1.5 Parametric Complexity 27
where
d − d2
def 2d+1 R d j=1 ( 1, j )
B(d, R, 1) = d .
d d+1
2
ˆ
positive definiteness of .
This section introduces the Fourier method developed by Suzuki and Yamanishi
[32, 33] for calculating parametric complexity. Let Pk = { p(x; θ ) : θ ∈ } be a k-
dimensional parametric class of probability distributions. Let p̃(x; ξ ) ∈ Pk be the
Fourier transform of p(x; θ ) calculated as
1
p̃(x; ξ ) = dθ exp(−iξ θ ) p(x; θ ), (1.52)
(2π )k/2
1
p(x; θ ) = dξ exp(iξ θ ) p̃(x; ξ ), (1.53)
(2π )k/2
where θ̂ (x) = argmaxθ p(x; θ ) is the MLE of θ from x, which is assumed to exist.
The following theorem shows that the parametric complexity can be calculated
as the integral of h(θ ) with respect to θ .
Theorem 1.7 (Fourier method for computing parametric complexity) [32, 33] For
a k-dimensional parametric class Pk of probability distributions, we denote the
parametric complexity as log Cn (k). We make the following assumptions:
1. dξ | p̃(x; ξ )| < ∞.
2. We define
def
φθ̂(x)−θ (ξ ) = p(x; θ ) exp(iξ(θ̂(x) − θ )),
x
Proof Letting θ̂(x) be the MLE of θ , we use the Fourier transforms (1.52) and (1.53)
to calculate the parametric complexity for P as follows:
1
max p(x; θ ) = dξ exp(iξ θ̂ (x)) p̃(x; ξ )
x
θ
x
(2π )k/2
1
= dξ dθ exp(iξ (θ̂ (x) − θ )) p(x; θ )
(2π )k x
1
= dθ dξ p(x; θ ) exp(iξ (θ̂ (x) − θ )) (1.56)
(2π ) k
x
= dθ h(θ ).
Equation (1.56) comes from the integrability of p̃(x; ξ ) and x p(x; θ )
exp(iξ (θ̂ (x) − θ )) under the assumptions in Theorem 1.7 and Fubini’s theorem
(see Sect. 9.1.7). This completes the proof of Theorem 1.7.
In the following, we show that for a specific family called the exponential family,
the parametric complexity can be simply calculated using Theorem 1.7.
Definition 1.9 We define the exponential family as a class of probability distribu-
tions, each with the following form of probability mass function (or probability
density function):
We denote its inverse transform as η(τ ), assuming that τ is bijective. Note that
the following relation holds:
Another random document with
no related content on Scribd:
— Miehelle on kunnia suuremman arvoinen kuin lapset, sukulaiset
ja ystävät. Kun kunnia on kysymyksessä, ponnistaa mies kaikki
sielunsa voimat, unohtaen kaikki hellemmät suhteet saadakseen
kunniansa takaisin. Mutta te näytte unohtavan, herra, kenelle
puhutte ja rohkenette ruveta minua tutkimaan. Neuvon teitä
tyytymään siihen, että saatte tutkia poikiani, muutoin minullakin
puolestani olisi syytä ruveta teiltä yhtä ja toista kyselemään. Kuinka
itse olette näin tyyni ja kylmäverinen, että kykenette tekemään
minulle kierteleviä kysymyksiä, vaikka hän makaa tuossa?
Kreivi Henningillä oli vaimonsa kanssa kaksi lasta. Itse hän pitkät
ajat oleskeli poissa, koska oli joutunut orjuuteen; sillä välin pidettiin
hänen puolisonsa hulluinhuoneeseen teljettynä. Vapaaherra lähti
Saksaan matkustamaan ja otti mukaansa holhokkinsa, pikku
Henrikin. Siellä hän jätti lapsen asumattomalle saarelle, mutta se
tulikin pelastetuksi. Kelpo ihmiset ottivat sen kasvattaaksensa ja se
elää vieläkin. Tyttären, joka äidin luulotellun kuoleman jälkeen myös
joutui setänsä holhouksen alaiseksi, oli tämä myös päättänyt
surmata, mutta hänen kätyrinsä, jonka kävi lasta sääli, vaihetti sen
kuolleeseen lapseen, jonka vapaaherran poissa ollessa toimitti
hautaan. Huolimatta kaikista vaaroista ja kummallisista
elämänvaiheista, tämä kreivittären toinenkin lapsi vielä elää. Kaikki
te hyvin tunnette hänet… se on Stella. Katsokaa, kuinka äiti ja tytär
ovat toistensa näköisiä, niin ette enään epäile!… Mitä poikaan tulee,
jonka muistopatsaan olette nähneet puistossa, niin hänkin, niinkuin
jo sanoin, elää vielä. No niin, minä olen Henrik Henning… tämä on
äitini ja Stella on minun sisareni… Oikeudessa olen vaativa takaisin
kaikki oikeutemme ja syytökset tätä miestä kohtaan olen myös
käräjissä näyttävä toteen. Olkoon hän kuinka viekas tahansa,
kierrelköön niin, että totuus muuttuu valheeksi ja valhe totuudeksi,
Jumala on kuitenkin oleva puolellani ja antava minulle viisautta, jotta
saan hänet kaikkine juonineen paljastetuksi.
*****
*****
*****
Harald ei unohtanut v. Nitiä. Seppä oli käynyt avaamassa lukon ja
pian v. Nit jälleen oli ullakkokamarissa, jossa Harald kertoi hänelle,
mitä päivän kuluessa oli tapahtunut. Herra v. Nit ei tahtonut mennä
vapaaherraa katsomaan.
Seppä avasi myös lukon, jolla kreivittären vasen käsi vielä oli
kytkettynä rautavitjoihin. Nyt vasta hän tunsi itsensä oikein vapaaksi
ja miltei unohti menneisyyden.
*****
Heidän istuttuaan noin yhden tai kaksi tuntia, astui Emilia sisään ja
viittasi Haraldia tulemaan ulos.