Learning With The Minimum Description Length Principle 1St Edition Kenji Yamanishi Online Ebook Texxtbook Full Chapter PDF

Learning with the Minimum Description
Length Principle 1st Edition Kenji

Yamanishi
Visit to download the full and correct content document:
https://ebookmeta.com/product/learning-with-the-minimum-description-length-principl
e-1st-edition-kenji-yamanishi/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...
General Relativity The Theoretical Minimum 1st Edition

Leonard Susskind
https://ebookmeta.com/product/general-relativity-the-theoretical-
minimum-1st-edition-leonard-susskind/
Ao Oni Kenji Kuroda
https://ebookmeta.com/product/ao-oni-kenji-kuroda/
Spectral Invariants with Bulk Quasi Morphisms and

Lagrangian Floer Theory 1st Edition Kenji Fukaya
https://ebookmeta.com/product/spectral-invariants-with-bulk-
quasi-morphisms-and-lagrangian-floer-theory-1st-edition-kenji-
fukaya/
Fun and Games Day at the Parade Length Susan Daddis
https://ebookmeta.com/product/fun-and-games-day-at-the-parade-
length-susan-daddis/
Interpretive Description 2nd Edition Sally Thorne
https://ebookmeta.com/product/interpretive-description-2nd-
edition-sally-thorne/
THE FIFTH PRINCIPLE Peter M Senge
https://ebookmeta.com/product/the-fifth-principle-peter-m-senge/
Insect Chronobiology 1st Edition Hideharu Numata Kenji

Tomioka
https://ebookmeta.com/product/insect-chronobiology-1st-edition-
hideharu-numata-kenji-tomioka/
R-Calculus, V: Description Logics 1st Edition Li
https://ebookmeta.com/product/r-calculus-v-description-
logics-1st-edition-li/
Masses and the Infinity - Options Principle 1st Edition

Norbert Schwarzer
https://ebookmeta.com/product/masses-and-the-infinity-options-
principle-1st-edition-norbert-schwarzer/
Kenji Yamanishi
Learning
with the Minimum
Description Length
Principle
Learning with the Minimum Description Length
Principle
Kenji Yamanishi
Learning with the Minimum

Description Length Principle
Kenji Yamanishi
Graduate School of Information Science
and Technology
University of Tokyo
Tokyo, Japan
ISBN 978-981-99-1789-1 ISBN 978-981-99-1790-7 (eBook)

https://doi.org/10.1007/978-981-99-1790-7
© Springer Nature Singapore Pte Ltd. 2023

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
To my wife and my son.
Foreword
In the context of information science/technology such as machine learning and statis-

tical inference, it is crucial to select an adequate stochastic model. We need to estab-
lish a best model from observed data. There have been proposed many methods to
solve this fundamental problem. The minimum description length (MDL) principle
is a most sophisticated one and has since been developed not only theoretically and
practically but also as a philosophical principle. This book focuses on this funda-
mental problem in the information sciences. Dr. Kenji Yamanishi is a pioneer devoted
to exploring this principle of information, not only in terms of deepening theory but
also extending the framework in novel areas of application. The book covers theoret-
ical foundations and wide areas of practical problems such as detection of changes
and anomalies, problems involving latent variable models, and high-dimensional
statistical inference among others.
The book offers an easy-to-follow guide to the MDL principle, together with
other information criteria, explaining the differences of their standpoints. The expla-
nations are systematic, concise, and comprehensive. I am pleased to recommend
such a wonderful book to researchers and graduate students in various areas of the
information sciences including machine learning and data science.
Tokyo, Japan Shun-ichi Amari

February 2023 Professor at Teikyo University, ACRO
Professor Emeritus at the University of Tokyo
Honorary Science Advisor, RIKEN
vii
Preface
This book is written as an introduction to the minimum description length (MDL)

principle and its applications in learning. The MDL principle, introduced by Jorma
Rissanen, a renowned information theorist, is a fundamental principle for induc-
tive inference, which is used in many applications including statistical modeling,
pattern recognition, and machine learning. The MDL criterion is well known as
one of the information criteria for statistical model selection, often compared with
existing criteria such as Akaike’s information criterion, Bayesian information crite-
rion, and minimum message length criterion. However, I believe that the MDL prin-
ciple extends beyond model selection and is in fact a philosophy of science. The
heart of the MDL principle is that “the shortest code-length leads to the best strategy
for learning anything from data.” From this perspective, it provides a unifying broad
view of statistical inferences such as estimation, prediction, and testing, as well as,
of course, machine learning.
Specifically, let us consider machine learning, which aims at extracting useful
knowledge from data to make effective use of it. In real-world machine learning
applications, the data sources and models are not sufficiently simple that an elegant
theory of statistics cannot straightforwardly be applied. For example, the data source
may be non-stationary and may include anomalies. Moreover, the models for knowl-
edge representations may include latent variables, and they may be specified by too
many parameters compared to data so that the conventional asymptotic theory is
no longer valid. Even for such realistic situations, the MDL principle will guide us
toward the best learning strategy in terms of the shortest code-length. My intention in
writing this book is to demonstrate the universal effectiveness of the MDL principle
to readers through its applications to a variety of learning problems.
Since 1978, the year when Rissanen’s first paper on MDL was published, the MDL
theory still continues to evolve, produce new applications, and raise new challenging
problems. I wrote this book to inform readers about the current progress of the MDL
principle as well as its classical theory. My main message through this book is simply
that MDL is forever.
This book is addressed to researchers or graduate students who specialize machine
learning, statistics, information theory, or computer science. Hence, readers are
ix
x Preface
assumed to have background knowledge about probability theory, linear algebra,

analysis, elementary statistics, basics of machine learning, and computer science.
This book is organized as follows: Chapter 1 surveys coding and information.
It is shown that a coding is equivalent to a probability distribution. Therefore, the
MDL principle is naturally introduced as a principle of statistical inference. The
central notion in the MDL principle is stochastic complexity, which is the shortest
code-length of data relative to a given class of probability distributions. Stochastic
complexity is difficult to compute in general. Various methods for calculating
stochastic complexity are presented in this chapter.
Chapter 2 presents fundamental methods for parameter estimation. This is a
preliminary chapter for the subsequent ones.
Chapter 3 shows an application of the MDL principle to a typical model selection
issue. Existing model selection criteria are also introduced, and the differences of
their standpoints are argued. The usefulness of MDL is demonstrated from the three
perspectives; consistency, estimation optimality, and rate of convergence.
Chapter 4 shows an application of the MDL principle to latent variable model
selection. Existing information criteria cannot be straightforwardly applied to latent
variable models due to its non-identifiability problem. To resolve this problem, two
types of latent variable model selection criteria are introduced on the basis of the MDL
principle. They are applied to the selection of best structures for modern knowledge
representations used in machine learning.
Chapter 5 shows an application of the MDL principle to sequential prediction. In
this context, stochastic complexity can also be considered an objective function of
learning, as with non-sequential learning.
Chapter 6 presents an application of the MDL principle to change detection in the
area of data mining. A number of statistics to measure various kinds of changes are
introduced on the basis of the MDL principle.
Chapter 7 shows an application of the MDL principle to continuous model selec-
tion. The integer-valued model dimensionality is extended to the notion of real-valued
dimensionality. We show that this new dimensionality is specifically effective for
latent model change sign detection.
Chapter 8 presents extensions of stochastic complexity to a decision-theoretic
setting where no probabilistic assumption is made.
Each chapter contains a summary, from which the reader can grasp the essence
of the chapter.
Finally, Chap. 9 presents mathematical preliminaries necessary for reading this
book.
The content in the following sections is written for advanced users.
Chapter 1: Sects. 1.5.2, 1.5.3.
Chapter 2: Sects. 2.3.3, 2.5.2, 2.5.3.
Chapter 3: Sects. 3.2.3, 3.3.2, 3.3.6.
Chapter 4: Sects. 4.3.3, 4.4.4, 4.4.9, 4.4.10, 4.4.11.
Chapter 5: Sects. 5.1.5, 5.2,1.
Chapter 6: Sects. 6.3.1, 6.3.2, 6.4.2, 6.4.3, 6.4.7.
Preface xi
Chapter 7: Sect. 7.3.2.

Chapter 8: Sects. 8.2, 8.3.
They are indicated by asterisk marks. Readers who like to get an overview of this
book as quickly as possible can skip them in the first round and return to them if
necessary for further understanding this book.
One of features of this book is to include rich amount of examples of applications
of the MDL principle. Here is a list of applications indicating the sections where they
appear.
• Outlier detection (2.1.2).
• Linear regression (2.1.3, 2.3.2, 2.5.3, 3.3.6).
• Logistic regression (2.4.1).
• Graphical model learning (2.3.3, 4.4.2, 4.4.3, 4.4.4).
• Classification (3.2.4, 3.3.3).
• Histogram density learning (3.3.1).
• Non-negative matrix factorization (3.3.2, 4.4.9).
• Decision tree learning (3.3.3).
• Word embedding (3.3.4).
• Time series modeling (3.3.5, 5.1.5, 7.3.2).
• Clustering (4.4.2, 4.4.3, 4.4.4, 4.4.5, 6.4.5, 6.5.1, 7.3.1, 7.3.3).
• Long-tailed distribution learning (4.4.6).
• Graph summarization (4.4.7).
• Probabilistic principal component analysis (4.4.8).
• Hyperbolic embedding (4.4.10).
• Neural network learning (4.4.11).
• Graph embedding (4.4.10, 6.2.4).
• Change detection (5.1.5, Chaps. 6 and 7).
• Failure detection (6.2.5).
• COVID-19 pandemic analysis (6.3.2).
• Population movement data analysis (6.4.7).
• Beer consumption data analysis (6.5.1, 7.3.3).
• Electric power consumption data analysis (7.3.3).
One of my favorite Japanese painters, Togyu Okumura, said: “Art can never be
perfect. The point is how big it will end up being unfinished.” The MDL theory that
this book has offered is, of course, not perfect. It is my sincere hope that the MDL
will be further developed by readers of this book, and that it will serve as a guide for
solving future challenging problems in computer science.
The author sincerely expresses his gratitude to his colleagues who kindly read the
early draft of the book: Prof. Shun-ichi Amari (Teikyo University), Prof. Te Sun Han
(NICT), Prof. Arlindo Oliveira (University of Lisbon), Prof. Linchuan Xu (The Hong
Kong Polytechnic University), Prof. Atsushi Suzuki (King’s College London), Dr.
Shintaro Fukushima (Toyota Motor Corporation), Dr. So Hirai (NTT Data Corpora-
tion), Dr. Kohei Miyaguchi (IBM Tokyo Research), Mr. Sosa Akimoto (University
of Tokyo), Mr. Eiichi Sakurai (University of Tokyo), Mr. Ryo Sato (University of
xii Preface
Tokyo), and Mr. Kento Urano (University of Tokyo), and Ryo Yuki (University of
Tokyo).
Tokyo, Japan Kenji Yamanishi

Contents
1 Information and Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Information, Probability, and Coding . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 What is Information? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Prefix Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Kraft Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Shannon Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Probability Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Shannon’s Coding Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 Shannon Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Universal Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Two-part Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.2 Bayes Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.3 Counting Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.4 Normalized Maximum Likelihood Coding . . . . . . . . . . . . . . 10
1.3.5 Kolmogorov Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Stochastic Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.1 Stochastic Complexity for Parametric Classes . . . . . . . . . . . 12
1.4.2 Shtarkov’s Min-Max Regret . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.3 Generalized Coding Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5 Parametric Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5.1 Asymptotic Approximation Method . . . . . . . . . . . . . . . . . . . 19
1.5.2 g-Function-Based Method∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.5.3 Fourier Method∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.5.4 Combinatorial Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.5.5 Monte Carlo Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.6 MDL Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.6.1 Machine Learning with MDL Principle . . . . . . . . . . . . . . . . 37
1.6.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.6.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.6.4 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
xiii
xiv Contents
1.7 Summary of This Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.1.1 Maximum Likelihood Estimator . . . . . . . . . . . . . . . . . . . . . . 47
2.1.2 MLE for Multivariate Gaussian and Outlier Detection . . . . 49
2.1.3 MLE for Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.1.4 Properties of Maximum Likelihood Estimator . . . . . . . . . . . 52
2.2 EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.2.1 EM Algorithm for Latent Variable Models . . . . . . . . . . . . . . 53
2.2.2 Incremental EM Algorithm for Online Outlier
Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.3 Maximum a Posteriori Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.3.1 MAP Estimation and Regularization . . . . . . . . . . . . . . . . . . . 61
2.3.2 Sparse Regularized Linear Regression . . . . . . . . . . . . . . . . . 63
2.3.3 Sparse Regularized Graphical Model∗ . . . . . . . . . . . . . . . . . . 65
2.4 Gradient Descent Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.4.1 Gradient Descent Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.5 High-Dimensional Penalty Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.5.1 Luckiness Normalized Maximum Likelihood
Code-length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.5.2 Penalty Selection with LNML∗ . . . . . . . . . . . . . . . . . . . . . . . 74
2.5.3 Analytical Bounds for LNML Code-length∗ . . . . . . . . . . . . 77
2.6 Bayesian Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.6.1 Bayesian Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
2.6.2 Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.1.1 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.1.2 Akaike’s Information Criterion . . . . . . . . . . . . . . . . . . . . . . . . 92
3.1.3 Bayesian Information Criterion . . . . . . . . . . . . . . . . . . . . . . . 96
3.1.4 Minimum Message Length Criterion . . . . . . . . . . . . . . . . . . . 97
3.1.5 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.2 Minimum Description Length Criterion . . . . . . . . . . . . . . . . . . . . . . . . 98
3.2.1 MDL Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.2.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.2.3 Estimation Optimality∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.2.4 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.2.5 Sequential Normalized Maximum Likelihood
Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
3.3 Applications of MDL Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.3.1 Histogram Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . 116
Contents xv
3.3.2 Non-negative Matrix Factorization∗ . . . . . . . . . . . . . . . . . . . . 118

3.3.3 Decision Tree Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.3.4 Dimensionality Selection for Word Embedding . . . . . . . . . . 125
3.3.5 Time Series Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . 129
3.3.6 Multivariate Linear Regression∗ . . . . . . . . . . . . . . . . . . . . . . . 130
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4 Latent Variable Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.1 MDL Approach to Latent Variable Model Selection . . . . . . . . . . . . . 137
4.1.1 Non-identifiability for Latent Variable Models . . . . . . . . . . 137
4.2 Latent Stochastic Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.2.1 LSC Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.2.2 Computational Complexity of LSC . . . . . . . . . . . . . . . . . . . . 141
4.3 Decomposed Normalized Maximum Likelihood Criterion . . . . . . . . 144
4.3.1 DNML Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
4.3.2 Computational Complexity of DNML . . . . . . . . . . . . . . . . . . 147
4.3.3 Estimation Optimality∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
4.4 Applications of the DNML Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . 151
4.4.1 Naïve Bayes models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
4.4.2 Latent Dirichlet Allocation Model . . . . . . . . . . . . . . . . . . . . . 154
4.4.3 Stochastic Block Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
4.4.4 Mixed Membership Stochastic Block Models∗ . . . . . . . . . . 160
4.4.5 Multivariate Gaussian Mixture Model . . . . . . . . . . . . . . . . . . 162
4.4.6 Approximating Inter-event Time Distributions . . . . . . . . . . 165
4.4.7 Graph Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
4.4.8 Probabilistic Principal Component Analysis∗ . . . . . . . . . . . 170
4.4.9 Non-negative Matrix Factorization Revisited∗ . . . . . . . . . . . 172
4.4.10 Dimensionality Selection of Hyperbolic Graph
Embedding∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
4.4.11 Neural Network Model Selection∗ . . . . . . . . . . . . . . . . . . . . . 176
4.5 Summary of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
5 Sequential Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
5.1 Sequential Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
5.1.1 Sequential Stochastic Prediction Algorithms . . . . . . . . . . . . 185
5.1.2 Online Maximum Likelihood Prediction Algorithm . . . . . . 187
5.1.3 Bayesian Prediction Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 189
5.1.4 Sequentially Normalized Maximum Likelihood
Prediction Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
5.1.5 Discounting SNML Prediction with Application
to Change Point Detection∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
5.1.6 Sequentially Luckiness Normalized Maximum
Likelihood Prediction Algorithm . . . . . . . . . . . . . . . . . . . . . . 200
5.2 High-Dimensional Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
xvi Contents
5.2.1 Prediction with Spike and Tails Prior∗ . . . . . . . . . . . . . . . . . . 201

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
6 MDL Change Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
6.1 A System of Change Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
6.2 Parameter Change Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
6.2.1 MDL Change Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
6.2.2 Performance on Hypothesis Testing . . . . . . . . . . . . . . . . . . . . 214
6.2.3 Sequential MDL Change Detection . . . . . . . . . . . . . . . . . . . . 217
6.2.4 Network Change Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6.2.5 Adaptive Windowing Method for Sequential MDL
Change Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
6.3 Parameter Change Sign Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
6.3.1 Differential MDL Change Statistics∗ . . . . . . . . . . . . . . . . . . . 228
6.3.2 COVID-19 Pandemic Analysis∗ . . . . . . . . . . . . . . . . . . . . . . . 232
6.4 Latent Structure Change Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
6.4.1 Burst Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
6.4.2 Switching Distribution∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
6.4.3 Tracking Best Experts∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238
6.4.4 Dynamic Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
6.4.5 Sequential Clustering Change Detection . . . . . . . . . . . . . . . . 245
6.4.6 MDL Model Change Statistics . . . . . . . . . . . . . . . . . . . . . . . . 247
6.4.7 Hierarchical Change Detection for Latent Variable
Models∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
6.5 Latent Structure Change Sign Detection . . . . . . . . . . . . . . . . . . . . . . . 255
6.5.1 Structural Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
6.6 Summary of this Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
7 Continuous Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
7.1 Descriptive Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
7.1.1 Motivation of Continuous Model Selection . . . . . . . . . . . . . 265
7.1.2 Definition of Descriptive Dimensionality . . . . . . . . . . . . . . . 266
7.2 Continuous Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
7.2.1 Continuous Model Selection with Ddim . . . . . . . . . . . . . . . . 270
7.2.2 Model Change Sign Detection Algorithms . . . . . . . . . . . . . . 272
7.3 Applications of Continuous Model Selection . . . . . . . . . . . . . . . . . . . 273
7.3.1 Continuous Model Selection for GMM . . . . . . . . . . . . . . . . . 273
7.3.2 Continuous Model Selection for AR Model∗ . . . . . . . . . . . . 278
7.3.3 Real Data Experiences for Model Change Sign
Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
Contents xvii
8 Extension of Stochastic Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

8.1 Extended Stochastic Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
8.1.1 Decision-Theoretic Learning . . . . . . . . . . . . . . . . . . . . . . . . . 287
8.1.2 Extended Stochastic Complexity . . . . . . . . . . . . . . . . . . . . . . 288
8.2 Function Estimation with ESC∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
8.3 Sequential Learning with ESC∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
8.4 Generalized Normalized Maximum Likelihood∗ . . . . . . . . . . . . . . . . 310
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
9 Mathematical Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
9.1 Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
9.1.1 Basic Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
9.1.2 Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
9.1.3 Max, Min, Sup, Inf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
9.1.4 Norm and Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
9.1.5 Taylor’s Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
9.1.6 Gamma Function and Stirling Formula . . . . . . . . . . . . . . . . . 320
9.1.7 Fubini’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
9.1.8 Fourier Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
9.1.9 Lebesgue’s Dominated Convergence Theorem . . . . . . . . . . 321
9.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
9.2.1 Probability Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
9.2.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
9.2.3 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
9.2.4 Variable Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
9.2.5 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
9.2.6 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
9.2.7 Law of Large Numbers and Central Limit Theorem . . . . . . 327
9.2.8 Entropy and Kullback–Leibler Divergence . . . . . . . . . . . . . . 328
9.2.9 Probability Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
9.2.10 Monte Carlo Sampling Method . . . . . . . . . . . . . . . . . . . . . . . 331
9.3 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
9.3.1 Eigenvalues and Quadratic Forms . . . . . . . . . . . . . . . . . . . . . 332
9.3.2 Matrix Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
9.4 Time Complexity Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
Acronyms
AGG Aggregating algorithm

AIC Akaike’s information criterion
AR model Autoregression model
BIC Bayesian information criterion
CLT Central limit theorem
CPR Complexity regularization
CV Cross-validation
Ddim Descriptive dimensionality
Diff Differential method
D-MDL Differential MDL
DMS Dynamic model selection
DNML Decomposed normalized maximum likelihood
DTO Dynamic threshold optimization
EM-algorithm Expectation and maximization algorithm
EMM Exponential mixture model
ESC Extended stochastic complexity
FMM Finite mixture model
FS Fixed share algorithm
GAE Graph autoencoder
GMM Gaussian mixture model
HCDL Hierarchical change detection for latent variable model
i.i.d. Independently identically distributed
ICM Iterated conditional modes
LASSO Least absolute shrinkage and selection operation
LDA Latent Dirichlet allocation
LNML Luckiness normalized maximum likelihood
LPC Latent parametric complexity
LSC Latent stochastic complexity
MCMC Markov chain Monte Carlo
MDL Minimum description length
MLC Minimm L-complexity algorithm
xix
xx Acronyms
MLE Maximum likelihood estimator

MMSBM Mixed membership stochastic block model
NB Näive Bayes
NMF Non-negative matrix factorization
NML Normalized maximum likelihood
NMLS Normalized minimum loss
PAC Probably approximately correct
RCL Relative cumulative loss
RWiLS Random walk in latent space
SBM Stochastic block model
SC Stochastic complexity
SCAW Stochastic complexity with adaptive window algorithm
SDL Stochastic decision list
SDNML Sequentially discounting normalized maximum likelihood
SDT Stochastic decision tree
SE Structural entropy
SLNML Sequentially luckiness normalized maximum likelihood
SNML Sequentially normalized maximum likelihood
SNMLS Sequentially normalized minimum loss
ST-prior Spike and tails prior
TH Thresholding method
Chapter 1
Information and Coding
Abstract In this chapter, we introduce the notions of information, probability dis-

tributions, and coding. We show that the (inferior) probability distribution and cod-
ing are equivalent through the Kraft inequality. The most primitive quantification
of information is Shannon’s information, which is the optimal code-length when a
probability distribution is known in advance. We introduce the notion of stochastic
complexity (SC) as an extension of Shannon’s information to the case where a prob-
ability distribution is unknown, but the class is given. SC can be calculated as the
normalized maximum likelihood (NML) code-length. We introduce various methods
for efficiently computing the NML code-length. Finally, we introduce the minimum
description length (MDL) principle as an SC minimization strategy. We then give a
unifying view of machine learning problems in terms of the MDL principle.
1.1 Information, Probability, and Coding
1.1.1 What is Information?
The most important issue that we address throughout this book is what information
is. Once information is properly quantified, then “learning” can be thought of as the
task of extracting the maximum information from data. Shannon [30] quantified the
amount of information within the framework of probability theory. Subsequently,
several attempts were made to quantify the information in terms of Kolmogorov
complexity [15], algorithmic complexity [2], information spectrum [9], etc. The
important point is that no matter how the information is defined, it should be defined
relative to the model. Here, the model means a constraint for the representation of the
source that generates the data. With the recent progress in machine learning, many
kinds of models have been developed to obtain rich knowledge representations. In
this book, we start with addressing the issue of how to define the information included
in some data with the help of models.
© Springer Nature Singapore Pte Ltd. 2023 1

K. Yamanishi, Learning with the Minimum Description Length Principle,
https://doi.org/10.1007/978-981-99-1790-7_1
2 1 Information and Coding
Fig. 1.1 Coding
1.1.2 Prefix Coding
This section shows that the notions of coding, information, and probability distri-
butions are closely related to one another and that they are unified under the MDL
principle. We refer to the seminal book, Elements of Information Theory written by
Cover and Thomas [3], for reviewing fundamental notions of information theory.
Let X be a finite set of symbols and Xn be a set of sequences of n symbols over
X. A source coding is defined as a map φ : X → {0, 1}∗ , where {0, 1}∗ is the set of
sequences over {0, 1}. The purpose of a source coding is to represent the original
data using as short a binary sequence as possible (see Fig. 1.1). Hereafter, we simply
refer to a source coding as a coding.
We require a specific feature for a coding.
Definition 1.1 We define a prefix coding as a coding φ such that for any x = y, φ(x)
is not a prefix of φ(y).
Consider the situation where we encode a sequence of symbols over Xn into a
binary sequence. With prefix encoding, we can decode the binary sequence to obtain
an original symbol even if we do not use a comma for each end of the encoded
sequence.
1.1.3 Kraft Inequality
For a coding φ, let L : X → R+ be a map so that L(x) is the code-length for φ(x).
We call such L a code-length function associated with φ. The necessary and sufficient
condition for any coding φ to be a prefix coding is given in terms of the code-length
function L.
In the following, we suppose that X is finite. The following theorem follows the
book by Cover and Thomas [3] (pp: 107–110).
Theorem 1.1 (The Kraft inequality) [19] Let L be a code-length function associated
with a coding φ. Then, if φ is a prefix coding, then L satisfies the following inequality:

2−L(x) ≤ 1. (1.1)
x∈X
1.1 Information, Probability, and Coding 3
Conversely, if L satisfies (1.1), then there exists a prefix coding for which L is the
code-length function.
The inequality (1.1) is called the Kraft inequality.
Proof We can represent a codeword for a prefix coding as a leaf node of a binary tree.
That is, any codeword associated with a prefix coding can be represented as a path
from the root to a leaf node. The prefix condition is equivalent to the condition that
a codeword cannot be an ancestor of any other codeword on the tree (see Fig. 1.2).
Let L max be the maximum length of any codeword over X. For x ∈ X, a codeword
at depth L(x) has 2 L max −L(x) descendants at depth L max . The total number of descen-
dants of x over X at depth L max should be less than 2 L max because the codeword for
x is located at a leaf of the tree under the prefix condition. This leads to

2 L max −L(x) ≤ 2 L max .
x∈X
Equivalently, we have

2−L(x) ≤ 1.
x∈X
Suppose that we have a set of codewords over X with a code-length function

satisfying (1.1). Then, we can construct a binary tree such that all the codewords
are assigned to the leaves of the binary tree in the following way; first, pick up a x
and assign a codeword with length L(x) to the node of depth L(x) and remove its
all descendants from the tree; then, pick up an x (= x) and assign a codeword with
length L(x ) to one of the remaining nodes of depth L(x ). Continue this until all xs
are assigned to the leaves of depth L(x). This completes the proof.
In [3], the Kraft inequality is formulated for a general D-ary coding as follows:

D −L(x) ≤ 1.
x∈X
Further, the Kraft inequality extends to the case where X is countably infinite.
Example 1.1 Let X = {A, B, C, D}. We consider the coding φ such that
φ(A) = 000, φ(B) = 001, φ(C) = 01, φ(D) = 1.
All the codewords are located at the leaves of a binary tree, as shown in Fig. 1.2.
Hence, we see that φ is a prefix coding over X. Since the code-length function is
given by
L(A) = 3, L(B) = 3, L(C) = 2, L(D) = 1,
Fig. 1.2 A prefix coding

tree
φ(D) = 1
φ(C) = 01
φ( B) = 001
φ(A) = 000
we confirm that the Kraft inequality holds for this case:
2−L(A) + 2−L(B) + 2−L(C) + 2−L(D) + 2−L(E)

= 2−3 + 2−3 + 2−2 + 2−1
= 1.
1.2 Shannon Information
1.2.1 Probability Distribution
For the code-length function L(x) satisfying the Kraft inequality, if we define p(x)
by
p(x) = 2−L(x) , (1.2)
then p(x) satisfies

p(x) ≤ 1, p(x) ≥ 0. (1.3)
x∈X
We call a p(x) satisfying (1.3) an inferior probability distribution or inferior prob-

ability mass function over X. Meanwhile, when an inferior probability distribution
p(x) is given in advance, if we define L(x) by
1.2 Shannon Information 5
L(x) = − log p(x) , (1.4)
then L(x) satisfies (1.1). Here, z is the least integer not smaller than z. Hereafter, the
logarithm refers to the base 2 logarithm. We may replace it with the natural logarithm
depending on the context. Then, the Kraft inequality (1.1) may be replaced with the
following:

e−L(x) ≤ 1,
x∈X
where e is Euler’s constant, the base of the natural logarithm.

The code-length function L(x) for a prefix coding and inferior probability distri-
bution p(x) are related to each other through (1.2) and (1.4). This implies that these
notions are equivalent through the Kraft inequality (1.1).
If p(x) satisfies

p(x) = 1, p(x) ≥ 0, (1.5)
x∈X
then we call p(x) a probability mass function or probability distribution over X.

that x p(x) < 1, adding a
Even if p(x) is an inferior probability distribution such
dummy x ∗ to X and assigning a probability mass 1 − x p(x) to it makes p(x) a
probability distribution. Hereafter, without loss of generality, we consider the prob-
ability distribution rather than the inferior one.
When X is continuous, if p(x) satisfies

p(x)μ(dx) = 1, p(x) ≥ 0, (1.6)
then we call p(x) a probability density function over X, where μ(x) is a given measure
on X. We simply write μ(dx) as dx for the Lebesgue measure throughout the paper.
Hereafter, although we discuss supposing that X is discrete, the same argument holds
for the cases where X is continuous, although the probability mass function should
be replaced with the probability density function.
Let us consider the case where X is continuous. Suppose that X is a 1-dimensional
Euclidean space. In this case, we must truncate x ∈ X with finite precision to obtain
[x]. Then, we encode [x] in a finite code-length. Let X̄ be a countable subset of X
obtained by quantizing X with truncation scale at most δ. Let p(x) be a probability
density function over X. Then, we may instead consider the probability mass function
over X̄:
p̄([x]) ≈ p(x)δ.
By taking sufficiently small δ > 0, the code-length for [x] can be calculated
approximately by
− log p̄([x]) ≈ − log p(x) − log δ.

Thus, making δ independent of x and ignoring δ, we may consider − log p(x) as

the code-length for x, even when − log p(x) itself is able to take a negative value.
The aforementioned argument can be straightforwardly extended to the case where
X is multidimensional.
The equivalence between a coding and probability distribution plays the key role in
information-theoretic machine learning. Based on this relation, any problems related
to statistical inference can be formulated in terms of coding (see Sect. 1.6). This
argument can also be seen in the literatures [8, 36, 37].
1.2.2 Shannon’s Coding Theorem
The code-length for any prefix coding can be calculated as in (1.4) using the (inferior)
probability distribution p. When we employ a prefix coding, we call the code-length
function L the prefix code-length function. Although the code-length should be inte-
ger in real, we allow it to take non-integer values in the discussion to follow, for the
sake of mathematical simplicity.
The goal of source coding is data compression. Hence, a coding should be
designed so that the code-length is as short as possible. If the true probability distri-
bution p for generating data is known in advance, the following theorem holds for
the average code-length for the data generated according to p.
Theorem 1.2 (A lower bound on the expected code-length) [3, 30] Suppose that the
data are distributed according to the probability distribution p(x). Then, any prefix
code-length function L(x) satisfies the following inequality:
E p [L(x)] ≥ H ( p), (1.7)
def
E p [·] denotes the expectation of x taken with respect to p, and H ( p) =
where
− x∈Xn p(x) log p(x), which we call the entropy of p(x).
Proof For any prefix code-length function L(x), there exists an inferior probability
distribution q(x) such that L(x) = − log q(x) by Theorem 1.1. Then, we have
E p [L(x)] = H ( p) + D( p||q),
def
where H ( p) = E p [− log p(x)] is the entropy function of p, and
def
D( p||q) = E p [log( p(x)/q(x))] is the Kullback–Leibler divergence between p and
q. In general, the following inequality holds:
log(1/x) ≥ (log e)(1 − x) (x > 0),

1.3 Universal Coding 7
where the equality holds if and only if x = 1. Thus, we have
D( p||q) = (log e)E p [ln( p(x)/q(x))]

≥ (log e) 1 − q(x)
x
≥ 0.
The equality holds if and only if p = q.
1.2.3 Shannon Information
Theorem 1.2 shows that when we know the true probability distribution p in advance,
the minimum expected code-length can be achieved with the coding such that for
any x,
L(x) = − log p(x). (1.8)
Definition 1.2 We define the Shannon information of x relative to p as the quantity

(1.8) .
The Shannon information is justified to be the optimal code-length for x when

the true probability distribution p is known in advance.
Shannon’s information theory is constructed under the assumption that the prob-
ability distribution of the information source is known. Meanwhile, information-
theoretic machine learning theory is constructed assuming that the probability dis-
tribution of the source is not known in advance, but rather should be estimated from
data.
1.3 Universal Coding
In this section, we consider how to conduct a prefix coding for the case where the
true probability distribution is not known in advance. We call such a coding the
universal coding. It is impossible to design any completely universal coding without
any information about the data generation. Thus, we introduce the class of probability
distributions P, which is a set of probability distributions whose representations are
restricted under some constraint. We consider how to encode the data with the help
of P even though the true distribution is unknown.
1.3.1 Two-part Coding
First, we consider what we call the two-part coding. The idea of two-part coding is to
encode the data sequence in two steps: (1) encode a single p which is selected from
P and (2) encode the data sequence x relative to p. In the case where P is discrete,
the two-part code-length with a selected p is written as
− log p(x) + L( p),

where L( p) is the code-length for p satisfying the Kraft inequality: p∈P 2−L( p) ≤
1. We may select p such that the two-part code-length is the shortest. Let us denote
the two-part code-length for x relative to P as Ltwo−part (x; P). Then, it is calculated as
def
Ltwo−part (x; P) = min{− log p(x) + L( p)}. (1.9)
p∈P
In the case where P is continuous, we quantize (equivalently, discretize) P to

obtain a discrete subset P̄ ⊂ P. Then, we similarly define the two-part code-length
in this case as follows:
def
Ltwo−part (x; P) = inf min{− log p(x) + L( p)}, (1.10)
P̄ p∈P̄
where the infimum is taken over all the methods for quantization of P.
The two-part coding is the most typical universal coding method. The crucial
problem in it is how we can find the optimal quantization method for a given class
of probability distributions. This is discussed in Sect. 3.2.4.
1.3.2 Bayes Coding
We next introduce the Bayes coding. We denote the Bayes code-length for a data
sequence x relative to a class P as LBayes (x; P). It is defined as the code-length for x
relative to the Bayesian marginal distribution over P. In the case where P is discrete,
letting π( p) be the prior probability mass function over P, it is written as
def

LBayes (x; P) = − log π( p) p(x). (1.11)
p∈P
In the case where P is continuous, the sum in (1.11) is replaced with the integral
over P:

def
LBayes (x; P) = − log p(x)μ(d p), (1.12)
where μ is a probability measure on P.

The Bayes coding is effective when a specific prior information for parameters
is to be used. The problem in it is that we should choose most appropriate priors
carefully so that the Bayes coding can yield the shortest code-length. It is discussed
in Sects. 5.1.3 and 5.2.
1.3.3 Counting Coding
Let X = {0, . . . , k} be a finite set. We show how to encode x ∈ Xn using a combina-

torial method. Note that by letting n i (i = 0, . . . , k) be the number of occurrences
of X = i in x, we can encode the number n 0 in log(n + 1) bits, and the number n 1 in
log(n − n 0 + 1) bits, and this process continues. Thus, we can encode (n 0 , . . . , n k )
in ⎛ ⎞
k
i−1
log ⎝n − n j + 1⎠
i=0 j=0
bits. The total number of sequences of length n having the statistics (n 0 , . . . , n k ) is
n def n!
= .
n0 · · · nk n0! · · · nk !
Hence, each such sequence can be encoded in
n
log
n0 · · · nk
bits.
We then define the counting code-length Lcount (x) for x as
⎛ ⎞
n
k
i−1
log ⎝n − n j + 1⎠ .
def
Lcount (x) = log + (1.13)
n0 · · · nk
i=0 j=0
Using the Stirling formula (see Sect. 9.1.6):
1 √
log n! = n + log n − n + log 2π + o(1),
2
we can asymptotically expand (1.13) as follows:

n0 nk 1 n
Lcount (x) = n H ,..., + log
n n 2 n · · · nk
⎛ ⎞0

k
i−1
√
+ log ⎝n − n j + 1⎠ − k log 2π + o(1), (1.14)
i=0 j=0
def k
where H (z 0 , . . . , z k ) = − i=0 z i log z i .
At a glance, there seems to be no probabilistic assumption for data generation;
however, we can interpret this as each sequence being uniformly distributed over
the set of sequences of length n with the statistics (n 0 , . . . , n k ). We call the coding
whose code-length is given by (1.14) the counting coding.
The counting coding is effective when any probability distribution of data is not
specified. However, if a probability distribution of data is known to be biased, this
coding will not necessarily attain the shortest code-length, in average.
1.3.4 Normalized Maximum Likelihood Coding
For a given data sequence x, supposing that there exists a maximum of p(x) over
P, we consider the maximum likelihood max p∈P p(x) supposing that the maximum
exists. Note that the maximum likelihood does not form a probability distribution
over Xn because
def

Cn (P) = max p(x) > 1.
p∈P
x∈Xn
By normalizing max p∈P p(x) by Cn (P), we may define a probability distribution

over P as
def max p∈P p(x)

pNML (x) = ,
Cn (P)
which we call the normalized maximum likelihood (NML) distribution of x with

respect to P. This leads to a code-length for x associated with this distribution as
def
LNML (x) = − log pNML (x)
= − log max p(x) + log Cn (P), (1.15)
p∈P
which we call the normalized maximum likelihood (NML) code-length for x relative
to P. We call the coding whose code-length function is given by (1.15) the NML
coding.
The NML coding is one of the central notions in this book. It is the most rec-
ommendable coding because it is defined without any prior information nor any
quantization of the class. It is more exactly defined in Sect. 1.4 for the case where
the model class is parametric. The optimality of the NML coding will be shown in a
number of scenarios (see Sects. 1.4.2 and 3.2.3).
1.3.5 Kolmogorov Complexity
Kolmogorov complexity [15, 16] is an approach to quantify the information for a data
sequence. It does not make any explicit assumption for a probabilistic mechanism
of the data generation, but rather defines the complexity in terms of the size of the
computer program.
Definition 1.3 Let ( p) be the length of a program p and U( p) be the output of
the universal Turing machine when a program p is input. We define the Kolmogorov
complexity of a data sequence x as
def
K(x) = min ( p). (1.16)
p:U( p)=x
The Kolmogorov complexity of x means the shortest size of the program that
produces x.
We further define the universal probability as follows:
Definition 1.4 We define the universal probability of x as

2−( p) .
def
PU (x) = (1.17)
p:U( p)=x
According to [3], we have the following relation between the Kolmogorov com-
plexity and universal probability.
Theorem 1.3 (Equivalence between Kolmogorov complexity and universal proba-
bility) [3] For some c < ∞, for all x, we have
2−K(x) ≤ PU (x) ≤ c2−K(x) . (1.18)
Theorem 1.3 shows that the universal probability and negative exponentiated
Kolmogorov complexity are essentially equivalent through (1.18). This relation is
analogous to that between the probability distribution and Shannon information.
If x can be generated according to a probability distribution in a certain class
P, then the Kolmogorov complexity of x will match the code-length for x with the
help of P (see also [3] pp: 143–145). We do not describe the details of the theory of
Kolmogorov complexity; instead, we recommend the books [3, 20] for readers who
are interested in this topic. We merely mention here that the Kolmogorov complexity
can be thought of as a kind of universal coding.
1.4 Stochastic Complexity
1.4.1 Stochastic Complexity for Parametric Classes
We introduce a parametric class P of probability distributions. We say that for a

positive integer k, P is k-dimensional parametric when each element in P is specified
by k free independent parameters. We may write P = { p(x; θ ) : θ ∈ }, where
is a k-dimensional parameter space, and θ = (θ1 , . . . , θk ) ∈ is a k-dimensional
parameter vector.
Let a data sequence x = x1 , . . . , xn ∈ Xn be observed. We consider how to cal-
culate the NML code-length for x with respect to P as in Sect. 1.3.4. Hereafter, the
logarithm is the natural logarithm, for the sake of analytical simplicity.
First note that
min{− log p(x; θ )} = − log p(x; θ̂ (x)), (1.19)

θ∈
where θ̂ (x) is the maximum likelihood estimator (MLE) defined as
def
θ̂ (x) = argmax p(x; θ ). (1.20)
θ
Here we suppose that the maximum of p(x; θ ) with respect to θ exists. Equation
(1.19) implies that plugging θ = θ̂ (x) into the negative log likelihood yields the least
Shannon information for x relative to P, which we call the baseline of x with respect
to P. Note that (1.19) does not define a prefix code-length function over Xn because

p(x; θ̂ (x)) > 1, (1.21)
x
which implies that (1.19) does not satisfy the Kraft inequality. We see that normalizing
p(x; θ̂ (x)) by (1.21) forms a probability distribution over Xn .
Definition 1.5 We call (1.22) the normalized maximum likelihood distribution
(NML distribution) over Xn :
def p(x; θ̂ (x))

pNML (x) = . (1.22)
y p( y; θ̂ ( y))
In the case where X is continuous, the sum in the denominator is replaced with an
integral. We define the normalized maximum likelihood (NML) code-length for x
with respect to P as
1.4 Stochastic Complexity 13
def p(x; θ̂ (x))

LNML (x; P) = − log
y p( y; θ̂ ( y))

= − log p(x; θ̂ (x)) + log p( y; θ̂ ( y)). (1.23)
y
The NML code-length (1.23) for x relative to P is also called the stochastic
complexity of x with respect to P. We may rewrite this as SC(x; P).
The stochastic complexity can be thought of as an extension of Shannon infor-
mation to the case where the true probability distribution is not given in advance
but rather a class of probability distributions is given instead. This is justified in
Sects. 1.4.2 and 1.4.3.
We pay special attention to the second term in (1.23), which is the logarithm of
the normalization term.
Definition 1.6 We define the parametric complexity for P for data length n as
def

log C(P) = log max p(x; θ ) = log p(x; θ̂ (x)). (1.24)
θ
x∈X n x∈X n
The parametric complexity is the information-theoretic complexity for P depend-

ing on data length n. Throughout this book, we may sometimes call C(P) itself the
parametric complexity without taking its logarithm. This is difficult to compute in
general when n is large. We discuss how to compute it in Sect. 1.5.
1.4.2 Shtarkov’s Min-Max Regret
We consider the min-max regret associated with a given parametric class P.

Definition 1.7 For a probability distribution q, we define the worst-case regret for
q with respect to P as follows:

def
R(q : P) = max − log q(x) − min(− log p(x; θ )) , (1.25)
x θ∈
where the maximum is taken over all x ∈ Xn . We define the min-max regret with
respect to P as
def
R(P) = min R(q; P), (1.26)
q
where the minimum is taken over all probability distributions.

In Definition 1.7, we assume that there exist the minimum and maximum values
for the regret. If no such values exist, the minimum and maximum would be replaced
with the infimum and supremum, respectively.
We justify the NML code-length from the point of view that it achieves the min-
max regret. The following theorem was first proved by Shtarkov [31].
Theorem 1.4 (Min-max regret optimality of the NML code-length) [31] The NML
distribution is the only distribution that achieves the min-max regret. That is,
pNML = argmin R(q; P), (1.27)

q
where the minimum is taken over all probability distributions. Then, the min-max
regret coincides with the parametric complexity.
R(P) = log C(P). (1.28)
The min-max regret on the right-hand side of (1.27) is the minimum of the largest
value of any prefix code-length relative to the baseline with respect to P. Theo-
rem 1.4 shows that the probability distribution achieving the min-max regret is the
NML distribution. The NML code-length associated with the NML distribution has
optimality in this sense.
Proof Let the worst-case regret for q relative to P be R(q; P). Let the probability
distribution that achieves the minimum of R(q; P) be q = p ∗ . If p ∗ = pNML , for
some x, pNML (x) > p ∗ (x). For such an x, we have
R( p ∗ ; P) ≥ − log p ∗ (x) − min(− log p(x; θ ))

θ∈
> − log pNML (x) − min(− log p(x; θ ))
θ∈

= log max p(x; θ )
θ∈
x
= R( pNML ; P).
This contradicts the fact that p ∗ achieves the minimum of R(q; P). This implies
that p ∗ = pNML . Then R(P) = R( pNML ; P) = log C(P). The proof is completed.
1.4.3 Generalized Coding Theorem
We next justify stochastic complexity (NML code-length) from the viewpoint of

data compression. We consider the case where the true probability distribution is
unknown in advance but is known to belong to a k-dimensional parametric class P.
For this case, according to [23], the following generalized coding theorem gives a
lower bound on the average code-length.
Theorem 1.5 (A lower bound on the expected code-length for parametric classes)
[23] Let P = { p(x; θ ) : θ ∈ } be a k-dimensional parametric class of probability
distributions, where is a compact set. Suppose that for each √ θ ∈ , letting θ̂ (x) be
the MLE of θ from x, the asymptotic normality holds, i.e., n(θ̂ (x) − θ ) is asymp-
totically distributed according to N(0, I −1 (θ )), where N (0, I −1 (θ )) is the normal
distribution (Gaussian distribution) with mean 0 and variance covariance matrix
I −1 (θ ), and I (θ ) is the Fisher information matrix defined as
2
def 1 ∂ log p(x; θ )
I (θ ) = lim E θ − , (1.29)
n→∞ n ∂θ ∂θ
and E θ denotes the expectation of x taken with respect to p(x; θ ). Then, for any prefix
code-length function L(x), for any > 0, except θ ∈ 0 such that the Lebesgue
volume of 0 goes to zero as n → ∞, the following inequality holds:
k−
E θ [L(x)] ≥ E θ [− log p(x; θ )] + log n. (1.30)
2
If for a given L, we happen to choose θ such that L(x) = − log p(x; θ ), then
the lower bound (1.30) is violated; however, Theorem 1.5 means that such a θ only
exists in a range whose volume asymptotically becomes zero.
Below we give the proof of Theorem 1.5 according to [23].
¯
√ space to obtain a finite set
Proof Let us discretize the real-valued parameter
so that each axis is discretized with width c/ n where c is a constant. We denote
the discretization as τ : → .¯ For θ ∈ ,
¯ let Xn (θ ) def
= {x : τ (θ̂ (x)) = θ } where
θ̂ (x) is the MLE of θ from x.
For a given probability mass function p(x; θ ), we define P(θ ) by
def

P(θ ) = p(x; θ ).
x∈Xn (θ)
Then, by the asymptotic normality, for some n 0 > 0, for some 0 < δ < 1, we
have
P(θ ) > 1 − δ. (1.31)
For a code-length function L, let the corresponding probability mass function be

q(x) = 2−L(x) . We define Q(x) by
def

Q(θ ) = q(x).
x∈Xn (θ)
Note that p(x; θ )/P(θ ) and q(x)/Q(θ ) form probability mass functions over
Xn (θ ). We consider the Kullback–Leibler divergence between them to obtain the
following inequality:
p(x; θ ) p(x; θ )/P(θ )

log ≥ 0,
x∈Xn (θ)
P(θ ) q(x)/Q(θ )
where we have used the fact that the Kullback–Leibler divergence satisfies D( p||q) ≥
0 for any probability distributions p and q (see the proof of Theorem 1.2). Equiva-
lently, we have
p(x; θ ) P(θ )
p(x; θ ) log ≥ P(θ ) log . (1.32)
x∈Xn (θ)
q(x) Q(θ )
Letting the left-hand side of (1.32) be R(θ ), for any given > 0, set

def k(1 − )
An = θ ∈ : R(θ ) < log n .
2
For θ ∈ − An , the following inequality holds:
p(x; θ ) k(1 − )
p(x; θ ) log ≥ log n. (1.33)
x∈Xn (θ)
q(x) 2
We also have the following inequalities:
p(x; θ ) q(x)
p(x; θ ) log ≥ p(x; θ ) 1 −
x∈Xn −Xn (θ)
q(x) x∈Xn −X n (θ)
p(x; θ )

= ( p(x; θ ) − q(x))
x∈Xn −Xn (θ)
≥ −1, (1.34)
where we used the relation log z ≥ 1 − 1/z(z > 0).

Combining (1.33) with (1.34), we see that for θ ∈ , the following inequality
holds:
p(x; θ ) k(1 − )
p(x; θ ) log ≥ log n − 1.
x∈Xn
q(x) 2

Since L(x) = − log q(x), setting = k − 2/ log n yields

k−
E θ [L(x)] ≥ E θ [− log p(x; θ )] + log n.
2
In the following, we prove that the Lebesgue measure Vn of An goes to zero as
¯
√ to infinity. Let us consider a set of cells, each centered at θ̄ ∈ with width
n goes
c/ n in each axis. Let the smallest set of such cells covering An be An . Then, we
have
k
c
Vn ≤ |An | √ . (1.35)
n
For θ ∈ An , by (1.32),
k(1 − ) P(θ )
log n > P(θ ) log .
2 Q(θ )
This implies that by the asymptotic normality assumption, for some 0 < δ < 1,
for sufficiently large n,
k(1 − ) log n
Q(θ ) > P(θ ) exp −
2P(θ )
k(1 − ) log n
> (1 − δ) exp − .
2(1 − δ)
Hence, for sufficiently small > 0, for some 0 < α < 1, the following inequality
holds:
αk
Q(θ ) > n − 2 .
This leads to
αk
1= Q(θ ) ≥ |An |n − 2 .
¯
θ∈
Thus, we have
αk
|An | ≤ n 2 . (1.36)
Combining (1.35) with (1.36) yields

(α−1)k
Vn ≤ ck n 2 . (1.37)
The right-hand side of (1.37) goes to zero as n goes to infinity since 0 < α < 1.
This completes the proof.
As will be shown in Sects. 1.5.1 and 1.5.3, under some regularity conditions for
a k-dimensional parametric class P, the following equation asymptotically holds:

k n
SC(x; P) = − log p(x; θ̂ (x)) + log + log |I (θ )|dθ + o(1). (1.38)
2 2π
Letting the true probability distribution be a k-dimensional parametric one,

p(x; θ ), we prepare the following lemma.
Lemma 1.1 The following equation holds:
k
E θ [− log p(x; θ̂ (x))] = E θ [− log p(x; θ )] − + o(1), (1.39)
2
where E θ is the expectation of x taken with respect to p(x; θ ).
Proof (Sketch) We abbreviate θ̂ (x) as θ̂ . By Taylor expansion (see Sect. 9.1.5) of

− log p(x; θ ) around θ̂ , we have

∂ log p(x; θ)
− log p(x; θ ) = − log p(x; θ̂ (x)) − (θ − θ̂ )
∂θ θ=θ̂

−∂ 2 log p(x; θ )
1
+ (θ − θ̂ ) (θ − θ̂) + O(n||θ − θ̂||3 )
2 ∂θ ∂θ
θ=θ̂
1√ √
= − log p(x; θ̂ (x)) + n(θ − θ̂ ) Iˆ(θ̂) n(θ − θ̂) + O(n||θ − θ̂||3 ),
2
(1.40)
√
where ||x|| = x x and

def 1 −∂ 2 log p(x; θ )
Iˆ(θ̂) = .
n ∂θ ∂θ θ=θ̂
Therefore, taking the expectation of both sides of (1.40) with respect to p(x; θ ),
we have
E θ [− log p(x; θ )] = E θ [− log p(x; θ̂ )]

1√ √
+ Eθ n(θ − θ̂ ) Iˆ(θ̂ ) n(θ − θ̂ ) + E θ [n O||θ − θ̂ ||3 ].
2
(1.41)
By the consistency of the MLE and the law of large numbers (see Sect. 9.2.7), as
n becomes sufficiently large, with probability 1, we have
Iˆ(θ̂ ) → I (θ ).
√ By the asymptotic normality assumption for θ , as n becomes sufficiently large,

n(θ − θ̂) will become a normal √variable with the √ variance–covariance matrix
I −1 (θ ). Hence, the quadratic form: n(θ − θ̂ ) I (θ ) n(θ − θ̂ ) follows the χ 2 dis-
tribution with freedom k; hence, its expectation becomes k. Therefore, for sufficiently
large n, we have

1√ √ k
Eθ n(θ − θ̂ ) Iˆ(θ̂ ) n(θ − θ̂ ) → . (1.42)
2 2
1.5 Parametric Complexity 19
By (1.41) with (1.42), we have
k
E θ [− log p(x; θ )] = E θ [− log p(x; θ̂ )] + + o(1).
2
This completes the proof.
Combining (1.38) with (1.39) yields

k n
E θ [SC(x; P)] = E θ [− log p(x; θ )] + log + log |I (θ )|dθ + o(1).
2 2π e
We see that E θ [SC(x; P)] asymptotically matches the lower bound (1.30) within
error o(log n). Recall that this lower bound is a generalization of Shannon’s source
coding theorem to the case where the true probability distribution is not given, but
rather the class of probability distributions is given. In this sense, the stochastic
complexity can be considered as a generalization of Shannon information.
1.5 Parametric Complexity
1.5.1 Asymptotic Approximation Method
In general, it is difficult to efficiently compute the parametric complexity (1.24). This

is because the sum in (1.24) should be taken over all x ∈ Xn when X is discrete, or
the integral with respect to x is not necessarily tractable when X is continuous. The
following lists a number of methods to overcome this problem.
(1) Asymptotic approximation method,
(2) g-function–based method,
(3) Fourier method,
(4) Combinatorial method,
(5) Monte Carlo method.
In the following, we show these five methods in detail. Note that Sects. 1.5.2 and
1.5.3 are relatively technical. Readers who like to grasp the essence of the MDL
principle and its applications as quickly as possible may skip this section and return
here when necessary in the subsequent chapters.
First, we give an asymptotic approximation method according to Rissanen [24].
The following formula of an asymptotic expansion of the parametric complexity is
valid when the data length n is sufficiently large for a fixed number of parameters.
Theorem 1.6 (An asymptotic approximation formula for parametric complexity)
[24] Let P = { p(x; θ ) : θ ∈ } be a k-dimensional parametric class of probability
distributions. Let be a k-dimensional compact set. We make the following assump-
tions:
1. Let I (θ ) be the Fisher information matrix as in (1.29). For any θ ∈ , the limit in
(1.29) converges and for some 0 < c1 , c2 < ∞, c1 ≤ |I (θ )| ≤ c2 , where |I (θ )|
is the determinant of I (θ ).
2. I (θ
√) is continuous with respect to θ .
3. |I (θ )|dθ < ∞.
4. For each θ , the asymptotic normality of√ the MLE of θ holds. That is, if x
is generated according to p(x; θ ), then n(θ̂ (x) − θ ) is normally distributed
according to N(0, I −1 (θ )) as n → ∞, where θ̂ (x) is the MLE of θ from x (i.e.,
θ̂ = argmaxθ∈ p(x; θ )).
5. For some positive definite matrix C0 , for all x, I (θ̂(x)) < C0 .
Then, the following equation asymptotically holds for the parametric complexity:

k n
C(P) = log + log |I (θ )|dθ + o(1), (1.43)
2 2π
where limn→∞ o(1) = 0.

The proof of Theorem 1.6 is given in [24], which might not necessarily be easy
for beginners to follow. As will be shown in Theorem 1.7 of Sect. 1.5.3, the same
formula is proven in a considerably simpler way using the Fourier method under
some additional conditions. In this book, the proof of (1.43) is given in Sect. 1.5.3.
From Theorem 1.6, as shown in (1.43), under some regularity conditions for P,
the NML code-length (stochastic complexity) for x with respect to P is

k n
LNML (x; P) = − log p(x; θ̂ (x)) + log + log |I (θ )|dθ + o(1). (1.44)
2 2π
Note that Theorem 1.6 is valid only when the data length is sufficiently large
compared to the number of parameters. It is not necessarily applicable to high-
dimensional parametric models wherein the number of parameters is larger than the
data length. Such high-dimensional models are dealt with in Sects. 1.5.5 and 5.2.
In the following, we give a number of applications of Theorem 1.6.
Example 1.2 (NML code-length for multinoulli distribution) Let X = {0, 1, . . . , k}

be a finite alphabet for some positive integer k. Let us consider kthe probability
mass function over X such that p(x = i; θ ) = θi (i = 0, . . . , k), i=0 θi = 1, θ =
(θ0 , . . . , θk ) . Let k be the set of such parameters θ . Then, we call the class
PMulti = { p(x; θ ) : θ ∈ k }
that of multinouulli distributions (or categorical distributions). In the case of k = 1,

it is called the class of Bernoulli distributions.
For a given independently identically distributed (i.i.d.) data sequence x =
x1 , . . . , xn xt ∈ X (t = 1, . . . , n), let n i be the number of occurrences of X = i
in x. Then the MLE of θ is given by
n0 n k
θ̂ = ,..., .
n n
In Example 2.1, we show how to derive this MLE in detail.
Meanwhile, the determinant of Fisher information matrix is calculated as follows:

k
|I (θ )| = θi−1 .
i=0
Thus, applying (1.43) to the calculation of parametric complexity yields the fol-
lowing formula of the NML code-length for x with respect to PMulti :

k n
LNML (x; PMulti ) = − log p(x; θ̂ (x)) + log + log |I (θ )|dθ
2 2π
nk k
k+1
n0 n π 2
= nH ,..., + log + log k+1 , (1.45)
n n 2 2π 2
k ∞
where H (z 0 , . . . , z k ) = − i=0 z i log z i . (x) = 0 e−t t x−1 dt is the Gamma func-
tion defined so that for any positive integer m,
1 · 3 · 5 . . . (2m − 1) √
(m) = (m − 1)!, (m + 1/2) = π.
2m
Comparing (1.45) with the counting code-length (1.14), we see that (1.45) would
be shorter asymptotically.
Example 1.3 (NML code-length for Poisson distribution) Let X = N and x ∈ X be

i.i.d. according to a Poisson distribution with probability mass function:
θ x e−θ
p(x; θ ) = .
x!
We denote the class of Poisson distributions as
PPoisson = { p(x; θ ) : θ ∈ [0, 2a ]},
for 0 < a < ∞.

For an i.i.d. data sequence x, the MLE of θ is given by
n
xt
θ̂ = t=1
.
n
The Fisher information is calculated as follows:
1
I (θ ) = .
θ
Thus, applying (1.43) to the calculation of parametric complexity yields the fol-
lowing formula of the NML code-length for x with respect to PPoisson :

n
LNML (x; PPoisson ) = θ̂n(1 − log θ̂) + log xt !
t=1
1 n a
+ log + 1+ log 2.
2 2π 2
Example 1.4 (NML code-length for Gaussian distribution) Let X = R and x ∈ X

be i.i.d. according to a 1-dimensional Gaussian distribution with probability density
function:
1 (x − μ)2
p(x; μ, σ ) = √ exp − .
2π σ 2σ 2
Let τ = σ 2 and the parameter space be = {(μ, τ ) : μ ∈ (−∞, +∞), τ > 0}.
In this case, the Fisher information matrix I (μ, τ ) is given as follows:
1/τ 0
I (μ, τ ) = .
0 1/2τ 2
Thus, we have
1
|I (μ, τ )| = .
2τ 3
For an i.i.d. data sequence x n = x1 . . . xn , let the MLEs of μ and τ be μ̂ and τ̂ ,

respectively. Then, we have
1 1
n n
μ̂ = xt , τ̂ = (xt − μ̂)2 .
n t=1 n t=1
The problem is that in applying the asymptotic approximation formula, the third
term of the right-hand side of (1.44) diverges if the range of parameters is infinite.
One of methods for avoiding this problem is to restrict the range of parameters as
follows: Letting s and r be the least positive integers such that |μ̂| ≤ 2s , τ̂ ≥ 2−2r ,
we take the parameter space ˜ = {(μ, τ ) : |μ| ≤ 2s , τ ≥ 2−2r } rather than . In
˜ We write a class
other words, the data range is restricted so that μ̂ and τ̂ are in .
of 1−dimensional Gaussian distributions with such a restricted parameter space as
PGauss . A simple calculation leads to

|I (μ, τ )|dμdτ = 2s+r +3/2 .
˜
(μ,τ )∈
Applying (1.43) to the calculation of parametric complexity yields the following

formula of the NML code-length for x relative to PGauss :

n
LNML (x; PGauss ) = − log p(x; μ̂, τ̂ ) + log + log |I (μ, τ )|dμdτ + (r, s)
2π
˜
(μ,τ )∈
n n 3
= log(2π eτ̂ ) + log + s + r + + log∗ s + log∗ r. (1.46)
2 2π 2
Here, (r, s) is the prefix code-length for r and s, calculated as log∗ s + log∗ r ,
and
log∗ x = log c + log x + log log x + · · · , (1.47)
where c = 2.865, and the sum is taken over all positives. log∗ (x) is Rissanen’s integer
code-length function [23] (p. 34), where the base of the logarithm is 2.
1.5.2 g-Function-Based Method∗
Rissanen’s asymptotic approximation technique described in Sect. 1.5.1 can be used

only when the sample size n is sufficiently large. In the following, we introduce a
number of methods for calculating the parametric complexity for any finite n. The
first method is to use the g-function, which was developed by Rissanen [25, 27]. We
consider the following factorization of a probability mass function (or a probability
density function):
p(x; θ, M) = f (x|θ̂(x))g(θ̂(x); θ ),
where θ̂ (x) is the MLE of θ from x. f and g are defined below.
Definition 1.8 We define the g-function as follows. For θ, θ̄ ,
def

g(θ̄; θ ) = p(x; θ ).
x:θ̂ (x)=θ̄
We can confirm that it forms a probability density function, as follows:

⎛ ⎞

g(θ̄; θ )dθ̄ = dθ̄ ⎝ p( y; θ )⎠ = 1.
y:θ̂ ( y)=θ̄
f (x|θ̂ (x)) is given by

def
f (x|θ̂ (x)) = p(x; θ )/g(θ̂(x); θ ).
Let the normalization term in the NML distribution be C(P), where log C(P)
is the parametric complexity for P. Then, it is calculated using the g-function as
follows:

C(P) = p(x; θ̂ (x))
x

= dθ̂ p( y; θ̂ )
y:θ̂( y)=θ̂

= g(θ̂; θ̂ )dθ̂ . (1.48)
Next, we give an example of calculating the parametric complexity using (1.48).
Example 1.5 (Parametric complexity for exponential distribution) Letting X be

(0, ∞), we consider the class of exponential distributions:
PExp = { p(x; θ ) = θ exp(−θ x) : θ ∈ R+ }.

def
Let x = x1 , . . . , xn be a given i.i.d. data sequence. Then, the MLE of θ from x is

given by
n
θ̂ (x) = n . (1.49)
t=1 xt
We factorize the probability density function for the exponential distribution as

n
p(x; θ ) = exp −θ xi + n log θ
i=1

nθ
= θ exp −
n
θ̂
= f (x|θ̂(x))g(θ̂(x); θ ).
g is the probability density function for the Gamma distribution with shape parameter
n and scale parameter 1/θ :

θ n nn n
g(θ̂(x); θ ) = exp −θ · .
(n)θ̂ (x)n+1 θ̂(x)
Fixing θ̂ (x) = θ̂ , we have

nn 1
g(θ̂; θ̂ ) = · .
en (n − 1)! θ̂

Note that g(θ̂; θ̂ )dθ̂ diverges. To avoid the divergence of the integral, we restrict
the parameter range by letting θmin and θmax be the lower and upper bounds on θ ,
respectively. Then, the normalization term C(PExp ) is calculated as follows:
θmax
C(PExp ) = g(θ̂; θ̂ )dθ̂
θmin
θmax
nn 1
= n dθ̂
e (n − 1)! θ̂
θmin
nn θmax
= n log . (1.50)
e (n − 1)! θmin
Example 1.6 (Parametric complexity for multivariate Gaussian distribution) Note

that this example requires a complicated calculation. Readers who like to get an
overview of this section can skip this example and return to it when necessary in
Sect. 4.4.5.
The discussion below follows [11, 12]. We consider the class PdGauss of multivari-
ate Gaussian distributions over Rd . For an i.i.d. sequence x = x1 , . . . , xn of length
n, the likelihood function is given by

1
n
1 −1
p(x; θ ) = dn n exp − (xt − μ) (xt − μ) ,
(2π ) 2 || 2 2 t=1
where θ = (μ, ) .

As will be shown in Example 2.1.2, the MLE of θ is given by
1 1
n n
μ̂(x) = ˆ
xt , (x) = (xt − μ̂(x))(xt − μ̂(x)) .
n t=1 n t=1
For a parameter vector ξ = (R, 1,1 , . . . , 1,d , 2 ) ∈ (Rd+2 )+ , we restrict the data
domain to Xn (ξ ), defined as
d(d−1)

def Vol(O(d))
Xn (ξ ) = x : ||μ̂(x)|| ≤ R, 1, j ≤ λ̂ j ≤ 2 < 1 ( j = 1, . . . , d), 2
2
<1 ,
2d
√
ˆ ||x|| = x x, and Vol(O(d)) is the
where λ j is the jth largest eigenvalue of ,
volume of the orthogonal group in dimension d embedded in Rd×d :
d2
2d π 2
Vol(O(d)) = .
d( 2 )
d
d(d−1) d
Here, is the multivariate Gamma function defined as d (x) =π 4
d
j=1
1− j
x + 2 , and is the Gamma function.
Using g-functions g1 and g2 , we set
ˆ
p(x; θ ) = f (x|μ̂(x), (x)), ˆ
g1 (μ̂(x); μ, )g2 ((x); ),
1 1
(μ̂ − μ) −1 (μ̂ − μ) ,
def
g1 (μ̂; μ, ) = d 1 exp −
(2π/n) ||
2 2 2/n
ˆ
||
n−d−2
2 1
ˆ ) =
g2 (;
def
d(d−1) n−1
−1 ˆ
n−1 exp − Tr(n ) .
2 2 | n1 | 2 d 2
2
ˆ
We fix μ̂(x) = μ̂ and (x) ˆ We define g()
= . ˆ as
ˆ = g1 (μ̂; μ̂, )g

g() ˆ 2 (;
ˆ )
ˆ
dn
n 2
= dn d dn
ˆ − d2 −1 .
n−1 · ||
2 π e
2 2 2
d 2
ˆ is
Then, the parametric complexity is upper-bounded as follows. Suppose that
ˆ
written as = U diag(λ1 · · · λd )U for some orthogonal matrix U .
d
C(PGauss )

= p(x; θ̂ (x))d x
X n (ξ )

= d μ̂ g()dˆ ˆ
||μ̂||≤R

= d μ̂ dU ˆ
|λ̂i − λ̂ j |g()d( λ̂1 · · · λ̂d )
||μ̂||≤R 1≤i< j≤d
d − d2 d − d2

d(d−1) 2d+1 R d j=1 1, j − j=1 2 n dn2 1
< dU 2
2
· n−1
d d+1 ( d2 ) 2e d 2
V ol(O(d)) d(d−1) n dn2 1
< 2
2
B(d, R, 1 ) × n−1
2d 2e d 2
n dn2 1
< B(d, R, 1 ) × n−1 , (1.51)
2e d 2
where
d − d2
def 2d+1 R d j=1 ( 1, j )
B(d, R, 1) = d .
d d+1
2
We used the fact that the volume

of the d-dimensional ball with radius R 1/2
is π R /((d/2) (d/2)), and dU = Vol(O(d))/2d due to the symmetricity and
d/2 d
ˆ
positive definiteness of .
1.5.3 Fourier Method∗
This section introduces the Fourier method developed by Suzuki and Yamanishi
[32, 33] for calculating parametric complexity. Let Pk = { p(x; θ ) : θ ∈ } be a k-
dimensional parametric class of probability distributions. Let p̃(x; ξ ) ∈ Pk be the
Fourier transform of p(x; θ ) calculated as

1
p̃(x; ξ ) = dθ exp(−iξ θ ) p(x; θ ), (1.52)
(2π )k/2

1
p(x; θ ) = dξ exp(iξ θ ) p̃(x; ξ ), (1.53)
(2π )k/2
where i is the unit complex number, and ξ ∈ Rk is the transformed parameter.

We define h(θ ) as

1
p(x; θ ) exp(iξ (θ̂ (x) − θ )),
def
h(θ ) = dξ (1.54)
(2π )k x
where θ̂ (x) = argmaxθ p(x; θ ) is the MLE of θ from x, which is assumed to exist.
The following theorem shows that the parametric complexity can be calculated
as the integral of h(θ ) with respect to θ .
Theorem 1.7 (Fourier method for computing parametric complexity) [32, 33] For
a k-dimensional parametric class Pk of probability distributions, we denote the
parametric complexity as log Cn (k). We make the following assumptions:

1. dξ | p̃(x; ξ )| < ∞.
2. We define
def

φθ̂(x)−θ (ξ ) = p(x; θ ) exp(iξ(θ̂(x) − θ )),
x
then for any θ , the following inequality holds.


dξ |φθ̂(x)−θ (ξ )| < ∞.
Then, the parametric complexity is calculated as

Cn (k) = max p(x; θ ) = dθ h(θ ). (1.55)
θ
x
Proof Letting θ̂(x) be the MLE of θ , we use the Fourier transforms (1.52) and (1.53)
to calculate the parametric complexity for P as follows:

1
max p(x; θ ) = dξ exp(iξ θ̂ (x)) p̃(x; ξ )
x
θ
x
(2π )k/2

1
= dξ dθ exp(iξ (θ̂ (x) − θ )) p(x; θ )
(2π )k x

1
= dθ dξ p(x; θ ) exp(iξ (θ̂ (x) − θ )) (1.56)
(2π ) k
x

= dθ h(θ ).

Equation (1.56) comes from the integrability of p̃(x; ξ ) and x p(x; θ )
exp(iξ (θ̂ (x) − θ )) under the assumptions in Theorem 1.7 and Fubini’s theorem
(see Sect. 9.1.7). This completes the proof of Theorem 1.7.
In the following, we show that for a specific family called the exponential family,
the parametric complexity can be simply calculated using Theorem 1.7.
Definition 1.9 We define the exponential family as a class of probability distribu-
tions, each with the following form of probability mass function (or probability
density function):
p(x; η) = m(x) exp(η t (x))/Z (η), (1.57)

where m(x) and t (x) are functions of x, and Z (η) = x m(x) exp(η t (x)) is a
partition function. We call η the natural parameter for the form (1.57).
We define the transformation from the natural parameter to the expectation param-
eter as

τ (η) = t (x) p(x; η).
x
We denote its inverse transform as η(τ ), assuming that τ is bijective. Note that
the following relation holds:
Another random document with
no related content on Scribd:
— Miehelle on kunnia suuremman arvoinen kuin lapset, sukulaiset
ja ystävät. Kun kunnia on kysymyksessä, ponnistaa mies kaikki
sielunsa voimat, unohtaen kaikki hellemmät suhteet saadakseen
kunniansa takaisin. Mutta te näytte unohtavan, herra, kenelle
puhutte ja rohkenette ruveta minua tutkimaan. Neuvon teitä
tyytymään siihen, että saatte tutkia poikiani, muutoin minullakin
puolestani olisi syytä ruveta teiltä yhtä ja toista kyselemään. Kuinka
itse olette näin tyyni ja kylmäverinen, että kykenette tekemään
minulle kierteleviä kysymyksiä, vaikka hän makaa tuossa?
— Minä kyllä tiedän, kenen kanssa puhun ja te saatte kohta sen

tietää. Teidän kysymykseenne saan vastata, että hän, joka nyt
makaa tuossa, on ollut minulle kaikkea muuta maailmassa kalliimpi.
Nyt olen hänet kadottanut ja onnettomuuteni on niin suuri, niin
ääretön, etten vielä oikein voi sitä käsittääkään ja senvuoksi voin olla
näin tyyni ja kylmä… ainakin aluksi. Voidakseni kostaa, koetan pysyä
näin tyynenä… ymmärrättekö sen? Vielä minulla on tehtävänä
muutamia kysymyksiä, joihin teidän pitää vastata. Miksikä Irene
ehdollaan olisi…
Sanat takertuivat Haraldille kurkkuun, ne kun olisivat loukanneet

hänen tunteitansa ja häväisseet tuota rakasta olentoa, jonka
ylevyyteen hän niin lujasti luotti, että vaikka kaikki todistukset olisivat
olleet häntä vastaan, vaikka kuolleet olisivat nousseet haudoistansa
sitä vakuuttamaan, niin hän ei kuitenkaan olisi uskonut Irenen
tappaneen itseänsä.
— Miksikä hän olisi lopettanut henkensä, te kai tarkoitatte, sanoi

vapaaherra, auttaaksensa Haraldia alkuun.
— Sitä tarkoitin… niin mitäpä syytä hänellä olisi siihen ollut?

— Tekö sitä kysytte, jolla on niin suuri syy siihen? Hän teki sen
toivottomuudesta, hänen omatuntonsa kun oli ruvennut häntä
vaivaamaan.
— Huomaan, että tyttö poloinen on tunnustanut kaikki julmalle ja

armottomalle isälleen, joka ei ansaitsisi, että hänellä on sellainen
tytär. Mutta älkää yrittäkö minulle luulotella, että hän, sitten kun
rakkautemme on tullut niin lujaksi, ettei mikään maailmassa voi
meitä erottaa, olisi elämän ja onnen sijasta valinnut kuoleman. Juuri
tänä aamuna, jolloin hän odotti sydämensä ystävää ja tiesi saavansa
hänestä, ei salaista, vaan julkisen suojelijan teidän pahoja juonianne
vastaan, nytkö vasta hän olisi näin avuttomasti joutunut epätoivoon!
Niin hullu en ole, että ottaisin sellaista uskoakseni. Vaikka olisin
viipynyt poissa kauankin yli määrä-ajan, niin tiedän ettei Irene
sittenkään olisi menettänyt jaloa luottamustansa, vaan olisi elänyt
toivossa onnellisena, taivuttamatta uljasta päätään kurjan ja ilkeän
ikeenne alle… Tiedän kyllä, kuinka asiat ovat. Kamarijunkkari on
tänään häntä kosinut ja Irene on kohdellut häntä ylenkatseellisesti ja
kylmästi. Sitten kamarijunkkari varmaankin on uhannut kostaa,
ilmoittamalla hänen isästään asioita, joita Irene isänsä tähden ei ole
tahtonut minulle suoraan mainita. Kuitenkin Irene on pysynyt lujana
sekä isänsä että kamarijunkkarin tahtoa vastaan. Silloin hänen
isänsä ajatteli näin: jos nyt laitan tytön pois tieltä ja teen sen
ikäänkuin hän itse olisi lopettanut päivänsä, niin minulla ei enää liene
syytä pelätä kamarijunkkaria. Ja mainio tilaisuus teillä olikin siihen
täällä ullakolla, varsinkin kun satuitte lukitsemaan ovenkin, Te se
olitte, joka panitte paulan hänen kaulaansa ja tapoitte hänet. Valkeus
ja pimeys eivät sovi yhteen… hän oli enkeli ja te olette itse
pahahenki.
Näin puhuessaan Harald kerta toisensa perästä oli istahtanut
patjalle Irenen viereen. Hänessä kyti vielä heikko toivo, että hän
mahdollisesti virkoisi henkiin ja hän koetti joka tavalla saada hänet
tointumaan. Ahtaan hameenmiehustan ja kureliivin hän oli päästänyt
auki ja siepannut seinältä päällysvaatteita, peittääksensä niillä
vaaleaa morsiantansa, varjellaksensa häntä pakkaselta, jonka
kuitenkin toivoi häntä virkistävänkin. Tuon tuostakin hän otti vadista
vettä, kastaaksensa sillä hänen otsaansa ja ohimoitaan.
Hän oli väkisinkin koettanut pysyä tyynenä tämän rakkaan,

vaalean tytön takia. Ken olisi hänet siinä nähnyt, ei olisi voinut olla
ihmettelemättä hänen mielenmalttiansa sekä sitä eriskummallista
asemaa, johon kohtalo oli hänet heittänyt. Lähellä häntä oli kaksi
olentoa, joista toinen oli hänen pahin vihollisensa ja toinen hänen
sydämensä ainoa rakas ystävä. Edelliselle hän puhui inhon ja
ylenkatseen sanoja, toiselle ei virkkanut sanaakaan, hän kun ei
kuitenkaan olisi voinut hänelle vastata. Kuitenkin oli viimeksimainitun
hoito hänelle pää-asiana ja toisen kanssa puhuminen vain
välttämätön sivuseikka.
— Onko sinulla todistuksia, jymisi vapaaherra, vaadin sinulta

todistuksia, mokomakin rohkea ja hävytön konna, jonka armosta
olen antanut syödä leipääni, suoden sinulle tilaisuuden houkutella
tyttäreni pauloihisi!
— Mitäpä ne paulat olivat tämän kuoleman paulan rinnalla! Sanot

minun syöneen sinun leipääsi, mutta minä vastaan, ettet sinä
koskaan vastedes tule syömään minun leipääni. Sinut ajetaan ulos
ryöstetystä linnastasi ja pois ihmisten silmistä. Mutta siitä ei nyt ole
kysymys. Vaadit todistuksia; siis ei niitä ole mielestäsi tarpeeksi… ja
ehkäpä niistä huolimatta voisitkin kiertelemällä päästä maallisen
oikeuden kourista… mutta minusta ne kumminkin ovat sitovia. Ja
onpa tässä olemassa hyvinkin pätevä todistus. Vai oletko koskaan
kuullut, että kukaan, joka ehdollaan tahtoi kuolla, ensin olisi repinyt ja
pahoin pidellyt omaa ruumistansa? Katsopa, kuinka Ireneni suhteen
on julmasti menetelty, katso näitä paisumia ja sinelmiä, näitä kynsien
jälkiä kaulassa ja niskassa, katso kuinka nämä kauniit hiukset ovat
sekaisin ja takkuiset. Olet häntä piessyt ja joka tavalla rääkännyt,
laahannut häntä hiuksista pitkin lattiaa… Ja vielä puhut, ettei muka
olisi todistuksia…
Näin puhuen Harald jälleen tunnusteli Irenen otsaa ja valtasuonia

ja luuli huomaavansa, ettei enään ollut mitään toivoa. Nyt hän
vihdoin hurjistui ja huusi jymisevällä äänellä:
— Vai eikö minulla ole todistuksia, sinä kavala tiikeri ihmisen

haahmossa, sinä luonnoton ilkiö isäksi?
Näin sanoen hän hyökkäsi vapaaherran kimppuun. Niin

kärsivällinen elehvanttikin, kun sen kärsivällisyys vihdoin viimein
loppuu, hyökkää julman tiikerin päälle ja saa siitä pian voiton. Harald
paiskasi vapaaherran voimainsa takaa lattiaan ja lyödä läimähytti
häntä, niin että veri purskahti korvista, suusta ja sieraimista. Mutta
siihen hänen raivonsa myös päättyi. Vapaaherra, joka ei kuitenkaan
pyörtynyt, alkoi kovalla äänellä huutaa apua:
— Palvelijat, pian apuun, ennenkuin minut tapetaan. Rientäkää!
— Älkää pitäkö kiirettä! Vapaaherralla ei ole mitään hätää, huusi

Harald vielä kovempaa.
Askeleita ei kuulunut, eikä ketään tullut apuun. Harald piti vielä

vapaaherrasta kiinni lujilla kourillaan ja huusi kaikuvalla äänellä:
— Vai enkö minä sinua tunne? Puolisosi sinä myrkytit, kälysi…
mutta hänestä saamme vast'edes puhua… veljesipojan jätit jylhälle
kalliolle… veljesityttären annoit tappaa… vielä toisenkin kerran aijoit
murhata hänet, Stellan… ja nyt lopuksi, päälle päätteeksi olet
tappanut oman tyttäresi. Onkohan olemassa niin suurta rangaistusta
kuin olisit ansainnut? Eihän se liikaa olisi, vaikka tukasta laahaisin
sinut tuonne loukkoon ja hirttäisin niinkuin itse teit hänelle, joka
makaa tässä. Mutta minä en ole kelvollinen sinua rankaisemaan,
vaan jätän sen Jumalalle, eikä se olisikaan kylliksi kova rangaistus
sinulle… Ollakseni kuitenkin varma, ettet lähde karkuun taikka, julma
kun olet, rupea tätä minulle kallista ruumista vielä raatelemaan, niin
seuraa minua, jottei minun tarvitse sinua tukasta laahata.
Koneentapaisesti vapaaherra kulki hänen perässänsä toiseen

loukkoon. Kun Harald vuorostansa nyt alkoi laittaa paulaa, niin
vapaaherra taas yritti lähteä karkuun, mutta samassa kuului
ikäänkuin seinästä ääni, joka lausui tuon ainoan sanan: Henrik.
Ikäänkuin kiveksi muuttuneena jäi vapaaherra paikallensa
seisomaan, jotta Harald sai tehdä hänelle mitä itse tahtoi. Hän pujotti
vapaaherran paulan läpitse ja jätti hänet kainaloista riippumaan;
tämä asema oli siksi tukala, ettei hän voinut ponnistella vastaan, eikä
päästä omin voimin irti.
Harald poistui nyt ullakolta ja tapasi portailla paitsi rouva Orrbergiä

ja v. Assarin sisaruksia myös koko talonväen koossa. Kaikeksi
onneksi pojat olivat kaupungissa kestissä ja pikku Ulla nukkui vielä.
Siitä asti kun ovi särjettiin olivat he seisoneet siinä kuuntelemassa,
uskaltamatta rientää vapaaherran avuksi, taikka lieneekö heiltä ollut
tahtoa puuttunut. Kaikki he olivat pelästyksestä puhumattomina ja
Amalia itki.
Harald meni huoneeseensa ja jotta ei hänen äitinsä liiaksi
pelästyisi hän, tarttuen häntä käteen, selitti muutamilla sanoilla mitä
oli tapahtunut ja mitä vielä oli tehtävä. Haraldin annettua hänelle
liinan, johon hän saisi kätkeä kasvonsa, he molemmat astuivat ulos.
Portailla seisoville sanoi Harald:
— Jo on aika, että tulette todistajina katsomaan sellaista, jota ette

ole tienneet aavistaakaan. Älkää viipykö, tulkaa heti!
Jos Haraldin odottamaton ilmestyminen jo oli herättänyt heissä

kummastusta, niin he nyt aivan ällistyivät, nähdessään tämän
salaperäisen naisen. Ja heidän tultuaan ullakolle, niin kuka voisi
kuvailla heidän pelkoaan ja kauhistustansa, kun he tapasivat
vapaaherran siellä elävänä roikkumassa ja vähän etäämpänä
hänestä näkivät hänen tyttärensä makaavan kuolleena lattialla.
Annettuaan heidän muutaman silmänräpäyksen tuota katsella,

päästi Harald hänen ilkeästä ja häpeällisestä asemastansa. Nyt oli
vapaaherra käskemäisillään miespalvelijansa ja kamarijunkkarin
vangita Haraldin, jonka aikoi ilmoittaa Irenen murhaajaksi, mutta
nähdessään tuon salaperäisen naisen, hänen rohkeutensa
kokonaan lannistui, niin ettei hän saanut sanaakaan suustansa. Sen
sijaan Harald ryhtyi puhumaan:
— Halusta sinä, vapaaherra Arvid Henning, käskisit kaikkia tässä

läsnäolevia tekemään minulle väkivaltaa, mutta kielesi ei ota
kääntyäksensä ja se onkin parasta, koska he tuskin uskaltaisivat
taikka tahtoisivat sinua nyt totella. Siis, kuulkaa minua. Kaikkein
teidän, sekä talonväen että vieraitten läsnäollessa, minä julistan, että
vapaaherra Arvid Henning on varas, rosvo ja moninkertainen
murhamies. Katsokaa tätä hänen tytärtään, hänen omaa lastansa,
kuinka hän makaa tuossa kalpeana ja puhumattomana, katselkaa
kauhean nuoran jälkiä hänen hienossa ja valkoisessa kaulassaan,
katselkaa hänen niskassaan kynsien jättämiä jälkiä. Kaiken tämän
on hänen isänsä tehnyt ja minkätähden? Kostonhimosta sekä
maallisen rangaistuksen pelosta. Siitä varmaankin kaksi meistä voisi
antaa tarpeellisia tietoja, koska luulen heidän pitävän kunniastaan ja
kuuntelevan omantunnon ääntä… mutta kentiesi sitä ei heiltä
vaaditakaan. Siinä kyllin, että monesta seikasta käy selville, ettei
tämä nuori neiti ole tappanut itseänsä, vaan että hänen isänsä on
hänet surmannut. Ja ellei näitä todistuksia pidettäisi täydellisinä, niin
muistakaa, että oikeudessa katsotaan puoliakin todisteita päteviksi,
jos on kysymys henkilöstä, joka aikaisemmin on huomattu syypääksi
suuriin rikoksiin. Tähän asti hän on voinut salata ilkitekonsa, niin ettei
laki ole pystynyt häntä tuomitsemaan; mutta kuulkaa nyt, mitä muita
rikoksia hän on tehnyt, paitsi sitä, jonka näette tässä silmäänne
edessä. Seuraavista kahdesta rikoksesta minä tulen tekemään
kanteen häntä vastaan ja toivonpa voivani todistaa hänet syypääksi.
Katsokaa tätä naista, tunnetteko hänet?…
Kreivitär veti liinan pois silmiltään, näyttäen kasvonsa. Kylmä hiki

nousi vapaaherran otsalle ja hän vaipui vanhalle sohvalle, jolle jäi
istumaan.
Eipä paljon puuttunut, ettei kaikkein saapuvilla olevien käynyt

samoin, heidän nähdessään nuo vieläkin kauniit kasvot, jotka
muistuttivat kuollutta.
Vapaaherran kamaripalvelija, joka oli noin viidenneljättä ikäinen

mies ja oli ennen palvellut vapaaherran vanhempaa veljeä, syöksyi
nyt esiin ja heittäytyi polvilleen entisen haltijattarensa eteen.
— Armollinen kreivitär, hän huudahti.

Kyynel kimmelsi kreivittären silmässä ja hän laski kätensä
palvelijan pään päälle, koettaen sitten saada hänet nousemaan. Itse
hän ei voinut puhua, mutta hän viittasi Henrikille, että tämä jatkaisi.
Kesti vähän aikaa ennenkuin tämä jälleen pystyi sitä tekemään ja
sitäpaitsi hän silmillään seurasi Amalian toimintaa, joka, vaikka se
olikin hänestä turhaa, kuitenkin herätti hänessä kiitollisuutta.
Haraldin puhuessa tämä nimittäin oli uskaltanut lähestyä rakkaan

neitinsä vuodetta, koettamaan hänen otsaansa ja käsiään. Hän ei
tehnyt sitä uteliaisuudesta, vaan pitääksensä huolta emännästään.
Hän riensi alas ja toi palatessaan jotain väkevää hajuvettä ja muuta
sellaista, jolla alkoi häntä hieroa ja joka tavalla virkistyttää. Harald
hymyili tuolle surullisesti ja jatkoi:
— Niin, tämä on kreivitär Helena Henning. Paitsi tätä uskollista

palvelijaa on monta muuta vielä elossa, jotka voivat sen todistaa.
Hänet haudattiin, koska hänet luultiin kuolleeksi, vaikka hän olikin
vain valekuollut. Hän heräsi maanalaisessa huoneessa, jonne ei kuu
eikä aurinko päässyt paistamaan. Siellä hän tuon tuostakin näki
kaksi miestä, joista toinen on vapaaherra tuossa ja toinen, jonka
myös aijon haastaa todistajaksi, omasta pyynnöstään on jäänyt
samaan hautakammioon, josta viime yönä onnellisesta
sattumuksesta pelastin kreivitär Henningin. Tyydyttääksensä
pirullista, luonnotonta kostonhimoaan, vapaaherra neljätoista vuotta
on pitänyt häntä siellä vangittuna. Kieli tarttuisi kiinni suulakeeni, jos
rupeaisin teille kertomaan kaikkea, mitä minä tiedän ja uskon. Tässä
näette hänet, tuon rakkaan ja armaan… hän elää, mutta
huomaatteko, kuinka vapaaherra vapisee…
Kreivi Henningillä oli vaimonsa kanssa kaksi lasta. Itse hän pitkät
ajat oleskeli poissa, koska oli joutunut orjuuteen; sillä välin pidettiin
hänen puolisonsa hulluinhuoneeseen teljettynä. Vapaaherra lähti
Saksaan matkustamaan ja otti mukaansa holhokkinsa, pikku
Henrikin. Siellä hän jätti lapsen asumattomalle saarelle, mutta se
tulikin pelastetuksi. Kelpo ihmiset ottivat sen kasvattaaksensa ja se
elää vieläkin. Tyttären, joka äidin luulotellun kuoleman jälkeen myös
joutui setänsä holhouksen alaiseksi, oli tämä myös päättänyt
surmata, mutta hänen kätyrinsä, jonka kävi lasta sääli, vaihetti sen
kuolleeseen lapseen, jonka vapaaherran poissa ollessa toimitti
hautaan. Huolimatta kaikista vaaroista ja kummallisista
elämänvaiheista, tämä kreivittären toinenkin lapsi vielä elää. Kaikki
te hyvin tunnette hänet… se on Stella. Katsokaa, kuinka äiti ja tytär
ovat toistensa näköisiä, niin ette enään epäile!… Mitä poikaan tulee,
jonka muistopatsaan olette nähneet puistossa, niin hänkin, niinkuin
jo sanoin, elää vielä. No niin, minä olen Henrik Henning… tämä on
äitini ja Stella on minun sisareni… Oikeudessa olen vaativa takaisin
kaikki oikeutemme ja syytökset tätä miestä kohtaan olen myös
käräjissä näyttävä toteen. Olkoon hän kuinka viekas tahansa,
kierrelköön niin, että totuus muuttuu valheeksi ja valhe totuudeksi,
Jumala on kuitenkin oleva puolellani ja antava minulle viisautta, jotta
saan hänet kaikkine juonineen paljastetuksi.
Vapaaherra ei enää voinut seurata Haraldin puhetta. Hän makasi

vanhalla sohva-rämällä sellaisessa tilassa, että muulloin, paitsi nyt,
olisi häntä surkuteltu. Kirkkaus oli sammunut silmistä, jotka hurjasti
tuijottivat ympärillensä, tukka nousi pystyyn ja suu oli vaahdossa.
Harald huomasi hänen tilansa ja sanoi palvelijoille:
— Kantakaa vapaaherra huoneeseensa, jonka hän toistaiseksi

saa pitää omanansa. Lähettäkää heti noutamaan lääkäriä, ellei tämä
pian mene ohitse, ja hoitakaa häntä hellästi. Hän on vanha mies ja
sitäpaitsi teidän isäntänne.
Panikohan jalomielisyys nämä sanat hänen suuhunsa, vai
pelkäsikö hän vihamiehensä menettävän järkensä, niin ettei
saisikaan hänelle kostaa. Vast'edes saamme tietää hänen syynsä.
Kaikki palkolliset paitsi Amalia poistuivat, mutta kamarijunkkari

sisarineen kuitenkin jäi. Edellinen ryhtyi puhumaan:
— Luuleeko maisteri… suokaa anteeksi, että teitä vielä siksi

kutsun… luuletteko minulla olevan mitään osaa tähän?
— En tahdo luulla pahaa muista, kun olen tullut huomaamaan, että

yksi ainoa on syypää niin paljoon pahaan.
Emilia v. Assarin tunteet olivat näiden kauheitten tapauksien

johdosta peräti muuttuneet. Hämmentyneenä hän lähestyi Haraldia
ja tarttuen häntä käteen, sanoi:
— Harald Thalberg tai Henrik Henning, uskokaa pahaa vielä

toisestakin… minusta. Minä se olin, joka yllytin veljeäni menemään
niin pitkälle kuin mahdollista, ja seuraus siitä on ollut näitä hirvittävin.
— Ei, sisareni, tällä kertaa en täydellisesti sinua totellut.

Vapaaherran tullessa tänne ullakolle, toivoin vielä tapaavani Irenen
alhaalla ja silloin olisin suostunut hänen niin kauniisti lausumaansa
pyyntöön… olisin luvannut luopua tyttärestä, rupeamatta silti
ahdistamaan isää. Siitä vapaaherra ei kuitenkaan tiennyt mitään,
sillä siinä tapauksessa hän ei olisi tehnyt minkä teki.
— Turhaan sinä koetat painua lohduttaa, Emil, sillä vaikkei

vapaaherra olisikaan tavannut Ireneä täällä, niin sinä kumminkin
alhaalla olisit tavannut minut ja minä olisin uudelleen yllyttänyt sinua
pahaan. Sillä minäpä kehoitin sinua pysymään vaatimuksissasi,
vaikka tiesin ketä Irene rakasti. Tuo syyllisyyteni, jonka nyt vasta
huomaan, tekee minut onnettomaksi. Turhamaisuus ja itsekkyys
viettelivät minua… kentiesi myös rakkaus, mutta minkä arvoinen
olikaan sellainen rakkaus, joka viihtyi yhdessä näin kehnojen
tunteitten kanssa. Voi, jospa Irene vielä eläisi, taikka jos hän jälleen
virkoisi henkiin, jotta hänellekin saisin tunnustaa ilkeyteni, jotta saisin
jollakin tavalla sovittaa, minkä olen rikkonut. Näetkö, Harald, minä
rakastan sinua ja kuitenkin tahtoisin hänen tulevan vielä henkiin;
etköhän siitä huomaa että kadun? Mutta mitäpä on katumuksesta ja
kyynelistä, koska tehty kuitenkin on tehty eikä Irene enää herää.
Emiliassa olivat hänen paremmat ominaisuutensa päässeet

voitolle ja hän itki katkerasti.
Silloin kreivitär Henning tuli puristamaan hänen kättänsä. Äiti

tahtoi niin kernaasti lohduttaa sitä, joka rakasti hänen poikaansa.
— Hän toipuu, kuiskasi Amalia, hiljaa, hän toipuu.
— Vetäydy syrjään, Emil, kuiskasi Emilia veljelleen. Tämä

ymmärsi, mitä hän tarkoitti ja totteli.
Harald ja kreivitär menivät Irenen luo ja ensinmainittu oli

huomaavinaan hieman punaa hänen poskillaan. Vaikka pelkäsikin
erehtyneensä, hän kuitenkin sanoi:
— Kannetaan hänet varovasti lämpöiseen suojaan.
Neljän he kantoivat Irenen patjoineen Haraldin huoneeseen ja

laskivat sen hänen vuoteeseensa. Sitten asettivat he peilin hänen
suunsa eteen ja sen kalvoon ilmaantui hikeä. Harald koetteli hänen
valtasuontaan ja huomasi sen hiljaa tykyttävän. Jälleen hän alkoi
kastella hänen ohimoitaan vedellä ja Amalia koetti uudelleen käyttää
hajuvettä. Kaikkien iloksi loi kuolleeksi luultu silmänsä auki.
Heikolla äänellä, jonka suloinen sointu tunkeutui kuulijain

sydämiin,
Irene sanoi:
— Harald, sinäkö siinä oletkin, Harald ja tuossa on kiltti

Amaliani…… ja Emilia, oletko sinäkin täällä… mutta sinä… Stella…
oletko vanhentunut? Miksikä Stella yksin on käynyt vanhemmaksi?
Harald, ymmärrätkö, mistä se johtuu. Ensin tahdon kuitenkin sanoa,
että rakastan sinua.
— Irene, oi, Ireneni!
— Muutoin luulisin eläväni, jatkoi Irene, vaivoin kohottaen hiukan

päätään, mutta koska Stella noin on vanhentunut, niin lienenkin
kuollut. Nyt minä ymmärrän. Olen kuollut ja autuas; Jumala on
antanut minulle Haraldin huoneen taivaakseni… Mutta minkävuoksi
on Emiliakin täällä?
Emilia purskahti itkuun, mutta nyt se tapahtui ilosta. Hänelle teki

niin hyvää saada Ireneltä nuhteita. Hyvyys hänen sydämessään
alkoi jo kantaa kauniita hedelmiä.
— Sentähden Emilia on täällä, että hän on ystäväsi, ja parhaimpia

onkin, selitti Harald.
— Vai on. Hyväile minua sitten, Emilia, niinkuin teetkin ja itke

ilosta, että olemme taivaassa. Sitä vain en käsitä, miksi Stella yksin
on käynyt noin vanhaksi.
— Hän on Stellan äiti, sanoi Amalia.

— Oikeinko totta? Silloinpa ehkä vielä elänkin ja olen maan päällä.
Harald, minä rakastan sinua. Harald parka, älä ole noin
murheellinen!
Olenhan minä kuitenkin sinun omasi, joko elän tai kuolen… Mutta
olen
kipeä ja väsyksissä ja tahdon nukkua.
Hänen päänsä vaipui jälleen päänalukselle ja hän vaipui uneen.
Harald meni alas ja viittasi Emiliaa seuraamaan. Tämän iloksi hän

ystävällisesti puristi hänen kättänsä.
Alhaalla hän kysyi, miten vapaaherra jaksoi. Ei kukaan tiennyt sitä

sanoa, koska ei kukaan ollut käynyt häntä katsomassa. Pikku Ulla oli
häntä kysellyt, mutta hänelle oli vastattu, että isä tahtoo olla yksin.
Sitten hän oli kysynyt suurta siskoa ja saanut vastaukseksi, että hän
on kipeä, mutta ei ilmoitettu, missä hän on.
— Mene minun huoneeseeni, sanoi Harald, kun pikku Ulla tuli

häntä vastaan, siellä iso sisko makaa sairaana… mutta astu hyvin
hiljaa.
Pikku Ulla katsoi häntä suurilla, viisailla silmillään ja läksi.
— Tulkaa, Emilia. Menkäämme nyt tätä sairasta katsomaan.
He menivät. Huoneessaan vapaaherra makasi tiedottomana,

sellaisena kuin he olivat nähneet hänet ullakolla. Kamarijunkkari,
joka oli saanut tietää missä he olivat, tuli myös pian sinne. Häntä
kummastutti hänen nähdessään, kuinka heistä oli tullut hyvät
ystävykset.
Nyt on kirkkoonmenon aika, sanoi Harald, minun tekisi mieleni
mennä kirkkoon. Irenehän nukkuu ja kamarijunkkari kai pitää huolta
tästä sairaasta. Lähdettekö, Emilia, mukaan?
— Mutta, entä jos hän heräisi, arveli Emilia. Lähdepä, Emil,

Haraldin huoneeseen ja ota sieltä hänen pistoolinsa, jotta voit
puolustaa Ireneä.
Pian Emilia oli saanut ylleen päällysvaatteensa, hevonen oli

valjastettu ja he kaikki kolme menivät siihen huoneeseen, joka nyt oli
Irenen hallussa. Hän nukkui makeasti, näyttäen unissaan enkeliltä.
Kamarijunkkarin otettua pistoolin, Harald ja Emilia jättivät

nukkuvan hyvästi ja läksivät.
Tämä kirkkomatka tuli olemaan Emilian onnellisimpia muistoja.

Tosin hän tiesi, ettei Harald häntä rakastanut, mutta hän tiesi myös,
että tämä ymmärsi häntä. Kaikesta mikä vast'edes kohtaisi häntä
elämässä, hän saisi korvausta tästä muistosta.
Otettuaan osaa jumalanpalvelukseen, he kirkolta menivät

kappalaistalolle, jossa söivät päivällistä, lähteäksensä sitten pastorin
ja Stellan seurassa Ristilään.
Harald oli kertonut kaikki mitä siellä oli tapahtunut ja matkalla

Stella sanoi:
— Olet aina naurahtanut niin epäilevästi, kun olen sanonut, että

joskus maailmassa vielä löydän vanhempani, ainakin äitini. Näetkö
nyt, Erkki… sinä paras ja rakkain ystäväni… että olin kerran
oikeassa!
— Oikeassa olit, pikku kulta, sinä ainoa iloni ja tulevaisuuteni
toivo. Oi, kuinka rakastankaan sinua, lapsi raukka!
— Tiedätkö, Erkki, että olen vihoissani sinulle eräästä asiasta.

Muistatko, että kerran sanoit: "ajattelepa, jos tulisi joskus ilmi, että
oletkin prinsessa, kreivitär taikka korkeasukuinen neiti… ja minä olen
vain köyhä pappi ja tulen aina semmoisena pysymään"… Mutta
tiedätkö, jos minusta tulee rikas, niin silloin minä vasta tunnen itseni
oikein rikkaaksi, kun saan antaa kaikki mitä minulla on tuolle
köyhälle papille, omalle Erkilleni… ja sanoa: koska Stella köyhänä
sinulle kelpasi, niin ehkä et rikkaana häntä hylkää.
— Taikka ehkä sanoisit: ota kaikki rikkauteni ja Stella vielä kaupan

päällisiksi, naurahti pastori.
— Minähän laskin vain leikkiä, sanoi Stella, joka äkkiä oli

muuttunut totiseksi, kuinkapa minä voisin sinuun suuttua. Tiedän
kyllä, että sinä ymmärrät minua ja sinä tiedät, että minä ymmärrän
sinua.
Mutta kun hän huomasi, että hänen sanansa luultavasti kuuluivat

sukkelilta, niin hän purskahti nauruun, muuttuaksensa kohta jälleen
totiseksi.
— Mitä huokaat, Stella? Vaikka tässä tilaisuudessa lieneekin

sopivampi, että huokaat, kuin että naurat.
— Kiitos, Erkki, että minua nuhtelet, mutta toria minä oikeastansa

tarvitsisinkin. Kovin olenkin kevytmielinen, kun saatan nauraa näin
tärkeällä hetkellä… minä, joka pian saan nähdä äitini… oman äitini…
ja vaikka olen kuullut semmoista, jota Harald meille kertoi. Mutta en
oikein tahdo saada päähäni, että se on totta, mitä hän kertoi. Vai
minulla olisi vielä äiti elossa… sehän olisi liian suuri onni minulle… ja
Haraldko olisikin minun veljeni… hän, johon varmaan olisin
rakastunut, ellei hän olisi vienyt minua sinun luoksesi. Sehän olisi
ollut kauheaa. Tiedätkö, kuinka minusta tuntuu? Tuntuupa ikäänkuin
olisin kulkenut sammalen peittämän äkkisyvän kuilun ylitse pitkin
kapeaa polkua. Jos jalkani olisi astunut vähänkin harhaan, niin olisin
syöksynyt syvyyteen.
He olivat nyt saapuneet Ristilän puistoon. Pastori osoitti

sormellaan harmaata linnaa ja sanoi:
— Tuo on ollut Stellan ensimäinen koti.
Stella kalpeni ja puristi hänen kättään. Eikä hän omin voiminsa

olisi päässyt ylös portaita ullakkohuoneeseen, ellei Erkki olisi häntä
tukenut. Ovelle hän seisahtui ja viittasi Erkkiä jäämään. Muutaman
kerran hän veti syvään henkeänsä ja Erkin tarttuessa lukkoon, hän
nyökkäsi myöntäen. Harald vei tyttären äitinsä syliin.
*****
Pikku Stella, joka ei muuten hevin pyörtynyt, kuitenkin

uudenvuoden päivänä 18 .. kello kahden ja kolmen välillä iltapäivällä
tointui äitinsä sylissä, oltuaan kolmasti tainnoksissa.
*****
Irene nukkui raskaasti ja sikeästi ja kovin hän olikin virkistävän

unen tarpeessa.
*****
Harald ei unohtanut v. Nitiä. Seppä oli käynyt avaamassa lukon ja
pian v. Nit jälleen oli ullakkokamarissa, jossa Harald kertoi hänelle,
mitä päivän kuluessa oli tapahtunut. Herra v. Nit ei tahtonut mennä
vapaaherraa katsomaan.
Seppä avasi myös lukon, jolla kreivittären vasen käsi vielä oli
kytkettynä rautavitjoihin. Nyt vasta hän tunsi itsensä oikein vapaaksi
ja miltei unohti menneisyyden.
— Nuo rautavitjat aijon ripustaa seinälleni, jotta ne alati

muistuttaisivat minulle, kuinka kärsimyksiä on kestettävä, sanoi
Stella.
— Ja kuinka on opittava panemaan arvoa onneensa, jatkoi

kreivitär. En nureksi kärsimyksiäni, sillä ilman niitä en nyt voisi olla
näin onnellinen. Jumala oli minulle armollinen, eikä antanut minun
kuolla ennenkuin oli suonut minulle tällaisen onnen. Vapaus ja
tavarat eivät kuitenkaan ole mitään sen ilon rinnalla, että nyt olen
saanut takaisin lapseni. Ja kuinka olettekin hyvät ja älykkäät ja
sitäpaitsi niin kauniit ja miellyttävät! Kuinka on mahdollista, että olen
tullut näin onnelliseksi… sanokaapa te, Harald ja Stella, taikka
Henrik ja Helena…
*****
Tämän päivän iltapuolella, jonka alku oli tuottanut hänelle niin

suurta tuskaa ja surua, tunsi Harald olevansa onnellisempi kuin
koskaan ennen. Kello viiden aikaan Irene avasi silmänsä ja näytti
olevan täysin tunnoissan, koska hänen puheensa ei ollut ensinkään
sekavaa.
Harald viittasi pastoria tulemaan kanssansa ulos.

— Luuletko, hyvä ystävä, sanoi hän, että hetkeksikään olisin
jättänyt Irenen, ellei minulla olisi ollut hyvin tärkeitä syitä. Tahdoin
edelläkäsin ilmoittaa sinulle ja Stellalle, mitä täällä saisitte nähdä, ja
oli minulla vielä toinenkin asia, joka on minulle varsin tärkeä.
Tahtoisin nimittäin, että kirkko vahvistaisi rakkauden synnyttämän
liittomme, koska en kestä nähdä kuinka tyttöparka makaa siinä niin
pahoinpideltynä, kärsien ehkä vielä kovia sieluntuskiakin. Poista
hänestä nuo tuskat, sillä vaikka hän nyt näyttääkin paremmalta, niin
voipa tapahtua, että hän jälleen rupeaa huononemaan, niin että ehkä
kuoleekin. Pyydän sinua sentähden vihkimään meidät ja soisin, että
se tapahtuisi nyt heti.
— Minulla ei ole mitään syytä kieltäytyä sitä tekemästä. Irenen

isähän on menettänyt kaikki edusmiehen oikeutensa, joten en
menettele vastoin lakia. Tule siis.
He palasivat sisään ja pastori selitti muutamalla sanalla, mitä nyt

oli tapahtuva. Sitten hän tavallisuuden mukaan toimitti vihkimisen ja
nyt Irene koko ijäkseen oli tullut Haraldin aviopuolisoksi.
Lienee tarpeetonta luetella todistajain nimiä, siinä kyllin, että niitä

oli viisi ja että kamarijunkkari myös oli niiden joukossa.
Itkevistä Emilia itki eniten. Hänen kyyneleensä olivat katkerimmat,

mutta samalla myös suloisimmat. Rakkauden unelma oli päättynyt,
mutta hänen sydämessänsä kohosi rakennus, joka ei ollut mikään
tuulentupa, vaan itse kieltäymyksen temppeli. Ja nuorta avioparia
onnitellessaan hän ei tuntenut mitään kateutta.
Juuri hän olikin kehoittanut Haraldia tähän pikaiseen vihkimykseen

ja Harald oli siitä niin kiitollinen kuin saattoi olla naiselle, jota ei
rakastanut.
*****
Olemme jo sanoneet, että Haraldin sydän oli herkkä ihmiskunnan

suruille ja vaivoille ja sen huomaamme seuraavastakin.
Hän läksi ulos huoneesta ja palasi hetken kuluttua noutamaan

pastoria kanssansa alas.
— Nyt seuraa se, joka sinulle Herran palvelijana on kentiesi

tärkeintä, sanoi hän. Nyt vasta sinulta kysytään sielunvoimaa, sillä
nyt ei ole kysymyksessä kahden ihmisen yhdistäminen, vaan
kuolemaisillaan olevan valmistaminen ijankaikkisuutta varten. Mitä
olen pelännyt, on valitettavasti nyt tapahtunut. Siitä saakka, kun
vapaaherra näki maan päällä hänet, jonka oli tottunut näkemään
vain maan alla, on hän ollut aivan puhumattomana. Luulin sitä ensin
pelkäksi teeskentelyksi ja että hän oli olevinaan mielipuoli, mutta siitä
huolimatta olisin kuitenkin vaatinut sinua käymään hänen puheillaan.
Niin surkeaa kuin se onkin, olen kuitenkin nyt huomannut, että häntä
on kohdannut halvaus, ja että hänen kasvonsa ovat kauheasti
muuttuneet. Mene nyt valmistamaan häntä kuolemaan.
Pastori kalpeni ja meni. Hän viipyi kauan sairaan luona ja Harald

alkoi jo toivoa parasta.
Vihdoin pastori palasi vielä kalpeampana kuin oli sinne

mennessään ollut.
— Kaikki on turhaa, sanoi hän, ensin hän ei minua ymmärtänyt,

maatessaan siinä tunnottomana. Mutta niin pian kun hän toipui sen
verran, että tunsi minut, viittasi hän minulle, että poistuisin. Ja kun
siitä huolimatta rupesin hänelle puhumaan, niin hän kiristeli
hampaitaan, puristi nyrkkiänsä, vieläpä nauroikin ilkkuen. En voinut
hänelle mitään ja palaan nyt toivottomampana kuin olin sinne
lähtiessäni.
— Lähtekäämme ylös etsimään lohtua rakkailtamme.
Portailla tuli Emilia heitä vastaan.
— Käypä, Emilia hyvä, vilkaisemassa tuon tuostakin vapaaherraa,

sillä luulen hänen olevan kuolemaisillaan, kehoitti Harald. Palkolliset
eivät näytä hänestä piittaavan ja ehkä kammoksuvatkin häntä. Mutta
missä on pikku Ulla?
— Hän tuli vast'ikään ylös ja kyseli ison sisaren sulhasta.
— Jospa hän ei vain rupeaisi puhumaan isästään, niin että Irene

saisi sen kuulla.
Heidän tultuaan ylös, Harald kuiskasi muutaman sanan pikku tytön

korvaan. Siihen Ulla ei vastannut mitään, sanoi vain veitikkamaisesti:
— Kyllä tiedän, että sinä olet ison siskon sulhanen. Irene ei

kuitenkaan kuullut näitä sanoja, sillä vihkimyksen jälkeen hän oli
mielenliikutuksesta saanut lievän kuumekohtauksen ja tunsi
olevansa niin uupunut, että pian vaipui uneen. Harald kävi istumaan
vuoteen viereen ja tyytyi siihen, että sai suudella hänen hiuksiansa.
Stella istui äitinsä ja Erkin välissä ja näytti äärettömän onnelliselta.
Huoneessa vallitsi syvä äänettömyys, jonka joskus katkaisi

hiljainen kuiskaus. Pikku Ulla ei uskaltanut hiiskuakaan.
Heidän istuttuaan noin yhden tai kaksi tuntia, astui Emilia sisään ja
viittasi Haraldia tulemaan ulos.

Learning With The Minimum Description Length Principle 1St Edition Kenji Yamanishi Online Ebook Texxtbook Full Chapter PDF

Uploaded by

Copyright:

Available Formats

You might also like

Learning With The Minimum Description Length Principle 1St Edition Kenji Yamanishi Online Ebook Texxtbook Full Chapter PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Learning With The Minimum Description Length Principle 1St Edition Kenji Yamanishi Online Ebook Texxtbook Full Chapter PDF

Uploaded by

Copyright:

Available Formats

Learning with the Minimum Description

Length Principle 1st Edition Kenji

General Relativity The Theoretical Minimum 1st Edition

Ao Oni Kenji Kuroda

Spectral Invariants with Bulk Quasi Morphisms and

Fun and Games Day at the Parade Length Susan Daddis

THE FIFTH PRINCIPLE Peter M Senge

Insect Chronobiology 1st Edition Hideharu Numata Kenji

R-Calculus, V: Description Logics 1st Edition Li

Masses and the Infinity - Options Principle 1st Edition

Learning with the Minimum

ISBN 978-981-99-1789-1 ISBN 978-981-99-1790-7 (eBook)

© Springer Nature Singapore Pte Ltd. 2023

In the context of information science/technology such as machine learning and statis-

Tokyo, Japan Shun-ichi Amari

This book is written as an introduction to the minimum description length (MDL)

assumed to have background knowledge about probability theory, linear algebra,

Chapter 7: Sect. 7.3.2.

Tokyo, Japan Kenji Yamanishi

1 Information and Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.7 Summary of This Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3.2 Non-negative Matrix Factorization∗ . . . . . . . . . . . . . . . . . . . . 118

5.2.1 Prediction with Spike and Tails Prior∗ . . . . . . . . . . . . . . . . . . 201

8 Extension of Stochastic Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

AGG Aggregating algorithm

MLE Maximum likelihood estimator

Abstract In this chapter, we introduce the notions of information, probability dis-

1.1 Information, Probability, and Coding

1.1.1 What is Information?

© Springer Nature Singapore Pte Ltd. 2023 1

Fig. 1.1 Coding

1.1.2 Prefix Coding

1.1.3 Kraft Inequality

Suppose that we have a set of codewords over X with a code-length function

φ(A) = 000, φ(B) = 001, φ(C) = 01, φ(D) = 1.

Fig. 1.2 A prefix coding

we confirm that the Kraft inequality holds for this case:

2−L(A) + 2−L(B) + 2−L(C) + 2−L(D) + 2−L(E)

1.2 Shannon Information

1.2.1 Probability Distribution

p(x) = 2−L(x) , (1.2)

then p(x) satisfies

We call a p(x) satisfying (1.3) an inferior probability distribution or inferior prob-

L(x) = − log p(x) , (1.4)

where e is Euler’s constant, the base of the natural logarithm.

then we call p(x) a probability mass function or probability distribution over X.

− log p̄([x]) ≈ − log p(x) − log δ.

Thus, making δ independent of x and ignoring δ, we may consider − log p(x) as

1.2.2 Shannon’s Coding Theorem

E p [L(x)] ≥ H ( p), (1.7)

log(1/x) ≥ (log e)(1 − x) (x > 0),

where the equality holds if and only if x = 1. Thus, we have

D( p||q) = (log e)E p [ln( p(x)/q(x))]

The equality holds if and only if p = q.

1.2.3 Shannon Information

L(x) = − log p(x). (1.8)

Definition 1.2 We define the Shannon information of x relative to p as the quantity

The Shannon information is justified to be the optimal code-length for x when

For θ ∈ − An , the following inequality holds: