Professional Documents
Culture Documents
Ebook Probabilistic Numerics Computation As Machine Learning 1St Edition Philipp Hennig Online PDF All Chapter
Ebook Probabilistic Numerics Computation As Machine Learning 1St Edition Philipp Hennig Online PDF All Chapter
Ebook Probabilistic Numerics Computation As Machine Learning 1St Edition Philipp Hennig Online PDF All Chapter
https://ebookmeta.com/product/probabilistic-machine-learning-
advanced-topics-kevin-p-murphy/
https://ebookmeta.com/product/probabilistic-machine-learning-
advanced-topics-draft-1st-edition-kevin-p-murphy/
https://ebookmeta.com/product/probabilistic-machine-learning-for-
finance-and-investing-early-release-1st-edition-deepak-kanungo/
https://ebookmeta.com/product/deep-learning-adaptive-computation-
and-machine-learning-series-goodfellow-ian-bengio-yoshua-
courville-aaron/
Hardware Aware Probabilistic Machine Learning Models
Learning Inference and Use Cases Laura Isabel Galindez
Olascoaga Wannes Meert Marian Verhelst
https://ebookmeta.com/product/hardware-aware-probabilistic-
machine-learning-models-learning-inference-and-use-cases-laura-
isabel-galindez-olascoaga-wannes-meert-marian-verhelst-2/
https://ebookmeta.com/product/hardware-aware-probabilistic-
machine-learning-models-learning-inference-and-use-cases-laura-
isabel-galindez-olascoaga-wannes-meert-marian-verhelst/
https://ebookmeta.com/product/mathematical-aspects-of-deep-
learning-philipp-grohs-editor/
https://ebookmeta.com/product/information-driven-machine-
learning-data-science-as-an-engineeering-discipline-2nd-edition-
gerald-friedland/
Philipp Hennig holds the Chair for the Methods of Machine Learning at the University of
Tübingen, and an adjunct position at the Max Planck Institute for Intelligent Systems. He has
dedicated most of his career to the development of Probabilistic Numerical Methods. Hennig’s
research has been supported by Emmy Noether, Max Planck and ERC fellowships. He is a co-
Director of the Research Program for the Theory, Algorithms and Computations of Learning
Machines at the European Laboratory for Learning and Intelligent Systems (ELLIS).
Michael A. Osborne is Professor of Machine Learning at the University of Oxford, and a co-
Founder of Mind Foundry Ltd. His research has attracted £10.6M of research funding and has
been cited over 15,000 times. He is very, very Bayesian.
Hans P. Kersting is a postdoctoral researcher at INRIA and École Normale Supérieure in Paris,
working in machine learning with expertise in Bayesian inference, dynamical systems, and
optimisation.
‘In this stunning and comprehensive new book, early developments from Kac and Larkin have been
comprehensively built upon, formalised, and extended by including modern-day machine learn-
ing, numerical analysis, and the formal Bayesian statistical methodology. Probabilistic numerical
methodology is of enormous importance for this age of data-centric science and Hennig, Osborne,
and Kersting are to be congratulated in providing us with this definitive volume.’
– Mark Girolami, University of Cambridge and The Alan Turing Institute
‘This book presents an in-depth overview of both the past and present of the newly emerging
area of probabilistic numerics, where recent advances in probabilistic machine learning are used to
develop principled improvements which are both faster and more accurate than classical numerical
analysis algorithms. A must-read for every algorithm developer and practitioner in optimization!’
– Ralf Herbrich, Hasso Plattner Institute
‘Probabilistic numerics spans from the intellectual fireworks of the dawn of a new field to its
practical algorithmic consequences. It is precise but accessible and rich in wide-ranging, principled
examples. This convergence of ideas from diverse fields in lucid style is the very fabric of good
science.’
– Carl Edward Rasmussen, University of Cambridge
‘An important read for anyone who has thought about uncertainty in numerical methods; an
essential read for anyone who hasn’t’
– John Cunningham, Columbia University
‘This is a rare example of a textbook that essentially founds a new field, re-casting numerics on
stronger, more general foundations. A tour de force.’
– David Duvenaud, University of Toronto
‘The authors succeed in demonstrating the potential of probabilistic numerics to transform the way
we think about computation itself.’
– Thore Graepel, Senior Vice President, Altos Labs
PROBABILISTIC NUMERICS
C O M P U TAT I O N A S M A C H I N E L E A R N I N G
www.cambridge.org
Information on this title:www.cambridge.org/9781107163447
DOI: 10.1017/9781316681411
© Philipp Hennig, Michael A. Osborne and Hans P. Kersting 2022
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2022
Printed in the United Kingdom by TJ Books Limited, Padstow Cornwall
A catalogue record for this publication is available from the British Library.
ISBN 978-1-107-16344-7 Hardback
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
I Mathematical Background 17
1 Key Points 19
2 Probabilistic Inference 21
3 Gaussian Algebra 23
4 Regression 27
5 Gauss–Markov Processes: Filtering and SDEs 41
6 Hierarchical Inference in Gaussian Models 55
7 Summary of Part I 61
II Integration 63
8 Key Points 65
9 Introduction 69
10 Bayesian Quadrature 75
11 Links to Classical Quadrature 87
12 Probabilistic Numerical Lessons from Integration 107
13 Summary of Part II and Further Reading 119
22 Proofs 183
23 Summary of Part III 193
References 369
Index 395
Philipp Hennig
Michael A. Osborne
I would like to thank Isis Hjorth, for being the most valuable
source of support I have in life, and our amazing children
Osmund and Halfdan – I wonder what you will think of this
book in a few years?
Hans P. Kersting
Bold symbols (x) are used for vectors, but only where the fact that a variable is a vector is relevant.
Square brackets indicate elements of a matrix or vector: if x = [ x1 , . . . , x N ] is a row vector, then
[ x]i = xi denotes its entries; if A ∈ R n×m is a matrix, then [ A]ij = Aij denotes its entries. Round
brackets (·) are used in most other cases (as in the notations listed below).
Notation Meaning
a∝c a is proportional to c: there is a constant k such that a = k · c.
A ∧ B, A ∨ B The logical conjunctions “and” and “or”; i.e. A ∧ B is true iff
both A and B are true, A ∨ B is true iff ¬ A ∧ ¬ B is false.
AB The Kronecker product of matrices A, B. See Eq. (15.2).
A
B The symmetric Kronecker product. See Eq. (19.16).
AB The element-wise product (aka Hadamard product) of two
matrices A and B of the same shape, i.e. [ A B]ij = [ A]ij · [ B]ij .
#– #– #–
A, A A is the vector arising from stacking the elements of a matrix A
#–
row after row, and its inverse (A = A). See Eq. (15.1).
cov p ( x, y) The covariance of x and y under p. That is,
cov p ( x, y) := E p ( x · y) − E p ( x )E p (y).
C q (V, R d ) The set of q-times continuously differentiable functions from
V to R d , for some q, d ∈ N.
δ( x − y) The Dirac delta, heuristically characterised by the property
f ( x )δ( x − y) dx = f (y) for functions f : R R.
δij The Kronecker symbol: δij = 1 if i = j, otherwise δij = 0.
det( A) The determinant of a square matrix A.
diag( x) The diagonal matrix with entries [diag( x)]ij = δij [ x ]i .
dωt The notation for an an Itô integral in a stochastic differential
equation. See Definition 5.4.
x
erf ( x ) The error function erf( x ) := √2π 0 exp(−t2 ) dt.
Ep( f ) The expectation of f under p. That is, E p ( f ) := f ( x ) dp( x ).
E |Y ( f ) The expectation of f under p( f | Y ).
∞
Γ(z) The Gamma function Γ(z) := 0 x z−1 exp(− x ) dx. See
Eq. (6.1).
G(·; a, b) The Gamma distribution with shape a > 0 and rate b > 0, with
a a −1
probability density function G(z; a, b) := bΓz( a) e−bz .
GP ( f ; μ, k) The Gaussian process measure on f with mean function μ and
covariance function (kernel) k. See §4.2
H p (x) The (differential) entropy of the distribution p( x ).
That is, H p ( x ) := − p( x ) log p( x ) dx. See Eq. (3.2).
H ( x | y) The (differential) entropy of the cond. distribution p( x | y).
That is, H ( x | y) := H p(·|y) ( x ).
I ( x ; y) The mutual information between random variables X and Y.
That is, I ( x ; y) := H ( x ) − H ( x | y) = H (y) − H (y | x ).
Notation Meaning
I, IN The identity matrix (of dimensionality N): [ I ]ij = δij .
I (· ∈ A) The indicator function of a set A.
Kν The modified Bessel function for some parameter ν ∈ C.
∞
That is, Kν ( x ) := 0 exp(− x · cosh(t))cosh(νt) dt.
L The loss function of an optimization problem (§26.1), or the
log-likelihood of an inverse problem (§41.2).
M The model M capturing the probabilistic relationship between
the latent object and computable quantities. See §9.3.
N, C, R, R + The natural numbers (excluding zero), the complex numbers,
the real numbers, and the positive real numbers, respectively.
N ( x; μ, Σ) = p( x ) The vector x has the Gaussian probability density function
with mean vector μ and covariance matrix Σ. See Eq. (3.1).
N (μ, Σ) ∼ X The random variable X is distributed according to a Gaussian
distribution with mean μ and covariance Σ.
O(·) Landau big-Oh: for functions f , g defined on N, the notation
f (n) = O( g(n)) means that f (n)/g(n) is bounded for n ∞.
p(y | x ) The conditional the probability density function for variable Y
having value y conditioned on variable X having value x.
rk( A) The rank of a matrix A.
span{ x1 , . . . , xn } The linear span of { x1 , . . . , xn }.
St(·; μ, λ1 , λ1 ) The Student’s-t probability density function with parameters
μ ∈ R and λ1 , λ2 > 0, see Eq. (6.9).
tr( A) The trace of matrix A, That is, tr( A) = ∑i [ A]ii .
A The transpose of matrix A: [ A ]ij = [ A] ji .
U a,b The uniform distribution with probability density function
p(u) := I (u ∈ ( a, b)), for a < b.
V p (x) The variance of x under p. That is, V p ( x ) := cov p ( x, x ).
V |Y ( f ) The variance of f under p( f | Y ). That is, H ( x | y) :=
− log p( x | y) dp( x | y).
W (V, ν) The Wishart distribution with probability density function
−1
W ( x; V, ν) ∝ | x |(ν− N −1)/2 e−1/2 tr(V x) . See Eq. (19.1).
x⊥y x is orthogonal to y, i.e. x, y = 0.
x := a The object x is defined to be equal to a.
Δ
x=a The object x is equal to a by virtue of its definition.
xa The object x is assigned the value of a (used in pseudo-code).
X∼p The random variable X is distributed according to p.
1, 1d A column vector of d ones, 1d := [1, . . . , 1] ∈ R d .
∇ x f ( x, t) The gradient of f w.r.t. x. (We omit subscript x if redundant.)
will consume. There are almost always choices for the character
of an iteration, such as where to evaluate an integrand or an
objective function to be optimised. Not all iterations are equal,
and it takes an intelligent agent to optimise the cost–benefit
trade-off.
On a related note, a well-designed probabilistic numerical
agent gives a reliable estimate of its own uncertainty over their
result. This helps to reduce bias in subsequent computations. For
instance, in ODE inverse problems, we will see how simulating
the forward map with a probabilistic solver accounts for the
tendency of numerical ODE solvers to systemically over- or
underestimate solution curves. While this does not necessarily
give a more precise ODE estimate (in the inner loop), it helps
the inverse-problem solver to explore the parameter space more
efficiently (in the outer loop). As these examples highlight, pn
hence promises to make more effective use of computation.
bers:
[Roughly:] The need for probability only
Une question de probabilités ne se pose que par suite de notre arises out of uncertainty: It has no place if
we are certain that we know all aspects of
ignorance: il n’y aurait place que pour la certitude si nous a problem. But our lack of knowledge also
connaissions toutes les données du problème. D’autre part, notre must not be complete, otherwise we would
have nothing to evaluate. There is thus a
ignorance ne doit pas être complète, sans quoi nous ne pourrions spectrum of degrees of uncertainty.
rien évaluer. Une classification s’opérerait donc suivant le plus While the probability for the sixth decimal
ou moins de profondeur de notre ignorance. digit of a number in a table of logarithms to
equal 6 is 1/10 a priori, in reality, all aspects
Ainsi la probabilité pour que la sixième décimale d’un nombre of the corresponding problem are well deter-
dans une table de logarithmes soit égale à 6 est a priori de 1/10; mined, and, if we wanted to make the effort,
we could find out its exact value. The same
en réalité, toutes les données du problème sont bien déterminées, holds for interpolation, for the integration
et, si nous voulions nous en donner la peine, nous connaîtrions methods of Cotes or Gauss, etc. (Emphasis
exactement cette probabilité. De même, dans les interpolations, in the op. cit.)
to wait over a decade. By then, the plot had thickened and au-
thors in many communities became interested in Bayesian ideas
for numerical analysis. Among them Kadane and Wasilkowski
(1985), Diaconis (1988), and O’Hagan (1992). Skilling (1991) even
ventured boldly toward solving differential equations, display-
ing the physicist’s willingness to cast aside technicalities in the
name of progress. Exciting as these insights must have been
for their authors, they seem to have missed fertile ground. The
development also continued within mathematics, for example in
the advancement of information-based complexity5 and average- 5
Traub, Wasilkowski, and Woźniakowski
(1983); Packel and Traub (1987); Novak
case analysis.6 But the wider academic community, in particular
(2006).
users in computer science, seem to have missed much of it. But 6
Ritter (2000)
the advancements in computer science did pave the way for
the second of the central insights of pn: that numerics requires
thinking about agents.
This Book
research questions.
p( x, y) −2
then the marginal distribution is given by the sum rule 0 p( x | y)
−2 0
p(y) = p( x, y) dx, 0
x
2 2
y
and the conditional distribution p( x | y) for x given that Y = y is
provided implicitly by the product rule Figure 2.1: Conceptual sketch of a joint
probability distribution p( x, y) over two
p( x | y) p(y) = p( x, y), (2.1) variables x, y, with marginal p(y) and a
conditional p( x | y).
whose terms are depicted in Figure 2.1. The corollary of these
two rules is Bayes’ theorem, which describes how prior knowledge,
combined with data generated according to the conditional
density p(y | x ), gives rise to the posterior distribution on x:
likelihood prior
p(y | x ) p( x )
p( x | y) = .
p ( y | x ) p ( x ) dx
posterior
evidence
rich analytic theory, and that computers are good at the basic More on this in a report by MacKay
linear operations – addition and multiplication. (2006).
2
The entropy,
In fact, the connection between linear functions and Gaussian H p ( x ) := − p( x ) log p( x ) dx, (3.2)
distributions runs deeper: Gaussians are a family of probability of the Gaussian is given by
distributions that are preserved under all linear operations. The
H N ( x;μ,Σ) ( x )
following properties will be used extensively:
N 1
= (1 + log(2π )) + log |Σ|. (3.3)
If a variable x ∈ RD
is normal distributed, then every affine 2 2
transformation of it also has a Gaussian distribution (Fig-
ure 3.1):
if p( x) = N ( x; μ, Σ),
and y := Ax + b for A ∈ R M× D , b ∈ R M , c
b
then p(y) = N (y; Aμ + b, AΣA ). (3.4) a
N ( x; a, A)N ( x; b, B) = N ( x; c, C )N ( a; b, A + B), 3
This statement is about the product of
−1 −1 −1
where C := ( A +B ) , (3.5) two probability density functions. In con-
trast, the product of two Gaussian ran-
and c : = C ( A −1 a + B −1 b ). dom variables is not a Gaussian random
variable.
p(
y|
x)
then both the posterior and the marginal distribution for y (the p( x | y)
evidence) are Gaussian (Figure 3.2): p( x)
p ( y | f ) = δ ( y − f X ).
Let us assume that it is possible to collect observations y :=
[y1 , . . . , y N ] ∈ R N of f corrupted by Gaussian noise with co- This corner case is important in numer-
ical applications where function values
variance Λ ∈ R N × N at the locations X (Figure 4.1). That is, can be computed “exactly”, i.e. up to ma-
according to the likelihood function4 chine precision.
with Σ̃ := (Σ −1
+ ΦX Λ −1
ΦX )−1
−1 −1 10
and μ̃ := Σ̃(Σ μ + ΦX Λ y ).
f(x)
Using the matrix inversion lemma (Eq. (15.9)), these expressions 0
0
arising from μ = 0, the marginal stan-
−10 dard deviation diag(Φx ΣΦ x )1/2 for the
choice Σ = I, and four samples drawn
−20 i.i.d. from the joint Gaussian over the
function.
Even-numbered rows: Posteriors arising
20 in these models from the observations
shown in Figure 4.1. The feature func-
10 tions giving rise to these eight different
plots are the polynomials
f(x)
0
φi ( x ) = xi , i = 0, . . . , 3;
−10
the trigonometric functions
−20
φi ( x ) = sin( x/i), i = 1, . . . , 8, and
steps linears Legendre
φi ( x ) = cos( x/i−8), i = 9, . . . , 16;
20
as well as, for i = −8, −7, . . . , 8, the
10 “switch” functions
f(x)
0 φi ( x ) = sign( x − i );
−10 the “step” functions
−20 φi ( x ) = I ( x − i > 0);
the linear functions
20 φi ( x ) = | x − i |;
10 the first 13 Legendre polynomials (scaled
to [−10, 10]); the absolute’s exponential
f(x)
0
φi ( x ) = e−| x−i| ;
−10
the square exponential
−20
φi ( x ) = e−( x−i) ;
2
exp-abs exp-square sigmoids
20
10
f(x)
−10
−20
−10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10
x x x
Generalised linear regression allows inference on real-valued Exercise 4.1 (easy). Consider the likelihood
of Eq. (4.2) with the parametric form for
functions over arbitrary input domains. f of Eq. (4.1). Show that the maximum-
likelihood estimator for w is given by the
By varying the feature set Φ, broad classes of hypotheses ordinary least-squares estimate
can be created. In particular, Gaussian regression models can wML = (Φ X ΦX )−1 Φ X y.
model nonlinear, discontinuous, even unbounded functions.
To do so, use the explicit form of the Gaus-
There are literally no limitations on the choice of φ : X R. sian pdf to write out log p(y | X, w), take
the gradient with respect to the elements [w]i
Neither the posterior mean nor the covariance over function of the vector w and set it to zero. If you find it
difficult to do this in vector notation, it may
values (Eq. (4.3)) contain “lonely” feature vectors, but only
be helpful to write out ΦX w = ∑i wi [Φ X ]i:
inner products of the form k ab := Φa ΣΦb and m a := Φ a μ. where [Φ X ]i: is the ith column of Φ X . Cal-
For reasons to become clear in the following section, these culate the derivative of log p(y | X, w) with
respect to wi , which is scalar.
quantities are known as the covariance function k : X × X R
and mean function m : X R, respectively.
metric case studied in §4.1: recall from Eq. (4.3) that, for general
linear regression using features φ, the mean vector and covari-
ance matrix of the posterior on function values does not contain
isolated, explicit forms of the features. Instead, for finite subsets
a, b ⊂ X it contains only projections and inner products, of the
form
F
m a := Φ a μ and k ab := Φa ΣΦb = ∑ φi (a)φj (b)Σij .
ij=1
with scale λ ∈ R + over the domain [cmin , cmax ] ⊂ R. That is, set
Proof sketch, omitting technicalities:
10
√ i =1
√
2θ 2 (cmax − cmin ) 2θ 2 (cmax − cmin ) − (a−b2)
2
Σ= √ I, = √ e 2λ
πλF πλF
F (ci −1/2( a+b))2
−
and set μ = 0. It is then possible to convince oneself10 that the ·∑e λ2 /2 .
i =1
limit of F ∞ and cmin −∞, cmax ∞ yields
In the limit of large F, the number of fea-
tures in a region of width δc converges
( a − b )2 to F·δc/(cmax −cmin ), and the sum becomes
k ( a, b) = θ 2 exp − . (4.4)
2λ2 the Gaussian integral
√ 2 2
2θ − (a−b2)
This is called a nonparametric formulation of regression, since k ab = √ e 2λ
πλ
the parameters w of the model (regardless of whether their cmax (c−1/2( a+b))2
−
· e 2 λ /2 dc.
number is finite or infinite) are not explicitly represented in the cmin
computation. The function k constructed in Eq. (4.4), if used in For cmax , √ ±∞, that integral con-
cmin √
a Gaussian regression framework, assigns a covariance of full verges to πλ/ 2.
[k XX ]ij = k( xi , x j )
Kun oli näin päästy tärkeistä töistä, tuntuikin ihanalta lähteä pitkin
saloja kuljeskelemaan ja katselemaan, mitä hauskaa tuolta eteen
kopeutuisi. Lähtevät kolmin menemään jutellen hyvillä mielin ja ketun
iloisena juoksennellessa ympärillä, kun tulevatkin äkkiä ilmolaisten
tervahaudalle, joka siinä omia aikojansa kyti sievästi. Kettu heti
ehdotti, että koetetaanpa, kuka jaksaa hypätä sen yli. Susi ja karhu
olivat paikalla valmiit. Sanaa sanomatta susi otti vauhtia ja aivan
roikaisemalla hyppäsi yli, kettu meni perässä ja hypätä kiepsautti
sievästi yli hänkin, ja molemmat sitten kääntyivät katsomaan, kuinka
sieltä karhu tulla tömähtäisi. Karhukin innostuneena koetti parastaan,
otti touhottaen vauhtia ja hypätä töpsähytti minkä jaksoi, mutta eihän
se, äijäpaha, jaksanutkaan, vaan pudota jymähti suoraan hehkuvaan
haudan silmään. No sun vietävä! Kettu aivan kuukertui nauruunsa,
kun kontio hirmustuneena, tulen ja savun seassa, tulla pölähti
haudasta tantereelle ja alkoi tuskasta möristen nuolla käpäliään,
mutta pianpa nauru jäähtyi hänen ikeniinsä, kun kontio samalla
kaapaisikin hänen jälkeensä vannoen: »Minä sinut syötävän…!»
Huutaen ja peläten kettu karkasi pakoon karhun painaessa
hirmuisesti möräten hänen perässään, niin että kangas jymisi ja
sammalet käpälien alta sinkoilivat. Kuka tietää, kuinka ketun
olisikaan käynyt, ellei hän olisi jaksanut juosta niin kauan, että
karhua rupesi laiskottamaan. Päivä paahtoikin tavattoman kuumasti
ja ukko parka oli muutenkin kaskenpoltostaan väsynyt, niin että kun
hän oli laukata nulkkaissut jonkun matkaa, hän rupesi arvelemaan,
että pahalainen laukatkoon kettu repaleen hännässä tällaisella
helteellä. Niinpä hän kohta tultuaan varjoiseen kuusikkoon, jossa
kirkas lähde kiven kolosta pulppuili, heittäytyi viileälle sammalelle,
venyttelihen siinä mielissään, hautoi käpäliään kylmässä vedessä ja
antoi ketun mennä menojaan, arvellen että kylläpähän vielä
tavataan.
»Tulee — saattaa tulla tuossa paikassa, niin että ala nyt kiireesti,
akka kulta, panna nursuja läjään, että päästään tästä pakopirtille.
Siellä kukaties pelastumme.»
Silloin kaikki ymmärsivät, että tässä oli nyt tulossa tuhon päivä,
ellei ajoissa pakoon päästy. Hirmuisesti kotkottaen ryntäsivät kukko
ja kana ovesta ulos, metsäkana rupesi kiroomaan ja sadattelemaan
kauhealla äänellä sekä meni sen tien perään, kettu luikaisi kankaalle
kuin sukkula ja metso akkoineen torui karhua korkealta orrelta minkä
jaksoi. Turhaan koetti ukko ottaa heitä kiinni ja ajaa takaa: mikä oli
menevää, se meni. Niin loppui nolosti tämä Immilän ylpeän kukon
maailmanlopun puuha.
»Mitä sinä nyt teet?» kysyi kettu, kun tuo kulkevainen mies, joka
pelata taisi, tuli hänen kohdalleen.
Pelimanni ällistyi, kun kettu yhtäkkiä tällä tavalla rupesi häntä
puhuttelemaan, mutta selvisi pian hämmästyksestään ja vastasi:
Silloin kettu valehteli, että hän muka vahtasi siinä hanhia, jos
sattuisi niitä edes yhdenkään saamaan. »Vai hanhia!» hymähti
soittoniekka, »mitäs hanhia tässä olisi. Mutta tule tänne niin opetan
sinut soittamaan ja kauniisti!»
Sillä aikaa mennä latmisti kettu niin nopeasti kuin pääsi hakemaan
karhua, sillä uusi koirankuje oli tullut hänen mieleensä. Tultuaan
autiotuvalle hän löysikin lähistöltä karhun, joka oli unehtunut
kaivamaan muurahaispesää ja murahteli mielissään syödessään
makeita muurahaisen munia. Vaikka hän olikin jo aivan leppynyt
äskeisestä suuttumuksestaan, ärähti hän kuitenkin ketulle: