MML Book Errata

Errata: Mathematics for Machine Learning
published by Cambridge University Press, 2020
Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong

Last update: December 18, 2023
In this document, we record typos and mistakes of our book Mathematics for Machine
Learning, published by Cambridge University Press (2020).
In this document, we will refer to the pagination of the online book 1 , which differs from the
printed version. An up-to-date version of the book (which includes the changes described in this
document) is available at https://mml-book.github.io/book/mml-book.pdf.
Things to be added are will be marked like this (which is basically a blue coloring); things
to be removed are marked like this (which is a red strikeout [sometimes underlining] of the old
text).
Foreword and Notation

• https://github.com/mml-book/mml-book.github.io/issues/588
p. 1: 2nd-last paragraph: ... these books only spend one or two chapters of on background
mathematics ...
p. 4: github GitHub
p.4: github.com https://github.com
p. 6: Table of symbols: add a row
A \ B A without B: the set of elements in A but not in B
•
p. 7, Table of symbols. Add three rows:
x ⊥ y Vectors x and y are orthogonal
V Vector space
V ⊥ Orthogonal complement of vector space V
p.
PN7, Table of symbols. Add two rows:
QN n=1 xn Sum of the xn : x1 + . . . + xN
x
n=1 n Product of the xn : x1 · . . . · xN
p. 7, Table of symbols. Add two rows:
f∗ = minx f (x) The smallest function value of f
x∗ ∈ arg minx f (x) The value x∗ that minimizes f (note: arg min returns a set of values)
1 https://mml-book.github.io/book/mml-book_printed.pdf
1
Chapter 1
Chapter 2
p. 21, equation (2.9): swap vectors and scalars for better readability.
       
a11 a12 a1n b1
 ..   ..   ..   .. 
x1  .  x1 + x2  .  x2 + · · · + xn  .  xn =  . 
am1 am2 amn bm
p. 24, Remark below Definition 2.3: Replace occurrences of B with A0 : “[...] If we
multiply A with

0 a22 −a12
B A :=
−a21 a11
we obtain

a a − a12 a21
0 0
AB A = 11 22 = (a11 a22 − a12 a21 )I .
0 a11 a22 − a12 a21
[...]”
p.25: (feature request) Switched order of equations (2.30) and (2.31) to mirror equations
(2.27) and (2.28)
p. 33, Example 2.8: ”Let us revisit the matrix in (), which is already in reduced REF:”
p. 36, Definition 2.7, inverse element:
∀x ∈ G ∃y ∈ G : x ⊗ y = e and y ⊗ x = e, where e is the neutral element. We often write
x−1 to denote the inverse element of x.
p. 36, Example 2.10, first item:
• (Z, +) is an Abelian group.
p. 40, remark: Every subspace U ⊆ (Rn , +, ·) is the solution space of a homogeneous
system of homogeneous linear equations Ax = 0 for x ∈ Rn .
• https://github.com/mml-book/mml-book.github.io/issues/497 p. 44, Definition 2.14
(Basis).
A generating set A of V is called minimal if there exists no smaller set Ã⊆ (A ⊆ V that
spans V . [...]
p. 58, Definition 2.23: We also call V and W also the domain and codomain of Φ,
respectively.
p. 58: Intuitively, the kernel is the set of vectors in v ∈ V that Φ maps onto the neutral
element 0W ∈ W .
2
p. 61–62, Example 2.26. Change all x1 , . . . , xn with b1 , . . . , bn (keep support point as
x0 ). The example should read as follows:
• One-dimensional affine subspaces are called lines and can be written as y = x0 +λb1 ,
where λ ∈ R and U = span[b1 ] ⊆ Rn is a one-dimensional subspace of Rn . This
means that a line is defined by a support point x0 and a vector b1 that defines the
direction. See Figure 6.15 for an illustration.
• Two-dimensional affine subspaces of Rn are called planes. The parametric equation
for planes is y = x0 +λ1 b1 +λ2 b2 , where λ1 , λ2 ∈ R and U = span[b1 , b2 ] ⊆ Rn . This
means that a plane is defined by a support point x0 and two linearly independent
vectors b1 , b2 that span the direction space.
• In Rn , the (n − 1)-dimensional affine subspaces are called hyperplanes, and the cor-
Pn−1
responding parametric equation is y = x0 + i=1 λi bi , where b1 , . . . , bn−1 form a
basis of an (n − 1)-dimensional subspace U of Rn . This means that a hyperplane is
defined by a support point x0 and (n − 1) linearly independent vectors b1 , . . . , bn−1
that span the direction space. In R2 , a line is also a hyperplane. In R3 , a plane is
also a hyperplane.
p. 62, Figure 2.13: replace u b1
p. 62: Caption of Figure 2.13 should be as follows: “Lines are affine subspaces. Vectors y
on a line x0 + λb1 lie in an affine subspace L with support point x0 and direction b1 .”
• p. 62. The remark should read as follows:
For A ∈ Rm×n and b x ∈ Rm , the solution of the linear equation system system of linear
equations Ax λ = x is either the empty set or an affine subspace of Rn of dimension n −
rk(A). In particular, the solution of the linear equation λ1 x1 + . . . + λn xn = b λ1 b1 + . . . + λn bn = x,
where (λ1 , . . . , λn ) 6= (0, . . . , 0), is a hyperplane in Rn .
In Rn , every k-dimensional affine subspace is the solution of an inhomogeneous system of
linear equations Ax = b, where A ∈ Rm×n , b ∈ Rm and rk(A) = n − k. Recall that for
homogeneous equation systems Ax = 0 the solution was a vector subspace, which we can
also think of as a special affine space with support point x0 = 0.
• p. 62, just below remark: [...] inhomogeneous linear equations system inhomogeneous
system of linear equations [...]
p. 69, Exercise 2.20 c.: “We consider c1 , c2 , c3 , three vectors of R3 defined in the standard
basis of R R3 as ... ” We consider c1 , c2 , c3 , three vectors of R3 defined in the standard
basis of R3
Chapter 3
p. 80: “We can think of a vector x ∈ Rn as a function with n function values.”
p.94, Section 3.10, paragraph 2: [...] For a broader and more in-depth overview of some
of the concepts we presented, we refer to the following excellent books: [...]
p. 97, exercise 3.7: Let V be a vector space and π an endomorphism of V .
3
• p. 97, exercise 3.9: Let n ∈ N∗
• p. 97, exercise 3.9: x1 + · · · + xn x1 + . . . + xn
Chapter 4
p. 106, Theorem 4.8: λ ∈ R is an eigenvalue ...
• p. 106, Example 4.4: det(I − λI) = (1 − λI)n
p. 110, bullet point for A5 : “[...] It stretches space along the (blue red) eigenvector of λ2
by a factor 1.5 and compresses it along the orthogonal (blue) eigenvector by a factor 0.5.”
1 21

p. 110: A5 = 1 is a shear-and-stretch mapping that shrinks scales space by 75%
2 1
3
since | det(A5 )| = 4
p. 112, Theorem 4.16 holds in its generality only if we consider complex eigenvalues.
Therefore, it should read:
“The determinant of a matrix A ∈ Rn×n is the product of its eigenvalues, i.e.,
n
Y
det(A) = λi ,
i=1
where λi ∈ C are (possibly repeated) eigenvalues of A.”

p. 113, paragraph 3: Replace circumference perimeter
p. 113, caption Fig. 4.6: circumference perimeter
https://github.com/mml-book/mml-book.github.io/issues/579
p. 113, caption of Figure 4.6, last sentence: ... circumference perimeter changes by a
factor of 2 12 (|λ1 | + |λ2 |)
p. 113, Theorem 4.17 holds in its generality only if we consider complex eigenvalues.
Therefore, it should read:
“The trace of a matrix A ∈ Rn×n is the sum of its eigenvalues, i.e.,
n
X
tr(A) = λi , (1)
i=1
where λi ∈ C are (possibly repeated) eigenvalues of A.”
p. 117, caption of Figure 4.7: ... “Top-left to bottom-left: P −1 performs a basis change
(here drawn in R2 and depicted as a rotation-like operation), mapping the eigenvectors
into the standard basis from the standard basis into the eigenbasis.”
4
Ae2
e2 e2
p2 e1
A
e1
p1
Ae1
P −1
P
λ2 p 2
p2 p2
e2 e1 p1 λ1 p 1
D
p1
p. 117: Replace Figure 4.7 with the following figure:
p. 117, last paragraph: red blue and orange arrows in Figure 4.7
p. 117, last paragraph: This identifies the eigenvectors pi (blue and orange arrows in
Figure 4.7) onto the standard basis vectors ei . This defines the eigenvectors pi (orange
arrows in Figure 4.7) as the new coordinate system with respect to which we continue.
p. 118, Example 4.11. We replaced the example with a new example, where matrix A
changes, so that it corresponds to the matrix we used in Figure 4.7. The new example
looks like this:
Example 4.11 (Eigendecomposition)
1 5 −2
Let us compute the eigendecomposition of A = .
2 −2 5
Step 1: Compute eigenvalues and eigenvectors. The characteristic polynomial of
A is
5
det(A − λI) = det 2 −λ −1
5
−1 2 −λ
= ( 52 − λ)2 − 1 = λ2 − 5λ + 21
4 = (λ − 72 )(λ − 23 ) .
Therefore, the eigenvalues of A are λ1 = 72 and λ2 = 32 (the roots of the characteristic

polynomial), and the associated (normalized) eigenvectors are obtained via
7 3
Ap1 = p , Ap2 = p .
2 1 2 2
This yields

1 1 1 1
p1 = √ , p2 = √ .
2 −1 2 1
Step 2: Check for existence. The eigenvectors p1 , p2 form a basis of R2 . Therefore,

A can be diagonalized.
Step 3: Construct the matrix P to diagonalize A. We collect the eigenvectors of Figure 4.7
visualizes
the eigen-
5 decomposition
of
A =
5 −2
as
−2 5
A in P so that
1 1 1
P = [p1 , p2 ] = √ .
2 −1 1
We then obtain
7
−1 2 0
P AP = 3 = D.
0 2
Equivalently, we get (exploiting that P −1 = P > since the eigenvectors p1 and p2 in this
example form an ONB)
1 72 0 1 1 −1

1 5 −2 1 1
=√ √ .
2 −2 5 2 −1 1 0 32 2 1 1
| {z } | {z } | {z } | {z }
A P D P −1
p. 119, Theorem 4.22: Am×n A ∈ Rm×n
• p. 120, Figure 4.8: Replace V 1 , V 2 with v 1 , v 2 in the top-left panel.

p. 121, margin comment: “It is useful to revise review basis changes ...”
p. 123, paragraph above (4.71): The spectral theorem (Theorem 4.15) tells us that a
symmetric matrix possesses a list of eigenvectors that form an ONB the eigenvectors of a
symmetric matrix form an ONB, which also means it can be diagonalized.
p. 125, 2nd paragraph: For n < m n > m, (4.79) holds only for i 6 m i 6 n, and but
(4.79) says nothing about the ui for i > m i > n. However, we know by construction that
they are orthonormal. Conversely, for n < m m < n, (4.79) holds only for i 6 m. For
i > m, we have Av i = 0 and we still know that the v i form an orthonormal set. This
means that the SVD also supplies an orthonormal basis of the kernel (null space) of A,
the set of vectors x with Ax = 0 (see Section 2.7.3).
• p. 125, 3rd paragraph: Remove “Moreover”.

p. 127, 2nd-last bullet point before example: The nonzero singular values of A are
the square roots of the nonzero eigenvalues of both AA> and are equal to the nonzero
eigenvalues of A> A.
p. 128: “An idealized science fiction lover is a purist and only loves science fiction movies,
so a science fiction lover v 1 gives a rating of zero to everything but science fiction themed
– —this logic is implied by the diagonal substructure for the singular value matrix Σ.”
p. 128, just below example: It is worth, to briefly discuss SVD terminology and conventions,
as there are different versions used in the literature. The mathematics remains invariant
to these differences, but these differences can be confusing. It is worth to briefly discuss
SVD terminology and conventions, as there are different versions used in the literature.
While these differences can be confusing, the mathematics remains invariant to them.
6
p. 129, Section 4.5.3, bullet point: It is possible to define the SVD of a rank-r matrix A
so that U is an m × r matrix, Σ a diagonal matrix of size r × r, and V an r × n n × r
matrix.
p. 133, Example 4.15, below (4.100b): This first rank-1 approximation A1 is insightful: it
tells us that Ali and Beatrix like science fiction movies, such as Star Wars and Bladerunner
(entries have values > 4 > 0.4), but fails to capture the ratings of the other movies by
Chandra.
p. 133, equation (4.101b)
   
−0.0154 0.0042 −0.0174 0.0020 0.0042 −0.0231
−0.1338 0.0362 −0.1516  0.0175 0.0362 −0.2014
   
 0.5019 −0.1358 0.5686  −0.0656 −0.1358 0.7556 
0.3928 −0.1063 0.445 −0.0514 −0.1063 0.5914
p. 136: Therefore, the Cholesky decomposition enables us to compute the reparametriza-
tion trick where we want to perform continuous differentiation over random variables, e.g.,
in variational autoencoders (Jimenez Rezende et al., 2014; Kingma and Ba, 2014 Kingma
and Welling, 2014).
• p.137: Exercise 4.3 should read: Compute the eigenspaces of
a.
1 0
A :=
1 1
b.
−2 2
B :=
2 1
Chapter 5
p. 146, (5.6c), add parentheses for clarity:
n
!
n n−1 X n n−i i−1
= lim x + x h
h→0 1 i=2
i
| {z }
→0 as h→0
p. 145, title of Example 5.5: Chain r Rule
p. 150, just after (5.55): “From (5.40), we know that the gradient of f f with respect
to a vector is the row vector of the partial derivatives. In (5.55), every partial derivative
∂f /∂xi ∂f /∂xi is itself a column vector.”
p. 161, just below Figure 5.8: In neural networks with multiple layers, we have functions
fi (xi−1 ) = σ(Ai−1 xi−1 + bi−1 ) in the ith layer.
7
• p. 171, Exercise 5.6: [...] where tr(·) denotes the trace.
• p. 171, Exercise 5.8: df /dx df /dx
171, Exercise 5.9 should read:
We define
g(x,z, ν) := log p(x, z) − log q(z, ν)

z := t(, ν)
for differentiable functions p, q, t, and x ∈ RD , z ∈ RE , ν ∈ RF , ∈ RG . By using the

chain rule, compute the gradient
d
g(x,z, ν) .
dν
Chapter 6
p. 188, equation (6.32): Exd [xd ] EXd [xd ]
p. 199, just above equation (6.64): “To consider the effect of applying the sum rule of
probability and the effect of conditioning, we explicitly write the Gaussian distribution in
terms of the concatenated states [x> , y > ]> , ...”
p. 199, equation (6.68)
Z

p(x) = p(x, y)d dy = N x | µx , Σxx .
p. 202, equation (6.83b), add brackets to integrand:
Z ∞
( αxp1 (x) + (1 − α)xp2 (x) )dx
−∞
p. 202, equation (6.84b), add brackets to integrand:
Z ∞
( αx2 p1 (x) + (1 − α)x2 p2 (x) )dx
−∞
p. 207, remark: We introduced the preceding three distributions ...
p. 217, below (6.133): subsituting substituting
Chapter 7
p. 241, Section 7.3.2, line 1: objctive objective
8
•
p. 241, just above (7.50): Assuming that Q is invertible Since Q is positive definite and
therefore invertible
p. 242, Section 7.3.3, 2nd paragraph. Reformat so that the trailing “.” follows immediately
after “concept”.
p. 242, Section 7.3.3, 11[...] convex sets can be equivalently described by its their sup-
porting hyperplanes [...]”
p. 243, line 3 Section 7.3.3: “To understand Definition 7.4 in a geometric fashion, consider
an nice simple one-dimensional convex and differentiable function.”
Chapter 8
PN
p. 232, beginning of 2nd paragraph: “Consider the term n=1 (∇Ln (θ i )) in (7.15), . We
can reduce the ...”
p. 266, line 9: In other words, once we have chosen the type of function we want as a
predictor, the likelihood provides the probability of observing data x. is the probability
density function of the observed data x given θ.
p. 266, last paragraph: ... the likelihood of involving the whole dataset (Y = {y1 , . . . , yN }
and X = {x1 , . . . , xN }) factorizes into a product of the likelihoods of each individual
example
p. 268, first paragraph: “Maximum likelihood estimation may suffer from overfitting (Sec-
tion 8.3.3), analogous to unregularized empirical risk minimization (Section 9.2.3 8.2.3).”
p. 276–277 (notation consistency): Replace occurrences of p(x | θ, z) with p(x | z, θ).
Chapter 9
pp. 296–297, remark: [...] When we were working without features, we required X > X to
be invertible, which is the case when rk(X) = D, i.e., the rows columns of X are linearly
independent. [...]
p. 299: “We notice that polynomials of low degree (e.g., constants (M = 0) or linear
(M = 1)) fit the data poorly ... ”
p. 299: “For degrees M = 3, . . . , 5 6, the fits look plausible and smoothly interpolate the
data.”
p. 301, Eq. (9.7)
θ ML = ∈ arg max p(Y | X , θ)
θ
9
p. 307, Add margin comment: Since p(θ | X , Y) = N mN , S N , it holds that θ MAP =
mN .
p. 308, bottom: Add margin comment: E[y∗ | X , Y, x∗ ] = φ> (x∗ )mN = φ> (x∗ )θ MAP .
p. 308, bottom: The predictive mean φ> (x∗ )mN coincides with the predictions made
with the MAP estimate θ MAP .
– p. 312, equation (9.62): Eθ [Y | X ] = Eθ, [Xθ + ] = XEθ [θ] = Xm0 .
– p. 312, equation (9.63a,b):
Cov θ [Y|X ] = Covθ, [Xθ + ] = Covθ [Xθ] + σ 2 I
= X Covθ [θ]X > + σ 2 I = XS 0 X > + σ 2 I .
p. 314, equation (9.71), add transpose
K
!
X
> −1 > > >
Φ(Φ Φ) Φ y = ΦΦ y = φk φk y
k=1
p. 314, around (9.71) “This will then lead to the projection
K
!
X
> −1 > > >
Φ(Φ Φ) Φ y = ΦΦ y = φk φk y
k=1
so that the coupling between different features has disappeared and the maximum like-
lihood projection is simply the sum of projections of y onto the individual basis vectors
φk , i.e., the columns of Φ. Furthermore, the coupling between different features has
disappeared due to the orthogonality of the basis.”
• p. 315 (top): [...] one can convert a set of linearly independent basis functions to an
orthogonal basis by using the Gram-Schmidt process; see Section 3.8.3 and (Strang, 2003).
Chapter 10
p. 319, 2nd paragraph after example 10.1: “Based on the motivation of thinking of PCA
as a data compression technique ...”
p. 319, 2nd paragraph after example 10.1: “The linear mapping represented by B can be
thought of as a decoder ...”
p. 326, second-last paragraph: In the following, we use exactly this kind of representation
of x̃ to find optimal coordinates z and basis vectors b1 , . . . , bM such that x̃ is as similar
to the original data point x as possible, i.e., we aim to minimize the (Euclidean) distance
kx − x̃k.
10
p. 327, above (10.29): The similarity measure we use in the following is the squared
2
Euclidean norm distance (Euclidean norm) kx − x̃k between x and x̃.
p. 327, above (10.29): We therefore define our objective as the minimizing the average
squared Euclidean distance
p. 341, Example 10.5, last sentence: “...more an and more ...”
p. 342, equation (10.68a) should read
Z
p(x | B, µ, σ ) = p(x | z, B,µ, σ 2 )p(z)dz
2
p. 344, equation (10.76) should read
N N 2
1 X 1 X
kxn − x̃n k2 = xn − B > B BB > xn .
N n=1 N n=1
p. 346: Add space in front of “However, FA no longer allows ...”.
Chapter 11
p. 356, equation (11.34): The referenced identity should be (5.103) instead of (5.106).
Equation (11.34) should then read:
(5.106) (5.103)
= −Σ−1 > −1
k (xn − µk )(xn − µk ) Σk
p. 361, add the following note next to equations (11.54)–(11.56): Having updated the
means µk in (11.54), they are subsequently used in (11.55) to update the corresponding
covariances.
Chapter 12
p. 371, caption of Figure 12.1: [...] separates red orange crosses from blue dots discs.
p. 371, last paragraph: red orange cross
p. 374, caption of Figure 12.3: [...] separate red orange crosses from blue dots discs.
• p. 371, 2nd paragraph: “In the SVM case, we start by designing an objective function a
loss function that is to be minimized on training data, following the principles of empirical
risk minimization (Section 8.2). This can also be understood as designing a particular loss
function.
11
• p. 382, just below (12.31): “The first term in (12.31) is called the regularization term or
the regularizer (see Section 9.3.2 8.3.2), ...”
p. 384, remark: explaination explanation
p. 386, caption of Figure 12.9(b): [...] (red) (orange) examples
PN
p. 386, equation (12.36): − n=1 αn yn (negate the partial derivative)
p. 393, last paragraph (starting with “The SVM ...”): From the training perspective,
there are many related probabilistic approches approaches.
12

MML Book Errata

Uploaded by

Copyright:

Available Formats

You might also like

MML Book Errata

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MML Book Errata

Uploaded by

Copyright:

Available Formats

Errata: Mathematics for Machine Learning

published by Cambridge University Press, 2020

Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong

Foreword and Notation

• p. 97, exercise 3.9: x1 + · · · + xn x1 + . . . + xn

where λi ∈ C are (possibly repeated) eigenvalues of A.”

where λi ∈ C are (possibly repeated) eigenvalues of A.”

Therefore, the eigenvalues of A are λ1 = 72 and λ2 = 32 (the roots of the characteristic

Step 2: Check for existence. The eigenvectors p1 , p2 form a basis of R2 . Therefore,

• p. 120, Figure 4.8: Replace V 1 , V 2 with v 1 , v 2 in the top-left panel.

• p. 125, 3rd paragraph: Remove “Moreover”.

g(x,z, ν) := log p(x, z) − log q(z, ν)

for differentiable functions p, q, t, and x ∈ RD , z ∈ RE , ν ∈ RF , ∈ RG . By using the

– p. 312, equation (9.62): Eθ [Y | X ] = Eθ, [Xθ + ] = XEθ [θ] = Xm0 .

– p. 312, equation (9.63a,b):

Cov θ [Y|X ] = Covθ, [Xθ + ] = Covθ [Xθ] + σ 2 I

= X Covθ [θ]X > + σ 2 I = XS 0 X > + σ 2 I .

You might also like

MML Book Errata

Uploaded by

Copyright:

Available Formats

You might also like

MML Book Errata

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MML Book Errata

Uploaded by

Copyright:

Available Formats

Errata: Mathematics for Machine Learning

published by Cambridge University Press, 2020

Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong

Foreword and Notation

• p. 97, exercise 3.9: x1 + · · · + xn x1 + . . . + xn

where λi ∈ C are (possibly repeated) eigenvalues of A.”

where λi ∈ C are (possibly repeated) eigenvalues of A.”

Therefore, the eigenvalues of A are λ1 = 72 and λ2 = 32 (the roots of the characteristic

Step 2: Check for existence. The eigenvectors p1 , p2 form a basis of R2 . Therefore,

• p. 120, Figure 4.8: Replace V 1 , V 2 with v 1 , v 2 in the top-left panel.

• p. 125, 3rd paragraph: Remove “Moreover”.

g(x,z, ν) := log p(x, z) − log q(z, ν)

for differentiable functions p, q, t, and x ∈ RD , z ∈ RE , ν ∈ RF ,  ∈ RG . By using the

– p. 312, equation (9.62): Eθ [Y | X ] = Eθ, [Xθ + ] = XEθ [θ] = Xm0 .

– p. 312, equation (9.63a,b):

Cov θ [Y|X ] = Covθ, [Xθ + ] = Covθ [Xθ] + σ 2 I

= X Covθ [θ]X > + σ 2 I = XS 0 X > + σ 2 I .

You might also like

for differentiable functions p, q, t, and x ∈ RD , z ∈ RE , ν ∈ RF , ∈ RG . By using the

– p. 312, equation (9.62): Eθ [Y | X ] = Eθ, [Xθ + ] = XEθ [θ] = Xm0 .

Cov θ [Y|X ] = Covθ, [Xθ + ] = Covθ [Xθ] + σ 2 I