8 PDF

Machine Learning for Engineers:
Chapter 3. Inference, or Model-Based Prediction -

Problems
Osvaldo Simeone
King’s College London
October 13, 2022
Osvaldo Simeone ML4Engineers 1 / 75

Problem 3.1
For the joint distribution p(x, t) in the table below, compute the
optimal soft predictor, the optimal hard predictor for the
detection-error loss, and the corresponding minimum population
detection-error loss.
x\t 0 1
0 0.4 0.3
1 0.1 0.2
How does the optimal hard predictor change when we adopt the loss
function illustrated by the table below?
t\t̂ 0 1
0 0 1
1 0.1 0

Problem 3.1: Solution
The optimal soft predictor is given by the posterior distribution p(t|x)

in the table below
x\t 0 1
0 0.4/(0.3 + 0.4) = 0.57 0.3/(0.3 + 0.4) = 0.43
1 0.1/0.3 = 0.33 0.2/0.3 = 0.67
The optimal hard predictor under the detection-error loss is given by

the MAP predictor
t̂ ∗ (x) = arg max p(t|x),

t
which yields the predictions t̂ ∗ (0) = 0 and t̂ ∗ (1) = 1.

The corresponding minimum population detection-error loss is given
by
Lp (t̂ ∗ (·)) =E(x,t)∼p(x,t) [1(t ̸= t̂ ∗ (x))]
=0.4 · 1(0 ̸= t̂ ∗ (0)) + 0.3 · 1(1 ̸= t̂ ∗ (0))
+ 0.1 · 1(0 ̸= t̂ ∗ (1)) + 0.2 · 1(1 ̸= t̂ ∗ (1))
=0.4 · 1(1 = t̂ ∗ (0)) + 0.3 · 1(0 = t̂ ∗ (0))
= + 0.1 · 1(1 = t̂ ∗ (1)) + 0.2 · 1(0 = t̂ ∗ (1))
=0.3 + 0.1 = 0.4.
We can alternatively use the fact that the minimum population
detection-error loss is the minimum probability of error
h i
Lp (t̂ ∗ (·)) = 1 − Ex∼p(x) max p(t|x)
t
= 1 − p(x = 0) max p(t|x = 0) − p(x = 1) max p(t|x = 1)
t t
= 1 − 0.7 · 0.57 − 0.3 · 0.67 = 0.4.
For any hard predictor, the population detection-error loss is given by
Lp (t̂(·)) =E(x,t)∼p(x,t) [ℓ((t, t̂(x))]

=0.4 · 1(0 ̸= t̂(0)) + 0.1 · 0.3 · 1(1 ̸= t̂(0))
+ 0.1 · 1(0 ̸= t̂(1)) + 0.1 · 0.2 · 1(1 ̸= t̂(1)).
It follows that the optimal predictor is given as t̂ ∗ (0) = 0 and

t̂ ∗ (1) = 0 and the resulting minimum population loss is
Lp (t̂ ∗ (·)) = 0.1 · 0.3 + 0.1 · 0.2 = 0.05.

For any hard predictor, the population detection-error loss is given by
Lp (t̂(·)) =E(x,t)∼p(x,t) [ℓ((t, t̂(x))]

=0.4 · 1(0 ̸= t̂(0)) + 0.1 · 0.3 · 1(1 ̸= t̂(0))
+ 0.1 · 1(0 ̸= t̂(1)) + 0.1 · 0.2 · 1(1 ̸= t̂(1)).
It follows that the optimal predictor is given as t̂ ∗ (0) = 0 and

t̂ ∗ (1) = 0 and the resulting minimum population loss is
Lp (t̂ ∗ (·)) = 0.1 · 0.3 + 0.1 · 0.2 = 0.05.

Equivalently, we can use the formula
t̂ ∗ (x) = arg min Et∼p(t|x) [1(t ̸= t̂(x))]

t̂
as
t̂ ∗ (0) = arg min 0.57 · 1(0 ̸= t̂(0)) + 0.43 · 0.1 · 1(1 ̸= t̂(0))

t̂
= 0.
And similarly for t̂ ∗ (1).

Problem 3.2
optimal (Bayesian) soft predictor, the optimal hard predictor for the
detection-error loss, and the corresponding minimum population
detection-error loss.
x\t 0 1
0 0.16 0.24
1 0.24 0.36

It can be seen that the two variables are independent, since the joint
distribution is equal to the product of marginals defined by the
probabilities p(x = 1) = 0.6 and p(t = 1) = 0.6.
Therefore, using x to predict t is not useful since the posterior
distribution equals the marginal, that is, the optimal soft predictor is
p(x, t) p(x)p(t)
p(t|x) = = = p(t).
p(x) p(x)
It also follows that the optimal hard predictor is a constant t̂ ∗ that

does not depend on the value of x, i.e.,
t̂ ∗ = arg min Lp (t̂) = Et∼p(t) [ℓ(t, t̂)] .

t̂

Let us compute the population loss for any predictor t̂ as
Lp (t̂) = Et∼p(t) [ℓ(t, t̂)]

= 0.4 · 1(0 ̸= t̂) + 0.6 · 1(1 ̸= t̂).
Therefore, the optimal hard predictor is t̂ ∗ = 1 and the minimum

population loss Lp (t̂ ∗ ) = 0.4.

Problem 3.3
Given the posterior distribution p(t|x) in the table below, obtain the
optimal point predictors under ℓ2 and detection-error losses, and
evaluate the corresponding minimum population losses when
p(x = 1) = 0.5.
x\t 0 1
0 0.9 0.1
1 0.2 0.8

We know that under the ℓ2 loss, the optimal hard predictor is given
by the mean of the posterior distribution
t̂ ∗ (x) = Et∼p(t|x) [t].
Therefore, we have
t̂ ∗ (0) = Et∼p(t|0) [t] = 0.9 · 0 + 0.1 · 1 = 0.1,
and
t̂ ∗ (1) = Et∼p(t|1) [t] = 0.2 · 0 + 0.8 · 1 = 0.8.
Furthermore, the minimum population loss is the average posterior
variance
Lp (t̂ ∗ (·)) = Ex∼p(x) [Var(t|x)].

To compute it, let us first evaluate the two conditional variances
Var(t|x = 0) = p(t = 1|x = 0)(1 − p(t = 1|x = 0)) = 0.1 · 0.9 = 0.09
and
Var(t|x = 1) = p(t = 1|x = 1)(1 − p(t = 1|x = 1)) = 0.8 · 0.2 = 0.16.
We finally obtain
Lp (t̂ ∗ (·)) = Ex∼p(x) [Var(t|x)]

= 0.5 · 0.09 + 0.5 · 0.16 = 0.125.

We know that under the detection-error loss, the optimal hard
predictor is given by the MAP predictor
t̂ ∗ (x) = arg max p(t|x).

t
Therefore, we have
t̂ ∗ (0) = 0,
and
t̂ ∗ (1) = 1.
Furthermore, the minimum population loss is the minimum probability
of error
h i
Lp (t̂ ∗ (·)) = 1 − Ex∼p(x) max p(t|x)
t
= 1 − (0.5 · 0.9 + 0.5 · 0.8) = 0.15.

Problem 3.4
If the posterior distribution is given as
(t|x = x) ∼ N (sin(2πx), 0.1),
what is the optimal hard predictor under the ℓ2 loss? What is the
MAP predictor?

The optimal hard predictor under the ℓ2 loss is given by the mean of
the posterior distribution
t̂ ∗ (x) = Et∼p(t|x) [t] = sin(2πx).
The MAP predictor is
t̂ ∗ (x) = arg max p(t|x) = sin(2πx).

t

Problem 3.5
Consider the posterior distribution
p(t|x) = 0.5N (t|3, 1) + 0.5N (t| − 3, 1).
Plot this posterior distribution.

What is the optimal hard predictor under the quadratic, or ℓ2 , loss?
What is the MAP hard predictor?

MATLAB code:
taxis=[-6:0.01:6];
plot(taxis,0.5*normpdf(taxis,3,1)+0.5*normpdf(taxis,-
3,1),’LineWidth’,2);
xlabel(’$t$’,’Interpreter’,’latex’,’FontSize’,12);
ylabel(’$p(t|x)$’,’Interpreter’,’latex’,’FontSize’,12);
The optimal hard predictor under the ℓ2 loss is given by the mean of the
posterior distribution
t̂ ∗ (x) = Et∼p(t|x) [t] = 0.5 · 3 + 0.5 · (−3) = 0.
The MAP predictor is

t̂ ∗ (x) = arg max p(t|x).
t
In this case, there are two MAP predictors, namely t̂ ∗ (x) = −3 or t̂ ∗ (x) = 3.

Problem 3.6
Consider the joint distribution p(x, t) = p(t)p(x|t) defined as

▶ prior: t ∼ N (2, 1)
▶ likelihood: (x|t = t) ∼ N (t, 0.1)
Derive the optimal soft predictor.
Derive the optimal hard predictor under the ℓ2 loss.
Plot prior and optimal soft predictor for x = 1.
Repeat the previous points for prior t ∼ N (2, 0.01) and comment on
the result.

In the problem at hand, we have

▶ prior: t ∼ N (ν = 2, α−1 = 1)
▶ likelihood: (x|t = t) ∼ N (t, β −1 = 0.1).
Using the formula in the notes, we have the optimal soft predictor

αν + βx 1 2 + 10x 1
p(t|x) = N t , =N t , .
α+β α+β 11 11
Therefore, the optimal hard predictor under the ℓ2 loss is the

conditional mean t̂ ∗ (x) = 2+10x
11 .

taxis=[-1:0.01:5];
plot(taxis,normpdf(taxis,2,1),’r’,’LineWidth’,2);
hold on; plot(taxis,normpdf(taxis,12/11,sqrt(1/11)),’b’,’LineWidth’,2);


▶ prior: t ∼ N (ν = 2, α−1 = 0.01)
▶ likelihood: (x|t = t) ∼ N (t, β −1 = 0.1).

αν + βx 1 200 + 10x 1
p(t|x) = N t , =N t , .
α+β α+β 110 110

conditional mean t̂ ∗ (x) = 200+10x
110 .

taxis=[-1:0.01:5];
plot(taxis,normpdf(taxis,2,0.01),’r’,’LineWidth’,2);
hold on; plot(taxis,normpdf(taxis,210/110,sqrt(1/110)),’b’,’LineWidth’,2);

Problem 3.7
Consider the joint distribution p(x, t) = p(t)p(x|t) defined as

▶ prior: t ∼ N (02 , I2 )
▶ likelihood defined as x = 2 1 t + z, with z ∼ N (0, 0.1I2 )
Derive the optimal soft predictor.
Derive the optimal hard predictor under the ℓ2 loss.
Plot contour lines of prior and optimal soft predictor for x = 1 by
using MATLAB.
Comment on the result.

▶ prior: t ∼ N (ν = 02 , α−1 I2 = I2 )
likelihood: (x|t = t) ∼ N (At, β −1 I2 ) with A = 2

▶ 1 and β = 10.
p(t|x) = N (Θ−1 (αν + βAT x), Θ−1 )

−1 0 20 −1
=N Θ + x ,Θ
0 10
where

T
2
Θ = αI2 + βA A = I2 + 10 2 1
1

1 0 4 2 41 20
= + 10 = .
0 1 2 1 20 11

Using MATLAB to compute the inverse, we have the optimal soft

predictor

0.39 0.21 −0.39
p(t|x) = N x, .
0.19 −0.39 0.80

conditional mean
0.39
t̂ ∗ (x) = x.
0.19

x1=[-2*sqrt(2):0.1:2*sqrt(2)];
x2=[-2*sqrt(2):0.1:2*sqrt(2)];
[X1,X2] = meshgrid(x1,x2);
X = [X1(:) X2(:)];
mup=[0,0];
Sigmap=eye(2);
y = mvnpdf(X,mup,Sigmap);
y = reshape(y,length(x2),length(x1));
contour(x1,x2,y);
hold on;
mupost=[0.39,0.19];
Sigmapost=[0.21,-0.39;-0.39,0.8];
y = mvnpdf(X,mupost,Sigmapost);
y = reshape(y,length(x2),length(x1));
contour(x1,x2,y);
xlabel(’$x 1$’,’Interpreter’,’latex’,’FontSize’,14);
ylabel(’$x 2$’,’Interpreter’,’latex’,’FontSize’,14)

Problem 3.8
Plot the KL divergence KL(p∥q) between p(t) = Bern(t|0.4) and

q(t) = Bern(t|q) as a function of q ∈ [0, 1]. Prepare two plots, one
in nats and one in bits.
Repeat for KL(q∥p).

The KL divergence is a measure of the “difference” between two

distributions. For Bernoulli rvs, it can be directly computed (in nats)
as
0.4 1 − 0.4
KL(p||q) = 0.4 log + (1 − 0.4) log .
q 1−q
MATLAB code
q=[0:0.01:1];
plot(q,0.4*log(0.4./q)+(1-0.4)*log((1-0.4)./(1-q)),’r’,’LineWidth’,2);
%KL(p||q) in nats
hold on
plot(q,0.4*log2(0.4./q)+(1-0.4)*log2((1-0.4)./(1-
q)),’b’,’LineWidth’,2); %KL(p||q) in bits
xlabel(’$q$’,’Interpreter’,’latex’,’FontSize’,12);
ylabel(’KL$(p||q)$’,’Interpreter’,’latex’,’FontSize’,12);

We have (in nats)

q 1−q
KL(q||p) = q log + (1 − q) log .
0.4 1 − 0.4
MATLAB code
figure
q=[0:0.01:1];
plot(q,q.*log(q/0.4)+(1-q).*log((1-q)/(1-0.4)),’r’,’LineWidth’,2);
%KL(q||p) in nats
hold on
plot(q,q.*log2(q/0.4)+(1-q).*log2((1-q)/(1-0.4)),’b’,’LineWidth’,2);
%KL(q||p) in bits
xlabel(’$q$’,’Interpreter’,’latex’,’FontSize’,12);
ylabel(’KL$(q||p)$’,’Interpreter’,’latex’,’FontSize’,12);

Problem 3.9
Compute KL(p∥q) between p(t) = Cat(t|[0.4, 0.1, 0.5]) and

q(t) = Cat(t|[0.1, 0.5, 0.4]) in nats.

Using the general definition of KL divergence, we have
KL(p||q) = 0.4 log(0.4/0.1) + 0.1 log(0.1/0.5) + 0.5 log(0.5/0.4) = 0.5 nats.

Problem 3.10
Plot the KL divergence between p(t) = N (t| − 1, 1) and

q(t) = N (t|µ, 1) as a function of µ (in nats).
Plot the KL divergence between p(t) = N (t| − 1, 1) and
q(t) = N (t| − 1, σ 2 ) as a function of σ 2 (in nats).
Plot the KL divergence KL(q∥p) with the same distributions at the
previous point (in nats).

For Gaussian pdfs, the KL divergence is given as
1 σ12 (µ1 − µ2 )2
2
σ2
KL(p∥q) = + − 1 + log
2 σ22 σ22 σ12
1 1 (−1 − µ)2

1
= + − 1 + log
2 1 1 1
1
= (1 + µ)2
2
MATLAB code
mu=[-3:0.01:1];
plot(mu,0.5*(1+mu).ˆ2,’r’,’LineWidth’,2); %KL(p||q) in nats
xlabel(’$\mu$’,’Interpreter’,’latex’,’FontSize’,12);

1 σ12 (µ1 − µ2 )2
2
σ2
KL(p∥q) = 2
+ 2
− 1 + log
2 σ2 σ2 σ12

1 1
− 1 + log σ 2

= 2
2 σ
MATLAB code
s2=[0.01:0.01:3];
plot(s2,0.5*(1./s2-1+log(s2)),’r’,’LineWidth’,2); %KL(p||q) in nats
ylim([0 2])
xlabel(’$\sigmaˆ2$’,’Interpreter’,’latex’,’FontSize’,12);

1 σ12 (µ1 − µ2 )2
2
σ2
KL(q∥p) = 2 + 2 − 1 + log
2 σ2 σ2 σ12
1 2
σ − 1 + log 1/σ 2

=
2
MATLAB code
s2=[0.01:0.01:3];
hold on; plot(s2,0.5*(s2-1+log(1./s2)),’b’,’LineWidth’,2); %KL(q||p) in
nats
ylim([0 2])
xlabel(’$\sigmaˆ2$’,’Interpreter’,’latex’,’FontSize’,12);
ylabel(’KL$(q||p)$ and KL$(p||q)$’,’Interpreter’,’latex’,’FontSize’,12);

Problem 3.11
Consider a soft predictor q(t) =Bern(t|0.6). Recall that a soft

predictor assigns score q(t) to all values of t ∈ {0, 1}.
Calculate the log-loss, in nats, for all possible values of t and
comment on your result.
With this soft predictor, which value of t has the smallest log-loss?
Interpret your conclusions in terms of information-theoretic suprise.

The log-loss is given as
− log q(t),
which equals
− log q(1) = − log(0.6) = 0.51 nats

− log q(0) = − log(0.4) = 0.91 nats
The log-loss is hence smaller for t = 1, since the soft predictor assigns
a larger probability, or score, namely 0.6, to t = 1. However, the
log-loss values are not too different, since the soft predictor is quite
uncertain between the two values of t.
The soft predictor q(t) is less “suprised” when observing t = 1 since
it guesses that a realization t = 1 will be observed with a 60% chance.
In information-theoretic terms, we can say that observing t = 1 yields
− log2 (0.6) = 0.73 bits of information, while observing t = 0 yields
− log2 (0.4) = 1.3 bits of information.
Problem 3.12
Consider a rv with distribution p(t) =Bern(t|0.4), and a soft predictor

q(t) =Bern(t|0.6).
Calculate the population log-loss, also known as cross entropy
between p(t) and q(t), in nats.
Repeat with the soft predictor q(t) =Bern(t|0.4), and compare with
the result obtained at the point above.

The population log-loss, also known as cross entropy between p(t)
and q(t), is given as
Lp (q(·)) = Et∼p(t) [− log q(t)] = H(p, q).
It measures the average log-loss when the true (population)

distribution of rv t is p(t).
We can directly compute it for this example as
Lp (Bern(·|0.6)) = H(Bern(t|0.4)||Bern(t|0.6))
= p(0) · (− log q(0)) + p(1) · (− log q(1))
= 0.6 · (− log(0.4)) + 0.4 · (− log(0.6)) = 0.75.
We expect this population loss to be relatively large as compared to

other soft predictors since the soft predictor q(t) gives the maximum
score to t = 1, while the true probability p(t = 1) = 0.4 is smaller
than p(t = 0) = 0.6.
With the soft predictor q(t) =Bern(t|0.4), we instead have
Lp (Bern(·|0.4)) = H(Bern(t|0.4)||Bern(t|0.4))
= p(0) · (− log q(0)) + p(1) · (− log q(1))
= 0.4 · (− log(0.4)) + 0.6 · (− log(0.6)) = 0.67.
This is smaller than the cross entropy obtained with

q(t) = Bern(t|0.6). This is intuitively true, since this soft predictor
q(t) gives the maximum score to t = 0, and the true probability
p(t = 0) = 0.6 is smaller than p(t = 1) = 0.4.
In fact, we know that the soft predictor q(t) = p(t) minimizes the
cross entropy, and the minimum cross entropy is known as the
entropy of rv t ∼ p(t)
H(Bern(t|0.4)||Bern(t|0.4)) = H(Bern(t|0.4)) = H(t) = 0.67.

Problem 3.13
Consider distributions p(t) =Bern(t|0.4), and soft predictor

q(t) =Bern(t|0.6).
Compute the KL divergence KL(Bern(0.4)||Bern(0.6)) in two
different ways:
▶ through direct calculation by using the definition of KL divergence;
▶ using the relationship of the KL divergence with cross entropy and
entropy.
Recall that the latter relationship views the KL divergence as a
measure of regret for not having used the correct predictor.

Using the definition of KL divergence:

0.4 0.6
KL(p||q) = 0.4 log + 0.6 log = 0.08.
0.6 0.4
Using the relationship with cross entropy H(p||q) and entropy H(p):
KL(p||q) =H(p||q) − H(p)

= − 0.4 log(0.6) − 0.6 log(0.4)
− (−0.4 log(0.4) − 0.6 log(0.6))
=0.75 − 0.67 = 0.08.

Problem 3.14
Consider a soft predictor q(t) =N (t| − 1, 3).

Plot the log-loss, in nats, across a suitable range of values of t and
comment on your result.
With this soft predictor, which value of t has the smallest log-loss?
Interpret your conclusions in terms of information-theoretic suprise.

The log-loss is given as
− log q(t) = − log N (t| − 1, 3).
MATLAB code:
t=[-10:0.01:8];
plot(t,-log(normpdf(t,-1,sqrt(3))),’r’,’LineWidth’,2); %log-loss in nats
ylabel(’$-\log q(t)$’,’Interpreter’,’latex’,’FontSize’,12);
The log-loss is hence smaller for t = −1 since the soft predictor

assigns a larger probability, or score, to t = −1.
The soft predictor q(t) is less “suprised” when observing t = −1
since it guesses that realizations around t = −1 will be observed with
the largest probability.

Problem 3.15
Consider a rv with distribution p(t) = N (t| − 2, 1), and a soft

predictor q(t) =N (t|1, 3).
Calculate the population log-loss, also known as cross entropy
between p(t) and q(t), in nats.
Repeat with the soft predictor q(t) =N (t| − 2, 1), and compare with
the result obtained at the point above.

The population log-loss, also known as cross entropy between p(t)

and q(t), is given as
Lp (q(·)) = Et∼p(t) [− log q(t)]

= H(N (t| − 2, 1)||N (t|1, 3))
= Et∼N (t|−2,1) [− log N (t|1, 3)].
It measures the average log-loss when the true (population)

distribution of rv t is p(t).
To compute it, we use the relationship KL(p||q) = H(p||q) − H(p),
or H(p||q) = KL(p||q) + H(p).

We have the KL divergence
1 (−2 − 1)2

1 3
KL(N (t| − 2, 1)||N (t|1, 3)) = + − 1 + log = 1.71
2 3 3 1
and the (differential) entropy for Gaussian N (t|µ = −2, σ 2 = 1) as
1 1
H(N (t| − 2, 1)) = log(2πeσ 2 ) = log(2πe) = 1.41.
2 2
Therefore, we can compute the cross entropy
H(N (t| − 2, 1), N (t|1, 3)) = 1.71 + 1.41 = 3.12.

With the soft predictor q(t) =p(t) = N (t| − 2, 1), the cross entropy
equals the entropy
1 1
H(N (t| − 2, 1)) = log(2πeσ 2 ) = log(2πe) = 1.41.
2 2
This is the optimal soft predictor, and indeed its cross entropy is
smaller than that obtained with any other soft predictor, such as
q(t) = N (t|1, 3). The difference between the two is given by
KL(N (t| − 2, 1)||N (t|1, 3)).

Problem 3.16
Consider a soft predictor q(t) = N (t|µ, 1) for some value of µ. The

population distribution of the rv to be predicted is
p(t) = N (t| − 1, 1).
Plot the cross-entropy H(p||q), the entropy H(p), and the KL
divergence KL(p||q) in the same figure and discuss their relationship.

We need to plot
1
KL(N (t| − 1, 1)||N (t|µ, 1)) = (1 + µ)2 ,
2
1
H(N (t| − 1, 1)) = log(2πe) = 1.41
2
and
1
H(N (t| − 1, 1)||N (t|µ, 1)) = 1.41 + (1 + µ2 ).
2

MATLAB code
mu=[-3:0.01:1];
plot(mu,0.5*(1+mu).ˆ2,’r’,’LineWidth’,2); %KL(p||q) in nats
hold on; plot(mu,1.41*ones(size(mu)),’k--’,’LineWidth’,2);
plot(mu,1.41+0.5*(1+mu).ˆ2,’b’,’LineWidth’,2);
xlabel(’$\mu$’,’Interpreter’,’latex’,’FontSize’,12);
Note that the minimum of both cross entropy and KL divergence is

obtained when p(t) = q(t), i.e., when µ = −1.

Problem 3.17
For the joint distribution p(x, t) below, compute the population

log-loss obtained by the soft predictor q(t = 1|x = 0) = 0.4 and
q(t = 1|x = 1) = 0.6 in two ways:
▶ using directly the definition of population log-loss;
▶ and using the expression in terms of the cross entropy.
Derive the optimal soft predictor and the corresponding population
log-loss.
What are the conditional entropy H(t|x) and the entropy H(t)?
x\t 0 1
0 0.3 0.4
1 0.1 0.2

Using its definition, the population log-loss is obtained as
Lp (q(·|·)) = E(x,t)∼p(x,t) [− log q(t|x)]

=p(0, 0) · (− log q(0|0)) + p(0, 1) · (− log q(1|0))
+ p(1, 0) · (− log q(0|1)) + p(1, 1) · (− log q(1|1))
=0.3 · (− log(0.6)) + 0.4 · (− log(0.4))
+ 0.1 · (− log(0.4)) + 0.2 · (− log(0.6)) = 0.71.

The population log-loss can also be expressed as the average cross

entropy
Lp (q(·|·)) = Ex∼p(x) [ Et∼p(t|x) [− log q(t|x)] ].

| {z }
=H(p(t|x)||q(t|x)) cross entropy
To evaluate this expression, first we obtain the conditional

distribution p(t|x) as
x\t 0 1
0 0.42 0.58
1 0.33 0.67

Now, we can compute the cross entropies
H(p(t|x = 0)||q(t|x = 0)) = −0.42 log(0.6) − 0.58 log(0.4) = 0.74
H(p(t|x = 1)||q(t|x = 1)) = −0.33 log(0.4) + −0.67 log(0.6) = 0.64

and then the average
Lp (q(·|·)) = 0.7 · H(p(t|x = 0)||q(t|x = 0)) + 0.3 · H(p(t|x = 1)||q(t|x = 1))

= 0.7 · 0.74 + 0.3 · 0.64 = 0.71.

The optimal soft predictor is the conditional distribution p(t|x)
x\t 0 1
0 0.42 0.58
1 0.33 0.67
Its population cross-entropy loss, aka population log-loss, is
Lp (q(·|·)) = E(x,t)∼p(x,t) [− log p(t|x)]

=p(0, 0) · (− log p(0|0)) + p(0, 1) · (− log p(1|0))
+ p(1, 0) · (− log p(0|1)) + p(1, 1) · (− log p(1|1))
=0.3 · (− log(0.42)) + 0.4 · (− log(0.58))
+ 0.1 · (− log(0.33)) + 0.2 · (− log(0.67)) = 0.66.
This is also known as the conditional entropy H(t|x).

We have computed above the conditional entropy, which corresponds

to the minimum population log-loss (aka population cross-entropy
loss): H(t|x) = 0.66.
The entropy is given as
H(t) = Et∼p(t) [− log p(t)]

=0.4(− log(0.4)) + 0.6(− log(0.6)) = 0.67.
Note that the conditional entropy must always be no larger than the
entropy: the conditional entropy corresponds to the population
log-loss when x is observed, while the entropy is the population
log-loss when no correlated observation is available.

Problem 3.18
Compute the conditional entropy H(t|x) and the entropy H(t) for the
joint distribution
x\t 0 1
0 0.16 0.24
1 0.24 0.36

The rvs x and t are independent with
p(t = 1|x = 0) = p(t = 1) = 0.6.
Therefore, we have
H(t|x) = E(x,t)∼p(x,t) [− log p(t|x)]

= Et∼p(t) [− log p(t)]
= H(t)
= 0.6(− log(0.6)) + 0.4(− log(0.4)) = 0.67.

Problem 3.19
mutual information I(x; t).
x\t 0 1
0 0 0.5
1 0.5 0
mutual information I(x; t).
x\t 0 1
0 0.16 0.24
1 0.24 0.36

The mutual information measures by how much we can decrease the

population log-loss when we have access to x. It is a measure of
statistical dependence between x and t.
It is defined as
I(x; t) = H(t) − H(t|x)
or equivalently as
I(x; t) = KL(p(x, t)∥p(x)p(t)).

For this example, it is easier to compute it using the first formula. In
fact, we have
H(t|x = 0) = Et∼p(t|x=0) [− log p(t|x = 0)]

=0(− log(0)) + 1(− log(1)) = 0
and similarly H(t|x = 1) = 0. Hence,
H(t|x) = p(x = 0)H(t|x = 0) + p(x = 1)H(t|x = 1) = 0.
This is because knowing x perfectly determines t, and hence the

residual uncertainty on t given x is zero.
We then have
I(x; t) = H(t) = − log(0.5) = 0.69.

: Solution
As we have seen, with the second joint pmf, the two rvs are
independent, which implies the equality p(x, t) = p(x)p(t), and
I(x; t) = KL(p(x, t)∥p(x)p(t)) = 0.
This is because knowing x brings no information about t, and hence

we have H(t|x) = H(t) – the uncertainty about t is the same whether
or not x is known.

Problem 3.20
Consider the population distribution p(x, t) = p(t)p(x|t) with

▶ t ∼ N (0, 1), and
▶ (x|t = t) ∼ N (t, 1), i.e., x = t + z, with independent noise
z ∼ N (0, 1).
For the soft predictor q(t|x) = N (t| − x, 0.5), compute the
population log-loss.
What is the optimal soft predictor?
What is the corresponding minimum population log-loss?

First, we compute the conditional distribution

αν + βx 1
p(t|x) = N t ,
α+β α+β

x 1
=N t , .
2 2
Therefore, for every value x, we have
KL(p(t|x)||q(t|x)) = KL(N (0.5x, 0.5)||N (−x, 0.5))

(0.5x − (−x))2

= 0.5
0.5
9
= x 2.
4

We now need to compute Ex∼p(x) [KL(p(t|x)||q(t|x))].

Since x = t + z, with independent rvs t ∼ N (0, 1) and z ∼ N (0, 1),
the marginal p(x) is p(x) = N (x|0, 1 + 1 = 2).
We can then write
9 9
Ex∼p(x) [KL(p(t|x)||q(t|x))] = Ex∼N (0,2) x2 = .
4 2

Recall now that we have the equality
Lp (q(·|·)) = Ex∼p(x) [KL(p(t|x)||q(t|x))] + H(t|x),
where the conditional entropy is given as
H(t|x) = H(N (t|0.5x, 0.5))

1
= log(2πe0.5) = 1.07.
2
Therefore, we conclude that the population log-loss is given as
Lp (q(·|·)) = 4.5 + 1.07 = 5.57.

The optimal soft predictor is given by the conditional distribution

αν + βx 1
p(t|x) = N t ,
α+β α+β

x 1
=N t , .
2 2
The corresponding minimum population log-loss is the conditional

entropy
1
H(t|x) = H(N (t|0.5x, 0.5)) = log(2πe0.5) = 1.07.
2

Problem 3.21
We are given the population distribution p(x, t) = p(t)p(x|t) with

▶ t ∼ N (0, 1), and
▶ (x|t = t) ∼ N (t, β −1 ), i.e., x = t + z, with independent noise
z ∼ N (0, β −1 )
Plot the mutual information I(x; t) as a function of β and comment
on your results.

We have
1 β 1
I(x; t) = log 1 + = log (1 + β) .
2 α 2
MATLAB code:
beta=[0.01:0.01:10];
plot(beta,0.5*log(1+beta),’r’,’LineWidth’,2); %mutual information in
nats
xlabel(’$\beta$’,’Interpreter’,’latex’,’FontSize’,12);
ylabel(’I$(\mathrm{x};\mathrm{t})$’,’Interpreter’,’latex’,’FontSize’,12);
As the precision β of the observation increases, the mutual

information increases accordingly.

Problem 3.22
Prove that the entropy H(x) of a random vector x = (x1 , ..., xM ) with
independent
PM entries (x1 , ..., xM ) is given by the sum
H(x) = m=1 H(xm ).
Use this result to compute the (differential) entropy of vector
x ∼ N (0M , IM ).

We have
M
" !#
Y
H(x) = Ex∼ QM p(xm ) − log p(xm )
m=1
m=1
M
" #
X
= Ex∼ QM p(xm ) − log p(xm )
m=1
m=1
M
X
= Exm ∼p(xm ) [− log p(xm )]
m=1
XM
= H(xm ).
m=1

Therefore, we have
H(N (0M , IM )) = MH(N (0, 1))

M
= log (2πe) .
2

Problem 3.23
Prove that the optimal constant predictor of a rv t ∼ N (µ, σ 2 ) under

the ℓ2 loss is the mean µ.

Compute the derivative of the population log-loss and set it equal to

zero (se Chapter 6).

8 PDF

Uploaded by

Copyright:

Available Formats

You might also like

8 PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

8 PDF

Uploaded by

Copyright:

Available Formats

Machine Learning for Engineers:

Chapter 3. Inference, or Model-Based Prediction -

King’s College London

October 13, 2022

Osvaldo Simeone ML4Engineers 1 / 75

Osvaldo Simeone ML4Engineers 2 / 75

The optimal soft predictor is given by the posterior distribution p(t|x)

The optimal hard predictor under the detection-error loss is given by

t̂ ∗ (x) = arg max p(t|x),

which yields the predictions t̂ ∗ (0) = 0 and t̂ ∗ (1) = 1.

Osvaldo Simeone ML4Engineers 3 / 75

For any hard predictor, the population detection-error loss is given by

Lp (t̂(·)) =E(x,t)∼p(x,t) [ℓ((t, t̂(x))]

It follows that the optimal predictor is given as t̂ ∗ (0) = 0 and

Lp (t̂ ∗ (·)) = 0.1 · 0.3 + 0.1 · 0.2 = 0.05.

Osvaldo Simeone ML4Engineers 5 / 75

For any hard predictor, the population detection-error loss is given by

Lp (t̂(·)) =E(x,t)∼p(x,t) [ℓ((t, t̂(x))]

It follows that the optimal predictor is given as t̂ ∗ (0) = 0 and

Lp (t̂ ∗ (·)) = 0.1 · 0.3 + 0.1 · 0.2 = 0.05.

Osvaldo Simeone ML4Engineers 5 / 75

Equivalently, we can use the formula

t̂ ∗ (x) = arg min Et∼p(t|x) [1(t ̸= t̂(x))]

And similarly for t̂ ∗ (1).

Osvaldo Simeone ML4Engineers 6 / 75

Osvaldo Simeone ML4Engineers 7 / 75

It also follows that the optimal hard predictor is a constant t̂ ∗ that

t̂ ∗ = arg min Lp (t̂) = Et∼p(t) [ℓ(t, t̂)] .

Osvaldo Simeone ML4Engineers 8 / 75

Let us compute the population loss for any predictor t̂ as

Lp (t̂) = Et∼p(t) [ℓ(t, t̂)]

Therefore, the optimal hard predictor is t̂ ∗ = 1 and the minimum

Osvaldo Simeone ML4Engineers 9 / 75

Osvaldo Simeone ML4Engineers 10 / 75

t̂ ∗ (x) = Et∼p(t|x) [t].

t̂ ∗ (0) = Et∼p(t|0) [t] = 0.9 · 0 + 0.1 · 1 = 0.1,

Lp (t̂ ∗ (·)) = Ex∼p(x) [Var(t|x)].

Osvaldo Simeone ML4Engineers 11 / 75

To compute it, let us first evaluate the two conditional variances

Lp (t̂ ∗ (·)) = Ex∼p(x) [Var(t|x)]

Osvaldo Simeone ML4Engineers 12 / 75

t̂ ∗ (x) = arg max p(t|x).

Osvaldo Simeone ML4Engineers 13 / 75

If the posterior distribution is given as

(t|x = x) ∼ N (sin(2πx), 0.1),

Osvaldo Simeone ML4Engineers 14 / 75

t̂ ∗ (x) = Et∼p(t|x) [t] = sin(2πx).

The MAP predictor is

t̂ ∗ (x) = arg max p(t|x) = sin(2πx).

Osvaldo Simeone ML4Engineers 15 / 75

Consider the posterior distribution

p(t|x) = 0.5N (t|3, 1) + 0.5N (t| − 3, 1).

Plot this posterior distribution.

Osvaldo Simeone ML4Engineers 16 / 75

t̂ ∗ (x) = Et∼p(t|x) [t] = 0.5 · 3 + 0.5 · (−3) = 0.

The MAP predictor is

Osvaldo Simeone ML4Engineers 17 / 75