8 PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 76

Machine Learning for Engineers:

Chapter 3. Inference, or Model-Based Prediction -


Problems

Osvaldo Simeone

King’s College London

October 13, 2022

Osvaldo Simeone ML4Engineers 1 / 75


Problem 3.1
For the joint distribution p(x, t) in the table below, compute the
optimal soft predictor, the optimal hard predictor for the
detection-error loss, and the corresponding minimum population
detection-error loss.

x\t 0 1
0 0.4 0.3
1 0.1 0.2

How does the optimal hard predictor change when we adopt the loss
function illustrated by the table below?

t\t̂ 0 1
0 0 1
1 0.1 0

Osvaldo Simeone ML4Engineers 2 / 75


Problem 3.1: Solution

The optimal soft predictor is given by the posterior distribution p(t|x)


in the table below

x\t 0 1
0 0.4/(0.3 + 0.4) = 0.57 0.3/(0.3 + 0.4) = 0.43
1 0.1/0.3 = 0.33 0.2/0.3 = 0.67

The optimal hard predictor under the detection-error loss is given by


the MAP predictor

t̂ ∗ (x) = arg max p(t|x),


t

which yields the predictions t̂ ∗ (0) = 0 and t̂ ∗ (1) = 1.

Osvaldo Simeone ML4Engineers 3 / 75


Problem 3.1: Solution
The corresponding minimum population detection-error loss is given
by
Lp (t̂ ∗ (·)) =E(x,t)∼p(x,t) [1(t ̸= t̂ ∗ (x))]
=0.4 · 1(0 ̸= t̂ ∗ (0)) + 0.3 · 1(1 ̸= t̂ ∗ (0))
+ 0.1 · 1(0 ̸= t̂ ∗ (1)) + 0.2 · 1(1 ̸= t̂ ∗ (1))
=0.4 · 1(1 = t̂ ∗ (0)) + 0.3 · 1(0 = t̂ ∗ (0))
= + 0.1 · 1(1 = t̂ ∗ (1)) + 0.2 · 1(0 = t̂ ∗ (1))
=0.3 + 0.1 = 0.4.
We can alternatively use the fact that the minimum population
detection-error loss is the minimum probability of error
h i
Lp (t̂ ∗ (·)) = 1 − Ex∼p(x) max p(t|x)
t
= 1 − p(x = 0) max p(t|x = 0) − p(x = 1) max p(t|x = 1)
t t
= 1 − 0.7 · 0.57 − 0.3 · 0.67 = 0.4.
Osvaldo Simeone ML4Engineers 4 / 75
Problem 3.1: Solution

For any hard predictor, the population detection-error loss is given by

Lp (t̂(·)) =E(x,t)∼p(x,t) [ℓ((t, t̂(x))]


=0.4 · 1(0 ̸= t̂(0)) + 0.1 · 0.3 · 1(1 ̸= t̂(0))
+ 0.1 · 1(0 ̸= t̂(1)) + 0.1 · 0.2 · 1(1 ̸= t̂(1)).

It follows that the optimal predictor is given as t̂ ∗ (0) = 0 and


t̂ ∗ (1) = 0 and the resulting minimum population loss is

Lp (t̂ ∗ (·)) = 0.1 · 0.3 + 0.1 · 0.2 = 0.05.

Osvaldo Simeone ML4Engineers 5 / 75


Problem 3.1: Solution

For any hard predictor, the population detection-error loss is given by

Lp (t̂(·)) =E(x,t)∼p(x,t) [ℓ((t, t̂(x))]


=0.4 · 1(0 ̸= t̂(0)) + 0.1 · 0.3 · 1(1 ̸= t̂(0))
+ 0.1 · 1(0 ̸= t̂(1)) + 0.1 · 0.2 · 1(1 ̸= t̂(1)).

It follows that the optimal predictor is given as t̂ ∗ (0) = 0 and


t̂ ∗ (1) = 0 and the resulting minimum population loss is

Lp (t̂ ∗ (·)) = 0.1 · 0.3 + 0.1 · 0.2 = 0.05.

Osvaldo Simeone ML4Engineers 5 / 75


Problem 3.1: Solution

Equivalently, we can use the formula

t̂ ∗ (x) = arg min Et∼p(t|x) [1(t ̸= t̂(x))]


as

t̂ ∗ (0) = arg min 0.57 · 1(0 ̸= t̂(0)) + 0.43 · 0.1 · 1(1 ̸= t̂(0))


= 0.

And similarly for t̂ ∗ (1).

Osvaldo Simeone ML4Engineers 6 / 75


Problem 3.2

For the joint distribution p(x, t) in the table below, compute the
optimal (Bayesian) soft predictor, the optimal hard predictor for the
detection-error loss, and the corresponding minimum population
detection-error loss.

x\t 0 1
0 0.16 0.24
1 0.24 0.36

Osvaldo Simeone ML4Engineers 7 / 75


Problem 3.2: Solution

It can be seen that the two variables are independent, since the joint
distribution is equal to the product of marginals defined by the
probabilities p(x = 1) = 0.6 and p(t = 1) = 0.6.
Therefore, using x to predict t is not useful since the posterior
distribution equals the marginal, that is, the optimal soft predictor is

p(x, t) p(x)p(t)
p(t|x) = = = p(t).
p(x) p(x)

It also follows that the optimal hard predictor is a constant t̂ ∗ that


does not depend on the value of x, i.e.,

t̂ ∗ = arg min Lp (t̂) = Et∼p(t) [ℓ(t, t̂)] .




Osvaldo Simeone ML4Engineers 8 / 75


Problem 3.2: Solution

Let us compute the population loss for any predictor t̂ as

Lp (t̂) = Et∼p(t) [ℓ(t, t̂)]


= 0.4 · 1(0 ̸= t̂) + 0.6 · 1(1 ̸= t̂).

Therefore, the optimal hard predictor is t̂ ∗ = 1 and the minimum


population loss Lp (t̂ ∗ ) = 0.4.

Osvaldo Simeone ML4Engineers 9 / 75


Problem 3.3

Given the posterior distribution p(t|x) in the table below, obtain the
optimal point predictors under ℓ2 and detection-error losses, and
evaluate the corresponding minimum population losses when
p(x = 1) = 0.5.

x\t 0 1
0 0.9 0.1
1 0.2 0.8

Osvaldo Simeone ML4Engineers 10 / 75


Problem 3.3: Solution
We know that under the ℓ2 loss, the optimal hard predictor is given
by the mean of the posterior distribution

t̂ ∗ (x) = Et∼p(t|x) [t].

Therefore, we have

t̂ ∗ (0) = Et∼p(t|0) [t] = 0.9 · 0 + 0.1 · 1 = 0.1,

and
t̂ ∗ (1) = Et∼p(t|1) [t] = 0.2 · 0 + 0.8 · 1 = 0.8.
Furthermore, the minimum population loss is the average posterior
variance

Lp (t̂ ∗ (·)) = Ex∼p(x) [Var(t|x)].

Osvaldo Simeone ML4Engineers 11 / 75


Problem 3.3: Solution

To compute it, let us first evaluate the two conditional variances

Var(t|x = 0) = p(t = 1|x = 0)(1 − p(t = 1|x = 0)) = 0.1 · 0.9 = 0.09

and

Var(t|x = 1) = p(t = 1|x = 1)(1 − p(t = 1|x = 1)) = 0.8 · 0.2 = 0.16.

We finally obtain

Lp (t̂ ∗ (·)) = Ex∼p(x) [Var(t|x)]


= 0.5 · 0.09 + 0.5 · 0.16 = 0.125.

Osvaldo Simeone ML4Engineers 12 / 75


Problem 3.3: Solution
We know that under the detection-error loss, the optimal hard
predictor is given by the MAP predictor

t̂ ∗ (x) = arg max p(t|x).


t

Therefore, we have
t̂ ∗ (0) = 0,
and
t̂ ∗ (1) = 1.
Furthermore, the minimum population loss is the minimum probability
of error
h i
Lp (t̂ ∗ (·)) = 1 − Ex∼p(x) max p(t|x)
t
= 1 − (0.5 · 0.9 + 0.5 · 0.8) = 0.15.

Osvaldo Simeone ML4Engineers 13 / 75


Problem 3.4

If the posterior distribution is given as

(t|x = x) ∼ N (sin(2πx), 0.1),

what is the optimal hard predictor under the ℓ2 loss? What is the
MAP predictor?

Osvaldo Simeone ML4Engineers 14 / 75


Problem 3.4: Solution

The optimal hard predictor under the ℓ2 loss is given by the mean of
the posterior distribution

t̂ ∗ (x) = Et∼p(t|x) [t] = sin(2πx).

The MAP predictor is

t̂ ∗ (x) = arg max p(t|x) = sin(2πx).


t

Osvaldo Simeone ML4Engineers 15 / 75


Problem 3.5

Consider the posterior distribution

p(t|x) = 0.5N (t|3, 1) + 0.5N (t| − 3, 1).

Plot this posterior distribution.


What is the optimal hard predictor under the quadratic, or ℓ2 , loss?
What is the MAP hard predictor?

Osvaldo Simeone ML4Engineers 16 / 75


Problem 3.5: Solution

MATLAB code:
taxis=[-6:0.01:6];
plot(taxis,0.5*normpdf(taxis,3,1)+0.5*normpdf(taxis,-
3,1),’LineWidth’,2);
xlabel(’$t$’,’Interpreter’,’latex’,’FontSize’,12);
ylabel(’$p(t|x)$’,’Interpreter’,’latex’,’FontSize’,12);

The optimal hard predictor under the ℓ2 loss is given by the mean of the
posterior distribution

t̂ ∗ (x) = Et∼p(t|x) [t] = 0.5 · 3 + 0.5 · (−3) = 0.

The MAP predictor is


t̂ ∗ (x) = arg max p(t|x).
t

In this case, there are two MAP predictors, namely t̂ ∗ (x) = −3 or t̂ ∗ (x) = 3.

Osvaldo Simeone ML4Engineers 17 / 75


Problem 3.6

Consider the joint distribution p(x, t) = p(t)p(x|t) defined as


▶ prior: t ∼ N (2, 1)
▶ likelihood: (x|t = t) ∼ N (t, 0.1)
Derive the optimal soft predictor.
Derive the optimal hard predictor under the ℓ2 loss.
Plot prior and optimal soft predictor for x = 1.
Repeat the previous points for prior t ∼ N (2, 0.01) and comment on
the result.

Osvaldo Simeone ML4Engineers 18 / 75


Problem 3.6: Solution

In the problem at hand, we have


▶ prior: t ∼ N (ν = 2, α−1 = 1)
▶ likelihood: (x|t = t) ∼ N (t, β −1 = 0.1).
Using the formula in the notes, we have the optimal soft predictor
   
αν + βx 1 2 + 10x 1
p(t|x) = N t , =N t , .
α+β α+β 11 11

Therefore, the optimal hard predictor under the ℓ2 loss is the


conditional mean t̂ ∗ (x) = 2+10x
11 .

Osvaldo Simeone ML4Engineers 19 / 75


Problem 3.6: Solution

taxis=[-1:0.01:5];
plot(taxis,normpdf(taxis,2,1),’r’,’LineWidth’,2);
hold on; plot(taxis,normpdf(taxis,12/11,sqrt(1/11)),’b’,’LineWidth’,2);
xlabel(’$t$’,’Interpreter’,’latex’,’FontSize’,12);

Osvaldo Simeone ML4Engineers 20 / 75


Problem 3.6: Solution

In the problem at hand, we have


▶ prior: t ∼ N (ν = 2, α−1 = 0.01)
▶ likelihood: (x|t = t) ∼ N (t, β −1 = 0.1).
Using the formula in the notes, we have the optimal soft predictor
   
αν + βx 1 200 + 10x 1
p(t|x) = N t , =N t , .
α+β α+β 110 110

Therefore, the optimal hard predictor under the ℓ2 loss is the


conditional mean t̂ ∗ (x) = 200+10x
110 .

Osvaldo Simeone ML4Engineers 21 / 75


Problem 3.6: Solution

taxis=[-1:0.01:5];
plot(taxis,normpdf(taxis,2,0.01),’r’,’LineWidth’,2);
hold on; plot(taxis,normpdf(taxis,210/110,sqrt(1/110)),’b’,’LineWidth’,2);
xlabel(’$t$’,’Interpreter’,’latex’,’FontSize’,12);

Osvaldo Simeone ML4Engineers 22 / 75


Problem 3.7

Consider the joint distribution p(x, t) = p(t)p(x|t) defined as


▶ prior: t ∼ N (02 , I2 )  
▶ likelihood defined as x = 2 1 t + z, with z ∼ N (0, 0.1I2 )
Derive the optimal soft predictor.
Derive the optimal hard predictor under the ℓ2 loss.
Plot contour lines of prior and optimal soft predictor for x = 1 by
using MATLAB.
Comment on the result.

Osvaldo Simeone ML4Engineers 23 / 75


Problem 3.7: Solution
In the problem at hand, we have
▶ prior: t ∼ N (ν = 02 , α−1 I2 = I2 )
likelihood: (x|t = t) ∼ N (At, β −1 I2 ) with A = 2
 
▶ 1 and β = 10.
Using the formula in the notes, we have the optimal soft predictor

p(t|x) = N (Θ−1 (αν + βAT x), Θ−1 )


      
−1 0 20 −1
=N Θ + x ,Θ
0 10

where
 
T
 2 
Θ = αI2 + βA A = I2 + 10 2 1
1
     
1 0 4 2 41 20
= + 10 = .
0 1 2 1 20 11

Osvaldo Simeone ML4Engineers 24 / 75


Problem 3.7: Solution

Using MATLAB to compute the inverse, we have the optimal soft


predictor
   
0.39 0.21 −0.39
p(t|x) = N x, .
0.19 −0.39 0.80

Therefore, the optimal hard predictor under the ℓ2 loss is the


conditional mean  
0.39
t̂ ∗ (x) = x.
0.19

Osvaldo Simeone ML4Engineers 25 / 75


Problem 3.7: Solution

x1=[-2*sqrt(2):0.1:2*sqrt(2)];
x2=[-2*sqrt(2):0.1:2*sqrt(2)];
[X1,X2] = meshgrid(x1,x2);
X = [X1(:) X2(:)];
mup=[0,0];
Sigmap=eye(2);
y = mvnpdf(X,mup,Sigmap);
y = reshape(y,length(x2),length(x1));
contour(x1,x2,y);
hold on;
mupost=[0.39,0.19];
Sigmapost=[0.21,-0.39;-0.39,0.8];
y = mvnpdf(X,mupost,Sigmapost);
y = reshape(y,length(x2),length(x1));
contour(x1,x2,y);
xlabel(’$x 1$’,’Interpreter’,’latex’,’FontSize’,14);
ylabel(’$x 2$’,’Interpreter’,’latex’,’FontSize’,14)

Osvaldo Simeone ML4Engineers 26 / 75


Problem 3.8

Plot the KL divergence KL(p∥q) between p(t) = Bern(t|0.4) and


q(t) = Bern(t|q) as a function of q ∈ [0, 1]. Prepare two plots, one
in nats and one in bits.
Repeat for KL(q∥p).

Osvaldo Simeone ML4Engineers 27 / 75


Problem 3.8: Solution

The KL divergence is a measure of the “difference” between two


distributions. For Bernoulli rvs, it can be directly computed (in nats)
as    
0.4 1 − 0.4
KL(p||q) = 0.4 log + (1 − 0.4) log .
q 1−q
MATLAB code
q=[0:0.01:1];
plot(q,0.4*log(0.4./q)+(1-0.4)*log((1-0.4)./(1-q)),’r’,’LineWidth’,2);
%KL(p||q) in nats
hold on
plot(q,0.4*log2(0.4./q)+(1-0.4)*log2((1-0.4)./(1-
q)),’b’,’LineWidth’,2); %KL(p||q) in bits
xlabel(’$q$’,’Interpreter’,’latex’,’FontSize’,12);
ylabel(’KL$(p||q)$’,’Interpreter’,’latex’,’FontSize’,12);

Osvaldo Simeone ML4Engineers 28 / 75


Problem 3.8: Solution

We have (in nats)


 
 q  1−q
KL(q||p) = q log + (1 − q) log .
0.4 1 − 0.4

MATLAB code
figure
q=[0:0.01:1];
plot(q,q.*log(q/0.4)+(1-q).*log((1-q)/(1-0.4)),’r’,’LineWidth’,2);
%KL(q||p) in nats
hold on
plot(q,q.*log2(q/0.4)+(1-q).*log2((1-q)/(1-0.4)),’b’,’LineWidth’,2);
%KL(q||p) in bits
xlabel(’$q$’,’Interpreter’,’latex’,’FontSize’,12);
ylabel(’KL$(q||p)$’,’Interpreter’,’latex’,’FontSize’,12);

Osvaldo Simeone ML4Engineers 29 / 75


Problem 3.9

Compute KL(p∥q) between p(t) = Cat(t|[0.4, 0.1, 0.5]) and


q(t) = Cat(t|[0.1, 0.5, 0.4]) in nats.

Osvaldo Simeone ML4Engineers 30 / 75


Problem 3.9: Solution

Using the general definition of KL divergence, we have

KL(p||q) = 0.4 log(0.4/0.1) + 0.1 log(0.1/0.5) + 0.5 log(0.5/0.4) = 0.5 nats.

Osvaldo Simeone ML4Engineers 31 / 75


Problem 3.10

Plot the KL divergence between p(t) = N (t| − 1, 1) and


q(t) = N (t|µ, 1) as a function of µ (in nats).
Plot the KL divergence between p(t) = N (t| − 1, 1) and
q(t) = N (t| − 1, σ 2 ) as a function of σ 2 (in nats).
Plot the KL divergence KL(q∥p) with the same distributions at the
previous point (in nats).

Osvaldo Simeone ML4Engineers 32 / 75


Problem 3.10: Solution

For Gaussian pdfs, the KL divergence is given as

1 σ12 (µ1 − µ2 )2
  2 
σ2
KL(p∥q) = + − 1 + log
2 σ22 σ22 σ12
1 1 (−1 − µ)2
  
1
= + − 1 + log
2 1 1 1
1
= (1 + µ)2
2
MATLAB code
mu=[-3:0.01:1];
plot(mu,0.5*(1+mu).ˆ2,’r’,’LineWidth’,2); %KL(p||q) in nats
xlabel(’$\mu$’,’Interpreter’,’latex’,’FontSize’,12);
ylabel(’KL$(p||q)$’,’Interpreter’,’latex’,’FontSize’,12);

Osvaldo Simeone ML4Engineers 33 / 75


Problem 3.10: Solution

For Gaussian pdfs, the KL divergence is given as

1 σ12 (µ1 − µ2 )2
  2 
σ2
KL(p∥q) = 2
+ 2
− 1 + log
2 σ2 σ2 σ12
 
1 1
− 1 + log σ 2

= 2
2 σ

MATLAB code
s2=[0.01:0.01:3];
plot(s2,0.5*(1./s2-1+log(s2)),’r’,’LineWidth’,2); %KL(p||q) in nats
ylim([0 2])
xlabel(’$\sigmaˆ2$’,’Interpreter’,’latex’,’FontSize’,12);
ylabel(’KL$(p||q)$’,’Interpreter’,’latex’,’FontSize’,12);

Osvaldo Simeone ML4Engineers 34 / 75


Problem 3.10: Solution

For Gaussian pdfs, the KL divergence is given as

1 σ12 (µ1 − µ2 )2
  2 
σ2
KL(q∥p) = 2 + 2 − 1 + log
2 σ2 σ2 σ12
1 2
σ − 1 + log 1/σ 2

=
2
MATLAB code
s2=[0.01:0.01:3];
hold on; plot(s2,0.5*(s2-1+log(1./s2)),’b’,’LineWidth’,2); %KL(q||p) in
nats
ylim([0 2])
xlabel(’$\sigmaˆ2$’,’Interpreter’,’latex’,’FontSize’,12);
ylabel(’KL$(q||p)$ and KL$(p||q)$’,’Interpreter’,’latex’,’FontSize’,12);

Osvaldo Simeone ML4Engineers 35 / 75


Problem 3.11

Consider a soft predictor q(t) =Bern(t|0.6). Recall that a soft


predictor assigns score q(t) to all values of t ∈ {0, 1}.
Calculate the log-loss, in nats, for all possible values of t and
comment on your result.
With this soft predictor, which value of t has the smallest log-loss?
Interpret your conclusions in terms of information-theoretic suprise.

Osvaldo Simeone ML4Engineers 36 / 75


Problem 3.11: Solution
The log-loss is given as
− log q(t),
which equals

− log q(1) = − log(0.6) = 0.51 nats


− log q(0) = − log(0.4) = 0.91 nats

The log-loss is hence smaller for t = 1, since the soft predictor assigns
a larger probability, or score, namely 0.6, to t = 1. However, the
log-loss values are not too different, since the soft predictor is quite
uncertain between the two values of t.
The soft predictor q(t) is less “suprised” when observing t = 1 since
it guesses that a realization t = 1 will be observed with a 60% chance.
In information-theoretic terms, we can say that observing t = 1 yields
− log2 (0.6) = 0.73 bits of information, while observing t = 0 yields
− log2 (0.4) = 1.3 bits of information.
Osvaldo Simeone ML4Engineers 37 / 75
Problem 3.12

Consider a rv with distribution p(t) =Bern(t|0.4), and a soft predictor


q(t) =Bern(t|0.6).
Calculate the population log-loss, also known as cross entropy
between p(t) and q(t), in nats.
Repeat with the soft predictor q(t) =Bern(t|0.4), and compare with
the result obtained at the point above.

Osvaldo Simeone ML4Engineers 38 / 75


Problem 3.12: Solution
The population log-loss, also known as cross entropy between p(t)
and q(t), is given as

Lp (q(·)) = Et∼p(t) [− log q(t)] = H(p, q).

It measures the average log-loss when the true (population)


distribution of rv t is p(t).
We can directly compute it for this example as

Lp (Bern(·|0.6)) = H(Bern(t|0.4)||Bern(t|0.6))
= p(0) · (− log q(0)) + p(1) · (− log q(1))
= 0.6 · (− log(0.4)) + 0.4 · (− log(0.6)) = 0.75.

We expect this population loss to be relatively large as compared to


other soft predictors since the soft predictor q(t) gives the maximum
score to t = 1, while the true probability p(t = 1) = 0.4 is smaller
than p(t = 0) = 0.6.
Osvaldo Simeone ML4Engineers 39 / 75
Problem 3.12: Solution
With the soft predictor q(t) =Bern(t|0.4), we instead have

Lp (Bern(·|0.4)) = H(Bern(t|0.4)||Bern(t|0.4))
= p(0) · (− log q(0)) + p(1) · (− log q(1))
= 0.4 · (− log(0.4)) + 0.6 · (− log(0.6)) = 0.67.

This is smaller than the cross entropy obtained with


q(t) = Bern(t|0.6). This is intuitively true, since this soft predictor
q(t) gives the maximum score to t = 0, and the true probability
p(t = 0) = 0.6 is smaller than p(t = 1) = 0.4.
In fact, we know that the soft predictor q(t) = p(t) minimizes the
cross entropy, and the minimum cross entropy is known as the
entropy of rv t ∼ p(t)

H(Bern(t|0.4)||Bern(t|0.4)) = H(Bern(t|0.4)) = H(t) = 0.67.

Osvaldo Simeone ML4Engineers 40 / 75


Problem 3.13

Consider distributions p(t) =Bern(t|0.4), and soft predictor


q(t) =Bern(t|0.6).
Compute the KL divergence KL(Bern(0.4)||Bern(0.6)) in two
different ways:
▶ through direct calculation by using the definition of KL divergence;
▶ using the relationship of the KL divergence with cross entropy and
entropy.
Recall that the latter relationship views the KL divergence as a
measure of regret for not having used the correct predictor.

Osvaldo Simeone ML4Engineers 41 / 75


Problem 3.13: Solution

Using the definition of KL divergence:


   
0.4 0.6
KL(p||q) = 0.4 log + 0.6 log = 0.08.
0.6 0.4

Using the relationship with cross entropy H(p||q) and entropy H(p):

KL(p||q) =H(p||q) − H(p)


= − 0.4 log(0.6) − 0.6 log(0.4)
− (−0.4 log(0.4) − 0.6 log(0.6))
=0.75 − 0.67 = 0.08.

Osvaldo Simeone ML4Engineers 42 / 75


Problem 3.14

Consider a soft predictor q(t) =N (t| − 1, 3).


Plot the log-loss, in nats, across a suitable range of values of t and
comment on your result.
With this soft predictor, which value of t has the smallest log-loss?
Interpret your conclusions in terms of information-theoretic suprise.

Osvaldo Simeone ML4Engineers 43 / 75


Problem 3.14: Solution

The log-loss is given as

− log q(t) = − log N (t| − 1, 3).

MATLAB code:
t=[-10:0.01:8];
plot(t,-log(normpdf(t,-1,sqrt(3))),’r’,’LineWidth’,2); %log-loss in nats
xlabel(’$t$’,’Interpreter’,’latex’,’FontSize’,12);
ylabel(’$-\log q(t)$’,’Interpreter’,’latex’,’FontSize’,12);

The log-loss is hence smaller for t = −1 since the soft predictor


assigns a larger probability, or score, to t = −1.
The soft predictor q(t) is less “suprised” when observing t = −1
since it guesses that realizations around t = −1 will be observed with
the largest probability.

Osvaldo Simeone ML4Engineers 44 / 75


Problem 3.15

Consider a rv with distribution p(t) = N (t| − 2, 1), and a soft


predictor q(t) =N (t|1, 3).
Calculate the population log-loss, also known as cross entropy
between p(t) and q(t), in nats.
Repeat with the soft predictor q(t) =N (t| − 2, 1), and compare with
the result obtained at the point above.

Osvaldo Simeone ML4Engineers 45 / 75


Problem 3.15: Solution

The population log-loss, also known as cross entropy between p(t)


and q(t), is given as

Lp (q(·)) = Et∼p(t) [− log q(t)]


= H(N (t| − 2, 1)||N (t|1, 3))
= Et∼N (t|−2,1) [− log N (t|1, 3)].

It measures the average log-loss when the true (population)


distribution of rv t is p(t).
To compute it, we use the relationship KL(p||q) = H(p||q) − H(p),
or H(p||q) = KL(p||q) + H(p).

Osvaldo Simeone ML4Engineers 46 / 75


Problem 3.15: Solution

We have the KL divergence

1 (−2 − 1)2
  
1 3
KL(N (t| − 2, 1)||N (t|1, 3)) = + − 1 + log = 1.71
2 3 3 1

and the (differential) entropy for Gaussian N (t|µ = −2, σ 2 = 1) as

1 1
H(N (t| − 2, 1)) = log(2πeσ 2 ) = log(2πe) = 1.41.
2 2
Therefore, we can compute the cross entropy
H(N (t| − 2, 1), N (t|1, 3)) = 1.71 + 1.41 = 3.12.

Osvaldo Simeone ML4Engineers 47 / 75


Problem 3.15: Solution

With the soft predictor q(t) =p(t) = N (t| − 2, 1), the cross entropy
equals the entropy
1 1
H(N (t| − 2, 1)) = log(2πeσ 2 ) = log(2πe) = 1.41.
2 2
This is the optimal soft predictor, and indeed its cross entropy is
smaller than that obtained with any other soft predictor, such as
q(t) = N (t|1, 3). The difference between the two is given by
KL(N (t| − 2, 1)||N (t|1, 3)).

Osvaldo Simeone ML4Engineers 48 / 75


Problem 3.16

Consider a soft predictor q(t) = N (t|µ, 1) for some value of µ. The


population distribution of the rv to be predicted is
p(t) = N (t| − 1, 1).
Plot the cross-entropy H(p||q), the entropy H(p), and the KL
divergence KL(p||q) in the same figure and discuss their relationship.

Osvaldo Simeone ML4Engineers 49 / 75


Problem 3.16: Solution

We need to plot
1
KL(N (t| − 1, 1)||N (t|µ, 1)) = (1 + µ)2 ,
2
1
H(N (t| − 1, 1)) = log(2πe) = 1.41
2
and
1
H(N (t| − 1, 1)||N (t|µ, 1)) = 1.41 + (1 + µ2 ).
2

Osvaldo Simeone ML4Engineers 50 / 75


Problem 3.16: Solution

MATLAB code
mu=[-3:0.01:1];
plot(mu,0.5*(1+mu).ˆ2,’r’,’LineWidth’,2); %KL(p||q) in nats
hold on; plot(mu,1.41*ones(size(mu)),’k--’,’LineWidth’,2);
plot(mu,1.41+0.5*(1+mu).ˆ2,’b’,’LineWidth’,2);
xlabel(’$\mu$’,’Interpreter’,’latex’,’FontSize’,12);

Note that the minimum of both cross entropy and KL divergence is


obtained when p(t) = q(t), i.e., when µ = −1.

Osvaldo Simeone ML4Engineers 51 / 75


Problem 3.17

For the joint distribution p(x, t) below, compute the population


log-loss obtained by the soft predictor q(t = 1|x = 0) = 0.4 and
q(t = 1|x = 1) = 0.6 in two ways:
▶ using directly the definition of population log-loss;
▶ and using the expression in terms of the cross entropy.
Derive the optimal soft predictor and the corresponding population
log-loss.
What are the conditional entropy H(t|x) and the entropy H(t)?

x\t 0 1
0 0.3 0.4
1 0.1 0.2

Osvaldo Simeone ML4Engineers 52 / 75


Problem 3.17: Solution

Using its definition, the population log-loss is obtained as

Lp (q(·|·)) = E(x,t)∼p(x,t) [− log q(t|x)]


=p(0, 0) · (− log q(0|0)) + p(0, 1) · (− log q(1|0))
+ p(1, 0) · (− log q(0|1)) + p(1, 1) · (− log q(1|1))
=0.3 · (− log(0.6)) + 0.4 · (− log(0.4))
+ 0.1 · (− log(0.4)) + 0.2 · (− log(0.6)) = 0.71.

Osvaldo Simeone ML4Engineers 53 / 75


Problem 3.17: Solution

The population log-loss can also be expressed as the average cross


entropy

Lp (q(·|·)) = Ex∼p(x) [ Et∼p(t|x) [− log q(t|x)] ].


| {z }
=H(p(t|x)||q(t|x)) cross entropy

To evaluate this expression, first we obtain the conditional


distribution p(t|x) as

x\t 0 1
0 0.42 0.58
1 0.33 0.67

Osvaldo Simeone ML4Engineers 54 / 75


Problem 3.17: Solution

Now, we can compute the cross entropies

H(p(t|x = 0)||q(t|x = 0)) = −0.42 log(0.6) − 0.58 log(0.4) = 0.74

H(p(t|x = 1)||q(t|x = 1)) = −0.33 log(0.4) + −0.67 log(0.6) = 0.64


and then the average

Lp (q(·|·)) = 0.7 · H(p(t|x = 0)||q(t|x = 0)) + 0.3 · H(p(t|x = 1)||q(t|x = 1))


= 0.7 · 0.74 + 0.3 · 0.64 = 0.71.

Osvaldo Simeone ML4Engineers 55 / 75


Problem 3.17: Solution
The optimal soft predictor is the conditional distribution p(t|x)

x\t 0 1
0 0.42 0.58
1 0.33 0.67

Its population cross-entropy loss, aka population log-loss, is

Lp (q(·|·)) = E(x,t)∼p(x,t) [− log p(t|x)]


=p(0, 0) · (− log p(0|0)) + p(0, 1) · (− log p(1|0))
+ p(1, 0) · (− log p(0|1)) + p(1, 1) · (− log p(1|1))
=0.3 · (− log(0.42)) + 0.4 · (− log(0.58))
+ 0.1 · (− log(0.33)) + 0.2 · (− log(0.67)) = 0.66.

This is also known as the conditional entropy H(t|x).


Osvaldo Simeone ML4Engineers 56 / 75
Problem 3.17: Solution

We have computed above the conditional entropy, which corresponds


to the minimum population log-loss (aka population cross-entropy
loss): H(t|x) = 0.66.
The entropy is given as

H(t) = Et∼p(t) [− log p(t)]


=0.4(− log(0.4)) + 0.6(− log(0.6)) = 0.67.

Note that the conditional entropy must always be no larger than the
entropy: the conditional entropy corresponds to the population
log-loss when x is observed, while the entropy is the population
log-loss when no correlated observation is available.

Osvaldo Simeone ML4Engineers 57 / 75


Problem 3.18

Compute the conditional entropy H(t|x) and the entropy H(t) for the
joint distribution

x\t 0 1
0 0.16 0.24
1 0.24 0.36

Osvaldo Simeone ML4Engineers 58 / 75


Problem 3.18: Solution

The rvs x and t are independent with

p(t = 1|x = 0) = p(t = 1) = 0.6.

Therefore, we have

H(t|x) = E(x,t)∼p(x,t) [− log p(t|x)]


= Et∼p(t) [− log p(t)]
= H(t)
= 0.6(− log(0.6)) + 0.4(− log(0.4)) = 0.67.

Osvaldo Simeone ML4Engineers 59 / 75


Problem 3.19
For the joint distribution p(x, t) in the table below, compute the
mutual information I(x; t).

x\t 0 1
0 0 0.5
1 0.5 0

For the joint distribution p(x, t) in the table below, compute the
mutual information I(x; t).

x\t 0 1
0 0.16 0.24
1 0.24 0.36

Osvaldo Simeone ML4Engineers 60 / 75


Problem 3.19: Solution

The mutual information measures by how much we can decrease the


population log-loss when we have access to x. It is a measure of
statistical dependence between x and t.
It is defined as

I(x; t) = H(t) − H(t|x)

or equivalently as

I(x; t) = KL(p(x, t)∥p(x)p(t)).

Osvaldo Simeone ML4Engineers 61 / 75


Problem 3.19: Solution
For this example, it is easier to compute it using the first formula. In
fact, we have

H(t|x = 0) = Et∼p(t|x=0) [− log p(t|x = 0)]


=0(− log(0)) + 1(− log(1)) = 0

and similarly H(t|x = 1) = 0. Hence,

H(t|x) = p(x = 0)H(t|x = 0) + p(x = 1)H(t|x = 1) = 0.

This is because knowing x perfectly determines t, and hence the


residual uncertainty on t given x is zero.
We then have

I(x; t) = H(t) = − log(0.5) = 0.69.

Osvaldo Simeone ML4Engineers 62 / 75


: Solution

As we have seen, with the second joint pmf, the two rvs are
independent, which implies the equality p(x, t) = p(x)p(t), and

I(x; t) = KL(p(x, t)∥p(x)p(t)) = 0.

This is because knowing x brings no information about t, and hence


we have H(t|x) = H(t) – the uncertainty about t is the same whether
or not x is known.

Osvaldo Simeone ML4Engineers 63 / 75


Problem 3.20

Consider the population distribution p(x, t) = p(t)p(x|t) with


▶ t ∼ N (0, 1), and
▶ (x|t = t) ∼ N (t, 1), i.e., x = t + z, with independent noise
z ∼ N (0, 1).
For the soft predictor q(t|x) = N (t| − x, 0.5), compute the
population log-loss.
What is the optimal soft predictor?
What is the corresponding minimum population log-loss?

Osvaldo Simeone ML4Engineers 64 / 75


Problem 3.20: Solution

First, we compute the conditional distribution


 
αν + βx 1
p(t|x) = N t ,
α+β α+β
 
x 1
=N t , .
2 2

Therefore, for every value x, we have

KL(p(t|x)||q(t|x)) = KL(N (0.5x, 0.5)||N (−x, 0.5))


(0.5x − (−x))2
 
= 0.5
0.5
9
= x 2.
4

Osvaldo Simeone ML4Engineers 65 / 75


Problem 3.20: Solution

We now need to compute Ex∼p(x) [KL(p(t|x)||q(t|x))].


Since x = t + z, with independent rvs t ∼ N (0, 1) and z ∼ N (0, 1),
the marginal p(x) is p(x) = N (x|0, 1 + 1 = 2).
We can then write
9   9
Ex∼p(x) [KL(p(t|x)||q(t|x))] = Ex∼N (0,2) x2 = .
4 2

Osvaldo Simeone ML4Engineers 66 / 75


Problem 3.20: Solution

Recall now that we have the equality

Lp (q(·|·)) = Ex∼p(x) [KL(p(t|x)||q(t|x))] + H(t|x),

where the conditional entropy is given as

H(t|x) = H(N (t|0.5x, 0.5))


1
= log(2πe0.5) = 1.07.
2
Therefore, we conclude that the population log-loss is given as

Lp (q(·|·)) = 4.5 + 1.07 = 5.57.

Osvaldo Simeone ML4Engineers 67 / 75


Problem 3.20: Solution

The optimal soft predictor is given by the conditional distribution


 
αν + βx 1
p(t|x) = N t ,
α+β α+β
 
x 1
=N t , .
2 2

The corresponding minimum population log-loss is the conditional


entropy
1
H(t|x) = H(N (t|0.5x, 0.5)) = log(2πe0.5) = 1.07.
2

Osvaldo Simeone ML4Engineers 68 / 75


Problem 3.21

We are given the population distribution p(x, t) = p(t)p(x|t) with


▶ t ∼ N (0, 1), and
▶ (x|t = t) ∼ N (t, β −1 ), i.e., x = t + z, with independent noise
z ∼ N (0, β −1 )
Plot the mutual information I(x; t) as a function of β and comment
on your results.

Osvaldo Simeone ML4Engineers 69 / 75


Problem 3.21: Solution

We have  
1 β 1
I(x; t) = log 1 + = log (1 + β) .
2 α 2
MATLAB code:
beta=[0.01:0.01:10];
plot(beta,0.5*log(1+beta),’r’,’LineWidth’,2); %mutual information in
nats
xlabel(’$\beta$’,’Interpreter’,’latex’,’FontSize’,12);
ylabel(’I$(\mathrm{x};\mathrm{t})$’,’Interpreter’,’latex’,’FontSize’,12);

As the precision β of the observation increases, the mutual


information increases accordingly.

Osvaldo Simeone ML4Engineers 70 / 75


Problem 3.22

Prove that the entropy H(x) of a random vector x = (x1 , ..., xM ) with
independent
PM entries (x1 , ..., xM ) is given by the sum
H(x) = m=1 H(xm ).
Use this result to compute the (differential) entropy of vector
x ∼ N (0M , IM ).

Osvaldo Simeone ML4Engineers 71 / 75


Problem 3.22: Solution

We have
M
" !#
Y
H(x) = Ex∼ QM p(xm ) − log p(xm )
m=1
m=1
M
" #
X
= Ex∼ QM p(xm ) − log p(xm )
m=1
m=1
M
X
= Exm ∼p(xm ) [− log p(xm )]
m=1
XM
= H(xm ).
m=1

Osvaldo Simeone ML4Engineers 72 / 75


Problem 3.22: Solution

Therefore, we have

H(N (0M , IM )) = MH(N (0, 1))


M
= log (2πe) .
2

Osvaldo Simeone ML4Engineers 73 / 75


Problem 3.23

Prove that the optimal constant predictor of a rv t ∼ N (µ, σ 2 ) under


the ℓ2 loss is the mean µ.

Osvaldo Simeone ML4Engineers 74 / 75


Problem 3.23: Solution

Compute the derivative of the population log-loss and set it equal to


zero (se Chapter 6).

Osvaldo Simeone ML4Engineers 75 / 75

You might also like