Professional Documents
Culture Documents
8 PDF
8 PDF
8 PDF
Osvaldo Simeone
x\t 0 1
0 0.4 0.3
1 0.1 0.2
How does the optimal hard predictor change when we adopt the loss
function illustrated by the table below?
t\t̂ 0 1
0 0 1
1 0.1 0
x\t 0 1
0 0.4/(0.3 + 0.4) = 0.57 0.3/(0.3 + 0.4) = 0.43
1 0.1/0.3 = 0.33 0.2/0.3 = 0.67
as
t̂ ∗ (0) = arg min 0.57 · 1(0 ̸= t̂(0)) + 0.43 · 0.1 · 1(1 ̸= t̂(0))
t̂
= 0.
For the joint distribution p(x, t) in the table below, compute the
optimal (Bayesian) soft predictor, the optimal hard predictor for the
detection-error loss, and the corresponding minimum population
detection-error loss.
x\t 0 1
0 0.16 0.24
1 0.24 0.36
It can be seen that the two variables are independent, since the joint
distribution is equal to the product of marginals defined by the
probabilities p(x = 1) = 0.6 and p(t = 1) = 0.6.
Therefore, using x to predict t is not useful since the posterior
distribution equals the marginal, that is, the optimal soft predictor is
p(x, t) p(x)p(t)
p(t|x) = = = p(t).
p(x) p(x)
Given the posterior distribution p(t|x) in the table below, obtain the
optimal point predictors under ℓ2 and detection-error losses, and
evaluate the corresponding minimum population losses when
p(x = 1) = 0.5.
x\t 0 1
0 0.9 0.1
1 0.2 0.8
Therefore, we have
and
t̂ ∗ (1) = Et∼p(t|1) [t] = 0.2 · 0 + 0.8 · 1 = 0.8.
Furthermore, the minimum population loss is the average posterior
variance
Var(t|x = 0) = p(t = 1|x = 0)(1 − p(t = 1|x = 0)) = 0.1 · 0.9 = 0.09
and
Var(t|x = 1) = p(t = 1|x = 1)(1 − p(t = 1|x = 1)) = 0.8 · 0.2 = 0.16.
We finally obtain
Therefore, we have
t̂ ∗ (0) = 0,
and
t̂ ∗ (1) = 1.
Furthermore, the minimum population loss is the minimum probability
of error
h i
Lp (t̂ ∗ (·)) = 1 − Ex∼p(x) max p(t|x)
t
= 1 − (0.5 · 0.9 + 0.5 · 0.8) = 0.15.
what is the optimal hard predictor under the ℓ2 loss? What is the
MAP predictor?
The optimal hard predictor under the ℓ2 loss is given by the mean of
the posterior distribution
MATLAB code:
taxis=[-6:0.01:6];
plot(taxis,0.5*normpdf(taxis,3,1)+0.5*normpdf(taxis,-
3,1),’LineWidth’,2);
xlabel(’$t$’,’Interpreter’,’latex’,’FontSize’,12);
ylabel(’$p(t|x)$’,’Interpreter’,’latex’,’FontSize’,12);
The optimal hard predictor under the ℓ2 loss is given by the mean of the
posterior distribution
In this case, there are two MAP predictors, namely t̂ ∗ (x) = −3 or t̂ ∗ (x) = 3.
taxis=[-1:0.01:5];
plot(taxis,normpdf(taxis,2,1),’r’,’LineWidth’,2);
hold on; plot(taxis,normpdf(taxis,12/11,sqrt(1/11)),’b’,’LineWidth’,2);
xlabel(’$t$’,’Interpreter’,’latex’,’FontSize’,12);
taxis=[-1:0.01:5];
plot(taxis,normpdf(taxis,2,0.01),’r’,’LineWidth’,2);
hold on; plot(taxis,normpdf(taxis,210/110,sqrt(1/110)),’b’,’LineWidth’,2);
xlabel(’$t$’,’Interpreter’,’latex’,’FontSize’,12);
where
T
2
Θ = αI2 + βA A = I2 + 10 2 1
1
1 0 4 2 41 20
= + 10 = .
0 1 2 1 20 11
x1=[-2*sqrt(2):0.1:2*sqrt(2)];
x2=[-2*sqrt(2):0.1:2*sqrt(2)];
[X1,X2] = meshgrid(x1,x2);
X = [X1(:) X2(:)];
mup=[0,0];
Sigmap=eye(2);
y = mvnpdf(X,mup,Sigmap);
y = reshape(y,length(x2),length(x1));
contour(x1,x2,y);
hold on;
mupost=[0.39,0.19];
Sigmapost=[0.21,-0.39;-0.39,0.8];
y = mvnpdf(X,mupost,Sigmapost);
y = reshape(y,length(x2),length(x1));
contour(x1,x2,y);
xlabel(’$x 1$’,’Interpreter’,’latex’,’FontSize’,14);
ylabel(’$x 2$’,’Interpreter’,’latex’,’FontSize’,14)
MATLAB code
figure
q=[0:0.01:1];
plot(q,q.*log(q/0.4)+(1-q).*log((1-q)/(1-0.4)),’r’,’LineWidth’,2);
%KL(q||p) in nats
hold on
plot(q,q.*log2(q/0.4)+(1-q).*log2((1-q)/(1-0.4)),’b’,’LineWidth’,2);
%KL(q||p) in bits
xlabel(’$q$’,’Interpreter’,’latex’,’FontSize’,12);
ylabel(’KL$(q||p)$’,’Interpreter’,’latex’,’FontSize’,12);
1 σ12 (µ1 − µ2 )2
2
σ2
KL(p∥q) = + − 1 + log
2 σ22 σ22 σ12
1 1 (−1 − µ)2
1
= + − 1 + log
2 1 1 1
1
= (1 + µ)2
2
MATLAB code
mu=[-3:0.01:1];
plot(mu,0.5*(1+mu).ˆ2,’r’,’LineWidth’,2); %KL(p||q) in nats
xlabel(’$\mu$’,’Interpreter’,’latex’,’FontSize’,12);
ylabel(’KL$(p||q)$’,’Interpreter’,’latex’,’FontSize’,12);
1 σ12 (µ1 − µ2 )2
2
σ2
KL(p∥q) = 2
+ 2
− 1 + log
2 σ2 σ2 σ12
1 1
− 1 + log σ 2
= 2
2 σ
MATLAB code
s2=[0.01:0.01:3];
plot(s2,0.5*(1./s2-1+log(s2)),’r’,’LineWidth’,2); %KL(p||q) in nats
ylim([0 2])
xlabel(’$\sigmaˆ2$’,’Interpreter’,’latex’,’FontSize’,12);
ylabel(’KL$(p||q)$’,’Interpreter’,’latex’,’FontSize’,12);
1 σ12 (µ1 − µ2 )2
2
σ2
KL(q∥p) = 2 + 2 − 1 + log
2 σ2 σ2 σ12
1 2
σ − 1 + log 1/σ 2
=
2
MATLAB code
s2=[0.01:0.01:3];
hold on; plot(s2,0.5*(s2-1+log(1./s2)),’b’,’LineWidth’,2); %KL(q||p) in
nats
ylim([0 2])
xlabel(’$\sigmaˆ2$’,’Interpreter’,’latex’,’FontSize’,12);
ylabel(’KL$(q||p)$ and KL$(p||q)$’,’Interpreter’,’latex’,’FontSize’,12);
The log-loss is hence smaller for t = 1, since the soft predictor assigns
a larger probability, or score, namely 0.6, to t = 1. However, the
log-loss values are not too different, since the soft predictor is quite
uncertain between the two values of t.
The soft predictor q(t) is less “suprised” when observing t = 1 since
it guesses that a realization t = 1 will be observed with a 60% chance.
In information-theoretic terms, we can say that observing t = 1 yields
− log2 (0.6) = 0.73 bits of information, while observing t = 0 yields
− log2 (0.4) = 1.3 bits of information.
Osvaldo Simeone ML4Engineers 37 / 75
Problem 3.12
Lp (Bern(·|0.6)) = H(Bern(t|0.4)||Bern(t|0.6))
= p(0) · (− log q(0)) + p(1) · (− log q(1))
= 0.6 · (− log(0.4)) + 0.4 · (− log(0.6)) = 0.75.
Lp (Bern(·|0.4)) = H(Bern(t|0.4)||Bern(t|0.4))
= p(0) · (− log q(0)) + p(1) · (− log q(1))
= 0.4 · (− log(0.4)) + 0.6 · (− log(0.6)) = 0.67.
Using the relationship with cross entropy H(p||q) and entropy H(p):
MATLAB code:
t=[-10:0.01:8];
plot(t,-log(normpdf(t,-1,sqrt(3))),’r’,’LineWidth’,2); %log-loss in nats
xlabel(’$t$’,’Interpreter’,’latex’,’FontSize’,12);
ylabel(’$-\log q(t)$’,’Interpreter’,’latex’,’FontSize’,12);
1 (−2 − 1)2
1 3
KL(N (t| − 2, 1)||N (t|1, 3)) = + − 1 + log = 1.71
2 3 3 1
1 1
H(N (t| − 2, 1)) = log(2πeσ 2 ) = log(2πe) = 1.41.
2 2
Therefore, we can compute the cross entropy
H(N (t| − 2, 1), N (t|1, 3)) = 1.71 + 1.41 = 3.12.
With the soft predictor q(t) =p(t) = N (t| − 2, 1), the cross entropy
equals the entropy
1 1
H(N (t| − 2, 1)) = log(2πeσ 2 ) = log(2πe) = 1.41.
2 2
This is the optimal soft predictor, and indeed its cross entropy is
smaller than that obtained with any other soft predictor, such as
q(t) = N (t|1, 3). The difference between the two is given by
KL(N (t| − 2, 1)||N (t|1, 3)).
We need to plot
1
KL(N (t| − 1, 1)||N (t|µ, 1)) = (1 + µ)2 ,
2
1
H(N (t| − 1, 1)) = log(2πe) = 1.41
2
and
1
H(N (t| − 1, 1)||N (t|µ, 1)) = 1.41 + (1 + µ2 ).
2
MATLAB code
mu=[-3:0.01:1];
plot(mu,0.5*(1+mu).ˆ2,’r’,’LineWidth’,2); %KL(p||q) in nats
hold on; plot(mu,1.41*ones(size(mu)),’k--’,’LineWidth’,2);
plot(mu,1.41+0.5*(1+mu).ˆ2,’b’,’LineWidth’,2);
xlabel(’$\mu$’,’Interpreter’,’latex’,’FontSize’,12);
x\t 0 1
0 0.3 0.4
1 0.1 0.2
x\t 0 1
0 0.42 0.58
1 0.33 0.67
x\t 0 1
0 0.42 0.58
1 0.33 0.67
Note that the conditional entropy must always be no larger than the
entropy: the conditional entropy corresponds to the population
log-loss when x is observed, while the entropy is the population
log-loss when no correlated observation is available.
Compute the conditional entropy H(t|x) and the entropy H(t) for the
joint distribution
x\t 0 1
0 0.16 0.24
1 0.24 0.36
Therefore, we have
x\t 0 1
0 0 0.5
1 0.5 0
For the joint distribution p(x, t) in the table below, compute the
mutual information I(x; t).
x\t 0 1
0 0.16 0.24
1 0.24 0.36
or equivalently as
As we have seen, with the second joint pmf, the two rvs are
independent, which implies the equality p(x, t) = p(x)p(t), and
We have
1 β 1
I(x; t) = log 1 + = log (1 + β) .
2 α 2
MATLAB code:
beta=[0.01:0.01:10];
plot(beta,0.5*log(1+beta),’r’,’LineWidth’,2); %mutual information in
nats
xlabel(’$\beta$’,’Interpreter’,’latex’,’FontSize’,12);
ylabel(’I$(\mathrm{x};\mathrm{t})$’,’Interpreter’,’latex’,’FontSize’,12);
Prove that the entropy H(x) of a random vector x = (x1 , ..., xM ) with
independent
PM entries (x1 , ..., xM ) is given by the sum
H(x) = m=1 H(xm ).
Use this result to compute the (differential) entropy of vector
x ∼ N (0M , IM ).
We have
M
" !#
Y
H(x) = Ex∼ QM p(xm ) − log p(xm )
m=1
m=1
M
" #
X
= Ex∼ QM p(xm ) − log p(xm )
m=1
m=1
M
X
= Exm ∼p(xm ) [− log p(xm )]
m=1
XM
= H(xm ).
m=1
Therefore, we have