Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

MATH4267 Solutions JL

1. (a) The conditional PDFs of (X1 , X2 |Y = 1) and (X1 , X2 |Y = 0) are:

1 1 2 2

fX1 ,X2 |Y =1 (x1 , x2 ) = exp − (x − 1) + y
2π 2
1 1 2 2

fX1 ,X2 |Y =0 (x1 , x2 ) = exp − (x + 1) + y
2π 2

and we have P (Y = 1) = α. So the unconditional PDF of (X1 , X2 ) is given by:

f(X1 ,X2 ) (x1 , x2 ) = P (Y = 1)f(X1 ,X2 |Y =1) + P (Y = 0)f(X1 ,X2 |Y =0)

1 1 2 2

= α exp − (x − 1) + y +
2π 2
1 1 2 2

(1 − α) exp − (x + 1) + y (1)
2π 2

[3 marks]
(b) The generalisation error of d is:
E(X1 ,X2 ),Y Loss Y, Ŷ ) = E(X1 ,X2 ),Y (c (Y, d(X1 , X2 ))) (2)

NB: it is fine not to write this out as an integral, but an answer written as an
integral is also acceptable. [SEEN]
[3 marks]

1 of 7 2024
MATH4267 Solutions JL

(c) The generalisation error for a zero-one loss is minimised for the function
1 if P (Y = 1|X1 , X2 = (x1 , x2 )) > P (Y = 0|X1 , X2 = (x1 , x2 ))
dmin (x1 , x2 ) =
0 otherwise

Since we have
fX1 ,X2 |Y =1 (x1 , x2 )P (Y = 1)
P (Y = 1|X1 , X2 = (x1 , x2 )) =
fX1 ,X2 (x1 , x2 )
αfX1 ,X2 |Y =1 (x1 , x2 )
fX1 ,X2 (x1 , x2 )
fX ,X |Y =0 (x1 , x2 )P (Y = 0)
P (Y = 0|X1 , X2 = (x1 , x2 )) = 1 2
fX1 ,X2 (x1 , x2 )
(1 − α)fX1 ,X2 |Y =0 (x1 , x2 )
fX1 ,X2 (x1 , x2 )

we have dmin (x1 , x2 ) in our case is:

1 if αfX1 ,X2 |Y =1 (x1 , x2 ) > (1 − α)fX1 ,X2 |Y =0 (x1 , x2 )
dmin (x1 , x2 ) = (3)
0 otherwise

[4 marks]
NB: it is fine to use abbreviations of previously defined functions, rather than
writing them out in full each time.

2 of 7 2024
MATH4267 Solutions JL

2. (a) Given a finite set S, a σ-algebra Σ on S is a collection of subsets of S (that is,

Σ ⊆ 2S ) such that:
• S ∈ Σ;
• Σ is closed under complementarity, so if A ∈ Σ then (S \ A) ∈ Σ;
• Σ is closed under union/intersection, so if A, B ∈ Σ then (A ∪ B) ∈ Σ (and
(A ∩ B) ∈ Σ).
NB: the concept of countable union or intersection is not relevant here, since S is
finite, but there is no penalty for mentioning it. Only one of union/intersection
closure need be mentioned, but there is no penalty for mentioning both. [SEEN]
[3 marks]
(b) Since the map from X (ℓ) to X (ℓ+1) is a deterministic function, we have S(X (ℓ) ) ⊇
S(X (ℓ+1) ), from which the statement follows. Inequality will hold in each case
if the function from X (ℓ) to X (ℓ+1) is one-to-one. (it is fine to say bijective,
invertible, or injective). [SEEN]
[3 marks]
(c) Suppose that the outcome of the original network is Y = (Y1 , Y2 , . . . Ym ) where

there are M neurons, and the output of the new network is Y ′ = (Y1′ , Y2′ , . . . Ym−1 ).
Then Y1′ = Y1 , Y2′ = Y2 , . . . , and we can define a deterministic function from Y
to Y ′ . Hence S(Y ′ ) ⊆ S(Y ). [SEEN SIMILAR]
[4 marks]
(d) Suppose A, B ∈ Σ and we have closure under union. Then by closure under
complement, (S \ A) ∈ Σ and (S \ B) ∈ Σ. By the assumed closure under union,
we have (S \ A) ∪ (S \ B) ∈ Σ. But the set T = (S \ A) ∪ (S \ B) will contain
everything in S not in both A and B, so (A ∩ B) = (S \ T ). Since T ∈ Σ, we
have S \ T ∈ Σ by closure under complement, so (A ∩ B) ∈ Σ.
Suppose A, B ∈ Σ and we have closure under intersection. Then by closure
under complement, (S \ A) ∈ Σ and (S \ B) ∈ Σ. By the assumed closure under
intersection, we have (S \ A) ∩ (S \ B) ∈ Σ. But the set T ′ = (S \ A) ∪ (S \ B)
will contain everything in S not in either A and B, so (A ∪ B) = (S \ T ′ ). Since
T ′ ∈ Σ, we have S \ T ′ ∈ Σ by closure under complement, so (A ∪ B) ∈ Σ.
NB: it is fine to show just one way and say ‘The other direction is similar’, as
long as there is acknowledgement of proof the other way too.
[4 marks]

3 of 7 2024
MATH4267 Solutions JL

3. (a) We have

H(X|Y ) − H(Y |X) = (H(X, Y ) − H(Y )) − (H(X, Y ) − H(X)) = H(X) − H(Y )

We can derive this (this is not necessary for a correct answer) as follows:
fX,Y (x, y)
H(X|Y ) − H(Y |X) = − fX,Y (x, y) ln dxdy
fY (y)
fX,Y (x, y)
+ fX,Y (x, y) ln dxdy
fX (x)
fX,Y (x, y)fY (y)
= fX,Y (x, y) ln dxdy
fX,Y (x, y)fX (x)
fY (y)
= fX,Y (x, y) ln dxdy
fX (x)
= − fX,Y (x, y) ln (fX (x)) dxdy + fX,Y (x, y) ln (fY (y)) dxdy
Z Z 
=− fX,Y (x, y)dy ln (fX (x)) dx
Z Z 
+ fX,Y (x, y)dx ln (fY (y)) dy
= − fX (x) ln (fX (x)) dx + fY (y) ln (fY (y)) dy

= H(X) − H(Y )

Since H(X|Y ) − H(Y |X) = H(X) − H(Y ), we have H(X|Y ) = H(Y |X) if and
only if H(X) = H(Y ). [SEEN]
[3 marks]
(b) The K-L divergence of random variables X and Y (where defined) is defined as
fX (x)
DKL (fX , fY ) = fX (x) ln dx
fY (x)

where fX (x) and fY (x) are the PDFs of X and Y , and the integral is over the
domain of fX and fY .
To show that the K-L divergence is non-negative, we use Jensen’s inequality:
since ln(x) is concave, we have, for positive-valued functions p(x), q(x):
Z Z 
p(x) ln(q(x))dx ≤ ln p(x)q(x)dx

4 of 7 2024
MATH4267 Solutions JL

fY (x)
DKL (fX ||fY ) = − fX (x) ln dx
fX (x)
Z  Z 
fY (x)
≥ ln fX (x) dx ln fY (x)dx
fX (x)
= ln(1) = 0

Alternatively, using the more specific inequality ln(x) ≤ x − 1, we have

fY (x)
DKL (fX ||fY ) = − fX (x) ln dx
fX (x)
fY (x)
≥ − fX (x) − 1 dx
fX (x)

=− fY (x)dx + fX (x)dx

= −1 + 1 = 0

[5 marks]

5 of 7 2024
MATH4267 Solutions JL

(c) We have

I(X, Y ) = H(X) + H(Y ) − H(X, Y )

= − fX (x) ln(fX (x))dx − fY (y) ln(fY (x))dy + fX,Y (x, y) ln (fX,Y ) dxdy
Z Z  Z Z 
=− fX,Y (x, y)dy ln(fX (x))dx − fX,Y (x, y)dx ln(fY (x))dy
+ fX,Y (x, y) ln (fX,Y ) dxdy
= − fX,Y (x, y) ln(fX (x))dx − fX,Y (x, y) ln(fY (x))dy
+ fX,Y (x, y) ln (fX,Y ) dxdy
fX,Y (x, y)
= fX,Y (x, y) ln dxdy (= DKL (fX,Y (x, y)||fX (x)fY (y)))
fX (x)fY (y)
  f (x,y)  
fX,Y (x, y)  fY (y) 
= fY (y) ln dxdy
fY (y) fX (x)
fX|Y =y (x)
= fY (y)fX|Y =y (x) ln dxdy
fX (x)
= fY (y)DKL (fX|Y =y ||fX (x))dxdy

= Ey∼Y DKL (fX|Y =y ||fX (x))

as required. [UNSEEN]
[7 marks]

6 of 7 2024
MATH4267 Solutions JL

4. (a) We have hi = ϕ(W hi−1 + U Xi + b). The value of W affects hi both directly
through this equation, and through hi−1 . Thus we have for i > 1:
dhi ∂hi ∂hi dhi−1
αi = = +
dW ∂W ∂hi−1 dW
= βi + γi αi−1

[4 marks]
(b) If n = 2 then we have
2−1 2
α2 = β2 + γ2 α1 = β2 + γj βi
i=1 j=1+1

Suppose by induction that the formula in the question holds for n = 1, 2 . . . n−1.
Then we have

αn = βn + γn αn−1
n−2 n−1
! !
= βn + γn βn−1 + γj βi
i=1 j=i+1
n−2 n−1
= βn + γn βn−1 + γn γj βi
i=1 j=i+1
n−1 n
! n−2 n
= βn + γj βi + γj βi
i=n−1 j=i+1 i=1 j=1+1
n−1 n
= βn + γj βi
i=1 j=1+1

which gives the requisite result by induction. [UNSEEN]

[7 marks]
(c) The exploding gradient problem can arise when activation functions have gra-
dients which exceed 1. An example could be the scaled sigmoid function:

ϕ(x) = (1 + exp(−5x))−1 (6)

[4 marks]

7 of 7 2024

You might also like