Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

CSC413 Assignment 2

Deadline: Nov 7, 2023 by 6pm EST


Submission: Compile and submit a PDF report containing your written solutions. You may also submit an
image of your legible hand-written solutions. Submissions will be done on Markus.
Late Submission: Please see the syllabus for the late submission criteria. You must work individually on this
assignment.

Question 1. Dead Units (3 pts)


Consider the following neural network, where x ∈ R2 and h ∈ R2 , and y ∈ R2 .

h = ReLU (W(1) x + b(1) )


y = w(2) h + b(2)

Suppose also that each element of x is between -1 and 1.

Part (a)
Come up with an example values of the parameters W(1) , and b(1) such that both hidden units h1 and h2 are
dead.
Answer:
   
0 0 −1
W(1) = , b(1) =
0 0 −1

Regardless of the input data, both hidden units h1 and h2 will always be ReLU ([−1, −1]T ) = max(−1, 0) = 0.

Part (b)
Show that the gradients of y with respect to W(1) , and b(1) are zero.
Answer:
Let z = W(1) x + b(1)

∂y ∂y ∂h ∂z
= · · = w(2) · 0 · x = 0
∂W(1) ∂h ∂z z1 =z2 =−1 ∂W(1)
∂y ∂y ∂h ∂z
(1)
= · · = w(2) · 0 · 0 = 0
∂b ∂h ∂z z1 =z2 =−1 ∂b(1)

Question 2. Dropout (3pt)


Part (a)
In a dropout layer, instead of “zeroing out” activations at test time, we multiply the weights by 1 − p, where p
is the probability that an activation is set to zero during training. Explain why the multiplication by 1 − p is
necessary for the neural network to make meaningful predictions.
Answer:

When dropout is implemented, during training, dropout is applied with probability 1 − p to each neuron,
effectively setting a fraction 1 − p of activations to zero. This creates a thinner architecture in the given training
batch. However, when we make prediction, we do not use a dropout layer. This means that all the neurons are
activated during the prediction step. But, because of taking all the neurons from a layer, the final weights will
be larger than expected. Scaling the weights by 1 − p during testing helps maintain the expected value of the
activations, so the network doesn’t need to adapt to the sudden drop in activations.

1
Part (b)
Explain the difference between model.train() and model.eval() modes of evaluating a network in Pytorch. Does
the Dropout layer in Pytorch behave differently in these two modes? Feel free to look at the online documen-
tation for Pytorch.
Answer:

model.train() activates dropout layers, which randomly deactivate a fraction of neurons during forward
passes. This mode is used during training to prevent overfitting. While model.eval() deactivates dropout layers,
causing them to pass all activations through without any dropout. This mode is used during testing, validation,
and inference to ensure consistent and deterministic results.

Question 3. Bias-variance decomposition (4 pts)


Let D = (xi , yi )|i = 1...n be a dataset obtained from the true underlying data distribution P , i.e. D ∼ P n .
And let hD (·) be a classifier trained on D. Show the variance bias decomposition
       
ED,x,y (hD (x) − y)2 = ED,x (hD (x) − ĥ(x))2 + Ex,y (ŷ(x) − y)2 + Ex (ĥ(x) − ŷ(x))2
| {z } | {z } | {z } | {z }
Expected test error Variance Noise Bias2

where ĥ(x) = ED∼P n [hD (x)] is the expected regressor over possible training sets, given the learning algorithm A
and ŷ(x) = Ey|x [y] is the expected label given x. As mentioned in the lecture, labels might not be deterministic
given x. To carry out the proof, proceed in the following steps:

Part (a)
Show that the following identity holds
   2   2 
2
ED,x,y (hD (x) − y) = ED,x ĥD (x) − ĥ(x) + Ex,y ĥ(x) − y (1)

Answer:
Reformulate (1) as
   h   i2 
2
Ex,y (hD (x) − y) = Ex,y hD (x) − ĥ(x) + ĥ(x) − y
 2      2 
= Ex,y ĥD (x) − ĥ(x) + 2Ex,y,D hD (x) − ĥ(x) ĥ(x) − y + Ex,y,D ĥ(x) − y
 2    2 
= Ex,y ĥD (x) − ĥ(x) + Ex,y,D ĥ(x) − y

Note that the second term in the above equation is zero because
      
Ex,y,D hD (x) − ĥ(x) ĥ(x) − y = Ex,y ED hD (x) − ĥ(x) ĥ(x) − y
  
= Ex,y ED [hD (x)] − ĥ(x) ĥ(x) − y
  
= Ex,y ĥ(x) − ĥ(x) ĥ(x) − y

= Ex,y [0]
=0

2
Part (b)
Next, show  2     2 
2
Ex,y ĥ(x) − y = Ex,y (ŷ(x) − y) + Ex ĥ(x) − ŷ(x) (2)

which completes the proof by substituting (2) into (1).

Answer:
Reformulate (2) as
 2   h  i2 
Ex,y ĥ(x) − y = Ex,y ĥ(x) − ŷ(x) + (ŷ(x) − y)
 2      
2
= Ex ĥ(x) − ŷ(x) + 2Ex,y ĥ(x) − ŷ(x) (ŷ(x) − y) + Ex,y (ŷ(x) − y)
 2   
2
= Ex ĥ(x) − ŷ(x) + Ex,y (ŷ(x) − y)

Note that the second term in the above equation is also zero because
      
Ex,y ĥ(x) − ŷ(x) (ŷ(x) − y) = Ex Ey|x ĥ(x) − ŷ(x) (ŷ(x) − y)
  
= Ex Ey|x [ŷ(x) − y] ĥ(x) − ŷ(x)
 

= Ex ŷ(x) − Ey|x [y] ĥ(x) − ŷ(x)
  
= Ex (ŷ(x) − ŷ(x)) ĥ(x) − ŷ(x)

= Ex [0]
=0

Part (c)
Explain in a sentence or two what overfitting means and which term in this formula represents it.
Answer:

Overfitting means the model fits too close to the training data, thus gives accurate predictions for training
data but not for new data. When the model overfits, the variance term of this formula will be very high and
the bias term will be very low.

You might also like