Mock Endterm ADL 2021

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

E9 309 – Advance Deep Learning

Mock Final Exam


December 2, 2021

Instructions

1. This exam is open book. However, computers, mobile phones and other handheld devices
are not allowed.

2. Any reference materials that are used in the exam (other than materials distributed in
the course webpage) should be pre-approved with the instructor before the exam.

3. No additional resources (other than those pre-approved) are allowed for use in the exam.

4. Academic integrity and ethics of highest order are expected.

5. Notation - bold symbols are vectors, capital bold symbols are matrices and regular symbols
are scalars.

6. Answer all questions.

7. Total Duration - 240 minutes including answer upload

8. Total Marks - 100 points

Name - ................................

Dept. - ....................

SR Number - ....................
1. Should all samples contribute the same
Rohan is exploring a new deep autoencoding model where all the training samples are
weighted differently in the model training. He has a set of training samples {xi }N
i=1 and
v M L
set of validation samples {xj }j=1 . Let θ = {θ}l=1 denote the parameters of the deep
learning model with L layers with input and output hidden activations denoted as al and
zl , i.e.,

zl = θ Tl al−1 , al = σ(zl ) for {l = 1...L} ; a0 = x

where σ is some element wise non-linearity.


Let fi (θ) denote the loss function computed for the i th training sample.
Further, consider a mini-batch consisting of i = 1..n training samples. Rohan devises a
new approach where every training sample is weighted differently. Let fi,ϵ (θ) = ϵi fi (θ)
denote the weighted loss function for sample i. The model parameters θ are learned using,
n
X ∂fi,ϵ (θ)
θ t+1 (ϵ) = θ t − η
∂θ θ=θ t
i=1

The ϵt = {ϵti }ni=1 parameters are defined only on the training samples and they are treated
are hyper-parameters which are learned on the validation data after each mini-batch using,
M
1 X v t+1
ϵ∗t+1 = argminϵ E[f v (θ t+1 (ϵ)] = argminϵ fj (θ (ϵ))
M
j=1

In this new model training, can you help Rohan by showing the following update equation,
M L
∂ v t+1 1 XX v
E[f (θ (ϵ)] ∝ − v T
((aj,l−1 )T ai,l−1 )((gj,l ) gi,l )
∂ϵi M
j=1 l=1

∂fi
where g is gradient gi,l = ∂zl .
(Points 20)
2. Mayur and Shalini are exploring two methods for explainable deep convolutional model
without any pooling and incorporating zero-padding (the size of feature maps at all layers
is the same as the input image size). Let fkL denote the 2-D feature map of the last
convolutional layer and let hL L
k denote the average of all entries of the feature map fk .
Their model has a final layer which is feed-forward connection of K × C that connect
hL ∈ RK with the pre-softmax output a ∈ RC . Here, K is the number of feature maps in
the last convolutional layer while C is the number of classes. Let last layer feed forward
weights be denoted as wk,c .

• Mayur’s model of explainability is to use image m(c) = k wk,c fkL for highlighting
P
the input regions that are used in predicting the class c.
• Shalini’s model of explainability is to use image m(c) = k βk,c fkL for highlighting
P
the input regions that are used in predicting the class c, where
XX ∂ac
βk,c = L
i j
∂fk (i, j)

Is there a connection between their models of explainability.


(Points 10)
3. A deep neural network with training data D = {(x1 , y1 ), (x2 , y2 ), ..., (xN , yN )} consist
of weight parameters W. For regression/classification, mathematically illustrate how
variational learning can be used for Bayesian learning of model parameters. (No need of
any assumptions about the prior).
(Points 10)
4. Hierarchical Modeling of Vector series
Meha is working on her project on time series data, xt , t = 1...T ∈ RD , modeled with
neural networks where T = 2L for all sequences in training. She explores a new model
defined as

layer 0 : xt0 ; t0 ∈ [1..T ]


T
layer 1 : ht11 = σ(W1 (U1 x2t1 −1 ⊙ V1 x2t1 )) ; t1 ∈ [1.. ]
2
tl 2tl −1 2tl T
layer l : hl = σ(Wl (Ul hl−1 ⊙ Vl hl−1 )) ; tl ∈ [1.. l ]
2
where l = 1..L denote the layer index, ⊙ denotes element wise product, and σ denotes
element wise ReLU non-linearity. The parameters of the model are θ = {Wl , Ul , Vl ∈
t1 2tl −1
RD×D }L l=1 . Let zl denote the hidden activation of the l-th layer zl = (Ul hl−1 ⊙
2tl
Vl hl−1 ). The final hidden layer output hL is mapped to the output ŷ through feed-
forward connections and softmax non-linearity.

(a) Meha claims that her model definition preserves variance at initialization. Specifi-
cally, she assumes that parameters θ are all initialized with independent, identically
distributed random variables with zero mean. Further, the data dimensions are also
assumed to be i.i.d. and independent of the model parameters and with all the non-
linearities operating in the linear region. If the variance of the entries of Ul , Vl , Wl
1
are chosen as D , then she claims that the variance of the last hidden layer is the
same as the variance of the input,

Var(hL ) = Var(xt )

Is she correct in her claim.


(b) Derive the update equations for the model parameters that will help Meha learn the
model.

(Points 20)
5. GCN: Let X = {x1 , ..., xn } ∈ RD are the input features which are represented as vertices
of a Graph G = (V, E). A is the adjacency matrix of the graph which stores edge weights
between vertices. We build a graph convolution neural network with two layers of GCN
and one fully connected layer with softmax activations for K-class classification. Let
z = {z1 , .., zn } are the true class labels for each input and x0i = xi ∀i = {1, .., n}. The
updating scheme between two layers is:
 
X
xli = σ W l Lij xl−1
j

j

where L = D̂−1/2 ÂD̂−1/2 is the normalized affinity matrix added by self connection
 = A + IN , D̂ denotes the diagonal degree matrix of  ( diagonal elements are defined
l l−1
as dii = j âij ), W ∈ RD XD
P
is a layer-specific trainable weight matrix.
σ(·) is a RELU activation. Final output is obtained as y = softmax(W L xL−1 )
Find the update rule of W for all layers if we use cross entropy loss for training.

(Points 15)
6. Robot navigation with GANs
In robot navigation, the key question is: can the robot navigate through a situation?

Figure 1: Example of different situations. Left: traversable; Right: non-traversable

You want to estimate the traversability of a situation for a robot. Traversable data is
easy to collect (e.g. going through a corridor) while non-traversable data is very costly
(e.g. going down the stairs). You have a large and rich dataset X of traversable images,
but no non-traversable images.
The question you are trying to answer is: “Is it possible to train a neural network that
classifies whether or not a situation is traversable using only dataset X ?” More precisely,
if a non-traversable image was fed into the network, you want the network to predict that
it is non-traversable. In this part, you will use a Generative Adversarial Network (GAN)
to solve this problem.

(a) Consider that you have trained a network fw : Rnx ×1 → Rny ×1 . The parameters
of the network are denoted w. Given an input x ∈ Rnx ×1 , the network outputs
ŷ = fw (x) ∈ Rny ×1 .
Given an arbitrary output ŷ∗ , you would like to use gradient descent optimization
to generate an input x∗ such that fw (x∗ ) = ŷ∗ .
(i) Write down the formula of the l2 loss function you would use.
(ii) Write down the update rule of the gradient descent optimizer in terms of l2
norm.
(iii) Calculate the gradient of the loss in your update rule with respect to the input.
(b) Now, let us go back to the traversability problem. In particular, imagine that you
have successfully trained a perfect GAN with generator G and discriminator D on
X . As a consequence, given a code z ∈ RC , G(z) will look like a traversable image.
(i) Consider a new image x. How can you find a code z such that the output of the
generator G(z) would as close as possible to x?
(ii) Suppose you’ve found z such that G(z) is the closest possible value to x out of
all possible z. How can you decide if x represents a traversable situation or not?
Give a qualitative explanation.
(iii) Instead of using the method above, Amelia suggests directly running x through
the discriminator D. Amelia believes that if D(x) predicts that it is a real image,
then x is likely a traversable situation. Else, it is likely to be a non-traversable
situation. Do you think that Amelia’s method would work and why?

(Points 15)
7. A capsule network is defined on the MNIST images with input size of 28 × 28 with one
convolutional layer (256 kernels of size 9 × 9 with no pooling) and one capsule layer
(9 × 9 convolutions with stride of 2, 8D capsules and 32 kernels) followed by digit-caps of
16D. Define the forward equations with a reconstruction loss and margin loss. Show the
backward propagation for updating the model parameters. Simplify as much as possible.
Explain how the routing algorithm works in this case.
(Points 10)

You might also like