D. Assume Model A Is Trained For Multi-Label Classification and Binary Cross-Entropy Is The Loss, The Loss Value For Label Sunny Is - Log0.2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Name: ___________ Student ID: A ___________ NUS ID: E____________

1. Which one of the following statements is incorrect?


A. Deep learning models are trained over data
B. Deep learning models can be customized to obtain higher performance over larger dataset
C. Deep learning is a rule-based AI solution
D. Deep learning is also known as feature learning

2. Which one is incorrect?


A. When the model capacity increases, the chance of overfitting increases, and the model
variance increases
B. If the training error of Model A is smaller than that of Model B, we should prefer model A
than model B
C. Training data is for training/tuning model parameters
D. Validation data is for tuning model hyper-parameters

3. If model A predicts the probability of an instance as Sunny = 0.2, Raining = 0.5, and Cloudy = 0.3,
and the target value is Cloudy, which statement is incorrect?
A. This problem could be a multi-class single-label classification problem
B. This problem could be a multi-class multi-label classification problem
C. Assume Model A is trained for single-label classification and cross-entropy is the loss, the
loss value is -log0.3
D. Assume Model A is trained for multi-label classification and binary cross-entropy is the loss,
the loss value for label Sunny is -log0.2

4. Which one is incorrect?


A. If an operation has multiple inputs, then the gradient of the loss with respect to each input x
𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕
is computed respectively as 𝜕𝜕𝜕𝜕 = 𝑚𝑚𝑚𝑚𝑚𝑚(𝜕𝜕𝜕𝜕 , 𝜕𝜕𝜕𝜕 ), where y is the output of the operation
B. If a variable x is used in multiple operations, then the gradient of the loss with respect x is
computed by averaging the gradients derived through all the operations.
C. Back-propagation algorithm calls the forward function of each operation of the computation
graph in topological order, and then calls the backward function of each operation in the
reverse order.
D. Back-propagation algorithm is for computing the gradients of model parameters, and the
SGD algorithm is for updating the model parameters based on the gradients.

5. Which one is incorrect?

A. Model parameters should be randomly initialized to break symmetry (i.e., the neurons are
always having the same values)
B. Data normalization rescales the parameters into the similar scale
C. Early stopping to one approach for reducing overfitting
D. Mini-batch SGD can jump out of local optimum and saddle points, and is more stable than
SGD.

6. If 𝑦𝑦 = (𝑨𝑨𝑨𝑨)𝑇𝑇 (2𝒙𝒙 + 𝒛𝒛), where A is a square matrix, x and z are vectors, y is a scalar. Which one is
𝜕𝜕𝜕𝜕
𝜕𝜕𝒙𝒙

A. 2�𝑨𝑨 + 𝑨𝑨𝑻𝑻 �𝒙𝒙 + 𝑨𝑨𝑻𝑻 𝒛𝒛


B. 2𝑨𝑨𝑨𝑨
C. 𝑨𝑨𝑻𝑻 (2𝒙𝒙 + 𝒛𝒛)
D. 2�𝑨𝑨 + 𝑨𝑨𝑻𝑻 �𝒙𝒙

1 2 𝜕𝜕𝜕𝜕
7. 𝐿𝐿 = �𝒘𝒘𝑻𝑻 𝒙𝒙 − 𝑦𝑦� if 𝒙𝒙 = (1, 2), 𝒘𝒘 = (2, 1), 𝑦𝑦 = 0 . Compute the gradient of = ( 4, 8)𝑇𝑇
2 𝜕𝜕𝒘𝒘

𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕
8. If the input to the ReLU function (𝒚𝒚 = 𝑚𝑚𝑚𝑚𝑚𝑚(0, 𝒙𝒙)) is 𝒙𝒙 = (−1, 1)𝑇𝑇 , = (1, 2)𝑇𝑇 , = ( 0 , 2 )𝑇𝑇
𝜕𝜕𝒚𝒚 𝜕𝜕𝒙𝒙

9. For a parameter vector 𝒘𝒘 = (0, 1)𝑇𝑇 , if the gradient for the first two iterations are 𝑑𝑑𝑑𝑑 =
(1, −1)𝑇𝑇 and 𝑑𝑑𝑑𝑑 = (1, 1)𝑇𝑇 respectively, what is the value of w after the two iterations using
SGD with momentum as the optimization algorithm (learning rate is 0.1, beta=0.9) 𝑤𝑤 =
(−𝟎𝟎. 𝟐𝟐𝟐𝟐 , 1.09 )𝑇𝑇

V1=0+(1, -1) = (1, -1)


W1=w – 0.1*(1, -1)=(0, 1) – (0.1, -0.1)=(-0.1, 1.1)
V2=0.9*(1, -1) + (1, 1) = (0.9, -0.9) + (1, 1)=(1.9, 0.1)
W2 = (-0.1, 1.1) – 0.1*(1.9, 0.1) = (-0.1, 1.1) – (0.19, 0.01) = (-0.29, 1.09)

You might also like