Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

BITS F464

Machine Learning 2022-23


Aditya Challa End-Sem Exam

BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE, PILANI,


K K BIRLA GOA CAMPUS
No Digital Devices Allowed, Open Book Exam.

Subject Name: BITS F464 - Machine Learning, Date: 12 May 2023


Examiner Name: Aditya Challa Marks: 40
Duration: 2.5 hours (2:00 PM - 4:30 PM)

Instructions
• Attempt all questions.
• Marks corresponding to each question is highlighted in bold within square braces at the end of
the question.
• You are allowed to carry any handwritten notes or printed material. Laptops are not allowed.
• You should show the reasoning behind each answer in the answer sheet clearly indicating the
problem number. In case of ambiguities in any of the questions, clearly state your assumptions
and attempt the question(s).

Using LOOCV for learning, Problem 1


1. In this problem we shall try to use the LOOCV to learn the K-nearest neighbors classifier. Let D = {(~ xi , yi )}n
i=1 be
d
xi ∈ R and yi ∈ {0, 1}. Let k be a positive integer. Let D−i be the training data set
the training data set, where ~
with the ith instance removed. Let yik be the predicted label for ~
xi using the k-nearest neighbors classifier trained on
D−i . Let ELOOCV be the LOOCV error for the K-nearest neighbors classifier. Then, ELOOCV is given by:
(a) Write the expression for ELOOCV .
0
Observe that the above procedure has no parameters to learn. So, we try to identify an embedding A : Rd → Rd ,
such that the LOOCV error is minimized. Let D0 = {(A(~ x1 ), y1 ), (A(~
x2 ), y2 ), . . . , (A(~
xn ), yn )} be the transformed
training data set. Further fix k (parameter in KNN) to be n − 1. Define the probability that two points i, j have the
same label as
pij ∝ exp (−kA~
xi − A~
xj k) i 6= j, pii = 0 (1)
(l)
(b) Write the expression for computing the probability that a point ~xi has label l from above, - pi , in terms
of pij .
(l) (l)
(c) Recall that cross-entropy loss for a single sample i, s is given by L
P
l=1 yi log pi , where l denotes the
labels. Write the expression for the cross-entropy loss for the transformed data set D0 , in terms of pij .
Remark: Your expressions for (b) and (c) should only involve pij , ~xi , yi .
Assume that there is a procedure to optimize the above loss. Let A∗ be an optimal embedding.
0
x) = A∗ ~
(d) State true or false with justification - f (~ x + b is also an optimal embedding, where b ∈ Rd .
0
×d0
x) = U A∗ ~
(e) State true or false with justification - f (~ x is also an optimal embedding, where U ∈ Rd is
an orthogonal matrix.
x) = A∗ V ~x is also an optimal embedding, where V ∈ Rd×d is an
(f) State true or false with justification - f (~
orthogonal matrix.
(g) Consider the vectors νij = ~ xi − ~
xj . Let à be a mapping such that kÃνij k = 0 whenever yi = yj and
kÃνij k = 100 whenever yi 6= yj . Then what is the cross entropy loss for the transformed data set D0
using Ã? (assume that exp(−100) ≈ 0 for your computations.)
(h) State true or false with justification - If d0 = n2 , then there exists a mapping A such that the cross

0
entropy loss for the transformed data set D is zero.
[1 + 1 + 1 + 1 + 1.5 + 1.5 + 2 + 2 = 11]

Answer of exercise 1

(a) The expression for ELOOCV is given by,


n
1X
ELOOCV = I(yi 6= yik ) (2)
n i=1

1 of 4
BITS F464
Machine Learning 2022-23
Aditya Challa End-Sem Exam

(b) The probability that a point ~


xi has label l is given by,
n
X
(l) (l)
pi = I(yj = l)pi (3)
j=1

(c) The cross-entropy loss for the transformed data set D0 is given by,
n X
X L Xn
L= I(yi = l) log( I(yj = l)pij ) (4)
i=1 l=0 j=1

(d) True since the probabilities do not change.


(e) True since the distances do not change if you multiply with orthogonal matrices, and hence probabilities
do not change.
(f) False since distances are not preserved in this case.
(g) The answer is 0
n

(h) Yes. Observe that νij has size 2
. So, you can construct the à as in (f) and get the cross entropy to be
0.

Boosting for models other than decision trees 2


2. Consider the boosting algorithm for regression in this problem. The setting is exactly the same as discussed using
decision trees except the changes in the hypothesis class and the optimization procedure.
(a) Describe the boosting algorithm by replacing the decision trees with linear models.
(b) State true or false with justification - Boosting for linear models gives an hypothesis which has an training-
error strictly less than the error of the best linear model in the hypothesis class.
(c) State true or false with justification - Boosting for linear models gives an hypothesis which has an training-
error strictly greater than the error of the best linear model in the hypothesis class.
(d) State true or false with justification - Boosting for neural networks gives an hypothesis which has an
training-error strictly less than the error of the best linear model in the hypothesis class.
(e) State true or false with justification - Boosting for neural networks gives an hypothesis which has an
training-error strictly greater than the error of the best linear model in the hypothesis class.

1 + 2 + 2 + 2 + 2 = 9 marks

Answer of exercise 2

(a) Copy it from the book with function definition changed.


(b),(c) Observe that adding two linear models will give a linear model. So, both these statements are false.
(d), (e) Adding two non-linear models is not the same as a single non-linear model. Hence the training error will
be strictly less than the error of the best linear model in the hypothesis class. So (d) is true and (e) is
false.

Max-Margin classifier vs Logistic Regression, Problem 3


3. Consider the following construction - Let θ ∈ (0, 1), let f (θ) = (f1 (θ), f2 (θ), · · · , fk (θ)). We shall use f (θ) as features
for classification. Assume for simplicity that dfi/dθ ≥ 1 for all i and θ ∈ (0, 1). Assign the ground-truth label as
follows- for a given θ0 , assign the label −1 if θ < θ0 and label +1 for θ > θ0 . Ignore the data-point f (θ0 ), i.e assume
that it does not belong to your training set.
t
Consider the max-margin classifier and logistic regression for this problem. Let βM f (θ)+βM,0 be the optimal classifier
t
for max-margin and, βL f (θ) + βL,0 be the optimal classifier for logistic regression. (Remark : For logistic regression,
the label −1 is taken to be 0.). Also assume that kβM k = kβL k = 1.
t t
(a) What are the values of βM f (θ0 ) + βM,0 and βL f (θ0 ) + βL,0 ?
Pk
(b) Is the quantity i=1 βM,i ( i 0 /dθ) less than or equal to zero? Justify your answer.
df (θ )

2 of 4
BITS F464
Machine Learning 2022-23
Aditya Challa End-Sem Exam

Pk Pk
(c) Prove or disprove i=1 βM,i (dfi (θ0 )/dθ) ≥ i=1 βL,i (dfi (θ0 )/dθ).
Suppose we reverse the labels - i.e assign the label +1 if θ < θ0 and label −1 for θ > θ0 , then answer the following
questions.
(d) Is the quantity ki=1 βM,i (dfi (θ0 )/dθ) less than or equal to zero? Justify your answer.
P

(e) Prove or disprove ki=1 βM,i (dfi (θ0 )/dθ) ≥ ki=1 βL,i (dfi (θ0 )/dθ).
P P

Hint: First order taylor approximation is given by - f (x + ) = f (x) + df/dx.


1 + 2 + 2 + 2 + 2 = 9 marks
Answer of exercise 3
t t
(a) It is clear that βM f (θ0 ) + βM,0 = βL f (θ0 ) + βL,0 = 0
t t
f (θ0 ) + βM,0 +  ki=1 βMt df (θ0 )
P
Consider the first order approximation - βM f (θ0 + ) + βM,0 ≈ βM ( i /dθ). The first
t
order approximation is valid for small . For the maximum margin classifier we should have βM f (θ0 + ) + βM,0 ≥ 0,
so we get ki=1 βM,i (dfi (θ0 )/dθ) ≥ 0.
P

For the logistic regression the loss function is the cross entropy loss defined as −yi log(sigmoid(β t f (θ) + β0 )) − (1 −
yi ) log(1 − sigmoid(β t f (θ) + β0 )). So, for label yi = 1, minimizing the cross-entropy maximizing β t f (θ) + β0 . Using
t t t
the same first order approximation, we should maximize βL ∇f (θ0 ). So, we must have βL ∇f (θ0 ) ≥ βM ∇f (θ0 ).
Pk t t
If the labels are reversed, we would get i=1 βM,i ( df i (θ 0 )/dθ) ≤ 0. Accordingly, we would get βL ∇f (θ0 ) ≤ βM ∇f (θ0 ).

(b) The quantity is ≥ 0.


(c) The statement is True.
(d) The quantity is ≤ 0.
(e) The statement is False.

Neural Networks, Problem 4


4. Let {(xi , yi )}n
i=1 be a set of training examples for binary classification. Consider a neural network with ReLU
activation, fθ (x), trained on this dataset. Note the value of the hidden neurons can either be 0 or not. Let Θ(xi )
denote the set of hidden neurons for which value is 0.
(a) State true or false - Θ(xi ) = Θ(xj ) if yi = yj .
(b) Assume Θ(xi ) = Θ(xj ) for some i, j. What is the value of ∂fθ (xi )/∂x
i − ∂fθ (xj )/∂xj ?
(c) Assume Θ(xi ) = Θ(xj ) for all pairs i, j, and let the training accuracy be 1. What is the best possible
accuracy one can obtain with the linear models (logistic regression) on this dataset?
[1 + 2 + 2 = 5 Marks]

Answer of exercise 4
The main idea is the following construction - Let the neural network be fθ (x). Now for a sample xi , Θ(xi ) denotes
the hidden neurons which has value 0. Replace all the connections to these hidden neurons with weight 0. Then,
the entire network is simply a sequence of linear operators and hence equivalent to a matrix multiplication. Let this
matrix be W . Then, fθ (xi ) = W xi . Now, we can answer the questions.

(a) Note true. Consider a 1-hidden network on the XOR problem.


(b) If Θ(xi ) = Θ(xj ), then fθ (xi ) = W (x1 ) and fθ (xj ) = W (xj ). Hence, ∂fθ (xi )/∂x
i − ∂fθ (xj )/∂xj = 0.
(c) Similarly, if Θ(xi ) = Θ(xj ) for all i, j, then fθ (xi ) = W xi for all i. So, there exists a linear model which
can perfectly fit the data. Hence, the best possible accuracy is 1.

Deterministic models, Problem 5


5. Recall that the linear regression models assume the ground-truth to be f (x) = (θ)t x + θ0 +  where  is gaussian
distributed with mean 0 and variance σ 2 . Let {xi , yi }n
i=1 denote the training sample obtained by sampling xi from
some arbitrary distribution X and yi = f (xi ). n indicates the number of samples. Assume number of features =
d, i.e xi ∈ Rd . Let θ̂ denote the estimate obtained from minimizing mean squared error. State true or false with
justification

3 of 4
BITS F464
Machine Learning 2022-23
Aditya Challa End-Sem Exam

(a) If σ 2 = 0 then the variance of θ̂ is 0 for all possible n


(b) If σ 2 = 0 then the variance of θ̂ is 0 only if n > d
(c) We really cannot be sure that variance of θ̂ is 0 irrespective of n and d even if σ 2 = 0.

[2 + 2 + 2 = 6 Marks]

Answer of exercise 5
The variance of our estimate depends on the underlying distribution. Take the extreme example where X = 0 with
probability 1. Then irrespective of what θ is, the estimate will always be θ0 . So essentially we only have one sample
(0, θ0 ) repeating several times. Then any value θ is a minimizer and hence the variance of θ̂ can be potentially infinite.

(a) False. See above.


(b) False. See above.
(c) True. See above.

4 of 4

You might also like