Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Machine Learning Foundations and Applications

(AI42001): Practice Problem Set 1

August 31, 2023

Qn 1. Can a linear regression classifier achieve zero training error on any of


the datasets in Fig. 1. Provide justification for your answer.

x2 x2 x2

x1 x1 x1

(A) (B) (C)


x2 x2 x2

x1 x1 x1

(D) (E) (F)

Figure 1: The 2-dimensional labeled training sets, where circles correspond to


class y=1 and stars correspond to class y = 0.

Answer: B, D, and F allow zero training error for linear regression classifier.

2. In linear regression, we realize g(x | θ) as a linear function with {w0 , w1 , w2 , . . . , wn }


as the parameters:

g(x | w) = w0 + w1 x1 + w2 x2 + . . . + wn xn (1)

The optimal values of the parameters are obtained by minimizing the total or
average square error:
N
1X
E= (g(xi | w) − yi )2 (2)
2 i=1

1
Suppose that we use L1 and L2 regularizations for the parameters. The final
loss to be optimized for the L2 case is

Ef inal = E + λ(w02 + w12 + · · · + wn2 ). (3)

Similarly, the final loss to be optimized for the L1 case is

Ef inal = E + λ(|w0 | + |w1 | + . . . |wn |). (4)

Comparing Eqs. (3) and (4), which one will enforce better sparsity on the pa-
rameter w (i.e., relatively more number of zeroes in w). Briefly explain using
the losses for L2 and L1 provided below.

Figure 2: Losses for the regularization of only L1 or L2 terms.

Answer: In the gradient descent algorithm, the weight update depends on


the gradient of loss. As the loss becomes smaller, the gradient of L2 loss (and
hence the weight update) becomes smaller. As a result, there exists no strong
constraints that force the smaller weights to zero. But this effect does not
exist for L1 loss where the gradient is constant irrespective of the loss value.
Therefore, L1 enforces better sparsity.

3. We attempt to solve the binary classification task depicted in Fig. 3 with a


simple linear logistic regression model.
1
P (y = 1 | x, w) = g(w0 + w1 x1 + w2 x2 ) = (5)
1 + exp(−w0 − w1 x1 − w2 x2 )

2
x2 x2 x2

ONS
x1 x1
x1

(A) (B) (C)

Figure 3: The 2-dimensional labeled training sets, where circles correspond to


class y=1 and stars correspond to class y = 0.

Consider training regularized linear logistic regression models where we try


to maximize
XN
log P (yi | xi , w0 , w1 , w2 ) − Cwj2

(6)
i=1

for very large C. The regularization penalties used in penalized conditional


log-likelihood estimation are −Cwj2 , where j = {0, 1, 2}. In other words, only
one of the parameters is regularized in each case. Given the training datasets
in Fig. 3, state whether which option of j is not favorable for very large C.
Provide a brief justification for each of your answers. (Note: This shows that
regularization for a given problem is not arbitrary, but problem dependent).
Answer: (A) w2 = 0 is not favorable.
(B) w2 = 0 and w0 = 0 are not favorable.
(C) w1 = 0 and w2 = 0 are not favorable.

4. If we change the form of regularization to L1-norm (absolute value) and


regularize w1 and w2 only (but not w0 ), we get the following penalized log-
likelihood
N
X 
log P (yi | xi , w0 , w1 , w2 ) − C(|w1 | + |w2 |) (7)
i=1

Consider again the datasets in Figure 3 and the same linear logistic regression
model
P (y = 1 | x, w) = g(w0 + w1 x1 + w2 x2 ) (8)
As we increase the regularization parameter C which of the following scenarios
do you expect to observe? (Choose only one) Briefly explain your choice:
1. First w1 will become 0, then w2 may become smaller.
2. First w2 will become 0, then w1 may become smaller.
3. w1 and w2 will become zero simultaneously.

4. None of the weights will become exactly zero, only smaller as C increases.

3
Answer: (A) 1; (B) 1; C (4)

5. Consider again the datasets in Fig. 3 and the L1 regularization in Qn 4.


For very large C, with the same L1-norm regularization for w1 and w2 as above,
which value(s) do you expect w0 to take? Explain briefly. (Note that the number
of points from each class is roughly the same.) (You can give a range of values
for w0 if you deem necessary).
Answer: (A) w0 ≈ 0; (B) w0 ≈ 0; (C) w0 ≪ 0

6. A random sample of eight drivers insured with a company and having similar
auto insurance policies was selected. The following table lists their driving
experiences (in years) and monthly auto insurance premiums.

Driving Experience (years) Monthly Auto Insurance Premium


5 64 USD
2 87
12 50
9 71
15 44
6 56
25 42
16 60

1. Does the insurance premium depend on the driving experience ? Do you


expect a positive or a negative relationship between these two variables?
2. Find the least squares regression line by choosing appropriate dependent
and independent variables based on your answer in part 1.
3. Interpret the meaning of the values of a and b calculated in part 2.
4. Predict the monthly auto insurance premium for a driver with 10 years of
driving experience.
1. One can find a negative relationship (higher the experience, less the pre-
mium.)
2. a =76.66, b = -1.5476
3. y = 76.66 − 1.5476x, which conveys a negative relation.
4. 61.18

7: Suppose that X is a discrete random variable with the following PMF:


P (X = 0) = 2θ/3, P (X = 1) = θ/3, P (X = 2) = 2(1 − θ)/3, P (X = 3) =
(1 − θ)/3 where 0 ≤ θ ≤ 1. The sample consists of the following independent
items: X = {3, 0, 2, 1, 3, 2, 1, 0, 2, 1}. Find the MLE of θ.
Answer:
YN  2θ 2  θ 3  2(1 − θ) 3  (1 − θ) 3
L(θ|X ) = log P (Xi |θ) = log = 5logθ+5log(1−θ)
i=1
3 3 3 3

4
dL 5 5
= − =0
dθ θ 1−θ
θ̂ = 0.5

8 Let {Xi }N
i=1 be iid observations drawn from the following probability density
functions: h |x| i
1
p(x|θ) = exp − .
2θ θ
Find the MLE of θ.
Answer:
N h
X |Xi | i
L(θ|X ) = − log2 − logθ −
i=1
θ
N
dL X h 1 |Xi | i
= − + 2
dθ i=1
θ θ
PN
i=1 |Xi |
θ̂ =
N

9. Let X = {Xi }N i=1 be a sample drawn from a uniform distribution on the


interval [0, θ]. Here, θ > 0 is the parameter. Determine the MLE of θ.
Answer: The PDF:
(
1
for 0 ≤ x ≤ θ
p(x|θ) = θ
0 otherwise

The likelihood: (
1
θN
for 0 ≤ Xi ≤ θ
L(θ|X ) ==
0 otherwise
1
== The MLE estimate θ̂ ≥ Xi . The choice of value should maximize θN
== θ̂ should be the smallest possible value of θ.
== θ̂ = max(X1 , X2 , . . . XN )

Q10: Suppose that we are doing binary sentiment classification on movie review
text, and we would like to know whether to assign the sentiment class positive
(+) or negative (-) to a review document. We’ll represent each input observation
by the 6 features {x1 , x2 , . . . , x6 } of the input shown in the following text and
table (Figs. 4 and 5)

5
Figure 4: A movie review text.

Figure 5: Features extracted from the text in Fig. 4.

Suppose that we have trained a logistic regression classifier, and the 6 weights
corresponding to the 6 features are {2.5, −5.0, −1.2, 0.5, 2.0, 0.7}, while w0 = 0.1.
1. How do you interpret the values of w1 = 2.5 and w2 = −5.0.
2. Calculate P (+ | x) and P (− | x) for the above example.
Answer: (1) The weight w1, indicates how important a feature the number
of positive lexicon words (great, nice, enjoyable, etc.) is to a positive sentiment
decision while w2 tells us the opposite importance of negative lexicon words.
(2)P (+ | x) = σ(wT x + b) = 0.70
P (− | x) = 1 − σ(wT x + b) = 0.30

Q11: Next, we focus on optimizing of a logistic regression classifier using gra-


dient descent algorithm. We’ll use a simplified version of the previous example
in Fig. 4 as it sees a single observation x whose correct value is y = 1 (this is
a positive review), with a feature vector x = {x1 , x2 } consisting of these two
features: x1 = 3 (count of positive lexicon words) x2 = 2 (count of negative
lexicon words). Assume that the initial weights and bias are all set to 0, and
the initial learning rate µ is 0.1. What will be the new w after a single step of
the gradient descent?

6
Answer:
  
σ(wT x + w0 ) − y x1
    
σ(0) − 1x1 −1.5
∆w,b L =  σ(wT x + w0 ) − y x2  =  σ(0) − 1 x2  =  −1  (9)
σ(wT x + w0 ) − y σ(0) − 1 −0.5
 
−0.15
θ1 = θ0 − µ∆w,b L =  −0.1  (10)
−0.05

You might also like