Pset 2

Machine Learning Foundations and Applications
(AI42001): Practice Problem Set 2
August 31, 2023
Qn 1. We attempt to study why entropy is used as a measure in decision trees?

Given a discrete random variable X which takes n values with probabilities
{pi : 1 ≤ i ≤ n}, its entropy is defined as
n
X
Entropy(X) = − pi log2 pi (1)
i=1
(a) Show that Entropy is greater than or equal to 0?

(b) For which PMF, does the entropy takes the lowest value?
(For the question (c), use the following fact: The maximum value of en-
tropy is attained for uniform probability mass function.)
(c) Guess the entropy relations of the following cases: (a) {0.1, 0.9, 0, 0}, (b)
{0.25, 0.25, 0.25, 0.25}, (c) {0, 0, 1, 0}. Calculate the entropy and verify.
(d) Based on the above answers, explain why entropy is a good measure for
decision trees.
Qn 2. We will use the dataset below to learn a decision tree which predicts
if people pass machine learning (True or False), based on their previous GPA
(High, Medium, or Low) and whether or not they studied (True or False).
GPA Studied Passed

L F F
L T T
M F F
M T T
H F T
H T T
(a) What is the entropy H(Passed)?
1
(b) What is the entropy H(Passed | GPA)?
(c) What is the entropy H(Passed | Studied)?
(d) Draw the full decision tree that would be learned for this dataset. You do
not need to show any calculations.
4. Given the following decision tree, show how the new examples in the table
would be classified by filling in the last column in the table. If an example
cannot be classified, enter UNKNOWN in the last column.
Figure 1: Decision tree.
Figure 2: Test set.
3. Using the dataset below, we want to build a decision tree which classifies Y
as T /F given the binary variables A, B, C.
A B C Y
F F F F
T F T T
T T F T
T T T F
2
(a) Draw the tree that would be learned by the greedy algorithm with zero
training error.
(b) Is this tree optimal (i.e. does it get zero training error with minimal
depth)? Explain in less than two sentences. If it is not optimal, draw the
optimal tree as well.
5. Consider the problem of binary classification using the Naive Bayes classi-
fier. You are given two dimensional features (X1 , X2 ) and the categorical class
conditional distributions in the tables below. The entries in the tables corre-
spond to P(X1 = x1 | Ci) and P(X2 = x2 | Ci) respectively. The two classes
(Ci : i = 1, 2) are equally likely.
X1 Class 1 Class 2 X2 Class 1 Class 2

-1 0.2 0.3 -1 0.4 0.1
0 0.4 0.6 0 0.5 0.3
1 0.4 0.1 1 0.1 0.6
Given a data point (-1, 1), calculate the following posterior probabilities
(a) P(C1 | X1 = -1, X2 = 1)
(b) P(C2 | X1 = -1, X2 = 1)
6. Here’s a naive Bayes model with the following conditional probability table
and the following prior probabilities over classes.
Word type a b c
P(w | y =1) 5/10 3/10 2/10
P(w | y =0) 2/10 2/10 6/10
P(y=1) P(y=0)
8/10 2/10
Consider a binary classification problem, for whether a document is about

the Chandrayaan-3 (class y = 1), or it is not about the Chandrayaan-3 (y = 0).
Consider a document consisting of 2 a’s, and 1 c.
(a) What is the probability that it is about the Chandrayaan?
(b) What is the probability that it is not about the Chandrayaan?
Now suppose that we know the document is about Chandrayaan (y = 1).
(a) True or False, the naive Bayes model is able to tell us the probability of
seeing the document w = (a, a, b, c) under the model.
3
(b) If True, what is the probability?
7. In the following questions you will consider a k-nearest neighbor classifier

using Euclidean distance metric on a binary classification task. We assign the
class of the test point to be the class of the majority of the k nearest neighbors.
Note that a point can be its own neighbor.
Figure 3: Solution.
(a) What value of k minimizes the training set error for this dataset? What
is the resulting training error?
(b) Why might using too large values k be bad in this dataset? Why might
too small values of k also be bad?
(c) In Figure 4, sketch the 1-nearest neighbor decision boundary for this
dataset.
7. A KNN classifier assigns a test instance the majority class associated with
its K nearest training instances. Distance between instances is measured using
Euclidean distance. Suppose we have the following training set of positive (+)
and negative (-) instances and a single test instance (o). All instances are
projected onto a vector space of two real-valued features (X and Y). Answer
the following questions. Assume “unweighted” KNN (every nearest neighbor
contributes equally to the final vote).
4
Figure 4: Input distribution.
(a) What would be the class assigned to this test instance for K=1?
(b) What would be the class assigned to this test instance for K=3?
(c) What would be the class assigned to this test instance for K=5?
(d) Setting K to a large value seems like a good idea. We get more votes!
Given this particular training set, would you recommend setting K = 11?
Why or why not?

Pset 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pset 2

Uploaded by

Copyright:

Available Formats

Machine Learning Foundations and Applications

(AI42001): Practice Problem Set 2

August 31, 2023

Qn 1. We attempt to study why entropy is used as a measure in decision trees?

(a) Show that Entropy is greater than or equal to 0?

GPA Studied Passed

(a) What is the entropy H(Passed)?

Figure 1: Decision tree.

Figure 2: Test set.

X1 Class 1 Class 2 X2 Class 1 Class 2

(b) P(C2 | X1 = -1, X2 = 1)

Consider a binary classification problem, for whether a document is about

7. In the following questions you will consider a k-nearest neighbor classifier

You might also like