Assignment 10

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Assignment 10

Introduction to Machine Learning


Prof. B. Ravindran
1. Suppose you have a single cluster of data points. The data points are (-2,-2),(-1,-2),(2,1),(1,2).
Find the data point x which has the highest average L2 distance with respect to other data
points.
(a) (-2,-2)
(b) (-1,-2)
(c) (2,1)
(d) (1,2)
Sol. (a)

2. Which of the following statements is/are not true about k−means clustering?
(a) It is an unsupervised learning algorithm
(b) Overlapping of clusters is allowed in k−means clustering
(c) It is a hard-clustering technique
(d) k is a hyperparameter
Sol. (b)
Refer to the lecture
For the following questions, we will be using the MNIST dataset that can be loaded using
the following utility from sklearn: https://scikit-learn.org/stable/modules/generated/
sklearn.datasets.load_digits.html
Do not make any changes to the dataset unless directed in the question.
Set seed = 42 for numpy (np.random.seed(seed)).
Use scikit-learn 1.0.2 to run your experiments.
3. (2 Marks) Run K-means on the input features of the MNIST dataset using the following
initialization:
KMeans(n clusters=10, random state=seed)
Usually, for clustering tasks, we are not given labels, but since we do have labels for our
dataset, we can use accuracy to determine how good our clusters are.

Label the prediction class for all the points in a cluster as the majority true label.
E.g. {a, a, b} would be labeled as {a, a, a}
What is the accuracy of the resulting labels?
(a) 0.790
(b) 0.893
(c) 0.702

1
(d) 0.933
Sol. (a)
Code for the solution can be found here: https://colab.research.google.com/drive/
1Ol4gksThvWSuqo5sZdir4feXVy6irAJp?usp=sharing
4. (2 Marks) For the same clusters obtained in the previous question, calculate the rand-index.
The formula for rand-index:
a+b
R=
C2n
Where,
a = the number of times a pair of elements occur in the same cluster in both sequences.
b = the number of times a pair of elements occur in the different clusters in both sequences.

Note: The two clusters are given by: (1) Ground truth labels, (2) Prediction labels using
clustering as directed in Q3.
(a) 0.879
(b) 0.893
(c) 0.919
(d) 0.933
Sol. (d)
Code for the solution can be found here: https://colab.research.google.com/drive/
1Ol4gksThvWSuqo5sZdir4feXVy6irAJp?usp=sharing
5. a in rand-index can be viewed as true positives(pair of points belonging to the same cluster) and
b as true negatives(pair of points belonging to different clusters). How, then, are rand-index
and accuracy from the previous two questions related?
(a) rand-index = accuracy
(b) rand-index = 1.18×accuracy
(c) rand-index = accuracy/2
(d) None of the above
Sol. (d)
The accuracy in Q3 works on individual points, whereas the rand-index on pair of points. The
two are, therefore, not directly related.
6. Run BIRCH on the input features of MNIST dataset using Birch(n clusters=10, thresh-
old=1). What is the rand-index obtained?
(a) 0.91
(b) 0.96
(c) 0.88
(d) 0.98
Sol. (b)
Code for the solution can be found here: https://colab.research.google.com/drive/
1Ol4gksThvWSuqo5sZdir4feXVy6irAJp?usp=sharing

2
7. (2 Marks) Run PCA on MNIST dataset input features with n components = 2. Now run
DBSCAN using DBSCAN(eps=0.5, min samples=5) on both the original features and
the PCA features. What are their respective number of outliers/noisy points detected by
DBSCAN?
As an extra, you can plot the PCA features on a 2D plot using matplotlib.pyplot.scatter with
parameter c = y pred (where y pred is the cluster prediction) to visualise the clusters and
outliers.

(a) 1600, 1522


(b) 1500, 1482
(c) 1000, 1000
(d) 1797, 1742

Sol. (d)
Code for the solution can be found here: https://colab.research.google.com/drive/
1Ol4gksThvWSuqo5sZdir4feXVy6irAJp?usp=sharing

You might also like