Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Data Mining and Machine Learning

Name: Folusho Arokoyo

Student ID: w1878345


Question 1.

Loading the 'cluster_data.npy' dataset

Scatterplot
Plot_clusters Utility Function

Questions 2

K-Means

Affinity Propagation

from sklearn.cluster import AffinityPropagation


plot_clusters(data, AffinityPropagation, (), {'damping': 0.9})
plot_clusters(data, AffinityPropagation, (), {'damping': 0.5})
plot_clusters(data, AffinityPropagation, (), {'max_iter': 200})
plot_clusters(data, AffinityPropagation, (), {'convergence_iter': 200})

Mean Shift
from sklearn.cluster import MeanShift
plot_clusters(data, MeanShift, (), {'bandwidth': 8.0})
plot_clusters(data, MeanShift, (), {'bandwidth':0.2})
plot_clusters(data, MeanShift, (), {'min_bin_freq': 25})
plot_clusters(data, MeanShift, (), {'bin_seeding': True})

Spectral Clustering

from sklearn.cluster import SpectralClustering


plot_clusters(data, SpectralClustering, (), {'n_clusters': 6})
plot_clusters(data, SpectralClustering, (), {'affinity': 'rbf'})
plot_clusters(data, SpectralClustering, (), {'assign_labels': 'kmeans'})
plot_clusters(data, SpectralClustering, (), {'affinity': 'nearest_neighbors'})
Agglomerative Clustering

from sklearn.cluster import AgglomerativeClustering


plot_clusters(data, AgglomerativeClustering, (), {'affinity':'euclidean'})
plot_clusters(data, AgglomerativeClustering, (), {'linkage':'ward'})
plot_clusters(data, AgglomerativeClustering, (), {'n_clusters': 6})

HDBSCAN

import hdbscan
from sklearn.cluster import DBSCAN
from joblib import Memory

plot_clusters(data, hdbscan.HDBSCAN, (), {'metric':'euclidean'})


plot_clusters(data, hdbscan.HDBSCAN, (), {'min_samples': 15})
plot_clusters(data, hdbscan.HDBSCAN, (), {'p': 0.01})
plot_clusters(data, hdbscan.HDBSCAN, (), {'leaf_size': 20})
plot_clusters(data, hdbscan.HDBSCAN, (), {'cluster_selection_method': 'eom'})

Question 3

K-Means

Justification: It cluster the data points into 6 groups.


Affinity Propagation

Justification: It is the most reasonable output based on the tried parameters.

Mean Shift

Justification: This seems to be the reasonable output as it distinctively classifies the data points.
Spectral Clustering

Justification: Visualising the clusters, the algorithm was able to segment the clusters into 6 groups
which are in line with the initial visualised expectation.

Agglomerative Clustering

Justification: The algorithm clustered the data points within each cluster.
HDBSCAN

Justification: HDBSCAN is less likely to assign noisy data points to clusters. The algorithm classifies
noise around the clusters perfectly, [1] thereby making the clusters stand out better.

Question 4

1. K-means clustering

K-means clustering is not a robust algorithm It can perform poorly on datasets with non-spherical
clusters. K-means clustering relies on the distance between data points makes it sensitive to outliers,
and noisy data, since noisy data can spurious distances between data points.

The following (algorithm, args, kwds) values were specified for the plot_clusters() function:

Algorithm: k-means
args: None
kwds: {'n_clusters': 6}

Based on the domain knowledge and visualisation of the dataset, n_clusters of 6 was chosen
The intra-cluster distance for the K-means clustering algorithm is relatively small, which indicates
that the data points within each cluster are close together. K-means clustering works well when
clusters are well separated from each other.

The closeness of the clusters does not permit the algorithm to separate the clusters properly. Overall,
K-means clustering is an efficient and fast algorithm but seems not to work well with the given
dataset because of overlapping clusters. It works well when the data points in different clusters are
clearly distinct.
2. Affinity propagation

The affinity propagation algorithm produces relatively separable clusters, but the clusters are not
homogeneous or complete. They produce varying results for different parameters and may converge
to suboptimal solutions.

Algorithm: AffinityPropagation
args: None
kwds: {'damping': 0.9}

After a series of tests, Affinity Propagation performs well with the damping keyword of 0.9. The
algorithm is slow and computationally expensive (It took 11.04 seconds to produce an output)
compared to K-means (0.07 seconds).

The algorithm is good at handling non-linearly separable data and finding clusters of arbitrary shape.
So far, Affinity Propagation did not do a good job of clustering the data points. The data points in
each cluster are close to each other (intra-cluster). The number of clusters does not need to be
determined before affinity propagation is executed [2].

3. Mean Shift

In the mean shift algorithm, each point tries to find its group by moving towards the weighted mean
of its local area in each step[3].

Algorithm: MeanShift
args: None
kwds: 'bandwidth': 0.2}

After multiple trials, the optimal keyword parameter used was a bandwidth of 0.2. In contrast to the
K-means, the Mean shift clustering algorithm was able to create 4 clusters that are less well-defined
than the K-means clustering algorithm. The algorithm is not highly scalable, as it requires multiple
nearest neighbour searches during the execution of the algorithm[4].

4. Spectral Clustering
This algorithm is sensitive to parameter choice and may produce varying results for different
parameters. It is also computationally expensive for example it took 1.18 seconds to compute. The
distance between the inter-cluster outputs is sparsely distributed while the intra-clusters are well
compacted.

Algorithm: SpectralClustering
args: None
kwds: {'n_clusters': 6}
5. Agglomerative Clustering
This type of algorithm seems to be sensitive to the choice of distance metric and linkage criterion.

Algorithm: AgglomerativeClustering
args: None
kwds: {'n_clusters': 6}

Different keyword parameters were tried in getting the best output. The keyword parameter
{‘n_clusters’: 6} outperformed the other parameters. The algorithm was able to cluster the data
points into six groups similar to the K-means clustering of the same clustering value. The intra-cluster
distances between the data points are far apart while the inter-cluster distances are close to each
other.

6. HDBSCAN

This algorithm outperformed the other algorithms, especially in dealing with noise. Being a
density-based clustering algorithm, HDBSCAN was able to identify clusters perfectly from the
surrounding noisy data points. Different keyword parameters (min_samples, were tried as shown on
the Google Colab IDE environment. The best parameter is the min_samples of 15. This creates a clear
distinction between the clusters.

Algorithm: hdbscan.HDBSCAN
args: None
Kwds: {'min_samples': 15}

The intra-clusters relationship shows closely packed data points within each cluster while the
inter-clusters distances are sparely represented.

Conclusion.

HDBSCAN is a density-based algorithm that is robust to outliers, it outperformed other algorithms as


it was able to identify the noise around each cluster and also output the expected value for clusters
initially visualised as 6 clusters, matching my expectation. Overall, HDBSCAN produces good results
for the given dataset.
References

1. Understanding Density-based Clustering,


https://www.kdnuggets.com/2020/02/understanding-density-based-clustering.html
2. Park S, Jo H-S, Mun C, Yook J-G. RRH Clustering Using Affinity Propagation Algorithm with Adaptive
Thresholding and Greedy Merging in Cloud Radio Access Network. Sensors. 2021; 21(2):480.
https://doi.org/10.3390/s21020480
3. Understanding Mean Shift Clustering and Implementation with Python,
https://towardsdatascience.com/understanding-mean-shift-clustering-and-implementation-with-pyt
hon-6d5809a2ac40, last accessed: 2023/05/02
4. Clustering: Mean Shift, https://scikit-learn.org/stable/modules/clustering.html#mean-shift, last
accessed: 2023/05/02.

You might also like