Data Mining and Machine Learning PDF

Data Mining and Machine Learning
Name: Folusho Arokoyo
Student ID: w1878345

Question 1.
Loading the 'cluster_data.npy' dataset
Scatterplot
Plot_clusters Utility Function
Questions 2
K-Means
Affinity Propagation
from sklearn.cluster import AffinityPropagation

plot_clusters(data, AffinityPropagation, (), {'damping': 0.9})
plot_clusters(data, AffinityPropagation, (), {'damping': 0.5})
plot_clusters(data, AffinityPropagation, (), {'max_iter': 200})
plot_clusters(data, AffinityPropagation, (), {'convergence_iter': 200})
Mean Shift
from sklearn.cluster import MeanShift
plot_clusters(data, MeanShift, (), {'bandwidth': 8.0})
plot_clusters(data, MeanShift, (), {'bandwidth':0.2})
plot_clusters(data, MeanShift, (), {'min_bin_freq': 25})
plot_clusters(data, MeanShift, (), {'bin_seeding': True})
Spectral Clustering
from sklearn.cluster import SpectralClustering

plot_clusters(data, SpectralClustering, (), {'n_clusters': 6})
plot_clusters(data, SpectralClustering, (), {'affinity': 'rbf'})
plot_clusters(data, SpectralClustering, (), {'assign_labels': 'kmeans'})
plot_clusters(data, SpectralClustering, (), {'affinity': 'nearest_neighbors'})
Agglomerative Clustering
from sklearn.cluster import AgglomerativeClustering

plot_clusters(data, AgglomerativeClustering, (), {'affinity':'euclidean'})
plot_clusters(data, AgglomerativeClustering, (), {'linkage':'ward'})
plot_clusters(data, AgglomerativeClustering, (), {'n_clusters': 6})
HDBSCAN
import hdbscan
from sklearn.cluster import DBSCAN
from joblib import Memory
plot_clusters(data, hdbscan.HDBSCAN, (), {'metric':'euclidean'})

plot_clusters(data, hdbscan.HDBSCAN, (), {'min_samples': 15})
plot_clusters(data, hdbscan.HDBSCAN, (), {'p': 0.01})
plot_clusters(data, hdbscan.HDBSCAN, (), {'leaf_size': 20})
plot_clusters(data, hdbscan.HDBSCAN, (), {'cluster_selection_method': 'eom'})
Question 3
K-Means
Justification: It cluster the data points into 6 groups.

Affinity Propagation
Justification: It is the most reasonable output based on the tried parameters.
Mean Shift
Justification: This seems to be the reasonable output as it distinctively classifies the data points.
Spectral Clustering
Justification: Visualising the clusters, the algorithm was able to segment the clusters into 6 groups
which are in line with the initial visualised expectation.
Agglomerative Clustering
Justification: The algorithm clustered the data points within each cluster.
HDBSCAN
Justification: HDBSCAN is less likely to assign noisy data points to clusters. The algorithm classifies
noise around the clusters perfectly, [1] thereby making the clusters stand out better.
Question 4
1. K-means clustering
K-means clustering is not a robust algorithm It can perform poorly on datasets with non-spherical
clusters. K-means clustering relies on the distance between data points makes it sensitive to outliers,
and noisy data, since noisy data can spurious distances between data points.
The following (algorithm, args, kwds) values were specified for the plot_clusters() function:
Algorithm: k-means
args: None
kwds: {'n_clusters': 6}
Based on the domain knowledge and visualisation of the dataset, n_clusters of 6 was chosen
The intra-cluster distance for the K-means clustering algorithm is relatively small, which indicates
that the data points within each cluster are close together. K-means clustering works well when
clusters are well separated from each other.
The closeness of the clusters does not permit the algorithm to separate the clusters properly. Overall,
K-means clustering is an efficient and fast algorithm but seems not to work well with the given
dataset because of overlapping clusters. It works well when the data points in different clusters are
clearly distinct.
2. Affinity propagation
The affinity propagation algorithm produces relatively separable clusters, but the clusters are not
homogeneous or complete. They produce varying results for different parameters and may converge
to suboptimal solutions.
Algorithm: AffinityPropagation
args: None
kwds: {'damping': 0.9}
After a series of tests, Affinity Propagation performs well with the damping keyword of 0.9. The
algorithm is slow and computationally expensive (It took 11.04 seconds to produce an output)
compared to K-means (0.07 seconds).
The algorithm is good at handling non-linearly separable data and finding clusters of arbitrary shape.
So far, Affinity Propagation did not do a good job of clustering the data points. The data points in
each cluster are close to each other (intra-cluster). The number of clusters does not need to be
determined before affinity propagation is executed [2].
3. Mean Shift
In the mean shift algorithm, each point tries to find its group by moving towards the weighted mean
of its local area in each step[3].
Algorithm: MeanShift
args: None
kwds: 'bandwidth': 0.2}
After multiple trials, the optimal keyword parameter used was a bandwidth of 0.2. In contrast to the
K-means, the Mean shift clustering algorithm was able to create 4 clusters that are less well-defined
than the K-means clustering algorithm. The algorithm is not highly scalable, as it requires multiple
nearest neighbour searches during the execution of the algorithm[4].
4. Spectral Clustering
This algorithm is sensitive to parameter choice and may produce varying results for different
parameters. It is also computationally expensive for example it took 1.18 seconds to compute. The
distance between the inter-cluster outputs is sparsely distributed while the intra-clusters are well
compacted.
Algorithm: SpectralClustering
args: None
5. Agglomerative Clustering
This type of algorithm seems to be sensitive to the choice of distance metric and linkage criterion.
Algorithm: AgglomerativeClustering
args: None
Different keyword parameters were tried in getting the best output. The keyword parameter
{‘n_clusters’: 6} outperformed the other parameters. The algorithm was able to cluster the data
points into six groups similar to the K-means clustering of the same clustering value. The intra-cluster
distances between the data points are far apart while the inter-cluster distances are close to each
other.
6. HDBSCAN
This algorithm outperformed the other algorithms, especially in dealing with noise. Being a
density-based clustering algorithm, HDBSCAN was able to identify clusters perfectly from the
surrounding noisy data points. Different keyword parameters (min_samples, were tried as shown on
the Google Colab IDE environment. The best parameter is the min_samples of 15. This creates a clear
distinction between the clusters.
Algorithm: hdbscan.HDBSCAN
args: None
Kwds: {'min_samples': 15}
The intra-clusters relationship shows closely packed data points within each cluster while the
inter-clusters distances are sparely represented.
Conclusion.
HDBSCAN is a density-based algorithm that is robust to outliers, it outperformed other algorithms as

it was able to identify the noise around each cluster and also output the expected value for clusters
initially visualised as 6 clusters, matching my expectation. Overall, HDBSCAN produces good results
for the given dataset.
References
1. Understanding Density-based Clustering,

https://www.kdnuggets.com/2020/02/understanding-density-based-clustering.html
2. Park S, Jo H-S, Mun C, Yook J-G. RRH Clustering Using Affinity Propagation Algorithm with Adaptive
Thresholding and Greedy Merging in Cloud Radio Access Network. Sensors. 2021; 21(2):480.
https://doi.org/10.3390/s21020480
3. Understanding Mean Shift Clustering and Implementation with Python,
https://towardsdatascience.com/understanding-mean-shift-clustering-and-implementation-with-pyt
hon-6d5809a2ac40, last accessed: 2023/05/02
4. Clustering: Mean Shift, https://scikit-learn.org/stable/modules/clustering.html#mean-shift, last
accessed: 2023/05/02.

Data Mining and Machine Learning PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining and Machine Learning PDF

Uploaded by

Copyright:

Available Formats

Data Mining and Machine Learning

Name: Folusho Arokoyo

Student ID: w1878345

Loading the 'cluster_data.npy' dataset

from sklearn.cluster import AffinityPropagation

from sklearn.cluster import SpectralClustering

from sklearn.cluster import AgglomerativeClustering

plot_clusters(data, hdbscan.HDBSCAN, (), {'metric':'euclidean'})

Justification: It cluster the data points into 6 groups.

Justification: It is the most reasonable output based on the tried parameters.

HDBSCAN is a density-based algorithm that is robust to outliers, it outperformed other algorithms as

1. Understanding Density-based Clustering,

You might also like