Professional Documents
Culture Documents
Data Mining and Machine Learning PDF
Data Mining and Machine Learning PDF
Scatterplot
Plot_clusters Utility Function
Questions 2
K-Means
Affinity Propagation
Mean Shift
from sklearn.cluster import MeanShift
plot_clusters(data, MeanShift, (), {'bandwidth': 8.0})
plot_clusters(data, MeanShift, (), {'bandwidth':0.2})
plot_clusters(data, MeanShift, (), {'min_bin_freq': 25})
plot_clusters(data, MeanShift, (), {'bin_seeding': True})
Spectral Clustering
HDBSCAN
import hdbscan
from sklearn.cluster import DBSCAN
from joblib import Memory
Question 3
K-Means
Mean Shift
Justification: This seems to be the reasonable output as it distinctively classifies the data points.
Spectral Clustering
Justification: Visualising the clusters, the algorithm was able to segment the clusters into 6 groups
which are in line with the initial visualised expectation.
Agglomerative Clustering
Justification: The algorithm clustered the data points within each cluster.
HDBSCAN
Justification: HDBSCAN is less likely to assign noisy data points to clusters. The algorithm classifies
noise around the clusters perfectly, [1] thereby making the clusters stand out better.
Question 4
1. K-means clustering
K-means clustering is not a robust algorithm It can perform poorly on datasets with non-spherical
clusters. K-means clustering relies on the distance between data points makes it sensitive to outliers,
and noisy data, since noisy data can spurious distances between data points.
The following (algorithm, args, kwds) values were specified for the plot_clusters() function:
Algorithm: k-means
args: None
kwds: {'n_clusters': 6}
Based on the domain knowledge and visualisation of the dataset, n_clusters of 6 was chosen
The intra-cluster distance for the K-means clustering algorithm is relatively small, which indicates
that the data points within each cluster are close together. K-means clustering works well when
clusters are well separated from each other.
The closeness of the clusters does not permit the algorithm to separate the clusters properly. Overall,
K-means clustering is an efficient and fast algorithm but seems not to work well with the given
dataset because of overlapping clusters. It works well when the data points in different clusters are
clearly distinct.
2. Affinity propagation
The affinity propagation algorithm produces relatively separable clusters, but the clusters are not
homogeneous or complete. They produce varying results for different parameters and may converge
to suboptimal solutions.
Algorithm: AffinityPropagation
args: None
kwds: {'damping': 0.9}
After a series of tests, Affinity Propagation performs well with the damping keyword of 0.9. The
algorithm is slow and computationally expensive (It took 11.04 seconds to produce an output)
compared to K-means (0.07 seconds).
The algorithm is good at handling non-linearly separable data and finding clusters of arbitrary shape.
So far, Affinity Propagation did not do a good job of clustering the data points. The data points in
each cluster are close to each other (intra-cluster). The number of clusters does not need to be
determined before affinity propagation is executed [2].
3. Mean Shift
In the mean shift algorithm, each point tries to find its group by moving towards the weighted mean
of its local area in each step[3].
Algorithm: MeanShift
args: None
kwds: 'bandwidth': 0.2}
After multiple trials, the optimal keyword parameter used was a bandwidth of 0.2. In contrast to the
K-means, the Mean shift clustering algorithm was able to create 4 clusters that are less well-defined
than the K-means clustering algorithm. The algorithm is not highly scalable, as it requires multiple
nearest neighbour searches during the execution of the algorithm[4].
4. Spectral Clustering
This algorithm is sensitive to parameter choice and may produce varying results for different
parameters. It is also computationally expensive for example it took 1.18 seconds to compute. The
distance between the inter-cluster outputs is sparsely distributed while the intra-clusters are well
compacted.
Algorithm: SpectralClustering
args: None
kwds: {'n_clusters': 6}
5. Agglomerative Clustering
This type of algorithm seems to be sensitive to the choice of distance metric and linkage criterion.
Algorithm: AgglomerativeClustering
args: None
kwds: {'n_clusters': 6}
Different keyword parameters were tried in getting the best output. The keyword parameter
{‘n_clusters’: 6} outperformed the other parameters. The algorithm was able to cluster the data
points into six groups similar to the K-means clustering of the same clustering value. The intra-cluster
distances between the data points are far apart while the inter-cluster distances are close to each
other.
6. HDBSCAN
This algorithm outperformed the other algorithms, especially in dealing with noise. Being a
density-based clustering algorithm, HDBSCAN was able to identify clusters perfectly from the
surrounding noisy data points. Different keyword parameters (min_samples, were tried as shown on
the Google Colab IDE environment. The best parameter is the min_samples of 15. This creates a clear
distinction between the clusters.
Algorithm: hdbscan.HDBSCAN
args: None
Kwds: {'min_samples': 15}
The intra-clusters relationship shows closely packed data points within each cluster while the
inter-clusters distances are sparely represented.
Conclusion.