Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 20

Clustering

Introduction to Clustering:

https://www.geeksforgeeks.org/clustering-in-machine-learning/
It is basically a type of unsupervised learning method  . An unsupervised learning method is a
method in which we draw references from datasets consisting of input data without labelled
responses. Generally, it is used as a process to find meaningful structure, explanatory
underlying processes, generative features, and groupings inherent in a set of examples.
Clustering is the task of dividing the population or data points into a number of groups such
that data points in the same groups are more similar to other data points in the same group
and dissimilar to the data points in other groups. It is basically a collection of objects on the
basis of similarity and dissimilarity between them.
Applications of Cluster Analysis:
https://www.analyticssteps.com/blogs/5-clustering-methods-and-applications
 Recommendation engines
 Market and Customer segmentation
 Social Network Analysis (SNA)
 Search Result Clustering 
 Biological Data Analysis, Medical Imaging Analysis and Identification of Cancer
Cells
https://www.geeksforgeeks.org/clustering-in-machine-learning/

 Marketing: It can be used to characterize & discover customer segments for


marketing purposes.
 Biology: It can be used for classification among different species of plants and
animals.
 Libraries: It is used in clustering different books on the basis of topics and
information.
 Insurance: It is used to acknowledge the customers, their policies and
identifying the frauds.
Types of Clustering:
https://www.analytixlabs.co.in/blog/types-of-clustering-algorithms/
1. Connectivity-based Clustering (Hierarchical clustering)
2. Centroids-based Clustering (Partitioning methods)     
3. Distribution-based Clustering
4. Density-based Clustering (Model-based methods)
5. Fuzzy Clustering
Centroid Based Clustering
Centroid-based clustering organizes the data into non-hierarchical clusters.
1. K MEANS
2. K-MEDOIDS
3. GENERALIZED K-HARMONIC MEANS

K Means Clustering:
Computational Steps:
1. Select k points at random as centroids/cluster centers.
2. Assign data points to the closest cluster based on Euclidean distance
3. Calculate centroid of all points within the cluster
4. Repeat iteratively till convergence. (Same points are assigned to the clusters in
consecutive iterations)

Example:
Advantages of k-means
 Relatively simple to implement.
 Scales to large data sets.
 Guarantees convergence.
 Can warm-start the positions of centroids.
 Easily adapts to new examples.
 Generalizes to clusters of different shapes and sizes, such as elliptical clusters.
 Disadvantages:
 It requires to specify the number of clusters (k) in advance.
 It can not handle noisy data and outliers.
 It is not suitable to identify clusters with non-convex shapes.
 Curse of dimensionality

K-MEDOIDS Clustering
A medoid can be defined as the point in the cluster, whose dissimilarities with all the other
points in the cluster is minimum.
The dissimilarity of the medoid(Ci) and object(Pi) is calculated by using E = |Pi - Ci|

Computational Steps:
Advantages of K-medoids Algorithms
 As compared to other Partitioning algorithms, it effectively dealt with the noise and
outliers present in data; because it uses medoid for the partitioning of objects into
clusters.

 Easily Implementable and simple to understand.

 K-Medoid Algorithm is comparably fast as compared to other partitional algorithms.

 It outputs the final clusters of objects in a fixed number of iterations.


Disadvantages of K-medoids Algorithms
 It may results in different clusters for different runs on the same dataset because
initially, we choose k medoids randomly from total objects of data and assigns them
to each cluster one by one so that it becomes initial medoid of that cluster.

 It fixed the value of k(number of clusters/groups) in starting, so we don't know at


what value of k the result is accurate and distinguishable.

 Its overall computational time and final distribution of objects in clusters or groups
depend upon initial partition.

 Since here we distribute objects in clusters based on their minimum distance from
medoid instead of centroid as in k-means. Therefore, it is not useful for clustering data
in arbitrary shaped clusters.
Fuzzy c Mean Clustering
In fuzzy clustering, data points can potentially belong to multiple clusters.
Advantages:
1. Gives best result for overlapped data set and comparatively better than k-means
algorithm.
2. Unlike k-means where data point must exclusively belong to one cluster center here
data point is assigned membership to each cluster center as a result of which data
point may belong to more than one cluster center.

Disadvantages:
1. Apriori specification of the number of clusters.
2. With lower value of β we get the better result but at the expense of more number of
iteration.
3. Euclidean distance measures can unequally weight underlying factors.
Distribution-Based Clustering
Until now, the clustering techniques as we know are based around either proximity
(similarity/distance) or composition (density). There is a family of clustering algorithms that
take a totally different metric into consideration – probability. Distribution-based clustering
creates and groups data points based on their likely hood of belonging to the same probability
distribution (Gaussian, Binomial etc.) in the data.
As distance from the distribution's center increases, the probability that a point belongs to the
distribution decreases. The bands show that decrease in probability. When you do not know
the type of distribution in your data, you should use a different algorithm.

Gaussian Mixed Models (GMM) with Expectation-Maximization Clustering:


A Gaussian mixture model is a probabilistic model that assumes all the data points are
generated from a mixture of a finite number of Gaussian distributions with unknown
parameters.
Expectation-Maximization:
The expectation-maximization algorithm is an approach for performing maximum likelihood
estimation in the presence of latent variables. It does this by first estimating the values for the
latent variables, then optimizing the model.
Computational Steps:
Advantages:
1.The associativity of a data point to a cluster is quantified using probability metrics – which
can be easily interpreted.
2. Proven to be accurate for real-time data sets.
3. Some versions of GMM allows for mixed membership of data points, hence it can be a
good alternative to Fuzzy C Means to achieve fuzzy clustering.

Disadvantages:
1. Complex algorithm and cannot be applicable to larger data
2. It is hard to find clusters if the data is not Gaussian, hence a lot of data preparation is
required.
Density-based Clustering (Model-based Methods)
Density-based clustering methods take density into consideration instead of distances. When
performing most of the clustering, we take two major assumptions, one, the data is devoid of
any noise and two, the shape of the cluster so formed is purely geometrical (circular or
elliptical). Density-based algorithms can get us clusters with arbitrary shapes, clusters
without any limitation in cluster sizes, clusters that contain the maximum level of
homogeneity by ensuring the same levels of density within it, and also these clusters are
inclusive of outliers or the noisy data.

DBSCAN – Density-based Spatial Clustering


Density-based algorithms, in general, are pivotal in the application areas where we require
non-linear cluster structures, purely based out of density. One of the ways how this principle
can be made into reality is by using the Density-Based Spatial Clustering of Applications
with Noise (DBSCAN) algorithm. There are two major underlying concepts in DBSCAN –
one, Density Reachability and second, Density Connectivity.
Advantages:
1. Doesn’t require prior specification of clusters.
2. Can easily deal with noise, not affected by outliers.
3. It has no strict shapes, it can correctly accommodate many data points.

Disadvantages:
1. Cannot work with datasets of varying densities.
2. Sensitive to the clustering hyper-parameters – the eps and the min_points.
3. Fails if the data is too sparse.

You might also like