Professional Documents
Culture Documents
Clustering
Clustering
Introduction to Clustering:
https://www.geeksforgeeks.org/clustering-in-machine-learning/
It is basically a type of unsupervised learning method . An unsupervised learning method is a
method in which we draw references from datasets consisting of input data without labelled
responses. Generally, it is used as a process to find meaningful structure, explanatory
underlying processes, generative features, and groupings inherent in a set of examples.
Clustering is the task of dividing the population or data points into a number of groups such
that data points in the same groups are more similar to other data points in the same group
and dissimilar to the data points in other groups. It is basically a collection of objects on the
basis of similarity and dissimilarity between them.
Applications of Cluster Analysis:
https://www.analyticssteps.com/blogs/5-clustering-methods-and-applications
Recommendation engines
Market and Customer segmentation
Social Network Analysis (SNA)
Search Result Clustering
Biological Data Analysis, Medical Imaging Analysis and Identification of Cancer
Cells
https://www.geeksforgeeks.org/clustering-in-machine-learning/
K Means Clustering:
Computational Steps:
1. Select k points at random as centroids/cluster centers.
2. Assign data points to the closest cluster based on Euclidean distance
3. Calculate centroid of all points within the cluster
4. Repeat iteratively till convergence. (Same points are assigned to the clusters in
consecutive iterations)
Example:
Advantages of k-means
Relatively simple to implement.
Scales to large data sets.
Guarantees convergence.
Can warm-start the positions of centroids.
Easily adapts to new examples.
Generalizes to clusters of different shapes and sizes, such as elliptical clusters.
Disadvantages:
It requires to specify the number of clusters (k) in advance.
It can not handle noisy data and outliers.
It is not suitable to identify clusters with non-convex shapes.
Curse of dimensionality
K-MEDOIDS Clustering
A medoid can be defined as the point in the cluster, whose dissimilarities with all the other
points in the cluster is minimum.
The dissimilarity of the medoid(Ci) and object(Pi) is calculated by using E = |Pi - Ci|
Computational Steps:
Advantages of K-medoids Algorithms
As compared to other Partitioning algorithms, it effectively dealt with the noise and
outliers present in data; because it uses medoid for the partitioning of objects into
clusters.
Its overall computational time and final distribution of objects in clusters or groups
depend upon initial partition.
Since here we distribute objects in clusters based on their minimum distance from
medoid instead of centroid as in k-means. Therefore, it is not useful for clustering data
in arbitrary shaped clusters.
Fuzzy c Mean Clustering
In fuzzy clustering, data points can potentially belong to multiple clusters.
Advantages:
1. Gives best result for overlapped data set and comparatively better than k-means
algorithm.
2. Unlike k-means where data point must exclusively belong to one cluster center here
data point is assigned membership to each cluster center as a result of which data
point may belong to more than one cluster center.
Disadvantages:
1. Apriori specification of the number of clusters.
2. With lower value of β we get the better result but at the expense of more number of
iteration.
3. Euclidean distance measures can unequally weight underlying factors.
Distribution-Based Clustering
Until now, the clustering techniques as we know are based around either proximity
(similarity/distance) or composition (density). There is a family of clustering algorithms that
take a totally different metric into consideration – probability. Distribution-based clustering
creates and groups data points based on their likely hood of belonging to the same probability
distribution (Gaussian, Binomial etc.) in the data.
As distance from the distribution's center increases, the probability that a point belongs to the
distribution decreases. The bands show that decrease in probability. When you do not know
the type of distribution in your data, you should use a different algorithm.
Disadvantages:
1. Complex algorithm and cannot be applicable to larger data
2. It is hard to find clusters if the data is not Gaussian, hence a lot of data preparation is
required.
Density-based Clustering (Model-based Methods)
Density-based clustering methods take density into consideration instead of distances. When
performing most of the clustering, we take two major assumptions, one, the data is devoid of
any noise and two, the shape of the cluster so formed is purely geometrical (circular or
elliptical). Density-based algorithms can get us clusters with arbitrary shapes, clusters
without any limitation in cluster sizes, clusters that contain the maximum level of
homogeneity by ensuring the same levels of density within it, and also these clusters are
inclusive of outliers or the noisy data.
Disadvantages:
1. Cannot work with datasets of varying densities.
2. Sensitive to the clustering hyper-parameters – the eps and the min_points.
3. Fails if the data is too sparse.