Clustering in Machine
Learning
Clustering or cluster analysis is a machine
learning technique, which groups the unlabelled
dataset. It can be defined as "A way of grouping
the data points into different clusters, consisting
of similar data points. The objects with the
possible similarities remain in a group that has
less or no similarities with another group."
It does it by finding some similar patterns in the
unlabelled dataset such as shape, size, color,
behavior, etc., and divides them as per the
presence and absence of those similar patterns.After applying this clustering technique, each
cluster or group is provided with a cluster-ID. ML
system can use this id to simplify the processing
of large and complex datasets.
The clustering technique is commonly used for
statistical data analysis.
/ Note: Clustering is somewhere similar to
the classification algorithm, but the
difference is the type of dataset that we are
using. In classification, we work with the
labeled data set, whereas in clustering, we
work with the unlabelled dataset.wo —E-
eo Las,
Types of Clustering Methods
The clustering methods are broadly divided into
Hard clustering (datapoint belongs to only one
group) and Soft Clustering (data points can
belong to another group also). But there are also
other various approaches of Clustering exist.
Below are the main clustering methods used in
Machine learning:
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy ClusteringApplications of Clustering
Clustering is an unsupervised machine
learning technique with a lot of applications
in the areas of pattern recognition, image
analysis, customer analytics, market
segmentation, social network analysis, and
more. A broad range of industries use
clustering, from airlines to healthcare and
beyond.
It is a type of unsupervised learning, meaning
that we do not need labeled data for
clustering algorithms; this is one of the
biggest advantages of clustering over other
supervised learning like Classification.Kmeans Algorithm
Kmeans algorithm is an iterative
algorithm that tries to partition the
dataset into Kpre-defined distinct non-
overlapping subgroups (clusters) where
each data point belongs to only one
group. It tries to make the intra-cluster
data points as similar as possible while
also keeping the clusters as different
(far) as possible. It assigns data points to
a cluster such that the sum of the
squared distance between the data
points and the cluster’s centroid
(arithmetic mean of all the data points
that belong to that cluster) is at the
minimum. The less variation we have
within clusters, the more homogeneous
(similar) the data points are within the
same cluster.The way kmeans algorithm works is as
follows:
HB
Np
wo
. Specify number of clusters K.
. Initialize centroids by first shuffling
the dataset and then randomly
selecting K data points for the
centroids without replacement.
. Keep iterating until there is no
change to the centroids. i.e
assignment of data points to clusters
isn’t changing.
Compute the sum of the squared
distance between data points and all
centroids.
Assign each data point to the closest
cluster (centroid).
Compute the centroids for the
clusters by taking the average of the
all data points that belong to each
cluster.K-means_ follows Expectation-Maximization
approach to solve the problem. The Expectation-
step is used for assigning the data points to the
closest cluster and the Maximization-step is
used for computing the centroid of each cluster.
While working with K-means algorithm we need
to take care of the following things -
While working with clustering algorithms
including K-Means, it is recommended to
standardize the data because such
algorithms use distance-based
measurement to determine the similarity
between data points.
Due to the iterative nature of K-Means and
random initialization of centroids, K-Means
may stick in a local optimum and may not
converge to global optimum. That is why it
is recommended to use different
initializations of centroids.Advantages and Disadvantages
Advantages
The following are some advantages of K-Means
clustering algorithms -
S\t is very easy to understand and
implement.
5 If we have large number of variables then,
K-means would be faster than Hierarchical
clustering.
5 On re-computation of centroids, an instance
can change the cluster.
= Tighter clusters are formed with K-means
as compared to Hierarchical clustering.
Disadvantages
The following are some disadvantages of K-
Means clustering algorithms -
3 It is a bit difficult to predict the number of
clusters i.e. the value of k.
= Output is strongly impacted by initial inputs
like number of clusters (value of k).
= Order of data will have strong impact on the
final output.= It is very sensitive to rescaling. If we will
rescale our data by means of normalization
or standardization, then the output will
completely change.final output.
8 It is not good in doing clustering job if the
clusters have a complicated geometric
shape.
Applications of K-Means
Clustering Algorithm
The main goals of cluster analysis are -
5 To get a meaningful intuition from the data
we are working with.
5 Cluster-then-predict where different models
will be built for different subgroups.
To fulfill the above-mentioned goals, K-means
clustering is performing well enough. It can be
used in following applications -
5 Market segmentation
= Document Clustering
5 Image segmentation
=| Image compression
= Customer segmentation
Analyzing the trend on dynamic data rUsing clustering for image segmentation
Introduction
Image Segmentation is the process of
partitioning an image into multiple regions
based on the characteristics of the pixels in the
original image. Clustering is a technique to
group similar entities and label them. Thus, for
image segmentation using clustering, we can
cluster similar pixels using a _ clustering
algorithm and group a particular cluster pixel as
a single segment.
Thus, let's explore more image segmentation
using clustering,
Image Segmentation
The process of image segmentation by
clustering can be carried out using two
methods.
= Agglomerative clustering
& Divisive clustering
In Agglomerative clustering, we label a pixel to
a close cluster and then increase the size of the
clusters iteratively. The following steps outline
the process of Agglomerative clustering.
5 Each pixel is considered to be an individual
cluster§ Similar clusters with smaller inter-cluster
distances (WCSS) are merged.
§ The steps are repeated.
In Divisive clustering, the following process is
followed.
& All the pixels are assigned to a single
cluster.
5 The cluster is split into two with large inter-
cluster distance over some epochs.
5 The steps are repeated until the optimal
number of clusters is reached.Using clustering for image preprocessing
Pre-processing of images.
Before we can extract features from the
images, we need to perform some
preprocessing steps to make sure that
images are comparable in color, value
range, and image size. The preprocessing
steps are utilized from open-cy and
pipelined in clustimage.
1. colorscale: Conversion of the image
into e.g. grayscale (2-D) or color (3-
D).
2. scale: Normalize all pixel values
between the minimum and
maximum range of [0, 255].
3. dim: Resize each image to make sure
that the number of features is the
same.Using Clustering for SemiSupervised Learning :
Clustering methods that can be applied to
partially labeled data or data with other
types of outcome measures are known as
semi-supervised clustering methods (or
sometimes as supervised clustering meth-
ods). They are examples of semi-supervised
learning methods, which are methods that
use both labeled and unlabeled data2®. ThisWhy DBSCAN?
Partitioning methods (K-means, PAM
clustering) and hierarchical clustering work
for finding spherical-shaped clusters or
convex clusters. In other words, they are
suitable only for compact and well-separated
clusters. Moreover, they are also severely
affected by the presence of noise and
outliers in the data.
Real life data may contain irregularities, like:
1. Clusters can be of arbitrary shape such as
those shown in the figure below.
2. Data may contain noise.