Unit 5

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

UNIT 5

Clustering in Machine Learning

Introduction to Clustering
It is basically a type of unsupervised learning method. An unsupervised learning method is a
method in which we draw references from datasets consisting of input data without labeled
responses. Generally, it is used as a process to find meaningful structure, explanatory
underlying processes, generative features, and groupings inherent in a set of examples.

Clustering is the task of dividing the population or data points into a number of groups such
that data points in the same groups are more similar to other data points in the same group
and dissimilar to the data points in other groups. It is basically a collection of objects on the
basis of similarity and dissimilarity between them.
For ex– The data points in the graph below clustered together can be classified into one
single group. We can distinguish the clusters, and we can identify that there are 3 clusters in
the below picture.

It is not necessary for clusters to be spherical. Such as :

Clustering Methods :
● Density-Based Methods: These methods consider the clusters as the dense region having
some similarities and differences from the lower dense region of the space. These
methods have good accuracy and the ability to merge two clusters. Example DBSCAN
(Density-Based Spatial Clustering of Applications with Noise), OPTICS (Ordering Points
to Identify Clustering Structure), etc.
● Hierarchical Based Methods: The clusters formed in this method form a tree-type
structure based on the hierarchy. New clusters are formed using the previously formed
one. It is divided into two category
● Agglomerative (bottom-up approach)
● Divisive (top-down approach)
examples CURE (Clustering Using Representatives), BIRCH (Balanced Iterative Reducing
Clustering and using Hierarchies), etc.
● Partitioning Methods: These methods partition the objects into k clusters and each
partition forms one cluster. This method is used to optimize an objective criterion
similarity function such as when the distance is a major parameter example K-means,
CLARANS (Clustering Large Applications based upon Randomized Search), etc.
● Grid-based Methods: In this method, the data space is formulated into a finite number of
cells that form a grid-like structure. All the clustering operations done on these grids are
fast and independent of the number of data objects example STING (Statistical
Information Grid), wave cluster, CLIQUE (CLustering In Quest), etc.
Clustering Algorithms :
K-means clustering algorithm – It is the simplest unsupervised learning algorithm that solves
clustering problem.K-means algorithm partitions n observations into k clusters where each
observation belongs to the cluster with the nearest mean serving as a prototype of the cluster.

Applications of Clustering in different fields


● Marketing: It can be used to characterize & discover customer segments for marketing
purposes.
● Biology: It can be used for classification among different species of plants and animals.
● Libraries: It is used in clustering different books on the basis of topics and information.
● Insurance: It is used to acknowledge the customers, their policies and identifying the
frauds.
● City Planning: It is used to make groups of houses and to study their values based on
their geographical locations and other factors present.
● Earthquake studies: By learning the earthquake-affected areas we can determine the
dangerous zones.
Key Features of K-means Clustering

Find below some key features of k-means clustering;

1. It is very smooth in terms of interpretation and resolution.


2. For a large number of variables present in the dataset, K-means operates quicker than
Hierarchical clustering.
3. While redetermining the cluster centre, an instance can modify the cluster.
4. K-means reforms compact clusters.
5. It can work on unlabeled numerical data.
6. Moreover, it is fast, robust and uncomplicated to understand and yields the best
outcomes when datasets are well distinctive (thoroughly separated) from each other.

Limitations of K-means Clustering

The following are a few limitations with K-Means clustering;

1. Sometimes, it is quite tough to forecast the number of clusters, or the value of k.


2. The output is highly influenced by original input, for example, the number of clusters.
3. An array of data substantially hits the concluding outcomes.
4. In some cases, clusters show complex spatial views, then executing clustering is not a
good choice.
5. Also, rescaling is sometimes conscious, it can’t be done by normalization or
standardization of data points, the output gets changed entirely.

(Recommended blog: Machine Learning tools)

Disadvantages of K-means Clustering

1. The algorithm demands for the inferred specification of the number of cluster/
centres.
2. An algorithm goes down for non-linear sets of data and unable to deal with noisy data
and outliers.
3. It is not directly applicable to categorical data since only operatable when mean is
provided.
4. Also, Euclidean distance can weight unequally the underlying factors.
5. The algorithm is not variant to non-linear transformation, i.e provides different results
with different portrayals of data.

adaptive hierarchical clustering

Hierarchical clustering, also known as hierarchical cluster analysis or HCA, is another


unsupervised machine learning approach for grouping unlabeled datasets into clusters.

The hierarchy of clusters is developed in the form of a tree in this technique, and this tree-
shaped structure is known as the dendrogram.

Simply speaking, Separating data into groups based on some measure of similarity, finding a
technique to quantify how they're alike and different, and limiting down the data is what
hierarchical clustering is all about.
Approaches of Hierarchical Clustering

1. Agglomerative clustering:

Agglomerative Clustering is a bottom-up strategy in which each data point is originally


a cluster of its own, and as one travels up the hierarchy, more pairs of clusters are
combined. In it, two nearest clusters are taken and joined to form one single cluster.

2. Divisive clustering:

The divisive clustering algorithm is a top-down clustering strategy in which all points
in the dataset are initially assigned to one cluster and then divided iteratively as one
progresses down the hierarchy.
It partitions data points that are clustered together into one cluster based on the slightest
difference. This process continues till the desired number of clusters is obtained.

Gaussian Mixture Model

Suppose there are set of data points that need to be grouped into several parts or clusters
based on their similarity. In machine learning, this is known as Clustering.
There are several methods available for clustering like:

● K Means Clustering
● Hierarchical Clustering
● Gaussian Mixture Models

What are Gaussian mixture models (GMM)?

The Gaussian mixture model is defined as a clustering algorithm that is used to discover the
underlying groups of data. It can be understood as a probabilistic model where Gaussian
distributions are assumed for each group and they have means and covariances which define
their parameters. GMM consists of two parts – mean vectors (μ) & covariance matrices (Σ).
A Gaussian distribution is defined as a continuous probability distribution that takes on a
bell-shaped curve. Another name for Gaussian distribution is the normal distribution. Here is
a picture of Gaussian mixture models:

What is expectation-maximization (EM) method in relation to GMM?


In Gaussian mixture models, expectation-maximization method is used to find the gaussian
mixture model parameters. Expectation is termed as E and maximization is termed as M.
Expectation is used to find the gaussian parameters which are used to represent each
component of gaussian mixture models. Maximization is termed as M and it is involved in
determining whether new data points can be added or not.
Expectation-maximization method is a two-step iterative algorithm that alternates between
performing an expectation step, in which we compute expectations for each data point using
current parameter estimates and then maximize these to produce new gaussian, followed by a
maximization step where we update our gaussian means based on the maximum likelihood
estimate. This iterative process is performed until the gaussians parameters converge. Here is
a picture representing the two-step iterative aspect of algorithm:

You might also like