Clustering Techniques

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 23

Clustering Techniques

Periodic Table

Clusters poor metals


Periodic Table
Powered by Machine
Learning

https://www.chemistryworld.com/opinion/machine-learning-
mendeleevs-have-rediscovered-the-periodic-table/
3010720.article
Helpful in vaccine
Coronaviradae Family Development
Coronaviradae Family

Covid 19 Helpful in vaccine


Development
What is Clustering ?

Clustering is the process of grouping a set of data


objects into multiple groups or clusters so that objects
within a cluster have high similarity, but are very
dissimilar to objects in other clusters.

Cluster analysis has been widely used in many applications such as business intelligence,
image pattern recognition, web search, biology, security etc.
Clustering Techniques
Use the spam
filter for mails
Clustering

Partitioning Hierarchical Density based Grid Based

Given a set of n objects, a partitioning method constructs k partitions of the data, where each partition represents a
cluster and k ≤ n. it then uses an iterative relocation technique that attempts to improve the partitioning by moving
objects from one group to another.
Partitioning
Centroid-based clustering is the easiest of all
the clustering types in data mining. It works
on the closeness of the data points to the
chosen central value. The datasets are
divided into a given number of clusters, and
a vector of values references every cluster.
The input data variable is compared to the
vector value and enters the cluster with
minimal difference.

Pre-defining the number of clusters at the


initial stage is the most crucial yet most
complicated stage for the clustering
approach. Despite the drawback, it is a
vastly used clustering approach for
surfacing and optimizing large datasets.
The K-Means algorithm lies in this
category
Clustering Techniques
Document
Analysis
Clustering

Partitioning Hierarchical Density based Grid Based

A hierarchical method creates a hierarchical decomposition of agglomerative or divisive, based on how the hierarchical
decomposition is formed.
Hierarchical Clustering
Hierarchical Clustering is also known as connectivity-based clustering, is based on the principle that every object is
connected to its neighbors depending on their proximity distance (degree of relationship). The clusters are represented in
extensive hierarchical structures separated by a maximum distance required to connect the cluster parts.

The clusters are represented as Dendrograms, where X-axis represents the objects that do not merge while Y-axis is the
distance at which clusters merge. The similar data objects have minimal distance falling in the same cluster, and the
dissimilar data objects are placed farther in the hierarchy. Mapped data objects correspond to a Cluster amid discrete
qualities concerning the multidimensional scaling, quantitative relationships among data variables, or cross-tabulation in
some aspects. 
Hierarchical Clustering -
Types
Agglomerative Divisive
(Bottom Up) (Top Down)

1,2,3,4,5,6,7

1,2,3 4,5 6,7

1 2 3 4 5 6 7
Hierarchical Clustering -
Types
Agglomerative Divisive
(Bottom Up) (Top Down)
Hierarchical Clustering
Dendrogram

A dendrogram is a diagram
that shows the hierarchical
relationship between
objects. It is most
commonly created as an
output from hierarchical
clustering. The main use of
a dendrogram is to work out
the best way to allocate
objects to clusters
Hierarchical Clustering
Dendrograms

Canada United States Germany France United Kingdom Australia


Hierarchical Clustering
Dendrogram – 100 observations
Hierarchical Clustering
Similarity Measures – Euclidean
Distance

(5,6 ) b (5,6 )

X1 X1 a

(1,3 ) (1,3 )

X2 X2

Euclidean distance or Euclidean metric is the Manhattan distance is the distance between
"ordinary" straight-line distance between two two points is the sum of the absolute
points in Euclidean space differences of their Cartesian coordinates.

ED = √ 2
2 2
( 𝑥 − 𝑥 1 ) + ( 𝑦 2 − 𝑦 1 ) MD = ( 𝑎+ 𝑏 )

ED = √ ( 5 −1 ) 2
+( 6 − 3 )
2
MD = ( 3+ 4 )
ED =𝟓 MD = 𝟕
Clustering Techniques
Traffic
problem
Clustering

Partitioning Hierarchical Density based Grid Based

Most partitioning methods cluster objects based on the distance between objects. Such methods can find only spherical-
shaped clusters and encounter difficulty in discovering clusters of arbitrary shapes.
Density based
Density-based aka DBSCAN ( Density Based Special clustering with applications with Noise)
clustering method considers density ahead of distance. Data is clustered by regions of high
concentrations of data objects bounded by areas of low concentrations of data objects. The
clusters formed are grouped as a maximal set of connected data points.
Clustering Techniques
Retail
Customers
Clustering

Partitioning Hierarchical Density based Grid Based

Grid-based methods quantize the object space into a finite number of cells that form a grid structure. Using grids is often
an efficient approach to many spatial data mining problems, including clustering. Therefore, grid-based methods can be
integrated with other clustering methods such as density-based methods and hierarchical methods.
Other Clustering Techniques Constraints Based

The clustering process, in general, is based on the approach


that the data can be divided into an optimal number of
“unknown” groups. The underlying stages of all the clustering
algorithms are to find those hidden patterns and similarities
without intervention or predefined conditions. However, in
certain business scenarios, we might be required to partition
the data based on certain constraints. Here is where a
supervised version of clustering machine learning techniques
comes into play.

A constraint is defined as the desired properties of the


clustering results or a user’s expectation of the clusters so
formed – this can be in terms of a fixed number of clusters, the
cluster size, or important dimensions (variables) that are
required for the clustering process.
Other Clustering Techniques Distribution-Based

It is a probability-based distribution that uses statistical


distributions to cluster the data objects. The cluster includes data
objects that have a higher probability to be in it. Each cluster
has a central point, the higher the distance of the data point from
the central point, the lesser will be its probability to get included
in the cluster.

A constraint is defined as the desired properties of the


clustering results or a user’s expectation of the clusters so
formed – this can be in terms of a fixed number of clusters, the
cluster size, or important dimensions (variables) that are
required for the clustering process.

You might also like