Download as pdf or txt
Download as pdf or txt
You are on page 1of 153

Clustering

Module 4
Contents
• Types of data in Cluster analysis
• Partitioning Methods (k-Means, k-Medoids)
• Hierarchical Methods (Agglomerative,
Divisive).
MUMBAIUNIVERSITY EXAM QUESTIONS

Clustering

1.Explain What Is Meant By Clustering. State And Explain The


Various Types With Suitable Example For Each.
(May 2012)

2.Give five examples of applications that can use clustering.


Describe any one clustering algorithm with the help of an
example (Dec 2012)
What is a Cluster?

• A cluster is a subset of similar objects


• A subset of objects such that the distance between
any of the two objects in the cluster is less than the
distance between any object in the cluster and any
object that is not located inside it.
Clustering

• Clustering is “the process of organizing objects into


groups whose members are similar in some way”.

• Finding groups of objects such that the objects in a


group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups.
Different ways of representing
clusters
(a) (b) d e
d e
a a
c j c
j k
k h h
f b f b
g i i
Division with boundaries g
Spheres
(c) 1 2 3 (d)
a 0.4 0.1 0.5
b 0.1
0.8 0.1
c
... 0.3 0.3 0.4
g a c d kbj
ie fh
Probabilistic Dendrograms
Example
suppose we are a market manager, and we have a new tempting product to sell.
We are sure that the product would bring enormous profit, as long as it is sold to
the right people. So, how can we tell who is best suited for the product from our
company's huge customer base?
Clustering Applications
1. Marketing: Help marketers discover distinct groups in
their customer bases, and then use this knowledge to
develop targeted marketing programs
2. Land use: Identification of areas of similar land use in
an earth observation database
3. Insurance: Identifying groups of motor insurance
policy holders with a high average claim cost
4. City-planning: Identifying groups of houses according
to their house type, value, and geographical location
5. Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
A Clustering
Example

Income:
Income: Medium
High Children:2
Children:1 Car: Truck
Car: Luxury
Cluster 1 Cluster 4
Car: Sedan and
Income: Low Children:3
Children:0 Income:
Car:Compac Medium
t Cluster 3
Cluster 2
Clustering is
ambiguous
⚫ There is no correct or incorrect solution for
clustering.

How many clusters? Six Clusters

Two Clusters Four Clusters


Clustering
Properties of Clustering

1. Scalability − We need highly scalable clustering


algorithms to deal with large databases.

2. Ability to deal with different kinds of


attributes − Algorithms should be capable to be
applied on any kind of data such as
interval-based (numerical) data, categorical, and
binary data.
3.High dimensionality − The clustering algorithm
should not only be able to handle low-dimensional
data but also the high dimensional space.

4.Ability to deal with noisy data − Databases


contain noisy, missing or erroneous data. Some
algorithms are sensitive to such data and may lead
to poor quality clusters.

5.Interpretability − The clustering results should be


interpretable, comprehensible, and usable.
Types of data in Cluster analysis
Clustering Methods

Clustering methods can be classified into the following


categories −

1. Partitioning Method
2. Hierarchical Method
3. Density-based Method
4. Grid-Based Method
5. Model-Based Method
6. Constraint-based Method
Clustering Methods

Clustering

Hierarchical Partitioned

Agglomerative Divisive K-mediods K- Means


Partitioning
Clustering
K- MEANS, K- MEDIODS
I. Partition Clustering
Explain partitioning methods for clustering. (Dec
2010)

Original
A Partitional Clustering
Points
• Finding all clusters at once
• A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one subset
Partitioning Algorithms: Basic
Concept 123

⚫ Partitioning method: Construct a partition of a database


D of n objects into a set of k clusters
⚫ Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
⚪ Global optimal: exhaustively enumerate/tally all partitions
⚪ Heuristic methods: k-means and k-medoids algorithms
1. k-means : Each cluster is represented by the
center of the cluster
2. k-medoids or PAM (Partition around medoids): Each
cluster is represented by one of the objects in the cluster
Partitioning: K-Mean
Write short note on K-Means Clustering. (May 2013)
Algorithm
⚫ Each clusteris represented by the mean value of the
objects in the cluster
⚫ Input: Set of objects (n), no of clusters (k)
⚫ Output : Set of k clusters
⚫ Algorithm (K-Means Array):
1. Partition objects into k nonempty subsets. And Randomly assign data
to each cluster.
2. Compute seed points as the centroids of the clusters of the current
partition (the centroid is the center, i.e., mean point, of the cluster)
3. Assign each object to the cluster with the nearest seed point .
4. Go back to Step 2, stop when no more new assignment
Example
Example
1
Given: {2,3,6,8,9,12,15,18,22} Assume
k=3.
⚫ Solution:
⚪ Randomly
⯍K1 = 2,8,15partition givenmean
data =set:
8.3
⯍K2 = 3,9,18 mean = 10
⯍K3 = 6,12,22 mean = 13.3
⚪ Reassign
⯍K1= 2,3,6,8,9 mean = 5.6
⯍K2 = mean = 0
⯍K3 = 12,15,18,22 mean =
16.75
⚪Reassign
⯍K1 = 3,6,8,9 mean = 6.5
⯍K2 = 2 mean = 2
⯍K3 = 12,15,18,22 mean = 16.75
⚪ Reassign mean = 7.6
⯍K1 = 6,8,12 mean = 2.5
⯍K2 = 2,3 mean = 16.75
⯍K3 = 12,15,18,22
⚪ Reassign mean = 7.6
⯍K1 = 6,8,9,12 mean = 2.5
⯍K2 = 2,3 mean = 16.75
⯍K3 = 15,18,22
⚫ STOP
Example 2
What is clustering? Explain K-means clustering algorithm. Suppose the data for clustering is
{2,4,10,12,3,20,30,11,25} consider K=2, Cluster the given data using above algorithm. (Dec
2010)

1. Given {2,4,10,12,3,20,30,11,25} Assume number of cluster i.e.


k=2.
⚫ Solution:
⚪ Randomly partition given data
set: mean = 14
⯍ K2 = 2,10,3,30,25
K1 = 4,12,20,11 mean = 11.75
⚪ Reassign
⯍ K1= 20,25,30 mean = 25
⯍ K2 = 2,3,4,10,11,12 mean = 7
⚪ Reassign
⯍ K1= 20,25,30 mean = 25
⯍ K2 = 2,3,4,10,11,12 mean = 7
So the final answer is K2=
{2,3,4,10,11,12}, K1={20,25,30}
K-means : Advantage &
Disadvantage
Advantages
• K-means is relatively scalable and efficient in processing large data
sets
• The computational complexity of the algorithm is O(nkt)
n: the total number of
objects k: the number of
clusters
t: the number of iterations
Normally: k<<n and t<<n {cluster<< objects & iterations<objects}
Disadvantage
• Can be applied only when the mean of a cluster is defined
• Users need to specify k
• K-means is not suitable for discovering clusters with
non convex
shapes or clusters of very different size
Questions on Partitioning
method-K-Means
1. What Is K-means Clustering? Confer The K-means Algorithm With
The Following Data For Two Clusters. Data Set
{10,4,2,12,3,20,30,11,25} . (May2010)
2. Explain K-means clustering algorithm. Suppose the data for clustering is
{2,4,10,12,3,20,30,11,25} consider K=2, Cluster the given data using
above algorithm. (Dec 2010)
3. Write short note on K-Means Clustering. (May 2013)
4. May 2016

5. May 2017
⚫ D={1,2,6,7,8,10,15,17,20}
⚪ K1={2,7,10,17}
⚪ K2={1,6,8,15,20}
⚫ Iteration 1: m1=9, m2=10
⚪ K1 (9)={1,2,6,7,8)
⚪ K2(10)={10,15,17,20}
⚫ Iteration 2: m1=4.8, m2=15.5
⚪ K1(4.8)= {1,2,6,7,8,10)
⚪ K2(15.5)={15,17,20}
⚫ Iteration 3: m1=5.6, m2=17.3
⚪ k1(5.6)={1,2,6,7,8,10}
⚪ k2 (17.3)={15,17,20}
⚫ Iteration 4: m1=5.6, m2=17.3
⚪ k1(5.6)={1,2,6,7,8,10}
⚪ k2 (17.3)={15,17,20}
K-Means
(Graph)
K-Means (graph)

⚫ Step1: Form k centroids, randomly


⚫ Step2: Calculate distance between centroids and
each object
⚪ Use Euclidean’s law to determine min
d(A,B) = (x2-x1)2 + (y2-y1)2
distance:
⚫ Step3: Assign objects based on min distance to
clustersk
⚫ Step4: Calculate centroid of each cluster using
C= (x1+x2+…xn , y1+y2+…yn)
nn
⚪ Go to step 2.
⚪ Repeat until no change in
centroids.
Algorithm

1. Choose k number of random points (Data point from the


data set or some other points). These points are also called
"Centroids" or "Means".
2. Assign all the data points in the data set to the closest
centroid by applying any distance formula like Euclidian
distance, Manhattan distance, etc.
3. Now, choose new centroids by calculating the mean of all
the data points in the clusters and goto step 2
4. Continue step 3 until no data point changes classification between
two iterations.
Example 1
• There are four types of medicines and each have two
attributes, as shown below. Find a way to group them into 2
groups based on their features.

Medicine Weight pH
A 1 1
B 2 1
C 4 3
D 5 4

25
Solution
• Plot the values on a graph.
• Mark any k centeroids

26
Questions on
Partitioning
method-K-Means
May 2018
May
2019
What is the problem of k-Means Method?

⚫ The k-means algorithm is sensitive to outliers !


⚪ Since an object with an extremely large value may substantially
distort the distribution of the data.
⚫ K-Medoids: Instead of taking the mean value of the
object in a cluster as a reference point, Medoids can be
used, which is the most centrally located object in a
cluster.
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 0 1 2 3 4 5
6 7 8 6 7 8 9 10
9 10
The K-Mediods Clustering
Method
In the K-Medoids algorithm, each data point is
called a medoid.
The medoids serve as cluster centers.
The medoid is a point such that its sum of the
distance from all other points in the same cluster is
minimum.
For distance, any suitable metric like Euclidian
distance or Manhattan distance can be used.
The complete data is divided into K Clusters after
the algorithm is applied.
There are three types of algorithms for K-Medoids
Clustering:

1. PAM (Partitioning Around Clustering)

2. CLARA (Clustering Large Applications)

3. CLARANS (Randomized Clustering Large


Applications)
PAM is the most popular method.It has one
disadvantage that it takes a lot of time.

The K-Medoid is applied in such a way that


• A single point can belong to only one
cluster
• Each cluster has a minimum one point
The dissimilarity of the medoid(Ci) and object(Pi) is
calculated by using

E = |Pi – Ci|

The cost in K-Medoids algorithm is given as


K-MedoidS (PAM) Algorithm

⚫ Also called Partitioning Around Medoids.


1. Given k
2. Randomly pick k instances as initial Medoid
3. Assign each data point to the nearest Medoid x
4. Calculate the objective function
⯍ the sum of dissimilarities of all points to their
nearest Medoids. (Absolute-Error Criterion / Manhattan)
5. Randomly select an point y
6. Swap x by y if theswap reduces theobjective
function
7. Repeat (3-6) until no change
Typical k-medoids algorithm (PAM)
Example
Hierarchical
Clustering
II. Hierarchical (May 17, May 18)
Clustering
⚫ In this algorithm, we develop the hierarchy of
clusters in the form of a tree, and this tree-shaped
structure is known as the dendrogram.
i. Agglomerative
⚪ Initially each item in its own cluster
⚪ Iteratively clusters are merged together
⚪ Bottom Up
ii. Divisive
⚪ Initially all items in one cluster
⚪ Large clusters are successively divided
⚪ Top Down
Hierarchical Clustering
⚫ Agglomerative
Initialization:
approach Each object is a
cluster Iteration:
a
ab Merge two clusters which are
b abcd most similar to each other;
c e cde Until all objects are

d merged into a single

e de cluster

Step 0 Step 1 Step 2 Step 3 Step 4 bottom-up


Hierarchical Clustering
⚫ Divisive Initialization:
Approaches All objects stay in one

cluster Iteration:
a
ab Select a cluster and split it into
b abcd
two sub clusters
c e cde Until each leaf cluster contains
d only one object
de
e

Step 4 Step 3 Step 2 Step 1 Step 0 Top-dow


n
Agglomerative Hierarchical
Clustering
1. Agglomerative hierarchical clustering

⚫ This bottom-up strategy starts by placing each object in


its own cluster and then merges these atomic clusters
into larger and larger clusters, until all of the objects are
in a single cluster or until certain termination conditions
are satisfied.
⚫ Most hierarchical clustering methods belong to this
category.
Agglomerative Algorithm
(Single Link)

⚫ Step1: Make each object as a cluster


⚫ Step2: Calculate the Euclidean distance
every from
point to everyother point. i.e.,
construct a Distance Matrix
⚫ Step3: Identify two clusters with shortest
distance.
⚪ Merge them
⚪ Go to Step 2
⚪ Repeat until all objects are in one cluster
58
How the Agglomerative Hierarchical clustering Work?
Measure for the distance between two clusters

1. Single Linkage: It is the Shortest Distance between the


closest points of the clusters.
2.Complete Linkage: It is the farthest distance between the two
points of two different clusters. It is one of the popular linkage
methods.
3. Average Linkage: It is the linkage method in which the
distance between each pair of datasets is added up and
then divided by the total number of datasets to calculate
the average distance between two clusters. It is also one
of the most popular linkage methods.
4. Centroid Linkage: It is the linkage method in which the
distance between the centroid of the clusters is
calculated.
• Single link:
smallest distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)

• Complete link: largest distance between an element in one


cluster and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)

• Average: avg distance between an element in one cluster and an


element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
• Centroid: distance between the centroids of two clusters, i.e.,
dis(Ki, Kj) = dis(Ci, Cj)

62
Dendrogram
⚫ Dendrogram:a tree data
structure which illustrates
hierarchical clustering
techniques.
⚫ Each level shows clusters for
that level.
⚪ Leaf – individual clusters
⚪ Root – one cluster
⚫ A cluster at level i is the
union of its children clusters
at level i+1.
60
Example
Solution
Example 2:
Calculation Steps:
Step 1: Draw the graph.
Same formula can be used for
(p1,p3),(p1,p4),(p1,p5),(p1,p6)
The distance matrix is:
Step 3: Find the minimum value element from distance
matrix.
The minimum value element is (p3,p6)and value is 0.11
i.e. our 1st cluster (p3,p6)
May 2010 university question
1. What is clustering technique? Discuss the agglomerative algorithm
using following data and plot a dendrogram using link approach.
The following figure contains sample data items indicating the
distance between the elements.

Step 1: Merge E and A as it is having


minimum distance. (we get E/A)
Step 2: Merge B and C as it is having minimum
distance. (we get B/C)
Step 3: Merge E/A and B/C as it is having minimum distance.
(we get E/A/B/C)
Step 4: Merge E/A/B/C and D as it is last cluster. (we get
E/A/B/C/D)
Resulting Dendrogram
Dec
2016
DEC 2017
DEC
2019

MAY 2019
Agglomerative – Complete Link
Algorithm
Agglomerative – Complete Link
1. Discuss the agglomerative algorithm using following data and plot a
Algorithm
dendrogram using Complete link approach. The following figure
contains sample data items indicating the distance between the
elements. 1 2 3 4 5
11.00 0.90 0.10 0.65 0.20
20.90 1.00 0.70 0.60 0.50
30.10 0.70 1.00 0.40 0.30
40.65 0.60 0.40 1.00 0.80
50.20 0.50 0.30 0.80 1.00

Step 1: Merge 1 and 3 as it is having minimum distance


(0.1). (we get 1/3).While merging consider maximum.
Step 2: Merge 1/3 and 5 as it is having minimum
distance (0.3). (we get 1/3/5)
Step 3: Merge 4 and 2 as it is having minimum distance (0.6).
(we get 4/2)
Step 4: Merge 1/3/5 and 4/2 as it is last cluster. (we get
1/3/5/4/2)
Resulting Dendrogram
Agglomerative – AverageLink
Algorithm
Agglomerative – Average Link Algorithm
Apply Agglomerative Hierarchical Clustering and draw
single link and average link dendrogram for the following
distance matrix. (May 2013)

Step 1: Merge A and B as it is having minimum


distance (2). (we get A/B); while merging Using Average Link
consider average.
Step 2: Merge D and E as it is having minimum
distance (4). (we get D/E)
Step 3: Merge A/B and C as it is having
minimum distance (4.5). (we get A/B/C)
Step 4: Merge A/B/C and D/E as it is last cluster.
With distance 7.5 (we get A/B/C/D/E)
Comparison between Single, complete & average
link
1. Single Link
Advantage: Can handle non-elliptical shapes
Dis-Advantage:
⯍Sensitive to noise and outliers
⯍It produces long, elongated clusters

2. Complete Link
Advantage: Less susceptible to noise and outliers
Dis-Advantage:
⯍Tends to break large clusters
⯍Biased towards globular clusters

3. Average Link
Less susceptible to noise and outliers
⯍ Advantage:

Dis-Advantage: Biased towards globular clusters


Divisive
Hierarchical
Clustering
2. Divisive Hierarchical Clustering
Divisive Analysis- (DIANA)
⚫ This top-down strategy does the reverse of agglomerative
hierarchical clustering by starting with all objects in one
cluster.
⚫ It subdivides the clusters into smaller and smaller pieces, until
each object form a cluster on its own or until it satisfies
certain termination conditions, such as a desired number of
cluster or the diameter of each cluster is within a certain
threshold.
DIANA-
Explored
⚫ First, all of the objects form one cluster.
⚫ The cluster is split according to some principle,
such as the minimum Euclidean distance
between the closest neighboring objects in the
cluster.
⚫ The cluster splitting process repeats until,
eventually, each new cluster contains a single
object or a termination condition is met.
Splitting Process of
Intialization:DIANA
1. Choose the object Oh which is most dissimilar to other
objects in C.
2. Let C1={Oh}, C2=C-C1.

C2 C1
Splitting Process of DIANA
Iteration: (Cont’d) C2 C1

3. For each object Oi in C2,


tell whether it is more close to
C1 or to other objects in C2 C2 C1

Di = avgd(Oi ,Oj ) − avgd(Oi ,Oj


C2 C1
)
j∈C2 j∈C1

4. Choose the object Ok with


greatest D score. ……
5. If Dk>0, move O k from C2 to
C1, and repeat 3-5. C2 C1

6. Otherwise, stop
splitting process.
Discussion on Hierarchical Approaches
190

⚫ Strengths
⚪ Do not need to input k, the number of clusters
⚫ Weakness
⚪ Do not scale well; time complexity of at least
O(n2), where n is total number of objects
⚪ Can never undo what was done previously
Hierarchical clustering
comparison
• Agglomerative (bottom
Divisive (top down)
• up)

1. Start with a big cluster 1. Start with 1 point


2. Recursively divide (singleton)
smaller clusters at 2. Recursively add two or
iteration more appropriate clusters
3. Stop when k number of at each iteration
clusters is achieved. 3. Stop when all the clusters
are merged into one big
cluster.
THANK
YOU

You might also like