Professional Documents
Culture Documents
Cluster Analysis: Basic Concepts and Algorithms: K. Ramachandra Murthy
Cluster Analysis: Basic Concepts and Algorithms: K. Ramachandra Murthy
and Algorithms
K. Ramachandra Murthy
Outline
Introduction
Types of data in cluster analysis
Types of clusters
Clustering Techniques
– Partitional
K-Means
Bisecting K-Means
K-Medoids
CLARA
CLARANS
– Hierarchical
Agglomerative
Divisive
– Density
DBSCAN
INTRODUCTION
What is Cluster Analysis?
Finding groups of objects such that the objects in a group will be similar (or
related) to one another and different from (or unrelated to) the objects in other
groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Quality: What Is Good Clustering?
Dissimilarity matrix 0
– (one mode) d(2,1) 0
d(3,1) d ( 3,2) 0
: : :
d ( n,1) d ( n, 2 ) ... ... 0
Type of data in clustering analysis
Interval-scaled variables
Binary variables
Standardize data
– Calculate the mean absolute deviation:
sf 1
n (| x1 f m f | | x2 f m f | ... | xnf m f |)
1
where m f n (x1 f x2 f ... xnf )
.
If q = 2, d is Euclidean distance
d (i, j) (| x x |2 | x x | 2 ... | x x |2 )
i1 j1 i2 j2 ip jp
– Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
d (i, j) p
p
m
– map the range of each variable onto [0, 1] by replacing i-th object in
the f-th variable by
rif 1
zif
M f 1
– compute the dissimilarity using methods for interval-scaled variables
Ratio-Scaled Variables
yif = log(xif)
pf 1 ij( f ) d ij( f )
d (i, j)
– f is binary or nominal:
pf 1 ij( f )
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
– f is ordinal or ratio-scaled
compute ranks rif and
zif
r if 1
and treat zif as interval-scaled M f 1
Vector Objects
TYPES OF CLUSTERS
Types of Clusters
Well-separated clusters
Center-based clusters
Contiguous clusters
Density-based clusters
Property or Conceptual
Well-Separated Clusters:
3 well-separated clusters
Types of Clusters: Center-Based
Center-based
– A cluster is a set of objects such that an object in a cluster is closer
(more similar) to the “center” of a cluster, than to the center of any
other cluster
– The center of a cluster is often a centroid, the average of all the points
in the cluster, or a medoid, the most “representative” point of a cluster
4 center-based clusters
Types of Clusters: Contiguity-Based
8 contiguous clusters
Types of Clusters: Density-Based
Density-based
6 density-based clusters
Types of Clusters: Conceptual Clusters
2 Overlapping Circles
Types of Clusters: Objective Function
Map the clustering problem to a different domain and solve a related problem in that domain
– Proximity matrix defines a weighted graph, where the nodes are the points being clustered, and the
weighted edges represent the proximities between points
– Clustering is equivalent to breaking the graph into connected components, one for each cluster
– Want to minimize the edge weight between clusters and maximize the edge weight within clusters
CLUSTERING TECHNIQUES
Types of Clustering
Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
p1
p3 p4
p2
p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram
p1
p3 p4
p2
p1 p2 p3 p4
Partitional Clustering
Hierarchical clustering
Density-based clustering
Partitional Clustering
Partitional Clustering
Given
– A data set of n objects
Organize the objects into k partitions (k<=n) where each partition represents
a cluster
The clusters are formed to optimize an objective partitioning criterion
– Objects within a cluster are similar
Complexity is O( n * K * I * d )
– n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Evaluating K-means Clusters
– Given two clusters, we can choose the one with the smallest error
One easy way to reduce SSE is to increase K, i.e. the number of clusters
– A good clustering with smaller K can have a lower SSE than a poor clustering
with higher K
Two different K-means Clusterings
3
2.5
2
Original Points
1.5
y
1
0.5
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
Iteration 6
1
2
3
4
5
3
2.5
1.5
y 1
0.5
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Importance of Choosing Initial Centroids (Case ii)
Iteration 5
1
2
3
4
3
2.5
1.5
y
1
0.5
Iteration 1 Iteration 2
3 3
2.5 2.5
2 2
1.5 1.5
y
1 1
0.5 0.5
0 0
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Problems with Selecting Initial Points
If there are K ‘real’ clusters then the chance of selecting one centroid
from each cluster is small.
– Chance is relatively small when K is large
– Sometimes the initial centroids will readjust themselves in ‘right’ way, and
sometimes they don’t
y 2
-2
-4
-6
0 5 10 15 20
x
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example (Good Clusters)
Iteration 1 Iteration 2
8 8
6 6
4 4
2 2
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Iteration 3 Iteration 4
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Starting with two initial centroids in one cluster of each pair of clusters
10 Clusters Example (Bad Clusters)
Iteration 4
1
2
3
8
y 0
-2
-4
-6
0 5 10 15 20
x
Starting with some pairs of clusters having three initial centroids, while other have only one.
10 Clusters Example (Bad Clusters)
Iteration 1 Iteration 2
8 8
6 6
4 4
2 2
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
Iteration
x 3 Iteration
x 4
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Starting with some pairs of clusters having three initial centroids, while other have only one.
Solutions to Initial Centroids Problem
Multiple runs
– Helps, but probability is not on your side
Post-processing
Bisecting K-means
– Not as susceptible to initialization issues
Pre-processing and Post-processing
Pre-processing
– Normalize the data
– Eliminate outliers
Post-processing
– Eliminate small clusters that may represent outliers
– Merge clusters that are ‘close’ and that have relatively low SSE
– Densities
– Non-globular shapes
– Since an object with an extremely large value may substantially distort the
Solution: Instead of taking the mean value of the object in a cluster as a reference
point, medoids can be used, which is the most centrally located object in a cluster.
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
The K-Medoids Clustering Method
– starts from an initial set of medoids and iteratively replaces one of the medoids
clustering.
– PAM works effectively for small data sets, but does not scale well for large data
sets.
2. For each pair of selected object (i) and non-selected object (h), calculate
object
𝑦 3
𝑥 2 𝑦 2
𝑥 3
𝑥 1 C2
C1 𝑦 1
Case (i) : Computation of Cjih
10
8
t
7
Therefore, in future, j j
6
belongs to the cluster h
5
represented by h
4
3
i
2
0
0 1 2 3 4 5 6 7 8 9 10
𝑘
𝐸=∑ ∑ 𝑑 ( 𝑗 , 𝑖) Cjih = d(j, h) - d(j, i)
𝑖=1 𝑗∈𝐶 𝑖
Case (ii) : Computation of Cjih
10
8 h
Therefore, in future, j 7 j
belongs to the cluster 6
represented by t 5
4
i t
3
0
0 1 2 3 4 5 6 7 8 9 10
10
8
t j
7
Therefore, in future, j
6
belongs to the cluster h
5
represented by t itself
4
3
i
2
0
0 1 2 3 4 5 6 7 8 9 10
Cjih = 0
Case (iv) : Computation of Cjih
10
8
t
7
Therefore, in future, j j
6
belongs to the cluster h
5
represented by h
4
3
i
2
0
0 1 2 3 4 5 6 7 8 9 10
Data Objects 9
8
3
A1 A2 7
4
6 10
01 2 6 1
5
02 3 4
9
03 4
3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3
08 Goal: create two clusters
7 4
09 8 5 Choose randmly two medoids
010 7 6
08 = (7,4) and 02 = (3,4)
PAM or K-Medoids: Example
Data Objects 9
cluster1
8
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5
02 3 4
9
03 4
3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3 Assign each object to the closest representative object
08 7 4
09 Using L1 Metric (Manhattan), we form the following clusters
8 5
010 7 6 Cluster1 = {01, 02, 03, 04}
Data Objects 9
cluster1
8
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5
02 3 4
9
03 4
3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3
Compute the absolute error criterion [for the set of Medoids
08 7 4
09 (O2,O8)]
8 5
𝒌
010 7 6
𝑬=∑ ∑ |𝒑 − 𝑶𝒊|=(|𝑶𝟏 − 𝑶𝟐|+|𝑶𝟑 − 𝑶𝟐|+|𝑶𝟒 − 𝑶𝟐|)+¿ ¿
𝒊=𝟏 𝒑 ∈𝑪 𝒊
Data Objects 9
cluster1
8
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5
02 3 4
9
03 4
3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3
08 The absolute error criterion [for the set of Medoids (O 2,O8)]
7 4
09 8 5
010 7 6 E = (3+4+4)+(3+1+1+2+2) = 20
PAM or K-Medoids: Example
Data Objects 9
cluster1
8
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5
02 3 4
9
03 4
3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3 • Choose a random object 07
08 7 4
• Swap 08 and 07
09 8 5
010 • Compute the absolute error criterion [for the set of
7 6
Medoids (02,07)
E = (3+4+4)+(2+2+1+3+3) = 22
PAM or K-Medoids: Example
Data Objects 9
cluster1
8
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5
02 3 4
9
03 4
3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3 → Compute the cost function
08 7 4
Absolute error [02 ,07 ] - Absolute error [for 02 ,08 ]
09 8 5
010 7 6
S=22-20
S> 0 => It is a bad idea to replace 08 by 07
PAM or K-Medoids: Example
Data Objects 9
cluster1
8
3 cluster2
A1 A2 7
4
6 10
01 2 6 1
5
02 3 4
9
03 4
3 8 2 6 8
04 4 7 3 7
5
05 6 2 2
3 4 5 6 7 8 9
06 6 4
07 7 3 C687 = d(07,06) - d(08,06)=2-1=1
08 7 4
C587 = d(07,05) - d(08,05)=2-3=-1 C187=0, C387=0, C487=0
09 8 5
010 C987 = d(07,09) - d(08,09)=3-2=1
7 6
C1087 = d(07,010) - d(08,010)=3-2=1
TCih=jCjih=1-1+1+1=2=S
PAM or K-Medoids : Different View
{03,02}
Initial Medoids
E=20 Dataset=
{01,04}
E=26 {01,05} {04,05}
E=17 E=9
PAM or K-Medoids : Different View
{03,02}
Initial Medoids
E=20 Dataset=
{01,04}
E=26 {01,05} {04,05}
E=17 E=9
PAM or K-Medoids : Different View
{03,02}
Initial Medoids
E=20 Dataset=
{01,04}
E=26 {01,05} {04,05}
E=17 E=9
PAM or K-Medoids : Different View
{03,02}
Initial Medoids
E=20 Dataset=
{01,04}
E=26 {01,05} {04,05}
E=17 E=9
PAM or K-Medoids : Different View
{03,02}
Initial Medoids
E=20 Dataset=
{01,04}
E=26 {01,05} {04,05}
E=17 E=9
PAM or K-Medoids : Different View
PAM is more robust than k-means in the presence of noise and outliers
a mean.
PAM works efficiently for small data sets but does not scale well for large
data sets.
It draws multiple samples of the data set, applies PAM on each sample, and
Weakness:
order to find the medoids.
Medoids are chosen from the sample.
– Note that the algorithm cannot find the best solution if one of the best k-
medoids is not among the selected sample.
CLARA (Clustering Large Applications)
PAM
sample
CLARA (Clustering Large Applications)
Choose the best clustering
…
clustering is returned as the output.
sample1 sample2 samplem
The clustering accuracy is measured
Experiments show that 5 samples of size
40+2k give satisfactory results
CLARA (Clustering Large Applications)
For i= 1 to 5, repeat the following steps
– Draw a sample of 40 + 2k objects randomly from the entire data set, and call Al-
gorithm PAM to find k medoids of the sample.
– For each object in the entire data set, determine which of the k medoids is the most similar to it.
– Calculate the average dissimilarity ON THE ENTIRE DATASET of the clustering obtained in the previous step. If
this value is less than the current minimum, use
this value as the current minimum, and retain the k medoids found in Step (1) as
the best set of medoids obtained so far.
CLARA (Clustering Large Applications)
PAM finds the best k medoids among a given data, and CLARA finds the best k
Problems
The best k medoids may not be selected during the sampling process, in
this case, CLARA will never find the best clustering.
If the sampling is biased we cannot have a good clustering.
PAM or K-Medoids : Different View
{02,03} Dataset=
Initial Medoids
E=20
No. of patterns (n) = 5
No. of clusters (k) =2
{01,04}
E=26 {01,05} {04,05}
E=17 E=21
CLARA: Different View
Initial Medoids
Dataset= {05,02}
{01,02} {04,02}
Sample= E=19 E=33 E=14
{01,04}
E=26 {01,05} {04,05}
E=17 E=21
CLARA: Different View
{02,03}
Initial Medoids Dataset=
E=20
Sample=
No. of patterns (n) = 5
No. of clusters (k) =2
{03,01} {03,04} {03,05} {01,02} {04,02} {05,02}
E=30 E=10 E=12 E=19 E=33 E=14
{01,04}
E=26 {01,05} {04,05}
E=17 E=21
CLARANS (“Randomized” CLARA)
Initial Medoids
Dataset= {05,02}
{01,02} {04,02}
Sample= E=19 E=33 E=14
{01,04}
E=26 {01,05} {04,05}
E=17 E=9
CLARA: Different View
{02,03}
Initial Medoids Dataset=
E=20
Sample=
No. of patterns (n) = 5
No. of clusters (k) =2
{03,01} {03,04} {03,05} {01,02} {04,02} {05,02}
E=30 E=10 E=12 E=19 E=33 E=14
{01,04}
E=26 {01,05} {04,05}
E=17 E=9
CLARANS
Advantages
– Experiments show that CLARANS is more effective than both PAM and CLARA
– Handles outliers
Disadvantages
– The computational complexity of CLARANS is O(n2), where n is the number of objects