Clustering Partition Hierachy

Major Clustering Approaches
 Partitioning approach:
 Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors

 Typical methods: k-means, k-medoids, CLARANS
 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects)
using some criterion

 Typical methods: Diana, Agnes, BIRCH, CAMELEON
 Density-based approach:
 Based on connectivity and density functions
 Typical methods: DBSACN, OPTICS, DenClue
 Grid-based approach:
 based on a multiple-level granularity structure
 Typical methods: STING, WaveCluster, CLIQUE
1
Partitioning Algorithms: Basic Concept
 Partitioning method: Partitioning a database D of n objects into a set of
k clusters, such that the sum of squared distances is minimized (where
ci is the centroid or medoid of cluster Ci)
E   ik1 pCi dist ( p, ci ) 2

 Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67, Lloyd’57/’82): Each cluster is
represented by the center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the
objects in the cluster
2
K-means Algorithm
 Given the cluster number K, the K-means algorithm is
carried out in three steps after initialization:
Initialization: set seed points (randomly)

1)Assign each object to the cluster of the nearest seed point
measured with a specific distance metric
2)Compute new seed points as the centroids of the clusters of
the current partition (the centroid is the centre, i.e., mean
point, of the cluster)
3)Go back to Step 1, stop when no more new assignment
(i.e., membership in each cluster no longer changes)
3
K Means Example Sec. 16.4
(K=2)
Pick seeds
Reassign clusters
Compute centroids
Reassign clusters
x x Compute centroids
x
x
Reassign clusters
Converged!
Example
Suppose we have 4 types of medicines and each has two

attributes (pH and weight index). Our goal is to group
these objects into K=2 group of medicine.
D
Medicine Weight pH-Index
C
A 1 1
B 2 1
C 4 3 A B
D 5 4
5
Example
 Step 1: Use initial seed points for partitioning
c1  A (1,1), c 2  B (2,1)
d( D , c1 )  ( 5  1)2  ( 4  1)2  5 Euclidean distance
A B d( D , c2 )  ( 5  2)2  ( 4  1)2  4.24
Assign each object to the cluster

with the nearest seed point
6
Example
 Step 2: Compute new centroids of the
current partition
Knowing the members of each
cluster, now we compute the
new
centroid of each group based on
these new memberships.
c1  (1, 1)
2  4  5 1 3  4 
c 2   , 
 3 3 
11 8
( , )
3 3
7
Example
 Step 2: Renew membership based on
new centroids
Compute the distance of all
objects to the new centroids
Assign the membership to objects
8
Example
 Step 3: Repeat the first two steps until
its convergence
Knowing the members of each
cluster, now we compute the new
centroid of each group based on
these new memberships.
 1 2 11 1
c1   ,   (1 , 1)
 2 2  2
45 34 1 1
c2   ,   ( 4 , 3 )
 2 2  2 2
9
Example
 Step 3: Repeat the first two steps until

its convergence
Compute the distance of all
objects to the new centroids
Stop due to no new assignment

Membership in each cluster no
longer change
10
Exercise
For the medicine data set, use K-means with the Manhattan distance
metric for clustering analysis by setting K=2 and initialising seeds as
C1 = A and C2 = C. Answer two questions as follows:
1. What are memberships of two clusters after convergence?
2. What are centroids of two clusters after convergence?
Medicine Weight pH- D

Index
C
A 1 1
B 2 1
A B
C 4 3
D 5 4
11
Sec. 16.4
Termination conditions
 Several possibilities, e.g.,
 A fixed number of iterations.
 Centroid positions don’t change.
 RSS(Residual Sum of Squares) falls
below a threshold.
 The decrease in RSS falls below a
threshold.
Comments on the K-Means Method
 Strength: Efficient: O(tkn), where n is # objects, k is #

clusters, and t is # iterations. Normally, k, t << n.
 Comment: Often terminates at a local optimal.
 Weakness
 Applicable only to objects in a continuous n-
dimensional space
 Need to specify k, the number of clusters, in advance
 Sensitive to noisy data and outliers
 Not suitable to discover clusters with non-convex
shapes
13
14
Sec. 16.4
Seed Choice
 Results can vary based on random Example showing

seed selection. sensitivity to seeds
 Some seeds can result in poor
convergence rate, or convergence
to sub-optimal clusterings.
 Select good seeds using a
In the above, if you start
heuristic (e.g., point least similar with B and E as centroids
to any existing mean) you converge to {A,B,C}
 Try out multiple starting points and {D,E,F}
If you start with D and F
 Initialize with the results of
you converge to
another method. {A,B,D,E} {C,F}
k-means++
 Instead of choosing initial clusters centers
randomly, choose them smarter
 Randomly choose one of the observations to be a
cluster center
 For each observation x, determine d(x), where d(x)
denotes the MINIMAL distance from x to a current
cluster center
 Choose next cluster center from the data points, with
probability of making an observation x a cluster center
proportional to d(x)2
 Repeat 2 and 3 until you have chosen the right number
of clusters
16
Example
17
Example
Cluster centers :{(7,4)}
Cluster centers :{(7,4), (1,3)}
18
Example
Cluster centers :{(7,4), (1,3)}
Cluster centers :{(7,4), (1,3), (5,9)}

k-means++
 Instead of choosing initial clusters centers randomly,
choose them smarter
 Randomly choose one of the observations to be a cluster center
 For each observation x, determine d(x), where d(x) denotes the
MINIMAL distance from x to a current cluster center
 Choose next cluster center from the data points, with probability of
making an observation x a cluster center proportional to d(x)2
 Repeat 2 and 3 until you have chosen the right number of clusters
This process has a setup cost, but convergence

tends to be faster and better(lower heterogeneity)
20

 Weakness
dimensional space
shapes
21
What Is the Problem of the K-Means Method?
 The k-means algorithm is sensitive to outliers !

 Since an object with an extremely large value may substantially
distort the distribution of the data
 K-Medoids: Instead of taking the mean value of the object in a cluster
as a reference point, medoids can be used, which is the most
centrally located object in a cluster
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
22

 Weakness
dimensional space
shapes
23
24
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
25
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical packages, e.g., Splus
 Use the single-link method and the dissimilarity matrix
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
26
Example
 Five objects : a, b, c, d, e a b c d e
Distance matrix
a b c d e
a 0
b 12 0
c 4 10 0
a c b d e
d 15 6 12 0
e 7 8 5 13 0
27
Example
Distance matrix
a b c d e
a 0
b 12 0 a c b d e
c 4 10 0
d 15 6 12 0
e 7 8 5 13 0
After merging a and c

ac b d e
ac 0
b 10 0
a c e b d
d 12 6 0
e 5 8 13 0
28
Example
After merging a and c
ac b d e
ac 0
b 10 0
a c e b d
d 12 6 0
e 5 8 13 0
After merging ac and e
ace b d
ace 0
b 8 0
a c e b d
d 12 6 0
29
Exercise
 Five objects : a, b, c, d, e a b c d e
Distance matrix
a b c d e
a 0
b 12 0
c 4 10 0
d 15 6 12 0
e 7 8 5 13 0
Complete link
30
dendrogram
Decompose data objects into a

several levels of nested partitioning,
called a dendrogram
 Shows How Clusters are
Merged
A clustering of the data objects is

obtained by cutting the dendrogram
a c e b d at the desired level, then each
connected component forms a
cluster
31
DIANA (Divisive Analysis)
 Introduced in Kaufmann and Rousseeuw (1990)

 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
32
Hierarchical Clustering (Cont.)
 Most hierarchical clustering algorithms are variants of the
single-link, complete-link or average link.
 Of these, single-link and complete link are most popular.
 In the single-link method, the distance between two

clusters is the minimum of the distances between all
pairs of patterns drawn one from each cluster.
 In the complete-link algorithm, the distance between two

clusters is the maximum of all pairwise distances
between pairs of patterns drawn one from each cluster.
 In the average-link algorithm, the distance between two

clusters is the average of all pairwise distances between
Distance Between Clusters
 Single Link: smallest distance between any
pair of points from two clusters
 Complete Link: largest distance between any pair
of points from two clusters
Distance between Clusters (Cont.)
 Average Link: average distance between points
from two clusters
 Centroid: distance between centroids

of the two clusters
Single Link vs. Complete Link (Cont.)
Single link
works but not
complete link
Complete link
works but not
single link
1 1
1 1
1 1
1 2 2 1 2 2
2 2 2
2
1 1 1
1
1 1
Single link works Complete link doesn’t

2 2
1 1
2 2
1 2 1 2 2 1 2 1 2 2
1 1
2 2
2 2
1-cluster noise 2-cluster
Single link doesn’t works Complete link does

Hierarchical vs. Partitional
 Hierarchical algorithms are more versatile than

partitional algorithms.
 For example, the single-link clustering algorithm works
well on data sets containing non-isotropic (non-
roundish) clusters including well-separated, chain-like,
and concentric clusters, whereas a typical partitional
algorithm such as the k-means algorithm works well
only on data sets having isotropic clusters.
 On the other hand, the time and space
complexities of the partitional algorithms are
typically lower than those of the hierarchical
algorithms.
More on Hierarchical Clustering Methods
 Major weakness of agglomerative clustering

methods
 do not scale well: time complexity of at least O(n2),
where n is the number of total objects
 can never undo what was done previously (greedy
algorithm)
 Integration of hierarchical with distance-based
clustering
 BIRCH (1996): uses Clustering Feature tree (CF-tree)
and incrementally adjusts the quality of sub-clusters
 CURE (1998): selects well-scattered points from the
cluster and then shrinks them towards the center of
the cluster by a specified fraction
 CHAMELEON (1999): hierarchical clustering using
dynamic modeling
BIRCH (Balanced Iterative Reducing and
Clustering Using Hierarchies)
 Zhang, Ramakrishnan & Livny, SIGMOD’96
 Incrementally construct a CF (Clustering Feature) tree, a
hierarchical data structure for multiphase clustering
 Phase 1: scan DB to build an initial in-memory CF tree
(a multi-level compression of the data that tries to
preserve the inherent clustering structure of the data)
 Phase 2: use an arbitrary clustering algorithm to cluster
the leaf nodes of the CF-tree
41
The CF Tree Structure
Root
CF1 CF2 CF3 CF6
child1 child2 child3 child6
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
Leaf node Leaf node

prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next
42
Clustering Feature Vector in BIRCH
Clustering Feature (CF): CF = (N, LS, SS)

N: Number of data points
N
LS: linear sum of N points:  X
i
i 1
SS: square sum of N points CF = (5, (16,30),(54,190))
N 2 10
(3,4)
 Xi
9
(2,6)
8
i 1
7
4
(4,5)
(4,7)
3
(3,8)
1
0
0 1 2 3 4 5 6 7 8 9 10
43
CF-Tree in BIRCH
 Clustering feature:
 Summary of the statistics for a given subcluster: the
0-th, 1st, and 2nd moments of the subcluster from

the statistical point of view
 Registers crucial measurements for computing
cluster and utilizes storage efficiently
44
CF-Tree in BIRCH
A CF tree is a height-balanced tree that stores the

clustering features for a hierarchical clustering
 A nonleaf node in a tree has descendants or
“children”
 The nonleaf nodes store sums of the CFs of their
children
 A CF tree has two parameters
 Branching factor: max # of children
 Threshold: max diameter of sub-clusters stored at
the leaf nodes
45
5. BIRCH algorithm
• An example of the CF Тree

root
Initially, the data points in one A
cluster.
17 / 32
5. BIRCH algorithm

root
The data arrives, and a check is A
made whether the size of the
cluster does not exceed T.
T
A
18 / 32
5. BIRCH algorithm

root
If the cluster size grows
too big, the cluster is split
into two clusters, A B
and the points
are redistributed.
T
A
19 / 32
5. BIRCH algorithm

root
At each node of the tree,
the CF tree keeps information
about the mean of the A B
cluster, and the mean
of the sum of squares to
compute the size of the
clusters efficiently.
A B
20 / 32
5. BIRCH algorithm
• Another example of the CF Tree Insertion
LN3
LN2
Root
sc5 sc7
sc4 sc6
LN1 sc3
sc1 LN1 LN2 LN3

sc8
sc2
sc8 sc1 sc2 sc3 sc4 sc5 sc6 sc7
21 / 32
5. BIRCH algorithm

If the branching factor of a leaf node can not exceed 3, then LN1 is split.
LN3
LN2 Root
sc5 sc7
sc4 sc6
LN1’ LN1’’
LN1’ LN1’’ LN2 LN3

sc1 sc3
sc8 sc2
22 / 32
5. BIRCH algorithm

If the branching factor of a non-leaf node can not exceed 3,
then the root is split and the height of the
CF Tree increases by one. Root
LN3
NLN1 NLN2
sc5
sc4 sc7
LN2 sc6
NLN2
LN1’’ LN1’ LN1’’ LN2 LN3
NLN1
sc3
sc1 sc2
sc8
LN1’
23 / 32
5. BIRCH algorithm
• Phase 1: Scan all data and build an initial in-memory CF

tree, using the given amount of memory and recycling
space on disk.
• Phase 2: Condense into desirable length by building a

smaller CF tree.
• Phase 3: Global clustering.
• Phase 4: Cluster refining – this is optional, and requires

more passes over the data to refine the results.
24 / 32
The Birch Algorithm
 Cluster Diameter 1 2
 (x  x )
n( n  1) i j
 For each point in the input

 Find closest leaf entry
 Add point to leaf entry and update CF
 If entry diameter > max_diameter, then split leaf, and possibly
parents
 Algorithm is O(n)
 Concerns
 Sensitive to insertion order of data points
 Since we fix the size of leaf nodes, so clusters may not be so
natural
 Clusters tend to be spherical given the radius and diameter
measures
54
Property
N N
 i j
(t  t
j 1 i 1
) 2
Dm 
N ( N  1)
N N
 i i j j )
(t 2

j 1 i 1
2t t  t 2

N ( N  1)
N N N N
  i j i  j )
( t
j 1
2
 2t
i 1
t  t 2
i 1 i 1

N ( N  1)
Property
 (M
j 1
2  2t j M 1  Nt j )2

N ( N  1)
N N N
M
j 1
2  2M1  t j  N  t j
j 1 j 1
2

N ( N  1)
2
NM 2  2 M 1  NM 2

N ( N  1)
 N = na +nb
 M1 = Ma1 + Mb1
 M2 = Ma2 + Mb2
Cluster Ca Cluster Cb
• CFa = (na, Ma1, Ma2) • CFb = (nb, Nb1, Mb2)

BIRCH (Balanced Iterative Reducing and
Clustering Using Hierarchies)
 Scales linearly: finds a good clustering with a
single scan and improves the quality with a few
additional scans
 Weakness: handles only numeric data, and

sensitive to the order of the data record
58

Clustering Partition Hierachy

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering Partition Hierachy

Uploaded by

Copyright:

Available Formats

Major Clustering Approaches

criterion, e.g., minimizing the sum of square errors

using some criterion

 Typical methods: DBSACN, OPTICS, DenClue

 Typical methods: STING, WaveCluster, CLIQUE

E   ik1 pCi dist ( p, ci ) 2

Initialization: set seed points (randomly)

Suppose we have 4 types of medicines and each has two

d( D , c1 )  ( 5  1)2  ( 4  1)2  5 Euclidean distance

A B d( D , c2 )  ( 5  2)2  ( 4  1)2  4.24

Assign each object to the cluster

Assign the membership to objects

 Step 3: Repeat the first two steps until

Stop due to no new assignment

Medicine Weight pH- D

 RSS(Residual Sum of Squares) falls

 Strength: Efficient: O(tkn), where n is # objects, k is #

 Results can vary based on random Example showing

Cluster centers :{(7,4), (1,3)}

Cluster centers :{(7,4), (1,3), (5,9)}

This process has a setup cost, but convergence

 Strength: Efficient: O(tkn), where n is # objects, k is #

 The k-means algorithm is sensitive to outliers !

 Strength: Efficient: O(tkn), where n is # objects, k is #

After merging a and c

After merging ac and e

Decompose data objects into a

A clustering of the data objects is

 Introduced in Kaufmann and Rousseeuw (1990)

 Of these, single-link and complete link are most popular.

 In the single-link method, the distance between two

 In the complete-link algorithm, the distance between two

 In the average-link algorithm, the distance between two

 Centroid: distance between centroids

Single link works Complete link doesn’t

Single link doesn’t works Complete link does

 Hierarchical algorithms are more versatile than

 Major weakness of agglomerative clustering

Leaf node Leaf node

Clustering Feature (CF): CF = (N, LS, SS)

SS: square sum of N points CF = (5, (16,30),(54,190))

0-th, 1st, and 2nd moments of the subcluster from

cluster and utilizes storage efficiently

A CF tree is a height-balanced tree that stores the

 Threshold: max diameter of sub-clusters stored at

the leaf nodes

• An example of the CF Тree

• An example of the CF Тree

• An example of the CF Тree

• An example of the CF Тree

• Another example of the CF Tree Insertion

sc1 LN1 LN2 LN3

sc8 sc1 sc2 sc3 sc4 sc5 sc6 sc7

• Another example of the CF Tree Insertion

LN1’ LN1’’ LN2 LN3

sc8 sc1 sc2 sc3 sc4 sc5 sc6 sc7

• Another example of the CF Tree Insertion

• Phase 1: Scan all data and build an initial in-memory CF

• Phase 2: Condense into desirable length by building a

• Phase 3: Global clustering.

• Phase 4: Cluster refining – this is optional, and requires

 For each point in the input