Professional Documents
Culture Documents
3 Cluster Analysis
3 Cluster Analysis
Concepts and
Techniques
Chapter 7
Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
2006 Jiawei Han and Micheline Kamber, All rights reserved
October 4, 20
October 4, 20
October 4, 20
Cluster analysis
Typical applications
October 4, 20
Pattern Recognition
Image Processing
WWW
Document classification
October 4, 20
Examples of Clustering
Applications
October 4, 20
October 4, 20
October 4, 20
Scalability
High dimensionality
October 4, 20
October 4, 20
10
Data Structures
Data matrix
(two modes)
x11
...
x
i1
...
x
n1
Dissimilarity matrix
(one mode)
October 4, 20
...
x1f
...
x1p
...
...
...
...
xif
...
...
xip
...
...
... xnf
...
...
...
xnp
d(2,1)
0
d(3,1) d ( 3,2) 0
:
:
:
... 0
11
Interval-scaled variables
Binary variables
October 4, 20
12
Interval-valued variables
Standardize data
where
...
xnf )
m f 1n (x1 f x2 f
xif m f
zif
sf
October 4, 20
13
q
q
Some popular
d (i, j) q (| ones
x x |include:
| x x Minkowski
| q ... | x x |distance:
)
i1
j1
i2
j2
ip
jp
If q = 1, d isd (iManhattan
, j) | x x | |distance
x x | ... | x x |
October 4, 20
i1
j1
i2
j2
ip
jp
14
If q = 2, d is Euclidean distance:
d (i, j) (| x x | 2 | x x | 2 ... | x x |2 )
i1
j1
i2
j2
ip
jp
Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
October 4, 20
15
Binary Variables
Object i
1
a
0
b
0
c
d
sum a c b d
d (i, j)
Object j
d (i, j)
October 4, 20
bc
a bc
simJaccard (i, j)
Data Mining: Concept
cd
p
bc
a bc d
sum
a b
a
a bc
16
Example
Name
Jack
Mary
Jim
Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
October 4, 20
17
Nominal Variables
October 4, 20
18
Ordinal Variables
rif 1
M f 1
October 4, 20
19
Ratio-Scaled Variables
Methods:
October 4, 20
20
d (i, j )
f 1 ij
p
f 1
ij
(f)
ij
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is interval-based: use the normalized distance
f is ordinal or ratio-scaled
compute ranks r and
if
and treat z as interval-scaled
if
z r 1
if
if
October 4, 20
21
Vector Objects
Cosine measure
October 4, 20
22
October 4, 20
23
Partitioning approach:
Hierarchical approach:
Density-based approach:
October 4, 20
24
Grid-based approach:
Model-based:
A model is hypothesized for each of the clusters and tries to find the
best fit of that model to each other
Frequent pattern-based:
User-guided or constraint-based:
October 4, 20
25
October 4, 20
26
Cm
iN 1(t
ip
N (t cm ) 2
Rm i 1 ip
N
N N (t t ) 2
Dm i 1 i 1 ip iq
N ( N 1)
October 4, 20
27
October 4, 20
28
October 4, 20
29
October 4, 20
30
Example
10
9
8
7
6
5
10
10
4
3
2
1
0
0
K=2
Arbitrarily choose
K object as initial
cluster center
10
Assign
each
objects
to
most
similar
center
3
2
1
0
0
10
4
3
2
1
0
0
reassign
10
10
2
1
0
0
10
reassign
October 4, 20
Update
the
cluster
means
10
Update
the
cluster
means
4
3
2
1
0
0
10
31
Weakness
October 4, 20
32
Dissimilarity calculations
October 4, 20
33
10
0
0
October 4, 20
10
10
34
PAM works effectively for small data sets, but does not
scale well for large data sets
October 4, 20
35
Total Cost = 20
10
10
10
6
5
4
3
2
1
0
0
K=2
10
Arbitrar
y
choose
k object
as
initial
medoid
s
6
5
4
3
2
1
0
0
10
Total Cost = 26
10
Do loop
Until no
change
Assign
each
remaini
ng
object
to
nearest
medoid
s
6
5
4
3
2
1
0
0
10
Compute
total cost
of
swapping
Swapping
O and
Oramdom
If quality is
improved.
7
6
5
4
10
Randomly select a
nonmedoid
object,Oramdom
8
7
6
5
4
0
0
October 4, 20
10
10
36
October 4, 20
37
10
10
8
7
8
7
6
5
4
3
6
5
h
i
4
3
0
0
10
Cjih = 0
0
10
10
10
7
6
5
4
3
10
10
38
October 4, 20
39
Weakness:
October 4, 20
40
October 4, 20
41
October 4, 20
42
Hierarchical Clustering
a
b
Step 1
ab
abcde
cde
de
e
Step 4
October 4, 20
agglomerative
(AGNES)
Step 3
divisive
Step 2 Step 1 Step 0
(DIANA)
Data Mining: Concept
43
Go on in a non-descending fashion
10
10
10
0
0
October 4, 20
10
0
0
10
10
44
October 4, 20
45
10
10
0
0
October 4, 20
10
0
0
10
10
46
October 4, 20
47
BIRCH (1996)
October 4, 20
48
CF = (5, (16,30),(54,190))
10
9
8
7
6
5
4
3
2
1
0
0
October 4, 20
10
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
49
CF-Tree in BIRCH
Clustering feature:
summary of the statistics for a given subcluster: the 0-th, 1st and
2nd moments of the subcluster from the statistical point of view.
October 4, 20
50
CF1
CF2 CF3
CF6
L=6
child1
child2 child3
child6
CF1
Non-leaf node
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev
CF1 CF2
October 4, 20
CF6 next
Leaf node
prev
CF1 CF2
CF4 next
51
October 4, 20
52
C1. <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c,
d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c,
d, e}
C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
Jaccard co-efficient may lead to wrong clustering result
C1: 0.2 ({a, b, c}, {b, d, e}} to 0.5 ({a, b, c}, {a, b, d})
C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, T
f})
1 T2
Sim( T , T )
Jaccard co-efficient-based similarity function: 1 2
T1 T2
{c{c,
} d, e} 1
Ex. LetSim
T1 (=T {a,
b, c}, T2 =
1, T 2 )
0.2
{a, b, c, d , e}
October 4, 20
53
C1 <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a,
c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
October 4, 20
54
A two-phase algorithm
1.
2.
October 4, 20
55
Overall Framework of
CHAMELEON
Construct
Partition the Graph
Sparse Graph
Data Set
Merge Partition
Final Clusters
October 4, 20
56
October 4, 20
57
October 4, 20
58
59
Two parameters:
NEps(p):
p belongs to NEps(q)
core point condition:
p
q
MinPts = 5
Eps = 1 cm
60
Density-reachable:
A point p is density-reachable
from a point q w.r.t. Eps, MinPts if
there is a chain of points p1, ,
pn, p1 = q, pn = p such that pi+1 is
directly density-reachable from pi
Density-connected
A point p is density-connected to
a point q w.r.t. Eps, MinPts if there
is a point o such that both, p and
q are density-reachable from o
w.r.t. Eps and MinPts
October 4, 20
p
p1
q
o
61
Eps = 1cm
Core
October 4, 20
MinPts = 5
62
If p is a border point, no points are densityreachable from p and DBSCAN visits the next point
of the database.
October 4, 20
63
DBSCAN: Sensitive to
Parameters
October 4, 20
64
October 4, 20
65
October 4, 20
66
Index-based:
k = number of dimensions
N = 20
p = 75%
M = N(1-p) = 5
Complexity: O(kN2)
Core Distance
p1
Reachability Distance
o
p2
o
MinPts = 5
67
Reachability
-distance
undefined
October 4, 20
Cluster-order
of the objects 68
October 4, 20
69
f Gaussian ( x , y ) e
Major features
D
Gaussian
D
Gaussian
d ( x , y )2
2 2
( x ) i 1 e
N
d ( x , xi ) 2
2 2
( x, xi ) i 1 ( xi x) e
N
October 4, 20
d ( x , xi ) 2
2 2
70
October 4, 20
71
Density Attractor
October 4, 20
72
October 4, 20
73
October 4, 20
74
October 4, 20
75
October 4, 20
76
October 4, 20
77
Comments on STING
Advantages:
Disadvantages:
October 4, 20
78
October 4, 20
79
Wavelet Transform
October 4, 20
80
Input parameters
# of grid cells for each dimension
the wavelet, and the # of applications of wavelet transform
Why is wavelet transformation useful for clustering?
Use hat-shape filters to emphasize region where points
cluster, but simultaneously suppress weaker information in
their boundary
Effective removal of outliers, multi-resolution, cost effective
Major features:
Complexity O(N)
Detect arbitrary shaped clusters at different scales
Not sensitive to noise, not sensitive to input order
Only applicable to low dimensional data
Both grid-based and density-based
October 4, 20
81
Quantization
& Transformation
October 4, 20
82
October 4, 20
83
Model-Based Clustering
October 4, 20
84
EM Expectation Maximization
An extension to k-means
General idea
October 4, 20
85
Maximization step:
Estimation of model parameters
October 4, 20
86
Conceptual Clustering
Conceptual clustering
A form of clustering in machine learning
Produces a classification scheme for a set of
unlabeled objects
Finds characteristic description for each concept
(class)
COBWEB (Fisher87)
A popular a simple method of incremental
conceptual learning
Creates a hierarchical clustering in the form of a
classification tree
Each node refers to a concept and contains a
probabilistic description of that concept
October 4, 20
87
COBWEB Clustering
Method
A classification tree
October 4, 20
88
Limitations of COBWEB
CLASSIT
Popular in industry
October 4, 20
89
October 4, 20
90
SOMs, also called topological ordered maps, or Kohonen SelfOrganizing Feature Map (KSOMs)
The unit whose weight vector is closest to the current object wins
SOMs are believed to resemble processing that can occur in the brain
October 4, 20
91
The result of
SOM clustering
of 12088 Web
articles
The picture on
the right:
drilling down
on the keyword
mining
Based on
websom.hut.fi
Web page
October 4, 20
92
October 4, 20
93
Major challenges:
Methods
October 4, 20
94
October 4, 20
95
October 4, 20
96
October 4, 20
97
Identify clusters
October 4, 20
98
=3
40
50
la
a
S
October 4, 20
20
30
40
50
age
60
Vacation
30
Vacation(
week)
0 1 2 3 4 5 6 7
Salary
(10,000)
0 1 2 3 4 5 6 7
20
age
60
ry
30
50
age
99
Strength
automatically finds subspaces of the highest
dimensionality such that high density clusters exist
in those subspaces
insensitive to the order of records in input and does
not presume some canonical data distribution
scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
Weakness
The accuracy of the clustering result may be
degraded at the expense of simplicity of the method
October 4, 20
100
CLIQUE, ProClus
Typical methods
October 4, 20
101
October 4, 20
102
Why p-Clustering?
Where
1
1
1
d
d
d
d
d
IJ | I || J |
ij
Ij | I |
ij
ij | J |
ij
i I, j J
I some > 0
A submatrix is a -cluster
if H(I, J) i
for
jJ
October 4, 20
103
p-Clustering:
Clustering by
Pattern Similarity
d ya d yb
pScore(
Properties of -pCluster
Downward closure
d xbon
/ d yb
For scaling patterns, one can observe, taking logarithmic
October 4, 20
104
October 4, 20
105
October 4, 20
106
User-specified constraints
October 4, 20
107
October 4, 20
108
109
October 4, 20
110
October 4, 20
111
October 4, 20
112
Outlier Discovery:
Statistical
Approaches
October 4, 20
113
October 4, 20
114
Density-Based Local
Outlier Detection
Distance-based outlier
detection is based on global
distance distribution
It encounters difficulties to
identify outliers if data is not
uniformly distributed
Ex. C1 contains 400 loosely
distributed points, C2 has 100
tightly condensed points, 2
outlier points o1, o2
Distance-based method
cannot identify o2 as an outlier
Need the concept of local
outlier
October 4, 20
115
October 4, 20
116
October 4, 20
117
Summary
October 4, 20
118
October 4, 20
119
References (1)
M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the
clustering structure, SIGMOD99.
P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scientific, 1996
Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering ", KDD'02
M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based Local Outliers.
SIGMOD 2000.
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering
clusters in large spatial databases. KDD'96.
M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing
techniques for efficient class identification. SSD'95.
October 4, 20
120
References (2)
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases.
SIGMOD'98.
S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes
. In ICDE'99, pp. 512-521, Sydney, Australia, March 1999.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
G. Karypis, E.-H. Han, and V. Kumar.
CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling. COMPUTER, 32(8):
68-75, 1999.
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB98.
G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering.
John Wiley and Sons, 1988.
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94.
October 4, 20
121
References (3)
E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition,.
W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data
Mining, VLDB97.
October 4, 20
122
www.cs.uiuc.edu/~hanj
October 4, 20
123