Professional Documents
Culture Documents
Concepts and Techniques: - Chapter 7
Concepts and Techniques: - Chapter 7
— Chapter 7 —
Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
©2006 Jiawei Han and Micheline Kamber, All rights reserved
Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types
If q = 2, d is Euclidean distance:
d (i, j) (| x x | 2 | x x |2 ... | x x |2 )
i1 j1 i2 j2 ip jp
Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures
f is ordinal or ratio-scaled
compute ranks rif and
and treat zif as interval-scaled zif r 1
if
f M 1
February 16, 2022 16
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
11. Outlier Analysis
12. Summary
February 16, 2022 17
Major Clustering Approaches (I)
Partitioning approach:
Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using
some criterion
Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN, OPTICS, DenClue
Centroid: distance between the centroids of two clusters, i.e., dis(Ki, Kj)
= dis(Ci, Cj)
Medoid: distance between the medoids of two clusters, i.e., dis(Ki, Kj) =
dis(Mi, Mj)
Medoid: one chosen, centrally located object in the cluster
February 16, 2022 20
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
Centroid: the “middle” of a cluster iN 1(t )
Cm N
ip
N N (t t ) 2
Dm i 1 i 1 ip iq
N ( N 1)
Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3
2 each
2 the 2
1
objects
1
0
cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
similar
center reassign reassign
10 10
K=2 9 9
8 8
Arbitrarily choose K 7 7
6 6
object as initial 5 5
2
the 3
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Outlier
Border
Eps = 1cm
Core MinPts = 5
Customer segmentation
Medical analysis
Drawbacks
most tests are for single attribute
data distribution
Distance-based outlier: A DB(p, D)-outlier is an object O
in a dataset T such that at least a fraction p of the
objects in T lies at a distance greater than D from O
Algorithms for mining distance-based outliers
Index-based algorithm
Nested-loop algorithm
Cell-based algorithm