Professional Documents
Culture Documents
DS4 Nonpartitional Clustering
DS4 Nonpartitional Clustering
6 5
0.2
4
3 4
0.15 2
5
2
0.1
1
0.05
3 1
0
1 3 2 5 4 6
– Divisive:
Start with one, all-inclusive cluster
At each step, split a cluster until each cluster contains a point (or there are k
clusters)
p2
p3
p4
p5
.
.
. Proximity Matrix
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2 C5
We want to merge the two closest clusters (C2 and C5) and update
the proximity matrix. C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2 C5
C1 ?
C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?
Proximity Matrix
C1
C2 U C5
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2
p3
p4
p5
MIN (nearest neighbor)
.
MAX (farthest neighbor)
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
(variances)
Department of IEM, NYCU, Hsinchu, Taiwan
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
5
1
3
0.2
5
2 1 0.15
2 3 6 0.1
0.05
4
4 0
3 6 2 5 4 1
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
4 1
2 5 0.4
0.35
5
2 0.3
0.25
3 6 0.2
3 0.15
1 0.1
4 0.05
0
3 6 4 1 2 5
p jClusterj
proximity(Clusteri , Clusterj )
|Clusteri ||Clusterj |
5 4 1
2 0.25
5 0.2
2
0.15
3 6 0.1
1 0.05
4 0
3 3 6 4 1 2 5
Strengths
– Less susceptible to noise and outliers
Limitations
– Biased towards globular clusters
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4
5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3
4 1 1
4 4
3
• Resistant to Noise
• Can handle clusters of different shapes and sizes
(MinPts=4, Eps=9.75).
Original Points
• Varying densities
• High-dimensional data
(MinPts=4, Eps=9.92)
Department of IEM, NYCU, Hsinchu, Taiwan
DBSCAN: Determining EPS and MinPts
BSS i
Ci (m mi )2
– Where |Ci| is the size of cluster i
Department of IEM, NYCU, Hsinchu, Taiwan
Internal Measures: Cohesion and Separation
Example: SSE
– BSS + WSS = constant
m
1 m1 2 3 4 m2 5
A proximity graph based approach can also be used for cohesion and
separation.
– Cluster cohesion is the sum of the weight of all links within a cluster.
– Cluster separation is the sum of the weights between nodes in the cluster and
nodes outside the cluster.
cohesion separation
Silhouette Coefficient combine ideas of both cohesion and separation, but for
individual points, as well as clusters and clusterings
For an individual point, i
– Calculate a = average distance of i to the points in its cluster
– Calculate b = min (average distance of i to points in another cluster)
– The silhouette coefficient for a point is then given by
b
– Typically between 0 and 1. a
– The closer to 1 the better.
6 9
8
4
7
2 6
SSE
5
0
4
-2 3
2
-4
1
-6 0
2 5 10 15 20 25 30
5 10 15
K
Department of IEM, NYCU, Hsinchu, Taiwan
Internal Measures: SSE (failure)
1
2 6
3
4
0.9 0.9
0.8 0.8
0.7 0.7
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1 1
0.9 0.9
0.5 0.5
y
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
Department of IEM, NYCU, Hsinchu, Taiwan
External Measures of Cluster Validity: Entropy and Purity
Two matrices
– Proximity Matrix
– “Incidence” Matrix
One row and one column for each data point
An entry is 1 if the associated pair of points belong to the same cluster
An entry is 0 if the associated pair of points belongs to different clusters
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
y
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x