Professional Documents
Culture Documents
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
Data Utilization
Summarization Compression
Six Clusters
Two Clusters
Four Clusters
Types of Clusterings
A clustering is a set of clusters Important distinction between hierarchical and partitional sets of clusters Partitional Clustering
A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset
Hierarchical clustering
A set of nested clusters organized as a hierarchical tree
Partitional Clustering
Original Points
A Partitional Clustering
Hierarchical Clustering
p1 p3 p2 p4
p1 p2
Hierarchical Clustering
p3 p4
Dendrogram
4 center-based clusters
6 density-based clusters
Clustering Algorithms
K-center (the previous lecture).
Think: how?
K-means Clustering
Partitional clustering approach Each cluster is associated with a centroid.
The centroid of a point set S is the point p whose x- (y-) coordinate is the mean of the x- (y-) coordinates of the points in S.
Example with k = 3
Iteration 1
3 3 2.5 2.5
Iteration 2
3 2.5
Iteration 3
1.5
1.5
1.5
y
1 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2
0.5 0
0.5
-2
-1.5
-1
-0.5
0.5
1.5
-1.5
-1
-0.5
0.5
1.5
Iteration 4
3 3 2.5 2.5
Iteration 5
3 2.5
Iteration 6
1.5
1.5
1.5
y
1 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2
0.5 0
0.5
-2
-1.5
-1
-0.5
0.5
1.5
-1.5
-1
-0.5
0.5
1.5
Iteration 2
1.5
1.5
y
1 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2
0.5
-1.5
-1
-0.5
0.5
1.5
Iteration 3
3 3 2.5 2.5
Iteration 4
3 2.5
Iteration 5
1.5
1.5
1.5
y
1 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2
0.5 0
0.5
-2
-1.5
-1
-0.5
0.5
1.5
-1.5
-1
-0.5
0.5
1.5
The above algorithm allows k-means to achieve an approximation ratio of O(lg k). Namely, if the optimal k clusters has SSE s, then k-means guarantees returning clusters with SSE at most O(s lg k).
Original Points
K-means (3 Clusters)
Original Points
K-means (3 Clusters)
Original Points
K-means (2 Clusters)
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram
A tree like diagram that records the sequences of merges or splits
6
0.2
4 3 2 5 2 4
0.15
0.1
1
0.05
3
0 1 3 2 5 4 6
Starting Situation
Start with clusters of individual points and a proximity matrix p1 p2 p3 p4 p5
p1 p2 p3 p4 p5
. . .
...
Proximity Matrix
Intermediate Situation
After some merging steps, we have some clusters
C1 C1 C2 C3 C3 C4 C4 C5 C2 C3 C4 C5
Proximity Matrix
C1
C2
C5
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C3 C4 C5 C1 C2
C1 C2 C3 C3 C4 C4 C5
Proximity Matrix
C1
C2
C5
After Merging
The question is How do we update the proximity matrix?
C1 C1 C3 C4 C2 U C5 C3 C4 C1 ? C2 U C5 ? ? ? ? ? ? C3 C4
Proximity Matrix
C2 U C5
...
Similarity?
p1 p2 p3 p4
p5
. . .
Proximity Matrix
...
p5
. . .
Proximity Matrix
...
p5
. . .
Proximity Matrix
...
p5
. . .
Proximity Matrix
1 3 5 2 2 4 4 3 1
0.2
0.15
0.1
0.05
Nested Clusters
Dendrogram
Strength of MIN
Original Points
Two Clusters
Limitations of MIN
Original Points
Two Clusters
4 2 5 2 3 3 4
1 5
0.4 0.35 0.3 0.25
6 1
Nested Clusters
Dendrogram
Strength of MAX
Original Points
Two Clusters
Limitations of MAX
Original Points Tends to break large clusters Biased towards globular clusters
Two Clusters
5 2 5 2
1
0.25 0.2 0.15
3 1 4 3
0.1 0.05 0
Nested Clusters
Dendrogram
Limitations
Biased towards ball-like clusters
DBSCAN Algorithm
Eliminate noise points Put an edge between each pair of core points within distance Eps of each other Make each group of connected core points into a separate cluster Assign each border point arbitrarily to one of the clusters containing its associated core points
Original Points
Original Points
Clusters
Drawback of DBSCAN