Professional Documents
Culture Documents
Clusters
Clusters
Everything Data
CompSci 216 Spring 2018
2
Geo-tags of tweets
80
60
40
20
-20
-40
-60
-200 -150 -100 -50 0 50 100 150 200
5
Trending topics
• How would you compute trending topics?
– Most frequent hashtags
– Frequent keywords or phrases (which are not
stopwords)
– …
60
10
40
8
20
6
0
4
-20
2
-40
-60 0
-200 -150 -100 -50 0 50 100 150 200
7
http://www.esriro.ro/library/fliers/
pdfs/tapestry_segmentation.pdf#page=2
9
Other Examples
• Image segmentation
• Document clustering
• De-duplication …
11
Outline
• K-means Clustering
• Distance Metrics
60
10
40
8
20
6
0
4
-20
2
-40
-60 0
-200 -150 -100 -50 0 50 100 150 200
13
K-means
• Partition a set of points X = {x1, x2, …, xn}
into k partitions C = {C1, C2, …, Ck} that
minimizes
! !
!!
!""(!) = !!" !! − !! !
!!
!!! !!!
aij is 1 if
xj is assigned to cluster Ci
Assignment Function
15
K-means
• Partition a set of points X = {x1, x2, …, xn}
into k partitions C = {C1, C2, …, Ck} that
minimizes
! !
!!
!""(!) = !!" !! − !! !
!!
!!! !!!
K-means
• Partition a set of points X = {x1, x2, …, xn}
into k partitions C = {C1, C2, …, Ck} that
minimizes
! !
!!
!""(!) = !!" !! − !! !
!!
!!! !!!
Chicken-and-Egg problem
• How do we minimize RSS(C) ?
K-means Algorithm
• Idea: Alternate these two steps.
– Pick some initialization for cluster
representatives µ0.
– E-step:
Assign points to the closest representative in µi.
– M-step:
Recompute the representatives µi+1 as means of
the current clusters.
19
K-means: Initialization
80 12
"geo_data_head10000_run_0" using 1:2:3
60
+ +
10
40 + + + +
+ 8
20 + + 6
0
+ 4
-20
2
-40
-60 0
-200 -150 -100 -50 0 50 100 150 200
20
K-means: Iteration 1
80 12
"geo_data_head10000_run_1" using 1:2:3
60
10
+ +
40
+ + +
+ 8
20 + +
+
6
0
-20 + 4
2
-40
-60 0
-200 -150 -100 -50 0 50 100 150 200
21
K-means: Iteration 2
80 12
"geo_data_head10000_run_2" using 1:2:3
60
10
+ +
40 + + +
+ 8
20 +
6
0 + +
4
-20 +
2
-40
-60 0
-200 -150 -100 -50 0 50 100 150 200
22
K-means: Iteration 10
80 12
"geo_data_head10000_kmeans_10" using 1:2:3
60
+ +
10
40
+ + +
+ 8
20
+ +
6
0
+
4
-20
+
2
-40
-60 0
-200 -150 -100 -50 0 50 100 150 200
23
Initialization
• Many heuristics
– Random: K random points in the dataset
– Farthest First:
• Pick the first center at random
• Pick the ith center as the point “farthest away”
from the last (i-1) centers
Stopping
• Alternate E and M steps till the cluster
representatives do not change.
• Guaranteed to converge
– To a local optima …
– … but not necessarily to a global optima
y y
x
26
y R
0.5 * x
27
Limitations of k-means
• Scaling/changing the feature space can
change the solution.
• Cluster points into spherical regions.
Outline
• K-means Clustering
• Distance Metrics
Distance Metrics
• Function d that maps pairs of points x, y to
real numbers (usually between 0, 1)
Euclidean Distance
!−! ! = !! − !! ! !
!
!
• Straight line distance between two points
x = [x1, x2, …, xd] and y = [y1, y2, …, yd]
!
• L2 = ?
32
!
!
!
!! = ! !! − !! ! !
!
• L2 = Euclidean
• L1 = ?
33
!
!
!
!! = ! !! − !! ! !
!
• L1 = city block / Manhalan
• L∞ = ?
34
Vector-based Similarities
• Cosine Similarity (inverse of a distance)
Dot Product
! !! ∙ !!
!
! !
! !! ! !!
L2 Norm
Vector-based Similarities
• Pearson’s Correlation Coefficient
– cosine similarity on mean normalized vectors
! !! − ! ∙ (!! − !)
!
! !
! !! − ! ! !! − !
L2 Norm
36
Set-based Distances
• Let A and B be two sets.
|! ∩ !|
Jaccard !, ! = ! !
|! ∪ !|
Outline
• K-means Clustering
• Distance Metrics
K-medoids
• Allows a general distance metric d(x,y).
K-medoids
– Pick some initialization for cluster
representatives µ0.
– E-step:
Assign points to the closest representative in µi.
– M-step:
Recompute the representatives µi+1 as the medoid,
or one of the points in the cluster with the
minimum distance from all the other points.
42
Medoid
• m is the medoid of a set of points P if
! = ! argmin !(!, !) !
!!∈!! !∈!
Point that minimizes the sum of distances to
all other points in the set.
43
K-medoids summary
• Same algorithm as K-means, but uses medoids
instead of means
Outline
• K-means Clustering
• Distance Metrics
Hierarchical Clustering
• Rather than compute a single clustering,
compute a family of clusterings.
Agglomerative Clustering
• Initialize each point to its own cluster
• Repeat:
– Pick the two clusters that are closest
– Merge them into one cluster
– Stop when there is only one cluster left
48
Example
Step 1: {1} {2} {3} {4} {5} {6} {7}
Step 2: {1} {2, 3} {4} {5} {6} {7}
Step 3: {1, 7} {2, 3} {4} {5} {6}
Step 4: {1, 7} {2, 3} {4, 5} {6}
Step 5: {1, 7} {2, 3, 6} {4, 5}
Step 6: {1, 7} {2, 3, 4, 5, 6}
Step 7: {1, 2, 3, 4, 5, 6, 7}
Example based on Ryan Tibshirani’s slides
49
Height of a node is
proportional to distance
between children clusters Each node
is a
cluster
Individual points in
the dataset
50
Dendrogram
Single Linkage
!!"#$%& (!! , !! ) = ! min !(!, !)!
!∈!! ,!∈!!
Complete Linkage
!!"#$%&'& (!! , !! ) = ! max !(!, !)!
!∈!! ,!∈!!
3 1
2
> 1
1
0
0 1 2 3 4 5 6 7
55
Single Linkage
4
3 1
2
> 1
1
0
0 1 2 3 4 5 6 7
56
Single Linkage
4
0
0 1 2 3 4 5 6 7
Complete Linkage
4
3 3
2
< 2
1
0
0 1 2 3 4 5 6 7
58
Complete Linkage
4
0
0 1 2 3 4 5 6 7
0
0 1 2 3 4 5 6 7 8
60
In both cases …
3
0
0 1 2 3 4 5 6 7 8
61
Single Linkage
3
0
0 1 2 3 4 5 6 7 8
62
Complete Linkage
3
0
0 1 2 3 4 5 6 7 8
Average Linkage
!!!! ,!!!! !(!, !)
!!"# (!! , !! ) = ! ! !!
|!! | ∙ |!! |
Distance between two clusters is
the average distance between
every pair of points in the
clusters.
64