Professional Documents
Culture Documents
PMBD 04 Clustering
PMBD 04 Clustering
Big Data
Clustering
• Clustering
• Distance measures
• Hierarchical clustering
• k-means
• CURE
2
Clustering
Clustering
Motivation
Source: J.Leskovec, A.Rajaraman and J.D.Ullman, Mining of massive datasets. Cambridge University Press, 2014
4
Clustering
Clustering problem
5
Clustering
Example of clustering
6
Clustering
What is similarity?
7
Clustering
Clustering is a hard problem
d(A, B) ≥ 0
d(A, B) = d(B, A)
• L2—norm
n
d(x, y) = ∥x − y∥2 = ∑i=1 (xi − yi)2
• Lp—norm
( ∑i=1 | xi − yi |p )
n p
d(x, y) = ∥x − y∥p =
| C1 ∩ C2 |
d(C1, C2) = 1 − SIM(C1, C2) = 1−
| C1 ∪ C2 |
11
Distance Measures
Examples
• Cosine distance
• makes sense in Euclidean spaces or discrete versions
of Euclidean spaces, such as spaces where points are
vectors with integer components or boolean (0 or 1)
components
n
x⋅y ∑i=1 xi yi
d(x, y) = =
∥x∥2 ∥y∥2 n
∑i=1 xi2
n
∑i=1 yi2
12
Distance Measures
Examples
• Edit distance
• this distance is used when points are strings
1. delete b
2. insert f after c
3. insert g after e
13
Distance Measures
Examples
• Hamming distance
• the Hamming distance between two vectors is the
number of components in which they differ
14
Distance Measures
Cluster strategies
• Hierarchical or agglomerative
• start each point in his own cluster
• Point assignment
• points are considered in some order
• The algorithm:
• Questions:
• continue until there is only one cluster and then return the
tree representing the association of clusters
18
Hierarchical Clustering
Example
19
Hierarchical Clustering
Example
20
Hierarchical Clustering
Example
21
Hierarchical Clustering
Example
22
Hierarchical Clustering
Example
23
Hierarchical Clustering
Example
24
Hierarchical Clustering
Example
25
Hierarchical Clustering
Example
26
Hierarchical Clustering
Example
27
Hierarchical Clustering
Non Euclidean case
• pick one of the points to represent the cluster, usually close to all
the points in the cluster; we call this point the clustroid
• the point selected for clustroid can also be the point that minimizes:
2
• reduce cost to O(n log n)
29
k-means
k-means
Algorithm
• Algorithm:
1. Initialise all clusters by choosing one point for each cluster at random
(or at least as far away from each other)
31
k-means
Example
32
k-means
Example
33
k-means
Example
34
k-means
Example
35
k-means
Example
36
k-means
Example
37
k-means
Example
38
k-means
Example
39
k-means
Example
40
k-means
Example
41
k-means
Example
42
k-means
Example
43
k-means
Example
44
k-means
Example
45
k-means
Example
46
k-means
How to select k
• increasing k - in the limit, we will have one cluster for each point
• choose for k the value where the graph changes abruptly, that is, at an
inflection point
47
Clustering with k-means
Another practical example
CURE
CURE
Motivation
52
CURE
Example
53
CURE
Example
54
CURE
Example
55
CURE
Example
56
CURE
Example
57
CURE
Example
58
References
59