Download as pdf or txt
Download as pdf or txt
You are on page 1of 59

Processamento e Modelação de

Big Data
Clustering

João Oliveira & Adriano Lopes - 2020/2021


Outline

• Clustering

• Distance measures

• Hierarchical clustering

• k-means

• CURE

2
Clustering
Clustering
Motivation

• Sometimes data exhibit structure according to some sort


of distance measure

• It is possible to group collection of points into “clusters”

Source: J.Leskovec, A.Rajaraman and J.D.Ullman, Mining of massive datasets. Cambridge University Press, 2014
4
Clustering
Clustering problem

• Given a collection of “points”, group them into “clusters”


according to some distance measure, such that

• points in the same cluster are “similar”

• points in different clusters are “dissimilar”

• We are interested in situations where

• data is very large

• data is in high-dimensional space

• the space may not be Euclidean

• data may not fit in main memory

5
Clustering
Example of clustering

How to cluster them?

6
Clustering
What is similarity?

• “The quality or state of being similar; likeness;


resemblance; as, a similarity of features.” — Webster’s
Dictionary

• Similarity may be hard to define, but



“we know it when we see it”…

7
Clustering
Clustering is a hard problem

• Clustering in 2 dimensions can be more difficult than


expected

• Clustering is usually done in high dimensional spaces



(can be 10, 100, or even 1 000 dimensions)

• The curse of dimensionality: most pairs of points are equally


far away from each other
8
Distance Measures
Distance Measures
Definition

• A function d(A, B) is a distance measure between two


points A and B if satisfies the following:

• the distance is always nonnegative, and only the


distance between a point and itself is 0

d(A, B) ≥ 0

• the distance is symmetric

d(A, B) = d(B, A)

• the distance measure must obey the triangle inequality

d(A, B) + d(B, C) ≥ d(A, C)


10
Distance Measures
Examples

• Euclidean: let x = (x1, x2, …, xn) and y = (x1, x2, …, xn)

• L2—norm

n
d(x, y) = ∥x − y∥2 = ∑i=1 (xi − yi)2

• Lp—norm

( ∑i=1 | xi − yi |p )

n p
d(x, y) = ∥x − y∥p =

• Jaccard distance: for dissimilarity between sets

| C1 ∩ C2 |
d(C1, C2) = 1 − SIM(C1, C2) = 1−
| C1 ∪ C2 |
11
Distance Measures
Examples

• Cosine distance
• makes sense in Euclidean spaces or discrete versions
of Euclidean spaces, such as spaces where points are
vectors with integer components or boolean (0 or 1)
components

• points are thought as directions; we do not distinguish


between a vector or a multiple of that vector

n
x⋅y ∑i=1 xi yi
d(x, y) = =
∥x∥2 ∥y∥2 n
∑i=1 xi2
n
∑i=1 yi2

12
Distance Measures
Examples

• Edit distance
• this distance is used when points are strings

• the distance between two strings x = x1x2…xn and


y = y1y2…yn is the smallest number of insertions and
deletions of a single character that will convert x to y

• Example: the edit distance between the string x = abcde


and y = acfdeg is 3

1. delete b

2. insert f after c

3. insert g after e
13
Distance Measures
Examples

• Hamming distance
• the Hamming distance between two vectors is the
number of components in which they differ

• for example, the Hamming distance between the


vectors 10011 and 11101 is 3

14
Distance Measures
Cluster strategies

• Hierarchical or agglomerative
• start each point in his own cluster

• clusters are combined based on their “closeness”

• combination stops when further combination leads to


undesirable clusters

• Point assignment
• points are considered in some order

• usually there is a short phase where initial clusters are


estimated

• each point is assigned to the cluster into which it best


fits, typically the “nearest” cluster
15
Hierarchical Clustering
Hierarchical Clustering
Building a dendrogram

• The algorithm:

• At start, every point is a cluster

• To repeat: combine two “nearest”


clusters into one

• Questions:

• How to represent clusters?

• How to choose which two clusters


to merge?

• When to stop combining clusters?

• We will consider two cases

• distance measure is Euclidean

• distance measure is non Euclidean


17
Hierarchical Clustering
Euclidean case

• How to represent clusters?

• A cluster can be represented by its centroid or average of


the points in the cluster

• How to choose which two clusters to merge?

• Use the Euclidean distance between centroids to


determine the closest ones

• When to stop combining clusters?

• we have a belief about how many clusters are there in data

• stop at a point where the best combination of clusters


produces an inadequate cluster

• continue until there is only one cluster and then return the
tree representing the association of clusters
18
Hierarchical Clustering
Example

19
Hierarchical Clustering
Example

20
Hierarchical Clustering
Example

21
Hierarchical Clustering
Example

22
Hierarchical Clustering
Example

23
Hierarchical Clustering
Example

24
Hierarchical Clustering
Example

25
Hierarchical Clustering
Example

26
Hierarchical Clustering
Example

27
Hierarchical Clustering
Non Euclidean case

• How to represent clusters?

• pick one of the points to represent the cluster, usually close to all
the points in the cluster; we call this point the clustroid

• the point selected for clustroid can also be the point that minimizes:

• the sum of the distances to other points

• the maximum distance to other points

• the sum of the squares of the distances to other points

• How to choose which two clusters to merge?

• use the distance between clustroids

• other criteria measuring the density of a cluster, based on the radius


or diameter

• When stop combining clusters?

• same as in the Euclidean case


28
Hierarchical Clustering
Efficiency

• Hierarchical clustering is not very efficient

• at each step we must compute all the distances


between each pair of clusters, and then merge

• cost O(n 3) (n = number of points)

• More efficient implementation

• based on priority queues

2
• reduce cost to O(n log n)

• still infeasible to use for large n

29
k-means
k-means
Algorithm

• One of the most popular algorithms following a point assignment strategy

• All points are of the quantitative type, thus it assumes an



Euclidean space/distance

• The number of clusters k is set in advance

• Algorithm:

1. Initialise all clusters by choosing one point for each cluster at random
(or at least as far away from each other)

2. Assign each point to the closest centroid

3. Compute the new centroid for each cluster

4. Reassign all points to their closest centroid



(points can move between clusters)

5. Repeat steps 2 - 4 until no points are reassigned

31
k-means
Example

1st step: choose k =4

32
k-means
Example

2nd step: initialise clusters (choose 4 random points)

33
k-means
Example

3rd step: assign points to clusters

34
k-means
Example

3rd step: assign points to clusters

35
k-means
Example

3rd step: assign points to clusters

36
k-means
Example

4th step: compute centroid of each cluster

37
k-means
Example

4th step: compute centroid of each cluster

38
k-means
Example

5th step: reassign points to clusters

39
k-means
Example

5th step: reassign points to clusters

40
k-means
Example

5th step: reassign points to clusters

41
k-means
Example

6th step: recompute centroids

42
k-means
Example

6th step: recompute centroids

43
k-means
Example

7th step: reassign points to clusters

44
k-means
Example

7th step: reassign points to clusters

45
k-means
Example

… recompute centroids; reassign points to clusters; and …


stop when no point is changing of cluster

46
k-means
How to select k

• What is the impact of k?

• increasing k - in the limit, we will have one cluster for each point

• decreasing k - the average diameter of the clusters will increase

• Using a measure such as average radius, diameter or average global error


(measured to the centroid), then its plot as a function of k will have a L–shape

• choose for k the value where the graph changes abruptly, that is, at an
inflection point

47
Clustering with k-means
Another practical example
CURE
CURE
Motivation

• Examples where k-means algorithms will fail

• Traditional algorithms can wrongly split large clusters in order to


minimize the square error

• Alternative: CURE — Clustering Using REpresentatives

• More robust to outliers

• Better for non-spherical shapes

• clusters do not have to be normally distributed

• clusters can have strange bends, S-shapes, or even rings

• Idea: to represent clusters by a set of representatives


50
CURE
Algorithm

1. Take a random sample n of the data into memory

2. Cluster sample data (use a hierarchical method)

3. Select a small set of points from each cluster to be


representatives (choose points as far as possible)

4. Move each representative a small fixed fraction of the distance


towards the centroid of the cluster

5. Merge two clusters if they have a pair of representatives close


enough

6. Repeat steps 4-5 until no more clusters can be merged

7. Use representatives to label data on disk: a point is assigned to


the cluster that has the closest representative
51
CURE
Example

52
CURE
Example

53
CURE
Example

54
CURE
Example

55
CURE
Example

56
CURE
Example

57
CURE
Example

58
References

J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of


massive datasets. Cambridge University Press, 2014

59

You might also like