Open navigation menu

Welcome to Scribd!

PMBD 04 Clustering

Uploaded by

0% found this document useful (0 votes)

19 views59 pages

Repeat steps 3-5 until convergence 41 k-means Example Converged solution 42 k-means Properties - Simple and intuitive algorithm - Works well for globular clusters of similar size/shape - Sensitive to outliers and noise - Number of clusters k must be specified in advance - Result depends on initialisation, may get stuck in local optima - Scales well to large datasets - Can be extended to handle non-Euclidean spaces - Widely used in practice due to simplicity and efficiency - Many variants proposed to address its limitations 43 CURE - CURE (Cluster

Original Description:

Original Title

PMBD-04-Clustering

Copyright

© © All Rights Reserved

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Repeat steps 3-5 until convergence 41 k-means Example Converged solution 42 k-means Properties - Simple and intuitive algorithm - Works well for globular clusters of similar size/shape - Sensitive to outliers and noise - Number of clusters k must be specified in advance - Result depends on initialisation, may get stuck in local optima - Scales well to large datasets - Can be extended to handle non-Euclidean spaces - Widely used in practice due to simplicity and efficiency - Many variants proposed to address its limitations 43 CURE - CURE (Cluster

Copyright:

© All Rights Reserved

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

0% found this document useful (0 votes)

19 views59 pages

PMBD 04 Clustering

Uploaded by

Repeat steps 3-5 until convergence 41 k-means Example Converged solution 42 k-means Properties - Simple and intuitive algorithm - Works well for globular clusters of similar size/shape - Sensitive to outliers and noise - Number of clusters k must be specified in advance - Result depends on initialisation, may get stuck in local optima - Scales well to large datasets - Can be extended to handle non-Euclidean spaces - Widely used in practice due to simplicity and efficiency - Many variants proposed to address its limitations 43 CURE - CURE (Cluster

Copyright:

© All Rights Reserved

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

Jump to Page

You are on page 1of 59

Search inside document

Processamento e Modelação de

Big Data
Clustering

João Oliveira & Adriano Lopes - 2020/2021

Outline

• Clustering

• Distance measures

• Hierarchical clustering

• k-means

• CURE

2
Clustering
Clustering
Motivation

• Sometimes data exhibit structure according to some sort

of distance measure

• It is possible to group collection of points into “clusters”

Source: J.Leskovec, A.Rajaraman and J.D.Ullman, Mining of massive datasets. Cambridge University Press, 2014
4
Clustering
Clustering problem

• Given a collection of “points”, group them into “clusters”

according to some distance measure, such that

• points in the same cluster are “similar”

• points in diﬀerent clusters are “dissimilar”

• We are interested in situations where

• data is very large

• data is in high-dimensional space

• the space may not be Euclidean

• data may not fit in main memory

5
Clustering
Example of clustering

How to cluster them?

6
Clustering
What is similarity?

• “The quality or state of being similar; likeness;

resemblance; as, a similarity of features.” — Webster’s
Dictionary

• Similarity may be hard to define, but 

“we know it when we see it”…

7
Clustering
Clustering is a hard problem

• Clustering in 2 dimensions can be more diﬃcult than

expected

• Clustering is usually done in high dimensional spaces 

(can be 10, 100, or even 1 000 dimensions)

• The curse of dimensionality: most pairs of points are equally

far away from each other
8
Distance Measures
Distance Measures
Definition

• A function d(A, B) is a distance measure between two

points A and B if satisfies the following:

• the distance is always nonnegative, and only the

distance between a point and itself is 0

d(A, B) ≥ 0

• the distance is symmetric

d(A, B) = d(B, A)

• the distance measure must obey the triangle inequality

d(A, B) + d(B, C) ≥ d(A, C)

10
Distance Measures
Examples

• Euclidean: let x = (x1, x2, …, xn) and y = (x1, x2, …, xn)

• L2—norm

n
d(x, y) = ∥x − y∥2 = ∑i=1 (xi − yi)2

• Lp—norm

( ∑i=1 | xi − yi |p )

n p
d(x, y) = ∥x − y∥p =

• Jaccard distance: for dissimilarity between sets

| C1 ∩ C2 |
d(C1, C2) = 1 − SIM(C1, C2) = 1−
| C1 ∪ C2 |
11
Distance Measures
Examples

• Cosine distance
• makes sense in Euclidean spaces or discrete versions
of Euclidean spaces, such as spaces where points are
vectors with integer components or boolean (0 or 1)
components

• points are thought as directions; we do not distinguish

between a vector or a multiple of that vector

n
x⋅y ∑i=1 xi yi
d(x, y) = =
∥x∥2 ∥y∥2 n
∑i=1 xi2
n
∑i=1 yi2

12
Distance Measures
Examples

• Edit distance
• this distance is used when points are strings

• the distance between two strings x = x1x2…xn and

y = y1y2…yn is the smallest number of insertions and
deletions of a single character that will convert x to y

• Example: the edit distance between the string x = abcde

and y = acfdeg is 3

1. delete b

2. insert f after c

3. insert g after e
13
Distance Measures
Examples

• Hamming distance
• the Hamming distance between two vectors is the
number of components in which they diﬀer

• for example, the Hamming distance between the

vectors 10011 and 11101 is 3

14
Distance Measures
Cluster strategies

• Hierarchical or agglomerative
• start each point in his own cluster

• clusters are combined based on their “closeness”

• combination stops when further combination leads to

undesirable clusters

• Point assignment
• points are considered in some order

• usually there is a short phase where initial clusters are

estimated

• each point is assigned to the cluster into which it best

fits, typically the “nearest” cluster
15
Hierarchical Clustering
Hierarchical Clustering
Building a dendrogram

• The algorithm:

• At start, every point is a cluster

• To repeat: combine two “nearest”

clusters into one

• Questions:

• How to represent clusters?

• How to choose which two clusters

to merge?

• When to stop combining clusters?

• We will consider two cases

• distance measure is Euclidean

• distance measure is non Euclidean

17
Hierarchical Clustering
Euclidean case

• How to represent clusters?

• A cluster can be represented by its centroid or average of

the points in the cluster

• How to choose which two clusters to merge?

• Use the Euclidean distance between centroids to

determine the closest ones

• When to stop combining clusters?

• we have a belief about how many clusters are there in data

• stop at a point where the best combination of clusters

produces an inadequate cluster

• continue until there is only one cluster and then return the
tree representing the association of clusters
18
Hierarchical Clustering
Example

19
Hierarchical Clustering
Example

20
Hierarchical Clustering
Example

21
Hierarchical Clustering
Example

22
Hierarchical Clustering
Example

23
Hierarchical Clustering
Example

24
Hierarchical Clustering
Example

25
Hierarchical Clustering
Example

26
Hierarchical Clustering
Example

27
Hierarchical Clustering
Non Euclidean case

• How to represent clusters?

• pick one of the points to represent the cluster, usually close to all
the points in the cluster; we call this point the clustroid

• the point selected for clustroid can also be the point that minimizes:

• the sum of the distances to other points

• the maximum distance to other points

• the sum of the squares of the distances to other points

• How to choose which two clusters to merge?

• use the distance between clustroids

• other criteria measuring the density of a cluster, based on the radius

or diameter

• When stop combining clusters?

• same as in the Euclidean case

28
Hierarchical Clustering
Eﬃciency

• Hierarchical clustering is not very eﬃcient

• at each step we must compute all the distances

between each pair of clusters, and then merge

• cost O(n 3) (n = number of points)

• More eﬃcient implementation

• based on priority queues

2
• reduce cost to O(n log n)

• still infeasible to use for large n

29
k-means
k-means
Algorithm

• One of the most popular algorithms following a point assignment strategy

• All points are of the quantitative type, thus it assumes an 

Euclidean space/distance

• The number of clusters k is set in advance

• Algorithm:

1. Initialise all clusters by choosing one point for each cluster at random
(or at least as far away from each other)

2. Assign each point to the closest centroid

3. Compute the new centroid for each cluster

4. Reassign all points to their closest centroid 

(points can move between clusters)

5. Repeat steps 2 - 4 until no points are reassigned

31
k-means
Example

1st step: choose k =4

32
k-means
Example

2nd step: initialise clusters (choose 4 random points)

33
k-means
Example

3rd step: assign points to clusters

34
k-means
Example

3rd step: assign points to clusters

35
k-means
Example

3rd step: assign points to clusters

36
k-means
Example

4th step: compute centroid of each cluster

37
k-means
Example

4th step: compute centroid of each cluster

38
k-means
Example

5th step: reassign points to clusters

39
k-means
Example

5th step: reassign points to clusters

40
k-means
Example

5th step: reassign points to clusters

41
k-means
Example

6th step: recompute centroids

42
k-means
Example

6th step: recompute centroids

43
k-means
Example

7th step: reassign points to clusters

44
k-means
Example

7th step: reassign points to clusters

45
k-means
Example

… recompute centroids; reassign points to clusters; and …

stop when no point is changing of cluster

46
k-means
How to select k

• What is the impact of k?

• increasing k - in the limit, we will have one cluster for each point

• decreasing k - the average diameter of the clusters will increase

• Using a measure such as average radius, diameter or average global error

(measured to the centroid), then its plot as a function of k will have a L–shape

• choose for k the value where the graph changes abruptly, that is, at an
inflection point

47
Clustering with k-means
Another practical example
CURE
CURE
Motivation

• Examples where k-means algorithms will fail

• Traditional algorithms can wrongly split large clusters in order to

minimize the square error

• Alternative: CURE — Clustering Using REpresentatives

• More robust to outliers

• Better for non-spherical shapes

• clusters do not have to be normally distributed

• clusters can have strange bends, S-shapes, or even rings

• Idea: to represent clusters by a set of representatives

50
CURE
Algorithm

1. Take a random sample n of the data into memory

2. Cluster sample data (use a hierarchical method)

3. Select a small set of points from each cluster to be

representatives (choose points as far as possible)

4. Move each representative a small fixed fraction of the distance

towards the centroid of the cluster

5. Merge two clusters if they have a pair of representatives close

enough

6. Repeat steps 4-5 until no more clusters can be merged

7. Use representatives to label data on disk: a point is assigned to

the cluster that has the closest representative
51
CURE
Example

52
CURE
Example

53
CURE
Example

54
CURE
Example

55
CURE
Example

56
CURE
Example

57
CURE
Example

58
References

J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of

massive datasets. Cambridge University Press, 2014

59

You might also like

Introduction To Black Hole Physics PDF
Document505 pages
Introduction To Black Hole Physics PDF
Juan Lombardero
100% (6)
Hoffman Calculus Ch0 To Ch3
Document330 pages
Hoffman Calculus Ch0 To Ch3
AjurycabaJunior
No ratings yet
Clustering, A Tool To Analyze Data Points
Document61 pages
Clustering, A Tool To Analyze Data Points
sukhvinders
No ratings yet
Chapter 4 PDF
Document89 pages
Chapter 4 PDF
Anirudh Tripathi
No ratings yet
Clustering K-Means
Document28 pages
Clustering K-Means
Faysal Ahammed
No ratings yet
9.54 Class 13: Unsupervised Learning
Document54 pages
9.54 Class 13: Unsupervised Learning
GrantMwakipunda
No ratings yet
Pattern Recognition - Clustering - Classification
Document177 pages
Pattern Recognition - Clustering - Classification
anilipg07
No ratings yet
7 Cluster Analysis
Document62 pages
7 Cluster Analysis
Nilakhya Chawrok
No ratings yet
Clustering
Document61 pages
Clustering
efi
No ratings yet
Clustering Algorithms
Document61 pages
Clustering Algorithms
Ayesha Khan
No ratings yet
Chapter 5 Clustering
Document40 pages
Chapter 5 Clustering
Mohamedsultan Awol
No ratings yet
Clustering
Document80 pages
Clustering
Javada Javada
No ratings yet
Section 3
Document22 pages
Section 3
HuanYu
No ratings yet
Clustering
Document80 pages
Clustering
Aatmaj Salunke
No ratings yet
Digi Week 10
Document8 pages
Digi Week 10
Ilion Barboso
No ratings yet
Unit 5
Document63 pages
Unit 5
Asif EE-010
No ratings yet
w2 - Fundamentals of Learning
Document37 pages
w2 - Fundamentals of Learning
Swastik Sindhani
No ratings yet
Data Mining CH - 5
Document18 pages
Data Mining CH - 5
Hasset Tiss Abay Genji
No ratings yet
MLCH9
Document45 pages
MLCH9
sam33rdhakal
No ratings yet
Unit-6 Clustering Techniques
Document110 pages
Unit-6 Clustering Techniques
Rahul Vashistha
No ratings yet
WINSEM2023-24 BEEE410L TH VL2023240502246 2024-03-22 Reference-Material-I
Document95 pages
WINSEM2023-24 BEEE410L TH VL2023240502246 2024-03-22 Reference-Material-I
Rahul ganth
No ratings yet
ACFrOgCajrghX8QYes4eQZ0sdYkSYcgCfngE3 G40h28YsPxSNUI5pWUj1kIOR5d1d2nVkHBNqBJQVMMKTJ6lSwm5kuENTMySlduAvwhEcn-N5iutSBNaAaDhkol5Hv3mPmTl0q-ahwmr7GR 2cj
Document95 pages
ACFrOgCajrghX8QYes4eQZ0sdYkSYcgCfngE3 G40h28YsPxSNUI5pWUj1kIOR5d1d2nVkHBNqBJQVMMKTJ6lSwm5kuENTMySlduAvwhEcn-N5iutSBNaAaDhkol5Hv3mPmTl0q-ahwmr7GR 2cj
ethan
No ratings yet
ML Co4 Session 29
Document36 pages
ML Co4 Session 29
Shylandra Bhanu
No ratings yet
Clustering
Document39 pages
Clustering
Sourav Mondal
No ratings yet
Class19-22 Clustering 17-25oct2019
Document42 pages
Class19-22 Clustering 17-25oct2019
Saili Mishra
No ratings yet
WINSEM2020-21 CSE4020 ETH VL2020210504996 Reference Material I 26-Apr-2021 Clustering
Document43 pages
WINSEM2020-21 CSE4020 ETH VL2020210504996 Reference Material I 26-Apr-2021 Clustering
RenuSharma
No ratings yet
Unit5 - Unsupervised Learning
Document48 pages
Unit5 - Unsupervised Learning
Soumya Mishra
No ratings yet
Clustering: CMPUT 466/551 Nilanjan Ray
Document34 pages
Clustering: CMPUT 466/551 Nilanjan Ray
Richa Jain
No ratings yet
Lecture 6
Document55 pages
Lecture 6
Hassan
No ratings yet
19 - Sessionppt - Clusteringalgos
Document36 pages
19 - Sessionppt - Clusteringalgos
Graisy Biswal
No ratings yet
Hirarchical Clustering
Document27 pages
Hirarchical Clustering
Jonathan Pervaiz
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
Document40 pages
Unit 4 Clustering - K-Means and Hierarchical
animeshrajak649
No ratings yet
Non Parametric Classification: Pattern Recognition
Document74 pages
Non Parametric Classification: Pattern Recognition
marshadmit
No ratings yet
Pertemuan-X - Manajemen Data Bagian 2
Document31 pages
Pertemuan-X - Manajemen Data Bagian 2
Roisyal Bariz
No ratings yet
Duda ch10
Document17 pages
Duda ch10
Sudheer Kumar
No ratings yet
CURE
Document14 pages
CURE
Punitha viswanathan
No ratings yet
Lect 4
Document34 pages
Lect 4
yoursweetseptember
No ratings yet
Multivariate Analysis (Slides 8)
Document19 pages
Multivariate Analysis (Slides 8)
John Fogarty
No ratings yet
K Nearest Neighbor (Revised)
Document20 pages
K Nearest Neighbor (Revised)
Aradhya
No ratings yet
Gap Statistic
Document32 pages
Gap Statistic
Kikie Goguma Gyu
No ratings yet
Data Mining: I Gede Mahendra Darmawiguna
Document25 pages
Data Mining: I Gede Mahendra Darmawiguna
Bitboxk
No ratings yet
K Means Clustering Lecture
Document32 pages
K Means Clustering Lecture
Daneil Radcliffe
No ratings yet
Chapter 8 - Clustering
Document42 pages
Chapter 8 - Clustering
FakhrulShahrilEzanie
No ratings yet
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
Document54 pages
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
Pedro Jesús García Ramos
No ratings yet
Clustering
Document127 pages
Clustering
kamaruddin
0% (1)
Instance Based Learning
Document20 pages
Instance Based Learning
d sandeep
No ratings yet
Building Diversified Portfolios That Outperform Out-Of-Sample
Document33 pages
Building Diversified Portfolios That Outperform Out-Of-Sample
Quant Bots
No ratings yet
Ultrametricity
Document35 pages
Ultrametricity
Masooma Raza
No ratings yet
Clustering
Document75 pages
Clustering
BAKKA VISHNU VARDHAN REDDY 20BEC1279
No ratings yet
Week 3 Clustering
Document36 pages
Week 3 Clustering
Ishaq Ali
No ratings yet
Object Recognition
Document43 pages
Object Recognition
A J
No ratings yet
Agenda: 1. Introduction To Clustering
Document47 pages
Agenda: 1. Introduction To Clustering
Salih Genel
No ratings yet
Clustering Techniques - Hierarchical, K-Means Clustering
Document22 pages
Clustering Techniques - Hierarchical, K-Means Clustering
Tanya Sharma
No ratings yet
IS4242 W8 Similarity, NN and Clusters
Document29 pages
IS4242 W8 Similarity, NN and Clusters
wongdeshun4
No ratings yet
02 K-Means
Document25 pages
02 K-Means
Kushagra Bhatnagar
No ratings yet
M8 Klastering
Document83 pages
M8 Klastering
Teofilus Evan
No ratings yet
Week 04 Lecture Material
Document52 pages
Week 04 Lecture Material
Meer Hassan
No ratings yet
Agglomerative Hierarchical Clustering
Document22 pages
Agglomerative Hierarchical Clustering
Daneil Radcliffe
No ratings yet
R PPT 30
Document45 pages
R PPT 30
bernatin T
No ratings yet
Clustering
Document19 pages
Clustering
Vignesh Senthil
No ratings yet
Origami Dots: Folding paper to explore geometry
From Everand
Origami Dots: Folding paper to explore geometry
Andy Parkinson
No ratings yet
Two Dimensional Geometric Model: Understanding and Applications in Computer Vision
From Everand
Two Dimensional Geometric Model: Understanding and Applications in Computer Vision
Fouad Sabry
No ratings yet
PDF Ce Board Problems in Surveying
Document16 pages
PDF Ce Board Problems in Surveying
Bilagantol, Joriz Allen C.
100% (1)
Problems ch20 Wsol Spring12
Document48 pages
Problems ch20 Wsol Spring12
Richmond Tetteh
No ratings yet
Math X (Science) PI
Document5 pages
Math X (Science) PI
Haris Shahid
No ratings yet
Horizontal
Document48 pages
Horizontal
Firdaus Mangawing
No ratings yet
PCX - Report Sushant
Document63 pages
PCX - Report Sushant
Pardhan Lakshay Pareek
No ratings yet
@iitwale On Telegram: Decucted Portion Mathematics Code - 041 Class Ix
Document2 pages
@iitwale On Telegram: Decucted Portion Mathematics Code - 041 Class Ix
TECHNICAL RISHAV
No ratings yet
CH 3-1 Parallel Lines and Transversals
Document26 pages
CH 3-1 Parallel Lines and Transversals
Anonymous O7NI8R3gFA
No ratings yet
J J J J J J J: 1 Sylow's Theorem
Document6 pages
J J J J J J J: 1 Sylow's Theorem
Sikandar Khan
No ratings yet
CBSE Math Syllabus (2009-2010) - Class 9
Document5 pages
CBSE Math Syllabus (2009-2010) - Class 9
Applect
75% (4)
Pythagorean Theorem Notes
Document5 pages
Pythagorean Theorem Notes
peterash
No ratings yet
RZJ 68 K 4 EYF7 N
Document458 pages
RZJ 68 K 4 EYF7 N
rahnemaa
No ratings yet
Geometry For Maths Olympiad
Document8 pages
Geometry For Maths Olympiad
CambridgeSchool Noida
No ratings yet
Circles
Document26 pages
Circles
Neha Thakur
No ratings yet
Perelman's Proof of The Poincare Conjecture PDF
Document42 pages
Perelman's Proof of The Poincare Conjecture PDF
Sanat Sharma
No ratings yet
Dynamics Problems
Document5 pages
Dynamics Problems
gregdot671
No ratings yet
Mathematics 3D Geometry MCQ
Document6 pages
Mathematics 3D Geometry MCQ
Kamran Ahmed
No ratings yet
Circle Exercise PDF
Document58 pages
Circle Exercise PDF
JKoti Rao
No ratings yet
Higher Engineering Math
Document3 pages
Higher Engineering Math
maithamalshamery
No ratings yet
Bridge Course Meterial Original
Document83 pages
Bridge Course Meterial Original
Aswin Thangaraju
0% (1)
Geometry Theorems
Document13 pages
Geometry Theorems
Prem Sagar
100% (2)
AP Physics Notes - Kinematics in Two Dimensions
Document4 pages
AP Physics Notes - Kinematics in Two Dimensions
Andy He
No ratings yet
Psna College of Engineering and Technology, Dindigul Department of Mechanical Engineering Assignment - I Class - Cse - D Conics
Document1 page
Psna College of Engineering and Technology, Dindigul Department of Mechanical Engineering Assignment - I Class - Cse - D Conics
balusharma1212
No ratings yet
Applications of Integration
Document75 pages
Applications of Integration
weglisotter
No ratings yet
Handbook Spatial Analysis 2018 PDF
Document394 pages
Handbook Spatial Analysis 2018 PDF
Алексей Бабайцев
No ratings yet
m1 On Vectors
Document38 pages
m1 On Vectors
jainam
No ratings yet
Geometry 2nd Semester
Document5 pages
Geometry 2nd Semester
api-262621710
No ratings yet
Motion-Distance and Displacement
Document41 pages
Motion-Distance and Displacement
mjdemaala
100% (1)