04 - KMeans Clustering

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 56

Machine Learning

Unsupervised Learning – Kmeans Clustering


ADF

1 05/25/2024
Outline
Introduction:
– Supervised vs Unsupervised
– Clustering

Partitional Clustering: K-Means

2 05/25/2024 Machine Learning


Motivation

3 05/25/2024 Machine Learning


Supervised vs Unsupervised Learning
Supervised
– You have labeled data
– Find a function which can map data to its label
– discover patterns that relate data attributes
with a target (class)

Unsupervised
– You have unlabeled data
– Discover the underlying structure of the data
– Try to understand the data
– Not predicting anything specific

4 05/25/2024 Machine Learning


Clustering
The process of grouping a set of objects into classes of
similar object
– Data within a cluster should be similar (or related) .
– Data from different clusters should be dissimilar (or unrelated).

Cluster: A collection/group of data objects/points

Cluster analysis
– find similarities between data according to characteristics
underlying the data and grouping similar data objects into
clusters

5 05/25/2024 Machine Learning


Clustering
Try to answer:
– How many sub-populations (groups) ?
– What are their sizes?
– Do elements in a sub-population have any common properties?
– Are sub-populations cohesive? Can they be further split up?
– Are there outliers?

6 05/25/2024 Machine Learning


Clustering
How many cluster?

What is a good cluster?

Define how to cluster

7 05/25/2024 Machine Learning


Types of Clustering
Hierarchical (Connectivity-based)
– Objects being more related to nearby objects than to objects
farther away

Partitional (Centroid-based)
– Each cluster represented by a centroid
– Determined by a proximity measurement

Other types:
– Distribution-based
– Density-based
– Etc.

8 05/25/2024 Machine Learning


Hierarchical (Connectivity-based)
Objects being more related to nearby objects than to
objects farther away

– In this approach, data vectors are arranged in a tree, where nearby


(‘similar’) vectors xi and xj are placed close to each other in the tree
 Any horizontal cut corresponds to a partitional clustering
 In the example above, the 3 colors have been added manually for emphasis
(they are not produced by the algorithm)

9 05/25/2024 Machine Learning


Partitional (Centroid-based)
Each cluster represented by a centroid determined by a
proximity measurement

– Typically K and a proximity measure is selected by the user, while the


chosen algorithm then learns the actual partitions

10 05/25/2024 Machine Learning


Types of Clustering
Grouping criteria
– Monothetic (using some common attribute)
– Polythetic (using similarity/distance measure)

Overlap criteria
– Hard clustering
– Soft clustering

11 05/25/2024 Machine Learning


Clustering Algorithms
K-Means Fuzzy C-Means
– Polythetic, Partitional, – Polythetic, Partitional,
Hard Clustering Soft Clustering

Agglomerative K-D Trees


– Monothetic, Hierarchical, – Monothetic, Hierarchical,
Hard Clustering Hard Clustering

EM clustering Quality Threshold


– Polythetic, Partitional,
Etc.
Soft Clustering

12 05/25/2024 Machine Learning


Usage
Discover classes of unlabeled data

Dimensionality reductions

Graph coloring

Color-based image segmentation

Social network analysis

Market segmentation

Etc.

13 05/25/2024 Machine Learning


K-Means Algorithm

14 05/25/2024 Machine Learning


K-Means Algorithm
A simple and often used partitional clustering method

Hard, polythetic clustering

Data partitioned into K cluster


– Need to determine K

Points in each cluster similar to a “centroid”

15 05/25/2024 Machine Learning


K-Means Algorithm

Line 1: simplest solution is to initialize the cj to equal K random


vectors from the input data

Line 3: For simplicity, use Euclidean

Line 4: recalculate using

16 05/25/2024 Machine Learning


K-Means
Example

17 05/25/2024 Machine Learning


K-Means - Example
x2
6

5 o1 o2 o5 o6
4

2 o4 o3 o8 o7

0 x1
0 1 2 3 4 5 6 7 8 9 10 11 12

Set K = 2,
initialize 2 centroid randomly

18 05/25/2024 Machine Learning


K-Means - Example
x2
6
c1
5 o1 + o2 o5 o6
4

2 o4 +
c
o 3 o8 o7
2
1

0 x1
0 1 2 3 4 5 6 7 8 9 10 11 12

For each data point o, find nearest centroid c

19 05/25/2024 Machine Learning


K-Means - Example
x2
6
c1
5 o1 + o2 o5 o6
4

2 o4 +
c
o 3 o8 o7
2
1

0 x1
0 1 2 3 4 5 6 7 8 9 10 11 12

Calculate mean data of each cluster,


Update cluster

21 05/25/2024 Machine Learning


K-Means - Example
x2
6
c1
5 o1 + o2 o5 o6
4

3 +
c
2
2 o4 o3 o8 o7

0 x1
0 1 2 3 4 5 6 7 8 9 10 11 12

Iteration 2, new centroid


For each data point o, find nearest centroid c

22 05/25/2024 Machine Learning


K-Means - Example
x2
6
c1
5 o1 + o2 o5 o6
4

3 +
c
2
2 o4 o3 o8 o7

0 x1
0 1 2 3 4 5 6 7 8 9 10 11 12

Calculate mean data of each cluster,


Update cluster

24 05/25/2024 Machine Learning


K-Means - Example
x2
6

5 o1 o2 o5 o6
4

3
+ c1 + c2
2 o4 o3 o8 o7

0 x1
0 1 2 3 4 5 6 7 8 9 10 11 12

Iteration 3, new centroid


For each data point o, find nearest centroid c

25 05/25/2024 Machine Learning


K-Means - Example
x2
6

5 o1 o2 o5 o6
4

3
+ c1 + c2
2 o4 o3 o8 o7

0 x1
0 1 2 3 4 5 6 7 8 9 10 11 12

Calculate mean data of each cluster,


Cluster not updated, iteration stop

26 05/25/2024 Machine Learning


K-Means
Example in Euclidean Space

27 05/25/2024 Machine Learning


K-Means - Example
x2
6
c1
5 o1 + o2 o5 o6
4

2 o4 +
c
o 3 o8 o7
2
1

0 x1
0 1 2 3 4 5 6 7 8 9 10 11 12

Iteration 1

28 05/25/2024 Machine Learning


K-Means - Example

Iteration 1

29 05/25/2024 Machine Learning


K-Means - Example

x2
6
c1
5 o1 + o2 o5 o6

3 +
c
2
2 o4 o3 o8 o7

0 x1
0 1 2 3 4 5 6 7 8 9 10 11 12

Iteration 2

30 05/25/2024 Machine Learning


K-Means - Example
x2
6

5 o1 o2 o5 o6
4

3
+ c1 + c2
2 o4 o3 o8 o7

0 x1
0 1 2 3 4 5 6 7 8 9 10 11 12

Iteration 3

31 05/25/2024 Machine Learning


K-Means
Problems

32 05/25/2024 Machine Learning


K-Means - Problem 1

Watch the example,


we try to run K-means twice from 2 different initial centroid

33 05/25/2024 Machine Learning


K-Means - Problem 1

Try 1
Iteration 1

34 05/25/2024 Machine Learning


K-Means - Problem 1

Try 1
Iteration 2

35 05/25/2024 Machine Learning


K-Means - Problem 1

Try 1
Iteration 3

36 05/25/2024 Machine Learning


K-Means - Problem 1

Try 1
Iteration 4, iteration stopped

37 05/25/2024 Machine Learning


K-Means - Problem 1

Let’s try a different initial of centroids

38 05/25/2024 Machine Learning


K-Means - Problem 1

Try 2
Iteration 1

39 05/25/2024 Machine Learning


K-Means - Problem 1

Try 2
Iteration 1

40 05/25/2024 Machine Learning


K-Means - Problem 1

Try 2
Iteration 2

41 05/25/2024 Machine Learning


K-Means - Problem 1

Try 2
Iteration 3, iteration stopped

42 05/25/2024 Machine Learning


K-Means - Problem 1

Nearby points may not end up in the same cluster

43 05/25/2024 Machine Learning


K-Means - Problem 2
Which cluster?

44 05/25/2024 Machine Learning


K-Means - Problem 2

45 05/25/2024 Machine Learning


K-Means - Problem 3

Are these centroid good?


Are they accurate? c2
+

c1
+

46 05/25/2024 Machine Learning


K-Means - Problem 3

Where should the centroids be?

47 05/25/2024 Machine Learning


K-Means
Pros and Cons

48 05/25/2024 Machine Learning


Pros and Cons
Pros:
– Relatively simple to implement
– Good for neat (rounded/convex shaped) data
– Efficient, time complexity = O(#data * #cluster * #iteration)

Cons
– Necessity of specifying k
– Sensitive to initial assignment of centroids
– Sensitive to noise and outlier
– Not suitable for discovering clusters with non-convex shapes
– Non-deterministic, Can be inconsistent from one run to another
– Need to define measurement to evaluate the performance

49 05/25/2024 Machine Learning


Cluster Quality
Sum Square Error
– Aggregate intra-cluster distance
𝐾 ❑
𝑆𝑆𝐸= ∑ ∑ ‖c 𝑗 − x 𝑖‖2
2

𝑗=1 x𝑖 ∈𝐶 𝑗

Silhouette coefficient

𝑏 ( 𝑖 ) − 𝑎(𝑖) = the average distance of to all other


𝑠 ( 𝑖 )= data within the same cluster

max { 𝑎 ( 𝑖 ) , 𝑏 ( 𝑖 ) } = the lowest average distance of


to all points in any other cluster

50 05/25/2024 Machine Learning


Cluster Quality
Davies–Bouldin index, Dunn Index, Mutual Information

Purity, F-Measure, Jaccard Index, Etc.

51 05/25/2024 Machine Learning


Things to try and observe
Dealing with outliers
– Try to remove some data points considered as noise

– Perform random sampling

Dealing with initial seeds


– Try out multiple starting points,
choose cluster result with the smallest SSE
(or any other performance measure)

52 05/25/2024 Machine Learning


Things to try and observe
Dealing with number of cluster
– Hopkins Statistic
– Elbow method
– Cross-validation
– Silhouette Analysis

53 05/25/2024 Machine Learning


K-Means
Determine the number of cluster

54 05/25/2024 Machine Learning


Elbow method
The more clusters, the lower SSE

Choose the minimum number of cluster


when the SSE starts to level out

Use cross validation and average of SSE of each fold

55 05/25/2024 Machine Learning


Silhouette Analysis
A measure of how close each point in one cluster is to points in the
neighboring clusters

A measure of how similar an object is to its own cluster (cohesion)


compared to other clusters (separation).

56 05/25/2024 Machine Learning


Question?

57 05/25/2024 Machine Learning


THANK YOU
05/25/2024
58

You might also like