Chap2 Part1 KMEANS

Chapter 2 : Generative Models
Kmeans
Machine Learning Team
UP GL-BD
Learning Outcomes
- Apply unsupervised models exploiting the geometric relationships

between data through the K-means algorithm.
- Evaluate the performance of generative models.
Machine Learning team Esprit 2022/2023 2

Plan
1. Introduction
2. Unsupervised learning categories
3. Clustering
4. K-means
5. Bibliography

Introduction
• Discriminative models draw boundaries in the data space, while generative models try to model
how data is placed throughout the space.
• A generative model focuses on explaining how the data was generated, while a discriminative
model focuses on predicting the labels of the data

Introduction
• An unsupervised learning method is a method in which we draw
references from datasets consisting of input data without labeled
responses.
• Generally, it is used as a process to find:

o Meaningful structure,
o Explanatory underlying processes,
o Generative features, Unsupervised Learning
o and groupings inherent in a set of examples.

Introduction

Introduction

Unsupervised learning categories
Different tasks are associated with unsupervised learning:
Unsupervised Learning
Dimensionality
Clustering Association rules
reduction

Unsupervised learning categories
Different tasks are associated with unsupervised learning:
Unsupervised Learning
Dimensionality
Clustering Association rules
reduction

Clustering
Definition
• Clustering is the task of dividing the population or data points into a number of groups/cluster.
• It is basically a collection of objects on the basis of similarity and dissimilarity between them:
• Data points in the same groups are more similar to other data points in the same group
• Data points in other groups are dissimilar.
• No predefined classes => unlabeled data

🡺 The quality of a clustering depends on the Similarity Measure
🡺 A good method will produce clusters whose elements have:
- strong intra‐class similarity.
- low inter‐class similarity.

Clustering
Similarity Measure
• Similarity between objects depends on:
- The type of data
- The type of similarity
Data type Similarity Measure Remarks

Distance de Manhattan • It needs to normalize the data before using this distance measure.
• Euclidean distance works great when we have low-dimensional data
• Overweight outliers
Distance euclidienne • Does not overweight outliers.
Numerical • The calculation times are particularly long
data
Distance de Minkowski • It allows you a huge amount of flexibility over your distance metric
• The parameter p can be troublesome to work with as finding the right value
can be quite computationally inefficient depending on the use-case.
Binary distance d(0,0)=d(1,1)=0

Binary data d(0,1)=d(1,0)=1
Enumerated Distance zero if the values are equal and 1 otherwise

data
Clustering
Applications of Clustering
Bank & Insurance Medicine

It is used to acknowledge the Patients segmentation
customers, their policies and Location of tumors in the brain (Similar behaviors)
identifying the frauds
City planning
It is used to make groups of houses and to study
their values based on their geographical locations
and other factors present.

Clustering
Types
• Centroid-based Clustering : finding k sets of • Distribution-based Clustering: assumes data is
points which are grouped based on the composed of distributions, such as Gaussian
proximity to the centroid distributions
• Hierarchical Clustering: assumes data is

• Density-based Clustering: connects areas of high composed of distributions, such as Gaussian
distributions
example density into clusters

Centroid-based Clustering:
Kmeans (MacQueen’67)

K-Means
Working principal
• Chercher des groupes homogènes dans une population hétérogène

K-Means
Working principal
Objective: identify groups (clusters) of observations with similar characteristics (e.g. discover
customer segments for marketing purposes, clustering different books on the basis of topics
and information, etc.)
(1) The individuals in the same group are similar as much as possible
(2) Individuals in different groups stand out as much as possible
Why ?
o Identify underlying structures in the data
o Summarize behaviors
o Assign new individuals to categories

K-Means
Working principal
• K-means clustering algorithm computes the centroids and iterates until we it finds optimal
centroid.
• It assumes that the number of clusters are already known.
• The number of clusters identified from data by algorithm is represented by ‘K’ in K-means.
• In this algorithm, the data points are assigned to a cluster in such a manner that the sum of
the squared distance between the data points and centroid would be minimum.

K-Means
Working principal

K-Means
Working principal
• Initialize k means with random values

K-Means
Working principal
• Find the mean closest to the item by calculating the Euclidean
distance of the item with each of the means
• Assign item to mean

K-Means
Working principal
• Update mean by shifting it to the average of the items in that cluster

K-Means
Working principal
• Assign item to the new mean

K-Means
Working
principal
Total Inertia = Inertia inter – classes + Inertia intra – classe (Huygens Theorem)
◼ Cluster weight= sum (weight of each observation)/ number
of observations
◼ Weight of an observation (by default)= 1/nb observations
Distance measurement
• Dispersion of barycenters around the • Dispersion within each group;

global barycenter; • Cluster compactness indicator
• Cluster separability indicator
The objective of the automatic clustering would be to minimize the intra-

class inertia W, at a fixed number of clusters K.
K-Means
Pseudocode
Input: X (n obs., p variables), K (clusters )
Initialize k means with random values
REPEAT
• Assign each individual to the cluster whose Fundamental property: the intra-class
center is closest inertia decreases at each step
• Recalculate cluster centers from attached

individuals
Number of iterations fixed when
UNTIL Convergence • No individual changes class
Output : A partition of individuals characterized • Or when W no longer decreases
• Or when the Gk are steady
by the K centers of clusters Gk

K-means
Number of
clusters

K-Means
Example

K-Means
Example

K-Means
Advantages
• Relatively simple to implement.
• Scales to large data sets.
• Easy to interpret.
• Easily adapts to new examples…

K-Means
Disdvantages
• Choosing k manually.
• Being dependent on initial values.
• Clustering data of varying sizes and density: k-means has trouble clustering data where clusters
are of varying sizes and density
• Clustering outliers: Centroids can be dragged by outliers, or outliers might get their own cluster
instead of being ignored. Consider removing or clipping outliers before clustering.

K-Means
Exercice
4 types of drugs each having two modalities: Concentration and efficacy, we want to
create two clusters (k=2)
Drug Concentration Efficacity

A 1 1
B 2 1
C 4 3
D 5 4
We randomly designate A and B as center of classes: C1=A and C2=B

NB: The used distance is the Euclidean distance

Bibliography
• J. MacQueen (1967). Some methods for classification and analysis of multivariate observations. Proc. Fifth Berkeley Symp.
on Math. Statist. and Prob., Vol. 1 (Univ. of Calif. Press, 1967), 281--297.
• A Density Based Algorithm for Discovering Density Varied Clusters in Large Spatial Databases. KDD-96 Proceedings, Martin
Ester & al.
• Rokach, Lior, and Oded Maimon. "Clustering methods." Data mining and knowledge discovery handbook. Springer US,
2005. 321-352
• Algorithm AS 136: A K-Means Clustering Algorithm, J. A. Hartigan and M. A. Wong, Journal of the Royal Statistical Society.
Series C (Applied Statistics) Vol. 28, No. 1 (1979), pp. 100-108

Chap2 Part1 KMEANS

Uploaded by

Copyright:

Available Formats

You might also like

Chap2 Part1 KMEANS

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chap2 Part1 KMEANS

Uploaded by

Copyright:

Available Formats

Chapter 2 : Generative Models

- Apply unsupervised models exploiting the geometric relationships

- Evaluate the performance of generative models.

Machine Learning team Esprit 2022/2023 2

2. Unsupervised learning categories

Machine Learning team Esprit 2022/2023 3

Machine Learning team Esprit 2022/2023 4

references from datasets consisting of input data without labeled

• Generally, it is used as a process to find:

Machine Learning team Esprit 2022/2023 5

Machine Learning team Esprit 2022/2023 6

Machine Learning team Esprit 2022/2023 7

Different tasks are associated with unsupervised learning:

Machine Learning team Esprit 2022/2023 8

Different tasks are associated with unsupervised learning:

Machine Learning team Esprit 2022/2023 9

• No predefined classes => unlabeled data

Machine Learning team Esprit 2022/2023 10

Data type Similarity Measure Remarks

Binary distance d(0,0)=d(1,1)=0

Enumerated Distance zero if the values are equal and 1 otherwise

Bank & Insurance Medicine

Machine Learning team Esprit 2022/2023 12

• Hierarchical Clustering: assumes data is

Machine Learning team Esprit 2022/2023 13

Machine Learning team Esprit 2022/2023 14

Machine Learning team Esprit 2022/2023 15

Machine Learning team Esprit 2022/2023 16

• It assumes that the number of clusters are already known.

Machine Learning team Esprit 2022/2023 17

Machine Learning team Esprit 2022/2023 18

Machine Learning team Esprit 2022/2023 19

Machine Learning team Esprit 2022/2023 20

Machine Learning team Esprit 2022/2023 21

Machine Learning team Esprit 2022/2023 22

• Dispersion of barycenters around the • Dispersion within each group;

The objective of the automatic clustering would be to minimize the intra-

• Recalculate cluster centers from attached

Machine Learning team Esprit 2022/2023 24

Machine Learning team Esprit 2022/2023 25

Machine Learning team Esprit 2022/2023 26

Machine Learning team Esprit 2022/2023 27

• Relatively simple to implement.

• Scales to large data sets.

• Easily adapts to new examples…

Machine Learning team Esprit 2022/2023 28

• Being dependent on initial values.

Machine Learning team Esprit 2022/2023 29

Drug Concentration Efficacity

We randomly designate A and B as center of classes: C1=A and C2=B

Machine Learning team Esprit 2022/2023 30

Machine Learning team Esprit 2022/2023 31

You might also like