Chap2 Part1 KMEANS

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

Chapter 2 : Generative Models

Kmeans
Machine Learning Team
UP GL-BD
Learning Outcomes

- Apply unsupervised models exploiting the geometric relationships


between data through the K-means algorithm.

- Evaluate the performance of generative models.

Machine Learning team Esprit 2022/2023 2


Plan
1. Introduction

2. Unsupervised learning categories

3. Clustering

4. K-means

5. Bibliography

Machine Learning team Esprit 2022/2023 3


Introduction
• Discriminative models draw boundaries in the data space, while generative models try to model
how data is placed throughout the space.

• A generative model focuses on explaining how the data was generated, while a discriminative
model focuses on predicting the labels of the data

Machine Learning team Esprit 2022/2023 4


Introduction
• An unsupervised learning method is a method in which we draw

references from datasets consisting of input data without labeled

responses.

• Generally, it is used as a process to find:


o Meaningful structure,
o Explanatory underlying processes,
o Generative features, Unsupervised Learning
o and groupings inherent in a set of examples.

Machine Learning team Esprit 2022/2023 5


Introduction

Machine Learning team Esprit 2022/2023 6


Introduction

Machine Learning team Esprit 2022/2023 7


Unsupervised learning categories

Different tasks are associated with unsupervised learning:

Unsupervised Learning

Dimensionality
Clustering Association rules
reduction

Machine Learning team Esprit 2022/2023 8


Unsupervised learning categories

Different tasks are associated with unsupervised learning:

Unsupervised Learning

Dimensionality
Clustering Association rules
reduction

Machine Learning team Esprit 2022/2023 9


Clustering
Definition
• Clustering is the task of dividing the population or data points into a number of groups/cluster.

• It is basically a collection of objects on the basis of similarity and dissimilarity between them:
• Data points in the same groups are more similar to other data points in the same group
• Data points in other groups are dissimilar.

• No predefined classes => unlabeled data


🡺 The quality of a clustering depends on the Similarity Measure
🡺 A good method will produce clusters whose elements have:
- strong intra‐class similarity.
- low inter‐class similarity.

Machine Learning team Esprit 2022/2023 10


Clustering
Similarity Measure
• Similarity between objects depends on:
- The type of data
- The type of similarity

Data type Similarity Measure Remarks


Distance de Manhattan • It needs to normalize the data before using this distance measure.
• Euclidean distance works great when we have low-dimensional data
• Overweight outliers
Distance euclidienne • Does not overweight outliers.
Numerical • The calculation times are particularly long
data

Distance de Minkowski • It allows you a huge amount of flexibility over your distance metric
• The parameter p can be troublesome to work with as finding the right value
can be quite computationally inefficient depending on the use-case.

Binary distance d(0,0)=d(1,1)=0


Binary data d(0,1)=d(1,0)=1

Enumerated Distance zero if the values are equal and 1 otherwise


data
Machine Learning team Esprit 2022/2023 11
Clustering
Applications of Clustering

Bank & Insurance Medicine


It is used to acknowledge the Patients segmentation
customers, their policies and Location of tumors in the brain (Similar behaviors)
identifying the frauds

City planning
It is used to make groups of houses and to study
their values based on their geographical locations
and other factors present.

Machine Learning team Esprit 2022/2023 12


Clustering
Types
• Centroid-based Clustering : finding k sets of • Distribution-based Clustering: assumes data is
points which are grouped based on the composed of distributions, such as Gaussian
proximity to the centroid distributions

• Hierarchical Clustering: assumes data is


• Density-based Clustering: connects areas of high composed of distributions, such as Gaussian
distributions
example density into clusters

Machine Learning team Esprit 2022/2023 13


Centroid-based Clustering:
Kmeans (MacQueen’67)

Machine Learning team Esprit 2022/2023 14


K-Means
Working principal
• Chercher des groupes homogènes dans une population hétérogène

Machine Learning team Esprit 2022/2023 15


K-Means
Working principal
Objective: identify groups (clusters) of observations with similar characteristics (e.g. discover
customer segments for marketing purposes, clustering different books on the basis of topics
and information, etc.)
(1) The individuals in the same group are similar as much as possible
(2) Individuals in different groups stand out as much as possible
Why ?
o Identify underlying structures in the data
o Summarize behaviors
o Assign new individuals to categories

Machine Learning team Esprit 2022/2023 16


K-Means
Working principal
• K-means clustering algorithm computes the centroids and iterates until we it finds optimal
centroid.

• It assumes that the number of clusters are already known.

• The number of clusters identified from data by algorithm is represented by ‘K’ in K-means.

• In this algorithm, the data points are assigned to a cluster in such a manner that the sum of
the squared distance between the data points and centroid would be minimum.

Machine Learning team Esprit 2022/2023 17


K-Means
Working principal

Machine Learning team Esprit 2022/2023 18


K-Means
Working principal
• Initialize k means with random values

Machine Learning team Esprit 2022/2023 19


K-Means
Working principal
• Find the mean closest to the item by calculating the Euclidean
distance of the item with each of the means
• Assign item to mean

Machine Learning team Esprit 2022/2023 20


K-Means
Working principal
• Update mean by shifting it to the average of the items in that cluster

Machine Learning team Esprit 2022/2023 21


K-Means
Working principal
• Assign item to the new mean

Machine Learning team Esprit 2022/2023 22


K-Means
Working
principal
Total Inertia = Inertia inter – classes + Inertia intra – classe (Huygens Theorem)
◼ Cluster weight= sum (weight of each observation)/ number
of observations
◼ Weight of an observation (by default)= 1/nb observations
Distance measurement

• Dispersion of barycenters around the • Dispersion within each group;


global barycenter; • Cluster compactness indicator
• Cluster separability indicator

The objective of the automatic clustering would be to minimize the intra-


class inertia W, at a fixed number of clusters K.
Machine Learning team Esprit 2022/2023 23
K-Means
Pseudocode
Input: X (n obs., p variables), K (clusters )
Initialize k means with random values
REPEAT
• Assign each individual to the cluster whose Fundamental property: the intra-class
center is closest inertia decreases at each step

• Recalculate cluster centers from attached


individuals
Number of iterations fixed when
UNTIL Convergence • No individual changes class
Output : A partition of individuals characterized • Or when W no longer decreases
• Or when the Gk are steady
by the K centers of clusters Gk

Machine Learning team Esprit 2022/2023 24


K-means
Number of
clusters

Machine Learning team Esprit 2022/2023 25


K-Means
Example

Machine Learning team Esprit 2022/2023 26


K-Means
Example

Machine Learning team Esprit 2022/2023 27


K-Means
Advantages

• Relatively simple to implement.

• Scales to large data sets.

• Easy to interpret.

• Easily adapts to new examples…

Machine Learning team Esprit 2022/2023 28


K-Means
Disdvantages

• Choosing k manually.

• Being dependent on initial values.

• Clustering data of varying sizes and density: k-means has trouble clustering data where clusters
are of varying sizes and density

• Clustering outliers: Centroids can be dragged by outliers, or outliers might get their own cluster
instead of being ignored. Consider removing or clipping outliers before clustering.

Machine Learning team Esprit 2022/2023 29


K-Means
Exercice
4 types of drugs each having two modalities: Concentration and efficacy, we want to
create two clusters (k=2)

Drug Concentration Efficacity


A 1 1

B 2 1

C 4 3

D 5 4

We randomly designate A and B as center of classes: C1=A and C2=B


NB: The used distance is the Euclidean distance

Machine Learning team Esprit 2022/2023 30


Bibliography
• J. MacQueen (1967). Some methods for classification and analysis of multivariate observations. Proc. Fifth Berkeley Symp.
on Math. Statist. and Prob., Vol. 1 (Univ. of Calif. Press, 1967), 281--297.

• A Density Based Algorithm for Discovering Density Varied Clusters in Large Spatial Databases. KDD-96 Proceedings, Martin
Ester & al.

• Rokach, Lior, and Oded Maimon. "Clustering methods." Data mining and knowledge discovery handbook. Springer US,
2005. 321-352
• Algorithm AS 136: A K-Means Clustering Algorithm, J. A. Hartigan and M. A. Wong, Journal of the Royal Statistical Society.
Series C (Applied Statistics) Vol. 28, No. 1 (1979), pp. 100-108

Machine Learning team Esprit 2022/2023 31

You might also like