K Means Clustering Algorithm: Explained: Dni Institute

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

9/26/2019 K Means Clustering Algorithm: Explained – DnI Institute

DnI Institute
Build Data and Decision Science Experience

 Menu

K Means Clustering Algorithm: Explained


September 26, 2015 by DnI Institute

Like 59 Share Save 4

Share this on WhatsApp

Classification problems are solved by objective segmentation and subjective segmentation.

A non technical explanation ( http://dni-institute.in/blogs/segmentation-a-perspective-2/ ) on when to


use subjective segmentation technique such as K means clustering and when to use objective
segmentation methods such as Decision Tree.

One of the most frequently used unsupervised algorithms is K Means. K Means Clustering is
exploratory data analysis technique. This is non-hierarchical method of grouping objects together.

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the
same group (called a cluster) are more similar (in some sense or another) to each other than to those
in other groups (clusters).

In this blog, we aim to explain the algorithm in a simple steps and with an example.

Business Scenario: We have height and weight information. Using these two variables, we need to
group the objects based on height and weight information.

dni-institute.in/blogs/k-means-clustering-algorithm-explained/ 1/17
9/26/2019 K Means Clustering Algorithm: Explained – DnI Institute

If you look at the above chart, you will expect that there are two visible clusters/segments and we
want these to be identified using K Means algorithm.

Data Sample

Height Weight

185 72
170 56
168 60
179 68
182 72
188 77
180 71
180 70
183 84
180 88
180 67
177 76

Step 1: Input
Dataset, Clustering Variables and Maximum Number of Clusters (K in Means Clustering)
In this dataset, only two variables –height and weight – are considered for clustering

Height Weight

185 72
170 56
168 60
179 68

dni-institute.in/blogs/k-means-clustering-algorithm-explained/ 2/17
9/26/2019 K Means Clustering Algorithm: Explained – DnI Institute

182 72
188 77
180 71
180 70
183 84
180 88
180 67
177 76

Step 2: Initialize cluster centroid


In this example, value of K is considered as 2. Cluster centroids are initialized with first 2
observations.
Initial Centroid

Cluster
Height Weight

K1 185 72
K2 170 56

Step 3: Calculate Euclidean Distance


Euclidean is one of the distance measures used on K Means algorithm. Euclidean distance between
of a observation and initial cluster centroids 1 and 2 is calculated. Based on euclidean distance each
observation is assigned to one of the clusters - based on minimum distance.

First two observations

Height Weight

185 72
170 56

dni-institute.in/blogs/k-means-clustering-algorithm-explained/ 3/17
9/26/2019 K Means Clustering Algorithm: Explained – DnI Institute

Now initial cluster centroids are :


Updated
Centroid
Cluster
Height Weight

K1 185 72
K2 170 56

Euclidean Distance Calculation from each of the clusters is calculated.

Euclidian Distance from Euclidian Distance from


Cluster 1 Cluster 2
Assignment
(185-185)2+(72-72)2 (185-170)2+(72-56)2
=0 = 21.93 1
2 2 2 2
(170-185) +(56-72) (170-170) +(56-56)
= 21.93 =0 2

We have considered two observations for assignment only because we knew the assignment.
And there is no change in Centroids as these two observations were only considered as initial
centroids
Step 4: Move on to next observation and calculate Euclidean Distance
Height Weight
168 60

Euclidean Distance from Cluster 1 Euclidean Distance from Cluster 2


Assignment
(168-185)2+(60-72)2 (168-185)2+(60-72)2
=20.808 = 4.472 2

Since distance is minimum from cluster 2, so the observation is assigned to cluster 2. Now revise
Cluster Centroid – mean value Height and Weight as Custer Centroids. Addition is only to cluster 2,
so centroid of cluster 2 will be updated
Updated cluster centroids
Updated Centroid

Cluster
Height Weight

K=1 185 72
(170+168)/2 (56+60)/2
K=2 = 169 = 58
dni-institute.in/blogs/k-means-clustering-algorithm-explained/ 4/17
9/26/2019 K Means Clustering Algorithm: Explained – DnI Institute

Step 5: Calculate Euclidean Distance for the next observation, assign next observation based on
minimum euclidean distance and update the cluster centroids.
Next Observation.
Height Weight
179 68

Euclidean Distance Calculation and Assignment


Euclidain Euclidain
Distance Distance
from from
Cluster 1 Cluster 2 Assignment
7.211103 14.14214 1

Update Cluster Centroid


Updated Centroid
Cluster
Height Weight
K=1 182 70.6667
K=2 169 58

Continue the steps until all observations are assigned


Final assignments

Cluster Centroids
Updated Centroid
Cluster
Height Weight
K=1 182.8 72
K=2 169 58

dni-institute.in/blogs/k-means-clustering-algorithm-explained/ 5/17
9/26/2019 K Means Clustering Algorithm: Explained – DnI Institute

This is what was expected initially based on two-dimensional plot.

A few important considerations in K Means

Scale of measurements influences Euclidean Distance , so variable standardisation becomes


necessary
Depending on expectations - you may require outlier treatment
K Means clustering may be biased on initial centroids - called cluster seeds
Maximum clusters is typically inputs and may also impacts the clusters getting created

In the next blog, we focus on creating clusters using R. K Means Clustering using R

Share this on WhatsApp

dni-institute.in/blogs/k-means-clustering-algorithm-explained/ 6/17
9/26/2019 K Means Clustering Algorithm: Explained – DnI Institute

 K Means
 k means clustering algorithm, k means clustering example, k means clustering explained, k means
steps, simple explanation k means, Working of k means
 Interview Process - Evaluating Analytical Skills
 Facebook Groups - Who is contributing?

29 thoughts on “K Means Clustering Algorithm: Explained”

Vishal Nigam
September 25, 2016 at 1:57 pm | Reply

Excellent Example. No better example found

DnI Institute
September 25, 2016 at 2:21 pm | Reply

Thanks Vishal

Nitesh
October 8, 2016 at 5:55 am | Reply

Very good..example..
but there is a text mistake in step 4.. euclidean distance from cluster 2

DnI Institute
October 8, 2016 at 6:13 am | Reply

dni-institute.in/blogs/k-means-clustering-algorithm-explained/ 7/17
9/26/2019 K Means Clustering Algorithm: Explained – DnI Institute

Thanks Nitesh.. We have corrected the spelling.

Kumar P
October 14, 2016 at 9:22 pm | Reply

I am little confused. This is not an accurate depiction of k-Means algorithm. k-Means


algorithm steps:

K-Means finds the best centroids by alternating between (1) assigning data points to clusters
based on the current centroids (2) chosing centroids (points which are the center of a cluster)
based on the current assignment of data points to clusters.

One iteration:
1. Assign labels (clusters) to all observations
2. Calculate the new Centroid values using mean

Ref: http://stanford.edu/~cpiech/cs221/handouts/kmeans.html

In your example you are updating the centroid values even before assigning all the
observations to clusters.

Please clarify.

DnI Institute
October 15, 2016 at 3:04 pm | Reply

Thanks Kumar for your comment.. I do not think there is any overall approach-wise
disconnect between steps we explained and mentioned in the link.. If you read the Step 3 -
it calculates Euclidean Distance, Assign observation to a Cluster and Cluster Centroids are
updated.Hope it helps

Pranay A
March 7, 2018 at 4:19 pm | Reply
dni-institute.in/blogs/k-means-clustering-algorithm-explained/ 8/17
9/26/2019 K Means Clustering Algorithm: Explained – DnI Institute

@Dnl Institute,
from step 3, the assignment is only done to the new data points, and the centroids are
updated. But what if the data points assigned in the previous iterations have to change
from one cluster to the other due to the change in the centroids. I mean the euclidian
distance changes right. so, there could be a possibility that a data point in one cluster is
more closer to the data point in the other cluster, than the data point in the same cluster. I
hope my explanation is good.

Van Tuyen
November 25, 2016 at 10:18 am | Reply

Great, Professional.
You demo is peffect!

Thanks Bro.

Hazim
December 1, 2016 at 6:00 pm | Reply

Perfect.....
Mistake on calculation
Step 5 -Uploaded centroid weight values is incorrect I think.

DnI Institute
December 5, 2016 at 4:57 pm | Reply

Thanks Hazim.. Let us check and correct if required..

dni-institute.in/blogs/k-means-clustering-algorithm-explained/ 9/17
9/26/2019 K Means Clustering Algorithm: Explained – DnI Institute

Eliazar
January 8, 2017 at 11:51 pm | Reply

Great introduction to k-mean clustering

Noor
May 10, 2017 at 10:14 am | Reply

Great ,easy to understand

Anant Lalchandani
August 15, 2017 at 8:08 pm | Reply

Excellent explaination. I was stuck in making my own implementation of KMeans in R, but


this post make it easier to implement. Thank you.

Tirtha Chakraborty
October 6, 2017 at 3:11 pm | Reply

There's probably a calculation mistake in the updated centroid values in step 5. But,apart
from that, wonderfully explained. Such a complex thing made so easy!

Rakesh Mondal
November 10, 2017 at 6:46 pm | Reply
dni-institute.in/blogs/k-means-clustering-algorithm-explained/ 10/17
9/26/2019 K Means Clustering Algorithm: Explained – DnI Institute

The simplest way of learning K Means..

Devendra Shukla
November 20, 2017 at 10:31 am | Reply

Hello,

I have one doubt on K means clustering. How can any team work on K means clustering
algorithm( team means in real time project) because if value of K will be multiple so cluster
will also create multiple so can only one person will work on K means or how we use this
also real-time project?

DnI Institute
November 20, 2017 at 2:58 pm | Reply

When a k means clustering project is being done, multiple values of k are considered.
There are a few considerations to select the final clustering is selected.

Observation % in each of the clusters

R2 value of each of the variable

Overall R2 value and other clustering performance statistics e.g. CCC

Devendra Shukla
November 21, 2017 at 6:46 am | Reply

dni-institute.in/blogs/k-means-clustering-algorithm-explained/ 11/17
9/26/2019 K Means Clustering Algorithm: Explained – DnI Institute

Thanks, Dnl Institute for your reply. Have you any link or any site where we are using K
means like this.

Observation % in each of the clusters


R2 value of each of the variable
Overall R2 value and other clustering performance statistics

DnI Institute
November 25, 2017 at 12:52 am | Reply

These are practical steps and considerations.. So you may not see a lot of info on internet.

Tas
December 29, 2017 at 6:37 pm | Reply

I am afraid your K-means method is not correct.


You are not updating the centroids of the clusters every time an observation changes a
cluster but instead you update when all observations are assigned a cluster (or not!).

Have a look here :


https://pdfs.semanticscholar.org/99d0/ea088fb5f545c7ab4a0d77b2df7c68f031ae.pdf

and here (page 388) : http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf


Thanks
T.

DnI Institute
February 12, 2018 at 8:50 am | Reply

Thanks for the comments.. We have advised that it is directional and not a pure research
blog. Also, we have mentioned that all objects are reconsidered for the reassignment..
dni-institute.in/blogs/k-means-clustering-algorithm-explained/ 12/17
9/26/2019 K Means Clustering Algorithm: Explained – DnI Institute

Vijeta
January 29, 2018 at 3:49 am | Reply

What if we only have height.. them how will you apply


Like we have transactions of a person .... How to apply k mean on that

Tim
February 24, 2018 at 11:50 pm | Reply

I am afraid this is not a right version of k-means clustering. first, the centroid is revised every
time a new data point is assign. that may be OK. The most serious problem is that after all
data points are assigned, the clustering ends. Assign all data points is only one step in k-
means clustering, and next step is to update centroids, and these two steps are repeated
until no data point changes clustering.

Aarti
May 3, 2018 at 4:38 am | Reply

I think there is mistake in step 5 that is updated centroid how 182.8 nd 72 come.there is
calculation mistake. But rest of steps is well explained

Kleber
May 24, 2018 at 11:11 am | Reply

dni-institute.in/blogs/k-means-clustering-algorithm-explained/ 13/17
9/26/2019 K Means Clustering Algorithm: Explained – DnI Institute

Good job !

Please fix the table in Step 4: the values used in the calculation of the distance to Cluster 2
are not correct (you used the Cluster 1 values again by mistake).

Abhijit Choudhary
August 20, 2018 at 3:24 am | Reply

This is great!!!

Manju Gupta
September 21, 2018 at 8:11 am | Reply

I appreciate your work on Data Science. It's such a wonderful read on Data Science course.
Keep sharing stuffs like this. I am also educating people on similar Data Science training so if
you are interested to know more you can watch this Data Science tutorial:-
https://www.youtube.com/watch?v=h_GnVUIISk0&

RAnu
September 25, 2018 at 8:44 am | Reply

simplest and clear example i ever found.

Mahmoud Shaban
January 24, 2019 at 8:07 pm | Reply

dni-institute.in/blogs/k-means-clustering-algorithm-explained/ 14/17
9/26/2019 K Means Clustering Algorithm: Explained – DnI Institute

Thanks for ever

Leave a Comment

Name *

Email *

Website

Save my name, email, and website in this browser for the next time I comment.

Post Comment

Search …

Categories

Campaign Analytics
dni-institute.in/blogs/k-means-clustering-algorithm-explained/ 15/17
9/26/2019 K Means Clustering Algorithm: Explained – DnI Institute

Career
Cross Sell Modeling
Customer Analytics
Customer Retention
Decision Tree
Forecasting
Fraud Analytics
Insurance
jobs
K Means
Logistic Regression
Machine Learning
Market Basket
Multiple Regression
Next Best Action
Predictive Modeling
Python
Python for Data Science
R
R for Data Science
R Visualization
Random Forest
Retail Analytics
Risk Analytics
SAS
Segmentation
Social Media
Statistical Tests
Statistics
Support Vector Machine
Survival Model
Technology

dni-institute.in/blogs/k-means-clustering-algorithm-explained/ 16/17
9/26/2019 K Means Clustering Algorithm: Explained – DnI Institute

Tool
Uncategorized
Views

Views

Python - Native Data Structure: Sets


Python: Native Data Structure - Tuples and Lists
Model Performance Assessment Statistics – Concordance: Steps to Calculate
Reading Data in R
Tutorial on Random Forest using Python
Step by Step Tutorial on Decision Tree using Python
Random Forest using R - Step by Step on a Sample Data
Decision Tree CART Algorithm Part 3
CART Algorithm: Best Split for a Categorical Variable
CART Algorithm for Decision Tree

© 2019 DnI Institute • Powered by GeneratePress

dni-institute.in/blogs/k-means-clustering-algorithm-explained/ 17/17

You might also like