K Means Clustering Algorithm: Explained: Dni Institute

9/26/2019 K Means Clustering Algorithm: Explained – DnI Institute
DnI Institute
Build Data and Decision Science Experience
 Menu
K Means Clustering Algorithm: Explained

September 26, 2015 by DnI Institute
Like 59 Share Save 4
Share this on WhatsApp
Classification problems are solved by objective segmentation and subjective segmentation.
A non technical explanation ( http://dni-institute.in/blogs/segmentation-a-perspective-2/ ) on when to

use subjective segmentation technique such as K means clustering and when to use objective
segmentation methods such as Decision Tree.
One of the most frequently used unsupervised algorithms is K Means. K Means Clustering is
exploratory data analysis technique. This is non-hierarchical method of grouping objects together.
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the
same group (called a cluster) are more similar (in some sense or another) to each other than to those
in other groups (clusters).
In this blog, we aim to explain the algorithm in a simple steps and with an example.
Business Scenario: We have height and weight information. Using these two variables, we need to
group the objects based on height and weight information.
dni-institute.in/blogs/k-means-clustering-algorithm-explained/ 1/17
If you look at the above chart, you will expect that there are two visible clusters/segments and we
want these to be identified using K Means algorithm.
Data Sample
Height Weight
185 72
170 56
168 60
179 68
182 72
188 77
180 71
180 70
183 84
180 88
180 67
177 76
Step 1: Input
Dataset, Clustering Variables and Maximum Number of Clusters (K in Means Clustering)
In this dataset, only two variables –height and weight – are considered for clustering
Height Weight
185 72
170 56
168 60
179 68
182 72
188 77
180 71
180 70
183 84
180 88
180 67
177 76
Step 2: Initialize cluster centroid

In this example, value of K is considered as 2. Cluster centroids are initialized with first 2
observations.
Initial Centroid
Cluster
Height Weight
K1 185 72
K2 170 56
Step 3: Calculate Euclidean Distance

Euclidean is one of the distance measures used on K Means algorithm. Euclidean distance between
of a observation and initial cluster centroids 1 and 2 is calculated. Based on euclidean distance each
observation is assigned to one of the clusters - based on minimum distance.
First two observations
Height Weight
185 72
170 56
Now initial cluster centroids are :

Updated
Centroid
Cluster
Height Weight
K1 185 72
K2 170 56
Euclidean Distance Calculation from each of the clusters is calculated.
Euclidian Distance from Euclidian Distance from

Cluster 1 Cluster 2
Assignment
(185-185)2+(72-72)2 (185-170)2+(72-56)2
=0 = 21.93 1
2 2 2 2
(170-185) +(56-72) (170-170) +(56-56)
= 21.93 =0 2
We have considered two observations for assignment only because we knew the assignment.
And there is no change in Centroids as these two observations were only considered as initial
centroids
Step 4: Move on to next observation and calculate Euclidean Distance
Height Weight
168 60
Euclidean Distance from Cluster 1 Euclidean Distance from Cluster 2

Assignment
(168-185)2+(60-72)2 (168-185)2+(60-72)2
=20.808 = 4.472 2
Since distance is minimum from cluster 2, so the observation is assigned to cluster 2. Now revise
Cluster Centroid – mean value Height and Weight as Custer Centroids. Addition is only to cluster 2,
so centroid of cluster 2 will be updated
Updated cluster centroids
Updated Centroid
Cluster
Height Weight
K=1 185 72
(170+168)/2 (56+60)/2
K=2 = 169 = 58
Step 5: Calculate Euclidean Distance for the next observation, assign next observation based on
minimum euclidean distance and update the cluster centroids.
Next Observation.
Height Weight
179 68
Euclidean Distance Calculation and Assignment

Euclidain Euclidain
Distance Distance
from from
Cluster 1 Cluster 2 Assignment
7.211103 14.14214 1
Update Cluster Centroid

Updated Centroid
Cluster
Height Weight
K=1 182 70.6667
K=2 169 58
Continue the steps until all observations are assigned

Final assignments
Cluster Centroids
Updated Centroid
Cluster
Height Weight
K=1 182.8 72
K=2 169 58
This is what was expected initially based on two-dimensional plot.
A few important considerations in K Means
Scale of measurements influences Euclidean Distance , so variable standardisation becomes

necessary
Depending on expectations - you may require outlier treatment
K Means clustering may be biased on initial centroids - called cluster seeds
Maximum clusters is typically inputs and may also impacts the clusters getting created
In the next blog, we focus on creating clusters using R. K Means Clustering using R
Share this on WhatsApp
 K Means
 k means clustering algorithm, k means clustering example, k means clustering explained, k means
steps, simple explanation k means, Working of k means
 Interview Process - Evaluating Analytical Skills
 Facebook Groups - Who is contributing?
29 thoughts on “K Means Clustering Algorithm: Explained”
Vishal Nigam
September 25, 2016 at 1:57 pm | Reply
Excellent Example. No better example found
DnI Institute
September 25, 2016 at 2:21 pm | Reply
Thanks Vishal
Nitesh
October 8, 2016 at 5:55 am | Reply
Very good..example..
but there is a text mistake in step 4.. euclidean distance from cluster 2
DnI Institute
October 8, 2016 at 6:13 am | Reply
Thanks Nitesh.. We have corrected the spelling.
Kumar P
October 14, 2016 at 9:22 pm | Reply
I am little confused. This is not an accurate depiction of k-Means algorithm. k-Means

algorithm steps:
K-Means finds the best centroids by alternating between (1) assigning data points to clusters
based on the current centroids (2) chosing centroids (points which are the center of a cluster)
based on the current assignment of data points to clusters.
One iteration:
1. Assign labels (clusters) to all observations
2. Calculate the new Centroid values using mean
Ref: http://stanford.edu/~cpiech/cs221/handouts/kmeans.html
In your example you are updating the centroid values even before assigning all the
observations to clusters.
Please clarify.
DnI Institute
Thanks Kumar for your comment.. I do not think there is any overall approach-wise
disconnect between steps we explained and mentioned in the link.. If you read the Step 3 -
it calculates Euclidean Distance, Assign observation to a Cluster and Cluster Centroids are
updated.Hope it helps
Pranay A
March 7, 2018 at 4:19 pm | Reply
@Dnl Institute,
from step 3, the assignment is only done to the new data points, and the centroids are
updated. But what if the data points assigned in the previous iterations have to change
from one cluster to the other due to the change in the centroids. I mean the euclidian
distance changes right. so, there could be a possibility that a data point in one cluster is
more closer to the data point in the other cluster, than the data point in the same cluster. I
hope my explanation is good.
Van Tuyen
November 25, 2016 at 10:18 am | Reply
Great, Professional.
You demo is peffect!
Thanks Bro.
Hazim
December 1, 2016 at 6:00 pm | Reply
Perfect.....
Mistake on calculation
Step 5 -Uploaded centroid weight values is incorrect I think.
DnI Institute
Thanks Hazim.. Let us check and correct if required..
Eliazar
January 8, 2017 at 11:51 pm | Reply
Great introduction to k-mean clustering
Noor
May 10, 2017 at 10:14 am | Reply
Great ,easy to understand
Anant Lalchandani
August 15, 2017 at 8:08 pm | Reply
Excellent explaination. I was stuck in making my own implementation of KMeans in R, but

this post make it easier to implement. Thank you.
Tirtha Chakraborty
There's probably a calculation mistake in the updated centroid values in step 5. But,apart
from that, wonderfully explained. Such a complex thing made so easy!
Rakesh Mondal
November 10, 2017 at 6:46 pm | Reply
The simplest way of learning K Means..
Devendra Shukla
Hello,
I have one doubt on K means clustering. How can any team work on K means clustering
algorithm( team means in real time project) because if value of K will be multiple so cluster
will also create multiple so can only one person will work on K means or how we use this
also real-time project?
DnI Institute
November 20, 2017 at 2:58 pm | Reply
When a k means clustering project is being done, multiple values of k are considered.
There are a few considerations to select the final clustering is selected.
Observation % in each of the clusters
R2 value of each of the variable
Overall R2 value and other clustering performance statistics e.g. CCC
Devendra Shukla
Thanks, Dnl Institute for your reply. Have you any link or any site where we are using K
means like this.
Observation % in each of the clusters

R2 value of each of the variable
Overall R2 value and other clustering performance statistics
DnI Institute
These are practical steps and considerations.. So you may not see a lot of info on internet.
Tas
I am afraid your K-means method is not correct.

You are not updating the centroids of the clusters every time an observation changes a
cluster but instead you update when all observations are assigned a cluster (or not!).
Have a look here :

https://pdfs.semanticscholar.org/99d0/ea088fb5f545c7ab4a0d77b2df7c68f031ae.pdf
and here (page 388) : http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf

Thanks
T.
DnI Institute
February 12, 2018 at 8:50 am | Reply
Thanks for the comments.. We have advised that it is directional and not a pure research
blog. Also, we have mentioned that all objects are reconsidered for the reassignment..
Vijeta
January 29, 2018 at 3:49 am | Reply
What if we only have height.. them how will you apply

Like we have transactions of a person .... How to apply k mean on that
Tim
February 24, 2018 at 11:50 pm | Reply
I am afraid this is not a right version of k-means clustering. first, the centroid is revised every
time a new data point is assign. that may be OK. The most serious problem is that after all
data points are assigned, the clustering ends. Assign all data points is only one step in k-
means clustering, and next step is to update centroids, and these two steps are repeated
until no data point changes clustering.
Aarti
May 3, 2018 at 4:38 am | Reply
I think there is mistake in step 5 that is updated centroid how 182.8 nd 72 come.there is
calculation mistake. But rest of steps is well explained
Kleber
May 24, 2018 at 11:11 am | Reply
Good job !
Please fix the table in Step 4: the values used in the calculation of the distance to Cluster 2
are not correct (you used the Cluster 1 values again by mistake).
Abhijit Choudhary
August 20, 2018 at 3:24 am | Reply
This is great!!!
Manju Gupta
September 21, 2018 at 8:11 am | Reply
I appreciate your work on Data Science. It's such a wonderful read on Data Science course.
Keep sharing stuffs like this. I am also educating people on similar Data Science training so if
you are interested to know more you can watch this Data Science tutorial:-
https://www.youtube.com/watch?v=h_GnVUIISk0&
RAnu
September 25, 2018 at 8:44 am | Reply
simplest and clear example i ever found.
Mahmoud Shaban
January 24, 2019 at 8:07 pm | Reply
Thanks for ever
Leave a Comment
Name *
Email *
Website
Save my name, email, and website in this browser for the next time I comment.
Post Comment
Search …
Categories
Campaign Analytics
Career
Cross Sell Modeling
Customer Analytics
Customer Retention
Decision Tree
Forecasting
Fraud Analytics
Insurance
jobs
K Means
Logistic Regression
Machine Learning
Market Basket
Multiple Regression
Next Best Action
Predictive Modeling
Python
Python for Data Science
R
R for Data Science
R Visualization
Random Forest
Retail Analytics
Risk Analytics
SAS
Segmentation
Social Media
Statistical Tests
Statistics
Support Vector Machine
Survival Model
Technology
Tool
Uncategorized
Views
Views
Python - Native Data Structure: Sets

Python: Native Data Structure - Tuples and Lists
Model Performance Assessment Statistics – Concordance: Steps to Calculate
Reading Data in R
Tutorial on Random Forest using Python
Step by Step Tutorial on Decision Tree using Python
Random Forest using R - Step by Step on a Sample Data
Decision Tree CART Algorithm Part 3
CART Algorithm: Best Split for a Categorical Variable
CART Algorithm for Decision Tree
© 2019 DnI Institute • Powered by GeneratePress

K Means Clustering Algorithm: Explained: Dni Institute

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

K Means Clustering Algorithm: Explained: Dni Institute

Uploaded by

Copyright:

Available Formats

9/26/2019 K Means Clustering Algorithm: Explained – DnI Institute

K Means Clustering Algorithm: Explained

Like 59 Share Save 4

Share this on WhatsApp

Classification problems are solved by objective segmentation and subjective segmentation.

A non technical explanation ( http://dni-institute.in/blogs/segmentation-a-perspective-2/ ) on when to

Step 2: Initialize cluster centroid

Step 3: Calculate Euclidean Distance

First two observations

Now initial cluster centroids are :

Euclidean Distance Calculation from each of the clusters is calculated.

Euclidian Distance from Euclidian Distance from

Euclidean Distance from Cluster 1 Euclidean Distance from Cluster 2

Euclidean Distance Calculation and Assignment

Update Cluster Centroid

Continue the steps until all observations are assigned

This is what was expected initially based on two-dimensional plot.

A few important considerations in K Means

Scale of measurements influences Euclidean Distance , so variable standardisation becomes

Share this on WhatsApp

29 thoughts on “K Means Clustering Algorithm: Explained”

Excellent Example. No better example found

Thanks Nitesh.. We have corrected the spelling.

I am little confused. This is not an accurate depiction of k-Means algorithm. k-Means

Thanks Hazim.. Let us check and correct if required..

Great introduction to k-mean clustering

Great ,easy to understand

Excellent explaination. I was stuck in making my own implementation of KMeans in R, but

The simplest way of learning K Means..

Observation % in each of the clusters

R2 value of each of the variable

Overall R2 value and other clustering performance statistics e.g. CCC

Observation % in each of the clusters

I am afraid your K-means method is not correct.

Have a look here :

and here (page 388) : http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf

What if we only have height.. them how will you apply

simplest and clear example i ever found.

Thanks for ever

Python - Native Data Structure: Sets

© 2019 DnI Institute • Powered by GeneratePress

You might also like