Module - 4 K Means Clustering

Distance Functions
Given two p-dimensional data objects i = (xi1,xi2, ...,xip) and j =

(xj1,xj2, ...,xjp), the following
common distance functions can be defined:
Euclidean Distance Function:
Manhattan Distance Function:

K-Means Clustering Algorithm:
1. Choose a value of k, number of clusters to be formed.
2. Randomly select k data points from the data set as the initial cluster
centeroids/centers 3. For each datapoint:
a. Compute the distance between the datapoint and the
cluster centroid b. Assign the datapoint to the closest
centroid
4. For each cluster calculate the new mean based on the data points in the cluster.
5. Repeat 3 & 4 steps until mean of the clusters stops changing or
maximum number of iterations reached
Understanding with a simple example ( For one diamensional data) We will
apply k-means on the following 1 dimensional data set for K=2. Data set {2,
4, 10, 12, 3, 20, 30, 11, 25}

Iteration 1
M1, M2 are the two randomly selected centroids/means where
M1= 4, M2=11
and the initial clusters are
C1= {4}, C2= {11}
As we can see in the above table, 2 datapoints are added to cluster C1 and other
datapoints added to cluster C2
Therefore
C1= {2, 4, 3}
C2= {10, 12, 20, 30, 11, 25}
teration 2
I
Calculate new mean of datapoints in C1 and C2.

Therefore
M1= (2+3+4)/3= 3
M2= (10+12+20+30+11+25)/6= 18
Calculating distance and updating clusters based on table below
New Clusters
C1= {2, 3, 4, 10}
C2= {12, 20, 30, 11, 25}
Iteration 3
Therefore
M1= (2+3+4+10)/4= 4.75
M2= (12+20+30+11+25)/5= 19.6
New Clusters
C1= {2, 3, 4, 10, 12, 11}
C2= {20, 30, 25}
Iteration 4
Therefore
M1= (2+3+4+10+12+11)/6=7
M2= (20+30+25)/3= 25
New Clusters
C1= {2, 3, 4, 10, 12, 11}
C2= {20, 30, 25}

As we can see that the data points in the cluster C1 and C2 in
iteration 4 are same as the data points of the cluster C1 and C2 of
iteration 3.
It means that none of the data points has moved to other cluster. Also
the means/centroid of these clusters is constant. So this becomes
the stopping condition for our algorithm.
Suppose the data for clustering is {2, 4, 6, 3, 31, 15, 12, 16, 38, 35,14,21,30, 25,
23}. Apply k-means algorithm for above data using c1=2, c2=16 and c3=38 as
initial cluster centers.
Example 2( For Two dimensional data)
Example: Suppose we have 4 objects as your data point and each object have 2
attributes. Each
attribute represents coordinate of the object.
Object Attribute 1 (X): weight index Attribute 2 (Y): pH
Medicine A 1 1
Medecine B 2 1
Medecine C 4 3
Medecine D 5 4
Thus, we also know beforehand that these objects belong to two groups of
medicine (cluster 1 and cluster 2).
The problem now is to determine which medicines belong to cluster 1 and which
medicines belong to the other cluster. Each medicine represents one point with
two components coordinate.
1. Initial value of centroids: Suppose we use medicine A and medicine B as the
first centroids. Let c1
and c2 denote the coordinate of the centroids, then c1 = (1,1) and c2 = (2,1)
2. Objects-Centroids distance: we calculate the distance between cluster

centroid to each object. Let us use Euclidean distance, then we have distance
matrix at iteration 0 is
Do= 0 1 3.61 5 C1=(1,1) Group 1
1 0 2.83 4.24 C2=(2,1) Group 2
A B c D
For eg. distance from medicine c=(4,3) to the first centroid is
And to the second centroid is

3. Objects clustering: We assign each object based on the minimum distance. Thus, medicine A is
assigned to group 1, medicine B to group 2, medicine C to group 2 and medicine D to group 2.
The element of Group matrix below is 1 if and only if the object is assigned to that group.
1 0 0 0 Group1
0 1 1 1 Group 2
A B C D
4. Iteration-1, determine centroids: Knowing the members of each group, now we compute
the new centroid of each group based on these new memberships. Group 1 only has one
member thus the centroid remains in c1 = (1,1). Group 2 now has three members, thus the
centroid is the average coordinate among the three members: (2+4+5)/3 , (1+3+4)/3 =(11/3,
8/3)
5. Iteration-1, Objects-Centroids distances: The next step is to compute the distance of all
objects to the new centroids. Similar to step 2, we have distance matrix at iteration 1 is
D1= 0 1 3.61 5 C1=(1,1) Group 1
C2=(11/3, 8/3 ) Group 2
3.14 2.36 0.47 1.89
6. Iteration-1, Objects clustering: Similar to step 3, we assign each

object based on the minimum distance. Based on the new
distance matrix, we move the medicine B to Group 1 while all the
other objects remain. The Group matrix is shown below
1 1 0 0 Group 1
G 1
0 0 1 1 Group 2
7. Iteration 2, determine centroids: Now we repeat step 4 to calculate the
new centroids coordinate based on the clustering of previous iteration.
Group1 and group 2 both has two members, thus the new centroids are
c1 =(1+2)/2, (1+1)/2 and C2=(4+5)/2, (3+4)/2)
i.e. c1= (3/2, 1) and c2= (9/2, 7/2)
D2 = 0.5 0.5 3.20 4.61 C1=(3/2 , 1) Group 1

3.54 0.7 0 .71 4.30 C2=(9/2, 7/2 ) Group 2
8. Iteration-2,
Objects clustering: Again, we assign each object based on
the minimum distance.
G2 = 1 1 0 0 Group 1
0 0 1 1 Group 1
A B C D
We obtain result that G2 = G1 . Comparing the grouping of last iteration and this
iteration reveals that the objects does not move group anymore. Thus, the
computation of the k-mean clustering has reached its stability and no more
iteration is needed. We get the final grouping as the results
Object Attribute 1 Attribute 2 Group result

(X): weight (Y): pH
index
Medicine A 1 1 1
Medicine B 2 1 1
Medicine C 4 3 2
Medicine D 5 4 2

Module - 4 K Means Clustering

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module - 4 K Means Clustering

Uploaded by

Copyright:

Available Formats

Distance Functions

Given two p-dimensional data objects i = (xi1,xi2, ...,xip) and j =

Manhattan Distance Function:

centeroids/centers 3. For each datapoint:

a. Compute the distance between the datapoint and the

cluster centroid b. Assign the datapoint to the closest

4, 10, 12, 3, 20, 30, 11, 25}

Calculate new mean of datapoints in C1 and C2.

Calculate new mean of datapoints in C1 and C2.

M1= (2+3+4+10)/4= 4.75

M2= (12+20+30+11+25)/5= 19.6

C1= {2, 3, 4, 10, 12, 11}

C2= {20, 30, 25}

Calculate new mean of datapoints in C1 and C2.

C1= {2, 3, 4, 10, 12, 11}

C2= {20, 30, 25}

attribute represents coordinate of the object.

Object Attribute 1 (X): weight index Attribute 2 (Y): pH

2. Objects-Centroids distance: we calculate the distance between cluster

And to the second centroid is

6. Iteration-1, Objects clustering: Similar to step 3, we assign each

D2 = 0.5 0.5 3.20 4.61 C1=(3/2 , 1) Group 1

Object Attribute 1 Attribute 2 Group result

You might also like