K-Mean Clustering Final

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 21

K-Mean Clustering

• K-Means is used when the number of classes


is fixed. Distance is used to separate
observations into different groups in
clustering algorithms.
*Hierarchical clustering is used for an
unknown number of classes.
Cont.
• The most common clustering covered in machine learning for
beginners is K-Means. The first step is to create c new observations
among our unlabelled data and locate them randomly, called centroids.
The number of centroids represents the number of output classes. The
first step of the iterative process for each centroid is to find the nearest
point (in terms of Euclidean distance) and assign them to its category.
Next, for each category, the average of all the points attributed to that
class is computed. The output is the new centroid of the class.
• With every iteration, the observations can be redirected to another
centroid. After several reiterations, the centroid’s change in location is
less critical as initial random centroids converge with real ones—the
process ends when there is no change in centroids’ position. Many
methods can be employed for the task, but a common one is ‘elbow
method’. A low level of variation is needed within the clusters
measured by the within-cluster sum of squares. The number of
centroids and observations are inversely proportional. Thus, setting the
highest possible number of centroids would be inconsistent.
K-Mean Clustering
• Initially, we determine the number of cluster K and
assume the centroid or center of these clusters.
• We can take any random objects as the initial centroids
or the first K objects in sequence can also serve as the
initial centorids.
• Then, the K-means algorithm will do the following three
steps until convergence
• Iterate until stable (= no object move group)
1. Determine the cetroid coordinate
2. Determine the distance of each object to the centorids
3. Group the object based on minimum distance
Example
• Suppose we have several objects (4 types of
medicines) and each object have two attributes or
features as shown in table.
• The goal is to group these objects into K = 2 groups
of medicines based on the two features (pH and
weight index).
Object Attribute 1 (X): Attribute 2 (Y): pH
weight index

Medicine A 1 1
Medicine B 2 1
Medicine C 4 3
Medicine D 5 4
Object Attribute 1 (X): Attribute 2 (Y):
weight index pH
Medicine A 1 1
Medicine B 2 1
Medicine C 4 3
Medicine D 5 4

Each medicine represents one point with two attributes (X, Y)


such that we can represent it as coordinate in an attribute space as
shown in figure below
1. Initial value of centroids:

• Suppose we use medicine A and medicine B as the


first centroids.
• Let c1 and c2 denote the coordinate of the
centroids, then
c1 = (1, 1) and c2=(2, 1)
2. Objects-Centroids Distance:
• We calculate the distance between cluster centroid to
each object.
• Let us use Euclidean distance, then we have distance
matrix at iteration 0 is
0 1 3.61 5  c1  (1,1) group  1
D 
0
 c 2  (2,1) group  2
1 0 2 .83 4 .24 

A B C D

1 2 4 5 X
1 1 3 4 Y
 
Cont.
• Each column in the distance matrix symbolizes the
object. The first row of the distance corresponds to
the distance of each object to the first centroid and the
second row is the distance of each object to the
second centroid. For example, distance from
medicine C=(4, 3) to the first centroid c1 = (1, 1) is

(4  1)  (3  1)
2 2
 3.61
• And the distance to the second centroid c2 = (2,1) is

(4  2)  (3  1)  2.83
2 2
3. Object Clustering

• We assign each object based on the minimum distance.


• Thus, medicine A is assigned to group 1, medicine B
to group 2, medicine C to group 2 and medicine D to
group 2.
• The element of Group matrix below is 1 if and only if
the object is assigned to that group

1 0 0 0 group1
C 
0

0 1 1 1 group 2
A B C D
4 Iteration 1: Determine Centroids
• Knowing the members of each group, now we
compute the centroid of each group based on these
new memberships.
• Group 1 only has one member, thus the centroid
remains in c1 = (1, 1).
• Group 2 now has three members, thus the centroid
is the average coordinate among the three members

 2  4  5 1  3  4   11 8 
c2   ,  , 
 3 3   3 3
5 Iteration 1: Objects-Centroid Distances

•The next step is to compute the distance of all objects to


the new centroids.

•Similar to step2, we have distance matrix in iteration 1 as

 0 1 3 . 61 5  c1  (1 , 1) group1
D1    c 2  (11 , 8 ) group 2
3. 14 2. 36 0 .47 1 . 89  3 3
A B C D
1 2 4 5 X
1 1 3 4 Y
 
6. Iteration 1: Objects Clustering
• Similar to step 3, we assign each object based on the
minimum distance.
• Based on the new distance matrix, we move the
medicine B to group 1 while all the other objects remain.
• The group matrix is shown as below:

1 1 0 0 group1
G 
1
 group 2
 0 0 1 1 
A B C D
7. Iteration 2: Determine Centroids
• Now, we repeat step 4 to calculate the new
centroid coordinates based on the clustering of
previous iteration.
• Both group1 and group2 have two members.
• Thus, new centroids are:
1 2 11  1 
c1   ,    1 , 1
 2 2   2 

 45 3 4  1 1 
c2   ,   4 , 3 
 2 2   2 2
8. Iteration 2: Objects-Centroids Distances
• Repeat step 2 again; we have new distance
matrix at iteration 2 as
3 
 0.5 0.5 3.20 4.61 c1   , 1 group 1
2 
D 
2
 9 7
4.30 3.54 0.71 0.71 c 2   ,  group 2
2 2

1 2 4 5 X
1 1 3 4 Y
 
A B C D
6. Iteration 2: Objects Clustering
• The group matrix is shown as below:

1 1 0 0 group1
G 
1
 group 2
 0 0 1 1 
A B C D
Iteration-2: Objects Clustering

• Again, we assign each object based on the


minimum distance.
• We obtain result that G2=G1.
• Comparing the grouping of last iteration and
this iteration reveals that the objects do not
move group anymore. Thus, the computation of
the k-mean clustering has reached its stability
and no more iteration is needed.
• Therefore, we get the final result.
Final Grouping of the Medicines

Object Attribute 1 Attribute 2 Group


(X): weight (Y): pH result
index
Medicine A 1 1 1
Medicine B 2 1 1
Medicine C 4 3 2
Medicine D 5 4 2
K-Means Vs Hierarchical
As we know, clustering is a subjective statistical analysis, and
there is more than one appropriate algorithm for every dataset and
type of problem. So how to choose between K-means and
hierarchical?
• If there is a specific number of clusters in the dataset, but the
group they belong to is unknown, choose K-means
• If the distinguishes are based on prior beliefs, hierarchical
clustering should be used to know the number of clusters
• With a large number of variables, K-means computes faster
• The result of K-means is unstructured, but that of hierarchal is
more interpretable and informative
• It is easier to determine the number of clusters by hierarchical
clustering’s dendrogram
K-means Clustering Hierarchical Clustering
K-means, uses a pre-specified number of clusters, the Hierarchical methods can be either divisive or
method assigns records to each cluster to find the agglomerative.
mutually exclusive cluster of spherical shape based on
distance.
K-Means clustering needs advance knowledge of K In hierarchical clustering one can stop at any number of
i.e. no. of clusters one want to divide your data. clusters, one find appropriate by interpreting the
dendrogram.
One can use median or mean as a cluster centre to Agglomerative methods begin with ‘n’ clusters and
represent each cluster. sequentially combine similar clusters until only one
cluster is obtained.
Methods used are normally less computationally Divisive methods work in the opposite direction,
intensive and are suited with very large datasets. beginning with one cluster that includes all the records and
Hierarchical methods are especially useful when the target
is to arrange the clusters into a natural hierarchy.
In K-Means clustering, since one start with random In Hierarchical Clustering, results are reproducible in
choice of clusters, the results produced by running the Hierarchical clustering
algorithm many times may differ.
K-means clustering a simply a division of the set of A hierarchical clustering is a set of nested clusters that are
data objects into non-overlapping subsets (clusters) arranged as a tree.
such that each data object is in exactly one subset).
K Means clustering is found to work well when the Hierarchical clustering does not work well when the shape
structure of the clusters is hyper spherical (like circle of the clusters is hyper spherical.
in 2D, sphere in 3D).
Advantages: Advantages:
1. Convergence is guaranteed. 1 .Ease of handling of any forms of similarity or distance.
2. Specialized to clusters of different sizes and shapes. 2. Consequently, applicability to any attributes types.
Disadvantages: Disadvantage:
• K-Means vs Hierarchical
• As we know, clustering is a subjective statistical analysis, and there is
more than one appropriate algorithm for every dataset and type of
problem. So how to choose between K-means and hierarchical?
• If there is a specific number of clusters in the dataset, but the group
they belong to is unknown, choose K-means
• If the distinguishes are based on prior beliefs, hierarchical clustering
should be used to know the number of clusters
• With a large number of variables, K-means compute faster
• The result of K-means is unstructured, but that of hierarchal is more
interpretable and informative
• It is easier to determine the number of clusters by hierarchical
clustering’s dendrogram

You might also like