Professional Documents
Culture Documents
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
Talha Farooq
Faizan Ali
Muhammad Abdul basit
1. k-means clustering
2. Hierarchical clustering
2 K-means Clustering
K-means clustering is the most commonly used unsupervised machine
learning algorithm for partitioning a given data set into a set of k groups
(i.e. k clusters) where k represents the number of groups pre-specied by
the analyst. It classies objects in multiple groups (i.e., clusters), such that
objects within the same cluster are as similar as possible (i.e., high
intra-class similarity), whereas objects from dierent clusters are as
dissimilar as possible (i.e., low inter-class similarity). In k-means clustering,
each cluster is represented by its center (i.e, centroid) which corresponds to
the mean of points assigned to the cluster.
1
information; the choice of distance measures is a critical step in clustering.
It denes how the similarity of two elements (x, y) is calculated and it will
inuence the shape of the clusters.
The choice of distance measures is a critical step in clustering. It denes
how the similarity of two elements (x, y) is calculated and it will inuence
the shape of the clusters. The classical methods for distance measures are
Euclidean and Manhattan distances, which are dened as follow:
Euclidean Distance:
v
u n
uX
deuc (x, y) = t (xi − yi )2
i=1
Manhattan Distance:
n
X
dman (x, y) = |(xi − yi )|
i=1
X
W (CK ) = (xi − µk )2
xi ∈CK
Where:
2
µk is the mean value of the points assigned to the cluster Ck .
Each observation (xi ) is assigned to a given cluster such that the sum of
squares (SS) distance of the observation to their assigned cluster centers
(µk ) is minimized.
We dene the total within-cluster variation as follows:
k
X k
X X
tot.withiness = W (Ck ) = (xi − µk )2
k=1 k=1 xi ∈CK
The rst step when using k-means clustering is to indicate the number of
clusters (k) that will be generated in the nal solution. The algorithm
starts by randomly selecting k objects from the data set to serve as the
initial centers for the clusters. The selected objects are also known as
cluster means or centroids.
Next, each of the remaining objects is assigned to it's closest centroid,
where closest is dened using the Euclidean distance between the object
and the cluster mean. This step is called cluster assignment step. After
the assignment step, the algorithm computes the new mean value of each
cluster.
The term cluster centroid update is used to design this step. Now that
the centers have been recalculated, every observation is checked again to
see if it might be closer to a dierent cluster. All the objects are reassigned
again using the updated cluster means. The cluster assignment and
centroid update steps are iteratively repeated until the cluster assignments
stop changing (i.e until convergence is achieved). That is, the clusters
formed in the current iteration are the same as those obtained in the
previous iteration.
3
K-means algorithm can be summarized as follows:
2. Select randomly k objects from the data set as the initial cluster
centers or means.
5. Iteratively minimize the total within sum of square . That is, iterate
steps 3 and 4 until the cluster assignments stop changing or the
maximum number of iterations is reached.
https://archive.ics.uci.edu/ml/datasets/parkinson+Disease+
Spiral+Drawings+Using+Digitized+Graphics+Tablet
4
First of all, we load our data in R using the command of read.delim .
The command delim reads the text les. We use a command fix(X) to
multiple pictures at the same time in one le. In our data, we setup it with
three pictures in a row.
We group our data into 2 clusters. The Kmeans function also has an
The output of Kmeans is a list with several bits of information. The most
important being:
5
withinss : Vector of within-cluster sum of squares, one component
per cluster.
6
3 Hierarchical Cluster Analysis
Hierarchical clustering is an alternative approach to k-means clustering for
identifying groups in the dataset. It does not require us to predened the
number of clusters to be generated as is required by the k-means
approach.Furthermore, hierarchical clustering has an added advantage over
K-means clustering in that it results in an attractive tree-based
representation of the observations, called a dendrogram.
Agglomerative clustering
7
However, a bigger question is:
8
We can see the dierences these approaches in the following dendrograms:
9
3.3 Hierarchical Clustering with R
10
In the dendrogram displayed above, each leaf corresponds to one
observation. As we move up the tree, observations that are similar to each
other are combined into branches. The height of the cut to the dendrogram
controls the number of clusters obtained. It plays the same role as the k in
k-means clustering. In order to identify sub-groups (i.e. clusters), we can
cut the dendrogram with cutree .
1. Elbow method
2. Silhouette method
3. Gap statistic
Recall that, the basic idea behind cluster partitioning methods, such as
k-means clustering, is to dene clusters such that the total intra-cluster
variation (known as total within-cluster variation or total within-cluster
sum of square) is minimized:
k
X
minimize W (CK )
k=1
th
where Ck is the k cluster and W (Ck ) is the within-cluster variation. The
total within-cluster sum of square (wss) measures the compactness of the
clustering and we want it to be as small as possible. Thus, we can use the
following algorithm to dene the optimal clusters:
11
4.2 Average Silhouette Method
5 R Codes
=
X< r e a d . d e l i m ( " h w _ d a t a s e t / p a r k i n s o n / P_02100001 . t x t " , sep =";")
library ( tidyverse )
library ( cluster )
library ( factoextra )
f i x (X)
dim (X)
s d . d a t a= s c a l e (X)
head ( sd . d a t a )
distance < = g e t _ d i s t ( sd . data )
fviz_dist ( distance , gradient = l i s t ( l o w = "#00AFBB" , mid = " w h i t e " ,
h i g h = "#FC4E07 " ) )
p a r ( mfrow =c ( 1 , 4 ) )
12
km . o u t =kmeans ( sd . data , 2 , nstart =20)
s t r (km . o u t )
km . o u t
f v i z _ c l u s t e r (km . o u t , data = sd . data )
km . o u t $ c l u s t e r
p l o t ( sd . data , =
c o l =(km . o u t $ c l u s t e r + 1 ) , main="K Means Clustering
Results w i t h K=2" , xlab ="" , y l a b ="" , pch =20 , cex =2)
set . seed (4)
km . o u t 3 =kmeans ( sd . data , 3 , nstart =20)
km . o u t 3
p l o t ( sd . data , col =(km . o u t 3 $ c l u s t e r +1) , =
main="K Means Clustering
Results w i t h K=3" , xlab ="" , y l a b ="" , pch =20 , cex =2)
13
plot (k . values , wss_values ,
t y p e ="b " , pch = 1 9 , f r a m e = FALSE ,
x l a b ="Number of clusters K" ,
y l a b =" T o t a l within =c l u s t e r s sum of squares ")
##############gap s t a t i s t i c s###################
fviz_gap_stat ( gap_stat )
#####################f i n a l c l u s t e r################
set . seed (123)
final < = kmeans ( s d . d a t a , 9, n s t a r t = 25)
print ( f i na l )
X %>%
14
mutate ( C l u s t e r = f i n a l $ c l u s t e r ) %>%
group_by ( C l u s t e r ) %>%
s u m m a r i s e _ a l l ( " mean " )
########### H i e r a r c h i c a l C l u s t e r i n g ##########
d a t a . d i s t=d i s t ( sd . d a t a )
n c i . d a t a=X
p l o t ( h c l u s t ( data . d i s t ) , main=" C o m p l e t e
Linkage ", xlab ="" , sub ="" , ylab ="")
plot ( hclust ( d a t a . d i s t , method =" a v e r a g e " ) ,
main=" A v e r a g e Linkage " , xlab ="" , sub ="" , y l a b ="")
plot ( hclust ( data . d i s t , method =" s i n g l e " ) ,
main=" Single Linkage ", x l a b ="" , sub ="" , ylab ="")
hc . o u t =h c l u s t ( d i s t ( sd . data ) )
abline ( h =8 , col =" red ")
hc . o u t
################################ward ' s method##########################
# Ward ' s method
hc5 < = h c l u s t ( data . d i s t , method = " ward . D2" )
t a b l e ( sub_grp )
X %>%
mutate ( c l u s t e r = sub_grp ) %>%
head
p l o t ( hc5 , cex = 0 . 6 )
r e c t . h c l u s t ( hc5 , k = 9, border = 2:10)
15