Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Cluster Analysis

Talha Farooq
Faizan Ali
Muhammad Abdul basit

School of Natural Sciences


MS Statistics

National University of Sciences & Technology


Pakistan
1 Cluster Analysis
Clustering analysis is a form of exploratory data analysis in which
observations are divided into dierent groups that share common
characteristics.
The purpose of cluster analysis (also known as classication) is to
construct groups (or classes or clusters) while ensuring the following
property:

1. Within a group the observations must be as similar as possible

2. While observations belonging to dierent groups must be as dierent


as possible.

There are two main types of classication:

1. k-means clustering

2. Hierarchical clustering

The rst is generally used when the number of classes is xed in


advance, while the second is generally used for anunknown number of
classes and helps to determine this optimal number.

2 K-means Clustering
K-means clustering is the most commonly used unsupervised machine
learning algorithm for partitioning a given data set into a set of k groups
(i.e. k clusters) where k represents the number of groups pre-specied by
the analyst. It classies objects in multiple groups (i.e., clusters), such that
objects within the same cluster are as similar as possible (i.e., high
intra-class similarity), whereas objects from dierent clusters are as
dissimilar as possible (i.e., low inter-class similarity). In k-means clustering,
each cluster is represented by its center (i.e, centroid) which corresponds to
the mean of points assigned to the cluster.

2.1 Clustering Distance Measures

The classication of observations into groups requires some methods for


computing the distance or the (dis)similarity between each pair of
observations. The result of this computation is known as a dissimilarity or
distance matrix.There are many methods to calculate this distance

1
information; the choice of distance measures is a critical step in clustering.
It denes how the similarity of two elements (x, y) is calculated and it will
inuence the shape of the clusters.
The choice of distance measures is a critical step in clustering. It denes
how the similarity of two elements (x, y) is calculated and it will inuence
the shape of the clusters. The classical methods for distance measures are
Euclidean and Manhattan distances, which are dened as follow:

Euclidean Distance:
v
u n
uX
deuc (x, y) = t (xi − yi )2
i=1

Manhattan Distance:
n
X
dman (x, y) = |(xi − yi )|
i=1

Where, x and y are two vectors of length n.

The choice of distance measures is very important, as it has a strong


inuence on the clustering results. For most common clustering software,
the default distance measure is the Euclidean distance. However,
depending on the type of the data and the research questions, other
dissimilarity measures might be preferred and you should be aware of the
options.

2.2 The Basic Idea

The basic idea behind k-means clustering consists of dening clusters so


that the total intra-cluster variation (known as total within-cluster
variation) is minimized. There are several k-means algorithms available.
The standard algorithm is the Hartigan-Wong algorithm (1979), which
denes the total within-cluster variation as the sum of squared distances
Euclidean distances between items and the corresponding centroid:

X
W (CK ) = (xi − µk )2
xi ∈CK

Where:

ˆ xi is a data point belonging to the cluster Ck .

2
ˆ µk is the mean value of the points assigned to the cluster Ck .

Each observation (xi ) is assigned to a given cluster such that the sum of
squares (SS) distance of the observation to their assigned cluster centers
(µk ) is minimized.
We dene the total within-cluster variation as follows:

k
X k
X X
tot.withiness = W (Ck ) = (xi − µk )2
k=1 k=1 xi ∈CK

The total within-cluster sum of square measures the compactness


(i.e goodness) of the clustering and we want it to be as small as possible.

2.3 K-means Algorithm

The rst step when using k-means clustering is to indicate the number of
clusters (k) that will be generated in the nal solution. The algorithm
starts by randomly selecting k objects from the data set to serve as the
initial centers for the clusters. The selected objects are also known as
cluster means or centroids.
Next, each of the remaining objects is assigned to it's closest centroid,
where closest is dened using the Euclidean distance between the object
and the cluster mean. This step is called cluster assignment step. After
the assignment step, the algorithm computes the new mean value of each
cluster.
The term cluster centroid update is used to design this step. Now that
the centers have been recalculated, every observation is checked again to
see if it might be closer to a dierent cluster. All the objects are reassigned
again using the updated cluster means. The cluster assignment and
centroid update steps are iteratively repeated until the cluster assignments
stop changing (i.e until convergence is achieved). That is, the clusters
formed in the current iteration are the same as those obtained in the
previous iteration.

3
K-means algorithm can be summarized as follows:

1. Specify the number of clusters (K) to be created.

2. Select randomly k objects from the data set as the initial cluster
centers or means.

3. Assigns each observation to their closest centroid, based on the


Euclidean distance between the object and the centroid.

4. For each of the k clusters update the cluster centroid by calculating


the new mean values of all the data points in the cluster. The
centroid of a Kth cluster is a vector of length p containing the means
of all variables for the observations in the kth cluster; p is the number
of variables.

5. Iteratively minimize the total within sum of square . That is, iterate
steps 3 and 4 until the cluster assignments stop changing or the
maximum number of iterations is reached.

2.4 Computing k-means clustering in R

We can compute k-means in R with the Kmeans function.The data is


taken from a website. The link is given below:

https://archive.ics.uci.edu/ml/datasets/parkinson+Disease+
Spiral+Drawings+Using+Digitized+Graphics+Tablet

2.4.1 Data set information


Handwriting database consists of 62 PWP(People with Parkinson) and 15
healthy individuals. Three types of recordings (Static Spiral Test, Dynamic
Spiral Test and Stability Test) are taken.

Data set Charasteristics Multivariate


Attribute characteristics Integers
Number of instances 77
Number of attributes 7
Associted tasks Classication, Regression, Clustering

4
First of all, we load our data in R using the command of read.delim .
The command delim reads the text les. We use a command fix(X) to

show our data in separate le.Then we use a function named scale(X) .


The Scale function is used to determine the standardized value for each
element in a dataset. A command is used name par which is used to see

multiple pictures at the same time in one le. In our data, we setup it with
three pictures in a row.
We group our data into 2 clusters. The Kmeans function also has an

nstart option that attempts multiple initial congurations and reports


on the best one.

The output of Kmeans is a list with several bits of information. The most
important being:

ˆ cluster : A vector of integers (from 1:k) indicating the cluster to


which each point is allocated.

ˆ centers : A matrix of cluster centers.

ˆ totss : The total sum of squares.

5
ˆ withinss : Vector of within-cluster sum of squares, one component
per cluster.

ˆ tot.withinss : Total within-cluster sum of squares, i.e.


sum(withinss).

ˆ betweenss : The between-cluster sum of squares, i.e.


totss − tot.withinss.

ˆ size : The number of points in each cluster.

We can also view our results by plotting them.

6
3 Hierarchical Cluster Analysis
Hierarchical clustering is an alternative approach to k-means clustering for
identifying groups in the dataset. It does not require us to predened the
number of clusters to be generated as is required by the k-means
approach.Furthermore, hierarchical clustering has an added advantage over
K-means clustering in that it results in an attractive tree-based
representation of the observations, called a dendrogram.

3.1 Hierarchical Clustering Algorithms

Hierarchical clustering can be divided into two main types:

ˆ Agglomerative clustering

ˆ Divisible hierarchical clustering

3.1.1 Agglomerative clustering:


It's also known as AGNES (Agglomerative Nesting). It works in a
bottom-up manner. That is, each object is initially considered as a
single-element cluster (leaf ). At each step of the algorithm, the two
clusters that are the most similar are combined into a new bigger cluster
(nodes). This procedure is iterated until all points are member of just one
single big cluster (root). The result is a tree which can be plotted as a
dendrogram.

3.1.2 Divisive hierarchical clustering:


It's also known as DIANA (Divise Analysis) and it works in a top-down
manner. The algorithm is an inverse order of AGNES. It begins with the
root, in which all objects are included in a single cluster. At each step of
iteration, the most heterogeneous cluster is divided into two. The process
is iterated until all objects are in their own cluster.

Note that agglomerative clustering is good at identifying small clusters.

Divisive hierarchical clustering is good at identifying large clusters.

As we learned in the K-means , we measure the (dis)similarity of


observations using distance measures (i.e Euclidean distance, Manhattan
distance, etc.)

7
However, a bigger question is:

How do we measure the dissimilarity between two clusters of


observations?
A number of dierent cluster agglomeration methods (i.e, linkage methods)
have been developed to answer to this question. The most common types
methods are:

ˆ Maximum or complete linkage clustering: It computes all


pairwise dissimilarities between the elements in cluster 1 and the
elements in cluste 2, and considers the largest value (i.e., maximum
value) of these dissimilarities as the distance between the two
clusters. It tends to produce more compact clusters.

ˆ Minimum or single linkage clustering: It computes all pairwise


dissimilarities between the elements in cluster 1 and the elements in
cluster 2, and considers the smallest of these dissimilarities as a
linkage criterion. It tends to produce long, loose clusters.

ˆ Mean or average linkage clustering: It computes all pairwise


dissimilarities between the elements in cluster 1 and the elements in
cluster 2, and considers the average of these dissimilarities as the
distance between the two clusters.

ˆ Centroid linkage clustering: It computes the dissimilarity


between the centroid for cluster 1 (a mean vector of length p
variables) and the centroid for cluster 2.

ˆ Ward's minimum variance method: It minimizes the total


within-cluster variance. At each step the pair of clusters with
minimum between-cluster distance are merged.

8
We can see the dierences these approaches in the following dendrograms:

3.2 Data Preparation

To perform a cluster analysis in R, generally, the data should be prepared


as follows:

1. Rows are observations (individuals) and columns are variables.

2. Any missing value in the data must be removed or estimated.

3. The data must be standardized (i.e., scaled) to make variables


comparable.

9
3.3 Hierarchical Clustering with R

There are dierent functions available in R for computing hierarchical


clustering. We will use hclust . For measuring distances between the

clusters, we use dierent linkage methods like complete linkage ,

Average linkage and Single linkage .

We can then plot the dendrogram.

10
In the dendrogram displayed above, each leaf corresponds to one
observation. As we move up the tree, observations that are similar to each
other are combined into branches. The height of the cut to the dendrogram
controls the number of clusters obtained. It plays the same role as the k in
k-means clustering. In order to identify sub-groups (i.e. clusters), we can
cut the dendrogram with cutree .

4 Determining Optimal Clusters


As you may recall the analyst species the number of clusters to use;
preferably the analyst would like to use the optimal number of clusters. To
aid the analyst, the following explains the three most popular methods for
determining the optimal clusters, which includes:

1. Elbow method

2. Silhouette method

3. Gap statistic

4.1 Elbow Method

Recall that, the basic idea behind cluster partitioning methods, such as
k-means clustering, is to dene clusters such that the total intra-cluster
variation (known as total within-cluster variation or total within-cluster
sum of square) is minimized:

k
X 
minimize W (CK )
k=1
th
where Ck is the k cluster and W (Ck ) is the within-cluster variation. The
total within-cluster sum of square (wss) measures the compactness of the
clustering and we want it to be as small as possible. Thus, we can use the
following algorithm to dene the optimal clusters:

1. Compute clustering algorithm (e.g., k-means clustering) for dierent


values of k. For instance, by varying k from 1 to 10 clusters.

2. For each k, calculate the total within-cluster sum of square (wss).

3. Plot the curve of wss according to the number of clusters k.

4. The location of a bend (knee) in the plot is generally considered as


an indicator of the appropriate number of clusters.

11
4.2 Average Silhouette Method

In short, the average silhouette approach measures the quality of a


clustering. That is, it determines how well each object lies within its
cluster. A high average silhouette width indicates a good clustering. The
average silhouette method computes the average silhouette of observations
for dierent values of k. The optimal number of clusters k is the one that
maximizes the average silhouette over a range of possible values for k.
We can use the Silhouette function in the cluster package to compuate
the average silhouette width.

4.3 Gap Statistic Method

The gap statistic has been published by


R. Tibshirani, G. Walther, and T. Hastie (Standford University, 2001). The
approach can be applied to any clustering method (i.e. K-means clustering,
hierarchical clustering). The gap statistic compares the total intracluster
variation for dierent values of k with their expected values under null
reference distribution of the data (i.e. a distribution with no obvious
clustering). The reference dataset is generated using Monte Carlo
simulations of the sampling process. That is, for each variable (xi ) in the
data set we compute its range nd generate values for the n points
uniformly from the interval min to max.
To compute the gap statistic method we can use the clusGap function

which provides the gap statistic and standard error.

5 R Codes
=
X< r e a d . d e l i m ( " h w _ d a t a s e t / p a r k i n s o n / P_02100001 . t x t " , sep =";")
library ( tidyverse )
library ( cluster )
library ( factoextra )
f i x (X)
dim (X)
s d . d a t a= s c a l e (X)
head ( sd . d a t a )
distance < = g e t _ d i s t ( sd . data )
fviz_dist ( distance , gradient = l i s t ( l o w = "#00AFBB" , mid = " w h i t e " ,
h i g h = "#FC4E07 " ) )
p a r ( mfrow =c ( 1 , 4 ) )

12
km . o u t =kmeans ( sd . data , 2 , nstart =20)
s t r (km . o u t )
km . o u t
f v i z _ c l u s t e r (km . o u t , data = sd . data )
km . o u t $ c l u s t e r
p l o t ( sd . data , =
c o l =(km . o u t $ c l u s t e r + 1 ) , main="K Means Clustering
Results w i t h K=2" , xlab ="" , y l a b ="" , pch =20 , cex =2)
set . seed (4)
km . o u t 3 =kmeans ( sd . data , 3 , nstart =20)
km . o u t 3
p l o t ( sd . data , col =(km . o u t 3 $ c l u s t e r +1) , =
main="K Means Clustering
Results w i t h K=3" , xlab ="" , y l a b ="" , pch =20 , cex =2)

km . o u t 4 =kmeans ( sd . data , 4 , nstart =20)


km . o u t 4
p l o t ( sd . data , col =(km . o u t 4 $ c l u s t e r +1) , =
main="K Means Clustering
Results w i t h K=4" , xlab ="" , y l a b ="" , pch =20 , cex =2)

km . o u t 5 =kmeans ( sd . data , 5 , nstart =20)


km . o u t 5
p l o t ( sd . data , col =(km . o u t 5 $ c l u s t e r +1) , =
main="K Means Clustering
Results w i t h K=5" , xlab ="" , y l a b ="" , pch =20 , cex =2)

p1 < = f v i z _ c l u s t e r (km . o u t , geom = " p o i n t " , data = sd . data )


+ g g t i t l e (" k = 2")
p2< = f v i z _ c l u s t e r (km . o u t 3 , geom = " p o i n t " , data = sd . data )
+ g g t i t l e (" k = 3")
p3 < = f v i z _ c l u s t e r (km . o u t 4 , geom = " p o i n t " , data = sd . data )
+ g g t i t l e (" k = 4")
p4 < = f v i z _ c l u s t e r (km . o u t 5 , geom = " p o i n t " , data = sd . data )
+ g g t i t l e (" k = 5")
l i b r a r y ( gridExtra )
g r i d . a r r a n g e ( p1 , p2 , p3 , p4 , nrow = 2 )

set . seed (123)


wss < = function (k) {
kmeans ( s d . d a t a , k, n s t a r t = 10 ) $tot . withinss
}
k . values < = 1:15
wss_values < = map_dbl ( k . v a l u e s , wss )

13
plot (k . values , wss_values ,
t y p e ="b " , pch = 1 9 , f r a m e = FALSE ,
x l a b ="Number of clusters K" ,
y l a b =" T o t a l within =c l u s t e r s sum of squares ")

set . seed (123)

f v i z _ n b c l u s t ( sd . data , kmeans , method = " w s s " )


avg_sil < = function (k) {
km . r e s < = kmeans ( s d . d a t a , centers = k , n s t a r t = 25)
ss < = s i l h o u e t t e (km . r e s $ c l u s t e r , d i s t ( sd . data ) )
mean ( s s [ , 3])
}
k . values < = 2:15

avg_sil_values < = map_dbl ( k . v a l u e s , avg_sil )

plot (k . values , avg_sil_values ,


t y p e = "b " , pch = 1 9 , f r a m e = FALSE ,
x l a b = "Number of clusters K" ,
y l a b = " Average S i l h o u e t t e s ")

f v i z _ n b c l u s t ( sd . data , kmeans , method = " s i l h o u e t t e " )

##############gap s t a t i s t i c s###################

set . seed (123)


gap_stat < = clusGap ( sd . data , FUN = kmeans , n s t a r t = 25 ,
K . max = 1 0 , B = 5 0 )
# Print the result
p r i n t ( gap_stat , method = " f i r s t m a x " )

fviz_gap_stat ( gap_stat )
#####################f i n a l c l u s t e r################
set . seed (123)
final < = kmeans ( s d . d a t a , 9, n s t a r t = 25)
print ( f i na l )

fviz_cluster ( final , data = sd . data )

X %>%

14
mutate ( C l u s t e r = f i n a l $ c l u s t e r ) %>%
group_by ( C l u s t e r ) %>%
s u m m a r i s e _ a l l ( " mean " )

########### H i e r a r c h i c a l C l u s t e r i n g ##########

d a t a . d i s t=d i s t ( sd . d a t a )
n c i . d a t a=X
p l o t ( h c l u s t ( data . d i s t ) , main=" C o m p l e t e
Linkage ", xlab ="" , sub ="" , ylab ="")
plot ( hclust ( d a t a . d i s t , method =" a v e r a g e " ) ,
main=" A v e r a g e Linkage " , xlab ="" , sub ="" , y l a b ="")
plot ( hclust ( data . d i s t , method =" s i n g l e " ) ,
main=" Single Linkage ", x l a b ="" , sub ="" , ylab ="")
hc . o u t =h c l u s t ( d i s t ( sd . data ) )
abline ( h =8 , col =" red ")
hc . o u t
################################ward ' s method##########################
# Ward ' s method
hc5 < = h c l u s t ( data . d i s t , method = " ward . D2" )

# Cut tree into 4 groups


sub_grp < = c u t r e e ( hc5 , k = 9)

t a b l e ( sub_grp )

X %>%
mutate ( c l u s t e r = sub_grp ) %>%
head

p l o t ( hc5 , cex = 0 . 6 )
r e c t . h c l u s t ( hc5 , k = 9, border = 2:10)

f v i z _ c l u s t e r ( l i s t ( data = sd . data , cluster = sub_grp ) )


######## C o m p a r i s o n ##########
set . seed (2)
km . o u t =kmeans ( sd . data , 9, nstart =20)
km . c l u s t e r s =km . o u t $ c l u s t e r
t a b l e (km . c l u s t e r s , hc . c l u s t e r s )

15

You might also like