Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit

Cluster Analysis
Talha Farooq
Faizan Ali
Muhammad Abdul basit
School of Natural Sciences

MS Statistics
National University of Sciences & Technology

Pakistan
1 Cluster Analysis
Clustering analysis is a form of exploratory data analysis in which
observations are divided into dierent groups that share common
characteristics.
The purpose of cluster analysis (also known as classication) is to
construct groups (or classes or clusters) while ensuring the following
property:
1. Within a group the observations must be as similar as possible
2. While observations belonging to dierent groups must be as dierent

as possible.
There are two main types of classication:
1. k-means clustering
2. Hierarchical clustering
The rst is generally used when the number of classes is xed in

advance, while the second is generally used for anunknown number of
classes and helps to determine this optimal number.
2 K-means Clustering
K-means clustering is the most commonly used unsupervised machine
learning algorithm for partitioning a given data set into a set of k groups
(i.e. k clusters) where k represents the number of groups pre-specied by
the analyst. It classies objects in multiple groups (i.e., clusters), such that
objects within the same cluster are as similar as possible (i.e., high
intra-class similarity), whereas objects from dierent clusters are as
dissimilar as possible (i.e., low inter-class similarity). In k-means clustering,
each cluster is represented by its center (i.e, centroid) which corresponds to
the mean of points assigned to the cluster.
2.1 Clustering Distance Measures
The classication of observations into groups requires some methods for

computing the distance or the (dis)similarity between each pair of
observations. The result of this computation is known as a dissimilarity or
distance matrix.There are many methods to calculate this distance
1
information; the choice of distance measures is a critical step in clustering.
It denes how the similarity of two elements (x, y) is calculated and it will
inuence the shape of the clusters.
The choice of distance measures is a critical step in clustering. It denes
how the similarity of two elements (x, y) is calculated and it will inuence
the shape of the clusters. The classical methods for distance measures are
Euclidean and Manhattan distances, which are dened as follow:
Euclidean Distance:
v
u n
uX
deuc (x, y) = t (xi − yi )2
i=1
Manhattan Distance:
n
X
dman (x, y) = |(xi − yi )|
i=1
Where, x and y are two vectors of length n.
The choice of distance measures is very important, as it has a strong

inuence on the clustering results. For most common clustering software,
the default distance measure is the Euclidean distance. However,
depending on the type of the data and the research questions, other
dissimilarity measures might be preferred and you should be aware of the
options.
2.2 The Basic Idea
The basic idea behind k-means clustering consists of dening clusters so

that the total intra-cluster variation (known as total within-cluster
variation) is minimized. There are several k-means algorithms available.
The standard algorithm is the Hartigan-Wong algorithm (1979), which
denes the total within-cluster variation as the sum of squared distances
Euclidean distances between items and the corresponding centroid:
X
W (CK ) = (xi − µk )2
xi ∈CK
Where:
xi is a data point belonging to the cluster Ck .
2
µk is the mean value of the points assigned to the cluster Ck .
Each observation (xi ) is assigned to a given cluster such that the sum of
squares (SS) distance of the observation to their assigned cluster centers
(µk ) is minimized.
We dene the total within-cluster variation as follows:
k
X k
X X
tot.withiness = W (Ck ) = (xi − µk )2
k=1 k=1 xi ∈CK
The total within-cluster sum of square measures the compactness

(i.e goodness) of the clustering and we want it to be as small as possible.
2.3 K-means Algorithm
The rst step when using k-means clustering is to indicate the number of
clusters (k) that will be generated in the nal solution. The algorithm
starts by randomly selecting k objects from the data set to serve as the
initial centers for the clusters. The selected objects are also known as
cluster means or centroids.
Next, each of the remaining objects is assigned to it's closest centroid,
where closest is dened using the Euclidean distance between the object
and the cluster mean. This step is called cluster assignment step. After
the assignment step, the algorithm computes the new mean value of each
cluster.
The term cluster centroid update is used to design this step. Now that
the centers have been recalculated, every observation is checked again to
see if it might be closer to a dierent cluster. All the objects are reassigned
again using the updated cluster means. The cluster assignment and
centroid update steps are iteratively repeated until the cluster assignments
stop changing (i.e until convergence is achieved). That is, the clusters
formed in the current iteration are the same as those obtained in the
previous iteration.
3
K-means algorithm can be summarized as follows:
1. Specify the number of clusters (K) to be created.
2. Select randomly k objects from the data set as the initial cluster
centers or means.
3. Assigns each observation to their closest centroid, based on the

Euclidean distance between the object and the centroid.
4. For each of the k clusters update the cluster centroid by calculating

the new mean values of all the data points in the cluster. The
centroid of a Kth cluster is a vector of length p containing the means
of all variables for the observations in the kth cluster; p is the number
of variables.
5. Iteratively minimize the total within sum of square . That is, iterate
steps 3 and 4 until the cluster assignments stop changing or the
maximum number of iterations is reached.
2.4 Computing k-means clustering in R
We can compute k-means in R with the Kmeans function.The data is

taken from a website. The link is given below:
https://archive.ics.uci.edu/ml/datasets/parkinson+Disease+
Spiral+Drawings+Using+Digitized+Graphics+Tablet
2.4.1 Data set information

Handwriting database consists of 62 PWP(People with Parkinson) and 15
healthy individuals. Three types of recordings (Static Spiral Test, Dynamic
Spiral Test and Stability Test) are taken.
Data set Charasteristics Multivariate

Attribute characteristics Integers
Number of instances 77
Number of attributes 7
Associted tasks Classication, Regression, Clustering
4
First of all, we load our data in R using the command of read.delim .
The command delim reads the text les. We use a command fix(X) to
show our data in separate le.Then we use a function named scale(X) .

The Scale function is used to determine the standardized value for each
element in a dataset. A command is used name par which is used to see
multiple pictures at the same time in one le. In our data, we setup it with
three pictures in a row.
We group our data into 2 clusters. The Kmeans function also has an
nstart option that attempts multiple initial congurations and reports

on the best one.
The output of Kmeans is a list with several bits of information. The most
important being:
cluster : A vector of integers (from 1:k) indicating the cluster to

which each point is allocated.
centers : A matrix of cluster centers.
totss : The total sum of squares.
5
withinss : Vector of within-cluster sum of squares, one component
per cluster.
tot.withinss : Total within-cluster sum of squares, i.e.

sum(withinss).
betweenss : The between-cluster sum of squares, i.e.

totss − tot.withinss.
size : The number of points in each cluster.
We can also view our results by plotting them.
6
3 Hierarchical Cluster Analysis
Hierarchical clustering is an alternative approach to k-means clustering for
identifying groups in the dataset. It does not require us to predened the
number of clusters to be generated as is required by the k-means
approach.Furthermore, hierarchical clustering has an added advantage over
K-means clustering in that it results in an attractive tree-based
representation of the observations, called a dendrogram.
3.1 Hierarchical Clustering Algorithms
Hierarchical clustering can be divided into two main types:
Agglomerative clustering
Divisible hierarchical clustering
3.1.1 Agglomerative clustering:

It's also known as AGNES (Agglomerative Nesting). It works in a
bottom-up manner. That is, each object is initially considered as a
single-element cluster (leaf ). At each step of the algorithm, the two
clusters that are the most similar are combined into a new bigger cluster
(nodes). This procedure is iterated until all points are member of just one
single big cluster (root). The result is a tree which can be plotted as a
dendrogram.
3.1.2 Divisive hierarchical clustering:

It's also known as DIANA (Divise Analysis) and it works in a top-down
manner. The algorithm is an inverse order of AGNES. It begins with the
root, in which all objects are included in a single cluster. At each step of
iteration, the most heterogeneous cluster is divided into two. The process
is iterated until all objects are in their own cluster.
Note that agglomerative clustering is good at identifying small clusters.
Divisive hierarchical clustering is good at identifying large clusters.
As we learned in the K-means , we measure the (dis)similarity of

observations using distance measures (i.e Euclidean distance, Manhattan
distance, etc.)
7
However, a bigger question is:
How do we measure the dissimilarity between two clusters of

observations?
A number of dierent cluster agglomeration methods (i.e, linkage methods)
have been developed to answer to this question. The most common types
methods are:
Maximum or complete linkage clustering: It computes all

pairwise dissimilarities between the elements in cluster 1 and the
elements in cluste 2, and considers the largest value (i.e., maximum
value) of these dissimilarities as the distance between the two
clusters. It tends to produce more compact clusters.
Minimum or single linkage clustering: It computes all pairwise

dissimilarities between the elements in cluster 1 and the elements in
cluster 2, and considers the smallest of these dissimilarities as a
linkage criterion. It tends to produce long, loose clusters.
Mean or average linkage clustering: It computes all pairwise

dissimilarities between the elements in cluster 1 and the elements in
cluster 2, and considers the average of these dissimilarities as the
distance between the two clusters.
Centroid linkage clustering: It computes the dissimilarity

between the centroid for cluster 1 (a mean vector of length p
variables) and the centroid for cluster 2.
Ward's minimum variance method: It minimizes the total

within-cluster variance. At each step the pair of clusters with
minimum between-cluster distance are merged.
8
We can see the dierences these approaches in the following dendrograms:
3.2 Data Preparation
To perform a cluster analysis in R, generally, the data should be prepared

as follows:
1. Rows are observations (individuals) and columns are variables.
2. Any missing value in the data must be removed or estimated.
3. The data must be standardized (i.e., scaled) to make variables

comparable.
9
3.3 Hierarchical Clustering with R
There are dierent functions available in R for computing hierarchical

clustering. We will use hclust . For measuring distances between the
clusters, we use dierent linkage methods like complete linkage ,
Average linkage and Single linkage .
We can then plot the dendrogram.
10
In the dendrogram displayed above, each leaf corresponds to one
observation. As we move up the tree, observations that are similar to each
other are combined into branches. The height of the cut to the dendrogram
controls the number of clusters obtained. It plays the same role as the k in
k-means clustering. In order to identify sub-groups (i.e. clusters), we can
cut the dendrogram with cutree .
4 Determining Optimal Clusters

As you may recall the analyst species the number of clusters to use;
preferably the analyst would like to use the optimal number of clusters. To
aid the analyst, the following explains the three most popular methods for
determining the optimal clusters, which includes:
1. Elbow method
2. Silhouette method
3. Gap statistic
4.1 Elbow Method
Recall that, the basic idea behind cluster partitioning methods, such as
k-means clustering, is to dene clusters such that the total intra-cluster
variation (known as total within-cluster variation or total within-cluster
sum of square) is minimized:
k
X
minimize W (CK )
k=1
th
where Ck is the k cluster and W (Ck ) is the within-cluster variation. The
total within-cluster sum of square (wss) measures the compactness of the
clustering and we want it to be as small as possible. Thus, we can use the
following algorithm to dene the optimal clusters:
1. Compute clustering algorithm (e.g., k-means clustering) for dierent

values of k. For instance, by varying k from 1 to 10 clusters.
2. For each k, calculate the total within-cluster sum of square (wss).
3. Plot the curve of wss according to the number of clusters k.
4. The location of a bend (knee) in the plot is generally considered as

an indicator of the appropriate number of clusters.
11
4.2 Average Silhouette Method
In short, the average silhouette approach measures the quality of a

clustering. That is, it determines how well each object lies within its
cluster. A high average silhouette width indicates a good clustering. The
average silhouette method computes the average silhouette of observations
for dierent values of k. The optimal number of clusters k is the one that
maximizes the average silhouette over a range of possible values for k.
We can use the Silhouette function in the cluster package to compuate
the average silhouette width.
4.3 Gap Statistic Method
The gap statistic has been published by

R. Tibshirani, G. Walther, and T. Hastie (Standford University, 2001). The
approach can be applied to any clustering method (i.e. K-means clustering,
hierarchical clustering). The gap statistic compares the total intracluster
variation for dierent values of k with their expected values under null
reference distribution of the data (i.e. a distribution with no obvious
clustering). The reference dataset is generated using Monte Carlo
simulations of the sampling process. That is, for each variable (xi ) in the
data set we compute its range nd generate values for the n points
uniformly from the interval min to max.
To compute the gap statistic method we can use the clusGap function
which provides the gap statistic and standard error.
5 R Codes
=
X< r e a d . d e l i m ( " h w _ d a t a s e t / p a r k i n s o n / P_02100001 . t x t " , sep =";")
library ( tidyverse )
library ( cluster )
library ( factoextra )
f i x (X)
dim (X)
s d . d a t a= s c a l e (X)
head ( sd . d a t a )
distance < = g e t _ d i s t ( sd . data )
fviz_dist ( distance , gradient = l i s t ( l o w = "#00AFBB" , mid = " w h i t e " ,
h i g h = "#FC4E07 " ) )
p a r ( mfrow =c ( 1 , 4 ) )
12
km . o u t =kmeans ( sd . data , 2 , nstart =20)
s t r (km . o u t )
km . o u t
f v i z _ c l u s t e r (km . o u t , data = sd . data )
km . o u t $ c l u s t e r
p l o t ( sd . data , =
c o l =(km . o u t $ c l u s t e r + 1 ) , main="K Means Clustering
Results w i t h K=2" , xlab ="" , y l a b ="" , pch =20 , cex =2)
set . seed (4)
km . o u t 3 =kmeans ( sd . data , 3 , nstart =20)
km . o u t 3
p l o t ( sd . data , col =(km . o u t 3 $ c l u s t e r +1) , =
main="K Means Clustering

km . o u t 4

km . o u t 5
p1 < = f v i z _ c l u s t e r (km . o u t , geom = " p o i n t " , data = sd . data )

+ g g t i t l e (" k = 2")
p2< = f v i z _ c l u s t e r (km . o u t 3 , geom = " p o i n t " , data = sd . data )
+ g g t i t l e (" k = 3")
p3 < = f v i z _ c l u s t e r (km . o u t 4 , geom = " p o i n t " , data = sd . data )
+ g g t i t l e (" k = 4")
p4 < = f v i z _ c l u s t e r (km . o u t 5 , geom = " p o i n t " , data = sd . data )
+ g g t i t l e (" k = 5")
l i b r a r y ( gridExtra )
g r i d . a r r a n g e ( p1 , p2 , p3 , p4 , nrow = 2 )
set . seed (123)

wss < = function (k) {
kmeans ( s d . d a t a , k, n s t a r t = 10 ) $tot . withinss
}
k . values < = 1:15
wss_values < = map_dbl ( k . v a l u e s , wss )
13
plot (k . values , wss_values ,
t y p e ="b " , pch = 1 9 , f r a m e = FALSE ,
x l a b ="Number of clusters K" ,
y l a b =" T o t a l within =c l u s t e r s sum of squares ")
set . seed (123)
f v i z _ n b c l u s t ( sd . data , kmeans , method = " w s s " )

avg_sil < = function (k) {
km . r e s < = kmeans ( s d . d a t a , centers = k , n s t a r t = 25)
ss < = s i l h o u e t t e (km . r e s $ c l u s t e r , d i s t ( sd . data ) )
mean ( s s [ , 3])
}
k . values < = 2:15
avg_sil_values < = map_dbl ( k . v a l u e s , avg_sil )
plot (k . values , avg_sil_values ,

t y p e = "b " , pch = 1 9 , f r a m e = FALSE ,
x l a b = "Number of clusters K" ,
y l a b = " Average S i l h o u e t t e s ")
f v i z _ n b c l u s t ( sd . data , kmeans , method = " s i l h o u e t t e " )
##############gap s t a t i s t i c s###################
set . seed (123)

gap_stat < = clusGap ( sd . data , FUN = kmeans , n s t a r t = 25 ,
K . max = 1 0 , B = 5 0 )
# Print the result
p r i n t ( gap_stat , method = " f i r s t m a x " )
fviz_gap_stat ( gap_stat )
#####################f i n a l c l u s t e r################
set . seed (123)
final < = kmeans ( s d . d a t a , 9, n s t a r t = 25)
print ( f i na l )
fviz_cluster ( final , data = sd . data )
X %>%
14
mutate ( C l u s t e r = f i n a l $ c l u s t e r ) %>%
group_by ( C l u s t e r ) %>%
s u m m a r i s e _ a l l ( " mean " )
########### H i e r a r c h i c a l C l u s t e r i n g ##########
d a t a . d i s t=d i s t ( sd . d a t a )
n c i . d a t a=X
p l o t ( h c l u s t ( data . d i s t ) , main=" C o m p l e t e
Linkage ", xlab ="" , sub ="" , ylab ="")
plot ( hclust ( d a t a . d i s t , method =" a v e r a g e " ) ,
main=" A v e r a g e Linkage " , xlab ="" , sub ="" , y l a b ="")
plot ( hclust ( data . d i s t , method =" s i n g l e " ) ,
main=" Single Linkage ", x l a b ="" , sub ="" , ylab ="")
hc . o u t =h c l u s t ( d i s t ( sd . data ) )
abline ( h =8 , col =" red ")
hc . o u t
################################ward ' s method##########################
# Ward ' s method
hc5 < = h c l u s t ( data . d i s t , method = " ward . D2" )
# Cut tree into 4 groups

sub_grp < = c u t r e e ( hc5 , k = 9)
t a b l e ( sub_grp )
X %>%
mutate ( c l u s t e r = sub_grp ) %>%
head
p l o t ( hc5 , cex = 0 . 6 )
r e c t . h c l u s t ( hc5 , k = 9, border = 2:10)
f v i z _ c l u s t e r ( l i s t ( data = sd . data , cluster = sub_grp ) )

######## C o m p a r i s o n ##########
set . seed (2)
km . o u t =kmeans ( sd . data , 9, nstart =20)
km . c l u s t e r s =km . o u t $ c l u s t e r
t a b l e (km . c l u s t e r s , hc . c l u s t e r s )
15

Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit

Uploaded by

Copyright:

Available Formats

Cluster Analysis

School of Natural Sciences

National University of Sciences & Technology

1. Within a group the observations must be as similar as possible

2. While observations belonging to dierent groups must be as dierent

There are two main types of classication:

The rst is generally used when the number of classes is xed in

2.1 Clustering Distance Measures

The classication of observations into groups requires some methods for

Where, x and y are two vectors of length n.

The choice of distance measures is very important, as it has a strong

2.2 The Basic Idea

The basic idea behind k-means clustering consists of dening clusters so

 xi is a data point belonging to the cluster Ck .

The total within-cluster sum of square measures the compactness

2.3 K-means Algorithm

1. Specify the number of clusters (K) to be created.

3. Assigns each observation to their closest centroid, based on the

4. For each of the k clusters update the cluster centroid by calculating

2.4 Computing k-means clustering in R

We can compute k-means in R with the Kmeans function.The data is

2.4.1 Data set information

Data set Charasteristics Multivariate

show our data in separate le.Then we use a function named scale(X) .

nstart option that attempts multiple initial congurations and reports

 cluster : A vector of integers (from 1:k) indicating the cluster to

 centers : A matrix of cluster centers.

 totss : The total sum of squares.

 tot.withinss : Total within-cluster sum of squares, i.e.

 betweenss : The between-cluster sum of squares, i.e.

 size : The number of points in each cluster.

We can also view our results by plotting them.

3.1 Hierarchical Clustering Algorithms

Hierarchical clustering can be divided into two main types:

 Divisible hierarchical clustering

3.1.1 Agglomerative clustering:

3.1.2 Divisive hierarchical clustering:

Note that agglomerative clustering is good at identifying small clusters.

Divisive hierarchical clustering is good at identifying large clusters.

As we learned in the K-means , we measure the (dis)similarity of

How do we measure the dissimilarity between two clusters of

 Maximum or complete linkage clustering: It computes all

 Minimum or single linkage clustering: It computes all pairwise

 Mean or average linkage clustering: It computes all pairwise

 Centroid linkage clustering: It computes the dissimilarity

 Ward's minimum variance method: It minimizes the total

3.2 Data Preparation

To perform a cluster analysis in R, generally, the data should be prepared

1. Rows are observations (individuals) and columns are variables.

2. Any missing value in the data must be removed or estimated.

3. The data must be standardized (i.e., scaled) to make variables

There are dierent functions available in R for computing hierarchical

clusters, we use dierent linkage methods like complete linkage ,

Average linkage and Single linkage .

We can then plot the dendrogram.

4 Determining Optimal Clusters

4.1 Elbow Method

1. Compute clustering algorithm (e.g., k-means clustering) for dierent

2. For each k, calculate the total within-cluster sum of square (wss).

3. Plot the curve of wss according to the number of clusters k.

4. The location of a bend (knee) in the plot is generally considered as

In short, the average silhouette approach measures the quality of a

2. While observations belonging to dierent groups must be as dierent

There are two main types of classication:

The rst is generally used when the number of classes is xed in

The classication of observations into groups requires some methods for

The basic idea behind k-means clustering consists of dening clusters so

xi is a data point belonging to the cluster Ck .

show our data in separate le.Then we use a function named scale(X) .

nstart option that attempts multiple initial congurations and reports

cluster : A vector of integers (from 1:k) indicating the cluster to

centers : A matrix of cluster centers.

totss : The total sum of squares.

tot.withinss : Total within-cluster sum of squares, i.e.

betweenss : The between-cluster sum of squares, i.e.

size : The number of points in each cluster.

Divisible hierarchical clustering

Maximum or complete linkage clustering: It computes all

Minimum or single linkage clustering: It computes all pairwise

Mean or average linkage clustering: It computes all pairwise

Centroid linkage clustering: It computes the dissimilarity

Ward's minimum variance method: It minimizes the total

There are dierent functions available in R for computing hierarchical

clusters, we use dierent linkage methods like complete linkage ,

1. Compute clustering algorithm (e.g., k-means clustering) for dierent