CSE3506 - Essentials of Data Analytics: Facilitator: DR Sathiya Narayanan S

CSE3506 - Essentials of Data Analytics
Facilitator: Dr Sathiya Narayanan S
Assistant Professor (Senior)

School of Electronics Engineering (SENSE), VIT-Chennai
Email: sathiyanarayanan.s@vit.ac.in
Handphone No.: +91-9944226963
Winter Semester 2020-21
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 1 / 17

Suggested Readings
Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani,

“An Introduction to Statistical Learning with Applications in R”,
Springer Texts in Statistics, 2013 (Facilitator’s Recommendation).
Alpaydin Ethem, “ Introduction to Machine Learning”, 3rd Edition,

PHI Learning Private Limited, 2019.

Contents
1 Module 3: Clustering

Module 3: Clustering
Topics to be covered in Module-3
Introduction to Clustering
K -Means Clustering
K -Medoids Clustering
Hierarchical Clustering
Applications of Clustering

Introduction to Clustering
Clustering algorithms group samples/data points/features/objects

into clusters by natural association according to some similarity
measures (say Euclidean distance).
Clustering serves two purposes: (i) understanding the structure in the
data (i.e. data exploration) and (ii) finding the similarities between
instances (data points) and thus grouping them.
After grouping the data points, the groups can be named and their
attributes can be defined (using domain knowledge). This paves the
way for supervised learning. In this case, clustering becomes a part of
preprocessing stage.
In most cases, labelling the data is costly and therefore, preceding a
supervised learning (regression or classification) with unsupervised
learning (clustering) is advantageous.

K -Means Clustering
Partitioning based clustering methods group data points based on

their similarity and characteristics.
K -means and K -medoids (partitioning around medoids) are the two
popular partitioning based methods.
K -means algorithm partitions the data points into K clusters and the
value of K should be known apriori (i.e. it needs to be specified
beforehand).
K -means algorithm is based on the minimization of the sum of
squared distances.
The K -means procedure attempts to form clusters such that the
intracluster similarity is high and intercluster similarity is low. The
similarity of a cluster is determined based on the centroid (i.e. mean
value) of the data points in it.

K -Means Clustering - Procedure

Step 1: Initialize the iteration count: n = 1; and arbitrarily choose K
samples as initial cluster centres: z1 (n), z2 (n), ..., and zK (n).
Step 2: Distribute the pattern samples x among the K clusters according
to the following rule:
x ∈ Gi (n) if kx − zi (n)k < kx − zj (n)k for j = 1, 2, ..., K ; j 6= i.
Step 3: Compute zi (n + 1) for i = 1, 2, .., K :
1 X
zi (n + 1) = x
Ni
x∈Gi (n)
where Ni is the number of pattern samples assigned to class Gi (n).

Step 4: If ∀i zi (n + 1) = zi (n), then STOP. Otherwise go to Step 2.

K -Means Clustering
Advantages of K -means clustering: (i) simple and efficient ; and (ii)

low computational complexity.
Drawbacks: (i) K must be known/decided; and (ii) final clusters
usually depend on the order of presentation of training samples and
the initial cluster centres.
Question 3.1
Apply K -means clustering to cluster the following samples/data points:
(0,0), (0,1), (1,0), (3,3), (5,6), (8,9), (9,8) and (9,9).
Fix K = 2 and choose (0,0) and (5,6) as the initial cluster centres.

K -Means Clustering
The ‘elbow method’ (a heuristic approach) for determining K : Plot the
‘explained variation’ (say, the ‘distortion’) as a function of K , and pick the
‘elbow’ or ’knee’ of the curve as the value of K (as shown in Figure 1).
Figure 1: Elbow method using distortion.

In K -medoids clustering, each cluster is represented by a cluster
medoid which is one among the data points in the cluster.
The medoid of a cluster is defined as a data point in the cluster
whose average dissimilarity to all the other data points in the cluster
is minimal. As ‘medoid’ is the most centrally located point in the
cluster, the cluster representatives can be interpreted in a better way
(compared to K -means).
In K -medoids can use arbitrary dissimilarity measures, whereas
K -means generally requires Euclidean distance for better
performance. In general, K -medoids use Manhattan distance and
minimizes the sum of pairwise dissimilarities.
As in the case of K -means, the value of K needs to be specified
beforehand. An heuristic approach, the ‘silhouette method’ , can be
used for determining the optimal value of K .
As the K -medoids clustering problem is NP-hard to solve exactly,

many heuristic approaches/solutions exist. The most common
approach is the Partitioning Around Medoids (PAM) algorithm.
PAM algorithm has 2 phases: BUILD phase and SWAP phase.
The BUILD phase greedily selects K data points from the available
data points and initialize them as cluster medoids.
The SWAP phase associates each data point to the closest medoid
and SWAPS a cluster medoid with a non-medoid data point in the
cluster if the cost of the configuration decreases.
PAM is faster than exhaustive search and being a greedy search, it
may not find the optimum solution.
K -medoids clustering is more robust (i.e. less sensitive to outliers and
noise) compared to K -means.
Hierarchical clustering procedure groups similar data points into
clusters. It can be performed with either raw data (i.e. data points)
or a similarity matrix. When raw data is provided, a similarity matrix
S should be computed:
1
Si,j =
di,j + 1
where di,j is the Euclidean distance between data points i and j. In
the recent years, many other distance metrics have been developed.
An agglomerative hierarchical clustering is an iterative, 2-step
procedure that starts by considering each data point as a separate
cluster. In each iteration, it executes the following steps: (i)
identifying the two clusters that are closest together, and (ii) merging
the two most similar clusters. This iterative process continues until all
the clusters are merged together.
Click here for more details (particularly an illustration) on hierachical
clustering.
The dendrogram obtained at the end of hierarchical clustering shows
the hierarchical relationship between the clusters.
After completing the merging step, it is necessary to update the
similarity matrix. The updation can be based on (i) the two most
similar parts of a cluster (single-linkage), (ii) the two least similar bits
of a cluster (complete-linkage), or (iii) the center of the clusters
(mean or average-linkage). Refer Figure 2.
The choice of similarity or distance metric and the choice of linkage
criteria are always application-dependent.
Hierarchical clustering can also be done by initially treating all data
points as one cluster, and then successively splitting them. This
approach is called the divisive hierarchical clustering.
Figure 2: Three linkage types used in hierarchical clustering. Source:

https://www.dexlabanalytics.com/blog/hierarchical-clustering-foundational-
concepts-and-example-of-agglomerative-clustering

Question 3.2
Consider the similarity matrix given below.
Determine the hierarchy of clusters created by

(a) the single-linkage clustering algorithm, and
(b) the complete-linkage clustering algorithm.

Applications of Clustering
Clustering analysis finds its application in market research, image

processing, etc.
Consider Customer Relationship Management for example.
Assume the customers of a company are defined in terms of their
demographic attributes and transactions with the company. If those
customers are grouped into K clusters, then a better understanding of
customer base is possible. Based on this understanding the company
shall adapt different strategies for different types of customers. The
company shall also identify unique customers (those who don’t fall in
any large group) and develop strategies for them. For example,
’churning’ customers who require immediate attention.
In Image Segmentation, clustering can be used to group the pixels
that ‘belong together’ (say, foreground and background pixels).

Module-3 Summary
Clustering: grouping data points based on similarity measures

Partitioning based methods: K -means and K -medoids
The elbow method and the silhouette method: heuristic approaches
for deciding the optimal value of K for partition based methods
Hierarchical clustering: single-linkage, complete-linkage, etc.
Dendogram: shows the hierarchical relationship between the clusters
Applications of clustering: customer relationship management, image
segmentation, etc.

CSE3506 - Essentials of Data Analytics: Facilitator: DR Sathiya Narayanan S

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSE3506 - Essentials of Data Analytics: Facilitator: DR Sathiya Narayanan S

Uploaded by

Copyright:

Available Formats

CSE3506 - Essentials of Data Analytics

Facilitator: Dr Sathiya Narayanan S

Assistant Professor (Senior)

Winter Semester 2020-21

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 1 / 17

Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani,

Alpaydin Ethem, “ Introduction to Machine Learning”, 3rd Edition,

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 2 / 17

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 3 / 17

Topics to be covered in Module-3

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 4 / 17

Clustering algorithms group samples/data points/features/objects

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 5 / 17

Partitioning based clustering methods group data points based on

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 6 / 17

K -Means Clustering - Procedure

x ∈ Gi (n) if kx − zi (n)k < kx − zj (n)k for j = 1, 2, ..., K ; j 6= i.

Step 3: Compute zi (n + 1) for i = 1, 2, .., K :

where Ni is the number of pattern samples assigned to class Gi (n).

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 7 / 17

Advantages of K -means clustering: (i) simple and efficient ; and (ii)

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 8 / 17

Figure 1: Elbow method using distortion.

As the K -medoids clustering problem is NP-hard to solve exactly,

Figure 2: Three linkage types used in hierarchical clustering. Source:

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 14 / 17

Determine the hierarchy of clusters created by

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 15 / 17

Clustering analysis finds its application in market research, image

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 16 / 17

Clustering: grouping data points based on similarity measures

Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 17 / 17

You might also like