Unit 4 Clustering - K-Means and Hierarchical

Clustering
Poonam Saini (Senior Member IEEE)

Faculty, Computer Science and Engineering Dept.
Convener and Adjunct Faculty, CoE Data Science
Faculty In-charge IoT, Siemens-PEC CoE
poonamsaini@pec.edu.in
Why Learning from data?
• Learning is used when:
o Human expertise does not exist

o Humans are unable to explain their expertise
o Solution changes in time and needs to be adapted
• When we talk about “Learning”, what we mean?
o Learning general models from a data of particular

examples
o Build a model that is a good and useful approximation to
the data.
Applications
• Association Analysis
o Basket analysis: P (Y | X ) probability that somebody who
buys X also buys Y where X and Y are products/services
• Supervised Learning
o Classification
o Regression/Prediction
Example: Customer Credit Scoring

Discriminant: IF income > θ1 AND savings > θ2
THEN low-risk ELSE high-risk
• Unsupervised Learning
o Clustering
The Importance of Clustering and
Classification
 To make sense of and extract value from large

sets of structured and unstructured data
 Due to huge volumes of unstructured data, there

is a need to partition data into logical groupings
• To take a comprehensive view of large data
• Form some logical structures based on findings
• To do deeper analysis into the datasets
 Is classification same as clustering?

Classification and Clustering
 With classification, number of classes are already

known into which data should be grouped
• what class you want each data point to be assigned
• the data in the dataset being learned from is labelled
 With clustering algorithms, there is no predefined

concept upon number of appropriate clusters
• rely upon clustering algorithms to sort and cluster data
• there is learning from unlabelled data
Cluster and Clustering Algorithm
In simplest form,
• clusters are sets of data points that share similar

attributes and clustering algorithms are the
methods that group these data points into
different clusters based on their similarities
• diverse applications- disease classification in

medical science, customer classification in
marketing research and environmental health risk
assessment in environmental engineering
Why we need clustering?
Types of Clustering
Hard clustering: grouping the data items such that
each piece is only assigned to one cluster
Soft Clustering: grouping the data items such that an

object can exist in multiple clusters.
What is clustering?
• The organization of unlabeled data into
similarity groups called clusters
• A cluster is a collection of data items which are
“similar” between them, and “dissimilar” to data
items in other clusters
Historic application of clustering
What do we need for clustering?
Distance (dissimilarity) measures
Cluster evaluation (a hard problem)
• Intra-cluster cohesion (compactness):
– Cohesion measures how near the data points in a
cluster are to the cluster centroid
– Sum of squared error (SSE) is a commonly used
measure
• Inter-cluster separation (isolation):

– Separation means that different cluster centroids should
be far away from one another
Note- In most applications, expert judgments are still the key

How many clusters?
Clustering Techniques
Divisive
Clustering techniques:Overview
Popular Clustering Algorithms
• k-Means
• Mean Shift
• DBSCAN
• Agglomerative Hierarchical
k-Means clustering
• k-Means (MacQueen, 1967) is a partitional
clustering algorithm
• Let the set of data points D be {x1, x2, …, xn},

where xi = (xi1, xi2, …, xir) is a vector in X  Rr, and r is
the number of dimensions
• The k-means algorithm partitions the given data

into k clusters:
– Each cluster has a cluster center, called centroid.
– k is specified by the user
k-Means algorithm
Given k, the k-Means algorithm works as follows:
1. Choose k (random) data points (seeds) to

be the initial centroids, cluster centers
2. Assign each data point to the closest
centroid
3. Re-compute the centroids using the
current cluster memberships
4. If a convergence criterion is not met, repeat
steps 2 and 3
k-Means convergence (stopping) criterion
• no (or minimum) re-assignments of data points to
different clusters, or
• no (or minimum) change of centroids, or
• minimum decrease in the sum of squared error (SSE),

k
SSE   d (x, m )2
j
j1 xC j
– Cj is the jth cluster,

– mj is the centroid of cluster Cj (the mean vector of all the data
points in Cj),
– d(x, mj) is the (Eucledian) distance between data point x and
centroid mj.
k-Means clustering example: step 1
k-Means clustering example
Why use k-means?
• Strengths
– Simple: easy to understand and to implement
– Efficient: Time complexity: O(tkn),
n is the number of data points
k is the number of clusters
t is the number of iterations
– Since both k and t are small, k-means is considered a linear
algorithm with linear complexity O(n)
• k-means is the most popular clustering algorithm.
• Note that, it terminates at a local optimum if SSE is used.

The global optimum is hard to find due to complexity.
Weaknesses of k-Means
• The algorithm is only applicable if mean is defined
– For categorical data, k-mode - the centroid is
represented by most frequent values.
• The user needs to specify k
• The algorithm is sensitive to outliers

– Outliers are data points that are very far away from
other data points.
– Outliers could be errors in the data recording or some
special data points with very different values.
Outliers
Dealing with outliers
• Remove some data points that are much far
away from the centroids than other data points
– To be safe, we may want to monitor these possible outliers
over a few iterations and then decide to remove them
• Perform random sampling: by choosing a small

subset of the data points, the chance of selecting
an outlier is much smaller
– Assign the rest of the data points to the clusters by
distance or similarity comparison, or classification
Sensitivity to initial seeds
Random selection of seeds Random selection of seeds

(centroids) (centroids)
Iteration 1 Iteration 2 Iteration 1 Iteration 2

Special data structures
• The k-means algorithm is not suitable for
discovering clusters that are not hyper-ellipsoids
(or hyper-spheres)
k-Means variant: k-Medians
K-Medians is similar K-Means, except
• Instead of re-computing the group centre points

using the mean, we use median vector of the group
• This method is less sensitive to outliers (because of

using the Median)
• However, much slower for larger datasets as sorting

is required on each iteration when computing the
Median vector
k-Means Summary
• Despite weaknesses, k-means is still the most
popular algorithm due to its simplicity and
efficiency
• No clear evidence that any other clustering

algorithm performs better in general
• Comparing different clustering algorithms is a

difficult task
• No one knows the correct clusters!

Hierarchical Agglomerative
• Hierarchical clustering algos- top-down or bottom-up
• Bottom-up algorithms treat each data point as a single

cluster at the outset and then successively merge
(or agglomerate) pairs of clusters
• until all clusters have been merged into a single cluster
that contains all data points
• Bottom-up clustering is therefore called hierarchical

agglomerative clustering or HAC
• Clusters hierarchy is represented as a tree(dendrogram)
• Root of the tree is the unique cluster that gathers all the
samples
• The leaves being the clusters with only one sample
Graphic Illustration- HAC
Explanation- HAC
• We begin by treating each data point as a single
cluster
• if there are X data points in a dataset, there will be X
clusters
• Next, select a distance metric that measures the

distance between two clusters
• As an example, we will use average linkage which
defines the distance between two clusters to be the
average distance between data points in the first
cluster and data points in the second cluster
• On each iteration, we combine two clusters into one

• Policy- Two clusters to be combined are selected as
those with the smallest average linkage
HAC Summary
• Does not require to specify the number of clusters
• Additionally, the algorithm is not sensitive to the

choice of distance metric
• Use case- when the underlying data has a

hierarchical structure
• Lower efficiency, as it has a time complexity of O(n³)

Exercise- k-means
Problem-01:
Cluster the following eight points (with (x, y) representing locations) into three
clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)
Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined
as-
Ρ(a, b) = |x2 – x1| + |y2 – y1|
Use K-Means Algorithm to find the three cluster centers after the second
iteration.
THANK YOU

Unit 4 Clustering - K-Means and Hierarchical

Uploaded by

Copyright:

Available Formats

You might also like

Unit 4 Clustering - K-Means and Hierarchical

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 4 Clustering - K-Means and Hierarchical

Uploaded by

Copyright:

Available Formats

Clustering

Poonam Saini (Senior Member IEEE)

• Learning is used when:

o Human expertise does not exist

• When we talk about “Learning”, what we mean?

o Learning general models from a data of particular

Example: Customer Credit Scoring

 To make sense of and extract value from large

 Due to huge volumes of unstructured data, there

 Is classification same as clustering?

 With classification, number of classes are already

 With clustering algorithms, there is no predefined

• clusters are sets of data points that share similar

• diverse applications- disease classification in

Soft Clustering: grouping the data items such that an

• Inter-cluster separation (isolation):

Note- In most applications, expert judgments are still the key

• Let the set of data points D be {x1, x2, …, xn},

• The k-means algorithm partitions the given data

1. Choose k (random) data points (seeds) to

• no (or minimum) change of centroids, or

• minimum decrease in the sum of squared error (SSE),

– Cj is the jth cluster,

• k-means is the most popular clustering algorithm.

• Note that, it terminates at a local optimum if SSE is used.

• The user needs to specify k

• The algorithm is sensitive to outliers

• Perform random sampling: by choosing a small

Random selection of seeds Random selection of seeds

Iteration 1 Iteration 2 Iteration 1 Iteration 2

• Instead of re-computing the group centre points

• This method is less sensitive to outliers (because of

• However, much slower for larger datasets as sorting

• No clear evidence that any other clustering

• Comparing different clustering algorithms is a

• No one knows the correct clusters!

• Bottom-up algorithms treat each data point as a single

• Bottom-up clustering is therefore called hierarchical

• Next, select a distance metric that measures the

• On each iteration, we combine two clusters into one

• Additionally, the algorithm is not sensitive to the

• Use case- when the underlying data has a

• Lower efficiency, as it has a time complexity of O(n³)

You might also like