Unit 4 Clustering - K-Means and Hierarchical

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Clustering

Poonam Saini (Senior Member IEEE)


Faculty, Computer Science and Engineering Dept.
Convener and Adjunct Faculty, CoE Data Science
Faculty In-charge IoT, Siemens-PEC CoE
poonamsaini@pec.edu.in
Why Learning from data?

• Learning is used when:

o Human expertise does not exist


o Humans are unable to explain their expertise
o Solution changes in time and needs to be adapted

• When we talk about “Learning”, what we mean?

o Learning general models from a data of particular


examples
o Build a model that is a good and useful approximation to
the data.
Applications
• Association Analysis
o Basket analysis: P (Y | X ) probability that somebody who
buys X also buys Y where X and Y are products/services

• Supervised Learning
o Classification
o Regression/Prediction

Example: Customer Credit Scoring


Discriminant: IF income > θ1 AND savings > θ2
THEN low-risk ELSE high-risk

• Unsupervised Learning
o Clustering
The Importance of Clustering and
Classification

 To make sense of and extract value from large


sets of structured and unstructured data

 Due to huge volumes of unstructured data, there


is a need to partition data into logical groupings
• To take a comprehensive view of large data
• Form some logical structures based on findings
• To do deeper analysis into the datasets

 Is classification same as clustering?


Classification and Clustering

 With classification, number of classes are already


known into which data should be grouped
• what class you want each data point to be assigned
• the data in the dataset being learned from is labelled

 With clustering algorithms, there is no predefined


concept upon number of appropriate clusters
• rely upon clustering algorithms to sort and cluster data
• there is learning from unlabelled data
Cluster and Clustering Algorithm

In simplest form,

• clusters are sets of data points that share similar


attributes and clustering algorithms are the
methods that group these data points into
different clusters based on their similarities

• diverse applications- disease classification in


medical science, customer classification in
marketing research and environmental health risk
assessment in environmental engineering
Why we need clustering?
Types of Clustering
Hard clustering: grouping the data items such that
each piece is only assigned to one cluster

Soft Clustering: grouping the data items such that an


object can exist in multiple clusters.
What is clustering?
• The organization of unlabeled data into
similarity groups called clusters
• A cluster is a collection of data items which are
“similar” between them, and “dissimilar” to data
items in other clusters
Historic application of clustering
What do we need for clustering?
Distance (dissimilarity) measures
Cluster evaluation (a hard problem)
• Intra-cluster cohesion (compactness):
– Cohesion measures how near the data points in a
cluster are to the cluster centroid
– Sum of squared error (SSE) is a commonly used
measure

• Inter-cluster separation (isolation):


– Separation means that different cluster centroids should
be far away from one another

Note- In most applications, expert judgments are still the key


How many clusters?
Clustering Techniques

Divisive
Clustering techniques:Overview
Popular Clustering Algorithms

• k-Means

• Mean Shift

• DBSCAN

• Agglomerative Hierarchical
k-Means clustering
• k-Means (MacQueen, 1967) is a partitional
clustering algorithm

• Let the set of data points D be {x1, x2, …, xn},


where xi = (xi1, xi2, …, xir) is a vector in X  Rr, and r is
the number of dimensions

• The k-means algorithm partitions the given data


into k clusters:
– Each cluster has a cluster center, called centroid.
– k is specified by the user
k-Means algorithm
Given k, the k-Means algorithm works as follows:

1. Choose k (random) data points (seeds) to


be the initial centroids, cluster centers
2. Assign each data point to the closest
centroid
3. Re-compute the centroids using the
current cluster memberships
4. If a convergence criterion is not met, repeat
steps 2 and 3
k-Means convergence (stopping) criterion
• no (or minimum) re-assignments of data points to
different clusters, or

• no (or minimum) change of centroids, or

• minimum decrease in the sum of squared error (SSE),


k
SSE   d (x, m )2
j
j1 xC j

– Cj is the jth cluster,


– mj is the centroid of cluster Cj (the mean vector of all the data
points in Cj),
– d(x, mj) is the (Eucledian) distance between data point x and
centroid mj.
k-Means clustering example: step 1
k-Means clustering example: step 2
k-Means clustering example: step 3
k-Means clustering example
k-Means clustering example
k-Means clustering example
Why use k-means?
• Strengths
– Simple: easy to understand and to implement
– Efficient: Time complexity: O(tkn),
n is the number of data points
k is the number of clusters
t is the number of iterations
– Since both k and t are small, k-means is considered a linear
algorithm with linear complexity O(n)

• k-means is the most popular clustering algorithm.

• Note that, it terminates at a local optimum if SSE is used.


The global optimum is hard to find due to complexity.
Weaknesses of k-Means
• The algorithm is only applicable if mean is defined
– For categorical data, k-mode - the centroid is
represented by most frequent values.

• The user needs to specify k

• The algorithm is sensitive to outliers


– Outliers are data points that are very far away from
other data points.
– Outliers could be errors in the data recording or some
special data points with very different values.
Outliers
Dealing with outliers
• Remove some data points that are much far
away from the centroids than other data points
– To be safe, we may want to monitor these possible outliers
over a few iterations and then decide to remove them

• Perform random sampling: by choosing a small


subset of the data points, the chance of selecting
an outlier is much smaller
– Assign the rest of the data points to the clusters by
distance or similarity comparison, or classification
Sensitivity to initial seeds

Random selection of seeds Random selection of seeds


(centroids) (centroids)

Iteration 1 Iteration 2 Iteration 1 Iteration 2


Special data structures
• The k-means algorithm is not suitable for
discovering clusters that are not hyper-ellipsoids
(or hyper-spheres)
k-Means variant: k-Medians
K-Medians is similar K-Means, except

• Instead of re-computing the group centre points


using the mean, we use median vector of the group

• This method is less sensitive to outliers (because of


using the Median)

• However, much slower for larger datasets as sorting


is required on each iteration when computing the
Median vector
k-Means Summary
• Despite weaknesses, k-means is still the most
popular algorithm due to its simplicity and
efficiency

• No clear evidence that any other clustering


algorithm performs better in general

• Comparing different clustering algorithms is a


difficult task

• No one knows the correct clusters!


Hierarchical Agglomerative
• Hierarchical clustering algos- top-down or bottom-up

• Bottom-up algorithms treat each data point as a single


cluster at the outset and then successively merge
(or agglomerate) pairs of clusters
• until all clusters have been merged into a single cluster
that contains all data points

• Bottom-up clustering is therefore called hierarchical


agglomerative clustering or HAC
• Clusters hierarchy is represented as a tree(dendrogram)
• Root of the tree is the unique cluster that gathers all the
samples
• The leaves being the clusters with only one sample
Graphic Illustration- HAC
Explanation- HAC
• We begin by treating each data point as a single
cluster
• if there are X data points in a dataset, there will be X
clusters

• Next, select a distance metric that measures the


distance between two clusters
• As an example, we will use average linkage which
defines the distance between two clusters to be the
average distance between data points in the first
cluster and data points in the second cluster

• On each iteration, we combine two clusters into one


• Policy- Two clusters to be combined are selected as
those with the smallest average linkage
HAC Summary
• Does not require to specify the number of clusters

• Additionally, the algorithm is not sensitive to the


choice of distance metric

• Use case- when the underlying data has a


hierarchical structure

• Lower efficiency, as it has a time complexity of O(n³)


Exercise- k-means

Problem-01:

Cluster the following eight points (with (x, y) representing locations) into three
clusters:

A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)

Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).

The distance function between two points a = (x1, y1) and b = (x2, y2) is defined
as-
Ρ(a, b) = |x2 – x1| + |y2 – y1|

Use K-Means Algorithm to find the three cluster centers after the second
iteration.
THANK YOU

You might also like