Introduction To Unsupervised Learning:: Clustering

Module -3 Unsupervised Learning
Types of Unsupervised Learning: K-means clustering, Hierarchical clustering--Applications of

unsupervised learning, Dimensionality reduction techniques (PCA, LDA)
Introduction to Unsupervised Learning:

The data have no target attribute. We want to explore the data to find some intrinsic structures in them.
Clustering:
 Clustering is a technique for finding similarity groups in data, called clusters. I.e.,
o it groups data instances that are similar to (near) each other in one cluster and data
instances that are very different (far away) from each other into different clusters.
 Clustering is often called an unsupervised learning task as no class values denoting an a priori
grouping of the data instances are given, which is the case in supervised learning.
 Due to historical reasons, clustering is often considered synonymous with unsupervised
learning.
o In fact, association rule mining is also unsupervised
 The data set has three natural groups of data points, i.e., 3 natural clusters.
 Let us see some real-life examples

o Example 1: groups people of similar sizes together to make “small”, “medium” and
“large” T-Shirts.
 Tailor-made for each person: too expensive
 One-size-fits-all: does not fit all.
o Example 2: In marketing, segment customers according to their similarities
 To do targeted marketing.
o Example 3: Given a collection of text documents, we want to organize them according
to their content similarities,
 To produce a topic hierarchy
 In fact, clustering is one of the most utilized data mining techniques.
 It has a long history, and used in almost every field, e.g., medicine, psychology,
botany, sociology, biology, archeology, marketing, insurance, libraries, etc.
 In recent years, due to the rapid increase of online documents, text clustering
becomes important.
Video :
1. Unsupervised Learning | Clustering and Association Algorithms: URL:
https://www.youtube.com/watch?v=UhVn2WrzMnI
K-means clustering
 K-means is a partitional clustering algorithm
 Let the set of data points (or instances) D be
 {x1, x2, …, xn},

o where xi = (xi1, xi2, …, xir) is a vector in a real-valued space X  Rr, and r is the number of
attributes (dimensions) in the data.
 The k-means algorithm partitions the given data into k clusters.
o Each cluster has a cluster center, called centroid.
o k is specified by the user
K-means algorithm
 Given k, the k-means algorithm works as follows:
1) Randomly choose k data points (seeds) to be the initial centroids, cluster centers
2) Assign each data point to the closest centroid
3) Re-compute the centroids using the current cluster memberships.
4) If a convergence criterion is not met, go to 2).

Stopping/convergence criterion
1. no (or minimum) re-assignments of data points to different clusters,
2. no (or minimum) change of centroids, or
3. minimum decrease in the sum of squared error (SSE),
 Ci is the jth cluster, mj is the centroid of cluster Cj (the mean vector of all the data points
in Cj), and dist(x, mj) is the distance between data point x and centroid mj.
An example
An example distance function
A disk version of k-means

 K-means can be implemented with data on disk
o In each iteration, it scans the data once.
o as the centroids can be computed incrementally
 It can be used to cluster large datasets that do not fit in main memory
 We need to control the number of iterations
o In practice, a limited is set (< 50).
 Not the best method. There are other scale-up algorithms, e.g., BIRCH.
Strengths of k-means
 Strengths:
o Simple: easy to understand and to implement
o Efficient: Time complexity: O(tkn),
o where n is the number of data points,

o k is the number of clusters, and
o t is the number of iterations.
o Since both k and t are small. k-means is considered a linear algorithm.
 K-means is the most popular clustering algorithm.
 Note that: it terminates at a local optimum if SSE is used. The global optimum is hard to find due
to complexity.
Weaknesses of k-means
 The algorithm is only applicable if the mean is defined.
o For categorical data, k-mode - the centroid is represented by most frequent values.
 The user needs to specify k.
 The algorithm is sensitive to outliers
o Outliers are data points that are very far away from other data points.
o Outliers could be errors in the data recording or some special data points with very
different values.
Weaknesses of k-means: Problems with outliers
Weaknesses of k-means: To deal with outliers
 One method is to remove some data points in the clustering process that are much further away
from the centroids than other data points.
o To be safe, we may want to monitor these possible outliers over a few iterations and
then decide to remove them.
 Another method is to perform random sampling. Since in sampling we only choose a small
subset of the data points, the chance of selecting an outlier is very small.
o Assign the rest of the data points to the clusters by distance or similarity comparison, or
classification
The algorithm is sensitive to initial seeds.

If we use different seeds: good results
 The k-means algorithm is not suitable for discovering clusters that are not hyper-ellipsoids
(or hyper-spheres).

K-means summary
 Despite weaknesses, k-means is still the most popular algorithm due to its simplicity,
efficiency and
o other clustering algorithms have their own lists of weaknesses.
 No clear evidence that any other clustering algorithm performs better in general
o although they may be more suitable for some specific types of data or applications.
 Comparing different clustering algorithms is a difficult task. No one knows the correct
clusters!
Example:
As a simple illustration of a k-means algorithm, consider the following data set consisting of the scores
of two variables on each of seven individuals:
Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5

This data set is to be grouped into two clusters. As a first step in finding a sensible initial partition, let
the A & B values of the two individuals furthest apart (using the Euclidean distance measure), define the
initial cluster means, giving:
Mean Vector
Individual
(centroid)
Group 1 1 (1.0, 1.0)
Group 2 4 (5.0, 7.0)

The remaining individuals are now examined in sequence and allocated to the cluster to which they are
closest, in terms of Euclidean distance to the cluster mean. The mean vector is recalculated each time a
new member is added. This leads to the following series of steps:
Cluster 1 Cluster 2
Mean Mean
Step Individual Vector Individual Vector
(centroid) (centroid)
1 1 (1.0, 1.0) 4 (5.0, 7.0)
2 1, 2 (1.2, 1.5) 4 (5.0, 7.0)
3 1, 2, 3 (1.8, 2.3) 4 (5.0, 7.0)
4 1, 2, 3 (1.8, 2.3) 4, 5 (4.2, 6.0)
5 1, 2, 3 (1.8, 2.3) 4, 5, 6 (4.3, 5.7)
6 1, 2, 3 (1.8, 2.3) 4, 5, 6, 7 (4.1, 5.4)

Now the initial partition has changed, and the two clusters at this stage having the following
characteristics:
Mean Vector
Individual
(centroid)
Cluster 1 1, 2, 3 (1.8, 2.3)
Cluster 2 4, 5, 6, 7 (4.1, 5.4)

But we cannot yet be sure that each individual has been assigned to the right cluster. So, we compare
each individual’s distance to its own cluster mean and to
that of the opposite cluster. And we find:
Distance to Distance to
mean mean
Individual
(centroid) of (centroid) of
Cluster 1 Cluster 2
1 1.5 5.4
2 0.4 4.3
3 2.1 1.8
4 5.7 1.8
5 3.2 0.7
6 3.8 0.6
7 2.8 1.1

Only individual 3 is nearer to the mean of the opposite cluster (Cluster 2) than its own (Cluster 1). In
other words, each individual's distance to its own cluster mean should be smaller that the distance to
the other cluster's mean (which is not the case with individual 3). Thus, individual 3 is relocated to
Cluster 2 resulting in the new partition:
Mean Vector
Individual
(centroid)
Cluster 1 1, 2 (1.3, 1.5)
Cluster 2 3, 4, 5, 6, 7 (3.9, 5.1)

The iterative relocation would now continue from this new partition until no more relocations occur.
However, in this example each individual is now nearer its own cluster mean than that of the other
cluster and the iteration stops, choosing the latest partitioning as the final cluster solution.
Also, it is possible that the k-means algorithm won't find a final solution. In this case it would be a good
idea to consider stopping the algorithm after a pre-chosen maximum of iterations.
Example 2:
Suppose we want to group the visitors to a website using just their age (one-dimensional space) as follows:
n = 19
15,15,16,19,19,20,20,21,22,28,35,40,41,42,43,44,60,61,65

Initial clusters (random centroid or average):
k = 2
c1 = 16
c2 = 22

Iteration 1:
c1 = 15.33
c2 = 36.25
xi c1 c2 Distance 1 Distance 2 Nearest Cluster New Centroid
15 16 22 1 7 1
15 16 22 1 7 1 15.33
16 16 22 0 6 1
19 16 22 9 3 2 36.25
19 16 22 9 3 2
20 16 22 16 2 2
20 16 22 16 2 2
21 16 22 25 1 2
22 16 22 36 0 2
28 16 22 12 6 2
35 16 22 19 13 2
40 16 22 24 18 2
41 16 22 25 19 2
42 16 22 26 20 2
43 16 22 27 21 2
44 16 22 28 22 2
60 16 22 44 38 2
61 16 22 45 39 2
65 16 22 49 43 2

Iteration 2:
c1 = 18.56
c2 = 45.90
15 15.33 36.25 0.33 21.25 1
15 15.33 36.25 0.33 21.25 1
16 15.33 36.25 0.67 20.25 1
19 15.33 36.25 3.67 17.25 1
19 15.33 36.25 3.67 17.25 1 18.56
20 15.33 36.25 4.67 16.25 1
20 15.33 36.25 4.67 16.25 1
21 15.33 36.25 5.67 15.25 1
22 15.33 36.25 6.67 14.25 1
28 15.33 36.25 12.67 8.25 2
35 15.33 36.25 19.67 1.25 2
40 15.33 36.25 24.67 3.75 2
41 15.33 36.25 25.67 4.75 2
42 15.33 36.25 26.67 5.75 2
45.9
43 15.33 36.25 27.67 6.75 2
44 15.33 36.25 28.67 7.75 2
60 15.33 36.25 44.67 23.75 2
61 15.33 36.25 45.67 24.75 2
65 15.33 36.25 49.67 28.75 2

Iteration 3:
c1 = 19.50
c2 = 47.89
15 18.56 45.9 3.56 30.9 1 19.50
15 18.56 45.9 3.56 30.9 1
16 18.56 45.9 2.56 29.9 1
19 18.56 45.9 0.44 26.9 1
19 18.56 45.9 0.44 26.9 1
20 18.56 45.9 1.44 25.9 1
20 18.56 45.9 1.44 25.9 1
21 18.56 45.9 2.44 24.9 1
22 18.56 45.9 3.44 23.9 1
28 18.56 45.9 9.44 17.9 1
35 18.56 45.9 16.44 10.9 2
40 18.56 45.9 21.44 5.9 2
41 18.56 45.9 22.44 4.9 2
42 18.56 45.9 23.44 3.9 2
43 18.56 45.9 24.44 2.9 2 47.89
44 18.56 45.9 25.44 1.9 2
60 18.56 45.9 41.44 14.1 2
61 18.56 45.9 42.44 15.1 2
65 18.56 45.9 46.44 19.1 2

Iteration 4:
c1 = 19.50
c2 = 47.89
15 19.5 47.89 4.50 32.89 1
15 19.5 47.89 4.50 32.89 1
16 19.5 47.89 3.50 31.89 1
19 19.5 47.89 0.50 28.89 1
19 19.5 47.89 0.50 28.89 1
19.50
20 19.5 47.89 0.50 27.89 1
20 19.5 47.89 0.50 27.89 1
21 19.5 47.89 1.50 26.89 1
22 19.5 47.89 2.50 25.89 1
28 19.5 47.89 8.50 19.89 1
35 19.5 47.89 15.50 12.89 2
40 19.5 47.89 20.50 7.89 2
41 19.5 47.89 21.50 6.89 2
42 19.5 47.89 22.50 5.89 2
43 19.5 47.89 23.50 4.89 2 47.89
44 19.5 47.89 24.50 3.89 2
60 19.5 47.89 40.50 12.11 2
61 19.5 47.89 41.50 13.11 2
65 19.5 47.89 45.50 17.11 2

No change between iterations 3 and 4 has been noted. By using clustering, 2 groups have been identified 15-28 and
35-65. The initial choice of centroids can affect the output clusters, so the algorithm is often run multiple times with
different starting conditions in order to get a fair view of what the clusters should be.
Videos Tutorials
1. K Means Clustering Algorithm: URL: https://www.youtube.com/watch?
v=1XqG0kaJVHY&feature=emb_logo
2. K Means Clustering Algorithm: URL: https://www.youtube.com/watch?v=EItlUEPCIzM
Hierarchical Clustering
Hierarchical clustering is another unsupervised learning algorithm that is used to group together the
unlabeled data points having similar characteristics. Hierarchical clustering algorithms falls into following
two categories.
Agglomerative hierarchical algorithms − In agglomerative hierarchical algorithms, each data point is

treated as a single cluster and then successively merge or agglomerate (bottom-up approach) the pairs
of clusters. The hierarchy of the clusters is represented as a dendrogram or tree structure.
Divisive hierarchical algorithms − On the other hand, in divisive hierarchical algorithms, all the data
points are treated as one big cluster and the process of clustering involves dividing (Top-down approach)
the one big cluster into various small clusters.
Agglomerative clustering
In agglomerative or bottom-up clustering method we assign each observation to its own cluster. Then,
compute the similarity (e.g., distance) between each of the clusters and join the two most similar
clusters. Finally, repeat steps 2 and 3 until there is only a single cluster left. The related algorithm is
shown below.
Before any clustering is performed, it is required to determine the proximity matrix containing the distance bet
a distance function. Then, the matrix is updated to display the distance between each cluster. The following thr
how the distance between each cluster is measured.
An example: working of the algorithm
Measuring the distance of two clusters

 A few ways to measure distances of two clusters.
 Results in different variations of the algorithm.
o Single link
o Complete link
o Average link
o Centroids
o …
Single link method

In single linkage hierarchical clustering, the distance between two clusters is defined as the shortest
distance between two points in each cluster. For example, the distance between clusters “r” and “s” to
the left is equal to the length of the arrow between their two closest points.
Complete link method

In complete linkage hierarchical clustering, the distance between two clusters is defined as the longest
distance between two points in each cluster. For example, the distance between clusters “r” and “s” to
the left is equal to the length of the arrow between their two furthest points.
Average link method:
In average linkage hierarchical clustering, the distance between two clusters is defined as the average
distance between each point in one cluster to every point in the other cluster. For example, the distance
between clusters “r” and “s” to the left is equal to the average length each arrow between connecting
the points of one cluster to the other.
Centroid method:
 In this method, the distance between two clusters is the distance between their centroids
The complexity
 All the algorithms are at least O(n2). n is the number of data points.
 Single link can be done in O(n2).
 Complete and average links can be done in O(n 2logn).
 Due the complexity, hard to use for large data sets.
o Sampling
o Scale-up methods (e.g., BIRCH).
An Example
Let’s now see a simple example: a hierarchical clustering of distances in kilometers between some Italian
cities. The method used is single-linkage.
Input distance matrix (L = 0 for all the clusters):
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0
The nearest pair of cities is MI and TO, at distance 138. These are merged into a single cluster called
"MI/TO". The level of the new cluster is L(MI/TO) = 138 and the new sequence number is m = 1.
Then we compute the distance from this new compound object to all other objects. In single link
clustering the rule is that the distance from the compound object to another object is equal to the
shortest distance from any member of the cluster to the outside object. So the distance from "MI/TO" to
RM is chosen to be 564, which is the distance from MI to RM, and so on.
After merging MI with TO we obtain the following matrix:
BA FI MI/TO NA RM
BA 0 662 877 255 412
FI 662 0 295 468 268
MI/TO 877 295 0 754 564
NA 255 468 754 0 219
RM 412 268 564 219 0

min d(i,j) = d(NA,RM) = 219 => merge NA and RM into a new cluster called NA/RM
L(NA/RM) = 219
m=2
BA FI MI/TO NA/RM
BA 0 662 877 255
FI 662 0 295 268
MI/TO 877 295 0 564
NA/RM 255 268 564 0
min d(i,j) = d(BA,NA/RM) = 255 => merge BA and NA/RM into a new cluster called BA/NA/RM
L(BA/NA/RM) = 255
m=3
BA/NA/RM FI MI/TO
BA/NA/RM 0 268 564
FI 268 0 295
MI/TO 564 295 0
min d(i,j) = d(BA/NA/RM,FI) = 268 => merge BA/NA/RM and FI into a new cluster called BA/FI/NA/RM
L(BA/FI/NA/RM) = 268
m=4
BA/FI/NA/RM MI/TO
BA/FI/NA/RM 0 295
MI/TO 295 0
Finally, we merge the last two clusters at level 295.
The process is summarized by the following hierarchical tree

Video:
1. Hierarchical Clustering: URL: https://www.youtube.com/watch?

v=tlIv3IT_hHk&feature=emb_logo
2. Hierarchical Clustering :URL: https://www.youtube.com/watch?v=9U4h6pZw6f8

Introduction To Unsupervised Learning:: Clustering

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Unsupervised Learning:: Clustering

Uploaded by

Copyright:

Available Formats

Module -3 Unsupervised Learning

Types of Unsupervised Learning: K-means clustering, Hierarchical clustering--Applications of

Introduction to Unsupervised Learning:

 Let us see some real-life examples

 Let the set of data points (or instances) D be

 {x1, x2, …, xn},

 The k-means algorithm partitions the given data into k clusters.

o Each cluster has a cluster center, called centroid.

o k is specified by the user

2) Assign each data point to the closest centroid

3) Re-compute the centroids using the current cluster memberships.

4) If a convergence criterion is not met, go to 2).

2. no (or minimum) change of centroids, or

3. minimum decrease in the sum of squared error (SSE),

A disk version of k-means

o In each iteration, it scans the data once.

o as the centroids can be computed incrementally

 We need to control the number of iterations

o In practice, a limited is set (< 50).

o Simple: easy to understand and to implement

o Efficient: Time complexity: O(tkn),

o where n is the number of data points,

o Since both k and t are small. k-means is considered a linear algorithm.

 K-means is the most popular clustering algorithm.

 The user needs to specify k.

 The algorithm is sensitive to outliers

Weaknesses of k-means: Problems with outliers

Weaknesses of k-means: To deal with outliers

The algorithm is sensitive to initial seeds.

Agglomerative hierarchical algorithms − In agglomerative hierarchical algorithms, each data point is

An example: working of the algorithm

Measuring the distance of two clusters

Single link method

Complete link method

 Single link can be done in O(n2).

 Complete and average links can be done in O(n 2logn).

 Due the complexity, hard to use for large data sets.

o Scale-up methods (e.g., BIRCH).

Input distance matrix (L = 0 for all the clusters):

After merging MI with TO we obtain the following matrix:

BA 0 662 877 255 412

FI 662 0 295 468 268

MI/TO 877 295 0 754 564

NA 255 468 754 0 219

RM 412 268 564 219 0

Finally, we merge the last two clusters at level 295.

The process is summarized by the following hierarchical tree

1. Hierarchical Clustering: URL: https://www.youtube.com/watch?

2. Hierarchical Clustering :URL: https://www.youtube.com/watch?v=9U4h6pZw6f8

You might also like