Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 36

DISCLAIMER

In preparation of these slides, materials have been taken from


different online sources in the shape of books, websites, research
papers and presentations etc. However, the author does not have any
intention to take any benefit of these in her/his own name. This
lecture (audio, video, slides etc) is prepared and delivered only for
educational purposes and is not intended to infringe upon the
copyrighted material. Sources have been acknowledged where
applicable. The views expressed are presenter’s alone and do not
necessarily represent actual author(s) or the institution.
Cluster Analysis
What is Cluster Analysis?
• Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters

• Cluster analysis
– Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
What is Cluster Analysis?
• Clustering analysis is an important human activity
• Early in childhood, we learn how to distinguish between cats
and dogs

• Unsupervised learning: no predefined classes

• Typical applications
– As a stand-alone tool to get insight into data distribution
– As a preprocessing step for other algorithms
Clustering
• Hard vs. Soft
– Hard: same object can only belong to single
cluster
– Soft: same object can belong to different
clusters
Clustering
• Flat vs. Hierarchical
– Flat: clusters are flat
– Hierarchical: clusters form a tree
• Agglomerative
• Divisive
Clustering: Rich Applications and
Multidisciplinary Efforts

• Pattern Recognition
• Spatial Data Analysis
– Create thematic maps in GIS by clustering feature spaces
– Detect spatial clusters or for other spatial mining tasks
• Image Processing
• Economic Science (especially market research)
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar access patterns
Quality: What Is Good Clustering?
• A good clustering method will produce high quality clusters with
– high intra-class similarity
(Similar to one another within the same cluster)
– low inter-class similarity
(Dissimilar to the objects in other clusters)
What is Cluster Analysis?
• Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups
Inter-class
Intra-cluster distances are
distances are maximized
minimized
Similarity and Dissimilarity Between
Objects

• Distances are normally used to measure the similarity or


dissimilarity between two data objects
• Some popular ones include: Minkowski distance:

d(i, j)  q (| x  x |q | x  x |q ...| x  x |q )
i1 j1 i2 j2 ip
jp

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and q is a positive integer
• If q = 1, d is Manhattan distance

d(i, j) | xi1  x j1 || xi2  x j2 |...| xip  x jp |


Similarity and Dissimilarity Between
Objects (Cont.)

• If q = 2, d is Euclidean distance:

d(i, j)  (| x  |2 | x  x |2 ...| x  x |2 )
i1 i2 j2 ip jp
x
j1

• Also, one can use weighted distance, parametric


Pearson correlation, or other disimilarity measures
Major Clustering Approaches
• Partitioning approach:
– Construct various partitions and then evaluate them by some criterion, e.g., minimizing
the sum of square errors
– Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
– Create a hierarchical decomposition of the set of data (or objects) using some criterion
– Typical methods: Hierarchical, Diana, Agnes, BIRCH, ROCK, CAMELEON
• Density-based approach:
– Based on connectivity and density functions
– Typical methods: DBSACN, OPTICS, DenClue
Clustering Approaches
1. Partitioning Methods
2. Hierarchical Methods
3. Density-Based Methods
Partitioning Algorithms: Basic Concept
• Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
• k-means and k-medoids algorithms
• k-means (MacQueen’67): Each cluster is represented by the center of
the cluster
• k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87):
Each cluster is represented by one of the objects in the cluster
The K-Means Clustering Method

• Given k, the k-means algorithm is


implemented in four steps:
1. Partition objects into k nonempty subsets
2. Compute seed points as the centroids of the clusters of the
current partition (the centroid is the center, i.e., mean point, of
the cluster)
3. Assign each object to the cluster with the nearest seed
point
4. Go back to Step 2, stop when no more new assignment
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
K-means Clustering
The K-Means Clustering Method
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3
Update 3
3

2 each 2 the 2

1
objects
1
cluster 1

means
0 0

to
0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 10 10
10
most
similar reassign reassign
center 10 10

K=2 9 9

8 8

Arbitrarily choose K 7 7

object as initial 6 6

5 5

cluster center 4 Update 4

3
the 3

2
cluster
2

1 1

0 means 0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
10 10
Example
• Run K-means clustering with 3 clusters (initial
centroids: 3, 16, 25) for at least 2 iterations
Example
• Centroids:
3 – 2 3 4 7 9 new centroid: 5

16 – 10 11 12 16 18 19 new centroid: 14.33

25 – 23 24 25 30 new centroid: 25.5


Example
• Centroids:
5 – 2 3 4 7 9 new centroid: 5

14.33 – 10 11 12 16 18 19 new centroid: 14.33

25.5 – 23 24 25 30 new centroid: 25.5


Practice Exercise
• Run K-means clustering with 3 clusters (initial
centroids: 3, 12, 19) for at least 2 iterations
What Is the Problem of the K-
Means Method?
• The k-means algorithm is sensitive to outliers !
– Since an object with an extremely large value may substantially distort
the distribution of the data.

• K-Medoids: Instead of taking the mean value of the object in a


cluster as a reference point, medoids can be used, which is the
most centrally located object in a cluster.
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
10 10
Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)


Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)


Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)


The K-Medoids Clustering Method

• Find representative objects, called medoids, in


clusters
• Medoids are located in the center of the
clusters.
– Given data points, how to find the medoid?
10

57
0 1 2 3 4 5 6 7 8 9
10
Categorical Values

• Handling categorical data: k-modes (Huang’98)

– Replacing means of clusters with modes


• Mode of an attribute: most frequent value

• Mode of instances: for an attribute A, mode(A)= most frequent value

• K-mode is equivalent to K-means

– Using a frequency-based method to update modes of clusters

– A mixture of categorical and numerical data: k-prototype method

58
K-mediods example
X1 2 6
X2 3 4
X3 3 8
X4 4 7
X5 6 2
X6 6 4
X7 7 3
X8 7 4
X9 8 5
X10 7 6
• Initialize k mediods
• Let us assume c1 = (3,4) and c2 = (7,4)
• Calculate distance so as to associate each data
object to its nearest medoid.
Data Data
i c1 objects Cost i c2 objects Cost
(distance) (distance)
(Xi) (Xi)

1 3 4 2 6 3 1 7 4 2 6 7
3 3 4 3 8 4 3 7 4 3 8 8
4 3 4 4 7 4 4 7 4 4 7 6
5 3 4 6 2 5 5 7 4 6 2 3
6 3 4 6 4 3 6 7 4 6 4 1
7 3 4 7 3 5 7 7 4 7 3 1
9 3 4 8 5 6 9 7 4 8 5 2
10 3 4 7 6 6 10 7 4 7 6 2

Cluster1 = {(3,4)(2,6)(3,8)(4,7)}
Cluster2 = {(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)}
• Select one of the nonmedoids O′. Let us assume O′ = (7,3)
• Now the medoids are c1(3,4) and O′(7,3)

Data Data
objects Cost (distance) i O′ objects Cost (distance)
i c1
(Xi) (Xi)

1 3 4 2 6 3 1 7 3 2 6 8
3 3 4 3 8 4 3 7 3 3 8 9
4 3 4 4 7 4 4 7 3 4 7 7
5 3 4 6 2 5 5 7 3 6 2 2
6 3 4 6 4 3 6 7 3 6 4 2
8 3 4 7 4 4 8 7 3 7 4 1
9 3 4 8 5 6 9 7 3 8 5 3
10 3 4 7 6 6 10 7 3 7 6 3

• Do not change the mediod as S > 0


Acknowledgements
 Introduction to Machine Learning, Alphaydin
 Pattern Classification” by Duda et al., John Wiley & Sons.
 Read GMM from “Automated Detection of Exudates in Colored Retinal Images for
Diagnosis of Diabetic Retinopathy”, Applied Optics, Vol. 51 No. 20, 4858-4866, 2012.
Material in these slides has been taken from, the following

 Biomisa.org/
resources

100

You might also like