Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Clustering

1. General Concept of Clustering


2. Basic problems in determining clusters
3. Definition of distance functions between
clusters
4. Introduce the K-Means Clustering Algorithm

63
Clustering is the art of grouping
together pattern vectors that in some
sense belong together because they
have similar characteristics and are
different from other pattern vectors.

In the most general problem the


number of clusters or subgroups is
unknown as are the properties that
make them similar.

64
Question:
How do we start the process of finding
clusters and identifying similarities???

Answer:
First realize that clustering is an art and
there is no correct answer only feasible
alternatives.
Second explore structures of data,
similarity measures, and limitations of
various clustering procedures
65
Formalization of the Problem of Clustering

Given a set S of NS n-dimensional pattern


vectors:
S = { xj ; j =1, 2, ... , NS }

Clustering is the process of partitioning S


into M subsets, Clk , k=1, 2, ... , M called
clusters that satisfy the following conditions.

66
1. The members in each subset are in
some sense similar and not similar to
members in the other subsets.

2. Clk ≠ Φ Not empty

3. Clk ∩ Clj = Φ Pairwise disjoint

K
=S

4. Clk
k=1
Exhaustive
Φ is the Null Set 67
Illustration of Clusters and Cluster centers

68
Will now look at two examples that
illustrate problems in performing
meaningful clustering:

Example 1: Problems with scaling

Example 2: The nonuniqueness of results

69
Example 1:
Given the data below, obtained by
measuring the weight and diameter of 4
large foam balls labeled a, b, c, and d.

Find two clusters from the set { a, b, c, d }


70
Solution:
The plot of the points in the 2-dimensional
pattern space is given below

71
Solution:
The plot of the points in the 2-dimensional
pattern space is given below

By closeness in
pattern space select

Cl1 = { a,c }
Cl2 = { b,d }

72
The plot of the same points in the 2-dimensional
pattern space with Diameter shown in inches
rather than feet (different scale) is given below

73
The plot of the same points in the 2-dimensional
pattern space with Diameter shown in inches
rather than feet (different scale) is given below

By closeness in
pattern space select

Cl1 = { a,b }
Cl2 = { c,d }

74
Which set of clusters is the
correct answer ???

75
Which set of clusters is the
correct answer ???
#1: Cl1 = { a,c } Measured in feet
Cl2 = { b,d }

#2: Cl1 = { a,b }


Measured in inches
Cl2 = { c,d }

76
Which set of clusters is the
correct answer ???
#1: Cl1 = { a,c } Measured in feet
Cl2 = { b,d }

#2: Cl1 = { a,b }


Measured in inches
Cl2 = { c,d }
#3: Cl1 = { a,d } Other measurement
Cl2 = { b,c } Units

77
Which set of clusters is the
correct answer ???
#1: Cl1 = { a,c } Measured in feet
Cl2 = { b,d }

#2: Cl1 = { a,b }


Measured in inches
Cl2 = { c,d }
#3: Cl1 = { a,d } Other measurement
Cl2 = { b,c } Units
#4: None of the above

78
Which set of clusters is the
correct answer ???
#1: Cl1 = { a,c } Measured in feet
Cl2 = { b,d }

#2: Cl1 = { a,b }


Measured in inches
Cl2 = { c,d }
#3: Cl1 = { a,d } Other measurement
Cl2 = { b,c } Units
#4: None of the above
#5: All of the above 79
Answer:
There is no correct answer, the
clusters provide us with different
interpretations of the data where the
closeness of patterns is measured with
different definitions of similarity.

80
One approach is to solve the scaling problem is to
normalize each dimension separately if they
represent different properties like weight and
diameter.

For our problem we have Diameter

Weight
1 81
Concentrate now on quantitative data and
examine measures of similarity between
pattern samples and clusters
Euclidean Distance between
two pattern vectors x and y

The smaller the distance the larger


the similarity 82
Measures of Distance between
two pattern Classes Si and Sj

1.
minimum distance

2.
average distance

83
3.
between means
Where

4.
between medians
84
5.

maximum distance

Interpretation of dmax , dmean, dmin 85


Measure of Performance for Clustering
Overall performance measure J for a
given set of clusters Clk for k =1, 2, ... , K

where the mean of


each cluster is
Mk i
k i k
86
If K=NS , the number of samples, then
the cluster centers equal the sample in
the cluster and the performance would
be 0.
If K=1 then all samples are in just one
cluster and J would be maximum.
87
There is no useful information in either
one of these conditions!
87
Methods for Clustering Quantitative Data

* 1. K-Means Clustering Algorithm


2. Hierarchical Clustering Algorithm
3. ISODATA Clustering Algorithm
4. Fuzzy Clustering Algorithm

88
K-Means Clustering Algorithm: Basic Procedure

- Randomly Select K cluster centers


from Pattern Space
- Distribute set of patterns to the cluster
center using minimum distance
- Compute new Cluster centers for each
cluster
- Continue this process until the cluster
centers do not change.

89
Flow Diagram for K-Means Algorithm

90

You might also like