lec-3

Clustering
1. General Concept of Clustering

2. Basic problems in determining clusters
3. Definition of distance functions between
clusters
4. Introduce the K-Means Clustering Algorithm
63
Clustering is the art of grouping
together pattern vectors that in some
sense belong together because they
have similar characteristics and are
different from other pattern vectors.
In the most general problem the

number of clusters or subgroups is
unknown as are the properties that
make them similar.
64
Question:
How do we start the process of finding
clusters and identifying similarities???
Answer:
First realize that clustering is an art and
there is no correct answer only feasible
alternatives.
Second explore structures of data,
similarity measures, and limitations of
various clustering procedures
65
Formalization of the Problem of Clustering
Given a set S of NS n-dimensional pattern

vectors:
S = { xj ; j =1, 2, ... , NS }
Clustering is the process of partitioning S

into M subsets, Clk , k=1, 2, ... , M called
clusters that satisfy the following conditions.
66
1. The members in each subset are in
some sense similar and not similar to
members in the other subsets.
2. Clk ≠ Φ Not empty
3. Clk ∩ Clj = Φ Pairwise disjoint
K
=S
∩
4. Clk
k=1
Exhaustive
Φ is the Null Set 67
Illustration of Clusters and Cluster centers
68
Will now look at two examples that
illustrate problems in performing
meaningful clustering:
Example 1: Problems with scaling
Example 2: The nonuniqueness of results
69
Example 1:
Given the data below, obtained by
measuring the weight and diameter of 4
large foam balls labeled a, b, c, and d.
Find two clusters from the set { a, b, c, d }

70
Solution:
The plot of the points in the 2-dimensional
pattern space is given below
71
Solution:
The plot of the points in the 2-dimensional
pattern space is given below
By closeness in
pattern space select
Cl1 = { a,c }
Cl2 = { b,d }
72
The plot of the same points in the 2-dimensional
pattern space with Diameter shown in inches
rather than feet (different scale) is given below
73
The plot of the same points in the 2-dimensional
pattern space with Diameter shown in inches
rather than feet (different scale) is given below
By closeness in
pattern space select
Cl1 = { a,b }
Cl2 = { c,d }
74
Which set of clusters is the
correct answer ???
75
correct answer ???
#1: Cl1 = { a,c } Measured in feet
Cl2 = { b,d }
#2: Cl1 = { a,b }

Measured in inches
Cl2 = { c,d }
76
correct answer ???
Cl2 = { b,d }
#2: Cl1 = { a,b }

Measured in inches
Cl2 = { c,d }
#3: Cl1 = { a,d } Other measurement
Cl2 = { b,c } Units
77
correct answer ???
Cl2 = { b,d }
#2: Cl1 = { a,b }

Measured in inches
Cl2 = { c,d }
Cl2 = { b,c } Units
#4: None of the above
78
correct answer ???
Cl2 = { b,d }
#2: Cl1 = { a,b }

Measured in inches
Cl2 = { c,d }
Cl2 = { b,c } Units
#4: None of the above
#5: All of the above 79
Answer:
There is no correct answer, the
clusters provide us with different
interpretations of the data where the
closeness of patterns is measured with
different definitions of similarity.
80
One approach is to solve the scaling problem is to
normalize each dimension separately if they
represent different properties like weight and
diameter.
For our problem we have Diameter
Weight
1 81
Concentrate now on quantitative data and
examine measures of similarity between
pattern samples and clusters
Euclidean Distance between
two pattern vectors x and y
The smaller the distance the larger

the similarity 82
Measures of Distance between
two pattern Classes Si and Sj
1.
minimum distance
2.
average distance
83
3.
between means
Where
4.
between medians
84
5.
maximum distance
Interpretation of dmax , dmean, dmin 85

Measure of Performance for Clustering
Overall performance measure J for a
given set of clusters Clk for k =1, 2, ... , K
where the mean of

each cluster is
Mk i
k i k
86
If K=NS , the number of samples, then
the cluster centers equal the sample in
the cluster and the performance would
be 0.
If K=1 then all samples are in just one
cluster and J would be maximum.
87
There is no useful information in either
one of these conditions!
87
Methods for Clustering Quantitative Data
* 1. K-Means Clustering Algorithm

2. Hierarchical Clustering Algorithm
3. ISODATA Clustering Algorithm
4. Fuzzy Clustering Algorithm
88
K-Means Clustering Algorithm: Basic Procedure
- Randomly Select K cluster centers

from Pattern Space
- Distribute set of patterns to the cluster
center using minimum distance
- Compute new Cluster centers for each
cluster
- Continue this process until the cluster
centers do not change.
89
Flow Diagram for K-Means Algorithm
90

lec-3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

lec-3

Uploaded by

Copyright:

Available Formats

Clustering

1. General Concept of Clustering

In the most general problem the

Given a set S of NS n-dimensional pattern

Clustering is the process of partitioning S

2. Clk ≠ Φ Not empty

3. Clk ∩ Clj = Φ Pairwise disjoint

Example 1: Problems with scaling

Example 2: The nonuniqueness of results

Find two clusters from the set { a, b, c, d }

#2: Cl1 = { a,b }

#2: Cl1 = { a,b }

#2: Cl1 = { a,b }

#2: Cl1 = { a,b }

For our problem we have Diameter

The smaller the distance the larger

Interpretation of dmax , dmean, dmin 85

where the mean of

* 1. K-Means Clustering Algorithm

- Randomly Select K cluster centers

You might also like