Professional Documents
Culture Documents
Pattern Recognition Lecture 3
Pattern Recognition Lecture 3
Contents
� Cluster Analysis
� Application of Clustering
� Major clustering approach
� Clustering Algorithm
■ K-means Algorithm
■ Nearest Neighbor Algorithm
■ Agglomerative Algorithm
■ Divisive Algorithm
� Conclusion
� References
What is Cluster Analysis?
� Finding groups of objects such that the objects in
a group will be similar (or related) to one another
and different from (or unrelated to) the objects in
other groups Inter-cluster
distances are
Intra-cluster
maximized
distances are
minimized
Cluster Analysis
� Cluster: a collection of data objects
■ Similar to one another within the same cluster
■ Dissimilar to the objects in other clusters
� Cluster analysis
■ Finding similarities between data according to
the characteristics found in the data and
grouping similar data objects into clusters
� Unsupervised learning: no predefined
classes
What Is Good Clustering?
� A good clustering method will produce high
quality clusters with
■ high intra-class similarity
■ low inter-class similarity
� The quality of a clustering result depends on both
the similarity measure used by the method and its
implementation.
� The quality of a clustering method is also
measured by its ability to discover some or all of
the hidden patterns.
Application of Clustering
� Applications of clustering algorithm
includes
■ Pattern Recognition
■ Spatial Data Analysis
■ Image Processing
■ Economic Science (especially market research)
■ Web analysis and classification of documents
■ Classification of astronomical data and
classification of objects found in an
archaeological study
■ Medical science
Requirements of Clustering in Data Mining
� Scalability
� Ability to deal with different types of attributes
� Discovery of clusters with arbitrary shape
� Minimal requirements for domain knowledge to
determine input parameters
� Able to deal with noise and outliers
� Insensitive to order of input records
� High dimensionality
� Incorporation of user-specified constraints
� Interpretability and usability
Outliers
� Outliers are objects that do not belong to any
cluster or form clusters of very small cardinality
cluster
outliers
Cluster Analysis
Partitions
Hierarchical Grid-Based Model-Based
Methods
K-means
18 73 75 57
18 79 85 75
23 70 70 52
20 55 55 55
22 85 86 87
19 91 90 89
20 70 65 60
21 53 56 59
19 82 82 60
47 75 76 77
K-means Example(Conti…)
� Steps 1 and 2: Let the three seeds be first three
students.
Table 2: The three
seeds
Student Age Mark1 Mark2 Mark3
18 73 75 57
18 79 85 75
23 70 70 52
S2 18 79 85 75 34
C2 18 79 85 75
S2 18 79 85 75
K-means Example(Conti…) 0 0 0 0
Total Distance 0
S2 18 79 85 75 34 0
C3 23 70 70 52
S2 18 79 85 75
K-means Example(Conti…) 5 9 15 23
Total Distance 52
S2 18 79 85 75 34 0 52 C2
K-means Example(Conti…)
S2 18 79 85 75 34 0 52 C2
S3 23 70 70 52 18 52 0 C3
S4 20 55 55 55 42 76 36 C3
S5 22 85 86 87 57 23 67 C2
S6 19 91 90 89 66 32 82 C2
S7 20 70 65 60 18 46 16 C3
S8 21 53 56 59 44 74 40 C3
S9 19 82 82 60 20 22 36 C1
S10 47 75 76 77 52 44 60 C2
S1 18 73 75 57
S9 19 82 82 60
K-means Example(Conti…) AVG 18.5 77.5 78.5 58.5
Cluster membership
Cluster-1: S1 , S9
Cluster-2: S2 ,S5 , S6 , S10
Cluster-3: S3 , S4 S7 S8
K-means Example(Conti…)
� Example
K-Means
� Strengths
■ Relatively efficient: O(tkn), where n is # objects,
k is # clusters, and t is # iterations. Normally,
k, t << n.
■ Often terminates at a local optimum.
� Weaknesses
■ Applicable only when mean is defined (what about
categorical data?)
■ Need to specify k, the number of clusters, in
advance
■ Trouble with noisy data and outliers
■ Not suitable to discover clusters with non-convex
shapes
K-means Example(Conti…)
■ The results of the k-means method depend strongly on the initial
guesses of the seeds.
■ The k-means method can be sensitive to outliers. If an outlier is
picked as a starting seed, it may end up in a cluster of its own. Also
if an outlier moves from one cluster to another during iterations, it
can have a major impact on the clusters because the means of the
two clusters are likely to change significantly.
■ Although some local optimum solutions discovered by the K-
means method are satisfactory, often the local optimum is not as
good as the global optimum.
■ The K-means method does not consider the size of the clusters.
Some clusters may be large and some very small.
■ The K-means does not deal with overlapping clusters.
Nearest Neighbor Algorithm
� An algorithm similar to the single link technique
is called the nearest neighbor algorithm.
� With this serial algorithm, items are iteratively
merged into the existing clusters that are closet.
� In this algorithm a threshold, t is used to
determine if items will be added to existing
clusters or if a new cluster is created.
Nearest Neighbor Algorithm
�
Nearest Neighbor Algorithm Example
Table : Distance among A, B, C, D, E data
Item A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
Nearest Neighbor Algorithm Example
� A placed to a cluster by itself
K1={A}
Item A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
Nearest Neighbor Algorithm Example
� Consider B, should it be added to K1 or form a
new cluster?
� Dist(A,B)=1 and less than threshold value 2
� So K1={A, B}
Item A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
Nearest Neighbor Algorithm Example
� For C we calculate distance from both A and B.
� Dist(AB, C)= min{dist(A, C), Dist(B, C)}
� Dist(AB, C)=2
� So K1={A, B, C}
Item A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
Nearest Neighbor Algorithm Example
� Dist(ABC, D)= min{Dist(A, D), Dist(B, D),Dist(C, D)}
=min{2,4,1}
=1
� So K1={A, B, C, D}
Item A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
Nearest Neighbor Algorithm Example
� Dist(ABCD, E)= min{Dist(A, E), Dist(B, E),Dist(C, E), Dist(D, E)}
=min{3, 3, 5, 3}
=3 greater than threshold value.
� So K1={A, B, C, D}
� And K2={E}
Item A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
Thank you