Professional Documents
Culture Documents
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
Cluster analysis
Grouping a set of data objects into clusters
Examples of Clustering
Applications
Requirements of Clustering in
Data Mining
Scalability
High dimensionality
Data Structures
Data matrix
(two modes)
x11
...
x
i1
...
x
n1
Dissimilarity matrix
(one mode)
...
x1f
...
x1p
...
...
...
...
xif
...
...
xip
...
...
... xnf
...
...
...
xnp
d(2,1)
0
d(3,1) d ( 3,2) 0
:
:
:
... 0
Major Clustering
Approaches
Partitioning algorithms: Construct various partitions and
then evaluate them by some criterion
Hierarchy algorithms: Create a hierarchical decomposition
of the set of data (or objects) using some criterion
Density-based: based on connectivity and density functions
Grid-based: based on a multiple-level granularity structure
Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other
10
10
4
3
2
1
0
0
K=2
Arbitrarily choose
K object as initial
cluster center
10
Assign
each
objects
to
most
similar
center
3
2
1
0
0
10
Update
the
cluster
means
4
3
2
1
0
0
reassign
10
4
3
2
1
0
1
10
reassign
10
10
Update
the
cluster
means
4
3
2
1
0
0
10
Weakness
Applicable only when mean is defined, then what about
categorical data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes
10
10
0
0
10
10
Total Cost = 20
10
10
10
6
5
4
3
2
1
0
0
K=2
10
Arbitrar
y
choose
k object
as
initial
medoid
s
6
5
4
3
2
1
0
0
10
Total Cost = 26
10
Do loop
Until no
change
Assign
each
remaini
ng
object
to
nearest
medoid
s
7
6
5
4
3
2
1
0
0
10
Compute
total cost
of
swapping
Swapping
O and
Oramdom
If quality is
improved.
6
5
4
10
Randomly select a
nonmedoid
object,Oramdom
8
7
8
7
6
5
4
0
0
10
10
10
10
8
7
8
7
6
5
4
3
6
5
h
i
4
3
0
0
Cjih = 0
10
10
10
7
6
5
4
3
10
10
10
K-Means Example
Clustering Approaches
Clustering
Hierarchical
Agglomerative
Partitional
Divisive
Categorical
Sampling
Large DB
Compression
Hierarchical Clustering
a
b
Step 1
ab
abcde
cde
de
e
Step 4
agglomerative
(AGNES)
Step 3
divisive
(DIANA)
Hierarchical Clustering
Clusters are created in levels actually creating
sets of clusters at each level.
Agglomerative
Initially each item in its own cluster
Iteratively clusters are merged together
Bottom Up
Divisive
Initially all items in one cluster
Large clusters are successively divided
Top Down
Hierarchical Algorithms
Single Link
MST Single Link
Complete Link
Average Link
Dendrogram
Dendrogram: a tree data
structure which illustrates
hierarchical clustering
techniques.
Each level shows clusters for
that level.
Leaf individual clusters
Root one cluster
Levels of Clustering
Agglomerative Example
A B C D E
A
D
Threshold of
1 2 34 5
A B C D E
MST Example
A
A B C D E
A
Agglomerative Algorithm
Single Link
View all items with links (distances) between them.
Finds maximal connected components in this graph.
Two clusters are merged if there is at least one edge
which connects them.
Uses threshold distances at each level.
Could be agglomerative or divisive.
AGNES (Agglomerative
Nesting)
Go on in a non-descending fashion
10
10
0
0
10
0
0
10
10
10
10
10
0
0
10
0
0
10
10
Readings
BIRCH: A New Data Clustering Algorithm and Its Applications. Data Mining
and Knowledge Discovery Volume 1 , Issue 2 (1997)