Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 49

ADVANCED CLUSTER

ANALYSIS
Clustering high-dimensional data
SYLLABUS
Clustering techniques:
 hierarchical,

 K-means,

 clustering high dimensional data,

 CLIQUE and ProCLUS,

 frequent pattern based clustering methods,

 clustering in non-euclidean space,

 clustering for streams and parallelism


 Probabilistic model-based clustering
 Clustering high-dimensional data

 Clustering graph and network data

 Clustering with constraints


CLUSTERING HIGH-DIMENSIONAL DATA

 The clustering methods we have studied so far work well


when the dimensionality is not high, that is, having less
than 10 attributes.
 There are, however, important applications of high
dimensionality.
 “How can we conduct cluster analysis on high-
dimensional data”?
EXAMPLE
 All Electronics keeps track of the products purchased by
every customer.
 As a customer-relationship manager, you want to cluster
customers into groups according to what they purchased
from All Electronics.
 All Electronics carries tens of thousands of products
 It is easy to see that

dist(Ada,Bob) = dist(Bob,Cathy) = dist(Ada,Cathy) = √ 2.


 According to Euclidean distance, the three customers are
equivalently similar (or dissimilar) to each other.
 However, a close look tells us that Ada should be more
similar to Cathy than to Bob because Ada and Cathy
share one common purchased item, P1.
 The traditional distance measures can be ineffective on
high-dimensional data.
 Such distance measures may be dominated by the noise
in many dimensions.
 Therefore, clusters in the full, high-dimensional space
can be unreliable, and finding such clusters may not be
meaningful.

 Clustering high-dimensional data is the search for


clusters and the space in which they exist.
FIRST CHALLENGE
 A major issue is how to create appropriate models for clusters
in high-dimensional data.
 Unlike conventional clusters in low-dimensional spaces,
clusters hidden in high-dimensional data are often
significantly smaller.
 For example, when clustering customer-purchase data, we
would not expect many users to have similar purchase
patterns.
 Searching for such small but meaningful clusters is like
finding needles in a haystack.
 we often have to consider various more sophisticated
techniques that can model correlations and consistency among
objects in subspaces.
SECOND CHALLENGE

 There are typically an exponential number of possible


subspaces or dimensionality reduction options, and thus
the optimal solutions are often computationally
prohibitive.
 For example, if the original data space has 1000
dimensions, and we want to find clusters of
dimensionality 10, then there are 2.63×1023 possible
subspaces.
TWO MAJOR KINDS OF METHODS

 Subspace clustering approaches search for clusters


existing in subspaces of the given high-dimensional data
space, where a subspace is defined using a subset of
attributes in the full space.

 Dimensionality reduction approaches try to construct a


much lower-dimensional space and search for clusters in
such a space. Often, a method may construct new
dimensions by combining some dimensions from the
original data.
SUBSPACE CLUSTERING METHODS

 Subspace search methods


 Correlation-based clustering methods

 Biclustering methods
SUBSPACE SEARCH METHODS
 A subspace search method searches various subspaces
for clusters.
 Here, a cluster is a subset of objects that are similar to
each other in a subspace.
 The similarity is often captured by conventional
measures such as distance or density.

 A major challenge that subspace search methods face is


how to search a series of subspaces effectively and
efficiently
GENERALLY THERE ARE TWO KINDS OF
STRATEGIES:
 Bottom-up approaches start from low-dimensional
subspaces and search higher dimensional subspaces only
when there may be clusters in those higher-dimensional
subspaces.

 Various pruning techniques are explored to reduce the


number of higher dimensional subspaces that need to be
searched.
 CLIQUE is an example of a bottom-up approach.
TOP-DOWN APPROACHES
 Top-down approaches start from the full space and
search smaller and smaller subspaces recursively.
 Top-down approaches are effective only if the locality
assumption holds, which require that the subspace of a
cluster can be determined by the local neighborhood.

 PROCLUS, is an example of a top-down subspace


approach.
TOP-DOWN APPROACHES
 Top-down approaches start from the full space and
search smaller and smaller subspaces recursively.
 Top-down approaches are effective only if the locality
assumption holds, which require that the subspace of a
cluster can be determined by the local neighborhood.

 PROCLUS, is an example of a top-down subspace


approach.
CLIQUE: A DIMENSION –GROWTH
SUBSPACE CLUSTERING METHOD
 CLIQUE (Clustering in QUEst) was the first algorithm
proposed for dimension-growth subspace clustering in
high dimensional space.
 In dimension-growth subspace clustering, the clustering
process starts a single dimensional subspace and grows
upward to high dimensional ones (grid structure)
 It can also be viewed as an integration of density-based
and grid based clustering methods.
 Its overall approach is typical of subspace clustering for
high-dimension space.
EXAMPLE
 The idea of the CLIQUE clustering algorithm are
outlined as follows:
 Given a large set of multidimensional data points, the
data space is usually not uniformly occupied by the data
points.
 CLIQUE’s clustering identifies the sparse and crowded
areas in space, thereby discovering overall distribution
patterns of the data set.
 A unit is dense if the fraction of total points contained in
it exceeds an input model parameter.
 In CLIQUE, a cluster is defined as a maximal set of
connected dense units.
HOW DOES CLIQUE WORKS
 I STEP: CLIQUE partitions the d-dimensional data
space into non overlapping rectangular units, identifying
the dense units among these.

 II STEP: The subspaces representing these dense units


are intersected to form a candidate search space in which
dense units of higher dimensionality may exist.
HOW EFFECTIVELY CLIQUE IS?
 CLIQUE automatically find subspaces of the highest
dimensionality such that high density clusters exist in
those subspace.
 It is insensitive to order of objects.

 It scales linearly with the size of input and has a good


scalability as the number of dimensions in the data is
increased.
 Clustering results are dependent on proper tuning on grid
size and the density threshold.
GRAPHICAL DEFINATION
CLIQUE is the group of nodes in
graph such that all nodes in a
CLIQUE are connected to each other.
 ‘K’ – No of nodes in a CLIQUE
The clique percolation method is as follows:
1) All K cliques present in graph G are extracted.
2) A new clique graph GC is created -
a) Here each extracted K - CLIQUE is compressed as one
vertex.
b) The two vertices are connected by an edge in GC if they
have k - 1 common vertices.
3) connected components in GC are identified.
4) Each connected component in GC represents a
community
5) Set C will be set of communities formed for G.
 K=2 K=3

N1
N2
N2
N1

N3

N1 N2

K=4

N3 N4
COMMUNITY
 Community is the group of CLIQUES such that all the
CLIQUES must have ‘K-1’ nodes in common.
CLIQUE PERCOLATION METHOD (CPM)
CLIQUE
COMMUNITY
CLIQUE- EXAMPLE
EXAMPLE CONTINUE
EXAMPLE CONTINUE
EXAMPLE CONTINUE
CLIQUE & COMMUNITY

Here for K=3


CLIQUE 1 = {N1,N2,N3}
CLIQUE 2 = {N1,N2,N4}

COMMUNITY =
{CLIQUE 1, CLIQUE 2 }
EXAMPLE
CLIQUE ( K =3)
a) {1,2,3}

b) {1,2,8}

c) {2,6,5}

d) {2,6,4}

e) {2,5,4}

f) {4,5,6}

Community 1= {a, b}
Community 2 = { c,d,e,f}
CLIQUE ( K =3)
a) {1,2,3}

b) {1,2,8}
d
c) {2,6,5} c

d) {2,6,4}

e) {2,5,4}

f) {4,5,6} e f
Community 1= {a, b}
Community 2 = { c,d,e,f}
EXAMPLE
IDENTIFY – CLIQUE(K= 5 AND K = 4 )

3
10
2 7

1 9

5 6
PROCLUS
 Choose a sample set of data point randomly.
 Choose a set of data point which is probably the
medoids of the cluster
INPUT AND OUTPUT FOR PROCLUS
 Input:
 The set of data points

 Number of clusters, denoted by k

 Average number of dimensions for each clusters,


denoted by L
 Output:

 The clusters found, and the dimensions respected to such


clusters
Three Phase for PROCLUS:
 Initialization Phase

 Iterative Phase

 Refinement Phase
INITIALIZATION PHASE
 Choose a sample set of data point randomly.
 Choose a set of data point which is probably the medoids
of the cluster
ITERATIVE PHASE
 From the Initialization Phase, we got a set of data points which
should contains the medoids. (Denoted by M)
 This phase, we will find the best medoidsfrom M.

 Randomly find the set of points Mcurrent, and replace the “bad”
medoidsfrom other point in M if necessary.
For the medoids, following will be done:
 Find Dimensions related to the medoids

 Assign Data Points to the medoids

 Evaluate the Clusters formed

 Find the bad medoid, and try the result of replacing bad medoid

 The above procedure is repeated until we got a satisfied result


REFINEMENT PHASE-HANDLE
OUTLIERS
 For each medoid mi with the dimension Di, find the
smallest Manhattan segmental distance
 ito any of the other medoids with respect to the set of
dimensions
 the sphere of influence of the medoidmiA data point is
an outlier if it is not under any spheres of influence.

You might also like