Advanced Cluster Analysis: Clustering High-Dimensional Data

ADVANCED CLUSTER
ANALYSIS
Clustering high-dimensional data
SYLLABUS
Clustering techniques:
 hierarchical,
 K-means,
 clustering high dimensional data,
 CLIQUE and ProCLUS,
 frequent pattern based clustering methods,
 clustering in non-euclidean space,
 clustering for streams and parallelism

 Probabilistic model-based clustering
 Clustering high-dimensional data
 Clustering graph and network data
 Clustering with constraints

CLUSTERING HIGH-DIMENSIONAL DATA
 The clustering methods we have studied so far work well

when the dimensionality is not high, that is, having less
than 10 attributes.
 There are, however, important applications of high
dimensionality.
 “How can we conduct cluster analysis on high-
dimensional data”?
EXAMPLE
 All Electronics keeps track of the products purchased by
every customer.
 As a customer-relationship manager, you want to cluster
customers into groups according to what they purchased
from All Electronics.
 All Electronics carries tens of thousands of products
 It is easy to see that
dist(Ada,Bob) = dist(Bob,Cathy) = dist(Ada,Cathy) = √ 2.

 According to Euclidean distance, the three customers are
equivalently similar (or dissimilar) to each other.
 However, a close look tells us that Ada should be more
similar to Cathy than to Bob because Ada and Cathy
share one common purchased item, P1.
 The traditional distance measures can be ineffective on
high-dimensional data.
 Such distance measures may be dominated by the noise
in many dimensions.
 Therefore, clusters in the full, high-dimensional space
can be unreliable, and finding such clusters may not be
meaningful.
 Clustering high-dimensional data is the search for

clusters and the space in which they exist.
FIRST CHALLENGE
 A major issue is how to create appropriate models for clusters
in high-dimensional data.
 Unlike conventional clusters in low-dimensional spaces,
clusters hidden in high-dimensional data are often
significantly smaller.
 For example, when clustering customer-purchase data, we
would not expect many users to have similar purchase
patterns.
 Searching for such small but meaningful clusters is like
finding needles in a haystack.
 we often have to consider various more sophisticated
techniques that can model correlations and consistency among
objects in subspaces.
SECOND CHALLENGE
 There are typically an exponential number of possible

subspaces or dimensionality reduction options, and thus
the optimal solutions are often computationally
prohibitive.
 For example, if the original data space has 1000
dimensions, and we want to find clusters of
dimensionality 10, then there are 2.63×1023 possible
subspaces.
TWO MAJOR KINDS OF METHODS
 Subspace clustering approaches search for clusters

existing in subspaces of the given high-dimensional data
space, where a subspace is defined using a subset of
attributes in the full space.
 Dimensionality reduction approaches try to construct a

much lower-dimensional space and search for clusters in
such a space. Often, a method may construct new
dimensions by combining some dimensions from the
original data.
SUBSPACE CLUSTERING METHODS
 Subspace search methods

 Correlation-based clustering methods
 Biclustering methods
SUBSPACE SEARCH METHODS
 A subspace search method searches various subspaces
for clusters.
 Here, a cluster is a subset of objects that are similar to
each other in a subspace.
 The similarity is often captured by conventional
measures such as distance or density.
 A major challenge that subspace search methods face is

how to search a series of subspaces effectively and
efficiently
GENERALLY THERE ARE TWO KINDS OF
STRATEGIES:
 Bottom-up approaches start from low-dimensional
subspaces and search higher dimensional subspaces only
when there may be clusters in those higher-dimensional
subspaces.
 Various pruning techniques are explored to reduce the

number of higher dimensional subspaces that need to be
searched.
 CLIQUE is an example of a bottom-up approach.
TOP-DOWN APPROACHES
 Top-down approaches start from the full space and
search smaller and smaller subspaces recursively.
 Top-down approaches are effective only if the locality
assumption holds, which require that the subspace of a
cluster can be determined by the local neighborhood.
 PROCLUS, is an example of a top-down subspace

approach.
TOP-DOWN APPROACHES
 Top-down approaches start from the full space and
search smaller and smaller subspaces recursively.
 Top-down approaches are effective only if the locality
assumption holds, which require that the subspace of a
cluster can be determined by the local neighborhood.
 PROCLUS, is an example of a top-down subspace

approach.
CLIQUE: A DIMENSION –GROWTH
SUBSPACE CLUSTERING METHOD
 CLIQUE (Clustering in QUEst) was the first algorithm
proposed for dimension-growth subspace clustering in
high dimensional space.
 In dimension-growth subspace clustering, the clustering
process starts a single dimensional subspace and grows
upward to high dimensional ones (grid structure)
 It can also be viewed as an integration of density-based
and grid based clustering methods.
 Its overall approach is typical of subspace clustering for
high-dimension space.
EXAMPLE
 The idea of the CLIQUE clustering algorithm are
outlined as follows:
 Given a large set of multidimensional data points, the
data space is usually not uniformly occupied by the data
points.
 CLIQUE’s clustering identifies the sparse and crowded
areas in space, thereby discovering overall distribution
patterns of the data set.
 A unit is dense if the fraction of total points contained in
it exceeds an input model parameter.
 In CLIQUE, a cluster is defined as a maximal set of
connected dense units.
HOW DOES CLIQUE WORKS
 I STEP: CLIQUE partitions the d-dimensional data
space into non overlapping rectangular units, identifying
the dense units among these.
 II STEP: The subspaces representing these dense units

are intersected to form a candidate search space in which
dense units of higher dimensionality may exist.
HOW EFFECTIVELY CLIQUE IS?
 CLIQUE automatically find subspaces of the highest
dimensionality such that high density clusters exist in
those subspace.
 It is insensitive to order of objects.
 It scales linearly with the size of input and has a good

scalability as the number of dimensions in the data is
increased.
 Clustering results are dependent on proper tuning on grid
size and the density threshold.
GRAPHICAL DEFINATION
CLIQUE is the group of nodes in
graph such that all nodes in a
CLIQUE are connected to each other.
 ‘K’ – No of nodes in a CLIQUE
The clique percolation method is as follows:
1) All K cliques present in graph G are extracted.
2) A new clique graph GC is created -
a) Here each extracted K - CLIQUE is compressed as one
vertex.
b) The two vertices are connected by an edge in GC if they
have k - 1 common vertices.
3) connected components in GC are identified.
4) Each connected component in GC represents a
community
5) Set C will be set of communities formed for G.
 K=2 K=3
N1
N2
N2
N1
N3
N1 N2
K=4
N3 N4
COMMUNITY
 Community is the group of CLIQUES such that all the
CLIQUES must have ‘K-1’ nodes in common.
CLIQUE PERCOLATION METHOD (CPM)
CLIQUE
COMMUNITY
CLIQUE- EXAMPLE
EXAMPLE CONTINUE
EXAMPLE CONTINUE
EXAMPLE CONTINUE
CLIQUE & COMMUNITY
Here for K=3

CLIQUE 1 = {N1,N2,N3}
CLIQUE 2 = {N1,N2,N4}
COMMUNITY =
{CLIQUE 1, CLIQUE 2 }
EXAMPLE
CLIQUE ( K =3)
a) {1,2,3}
b) {1,2,8}
c) {2,6,5}
d) {2,6,4}
e) {2,5,4}
f) {4,5,6}
Community 1= {a, b}
Community 2 = { c,d,e,f}
CLIQUE ( K =3)
a) {1,2,3}
b) {1,2,8}
d
c) {2,6,5} c
d) {2,6,4}
e) {2,5,4}
f) {4,5,6} e f
Community 1= {a, b}
Community 2 = { c,d,e,f}
EXAMPLE
IDENTIFY – CLIQUE(K= 5 AND K = 4 )
3
10
2 7
1 9
5 6
PROCLUS
 Choose a sample set of data point randomly.
 Choose a set of data point which is probably the
medoids of the cluster
INPUT AND OUTPUT FOR PROCLUS
 Input:
 The set of data points
 Number of clusters, denoted by k
 Average number of dimensions for each clusters,

denoted by L
 Output:
 The clusters found, and the dimensions respected to such

clusters
Three Phase for PROCLUS:
 Initialization Phase
 Iterative Phase
 Refinement Phase
INITIALIZATION PHASE
 Choose a sample set of data point randomly.
 Choose a set of data point which is probably the medoids
of the cluster
ITERATIVE PHASE
 From the Initialization Phase, we got a set of data points which
should contains the medoids. (Denoted by M)
 This phase, we will find the best medoidsfrom M.
 Randomly find the set of points Mcurrent, and replace the “bad”
medoidsfrom other point in M if necessary.
For the medoids, following will be done:
 Find Dimensions related to the medoids
 Assign Data Points to the medoids
 Evaluate the Clusters formed
 Find the bad medoid, and try the result of replacing bad medoid
 The above procedure is repeated until we got a satisfied result

REFINEMENT PHASE-HANDLE
OUTLIERS
 For each medoid mi with the dimension Di, find the
smallest Manhattan segmental distance
 ito any of the other medoids with respect to the set of
dimensions
 the sphere of influence of the medoidmiA data point is
an outlier if it is not under any spheres of influence.

Advanced Cluster Analysis: Clustering High-Dimensional Data

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced Cluster Analysis: Clustering High-Dimensional Data

Uploaded by

Copyright:

Available Formats

ADVANCED CLUSTER

 clustering high dimensional data,

 CLIQUE and ProCLUS,

 frequent pattern based clustering methods,

 clustering in non-euclidean space,

 clustering for streams and parallelism

 Clustering graph and network data

 Clustering with constraints

 The clustering methods we have studied so far work well

dist(Ada,Bob) = dist(Bob,Cathy) = dist(Ada,Cathy) = √ 2.

 Clustering high-dimensional data is the search for

 There are typically an exponential number of possible

 Subspace clustering approaches search for clusters

 Dimensionality reduction approaches try to construct a

 Subspace search methods

 A major challenge that subspace search methods face is

 Various pruning techniques are explored to reduce the

 PROCLUS, is an example of a top-down subspace

 PROCLUS, is an example of a top-down subspace

 II STEP: The subspaces representing these dense units

 It scales linearly with the size of input and has a good

Here for K=3

 Number of clusters, denoted by k

 Average number of dimensions for each clusters,

 The clusters found, and the dimensions respected to such

 Assign Data Points to the medoids

 Evaluate the Clusters formed

 The above procedure is repeated until we got a satisfied result

You might also like