Clustering High Dimensional Data

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 15

Clustering High

Dimensional Data
‌I ntroduction

Most clustering methods are designed for clustering low-dimensional


data and encounter challenges when the dimensionality of the data
grows really high (say, over 10 dimensions, or even over thousands
of dimensions for some tasks)
Issues:

• Noise

• Distance measure meaningless


‌ nly a small number of dimensions are relevant
O
to certain clusters for producing noise and
masking the real clusters.

‌W hat happen
when Data become increasingly sparse because the data
points are likely located in different dimensional
dimensionality subspaces.
increases?
data points can be considered as all equally
distanced. the distance measure, which is essential
for cluster analysis, becomes meaningless.
‌S olution Techniques

Subspace
‌ eature/Attribute
F Feature/Attribute Clustering
Transformation Selection
‌F eature Transformation

Transform the data onto a smaller space while They summarize data by creating linear combinations of
preserving the original relative distance between the attributes
objects.
‌F eature Selection

Given a set of attributes, Attribute subset selection


I‌ t is commonly used for data
attribute subset selection finds involves searching through
reduction by removing irrelevant
the subset of attributes that are various attribute subsets and
or redundant dimensions (or
most relevant to the data mining evaluating these subsets using
attributes). 
task.  certain criteria.

Unsupervised process: such as


Supervised learning: the most entropy analysis, which is based
relevant set of attributes are on the property that entropy
found with respect to the given tends to be low for data that
class labels. contain tight clusters.
‌S ubspace Clustering

• ‌I t is an extension to attribute subset selection that has shown its strength at


high-dimensional clustering.

• It is based on the observation that different subspaces may contain different,


meaningful clusters.

• Subspace clustering searches for groups of clusters within different


subspaces of the same data set.

• The problem becomes how to find such subspace clusters effectively and
efficiently.
‌F eature Transformation Issues

‌ hey do not remove any of


T ‌ nfortunately, real-world data
U
Thus, feature transformation
the original attributes from sets tend to have many highly
is only suited to data sets
analysis. I.e. ‌The irrelevant correlated, or redundant,
where most of the dimensions
information may mask the dimensions.
are relevant to the clustering
real clusters, even after
task.
transformation. 
‌H igh-Dimensional Data Clustering Approaches

01 02 03
Dimension-Growth Dimension-Reduction Frequent Pattern-
Subspace Clustering, Projected Clustering, Based Clustering,
represented by represented by represented by
CLIQUE.  PROCLUS.  pCluster.
‌C LIQUE: Dimension-Growth  Subspace Clustering

‌CLIQUE is used for the clustering CLIQUE identifies the dense units
of high-dimensional data present in the subspaces of high
in large tables. By high- dimensional data space, and uses
dimensional data we mean records these subspaces to provide more
that have many attributes. efficient clustering.
‌C LIQUE Overall Approach 

To cluster a set of records in terms of n-attributes (n-dimensional space).


MAJOR STEPS : 
• CLIQUE partitions each subspace that has dimension 1 into the same number of equal length
intervals.
• Using this as basis, it partitions the n-dimensional data space into non-overlapping rectangular units.
• CLIQUE finds dense units of higher dimensionality by finding the dense units in the subspaces.
Strength  It automatically finds subspaces of the highest
dimensionality such that high density clusters exist
in those subspaces.
It is insensitive to the order of records in input and
does not presume some canonical data distribution.
‌S trength and It scales linearly with the size of input and has good
Weakness of scalability as the number of dimensions in the data
increases.

CLIQUE Weakness Obtaining meaningful clustering results is dependent


on proper tuning of the grid size (which is a stable
structure here) and the density threshold. therefore
The accuracy of the clustering result may be
degraded.
PROCLUS: Dimension-Reduction
Projected Clustering

Projected cluster: subset of


data points, together with a Objective: Find cluster
subset of dimensions, such centroids (medoids), and set
that the points are closely of dimensions in which each
clustered in the corresponding cluster exists.
subspace.
Initialization: an initial (super) set of
medoids is chosen.

Iterative phase: Find dimensions


PROCLUS:
within a locality of each medoid,
and resulting clustering.
Overall
Approach
Improve the quality of medoids,
Until stop criterion satisfied. 
1 2 3
The algorithm requires the average The performance of PROCLUS is If the average number of dimensions
number of dimensions per cluster as highly sensitive to the value of its is erroneously estimated, the
parameter in input. input parameter. performance of PROCLUS
significantly worsens.

PROCLUS Drawbacks:

You might also like