Density Based Clustering Algorithm

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 25

Density Based Clustering

Algorithms
Kehkashan Fatima 202090202
Amulya Viswambharan 202090007
Sruthi Krishnan 202090333
Outline
• Introduction
• DBSCAN
• OPTICS
• DENCLUE
• Conclusion
Introduction
• Clustering, in data mining, is a useful technique.
Based on density between objects.

DENSITY-BASED CLUSTERING ALGORITHMS

DBSCAN OPTICS DENCLUE

Represents various density based clustering algorithms


Density-based spatial clustering of applications
with noise (DBSCAN)
•  -Neighborhood :
o
The - neighborhood, , of a data point is the set of points within a specified radius
around .

where is some distance measure and . Note that the point is always in its own
-neighborhood, i.e., always holds.
DBSCAN
•  Point classes :
o
A point is classified as:
• Core point: If has high density, i.e., || minPts, where minPts is a user-
specified density threshold.

• Border point: If is not a core point, but it is in the neighborhood of a core


point , i.e., , or

• Noise point, otherwise.


DBSCAN
•  Directly density-reachable.
o
A point is directly density-reachable from a point with respect to and minPts if,
and only if,
1. || minPts, and
2.
o Density-reachable. A point is density-reachable from if there exists in:
D an ordered sequence of points (, , ) with = and = such that +1 directly density-
reachable from i {1, 2, ..., n − 1}.
o Density-connected. A point is density-connected to a point if there is a point
such that both and are density-reachable from .
DBSCAN

(a) (b)

Figure (a) the three point classes, core, border, and noise points Figure (b) illustrates the concept of density-
reachability and density-connectivity.
DBSCAN
•Procedure:
 
• Initially all the objects are marked unvisited.
• Then, randomly select an unvisited object m and mark it as visited .
• Check if its neighbourhood has no minpoints objects, then mark m as noise point.
• Else Form a new cluster C and add m to C.
• Add remaining objects of neighbourhood of m to set N.
• Repeat the procedure.
• The run time of DBSCAN algorithm is ().
DBSCAN
•Advantages:
 
• Fast.
• Shape.
• No prior knowledge.
• Only two input parameters.
• No need to define in advance.
Disadvantages:
• Initial values.
• The computational complexity is O(); with spatial indexing it is O ( log ).
• Variable density.
• High-dimensional data.
OPTICS

Ordering Points To Identify The Clustering Structures


Ankerst, Breunig, Kriegel and Sander
OPTICS: Ordering Points To Identify the
Clustering Structure
• DBSCAN
• Input parameter – hard to determine.
• Algorithm very sensitive to input parameters.
• Problem of detecting meaningful clusters in data of varying density.
• OPTICS
• Does not produce clusters explicitly.
• Produces a special order of the database w.r.t its density-based clustering
structure
• Contains info equivalent to the density-based clustering corresponding to a
broad range of parameter settings
• Good for both automatic and interactive cluster analysis
• Can be represented graphically or using visualization techniques
Core- and Reachability Distance

• Parameters: “generating” distance Ɛ, fixed value MinPts


core-distanceƐ,MinPts(p)
“smallest distance such that p is a core object”

Reachability-distanceƐ,MinPts(p, q)
“smallest distance such that q is directly density-reachable
from p”

Order points by shortest reachability distance to guarantee


that clusters w.r.t. higher density are finished first.
OPTICS Algorithm
• Basic data structure:
• Memorize shortest reachability distances seen so far
(“distance of a jump to that point”)
• Visit each point
• Make always a shortest jump
• Output:
• order of points
• core-distance of points
• reachability-distance of points
OPTICS: The Reachability Plot

represents the density-based clustering structure


easy to analyze
independent of the dimension of the data

reachability distance
reachability distance

cluster ordering cluster ordering


OPTICS
Advantages
 It solves the problem of finding good clusters if data has changeable density
 It outcomes the objects in a particular ordering.

Disadvantages
 Memory Cost 
 It expects some kind of density decline to find cluster borders
 It is less sensitive to erroneous data
DENCLUE
(DENsity-based CLUstEring)
Proposed by
A. Hinneburg and D. A. Keim
DENCLUE
• It is considered as a special case of the Kernel Density Estimation (KDE)
• The KDE is a non-parametric estimation technique, which aimed to find dense
regions points.
• DENCLUE was developed to classify large multimedia databases as such data
bases contain high volume of noise & requires clustering of high-dimensional
feature vectors.
STEPS in DENCLUE

PRE-CLUSTERING CLUSTERING

FIND CONNECTING
CONSTRUCT A THE DENSITY
DETERMINE DENSITY
CONSTRUCT MAP BY ATTRACTORS
THE CONNECTING
ATTRACTORS
THE HYPER HAVING THE
POPULATED THE USING HILL SAME PATH TO
RECTANGLE CUBES CLIMBING
POPULATED FORM FINAL
CUBES ALGORITHM CLUSTERS
Steps…
DENCLUE is based on the calculation of the influence of
points between them.

The total sum of these influence functions represent the


density function.

The influence function between two points x and y.

Where
d(x, y) is an euclidean distance between x and y,
σ represents the radius of the neighbourhood containing x. Fig. A Hyper Rectangle with cubes
DENSITY FUNCTION & DENSITY ATTRACTOR
The density function is calculated using the previous equation.

where D represents the set of points on the database, and N its cardinal.

• To determine the clusters, DENCLUE calculate the density attractor for each point in the
database.

• This attractor is considered as a local maximum of the density function. This maximum is
found by the Hill Climbing algorithm, which is based on gradient ascent approach
Procedure..

The points forming a path with the density attractor, are called attracted points.

Clusters are made by taking into account the density attractors and its attracted points.

Overall..
i. A statistical density estimation method is used for estimating the kernel density. This results in the local

density maxima value.

ii. Clusters can be formed from this local density maxima value.

iii. If the local density value is very small, then the objects of clusters are discarded as noise In this

method

iv. Objects under consideration are added to a cluster through density attractors using a step wise hill-

climbing procedure.
ADVANTAGES:
• It detects erroneous data very well.
• It solves non-spherical shaped clusters in high-dimensional data sets.
• Its processing is much faster than DBSCAN.

DISADVANTAGES:
• It needs many constants.
• It is less sensitive to outliers.
Problem with Hill Climbing Algorithm
Global Maxima
Result can either be
Optimum or Complete
when using the Hill

Objective Function, y
Climbing Algorithm
Local Maxima

DENCLUE suffers in
terms of execution
time due to the Hill
Climbing Algorithm

State Space, X
CONCLUSION :
DENCLUE-IM (An improvement to DENCLUE)
• The idea behind is to speed calculation by avoiding the crucial step in DENCLUE
which is the Hill Climbing step.
• This step considered as crucial in DENCLUE algorithm is based on gradient
calculations.
• These calculations are done for each point in order to find its density attractor.
• Make calculations for each point is not obvious to achieve results in a reasonable
time, especially when it comes to operate on large databases.
• This allows to find an equivalent item to the density attractor, which will
represent all the points contained in a hyper-cube, XH Cube.
Conclusion..
• XH Cube will be considered as the point having the highest density in this hyper-
cube as shown in equation.

where Cp is a given populated Hyper-cube in the constructed hyper-rectangle.


• Thus each hyper-cube constitute an initial cluster represented by its XH Cube .
• These clusters XH Cube will be merged if and only if there exists a path between
their representatives

You might also like