DBSCAN Algorithm

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 15

DBSCAN Algorithm

By
Bijoyeta Roy
CSE Department, SMIT
DBSCAN Clustering in ML
 Clustering analysis or simply Clustering is basically an
Unsupervised learning method that divides the data points into a
number of specific batches or groups, such that the data points in
the same groups have similar properties and data points in
different groups have different properties in some sense.

 Fundamentally, all clustering methods use the same approach i.e.


first we calculate similarities and then we use it to cluster the
data points into groups or batches. Here we will focus on
the Density-Based Spatial Clustering of Applications with
Noise (DBSCAN) clustering method.
DBSCAN Clustering in ML
 Clusters are dense regions in the data space, separated by regions of the
lower density of points. The DBSCAN algorithm is based on this
intuitive notion of “clusters” and “noise”.

 Partitioning methods (K-means, PAM clustering) and hierarchical


clustering work for finding spherical-shaped clusters or convex clusters.
In other words, they are suitable only for compact and well-separated
clusters. Moreover, they are also severely affected by the presence of
noise and outliers in the data.

 DBSCAN is a popular clustering algorithm used for data analysis and


pattern recognition. It groups data points based on their density,
identifying clusters of high-density regions and classifying outliers as
noise

 The key idea is that for each point of a cluster, the neighborhood of a
given radius has to contain at least a minimum number of points.
DBSCAN Clustering in ML
 It was proposed by Martin Ester et al. in 1996. DBSCAN is a
density-based clustering algorithm that works on the assumption
that clusters are dense regions in space separated by regions of
lower density.

 It groups ‘densely grouped’ data points into a single cluster. It


can identify clusters in large spatial datasets by looking at the
local density of the data points. The most exciting feature of
DBSCAN clustering is that it is robust to outliers.

 It can discover clusters of different shapes and sizes from a large


amount of data, which is containing noise and outliers.

 DBSCAN is not just able to cluster the data points correctly, but
it also perfectly detects noise in the dataset.
DBSCAN Clustering in ML
Parameters Required For DBSCAN Algorithm

1. eps(Epsilon): It defines the neighborhood around a data point i.e. if


the distance between two points is lower or equal to ‘eps’ then they
are considered neighbors.

If the eps value is chosen too small then a large part of the data will be
considered as an outlier. If it is chosen very large then the clusters will
merge and the majority of the data points will be in the same clusters.
One way to find the eps value is based on the k-distance graph.

2. MinPts: Minimum number of neighbors (data points) within eps


radius. The larger the dataset, the larger value of MinPts must be
chosen. As a general rule, the minimum MinPts can be derived from
the number of dimensions D in the dataset as, MinPts >= D+1. The
minimum value of MinPts must be chosen at least 3.
DBSCAN Clustering in ML
In this algorithm, we have 3 types of data points.

Core Point: A point is a core point if it has more than MinPts


points within eps.

Border Point: A point which has fewer than MinPts within eps but
it is in the neighborhood of a core point.

Noise or outlier: A point which is not a core point or border point.


Reachability and Connectivity
 Reachability states if a data point can be accessed from
another data point directly or indirectly, whereas
Connectivity states whether two data points belong to the
same cluster or not.

 In terms of reachability and connectivity, two points in


DBSCAN can be referred to as:

• Directly Density-Reachable
• Density-Reachable
• Density-Connected
Directly Density-Reachable
A point X is directly density-reachable from point Y w.r.t epsilon,
minPoints if,

1.X belongs to the neighborhood of Y, i.e, dist(X, Y) <= epsilon

2. Y is a core point

Here, X is directly density-reachable from


Y, but vice versa is not valid.
Density-Reachable
A point X is density-reachable from point Y w.r.t epsilon,
minPoints if there is a chain of points p1, p2, p3, …, pn and
p1=X and pn=Y such that pi+1 is directly density-reachable from pi.
A point X is density-connected from point Y w.r.t epsilon and
minPoints if there exists a point O such that both X and Y are
density-reachable from O w.r.t to epsilon and minPoints.
Algorithmic steps for DBSCAN clustering

•The algorithm proceeds by arbitrarily picking up a point


in the dataset (until all points have been visited).

•If there are at least ‘minPoint’ points within a radius of


‘ε’(eps) to the point then we consider all these points to be
part of the same cluster.

•The clusters are then expanded by recursively repeating


the neighborhood calculation for each neighboring point
Questions???

You might also like