DBSCAN Python Example - The Optimal Value For Epsilon (EPS) - by Cory Maklin - Towards Data Science

09/07/2021 DBSCAN Python Example: The Optimal Value For Epsilon (EPS) | by Cory Maklin | Towards Data Science
Open in app
Follow 558K Followers
You have 2 free member-only stories left this month. Upgrade for unlimited access.
DBSCAN Python Example: The Optimal Value

For Epsilon (EPS)
Cory Maklin Jun 30, 2019 · 4 min read
DBSCAN, or Density-Based Spatial Clustering of Applications with Noise, is an

unsupervised machine learning algorithm. Unsupervised machine learning algorithms
are used to classify unlabeled data. In other words, the samples used to train our model
do not come with predefined categories. In comparison to other clustering algorithms,
DBSCAN is particularly well suited for problems which require:
https://towardsdatascience.com/machine-learning-clustering-dbscan-determine-the-optimal-value-for-epsilon-eps-python-example-3100091cfbc 1/7
1. Minimal domain knowledge to determine the input parameters (i.e. K in k-means

and Dmin in hierarchical clustering)
2. Discovery of clusters with arbitrary shapes
3. Good efficiency on large databases
If you’re interested in reading up on DBSCAN, the original paper can be found here.
Algorithm
As is the case in most machine learning algorithms, the model’s behaviour is dictated
by several parameters. In the proceeding article, we’ll touch on three.
eps: Two points are considered neighbors if the distance between the two points is
below the threshold epsilon.
min_samples: The minimum number of neighbors a given point should have in

order to be classified as a core point. It’s important to note that the point itself is
included in the minimum number of samples.
metric: The metric to use when calculating distance between instances in a feature
array (i.e. euclidean distance).
The algorithm works by computing the distance between every point and all other
points. We then place the points into one of three categories.
Core point: A point with at least min_samples points whose distance with respect to
the point is below the threshold defined by epsilon.
Border point: A point that isn’t in close proximity to at least min_samples points but is
close enough to one or more core point. Border points are included in the cluster of the
closest core point.
Noise point: Points that aren’t close enough to core points to be considered border
points. Noise points are ignored. That is to say, they aren’t part of any cluster.
Code
Let’s take a look at how we could go about implementing DBSCAN in python. To get
started, import the following libraries.
import numpy as np
from sklearn.datasets.samples_generator import make_blobs
from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import DBSCAN
from matplotlib import pyplot as plt
import seaborn as sns
sns.set()
As opposed to importing data, we can use scikit-learn to generate nicely defined

clusters.
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60,

random_state=0)
plt.scatter(X[:,0], X[:,1])
As mentioned previously, we must provide a value for epsilon which defines the
maximum distance between two points. The following paper, describes an approach
for automatically determining the optimal value for Eps.
https://iopscience.iop.org/article/10.1088/1755-1315/31/1/012012/pdf
In layman’s terms, we find a suitable value for epsilon by calculating the distance to the
nearest n points for each point, sorting and plotting the results. Then we look to see
where the change is most pronounced (think of the angle between your arm and
forearm) and select that as epsilon.
We can calculate the distance from each point to its closest neighbour using the
NearestNeighbors . The point itself is included in n_neighbors . The kneighbors method
returns two arrays, one which contains the distance to the closest n_neighbors points
and the other which contains the index for each of those points.
neigh = NearestNeighbors(n_neighbors=2)
nbrs = neigh.fit(X)
distances, indices = nbrs.kneighbors(X)
Next, we sort and plot results.
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances)
The optimal value for epsilon will be found at the point of maximum curvature.
We train our model, selecting 0.3 for eps and setting min_samples to 5.
m = DBSCAN(eps=0.3, min_samples=5)
m.fit(X)
The labels_ property contains the list of clusters and their respective points.
clusters = m.labels_
Then, we map every individual cluster to a color.
colors = ['royalblue', 'maroon', 'forestgreen', 'mediumorchid',

'tan', 'deeppink', 'olive', 'goldenrod', 'lightcyan', 'navy']
vectorizer = np.vectorize(lambda x: colors[x % len(colors)])
The model classified the densely populated areas. As we can see, all the dark blue
points were categorized as noise.
plt.scatter(X[:,0], X[:,1], c=vectorizer(clusters))
Final Thoughts
Unlike k-means, DBSCAN will figure out the number of clusters. DBSCAN works by
determining whether the minimum number of points are close enough to one another
to be considered part of a single cluster. DBSCAN is very sensitive to scale since epsilon
is a fixed value for the maximum distance between two points.
Sign up for The Variable

By Towards Data Science
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.
Emails will be sent to rafaelmborgognoni@gmail.com.

Get this newsletter
Not you?
Artificial Intelligence Machine Learning Programming Towards Data Science Data Science
About Write Help Legal
Get the Medium app

DBSCAN Python Example - The Optimal Value For Epsilon (EPS) - by Cory Maklin - Towards Data Science

Uploaded by

Copyright:

Available Formats

You might also like

DBSCAN Python Example - The Optimal Value For Epsilon (EPS) - by Cory Maklin - Towards Data Science

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DBSCAN Python Example - The Optimal Value For Epsilon (EPS) - by Cory Maklin - Towards Data Science

Uploaded by

Copyright:

Available Formats

09/07/2021 DBSCAN Python Example: The Optimal Value For Epsilon (EPS) | by Cory Maklin | Towards Data Science

Follow 558K Followers

DBSCAN Python Example: The Optimal Value

DBSCAN, or Density-Based Spatial Clustering of Applications with Noise, is an

1. Minimal domain knowledge to determine the input parameters (i.e. K in k-means

2. Discovery of clusters with arbitrary shapes

3. Good efficiency on large databases

min_samples: The minimum number of neighbors a given point should have in

from sklearn.datasets.samples_generator import make_blobs

from sklearn.neighbors import NearestNeighbors

from sklearn.cluster import DBSCAN

from matplotlib import pyplot as plt

import seaborn as sns

As opposed to importing data, we can use scikit-learn to generate nicely defined

X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60,

distances, indices = nbrs.kneighbors(X)

Next, we sort and plot results.

distances = np.sort(distances, axis=0)

Then, we map every individual cluster to a color.

colors = ['royalblue', 'maroon', 'forestgreen', 'mediumorchid',

vectorizer = np.vectorize(lambda x: colors[x % len(colors)])

plt.scatter(X[:,0], X[:,1], c=vectorizer(clusters))

Sign up for The Variable

Emails will be sent to rafaelmborgognoni@gmail.com.

About Write Help Legal

Get the Medium app

You might also like