Professional Documents
Culture Documents
A Density-Based Algorithm For Discovering Clusters in Large Spatial Databases With Noise
A Density-Based Algorithm For Discovering Clusters in Large Spatial Databases With Noise
226 KDD-96
clusters found by a partitioning algorithmis convexwhichis moderatevalues for n, but it is prohibitive for applicationson
veryrestrictive. large databases.
Ng & Han (1994) explore partitioning algorithms for Jain (1988) explores a density based approach to identify
KDDin spatial databases. An algorithm called CLARANS clusters in k-dimensionalpoint sets. The data set is parti-
(Clustering Large Applications based on RANdomized tioned into a numberof nonoverlappingcells and histograms
Search) is introduced which is an improvedk-medoidmeth- are constructed. Cells with relatively high frequencycounts
od. Compared to former k-medoid algorithms, CLARANS of points are the potential cluster centers and the boundaries
is moreeffective and moreefficient. Anexperimentalevalu- betweenclusters fall in the "valleys" of the histogram. This
ation indicates that CLARANS runs efficiently on databases methodhas the capability of identifying clusters of any
of thousands of objects. Ng&Han(1994) also discuss meth- shape. However,the space and run-time requirements for
ods to determinethe "natural" numberknat of clusters in a storing and searching multidimensional histograms can be
database. They propose to run CLARANS once for each k enormous.Even if the space and run-time requirements are
from2 to n. For each of the discovered clusterings the sil- optimized, the performance of such an approach crucially
houette coefficient (Kaufman&Rousseeuw1990) is calcu- dependson the size of the cells.
lated, and finally, the clustering with the maximum silhou-
ette coefficient is chosen as the "natural" clustering. 3. A Density Based Notion of Clusters
Unfortunately, the run time of this approachis prohibitive
for large n, because it implies O(n) calls of CLARANS. Whenlooking at the sample sets of points depicted in
CLARANS assumesthat all objects to be clustered can re- figure 1, we can easily and unambiguously
detect clusters of
side in main memoryat the same time which does not hold points and noise points not belongingto any of those clus-
for large databases. Furthermore, the run time of CLARANS ters.
is prohibitive on large databases. Therefore, Ester, Kriegel
&Xu(1995) present several focusing techniques which ad- ...:-’.".’"-...-:~
¯ .-,. ..:,, .
dress both of these problemsby focusing the clustering pro- ¯ - ,;~.:.:~ .-
cess on the relevant parts of the database.First, the focus is °’-l" .,... ~’
small enough to be memoryresident and second, the run ., ,..-:.,..,~.,.:. ..~,,:’.;t""" :’:::
time of CLARANS on the objects of the focus is significant- .. :::i-\"."’.,.:~:
¯o,..°
°lo~, ..,,¯o .:::;::."
~,
ly less than its run time on the wholedatabase.
Hierarchical algorithms create a hierarchical decomposi- database 1 database 2 database 3
tion of D. The hierarchical decompositionis represented by figure 1: Sampledatabases
a dendrogram,a tree that iteratively splits D into smaller
subsets until each subset consists of only one object. In such The main reason why we recognize the clusters is that
a hierarchy, each node of the tree represents a cluster of D. within each cluster wehave a typical density of points which
The dendrogramcan either be created from the leaves up to is considerably higher than outside of the cluster. Further-
the root (agglomerativeapproach)or from the root downto more, the density within the areas of noise is lowerthan the
the leaves (divisive approach)by mergingor dividing clus- density in any of the clusters.
ters at eachstep. In contrast to partitioning algorithms,hier- In the following, wetry to formalizethis intuitive notion
archical algorithmsdo not needk as an input. However,a ter- of "clusters" and "noise" in a database D of points of some
mination condition has to be defined indicating whenthe k-dimensionalspace S. Note that both, our notion of clusters
mergeor division process should be terminated. One exam- and our algorithm DBSCAN, apply as well to 2D or 3D Eu-
ple of a termination condition in the agglomerativeapproach clidean space as to some high dimensional feature space.
is the critical distance Dmin betweenall the clusters of Q. Thekey idea is that for each point of a cluster the neighbor-
So far, the mainproblemwith hierarchical clustering al- hood of a given radius has to contain at least a minimum
gorithmshas been the difficulty of deriving appropriate pa- numberof points, i.e. the density in the neighborhoodhas to
nrameters for the termination condition, e.g. a value of Dmi exceed somethreshold. The shape of a neighborhoodis de-
whichis small enoughto separate all "natural" clusters and, termined by the choice of a distance function for two points
at the sametime large enoughsuch that no cluster is split into p and q, denoted by dist(p,q). For instance, whenusing the
twoparts. Recently,in the area of signal processingthe hier- Manhattandistance in 2Dspace, the shape of the neighbor-
archical algorithm Ejcluster has been presented (Garcfa, hoodis rectangular. Note, that our approachworkswith any
Fdez-Valdivia, Cortijo &Molina1994) automatically deriv- distance function so that an appropriate function can be cho-
ing a terminationcondition. Its key idea is that twopoints be- sen for somegiven application. For the purposeof proper vi-
long to the samecluster if you can walk fromthe first point sualization, all exampleswill be in 2Dspace using the Eu-
to the secondone by a "sufficiently small" step. Ejcluster clidean distance.
follows the divisive approach. It does not require any input Definition 1: (Eps-neighborhood of a point) The Eps-
of domain knowledge. Furthermore, experiments show that neighborhoodof a point p, denoted by NEps(P),is defined
it is very effective in discovering non-convexclusters. How- NEps(P)= {q E DI dist(p,q) _< Eps}.
ever, the computationalcost of Ejcluster is O(n2) due to the A naive approachcould require for each point in a cluster
distance calculation for each pair of points. This is accept- that there are at least a minimum number(MinPts) of points
able for applications such as character recognition with in an Eps-neighborhoodof that point. However,this ap-
228 KDD-96
there is no easy wayto get this informationin advancefor all used by DBSCAN
is ExpandCluster which is present-
clusters of the database. However,there is a simple and ef- ed below:
fective heuristic (presented in section section 4.2) to deter- ExpandCluster (SetOfPoints, Point, CiId, Eps,
minethe parameters Eps and MinPtsof the "thinnest", i.e. MinPts) : Boolean;
least dense, cluster in the database. Therefore, DBSCAN seeds : =SetOfPoints. regionQuery (Point, Eps )
uses global values for Eps and MinPts, i.e. the samevalues IF seeds.size<MinPts THEN // no core point
for all clusters. The density parameters of the "thinnest" SetOfPoint. changeCl Id (Point, NOISE)
cluster are goodcandidatesfor these global parametervalues RETURN False;
ELSE // all points in seeds are density-
specifying the lowest density whichis not considered to be
// reachable from Point
noise. SetOfpoints. changeCiIds ( seeds, C1 Id)
seeds .delete (Point)
4.1 The Algorithm WHILE seeds <> Empty DO
To find a cluster, DBSCAN starts with an arbitrary point p currentP := seeds.first()
and retrieves all points density-reachable from p wrt. Eps result := setofPoints.regionQuery(currentP,
and MinPts.If p is a core point, this procedureyields a clus- Eps)
ter wrt. Epsand MinPts(see Lemma 2). Ifp is a border point, IF result.size >= MinPts THEN
FOR i FROM 1 TO result.size DO
no points are density-reachable from p and DBSCAN visits
resultP := result.get(i)
the next point of the database.
IF resultP. CiId
Since we use global values for Eps and MinPts, DBSCAN IN (UNCLASSIFIED, NOISE} THEN
maymergetwo clusters according to definition 5 into one IF resultP.CiId = UNCLASSIFIED THEN
cluster, if twoclusters of different densityare "close"to each seeds, append (resultP)
other. Let the distance betweentwo sets of points S 2 l and S END IF;
be definedas dist (S1, Sx) = min{dist(p,q) I p b q C$2}. SetOfPoints. changeCiId ( resultP, CiId)
Then, two sets of points having at least the density of the END IF; // UNCLASSIFIED or NOISE
thinnest cluster will be separated fromeach other only if the END FOR ;
distance between the two sets is larger than Eps. Conse- END IF; // result.size >= MinPts
quently, a recursive call of DBSCAN maybe necessary for seeds, delete (currentP)
END WHILE; // seeds <> Empty
the detected clusters with a higher value for MinPts.This is,
RETURN True ;
however,no disadvantage because the recursive application
END IF
of DBSCAN yields an elegant and very efficient basic algo- END; // ExpandCluster
rithm. Furthermore,the recursive clustering of the points of A call of SetO fPoint s.regionQue-
a cluster is only necessaryunderconditions that can be easi- ry(Point,Eps) returnsthe Eps-Neighb0rho0d
ly detected.
Pointin SetOfPoints as a listof points. Regionque-
In the following, we present a basic version of DBSCAN riescan be supported efficiently by spatial access methods
omitting details of data types and generation of additional such as R*-trees (Beckmannet al. 1990) which are assumed
informationabout clusters: to be available in a SDBS for efficient processingof several
DBSCAN (SetOfPoints, Eps, MinPts) types of spatial queries (Brinkhoffetal. 1994). Theheight
an R*-tree is O(logn) for a database of n points in the worst
// SetOfPoints is UNCLASSIFIED case and a query with a "small" query region has to traverse
ClusterId := nextId(NOISE); only a limited numberof paths mthe R -tree. Since the Eps-
FOR i FROM 1 TO SetOfPoints.size DO Neighborhoods are expected to be small compared to the
Point := SetOfPoints.get(i); size of the wholedata space, the average run time complexi-
ty of a single region query is O(log n). For each of the
IF Point.CiId = UNCLASSIFIED THEN
points of the database, we have at most one region query.
IF ExpandCluster(SetOfPoints, Point,
Thus, the average run time complexity of DBSCAN is
ClusterId, Eps, MinPts) THEN O(n * log n).
ClusterId := nextId(ClusterId) The C1 Td (clusterId) of points which have been marked
END IF to be NOISE maybe changedlater, if they are density-reach-
END IF able from someother point of the database. This happensfor
END FOR
border points of a cluster. Thosepoints are not addedto the
seeds-list because we already knowthat a point with a
END; // DBSCAN
ClId of NOISEis not a core point. Addingthose points to
SetOfPoints is either the whole database or a dis- seeds wouldonly result in additional region queries which
covered cluster ~oma p~vious run. Eps and M$nPgsare wouldyield no new answers.
the global density parameters determinedeither manuallyor If twoclusters C! and C2 are very close to each other, it
according to the heuristics presented in section 4.2. The might happenthat somepoint p belongs to both, C1 and 2. C
~ncfion SetOfPoints. get( i ) returns the i-th ele- Thenp must be a border point in both clusters becauseother-
ment of SetOfPoints. The most important function wise C1 would be equal to C2 since we use global parame-
noise ~ points
230 KDD-96
and on real data of the SEQUOIA2000 benchmark. The re-
suits of these experiments demonstrate that DBSCAN is sig-
nificantly more effective in discovering clusters of arbitrary
¯ . ~:.:..’.. ., shape than the well-known algorithm CLARANS.Further-
more, the experiments have shown that DBSCANoutper-
;:* .’,:°
..,.. "::?...:" ¯: :.:: :.’-" forms CLARANS by a factor of at least 100 in terms of effi-
:~ ° °.
ciency.
damb~el dabble2 dabble3 Future research will have to consider the following issues.
figure 6: Clusterings discovered by DBSCAN First, we have only considered point objects. Spatial data-
bases, however, may also contain extended objects such as
To test the efficiency of DBSCANand CLARANS,we
polygons. Wehave to develop a definition of the density in
use the SEQUOIA2000 benchmark data. The SEQUOIA
2000 benchmarkdatabase (Stonebraker et al. 1993) uses real an Eps-neighborhood in polygon databases for generalizing
data sets ihat are representative of Earth Science tasks. There DBSCAN.Second, applications of DBSCANto high di-
are four types of data in the database: raster data, point data, mensional feature spaces should be investigated. In particu-
polygon data and directed graph data. The point data set con- lar, the shape of the k-dist graph in such applications has to
tains 62,584 Californian names of landmarks, extracted be explored.
from the US Geological Survey’s Geographic Names Infor- WWWAvailability
mation System, together with their location. The point data
set occupies about 2.1 Mbytes. Since the run time of CLAR- A version of this paper in larger font, with large figures and
ANSon the whole data set is very high, we have extracted a clusterings in color is available under the following URL:
series of subsets of the SEQUIOA 2000 point data set con- http : //www. dbs. informatik,uni-muenchen,de/
taining from 2%to 20% representatives of the whole set. dbs/projec t/publikat ionen/veroef fent i ichun-
The run time comparison of DBSCANand CLARANSon gen. html.
these databases is shownin table 1.
References
Table 1: run time in seconds BeckmannN., Kriegel H.-P., Schneider R, and Seeger B. 1990. The
R*-tree: An Efficient and Robust Access Methodfor Points and
number of 3910 5213 6256
1252 2503 Rectangles, Proc. ACMSIGMOD Int. Conf. on Managementof
points
Data, Atlantic City, NJ, 1990, pp. 322-331.
DBSCAN 3.1 6.7 11.3 16.0 17.8 Brinkhoff T., Kriegel H.-R, Schneider R., and Seeger B. 1994
CLAR- Efficient Multi-Step Processing of Spatial Joins, Proc. ACM
758 3026 6845 11745 18029 SIGMOD Int. Conf. on Managementof Data, Minneapolis, MN,
ANS
1994, pp. 197-208.
number of
7820 8937 10426 12512 Ester M., Kriegel H.-P., and XuX. 1995. A Database Interface for
points
Clustering in Large Spatial Databases, Proc. 1st Int. Conf. on
DBSCAN 24.5 28.2 32.7 41.7 KnowledgeDiscovery and Data Mining, Montreal, Canada, 1995,
AAAIPress, 1995.
CLAR- 60540 80638 Garcfa J.A., Fdez-ValdiviaJ., Cortijo E J., and MolinaR. 1994. A
29826 39265
ANS
DynamicApproachfor Clustering Data. Signal Processing, Vol. 44,
The results of our experiments show that the run time of No. 2, 1994, pp. 18t-196.
DBSCAN is slightly higher than linear in the number of Gueting R.H. 1994. An Introduction to Spatial Database Systems.
points. The run time of CLARANS, however, is close to qua- The VLDBJournal 3(4): 357°399.
dratic in the number of points. The results show that DB- Jain Anil K. 1988. Algorithmsfor Clustering Data. Prentice Hall.
SCANoutperforms CLARANS by a factor of between 250
KaufmanL., and RousseeuwRJ. 1990. Finding Groups#~ Data: an
and 1900 which grows with increasing size of the database.
Introduction to Cluster Analysis. John Wiley&Sons.
MatheusC.J.; ChanP.K.; and Piatetsky-Shapiro G. 1993. Systems
6. Conclusions
for KnowledgeDiscovery in Databases, 1EEETransactions on
Clustering algorithms are attractive for the task of class iden- Knowledgeand Data Engineering 5(6): 903-913.
tification in spatial databases. However,the well-known al- Ng R.T., and Han J. 1994. Efficient and Effective Clustering
gorithms suffer from severe drawbacks when applied to Methodsfor Spatial Data Mining, Proc. 20th Int. Conf. on Very
large spatial databases. In this paper, we presented the clus- Large Data Bases, 144-155. Santiago, Chile.
tering algorithm DBSCAN which relies on a density-based
notion of clusters. It requires only one input parameter and Stonebraker M., Frew J., Gardels K., and Meredith J.1993. The
supports the user in determining an appropriate value for it. SEQUOIA2000 Storage Benchmark, Proc. ACMSIGMODInt.
We performed a performance evaluation on synthetic data Conf. on Managementof Data, Washington,DC, 1993, pp. 2-11.