efficien metod KNN

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Indexing the Distance: An EÆcient Method to KNN

Processing
Cui Yu
1 Beng Chin Ooi
1 Kian-Lee Tan
1 H. V. Jagadish
2

1 Department of Computer Science 2 Department of Electrical Engineering & Computer Science


National University of Singapore University of Michigan

Abstract not adaptive with respect to data distribution. In con-


sequence, they tend to perform well for some data sets
In this paper, we present an eÆcient method, and poorly for others.
called iDistance, for K-nearest neighbor In this paper, we propose a new technique for KNN
(KNN) search in a high-dimensional space. search that can be adapted based on the data distribu-
iDistance partitions the data and selects a ref- tion. For uniform distributions, it behaves similar to
erence point for each partition. The data in techniques such as pyramid tree[1] and iMinMax[13],
each cluster are transformed into a single di- that are known to be good for such situations. For
mensional space based on their similarity with highly clustered distributions, it behaves as if a hier-
respect to a reference point. This+ allows the archical clustering tree had been created especially for
points to be indexed using a B -tree struc- this problem.
ture and KNN search be performed using one- Our technique, called iDistance, relies on partition-
dimensional range search. The choice of par- ing the data and de ning a reference point for each
tition and reference point provides the iDis- partition. We then index the distance of each data
tance technique with degrees of freedom most point to the reference point of its partition. Since this
other techniques do not have. We describe distance is a simple scalar, with a small mapping e ort
how appropriate choices here can e ectively to keep+partitions distinct, it is possible to use a stan-
adapt the index structure to the data distri- dard B tree to index this distance. As such, it is easy
bution. We conducted extensive experiments to graft our technique on top of an existing commercial
to evaluate the iDistance technique, and re- relational database.
port results demonstrating its e ectiveness. Finally, our technique can permit the immediate
generation of a few results while additional results are
1 Introduction searched for. In other words, we are able to support
Many emerging database applications such as image, online query answering : an important facility for in-
time series and scienti c databases, manipulate high- teractive querying and data analysis.
dimensional data. In these applications, one of the The e ectiveness of iDistance depends on how the
most frequently used and yet expensive operations is data are partitioned, and how reference points are se-
to nd objects in the high-dimensional database that lected. We also present several partitioning strategies,
are similar to a given query object. Nearest neighbor as well as reference point selection strategies.
search is a central requirement in such cases. We implemented the iDistance method together
There is a long stream of research on solving the with the partitioning and reference point selection
nearest neighbor search problem, and a large number strategies, and conducted an extensive performance
of multidimensional indexes have been developed for study to evaluate their e ectiveness. Our results show
this purpose. However, most of these structures are that the proposed schemes can provide fast initial re-
sponse time without sacri cing on the quality of the
answers. Moreover, through appropriate choice of par-
Permission to copy without fee all or part of this material is
titioning scheme, the iDistance method can compute
the complete answer set much faster than linear scan,
granted provided that the copies are not made or distributed for
direct commercial advantage, the VLDB copyright notice and
the title of the publication and its date appear, and notice is
iMinMax and A-tree [15].
given that copying is by permission of the Very Large Data Base
In Section 2, we review some related works. Sec-
tion 3 provides the background for metric-based KNN
Endowment. To copy otherwise, or to republish, requires a fee
and/or special permission from the Endowment.

Proceedings of the 27th VLDB Conference, search. In Section 4, we present the proposed iDis-
Roma, Italy, 2001 tance algorithm and in Section 5, we discuss the clus-
tering mechanisms and policies for selecting reference gion is expanded till all the K nearest points are found
points. Section 6 presents our performance study, and and all the data subspaces that intersect with the cur-
the results. Finally, we conclude in 7 with directions rent query space are checked. The K data points
for future work. are the nearest neighbors when further enlargement
of query sphere does not introduce new answer points.
2 Related Work We note that starting the search query with a small
initial radius keeps the search space as tight as pos-
Many indexing techniques have been proposed for sible, and hence minimizes unnecessary search (had a
nearest neighbor and approximate search in high- larger radius that contains all the K nearest points
dimensional databases. Existing multi-dimensional in- been used).
dexes [4] such as R-trees [11] have been shown to be
ineÆcient even for supporting range/window queries 4 Our Proposal
in high-dimensional databases; they, however, form
the basis for indexes designed for high-dimensional In this section, we propose a new KNN process-
databases [12, 17]. To reduce the e ect of high di- ing scheme, called iDistance, to facilitate eÆcient
mensionalities, use of bigger nodes [3], dimensionality distance-based KNN search. The design of iDistance
reduction [7] and lter-and-re ne methods [2, 16] have is motivated by the following observations. First, the
been proposed. Indexes were also speci cally designed (dis)similarity between data points can be derived with
to facilitate metric based query processing [6, 8]. How- reference to a chosen reference point. Second, data
ever, linear scan remains one of the best strategies points can be ordered based on their distances to a
for KNN search [5]. This is because there is a high reference point. Third, distance is essentially a single
tendency for data points to be equidistant to query dimensional value. This allows us to represent high-
points in a high-dimensional space. More recently, the dimensional data in single dimensional space, thereby
p-sphere tree [9] was proposed to support approximate enabling reuse+ of existing single dimensional indexes
nearest neighbor (NN) search where the answers can such as the B -tree.
be obtained quickly but they may not necessarily be iDistance is designed to support similarity search {
the nearest neighbors. both similarity range search and KNN search. How-
To reduce the e ect of high-dimensionality as expe- ever, we note that similarity range search is a window
rienced by the R-tree, the A-tree stores virtual bound- search with a xed radius and is simpler in computa-
ing boxes that approximate actual minimum bound- tion than KNN search. Thus, we shall concentrate on
ing boxes instead of actual bounding boxes. Bound- the KNN operations from here onwards.
ing boxes are approximated by their relative positions For the rest of this section, we assume a set of data
with respect to the parents' bounding region. The ra- points P oints in a unit d-dimensional space. Let dist
tionale is similar to that of the X-tree[3] { tree nodes be a metric distance function for pairs of points. In
with more entries may lead to less overlap and hence our research, we use the Euclidean distance as the dis-
a more eÆcient search. However, maintenance of ef- tance function, although other distance functions may
fective virtual bounding boxes is expensive should the be used for certain applications.
database be very dynamic. A change in the parent
bounding box will a ect all virtual bounding boxes ref- 4.1 The Data Structure
erencing it. Further, additional CPU time in deriving
actual bounding boxes is incurred. Notwithstanding, In iDistance, high-dimensional points are transformed
the performance study in [15] shows that it is more into a single dimensional space. This is done us-
eÆcient than the SR-tree [12] and the VA- le [16]. ing a three step algorithm. In the rst step, the
high-dimensional data space is split into a set of
3 Background partitions. In the second step, a reference point is
identi ed for each partition. Without loss of gen-
To search for the K nearest neighbors of a query point erality, let us suppose that we have m partitions,
q, the distance of the K th nearest neighbor to q de- P0 ; P1 ; : : : ; Pm 1 and their corresponding reference
nes the minimum radius required for retrieving the points, O0; O1; : : : ; Om 1. We shall defer the discus-
complete answer set. Unfortunately, such a distance sion on how the partitions and reference points are
cannot be predetermined with 100% accuracy. Hence, obtained to the next section.
an iterative approach that examines increasingly larger Finally, in the third step, all data points are repre-
sphere in each iteration can be employed. The algo- sented in a single dimension space as follows. A data
rithm works as follows. Given a query point q, nd- point p(x1; x2; : : : ; xd ), 0  xj  1, 1  j  d,
ing K nearest neighbors (NN) begins with a query has an index key, y, based on the distance from the
sphere de ned by a relatively small radius about q, nearest reference point Oi as follows:
querydist(q ). All data spaces that intersect the query
sphere have to be searched. Gradually, the search re- y = i  c + dist(p; Oi )
where dist(Oi; p) is a distance function that returns
the distance between Oi and p, and c is some constant
to stretch the data ranges. Essentially, c serves to
partition the single dimension space into regions so
q1
O3

that all points in partition Pi will be mapped to the


range [i  c, (i + 1)  c).
In iDistance, we also employ two data structures: O1

 A B+ -tree is use to index the transformed points


q2

to facilitate speedy retrieval. We use the B+-tree


since it is available in all commercial DBMS. In O2

our implementation of the B+-tree, leaf nodes are


linked to both the left and right siblings [14]. This
is to facilitate searching the neighboring nodes
when the search region is gradually enlarged. Figure 1: Search regions for NN queries q1 and q2 .
 An array is also required to store the m reference The algorithm is highly abstracted.
points and their respective nearest and farthest
iDistanceKNN
Before examining it, let us brie y discuss some of
radii that de ne the data space. the important routines and notations. Routine far-
Clearly, iDistance is lossy in the sense that multi- thest(S,q) returns the object in S farthest in distance
ple data points in the high-dimensional space may be from q. sphere(q,r) denote the sphere with radius r
mapped to the same value in the single dimensional and centroid q. Since both routines Inward and Out-
space. For example, di erent points within a partition ward are similar, we shall only explain routine Inward.
that are equidistant from the reference point have the Given a leaf node, routine Inward examines the en-
same transformed value. tries of the node to determine if they are among the
K nearest neighbors, and updates the answers accord-
4.2 KNN Search in iDistance ingly. We note that because iDistance is lossy, it is
Before we examine the KNN algorithm for iDistance, possible that points with the same values are actually
let us look at the search regions. Let Oi be the not close to one another - some may be closer to q,
reference point of partition i, and dist maxi and while others are far from it. If the rst element (or
dist mini be the maximum and minimum distance
last element for Outward) of the node is contained in
between Oi and the points in partition Pi respec- the query sphere, then it is likely that its predeces-
tively. We note that the region bounded by the sor with respect to distance from the reference point
spheres obtained from these two radii de nes the ef- (or successor for Outward) may also be close to q. As
fective data space that need to be searched. Let q such, the left (or right for Outward) sibling is exam-
be a query point and querydist(q) be the radius of ined. In other words, Inward (Outward) searches the
the sphere obtained about q. For iDistance, in con- space towards (away from) the reference point +of the
ducting NN search, if dist(Oi; q) querydist(q)  partition. The routine LocateLeaf is a typical B -tree
dist maxi , then Pi has to be searched for NN points.
traversal algorithm which locates a leaf node given a
The range to be searched within an a ected partition search value, and hence the detailed description of the
in the single dimensional space is [max(0; dist mini), algorithm is omitted.
min(dist maxi , dist(Oi ; q )+ querydist(q ))]. Figure 1
We are now ready to explain the search algorithm.
shows an example. Here, for query point q1, only par- Searching in iDistance begins by scanning the auxil-
tition P1 needs to be searched; for query point q2, both iary structure to identify the reference points whose
P2 and P3 have to be searched. From the gure, it is
data space (sphere area of partition) overlaps with the
clear that all points along a xed radius have the same query region. The search starts with a small radius
value after transformation due to the lossy transforma- (querydist), and step by step, the radius is increased
tion of data points into distance with respect to the to form a bigger query sphere. For each enlargement,
reference points. As such, the shaded regions are the there are three main cases to consider.
areas that need to be checked. 1. The data space contains the query point, q. In
Figures 2 to 4 summarize the algorithm for KNN this case, we want to traverse the data space suf-
with iDistance method. The algorithm is similar to ciently to determine the K nearest neighbors.
its high-dimensional counterpart. It begins by search- This can be done by rst locating the leaf node
ing a small `sphere', and incrementally enlarges the where q may be stored. Since this node does not
search space till all K nearest neighbors are found. necessarily contain points whose distance are clos-
The search stops when the distance of the farthest ob- est to q compared to its sibling nodes, we need to
ject in S (answer set) from query point q is less than search inward or outward from the reference point
or equal to the current search radius r. accordingly.
Algorithm iDistanceKNN (q, r, max r) 2. The data space intersects the query sphere. In
this case, we only need to search inward since the
r = 0; query point is outside the data space.
Stop ag = FALSE; 3. The data space does not intersect the query
initialize lp[], rp[], of lag[]; sphere. Here, we do not need to examine the data
while r < max r and Stop ag == FALSE space.
r = r + r ;
SearchO(q, r); The search stops when the K nearest neighbors
Figure 2: iDistance KNN main search algorithm have been identi ed from the data subspaces that in-
tersect with the current query sphere and when fur-
Algorithm SearchO(q, r) ther enlargement of query sphere does not change the
K nearest list. In other words, all points outside the
pfarthest = farthest(S,q) subspaces intersecting with the query sphere will def-
ifdist(pfarthest, q) < r and jS j == K initely be at a distance D from the query point such
Stop ag = TRUE; break; that D is greater than querydist. This occurs when
for i = 0 to m 1 the distance of the farther object in the answer set, S,
dis = dist(Oi , q ); from query point q is less than or equal to the current
= if Oi has not been searched before = search radius r. Therefore, the answers returned by
if not of lag[i] iDistance are of 100% accuracy.
if sphere(Oi, dist maxi) contains q An interesting by-product of iDistance is that it can
of lag [i] = TRUE; provide approximate KNN answers quickly. In fact,
lnode = LocateLeaf(btree, i  c + dis); at each iteration of algorithm iDistanceKNN, we have
lp[i] = Inward(lnode, i  c + dis r ); a set of K candidate NN points. These results can
rp[i] = Outward(lnode, i  c + dis + r ); be returned to the users immediately and re ned as
else if sphere(Oi, dist maxi) intersects more accurate answers are obtained in subsequent it-
sphere(q , r ) erations. It is important to note that these K can-
of lag [i] = TRUE; didate NN points can be partitioned into two cate-
lnode = LocateLeaf(btree, dist maxi ); gories: those that we are certain to be in the answer
lp[i] = Inward(lnode, i  c + dis r ); set, and those that we have no such guarantee. The
else rst category can be easily determined, since all those
if lp[i] not nil points with distance smaller than the current spher-
l[i] = lp[i] ! lef tnode; ical radius of the query must be in the answer set.
lp[i] = Inward(l[i], i  c + dis r ); Users who can tolerate some amount of inaccuracy
if rp[i] not nil can obtain quick approximate answers and terminate
r [i] = rp[i] ! rightnode; the processing prematurely (as long as they are sat-
rp[i] = Outward(r [i], i  c + dis + r ); is ed with the guarantee). Alternatively, max r can
Figure 3: iDistance KNN search algorithm: SearchO be speci ed with an appropriate value to terminate
iDistanceKN N prematurely.
Algorithm Inward(node, ivalue)
5 Selection of Reference Points and
for each entry e in node Data Space Partitioning
(e = ej ,j = 1; 2; : : : ; N umber of entries) To support distance-based similarity search, we need
if jS j == K to split the data space into partitions and for each
pfarthest = farthest(S,q) partition, we need a reference point. In this section
if dist(e, q) < dist(pfarthest, q) we look at some choices.
S = S pfarthest ;
S = S [ e; 5.1 Equal Partitioning of Data Space
else
S = S [ e; A straightforward approach to data space partition-
if e1.key > ivalue ing is to sub-divide it into equal partitions. In a
lnode = node ! lef tnode; d-dimensional space, we have 2d hyperplanes. The
node = Inward(lnode, i  c + dis r ); method we adopt is to partition the space into 2d
if end of partition is reached pyramids with the centroid of the unit cube space as
node = nil; their top, and1 each hyperplane forming the base of
return(node); each pyramid. Now, we expect equi-partitioning to
1 We note that the space is similar to that of the Pyramid
Figure 4: iDistance KNN search algorithm: Inward method [1]. However, the rationale behind the design and the
O3

O3 O3

C r
A Q
O0 O2 O2 r
Q
Q O0
O0 O2

B
D

O1
The bounding area by
these arches are the affected O1
O1 searching area of kNN(Q,r).

(a) (Centroid, Closest) (b) (Centroid, Farthest) (c) (External point, Closest)

Figure 5: Space partitioning methods.


be e ective if the data are uniformly distributed. We distance can also be supported. Figures 5(c) shows an
study the following possible reference points, where example of closest distance for 2-dimensional space.
the actual data space of hyperspheres do not overlap . Again, we observe that the a ected space under exter-
nal point is reduced (compared to using the centroid
Centroid of hyperplane, Closest Distance. The of the hyperplane). The generalization here conceptu-
centroid of each hyperplane can be used as a refer- ally corresponds to the manner in which the iMinMax
ence point, and the partition associated with the point tree generalizes the pyramid tree. The iDistance met-
contains all points that are nearest to it. Figure 5(a) ric can perform approximately like the iMinMax tree,
shows an example in a 2-dimensional space of a query as we shall see in our experimental study.
region and the a ected space.
For a query point along the central axis, the set of
points retrieved along the axis is exactly the same as 5.2 Cluster Based Partitioning
in a pyramid tree. When dealing with query and data As mentioned, equi-partitioning is expected to be ef-
points o the axis, the sets of points are not exactly fective only if the data are uniformly distributed. How-
identical, due to the curvature of the hypersphere as ever, data points are often correlated. In these cases, a
compared to the partitioning along axial hyperplanes more appropriate partitioning strategy would be used
in the case of the pyramid tree: nonetheless, these to identify clusters from the data space. However, in
sets are likely to have considerable overlap. high-dimensional data space, the distribution of data
points is mostly sparse, and hence clustering is not
Centroid of hyperplane, Farthest Distance. as straightforward as in low-dimensional databases.
The centroid of each hyperplane can be used as a There are several existing clustering schemes in the lit-
reference point, and the partition associated with the erature such as BIRCH [20] and CURE [10]. While our
point contains all points that are farthest from it. metric based indexing is not dependent on the underly-
Figure 5(b) shows an example in 2-dimensional space. ing clustering method, we expect the clustering strat-
As shown, the a ected area can be greatly reduced egy to have an in uence on retrieval performance. In
(as compared to the closest distance counterpart). our experiment, we adopt a sampling-based approach.
The method comprises four steps. First, we obtain a
External point. Any point along the line formed by sample of the database. Second, from the sample, we
the centroid of a hyperplane and the centroid of the can obtain the statistics on the distribution of data in
corresponding data space can also be used as a refer- each dimension. Third, we select ki values from di-
ence point.2 By , we refer to a reference mension i. These ki values are those values whose fre-
point that falls out of the data space. This heuris- quencies exceed a certain threshold value. (Histograms
external point

tics is expected to perform well when the a ected area can be usedQto observe these values directly.) We can
is quite large, especially when the data are uniformly then form ki centroids from these values. For ex-
distributed. We note that both closest and farthest ample, in a 2-dimensional data set, we can pick 2 high
frequency values, say 0.2 and 0.8, on one dimension,
mapping function are di erent; in the Pyramid method, a d- and 2 high frequency values, say 0.3 and 0.6, on an-
dimensional data point is associated with a pyramid based on other dimension. Based on this, we can predict the
an attribute value, and is represented as a value away from the
centroid of the space. clusters could be around (0.2,0,3), (0.2,0.6), (0.8,0.3)
2 We note that the other two reference points are actually or (0.8,0.6), which can be treated as the clusters' cen-
special cases of this. troids. Fourth, we count the data that are nearest to
O1: (0,1)
1
1

O1: (0.20,0.70)

0.70
0.70

O2: (0.67,0.31)

0.31 0.31

0 1 0 O2: (1,0)
0.20 0.67 0.20 0.67

(a) Cluster centroid. (b) Cluster edge.

Figure 6: Cluster-based space partitioning.


each of the centroids; if there are suÆcient number 6 Performance Study
of data around a centroid, then we can estimate that In this section, we report the results of an extensive
there is a cluster there. performance study that we have conducted to eval-
We note that the third step of the algorithm is cru- uate the performance of iDistance. Variant indexing
cial since the number of clusters can have an impact on strategies of iDistance are tested on di erent data sets,
the search area and the number of traversals from the varying data set dimension, data set size and data dis-
root to the leaf nodes. When the number of clusters is tribution.
small, more points are likely to have similar distance We also extended the iMinMax() scheme [13] to
to a given reference point. On the other hand, when support KNN queries, and to return approximate an-
the number of clusters is large, more data spaces, de- swers progressively [19]. iMinMax() maps a high-
ned by spheres with respect to centroid of clusters, dimensional point to either the maximum or minimum
are likely to overlap, and incur additional traversal and value of the values among the various dimensions of the
searching. Our solution is simple: if the number of point, and a range query requires d subqueries.
clusters is too many, we can merge those whose cen- The comparative study was done between iDis-
troids are closest; similarly, if the number of clusters tance, iMinMax, and the A-tree [15]. We also compare
is too few, we can split a large cluster into two smaller iDistance against linear scan which has been shown to
ones. We expect the number of clusters to be a tuning be e ective for KNN queries in high-dimensional data
parameter, and may vary for di erent applications and space.
domains.
Once the clusters are obtained, we need to select 6.1 Experiment Setup
the reference points. Again, we have several possible We implemented iMinMax() and the iDistance tech-
options when selecting reference points. nique and their search algorithms in C, and used the
B+-tree as the single dimensional index structure. We
Centroid of cluster. The centroid of a cluster is a thors obtained the source codes of the A-tree from the au-
natural candidate as a reference point. Figure 6(a) All the[15]. For all indexes, we index page size to 4 KB.
experiments are performed on a 300-MHz SUN
shows an example with two clusters in 2-dimensional Ultra 10 with 64 megabytes main memory, running
space. SUN Solaris.
We conducted many experiments using various data
Edge of cluster. As shown in Figure 6(a), when the sets, with some deriving from LUV color histograms
centroid is used, the sphere area of both clusters have of 20,000 images. Here, we report some of the more
to be enlarged to include outlier points, leading to sig- interesting results on KNN queries; more results can be
ni cant overlap in the data space. To minimize the found in [18]. For each query, a d-dimensional point
overlap, we can select points on the edge of the hyper- is used. We issue ve hundred points, and take the
planes as reference points. Figure 6(b) is an example average I/O cost as the performance metric.
of 2-dimensional data space. There are two clusters In the experiment, we generated 8, 16, 30-
and the edge points are O1 : (0; 1) and O2 : (1; 0). dimensional uniform data sets. The data size ranges
As shown, the overlap of the two partitions is smaller from 100,000 to 500,000. For the clustered data sets,
than that using cluster centroids as reference points. we used a clustering algorithm similar to BIRCH to
Using outside point and furthest distance, dimension=8 Using outside point and furthest distance, dimension=16 Using outside point and furthest distance, dimension=30

100 100 100

Percentage of the K nearest points found (%)

Percentage of the K nearest points found (%)

Percentage of the K nearest points found (%)


90 90 90

80 80 80

70 70 70

60 60 60

50 50 50

40 40 40

30 30 30

20 20 20
K=1 K=1 K=1
10 K=10 10 K=10 10 K=10
K=20 K=20 K=20
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Radius Radius Radius

(a) Dimension = 8 (b) Dimension = 16 (c) Dimension = 30

Figure 7: E ect of search radius on query accuracy.


generate the data sets. 4000
Seq-Scan d = 8
Seq-Scan d = 16
3500 Seq-Scan d = 30
d=8

6.2 E ect of Search Radius on Accuracy 3000


d = 16
d = 30

Average Page Access Cost


In high-dimensional KNN search, the search around 2500

the neighborhood is required to nd K nearest neigh-


bors. Typically, a small search sphere is used and
2000

enlarged when the search condition cannot be met. 1500

Hence, it is important to study the e ect of search 1000

radius on the proposed index methods.


In this experiment, we used 8-dimensional, 16-
500

dimensional and 30-dimensional, 100K tuple uniformly 0


0 0.1 0.2 0.3 0.4 0.5

distributed data sets. We use only the (external Radius

points,farthest distance) combination in this experi-


ment. Figures 7(a)-(c) show the average percentage of Figure 8: Retrieval EÆciency.
KNN answers returned over multiple query points with
respect to the search radius (querydist), of the 8,16 answers, iDistance is a promising strategy to adopt.
and 30 dimensional datasets respectively. The results
show that as radius increases, the accuracy improves 6.3 E ect of Reference Points on Equi-
and hits 100% at certain query distance. A query with Partitioning Schemes
smaller K requires less searching to retrieve the re- In this experiment, we evaluate the eÆciency of equi-
quired answers. As the number of dimension increases, partitioning-based iDistance schemes using one of the
the radius required to obtain 100% also increases due previous data sets.
to increase in possible distance between two points
and sparsity of data space in higher-dimensional space. 100K uniform dataset, dimension=30

However, we should note that the seemingly large in-


crease is not out of proportion with respect to the total 5000

possible dissimilarity. We also observe that iDistance 4500

can return a signi cant number of nearest neighbors


4000
Average Page Access Cost

with a small query radius.


3500

In Figure 8, we see the retrieval eÆciency of iDis-


3000

tance for 10-NN queries. First, we note that we have


2500

2000

stopped at radius around 0.5. This is because the algo- 1500

rithm is able to detect all the nearest neighbors once 1000


Seq-Scan
hyper-plane center

the radius reaches that length. As shown, iDistance 500


outside point
far point

can provide fast initial answers quickly (compared to


farther point
0

linear scan). Moreover, iDistance can produce the


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Radius

complete answers much faster than linear scan for rea-


sonable number of dimensions (i.e., 8 and 16). When Figure 9: E ect of reference points.
the number of dimensions reaches 30, iDistance takes
a longer time to produce the complete answers. This Figure 9 shows the results for (centroid,closest)
is expected since the data are uniformly distributed. combination and 3 (external points, closest) schemes.
However, because of its ability to produce approximate Each of the external points is farther away from the
100K clustered dataset, dimension=30 100K clustered dataset, dimension=30 Clustered dataset, dimension=30

100 3000 100

Percentage of the K nearest points found (%)

Percentage of the K nearest points found (%)


Seq-Scan
90 20 Partitions 90
2500 25 Partitions
80 30 Partitions 80

Average Page Access Cost


35 Partitions
70 40 Partitions 70
2000 45 Partitions
60 50 Partitions 60

50 1500 50

40 20 Partitions, K=1 40 K=1 (500K)


20 Partitions, K=10 1000 K=10 (500K)
30 20 Partitions, K=20 30 K=20 (500K)
20 Partitions, K=100 K=100 (500K)
20 50 Partitions, K=1 20 K=1 (100K)
500
50 Partitions, K=10 K=10 (100K)
10 50 Partitions, K=20 10 K=20 (100K)
50 Partitions, K=100 K=100 (100K)
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Radius Radius Radius

(a) Percentage trend with variant searching (b) E ect of number of partitions on iDis- (c) E ect of data size on search radius.
radius. tance.
Clustered dataset, dimension=30, k=10 100K clustered dataset, dimension=30 100K & 500K clustered dataset,dimension=30
16000 3200
3000 15000
14000 Seq-Scan (100K) 2800 seq-scan 14000 seq-scan (100K)
Seq-Scan (500K) iDistance(cluster centroid),k=1 13000 seq-scan (500K)
2600 iDistance (100K)
iDistance (500K) iDistance(cluster centroid),k=10
12000 iDistance (100K) 2400 iDistance(cluster centroid),k=20 12000 iDistance (500K)
Average Page Access Cost

Average Page Access Cost

Average Page Access Cost


2200 iDistance(cluster edge),k=1 11000
iDistance(cluster edge),k=10 10000
10000 2000 iDistance(cluster edge),k=20
1800 9000
8000 1600 8000
1400 7000
6000 1200 6000
1000 5000
4000 800 4000
600 3000
2000 400 2000
200 1000
0 0 0
0 0.1 0.2 0.3 0.4 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Radius Accuracy rate (%) Accuracy rate (%)

(d) E ect of data size on I/O cost. (e) E ect of reference points in clustered (f) E ect of clustered data size.
data sets.
Figure 10: On cluster-based schemes.
hyperplane centroid than the others. First, we note partitions incurs higher I/O cost. This is reasonable
that the I/O cost increases with radius when doing since a smaller number of partitions would mean that
KNN search. This is expected since a larger radius each partition is larger, and the number of false drops
would mean increasing number of false hits and more being accessed is also higher. As before, we observe
data are examined. We also notice that iDistance- that the cluster-based scheme can obtain the complete
based schemes are very eÆcient in producing fast rst set of answers in a short time.
answers, as well as the complete answers. Moreover,
we note that the farther away the reference point from We also repeated the experiments for a larger data
the hyperplane centroid, the better is the performance. set of 500K points of 50 clusters using the edge of clus-
This is because the data space that is required to be ter strategy in selecting the reference points. Figure
traversed is smaller in these cases as the point gets 10(c) shows the searching radius required for locat-
farther away. ing K (K =1, 10, 20, 100) nearest neighbors when 50
partitions were used. The results show that searching
6.4 Performance of Cluster-based Schemes radius does not increase (compared to small data set)
in order to get good percentage of KNN. However, the
In this experiment, we tested a data set with 100K data size does have great impact on the query cost.
data points of 20 and 50 clusters, some of which over- Figure 10(d) shows the I/O cost for 10-NN queries.
lapped each other. To test the e ect of the number We can see that iDistance has a speedup factor of 4
of partitions on KNN, we merge some number of close over linear scan when all 10 NNs were retrieved.
clusters and treat them as one cluster.
We use the edge near to the cluster as its reference Figure 10(e) and Figure 10(f) show how the I/O
point for the partition. We notice in Figure 10(a), as cost is a ected as the nearest neighbors are being re-
with the other experiments, that the complete answer turned. Here, a point (x, y) in the graph means that
set can be obtained with a reasonably small radius. x percent of the K nearest points are obtained after y
We also notice that a smaller number of partitions number of I/Os. Here, we note that all the proposed
performs better in returning the K points. This is schemes can produce 100% answers at a much lower
probably due to the larger partitions for small number cost than linear scan. In fact, the improvement can be
of partitions. as much as ve times. The results also show that pick-
The I/O results in Figure 10(b) show a slightly dif- ing an edge point to be the reference point is generally
ferent trend. Here, we note that a smaller number of better because it can reduce the amount of overlap.
100K uniform dataset,dimension=30,k=10 100K clustered dataset, dimension=30, k=10 100K clustered dataset, dimension=30
4500 1200
seq-scan 3000
iMinMax
4000 iDistance seq-scan
iMinMax 1000
3500 2500 iDistance(cluster centriod)
iDistance(cluster edge)

Average Page Access Cost

Average Page Access Cost

Average Page Access Cost


3000
2000 800
2500 A-tree, K=1
A-tree, K=10
1500 600 A-tree, K=20
2000
iDistance, K=1
iDistance, K=10
1500 1000 iDistance, K=20
400
1000
500
500 200

0 0
20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 0
Accuracy rate (%) Accuracy rate (%) 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Accuracy
(a) iDistance vs iMinMax (Uniform (b) iDistance vs iMinMax (Clustered (c) iDistance vs A-tree (Clustered dataset)
dataset) dataset)

Figure 11: A comparative study.


6.5 Comparative Study with the recently proposed A-tree structure [15], which
In this study, we compare iDistance cluster-based tree beenhas
[12]
shown to be far more eÆcient than the SR-
and signi cantly more eÆcient than the VA-
schemes with iMinMax and A-tree. le [16]. Upon investigation, we note that
Our rst experiment uses a 100K 30-dimensional
the gain

uniform data set. The query is a 10-NN query. For


is not only due to the use of virtual bounding boxes,

iDistance, we use the (external point, farthest) scheme.


but also the smaller sized logical pointers. Instead of

Figure 11(a) shows the result of the experiment. First,


storing actual pointers in the internal nodes, it has to

we note that both iMinMax and iDistance can produce


store a memory resident mapping table. In particu-
To have a
quality approximate answers very quickly compared to fair comparison with the A-tree, we also implemented
lar, the size of the table is very substantial.

linear scan. As shown, the I/O cost is lower than linear a version of the iDistance method in a similar fash-
scan with up to 95% accuracy. However, because the ion { the resultant e ect is that the fan-out is greatly
data is uniformly distributed, to retrieve all the 10 NN increased.
takes a longer time than linear scan since all points are In this set of experiments, we use a 100K 30-
almost equidistant to one another. Second, we note dimensional clustered data set, and 1NN, 10NN and
that iMinMax and iDistance perform equally well. 20NN queries. The page size we used is 4K bytes.
In another set of experiments, we use a 100K 30- Figure 11(c) summarizes the result. The results show
dimensional clustered data set. The query is still a that iDistance is clearly more
10-NN query. Here, we study two versions of cluster- for the three types of querieseÆcient we
than the A-tree
used. Moreover, it
based iDistance { one that uses the edge of the clus- is able to produce answers progressively which A-tree
ter as a reference point, and another that uses the is unable to achieve. The result also shows that the
centroid of the cluster. Figure 11(b) summarizes the di erence in performance for di erent K values is not
result. First, we observe that among the two cluster- signi cant when K gets larger.
based schemes, the one that employs the edge reference
points performs best. This is because of the smaller
overlaps in space of this scheme. Second, as in ear-
lier experiments, we see that the cluster-based scheme
can return initial approximate answer quickly, and can 6.6 CPU Cost
eventually produce the nal answer set much faster
than the linear scan. Third, we note that iMinMax While linear scan incurs less seek time, linear scan of
can also produce approximate answers quickly. How- a feature le entails examination of each data point
ever, its performance starts to degenerate as the radius and calculation of distance between each data point
increases, as it attempts to search for exact K NNs. and the query point. This will result in high CPU
Unlike iDistance which terminates once the K near- cost for linear scan. Figure 12 shows the CPU time of
est points are determined, iMinMax cannot terminate linear scan and iDistance for the same experiment as
until the entire data set is examined. As such, to ob- in Figure 10(b). It is interesting to note that the per-
tain the nal answer set, iMinMax performs poorly. formance in terms of CPU time approximately re ects
Finally, we see that the relative performance between the trend in page accesses. The results show that the
iMinMax and iDistance for clustered data set is dif- best iDistance method achieves about a seven fold in-
ferent from that of uniform data set. Here, iDistance crease in speed. We omit iMinMax in our comparison
outperforms iMinMax by a wide margin because of the as iMinMax has to search the whole index in order to
larger number of false drops produced by iMinMax. ensure 100% accuracy, and its CPU time at that point
We also compare iDistance cluster-based schemes is much higher than linear scan.
100K clustered dataset, dimension=30
[6] T. Bozkaya and M. Ozsoyoglu. Distance-based index-
2

1.8
Seq-Scan
ing for high-dimensional metric spaces. In Proc. 1997
20 Partitions ACM SIGMOD International Conference on Manage-
ment of Data, pages 357{368. 1997.
25 Partitions
1.6 30 Partitions

[7] K. Chakrabarti and S. Mehrotra. Local dimensional-


35 Partitions

Average CPU Cost (sec)


1.4 40 Partitions
45 Partitions

ity reduction: a new approach to indexing high dimen-


1.2 50 Partitions

sional spaces. In Proc. 26th International Conference


1

on Very Large Databases, pages 89{100, 2000.


0.8

0.6

0.4
[8] P. Ciaccia, M. Patella, and P. Zezula. M-trees: An
0.2
eÆcient access method for similarity search in metric
0
space. In Proc. 23rd International Conference on Very
Large Data Bases, pages 426{435. 1997.
0 0.1 0.2 0.3 0.4

Figure 12: CPU Time.


Radius

[9] J. Goldstein and R. Ramakrishnan. Contrast plots


7 Conclusion and p-sphere trees: space vs. time in nearest neighbor
searches. In Proc. 26th International Conference on
In this paper, we have presented a simple and eÆcient Very Large Databases, pages 429{440, 2000.
approach to nearest neighbor processing, called iDis- [10] S. Guha, R. Rastogi, and K. Shim. Cure: an eÆcient
tance. Extensive experiments were conducted and the clustering algorithm for large databases. In Proc. 1998
results show that iDistance is both e ective and eÆ- ACM SIGMOD International Conference on Manage-
cient. In fact, iDistance achieves a speedup factor of ment of Data. 1998.

seven over linear scan without blowing up in storage [11] A. Guttman. R-trees: A dynamic index structure for
requirement and compromising on the accuracy of the spatial searching. In Proc. 1984 ACM SIGMOD Inter-
results (unless it is done with the intention for even national Conference on Management of Data, pages

faster response time). Moreover, it is well suited for 47{57. 1984.


integration into existing DBMSs. iDistance also out- [12] N. Katamaya and S. Satoh. The SR-tree: An in-
performs the A-tree and iMinMax schemes, and is ro- dex structure for high-dimensional nearest neighbor
bust and adaptive to di erent data distributions. As queries. In Proc. 1997 ACM SIGMOD International
Conference on Management of Data. 1997.
for the extension of current work, we are implementing
a similarity join algorithm using the iDistance. [13] B. C. Ooi, K. L. Tan, C. Yu, and S. Bressan. In-
dexing the edge: a simple and yet eÆcient approach
Acknowledgement to high-dimensional indexing. In Proc. 18th ACM
SIGACT-SIGMOD-SIGART Symposium on Princi-
We would like to thank the authors of [15] for pro- ples of Database Systems, pages 166{174. 2000.
viding us with the source code of the A-tree structure [14] R. Ramakrishnan and J. Gehrke. Database Manage-
that we used in our comparative study. The work of ment Systems. McGraw-Hill, 2000.
H.V.Jagadish was supported in part by NSF under [15] Y. Sakurai, M. Yoshikawa, and S. Uemura. The a-tree:
grant number IIS-0002356. An index structure for high-dimensional spaces using
relative approximation. In Proc. 26th International
References Conference on Very Large Data Bases, pages 516{526.
2000.
[1] S. Berchtold, C. B}ohm, and H-P. Kriegel. The [16] R. Weber, H. Schek, and S. Blott. A quantitative
pyramid-technique: Towards breaking the curse of di- analysis and performance study for similarity-search
mensionality. In Proc. 1998 ACM SIGMOD Inter- methods in high-dimensional spaces. In Proc. 24th
national Conference on Management of Data, pages
International Conference on Very Large Data Bases,
142{153. 1998. pages 194{205. 1998.
[2] S. Berchtold, C. Bohm, H.P. Kriegel, J. Sander, and [17] D.A. White and R. Jain. Similarity indexing with the
H.V. Jagadish. Independent quantization: An in- SS-tree. In Proc. 12th International Conference on
dex compression technique for high-dimensional data Data Engineering, pages 516{523. 1996.
spaces. In Proc. 16th International Conference on [18] C. Yu. High-Dimensional Indexing. PhD thesis, De-
Data Engineering, pages 577{588. 2000.
[3] S. Berchtold, D.A. Keim, and H.P. Kriegel. The X- partment of Computer Science, National University of
tree: An index structure for high-dimensional data. Singapore, 2001.
In Proc. 22nd International Conference on Very Large [19] C. Yu, B. C. Ooi, and K. L. Tan. + Pro-
Data Bases, pages 28{37. 1996. gressive KNN search using B -trees.
http://www.comp.nus.edu.sg/ooibc/appimm.ps,
[4] E. Bertino and et. al. Indexing Techniques for Ad- page 20p, 2001.
vanced Database Systems. Kluwer Academic, 1997.
[5] K. Beyer, J. Goldstein, R. Ramakrishnan, and [20] T. Zhang, R. Ramakrishnan, and M. Livny. Birch:
U. Shaft. When is nearest neighbors meaningful? In an eÆcient data clustering method for very large
Proc. International Conference on Database Theory,
databases. In Proc. 1996 ACM SIGMOD Interna-
tional Conference on Management of Data. 1996.
1999.

You might also like