Professional Documents
Culture Documents
Continuous Nearest Neighbor Search
Continuous Nearest Neighbor Search
Continuous Nearest Neighbor Search
Figure 2.1: R-tree and point-NN example A technique that does not incur false misses is based
The most common type of nearest neighbor search is the on the concept of time-parameterized (TP) queries
point-kNN query that finds the k objects from a dataset P [TP02]. The output of a TP query has the general form
that are closest to a query point q. Existing algorithms <R, T, C>, where R is current result of the query (the
search the R-tree of P in a branch-and-bound manner. For methodology applies to general spatial queries), T is the
instance, Roussopoulos et al [RKV95] propose a depth- validity period of R, and C the set of objects that will
first method that, starting from the root of the tree, visits affect R at the end of T. From the current result R, and the
the entry with the minimum distance from q (e.g., entry E1 set of objects C that will cause changes, we can
in Figure 2.1). The process is repeated recursively until incrementally compute the next result. We refer to R as
the leaf level (node N4), where the first potential nearest the conventional, and (T,C) as the time-parameterized
neighbor is found (f). During backtracking to the upper component of the query.
level (node N1), the algorithm only visits entries whose Figures 2.2 and 2.3 illustrate how the problem of
minimum distance is smaller than the distance of the Figure 1.1 can be processed using TP NN queries.
nearest neighbor already found. In the example of Figure Initially a point-NN query is performed at the starting
2.1, after discovering f, the algorithm will backtrack to the point (s) to retrieve the first nearest neighbor (a). Then,
root level (without visiting N3), and then follow the path the influence point sx of each object x in the dataset P is
N2, N6 where the actual NN l is found. computed as the point where x will start to get closer to
Another approach [HS99] implements a best-first the line segment than the current NN. Figure 2.2 shows
traversal that follows the entry with the smallest distance the influence points after the retrieval of a. Some of the
among all those visited. In order to achieve this, the points (e.g., b) will never influence the result, meaning
algorithm keeps a heap with the candidate entries and that they will never come closer to [s,e] than a.
Identifying the influencing point (sc) that will change the
result (rendering c as the next neighbor) can be thought of Recently, Benetis, et al [BJKS02] address CNN
as a conventional NN query, where the goal is to find the queries from a mathematical point of view. Our
point x with the minimum dist(s,sx). Thus, traditional algorithm, on the other hand, is based on several
point-NN algorithms (e.g., [RKV95]) can be applied with geometric problem characteristics. Further we also
appropriate transformations (for details see [TP02]). provide performance analysis, and discuss complex query
types (e.g., trajectory nearest neighbor search).
(
nNN = N ⋅ area( RSEARCH ) = N π d NN 2 + 2d NN ⋅ q.l ) (5-5)
7. Experiments
In this section, we perform an extensive experimental
evaluation to prove the efficiency of the proposed
methods using one uniform and two real point datasets.
The first real dataset, CA, contains 130K sites, while the
second one, ST, contains the centroids of 2M MBRs
Figure 6.3: Updating SL (k=2) – the second step representing street segments in California [Web].
Finally, note that the performance analysis presented in Performance is measured by executing workloads, each
Section 5 also applies to kCNN queries, except that in all consisting of 200 queries generated as follows: (i) the start
equations, dNN is replaced with dk-NN, which corresponds point of the query distributes uniformly in the data space,
to the distance between a query point and its k-th nearest (ii) its orientation (angle with the x-axis) is randomly
neighbor. The estimation of dk-NN has been discussed in generated in [0, 2π), and (iii) the query length is fixed for
[BBK+01]: all queries in the same workload. Experiments are
d k − NN ≈ k / (π N ) conducted with a Pentium IV 1Ghz CPU and 256 Mega
bytes memory. The disk size is set to 4K bytes and the
maximum fanout of an R-tree node equals 200 entries.
6.2 Trajectory Nearest Neighbor Search The first set of experiments evaluates the accuracy of
the analytical model. For estimations on the real datasets
So far we have discussed CNN query processing for a we apply the histogram (50×50 bins) discussed in Section
single query segment. In practice, a trajectory nearest 5. Figures 7.1a and 7.1b illustrate the number of node
neighbor (TNN) query consists of several consecutive accesses (NA) as a function of the query length qlen (1%
segments, and retrieves the NN of every point on each to 25% of the axis) for the uniform and CA datasets,
segment. An example for such a query is “find my nearest respectively (the number of neighbors k is fixed to 5). In
gas station at each point during my route from city A to particular, each diagram includes: (i) the NA of a CNN
city B”. The adaptation of the proposed techniques to this implementation based on depth-first (DF) traversal, (ii)
case is straightforward. the NA of a CNN implementation based on best-first (BF)
Consider, for instance, Figure 6.4a, where the query traversal, (iii) the estimated NA obtained by equation (5-
consists of 3 line segments q1=[s, u], q2=[u, v], q3=[v, e]. 4). Figures 7.1c (for the uniform dataset) and 7.1d (for
A separate split list (SL1,2,3) is assigned to each query CA) contain a similar experiment, where qlen is fixed to
segment. The pruning heuristics are similar to those for 12.5% and k ranges between 1 and 9.
CNN, but take into account all split lists. For example, a The BF implementation requires about 10% fewer NA
counterpart of heuristic 1 is: the sub-tree of entry E can be than the DF variation of CNN, which agrees with
DF BF EST
node accesses CPU cost (sec) total cost (sec) CPU percentage
1000 10 10
CNN CNN CNN 78%
77%
TP 76%
TP 1 TP 74%
100
68% 10%
0.1 1 6% 8%
4%
10 41%
0.01 2%
1%
1 0.001 0.1
1% 5% 10% 15% 20% 25% 1% 5% 10% 15% 20% 25% 1% 5% 10% 15% 20% 25%
query length query length query length
(a) NA vs qlen (CA dataset) (b) CPU cost vs qlen (CA dataset) (c) Total cost vs qlen (CA dataset)
node accesses CPU time (sec) total cost (sec) CPU percentage
10000 100 100
CNN CNN CNN 91%
90% 91%
1000 TP 10 TP TP
10 84%
80% 42%
38%
100 1 25%
75% 14%
10 0.1 1 7%
3%
1 0.01 0.1
1% 5% 10% 15% 20% 25% 1% 5% 10% 15% 20% 25% 1% 5% 10% 15% 20% 25%
query length query length query length
(d) NA vs qlen (ST dataset) (e) CPU cost vs qlen (ST dataset) (f) Total cost vs qlen (ST dataset)
Figure 7.2: Performance vs. query length (k=5)
previous results on point-NN queries [HS99]. In all cases The burden of the large number of queries is evident
the estimation of the cost model is very close (less than in Figures 7.2b and 7.2e that depict the CPU overhead.
5% and 10% errors for the uniform and CA dataset, The relative performance of the algorithms on both
respectively) to the actual NA of BF, which indicates that: datasets indicates that similar behaviour is expected
(i) the model is accurate and (ii) BF CNN is nearly independently of the input. Finally, Figures 7.2c and 7.2f
optimal. Therefore, in the following discussion we select show the total cost (in seconds) after charging 10ms per
the BF approach as the representative CNN method. For I/O. The number on top of each column corresponds to
fairness, BF is also employed in the implementation of the the percentage of CPU-time in the total cost. CNN is I/O-
TP approach. bounded in all cases, while TP is CPU-bounded. Notice
The rest of the experiments compare CNN and TP that the CPU percentages increase with the query lengths
algorithms using the two real datasets CA and ST. Unless for both methods. For CNN, this happens because, as the
specifically stated, an LRU buffer with size 10% of the query becomes longer, the number of split points
tree is adopted (i.e., the cache allocated to the tree of ST is increases, triggering more distance computations. For TP,
larger). Figure 7.2 illustrates the performance of the the buffer absorbs most of the I/O cost since successive
algorithms (NA, CPU time and total cost) as a function of queries access similar pages. Therefore, the percentage of
the query length (k = 5). The first row corresponds to CA, CPU-time dominates the I/O cost as the query length
and the second one to ST, dataset. As shown in Figures increases. The CPU percentage is higher in ST because of
7.2a and 7.2d, CNN accesses 1-2 orders of magnitude its density; i.e., the dataset contains 2M points (as
fewer nodes than TP. Obviously, the performance gap opposed to 130K) in the same area as CA. Therefore, for
increases with the query length since more TP queries are the same query length, a larger number of neighbors will
required. be retrieved in ST (than in CA).
node accesses CPU cost (sec) total cost (sec) CPU percentage
1000 10 10
CNN CNN 88%
CNN 81%
TP 71%
1 TP TP
100 52%
1 17%
0.1 8% 12%
1% 3% 5%
10
0.01
1 0.001 0.1
1 3 5 7 9 1 3 5 7 9 1 3 5 7 9
k k k
(a) NA vs. k (CA dataset) (b) CPU cost vs. k (CA dataset) (c) Total cost vs. k (CA dataset)
node accesses CPU time (sec) 100 total cost (sec)
10000 100 CPU percentage
CNN CNN CNN 94%
1000 TP 10 TP TP 91%
84%
100 1 10
71%
51% 42%
10 0.1 30%
3% 8% 20%
1 0.01 1
1 3 5 7 9 1 3 5 7 9 1 3 5 7 9
k k k
(d) NA vs. k (ST dataset) (e) CPU cost vs. k (ST dataset) (f) Total cost vs. k (ST dataset)
Figure 7.3: Comparison with various k values (query length=12.5%)
Next we fix the query length to 12.5% and compare Finally, we evaluate performance under different
the performance of both methods by varying k from 1 to buffer sizes, by fixing qlen and k to their standard values
9. As shown in Figure 7.3, the CNN algorithm (i.e., 12.5% and 5 respectively), and varying the cache
outperforms its competitor significantly in all cases (over size from 1% to 32% of the tree size. Figure 7.4
an order of magnitude). The performance difference demonstrates the total query time as a function of the
increases with the number of neighbors. This is explained cache size for the CA and ST datasets. CNN receives
as follows. For CNN, k has little effect on the NA (see larger improvement than TP because its I/O cost accounts
Figures 7.3a and 7.3d). On the other hand, the CPU for a higher percentage of the total cost.
overhead grows due to the higher number of split points To summarize, CNN outperforms TP significantly
that must be considered during the execution of the under all settings (by a factor up to 2 orders of
algorithm. Furthermore, the processing of qualifying magnitude). The improvement is due to the fact that CNN
points involves a larger number of comparisons (with all performs only a single traversal on the dataset to retrieve
NN of points in the split list). For TP, the number of tree all split points. Furthermore, according to Figure 7.1, the
traversals increases with k, which affects both the CPU number of NA is nearly optimal, meaning that CNN visits
and the NA significantly. In addition, every query only the nodes necessary for obtaining the final result. TP
involves a larger number of computations since each is comparable to CNN only when the input line segment
qualifying point must be compared with the k current is very short.
neighbors.
8. Conclusion Structure for Spatial Searching, ACM
SIGMOD, 1984.
Although CNN is one of the most interesting and intuitive
[HS98] Hjaltason, G., Samet, H. Incremental
types of nearest neighbour search, it has received rather
Distance Join Algorithms for Spatial
limited attention. In this paper we study the problem
Databases. ACM SIGMOD 1998.
extensively and propose algorithms that avoid the pitfalls
of previous ones, namely, the false misses and the high [HS99] Hjaltason, G., Samet, H. Distance Browsing
processing cost. We also propose theoretical bounds for in Spatial Databases. ACM TODS, 24(2), pp.
the performance of CNN algorithms and experimentally 265-318, 1999.
verify that our methods are nearly optimal in terms of [KGT99a] Kollios, G., Gunopulos, D., Tsotras, V. On
node accesses. Finally, we extend the techniques for the Indexing Mobile Objects. ACM PODS, 1999.
case of k neighbors and trajectory inputs. [KGT99b] Kollios, G., Gunopulos, D., Tsotras, V.
Given the relevance of CNN to several applications, Nearest Neighbor Queries in a Mobile
such as GIS and mobile computing, we expect this Environment. Spatio-Temporal Database
research to trigger further work in the area. An obvious Management Workshop, 1999.
direction refers to datasets of extended objects, where the [KSF+96] Korn, F., Sidiropoulos, N., Faloutsos, C.,
distance definitions and the pruning heuristics must be Siegel, E, Protopapas, Z. Fast Nearest
revised. Another direction concerns the application of the Neighbor Search in Medical Image
proposed techniques to dynamic datasets. Several indexes Databases. VLDB, 1996.
have been proposed for moving objects in the context of [RKV95] Roussopoulos, N., Kelly, S., Vincent, F.
spatiotemporal databases [KGT99a, KGT99b, SJLL00]. Nearest Neighbor Queries. ACM SIGMOD,
These indexes can be combined with our techniques to 1995.
process prediction-CNN queries such as "according to the
[SJLL00] Saltenis, S., Jensen, C., Leutenegger, S.,
current movement of the data objects, find my nearest
Lopez, M. Indexing the Positions of
neighbors during the next 10 minutes".
Continuously Moving Objects. ACM
SIGMOD, 2000.
Acknowledgements [SK98] Seidl, T., Kriegel, H. Optimal Multi-Step K-
This work was supported by grants HKUST 6081/01E Nearest Neighbor Search. ACM SIGMOD,
and HKUST 6070/00E from Hong Kong RGC. 1998.
[SR01] Song, Z., Roussopoulos, N. K-Nearest
References Neighbor Search for Moving Query Point.
SSTD, 2001.
[BBKK97] Berchtold, S., Bohm, C., Keim, D.A., [SRF87] Sellis, T., Roussopoulos, N. Faloutsos, C.:
Kriegel, H. A Cost Model for Nearest The R+-tree: a Dynamic Index for Multi-
Neighbor Search in High-Dimensional Data Dimensional Objects, VLDB, 1987.
Space. ACM PODS, 1997.
[SWCD97] Sistla, P., Wolfson, O., Chamberlain, S., Dao,
[BBK+01] Berchtold, S., Bohm, C., Keim, D., Krebs, F., S. Modeling and Querying Moving Objects.
Kriegel, H.P. On Optimizing Nearest IEEE ICDE, 1997.
Neighbor Queries in High-Dimensional Data
[TP02] Tao, Y., Papadias, D. Time Parameterized
Spaces. ICDT, 2001.
Queries in Spatio-Temporal Databases. ACM
[BJKS02] Benetis, R., Jensen, C., Karciauskas, G., SIGMOD, 2002.
Saltenis, S. Nearest Neighbor and Reverse
[TSS00] Theodoridis, Y., Stefanakis, E., Sellis, T.
Nearest Neighbor Queries for Moving
Efficient Cost Models for Spatial Queries
Objects. IDEAS, 2002.
Using R-trees. IEEE TKDE, 12(1), pp. 19-32,
[BKSS90] Beckmann, N., Kriegel, H.P., Schneider, R., 2000.
Seeger, B. The R*-tree: An Efficient and
[web] http://dias.cti.gr/~ytheod/research/datasets/
Robust Access Method for Points and
spatial.html
Rectangles. ACM SIGMOD, 1990.
[WSB98] Weber, R., Schek, H., Blott, S. A Quantitative
[BS99] Bespamyatnikh, S., Snoeyink, J. Queries with
Analysis and Performance Study for
Segments in Voronoi Diagrams. SODA,
Similarity-Search Methods in High-
1999.
Dimensional Spaces. VLDB, 1998.
[CMTV00] Corral, A., Manolopoulos, Y., Theodoridis,
[YOTJ01] Yu, C., Ooi, B.C., Tan, K.L., Jagadish, H.V.
Y., Vassilakopoulos, M. Closest Pair Queries
Indexing the Distance: An Efficient Method
in Spatial Databases. ACM SIGMOD, 2000.
to KNN Processing. VLDB, 2001.
[G84] Guttman, A. R-trees: A Dynamic Index