Professional Documents
Culture Documents
Nearest Neighbor Queries: Instructor: Cyrus Shahabi
Nearest Neighbor Queries: Instructor: Cyrus Shahabi
Nearest Neighbor Queries: Instructor: Cyrus Shahabi
CSCI585
Compute distance d between Q and O re-execute the range query with the distance d around Q. Compute distance of Q from each retrieved object. The object at minimum distance is the nearest neighbor!!!
CSCI585
Nave Approach
A B F E 4 5 6
Query Point Q
A: B C
1 2 3
B: E F C: G H
E 4 5 6
F 1 2 3
G 10 11 12 H 7 8
C 11 G 12 10 7 8 9 H
Issues: how to guess range? The retrieval may be sub-optimal if incorrect range guessed. Would be a problem in high dimensional spaces.
CSCI585
a q
Depth First and Best-First Search using R-trees Goal: avoid visiting nodes that cannot contain results
4
CSCI585
mindist(E1,q) is a lower bound of d(o,q) for every object o in E1. If we have found a candidate NN p, we can prune every MBR whose mindist > d(p, q).
5
CSCI585
MINDIST Property
y axis
10 g 8 6 4 2 e f h
E 2 E 6
E 7
l k j
E 5
i E E 4 1 query d E a 3 b c
Main idea Starting from the root visit nodes according to their mindist in depth-first manner
x axis
0 2 4 6 8 10
Root E 1 1 E 1 a 5 7 E 3 b c E 4 E 3 5 E 4 9 d E 5 5 e E 5 f
E 2 2 E 6 2 g h E 6
Note: distances not actually stored inside nodes. Only for illustration E 7 13 i 2 j 10
E 2 k 13 E 7 l m
CSCI585
DF Search Visit E1
y axis
10 g 8 6 4 b 2 c e f h
E 2 E 6
E 7
l k j
E 5 E 1
a i
E 4
d
query
E 3
x axis
0 2 4 6 8 10
Root E 1 1 E 1 a 5 8 E 3 b c E 4 E 3 5 E 4 9 d E 5 5 e E 5 f
E 2 2 E 6 2 g h E 6 E 7 13 i 2 j E 2 k E 7 l m
y axis
10 g 8 6 4 b 2 c e f h
E 2 E 6
E 7
l k j
E 5 E 1
a i
E 4
d
query
E 3
x axis
0 2 4 6 8 10
Root E 1 1 E 1 a 5 9 E 3 b c E 4 E 3 5 E 4 9 d E 5 5 e E 5 f
E 2 2 E 6 2 g h E 6 E 7 13 i 2 j
E 2 k E 7 l m
CSCI585
E 2 E 6
E 7
l k j
E 5 E 1
a i
E 4
d
query
E 3
x axis
0 2 4 6 8 10
Root E 1 1 E 1 a 5 10 E 3 b c E 4 E 3 5 E 4 9 d E 5 5 e E 5 f
E 2 2 E 6 2 g h E 6 E 7 13 i 2 j
E 2 k E 7 l m
y axis
10 g 8 6 4 b 2 c e f h
E 2 E 6
E 7
l k j
E 5 E 1
a i
E 4
d
query
First Candidate NN: a with distance 5 x axis
E 3
10
Root E 1 1 E 1 a 5 11 E 3 b c E 4 E 3 5 E 4 9 d E 5 5 e E 5 f
E 2 2 E 6 2 g h E 6 E 7 13 i 2 j E 2 k E 7 l m
Optimality
CSCI585
y axis
10 g 8 6 4 b 2 c e f h
E 2 E 6 E 5 E 1
a i
E 7
l k j
E 4
d
query
E 3
x axis
0 2 4 6 8 10
Question: Which is the minimal set of nodes that must be visited by any NN algorithm? Answer: The set of nodes whose MINDIST is smaller than or equal to the distance between q and its NN (e.g., E1, E2, E6).
12
CSCI585
A sorted priority queue based on MINDIST; Nodes traversed in order; Stops when there is an object at the top of the queue; (1-NN found) k-NN can be computed incrementally;
I/O optimal
13
CSCI585
Priority Queue
A:
A B F E 4 5 6
B C
1
B: E F F 4 5 6 1 2 3
C: G G 10
2 3
A B
Query Point Q
11 12 H 7 8 9
C C 5 5 5 F 6 G 8 4 6 G F 4 9 F 6 4 F
E C
C 1211 10 G 7 8 9 H
H 7
1NN
14
CSCI585
Keep a heap H of index entries and objects, ordered by MINDIST. Initially, H contains the root. While H
Extract the element with minimum MINDIST
If it is an index entry, insert its children into H. If it is an object, return it as NN.
End while
15
CSCI585
10 8 6 4 b 2 c e f
E 2 E 6
E 7
l k j
Heap
E 1 E 2 1 2
E 5 E 1
a i
E 4
d
query
E 3
x axis
0 2 4 6 8 10
Root E 1 1 E 1 a
16
E 2 2 E 6 2 f g h E 6 E 7 13 i 2 j 10 E 2 k 13 E 7 l m
E 3 5 c E 4
E 4 9 d
E 5 5 e E 5
E 3
CSCI585
y axis
10 g 8 6 4 b 2 c e f h
BF Search Visit E1
E 2 E 6 E 5 E 1
a i j m
E 7
l k
Heap
E 1 E 2 1 2 E 2 E 5 E 5 E 9 3 5 4 2
E 4
d
query
E 3
x axis
0 2 4 6 8 10
Root E 1 1 E 1 a
17
E 2 2 E 6 2 f g h E 6 E 7 13 i 2 j 10 E 2 k 13 E 7 l m
E 3 5 c E 4
E 4 9 d
E 5 5 e E 5
E 3
CSCI585
y axis
10 g 8 6 4 b 2 c e f h
BF Search Visit E2
E 2 E 6 E 5 E 1
a i j m
E 7
l k
Heap
E 1 E 2 1 2 E 2 E 5 E 5 E 9 3 5 4 2 E 2 E 5 E 5 E 9 E 13 3 5 4 7 6
E 4
d
query
E 3
x axis
0 2 4 6 8 10
Root E 1 1 E 1 a
18
E 2 2 E 6 2 f g h E 6 E 7 13 i 2 j 10 E 2 k 13 E 7 l m
E 3 5 c E 4
E 4 9 d
E 5 5 e E 5
E 3
CSCI585
y axis
10 g 8 6 4 b 2 c e f h
BF Search Visit E6
E 2 E 6 E 5 E 1
a i j m
E 7
l k
Heap
E 1 E 1 2 E 2 E 3 2 E 2 E 3 6 i 2 E3 2 5 E 5 E 9 5 4 5 E 5 E 9 E 13 5 4 7 j 10 E 13 k 5 E 5 E 9 7 5 4
E 4
d
query
E 3
13
x axis
0 2 4 6 8 10
Root E 1 1 E 1 a
19
E 2 2 E 6 2 f g h E 6 E 7 13 i 2 j 10 E 2 k 13 E 7 l m
E 3 5 c E 4
E 4 9 d
E 5 5 e E 5
E 3
CSCI585
10 8 6 4 b 2 c e f
E 2 E 6
E 7
l k j
Heap
E 1 E 1 2 E 2 E 3 2 E 2 E 3 6 i 2 E3 2 5 E 5 E 9 5 4 5 E 5 E 9 E 13 5 4 7 j 10 E 13 k 5 E 5 E 9 7 5 4
E 5 E 1
a i
E 4
d
query
E 3
13
Root E 1 1 E 1 a
20
E 2 2 E 6 2 f g h E 6 E 7 13 i 2 j 10 E 2 k 13 E 7 l m
E 3 5 c E 4
E 4 9 d
E 5 5 e E 5
E 3
CSCI585
Generalizations
Both DF and BF can be easily adapted to (i) extended (instead of point) objects and (ii) retrieval of k (>1) NN. BF can be made incremental; i.e., it can report the NN in increasing order of distance without a given value of k.
Example: find the 10 closest cities to HK with population more than 1 million. We may have to retrieve many (>>10) cities around Hong Kong in order to answer the query.
21
CSCI585
Generalize to k-NN
Keep a sorted buffer of at most k current nearest neighbors Pruning is done according to the distance of the furthest nearest neighbor in this buffer Example:
R
MINDIST
CSCI585
MBR is an n-dimensional Minimal Bounding Rectangle used in R trees, which is the minimal bounding n-dimensional rectangle bounds its corresponding objects. MBR face property: Every face of any MBR contains at least one point of some object in the database.
23
CSCI585
24
CSCI585
25
CSCI585
While the MinDist based algorithm is I/O optimal, its performance may be further improved by pruning nodes from the priority queue.
26
CSCI585
Properties of MINMAXDIST
MINMAXDIST(P,R) is the minimum over all dimensions distances from P to the furthest point of the closest face of R. MINMAXDIST is the smallest possible upper bound of distances from the point P to the rectangle R. MINMAXDIST guarantees there is an object within the R at a distance to P less than or equal to minmaxdist.
CSCI585
28
CSCI585
Rectangle R
29
CSCI585
Pruning 1
If there exists another R such that MINDIST(P,R)> MINMAXDIST(P,R)
R R
MINDIST
MINMAXDIST
30
CSCI585
Pruning 2
If there exists an R such that Actual_dist(P,O) > MINMAXDIST(P,R)
Actual-dist
MINMAXDIST
31
CSCI585
Pruning 3
If an object O is found such that MINDIST(P,R) > Actual_dist(P,O)
MINDIST
P Actual_dist O
32
CSCI585
MINDIST: optimistic (the box that is closer) MINMAXDIST: pessimistic (the box that has at least one object at that distance)
CSCI585
CSCI585
NN-search Algorithm using the mentioned pruning rules (branch and bound)
1. 2.
Initialize the nearest distance as infinite distance Traverse the tree depth-first starting from the root. At each Index node, sort all MBRs using an ordering metric and put them in an Active Branch List (ABL). Apply pruning rules 1 and 2 to ABL Visit the MBRs from the ABL following the order until it is empty If Leaf node, compute actual distances, compare with the best NN so far, update if necessary. At the return from the recursion, use pruning rule 3 When the ABL is empty, the NN search returns.
3. 4. 5. 6. 7.
CSCI585
CSCI585
References
1- N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In SIGMOD, pages 71-79, 1995 2- G. R. Hjaltason and H. Samet, Distance browsing in spatial databases, ACM Transactions on Database Systems 24, 2 (June 1999), 265-318 STDBM06 keynote slides by Dimitris Papadias Novel Forms of Nearest Neighbor Search in Spatial and Spatiotemporal Databases