Professional Documents
Culture Documents
Unit 3
Unit 3
Unit 3
Courtesy:
Modern Information Retrieval
by R. Baeza-Yates and B. Ribeiro-Neto
1
Introduction
Word-based indexing
» Inverted indices are good for search words
» Queries such as phrases are expensive to solve using
Inverted files
» For word-based applications, inverted files perform better
Suffix trees and suffix arrays
» complex queries
2
Text Suffixes
This is a text. A text has many words. Words are made from letters.
text. A text has many words. Words are made from letters.
text has many words. Words are made from letters.
many words. Words are made from letters.
words. Words are made from letters.
Words are made from letters.
made from letters.
letters.
3
The Suffix Trie and Suffix Tree
1 11 19 28 33 40 46 50 60
This is a text. A text has many words. Words are made from letters.
w l
t m
60 6 5 3 60
o e a
28 50
r n d
x
28 50 11 19
d
t
33 40
s
11 19
33 40 4
PAT Trees and PAT Arrays
Information Retrieval: Data Structures and Algorithms
by W.B. Frakes and R. Baeza-Yates (Eds.)
Englewood Cliffs, NJ: Prentice Hall, 1992.
(Chapters 5)
5
PAT Trees and PAT Arrays
6
Semi-infinite Strings
Example
Text Once upon a time, in a far away land …
sistring 1Once upon a time …
sistring 2nce upon a time …
sistring 8on a time, in a …
sistring 11 a time, in a far …
sistring 22 a far away land …
Compare sistrings
22 < 11 < 2 < 8 < 1
7
PAT Tree
PAT Tree
A Patricia tree constructed over all the possible sistrings of a text
Patricia tree
» a binary digital tree where the individual bits of the keys are used to
decide on the branching
– A zero bit will cause a branch to the left subtree
– A one bit will cause a branch to the right subtree
» each internal node indicates which bit of the query is used for
branching
– absolute bit position
– a count of the number of bits to skip
» each external node points to a sistring
– the integer displacement to original text
8
Example 1
Text 01100100010111 … 2 2
sistring 1 01100100010111 …
4 1 3 2
sistring 2 1100100010111 …
sistring 3 100100010111 … 1
sistring 4 00100010111 …
2 2
sistring 5 0100010111 …
sistring 6 100010111 … 4 3 3 2
sistring 7 00010111 …
sistring 8 0010111 ... 5 1
11
Prefix searching
idea
every subtree of the PAT tree has all the sistrings with a given
prefix.
Search: proportional to the query length
exhaust the prefix or up to external node.
12
Proximity Searching
13
Range Searching
14
Longest Repetition Searching
the match between two different positions of a text where this match is
the longest in the entire text, e.g.,
01100100010111 the tallest internal node gives a pair
of sistrings that match for the greatest
number of characters
Text 01100100010111
sistring 1 01100100010111 1
sistring 2 1100100010111
2 2
sistring 3 100100010111
sistring 4 00100010111 3 3 4 2
sistring 5 0100010111
sistring 6 100010111 7 5 5 1 6 3
sistring 7 00010111
sistring 8 0010111 4 8
15
“Most Significant” or “Most Frequent” Matching
4 8
16
Building PAT Trees as Patricia Trees (1)
17
Building PAT Trees as Patricia Trees (2)
18
PAT Trees Represented as Arrays
3 3 4 2
0 1 1 0 0 1 0 0 0 1 0 1 1 1 ...
7 Text
5 5 1 6 3
4 8
19
Searching PAT Trees as Arrays
PAT array
7 4 8 5 1 6 3 2
0 1 1 0 0 1 0 0 0 1 0 1 1 1 ...
Text 20
Comparisons
Signature files
» Use hashing techniques to produce an index
» Advantage
– storage overhead is small (10%-20%)
» Disadvantages
– the search time on the index is linear
– some answers may not match the query, thus filtering must be
done
21
Comparisons (Continued)
Inverted files
» storage overhead (30% ~ 100%)
» search time for word searches is logarithmic
PAT arrays
» potential use in other kind of searches
– phrases
– regular expression searching
– approximate string searching
– longest repetitions
– most frequent searching
22