Professional Documents
Culture Documents
Current Challenges in Textual Databases: Gonzalo Navarro
Current Challenges in Textual Databases: Gonzalo Navarro
Current Challenges in Textual Databases: Gonzalo Navarro
Gonzalo Navarro
Department of Computer Science
University of Chile
Road Map
• Computational biology
– Genome projects have produced gigabytes of nucleotide and protein da
– Biologists search that data for known or new genes and proteins, to
homologous regions so as to predict chemical structure, biochemical func
evolutionary history, etc.
– Global and local similarity, with different concepts of approximate s
including approximate regular expression searching.
• Information retrieval
– IR systems usually have a text search engine at their kernel.
– Usually it is specific to natural language.
– Approximate searching is useful to cope with spelling, typing and OCR
– Handling from medium to huge texts: jurisprudence databases, linguis
porate data, news clipping... and of course the Web.
• Text retrieval in Oriental languages
– Chinese, Korean, (one) Japanese and other Oriental languages have v
alphabets, mixing phonetic and ideographic symbols.
– Word separations must be inferred from the meaning and are difficul
automatically.
– They have to treat their texts simply as strings of symbols.
– Typical IR systems do not apply to those languages.
• Multimedia databases
– Music databases, in MIDI format for example, with their own concept of s
e.g. transposition invariance.
– Audio databases, where we wish to find patterns independently of volum
etc.
– Video databases, for example object tracking information is a string o
directions.
Challenges for Modern Text Databases
• Be small, in order to store large text collections and extra data structures
them at a reasonable space cost.
• Be fast, in order to provide access to a large mass of text in reasonable
possibly many concurrent users.
• Be flexible, in order to search for complex patterns.
• For fast access we need an index, that is, a data structure built on the text
• But an index may take much space, and this plays against the goal of
database.
• For large databases, that index has to be oriented to secondary memory.
• The index should be dynamic, that is, easy to update upon changes in
collection.
The Simplest Solution: Suffix Tries
bar
20 _
a r 10
_ l 9
bar
11 l _ d
bar
7 12 8 5 17
laba
_ d
_ _
r
d 4 16
3 15 2
alabar a la alabarda$ _ d
1 13
Minimizing Trie Automata: DAWGs
b b d
r r
d
Best from Suffix Trees and DAWGs: Compact DAWGs
bar bar da
r r
da
Sampling the Data: Sparse Suffix Trees, Cactuses, and q-Grams
Kärkkäinen, Sutinen, U
bala
a r a
7 12 a 10 19
bar
$ 9 7 $ _
20 16
r l 4
_ $
b _ d _ d _ d
8 l 5 17 1
2 14 13 20 11
3 1
bala$
bala_
_a
a$
a_
ad
_r
al
r
l
_a_ 6
_al 11
a$ 19
a_ 7
ab 2
al
ar_ 4
ard 16
la_ 9
labar_ 1
labard 13
Just the Tree Leaves: Suffix Arrays
21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 2 14
Minimizing the Impact of the Tree: Compact PAT Trees
• Separate a suffix tree into the leaves (suffix array) plus the structure.
• The tree structure itself can be represented as a sequence of parentheses.
• This sequence can be traversed with the usual tree operations.
• Basic tools are: ranking of bits and finding closing parentheses.
• The result is functionally similar to a suffix tree.
• Experimental results report 5n to 6n bytes.
• So it is still too large.
• Worse than that, it is built from the plain suffix tree.
• Search in secondary storage is reasonable (2–4 disk accesses), by storing sub
single disk pages.
• Dynamism is painful.
• The rank mechanism is very valuable by itself.
$ r d
d 18
21 _ la _
19 6
a
a l 20 $_ _
b ar
a r 10
_ l 9
bar
7 12 11 l bar
_ d
8 5 17
laba
_ d
_ _ d
d r 4 16
3 15 2 14
_ d
1 13
(()((()())())(()(()())(()())(()())(()()))(()())()(()(()()))(()()))
21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 2 14
(()((()())())(()(()())(()())(()())(()()))(()())()(()(()()))((
1101110100100110110100110100110100110100011010010110110100011
3 6 7 9 3 5 6 9 2 3 5 7 2 5 6
9 18 25 3
Smaller than Suffix Arrays: Compact Suffix Arrays
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 2 14
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 (9,4,0)
0 1 2 3 4 5 6 7 8 9 10 11 12
21 7 (3,2,1)(10,2,0) 8 (12,2,0) 1 13 (9,2,0) (5,2,0) 19 10 (6,4,0)
Towards Self-Indexing
Grossi &
B_0 0 0 1 0 1 0 1 0 0 0 0 0 0 1 1 0 1 1 1
Ψ_0 9 6 10 16 0 2 3 13 14 17 18 19 20 11 12 4 5 7 8
0 1 2 3 4 5 6 7 8 9
SA_1 6 10 4 2 8 5 1 7 3 9
B_1 1 1 1 1 1 0 0 0 0 0
Ψ_1 7 6 5 8 9 0 3 4 2 1
0 1 2 3 4
SA_2 3 5 2 1 4
alabar a la alabarda$
B_2 0 0 1 0 1
Ψ_2 4 3 0 2 1 D = 110010000000010110010
C = $_abdlr
0 1
SA_3 1 2
Compressed Suffix Arrays without Text
Ferragina &
1st "a"
alabar a la alabarda$ $alabar a la alabarda
labar a la alabarda$a a la alabarda$alabar
abar a la alabarda$al alabarda$alabar a la
bar a la alabarda$ala la alabarda$alabar a
ar a la alabarda$alab a$alabar a la alabard 1st "d"
r a la alabarda$alaba a alabarda$alabar a l
a la alabarda$alabar a la alabarda$alabar
a la alabarda$alabar abar a la alabarda$al
la alabarda$alabar a abarda$alabar a la al
la alabarda$alabar a alabar a la alabarda$
a alabarda$alabar a l alabarda$alabar a la
alabarda$alabar a la ar a la alabarda$alab
alabarda$alabar a la arda$alabar a la alab
labarda$alabar a la a bar a la alabarda$ala
abarda$alabar a la al barda$alabar a la ala
barda$alabar a la ala da$alabar a la alabar 2nd "r"
arda$alabar a la alab la alabarda$alabar a
rda$alabar a la alaba labar a la alabarda$a
da$alabar a la alabar labarda$alabar a la a
a$alabar a la alabard r a la alabarda$alaba
$alabar a la alabarda rda$alabar a la alaba 9th "a"
• The index also has a cumulative letter frequency array C.
• As well as Occ[c, i] = number of occurrences of c before position i in the p
text.
• If we start at a position i in the permuted text, with character c, the previo
is at position C[c] + Occ[c, i].
• The search for pattern p1 . . . pm is done backwards, in optimal O(m) time
• First we take interval for pm, [l, r) = [C[pm], C[pm + 1]).
• Interval for pm−1 pm is [l, r) = [C[pm−1] + Occ[pm−1 , l], C[pm−1] + Occ[pm
• C is small and Occ is cumulative, so it is easy to store with blocks and sup
• The in-block computation of Occ is done by scanning the permuted text.
• We can show text contexts by walking the permuted text as explained.
• A problem is how to know which text position are we at!
• Some suffix array pointers are stored, we walk the text until we find one.
• Overall space is 5Hk n bits, text included, for any k.
• In practice, 0.3 to 0.8 times the text size, and includes the text.
• Counting the number of occurrences is amazingly fast.
• Reporting their positions and text contexts is very slow, however.
• Search complexity is O(m + R logε n).
• Construction needs to start from the suffix array.
• Some (theoretical) provisions for dynamism, search time becomes O(m log
alabar a la alabarda$ $ a
r
21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 2 14 6 18
a
a
araadl ll$ bbaar aaaa BWT a d
a l
2,6,1,0,5,6,5,1,0,5,2,6,0,5,0,6,3,2,0,0,0 MTF
a _
C[$]=0, C[_]=1, C[a]=4, C[b]=13, C[d]=15, C[l]=16, C[r]=19 a l
a l
Occ[$] = 0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1 a $
Occ[_] = 0,0,0,0,0,0,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3 a _
Occ[a] = 1,1,2,3,3,3,3,3,3,3,3,3,3,4,5,5,5,6,7,8,9 a b
Occ[b] = 0,0,0,0,0,0,0,0,0,0,0,1,2,2,2,2,2,2,2,2,2 a b
Occ[d] = 0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 b a
Occ[l] = 0,0,0,0,0,1,1,2,3,3,3,3,3,3,3,3,3,3,3,3,3 b a
Occ[r] = 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2 d r
l _
l a
l a
r a
r a
Using just LZ78 Parsing: LZ-Index
0
0 r
l $ _
_ a b d l
a
5 1 5 1 2
2
$ a a _ l a r
a _ b r a
8 11 6 3 4 7 11 6 8 7 3
d l a
b
10 9 9 10
Secondary Memory and Dynamic Data Structures
• Most of the methods we have seen are not suited to secondary memory.
• This refers both to construction and searching.
• Most of them are difficult to modify if the text changes.
• Those that can be built online can easily incorporate more text.
• Recently, efficient construction of suffix trees in secondary memory has been a
both in theory and in practice.
• Suffix arrays can perform decently on secondary memory, but modifying them
rewriting them completely.
• Compact PAT Trees put contiguous subtrees in disk pages, obtaining re
√
performance (2–4 disk accesses, O(k/ m + logm n)).
• But modifying them is painful.
• Another interesting approach for managing insertions is to have buckets of e
tially increasing sizes, one index per bucket.
• Insertion reminds incrementing a binary number and has good amortized
mance.
Dynamic in Secondary Memory: String B-Tree
Ferragina &
suffix tree
DAWG q-gr
succint
LZ-index dynamic
secondary mem