Current Challenges in Textual Databases: Gonzalo Navarro

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

Current Challenges in Textual Databases

Gonzalo Navarro
Department of Computer Science
University of Chile
Road Map

• Text retrieval versus information retrieval


• Applications of text databases
• Challenges for modern text databases
• Classical solutions
– Suffix tries
– Suffix trees
– Directed acyclic word graphs (DAWGs)
• Towards succint data structures
– Compact DAWGs
– Sparse suffix trees, cactuses and q-grams
– Ziv-Lempel based indexes
– Suffix arrays
– Compact PAT trees
– Compact Suffix Arrays
• Self-indexing data structures
– Compressed Suffix Arrays
– FM-Index
– LZ-Index
• Secondary memory and dynamic data structures
• Open Challenges
Text Retrieval versus Information Retrieval

• In both cases we handle a text collection.


• In Information Retrieval
– The database is seen as a set of documents.
– The user has a (vague) information need.
– The user expresses that information using words and phrases.
– The system tries to guess what the user really wants.
– There are no correct or incorrect answers, just more or less relevant to
– An underlying text retrieval engine is used to compute relevance.
– It works only for “natural language” text, which excludes several hum
guages.
– We want speed, but especially good precision and recall.
• In Text Retrieval
– The database is seen as a set of strings.
– The user knows exactly what kind of strings he/she wants.
– The user gives patterns to search for in the text.
– The system retrieves exactly the occurrences of those patterns.
– A postprocessing stage may further filter these results, but this is not the
of the text retrieval engine.
– It works for any kind of text.
– We want basically speed, usually subject to correctness.
Applications of Text Databases

• Computational biology
– Genome projects have produced gigabytes of nucleotide and protein da
– Biologists search that data for known or new genes and proteins, to
homologous regions so as to predict chemical structure, biochemical func
evolutionary history, etc.
– Global and local similarity, with different concepts of approximate s
including approximate regular expression searching.
• Information retrieval
– IR systems usually have a text search engine at their kernel.
– Usually it is specific to natural language.
– Approximate searching is useful to cope with spelling, typing and OCR
– Handling from medium to huge texts: jurisprudence databases, linguis
porate data, news clipping... and of course the Web.
• Text retrieval in Oriental languages
– Chinese, Korean, (one) Japanese and other Oriental languages have v
alphabets, mixing phonetic and ideographic symbols.
– Word separations must be inferred from the meaning and are difficul
automatically.
– They have to treat their texts simply as strings of symbols.
– Typical IR systems do not apply to those languages.
• Multimedia databases
– Music databases, in MIDI format for example, with their own concept of s
e.g. transposition invariance.
– Audio databases, where we wish to find patterns independently of volum
etc.
– Video databases, for example object tracking information is a string o
directions.
Challenges for Modern Text Databases

• Be small, in order to store large text collections and extra data structures
them at a reasonable space cost.
• Be fast, in order to provide access to a large mass of text in reasonable
possibly many concurrent users.
• Be flexible, in order to search for complex patterns.
• For fast access we need an index, that is, a data structure built on the text
• But an index may take much space, and this plays against the goal of
database.
• For large databases, that index has to be oriented to secondary memory.
• The index should be dynamic, that is, easy to update upon changes in
collection.
The Simplest Solution: Suffix Tries

• A trie that stores all the suffixes of the texts.


• Each leaf represents a unique substring (and extensions to suffix).
• Each internal node represents a repeating substring.
• Every text substring is the prefix of some suffix.
• So any substring of length m can be found in O(m) time.
• Many other search problems can be solved with suffix tries:
– Longest text repetitions.
– Approximate searching in O(nλ) time (n = text size).
– Regular expression searching in O(nλ) time.
– ... and many others.
• Let n be the text size, then the trie
– on average is O(n) space and construction time...
– but in the worst case it is O(n2) space and construction time.
$ r d
d 1
21 _ 19 _
b l 6
a
a
a $ a
l _
20 b l r _
_ l 9 a b
l a a _ d r 10
7 12 11 a
8 b 5 17
r
_ d
_
d a 4 16
3 15 _
r 2
alabar a la alabarda$ _ d
1 13
Guaranteeing Linear Space: Suffix Trees

Weiner, McCreight, Ukkonen,

• Works essentially as a suffix trie.


• Unary paths are compressed.
• It has O(n) size and construction time, in the worst case.
• Construction can be done online.
• After the O(m) search, the R occurences can be collected in O(R) time.
• So it’s simply perfect!
• Where is the trick?
– The suffix tree needs at least 20 times the text size.
– It is hard to build and search in secondary memory.
– It is difficult to update upon changes in the text.
• So, it is problematic with respect to our three goals!
• Very compact implementations achieve 10n bytes.
• Other variants: Patricia trees, level-compressed tries... not better in practic
$ r d
d 1
21 _ la _
19
a 6
a l $ _

bar
20 _
a r 10
_ l 9

bar
11 l _ d
bar

7 12 8 5 17
laba

_ d
_ _
r

d 4 16
3 15 2

alabar a la alabarda$ _ d
1 13
Minimizing Trie Automata: DAWGs

Blumer et al., Croch

• The smallest deterministic automaton recognizing every text suffix.


• It would be the result of minimizing the suffix trie.
• At most 2n nodes and 3n edges.
• It can be built in O(n) time with an online algorithm.
• Searching can also be done in O(m + R) time.
• An interesting alternative to suffix trees.
• In practice, it is too large (30n), static and bad for secondary storage.
_
a
l a l
alabar a la alaba
b l
_ l
_ a _
a l a b a r _ a _ l a _ a l a b a r

b b d
r r
d
Best from Suffix Trees and DAWGs: Compact DAWGs

Blumer et al., Crochemore & Vérin, Takeda & Shin

• Obtained by compressing unary paths in the DAWG.


• Also obtained by minimizing the suffix tree.
• It is smaller than the three classical structures.
• It can still be built online in O(n) time.
• It can still search in O(m + R) time.
• It can still implement all the flexible searches.
• But it is still too large (15n), static and bad on disk.
_alabarda alabar a la alabarda
alabarda
la bar labarda
la_alabarda
_ la_alabarda
_ a _la_alabarda
a labar _a_la_alabarda

bar bar da
r r
da
Sampling the Data: Sparse Suffix Trees, Cactuses, and q-Grams

Kärkkäinen, Ukkonen, Sutinen, T

• Sparse suffix trees


– Only one out of k suffixes are indexed.
– Hence the suffix tree has less nodes and is smaller.
– But searches are more expensive.
– In practice, reasonable search time requires about 4n space.
• Suffix cactuses
– Carry path compression one step further.
– They are a crossing between a suffix tree and a suffix array.
– Not very promising either (10n space).
• Q-gram indexes
– Indexes substrings of length q instead of full suffixes.
– Significantly less space, for example 2n extra space.
– But search times are far less attractive if q is not large enough.
– And the index grows exponentially with q, so 4n at least.

ala 1,13 a_l 8


lab 2,14 _la 9
aba 3,15 la_ 10
bar 4,16 a_a 11 alabar a la alabarda$
ar_ 5 _al 12
r_a 6 ard 17
a 7 rda 18
Using the Ziv-Lempel Parsing: Ziv-Lempel Indexes

Kärkkäinen, Sutinen, U

• The text is divided according to a Ziv-Lempel parsing.


• A sparse suffix tree indexes initial positions of blocks.
• Another indexes the reverse prefixes ending at final block positions.
• Occurrences can either be contained inside a block or not.
• Those that are not, have a prefix finishing a block and a suffix starting a b
• Once each prefix-suffix pair is found, a range search data structure perm
secting the lexicographical ranges.
• Those contained in another block are found by a structure tracking the rep
• Not implemented, but we estimate 3.5n to 5.5n extra space and good sear
• Although in theory search time is, for example, O(m2 + m log n + R logε n
• A version able of searching for q-grams only (q < log n) obtains optimal O
search time and 4n to 6n extra space.
• No good algorithm for longer patterns can be obtained with such a q.
la _
_a
_ d
_ l

bala
a r a
7 12 a 10 19

bar
$ 9 7 $ _
20 16
r l 4
_ $
b _ d _ d _ d
8 l 5 17 1
2 14 13 20 11
3 1

a.l.ab.ar. .a .la. a.lab.ard.a$


a.l.ab.ar. .a .la. a.lab.ard.a$

bala$
bala_
_a

a$
a_
ad
_r

al

r
l
_a_ 6
_al 11
a$ 19
a_ 7
ab 2
al
ar_ 4
ard 16
la_ 9
labar_ 1
labard 13
Just the Tree Leaves: Suffix Arrays

Manber & Myers, Gonnet et al., Kurtz et al., Kärkkäinen, Baeza-Yate

• An array of all the text positions in lexicographical suffix order.


• Much simpler to implement and smaller than the suffix tree.
• Simple searches result in a range of positions.
• It can simulate all the searches with an O(log n) penalty factor.
• It can be built in O(n) time, but construction is not online.
• In practice it takes about 4 times the text size.
• Linear-time construction needs 8n bytes, otherwise construction is O(n log
• Paying 6n extra space, we can get rid of the O(log n) penalty.
• Builds in secondary memory in O(n2 log(M )/M ) time (M = RAM size).
• Searching in secondary memory is painful.
• But it can be improved a lot with sampling supraindexes.
• Still dynamism is a problem.
alabar a la alabarda$
bar

21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 2 14
Minimizing the Impact of the Tree: Compact PAT Trees

Jacobson, Munro, Raman, Clark

• Separate a suffix tree into the leaves (suffix array) plus the structure.
• The tree structure itself can be represented as a sequence of parentheses.
• This sequence can be traversed with the usual tree operations.
• Basic tools are: ranking of bits and finding closing parentheses.
• The result is functionally similar to a suffix tree.
• Experimental results report 5n to 6n bytes.
• So it is still too large.
• Worse than that, it is built from the plain suffix tree.
• Search in secondary storage is reasonable (2–4 disk accesses), by storing sub
single disk pages.
• Dynamism is painful.
• The rank mechanism is very valuable by itself.
$ r d
d 18
21 _ la _
19 6
a
a l 20 $_ _

b ar
a r 10
_ l 9

bar
7 12 11 l bar
_ d
8 5 17

laba
_ d
_ _ d
d r 4 16
3 15 2 14

_ d
1 13

(()((()())())(()(()())(()())(()())(()()))(()())()(()(()()))(()()))

alabar a la alabarda$ bar

21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 2 14
(()((()())())(()(()())(()())(()())(()()))(()())()(()(()()))((
1101110100100110110100110100110100110100011010010110110100011
3 6 7 9 3 5 6 9 2 3 5 7 2 5 6
9 18 25 3
Smaller than Suffix Arrays: Compact Suffix Arrays

• Exploits self-repetitions in the suffix array.


• An area in the array can be equal to another area, provided all the values are
• Those repetitions are factored out.
• Search time is O(m log n + R) and fast in practice .
• Extra space is around 1.6n .
• Construction needs O(n) time given the suffix array.
• This is much better than anything else seen so far.
• Still no provisions for updating nor secondary memory.
alabar a la alabarda$

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 2 14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 (9,4,0)

0 1 2 3 4 5 6 7 8 9 10 11 12
21 7 (3,2,1)(10,2,0) 8 (12,2,0) 1 13 (9,2,0) (5,2,0) 19 10 (6,4,0)
Towards Self-Indexing

• Up to now, we have focused on compressing the index.


• We have obtained decent extra space, 1.6 times the text size.
• However, we still need the text separately available in plain form.
• Could the text be compressed?
• Moreover, could the compressed text act as an index by itself?
• Self-index: a data structure that acts as an index and comprises the text
• After the index is built, the text can be deleted.
• The retrieval of text passages is done via a request to the index.
• Hence, retrieving text becomes an essential operation of the index.
• Exciting possibility: the whole thing takes less space than the original text!
Exploiting Repetitions Again: Compressed Suffix Arrays

Grossi &

• We replace the suffix array by a level-wise data structure.


• Level 0 is the suffix array itself.
• Level k + 1 stores only the even pointers of level k, divided by 2.
• Bit array Bk (i) tells whether the i-th suffix array entry is even.
• Array Ψk (i) tells where is the pointer to position i + 1.
• Note that level k + 1 needs half the space of level k.
• For large enough k =  = Θ(log log n), the suffix array is stored explicitly.
• If Bk (i) = 1, SAk [i] = 2 × SAk+1 [rank(Bk , i)].
• If Bk (i) = 0, SAk [i] = 2 × SAk+1 [rank(Bk , Ψk (i))] − 1.
• This permits computing SA[i] = SA0 [i] in O(log log n) time.
• Hence searching can be done at O((m + log log n) log n) time.
• We store the Bk and Ψk explicitly.
• Function rank(Bk , ·) is computed in constant time as explained.
• The Ψk vector is stored with differential plus delta encoding, with an absol
each log n entries.
• A two-level structure like that for rank computes Ψk in constant time.
• The delta encoding dominates space, and it is good because of the same r
property.
• By dividing the array by 2c instead of by 2, for c = ε log log n, we get
constant time.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
SA_0 21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 2 14

B_0 0 0 1 0 1 0 1 0 0 0 0 0 0 1 1 0 1 1 1
Ψ_0 9 6 10 16 0 2 3 13 14 17 18 19 20 11 12 4 5 7 8
0 1 2 3 4 5 6 7 8 9
SA_1 6 10 4 2 8 5 1 7 3 9

B_1 1 1 1 1 1 0 0 0 0 0
Ψ_1 7 6 5 8 9 0 3 4 2 1

0 1 2 3 4
SA_2 3 5 2 1 4
alabar a la alabarda$
B_2 0 0 1 0 1
Ψ_2 4 3 0 2 1 D = 110010000000010110010
C = $_abdlr
0 1
SA_3 1 2
Compressed Suffix Arrays without Text

• Use only Ψ = Ψ0 and C.


• The text can be discarded: rank over a bit vector D with a 1 for each c
first character pointed by SA[i].
• Using Ψ0 we obtain successive text characters.
• Overall space is n(H0 + 8 + 3 log2 H0) bits, text included.
• In practice, this index takes 0.6n to 0.7n bytes.
• Performance is good for counting, but showing text contexts is slow.
• Search complexity is O(m log n + R).
• Recently shown how to build in little space.
Building on the Burrows-Wheeler Transform: FM-Index

Ferragina &

• Based on the Burrows-Wheeler transform.


• The characters preceding each suffix are collected from the suffix array.
• The result is a more compressible permuted text.
• These are coded with move to front, run-length and δ-coding.
• The transformation is reversible.
• Given a position in the permuted text (last column), we can find the positio
letter preceding it in the original text.
• The trick is that we know which letter is at each position in the sorted array
(first column).
• And letters in the first column follow those in the last column.
• Given a letter in the last column, which is the k-th c, we easily find its po
the first column, and hence the character preceding it.
• Starting at the character “$”, we can obtain the text backwards.
alabar a la alabarda$
21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 2 14 6 18

araadl ll$ bbaar aaaa BWT

1st "a"
alabar a la alabarda$ $alabar a la alabarda
labar a la alabarda$a a la alabarda$alabar
abar a la alabarda$al alabarda$alabar a la
bar a la alabarda$ala la alabarda$alabar a
ar a la alabarda$alab a$alabar a la alabard 1st "d"

r a la alabarda$alaba a alabarda$alabar a l
a la alabarda$alabar a la alabarda$alabar
a la alabarda$alabar abar a la alabarda$al
la alabarda$alabar a abarda$alabar a la al
la alabarda$alabar a alabar a la alabarda$
a alabarda$alabar a l alabarda$alabar a la
alabarda$alabar a la ar a la alabarda$alab
alabarda$alabar a la arda$alabar a la alab
labarda$alabar a la a bar a la alabarda$ala
abarda$alabar a la al barda$alabar a la ala
barda$alabar a la ala da$alabar a la alabar 2nd "r"
arda$alabar a la alab la alabarda$alabar a
rda$alabar a la alaba labar a la alabarda$a
da$alabar a la alabar labarda$alabar a la a
a$alabar a la alabard r a la alabarda$alaba
$alabar a la alabarda rda$alabar a la alaba 9th "a"
• The index also has a cumulative letter frequency array C.
• As well as Occ[c, i] = number of occurrences of c before position i in the p
text.
• If we start at a position i in the permuted text, with character c, the previo
is at position C[c] + Occ[c, i].
• The search for pattern p1 . . . pm is done backwards, in optimal O(m) time
• First we take interval for pm, [l, r) = [C[pm], C[pm + 1]).
• Interval for pm−1 pm is [l, r) = [C[pm−1] + Occ[pm−1 , l], C[pm−1] + Occ[pm
• C is small and Occ is cumulative, so it is easy to store with blocks and sup
• The in-block computation of Occ is done by scanning the permuted text.
• We can show text contexts by walking the permuted text as explained.
• A problem is how to know which text position are we at!
• Some suffix array pointers are stored, we walk the text until we find one.
• Overall space is 5Hk n bits, text included, for any k.
• In practice, 0.3 to 0.8 times the text size, and includes the text.
• Counting the number of occurrences is amazingly fast.
• Reporting their positions and text contexts is very slow, however.
• Search complexity is O(m + R logε n).
• Construction needs to start from the suffix array.
• Some (theoretical) provisions for dynamism, search time becomes O(m log
alabar a la alabarda$ $ a
r
21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 2 14 6 18
a
a
araadl ll$ bbaar aaaa BWT a d
a l
2,6,1,0,5,6,5,1,0,5,2,6,0,5,0,6,3,2,0,0,0 MTF
a _
C[$]=0, C[_]=1, C[a]=4, C[b]=13, C[d]=15, C[l]=16, C[r]=19 a l
a l
Occ[$] = 0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1 a $
Occ[_] = 0,0,0,0,0,0,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3 a _
Occ[a] = 1,1,2,3,3,3,3,3,3,3,3,3,3,4,5,5,5,6,7,8,9 a b
Occ[b] = 0,0,0,0,0,0,0,0,0,0,0,1,2,2,2,2,2,2,2,2,2 a b
Occ[d] = 0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 b a
Occ[l] = 0,0,0,0,0,1,1,2,3,3,3,3,3,3,3,3,3,3,3,3,3 b a
Occ[r] = 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2 d r
l _
l a
l a
r a
r a
Using just LZ78 Parsing: LZ-Index

• Inspired in the idea of LZ-suffix trees, sticking to LZ78.


• Instead of a sparse suffix tree, the very same LZ78 trie is the index.
• We also need the tree of reverse phrases.
• These are represented using the parentheses technique.
• Occurrences can span 2 blocks (search prefix and suffix).
• They can lie inside one block (easily enumerated thanks to LZ78 properties
• Or they can span 3 blocks or more (complicated but very few thanks to LZ78
• Text can be retrieved by following parent pointers in the tree (can be d
parentheses representation).
• The index needs 4Hk n bits, but 1.2n to 1.6n bytes in practice, text includ
• It needs about the same space of a suffix array to build and builds fast.
• Query time is O(m3 + m2 log n + R log n) and slow in practice.
• But reporting occurrences is much faster, and this can be dominant.
a.l.ab.ar. .a .la. a.lab.ard.a$

0
0 r
l $ _
_ a b d l
a
5 1 5 1 2
2
$ a a _ l a r
a _ b r a
8 11 6 3 4 7 11 6 8 7 3
d l a
b
10 9 9 10
Secondary Memory and Dynamic Data Structures

• Most of the methods we have seen are not suited to secondary memory.
• This refers both to construction and searching.
• Most of them are difficult to modify if the text changes.
• Those that can be built online can easily incorporate more text.
• Recently, efficient construction of suffix trees in secondary memory has been a
both in theory and in practice.
• Suffix arrays can perform decently on secondary memory, but modifying them
rewriting them completely.
• Compact PAT Trees put contiguous subtrees in disk pages, obtaining re

performance (2–4 disk accesses, O(k/ m + logm n)).
• But modifying them is painful.
• Another interesting approach for managing insertions is to have buckets of e
tially increasing sizes, one index per bucket.
• Insertion reminds incrementing a binary number and has good amortized
mance.
Dynamic in Secondary Memory: String B-Tree

Ferragina &

• A data structure to store a set of strings.


• It is like a B-tree of strings, but each node stores a compressed trie of the se
keys.
• Takes 12n space, but can be reduced to 6n.
• Permits searching substrings of any string.
• Search time is optimal, O((m + R)/B + logB n). .
• In practice it takes 4–6 disk accesses per query.
• It can insert a new text of length n in O(n logB (n + n )) time.
• Same complexity for deleting a text from the set.
• It uses optimal space O(n/B).
• It has been implemented and is a very attractive alternative.
• Yet, it is not succint.
Open Challenges

• Many goals have been obtained separately:


– Fast searches and presentation of results.
– Succint space for construction and usage.
– Efficient construction and search in secondary memory.
– Efficient insertions and deletions of texts.
• However, no existing data structure fits all the requirements.
• Several theoretical proposals remain unimplemented.
• We have only considered exact searching for simple patterns.
suffix trie

suffix tree
DAWG q-gr

sparse ST CPT suffix array


string
CDAWG
B-tree LZ-q
LZ-stree
Compact SA CSA FM

succint
LZ-index dynamic
secondary mem

You might also like