Current Challenges in Textual Databases: Gonzalo Navarro

Current Challenges in Textual Databases

Gonzalo Navarro
Department of Computer Science
University of Chile
Road Map

• Text retrieval versus information retrieval

• Applications of text databases
• Challenges for modern text databases
• Classical solutions
– Suffix tries
– Suffix trees
– Directed acyclic word graphs (DAWGs)
• Towards succint data structures
– Compact DAWGs
– Sparse suffix trees, cactuses and q-grams
– Ziv-Lempel based indexes
– Suffix arrays
– Compact PAT trees
– Compact Suffix Arrays
• Self-indexing data structures
– Compressed Suffix Arrays
– FM-Index
– LZ-Index
• Secondary memory and dynamic data structures
• Open Challenges
Text Retrieval versus Information Retrieval

• In both cases we handle a text collection.

• In Information Retrieval
– The database is seen as a set of documents.
– The user has a (vague) information need.
– The user expresses that information using words and phrases.
– The system tries to guess what the user really wants.
– There are no correct or incorrect answers, just more or less relevant to
– An underlying text retrieval engine is used to compute relevance.
– It works only for “natural language” text, which excludes several hum
– We want speed, but especially good precision and recall.
• In Text Retrieval
– The database is seen as a set of strings.
– The user knows exactly what kind of strings he/she wants.
– The user gives patterns to search for in the text.
– The system retrieves exactly the occurrences of those patterns.
– A postprocessing stage may further filter these results, but this is not the
of the text retrieval engine.
– It works for any kind of text.
– We want basically speed, usually subject to correctness.
Applications of Text Databases

• Computational biology
– Genome projects have produced gigabytes of nucleotide and protein da
– Biologists search that data for known or new genes and proteins, to
homologous regions so as to predict chemical structure, biochemical func
evolutionary history, etc.
– Global and local similarity, with different concepts of approximate s
including approximate regular expression searching.
• Information retrieval
– IR systems usually have a text search engine at their kernel.
– Usually it is specific to natural language.
– Approximate searching is useful to cope with spelling, typing and OCR
– Handling from medium to huge texts: jurisprudence databases, linguis
porate data, news clipping... and of course the Web.
• Text retrieval in Oriental languages
– Chinese, Korean, (one) Japanese and other Oriental languages have v
alphabets, mixing phonetic and ideographic symbols.
– Word separations must be inferred from the meaning and are difficul
– They have to treat their texts simply as strings of symbols.
– Typical IR systems do not apply to those languages.
• Multimedia databases
– Music databases, in MIDI format for example, with their own concept of s
e.g. transposition invariance.
– Audio databases, where we wish to find patterns independently of volum
– Video databases, for example object tracking information is a string o
Challenges for Modern Text Databases

• Be small, in order to store large text collections and extra data structures
them at a reasonable space cost.
• Be fast, in order to provide access to a large mass of text in reasonable
possibly many concurrent users.
• Be flexible, in order to search for complex patterns.
• For fast access we need an index, that is, a data structure built on the text
• But an index may take much space, and this plays against the goal of
• For large databases, that index has to be oriented to secondary memory.
• The index should be dynamic, that is, easy to update upon changes in
The Simplest Solution: Suffix Tries

• A trie that stores all the suffixes of the texts.

• Each leaf represents a unique substring (and extensions to suffix).
• Each internal node represents a repeating substring.
• Every text substring is the prefix of some suffix.
• So any substring of length m can be found in O(m) time.
• Many other search problems can be solved with suffix tries:
– Longest text repetitions.
– Approximate searching in O(nλ) time (n = text size).
– Regular expression searching in O(nλ) time.
– ... and many others.
• Let n be the text size, then the trie
– on average is O(n) space and construction time...
– but in the worst case it is O(n2) space and construction time.
$ r d
d 1
21 _ 19 _
b l 6
a $ a
l _
20 b l r _
_ l 9 a b
l a a _ d r 10
7 12 11 a
8 b 5 17
_ d
d a 4 16
3 15 _
r 2
alabar a la alabarda$ _ d
1 13
Guaranteeing Linear Space: Suffix Trees

Weiner, McCreight, Ukkonen,

• Works essentially as a suffix trie.

• Unary paths are compressed.
• It has O(n) size and construction time, in the worst case.
• Construction can be done online.
• After the O(m) search, the R occurences can be collected in O(R) time.
• So it’s simply perfect!
• Where is the trick?
– The suffix tree needs at least 20 times the text size.
– It is hard to build and search in secondary memory.
– It is difficult to update upon changes in the text.
• So, it is problematic with respect to our three goals!
• Very compact implementations achieve 10n bytes.
• Other variants: Patricia trees, level-compressed tries... not better in practic
$ r d
d 1
21 _ la _
a 6
a l $ _

20 _
a r 10
_ l 9

11 l _ d

7 12 8 5 17

_ d
_ _

d 4 16
3 15 2

alabar a la alabarda$ _ d
1 13
Minimizing Trie Automata: DAWGs

Blumer et al., Croch

• The smallest deterministic automaton recognizing every text suffix.

• It would be the result of minimizing the suffix trie.
• At most 2n nodes and 3n edges.
• It can be built in O(n) time with an online algorithm.
• Searching can also be done in O(m + R) time.
• An interesting alternative to suffix trees.
• In practice, it is too large (30n), static and bad for secondary storage.
l a l
alabar a la alaba
b l
_ l
_ a _
a l a b a r _ a _ l a _ a l a b a r

b b d
r r
Best from Suffix Trees and DAWGs: Compact DAWGs

Blumer et al., Crochemore & Vérin, Takeda & Shin

• Obtained by compressing unary paths in the DAWG.

• Also obtained by minimizing the suffix tree.
• It is smaller than the three classical structures.
• It can still be built online in O(n) time.
• It can still search in O(m + R) time.
• It can still implement all the flexible searches.
• But it is still too large (15n), static and bad on disk.
_alabarda alabar a la alabarda
la bar labarda
_ la_alabarda
_ a _la_alabarda
a labar _a_la_alabarda

bar bar da
r r
Sampling the Data: Sparse Suffix Trees, Cactuses, and q-Grams

Kärkkäinen, Ukkonen, Sutinen, T

• Sparse suffix trees

– Only one out of k suffixes are indexed.
– Hence the suffix tree has less nodes and is smaller.
– But searches are more expensive.
– In practice, reasonable search time requires about 4n space.
• Suffix cactuses
– Carry path compression one step further.
– They are a crossing between a suffix tree and a suffix array.
– Not very promising either (10n space).
• Q-gram indexes
– Indexes substrings of length q instead of full suffixes.
– Significantly less space, for example 2n extra space.
– But search times are far less attractive if q is not large enough.
– And the index grows exponentially with q, so 4n at least.

ala 1,13 a_l 8

lab 2,14 _la 9
aba 3,15 la_ 10
bar 4,16 a_a 11 alabar a la alabarda$
ar_ 5 _al 12
r_a 6 ard 17
a 7 rda 18
Using the Ziv-Lempel Parsing: Ziv-Lempel Indexes

Kärkkäinen, Sutinen, U

• The text is divided according to a Ziv-Lempel parsing.

• A sparse suffix tree indexes initial positions of blocks.
• Another indexes the reverse prefixes ending at final block positions.
• Occurrences can either be contained inside a block or not.
• Those that are not, have a prefix finishing a block and a suffix starting a b
• Once each prefix-suffix pair is found, a range search data structure perm
secting the lexicographical ranges.
• Those contained in another block are found by a structure tracking the rep
• Not implemented, but we estimate 3.5n to 5.5n extra space and good sear
• Although in theory search time is, for example, O(m2 + m log n + R logε n
• A version able of searching for q-grams only (q < log n) obtains optimal O
search time and 4n to 6n extra space.
• No good algorithm for longer patterns can be obtained with such a q.
la _
_ d
_ l

a r a
7 12 a 10 19

$ 9 7 $ _
20 16
r l 4
_ $
b _ d _ d _ d
8 l 5 17 1
2 14 13 20 11
3 1 .a .la. a.lab.ard.a$ .a .la. a.lab.ard.a$




_a_ 6
_al 11
a$ 19
a_ 7
ab 2
ar_ 4
ard 16
la_ 9
labar_ 1
labard 13
Just the Tree Leaves: Suffix Arrays

Manber & Myers, Gonnet et al., Kurtz et al., Kärkkäinen, Baeza-Yate

• An array of all the text positions in lexicographical suffix order.

• Much simpler to implement and smaller than the suffix tree.
• Simple searches result in a range of positions.
• It can simulate all the searches with an O(log n) penalty factor.
• It can be built in O(n) time, but construction is not online.
• In practice it takes about 4 times the text size.
• Linear-time construction needs 8n bytes, otherwise construction is O(n log
• Paying 6n extra space, we can get rid of the O(log n) penalty.
• Builds in secondary memory in O(n2 log(M )/M ) time (M = RAM size).
• Searching in secondary memory is painful.
• But it can be improved a lot with sampling supraindexes.
• Still dynamism is a problem.
alabar a la alabarda$

21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 2 14
Minimizing the Impact of the Tree: Compact PAT Trees

Jacobson, Munro, Raman, Clark

• Separate a suffix tree into the leaves (suffix array) plus the structure.
• The tree structure itself can be represented as a sequence of parentheses.
• This sequence can be traversed with the usual tree operations.
• Basic tools are: ranking of bits and finding closing parentheses.
• The result is functionally similar to a suffix tree.
• Experimental results report 5n to 6n bytes.
• So it is still too large.
• Worse than that, it is built from the plain suffix tree.
• Search in secondary storage is reasonable (2–4 disk accesses), by storing sub
single disk pages.
• Dynamism is painful.
• The rank mechanism is very valuable by itself.
$ r d
d 18
21 _ la _
19 6
a l 20 $_ _

b ar
a r 10
_ l 9

7 12 11 l bar
_ d
8 5 17

_ d
_ _ d
d r 4 16
3 15 2 14

_ d
1 13


alabar a la alabarda$ bar

21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 2 14
3 6 7 9 3 5 6 9 2 3 5 7 2 5 6
9 18 25 3
Smaller than Suffix Arrays: Compact Suffix Arrays

• Exploits self-repetitions in the suffix array.

• An area in the array can be equal to another area, provided all the values are
• Those repetitions are factored out.
• Search time is O(m log n + R) and fast in practice .
• Extra space is around 1.6n .
• Construction needs O(n) time given the suffix array.
• This is much better than anything else seen so far.
• Still no provisions for updating nor secondary memory.
alabar a la alabarda$

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 2 14

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 (9,4,0)

0 1 2 3 4 5 6 7 8 9 10 11 12
21 7 (3,2,1)(10,2,0) 8 (12,2,0) 1 13 (9,2,0) (5,2,0) 19 10 (6,4,0)
Towards Self-Indexing

• Up to now, we have focused on compressing the index.

• We have obtained decent extra space, 1.6 times the text size.
• However, we still need the text separately available in plain form.
• Could the text be compressed?
• Moreover, could the compressed text act as an index by itself?
• Self-index: a data structure that acts as an index and comprises the text
• After the index is built, the text can be deleted.
• The retrieval of text passages is done via a request to the index.
• Hence, retrieving text becomes an essential operation of the index.
• Exciting possibility: the whole thing takes less space than the original text!
Exploiting Repetitions Again: Compressed Suffix Arrays

Grossi &

• We replace the suffix array by a level-wise data structure.

• Level 0 is the suffix array itself.
• Level k + 1 stores only the even pointers of level k, divided by 2.
• Bit array Bk (i) tells whether the i-th suffix array entry is even.
• Array Ψk (i) tells where is the pointer to position i + 1.
• Note that level k + 1 needs half the space of level k.
• For large enough k =  = Θ(log log n), the suffix array is stored explicitly.
• If Bk (i) = 1, SAk [i] = 2 × SAk+1 [rank(Bk , i)].
• If Bk (i) = 0, SAk [i] = 2 × SAk+1 [rank(Bk , Ψk (i))] − 1.
• This permits computing SA[i] = SA0 [i] in O(log log n) time.
• Hence searching can be done at O((m + log log n) log n) time.
• We store the Bk and Ψk explicitly.
• Function rank(Bk , ·) is computed in constant time as explained.
• The Ψk vector is stored with differential plus delta encoding, with an absol
each log n entries.
• A two-level structure like that for rank computes Ψk in constant time.
• The delta encoding dominates space, and it is good because of the same r
• By dividing the array by 2c instead of by 2, for c = ε log log n, we get
constant time.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
SA_0 21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 2 14

B_0 0 0 1 0 1 0 1 0 0 0 0 0 0 1 1 0 1 1 1
Ψ_0 9 6 10 16 0 2 3 13 14 17 18 19 20 11 12 4 5 7 8
0 1 2 3 4 5 6 7 8 9
SA_1 6 10 4 2 8 5 1 7 3 9

B_1 1 1 1 1 1 0 0 0 0 0
Ψ_1 7 6 5 8 9 0 3 4 2 1

0 1 2 3 4
SA_2 3 5 2 1 4
alabar a la alabarda$
B_2 0 0 1 0 1
Ψ_2 4 3 0 2 1 D = 110010000000010110010
C = $_abdlr
0 1
SA_3 1 2
Compressed Suffix Arrays without Text

• Use only Ψ = Ψ0 and C.

• The text can be discarded: rank over a bit vector D with a 1 for each c
first character pointed by SA[i].
• Using Ψ0 we obtain successive text characters.
• Overall space is n(H0 + 8 + 3 log2 H0) bits, text included.
• In practice, this index takes 0.6n to 0.7n bytes.
• Performance is good for counting, but showing text contexts is slow.
• Search complexity is O(m log n + R).
• Recently shown how to build in little space.
Building on the Burrows-Wheeler Transform: FM-Index

Ferragina &

• Based on the Burrows-Wheeler transform.

• The characters preceding each suffix are collected from the suffix array.
• The result is a more compressible permuted text.
• These are coded with move to front, run-length and δ-coding.
• The transformation is reversible.
• Given a position in the permuted text (last column), we can find the positio
letter preceding it in the original text.
• The trick is that we know which letter is at each position in the sorted array
(first column).
• And letters in the first column follow those in the last column.
• Given a letter in the last column, which is the k-th c, we easily find its po
the first column, and hence the character preceding it.
• Starting at the character “$”, we can obtain the text backwards.
alabar a la alabarda$
21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 2 14 6 18

araadl ll$ bbaar aaaa BWT

1st "a"
alabar a la alabarda$ $alabar a la alabarda
labar a la alabarda$a a la alabarda$alabar
abar a la alabarda$al alabarda$alabar a la
bar a la alabarda$ala la alabarda$alabar a
ar a la alabarda$alab a$alabar a la alabard 1st "d"

r a la alabarda$alaba a alabarda$alabar a l
a la alabarda$alabar a la alabarda$alabar
a la alabarda$alabar abar a la alabarda$al
la alabarda$alabar a abarda$alabar a la al
la alabarda$alabar a alabar a la alabarda$
a alabarda$alabar a l alabarda$alabar a la
alabarda$alabar a la ar a la alabarda$alab
alabarda$alabar a la arda$alabar a la alab
labarda$alabar a la a bar a la alabarda$ala
abarda$alabar a la al barda$alabar a la ala
barda$alabar a la ala da$alabar a la alabar 2nd "r"
arda$alabar a la alab la alabarda$alabar a
rda$alabar a la alaba labar a la alabarda$a
da$alabar a la alabar labarda$alabar a la a
a$alabar a la alabard r a la alabarda$alaba
$alabar a la alabarda rda$alabar a la alaba 9th "a"
• The index also has a cumulative letter frequency array C.
• As well as Occ[c, i] = number of occurrences of c before position i in the p
• If we start at a position i in the permuted text, with character c, the previo
is at position C[c] + Occ[c, i].
• The search for pattern p1 . . . pm is done backwards, in optimal O(m) time
• First we take interval for pm, [l, r) = [C[pm], C[pm + 1]).
• Interval for pm−1 pm is [l, r) = [C[pm−1] + Occ[pm−1 , l], C[pm−1] + Occ[pm
• C is small and Occ is cumulative, so it is easy to store with blocks and sup
• The in-block computation of Occ is done by scanning the permuted text.
• We can show text contexts by walking the permuted text as explained.
• A problem is how to know which text position are we at!
• Some suffix array pointers are stored, we walk the text until we find one.
• Overall space is 5Hk n bits, text included, for any k.
• In practice, 0.3 to 0.8 times the text size, and includes the text.
• Counting the number of occurrences is amazingly fast.
• Reporting their positions and text contexts is very slow, however.
• Search complexity is O(m + R logε n).
• Construction needs to start from the suffix array.
• Some (theoretical) provisions for dynamism, search time becomes O(m log
alabar a la alabarda$ $ a
21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 2 14 6 18
araadl ll$ bbaar aaaa BWT a d
a l
2,6,1,0,5,6,5,1,0,5,2,6,0,5,0,6,3,2,0,0,0 MTF
a _
C[$]=0, C[_]=1, C[a]=4, C[b]=13, C[d]=15, C[l]=16, C[r]=19 a l
a l
Occ[$] = 0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1 a $
Occ[_] = 0,0,0,0,0,0,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3 a _
Occ[a] = 1,1,2,3,3,3,3,3,3,3,3,3,3,4,5,5,5,6,7,8,9 a b
Occ[b] = 0,0,0,0,0,0,0,0,0,0,0,1,2,2,2,2,2,2,2,2,2 a b
Occ[d] = 0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1 b a
Occ[l] = 0,0,0,0,0,1,1,2,3,3,3,3,3,3,3,3,3,3,3,3,3 b a
Occ[r] = 0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2 d r
l _
l a
l a
r a
r a
Using just LZ78 Parsing: LZ-Index

• Inspired in the idea of LZ-suffix trees, sticking to LZ78.

• Instead of a sparse suffix tree, the very same LZ78 trie is the index.
• We also need the tree of reverse phrases.
• These are represented using the parentheses technique.
• Occurrences can span 2 blocks (search prefix and suffix).
• They can lie inside one block (easily enumerated thanks to LZ78 properties
• Or they can span 3 blocks or more (complicated but very few thanks to LZ78
• Text can be retrieved by following parent pointers in the tree (can be d
parentheses representation).
• The index needs 4Hk n bits, but 1.2n to 1.6n bytes in practice, text includ
• It needs about the same space of a suffix array to build and builds fast.
• Query time is O(m3 + m2 log n + R log n) and slow in practice.
• But reporting occurrences is much faster, and this can be dominant. .a .la. a.lab.ard.a$

0 r
l $ _
_ a b d l
5 1 5 1 2
$ a a _ l a r
a _ b r a
8 11 6 3 4 7 11 6 8 7 3
d l a
10 9 9 10
Secondary Memory and Dynamic Data Structures

• Most of the methods we have seen are not suited to secondary memory.
• This refers both to construction and searching.
• Most of them are difficult to modify if the text changes.
• Those that can be built online can easily incorporate more text.
• Recently, efficient construction of suffix trees in secondary memory has been a
both in theory and in practice.
• Suffix arrays can perform decently on secondary memory, but modifying them
rewriting them completely.
• Compact PAT Trees put contiguous subtrees in disk pages, obtaining re

performance (2–4 disk accesses, O(k/ m + logm n)).
• But modifying them is painful.
• Another interesting approach for managing insertions is to have buckets of e
tially increasing sizes, one index per bucket.
• Insertion reminds incrementing a binary number and has good amortized
Dynamic in Secondary Memory: String B-Tree

Ferragina &

• A data structure to store a set of strings.

• It is like a B-tree of strings, but each node stores a compressed trie of the se
• Takes 12n space, but can be reduced to 6n.
• Permits searching substrings of any string.
• Search time is optimal, O((m + R)/B + logB n). .
• In practice it takes 4–6 disk accesses per query.
• It can insert a new text of length n in O(n logB (n + n )) time.
• Same complexity for deleting a text from the set.
• It uses optimal space O(n/B).
• It has been implemented and is a very attractive alternative.
• Yet, it is not succint.
Open Challenges

• Many goals have been obtained separately:

– Fast searches and presentation of results.
– Succint space for construction and usage.
– Efficient construction and search in secondary memory.
– Efficient insertions and deletions of texts.
• However, no existing data structure fits all the requirements.
• Several theoretical proposals remain unimplemented.
• We have only considered exact searching for simple patterns.
suffix trie

suffix tree
DAWG q-gr

sparse ST CPT suffix array

B-tree LZ-q
Compact SA CSA FM

LZ-index dynamic
secondary mem

