Professional Documents
Culture Documents
9 Suffix Trees: Tttta
9 Suffix Trees: Tttta
9 Suffix Trees: Tttta
\ {}.
Let T = t
1
t
2
. . . t
n
be a text. A prex of T is a substring of T beginning at the rst position in T. A
sux of T is a substring of T ending at the last position in T.
In bioinformatics, the alphabet is usually of size 4 or 20. In other applications, the alphabet can
be much larger, for example, in the analysis of web-surng patterns, consists of the set of all links
contained in a collection of web sites.
9.3 The role of suxes and the sentinel $
Consider the text abab
It has the following suxes: abab, bab, ab and b.
The sentinel $: In the following, we want to ensure that no sux is a prex of any other. To do so,
we append a special character $ / to the end of the text.
Now, consider the text abab$
It has the following suxes: abab$, bab$, ab$, b$, and $.
Queries are prexes of suxes: To determine whether a given query q is contained in the text,
we could simply check whether q is the prex of one of the suxes.
9.4 Sharing prexes
E.g., the query ab is the prex of both abab$ and ab$.
To speed up the search for all suxes that have the query as a prex, we use a tree structure to share
common prexes between the suxes.
(a) The suxes abab$ and ab$ both share the prex ab.
(b) The suxes bab$ and b$ both share the prex b.
(c) The sux $ doesnt share a prex.
b
a
$
a
b
$
(a)
b
$
$
b
a
(b)
$
(c)
Here is an example of how the idea of searching the prex of every sux can be optimized by sharing
prexes whenever possible:
138 Bioinformatics I, WS09-10, J. Fischer (script by J. Fischer and D. Huson) February 3, 2010
1
3 2
4
5
a
b
a
b
$
a
b
$
b
a
b
$
$
b
$
b
a
b
$
$
$
a
b
$
$
b
a
1
3
2
4
5
A sux tree for abab$ is obtained by sharing prexes whenever possible. The leaves are annotated
by the positions of the corresponding suxes in the text.
9.5
+
-tree
Denition 9.5.1 (
+
-tree) A compact
+
-tree is a rooted tree S = (V, E) with edge labels from
+
that fullls the following two constraints:
For all v V , all outgoing edges from v start with a dierent letter a .
Apart from the root, all nodes have out-degree = 1.
You can think of a compact
+
-tree as a tree that looks like the trie data structure of section 2.7, but
with degree-one-paths contracted into a single edge.
As usual, a leaf is a node with no children and an edge leading to a leaf is called a leaf edge. A node
with at least two children is called a branching node.
Denition 9.5.2 (Notations in
+
-trees) Let S = (V, E) be a compact
+
-tree.
For v V , v denotes the concatenation of all path labels from the root of S to v.
|v| is called the string-depth of v and is denoted by d(v).
S is said to display
i v V,
: S displays }
For i {1, 2, . . . , n}, t
i
t
i+1
. . . t
n
is called the i-th sux of T and is denoted by T
i...n
. In general,
we use the notation T
i...j
as an abbreviation of t
i
t
i+1
. . . t
j
.
For example u = pqr:
u
p
q
r
root
The root is called .
Bioinformatics I, WS09-10, J. Fischer (script by J. Fischer and D. Huson) February 3, 2010 139
9.6 Sux tree
Denition 9.6.1 Let factor(T) denote the set of all factors (=subwords) of T, factor(T) = {T
i...j
:
1 i j n}. The sux tree of T is a compact
+
-tree S with words(S) = factor(T).
Here is an example.
For several reasons, we shall nd it useful that each sux ends in a leaf of S. This can be accomplished
by adding a new character $ to the end of T, and build the sux tree over T$.
From now on, we assume that T terminates with a $, and we dene $ to be lexicographically smaller
than all other characters in : $ < a for all a . This gives the desired one-to-one correspondence
between Ts suxes and the leaves of S, which implies that we can label the leaves with a function l
by the start index of the sux they represent: l(v) = i v = T
i...n
. This also explains the name
sux tree. (Observe that we did not dene sux trees in terms of suxes, but in terms of factors!)
For every leaf T
j...n
we dene L(T
j...n
) = l(v). Recursively, for every branching node u we dene
L(u) =
v is child of u
L(v).
We call L(u) the leaf set of u.
We can use S to nd occurrences of a pattern P
from whose
sux array A
we can derive A
12
. To this end, we look at all character triplets t
i
t
i+1
t
i+2
with i 0(3),
and sort the resulting set of triplets S = {t
i
t
i+1
t
i+2
: i 0(3)} with a bucket-sort in O(n) time. (To
have all triplets well-dened, we pad T with suciently many $s at the end.)
We dene T
for T
| = O(1)). This
already gives us sorting of the suxes starting at positions i 0(3), because of the following:
The suxes starting at i 2(3) in T have a one-to-one correspondence to the suxes in T
and
are hence in correct lexicographic order.
The suxes starting at i 1(3) in T are longer than they should be (because of the bucket
numbers of the triplets starting at 2(3)), but due to the $s in the middle of T
1 + 3A
[i] if A
[i] <
|T
|
2
2 + 3(A
[i]
|T
|
2
) otherwise
9.9.2 Creation of A
0
The following lemma follows immediately from the denition of A
12
.
Lemma 9.9.2 Let i, j 0(3). Then T
i...n
< T
j...n
i t
i
< t
j
, or t
i
= t
j
and i +1 appears before j +1
in A
12
.
This suggests the following strategy to construct A
0
:
1. Initialize A
0
with all numbers 0 i < n for which i 0(3).
2. Bucket-sort A
0
, where A
0
[i] has sort-key t
A
0
[i]
.
3. Bucket-sort each bucket obtained from step 2 again, using the position of A
0
[i] + 1 in A
12
as a
sort-key for A
0
[i]. This step can be realized by a left-to-right scan of A
12
: if A
12
[i] 1(3), swap
A
12
[i] 1 with the entry at the current beginning of its bucket, and move the beginning of that
bucket one to the right (such that all processed entries remain where they have just been moved
to).
9.9.3 Merging A
12
with A
0
We scan A
0
and A
12
simultaneously. The suxes from A
0
and A
12
can be compared among each
other in O(1) time by the following lemma, which follows again directly from the denition of A
12
.
Lemma 9.9.3 Let i and j be two indices in T with i 0(3).
1. If j 1(3), then T
i...n
< T
j...n
i t
i
< t
j
, or t
i
= t
j
and i + 1 appears before j + 1 in A
12
.
2. If j 2(3), then T
i...n
< T
j...n
i t
i
< t
j
, or t
i
= t
j
and t
i+1
< t
j+1
, or t
i
t
i+1
= t
j
t
j+1
and i + 2
appears before j + 2 in A
12
.
One should note that in both of the above cases the values i + 1 and j + 1 (or i + 2 and j + 2,
respectively) appear in A
12
this is why it is enough to compare at most 2 characters before one can
derive the lexicographic order of T
i...n
and T
j...n
from A
12
.
In order to check eciently whether i appears before j in A
12
, we need the inverse sux array A
1
12
of A
12
, dened by A
1
12
[A
12
[i]] = i for all i. With this, it is easy to see that i appears before j in A
12
i A
1
12
[i] < A
1
12
[j].
The running time T (n) of the whole sux sorting algorithm presented in this section is given by the
recursion T (n) = T (2n/3) +O(n), which solves to T (n) = O(n).
Theorem 9.9.4 We can construct the sux array for a text of length n in O(n) time.
144 Bioinformatics I, WS09-10, J. Fischer (script by J. Fischer and D. Huson) February 3, 2010
9.10 Linear-Time Construction of LCP-Arrays
It remains to be shown how the LCP-array H can be constructed in O(n) time. Here, we assume that
we are given T, A, and A
1
, the latter being the inverse sux array.
We will construct H in the order of the inverse sux array (i.e., lling H[A
1
[i]] before H[A
1
[i+1]]),
because in this case we know that H cannot decrease too much, as shown next.
Going from sux T
i...n
to T
i+1...n
, we see that the latter equals the former, but with the rst character
t
i
truncated. Let h = H[i]. Then the sux T
j...n
, j = A[A
1
[i] 1], has a longest common prex
with T
i...n
of length h. So T
i+1...n
has a longest common prex with T
j+1...n
of length h1. But every
sux T
k...n
that is lexicographically between T
j+1...n
and T
i+1...n
must have a longest common prex
with T
j+1...n
that is at least h 1 characters long (for otherwise T
k...n
would not be in lexicographic
order). We have thus proved the following:
Lemma 9.10.1 For all 1 i < n: H[A
1
[i + 1]] H[A
1
[i]] 1.
This gives rise to the following elegant algorithm to construct H:
h 0, H[1] 0 1
for i = 1, . . . , n do 2
if A
1
[i] = 1 then 3
while t
i+h
= t
A[A
1
[i]1]+h
do h h + 1 4
H[A
1
[i]] h 5
h max{0, h 1} 6
end 7
end 8
The linear running time follows because h is always less than n and decreased at most n times in line
6. Hence, the number of times where k is increased in line 4 is bounded by n, so there are at most 2n
character comparisons in the whole algorithm. We have proved:
Theorem 9.10.2 The LCP-array for a text of length n can be constructed in O(n) time.