Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 18

Linear Time Construction of

Suffix Tree
Presented By
Dr. Shazzad Hosain
Asst. Prof. EECS, NSU
High-level of Ukkonen’s Algorithm
• Ukkonen’s algorithm is divided into m phases. In phase i+1,
tree i+1 is constructed from i
• Each phase i+1 is further divided into i+1 extensions, one for
each of the i+1 suffixes of S[1… i+1].
ab b

phases
a
: S[1…1] {a}
1
b b
2 : S[1…2] {ab, b}
a
3 : S[1…3] {aba, ba, a} a

extensions 1 2
1234567890 How suffix links help?
MISSISSIPI
P
M I
S
9
10: MISSISSIPI I
I
S S
9 : MISSISSIP S
S S P I I
8 : MISSISSI I P I
S I I 6
S
7 : MISSISS S 8 S
S S
6 : MISSIS P I P S
S P I I
5 : MISSI I I P
I 7
4 : MISS 1 I P I
P
3 : MIS I 5 I
2 : MI 3
2 4
1 : M
Corollary 6.1.1: In Ukkanon’s algorithm, any newly created internal
node will have a suffix link form it by the end of the next extension.
What is achieved so far?

Not so much. Worst-case running


time is O(m2) for a phase.
Trick1: Skip/Count Trick

There must be a γ path from s(v).


Trick1: Skip/Count Trick
Walking down along γ takes time
proportional to |γ|

Skip/count trick reduces the traversal


time to something proportional to the
number of nodes on the path.

zabcdefghy
Nodes
Edge length 2 2 3 3

But what does it buy in terms of


worst-case bounds?

There must be a γ path from s(v).


Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during
Ukkonen’s algorithm. At that moment , the node-depth of v is
at most one greater than the node depth of s(v).

v=2 s(v)=1

s(v)=3
v=3

v=4 s(v)=5
Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during
Ukkonen’s algorithm. At that moment , the node-depth of v is
at most one greater than the node depth of s(v).

Theorem 6.1.1: Using the skip/count trick, any phase of


Ukkonen’s algorithm takes O(m) time.
In a single extension
– The algorithm walks up at most one edge
– Find suffix link and traverse it
– Walks down some number of nodes
– Applies suffix extension rules
– And may add a suffix link

All operations except down-walk takes constant time


Only needs to analyze down walk time
Lemma 6.1.2: Let (v, s(v)) be any suffix link traversed during
Ukkonen’s algorithm. At that moment , the node-depth of v is
at most one greater than the node depth of s(v).

Theorem 6.1.1: Using the skip/count trick, any phase of


Ukkonen’s algorithm takes O(m) time.
In a single extension
– The algorithm walks up at most one edge – Decreases current node-depth by at most one
– Find suffix link and traverse it – Decreases node-depth by at most another one
– Walks down some number of nodes – Each down walk moves to greater node-depth
– Applies suffix extension rules
– Over the entire phase, current node-depth is
– And may add a suffix link decremented by at most 2m times
– Since no node can have depth greater than m,
All operations except down-walk takes constant time the total possible increment to current node-
Only needs to analyze down walk time depth is bounded by 3m over the entire phase
– Total number of edge traversal bounded by 3m
– Since each edge traversal is constant, in a phase
all the down-walking is O(m).
Complexity
• There are m phases
• Each phase takes O(m)
• So the running time is O(m2)

Two more tricks and we are done


Simple Implementation Detail
• Suffix tree may require O(m2) space
• Consider the string
• Every suffix begins with a distinct character, so there
are 26 edges out of the root.
• Requires 26x27/2 characters in all
• So O(m) is impossible to achieve in this
representation.
Alternative Representation of Suffix Tree
Edge Label Compression

1 2 3 4 56789 0 1 2

Could be 8,9

A fragment of the suffix tree Edge label compressed

Number of edge at most 2m – 1, and two numbers are written in an edge, so space is O(m)
7 : 1234567
1234567890
MISSISSIPI 8 : 12345678
M S
I
I
S S S
S S I I
8 : MISSISSI I
I
S S
7 : MISSISS S S
S S
6 : MISSIS I
S
S
I
5 : MISSI I
Explicit Extension I

4 : MISS 1
3 : MIS
2 : MI 2
Implicit extension 3
4
1 : M

Observation 1: Rule 2 is a show stopper. We stop further extension.


7 : 1234567
1234567890
MISSISSIPI 8 : 12345678
M S
I
I
S S S
S S I I
8 : MISSISSI I
I
1,7 S S
7 : MISSISS S S
S S 3,7
6 : MISSIS 2,7
S
4,7 S

5 : MISSI Explicit Extension


The major cost
4 : MISS 1
3 : MIS
2 : MI 2 e=8 3
4
1 : M
Observation 2: Once a leaf always a leaf
7 : 1234567
1234567890
MISSISSIPI 8 : 12345678
M S
I
I
S S S
S S I I
8 : MISSISSI I
I
1,7 S S
7 : MISSISS S S
S S 3,7
6 : MISSIS 2,7
S
4,7 S

5 : MISSI Explicit Extension


The major cost
4 : MISS 1
3 : MIS
2 : MI 2 e=8 3
4
1 : M
Once a leaf always a leaf

At any phase the cost is only for explicit extension


: 12345678
1234567890 8

MISSISSIPI 9 : 123456789
P 9,9
M S
I 9
I
9 : MISSISSIP S S S
S P I 9,9
2,5 S I
8 : MISSISSI I
I
9,9 P
S 6
7 : MISSISS 1,9 S 8 S
S
I S S 3,9
6 : MISSIS 2,9
6,9
S
P P S 4,9
I
9,9 9,9
5 : MISSI I
I
7
4 : MISS 1 5

3 : MIS
2 : MI 2 e=9 3
4
1 : M
Once a leaf always a leaf

At any phase the cost is only for explicit extension


: 12345
1234567890 8

MISSISSIPI 9 : 123456789

Since there are only m phases, the total number of explicit


extension is bounded by 2m

So the total number of down-walk is bounded by O(m)


Or
The time to construct the suffix tree is bounded by O(m)
Reference
• Chapter 6: Algorithms on Strings, Trees and
Sequences

You might also like