Professional Documents
Culture Documents
Data Structures
Data Structures
DATA STRUCTURES
Lecture 11:
Data Structures for Strings
OUTLINE
‣ Tries
‣ R-way tries
‣ Patricia Trees
‣ Suffix Trees (details will be discussed in BBM202)
‣ Suffix Arrays (details will be discussed in BBM202)
Introduction
Numbers as key values: are data items of constant size and can be
compared in constant time.
Text fragments have a length; they are not elementary objects that
the computer can process in a single step.
• Pneumonoultramicroscopicsilicovolcanoconiosis !!!
Applications
Search Engines
Spell checkers
Tries
b s t
y 4 e h h
key value
by 4 a 6 l e 0 o e 5
shore 7
the 5 s 3
6
Search in a trie
get("shells")
b s t
y 4 e h h
a 6 l e 0 o e 5
l l r
s 1 l e 7
get("she")
b s t
y 4 e h h
a 6 l e 0 o e 5
s 3
8
Search in a trie
get("shell")
b s t
y 4 e h h
a 6 l e 0 o e 5
l l r
s 1 l e 7
no value associated
s 3 with last key character
(return null) 9
Search in a trie
get("shelter")
b s t
y 4 e h h
a 6 l e 0 o e 5
l l r
s 1 l e 7
s no link to 't'
3
(return null) 10
Insertion into a trie
put("shore", 7)
b s t
y 4 e h h
a 6 l e 0 o e 5
l l r
s 1 l e 7
s 3
11
Trie construction demo
trie
12
Trie construction demo
put("she", 0)
e 0
13
Trie construction demo
she
trie
e 0
14
Trie construction demo
she
trie
e 0
15
Trie construction demo
she
put("sells", 1)
e h
l e 0
s 1
16
Trie construction demo
she
sells
trie
e h
l e 0
s 1
17
Trie construction demo
she
sells
trie
e h
l e 0
s 1
18
Trie construction demo
she
sells
put("sea", 2)
e h
a 2 l e 0
s 1
19
Trie construction demo
she
sells
sea
trie
e h
a 2 l e 0
s 1
20
Trie construction demo
she
sells
sea
put("shells", 3)
e h
a 2 l e 0
l l
s 1 l
s 3
21
Trie construction demo
she
sells
sea
trie
e h
a 2 l e 0
l l
s 1 l
s 3
22
Trie construction demo
she
sells
sea
put("by", 4)
b s
y 4 e h
a 2 l e 0
l l
s 1 l
s 3
23
Trie construction demo
she
sells
sea
by
trie
b s
y 4 e h
a 2 l e 0
l l
s 1 l
s 3
24
Trie construction demo
she
sells
sea
by
put("the", 5)
b s t
y 4 e h h
a 2 l e 0 e 5
l l
s 1 l
s 3
25
Trie construction demo
she
sells
sea
by
the
trie
b s t
y 4 e h h
a 2 l e 0 e 5
l l
s 1 l
s 3
26
Trie construction demo
put("sea", 6)
b s t
y 4 e h h
a 2 6 l e 0 e 5
l l
overwrite
old value with
s 1 l
new value
s 3
27
Trie construction demo
trie
b s t
y 4 e h h
a 6 l e 0 e 5
l l
s 1 l
s 3
28
Trie construction demo
trie
b s t
y 4 e h h
a 6 l e 0 e 5
l l
s 1 l
s 3
29
Trie construction demo
she
sells
sea
by
the
put("shore", 7)
b s t
y 4 e h h
a 6 l e 0 o e 5
l l r
s 1 l e 7
s 3
30
Trie construction demo
she
sells
sea
by
the
shore
trie
b s t
y 4 e h h
a 6 l e 0 o e 5
l l r
s 1 l e 7
s 3
31
Trie representation: implementation
class Node
{ A child node for each character in Alphabet.
int value; No need to search for character, but a
Node* next[R]; pointer reserved for each character in
} memory
32
R-way trie: implementation
Node root;
33
R-way trie: implementation (continued)
Node* getNode(){
Node* pNode = NULL;
pNode = new Node;
if (pNode){
for (int i = 0; i < R; i++)
pNode->next[i] = NULL;
}
return pNode;
}
34
R-way trie: implementation (continued)
35
Trie performance
Search miss.
• Could have mismatch on first character.
• Typical case: examine only a few characters (sublinear).
Bottom line. Fast search hit and even faster search miss, but
wastes space.
36
Tries cont.
Strings are sequences of characters from some alphabet. But for use in
the computer, we need an important further information: how to
recognize where the string ends.
• There are many nonprintable ASCII codes that should never occur in
a text and ’\0’ is just one of them.
• There are also many applications in which the strings do not represent
text, but, for example, machine instructions.
Strings:
t f e
• exam
r a x
• example
• fail i e i
u l a
• false
• tree e e e s l m
• trie
• true \0 p
\0 \0 \0 e \0
l
\0
\0
Find, Insert and Delete
space
implementation Search hit Search miss insert
(references)
hashing
N N 1 N
(separate chaining)
R-way trie.
• Method of choice for small R.
• Too much memory for large R.
42
Alphabet Size
• The problem here is the dependence on the size of the alphabet which
determines the size of the nodes.
• There are several ways to reduce or avoid the problem of the
alphabet size.
• A simple method, is to replace the big nodes by linked lists of all the entries
that are really used.
• Another way to avoid the problem with the alphabet size R is alphabet
reduction. We can represent the alphabet R as set of k -tuples from some
direct product R1xR2 … xRk
• The path compressed trie contains only nodes with at least two
outgoing edges.
• All internal nodes have >=2 child
Trie
a o
p r
e p g
$ l $ a
e n
$ $
S = {ape, apple, org, organ}
Trie
redundant nodes
a o
p r
e p g
$ l $ a
e n
$ $
S = {ape, apple, org, organ}
a o ap org
p r e$ ple$ an$
$
e p g
$ l $ a
e n
$ $
S = {ape, apple, org, organ}
Compressed Trie
ap org
o|r|g|a|n|$
2 3
e$ ple$ an$
$
2 4
a|p|e|$
1 3
a|p|p|l|e|$ o|r|g|$
Compressed Trie
length = 0
a|b|c|d|…|o|…|z
o|r|g|a|n|$
length = 2 length = 3
a | b | … | e | … | o | p |.. a | b | … | e | f | … …| $ |
a|p|e|$
length = 3
length = 2
length = 1
a|b|…|e|…|o|…|
a|b|…|e|…|o|…|
a|b|…|e|…|o|…|
length = 4
o|r|g|$
a|b|…|e|…|o|…|
a|p|p|l|e|$
Alternative Representation-via string array indexes
(word-index,
[1] a p p l e $
(-, -, -)
[2] o r g $ a | b | … | e | … | o | p |..|
[3] o r g a n $
(0, 0, 1) (2, 0, 2)
a | b | … | e | … | o | p |..| a|b|…|e|…|$|…|
(0, 2, 3) (2, 3, 3)
(1, 2, 5)
a|b|…|e|…|o|…| a|b|…|e|…|o|…|
a|b|…|e|…|o|…|
(3, 3, 5)
a|b|…|e|…|o|…|
Patricia Tree (a.k.a. Compressed Trie)
e$ ple$ an$
$
Patricia Tree (a.k.a. Compressed Trie)
• Inserting a string s in a Patricia Tree:
• similar to searching up until the point where the search gets
stuck
• since s is not in the tree
• if the search is over in the middle of an edge, e, then e is split
into two new edges, joined by a new node u
• the remainder of s becomes new edge label, which
connects u to the new leaf node
• if the search is over at a node u, the remainder of s becomes
new edge label, which connects u to theap new leaf node org
•[0]O(|s|+|Σ|), |s| for • Σ:for
search+|Σ| alphabet
node creation&initialization
[1] [2] [3] [4] [5]
[0] a p e $ (0, 0, 1) (2, 0, 2)
[1] a p p l e $ e$ ple$ an$
$
[2] o r g $
[3] o r g a n $ (0, 2, 3) (1, 2, 5) (2, 3, 3) (3, 3, 5)
Patricia Tree (a.k.a. Compressed Trie)
[2] o r g $
[3] o r g a n $ (0, 0, 1) (2, 0, 2)
e$ ple$ an$
[4] o r t o $ $
[0] a p e $
[1] a p p l e $ ap org
[2] o r g $
[3] o r g a n $ (0, 0, 1) (2, 0, 2)
e$ ple$ an$
[4] o r t o $ $
[4] o r t o $
(0, 0, 1) (2, 0, 1)
e$ ple$ g
$ an$
(2, 3, 3) (3, 3, 5)
Patricia Tree (a.k.a. Compressed Trie)
[0] [1] [2] [3] [4] [5]
insert “orto”
[0] a p e $
[1] a p p l e $
[2] o r g $
u: splitted node
[3] o r g a n ap $ or
[4] o r t o $
(0, 0, 1) (2, 0, 1)
new leaf node
e$ ple$ g to$
$ an$
(2, 3, 3) (3, 3, 5)
Patricia Tree (a.k.a. Compressed Trie)
[0] [1] [2] [3] [4] [5]
[0] a p e $ insert “apron”
[1] a p p l e $
[2] o r g $ search fails here:
[4] o r t o $
[5] a p r o n $ ap or
(0, 0, 1) (2, 0, 1)
e$ ple$ g to$
$ an$
(2, 3, 3) (3, 3, 5)
Patricia Tree (a.k.a. Compressed Trie)
[0] [1] [2] [3] [4] [5]
[0] a p e $ insert “apron”
[1] a p p l e $
[2] o r g $ search fails here:
[4] o r t o $
[5] a p r o n $ ap or
(0, 0, 1) (2, 0, 1)
ron$
e$ ple$ g to$
$ an$
(2, 3, 3) (3, 3, 5)
Patricia Tree (a.k.a. Compressed Trie)
[0] [1] [2] [3] [4] [5]
[0] a p e $ insert “apron”
[1] a p p l e $
[2] o r g $ search fails here:
[4] o r t o $
[5] a p r o n $ ap or
(0, 0, 1) (2, 0, 1)
ron$
e$ ple$ g to$
$ an$
(2, 3, 3) (3, 3, 5)
Patricia Tree (a.k.a. Compressed Trie)
[2] o r g $
[3] o r g a n $ (0, 0, 1) (2, 0, 2)
e$ ple$ an$
[4] o r t o $ $
[2] o r g $
[3] o r g a n $ (0, 0, 1) (2, 0, 2)
e$ ple$ an$
[4] o r t o $ $
[2] o r g $
[3] o r g a n $ (0, 0, 1) (2, 0, 2)
e$ ple$ an$
[4] o r t o $
single child: w
(0, 2, 3) (1, 2, 5) (3, 3, 5)
Patricia Tree (a.k.a. Compressed Trie)
[2] o r g $
[3] o r g a n $ (0, 0, 1) (2, 0, 2)
e$ ple$ an$
[4] o r t o $
single child: w
(0, 2, 3) (1, 2, 5) (3, 3, 5)
[2] o r g $
[3] o r g a n $ (0, 0, 1) (3, 0, 5)
[4] o r t o $ e$ ple$
joined node
(0, 2, 3) (1, 2, 5)
Suffix Trees
Possible Advantages:
• Its size does not depend on the size of the alphabet.
• It offers a quite different tool to attack the same type of string
problems.
• Straightforward implementation and it is said to be smaller than
suffix trees
The Underlying Idea
O(|s|logN)
|s|: length of the query string
N: suffix array size