Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

9.

Strings
Orientation
chapter combinatorial classes type of class type of GF type of GF
5 Trees unlabeled unlabeled OGFs
6 Permutations labeled labeled EGFs
7 Strings unlabeled unlabeled OGFs
8 Maps labeled labeled EGFs
Second half of text
surveys fundamental combinatorial classes
uses the symbolic method to study them
Bitstrings
10111110100101001100111000100111110110110100000111100001100111011101111101011000
11010010100011110100111100110100111011010111110000010110111001101000000111001110
11101110101100111010111001101000011000111001010111110011001000011001000101010010
10111000011011000110011101110011011011110111110011101011000011001100101000000110
10101100111010001101101110110010010110100101001101111100110000001111101000001111
10000010011000001100011000100001111001110011110000011001111110011011000100100111
10001010101110001110101100000110000011101010100010110001001101111110011110110010
00111011001011100100001100001001111010010011001100001100111010011010000101000111
00111111100110110111011011101010011011011100011111111010111010011000000100101110
10101000111100001010000011001000001101010010100011001100101010101110110111111110
11000000101111011011000101011010110010010000011101110010000001101010000000101000
11101111011011111011111111110100111010010111111011101001110100011000100100010010
00111111100111010110111110000100010001110000111010111100101011111001110101011111
00000010001111001110110101011100110000011110010010010101001100110011010011011110
10111100101000100110111100011001000111001000010100110101110111111010110010011100
01010010001011110110000110110101011010101111011001101101101000100110001111100111
01110110010011001110111000101010001101101001111111001101010111010001100110100001
00100011011010001100011111110011100110011110010110001100110011010001110111011101
23
9
29
6
13
1
24
18
42
5
2
70
25
0
24
7
23
3
Q. What is probability that an
N-bit random bitstring does not
contain 000?
Q. What is expected wait time
for the first occurrence of 000 in
a random bitstring?
Symbolic method for unlabeled objects (review)
Binary strings
Combinatorial class

Size function
Generating function

Construction

Functional equation

Expansion

type notation size GF
0 bit 0 1 z
1 bit 1 1 z
Atoms
() =

||
() =


[

]() =

B = ( + ) a binary string is a sequence of 0 or 1 bits


B, the class of all binary strings
|b |, the number of bits in b
Symbolic method for unlabeled objects (review)
type notation size GF
0 bit 0 1 z
1 bit 1 1 z
Atoms
() =

||
() =


[

]() =

B = + ( + ) B
a binary string is either empty or a
0 or 1 bit followed by a binary string
Binary strings (alternate construction)
Combinatorial class

Size function
Generating function

Construction

Functional equation

Expansion

B, the class of all binary strings
|b |, the number of bits in b
Symbolic method for unlabeled objects (review)
Combinatorial class

Size function
Generating function

Construction

Functional equation

Expansion

type notation size GF
0 bit 0 1 z
1 bit 1 1 z
Atoms

() =

||

() = + + ( +

()

() =
+

() =

+
+
=
+
a binary string with no 00 is either
empty or 0 or it is 1 or 01 followed
by a binary string with no 00
B

= + + ( + ) B

Binary strings with no consecutive 0 bits


=

2

2
vth

2
.
= l.6l803

2
.
= l.l7082
|b |, the number of bits in b
B00, binary strings with no 00
Binary strings without long runs
Combinatorial class

Size function

Binary strings with no k consecutive 0 bits
Generating function

() =

||
a binary string with no 0
k
is either a string of
fewer than k 0s or such a string followed by a
1 followed by a binary string with no 0
k

Construction
class P<k + 0 + 0
2
+ ... + 0
k1
OGF
+ +

+. . . +

= P
<
+P
<
S

Functional equation.

() =

()

Solve.

() =

Expand. [

()

vhere

s the domnunt root o l 2

= expct ormuu uvuube


See Asymptotics lecture
Sk, binary strings with no 00...0
|b |, the number of bits in b
Binary strings without long runs
Theorem. The number of binary strings of length N with no runs of k 0s is
where ck and k are easily-calculated constants.
sage: f2 = 1 - 2*x + x^3
sage: 1.0/f2.find_root(0, .99, x)
1.61803398874989
sage: f3 = 1 - 2*x + x^4
sage: 1.0/f3.find_root(0, .99, x)
1.83928675521416
sage: f4 = 1 - 2*x + x^5
sage: 1.0/f4.find_root(0, .99, x)
1.92756197548293
sage: f5 = 1 - 2*x + x^6
sage: 1.0/f5.find_root(0, .99, x)
1.96594823664510
sage: f6 = 1 - 2*x + x^7
sage: 1.0/f6.find_root(0, .99, x)
1.98358284342432
2
3
4
5
6

Information in GFs for strings

() =

||
=

+
+
=

0
{# o btstrngs o ength vth no runs o 0s}

(/2) =

{# o btstrngs o ength vth no runs o 0s}/2

(l/2) =

0
{# o btstrngs o ength vth no runs o 0s}/2

0
lr {lst bts o u rundom btstrng huve no runs o 0s}
=

0
lr {poston o end o rst run o 0s s > }
Theorem. The expected wait time for
the first run of k 0s in a random
bitstring is

(/) =
+

Theorem. The probability that an N-bit


random bitstring has no run of k 0s is
[

(/)

/)

= Lxpected poston o end o rst run o 0s wait time


Consecutive 0s in random bitstrings
k Sk(z)
approx. probability of no occurrence of
k consecutive 0s in N random bits
approx. probability of no occurrence of
k consecutive 0s in N random bits
approx. probability of no occurrence of
k consecutive 0s in N random bits
wait
time
N 10 100
1 (.5)
N
0.0010 <10
30
2
2 1.1708(.80901)
N
0.1406 <10
9
6
3 1.1375(.91864)
N
0.4869 0.0023 14
4 1.0917(.96328)
N
0.7510 0.0259 30
5 1.0575(.98297)
N
0.8906 0.1898 62
6 1.0350(.99174)
N
0.9526 0.4516 126

+

Validation of results
is always worthwhile when analyzing algorithms.
public class TestOccK
{
public static int find(int[] bits, int k)
{
int cnt = 0;
for (int i = 0; i < bits.length; i++)
{
if (cnt == k) return i;
if (bits[i] == 0) cnt++; else cnt = 0;
}
return bits.length;
}
public static void main(String[] args)
{
int w = Integer.parseInt(args[0]);
int maxK = Integer.parseInt(args[1]);
int[] bits = new int[w];
int[] sum = new int[maxK+1];
int T = 0;
int cnt = 0;
while (!StdIn.isEmpty())
{
T++;
for (int j = 0; j < w; j++)
bits[j] = BitIO.readbit();
for (int k = 1; k <= maxK; k++)
if (find(bits, k) == bits.length) sum[k]++;
}

for (int k = 1; k <= maxK; k++)
StdOut.printf("%8.4f\n", 1.0*sum[k]/T);
StdOut.println(T + trials);
}
}
N/w trials.
Read w-bits from StdIn
For each k to maxK check for
k consecutive 0s.
Print empirical probabilities.
% java TestOccK 100 6 < data/random1M.txt
0.0000
0.0000
0.0004
0.0267
0.1861
0.4502
10000 trials

.0000
.0000
.0023
.0259
.1898
.4516
predicted
by
theory
Autocorrelation
Q. Do the same results hold for 0001?
The probability that an N-bit random
bitstring does not contain 0000
is ~1.0917(. 96328)
N
The expected wait time for the first
occurrence of 0000 in a random
bitstring is 30.
Observation. Consider first occurrence of 000.
0000 and 0001 equally likely BUT
mismatch for 0000 means 0001, so need to wait four more bits and
mismatch for 0001 means 0000, so next bit could give a match.
10111110100101001100111000100111110110110100000111100001
A. No!
Q. What is probability that an
N-bit random bitstring does not
contain 0001?
Q. What is expected wait time
for the first occurrence of 0001
in a random bitstring?
Constructions for strings without specied patterns
Sp, binary strings that do not contain p
Tp, binary strings that end in p and have no other occurrence
10111110101101001100110101001010
10111110101101001100110000011111
p, a pattern
101001010 Ex.
Ex.
Ex.
Given
First construction

Sp and Tp are disjoint


remove the last bit from a string in either Sp or Tp
result is either empty or a string in Sp
S

+T

= +S

{ +}
Constructions for strings without specied patterns
Every pattern has an autocorrelation polynomial
slide the pattern to the left over itself.
for each match of the trailing bits with the leading bits
include a term z
i
where i is amount of the shift.
101001010
101001010
101001010
101001010
101001010
101001010
101001010
101001010
101001010
101001010

() = +

Constructions for strings without specied patterns


Second construction
add the pattern to Sp
for each 1 bit in the autocorrelation, add a tail
result is in Tp
101101001100110000011111101001010
1011010011001100000111111010010101001010
10100101001010
101001010
S

{} = T

=
{

}
Binary strings without specied patterns
Sp and Tp are disjoint.
Removing a bit from either
yields or a string from Sp
Binary strings that do not contain a specified pattern p.
Constructions
Combinatorial classes
Sp, binary strings that do not contain p
Tp, binary strings that end in p
and have no other occurrence
OGFs

() =

||

() =

||
S

+T

= +S

{ +}
S

{} = T

=
{

}
Add the pattern to a string from Sp.
For each 1 bit in the autocorrelation,
add a tail to get a string from T.
Solve.

() =

()

+ ( )

()
GF equations.

()

()()

() +

() = + ()
Expand.
See Asymptotics lecture
[

()

vhere

s the domnunt root o

+ (l 2)

()

= expct ormuu uvuube


Autocorrelation for 4-bit patterns
k
auto-
correlation
OGF
approx. probability of
no occurrence
of pattern in N random bits
approx. probability of
no occurrence
of pattern in N random bits
approx. probability of
no occurrence
of pattern in N random bits
wait
time
N 10 100
0000 1111
1111 (. 96328)
N
0.7510 0.0259 30
0001 0011 0111
1000 1100 1110
1000
(.91964)
N
0.4327 0.0002 16
0010 0100 0110
1001 1011 1101
1001
(.93338)
N
0.5019 0.0010 18
0101 1010
1010 (.94165)
N
0.5481 0.0024 20

constants omitted
(close to 1)
off by < 10%
but indicative
Observation. In 100 random bits,
0000 is 10 times more likely to be missing than 0101
100 times more likely to be missing than 0001 or 0111.
Formal languages and the symbolic method
Definition. A formal language is a set of strings.
Q. How many strings of length N in a given language?
Remark. The symbolic method provides a systematic approach to this problem.
A. Use an OGF to enumerate them.
() =

||
Issue. Ambiguity.
Regular expressions
Theorem. Let A and B be unambiguous REs with OGFs and .
Suppose that A+B, AB, and A* are also unambiguous. Then
enumerates A+B
enumerates AB
enumerates A*
Proof.
Same as for symbolic method; different notation.
() ()
() +()

() +()
()()
Corollary. OGFs that enumerate regular languages are rational.
Proof.
1. There exists an FSA for the language.
2. Kleenes theorem gives unambiguous RE for any FSA.
a* | (a*ba*ba*ba*)*
A rational function is a function
that can be written as the ratio
of two polynomials.
Regular expressions
Example 1. Binary strings with no 000
Expansion.
[

]
4
()
4

4
vth

4
.
= l.92756

4
.
= l.09l66
RE.
( + + + )

( + + + )
OGF.

() =
+ +

( +

)
=


Regular expressions
Example 2. Binary strings that represent multiples of 3
RE.
((

OGF.

() =


( )( + )
Expansion.

()

11
110
1001
1100
1111
10010
10101
11000
.
.
.
Context-free languages
Theorem. Let <A> and <B> be nonterminals in an unambiguous CFG with
OGFs and . Suppose that <A> | <B> and <A><B> are also
unambiguous. Then
enumerates <A> | <B>
enumerates <A><B>
Proof.
Same as for symbolic method; different notation.
() ()
() +()
()()
Corollary. OGFs that enumerate unambiguous CF languages are algebraic.
Proof. Groebner basis elimination (see text).
An algebraic function is a function that satisfies a
polynomial equation whose coefficients are
themselves polynomials with rational coefficients
Context-free languages
Example 1. Binary trees
CFG.
OGF.
Expand.
Solve.
<T> :=

+ <T><T>

() = +

()

() =

() =

The constructions we have considered are CFGs (different notation).


Issue: ambiguity.
Not true for labeled objects.
Symbolic method is more general.
Walks
Definition. A walk is a sequence of + and characters.
+-+++-+---+--+--
Sample applications:
Parenthesis systems
Gamblers ruin problems
Inversions in 2-ordered permutations (see text)
+-+++-+---+--+--
()((()()))())())
Q. How many different walks of length N ?
Q. How many different walks of length N where every prefix has more + than ?
Walks
<U> := <+> | <U><U><>
U
U
<D> := <> | <D><D><+>
D
D
Basis for unambiguous decomposition:
<S> begins and ends at 0
<U> starts with +, ends at +1, never hits 0
<D> starts with , ends at 1, never hits 0
<S> := <U><><S> | <D><+><S>
U
S
S
D
Context-free languages
Example 2. Walks of length 2N that start at and return to 0
CFL.
<S> := <U><><S> | <D><+><S> |
<U> := <U><U><> | <+>
<D> := <D><D><+> | <>
OGFs.
() = ()() + ()() +
() = +

()
() = +

()
Solve simultaneous
equations.
() = () =

() =

()
=

Expand.
Elementary example, but extends
to similar, more difficult problems
[

]() =

Tries
internal
node
root
external
node
leaf
void
external
nodes
and
Definition. A trie is a binary tree with
the following properties:
External nodes may be void.
Siblings of void nodes are internal.
are disallowed
1 trie with 2 external nodes
4 tries with 3 external nodes
17 tries with 4 external nodes
Tries and sets of bitstrings
Definition. A set of bitstrings is prefix-free if no member is the prefix of another.
Definition. The trie associated with a prefix-free set B of bitstrings is
A void external node if B is empty.
A nonvoid external node if B contains a single bitstring.
An internal node connected to the tries associated with B0 and B1, where
B0 is the set of strings from B that begin with 0, with initial bit removed from each, and
B1 is the set of strings from B that begin with 1, with the initial bit removed from each.
01110
01000
01101
11101
11001
10101
10110
11000
11111
01
000
110
101
01110
11101
11001
10101
01000
10110
01101
11000
11111
1101
1001
1000
1111
0101
0110
1110
1000
1101
01 11
101
111
001
000
1
01
00
0
101
110
10 01 10
Example.
Not prefix-free? Allow nonvoid internal nodes
Option for finite strings:
that is the null string
Tries and sets of bitstrings
A trie represents a set of bitstrings.
Each nonvoid external node represents one bitstring (and holds remainder).
Path from the root to a node (plus string at the node) defines the bitstring
1
00
01110
11101
11001
10101
01000
10110
01101
11000
11111
1 1 0 1 0
Examples.
1
0
1
1
0
1
0
Tries and sets of bitstrings (nite string option)
A trie represents a set of bitstrings.
Each nonvoid external node represents one bitstring.
Path from the root to a node defines the bitstring
01110
11101
11001
10101
01000
10110
01101
11000
11111
Examples.
1
0
1
1
0
0
1
0
0
0
Trie applications
Searching and sorting
MSD radix sort
Symbol tables with string keys
Data compression
Huffman and prefix-free codes
LZW compression
Decision making
Collision resolution
Leader election (stay tuned)
Sections 5.1 and 5.2
Sections 5.5
Trie applications
Problem. Elect a leader
Example 4. Leader election
Trie applications
0
0
0
0
0
1 1
1
1
1
1
1
Method.
Each person flips a 0-1 coin.
1 wins, 0 loses
Winners continue to next round.
Trie applications
Method.
Each person flips a 0-1 coin.
1 wins, 0 loses
Winners continue to next round.
Trie applications
0
1
Method.
Each person flips a 0-1 coin.
1 wins, 0 loses
Winners continue to next round.
Application: Distributed leader election
Q. How many rounds?
A. Average length of the rightmost
path in a random trie.
Application: Distributed leader election
Procedure might fail!
Q. Probability of failure?
A. Probability that the rightmost path
in a random trie ends in a void node.
Trie parameters
0
1
2 1
3 1
4 2
5 16
Q. Worst-case search cost?
A. Height.
Q. Space requirement?
A. Number of external nodes.
height
# external
nodes
Q. Expected cost to find a prefix of a random string?
A. External path length.
20 external nodes, external path length 21 + 31 + 42 + 516 = 93
Usual model. Build trie from N infinite random strings.
N = 9
Average external path length in a trie

= +
l
2

) or > l vth
0
=
l
= 0
Caution: When k = 0 and k = N, CN appears on right-hand side.
k strings,
stripped of 0 bit
Nk strings,
stripped of 1 bit
N external nodes
Recurrence. [For comparison with BST and Catalan models.]

BST
Catalan
Trie
Pr {root is of rank k}
Average external path length in a trie

= +
l
2

) or > l vth
0
=
l
= 0 Recurrence.
EGF
() =

!
=

+
/

+
/
(/)

= (

) + (

/
) +
/
(/)
= (

) + (

/
) + (

/
) +
/
(/)

= ![

]() =

Expand.

(
/

) lg
Approximate (exp-log)
Iterate.
() =

() =

+
/
(/) GF equation.
Also available directly
through symbolic method
Average external path length in a trie

(
/

) =

<lg
(
/

) +

lg
(
/

)
= lg

<lg
(
/

) +

lg
(
/

)
= lg

<lg
(
/

) +

lg
(
/

) + (

)
= lg

<
(
/
+lg
) +

(
/
+lg
) + (

)
Goal: isolate periodic terms
= lg {lg }

<

{lg }
+

{lg }
) + (

Average external path length in a trie


10
6
A + B + C
1.33274
0.8
A
A
0.2
B
B
0.7
C
C

= +
l
2

) or > l vth
0
=
l
= 0

/ = lg {lg }

<

{lg }
+

{lg }
) +(

)
Q.
A.
Fluctuating term in trie (and other AofA) results
Q. Is there a reason that such a recurrence should imply periodic behavior?
10
6
1.33274

= +
l
2

) or > l vth
0
=
l
= 0
A. Yes. Stay tuned for the Mellin transform and related topics.
Assignments for next lecture
1. Exercise 7.3 How long a string of random bits should be taken to be 50%
sure that there are at least 32 consecutive 0s?
2. Exercise 7.14 Suppose that a monkey types randomly at a 32-key
keyboard. What is the expected number of characters typed before the
monkey hits upon the phrase TO BE OR NOT TO BE?
3. Exercise 7.51 [Analyze the probability of success of the leader-election
algorithm.]
4. Read pages 361-412 (Strings and Tries) in text.
Prepare a question/discussion point on 1., 2., 3. or 4.

You might also like