Professional Documents
Culture Documents
8 TextProcessing
8 TextProcessing
Text Processing
String S a b c a b a a b c a b a c
Pattern p a b a a
p a b a a
p a b a a
Data Structures and Algorithms in Java 7/47
Brute-Force algorithm demo - 3
Step 3: compare p[3] with S[3]
S a b c a b a a b c a b a c
p a b a a
Mismatch occurs here..
p a ba a
Finally, a match would be found after shifting p three times to the right side.
Drawbacks of this approach: if m is the length of pattern p and n the length of string
S, the matching time is of the order O(mn). This is a certainly a very slow running
algorithm.
What makes this approach so slow is the fact that elements of S with which
comparisons had been performed earlier are involved again and again in
comparisons in some future iterations. For example: when mismatch is detected for
the first time in comparison of p[3] with S[3], pattern p would be moved one position
to the right and matching procedure would resume from here. Here the first
comparison that would take place would be between p[0]=a and S[1]=b. It should be
noted here that S[1]=b had been previously involved in a comparison in step 2. this
is a repetitive use of S[1] in another comparison.
It is these repetitive comparisons that lead to the runtime of O(mn).
Data Structures and Algorithms in Java 9/47
The Knuth-Morris-Pratt (KMP) Algorithm - 1
Index i 0 1 2 3 4 5 6 7 8 9 10 11 12
Text a 1 0 1 0 0 0 1 0 1 0 1 1 0
Pattern p 1 0 1 0 1 1
1 0 1 0 1 1
1 0 1 0 1 1
We call the starting position for a matching is r. Thus, for the first matching r = 0. We
starting to compare a[i] with p[j], j = 0, 1, 2, ..., m-1; i = r+j. The result is:
a[0] = p[0], a[1] = p[1], a[2] = p[2], a[3] = p[3], a[4] p[4] (1)
Thus r = 0 is a wrong "starting position". The first wrong position is j = k = 4. By Brute-
Force (BF) algorithm, we set r' = r+1 = 1 and starting new matching by comparing:
(a[1],p[0]), (a[2],p[1]), (a[3],p[2]), (a[4], p[3]),...
However from (1) we can replace the above comparing with the following:
(p[1],p[0]), (p[2],p[1]), (p[3],p[2]), (a[4],p[3]),...
In the general case:
(p[1],p[0]), (p[2],p[1]), ..., (p[k-1],p[k-2]), (a[k],p[k-1]),...
The bold comparisions based on the pattern only and can be pre-done. These can be
done by sliding a copy pp of the pattern to right 1 position. We can see that in the given
example this comparision is wrong.
Now we slide the pattern pp to right 2 positions and we can see the bold
comparisions are matched. We can start to compare a[4] with p[2] only.
(p[2],p[0]), (p[3],p[1]), (a[4],p[2]),...
Running Time
Each time through the loop, either we increase i or we slide the pattern
right. Both of these events can occur at most n times, and so the repeat
loop is executed at most 2n times. The cost of each iteration of the repeat
loop is O(1). Therefore, the running time is O(n), assuming that the values
(q) are already computed.
Example 3:
i 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17
p[i] P A R T I C I P A T E I N P A R
T[i] -1 0 0 0 0 0 0 0 1 2 0 0 0 0 0 0 1 2
• Types of compression
– Lossy: MP3, JPG
– Lossless: ZIP, GZ
• Compression Algorithm:
– Huffman Encoding
– Lempel-Ziv
– RLE: Run Length Encoding
• Performance of compression depends on
file types.
Data Structures and Algorithms in Java 17/47
Compress data by decoding symbols
contained in it (lossless compression)
• The information content of the set M, called the entropy of
the source X = (x1, x2,…, xn ), is defined by:
Lave = H = P(x1)L(x1) + · · · + P(xn)L(xn)
Where L(xi) = - log2P(xi), which is the minimum length of a
codeword for symbol xi (Claude E.Shannon, 1948). Shannons
entropy represents an absolute limit on the best possible
lossless compression of any communication.
• To compare the efficiency of different data compression
methods when applied to the same data, the same measure
is used; this measure is the compression rate
length(input) – length (output) Codeword:
sequence of bits of a code
length(input) corresponding to a symbol.
Data Structures and Algorithms in Java 18/47
Uniquely Decodable Codes
A variable length code assigns a bit string (codeword)
of variable length to every message value
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence of bits
1011 ?
Is it aba, ca, or, ad?
A uniquely decodable code is a variable length code in
which bit strings can always be uniquely
decomposed into its codewords.
E A C D B F
.40 .20 .15 .11 .09 .05
E A C BF D
.40 .20 .15 .14 .11
B F
.09 .05
E BFD A C
.40 .25 .20 .15
BF D
.14 .11
B F
.09 .05
E AC BFD
.40 .35 .25
A C BF D
.20 .15 .14 .11
B F
.09 .05
BFDAC E
.60 .40
AC BFD
.35 .25
A C BF D
.20 .15 .14 .11
B F
.09 .05
A C BF D
.20 .15 0 .14 .11
1
B F
.09 .05
Data Structures and Algorithms in Java 29/47
Huffman Coding example - 8
Raw : FFFFOOOOFFFOOFFFFFOOOOOOOO
4 4 3 2 5 7
Compressed: 4F4O3F2O5F7O