Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Algorithms, Data Structures

and Computability
Search
Introduction
In this lecture we will address the Naïve String Search, understand
how it works and analyze its complexity. Then, we show how efficient
string-searching algorithms, in particular the KMP algorithm , can be
used to match specimens against items in very large DNA databases.

We also go back to the ADT Binary Tree and the more specific form
of this, the ADT AVL Tree , showing how these can be used in search
problems. We will explore methods of improving efficiency by keeping
trees in balance, examine the algorithms for doing so, and perform
complexity analyses on these.
Agenda
❑Basic string search (Naïve Search) or (Pattern
Matching Algorithm - Brute Force)
❑The Knuth–Morris–Pratt algorithm
❑Search trees
▪ Binary search trees
String Matching Problem

• Given a text string T of length n and a pattern string P of length m,


the exact string matching problem is to find all occurrences of P in T.
• Example: T=“AGCTTGA” P=“GCT”
• Applications:
• Searching keywords in a file
• Search engines (like Google and Openfind)
• Database searching (GenBank)
• More string matching algorithms (with source codes):
http://www-igm.univ-mlv.fr/~lecroq/string/
3 -5
Terminologies
• S=“AGCTTGA”
• |S|=7, length of S
• Substring: Si,j=SiS i+1…Sj
• Example: S2,4=“GCT”
• Subsequence of S: deleting zero or more characters from S
• “ACT” and “GCTT” are subsquences.
• Prefix of S: S1,k
• “AGCT” is a prefix of S.
• Suffix of S: Sh,|S|
• “CTTGA” is a suffix of S. 3 -6
A Brute-Force Algorithm

To see the whole process:


http://whocouldthat.be/visualizing-string-
matching/

Time: O(mn) where m=|P| and n=|T|.

3 -7
Brute-Force Algorithm
• The brute-force pattern
matching algorithm compares Algorithm BruteForceMatch(T, P)
the pattern P with the text T for Input text T of size n and pattern
each possible shift of P relative P of size m
to T, until either Output starting index of a
substring of T equal to P or -1
• a match is found, or if no such substring exists
• all placements of the pattern for i  0 to n - m
have been tried
{ test shift i of the pattern }
• Brute-force pattern matching j0
runs in time O(nm) while j < m  T[i + j] = P[j]
• Example of worst case: jj+1
• T = aaa … ah if j = m
• P = aaah return i {match at i}
else
• may occur in images and DNA break while loop {mismatch}
sequences
return -1 {no match anywhere}
• unlikely in English text
The KMP Algorithm – Motivation
Proposed by Knuth, Morris and Pratt in 1977

• Knuth-Morris-Pratt’s algorithm compares


the pattern to the text in left-to-right, but . . a b a a b x . . . . .
shifts the pattern more intelligently than
the brute-force algorithm.
• When a mismatch occurs, what is the a b a a b a
most we can shift the pattern so as to j
avoid redundant comparisons?
• Answer: the largest prefix of P[0..j] that is a b a a b a
a suffix of P[1..j]
No need to Resume
repeat these comparing
Pattern Matching
comparisons here 9
KMP Failure Function
j 0 1 2 3 4 5
P[j] a b a a b a
• Knuth-Morris-Pratt’s algorithm preprocesses F(j) 0 0 1 1 2 3
the pattern to find matches of prefixes of the
pattern with the pattern itself
. . a b a a b x . . . . .
• The failure function F(j) is defined as the size
of the largest prefix of P[0..j] that is also a
suffix of P[1..j]
a b a a b a
• Knuth-Morris-Pratt’s algorithm modifies the j
brute-force algorithm so that if a mismatch
occurs at P[j]  T[i] we set j  F(j - 1)
a b a a b a

Pattern Matching F(j - 1) 10


The KMP Algorithm

Algorithm KMPMatch(T, P)
• The failure function can be represented by F  failureFunction(P)
an array and can be computed in O(m) time i0
j0
• At each iteration of the while-loop, either while i < n
if T[i] = P[j]
• i increases by one, or if j = m - 1
• the shift amount i - j increases by at least return i - j { match }
one (observe that F(j - 1) < j) else
ii+1
• Hence, there are no more than 2n iterations jj+1
of the while-loop else
if j > 0
• Thus, KMP’s algorithm runs in optimal time j  F[j - 1]
O(m + n) else
ii+1
return -1 { no match }
Computing the Failure Function

• The failure function can be Algorithm failureFunction(P)


represented by an array and can F[0]  0
be computed in O(m) time i1
j0
• The construction is similar to the while i < m
KMP algorithm itself if P[i] = P[j]
{we have matched j + 1 chars}
• At each iteration of the while- F[i]  j + 1
loop, either ii+1
jj+1
• i increases by one, or else if j > 0 then
{use failure function to shift P}
• the shift amount i - j increases j  F[j - 1]
by at least one (observe that F(j else
- 1) < j) F[i]  0 { no match }
ii+1
• Hence, there are no more than
2m iterations of the while-loop
To see the whole process:
http://whocouldthat.be/visualizing-string-matching/

Example 1
a b a c a a b a c c a b a c a b a a b b
1 2 3 4 5 6
a b a c a b
7
a b a c a b
j 0 1 2 3 4 5
8 9 10 11 12
P[j] aa b ab c a a b c a b
13
a b a c a b
F(j) 0 0 1 0 1 2
14 15 16 17 18 19
a b a c a b
Example 2

a b x a b c a b c a b y

j 0 1 2 3 4 5
P a b c a b Y

F 0 0 0 1 2 0
Time Complexity of KMP Algorithm

• Time complexity: O(m+n) (analysis omitted)


• O(m) for computing function f
• O(n) for searching P

3 -15
Binary Search

• Binary search can perform nearest neighbor queries on an ordered map


that is implemented with an array, sorted by key
• similar to the high-low children’s game
• at each step, the number of candidate items is halved
• terminates after O(log n) steps
• Example: find(7)
0 1 3 4 5 7 8 9 11 14 16 18 19
l m h
0 1 3 4 5 7 8 9 11 14 16 18 19
l m h
0 1 3 4 5 7 8 9 11 14 16 18 19
l m h
0 1 3 4 5 7 8 9 11 14 16 18 19
l=m =h
Binary Search Trees Visualize the BST:
https://visualgo.net/en/bst

• A binary search tree is a • An inorder traversal of a


binary tree storing keys binary search trees visits
(or key-value items) at its
nodes and satisfying the the keys in increasing
following property: order
• Let u, v, and w be three
6
nodes such that u is in the
left subtree of v and w is in
the right subtree of v. We 2 9
have
key(u)  key(v)  key(w) 1 4 8

• External nodes do not


store items, instead we
consider them as None
Binary Search Trees
Search

• To search for a key k, we trace


a downward path starting at
the root
• The next node visited
depends on the comparison of
k with the key of the current
node
• If we reach a leaf, the key is
not found < 6

• Example: find(4): 2
>
9
• Call TreeSearch(4,root)
1 4 = 8
• The algorithms for nearest
neighbor queries are similar
Insertion

6
<
• To perform operation put(k, 2
>
9
o), we search for key k (using 1 4 8
TreeSearch) >
• Assume k is not already in the
w
tree, and let w be the (None)
leaf reached by the search 6
• We insert k at node w and
2 9
expand w into an internal
node 1 4 8
• Example: insert 5 5
w
Insertion Pseudo-code
Deletion

6
<
• To perform operation 2
>
9
remove(k), we search for key k
1 4 v 8
• Assume key k is in the tree, and w
let let v be the node storing k 5

• If node v has a (None) leaf child


w, we remove v and w from the
tree with operation 6
removeExternal(w), which
2 9
removes w and its parent
• Example: remove 4 1 5 8
Deletion (cont.)

• We consider the case where the


1
key k to be removed is stored at v
a node v whose children are 3
both internal 2 8
• we find the internal node w 6 9
that follows v in an inorder w
traversal 5
• we copy key(w) into node v z
• we remove node w and its left
child z (which must be a leaf) 1
v
by means of operation 5
removeExternal(z)
2 8
• Example: remove 3
6 9
Performance

• Consider an ordered map


with n items implemented
by means of a binary
search tree of height h
• the space used is O(n)
• Search and update
methods take O(h) time
• The height h is O(n) in the
worst case and O(log n) in
the best case
Thank You

You might also like