07 Read Mapping

Introduction to Bioinformatics
Read mapping (String matching)
Subangkar Karmaker Shanto

(CSE, UIU)
Slide Courtesy: Atif Hasan Rahman (CSE, BUET)

Short read mapping
• Short read
– Second generation or mainly Illumina reads
– 35-300bp
• Mapping
– Finding the positions in the genome reads are coming from
• Applications
– Reference guided genome assembly
– Genotyping
– Analysis of *-seq data
Reference guided assembly
Exact pattern matching
• Given a pattern (read) and a text (genome), find all
positions in the text that matches the pattern?
• This leads us to the Pattern Matching Problem.

Pattern Matching Problem
• Goal: Find all occurrences of a pattern in a text.
• Input:
– Pattern p = p1…pm of length m
– Text t = t1…tn of length n
• Output: All positions 1< i < (n – m + 1) such that the

m-letter substring of text t starting at i matches the
pattern p.
Adapted from slides available at
www.bioalgorithms.info
Exact Pattern Matching: Brute Force
PatternMatching(p,t)
1 m  length of pattern p
2 n  length of text t
3 for i  1 to (n – m + 1)
4 if ti…ti+m-1 = p
5 output i

Exact Pattern Matching: An Example
• PatternMatching algorithm for:
– Pattern GCAT
– Text AGCCGCATCT

– Pattern GCAT
– Text AGCCGCATCT
• AGCCGCATCT
• GCAT

– Pattern GCAT
– Text AGCCGCATCT
• AGCCGCATCT
• GCAT

– Pattern GCAT
– Text AGCCGCATCT
• AGCCGCATCT
• GCAT

– Pattern GCAT
– Text AGCCGCATCT
• AGCCGCATCT
• GCAT

– Pattern GCAT
– Text AGCCGCATCT
• AGCCGCATCT
• GCAT

– Pattern GCAT
– Text AGCCGCATCT
• AGCCGCATCT
• GCAT

– Pattern GCAT
– Text AGCCGCATCT
• AGCCGCATCT
• GCAT

Exact Pattern Matching: Running Time
• PatternMatching runtime: O(nm)
– In the worst case, we have to check m characters at each of
the n letters of the text.
– Example: Text = AAAAAAAAAAAAAAAA, Pattern = AAAC
• This is rare; on average, the run time is more like

O(m).
– Rarely will there be close to m comparisons at each step.

Knuth-Morris-Pratt Algorithm
• Linear time algorithm
• Avoids comparisons with positions in the text that
we have already checked
• AGCAGCTAGCTAGT
• AGCTAGT
• Linear time algorithm
• Avoids comparisons with positions in the text that
we have already checked
• AGCAGCTAGCTAGT
• AGCTAGT
Pattern Matched so far
AGCTAGT AGCTAG
Proper prefix Proper suffix
• Compute length of the longest proper prefix of the

subpattern that matches a proper suffix of the
subpattern
Short read mapping
• KMP runs in O(n+m)
– Preprocesses the pattern
– To map k-reads O(k(n+m)) time is needed
• But typically we need to map many reads to a single
genome
• Better to index the genome
– Hash tables
– Suffix trees
– Suffix arrays
– BWT-FM index
Multiple Pattern Matching (MPM)
Problem
• Goal: Given a set of patterns and a text, find all occurrences of
any of patterns in text.
• Input: k patterns p1,…,pk, and text of n characters t = t1…tn
• Output: Positions 1 < i < n where the substring of text t

starting at i matches one of the patterns pj for 1 < j < k.

Keyword Trees: Example
• Keyword tree:
– Apple

• Keyword tree:
– Apple
– Apropos

• Keyword tree:
– Apple
– Apropos
– Banana

• Keyword tree:
– Apple
– Apropos
– Banana
– Bandana

• Keyword tree:
– Apple
– Apropos
– Banana
– Bandana
– Orange

Keyword Trees: Properties
– Stores a set of keywords in a
rooted labeled tree.
– Each edge is labeled with a
letter from an alphabet.
– Any two edges coming out of
the same vertex have distinct
labels.
– Every keyword stored can be
spelled on a path from root to
some leaf.
– Furthermore, every path from
root to leaf gives a keyword.
Keyword Trees: Threading
• Thread “appeal”
– appeal

– appeal

– appeal

– appeal

• Thread “apple”
– apple

– apple

– apple

– apple

– apple

Suffix Trees = Collapsed Keyword Trees
• Suffix Tree: Similar to keyword tree,
– Suffixes of the text are keywords
– Edges that form paths are collapsed
– Each edge is labeled with a substring of the text
– All internal edges have at least two outgoing edges.
– Leaves are labeled by the index of the pattern.

Suffix Tree of a Text
• The suffix trees of a text is constructed for all its
suffixes.
• Example: Suffix tree of ATCATG:
ATCATG Keyword Suffix

TCATG Tree Tree
CATG
ATG
TG
G
Suffix Tree of a Text
Keyword Tree Suffix Tree

Suffix Trees: Example
• Example: Suffix tree of ATGCATACATGG

Suffix Trees: Runtime
• Say the length of the text is n.
• Construction of the keyword tree takes O(n2) time.
• Constructing the suffix tree takes O(n) time.
• So our method takes O(n2 + n) = O(n2) time.
• However, there exists a method of calculating the suffix tree in O(n) time
without needing the keyword tree for the suffixes.
ATCATG
TCATG
CATG Keyword Suffix
ATG Tree Tree
TG
G
Suffix Trees: Runtime
• Say the length of the text is n.
• Construction of the keyword tree takes O(n2) time.
• Constructing the suffix tree takes O(n) time.
• So our method takes O(n2 + n) = O(n2) time.
• However, there exists a method of calculating the suffix tree in O(n) time
without needing the keyword tree for the suffixes.
ATCATG
TCATG
CATG Keyword Suffix
ATG Tree Tree
TG
G
Pattern Matching with Suffix Trees
• To find any pattern in a text:
1. Build suffix tree for text of length n—O(n) time
2. Thread the pattern of length m through the suffix tree of
the text—O(m) time.
• Therefore the runtime of the Pattern Matching

Problem is only
O(m + n)!

Pattern Matching with Suffix Trees
SuffixTreePatternMatching(p,t)
1. Build suffix tree for text t
2. Thread pattern p through suffix tree
3. if threading is complete
4. output positions of all p-matching leaves in the tree
5. else
6. output “Pattern does not appear in text”

Threading ATG through a Suffix Tree
ATGCATACATGG

Suffix Array
• More space efficient than suffix tree
– Suffix tree index for human genome is about 47GB
• Lexicographically sort all the suffixes
• Store the starting indices of the suffixes along with
the original string
Suffix Array
• ATCATG
1 ATCATG$ 7 $
2 TCATG$ Sort the suffixes 1 ATCATG$
3 CATG$ lexicographically 4 ATG$
4 ATG$ 3 CATG$
5 TG$ 6 G$
6 G$ 2 TCATG$
7 $ 5 TG$
Suffix Array
• There exists algorithms to construct suffix array in
O(n) time
• Searching is done using binary search
– Basic binary search would take O(mlogn)
– Can be done in O(m+logn)
Burrows-Wheeler Transform (BWT)
• Lexicographically sort all the rotations
Figure from slide by Ben Langmead

• Sort order is the same for rotations and suffixes
– Since $ precedes others

• Original motivation was
compression
• Rotating and sorting
tend to increase run
lengths
Figure from original paper on BWT

LF mapping
• Last to first mapping
$abaaba
a$abaab
aaba$ab
aba$aba
abaaba$
ba$abaa
baaba$a
LF mapping
• Assign numbers arbitrarily to distinguish characters
– a b0 a a b1 a $
$abaaba
a $ a b a a b1
a a b a $ a b0
aba$aba
abaaba$
b1 a $ a b a a
b0 a a b a $ a
LF mapping
• They will be in same order in L and F. Why?
– a b0 a a b1 a $
$abaaba
a $ a b a a b1
a a b a $ a b0
aba$aba
abaaba$
b1 a $ a b a a
b0 a a b a $ a
LF mapping
• Assign numbers to distinguish characters
– a0 b0 a1 a2 b1 a3 $
$ a b a a b a3
a3 $ a b a a b1
a1 a b a $ a b0
a2 b a $ a b a1
a0 b a a b a $
b1 a $ a b a a2
b0 a a b a $ a0
LF mapping

LF mapping

BWT reversing
• Reverse BWT from right to left
$ a0
a0 b0
a1 b1
a2 a1
a3 $
b0 a2
b1 a3
a3b1a1a2b0a0$
BWT reversing
• Reverse BWT from right to left
$ a
a b
a b
a a
a $
b a
b a
abaaba$
FM index
• FM index
– Combines the BWT with a few small auxiliary data
structures
– Ferragina and Manzini
• F and L from BWT
– Both can be compressed
FM index querying

FM index querying

FM index querying

FM index issues
• Where are the matches?
– Store positions as in suffix array

Short read mapping tools
• Hash based
– MAQ
– SOAP
• BWT based
– Bowtie, Bowtie2
– SOAP2
– BWA
Hashing
• Hashing is a very efficient way to store and retrieve
data.
• What does hashing do?
– Say we are given a collection of data.
– For different entries, generate a unique integer and store
the entry in an array at this integer location

Hashing DNA sequences
– Each k-mer can be translated into a binary string:
• A can be represented as 00
• T can be represented as 01
• G can be represented as 10
• C can be represented as 11
• Example: ATGCTA = 000110110100
– After assigning a unique integer to each k-mer, it is

easy to obtain all start locations of each k-mer in a
genome.
Hashing: Read Mapping
• To map reads exactly:
1. For all k-mers in the genome, note the start position and
the sequence where k is the read length
2. Generate a hash table index for each k-mer sequence.
3. At each index of the hash table, store all genome start
locations of the k-mer that generated the index.
4. Look up all start positions using the read as the key
– How do we deal with collisions?

Hashing: Collisions
• In order to deal with
collisions:
– “Chain” all start locations of

k-mers.
– This can be done via a data

structure called a linked list.

• Example: Hash table for ATGCATACATGG
AA 0
AC 1
AG 2
AT 3 0
CA 4
CC 5
CG 6
CT 7
GA 8
GC 9
GG 10
GT 11
TA 12
TC 13
TG 14
TT 15
AA 0
AC 1
AG 2
AT 3 0
CA 4
CC 5
CG 6
CT 7
GA 8
GC 9
GG 10
GT 11
TA 12
TC 13
TG 14 1
TT 15
AA 0 • Map TG
AC 1 6
AG 2
– Lookup TG
AT 3 0 4 8 – At positions 1, 9
CA 4 3 7
CC 5 • Map GCAT
CG 6 – Lookup GC and AT
CT 7
GA 8
GC 9 2
GG 10 10
GT 11
TA 12 5
TC 13
TG 14 1 9
TT 15
Short read mapping with errors
• Approximate short read mapping
– Reads may have errors
– For reference based assembly/variant discovery
• Finding optimal mapping is usually too slow
– Sequence alignment (next)
• Fast heuristic methods are often used instead of slow
exact methods
– Seed and extend
– Backtracking using BWT/FM index
Backtracking
• Consider an attempt to find Q = “agc” in T = “acaacg”:
“g”
“c”
“gc” does not
occur in the text
• Instead of giving up, try to “backtrack” to a previous

position and try a different base
Backtracking
• Backtracking attempt for Q = “agc”, T = “acaacg”:
“g”
“c”
“gc” does not
occur in the text
Substitution
“a” “a”
Found this alignment:

acaacg
“c” agc
Backtracking
• Relevant alignments may lie along multiple paths
– E.g., Q = “aaa”, T = “acaacg”
“a” “a”
“a”
“a” “a” “c” “a” “a” “c”
“a” “a”
“c”
acaacg acaacg acaacg

aaa aaa aaa

07 Read Mapping

Uploaded by

Copyright:

Available Formats

You might also like

07 Read Mapping

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

07 Read Mapping

Uploaded by

Copyright:

Available Formats

Introduction to Bioinformatics

Read mapping (String matching)

Subangkar Karmaker Shanto

Slide Courtesy: Atif Hasan Rahman (CSE, BUET)

• This leads us to the Pattern Matching Problem.

• Output: All positions 1< i < (n – m + 1) such that the

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

• This is rare; on average, the run time is more like

Adapted from slides available at

• Compute length of the longest proper prefix of the

• Input: k patterns p1,…,pk, and text of n characters t = t1…tn

• Output: Positions 1 < i < n where the substring of text t

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

• Example: Suffix tree of ATCATG:

ATCATG Keyword Suffix

Keyword Tree Suffix Tree

Adapted from slides available at

• Therefore the runtime of the Pattern Matching

Adapted from slides available at

Adapted from slides available at

Adapted from slides available at

Figure from slide by Ben Langmead

Figure from slide by Ben Langmead

Figure from original paper on BWT

Figure from slide by Ben Langmead

Figure from slide by Ben Langmead

Figure from slide by Ben Langmead

Figure from slide by Ben Langmead

Figure from slide by Ben Langmead

Figure from slide by Ben Langmead

Adapted from slides available at

– After assigning a unique integer to each k-mer, it is

– How do we deal with collisions?

Adapted from slides available at