Professional Documents
Culture Documents
Ch04 Motifs
Ch04 Motifs
info
1
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Outline
• Implanting Patterns in Random Text
• Gene Regulation
• Regulatory Motifs
• The Gold Bug Problem
• The Motif Finding Problem
• Brute Force Motif Finding
• The Median String Problem
• Search Trees
• Branch-and-Bound Motif Search
• Branch-and-Bound Median String Search
• Consensus and Pattern Branching: Greedy Motif Search
• PMS: Exhaustive Motif Search
2
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Random Sample
atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtaca
tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatag
gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca
3
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa
tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag
gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa
4
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga
tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag
gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga
5
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa
tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag
gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa
6
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga
tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag
gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga
7
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa
tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag
gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa
AgAAgAAAGGttGGG
..|..|||.|..|||
cAAtAAAAcGGcGGG
8
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Challenge Problem
9
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
10
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Regulatory Proteins
• Gene X encodes regulatory protein, a.k.a. a
transcription factor (TF)
11
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Regulatory Regions
• Every gene contains a regulatory region (RR) typically
stretching 100-1000 bp upstream of the transcriptional
start site
12
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
13
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
ATCCCG gene
TTCCGG gene
ATCCCG gene
ATGCCG gene
ATGCCC gene
14
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
15
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Motif Logo
TGGGGGA
• Motifs can mutate on non TGAGAGA
important bases
TGGGGGA
• The five motifs in five
TGAGAGA
different genes have
TGAGGGA
mutations in position 3
and 5
• Representations called
motif logos illustrate the
conserved and variable
regions of a motif
16
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)
17
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Identifying Motifs
• Genes are turned on or off by regulatory
proteins
• These proteins bind to upstream regulatory
regions of genes to either attract or block an
RNA polymerase
• Regulatory protein (TF) binds to a short DNA
sequence called a motif (TFBS)
• So finding the same motif in multiple genes’
regulatory regions suggests a regulatory
relationship amongst those genes
18
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
20
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
22
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
23
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
• English Language:
etaoinsrhldcumfpgwybvkxjqz
25
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
• A better approach:
• Examine frequencies of l-tuples,
combinations of 2 symbols, 3 symbols, etc.
• “The” is the most frequent 3-tuple in
English and “;48” is the most frequent 3-
tuple in the encrypted text
• Make inferences of unknown symbols by
examining other frequent l-tuples
26
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
53++!305))6*the26)h+.)h+)te06*the!e`60))e5t]e*:+*e!e3(ee)5*!t
h6(tee*96*?te)*+(the5)t5*!2:*+(th956*2(5*h)e`e*th0692e5)t)6!e
)h++t1(+9the0e1te:e+1the!e5th)he5!52ee06*e1(+9thet(eeth(+?3ht
he)h+t161t:1eet+?t
27
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
• Make inferences:
53++!305))6*the26)h+.)h+)te06*the!e`60))e5t]e*:+*e!e3(ee)5*!t
h6(tee*96*?te)*+(the5)t5*!2:*+(th956*2(5*h)e`e*th0692e5)t)6!e
)h++t1(+9the0e1te:e+1the!e5th)he5!52ee06*e1(+9thet(eeth(+?3ht
he)h+t161t:1eet+?t
AGOODGLASSINTHEBISHOPSHOSTELINTHEDEVILSSEATWENYONEDEGRE
ESANDTHIRTEENMINUTESNORTHEASTANDBYNORTHMAINBRANCHSEVENT
HLIMBEASTSIDESHOOTFROMTHELEFTEYEOFTHEDEATHSHEADABEELINE
FROMTHETREETHROUGHTHESHOTFIFTYFEETOUT
29
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
30
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
31
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Similarities (cont’d)
• Motif Finding:
• In order to solve the problem, we analyze the
frequencies of patterns in the nucleotide sequences
33
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Similarities (cont’d)
• Motif Finding:
• Knowledge of established motifs reduces
the complexity of the problem
34
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
35
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc
aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc
37
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc
aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc
acgtacgt
Consensus String
38
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc
aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc
39
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc
aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc
40
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Defining Motifs
41
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
A 3 0 1 0 3 1 1 0
• Construct matrix profile
Profile C 2 4 0 0 1 4 0 0 with frequencies of each
G 0 1 4 0 0 0 3 1 nucleotide in columns
T 0 0 0 5 1 0 1 4
• Consensus nucleotide in
_________________
each position has the
Consensus A C G T A C G T highest score in column
42
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Consensus
43
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Consensus (cont’d)
44
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Evaluating Motifs
• We have a guess about the consensus
sequence, but how “good” is this consensus?
45
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Parameters
l=8 DNA
cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat
agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc
t=5 aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt
agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca
ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc
n = 69
s s1 = 26 s2 = 21 s3= 3 s4 = 56 s5 = 60
47
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Scoring Motifs
l
a G g t a c T t
• Given s = (s1, … st) and DNA: C c A t a c g t
a c g t T A g t t
l a c g t C c A t
max count (k , i ) C c g t a c g G
Score(s,DNA) = i 1 k{ A,T ,C ,G } _________________
A 3 0 1 0 3 1 1 0
C 2 4 0 0 1 4 0 0
G 0 1 4 0 0 0 3 1
T 0 0 0 5 1 0 1 4
_________________
Consensus a c g t a c g t
Score 3+4+4+5+3+4+3+4=30
48
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
50
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
51
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
BruteForceMotifSearch
1. BruteForceMotifSearch(DNA, t, n, l)
2. bestScore 0
3. for each s=(s1,s2 , . . ., st) from (1,1 . . . 1)
to (n-l+1, . . ., n-
l+1)
4. if (Score(s,DNA) > bestScore)
5. bestScore score(s, DNA)
6. bestMotif (s1,s2 , . . . , st)
7. return bestMotif
52
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
53
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
54
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Hamming Distance
• Hamming distance:
• dH(v,w) is the number of nucleotide pairs
that do not match when v and w are
aligned. For example:
dH(AAAAAA,ACAAAC) = 2
55
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
• TotalDistance(v,DNA) = 0
56
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
• TotalDistance(v,DNA) = 1+0+2+0+1 = 4
57
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
58
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
59
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
60
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
61
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
a G g t a c T t • At any column i
C c A t a c g t Scorei + TotalDistancei = t
Alignment a c g t T A g t t
a c g t C c A t
C c g t a c g G
_________________
• Because there are l columns
Score + TotalDistance = l * t
A 3 0 1 0 3 1 1 0
Profile C 2 4 0 0 1 4 0 0
G 0 1 4 0 0 0 3 1 • Rearranging:
T 0 0 0 5 1 0 1 4
_________________ Score = l * t - TotalDistance
Consensus a c g t a c g t
• l * t is constant the minimization
Score 3+4+4+5+3+4+3+4
of the right side is equivalent to
TotalDistance 2+1+1+0+2+1+2+1 the maximization of the left side
Sum 5 5 5 5 5 5 5 5
62
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
64
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
65
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
1. MedianStringSearch (DNA, t, n, l)
2. bestWord AAA…A
3. bestDistance ∞
4. for each l-mer s from AAA…A to TTT…T
if TotalDistance(s,DNA) < bestDistance
5. bestDistanceTotalDistance(s,DNA)
6. bestWord s
7. return bestWord
66
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
67
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
• Let A = 1, C = 2, G = 3, T = 4
• Then the sequences from AA…A to TT…T become:
l
11…11
11…12
11…13
11…14
.
.
44…44
• Notice that the sequences above simply list all numbers as
if we were counting on base 4 without using 0 as a digit
68
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Linked List
• Suppose l = 2
Start
aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt
69
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt
70
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Search Tree
root
--
a- c- g- t-
aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt
71
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
72
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
73
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
74
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
NextLeaf (cont’d)
• The algorithm is common addition in radix k:
75
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
NextLeaf: Example
• Moving to the next leaf:
Current Location --
1- 2- 3- 4-
11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44
76
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Next Location --
1- 2- 3- 4-
11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44
77
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
--
Order of steps
1- 2- 3- 4-
11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
79
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
80
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
81
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Example
• Moving to the next vertex:
Current Location
--
1- 2- 3- 4-
11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44
82
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Example
• Moving to the next vertices:
Location after 5
next vertex moves
--
1- 2- 3- 4-
11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44
83
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Bypass Move
Current Location
--
1- 2- 3- 4-
11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44
85
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Example
• Bypassing the descendants of “2-”:
Next Location
--
1- 2- 3- 4-
11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44
86
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
87
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
1. BruteForceMotifSearchAgain(DNA, t, n, l)
2. s (1,1,…, 1)
3. bestScore Score(s,DNA)
4. while forever
5. s NextLeaf (s, t, n- l +1)
6. if (Score(s,DNA) > bestScore)
7. bestScore Score(s, DNA)
8. bestMotif (s1,s2 , . . . , st)
9. return bestMotif
88
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Can We Do Better?
• Sets of s=(s1, s2, …,st) may have a weak profile for
the first i positions (s1, s2, …,si)
• Every row of alignment may add at most l to Score
• Optimism: all subsequent (t-i) positions (si+1, …st)
add
(t – i ) * l to Score(s,i,DNA)
90
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
1. BranchAndBoundMotifSearch(DNA,t,n,l)
2. s (1,…,1)
3. bestScore 0
4. i1
5. while i > 0
6. if i < t
7. optimisticScore Score(s, i, DNA) +(t – i ) * l
8. if optimisticScore < bestScore
9. (s, i) Bypass(s,i, n-l +1)
10. else
11. (s, i) NextVertex(s, i, n-l +1)
12. else
13. if Score(s,DNA) > bestScore
14. bestScore Score(s)
15. bestMotif (s1, s2, s3, …, st)
16. (s,i) NextVertex(s,i,t,n-l + 1)
17. return bestMotif
91
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
92
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
94
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
95
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
96
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Better Bounds
97
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
98
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
99
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
100
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Score(s,i,DNA)
under the assumption that the first (i-1) l-mers have been already
chosen
• CONSENSUS sacrifices optimal solution for speed: in fact the
bulk of the time is actually spent locating the first 2 l-mers
101
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
102
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
• Output:
• Motif M, of length l
• Variants of interest have a hamming distance of d
from M
103
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
How to proceed?
• Exhaustive search?
104
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
105
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
106
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
A lot of work,
most of it
unecessary
107
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Best Neighbor
Branch from the seed
strings
Find best neighbor -
highest score
Don’t consider
branches where the
upper bound is not as
good as best score so
far
108
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Scoring
• PatternBranching use total distance score:
• For each sequence Si in the sample S = {S1, . . . , Sn}, let
d(A, Si) = min{d(A, P) | P Si}.
• Then the total distance of A from the sample is
d(A, S) = ∑ Si S d(A, Si).
• For a pattern A, let D=Neighbor(A) be the set of patterns which differ from
A in exactly 1 position.
• We define BestNeighbor(A) as the pattern B D=Neighbor(A) with lowest
total distance d(B, S).
109
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
PatternBranching Algorithm
110
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
PatternBranching Performance
111
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
112
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
113
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
114
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
Eliminate duplicates
AAG AGT GTC TCA CAG AGG GGA GAG AGT
AAA AAT ATC ACA AAG AAG AGA AAG AAT
AAC ACT CTC CCA CAA ACG CGA CAG ACT
AAT AGA GAC GCA CAC AGA GAA GAA AGA
ACG AGC GCC TAA CAT AGC GCA GAC AGC
AGG AGG GGC TCC CCG AGT GGC GAT AGG
ATG ATT GTA TCG CGG ATG GGG GCG ATT
CAG CGT GTG TCT CTG CGG GGT GGG CGT
GAG GGT GTT TGA GAG GGG GTA GTG GGT
TAG TGT TTC TTA TAG TGG TGA TAG TGT
115
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
116
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info