Professional Documents
Culture Documents
Learning Sequence Motif Models Using Expectation Maximization (EM)
Learning Sequence Motif Models Using Expectation Maximization (EM)
Learning Sequence Motif Models Using Expectation Maximization (EM)
BMI/CS 776
www.biostat.wisc.edu/bmi776/
Spring 2019
Colin Dewey
colin.dewey@wisc.edu
These slides, excluding third-party material, are licensed under CC BY-NC 4.0 by Mark Craven, Colin Dewey, and Anthony Gitter
Goals for Lecture
Key concepts
• the motif finding problem
• using EM to address the motif-finding problem
• the OOPS and ZOOPS models
2
Sequence Motifs
• What is a sequence motif ?
– a sequence pattern of biological significance
• Examples
– DNA sequences corresponding to protein binding sites
– protein sequences corresponding to common functions
or conserved pieces of structure
3
Sequence Motifs Example
do:
– infer a model of the motif
– predict the locations of the motif occurrences in
the given sequences
5
Why is this important?
• To further our understanding of which
regions of sequences are “functional”
• DNA: biochemical mechanisms by which
the expression of genes are regulated
• Proteins: which regions of proteins
interface with other molecules (e.g., DNA
binding sites)
• Mutations in these regions may be
significant
6
Motifs and Profile Matrices
(a.k.a. Position Weight Matrices)
• Given a set of aligned sequences, it is straightforward to
construct a profile matrix characterizing a motif of interest
shared motif sequence positions
1 2 3 4 5 6 7 8
A 0.1 0.3 0.1 0.2 0.2 0.4 0.3 0.1
or
CA
TGGCAA
TC
bits
1
GC
T TAAT
ATGGCG
A GGC C
AGCCTCGT
C
G TAAA A
G
0
A GT
T T
A C T
G
C T
1
8
1
5′ 3′ 5′ 3′
weblogo.berkeley.edu weblogo.berkeley.edu
8
Motifs and Profile Matrices
• How can we construct the profile if the sequences aren’t
aligned?
• In the typical case we don’t know what the motif looks
like.
9
Unaligned Sequence Example
• ChIP-chip experiment tells which probes are bound
(though this protocol has been replaced by ChIP-seq)
11
Overview of EM
• Method for finding the maximum likelihood (ML)
parameters (θ) for a model (M) and data (D)
q ML = argmax P( D | q , M )
q
• Useful when
– it is difficult to optimize P ( D | q ) directly
– likelihood can be decomposed by the introduction of hidden
information (Z)
P ( D | q ) = å P ( D, Z | q )
Z
Q(q | q t ) = å P( Z | D, q t ) log P( D, Z | q )
Z
13
Representing Motifs in MEME
• MEME: Multiple EM for Motif Elicitation
• A motif is
– assumed to have a fixed width, W
– represented by a matrix of probabilities: pc, k
represents the probability of character c in column k
14
Representing Motifs in MEME
0 1 2 3
A 0.25 0.1 0.5 0.2
p= C 0.25 0.4 0.2 0.1
G 0.25 0.3 0.1 0.6
T 0.25 0.2 0.2 0.1
15
Representing Motif Starting
Positions in MEME
• The element Zi,j of the matrix Z is an indicator random
variable that takes value 1 if the motif starts in position j in
sequence i (and takes value 0 otherwise)
• Example: given DNA sequences where L=6 and W=3
• Possible starting positions m = L – W + 1
Z=
1 2 3 4
G T C A G G seq1 0 0 1 0
G A G A G T seq2 1 0 0 0
A C G G A G seq3 0 0 0 1
C C A G T C seq4 0 1 0 0
16
Probability of a Sequence Given a
Motif Starting Position
1 j j +W L
€ j -1 j +W -1 L
P( X i | Z i , j = 1, p ) = Õ pck , 0 Õp ck , k - j +1 Õp ck , 0
k =1 k= j k = j +W
Xi is the i th sequence
P( X i | Z i ,3 = 1, p ) =
pG,0 ´ pC,0 ´ pT,1 ´ pG,2 ´ pT,3 ´ pA,0 ´ pG,0 =
0.25 ´ 0.25 ´ 0.2 ´ 0.1´ 0.1´ 0.25 ´ 0.25
18
Likelihood Function
• EM (indirectly) optimizes log likelihood of observed
data
log P( X | p )
= log Õ P( X i | Z i , p )P( Z i | p )
i
= log Õ m1 Õ P( X i | Z i , j = 1, p )
Zi , j
i j
= åå Z i , j log P( X i | Z i , j = 1, p ) + n log m1
i j
20
Expected Starting Positions
( t -1)
P( X i | Z i , j = 1, p ) P( Z i , j = 1)
Z (t )
i, j = m
å P( X
k =1
i | Z i ,k = 1, p ( t -1)
) P( Z i ,k = 1)
( t -1)
P( Z i , j = 1 | X i , p )
22
The E-step: Computing Z(t)
P( Z i , j = 1) = 1
m
( t -1)
P( X i | Z i , j = 1, p ) P( Z i , j = 1)
Z (t )
i, j = m
å P( X
k =1
i | Z i ,k = 1, p ( t -1)
) P( Z i ,k = 1)
23
Example: Computing Z(t)
Xi = G C T G T A G
0 1 2 3
A 0.25 0.1 0.5 0.2
(t -1) C 0.25 0.4 0.2 0.1
p =G 0.25 0.3 0.1 0.6
T 0.25 0.2 0.2 0.1
Z (t ) i ,1 µ P( Xi | Z i ,1 = 1, p (t -1) ) = 0.3 ´ 0.2 ´ 0.1´ 0.25 ´ 0.25 ´ 0.25 ´ 0.25
Z (t ) i , 2 µ P( Xi | Z i , 2 = 1, p (t -1) ) = 0.25 ´ 0.4 ´ 0.2 ´ 0.6 ´ 0.25 ´ 0.25 ´ 0.25
...
m
• Then normalize so that
åZ
j =1
(t )
i, j =1
24
The M-step: Estimating p
• Recall pc ,k represents the probability of character c in
position k ; values for k=0 represent the background
nc , k + d c , k
p (t )
=
å (nb, k + db, k )
c, k pseudo-counts
bÎ{ A,C ,G ,T }
ìå å Z (t )
i, j k >0
ïï i { j| X i , j+k -1 =c} sum over positions
nc , k =í W where c appears
ïnc - å nc , j k =0
total # of c’s ïî j =1
in data set
25
Example: Estimating p
A C A G C A
Z (t )
1,1 = 0.1, Z (t )
1, 2 = 0.7, Z (t )
1, 3 = 0.1, Z (t )
1, 4 = 0.1
A G G C A G
Z (t )
2 ,1 = 0.4, Z (t )
2, 2 = 0.1, Z (t )
2,3 = 0.1, Z (t )
2, 4 = 0.4
T C A G T C
Z (t ) 3,1 = 0.2, Z (t ) 3, 2 = 0.6, Z (t ) 3,3 = 0.1, Z (t ) 3, 4 = 0.1
Z (t )
1,1 + Z (t )
1, 3 + Z (t )
2 ,1 + Z (t )
3, 3 + 1
p A,1 = (t )
(t )
27
E-step in the ZOOPS Model
• We need to consider another alternative: the ith
sequence doesn’t contain the motif
• We add another parameter (and its relative)
g § prior probability of a
sequence containing a motif
28
E-step in the ZOOPS Model
P( X i | Z i , j = 1, p (t -1) )l(t -1)
Z i(,tj) = m
P( X i | Qi = 0, p (t -1) )(1 - g (t -1) ) + å P( X i | Z i ,k = 1, p (t -1) )l(t -1)
k =1
L
P( X i | Qi = 0, p (t -1) ) = Õ p (ct -, 01) P(Qi = 0) = 1 - g (t -1)
j
j =1
29
M-step in the ZOOPS Model
• Update p same as before
• Update g as follows:
n
1
g (t )
º ml = å Qi
(t ) (t )
n i =1
30
Extensions to the Basic EM
Approach in MEME
31
Starting Points in MEME
• EM is susceptible to local maxima, so it’s a good idea
to try multiple starting points
• Insight: motif must be similar to some subsequence
in data set
• For every distinct subsequence of length W in the
training set
– derive an initial p matrix from this subsequence
– run EM for 1 iteration
• Choose motif model (i.e. p matrix) with highest
likelihood
• Run EM to convergence
32
Using Subsequences as Starting
Points for EM
• Set values matching letters in the subsequence to
some value π
• Set other values to (1- π)/(M-1) where M is the length
of the alphabet
• Example: for the subsequence TAT with π =0.7
1 2 3
A 0.1 0.7 0.1
C 0.1 0.1 0.1
p= G 0.1 0.1 0.1
T 0.7 0.1 0.7
33
MEME web server
http://meme-suite.org/
34