CG6 ML

.
Maximum Likelihood (ML)

Parameter Estimation
with applications to
inferring phylogenetic trees
Comput. Genomics, lecture 6a
Presentation taken from Nir Friedman’s HU course, available at www.cs.huji.ac.il/~pmai.

Changes made by Dan Geiger, Ydo Wexler, and finally by Benny Chor.
.
The Setting
 We have a probabilistic model, M, of some
phenomena. We know exactly the structure of
M, but not the values of its probabilistic
parameters, .
 Each “execution” of M produces an
observation, x[i] , according to the (unknown)
distribution induced by M.
 Goal: After observing x[1] ,…, x[n] , estimate

the model parameters, , that generated the
observed data.
3
Maximum Likelihood Estimation (MLE)
 The likelihood of the observed data, given the

model parameters , as the conditional
probability that the model, M, with parameters
, produces x[1] ,…, x[n] .
L()=Pr(x[1] ,…, x[n] | , M),

 In MLE we seek the model parameters, , that
maximize the likelihood.
4
Maximum Likelihood Estimation (MLE)
 In MLE we seek the model parameters, , that
maximize the likelihood.
 The MLE principle is applicable in a wide
variety of applications, from speech recognition,
through natural language processing, to
computational biology.
 We will start with the simplest example:

Estimating the bias of a coin. Then apply MLE
to inferring phylogenetic trees.
 (will later talk about MAP - Bayesian inference).5
Example: Binomial Experiment
Head Tail
 When tossed, it can land in one of two positions:

Head (H) or Tail (T)
We denote by  the (unknown) probability P(H).

Estimation task:
 Given a sequence of toss samples x[1], x[2],
…, x[M] we want to estimate the probabilities
P(H)= and P(T) = 1 - 
6
Statistical Parameter Fitting (restement)
 Consider instances x[1], x[2], …, x[M]

such that
 The set of values that x can take is known i.i.d.
 Each is sampled from the same distribution Samples
(why??)
 Each sampled independently of the rest
 The task is to find a vector of parameters
 that have generated the given data. This
vector parameter  can be used to predict
future data.
7
The Likelihood Function
 How good is a particular ?
It depends on how likely it is to generate the
observed data
LD ( )  P( D |  )   P( x[m] |  )
m
 The likelihood for the sequence H,T, T, H, H is
L()
LD ( )    (1   )  (1   )  
0 0.2 0.4
 0.6 0.8 1
8
Sufficient Statistics
 To compute the likelihood in the thumbtack
example we only require NH and NT
(the number of heads and the number of tails)
LD ( )   NH
 (1   ) NT
 NH and NT are sufficient statistics for the

binomial distribution
9
Sufficient Statistics
A sufficient statistic is a function of the data that
summarizes the relevant information for the
likelihood
Formally, s(D) is a sufficient statistics if for
any two datasets D and D’
 s(D) = s(D’ )  LD() = LD’ ()
Datasets
Statistics
10
Maximum Likelihood Estimation
MLE Principle:
Choose parameters that maximize the
likelihood function
 This is one of the most commonly used

estimators in statistics
 Intuitively appealing
 One usually maximizes the log-likelihood

function, defined as lD() = ln LD()
11
Example: MLE in Binomial Data
lD    N H log   NT log 1   
Taking derivative and equating it to 0,
we get
NH NT ˆ NH
  
 1 N H  NT
(which coincides with what one would expect)
Example: L()
(NH,NT ) = (3,2)
MLE estimate is 3/5 = 0.6

0 0.2 0.4 0.6 0.8 1
12
From Binomial to Multinomial
 Now suppose X can have the values 1,2,…,K
(For example a die has K=6 sides)
 We want to learn the parameters 1, 2. …, K
Sufficient statistics:
N1, N2, …, NK - the number of times each outcome
is observed K
Likelihood function: LD ( ) 
k 1
 k
 Nk
ˆ Nk
k 
N
MLE: (proof @ assignment 3)
13
Example: Multinomial
 Let x1 x2 ....xn be a protein sequence
 We want to learn the parameters q1, q2,…,q20
corresponding to the frequencies of the 20 amino
acids
 N1, N2, …, N20 - the number of times each amino
acid is observed in the sequence
20
Likelihood function: LD (q )   qk Nk
k 1
Nk
MLE: qk 
n
14
Inferring Phylogenetic Trees
 Let S1 , S2 ,.... , Sn be n sequence (DNA or AA).
Assume for simplicity they are all same length, l.
 We want to learn the parameters of a
phylogenetic tree that maximizes the likelihood.
 But wait: Should first specify a model.
15
A Probabilistic Model
 Our models will consist of a “regular” tree, where
in addition, edges are assigned substituion
probabilities.
 For simplicity, assume our “DNA” has only two
states, say X and Y.

 If edge e is assigned probability pe , this means
that the probability of substitution (X Y)
across e is pe .
16
A Probabilistic Model (2)
 Our models will consist of a “regular” tree, where
in addition, edges are assigned substituion
probabilities.
 For simplicity, assume our “DNA” has only two
states, say X and Y.

that the probability of substitution (X Y)
across e is pe .
17
that the probability of more involved patterns of
substitution across e (e.g. XXYXY YXYXX)
is determined, and easily computed: pe2 (1- pe)3
for this pattern.
 Q.: What if pattern on both sides is known, but pe
is
not known?
 A.: Makes sense to seek pe that maximizes
probability of observation.
 So far, this is identical to coin toss example.
18
But a single edge is a fairly boring tree…
XXYXY YXYXX
pe2
pe1
pe3
????? YYYYX
Now we don’t know the states at internal node(s), nor

the edge parameters pe1, pe2, pe3
19
Two Ways to Go
XXYXY YXYXX
pe2
pe1
pe3
????? YYYYX
1. Maximize over states of internal node(s)

2. Average over states of internal node(s)
In both cases, we maximize over edge parameters

20
Two Ways to Go
XXYXY YXYXX
pe2
pe1
pe3
????? YYYYX
In the first version (average, or sum over states of internal
nodes) we are looking for the “most likely” setting of tree edges.
This is called maximum likelihood (ML) inference of
phylogenetic trees.
ML is probably the inference method most widely (wildly )

used. 21
Two Ways to Go
XXYXY YXYXX
pe2
pe1
pe3
????? YYYYX
In the second version (maximize over states of internal nodes)
we are looking for the “most likely” ancestral states. This is
called ancestral maximum likelihood (AML).
In some sense AML is “between” MP (having ancestral states)

and ML (because the goal is still to maximize likelihood).
22
bust
or a break
.

CG6 ML

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CG6 ML

Uploaded by

Copyright:

Available Formats

.

Maximum Likelihood (ML)

Comput. Genomics, lecture 6a

Presentation taken from Nir Friedman’s HU course, available at www.cs.huji.ac.il/~pmai.

 Goal: After observing x[1] ,…, x[n] , estimate

 The likelihood of the observed data, given the

L()=Pr(x[1] ,…, x[n] | , M),

 We will start with the simplest example:

 When tossed, it can land in one of two positions:

We denote by  the (unknown) probability P(H).

 Consider instances x[1], x[2], …, x[M]

 NH and NT are sufficient statistics for the

 This is one of the most commonly used

 One usually maximizes the log-likelihood

MLE estimate is 3/5 = 0.6

 But wait: Should first specify a model.

states, say X and Y.

states, say X and Y.

Now we don’t know the states at internal node(s), nor

1. Maximize over states of internal node(s)

In both cases, we maximize over edge parameters

ML is probably the inference method most widely (wildly )

In some sense AML is “between” MP (having ancestral states)

You might also like