Lecture 03 Maximum Likelihood Estimation

DATA SCIENCE AND Master degree
MECHATRONICS AND SMART

in
TECHNOLOGY ENGINEERING
AUTOMATION
Lecture 3: Maximum Likelihood SPEAKER

Prof. Mirko Mazzoleni
Estimation PLACE
University of Bergamo
Syllabus
1. Introduction to data science 10. Neural networks
1.1 The business perspective 11. Machine vision
1.2 Data analysis processes 11.1 Classic approaches
2. Data visualization 11.2 CNN and deep learning
3. Maximum Likelihood Estimation 12. Unsupervised learning
4. Linear regression 12.1 k-means and hierarchical clustering
5. Logistic regression 12.2 Principal Component Analysis
6. Bias-Variance tradeoff 13. Fault diagnosis
7. Overfitting and regularization 13.1 Model-based fault diagnosis
8. Validation and performance metrics 13.2 Signal-based fault diagnosis
9. Decision trees 13.3 Data-driven fault diagnosis
2 /21
Outline
1. Concepts of Maximum Likelihood Estimation (MLE)
2. Properties of the MLE
3. Example: MLE of the mean of a Gaussian distribution
3 /21
Outline
4 /21
Maximum Likelihood Estimation
The Maximum Likelihood Estimation (MLE) method is an estimation procedure that, given
a probabilistic model, estimates its parameters in such a way that they are most
consistent with the observed data
Assume to have 6 i. i. d. observations 𝒟 = 𝑦 1 , 𝑦 2 , … , 𝑦 6 , where 𝑦 𝑖 ∼ 𝒩 𝜇, 𝜎 2
The pdf of a single random variable is

𝒩(𝜇, 𝜎 2 ) 𝑓𝑦 𝑦 5 |𝜇, 𝜎 2
2
1 1 𝑦 𝑖 −𝜇
𝑓𝑦 𝑦 6 |𝜇, 𝜎 2 𝑓 𝑦 𝑖 |𝜇, 𝜎 = exp −
2𝜋𝜎 2 2 𝜎
𝑦 1 𝑦 3 𝑦 4 𝑦 5 𝑦 6
5 /21
Defined the data vector 𝑌 = 𝑦 1 , 𝑦 2 , … , 𝑦 𝑁 ⊤. The joint pdf of the data vector 𝑌 is
𝑁 𝑁
𝑓𝑌 𝑦 1 , 𝑦 2 , … , 𝑦 𝑁 |𝜇, 𝜎 2 = ෑ 𝑓𝑦 𝑦 𝑖 𝜇, 𝜎 2 = ෑ 𝒩 𝑦 𝑖 𝜇, 𝜎 2
𝑖=1 𝑖=1
The value assumed by the joint pdf 𝑓𝑌 𝑌|𝜇, 𝜎 2 , with known 𝜇 and 𝜎 2 , evaluated using the
data 𝒟, is the product of the blue dots in the previous example, where we had 𝑁 = 6
observations
Maximizing the likelihood means maximizing this product
6 /21
If function of the data 𝑌, the joint pdf is a multivariable distribution. But we know the
value of 𝑌, since we observed those data
If we also knew 𝜇 and 𝜎, we could compute the probability of having observed 𝑌. But we
do not know 𝜇 and 𝜎! That's exactly what we want to estimate!
When 𝑓𝑌 (𝑌|𝜇, 𝜎 2 ) (the joint pdf) is seen as function of the parameters 𝜇 and 𝜎, it is
called likelihood ℒ 𝜇, 𝜎 2 𝑌
Only the interpretation changes, but 𝑓𝑌 (𝑌|𝜇, 𝜎 2 ) e ℒ 𝜇, 𝜎 2 𝑌 are the same

mathematical object
7 /21
Summary
Not known
variables KNOWN parameters
• If 𝑓𝑌 𝑌 𝜇, 𝜎 2 ) is function of the data 𝑌: multivariable pdf
KNOWN data Not known

variables
• If 𝑓𝑌 𝑌 𝜇, 𝜎 2 ) is function of the parameters 𝜇 e 𝜎 2 : likelihood ℒ 𝜇, 𝜎 2 𝑌
Usually, the notation of 𝑓𝑌 𝑌 𝜇, 𝜎 2 ) changes into ℒ 𝜇, 𝜎 2 𝑌 , to make clearer who is

supposed known («to the right of the bar |») and who is not known («to the left of the
bar|» )
8 /21
The MLE is that value of the parameters vector 𝜽 who maximizes the likelihood ℒ 𝜽 𝑌
Example: suppose to have only one datum 𝑓𝑦 𝑦|𝜇 = 1, 1 𝑓𝑦 𝑦|𝜇 = 2, 1

𝑦 1 ∼ 𝒩 𝜇, 𝜎 2 = 1 , and that its value is
ത The parameter to be estimated is
𝑦 1 = 𝑦.
𝜃 = 𝜇 (the mean of the distribution)
Notice that: ℒ 𝜇 = 2|𝑦 1 = 𝑦ത

ℒ 𝜇 = 1|𝑦 1 = 𝑦ത
ℒ 𝜇 = 2|𝑦 1 = 𝑦ത > ℒ 𝜇 = 1|𝑦 1 = 𝑦ത
So that 𝜇 = 2 is more likely to be observed

than 𝜇 = 1, given this probabilistic model and 𝑦
the data 𝑦ത
9 /21
QUIZ!
In this example, the maximum likelihood estimate is:
𝑓𝑦 𝑦|𝜇 = 1, 1 𝑓𝑦 𝑦|𝜇 = 2, 1
❑ 𝜇Ƹ = 2𝑦ത
❑ 𝜇Ƹ = 𝑦ത
❑ 𝜇Ƹ = 2
ℒ 𝜇 = 2|𝑦 1 = 𝑦ത
ℒ 𝜇 = 1|𝑦 1 = 𝑦ത
𝑦
𝑦ത
10 /21
The maximum likelihood estimate of the previous example can be expressed as:
𝑁
෡ 𝜇Ƹ
𝜽ML = 2 = arg max ℒ 𝜽|𝑌 = arg max ෑ 𝒩 𝑦 𝑖 𝜇, 𝜎 2
2×1 𝜎ො 𝜽 𝜽
𝑖=1
In general, we can attribute to the data any probability distribution 𝑓𝑌 , both

continuous and discrete
෡ ML = arg max ℒ 𝜽|𝑌

𝜽
𝑑×1
𝜽
11 /21
Often, instead of maximizing ℒ 𝜽|𝑌 , we maximize its natural logarithm
• Since the logarithm is an increasing monotone function, ln ℒ 𝜽|𝑌 has the same
maximum of ℒ 𝜽|𝑌
• Using the logarithm is efficient from an implementation point of view, because it

avoids possible underflows given by the product of small probabilities (replacing it with
the sum of the log-probabilities)
෡ ML = arg max ln ℒ 𝜽|𝑌

𝜽
𝜽
𝑑×1
Unless special lucky cases, the optimization is carried out with iterative methods
12 /21
Outline
13 /21
Maximum Likelihood Estimate: properties
The maximum likelihood estimator has good properties. In fact, it is:
1. Asymptotically correct : ෡ ML = 𝜽0
lim 𝔼 𝜽
𝑁→+∞
The estimator can be biased. For example, the maximum likelihood estimator of the variance
of a Guassian population is biased
2. Consistent: the larger 𝑁, the more accurate the estimate
3. Asymptotically efficient: ෡ ML = 𝑀−1

lim Var 𝜽 𝑀: Fisher information
𝑁→+∞ matrix
4. Asymptotically normal: ෡ ML ∼ 𝒩 𝜽0 , 𝑀−1

𝜽 for 𝑁 → +∞
14 /21
Outline
15 /21
MLE of the mean of a Gaussian distribution
Let us consider the case in which we want to estimate the mean 𝜇 of a population of
i.i.d. Gaussian random variables, assuming we know the variance of the distributions
Assume to have observed only 2 data 𝑦 𝑖 ∼ 𝒩 𝜇, 𝜎 2 = 1 , 𝑖 = 1,2, i. i. d., with values

𝑦 1 = 4, 𝑦 2 = 6
The shape of the pdf of the single random variables 𝑦 𝑖 is:
2
1 1 𝑦 𝑖 −𝜇 1 1
𝑓𝑦 𝑦 𝑖 |𝜇, 𝜎 2 = exp − = exp − 𝑦 𝑖 − 𝜇 2
2𝜋𝜎 2 2 𝜎 2𝜋 2
16 /21
The value assumed by the pdf in correspondence of the two observations is:
𝑦 1 =4 𝑦 2 =6
1 1 1 1
𝑓𝑦 𝑦 1 = 4 𝜇, 𝜎 2 =1 = exp − 4 − 𝜇 2
𝑓𝑦 𝑦 2 = 6 𝜇, 𝜎 2 =1 = exp − 6 − 𝜇 2
2𝜋 2 2𝜋 2
The joint distribution is the product of the two single pdfs (since the data are i.i.d.)
1 1 1 1
𝑝 𝑦 1 ,𝑦 2 |𝜇, 𝜎 2 =1 = exp − 4 − 𝜇 2 ⋅ exp − 6 − 𝜇 2
2𝜋 2 2𝜋 2
17 /21
The joint pdf is only a function of 𝜇, since the value of the data is known. With this
interpretation, the joint pdf is the likelihood function
1 1 2
1 1 2
ℒ 𝜇 𝑦 1 = 4, 𝑦 2 = 6 = exp − 4 − 𝜇 ⋅ exp − 6 − 𝜇
2𝜋 2 2𝜋 2
The estimate 𝜇ොML is the value of 𝜇 that maximizes the likelihood
𝜇Ƹ ML = arg max ln ℒ 𝜇|𝑦 1 = 4, 𝑦 2 = 6

𝜇
18 /21
It is more convenient to maximize the log of the likelihood. This new function (the log-
likelihood) has the same maximum of the likelihood
1 1 2
1 1 2
ln ℒ = ln exp − 4 − 𝜇 ⋅ exp − 6 − 𝜇
2𝜋 2 2𝜋 2
1 1 2
1 1 2
= ln exp − 4 − 𝜇 + ln exp − 6 − 𝜇
2𝜋 2 2𝜋 2
1 1 2
1 1 2
= ln + ln exp − 4 − 𝜇 + ln + ln exp − 6 − 𝜇
2𝜋 2 2𝜋 2
11 2
1 2
= 2 ⋅ ln − 4−𝜇 − 6−𝜇
2𝜋 2 2
19 /21
By maximizing the expression obtained with respect to 𝜇 we get:
𝜕lnℒ 4+6
=0⇒ 4−𝜇 + 6−𝜇 =0⇒ 𝜇Ƹ ML = = 5
𝜕𝜇 2
The maximum likelihood estimate of the parameter 𝜇 for the Gaussian model is equal to
the estimate obtained using the sample mean estimator!
This result, although not generalizable, makes the maximum likelihood estimator very
interpretable and intuitive
20 /21
Observation: maximizing the «log-likelihood» is equivalent to minimizing the «minus log-
likelihood»
෡ ML = arg max ln ℒ 𝜽|𝑌

𝜽 = arg min −ln ℒ 𝜽|𝑌
𝜽 𝜽
𝑑×1
21 /21

Lecture 03 Maximum Likelihood Estimation

Uploaded by

Copyright:

Available Formats

You might also like

Lecture 03 Maximum Likelihood Estimation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 03 Maximum Likelihood Estimation

Uploaded by

Copyright:

Available Formats

DATA SCIENCE AND Master degree

MECHATRONICS AND SMART

Lecture 3: Maximum Likelihood SPEAKER

1.1 The business perspective 11. Machine vision

1.2 Data analysis processes 11.1 Classic approaches

2. Data visualization 11.2 CNN and deep learning

3. Maximum Likelihood Estimation 12. Unsupervised learning

4. Linear regression 12.1 k-means and hierarchical clustering

5. Logistic regression 12.2 Principal Component Analysis

6. Bias-Variance tradeoff 13. Fault diagnosis

7. Overfitting and regularization 13.1 Model-based fault diagnosis

8. Validation and performance metrics 13.2 Signal-based fault diagnosis

9. Decision trees 13.3 Data-driven fault diagnosis

1. Concepts of Maximum Likelihood Estimation (MLE)

2. Properties of the MLE

3. Example: MLE of the mean of a Gaussian distribution

1. Concepts of Maximum Likelihood Estimation (MLE)

2. Properties of the MLE

3. Example: MLE of the mean of a Gaussian distribution

The pdf of a single random variable is

Maximizing the likelihood means maximizing this product

Only the interpretation changes, but 𝑓𝑌 (𝑌|𝜇, 𝜎 2 ) e ℒ 𝜇, 𝜎 2 𝑌 are the same

• If 𝑓𝑌 𝑌 𝜇, 𝜎 2 ) is function of the data 𝑌: multivariable pdf

KNOWN data Not known

Usually, the notation of 𝑓𝑌 𝑌 𝜇, 𝜎 2 ) changes into ℒ 𝜇, 𝜎 2 𝑌 , to make clearer who is

Example: suppose to have only one datum 𝑓𝑦 𝑦|𝜇 = 1, 1 𝑓𝑦 𝑦|𝜇 = 2, 1

Notice that: ℒ 𝜇 = 2|𝑦 1 = 𝑦ത

So that 𝜇 = 2 is more likely to be observed

In general, we can attribute to the data any probability distribution 𝑓𝑌 , both

෡ ML = arg max ℒ 𝜽|𝑌

• Using the logarithm is efficient from an implementation point of view, because it

෡ ML = arg max ln ℒ 𝜽|𝑌

1. Concepts of Maximum Likelihood Estimation (MLE)

2. Properties of the MLE

3. Example: MLE of the mean of a Gaussian distribution

2. Consistent: the larger 𝑁, the more accurate the estimate

3. Asymptotically efficient: ෡ ML = 𝑀−1

4. Asymptotically normal: ෡ ML ∼ 𝒩 𝜽0 , 𝑀−1

1. Concepts of Maximum Likelihood Estimation (MLE)

2. Properties of the MLE

3. Example: MLE of the mean of a Gaussian distribution

Assume to have observed only 2 data 𝑦 𝑖 ∼ 𝒩 𝜇, 𝜎 2 = 1 , 𝑖 = 1,2, i. i. d., with values

The shape of the pdf of the single random variables 𝑦 𝑖 is:

The estimate 𝜇ොML is the value of 𝜇 that maximizes the likelihood

𝜇Ƹ ML = arg max ln ℒ 𝜇|𝑦 1 = 4, 𝑦 2 = 6

෡ ML = arg max ln ℒ 𝜽|𝑌

You might also like