Lecture 03 Maximum Likelihood Estimation

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

DATA SCIENCE AND Master degree

MECHATRONICS AND SMART


in

TECHNOLOGY ENGINEERING
AUTOMATION

Lecture 3: Maximum Likelihood SPEAKER


Prof. Mirko Mazzoleni
Estimation PLACE
University of Bergamo
Syllabus
1. Introduction to data science 10. Neural networks

1.1 The business perspective 11. Machine vision

1.2 Data analysis processes 11.1 Classic approaches

2. Data visualization 11.2 CNN and deep learning

3. Maximum Likelihood Estimation 12. Unsupervised learning

4. Linear regression 12.1 k-means and hierarchical clustering

5. Logistic regression 12.2 Principal Component Analysis

6. Bias-Variance tradeoff 13. Fault diagnosis

7. Overfitting and regularization 13.1 Model-based fault diagnosis

8. Validation and performance metrics 13.2 Signal-based fault diagnosis

9. Decision trees 13.3 Data-driven fault diagnosis

2 /21
Outline

1. Concepts of Maximum Likelihood Estimation (MLE)

2. Properties of the MLE

3. Example: MLE of the mean of a Gaussian distribution

3 /21
Outline

1. Concepts of Maximum Likelihood Estimation (MLE)

2. Properties of the MLE

3. Example: MLE of the mean of a Gaussian distribution

4 /21
Maximum Likelihood Estimation
The Maximum Likelihood Estimation (MLE) method is an estimation procedure that, given
a probabilistic model, estimates its parameters in such a way that they are most
consistent with the observed data
Assume to have 6 i. i. d. observations 𝒟 = 𝑦 1 , 𝑦 2 , … , 𝑦 6 , where 𝑦 𝑖 ∼ 𝒩 𝜇, 𝜎 2

The pdf of a single random variable is


𝒩(𝜇, 𝜎 2 ) 𝑓𝑦 𝑦 5 |𝜇, 𝜎 2
2
1 1 𝑦 𝑖 −𝜇
𝑓𝑦 𝑦 6 |𝜇, 𝜎 2 𝑓 𝑦 𝑖 |𝜇, 𝜎 = exp −
2𝜋𝜎 2 2 𝜎

𝑦 1 𝑦 3 𝑦 4 𝑦 5 𝑦 6

5 /21
Maximum Likelihood Estimation
Defined the data vector 𝑌 = 𝑦 1 , 𝑦 2 , … , 𝑦 𝑁 ⊤. The joint pdf of the data vector 𝑌 is

𝑁 𝑁

𝑓𝑌 𝑦 1 , 𝑦 2 , … , 𝑦 𝑁 |𝜇, 𝜎 2 = ෑ 𝑓𝑦 𝑦 𝑖 𝜇, 𝜎 2 = ෑ 𝒩 𝑦 𝑖 𝜇, 𝜎 2
𝑖=1 𝑖=1

The value assumed by the joint pdf 𝑓𝑌 𝑌|𝜇, 𝜎 2 , with known 𝜇 and 𝜎 2 , evaluated using the
data 𝒟, is the product of the blue dots in the previous example, where we had 𝑁 = 6
observations

Maximizing the likelihood means maximizing this product

6 /21
Maximum Likelihood Estimation
If function of the data 𝑌, the joint pdf is a multivariable distribution. But we know the
value of 𝑌, since we observed those data

If we also knew 𝜇 and 𝜎, we could compute the probability of having observed 𝑌. But we
do not know 𝜇 and 𝜎! That's exactly what we want to estimate!

When 𝑓𝑌 (𝑌|𝜇, 𝜎 2 ) (the joint pdf) is seen as function of the parameters 𝜇 and 𝜎, it is
called likelihood ℒ 𝜇, 𝜎 2 𝑌

Only the interpretation changes, but 𝑓𝑌 (𝑌|𝜇, 𝜎 2 ) e ℒ 𝜇, 𝜎 2 𝑌 are the same


mathematical object

7 /21
Maximum Likelihood Estimation
Summary
Not known
variables KNOWN parameters

• If 𝑓𝑌 𝑌 𝜇, 𝜎 2 ) is function of the data 𝑌: multivariable pdf

KNOWN data Not known


variables
• If 𝑓𝑌 𝑌 𝜇, 𝜎 2 ) is function of the parameters 𝜇 e 𝜎 2 : likelihood ℒ 𝜇, 𝜎 2 𝑌

Usually, the notation of 𝑓𝑌 𝑌 𝜇, 𝜎 2 ) changes into ℒ 𝜇, 𝜎 2 𝑌 , to make clearer who is


supposed known («to the right of the bar |») and who is not known («to the left of the
bar|» )

8 /21
Maximum Likelihood Estimation
The MLE is that value of the parameters vector 𝜽 who maximizes the likelihood ℒ 𝜽 𝑌

Example: suppose to have only one datum 𝑓𝑦 𝑦|𝜇 = 1, 1 𝑓𝑦 𝑦|𝜇 = 2, 1


𝑦 1 ∼ 𝒩 𝜇, 𝜎 2 = 1 , and that its value is
ത The parameter to be estimated is
𝑦 1 = 𝑦.
𝜃 = 𝜇 (the mean of the distribution)

Notice that: ℒ 𝜇 = 2|𝑦 1 = 𝑦ത


ℒ 𝜇 = 1|𝑦 1 = 𝑦ത
ℒ 𝜇 = 2|𝑦 1 = 𝑦ത > ℒ 𝜇 = 1|𝑦 1 = 𝑦ത

So that 𝜇 = 2 is more likely to be observed


than 𝜇 = 1, given this probabilistic model and 𝑦
the data 𝑦ത

9 /21
QUIZ!
In this example, the maximum likelihood estimate is:

𝑓𝑦 𝑦|𝜇 = 1, 1 𝑓𝑦 𝑦|𝜇 = 2, 1
❑ 𝜇Ƹ = 2𝑦ത

❑ 𝜇Ƹ = 𝑦ത

❑ 𝜇Ƹ = 2
ℒ 𝜇 = 2|𝑦 1 = 𝑦ത
ℒ 𝜇 = 1|𝑦 1 = 𝑦ത

𝑦
𝑦ത

10 /21
Maximum Likelihood Estimation
The maximum likelihood estimate of the previous example can be expressed as:
𝑁
෡ 𝜇Ƹ
𝜽ML = 2 = arg max ℒ 𝜽|𝑌 = arg max ෑ 𝒩 𝑦 𝑖 𝜇, 𝜎 2
2×1 𝜎ො 𝜽 𝜽
𝑖=1

In general, we can attribute to the data any probability distribution 𝑓𝑌 , both


continuous and discrete

෡ ML = arg max ℒ 𝜽|𝑌


𝜽
𝑑×1
𝜽

11 /21
Maximum Likelihood Estimation
Often, instead of maximizing ℒ 𝜽|𝑌 , we maximize its natural logarithm
• Since the logarithm is an increasing monotone function, ln ℒ 𝜽|𝑌 has the same
maximum of ℒ 𝜽|𝑌

• Using the logarithm is efficient from an implementation point of view, because it


avoids possible underflows given by the product of small probabilities (replacing it with
the sum of the log-probabilities)

෡ ML = arg max ln ℒ 𝜽|𝑌


𝜽
𝜽
𝑑×1

Unless special lucky cases, the optimization is carried out with iterative methods

12 /21
Outline

1. Concepts of Maximum Likelihood Estimation (MLE)

2. Properties of the MLE

3. Example: MLE of the mean of a Gaussian distribution

13 /21
Maximum Likelihood Estimate: properties
The maximum likelihood estimator has good properties. In fact, it is:

1. Asymptotically correct : ෡ ML = 𝜽0
lim 𝔼 𝜽
𝑁→+∞
The estimator can be biased. For example, the maximum likelihood estimator of the variance
of a Guassian population is biased

2. Consistent: the larger 𝑁, the more accurate the estimate

3. Asymptotically efficient: ෡ ML = 𝑀−1


lim Var 𝜽 𝑀: Fisher information
𝑁→+∞ matrix

4. Asymptotically normal: ෡ ML ∼ 𝒩 𝜽0 , 𝑀−1


𝜽 for 𝑁 → +∞

14 /21
Outline

1. Concepts of Maximum Likelihood Estimation (MLE)

2. Properties of the MLE

3. Example: MLE of the mean of a Gaussian distribution

15 /21
MLE of the mean of a Gaussian distribution
Let us consider the case in which we want to estimate the mean 𝜇 of a population of
i.i.d. Gaussian random variables, assuming we know the variance of the distributions

Assume to have observed only 2 data 𝑦 𝑖 ∼ 𝒩 𝜇, 𝜎 2 = 1 , 𝑖 = 1,2, i. i. d., with values


𝑦 1 = 4, 𝑦 2 = 6

The shape of the pdf of the single random variables 𝑦 𝑖 is:

2
1 1 𝑦 𝑖 −𝜇 1 1
𝑓𝑦 𝑦 𝑖 |𝜇, 𝜎 2 = exp − = exp − 𝑦 𝑖 − 𝜇 2
2𝜋𝜎 2 2 𝜎 2𝜋 2

16 /21
MLE of the mean of a Gaussian distribution
The value assumed by the pdf in correspondence of the two observations is:

𝑦 1 =4 𝑦 2 =6

1 1 1 1
𝑓𝑦 𝑦 1 = 4 𝜇, 𝜎 2 =1 = exp − 4 − 𝜇 2
𝑓𝑦 𝑦 2 = 6 𝜇, 𝜎 2 =1 = exp − 6 − 𝜇 2
2𝜋 2 2𝜋 2

The joint distribution is the product of the two single pdfs (since the data are i.i.d.)

1 1 1 1
𝑝 𝑦 1 ,𝑦 2 |𝜇, 𝜎 2 =1 = exp − 4 − 𝜇 2 ⋅ exp − 6 − 𝜇 2
2𝜋 2 2𝜋 2

17 /21
MLE of the mean of a Gaussian distribution
The joint pdf is only a function of 𝜇, since the value of the data is known. With this
interpretation, the joint pdf is the likelihood function

1 1 2
1 1 2
ℒ 𝜇 𝑦 1 = 4, 𝑦 2 = 6 = exp − 4 − 𝜇 ⋅ exp − 6 − 𝜇
2𝜋 2 2𝜋 2

The estimate 𝜇ොML is the value of 𝜇 that maximizes the likelihood

𝜇Ƹ ML = arg max ln ℒ 𝜇|𝑦 1 = 4, 𝑦 2 = 6


𝜇

18 /21
MLE of the mean of a Gaussian distribution
It is more convenient to maximize the log of the likelihood. This new function (the log-
likelihood) has the same maximum of the likelihood

1 1 2
1 1 2
ln ℒ = ln exp − 4 − 𝜇 ⋅ exp − 6 − 𝜇
2𝜋 2 2𝜋 2

1 1 2
1 1 2
= ln exp − 4 − 𝜇 + ln exp − 6 − 𝜇
2𝜋 2 2𝜋 2

1 1 2
1 1 2
= ln + ln exp − 4 − 𝜇 + ln + ln exp − 6 − 𝜇
2𝜋 2 2𝜋 2

11 2
1 2
= 2 ⋅ ln − 4−𝜇 − 6−𝜇
2𝜋 2 2

19 /21
MLE of the mean of a Gaussian distribution
By maximizing the expression obtained with respect to 𝜇 we get:

𝜕lnℒ 4+6
=0⇒ 4−𝜇 + 6−𝜇 =0⇒ 𝜇Ƹ ML = = 5
𝜕𝜇 2

The maximum likelihood estimate of the parameter 𝜇 for the Gaussian model is equal to
the estimate obtained using the sample mean estimator!

This result, although not generalizable, makes the maximum likelihood estimator very
interpretable and intuitive

20 /21
MLE of the mean of a Gaussian distribution
Observation: maximizing the «log-likelihood» is equivalent to minimizing the «minus log-
likelihood»

෡ ML = arg max ln ℒ 𝜽|𝑌


𝜽 = arg min −ln ℒ 𝜽|𝑌
𝜽 𝜽
𝑑×1

21 /21

You might also like