Chapter 2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 246

MH3500 Statistics

Lecture Notes
Chapter 2
Parameter Estimation
Table of Contents
Section Remark
1. Introduction to Parameter Estimation
2. “Ad-hoc" Parameter Estimation Method
3. Parametric Statistical Models
4. Formal Process of Parameter Estimation
5. Moments of a Random Variable
6. Constructing Good Estimators: Method of Moments
7. Constructing Good Estimators: Maximum Likelihood Method
8. Maximizing Likelihood Functions in the Case of One Unknown Parameter
9. Maximum Likelihood Method for Continuous Distributions
10. Non-Existence of MLEs
11. Non-Uniqueness of MLEs
12. Bias, Standard Error, Mean Squared Error
13. Sample Variance as Estimator for Variance
14. Estimated Standard Error
15. Criteria for Quality of Estimators
Table of Contents
Section Remark
16. Consistency of Estimators
17. Consistency of Method of Moments Estimators
18. Consistency of MLEs
19. Best Unbiased Estimator
20. Expected Value and Variance of Score Function
21. Fisher Information: Additivity and Alternative Formula
22. Cramer-Rao Bound and Efficient Estimators
23. Asymptotic Properties of Estimators
24. Asymptotic Normality and Asymptotic Variance of MLEs
25. Review of Concept of Confidence Intervals
26. Pivotal Quantities and Construction of Exact Confidence Intervals
27. Exact Confidence Intervals for Exponential Distributions
28. Asymptotically Pivotal Quantities and Approximate Confidence Intervals
1) Introduction to Parameter
Estimation

Parameter Estimation as Part of


Statistical Inference
Statistical Inference (Inferential Statistics)
“Population” (process that generates data) is given
Data drawn from population is represented by random
(i.i.d.) sample , where the population distribution is
unknown or partially unknown

Goal of Statistical Inference: Extract information from


about the population distribution (“infer” means
“deduce something from evidence”)

Purpose: Understand structure of data in order to facilitate


predictions, recommendations, decisions
Parameter Estimation
Consider i.i.d. sample

Often the type of distribution ( , etc.) of ’s


is known, but its parameters ( are unknown

Parameter estimation: Extract information from


on these parameters

This is one way to do statistical inference

Non-parametric statistics is an alternative, but not covered


in this course
Example: Diastolic Blood Pressure
(DBP)

 This is a histogram for 1000 observations


 We want to find the underlying distribution
 The contour of the histogram should look like the graph
of the PDF of the distribution
 We can assume that the data are normally distributed
Diastolic Blood Pressure (DBP)

 Conclusion:
 Remaining task: Approximately find
 This is the purpose of parameter estimation
Process of Point Estimation
Given are data , viewed as realizations of i.i.d.
random variables

1) Modeling: Identify a suitable type of distribution for ’s,


which depends on parameters

2) Point estimation: Find functions


which approximate (“best
guesses”)

3) Substitute in these functions to get


estimates for
Example for Point Estimation

Data (measurements in a physics experiment):


1.1) Find Appropriate Type of
Distribution
Histogram: Normal distribution
seems appropriate

Here
1.2) Find Functions to Approximate
Parameters
 (by Law of Large Numbers)

 We will show that the sample variance is a “good”


estimator for
1.3) Substitute Data in Functions to
Obtain Estimations

Not bad!

These data were drawn from an distribution


Remarks
 Important distinction: are fixed real numbers
are random variables
 is one realization of - e.g., further
realizations could be used to get more information on the
distribution of the ’s
 The unknown parameters are fixed real
numbers
 The point estimators for the
parameters of the distribution are random variables
 For different realizations of they usually give
different results
2) “Ad-hoc" Estimation Method
Example: Binomial Distribution
 Let be i.i.d. where and are
both unknown
 Given: Observations
 Goal: Estimate and from

 Recall: ,

 Idea: Use the fact that on average is close to


and is close to
Parameter Estimation for Binomial
Distribution, continued
 i.i.d.
We expect:
(1)
(2)

(1) & (2)

(1)
Observations drawn from
Binomial( ) with unknown
12 13 7 10 6 11 12 6 7 10 10 9 10 15 11 14 9 8 9 12 10 16 16 135 11 7 6 9 19 7 16 9 11 14 9 8 13 16 8 7 11 10
10 7 12 11 1312 18 8 8 12 16 11 14 7 7 10 13 4 8 7 6 8 9 12 12 11 14 8 3 8 9 12 9 7 10 8 10 8 5 10 9 13 10
10 8 12 6 8 9 11 10 14 7 7 9 8 12 12 7 9 5 9 9 10 6 8 10 12 13 13 11 10 15 9 9 12 12 10 12 9 6 8 12 3 7 9
11 8 7 10 10 8 15 9 11 12 11 6 7 9 7 8 11 13 3 12 10 11 9 6 11 13 14 10 10 10 8 18 7 15 11 7 6 10 710 7 7 13
7 9 14 12 7 14 10 15 13 12 7 5 14 13 8 8 9 8 9 8 9 8 8 7 12 14 12 6 8 12 6 4 9 10 11 14 7 6 13 9 10 10 8 8
7 9 9 10 8 10 9 13 10 11 13 11 7 9 6 11 8 13 13 9 16 15 9 11 6 11 13 12 12 16 8 10 10 17 7 9 11 9 9 10 12 8 12
3 12 6 10 10 12 14 5 10 3 9 9 14 10 12 14 7 13 11 15 8 4 7 8 8 14 8 11 9 11 8 10 11 8 9 10 11 14 9 13 7 6 8
9 11 7 9 7 9 10 5 9 11 11 7 8 11 3 13 9 8 9 5 11 15 11 4 8 11 14 13 8 14 10 5 10 13 12 5 17 10 8 14 10 10 11
13 8 16 9 10 11 9 6 6 12 10 12 12 9 8 11 11 16 12 8 7 11 12 14 10 16 10 9 8 11 15 11 7 10 8 12 12 9 12 11 9 8
11 9 10 12 7 11 12 12 10 12 6 8 9 13 5 4 15 11 10 10 11 10 12 15 12 11 11 7 8 14 7 14 9 9 7 12 6 10 10 6 12 11
10 9 9 11 11 11 4 12 13 11 11 10 8 7 8 14 9 12 12 13 13 4 10 8 8 10 6 10 16 12 13 10 8 12 9 13 11 9 8 7 8 6
18 6 9 6 10 12 10 9 12 10 7 11 6 8 4 11 9 9 16 10 8 10 8 12 11 13 11 14 14 5 11 7 11 7 9 10 10 9 10 12 12 8 8
9 11 12 9 9 14 8 11 10 12 11 11 12 15 10 11 16 7 11 14 15 12 10 9 12 11 8 17 11 13 10 7 10 12 12 9 11 7 9 11 13
10 10 13 9 6 12 9 12 8 9 14 11 5 8 13 13 15 8 11 5 9 5 14 14 11 14 10 11 11 9 13 10 8 12 7 9 11 7 7 7 12 10
10 9 6 11 13 11 10 10 12 12 9 9 11 12 13 9 7 7 8 8 8 9 13 10 11 10 8 7 10 10 9 11 6 12 8 10 14 8 11 8 17 18
11 9 9 11 7 12 12 16 15 7 7 6 14 15 8 5 9 12 10 7 9 13 8 7 11 11 15 9 8 11 8 9 6 5 11 6 8 5 11 15 7 8 6 7
9 10 10 5 4 4 10 9 7 7 5 6 8 9 4 11 12 13 9 9 8 10 9 9 10 11 11 7 13 13 12 5 10 6 11 10 9 11 12 10 8 11 9 8
14 8 14 7 8 6 16 10 8 7 11 13 12 12 12 9 6 10 7 10 9 12 14 7 7 8 8 6 8 13 8 12 11 8 12 10 7 7 12 13 10 10 13
5 13 13 9 15 10 10 11 13 13 11 9 12 10 12 7 9 10 6 9 16 9 16 8 11 7 13 4 7 10 11 18 10 12 4 10 9 15 11 9 8 15
10 9 16 11 7 6 10 6 9 12 14 11 11 9 10 15 8 13 11 8 9 10 8 12 8 11 9 7 10 6 10 7 9 10 9 7 10 9 6 8 12 11 13
8 14 8 8 7 12 10 15 7 7 10 9 8 8 12 14 4 12 2 10 10 10 13 4 12 12 9 9 12 11 9 13 613 8 8 10 9 9 10 13 9 14
12 7 2 9 12 7 10 15 14 7 12 10 12 10 12 9 13 12 7 14 10 13 13 6 12 9 7 6 7 13 6 6 10 7 11 15 9 15 11 12 9 11
16 9 13 10 10 11 12 11 6 9 10 10 9 10 10 5 8 10 12 10 14 7 5 7 14 9 9 10 7 14 10 9 12 12 13 8 7 10 12 12 13 7
10 12 17 9 7 8 12 12 13 11 10 9 6 14 13 13 11 11
Estimations for and Derived from
Observations

Observations were actually drawn from


Example, Conclusion

 Sample mean and variance can be useful to estimate


unknown parameters of the populations distribution
 However, the arguments used are "ad-hoc" and we do
not yet have a way to measure the accuracy of the
estimation
 More systematics methods are needed for parameter
estimation in general
3) Parametric Statistical Models

Most common distributions If the population distribution of


we covered depend on an i.i.d. sample
parameters depends on unknown
parameters , we say
Examples: the ’s follow a parametric
model
Distribution Parameters

One purpose of statistical


inference is to extract
information on from
observations
Notations
Suppose i.i.d. where is a
distribution depending on parameters

Then , Var , moments of , PDF, CDF, and all


probabilities involving ’s also depend on . If there is a
need to, we indicate this as in these examples:

 or
 or
 or
 or (PDF / PMF)
Discrete and Continuous Models

Consider a parametric model i.i.d.

If is continuous / discrete, then the model is also called


continuous / discrete, respectively

Examples:
Discrete parametric model: i.i.d.
Continuous parametric model: i.i.d.
4) Formal Process of Parameter
Estimation
Parameter Estimation
 Given: observations (data)
 are viewed as realizations of i.i.d. random
variables
 Identify a suitable type of population distribution for ’s
which depends on unknown parameters
 Point estimation:
Find functions which
approximate
 Substitute in these functions to get
estimates for
Estimators
 Let be functions which
approximate
 are called estimators for
 Estimators are random variables which only depend on
the sample , that is, they are statistics

Main problem: How to obtain estimators that are as precise


as possible?
5) Moments of a Random Variable
The -th moment ( of a random variable is

where is the PDF respectively PMF of

If the integral, respectively the sum, converges, we say the


-th moment of X exists

Fact: If -th moment exists, then the -th moment also exist
for all (this is proved in the theory of
Lebesgue integration)
Example
Let be a random variable with PDF for
and otherwise

The first and second


moments of exist, but all
other moments of do not
exist


Definition of Sample Moments

Let be an i.i.d. random sample

is called the -th sample moment

Note: Sample mean


Example

Create a vector containing 10 observations in R:


data = c(13, -10, 11, 6, 12, -11, 10, -12, -13, -6)

Compute sample moments:


moment(data,1) (computes , result: 0)
moment(data,2) (computes , result: 114)
moment(data,3) (computes , result: 0)

(to use the function moment, first install and load the
package “moments”)
Example

Verify validity of for an example using R:

data = c(13, -10, 11, 6, 12)


n = length(data) (number of data values)
sample_var = var(data) (computes sample variance)
S_1 = moment(data,1) (first sample moment)
S_2 = moment(data,2) (second sample moment)
sample_var - (n/(n-1))*(S_2-S_1^2)
(result 0, as predicted by )
Connection Between Population
Moments & Sample Moments
i.i.d. random sample
-th population moment:

-th sample moment

Remarks: The expected value of the -th sample moment


is the -th population moment, and by the Law of large
Numbers converges to in probability when the
appropriate number of moments exist.
6) Constructing Good Estimators:
Method of Moments (Motivation)
Let i.i.d.
We know:

By law of large numbers, we also know that:


converges to in probability
converges to in probability

Is there any way to combine and to get an estimator


for ?
Motivation, continued
Idea is to use and to express
through and :

So, could be a “good” estimator for

That is,
Method of Moments
Let be an i.i.d. sample drawn from a population
distribution , which depends on parameters

 Compute moments:
 Each which depends on , say
, provides one equation for
(namely, )
 Continue until the system of equations ,
with variables has a unique solution
,…,
Method of Moments, continued
 Let ,…, be the
unique solution of ,
 Replace moments in expressions for by
sample moments to obtain estimators

 These are called method of moments estimators for


Example
We reformulate the previous Gamma distribution example
in the general framework for the method of moments:
i.i.d.

Equations for :

(stop computing moments, as this system has a unique


solution ( )
Example, continued

Unique solution:
,

Replace moments by sample moments:

,
(method of moments estimators for Gamma distribution)
Example (Including Data & Modeling)

Data:

Find a suitable statistical model and use the method of


moments to estimate its parameters
Example: Statistical Model
Histogram:

Need to find distribution with a PDF that looks similar


PDF of Gamma Distribution

(here and )
Statistical Model, continued
Data Gamma PDFs

should be suitable
Can use method of moments estimators for
Example, continued
,

Use data from above


7) Constructing Good Estimators:
Maximum Likelihood Method (MLE)
(Motivation)
Among 1000 computer chips produced, exactly 13 were
found defective (to simplify notation, we assume the first 13
are defective)
Model: i.i.d. where if chip is
defective and is an unknown parameter
Given the observations and for
, is there a value, say , for which is “most
probable”?
No! Since is a constant (not a random variable), it is either
equal to or not. No probability involved
Motivation, continued
i.i.d.
Observations: , for

Based on the observations, we cannot decide if


or not, because could also happen if or
even

But still seems much more plausible than

Given the observations, is there any way to pick a “most


plausible” value for ?
Motivation, continued
i.i.d.
Observations: , for
“Most plausible” value for ?

Problem: There is no “most probable” value of (it is not a


random variable)

Main idea of maximum likelihood


Reverse the approach: Find a value of that makes
the observations most likely
Motivation, continued
i.i.d.
Observations: , for
If , then

If , then

Hence is more plausible than


Formal Description of Maximum
Likelihood Method for Discrete
Distributions
Probability of Observations,
(Discrete Case)
Let be iid where is a discrete distribution
with PMF ,
depending on unknown parameters

Observations are given

We need to compute the probability to observe ,


that is,

This leads us to the so-called likelihood function


Likelihood Function
PMF:

(independence of ’s)
(identical distribution of ’s)
(definition of PMF)

is called a
likelihood function
Interpretation of Likelihood Function,
(Discrete Case)
is the probability
to observe under the assumption that
the true parameters are

The idea of the maximum likelihood method is to choose


those values for as estimators which maximize
this probability
Example
We revisit the example from above and use the likelihood
function
i.i.d. with PMF
,

Observations: , for

Likelihood function:
Example, continued
The maximum likelihood method picks the value of (
that maximizes the likelihood function

We can find the maximizer by setting the derivative to zero:

is the maximizer

That is, the maximum likelihood estimate for is


Range of Parameters
For a distribution depending on , the
parameters are often restricted to a certain
subset of

Examples:
for
for

We call the range of the parameters


Likelihood Function with Random
Variables
If we maximize
for given observations , we get estimates for
which are valid only for these observations

To get estimators which work for all possible observations,


we need to substitute the random variables instead
of the observations: Maximize

The resulting maximizers depend on , that


is, they are statistics
Maximum Likelihood Method for
Discrete Parametric Models
Let be i.i.d. where is a discrete distribution
depending on parameters with range
and PMF
Compute likelihood function

If satisfies , then
is called a maximum likelihood estimator (MLE) for
MLE Maximization Problem
To compute the MLE, we need to solve the maximization
problem , where and is the
range of the parameters

That is, we need to find a global maximum of over

This can be difficult, as the likelihood function may be


complicated

But likelihood functions typically have a structure that allows


simplifications in the maximization process
Dependence of Likelihood Function
on Parameters
Recall:

When we solve , we are only interested in


how depends on the parameters
(the ’s are considered as constants during maximization)

That’s why we may use the notation instead of


during the maximization process
Example
i.i.d. with . Find the
MLE for

Recall PMF of :
,

Now the dependence of the PMF on is important, so we


denote the PMF by
Example, continued
PMF:
Likelihood function (with random variables):

Substitute PMF into likelihood function:


Need to find which maximizes this function


( ’s are considered as constants during maximization)
Example, continued

where and

Derivative test for extreme values:

is the MLE for

(for a rigorous justification, see Theorem 1 later)


8) Maximizing Likelihood Functions in
the Case of One Unknown Parameter
8) Maximizing Likelihood Functions in
the Case of One Unknown Parameter
The whole concept of maximum likelihood estimators is
based on finding global maximizers

Just setting derivatives to zero and solving the resulting


equations is not enough to justify the results (could yield
local maxima / minima or saddle points etc.)

Our goal now is to provide a simple, but rigorous way to


compute maximizers in the case of one unknown
parameter
Examples of Likelihood Functions

For :

,

For :
,
Typical Properties of
Likelihood Functions
Take only positive values in the range of the parameters
Differentiable with respect to parameters
Limit of the function is zero whenever parameter
approaches a boundary of its range

We call these conditions the “standard conditions for


likelihood functions”
Example

where and are constants

Only positive values: Clear (product of PMFs)


Differentiability:
for all
Limits when parameter approaches a boundary:

( dominates for )
Standard Conditions
Let be a function defined on a (finite or infinite) interval
such that

a) for all

b) exists for all

c)

In this case, we say that satisfies the standard


conditions for likelihood functions
Maximizing Likelihood Functions
Theorem 1
Let be a function defined on a (finite or infinite) interval
that satisfies the standard conditions for likelihood
functions. Then has a global maximizer on and
we have .

Hence a simple method to find a global maximizer is:


Compute all solutions of
Pick the solution with the largest as
Idea of Proof for Theorem 1
Extreme Value Theorem:
)
(1) If is continuous on a finite
interval , then it attains a maximum
on

Since , there is
→ →
a finite sub-interval of such
that

(2) The maximum of over is the


same as its maximum over
(e.g., in the example)

Function satisfying the standard


conditions (here , ) (1) & (2) attains maximum on
Idea of Proof for Theorem 1, continued
)
attains maximum on

Let be a maximizer of on

Derivative test for extreme


values

Function satisfying the standard


conditions (here , )
Logarithm-Trick
Chain rule:

So

Conclusion: When using Theorem 1, we can solve


instead of

Often is easier to handle

Remark: always denotes the natural logarithm


How to Maximize Likelihood
Functions: Summary
Given: Likelihood function , defined on interval

 Check that is positive and differentiable and satisfies

 Compute (“log-likelihood function”)

 Find all solutions of


 Solution with largest is a maximizer
9) Maximum Likelihood Method for
Continuous Distributions
Recall: Main Idea of Maximum Likelihood Method

Let be an i.i.d. sample whose population


distribution depends on parameters

Given observations , we pick those values of the


parameters as estimators, which maximize the probability
to observe

This works smoothly for discrete distributions, but for


continuous distributions we run into a problem
Goal of Maximum Likelihood Method
for Continuous Distributions

It makes no sense to maximize the probability to observe

Instead, we aim for maximizing

for small , that is, the probability to observe up to


small deviations
Interpretation of PDF

 is the area between PDF and x-axis over


the interval
 If the interval is small and is any point in the interval,
then (assumes continuity of
PDF)
approximately is proportional to probability of
taking a value close to
Probability for Observations up to
Small Deviations
Let be i.i.d. with PDF

(independence of ’s)

(take , and on previous slide)


Probability for Observations up to
Small Deviations, Conclusion
Have seen:

Conclusion:
Probability to observe up to deviation is
approximately proportional to
Likelihood Function (Continuous Case)
Let i.i.d. where is a continuous
distribution with PDF

Observations:

Likelihood function:

This function measures the probability to approximately


observe if the true parameters are
Maximum Likelihood Estimators for
Continuous Distributions
Let be i.i.d. where is a continuous
distribution depending on parameters with
range and PDF

Compute likelihood function

If satisfies , then
is called a maximum likelihood estimator (MLE) for
MLEs for Continuous Distributions,
Summary
The maximum likelihood method for continuous distributions
is the same as in the discrete case, just with PMF replaced
by PDF
The interpretation for the method in the continuous case,
however, is different (maximize probability to approximately
observe )
The maximization problem can be solved with the same
tools as for discrete distributions, including Theorem 1 and
the logarithm trick
Example
Let be i.i.d. where
PDF: for and otherwise

Likelihood function:

Need to find with


Example, continued

Need to solve

satisfies the standard conditions for


likelihood functions:
(a) for all

(b) exists for all

(c)

Hence it suffices to find all solutions of


Example, continued

This is the only solution of

Theorem 1 is the MLE for


10) Non-Existence of MLEs
Recall that:
i.i.d. with and PMF
or PDF

Likelihood function:

By definition, every maximizer of over is


an MLE for

It can happen, however, that a maximizer does not exist,


i.e., that no MLE for exists
Example for MLE Non-Existence
Let have PMF
for

and otherwise

Here is an unknown parameter with range

Likelihood function (for :


Example for MLE Non-Existence,
continued

Note that only takes values

If , then is strictly decreasing on the


whole real line
If , then strictly increasing on the
whole real line
In both cases, has no maximizer on , i.e., no
MLE for exists
Reason for MLE Nonexistence
 Parameter space is not compact
 Maximum of continuous function over non-compact set
may not exist (happens here)
 actually is a Bernoulli variable and could be better
modelled by with
(compact)

Rule of thumb: MLE nonexistence can be avoided by


choosing an appropriate model
Remark on MLE for Bernoulli
If we choose the model with
(open interval!), we also will get MLE nonexistence:

If , then , which does not attain a


maximum on . Similarly, if , then the MLE
also does not exist
If we change the range of to (closed interval), then
the MLE always exists (MLE is if and 0 if
)
Another Example for MLE Non-
Existence
Tutorial 5:
PDF , ,
The parameter measures how much this PDF is
concentrated around :
11) Non-Uniqueness of MLE
Let i.i.d. , , where is an unknown
parameter (here

PDF: for and


otherwise
,

, ,

Hence every , , maximizes


, i.e, is an MLE for
Example
PDF: for and otherwise
Reason for Non-Uniqueness

 Likelihood function is constant on a whole interval

 does not distinguish between values of the


parameter which are close to each other

 Reliable MLEs require that is “relatively far away


from ” on average if

 We will see that this issue is related to “Fisher


information” (which we study later in this course)
12) Bias, Standard Error, Mean
Squared Error
Definition of Bias:
Let be a population distribution, parameterized by a real
number and let be an estimator for . Then

Bias

is unbiased if Bias for all , otherwise biased

Here “ ” means the expected value under the assumption


that the distribution parameter is (sometimes the in
is dropped, but we still have to be aware of the assumption)
Interpretation of Bias
Bias is the expected deviation of from the true
parameter

A good estimator must have bias zero or at least its bias


should tend to zero for increasing sample size

Small bias is, however, not sufficient to be sure that is


close to the true on average ( ] also must be small)
Example
Population distribution: Bernoulli ,

Estimator: (for )

Bias

for all
So is unbiased
Definition of Standard Error
Let be a population distribution, parameterized by a real
number and let be an estimator for

Standard error of :

(here the variance is computed under the implicit


assumption that the parameter is ; if necessary, we can
write and to indicate the dependence on )
Interpretation of Standard Error

 Suppose is an unbiased estimator for

 measures how
much on average is off from the true

 Rule of thumb: For large samples, the true will be in


the interval in around 70% of the
cases and in the interval in
around 95% of the cases, if is used repeatedly to
estimate
Graphical illustration
Accurate,
Unbiased, but unbiased and
not precise precise
(large SE) Precise (small SE),
but biased

97
Example
Let be i.i.d. with population distribution
,
Estimator: (for )

Shown previous lecture:


Recall: If , then
Definition of Mean Squared Error
Let be a population distribution, parameterized by a real
number and let be an estimator for

Mean squared error of :

(here the expected value is computed under the implicit


assumption that the parameter is )

Note:
Connection between Bias, Variance,
and MSE
)

The proof uses the following facts.

Let be a random variable. We have


(1) for all constants
(2)
Connection between Bias, Variance,
and MSE
)

Proof:
)
) (by definition of bias and by (1))
(by (2))

(by definition of MSE)


Interpretation of MSE

 The MSE simultaneously controls the bias and the


variance of an estimator
 If the MSE is small, we can be sure that the estimator on
average will be close to the true parameter
13) Example: Sample Variance as
Estimator for Variance
13) Sample Variance as an Estimator
for Variance
Example: Let be i.i.d. with ,
(not necessarily normally distributed)

Recall:

We now compute the bias of when considered as an


estimator for (this will explain the factor )
Bias of Sample Variance
Bias of Sample Variance (continued)
Bias of Sample Variance (continued)
Have shown:

Hence is an unbiased estimator for whenever


i.i.d. with ,

This would not be true if was replaced by and this is


the explanation for the factor
14) Estimated Standard Error

Let be an estimator for a parameter .

Problem: depends on the true value – which is


unknown

To emphasize the dependence of on , we now use


the notation
Definition of Estimated Standard Error

Standard error depends on unknown

Solution:
Replace by in to obtain the estimated
standard error:
Example (Textbook Page 261)
Experiment: Asbestos suspended in water and spread on a
filter
Counts of asbestos fibers on 23 equally sized samples
from the filter:
31, 29, 19, 18, 31, 28, 34, 27, 34, 30, 16, 18,
26, 27, 27, 18, 24, 22, 28, 24, 21, 17, 24

Note: If a random variable represents the number of


events occurring during a certain time interval or within a
certain area, then usually has an approximately Poisson
distribution
Counts of Asbestos Fibers: Model
• number of fibres that end up on sample ,

• is the number of events occurring within a certain area

• The size of the area is the same for all and the counts
on the sample are approximately independent

Assumption i.i.d. reasonable


Estimation of
total number of fibres that end up on sample
Model: iid
expected number of fibers that end up on
sample

Estimator:
Substitute data

We now use the estimated standard error to measure the


accuracy of this estimate
Estimated Standard Error
, substitute data

SE of :

Estimated SE:
Interpretation of Estimated SE
We have and

In view of the Central Limit Theorem, if we approximate the


distribution of by , noting that

Hence,
15) Criteria for Quality of Estimators
Main Idea for Precise Estimators

We recall Chebyshev Inequality:


If random variable with and
Then

for all .

Implies: If is small, then the probability of a


substantial deviation of from is also small

Proof of Chebyshev Inequality: See previous Chapter.


Chebyshev for Estimators
Suppose is an estimator for parameter
(subscript here indicates dependence of on sample size)

If and , then

for all by Chebyshev Inequality


Chebyshev for Estimators (continued)
Suppose and

Have shown:
If holds, then for all

This means: By increasing , the estimator for can be


made arbitrarily precise

Hence, we should aim for estimators which satisfy both


conditions in
How to Achieve
First, we compute .

If the result is a real valued function , then we set


(scaling)

So that we have
How to Achieve
Example: i.i.d. , with both
unknown.

Suppose we want to estimate , which is


Consider the estimator:

Now,
How to Achieve
How to Achieve
This usually is automatically satisfied for reasonable
estimators, because they are often based on averages and
the variance of an average tends to for increasing sample
size

Example: i.i.d.
is an estimator for
Constructing Precise Estimators
(Summary)
 Goals: and

 Chebyshev If these conditions are satisfied,


becomes arbitrarily precise for large enough

 can be achieved by computing and


scaling, if necessary

 should automatically be satisfied if is


chosen reasonably
Quality Criteria for Estimators
The ideas for constructing precise estimators are
formalized as follows.

Let be an estimator for , we like to have


a) should be close to
Small Bias
b) Small expected deviation of from
Small Standard Error
c) Conditions (a) and (b) combined
Small Mean Squared Error
16) Consistency of Estimators
Let be an estimator (based on a sample ) for a
parameter

We say that is consistent if

for all

In other words: Probability that estimator deviates more


than from the true value can be made arbitrarily small by
increasing the sample size
Proving Consistency

Proving consistency requires to show

Need to compute or estimate

Requires knowledge on the distribution of

This can be quite challenging, since often is complicated

Best tool: Theorem 1 (later in this lecture)


Example
i.i.d.
(estimator for )

Shown in previous Lecture:

Let be the CDF of

by

consistent estimator for


Sufficient Conditions for Consistency
Theorem 1: Let be an estimator (based on a sample
) for a parameter

Suppose that
(a)
(b)

Then is consistent.

That is for all


Proof
By the proof of Chebyshev inequality, we have:

Now,

Cross-term = 0

With conditions (a) & (b) is consistent.


17) Consistency of Method of
Moments Estimators
Recall: Law of Large Numbers

Let be i.i.d. with . Then

for all .

In words: Probability that the average of i.i.d. random


variables deviates by more than any constant from their
expected value tends to 0 for increasing sample size
Illustration of Consistency

is consistent for if for all

Value frequency of consistent estimator for (example):

0
Recall: Method of Moments
(for Single Unknown Parameter)
Let i.i.d. sample drawn from

1) Compute moments until


system of equations

has a unique solution

2) MME: where
Consistency of MMEs
(for Single Unknown Parameter)
Theorem 1

If is continuous, then the method of moments estimator


is a consistent estimator for

Remark: The theorem, with replaced by , also


holds for MMEs in the case of several unknown parameters
Proof of Theorem 1

To show: for all

We know:
a)
b)
c)

d)
e) is continuous
Proof of Theorem 1, continued
Step 1: Apply Law of Large Numbers to

We know: (c) , (d) for

“Probability that the average of i.i.d. random variables


deviates by more than any constant from their expected
value tends to 0 for increasing sample size”

for and all


Proof of Theorem 1, continued
Step 2: Use continuity of to bound

(by (a) and (b))

Continuity of given , there is such that

whenever for all

Hence, for all


Proof of Theorem 1, continued

for at all

for at least one

Sub-additivity of probability

By Step 1, each term goes to zero


Example
Let be an i.i.d. sample
We know,

Hence,

with and continuous on

and are consistent estimators


18) Consistency of MLEs
Recall: MLE (for one Parameter)

i.i.d. with

Likelihood function:

If , then is an MLE for


Consistency of MLEs
Theorem 2

i.i.d. with likelihood function

Under appropriate regularity assumptions on , the MLE


for is unique and consistent.

We will have a stronger result later in this Chapter.


19) Best Unbiased Estimator
Example (Motivation)

Let i.i.d. with


PMF: ,

Let and

We have shown that both and are unbiased estimators


of .

To determine which is better, we just need to compare the


variances.
Example (Motivation)
We know, , but is quite a lengthy
calculation.

Even if we could show that for all ,


how about other unbiased estimators such as

for any real constant ?

The idea is to find a lower bound, say , for any unbiased


estimator of , and if we can find an unbiased estimator
satisfying , then we have found a best
unbiased estimator.
Information in Observations
Suppose i.i.d. with

Important question:
How much information about can observations
contain?
That is, how precise can an estimator for based on
be?

The so-called Fisher Information measures how much


information about is contained in on average
(when observations are repeated)
Definition of Fisher Information

Let i.i.d. with likelihood function

The Fisher Information of at is


Remarks on Computation of Fisher
Information

is computed under the assumption that is the true


parameter, that is, that is the PMF/PDF of the ’s

To simplify notation, the true parameter usually just is


denoted by
Example
Let i.i.d. with
We compute the Fisher Information

Preparations:
, , (previous lecture)

and are independent for


Example, continued
Example, continued
Log-Likelihood and Score Function
The function occurring
in the definition of the Fisher Information is called the
log-likelihood function

is the score function

Note that
Fisher Information: Summary of
Relevant Expressions
Let i.i.d. with likelihood function

Log-likelihood function:

Score function:

Fisher Information:
Condition
The following condition is useful for several identities
involving the Fisher Information.

Let be the population PMF or PDF.


Consider the following condition
(for PDF)

(for PMF)

This holds for most common PMFs and PDFs


Example
Let be i.i.d.
for and otherwise

We check if condition holds:


Example (continued)

In summary, condition holds for

Note: We always have , since

Hence, to check if holds, it suffices to determine if the


first integral in is zero
20) Expected Value and Variance of
Score Function
Recall: Condition

Let be the population PMF or PDF.

(for PDF)

(for PMF)
Recall: Score Function and Fisher
Information
Likelihood function: Our goal now is to learn
more about the score
function
Score function:
This will help simplifying
the computation of
Fisher information in
many cases
Fisher information:
Expected Value of Score Function
Consider the case of sample size :
Expected Value of Score Function,
continued
If condition holds, then

Thus,
Variance of Score Function
Finally,

Hence, if condition holds, the Fisher Information equals


the variance of the score function
Summary

Score function:

If condition holds, then

(a)

(b)
21) Fisher Information:
Additivity and Alternative Formula
Score function:

If condition holds, then

(a)

(b)
Additivity of Fisher Information
Suppose condition holds. Then
(definition)

(variance of score function)

(definition of score function)

(definition of likelihood function)

(independence of ’s)
Additivity of Fisher Information
Then

(independence of ’s)

(definition of score function)


(identical distribution)
(definition of Fisher Information)
Additivity of Fisher Information,
Conclusion

If holds, then .

That is, the Fisher Information of i.i.d. sample of size


equals times Fisher Information of a size 1 sample
Alternative Formula

If condition holds and also holds with replaced


by , then

This formula, together with the additivity, often simplifies


the computation of the Fisher Information

For a proof, see textbook, Section 8.5, Lemma A


Remark

The condition with with replaced by is

(for PDF)

(for PMF)
Alternative Formula (Proof)
To prove,

We notice that

and,
Alternative Formula (Proof)
So we have,

Taking another derivative,

Therefore,
Example
i.i.d. with
PMF: ,
Example

Now, using alternative formula,

Additivity,
22) Cramer-Rao Bound and Efficient
Estimators
For a “good” estimator for , a basic requirement is to be
unbiased

Among all unbiased estimators for , an optimal one has


minimal variance. That is, an “Optimal” Unbiased
Estimator

How to prove that an estimator has minimal variance?

Main tool: A lower bound on the variance of unbiased


estimators
Cramér-Rao Lower Bound (CRLB)
Theorem: Let be a random sample with joint
PDF where .

Let be any estimator where is a


differentiable function of .

Suppose that satisfies

for any function with .


Theorem: Cramér-Rao Lower Bound

Then

Note:

is called the Cramér-Rao Lower Bound


Proof of Cramér-Rao’s Theorem
First, we notice that for any 2 random variables and ,
and hence,

If we let and

Then, we just need to show , and


compute .
Proof of Cramér-Rao’s Theorem
To show ,
Proof of Cramér-Rao’s Theorem
To compute ,
Proof of Cramér-Rao’s Theorem
Putting together,
Corollary
Let i.i.d. with PDF where

Let be an unbiased estimator for

Suppose that the conditions for CRLB’s theorem hold. Then

(here is the Fisher Information of a sample of size 1


with PDF )
Proof of Corollary
Since is an unbiased estimator for ,

and,
Example:
Let i.i.d. with
PMF: ,
We know, and

Let be an unbiased estimator of , with

Is a ‘good ‘ estimator of , in the sense that it achieve the


CRLB?
Example:

Hence, the CRLB is


Definition of Efficient Estimators
An estimator for is called efficient if it satisfies the
following two conditions.

1) is unbiased
2) Equality holds in the Cramér-Rao bound for , that is,

In other words, an efficient estimator has the smallest


possible variance among all unbiased estimators
23) Asymptotic Properties of
Estimators
Asymptotic analysis of estimators means to study their
properties for large sample size (mathematically, for

Often the exact distribution of an estimator cannot be


determined (too complicated)

But for large sample size, the distribution can be


approximated, which gives valuable information on the
accuracy of the estimator (bias, consistency, confidence
intervals)
Asymptotic Analysis: Example
Let be i.i.d. where is an
unknown parameter.

Question: What is the MLE for ?

Answer: (shown in previous lecture?)

We now consider how behaves for different values of


Behavior of
For each , we generate observations drawn from
, compute , repeat this 1000 times and
represent the outcome by a histogram:
Frequency

Value of
Behavior of
Behavior of , continued
Behavior of , continued
Behavior of , continued

Conclusion:

For increasing , the distribution


of becomes more and more
concentrated around the true
parameter and can be
approximated by a normal
distribution
Advantage of Asymptotic Analysis

 Uses the phenomenon demonstrated by the example to


provide information on accuracy of estimators without
need to know their exact distribution

 Broad applicability
24) Asymptotic Normality and
Asymptotic Variance of MLEs

Recall: Definition of Convergence in Distribution

A sequence of random variables converges to


in distribution if
for all ,

where is the CDF of


Recall: CLT

Let be i.i.d. with and .


Then
for all

In other words, converges to in distribution

We now discuss an extension of the CLT to maximum


likelihood estimators
Regularity Conditions

i.i.d. with PDF or PMF , where is an


unknown parameter

Let be an MLE for based on

Goal: Extend the CLT to (instead of )

This can be done if satisfies the regularity


conditions stated on the next slide
Regularity Conditions, continued

1) Third order partial derivative of with respect to


exists and is continuous

2) There is an open interval with ,


where is the true parameter (that is, is an “inner point”
of )

3) The “support” of , that is, is the


same for all
Asymptotic Normality of MLEs
Theorem 1
Suppose is an MLE based on an i.i.d. sample with
PDF or PMF where .
Let be the true parameter and ) the Fisher
Information of at .
If satisfies the regularity conditions (1)-(3), then

converges to in distribution.

(Proof: See reference books mentioned in Lecture 1, e.g.


page 369 of Hogg, Mckean & Craig)
Application of Theorem 1

In most applications, we use the following consequence of


Theorem 1

For large sample size,

for all , where is the CDF of


Significance of Theorem 1
Theorem 1 ) approximately

This enables us to find approximations to:


 Probability for to deviate from the true parameter by


any constant
 Confidence intervals (to be discussed in another lecture)

Usually this is all we need to know about an estimator


Consequence for
Theorem 1
) approximately
)
)

Conclusion: By Theorem 1, the MLE is aymptotically


unbiased
Definition of Asymptotic Variance
Theorem 1
) approximately
)

We call the asymptotic variance of


Example
i.i.d. ,

is MLE for (shown in previous lecture)

Theorem 1 ) converges to
in distribution
When Theorem 1 can not be Applied

a) i.i.d. when the true parameter is


or . In this case, the true parameter is not an
inner point of

b) i.i.d. with unknown parameter . The


support of is and thus is not the same for all
25) Review of Concept of Confidence
Intervals
Consider i.i.d. where is an unknown
parameter

An estimator for only provides a single “best guess” for


(“point estimator”), based on a random sample

Bias and standard error measure average precision of


Both types of information, the best guess and average
precision, can be combined into a confidence interval
(“interval estimation”)
In almost all applications of parameter estimation, confidence
intervals are used (point estimation is not enough)
Definition of Confidence Intervals
i.i.d. where is an unknown parameter and
is the true parameter

Suppose and are statistics


such that for all

Then is called an exact


confidence interval for

The probability is called the confidence level of the


interval
Example of Confidence Interval
Elections and Confidence Intervals

(New York Times, 5 October 2016)

This is a commonly used rule of thumb for accuracy of polls


It actually is a statement about a 95% confidence interval
for a parameter estimator
Example of Confidence Interval
Poll of 1000 persons for US election is conducted

Estimator:

New York Times claim: Roughly 95% of the time, the


estimate should be within of the true answer
This means:
is an (approximately) 95% confidence
interval for
Polls Revisited
Whenever the distribution of an estimator is known,
confidence intervals can easily be constructed

In previous example:
and

where is the CDF of


Polls Revisited, continued
Have shown:

where is the CDF of

When

When

actually is not really an approximate


95% confidence interval for all , but it provides at least
95% confidence.
Polls Revisited, Summary

Rule of thumb: provides at least 95%


confidence for the confidence interval for

This means for all

It is approximately 95% for only.

For away from , the Fisher Information is


larger and thus increases (will be shown
later).
Interpretation of Confidence Intervals
Every set of observations gives an interval

Suppose sets of observations are taken. This produces


intervals
Random Variables

means that we
expect that roughly of these intervals contain

Confidence intervals express what to expect from repeated


observations (samples), not from a single set of observations
Interpretation of Confidence Intervals
If the same population is sampled on numerous occasions, and
interval estimates are made on each occasion, the resulting interval
would bracket the true population mean in approximately 95% of the
cases.
26) Pivotal Quantities and Construction
of Exact Confidence Intervals
Recall that:
If i.i.d. where is an unknown parameter
and is the true parameter

Suppose and are statistics


such that for all

Then is called a
exact confidence interval for .
The probability is called the confidence level of the
interval
Motivation
So, we need to find and with
for all

Possible idea:
Suppose be an estimator for and suppose that only
takes positive values

Try and with a


constant , such that

for all
Motivation, continued
for all
for all
note that and are constants

Eqn typically only can hold if the distribution of


does not depend on (otherwise, Eqn will not
hold for all )

In such a case, is an example of a pivotal quantity:


It involves , but its distribution does not depend on
Meaning of “Distribution Does Not
Depend on a Parameter”
Let i.i.d. ,
Let be a random variable that is a
function of and the unknown parameter

We say that the distribution of does not depend on


if, for all for some
function
In other words, the CDF of does not depend on
Equivalently, the PDF / PMF of does not depend on
Definition of Pivotal Quantities
Let i.i.d. where and the ’s are
unknown parameters
A random variable is called a pivotal quantity
(or pivot) if its distribution does not depend on any unknown
parameter
Note:
 can depend on (in fact, usually does), but
its distribution must not depend on
 usually is not a statistic (a statistic must
only depend on )
How to Prove that an Expression is a
Pivotal Quantity
To prove that is a pivotal quantity, we have
the following options:
 Determine a PDF of and show that it does not involve
any unknown parameter
 Determine the CDF of and show that it does not involve
any unknown parameter
 Show that the MGF exists for all
for some interval containing zero and that
does not involve any unknown parameter
Example

i.i.d. where is unknown

 is not a pivotal quantity


(distribution depends on )

 is a pivotal quantity

 is also pivotal quantity


Preparation: CDF and PDF of
Let be a random variable with CDF for
.
Let be a constant, then

( )

Conclusion:
CDF:

PDF (in continuous case):


Construction of Exact Confidence
Intervals
Given: i.i.d. where is an unknown
parameter
1) Find a pivotal quantity
2) Find with for all

3) Find and such that

4) Then is an exact
confidence interval for
Construction of Pivotal Quantities
The essential step in the construction of confidence intervals
is to come up with pivotal quantities

Method that often (not always) works:


 Start with the MLE
 Usually its distribution depends on
 Modify to so that the distribution of does not depend
on any more
 To show that is a pivotal quantity, it is enough to show
that the CDF or PDF or PMF of does not involve
Example for Construction of Pivotal
Quantity
Let i.i.d. with

MLE: (see previous lecture)

CDF of is
(see Tutorial Problems)

Our goal is to modify so that the CDF does not depend


on any more
Example for Construction of Pivotal
Quantity, continued

CDF of is

We want to find a constant such that the CDF of does


not depend on

Recall: To get the CDF of , replace by in the CDF of

Question: What should we choose?


Example for Construction of Pivotal
Quantity, Conclusion
CDF of is

CDF of is for

CDF of is for

CDF of does not depend on

is a pivotal quantity
Example for Construction of Confidence
Interval from Pivotal Quantity
We have:
is a pivotal quantity for i.i.d.

CDF of is for

Next step: Find with for all


, where is the true parameter

Use CDF to get an expression for the probability:


Example for Construction of Confidence
Interval from Pivotal Quantity, continued
Need with

There are infinitely many solutions . We can just pick


one of them, say, ,

Next step: Find and such that


Example for Construction of Confidence
Interval from Pivotal Quantity, continued

is exact confidence interval for


Example for Construction of Confidence
Interval from Pivotal Quantity, Conclusion
Have shown: is confidence interval
for in the case of i.i.d.

Example of observations: :

Note
Take

is a 95% confidence interval for


based on these observations
Overview of Pivotal Quantities for
Normal Distribution
i.i.d. ,

confidence intervals for if is known

confidence intervals for if both unknown

confidence intervals for if both unknown


Pr − / < < / =1−

a
b

− /
/

Because < , we see that the interval − / , / is narrower than − / , / .


27) Exact Confidence Intervals for
Exponential Distributions
i.i.d. with PDF

Need to find pivotal quantity based on

MLE:

Need to modify so that we get a random variable whose


distribution does not depend on
Start with the simplest possible expression:
Confidence Intervals for
i.i.d. with PDF

Need to find pivotal quantity based on

MLE:

Need to modify so that we get a random variable whose


distribution does not depend on
Start with the simplest possible expression:
Confidence Intervals for , cont.
i.i.d.
, ,

and
and
Mean and variance of both do not depend on
potentially is a pivotal quantity

Need to show: Distribution of does not depend on


Confidence Intervals for , cont.

with i.i.d. Can also use

Goal: Show that the distribution of does not depend on


Instead, we can also consider the distribution of

Tutorial Problems 2:
(does not depend on )
is pivotal quantity
Confidence Intervals for , cont
Have shown: is pivotal quantity
Next step: Find with

Recall upper percentage points:

is defined by
Confidence Intervals for , cont

Have shown:

Nest step: Express as a condition on


:
, / , /

, / , /
is confidence interval for
28) Asymptotically Pivotal Quantities
and Approximate Confidence Intervals
Motivation
i.i.d. . Can we find a pivotal quantity
based on ?
MLE:
,

It’s impossible to modify so that its distribution becomes


independent of and even impossible to find any useful
pivotal quantity (see Wikipedia, “Binomial proportion
confidence interval”)
A more general method for constructing confidence
intervals is needed
Motivation, continued
In the case of i.i.d. , pivotal quantities
work perfectly for constructing confidence intervals:

For other distributions (even Bernoulli!), it is much harder or


even impossible to find pivotal quantities. Often it is better
to use methods based on asymptotic normality
Asymptotically Pivotal Quantities
i.i.d. where (unknown
parameters)
A random variable is called an
asymptotically pivotal quantity if it tends to in
distribution for , where is a distribution that does
not depend on any unknown parameter
Usually

Asymptotically pivotal quantities are used to construct


approximate confidence intervals. These are also called
large sample confidence intervals
Main Source for Asymptotically
Pivotal Quantities
Let i.i.d. and let be an MLE for

Asymptotic normality of MLE: ) tends to


in distribution (under regularity conditions)

Hence ) is an asymptotically pivotal


quantity, and this holds for all distributions which satisfy the
regularity conditions
Approximate Confidence Intervals
Based on MLEs
Let i.i.d. , let be an MLE for , and let be
the true parameter. Assume that the regularity conditions for
the asymptotic normality of the MLE hold

tends to in distribution for

if is large

We now use this to obtain a general formula for approximate


confidence intervals
Preparation: Notation for Standard
Normal Percentage Points
Let and
The upper percentage point is defined by

Note: , where is the CDF


of

Since is symmetric around , we have


Approximate Confidence Intervals
Based on MLEs, continued

(see slide on )

/ /

Problem: unknown, since unknown


Solution: Estimate by
Approximate Confidence Intervals
Based on MLEs, continued
/ /

/ /

/ /
is approximate 100(

confidence interval for


Main Result on Approximate
Confidence Intervals (Summary)
Let i.i.d. , let be an MLE for , and let be
the true parameter. Assume that the regularity conditions for
the asymptotic normality of the MLE hold.

Then
/ /
is an approximate 100(

confidence interval for


Example
i.i.d. with true parameter
is MLE for (shown in previous Lecture)
(shown in previous Lecture)

Main result on approximate confidence intervals


/ / ⁄ /

is approximate confidence interval for


Recall: Claim of New York Times

i.i.d. with true parameter

NYT claim: is approximate 95%


confidence interval for

We have seen that this is only true for

We now compute a correct approximate 95% confidence


interval for
Confidence Interval for Polls
⁄ /
Have seen: is approximate
confidence interval for

Case of polls: ,
is approximate
95% confidence interval for
The bounds depend on

If , then and the interval will be roughly


, which is much smaller than the one
claimed by New York Times

You might also like