Chapter 2

MH3500 Statistics
Lecture Notes
Chapter 2
Parameter Estimation
Table of Contents
Section Remark
1. Introduction to Parameter Estimation
2. “Ad-hoc" Parameter Estimation Method
3. Parametric Statistical Models
4. Formal Process of Parameter Estimation
5. Moments of a Random Variable
6. Constructing Good Estimators: Method of Moments
7. Constructing Good Estimators: Maximum Likelihood Method
8. Maximizing Likelihood Functions in the Case of One Unknown Parameter
9. Maximum Likelihood Method for Continuous Distributions
10. Non-Existence of MLEs
11. Non-Uniqueness of MLEs
12. Bias, Standard Error, Mean Squared Error
13. Sample Variance as Estimator for Variance
14. Estimated Standard Error
15. Criteria for Quality of Estimators
Table of Contents
Section Remark
16. Consistency of Estimators
17. Consistency of Method of Moments Estimators
18. Consistency of MLEs
19. Best Unbiased Estimator
20. Expected Value and Variance of Score Function
21. Fisher Information: Additivity and Alternative Formula
22. Cramer-Rao Bound and Efficient Estimators
23. Asymptotic Properties of Estimators
24. Asymptotic Normality and Asymptotic Variance of MLEs
25. Review of Concept of Confidence Intervals
26. Pivotal Quantities and Construction of Exact Confidence Intervals
27. Exact Confidence Intervals for Exponential Distributions
28. Asymptotically Pivotal Quantities and Approximate Confidence Intervals
1) Introduction to Parameter
Estimation
Parameter Estimation as Part of

Statistical Inference
Statistical Inference (Inferential Statistics)
“Population” (process that generates data) is given
Data drawn from population is represented by random
(i.i.d.) sample , where the population distribution is
unknown or partially unknown
Goal of Statistical Inference: Extract information from

about the population distribution (“infer” means
“deduce something from evidence”)
Purpose: Understand structure of data in order to facilitate

predictions, recommendations, decisions
Consider i.i.d. sample
Often the type of distribution ( , etc.) of ’s

is known, but its parameters ( are unknown
Parameter estimation: Extract information from

on these parameters
This is one way to do statistical inference
Non-parametric statistics is an alternative, but not covered

in this course
Example: Diastolic Blood Pressure
(DBP)
 This is a histogram for 1000 observations

 We want to find the underlying distribution
 The contour of the histogram should look like the graph
of the PDF of the distribution
 We can assume that the data are normally distributed
Diastolic Blood Pressure (DBP)
 Conclusion:
 Remaining task: Approximately find
 This is the purpose of parameter estimation
Process of Point Estimation
Given are data , viewed as realizations of i.i.d.
random variables
1) Modeling: Identify a suitable type of distribution for ’s,

which depends on parameters
2) Point estimation: Find functions

which approximate (“best
guesses”)
3) Substitute in these functions to get

estimates for
Example for Point Estimation
Data (measurements in a physics experiment):

1.1) Find Appropriate Type of
Distribution
Histogram: Normal distribution
seems appropriate
Here
1.2) Find Functions to Approximate
Parameters
 (by Law of Large Numbers)
 We will show that the sample variance is a “good”

estimator for
1.3) Substitute Data in Functions to
Obtain Estimations

Not bad!
These data were drawn from an distribution

Remarks
 Important distinction: are fixed real numbers
are random variables
 is one realization of - e.g., further
realizations could be used to get more information on the
distribution of the ’s
 The unknown parameters are fixed real
numbers
 The point estimators for the
parameters of the distribution are random variables
 For different realizations of they usually give
different results
2) “Ad-hoc" Estimation Method
Example: Binomial Distribution
 Let be i.i.d. where and are
both unknown
 Given: Observations
 Goal: Estimate and from
 Recall: ,
 Idea: Use the fact that on average is close to

and is close to
Parameter Estimation for Binomial
Distribution, continued
 i.i.d.
We expect:
(1)
(2)
(1) & (2)
(1)
Observations drawn from
Binomial( ) with unknown
12 13 7 10 6 11 12 6 7 10 10 9 10 15 11 14 9 8 9 12 10 16 16 135 11 7 6 9 19 7 16 9 11 14 9 8 13 16 8 7 11 10
10 7 12 11 1312 18 8 8 12 16 11 14 7 7 10 13 4 8 7 6 8 9 12 12 11 14 8 3 8 9 12 9 7 10 8 10 8 5 10 9 13 10
10 8 12 6 8 9 11 10 14 7 7 9 8 12 12 7 9 5 9 9 10 6 8 10 12 13 13 11 10 15 9 9 12 12 10 12 9 6 8 12 3 7 9
11 8 7 10 10 8 15 9 11 12 11 6 7 9 7 8 11 13 3 12 10 11 9 6 11 13 14 10 10 10 8 18 7 15 11 7 6 10 710 7 7 13
7 9 14 12 7 14 10 15 13 12 7 5 14 13 8 8 9 8 9 8 9 8 8 7 12 14 12 6 8 12 6 4 9 10 11 14 7 6 13 9 10 10 8 8
7 9 9 10 8 10 9 13 10 11 13 11 7 9 6 11 8 13 13 9 16 15 9 11 6 11 13 12 12 16 8 10 10 17 7 9 11 9 9 10 12 8 12
3 12 6 10 10 12 14 5 10 3 9 9 14 10 12 14 7 13 11 15 8 4 7 8 8 14 8 11 9 11 8 10 11 8 9 10 11 14 9 13 7 6 8
9 11 7 9 7 9 10 5 9 11 11 7 8 11 3 13 9 8 9 5 11 15 11 4 8 11 14 13 8 14 10 5 10 13 12 5 17 10 8 14 10 10 11
13 8 16 9 10 11 9 6 6 12 10 12 12 9 8 11 11 16 12 8 7 11 12 14 10 16 10 9 8 11 15 11 7 10 8 12 12 9 12 11 9 8
11 9 10 12 7 11 12 12 10 12 6 8 9 13 5 4 15 11 10 10 11 10 12 15 12 11 11 7 8 14 7 14 9 9 7 12 6 10 10 6 12 11
10 9 9 11 11 11 4 12 13 11 11 10 8 7 8 14 9 12 12 13 13 4 10 8 8 10 6 10 16 12 13 10 8 12 9 13 11 9 8 7 8 6
18 6 9 6 10 12 10 9 12 10 7 11 6 8 4 11 9 9 16 10 8 10 8 12 11 13 11 14 14 5 11 7 11 7 9 10 10 9 10 12 12 8 8
9 11 12 9 9 14 8 11 10 12 11 11 12 15 10 11 16 7 11 14 15 12 10 9 12 11 8 17 11 13 10 7 10 12 12 9 11 7 9 11 13
10 10 13 9 6 12 9 12 8 9 14 11 5 8 13 13 15 8 11 5 9 5 14 14 11 14 10 11 11 9 13 10 8 12 7 9 11 7 7 7 12 10
10 9 6 11 13 11 10 10 12 12 9 9 11 12 13 9 7 7 8 8 8 9 13 10 11 10 8 7 10 10 9 11 6 12 8 10 14 8 11 8 17 18
11 9 9 11 7 12 12 16 15 7 7 6 14 15 8 5 9 12 10 7 9 13 8 7 11 11 15 9 8 11 8 9 6 5 11 6 8 5 11 15 7 8 6 7
9 10 10 5 4 4 10 9 7 7 5 6 8 9 4 11 12 13 9 9 8 10 9 9 10 11 11 7 13 13 12 5 10 6 11 10 9 11 12 10 8 11 9 8
14 8 14 7 8 6 16 10 8 7 11 13 12 12 12 9 6 10 7 10 9 12 14 7 7 8 8 6 8 13 8 12 11 8 12 10 7 7 12 13 10 10 13
5 13 13 9 15 10 10 11 13 13 11 9 12 10 12 7 9 10 6 9 16 9 16 8 11 7 13 4 7 10 11 18 10 12 4 10 9 15 11 9 8 15
10 9 16 11 7 6 10 6 9 12 14 11 11 9 10 15 8 13 11 8 9 10 8 12 8 11 9 7 10 6 10 7 9 10 9 7 10 9 6 8 12 11 13
8 14 8 8 7 12 10 15 7 7 10 9 8 8 12 14 4 12 2 10 10 10 13 4 12 12 9 9 12 11 9 13 613 8 8 10 9 9 10 13 9 14
12 7 2 9 12 7 10 15 14 7 12 10 12 10 12 9 13 12 7 14 10 13 13 6 12 9 7 6 7 13 6 6 10 7 11 15 9 15 11 12 9 11
16 9 13 10 10 11 12 11 6 9 10 10 9 10 10 5 8 10 12 10 14 7 5 7 14 9 9 10 7 14 10 9 12 12 13 8 7 10 12 12 13 7
10 12 17 9 7 8 12 12 13 11 10 9 6 14 13 13 11 11
Estimations for and Derived from
Observations
Observations were actually drawn from

Example, Conclusion
 Sample mean and variance can be useful to estimate

unknown parameters of the populations distribution
 However, the arguments used are "ad-hoc" and we do
not yet have a way to measure the accuracy of the
estimation
 More systematics methods are needed for parameter
estimation in general
3) Parametric Statistical Models
Most common distributions If the population distribution of

we covered depend on an i.i.d. sample
parameters depends on unknown
parameters , we say
Examples: the ’s follow a parametric
model
Distribution Parameters
One purpose of statistical

inference is to extract
information on from
observations
Notations
Suppose i.i.d. where is a
distribution depending on parameters
Then , Var , moments of , PDF, CDF, and all

probabilities involving ’s also depend on . If there is a
need to, we indicate this as in these examples:
 or
 or
 or
 or (PDF / PMF)
Discrete and Continuous Models
Consider a parametric model i.i.d.
If is continuous / discrete, then the model is also called

continuous / discrete, respectively
Examples:
Discrete parametric model: i.i.d.
Continuous parametric model: i.i.d.
4) Formal Process of Parameter
Estimation
 Given: observations (data)
 are viewed as realizations of i.i.d. random
variables
 Identify a suitable type of population distribution for ’s
which depends on unknown parameters
 Point estimation:
Find functions which
approximate
 Substitute in these functions to get
estimates for
Estimators
 Let be functions which
approximate
 are called estimators for
 Estimators are random variables which only depend on
the sample , that is, they are statistics
Main problem: How to obtain estimators that are as precise

as possible?
5) Moments of a Random Variable
The -th moment ( of a random variable is
where is the PDF respectively PMF of
If the integral, respectively the sum, converges, we say the

-th moment of X exists
Fact: If -th moment exists, then the -th moment also exist
for all (this is proved in the theory of
Lebesgue integration)
Example
Let be a random variable with PDF for
and otherwise
The first and second

moments of exist, but all
other moments of do not
exist
…
Definition of Sample Moments
Let be an i.i.d. random sample
is called the -th sample moment
Note: Sample mean

Example
Create a vector containing 10 observations in R:

data = c(13, -10, 11, 6, 12, -11, 10, -12, -13, -6)
Compute sample moments:

moment(data,1) (computes , result: 0)
(to use the function moment, first install and load the
package “moments”)
Example
Verify validity of for an example using R:
data = c(13, -10, 11, 6, 12)

n = length(data) (number of data values)
sample_var = var(data) (computes sample variance)
S_1 = moment(data,1) (first sample moment)
S_2 = moment(data,2) (second sample moment)
sample_var - (n/(n-1))*(S_2-S_1^2)
(result 0, as predicted by )
Connection Between Population
Moments & Sample Moments
i.i.d. random sample
-th population moment:
-th sample moment
Remarks: The expected value of the -th sample moment

is the -th population moment, and by the Law of large
Numbers converges to in probability when the
appropriate number of moments exist.
6) Constructing Good Estimators:
Method of Moments (Motivation)
Let i.i.d.
We know:
By law of large numbers, we also know that:

converges to in probability
converges to in probability
Is there any way to combine and to get an estimator

for ?
Motivation, continued
Idea is to use and to express
through and :
So, could be a “good” estimator for
That is,
Method of Moments
Let be an i.i.d. sample drawn from a population
distribution , which depends on parameters
 Compute moments:
 Each which depends on , say
, provides one equation for
(namely, )
 Continue until the system of equations ,
with variables has a unique solution
,…,
Method of Moments, continued
 Let ,…, be the
unique solution of ,
 Replace moments in expressions for by
sample moments to obtain estimators
 These are called method of moments estimators for

Example
We reformulate the previous Gamma distribution example
in the general framework for the method of moments:
i.i.d.
Equations for :
(stop computing moments, as this system has a unique

solution ( )
Example, continued
Unique solution:
,
Replace moments by sample moments:
,
(method of moments estimators for Gamma distribution)
Example (Including Data & Modeling)
Data:
Find a suitable statistical model and use the method of

moments to estimate its parameters
Example: Statistical Model
Histogram:
Need to find distribution with a PDF that looks similar

PDF of Gamma Distribution
(here and )
Statistical Model, continued
Data Gamma PDFs
should be suitable
Can use method of moments estimators for
Example, continued
,
Use data from above

7) Constructing Good Estimators:
Maximum Likelihood Method (MLE)
(Motivation)
Among 1000 computer chips produced, exactly 13 were
found defective (to simplify notation, we assume the first 13
are defective)
Model: i.i.d. where if chip is
defective and is an unknown parameter
Given the observations and for
, is there a value, say , for which is “most
probable”?
No! Since is a constant (not a random variable), it is either
equal to or not. No probability involved
i.i.d.
Observations: , for
Based on the observations, we cannot decide if

or not, because could also happen if or
even
But still seems much more plausible than
Given the observations, is there any way to pick a “most

plausible” value for ?
i.i.d.
Observations: , for
“Most plausible” value for ?
Problem: There is no “most probable” value of (it is not a

random variable)
Main idea of maximum likelihood

Reverse the approach: Find a value of that makes
the observations most likely
i.i.d.
Observations: , for
If , then
If , then
Hence is more plausible than

Formal Description of Maximum
Likelihood Method for Discrete
Distributions
Probability of Observations,
(Discrete Case)
Let be iid where is a discrete distribution
with PMF ,
depending on unknown parameters
Observations are given
We need to compute the probability to observe ,

that is,
This leads us to the so-called likelihood function

Likelihood Function
PMF:
(independence of ’s)
(identical distribution of ’s)
(definition of PMF)
is called a
likelihood function
Interpretation of Likelihood Function,
(Discrete Case)
is the probability
to observe under the assumption that
the true parameters are
The idea of the maximum likelihood method is to choose

those values for as estimators which maximize
this probability
Example
We revisit the example from above and use the likelihood
function
i.i.d. with PMF
,
Observations: , for
Likelihood function:
Example, continued
The maximum likelihood method picks the value of (
that maximizes the likelihood function
We can find the maximizer by setting the derivative to zero:
is the maximizer
That is, the maximum likelihood estimate for is

Range of Parameters
For a distribution depending on , the
parameters are often restricted to a certain
subset of
Examples:
for
for
We call the range of the parameters

Likelihood Function with Random
Variables
If we maximize
for given observations , we get estimates for
which are valid only for these observations
To get estimators which work for all possible observations,

we need to substitute the random variables instead
of the observations: Maximize
The resulting maximizers depend on , that

is, they are statistics
Maximum Likelihood Method for
Discrete Parametric Models
Let be i.i.d. where is a discrete distribution
depending on parameters with range
and PMF
Compute likelihood function
If satisfies , then
is called a maximum likelihood estimator (MLE) for
MLE Maximization Problem
To compute the MLE, we need to solve the maximization
problem , where and is the
range of the parameters
That is, we need to find a global maximum of over
This can be difficult, as the likelihood function may be

complicated
But likelihood functions typically have a structure that allows

simplifications in the maximization process
Dependence of Likelihood Function
on Parameters
Recall:
When we solve , we are only interested in

how depends on the parameters
(the ’s are considered as constants during maximization)
That’s why we may use the notation instead of

during the maximization process
Example
i.i.d. with . Find the
MLE for
Recall PMF of :
,
Now the dependence of the PMF on is important, so we

denote the PMF by
Example, continued
PMF:
Likelihood function (with random variables):
Substitute PMF into likelihood function:

∑
Need to find which maximizes this function

( ’s are considered as constants during maximization)
Example, continued
∑
where and
Derivative test for extreme values:
is the MLE for
(for a rigorous justification, see Theorem 1 later)

8) Maximizing Likelihood Functions in
the Case of One Unknown Parameter
8) Maximizing Likelihood Functions in
the Case of One Unknown Parameter
The whole concept of maximum likelihood estimators is
based on finding global maximizers
Just setting derivatives to zero and solving the resulting

equations is not enough to justify the results (could yield
local maxima / minima or saddle points etc.)
Our goal now is to provide a simple, but rigorous way to

compute maximizers in the case of one unknown
parameter
Examples of Likelihood Functions
For :
∑
,
For :
,
Typical Properties of
Likelihood Functions
Take only positive values in the range of the parameters
Differentiable with respect to parameters
Limit of the function is zero whenever parameter
approaches a boundary of its range
We call these conditions the “standard conditions for

likelihood functions”
Example
∑
where and are constants
Only positive values: Clear (product of PMFs)

Differentiability:
for all
Limits when parameter approaches a boundary:
( dominates for )
Standard Conditions
Let be a function defined on a (finite or infinite) interval
such that
a) for all
b) exists for all
c)
In this case, we say that satisfies the standard

conditions for likelihood functions
Maximizing Likelihood Functions
Theorem 1
Let be a function defined on a (finite or infinite) interval
that satisfies the standard conditions for likelihood
functions. Then has a global maximizer on and
we have .
Hence a simple method to find a global maximizer is:

Compute all solutions of
Pick the solution with the largest as
Idea of Proof for Theorem 1
Extreme Value Theorem:
)
(1) If is continuous on a finite
interval , then it attains a maximum
on
Since , there is
→ →
a finite sub-interval of such
that
(2) The maximum of over is the

same as its maximum over
(e.g., in the example)
Function satisfying the standard

conditions (here , ) (1) & (2) attains maximum on
Idea of Proof for Theorem 1, continued
)
attains maximum on
Let be a maximizer of on
Derivative test for extreme

values
Function satisfying the standard

conditions (here , )
Logarithm-Trick
Chain rule:
So
Conclusion: When using Theorem 1, we can solve

instead of
Often is easier to handle
Remark: always denotes the natural logarithm

How to Maximize Likelihood
Functions: Summary
Given: Likelihood function , defined on interval
 Check that is positive and differentiable and satisfies
 Compute (“log-likelihood function”)
 Find all solutions of

 Solution with largest is a maximizer
9) Maximum Likelihood Method for
Continuous Distributions
Recall: Main Idea of Maximum Likelihood Method
Let be an i.i.d. sample whose population

distribution depends on parameters
Given observations , we pick those values of the

parameters as estimators, which maximize the probability
to observe
This works smoothly for discrete distributions, but for

continuous distributions we run into a problem
Goal of Maximum Likelihood Method
for Continuous Distributions
It makes no sense to maximize the probability to observe
Instead, we aim for maximizing
for small , that is, the probability to observe up to

small deviations
Interpretation of PDF
 is the area between PDF and x-axis over

the interval
 If the interval is small and is any point in the interval,
then (assumes continuity of
PDF)
approximately is proportional to probability of
taking a value close to
Probability for Observations up to
Small Deviations
Let be i.i.d. with PDF
(take , and on previous slide)

Probability for Observations up to
Small Deviations, Conclusion
Have seen:
Conclusion:
Probability to observe up to deviation is
approximately proportional to
Likelihood Function (Continuous Case)
Let i.i.d. where is a continuous
distribution with PDF
Observations:
This function measures the probability to approximately

observe if the true parameters are
Maximum Likelihood Estimators for
Continuous Distributions
Let be i.i.d. where is a continuous
distribution depending on parameters with
range and PDF
Compute likelihood function
If satisfies , then
is called a maximum likelihood estimator (MLE) for
MLEs for Continuous Distributions,
Summary
The maximum likelihood method for continuous distributions
is the same as in the discrete case, just with PMF replaced
by PDF
The interpretation for the method in the continuous case,
however, is different (maximize probability to approximately
observe )
The maximization problem can be solved with the same
tools as for discrete distributions, including Theorem 1 and
the logarithm trick
Example
Let be i.i.d. where
PDF: for and otherwise
Need to find with

Example, continued
Need to solve
satisfies the standard conditions for

likelihood functions:
(a) for all
(b) exists for all
(c)
Hence it suffices to find all solutions of

Example, continued
This is the only solution of
Theorem 1 is the MLE for

10) Non-Existence of MLEs
Recall that:
i.i.d. with and PMF
or PDF
By definition, every maximizer of over is

an MLE for
It can happen, however, that a maximizer does not exist,

i.e., that no MLE for exists
Example for MLE Non-Existence
Let have PMF
for
and otherwise
Here is an unknown parameter with range
Likelihood function (for :

Example for MLE Non-Existence,
continued
Note that only takes values
If , then is strictly decreasing on the

whole real line
If , then strictly increasing on the
whole real line
In both cases, has no maximizer on , i.e., no
MLE for exists
Reason for MLE Nonexistence
 Parameter space is not compact
 Maximum of continuous function over non-compact set
may not exist (happens here)
 actually is a Bernoulli variable and could be better
modelled by with
(compact)
Rule of thumb: MLE nonexistence can be avoided by

choosing an appropriate model
Remark on MLE for Bernoulli
If we choose the model with
(open interval!), we also will get MLE nonexistence:
If , then , which does not attain a

maximum on . Similarly, if , then the MLE
also does not exist
If we change the range of to (closed interval), then
the MLE always exists (MLE is if and 0 if
)
Another Example for MLE Non-
Existence
Tutorial 5:
PDF , ,
The parameter measures how much this PDF is
concentrated around :
11) Non-Uniqueness of MLE
Let i.i.d. , , where is an unknown
parameter (here
PDF: for and

otherwise
,
, ,
Hence every , , maximizes

, i.e, is an MLE for
Example
PDF: for and otherwise
Reason for Non-Uniqueness
 Likelihood function is constant on a whole interval
 does not distinguish between values of the

parameter which are close to each other
 Reliable MLEs require that is “relatively far away

from ” on average if
 We will see that this issue is related to “Fisher

information” (which we study later in this course)
12) Bias, Standard Error, Mean
Squared Error
Definition of Bias:
Let be a population distribution, parameterized by a real
number and let be an estimator for . Then
Bias
is unbiased if Bias for all , otherwise biased
Here “ ” means the expected value under the assumption

that the distribution parameter is (sometimes the in
is dropped, but we still have to be aware of the assumption)
Interpretation of Bias
Bias is the expected deviation of from the true
parameter
A good estimator must have bias zero or at least its bias

should tend to zero for increasing sample size
Small bias is, however, not sufficient to be sure that is

close to the true on average ( ] also must be small)
Example
Population distribution: Bernoulli ,
Estimator: (for )
Bias
for all
So is unbiased
Definition of Standard Error
number and let be an estimator for
Standard error of :
(here the variance is computed under the implicit

assumption that the parameter is ; if necessary, we can
write and to indicate the dependence on )
Interpretation of Standard Error
 Suppose is an unbiased estimator for
 measures how
much on average is off from the true
 Rule of thumb: For large samples, the true will be in

the interval in around 70% of the
cases and in the interval in
around 95% of the cases, if is used repeatedly to
estimate
Graphical illustration
Accurate,
Unbiased, but unbiased and
not precise precise
(large SE) Precise (small SE),
but biased
97
Example
Let be i.i.d. with population distribution
,
Estimator: (for )
Shown previous lecture:

Recall: If , then
Definition of Mean Squared Error
number and let be an estimator for
Mean squared error of :
(here the expected value is computed under the implicit

assumption that the parameter is )
Note:
Connection between Bias, Variance,
and MSE
)
The proof uses the following facts.
Let be a random variable. We have

(1) for all constants
(2)
Connection between Bias, Variance,
and MSE
)
Proof:
)
) (by definition of bias and by (1))
(by (2))
(by definition of MSE)

Interpretation of MSE
 The MSE simultaneously controls the bias and the

variance of an estimator
 If the MSE is small, we can be sure that the estimator on
average will be close to the true parameter
13) Example: Sample Variance as
Estimator for Variance
13) Sample Variance as an Estimator
for Variance
Example: Let be i.i.d. with ,
(not necessarily normally distributed)
Recall:
We now compute the bias of when considered as an

estimator for (this will explain the factor )
Bias of Sample Variance
Bias of Sample Variance (continued)
Bias of Sample Variance (continued)
Have shown:
Hence is an unbiased estimator for whenever

i.i.d. with ,
This would not be true if was replaced by and this is

the explanation for the factor
14) Estimated Standard Error
Let be an estimator for a parameter .
Problem: depends on the true value – which is

unknown
To emphasize the dependence of on , we now use

the notation
Definition of Estimated Standard Error
Standard error depends on unknown
Solution:
Replace by in to obtain the estimated
standard error:
Example (Textbook Page 261)
Experiment: Asbestos suspended in water and spread on a
filter
Counts of asbestos fibers on 23 equally sized samples
from the filter:
31, 29, 19, 18, 31, 28, 34, 27, 34, 30, 16, 18,
26, 27, 27, 18, 24, 22, 28, 24, 21, 17, 24
Note: If a random variable represents the number of

events occurring during a certain time interval or within a
certain area, then usually has an approximately Poisson
distribution
Counts of Asbestos Fibers: Model
• number of fibres that end up on sample ,
• is the number of events occurring within a certain area
• The size of the area is the same for all and the counts
on the sample are approximately independent
Assumption i.i.d. reasonable

Estimation of
total number of fibres that end up on sample
Model: iid
expected number of fibers that end up on
sample
Estimator:
Substitute data
We now use the estimated standard error to measure the

accuracy of this estimate
Estimated Standard Error
, substitute data
SE of :
Estimated SE:
Interpretation of Estimated SE
We have and
In view of the Central Limit Theorem, if we approximate the

distribution of by , noting that
Hence,
15) Criteria for Quality of Estimators
Main Idea for Precise Estimators
We recall Chebyshev Inequality:

If random variable with and
Then
for all .
Implies: If is small, then the probability of a

substantial deviation of from is also small
Proof of Chebyshev Inequality: See previous Chapter.

Chebyshev for Estimators
Suppose is an estimator for parameter
(subscript here indicates dependence of on sample size)
If and , then
for all by Chebyshev Inequality

Chebyshev for Estimators (continued)
Suppose and
Have shown:
If holds, then for all
This means: By increasing , the estimator for can be

made arbitrarily precise
Hence, we should aim for estimators which satisfy both

conditions in
How to Achieve
First, we compute .
If the result is a real valued function , then we set

(scaling)
So that we have
How to Achieve
Example: i.i.d. , with both
unknown.
Suppose we want to estimate , which is

Consider the estimator:
Now,
How to Achieve
How to Achieve
This usually is automatically satisfied for reasonable
estimators, because they are often based on averages and
the variance of an average tends to for increasing sample
size
Example: i.i.d.
is an estimator for
Constructing Precise Estimators
(Summary)
 Goals: and
 Chebyshev If these conditions are satisfied,

becomes arbitrarily precise for large enough
 can be achieved by computing and

scaling, if necessary
 should automatically be satisfied if is

chosen reasonably
Quality Criteria for Estimators
The ideas for constructing precise estimators are
formalized as follows.
Let be an estimator for , we like to have

a) should be close to
Small Bias
b) Small expected deviation of from
Small Standard Error
c) Conditions (a) and (b) combined
Small Mean Squared Error
16) Consistency of Estimators
Let be an estimator (based on a sample ) for a
parameter
We say that is consistent if
for all
In other words: Probability that estimator deviates more

than from the true value can be made arbitrarily small by
increasing the sample size
Proving Consistency
Proving consistency requires to show
Need to compute or estimate
Requires knowledge on the distribution of
This can be quite challenging, since often is complicated
Best tool: Theorem 1 (later in this lecture)

Example
i.i.d.
(estimator for )
Shown in previous Lecture:
Let be the CDF of
by
consistent estimator for

Sufficient Conditions for Consistency
Theorem 1: Let be an estimator (based on a sample
) for a parameter
Suppose that
(a)
(b)
Then is consistent.
That is for all

Proof
By the proof of Chebyshev inequality, we have:
Now,
Cross-term = 0
With conditions (a) & (b) is consistent.

17) Consistency of Method of
Moments Estimators
Recall: Law of Large Numbers
Let be i.i.d. with . Then
for all .
In words: Probability that the average of i.i.d. random

variables deviates by more than any constant from their
expected value tends to 0 for increasing sample size
Illustration of Consistency
is consistent for if for all
Value frequency of consistent estimator for (example):
0
Recall: Method of Moments
(for Single Unknown Parameter)
Let i.i.d. sample drawn from
1) Compute moments until

system of equations
has a unique solution
2) MME: where
Consistency of MMEs
(for Single Unknown Parameter)
Theorem 1
If is continuous, then the method of moments estimator

is a consistent estimator for
Remark: The theorem, with replaced by , also

holds for MMEs in the case of several unknown parameters
Proof of Theorem 1
To show: for all
We know:
a)
b)
c)
d)
e) is continuous
Proof of Theorem 1, continued
Step 1: Apply Law of Large Numbers to
We know: (c) , (d) for
“Probability that the average of i.i.d. random variables

deviates by more than any constant from their expected
value tends to 0 for increasing sample size”
for and all

Step 2: Use continuity of to bound
(by (a) and (b))
Continuity of given , there is such that
whenever for all
Hence, for all

for at all
for at least one
Sub-additivity of probability
By Step 1, each term goes to zero

Example
Let be an i.i.d. sample
We know,
Hence,
with and continuous on
and are consistent estimators

18) Consistency of MLEs
Recall: MLE (for one Parameter)
i.i.d. with
If , then is an MLE for

Consistency of MLEs
Theorem 2
i.i.d. with likelihood function
Under appropriate regularity assumptions on , the MLE

for is unique and consistent.
We will have a stronger result later in this Chapter.

19) Best Unbiased Estimator
Example (Motivation)
Let i.i.d. with

PMF: ,
Let and
We have shown that both and are unbiased estimators

of .
To determine which is better, we just need to compare the

variances.
Example (Motivation)
We know, , but is quite a lengthy
calculation.
Even if we could show that for all ,

how about other unbiased estimators such as
for any real constant ?
The idea is to find a lower bound, say , for any unbiased

estimator of , and if we can find an unbiased estimator
satisfying , then we have found a best
unbiased estimator.
Information in Observations
Suppose i.i.d. with
Important question:
How much information about can observations
contain?
That is, how precise can an estimator for based on
be?
The so-called Fisher Information measures how much

information about is contained in on average
(when observations are repeated)
Definition of Fisher Information
Let i.i.d. with likelihood function
The Fisher Information of at is

Remarks on Computation of Fisher
Information
is computed under the assumption that is the true

parameter, that is, that is the PMF/PDF of the ’s
To simplify notation, the true parameter usually just is

denoted by
Example
Let i.i.d. with
We compute the Fisher Information
Preparations:
, , (previous lecture)
and are independent for

Example, continued
Example, continued
Log-Likelihood and Score Function
The function occurring
in the definition of the Fisher Information is called the
log-likelihood function
is the score function
Note that
Fisher Information: Summary of
Relevant Expressions
Let i.i.d. with likelihood function
Log-likelihood function:
Score function:
Fisher Information:
Condition
The following condition is useful for several identities
involving the Fisher Information.
Let be the population PMF or PDF.

Consider the following condition
(for PDF)
(for PMF)
This holds for most common PMFs and PDFs

Example
Let be i.i.d.
for and otherwise
We check if condition holds:

Example (continued)
In summary, condition holds for
Note: We always have , since
Hence, to check if holds, it suffices to determine if the

first integral in is zero
20) Expected Value and Variance of
Score Function
Recall: Condition
Let be the population PMF or PDF.
(for PDF)
(for PMF)
Recall: Score Function and Fisher
Information
Likelihood function: Our goal now is to learn
more about the score
function
Score function:
This will help simplifying
the computation of
Fisher information in
many cases
Fisher information:
Expected Value of Score Function
Consider the case of sample size :
Expected Value of Score Function,
continued
If condition holds, then
Thus,
Variance of Score Function
Finally,
Hence, if condition holds, the Fisher Information equals

the variance of the score function
Summary
Score function:
(a)
(b)
21) Fisher Information:
Additivity and Alternative Formula
Score function:
(a)
(b)
Additivity of Fisher Information
Suppose condition holds. Then
(definition)
(variance of score function)
(definition of score function)
(definition of likelihood function)
Additivity of Fisher Information
Then
(definition of score function)

(identical distribution)
(definition of Fisher Information)
Additivity of Fisher Information,
Conclusion
If holds, then .
That is, the Fisher Information of i.i.d. sample of size

equals times Fisher Information of a size 1 sample
Alternative Formula
If condition holds and also holds with replaced

by , then
This formula, together with the additivity, often simplifies

the computation of the Fisher Information
For a proof, see textbook, Section 8.5, Lemma A

Remark
The condition with with replaced by is
(for PDF)
(for PMF)
Alternative Formula (Proof)
To prove,
We notice that
and,
Alternative Formula (Proof)
So we have,
Taking another derivative,
Therefore,
Example
i.i.d. with
PMF: ,
Example
Now, using alternative formula,
Additivity,
22) Cramer-Rao Bound and Efficient
Estimators
For a “good” estimator for , a basic requirement is to be
unbiased
Among all unbiased estimators for , an optimal one has

minimal variance. That is, an “Optimal” Unbiased
Estimator
How to prove that an estimator has minimal variance?
Main tool: A lower bound on the variance of unbiased

estimators
Cramér-Rao Lower Bound (CRLB)
Theorem: Let be a random sample with joint
PDF where .
Let be any estimator where is a

differentiable function of .
Suppose that satisfies
for any function with .

Theorem: Cramér-Rao Lower Bound
Then
Note:
is called the Cramér-Rao Lower Bound

Proof of Cramér-Rao’s Theorem
First, we notice that for any 2 random variables and ,
and hence,
If we let and
Then, we just need to show , and

compute .
To show ,
To compute ,
Putting together,
Corollary
Let i.i.d. with PDF where
Let be an unbiased estimator for
Suppose that the conditions for CRLB’s theorem hold. Then
(here is the Fisher Information of a sample of size 1

with PDF )
Proof of Corollary
Since is an unbiased estimator for ,
and,
Example:
Let i.i.d. with
PMF: ,
We know, and
Let be an unbiased estimator of , with
Is a ‘good ‘ estimator of , in the sense that it achieve the

CRLB?
Example:
Hence, the CRLB is

Definition of Efficient Estimators
An estimator for is called efficient if it satisfies the
following two conditions.
1) is unbiased
2) Equality holds in the Cramér-Rao bound for , that is,
In other words, an efficient estimator has the smallest

possible variance among all unbiased estimators
23) Asymptotic Properties of
Estimators
Asymptotic analysis of estimators means to study their
properties for large sample size (mathematically, for
Often the exact distribution of an estimator cannot be

determined (too complicated)
But for large sample size, the distribution can be

approximated, which gives valuable information on the
accuracy of the estimator (bias, consistency, confidence
intervals)
Asymptotic Analysis: Example
Let be i.i.d. where is an
unknown parameter.
Question: What is the MLE for ?
Answer: (shown in previous lecture?)
We now consider how behaves for different values of

Behavior of
For each , we generate observations drawn from
, compute , repeat this 1000 times and
represent the outcome by a histogram:
Frequency
Value of
Behavior of
Behavior of , continued
Conclusion:
For increasing , the distribution

of becomes more and more
concentrated around the true
parameter and can be
approximated by a normal
distribution
Advantage of Asymptotic Analysis
 Uses the phenomenon demonstrated by the example to

provide information on accuracy of estimators without
need to know their exact distribution
 Broad applicability
24) Asymptotic Normality and
Asymptotic Variance of MLEs
Recall: Definition of Convergence in Distribution
A sequence of random variables converges to

in distribution if
for all ,
where is the CDF of

Recall: CLT
Let be i.i.d. with and .

Then
for all
In other words, converges to in distribution
We now discuss an extension of the CLT to maximum

likelihood estimators
Regularity Conditions
i.i.d. with PDF or PMF , where is an

unknown parameter
Let be an MLE for based on
Goal: Extend the CLT to (instead of )
This can be done if satisfies the regularity

conditions stated on the next slide
Regularity Conditions, continued
1) Third order partial derivative of with respect to

exists and is continuous
2) There is an open interval with ,

where is the true parameter (that is, is an “inner point”
of )
3) The “support” of , that is, is the

same for all
Asymptotic Normality of MLEs
Theorem 1
Suppose is an MLE based on an i.i.d. sample with
PDF or PMF where .
Let be the true parameter and ) the Fisher
Information of at .
If satisfies the regularity conditions (1)-(3), then
converges to in distribution.
(Proof: See reference books mentioned in Lecture 1, e.g.

page 369 of Hogg, Mckean & Craig)
Application of Theorem 1
In most applications, we use the following consequence of

Theorem 1
For large sample size,
for all , where is the CDF of

Significance of Theorem 1
Theorem 1 ) approximately
This enables us to find approximations to:


 Probability for to deviate from the true parameter by

any constant
 Confidence intervals (to be discussed in another lecture)
Usually this is all we need to know about an estimator

Consequence for
Theorem 1
) approximately
)
)
Conclusion: By Theorem 1, the MLE is aymptotically

unbiased
Definition of Asymptotic Variance
Theorem 1
) approximately
)
We call the asymptotic variance of

Example
i.i.d. ,
is MLE for (shown in previous lecture)
Theorem 1 ) converges to
in distribution
When Theorem 1 can not be Applied
a) i.i.d. when the true parameter is

or . In this case, the true parameter is not an
inner point of
b) i.i.d. with unknown parameter . The

support of is and thus is not the same for all
25) Review of Concept of Confidence
Intervals
Consider i.i.d. where is an unknown
parameter
An estimator for only provides a single “best guess” for

(“point estimator”), based on a random sample
Bias and standard error measure average precision of

Both types of information, the best guess and average
precision, can be combined into a confidence interval
(“interval estimation”)
In almost all applications of parameter estimation, confidence
intervals are used (point estimation is not enough)
Definition of Confidence Intervals
i.i.d. where is an unknown parameter and
is the true parameter
Suppose and are statistics

such that for all
Then is called an exact

confidence interval for
The probability is called the confidence level of the

interval
Example of Confidence Interval
Elections and Confidence Intervals
(New York Times, 5 October 2016)
This is a commonly used rule of thumb for accuracy of polls

It actually is a statement about a 95% confidence interval
for a parameter estimator
Example of Confidence Interval
Poll of 1000 persons for US election is conducted
Estimator:
New York Times claim: Roughly 95% of the time, the

estimate should be within of the true answer
This means:
is an (approximately) 95% confidence
interval for
Polls Revisited
Whenever the distribution of an estimator is known,
confidence intervals can easily be constructed
In previous example:
and
where is the CDF of

Polls Revisited, continued
Have shown:
where is the CDF of
When
When
actually is not really an approximate

95% confidence interval for all , but it provides at least
95% confidence.
Polls Revisited, Summary
Rule of thumb: provides at least 95%

confidence for the confidence interval for
This means for all
It is approximately 95% for only.
For away from , the Fisher Information is

larger and thus increases (will be shown
later).
Interpretation of Confidence Intervals
Every set of observations gives an interval
Suppose sets of observations are taken. This produces

intervals
Random Variables
means that we
expect that roughly of these intervals contain
Confidence intervals express what to expect from repeated

observations (samples), not from a single set of observations
Interpretation of Confidence Intervals
If the same population is sampled on numerous occasions, and
interval estimates are made on each occasion, the resulting interval
would bracket the true population mean in approximately 95% of the
cases.
26) Pivotal Quantities and Construction
of Exact Confidence Intervals
Recall that:
If i.i.d. where is an unknown parameter
and is the true parameter
Suppose and are statistics

such that for all
Then is called a
exact confidence interval for .
The probability is called the confidence level of the
interval
Motivation
So, we need to find and with
for all
Possible idea:
Suppose be an estimator for and suppose that only
takes positive values
Try and with a

constant , such that
for all
for all
for all
note that and are constants
Eqn typically only can hold if the distribution of

does not depend on (otherwise, Eqn will not
hold for all )
In such a case, is an example of a pivotal quantity:

It involves , but its distribution does not depend on
Meaning of “Distribution Does Not
Depend on a Parameter”
Let i.i.d. ,
Let be a random variable that is a
function of and the unknown parameter
We say that the distribution of does not depend on

if, for all for some
function
In other words, the CDF of does not depend on
Equivalently, the PDF / PMF of does not depend on
Definition of Pivotal Quantities
Let i.i.d. where and the ’s are
unknown parameters
A random variable is called a pivotal quantity
(or pivot) if its distribution does not depend on any unknown
parameter
Note:
 can depend on (in fact, usually does), but
its distribution must not depend on
 usually is not a statistic (a statistic must
only depend on )
How to Prove that an Expression is a
Pivotal Quantity
To prove that is a pivotal quantity, we have
the following options:
 Determine a PDF of and show that it does not involve
any unknown parameter
 Determine the CDF of and show that it does not involve
any unknown parameter
 Show that the MGF exists for all
for some interval containing zero and that
does not involve any unknown parameter
Example
i.i.d. where is unknown
 is not a pivotal quantity

(distribution depends on )
 is a pivotal quantity
 is also pivotal quantity

Preparation: CDF and PDF of
Let be a random variable with CDF for
.
Let be a constant, then
( )
Conclusion:
CDF:
PDF (in continuous case):

Construction of Exact Confidence
Intervals
Given: i.i.d. where is an unknown
parameter
1) Find a pivotal quantity
2) Find with for all
3) Find and such that
4) Then is an exact
Construction of Pivotal Quantities
The essential step in the construction of confidence intervals
is to come up with pivotal quantities
Method that often (not always) works:

 Start with the MLE
 Usually its distribution depends on
 Modify to so that the distribution of does not depend
on any more
 To show that is a pivotal quantity, it is enough to show
that the CDF or PDF or PMF of does not involve
Example for Construction of Pivotal
Quantity
Let i.i.d. with
MLE: (see previous lecture)
CDF of is
(see Tutorial Problems)
Our goal is to modify so that the CDF does not depend

on any more
Quantity, continued
CDF of is
We want to find a constant such that the CDF of does

not depend on
Recall: To get the CDF of , replace by in the CDF of
Question: What should we choose?

Quantity, Conclusion
CDF of is
CDF of is for
CDF of is for
CDF of does not depend on
is a pivotal quantity
Example for Construction of Confidence
Interval from Pivotal Quantity
We have:
is a pivotal quantity for i.i.d.
CDF of is for
Next step: Find with for all

, where is the true parameter
Use CDF to get an expression for the probability:

Interval from Pivotal Quantity, continued
Need with
There are infinitely many solutions . We can just pick

one of them, say, ,
Next step: Find and such that

Interval from Pivotal Quantity, continued
is exact confidence interval for

Interval from Pivotal Quantity, Conclusion
Have shown: is confidence interval
for in the case of i.i.d.
Example of observations: :
Note
Take
is a 95% confidence interval for

based on these observations
Overview of Pivotal Quantities for
Normal Distribution
i.i.d. ,
confidence intervals for if is known
confidence intervals for if both unknown
confidence intervals for if both unknown

Pr − / < < / =1−
a
b
− /
/
Because < , we see that the interval − / , / is narrower than − / , / .

27) Exact Confidence Intervals for
Exponential Distributions
i.i.d. with PDF
Need to find pivotal quantity based on
MLE:
Need to modify so that we get a random variable whose

distribution does not depend on
Start with the simplest possible expression:
Confidence Intervals for
i.i.d. with PDF
Need to find pivotal quantity based on
MLE:
Need to modify so that we get a random variable whose

distribution does not depend on
Start with the simplest possible expression:
Confidence Intervals for , cont.
i.i.d.
, ,
and
and
Mean and variance of both do not depend on
potentially is a pivotal quantity
Need to show: Distribution of does not depend on

Confidence Intervals for , cont.
with i.i.d. Can also use
Goal: Show that the distribution of does not depend on

Instead, we can also consider the distribution of
Tutorial Problems 2:
(does not depend on )
is pivotal quantity
Confidence Intervals for , cont
Have shown: is pivotal quantity
Next step: Find with
Recall upper percentage points:
is defined by
Confidence Intervals for , cont
Have shown:
Nest step: Express as a condition on

:
, / , /
, / , /
is confidence interval for
28) Asymptotically Pivotal Quantities
and Approximate Confidence Intervals
Motivation
i.i.d. . Can we find a pivotal quantity
based on ?
MLE:
,
It’s impossible to modify so that its distribution becomes

independent of and even impossible to find any useful
pivotal quantity (see Wikipedia, “Binomial proportion
confidence interval”)
A more general method for constructing confidence
intervals is needed
In the case of i.i.d. , pivotal quantities
work perfectly for constructing confidence intervals:
For other distributions (even Bernoulli!), it is much harder or

even impossible to find pivotal quantities. Often it is better
to use methods based on asymptotic normality
Asymptotically Pivotal Quantities
i.i.d. where (unknown
parameters)
A random variable is called an
asymptotically pivotal quantity if it tends to in
distribution for , where is a distribution that does
not depend on any unknown parameter
Usually
Asymptotically pivotal quantities are used to construct

approximate confidence intervals. These are also called
large sample confidence intervals
Main Source for Asymptotically
Pivotal Quantities
Let i.i.d. and let be an MLE for
Asymptotic normality of MLE: ) tends to

in distribution (under regularity conditions)
Hence ) is an asymptotically pivotal

quantity, and this holds for all distributions which satisfy the
regularity conditions
Approximate Confidence Intervals
Based on MLEs
Let i.i.d. , let be an MLE for , and let be
the true parameter. Assume that the regularity conditions for
the asymptotic normality of the MLE hold
tends to in distribution for
if is large
We now use this to obtain a general formula for approximate

confidence intervals
Preparation: Notation for Standard
Normal Percentage Points
Let and
The upper percentage point is defined by
Note: , where is the CDF

of
Since is symmetric around , we have

Based on MLEs, continued
(see slide on )
/ /
Problem: unknown, since unknown

Solution: Estimate by
Based on MLEs, continued
/ /
/ /
/ /
is approximate 100(

Main Result on Approximate
Confidence Intervals (Summary)
Let i.i.d. , let be an MLE for , and let be
the true parameter. Assume that the regularity conditions for
the asymptotic normality of the MLE hold.
Then
/ /
is an approximate 100(

Example
i.i.d. with true parameter
is MLE for (shown in previous Lecture)
(shown in previous Lecture)
Main result on approximate confidence intervals

/ / ⁄ /
is approximate confidence interval for

Recall: Claim of New York Times
i.i.d. with true parameter
NYT claim: is approximate 95%

We have seen that this is only true for
We now compute a correct approximate 95% confidence

interval for
Confidence Interval for Polls
⁄ /
Have seen: is approximate
Case of polls: ,
is approximate
95% confidence interval for
The bounds depend on
If , then and the interval will be roughly

, which is much smaller than the one
claimed by New York Times

Chapter 2

Uploaded by

Copyright:

Available Formats

You might also like

Chapter 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 2

Uploaded by

Copyright:

Available Formats

MH3500 Statistics

Parameter Estimation as Part of

Goal of Statistical Inference: Extract information from

Purpose: Understand structure of data in order to facilitate

Often the type of distribution ( , etc.) of ’s

Parameter estimation: Extract information from

This is one way to do statistical inference

Non-parametric statistics is an alternative, but not covered

 This is a histogram for 1000 observations

1) Modeling: Identify a suitable type of distribution for ’s,

2) Point estimation: Find functions

3) Substitute in these functions to get

Data (measurements in a physics experiment):

 We will show that the sample variance is a “good”

These data were drawn from an distribution

 Idea: Use the fact that on average is close to

(1) & (2)

Observations were actually drawn from

 Sample mean and variance can be useful to estimate

Most common distributions If the population distribution of

One purpose of statistical

Then , Var , moments of , PDF, CDF, and all

Consider a parametric model i.i.d.

If is continuous / discrete, then the model is also called

Main problem: How to obtain estimators that are as precise

where is the PDF respectively PMF of

If the integral, respectively the sum, converges, we say the

The first and second

Let be an i.i.d. random sample

is called the -th sample moment

Note: Sample mean

Create a vector containing 10 observations in R:

Compute sample moments:

Verify validity of for an example using R:

data = c(13, -10, 11, 6, 12)

-th sample moment

Remarks: The expected value of the -th sample moment

By law of large numbers, we also know that:

Is there any way to combine and to get an estimator

So, could be a “good” estimator for

 These are called method of moments estimators for

(stop computing moments, as this system has a unique

Replace moments by sample moments:

Find a suitable statistical model and use the method of

Need to find distribution with a PDF that looks similar

Use data from above

Based on the observations, we cannot decide if

But still seems much more plausible than

Given the observations, is there any way to pick a “most

Problem: There is no “most probable” value of (it is not a

Main idea of maximum likelihood

Hence is more plausible than

Observations are given

We need to compute the probability to observe ,

This leads us to the so-called likelihood function

The idea of the maximum likelihood method is to choose

We can find the maximizer by setting the derivative to zero:

That is, the maximum likelihood estimate for is

We call the range of the parameters