Week 5 Annotated

ACTL2002/ACTL5101 Probability and Statistics: Week 5 Video Lecture Notes
ACTL2002/ACTL5101 Probability and Statistics

c Katja Ignatieva

School of Risk and Actuarial Studies
Australian School of Business
University of New South Wales
k.ignatieva@unsw.edu.au
Week 5 Video Lecture Notes

Week 2
Week 3
Week 4
Probability: Week 1
Week 6
Review
Estimation: Week 5
Week
7
Week
8
Week 9
Hypothesis testing:
Week
10
Week
11
Week
12
Linear regression:
Week 2 VL
Week 3 VL
Week 4 VL
Video lectures: Week 1 VL

Special Sampling Distributions: chi-squared distribution
Chi-squared distribution: one degree of freedom
Special sampling distributions & sample mean and variance

Chi-squared distribution: n degrees of freedom
Special Sampling Distributions: student-t distribution

Jacobian technique and William Gosset (t-distribution)
Special Sampling Distributions: Snecdors F distribution

Jacobian technique and Snecdors F distribution
Distribution of sample mean/variance

Background
Fundamental sampling distributions


Sampling from a normal distribution; independent and
identically distributed (i.i.d.) random values.
Suppose Z N (0, 1), then
Y = Z 2 2 (1)
has a chi-squared distribution with one degree of freedom.
Distribution characteristics:
1
fY (y ) =
exp(y /2);
2y
FY (y ) =FZ ( y ) FZ ( y ) = 2 FZ ( y ) 1;

E[Y ] =E Z 2 = 1;

2
Var (Y ) =E Y 2 (E [Y ])2 = E Z 4 E Z 2
= 3 1 = 2.
802/827
Prove: see next slides.

Prove that Z 2 has a chi-squared distributed with one degree

of freedom (using p.d.f.), with Z a standard normal r.v..
Proof: using the CDF technique (seen last week). Consider:

FY (y ) = Pr Z 2 y
= Pr ( y Z y )
Z y
1 2
1
e 2 z dz
=
2
y
Z y
1 2
1
e 2 z dz
= 2
2
Z0 y
1
1
1
w 1/2 e 2 w dw .
= 2
2 2
0
* using change of variable z = w , so that

dz = 12 w 1/2 dw .
803/827
Proof continues on next slide.

Proof (cont.).
Z
FY (y ) =
0
1
1
w 1/2 e 2 w dw .
2
Differentiating to get the p.d.f. gives:

1
FY (y )
1
= fY (y ) = y 1/2 e 2 y
y
2
1
y (12)/2 e y /2 ,
= 1/2
2 21
** using differentiation of integral:
804/827
Rb
a
f (x)dx
b
= f (b).
which is the density of a 2 (1) distributed random variables

(see F&T pages 164-169 for tabulated values of c.d.f.).

dist
Note: Yi 2 (1) = Gamma 21 , 12

1/2
1/2
MY (t) = 1/2t
= (1 2 t)1/2 .





Background


P
Let Zi , i = 1, . . . , n be i.i.d. N(0,1), then X = ni=1 Zi2 , has
a Chi-squared distribution with n d.f.: X 2 (n).
Distribution properties:
1
x (n2)/2 e x/2 , if x > 0,
fX (x) = n/2
2 (n/2)
and zero otherwise. Parameter constraints: n = 1, 2, . . .
" n
#
X
E[X ] = E
Yi = n E [Yi ] = n
i=1
n
X
Var (X ) = Var
!
Yi
= n Var (Yi ) = 2 n
i=1
MX (t) =
805/827
MPn
i=1
Yi (t)
= MYn i (t) = (1 2 t)n/2 ,
Prove: * use i = 1, . . . , n i.i.d. Yi 2 (1).
t < 1/2.

Alternative proof: Recall the p.d.f. of Y :

1
y 1/2 e y /2
fY (y ) =
2 (1/2)
Recall X Gamma (n, ), with p.d.f.:
fX (x) =
n x n1 e x
,
(n)
if x 0 and zero otherwise.
For independent Y1 , Y2 , . . . , Yn 2 (1) ,

n 1 dist 2
Y1 + Y2 + . . . + Yn Gamma
,
= (n) ,
2 2
since the sum of i.i.d. Gamma random variables
Gamma(
Pi ,n) is also a Gamma random variable but with
Gamma( i=1 i , ) (see lecture week 2).
806/827
See F&T pages 164-169 for tabulated values of c.d.f.

Chi-squared probability/cumulative
density function
2
2
p.d.f.
c.d.f.
1
n=1
n=2
n=3
n=5
n=10
n=25
0.5
0.4
0.9
0.8
0.7
FX(x)
fX(x)
0.6
0.3
0.5
0.4
0.2
0.3
0.2
0.1
0.1
0
0
807/827
10
20
x
30
0
0
10
20
x
30





Background

Jacobian technique and William Gosset

As an illustration of the Jacobian transformation technique,
consider deriving the t-distribution (see exercises 4.111, 4.112
and 7.30 in W+(7ed)).
t-Distributions discovered by William Gosset in 1908. Gosset
was a statistician employed by the Guinness brewing company.
Suppose Z N (0, 1) and V 2 (r ) =
r
P
Zi2 , where
k=1
Zi , i = 1, . . . , r i.i.d. and Z , V are independent.

Then, the random variable:
Z
T =p
V /r
has a t-distribution with r degrees of freedom.
808/827

Jacobian transformation technique procedure
Recall the procedure to find joint density of u1 = g1 (x1 , x2 )

and u2 = g2 (x1 , x2 ):
1. Find u1 = g1 (x1 , x2 ) and u2 = g2 (x1 , x2 ).
2. Determine h (u1 , u2 ) = g 1 (u1 , u2 ).
3. Find the absolute value of the Jacobian of the transformation.
4. Multiply that with the joint density of X1 , X2 evaluated in
h1 (u1 , u2 ), h2 (u1 , u2 ).
809/827

Proof:
Note p.d.f.s:
fV (v ) =
fZ (z) =
v r /21
e v /2 ,
2r /2 (r /2)
1 2
1 e 2 z ,
2
if 0 v < ;
if < z < .
1. Define the variables:

s = g1 (z, v ) = v
and
z
.
t = g2 (z, v ) = p
v /r
2. So that this forms a one-to-one transformation with inverse:

p
v = h1 (s, t) = s
and
z = h2 (s, t) = t s/r .
810/827

3. The Jacobian is:
h1 (s, t) h1 (s, t)
s
t
J (s, t) = det
h (s, t) h (s, t)
2
2
s
t
1
0
p
= s /r
= det
p
1
1/2 /r
s /r
2 t s
Note that the support is:

0 <v <
0 <s<
811/827
and
and
< z < ;
< t < .

Since Z and V are independent, their joint density can be

written as:
fZ ,V (z, v ) =fZ (z) fV (v )

1 2
1
1
= e 2 z
v r /21 e v /2 .
(r /2) 2r /2
2
4. Using the Jacobian transformation formula above, the joint
density of (S, T ) is given by:
2

p
1 21 t s/r
1
fS,T (s, t) = s /r e
s r /21 e s/2
(r /2) 2r /2
2

1
1
s
t2
(r +1)/21
=
s
exp
1+
2
r
r
2 (r /2) 2r /2
5. Therefore, the marginal density of T is given by:
Z
fT (t) =
fS,T (s, t) ds
812/827
(continues on next slide).

Making the transformation:

s
t2
w=
1+
2
r
s=
2w
,
1 + t 2 /r
so that:
1
dw =
2

t2
1+
ds
r
ds =
2
1 + t 2 /r

dw .
So that we have:
Z
fT (t) =
0
Z
=
0
813/827

1
1
s
t2
s (r +1)/21 exp 1 +
ds
2
r
r
2 (r /2) 2r /2
2 (r /2) 2r /2
2w
2
1 + tr
! (r +1)
2 1
1
exp(w )
r
2
2
1 + tr
!
dw .

Simplifying:
fT (t) =
0
1
2r (r /2) 2r /2
2
1 + t 2 /r
(r +1)/21
2
1 + t 2 /r
w (r +1)/21 e w dw
1
r (r /2) 2(r +1)/2
((r + 1) /2)
1
=
(r /2)
r
* using Gamma function:
1
1 + t 2 /r
R
0
2
1 + t 2 /r
(r +1)/2 Z
w (r +1)/21 e w dw
(r +1)/2
,
for < t < ,
x 1 exp(x)dx = ().
This is the standard form of tdistribution (see F&T page

163 for tabulated values of c.d.f.).

814/827

Student-t probability/cumulative density function

Studentt p.d.f.
Studentt c.d.f.
0.4
1
r=1
r=2
r=3
r=5
r=10
r=25
0.35
0.3
0.9
0.8
0.7
0.6
F (x)
f (x)
0.25
0.2
0.5
0.4
0.15
0.3
0.1
0.2
0.05
0
5
815/827
0.1
0
x
0
5
0
x





Background

Snecdors F distribution
Suppose U 2 (n1 ) and V 2 (n2 ) are two independent
chi-squared distributed random variables.
Then, the random variable:
F =
U /n1
V /n2
has a F distribution with n1 and n2 degrees of freedom.

See F&T pages 170-174 for tabulated values of c.d.f.
Prove: Use Jacobian technique.
1. Define variables: f =
u/n1
v /n2
, g = v;
2. Inverse transformation: v = g and u = f g

816/827
n1
n2 .

3. Jacobian of the transformation:

0
v /f v /g
J(f , g ) = det
= det
g nn21
u/f u/g
Absolute value of the Jacobian: |J(f , g )| = g
1
f nn12

= g
n1
n2 .
4. Multiply the absolute value of the Jacobian by the joint

density (joint density, using independence:
fU,V (u, v ) = fU (u) fV (v )):
fU,V (u, v ) =fU (u) fV (v )
(n2 2)
u
v
v 2
= n /2
exp
n /2
exp
2
2
2 1 (n1 /2)
2 2 (n2 /2)
817/827
(n1 2)
2
Continues on the next slide.
n1
.
n2

(Cont.) Joint density F and G (using u = f g nn12 and

v = g ):
(n1 2)

!
2
f n1 g
(n2 2)
f n1 g
g
n2
g 2
n1 g
n2

n /2
fF ,G (f , g ) =
exp
exp
n2
2
2
2 1 n21
2n2 /2 n22
5. The marginal of F is obtained by integrating over all possible
values of G :
Z
fF (f ) =
fF ,G (f , g )dg
0

Z
1
fn1
(n1 +n2 2)/2
=func(f )
g
exp g
+
dg
2 2n2
0
where func(f ) =
818/827
n1
(f n1 )(n1 2)/2
2n2 /2 (n2 /2) nn1 /2 2n1 /2 (n1 /2)

2

Continues:
fF (f ) =func(f )
=func(f )
2 n2
n2 + f n1
(n1 +n2 2)/2+1 Z
2 n2
n2 + f n1
(n1 +n2 )/2
x (n1 +n2 2)/2 exp (x) dx
((n1 + n2 )/2)
((n1 + n2 )/2)
f n1 /21
(n1 /2) (n2 /2) (n2 + f n1 )(n1 +n2 )/2

n1
2
* using transformation x = g 12 + f2n
, thus g = n22n
+f n1 x and
2

1
+f n1
+f n1
dx = n22n
dg , thus dg = n22n
dx.
2
R 2 1
** using Gamma function: () = 0 x
exp(x)dx.
n1 /2
= n1
n /2
n2 2
*** using func(f ) =

819/827
n1 (f n1 )(n1 2)/2
n /2
(n
+n
)/2
2 1 2 n21 (n2 /2)(n1 /2)

Snecdors F probability density function

Snecdors F p.d.f.
Snecdors F c.d.f.
1
n =2, n =2
1
0.9
0.8
n1=2, n2=6
0.8
0.7
n1=2, n2=10
0.7
0.6
fX(x)
n =2, n =4
2
n1=10, n2=2
n1=10, n2=10
0.5
0.6
FX(x)
0.9
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
820/827
5
x
10
0
0
5
x
10

Background




Background

Background
Properties of the sample mean and sample variance

Suppose you select randomly from a sample.
Assume selected with replacement or, alternatively, from a
large population size.
These outcomes (x1 , . . . , xn ) are random variables, all with the
same distribution and independent.
Suppose X1 , X2 , . . . , Xn are n independent r.v. with identical
distribution. Define the sample mean by:
n
1 X
X =
Xk ,
n
k=1
821/827
and recall the sample variance by:

n
X
2
1
S2 =
Xk X .
n1
k=1





Background


Sampling distributions for i.i.d. normal samples, i.e.,
Xi N(, 2 ).
In the next slides we will prove the following important
properties:

1 2
- X N , n : sample mean using known population

variance.
- T =
X
S
tn1 : sample mean using sample variance.
(n 1) S 2
2 (n 1): sample variance using population
2
variance.
- X and S 2 are independent (proof given in Exercise 13.93 of

W+(7ed)).
822/827

Distribution of sample mean (known 2 )

Prove that the distribution of the sample mean given known
variance is N(, 2 /n).
We have X1 , . . . , Xn are i.i.d. normally distributed variables.
n
P
Xi
We defined the sample mean by: X =
n .
i=1
Use MGF-technique to find the distribution of X :

t
1 2 t 2 n
n
n
MX (t) =M P
(t) = MXi (t/n) = exp +
Xi /n
n 2
n
i=1

1 2 2
t
= exp t +
2 n
823/827
which is the m.g.f. of a normal distribution with mean and

variance 2 /n.

Distribution of sample mean (unknown 2 )

The distribution of the sample mean given unknown
(population) variance is given by:
X
S
tn1
Proof:
X
X
S
n
=q 2 q
S
2
Z
2n1
n1
tn1 ,
where Z N(0, 1) is a standard normal r.v..

* Using (n 1) S 2 / 2 2n1 (prove: see next slides).
824/827

Distribution of sample variance

Prove that the distribution of the sample variance is given by:
(n 1) S 2
2n1 .
2
First note that:
(n 1) S 2
=
2
Pn
i=1
Xi X
2
2
and second note that:

Pn

n
n
2
X
Xi 2 X 2
i=1 (Xi )
=
=
Zi 2n ,
2
i=1
i=1
where Zi N(0, 1), i = 1, . . . , n i.i.d. standard normal r.v..

825/827

We have:
Pn
|
i=1 (Xi
2
)2
{z
Pn
i=1
Zi2 2n
Pn
(Xi X ) + (X )
2
Pn
Xi X
2
2
Xi X
2
2
i=1
i=1
Pn
=
i=1
2
Pn
X
2
!2
X
.
i=1
+
+
2
{z
Z 2 21
Hence, the first term on right is 2n1 (using gamma sum

property/MGF-technique).
Xn
* Using 2 (X )
(Xi X ) = 0.
}
| i=1 {z
826/827
=0


We have now proven the following important properties:
- X N , n1 2
- T =
X
S
tn1
(n 1) S 2
2 (n 1)
2
We will use this for:

- confidence intervals for population mean and variance;
- testing population mean and variance;
- parameter uncertainty of a linear regression model.
Notice, when applying CLT, we do not need that Xi are

normally distributed anymore.
827/827
ACTL2002/ACTL5101 Probability and Statistics: Week 5
ACTL2002/ACTL5101 Probability and Statistics

c Katja Ignatieva

School of Risk and Actuarial Studies
Australian School of Business
University of New South Wales
k.ignatieva@unsw.edu.au
Week 5
Week 2
Week 3
Week 4
Probability:
Review
Estimation: Week 6
Week
7
Week
8
Week 9
Hypothesis testing:
Week
10
Week
11
Week
12
Linear regression:
Week 2 VL
Week 3 VL
Week 4 VL
Video lectures: Week 1 VL
Week 1
Week 5 VL
1001/1074
Last four weeks

Introduction to probability;
Moments: (non)-central moments, mean, variance (standard
deviation), skewness & kurtosis;
Special univariate distribution (discrete & continue);
Joint distributions;
Dependence of multivariate distributions
Functions of random variables
1002/1074
This week
Parameter estimation:
- Method of Moments;
- Maximum Likelihood method;
- Bayesian estimator.
Convergence (almost surely, probability, & distribution);

Application (important theorems):
- Law of large numbers;
- Central limit theorem.
1003/1074

Parameter estimation
Definition of an estimator
Limit theorems & parameter estimators

Estimator I: the method of moments
The method of moments
Example & exercise
Estimator II: maximum likelihood estimator
Maximum likelihood estimation
Example & exercise
Sampling distribution and the bootstrap
Estimator III: Bayesian estimator
Introduction
Bayesian estimation
Example & exercise
Convergence of series
Chebyshevs Inequality
Convergence concepts
Application of strong convergency: Law of Large Numbers
Application of weak convergency: Central Limit Theorem
Application of convergence in distribution: Normal Approximation to the Binomial
Application of convergence in distribution: Normal Approximation to the Poisson
Summary
Summary

Definition of an Estimator
Problem of statistical estimation: a population has some
characteristics that can be described by a r.v. X with density
fX ( | ).
Density has unknown parameter (or set of parameters) .
We observe values of the random sample X1 , X2 , . . . , Xn
from the population fX ( | ). Denote this observed sample
values by x1 , x2 , . . . , xn .
We then estimate the parameter (or some function of the
parameter) based on this random sample.
1004/1074

Definition of an Estimator
Any statistic, i.e., a function T (X1 , X2 , . . . , Xn ), that is a
function of observable random variables and whose values are
used to estimate (), where () is some function of the
parameter , is called an estimator of ().
A value b of the statistic evaluated at the observed sample
values by x1 , x2 , . . . , xn , will be called an (point) estimate.
For example:
1 Pn
Xj , estimator;
T (X1 , X2 , . . . , Xn ) = X n =
n j=1
b
= 0.23,
point estimate.
Note can be a vector, then the estimator is a set of
equations.
1005/1074


Example & exercise
Example & exercise
Introduction
Bayesian estimation
Example & exercise
Summary
Summary

The Method of Moments

Example of estimator: Method of Moments (MME).
Let X1 , X2 , . . . , Xn be a random sample from the population
with density fX (|) which we will assume has k number of
parameters, say = [1 , 2 , . . . , k ]> .
The method of moments estimator () procedure is:
1. Equate (the first) k sample moments to the corresponding k
population moments;
2. Equate the k population moments to the parameters of the
distribution;
3. Solve the resulting system of simultaneous equations.
The method of moment point estimates (b

) are the estimate
values of the estimator corresponding to the data set.
1006/1074

The Method of Moments

Denote the sample moments by:
m1 =
n
n
n
1 X 2
1 X k
1 X
xj , m2 =
xj , . . . , mk =
xj ,
n
n
n
j=1
j=1
j=1
and the population moments by:

1 (1 , 2 , . . . , k ) = E [X ] , 2 (1 , 2 , . . . , k ) = E X 2 ,
h i
. . . , k (1 , 2 , . . . , k ) = E X k .
The system of equations to solve for (1 , 2 , . . . , k ) is given
by:
mj = j (1 , 2 , . . . , k ) , for j = 1, 2, . . . , k.
1007/1074
b
Solving this provides us the point estimate .

Example & exercise

Example & exercise
Example & exercise
Introduction
Bayesian estimation
Example & exercise
Summary
Summary

Example & exercise
Example: MME & Binomial distribution

Suppose X1 , X2 , . . . , Xn is a random sample from Bin (n, p)
distribution, with known parameter n.
Question: Use the method of moments to find point
estimators of = p.
1. Solution: Equate population moment to sample moment:
n
1 X
E[X ] =
xj = x.
n
j=1
2. Equate population moment to the parameter (use week 2):

E[X ] = n p.
3. Then the method of moments estimator is (i.e., solving it):
x =np
1008/1074
b = x/n.
p

Example & exercise
Exercise: MME & Normal distribution

Suppose X1 , X2 , . . . , Xn is a random sample from N , 2
distribution.
Question: Use the method of moments to find point

estimators of and 2 .
1. Solution: Equate population moment to sample moment:
n
1 X
=
xj = x
E [X ]
| {z }
n
j=1
population moment |
{z
}

E X2
| {z }
population moment
sample moment
n
1 X 2
xj .
n
j=1
{z
sample moment
1009/1074

Example & exercise
Exercise: MME & Normal distribution

2. Equate population moment to the parameters (use week 2):
E[X ] =
and
E[X 2 ] = Var (X ) + E[X ]2 = 2 + 2 .
3. The method of moments estimators are:
b =E [X ] = x

b2 =E X 2 (E [X ])2
n
n
1X
1X 2
n1 2
xj x 2 =
(xj x)2 =
s ,
=
n
n
n
j=1
n
(xj x )
* using s 2 = j=1n1
is the sample variance.
2
2
Note: E
b 6= (biased estimator), more on this next
week.
1010/1074
j=1


Example & exercise
Example & exercise
Introduction
Bayesian estimation
Example & exercise
Summary
Summary

Maximum Likelihood function

Another example (mostly used) of an estimator is the
maximum likelihood estimator.
First, we need to define the likelihood function.
If x1 , x2 , . . . , xn are drawn from a population with a parameter
(where could be a vector of parameters), then the
likelihood function is given by:
L (; x1 , x2 , . . . , xn ) = fX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) ,
where fX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) is the joint probability density
of the random variables X1 , X2 , . . . , Xn .
1011/1074

Maximum Likelihood Estimation

Let L () = L (; x1 , x2 , . . . , xn ) be the likelihood function for
X1 , X2 , . . . , Xn .
The set of parameters b = b (x1 , x2 , . . . , xn ) (note: function of
observed values) that maximizes L () is the maximum
likelihood estimate of .
The random variable b (X1 , X2 , . . . , Xn ) is called the maximum
likelihood estimator.
When X1 , X2 , . . . , Xn is a random sample from fX (x|), then
the likelihood function is (using i.i.d. property):
L (; x1 , x2 , . . . , xn ) =
n
Y
fX (xj |) ,
j=1
1012/1074
which is just the product of the densities evaluated at each of

the observations in the random sample.


If the likelihood function contains k parameters so that:
L (1 , 2 , . . . , k ; x) = fX (x1 |) fX (x2 ; ) . . . fX (xn ; ) ,
then (under certain regularity conditions), the point where the
likelihood is a maximum is a solution of the k equations:
L (1 , 2 , . . . , k ; x)
= 0,
1
L (; x)
= 0,
2
...,
L (; x)
= 0.
k
Normally, the solutions to this system of equations give the

global maximum, but to ensure, you should usually check for
the second derivative (or Hessian) conditions and boundary
conditions for a global maximum.
1013/1074


Consider the case of estimating two variables, say 1 and 2 .
Define the gradient vector:
L
1
D (L) =
L
2
and define the Hessian matrix:
2L
2
1
H (L) =
2L
1 2
1014/1074
2L
1 2
2
L
22


From calculus we know that the maximum choice 1 and 2
should satisfy not only:
D (L) = 0,
but also H should be negative definite which means:
2L
2L
2

1 2
h1

1
h1 h2
h2 < 0,
2L
2L
1 2
22
for all [h1 , h2 ] 6= 0.
1015/1074

Log-Likelihood function
Generally, maximizing the log-likelihood function is easier.
Not surprisingly, we define the log-likelihood function as:
` (1 , 2 , . . . , k ; x) = log (L (1 , 2 , . . . , k ; x))
n
Y
= log fX (xj |)
j=1
n
X
log (fX (xj |)) .
j=1
* using log(a b) = log(a) + log(b).
1016/1074
Maximizing the log-likelihood function gives the same

parameter estimates as maximizing the likelihood function,
because taking the log is a monotonic increasing function.

MLE procedure
The general procedure to find the ML estimator is:
1. Determine the likelihood function L (1 , 2 , . . . , k ; x);
2. Determine the log-likelihood function
` (1 , 2 , . . . , k ; x) = log (L (1 , 2 , . . . , k ; x));
3. Equate the derivatives of ` (1 , 2 , . . . , k ; x) w.r.t.
1 , 2 , . . . , k to zero ( global/local minimum/maximum).
4. Check wether second derivative is negative (maximum) and
boundary conditions.
1017/1074

Example & exercise

Example & exercise
Example & exercise
Introduction
Bayesian estimation
Example & exercise
Summary
Summary

Example & exercise
Example: MLE and Poisson

1. Suppose X1 , X2 , . . . , Xn are i.i.d. and Poisson(). The
likelihood function is given by:
x1 x2
xn
n
Y
e
e
e
...
fX (xj |) =
L (; x) =
x1 !
x2 !
xn !
j=1
x1

x2
xn
=e n
...
.
x1 ! x2 !
xn !
2. So that taking the log of both sides, we get:
n
n
X
X
xk
log (xk !) .
` (; x) = n + log ()
k=1
k=1
Or, equivalently, using directly the log-likelihood function:

n
n
X
X
` (; x) =
log (fX (xj |)) =
+ xk log () log (xk !) .
1018/1074
j=1
j=1

Example & exercise
Example: MLE and Poisson

Now we need to maximize this log-likelihood function with
respect to the parameter .
3. Taking the first order condition (FOC) with respect to we
have:
n
1X
` () = 0
n +
xk = 0.
k=1
This gives the maximum likelihood estimate (MLE):

n
X
b= 1
xk = x,
n
k=1
which equals the sample mean.
1019/1074
4. Check for second derivative condition to ensure global

maximum.

Example & exercise
Exercise: MLE and Normal

Suppose X1 , X2 , . . . , Xn are i.i.d. and Normal , 2 where
both parameters are unknown.
The p.d.f. is given by:
1
1
exp
fX (x) =
2
2
2 !
.
1. Thus the likelihood function is given by:

n
Y
1
1
L (, ; x) =
exp
2
2
k=1
Question: Find the MLE of and 2 .
1020/1074
xk
2 !
.

Example & exercise
Exercise: MLE and Normal

2. Solution: Its log-likelihood function is:
n
X
log
` (, ; x) =
i=1
1
1
exp
2
2
=n log()
xk
2 !!
n
1 X
(xk )2 .
log(2) 2
2
2
k=1
* using log(1/a)
= log(a1 ) = log(a), with a =
and log(1/ b) = log(b 0.5 ) = 0.5 log(b), with b = 2.

Take the derivative w.r.t. and and set that equal to zero.
1021/1074

Example & exercise
3./4. Then, we obtain:

n
1 X
` (, ; x) = 2
(xk ) = 0
k=1
n
X
xk n = 0
k=1
b=x
Pn
(xk )
n
` (, ; x) =
+ k=1 3
=0
n
P
(xk )
k=1
n=
2
n
X
1
b2 =
(xk x)2 .
n
k=1
1022/1074
See 9.7 and 9.8 of W+(7ed) for further details.

Example & exercise
Example: MME & MLE and Gamma

You may not always obtain closed-form solutions for the
parameter estimates with the maximum likelihood method.
An example of such problem when estimating the parameters
using MLE is the Gamma distribution.
As we will see in the next slides, using MLE yields one
parameter estimate in closed-form solution; not so for the
second parameter.
To find the MLE one should do the following: numerically
estimate the estimates (!) by solving a non-linear equation.
This can be done by employing an iterative numerical
approximation (e.g. Newton-Ralphson).
Application: Surrender mortgages, see Excel.
1023/1074

Example & exercise

In such cases an initial value may be needed so that other
means of estimating first may be used, such as using the
method of moments. Then use it as the starting value.
Question: Consider X1 , X2 , . . . , Xn i.i.d. and Gamma(, )
find the MME of the Gamma distribution.
()
x 1 e x ;
tX
MX (t) = E e
= t ;
fX (x) =
E [X r ] =
Var (X ) =
(+r )
r ()
.
2
1. Solution: Equate sample moments to population moments:

(1)
1 = MX (t)
t=0
1024/1074
= E [X ] = x
and

(2)
2 = MX (t)
t=0
n
X
xi2
= E X2 =
.
n
i=1

Example & exercise

2. Equate population moments to the parameters:

( + 1)
+1
1
1 =
and
2 =
= 1 1 +
.
=
3. Therefore, the method of moments estimates are given by:

2
1
= 1 +
= 1
=
=
1
2 21
21
.
2 21
So that estimators are:

x2
b=x
and
b = 2.
b
using (step 1.) 1 = x and
n 2
n 2
P
P
xi
xi
2 =
2
2 =
b2
2
1
n
n x =
1025/1074
i=1
i=1

Example & exercise

Question: Find the ML-estimates.
1. Solution: Now, X1 , X2 , . . . , Xn are i.i.d. and Gamma(, ) so
likelihood function is:
L (, ; x) =
n
Y
i=1
1
xi1 e xi .
()
2. The log-likelihood function is then:

` (, ; x) = n log ( ()) + n log()
n
n
X
X
+ ( 1)
log(xi )
xi .
i=1
1026/1074
i=1

Example & exercise

3. Maximizing this:
n
()
` (, ; x) = n + n log() +
log(xi ) = 0
()
i=1
n
` (, ; x) =
n
X
xi = 0.
i=1
Easy to solve for second equation:

b
b = n
,
n
P
xi
i=1
but need numerical (iterative) techniques for solving the first

equation.
1027/1074

Example & exercise
Example: MLE and Uniform

1
Suppose X1 , X2 , . . . , Xn are i.i.d. U [0, ], i.e., fX (x) = , for
0 x , and zero otherwise. Here the range of x depends

on the parameter .
The likelihood function can be expressed as:
n Y
n
1
L (; x) =
I{0xk } ,
k=1
where I{0xk } is an indicator function taking 1 if x [0, ]

and zero otherwise.
Question: How to find the maximum of this Likelihood
function?
1028/1074

Example & exercise
Example: MLE and Uniform

14
12
10
Solution: Non-linearity in the

indicator function cannot use
calculus to maximize this function,
i.e., setting FOC equal to zero.
You can maximize it by looking at its
properties:
L(; x)
8
6
Qn
k=1 I{0xk } can only take value 0

and 1;
Note: it will take the value 0 if
< x(n) and 1 else!
- (1/)n is a decreasing function in ;

4
- Hence, function is maximized for the

lowest
value of for which
Qn
I
k=1 {0xk } = 1 i.e.:
2
0
x(1) x(4) x(3)
1029/1074
x(2)
b = max {x1 , x2 , . . . , xn } = x(n) .


Example & exercise
Example & exercise
Introduction
Bayesian estimation
Example & exercise
Summary
Summary


We might not only be interested in the point estimate, but in
the whole distribution of the MLE estimate (parameter
uncertainty!);
However, we have no closed solution for MLE estimates. How
to obtain their sampling distribution? Use bootstrapping.
b
Step 1: Generate k samples from Gamma(,
b).
b
Step 2: Estimate ,
b for each of these k samples using MLE.
Step 3: The empirical joint cumulative distribution function of
these k parameter estimates is an approximation to sample
distribution of the MLE estimates.
Quantification of risk: produce histograms of estimates.
1030/1074

Sampling distribution and bootstrap, k = 250, see Excel

Approximation sample distr of
3rd
4th
5th
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
1
1031/1074
2nd
F()
F()
1st time
1
Approximation sample distr of
0
0.1
0.2
0.3
0.4

Introduction

Example & exercise
Example & exercise
Introduction
Bayesian estimation
Example & exercise
Summary
Summary

Introduction
Introduction
We have seen:
I
Method of moment estimator:

Idea: first k moments of the estimated special distribution and
sample are the same.
Maximum likelihood estimator:
Idea: Probability of sample given a class of distribution is the
highest with this set of parameters.
Warning: Bayesian estimation is hard to understand. Partly

due to non-standard notation in Bayesian estimates.
Pure Bayesian interpretation: Suppose you have, a priori,

prior belief about a distribution;
1032/1074
Then you observe data more information about the

distribution.

Introduction
Example frequentist interpretation: Let Xi Ber() be

whether individual i lodge a claim at the insurer:
PT
i=1 Xi = Y Bin(T , ) be the number of car accidents;
- The probability of insured having a car accident depends on
adverse selection;
- A new insurer does not know the amount of adverse selection
in his pool;
- Now, let , with Beta(a, b) the distribution of the
risk among individuals (i.e., representing adverse selection);
- Use this for estimating the parameter what is our prior for
?
This is called empirical Bayes.

Similar idea: Bayesian updating, in case of time varying
parameters:
1033/1074
- Prior: Last years estimated claim distribution;

- Data: This years claims;
- Posterior: revised estimated claim distribution.

Bayesian estimation

Example & exercise
Example & exercise
Introduction
Bayesian estimation
Example & exercise
Summary
Summary

Bayesian estimation
Notation for Bayesian estimation

Under this approach, we assume that is a random quantity
with density () called the prior density.
(This is usual notation, rather than f ().)
A sample X = x(= [x1 , x2 , . . . , xT ]> ) is taken from its
population and the prior density is updated using the
information drawn from this sample and applying Bayes rule.
This updated prior is called the posterior density, which is the
conditional density of given the sample X = x is (|x)
(=f|X (|x)).
So were using a conditional r.v., |X , associated with the
multivariate distribution of and the X (look back at lecture
notes for week 3).
1034/1074
Use for example E [(|x)] as the Bayesian estimator.

Bayesian estimation
Bayesian estimation, theory

b ) on T which is an
First, let us define a loss function L(;
estimator of () with:
b ) 0,
L(;
b ) = 0,
L(;
b
for every ;
when b = .
Interpretation loss function: for reasonable functions we

have:
a loss function has a lower value better estimator.
Examples of the loss function:
b ) = (b )2
- Mean squared error: L(,
b ) = |b |.
- Absolute error: L(,
1035/1074
(mostly used);

Bayesian estimation

Next, we define a risk function, the expected loss:
h
i Z
b
b
Rb() =Eb L(; ) = L((x);
) fx| (x|)dx.
Note: estimator is a random variable (e.g. T = b = X ,
() = = ) depending on observations.
Interpretation risk function: loss function is a random
variable taking expectation returns a number given .
Note: Rb() is a function of (we only know prior density).
Define Bayes risk under prior as:
Z

b = E R b() =
Rb() ()d.
B ()
1036/1074
Goal: minimize Bayes risk.

Bayesian estimation

Now, we can introduce the Bayesian estimator, for a given
loss function, bB , for which the following hold:

E RbB () E Rb() ,
b
for any .
Rewriting, * using reversing order of integrals; ** using the
law of iterative expectations (week 3) we have:
n h h
iio
b
bB =argminb E Eb L(|)
n h h
iio
b
=argminb Eb E L(|)
n h
io
b
=argminb Eb L()
.
1037/1074
Interpretation: b is the best estimator with respect to loss

b ).
function L(;

Bayesian estimation
Bayesian estimation, estimators

Rewriting the Bayes risk we have:
Z
Z Z
b
b
) fx| dx ()d
B () =
Rb() ()d =
L((x),
Z Z
b
=
L((x),
) fx| (|x)dxd
Z Z
b
) (|x)d fx| dx
=
L((x),
|
{z
}
b
r (|x)
Z
=
b fx| dx.
r (|x)
b is equivalent to minimizing r (|x)

b for
Implying: minimizing B ()
all x. R
* using (|x)dx = (), i.e., Law of Total Probability and **
1038/1074changing order of integration.

Bayesian estimation
Bayesian estimation, estimators

For the squared error loss function (used in *) we have:
n
o
b
b minimizing r (|x)
b for all x r (|x) = 0
min B ()
b
|x
b
Z
b
2 ( (x))
(|x)d = 0
Z
B
b
(x) =
(|x)d
bB (x) = E|x [] .
Interpretation: Bayesian estimator under squared error loss
function is the expectation of the posterior density, i.e.,
bB = E[(|x)]!
1039/1074
One can show that for absolute error loss function:

bB (x) = median((|x)).

Bayesian estimation
Bayesian estimation, derivation

The posterior density (i.e., f|X (|x)) is derived as:
(|x) =
fX | (x1 , x2 , . . . , xT | ) ()
fX | (x1 , x2 , . . . , xT | ) () d
(1)
fX | (x1 , x2 , . . . , xT | ) ()
fX (x1 , x2 , . . . , xT )
i )Pr(Ai )
, with
* Using Bayes formula: Pr(Ai |B) = PnPr(B|A
j=1 Pr(B|Aj )Pr(Aj )
A1 , . . . , An a complete partition of .
P
** Using LTP: Pr(A) = ni=1 Pr(A|Bi ) Pr(Bi )
(where B1 , . . . , Bn a complete partition of , week 1).
Hence, denominator is is the marginal density of the

X = [x1 , x2 , . . . , xT ]> (=constant given the observations!).
1040/1074
Note: () is a complete partition of the sample space.

Bayesian estimation
Bayesian estimation, derivation

Notation: is proportional to, i.e., f (x) g (x) f (x) = c g (x).
We have that the posterior is given by:
(|x) fX | (x1 , x2 , . . . , xT | ) () .
(2)
Either use equation (1) (difficult/tidious integral!) or (2).

Equation (2) can be used
R to find the posterior density by:
I. Find c such that c fX | (x1 , x2 , . . . , xT | ) () d = 1.

II. Find a (special) distribution that is proportional to
fX | (x1 , x2 , . . . , xT | ) (). (fastest way, if possible!)
Estimation procedure:
1041/1074
1. Find posterior density using (1) (difficult/tidious integral!) or

(2).
2. Compute the Bayesian estimator (using the posterior) under a
given loss function (under mean squared loss function: take
expectation of the posterior distribution).

Example & exercise

Example & exercise
Example & exercise
Introduction
Bayesian estimation
Example & exercise
Summary
Summary

Example & exercise
Example Bayesian estimation: Bernoulli-Beta

Let X1 , X2 , . . . , XT be i.i.d. Bernoulli(), i.e.,
(Xi | = ) Bernoulli().
Assume the prior density of is Beta(a, b) so that:
() =
(a + b)
a1 (1 )b1 .
(a) (b)
We know that the conditional density (density conditional on

the true value of ) of our data is given by:
fX | (x | ) =x1 (1 )1x1 x2 (1 )1x2 . . . xT (1 )1xT
T
P
1042/1074
j=1
xj
(1 )
T
P
j=1
xj
= s (1 )T s .
This is just the likelihood function.

P
* Simplifying notation, let s = T
j=1 xj .

Example & exercise
1. Easy method: The posterior density, the density of given

X = x, using (1) is proportional to:
(|x) fX | (x1 , x2 , . . . , xT | ) ()
=
(a + b)
(a+s)1 (1 )(b+T s)1
(a) (b)
(3)
I. Posterior density is also solvable by finding c such that:

Z
(a + b)
c
(a+s)1 (1 )(b+T s)1 d = 1.
(a) (b)
Posterior density is c fX | (x1 , x2 , . . . , xT | ) ().
II. However, we observe (3) is proportional to the p.d.f. of
Beta (a + s, b + T s).
1. Tedious method: To find the posterior density using (2) we
first need to find the marginal density of the X (next slide).
1043/1074

Example & exercise
The marginal density of the X (* using LTP) is given by:

Z 1
fX (x) =
fX | (x | ) ()d
0
Z 1
(a + b)
(a+s)1 (1 )(b+T s)1 d
=
(a)
(b)
0
(a + b) (a + s) (b + T s)
.
=
(a) (b)
(a + b + T )
** :
R1
0
x 1 (1 x)1 dx = B(, ) =
using (2):
(|x) =
=
1044/1074
()()
(+) ;
Posterior density
fX | (x | ) ()
fX (x)
s (1 )T s
(a+b)
a1 (1
(a)(b)
(a+b) (a+s)(b+T s)
(a)(b)
(a+b+T )
)b1
(a + b + T )
(a+s)1 (1 )(b+T s)1 ,
(a + s) (b + T s)

Example & exercise
Example Bayesian estimation: Bernoulli-Beta

2. The mean of this r.v. with the above posterior density is then:
bB = E[|X = x] = E [ Beta (a + s, b + T s)] =
a+s
a+b+T
gives the Bayesian estimator of .
We note that we can write the Bayesian estimator as a

weighted average of the prior mean (which is a/(a + b)) and
the sample mean (which is s/T ) as follows:

s
T
a+b
a
B
b
.
= E[|X = x] =
a+b+T
T}
a+b+T
a+b
|
{z
|
{z
}
|
{z
} | {z }
weight sample
1045/1074
sample mean
weight prior
prior mean

Example & exercise
Exercise Normal-Normal

Let X1 , X2 , . . . , XT be i.i.d. Normal , 22 , i.e.,
(Xi | = ) Normal(, 22 ).

Assume the prior density of is Normal m, 12 so that:

1
( m)2
.
() =
exp
2 12
21
Question: Find the Bayesian estimator for .
1046/1074
Solution: We know that the conditional density of our data is

given by the likelihood function:

T
Y
(xj )2
1
fX | (x | ) =
exp
2 22
22
j=1
!
PT
2
(x
)
1
j
j=1
=
exp
2 22
( 22 )T

Example & exercise
1. Posterior density:
(|x) fX | (x|) () exp
PT
j=1 (xj
)2

( m)2
exp
2 12
2 22
!
PT
2
( m)2
j=1 (xj )
= exp
2 22
2 12
!
PT
2
2
(2 + m2 2 m)
j=1 (xj + 2 xj )
= exp
2 22
2 12
!
P
2
2
22 (2 + m2 2 m) + 12 T
j=1 (xj + 2 xj )
= exp
2 22 12

2
(22 + T 12 ) 2 (m 22 + T x 12 )
exp
2 22 12

(m 2 +T x 2 )
m22 +T x12 2
2 2 (22+T 2 ) 1
22 +T 12
2
1
= exp
exp
2
2
2
2
2
2
2 2 1 /(2 + T 1 )
2 2 1 /(22 + T 12 )
1047/1074

Example & exercise

P
m2 + T
j=1 xj
22 2
2 1
*: exp
and **:

m22 +T x12 2
2
2
2
2
2 2 1 /(2 + T 1 )
exp
are
2 +T 2
2
constants given x.
1. Thus |X is Normally distributed with mean
variance
mean:
22 12
22 +T 12
1
12
1
12
T
22
m22 +T x12
22 +T 12
and
. Note that we can rewrite it to:
m+
T
12
1
12

T
22
x,
and variance:
1
T
+ 2
2
1
2
2. The Bayesian estimator under both the mean squared loss

function and absolute error loss function is:
bB
=
1048/1074
1
12
1
12
T
22
m+
T
12
1
12
T
22
x.
1


Example & exercise
Example & exercise
Introduction
Bayesian estimation
Example & exercise
Summary
Summary

The Chebyshevs inequality, states that for any random
variable X with mean and variance 2 , the following
probability inequality holds for all > 0:
2
.
2
Note that this applies to all distributions, hence also
non-symmetric ones! This implies that:
Pr (|X | > )
2
Pr (X < ) .
2
Interesting example: set = k then:
1
Pr (|X | > k ) 2 .
k
This provides us with an upper bound of the probability that
X deviates more than k standard deviations of its mean.
Pr (X > )
1049/1074

Application: Chebyshevs Inequality

The distribution of fire insurance claims does not have a
special distribution.
We do know that the mean claim size in the portfolio is $50
million with a standard deviation of $150 million.
Question: What is an upper bound for the probability that
the claim size is larger than $500 million?
Solution: We have:
Pr (X > k ) Pr (|X | > k )
= Pr (|X 50| > k 150)
1
1
2 = .
k
9
1050/1074
Thus, Pr (X > 500) 1/9.


Example & exercise
Example & exercise
Introduction
Bayesian estimation
Example & exercise
Summary
Summary

Suppose X1 , X2 , . . . form a sequence of r.v.s. Example: Xi is
the sample variance using the first i observations.
Xn is said to converge almost surely (a.s.) to the random
variable X as n if and only if:
Pr ( : Xn () X () , as n ) =1,
a.s.
and we write Xn X , as n .
Sometimes called strong convergence. It means that beyond
some point in the sequence (), the difference will always be
less than some positive , but that point is random.
OPTIONAL:
Also expressed as: Pr (|Xn () X ()| > , i.o.) = 0, where
i.o. stands for infinitely often: Pr(An i.o.) = Pr(lim supn An ).
1051/1074
Applications: Law of large numbers, Monte Carlo integration.

Xn converges in probability to the random variable X as

n if and only if, for every > 0,
Pr (|Xn X | > ) 0,
as n ,
and we write Xn X , as n .
Difference converges in probability and converges almost
surely: Pr (|Xn X | > ) goes to zero instead of equals zero
p
a.s.
as n goes to infinity (hence is weaker than ).
1052/1074

Xn converges in distribution to the random variable X as

n if and only if, for every x,
FXn (x) FX (x) ,
as n .
and we write Xn X , as n . Sometimes called weak

convergence.
Convergence of MGFs implies weak convergence.
Applications (see later in lecture):
- Cental Limit Theorem;
- Xn Bin(n, p) and X N(n p, n p (1 p));
- Xn Poi(n ), with n and X N(n , n ).
1053/1074


Example & exercise
Example & exercise
Introduction
Bayesian estimation
Example & exercise
Summary
Summary

The Law of Large Numbers

Suppose X1 , X2 , . . . , Xn are independent random variables
with common mean E[Xk ] = and common variance
Var (Xk ) = 2 , for k = 1, 2, . . . , n.
Define the sequence of sample means as:
n
1X
Xk .
Xn =
n
k=1
1054/1074
Then, according to the law of large numbers, for any > 0,

we have:

2
2
lim Pr X n > = lim 2n = lim
= 0,
n
n n 2
n
Proof: special case: N(, 2 ): X N(0, 2 /n), thus
when n we have lim 2 /n = 0.
n
General case: When second moment exists, use Chebychevs
inequality with 0.

The law of large numbers (LLN) is sometimes written as:

as n .
Pr X n > 0,
The result above is sometimes called the (weak) law of large
p
numbers and sometimes we write X n , because this is the
same concept as convergence in probability to a constant.
However, there is also what we call the (strong) law of large
numbers which simply states that the sample mean converges
almost surely to :
a.s.
X n ,
as n .
Important result in Probability and Statistics!

Intuitively, the law of large number states that the sample
mean X n converges to the true value .
How accurate the estimate is will depend on:
1055/1074
I) how large the sample size is;
II) the variance 2 .

Application of LLN: Monte Carlo Integration

Suppose we wish to calculate
Z
I (g ) =
g (x) dx,
where elementary techniques of integration will not work.

Using the Monte Carlo method, we generate U [0, 1] variables
say X1 , X2 , . . . , Xn and compute:
n
X
bIn (g ) = 1
g (Xk ) ,
n
k=1
where bIn (g ) denotes the approximation of I (g ), we have:

bIn (g ) a.s.
I (g ), as n .
1056/1074
Prove: next slide.

Proof: Using the law of large numbers, we have

bIn (g ) = 1 Pn g (Xk ) a.s.
E [g (X )] which is:
k=1
n
Z
Z
g (x) 1dx =
E [g (X )] =
0
g (x) dx = I (g ) .
0
Try this in Excel using the integral of the standard normal

density. How good is your approximation for 100 (1,000
10,000 100,000 and 1,000,000) random numbers?
This method is called Monte Carlo integration.
1057/1074

Application of LLN: Pooling of Risks in Insurance

Individuals may be faced with large and unpredictable losses.
Insurance may help reduce the financial consequences of such
losses by pooling individual risks. This is based on the LLN.
If X1 , X2 , . . . , Xn are the amount of losses faced by n different
individuals, but homogeneous enough to have a common
distribution, and if these individuals pool together and each
agrees to pay:
n
1 X
Xn =
Xk .
n
k=1
Then, the LLN tells us that the amount each person will end
up paying becomes more predictable as the size of the group
increases. In effect, this amount will become
closer to , the average loss each individual expects.
1058/1074


Example & exercise
Example & exercise
Introduction
Bayesian estimation
Example & exercise
Summary
Summary

Central Limit Theorem

Suppose X1 , X2 , . . . , Xn are independent, identically
distributed random variables with finite mean and finite
variance 2 . As before, denote the sample mean by X n .
Then, the central limit theorem states:
Xn d
N (0, 1) ,
as n .
This holds for all r.v. with finite mean and variance, not only
normal r.v.!
Prove & rewriting CLT: see next slides.
1059/1074

Rewriting Central Limit Theorem

We can write this result as:
lim Pr
Xn
x
!
= (x) ,
for all x where () denotes the cdf of a standard normal r.v..

Intuitively for large n, the random variable:
Zn =
Xn

is approximately standard normally distributed.
1060/1074
The Central Limit Theorem is

Pusually expressed in terms of
the standardized sums Sn = nk=1 Xk . Then the CLT applies
to the random variable:
Sn n d
Zn =
N (0, 1) ,
as n .
n

Proof of the Central Limit Theorem

Let X1 , X2 , . . . be a sequence of independent
r.v.s with mean
P
and variance 2 and denote Sn = ni=1 Xi . Prove that
Sn n
n
converges to the standard normal distribution.
Zn =
General procedure to prove Xn X :

1. Find the m.g.f. of X : MX (t);
2. Find the m.g.f. of Xn : MXn (t);
3. Take the limit n of m.g.f. of Xn : lim MXn (t) and
n
rewrite it. This should be equal to MX (t).

Note: useful are expansions for log and exp (see F&T page 2)!
1061/1074
1. Proof: Consider the case with = 0 and assuming

the MGF

exists for X , then we have: MZ (t) = exp t 2 /2 .

2. Recall Sn =
n
P
n
P
Xi , the m.g.f. of Zn =
i=1
S
n
n
Xi
n
i=1
is
obtained by:
MZn (t) =Msn

n

n
t
= MXi
n
* using MaX (t) = MX (a t) ** using Sn is the sum of n i.i.d.
random variables Xi , thus MPni=1 Xi (t) = MXn i (t).
Note that we only assumed that:

MXi (t) =f t, 2 ;
E [Xi ] =;
Var (Xi ) = 2 < ,
1062/1074
hence, for any distribution Xi with mean and finite variance!

Note: lim b nc = 0, for b R and c > 0.

n
Recall from week 1: 1) An m.g.f. uniquely defines a

distribution; 2) The m.g.f. is a function of all moments.
Consider Taylor series around zero for any M (t):
i

X
t

M (i) (t)
M (t) =
i! |
{z t=0}
i=0
i th moment

1 2

t M (2) (t)
+ O(t 3 ),
2
t=0
t=0
where O(t 3 ) covers all terms ck t k , with ck R for k 3.

=M (0) + t M (1) (t)
We have M (0) = E[e 0X ] = 1 and because we assumed that E[Xi ] = 0:

(1)
MXi (t)
1063/1074
t=0
=E [Xi ] = 0,
and

(2)
MXi (t)
3. Proof continues on next slide.
t=0

2
= E Xi2 = Var (Xi ) + (E [Xi ]) = 2 .

Now we can align the results from the previous two slides:
n
1062
lim MZn (t) = lim MXi t/( n)
n
n
!n
i

X
t/( n)
1063

(i)
= lim
MXi (t)
n
i!
t=0
i=0

2
3 !!n
t
t
1
1063
2
+O
= lim 1 + 0 +
n
2 n
/ n

2
3/2 !!
1
1
t
2
lim log (MZn (t)) = lim n log 1 +

+O
n
n
2 n
n

3/2 !

t 2
1
1
t2
=
lim n
+O
= ,
n
2
n
2
n
|
{z
}

3/2
2
1/2
=n O ( n1 )
+O ( n1 )
=O ( n1 )
0, if n

P (1)i+1 ai
t2
1 3/2
2
* using log(1 + a)= i=1
=
a
+
O(a
),
with
a=
+
O
.
i
n
n
1064/1074

Application CLT: An insurer offers builders risk insurance. It

has yearly 400 contracts and offers the product already 9
years. The sample mean of a claim is $10 million and the
sample standard deviation is $25 million.
Question: What is the probability that in a year the claim
size is larger than $5 billion?
Solution: Using CLT (why is sample s.d.?)
Xn d
N (0, 1) , as n
n

2
X n N , / n

n X n N n , n 2

0.9772 = Pr 400 X 400 400 10 million + 2 20 25 million .

Thus, Pr 400 X 400 > $5 billion = 1 0.9772 = 0.0228.
1065/1074


Example & exercise
Example & exercise
Introduction
Bayesian estimation
Example & exercise
Summary
Summary

Normal Approximation to the Binomial

From week 2 we know: a Binomial random variable is the sum
of Bernoulli random variables. Let Xk Bernoulli (p). Then:
S = X1 + X2 + . . . + Xn
has a Binomial(n, p) distribution.
Applying the Central Limit Theorem, S must be
approximately normal with mean E[S] = n p and variance
Var (S) = n p q, so that approximately for large n we have:
S np
N (0, 1) .
npq
Question: What is the probability that X = 60 if
X Bin(1000, 0.06)? Not in Binomial tables!
1066/1074

In practice, for large n and for p around 0.5 (but in particular

np > 5 and np (1 p) > 5 or n > 30) then can approximate
the binomial probabilities with the Normal distribution.
Use = n p and 2 = n p (1 p).
Continuity correction for binomial: note that Binomial random
variable X takes integer values k = 0, 1, 2, . . . but Normal
probability is continuous so that for value:
Pr (X = k) ,
we require the Normal approximation:
!

k+ 12
k 12
<Z <
Pr
and similarly for probability that Pr (X k).

1067/1074

n = 5, p = 0.1
0.4
p.d.f. N(0.5,0.45)
0.2
0
x
Binomial(30,0.1) p.m.f.
n = 30, p = 0.1
0.2
0.15
p.d.f. N(3,2.7)
0.1
0.05
1068/1074
0
0
10
20
x
30
probability mass function
Normal approximation to Binomial
0.4
n = 10, p = 0.1
0.3
p.d.f. N(1,0.9)
0.2
0.1
0
0.08
0.06
5
10
x
n = 200, p = 0.1
p.d.f. N(20,18)
0.04
0.02
0
0
100
x
200


Example & exercise
Example & exercise
Introduction
Bayesian estimation
Example & exercise
Summary
Summary

Normal approximation to the Poisson

Approximation of Poisson by Normal for large values of .
Let Xn be a sequence of Poisson random variables with
increasing parameters 1 , 2 , . . . such that n .
We have:
E[Xn ] =n
Var (Xn ) =n
Standardize the random variable (i.e., subtract mean and
divide by standard deviation):
Xn E[Xn ]
Xn n d
Zn = p
=
Z N(0, 1).
n
Var (Xn )
1069/1074
Proof: See next slides.


1. We have the m.g.f. of Z : MZ (t) = exp t 2 /2 .
2. Next, we need to find the m.g.f. of Zn . We know (week 2):

MXn (t) = exp n e t 1 .
Thus, using the calculation rules for m.g.f., we have:
MZn (t) =M X
n n (t) = M Xn (t)
n
n
n
p

p
=exp n t MXn t/ n
p

= exp n t exp n e t/ n 1
p

= exp n t + n e t/ n 1
* using MaX +b (t) = exp (b t) MX (a t).

1070/1074

3. Find the limit of the MZn (t) and proof it equals MZ (t):

p

lim MZn (t) = lim exp n t + n e t/ n 1
n
n
t

p
lim log (MZn (t)) = lim t n + n e n 1
n
n

2
p
t
1
t
= lim t n + n 1 + +
n
n 2!
n
!

3
1
t
+
+ . . . 1
3!
n

1 2
1
= lim t + O
= t 2 /2
n 2!
n

lim MZn (t) = exp t 2 /2 = MZ (t).
n
1071/1074
* usingexponential expansion: e a =
a = t/ n .
ai
i=1 i! ,
with

Poisson(0.1) p.m.f.
= 0.1
p.d.f. N(0.1,0.1)
0.5
1
2
x
Poisson(10) p.m.f.
= 10
0.1
Normal approximation to Poisson
p.d.f. N(10,10)
0.05
0
0
1072/1074
10
20
x
30
Poisson(1) p.m.f.
=1
0.3
p.d.f. N(1,1)
0.2
0.1
0
4
6
x
Poisson(100) p.m.f.
= 100
0.03
0.02
p.d.f. N(100,100)
0.01
0
0
100
x
200

Summary
Summary

Example & exercise
Example & exercise
Introduction
Bayesian estimation
Example & exercise
Summary
Summary

Summary
Summary
Parameter estimators
Method of moments:
1. Equate (the first) k sample moments to the corresponding k
population moments;
2. Equate the k population moments to the parameters of the
distribution;
3. Solve the resulting system of simultaneous equations.
Maximum likelihood:
1. Determine the likelihood function L (1 , 2 , . . . , k ; x);
2. Determine the log-likelihood function
` (1 , 2 , . . . , k ; x) = log (L (1 , 2 , . . . , k ; x));
3. Equate the derivatives of ` (1 , 2 , . . . , k ; x) w.r.t.
1 , 2 , . . . , k to zero ( global/local minimum/maximum).
4. Check wether second derivative is negative (maximum) and
boundary conditions.
Bayesian:
1073/1074
1. Posterior density using (1) (difficult/tidious integral!) or (2).

2. Compute the Bayesian estimator under a given loss function.

Summary
Summary
LLN & CLT

Law of large numbers: Let Xi , . . . , Xn be independent
random variables with equal mean E[Xk ] = and variance
Var (Xk ) = 2 for k = 1, . . . , n, then for all > 0 we have:

Pr X n > 0, as n .
Central limit theorem: Let Xi , . . . , Xn be independent and
identically distributed random variables with mean E[Xk ] =
and variance Var (Xk ) = 2 for k = 1, . . . , n, then:
Xn d
N(0, 1),
/ n
1074/1074
as n .

Week 5 Annotated

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 5 Annotated

Uploaded by

Copyright:

Available Formats

ACTL2002/ACTL5101 Probability and Statistics: Week 5 Video Lecture Notes

ACTL2002/ACTL5101 Probability and Statistics

Week 5 Video Lecture Notes

ACTL2002/ACTL5101 Probability and Statistics: Week 5 Video Lecture Notes

Special sampling distributions & sample mean and variance

Special Sampling Distributions: chi-squared distribution

Special Sampling Distributions: student-t distribution

Special Sampling Distributions: Snecdors F distribution

Distribution of sample mean/variance

ACTL2002/ACTL5101 Probability and Statistics: Week 5 Video Lecture Notes

Chi-squared distribution: one degree of freedom

Prove: see next slides.

ACTL2002/ACTL5101 Probability and Statistics: Week 5 Video Lecture Notes

Prove that Z 2 has a chi-squared distributed with one degree

* using change of variable z = w , so that

Proof continues on next slide.

ACTL2002/ACTL5101 Probability and Statistics: Week 5 Video Lecture Notes

Differentiating to get the p.d.f. gives:

** using differentiation of integral:

which is the density of a 2 (1) distributed random variables

ACTL2002/ACTL5101 Probability and Statistics: Week 5 Video Lecture Notes

Special sampling distributions & sample mean and variance

Special Sampling Distributions: chi-squared distribution

Special Sampling Distributions: student-t distribution

Special Sampling Distributions: Snecdors F distribution

Distribution of sample mean/variance

ACTL2002/ACTL5101 Probability and Statistics: Week 5 Video Lecture Notes

Chi-squared distribution: n degrees of freedom

= MYn i (t) = (1 2 t)n/2 ,

Prove: * use i = 1, . . . , n i.i.d. Yi 2 (1).

ACTL2002/ACTL5101 Probability and Statistics: Week 5 Video Lecture Notes

Alternative proof: Recall the p.d.f. of Y :

if x 0 and zero otherwise.

For independent Y1 , Y2 , . . . , Yn 2 (1) ,

See F&T pages 164-169 for tabulated values of c.d.f.

ACTL2002/ACTL5101 Probability and Statistics: Week 5 Video Lecture Notes

ACTL2002/ACTL5101 Probability and Statistics: Week 5 Video Lecture Notes

Special sampling distributions & sample mean and variance

Special Sampling Distributions: chi-squared distribution

Special Sampling Distributions: student-t distribution

Special Sampling Distributions: Snecdors F distribution

Distribution of sample mean/variance

ACTL2002/ACTL5101 Probability and Statistics: Week 5 Video Lecture Notes

Jacobian technique and William Gosset

Zi , i = 1, . . . , r i.i.d. and Z , V are independent.

ACTL2002/ACTL5101 Probability and Statistics: Week 5 Video Lecture Notes

Jacobian transformation technique procedure

Recall the procedure to find joint density of u1 = g1 (x1 , x2 )

ACTL2002/ACTL5101 Probability and Statistics: Week 5 Video Lecture Notes

1. Define the variables:

2. So that this forms a one-to-one transformation with inverse:

ACTL2002/ACTL5101 Probability and Statistics: Week 5 Video Lecture Notes

3. The Jacobian is:

Note that the support is:

ACTL2002/ACTL5101 Probability and Statistics: Week 5 Video Lecture Notes

Since Z and V are independent, their joint density can be

fZ ,V (z, v ) =fZ (z) fV (v )

(continues on next slide).

ACTL2002/ACTL5101 Probability and Statistics: Week 5 Video Lecture Notes

Making the transformation:

ACTL2002/ACTL5101 Probability and Statistics: Week 5 Video Lecture Notes

* using Gamma function:

for < t < ,

(n1 +n2 2)/2+1 Z

(n1 +n2 )/2