8112 Notes

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 79

Class Notes of Stat 8112

1 Bayes estimators
Here are three methods of estimating parameters:
(1) MLE; (2) Moment Method; (3) Bayes Method.
H We want to estimate g(θ) ∈ R1 .
An example of Bayes argument: Let X ∼ F (x|θ), θ ∈ .
Suppose t(X) is an estimator and look at

MSEθ (t) = Eθ (t(X) − g(θ))2 .

The problem is MSEθ (t) depends on θ. So minimizing one point may costs at other points.
Bayes idea is to average MSEθ (t) over θ and then minimize over t’s. Thus we pretend to
have a distribution for θ, say, π, and look at

H(t) = E(t(X) − g(θ))2

where E now refers to the joint distribution of X and θ, that is


Z
E(t(X) − g(θ)) = (t(x) − g(θ))2 F (dx|θ)π(dθ).
2
(1.1)

Next pick t(X) to minimize H(t). The minimizer is called the Bayes estimator.

LEMMA 1.1 Suppose Z and W are real random variables defined on the same probability
space and H be the set of functions from R1 to R1 . Then

min E(Z − h(W ))2 = E(Z − E(Z|W ))2 .


h∈H

That is, the minimizer above is E(Z|W ).

Proof. Note that

E(Z − h(W ))2 = E(Z − E(Z|W ) + E(Z|W ) − h(W ))2


= E(Z − E(Z|W ))2 + E(E(Z|W ) − h(W ))2
+2E{(Z − E(Z|W ))(E(Z|W ) − h(W ))}.

Conditioning on W, we have that the cross term is zero. Thus

E(Z − h(W ))2 = E(Z − E(Z|W ))2 + E(E(Z|W ) − h(W ))2 .

1
Now E(Z − E(Z|W ))2 is fixed. The second term E(E(Z|W ) − h(W ))2 depends on h. So
choose h(W ) = E(Z|W ) to minimize E(Z − h(W ))2 . 

Now choose W = X, t = h and Z = g(θ). Then

d = E(g(θ)|X), which is the poste-


COROLLARY 1.1 The Bayes estimator in (1.1) is g(θ)
rior mean.

At a later occasion, we call E(g(θ)|X) the posterior mean of g(θ). Similarly, we have the
following multivariate analogue of the above corollary. The proof is in the same spirit of
Lemma 1.1. We omit it.

THEOREM 1 Let X ∈ Rm and X ∼ F (·|θ), θ ∈ .


H Let g(θ) = (g1 (θ), · · · , gk (θ)) ∈ Rk .

The MSE between an estimator t(X) and g(θ) is


Z
Ekt(X) − g(θ)k2 = kt(x) − g(θ)k2 F (dx|θ)π(dθ).

d = E(g(θ)|X) := (E(g1 (θ)|X), · · · , E(gk (θ)|X)).


Then Bayes estimator g(θ)

Definition. The distribution of θ (we thought there is such) , π(θ), is called a prior
distribution of θ. The conditional distribution of θ given X is called the posterior distribu-
tion.

Example. Let X1 , · · · , Xn be i.i.d. Ber(p) and p ∼ Beta(α, β). We know that the
density function of p is
Γ(α + β) α−1
f (p) = p (1 − p)β−1
Γ(α)Γ(β)
and E(p) = α/(α + β). The joint distribution of X1 , · · · , Xn and p is

f (x1 , · · · , xn , p) = f (x1 , · · · , xn |p) · π(p) = px (1 − p)n−x · C(α, β)pα−1 (1 − p)β−1


= C(α, β)px+α−1 (1 − p)n+β−x−1
P
where x = i xi . The marginal density of (X1 , · · · , Xn ) is
Z 1
f (x1 , · · · , xn ) = C(α, β) px+α−1 (1 − p)n+β−x−1 dp = C(α, β) · B(x + α, n + β − x).
0

Therefore,
f (x1 , · · · , xn , p)
f (p|x1 , · · · , xn ) = = D(x, α, β) px+α−1 (1 − p)n+β−x−1 .
f (x1 , · · · , xn )

2
This says that p|x ∼ Beta(x + α, n + β − x). Thus the Bayes estimator is
X +α
p̂ = E(p|X) = ,
n+α+β
Pn
where X = i=1 Xi . One can easily check that
np + α
E(p̂) = .
n+α+β
So p̂ is biased unless α = β = 0, which is not allowed in the prior distribution.
The next lemma is useful

LEMMA 1.2 The posterior distribution depends only on sufficient statistics. Let t(X) be
a sufficient statistic. Then E(g(θ)|X) = E(g(θ)|t(X)).

Proof. The joint density function of X and θ is p(x|θ)π(θ) where π(θ) is the prior distri-
bution of θ. Let t(X) be a sufficient statistic. The by the factorization theorem

p(x|θ) = q(t(x)|θ)h(x).

So the joint distribution function is q(t(x)|θ)h(x)π(θ). It follows that the conditional distri-
bution of θ given X is
q(t(x)|θ)h(x)π(θ) q(t(x)|θ)π(θ)
p(θ|x) = R =R
q(t(x)|ξ)h(x)π(ξ) dξ q(t(x)|ξ)π(ξ) dξ
which is a function of t(x) and θ.
By the above conclusion, E(g(θ)|X) = l(t(X)) for some function l(·). Take conditional
expectation for both sides with respect to the algebra σ(t(X)). Note that σ(t(X)) ⊂ σ(X).
By the Tower theorem, l(t(X)) = E(g(θ)|t(X)). This proves the second conclusion. 

Example. Let X1 , · · · , Xn be iid from N (µ, 1) and µ ∼ π(µ) = N (µ0 , τ0 ). We know


that X̄(∼ N (µ, 1/n)) is a sufficient statistic. The Bayes estimator is µ̂ = E(µ|X̄). We need
to calculate the joint distribution of (µ, X̄)T first.
It is not difficult to see that (µ, X̄)T is bivariate normal. We know that Eµ = µ0 , V ar(µ) =
τ0 , E(X̄) = E(E(X̄|µ)) = E(µ) = µ0 . Now
1
V ar(X̄) = E(X̄ − µ + µ − µ0 )2 = E((1/n) + (µ − µ0 )2 ) = + τ0 ;
n
Cov(X̄, µ) = E(E((X̄ − µ0 )(µ − µ0 )|µ)) = E(µ − µ0 )2 = τ0 .

Thus,
! ! !!
X̄ µ0 (1/n) + τ0 τ0
∼N , .
µ µ0 τ0 τ0

3
Recall the following fact: if
! ! !!
Y1 µ1 Σ11 Σ12
∼N , ,
Y2 µ2 Σ21 Σ22

then

Y1 |Y2 ∼ N µ1 + Σ12 Σ−1
22 (Y2 − µ2 ), Σ11·2

where Σ11·2 = Σ11 − Σ12 Σ−1


22 Σ21 . It follows that
 
τ0 τ02
µ|X̄ ∼ N µ0 + (X̄ − µ0 ), τ0 − .
(1/n) + τ0 (1/n) + τ0

Thus the Bayes estimator is


τ0 τ0 µ0
µ̂ = µ0 + (X̄ − µ0 ) = X̄ + .
(1/n) + τ0 (1/n) + τ0 1 + nτ0

One can see that µ̂ → µ0 as τ0 → 0 and µ̂ → X̄ as τ0 → ∞.

Example. Let X ∼ Np (θ, Ip ) and π(θ) ∼ Np (0, τ Ip ), τ > 0. The Bayes estimator is the
posterior mean E(θ|X). We claim that
! ! !!
X 0 (1 + τ )Ip τ Ip
∼N , .
θ 0 τ Ip τ Ip

Indeed, by the prior we know that Eθ = 0 and Cov(θ)=τ Ip . By conditioning argument one
can verify that E(Xθ ′ ) = τ Ip and E(XX ′ ) = (1 + τ )Ip . So by conditional distribution of
normal random variables,
 
τ τ
θ|X ∼ N X, Ip .
1+τ 1+τ

So the Bayes estimator is t0 (X) = τ X/(1 + τ ).


The usual MVUE is t1 (X) = X. It follows that

MSEt1 (θ) = EkX − θk2 = p

which doesn’t depend on θ. Now


τ τ −1
MSEt0 (θ) = Eθ k X − θk2 = Eθ k (X − θ) + θk2
1+τ 1+τ 1+τ
 2
τ 1
= p+ kθk2 (1.2)
1+τ (1 + τ )2

4
which goes to p as τ → ∞. What happens to the prior when τ → ∞? It “converges
to” the Lebesgue measure, or “uniform distribution over the real line” which generates
no information (recall the information of X ∼ Np (0, (1 + τ )Ip ) is p/(1 + τ ) and θ̄i =
uniform over [−τ, τ ] with θ̄i , 1 ≤ i ≤ p i.i.d. is p/(2τ 2 ). Both go to zero and the later is
faster than the former). This explains why the MSE of the former and the limiting MSE
are identical.
Now let

M = inf sup Eθ kt(X) − θk2


t θ

called the Minimax risk and any estimator t∗ with

M = sup Eθ kt∗ (X) − θk2


θ

is called a Minimax estimator.

THEOREM 2 For any p ≥ 1, t0 (X) = X is a minimax.

Proof. First,

M = inf sup Eθ kt(X) − θk2 ≤ sup Eθ kX − θk2 = p.


t θ θ

Second
Z  2
2 2 τ
M = inf sup Eθ kt(X) − θk ≥ inf Eθ kt(X) − θk πτ (dθ) ≥ p
t θ t 1+τ
for any τ > 0 where πτ (θ) = N (0, τ Ip ), where the last step is from (1.2). Then M ≥ p by
letting p → ∞. The above says that M = supθ Eθ kX − θk2 = p. 

Question. Does there exist another estimator t1 (X) such that

Eθ kt1 (X) − θk2 ≤ Eθ kX − θk2

for all θ, and the strict inequality holds for some θ? If so, we say t0 (X) = X is inadmissible
(because it can be beaten for any θ by some estimator t, and strictly beaten at some θ).
Here is the answer:
(i) For p = 1, there is not. This is proved by Blyth in 1951.
(ii) For p = 2, there is not. This result was shown by Stein in 1961.
(iii) When p ≥ 3, Stein (1956) shows there is such estimator, which is called James-Stein
estimator.


Recall the density function of N (µ, σ 2 ) is φ(x) = ( 2πσ)−1 exp(−(x − µ)2 /(2σ 2 )).

5
LEMMA 1.3 (Stein’s lemma). Let Y ∼ N (µ, σ 2 ) and g(y) be a function such that g(b) −
Rb
g(a) = a g′ (y) dy for any a and b and some function g′ (y). If E|g′ (Y )| < ∞. Then

Eg(Y )(Y − µ) = σ 2 Eg′ (Y ).



Proof. Let φ(y) = (1/ 2πσ) exp(−(y − µ)2 /(2σ 2 )), the density of N (µ, σ 2 ). Then φ′ (y) =
R∞ R y z−µ
−φ(y)(y − µ)/σ 2 . Thus, φ(y) = y z−µ
σ2 φ(z) dz = − −∞ σ2 φ(z) dz. We then have that
Z ∞
Eg′ (Y ) = g′ (y)φ(y) dy
−∞
Z ∞ Z ∞  Z 0 Z y 
′ z−µ ′ z−µ
= g (y) φ(z) dz dy − g (y) φ(z) dz dy
0 y σ2 −∞ −∞ σ
2
Z ∞ Z z  Z 0 Z 0 
z−µ ′ z−µ ′
= φ(z) g (y) dy dz − φ(z) g (y) dy dz
0 σ2 0 −∞ σ
2
z
Z ∞ Z 0 
z−µ
= + φ(z)(g(z) − g(0)) dz
0 −∞ σ2
Z ∞
1 1
= 2
(z − µ)g(z)φ(z) dz = 2 E(Y − µ)g(Y ).
σ −∞ σ

The Fubini’s theorem is used in the third step; the mean of a centered normal random
variable is zero is used in the fifth step. 

Remark 1. Suppose two functions g(x) and h(x) are given such that g(b) − g(a) =
Rb
a h(x) dx for any a < b. This does not mean that g′ (x) = h(x) for all x. Actually, in this
case, g(x) is differentiable almost everywhere and g′ (x) = h(x) a.s. under Lebesgue measure.
For example, let h(x) = 1 if x is an irrational number, and h(x) = 0 if x is rational. Then
Rx
g(x) = x = 0 h(t) dt for any x ∈ R. We know in this case that g′ (x) = h(x) a.s. The
following are true from real analysis:
Fundamental Theorem of Calculus. If f (x) is absolutely continuous on [a, b], then f (x)
is differentiable almost everywhere and
Z x
f (x) − f (a) = f ′ (t) dt, x ∈ [a, b].
a

Another fact. Let f (x) be differentiable everywhere on [a, b] and f ′ (x) is integrable over
[a, b]. Then
Z x
f (x) − f (a) = f ′ (t) dt x ∈ [a, b].
a

Remark 2. What drives the Stein’s lemma is the formula of integration by parts:

6
suppose g(x) is differentiable then
Z Z ∞
2
Eg(Y )(Y − µ) = g(y)(y − µ)φ(y) dy = −σ g(y)φ′ (y) dy
R −∞
Z ∞
= σ 2 lim g(y)φ(y) − σ 2 lim g(y)φ(y) + σ 2 g′ (y)φ(y) dy.
y→−∞ y→+∞ −∞

It is reasonable to assume the two limits are zero. The last term is exactly σ 2 Eg′ (Y ).

THEOREM 3 Let X ∼ Np (θ, Ip ) for some p ≥ 3. Define


 
p−2
δc (X) = 1 − c X.
kXk2
Then

2 c(2 − c)
2
Ekδc (X) − θk = p − (p − 2) E .
kXk2

Proof. Let gi (x) = c(p − 2)xi /kxk2 and g(x) = (g1 (x), · · · , gp (x)) for x = (x1 , x2 , · · · , xp ).
Then g(x) = c(p − 2)x/kxk2 . It follows that

Ekδc (X) − θk2 = Ek(X − θ) − g(X)k2 = EkX − θk2 + Ekg(X)k2 − 2EhX − θ, g(X)i
p
X
= p + Ekg(X)k2 − 2 E(Xi − θi )gi (X).
i=1

Since X1 , · · · , Xp are independent, by conditioning and Stein’s lemma,


 
∂gi
E(Xi − θi )gi (X) = E [E{(Xi − θi )gi (X)|(Xj , j 6= i)} ] = E (X) .
∂xi
It is easy to calculate that
Pp
∂gi x2 − 2x2 kxk2 − 2x2i
Pp k 2 2 i = c(p − 2)
(x) = c(p − 2) k=1 .
∂xi ( k=1 xk ) kxk4
Thus,
p
X p
X  
kXk2 − 2X 2 i 2 1
E(Xi − θi )gi (X) = c(p − 2)E = c(p − 2) E .
kXk4 kXk2
i=1 i=1

On the other hand, Ekg(X)k2 = c2 (p − 2)2 E(1/kXk2 ). Combine all above together, the
conclusion follows. 

We have to show that EkXk−2 < ∞ for p ≥ 3. Indeed,


    Z Z
1 1 1
E ≤ 1+E I(kXk ≤ 1) ≤ 1 + · · · dx1 · · · dxp
kXk2 kXk2 kxk≤1 kxk2

7
since the density of a normal distribution is bounded by one. By the polar-transformation,
the last integral is equal to
Z 1 Z π Z π Z 1
r p−1 J(θ1 , · · · , θp ) p−1
dr dθ1 · · · dθp−1 2
dr ≤ Cπ r p−3 dr < ∞
0 0 0 r 0

if the integer p ≥ 3, where J(θ1 , · · · , θp ) is a multivariate polynomial of sin θi and cos θi , i =


1, 2, · · · , p, bounded by a constant C.
When 0 ≤ c ≤ 2, the term c(2 − c) ≥ 0 and attains the maximum at c = 1. It follows
that

COROLLARY 1.2 The estimator δc (X) dominates X provided 0 < c < 2 and p ≥ 3.

When c = 1, the estimator δ1 (X) is called the James-Stein estimator.

COROLLARY 1.3 The James-Stein estimator δ1 dominates all estimators δc with c 6= 1.

2 Likelihood Ratio Test


We consider test H0 : θ ∈
H 0 vs Ha : θ ∈
H a , where
H 0 and
H a are disjoint. The

probability density or probability mass function of a random observation is f (x|θ). The


likelihood ratio test statistic is
supH0 f (x|θ)
λ(x) = .
supH0 ∪Ha f (x|θ)

When H0 is true, the value of λ(x) tends to be large. So the rejection region is

{λ(x) ≤ c}.

Example. The t-test as a likelihood ratio test.


Let X1 , · · · , Xn be i.i.d. N (µ, σ 2 ) with both parameters unknown. Test H0 : µ = µ0 vs
Ha : µ 6= µ0 . The joint density function of Xi ’s is
 n n
!
1 2 −n/2 1 X
f (x1 , · · · , xn |µ, σ) = √ (σ ) exp − 2 (xi − µ)2 . (2.1)
2π 2σ
i=1

This one has maximum at (X̄, V ) where


n
1X
V = (Xi − X̄)2 .
n
i=1

8
The maximum is
 
1 X
max f (x1 , · · · , xn |µ, σ) = CV −n/2 exp − (Xi − X̄)2 = CV −n/2 e−n/2
µ,σ 2V

Where C = (2π)−n/2 . When µ = µ0 , f (x1 , · · · , xn |µ, σ) has maximum at


n
1X
V0 = (Xi − µ0 )2 .
n
i=1

And the corresponding maximum value is


 
−n/2 1 X 2 −n/2 −n/2
max f (x1 , · · · , xn |µ0 , σ) = CV0 exp − (Xi − µ0 ) = CV0 e .
σ 2V0
So the likelihood ratio is
 −n/2  −n/2
V0 V + (X̄ − µ0 )2
Λ= = .
V V
So the rejection region is
 
X̄ − µ0
{Λ ≤ a} = ≥b
S
P
for some constant b, where S = (1/(n − 1)) ni=1 (Xi − X̄)2 . This yields exactly a student t
test.

Example. Let X1 , · · · , Xn be a random sample from



e−(x−θ) , if x ≥ θ;
f (x|θ) = (2.2)
0, otherwise.

The joint density function is


 P
e− xi +nθ , if θ ≤ x ;
(1)
f (x|θ) =
0, otherwise.

Consider the test H0 : θ ≤ θ0 versus Ha : θ > θ0 . It is easy to see


P
max f (x|θ) = e− xi +nx(1)
,
θ∈R

which is achieved at θ = x(1) . Now consider maxθ≤θ0 f (x|θ). If θ0 ≥ x(1) , the two maxima
are identical; If θ0 < x(1) , it is achieved at θ = θ0 that yields the maximum value
 P
e− xi +nx(1) , if θ0 ≥ x ;
(1)
max f (x|θ) =
θ≤θ0 e− xi +nθ0 , if θ < x .
P
0 (1)

9
This gives the likelihood ratio

1, if x ≤ θ0 ;
(1)
λ(x) =
en(θ0 −x(1) ) , if x > θ0 .
(1)

The rejection region is {λ(x) ≤ a} which is equivalent to {x(1) ≥ b} for some b.

Again, the general form of a hypothesis test is H0 : θ ∈


H 0 vs Ha : θ ∈
H a.

THEOREM 4 Suppose X has pdf or pmf f (x|θ) and T (X) is sufficient for θ. Let λ(x) and
λ∗ (y) be the LRT statistics obtained from X and Y = T (X). Then λ(x) = λ∗ (T (x)) for any
x.

Proof. Suppose the pdf or pmf of T (X) is g(t|θ). To be simple, suppose X is discrete, that
is, Pθ (T (X) = t) = g(t|θ). Then

f (x|θ) = Pθ (X = x, T (X) = T (x))


= Pθ (T (X) = T (x))Pθ (X = x|T (X) = T (x))
= g(T (x)|θ)h(x). (2.3)

by the definition of sufficiency, where h(x) is a function depending only on x (not θ).
Evidently,
supθ∈H0 f (x|θ) supθ∈H0 g(t|θ)
λ(x) = and λ∗ (t) = . (2.4)
supθ∈H0 ∪Ha f (x|θ) supθ∈H0 ∪Ha g(t|θ)

By (2.3) and (2.4)


supθ∈H0 f (T (x)|θ)
λ(x) = = λ∗ (T (x)). 
supθ∈H0 ∪Ha f (T (x)|θ)

This theorem says that the simplification of λ(x) is eventually relevant to x through a
sufficient statistic for the parameter.

3 Evaluating tests
Consider the test H0 : θ ∈
H 0 versus Ha : θ ∈
H a , where
H 0 and
H a are disjoint. When

making decisions, there are two types of mistakes:


Type I error: Reject H0 when it is true.
Type II error: Do not reject H0 when it is false.

10
Not reject H0 reject H0
H0 correct type I error
H1 Type II error correct

Suppose R denotes the rejection region for the test. The chances of making the two
H 0 and Pθ (X ∈ Rc ), θ ∈
types of mistakes are Pθ (X ∈ R) for θ ∈ H a . So


probability of a Type I error, if θ ∈
H 0;
Pθ (X ∈ R) =
1 − probability of a Type II error, if θ ∈
H a.

DEFINITION 3.1 The power function of the test above with the rejection region R is a
function of θ defined by β(θ) = Pθ (X ∈ R).

DEFINITION 3.2 For α ∈ [0, 1], a test with power function β(θ) is a size α test if
H 0 β(θ) = α.
supθ∈

DEFINITION 3.3 For α ∈ [0, 1], a test with power function β(θ) is a level α test if
H 0 β(θ) ≤ α.
supθ∈

There are some differences between the above two definitions when dealing with complicated
models.
Example. Let X1 , · · · , Xn be from N (µ, σ 2 ) with µ unknown but known σ. Look at the
test H0 : µ ≤ µ0 vs Ha : µ > µ0 . It is easy to check that by likelihood rato test, the rejection
region is
 
X̄ − µ0
√ >c .
σ/ n

So the power function is


   
X̄ − µ0 X̄ − µ µ0 − µ
β(µ) = Pθ √ >c = Pθ √ >c+ √
σ/ n σ/ n σ/ n
 
µ0 − µ
= 1−Φ c+ √ .
σ/ n

It is easy to see that

lim β(µ) = 1 lim β(µ) = 0 and β(µ0 ) = a if P (Z > c) = a,


µ→∞ µ→−∞

where Z ∼ N (0, 1).

11
H 0 Pθ (λ(X)
Example. In general, a size α LRT is constructed by choosing c such that supθ∈
≤ c) = α. Let X1 , · · · , Xn be a random sample from N (θ, 1). Test H0 : θ = θ0 vs Ha : θ 6= θ0 .
Here one can easily see that
 n 
λ(x) = exp − (x̄ − θ0 )2
2

So {λ(X) ≤ c} = {|X̄−θ0 | ≥ d}. For α given, d can be chosen such that Pθ0 (|Z| > d n) = α.
For the example in (2.2), the test is H0 : θ ≤ θ0 vs θ > θ0 and the rejection region
{λ(X) ≤ c} = {X(1) > d}. For size α, first choose d such that

α = Pθ0 (X(1) > d) = e−n(d−θ0 ) .

Thus, d = θ0 − (log α)/n. Now

sup Pθ (X(1) > d) = sup Pθ (X(1) > d) = sup e−n(d−θ) = e−n(d−θ0 ) = α.


θ∈
H0 θ≤θ0 θ≤θ0

4 Most Powerful Tests


Consider a hypothesis test H0 : θ ∈
H 0 vs H1 : θ ∈
H a . There are a lot of decision rules.

We plan to select the best one. Since we are not able to reduce the two types of errors at
same time (the sum of chances of two errors and two correct decisions is one). We fix the
the chance of the first type of error and maximize the powers among all decision rules.

DEFINITION 4.1 Let C be a class of tests for testing H0 : θ ∈


H 0 vs H1 : θ ∈
H a . A test

in class C, with power function β(θ), is a uniformly most powerful (UMP) class C test if
β(θ) ≥ β ′ (θ) for every θ ∈
H a and every β ′ (θ) that is a power function of a test in class C.

In this section we only consider C as the set of tests such that the level α is fixed, i.e.,
H 0 β(θ) ≤ α}.
{β(θ); supθ∈
The following gives a way to obtain UMP tests.

THEOREM 5 (Neyman-Pearson Lemma). Let X be a sample with pdf or pmf f (x|θ).


Consider testing H0 : θ = θ0 vs H1 : θ = θ1 , using a test with rejection region R that
satisfies

x ∈ R if f (x|θ1 ) > kf (x|θ0 ) and x ∈ Rc if f (x|θ1 ) < kf (x|θ0 ) (4.1)

for some k ≥ 0, and

α = Pθ0 (X ∈ R). (4.2)

12
Then
(i) (Sufficiency). Any test satisfies (4.1) and (4.2) is a UMP level α test.
(ii) (Necessity). Suppose there exists a test satisfying (4.1) and (4.2) with k > 0. Then
every UMP level α test is a size α test (satisfies (4.2)); Every UMP level α test satisfies
(4.1) except on a set A satisfying Pθ (X ∈ A) = 0 for θ = θ0 and θ1 .

Proof. We will prove the theorem only for the case that f (x|θ) is continuous.
Since
H 0 consists of only one point, the level α test and size α test are equivalent.

(i) Let R satisfy (4.1) and (4.2) and φ(x) = I(x ∈ R) with power function β(θ). Now take
another rejection region R′ with β ′ (θ) = Pθ (X ∈ R′ ) and β ′ (θ0 ) ≤ α. Set φ′ (x) = I(x ∈ R′ ).
It suffices to show that

β(θ1 ) ≥ β ′ (θ1 ). (4.3)

Actually, (φ(x) − φ′ (x))(f (x|θ1 ) − kf (x|θ0 )) ≥ 0 (check it based on if f (x|θ1 ) > kf (x|θ0 ) or
not). Then
Z
0 ≤ (φ(x) − φ′ (x))(f (x|θ1 ) − kf (x|θ0 )) dx = β(θ1 ) − β ′ (θ1 ) − k(β(θ0 ) − β ′ (θ0 )). (4.4)

Thus the assertion (4.3) follows from (4.2) and that k ≥ 0.


(ii) Suppose the test satisfying (4.1) and (4.2) has power function β(θ). By (i) the test is
UMP. For the second UMP level α test, say, it has power function β ′ (θ). Then β(θ1 ) = β ′ (θ1 ).
Since k > 0, (4.4) implies β ′ (θ0 ) ≥ β(θ0 ) = α. So β ′ (θ0 ) = α.
Given a UMP level α test with power function β ′ (θ). By the proved (ii), β(θ1 ) = β ′ (θ1 )
and β(θ0 ) = β ′ (θ0 ) = α. Then (4.4) implies that the expectation in there is zero. Being
always nonnegative, (φ(x) − φ′ (x))(f (x|θ1 ) − kf (x|θ0 )) = 0 except on a set A of Lebesgue
measure zero, which leads to (4.1) (by considering f (x|θ1 ) > kf (x|θ0 ) or not). Since X has
density, Pθ (X ∈ A) = 0 for θ = θ0 and θ = θ1 . 

Example. Let X ∼ Bin(2, θ). Consider H0 : θ = 1/2 vs Ha : θ = 3/4. Note that


   x  2−x    x  2−x
2 3 1 2 1 1
f (x|θ1 ) = >k = kf (x|θ0 )
x 4 4 x 2 2

is equivalent to {3x > 4k}.


(i) The case k ≥ 9/4 and the case k < 1/4 correspond to R = ∅ and Ω, respectively.
The according UMP level α = 0 and α = 1.
(ii) If 1/4 ≤ k < 3/4, then R = {1, 2}, and α = P (Bin(2, 1/2) = 1 or 2) = 3/4.
(iii) If 3/4 ≤ k < 9/4, then R = {2}. The level α = P (Bin(2, 1/2) = 2) = 1/4.

13
This example says that if we firmly want a size α test, then there are only two such α’s:
α = 1/4 and α = 3/4.

Example. Let X1 , · · · , Xn be a random sample from N (θ, σ 2 ) with σ known. Look


at the test H0 : θ = θ0 vs H1 : θ = θ1 , where θ0 > θ1 . The density function f (x|θ, σ) =

(1/ 2π σ) exp(−(x − θ)2 /(2σ 2 )). Now f (x|θ1 ) > kf (θ0 ) is equivalent to that R = {X̄ < c}.

Set α = Pθ0 (X̄ < c). One can check easily that c = θ0 + (σzα / n). So the UMP rejection
region is
 
σ
R= X̄ < θ0 + √ zα . 
n
LEMMA 4.1 Let X be a random variable. Let also f (x) and g(x) be non-decreasing real
functions. Then

E(f (X)g(X)) ≥ Ef (X) · Eg(X)

provided all the above three expectations are finite.

Proof. Let the µ be the distribution of X. Noting that (f (x) − f (y))(g(x) − g(y)) ≥ 0 for
any x and y. Then
ZZ
0 ≤ (f (x) − f (y))(g(x) − g(y)) µ(dx) µ(dy)
Z Z
= 2 f (x)g(x) µ(dx) − 2 f (x)g(y) µ(dx) µ(dy)
= 2[E(f (X)g(X)) − Ef (X) · Eg(X)].

The desired result follows. 

Let f (x|θ), θ ∈ R be a pmf or pdf with common support, that is, the set {x : f (x|θ) > 0}
is identical for every θ. This family of distributions is said to have monotone likelihood ratio
(MLR) with respect to a statistic T (x) if
fθ2 (x)
on the support is a non-decreasing function in T (x) for any θ2 > θ1 . (4.5)
fθ1 (x)
LEMMA 4.2 Let X be a random vector with X ∼ f (x|θ), θ ∈ R, where f (x|θ) is a pmf
or pdf with a common support. Suppose f (x|θ) has MLR with a statistic T (x). Let


 1 when T (x) > C,


φ(x) = γ when T (x) = C, (4.6)




0 when T (x) < C

14
for some C ∈ R, where γ ∈ [0, 1]. Then

Eθ1 φ(X) ≤ Eθ2 φ(X)

for any θ1 < θ2 .

Proof. Thanks to MLT, for any θ1 < θ2 there exists a nondecreasing function g(t) such
that f (x|θ2 )/f (x|θ1 ) = g(T (x)), where g(t) is a nondecreasing function. Thus
Z
f (x|θ2 )
Eθ2 φ(X) = φ(x) f (x|θ1 ) dx = Eθ1 (g(T (X)) · h(T (X))) ,
f (x|θ1 )
where


1 when x > C,


h(x) = γ when x = C,




0 when x < C

Since both g(x) and h(x) are non-decreasing, by the positive correlation inequality,

Eθ1 (g(T (X)) · h(T (X))) ≥ Eθ1 g(T (X)) · Eθ1 h(T (X)) = Eθ1 φ(X)

since Eθ1 g(T (X)) = 1. 

In (4.6), if T (X) > C reject H0 ; if T (X) < C, don’t reject H0 ; if T (X) = C, with
chance γ reject H0 and with chance 1 − γ don’t reject H0 . The last case is equivalent to
flipping a coin with chance γ for head and 1 − γ for tail. If the head occurs, reject H0 ;
otherwise, don’t reject H0 . This is a randomization. The purpose is to make the test UMP.
The randomization occurs only when T (X) is discrete.

THEOREM 6 Consider testing H0 : θ ≤ θ0 vs H1 : θ > θ0 . Let X ∼ f (x|θ), θ ∈ R, where


f (x|θ) is a pmf or pdf with a common support. Also let T be a sufficient statistic for θ. If
f (x|θ) has MLR w.r.t. T (x), then φ(X) in (4.6) is a UMP level α test, where α = Eθ0 φ(X).

Proof. Let β(θ) = Eθ φ(X). Then by the previous lemma, supθ≤θ0 β(θ) = Eθ0 φ(X) = α.
We need to show β(θ1 ) ≥ β ′ (θ1 ) for any θ1 > θ0 and β ′ such that supθ≤θ0 β ′ (θ) ≤ α.
Consider test H0 : θ = θ0 vs Ha : θ = θ1 . We know that

α = Pθ0 (X ∈ R) = Eθ0 φ(X) and β ′ (θ1 ) ≤ α. (4.7)

Again, we use the same notation as in Lemma 4.2, f (x|θ1 )/f (x|θ0 ) = g(T (x)), where
g(t) is a non-decreasing function in t. Take k = g(C) in the Neyman-Pearson Lemma 5.

15
If f (x|θ1 )/f (x|θ0 ) > k, since g(t) is non-decreasing, then T (X) > C, which is the same
as saying to reject H0 by the functional form of φ(x) as in (4.6). If f (x|θ1 )/f (x|θ0 ) < k
then T (X) < C, that is, not to reject H0 . By Neyman-Pearson lemma and (4.7), φ(x)
corresponds to a UMP level-α test. Thus, β(θ1 ) ≥ β ′ (θ1 ). 

16
5 Probability convergence
Let (Ω, F, P ) be a probability space. Let Also {X, Xn ; n ≥ 1} be a sequence of ran-
dom variables defined on this probability space. We say Xn → X in probability or Xn is
consistent with X if

P (|Xn − X| ≥ ǫ) → 0 as n → ∞

for any ǫ > 0.


Example. Let {Xn ; n ≥ 1} be a sequence of independent random variables with mean
P
zero and variance one. Set Sn = ni=1 Xi , n ≥ 1. Then Sn /n → 0 in probability as n → ∞.
In fact,
 
Sn E(Sn )2 1

P ≥ǫ ≤ = 2 →0
n nǫ 2 nǫ
by Chebyshev’s inequality.
Recall X and Xn , n ≥ 1 are real measurable functions defined on (Ω, F). We say Xn
converges to X almost surely if

P (ω ∈ Ω : lim Xn (ω) = X(ω)) = 1.


n→∞

For a sequence of events En ; n ≥ 1 in F. Define


∞ [
\
{En ; i.o.} = Ek .
n=1 k≥n

Here “i.o.” means “infinitely often”. If ω ∈ {En ; i.o.}, then there are infinitely many n
/ {En ; i.o.}, or ω ∈ {En ; i.o.}c , then only finitely many
such that ω ∈ En . Conversely, if ω ∈
En contain ω.
In proving almost convergence, the following so-called Borel-Cantelli lemma is very
powerful.

LEMMA 5.1 (Borel-Cantelli Lemma). Let En ; n ≥ 1 be a sequence of events in F. If



X
P (En ) < ∞ (5.1)
n=1

then P (En occurs only finite times) = P ({En , i.o.}c ) = 1.


P∞
Proof. Recall P (En ) = E(IEn ). Let X(ω) = n=1 IEn (ω). Then

X ∞
X
EX = E IEn = P (En ) < ∞.
i=1 i=1

17
Thus, X < ∞ a.s. 

LEMMA 5.2 (Second Borel-Cantelli Lemma). Let En ; n ≥ 1 be a sequence of independent


events in F. If

X
P (En ) = ∞ (5.2)
n=1

then P (En , i.o.) = 1.

Proof. First,
[\
{En , i.o.}c = Ekc .
n k≥n

With pk denoting P (Ek ), we have that


   
\ Y X
P Ekc  = (1 − pk ) ≤ exp − pk  = 0
k≥n k≥n k≥n

as n → ∞, where we used the inequality that 1 − x ≤ e−x for any x ∈ R. Thus,


 
X \
P ({En , i.o.}c ) ≤ P Ekc  = 0. 
n≥1 k≥n

As an application, we have that


Example. Let {Xn ; n ≥ 1} be a sequence of i.i.d. random variables with mean µ and
finite fourth moment. Then
Sn
→ µ a.s.
n
as n → ∞.
Proof. W.L.O.G., assume µ = 0. Let ǫ > 0. Then by Chebyshev’s inequality
 
Sn E(Sn )4
P ≥ ǫ ≤ .
n n 4 ǫ4
Now, by independence,
 
X   X   X
4 4
E(Sn4 ) = E  4
Xk + 2 2
Xi Xj + Xi Xj3 
2 2
k 1≤i6=j≤n 1≤i6=j=n

= nE(X14 ) + 6n(n − 1)E(X12 ).

18
 P 
Thus, P Snn ≥ ǫ = O(n−2 ). Consequently, n≥1 P Snn ≥ ǫ < ∞ for any ǫ > 0. This
means that, with probability one, |Sn /n| ≤ ǫ for sufficiently large n. Therefore,

|Sn |
lim sup ≤ ǫ a.s.
n n
for any ǫ > 0. Let ǫ ↓ 0. The desired conclusion follows. 

By using a truncation method, we have the following Kolmogorov’s Strong Law of Large
Numbers.

THEOREM 7 Let {Xn ; n ≥ 1} be a sequence of i.i.d. random variables with mean µ.


Then
Sn
→ µ a.s.
n
as n → ∞.

There are many variations of the above strong law of large numbers. Among them,
Etemadi showed that Theorem 7 holds if {Xn ; n ≥ 1} is a sequence of pairwise independent
and identically distributed random variables with mean µ.
The next proposition tells the relationship between convergence in probability and con-
vergence almost surely.

PROPOSITION 5.1 Let Xn and X be random variables taking values in Rk . If Xn →


X a.s. as n → ∞, then Xn converges to X in probability; second, Xn converges to X in
probability iff for any subsequence of {Xn } there exists a further subsequence, say, {Xnk ; k ≥
1} such that Xnk → X a.s. as k → ∞.

Proof. The joint almost sure convergence and joint convergence in probability are equiv-
alent to the corresponding pointwise convergences. So W.L.O.G., assume Xn and X are
one-dimensional. Given ǫ > 0. The almost sure convergence implies that

P (|Xn − X| ≥ ǫ occurs only for finitely many n) = 1.

This says that P (|Xn − X| ≥ ǫ, i.o.) = 0. But


 
[
lim P (|Xn − X| ≥ ǫ) ≤ lim P  {|Xk − X| ≥ ǫ} = P (|Xn − X| ≥ ǫ, i.o.) = 0
n n
k≥n

19
where we use the fact that
[
{|Xk − X| ≥ ǫ} ↑ {|Xn − X| ≥ ǫ, i.o.}.
k≥n

Now suppose Xn → X in probability. W.L.O.G., we assume the subsequence is Xn


itself. Then for any ǫ > 0, there exists nǫ > 0 such that supn≥nǫ P (|Xn − X| ≥ ǫ) < ǫ. We
then have a subsequence nk such that P (|Xnk − X| ≥ 2−k ) < 2−k for each k ≥ 1. By the
Borel-Cantelli lemma
 
P |Xnk − X| < 2−k for sufficiently large k = 1

which implies that Xnk → X a.s. as k → ∞.


Now suppose Xn doesn’t converge to X in probability. That is, there exists ǫ0 > 0 and
subsequence {Xkl ; l ≥ 1} such that

P (|Xkl − X| ≥ ǫ0 ) ≥ ǫ0

for all l ≥ 1. Therefore any subsequence of Xkl can not converge to X in probability, hence
it can not converge to X almost surely. 

Example Let Xi , i ≥ 1 be a sequence of i.i.d. random variables with mean zero and

variance one. Let Sn = X1 + · · · + Xn , n ≥ 1. Then Sn / 2n log log n → 0 in pobability but

it does not converge to zero almost surely. Actually lim supn→∞ Sn / 2 log log n = 1 a.s.

PROPOSITION 5.2 Let {Xk ; k ≥ 1} be a sequence of i.i.d. random variables with X1 ∼


N (0, 1). Let Wn = max1≤i≤n Xi . Then
Wn √
lim √ = 2 a.s.
n→∞ log n
Proof. Recall
x 2 1 2
√ e−x /2 ≤ P (X1 ≥ x) ≤ √ e−x /2 (5.3)
2π (1 + x2 ) 2π x
√ √
for any x > 0. Given ǫ > 0, let bn = ( 2 + ǫ) log n. Then
n −b2n /2 1
P (Wn ≥ bn ) ≤ nP (X1 ≥ bn ) ≤ e ≤ ǫ2 /2
bn n
2
for n ≥ 3. Let nk = [k4/ǫ ] + 1 for k ≥ 1. Then,
1
P (Wnk ≥ bnk ) ≤
k2

20
for large k. By the Borel-Cantelli lemma, P (Wnk ≥ bnk , i.o.) = 0. This implies that

Wnk √
lim sup ≤ 2 + ǫ a.s.
k bnk

For any n ≥ n1 there exists k such that nk ≤ n < nk+1 . Note that nk+1 /nk → 1 as n → ∞
and Wnk ≤ Wn ≤ Wnk+1 we have that

Wn √
lim sup √ ≤ 2 + ǫ a.s.
n log n

Let ǫ ↓ 0. We obtain that


Wn √
lim sup √ ≤ 2 a.s. (5.4)
n log n

This proves the upper bound. Now we turn to prove the lower bound. Choose ǫ ∈ (0, 2).
√ √
Set cn = ( 2 − ǫ) log n. Then

P (Wn ≤ cn ) = P (X1 ≤ cn )n = (1 − P (X1 > cn ))n ≤ e−nP (X1 >cn ) .

By (5.3),
ncn 2 C √ 2
nP (X1 > cn ) ≥ √ e−cn /2 ∼ √ n1−( 2−ǫ) /2
2
2π (1 + cn ) 2π log n

as n is sufficiently large, where C is a constant. Since 1 − ( 2 − ǫ)2 /2 > 0,
X
P (Wn ≤ cn ) < ∞.
n≥2

By the Borel-Cantelli again, P (Wn ≤ cn , i.o.) = 0. This implies that


Wn √
lim inf √ ≥ 2 − ǫ a.s.
n log n
Let ǫ ↓ 0. We obtain
Wn √
lim inf √ ≥ 2 a.s.
n log n

This together with (5.4) implies the desired result. 

The above results also hold for random variables taking abstract values. Now let’s
introduce such random variables.

21
A metric space is a set D equipped with a metric. A metric is a map d : D × D → [0, ∞)
with the properties

(i) d(x, y) = 0 if and only if x = y;


(ii) d(x, y) = d(y, x);
(iii) d(x, z) ≤ d(x, y) + d(y, z)

for any x, y and z. A semi-metric satisfies (ii) and (iii) , but not necessarily (i). An open
ball is a set of the form {y : d(x, y) < r}. A subset of a metric space is open if and only if it
is the union of open balls; it is closed if and only if the complement is open. A sequence xn
converges to x if and only if d(xn , x) → 0, and denoted by xn → x. The closure Ā of a set
A is the set of all points that are the limit of a sequence in A; it is the smallest closed set
containing A. The interior Å is the collection of all points x such that x ∈ G ⊂ A for some
open set G; it is the largest open set contained in A. A function f : D → E between metric
spaces is continuous at a point x if and only if f (xn ) → f (x) for every sequence xn → x;
it is continuous at every x if and only if f −1 (G) is also open for every open set G ⊂ E. A
subset of a metric space is dense if and only if its closure is the whole space. A metric space
is separable if and only if it has a countable dense subset. A subset K of a metric space is
compact if and inly if it is closed and every sequence in K has a convergent subsequence. A
set K is totally bounded if and only if for every ǫ > 0 it can be covered by finitely many balls
of radius ǫ > 0. The space is complete if and only if any Cauchy sequence has a limit. A
subset of a complete metric space is compact if and only if it is totally bounded and closed.
A norm space D is a vector space equipped with a norm. A norm is a map k · k : D →
[0, ∞) such that for every x, y in D, and α ∈ R,

(i) kx + yk ≤ kxk + kyk;


(ii) kαxk = |α|kxk;
(iii) kxk = 0 if and only if x = 0.

Definition. The Borel σ-field on a metric space D is the smallest σ-field that contains
the open sets. A function taking values in a metric space is called Borel-measurable if it is
measurable relative to the Borel σ-field. A Borel-measurable map X : Ω → D defined on a
probability space (Ω, F, P ) is referred to as a random element with values in D.
The following lemma is obvious.

LEMMA 5.3 A continuous map between two metric spaces is Borel-measurable.

22
Example. Let C[a, b] = {f (x); f is a continuous function on [a, b]}. Define

kf k = sup |f (x)|.
a≤x≤b

This is a complete normed linear space which is also called a Banach space. This space is
separable. Any separable Banach space is isomorphic to a subset of C[a, b].

Remark. Proposition 5.1 still holds when the random variables taking values in a Polish
space.
Let {X, Xn ; n ≥ 1} be a sequence of random variables. We say Xn converges to X
weakly or in distribution if

P (Xn ≤ x) → P (X ≤ x)

as n → ∞ for every continuous point x, that is, P (X ≤ x) = P (X < x).


Remark. The set of discontinuous points of any random variable is countable.

LEMMA 5.4 Let Xn and X be random variables. The following are equivalent:
(i) Xn converges weakly to X;
(ii) Ef (Xn ) → Ef (X) for any bounded continuous function f (x) defined on R;
(iii) liminfn P (Xn ∈ G) ≥ P (X ∈ G) for any open set G;
(iv) limsupn P (Xn ∈ F ) ≤ P (X ∈ F ) for any closed set F ;

Remark. The above lemma is actually true for any finite dimensional space. Further, it
is true for random variables taking values in separable, complete metric spaces, which are
also called Polish spaces.
Proof. (i) =⇒ (ii). Given ǫ > 0, choose K > 0 such that K and −K are continuous
point of all Xn ’s and X and P (−K ≤ X ≤ K) ≥ 1 − ǫ. By (i)

lim P (|Xn | ≤ K) = lim P (Xn ≤ K) − lim P (Xn ≤ −K) = P (X ≤ K) − P (X ≤ −K)


n n n
≥ 1−ǫ

In other words

lim P (|Xn | > K) ≤ ǫ. (5.5)


n

Since f (x) is continuous on [−K, K], we can cut the interval into, say, m subintervals
[ai , ai+1 ], 1 ≤ i ≤ m such that each ai is a continuous point of X and supai ≤x≤ai+1 |f (x) −
f (ai )| ≤ ǫ. Set
m
X
g(x) = f (ai )I(ai ≤ x < ai+1 ).
i=1

23
It is easy to see that

|f (x) − g(x)| ≤ ǫ for |x| ≤ K and |f (x) − g(x)| ≤ kf k for |x| > K,

where kf k = supx∈R |f (x)|. It follows that

E|f (Y ) − Eg(Y )| ≤ ǫ + kf k · P (|Y | > K)

for any random variable Y. Apply this to Xn and X we obtain from (5.5) that

lim sup |Ef (Xn ) − Eg(Xn )| ≤ (1 + kf k)ǫ and |Ef (X) − Eg(X)| ≤ (1 + kf k)ǫ.
n

Easily Eg(Xn ) → Eg(X). Therefore by triangle inequality,

lim sup |Ef (Xn ) − Ef (X)| ≤ 2(1 + kf k)ǫ.


n

Then (ii) follows by letting ǫ ↓ 0.


(ii) =⇒ (iii). Define fm (x) = (md(x, Gc )) ∧ 1 for m ≥ 1. Since G is open, fm (x) is a
continuous function bounded by one and fm (x) ↑ 1G (x) for each x as m → ∞. It follows
that

lim inf P (Xn ∈ G) ≥ lim inf Efm (Xn ) → Efm (X)


n n

for each m. (iii) follows by letting m → ∞.


(iii) and (iv) are equivalent because F c is open.
(iv) =⇒ (i). Let x be a continuous point. Then by (iii) and (iv)

lim sup P (Xn ≤ x) ≤ P (X ≤ x) = P (X < x) and


n
lim inf P (Xn ≤ x) ≥ lim inf P (Xn < x) ≥ P (X < x).
n n

Then (i) follows. 

Example. Let Xn be uniformly distributed over {1/n, 2/n, · · · , n/n}. Prove that Xn con-
verges weakly to U [0, 1].
Actually, for any bounded continuous function f (x),
n   Z 1
1X i
Ef (Xn ) = f → f (t) dt = Ef (Y )
n n 0
i=1

where Y ∼ U [0, 1].

24
Example. Let Xi , i ≥ 1 be i.i.d. random variables. Define
n n
1X 1X #{i : Xi ∈ A}
µn = δXi through µn (A) = I(Xi ∈ A) =
n n n
i=1 i=1

for any set A. Then, for any measurable function f (x), by the strong law of large numbers,
Z n Z
1X
f (x)µn ( dx) = f (Xi ) → Ef (X1 ) = f (x)µ( dx)
n
i=1

where µ = L(X1 ). This concludes that µn → µ weakly as n → ∞.

COROLLARY 5.1 Suppose Xn converges to X weakly and g(x) is a continuous map de-
fined on R1 (not necessarily one-dimensional). Then g(Xn ) converges to g(X) weakly.

The following two results are called Delta methods.



THEOREM 8 Let Yn be a sequence of random variables that satisfies n(Yn − θ) =⇒
N (0, σ 2 ). Let g be a real-valued function such that g′ (θ) 6= 0. Then

n(g(Yn ) − g(θ)) =⇒ N (0, σ 2 (g′ (θ)2 )).

Proof. By Taylor’s expansion g(x) = g(θ) + g′ (θ)(x − θ) + a(x − θ) for some real function
a(x) such that a(t)/t → 0 as t → 0. Then,
√ √ √
n(g(Yn ) − g(θ)) = g′ (θ) n(Yn − θ) + n · a(Yn − θ).

We only need to show that n · a(Yn − θ) → 0 in probability as n → ∞ (latter we will see
this easily by Slusky’s lemma. But why now?) Indeed, for any m ≥ 1 there exists a δ > 0
such that |a(x)| ≤ |x|/m for all |x| ≤ δ. Thus,
√ √
P (| n · a(Yn − θ)| ≥ b) ≤ P (|Yn − θ| > δ) + P (| n(Yn − θ)| ≥ mb)
→ P (|N (0, σ 2 )| ≥ mb)

as n → ∞. Now let m ↑ ∞. we obtain that n · a(Yn − θ) → 0 in probability. 

Similarly, one can prove that



THEOREM 9 Let Yn be a sequence of random variables that satisfies n(Yn − θ) =⇒
N (0, σ 2 ). Let g be a real-valued function such that g′′ (x) exists around x = θ with g′ (θ) = 0
and g′′ (θ) 6= 0. Then
1 2 ′′
n(g(Yn ) − g(θ)) =⇒ σ g (θ)χ21 .
2

25
Proof. By Taylor’s expansion again,
"   #
1 2 ′′ Yn − θ 2
n(g(Yn ) − g(θ)) = σ g (θ) n · + o(n|Yn − θ|2 ).
2 σ

This leads to the desired conclusion by the same arguments as in Theorem 8. 

PROPOSITION 5.3 Let {X, Xn ; n ≥ 1} be a sequence of random variables on Rk . Look


at the following three statements:
(i) Xn converges to X almost surely;
(ii)Xn converges to X in probability;
(iii) Xn converges to X weakly.
Then (i) =⇒ (ii) =⇒ (iii).

Proof. The implication “(i) =⇒ (ii)” is proved earlier. Now we prove that (ii) =⇒ (iii).
For any closed set F let

F ǫ = {x : inf d(x, y) ≤ ǫ}.


y∈F

Then F ǫ is a closed set. When X takes values in R, the metric d(x, y) = |x − y|. It is easy
to see that

{Xn ∈ F } ⊂ {X ∈ F 1/k } ∪ {|Xn − X| ≥ 1/k}

for any integer k ≥ 1. Therefore

P (Xn ∈ F ) ≤ P (X ∈ F 1/k ) + P (|Xn − X| ≥ 1/k).

Let n → ∞ we have that

lim sup P (Xn ∈ F ) ≤ P (X ∈ F 1/k )


n

Note that F 1/k ↓ F as k → ∞. (iii) follows by letting k → ∞. 

Any random vector X taking values in Rk is tight: For every ǫ > 0 there exists a
number M > 0 such that P (kXk > M ) < ǫ. A set of random vectors {Xα ; α ∈ A} is called
uniformly tight if there exists a constant M > 0 such that

sup P (kXα k > M ) < ǫ.


α∈A

26
A random vector X taking values in a polish space (D, d) is also tight: For every ǫ > 0 there
exists a compact set K ⊂ D such that P (X ∈
/ K) < ǫ. This will be proved later.

Remark. It is easy to see that a finite number of random variables is uniformly tight if every
individual is tight. Also, if {Xα ; α ∈ A} and {Xα ; α ∈ B} are uniformly tight, respectively,
then {Xα , α ∈ A ∪ B} is also tight. Further, if Xn =⇒ X, then {X, Xn ; n ≥ 1} is uniformly
tight. Actually, for any ǫ > 0, choose M > 0 such that P (kXk ≥ M ) ≥ 1 − ǫ. Since the
set {x ∈ Rk , kxk > M } is an open set, lim inf n→∞ P (kXn k > M ) ≥ P (kXk > M ) ≥ 1 − ǫ.
The tightness of {X, Xn } follows.

LEMMA 5.5 Let µ be a probability measure on (D, B), where (D, d) is a Polish space and
B is the Borel σ-algebra. Then µ is tight.

Proof. Since (D, d) is Polish, it is separable, there exists a countable set {x1 , x2 , · · · } ⊂ D
such that it is dense in D.
Given ǫ > 0. For any integer k ≥ 1, since ∪n≥1 B(xn , 1/k) = D, there exists nk such
that µ(∪nn=1
k
B(xn , 1/k)) > 1 − ǫ/2k . Set Dk = ∪nn=1
k
B(xn , 1/k) and M = ∩k≥1 Dk . Then M
is totally bounded (M is always covered by a finite number of balls with any prescribed
radius). Also

X ∞
X
c ǫ
µ(M ) ≤ µ(Dck ) ≤ = ǫ.
2k
k=1 k=1
Let K be the closure of M. Since D is complete, M ⊂ K ⊂ D and K is compact. Also by
the above fact, µ(K c ) < ǫ. 

LEMMA 5.6 Let {Xn ; n ≥ 1} be a sequence of random variables. Let Fn be the cdf of Xn .
Then there exists a subsequence Fnj with the property that Fnj (x) → F (x) at each continuity
point of x of a possibly defective distribution function F. Further, if {Xn } is tight, then F (x)
is a cumulative distribution function.

Proof. Put all rational numbers q1 , q2 , · · · in a sequence. Because the sequence Fn (q1 )
is contained in the interval [0, 1], it has a converging subsequence. Call the indexing sub-
sequence {n1j }∞ 2 1
j=1 and the limit G(q1 ). Next, extract a further subsequence {nj } ⊂ {nj }
along which Fn (q2 ) converges to a limit G(q2 ), a further subsequence {n3j } ⊂ {n2j } along
which Fn (q3 ) converges to a limit G(q3 ), · · · , and so forth. The tail of the diagonal sequence
nj := njj belongs to every sequence nij . Hence Fnj (qi ) → G(qi ) for every i = 1, 2, · · · . Since
each Fn is nondecreasing, G(q) ≤ G(q ′ ) if q ≤ q ′ . Define

F (x) = inf G(q).


q>x

27
It is not difficult to verify that F (x) is decreasing and right-continuous at every point x. It
is also easy to show that Fnj (x) → F (x) at each continuity point of x of F by monotonicity
of F (x).
Now assume {Xn } is tight. For given ǫ ∈ (0, 1), there exists a rational number M > 0
such that P (−M < Xn ≤ M ) ≥ 1 − ǫ. Equivalently, Fn (M ) − Fn (−M ) ≥ 1 − ǫ for all n.
Passing the subsequence nj to ∞. We have that G(M ) − G(−M ) ≥ 1 − ǫ. By definition,
F (M ) − F (−M ) ≥ 1 − ǫ. Therefore F (x) is a cdf. 

THEOREM 10 (Almost-sure representations). If Xn =⇒ X0 then there exist random


variables Yn on ([0, 1], B[0, 1]) such that L(Yn ) = L(Xn ) for each n ≥ 0 and Yn → Y0
almost surely.

Proof. Let Fn (x) be the cdf of Xn . Let U be a random variable uniformly distributed on
([0, 1], B[0, 1]). Define Yn = Fn−1 (U ) where the inverse here is the generalized inverse of a
non-decreasing function defined by

Fn−1 (t) = inf{x : Fn (x) ≥ t}

for 0 ≤ t ≤ 1. One can verify that Fn−1 (t) → F0−1 (t) for every t such that F0 (x) is continuous
at t. We know that the set of the discontinuous points is countable. Therefore the Lebesgue
measure is zero. This shows that Yn = Fn−1 (U ) → F0−1 (U ) = Y0 almost surely.
It is easy to check that L(Xn ) = L(Yn ) for all n = 0, 1, 2, · · · . 

LEMMA 5.7 (Slusky). Let {an }, {bn } and {Xn } are sequences of random variables such
that an → a in probability and bn → b in probability and Xn =⇒ X for constants a and b
and some random variable X. Then an Xn + bn =⇒ aX + b as n → ∞.

Proof. We will show that (an Xn + bn )− (aXn + b) → 0 in probability. If this is the case,
then the desired conclusion follows from the weak convergence of Xn (why?). Given ǫ > 0.
Since {Xn } is convergent, it is tight. So there exists M > 0 such that P (|Xn | ≥ M ) ≤ ǫ.
Moreover, by the given condition,
 ǫ 
P |an − a| ≥ < ǫ and P (|bn − b| ≥ ǫ) < ǫ
2M
as n is sufficiently large. Note that

|(an Xn + bn ) − (aXn + b)| ≤ |an − a||Xn | + |bn − b|.

28
Thus,

P (|(an Xn + bn ) − (aXn + b)| ≥ 2ǫ)


 ǫ 
≤ P |an − a| ≥ + P (|Xn | ≥ M ) + P (||bn − b| ≥ ǫ) < 3ǫ
2M
as n is sufficiently large. 
Remark. All the above statements of weak convergence of random variables Xn can be
replaced by their distributions µn := L(Xn ). So “Xn =⇒ X” is equivalent to “µn =⇒ µ”.

THEOREM 11 (Lévy’s continuity theorem). Let µn ; n = 0, 1, 2, · · · be a sequence of prob-


ability measures on Rk with characteristic function φn . We have that
(i) If µn =⇒ µ0 then φn (t) → φ0 (t) for every t ∈ Rk ;
(ii) If φn (t) converges pointwise to a limit φ0 (t) that is continuous at 0, then the as-
sociated sequence of distributions µn is tight and converges weakly to the measure µ0 with
characteristic function φ0 .

Remark. The condition that “φ0 (t) is continuous at 0” is essential. Let Xn ∼


N (0, n), n ≥ 1. So P (Xn ≤ x) → 1/2 for any x, which means that Xn doesn’t converges
2 /2
weakly. Note that φn (t) = e−nt → 0 as n → ∞ and φn (0) = 1. So φ0 (t) is not continuous.
If one ignores the continuity condition at zero, the conclusion that Xn converges weakly
may wrongly follow.

Proof of Theorem 11. (i) is trivial.


(ii) Because marginal tightness implies joint tightness. W.L.O.G., assume that µn is a
probability measure on R and generated by Xn . Note that for every x and δ > 0,
  Z
sin δx 1 δ
1{|δx| > 2} ≤ 2 1 − = (1 − cos tx) dt.
δx δ −δ

Replace x by Xn , take expectations, and use Fubini’s theorem to obtain that


  Z Z
2 1 δ 1 δ
P |Xn | > ≤ (1 − EeitXn ) dt = (1 − φn (t)) dt.
δ δ −δ δ −δ

By the given condition,


  Z δ
2 1
lim sup P |Xn | > ≤ (1 − φ0 (t)) dt.
n δ δ −δ

The right hand side above goes to zero as δ ↓ 0. So tightness follows.

29
R R
To prove the weak convergence, we have to show that f (x) dun → f (x) dµ for every
bounded continuous function f (x). It is equivalent to that for any subsequence, there is a
R R
further subsequence, say nk , such that f (x) dunk → f (x) dµ0 .
For any subsequence, by Lemma 5.6, the Helly selection principle, there is a further
subsequence, say µnk , such that it converges to µ0 weakly, by the earlier proof, the c.f. of
µ0 is φ0 . We know that a c.f. uniquely determines a distribution. This says that every limit
R R
is identical. So f (x) dunk → f (x) dµ0 . 

THEOREM 12 (Central Limit Theorem). Let X1 , X2 , ·, Xn be a sequence of i.i.d. random



variables with mean zero and variance one. Let X̄n = (X1 + · · · + Xn )/n. Then nX̄n =⇒
N (0, 1).

Proof. Let φ(t) be the c.f. of X1 . Since EX1 = 0 and EX12 = 1 we have that φ′ (0) =
iEX1 = 0 and φ′′ (0) = i2 EX12 = −1. By Taylor’s expansion
   
t t2 1
φ √ =1− +o
n 2n n
as n is sufficiently large. Hence
    n

it nX̄n t n t2 1 2
Ee =φ √ = 1− +o → e−t /2
n 2n n
as n → ∞. By Levy’s continuity theorem, the desired conclusion follows. 

To deal with multivariate random variables, we have the next tool.

THEOREM 13 (Cramér-Wold device). Let Xn and X be random variables taking values


in Rk . Then

Xn =⇒ X if and only if tT Xn =⇒ tT X for all t ∈ Rk .

Proof. By Levy’s continuity Theorem 11, Xn =⇒ X if and only if E exp(itT Xn ) →


E exp(itT X) for each t ∈ Rk , which is equivalent to E exp(iu(tT Xn )) → E exp(iu(tT X)) for
any real number u. This is same as saying that the c.f. of tT Xn converges to that of tT X
for each t ∈ Rk . It is equivalent to that tT Xn =⇒ tT X for all t ∈ Rk . 

As an application, we have the multivariate analogue of the one-dimensional CLT.

THEOREM 14 Let X1 , X2 , · · · be a sequence of i.i.d. random vectors in Rk with mean


vector µ = EX1 and covariance matrix Σ = E(X1 − µ)(X1 − µ)T . Then
n
1 X √
√ (Xi − µ) = n(Ȳn − µ) =⇒ Nk (0, Σ).
n
i=1

30
Proof. By the Cramér-Wold device, we need to show that
n
1 X T
√ (t Xi − tT µ) =⇒ N (0, tT Σt)
n
i=1

for ant t ∈ Rk . Note that {tT Xi − tT µ, i = 1, 2, · · · } are i.i.d. real random variables. By
the one-dimensional CLT,
n
1 X T
√ (t Xi − tT µ) =⇒ N (0, Var(tT X1 )).
n
i=1

This leads to the desired conclusion since Var(tT X1 ) = tT Σt. 


We state the following theorem without giving the proof. The spirit of the proof is
similar to that of Theorem 12. It is called Lindeberg-Feller Theorem.

THEOREM 15 Let kn ; n ≥ 1} be a sequence of positive integers and kn → +∞ as n → ∞.


For each n let Yn,1 , · · · , Yn,kn be independent random vectors with finite variances such that

kn
X kn
X
Cov (Yn,i ) → Σ and EkYn,i k2 I{kYn,i k > ǫ} → 0
i=1 i=1

for every ǫ > 0. Then the sequence Σki=1


n
(Yn,i − EYn,i ) =⇒ N (0, Σ).

The Central Limit Theorems hold not only for independent random variables, it is also
true for some other dependent random variables of special structures.
Let Y1 , Y2 , · · · be a sequence of random variables. A sequence of σ-algebra Fn satisfies
that Fn ⊃ σ(Y1 , · · · , Yn ) and is non-decreasing Fn ⊂ Fn+1 for all n ≥ 0, where F0 = {∅, Ω}.
When E(Yn+1 |Fn ) = 0 for any n ≥ 0, we say {Yn ; n ≥ 1} are martingale differences.

THEOREM 16 For each n ≥ 1, let {Xni , 1 ≤ i ≤ kn } be a sequence of martingale differ-


ences relative to nested σ-algebras {Fni , 1 ≤ i ≤ kn }, that is, Fni ⊂ Fnj for i < j, such
that
1. {maxi |Xni |; n ≥ 1} is uniformly integrable;
2. E (maxi |Xni |) → 0;
P 2
3. i Xni → 1 in probability.
P
Then Sn = i Xni → N (0, 1) in distribution.

Example. Let Xi ; i ≥ 1 be i.i.d. with mean zero and variance one. Then one can

verify that the required conditions in Theorem 16 hold for Xni = Xi / n for i = 1, 2, · · · , n.

31
Pn √
Therefore we recollect the classical CLT that i=1 Xi / n =⇒ N (0, 1). Actually,
√ n √
P (max |Xni | ≥ ǫ) ≤ nP (|X1 | ≥ nǫ) ≤ 2 E|X1 |2 I{|X1 | ≥ nǫ} → 0 and
i nǫ
Pn 2
X
E(max |Xni |)2 ≤ E i=1 i = 1
i n
This shows that maxi |Xni | → 0 in probability and {maxi |Xni |, n ≥ 1} is uniformly inte-
grable. So condition 2 holds.

LEMMA 5.8 Let {Un ; n ≥ 1} and Tn ; n ≥ 1} be two sequences of random variables such
that
1. Un → a in probability, where a is a constant;
2. {Tn } is uniformly integrable;
3. {Un Tn } is uniformly integrable;
4. ETn → 1.
Then E(Tn Un ) → a.

Proof. Write Tn Un = Tn (Un − a) + aTn . Then E(Tn Un ) = E{Tn (Un − a)} + aETn . Evi-
dently, {Tn (Un − a)} is uniformly integrable by the given condition. Since {Tn } is uniformly
integrable, it is not difficult to verify that Tn (Un − a) → 0 in probability. Then the result
follows. 

By the Taylor expansion, we have the following equality


2 /2+r(x)
eix = (1 + ix)e−x (5.6)

for some function r(x) = O(x3 ) as x → ∞. Look at the set-up in Theorem 16, we have that
eitSn = Tn Un , where
Y X X
Tn = (1 + itXnj ) and Un = exp(−(t2 /2) 2
Xnj + r(tXnj )). (5.7)
j j j

We next obtain a general result on CLT by using Lemma 5.8.

THEOREM 17 Let {Xnj , 1 ≤ i ≤ kn , n ≥ 1} be a triangular array of random variables.


Suppose
1. ETn → 1;
2. {Tn } is uniformly integrable;

32
P 2 → 1 in probability;
3. j Xnj
4. E(maxj |Xnj |) → 0.
P
Then Sn = j Xnj converges to N (0, 1) in distribution.

Proof. Since |Tn Un | = |eitSn | = 1, we know that {Tn Un } is uniformly integrable. By Lemma
2
5.8, it is enough to show that Un → e−t /2 in probability. Review (5.7) and condition 3, it
P
suffices to show that j r(tXnj ) → 0 in probability. The assertion r(x) = O(x3 ) say that
there exists A > 0 and δ > 0 such that |r(x)| ≤ A|x3 | as |x| < δ. Then
X X
P (| r(tXnj )| > ǫ) ≤ P (max |Xnj | ≥ δ) + P (| r(tXnj )| > ǫ, max |Xnj | < δ)
j j
j j
X
≤ δ−1 E(max |Xnj |) + P (max |Xnj | · |Xnj |2 > ǫ) → 0
j j
j

as n → ∞ by the given conditions. 

Proof of Theorem 16. Define Zn1 = Xn1 and

Xj−1
2
Znj = Xnj I( Xnk ≤ 2), 2 ≤ j ≤ kn .
k=1

It is easy to check that {Znj } is still a martingale relative to the original σ-algebra. Now
P 2 > 2} ∧ k . Then
define J = inf{j ≥ 1; 1≤k≤j Xnk n

P (Xnk 6= Znk for some k ≤ kn ) = P (J ≤ kn − 1)


X
2
≤ P( Xnk > 2) → 0
1≤k≤kn

Pkn
as n → ∞. Therefore, P (Sn 6= j=1 Znj ) → 0 as n → ∞. To prove the result, we only need
to show
kn
X
Znj =⇒ N (0, 1). (5.8)
j=1

We now apply Theorem 17 to prove this. Replacing Xnj by Znj in (5.7), we have new Tn and
Un . Let’s literally verify the four conditions in Theorem 17. By a martingale property and
iteration, ETn = 1. So condition 1 holds. Now maxj |Znj | ≤ maxj |Xnj | → 0 in probability
P 2
as n → ∞. So condition 4 holds. It is also easy to check that j Znj → 1 in probability.
Then it remains to show condition 2.

33
Qk n Q
By definition, Tn = j=1 (1 + itZnj ) = 1≤j≤J (1 + itZnj ). Thus,

J−1
Y
|Tn | = (1 + t2 Xnk
2 1/2
) |1 + tXnJ |
k=1
J−1
!
t2 X 2
≤ exp Xnk (1 + |t| · |XnJ |)
2
k=1
2
≤ et (1 + |t| · max |Xnj |)
j

which is uniform integrable by condition 1 in Theorem 16. 

The next is an extreme limit theorem, which is different than the CLT. The limiting
distribution is called an extreme distribution.

THEOREM 18 Let Xi ; 1 ≤ i ≤ n be i.i.d. N (0, 1) random variables. Let Wn = max1≤i≤n Xi .


Then
 
p (log n) + x 1
− 2√ ex/2
P Wn ≤ 2 log n − √2 →e π
2 2 log n

for any x ∈ R, where log2 n = log(log n).

Proof. Let tn be the right hand side of “≤” in the probability above. Then

P (Wn ≤ tn ) = P (X1 ≤ tn )n = (1 − P (X1 > tn ))n .

Since (1 − xn )n ∼ e−a as n → ∞ if xn ∼ a/n. To prove the theorem, it suffices to show that


1
P (X1 > tn ) ∼ √ ex/2 . (5.9)
2 πn

Actually, we know that


1 2
P (X1 > x) ∼ √ e−x /2
2π x
as x → +∞. It is easy to calculate that

1 1 t2 p x
√ ∼ √ and n ∼ log n − log( log n) −
2π tn 2 π log n 2 2

as n → ∞. This leads to (5.9). 

34
As an application of the CLT of i.i.d. random variables, we will be able to derive the
classical χ2 -test as follows.
Let’s consider a multinomial distribution with n trails, k classes with parameter p =
(p1 , · · · , pk ). Roll a die twenty times. Let Xi be the number of the occurrences of “i dots”,
1 ≤ i ≤ 6. Then (X1 , X2 , · · · , X6 ) follows a multinomial distribution with “success rate”
p = (p1 , · · · , p6 ). How to test the die is fair or not? From an introductory course, we know
that, the null hypothesis is H0 : p1 = · · · = p6 = 1/6. The test statistic is
6
X (Xi − npi )2
χ2 = .
npi
i=1

We use the fact that χ2 is roughly χ2 (5) as n is large. We will prove this next.
In general, Xn = (Xn,1 , · · · , Xn,k ) follows a multinomial distribution with n trails and
P
“success rate” p = (p1 , · · · , pk ). Of course, ki=1 Xn,i = n. To be more precise,

n! x x
P (Xn,1 = xn,1 , · · · , Xn,k = xn,k ) = p n,1 · · · pk n,k ,
xn,1 ! · · · xn,k ! 1
Pk
where i=1 xn,i = n. Now we prove the following theorem,

THEOREM 19 As n → ∞,
k
X (Xi − npi )2
χ2n = =⇒ χ2 (k − 1).
npi
i=1

We need a lemma.

LEMMA 5.9 Let Y ∼ Nk (0, Σ). Then


k
X
2 d
kY k = λi Zi2 ,
i=1

where λ1 , · · · , λk are eigenvalues of Σ and Z1 , · · · , Zk are i.i.d. N (0, 1).

Proof. Decompose Σ = Odiag(λ1 , · · · , λk )OT for some orthogonal matrix O. Then the k
coordinates of OY are independent with mean zero and variances λi ’s. So ||Y k2 = kOY k2
P
is equal to ki=1 λi Zi2 , where Zi ’s are i.i.d. N (0, 1). 

Another fact we will use is that AXn =⇒ AX for any k × k matrix A if Xn =⇒ X, where
Xn and X are Rk -valued random vectors. This can be shown easily by the Cramer-Wold
device.

35
Proof of the Theorem. Let Y1 , · · · , Yn be i.i.d. with a multinomial distribution with 1
trails and “success rate” p = (p1 , · · · , pk ). Then Xn = Y1 +· · · Yn . Each Yi is a k-dimensional
vector. By CLT,
Xn − EXn
√ =⇒ Nk (0, Cov(Y1 )),
n
where it is easy to see that EXn = np and
 
p (1 − p1 ) −p1 p2 ··· −p1 pk
 1 
 −p p p2 (1 − p2 ) ··· −p2 pk 
 1 2 
Σ1 := Cov(Y1 ) =  .
 ··· ··· ··· ··· 
 
−p1 pk −p2 pk ··· pk (1 − pk )
√ √
Set D = diag(1/ p1 , · · · , 1/ pk ). Then
 √   √ T
p1 p1
 √  √ 
 p2   p2 
  
DΣ1 D T == I −  .   .  .
 ..   .. 
  
√ √
pk pk
Pk
Since i=1 pi = 1, the above matrix is symmetric and idempotent or a projection. Also,
trace of Σ1 is k−1. So Σ1 has eigenvalue 0 with one multiplicity and 1 with k−1 multiplicity.
Therefore, by the lemma, we have that
D(Xn − np)
√ =⇒ N (0, DΣ1 D T ).
n

By the lemma, since g(x) = kxk2 is a continuous function on Rk ,


k
X k−1
X
(Xi − npi )2 D(Xn − np) 2 d

= √ =⇒ Zi2 = χ2 (k − 1). 
npi n
i=1 i=1

6 Central Limit Theorems for Stationary Random Variables


and Markov Chains
Central limit theorems hold not only for independent random variables but also for depen-
dent random variables. In the next we will study the Central Limit Theorems for α-mixing
random variables and and martingale (a special class of dependent random variables).
Let X1 , X2 , · · · be a sequence of random variables. For each n ≥ 1, define αn by

α(n) := sup |P (A ∩ B) − P (A)P (B)| (6.10)

36
where the supremum is taken over all A ∈ σ(X1 , · · · , Xk ), B ∈ σ(Xk+n , Xk+n+1 , · · · ), and
k ≥ 1. Obviously, α(n) is decreasing. When α(n) → 0, then Xk and Xk+n are approximately
independent. In this case the sequence {Xn } is said to be α-mixing. If the distribution of
the random vector (Xn , Xn+1 , · · · , Xn+j ) does not depend on n, the sequence is said to be
stationary.
The sequence is called m-dependent if (X1 , · · · , Xk ) and (Xk+n , · · · , Xk+n+l ) are inde-
pendent for any n > m, k ≥ 1 and l ≥ 1. In this case the sequence is α-mixing with αn = 0
for n > m. An independent sequence is 0-dependent.
For stationary process, we have the following the Central Limit Theorems.

THEOREM 20 Suppose that X1 , X2 , · · · is stationary with EX1 = 0 and |X1 | ≤ A for


P
some constant A > 0. Let Sn = X1 + · · · + Xn . If ∞
i=1 α(n) < ∞, then

X ∞
V ar(Sn )
→ σ 2 = E(X12 ) + 2 E(X1 Xk+1 ) (6.11)
n
k=1

where the series converges absolutely. If σ > 0, then Sn / n =⇒ N (0, σ 2 ).

THEOREM 21 Suppose that X1 , X2 , · · · is stationary with EX1 = 0 and E|X1 |2+δ < ∞
P
for some constant δ > 0. Let Sn = X1 + · · · + Xn . If ∞
i=1 α(n)
δ/(2+δ) < ∞, then

X ∞
V ar(Sn )
→ σ 2 = E(X12 ) + 2 E(X1 Xk+1 )
n
k=1

where the series converges absolutely. If σ > 0, then Sn / n =⇒ N (0, σ 2 ).

Easily we have the following corollary.

COROLLARY 6.1 Let X1 , X2 , · · · be m-dependent random variables with the same distri-

bution. If EX1 = 0 and EX12+δ < ∞ for some δ > 0. Then Sn / n → N (0, σ 2 ), where
P
σ 2 = EX12 + 2 m+1
i=1 E(X1 Xi ).

Remark. If we argue by using the truncation method as in the proof of Theorem 21, we
can show that the above corollary still holds only under EX12 < ∞.

LEMMA 6.1 If Y ∈ σ(X1 , · · · , Xk ) and Z ∈ σ(Xk+n , Xk+n+1 , · · · , Xk+m ) for n ≤ m ≤


+∞. Assume |Y | ≤ C and |Z| ≤ D. Then

|E(Y Z) − (EY )EZ| ≤ 4CDα(n).

37
Proof. W.l.o.g, assume that C = D = 1. Now since expectations can be approximated by
finite sums, so w.l.o.g, we further assume that
m
X n
X
Y = yi Ai and Z = zj Bj
i=1 j=1

where Ai ’s and Bj ’s are respectively partitions of the sample space. Then


X X
|E(Y Z) − (EY )EZ| = | yi zj (P (Ai ∩ Bj ) − P (Ai )P (Bj )) |
i j
X X
≤ | zj (P (Ai ∩ Bj ) − P (Ai )P (Bj )) |.
i j

Let ξi = 1 or −1 depends on whether or not zj (P (Ai ∩ Bj ) − P (Ai )P (Bj ) ≥ 0 or not. Then


P P
the last sum is equal to i j ξi zj (P (Ai ∩ Bj ) − P (Ai )P (Bj )) which is also identical to
P P
j zj j ξi (P (Ai ∩ Bj ) − P (Ai )P (Bj )). By the same argument, there are ηi equal to 1 or
P P
−1 such that the last quantity is bounded by j j ξi ηj (P (Ai ∩ Bj ) − P (Ai )P (Bj )). Let
E1 be the union of Ai such that ξi = 1 and E2 = E1c ; Let F1 and F2 be defined similarly for
Bj according to the sign of ηj ’s. Then
X
|E(Y Z) − (EY )EZ| ≤ |P (Ei ∩ Fj ) − P (Ei )P (Fj )| ≤ 4α(n)
1≤i,j≤2

LEMMA 6.2 If Y ∈ σ(X1 , · · · , Xk ) and Z ∈ σ(Xk+n , Xk+n+1 , · · · ). Assume for some


δ > 0, E(Y 2+δ ) ≤ C and E(Z 2+δ ) ≤ D. Then

|E(Y Z) − (EY )EZ| ≤ 5(1 + C + D)α(n)δ/(2+δ) .

Proof. Let Y0 = Y I(|Y | ≤ a) and Y1 = Y I(|Y | > a); and Z0 = ZI(|Z| ≤ a) and
Z1 = ZI(|Z| > a). Then
X
|E(Y Z) − (EY )EZ| ≤ |E(Yi Zj ) − (EYi )EZj |.
1≤i,j≤2

By the previous lemma, the term corresponding to i = j = 0 is bounded by 4a2 α(n). For
p
i = j = 1, |E(Y1 Z1 ) − (EY1 )EZ1 | = |Cov(Y1 , Z1 )| which is bounded by E(Y12 )E(Z12 ) ≤

CD/aδ since E(Y12 ) ≤ C/aδ . Now

|E(Y0 Z1 ) − (EY0 )EZ1 | ≤ E(|Y0 − EY0 | · |Z1 − EZ1 |) ≤ 2a · 2E|Z1 |

which is less than 4D/aδ . The term for i = 1, j = 0 is bounded by 4C/aδ . In summary

2 CD + 4C + 4D 4.5(C + D)
|E(Y Z) − (EY )EZ| ≤ 4a α(n) + 2
≤ 4a2 α(n) + .
a aδ

38
Now take a = α(n)−1/(2+δ) . The conclusion follows. 

Proof of Theorem 20. By Lemma 6.2, |EX1 X1+k | ≤ 4A2 α(k), so the series (6.11) con-
P
verges absolutely. Moreover, let ρk = E(X1 X1+k ). Then ESn2 = nEX12 +2 1≤i<j≤n E(Xi Xj ) =
Pn−1
nρ0 + 2 k=1 (n − k)ρk . Then it is to see that

X ∞
E(Sn2 )
→ ρ0 + 2 ρk . (6.12)
n
k=1

We claim that

E(Sn4 ) = n3 δn (6.13)
P
for a sequence of constants δn → 0. Actually, E(Sn4 ) ≤ 1≤i,j,k,l≤n |E(Xi Xj Xk Xl )|. Among
the terms in the sum there are n terms of Xi4 ; at most n2 of Xi2 Xj2 and Xi3 Xj respectively
for i 6= j. So the contribution of these terms is bounded by Cn2 for an universal constant
C. There are only two type of terms left: Xi2 Xj Xk and Xi Xj Xk Xl where these indices are
all different. We will deal with the second type of terms only. The first one can be done
similarly and more easily. Note that
X X
|E(Xi Xj Xk Xl )| ≤ 4!n |E(X1 X1+i X1+i+j X1+i+j+k )| (6.14)
1≤i,j,k,l≤n 1≤i+j+k≤n

where the three indices in the last sum are not zero. By Lemma 6.2, the term inside is
bounded by 4A4 αi . By the same argument, it is also bounded by 4A4 αk . Therefore, the
sum in (6.14) is bounded by
X X
4 · 4!A4 n2 min{αi , αk } ≤ 2Kn2 min{αi , αk }
2≤i+k≤n 2≤i+k≤n, i≤k
Xn
≤ 2Kn2 kαk = nδn (6.15)
k=1

for a sequence of constants δn such that 0 < δn → 0. This is because


n
X X X
kαk = kαk + kαk
√ √
k=1 k≤ n n<k≤n
X∞ X

≤ n αk + n αk = o(n).

k=1 k≥ n

39
Therefore (6.13) follows. From this, we know that ESn4 ≤ nδn ≤ n3 (δn + (1/n)) ≤ n3 δn′
where δn′ := supk≥n {δk + (1/k); k ≥ n} is decreasing to 0 and is bounded below by 1/n.
Thus, w.l.o.g., assume δn has this property.

Define pn = [ n log(1/δ[√n] )] and qn = [n/pn ] and rn = [n/(pn + qn )]. Then
!  
√ √ p2n δpn 2 1 2 1
n ≪ pn ≤ n log n and ≤ δpn log ≤ δpn log →0 (6.16)
n δ[√n] δpn
as n → ∞. Now for finite n, break X1 , X2 , · · · , Xn into some blocks as follows:

X1 , · · · , Xpn | Xpn +1 , · · · , Xpn +qn | Xpn +qn +1 , · · · , X2pn +qn | · · · , Xn

For i = 1, 2, · · · , rn , set

Un,i = X(i−1)(pn +qn )+1 + X(i−1)(pn +qn )+2 + · · · + Xi(pn +qn ) ;


Vn,i = Xipn +(i−1)qn +1 + Xipn +(i−1)qn +2 + · · · + Xi(pn +qn ) ;
Wn = Xrn (pn +qn )+1 + Xrn (pn +qn )+1 + · · · + Xn ,
Pn Pn
and Sn′ = ri=1 Un,i and Sn′′ = ri=1 Vn,i . In what follows, we will show that Sn′′ and Wn are
negligible. Now, by (6.12) and its reasoning
rn
X
V ar(Sn′′ ) = rn V ar(Vn,1 ) + 2 (rn − k)E(Vn,1 Vn,k )
k=2
rn
X
≤ Crn qn + 2rn |E(Vn,1 Vn,k )|.
k=2

By assumption, |Vn,i | ≤ Aqn for any i ≥ 1. By Lemma 6.1, |E(Vn,1 Vn,k )| ≤ 4A2 qn2 α(k−1)pn .
Thus, by monotoncity,
rn rn rn (k−1)pn
X X X 1 X
|E(Vn,1 Vn,k )| ≤ 4A2 qn2 α(k−1)pn ≤ 4A2 qn2 αj
pn
k=2 k=2 k=2 j=(k−2)pn +1

4A2 qn2 X
≤ αj .
pn
j=1

Since qn /pn → 0, the above two inequalities imply that


V ar(Sn′′ )
→0 (6.17)
n

as n → ∞. This says that Sn′′ / n → 0 in probability. By Chebyshev’s inequality and (6.12)

we see that Wn / n → 0 in probability. Thus, to prove the theorem, it suffices to show that

Sn′ / n =⇒ N (0, σ 2 ). This will be done through the following two steps
S̃ ′ √ √
√n =⇒ N (0, σ 2 ) and EeitSn / n − EeitS̃n / n → 0 (6.18)
n

40
Prn ′ ′ ; ı = 1, 2, · · · , r } are i.i.d. with the same
for any t ∈ R, where S̃n = i=1 Un,i and {Un,i n
distribution as that of Un,1 . First, by (6.12), V ar(S̃n ) = rn V ar(Un,1 ) ∼ rn pn σ 2 . It follows
that V ar(S̃n )/n → σ 2 . Also, EUn,i
′ = 0. By (6.13) and (6.16),

Xrn
1 ′ 4 rn 4 rn p3n δpn 2
−4 pn δpn
E(Un,i ) ∼ 2 4 E(Un,1 )= ∼ σ →0
V ar(S̃n )2 i=1 n σ n2 σ 4 n

as n → ∞. By the Lyapounov central limit theorem, we know the first assertion in (6.18)
holds. Last, by Lemma 6.12,
rn
X
′ √ √ √
|EeitSn / n
− (EeitUn,1 / n
)E exp(it Un,j / n)| ≤ 16α(qn ).
j=2
P rn √
Then repeat this inequality for E exp(it j=2 Un,j / n) and so on, we obtain that

√ √ √ rn
Y √

EeitUn,j / n
′ ′
|EeitSn / n
− EeitS̃n / n
| = |EeitSn / n
− | ≤ 16α(qn )rn .
j=1
P P
Since α(n) is decreasing and n≥1 α(n) < ∞, we have nα(n)/2 ≤ [n/2]≤i≤n α(i) → 0.
Thus the above bound is o(rn /qn ) = o(n/(pn qn )) → 0 as n → ∞. The second statement in
(6.18) follows. The proof is complete. 

Proof of Theorem 21. By Lemma 6.2, |EX1 Xk | ≤ 5(1 + 2E|X1 |2+δ )α(n)2/(2+δ) ,
P
which is summable by the assumption. Thus the series EX12 +2 ∞i=2 E(X1 Xk ) is absolutely
convergent. By the exact same argument as in proving (6.11), we obtain that V ar(Sn )/n →
σ2 .
Now for a given constant A > 0 and all i = 1, 2, · · · , define

Yi = Xi I{|Xi | ≤ A} − E(Xi I{|Xi | ≤ A}) and Zi = Xi I{|Xi | > A} − E(Xi I{|Xi | > A}),
Pn Pn
and Sn′ = i=1 Yi and Sn′′ = i=1 Zi . Obviously, Sn = Sn′ + Sn′′ . By Theorem 20,

S
√n =⇒ N (0, σ(A)2 ) (6.19)
n
P∞
as n → ∞ for each fixed A > 0, where σ(A)2 := EY12 + 2 i=2 EY1 Yk . By the summable
assumption on α(n), it is easy to check that both the sum in the definition of σ(A)2 and
P
σ̃(A)2 := EZ12 + 2 ∞i=2 E(Z1 Zk ) are absolutely convergent, and

σ(A)2 → σ 2 and σ̃(A)2 → 0 (6.20)

41
as A → +∞. Now
|S ′′ | V ar(Sn′′ ) σ̃(A)
lim sup P ( √n ≥ ǫ) ≤ lim sup 2
→ 2 . (6.21)
n→∞ n n→∞ nǫ ǫ
√ √
by using the same argument as in (6.2). Since P (Sn / n ∈ F ) ≤ P (Sn′ / n ∈ F̄ ǫ ) +

P (|Sn′′ |/ n ≥ ǫ) for any closed set F ⊂ R and any ǫ > 0, where F̄ ǫ is the closed ǫ-
neighborhood of F. Thus, from (6.19) and (6.21)
Sn |S ′′ |
lim sup P ( √ ∈ F ) ≤ P (N (0, σ(A)2 ) ∈ F̄ ǫ ) + lim sup P ( √n ≥ ǫ)
n→∞ n n→∞ n
 
σ σ̃(A)
≤ P N (0, σ 2 ) ∈ F̄ ǫ + 2
σ(A) ǫ

for any A > 0 and ǫ > 0. First let A → +∞, and then let ǫ → 0+ to obtain from (6.20) that
Sn
lim sup P ( √ ∈ F ) ≤ P (N (0, σ 2 ) ∈ F ).
n→∞ n
The proof is complete. 

Define by α̃(n) the supremum of

|E (f (X1 , · · · , Xk )g(Xk+n , · · · , Xk+m )) − Ef (X1 , · · · , Xk ) · Eg(Xk+n , · · · , Xk+m )| (6.22)

over all k ≥ 1, m ≥ 0 and measurable functions f (x) defined on Rm and g(x) defined on
Rm−n+1 with |f (x)| ≤ 1 and |g(x)| ≤ 1 for all x ∈ R.

LEMMA 6.3 The following are true:

(i) α(n) = sup sup sup |P (A ∩ B) − P (A)P (B)|.


k≥1 m≥0 A∈σ(X1 ,··· ,Xk ), B∈σ(Xk+n ,··· ,Xk+n+m )

(ii) For all n ≥ 1, we have that α(n) ≤ α̃(n) ≤ 4α(n).

Proof. (i) For two sets A and B, let A∆B = (A\B) ∪ (B\A), which is their symmetric
difference. Then IA∆B = |IA − IB |. Let Zi ; i ≥ 1 be a sequence of random variables. We
claim that

σ(Z1 , Z2 , · · · ) = {B ∈ σ(Z1 , Z2 , · · · ); for any ǫ > 0, there exists l ≥ 1 and (6.23)


C ∈ σ(Z1 , · · · , Zl ) such that P (B∆C) < ǫ}. (6.24)

If this is true, then for any ǫ > 0, there exists l < ∞ and C ∈ σ(Z1 , · · · , Zl ) such that
|P (B) − P (C)| < ǫ. Thus (ii) follows.

42
Now we prove this claim. First, note that σ(Z1 , Z2 , · · · ) = σ{{Zi ∈ Ei }; Ei ∈ B(R); i ≥
1}. Denote by F the right hand side of (6.23). Then F contains all {Zi ∈ Ei }; Ei ∈
B(R); i ≥ 1. We only need to show that the right hand side is a σ-algebra.
It is easy to see that (i) Ω ∈ F; (ii) B c ∈ F if B ∈ F since A∆B = Ac ∆B c . (iii) If Bi ∈ F
for i ≥ 1, there exist mi < ∞ and Ci ∈ σ(Z1 , · · · , Zmi ) such that P (Bi ∆Ci ) < ǫ/2i for all
i ≥ 1. Evidently, ∪ni=1 Bi ↑ ∪∞
i=1 Bi as n → ∞. Therefore, there exists n0 < ∞ such that
n0 n0 n0 n0
|P (∪∞
i=1 Bi )−P (∪i=1 Bi )| ≤ ǫ/2. It is easy to check that (∪i=1 Bi )∆(∪i=1 Ci ) ⊂ ∪i=1 (Bi ∆Ci ).
n0 n0
Write B = ∪∞
i=1 Bi and B̃ = ∪i=1 Bi and C = ∪i=1 Ci . Note B∆C ⊂ (B\B̃) ∪ (B̃∆C). These
facts show that P (B∆C) < ǫ and C ∈ σ(Z1 , Z2 , · · · , Zl ) for some l < ∞. Thus F is a
σ-algebra.
(ii) Take f and g to be indicator functions, from (i) we get α(n) ≤ α̃(n). By Lemma
6.1, we have that α̃(n) ≤ 4α(n). 

Let {Xi ; i ≥ 1} be a sequence of random variables. We say it is a Markov chain if


P (Xk+1 ∈ A|X1 , · · · , Xk ) = P (Xk+1 ∈ A|Xk ) for any Borel set A and any k ≥ 1. A Markov
chain has the property that given the present observations, the past and future
observations are (conditionally) independent. This is literally stated in Proposition
10.3.
Let’s look at the strong mixing coefficient α̃(n). Recall the definition of α̃(n). If X1 , X2 , · · ·
is a Markov chain, for simplicity, we write f and g, respectively, for f (X1 , · · · , Xk ) and
g(Xk+n , · · · , Xk+m ). Let f˜ = E(f |Xk+1 )) and g̃ = E(g|Xk+n−1 ). Then by Proposition 10.3,
E(f g) = E(f˜g̃), and Ef = E f˜ and Eg = Eg̃. Thus,

E(f g) − (Ef )Eg = E(f˜g̃) − (E f˜)Eg̃.

The point is that f˜ is Xk+1 measurable, and g̃ is Xk+n−1 measurable. Therfore

α(n) ≤ sup |E(f (Xk+1 )g(Xk+n−1 ) − (Ef (Xk+1 ))Eg(Xk+n−1 )| (6.25)


|f |≤1 |g|≤1
≤ 4 sup |P (Xk+1 ∈ A, Xk+n−1 ∈ B) − P (Xk+1 ∈ A)P (Xk+n−1 ∈ B)| (6.26)
A, B

by Lemma 6.1 by taking k there to be k + 1 and k + n to be k + n + 1 and Xi = 0 for all


positive integer i excluding k + 1 and k + n + 1.
If the Markov process is stationary, then

α(n + 1) ≤ 4 sup |P (X1 ∈ A, Xn ∈ B) − P (X1 ∈ A)P (Xn ∈ B)|. (6.27)


A, B

43
Let P n (x, B) represent the probability of Xn+1 ∈ B given X1 = x, and π be the marginal
probability distribution of X1 . Then
Z
P (Xn+1 ∈ B, X1 ∈ A) = P n (x, B)π( dx)
A

From this,
Z
α(n + 2) ≤ 4 · sup | (P n (x, B) − P (Xn ∈ B))π( dx)|.
A,B A

For two probability measures µ and ν, define its total variation distance kµ−νk = supA |µ(A)−
ν(A)|. Then
Z
α(n + 2) ≤ 4 kP n (x, ·) − πkπ( dx), n ≥ 2.

In practice, for example, in the theory of Monto Carlo Markov Chains, we have some
conditions to guarantee the convergence rate of kP n (x, ·) − πk. In other words, we have a
function M (x) ≥ 0 and γ(n) such that

kP n (x, ·) − πk ≤ M (x)γ(n)

for all x and n. Here are some examples:


If a chain is so-called polynomially ergodic of order m, then γ(n) = n−m ;
If a chain is geometrically ergodic, then γ(n) = tn for some t ∈ (0, 1);
If a chain is uniformly geometrically ergodic, then γ(n) = tn for some t ∈ (0, 1) and
supx M (x) < ∞.
These three conditions, combining with the Central Limit Theorems of stationary ran-
dom variables stated in Theorems 20 and 21, imply the following CLT for Markov chains
directly.
R
Let Eπ M = M (x)π( dx).

THEOREM 22 Suppose {X1 , X2 , · · · } is a stationary Markov chain with L(X1 ) = π and


Eπ M < ∞. If the chain is Geometrically ergodic, in particular, if it is uniformly geometri-
cally ergodic, then
Sn − nEX1
√ =⇒ N (0, σ 2 ) (6.28)
n
P
where σ 2 = E(X12 ) + 2 i=2 ∞E(X1 Xk ).

44
THEOREM 23 Suppose {X1 , X2 , · · · } is a stationary Markov chain with L(X1 ) = π and
Eπ M < ∞.
(i) If |X1 | ≤ C for some constant C > 0, and the chain is polynomially ergodic of order
m > 1, then (6.28) holds.
(ii) If the chain is polynomially ergodic of order m, and E|X1 |2+δ < ∞ for some δ > 0
satisfying mδ > 2 + δ, then (6.28) holds.

Remark. In practice, given an initial distribution λ, and a transitional kernel P (x, A),
here is the way to generate a Markov chain: randomly draw X0 = x under distribution λ,
randomly draw X1 under distribution P (X0 , ·) and Xn+1 under P (Xn , ·) for any n ≥ 1.
Through this a Markov chain X1 , X2 , · · · with initial distribution λ and transitional kernel
P (x, ·) has been generated. Let π be the stationary distribution of this chain, that is,
Z
π(·) = P (x, ·)π( dx).

Theorem 17.1.6 (on p. 420 from “Markov Chains and Stochastic Stability” by S.P. Meyn
and R.L. Tweedie) says that a Markov CLT holds for a certain initial distribution λ0 , it
then holds for any initial distribution λ. In particular, if Theorem 22 or Theorem 23 (λ = π
in this case) holds, then the CLT holds for the Markov chain of the same transitional prob-
ability kernel P (x, ·) and stationary distribution π for any initial distribution λ.

7 U -Statistics
Let X1 , · · · , Xn be a sequence of i.i.d. random variables. Sometimes we are interested in
estimating θ = Eh(X1 , · · · , Xm ) for 1 ≤ m ≤ n. To make discussion simple, we assume
h(x1 , · · · , xm ) is a symmetric function. If X(1) ≤ X(2) ≤ · · · ≤ X(n) , the order statistic, is
sufficient and complete, then
1 X
Un = n
 h(Xi1 , · · · , Xim ) (7.1)
m 1≤i1 <···<im ≤n

is an unbiased estimator of θ and is also a function of X(1) , · · · , X(n) , and hence is uniformly
minimu variance unbiased estimator (UMVUE). Tosee Un is a function of X(1) , · · · , X(n) ,
one can rewrite it as
1 X
Un = n
 h(X(i1 ) , · · · , X(im ) ). (7.2)
m 1≤i1 <···<im ≤n

We can see that Un is a conditional expectation of h(x1 , · · · , Xm ). In fact, the following is


true.

45
PROPOSITION 7.1 The equality Un = E(h(X1 , · · · , Xm )|X(1) , · · · , X(n) ) holds for all 1 ≤
m ≤ n.

Proof. One can see that

E(h(X1 , · · · , Xm )|X(1) , · · · , X(n) ) = E(h(Xi1 , · · · , Xim )|X(1) , · · · , X(n) )

for any 1 ≤ i1 < · · · < im ≤ n. Summing both sidees above over all such indices and then
n
dividing by m , we obtain
1 X
E(h(X1 , · · · , Xm )|X(1) , · · · , X(n) ) = n
 h(X(i1 ) , · · · , X(im ) ) = Un
m 1≤i1 <···<im ≤n

by (7.2) 

Example Let θ = E(X12 ). We know E(X12 ) = (1/2)E(X1 − X2 )2 . Taking h(x1 , x2 ) =


(1/2)(x1 − x2 )2 , then the corresponding U -statistic is
X n
2 1 1 X
Un = · (Xi − Xj )2 = (Xi − X̄)2 = S 2 ,
n(n − 1) 2 n−1
1≤i<j≤n I=1

which is the sample variance.

Proof. One-sample Wilcoxon statistic θ = P (X1 + X2 ≤ 0). The statistic is


2 X
Un = I(Xi + Xj ≤ 0)
n(n − 1)
1≤i<j≤n

which is U -statistic.

Example. Gini’s mean difference is to measure the concentration of the distribution,


the parameter is θ = E|X1 − X2 |. The statistic is
2 X
Un = |Xi − Xj |.
n(n − 1)
1≤i<j≤n

This is again a U -statistic.


Now we calculate the variance of U -statistics. Assume E[h(X1 , · · · , Xm )2 ] < ∞. Set

hk (x1 , · · · , xk ) = E (h(X1 , · · · , Xm )|X1 = x1 , · · · , Xk = xk )


= Eh(x1 , · · · , xk , Xk+1 , · · · , Xm ). (7.3)

Evidently, hm (x1 , · · · , xm ) = h(x1 , · · · , xm ).

46
THEOREM 24 (Hoeffding’s Theorem) For a U -statistic given by (7.1) with E[h(X1 , · · · , Xm )2 ] <
∞,
m   
1 X m n−m
V ar(Un ) = n  ζk
m
k m−k
k=1

where ζk = V ar(hk (X1 , · · · , Xk )).

Proof. Consider two sets {i1 , · · · , im } and {j1 , · · · , jm } of m distinct integers from {1, 2, · · · , n}
n  m n−m
with exactly k integers in common. Then the total number of distinct choices is m k m−k .
Now
1 X
V ar(Un ) = 
n 2
Cov(h(Xi1 , · · · , Xim ), h(Xj1 , · · · , Xjm )),
m

where the sum is over all indices 1 ≤ i1 < · · · < im ≤ n and 1 ≤ j1 < · · · < jm ≤ n.
If {i1 , · · · , im } and {j1 , · · · , jm } have no integers in common, the covariance is zero by
independence. Now we classfisy the sum over the number of integers in common. Then
   
1 Pm n m n−m
V ar(Un ) = n 2 k=1 ·
m k m−k
m
Cov(h(X1 , · · · , Xm ), h(X1 , · · · , Xk , Xm+1 , · · · , X2m−k ))
m   
1 X m n−m
= n
 · V ar(ζk )
m
k m−k
k=1

where the fact (from conditioning) that the covariance above equal to V ar(ζk ) is used in
the last step. 

LEMMA 7.1 Under the condition of Theorem 24,


(i) 0 = ζ0 ≤ ζ1 ≤ · · · ≤ ζm = V ar(h);
(ii) For fixed m and some k ≥ 1 such that ζ0 = · · · = ζk−1 = 0 and ζk > 0, we have
m2  
k! k ζk 1
V ar(Un ) = +O
nk nk+1
as n → ∞.

Proof. (i) For any 1 ≤ k ≤ m − 1,

hk (X1 , · · · , Xk ) = E (h(X1 , · · · , Xn )|X1 , · · · , Xk )


= E (hk+1 (X1 , · · · , Xk+1 )|X1 , · · · , Xk ) .

47
Then by Jensen’s inequality, E(h2k ) ≤ E(h2k+1 ). Note that Ehk = Ehk+1 . Then V ar(hk ) ≤
V ar(hk+1 ). So (i) follows.
(ii) By Theorem 24,
   m   
1m n−m 1 X m n−m
V ar(Un ) = n  ζk + n  ζi .
m
k m−k m
i m−i
i=k+1

n n−m
Since m is fixed, m ∼ nm /m! and m−i ≤ nm−i for any k + 1 ≤ i ≤ m. Thus the last
term above is bounded by O(n−(k+1) ) as n → ∞. Now
  
1 m n−m
n ζk
m
k m−k

m! m k ζk (n − m)!
= ·
n(n − 1) · · · (n − m + 1) (m − k)!(n − 2m + k)!
2
k! m
k ζk n · n···n (n − m) · · · (n − 2m + k + 1)
= k
· ·
n n(n − 1) · · · (n − k + 1) (n − k) · · · (n − m + 1)
m2 k−1   2m−k−1   m−1
Y 
k! k ζk Y i −1 Y i i −1
= · 1− · 1− · 1− .
nk n n n
i=1 i=m i=k

Note that since m is fixed, each of the above product is equal to 1 + O(1/n). The conclusion
then follows. 

Let X1 , · · · , Xn be a sample, and Tn be a statistic based on this sample. The projection


of Tn on some random variables Y1 , · · · , Yp is defined by
p
X
Tn′ = ETn + (E(Tn |Yi ) − ETn ).
i=1

Suppose Tn is a symmetric function of X1 , · · · , Xn . Set ψ(Xi ) = E(Tn |Xi ) for i = 1, 2, · · · , n.


Then ψ(X1 ), · · · , ψ(Xn ) are i.i.d. random variables with mean ETn . If V ar(Tn ) < ∞, then
n
X
1
p (ψ(Xi ) − ETn ) → N (0, 1) (7.4)
nV ar(ψ(X1 )) i=1

in distribution as n → ∞. Now Let Tn′ be the projection of Tn on X1 , · · · , Xn . Then


n
X
Tn − Tn′ = (Tn − ETn ) − (ψ(Xi ) − ETn ). (7.5)
i=1

We will next show that Tn − Tn′ is negligible comparing the order appeared in (7.4). Then
the CLT holds for Tn by (7.4) and thre Slusky lemma.

48
LEMMA 7.2 Let Tn be a symmetric statistic with V ar(Tn ) < ∞ for every n, and Tn′ be
the projection of Tn on X1 , · · · , Xn . Then ETn′ = ETn and

E(Tn − Tn′ )2 = V ar(Tn ) − V ar(Tn′ ).

Proof. Since ETn = ETn′ ,

E(Tn − Tn′ )2 = V ar(Tn ) + V ar(Tn′ ) − 2Cov(Tn , Tn′ ).

First, V ar(Tn′ ) = nV ar(E(Tn |X1 )) by independence. Second, by (7.5)

Cov(Tn , Tn′ )
n
X
= V ar(Tn ) + nV ar(E(Tn |X1 )) − 2 Cov(Tn − ETn )(ψ(Xi ) − ETn ). (7.6)
i=1

Now the i-th covariance above is equal to E(Tn E(Tn |Xi ))−{E(E(Tn |Xi ))}2 = V ar(E(Tn |X1 )).
We already see V ar(Tn′ ) = nV ar(E(Tn |X1 )). So the proof is complete by the two facts and
(7.6). 

THEOREM 25 Let Un be the statistic given by (7.1) with E[h(X1 , · · · , Xm )]2 < ∞.

(i) If ζ1 > 0, then n(Un − ETn ) → N (0, m2 ζ1 ) in distribution.
(ii) If ζ1 = 0 and ζ2 > 0, then

m(m − 1) X
n(Un − EUn ) → λj (χ21j − 1),
2
j=1
P∞
where χ21j ’s are i.i.d. χ2 (1)-random variables, and λj ’s are constants satisfying 2
j=1 λj =
ζ2 .

Proof. We will only prove (i) here. The proof of (ii) is very technical, it is omitted. Let
Un′ be the projection of Un on X1 , · · · , Xn . Then
1 X
Un′ = Eh + n  E(h(Xi1 , · · · , Xik )|Xi ) − Eh.
m 1≤i1 <···<im ≤n

Observe that
1 X
n
( E(h(Xi1 , · · · , Xik )|X1 ) − Eh)
m 1≤i1 <···<im ≤n
1 X
= n (h1 (X1 ) − Eh)
m 2≤i2 <···<im ≤n
n−1 
m−1 m
= n  (h1 (X1 ) − Eh) = (h1 (X1 ) − Eh).
m
n

49
Thus
n
mX
Un′ = Eh + (h1 (Xi ) − Eh).
n
i=1

This says that V ar(Un′ ) = (m2 /n)ζ1 , where ζ1 = V ar(h1 (X1 )). By lemma 7.1, V ar(Un ) =
(m2 /n)ζ1 +O(n−2 ). From Lemma 7.2, V ar(Un −Un′ ) = O(n−2 ) as n → ∞. By the Chebyshev

inequality, n(Un − Un′ ) → 0 in probability. The result follows from (7.4) and the Slusky
lemma by noting V art(ψ(X1 )) = V ar(h1 (X1 )) = ζ1 . 

50
8 Empirical Processes
Let X1 , X2 , · · · be a random sample, where Xi takes values in a metric space M. Define a
probability measure µn such that
n
1X
µn (A) = I(Xi ∈ A), A ⊂ M.
n
i=1

The measure µn is called a probability measure. If X1 takes real values, that is, M = R,
set
n
1X
Fn (t) = I(Xi ≤ t), t ∈ R, A = (−∞, t].
n
i=1

The random process {Fn (t), t ∈ R} is called an empirical process. By the LLN and CLT,

Fn (t) → F (t) a.s. and n(Fn (t) − F (t)) =⇒ N (0, F (t)(1 − F (t))),

where F (t) is the cdf of X1 . Among many interesting questions, here we are interested in
the uniform convergence of the above almost sure and weak convergences. Specifically, we
want to answer the following two questions:

1) Does ∆n := sup |Fn (t) − F (t)| → 0 a.s.?


t
2) Fix n ≥ 1, regard {Fn (t); t ∈ R} as one point in a big space, does the CLT hold?

The answers are yes for both questions. We first answer the first question.

THEOREM 26 (Glivenko-Cantelli) Let X1 , X2 , · · · be a sequence of i.i.d. random vari-


ables with cdf F (t). Then

∆n = sup |Fn (t) − F (t)| → 0 a.s.


t

as n → ∞.

Remark. When F (t) is continuous and strictly increasing in the support of X1 , the above
the distribution of ∆n is independent of the distribution of X1 . First, F (Xi )’s are i.i.d.
U [0, 1]-distributed random variables. Second,
1X 1X
∆n = sup | I(Xi ≤ t) − F (t)| = sup | I(F (Xi ) ≤ F (t)) − F (t)|
t n t n
i i
1X
= sup | I(Yi ≤ y) − y|,
0≤y≤1 n i

51
where {Yi = F (Xi ); 1 ≤ i ≤ n} are i.i.d. random variables with uniform distribution over
[0, 1].
Proof. Set
n
1X
Fn (t−) = I(Xi < t) and F (t−) = P (X1 < t), t ∈ R.
n
i=1

By the strong law of large numbers, Fn (t) → F (t) a.s. and Fn (t−) → F (t−) a.s. as n → ∞.
Given ǫ > 0, choose −∞ = t0 < t1 < · · · < tk = ∞ such that F (ti −) − F (ti−1 ) < ǫ for every
i. Now, for ti−1 ≤ t < ti ,

Fn (t) − F (t) ≤ Fn (ti −) − F (ti −) + ǫ


Fn (t) − F (t) ≥ Fn (ti ) − F (ti ) − ǫ.

This says that

∆n ≤ max {|Fn (ti ) − F (ti )|, |Fn (ti −) − F (ti −)|} + ǫ → ǫ a.s.
1≤i≤k

as n → ∞. The conclusion follows by letting ǫ ↓ 0. 


Given (t1 , t2 , · · · , tk ) ∈ Rk , it is easy to check that n(Fn (t1 ) − F (t1 ), · · · , Fn (tk ) −
F (tk )) =⇒ Nk (0, Σ) where

Σ = (σij ) and σij = F (ti ∧ tj ) − F (ti )F (tj ), 1 ≤ i, j ≤ n. (8.1)

Let D[−∞, ∞] be the set of all functions defined on (−∞, ∞) such that they are right-
continuous and the left limits exist everywhere. Equipped with a so-called Skorohod metric,

it becomes a Polish space. The random vector n(Fn − F ), viewed as an element in
D[−∞, ∞], converges weakly to a continuous Gaussian process ξ(t), where ξ(±∞) = 0 and
the covariance structure is given in (8.1). This is usually called a Brownian bridge. It
has same distribution of Gλ ◦ F, where Gλ is the limiting point when F is the cumulative
distribution function of the uniform distribution over [0, 1].

THEOREM 27 (Donsker). If X1 , X2 , · · · are i.i.d. random variables with distribution



function F, then the sequence of empirical processes n(Fn − F ) converges in distribution
in the space D[−∞, ∞] to a random element GF , whose marginal distributions are zero-
mean with covariance function (8.1).

Later we will see that


√ √
nkFn − F k∞ = n sup |Fn (t) − F (t)| =⇒ sup |GF (t)|,
t t

52
where
  ∞
X 2 2
P sup |GF (t)| ≥ x =2 (−1)j+1 e−2j x , x ≥ 0.
t
j=1

Also, the DKW (Dvoretsky, Kiefer, and Wolfowitz) inequality says that
√ 2
P ( nkFn − F k∞ > x) ≤ 2e−2x , x > 0.

Let X1 , X2 , · · · be i.i.d. random variables taking values in (X , A) and L(X1 ) = P. We


R
write µf = f (x)µ( dx). So
n Z
1X
Pn f = f (Xi ) and P f = f (x)P ( dx).
n X
i=1

By the Glivenko-Cantelli theorem

sup |Pn ((−∞, x]) − P ((−∞, x])| → 0 a.s.


x∈R

Also, if P has no point mass, then

sup |Pn (A) − P (A)| = 1


any measurable set A

because Pn (A) = 1 and P (A) = 0 for A = {X1 , X2 , · · · , Xn }. We are searching for F such
that

sup |Pn f − P f | → 0 a.s. (8.2)


f ∈F

The previous two examples say (8.2) holds for F = {I(−∞, x], x ∈ R} but doesn’t hold for
the power set 2R . The class F is called P -Glivenko-Cantelli if (8.2) holds.

Define Gn = n(Pn − P ). Given k measurable functions f1 , · · · , fk such that P fi2 < ∞
for all i, one can also check by the multivariate Central Limit Theorem (CLT) that

(Gn f1 , · · · , Gn fk ) =⇒ Nk (0, Σ)

where Σ = (σij )1≤i,j≤n and σij = P (fi fj ) − (P fi )P fj . This tells us that {Gn f, f ∈ F}
satisfies CLT in Rk . If F has infinite many members, how we define weak convergence? We
first define a space similar to Rk . Set

l∞ (F) = {bounded function z : F → R}

with norm kzk∞ = supF ∈F |z(F )|. Then (l∞ (F), k · k∞ ) is a Banach space; it is separable
if F is countable. When F has finite elements, (l∞ (F), k · k∞ ) = (Rk , k · k∞ ).
So, Gn : F → R is a random element and can be viewed as an element in Banach space
l∞ (F). We say F is P -Donsker if Gn =⇒ G, where G is a tight element in l∞ (F).

53
LEMMA 8.1 (Berstein’s inequality). Let X1 , · · · , Xn be independent random variables
P
mean zero, |Xj | ≤ K for some constant K and all j, and σj2 = EXj2 > 0. Let Sn = ni=1 Xi
P
and s2n = nj=1 σj2 . Then
2 /2(s2 +Kx)
P (|Sn | ≥ x) ≤ 2e−x n , x > 0.

Proof. It suffices to show that


2 /2(s2 +Kx)
P (Sn ≥ x) ≤ e−x n , x > 0. (8.3)

First, for any λ > 0,

P (Sn > x) ≤ e−λx EeλSn = e−λx Πni=1 EeλXi .

Now
∞ ∞
!
X λi σj2 λ2 X σj2 λ2 σj2 λ2
EeλXj = 1 + EXji ≤ 1 + (λK)i−2 = 1 + ≤ exp
i! 2 2(1 − λK) 2(1 − λK)
i=2 i=2

if λK < 1. Thus
 
λ2 s2n
P (Sn > x) ≤ exp −λx +
2(1 − λK)
Now (8.3) follows by choosing λ = x/(s2n + Kx). 

Pn √ √
Recall Gn f = − P f )/ n. The random variable Yi := (f (Xi ) − P f )/ n
i=1 (f (Xi )
P
here corresponds to Xi in the above lemma. Note that V ar( ni=1 Yi ) = V ar(f (X1 )) ≤ P f 2
√ P
and kYi k∞ ≤ 2kf k∞ / n. Apply the above lemma to ni=1 Yi , we obtain

COROLLARY 8.1 For any bounded, measurable function f,


 
1 x2
P (|Gn f | ≥ x) ≤ 2 exp − √
4 P f 2 + xkf k∞ / n
for any x > 0.

Notice that

P (|Gn f | ≥ x) ≤ 2e−Cx when x is large and


2
P (|Gn f | ≥ x) ≤ 2e−Cx when x is small. (8.4)

Let’s estimate E max1≤i≤m |Yi | provided m is large and P (|Yi | ≥ x) ≤ e−x for all i and
x > 0. Then E|Yi |k ≤ k! for k = 1, 2, · · · . The immediate estimate is
m
X
E max |Yi | ≤ E|Yi | ≤ m.
1≤i≤m
i=1

54
By Holder’s inequality,
Xm
2 1/2

E max |Yi | ≤ (E max |Yi | ) ≤( E|Yi |2 )1/2 ≤ 2m.
1≤i≤m 1≤i≤m
i=1

Let ψ(x) = ex − 1, x ≥ 0. Following this logic, by Jensen’s inequality

ψ(E max |Yi |/2) ≤ m · max Eψ(|Yi |/2) ≤ 2m.


1≤i≤m 1≤i≤m

Take inverse ψ −1 for both sides. We obtain that

E max |Yi | ≤ 2 log(2m).


1≤i≤m

This together with (8.4) leads to the following lemma.

LEMMA 8.2 For any finite class F of bounded, measurable, square-integrable functions
with |F| elements, there is an universal constant C > 0 such that
log(1 + |F|) p
C · E max |Gn f | ≤ max kf k∞ √ + max kf kP,2 log(1 + |F|) .
f f n f

where kf kP,2 = (P f 2 )1/2 .



Proof. Define a = 24kf k∞ / n and b = 24P f 2 . Define Af = (Gn f )I{|Gn f | > b/a} and
Bf = (Gn f )I{|Gn f | ≤ b/a}, then Gn f = Af + Bf . It follows that

E max |Gn f | ≤ E max |Af | + E max |Bf |. (8.5)


f f f

For x ≥ b/a and x ≤ b/a the exponent in the Bernstein inequality is bounded above by
−3x/a and −3x2 /b, respectively. By Bernstein’s inequality,
 
3x
P (|Af | ≥ x) ≤ P (|Gn f | ≥ x ∨ b/a) ≤ 2 exp − ,
a
 
3x2
P (|Bf | ≥ x) = P (b/a ≥ |Gn f | ≥ x) ≤ P (|Gn f | ≥ x ∧ b/a) ≤ 2 exp −
b
for all x ≥ 0. Let ψp (x) = exp(xp ) − 1, x ≥ 0, p ≥ 1.
  Z |Af |/a Z ∞
|Af |
Eψ1 =E ex dx = P (|Af | > ax)ex dx ≤ 1.
a 0 0

By a similar argument we find that Eψ2 (|Bf |/ b) ≤ 1. Because ψp (·) is convex for all p ≥ 1,
by Jensen’s inequality
    X  |Af | 
|Af | maxf |Af |
ψ1 E max ≤ Eψ1 ≤E ψ1 ≤ |F|.
f a a a
f

55
Taking inverse function, we have the first term on the right hand side. Similarly we have
the second term. The conclusion follows from (8.5). 

We actually use the following lemma above

LEMMA 8.3 Let X be a real valued random variable and f : [0, ∞) → R be differentiable
Rs R∞
with 0 |f ′ (t)| dt < ∞ for any s > 0 and 0 P (X > t)|f ′ (t)| dt < ∞. Then
Z ∞
Ef (X) = P (X ≥ t)f ′ (t) dt + f (0).
0
Rt
Proof. First, since 0|f ′ (t)| dt < ∞ for any t > 0, we have
Z X Z ∞

f (X) − f (0) = f (t) dt = I(X ≥ t)f ′ (t) dt.
0 0
Then, by Fubini’s theorem,
Z ∞ Z ∞

Ef (X) = E I(X ≥ t)f (t) dt + f (0) = P (X ≥ t)f ′ (t) dt + f (0). 
0 0
Remark. When f (x) is differentiable, f ′ (x) is not necessarily Lebesgue integrable even
though it is Riemann-integrable. The following is an example: Let

x2 cos(1/x2 ), if x =
6 0;
f (x) =
0, otherwise.
Then

2x cos(1/x2 ) + (2/x) sin(1/x2 ), if x 6= 0;
f ′ (x) =
0, otherwise
is not Lebesgue-integrable on [0, 1], but obviously Riemann-integrable. We only need to
show g(x) := (2/x) sin(1/x2 ) is not integrable. Suppose it is,
Z 1   Z
1 1 1 1 | sin t|
sin x2 dx = 2 dt = +∞,
0 x 0 t
yields a contradiction.

Now we describe the size of F. For any f ∈ F, define

kf kP,r = (P |f |r )1/r .

Let l and u be two functions, the bracket [l, u] is set of all functions f such that l ≤ f ≤ u.
An ǫ-bracket in Lr (P ) is a bracket [l, u] such that ku − lkP,r < ǫ. The bracketing number
N[ ] (ǫ, F, Lr (P )) is the minimum numbers of ǫ-brackets needed to cover F. The entropy with
bracketing is the logarithm of the bracketing number.

56
8.1 Outer Measures and Expectations

Recall X is a random variable from (Ω, G) → (R, B(R)) if X −1 (B) ∈ G for any set B ∈ B(R).
For an arbitrary map X the inverse may not be in G), particularly for small G). For
example, when G = {∅, Ω}, many maps are not random variables. In empirical processes,
we will deal with Z =: supt∈T Xt for some index set T. If T is big, then Z may not be
measurable. It does not make sense to study expectations and probabilities related to such
a random variable. But we have another way to get around this.
Definition Let X be an arbitrary map from (Ω, G, P ) → (R, B(R)). Define

E ∗ X = inf{EY ; Y ≥ X and Y is a measurable map : (Ω, G) → (R, B(R))};


P ∗ (X ∈ A) = inf{P (X ∈ B); {X ∈ B} ∈ G, B ⊃ A}, A, B ∈ B(R).

One can easily show that E ∗ X, as an infimum, can be achieved, i.e., there exists a random
variable X ∗ : (Ω, F, P ) → (R, B(R)) such that EX ∗ = E ∗ X. Further, X ∗ is P -almost surely
unique, i.e., if there exist two such random variables X1∗ and X2∗ , then P (X1∗ = X2∗ ) = 1.
We call X ∗ the measurable cover function. Obviously

(X1 + X2 )∗ ≤ X1∗ + X2∗ and X1∗ ≤ X2∗ if X1 ≤ X2 .

One can define E∗ and P∗ similarly.


Let (M, d) be a metric space. A sequence of arbitrary maps Xn : (Ωn , Gn ) → (M, d)
converges in distribution to a random vector X if

E ∗ f (Xn ) → Ef (X)

for any bounded, continuous function f defined on (M, d). We still have an analogue of the
Portmanteau theorem.

THEOREM 28 The following are equivalent:


(i) E ∗ f (Xn ) → Ef (X) for every bounded, continuous function f defined on (M, d);
(ii) E ∗ f (Xn ) → Ef (X) for every bounded, Lipschitz function f defined on (M, d), that
is, there is a constant C > 0 such that |f (x) − f (y)| ≤ Cd(x, y) for any x, y ∈ M ;
(iii) lim inf n P∗ (Xn ∈ G) ≥ P (X ∈ G) for any open set G;
(iv) lim supn P ∗ (Xn ∈ F ) ≤ P (X ∈ F ) for any closed set F ;
(v) lim supn P ∗ (Xn ∈ H) = P (X ∈ H) for any set H such that P (X ∈ ∂H) = 0, where
∂H is the boundary of H.

57
Let
Z δ q
J[ ] (δ, F, L2 (P )) = log N[ ] (ǫ, F, L2 (P )) dǫ.
0

The proof of the following thorem is easy. It is omitted.

THEOREM 29 Every class F of measurable functions such that N[ ] (ǫ, F, L1 (P )) < ∞ for
every ǫ > 0 is P -Glivenko-Cantelli.

THEOREM 30 Every class F of measurable functions with J[ ] (1, F, L2 (P )) < ∞ is P -


Donsker.

To prove the theorem, we need some preparation.

THEOREM 31 A sequence of arbitrary maps Xn = (Xn,t , t ∈ T ) : (Ωn , Gn ) → l∞ (T )


converges weakly to a tight random element if and only if both of the following conditions
hold:
(i) The sequence (Xn,t1 , · · · , Xn,tk ) converges in distribution in Rk for every finite set
of points t1 , · · · , tk in T ;
(ii) for every ǫ, η > 0 there exists a partition of T into finitely many sets T1 , · · · , Tk
such that
!

lim sup P sup sup |Xn,s − Xn,t | ≥ ǫ ≤ η.
n→∞ i s,t∈Ti

Proof. We only prove sufficiency.


Step 1: A preparation. For each integer m ≥ 1, let T1m , · · · , Tkmm be a partition of T such
that
!
1 1
lim sup P ∗ sup sup |Xn,s − Xn,t | ≥ m ≤ . (8.1)
n→∞ j s,t∈Tjm 2 2m

Since the supremum above becomes smaller when a partition becomes more refined, w.l.o.g.,
assume the partitions are successive refinements as m increases. Define a semi-metric

0, if s, t belong to the same partitioning set T m for some j;
j
ρm (s, t) =
1, otherwise.

Easily, by the nesting of the partitions, ρ1 ≤ ρ2 ≤ · · · . Define



X ρm (s, t)
ρ(s, t) = for s, t ∈ T.
2m
m=1

58
P∞ −k
Obviously, ρ(s, t) ≤ k=m+1 2 ≤ 2−m when s, t ∈ Tjm for some j. So (T, ρ) is totally
bounded. Let T0 be the countable ρ-dense subset constructed by choosing an arbitrary
point tm m
j from every Tj .
Step 2: Construct the limit of Xn . For two finite subsets of T, S = (s1 , · · · , sp ) and
U = (s1 , · · · , sp , , · · · , sq ), by assumption (i), there are two probability measures µp on Rp
and µq on Rq such that

(Xn,s1 , · · · , Xn,sp ) =⇒ µp ; (Xn,s1 , · · · , Xn,sp , · · · , Xn,sq ) =⇒ µq


µq (A × Rq−p ) = µp (A) for any Borel set A ⊂ Rp .

By the Kolmogorov consistency theorem there exists a stochastic process {Xt ; t ∈ T0 } on


some probability space such that (Xn,t1 , · · · , Xn,tk ) =⇒ (Xt1 , · · · , Xtk ) for every finite set
of points t1 , · · · , tk in T0 . It follows that, for a finite set S ⊂ T,

sup sup |Xn,s − Xn,t | → sup sup |Xs − Xt |


i s,t∈Tim i s,t∈Tim
s,t∈S s,t∈S

as n → ∞ for fixed m ≥ 1. By the Portmanteau theorem and (8.1)


   
 1   1  1
P sup sup |Xs − Xt | > m  ≤ lim sup P ∗ sup sup |Xn,s − Xn,t | ≥ m  ≤ m
m
i s,t∈Ti 2 n→∞ m
i s,t∈Ti 2 2
s,t∈S s,t∈S

for each m ≥ 1. By Monotone Convergence Theorem, since T is countable, let S ↑ T, we


obtain that
 
 1  1
P sup sup |Xs − Xt | > m  ≤ m .
m
i s,t∈Ti 2 2
s,t∈T0

If ρ(s, t) < 2−m , then ρm (s, t) < 1 (otherwise, ρm (s, t) = 1 implies ρ(s, t) ≥ 2−m ). This
says that s, t ∈ Tjm for the same j. So the above inequality implies that
 
 1  1
P sup |Xs − Xt | > m ≤ m
ρ(s,t)<2−m 2 2
s,t∈T0

for each m ≥ 1. By the Borel-Cantelli lemma, when m is large, with probability one
1
sup |Xs − Xt | ≤
ρ(s,t)<2−m 2m
s,t∈T0

59
as m is large enough. This says that, with probability one, Xt , as a function of t is uniformly
continuous on T0 . Extending this from T0 to T, we obtain a ρ-continuous process Xt , t ∈ T.
We then have that with probability one
1 1
|Xs − Xt | ≤ m
for any s, t ∈ T with ρ(s, t) < m (8.2)
2 2
for sufficiently large m.
Step 3: Show Xn =⇒ X. Define πm : T → T as the map that maps every point in
the partitioning set Tim onto the point tm m
i ∈ Ti . For any bounded, Lipschitz continuous
function f : l∞ (T ) → R,

lim |E ∗ f (Xn ) − Ef (X)| ≤ lim |E ∗ f (Xn ) − E ∗ f (Xn ◦ πm )|


n→∞ n→∞
+ lim |E ∗ f (Xn ◦ πm ) − Ef (X ◦ πm )| + E|f (X ◦ πm ) − f (X)|.
n→∞

Xn ◦ πm and X ◦ πm are essentially finite dimensional random vectors, by assumption (i),


the first one converges to the second in distribution. So the middle term above is zero.
Now, if s, t ∈ Tim , ρ(s, t) ≤ 2−m . It follows that
1
kX ◦ πm − Xk∞ = sup sup |Xt − Xtm |< . (8.3)
i
i
t∈Tim 2m−1

as m is large enough. Thus, E|f (X ◦ πm ) − f (X)| ≤ kf k∞ 2−m+1 . Since f is Lipschitz,



|E ∗ f (Xn ) − E ∗ f (Xn ◦ πm ))| ≤ kf k∞ 2−m + P kXn − Xn ◦ πm k ≥ 2−m

By the same argument as in (8.3), using (8.1) to obtain

lim |E ∗ f (Xn ) − Ef (X)| ≤ kf k∞ 2−m + 2−m + 2−m+1 kf k∞ .


n→∞

Letting m → ∞, we have that E ∗ f (Xn ) → Ef (X) as n → ∞. 

Let F be a class of measurable functions f : X → R. Review


δ
a(δ) = q ,
log N[ ] (δ, F, L2 (P ))
Z δq
J[ ] (δ, F, L2 (P )) = log N[ ] (ǫ, F, L2 (P )) dǫ.
0

LEMMA 8.4 Suppose there is δ > 0 and a measurable function F > 0 such that P ∗ f 2 < δ2
and |f | ≤ F for every f ∈ F. Then there exists a constant C > 0 such that
√ √
C · EP∗ kGn kF ≤ J[ ] (δ, F, L2 (P )) + nP ∗ F {F > na(δ)}.

60
√ P
Proof. Step 1: Truncation. Recall Gn f = (1/ n) ni=1 (f (Xi ) − P f ). If |f | ≤ g, then
n
1 X
|Gn f | ≤ √ (g(Xi ) + P g).
n
i=1

It follows that
√ √ √
E ∗ Gn f I{F > na(δ)} F ≤ 2 nP F {F > na(δ)}.

We will bound E ∗ kGn f I{F ≤ na(δ)}kF next. The bracketing number of the class of

functions f I{F > na(δ)} if f ranges over F are smaller than the bracketing number of

the class F. To simplify notation, we assume, w.l.o.g., |f | ≤ na(δ) for every f ∈ F.
Step 2: Discretization of the integral. Choose an integer q0 such that 4δ ≤ 2−q0 ≤ 8δ.
We claim there exists a nested sequence of F-partitions {Fqi ; 1 ≤ i ≤ Nq }, indexed by the
integers q ≥ q0 , into Nq disjoint subsets and measurable functions ∆qi ≤ 2F such that

X 1p Z δq
C log N q ≤ log N[ ] (ǫ, F, L2 (P )) dǫ, (8.4)
2q 0
q≥q0
1
sup |f − g| ≤ ∆qi , and P ∆2qi ≤ (8.5)
f,g∈Fqi 22q

for an universal constant C > 0. For convenience, write N (ǫ) = N[ ] (ǫ, F, L2 (P )). Thus,
Z δ p Z 1/2q0 +3 p ∞
X Z 1/2q p
log N (ǫ) dǫ ≥ log N (ǫ) dǫ = log N (ǫ) dǫ
0 0 q+1
q=q0 +3 1/2
s  

X 1 1
≥ log N .
2q+1 2q−3
q=q0 +3

Re-indexing the sum, we have that


∞ Z δp
1 X 1p
log Nq ≤ log N (ǫ) dǫ,
16 q=q 2q 0
0

where Nq = N (2−q ). By definition of Nq , there exists a partition {Fqi = [lqi , uqi ]; 1 ≤ i ≤


Nq }, where lqi and uqi are functions such that E(uqi − lqi )2 < 2−q . Set ∆qi = uqi − lqi . Then
∆qi ≤ 2F and supf,g∈Fqi |f − g| ≤ ∆qi , and P ∆2qi ≤ 2−2q .
We can also assume, w.l.o.g., that the partitions {Fqi ; 1 ≤ i ≤ Nq } are successive
refined as q increases. Actually, at q-th level, we can make intersections of elements from
{Fki ; 1 ≤ i ≤ Nk } as k goes from q0 to q : ∩qk=q0 Fki . The total possible number of
such intersections is no more than N̄q := Nq0 Nq0 +1 · · · Nq . Since the current Fqi ’s become

61
smaller, all requirementsqon Fqi still hold obviously except possibly (8.4). Now we verify
P √
(8.4). Actually, noticing log N̄q ≤ qk=q0 log Nk . Then

X 1q q
X X 1p
∞ X
X 1p

X 1p
q
log N̄ q ≤ q
log N k = q
log N k = 2 log Nk .
2 2 2 2k
q≥q0 q≥q0 k=q0 k=q0 q≥k k=q0

So (8.4) still holds when replacing Nq by N̄q .

Step 3: Chaining-a skill. For each fixed q ≥ q0 , choose a fixed element fqi from each
partition set Fqi , and set

πq f = fqi , ∆q f = ∆qi , if f ∈ Fqi . (8.6)

Thus, πq f and ∆q f runs through a set of Nq functions if f run through F. Without loss of
generality, we can assume

∆q f ≤ ∆q−1 f for any f ∈ F and q ≥ q0 + 1. (8.7)

¯ qi be the measurable cover function of supf,g∈F |f −g|. Then (8.5) also holds.
Actually, let ∆ qi

Let Fq−1 j be a partition set at the q−1 level such that Fqi ⊂ Fq−1 j . Then supf,g∈Fqi |f −g| ≤
supf,g∈F |f − g|. Thus, ∆ ¯ qi ≤ ∆
¯ q−1 j . The assertion (8.7) follows by replacing ∆qi with
q−1 j
¯ qi .

P P
By (8.6), P (πq f − f )2 ≤ maxi P ∆2qi < 2−2q . Thus, P q (πq f − f )2 = q P (πq f − f )2 <
P
∞. So q (πq f − f )2 < ∞ a.s. This implies

πq f → f a.s. on P (8.8)

as q → ∞ for any f ∈ F. Define


p
aq = 2−q / log Nq+1 ,

τ = τ (n, f ) = inf{q ≥ q0 : ∆q f > naq }.

The value of τ is thought to be +∞ if the above set is empty. This is the first time ∆q f >

naq . By construction, 2a(δ) = 2δ(log N[ ] (δ, F, L2 (P )))−1/2 ≤ aq0 since the denominator is
√ √
decreasing in δ. By Step 1, we know that |∆q f | ≤ 2 na(δ) ≤ naq0 . This says that τ > q0 .
We claim that

X ∞
X
f − πq 0 f = (f − πq f )I{τ = q} + (πq f − πq−1 f )I{τ ≥ q}, a.s on P. (8.9)
q0 +1 q0 +1
Pq1
In fact, write f − πq0 f = (f − πq1 f ) + q0 +1 (πq f − πq−1 f ) for q1 > q0 . Now,

62
(i) If τ = ∞, the right hand side above is identical to limq→∞ πq f − πq0 f = f − πq0 f a.s.
by (8.8);
Pq1
(ii) If τ = q1 < ∞, the right hand is equal to (f − πq1 f ) + q0 +1 (πq f − πq−1 f ).

Step 4. Bound terms in the chain. Apply Gn to the both sides of (8.9). For |f | ≤ g,

note that |Gn f | ≤ |Gn g| + 2 nP g. By (8.6), |f − πq f | ≤ ∆q f. One obtains that
f

X z }| {
E∗k Gn (f − πq f )I{τ = q} kF
q0 +1

X ∞
√ X
≤ E ∗ kGn (∆q f I{τ = q})kF + 2 n kP (∆q f I{τ = q})kF . (8.10)
q0 +1
| {z } q +1 0
g

Now, by (8.7), ∆q f I{τ = q} ≤ ∆q−1 f I{τ = q} ≤ naq−1 . Moreover, P (∆q f I{τ = q})2 ≤
2−2q . By Lemma 8.2, the middle term above is bounded by

X p
(aq−1 log Nq + 2−q log Nq ).
q0 +1

By Holder’s inequality, P (∆q f I{τ = q}) ≤ (P (∆q f )2 )1/2 P (τ = q)1/2 ≤ 2−q P (∆q f >
√ √ √
naq )1/2 ≤ 2−q ( naq )−1 (P (∆q f )2 )1/2 ≤ 2−2q ( naq )−1 . So the last term in (8.10) is
bounded by 2 · 2−2q /aq . In summary,

X ∞ X∞

p
E
Gn (f − πq f )I{τ = q} ≤ C 2−q log Nq (8.11)
q0 +1 q0 +1
F

for some universal constant C > 0.


Second, there are at most Nq functions πq f −πq−1 f and two values the indicator functions
I(τ ≥ q) takes. Because the partitions are nested, the function |πq f − πq−1 f |I{τ ≥ q} ≤

∆q−1 f I{τ ≥ q} ≤ naq−1 . The L2 (P )-norm of πq f − πq−1 f is bounded by 2−q+1 . Applying
Lemma 8.2 again to obtain

∞ ∞
X X p

E Gn (πq f − πq−1 f )I{τ ≥ q}
≤ (aq−1 log Nq + 2−q log Nq )
q0 +1 q0 +1
F

X p
≤ C 2−q log Nq (8.12)
q0 +1

for some universal constant C > 0.

63
√ √
At last, we consider πq0 f. Because |πq0 f | ≤ F ≤ a(δ) n ≤ naq0 and P (πq0 f )2 ≤ δ2
by assumption, another application of Lemma 8.2 leads to
p
E ∗ kGn πq0 f kF ≤ aq0 log Nq0 + δ log Nq0 .

In view of the choice of q0 , this is no more than the bound in (8.12). All above inequalities
together with (8.4) yield the desired result. 

COROLLARY 8.2 For any class F of measurable functions with envelope function F, there
exists an universal constant C such that

EP∗ kGn kF ≤ C · J[ ] (kF kP,2 , F, L2 (P )).

Proof. Since F has a single bracket [−F, F ], we have that N[ ] (δ, F, L2 (P )) = 1 for δ =
2kF kP,2 . Review the definition in Lemma 8.4. Choose δ = kF kP,2 . It follows that
kF kP,2
a(δ) = q .
log N[ ] (kF kP,2 , F, L2 (P ))

√ √ q
Now nP ∗ F I(F > na(δ)) ≤ kF k2P,2 /a(δ) = kF kP,2 log N[ ] (kF kP,2 , F, L2 (P )) by Markov’s
inequality, which is bounded by J[ ] (kF kP,2 , F, L2 (P )) since the integrand is non-decreasing
and hence
Z kF kP,2 q q
log N[ ] (ǫ, F, L2 (P )) dǫ ≥ kF kP,2 log N[ ] (kF kP,2 , F, L2 (P )). 
0

Proof of Theorem 30. We will use Theorem 31 to prove this theorem. The part (i) is
easily satisfied. Now we verify (ii).
Note there is no cover on F and we don’t know if P f 2 < δ2 . Let G = {f − g, f, g ∈ F}.
With a given set of ǫ-brackets [li , ui ] over F we can construct 2ǫ-brackets over G by taking
differences [li − uj , ui − lj ] of upper and lower bounds. Therefore, the bracketing number
N[ ] (2ǫ, G, L2 (P )) are bounded by the square of the bracketing number N[ ] (ǫ, F, L2 (P )).
Easily, k(ui − lj ) − (li − uj )k ≤ 2ǫ. Hence

N[ ] (2ǫ, G, L2 (P )) ≤ N[ ] (ǫ, F, L2 (P ))2 .

This says that

J[] (ǫ, G, L2 (P )) < ∞. (8.13)

64
For a given, small δ > 0, by the definition of N[ ] (δ, F, L2 (P )), choose a minimal number
of brackets of size δ that cover F, and use them to form a partition of F = ∪i Fi . The subsets
G consisting of differences f − g of functions f and g belonging to the same partitioning
set consists of functions of L2 (P )-norm smaller than δ. Hence, by Lemma 8.4, there exists
a finite number a(δ) := a(δ)F < a(2δ)G and a universal constant C such that

C · E ∗ sup sup |Gn (f − g)| = CE ∗ sup |Gn h|


i f,g∈Fi h∈H
√ √
≤ J[] (δ, H, L2 (P )) + 2 nP F I(F > a(δ) n)
√ √
≤ J[ ] (δ, G, L2 (P )) + 2 nP F I(F > a(δ) n).

since H := {f − g; f, g ∈ Fi , for all i} ⊂ G, where the envelope function F can be taken


equal to the supremum of the absolute values of the upper and lower bounds of finitely
many brackets that cover F, for instance a minimal set of brackets of size 1. This F is

square integrable. The second term above is bounded by a(δ)−1 P F 2 I(F > a(δ) n) → 0 as
n → ∞ for fixed δ. First let n → ∞, then let δ ↓ 0 the left hand side goes to 0. So part (ii)
in (31) is valid by using Markov’s inequality. 

Example. Let F = {ft = I(−∞, t], t ∈ R}. Then

kft − fs k2,P = (F (t) − F (s))1/2 for s < t.

Cut [0, 1] into many pieces with length less than ǫ2 . Since F (t) is increasing, right continuous
and of left limits, there exist a partition, say, −∞ = t0 < t1 < · · · < tk = ∞ with
k = [1/ǫ2 ] + 1 such that

F (ti −) − F (ti−1 ) < ǫ2 for i = 1, 2, · · · , k.

Since kI(−∞, ti )−I(−∞, ti−1 ]kP,2 = (F (ti −)−F (ti−1 ))1/2 < ǫ. Namely, (I(−∞, ti−1 ], I(−∞, ti ))
is an ǫ-bracket for i = 1, 2, · · · , k. It follows that
2
N[ ] (ǫ, F, L2 (P )) ≤ for ǫ ∈ (0, 1).
ǫ2
R1p
But 0 log(1/ǫ) dǫ < ∞. The previous classical Glivenko-Cantelli’s Lemma 26 and Donsker’s
Theorem 27 follow from Theorems 29 and 30.
Since kf k = supt |f (t)| is the norm in l∞ , it is continuous. By the Delta method, we
have

sup n|Fn (t) − F (t)| → sup G(t)
t∈R t∈R

65
weakly, i.e., in distribution, as n → ∞, where G(t) is the Gaussian process we mentioned
before.

Example. Let F = {fθ ; θ ∈ Θ} be a collection of measurable functions with Θ ⊂ Rd


bounded. Suppose P mr < ∞ and |fθ1 (x) − fθ2 (x)| ≤ m(x)kθ1 − θ2 k for any θ1 and θ2 . Then
 
diam(Θ) d
N[ ] (ǫkmkP,r , F, Lr (P )) ≤ K (8.14)
ǫ
for any 0 < ǫ < diam(Θ). As long as kθ1 − θ2 k < ǫ, we have that l(x) := fθ1 (x)− m(x)ǫ(x) ≤
fθ2 (x) ≤ fθ1 (x) + m(x)ǫ(x) =: u(x). Also, ku − lkP,r = ǫkmkP,r . Naturally, [l, u] is an
ǫkmkP,r -bracket. It suffices to calculate the minimal number of balls of radius ǫ need to
cover Θ.
Note that Θ ⊂ Rd is in a cube with side length no bigger than diam(Θ). We can cover
Θ with fewer than (2diam(Θ)/ǫ)d cubes of size ǫ. The circumscribed balls have radius a
multiple of ǫ and also cover Θ. The intersection of these balls with Θ cover Θ. So the claim
is true. This says that F is a P -Donsker class.

Example (Sobolev classes). For k ≥ 1, let


 Z 1 
(k) 2
F = f : [0, 1] → R; kf k∞ ≤ 1 and (f (x)) dx ≤ 1 .
0

Then there exists an universal constant K such that


 1/k
1
log N[ ] (ǫ, F, k · k∞ ) ≤ K
ǫ
Since kf kP,2 ≤ kf k∞ for any P, it is easy to check that N[ ] (ǫ, F, L2 (P )) ≤ N[ ] (ǫ, F, k · k∞ )
for any P. So F is P -Donsker class for any P.

Example (Bounded Variation). Let

F = {f : R → [−1, 1] of bounded variation 1}.

Any function of bounded variation is the difference of two monotone increasing functions.
Then for any r ≥ 1 and probability measure P,
 
1
log N[ ] (ǫ, F, Lr (P )) ≤ K .
ǫ
Therefore F is P -Donsker for every P.

66
9 Consistency and Asymptotic Normality of Maximum Like-
lihood Estimators

9.1 Consistency

Let X1 , X2 , · · · , Xn be a random sample from a population distribution with pdf or pmf


f (x|θ), where θ is an unknown parameter. The MLE θ̂ under certain conditions on f (x|θ)
will be consistent and satisfy Central Limit Theorems. But for some cases those will not
be true. Let’s see a good example and a pathological example.
Example. Let X1 , X2 , · · · , Xn be i.i.d. from Exp(θ) with pdf

θe−θx , if x > 0;
f (x|θ) =
0, otherwise.

So EX1 = 1/θ and V ar(X1 ) = 1/θ 2 . It is easy to check that θ̂ = 1/X̄. By the CLT,
 
√ 1
n X̄ − =⇒ N (0, θ −2 )
θ
as n → ∞. Let g(x) = 1/x, x > 0. Then g(EX1 ) = θ and g′ (EX1 ) = −θ 2 . By the Delta
method,

n(θ̂ − θ) =⇒ N (0, g′ (µ)2 σ 2 ) = N (0, θ 2 )

as n → ∞. Of course, θ̂ → θ in probability.
Example. Let X1 , X2 , · · · , Xn be i.i.d. from U [0, θ], where θ is unknown. The MLE
estimator θ̂ = max Xi . First, it is easy to check that θ̂ → θ in probability. But θ̂ doesn’t
follow the CLT. In fact,
! 
n(θ − θ̂) 1 − e−x , if x > 0;
P ≤x →
θ 0, otherwise
as n → ∞.
For some cases even the consistency, i.e., θ̂ → θ in probability, doesn’t hold. Next, we
will study some sufficient conditions for consistency. Later we provide sufficient conditions
for CLT.
Let X1 , · · · , Xn be a random sample from a density pθ with reference measure µ, that
R
is, Pθ (X1 ∈ A) = A pθ (x)µ( dx), where θ ∈ Θ. The maximum likelihood estimator θ̂n
P
maximizes the function h(θ) := log pθ (Xi ) over Θ, or equivalently, the function
n
1X pθ
Mn (θ) = log (Xi ),
n p θ0
i=1

67
where θ0 is the true parameter. Under suitable conditions, by the weak law of large numbers,
Z
pθ pθ
Mn (θ) → M (θ) := Eθ0 log (X1 ) = pθ0 (x) log (x)µ( dx) (9.1)
p θ0 R p θ0

in probability as n → ∞. The number −M (θ) is called the Kullback-Leibler divergence of


pθ and pθ0 . Let Y = pθ0 (Z)/pθ (Z), where Z ∼ pθ . Then Eθ Y = 1 and

M (θ) = −Eθ (Y log Y ) ≤ −(Eθ Y ) log(Eθ Y ) = 0

by Jensen’s inequality. Of course M (θ0 ) = 0, that is, θ0 attains the maximum of M (θ).
Obviously, M (θ0 ) = Mn (θ0 ) = 0. The following gives the consistency of θ̂n with θ0 .

P
THEOREM 32 Suppose supθ∈Θ |Mn (θ) − M (θ)| → 0 and supθ:d(θ,θ0 )≥ǫ M (θ) < M (θ0 ) for
any ǫ > 0. Then θ̂n → θ0 in probability.

Proof. For any ǫ > 0, we need to show that

P (d(θ̂n , θ0 ) ≥ ǫ) → 0 (9.2)

as n → ∞. By the condition, there exists δ > 0 such that

sup M (θ) < M (θ0 ) − δ.


θ:d(θ,θ0 )≥ǫ

Thus, if d(θ̂n , θ0 ) ≥ ǫ, then M (θ̂n ) < M (θ0 ) − δ. Note that

M (θ̂n ) − M (θ0 ) = Mn (θ̂n ) − Mn (θ0 ) + (Mn (θ0 ) − M (θ0 )) + (M (θ̂n ) − Mn (θ̂n ))


≥ −2 sup |Mn (θ) − M (θ)|
θ∈Θ

since Mn (θ̂n ) ≥ Mn (θ0 ). Consequently,

P (d(θ̂n , θ0 ) ≥ ǫ) ≤ P (M (θ̂n ) − M (θ0 ) < −δ) = P (sup |Mn (θ) − M (θ)| ≥ δ/2) → 0.
θ∈Θ

by the first assumption. 

Let X1 , · · · , Xn be a random sample from a density pθ with reference measure µ, that


R
is, Pθ (X1 ∈ A) = A pθ (x)µ( dx), where θ ∈ Θ. The maximum likelihood estimator θ̂n
P
maximizes the function log pθ (Xi ) over Θ, or equivalently, the function
n
1X pθ
Mn (θ) := log (Xi ),
n p θ0
i=1

68
where θ0 is the true parameter. This is usually practiced by solving the equation ∂Mn (θ)/∂θ =
0 for θ. Let
∂ log(pθ (x)/pθ0 (x))
ψθ (x) = .
∂θ
Then the maximum likelihood estimator θ̂n is the zeroes of the function
n
1X
Ψn (θ) := ψθ (Xi ). (9.3)
n
i=1

Though we still use the notation Ψn in the following theorem, it is not necessarily the one
above. It can be any function of properties stated below.

THEOREM 33 Let Θ be a subset of the real line and let Ψn be random functions and Ψ a
fixed function of θ such that Ψn (θ) → Ψ(θ) in probability for every θ. Assume that Ψn (θ) is
continuous and has exactly one zero θ̂n or is nondecreasing satisfying Ψn (θ̂n ) = oP (1). Let
P
θ0 be a point such that Ψ(θ0 − ǫ) < 0 < Ψ(θ0 + ǫ) for every ǫ > 0. Then θ̂n → θ0 .

Proof. Case 1. Suppose Ψn (θ) is continuous and has exactly one zero θ̂n . Then

P (Ψn (θ0 − ǫ) < 0, Ψn (θ0 + ǫ) > 0) ≤ P (θ0 − ǫ ≤ θ̂n ≤ θ0 + ǫ).

Since Ψn (θ0 ± ǫ) → Ψ(θ0 ± ǫ) in probability, and Ψ(θ0 − ǫ) < 0 < Ψ(θ0 + ǫ), the left hand
side goes to one.
Case 2. Suppose Ψn (θ) is nondecreasing satisfying Ψn (θ̂n ) = oP (1). Then

P (|θ̂n − θ0 | ≥ ǫ) ≤ P (θ̂n > θ0 + ǫ) + P (θ̂n < θ0 − ǫ)


≤ P (Ψn (θ̂n ) ≥ Ψn (θ0 + ǫ)) + P (Ψn (θ̂n ) ≤ Ψn (θ0 − ǫ)).

Now, Ψn (θ0 ± ǫ) → Ψ(θ0 ± ǫ) in probability. This together with Ψn (θ̂n ) = oP (1) shows that
P (|θ̂n − θ0 | ≥ ǫ) → 0 as n → ∞. 

Example. Let X1 , · · · , Xn be a random sample from Exp(θ), that is, the density
function is pθ (x) = θ −1 exp(−x/θ)I(x ≥ 0) for θ > 0. We know that, under the true model,
the MLE θ̂n = 1/X̄n → θ0 in probability. Let’s verify that this conclusion indeed can be
deduced from Theorem 33.
Actually, ψθ (x) = (log pθ (x))′ = −θ −1 + xθ −2 for x ≥ 0. Thus Ψn (θ) = −θ −1 + X̄n θ −2 ,
which goes to Ψ(θ) = Eθ0 (−θ −1 + X1 θ −2 ) = θ −2 (θ0 − θ), that is positive or negative
depending on if θ is samller or bigger than θ0 . Also, θ̂n = 1/X̄n . Applying Theorem 33 to
−Ψn and −Ψ, we obtain the consisitency result.

69
9.2 Asymptotic Normality

Now we study the central limit theorems for the MLE.


Now we illustrate the idea of showing the normality of MLE. Recalling (9.3). Do the
Taylor’s expansion for ψθ (Xi ).

. 1
ψθ̂n (Xi ) = ψθ0 (Xi ) + (θ̂n − θ0 )ψ̇θ0 (Xi ) + (θ̂n − θ0 )2 ψ̈θ0 (Xi )
2
We will use the following notation:
Z n
1X
Pf = f (x)µ(dx) for any real function f (x) and Pn = δXi .
n
i=1
Pn
Then Pn f = (1/n) i=1 f (Xi ). Thus,

. 1
0 = Ψn (θ̂n ) = Pn ψθ0 + (θ̂n − θ0 )Pn ψ̇θ0 + (θ̂n − θ0 )2 Pn ψ̈θ0 .
2
Reorganize it in the following form

√ . − n(Pn ψθ0 )
n(θ̂n − θ0 ) = .
Pn ψ̇θ0 + 12 (θ̂n − θ0 )Pn ψ̈θ0

Recall the Fisher information


 2  2 
∂ log pθ (X1 ) 2 ∂ log pθ (X1 )
I(θ0 ) = Eθ0 |θ=θ0 = Eθ0 (ψθ0 (X1 )) = −Eθ0 |θ=θ0
∂θ ∂2θ
= −Eθ0 (ψ̇θ0 (X1 )).

By CLT and LLN, n(Pn ψθ0 ) =⇒ N (0, I(θ0 )), Pn ψ̇θ0 → −I(θ0 ) and Pn ψ̈θ0 → pθ0 ψ̈θ0 (X1 )
in probability. This illustrates that

n(θ̂n − θ0 ) =⇒ N (0, I(θ0 )−1 )

as n → ∞. The next two theorems will make these steps rigorous.


R
Let g(θ) = Eθ0 (mθ (X1 )) = mθ (x)pθ0 (x)µ( dx). We need the following condition:

1
g(θ) = g(θ0 ) + (θ − θ0 )T Vθ0 (θ − θ0 ) + o(kθ − θ0 k2 ), (9.1)
2
where
  
∂mθ
Vθ0 = Eθ0 |θ=θ0 , θ = (θ1 , · · · , θd ) ∈ Rd .
∂θi ∂θj 1≤i,j≤d

70
THEOREM 34 For each θ in an open subset of Euclidean space let x 7→ mθ (x) be a mea-
surable function such that θ 7→ mθ (x) is differentiable at θ0 for P -almost every x with
derivative ṁθ0 (x) and such that , for every θ1 and θ2 in a neighborhood of θ0 and a mea-
surable function n(x) with En(X1 )2 < ∞

|mθ1 (x) − mθ2 (x)| ≤ n(x)kθ1 − θ2 k.

Furthermore, assume the map θ 7→ Emθ (X1 ) has expression as in (9.1). If Pn mθ̂n ≥
P
supθ Pn mθ − oP (n−1 ) and θ̂n → θ0 , then
n
X
√ 1
n(θ̂n − θ0 ) = −Vθ−1 √ ṁθ0 (Xi ) + oP (1).
0 n
i=1

In particular, n(θ̂n − θ0 ) =⇒ N (0, Vθ−1
0
E(ṁθ0 ṁTθ0 )Vθ−1
0
) as n → ∞.

A statistical model (pθ , θ ∈ Θ) is called differentiable in quadratic mean if there exists a


measurable vector-valued function l̇θ0 such that, as θ → θ0 ,
Z  2
√ √ 1 √
pθ − p θ0 − (θ − θ0 )T l˙θ0 pθ0 du = o(kθ − θ0 k2 ).
2

THEOREM 35 Suppose that the model (Pθ : θ ∈ Θ) is differentiable in quadratic mean at


an inner point θ0 of Θ ⊂ Rk . Furthermore, suppose that there exists a measurable function
l(x) with Eθ0 l2 (X1 ) < ∞ such that, for every θ1 and θ2 in a neighborhood of θ0 ,

| log pθ1 (x) − log pθ2 (x)| ≤ l(x)kθ1 − θ2 k.

If the Fisher information matrix Iθ0 is non-singular and θ̂n is consistent, then
n
√ 1 X
n(θ̂n − θ0 ) = Iθ−1 √ l̇θ0 (Xi ) + oP (1)
0 n
i=1

In particular, n(θ̂n − θ0 ) =⇒ N (0, Iθ−1
0
) as n → ∞, where
 2 
∂ log pθ (X1 )
I(θ0 ) = − E .
∂θi ∂θj 1≤i,j≤k

We need some preparations in proving the above theorems.


Given functions x → mθ (x), θ ∈ Rd , we need conditions that ensure that, for a given
sequence rn → ∞ and any sequence h̃n = OP∗ (1),
 
P
Gn rn (mθ0 +h̃n /rn − mθ0 ) − h̃Tn ṁθ0 → 0. (9.2)

71
LEMMA 9.1 For each θ in an open subset of Euclidean space let mθ (x), as a function
of x, be measurable for each θ; and as a function of θ, is differentiable for almost every x
(w.r.t. P ) with derivative ṁθ0 (x) and such that for every θ1 and θ2 in a neighborhood of θ0
and a measurable function ṁ such that pṁ2 < ∞,

kmθ1 (x) − mθ2 (x)k ≤ ṁ(x)kθ1 − θ2 k.

Then (9.2) holds for every random sequence h̃n that is bounded in probability.

Proof. Because h̃n is bounded in probability, to show (9.2), it is enough to show, w.l.o.g.
P
sup |Gn (rn (mθ0 +θ/rn − mθ0 ) − θ̃ T ṁθ0 )| → 0
|θ|≤1

as n → ∞. Define

Fn = {fθ := rn (mθ0 +θ/rn − mθ0 ) − θ T ṁθ0 , |θ| ≤ 1}.

Then

|fθ1 (x) − fθ2 (x)| ≤ 2ṁθ0 (x)kθ1 − θ2 k

for any θ1 and θ2 in the unit ball in Rd by the Lipschitz condition. Further, set

Hn = sup |rn (mθ0 +θ/rn − mθ0 ) − θ T ṁθ0 |.


|θ|≤1

Then Hn is a cover and Hn → 0 as n → ∞ by the definition of partial derivative. By


Bounded Convergence Theorem, δn := (P Hn2 )1/2 → 0. Thus, by Corollary 8.2 and Example
8.14,
Z δn q
EP∗ kGn kFn ≤ C · J[ ] (kHkP,2 , Fn , L2 (P )) ≤ C log(Kǫ−d ) dǫ → 0
0
as n → ∞. The desired conclusion follows. 
We need some preparation before proving Theorem 34.
Let Pn be the empirical distribution of a random sample of size n from a distribution P,
and, for every θ in a metric space (Θ, d), mθ (x) be a measurable function. Let θ̂n (nearly)
maximize the criterion function Pn mθ . The number θ0 is the truth, that is, the maximizer

of mθ over θ ∈ Θ. Recall Gn = n(Pn − P ).

THEOREM 36 (Rate of Convergence) Assume that for fixed constants C and α > β, for
every n and every sufficiently small δ > 0,

sup P (mθ − mθ0 ) ≤ −Cδα ,


d(θ,θ0 )>δ

E∗ sup |Gn (mθ − mθ0 )| ≤ Cδβ .


d(θ,θ0 )<δ

72
If the sequence θ̂n satisfies Pn mθ̂n ≥ Pn mθ0 − OP (nα/(2β−2α) ) and converges in outer prob-
ability to θ0 , then n1/(2α−2β) d(θ̂n , θ0 ) = OP∗ (1).

Proof. Set rn = n1/(2α−2β) and Pn mθ̂n ≥ Pn mθ0 − Rn with 0 ≤ Rn = OP (nα/(2β−2α) ).


Partition (0, ∞) by Sj,n = {θ : 2j−1 < rn d(θ, θ0 ) ≤ 2j } for all integers j. If rn d(θ̂n , θ0 ) ≥ 2M
for a given M, then θ̂n is in one of the shells Sj,n with j ≥ M. In that case, supθ∈Sj,n (Pn mθ −
Pn mθ0 ) ≥ −Rn . It follows that
!
X K
P ∗ (rn d(θ̂n , θ0 ) ≥ 2M ) ≤ P∗ sup (Pn mθ − Pn mθ0 ) ≥ − α (9.3)
θ∈Sj,n rn
j≥M,2j ≤ǫrn

+P ∗ (2d(θ̂n , θ0 ) ≥ ǫ) + P (rnα Rn ≥ K).

The middle term on right goes to zero; the last term can be arbitrarily small by choosing
large K. We only need to show the sum is arbitrarily small when M is large enough as
n → ∞. By the given condition
2j−1 α
sup P (mθ − mθ0 ) ≤ −C .
θ∈Sj,n rnα

For M such that (1/2)C2(M −1)α ≥ K, by the fact that sups (fs + gs ) ≤ sups fs + sups gs ,
! !
∗ K ∗ √ 2(j−1)α
P sup (Pn mθ − Pn mθ0 ) ≥ − α ≤ P sup Gn (mθ − mθ0 ) ≥ C n .
θ∈Sj,n rn θ∈Si,n 2rnα

Therefore, the sum in (9.3) is bounded by


!
X √ 2(j−1)α X (2j /rn )β 2r α
P∗ sup |Gn (mθ − mθ0 )| ≥ C n ≤ √ (j−1)αn
θ∈Si,n 2rnα n2
j≥M,2j ≤ǫrn j≥M

by Markov’s inequality and the definition of rn . The right hand side goes to zero as M → ∞.


COROLLARY 9.1 For each θ in an open subset of Euclidean space, mθ (x) is a measurable
function such that, there exists ṁ(x) with P ṁ2 < ∞ satisfying

|mθ1 (x) − mθ2 (x)| ≤ ṁ(x)kθ1 − θ2 k.

Furthermore, assume
1
P mθ = P mθ0 + (θ − θ0 )T Vθ0 (θ − θ0 ) + o(kθ − θ0 k2 ) (9.4)
2
P √
with Vθ0 nonsingular. If Pn mθ̂n ≥ Pn mθ0 −OP (n−1 ) and θ̂n → θ0 , then n(θ̂n −θ0 ) = OP (1).

73
Proof. (9.4) implies the first condition of Theorem 36 holds with α = 2 since Vθ0 is
nonsingular. Now we use Corollary 8.2 to the class of functions F = {mθ −mθ0 ; kθ−θ0 k < δ}
to see the second condition is valid with β = 1. This class has envelope function F = δṁ,
while
Z δkṁkP,2 q

E sup |Gn (mθ − mθ0 )| ≤ C · log[ ] (ǫ, F, L2 (P )) dǫ
kθ−θ0 k<δ 0
v
Z δkṁkP,2
u
u  d !
≤ tlog K δ dǫ = C1 δ
0 ǫ

by (8.14) with Diam(Θ) = const · δd , where C1 depends on kṁkP,2 . 

Proof of Theorem 34. By Lemma 9.1,


 
P
Gn rn (mθ0 +h̃n /rn − mθ0 ) − h̃Tn ṁθ0 → 0. (9.5)

Expand P (rn (mθ0 +h̃n /rn − mθ0 ) by condition 9.1, we obtain

1
nPn (mθ0 +h̃n /√n − mθ0 ) = h̃Tn Vθ0 h̃n + h̃Tn Gn ṁθ0 + oP (1).
2

By Corollary 9.1, n(θ̂n − θ) is bounded in probability (same as tightness). Take ĥn =

n(θ̂n − θ) and h̃n = −Vθ−10
Gn ṁθ0 , we then obtain the Taylor’s expansions of Pn mθ̂n and
Pn mθ0 −V −1 Gn ṁθ as follows
θ0 0

1
nPn (mθ̂n − mθ0 ) = ĥTn Vθ0 ĥn + ĥTn Gn ṁθ0 + oP (1),
2
1
nPn (mθ0 −V −1 Gn ṁθ /√n − mθ0 ) = − Gn ṁTθ0 Vθ−1 Gn ṁθ0 + oP (1),
θ0 0 2 0

where the second one is obtained through a bit algebra. By the definition of θ̂n , the left
hand side of the first equation is greater than that of the second one. So are he right hand
sides. Take the difference and make it to a complete square, we then have
1
(ĥn + Vθ−1 Gn ṁθ0 )T Vθ0 (ĥn + Vθ−1 Gn ṁθ0 ) + oP (1) ≥ 0.
2 0 0

We know Vθ0 is strictly negative-definite, the quadratic form must converge to zero in
probability. This is also the case kĥn + Vθ−1
0
Gn ṁθ0 k, that is, ĥn = −Vθ−1
0
Gn ṁθ0 + oP (1).


74
10 Appendix
Let A be a collection of subsets of Ω and B is generated by A, that is, B = σ(A). Let P be
a probability measure on (Ω, B).

LEMMA 10.1 Suppose A has the following property: (i) Ω ∈ A, (ii)Ac ∈ A if A ∈ A,


and (iii) ∪m
i=1 Ai ∈ A if Ai ∈ A for all 1 ≤ i ≤ m. Then, for any B ∈ B and ǫ > 0, there
exists A ∈ A such that P (B∆A) < ǫ.

Proof. Let B ′ be the set of B ∈ B satisfying the conclusion. Obviously, A ⊂ B ′ ⊂ B. It is


enough to verify that B ′ is a σ-algebra.
It is easy to see that (i) Ω ∈ B ′ ; (ii) B c ∈ B ′ if B ∈ B ′ since A∆B = Ac ∆B c . (iii) If
Bi ∈ B ′ for i ≥ 1, there exist Ai ∈ A such that P (Bi ∆Ai ) < ǫ/2i for all i ≥ 1. Evidently,
∪ni=1 Bi ↑ ∪∞ ∞
i=1 Bi as n → ∞. Therefore, there exists n0 < ∞ such that |P (∪i=1 Bi ) −
P (∪ni=1
0
Bi )| ≤ ǫ/2. It is easy to check that (∪ni=1
0
Bi )∆(∪ni=1
0
Ai ) ⊂ ∪ni=1
0
(Bi ∆Ai ). Write
n0 n0
B = ∪∞
i=1 Bi and B̃ = ∪i=1 Bi and A = ∪i=1 Ai . Then A ∈ A. Note B∆A ⊂ (B\B̃) ∪ (B̃∆A).
The above facts show that P (B∆A) < ǫ. Thus B ′ is a σ-algebra. 

LEMMA 10.2 Let X1 , X2 , · · · , Xm , m ≥ 1 be random variables defined on (Ω, F, P). Let


f (x1 , · · · , xm ) be a real measurable function with E|f (X1 , · · · , Xm )|p < ∞ for some p ≥ 1.
Then there exists {fn (X1 , · · · , Xm ); n ≥ 1} such that
(i) fn (X1 , · · · , Xm ) → f (X1 , · · · , Xm ) a.s.
(ii) fn (X1 , · · · , Xm ) → f (X1 , · · · , Xm ) in Lp (Ω, F, P).
P m
(iii) For each n ≥ 1, fn (X1 , · · · , Xm ) = ki=1 ci gi1 (X1 ) · · · gim (Xm ) for some km < ∞,
constants ci , and gij (Xj ) = IAi,j (Xj ) for some sets Ai,j ∈ B(R), and all 1 ≤ i ≤ km and
1 ≤ j ≤ m.

Proof. To save notation, we write f = f (x1 , · · · , xm ). Since E|f I(|f | ≤ C) − f |p → 0


as C → ∞, choose Ck such that E|f I(|f | ≤ Ck ) − f |p ≤ 1/k2 for k = 1, 2, · · · . We will
show that there exists function gk = gk (x1 , · · · , xk ) of the form in (iii) such that E|f I(|f | ≤
Ck ) − gk |p ≤ 1/k2 for all k ≥ 1. Thus, E|f − gk |p < 2/k2 for all k ≥ 1. The assertion (ii)
P P
then follows. Also, it implies E( k≥1 |f − gk |p ) < ∞, then k≥1 |f − gk |p < ∞ a.s. we
ontain (i). Therefore, to prove this lemma, we assume w.l.o.g. that f is bounded and we
need to show that there exist fn = fn (x1 , · · · , xm ); n ≥ 1 of the form in (iii) such that
1
E|f − fn |p ≤ (10.6)
n2
for all n ≥ 1.

75
Since f is bouded, for any n ≥ 1, there exists hn such that
1
sup |f (x) − hn (x)| < , (10.7)
x∈Rm 2n2
Pkn
where hn is a simple function, i.e., hn (x) = i=1 ci I{x ∈ Bi } for some kn < ∞, constants
ci ’s and Bi ∈ B(Rn ).
Now set X = (X1 , · · · , Xm ) ∈ Rm and µ be the probability measure of X under proba-
Q
bility P. Let A be the set of all finite unions of sets in A1 := { m n
i=1 Ai ∈ B(R ); Ai ∈ B(R)}.
By the construction of B(Rn ), we know that B(Rn ) = σ(A). It is not difficult to verify that
A satisfies the conditions in Lemma 10.1. Thus there exist Ei ∈ A such that
Z ki Y
[ m
p 1
|IBi (x) − IEi (x)| dµ = µ(Bi ∆Ei ) < and Ei = Ai,j,l
(2cn2 )p
j=1 l=1

Pkn
where c = 1 + i=1 |ci | and Ai,j,l ∈ B(R) for all i, j and l. Now, since k · kp is a norm, we
have
kn
X kn
X 1
khn (X) − ci IEi (X)kp ≤ |ci | · kIEi (X) − IBi (X)kp ≤ . (10.8)
2n2
i=1 i=1

Observe that ν(E) := IE (X) is a measure on (Rn , B(Rn )). Note that the intersection of any
Q
finite product sets m l=1 Ai,j,l is still in A1 . By the inclusion-exclusion formula, IEi (X) is a
P n
finite linear combination of IF (X) where F ∈ A1 . Thus, fn (X) := ki=1 ci IEi (X) is of the
form in (iii). Now, by (10.7) and (10.8)
1
kf (X) − fn (X)kp ≤ kf (X) − hn (X)kp + khn (X) − fn (X)kp ≤ .
n2
Thus, (10.6) follows. 

LEMMA 10.3 Let ξ1 , · · · , ξk be random variables, and φ(x), x ∈ Rn be a measurable func-


tion with E|φ(ξ1 , · · · , ξk )| < ∞. Let F and G be two σ-algebras. Suppose

E(f1 (ξ1 ) · · · fk (ξk )|F) = E(f1 (ξ1 ) · · · fk (ξk )|G) a.s. (10.9)

holds for all bounded, measurable functions fi : R → R, 1 ≤ i ≤ n. Then E(φ(ξ1 , · · · , ξk )|F) =


E(φ(ξ1 , · · · , ξk )|G) almost surely.

Proof. From (10.9), we know that the desired conclusion holds for φ that is a finite linear
combination of f1 · · · fk ’s. By Lemma 10.2, there exist such functions φn ; n ≥ 1, such that

76
E|φn (ξ1 , · · · , ξk ) − φ(ξ1 , · · · , ξk )| → 0 as n → ∞. Therefore,

E|E(φn (ξ1 , · · · , ξk )|F) − E(φ(ξ1 , · · · , ξk )|F)|


≤ E|E(φn (ξ1 , · · · , ξk ) − φ(ξ1 , · · · , ξk )| → 0.

We know that E(φn (ξ1 , · · · , ξk )|F) = E(φn (ξ1 , · · · , ξk )|G) for all n ≥ 1, the conclusion
follows since the L1 -limit is unique up to a set of measure zero. 

PROPOSITION 10.1 Let X1 , X2 , · · · be a Markov chain. Let 1 ≤ l ≤ m ≤ n − 1. Assume


φ(x), x ∈ Rn−m is a measurable function such that E|φ(Xm+1 , · · · , Xn )| < ∞. Then

E (φ(Xm+1 , · · · , Xn )|Xm , · · · , Xl ) = E (φ(Xm+1 , · · · , Xn )|Xm ) .

Proof. Let p = n − m. By Lemma 10.3, we only need to prove the lemma for φ(x) =
φ1 (x1 ) · · · φ(xp ) with x = (x1 , · · · , xp ) and φi ’s being bounded functions. We do this by
induction.
If n = m+1, by the Markov property, Y := E(φ1 (Xm+1 )|Xm , · · · , X1 ) = E(φ1 (Xm+1 )|Xm ).
Review the property that E(E(Z|F)|G) = E(E(Z|G)|F) = E(Z|F) if F ⊂ G. We know that

E (φ1 (Xm+1 )|Xm , · · · , Xl ) = E(Y |Xm , · · · , Xl ) = E(φ1 (Xm+1 )|Xm )

since Y = E(φ1 (Xm+1 )|Xm ) is σ(Xm , · · · , Xl )-measurable.


Now assume n ≥ m + 2. Using the same argument to obtain,

E{φ1 (Xm+1 ) · · · φp+1 (Xn+1 )|Xm , · · · , Xl )}


= E{φ1 (Xm+1 ) · · · φp−1 (Xn−1 )φ̃p (Xn )|Xm , · · · , Xl }
= E{φ1 (Xm+1 ) · · · φp−1 (Xn−1 )φ̃p (Xn )|Xm }

where φ̃p (Xn ) = φp (Xn ) · E(φp+1 (Xn+1 )|Xn ), and the last step is by induction. Since
φ1 (Xm+1 ) · · · φp−1 (Xn−1 )φ̃p (Xn ) is equal to E(φ1 (Xm+1 ) · · · φp+1 (Xn+1 )|Xm , · · · , X1 ). The
conclusion follows. 

The following property says that a Markov chain has a reverse property.

PROPOSITION 10.2 Let X1 , X2 , · · · be a Markov chain. Let 1 ≤ m < n. Assume


φ(x), x ∈ Rm is a measurable function such that E|φ(X1 , · · · , Xm )| < ∞. Then

E(φ(X1 , · · · , Xm )|Xm+1 , · · · , Xn ) = E(φ(X1 , · · · , Xm )|Xm+1 ).

77
Proof. By lemma 10.3, we assume, without loss of generality, φ(x) is a bounded function.
From the definition of conditional expectation, to prove the proposition, it suffices to show
that

E{φ(X1 , · · · , Xm )g(Xm+1 , · · · , Xn )}
= E{E(φ(X1 , · · · , Xm )|Xm+1 ) · g(Xm+1 , · · · , Xn )} (10.10)

for any bounded, measurable function g(x), x ∈ Rn . By Lemma 10.3 again, we only need
to prove (10.10) for g(x) = g1 (x1 ) · · · gp (xp ) where p = n − m. Now

E{E(φ(X1 , · · · , Xm )|Xm+1 ) · g1 (Xm+1 ) · · · gp (Xn )}


= E{Z · g2 (Xm+2 ) · · · gp (Xn )}

where Z = E(φ(X1 , · · · , Xm )g1 (Xm+1 )|Xm+1 ). Use the fact E(E(ξ|F) · η) = E(ξ · E(η|F))
for any ξ, η and σ-algebra F to obtain that

E{Z · g2 (Xm+2 ) · · · gp (Xn )} = E{φ(X1 , · · · , Xm )g1 (Xm+1 ) · E(η|Xm+1 )}

where η = g2 (Xm+2 ) · · · gp (Xn ). By Proposition 10.1, E(η|Xm+1 ) = E(η|Xm+1 , · · · , X1 ).


Therefore the above is identical to

E{E(φ(X1 , · · · , Xm )g1 (Xm+1 ) · η|Xm+1 , · · · , X1 )}


= E{φ(X1 , · · · , Xm )g(Xm+1 , · · · , Xn )}

since g(x) = g1 (x1 ) · · · gp (xp ). 

The following proposition says that a Markov chain has the property: given the present,
the past and future are independent.

PROPOSITION 10.3 Let 1 ≤ k < l ≤ m be such that l − k ≥ 2. Let φ(x), x ∈ Rk


and ψ(x), x ∈ Rm−l+1 be two measurable functions satisfying E(φ(X1 , · · · , Xl )2 ) < ∞ and
E(ψ(Xl , · · · , Xm )2 ) < ∞. Then

E{φ(X1 , · · · , Xk ) · ψ(Xl , · · · , Xm )|Xk+1 , Xl−1 }


= E{φ(X1 , · · · , Xk )|Xk+1 , Xl−1 } · E{ψ(Xl , · · · , Xm )|Xk+1 , Xl−1 }.

Proof. Fo saving notation, we write φ = φ(X1 , · · · , Xk ) and ψ = ψ(Xl , · · · , Xm ). Then

E(φψ|Xk+1 , Xl−1 ) = E{E(φψ|X1 , · · · Xl−1 ) |Xk+1 , Xl−1 }


= E{φ · E(ψ|Xl−1 ) |Xk+1 , Xl−1 }

78
by Proposition 10.1 and that φ is σ(X1 , · · · , Xl−1 ) measurable. The last term is also equal
to E(ψ|Xl−1 ) · E{φ|Xk+1 , Xl−1 }. The conclusion follows since E(ψ|Xl−1 ) = E(ψ|Xk , Xl−1 ).


79

You might also like