Professional Documents
Culture Documents
8112 Notes
8112 Notes
8112 Notes
1 Bayes estimators
Here are three methods of estimating parameters:
(1) MLE; (2) Moment Method; (3) Bayes Method.
H We want to estimate g(θ) ∈ R1 .
An example of Bayes argument: Let X ∼ F (x|θ), θ ∈
.
Suppose t(X) is an estimator and look at
The problem is MSEθ (t) depends on θ. So minimizing one point may costs at other points.
Bayes idea is to average MSEθ (t) over θ and then minimize over t’s. Thus we pretend to
have a distribution for θ, say, π, and look at
Next pick t(X) to minimize H(t). The minimizer is called the Bayes estimator.
LEMMA 1.1 Suppose Z and W are real random variables defined on the same probability
space and H be the set of functions from R1 to R1 . Then
1
Now E(Z − E(Z|W ))2 is fixed. The second term E(E(Z|W ) − h(W ))2 depends on h. So
choose h(W ) = E(Z|W ) to minimize E(Z − h(W ))2 .
At a later occasion, we call E(g(θ)|X) the posterior mean of g(θ). Similarly, we have the
following multivariate analogue of the above corollary. The proof is in the same spirit of
Lemma 1.1. We omit it.
Definition. The distribution of θ (we thought there is such) , π(θ), is called a prior
distribution of θ. The conditional distribution of θ given X is called the posterior distribu-
tion.
Example. Let X1 , · · · , Xn be i.i.d. Ber(p) and p ∼ Beta(α, β). We know that the
density function of p is
Γ(α + β) α−1
f (p) = p (1 − p)β−1
Γ(α)Γ(β)
and E(p) = α/(α + β). The joint distribution of X1 , · · · , Xn and p is
Therefore,
f (x1 , · · · , xn , p)
f (p|x1 , · · · , xn ) = = D(x, α, β) px+α−1 (1 − p)n+β−x−1 .
f (x1 , · · · , xn )
2
This says that p|x ∼ Beta(x + α, n + β − x). Thus the Bayes estimator is
X +α
p̂ = E(p|X) = ,
n+α+β
Pn
where X = i=1 Xi . One can easily check that
np + α
E(p̂) = .
n+α+β
So p̂ is biased unless α = β = 0, which is not allowed in the prior distribution.
The next lemma is useful
LEMMA 1.2 The posterior distribution depends only on sufficient statistics. Let t(X) be
a sufficient statistic. Then E(g(θ)|X) = E(g(θ)|t(X)).
Proof. The joint density function of X and θ is p(x|θ)π(θ) where π(θ) is the prior distri-
bution of θ. Let t(X) be a sufficient statistic. The by the factorization theorem
p(x|θ) = q(t(x)|θ)h(x).
So the joint distribution function is q(t(x)|θ)h(x)π(θ). It follows that the conditional distri-
bution of θ given X is
q(t(x)|θ)h(x)π(θ) q(t(x)|θ)π(θ)
p(θ|x) = R =R
q(t(x)|ξ)h(x)π(ξ) dξ q(t(x)|ξ)π(ξ) dξ
which is a function of t(x) and θ.
By the above conclusion, E(g(θ)|X) = l(t(X)) for some function l(·). Take conditional
expectation for both sides with respect to the algebra σ(t(X)). Note that σ(t(X)) ⊂ σ(X).
By the Tower theorem, l(t(X)) = E(g(θ)|t(X)). This proves the second conclusion.
Thus,
! ! !!
X̄ µ0 (1/n) + τ0 τ0
∼N , .
µ µ0 τ0 τ0
3
Recall the following fact: if
! ! !!
Y1 µ1 Σ11 Σ12
∼N , ,
Y2 µ2 Σ21 Σ22
then
Y1 |Y2 ∼ N µ1 + Σ12 Σ−1
22 (Y2 − µ2 ), Σ11·2
Example. Let X ∼ Np (θ, Ip ) and π(θ) ∼ Np (0, τ Ip ), τ > 0. The Bayes estimator is the
posterior mean E(θ|X). We claim that
! ! !!
X 0 (1 + τ )Ip τ Ip
∼N , .
θ 0 τ Ip τ Ip
Indeed, by the prior we know that Eθ = 0 and Cov(θ)=τ Ip . By conditioning argument one
can verify that E(Xθ ′ ) = τ Ip and E(XX ′ ) = (1 + τ )Ip . So by conditional distribution of
normal random variables,
τ τ
θ|X ∼ N X, Ip .
1+τ 1+τ
4
which goes to p as τ → ∞. What happens to the prior when τ → ∞? It “converges
to” the Lebesgue measure, or “uniform distribution over the real line” which generates
no information (recall the information of X ∼ Np (0, (1 + τ )Ip ) is p/(1 + τ ) and θ̄i =
uniform over [−τ, τ ] with θ̄i , 1 ≤ i ≤ p i.i.d. is p/(2τ 2 ). Both go to zero and the later is
faster than the former). This explains why the MSE of the former and the limiting MSE
are identical.
Now let
Proof. First,
Second
Z 2
2 2 τ
M = inf sup Eθ kt(X) − θk ≥ inf Eθ kt(X) − θk πτ (dθ) ≥ p
t θ t 1+τ
for any τ > 0 where πτ (θ) = N (0, τ Ip ), where the last step is from (1.2). Then M ≥ p by
letting p → ∞. The above says that M = supθ Eθ kX − θk2 = p.
for all θ, and the strict inequality holds for some θ? If so, we say t0 (X) = X is inadmissible
(because it can be beaten for any θ by some estimator t, and strictly beaten at some θ).
Here is the answer:
(i) For p = 1, there is not. This is proved by Blyth in 1951.
(ii) For p = 2, there is not. This result was shown by Stein in 1961.
(iii) When p ≥ 3, Stein (1956) shows there is such estimator, which is called James-Stein
estimator.
√
Recall the density function of N (µ, σ 2 ) is φ(x) = ( 2πσ)−1 exp(−(x − µ)2 /(2σ 2 )).
5
LEMMA 1.3 (Stein’s lemma). Let Y ∼ N (µ, σ 2 ) and g(y) be a function such that g(b) −
Rb
g(a) = a g′ (y) dy for any a and b and some function g′ (y). If E|g′ (Y )| < ∞. Then
The Fubini’s theorem is used in the third step; the mean of a centered normal random
variable is zero is used in the fifth step.
Remark 1. Suppose two functions g(x) and h(x) are given such that g(b) − g(a) =
Rb
a h(x) dx for any a < b. This does not mean that g′ (x) = h(x) for all x. Actually, in this
case, g(x) is differentiable almost everywhere and g′ (x) = h(x) a.s. under Lebesgue measure.
For example, let h(x) = 1 if x is an irrational number, and h(x) = 0 if x is rational. Then
Rx
g(x) = x = 0 h(t) dt for any x ∈ R. We know in this case that g′ (x) = h(x) a.s. The
following are true from real analysis:
Fundamental Theorem of Calculus. If f (x) is absolutely continuous on [a, b], then f (x)
is differentiable almost everywhere and
Z x
f (x) − f (a) = f ′ (t) dt, x ∈ [a, b].
a
Another fact. Let f (x) be differentiable everywhere on [a, b] and f ′ (x) is integrable over
[a, b]. Then
Z x
f (x) − f (a) = f ′ (t) dt x ∈ [a, b].
a
Remark 2. What drives the Stein’s lemma is the formula of integration by parts:
6
suppose g(x) is differentiable then
Z Z ∞
2
Eg(Y )(Y − µ) = g(y)(y − µ)φ(y) dy = −σ g(y)φ′ (y) dy
R −∞
Z ∞
= σ 2 lim g(y)φ(y) − σ 2 lim g(y)φ(y) + σ 2 g′ (y)φ(y) dy.
y→−∞ y→+∞ −∞
It is reasonable to assume the two limits are zero. The last term is exactly σ 2 Eg′ (Y ).
Proof. Let gi (x) = c(p − 2)xi /kxk2 and g(x) = (g1 (x), · · · , gp (x)) for x = (x1 , x2 , · · · , xp ).
Then g(x) = c(p − 2)x/kxk2 . It follows that
Ekδc (X) − θk2 = Ek(X − θ) − g(X)k2 = EkX − θk2 + Ekg(X)k2 − 2EhX − θ, g(X)i
p
X
= p + Ekg(X)k2 − 2 E(Xi − θi )gi (X).
i=1
On the other hand, Ekg(X)k2 = c2 (p − 2)2 E(1/kXk2 ). Combine all above together, the
conclusion follows.
7
since the density of a normal distribution is bounded by one. By the polar-transformation,
the last integral is equal to
Z 1 Z π Z π Z 1
r p−1 J(θ1 , · · · , θp ) p−1
dr dθ1 · · · dθp−1 2
dr ≤ Cπ r p−3 dr < ∞
0 0 0 r 0
COROLLARY 1.2 The estimator δc (X) dominates X provided 0 < c < 2 and p ≥ 3.
When H0 is true, the value of λ(x) tends to be large. So the rejection region is
{λ(x) ≤ c}.
8
The maximum is
1 X
max f (x1 , · · · , xn |µ, σ) = CV −n/2 exp − (Xi − X̄)2 = CV −n/2 e−n/2
µ,σ 2V
which is achieved at θ = x(1) . Now consider maxθ≤θ0 f (x|θ). If θ0 ≥ x(1) , the two maxima
are identical; If θ0 < x(1) , it is achieved at θ = θ0 that yields the maximum value
P
e− xi +nx(1) , if θ0 ≥ x ;
(1)
max f (x|θ) =
θ≤θ0 e− xi +nθ0 , if θ < x .
P
0 (1)
9
This gives the likelihood ratio
1, if x ≤ θ0 ;
(1)
λ(x) =
en(θ0 −x(1) ) , if x > θ0 .
(1)
THEOREM 4 Suppose X has pdf or pmf f (x|θ) and T (X) is sufficient for θ. Let λ(x) and
λ∗ (y) be the LRT statistics obtained from X and Y = T (X). Then λ(x) = λ∗ (T (x)) for any
x.
Proof. Suppose the pdf or pmf of T (X) is g(t|θ). To be simple, suppose X is discrete, that
is, Pθ (T (X) = t) = g(t|θ). Then
by the definition of sufficiency, where h(x) is a function depending only on x (not θ).
Evidently,
supθ∈H0 f (x|θ) supθ∈H0 g(t|θ)
λ(x) = and λ∗ (t) = . (2.4)
supθ∈H0 ∪Ha f (x|θ) supθ∈H0 ∪Ha g(t|θ)
This theorem says that the simplification of λ(x) is eventually relevant to x through a
sufficient statistic for the parameter.
3 Evaluating tests
Consider the test H0 : θ ∈
H 0 versus Ha : θ ∈
H a , where
H 0 and
H a are disjoint. When
10
Not reject H0 reject H0
H0 correct type I error
H1 Type II error correct
Suppose R denotes the rejection region for the test. The chances of making the two
H 0 and Pθ (X ∈ Rc ), θ ∈
types of mistakes are Pθ (X ∈ R) for θ ∈
H a . So
probability of a Type I error, if θ ∈
H 0;
Pθ (X ∈ R) =
1 − probability of a Type II error, if θ ∈
H a.
DEFINITION 3.1 The power function of the test above with the rejection region R is a
function of θ defined by β(θ) = Pθ (X ∈ R).
DEFINITION 3.2 For α ∈ [0, 1], a test with power function β(θ) is a size α test if
H 0 β(θ) = α.
supθ∈
DEFINITION 3.3 For α ∈ [0, 1], a test with power function β(θ) is a level α test if
H 0 β(θ) ≤ α.
supθ∈
There are some differences between the above two definitions when dealing with complicated
models.
Example. Let X1 , · · · , Xn be from N (µ, σ 2 ) with µ unknown but known σ. Look at the
test H0 : µ ≤ µ0 vs Ha : µ > µ0 . It is easy to check that by likelihood rato test, the rejection
region is
X̄ − µ0
√ >c .
σ/ n
11
H 0 Pθ (λ(X)
Example. In general, a size α LRT is constructed by choosing c such that supθ∈
≤ c) = α. Let X1 , · · · , Xn be a random sample from N (θ, 1). Test H0 : θ = θ0 vs Ha : θ 6= θ0 .
Here one can easily see that
n
λ(x) = exp − (x̄ − θ0 )2
2
√
So {λ(X) ≤ c} = {|X̄−θ0 | ≥ d}. For α given, d can be chosen such that Pθ0 (|Z| > d n) = α.
For the example in (2.2), the test is H0 : θ ≤ θ0 vs θ > θ0 and the rejection region
{λ(X) ≤ c} = {X(1) > d}. For size α, first choose d such that
We plan to select the best one. Since we are not able to reduce the two types of errors at
same time (the sum of chances of two errors and two correct decisions is one). We fix the
the chance of the first type of error and maximize the powers among all decision rules.
in class C, with power function β(θ), is a uniformly most powerful (UMP) class C test if
β(θ) ≥ β ′ (θ) for every θ ∈
H a and every β ′ (θ) that is a power function of a test in class C.
In this section we only consider C as the set of tests such that the level α is fixed, i.e.,
H 0 β(θ) ≤ α}.
{β(θ); supθ∈
The following gives a way to obtain UMP tests.
12
Then
(i) (Sufficiency). Any test satisfies (4.1) and (4.2) is a UMP level α test.
(ii) (Necessity). Suppose there exists a test satisfying (4.1) and (4.2) with k > 0. Then
every UMP level α test is a size α test (satisfies (4.2)); Every UMP level α test satisfies
(4.1) except on a set A satisfying Pθ (X ∈ A) = 0 for θ = θ0 and θ1 .
Proof. We will prove the theorem only for the case that f (x|θ) is continuous.
Since
H 0 consists of only one point, the level α test and size α test are equivalent.
(i) Let R satisfy (4.1) and (4.2) and φ(x) = I(x ∈ R) with power function β(θ). Now take
another rejection region R′ with β ′ (θ) = Pθ (X ∈ R′ ) and β ′ (θ0 ) ≤ α. Set φ′ (x) = I(x ∈ R′ ).
It suffices to show that
Actually, (φ(x) − φ′ (x))(f (x|θ1 ) − kf (x|θ0 )) ≥ 0 (check it based on if f (x|θ1 ) > kf (x|θ0 ) or
not). Then
Z
0 ≤ (φ(x) − φ′ (x))(f (x|θ1 ) − kf (x|θ0 )) dx = β(θ1 ) − β ′ (θ1 ) − k(β(θ0 ) − β ′ (θ0 )). (4.4)
13
This example says that if we firmly want a size α test, then there are only two such α’s:
α = 1/4 and α = 3/4.
Proof. Let the µ be the distribution of X. Noting that (f (x) − f (y))(g(x) − g(y)) ≥ 0 for
any x and y. Then
ZZ
0 ≤ (f (x) − f (y))(g(x) − g(y)) µ(dx) µ(dy)
Z Z
= 2 f (x)g(x) µ(dx) − 2 f (x)g(y) µ(dx) µ(dy)
= 2[E(f (X)g(X)) − Ef (X) · Eg(X)].
Let f (x|θ), θ ∈ R be a pmf or pdf with common support, that is, the set {x : f (x|θ) > 0}
is identical for every θ. This family of distributions is said to have monotone likelihood ratio
(MLR) with respect to a statistic T (x) if
fθ2 (x)
on the support is a non-decreasing function in T (x) for any θ2 > θ1 . (4.5)
fθ1 (x)
LEMMA 4.2 Let X be a random vector with X ∼ f (x|θ), θ ∈ R, where f (x|θ) is a pmf
or pdf with a common support. Suppose f (x|θ) has MLR with a statistic T (x). Let
1 when T (x) > C,
φ(x) = γ when T (x) = C, (4.6)
0 when T (x) < C
14
for some C ∈ R, where γ ∈ [0, 1]. Then
Proof. Thanks to MLT, for any θ1 < θ2 there exists a nondecreasing function g(t) such
that f (x|θ2 )/f (x|θ1 ) = g(T (x)), where g(t) is a nondecreasing function. Thus
Z
f (x|θ2 )
Eθ2 φ(X) = φ(x) f (x|θ1 ) dx = Eθ1 (g(T (X)) · h(T (X))) ,
f (x|θ1 )
where
1 when x > C,
h(x) = γ when x = C,
0 when x < C
Since both g(x) and h(x) are non-decreasing, by the positive correlation inequality,
Eθ1 (g(T (X)) · h(T (X))) ≥ Eθ1 g(T (X)) · Eθ1 h(T (X)) = Eθ1 φ(X)
In (4.6), if T (X) > C reject H0 ; if T (X) < C, don’t reject H0 ; if T (X) = C, with
chance γ reject H0 and with chance 1 − γ don’t reject H0 . The last case is equivalent to
flipping a coin with chance γ for head and 1 − γ for tail. If the head occurs, reject H0 ;
otherwise, don’t reject H0 . This is a randomization. The purpose is to make the test UMP.
The randomization occurs only when T (X) is discrete.
Proof. Let β(θ) = Eθ φ(X). Then by the previous lemma, supθ≤θ0 β(θ) = Eθ0 φ(X) = α.
We need to show β(θ1 ) ≥ β ′ (θ1 ) for any θ1 > θ0 and β ′ such that supθ≤θ0 β ′ (θ) ≤ α.
Consider test H0 : θ = θ0 vs Ha : θ = θ1 . We know that
Again, we use the same notation as in Lemma 4.2, f (x|θ1 )/f (x|θ0 ) = g(T (x)), where
g(t) is a non-decreasing function in t. Take k = g(C) in the Neyman-Pearson Lemma 5.
15
If f (x|θ1 )/f (x|θ0 ) > k, since g(t) is non-decreasing, then T (X) > C, which is the same
as saying to reject H0 by the functional form of φ(x) as in (4.6). If f (x|θ1 )/f (x|θ0 ) < k
then T (X) < C, that is, not to reject H0 . By Neyman-Pearson lemma and (4.7), φ(x)
corresponds to a UMP level-α test. Thus, β(θ1 ) ≥ β ′ (θ1 ).
16
5 Probability convergence
Let (Ω, F, P ) be a probability space. Let Also {X, Xn ; n ≥ 1} be a sequence of ran-
dom variables defined on this probability space. We say Xn → X in probability or Xn is
consistent with X if
P (|Xn − X| ≥ ǫ) → 0 as n → ∞
Here “i.o.” means “infinitely often”. If ω ∈ {En ; i.o.}, then there are infinitely many n
/ {En ; i.o.}, or ω ∈ {En ; i.o.}c , then only finitely many
such that ω ∈ En . Conversely, if ω ∈
En contain ω.
In proving almost convergence, the following so-called Borel-Cantelli lemma is very
powerful.
17
Thus, X < ∞ a.s.
Proof. First,
[\
{En , i.o.}c = Ekc .
n k≥n
18
P
Thus, P Snn ≥ ǫ = O(n−2 ). Consequently, n≥1 P Snn ≥ ǫ < ∞ for any ǫ > 0. This
means that, with probability one, |Sn /n| ≤ ǫ for sufficiently large n. Therefore,
|Sn |
lim sup ≤ ǫ a.s.
n n
for any ǫ > 0. Let ǫ ↓ 0. The desired conclusion follows.
By using a truncation method, we have the following Kolmogorov’s Strong Law of Large
Numbers.
There are many variations of the above strong law of large numbers. Among them,
Etemadi showed that Theorem 7 holds if {Xn ; n ≥ 1} is a sequence of pairwise independent
and identically distributed random variables with mean µ.
The next proposition tells the relationship between convergence in probability and con-
vergence almost surely.
Proof. The joint almost sure convergence and joint convergence in probability are equiv-
alent to the corresponding pointwise convergences. So W.L.O.G., assume Xn and X are
one-dimensional. Given ǫ > 0. The almost sure convergence implies that
19
where we use the fact that
[
{|Xk − X| ≥ ǫ} ↑ {|Xn − X| ≥ ǫ, i.o.}.
k≥n
P (|Xkl − X| ≥ ǫ0 ) ≥ ǫ0
for all l ≥ 1. Therefore any subsequence of Xkl can not converge to X in probability, hence
it can not converge to X almost surely.
Example Let Xi , i ≥ 1 be a sequence of i.i.d. random variables with mean zero and
√
variance one. Let Sn = X1 + · · · + Xn , n ≥ 1. Then Sn / 2n log log n → 0 in pobability but
√
it does not converge to zero almost surely. Actually lim supn→∞ Sn / 2 log log n = 1 a.s.
20
for large k. By the Borel-Cantelli lemma, P (Wnk ≥ bnk , i.o.) = 0. This implies that
Wnk √
lim sup ≤ 2 + ǫ a.s.
k bnk
For any n ≥ n1 there exists k such that nk ≤ n < nk+1 . Note that nk+1 /nk → 1 as n → ∞
and Wnk ≤ Wn ≤ Wnk+1 we have that
Wn √
lim sup √ ≤ 2 + ǫ a.s.
n log n
By (5.3),
ncn 2 C √ 2
nP (X1 > cn ) ≥ √ e−cn /2 ∼ √ n1−( 2−ǫ) /2
2
2π (1 + cn ) 2π log n
√
as n is sufficiently large, where C is a constant. Since 1 − ( 2 − ǫ)2 /2 > 0,
X
P (Wn ≤ cn ) < ∞.
n≥2
The above results also hold for random variables taking abstract values. Now let’s
introduce such random variables.
21
A metric space is a set D equipped with a metric. A metric is a map d : D × D → [0, ∞)
with the properties
for any x, y and z. A semi-metric satisfies (ii) and (iii) , but not necessarily (i). An open
ball is a set of the form {y : d(x, y) < r}. A subset of a metric space is open if and only if it
is the union of open balls; it is closed if and only if the complement is open. A sequence xn
converges to x if and only if d(xn , x) → 0, and denoted by xn → x. The closure Ā of a set
A is the set of all points that are the limit of a sequence in A; it is the smallest closed set
containing A. The interior Å is the collection of all points x such that x ∈ G ⊂ A for some
open set G; it is the largest open set contained in A. A function f : D → E between metric
spaces is continuous at a point x if and only if f (xn ) → f (x) for every sequence xn → x;
it is continuous at every x if and only if f −1 (G) is also open for every open set G ⊂ E. A
subset of a metric space is dense if and only if its closure is the whole space. A metric space
is separable if and only if it has a countable dense subset. A subset K of a metric space is
compact if and inly if it is closed and every sequence in K has a convergent subsequence. A
set K is totally bounded if and only if for every ǫ > 0 it can be covered by finitely many balls
of radius ǫ > 0. The space is complete if and only if any Cauchy sequence has a limit. A
subset of a complete metric space is compact if and only if it is totally bounded and closed.
A norm space D is a vector space equipped with a norm. A norm is a map k · k : D →
[0, ∞) such that for every x, y in D, and α ∈ R,
Definition. The Borel σ-field on a metric space D is the smallest σ-field that contains
the open sets. A function taking values in a metric space is called Borel-measurable if it is
measurable relative to the Borel σ-field. A Borel-measurable map X : Ω → D defined on a
probability space (Ω, F, P ) is referred to as a random element with values in D.
The following lemma is obvious.
22
Example. Let C[a, b] = {f (x); f is a continuous function on [a, b]}. Define
kf k = sup |f (x)|.
a≤x≤b
This is a complete normed linear space which is also called a Banach space. This space is
separable. Any separable Banach space is isomorphic to a subset of C[a, b].
Remark. Proposition 5.1 still holds when the random variables taking values in a Polish
space.
Let {X, Xn ; n ≥ 1} be a sequence of random variables. We say Xn converges to X
weakly or in distribution if
P (Xn ≤ x) → P (X ≤ x)
LEMMA 5.4 Let Xn and X be random variables. The following are equivalent:
(i) Xn converges weakly to X;
(ii) Ef (Xn ) → Ef (X) for any bounded continuous function f (x) defined on R;
(iii) liminfn P (Xn ∈ G) ≥ P (X ∈ G) for any open set G;
(iv) limsupn P (Xn ∈ F ) ≤ P (X ∈ F ) for any closed set F ;
Remark. The above lemma is actually true for any finite dimensional space. Further, it
is true for random variables taking values in separable, complete metric spaces, which are
also called Polish spaces.
Proof. (i) =⇒ (ii). Given ǫ > 0, choose K > 0 such that K and −K are continuous
point of all Xn ’s and X and P (−K ≤ X ≤ K) ≥ 1 − ǫ. By (i)
In other words
Since f (x) is continuous on [−K, K], we can cut the interval into, say, m subintervals
[ai , ai+1 ], 1 ≤ i ≤ m such that each ai is a continuous point of X and supai ≤x≤ai+1 |f (x) −
f (ai )| ≤ ǫ. Set
m
X
g(x) = f (ai )I(ai ≤ x < ai+1 ).
i=1
23
It is easy to see that
|f (x) − g(x)| ≤ ǫ for |x| ≤ K and |f (x) − g(x)| ≤ kf k for |x| > K,
for any random variable Y. Apply this to Xn and X we obtain from (5.5) that
lim sup |Ef (Xn ) − Eg(Xn )| ≤ (1 + kf k)ǫ and |Ef (X) − Eg(X)| ≤ (1 + kf k)ǫ.
n
Example. Let Xn be uniformly distributed over {1/n, 2/n, · · · , n/n}. Prove that Xn con-
verges weakly to U [0, 1].
Actually, for any bounded continuous function f (x),
n Z 1
1X i
Ef (Xn ) = f → f (t) dt = Ef (Y )
n n 0
i=1
24
Example. Let Xi , i ≥ 1 be i.i.d. random variables. Define
n n
1X 1X #{i : Xi ∈ A}
µn = δXi through µn (A) = I(Xi ∈ A) =
n n n
i=1 i=1
for any set A. Then, for any measurable function f (x), by the strong law of large numbers,
Z n Z
1X
f (x)µn ( dx) = f (Xi ) → Ef (X1 ) = f (x)µ( dx)
n
i=1
COROLLARY 5.1 Suppose Xn converges to X weakly and g(x) is a continuous map de-
fined on R1 (not necessarily one-dimensional). Then g(Xn ) converges to g(X) weakly.
Proof. By Taylor’s expansion g(x) = g(θ) + g′ (θ)(x − θ) + a(x − θ) for some real function
a(x) such that a(t)/t → 0 as t → 0. Then,
√ √ √
n(g(Yn ) − g(θ)) = g′ (θ) n(Yn − θ) + n · a(Yn − θ).
√
We only need to show that n · a(Yn − θ) → 0 in probability as n → ∞ (latter we will see
this easily by Slusky’s lemma. But why now?) Indeed, for any m ≥ 1 there exists a δ > 0
such that |a(x)| ≤ |x|/m for all |x| ≤ δ. Thus,
√ √
P (| n · a(Yn − θ)| ≥ b) ≤ P (|Yn − θ| > δ) + P (| n(Yn − θ)| ≥ mb)
→ P (|N (0, σ 2 )| ≥ mb)
√
as n → ∞. Now let m ↑ ∞. we obtain that n · a(Yn − θ) → 0 in probability.
25
Proof. By Taylor’s expansion again,
" #
1 2 ′′ Yn − θ 2
n(g(Yn ) − g(θ)) = σ g (θ) n · + o(n|Yn − θ|2 ).
2 σ
Proof. The implication “(i) =⇒ (ii)” is proved earlier. Now we prove that (ii) =⇒ (iii).
For any closed set F let
Then F ǫ is a closed set. When X takes values in R, the metric d(x, y) = |x − y|. It is easy
to see that
Any random vector X taking values in Rk is tight: For every ǫ > 0 there exists a
number M > 0 such that P (kXk > M ) < ǫ. A set of random vectors {Xα ; α ∈ A} is called
uniformly tight if there exists a constant M > 0 such that
26
A random vector X taking values in a polish space (D, d) is also tight: For every ǫ > 0 there
exists a compact set K ⊂ D such that P (X ∈
/ K) < ǫ. This will be proved later.
Remark. It is easy to see that a finite number of random variables is uniformly tight if every
individual is tight. Also, if {Xα ; α ∈ A} and {Xα ; α ∈ B} are uniformly tight, respectively,
then {Xα , α ∈ A ∪ B} is also tight. Further, if Xn =⇒ X, then {X, Xn ; n ≥ 1} is uniformly
tight. Actually, for any ǫ > 0, choose M > 0 such that P (kXk ≥ M ) ≥ 1 − ǫ. Since the
set {x ∈ Rk , kxk > M } is an open set, lim inf n→∞ P (kXn k > M ) ≥ P (kXk > M ) ≥ 1 − ǫ.
The tightness of {X, Xn } follows.
LEMMA 5.5 Let µ be a probability measure on (D, B), where (D, d) is a Polish space and
B is the Borel σ-algebra. Then µ is tight.
Proof. Since (D, d) is Polish, it is separable, there exists a countable set {x1 , x2 , · · · } ⊂ D
such that it is dense in D.
Given ǫ > 0. For any integer k ≥ 1, since ∪n≥1 B(xn , 1/k) = D, there exists nk such
that µ(∪nn=1
k
B(xn , 1/k)) > 1 − ǫ/2k . Set Dk = ∪nn=1
k
B(xn , 1/k) and M = ∩k≥1 Dk . Then M
is totally bounded (M is always covered by a finite number of balls with any prescribed
radius). Also
∞
X ∞
X
c ǫ
µ(M ) ≤ µ(Dck ) ≤ = ǫ.
2k
k=1 k=1
Let K be the closure of M. Since D is complete, M ⊂ K ⊂ D and K is compact. Also by
the above fact, µ(K c ) < ǫ.
LEMMA 5.6 Let {Xn ; n ≥ 1} be a sequence of random variables. Let Fn be the cdf of Xn .
Then there exists a subsequence Fnj with the property that Fnj (x) → F (x) at each continuity
point of x of a possibly defective distribution function F. Further, if {Xn } is tight, then F (x)
is a cumulative distribution function.
Proof. Put all rational numbers q1 , q2 , · · · in a sequence. Because the sequence Fn (q1 )
is contained in the interval [0, 1], it has a converging subsequence. Call the indexing sub-
sequence {n1j }∞ 2 1
j=1 and the limit G(q1 ). Next, extract a further subsequence {nj } ⊂ {nj }
along which Fn (q2 ) converges to a limit G(q2 ), a further subsequence {n3j } ⊂ {n2j } along
which Fn (q3 ) converges to a limit G(q3 ), · · · , and so forth. The tail of the diagonal sequence
nj := njj belongs to every sequence nij . Hence Fnj (qi ) → G(qi ) for every i = 1, 2, · · · . Since
each Fn is nondecreasing, G(q) ≤ G(q ′ ) if q ≤ q ′ . Define
27
It is not difficult to verify that F (x) is decreasing and right-continuous at every point x. It
is also easy to show that Fnj (x) → F (x) at each continuity point of x of F by monotonicity
of F (x).
Now assume {Xn } is tight. For given ǫ ∈ (0, 1), there exists a rational number M > 0
such that P (−M < Xn ≤ M ) ≥ 1 − ǫ. Equivalently, Fn (M ) − Fn (−M ) ≥ 1 − ǫ for all n.
Passing the subsequence nj to ∞. We have that G(M ) − G(−M ) ≥ 1 − ǫ. By definition,
F (M ) − F (−M ) ≥ 1 − ǫ. Therefore F (x) is a cdf.
Proof. Let Fn (x) be the cdf of Xn . Let U be a random variable uniformly distributed on
([0, 1], B[0, 1]). Define Yn = Fn−1 (U ) where the inverse here is the generalized inverse of a
non-decreasing function defined by
for 0 ≤ t ≤ 1. One can verify that Fn−1 (t) → F0−1 (t) for every t such that F0 (x) is continuous
at t. We know that the set of the discontinuous points is countable. Therefore the Lebesgue
measure is zero. This shows that Yn = Fn−1 (U ) → F0−1 (U ) = Y0 almost surely.
It is easy to check that L(Xn ) = L(Yn ) for all n = 0, 1, 2, · · · .
LEMMA 5.7 (Slusky). Let {an }, {bn } and {Xn } are sequences of random variables such
that an → a in probability and bn → b in probability and Xn =⇒ X for constants a and b
and some random variable X. Then an Xn + bn =⇒ aX + b as n → ∞.
Proof. We will show that (an Xn + bn )− (aXn + b) → 0 in probability. If this is the case,
then the desired conclusion follows from the weak convergence of Xn (why?). Given ǫ > 0.
Since {Xn } is convergent, it is tight. So there exists M > 0 such that P (|Xn | ≥ M ) ≤ ǫ.
Moreover, by the given condition,
ǫ
P |an − a| ≥ < ǫ and P (|bn − b| ≥ ǫ) < ǫ
2M
as n is sufficiently large. Note that
28
Thus,
29
R R
To prove the weak convergence, we have to show that f (x) dun → f (x) dµ for every
bounded continuous function f (x). It is equivalent to that for any subsequence, there is a
R R
further subsequence, say nk , such that f (x) dunk → f (x) dµ0 .
For any subsequence, by Lemma 5.6, the Helly selection principle, there is a further
subsequence, say µnk , such that it converges to µ0 weakly, by the earlier proof, the c.f. of
µ0 is φ0 . We know that a c.f. uniquely determines a distribution. This says that every limit
R R
is identical. So f (x) dunk → f (x) dµ0 .
Proof. Let φ(t) be the c.f. of X1 . Since EX1 = 0 and EX12 = 1 we have that φ′ (0) =
iEX1 = 0 and φ′′ (0) = i2 EX12 = −1. By Taylor’s expansion
t t2 1
φ √ =1− +o
n 2n n
as n is sufficiently large. Hence
n
√
it nX̄n t n t2 1 2
Ee =φ √ = 1− +o → e−t /2
n 2n n
as n → ∞. By Levy’s continuity theorem, the desired conclusion follows.
30
Proof. By the Cramér-Wold device, we need to show that
n
1 X T
√ (t Xi − tT µ) =⇒ N (0, tT Σt)
n
i=1
for ant t ∈ Rk . Note that {tT Xi − tT µ, i = 1, 2, · · · } are i.i.d. real random variables. By
the one-dimensional CLT,
n
1 X T
√ (t Xi − tT µ) =⇒ N (0, Var(tT X1 )).
n
i=1
kn
X kn
X
Cov (Yn,i ) → Σ and EkYn,i k2 I{kYn,i k > ǫ} → 0
i=1 i=1
The Central Limit Theorems hold not only for independent random variables, it is also
true for some other dependent random variables of special structures.
Let Y1 , Y2 , · · · be a sequence of random variables. A sequence of σ-algebra Fn satisfies
that Fn ⊃ σ(Y1 , · · · , Yn ) and is non-decreasing Fn ⊂ Fn+1 for all n ≥ 0, where F0 = {∅, Ω}.
When E(Yn+1 |Fn ) = 0 for any n ≥ 0, we say {Yn ; n ≥ 1} are martingale differences.
Example. Let Xi ; i ≥ 1 be i.i.d. with mean zero and variance one. Then one can
√
verify that the required conditions in Theorem 16 hold for Xni = Xi / n for i = 1, 2, · · · , n.
31
Pn √
Therefore we recollect the classical CLT that i=1 Xi / n =⇒ N (0, 1). Actually,
√ n √
P (max |Xni | ≥ ǫ) ≤ nP (|X1 | ≥ nǫ) ≤ 2 E|X1 |2 I{|X1 | ≥ nǫ} → 0 and
i nǫ
Pn 2
X
E(max |Xni |)2 ≤ E i=1 i = 1
i n
This shows that maxi |Xni | → 0 in probability and {maxi |Xni |, n ≥ 1} is uniformly inte-
grable. So condition 2 holds.
LEMMA 5.8 Let {Un ; n ≥ 1} and Tn ; n ≥ 1} be two sequences of random variables such
that
1. Un → a in probability, where a is a constant;
2. {Tn } is uniformly integrable;
3. {Un Tn } is uniformly integrable;
4. ETn → 1.
Then E(Tn Un ) → a.
Proof. Write Tn Un = Tn (Un − a) + aTn . Then E(Tn Un ) = E{Tn (Un − a)} + aETn . Evi-
dently, {Tn (Un − a)} is uniformly integrable by the given condition. Since {Tn } is uniformly
integrable, it is not difficult to verify that Tn (Un − a) → 0 in probability. Then the result
follows.
for some function r(x) = O(x3 ) as x → ∞. Look at the set-up in Theorem 16, we have that
eitSn = Tn Un , where
Y X X
Tn = (1 + itXnj ) and Un = exp(−(t2 /2) 2
Xnj + r(tXnj )). (5.7)
j j j
32
P 2 → 1 in probability;
3. j Xnj
4. E(maxj |Xnj |) → 0.
P
Then Sn = j Xnj converges to N (0, 1) in distribution.
Proof. Since |Tn Un | = |eitSn | = 1, we know that {Tn Un } is uniformly integrable. By Lemma
2
5.8, it is enough to show that Un → e−t /2 in probability. Review (5.7) and condition 3, it
P
suffices to show that j r(tXnj ) → 0 in probability. The assertion r(x) = O(x3 ) say that
there exists A > 0 and δ > 0 such that |r(x)| ≤ A|x3 | as |x| < δ. Then
X X
P (| r(tXnj )| > ǫ) ≤ P (max |Xnj | ≥ δ) + P (| r(tXnj )| > ǫ, max |Xnj | < δ)
j j
j j
X
≤ δ−1 E(max |Xnj |) + P (max |Xnj | · |Xnj |2 > ǫ) → 0
j j
j
Xj−1
2
Znj = Xnj I( Xnk ≤ 2), 2 ≤ j ≤ kn .
k=1
It is easy to check that {Znj } is still a martingale relative to the original σ-algebra. Now
P 2 > 2} ∧ k . Then
define J = inf{j ≥ 1; 1≤k≤j Xnk n
Pkn
as n → ∞. Therefore, P (Sn 6= j=1 Znj ) → 0 as n → ∞. To prove the result, we only need
to show
kn
X
Znj =⇒ N (0, 1). (5.8)
j=1
We now apply Theorem 17 to prove this. Replacing Xnj by Znj in (5.7), we have new Tn and
Un . Let’s literally verify the four conditions in Theorem 17. By a martingale property and
iteration, ETn = 1. So condition 1 holds. Now maxj |Znj | ≤ maxj |Xnj | → 0 in probability
P 2
as n → ∞. So condition 4 holds. It is also easy to check that j Znj → 1 in probability.
Then it remains to show condition 2.
33
Qk n Q
By definition, Tn = j=1 (1 + itZnj ) = 1≤j≤J (1 + itZnj ). Thus,
J−1
Y
|Tn | = (1 + t2 Xnk
2 1/2
) |1 + tXnJ |
k=1
J−1
!
t2 X 2
≤ exp Xnk (1 + |t| · |XnJ |)
2
k=1
2
≤ et (1 + |t| · max |Xnj |)
j
The next is an extreme limit theorem, which is different than the CLT. The limiting
distribution is called an extreme distribution.
Proof. Let tn be the right hand side of “≤” in the probability above. Then
1 1 t2 p x
√ ∼ √ and n ∼ log n − log( log n) −
2π tn 2 π log n 2 2
34
As an application of the CLT of i.i.d. random variables, we will be able to derive the
classical χ2 -test as follows.
Let’s consider a multinomial distribution with n trails, k classes with parameter p =
(p1 , · · · , pk ). Roll a die twenty times. Let Xi be the number of the occurrences of “i dots”,
1 ≤ i ≤ 6. Then (X1 , X2 , · · · , X6 ) follows a multinomial distribution with “success rate”
p = (p1 , · · · , p6 ). How to test the die is fair or not? From an introductory course, we know
that, the null hypothesis is H0 : p1 = · · · = p6 = 1/6. The test statistic is
6
X (Xi − npi )2
χ2 = .
npi
i=1
We use the fact that χ2 is roughly χ2 (5) as n is large. We will prove this next.
In general, Xn = (Xn,1 , · · · , Xn,k ) follows a multinomial distribution with n trails and
P
“success rate” p = (p1 , · · · , pk ). Of course, ki=1 Xn,i = n. To be more precise,
n! x x
P (Xn,1 = xn,1 , · · · , Xn,k = xn,k ) = p n,1 · · · pk n,k ,
xn,1 ! · · · xn,k ! 1
Pk
where i=1 xn,i = n. Now we prove the following theorem,
THEOREM 19 As n → ∞,
k
X (Xi − npi )2
χ2n = =⇒ χ2 (k − 1).
npi
i=1
We need a lemma.
Proof. Decompose Σ = Odiag(λ1 , · · · , λk )OT for some orthogonal matrix O. Then the k
coordinates of OY are independent with mean zero and variances λi ’s. So ||Y k2 = kOY k2
P
is equal to ki=1 λi Zi2 , where Zi ’s are i.i.d. N (0, 1).
Another fact we will use is that AXn =⇒ AX for any k × k matrix A if Xn =⇒ X, where
Xn and X are Rk -valued random vectors. This can be shown easily by the Cramer-Wold
device.
35
Proof of the Theorem. Let Y1 , · · · , Yn be i.i.d. with a multinomial distribution with 1
trails and “success rate” p = (p1 , · · · , pk ). Then Xn = Y1 +· · · Yn . Each Yi is a k-dimensional
vector. By CLT,
Xn − EXn
√ =⇒ Nk (0, Cov(Y1 )),
n
where it is easy to see that EXn = np and
p (1 − p1 ) −p1 p2 ··· −p1 pk
1
−p p p2 (1 − p2 ) ··· −p2 pk
1 2
Σ1 := Cov(Y1 ) = .
··· ··· ··· ···
−p1 pk −p2 pk ··· pk (1 − pk )
√ √
Set D = diag(1/ p1 , · · · , 1/ pk ). Then
√ √ T
p1 p1
√ √
p2 p2
DΣ1 D T == I − . . .
.. ..
√ √
pk pk
Pk
Since i=1 pi = 1, the above matrix is symmetric and idempotent or a projection. Also,
trace of Σ1 is k−1. So Σ1 has eigenvalue 0 with one multiplicity and 1 with k−1 multiplicity.
Therefore, by the lemma, we have that
D(Xn − np)
√ =⇒ N (0, DΣ1 D T ).
n
36
where the supremum is taken over all A ∈ σ(X1 , · · · , Xk ), B ∈ σ(Xk+n , Xk+n+1 , · · · ), and
k ≥ 1. Obviously, α(n) is decreasing. When α(n) → 0, then Xk and Xk+n are approximately
independent. In this case the sequence {Xn } is said to be α-mixing. If the distribution of
the random vector (Xn , Xn+1 , · · · , Xn+j ) does not depend on n, the sequence is said to be
stationary.
The sequence is called m-dependent if (X1 , · · · , Xk ) and (Xk+n , · · · , Xk+n+l ) are inde-
pendent for any n > m, k ≥ 1 and l ≥ 1. In this case the sequence is α-mixing with αn = 0
for n > m. An independent sequence is 0-dependent.
For stationary process, we have the following the Central Limit Theorems.
X ∞
V ar(Sn )
→ σ 2 = E(X12 ) + 2 E(X1 Xk+1 ) (6.11)
n
k=1
√
where the series converges absolutely. If σ > 0, then Sn / n =⇒ N (0, σ 2 ).
THEOREM 21 Suppose that X1 , X2 , · · · is stationary with EX1 = 0 and E|X1 |2+δ < ∞
P
for some constant δ > 0. Let Sn = X1 + · · · + Xn . If ∞
i=1 α(n)
δ/(2+δ) < ∞, then
X ∞
V ar(Sn )
→ σ 2 = E(X12 ) + 2 E(X1 Xk+1 )
n
k=1
√
where the series converges absolutely. If σ > 0, then Sn / n =⇒ N (0, σ 2 ).
COROLLARY 6.1 Let X1 , X2 , · · · be m-dependent random variables with the same distri-
√
bution. If EX1 = 0 and EX12+δ < ∞ for some δ > 0. Then Sn / n → N (0, σ 2 ), where
P
σ 2 = EX12 + 2 m+1
i=1 E(X1 Xi ).
Remark. If we argue by using the truncation method as in the proof of Theorem 21, we
can show that the above corollary still holds only under EX12 < ∞.
37
Proof. W.l.o.g, assume that C = D = 1. Now since expectations can be approximated by
finite sums, so w.l.o.g, we further assume that
m
X n
X
Y = yi Ai and Z = zj Bj
i=1 j=1
Proof. Let Y0 = Y I(|Y | ≤ a) and Y1 = Y I(|Y | > a); and Z0 = ZI(|Z| ≤ a) and
Z1 = ZI(|Z| > a). Then
X
|E(Y Z) − (EY )EZ| ≤ |E(Yi Zj ) − (EYi )EZj |.
1≤i,j≤2
By the previous lemma, the term corresponding to i = j = 0 is bounded by 4a2 α(n). For
p
i = j = 1, |E(Y1 Z1 ) − (EY1 )EZ1 | = |Cov(Y1 , Z1 )| which is bounded by E(Y12 )E(Z12 ) ≤
√
CD/aδ since E(Y12 ) ≤ C/aδ . Now
which is less than 4D/aδ . The term for i = 1, j = 0 is bounded by 4C/aδ . In summary
√
2 CD + 4C + 4D 4.5(C + D)
|E(Y Z) − (EY )EZ| ≤ 4a α(n) + 2
≤ 4a2 α(n) + .
a aδ
38
Now take a = α(n)−1/(2+δ) . The conclusion follows.
Proof of Theorem 20. By Lemma 6.2, |EX1 X1+k | ≤ 4A2 α(k), so the series (6.11) con-
P
verges absolutely. Moreover, let ρk = E(X1 X1+k ). Then ESn2 = nEX12 +2 1≤i<j≤n E(Xi Xj ) =
Pn−1
nρ0 + 2 k=1 (n − k)ρk . Then it is to see that
X ∞
E(Sn2 )
→ ρ0 + 2 ρk . (6.12)
n
k=1
We claim that
E(Sn4 ) = n3 δn (6.13)
P
for a sequence of constants δn → 0. Actually, E(Sn4 ) ≤ 1≤i,j,k,l≤n |E(Xi Xj Xk Xl )|. Among
the terms in the sum there are n terms of Xi4 ; at most n2 of Xi2 Xj2 and Xi3 Xj respectively
for i 6= j. So the contribution of these terms is bounded by Cn2 for an universal constant
C. There are only two type of terms left: Xi2 Xj Xk and Xi Xj Xk Xl where these indices are
all different. We will deal with the second type of terms only. The first one can be done
similarly and more easily. Note that
X X
|E(Xi Xj Xk Xl )| ≤ 4!n |E(X1 X1+i X1+i+j X1+i+j+k )| (6.14)
1≤i,j,k,l≤n 1≤i+j+k≤n
where the three indices in the last sum are not zero. By Lemma 6.2, the term inside is
bounded by 4A4 αi . By the same argument, it is also bounded by 4A4 αk . Therefore, the
sum in (6.14) is bounded by
X X
4 · 4!A4 n2 min{αi , αk } ≤ 2Kn2 min{αi , αk }
2≤i+k≤n 2≤i+k≤n, i≤k
Xn
≤ 2Kn2 kαk = nδn (6.15)
k=1
39
Therefore (6.13) follows. From this, we know that ESn4 ≤ nδn ≤ n3 (δn + (1/n)) ≤ n3 δn′
where δn′ := supk≥n {δk + (1/k); k ≥ n} is decreasing to 0 and is bounded below by 1/n.
Thus, w.l.o.g., assume δn has this property.
√
Define pn = [ n log(1/δ[√n] )] and qn = [n/pn ] and rn = [n/(pn + qn )]. Then
!
√ √ p2n δpn 2 1 2 1
n ≪ pn ≤ n log n and ≤ δpn log ≤ δpn log →0 (6.16)
n δ[√n] δpn
as n → ∞. Now for finite n, break X1 , X2 , · · · , Xn into some blocks as follows:
For i = 1, 2, · · · , rn , set
By assumption, |Vn,i | ≤ Aqn for any i ≥ 1. By Lemma 6.1, |E(Vn,1 Vn,k )| ≤ 4A2 qn2 α(k−1)pn .
Thus, by monotoncity,
rn rn rn (k−1)pn
X X X 1 X
|E(Vn,1 Vn,k )| ≤ 4A2 qn2 α(k−1)pn ≤ 4A2 qn2 αj
pn
k=2 k=2 k=2 j=(k−2)pn +1
∞
4A2 qn2 X
≤ αj .
pn
j=1
40
Prn ′ ′ ; ı = 1, 2, · · · , r } are i.i.d. with the same
for any t ∈ R, where S̃n = i=1 Un,i and {Un,i n
distribution as that of Un,1 . First, by (6.12), V ar(S̃n ) = rn V ar(Un,1 ) ∼ rn pn σ 2 . It follows
that V ar(S̃n )/n → σ 2 . Also, EUn,i
′ = 0. By (6.13) and (6.16),
Xrn
1 ′ 4 rn 4 rn p3n δpn 2
−4 pn δpn
E(Un,i ) ∼ 2 4 E(Un,1 )= ∼ σ →0
V ar(S̃n )2 i=1 n σ n2 σ 4 n
as n → ∞. By the Lyapounov central limit theorem, we know the first assertion in (6.18)
holds. Last, by Lemma 6.12,
rn
X
′ √ √ √
|EeitSn / n
− (EeitUn,1 / n
)E exp(it Un,j / n)| ≤ 16α(qn ).
j=2
P rn √
Then repeat this inequality for E exp(it j=2 Un,j / n) and so on, we obtain that
√ √ √ rn
Y √
′
EeitUn,j / n
′ ′
|EeitSn / n
− EeitS̃n / n
| = |EeitSn / n
− | ≤ 16α(qn )rn .
j=1
P P
Since α(n) is decreasing and n≥1 α(n) < ∞, we have nα(n)/2 ≤ [n/2]≤i≤n α(i) → 0.
Thus the above bound is o(rn /qn ) = o(n/(pn qn )) → 0 as n → ∞. The second statement in
(6.18) follows. The proof is complete.
Proof of Theorem 21. By Lemma 6.2, |EX1 Xk | ≤ 5(1 + 2E|X1 |2+δ )α(n)2/(2+δ) ,
P
which is summable by the assumption. Thus the series EX12 +2 ∞i=2 E(X1 Xk ) is absolutely
convergent. By the exact same argument as in proving (6.11), we obtain that V ar(Sn )/n →
σ2 .
Now for a given constant A > 0 and all i = 1, 2, · · · , define
Yi = Xi I{|Xi | ≤ A} − E(Xi I{|Xi | ≤ A}) and Zi = Xi I{|Xi | > A} − E(Xi I{|Xi | > A}),
Pn Pn
and Sn′ = i=1 Yi and Sn′′ = i=1 Zi . Obviously, Sn = Sn′ + Sn′′ . By Theorem 20,
S
√n =⇒ N (0, σ(A)2 ) (6.19)
n
P∞
as n → ∞ for each fixed A > 0, where σ(A)2 := EY12 + 2 i=2 EY1 Yk . By the summable
assumption on α(n), it is easy to check that both the sum in the definition of σ(A)2 and
P
σ̃(A)2 := EZ12 + 2 ∞i=2 E(Z1 Zk ) are absolutely convergent, and
41
as A → +∞. Now
|S ′′ | V ar(Sn′′ ) σ̃(A)
lim sup P ( √n ≥ ǫ) ≤ lim sup 2
→ 2 . (6.21)
n→∞ n n→∞ nǫ ǫ
√ √
by using the same argument as in (6.2). Since P (Sn / n ∈ F ) ≤ P (Sn′ / n ∈ F̄ ǫ ) +
√
P (|Sn′′ |/ n ≥ ǫ) for any closed set F ⊂ R and any ǫ > 0, where F̄ ǫ is the closed ǫ-
neighborhood of F. Thus, from (6.19) and (6.21)
Sn |S ′′ |
lim sup P ( √ ∈ F ) ≤ P (N (0, σ(A)2 ) ∈ F̄ ǫ ) + lim sup P ( √n ≥ ǫ)
n→∞ n n→∞ n
σ σ̃(A)
≤ P N (0, σ 2 ) ∈ F̄ ǫ + 2
σ(A) ǫ
for any A > 0 and ǫ > 0. First let A → +∞, and then let ǫ → 0+ to obtain from (6.20) that
Sn
lim sup P ( √ ∈ F ) ≤ P (N (0, σ 2 ) ∈ F ).
n→∞ n
The proof is complete.
over all k ≥ 1, m ≥ 0 and measurable functions f (x) defined on Rm and g(x) defined on
Rm−n+1 with |f (x)| ≤ 1 and |g(x)| ≤ 1 for all x ∈ R.
Proof. (i) For two sets A and B, let A∆B = (A\B) ∪ (B\A), which is their symmetric
difference. Then IA∆B = |IA − IB |. Let Zi ; i ≥ 1 be a sequence of random variables. We
claim that
If this is true, then for any ǫ > 0, there exists l < ∞ and C ∈ σ(Z1 , · · · , Zl ) such that
|P (B) − P (C)| < ǫ. Thus (ii) follows.
42
Now we prove this claim. First, note that σ(Z1 , Z2 , · · · ) = σ{{Zi ∈ Ei }; Ei ∈ B(R); i ≥
1}. Denote by F the right hand side of (6.23). Then F contains all {Zi ∈ Ei }; Ei ∈
B(R); i ≥ 1. We only need to show that the right hand side is a σ-algebra.
It is easy to see that (i) Ω ∈ F; (ii) B c ∈ F if B ∈ F since A∆B = Ac ∆B c . (iii) If Bi ∈ F
for i ≥ 1, there exist mi < ∞ and Ci ∈ σ(Z1 , · · · , Zmi ) such that P (Bi ∆Ci ) < ǫ/2i for all
i ≥ 1. Evidently, ∪ni=1 Bi ↑ ∪∞
i=1 Bi as n → ∞. Therefore, there exists n0 < ∞ such that
n0 n0 n0 n0
|P (∪∞
i=1 Bi )−P (∪i=1 Bi )| ≤ ǫ/2. It is easy to check that (∪i=1 Bi )∆(∪i=1 Ci ) ⊂ ∪i=1 (Bi ∆Ci ).
n0 n0
Write B = ∪∞
i=1 Bi and B̃ = ∪i=1 Bi and C = ∪i=1 Ci . Note B∆C ⊂ (B\B̃) ∪ (B̃∆C). These
facts show that P (B∆C) < ǫ and C ∈ σ(Z1 , Z2 , · · · , Zl ) for some l < ∞. Thus F is a
σ-algebra.
(ii) Take f and g to be indicator functions, from (i) we get α(n) ≤ α̃(n). By Lemma
6.1, we have that α̃(n) ≤ 4α(n).
43
Let P n (x, B) represent the probability of Xn+1 ∈ B given X1 = x, and π be the marginal
probability distribution of X1 . Then
Z
P (Xn+1 ∈ B, X1 ∈ A) = P n (x, B)π( dx)
A
From this,
Z
α(n + 2) ≤ 4 · sup | (P n (x, B) − P (Xn ∈ B))π( dx)|.
A,B A
For two probability measures µ and ν, define its total variation distance kµ−νk = supA |µ(A)−
ν(A)|. Then
Z
α(n + 2) ≤ 4 kP n (x, ·) − πkπ( dx), n ≥ 2.
In practice, for example, in the theory of Monto Carlo Markov Chains, we have some
conditions to guarantee the convergence rate of kP n (x, ·) − πk. In other words, we have a
function M (x) ≥ 0 and γ(n) such that
kP n (x, ·) − πk ≤ M (x)γ(n)
44
THEOREM 23 Suppose {X1 , X2 , · · · } is a stationary Markov chain with L(X1 ) = π and
Eπ M < ∞.
(i) If |X1 | ≤ C for some constant C > 0, and the chain is polynomially ergodic of order
m > 1, then (6.28) holds.
(ii) If the chain is polynomially ergodic of order m, and E|X1 |2+δ < ∞ for some δ > 0
satisfying mδ > 2 + δ, then (6.28) holds.
Remark. In practice, given an initial distribution λ, and a transitional kernel P (x, A),
here is the way to generate a Markov chain: randomly draw X0 = x under distribution λ,
randomly draw X1 under distribution P (X0 , ·) and Xn+1 under P (Xn , ·) for any n ≥ 1.
Through this a Markov chain X1 , X2 , · · · with initial distribution λ and transitional kernel
P (x, ·) has been generated. Let π be the stationary distribution of this chain, that is,
Z
π(·) = P (x, ·)π( dx).
Theorem 17.1.6 (on p. 420 from “Markov Chains and Stochastic Stability” by S.P. Meyn
and R.L. Tweedie) says that a Markov CLT holds for a certain initial distribution λ0 , it
then holds for any initial distribution λ. In particular, if Theorem 22 or Theorem 23 (λ = π
in this case) holds, then the CLT holds for the Markov chain of the same transitional prob-
ability kernel P (x, ·) and stationary distribution π for any initial distribution λ.
7 U -Statistics
Let X1 , · · · , Xn be a sequence of i.i.d. random variables. Sometimes we are interested in
estimating θ = Eh(X1 , · · · , Xm ) for 1 ≤ m ≤ n. To make discussion simple, we assume
h(x1 , · · · , xm ) is a symmetric function. If X(1) ≤ X(2) ≤ · · · ≤ X(n) , the order statistic, is
sufficient and complete, then
1 X
Un = n
h(Xi1 , · · · , Xim ) (7.1)
m 1≤i1 <···<im ≤n
is an unbiased estimator of θ and is also a function of X(1) , · · · , X(n) , and hence is uniformly
minimu variance unbiased estimator (UMVUE). Tosee Un is a function of X(1) , · · · , X(n) ,
one can rewrite it as
1 X
Un = n
h(X(i1 ) , · · · , X(im ) ). (7.2)
m 1≤i1 <···<im ≤n
45
PROPOSITION 7.1 The equality Un = E(h(X1 , · · · , Xm )|X(1) , · · · , X(n) ) holds for all 1 ≤
m ≤ n.
for any 1 ≤ i1 < · · · < im ≤ n. Summing both sidees above over all such indices and then
n
dividing by m , we obtain
1 X
E(h(X1 , · · · , Xm )|X(1) , · · · , X(n) ) = n
h(X(i1 ) , · · · , X(im ) ) = Un
m 1≤i1 <···<im ≤n
by (7.2)
which is U -statistic.
46
THEOREM 24 (Hoeffding’s Theorem) For a U -statistic given by (7.1) with E[h(X1 , · · · , Xm )2 ] <
∞,
m
1 X m n−m
V ar(Un ) = n ζk
m
k m−k
k=1
Proof. Consider two sets {i1 , · · · , im } and {j1 , · · · , jm } of m distinct integers from {1, 2, · · · , n}
n m n−m
with exactly k integers in common. Then the total number of distinct choices is m k m−k .
Now
1 X
V ar(Un ) =
n 2
Cov(h(Xi1 , · · · , Xim ), h(Xj1 , · · · , Xjm )),
m
where the sum is over all indices 1 ≤ i1 < · · · < im ≤ n and 1 ≤ j1 < · · · < jm ≤ n.
If {i1 , · · · , im } and {j1 , · · · , jm } have no integers in common, the covariance is zero by
independence. Now we classfisy the sum over the number of integers in common. Then
1 Pm n m n−m
V ar(Un ) = n 2 k=1 ·
m k m−k
m
Cov(h(X1 , · · · , Xm ), h(X1 , · · · , Xk , Xm+1 , · · · , X2m−k ))
m
1 X m n−m
= n
· V ar(ζk )
m
k m−k
k=1
where the fact (from conditioning) that the covariance above equal to V ar(ζk ) is used in
the last step.
47
Then by Jensen’s inequality, E(h2k ) ≤ E(h2k+1 ). Note that Ehk = Ehk+1 . Then V ar(hk ) ≤
V ar(hk+1 ). So (i) follows.
(ii) By Theorem 24,
m
1m n−m 1 X m n−m
V ar(Un ) = n ζk + n ζi .
m
k m−k m
i m−i
i=k+1
n n−m
Since m is fixed, m ∼ nm /m! and m−i ≤ nm−i for any k + 1 ≤ i ≤ m. Thus the last
term above is bounded by O(n−(k+1) ) as n → ∞. Now
1 m n−m
n ζk
m
k m−k
m! m k ζk (n − m)!
= ·
n(n − 1) · · · (n − m + 1) (m − k)!(n − 2m + k)!
2
k! m
k ζk n · n···n (n − m) · · · (n − 2m + k + 1)
= k
· ·
n n(n − 1) · · · (n − k + 1) (n − k) · · · (n − m + 1)
m2 k−1 2m−k−1 m−1
Y
k! k ζk Y i −1 Y i i −1
= · 1− · 1− · 1− .
nk n n n
i=1 i=m i=k
Note that since m is fixed, each of the above product is equal to 1 + O(1/n). The conclusion
then follows.
We will next show that Tn − Tn′ is negligible comparing the order appeared in (7.4). Then
the CLT holds for Tn by (7.4) and thre Slusky lemma.
48
LEMMA 7.2 Let Tn be a symmetric statistic with V ar(Tn ) < ∞ for every n, and Tn′ be
the projection of Tn on X1 , · · · , Xn . Then ETn′ = ETn and
Cov(Tn , Tn′ )
n
X
= V ar(Tn ) + nV ar(E(Tn |X1 )) − 2 Cov(Tn − ETn )(ψ(Xi ) − ETn ). (7.6)
i=1
Now the i-th covariance above is equal to E(Tn E(Tn |Xi ))−{E(E(Tn |Xi ))}2 = V ar(E(Tn |X1 )).
We already see V ar(Tn′ ) = nV ar(E(Tn |X1 )). So the proof is complete by the two facts and
(7.6).
THEOREM 25 Let Un be the statistic given by (7.1) with E[h(X1 , · · · , Xm )]2 < ∞.
√
(i) If ζ1 > 0, then n(Un − ETn ) → N (0, m2 ζ1 ) in distribution.
(ii) If ζ1 = 0 and ζ2 > 0, then
∞
m(m − 1) X
n(Un − EUn ) → λj (χ21j − 1),
2
j=1
P∞
where χ21j ’s are i.i.d. χ2 (1)-random variables, and λj ’s are constants satisfying 2
j=1 λj =
ζ2 .
Proof. We will only prove (i) here. The proof of (ii) is very technical, it is omitted. Let
Un′ be the projection of Un on X1 , · · · , Xn . Then
1 X
Un′ = Eh + n E(h(Xi1 , · · · , Xik )|Xi ) − Eh.
m 1≤i1 <···<im ≤n
Observe that
1 X
n
( E(h(Xi1 , · · · , Xik )|X1 ) − Eh)
m 1≤i1 <···<im ≤n
1 X
= n (h1 (X1 ) − Eh)
m 2≤i2 <···<im ≤n
n−1
m−1 m
= n (h1 (X1 ) − Eh) = (h1 (X1 ) − Eh).
m
n
49
Thus
n
mX
Un′ = Eh + (h1 (Xi ) − Eh).
n
i=1
This says that V ar(Un′ ) = (m2 /n)ζ1 , where ζ1 = V ar(h1 (X1 )). By lemma 7.1, V ar(Un ) =
(m2 /n)ζ1 +O(n−2 ). From Lemma 7.2, V ar(Un −Un′ ) = O(n−2 ) as n → ∞. By the Chebyshev
√
inequality, n(Un − Un′ ) → 0 in probability. The result follows from (7.4) and the Slusky
lemma by noting V art(ψ(X1 )) = V ar(h1 (X1 )) = ζ1 .
50
8 Empirical Processes
Let X1 , X2 , · · · be a random sample, where Xi takes values in a metric space M. Define a
probability measure µn such that
n
1X
µn (A) = I(Xi ∈ A), A ⊂ M.
n
i=1
The measure µn is called a probability measure. If X1 takes real values, that is, M = R,
set
n
1X
Fn (t) = I(Xi ≤ t), t ∈ R, A = (−∞, t].
n
i=1
The random process {Fn (t), t ∈ R} is called an empirical process. By the LLN and CLT,
√
Fn (t) → F (t) a.s. and n(Fn (t) − F (t)) =⇒ N (0, F (t)(1 − F (t))),
where F (t) is the cdf of X1 . Among many interesting questions, here we are interested in
the uniform convergence of the above almost sure and weak convergences. Specifically, we
want to answer the following two questions:
The answers are yes for both questions. We first answer the first question.
as n → ∞.
Remark. When F (t) is continuous and strictly increasing in the support of X1 , the above
the distribution of ∆n is independent of the distribution of X1 . First, F (Xi )’s are i.i.d.
U [0, 1]-distributed random variables. Second,
1X 1X
∆n = sup | I(Xi ≤ t) − F (t)| = sup | I(F (Xi ) ≤ F (t)) − F (t)|
t n t n
i i
1X
= sup | I(Yi ≤ y) − y|,
0≤y≤1 n i
51
where {Yi = F (Xi ); 1 ≤ i ≤ n} are i.i.d. random variables with uniform distribution over
[0, 1].
Proof. Set
n
1X
Fn (t−) = I(Xi < t) and F (t−) = P (X1 < t), t ∈ R.
n
i=1
By the strong law of large numbers, Fn (t) → F (t) a.s. and Fn (t−) → F (t−) a.s. as n → ∞.
Given ǫ > 0, choose −∞ = t0 < t1 < · · · < tk = ∞ such that F (ti −) − F (ti−1 ) < ǫ for every
i. Now, for ti−1 ≤ t < ti ,
∆n ≤ max {|Fn (ti ) − F (ti )|, |Fn (ti −) − F (ti −)|} + ǫ → ǫ a.s.
1≤i≤k
√
Given (t1 , t2 , · · · , tk ) ∈ Rk , it is easy to check that n(Fn (t1 ) − F (t1 ), · · · , Fn (tk ) −
F (tk )) =⇒ Nk (0, Σ) where
Let D[−∞, ∞] be the set of all functions defined on (−∞, ∞) such that they are right-
continuous and the left limits exist everywhere. Equipped with a so-called Skorohod metric,
√
it becomes a Polish space. The random vector n(Fn − F ), viewed as an element in
D[−∞, ∞], converges weakly to a continuous Gaussian process ξ(t), where ξ(±∞) = 0 and
the covariance structure is given in (8.1). This is usually called a Brownian bridge. It
has same distribution of Gλ ◦ F, where Gλ is the limiting point when F is the cumulative
distribution function of the uniform distribution over [0, 1].
52
where
∞
X 2 2
P sup |GF (t)| ≥ x =2 (−1)j+1 e−2j x , x ≥ 0.
t
j=1
Also, the DKW (Dvoretsky, Kiefer, and Wolfowitz) inequality says that
√ 2
P ( nkFn − F k∞ > x) ≤ 2e−2x , x > 0.
because Pn (A) = 1 and P (A) = 0 for A = {X1 , X2 , · · · , Xn }. We are searching for F such
that
The previous two examples say (8.2) holds for F = {I(−∞, x], x ∈ R} but doesn’t hold for
the power set 2R . The class F is called P -Glivenko-Cantelli if (8.2) holds.
√
Define Gn = n(Pn − P ). Given k measurable functions f1 , · · · , fk such that P fi2 < ∞
for all i, one can also check by the multivariate Central Limit Theorem (CLT) that
(Gn f1 , · · · , Gn fk ) =⇒ Nk (0, Σ)
where Σ = (σij )1≤i,j≤n and σij = P (fi fj ) − (P fi )P fj . This tells us that {Gn f, f ∈ F}
satisfies CLT in Rk . If F has infinite many members, how we define weak convergence? We
first define a space similar to Rk . Set
with norm kzk∞ = supF ∈F |z(F )|. Then (l∞ (F), k · k∞ ) is a Banach space; it is separable
if F is countable. When F has finite elements, (l∞ (F), k · k∞ ) = (Rk , k · k∞ ).
So, Gn : F → R is a random element and can be viewed as an element in Banach space
l∞ (F). We say F is P -Donsker if Gn =⇒ G, where G is a tight element in l∞ (F).
53
LEMMA 8.1 (Berstein’s inequality). Let X1 , · · · , Xn be independent random variables
P
mean zero, |Xj | ≤ K for some constant K and all j, and σj2 = EXj2 > 0. Let Sn = ni=1 Xi
P
and s2n = nj=1 σj2 . Then
2 /2(s2 +Kx)
P (|Sn | ≥ x) ≤ 2e−x n , x > 0.
Now
∞ ∞
!
X λi σj2 λ2 X σj2 λ2 σj2 λ2
EeλXj = 1 + EXji ≤ 1 + (λK)i−2 = 1 + ≤ exp
i! 2 2(1 − λK) 2(1 − λK)
i=2 i=2
if λK < 1. Thus
λ2 s2n
P (Sn > x) ≤ exp −λx +
2(1 − λK)
Now (8.3) follows by choosing λ = x/(s2n + Kx).
Pn √ √
Recall Gn f = − P f )/ n. The random variable Yi := (f (Xi ) − P f )/ n
i=1 (f (Xi )
P
here corresponds to Xi in the above lemma. Note that V ar( ni=1 Yi ) = V ar(f (X1 )) ≤ P f 2
√ P
and kYi k∞ ≤ 2kf k∞ / n. Apply the above lemma to ni=1 Yi , we obtain
Notice that
Let’s estimate E max1≤i≤m |Yi | provided m is large and P (|Yi | ≥ x) ≤ e−x for all i and
x > 0. Then E|Yi |k ≤ k! for k = 1, 2, · · · . The immediate estimate is
m
X
E max |Yi | ≤ E|Yi | ≤ m.
1≤i≤m
i=1
54
By Holder’s inequality,
Xm
2 1/2
√
E max |Yi | ≤ (E max |Yi | ) ≤( E|Yi |2 )1/2 ≤ 2m.
1≤i≤m 1≤i≤m
i=1
LEMMA 8.2 For any finite class F of bounded, measurable, square-integrable functions
with |F| elements, there is an universal constant C > 0 such that
log(1 + |F|) p
C · E max |Gn f | ≤ max kf k∞ √ + max kf kP,2 log(1 + |F|) .
f f n f
For x ≥ b/a and x ≤ b/a the exponent in the Bernstein inequality is bounded above by
−3x/a and −3x2 /b, respectively. By Bernstein’s inequality,
3x
P (|Af | ≥ x) ≤ P (|Gn f | ≥ x ∨ b/a) ≤ 2 exp − ,
a
3x2
P (|Bf | ≥ x) = P (b/a ≥ |Gn f | ≥ x) ≤ P (|Gn f | ≥ x ∧ b/a) ≤ 2 exp −
b
for all x ≥ 0. Let ψp (x) = exp(xp ) − 1, x ≥ 0, p ≥ 1.
Z |Af |/a Z ∞
|Af |
Eψ1 =E ex dx = P (|Af | > ax)ex dx ≤ 1.
a 0 0
√
By a similar argument we find that Eψ2 (|Bf |/ b) ≤ 1. Because ψp (·) is convex for all p ≥ 1,
by Jensen’s inequality
X |Af |
|Af | maxf |Af |
ψ1 E max ≤ Eψ1 ≤E ψ1 ≤ |F|.
f a a a
f
55
Taking inverse function, we have the first term on the right hand side. Similarly we have
the second term. The conclusion follows from (8.5).
LEMMA 8.3 Let X be a real valued random variable and f : [0, ∞) → R be differentiable
Rs R∞
with 0 |f ′ (t)| dt < ∞ for any s > 0 and 0 P (X > t)|f ′ (t)| dt < ∞. Then
Z ∞
Ef (X) = P (X ≥ t)f ′ (t) dt + f (0).
0
Rt
Proof. First, since 0|f ′ (t)| dt < ∞ for any t > 0, we have
Z X Z ∞
′
f (X) − f (0) = f (t) dt = I(X ≥ t)f ′ (t) dt.
0 0
Then, by Fubini’s theorem,
Z ∞ Z ∞
′
Ef (X) = E I(X ≥ t)f (t) dt + f (0) = P (X ≥ t)f ′ (t) dt + f (0).
0 0
Remark. When f (x) is differentiable, f ′ (x) is not necessarily Lebesgue integrable even
though it is Riemann-integrable. The following is an example: Let
x2 cos(1/x2 ), if x =
6 0;
f (x) =
0, otherwise.
Then
2x cos(1/x2 ) + (2/x) sin(1/x2 ), if x 6= 0;
f ′ (x) =
0, otherwise
is not Lebesgue-integrable on [0, 1], but obviously Riemann-integrable. We only need to
show g(x) := (2/x) sin(1/x2 ) is not integrable. Suppose it is,
Z 1 Z
1 1 1 1 | sin t|
sin x2 dx = 2 dt = +∞,
0 x 0 t
yields a contradiction.
kf kP,r = (P |f |r )1/r .
Let l and u be two functions, the bracket [l, u] is set of all functions f such that l ≤ f ≤ u.
An ǫ-bracket in Lr (P ) is a bracket [l, u] such that ku − lkP,r < ǫ. The bracketing number
N[ ] (ǫ, F, Lr (P )) is the minimum numbers of ǫ-brackets needed to cover F. The entropy with
bracketing is the logarithm of the bracketing number.
56
8.1 Outer Measures and Expectations
Recall X is a random variable from (Ω, G) → (R, B(R)) if X −1 (B) ∈ G for any set B ∈ B(R).
For an arbitrary map X the inverse may not be in G), particularly for small G). For
example, when G = {∅, Ω}, many maps are not random variables. In empirical processes,
we will deal with Z =: supt∈T Xt for some index set T. If T is big, then Z may not be
measurable. It does not make sense to study expectations and probabilities related to such
a random variable. But we have another way to get around this.
Definition Let X be an arbitrary map from (Ω, G, P ) → (R, B(R)). Define
One can easily show that E ∗ X, as an infimum, can be achieved, i.e., there exists a random
variable X ∗ : (Ω, F, P ) → (R, B(R)) such that EX ∗ = E ∗ X. Further, X ∗ is P -almost surely
unique, i.e., if there exist two such random variables X1∗ and X2∗ , then P (X1∗ = X2∗ ) = 1.
We call X ∗ the measurable cover function. Obviously
E ∗ f (Xn ) → Ef (X)
for any bounded, continuous function f defined on (M, d). We still have an analogue of the
Portmanteau theorem.
57
Let
Z δ q
J[ ] (δ, F, L2 (P )) = log N[ ] (ǫ, F, L2 (P )) dǫ.
0
THEOREM 29 Every class F of measurable functions such that N[ ] (ǫ, F, L1 (P )) < ∞ for
every ǫ > 0 is P -Glivenko-Cantelli.
Since the supremum above becomes smaller when a partition becomes more refined, w.l.o.g.,
assume the partitions are successive refinements as m increases. Define a semi-metric
0, if s, t belong to the same partitioning set T m for some j;
j
ρm (s, t) =
1, otherwise.
58
P∞ −k
Obviously, ρ(s, t) ≤ k=m+1 2 ≤ 2−m when s, t ∈ Tjm for some j. So (T, ρ) is totally
bounded. Let T0 be the countable ρ-dense subset constructed by choosing an arbitrary
point tm m
j from every Tj .
Step 2: Construct the limit of Xn . For two finite subsets of T, S = (s1 , · · · , sp ) and
U = (s1 , · · · , sp , , · · · , sq ), by assumption (i), there are two probability measures µp on Rp
and µq on Rq such that
If ρ(s, t) < 2−m , then ρm (s, t) < 1 (otherwise, ρm (s, t) = 1 implies ρ(s, t) ≥ 2−m ). This
says that s, t ∈ Tjm for the same j. So the above inequality implies that
1 1
P sup |Xs − Xt | > m ≤ m
ρ(s,t)<2−m 2 2
s,t∈T0
for each m ≥ 1. By the Borel-Cantelli lemma, when m is large, with probability one
1
sup |Xs − Xt | ≤
ρ(s,t)<2−m 2m
s,t∈T0
59
as m is large enough. This says that, with probability one, Xt , as a function of t is uniformly
continuous on T0 . Extending this from T0 to T, we obtain a ρ-continuous process Xt , t ∈ T.
We then have that with probability one
1 1
|Xs − Xt | ≤ m
for any s, t ∈ T with ρ(s, t) < m (8.2)
2 2
for sufficiently large m.
Step 3: Show Xn =⇒ X. Define πm : T → T as the map that maps every point in
the partitioning set Tim onto the point tm m
i ∈ Ti . For any bounded, Lipschitz continuous
function f : l∞ (T ) → R,
LEMMA 8.4 Suppose there is δ > 0 and a measurable function F > 0 such that P ∗ f 2 < δ2
and |f | ≤ F for every f ∈ F. Then there exists a constant C > 0 such that
√ √
C · EP∗ kGn kF ≤ J[ ] (δ, F, L2 (P )) + nP ∗ F {F > na(δ)}.
60
√ P
Proof. Step 1: Truncation. Recall Gn f = (1/ n) ni=1 (f (Xi ) − P f ). If |f | ≤ g, then
n
1 X
|Gn f | ≤ √ (g(Xi ) + P g).
n
i=1
It follows that
√
√ √
E ∗
Gn f I{F > na(δ)}
F ≤ 2 nP F {F > na(δ)}.
√
We will bound E ∗ kGn f I{F ≤ na(δ)}kF next. The bracketing number of the class of
√
functions f I{F > na(δ)} if f ranges over F are smaller than the bracketing number of
√
the class F. To simplify notation, we assume, w.l.o.g., |f | ≤ na(δ) for every f ∈ F.
Step 2: Discretization of the integral. Choose an integer q0 such that 4δ ≤ 2−q0 ≤ 8δ.
We claim there exists a nested sequence of F-partitions {Fqi ; 1 ≤ i ≤ Nq }, indexed by the
integers q ≥ q0 , into Nq disjoint subsets and measurable functions ∆qi ≤ 2F such that
X 1p Z δq
C log N q ≤ log N[ ] (ǫ, F, L2 (P )) dǫ, (8.4)
2q 0
q≥q0
1
sup |f − g| ≤ ∆qi , and P ∆2qi ≤ (8.5)
f,g∈Fqi 22q
for an universal constant C > 0. For convenience, write N (ǫ) = N[ ] (ǫ, F, L2 (P )). Thus,
Z δ p Z 1/2q0 +3 p ∞
X Z 1/2q p
log N (ǫ) dǫ ≥ log N (ǫ) dǫ = log N (ǫ) dǫ
0 0 q+1
q=q0 +3 1/2
s
∞
X 1 1
≥ log N .
2q+1 2q−3
q=q0 +3
61
smaller, all requirementsqon Fqi still hold obviously except possibly (8.4). Now we verify
P √
(8.4). Actually, noticing log N̄q ≤ qk=q0 log Nk . Then
X 1q q
X X 1p
∞ X
X 1p
∞
X 1p
q
log N̄ q ≤ q
log N k = q
log N k = 2 log Nk .
2 2 2 2k
q≥q0 q≥q0 k=q0 k=q0 q≥k k=q0
Step 3: Chaining-a skill. For each fixed q ≥ q0 , choose a fixed element fqi from each
partition set Fqi , and set
Thus, πq f and ∆q f runs through a set of Nq functions if f run through F. Without loss of
generality, we can assume
¯ qi be the measurable cover function of supf,g∈F |f −g|. Then (8.5) also holds.
Actually, let ∆ qi
Let Fq−1 j be a partition set at the q−1 level such that Fqi ⊂ Fq−1 j . Then supf,g∈Fqi |f −g| ≤
supf,g∈F |f − g|. Thus, ∆ ¯ qi ≤ ∆
¯ q−1 j . The assertion (8.7) follows by replacing ∆qi with
q−1 j
¯ qi .
∆
P P
By (8.6), P (πq f − f )2 ≤ maxi P ∆2qi < 2−2q . Thus, P q (πq f − f )2 = q P (πq f − f )2 <
P
∞. So q (πq f − f )2 < ∞ a.s. This implies
πq f → f a.s. on P (8.8)
The value of τ is thought to be +∞ if the above set is empty. This is the first time ∆q f >
√
naq . By construction, 2a(δ) = 2δ(log N[ ] (δ, F, L2 (P )))−1/2 ≤ aq0 since the denominator is
√ √
decreasing in δ. By Step 1, we know that |∆q f | ≤ 2 na(δ) ≤ naq0 . This says that τ > q0 .
We claim that
∞
X ∞
X
f − πq 0 f = (f − πq f )I{τ = q} + (πq f − πq−1 f )I{τ ≥ q}, a.s on P. (8.9)
q0 +1 q0 +1
Pq1
In fact, write f − πq0 f = (f − πq1 f ) + q0 +1 (πq f − πq−1 f ) for q1 > q0 . Now,
62
(i) If τ = ∞, the right hand side above is identical to limq→∞ πq f − πq0 f = f − πq0 f a.s.
by (8.8);
Pq1
(ii) If τ = q1 < ∞, the right hand is equal to (f − πq1 f ) + q0 +1 (πq f − πq−1 f ).
Step 4. Bound terms in the chain. Apply Gn to the both sides of (8.9). For |f | ≤ g,
√
note that |Gn f | ≤ |Gn g| + 2 nP g. By (8.6), |f − πq f | ≤ ∆q f. One obtains that
f
∞
X z }| {
E∗k Gn (f − πq f )I{τ = q} kF
q0 +1
∞
X ∞
√ X
≤ E ∗ kGn (∆q f I{τ = q})kF + 2 n kP (∆q f I{τ = q})kF . (8.10)
q0 +1
| {z } q +1 0
g
√
Now, by (8.7), ∆q f I{τ = q} ≤ ∆q−1 f I{τ = q} ≤ naq−1 . Moreover, P (∆q f I{τ = q})2 ≤
2−2q . By Lemma 8.2, the middle term above is bounded by
∞
X p
(aq−1 log Nq + 2−q log Nq ).
q0 +1
By Holder’s inequality, P (∆q f I{τ = q}) ≤ (P (∆q f )2 )1/2 P (τ = q)1/2 ≤ 2−q P (∆q f >
√ √ √
naq )1/2 ≤ 2−q ( naq )−1 (P (∆q f )2 )1/2 ≤ 2−2q ( naq )−1 . So the last term in (8.10) is
bounded by 2 · 2−2q /aq . In summary,
X ∞
X∞
∗
p
E
Gn (f − πq f )I{τ = q}
≤ C 2−q log Nq (8.11)
q0 +1
q0 +1
F
63
√ √
At last, we consider πq0 f. Because |πq0 f | ≤ F ≤ a(δ) n ≤ naq0 and P (πq0 f )2 ≤ δ2
by assumption, another application of Lemma 8.2 leads to
p
E ∗ kGn πq0 f kF ≤ aq0 log Nq0 + δ log Nq0 .
In view of the choice of q0 , this is no more than the bound in (8.12). All above inequalities
together with (8.4) yield the desired result.
COROLLARY 8.2 For any class F of measurable functions with envelope function F, there
exists an universal constant C such that
Proof. Since F has a single bracket [−F, F ], we have that N[ ] (δ, F, L2 (P )) = 1 for δ =
2kF kP,2 . Review the definition in Lemma 8.4. Choose δ = kF kP,2 . It follows that
kF kP,2
a(δ) = q .
log N[ ] (kF kP,2 , F, L2 (P ))
√ √ q
Now nP ∗ F I(F > na(δ)) ≤ kF k2P,2 /a(δ) = kF kP,2 log N[ ] (kF kP,2 , F, L2 (P )) by Markov’s
inequality, which is bounded by J[ ] (kF kP,2 , F, L2 (P )) since the integrand is non-decreasing
and hence
Z kF kP,2 q q
log N[ ] (ǫ, F, L2 (P )) dǫ ≥ kF kP,2 log N[ ] (kF kP,2 , F, L2 (P )).
0
Proof of Theorem 30. We will use Theorem 31 to prove this theorem. The part (i) is
easily satisfied. Now we verify (ii).
Note there is no cover on F and we don’t know if P f 2 < δ2 . Let G = {f − g, f, g ∈ F}.
With a given set of ǫ-brackets [li , ui ] over F we can construct 2ǫ-brackets over G by taking
differences [li − uj , ui − lj ] of upper and lower bounds. Therefore, the bracketing number
N[ ] (2ǫ, G, L2 (P )) are bounded by the square of the bracketing number N[ ] (ǫ, F, L2 (P )).
Easily, k(ui − lj ) − (li − uj )k ≤ 2ǫ. Hence
64
For a given, small δ > 0, by the definition of N[ ] (δ, F, L2 (P )), choose a minimal number
of brackets of size δ that cover F, and use them to form a partition of F = ∪i Fi . The subsets
G consisting of differences f − g of functions f and g belonging to the same partitioning
set consists of functions of L2 (P )-norm smaller than δ. Hence, by Lemma 8.4, there exists
a finite number a(δ) := a(δ)F < a(2δ)G and a universal constant C such that
Cut [0, 1] into many pieces with length less than ǫ2 . Since F (t) is increasing, right continuous
and of left limits, there exist a partition, say, −∞ = t0 < t1 < · · · < tk = ∞ with
k = [1/ǫ2 ] + 1 such that
Since kI(−∞, ti )−I(−∞, ti−1 ]kP,2 = (F (ti −)−F (ti−1 ))1/2 < ǫ. Namely, (I(−∞, ti−1 ], I(−∞, ti ))
is an ǫ-bracket for i = 1, 2, · · · , k. It follows that
2
N[ ] (ǫ, F, L2 (P )) ≤ for ǫ ∈ (0, 1).
ǫ2
R1p
But 0 log(1/ǫ) dǫ < ∞. The previous classical Glivenko-Cantelli’s Lemma 26 and Donsker’s
Theorem 27 follow from Theorems 29 and 30.
Since kf k = supt |f (t)| is the norm in l∞ , it is continuous. By the Delta method, we
have
√
sup n|Fn (t) − F (t)| → sup G(t)
t∈R t∈R
65
weakly, i.e., in distribution, as n → ∞, where G(t) is the Gaussian process we mentioned
before.
Any function of bounded variation is the difference of two monotone increasing functions.
Then for any r ≥ 1 and probability measure P,
1
log N[ ] (ǫ, F, Lr (P )) ≤ K .
ǫ
Therefore F is P -Donsker for every P.
66
9 Consistency and Asymptotic Normality of Maximum Like-
lihood Estimators
9.1 Consistency
So EX1 = 1/θ and V ar(X1 ) = 1/θ 2 . It is easy to check that θ̂ = 1/X̄. By the CLT,
√ 1
n X̄ − =⇒ N (0, θ −2 )
θ
as n → ∞. Let g(x) = 1/x, x > 0. Then g(EX1 ) = θ and g′ (EX1 ) = −θ 2 . By the Delta
method,
√
n(θ̂ − θ) =⇒ N (0, g′ (µ)2 σ 2 ) = N (0, θ 2 )
as n → ∞. Of course, θ̂ → θ in probability.
Example. Let X1 , X2 , · · · , Xn be i.i.d. from U [0, θ], where θ is unknown. The MLE
estimator θ̂ = max Xi . First, it is easy to check that θ̂ → θ in probability. But θ̂ doesn’t
follow the CLT. In fact,
!
n(θ − θ̂) 1 − e−x , if x > 0;
P ≤x →
θ 0, otherwise
as n → ∞.
For some cases even the consistency, i.e., θ̂ → θ in probability, doesn’t hold. Next, we
will study some sufficient conditions for consistency. Later we provide sufficient conditions
for CLT.
Let X1 , · · · , Xn be a random sample from a density pθ with reference measure µ, that
R
is, Pθ (X1 ∈ A) = A pθ (x)µ( dx), where θ ∈ Θ. The maximum likelihood estimator θ̂n
P
maximizes the function h(θ) := log pθ (Xi ) over Θ, or equivalently, the function
n
1X pθ
Mn (θ) = log (Xi ),
n p θ0
i=1
67
where θ0 is the true parameter. Under suitable conditions, by the weak law of large numbers,
Z
pθ pθ
Mn (θ) → M (θ) := Eθ0 log (X1 ) = pθ0 (x) log (x)µ( dx) (9.1)
p θ0 R p θ0
by Jensen’s inequality. Of course M (θ0 ) = 0, that is, θ0 attains the maximum of M (θ).
Obviously, M (θ0 ) = Mn (θ0 ) = 0. The following gives the consistency of θ̂n with θ0 .
P
THEOREM 32 Suppose supθ∈Θ |Mn (θ) − M (θ)| → 0 and supθ:d(θ,θ0 )≥ǫ M (θ) < M (θ0 ) for
any ǫ > 0. Then θ̂n → θ0 in probability.
P (d(θ̂n , θ0 ) ≥ ǫ) → 0 (9.2)
P (d(θ̂n , θ0 ) ≥ ǫ) ≤ P (M (θ̂n ) − M (θ0 ) < −δ) = P (sup |Mn (θ) − M (θ)| ≥ δ/2) → 0.
θ∈Θ
68
where θ0 is the true parameter. This is usually practiced by solving the equation ∂Mn (θ)/∂θ =
0 for θ. Let
∂ log(pθ (x)/pθ0 (x))
ψθ (x) = .
∂θ
Then the maximum likelihood estimator θ̂n is the zeroes of the function
n
1X
Ψn (θ) := ψθ (Xi ). (9.3)
n
i=1
Though we still use the notation Ψn in the following theorem, it is not necessarily the one
above. It can be any function of properties stated below.
THEOREM 33 Let Θ be a subset of the real line and let Ψn be random functions and Ψ a
fixed function of θ such that Ψn (θ) → Ψ(θ) in probability for every θ. Assume that Ψn (θ) is
continuous and has exactly one zero θ̂n or is nondecreasing satisfying Ψn (θ̂n ) = oP (1). Let
P
θ0 be a point such that Ψ(θ0 − ǫ) < 0 < Ψ(θ0 + ǫ) for every ǫ > 0. Then θ̂n → θ0 .
Proof. Case 1. Suppose Ψn (θ) is continuous and has exactly one zero θ̂n . Then
Since Ψn (θ0 ± ǫ) → Ψ(θ0 ± ǫ) in probability, and Ψ(θ0 − ǫ) < 0 < Ψ(θ0 + ǫ), the left hand
side goes to one.
Case 2. Suppose Ψn (θ) is nondecreasing satisfying Ψn (θ̂n ) = oP (1). Then
Now, Ψn (θ0 ± ǫ) → Ψ(θ0 ± ǫ) in probability. This together with Ψn (θ̂n ) = oP (1) shows that
P (|θ̂n − θ0 | ≥ ǫ) → 0 as n → ∞.
Example. Let X1 , · · · , Xn be a random sample from Exp(θ), that is, the density
function is pθ (x) = θ −1 exp(−x/θ)I(x ≥ 0) for θ > 0. We know that, under the true model,
the MLE θ̂n = 1/X̄n → θ0 in probability. Let’s verify that this conclusion indeed can be
deduced from Theorem 33.
Actually, ψθ (x) = (log pθ (x))′ = −θ −1 + xθ −2 for x ≥ 0. Thus Ψn (θ) = −θ −1 + X̄n θ −2 ,
which goes to Ψ(θ) = Eθ0 (−θ −1 + X1 θ −2 ) = θ −2 (θ0 − θ), that is positive or negative
depending on if θ is samller or bigger than θ0 . Also, θ̂n = 1/X̄n . Applying Theorem 33 to
−Ψn and −Ψ, we obtain the consisitency result.
69
9.2 Asymptotic Normality
. 1
ψθ̂n (Xi ) = ψθ0 (Xi ) + (θ̂n − θ0 )ψ̇θ0 (Xi ) + (θ̂n − θ0 )2 ψ̈θ0 (Xi )
2
We will use the following notation:
Z n
1X
Pf = f (x)µ(dx) for any real function f (x) and Pn = δXi .
n
i=1
Pn
Then Pn f = (1/n) i=1 f (Xi ). Thus,
. 1
0 = Ψn (θ̂n ) = Pn ψθ0 + (θ̂n − θ0 )Pn ψ̇θ0 + (θ̂n − θ0 )2 Pn ψ̈θ0 .
2
Reorganize it in the following form
√
√ . − n(Pn ψθ0 )
n(θ̂n − θ0 ) = .
Pn ψ̇θ0 + 12 (θ̂n − θ0 )Pn ψ̈θ0
1
g(θ) = g(θ0 ) + (θ − θ0 )T Vθ0 (θ − θ0 ) + o(kθ − θ0 k2 ), (9.1)
2
where
∂mθ
Vθ0 = Eθ0 |θ=θ0 , θ = (θ1 , · · · , θd ) ∈ Rd .
∂θi ∂θj 1≤i,j≤d
70
THEOREM 34 For each θ in an open subset of Euclidean space let x 7→ mθ (x) be a mea-
surable function such that θ 7→ mθ (x) is differentiable at θ0 for P -almost every x with
derivative ṁθ0 (x) and such that , for every θ1 and θ2 in a neighborhood of θ0 and a mea-
surable function n(x) with En(X1 )2 < ∞
Furthermore, assume the map θ 7→ Emθ (X1 ) has expression as in (9.1). If Pn mθ̂n ≥
P
supθ Pn mθ − oP (n−1 ) and θ̂n → θ0 , then
n
X
√ 1
n(θ̂n − θ0 ) = −Vθ−1 √ ṁθ0 (Xi ) + oP (1).
0 n
i=1
√
In particular, n(θ̂n − θ0 ) =⇒ N (0, Vθ−1
0
E(ṁθ0 ṁTθ0 )Vθ−1
0
) as n → ∞.
If the Fisher information matrix Iθ0 is non-singular and θ̂n is consistent, then
n
√ 1 X
n(θ̂n − θ0 ) = Iθ−1 √ l̇θ0 (Xi ) + oP (1)
0 n
i=1
√
In particular, n(θ̂n − θ0 ) =⇒ N (0, Iθ−1
0
) as n → ∞, where
2
∂ log pθ (X1 )
I(θ0 ) = − E .
∂θi ∂θj 1≤i,j≤k
71
LEMMA 9.1 For each θ in an open subset of Euclidean space let mθ (x), as a function
of x, be measurable for each θ; and as a function of θ, is differentiable for almost every x
(w.r.t. P ) with derivative ṁθ0 (x) and such that for every θ1 and θ2 in a neighborhood of θ0
and a measurable function ṁ such that pṁ2 < ∞,
Then (9.2) holds for every random sequence h̃n that is bounded in probability.
Proof. Because h̃n is bounded in probability, to show (9.2), it is enough to show, w.l.o.g.
P
sup |Gn (rn (mθ0 +θ/rn − mθ0 ) − θ̃ T ṁθ0 )| → 0
|θ|≤1
as n → ∞. Define
Then
for any θ1 and θ2 in the unit ball in Rd by the Lipschitz condition. Further, set
THEOREM 36 (Rate of Convergence) Assume that for fixed constants C and α > β, for
every n and every sufficiently small δ > 0,
72
If the sequence θ̂n satisfies Pn mθ̂n ≥ Pn mθ0 − OP (nα/(2β−2α) ) and converges in outer prob-
ability to θ0 , then n1/(2α−2β) d(θ̂n , θ0 ) = OP∗ (1).
The middle term on right goes to zero; the last term can be arbitrarily small by choosing
large K. We only need to show the sum is arbitrarily small when M is large enough as
n → ∞. By the given condition
2j−1 α
sup P (mθ − mθ0 ) ≤ −C .
θ∈Sj,n rnα
For M such that (1/2)C2(M −1)α ≥ K, by the fact that sups (fs + gs ) ≤ sups fs + sups gs ,
! !
∗ K ∗ √ 2(j−1)α
P sup (Pn mθ − Pn mθ0 ) ≥ − α ≤ P sup Gn (mθ − mθ0 ) ≥ C n .
θ∈Sj,n rn θ∈Si,n 2rnα
by Markov’s inequality and the definition of rn . The right hand side goes to zero as M → ∞.
COROLLARY 9.1 For each θ in an open subset of Euclidean space, mθ (x) is a measurable
function such that, there exists ṁ(x) with P ṁ2 < ∞ satisfying
Furthermore, assume
1
P mθ = P mθ0 + (θ − θ0 )T Vθ0 (θ − θ0 ) + o(kθ − θ0 k2 ) (9.4)
2
P √
with Vθ0 nonsingular. If Pn mθ̂n ≥ Pn mθ0 −OP (n−1 ) and θ̂n → θ0 , then n(θ̂n −θ0 ) = OP (1).
73
Proof. (9.4) implies the first condition of Theorem 36 holds with α = 2 since Vθ0 is
nonsingular. Now we use Corollary 8.2 to the class of functions F = {mθ −mθ0 ; kθ−θ0 k < δ}
to see the second condition is valid with β = 1. This class has envelope function F = δṁ,
while
Z δkṁkP,2 q
∗
E sup |Gn (mθ − mθ0 )| ≤ C · log[ ] (ǫ, F, L2 (P )) dǫ
kθ−θ0 k<δ 0
v
Z δkṁkP,2
u
u d !
≤ tlog K δ dǫ = C1 δ
0 ǫ
1
nPn (mθ0 +h̃n /√n − mθ0 ) = h̃Tn Vθ0 h̃n + h̃Tn Gn ṁθ0 + oP (1).
2
√
By Corollary 9.1, n(θ̂n − θ) is bounded in probability (same as tightness). Take ĥn =
√
n(θ̂n − θ) and h̃n = −Vθ−10
Gn ṁθ0 , we then obtain the Taylor’s expansions of Pn mθ̂n and
Pn mθ0 −V −1 Gn ṁθ as follows
θ0 0
1
nPn (mθ̂n − mθ0 ) = ĥTn Vθ0 ĥn + ĥTn Gn ṁθ0 + oP (1),
2
1
nPn (mθ0 −V −1 Gn ṁθ /√n − mθ0 ) = − Gn ṁTθ0 Vθ−1 Gn ṁθ0 + oP (1),
θ0 0 2 0
where the second one is obtained through a bit algebra. By the definition of θ̂n , the left
hand side of the first equation is greater than that of the second one. So are he right hand
sides. Take the difference and make it to a complete square, we then have
1
(ĥn + Vθ−1 Gn ṁθ0 )T Vθ0 (ĥn + Vθ−1 Gn ṁθ0 ) + oP (1) ≥ 0.
2 0 0
We know Vθ0 is strictly negative-definite, the quadratic form must converge to zero in
probability. This is also the case kĥn + Vθ−1
0
Gn ṁθ0 k, that is, ĥn = −Vθ−1
0
Gn ṁθ0 + oP (1).
74
10 Appendix
Let A be a collection of subsets of Ω and B is generated by A, that is, B = σ(A). Let P be
a probability measure on (Ω, B).
75
Since f is bouded, for any n ≥ 1, there exists hn such that
1
sup |f (x) − hn (x)| < , (10.7)
x∈Rm 2n2
Pkn
where hn is a simple function, i.e., hn (x) = i=1 ci I{x ∈ Bi } for some kn < ∞, constants
ci ’s and Bi ∈ B(Rn ).
Now set X = (X1 , · · · , Xm ) ∈ Rm and µ be the probability measure of X under proba-
Q
bility P. Let A be the set of all finite unions of sets in A1 := { m n
i=1 Ai ∈ B(R ); Ai ∈ B(R)}.
By the construction of B(Rn ), we know that B(Rn ) = σ(A). It is not difficult to verify that
A satisfies the conditions in Lemma 10.1. Thus there exist Ei ∈ A such that
Z ki Y
[ m
p 1
|IBi (x) − IEi (x)| dµ = µ(Bi ∆Ei ) < and Ei = Ai,j,l
(2cn2 )p
j=1 l=1
Pkn
where c = 1 + i=1 |ci | and Ai,j,l ∈ B(R) for all i, j and l. Now, since k · kp is a norm, we
have
kn
X kn
X 1
khn (X) − ci IEi (X)kp ≤ |ci | · kIEi (X) − IBi (X)kp ≤ . (10.8)
2n2
i=1 i=1
Observe that ν(E) := IE (X) is a measure on (Rn , B(Rn )). Note that the intersection of any
Q
finite product sets m l=1 Ai,j,l is still in A1 . By the inclusion-exclusion formula, IEi (X) is a
P n
finite linear combination of IF (X) where F ∈ A1 . Thus, fn (X) := ki=1 ci IEi (X) is of the
form in (iii). Now, by (10.7) and (10.8)
1
kf (X) − fn (X)kp ≤ kf (X) − hn (X)kp + khn (X) − fn (X)kp ≤ .
n2
Thus, (10.6) follows.
E(f1 (ξ1 ) · · · fk (ξk )|F) = E(f1 (ξ1 ) · · · fk (ξk )|G) a.s. (10.9)
Proof. From (10.9), we know that the desired conclusion holds for φ that is a finite linear
combination of f1 · · · fk ’s. By Lemma 10.2, there exist such functions φn ; n ≥ 1, such that
76
E|φn (ξ1 , · · · , ξk ) − φ(ξ1 , · · · , ξk )| → 0 as n → ∞. Therefore,
We know that E(φn (ξ1 , · · · , ξk )|F) = E(φn (ξ1 , · · · , ξk )|G) for all n ≥ 1, the conclusion
follows since the L1 -limit is unique up to a set of measure zero.
Proof. Let p = n − m. By Lemma 10.3, we only need to prove the lemma for φ(x) =
φ1 (x1 ) · · · φ(xp ) with x = (x1 , · · · , xp ) and φi ’s being bounded functions. We do this by
induction.
If n = m+1, by the Markov property, Y := E(φ1 (Xm+1 )|Xm , · · · , X1 ) = E(φ1 (Xm+1 )|Xm ).
Review the property that E(E(Z|F)|G) = E(E(Z|G)|F) = E(Z|F) if F ⊂ G. We know that
where φ̃p (Xn ) = φp (Xn ) · E(φp+1 (Xn+1 )|Xn ), and the last step is by induction. Since
φ1 (Xm+1 ) · · · φp−1 (Xn−1 )φ̃p (Xn ) is equal to E(φ1 (Xm+1 ) · · · φp+1 (Xn+1 )|Xm , · · · , X1 ). The
conclusion follows.
The following property says that a Markov chain has a reverse property.
77
Proof. By lemma 10.3, we assume, without loss of generality, φ(x) is a bounded function.
From the definition of conditional expectation, to prove the proposition, it suffices to show
that
E{φ(X1 , · · · , Xm )g(Xm+1 , · · · , Xn )}
= E{E(φ(X1 , · · · , Xm )|Xm+1 ) · g(Xm+1 , · · · , Xn )} (10.10)
for any bounded, measurable function g(x), x ∈ Rn . By Lemma 10.3 again, we only need
to prove (10.10) for g(x) = g1 (x1 ) · · · gp (xp ) where p = n − m. Now
where Z = E(φ(X1 , · · · , Xm )g1 (Xm+1 )|Xm+1 ). Use the fact E(E(ξ|F) · η) = E(ξ · E(η|F))
for any ξ, η and σ-algebra F to obtain that
The following proposition says that a Markov chain has the property: given the present,
the past and future are independent.
78
by Proposition 10.1 and that φ is σ(X1 , · · · , Xl−1 ) measurable. The last term is also equal
to E(ψ|Xl−1 ) · E{φ|Xk+1 , Xl−1 }. The conclusion follows since E(ψ|Xl−1 ) = E(ψ|Xk , Xl−1 ).
79