Professional Documents
Culture Documents
Notes On Probability
Notes On Probability
Richard F. Bass
Revised 2001.
1. Basic notions.
A probability or probability measure is a measure whose total mass is one. Because the origins of
probability are in statistics rather than analysis, some of the terminology is different. For example, instead
of denoting a measure space by (X, A, µ), probabilists use (Ω, F, P). So here Ω is a set, F is called a σ-field
(which is the same thing as a σ-algebra), and P is a measure with P(Ω) = 1. Elements of F are called events.
Elements of Ω are denoted ω.
Instead of saying a property occurs almost everywhere, we talk about properties occurring almost
surely, written a.s.. Real-valued measurable functions from Ω to R are called random variables and are
usually denoted by X or Y or other capital letters. We often abbreviate ”random variable” by r.v.
We let Ac = (ω ∈ Ω : ω ∈ / A) (called the complement of A) and B − A = B ∩ Ac .
Integration (in the sense of Lebesgue) is called expectation or expected value, and we write E X for
R R
XdP. The notation E [X; A] is often used for A XdP.
The random variable 1A is the function that is one if ω ∈ A and zero otherwise. It is called the
indicator of A (the name characteristic function in probability refers to the Fourier transform). Events such
as (ω : X(ω) > a) are almost always abbreviated by (X > a).
Given a random variable X, we can define a probability on R by
Proof. We prove the first part of (b) and leave the others to the reader. If xn ↓ x, then (X ≤ xn ) ↓ (X ≤ x),
and so P(X≤ xn ) ↓ P(X ≤ x) since P is a measure.
Note that if xn ↑ x, then (X ≤ xn ) ↑ (X < x), and so FX (xn ) ↑ P(X < x).
Any function F : R → [0, 1] satisfying (a)-(c) of Proposition 1.1 is called a distribution function,
whether or not it comes from a random variable.
1
Proposition 1.2. Suppose F is a distribution function. There exists a random variable X such that
F = FX .
Proof. Let Ω = [0, 1], F the Borel σ-field, and P Lebesgue measure. Define X(ω) = sup{x : F (x) < ω}. It
is routine to check that FX = F .
In the above proof, essentially X = F −1 . However F may have jumps or be constant over some
intervals, so some care is needed in defining X.
Certain distributions or laws are very common. We list some of them.
(i) N (µ, σ 2 ). We shall see later that a standard normal has mean zero and variance one. If Z is a
standard normal, then a N (µ, σ 2 ) random variable has the same distribution as µ + σZ. It is an
exercise in calculus to check that such a random variable has density
1 2 2
√ e−(x−µ) /2σ . (1.3)
2πσ
Proof. If g is the indicator of an event A, this is just the definition of PX . By linearity, the result holds for
simple functions. By the monotone convergence theorem, the result holds for nonnegative functions, and by
linearity again, it holds for bounded g.
2
If FX has a density f , then PX (dx) = f (x) dx. So, for example, E X = xf (x) dx and E X 2 =
R
x2 f (x) dx. (We need E |X| finite to justify this if X is not necessarily nonnegative.)
R
We define the mean of a random variable to be its expectation, and the variance of a random variable
is defined by
Var X = E (X − E X)2 .
For example, it is routine to see that the mean of a standard normal is zero and its variance is one.
Note
Var X = E (X 2 − 2XE X + (E X)2 ) = E X 2 − (E X)2 .
The proof will show that this equality is also valid if we replace P(X > λ) by P(X ≥ λ).
EX
P(X ≥ a) ≤ .
a
Proof. We write
h i hX i
P(X ≥ a) = E 1[a,∞) (X) ≤ E 1[a,∞) (X) ≤ E X/a,
a
since X/a is bigger than 1 when X ∈ [a, ∞).
This special case of Chebyshev’s inequality is sometimes itself referred to as Chebyshev’s inequality, while
Proposition 1.5 is sometimes called the Markov inequality.
The second inequality we need is Jensen’s inequality, not to be confused with the Jensen’s formula of
complex analysis.
3
Proposition 1.6. Suppose g is convex and and X and g(X) are both integrable. Then
g(E X) ≤ E g(X).
Proof. One property of convex functions is that they lie above their tangent lines, and more generally their
support lines. So if x0 ∈ R, we have
g(x) ≥ g(x0 ) + c(x − x0 )
(An i.o.) = ∩∞ ∞
n=1 ∪i=n Ai .
Proof. We have
P(An i.o.) = lim P(∪∞
i=n Ai ).
n→∞
However,
∞
X
P(∪∞
i=n Ai ) ≤ P(Ai ),
i=n
2. Independence.
Let us say two events A and B are independent if P(A ∩ B) = P(A)P(B). The events A1 , . . . , An are
independent if
P(Ai1 ∩ Ai2 ∩ · · · ∩ Aij ) = P(Ai1 )P(Ai2 ) · · · P(Aij )
Proof. We write
We say two σ-fields F and G are independent if A and B are independent whenever A ∈ F and
B ∈ G. Two random variables X and Y are independent if the σ-field generated by X and the σ-field
generated by Y are independent. (Recall that the σ-field generated by a random variable X is given by
4
{(X ∈ A) : A a Borel subset of R}.) We define the independence of n σ-fields or n random variables in the
obvious way.
Proposition 2.1 tells us that A and B are independent if the random variables 1A and 1B are inde-
pendent, so the definitions above are consistent.
If f and g are Borel functions and X and Y are independent, then f (X) and g(Y ) are independent.
This follows because the σ-field generated by f (X) is a sub-σ-field of the one generated by X, and similarly
for g(Y ).
Let FX,Y (x, y) = P(X ≤ x, Y ≤ y) denote the joint distribution function of X and Y . (The comma
inside the set means ”and.”)
Proposition 2.2. FX,Y (x, y) = FX (x)FY (y) if and only if X and Y are independent.
Proof. If X and Y are independent, the 1(−∞,x] (X) and 1(−∞,y] (Y ) are independent by the above comments.
Using the above comments and the definition of independence, this shows FX,Y (x, y) = FX (x)FY (y).
Conversely, if the inequality holds, fix y and let My denote the collection of sets A for which P(X ∈
A, Y ≤ y) = P(X ∈ A)P(Y ≤ y). My contains all sets of the form (−∞, x]. It follows by linearity that My
contains all sets of the form (x, z], and then by linearity again, by all sets that are the finite union of such
half-open, half-closed intervals. Note that the collection of finite unions of such intervals, A, is an algebra
generating the Borel σ-field. It is clear that My is a monotone class, so by the monotone class lemma, My
contains the Borel σ-field.
For a fixed set A, let MA denote the collection of sets B for which P(X ∈ A, Y ∈ B) = P(X ∈
A)P(Y ∈ B). Again, MA is a monotone class and by the preceding paragraph contains the σ-field generated
by the collection of finite unions of intervals of the form (x, z], hence contains the Borel sets. Therefore X
and Y are independent.
Proposition 2.3. If X, Y , and XY are integrable and X and Y are independent, then E XY = E XE Y .
Proof. Consider the random variables in σ(X) (the σ-field generated by X) and σ(Y ) for which the
multiplication theorem is true. It holds for indicators by the definition of X and Y being independent.
It holds for simple random variables, that is, linear combinations of indicators, by linearity of both sides.
It holds for nonnegative random variables by monotone convergence. And it holds for integrable random
variables by linearity again.
Let us give an example of independent random variables. Let Ω = Ω1 × Ω2 and let P = P1 × P2 , where
(Ωi , Fi , Pi ) are probability spaces for i = 1, 2. We use the product σ-field. Then it is clear that F1 and F2
are independent by the definition of P. If X1 is a random variable such that X1 (ω1 , ω2 ) depends only on ω1
and X2 depends only on ω2 , then X1 and X2 are independent.
This example can be extended to n independent random variables, and in fact, if one has independent
random variables, one can always view them as coming from a product space. We will not use this fact.
Later on, we will talk about countable sequences of independent r.v.s and the reader may wonder whether
such things can exist. That it can is a consequence of the Kolmogorov extension theorem; see PTA, for
example.
If X1 , . . . , Xn are independent, then so are X1 − E X1 , . . . , Xn − E Xn . Assuming everything is
integrable,
5
using the multiplication theorem to show that the expectations of the cross product terms are zero. We have
thus shown
Var (X1 + · · · + Xn ) = Var X1 + · · · + Var Xn . (2.1)
We finish up this section by proving the second half of the Borel-Cantelli lemma.
P
Proposition 2.4. Suppose An is a sequence of independent events. If n P(An ) = ∞, then P(An i.o.) = 1.
Note that here the An are independent, while in the first half of the Borel-Cantelli lemma no such
assumption was necessary.
Proof. Note
N
Y N
Y
P(∪N N c
i=n Ai ) = 1 − P(∩i=n Ai ) = 1 − P(Aci ) = 1 − (1 − P(Ai )).
i=n i=n
By the mean value theorem, 1 − x ≤ e−x , so we have that the right hand side is greater than or equal to
PN
1 − exp(− i=n P(Ai )). As N → ∞, this tends to 1, so P(∪∞i=n Ai ) = 1. This holds for all n, which proves
the result.
3. Convergence.
In this section we consider three ways a sequence of r.v.s Xn can converge.
We say Xn converges to X almost surely if (Xn 6→ X) has probability zero. Xn converges to X in
probability if for each ε, P(|Xn − X| > ε) → 0 as n → ∞. Xn converges to X in Lp if E |Xn − X|p → 0 as
n → ∞.
The following proposition shows some relationships among the types of convergence.
Proposition 3.1. (a) If Xn → X a.s., then Xn → X in probability.
(b) If Xn → X in Lp , then Xn → X in probability.
(c) If Xn → X in probability, there exists a subsequence nj such that Xnj converges to X almost surely.
Proof. To prove (a), note Xn − X tends to 0 almost surely, so 1(−ε,ε)c (Xn − X) also converges to 0 almost
surely. Now apply the dominated convergence theorem.
(b) comes from Chebyshev’s inequality:
as n → ∞.
To prove (c), choose nj larger than nj−1 such that P(|Xn − X| > 2−j ) < 2−j whenever n ≥ nj .
So if we let Ai = (|Xnj − X| > 2−i for some j ≥ i), then P(Ai ) ≤ 2−i+1 . By the Borel-Cantelli lemma
P(Ai i.o.) = 0. This implies Xnj → X on the complement of (Ai i.o.).
Let us give some examples to show there need not be any other implications among the three types
of convergence.
Let Ω = [0, 1], F the Borel σ-field, and P Lebesgue measure. Let Xn = en 1(0,1/n) . Then clearly Xn
converges to 0 almost surely and in probability, but E Xnp = enp /n → ∞ for any p.
Let Ω be the unit circle, and let P be Lebesgue measure on the circle normalized to have total mass
Pn
1. Let tn = i=1 i−1 , and let An = {θ : tn−1 ≤ θ < tn }. Let Xn = 1An . Any point on the unit circle will
be in infinitely many An , so Xn does not converge almost surely to 0. But P(An ) = 1/2πn → 0, so Xn → 0
in probability and in Lp .
6
4. Weak law of large numbers.
Suppose Xn is a sequence of independent random variables. Suppose also that they all have the same
distribution, that is, FXn = FX1 for all n. This situation comes up so often it has a name, independent,
identically distributed, which is abbreviated i.i.d.
Pn
Define Sn = i=1 Xi . Sn is called a partial sum process. Sn /n is the average value of the first n of
the Xi ’s.
Theorem 4.1. (Weak law of large numbers) Suppose the Xi are i.i.d. and E X12 < ∞. Then Sn /n → E X1
in probability.
Proof. Since the Xi are i.i.d., they all have the same expectation, and so E Sn = nE X1 . Hence E (Sn /n −
E X1 )2 is the variance of Sn /n. If ε > 0, by Chebyshev’s inequality,
Pn
Var (Sn /n) i=1 Var Xi nVar X1
P(|Sn /n − E X1 | > ε) ≤ = = . (4.1)
ε2 n2 ε 2 n2 ε 2
Since E X12 < ∞, then Var X1 < ∞, and the result follows by letting n → ∞.
A nice application of the weak law of large numbers is a proof of the Weierstrass approximation
theorem.
Theorem 4.2. Suppose f is a continuous function on [0, 1] and ε > 0. There exists a polynomial P such
that supx∈[0,1] |f (x) − P (x)| < ε.
Proof. Let
n
X n
P (x) = f (k/n) xk (1 − x)n−k .
k
k=0
Clearly P is a polynomial. Since f is continuous, there exists M such that |f (x)| ≤ M for all x and there
exists δ such that |f (x) − f (y)| < ε/2 whenever |x − y| < δ.
Let Xi be i.i.d. Bernoulli r.v.s with parameter x. Then Sn , the partial sum, is a binomial, and hence
P (x) = E f (Sn /n). The mean of Sn /n is x. We have
7
Pn
Proposition 5.1. Suppose Xi is an i.i.d. sequence with E Xi4 < ∞ and let Sn = i=1 Xi . Then Sn /n →
E X1 a.s.
E (Sn /n)4 E S4
P(|Sn /n| > ε) ≤ 4
= 4 n4 .
ε n ε
If we expand Sn4 , we will have terms involving Xi4 , terms involving Xi2 Xj2 , terms involving Xi3 Xj , terms
involving Xi2 Xj Xk , and terms involving Xi Xj Xk X` , with i, j, k, ` all being different. By the multiplication
theorem and the fact that the Xi have mean 0, the expectations of all the terms will be 0 except for those
of the first two types. So
n
X X
E Sn4 = E Xi4 + E Xi2 E Xj2 .
i=1 i6=j
By the finiteness assumption, the first term on the right is bounded by c1 n. By Cauchy-Schwarz, E Xi2 ≤
(E Xi4 )1/2 < ∞, and there are at most n2 terms in the second term on the right, so this second term is
bounded by c2 n2 . Substituting, we have
Consequently P(|Sn /n| > ε i.o.) = 0 by Borel-Cantelli. Since ε is arbitrary, this implies Sn /n → 0 a.s.
Before we can prove the SLLN assuming only the finiteness of first moments, we need some prelimi-
naries.
P
Proposition 5.2. If Y ≥ 0, then E Y < ∞ if and only if n P(Y > n) < ∞.
R∞
Proof. By Proposition 1.4, E Y = 0 P(Y > x)dx. P(Y > x) is nonincreasing in x, so the integral is
P∞ P∞
bounded above by n=0 P(Y > n) and bounded below by n=1 P(Y > n).
If Xi is a sequence of r.v.s, the tail σ-field is defined by ∩∞ n=1 σ(Xn , Xn+1 , . . .). An example of an
event in the tail σ-field is (lim supn→∞ Xn > a). Another example is (lim supn→∞ Sn /n > a). The reason
for this is that if k < n is fixed, Pn
Sn Sk Xi
= + i=k+1 .
n n n
Pn
The first term on the right tends to 0 as n → ∞. So lim sup Sn /n = lim sup( i=k+1 Xi )/n, which is in
σ(Xk+1 , Xk+2 , . . .). This holds for each k. The set (lim sup Sn > a) is easily seen not to be in the tail σ-field.
Theorem 5.3. (Kolmogorov 0-1 law) If the Xi are independent, then the events in the tail σ-field have
probability 0 or 1.
This implies that in the case of i.i.d. random variables, if Sn /n has a limit with positive probability,
then it has a limit with probability one, and the limit must be a constant.
Proof. Let M be the collection of sets in σ(Xn+1 , . . .) that is independent of every set in σ(X1 , . . . , Xn ).
M is easily seen to be a monotone class and it contains σ(Xn+1 , . . . , XN ) for every N > n. Therefore M
must be equal to σ(Xn+1 , . . .).
If A is in the tail σ-field, then for each n, A is independent of σ(X1 , . . . , Xn ). The class MA of sets
independent of A is a monotone class, hence is a σ-field containing σ(X1 , . . . , Xn ) for each n. Therefore MA
contains σ(X1 , . . .).
8
We thus have that the event A is independent of itself, or
The next proposition shows that in considering a law of large numbers we can consider truncated
random variables.
Proposition 5.4. Suppose Xi is an i.i.d. sequence of r.v.s with E |X1 | < ∞. Let Xn0 = Xn 1(|Xn |≤n) . Then
(a) Xn converges almost surely if and only if Xn0 does;
Pn
(b) If Sn0 = i=1 Xi0 , then Sn /n converges a.s. if and only if Sn0 /n does.
Proof. Let An = (Xn 6= Xn0 ) = (|Xn | > n). Then P(An ) = P(|Xn | > n) = P(|X1 | > n). Since E |X1 | < ∞,
P
then by Proposition 5.2 we have P(An ) < ∞. So by the Borel-Cantelli lemma, P(An i.o.) = 0. Thus for
almost every ω, Xn = Xn0 for n sufficiently large. This proves (a).
For (b), let k (depending on ω) be the largest integer such that Xk0 (ω) 6= Xk (ω). Then Sn /n − Sn0 /n =
(X1 + · · · + Xk )/n − (X10 + · · · + Xk0 )/n → 0 as n → ∞.
E Sn2
P( max |Si | ≥ λ) ≤ .
1≤i≤n λ2
Proof. Let Ak = (|Sk | ≥ λ, |S1 | < λ, . . . , |Sk−1 | < λ). Note the Ak are disjoint and that Ak ∈ σ(X1 , . . . , Xk ).
Therefore Ak is independent of Sn − Sk . Then
n
X
E Sn2 ≥ E [Sn2 ; Ak ]
k=1
Xn
= E [(Sk2 + 2Sk (Sn − Sk ) + (Sn − Sk )2 ); Ak ]
k=1
Xn n
X
≥ E [Sk2 ; Ak ] + 2 E [Sk (Sn − Sk ); Ak ].
k=1 k=1
Using the independence, E [Sk (Sn − Sk )1Ak ] = E [Sk 1Ak ]E [Sn − Sk ] = 0. Therefore
n
X n
X
E Sn2 ≥ E [Sk2 ; Ak ] ≥ λ2 P(Ak ) = λ2 P( max |Sk | ≥ λ).
1≤k≤n
k=1 k=1
The last result we need for now is a special case of what is known as Kronecker’s lemma.
Pn P∞
Proposition 5.6. Suppose xi are real numbers and sn = i=1 xi . If j=1 (xj /j) converges, then sn /n → 0.
Pn Pn
Proof. Let bn = j=1 (xj /j), b0 = 0, and suppose bn → b. As is well known, this implies ( i=1 bi )/n → b.
We have n(bn − bn−1 ) = xn , so
Pn Pn Pn−1 Pn
sn i=1 (ibi − ibi−1 ) i=1 ibi − i=1 (i + 1)bi bi
= = = bn − i=1 → b − b = 0.
n n n n
9
6. Strong law of large numbers.
This section is devoted to a proof of Kolmogorov’s strong law of large numbers. We showed earlier
that if E Xi2 < ∞, where the Xi are i.i.d., then the weak law of large numbers (WLLN) holds: Sn /n converges
to E X1 in probability. The WLLN can be improved greatly; it is enough that xP(|X1 | > x) → 0 as x → ∞.
Here we show the strong law (SLLN): if one has a finite first moment, then there is almost sure convergence.
First we need a lemma.
Pn
Lemma 6.1. Suppose Vi is a sequence of independent r.v.s, each with mean 0. Let Wn = i=1 Vi . If
P∞
i=1 Var Vi < ∞, then Wn converges almost surely.
P∞
Proof. Choose nj > nj−1 such that i=nj Var Vi < 2−3j . If n > nj , then applying Kolmogorov’s inequality
shows that
P( max |Wi − Wnj | > 2−j ) ≤ 2−3j /2−2j = 2−j .
nj ≤i≤n
Theorem 6.2. (SLLN) Let Xi be a sequence of i.i.d. random variables. Then Sn /n converges almost surely
if and only if E |X1 | < ∞.
Proof. Let us first suppose Sn /n converges a.s. and show E |X1 | < ∞. If Sn (ω)/n → a, then
Sn−1 Sn−1 n − 1
= → a.
n n−1 n
So
Xn Sn Sn−1
= − → a − a = 0.
n n n
P
Hence Xn /n → 0, a.s. Thus P(|Xn | > n i.o.) = 0. By the second part of Borel-Cantelli, P(|Xn | > n) < ∞.
P∞
Since the Xi are i.i.d., this means n=1 P(|X1 | > n) < ∞, and by Proposition 4.1, E|X1 | < ∞.
Now suppose E |X1 | < ∞. By looking at Xi − E Xi , we may suppose without loss of generality that
Pn
E Xi = 0. We truncate, and let Yi = Xi 1(|Xi |≤i) . It suffices to show i=1 Yi /n → 0 a.s., by Proposition 5.4.
Next we estimate. We have
10
The convergence follows by the dominated convergence theorem, since the integrands are bounded by |X1 |.
To estimate the second moment of the Yi , we write
Z ∞
E Yi2 = 2yP(|Yi | ≥ y) dy
0
Z i
= 2yP(|Yi | ≥ y) dy
0
Z i
≤ 2yP(|X1 | ≥ y) dy,
0
and so
∞ ∞
1 i
X X Z
E (Yi2 /i2 ) ≤ 2yP(|X1 | ≥ y) dy
i=1 i=1
i2 0
∞
1 ∞
X Z
=2 1(y≤i) yP(|X1 | ≥ y) dy
i=1
i2 0
Z ∞X ∞
1
=2 1
2 (y≤i)
yP(|X1 | ≥ y) dy
0 i=1
i
Z ∞
1
≤4 yP(|X1 | ≥ y) dy
0 y
Z ∞
=4 P(|X1 | ≥ y) dy = 4E |X1 | < ∞.
0
∞
X
Var (Ui /i) < ∞.
i=1
Pn Pn
By Lemma 6.1 (with Vi = Ui /i), i=1 (Ui /i) converges almost surely. By Kronecker’s lemma, ( i=1 Ui )/n
Pn Pn
converges almost surely. Finally, since E Yi → 0, then i=1 E Yi /n → 0, hence i=1 Yi /n → 0.
7. Uniform integrability. Before proceeding to some extensions of the SLLN, we discuss uniform inte-
grability. A sequence of r.v.s is uniformly integrable if
Z
sup |Xi | dP → 0
i (|Xi |>M )
as M → ∞.
Proposition 7.1. Suppose there exists ϕ : [0, ∞) → [0, ∞) such that ϕ is nondecreasing, ϕ(x)/x → ∞ as
x → ∞, and supi E ϕ(|Xi |) < ∞. Then the Xi are uniformly integrable.
|Xi |
Z Z Z
|Xi | = ϕ(|Xi |)1(|Xi |>M ) ≤ ε ϕ(|Xi |) ≤ ε sup E ϕ(|Xi |).
(|Xi |>M ) ϕ(|Xi |) i
11
Proposition 7.2. If Xn and Yn are two uniformly integrable sequences, then Xn + Yn is also a uniformly
integrable sequence.
Proof. Since there exists M0 such that supn E [|Xn |; |Xn | > M0 ] < 1 and supn E [|Yn |; |Yn | > M0 ] < 1,
then supn E |Xn | ≤ M0 + 1, and similarly for the Yn . Let ε > 0 and choose M1 > 4(M0 + 1)/ε such that
supn E [|Xn |; |Xn | > M1 ] < ε/4 and supn E [|Yn |; |Yn | > M1 ] < ε/4. Let M2 = 4M12 .
Note P(|Xn | + |Yn | > M2 ) ≤ (E |Xn | + E |Yn |)/M2 ≤ ε/(4M1 ) by Chebyshev’s inequality. Then
E [|Xn + Yn |; |Xn + Yn | > M2 ] ≤ E [|Xn |; |Xn | > M1 ]
+ E [|Xn |; |Xn | ≤ M1 , |Xn + Yn | > M2 ]
+ E [|Yn |; |Yn | > M1 ]
+ E [|Yn |; |Yn | ≤ M1 , |Xn + Yn | > M2 ].
The first and third terms on the right are each less than ε/4 by our choice of M1 . The second and fourth
terms are each less than M1 P(|Xn + Yn | > M2 ) ≤ ε/2.
Proof. By the above proposition, Xn − X is uniformly integrable and tends to 0 a.s., so without loss of
generality, we may assume X = 0. Let ε > 0 and choose M such that supn E [|Xn |; |Xn | > M ] < ε. Then
Proof. Without loss of generality we may assume E X1 = 0. By the SLLN, Sn /n → 0 a.s. So we need to
show that the sequence Sn /n is uniformly integrable.
Pick M1 such that E [|X1 |; |X1 | > M1 ] < ε/2. Pick M2 = M1 E |X1 |/ε. So
We now consider the “three series criterion.” We prove the “if” portion here and defer the “only if”
to Section 20.
12
Theorem 8.2. Let Xi be a sequence of independent random variables., A > 0, and Yi = Xi 1(|Xi |≤A) . Then
P P P
X converges if and only if all of the following three series converge: (a) P(|Xn | > A); (b) E Yi ; (c)
P i
Var Yi .
P
Proof of “if” part. Since (c) holds, then (Yi − E Yi ) converges by Lemma 6.1. Since (b) holds, taking the
P P P
difference shows Yi converges. Since (a) holds, P(Xi 6= Yi ) = P(|Xi | > A) < ∞, so by Borel-Cantelli,
P
P(Xi 6= Yi i.o.) = 0. It follows that Xi converges.
9. Conditional expectation.
If F ⊆ G are two σ-fields and X is an integrable G measurable random variable, the conditional
expectation of X given F, written E [X | F] and read as “the expectation (or expected value) of X given
F,” is any F measurable random variable Y such that E [Y ; A] = E [X; A] for every A ∈ F. The conditional
probability of A ∈ G given F is defined by P(A | F) = E [1A | F].
If Y1 , Y2 are two F measurable random variables with E [Y1 ; A] = E [Y2 ; A] for all A ∈ F, then Y1 = Y2 ,
a.s., or conditional expectation is unique up to a.s. equivalence.
In the case X is already F measurable, E [X | F] = X. If X is independent of F, E [X | F] = E X.
Both of these facts follow immediately from the definition. For another example, which ties this definition
with the one used in elementary probability courses, if {Ai } is a finite collection of disjoint sets whose union
is Ω, P(Ai ) > 0 for all i, and F is the σ-field generated by the Ai s, then
X P(A ∩ Ai )
P(A | F) = 1A i .
i
P(Ai )
This follows since the right-hand side is F measurable and its expectation over any set Ai is P(A ∩ Ai ).
As an example, suppose we toss a fair coin independently 5 times and let Xi be 1 or 0 depending
whether the ith toss was a heads or tails. Let A be the event that there were 5 heads and let Fi =
σ(X1 , . . . , Xi ). Then P(A) = 1/32 while P(A | F1 ) is equal to 1/16 on the event (X1 = 1) and 0 on the event
(X1 = 0). P(A | F2 ) is equal to 1/8 on the event (X1 = 1, X2 = 1) and 0 otherwise.
We have
E [E [X | F]] = E X (9.1)
It is easy to check that limit theorems such as monotone convergence and dominated convergence
have conditional expectation versions, as do inequalities like Jensen’s and Chebyshev’s inequalities. Thus,
for example, we have the following.
Proposition 9.2. (Jensen’s inequality for conditional expectations) If g is convex and X and g(X) are
integrable,
E [g(X) | F] ≥ g(E [X | F]), a.s.
13
Proposition 9.3. If X and XY are integrable and Y is measurable with respect to F, then
E [XY | F] = Y E [X | F]. (9.2)
Proof. The right equality holds because E [X | E] is E measurable, hence F measurable. To show the left
equality, let A ∈ E. Then since A is also in F,
E E E [X | F] | E ; A = E E [X | F]; A = E [X; A] = E [E X | E]; A .
Since both sides are E measurable, the equality follows.
Proof. Using linearity, we need only consider X ≥ 0. Define a measure Q on F by Q(A) = E [X; A] for
A ∈ F. This is trivially absolutely continuous with respect to P|F , the restriction of P to F. Let E [X | F]
be the Radon-Nikodym derivative of Q with respect to P|F . The Radon-Nikodym derivative is F measurable
by construction and so provides the desired random variable.
When F = σ(Y ), one usually writes E [X | Y ] for E [X | F]. Notation that is commonly used (however,
we will use it only very occasionally and only for heuristic purposes) is E [X | Y = y]. The definition is as
follows. If A ∈ σ(Y ), then A = (Y ∈ B) for some Borel set B by the definition of σ(Y ), or 1A = 1B (Y ). By
linearity and taking limits, if Z is σ(Y ) measurable, Z = f (Y ) for some Borel measurable function f . Set
Z = E [X | Y ] and choose f Borel measurable so that Z = f (Y ). Then E [X | Y = y] is defined to be f (y).
If X ∈ L2 and M = {Y ∈ L2 : Y is F-measurable}, one can show that E [X | F] is equal to the
projection of X onto the subspace M. We will not use this in these notes.
14
Proposition 10.1.
(a) Fixed times n are stopping times.
(b) If N1 and N2 are stopping times, then so are N1 ∧ N2 and N1 ∨ N2 .
(c) If Nn is a nondecreasing sequence of stopping times, then so is N = supn Nn .
(d) If Nn is a nonincreasing sequence of stopping times, then so is N = inf n Nn .
(e) If N is a stopping time, then so is N + n.
We define FN = {A : A ∩ (N ≤ n) ∈ Fn for all n}.
11. Martingales.
In this section we consider martingales. Let Fn be an increasing sequence of σ-fields. A sequence of
random variables Mn is adapted to Fn if for each n, Mn is Fn measurable.
Mn is a martingale if Mn is adapted to Fn , Mn is integrable for all n, and
If we have E [Mn | Fn−1 ] ≥ Mn−1 a.s. for every n, then Mn is a submartingale. If we have E [Mn |
Fn−1 ] ≤ Mn−1 , we have a supermartingale. Submartingales have a tendency to increase.
Let us take a moment to look at some examples. If Xi is a sequence of mean zero i.i.d. random
variables and Sn is the partial sum process, then Mn = Sn is a martingale, since E [Mn | Fn−1 ] = Mn−1 +
E [Mn − Mn−1 | Fn−1 ] = Mn−1 + E [Mn − Mn−1 ] = Mn−1 , using independence. If the Xi ’s have variance
one and Mn = Sn2 − n, then
15
Note (N = k) is Fj measurable if j ≥ k, so
E [Mk ; N = k] = E [Mk+1 ; N = k]
= E [Mk+2 ; N = k] = . . . = E [MK ; N = k].
Hence
K
X
E MN = E [MK ; N = k] = E MK = E M0 .
k=0
The assumption that N be bounded cannot be entirely dispensed with. For example, let Mn be the
partial sums of a sequence of i.i.d. random variable that take the values ±1, each with probability 12 . If
N = min{i : Mi = 1}, we will see later on that N < ∞ a.s., but E MN = 1 6= 0 = E M0 .
The same proof as that in Theorem 12.1 gives the following corollary.
Proposition 12.4. If N1 ≤ N2 are stopping times bounded by K and M is a martingale, then E [MN2 |
FN1 ] = MN1 , a.s.
Proof. Suppose A ∈ FN1 . We need to show E [MN1 ; A] = E [MN2 ; A]. Define a new stopping time N3 by
N1 (ω) if ω ∈ A
N3 (ω) =
N2 (ω) if ω ∈
/ A.
Proof. Let ak = E [Xk | Fk−1 ] − Xk−1 for k = 1, 2, . . . Since Xk is a submartingale, then each ak ≥ 0. Then
Pk
let Ak = i=1 ai . The fact that the Ak are increasing and measurable with respect to Fk−1 is clear. Set
Mk = Xk − Ak . Then
E [Mk+1 − Mk | Fk ] = E [Xk+1 − Xk | Fk ] − ak+1 = 0,
or Mk is a martingale.
16
Corollary 12.6. Suppose Xk is a submartingale, and N1 ≤ N2 are bounded stopping times. Then
Proof. Set Mn+1 = Mn . Let N = min{j : |Mj | ≥ a} ∧ (n + 1). Since | · | is convex, |Mn | is a submartingale.
If A = (Mn∗ ≥ a), then A ∈ FN and by Corollary 12.3
Pn
Proof. Note Mn∗ ≤ i=1 |Mn |, hence Mn∗ ∈ Lp . We write
Z ∞ Z ∞
E (Mn∗ )p = pap−1 P(Mn∗ > a) da ≤ pap−1 E [|Mn |1(Mn∗ ≥a) /a] da
0 0
∗
Z Mn
p
=E pap−2 |Mn | da = E [(Mn∗ )p−1 |Mn |]
0 p−1
p
≤ (E (Mn∗ )p )(p−1)/p (E |Mn |p )1/p .
p−1
The last inequality follows by Hölder’s inequality. Now divide both sides by the quantity (E (Mn∗ )p )(p−1)/p .
and
Si+1 = min{k > Ti : Xk ≤ a}, Ti+1 = min{k > Si+1 : Xk ≥ b}.
The number of upcrossings Un before time n is Un = max{j : Tj ≤ n}.
17
Theorem 14.1. (Upcrossing lemma) If Xk is a submartingale,
Proof. The number of upcrossings of [a, b] by Xk is the same as the number of upcrossings of [0, b − a]
by Yk = (Xk − a)+ . Moreover Yk is still a submartingale. If we obtain the inequality for the the number of
upcrossings of the interval [0, b − a] by the process Yk , we will have the desired inequality for upcrossings of
X.
So we may assume a = 0. Fix n and define Yn+1 = Yn . This will still be a submartingale. Define the
Si , Ti as above, and let Si0 = Si ∧ (n + 1), Ti0 = Ti ∧ (n + 1). Since Ti+1 > Si+1 > Ti , then Tn+1
0
= n + 1.
We write
n+1
X n+1
X
E Yn+1 = E YS1 +0 E [YTi − YSi ] +
0 0 E [YSi+1
0 − YTi0 ].
i=0 i=0
All the summands in the third term on the right are nonnegative since Yk is a submartingale. For the jth
upcrossing, YTj0 − YSj0 ≥ b − a, while YTj0 − YSj0 is always greater than or equal to 0. So
∞
X
(YTi0 − YSi0 ) ≥ (b − a)Un .
i=0
So
Theorem 14.2. If Xn is a submartingale such that supn E Xn+ < ∞, then Xn converges a.s. as n → ∞.
So U (a, b) < ∞, a.s. Taking the union over all pairs of rationals a, b, we see that a.s. the sequence Xn (ω)
cannot have lim sup Xn > lim inf Xn . Therefore Xn converges a.s., although we still have to rule out the
possibility of the limit being infinite. Since Xn is a submartingale, E Xn ≥ E X0 , and thus
By Fatou’s lemma, E limn |Xn | ≤ supn E |Xn | < ∞, or Xn converges a.s. to a finite limit.
18
If Xn is a martingale bounded above, by considering −Xn , we may assume Xn is bounded below.
Looking at Xn + M for fixed M will not affect the convergence, so we may assume Xn is bounded below by
0. Now apply the first assertion of the corollary.
Proposition 14.4. If Xn is a martingale with supn E |Xn |p < ∞ for some p > 1, then the convergence is in
Lp as well as a.s. This is also true when Xn is a submartingale. If Xn is a uniformly integrable martingale,
then the convergence is in L1 . If Xn → X∞ in L1 , then Xn = E [X∞ | Fn ].
Xn is a uniformly integrable martingale if the collection of random variables Xn is uniformly inte-
grable.
Proof. The Lp convergence assertion follows by using Doob’s inequality (Theorem 13.2) and dominated
convergence. The L1 convergence assertion follows since a.s. convergence together with uniform integrability
implies L1 convergence. Finally, if j < n, we have Xj = E [Xn | Fj ]. If A ∈ Fj ,
E[lim |Sn /n|] ≤ lim inf E |Sn /n| ≤ lim inf nE |Y1 |/n = E |Y1 | < ∞.
19
Proposition 15.1. Suppose the Yi are i.i.d. with E |Y1 | < ∞, N is a stopping time with E N < ∞, and N
is independent of the Yi . Then E SN = (E N )(EY1 ), where the Sn are the partial sums of the Yi .
Proof. Sn − n(E Y1 ) is a martingale, so E Sn∧N = E (n ∧ N )E Y1 by optional stopping. The right hand side
tends to (E N )(E Y1 ) by monotone convergence. Sn∧N converges almost surely to SN , and we need to show
the expected values converge.
Note
∞
X ∞ n∧k
X X
|Sn∧N | = |Sn∧k |1(N =k) ≤ |Yj |1(N =k)
k=0 k=0 j=0
Xn X∞ Xn ∞
X
= |Yj |1(N =k) = |Yj |1(N ≥j) ≤ |Yj |1(N ≥j) .
j=0 k>j j=0 j=0
∞
X
(E |Yj |)P(N ≥ j) ≤ (E |Y1 |)(1 + E N ) < ∞.
j=0
Proposition 15.2. Suppose the Yi are i.i.d. with P(Y1 = 1) = 1/2, P(Y1 = −1) = 1/2, and Sn the partial
sum process. Suppose a and b are positive integers. Then
b
P(Sn hits −a before b) = .
a+b
We also have
P(SN = −a) + P(SN = b) = 1.
2
Solving these two equations for P(SN = −a) and P(SN = b) yields our first result. Since E N = E SN =
2 2
a P(SN = −a) + b P(SN = b), substituting gives the second result.
Based on this proposition, if we let a → ∞, we see that P(Nb < ∞) = 1 and E Nb = ∞, where
Nb = min{n : Sn = b}.
20
P∞
Proposition 15.3. Suppose An ∈ Fn . Then (An i.o.) and ( n=1 P(An | Fn−1 ) = ∞) differ by a null set.
Pn
Proof. Let Xn = m=1 [1Am − P(Am | Fm−1 )]. Note |Xn − Xn−1 | ≤ 1. Also, it is easy to see that
E [Xn − Xn−1 | Fn−1 ] = 0, so Xn is a martingale.
We claim that for almost every ω either lim Xn exists and is finite, or else lim sup Xn = ∞ and
lim inf Xn = −∞. In fact, if N = min{n : Xn ≤ −k}, then Xn∧N ≥ −k − 1, so Xn∧N converges by the
martingale convergence theorem. Therefore lim Xn exists and is finite on (N = ∞). So if lim Xn does not
exist or is not finite, then N < ∞. This is true for all k, hence lim inf Xn = −∞. A similar argument shows
lim sup Xn = ∞ in this case.
P∞ P
Now if lim Xn exists and is finite, then n=1 1An = ∞ if and only if P(An | Fn−1 ) < ∞. On the
P P
other hand, if the limit does not exist or is not finite, then 1An = ∞ and P(An | Fn−1 ) = ∞.
21
where g is a bounded continuous function. The difference between E (1[−M,M ] g)(X) and E g(X) is bounded
by kgk∞ P(X ∈ / [−M, M ]) ≤ 2εkgk∞ . Similarly, when X is replaced by Xn , the difference is bounded by
kgk∞ P(Xn ∈ / [−M, M ]) → kgk∞ P(X ∈/ [−M, M ]). So for n large, it is less than 3εkgk∞ . Since ε is arbitrary,
E g(Xn ) → E g(X) whenever g is bounded and continuous.
Let us examine the relationship between weak convergence and convergence in probability. The
√
example of Sn / n shows that one can have weak convergence without convergence in probability.
Proposition 16.2. (a) If Xn converges to X in probability, then it converges weakly.
(b) If Xn converges weakly to a constant, it converges in probability.
(c) (Slutsky’s theorem) If Xn converges weakly to X and Yn converges weakly to a constant c, then
Xn + Yn converges weakly to X + c and Xn Yn converges weakly to cX.
Proof. To prove (a), let g be a bounded and continuous function. If nj is any subsequence, then there
exists a further subsequence such that X(njk ) converges almost surely to X. Then by dominated convergence,
E g(X(njk )) → E g(X). That suffices to show E g(Xn ) converges to E g(X).
For (b), if Xn converges weakly to c,
P(Xn − c > ε) = P(Xn > c + ε) = 1 − P(Xn ≤ c + ε) → 1 − P(c ≤ c + ε) = 0.
We use the fact that if Y ≡ c, then c + ε is a point of continuity for FY . A similar equation shows
P(Xn − c ≤ −ε) → 0, so P(|Xn − c| > ε) → 0.
We now prove the first part of (c), leaving the second part for the reader. Let x be a point such that
x − c is a continuity point of FX . Choose ε so that x − c + ε is again a continuity point. Then
P(Xn + Yn ≤ x) ≤ P(Xn + c ≤ x + ε) + P(|Yn − c| > ε) → P(X ≤ x − c + ε).
So lim sup P(Xn + Yn ≤ x) ≤ P(X + c ≤ x + ε). Since ε can be as small as we like and x − c is a continuity
point of FX , then lim sup P(Xn + Yn ≤ x) ≤ P(X + c ≤ x). The lim inf is done similarly.
We say a sequence of distribution functions {Fn } is tight if for each ε > 0 there exists M such that
Fn (M ) ≥ 1 − ε and Fn (−M ) ≤ ε for all n. A sequence of r.v.s is tight if the corresponding distribution
functions are tight; this is equivalent to P(|Xn | ≥ M ) ≤ ε.
Theorem 16.3. (Helly’s theorem) Let Fn be a sequence of distribution functions that is tight. There exists
a subsequence nj and a distribution function F such that Fnj converges weakly to F .
What could happen is that Xn = n, so that FXn → 0; the tightness precludes this.
Proof. Let qk be an enumeration of the rationals. Since Fn (qk ) ∈ [0, 1], any subsequence has a further
subsequence that converges. Use diagonalization so that Fnj (qk ) converges for each qk and call the limit
F (qk ). F is nondecreasing, and define F (x) = inf qk ≥x F (qk ). So F is right continuous and nondecreasing.
If x is a point of continuity of F and ε > 0, then there exist r and s rational such that r < x < s and
F (s) − ε < F (x) < F (r) + ε. Then
Fnj (x) ≥ Fnj (r) → F (r) > F (x) − ε
and
Fnj (x) ≤ Fnj (s) → F (s) < F (x) + ε.
Since ε is arbitrary, Fnj (x) → F (x).
Since the Fn are tight, there exists M such that Fn (−M ) < ε. Then F (−M ) ≤ ε, which implies
limx→−∞ F (x) = 0. Showing limx→∞ F (x) = 1 is similar. Therefore F is in fact a distribution function.
22
Proposition 16.4. Suppose there exists ϕ : [0, ∞) → [0, ∞) that is increasing and ϕ(x) → ∞ as x → ∞.
If c = supn E ϕ(|Xn |) < ∞, then the Xn are tight.
Proof. Let ε > 0. Choose M such that ϕ(x) ≥ c/ε if x > M . Then
ϕ(|Xn |)
Z
ε
P(|Xn | > M ) ≤ 1(|Xn |>M ) dP ≤ E ϕ(|Xn |) ≤ ε.
c/ε c
function. Also, if the law of X has a density, that is, PX (dx) = fX (x) dx, then ϕX (t) = eitx fX (x) dx, so
R
in this case the characteristic function is the same as (one definition of) the Fourier transform of fX .
Proposition 17.1. ϕ(0) = 1, |ϕ(t)| ≤ 1, ϕ(−t) = ϕ(t), and ϕ is uniformly continuous.
Proof. Since |eitx | ≤ 1, everything follows immediately from the definitions except the uniform continuity.
For that we write
|eihX − 1| tends to 0 almost surely as h → 0, so the right hand side tends to 0 by dominated convergence.
Note that the right hand side is independent of t.
Proposition 17.2. ϕaX (t) = ϕX (at) and ϕX+b (t) = eitb ϕX (t),
Proof. The first follows from E eit(aX) = E ei(at)X , and the second is similar.
Proposition 17.3. If X and Y are independent, then ϕX+Y (t) = ϕX (t)ϕY (t).
ϕX1 −X2 (t) = ϕX1 (t)ϕ−X2 (t) = ϕX1 (t)ϕX2 (−t) = ϕX1 (t)ϕX2 (t) = |ϕX1 (t)|2 .
23
(e) Binomial: Write X as the sum of n independent Bernoulli r.v.s Bi . So
n
Y
ϕX (t) = ϕBi (t) = [ϕBi (t)]n = [1 − p(1 − eit )]n .
i=1
(f) Geometric:
∞
X X p
ϕ(t) = p(1 − p)k eitk = p ((1 − p)eit )k = .
1 − (1 − p)eit
k=0
This can be done by completing the square and then doing a contour integration. Alternately, ϕ0 (t) =
√ R∞ 2
(1/ 2π) −∞ ixeitx e−x /2 dx. (do the real and imaginary parts separately, and use the dominated
convergence theorem to justify taking the derivative inside.) Integrating by parts (do the real and
imaginary parts separately), this is equal to −tϕ(t). The only solution to ϕ0 (t) = −tϕ(t) with ϕ(0) = 1
2
is ϕ(t) = e−t /2 .
(j) Normal with mean µ and variance σ 2 : Writing X = σZ + µ, where Z is a standard normal, then
2 2
ϕX (t) = eiµt ϕZ (σt) = eiµt−σ t /2 .
(k) Cauchy: We have
eitx
Z
1
ϕ(t) = dx.
π 1 + x2
This is a standard exercise in contour integration in complex analysis. The answer is e−|t| .
Proof. If A = 0, this is clear. The case A < 0 reduces to the case A > 0 by the fact that sin is an odd
function. By a change of variables y = Ax, we reduce to the case A = 1. Part (a) is a standard result
in contour integration, and part (b) comes from the fact that the integral can be written as an alternating
series.
24
An alternate proof of (a) is the following. e−xy sin x is integrable on {(x, y); 0 < x < a, 0 < y < ∞}.
So Z a Z a Z ∞
sin x
dx = e−xy sin x dy dx
0 x
Z0 ∞ Z0 a
= e−xy sin x dx dy
0 0
Z ∞
h e−xy ia
= (−y sin x − cos x) dy
0 y2 + 1 0
Z ∞h
e−ay o −1 i
= (−y sin a − cos a) − dy
0 y2 + 1 y2 + 1
Z ∞ −ay Z ∞ −ay
π ye e
= − sin a 2
dy − cos a dy.
2 0 y +1 0 y2 + 1
The last two integrals tend to 0 as a → ∞ since the integrand is bounded by (1 + y)e−y if a ≥ 1.
Theorem 18.2. (Inversion formula) Let µ be a probability measure and let ϕ(t) = eitx µ(dx). If a < b,
R
then Z T −ita
1 e − e−itb 1 1
lim ϕ(t) dt = µ(a, b) + µ({a}) + µ({b}).
T →∞ 2π −T it 2 2
The example where µ is point mass at 0, so ϕ(t) = 1, shows that one needs to take a limit, since the
integrand in this case is 2 sin t/t, which is not integrable.
Proof. By Fubini,
T T
e−ita − e−itb e−ita − e−itb itx
Z Z Z
ϕ(t) dt = e µ(dx) dt
−T it −T it
T
e−ita − e−itb itx
Z Z
= e dt µ(dx).
−T it
To justify this, we bound the integrand by the mean value theorem
Expanding e−itb and e−ita using Euler’s formula, and using the fact that cos is an even function and
sin is odd, we are left with
Z hZ T Z T
sin(t(x − a)) sin(t(x − b)) i
2 dt − dt µ(dx).
0 t 0 t
Using Lemma 18.1 and dominated convergence, this tends to
Z
[πsgn (x − a) − πsgn (x − b)]µ(dx).
R
Theorem 18.3. If |ϕ(t)| dt < ∞, then µ has a bounded density f and
Z
1
f (y) = e−ity ϕ(t) dt.
2π
Proof.
1 1
µ(a, b) + µ({a}) + µ({b})
2 2
Z T −ita
1 e − e−itb
= lim ϕ(t) dt
T →∞ 2π −T it
Z ∞ −ita
1 e − e−itb
= ϕ(t) dt
2π −∞ it
b−a
Z
≤ |ϕ(t)| dt.
2π
25
Letting b → a shows that µ has no point masses.
We now write Z −itx
1 e − e−it(x+h)
µ(x, x + h) = ϕ(t)dt
2π it
Z Z x+h
1
= e−ity dy ϕ(t) dt
2π x
Z x+h Z
1
= e−ity ϕ(y) dt dy.
x 2π
R −ity
So µ has density (1/2π) e ϕ(t) dt. As in the proof of Proposition 17.1, we see f is continuous.
Proof. Note Z T Z T Z
1 1
ϕ(t) dt = eitx µ(dx) dt
2T −T 2T −T
Z Z
1
= 1[−T,T ] (t)eitx dtµ(dx)
2T
Z
sin T x
= µ(dx).
Tx
Since |(sin(T x))/T x| ≤ 1 and | sin(T x)| ≤ 1, then |(sin(T x))/T x| ≤ 1/2T A if |x| ≥ 2A. So
Z sin T x Z
1
µ(dx) ≤ µ([−2A, 2A]) + µ(dx)
Tx [−2A,2A]c 2T A
1
= µ([−2A, 2A]) + (1 − µ([−2A, 2A])
2T A
1 1
= + 1− µ([−2A, 2A]).
2T A 2T A
26
Setting T = 1/A,
A Z 1/A 1 1
ϕ(t) dt ≤ + µ([−2A, 2A]).
2 −1/A 2 2
Proposition 19.2. If µn converges weakly to µ, then ϕn converges to ϕ uniformly on every finite interval.
Proof. Let ε > 0 and choose M large so that µ([−M, M ]c ) < ε. Define f to be 1 on [−M, M ], 0 on
[−M − 1, M + 1]c , and linear in between. Since f dµn → f dµ, then if n is large enough,
R R
Z
(1 − f ) dµn ≤ 2ε.
We have Z
|ϕn (t + h) − ϕn (t)| ≤ |eihx − 1|µn (dx)
Z Z
≤ 2 (1 − f ) dµn + h |x|f (x) µn (dx)
≤ 2ε + h(M + 1).
So for n large enough and |h| ≤ ε/(M + 1), we have
which says that the ϕn are equicontinuous. Therefore the convergence is uniform on finite intervals.
The interesting result of this section is the converse, Lévy’s continuity theorem.
Theorem 19.3. Suppose µn are probabilities, ϕn (t) converges to a function ϕ(t) for each t, and ϕ is
continuous at 0. Then ϕ is the characteristic function of a probability µ and µn converges weakly to µ.
≥ 1 − 2ε.
By Lemma 19.1 with A = 1/δ, for such n,
27
Proposition 19.4. If E |X|k < ∞ for an integer k, then ϕX has a continuous derivative of order k and
Z
ϕ (t) = (ix)k eitx PX (dx).
(k)
Here is a converse.
Proposition 19.5. If ϕ is the characteristic function of a random variable X and ϕ00 (0) exists, then E |X|2 <
∞.
Proof. Note
eihx − 2 + e−ihx 1 − cos hx
= −2 ≤0
h2 h2
and 2(1 − cos hx)/h2 converges to x2 as h → 0. So by Fatou’s lemma,
1 − cos hx
Z Z
2
x PX (dx) ≤ 2 lim inf PX (dx)
h→0 h2
ϕ(h) − 2ϕ(0) + ϕ(−h)
= − lim sup 2
= ϕ00 (0) < ∞.
h→0 h
One nice application of the continuity theorem is a proof of the weak law of large numbers. Its proof
is very similar to the proof of the central limit theorem, which we give in the next section.
Another nice use of characteristic functions and martingales is the following.
Proposition 19.6. Suppose Xi is a sequence of independent r.v.s and Sn converges weakly. Then Sn
converges almost surely.
Proof. Suppose Sn converges weakly to W . Then ϕSn (t) → ϕW (t) uniformly on compact sets by Proposition
19.2. Since ϕW (0) = 1 and ϕW is continuous, there exists δ such that |ϕW (t) − 1| < 1/2 if |t| < δ. So for n
large, ϕSn (t)| ≥ 1/4 if |t| < δ.
Note h i
E eitSn | X1 , . . . Xn−1 = eitSn−1 E [eitXn | X1 , . . . , Xn−1 ] = eitSn−1 ϕXn (t).
Since ϕSn (t) = ϕXi (t), it follows that eitSn /ϕSn (t) is a martingale.
Q
Therefore for |t| < δ and n large, eitSn /ϕSn (t) is a bounded martingale, and hence converges almost
surely. Since ϕSn (t) → ϕW (t) 6= 0, then eitSn converges almost surely if |t| < δ.
Let A = {(ω, t) ∈ Ω × (−δ, δ) : eitSn (ω) does not converge}. For each t, we have almost sure con-
R Rδ R R Rδ
vergence, so 1A (ω, t) P(dω) = 0. Therefore −δ 1A dP dt = 0, and by Fubini, 1 dt dP = 0. Hence
−δ A
/ N , then eitSn (ω)
R
almost surely, 1A (ω, t) dt = 0. This means, there exists a set N with P(N ) = 0, and if ω ∈
converges for almost every t ∈ (−δ, δ).
28
Ra
If ω ∈
/ N , by dominated convergence, 0
eitSn (ω) dt converges, provided a < δ. Call the limit Aa . Also
a
eiaSn (ω) − 1
Z
eitSn (ω) dt =
0 iSn (ω)
Proof. Since X1 has finite second moment, then ϕX1 has a continuous second derivative. By Taylor’s
theorem,
ϕX1 (t) = ϕX1 (0) + ϕ0X1 (0)t + ϕ00X1 (0)t2 /2 + R(t),
Then
√ √ h t2 √ in
ϕSn /√n (t) = ϕSn (t/ n) = (ϕX1 (t/ n))n = 1 − + R(t/ n) .
2n
√
Since t/ n converges to zero as n → ∞, we have
2
ϕSn /√n (t) → e−t /2
.
Let us give another proof of this simple CLT that does not use characteristic functions. For simplicity
let Xi be i.i.d. mean zero variance one random variables with E |Xi |3 < ∞.
29
√
Proposition 20.2. With Xi as above, Sn / n converges weakly to a standard normal.
Proof. Let Y1 , . . . , Yn be i.i.d. standard normal r.v.s that are independent of the Xi . Let Z1 = Y2 + · · · + Yn ,
Z2 = X1 + Y3 + · · · + Yn , Z3 = X1 + X2 + Y4 + · · · + Yn , etc.
Let us suppose g ∈ C 3 with compact support and let W be a standard normal. Our first goal is to
show
√
|E g(Sn / n) − E g(W )| → 0. (20.1)
We have
n
√ √ X √
E g(Sn / n) − E g(W ) = E g(Sn / n) − E g( Yi / n)
i=1
n h X + Z Y + Z i
i
√ i − E g i√ i .
X
= Eg
i=1
n n
By Taylor’s theorem,
X + Z
i √ √ X 1 √
g √ i = g(Zi / n) + g 0 (Zi / n) √ i + g 00 (Zi / n)Xi2 + Rn ,
n n 2
where |Rn | ≤ kg 000 k∞ |Xi |3 /n3/2 . Taking expectations and using the independence,
X + Z
i √ 1 √
Eg √ i = E g(Zi / n) + 0 + E g 00 (Zi / n) + E Rn .
n 2
√
We have a very similar expression for E g((Yi + Zi )/ n). Taking the difference,
X + Z
i
Y + Z E |Xi |3 + E |Yi |3
E g √ i − E g i√ i ≤ kg 000 k∞ .
n n n3/2
However,
√ √ √
|E g(Sn / n) − E (gψ)(Sn / n)| ≤ kgk∞ P(|Sn / n| > M ) < εkgk∞ ,
and similarly
|E g(W ) − E (gψ)(W )| < εkgk∞ .
Since ε is arbitrary, this proves (20.1) for bounded continuous g. By Proposition 16.1, this proves our
proposition.
30
Proposition 20.3. Suppose for each n the r.v.s Xni , i = 1, . . . , n are i.i.d. Bernoullis with parameter pn .
Pn
If npn → λ and Sn = i=1 Xni , then Sn converges weakly to a Poisson r.v. with parameter λ.
Proof. We write h in
ϕSn (t) = [ϕXn1 (t)]n = 1 + pn (eit − 1)
h npn it in it
= 1+ (e − 1) → eλ(e −1) .
n
Now apply the continuity theorem.
A much more general theorem than Theorem 20.1 is the Lindeberg-Feller theorem.
Theorem 20.4. Suppose for each n, Xni , i = 1, . . . , n are mean zero independent random variables. Suppose
Pn 2 2
(a) i=1 E Xni → σ > 0 and
Pn 2
(b) for each ε, i=1 E [|Xni ; |Xni | > ε] → 0.
Pn
Let Sn = i=1 Xni . Then Sn converges weakly to a normal r.v. with mean zero and variance σ 2 .
Note nothing is said about independence of the Xni for different n.
√
Let us look at Theorem 20.1 in light of this theorem. Suppose the Yi are i.i.d. and let Xni = Yi / n.
Then
n
X √
E (Yi / n)2 = E Y12
i=1
and
n
X √ √
E [|Xni |2 ; |Xni | > ε] = nE [|Y1 |2 /n; |Y1 | > nε] = E [|Y1 |2 ; |Y1 | > nε],
i=1
then Sn /(Var Sn )1/2 converges weakly to a standard normal. This is known as Lyapounov’s theorem; we
leave the derivation of this from the Lindeberg-Feller theorem as an exercise for the reader.
2
Proof. Let ϕni be the characteristic function of Xni and let σni be the variance of Xni . We need to show
n
2
σ 2 /2
Y
ϕni (t) → e−t . (20.2)
i=1
31
The right hand side is less than or equal to
Yn n
Y X n
ai − bi ≤ |ai − bi |.
i=1 i=1 i=1
Yn n
Y X X
ϕni (t) − (1 − t2 σni
2
/2) ≤ cεt3 E [|Xni |2 ] + ct2 E [|Xni |2 ; |Xni | ≥ ε].
i=1 i=1
2
Since supi σni → 0, then log(1 − t2 σni
2
/2) is asymptotically equal to −t2 σni
2
/2, and so
Y X
(1 − t2 σni
2
/2) = exp log(1 − t2 σni
2
/2)
is asymptotically equal to
X 2 2
exp − t2 2
σni /2 = e−t σ /2 .
32
21. Framework for Markov chains.
Suppose S is a set with some topological structure that we will use as our state space. Think of S
as being Rd or the positive integers, for example. A sequence of random variables X0 , X1 , . . ., is a Markov
chain if
P(Xn+1 ∈ A | X0 , . . . , Xn ) = P(Xn+1 ∈ A | Xn ) (21.1)
for all n and all measurable sets A. The definition of Markov chain has this intuition: to predict the
probability that Xn+1 is in any set, we only need to know where we currently are; how we got there gives
no new additional intuition.
Let’s make some additional comments. First of all, we previously considered random variables as
mappings from Ω to R. Now we want to extend our definition by allowing a random variable be a map X
from Ω to S, where (X ∈ A) is F measurable for all open sets A. This agrees with the definition of r.v. in
the case S = R.
Although there is quite a theory developed for Markov chains with arbitrary state spaces, we will con-
fine our attention to the case where either S is finite, in which case we will usually suppose S = {1, 2, . . . , n},
or countable and discrete, in which case we will usually suppose S is the set of positive integers.
We are going to further restrict our attention to Markov chains where
that is, where the probabilities do not depend on n. Such Markov chains are said to have stationary transition
probabilities.
Define the initial distribution of a Markov chain with stationary transition probabilities by µ(i) =
P(X0 = i). Define the transition probabilities by p(i, j) = P(Xn+1 = j | Xn = i). Since the transition
probabilities are stationary, p(i, j) does not depend on n.
In this case we can use the definition of conditional probability given in undergraduate classes. If
P(Xn = i) = 0 for all n, that means we never visit i and we could drop the point i from the state space.
Proposition 21.1. Let X be a Markov chain with initial distribution µ and transition probabilities p(i, j).
then
P(Xn = in , Xn−1 = in−1 , . . . , X1 = i1 , X0 = i0 ) = µ(i0 )p(i0 , i1 ) · · · p(in−1 , in ). (21.2)
Proof. We use induction on n. It is clearly true for n = 0 by the definition of µ(i). Suppose it holds for n;
we need to show it holds for n + 1. For simplicity, we will do the case n = 2. Then
P(X3 = i3 ,X2 = i2 , X1 = i1 , X0 = i0 )
= E [P(X3 = i3 | X0 = i0 , X1 = i1 , X2 = i2 ); X2 = i2 , X1 = ii , X0 = i0 ]
= E [P(X3 = i3 | X2 = i2 ); X2 = i2 , X1 = ii , X0 = i0 ]
= p(i2 , i3 )P(X2 = i2 , X1 = i1 , X0 = i0 ).
The above proposition says that the law of the Markov chain is determined by the µ(i) and p(i, j).
The formula (21.2) also gives a prescription for constructing a Markov chain given the µ(i) and p(i, j).
33
P
Proposition 21.2. Suppose µ(i) is a sequence of nonnegative numbers with i µ(i) = 1 and for each i
the sequence p(i, j) is nonnegative and sums to 1. Then there exists a Markov chain with µ(i) as its initial
distribution and p(i, j) as the transition probabilities.
Proof. Define Ω = S ∞ . Let F be the σ-fields generated by the collection of sets {(i0 , i1 , . . . , in ) : n >
0, ij ∈ S}. An element ω of Ω is a sequence (i0 , i1 , . . .). Define Xj (ω) = ij if ω = (i0 , i1 , . . .). Define
P(X0 = i0 , . . . , Xn = in ) by (21.2). Using the Kolmogorov extension theorem, one can show that P can be
extended to a probability on Ω.
The above framework is rather abstract, but it is clear that under P the sequence Xn has initial
distribution µ(i); what we need to show is that Xn is a Markov chain and that
P(Xn+1 = in+1 , Xn = in , . . . X0 = i0 )
P(Xn+1 = in+1 | X0 = i0 , . . . , Xn = in ) = (21.4)
P(Xn = in , . . . , X0 = i0 )
µ(i0 ) · · · p(in−1 , in )p(in , in+1 )
=
µ(i0 ) · · · p(in−1 , in )
= p(in , in+1 )
as desired.
To complete the proof we need to show
P(Xn+1 = in+1 , Xn = in )
= p(in , in+1 ),
P(Xn = in )
or
P(Xn+1 = in+1 , Xn = in ) = p(in , in+1 )P(Xn = in ). (21.5)
Now X
P(Xn = in ) = P(Xn = in , Xn−1 = in−1 , . . . , X0 = i0 )
i0 ,···,in−1
X
= µ(i0 ) · · · p(in−1 , in )
i0 ,···,in−1
and similarly
P(Xn+1 = in+1 , Xn = in )
X
= P(Xn+1 = in+1 , Xn = in , Xn−1 = in−1 , . . . , X0 = i0 )
i0 ,···,in−1
X
= p(in , in+1 ) µ(i0 ) · · · p(in−1 , in ).
i0 ,···,in−1
Note in this construction that the Xn sequence is fixed and does not depend on µ or p. Let p(i, j) be
fixed. The probability we constructed above is often denoted Pµ . If µ is point mass at a point i or x, it is
denoted Pi or Px . So we have one probability space, one sequence Xn , but a whole family of probabilities
Pµ .
34
Later on we will see that this framework allows one to express the Markov property and strong
Markov property in a convenient way. As part of the preparation for doing this, we define the shift operators
θk : Ω → Ω by
θk (i0 , i1 , . . .) = (ik , ik+1 , . . .).
Then Xj ◦ θk = Xj+k . To see this, if ω = (i0 , i1 , . . .), then
22. Examples.
Random walk on the integers
We let Yi be an i.i.d. sequence of r.v.’s, with p = P(Yi = 1) and 1 − p = P(Yi = −1). Let
Pn
Xn = X0 + i=1 Yi . Then the Xn can be viewed as a Markov chain with p(i, i + 1) = p, p(i, i − 1) = 1 − p,
and p(i, j) = 0 if |j − i| 6= 1. More general random walks on the integers also fit into this framework. To
check that the random walk is Markov,
P(Xn+1 = in+1 | X0 = i0 , . . . , Xn = in )
= P(Xn+1 − Xn = in+1 − in | X0 = i0 , . . . , Xn = in )
= P(Xn+1 − Xn = in+1 − in ),
using the independence, while
P(Xn+1 = in+1 | Xn = in ) = P(Xn+1 − Xn = in+1 − in | Xn = in )
= P(Xn+1 − Xn = in+1 − in ).
Renewal processes
Let Yi be i.i.d. with P(Yi = k) = ak and the ak are nonnegative and sum to 1. Let T0 = i0 and
Pn
Tn = T0 + i=1 . We think of the Yi as the lifetime of the nth light bulb and Tn the time when the nth light
bulb burns out. (We replace a light bulb as soon as it burns out.) Let
35
So Xn is the amount of time after time n until the current light bulb burns out.
If Xn = j and j > 0, then Ti = n + j for some i but Ti does not equal n, n + 1, . . . , n + j − 1 for any
i. So Ti = (n + 1) + (j − 1) for some i and Ti does not equal (n + 1), (n + 1) + 1, . . . , (n + 1) + (j − 2) for
any i. Therefore Xn+1 = j − 1. So p(i, i − 1) = 1 if i ≥ 1.
If Xn = 0, then a light bulb burned out at time n and Xn+1 is 0 if the next light bulb burned out
immediately and j − 1 if the light bulb has lifetime j. The probability of this is aj . So p(0, j) = aj+1 . All
the other p(i, j)’s are 0.
Branching processes
Consider k particles. At the next time interval, some of them die, and some of them split into several
particles. The probability that a given particle will split into j particles is given by aj , j = 0, 1, . . ., where
the aj are nonnegative and sum to 1. The behavior of each particle is independent of the behavior of all
the other particles. If Xn is the number of particles at time n, then Xn is a Markov chain. Let Yi be i.i.d.
random variables with P(Yi = j) = aj . The p(i, j) for Xn are somewhat complicated, and can be defined by
Pi
p(i, j) = P( m=1 Ym = j).
Queues
We will discuss briefly the M/G/1 queue. The M refers to the fact that the customers arrive according
to a Poisson process. So the probability that the number of customers arriving in a time interval of length t
is k is given by e−λt (λt)k /k! The G refers to the fact that the length of time it takes to serve a customer is
given by a distribution that is not necessarily exponential. The 1 refers to the fact that there is 1 server.
Suppose the length of time to serve one customer has distribution function F with density f . The
probability that k customers arrive during the time it takes to serve one customer is
∞
(λt)k
Z
ak = e−λt f (t) dt.
0 k!
Let the Yi be i.i.d. with P(Yi = k − 1) = ak . So Yi is the number of customers arriving during the time it
takes to serve one customer. Let Xn+1 = (Xn + Yn+1 )+ be the number of customers waiting. Then Xn is a
Markov chain with p(0, 0) = a0 + a1 and p(i, j − 1 + k) = ak if j ≥ 1, k > 1.
Ehrenfest urns
Suppose we have two urns with a total of r balls, k in one and r − k in the other. Pick one of the
r balls at random and move it to the other urn. Let Xn be the number of balls in the first urn. Xn is a
Markov chain with p(k, k + 1) = (r − k)/r, p(k, k − 1) = k/r, and p(i, j) = 0 otherwise.
One model for this is to consider two containers of air with a thin tube connecting them. Suppose
a few molecules of a foreign substance are introduced. Then the number of molecules in the first container
is like an Ehrenfest urn. We shall see that all states in this model are recurrent, so infinitely often all the
molecules of the foreign substance will be in the first urn. Yet there is a tendency towards equilibrium, so
on average there will be about the same number of molecules in each container for all large times.
Birth and death processes
Suppose there are i particles, and the probability of a birth is ai , the probability of a death is bi ,
where ai , bi ≥ 0, ai + bi ≤ 1. Setting Xn equal to the number of particles, then Xn is a Markov chain with
p(i, i + 1) = ai , p(i, i − 1) = bi , and p(i, i) = 1 − ai − bi .
36
The right hand side is to be interpreted as ϕ(Xn ), where ϕ(y) = E y f (X1 ). The randomness on the right
hand side all comes from the Xn . If we write f (Xn+1 ) = f (X1 ) ◦ θn and we write Y for f (X1 ), then the
above can be rewritten
E x [Y ◦ θn | Fn ] = E Xn Y.
Theorem 23.1. (Markov property) If Y is bounded and measurable with respect to F∞ , then
E x [Y ◦ θn | Fn ] = E Xn [Y ], P − a.s.
Proof. If we can prove this for Y = f1 (X1 ) · · · fm (Xm ), then taking fj (x) = 1ij (x), we will have it for Y ’s
of the form 1(X1 =i1 ,...,Xm =im ) . By linearity (and the fact that S is countable), we will then have it for Y ’s
of the form 1((X1 ,...,Xm )∈B) . A monotone class argument shows that such Y ’s generate F∞ .
We use induction on m, and first we prove it for m = 1. We need to show
Using linearity and the fact that S is countable, it suffices to show this for f1 (y) = 1{j} (y). Using the
definition of θn , we need to show
or equivalently,
Px (Xn+1 = j; A) = E x [PXn (X1 = j); A] (23.2)
when A ∈ Fn . By linearity it suffices to consider A of the form A = (X1 = i1 , . . . , Xn = in ). The left hand
side of (23.2) is then
Px (Xn+1 = j, X1 = i1 , . . . , Xn = ij ),
while
P( X1 = j) if x = k,
E x [g(X0 ); X0 = k] = E x [g(k); X0 = k] = Pk (X1 = j)Px (X0 = k) =
0 6 k.
if x =
It follows that
p(i, j) = Px (X1 = j | X0 = i) = Pi (X1 = j).
37
as required.
Suppose the result holds for m and we want to show it holds for m + 1. We have
Here we used the result for m = 1 and we defined h(y) = fn+m (y)g(y), where g(y) = E y [fm+1 (X1 )]. Using
the induction hypothesis, this is equal to
E Xn [f1 (X1 ) · · · fm−1 (Xm−1 )g(Xm )] = E Xn [f1 (X1 ) · · · fm (Xm )E Xm fm+1 (X1 )]
= E Xn [f1 (X1 ) · · · fm (Xm )E [fm+1 (Xm+1 ) | Fm ]]
= E Xn [f1 (X1 ) · · · fm+1 (Xm+1 )],
Define θN (ω) = (θN (ω) )(ω). The strong Markov property is the same as the Markov property, but
where the fixed time n is replaced by a stopping time N .
Theorem 23.2. If Y is bounded and measurable and N is a finite stopping time, then
E x [Y ◦ θN | FN ] = E XN [Y ].
Once we have this, we can proceed as in the proof of the Theorem 23.1 to obtain our result. To show the
above equality, we need to show that if B ∈ FN , then
Px (XN +1 = j, B, N = k) = Px (Xk+1 = j, B, N = k)
= E x [Px (Xk+1 = j | Fk ); B, N = k]
= E x [PXk (X1 = j); B, N = k]
= E x [PXN (X1 = j); B, N = k].
Another way of expressing the Markov property is through the Chapman-Kolmogorov equations. Let
n
p (i, j) = P(Xn = j | X0 = i).
38
Proof. We write
X
P(Xn+m = j, X0 = i) = P(Xn+m = j, Xn = k, X0 = i)
k
X
= P(Xn+m = j | Xn = k, X0 = i)P(Xn = k | X0 = i)P(X0 = i)
k
X
= P(Xn+m = j | Xn = k)pn (i, k)P(X0 = i)
k
X
= pm (k, j)pn (i, k)P(X0 = i).
k
Note the resemblance to matrix multiplication. It is clear if P is the matrix made up of the p(i, j),
then P n will be the matrix whose (i, j) entry is pn (i, j).
Proof. The case k = 1 is just the definition, so suppose k > 1. Using the strong Markov property,
Px (Tyk < ∞) = Px (Ty ◦ θTyk−1 < ∞, Tyk−1 < ∞)
= E x [Px (Ty ◦ θTyk−1 < ∞ | FTyk−1 ); Tyk−1 < ∞]
k−1
= E x [PX(Ty )
(Ty < ∞); Tyk−1 ]
= E x [Py (Ty < ∞); Tyk−1 < ∞]
= r(y, y)Px (Tyk−1 < ∞).
We used here the fact that at time Tyk−1 the Markov chain must be at the point y. Repeating this argument
k − 2 times yields the result.
Proof. Note
∞
X ∞
X
E y N (y) = Py (N (y) ≥ k) = Py (Tyk < ∞)
k=1 k=1
X∞
= r(y, y)k .
k=1
39
We used the fact that N (y) is the number of visits to y and the number of visits being larger than k is the
same as the time of the k-th visit being finite. Since r(y, y) ≤ 1, the left hand side will be finite if and only
if r(y, y) < 1.
Observe that X X
E y N (y) = Py (Xn = y) = pn (y, y).
n n
we consider simple symmetric random walk on the integers, then pn (0, 0) is 0 if n is odd and equal
If
n
to 2−n if n is even. This is because in order to be at 0 after n steps, the walk must have had n/2
n/2
positive steps and n/2 negative steps; the probability of this is given by the binomial distribution. Using
√
Stirling’s approximation, we see that pn (0, 0) ∼ c/ n for n even, which diverges, and so simple random walk
in one dimension is recurrent.
Similar arguments show that simple symmetric random walk is also recurrent in 2 dimensions but
transient in 3 or more dimensions.
Proposition 24.3. If x is recurrent and r(x, y) > 0, then y is recurrent and r(y, x) = 1.
Proof. First we show r(y, x) = 1. Suppose not. Since r(x, y) > 0, there is a smallest n and y1 , . . . , yn−1
such that p(x, y1 )p(y1 , y2 ) · · · p(yn−1 , y) > 0. Since this is the smallest n, none of the yi can equal x. Then
Summing over n, X X
pL+n+K (y, y) ≥ pL (y, x)pK (x, y) pn (x, x) = ∞.
n n
X XX XX X X
E x N (y) = pn (x, y) = pn (x, y) = Px (Xn ∈ C) = 1 = ∞.
y y n n y n n
40
Theorem 24.5. Let R = {x : r(x, x) = 1}, the set of recurrent states. Then R = ∪∞
i=1 Ri , where each Ri is
closed and irreducible.
Proof. Say x ∼ y if r(x, y) > 0. Since every state is recurrent, x ∼ x and if x ∼ y, then y ∼ x. If x ∼ y and
y ∼ z, then pn (x, y) > 0 and pm (y, z) > 0 for some n and m. Then pn+m (x, z) > 0 or x ∼ z. Therefore we
have an equivalence relation and we let the Ri be the equivalence classes.
Looking at our examples, it is easy to see that in the Ehrenfest urn model all states are recurrent. For
the branching process model, suppose p(x, 0) > 0 for all x. Then 0 is recurrent and all the other states are
transient. In the renewal chain, there are two cases. If {k : ak > 0} is unbounded, all states are recurrent.
If K = max{k : ak > 0}, then {0, 1, . . . , K − 1} are recurrent states and the rest are transient.
P
For the queueing model, let µ = kak , the expected number of people arriving during one customer’s
service time. We may view this as a branching process by letting all the customers arriving during one
person’s service time be considered the progeny of that customer. It turns out that if µ ≤ 1, 0 is recurrent
and all other states are also. If µ > 1 all states are transient.
In matrix notation this is µP = µ, or µ is the left eigenvector corresponding to the eigenvalue 1. In the case of
a stationary distribution, Pµ (X1 = y) = µ(y), which implies that X1 , X2 , . . . all have the same distribution.
We can use (25.1) when µ is a measure rather than a probability, in which case it is called a stationary
measure.
If we have a random walk on the integers, µ(x) = 1 for all x serves as a stationary measure. In the
case of an asymmetric random walk: p(i, i + 1) = p, p(i, i − 1) = q = 1 − p and p 6= q, setting µ(x) = (p/q)x
also works.
−r r
In the Ehrenfest urn model, µ(x) = 2 works. One way to see this is that µ is the distribution
x
one gets if one flips r coins and puts a coin in the first urn when the coin is heads. A transition corresponds
to picking a coin at random and turning it over.
T
X −1
µ(y) = E a 1(Xn =y) .
n=0
The idea of the proof is that µ(y) is the expected number of visits to y by the sequence X0 , . . . , XT −1
while µP is the expected number of visits to y by X1 , . . . , XT . These should be the same because XT =
X0 = a.
∞
X ∞
X
a
µ(y) = P (Xn = y, T > n) = pn (a, y)
n=0 n=0
41
and
X ∞
XX
µ(y)p(y, z) = pn (a, y)p(y, z).
y y n=0
So X XX
µ(y)p(y, z) = pn (a, y)p(y, z)
n n y
∞
X ∞
X
= pn+1 (a, z) = pn (a, z)
n=0 n=0
= µ(z)
since p0 (a, z) = 0.
Third, we consider the case a = z. Then
X
pn (a, y)p(y, z)
y
X
= Pa (hit y in n steps without first hitting a and then go to z in one step)
y
= Pa (T = n + 1).
We next turn to uniqueness of the stationary distribution. We call the stationary measure constructed
in Proposition 25.1 µa .
42
Proposition 25.2. If the Markov chain is irreducible and all states are recurrent, then the stationary
measure is unique up to a constant multiple.
Proof. Fix a ∈ S. Let µa be the stationary measure constructed above and let ν be any other stationary
measure.
Since ν = νP , then
X
ν(z) = ν(a)p(a, z) + ν(y)p(y, z)
y6=a
X XX
= ν(a)p(a, z) + ν(a)p(a, y)p(y, z) + ν(x)p(x, y)p(y, z)
y6=a x6=a y6=a
Proposition 25.3. If a stationary distribution exists, then µ(y) > 0 implies y is recurrent.
43
Proposition 25.4. If the Markov chain is irreducible and has stationary distribution µ, then
1
µ(x) = x .
E Tx
Proof. µ(x) > 0 for some x. If y ∈ S, then r(x, y) > 0 and so pn (x, y) > 0 for some n. Hence
X
µ(y) = µ(x)pn (x, y) > 0.
x
Hence by Proposition 25.3, all states are recurrent. By the uniqueness of the stationary distribution, µx is
a constant multiple of µ, i.e., µx = cµ. Recall
∞
X
µx (y) = Px (Xn = y, Tx > n),
n=0
and so
X ∞
XX XX
µx (y) = Px (Xn = y, Tx > n) = Px (Xn = y, Tx > n)
y y n=0 n y
X
= Px (Tx > n) = E x Tx .
n
Thus c = E x Tx . Recalling that µx (x) = 1,
µx (x) 1
µ(x) = = x .
c E Tx
We make the following distinction for recurrent states. If E x Tx < ∞, then x is said to be positive
recurrent. If x is recurrent but E x Tx = ∞, x is null recurrent.
Proposition 25.5. Suppose a chain is irreducible.
(a) If there exists a positive recurrent state, then there is a stationary distribution,.
(b) If there is a stationary distribution, all states are positive recurrent.
(c) If there exists a transient state, all states are transient.
(d) If there exists a null recurrent state, all states are null recurrent.
Proof. To show (a), if x is positive recurrent, then there exists a stationary measure with µ(x) = 1. Then
µ(y) = µ(y)/E x Tx will be a stationary distribution.
For (b), suppose µ(x) > 0 for some x. We showed this implies µ(y) > 0 for all y. Then 0 < µ(y) =
1/E y Ty , which implies E y Ty < ∞.
We showed that if x is recurrent and r(x, y) > 0, then y is recurrent. So (c) follows.
Suppose there exists a null recurrent state. If there exists a positive recurrent or transient state as
well, then by (a) and (b) or by (c) all states are positive recurrent or transient, a contradiction, and (d)
follows.
26. Convergence.
Our goal is to show that under certain conditions pn (x, y) → π(y), where π is the stationary distri-
bution. (In the null recurrent case pn (x, y) → 0.)
Consider a random walk on the set {0, 1}, where with probability one on each step the chain moves
to the other state. Then pn (x, y) = 0 if x 6= y and n is even. A less trivial case is the simple random walk
on the integers. We need to eliminate this periodicity.
Suppose x is recurrent, let Ix = {n ≥ 1 : pn (x, x) > 0}, and let dx be the g.c.d. (greatest common
divisor) of Ix . dx is called the period of x.
44
Proposition 26.1. If r(x, y) > 0, then dy = dx .
Proof. Since x is recurrent, r(y, x) > 0. Choose K and L such that pK (x, y), pL (y, x) > 0.
Third, pick n0 ∈ Ix and k > 0 such that n0 + k ∈ Ix . If k = 1, we are done. Since dx = 1, there exists
n1 ∈ Ix such that k does not divide n1 . We have n1 = mk + r for some 0 < r < k. Note (m + 1)(n0 + k) ∈ Ix
and (m + 1)n0 + n1 ∈ Ix . The difference between these two numbers is (m + 1)k − n1 = k − r < k. So now
we have two numbers in Ik differing by less than or equal to k − 1. Repeating at most k times, we get two
numbers in Ix differing by at most 1, and we are done.
Theorem 26.3. Suppose the chain is irreducible, aperiodic, and has a stationary distribution π. Then
pn (x, y) → π(y) as n → ∞.
Proof. The idea is to take two copies of the chain with different starting distributions, let them run
independently until they couple, i.e., hit each other, and then have them move together. So define
p(x1 , x2 )p(y1 , y2 ) if x1 6= y1 ,
(
q((x1 , y1 ), (x2 , y2 )) = p(x1 , x2 ) if x1 = y1 , x2 = y2 ,
0 otherwise.
while
P(Yn = y) = P(Yn = y, T ≤ n) + P(Yn = y, T > n).
Subtracting,
P(Xn = y) − P(Yn = y) ≤ P(Xn = y, T > n) − P(Yn = y, T > n)
≤ P(Xn = y, T > n) ≤ P(T > n).
45
Using symmetry,
|P(Xn = y) − P(Yn = y)| ≤ P(T > n).
Suppose we let Y0 have distribution π and X0 = x. Then
It remains to show P(T > n) → 0. To do this, consider another chain Zn0 = (Xn , Yn ), where now we
take Xn , Yn independent. Define
The chain under the transition probabilities r is irreducible. To see this, there exist K and L such
that p (x1 , x2 ) > 0 and pL (y1 , y2 ) > 0. If M is large, pL+M (x2 , x2 ) > 0 and pK+M (y2 , y2 ) > 0. So
K
pK+L+M (x1 , x2 ) > 0 and pK+L+M (y1 , y2 ) > 0, and hence we have rK+L+M ((x1 , x2 ), (y1 , y2 )) > 0.
It is easy to check that π 0 (a, b) = π(a)π(b) is a stationary distribution for Z 0 . Hence Zn0 is recurrent,
and hence it will hit (x, x), hence the time to hit the diagonal {(y, y) : y ∈ S} is finite. However the
distribution of the time to hit the diagonal is the same as T .
Proof. Let X 0 be a random variable with the same law as X, Y 0 one with the same law as Y , and X 0 , Y 0
independent. (We let Ω = [0, 1]2 , P Lebesgue measure, X 0 a function of the first variable, and Y 0 a function
0 0 0 0
of the second variable defined as in Proposition 1.2.) Then E ei(uX +vY ) = E eiuX E eivY . Since X, X 0 have
the same law, they have the same characteristic function, and similarly for Y, Y 0 . Therefore (X 0 , Y 0 ) has the
same joint characteristic function as (X, Y ). By the uniqueness of the Fourier transform, (X 0 , Y 0 ) has the
same joint law as (X, Y ), which is easily seen to imply that X and Y are independent.
By taking u = (0, . . . , 0, a, 0, . . . , 0) to be a constant times the unit vector in the jth coordinate direction,
we deduce that each of the X’s is indeed normal.
46
Proposition 27.2. If the Xi are jointly normal and Cov (Xi , Xj ) = 0 for i 6= j, then the Xi are independent.
Proof. If Cov (X) = BB t is a diagonal matrix, then the joint characteristic function of the X’s factors, and
so by Proposition 27.1, the Xs would in this case be independent.
Proof. If B ⊂ R∞ , let
Then
P((Y0 , Y1 , . . .) ∈ B) = P((X0 , X1 , . . .) ∈ A)
= P((Xk , Xk+1 , . . .) ∈ A)
= P((Yk , Yk+1 , . . .) ∈ B).
b = R∞ , and define X
On the other hand, if Xk is stationary, define Ω bk (ω) = ωk , where ω = (ω0 , ω1 , . . .). Define
Pb on Ω
b so that the law of X b under Pb is the same as the law of X under P. Then define T ω = (ω1 , ω2 , . . .).
We see that
P(A) = P((ω0 , ω1 , . . .) ∈ A) = P((
b X b1 , . . .) ∈ A)
b0 , X
= P((X0 , X1 , . . .) ∈ A) = P((X1 , X2 , . . .) ∈ A)
= Pb((X b2 , . . .) ∈ A) = P((ω1 , ω2 , . . .) ∈ A)
b1 , X
= P(T ω ∈ A) = P(T −1 A).
We say a set A is invariant if T −1 A = A (up to a null set, that is, the symmetric difference has prob-
ability zero). The invariant σ-field I is the collection of invariant sets. A measure preserving transformation
is ergodic if the invariant σ-field is trivial.
In the case of an i.i.d. sequence, A invariant means A = T −n A ∈ σ(Xn , Xn+1 , . . .) for each n. Hence
each invariant set is in the tail σ-field, and by the Kolmogorov 0-1 law, T is ergodic.
In the case of rotations, if θ is a rational multiple of π, T need not be ergodic. For example, let θ = π
and A = (0, π/2) ∪ (π, 3π/2). However, if θ is an irrational multiple of π, the T is ergodic. To see that,
47
PK
recall that if f is measurable and bounded, then f is the L2 limit of k=−K ck e
ikx
, where ck are the Fourier
coefficients. So X
f (T n x) = ck eikx+iknθ
X
= dk eikx ,
where dk = ck eiknθ . If f (T n x) = f (x) a.e., then ck = dk , or ck eiknθ = ck . But θ is not a rational multiple
of π, so eiknθ 6= 1, so ck = 0. Therefore f = 0 a.e. If we take f = 1A , this says that either A is empty or A
is the whole space, up to sets of measure zero.
Our last example was the Bernoulli shift. Let Xi be i.i.d. with P(X = 1) = P(Xi = 0) = 1/2. Let
P∞
Yn = m=0 2−(m+1) Xn+m . So there exists g such that Yn = g(Xn , Xn+1 , . . .). If A is invariant for the
Bernoulli shift,
A = ((Yn , Yn+1 , . . .) ∈ B) = ((Xn , Xn+1 , . . .) ∈ C),
where C = {x : (g(x0 , x1 , . . .), g(x1 , x2 , . . .), . . .) ∈ B}. this is true for all n, so A is in the invariant σ-field
for the Xi ’s, which is trivial. Therefore T is ergodic.
Therefore Z
E [X(ω); Mk > 0] ≥ [max(S1 , . . . , Sk )(ω) − Mk (T ω)]
(Mk >0)
Z
= [Mk (ω) − Mk (T ω)].
(Mk >0)
Recall I is the invariant σ-field. The ergodic theorem says the following.
Theorem 29.2. Let T be measure preserving and X integrable. Then
n−1
1 X
X(T m ω) → E [X | I],
n m=0
Proof. We start with the a.s. result. By looking at X − E [X | I], we may suppose E [X | I] = 0. Let ε > 0
and D = {lim sup Sn /n > ε}. We will show P(D) = 0.
48
P P
Let δ > 0. Since X is integrable, P(|Xn (ω)| > δn) = P(|X| > δn) < ∞ (cf. proof of Proposition
5.1). By Borel-Cantelli, |Xn |/n will eventually be less than δ. Since δ is arbitrary, |Xn |/n → 0 a.s. Since
then lim sup(Sn /n)(T ω) = lim sup(Sn /n)(ω), and so D ∈ I. Let X ∗ (ω) = (X(ω) − ε)1D (ω), and define Sn∗
and Mn∗ analogously to the definitions of Sn and Mn . On D, lim sup(Sn /n) > ε, hence lim sup(Sn∗ /n) > 0.
Let F = ∪n (Mn∗ > 0). Note ∪ni=0 (Mi∗ > 0) = (Mn∗ > 0). Also |X ∗ | ≤ |X| + ε is integrable. By
Lemma 29.1, E [X ∗ ; Mn∗ > 0] ≥ 0. By dominated convergence, E [X ∗ ; F ] ≥ 0.
We claim D = F , up to null sets. To see this, if lim sup(Sn∗ /n) > 0, then ω ∈ ∪n (Mn∗ > 0). Hence
D ⊂ F . On the other hand, if ω ∈ F , then Mn∗ > 0 for some n, so Xn∗ 6= 0 for some n. By the definition of
X ∗ , for some n, T n ω ∈ D, and since D is invariant, ω ∈ D a.s.
Recall D ∈ I. Then
0 ≤ E [X ∗ ; D] = E [X − ε; D]
= E [E [X | I]; D] − εP(D) = −εP(D),
using the fact that E [X | I] = 0. We conclude P(D) = 0 as desired.
Since we have this for every ε, then lim sup Sn /n ≤ 0. By applying the same argument to −X, we
obtain lim inf Sn /n ≥ 0, and we have proved the almost sure result. Let us now turn to the L1 convergence.
0 00 00
Let M > 0, XM = X1(|X|≤M ) , and XM = X − XM . By the almost sure result,
1X 0 0
XM (T m ω) → E [XM | I]
n
almost surely. Both sides are bounded by M , so
1 X
0 0
XM (T m ω) − E [XM | I] → 0. (29.1)
E
n
00
Let ε > 0 and choose M large so that E |XM | < ε; this is possible by dominated convergence. We
have
1 n−1 1X
X 00 m 00 00
E XM (T ω) ≤ E |XM (T m ω)| = E |XM |≤ε
n m=0 n
and
00 00 00
E |E [XM | I]| ≤ E [E [|XM | | I]] = E |XM | ≤ ε.
What does the ergodic theorem tell us about our examples? In the case of i.i.d. random variables, we
see Sn /n → E X almost surely and in L1 , since E [X | I] = E X. Thus this gives another proof of the SLLN.
For rotations of the circle with X(ω) = 1A (ω) and θ is an irrational multiple of π, E [X | I] = E X =
P
P(A), the normalized Lebesgue measure of A. So the ergodic theorem says that (1/n) 1A (ω + nθ), the
average number of times ω + nθ is in A, converges for almost every ω to the normalized Lebesgue measure
of A.
Finally, in the case of the Bernoulli shift, it is easy to see that the ergodic theorem says that the
average number of ones in the binary expansion of almost every point in [0, 1) is 1/2.
49