Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

Barebones Background for Markov Chains

by
Howard G. Tucker

Part I. Markov Dependence.

0. Introductory Remarks. This collection which I refer to as “Bare-


bones Background for Markov Chains" is really a set of notes for lectures I
gave during the spring quarters of 2001 and 2003 on Markov chains, lead-
ing up to Markov Chain Monte Carlo. The prerequisite needed for this is a
knowledge of some basic probability theory and some basic analysis. This is
really not a basic course in Markov chains. My main purpose here was to
begin with a minimum de…nition of a Markov chain with a countable state
space and to proceed in as direct a manner as possible from this de…nition to
the strong law of large numbers for Markov chains, i.e., the ergodic theorem.
This prepared the way for me to introduce in a mathematically rigorous way
the Hastings-Metropolis algorithm for Markov Chain Monte Carlo, conclud-
ing with the optimality theorem by Billera and Diaconis. The only kind of
Markov chain that I needed to deal with was an irreducible chain with a
countable state space. Thus these notes should not be construed as a basic
course in Markov chains but simply a course in the background mathemat-
ics needed for a rigorous justi…cation of the Hastings-Metropolis algorithm.
These are called notes for the simple reason that, other than statements of
de…nitions and theorems and proofs of theorems, there is no expository prose
included other than these introductory remarks.

1. De…nition. An n n matrix of real numbers, P = (pij ), where


2 P n 1, is called a stochastic matrix if (i) pij 0 for all i 1; j 1,
n
(ii) j=1 pij = 1 for all i, and (iii) for every j there exists an i such that
pij > 0. (n is called the order of the matrix.)

2. Proposition. If P = (pij ) and if Q =(qij ) are stochastic matrices of


the same order, then PQ is a stochastic matrix of the same order.
Proof: Let us denote PQ = (rij ), where
P
rij = k 1 pik qkj .
Clearly, rij 0. Also, for every i 1,
P P P P P P
j 1 rij = j 1 k 1 pik qkj = k 1 pik j 1 qkj = k 1 pik = 1.

1
Finally, for every j 1, there exists a value of k 1 such that qkj > 0, and
for this k, there exists an i 1 such that pik > 0. Thus, for every j 1
there exists an i 1 such that rij pik qkj > 0. Q.E.D.

3. Corollary Q to Proposition 2. If P = (pij ) is a stochastic matrix,


then Pn = (pnij ) = nj=1 P is a stochastic matrix.
Proof: Apply Proposition 2.
4. De…nition of Markov Chain. A Markov Chain is an in…nite se-
quence, X0 ; X1 ; X2 ; , of random variables, each taking values in a count-
able set X (which we shall sometimes refer to as the state space), and such
that:
(i) P ([X0 = j]) > 0 forTall j 2 X,
(ii) P ([Xn+1 = jn+1 ]j nk=0 [Xk = jk ]) = P ([Xn+1 = jn+1 ]j[Xn = jn ]) =
Tn1 = jn+1 ]j[X0 = jn ]) for n = 0; 1; 2;
P ([X and for all ji 2 X for which
P ( k=0 [Xk = jk ]) > 0, and
(iii) for every j 2 X there exists an i 2 X such that P ([X1 = j]j[X0 =
i]) > 0.
5. Proposition. If fXn g is a Markov chain, then for every positive
integer n and every state i 2 X,

P ([Xn = i]) > 0.

Proof: We …rst prove the theorem for n = 1. Let j 2 X be arbitrary.


Then by (iii) in the de…nition of Markov chain, there exists an i 2 X such
that P ([X1 = j]j[X0 = i]) > 0. Then, since also P ([X0 = i]) > 0, we have
P
P ([X1 = j]) = k2X P ([X1 = j]j[X0 = k])P ([X0 = k])
P ([X1 = j]j[X0 = i])P ([X0 = i]) > 0.

Now let n be any positive integer for which the theorem is true. Then
P ([Xn = k]) > 0 for all k 2 X, and, as before, there exists an i 2 X
such that P ([X1 = j]j[X0 = i]) > 0. By requirement (ii) in the de…nition of
a Markov chain, P ([X1 = j]j[X0 = i]) = P ([Xn+1 = j]j[Xn = i]) > 0. Thus
P
P ([Xn+1 = j]) = k2X P ([Xn+1 = j]j[Xn = k])P ([Xn = k])
P ([Xn+1 = j]j[Xn = i])P ([Xn = i]) > 0. Q.E.D.

6. Theorem for Existence of a Markov Chain. Given a countable


state space X, a probability measure over X such that (i) > 0 for all i 2 X

2
and a stochastic matrix P = (pij ) indexed by X, there exist a probability
space ( ; A; P ) and X-valued random variables X0 ; X1 ; X2 ; de…ned over
which form a Markov chain for which (i) P ([X0 = i]) = (i) for all i 2 X
and (ii) P ([Xn+1 = j]j[Xn = i]) = pij for all n 0 for all j 2 X and all
i 2 X.
Proof: Let us order the elements of X in the order type 1; 2; ; n or
1; 2; ; n; , depending on whether X is …nite or denumerable, so that for
all intents and purposes the elements of X are consecutive positive integers.
Then denote = 1 r=0 X. For every j0 2 X, de…ne

A(0; j0 ) = f(x0 ; x1 ; x2 ; )2 : x0 = j0 g.
For arbitrary j0 ; j1 in X, de…ne
A(0; j0 ; 1; j1 ) = f(x0 ; x1 ; x2 ; ) 2 : x0 = j0 ; x1 = j1 g
T
= 1k=0 f(x0 ; x1 ; x2 ; ) 2 : xk = jk g,
and, in general, for arbitrary j0 ; j1 ; ; jn in X, de…ne
A(0; j0 ; : : : ; n; jn ) = f(x
T n 0 ; x1 ; x 2 ; ) 2 : x 0 = j 0 ; ; xn = j n g
= k=0 f(x0 ; x1 ; x2 ; ) 2 : xk = jk g
Over all subsets of the kind just de…ned, de…ne a function P by P (A(0; j0 )) =
P (A(0; j0 ; 1; j1 )) = (j0 )pj0 j1 , and, in general, P (A(0; j0 ; : : : ; n; jn )) =
(j0 ),Q
(j0 ) nk=1 pjk 1 jk . Without ambiguity we may de…ne, for integers 0
k0 < < kr and r, arbitrary elements of X, call them jk0 ; ; jkr , the
set A(k0 ; jk0 ; ; kr ; jkr ) by taking the disjoint union of all sets of the form
A(0; j0 ; : : : ; n; jn ), for any n kr , over all sets of the form
A(0; m0 ; 1; m1 ; 2; m2 ; ; n; mn ),
where mk0 = jk0 ; mk1 = jk1 ; ; mkr = jkr , then P (A(k0 ; jk0 ; ; kr ; jkr ))
is de…ned as the sum of P (A(0; j0 ; : : : ; n; jn )) of these sets. The manner
in which we de…ned P (A(k0 ; j; ; kr ; jkr )) indicates that we can apply the
Kolmogorov-Daniell theorem to extend uniquely the function P from the
algebra of subsets generated by all sets of the form A(k0 ; jk0 ; ; kr ; jkr ) to
the sigma-algebra generated by this algebra. Then de…ne for k 0 the
function Xk by Xk (j0 ; j1 ; j2 ; ) = jk . We now verify that fX0 ; X1 ; g is
the Markov chain as claimed. Since [X0 = j0 ] = A(0; j0 ), it follows that
P ([X0 = j0 ]) = (j0 ) > 0. Also, since
T
A(0; j0 ; : : : ; n; jn ) = nk=0 [Xk = jk ]

3
for n 0, it follows that
T Q
P ( nk=0 [Xk = jk ]) = (j0 ) nk=1 pjk 1 jk .
T
Now suppose that P ( nk=0 [Xk = jk ]) > 0. Then by the above
Tn+1
Tn P ( k=0 [Xk =jk ])
P ([Xn+1 = jn+1 ]j k=0 [Xk = jk ]) = Tn
P ( k=0 [Xk =jk ])
Qn+1
(j0 ) pjk 1 jk
= Qk=1
n
(j0 ) k=1
pjk 1 jk
= pjn jn+1 .

Also
P ([Xn+1 =jn+1 ][Xn =jn ])
P ([Xn+1 = jn+1 ]j[Xn = jn ]) =
P P ([Xn =jn ]) Tn 1
P ([Xn+1 =jn+1 ][Xn =jn ] [Xk =jk ])
= P Tn 1
k=0
P( k=0
[Xk =jk ])
= pjn jn+1 ,

where the two sums are taken over all n tuples of states, j0 ; j1 ; ; jn 1 .
From this we also obtain

P ([Xn+1 = jn+1 ]j[Xn = jn ]) = P ([X1 = jn+1 ]j[X0 = jn ]),

and thus fX0 ; X1 ; g satis…es the de…nition of a Markov chain with state
space X, stochastic matrix P and initial distribution ( ).
7. Corollary. If fXn g is a Markov Chain determined by initial dis-
tribution f (i) : i 2 Xg and stochastic matrix P = (pij ), if n 1, and if
i0 ; i1 ; ; in is any sequence of states in X, then
T Q
P ( nj=0 [Xj = ij ]) = (i0 ) nj=1 pij 1 ij .

Proof: This follows immediately from the proof of theorem 6.


8. Notation. For i; j in X; we denote p1ij = pij ; then
P
p2ij = k2X pik pkj ,

and in general P
pn+1
ij = k2X pnik pkj .

4
9. Proposition. If fXn g is a Markov Chain determined by initial
distribution f (i) : i 2 Xg and stochastic matrix P = (pij ), and if n 1,
then for every positive n and k,

P ([Xn+k = i1 ]j[Xn = i0 )] = pki0 i1 .

Proof: By requirement (ii) in the de…nition of a Markov chain,

P ([Xn+k = i1 ]j[Xn = i0 ]) = P ([Xk = i1 ]j[X0 = i0 ]).

But
P kT1
_
P ([Xk = i1 ]j[X0 = i0 ]) = P ([Xk = i1 ] [Xr = ur ][X0 = i0 ])=P ([X0 = i0 ]);
r=1

where the sum is taken over all u1 ; uk 1 in X. Now apply Corollary 7,


the de…niton of a stochastic matrix and Notation 8 above to obtain the
conclusion.
10. Theorem. If fXn g is a Markov Chain determined by initial distri-
bution f (i) : i 2 Xg and stochastic matrix P = (pij ), and if 0 k0 < k1 <
< kn are integers, then
T Q k k
P ( nj=0 [Xkj = ij ]) = P ([Xk0 = i0 ]) nj=1 pijj 1 ijj 1 .

Proof: We prove this for n = 1. We …rst observe that

P ([Xk0 = i0 ][Xk1 = i1 ])

is equal to

P k0T 1 k1T 1
P( [Xq = uq ][Xk0 = i0 ] [Xr = vr ][Xk1 = i1 ]),
q=0 r=k0 +1

where the sum is taken over all uq 2 X; 0 q k0 1 and all vr 2 X,


k0 + 1 r k1 1. Applying the multiplication rule, Corollary 7, and the
de…nition of Markov chain, and Notation 8, we obtain that the above sum is
equal to the product of two sums of products, namely,

P kQ
0 1
(u0 )( p um 1 um )puk0 1 i0
,
m=1

5
where the sum is taken over all states u0 ; u1 ; ; uk0 1 in X, and

P kQ
1 2
pi0 vk0 +1 ( pvm vm+1 )pvk1 1 i1
,
m=k0 +1

where the sum is taken over all states vk0 +1 ; ; vk1 1 in X. The …rst sum
is equal to P ([Xk0 = i0 ]), and the second sum is equal to pki01i1 k0 . The proof
for general n involves the product of n + 1 sums of products. Q.E.D.
11. Theorem. If fXn g is a Markov Chain determined by the state
space X, initial distribution f (i) : i 2 Xg and stochastic matrix P = (pij ),
if 0 k0 < k1 < < km < r0 < r1 < < rn , and if
T
P( mj=0 [Xkj = uj ]) > 0,

where fu0 ; ; um g X, and if fv0 ; ; vn g X, then


T T Tn
P ( nj=0 [Xrj = vj ]j mj=0 [Xkj = uj ]) = P ( j=0 [Xrj = vj ]j[Xkm = um ]).

Proof: We …rst write


T T
Tn Tm P ( nj=0 [Xrj = vj ] mj=0 [Xkj = uj ])
P ( j=0 [Xrj = vj ]j j=0 [Xkj = uj ]) = Tm .
P ( j=0 [Xkj = uj ])

Then, applying theorem 10 and doing appropriate cancellation, we obtain


T T Q n rt rt 1
P ( nj=1 [Xrj = vj ]j m r 1 km
j=0 [Xkj = uj ]) = pum v0 t=1 pvt 1 vt .

Note that theorem 10 also yields


T Q k kj
P ( nj=1 [Xkj = ij ]j[Xk0 = i0 ]) = nj=1 pijj 1 ij
1
,

which, when applied to the above, gives us


T T Tn
P ( nj=1 [Xrj = vj ]j mj=0 [Xkj = uj ]) = P ( j=0 [Xrj = vj ]j[Xkm = um ]).

12. Theorem. If fXn g is a Markov Chain determined by initial dis-


n
tribution f (i) : i 2 Xg and stochastic matrix P = (pij ), if S k=0 X for
n 1, and if P ([Xn+1 = i][(X0 ; X1 ; ; Xn ) 2 S]) > 0, then

P ([Xn+2 = j]j[Xn+1 = i][(X0 ; X1 ; ; Xn ) 2 S]) = P ([Xn+2 = j]j[Xn+1 = i])

6
for all j in X.
Proof: The left side of the above may be expressed as
P ([Xn+2 = j][Xn+1 = i][(X0 ; X1 ; ; Xn ) 2 S])
.
P ([Xn+1 = i][(X0 ; X1 ; ; Xn ) 2 S])
The numerator is equal to
P T
n
P ([Xn+2 = j][Xn+1 = i] [Xk = uk ]),
k=0

where the sum is taken over all (n + 1)-tuples (u0 ; ; un ) 2 S. Applying


Corollary 7, this becomes
P Q
n
(u0 ) p uk 1 uk
pun i pij ,
k=1

again where the sum is taken over all (n + 1)-tuples (u0 ; ; un ) 2 S. We


may factor out the term pij . What remains in the numerator by applying
Corollary 7 again is P ([Xn+1 = i][(X0 ; X1 ; ; Xn ) 2 S]), which cancels out
with the denominator, yielding the theorem.

13. Theorem. (Chapman-Kolmogorov equations) If fXn g is a


Markov Chain, then
P
P ([Xm+n = j]j[X0 = i]) = k2X P ([Xm+n = j]j[Xm = k])P ([Xm = k]j[X0 = i]).

Proof: By theorem 11,


P
P ([Xm+n =j][Xm =u][X0 =i])
u2X
P ([Xm+n = j]j[X0 = i]) =
P P ([X0 =i])
u2X
P ([Xm+n =j]j[Xm =u])P ([Xm =u]j[X0 =i])P ([X0 =i])
= P ([X0 =i])
,

which yields the conclusion.

Exercises for Part I

1 Prove: If fXn g is a Markov Chain determined by initial distribution


f (i) : i 2 Xg and stochastic matrix P = (pij ), and if Yn = Xm+n for some
m and n = 0; 1; 2; , then fYn g is a Markov chain.
2. An equivalent de…nition of Markov chain appears to be: "The future
and the past are conditionally independent, given the present." So check a

7
particular case of this. Prove that if fXn g is a Markov Chain determined
by state space X, initial distribution f (i) : i 2 Xg and stochastic matrix
P = (pij ), then
P ([Xn 2 = i][Xn 1 = j][Xn+1 = k][Xn+2 = l]j[Xn = r])
is equal to
P ([Xn 2 = i][Xn 1 = j]j[Xn = r])P ([Xn+1 = k][Xn+2 = l]j[Xn = r])
for n 2 and for arbitrary elements i; j; k; l; r in X:
3. Sometimes that part of the de…nition of Markov chain that states
T
P ([Xn+1 = jn+1 ]j nk=0 [Xk = jk ]) = P ([Xn+1 = jn+1 ]j[Xn = jn ])
T
for all states j0 ; j1 ; ; jn+1 in X which satisfy P (T nk=0 [Xk = jk ]) > 0 is
replaced by the requirement that P ([Xn+1 = jn+1 ]j nk=0 [Xk = jk ]) does not
depend
T on j0 ; j1 ; ; jn 1 for all states j0 ; j1 ; ; jn+1 in X which satisfy
P ( nk=0 [Xk = jk ]) > 0. Prove that these two statements are equivalent.
4. Consider the set of all 2 2 tables of nonnegative integers for which
the numbers in the …rst row adds up to 6, the numbers in the …rst column
adds up to 5, and the sum of all four numbers is 16. An example of one such
table is
2 4
s2 =
3 7
(i) List all six tables that satisfy the requirements.
(ii) These six tables constitute a state space for a Markov chain which
we are about to construct. Let si denote the table with the number i in …rst
row and the …rst column. A table like this one arises from time to time in
statistical practice with a probability measure de…ned over it by
5 11
i 6 i
(si ) = 16 for 0 i 5.
6

The mechanism for generating a Markov chain here is to select a …rst state
at random according to the distribution ( ). Whatever state is obtained,
then the next state is selected according to the following procedure. First,
select one of the following tables,
1 1 1 1
or
1 1 1 1

8
with probability 12 ; 12 . Whichever table is obtained, add it termwise to the
original table for your next state unless adding it termwise puts a negative
number in one of the entries, in which case that next state is the same as the
present state. So display the 6 6 stochastic matrix for this Markov chain:
(For example, ps0 ;s0 = 21 , ps3 s2 = 21 , ps2 s4 = 0, etc.)
(iii) Compute the density of X0 from the formula given above, then com-
pute the density of X1 , and …nally compute the joint density of X0 ; X1 .
(iv) If P denotes the stochastic matrix from part (ii), display these ma-
trices: P 10 , P 100 and P 1000 .

Part II. Aperiodicity, Irreducibility and the First Limit Theo-


rem.

1. De…nition. If a and b are positive integers, then ajb means there


exists a positive integer n such that b = na. We shall say in this case that
"a divides b".
2. Proposition. If a and b are positive integers, and if ajb, then a b.
3. De…nition. If ajb and if ajc, then a is called a common divisor of b
and c.
4. De…nition. If a is a common divisor of b and c, then a is called a
greatest common divisor of b and c if for every common divisor d of b and of
c, then a d. In this case we write a = (b; c): Similarly, a greatest common
divisor of more than two positive integers, a1 ; ; ar , is a number d that
divides each of a1 ; ; ar and is not smaller than any such common divisor;
we denote this greatest common divisor by (a1 ; ; ar ).
5. Remark. If a and b are positive integers, then (a; b) always exists
and is unique. The same holds for a1 ; ; ar . This follows from Proposition
2.
6. Theorem (Division Algorithm). If 0 < a < b are integers, then
there exist unique positive integers c and d such that 0 d < a and b = ca+d:
Proof: Since 0 < a < b, and since fka; k = 1; 2; g is an unbounded set,
we let c denote the largest value of k for which ka b. Then take d = b ca.
7. Euclidean Algorithm. If 0 < a < b are integers, then (a; b) is
obtained in the following way. First use the division algorithm to write
b = q0 a + r0 , where q0 1 and 0 r0 < a. If r0 = 0, then it is clear that
a = (a; b). But if r0 > 0, and since r0 < a, then by the division algorithm
we may write a = q1 r0 + r1 where q1 1 and 0 r1 < r0 . If r1 = 0, then
r0 ja, so then by the previous equation b = q0 a + r0 , it follows that r0 jb. Thus

9
r0 (a; b). But note that by b = q0 a + r0 , it follows that (a; b)jr0 . Therefore
by Proposition 2 above, (a; b) r0 , so the two inequalities imply (a; b) = r0 .
But if r1 > 0, then as before we may write r0 = q1 r1 + r2 where 0 r2 < r1 .
Again, if r2 = 0, then in the very same way, one can show that r1 ja and
r1 jb so that r1 (a; b), while at the same time showing that (a; b)jr1 . Thus
we have, as before, (a; b) = r1 :Proceed until the last nonzero remainder is
obtained.
8. Theorem. If 0 < a < b are integers, then there exist integers u and
v such that (a; b) = ua + vb.
Proof: In the Euclidean algorithm above, let us take the the case where
(a; b) = r1 . Then since r1 = a q1 r0 and since r0 = b q0 a, we have
r1 = a q1 (b q0 a) = (1 + q0 q1 )a q1 b, which proves the theorem in the case
when (a; b) = r1 in the application of the Euclidean algorithm. From here
on, the general proof is just a matter of notation.
9. Theorem. If a and b are positive integers, if dja and if djb, then
dj(a; b).
Proof: This follows directly from Theorem 8.
10. Theorem. If 0 < k1 < k2 < < kr are integers, where r 2, then

(k1 ; k2 ; ; kr ) = ((k1 ; k2 ; ; kr 1 ); kr ),

and there exist integers t1 ; ; tr such that


P
(k1 ; ; kr ) = rj=1 tj kj .

Proof : We shall prove this by induction on n for the integers 0 < b <
a0 < a 1 < < an . We …rst prove it is true for n = 1 by proving (b; a0 ; a1 ) =
((b; a0 ); a1 ) and then by applying theorem. Let d = (b; a0 ; a1 ). Then djb
and dja0 and dja1 , so by theorem 8, dj(b; a0 ); but also dja1 , so again by
theorem 9, dj((b; a0 ); a1 ), from which follows from Proposition 2 above that
(b; a0 ; a1 ) ((b; a0 ); a1 ). On the other hand, if d = ((b; a0 ); a1 ), then dj(b; a0 )

and dja1 , which in turn implies djb and dja0 and dja1 so d (b; a0 ; a1 ) . Thus
the …rst conclusion of the theorem,

(b; a0 ; a1 ) = ((b; a0 ); a1 );

is true. Applying theorem 8 twice we obtain (b; a0 ; a1 ) = k(b; a0 ) + ma1 =


krb + ksa0 + ma1 for some integers k; m; r; s. Now let n be any posi-
tive integer for which the two conclusions of the theorem are true, i.e.,

10
(b; a0 ; a1 ; ; an ) = ((b; a0 ; a1 ; ; an 1 ); an ) and
P
(b; a0 ; a1 ; ; an ) = jb + nk=0 jk ak

for some integers j; j1 ; ; jn . Then by the same proof as in the case of n = 1


we have
(b; a0 ; a1 ; ; an ; an+1 ) = ((b; a0 ; a1 ; ; an ); an+1 ).
Thus, for some integers u; v,
(b; a0 ; a1 ; ; an ; an+1 ) = u(b; a0P
; a1 ; ; an ) + van+1
= ujb + nk=0 ujk ak + van+1
which proves the theorem for n + 1: Q.E.D.
11. Theorem. If a1 ; ; ar are positive integers satisfying (a1 ; ; ar ) =
1, then there exists a positive integer N such that for every positive integer
m, there exist integers d1 ; ; dr such that
P
N + m = rj=1 dj aj .

Proof: By theorem 10 there exist integers c1 ; ; cn such that

1 = c 1 a1 + + c n an .

Without loss of generality we may assume that c1 > 0. Now let us de…ne

N = a1 a2 jc2 j + a1 a3 jc3 j + a1 an jcn j.

Claim: It is su¢ cient to prove the theorem for 0 m a1 1. Indeed,


suppose the theorem is true for all values of m that satisfy 0 m a1 1.
Now let m a1 . Hence by the division algorithm,

N + m = N + ca1 + r,

where c 1 and 0 r a1 1. By the Phypothesis of the claim, there exist


integers d1 ; ; dn such that N + r = ni=1 di ai . Thus
P
N + m = (c + d1 )a1 + ni=2 di ai ,

which proves the claim. So now we have to prove it only for 0 m a1 1.


In this case

N + m = a1 a2 jc2 j + a1 a3 jc3 j + a1 an jcn j + m

11
As noted above, 1 = c1 a1 + + cn an , so
P
N + m = a1 a2 jc2 j + a1 a3 jc3 j + a1 an jcn j + m ni=1 ci ai
= mc1 a1 + (a1 jc2 j + mc2 )a2 + + (a1 jcn j + mcn )an ,
which proves the theorem.
12. De…nition. If fXn g is a Markov Chain determined by state space
X, initial distribution f (i) : i 2 Xg and stochastic matrix P = (pij ), then
state j is said to be accessible from state i if there exists a nonnegative integer
n such that pnij > 0. (Note: for all i 2 X, i is accessible from i since p0ii = 1.)
13. Proposition. If state j is accessible from state i, and if state k is
accessible from state j, then state k is accessible from state i.
Proof: By hypothesis, there exist positive integers m and n such that
pm
ij > 0 and pnjk > 0. Thus by the Chapman-Kolmogorov equations,
P
pm+n
ik = t2X pm n
it ptk
m n
pij pjk > 0,

which proves the proposition.


14. De…nition. If a state j is accessible from state i, and if state i is
accessible from state j, then states i and j are said to communicate, and this
is expressed by: state i communicates with state j..

15. Proposition. If fXn g is a Markov Chain determined by state space


X, initial distribution f (i) : i 2 Xg and stochastic matrix P = (pij ), then
"communicates with" is an equivalence relation and therefore partitions X
in a unique manner into disjoint equivalence classes.
Proof: We must prove that "communicates with" is re‡exive, symmetric
and transitive. Refelxivity follows because p0ii = 1 for all i 2 X. Transitivity
follows from Proposition 13, and symmetry follows from the de…nition of
"communicates with".

16. De…nition. A Markov chain is said to be irreducible if there is only


one equivalence class, i.e., if every two elements in the state space communi-
cate with each other.

17. De…nition. If fXn g is a Markov Chain determined by state space


X, initial distribution f (i) : i 2 Xg and stochastic matrix P = (pij ), then
for each i 2 X, the period of i, denoted by di , is de…ned by di = gcdfn 1 :
pnii > 0g.

12
18. De…nition. If fXn g is a Markov Chain determined by initial distri-
bution f (i) : i 2 Xg and stochastic matrix P = (pij ), then the state i 2 X
is called aperiodic if di = 1.
19. De…nition. If fXn g is a Markov Chain determined by state space
X, initial distribution f (i) : i 2 Xg and stochastic matrix P = (pij ), then it
is said to satisfy the uniform irreducibility condition if there exists a positive
integer, N , such that, for all n N and all i; j in X, pnij > 0.
20. Theorem: If fXn g is a Markov Chain determined by state space
X, initial distribution f (i) : i 2 Xg and stochastic matrix P = (pij ), and if
two distinct states i and j communicate with each other, then the periods of
the two states are equal.
Proof: Let n and m be positive integers that satisfy: pnij > 0 and pm ji > 0.
Let k > 0 satisfy pkii > 0. Then, by the Chapman-Kolmogorov equations, it
follows that P
m+n+k
pjj = u;v2X pm k n
ju puv pvj
pji pii pij = cpkii ,
m k n

m+n
where c = pm n
ji pij > 0 does not depend on k. Since pjj > 0; it follows that
m + n = k1 d(j) for some positive integer k1 and m + n + k = k2 d(j). For our
particular value of k such that pkii > 0, we have

k =m+n+k (m + n) = k2 d(j) k1 d(j) = (k2 k1 )d(j).

Thus d(j) divides all values of k for which pkii > 0, which implies that
d(j)jd(i). By the arbitrariness of i and j, one can also prove that d(i)jd(j).
Thus by Proposition 2, d(i) = d(j).

21. Corollary. In an irreducible Markov chain, if one state is aperiodic,


then all states are aperiodic. (In which case we shall say that the chain is
aperiodic.)
Proof: This follows immediately from Theorem 20 and the de…nition of
irreducibility.

22. Theorem. If fXn g is a Markov Chain determined by initial dis-


tribution f (i) : i 2 Xg and stochastic matrix P = (pij ) and with a …nite
state space X, then it satis…es the uniform irreducibility condition if
and only if it is irreducible and aperiodic.
Proof: Assume that the Markov chain satis…es the uniform irreducibility
condition. Then there exists a positive integer m such that pmij > 0 for all

13
ordered pairs of states i; j. For any state j, we then have, by the Chapman-
Kolmogorov equations, P
pm+1
jj = k2X pm jk pkj .

By the de…nition of Markov chain, there exists a P state k0 , such that pk0 j > 0.
m+1
From this last equation, we observe that pjj = k2X pm jk pkj pm
jk0 pk0 j > 0.
Since (m; m + 1) = 1, it follows that the chain is aperiodic. Since pm ij > 0
for all ordered pairs of states i; j, it easily follows that the Markov chain is
irreducible. Conversely, suppose that the chain is irreducible and is aperiodic.
Let i 2 X, and let Di = fn : pnii > 0g. Since
P
pm+n
ii = k2X pm n
ik pki pm n
ii pii ,

it follows that Di is closed under addition. By the aperiodicity hypothesis,


there exist a …nite number of distinct elements, a1 ; ; ar , in Di such that
(a1 ; ; ar ) = 1. By theorem 11, it follows that there exists a positive integer
Mii such that pnii > 0 for all n > Mii . Also, for any pair of states j 6= i, it
follows by irreducibility that there exists a positive integer k such that pkij > 0.
If we de…ne Mij = Mii + k , it follows that for every n Mij , pnij > 0. Since
by hypothesis X is …nite, we may take N = maxfMij : i 2 X;j 2 Xg < 1,
and from this it follows that for every n N , pnij > 0 for all i 2 X;j 2 X.
Q.E.D.
23. First Big Limit Theorem. If fXn g is a Markov Chain determined
by state space X, initial distribution f (i) : i 2 Xg and stochastic matrix
P = (pij ), and if it satis…es the following three conditions:
(I) it has an aperiodic state,
(ii) it has a …nite state space, and
(iii) it is irreducible,

lim Pn exists and = ,


n!1

where all rows of are the same.


Proof: By theorem 22, P satis…es the uniform irreducibility condition
which is: there exists a positive integer m such that pnij > 0 for all n m
and for all i; j in X. We …rst prove the theorem in Case 1: m = 1. Thus
_ Since
pij > 0 for all pairs of states i; j in X. Let = minfpij : all i; j in Xg.
P has a …nite number of rows and columns, it follows that 0 < pij for all
i; j in X. Next denote

mj (n) = minfpnij : i 2 Xg

14
and
Mj (n) = maxfpnij : i 2 Xg
for all j 2 X and for all n 1. Applying the Chapman-Kolmogorov equa-
tions, we obtain, for n 2,
P P
pnij = k2X pik pnkj 1 k2X pik mj (n 1) = mj (n 1)
and P P
n
pnij = k2X pik pkj
1
k2X pik Mj (n 1) = Mj (n 1)
for all i; j 2 X and for all n 2. Taking the mini2X of both sides of the
…rst inequality, we get
mj (n) mj (n 1),
and, taking the maxi2X of the second inequality, we obtain
Mj (n) Mj (n 1)
for all n 2 and all j 2 X. Thus for each j 2 X, mj and Mj satisfy
mj = lim mj (n) lim Mj (n) = Mj .
n!1 n!1

Note that for every j and for all ordered pairs of states i; j 2 X,
P P
1 = j2X pnij j2X mj (n 1)
and P P
1= j2X pnij j2X Mj (n 1).
Hence,
P if we can prove that mj = Mj =(some) j for all j 2 X, then
j2X j = 1, and all of the rows are the same. So it is su¢ cient to prove
mj = Mj for all j 2 X. Let i0 2 X be such that
mj (n) = min pnij = pni0 j ,
i2X

and let i1 2 X be such that


Mj (n 1) = max pnij 1
= pin1 j 1 .
i2X

Then
P n 1
mj (n) = pni0 j = k2X pi0 k pkj
P n 1
= k6=i1 pi0 k pkj + pi0 i1 pin1 j 1
P
= pni1 j 1 + (pi0 i1 )pni1 j 1 + k6=i1 pi0 k pkj
n 1
P
Mj (n 1) + fpi0 i1 + k6=i1 pi0 k gmj (n 1)

15
or
mj (n) Mj (n 1) + (1 )mj (n 1).
Taking the limit of both sides as n ! 1, we obtain mj Mj + (1 )mj or
mj Mj. Since we established abovePthat mj Mj , we may conclude that
mj = Mj = j for all j 2 X. Hence j2X j = 1, and all rows of the limit
matrix are identical. This proves the theorem for m = 1. We now prove the
theorem for m > 1. Since Pm is a stochastic matrix with all positive entries,
we may replace P in the proof of the case above where m = 1 by Pm and
thereby conclude that

Pmn ! (some) as n ! 1,

where is a stochastic matrix of which all rows are identical. Then for
1 k m 1,
lim Pmn+k = Pk .
n!1

Since all rows of are the same , and since Pk is a stochastic matrix, then
Pk = . So Pr ! as r ! 1.

24. Corollary. If fXn g is a Markov Chain determined by state space


X, initial distribution f (i) : i 2 Xg and stochastic matrix P = (pij ), and if
it satis…es the following three conditions:
(I) it has an aperiodic state,
(ii) it has a …nite state space, and
(iii) it is irreducible,
if is as in theorem 23 above, then

P=P = .

Proof: First observe that Pn+1 = Pn P ! P as n ! 1 , but also


n+1
P ! , so P = . Since P is a stochastic matrix, P = , which
proves the corollary.

25. Corollary. Under the conditions of the theorem and corollary above,
if the initial distribution of X0 is the same as that of the common row in ,
then the random variables fXn g are identically distributed, and the sequence
is stationary.
Proof: This follows from the conclusion in Corollary 24 that states P =
.

16
Part III. Recurrence

1. Notation. If fXn g is a Markov chain with state space X, then for


i 2 X, we shall denote Ti (0) = 0, and for k 1, we shall denote
Ti (k) = minfm : m > Ti (k 1); Xm = ig,
provided that this set, fm : m > Ti (k 1); Xm = ig, is not empty; otherwise,
Ti (k) = 1.

2. Proposition. If fXn g is a Markov chain, then state j is accessible


from state i if and only if
P ([Tj (1) < 1]j[X0 = i]) > 0.
Proof: We …rst note that
S Tn 1
[Tj (1) < 1] = n 1 [Xn = j] r=1 [Xr 6= j] .
First suppose that state j is accessible from state i. Then by the de…nition
of accessible, there exists a positive integer n such that pnij > 0. Let n0 =
minfn : pnij > 0g. We now prove the
T 0 1
Claim: P ([Xn0 = j] nr=1 [Xr 6= j]j[X0 = i]) = pnij0 . Indeed, we know
that X T 0
pnij0 = P ([Xn0 = j] nr=1 [Xr = kr ]j[X0 = i])
where the sum is taken over all kr 2 X, 1 r n0 1. However, if, in any
of the summands, there is a t 2 f1; ; n0 1g such that kt = j, then, by
our de…nition of n0 , that term becomes
T 0
0 P ([Xn0 = j] nr=1;r6 =t [Xr = kr ][Xt = j]j[X0 = i])
P ([Xt = j]j[X0 = i]) = ptij = 0;
which proves the claim. Thus by the identity that we …rst noted above,
state j is accessible from state i if and only if P ([Tj (1) < 1]j[X0 = i]) > 0.
3. De…nition. If fXn g is a Markov chain, state i is said to be recurrent
if
P ([Ti (1) < 1]j[X0 = i]) = 1.

4. Proposition. If fXn g is a Markov chain, then state i is recurrent


if and only if S
P( 1 n=1 [Xn = i]j[X0 = i]) = 1.

17
Proof: In the proof of Proposition 2, we noted that
S T
[Ti (1) < 1] = n 1 [Xn = i] nr=11 [Xr 6= i] .

But it is also easily shown that


S1 S Tn 1
n=1 [Xn = i] = n 1 [Xn = i] r=1 [Xr 6= i] .

and hence S1
[Ti (1) < 1] = n=1 [Xn = i],
from which the proposition follows.
5. Notation. For every pair of elements i; j in the state space X of a
Markov chain fXn g, we shall denote
Tn 1
fijn = P r=1 [Xr 6= j] \ [Xn = j]j[X0 = i]

and P1 S
fij = n=1 fijn = P ( 1n=1 [Xn = j]j[X0 = i]) ,

with fii0 = 0 for all i 2 X. Here fij0 = 0 for all pairs i; j of states where i 6= j;
also, p0ii = 1 for all i 2 X, and p0ij = 0 for i 6= j.
6. Remark. If fXn g is a Markov chain, then

fijn = P ([Tj (1) = n]j[X0 = i]).

7. De…nition of Generating Functions. In the notation given above,


we de…ne the generating functions
P
Fij (s) = 1 n n
n=0 fij s

and P1
Pij (s) = n=0 pnij sn ,
both of which converge for jsj < 1.

8. Proposition. For every state i 2 X, and for every n 1,


P
pnii = nk= 0 fiik piin k ,

and
1
Pii (s) = for all s 2 (0; 1).
1 Fii (s)

18
Proof: We …rst observe that
S
[Xn = i] = nk=1 [Ti (1) = k][Xn = i],

so that
S
pnii = P ([Xn = i]j[X0 = i]) = P ( nk=1 [Ti (1) = k][Xn = i]j[X0 = i]) ,

and since the union of intersections on the right side is a disjoint union, then,
letting
Tk 1
W = P [Xk = i] j=0 [Xj 6= i]j[X0 = i]
the above is
P Tk 1
= nk=1 P j=0 [Xj 6= i][Xk = i][Xn = i]j[X0 = i]
Pn T
= k=1 P [Xn = i]j[Xk = i] kj=01 [Xj 6= i][X0 = i] W
P Tk 1
= nk=1 P ([Xn = i]j[Xk = i]) P [Xk = i] j=0 [Xj 6= i]j[X0 = i]
Pn k n k
= k=1 fii pii .

Now, since we agreed that p0ii = 1, we have


P P1 n n
Pii (s) = 1 n n
n=0 pii s = 1 + n=1 pii s ,

and this, plus the fact that fii0 = 0, implies


P Pn
Pii (s) 1 = P1 k n k
n=1 P k=0 fii pii sn
= P1 n k n k n
n=0 Pk=0 fii pii s
= P1 n=0
n
s )(piin k sn k )
k k
k=0 (fii
P
1
= k k
n=0 fii s ( 1 n n
n=0 pii s )
= Fii (s)Pii (s)

which gives the conclusion.

9. Proposition. If i and j are distinct states, then


P
pnij = nk=0 fijk pjj
n j

and
Pij (s) = Fij (s)Pjj (s)

19
for all s 2 ( 1; 1).
Proof: This proof is the same as the proof of theorem 8, except in this
case, one must use the agreement that fij0 = p0ij = 0.

10. Proposition. The following three statements for a Markov chain


are equivalent: for every state i 2 X,
(i) state i is recurrent,
ii = 1, and
(ii) fP
(iii) 1 n
n=0 pii = 1.
Proof: Since
S T
[Ti (1) < 1] = n 1 [Xn = i] nr=11 [Xr 6= i] ,

it follows that
S Tn 1
P ([Ti (1) < 1]j[X0 = i]) = P ( n 1 [Xn = i] r=1 [Xr 6= i] j[X0 = i]) = fii ;

which proves that (i) holds if and only if (ii) is true. We now prove that (ii)
is equivalent to (iii). We …rst prove
P P1 n P1 n n P1 n
Claim: lim 1 n n
n=1 fii s = n=1 fii and lim n=1 pii s = n=1 pii :
s"1 s"1

Let 0 < sk " 1 as k ! 1. If one applies the Lebesgue monotone convergence


theorem to measures ffiin , n = 1; 2; g and fpnii , n = 1; 2; g over the space
f1; 2; g with respect to the functions fk (n) = snk ; k = 1; 2; P, the claim
is established. So, by the above claim and proposition 8, fii = 1 n
n=1 fii = 1
if and only if
1 P
lim Pii (s) = lim = 1 n
n=1 pii = 1,
s"1 s"1 1 Fii (s)

which concludes the proof.

11.Corollary. If fXn g is an irreducible Markov Chain determined by


state space X, initial distribution f (i) : i 2 Xg and stochastic matrix
P = (pij ), and if any one state is recurrent, then all states are recurrent. (In
which case we may refer to the chain as a recurrent chain.)
Proof: Let i and j be any two states in X, and suppose that state i is
recurrent. Since the chain is irreducible, then these two states communicate

20
with each other, so there exist positive integers m and n such that pm
ij > 0
n m+k+n m k n
and pji > 0. Since P = P P P , it follows that
P
pm+k+n
jj = u2X;v2X pnju pkuv pm
vj
pnji pkii pm
ij
= cpkii ,
n m
where
P1 ck = pji pij > 0. Since state i is recurrent, then by theorem 10,
k=1 pii = 1, which combined with the above display implies
P1 m+k+n
k=1 pjj = 1.

Applying theorem 10 to this last equation implies that state j is recur-


rent.

12. Proposition. If fXn g is a recurrent Markov chain, then the random


variables
fTi (k) Ti (k 1); k 1g
are conditionally independent and identically distributed with respect to the
probability measure P ( j[X0 = i].
Proof: For ease of notation, we shall do the proof only for Ti (1) Ti (0) =
Ti (1) and Ti (2) Ti (1); the proof given suggests the proof in the general case.
Letting u; v be any positive integers, and applying theorem 7 in Part I, we
have
P ([Ti (1) = u][Ti (2) Ti (1) = v]j[X0 = i])
Tu 1 T
=P [X 6= i][Xu = i] u+v 1
[X 6= i][Xu+v = i]j[X0 = i]
P k=1 Tu 1k Tu+v 1m
m=u+1
= PP k=1 [Xk = jk ][Xu = i] m=u+1 [Xm = rm ][Xu+v = i]j[X0 = i]
= pij1 pj1 j2 pju 1 i piru+1 pru+1 ru+2 pru+v 1 i ,
where the above sums are taken over all elements ji ; ; ju 1 ; ru+1 ; ; ru+v 1
in X that are not equal to i. It is easily recognized that this last sum is the
product of two sums, namely,
P P
= pij1 pj1 j2 pju 1 i piru+1 pru+1 ru+2 pru+v 1 i
= P ([Ti (1) = u]j[X0 = i])P ([Ti (1) = v]j[X0 = i]).

Now sum both sides of this equation with respect to u, and we obtain

P ([Ti (2) Ti (1) = v]j[X0 = i]) = P ([Ti (1) = v]j[X0 = i]),

21
thus proving the two random variables to be conditionally identically distrib-
uted. Substituting this last quantity into the earlier one, we get

P ([Ti (1) = u][Ti (2) Ti (1) = v]j[X0 = i])


= P ([Ti (1) = u]j[X0 = i])P ([Ti (2) Ti (1) = v]j[X0 = i]),

which establishes independence. The equation that established conditional


identically distributed also implies P ([Ti (2) Ti (1) < 1]j[X0 = i]) = 1 and
therefore P ([Ti (2) < 1]j[X0 = i]) = 1. Q.E.D.

13. Theorem. If fXn ; n 0g is an irreducible, recurrent Markov chain


determined by ( ; X; P), if f : X ! R1 is any function, and if, for i 2 X,
random variables fZn ; n 1g are de…ned by
PTi (n)
Zn = k=Ti (n 1)+1 f (Xk ),

then fZn g are independent and identically distributed with respect to P ( j[X0 =
i]).
Proof: This is proved in much the same way as Theorem 12. Again we
prove it for Z1 and Z2 ; the general proof has only more notation. Let a and
b be any real numbers. Then

P ([Z1 a][Z2 b]j[X0 = i])


P P T1
r+s rT1
= P ([Xr+s = i] [Xq = iq ][Xr = i] [Xt = it ]j[X0 = i]),
r 1;s 1 q=r+1 t=1

where the inner sum is taken over all (r + s 2)-tuplets,

(i1 ; ; ir 1 ; ir+1 ; ; ir+s 1 );

of elements in Xnfig that satisfy

f (i1 ) + + f (ir 1 ) + f (i) a and


f (ir+1 ) + + f (ir+s 1 ) + f (i) b.

As in the proof of theorem 12, the above double sum becomes


P P
= pii1 pi1 i2 pir 1 i piir+1 pir+s 1 i
rP1;s 1 P
= pii1 pir 1 i piir+1 pir+s 1 i
= P ([Z1 a]j[X0 = i])P ([Z1 b]j[X0 = i]).

22
Now let a " 1 on both sides, from which we get

P ([Z2 b]j[X0 = i]) = P ([Z1 b]j[X0 = i]);

which establishes independence and identical distributions with respect to


P ( j[X0 = i]).

14. Corollary. If fXn g is an irreducible, recurrent Markov chain, then


for every i 2 X,

P ([Xn = i in…nitely often]j[X0 = i]) = 1.

Proof: Denote A = [Xn = i in…nitely often]; then clearly


T T1
A= 1 k=1 [Ti (k) < 1] = k=1 [Ti (k) Ti (k 1) < 1].

By theorem 12, fTi (k) Ti (k 1); k 1g are independent and identically


distributed with respect to P ( j[X0 = i). By the hypothesis of recurrence,
P ([Ti (k) Ti (k 1) < 1]j[X0 = i]) = 1 for k = 1 and hence for all k 1,
which implies P (Aj[X0 = i]) = 1. Q.E.D.

Exercises for Part III.

1. Refer to problem 2 in the exercises for Part II. Suppose that the
outcome of 16 plays of the game you observed this table:

B Bc
A 4 2
Ac 1 9

In much more complicated cases than this, we feel that in as many as 4 trials
out of 16 of having both A and B occur is too large relative to the other cell
counts for the hypothesis of independence. So what we might wish to do is
compute the conditional probability P ([X11 4]j[X1 = 6][X 1 = 5]) to see if
this "is possible". From your answer last time, this conditional probability
is 0:0357. You might simulate this game until you obtain 10,000 games in
which the …rst row sums are equal to 6 and the …rst column sums are equal
to 4. As these occur in your program, …nd the relative frequency in which
6 5
the event [X11 4] occurs. To do so, you might …rst take = 16 and = 16 .
Then you might obtain empirical evidence that this conditional probability
does not depend on and by taking = :5 and = :5.

23
2. One can also create a Markov chain that we shall use later on. The
state space is
X = fs(i) : 0 i 5g ,
where 0 1
B Bc
s(i) = @ A i 6 i A.
Ac 5 i 5+i
The mechanism for this Markov chain is as follows. Select the state s(i) with
probability

(s(i)) = P ([X11 = i]j[X1 = 6][X 1 = 5]); 0 i 5:

If 1 i 4, then take ps(i)s(i+1) = 12 , ps(i)s(i 1) = 12 , but take ps(0)s(0) = 21 ,


ps(0)s(1) = 12 , ps(5)s(4) = 12 and ps(5)s(5) = 12 , with all the remaining entries in
the stochastic matrix taken as zeros.
(i) Compute (s(i)) for all six states in the state space.
(ii) Display the stochastic matrix associated with this Markov chain.
(iii) Find P ([Xk = s(i)]) for 0 i 5 and for k = 1 and 2.

Part IV Invariant Measures

1. De…nition. An in…nite sequence fXn : n 0g of random variables


is said to be (strictly) stationary if for all positive integers n and k, the joint
distribution of fX0 ; X1 ; ; Xn g is the same as that of fXk ; X1 ; ; Xk+n g.

2. De…nition. If fXn g is a Markov chain with stochastic matrix


P = (pij ) and state spce X, and if

= ( 1; 2; )t

is a probability distribution over X, then is called a stationary


P distribution
for fXn g if i > 0 for all i 2 X and = P, i.e., k = i2X i pik for
all k 2 X. ( as well as other measures over X is a vertical vector whose
coordinates are ordered in the same manner as the rows and columns of X.)

3. Proposition. If fXn g is a Markov chain with stochastic matrix


P = (pij ) and state space X, and if the initial distribution = ( 1 ; 2 ; )

24
is a stationary distribution for fXn g, then the Markov chain is strictly sta-
tionary.
Proof: We observe that
P
P ([X1 = i]) = j2X P ([X1 = i]j[X0 = j])P ([X0 = j])
P
= j2X j pji = i
= P ([X0 = i]).

Continuing inductively, one can prove P ([Xn = i]) = i for all n 1 and all
i 2 X. From this we obtain for all positive integers m and n,
T
m+n Q Tt 1
P [Xk = ik ] = P ([Xm = im ]) m+n
t=m+1 P ([Xt = it ]j q=m [Xq = iq ])
k=m Q
= im nt=m+1 pit 1 it
Q Tt 1
= P ([X0 = im ]) m+n P ([X t m = i t ]j q=m [Xq m = iq ])
Tm+n t=m+1
=P k=m [Xk m = ik ] . Q.E.D.

4. De…nition. Let P = (pij ) be a stochastic matrix with state space X,


and let
t
= ( 1; 2; )
be a measure de…ned over the state space with 0 i 1 for all i 2 X.
(In other words, is a nonnegative function that assigns to each i 2 X a
t
number P i .) Then is called an invariant measure over X if = t P, i.e.,
k = i2X i pik for all k 2 X. (Like , the measure is regarded as a
vertical vector indexed by the rows of X.)

5. Proposition. If is an invariant measure for an irreducible, recurrent


Markov chain, and if i < 1 for at least one i 2 X, then j < 1 for all
j 2 X.
Proof: Let j 2 X; j 6= i. By hypothesis, the chain is irreducible, so there
exists a positive integer m such that pm ji > 0. Since by hypothesis is an
invariant measure for the chain, it follows that
t t t
= P= = Pm ,

from which there follows


P m m
i = k2X k pki j pji .

25
From the …niteness of i and from pm
ji > 0, it follows that j < 1.

6. Theorem. If fXn ; n 0g is a recurrent and irreducible Markov


chain, if i 2 X, and if is a measure over X de…ned for every j 2 X by
PTi (1) 1
j =E n=0 I[Xn =j] j[X0 = i]) ,

then
(1) i = 1
(2) is aP …nite invariant measure, and
(3) j = 1 n=0 P ([Xn = j][Ti (1) > n]j[X0 = i]) < 1 for all j 2 X..
Proof: Note that in this theorem and its proof, the measure is a function
of i. We …rst prove that (3) is true. By the monotone convergence theorem,
!
P1
Ti (1) P
1 PTi (1) 1
E I[Xn =j] j[X0 = i]) = E n=0 I[Xn =j][Ti (1)=k] j[X0 = i])
n=0 k=1
P1 P
k 1
= n=0 P ([Xn = j][Ti (1) = k]j[X0 = i]))
P1 P1
k=1
= P1n=0 S1 P ([Xn = j][Ti (1) = k]j[X0 = i])
k=n+1
= Pn=0 P ( k=n+1 [Xn = j][Ti (1) = k]j[X0 = i])
1
= n=0 P ([Xn = j][Ti (1) > n]j[X0 = i]),

which proves (iii). We now wish to verify (1): i = 1. By our de…nition of


j given above,

PTi (1) 1
i =E n=0 I[Xn =i] j[X0 = i])
P Ti (1) 1
= P ([X10 =i]) E n=0 I[Xn =i][X0 =i]
P1 Pm
= P ([X10 =i]) E T 1
m=1 I m 1 [Xt 6=i][Xm =i] n=0 I[Xn =i][X0 =i]
t=1
1
= P ([X0 =i])
E(I[X0 =i] ) = 1,

(note: this follows because each summand after m = 1 is the indicator


P of the
empty set) which proves (1). We now wish to prove (2): j = k2X k pkj
for all j 2 X. We …rst prove this to be true for j 6= i. By the de…nition of
j , and since j 6= i, we have

PTi (1) 1 PTi (1)


j =E n=0 I[Xn =j] j[X0 = i]) = E n=0 I[Xn =j] j[X0 = i] .

26
Thus, for j 6= i, we may write
P1 P i (1)
j = E( I[Ti (1)=k] Tn=0 I[Xn =j] j[X0 = i])
P1 Pk
k=1
= E(Pk=1 Pn=0 I[Ti (1)=k][Xn =j] j[X0 = i])
= E( 1 n=1
1
k=n I[Ti (1)=k][Xn =j] j[X0 =Pi])
= P ([X1 = j][Ti (1) 1]j[X0 = i]) + 1 n=2 P ([Xn = j][Ti (1) n]j[X0 = i]).

By hypothesis, this is a recurrent chain, so P ([Ti (1) 1]j[X0 = i]) = 1, and


thus the …rst term on the bottom line above becomes

P ([X1 = j][Ti (1) 1]j[X0 = i]) = P ([X1 = j]j[X0 = i]) = pij .

So, since j 6= i, then [Xn = j][Ti (1) n] = [Xn = j][Ti (1) n + 1], and
P1
j = pij + Pn=2 P ([X Pn = j][Ti (1) n + 1]j[X0 = i])
= pij + k2X;k6=i 1 n=2 P ([Xn = j][Ti (1) n + 1][Xn 1 = k]j[X0 = i])
P P1
= pij + k2X;k6=i n=2 P ([Xn = j]j[Ti (1) n + 1][Xn 1 = k][X0 = i])Q(k; n)

where Q(k; n) = P ([Ti (1) n + 1][Xn 1 = k]j[X0 = i]). But for n 2 and
k 6= i,

[Ti (1) n+1][Xn 1 = k][X0 = i] = [X1 6= i] [Xn 2 6= i][Xn 1 = k][X0 = i].

Since i = 1, we have from the above that


P P1 nT2
j = i pij + k2X;k6=i n=2 P ([Xn = j]j [Xq 6= i][Xn 1 = k][X0 = i])Q(k; n)
P P q=1
= i pij + k2X;k6=i 1 n=1 p kj P ([Ti (1) n + 1][Xn = k]j[X0 = i])
P P 1
= i pij + k2X;k6=i pkj E m=1 I[Ti (1)>m][Xm =k] j[X0 = i] .

The inner sum vanishes for m Ti (1); also since k 6= i, we may start the
inner sum at m = 0. Thus
P PTi (1) 1
j = i pij + k2X;k6=i pkj E m=0 I[Xm =k] j[X0 = i]
P P
= i pij + k2X;k6=i pkj k = k2X k pkj ,

which establishes (2) for j 6= i. Last we wish to prove (2) for j = i. But …rst
we establish a claim.
Claim. If j 6= i, if n 1 and if P ([Xn = j][Ti (1) > n][X0 = i]) > 0, then

pji = P ([Xn+1 = i]j[Xn = j][Ti (1) > n][X0 = i]).

27
Proof: By the technical theorems we obtained in Part I, and since j 6= i,

P ([Xn+1 = i]j[Xn = j][Ti (1)


Tn> n][X0 = i])
= P ([Xn+1 = i]j[Xn = j]( q=1 [Xq 6= i])[X0 = i])
Tn 1
= P ([Xn+1 = i]j[Xn = j]( q=1 [Xq 6= i])[X0 = i])
= P ([Xn+1 = i]j[Xn = j]) = pji ,

which proves the claim. Continuing our proof of (2) for j = Pi, since in (1)
we established that i = 1, it is therefore su¢ cient to prove k2X k pki = 1.
Applying the above claim, the hypothesis of recurrence and (iii), we have,
letting W = P ([Xn+1 = i]j[Xn = k][Ti (1) > n][X0 = i]),
P
k2X k p
P P ki
= Pk2X 1 P1
n=0 P ([Xn = k][Ti (1) > n]j[X0 = i])pki
= k2X;k6=i n=0 P ([Xn = k][Ti (1) > n]j[X0 = i])pki
P P
= k2X;k6=i 1 P ([Xn = k][Ti (1) > n]j[X0 = i])P ([Xn+1 = i]j[Xn = k])
P Pn=0
= k2X;k6=i 1 n=0 ([Xn = k][Ti (1) > n]j[X0 = i])W
P
P1 P
= n=0 k2X;k6=i P ([Xn = k][Ti (1) > n][Xn+1 = i]j[X0 = i])
P
= 1 n=0 P ([Xn+1 = i][Ti (1) = n + 1]j[X0 = i]) = 1.

Finally, the measure is …nite by (1) and Proposition 5. Q.E.D.

7. De…nition. A state i of a Markov chain fXn : n 0g is said to be


positive recurrent if E(Ti (1)j[X0 = i]) < 1.

8. Theorem. If fXn g is a recurrent, irreducible Markov chain, if state


i 2 X is a positive recurrent state, then the measure f j g, de…ned by
PTi (1) 1
E n=0 I[Xn =j] j[X0 = i])
j
j = =
E(Ti (1)j[X0 = i]) E(Ti (1)j[X0 = i])

is a stationary distribution. P
Proof: By theorem 6, we need only prove that j2X j = 1. We shall
make use of this well-known fact: if U is a nonnegative integer valued random
variable with …nite expectation, then
P
E(U ) = u 0 P ([U u]).

28
Making use of the Lebesgue monotone convergence theorem, we obtain
P P PTi (1) 1
j2X j = j2X E I[Xn =j] j[X0 = i]
n=0
P P1 Pk 1
=E j2X k=1 n=0 I[Xn =j][Ti (1)=k] j[X0 = i])
P1 P1 P
=E n=0 k=n j2X I[Xn =j][Ti (1)=k] j[X0 = i])
P1 P1
P1 n=0 k=n I[Ti (1)=k] j[X0 = i])
=E
= n=0 P ([Ti (1) n]j[X0 = i]) = E(Ti (1)j[X0 = i]),
from which the conclusion follows.

9. Lemma. Let P be a stochastic matrix, and let r P denote P with the


column for state r 2 X replaced by zeros. Let us denote (r P)n = (r qij ). If
j 6= r, then
r
qij = P ([Xn = j][Tr (1) > n]j[X0 = i]).
Proof: First note that Pn = (pnij ), where
P P
pnij = k1 2X kn 1 2X pik1 pk1 k2 p kn 1j

Since the rth column of r P contains all zeros, then


r
P P
qij = k1 6=r kn 1 6=r pik1 pk1 k2 p kn 1 j
= P ([X1 6= r] [Xn 1 6= r][Xn = j]j[X0 = i])
= P ([Xn = j][Tr (1) > n]j[X0 = i]). Q.E.D.

10. Theorem. Let fXn g be a recurrent and irreducible Markov chain


with stochastic matrix P. If and are invariant measures over X, and if
each is neither zero nor in…nite at some state, then there exists c > 0 such
that 0 < i = c i < 1 for all i 2 X.
Proof: Let i 2 X be an arbitrary …xed state. By theorem 6, if is de…ned
by
PTi (1) 1
j = E n=0 I[Xn =j] j[X0 = i]
for all j 2 X , then is an invariant measure and i = 1. Now let be a
measure that satis…es the hypothesis.
Claim 1: i > 0 for all i 2 X. Indeed, by hypothesis there exists a state
j 2 X such that j > 0. Then
t t
= P= = t Pm

29
for all positive integers m. By the hypothesis of irreducibility, there exists a
positive integer m such that pm ji > 0. Thus
P m m
i = k2X k pki j pji > 0,

which proves the claim. This claim and proposition 5 imply 0 < k < 1 for
all k 2 X. Now …x the state i that de…nes the measure . By theorem 6,
i = 1. Since the product of any positive constant with is still an invariant
measure, we may assume without loss of generality that i = 1.
i i
Claim 2. j j for all j 2 X. Indeed, let P = ( prs ) denote P with the
column determmined by state i replaced by zeros. We …rst verify that
P i
j = ij + k2X k pkj ;

indeed, if j = i, then ii = 1 and i pki = 0, (so in this case i = 1),and if


= 0 and i pkj = pkj (and in this case, since is an invariant
j 6= i, then ij P
measure, j = k2X k pkj ). Next recall by theorem 6 that
P1
j = n=0 P ([Xn = j][Ti (1) > n]j[X0 = i]).

Then by lemmaP 9, i pnij = P ([Xn = j][Ti (1) > n]j[X0 = i]). So i = 1, and,
for j 6= i, j = 1 i n
n=0 pij . So denote i as the vertical vector with all zeros
except for a 1 in the ith place. Then we may write
t i
= ( 1 ; 2P
; ) P
i i
= ( i1 + k2X k pk1 ; i2 + k2X k pk2 ; )
= ti + t i P.

From this last identity and the invariant measure hypothesis,


t t
= i + t iP
t
= i + ( ti + t i P) i P
= t
i + ti i P + t ( i P)2 ,
t t t i
and continuing this substitution of = i + P , we get at stage N ,
P PN
t
= N t i n
n=0 ( i )( P) +
t
( i P)n+1 t i n
n=0 ( i )( P) ;

where this last inequality means in a coordinate wise sense. So let N ! 1,


and we get P1
t t i n
n=0 ( i )( P) coordinatewise.

30
P1 n
Since we established above that i = 1 and j = n=0 pij for j 6= i, we
obtain from what has been established up to this point that j j for
all j 6= i; and in addition i = i = 1. This veri…es claim (2). But from
claim (2), is nonnegative and is also an invariant measure. If, for any
j 2 X, j j > 0, then by claim (i); k k > 0 for all states k. But this
contradicts the fact that i i = 0. Hence the two invariant measures are
identical. Q.E.D.

11. Theorem. If fXn g is a Markov chain with state space X, sto-


chastic matrix P = (pij ) and initial distribution and if the following three
conditions are satis…ed,
(i) the Markov chain is irreducible,
(ii) the Markov chain is aperiodic, and
(iii) X is …nite,
the Markov chain is positive recurrent.
Proof: The …rst big limit theorem in Part II and its …rst corollary state
that the hypotheses here are su¢ cient for there to exist an invariant proba-
bility measure, , for the chain such that Pn ! as n ! 1, where each
row of is t . We …rst claim that the chain is recurrent. Indeed, by Propo-
sition
P110 in Part III, state i 2 X is recurrent if and only if fii = 1 if and only
n
if n=1 pii = 1. Suppose to P the contrary that there exists a state i 2 X
which is not recurrent. Then 1 n n
n=1 pii < 1. This implies that pii ! 0 as
n ! 1. But since Pn ! as n ! 1 as established above, this means
that pnii ! i . But since is a probability measure, it follows by theorem 10
above that k > 0 for all k 2 X, supplying us with a contradiction. Thus
every state in the chain is recurrent. We now wish to prove that the chain is
positive recurrent, in other words, to prove that if fXn : n 0g is a Markov
chain with initial distribution and with transition matrix P = (pij ), then
E(Ti (1)j[X0 = i]) < 1 for any state i. So take any state i. Since it is
recurrent and if the measure is de…ned by
PTi (1) 1
j =E n=0 I[Xn =j] j[X0 = i])

for every j 2 X, then is a …nite invariant measure, i = 1, so that j is


…nite and positive for all states j. Further, by the uniqueness theorem above,
there exists a …nite c > 0 such that j = c j for all j 2 X. By hypothesis,

31
P
the state space is …nite, so j2X j < 1. Thus
P PTi (1) 1
1 > j2X E n=0 I[Xn =j] j[X0 = i])
PTi (1) 1P
=E n=0 j2X I[Xn =j] j[X0 = i])
PTi (1) 1
=E n=0 1j[X0 = i]) = E(Ti (1)j[X0 = i]),

which proves the theorem.

12. Theorem. Suppose a Markov chain fXn g is irreducible and positive


recurrent. For all j 2 X de…ne
PTi (1) 1
j =E n=0 I[Xn =j] j[X0 = i]) .

Then j de…ned by
PTi (1) 1
E n=0 I[Xn =j] j[X0 = i])
j
j = =
E(Ti (1)j[X0 = i]) E(Ti (1)j[X0 = i])

is the unique stationary distribution for the chain.


Proof: Theorems 8 and 10.

Exercises for Part IV

1. Refer to problem 2 in exercises for Part III. There we discovered that


the distribution of X0 was not the same as the distribution for X1 , and so
the Markov chain was not stationary. In certain applications to be met later
we shall need to change the stochastic matrix so that the original ( ) that
we took is an invariant measure. Here is the description of one such revision.
Suppose at time 0 the event [X0 = s(i)] occurs. So sample from the six states
according to the distribution ps(i)s(0) ; ps(i)s(1) ; ; ps(i)s(5) . If the outcome is
s(i), then decide that [X1 = s(i)] has occurred. But if the outcome is s(j) for
some j 6= i, then a decision must be made whether to decide that [X1 = s(j)]
has occurred or [X1 = s(i)] has occurred, and it is made as follows. Take a
random number in [0; 1), call it U , and de…ne

(sj )ps(j)s(i)
(s(i); s(j)) = min 1; .
(si )ps(i)s(j)

32
If [U < a(s(i); s(j))] occurs, then decide that [X1 = s(j)] has occurred;
otherwise, decide that [X0 = s(i)] has occurred. Repeat all of the above for
X2 ; X3 ; .
(i) Display the stochastic matrix for this altered Markov chain.
(ii) Compute P ([X0 = s(i)]) for 0 i 5, and determine whether
P ([X0 = s(i)]) = P ([X1 = s(i)]) for 0 i 5.
(iii) Verify that this Markov chain is irreducible, is aperiodic, has a …nite
state space and is positive recurrent.

Part V. The Ergodic Theorem.

P1.1 Lemma. If fXn g is a sequence of nonnegative random variables, and


if n=0 Xn converges almost surely, then for every > 0,

P ([Xn in…nitely often]) = 0.

Proof: Suppose the conclusion is not true. Then there exists an > 0
such that
P ([Xn in…nitely often]) > 0.
P
Let ! 2 [Xn in…nitely often]; then 1 n=0 Xn (!) = 1 over this set of
positive probability, which contradicts the hypothesis. Q.E.D.

2. Lemma. If fxn g is a sequence of real numbers, if xn ! 1 as n ! 1,


and if
rn = maxfk : xk ng,
then rn ! 1 as n ! 1.
Proof: First note that frn g is nondecreasing since

fk : xk ng fk : xk n + 1g.

Thus it is su¢ cient to prove that frn g is unbounded. If, to the contrary, frn g
is bounded, then there is a minimum value of n, call it n0 , and a maximum
value of k, call it k0 , such that rn = rn0 for all n > n0 and such that xk xn0
for all k > k0 . This contradicts the hypothesis that xn ! 1 as n ! 1.
3. Theorem (Strong Law of Large Numbers). If fXn g is a sequence
of independent, identically distributed random variables, and if X n = (X1 +
+ Xn )=n, then P ([X n converges]) = 1 if and only if EjX1 j < 1, in which
case P ([X n ! E(X1 ) as n ! 1]) = 1.

33
Theorem 3, due to A. N. Kolmogorov, is a classical theorem in measure-
theoretic probability.

4. Lemma. If fXn g is a Markov chain that is irreducible and positive


recurrent with unique stationary
P distribution f i ; i 2 Xg, and if f : X ! R1
is a function that satis…es i2X jf (i)j i < 1, then
PTi (1)
P E n=1 f (Xn )j[X0 = i]
i2X f (i) i =
E(Ti (1)j[X0 = i])
for all i 2 X.
Proof: By Theorem 12 in Part IV,
PTi (1) 1
P P E n=0
I[Xn =j] j[X0 =i])
j2X f (j) j = f (j)
j2X E(Ti (1)j[X0 =i])
P PTi (1) 1
E j2X n=0
f (j)I[Xn =j] j[X0 =i])
= E(Ti (1)j[X0 =i])
P1 P Pk 1
E I
k=1 [Ti (1)=k] j2X n=0
f (j)I[Xn =j] j[X0 =i])
= E(Ti (1)j[X0 =i])
P1 P k 1
E I
k=1 [Ti (1)=k] n=0
f (Xn )j[X0 =i])
= E(Ti (1)j[X0 =i])
.

At this point we should note that in this last sum, f (X0 ) = f (i) since every-
thing takes place over the event [X0 = i], and when we take n = k, since we
are then over the event [Ti (1) = k], then also f (Xn ) = f (i), so we can start
the innermost sum at n = 1 and end it at k. Thus
P1 Pk
P E I
k=1 [Ti (1)=k] n=1
f (Xn )j[X0 =i])
j2X f (j) j =
PTi (1) E(Ti (1)j[X0 =i])
E n=1
f (Xn )j[X0 =i])
= E(Ti (1)j[X0 =i])
,

which proves the lemma.

5. Theorem (Ergodic Theorem) If fXn g is a Markov chain that is


irreducible and positive recurrent with unique stationary distribution f i ; i 2
Xg, and if f : X ! R1 is any bounded function over X, then

1 PN P
P lim n=0 f (Xn ) exists and = i2X f (i) i j[X0 = i] =1
N

34
for every i 2 X, and

1 PN P
P lim n=0 f (Xn ) exists and = i2X f (i) i = 1.
N

Proof: Let i be any state in X. We …rst prove the theorem when f is


nonnegative and bounded. For every positive integer N , let us de…ne

Bi (N ) = maxfk 0 : Ti (k) N g,

i.e., Bi (N ) is the number of return visits to state i among X1 ; ; XN . Since


we have established that fTi (k) Ti (k 1); k 1g are independent, identically
distributed and positive with respect to P ( j[X0 = i]), it follows that Ti (k) !
1 with probability 1 as k ! 1. Next de…ne
PTi (k+1)
k = n=Ti (k)+1 f (Xn ).

By a Proposition 12 in Part III, f k ; k 1g are independent and identically


distributed with respect to P ( j[X0 = i]). We now prove that in addition,
their common expectation, E( 1 j[X0 = i]), is …nite. Indeed, since the chain
is positive recurrent by hypothesis, we have

E( 1 j[X0 = i]) (upper bound of f )E(Ti (1)j[X0 = i]) < 1.

So now we may apply the classical strong law of large numbers of Kolmogorov
to obtain
1 Pm PTi (k+1) PTi (1)
lim k=1 n=T (k)+1 f (Xn ) = E n=1 f (Xn )j[X0 = i] a.e.[P ( j[X0 = i])].
m!1 m i

Notice that the random variable Bi (N ) de…ned above is the largest value of
k such that Ti (k) N , then Ti (Bi (N )) N , and for any random variable
strictly greater than Bi (N ), then Ti evaluated at that random variable is >
N . This means that

Ti (Bi (N )) N Ti (Bi (N ) + 1).

Thus for f 0 and f bounded,


PTi (Bi (N )) PN PTi (Bi (N )+1)
n=0 f (Xn ) n=0 f (Xn ) n=0 f (Xn ).

35
Let us look at the sum at the left of this last string of inequalities; this term
can be written as
PTi (Bi (N )) P i (1) PBi (N 1) PBi (N 1)
n=0 f (Xn ) = Tn=0 f (Xn ) + k=1 k = J + k=1 k,

where PTi (1)


J= n=0 f (Xn ).
Similarly, the sum on the right side of the last string of inequalities becomes
PTi (Bi (N )+1) PBi (N )
n=0 f (Xn ) = J + k=1 k.

Now J is a …xed random variable, and so N1 J ! 0 a.e.[P ( j[X0 = i])]. We


should take note of the following:
PBi (N ) PBi (N )
k Bi (N )
k=1
= k=1 k .
N Bi (N ) N
We next prove a claim.
Claim: BiN(N ) ! E(Ti (1)j[X
1
as N ! 1 a.e. [P ( j[X0 = i])]. To prove
0 =i]) P
this claim, let us recall that Ti (n) = nk=0 k , where 1 ; ; n are inde-
pendent and identically distributed with common distribution P ([Ti (1) =
_ Also, by hypothesis, E( 1 j[X0 = i]) < 1. By the strong law of
]j[X0 = i]).
large numbers given above,
Ti (n) _ 0 = i]
P [ ! E(Ti (1)j[X0 = i])j[X = 1.
n
We then combine this with the fact that
Ti (Bi (N )) N Ti (Bi (N ) + 1) Bi (N ) + 1
Bi (N ) Bi (N ) Bi (N ) + 1 Bi (N )
and the fact that Bi (N ) " 1 with P ( j[X0 = i])-probability 1 as n ! 1 to
obtain
Ti (Bi (N ))
! E(Ti (1)j[X0 = i])_ a.e. [P ( j[X0 = i])],
Bi (N )
which proves the claim. Recalling that we are assuming for the time being
that f 0, by the above claim, and by lemma 4, we have
1
PN 1
PTi (Bi (N ))
N n=0 f (Xn ) N n=0 f (X )
PTi (Bin(N ))
Bi (N ) 1
= N Bi (N ) n=0 f (Xn )
1
P Ti (1)
! E(T (1)j[X =i])_
E n=1 f (Xn ) ,
i 0

36
or
1 PN P
lim inf
N !1 N n=0 f (Xn ) j2X f (j) j . a.e. [P ( j[X0 = i])].

Similarly, we can also make use of the inequality


PN PTi (Bi (N )+1)
n=0 f (Xn ) n=0 f (Xn )

to obtain
1 PN P
lim sup n=0 f (Xn ) j2X f (j) j a.e. [P ( j[X0 = i])].
N !1 N

These last two inequalities verify that


1 PN P
lim n=0 f (Xn ) = j2X f (j) j a.e. [P ( j[X0 = i])].
N !1 N

Now let
1 PN P
A = [ lim n=0 f (Xn ) exists and = j2X f (j) j ].
N !1 N

Then the absolute probability of A


P P
P (A) = i2X P (Aj[X0 = i])P ([X0 = i]) = i2X 1P ([X0 = i]) = 1.

For general bounded function f , we make use of f = f + f . The above


proof holds for each of f + and f . . Hence it holds for f . Q.E.D.

Problems for Part V

1. Verify the following equation,


P Pk 1 Pk 1
j2X n=0 f (j)I[Xn =j] = n=0 f (Xn );

which is used without proof in the proof of Lemma 4.

2. Verify the following sentence that appeared in the proof of theorem 5:


Since we have established that fTi (k) Ti (k 1); k 1g are independent,
identically distributed and positive with respect to P ( j[X0 = i]), it follows
that Ti (k) ! 1 with probability 1 as k ! 1. as well as with P ( j[X0 = i])-
probability 1.

37
3. Prove that since Bi (N ) " 1 with P ( j[X0 = i])-probability 1 as n !
1, then
Ti (Bi (N ))
! E(Ti (1)j[X0 = i])_ a.e. [P ( j[X0 = i])] as n ! 1.
Bi (N )

Part VI. Markov Chain Monte Carlo.

1. Proposition. If fXn g is a Markov chain with initial distribution


and state space X and with stochastic matrix P, then the Markov chain is
stationary if and only if P ([X0 = j]) = P ([X1 = j]) for all j 2 X.
Proof: Observe that for every positive integer n and every positive integer
k and all k + 1 tuplets of states, i0 ; ; ik ,
Tk Qk
P r=0 [Xr = ir ] = P ([X0 = i0 ]) r=1 pir 1 ir

and Tn+k Qk
P r=n [Xr = ir ] = P ([Xn = i0 ]) r=1 pir 1 ir ,
from which the conclusion follows.

2. De…nition. A Markov chain fXn g is said to be reversible if

P ([Xn = i][Xn+1 = j]) = P ([Xn = j][Xn+1 = i])

for all pairs of states i; j and every nonnegative integer n.

3. Proposition. If a Markov chain fXn g is reversible, then it is strictly


stationary.
Proof: Sum both sides of the display equation in de…nition 2 for all j 2 X,
and then apply proposition 1.

4. Theorem. If fXn g is an irreducible and aperiodic Markov chain with


initial distribution , state space X and stochastic matrix K = (K(i; j)) that
satis…es K(i; j) = 0 if and only if K(j; i) = 0, and if the stochastic matrix
M K = (M K(i; j)) indexed by X is de…ned by

j) (i; j) if i 6= j
K(i;P
M K(i; j) =
1 k2Xnfig K(i; k) (i; k) if i = j,

38
where
(j)K(j; i)
(i; j) = min 1; ,
(i)K(i; j)
then the Markov chain determined by , M K and X is a reversible (and thus
stationary), irreducible, aperiodic Markov chain.
Proof: It is su¢ cient to prove: for every pair of distinct states i and j,
(i)M K(i; j) = (j)M K(j; i).
Indeed, if (j)K(j; i) = (i)K(i; j) for all pairs i; j, there is no change
needed to obtain the conclusion. But if there exists states i 6= j such that
(j)K(j; i) < (i)K(i; j), then (i; j) = (j)K(j;i)
(i)K(i;j)
and (j; i) = 1, so that

(j)K(j; i)
(i)M K(i; j) = (i)K(i; j) = (j)K(j; i) = (j)M K(j; i),
(i)K(i; j)
which establishes reversibility. Since 0 < (i; j) 1, then K(i; j) > 0 if
and only if M K(i; j) > 0, so irreducibility and aperiodicity are preserved.
Q.E.D.

5. Proposition. In Theorem 4, if the original Markov chain is not


aperiodic and not reversible, then the conclusion of the theorem continues to
be true.
Proof: Since the chain is not reversible, there exists a pair of states i
and j such that
(j)K(j; i) < (i)K(i; j).
Thus if fYn : n 0g denotes the Markov chain determined by f ; M Kg, it
follows that 0 < ij < 1 and

P ([X1 = i]j[X0 = i]) 1 ij > 0,


from which aperiodicity follows. Reversibility of fYn g follows from theorem
4.

6. Proposition. Suppose fXn g is a Markov chain determined by f ; Pg


that is irreducible and is reversible (and therefore stationary) but is not
aperiodic. If fYn g is a Markov chain determined by (the same) but with
stochastic matrix Q = (qij g, where
0:9pij ifP
i 6= j
qij =
1 0:9 fpik : k 6= ig if i = j,

39
then fYn g is a Markov chain that is irreducible, reversible and aperiodic.
Proof: Since fXn g is reversible, we know that

(i)pij = (j)pji .

Applying this, we get

(i)qij = 0:9 (i)pij = 0:9 (j)pji = (j)qji ,

which means that fYn g is reversible and therefore stationary. Since


P
qii = (1 0:9 fpik : k 6= ig) : k 6= ig > 0,

it follows that fYn g is aperiodic. Finally, fYn g is irreducible because fXn g


is irreducible and pij > 0 if and only if qij > 0.

7. Markov Chain Monte Carlo Algorithm. Everything developed


in these notes so far constitutes the prerequisites for this section. Suppose
there is given a …nite state space, i.e., a set, X of elements with a probability
measure P de…ned over it. Suppose there is a function ':X ! R1 whose
value, '(x), is known or can be determined for every x 2 X. An observation
X is taken at random according to the probability measure P , and a state
x is observed. One wishes to test the null hypothesis that P is actually
some known probability , whose value (y) is known for each y 2 X at
least up to a multiplicative constant, i.e., for any pair u; v in X, the ratio
(u)= (v) is known or can be determined. The problem is to obtain an
"accurate estimate" of (fz 2 X :'(z) '(x)g) for some particular x 2 X.
The problem is made di¢ cult in that one does not know completely and
thus cannot simulate repeated observations on X according to the probability
measure in order to estimate (fz 2 X :'(z) '(x)g) by means of the
strong law of large numbers.
A solution to the problem is possible under the following condition. Sup-
pose you can come up with a stochastic matrix K( ; ) that is indexed by X
and has a small number of positive entries in each row such that the Markov
chain determined by f ; Kg is reversible (or at least stationary), irreducible
and aperiodic. In this case, one starts with the state x. Since the stochastic
matrix we have at our disposal contains only a relatively small number of
positive entries in the row determined by x, one can simulate the probability
distribution K(x; ) and take an observation on X according to this distrib-
ution.. The outcome of this will be called X1 . Then, whatever the outcome

40
is, say, X1 = x1 , one may simulate an observation on X according to tne
probability distribution K(x1 ; ), and call the outcome X2 . In this way one
can generate an observable Markov chain fXn g that is reversible, irreducible
and aperiodic. After a very large number n of observations, our simulated
observation on X will be, if our hypothesis is correct, none other than the
distribution ; this follows from our …rst big limit theorem. So at some time
after taking a large number N of observations on the Markov chain one may
begin start observing I['(Xn ) '(x)] for n N: Keeping track of the number of
1’s for the next M simulations yields the quantity
P +M
SM = N n=N I['(Xn ) '(x)] :

By the second big limit theorem, known as the ergodic theorem, the ratio
1
M M
S converges with probability one as M ! 1 to (fz 2 X :'(z) '(x)g).
Next suppose that the only stochastic matrix that one can come up with
that is indexed by X and that has a relatively small number of positive
entries in each row is such that the Markov chain determined by f ; Kg is
irreducible and reversible (and therefore stationary) but not aperiodic. In
this case, for every pair states i 6= j, de…ne a new Markov chain f ; AKg by
AK(i; j) = 0:9K(i; j), and for j = i, let
P
AK(i; i) = 1 fAK(i; j) : j 6= ig.

In this case, this Markov chain is irreducible, reversible and aperiodic by


proposition 6, and one can then proceed to estimate (fz 2 X :'(z) '(x)g)
as above.
Next suppose that the only stochastic matrix K that one can come up
with that is indexed by X and that has a relatively small number of positive
entries in each row is such that the Markov chain determined by f ; Kg
is irreducible but is not stationary. In this case, we would like to alter K
to obtain a stochastic matrix denoted by M K such that the Markov chain
determined by f ; M Kg is stationary, irreducible and aperiodic. In this case,
de…ne
K(i;Pj) (i; j) if i 6= j
M K(i; j) =
1 fK(i; k) (i; k) : k 6= ig if j = i,
where
(j)K(j; i)
(i; j) = min 1; .
(i)K(i; j)

41
Then by theorem 4, the Markov chain determined by f ; M Kg is reversible
and therefore stationary, it is still irreducible, and by proposition 7, it is
aperiodic. The procedure in this case and in a previous case is an application
of what is referred to in the literature as the Hastings-Metropolis procedure.
In the case when the Hastings-Metropolis procedure is applied, the sto-
chastic matrix M K enjoys a certain opimality property, which we now de-
velop.
Let X denote a …nite state space, and let denote a probability measure
over X such that (x) > 0 for all x 2 X. Let S(X) denote the set of all
stochastic matrices indexed by X such that for each K 2 S(X), the Markov
chain determined by { ; Kg, is irreducible and aperiodic. Let R( ) S(X)
be de…ned as the set of all K 2 S(X) such that the Markov chain determined
by f ; Kg is reversible. Finally over S(X) we create a metric d(K; K 0 ) de…ned
by
P
d(K; K 0 ) = Pf (x)jK(x; 0
P y) K (x; y)j : x 2 0X;y2 X;x 6= yg
= x2X (x) fy:y6=xg jK(x; y) K (x; y)j.
It should be noted that d( ; ) satis…es the properties of a metric over S(X).
Next we de…ne the mapping M : S(X) ! R(X) by
(y)
M K(x; y) = K(x; y) ^ (x) K(y; x) if x 6= y
P
=1 fz:z6=xg M K(x; z) if x = y.

By theorem 4 above, the Markov chain determined by f ; M Kg is reversible,


irreducible and aperiodic. The mapping M : K ! M K is called the
Metropolis map.

8. Theorem. (Billera and Diaconis) For every K 2 S(X)nR( ), M K is


the unique closest element of R( ) to K in the metric d that is coordinatewise
not larger than K o¤ the main diagonal.
Proof: Let x and y denote any two states in X, where x 6= y. Let us
de…ne, for these two …xed states,
Hxy = fK 2 S(X) : (x)K(x; y) = (y)K(y; x)g,
+
Hxy = fK 2 S(X) : (x)K(x; y) > (y)K(y; x)g, and
Hxy = fK 2 S(X) : (x)K(x; y) < (y)K(y; x)g.
Thus, for every x 6= y, S(X) is the union of these three disjoint sets:
+
S(X) = Hxy [ Hxy [ Hxy .

42
We shall break up the proof into a sequence of claims and subclaims.
+
Claim 1: If K 2 S(X), then K 2 Hxy if and only if K 2 Hyx . This claim
follows directly from the de…nitions.
Now let N 2 R(X) be arbitrary. Then by the de…nition of R(X);
(x)N (x; y) = (y)N (y; x)
for all x; y in X.
Claim 2: For all N 2 R( ) and all K 2 S(X)nR( ),
P
d(K; N ) f (x)jK(x; y) N (x; y)j + (y)jK(y; x) N (y; x)jg,
where the sum is taken over only those pairs of elements x; y in X that satisfy:
+
x 6= y and K 2 Hxy . To prove this we break up d(K; N ) into a sum of three
sums:
P P
d(K; N ) = x2X (x) fy:y6=x;K2Hxy
P P
+
g jK(x; y) N (x; y)j
+ x2X (x) fy:y6=x;K2Hxy g jK(x; y) N (x; y)j
P P
+ x2X (x) fy:y6=x;K2Hxy g jK(x; y) N (x; y)j
P P
P x2X (x) fy:y6=x;K2Hxy
P
+
g jK(x; y) N (x; y)j
+ x2X (x) fy:y6=x;K2Hxy g jK(x; y) N (x; y)j.
By claim 1, this last sum above is
P P
x2X (x) +
fy:y6=x;K2Hyx g jK(x; y) N (x; y)j,
and now replace in it each x by a y and each y by an x to obtain
P P
y2X (y) +
fx:x6=y;K2Hxy g jK(y; x) N (y; x)j,
which proves the claim.
Claim 3. d(K; M K) d(K; N ). To prove this, we …rst note that since
K(x; y) M K(x; y) for x 6= y, we obtain.
P
d(K; M K) = x6=y f (x)jK(x; y) M K(x; y)jg
P
= x6=y f (x)K(x; y) (x)M K(x; y)g
P (y)
= x6=y f (x)K(x; y) (x)fK(x; y) ^ (x) K(y; x)gg.
+
Then, by the de…nitions of Hxy , Hxy and Hxy , (and for this …xed K), we
continue the above to obtain
P
d(K; M K) = fx6=y:K2Hxy + f (x)K(x; y) (y)K(y; x)g
P g
+ fx6=y:K2Hxy g f (x)K(x; y) (x)K(x; y)g
P
+ fx6=y:K2Hxy g f (x)K(x; y) (x)K(x; y)g
P
= fx6=y:K2Hxy+ f (x)K(x; y) (y)K(y; x)gg.
g

43
Since N satis…es (x)N (x; y) = (y)N (y; x), the the last sum above is equal
to P
f (x)(K(x; y) N (x; y)) + (y)(N (y; x) K(y; x))g,
+
x6=y;K2Hxy

which is
P
f (x)jK(x; y) N (x; y)j + (y)jN (y; x) K(y; x)jg
+
x6=y;K2Hxy

By claim 2, we obtain d(K; M K) d(K; N ), which proves claim 3.


By claim 3, M K is a closest element of R( ) to K with respect to the
metric d that is majorized o¤ the main diagonal by K. We now wish to prove
that M K is the closest element of R( ) to K that is majorized o¤ the main
diagonal by K.
Claim 4. As before, K 2 S(X)nR( ) is …xed. Let N 2 R( ) be such that
N (x; y) K(x; y) for all x 6= y in X. Then there is at least one pair (x; y),
x 6= y, such that N (x; y) < K(x; y) and thus for at least this (x; y);
+
K 2 Hxy [ Hxy .
To show this, if, to the contrary, every pair (x; y) satis…es N (x; y) = K(x; y),
then K would be in R( ), which contradicts the de…nition of K. Since there
is at least one such pair, (x; y), for which N (x; y) < K(x; y), for this pair, K
+
cannot be in Hxy ; consequently, K 2 Hxy [ Hxy .
As before, K 2 S(X)nR( ), but let N 2 R( ) be such that N (x; y)
K(x; y) for all x 6= y in X with N (x; y) < K(x; y) for at least one such pair.
Let us de…ne, for all pairs x 6= y in X, the function
(x; y) = N (x; y) K(x; y).
Note that, for all pairs if distinct elements x; y in X, (x; y) 0, with
(x; y) < 0 for at least one such pair. Since N (x; y) = K(x; y) + (x; y) and
N 2 R( ), we have
(x)
N (y; x) = (y)
N (x; y)
(x)
= (y)
(K(x; y) + (x; y)).
By claim 2 and this last expression, d(K; N )
P
+ f (x)jK(x; y) N (x; y)j + (y)jK(y; x) N (y; x)jg
Pfx6=y:K2Hxy g (x)
= fx6=y:K2Hxy+ f (x)j (x; y)j + (y)jK(y; x) (K(x; y) + (x; y))jg
P g (y)
= fx6=y:K2Hxy+
g (x)j (x; y)j
P
+ fx6=y:K2Hxy+
g j (y)K(y; x) (x)K(x; y) (x) (x; y)j.

44
+
Now for those x 6= y for which K 2 Hxy , the expression (y)K(y; x)
(x)K(x; y) is negative, while (x) (x; y) is nonnegative, with at least one
term strictly positive by claim 4. These two observations imply
P
d(K; N ) > fx6=y:K2Hxy +
g (x)j (x; y)j
P
+ fx6=y:K2Hxy
P
+
g j (y)K(y; x) (x)K(x; y)j
> fx6=y:K2Hxy+ ( (x)K(x; y)
g (y)K(y; x)).

But recall that in our proof of claim 3 we proved that


P
d(K; M K) = ( (x)K(x; y) (y)K(y; x)).
+
x6=y;K2Hxy

Thus d(K; N ) > d(K; M K), which proves the theorem:

Exercises for Part VI

These exercises are designed to demonstrate in a very simple manner an


application of the Markov Chain Monte Carlo algorithm. Suppose we can
U
take 10 independent observations a random vector , whose range is
V

1 1 1 2 2 2
; ; ; ; ; :
1 2 3 1 2 3
Suppose we desire to test the null hypothesis that U and V are independent,
that is, for some unknown positive probabilities p1 ; p2 ; q1 ; q2 ; q3 , the joint
density of U; V satis…es
P ([U = i][V = j]) = pi qj ; i = 1; 2; j = 1; 2; 3:
U
So we take 10 independent observations on , and obtain results that
V
can be arranged in a table with two rows and three columns, namely,
X11 X12 X13
;
X21 X22 X23
where Xij = the number of observations which take U -values in row i and
V -values in column j. Thus
P
2 P
3 P
3 P
2
Xij = Xij = 10:
i=1 j=1 j=1 i=1

45
1. Verify that the total number of tables produced in this way is 610 .
2. Now suppose that when we take these ten observations, we obtain the
following table:
2 2 1
.
1 1 3
The row totals are 5 and 5, and the column totals are 3; 3; 4 respectively.
3. Let us denote the row totals for any of the 610 outcomes by X1 ; X2
and the column totals by X 1 ; X 2; X 3 . Compute

P ([X11 = i][X12 = j]j[X1 = 5][X 1 = 3][X 2 = 3])

for all (i; j)’s that do not violate the conditioning event.
4. Note that in problem 3, this conditional joint distribution of X11 and
X12 does not depend on the unknown parameters p1 ; q1 ; q2 . So now …nd the
conditional density of 2 given [X1 = 5][X 1 = 3][X 2 = 3], where

2
P2 P3 (Xij 10Xi X j )2
= i=1 j=1 .
10Xi X j

5. For the particular table of results obtained in problem 2, compute the


value, 20 , of the 2 -statistic de…ned in problem 4.
6. Assuming that it is really true that the selections of the row and
column were independent of each other with unknown probabilities p1 ; p2 for
the rows and q1 ; q2 ; q3 for the columns, i.e., that U and V are independent,
use your results in problem 3 to compute
2 2
P [ 0 ]j[X1 = 5][X 1 = 3][X 2 = 3] .

If this is too small, we would reject the null hypothesis that the selections of
the rows and the columns are independent. In this problem, this computation
could be easily carried out. However, for more rows and/or columns and for
larger sample sizes, this is an impossible task. But this probability can be
simulated, and this is by means of of the Hastings-Metropolis algorithm,
which you are now asked to carry out in a sequence of steps outlined in
problems 7, 8, 9 and 10.
7. First note that the conditional density computed in problem 3 has
13 points in its domain. Thus we have a state space of 13 states. In brief,
we would like to have at our disposal a Markov chain with this 13 state
state space and a transition matrix for which this conditional density is the

46
invariant measure. Having these, we could then simulate the Markov chain,
and for a large number of observations we could …nd the relative frequency
with which the event [ 2 2
0 ] occurs. By the ergodic theorem for Markov
chains, this relative frequency converges to P ([ 2 2
0 ]) with probability
1. So we need a stochastic matrix, which we construct and simulate as
follows. Whichever state the Markov chain is at, select two columns in the
table at random without replacement. These two columns and the two rows
determine a 2 2 table or matrix. Then select one of the two matrices
1 1 1 1
or
1 1 1 1
at random with probabilities 21 ; 21 : If you can add whatever you get to the
two columns selected at random without replacement and not obtain any
negative integers, then do so, and call the resulting 2 3 table the new state
of the chain. On the other hand, if you do get a 1, then the next state is the
same as the present one. Thus a Markov chain is created whose state space
is the set of all 2 3 tables of nonnegative integers whose row sums are 5
and 5 and whose column sums are 3; 3 and 4. Display the transition matrix,
P = (p(i; j)), indexing the rows and columns in the same manner as the
states are indexed for the invariant probability measure, i.e., the conditional
density computed in problem 3.
8. With respect to the Markov chain created in problem 7 which could
be simulated, verify that this chain is irreducible and aperiodic.
9. If you let denote the desired invariant probability measure (as a
row vector), and if the rows of P are indexed in the same order as the
coordinates of , you should be able to verify that the equation = P
(matrix multiplication) is not satis…ed. Thus the stochastic matrix P is not
the one we need for a stationary Markov chain.
10. We now know how to alter P to get a new stochastic matrix Q = (qij )
which together with the initial distribution,
= ( (1); ; (13)),
will determine a stationary Markov chain. So de…ne

q(i; j) = P (i; j) if i 6= j
p(i; j)
1 fp(i; k) (i; k) : k 6= ig if j = i,
where
(j)p(j; i)
(i; j) = min 1;
(i)p(i; j)

47
for all those ordered pairs of states (i; j) for which p(i; j) > 0. Now simulate
this chain, ( ; Q); say, 25; 000 times, starting from any state you choose, and
then at each simulation for the next 25; 000 observations compute the relative
frequency with which the event [ 2 2
0 ] occurs. This relative frequency
should be very close to the conditional probability that was obtained in
problem 3. (Epilogue: Why did we simulate the chain 25; 00 times before we
started simulating P ([ 2 2
0 ])? For some reason, one wants to start the
chain in which an estimate of P ([ 2 2
0 ]) is computed “according to the
invariant measure .").

48

You might also like