Markov Chains

4.
Markov Chains
A discrete time process {Xn , n = 0, 1, 2, . . .}
with discrete state space Xn {0, 1, 2, . . .} is a
Markov chain if it has the Markov property:
P[Xn+1 = j|Xn = i, Xn1 = in1 , . . . , X0 = i0 ]
= P[Xn+1 = j|Xn = i]
In words, the past is conditionally independent
of the future given the present state of the
process or given the present state, the past
contains no additional information on the future
evolution of the system.
The Markov property is common in probability
models because, by assumption, one supposes
that the important variables for the system being
modeled are all included in the state space.
We consider homogeneous Markov chains for
which P[Xn+1 = j | Xn = i] = P[X1 = j | X0 = i].
Example: physical systems. If the state space

contains the masses, velocities and accelerations of
particles subject to Newtons laws of mechanics,
the system in Markovian (but not random!)
Example: speech recognition. Context can be

important for identifying words. Context can be
modeled as a probability distrubtion for the next
word given the most recent k words. This can be
written as a Markov chain whose state is a vector
of k consecutive words.
Example: epidemics. Suppose each infected

individual has some chance of contacting each
susceptible individual in each time interval, before
becoming removed (recovered or hospitalized).
Then, the number of infected and susceptible
individuals may be modeled as a Markov chain.
Define Pij = P Xn+1 = j | Xn = i .

Let P = [Pij ] denote the (possibly infinite)
transition matrix of the one-step transition
probabilities.
P
2
Write Pij = k=0 Pik Pkj , corresponding to
standard matrix multiplication. Then
X

2
Pij =
P Xn+1 = k | Xn = i P Xn+2 = j | Xn+1 = k
k
=
P Xn+2 = j, Xn+1 = k | Xn = i
k
(via the Markov property. Why?)
Xn+2 = j, Xn+1 = k Xn = i
= P
k
= P Xn+2 = j | Xn = i
Generalizing this calculation:
The matrix power Pijn gives the n-step transition probabilities.
The matrix multiplication identity

P n+m = P n P m
corresponds to the Chapman-Kolmogorov
equation
P
n+m
n m
Pij
= k=0 Pik
Pkj .
Let (n) be the (possibly infinite) row vector of
(n)
probabilities at time n, so i = P Xn = i .
Then (n) = (0) P n , using standard
multiplication of a vector and a matrix. Why?
Example. Set X0 = 0,
and let Xn evolves as a
Markov chain with transition matrix
1/2 1/2 0
P = 0
0 1
.
1
0 0
(n)
Find 0
1/2
= P[Xn = 0] by
(i) using a probabilistic argument
1/2
H
Y
HH
1 HH
1
1
?
2
(ii) using linear algebra.
Classification of States
State j is accessible from i if Pijk > 0 for some
k 0.
The transition matrix can be represented as a
directed graph with arrows corresponding to
positive one-step transition probabilities j is
accessible from i if there is a path from i to j.
For example,
- ...
Here, 4 is accessible from 0, but not vice versa.

i and j communicate if they are accessible
from each other. This is written i j, and is an
equivalence relation, meaning that
(i) i i [reflexivity]
(ii) If i j then j i [symmetry]
(iii) If i j and j k then i k [transitivity]
An equivalence relation divides a set (here, the

state space) into disjoint classes of equivalent
states (here, called communication classes).
A Markov chain is irreducible if all the states
communicate with each other, i.e., if there is
only one communication class.
The communication class containing i is
absorbing if Pjk = 0 whenever i j but i 6 k
(i.e., when i communicates with j but not with
k). An absorbing class can never be left. A
partial converse is . . .
Example: Show that a communication class,

once left, can never be re-entered.
State i has period d if Piin = 0 when n is not a

multiple of d and if d is the greatest integer with
this property. If d = 1 then i is aperiodic.
Example: Show that all states in the same

communication class have the same period.
Note: This is obvi-
ous by considering a
generic directed graph
for a periodic state:
- i+1
- i+2
6
i+d-1
...
?
...

State i is recurrent if P re-enter i | X0 = i] = 1,
where {re-enter i} is the event
S
n=1 {Xn = i, Xk 6= i for k = 1, . . . , n 1}.
If i is not recurrent, it is transient.
Let S0 = 0 and S1 , S2 , . . . be times of successive
returns to i, with Ni (t) = max {n : Sn t} being
the corresponding counting process.
If i is recurrent, then Ni (t) is a renewal process,
since the Markov property gives independence of
interarrival times XnA = Sn Sn1 .
Letting ii = E[X1A ], the expected return
time for i, we then have the following from
renewal theory:
P limt Ni (t)/t = 1/ii | X0 = i = 1

Pn
If i j, limn k=1 Pijk /n = 1/jj
If i is aperiodic, limn Pijn = 1/jj for j i
If i has period d, limn Piind = d/ii
10
If i is transient, then Ni (t) is a defective

renewal process. This is a generalization of
renewal processes where X1A , X2A , . . . are still iid,
abut we allow P[X1A = ] > 0.
Proposition: i is recurrent if and only if

P
n=1
Piin = .
11
Example: Show that if i is recurrent and i j

then j is recurrent.
Example: If i is recurrent and i j, show that
P never enter state j | X0 = i = 0.
12
If i is transient, then ii = . If i is recurrent

and ii < then i is said to be positive
recurrent. Otherwise, if ii = , i is null
recurrent.
Proposition: If i j and i is recurrent, then

either i and j are both positive recurrent, or both
null recurrent (i.e., positive/null recurrence is a
property of communication classes).
13
Random Walks
The simple random walk is a Markov chain
on the integers, Z = {. . . , 1, 0, 1, 2, . . .} with
X0 = 0 and P Xn+1 = Xn + 1 = p,
P Xn+1 = Xn 1 = 1 p.
Example: If Xn counts the number of successes

minus the number of failures for a new medical
procedure, Xn could be modeled as a random
walk, with p the success rate of the procedure.
When should the trial be stopped?
If p = 1/2, the random walk is symmetric.
The symmetric random in d dimensions is a
vector valued Markov chain, with state space Zd ,
(d)
X0 = (0, . . . , 0). Two possible definitions are
(d)
(d)
(i) Let Xn+1 Xn take each of the 2d

possibilities (1, 1, . . . , 1) with equal
probability.
(d)
(d)
(ii) Let Xn+1 Xn take one of the 2d values
(1, 0, . . . , 0), (0, 1, 0, . . . , 0), . . . with equal
probability. This is harder to analyze.
14
Example: A biological signaling molecule

becomes separated from its receptor. It starts
diffusing, due to themal noise. Suppose the
diffusion is well modeled by a random walk. Will
the molecule return to the receptor? If the
molecule is constrained to a one-dimensional line?
A two-dimensional surface? Three-dimensional
space?
Proposition: The symmetric random walk is

null recurrent when d = 1 and d = 2, but
transient for d 3.
Proof: The method is toemploy Stirlings
formula, n! nn+1/2 en 2 where an bn

means limn (an /bn ) = 1, to approximate
(d)
(d)
P Xn = X0 .
Note: This is in HW 5. You are expected to solve
the simpler case (i), though you can solve (ii) if
you want a bigger challenge.
15
Stationary Distributions
= {i , i = 0, 1, . . .} is a stationary
P
distribution for P = [Pij ] if j = i=0 i Pij
P
with i 0 and i=0 i = 1.
P
In matrix notation, j = i=0 i Pij is = P
where is a row vector.
Theorem: An irreducible, aperiodic, positive

recurrent Markov chain has a unique stationary
distribution, which is also the limiting
distribution
j = limn Pijn .
Such Markov chains are called ergodic.
Proof
16
Proof continued
17
Irreducible chains which are transient or null

recurrent have no stationary distribution. Why?
Chains which are periodic or which have

multiple communicating classes may have
limn Pijn not existing, or depending on i.
A chain started in a stationary distribution will
remain in that distribution, i.e., will result in a
stationary process.
If we can find any probability distribution
solving the stationarity equations = P and
we check that the chain is irreducible and
aperiodic, then we know that
(i) The chain is positive recurrent.
(ii) is the unique stationary distribution.
(iii) is the limiting distribution.
18
Example: Monte Carlo Markov Chain
Suppose we wish to evaluate E h(X) where X
has distribution (i.e., P X= i = i ). The

Monte Carlo approach is to generate
X1 , X2 , . . . Xn and estimate
Pn
1
E h(X) n i=1 h(Xi )
If it is hard to generate an iid sample from ,
we may look to generate a sequence from a
Markov chain with limiting distribution .
This idea, called Monte Carlo Markov
Chain (MCMC), was introduced by
Metropolis and Hastings (1953). It has become a
fundamental computational method for the
physical and biological sciences. It is also
commonly used for Bayesian statistical inference.
19
Metropolis-Hastings Algorithm
(i) Choose a transition matrix Q = [qij ]
(ii) Set X0 = 0
(iii) for n = 1, 2, . . .
generate Yn with P Yn = j | Xn1 = i = qij .

If Xn1 = i and Yn = j, set
j with probability min(1, q / q )

j ji
i ij
Xn =
i otherwise
Here, Yn is called the proposal and we say the

proposal is accepted with probability
min(1, j qji /i qij ). If the proposal is not
accepted, the chain stays in its previous state.
20
Proposition: Set
j 6= i
qij min(1, j qji /i qij )
Pij = q + Xq {1 min(1, q / q )} j = i
ik
k ki
i ik
ii
k6=i
Then is a stationary distribution of the

Metropolis-Hastings chain {Xn }. If Pij is
irreducible and aperiodic, then is also the
limiting distribution.
Proof.
21
Example: (spatial models on a lattice)

Let {Yij , 1im, 1jn}
be a spatial stochastic
process.
Let N (i, j)
define
a
neighborhood , e.g. N (i, j) =
{(p, q) : |p i| + |q j| = 1}
We may model the conditional distribution of

Yij given the neighbors. E.g., if your neighbors
vote Republican (R), you are more likely to do
so. Say,
P Yij = R | {Ypq , (p, q) 6= (i, j)}]

P
1
= 2 +
(p,q)N (i,j) I{Ypq = R} 2 .
The full distribution of Y is, in principle,
specified by the conditional distributions. In
practice, one simulates from using MCMC,
where the proposal distribution is to pick a
random (i, j) and swap it from R to D or vice
versa.
22
Example: Sampling conditional distributions.

If X and have joint density fX (x, ) and we
observe X = x, then one wants to sample from
(x,)
f|X ( | x) = fX
fX (x) f ()fX| (x | )
For Bayesian inference, f () is the prior
distribution, and f|X ( | x) is the posterior.
The model determines f () and fX| (x | ).
The normalizing constant,
R
fX (x) = fX (x, ) d,
is often unknown.
Using MCMC to sample fro the posterior allows
numerical evaluation of the
posterior mean E | X= x ,
posterior variance Var | X= x ,

or a 1 credible region defined to be a set A
such that P A | X= x = 1 .
23
Galton-Watson Branching Process

Let Xn be the size of a population at time n.
Suppose each individual has an iid offspring
PXn (k)
distribution, so Xn+1 = k=1 Zn+1 where
(k)
Zn+1 Z.
(k)
We can think of Zn+1 = 0 being the death of

the parent, or think of Xn as the size of the nth
generation.
Suppose X0 = 1. Notice that Xn = 0 is an
absorbing state, termed extinction. Supposing
P[Z = 0] > 0 and P[Z > 1] > 0, we can show
that 1, 2, 3, . . . are transient states. Why?
Thus, the branching process either becomes

extinct or tends to infinity.
24
Set (s) = E[sZ ], the probability generating

function for Z. Set n (s) = E[sXn ]. Show that
n (s) = (n) (s) = ((. . . (s) . . .)), meaning
applied n times.
25
If we can find a solution to (s) = s, we will

have n (s) = s for all n. This suggests plotting
(s) vs s, noting that
(i) (1) = 1. Why?
(ii) (0) = P[Z= 0]. Why?
(iii) (s) is increasing, i.e.,
(iv) (s) is convex, i.e.,
d2
ds2
26
d
ds
> 0. Why?
> 0. Why?
Now, notice that, for 0 < s < 1,
P extinction = lim P Xn = 0 = lim n (s).

n
Argue that, in CASE 1, limn n (s) is the

unique fixed point (s) = s for 0 < s < 1. In
CASE 2, limn n (s) = 1.
27
Conclude that in CASE 1 (with d

ds | s=1 > 1, i.e.
E[Z] > 1) there is some possibility of infinite
growth. In CASE 2 (with d
ds | s=1 1, i.e.
E[Z] 1) extinction is assured.
Example: take
0 w.p.
Z=
1 w.p.
2 w.p.
Find the chance
1/4
1/4 .
1/2
of extinction.
Now suppose the founding population has size k

(i.e., X0 = k). Find the chance of extinction.
28
Time Reversal
Thinking backwards in time can be a powerful
strategy. For any Markov chain
{Xk , k = 0, 1, . . . , n}, we can set Yk = Xnk .
Then Yk is a Markov chain. Suppose Xk has
transition matrix Pij , then Yk has
[k]
inhomogeneous transition matrix Pij , where
[k]
Pij = P Yk+1 = j | Yk = i
= P Xnk1 = j | Xnk = i
P Xnk = i | Xnk1 = j P Xnk1 = j
=
P Xnk = i
= Pji P Xnk1 = j P Xnk = i .

If {Xk } is stationary, with stationary
distribution , then {Yk } is homogeneous, with
Pij = Pji j /i
Surprisingly many important chains have the
time-reversibility property Pij = Pij , for
which the chain looks the same forwards as
backwards.
29
Theorem. (detailed balance equations)

If is a probability distribution with
i Pij = j Pji , then is a stationary distribution
for P and the chain started at is time-reversible.
Proof
The detailed balance equations are simpler than

the general equations for finding . Try to solve
them when looking for , in case you get lucky!
Intuition: i Pij is the rate of going directly
from i to j. A chain is reversible if this equals
rate of going directly from j to i.
Periodic chains ( does not give limiting
probabilities) and reducible chains ( is not
unique) are both OK here.
30
An equivalent condition for reversibility is

every loop of states starting and ending in i has
the same probability as the reverse loop, i.e.,
for all i1 , . . . , ik
Pii1 Pi1 ,i2 . . . Pik1 ik Pik i = Piik Pik ik1 . . . Pi2 i1 Pi1 i
Proof
31
Example: Birth-Death Process.
Let P Xn = Xn +1 | Xn = i = pi and
P Xn = Xn 1 | Xn = i = qi = 1 pi .
If {Xn } is positive recurrent, it must
be reversible. Heuristic reason:
..
.
4
p3
q4
3
p2
q3
Find the stationary distribution giving conditions for its existence.
p1
q2
p0 =1
q1
Note that this chain is periodic (d = 2). There

is still a unique stationary distribution which
solves the detailed balance equations.
32
Example: Metropolis Algorithm. The

Metropolis-Hastings algorithm with symmetric
proposals (qij = qji ) is the Metropolis
algorithm.
Here,
j 6= i
qij min(1, j /i )
X
Pij =
qik {1 min(1, k /i )} j = i
qii +
k6=i
Show that Pij is reversible, with stationary

distribution .
Solution
33
Random Walk on a Graph

Consider undirected, connected graph with a
positive weight wij on each edge ij. Set
wij = 0 if i & j are not connected by an edge.
wij
Set Pij = P
k wik
The Markov chain {Xn } with transition matrix
P is called a random walk on a graph. Think of a
driver who is lost: each vertex is an intersection,
the weight is the width (in lanes) of a road
leaving an intersection, the driver picks a random
exit when he/she arrives at an intersection but
has a preference for bigger streets.
1
2
Z
Z 1
Z
Z 2
Z
Z
Z
2 Z
Z
5
34
Show that the random walk on a graph is

reversible, and has stationary distribution
P
P
i = j wij
jk wjk
Note: the double sum counts each weight twice.
Hence, find the limiting probability of being at

vertex 3 on the graph shown above.
35
Ergodicity
A stochastic process {Xn } is ergodic if limiting
time averages equal limiting probabilities, i.e.,
time in state i up to time n
lim
= lim P[Xn = i].
n
n
n
Show that an irreducible, aperiodic, positive
recurrent Markov chain is ergodic (i.e., this
general idea of ergodicity matches our definition
for Markov chains).
36
Semi-Markov Processes
A Semi-Markov process is a discrete state,
continuous time process {Z(t), t 0} for which
the sequence of states visited is a Markov chain
{Xn , n = 0, 1, . . .} and, conditional on {Xn }, the
times between transitions are independent.
Specifically, define transition times S0 , S1 , . . . by
S0 = 0 and Sn Sn1 Fij conditional on
Xn1 = i and Xn = j. Then, Z(t) = XN (t) where
N (t) = sup {n : Sn t}.
{Xn } is the embedded Markov chain.
The special case where Fij Exponential (i ) is
called a continuous time Markov chain
The special
case when
0, t < 1
Fij (t) =
1, t 1
results in Z(n) = Xn , so we retrieve the discrete
time Markov chain.
37
Example: A taxi serves the airport, the hotel

district and the theater district. Let Z(t) = 0, 1, 2
if the taxis current destination is each of these
locations respectively. The time to travel from i
to j has distribution Fij . Upon arrival at i, the
taxi immediately picks up a new customer whose
destination has distribution given by Pij . Then,
Z(t) is a semi-Markov process.
To study the limiting behavior of Z(t) we make
some definitions. . .
Say {Z(t)} is irreducible if {Xn } is irreducible
Say {Z(t)} is non-lattice if {N (t)} is
non-lattice
Set Hi to be the c.d.f. of the time in state i
P
before a transition, so Hi (t) = j Pij Fij (t)
and the expected time of a visit to i is
R
R
P
i = 0 xdHi (x) = j Pij 0 x dFij (x).
Let Tii be the time between successive visits to

i and define ii = E Tii .
38
Proposition: If {Z(t)} is an irreducible,

non-lattice semi-Markov process with ii < ,
then
i
lim P Z(t)= i | Z(0)= j =
t
ii
time in i before t
= lim
w.p. 1
t
t
Proof: identify a relevant alternating renewal

process.
39
By counting up time spent in i differently, we

get another identity
time in i during [0, t]
i i
=P
t
t
j j j
lim
supposing that {Xn } is ergodic (irreducible,

aperiodic, positive recurrent) with stationary
distribution .
Proof
40
Calculations with semi-Markov models typically

involve identifying a suitable renewal, alternating
renewal, regenerative or reward process.
Example Let Y (t) = SN (t)+1 t be the residual

life process for an ergodic, non-lattice
semi-Markov model. Find an expression for
limt P Y (t) > x .

Hint: Consider an alternating renewal process
which switches on when Z(t) enters i, and
switches off immediately if the next transition
is not j or switches off once the time until the
transition to j is less than x.
41
Proof (continued)
42

Markov Chains

Uploaded by

Copyright:

Available Formats

You might also like

Markov Chains

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Markov Chains

Uploaded by

Copyright:

Available Formats

4.

Example: physical systems. If the state space

Example: speech recognition. Context can be

Example: epidemics. Suppose each infected

Define Pij = P Xn+1 = j | Xn = i .

(via the Markov property. Why?)

The matrix multiplication identity

(i) using a probabilistic argument

(ii) using linear algebra.

Here, 4 is accessible from 0, but not vice versa.

An equivalence relation divides a set (here, the

Example: Show that a communication class,

State i has period d if Piin = 0 when n is not a

Example: Show that all states in the same

Note: This is obvi-

P limt Ni (t)/t = 1/ii | X0 = i = 1

If i is transient, then Ni (t) is a defective

Proposition: i is recurrent if and only if

Example: Show that if i is recurrent and i j

Example: If i is recurrent and i j, show that

P never enter state j | X0 = i = 0.

If i is transient, then ii = . If i is recurrent

Proposition: If i j and i is recurrent, then

Example: If Xn counts the number of successes

(i) Let Xn+1 Xn take each of the 2d

Example: A biological signaling molecule

Proposition: The symmetric random walk is

Proof: The method is toemploy Stirlings

formula, n! nn+1/2 en 2 where an bn

Theorem: An irreducible, aperiodic, positive

Irreducible chains which are transient or null

Chains which are periodic or which have

Example: Monte Carlo Markov Chain

Suppose we wish to evaluate E h(X) where X

has distribution (i.e., P X= i = i ). The

generate Yn with P Yn = j | Xn1 = i = qij .

j with probability min(1, q / q )

Here, Yn is called the proposal and we say the

Then is a stationary distribution of the

Example: (spatial models on a lattice)

We may model the conditional distribution of

P Yij = R | {Ypq , (p, q) 6= (i, j)}]

Example: Sampling conditional distributions.

posterior variance Var | X= x ,

Galton-Watson Branching Process

We can think of Zn+1 = 0 being the death of

Thus, the branching process either becomes

Set (s) = E[sZ ], the probability generating

If we can find a solution to (s) = s, we will

(ii) (0) = P[Z= 0]. Why?

(iii) (s) is increasing, i.e.,

(iv) (s) is convex, i.e.,

Now, notice that, for 0 < s < 1,

P extinction = lim P Xn = 0 = lim n (s).

Argue that, in CASE 1, limn n (s) is the

Conclude that in CASE 1 (with d

Now suppose the founding population has size k

P Xnk = i | Xnk1 = j P Xnk1 = j

= Pji P Xnk1 = j P Xnk = i .