STAT 433 Course Note

Stochastic Processes 2 (STAT 433) Course Note
Taught by Professor Yi Shen - Fall 2021
Department of Statistics, University of Waterloo
Tuan Hiep Do
Contents
1. Preparation 2
1.1. Probability Space and Random Variables 2
1.2. Stochastic Processes 2
1.3. Review on Discrete-time Markov Chain (DTMC) 3
1.4. Classification and Class Properties 5
1.5. Stationary Distribution and Limiting Behaviour 6
1.6. Infinite State Space 7
1.7. Branching Process 8
1.8. Absorption Probability and Absorption Time 8
2. Discrete Phase-type Distribution 11
References 11
1
2 STOCHASTIC PROCESSES 2 (STAT 433) COURSE NOTE
1. Preparation
We will begin by reviewing the materials in STAT 333 before proceeding with new concepts
in STAT 433. The most fundamental of all is the concept of probability space and random
variables.
1.1. Probability Space and Random Variables.
Definition 1. A probability space, or a probability measure space, consists of a triplet
(Ω, E, P) where
(1) Ω is a sample space. This is the collection of all possible outcomes of a random exper-
iment. For example, the collection of all possible outcomes of rolling a dice, flipping a
coin, and forecasting weather would be {1, 2, 3, 4, 5, 6}, {H, T}, and {sunny, cloudy, rainy, · · · }
respectively.
(2) E is a σ-algebra/σ-field. It is the collection of all the events. More precisely, an event
E is a subset of Ω for which we can talk about probability. For example, an event of
getting odd numbers when rolling a dice would be {1, 3, 5} ⊆ {1, 2, 3, 4, 5, 6}.
(3) P is a probability measure which is a function mapping E to R by assigning each
E ∈ E a real value P(E). In particular, it needs to satisfy the following axioms
(a) 0 ≤ P(E) ≤ 1 for any E ∈ E,
(b) P(∅) = 0, P(Ω) = 1, and
(c) For a countable collection of pairwise disjoint events {Ei }∞ i=1 ,
∞
! ∞
[ X
P Ei = P(Ei ).
i=1 i=1
In the language of measure theory, (Ω, E, P) is precisely a measure space subjected to the
normalization condition P(Ω) = 1. Axiom (3)(c) is often referred to as countable additivity.
Readers who are interested can refer to section 1.2 of reference [1] for full details.
Definition 2. A random variable X, abbreviated as R.V., is a real (measurable) function
from Ω to R by mapping w 7→ X(w).
1.2. Stochastic Processes.
A process is a change/evolvement over time and stochastic just means random. As such,
a stochastic process, abbreviated as S.P., informally is a random change/evolvement over
time. We could formulate this in two ways as described in the following diagram.
+ Randomness
Number R.V.
Change over Time

Change over Time
Function of time S.P.
+ Randomness
STOCHASTIC PROCESSES 2 (STAT 433) COURSE NOTE 3
The first approach would be to add randomness first and then let the random variable
evolve over time. This would yield a stochastic process. For the second approach, one may
let a number change over time in order to yield a function of time which combined with
randomness produces a stochastic process. It is worth noting that the second definition is
very hard to formulate since there is an involvement of a random function. Thus, we will
take the first approach in this course.
Definition 3. A stochastic process {Xt }t∈T is collection of random variables defined on a
common probability space indexed by a set T .
In most cases, T corresponds to “time” which could either by discrete as N ∪ {0} or contin-
uous as [0, ∞). In discrete cases, we typically write {Xn }n=0,1,··· . All the possible values of
Xt where t ∈ T are called the states of the process. Their collection is called the state space
which is denoted by S. Note that there is a subtle difference between the definition of a state
space and that of a sample space. The former implies that there is a time progression; that
is, the system will be in different states as time progresses. On the other hand, the latter
hints that there will be a probability measure defined on our sample space.
Naturally, the state space can be either discrete or continuous. For this section, we will
focus on discrete state space. One may relabel the states in a discrete state space S to get
the standardized state space {0, 1, 2, · · · } or {0, 1, · · · , n} if S is infinite or finite respectively.
1.3. Review on Discrete-time Markov Chain (DTMC).
For two events A and B, if P(A) > 0, then we define the conditional probability of B given
A as
P(B ∩ A)
P(B|A) := .
P(A)
Theorem 4. (Law of Total Probability) For a countable collection of pairwise disjoint events
{Ai }∞
i=1 such that ∪Ai = Ω, then
X∞
P(B) = P(B|Ai ) · P(Ai ).
i=1
Proof. By countable additivity and the definition of conditional probability, it follows that
" #
[ X
P(B) = P(B ∩ Ω) = P (B ∩ Ai ) = P(B ∩ Ai )
i≥1 i≥1
X
= P(B|Ai ) · P(Ai ).
i≥1
Theorem 5. (Bayes’ Rule) For a countable collection of pairwise disjoint events {Ai }∞
i=1
such that ∪Ai = Ω, then
P(B|Ai ) · P(Ai )
P(Ai |B) = P∞ .
j=1 P(B|Aj ) · P(Aj )
Proof. Following the definition and the law of total probability, one has
P(Ai ∩ B) P(B|Ai ) · P(Ai )
P(Ai |B) = = P∞ .
P(B) j=1 P(B|Aj ) · P(Aj )
A very natural notion of independent events can be defined as follows.

Definition 6. Two events A and B are called independent if P(A ∩ B) = P(A) · P(B). We
will denote the fact that A and B are independent as A ⊥⊥ B.
An equivalent formulation of independence could also be stated in terms of conditional
probabilities.
Lemma 7. If P(A > 0), then A ⊥⊥ B if and only if P(B|A) = P(B).
Proof. Note that
P(A ∩ B)
A ⊥⊥ B ⇐⇒ P(A ∩ B) = P(A) · P(B) ⇐⇒ P(B) = = P(B|A).
P(A)
Intuitively, A and B are independent if the probability of B occurring is not affected by
conditioning on A. Equipped with such background, we can now give a formal definition of
a DTMC.
Definition 8. {Xn }n=0,1,··· is called a (time-homogeneous) DTMC with transition matrix
P = {Pij }i,j∈S if for any n and j, i, in−1 , · · · , i0 ∈ S,
P(Xn+1 = j|Xn = i, Xn−1 = in−1 , · · · , X0 = i0 ) = P(Xn+1 = j|Xn = i)
= Pij .
P
The transition matrix P needs to satisfy Pij ≥ 0 and Pij = 1 for all i ∈ S.
j∈S
Intuitively, the definition states that the past/history will only influence the future through
(n)
the present (state). Recall the n-step transition matrix P (n) = {Pij }i,j∈S where
(n)
Pij := P(Xm+n = j|Xm = i) = P(Xn = j|X0 = i).
The C-K equation states that P (n+m) = P (m) · P (n) = P (n) · P (m) and
(n+m)
X (n) (m)
Pij = Pik · Pkj .
k∈S
As a corollary of the C-K equation, it follows that
P (n) = P n = |P · P ·{z
· · P · P} .
n times

If an initial distribution µ = µ0 µ1 · · · = P(X0 = 0) P(X0 = 1) · · · is given, then
the distribution of Xn is equal to
µn = P(Xn = 0) P(Xn = 1) · · · = µ · P (n) = µ · P n ,

which means that µ and P fully characterize the distribution of a DTMC.
From previous probability courses, readers are familiar with the notion of conditional expec-
tation E[g(X)|Y = y] at a specific value y of Y where X, Y are random variables and g is a
real (measurable) function. In particular,
(P
x g(x) · P(X = x|Y = y) if discrete,
E[g(X)|Y = y] = R
g(x)fX|Y (x|y) dx if continuous.
This is a value which is not necessarily finite. We may regard E[g(X)|Y ] as a random variable
and also as a function of Y that takes specific values w ∈ Ω
E[g(X)|Y ](w) = E[g(X)|Y = Y(w) ].
The law of iterated expectation states that E[E(X|Y )] = E(X) and as a result,
E[f (Xn )] = µ · P n · f 0 = µn · f 0 = µ · f (n)0 .
T
In here, f 0 = f (0) f (1) · · · and
T
f (n)0 = E[f (Xn )|X0 = 0] E[f (Xn )|X0 = 1] · · · .
1.4. Classification and Class Properties.
In this section, we will state results related to classification from STAT 333 and leave out
the proofs. Readers can refer to previous notes for full details.
Definition 9. For two states x and y, x is said to communicate to y, which we denote by
x → y, if
ρxy := Px (Ty < ∞) = P(Ty < ∞|X0 = x) > 0
n
where Ty := min{n ≥ 1 : Xn = y}. That is, there exists n ∈ N so that Pxy > 0. We then say
that “x can go to y”.
It is easy to see that “→” satisfies transitivity. Namely, if x, y, and z are states so that
x → y and y → z, then x → z. Thus, this inspires the definition of a communicating class.
Definition 10. We say that C ⊆ S is a communicating class if
(1) For all i, j ∈ C, i ↔ j. That is, i → j and j → i.
(2) For all i ∈ C and j ∈/ C, i 6↔ j. This means that either i 6→ j or j 6→ i.
A DTMC is called irreducible if all the states are in the same communicating class.
Let ∞
1{Xn =y} = total number of visits to y.
X
N (y) =
n=1
We have the following results related to transience and recurrence from STAT 333.
Recurrence Transience
ρyy = 1 ρyy < 1
Py (N (y) = ∞) = 1 Py (N (y) < ∞) = 1 .
Ey (N (y)) = ∞ Ey (N (y)) < ∞
P∞ n
P ∞ n
n=1 Pyy = ∞ n=1 Pyy < ∞
It turns out that recurrence and transience are class properties. Other criteria and properties
for recurrence and transience are
(1) If ρxy > 0 and ρyx < 1, then x is transient. This makes sense intuitively because
there is a positive probability that the chain will not go back to x given that it starts
from x. Taking the contrapositive, if x is recurrent and ρxy > 0, then ρyx = 1.
(2) If A is a closed set, x ∈ A, and y ∈ / A, then Pxy = 0 or equivalently, ρxy = 0. In
particular, a closed set with finite states has at least one recurrent state. As such,
a closed class with finite states must be recurrent which means that an irreducible
DTMC with finite state space is recurrent.
(3) We have a decomposition of the state space as follows

S = T ∪ R1 ∪ R2 ∪ · · ·
where T is the collection of transient states (not necessarily one class) and Ri are
closed recurrent classes.
Moreover, it turns out that
ρxy
Ex (N (y)) = .
1 − ρyy
Property 11. (Strong Markov Property) For any state y ∈ S, {XTy +k }∞
k=0 behaves like a
Markov chain starting from X0 = y.
n
Definition 12. The periodicity of a state x ∈ S is defined as d(x) := gcd{n ≥ 1 : Pxx > 0}.
If d(x) = 1, then x is said to be aperiodic. The Markov chain is called aperiodic if all states
are aperiodic.
It is not surprising that periodicity is a class property. Furthermore, if Pxx > 0, then x is
trivially aperiodic but the converse is not true.
1.5. Stationary Distribution and Limiting Behaviour.
In this section, we will state results related to stationary distribution as well as the limiting
behaviour for DTMCs without proofs. Readers can refer to previous notes for full details of
the proofs for these results.
Definition 13. A distribution π is called a stationary distribution if πP = π and π1 = 1.
Lemma 14. Let x ∈ S. If the chain is irreducible and recurrent, then
∞
X
µx (y) := Px (Xn = y, Tx > n)
n=0
= Ex (number of visits to y before returning to x)
gives a stationary measure. That is, µ :=P(µx (y))y∈S satisfies µP = µ. It can be normalized
to a stationary distribution if and only if y∈S µx (y) = Ex (Tx ) < ∞. Note that the condition
Ex (Tx ) < ∞ is the condition for positive recurrence.
Lemma 15. If a DTMC is irreducible and has a stationary distribution, then it is recurrent.
We can now state the main theorems for DTMCs.
Theorem 16. (Convergence) Suppose that a DTMC is irreducible, aperiodic, and has a
stationary distribution. Then,
n
lim Pxy = πy
n→∞
for all x, y ∈ S. Note that this limit does not depend on x.
Theorem 17. (Long-run Frequency) Suppose that a DTMC is irreducible and recurrent,
then
Nn (y) 1
lim = .
n→∞ n Ey (Ty )
Nn (y) represents the number of visits to y up to time n. Thus, the quantity Nnn(y) is the
fraction of time spent in y up to time n. This converges to the reciprocal of the expected
cycle length.
Theorem 18. Suppose that a DTMC is irreducible and has a stationary distribution, then
1
πy =
Ey (Ty )
In particular, the stationary distribution is unique.
Corollary 18.1. If a DTMC is irreducible, aperiodic, and has a stationary distribution,
then
n Nn (y) 1
πy = lim Pxy = lim = .
n→∞ n→∞ n Ey (Ty )
That is, the stationary distribution agrees with the limiting transition probability, the long-run
fraction of time, and 1/(expected revisit time).
Theorem 19. (Long-run
P Average) Suppose that the chain is irreducible, has a stationary
distribution, and x |f (x)| · πx < ∞. Then
n n−1
1X 1X X
lim f (Xm ) = lim f (Xm ) = f (x)π(x)
n→∞ n n→∞ n
m=1 m=0 x
= π · f 0,
T
where f 0 = f (0) f (1) · · · .
Definition 20. A distribution π satisfies the detailed balance condition if πx Pxy = πy Pyx for
all x, y ∈ S.
It was shown in STAT 333 that the detailed balance condition implies the existence of a
stationary distribution but the converse is not true. However, the converse will hold if P is
tridiagonal.
Definition 21. Fix n and define Ym = Xn−m for m ∈ {0, · · · , n}. The chain {Ym } is called
a time-reversed chain.
Lemma 22. {Ym } is a DTMC if {Xn } is a DTMC that starts from a stationary distribution.
In particular, {Xn } =d {Ym }.
Theorem 23. It turns out that time reversibility is equivalent to the condition that the
detailed balance condition holds.
1.6. Infinite State Space.
Suppose that x ∈ S, then either x is transient or recurrent. Within recurrence, there are
two further sub-categories that prove to be useful when S is infinite, namely positive re-
currence and null recurrence. A state x is said to be positive recurrent if x is recurrent
and Ex (Tx ) < ∞. On the other hand, x is said to be null recurrent if it is recurrent and
Ex (Tx ) = ∞. Once again, transience, positive recurrence, and null recurrence are class prop-
erties.
With these two notions, it follows that a stationary distribution exists if and only if there
exists at least one positive recurrent class and it is unique if and only if there exists only
one positive recurrent class. Additionally, πj = 0 for all π if and only if j is transient or null
recurrent. Finally, a DTMC with finite state space must have at least one positive recurrent
class and no null recurrent class/state.
1
Example: Recall that a simple random walk is transient for p 6= 2
and it is null recur-
rent if p = 21 .
1.7. Branching Process.
Consider a branching process {Xn } where Xn is the population size of generation n and
X0 = 1. Let Y be the random number of offsprings of one individual.
Then E(Xn ) = µn where µ = E(Y ). The extinction probability

(
= 1 if µ ≤ 1,
P(Xn = 0 for some n)
< 1 if µ > 1.
Moreover, it is the smallest solution of the equation s = ϕ(s) between 0 and 1 where
X∞
Y
ϕ(s) = E(s ) = si · P(Y = i)
i=0
is the generating function of Y defined on [0, 1].

1.8. Absorption Probability and Absorption Time.
Consider first an example of a DTMC with states 1, 2, 3, 4 and
1 2 3 4
 
1 0.25 0.6 0 0.15
2  0 0.2 0.7 0.1 
P =  .
3  1 
4 1
The main question that we want to answer in this section is of the following form: starting
from state 1 or 2, what is the probability that the chain gets absorbed to state 3 before state
4? In order to answer this type of question, define
h(x) := Px (absorbed to 3) = P(absorbed to 3|X0 = x).
By first step analysis,
4
X
h(1) = P(absorbed to 3|X1 = x, X0 = 1) · P(X1 = x|X0 = 1).
x=1
Note that the quantity



 h(1) if x = 1,

h(2) if x = 2,
P(X1 = x|X0 = 1) =

 1 if x = 3,

0 if x = 4.

And so, h(1) = 0.25 · h(1) + 0.6 · h(2). Similarly, h(2) = 0.2 · h(2) + 0.7. Solving the system
of equations yields h(1) = 0.7, h(2) = 87 . We can generalize this approach in order to obtain
the following result.
Theorem 24. (General Result) Suppose that S = A ∪ B ∪ C where C is finite. Starting
from any state in C, we are interested in the probability that the chain gets absorbed to set
A rather than set B assuming it is positive. Define
h(x) = P(absorbed to A|X0 = x).
Then h(x) = 1 for x ∈ A and 0 for x ∈ B. In particular, one can solve for h0 to obtain
h0 = (I − Q)−1 · R0A where
T
h0 = h(x1 ) h(x2 ) · · · ,
Q = {Pxy }x,y∈C ,
T
R0A =
P P
y∈A Px1 y y∈A Px1 y · · ·
in which xi ∈ C.
Proof. for x ∈ C, by using the law of total probability and conditioning on X1 , it follows
that
X
h(x) = P(absorbed to A|X1 = y, X0 = x) · P(X1 = y|X0 = x)
y inS
X
= Pxy h(y)
y∈S
X X
= Pxy h(y) + Pxy .
y∈C y∈A
In matrix notation, this is h = Q · h0 + R0A . Rearranging yields (I − Q)h0 = R0A . It remains

0
to show that I − Q is invertible. Indeed, one can modify the transition matrix as follows
C A B  C A B 
C Q R 0 C Q R
P =
A   7→ P = A  I .
0
B B I
Since we are only interested in observing the chain before it hits A or B, changing the
transition probabilities going out of the states in A or B will not change the result of this
problem. After this change, the states in A and B are absorbing, and all the states in C are
transient. Hence, for x ∈ C,
X X
0 = lim Px ( Xn0 ∈ C) = lim (P 0 )nxy = lim Qnxy .
n→∞ |{z} n→∞ n→∞
modified chain y∈C y∈C
The second and third equalities are obtained from the fact that Px (Xn0 = y) = 0 for any
y ∈ C and
 Cn A B 
C Q ···
(P 0 )n = .
A  I
0
B I
Now, note that
X
Qnxy = Qxx1 Qx1 x2 · · · Qxn−1 y
x1 ,··· ,xn−1 ∈C
X
= P(X10 = x1 , X20 = x2 · · · , Xn0 = y|X00 = x).
x1 ,··· ,xn−1 ∈C
Therefore,
X X
Qnxy = P(X10 = x1 , X20 = x2 · · · , Xn0 = y|X00 = x)
y inC x1 ,··· ,xn−1 ,y∈C
= P(X10 ∈ C, X20 ∈ C, · · · , Xn0 ∈ C|X00 = x)

= Px (no absorption unitl time n)
= Px (Xn0 ∈ C).
Because lim y∈C Qnxy = 0 for any x ∈ C, it follows that lim Qn = 0. Thus, all the
P
n→∞ n→∞
eigenvalues of Q have norms smaller than 1 which means that there does not exist a non-
zero column vector f 0 so that f 0 = Qf 0 or equivalently (I − Q)f 0 = 0. Thus, I − Q is
invertible. Finally, as (I − Q)h0 = R0A , one then obtains h0 = (I − Q)−1 R0A which completes
the proof.
With a similar theme as above, let us now consider the idea of absorption time. The basic
setting is as follows: S = A ∪ C disjoint, C is finite, and the same assumptions applied as
for absorption probability. Define
VA := min{n ≥ 0 : Xn ∈ A}.
We want to determine Ex (VA ) for x ∈ C. Denote g(x) := Ex (VA ), then by first step analysis,
X X
g(x) = E(VA |X1 = y, X0 = x) · P(X1 = y|X0 = x) = Pxy (g(y) + 1)
y∈S y∈S
X
= Pxy g(y) + 1
y∈S
X
= Pxy g(y) + 1
|{z}
y∈C
Qxy
T
Using matrix notation, this is g0 = Q · g0 + 10 where g0 = g(x1 ) g(x2 ) · · · with xi ∈ C
T
and 10 = 1 1 · · · . Rearranging yields (I − Q)g0 = 10 . Observe that the matrix I − Q
is exactly the same as what we have already seen in the part of absorption probability. As
such, this matrix is invertible which implies that g0 = (I − Q)−1 10 and moreover, this is the
unique solution.
2. Discrete Phase-type Distribution

Consider a DTMC {Xn }n=0,1,··· with states
1, · · · , M − 1, M, · · · , N .
| {z } | {z }
transient states absorbing states
The one-step transition probability matrix will have the following form
0 ··· M − 1 M ··· N
0
 
.. Q R

 . 

P = M −1 

.

M 



..  0 I 
.  
N
Let T := min{n : M ≤ Xn ≤ N } be the time until absorption. Assume that we start from
the initial distribution

α0 = α0,0 α0,1 · · · α0,M −1 α0,M · · · α0,N .
In here, α0,i = P(X0 = i). Let α0 0 = α0,0 α0,1 · · · α0,M −1 . In this section, we are

interested in the exact distribution of T.
First,
References
[1] K. Davidson. Note for Pure Math 451 - Measure Theory. 2020.

STAT 433 Course Note

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

STAT 433 Course Note

Uploaded by

Copyright:

Available Formats

Stochastic Processes 2 (STAT 433) Course Note

Taught by Professor Yi Shen - Fall 2021

Department of Statistics, University of Waterloo

Change over Time

Function of time S.P.

A very natural notion of independent events can be defined as follows.

which means that µ and P fully characterize the distribution of a DTMC.

(3) We have a decomposition of the state space as follows

Then E(Xn ) = µn where µ = E(Y ). The extinction probability

is the generating function of Y defined on [0, 1].

Note that the quantity

In matrix notation, this is h = Q · h0 + R0A . Rearranging yields (I − Q)h0 = R0A . It remains

= P(X10 ∈ C, X20 ∈ C, · · · , Xn0 ∈ C|X00 = x)

2. Discrete Phase-type Distribution

You might also like