Reinforcement Learning

Markov Decision Process

Random variables
We denote random variables with capital letters, e.g., X, Y . A lowercase
letter denote a specific value a random variable may take.

We use the notation Pr{X = x} = p(X = x), or simply p(x), to denote

the probability that the random variable X may assume the value x.

For discrete spaces, the random variable may take a countable number of

Example: rolling a fair die, where the random variable X representing the
die side facing upwards, can take the discrete values 1, 2, 3, 4, 5, 6, thus:
p(X = 1) = p(X = 2) = · · · = p(X = 6) =
For continuous spaces, the random variable may take one of infinitely
many continuoum values. Example: Gaussian distribution.

Probability mass function (PMF)
PMF of a random variable is a function p(x) used to specify the
probability of the random variable taking a particular value.

The PMF is nonnegative everywhere, and the sum of probabilities of all

possible events (values x ∈ X ) must add up to 1:
p(x) = 1, p(x) ≥ 0

Example: in a fair die, X = {1, 2, 3, 4, 5, 6} and value: probability

each value x ∈ X has probability of facing upright x p(x)
equal to 16 , and: 1 0.20
2 0.25
X X 1 3 0.20
p(x) = =1
6 4 0.15
X i=1
5 0.10
The table to the right shows the probability 6 0.10
distribution of a loaded die, i.e., some numbers sum 1.0
have higher probability than others.

Conditional probability

Given two random variables X and Y , the joint probability describes the
probability that X takes on the value x and Y takes the value y. It is
denoted as:
p(x, y) = p(X = x and Y = y)
If the random variables are independent (e.g., rolling two dice, rolling one
die twice),
p(x, y) = p(x)p(y).
The conditional probability describes the probability that X takes the
value x, given that Y has for sure assumed the value y. For p(y) > 0,
the conditional probability is defined as:

p(x, y)
p(x | y) =

If the variables are independent, we get p(x | y) = p(x).

Expected value
Expected value of a random variable X with a discrete probability is:
E(X) = p(xi ) xi

It represent the average value expected for X for large N :

1 X
xi → E(X), as N → ∞
N i=1

Example: the throw of a fair die: each value (1, . . . , 6) has a probability
of 61 . The expected value is:

N 6 6
X X 1 1X 1 + ... + 6 21
E(X) = p(xi ) xi = i= i= = = 3.5
i=1 i=1
6 6 i=1 6 6

Exercise: Expected value of a loaded die
In a loaded die, some numbers have higher value probability
probability than others. i.e., the die does not 1 0.20
have an equiprobable discrete probability 2 0.25
distribution, and hence the die is not fair. 3 0.20
Throwing this (or any) die a large number of 4 0.15
times n, the average will be 5 0.10
n 6 0.10
1X sum 1.0
xi ≈ E(X)
n i=1
What is the expected value E(X) for this die?
X 6
E(X) = p(xi ) xi = p(i)i
i=1 i=1
= 1 · 0.2 + 2 · 0.25 + 3 · 0.20 + 4 · 0.15 + 5 · 0.1 + 6 · 0.1
= 3.0

Example: Expected value of a die
Find out the expected value of a die with unknown probability
Solution: you have to actually throw the dice a large number of times and
compute the average value.

Value for first 5 throws: 2, 2, 4, 2, 4. Count of values

Average value for first 5 throws:
For one thousand throws:
1X 2 4 8 10 14 250
xi , , , , ,
n i=1 1 2 3 4 5 200


Average value for the first 50 throws: 100

2.8 1 2 3 4 5 6

For one million throws:

2 2.5

and for the last 50 throws (1000 in total): 2

2.96 1.5

2.95 1
1 2 3 4 5 6

Remarks: loaded die
In the two previous slide, we were dealing with the same loaded die. That
is, probability mass distribution (PMD) was the same.
First case: we knew the PMD of X. Second case: the PMD of X was
unknown. We had to find it
value: probability experimentally by actually rolling the
x p(x) die a large amount of times n.
For one million throws:
1 0.20 ·105
2 0.25 2.5

3 0.20 2
4 0.15
5 0.10
6 0.10 1

sum 1.0 1 2 3 4 5 6

The expected value was estimated

The expected value was computed by:
using all the results of the experiment.
N 6
X X n
E(X) = p(xi ) xi = p(i)i 1X
xi ≈ E(X)
i=1 i=1 n i=1
= 3.0
We will see that many RL approaches rely on experimental estimation
i.e., Monte Carlo methods: averaging over many random samples.
Example: Monte Carlo Simulations
Estimate the value of the constant π of a circle using a Monte Carlo simulation.
Solution: We know that the area of square with length N C π̄
ℓ = 2 is A = ℓ2 = 4. A circle inside that square, with 10 8 3.2
radius r = 2ℓ = 1, will have area πr2 = π. Using N 100 76 3.04
random points uniformly distributed in 2-dimensional 1E3 796 3.184
space, we can fill the squared area and count how many 1E4 7840 3.136
points are inside the circle. This provide a stochastic 1E5 78504 3.1406
measure of the area of the circle and of the square. 1E6 - 3.142812
▶ N : total number of random points (xi , yi ),
▶ C: number of points that satisfy x2i + yi2 ≤ 1
▶ π̄ = N A, estimated area of the unit circle, with π̄ → π as N → ∞.

N = 100 N = 1000, N = 5000

Markov process

0.1 A Markov process or Markov chain, named

after Russian mathematician Andrey
s0 s1 Markov.
Informally: ”What happens next depends
1.0 only on the state of affairs now.” (source:
0.9 Wikipedia)
A Markov process is a stochastic model describing
▶ a sequence of possible events
▶ the probability of each event depends only on the state attained in
the previous event.
Markov property:
▶ memoryless property of an stochastic process, or
▶ all information about the past that makes a difference for the future
is included in the current state

Markov Decision Process: intuitively
In an MDP, the probability of transitioning to a new state depends on the
state, and the action taken at that particular state. For a specific state
transition due to an action, a reward is associated.
0.6, 0 a1 0.4, +1

s0 s1

1.0, −1 a2

state action next prob. reward

s a s′ p(s′ |s, a) r(s′ , s, a)
s0 a1 s1 0.4 +1
s0 a1 s0 0.6 0
s1 a2 s0 1.0 −1

Finite Markov Decision Process
Finite MDP relies on discrete sets with finite number of elements:
▶ S set of states
▶ A set of actions
▶ R set of rewards
In the RL context, an MDP is a idealized model of how the environment
reacts to an agent actions. At each discrete time instance t = 0, 1, 2, . . .,
the MDP and the agent interact in the following way:
▶ the state is St−1 ∈ S. Note that St−1 is a random variable, and may
take a specific value st−1 with probability p(St−1 = st−1 ) = p(st−1 ).
▶ At any state s ∈ S, the agent can chose an action
At−1 ∈ A(s) ⊆ A.
▶ The environment react to this action, by transitioning to a new state
St ∈ S, and given the agent a reward Rt ∈ R ⊂ R.
That is, the MDP and the agent create a trajectory:

S0 , A0 , R1 , S1 , A1 , R2 , S2 , . . .

Finite Markov Decision Process

The dynamics of the MDP are governed by the probability function

p(s′ , r|s, a) = Pr{St = s′ , Rt = r|St−1 = s, At−1 = a}

The function p : S × R × S × A → [0, 1] is a deterministic function of 4

arguments. Being a probability mass distribution it satisfies:
p(s′ , r|s, a) = 1, ∀ s ∈ S, a ∈ A(s)
s′ ∈S r∈R

MDPs are a mathematically idealized form of the reinforcement learning

problem for which precise theoretical statements can be made.

MDP: definitions
Some fundamental definitions:
▶ The state-transition probability:
p(s′ |s, a) = Pr{St = s′ |St−1 = s, At−1 = a} = p(s′ , r|s, a)

▶ The expected reward for state-action pairs:

r(s, a) = E(Rt | St−1 = s, At−1 = a) = r p(s′ , r|s, a)
r∈R s′ ∈S

Recall that the expected value for a random variable X with discrete
probability is
E(X) = xi p(xi )

▶ The expected reward for state-action-next-state triplet:

X p(s′ , r|s, a)
r(s, a, s′ ) = E(Rt | St−1 = s, At−1 = a, St = s′ ) = r
p(s′ |s, a)

Exercise: MDP
ax 0.4, +1 What are the three sets
0.6, 0
describing this MDP?
0.3, −1 S = {sx , sy }
A = {ax , ay , az }
sx sy az
R = {−1, 0, +0.5, +1}
0.7, +0.5 What actions are available in
state sy , i.e., A(sy )?
1.0, −1 ay A(sy ) = {ay , az }

The MDP as a table: How is the last row computed?

Taking action az at s = sy implies
s a s′ p(s′ |s, a) r(s′ , s, a) s′ = sy , guaranteed. However, the
sx ax sy 0.4 +1 obtained reward varies. The
sx ax sx 0.6 0 expected state-action reward is:
sy ay sx 1.0 −1 X X p(s′ , r|s, a)
sy az sy 1.0 0.05 r(s′ , s, a) = r
r∈R ′
p(s′ |s, a)
s ∈S

= (−1 · 0.3) + (0.5 · 0.7)

= 0.05
Discounted return
After taking an action, the agent receives and immediate reward from the
environment. In reinforcement learning, the goal of the agent is to
maximize the cumulative reward.
That is, from the trajectory S0 , A0 , R1 , S1 , A1 , R2 , S2 , . . . we have a
sequence of rewards R1 , R2 , R3 , . . .
To maximize the cumulative reward, we use a function of of the reward
sequence called the discounted return
Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + · · · = γ k Rt+k+1

with 0 ≤ γ ≤ 1 is the discount rate, and possibly T = ∞ if γ ̸= 1.

▶ if γ = 0, the agent is miopic, and is only interested in maximizing
the immediate reward Gt = Rt+1
▶ for values of γ closer to 1, the agent can consider rewards farther in
the future
▶ it is easy to show that Gt = Rt+1 + γGt+1
Policies and value function
A policy is a mapping from states to probabilities of selecting each
possible action. Thus, if at time t the agent follows policy π, then:

π(a|s) = Pr{At = a | St = s}

i.e., the probability of selecting action a given that the state is s.

The value function of state s under policy π, denoted vπ (s) and also
called state-value function for policy π, is the expected return when
starting in s and following the policy π thereafter:
vπ (s) = Eπ (Gt | St = s) = Eπ γ k Rt+k+1 | St = s , ∀ s ∈ S

Similarly, the action-value function for policy π denoted qπ (s, a), is the
expected return when starting in s, taking action a, and following the
policy π thereafter:

qπ (s, a) = Eπ (Gt | St = s, At = a)

