lec02

Reinforcement Learning: A Comprehensive Open-Source
Course
Wilhelm Kirchgässner
Department of Power Electronics and Electrical Drives, Paderborn University, Germany
19 July 2023
Lecture 02: Markov Decision Processes
Wilhelm Kirchgässner
Department of Power Electronics and Electrical Drives, Paderborn University, Germany
2 / 50
Preface
● Markov decision processes (MDP) are a mathematically idealized
form of RL problems.
● They allow precise theoretical statements (e.g., on optimal
solutions).
● They deliver insights into RL solutions since many real-world
problems can be abstracted as MDPs.
3 / 50
Preface
solutions).
● In the following we’ll focus on:
● fully observable MDPs (i.e., xk = yk ) and
● finite MDPs (i.e., finite number of states & actions).
3 / 50
Preface
solutions).
● In the following we’ll focus on:
● fully observable MDPs (i.e., xk = yk ) and
● finite MDPs (i.e., finite number of states & actions).
2* All states observable?

Yes No
Actions?
2* No Markov chain Hidden Markov model

Yes Markov decision process (MDP) Partially observable MDP
3 / 50
Scalar and vectorial representations in finite MDPs
● The position of a chess piece can be represented in two ways:
● Vectorial: x = [xh xv ], i.e., a two-element vector with horizontal
and vertical information,
a b c d e f g h
6
5
3
...
...
...
2 9 10 ...
...
1 1 2 3 ...
4 / 50
● Scalar: simple enumeration of all available positions (e.g., x = 3).
a b c d e f g h
6
5
3
...
...
...
2 9 10 ...
...
1 1 2 3 ...
4 / 50
● Scalar: simple enumeration of all available positions (e.g., x = 3).
● Both ways represent the same amount of information.
● We will stick to the scalar representation of states and actions in
finite MDPs.
a b c d e f g h
6
5
3
...
...
...
2 9 10 ...
...
1 1 2 3 ...
4 / 50
Table of contents
1 Finite Markov chains
2 Finite Markov reward processes
3 Finite markov decision processes
4 Optimal policies and value functions
5 / 50
Markov chain
A finite Markov chain is a tuple ⟨X , P⟩ with

● X being a finite set of discrete-time states Xk ∈ X ,
● P = Pxx ′ = Xk+1 = x ′ ∣Xk = x is the state transition probability.
6 / 50
Markov chain

● Specific stochastic process model
● Sequence of random variables Xk , Xk+1 , . . .
● ’Memoryless’, i.e., system properties are time invariant
6 / 50
Markov chain

● Specific stochastic process model
● Sequence of random variables Xk , Xk+1 , . . .
● ’Memoryless’, i.e., system properties are time invariant
● In continuous-time framework: Markov process1
1
However, this results in a literature nomenclature inconsistency with Markov decision/reward ’processes’.
6 / 50
State transition matrix
State transition matrixSTM Given a Markov state Xk = x and its
successor Xk+1 = x ′ , the state transition probability ∀ {x, x ′ } ∈ X is
defined by the matrix
Pxx ′ = Xk+1 = x ′ ∣Xk = x. (1)
Here, Pxx ′ ∈ Rn×n has the form

⎡p11 p12 ⋯ p1n ⎤
⎢ ⎥
⎢p ⋮ ⎥⎥
⎢ 21
Pxx ′ =⎢ ⎥
⎢ ⋮ ⋮ ⎥⎥
⎢
⎢pn1 ⋯ ⋯ pnn ⎥
⎣ ⎦
with pij ∈ {R∣0 ≤ pij ≤ 1} being the specific probability to go from state
x = Xi to state x ′ = Xj . Obviously, ∑j pij = 1 ∀i must hold.
7 / 50
Example of a Markov chain (1)
Small Medium Large x ∈ {1, 2, 3, 4}

= {small, medium, large, gone}
⎡0 1 − α 0 α⎤⎥
⎢
⎢0 0 1 − α α⎥⎥
⎢
P =⎢ ⎥
⎢0 0 1−α α⎥⎥
Gone ⎢
⎢0
⎣ 0 0 1 ⎥⎦
Figure: Forest tree Markov chain
● At x = 1 a small tree is planted (’starting point’).

● A tree grows with (1 − α) probability.
● If it reaches x = 3 (large) its growth is limited.
● With α probability a natural hazard destroys the tree.
● The state x = 4 is terminal (’infinite loop’).
8 / 50
Example of a Markov chain (2)
Small Medium Large
Gone
Possible samples for the given Markov chain example starting from
x = 1 (small tree):
● Small → gone
● Small → medium → gone
● Small → medium → large → gone
● Small → medium → large → large → . . .
9 / 50
Table of contents
10 / 50
Markov reward process
Finite Markov reward

processMarkovr ewardp rocessAfiniteMarkovrewardprocess(MRP)isatuple⟨X , P
with
● P = Pxx ′ = Xk+1 = x ′ ∣Xk = x is the state transition probability,
● R is a reward function, R = Rx = Rk+1 ∣Xk = xk and
● γ is a discount factor, γ ∈ {R∣0 ≤ γ ≤ 1}.
● Markov chain extended with rewards
● Still an autonomous stochastic process without specific inputs
● Reward Rk+1 only depends on state Xk
11 / 50
Example of a Markov reward process
Small Medium Large
Gone
Figure: Forest Markov reward process
● Growing larger trees is rewarded, since it will be

● appreciated by hikers and
● useful for wood production.
● Loosing a tree due to a hazard is unrewarded.
12 / 50
Recap on return
Return
The return Gk is the total discounted reward starting from step k
onwards. For episodic tasks it becomes the finite series
N
Gk = Rk+1 + γRk+2 + γ 2 Rk+3 + ⋯ = ∑ γ i Rk+i+1 (2)
i=0
terminating at step N while it is an infinite series for continuing tasks

∞
Gk = Rk+1 + γRk+2 + γ 2 Rk+3 + ⋯ = ∑ γ i Rk+i+1 . (3)
i=0
● The discount γ represents the value of future rewards.
13 / 50
6 Now consider adding a constant c to all the rewards in an episodic task,
Value function in MRP
e running. Would this have any e↵ect, or would it leave the task unchanged
Value
ntinuing function
task above? Whyinor MRPvalue f unction
why not? Give M RPThestate −⇤valuefunctionv(xk )
an example.
of anToMRP
.6: Golf is the
formulate expected
playing a hole ofreturn
golf as astarting
reinforcement fromlearning state xk
unt a penalty (negative reward) of 1 for each stroke until we hit the ball
. The state is the location of the ball. The v (xvalue
k ) =ofG a kstate
∣Xk is=the xk .negative of (4)
of strokes to the hole from that location. Our actions are how we aim and
ball, of course, and which club we select. Let us take the former as given
just the choice of club, which we assume is either a putter or a driver. The
f Figure 3.3 Represents
● shows a possiblethe long-term
state-value value
function, vputtof (s),being in state
for the policy that Xk .
the putter. The terminal
hole has a value of 0. From vputt !4
the green we assume we can V putt !3

sand
!"
; these states have value 1.
n we cannot reach the hole !2
!1 !2
and the value is greater. If !3
0 !1
!4
h the green from a state by !5 sa green
n
!6
n that state must have value !4 !"
d
n the green’s value, that is, !2

!3
plicity, let us assume we can
ecisely and deterministically,
imited Figure: Isolines
range. This givesindicate
us qQ state
*(s,
⇤ (s, driver)value
driver ) of different golf ball locations (source: R.
Sutton and G.
ntour line labeled 2 in theBarto, Reinforcement learning: an s aintroduction,
nd 2018, CC
cationsBY-NC-ND
between that 2.0)
line and
quire exactly two strokes to
0 !1 !2
hole. Similarly, any location !2
sa
ng range of the 2 contour !3 n
d
green
14 / 50
ave a value of 3, and so
State-value samples of forest MRP
Small Medium Large
Gone
Exemplary samples for v̂ with γ = 0.5 starting in x = 1:
x = 1 → 4, v̂ = 1,
15 / 50
Small Medium Large
Gone
x = 1 → 4, v̂ = 1,
x = 1 → 2 → 4, v̂ = 1 + 0.5 ⋅ 2 = 2.0,
15 / 50
Small Medium Large
Gone
x = 1 → 4, v̂ = 1,
x = 1 → 2 → 4, v̂ = 1 + 0.5 ⋅ 2 = 2.0,
x = 1 → 2 → 3 → 4, v̂ = 1 + 0.5 ⋅ 2 + 0.25 ⋅ 3 = 3.75,
15 / 50
Small Medium Large
Gone
x = 1 → 4, v̂ = 1,
x = 1 → 2 → 4, v̂ = 1 + 0.5 ⋅ 2 = 2.0,
x = 1 → 2 → 3 → 4, v̂ = 1 + 0.5 ⋅ 2 + 0.25 ⋅ 3 = 3.75,
x = 1 → 2 → 3 → 3 → 4, v̂ = 1 + 0.5 ⋅ 2 + 0.25 ⋅ 3 + 0.125 ⋅ 3 = 4.13.
15 / 50
Bellman equation for MRPs (1)
Problem: How to calculate all state values in closed form?
Solution: Bellman equation.
v (xk ) = Gk ∣Xk = xk
(5)
16 / 50
= Rk+1 + γRk+2 + γ 2 Rk+3 + . . . ∣Xk = xk
(5)
16 / 50
= Rk+1 + γRk+2 + γ 2 Rk+3 + . . . ∣Xk = xk
= Rk+1 + γ (Rk+2 + γRk+3 + . . .) ∣Xk = xk (5)
16 / 50
= Rk+1 + γRk+2 + γ 2 Rk+3 + . . . ∣Xk = xk
= Rk+1 + γ (Rk+2 + γRk+3 + . . .) ∣Xk = xk (5)
= Rk+1 + γGk+1 ∣Xk = xk
16 / 50
= Rk+1 + γRk+2 + γ 2 Rk+3 + . . . ∣Xk = xk
= Rk+1 + γ (Rk+2 + γRk+3 + . . .) ∣Xk = xk (5)
= Rk+1 + γGk+1 ∣Xk = xk
= Rk+1 + γv (Xk+1 )∣Xk = xk
16 / 50
= Rk+1 + γRk+2 + γ 2 Rk+3 + . . . ∣Xk = xk
= Rk+1 + γ (Rk+2 + γRk+3 + . . .) ∣Xk = xk (5)
= Rk+1 + γGk+1 ∣Xk = xk
= Rk+1 + γv (Xk+1 )∣Xk = xk
Figure: Backup diagram for v (xk )
16 / 50
Assuming a known reward function R(x) for every state X = x ∈ X
rX = [R(x1 ) ⋯ R(xn )] = [R1 ⋯ Rn ] (6)
for a finite number of n states with unknown state values
vX = [v (x1 ) ⋯ v (xn )] = [v1 ⋯ vn ] (7)
one can derive a linear equation system based on
vX = rX + γPxx ′ vX ,
⎡v1 ⎤ ⎡R1 ⎤ ⎡p11 ⋯ p1n ⎤ ⎡v1 ⎤
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢
⎢
⎥⎢ ⎥
⎥⎢ ⎥ (8)
⎢ ⋮ ⎥=⎢ ⋮ ⎥+γ⎢ ⋮ ⋮ ⎥⎢ ⋮ ⎥.
⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥
⎢vn ⎥ ⎢Rn ⎥ ⎢pn1 ⋯ pnn ⎥ ⎢vn ⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦⎣ ⎦
17 / 50
Solving the MRP Bellman equation
Above, (8) is a normal equation in vX :
vX = rX + γPxx ′ vX ,
⇔ (I − γPxx ′ ) vX = rX . (9)
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ ¯
x
¯
A b
Possible solutions are (among others):

● Direct inversion (Gaussian elimination, O(n3 )),
● Matrix decomposition (QR, Cholesky, etc. , O(n3 )),
● Iterative solutions (e.g., Krylov-subspaces, often better than
O(n3 )).
In RL identifying and solving (9) is a key task, which is often realized

only approximately for high-order state spaces.
18 / 50
Example of a MRP with state values
Small Medium Large
Gone
Figure: Forest Markov reward process including state values
● Discount factor γ = 0.8

● Disaster probability α = 0.2
19 / 50
Table of contents
20 / 50
Markov decision process
Finite Markov decision

processMarkovd ecisionp rocessAfiniteMarkovdecisionprocess(MDP)isatuple⟨X
with
● U as a finite set of discrete-time actions Uk ∈ U,
u
● P = Pxx ′ is the state transition probability
P = Xk+1 = x ′ ∣Xk = xk , Uk = uk ,
● R is a reward function, R = Rux = Rk+1 ∣Xk = xk , Uk = uk and
● γ is a discount factor, γ ∈ {R∣0 ≤ γ ≤ 1}.
● Markov reward process is extended with actions / decisions.
● Now, rewards also depend on action Uk .
21 / 50
Example of a Markov decision process (1)
Small Medium Large
Gone
Figure: Forest Markov decision process
● Two actions possible in each state:

● Wait u = w : let the tree grow.
● Cut u = c: gather the wood.
● With increasing tree size the wood reward increases as well.
22 / 50
Example of a Markov decision process (2)
For the previous example the state transition probability matrix and
reward function are given as:
⎡0 0 0
1⎤⎥ ⎡0 1 − α 0 α⎤⎥
⎢ ⎢
⎢0 0 0
1⎥⎥ ⎢0 0 1 − α α⎥⎥
u=c ⎢ u=w ⎢
Pxx ′ =⎢ ⎥ , Pxx ′ =⎢ ⎥,
⎢0
⎢ 0 0
1⎥⎥
⎢0
⎢ 0 1−α α⎥⎥
⎢0 0 1⎥⎦
0 ⎢0 0 0 1 ⎥⎦
⎣ ⎣
rXu=c = [1 u=w
2 3 0] , rX = [0 0 1 0] .
● The term rXu is the abbreviated form for receiving the output of R
for the entire state space X given the action u.
23 / 50
Policy (1)
In an MDP environment, a policy is a distribution over actions given
states:
π(uk ∣xk ) = Uk = uk ∣Xk = xk . (10)
Figure: What is you best Monopoly policy? (source: Ylanite Koppens on

Pexels)
24 / 50
Policy (1)
states:
π(uk ∣xk ) = Uk = uk ∣Xk = xk . (10)
● In MDPs, policies depend only on the current state.

Pexels)
24 / 50
Policy (1)
states:
π(uk ∣xk ) = Uk = uk ∣Xk = xk . (10)
● In MDPs, policies depend only on the current state.

● A policy fully defines the agent’s behavior (which might be
stochastic or deterministic).

Pexels)
24 / 50
Policy (2)
Given a finite MDP ⟨X , U, P, R, γ⟩ and a policy π:

● The state sequence Xk , Xk+1 , . . . is a Markov chain ⟨X , P π ⟩ since
the state transition probability is only depending on the state:
π u
Pxx ′ = ∑ π(uk ∣xk )Pxx ′ . (11)
uk ∈U
● Consequently, the sequence Xk , Rk+1 , Xk+1 , Rk+2 , . . . of states and

rewards is a Markov reward process ⟨X , P π , Rπ , γ⟩:
Rπxx ′ = ∑ π(uk ∣xk )Rux . (12)

uk ∈U
25 / 50
Recap on MDP value functions
The state-value function of an MDP is the expected return starting in

xk following policy π:
∞
vπ (xk ) = Gk ∣Xk = xk π = ∑ γ i Rk+i+1 ∣ Xk π .
i=0
The action-value function of an MDP is the expected return starting in

xk taking action uk following policy π:
∞
qπ (xk , uk ) = Gk ∣Xk = xk , Uk = uk π = ∑ γ i Rk+i+1 ∣ Xk , Uk π .
i=0
26 / 50
Bellman expectation equation (1)
Analog to (5), the state value of an MDP can be decomposed into a
Bellman notation:
vπ (xk ) = Rk+1 + γvπ (Xk+1 )∣Xk = xk π . (13)
In finite MDPs, the state value can be directly linked to the action value
vπ (xk ) = ∑ π(uk ∣xk )qπ (xk , uk ) . (14)
uk ∈U
Figure: Backup diagram for vπ (xk )
27 / 50
Likewise, the action value of an MDP can be decomposed into a
Bellman notation:
qπ (xk , uk ) = Rk+1 + γqπ (Xk+1 , Uk+1 )∣Xk = xk , Uk = uk π . (15)
In finite MDPs, the action value can be directly linked to the state value
u
qπ (xk , uk ) = Rux + γ ∑ pxx ′ vπ (xk+1 ) . (16)
xk+1 ∈X
Figure: Backup diagram for qπ (xk , uk )
28 / 50
Inserting (16) into (14) directly results in:
⎛ ⎞
vπ (xk ) = ∑ π(uk ∣xk ) Rux + γ ∑ pxx u
′ vπ (xk+1 ) . (17)
uk ∈U ⎝ xk+1 ∈X ⎠
Conversely, the action value becomes:
⎛ ⎞
qπ (xk , uk ) = Rux + γ ∑ pxx
u
′ ∑ π(uk+1 ∣xk+1 )qπ (xk+1 , uk+1 ) .
xk+1 ∈X ⎝uk+1 ∈U ⎠
(18)
29 / 50
Bellman expectation equation in matrix form
Given a policy π and following the same assumptions as for (8), the
Bellman expectation equation can be expressed in matrix form:
vXπ = rXπ + γPxx

π π
′ vX ,
⎡v π ⎤ ⎡Rπ ⎤ ⎡p π ⋯ p π ⎤ ⎡v π ⎤
⎢ 1 ⎥ ⎢ 1⎥
⎢ ⎥ ⎢ ⎥
⎢ 11
⎢ 1n ⎥ ⎢ 1 ⎥
⎥⎢ ⎥ (19)
⎢ ⋮ ⎥=⎢ ⋮ ⎥+γ⎢ ⋮ ⋮ ⎥⎢ ⋮ ⎥.
⎢ π⎥ ⎢ π⎥ ⎢ π π ⎥ ⎢ ⎥
⎢vn ⎥ ⎢Rn ⎥ ⎢pn1 ⋯ pnn ⎥ ⎢vnπ ⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦⎣ ⎦
Here, rXπ and Pxx π
′ are the rewards and state transition probability
following policy π. Hence, the state value can be calculated by solving

(19) for vXπ , e.g., by direct matrix inversion:
π −1 π
vXπ = (I − γPxx ′) rX . (20)
30 / 50
Bellman expectation equation & forest tree example (1)
Let’s assume following very simple policy (’fifty-fifty ’)
π(u = cut∣x) = 0.5, π(u = wait∣x) = 0.5 ∀x ∈ X .
Applied to the given environment behavior
⎡0 0 0 1⎤⎥ ⎡0 1 − α 0 α⎤⎥
⎢ ⎢
⎢0 0 0 1⎥⎥ ⎢0 0 1−α α⎥⎥
u=c ⎢ u=w ⎢
Pxx ′ =⎢ ⎥, Pxx ′ =⎢ ⎥,
⎢0
⎢ 0 0 1⎥⎥ ⎢0
⎢ 0 1−α α⎥⎥
⎢0 0 0 1⎥⎦ ⎢0 0 0 1 ⎥⎦
⎣ ⎣
rXu=c = [1 2 3 0] , rXu=w = [0 0 1 0] ,
one receives:
⎡0 1−α 1+α ⎤
⎢ 2 0 2 ⎥
⎡0.5⎤
⎢ ⎥
⎢0 1−α 1+α ⎥ ⎢1⎥
π ⎢ 0 2 2 ⎥, π ⎢ ⎥
Pxx ′ =⎢ 1−α 1+α ⎥ rX = ⎢ ⎥ .
⎢0 0 2 2 ⎥
⎥ ⎢2⎥
⎢ ⎢ ⎥
⎢0 0 0 1 ⎥⎦ ⎢0⎥
⎣ ⎣ ⎦
31 / 50
Small Medium Large
Gone
Figure: Forest MDP with fifty-fifty policy including state values
● Discount factor γ = 0.8

● Disaster probability α = 0.2
32 / 50
Using the Bellman expectation eq. (16) the action values can be
directly calculated:
Small Medium Large
Gone
Figure: Forest MDP with fifty-fifty policy plus action values
33 / 50
Table of contents
34 / 50
Optimal value functions in an MDP
The optimal state-value function of an MDP is the maximum
state-value function over all polices:
v ∗ (x) = max vπ (x) . (21)

π
The optimal action-value function of an MDP is the maximum

action-value function over all polices:
q ∗ (x, u) = max qπ (x, u) . (22)

π
The optimal value function denotes the best possible agent’s

performance for a given MDP / environment.
A (finite) MDP can be easily solved in an optimal way if q ∗ (x, u) is
known.
35 / 50
Optimal policy in an MDP
Define a partial ordering over polices
π ≥ π′ if vπ (x) ≥ vπ′ (x) ∀x ∈ X . (23)
For any finite MDP

● there exists an optimal policy π ∗ ≥ π that is better or equal to all
other policies,
● all optimal policies achieve the same optimal state-value function
v ∗ (x) = vπ∗ (x),
● all optimal policies achieve the same optimal action-value function
q ∗ (x, u) = qπ∗ (x, u).
36 / 50
Bellman optimality equation (1)
“An optimal policy has the property that whatever the initial state and
initial decision are, the remaining decisions must constitute an optimal
policy with regard to the state resulting from the first decision.”(R.E.
Bellman, Dynamic Programming, 1957)
● Any policy (i.e., also the optimal one) must satisfy the
self-consistency condition given by the Bellman expectation
equation.
equation.
● An optimal policy must deliver the maximum expected return
being in a given state:
equation.
v ∗ (xk ) = max qπ∗ (xk , u) ,
u
(24)
37 / 50
equation.
v ∗ (xk ) = max qπ∗ (xk , u) ,
u
= max Gk ∣Xk = xk , U = uπ ∗ ,
u
(24)
37 / 50
equation.
v ∗ (xk ) = max qπ∗ (xk , u) ,
u
u
(24)
= max Rk+1 + γGk+1 ∣Xk = xk , U = uπ ∗ ,
u
37 / 50
equation.
v ∗ (xk ) = max qπ∗ (xk , u) ,
u
u
(24)
= max Rk+1 + γGk+1 ∣Xk = xk , U = uπ ∗ ,
u
= max Rk+1 + γv ∗ (Xk+1 )∣Xk = xk , U = u .
u
37 / 50
Again, the Bellman optimality equation can be visualized by a backup
diagram:
Figure: Backup diagram for v ∗ (xk )
For a finite MDP, the following expression results:
v ∗ (xk ) = max Rux + γ ∑ pxx

u
′ vπ ∗ (xk+1 ). (25)
uk
xk+1 ∈X
38 / 50
Likewise, the Bellman optimality equation is applicable to the action
value:
q ∗ (xk , uk ) = Rk+1 + γ max q ∗ (Xk+1 , Uk+1 )∣Xk = xk , Uk = uk . (26)
uk+1
And, in the finite MDP case:

q ∗ (xk , uk ) = Rux + γ ∑ pxx
u ∗
′ max q (xk+1 , uk+1 ). (27)
uk+1
xk+1 ∈X
Figure: Backup diagram for q ∗ (xk , uk )
39 / 50
Solving the Bellman optimality equation
● In finite MDPs with n states, (25) delivers an algebraic equation

system with n unknowns and n state-value equations.
● Likewise, (27) delivers an algebraic equation system with up to n ⋅ m
unknowns and n ⋅ m action-value equations (m =number of actions).
● If environment is exactly known, solving for v ∗ or q ∗ directly

delivers optimal policy.
● If v (x) is known, a one-step-ahead search is required to get q(x, u).
● If q(x, u) is known, directly choose q ∗ .
● Even though above decisions are very short sighted
(one-step-ahead search for v or direct choice of q), by Bellman’s
principle of optimality one receives the long-term maximum of the
expected reward.
40 / 50
Optimal policy for forest tree MDP
Remember the forest tree MDP example:
Small Medium Large
Gone
● Two actions possible in each state:

● Wait u = w : let the tree grow.
● Cut u = c: gather the wood.
● Lets first calculate v ∗ (x) and then q ∗ (x, u).
41 / 50
Optimal policy for forest tree MDP: state value (1)
Start with v (x = 4) (’gone’) and then continue going backwards:
v ∗ (x = 4) = 0 ,
42 / 50
v ∗ (x = 4) = 0 ,
⎪1 + γ [(1 − α)v ∗ (x = 3) + αv ∗ (x = 4)] ,
⎧
⎪
∗
v (x = 3) = max ⎨ ∗
⎩3 + γv (x = 4) ,
⎪
⎪
42 / 50
v ∗ (x = 4) = 0 ,
⎪1 + γ [(1 − α)v ∗ (x = 3) + αv ∗ (x = 4)] ,
⎧
⎪
∗
v (x = 3) = max ⎨ ∗
⎩3 + γv (x = 4) ,
⎪
⎪
⎪1 + γ [(1 − α)v ∗ (x = 3)] ,
⎧
⎪
= max ⎨
⎪3 ,
⎪
⎩
42 / 50
v ∗ (x = 4) = 0 ,
⎪1 + γ [(1 − α)v ∗ (x = 3) + αv ∗ (x = 4)] ,
⎧
⎪
∗
v (x = 3) = max ⎨ ∗
⎩3 + γv (x = 4) ,
⎪
⎪
⎪1 + γ [(1 − α)v ∗ (x = 3)] ,
⎧
⎪
= max ⎨
⎪3 ,
⎪
⎩
⎪0 + γ [(1 − α)v ∗ (x = 3) + αv ∗ (x = 4)] ,
⎧
⎪
v ∗ (x = 2) = max ⎨ ∗
⎩2 + γv (x = 4) ,
⎪
⎪
42 / 50
v ∗ (x = 4) = 0 ,
⎪1 + γ [(1 − α)v ∗ (x = 3) + αv ∗ (x = 4)] ,
⎧
⎪
∗
v (x = 3) = max ⎨ ∗
⎩3 + γv (x = 4) ,
⎪
⎪
⎪1 + γ [(1 − α)v ∗ (x = 3)] ,
⎧
⎪
= max ⎨
⎪3 ,
⎪
⎩
⎪0 + γ [(1 − α)v ∗ (x = 3) + αv ∗ (x = 4)] ,
⎧
⎪
v ∗ (x = 2) = max ⎨ ∗
⎩2 + γv (x = 4) ,
⎪
⎪
⎪γ [(1 − α)v ∗ (x = 3)] ,
⎧
⎪
= max ⎨
⎩2 ,
⎪
⎪
42 / 50
v ∗ (x = 4) = 0 ,
⎪1 + γ [(1 − α)v ∗ (x = 3) + αv ∗ (x = 4)] ,
⎧
⎪
∗
v (x = 3) = max ⎨ ∗
⎩3 + γv (x = 4) ,
⎪
⎪
⎪1 + γ [(1 − α)v ∗ (x = 3)] ,
⎧
⎪
= max ⎨
⎪3 ,
⎪
⎩
⎪0 + γ [(1 − α)v ∗ (x = 3) + αv ∗ (x = 4)] ,
⎧
⎪
v ∗ (x = 2) = max ⎨ ∗
⎩2 + γv (x = 4) ,
⎪
⎪
⎪γ [(1 − α)v ∗ (x = 3)] ,
⎧
⎪
= max ⎨
⎩2 ,
⎪
⎪
⎪γ [(1 − α)v ∗ (x = 2)] ,
⎧
⎪
∗
v (x = 1) = max ⎨
⎩1 .
⎪
⎪
42 / 50
● Possible solutions:
● numerical optimization approach (e.g., simplex method, gradient
descent,...)
● manual case-by-case equation solving (dynamic programming, cf.
next lecture)
Small Medium Large
Gone
Figure: State values under optimal policy (γ = 0.8, α = 0.2) 43 / 50
Small Medium Large
Gone
Figure: State values under optimal policy (γ = 0.9, α = 0.2)
44 / 50
Optimal policy for forest tree MDP: action value (1)
Use uk+1 = u ′ to set up equation system:
q ∗ (x = 1, u = c) = 1 ,
45 / 50
q ∗ (x = 1, u = c) = 1 ,
q ∗ (x = 1, u = w ) = γ(1 − α) max q ∗ (x = 2, u ′ ) ,
′ u
45 / 50
q ∗ (x = 1, u = c) = 1 ,
q ∗ (x = 1, u = w ) = γ(1 − α) max q ∗ (x = 2, u ′ ) ,
′ u
∗
q (x = 2, u = c) = 2 ,
45 / 50
q ∗ (x = 1, u = c) = 1 ,
q ∗ (x = 1, u = w ) = γ(1 − α) max q ∗ (x = 2, u ′ ) ,
′ u
∗
q (x = 2, u = c) = 2 ,
q ∗ (x = 2, u = w ) = γ(1 − α) max q ∗ (x = 3, u ′ ) ,
′ u
45 / 50
q ∗ (x = 1, u = c) = 1 ,
q ∗ (x = 1, u = w ) = γ(1 − α) max q ∗ (x = 2, u ′ ) ,
′ u
∗
q (x = 2, u = c) = 2 ,
q ∗ (x = 2, u = w ) = γ(1 − α) max q ∗ (x = 3, u ′ ) ,
′ u
∗
q (x = 3, u = c) = 3 ,
45 / 50
q ∗ (x = 1, u = c) = 1 ,
q ∗ (x = 1, u = w ) = γ(1 − α) max q ∗ (x = 2, u ′ ) ,
′ u
∗
q (x = 2, u = c) = 2 ,
q ∗ (x = 2, u = w ) = γ(1 − α) max q ∗ (x = 3, u ′ ) ,
′ u
∗
q (x = 3, u = c) = 3 ,
q ∗ (x = 3, u = w ) = 1 + γ(1 − α) max q ∗ (x = 3, u ′ ) .
′ u
45 / 50
q ∗ (x = 1, u = c) = 1 ,
q ∗ (x = 1, u = w ) = γ(1 − α) max q ∗ (x = 2, u ′ ) ,
′ u
∗
q (x = 2, u = c) = 2 ,
q ∗ (x = 2, u = w ) = γ(1 − α) max q ∗ (x = 3, u ′ ) ,
′ u
∗
q (x = 3, u = c) = 3 ,
q ∗ (x = 3, u = w ) = 1 + γ(1 − α) max q ∗ (x = 3, u ′ ) .
′ u
● There are six action-state pairs in total.

● Three of them can be directly determined.
● Three unknowns and three equations remain.
45 / 50
Rearrange max expressions for unknown action values:
⎪1 + γ(1 − α)q ∗ (3, w ),

⎧
⎪ ⎧
⎪
⎪
⎪
⎪γ(1 − α) max ⎨
q ∗ (x = 1, u = w ) = γ(1 − α) max ⎨
⎪
⎩3,
⎪
⎪
⎪
⎪
⎪
⎩2,
⎪
⎪
46 / 50
⎪1 + γ(1 − α)q ∗ (3, w ),

⎧
⎪ ⎧
⎪
⎪
⎪
⎪γ(1 − α) max ⎨
q ∗ (x = 1, u = w ) = γ(1 − α) max ⎨
⎪
⎩3,
⎪
⎪
⎪
⎪
⎪
⎩2,
⎪
⎪
⎪1 + γ(1 − α)q ∗ (3, w ),
⎧
q ∗ (x = 2, u = w ) = γ(1 − α) max ⎨
⎪
⎪3,
⎪
⎩
46 / 50
⎪1 + γ(1 − α)q ∗ (3, w ),

⎧
⎪ ⎧
⎪
⎪
⎪
⎪γ(1 − α) max ⎨
q ∗ (x = 1, u = w ) = γ(1 − α) max ⎨
⎪
⎩3,
⎪
⎪
⎪
⎪
⎪
⎩2,
⎪
⎪
⎪1 + γ(1 − α)q ∗ (3, w ),
⎧
q ∗ (x = 2, u = w ) = γ(1 − α) max ⎨
⎪
⎪3,
⎪
⎩
⎪q ∗ (3, w ),
⎧
⎪
q ∗ (x = 3, u = w ) = 1 + γ(1 − α) max ⎨
⎩3.
⎪
⎪
46 / 50
⎪1 + γ(1 − α)q ∗ (3, w ),

⎧
⎪ ⎧
⎪
⎪
⎪
⎪γ(1 − α) max ⎨
q ∗ (x = 1, u = w ) = γ(1 − α) max ⎨
⎪
⎩3,
⎪
⎪
⎪
⎪
⎪
⎩2,
⎪
⎪
⎪1 + γ(1 − α)q ∗ (3, w ),
⎧
q ∗ (x = 2, u = w ) = γ(1 − α) max ⎨
⎪
⎪3,
⎪
⎩
⎪q ∗ (3, w ),
⎧
⎪
q ∗ (x = 3, u = w ) = 1 + γ(1 − α) max ⎨
⎩3.
⎪
⎪
Again, retrieve unknown optimal action values by numerical

optimization solvers or manual backwards calculation (dynamic
programming).
46 / 50
Small Medium Large
Gone
Figure: Action values under optimal policy (γ = 0.8, α = 0.2)
47 / 50
Small Medium Large
Gone
Figure: Action values under optimal policy (γ = 0.9, α = 0.2)
48 / 50
Direct numerical state and action-value calculation
● Possible only for small action and state-space MDPs

● ’Solving’ Backgammon with ≈ 1020 states?
● Another issue: total environment knowledge required
Framing the reinforcement learning problem

Facing the above issues, RL addresses mainly two topics:
● Approximate solutions of complex decision problems.
● Learning of such approximations based on data retrieved from
environment interactions potentially without any a priori model
knowledge.
49 / 50
Summary: what you’ve learned in this lecture
● Differentiate finite Markov process models with or w/o rewards and
actions.
● Interpret such stochastic processes as simplified abstractions of
real-world problems.
● Understand the importance of value functions to describe the
agent’s performance.
● Formulate value-function equation systems by the Bellman
principle.
● Recognize optimal policies.
● Setting up nonlinear equation systems for retrieving optimal
policies by the Bellman principle.
● Solving for different value functions in MRP/MDP by brute force
optimization.
50 / 50

lec02

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

lec02

Uploaded by

Copyright:

Available Formats

Reinforcement Learning: A Comprehensive Open-Source

Department of Power Electronics and Electrical Drives, Paderborn University, Germany

Department of Power Electronics and Electrical Drives, Paderborn University, Germany

2* All states observable?

2* No Markov chain Hidden Markov model

1 Finite Markov chains

2 Finite Markov reward processes

3 Finite markov decision processes

4 Optimal policies and value functions

A finite Markov chain is a tuple ⟨X , P⟩ with

A finite Markov chain is a tuple ⟨X , P⟩ with

A finite Markov chain is a tuple ⟨X , P⟩ with

Pxx ′ = Xk+1 = x ′ ∣Xk = x. (1)

Here, Pxx ′ ∈ Rn×n has the form

Small Medium Large x ∈ {1, 2, 3, 4}

● At x = 1 a small tree is planted (’starting point’).

1 Finite Markov chains

2 Finite Markov reward processes

3 Finite markov decision processes

4 Optimal policies and value functions

Finite Markov reward

● Growing larger trees is rewarded, since it will be

terminating at step N while it is an infinite series for continuing tasks

● The discount γ represents the value of future rewards.

the green we assume we can V putt !3

n the green’s value, that is, !2

Figure: Backup diagram for v (xk )

rX = [R(x1 ) ⋯ R(xn )] = [R1 ⋯ Rn ] (6)

for a finite number of n states with unknown state values

vX = [v (x1 ) ⋯ v (xn )] = [v1 ⋯ vn ] (7)

one can derive a linear equation system based on

Possible solutions are (among others):

In RL identifying and solving (9) is a key task, which is often realized

Small Medium Large

● Discount factor γ = 0.8

1 Finite Markov chains

2 Finite Markov reward processes

3 Finite markov decision processes

4 Optimal policies and value functions

Finite Markov decision

● Two actions possible in each state:

Figure: What is you best Monopoly policy? (source: Ylanite Koppens on

● In MDPs, policies depend only on the current state.

Figure: What is you best Monopoly policy? (source: Ylanite Koppens on

● In MDPs, policies depend only on the current state.

Figure: What is you best Monopoly policy? (source: Ylanite Koppens on

Given a finite MDP ⟨X , U, P, R, γ⟩ and a policy π:

● Consequently, the sequence Xk , Rk+1 , Xk+1 , Rk+2 , . . . of states and

Rπxx ′ = ∑ π(uk ∣xk )Rux . (12)

The state-value function of an MDP is the expected return starting in

The action-value function of an MDP is the expected return starting in

Figure: Backup diagram for vπ (xk )

Figure: Backup diagram for qπ (xk , uk )

Inserting (16) into (14) directly results in:

Conversely, the action value becomes:

vXπ = rXπ + γPxx

following policy π. Hence, the state value can be calculated by solving

Small Medium Large

● Discount factor γ = 0.8

Small Medium Large

1 Finite Markov chains

2 Finite Markov reward processes

3 Finite markov decision processes

4 Optimal policies and value functions