Professional Documents
Culture Documents
lec02
lec02
Course
Wilhelm Kirchgässner
19 July 2023
Lecture 02: Markov Decision Processes
Wilhelm Kirchgässner
2 / 50
Preface
● Markov decision processes (MDP) are a mathematically idealized
form of RL problems.
● They allow precise theoretical statements (e.g., on optimal
solutions).
● They deliver insights into RL solutions since many real-world
problems can be abstracted as MDPs.
3 / 50
Preface
● Markov decision processes (MDP) are a mathematically idealized
form of RL problems.
● They allow precise theoretical statements (e.g., on optimal
solutions).
● They deliver insights into RL solutions since many real-world
problems can be abstracted as MDPs.
● In the following we’ll focus on:
● fully observable MDPs (i.e., xk = yk ) and
● finite MDPs (i.e., finite number of states & actions).
3 / 50
Preface
● Markov decision processes (MDP) are a mathematically idealized
form of RL problems.
● They allow precise theoretical statements (e.g., on optimal
solutions).
● They deliver insights into RL solutions since many real-world
problems can be abstracted as MDPs.
● In the following we’ll focus on:
● fully observable MDPs (i.e., xk = yk ) and
● finite MDPs (i.e., finite number of states & actions).
a b c d e f g h
6
5
3
...
...
...
2 9 10 ...
...
1 1 2 3 ...
4 / 50
Scalar and vectorial representations in finite MDPs
● The position of a chess piece can be represented in two ways:
● Vectorial: x = [xh xv ], i.e., a two-element vector with horizontal
and vertical information,
● Scalar: simple enumeration of all available positions (e.g., x = 3).
a b c d e f g h
6
5
3
...
...
...
2 9 10 ...
...
1 1 2 3 ...
4 / 50
Scalar and vectorial representations in finite MDPs
● The position of a chess piece can be represented in two ways:
● Vectorial: x = [xh xv ], i.e., a two-element vector with horizontal
and vertical information,
● Scalar: simple enumeration of all available positions (e.g., x = 3).
● Both ways represent the same amount of information.
● We will stick to the scalar representation of states and actions in
finite MDPs.
a b c d e f g h
6
5
3
...
...
...
2 9 10 ...
...
1 1 2 3 ...
4 / 50
Table of contents
5 / 50
Markov chain
6 / 50
Markov chain
6 / 50
Markov chain
1
However, this results in a literature nomenclature inconsistency with Markov decision/reward ’processes’.
6 / 50
State transition matrix
State transition matrixSTM Given a Markov state Xk = x and its
successor Xk+1 = x ′ , the state transition probability ∀ {x, x ′ } ∈ X is
defined by the matrix
with pij ∈ {R∣0 ≤ pij ≤ 1} being the specific probability to go from state
x = Xi to state x ′ = Xj . Obviously, ∑j pij = 1 ∀i must hold.
7 / 50
Example of a Markov chain (1)
⎡0 1 − α 0 α⎤⎥
⎢
⎢0 0 1 − α α⎥⎥
⎢
P =⎢ ⎥
⎢0 0 1−α α⎥⎥
Gone ⎢
⎢0
⎣ 0 0 1 ⎥⎦
Figure: Forest tree Markov chain
8 / 50
Example of a Markov chain (2)
Small Medium Large
Gone
Possible samples for the given Markov chain example starting from
x = 1 (small tree):
● Small → gone
● Small → medium → gone
● Small → medium → large → gone
● Small → medium → large → large → . . .
9 / 50
Table of contents
10 / 50
Markov reward process
11 / 50
Example of a Markov reward process
Small Medium Large
Gone
Figure: Forest Markov reward process
Return
The return Gk is the total discounted reward starting from step k
onwards. For episodic tasks it becomes the finite series
N
Gk = Rk+1 + γRk+2 + γ 2 Rk+3 + ⋯ = ∑ γ i Rk+i+1 (2)
i=0
13 / 50
6 Now consider adding a constant c to all the rewards in an episodic task,
Value function in MRP
e running. Would this have any e↵ect, or would it leave the task unchanged
Value
ntinuing function
task above? Whyinor MRPvalue f unction
why not? Give M RPThestate −⇤valuefunctionv(xk )
an example.
of anToMRP
.6: Golf is the
formulate expected
playing a hole ofreturn
golf as astarting
reinforcement fromlearning state xk
unt a penalty (negative reward) of 1 for each stroke until we hit the ball
. The state is the location of the ball. The v (xvalue
k ) =ofG a kstate
∣Xk is=the xk .negative of (4)
of strokes to the hole from that location. Our actions are how we aim and
ball, of course, and which club we select. Let us take the former as given
just the choice of club, which we assume is either a putter or a driver. The
f Figure 3.3 Represents
● shows a possiblethe long-term
state-value value
function, vputtof (s),being in state
for the policy that Xk .
the putter. The terminal
hole has a value of 0. From vputt !4
Gone
Exemplary samples for v̂ with γ = 0.5 starting in x = 1:
x = 1 → 4, v̂ = 1,
15 / 50
State-value samples of forest MRP
Small Medium Large
Gone
Exemplary samples for v̂ with γ = 0.5 starting in x = 1:
x = 1 → 4, v̂ = 1,
x = 1 → 2 → 4, v̂ = 1 + 0.5 ⋅ 2 = 2.0,
15 / 50
State-value samples of forest MRP
Small Medium Large
Gone
Exemplary samples for v̂ with γ = 0.5 starting in x = 1:
x = 1 → 4, v̂ = 1,
x = 1 → 2 → 4, v̂ = 1 + 0.5 ⋅ 2 = 2.0,
x = 1 → 2 → 3 → 4, v̂ = 1 + 0.5 ⋅ 2 + 0.25 ⋅ 3 = 3.75,
15 / 50
State-value samples of forest MRP
Small Medium Large
Gone
Exemplary samples for v̂ with γ = 0.5 starting in x = 1:
x = 1 → 4, v̂ = 1,
x = 1 → 2 → 4, v̂ = 1 + 0.5 ⋅ 2 = 2.0,
x = 1 → 2 → 3 → 4, v̂ = 1 + 0.5 ⋅ 2 + 0.25 ⋅ 3 = 3.75,
x = 1 → 2 → 3 → 3 → 4, v̂ = 1 + 0.5 ⋅ 2 + 0.25 ⋅ 3 + 0.125 ⋅ 3 = 4.13.
15 / 50
Bellman equation for MRPs (1)
Problem: How to calculate all state values in closed form?
Bellman equation for MRPs (1)
Problem: How to calculate all state values in closed form?
Solution: Bellman equation.
Bellman equation for MRPs (1)
Problem: How to calculate all state values in closed form?
Solution: Bellman equation.
v (xk ) = Gk ∣Xk = xk
(5)
16 / 50
Bellman equation for MRPs (1)
Problem: How to calculate all state values in closed form?
Solution: Bellman equation.
v (xk ) = Gk ∣Xk = xk
= Rk+1 + γRk+2 + γ 2 Rk+3 + . . . ∣Xk = xk
(5)
16 / 50
Bellman equation for MRPs (1)
Problem: How to calculate all state values in closed form?
Solution: Bellman equation.
v (xk ) = Gk ∣Xk = xk
= Rk+1 + γRk+2 + γ 2 Rk+3 + . . . ∣Xk = xk
= Rk+1 + γ (Rk+2 + γRk+3 + . . .) ∣Xk = xk (5)
16 / 50
Bellman equation for MRPs (1)
Problem: How to calculate all state values in closed form?
Solution: Bellman equation.
v (xk ) = Gk ∣Xk = xk
= Rk+1 + γRk+2 + γ 2 Rk+3 + . . . ∣Xk = xk
= Rk+1 + γ (Rk+2 + γRk+3 + . . .) ∣Xk = xk (5)
= Rk+1 + γGk+1 ∣Xk = xk
16 / 50
Bellman equation for MRPs (1)
Problem: How to calculate all state values in closed form?
Solution: Bellman equation.
v (xk ) = Gk ∣Xk = xk
= Rk+1 + γRk+2 + γ 2 Rk+3 + . . . ∣Xk = xk
= Rk+1 + γ (Rk+2 + γRk+3 + . . .) ∣Xk = xk (5)
= Rk+1 + γGk+1 ∣Xk = xk
= Rk+1 + γv (Xk+1 )∣Xk = xk
16 / 50
Bellman equation for MRPs (1)
Problem: How to calculate all state values in closed form?
Solution: Bellman equation.
v (xk ) = Gk ∣Xk = xk
= Rk+1 + γRk+2 + γ 2 Rk+3 + . . . ∣Xk = xk
= Rk+1 + γ (Rk+2 + γRk+3 + . . .) ∣Xk = xk (5)
= Rk+1 + γGk+1 ∣Xk = xk
= Rk+1 + γv (Xk+1 )∣Xk = xk
16 / 50
Bellman equation for MRPs (2)
Assuming a known reward function R(x) for every state X = x ∈ X
vX = rX + γPxx ′ vX ,
⎡v1 ⎤ ⎡R1 ⎤ ⎡p11 ⋯ p1n ⎤ ⎡v1 ⎤
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢
⎢
⎥⎢ ⎥
⎥⎢ ⎥ (8)
⎢ ⋮ ⎥=⎢ ⋮ ⎥+γ⎢ ⋮ ⋮ ⎥⎢ ⋮ ⎥.
⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥
⎢vn ⎥ ⎢Rn ⎥ ⎢pn1 ⋯ pnn ⎥ ⎢vn ⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦⎣ ⎦
17 / 50
Solving the MRP Bellman equation
Above, (8) is a normal equation in vX :
vX = rX + γPxx ′ vX ,
⇔ (I − γPxx ′ ) vX = rX . (9)
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ ¯
x
¯
A b
18 / 50
Example of a MRP with state values
Gone
Figure: Forest Markov reward process including state values
20 / 50
Markov decision process
P = Xk+1 = x ′ ∣Xk = xk , Uk = uk ,
● R is a reward function, R = Rux = Rk+1 ∣Xk = xk , Uk = uk and
● γ is a discount factor, γ ∈ {R∣0 ≤ γ ≤ 1}.
● Markov reward process is extended with actions / decisions.
● Now, rewards also depend on action Uk .
21 / 50
Example of a Markov decision process (1)
Small Medium Large
Gone
Figure: Forest Markov decision process
For the previous example the state transition probability matrix and
reward function are given as:
⎡0 0 0
1⎤⎥ ⎡0 1 − α 0 α⎤⎥
⎢ ⎢
⎢0 0 0
1⎥⎥ ⎢0 0 1 − α α⎥⎥
u=c ⎢ u=w ⎢
Pxx ′ =⎢ ⎥ , Pxx ′ =⎢ ⎥,
⎢0
⎢ 0 0
1⎥⎥
⎢0
⎢ 0 1−α α⎥⎥
⎢0 0 1⎥⎦
0 ⎢0 0 0 1 ⎥⎦
⎣ ⎣
rXu=c = [1 u=w
2 3 0] , rX = [0 0 1 0] .
● The term rXu is the abbreviated form for receiving the output of R
for the entire state space X given the action u.
23 / 50
Policy (1)
In an MDP environment, a policy is a distribution over actions given
states:
π(uk ∣xk ) = Uk = uk ∣Xk = xk . (10)
24 / 50
Policy (1)
In an MDP environment, a policy is a distribution over actions given
states:
π(uk ∣xk ) = Uk = uk ∣Xk = xk . (10)
24 / 50
Policy (1)
In an MDP environment, a policy is a distribution over actions given
states:
π(uk ∣xk ) = Uk = uk ∣Xk = xk . (10)
24 / 50
Policy (2)
25 / 50
Recap on MDP value functions
26 / 50
Bellman expectation equation (1)
Analog to (5), the state value of an MDP can be decomposed into a
Bellman notation:
vπ (xk ) = Rk+1 + γvπ (Xk+1 )∣Xk = xk π . (13)
In finite MDPs, the state value can be directly linked to the action value
vπ (xk ) = ∑ π(uk ∣xk )qπ (xk , uk ) . (14)
uk ∈U
27 / 50
Bellman expectation equation (2)
Likewise, the action value of an MDP can be decomposed into a
Bellman notation:
qπ (xk , uk ) = Rk+1 + γqπ (Xk+1 , Uk+1 )∣Xk = xk , Uk = uk π . (15)
In finite MDPs, the action value can be directly linked to the state value
u
qπ (xk , uk ) = Rux + γ ∑ pxx ′ vπ (xk+1 ) . (16)
xk+1 ∈X
28 / 50
Bellman expectation equation (3)
⎛ ⎞
vπ (xk ) = ∑ π(uk ∣xk ) Rux + γ ∑ pxx u
′ vπ (xk+1 ) . (17)
uk ∈U ⎝ xk+1 ∈X ⎠
⎛ ⎞
qπ (xk , uk ) = Rux + γ ∑ pxx
u
′ ∑ π(uk+1 ∣xk+1 )qπ (xk+1 , uk+1 ) .
xk+1 ∈X ⎝uk+1 ∈U ⎠
(18)
29 / 50
Bellman expectation equation in matrix form
Given a policy π and following the same assumptions as for (8), the
Bellman expectation equation can be expressed in matrix form:
⎡v π ⎤ ⎡Rπ ⎤ ⎡p π ⋯ p π ⎤ ⎡v π ⎤
⎢ 1 ⎥ ⎢ 1⎥
⎢ ⎥ ⎢ ⎥
⎢ 11
⎢ 1n ⎥ ⎢ 1 ⎥
⎥⎢ ⎥ (19)
⎢ ⋮ ⎥=⎢ ⋮ ⎥+γ⎢ ⋮ ⋮ ⎥⎢ ⋮ ⎥.
⎢ π⎥ ⎢ π⎥ ⎢ π π ⎥ ⎢ ⎥
⎢vn ⎥ ⎢Rn ⎥ ⎢pn1 ⋯ pnn ⎥ ⎢vnπ ⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦⎣ ⎦
Here, rXπ and Pxx π
′ are the rewards and state transition probability
30 / 50
Bellman expectation equation & forest tree example (1)
Let’s assume following very simple policy (’fifty-fifty ’)
π(u = cut∣x) = 0.5, π(u = wait∣x) = 0.5 ∀x ∈ X .
Applied to the given environment behavior
⎡0 0 0 1⎤⎥ ⎡0 1 − α 0 α⎤⎥
⎢ ⎢
⎢0 0 0 1⎥⎥ ⎢0 0 1−α α⎥⎥
u=c ⎢ u=w ⎢
Pxx ′ =⎢ ⎥, Pxx ′ =⎢ ⎥,
⎢0
⎢ 0 0 1⎥⎥ ⎢0
⎢ 0 1−α α⎥⎥
⎢0 0 0 1⎥⎦ ⎢0 0 0 1 ⎥⎦
⎣ ⎣
rXu=c = [1 2 3 0] , rXu=w = [0 0 1 0] ,
one receives:
⎡0 1−α 1+α ⎤
⎢ 2 0 2 ⎥
⎡0.5⎤
⎢ ⎥
⎢0 1−α 1+α ⎥ ⎢1⎥
π ⎢ 0 2 2 ⎥, π ⎢ ⎥
Pxx ′ =⎢ 1−α 1+α ⎥ rX = ⎢ ⎥ .
⎢0 0 2 2 ⎥
⎥ ⎢2⎥
⎢ ⎢ ⎥
⎢0 0 0 1 ⎥⎦ ⎢0⎥
⎣ ⎣ ⎦
31 / 50
Bellman expectation equation & forest tree example (2)
Gone
Figure: Forest MDP with fifty-fifty policy including state values
Gone
Figure: Forest MDP with fifty-fifty policy plus action values
33 / 50
Table of contents
34 / 50
Optimal value functions in an MDP
The optimal state-value function of an MDP is the maximum
state-value function over all polices:
35 / 50
Optimal policy in an MDP
36 / 50
Bellman optimality equation (1)
“An optimal policy has the property that whatever the initial state and
initial decision are, the remaining decisions must constitute an optimal
policy with regard to the state resulting from the first decision.”(R.E.
Bellman, Dynamic Programming, 1957)
Bellman optimality equation (1)
“An optimal policy has the property that whatever the initial state and
initial decision are, the remaining decisions must constitute an optimal
policy with regard to the state resulting from the first decision.”(R.E.
Bellman, Dynamic Programming, 1957)
● Any policy (i.e., also the optimal one) must satisfy the
self-consistency condition given by the Bellman expectation
equation.
Bellman optimality equation (1)
“An optimal policy has the property that whatever the initial state and
initial decision are, the remaining decisions must constitute an optimal
policy with regard to the state resulting from the first decision.”(R.E.
Bellman, Dynamic Programming, 1957)
● Any policy (i.e., also the optimal one) must satisfy the
self-consistency condition given by the Bellman expectation
equation.
● An optimal policy must deliver the maximum expected return
being in a given state:
Bellman optimality equation (1)
“An optimal policy has the property that whatever the initial state and
initial decision are, the remaining decisions must constitute an optimal
policy with regard to the state resulting from the first decision.”(R.E.
Bellman, Dynamic Programming, 1957)
● Any policy (i.e., also the optimal one) must satisfy the
self-consistency condition given by the Bellman expectation
equation.
● An optimal policy must deliver the maximum expected return
being in a given state:
v ∗ (xk ) = max qπ∗ (xk , u) ,
u
(24)
37 / 50
Bellman optimality equation (1)
“An optimal policy has the property that whatever the initial state and
initial decision are, the remaining decisions must constitute an optimal
policy with regard to the state resulting from the first decision.”(R.E.
Bellman, Dynamic Programming, 1957)
● Any policy (i.e., also the optimal one) must satisfy the
self-consistency condition given by the Bellman expectation
equation.
● An optimal policy must deliver the maximum expected return
being in a given state:
v ∗ (xk ) = max qπ∗ (xk , u) ,
u
= max Gk ∣Xk = xk , U = uπ ∗ ,
u
(24)
37 / 50
Bellman optimality equation (1)
“An optimal policy has the property that whatever the initial state and
initial decision are, the remaining decisions must constitute an optimal
policy with regard to the state resulting from the first decision.”(R.E.
Bellman, Dynamic Programming, 1957)
● Any policy (i.e., also the optimal one) must satisfy the
self-consistency condition given by the Bellman expectation
equation.
● An optimal policy must deliver the maximum expected return
being in a given state:
v ∗ (xk ) = max qπ∗ (xk , u) ,
u
= max Gk ∣Xk = xk , U = uπ ∗ ,
u
(24)
= max Rk+1 + γGk+1 ∣Xk = xk , U = uπ ∗ ,
u
37 / 50
Bellman optimality equation (1)
“An optimal policy has the property that whatever the initial state and
initial decision are, the remaining decisions must constitute an optimal
policy with regard to the state resulting from the first decision.”(R.E.
Bellman, Dynamic Programming, 1957)
● Any policy (i.e., also the optimal one) must satisfy the
self-consistency condition given by the Bellman expectation
equation.
● An optimal policy must deliver the maximum expected return
being in a given state:
v ∗ (xk ) = max qπ∗ (xk , u) ,
u
= max Gk ∣Xk = xk , U = uπ ∗ ,
u
(24)
= max Rk+1 + γGk+1 ∣Xk = xk , U = uπ ∗ ,
u
= max Rk+1 + γv ∗ (Xk+1 )∣Xk = xk , U = u .
u
37 / 50
Bellman optimality equation (2)
Again, the Bellman optimality equation can be visualized by a backup
diagram:
38 / 50
Bellman optimality equation (3)
Likewise, the Bellman optimality equation is applicable to the action
value:
q ∗ (xk , uk ) = Rk+1 + γ max q ∗ (Xk+1 , Uk+1 )∣Xk = xk , Uk = uk . (26)
uk+1
39 / 50
Solving the Bellman optimality equation
40 / 50
Optimal policy for forest tree MDP
Remember the forest tree MDP example:
Gone
41 / 50
Optimal policy for forest tree MDP: state value (1)
Start with v (x = 4) (’gone’) and then continue going backwards:
Optimal policy for forest tree MDP: state value (1)
Start with v (x = 4) (’gone’) and then continue going backwards:
v ∗ (x = 4) = 0 ,
42 / 50
Optimal policy for forest tree MDP: state value (1)
Start with v (x = 4) (’gone’) and then continue going backwards:
v ∗ (x = 4) = 0 ,
⎪1 + γ [(1 − α)v ∗ (x = 3) + αv ∗ (x = 4)] ,
⎧
⎪
∗
v (x = 3) = max ⎨ ∗
⎩3 + γv (x = 4) ,
⎪
⎪
42 / 50
Optimal policy for forest tree MDP: state value (1)
Start with v (x = 4) (’gone’) and then continue going backwards:
v ∗ (x = 4) = 0 ,
⎪1 + γ [(1 − α)v ∗ (x = 3) + αv ∗ (x = 4)] ,
⎧
⎪
∗
v (x = 3) = max ⎨ ∗
⎩3 + γv (x = 4) ,
⎪
⎪
⎪1 + γ [(1 − α)v ∗ (x = 3)] ,
⎧
⎪
= max ⎨
⎪3 ,
⎪
⎩
42 / 50
Optimal policy for forest tree MDP: state value (1)
Start with v (x = 4) (’gone’) and then continue going backwards:
v ∗ (x = 4) = 0 ,
⎪1 + γ [(1 − α)v ∗ (x = 3) + αv ∗ (x = 4)] ,
⎧
⎪
∗
v (x = 3) = max ⎨ ∗
⎩3 + γv (x = 4) ,
⎪
⎪
⎪1 + γ [(1 − α)v ∗ (x = 3)] ,
⎧
⎪
= max ⎨
⎪3 ,
⎪
⎩
⎪0 + γ [(1 − α)v ∗ (x = 3) + αv ∗ (x = 4)] ,
⎧
⎪
v ∗ (x = 2) = max ⎨ ∗
⎩2 + γv (x = 4) ,
⎪
⎪
42 / 50
Optimal policy for forest tree MDP: state value (1)
Start with v (x = 4) (’gone’) and then continue going backwards:
v ∗ (x = 4) = 0 ,
⎪1 + γ [(1 − α)v ∗ (x = 3) + αv ∗ (x = 4)] ,
⎧
⎪
∗
v (x = 3) = max ⎨ ∗
⎩3 + γv (x = 4) ,
⎪
⎪
⎪1 + γ [(1 − α)v ∗ (x = 3)] ,
⎧
⎪
= max ⎨
⎪3 ,
⎪
⎩
⎪0 + γ [(1 − α)v ∗ (x = 3) + αv ∗ (x = 4)] ,
⎧
⎪
v ∗ (x = 2) = max ⎨ ∗
⎩2 + γv (x = 4) ,
⎪
⎪
⎪γ [(1 − α)v ∗ (x = 3)] ,
⎧
⎪
= max ⎨
⎩2 ,
⎪
⎪
42 / 50
Optimal policy for forest tree MDP: state value (1)
Start with v (x = 4) (’gone’) and then continue going backwards:
v ∗ (x = 4) = 0 ,
⎪1 + γ [(1 − α)v ∗ (x = 3) + αv ∗ (x = 4)] ,
⎧
⎪
∗
v (x = 3) = max ⎨ ∗
⎩3 + γv (x = 4) ,
⎪
⎪
⎪1 + γ [(1 − α)v ∗ (x = 3)] ,
⎧
⎪
= max ⎨
⎪3 ,
⎪
⎩
⎪0 + γ [(1 − α)v ∗ (x = 3) + αv ∗ (x = 4)] ,
⎧
⎪
v ∗ (x = 2) = max ⎨ ∗
⎩2 + γv (x = 4) ,
⎪
⎪
⎪γ [(1 − α)v ∗ (x = 3)] ,
⎧
⎪
= max ⎨
⎩2 ,
⎪
⎪
⎪γ [(1 − α)v ∗ (x = 2)] ,
⎧
⎪
∗
v (x = 1) = max ⎨
⎩1 .
⎪
⎪
42 / 50
Optimal policy for forest tree MDP: state value (2)
● Possible solutions:
● numerical optimization approach (e.g., simplex method, gradient
descent,...)
● manual case-by-case equation solving (dynamic programming, cf.
next lecture)
Gone
Figure: State values under optimal policy (γ = 0.8, α = 0.2) 43 / 50
Optimal policy for forest tree MDP: state value (3)
Gone
Figure: State values under optimal policy (γ = 0.9, α = 0.2)
44 / 50
Optimal policy for forest tree MDP: action value (1)
Use uk+1 = u ′ to set up equation system:
q ∗ (x = 1, u = c) = 1 ,
45 / 50
Optimal policy for forest tree MDP: action value (1)
Use uk+1 = u ′ to set up equation system:
q ∗ (x = 1, u = c) = 1 ,
q ∗ (x = 1, u = w ) = γ(1 − α) max q ∗ (x = 2, u ′ ) ,
′ u
45 / 50
Optimal policy for forest tree MDP: action value (1)
Use uk+1 = u ′ to set up equation system:
q ∗ (x = 1, u = c) = 1 ,
q ∗ (x = 1, u = w ) = γ(1 − α) max q ∗ (x = 2, u ′ ) ,
′ u
∗
q (x = 2, u = c) = 2 ,
45 / 50
Optimal policy for forest tree MDP: action value (1)
Use uk+1 = u ′ to set up equation system:
q ∗ (x = 1, u = c) = 1 ,
q ∗ (x = 1, u = w ) = γ(1 − α) max q ∗ (x = 2, u ′ ) ,
′ u
∗
q (x = 2, u = c) = 2 ,
q ∗ (x = 2, u = w ) = γ(1 − α) max q ∗ (x = 3, u ′ ) ,
′ u
45 / 50
Optimal policy for forest tree MDP: action value (1)
Use uk+1 = u ′ to set up equation system:
q ∗ (x = 1, u = c) = 1 ,
q ∗ (x = 1, u = w ) = γ(1 − α) max q ∗ (x = 2, u ′ ) ,
′ u
∗
q (x = 2, u = c) = 2 ,
q ∗ (x = 2, u = w ) = γ(1 − α) max q ∗ (x = 3, u ′ ) ,
′ u
∗
q (x = 3, u = c) = 3 ,
45 / 50
Optimal policy for forest tree MDP: action value (1)
Use uk+1 = u ′ to set up equation system:
q ∗ (x = 1, u = c) = 1 ,
q ∗ (x = 1, u = w ) = γ(1 − α) max q ∗ (x = 2, u ′ ) ,
′ u
∗
q (x = 2, u = c) = 2 ,
q ∗ (x = 2, u = w ) = γ(1 − α) max q ∗ (x = 3, u ′ ) ,
′ u
∗
q (x = 3, u = c) = 3 ,
q ∗ (x = 3, u = w ) = 1 + γ(1 − α) max q ∗ (x = 3, u ′ ) .
′ u
45 / 50
Optimal policy for forest tree MDP: action value (1)
Use uk+1 = u ′ to set up equation system:
q ∗ (x = 1, u = c) = 1 ,
q ∗ (x = 1, u = w ) = γ(1 − α) max q ∗ (x = 2, u ′ ) ,
′ u
∗
q (x = 2, u = c) = 2 ,
q ∗ (x = 2, u = w ) = γ(1 − α) max q ∗ (x = 3, u ′ ) ,
′ u
∗
q (x = 3, u = c) = 3 ,
q ∗ (x = 3, u = w ) = 1 + γ(1 − α) max q ∗ (x = 3, u ′ ) .
′ u
45 / 50
Optimal policy for forest tree MDP: action value (2)
Rearrange max expressions for unknown action values:
46 / 50
Optimal policy for forest tree MDP: action value (2)
Rearrange max expressions for unknown action values:
46 / 50
Optimal policy for forest tree MDP: action value (2)
Rearrange max expressions for unknown action values:
46 / 50
Optimal policy for forest tree MDP: action value (2)
Rearrange max expressions for unknown action values:
46 / 50
Optimal policy for forest tree MDP: action value (3)
Gone
Figure: Action values under optimal policy (γ = 0.8, α = 0.2)
47 / 50
Optimal policy for forest tree MDP: action value (4)
Gone
Figure: Action values under optimal policy (γ = 0.9, α = 0.2)
48 / 50
Direct numerical state and action-value calculation
49 / 50
Summary: what you’ve learned in this lecture
● Differentiate finite Markov process models with or w/o rewards and
actions.
● Interpret such stochastic processes as simplified abstractions of
real-world problems.
● Understand the importance of value functions to describe the
agent’s performance.
● Formulate value-function equation systems by the Bellman
principle.
● Recognize optimal policies.
● Setting up nonlinear equation systems for retrieving optimal
policies by the Bellman principle.
● Solving for different value functions in MRP/MDP by brute force
optimization.
50 / 50