Professional Documents
Culture Documents
08 MDPs
08 MDPs
08 MDPs
Andrey Markov
(1856-1922)
Policies
In deterministic single-agent search problems,
we wanted an optimal plan, or sequence of
actions, from start to a goal
(3,3)
(1,1)
(4,1)
Example: Bridge Crossing
Example: Volcano
123RF.com
shutterstock.com
MDP Search Tree
state = (1,2)
state s = (2,2) P(
P(
∑ 𝑃 𝑠 |𝑠,𝑎) =1.0
( ′
state = (3,2)
𝑠′
MDP Search Tree
state = (2,1)
state s = (2,2) P(
action a = “east” P(
state = (3,2)
P(
∑ 𝑃 𝑠 |𝑠,𝑎) =1.0
( ′
state = (2,3)
𝑠′
MDP Search Tree
state = (4,2)
state s = (4,1) P(
action a = “east” P(
state = (4,1)
P(
∑ 𝑃 𝑠 |𝑠,𝑎) =1.0
( ′
state = (4,1)
𝑠′
MDP Search Trees
Each MDP state projects an expectimax-like search tree
s s is a state
(s, a) is a q-
s, a
state
(s,a,s’) called a transition
𝜋1 𝜋2
𝑠 𝜋1 (𝑠) 𝑠 𝜋 2(𝑠)
State Action State Action
… … … …
Example: Policy Evaluation
𝜋1 𝜋2
𝑠 𝜋1 (𝑠) 𝑉 𝜋 (𝑠) 1
𝑠 𝜋 2(𝑠) 𝑉 𝜋 (𝑠) 2
≤
… … … … … …
Policy Evaluation
Exit
Exit
Exit
Why discount?
Sooner rewards probably do have
higher utility than later rewards
Also helps our algorithms converge
Additive utility:
Discounted utility:
Quiz: Discounting
Given:
Quiz 3: For which are West and East equally good when in state d?
Infinite Utilities?!
Problem: What if the game lasts forever? Do we get infinite rewards?
Solutions:
Finite horizon: (similar to depth-limited search)
Terminate episodes after a fixed T steps (e.g. life)
Gives nonstationary policies ( depends on time left)
Geometric series
Smaller means smaller “horizon” – shorter term focus
Absorbing state: guarantee that for every policy, a terminal state will eventually
Utility of a sequence of actions
Exit
Exit
Utility
Utility
SumSum
of Discounted
of RewardsRewards
Utility of a sequence of actions
Living Reward = -0.3, Noise = 0.2
Exit
Exit
2
𝑈=( −0.3 ) +( 0.9 ) × ( −0.3 ) +( 0.9 ) × ( −0.3+0.5 )
Exit
2
𝑈 = ( − 0 . 3 ) + ( 0 . 9 ) × ( − 0 . 3 ) + ( 0 . 9 ) × ( − 50 )= − 41. 07
𝑈 =( − 0 . 3 ) + ( 0 . 9 ) × ( 2 )=1 .5
2
Exit 𝑈=( −0.3+0.5 ) +( 0.9 ) × (−0.3 ) +( 0 .9 ) × (−0.3 )
Expected Utility
Utility
Average
Sum Sum
of Discounted
of Discounted
Rewards
Rewards
Solving MDPs
Policy Evaluation
Policy Evaluation
Regarding the current state,
how good is it according to a policy ?
state s = (2,2)
Policy Evaluation ∑ |𝑠,𝑎) =1.0
𝑃 ( 𝑠 ′
𝑠′
state = (1,2)
expected utility
expected utility
of state s=(2,2)
of state (1,2)
by following 𝜋
𝑉 𝜋
𝑉 ( 𝑠( ,𝑠𝜋) ( 𝑠) ) ∑
( 𝑠 )= 𝑄 𝜋𝜋
¿ 𝑃 ( 𝑠
′
|𝑠 ,𝜋 (𝑠)) [𝑅 ( 𝑠 ,𝜋(𝑠),𝑠
′
) +𝛾 𝑉
𝜋
( 𝑠′ ) ] 0.5
by following 𝜋
)
𝑠′ ¿ .3 +
0
−
state s = (2,2) P𝑠′ 2)=
, state = (2,1)
a
s,
R( expected utility
action a = “south” P ¿ of state (2,1)
R ( 𝑠 , 𝑎, 𝑠1 ) =−0.3
′
by following 𝜋
𝑅( )
𝜋 𝑠,𝑎
𝑄 ( 𝑠 ,𝑎 ) 𝑃 , 𝑠 ¿¿ ′
𝜋 (𝑠) 3
¿ ) =− 0
State Action
¿ .3 ¿ state = (3,2)
(2,2) “south” Expected Utility of taking action
Expected Utility of taking action
Exit expected utility
a=“south” in state s = (2,2)
(1,2) “east” of state (3,2)
and then=following
? 𝜋
by following 𝜋
?
(2,1) “east”
s, (s),s’
: the state-value function for policy . ’s
a
: the expected utility obtained by starting from state ,
taking action , and then following policy . s, a
s,a,s’
: the action-value function for policy . ’s
Policy Evaluation
𝜋
𝑉 ¿
………… ………………….
State Action
[2,1] “east”
[2,2] “east”
[2,3] “east”
[2,4] “exit”
[1,3] “exit”
… …
Solving a system of linear equations
(Iterative) Policy Evaluation
𝜋
𝑉 ¿
Iterative Approach
)
)
Example: Policy Evaluation
Always Go Right Always Go Forward
Noise = 0.2
Example: Policy Evaluation Discount = 0.9
Living reward = -0.3
𝜋1 𝜋2
𝜋 1 (𝑠) 𝜋 2(𝑠)
State Action State Action
(2,1) “east” (2,1) “north”
(2,2) “east” (2,2) “north”
(2,3) “east” (2,3) “north”
(2,4) “exit” (2,4) “exit”
(1,3) “exit” (1,3) “exit”
… … … …
Noise = 0.2
t=0 Discount = 0.9
Living reward = -0.3
Noise = 0.2
t=0 t=1 Discount = 0.9
Living reward = -0.3
𝜋
𝑉 𝑡 ¿
𝑄𝑡𝜋−1 (𝑠 , 𝜋 (𝑠))
𝑉 𝜋𝑡 −1 ( 𝑠 ) 𝑉 𝜋𝑡 ( 𝑠 )
𝑉 𝜋1 ( 2 , 4 )=100
𝑉 𝜋1 ( 1 ,1 ) =−10
𝑉 (2,3)=(0.8)×(−0.3+0.9𝑉 (3,3))
𝜋
7
𝜋
6
Noise = 0.2
Example: Policy Evaluation Discount = 0.9
Living reward = -0.3
≤
𝜋1 𝜋2
𝑉 ( 𝑠) 𝑉 (𝑠)
Summary: Policy Evaluation
How do we calculate for a fixed policy ?
s
Idea 1: Turn recursive Bellman equations into updates
(s)
s, (s)
𝜋
𝑉 𝑡 ¿ s, (s),s’
Efficiency: O(S2) per iteration ’s
𝑠′
Value Iteration
Policy Evaluation
Regarding the current state,
how good is it according to a policy ?
state s = (2,2)
Value Iteration
Regarding the current state,
how good is it according to an optimal policy?
state s = (2,2)
𝑄 ( 𝑠,𝑎 )=∑ 𝑃 ( 𝑠 |𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠 ) +𝛾 𝑉 ( 𝑠 ′ ) ]
Value Iteration
∗ ′ ′ ∗
𝑠′
) ) )
) )
𝑄∗ ( 𝑠, 𝑎4 )
𝑉 ∗ ( 𝑠 )=? =‘north’
𝑄∗ ( 𝑠, 𝑎1 )
)
=‘west’ =‘east’
𝜋 ∗ ( 𝑠 ) =argmax 𝑄∗ ( 𝑠 , 𝑎)
𝑎
)
𝑄∗ ( 𝑠, 𝑎3 )
) =‘south’
V ∗ ( 𝑠 )=max 𝑄∗ ( 𝑠 , 𝑎 )
𝑎 𝑄∗ ( 𝑠, 𝑎2 )
)
) ) )
Optimal Policy
Do what says to do Do the optimal action
s s
(s) a
s, (s) s, a
s, (s),s’ s,a,s’
’s ’s
Expectimax trees max over all actions to compute the optimal values
Optimal Quantities
𝑎 𝑠′
Values of States
Recursive definition of value:
s
a
s, a
s,a,s’
’s
Bellman Equations
Recursive definition of value:
Bellman Equation:
Necessary condition for optimality in optimization problems
Richard E. Bellman
formulated as Dynamic Programming (1920–1984)
Dynamic Programing:
Process to simplify an optimization problem by breaking it
down into an optimal substructure.
Time-Limited Values
Key idea: time-limited values
S: 1
F: .5*2+.5*2=2
Assume no discount!
0 0 0 ∗
𝑉 𝑡 ¿
Example: Value Iteration
S: .5*1+.5*1=1
2 F: -10
Assume no discount!
0 0 0 ∗
𝑉 𝑡 ¿
Example: Value Iteration
2 1 0
Assume no discount!
0 0 0 ∗
𝑉 𝑡 ¿
Example: Value Iteration
S: 1+2=3
F: .5*(2+2)+.5*(2+1)=3.5
2 1 0
Assume no discount!
0 0 0 ∗
𝑉 𝑡 ¿
Example: Value Iteration
3.5 2.5 0
2 1 0
Assume no discount!
0 0 0 ∗
𝑉 𝑡 ¿
Policy Evaluation
𝜋
𝑉 ¿
Iterative Approach
)
)
Value Iteration
V ( 𝑠 )=max ∑ 𝑃 ( 𝑠 |𝑠 ,𝑎) [𝑅 ( 𝑠,𝑎 ,𝑠 ) +𝛾 𝑉 ( 𝑠′ )] Value Iteration
∗ ′ ′ ∗
Evaluate an optimal policy
𝑎 𝑠′ 1. Initialize for all states .
2. Repeat until convergence:
𝑄∗ ( 𝑠, 𝑎4 )
// While
• For each state :
∗
𝑉 ∗ ( 𝑠 )=? =‘north’
∗
𝑄 ( 𝑠,𝑎1 ) 𝑉 𝑡 ¿
=‘west’ =‘east’
𝑄∗𝑡 −1 (𝑠 , 𝑎)
𝑄∗ ( 𝑠, 𝑎3 )
=‘south’
Policy Evaluation
𝑄∗ ( 𝑠, 𝑎2 ) Evaluate a specific policy
𝜋
𝑉 𝑡 ¿
𝑄𝑡𝜋−1 (𝑠 , 𝜋 (𝑠))
t=0
Noise = 0.2
Discount = 0.9
Living reward = 0
𝑉 ∗𝑡 (𝑠)
t=1
𝑉 ∗𝑡 (𝑠)⟵ max ∑ 𝑃
𝑎∈ 𝐴𝑐𝑡𝑖𝑜𝑛𝑠(𝑠) 𝑠′
( 𝑠′
∨𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠′
) +𝛾 𝑉 ∗
𝑡−1 ( 𝑠 ′
)] Noise = 0.2
Discount = 0.9
Living reward = 0
𝑄∗𝑡 −1 (𝑠 , 𝑎)
𝑉∗
𝑡 −1 ( 𝑠 ′
) 𝑄∗𝑡 −1 (𝑠 , 𝑎) 𝑉 𝑡 (𝑠 )
∗ ′
t=2
𝑉 ∗𝑡 (𝑠)⟵ max ∑ 𝑃
𝑎∈ 𝐴𝑐𝑡𝑖𝑜𝑛𝑠(𝑠) 𝑠′
( 𝑠′
∨𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠′
) +𝛾 𝑉 ∗
𝑡−1 ( 𝑠 ′
)] Noise = 0.2
Discount = 0.9
Living reward = 0
𝑄∗𝑡 −1 (𝑠 , 𝑎)
𝑉∗
𝑡 −1 ( 𝑠 ′
) 𝑄∗𝑡 −1 (𝑠 , 𝑎) 𝑉 𝑡 (𝑠 )
∗ ′
t=3
𝑉 ∗𝑡 (𝑠)⟵ max ∑ 𝑃
𝑎∈ 𝐴𝑐𝑡𝑖𝑜𝑛𝑠(𝑠) 𝑠′
( 𝑠′
∨𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠′
) +𝛾 𝑉 ∗
𝑡−1 ( 𝑠 ′
)] Noise = 0.2
Discount = 0.9
Living reward = 0
𝑄∗𝑡 −1 (𝑠 , 𝑎)
𝑉∗
𝑡 −1 ( 𝑠 ′
) 𝑄∗𝑡 −1 (𝑠 , 𝑎) 𝑉 𝑡 (𝑠 )
∗ ′
t=4
𝑉 ∗𝑡 (𝑠)⟵ max ∑ 𝑃
𝑎∈ 𝐴𝑐𝑡𝑖𝑜𝑛𝑠(𝑠) 𝑠′
( 𝑠′
∨𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠′
) +𝛾 𝑉 ∗
𝑡−1 ( 𝑠 ′
)] Noise = 0.2
Discount = 0.9
Living reward = 0
𝑄∗𝑡 −1 (𝑠 , 𝑎)
𝑉∗
𝑡 −1 ( 𝑠 ′
) 𝑄∗𝑡 −1 (𝑠 , 𝑎) 𝑉 𝑡 (𝑠 )
∗ ′
t=5
𝑉 ∗𝑡 (𝑠)⟵ max ∑ 𝑃
𝑎∈ 𝐴𝑐𝑡𝑖𝑜𝑛𝑠(𝑠) 𝑠′
( 𝑠′
∨𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠′
) +𝛾 𝑉 ∗
𝑡−1 ( 𝑠 ′
)] Noise = 0.2
Discount = 0.9
Living reward = 0
𝑄∗𝑡 −1 (𝑠 , 𝑎)
𝑉∗
𝑡 −1 ( 𝑠 ′
) 𝑄∗𝑡 −1 (𝑠 , 𝑎) 𝑉 𝑡 (𝑠 )
∗ ′
t=6
𝑉 ∗𝑡 (𝑠)⟵ max ∑ 𝑃
𝑎∈ 𝐴𝑐𝑡𝑖𝑜𝑛𝑠(𝑠) 𝑠′
( 𝑠′
∨𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠′
) +𝛾 𝑉 ∗
𝑡−1 ( 𝑠 ′
)] Noise = 0.2
Discount = 0.9
Living reward = 0
𝑄∗𝑡 −1 (𝑠 , 𝑎)
𝑉∗
𝑡 −1 ( 𝑠 ′
) 𝑄∗𝑡 −1 (𝑠 , 𝑎) 𝑉 𝑡 (𝑠 )
∗ ′
t=7
𝑉 ∗𝑡 (𝑠)⟵ max ∑ 𝑃
𝑎∈ 𝐴𝑐𝑡𝑖𝑜𝑛𝑠(𝑠) 𝑠′
( 𝑠′
∨𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠′
) +𝛾 𝑉 ∗
𝑡−1 ( 𝑠 ′
)] Noise = 0.2
Discount = 0.9
Living reward = 0
𝑄∗𝑡 −1 (𝑠 , 𝑎)
𝑉∗
𝑡 −1 ( 𝑠 ′
) 𝑄∗𝑡 −1 (𝑠 , 𝑎) 𝑉 𝑡 (𝑠 )
∗ ′
t=100
𝑉 ∗𝑡 (𝑠)⟵ max ∑ 𝑃
𝑎∈ 𝐴𝑐𝑡𝑖𝑜𝑛𝑠(𝑠) 𝑠′
( 𝑠′
∨𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠′
) +𝛾 𝑉 ∗
𝑡−1 ( 𝑠 ′
)] Noise = 0.2
Discount = 0.9
Living reward = 0
𝑄∗𝑡 −1 (𝑠 , 𝑎)
𝑉∗
𝑡 −1 ( 𝑠 ′
) 𝑄∗𝑡 −1 (𝑠 , 𝑎) 𝑉 𝑡 (𝑠 )
∗ ′
Policy Extraction
Computing Actions from Values
Let’s imagine we have the optimal values V*(s)
𝑎 𝑠′
This is called policy extraction, since it gets the policy implied by the values
Computing Actions from Q-Values
Let’s imagine we have the optimal q-values:
𝜋 ∗ ( 𝑠 )=argmax 𝑄 ∗ ( 𝑠 , 𝑎 )
𝑎
𝑎 𝑠′
The Bellman Equations
s
Definition of “optimal utility” via expectimax recurrence
gives a simple one-step lookahead relationship amongst a
optimal utility values s, a
V ∗ ( 𝑠 )=max 𝑄∗ ( 𝑠 , 𝑎 )
𝑎
s,a,s’
𝑄 ( 𝑠,𝑎 )=∑ 𝑃 ( 𝑠 |𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠 ) +𝛾 𝑉 ( 𝑠 ′ ) ]
∗ ′ ′ ∗
’s
𝑠′
V ( 𝑠 )=max ∑ 𝑃 ( 𝑠 |𝑠 ,𝑎) [𝑅 ( 𝑠,𝑎 ,𝑠 ) +𝛾 𝑉 ( 𝑠′ )]
∗ ′ ′ ∗ How to be optimal:
Step 1: Take correct first action
𝑎 Step 2: Keep being optimal
𝑠′
Value Iteration
Bellman equations characterize the optimal values: V(s)
𝑉 ( 𝑠 ) ⟵max ∑ 𝑃 ( 𝑠 |𝑠 ,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠 ) +𝛾 𝑉 ( 𝑠 ′ ) ]
∗
𝑡
′ ′ ∗
𝑡−1
𝑎 𝑠′
Policy Iteration
Problems with Value Iteration
Value iteration repeats the Bellman updates: s
a
𝑉 ( 𝑠 ) ⟵max ∑ 𝑃 ( 𝑠 |𝑠 ,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠 ) +𝛾 𝑉 ( 𝑠 ′ ) ]
∗
𝑡
′ ′ ∗
𝑡−1 s, a
𝑎 𝑠′
Noise = 0.2
Discount = 0.9
Living reward = 0
t=1
Noise = 0.2
Discount = 0.9
Living reward = 0
t=2
Noise = 0.2
Discount = 0.9
Living reward = 0
t=3
Noise = 0.2
Discount = 0.9
Living reward = 0
t=4
Noise = 0.2
Discount = 0.9
Living reward = 0
t=5
Noise = 0.2
Discount = 0.9
Living reward = 0
t=6
Noise = 0.2
Discount = 0.9
Living reward = 0
t=7
Noise = 0.2
Discount = 0.9
Living reward = 0
t=8
Noise = 0.2
Discount = 0.9
Living reward = 0
t=9
Noise = 0.2
Discount = 0.9
Living reward = 0
t=10
Noise = 0.2
Discount = 0.9
Living reward = 0
t=11
Noise = 0.2
Discount = 0.9
Living reward = 0
t=12
Noise = 0.2
Discount = 0.9
Living reward = 0
t=100
Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Iteration
Alternative approach for optimal values:
Step 1: Policy evaluation: calculate utilities for some fixed policy (not optimal
utilities!) until convergence
Step 2: Policy improvement: update policy using one-step look-ahead with resulting
converged (but not optimal!) utilities as future values
Repeat steps until policy converges
2. Evaluation: For fixed current policy , find values with policy evaluation:
Iterate until values converge:
3. Improvement: For fixed values, get a better policy using policy extraction
One-step look-ahead:
𝑎 𝑠′
𝑎 𝑠′
Policy Improvement – Policy Evaluation
Noise = 0.2
Discount = 0.9
Living reward = 0
In value iteration:
Every iteration updates both the values and (implicitly) the policy
We don’t track the policy, but taking the max over actions implicitly recomputes it
In policy iteration:
We do several passes that update utilities with fixed policy (each pass is fast because we
consider only one action, not all of them)
After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)
The new policy will be better (or we’re done)