08 MDPs

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 110

CS 106: Artificial Intelligence

Markov Decision Processes

Instructor: Ngoc-Hoang LUONG, PhD.

University of Information Technology (UIT), VNU-HCM


[These slides were adapted from the slides created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. http://ai.berkeley.edu.]
Non-Deterministic Search
Example: Grid World
 A maze-like problem
 The agent lives in a grid
 Walls block the agent’s path
 Noisy movement: actions do not always go as planned
 80% of the time, the action North takes the agent North
(if there is no wall there)
 10% of the time, North takes the agent West; 10% East
 If there is a wall in the direction the agent would have
been taken, the agent stays put
 The agent receives rewards each time step
 Small “living” reward each step (can be negative)
 Big rewards come at the end (good or bad)
 Goal: maximize sum of rewards
Grid World Actions
Deterministic Grid World Stochastic Grid World
Markov Decision Processes
 An MDP is defined by:
 A set of states s  S
 A set of actions a  A
 A transition function T(s, a, s’)
 Probability that a from s leads to s’, i.e., P(s’| s, a)
 Also called the model or the dynamics
 A reward function R(s, a, s’)
 Sometimes just R(s) or R(s’)
 A start state
 Maybe a terminal state

 MDPs are non-deterministic search problems


 One way to solve them is with expectimax search
 We’ll have a new tool soon
What is Markov about MDPs?
 “Markov” generally means that given the present state, the
future and the past are independent

 For Markov decision processes, “Markov” means action


outcomes depend only on the current state

Andrey Markov
(1856-1922)
Policies
 In deterministic single-agent search problems,
we wanted an optimal plan, or sequence of
actions, from start to a goal

 A policy : S → A gives an action for each state.


 E.g, ((1,1)) = “north”, ((4,1)) = “west”,
((3,3))=“east”, ((3,2))=“north”,…

 For MDPs, we want an optimal policy *: S → A


 An optimal policy is one that maximizes
Optimal policy when R(s, a, s’) = -0.03
expected utility if followed for all non-terminals s

 Expectimax didn’t compute entire policies


 It computed the action for a single state only
Policy

(3,3)
(1,1)
(4,1)
Example: Bridge Crossing
Example: Volcano

123RF.com
shutterstock.com
MDP Search Tree

state = (1,2)

state s = (2,2) P(

action a = “south” P(P(


state = (2,1)

P(

∑ 𝑃 𝑠 |𝑠,𝑎) =1.0
( ′
state = (3,2)
𝑠′
MDP Search Tree

state = (2,1)

state s = (2,2) P(

action a = “east” P(
state = (3,2)

P(

∑ 𝑃 𝑠 |𝑠,𝑎) =1.0
( ′
state = (2,3)
𝑠′
MDP Search Tree

state = (4,2)

state s = (4,1) P(

action a = “east” P(
state = (4,1)

P(

∑ 𝑃 𝑠 |𝑠,𝑎) =1.0
( ′
state = (4,1)
𝑠′
MDP Search Trees
 Each MDP state projects an expectimax-like search tree

s s is a state

(s, a) is a q-
s, a
state
(s,a,s’) called a transition

s,a,s’ T(s,a,s’) = P(s’|s,a)


R(s,a,s’)
’s
Optimal Policies

R(s) = -0.01 R(s) = -0.03

R(s) = -0.4 R(s) = -2.0


R(s) = the “living reward”
Policy Evaluation
Example: Policy Evaluation
Always Go Right Always Go Forward
Example: Policy Evaluation

𝜋1 𝜋2
𝑠 𝜋1 (𝑠) 𝑠 𝜋 2(𝑠)
State Action State Action

(2,1) “east” (2,1) “north”

(2,2) “east” (2,2) “north”

(2,3) “east” (2,3) “north”

(2,4) “exit” (2,4) “exit”

(1,3) “exit” (1,3) “exit”

… … … …
Example: Policy Evaluation

𝜋1 𝜋2
𝑠 𝜋1 (𝑠) 𝑉 𝜋 (𝑠) 1
𝑠 𝜋 2(𝑠) 𝑉 𝜋 (𝑠) 2

State Action Value State Action Value

(2,1) “east” -9.06 (2,1) “north” 32.62

(2,2) “east” -8.25 (2,2) “north” 48.23

(2,3) “east” 0.76 (2,3) “north” 69.60

(2,4) “exit” 100 (2,4) “exit” 100

(1,3) “exit” -10 (1,3) “exit” -10


… … … … … …
Policy Evaluation

: the expected utility obtained by starting from


state and following policy .

: the state-value function for policy .

if and only if , for all .


Utility of a sequence of actions

Exit

Living Reward = 0, Noise = 0

𝑈 = 0+0+ ( 0+0.5 ) +0 +0+0+ 20= 20.5

𝑈 = ( 0+0.5 ) + 0+0+0 + ( 0+0.5 )+ 0+0+ 0+20=21


Utility of a sequence of actions

Exit

Living Reward = -0.3, Noise = 0.0

𝑈 =− 0.3 −0.3 + ( −0.3+ 0.5 ) −0.3 − 0.3 −0.3+ 20=18.7

Utility Sum of Rewards


Utility of a sequence of actions

Exit

Living Reward = -0.3, Noise = 0.2

𝑈 =− 0.3 −0.3 + ( −0.3+ 0.5 ) −0.3 − 0.3 −0.3+ 20=18.7


Exit

𝑈 =− 0.3 −0.3 + ( −0.3+ 0.5 ) −0.3 −50=− 50.7

𝑈 =− 0.3 −0.3 − 50=−50.6

Expected Utility Sum


Average
of Rewards
Sum of Rewards
Utilities of Sequences
Utilities of Sequences
 What preferences should an agent have over reward sequences?

 More or less? [1, 2, 2] or [2, 3, 4]

 Now or later? [0, 0, 1] or [1, 0, 0]


Discounting
 It’s reasonable to maximize the sum of rewards
 It’s also reasonable to prefer rewards now to rewards later
 One solution: values of rewards decay exponentially

Worth Now Worth Next Step Worth In Two Steps


Discounting
 How to discount?
 Each time we descend a level, we
multiply in the discount once

 Why discount?
 Sooner rewards probably do have
higher utility than later rewards
 Also helps our algorithms converge

 Example: discount of 0.5


 U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3
 U([1,2,3]) < U([3,2,1])
Stationary Preferences
 There are two ways to define utilities

 Additive utility:

 Discounted utility:
Quiz: Discounting
 Given:

 Actions: East, West, and Exit (only available in exit states a, e)


 Transitions: deterministic

 Quiz 1: For  = 1, what is the optimal policy?

 Quiz 2: For  = 0.1, what is the optimal policy?

 Quiz 3: For which  are West and East equally good when in state d?
Infinite Utilities?!
 Problem: What if the game lasts forever? Do we get infinite rewards?

 Solutions:
 Finite horizon: (similar to depth-limited search)
 Terminate episodes after a fixed T steps (e.g. life)
 Gives nonstationary policies ( depends on time left)

 Discounting: use 0 <  < 1

 Geometric series
 Smaller  means smaller “horizon” – shorter term focus

 Absorbing state: guarantee that for every policy, a terminal state will eventually
Utility of a sequence of actions

Exit

Living Reward = -0.3, Noise = 0.0

𝑈 =( − 0.3 ) + ( 0.9 ) × 2=1.5

Exit

Utility
Utility
SumSum
of Discounted
of RewardsRewards
Utility of a sequence of actions
Living Reward = -0.3, Noise = 0.2

Exit
Exit

2
𝑈=( −0.3 ) +( 0.9 ) × ( −0.3 ) +( 0.9 ) × ( −0.3+0.5 )
Exit

2
𝑈 = ( − 0 . 3 ) + ( 0 . 9 ) × ( − 0 . 3 ) + ( 0 . 9 ) × ( − 50 )= − 41. 07

𝑈 =( − 0 . 3 ) + ( 0 . 9 ) × ( 2 )=1 .5

2
Exit 𝑈=( −0.3+0.5 ) +( 0.9 ) × (−0.3 ) +( 0 .9 ) × (−0.3 )

Expected Utility
Utility
Average
Sum Sum
of Discounted
of Discounted
Rewards
Rewards
Solving MDPs
Policy Evaluation
Policy Evaluation
Regarding the current state,
how good is it according to a policy ?

state s = (2,2)
Policy Evaluation ∑ |𝑠,𝑎) =1.0
𝑃 ( 𝑠 ′

𝑠′

state = (1,2)
expected utility
expected utility
of state s=(2,2)
of state (1,2)
by following 𝜋
𝑉 𝜋
𝑉 ( 𝑠( ,𝑠𝜋) ( 𝑠) ) ∑
( 𝑠 )=  𝑄 𝜋𝜋
¿ 𝑃 ( 𝑠

|𝑠 ,𝜋 (𝑠)) [𝑅 ( 𝑠 ,𝜋(𝑠),𝑠

) +𝛾 𝑉
𝜋
( 𝑠′ ) ] 0.5
by following 𝜋
)
𝑠′ ¿ .3 +
0

state s = (2,2) P𝑠′ 2)=
, state = (2,1)
a
s,
R( expected utility
action a = “south” P ¿ of state (2,1)
R ( 𝑠 , 𝑎, 𝑠1 ) =−0.3

by following 𝜋
𝑅( )
𝜋 𝑠,𝑎
𝑄 ( 𝑠 ,𝑎 ) 𝑃 , 𝑠 ¿¿ ′
𝜋 (𝑠) 3
¿ ) =− 0
State Action
¿ .3 ¿ state = (3,2)
(2,2) “south” Expected Utility of taking action
Expected Utility of taking action
Exit expected utility
a=“south” in state s = (2,2)
(1,2) “east” of state (3,2)
and then=following
? 𝜋
by following 𝜋

?
(2,1) “east”

(3,2) “exit” 𝑄 ( 𝑠 ,𝑎 )=∑


𝜋
𝑃 ( (
𝑠 |
∑∑𝑃 𝑃𝑠 𝑠,𝑎|
′′ ′
(
𝑠 |
,𝑎 ) )
[ 𝑅)
𝑠 𝑠,𝑎[𝑅𝑅𝑠,𝑎(( (
𝑠,𝑎, 𝑠
𝑠,𝑎,𝑠,𝑠 )
′′ ′
)
+𝛾 𝑉
𝜋𝜋
( 𝑠′ ) ] )
… … 𝑠 𝑠′
′ 𝑠′
Policy Evaluation s
(s)
: the expected utility obtained by starting from state and
following policy . s, (s)

s, (s),s’
: the state-value function for policy . ’s

a
: the expected utility obtained by starting from state ,
taking action , and then following policy . s, a

s,a,s’
: the action-value function for policy . ’s
Policy Evaluation
𝜋
𝑉 ¿

………… ………………….
State Action

[2,1] “east”

[2,2] “east”

[2,3] “east”

[2,4] “exit”

[1,3] “exit”

… …
Solving a system of linear equations
(Iterative) Policy Evaluation
𝜋
𝑉 ¿
Iterative Approach
)

′ ) 1. Initialize for all states .


¿𝑠 2 2. Repeat until convergence:
) ,
P, a // While
𝑄 𝜋 ( 𝑠 ,𝑎 )
(s • For each state :
P ¿
𝜋
𝑉 ¿
a= R
R𝑅( 𝑠 , 𝑎 , 𝑠1 )

(𝑠
𝑃, 𝑎 ,
𝑠¿
) 𝑡
¿ ¿3 ′
¿ )¿ 𝑄𝑡𝜋−1 (𝑠 , 𝜋 (𝑠))

)
Example: Policy Evaluation
Always Go Right Always Go Forward
Noise = 0.2
Example: Policy Evaluation Discount = 0.9
Living reward = -0.3

𝜋1 𝜋2
𝜋 1 (𝑠) 𝜋 2(𝑠)
State Action State Action
(2,1) “east” (2,1) “north”
(2,2) “east” (2,2) “north”
(2,3) “east” (2,3) “north”
(2,4) “exit” (2,4) “exit”
(1,3) “exit” (1,3) “exit”
… … … …
Noise = 0.2
t=0 Discount = 0.9
Living reward = -0.3
Noise = 0.2
t=0  t=1 Discount = 0.9
Living reward = -0.3
𝜋
𝑉 𝑡 ¿
𝑄𝑡𝜋−1 (𝑠 , 𝜋 (𝑠))
𝑉 𝜋𝑡 −1 ( 𝑠 ) 𝑉 𝜋𝑡 ( 𝑠 )

𝑉 𝜋1 ( 2 , 4 )=100

𝑉 𝜋1 ( 1 ,1 ) =−10

𝑉 ( 2 ,2)=( 0.8) × (− 0.3+0.9𝑉 ( 3 ,2) )


𝜋
1
𝜋
0
Noise = 0.2
t=1  t=2 Discount = 0.9
Living reward = -0.3
𝜋
𝑉 𝑡 ¿
𝑄𝑡𝜋−1 (𝑠 , 𝜋 (𝑠))
𝑉 𝜋𝑡 −1 ( 𝑠 ) 𝑉 𝜋𝑡 ( 𝑠 )

𝑉 (2,2)=( 0.8) ×(− 0.3+0.9𝑉 (3,2))


𝜋
2
𝜋
1

𝑉 (2,3)=(0.8) ×(−0.3+0.9𝑉 (3 ,3))


𝜋
2
𝜋
1
Noise = 0.2
t=2  t=3 Discount = 0.9
Living reward = -0.3
𝜋
𝑉 𝑡 ¿
𝑄𝑡𝜋−1 (𝑠 , 𝜋 (𝑠))
𝑉 𝜋𝑡 −1 ( 𝑠 ) 𝑉 𝜋𝑡 ( 𝑠 )

𝑉 (2,2)=(0.8) ×(− 0.3+0.9𝑉 (3 ,2))


𝜋
3
𝜋
2

𝑉 (2,1)=(0.8) ×(− 0.3+0.9𝑉 (3 ,1))


𝜋
3
𝜋
2
Noise = 0.2
t=3  t=4 Discount = 0.9
Living reward = -0.3
𝜋
𝑉 𝑡 ¿
𝑄𝑡𝜋−1 (𝑠 , 𝜋 (𝑠))
𝑉 𝜋𝑡 −1 ( 𝑠 ) 𝑉 𝜋𝑡 ( 𝑠 )

𝑉 (2,2)=(0.8) ×(− 0.3+0.9𝑉 (3 ,2))


𝜋
4
𝜋
3

𝑉 (2,1)=(0.8) ×(− 0.3+0.9𝑉 (3 ,1))


𝜋
4
𝜋
3
Noise = 0.2
t=4  t=5 Discount = 0.9
Living reward = -0.3
𝜋
𝑉 𝑡 ¿
𝑄𝑡𝜋−1 (𝑠 , 𝜋 (𝑠))
𝑉 𝜋𝑡 −1 ( 𝑠 ) 𝑉 𝜋𝑡 ( 𝑠 )

𝑉 (2,2)=( 0.8) ×(− 0.3+0.9𝑉 (3,2))


𝜋
5
𝜋
4

𝑉 (2,3)=(0.8) ×(−0.3+0.9𝑉 (3 ,3))


𝜋
5
𝜋
4
Noise = 0.2
t=5  t=6 Discount = 0.9
Living reward = -0.3
𝜋
𝑉 𝑡 ¿
𝑄𝑡𝜋−1 (𝑠 , 𝜋 (𝑠))
𝑉 𝜋𝑡 −1 ( 𝑠 ) 𝑉 𝜋𝑡 ( 𝑠 )

𝑉 (2,2)=( 0.8) ×(− 0.3+0.9𝑉 (3,2))


𝜋
6
𝜋
5

𝑉 (2,3)=(0.8) ×(−0.3+0.9𝑉 (3 ,3))


𝜋
6
𝜋
5
Noise = 0.2
t=6  t=7 Discount = 0.9
Living reward = -0.3
𝜋
𝑉 𝑡 ¿
𝑄𝑡𝜋−1 (𝑠 , 𝜋 (𝑠))
𝑉 𝜋𝑡 −1 ( 𝑠 ) 𝑉 𝜋𝑡 ( 𝑠 )

𝑉 (2,2)=(0.8) ×(− 0.3+0.9𝑉 (3 ,2))


𝜋
7
𝜋
6

𝑉 (2,3)=(0.8)×(−0.3+0.9𝑉 (3,3))
𝜋
7
𝜋
6
Noise = 0.2
Example: Policy Evaluation Discount = 0.9
Living reward = -0.3

𝐴𝑙𝑤𝑎𝑦𝑠𝐺𝑜 𝑅𝑖𝑔h𝑡 𝐴𝑙𝑤𝑎𝑦𝑠𝐺𝑜 𝐹𝑜𝑟𝑤𝑎𝑟𝑑


𝜋1 𝜋2
𝑉 ( 𝑠) 𝑉 (𝑠)
Summary: Policy Evaluation
 How do we calculate for a fixed policy ?
s
 Idea 1: Turn recursive Bellman equations into updates
(s)
s, (s)
𝜋
𝑉 𝑡 ¿ s, (s),s’
 Efficiency: O(S2) per iteration ’s

 Idea 2: Bellman equations are just a linear system


 Solve with Matlab (or your favorite linear system solver)

𝑉 (𝑠)=∑ 𝑃 ( 𝑠 |𝑠 ,𝜋 (𝑠)) [𝑅 ( 𝑠, 𝜋(𝑠),𝑠 ) +𝛾 𝑉 ( 𝑠′ ) ]


𝜋 ′ ′ 𝜋

𝑠′
Value Iteration
Policy Evaluation
Regarding the current state,
how good is it according to a policy ?

state s = (2,2)
Value Iteration
Regarding the current state,
how good is it according to an optimal policy?

state s = (2,2)
𝑄 ( 𝑠,𝑎 )=∑ 𝑃 ( 𝑠 |𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠 ) +𝛾 𝑉 ( 𝑠 ′ ) ]
Value Iteration
∗ ′ ′ ∗

𝑠′

) ) )

) )

𝑄∗ ( 𝑠, 𝑎4 )

𝑉 ∗ ( 𝑠 )=? =‘north’
𝑄∗ ( 𝑠, 𝑎1 )
)
=‘west’ =‘east’
𝜋 ∗ ( 𝑠 ) =argmax 𝑄∗ ( 𝑠 , 𝑎)
𝑎
)
𝑄∗ ( 𝑠, 𝑎3 )
) =‘south’

V ∗ ( 𝑠 )=max 𝑄∗ ( 𝑠 , 𝑎 )
𝑎 𝑄∗ ( 𝑠, 𝑎2 )
)

) ) )
Optimal Policy
Do what  says to do Do the optimal action
s s

(s) a
s, (s) s, a

s, (s),s’ s,a,s’
’s ’s

Expectimax trees max over all actions to compute the optimal values
Optimal Quantities

 The value (utility) of a state s:


s is a
V*(s) = expected utility starting in s and s
state
acting optimally a
(s, a) is a
 The value (utility) of a q-state (s,a): s, a q-state
Q*(s,a) = expected utility starting out
s,a,s’ (s,a,s’) is a
having taken action a from state s and transition
(thereafter) acting optimally ’s

 The optimal policy:


*(s) = optimal action from state s ¿ argmax 𝑄∗ (𝑠 , 𝑎)
𝑎
Optimal Values of States
 Fundamental operation: compute the (expectimax) value of a state
 Expected utility under optimal action
s
 Average sum of (discounted) rewards
 This is just what expectimax computed! a
s, a
 Recursive definition of optimal values:
𝑄 ( 𝑠,𝑎 )=∑ 𝑃 ( 𝑠 |𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠 ) +𝛾 𝑉 ( 𝑠 ′ ) ]
∗ ′ ′ ∗ s,a,s’
𝑠′
’s
V ∗ ( 𝑠 )=max 𝑄∗ ( 𝑠 , 𝑎 )
𝑎
𝜋 ∗ ( 𝑠 ) =argmax 𝑄∗ ( 𝑠 , 𝑎)
V ( 𝑠 )=max ∑ 𝑃 ( 𝑠 |𝑠 ,𝑎) [𝑅 ( 𝑠,𝑎 ,𝑠 ) +𝛾 𝑉 ( 𝑠′ )]
∗ ′ ′ ∗ 𝑎

𝑎 𝑠′
Values of States
 Recursive definition of value:
s

a
s, a

s,a,s’
’s
Bellman Equations
 Recursive definition of value:

 Bellman Equation:
Necessary condition for optimality in optimization problems
Richard E. Bellman
formulated as Dynamic Programming (1920–1984)

 Dynamic Programing:
Process to simplify an optimization problem by breaking it
down into an optimal substructure.
Time-Limited Values
 Key idea: time-limited values

 Define Vk(s) to be the optimal value of s if the game ends


in k more time steps
 Equivalently, it’s what a depth-k expectimax would give from s
Example: Value Iteration

S: 1
F: .5*2+.5*2=2

Assume no discount!

0 0 0 ∗
𝑉 𝑡 ¿
Example: Value Iteration

S: .5*1+.5*1=1
2 F: -10

Assume no discount!

0 0 0 ∗
𝑉 𝑡 ¿
Example: Value Iteration

2 1 0
Assume no discount!

0 0 0 ∗
𝑉 𝑡 ¿
Example: Value Iteration

S: 1+2=3
F: .5*(2+2)+.5*(2+1)=3.5

2 1 0
Assume no discount!

0 0 0 ∗
𝑉 𝑡 ¿
Example: Value Iteration

3.5 2.5 0

2 1 0
Assume no discount!

0 0 0 ∗
𝑉 𝑡 ¿
Policy Evaluation
𝜋
𝑉 ¿
Iterative Approach
)

′ ) 1. Initialize for all states .


¿𝑠 2 2. Repeat until convergence:
) ,
P, a // While
𝑄 𝜋 ( 𝑠 ,𝑎 )
(s • For each state :
P ¿
𝜋
𝑉 ¿
a= R
R𝑅( 𝑠 , 𝑎 , 𝑠1 )

(𝑠
𝑃, 𝑎 ,
𝑠¿
) 𝑡
¿ ¿3 ′
¿ )¿ 𝑄𝑡𝜋−1 (𝑠 , 𝜋 (𝑠))

)
Value Iteration
V ( 𝑠 )=max ∑ 𝑃 ( 𝑠 |𝑠 ,𝑎) [𝑅 ( 𝑠,𝑎 ,𝑠 ) +𝛾 𝑉 ( 𝑠′ )] Value Iteration
∗ ′ ′ ∗
Evaluate an optimal policy
𝑎 𝑠′ 1. Initialize for all states .
2. Repeat until convergence:
𝑄∗ ( 𝑠, 𝑎4 )
// While
• For each state :

𝑉 ∗ ( 𝑠 )=? =‘north’

𝑄 ( 𝑠,𝑎1 ) 𝑉 𝑡 ¿
=‘west’ =‘east’

𝑄∗𝑡 −1 (𝑠 , 𝑎)
𝑄∗ ( 𝑠, 𝑎3 )
=‘south’
Policy Evaluation
𝑄∗ ( 𝑠, 𝑎2 ) Evaluate a specific policy

𝜋
𝑉 𝑡 ¿
𝑄𝑡𝜋−1 (𝑠 , 𝜋 (𝑠))
t=0
Noise = 0.2
Discount = 0.9
Living reward = 0

𝑉 ∗𝑡 (𝑠)
t=1
𝑉 ∗𝑡 (𝑠)⟵ max ∑ 𝑃
𝑎∈ 𝐴𝑐𝑡𝑖𝑜𝑛𝑠(𝑠) 𝑠′
( 𝑠′
∨𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠′
) +𝛾 𝑉 ∗
𝑡−1 ( 𝑠 ′
)] Noise = 0.2
Discount = 0.9
Living reward = 0

𝑄∗𝑡 −1 (𝑠 , 𝑎)

𝑉∗
𝑡 −1 ( 𝑠 ′
) 𝑄∗𝑡 −1 (𝑠 , 𝑎) 𝑉 𝑡 (𝑠 )
∗ ′
t=2
𝑉 ∗𝑡 (𝑠)⟵ max ∑ 𝑃
𝑎∈ 𝐴𝑐𝑡𝑖𝑜𝑛𝑠(𝑠) 𝑠′
( 𝑠′
∨𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠′
) +𝛾 𝑉 ∗
𝑡−1 ( 𝑠 ′
)] Noise = 0.2
Discount = 0.9
Living reward = 0

𝑄∗𝑡 −1 (𝑠 , 𝑎)

𝑉∗
𝑡 −1 ( 𝑠 ′
) 𝑄∗𝑡 −1 (𝑠 , 𝑎) 𝑉 𝑡 (𝑠 )
∗ ′
t=3
𝑉 ∗𝑡 (𝑠)⟵ max ∑ 𝑃
𝑎∈ 𝐴𝑐𝑡𝑖𝑜𝑛𝑠(𝑠) 𝑠′
( 𝑠′
∨𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠′
) +𝛾 𝑉 ∗
𝑡−1 ( 𝑠 ′
)] Noise = 0.2
Discount = 0.9
Living reward = 0

𝑄∗𝑡 −1 (𝑠 , 𝑎)

𝑉∗
𝑡 −1 ( 𝑠 ′
) 𝑄∗𝑡 −1 (𝑠 , 𝑎) 𝑉 𝑡 (𝑠 )
∗ ′
t=4
𝑉 ∗𝑡 (𝑠)⟵ max ∑ 𝑃
𝑎∈ 𝐴𝑐𝑡𝑖𝑜𝑛𝑠(𝑠) 𝑠′
( 𝑠′
∨𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠′
) +𝛾 𝑉 ∗
𝑡−1 ( 𝑠 ′
)] Noise = 0.2
Discount = 0.9
Living reward = 0

𝑄∗𝑡 −1 (𝑠 , 𝑎)

𝑉∗
𝑡 −1 ( 𝑠 ′
) 𝑄∗𝑡 −1 (𝑠 , 𝑎) 𝑉 𝑡 (𝑠 )
∗ ′
t=5
𝑉 ∗𝑡 (𝑠)⟵ max ∑ 𝑃
𝑎∈ 𝐴𝑐𝑡𝑖𝑜𝑛𝑠(𝑠) 𝑠′
( 𝑠′
∨𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠′
) +𝛾 𝑉 ∗
𝑡−1 ( 𝑠 ′
)] Noise = 0.2
Discount = 0.9
Living reward = 0

𝑄∗𝑡 −1 (𝑠 , 𝑎)

𝑉∗
𝑡 −1 ( 𝑠 ′
) 𝑄∗𝑡 −1 (𝑠 , 𝑎) 𝑉 𝑡 (𝑠 )
∗ ′
t=6
𝑉 ∗𝑡 (𝑠)⟵ max ∑ 𝑃
𝑎∈ 𝐴𝑐𝑡𝑖𝑜𝑛𝑠(𝑠) 𝑠′
( 𝑠′
∨𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠′
) +𝛾 𝑉 ∗
𝑡−1 ( 𝑠 ′
)] Noise = 0.2
Discount = 0.9
Living reward = 0

𝑄∗𝑡 −1 (𝑠 , 𝑎)

𝑉∗
𝑡 −1 ( 𝑠 ′
) 𝑄∗𝑡 −1 (𝑠 , 𝑎) 𝑉 𝑡 (𝑠 )
∗ ′
t=7
𝑉 ∗𝑡 (𝑠)⟵ max ∑ 𝑃
𝑎∈ 𝐴𝑐𝑡𝑖𝑜𝑛𝑠(𝑠) 𝑠′
( 𝑠′
∨𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠′
) +𝛾 𝑉 ∗
𝑡−1 ( 𝑠 ′
)] Noise = 0.2
Discount = 0.9
Living reward = 0

𝑄∗𝑡 −1 (𝑠 , 𝑎)

𝑉∗
𝑡 −1 ( 𝑠 ′
) 𝑄∗𝑡 −1 (𝑠 , 𝑎) 𝑉 𝑡 (𝑠 )
∗ ′
t=100
𝑉 ∗𝑡 (𝑠)⟵ max ∑ 𝑃
𝑎∈ 𝐴𝑐𝑡𝑖𝑜𝑛𝑠(𝑠) 𝑠′
( 𝑠′
∨𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠′
) +𝛾 𝑉 ∗
𝑡−1 ( 𝑠 ′
)] Noise = 0.2
Discount = 0.9
Living reward = 0

𝑄∗𝑡 −1 (𝑠 , 𝑎)

𝑉∗
𝑡 −1 ( 𝑠 ′
) 𝑄∗𝑡 −1 (𝑠 , 𝑎) 𝑉 𝑡 (𝑠 )
∗ ′
Policy Extraction
Computing Actions from Values
 Let’s imagine we have the optimal values V*(s)

 How should we act?


 It’s not obvious!

 We need to do a mini-expectimax (one step)


𝜋 ( 𝑠 )=argmax ∑ 𝑃 ( 𝑠 |𝑠,𝑎 ) [ 𝑅 ( 𝑠 ,𝑎,𝑠 ) +𝛾 𝑉 ( 𝑠′ ) ]
∗ ′ ′ ∗

𝑎 𝑠′

 This is called policy extraction, since it gets the policy implied by the values
Computing Actions from Q-Values
 Let’s imagine we have the optimal q-values:

 How should we act?


 Completely trivial to decide!

𝜋 ∗ ( 𝑠 )=argmax 𝑄 ∗ ( 𝑠 , 𝑎 )
𝑎

𝜋 𝑠 =argmax ∑ 𝑃 𝑠 𝑠,𝑎 [ 𝑅 𝑠 ,𝑎,𝑠 +𝛾 𝑉 ( 𝑠′ ) ]



( ) ( ′
| ) ( ′
) ∗

𝑎 𝑠′
The Bellman Equations
s
 Definition of “optimal utility” via expectimax recurrence
gives a simple one-step lookahead relationship amongst a
optimal utility values s, a
V ∗ ( 𝑠 )=max 𝑄∗ ( 𝑠 , 𝑎 )
𝑎
s,a,s’
𝑄 ( 𝑠,𝑎 )=∑ 𝑃 ( 𝑠 |𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠 ) +𝛾 𝑉 ( 𝑠 ′ ) ]
∗ ′ ′ ∗
’s
𝑠′
V ( 𝑠 )=max ∑ 𝑃 ( 𝑠 |𝑠 ,𝑎) [𝑅 ( 𝑠,𝑎 ,𝑠 ) +𝛾 𝑉 ( 𝑠′ )]
∗ ′ ′ ∗ How to be optimal:
Step 1: Take correct first action
𝑎 Step 2: Keep being optimal
𝑠′
Value Iteration
 Bellman equations characterize the optimal values: V(s)

V ( 𝑠 )=max ∑ 𝑃 ( 𝑠 |𝑠 ,𝑎) [𝑅 ( 𝑠,𝑎 ,𝑠 ) +𝛾 𝑉 ( 𝑠′ )]


∗ ′ ′ ∗ a
𝑎 s, a
𝑠′
s,a,s’
 Value iteration computes them: V(s’)

𝑉 ( 𝑠 ) ⟵max ∑ 𝑃 ( 𝑠 |𝑠 ,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠 ) +𝛾 𝑉 ( 𝑠 ′ ) ]

𝑡
′ ′ ∗
𝑡−1
𝑎 𝑠′
Policy Iteration
Problems with Value Iteration
 Value iteration repeats the Bellman updates: s

a
𝑉 ( 𝑠 ) ⟵max ∑ 𝑃 ( 𝑠 |𝑠 ,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠 ) +𝛾 𝑉 ( 𝑠 ′ ) ]

𝑡
′ ′ ∗
𝑡−1 s, a
𝑎 𝑠′

 Problem 1: It’s slow – O(S2A) per iteration s,a,s’


’s

 Problem 2: The “max” at each state rarely changes

 Problem 3: The policy often converges long before the values


t=0

Noise = 0.2
Discount = 0.9
Living reward = 0
t=1

Noise = 0.2
Discount = 0.9
Living reward = 0
t=2

Noise = 0.2
Discount = 0.9
Living reward = 0
t=3

Noise = 0.2
Discount = 0.9
Living reward = 0
t=4

Noise = 0.2
Discount = 0.9
Living reward = 0
t=5

Noise = 0.2
Discount = 0.9
Living reward = 0
t=6

Noise = 0.2
Discount = 0.9
Living reward = 0
t=7

Noise = 0.2
Discount = 0.9
Living reward = 0
t=8

Noise = 0.2
Discount = 0.9
Living reward = 0
t=9

Noise = 0.2
Discount = 0.9
Living reward = 0
t=10

Noise = 0.2
Discount = 0.9
Living reward = 0
t=11

Noise = 0.2
Discount = 0.9
Living reward = 0
t=12

Noise = 0.2
Discount = 0.9
Living reward = 0
t=100

Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Iteration
 Alternative approach for optimal values:
 Step 1: Policy evaluation: calculate utilities for some fixed policy (not optimal
utilities!) until convergence
 Step 2: Policy improvement: update policy using one-step look-ahead with resulting
converged (but not optimal!) utilities as future values
 Repeat steps until policy converges

 This is policy iteration


 It’s still optimal!
 Can converge (much) faster under some conditions
Policy Iteration
1. Initialization: , arbitrarily for all

2. Evaluation: For fixed current policy , find values with policy evaluation:
 Iterate until values converge:

𝑉 ( 𝑠 ) ⟵ ∑ 𝑃 ( 𝑠 |𝑠 ,𝜋 𝑖 (𝑠)) [ 𝑅 (𝑠 ,𝜋 𝑖 (𝑠),𝑠 )+𝛾 𝑉 ( 𝑠′ ) ]


𝜋𝑖 ′ ′ 𝜋𝑖
𝑡 𝑡 −1
𝑠′

3. Improvement: For fixed values, get a better policy using policy extraction
 One-step look-ahead:

𝜋𝑖+1 ( 𝑠 )=argmax ∑ 𝑃 ( 𝑠 |𝑠,𝑎 ) [ 𝑅 ( 𝑠 ,𝑎,𝑠 )+𝛾 𝑉 ( 𝑠′ ) ]


′ ′ 𝜋𝑖

𝑎 𝑠′

 If for all then return ; else and go to 2.


Policy Evaluation - t=0, t=1
Noise = 0.2
Discount = 0.9
Living reward = 0

𝑉 ( 𝑠 ) ⟵ ∑ 𝑃 ( 𝑠 |𝑠 ,𝜋 𝑖 (𝑠)) [ 𝑅 (𝑠 ,𝜋 𝑖 (𝑠),𝑠 )+𝛾 𝑉 ( 𝑠′ ) ]


𝜋𝑖 ′ ′ 𝜋𝑖
𝑡 𝑡 −1
𝑠′
Policy Evaluation - t=1, t=2
Noise = 0.2
Discount = 0.9
Living reward = 0

𝑉 ( 𝑠 ) ⟵ ∑ 𝑃 ( 𝑠 |𝑠 ,𝜋 𝑖 (𝑠)) [ 𝑅 (𝑠 ,𝜋 𝑖 (𝑠),𝑠 )+𝛾 𝑉 ( 𝑠′ ) ]


𝜋𝑖 ′ ′ 𝜋𝑖
𝑡 𝑡 −1
𝑠′
Policy Evaluation - t=2, t=3
Noise = 0.2
Discount = 0.9
Living reward = 0

𝑉 ( 𝑠 ) ⟵ ∑ 𝑃 ( 𝑠 |𝑠 ,𝜋 𝑖 (𝑠)) [ 𝑅 (𝑠 ,𝜋 𝑖 (𝑠),𝑠 )+𝛾 𝑉 ( 𝑠′ ) ]


𝜋𝑖 ′ ′ 𝜋𝑖
𝑡 𝑡 −1
𝑠′
Policy Evaluation - t=3, t=4
Noise = 0.2
Discount = 0.9
Living reward = 0

𝑉 ( 𝑠 ) ⟵ ∑ 𝑃 ( 𝑠 |𝑠 ,𝜋 𝑖 (𝑠)) [ 𝑅 (𝑠 ,𝜋 𝑖 (𝑠),𝑠 )+𝛾 𝑉 ( 𝑠′ ) ]


𝜋𝑖 ′ ′ 𝜋𝑖
𝑡 𝑡 −1
𝑠′
Policy Evaluation - t=4, t=5
Noise = 0.2
Discount = 0.9
Living reward = 0

𝑉 ( 𝑠 ) ⟵ ∑ 𝑃 ( 𝑠 |𝑠 ,𝜋 𝑖 (𝑠)) [ 𝑅 (𝑠 ,𝜋 𝑖 (𝑠),𝑠 )+𝛾 𝑉 ( 𝑠′ ) ]


𝜋𝑖 ′ ′ 𝜋𝑖
𝑡 𝑡 −1
𝑠′
Policy Evaluation - t=5, t=6
Noise = 0.2
Discount = 0.9
Living reward = 0

𝑉 ( 𝑠 ) ⟵ ∑ 𝑃 ( 𝑠 |𝑠 ,𝜋 𝑖 (𝑠)) [ 𝑅 (𝑠 ,𝜋 𝑖 (𝑠),𝑠 )+𝛾 𝑉 ( 𝑠′ ) ]


𝜋𝑖 ′ ′ 𝜋𝑖
𝑡 𝑡 −1
𝑠′
Policy Improvement – t=6
Noise = 0.2
Discount = 0.9
Living reward = 0

𝜋𝑖+1 ( 𝑠 )=argmax ∑ 𝑃 ( 𝑠 |𝑠,𝑎 ) [ 𝑅 ( 𝑠 ,𝑎,𝑠 )+𝛾 𝑉 ( 𝑠′ ) ]


′ ′ 𝜋𝑖

𝑎 𝑠′
Policy Improvement – Policy Evaluation
Noise = 0.2
Discount = 0.9
Living reward = 0

𝑉 ( 𝑠 ) ⟵ ∑ 𝑃 ( 𝑠 |𝑠 ,𝜋 𝑖 (𝑠)) [ 𝑅 (𝑠 ,𝜋 𝑖 (𝑠),𝑠 )+𝛾 𝑉 ( 𝑠′ ) ]


𝜋𝑖 ′ ′ 𝜋𝑖
𝑡 𝑡 −1
𝑠′
Comparison
 Both value iteration and policy iteration compute the same thing (all optimal values)

 In value iteration:
 Every iteration updates both the values and (implicitly) the policy
 We don’t track the policy, but taking the max over actions implicitly recomputes it

 In policy iteration:
 We do several passes that update utilities with fixed policy (each pass is fast because we
consider only one action, not all of them)
 After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)
 The new policy will be better (or we’re done)

 Both are dynamic programs for solving MDPs


Summary: MDP Algorithms
 So you want to….
 Compute optimal values: use value iteration or policy iteration
 Compute values for a particular policy: use policy evaluation
 Turn your values into a policy: use policy extraction (one-step lookahead)

 These all look the same!


 They basically are – they are all variations of Bellman updates
 They all use one-step lookahead expectimax fragments
 They differ only in whether we plug in a fixed policy or max over actions

You might also like