08 MDPs

CS 106: Artificial Intelligence
Markov Decision Processes
Instructor: Ngoc-Hoang LUONG, PhD.
University of Information Technology (UIT), VNU-HCM

[These slides were adapted from the slides created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. http://ai.berkeley.edu.]
Non-Deterministic Search
Example: Grid World
 A maze-like problem
 The agent lives in a grid
 Walls block the agent’s path
 Noisy movement: actions do not always go as planned
 80% of the time, the action North takes the agent North
(if there is no wall there)
 10% of the time, North takes the agent West; 10% East
 If there is a wall in the direction the agent would have
been taken, the agent stays put
 The agent receives rewards each time step
 Small “living” reward each step (can be negative)
 Big rewards come at the end (good or bad)
 Goal: maximize sum of rewards
Grid World Actions
Deterministic Grid World Stochastic Grid World
Markov Decision Processes
 An MDP is defined by:
 A set of states s  S
 A set of actions a  A
 A transition function T(s, a, s’)
 Probability that a from s leads to s’, i.e., P(s’| s, a)
 Also called the model or the dynamics
 A reward function R(s, a, s’)
 Sometimes just R(s) or R(s’)
 A start state
 Maybe a terminal state
 MDPs are non-deterministic search problems

 One way to solve them is with expectimax search
 We’ll have a new tool soon
What is Markov about MDPs?
 “Markov” generally means that given the present state, the
future and the past are independent
 For Markov decision processes, “Markov” means action

outcomes depend only on the current state
Andrey Markov
(1856-1922)
Policies
 In deterministic single-agent search problems,
we wanted an optimal plan, or sequence of
actions, from start to a goal
 A policy : S → A gives an action for each state.

 E.g, ((1,1)) = “north”, ((4,1)) = “west”,
((3,3))=“east”, ((3,2))=“north”,…
 For MDPs, we want an optimal policy *: S → A

 An optimal policy is one that maximizes
Optimal policy when R(s, a, s’) = -0.03
expected utility if followed for all non-terminals s
 Expectimax didn’t compute entire policies

 It computed the action for a single state only
Policy
(3,3)
(1,1)
(4,1)
Example: Bridge Crossing
Example: Volcano
123RF.com
shutterstock.com
MDP Search Tree
state = (1,2)
state s = (2,2) P(
action a = “south” P(P(

state = (2,1)
P(
∑ 𝑃 𝑠 |𝑠,𝑎) =1.0
( ′
state = (3,2)
𝑠′
MDP Search Tree
state = (2,1)
state s = (2,2) P(
action a = “east” P(
state = (3,2)
P(
∑ 𝑃 𝑠 |𝑠,𝑎) =1.0
( ′
state = (2,3)
𝑠′
MDP Search Tree
state = (4,2)
state s = (4,1) P(
action a = “east” P(
state = (4,1)
P(
∑ 𝑃 𝑠 |𝑠,𝑎) =1.0
( ′
state = (4,1)
𝑠′
MDP Search Trees
 Each MDP state projects an expectimax-like search tree
s s is a state
(s, a) is a q-
s, a
state
(s,a,s’) called a transition
s,a,s’ T(s,a,s’) = P(s’|s,a)

R(s,a,s’)
’s
Optimal Policies
R(s) = -0.01 R(s) = -0.03
R(s) = -0.4 R(s) = -2.0

R(s) = the “living reward”
Policy Evaluation
Example: Policy Evaluation
Always Go Right Always Go Forward
𝜋1 𝜋2
𝑠 𝜋1 (𝑠) 𝑠 𝜋 2(𝑠)
State Action State Action
(2,1) “east” (2,1) “north”
(2,2) “east” (2,2) “north”
(2,3) “east” (2,3) “north”
(2,4) “exit” (2,4) “exit”
(1,3) “exit” (1,3) “exit”
… … … …
𝜋1 𝜋2
𝑠 𝜋1 (𝑠) 𝑉 𝜋 (𝑠) 1
𝑠 𝜋 2(𝑠) 𝑉 𝜋 (𝑠) 2
State Action Value State Action Value
(2,1) “east” -9.06 (2,1) “north” 32.62
(2,2) “east” -8.25 (2,2) “north” 48.23
(2,3) “east” 0.76 (2,3) “north” 69.60
(2,4) “exit” 100 (2,4) “exit” 100
(1,3) “exit” -10 (1,3) “exit” -10
≤
… … … … … …
Policy Evaluation
: the expected utility obtained by starting from

state and following policy .
: the state-value function for policy .
if and only if , for all .

Utility of a sequence of actions
Exit
Living Reward = 0, Noise = 0
𝑈 = 0+0+ ( 0+0.5 ) +0 +0+0+ 20= 20.5
𝑈 = ( 0+0.5 ) + 0+0+0 + ( 0+0.5 )+ 0+0+ 0+20=21

Exit
Living Reward = -0.3, Noise = 0.0
𝑈 =− 0.3 −0.3 + ( −0.3+ 0.5 ) −0.3 − 0.3 −0.3+ 20=18.7
Utility Sum of Rewards

Exit
𝑈 =− 0.3 −0.3 + ( −0.3+ 0.5 ) −0.3 − 0.3 −0.3+ 20=18.7

Exit
𝑈 =− 0.3 −0.3 + ( −0.3+ 0.5 ) −0.3 −50=− 50.7
𝑈 =− 0.3 −0.3 − 50=−50.6
Expected Utility Sum

Average
of Rewards
Sum of Rewards
Utilities of Sequences
Utilities of Sequences
 What preferences should an agent have over reward sequences?
 More or less? [1, 2, 2] or [2, 3, 4]
 Now or later? [0, 0, 1] or [1, 0, 0]

Discounting
 It’s reasonable to maximize the sum of rewards
 It’s also reasonable to prefer rewards now to rewards later
 One solution: values of rewards decay exponentially
Worth Now Worth Next Step Worth In Two Steps

Discounting
 How to discount?
 Each time we descend a level, we
multiply in the discount once
 Why discount?
 Sooner rewards probably do have
higher utility than later rewards
 Also helps our algorithms converge
 Example: discount of 0.5

 U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3
 U([1,2,3]) < U([3,2,1])
Stationary Preferences
 There are two ways to define utilities
 Additive utility:
 Discounted utility:
Quiz: Discounting
 Given:
 Actions: East, West, and Exit (only available in exit states a, e)

 Transitions: deterministic
 Quiz 1: For  = 1, what is the optimal policy?
 Quiz 2: For  = 0.1, what is the optimal policy?
 Quiz 3: For which  are West and East equally good when in state d?
Infinite Utilities?!
 Problem: What if the game lasts forever? Do we get infinite rewards?
 Solutions:
 Finite horizon: (similar to depth-limited search)
 Terminate episodes after a fixed T steps (e.g. life)
 Gives nonstationary policies ( depends on time left)
 Discounting: use 0 <  < 1
 Geometric series
 Smaller  means smaller “horizon” – shorter term focus
 Absorbing state: guarantee that for every policy, a terminal state will eventually
Exit
𝑈 =( − 0.3 ) + ( 0.9 ) × 2=1.5
Exit
Utility
Utility
SumSum
of Discounted
of RewardsRewards
Exit
Exit
2
𝑈=( −0.3 ) +( 0.9 ) × ( −0.3 ) +( 0.9 ) × ( −0.3+0.5 )
Exit
2
𝑈 = ( − 0 . 3 ) + ( 0 . 9 ) × ( − 0 . 3 ) + ( 0 . 9 ) × ( − 50 )= − 41. 07
𝑈 =( − 0 . 3 ) + ( 0 . 9 ) × ( 2 )=1 .5
2
Exit 𝑈=( −0.3+0.5 ) +( 0.9 ) × (−0.3 ) +( 0 .9 ) × (−0.3 )
Expected Utility
Utility
Average
Sum Sum
of Discounted
of Discounted
Rewards
Rewards
Solving MDPs
Policy Evaluation
Policy Evaluation
Regarding the current state,
how good is it according to a policy ?
state s = (2,2)
Policy Evaluation ∑ |𝑠,𝑎) =1.0
𝑃 ( 𝑠 ′
𝑠′
state = (1,2)
expected utility
expected utility
of state s=(2,2)
of state (1,2)
by following 𝜋
𝑉 𝜋
𝑉 ( 𝑠( ,𝑠𝜋) ( 𝑠) ) ∑
( 𝑠 )= 𝑄 𝜋𝜋
¿ 𝑃 ( 𝑠
′
|𝑠 ,𝜋 (𝑠)) [𝑅 ( 𝑠 ,𝜋(𝑠),𝑠
′
) +𝛾 𝑉
𝜋
( 𝑠′ ) ] 0.5
by following 𝜋
)
𝑠′ ¿ .3 +
0
−
state s = (2,2) P𝑠′ 2)=
, state = (2,1)
a
s,
R( expected utility
action a = “south” P ¿ of state (2,1)
R ( 𝑠 , 𝑎, 𝑠1 ) =−0.3
′
by following 𝜋
𝑅( )
𝜋 𝑠,𝑎
𝑄 ( 𝑠 ,𝑎 ) 𝑃 , 𝑠 ¿¿ ′
𝜋 (𝑠) 3
¿ ) =− 0
State Action
¿ .3 ¿ state = (3,2)
(2,2) “south” Expected Utility of taking action
Expected Utility of taking action
Exit expected utility
a=“south” in state s = (2,2)
(1,2) “east” of state (3,2)
and then=following
? 𝜋
by following 𝜋
?
(2,1) “east”
(3,2) “exit” 𝑄 ( 𝑠 ,𝑎 )=∑

𝜋
𝑃 ( (
𝑠 |
∑∑𝑃 𝑃𝑠 𝑠,𝑎|
′′ ′
(
𝑠 |
,𝑎 ) )
[ 𝑅)
𝑠 𝑠,𝑎[𝑅𝑅𝑠,𝑎(( (
𝑠,𝑎, 𝑠
𝑠,𝑎,𝑠,𝑠 )
′′ ′
)
+𝛾 𝑉
𝜋𝜋
( 𝑠′ ) ] )
… … 𝑠 𝑠′
′ 𝑠′
Policy Evaluation s
(s)
: the expected utility obtained by starting from state and
following policy . s, (s)
s, (s),s’
: the state-value function for policy . ’s
a
: the expected utility obtained by starting from state ,
taking action , and then following policy . s, a
s,a,s’
: the action-value function for policy . ’s
Policy Evaluation
𝜋
𝑉 ¿
………… ………………….
State Action
[2,1] “east”
[2,2] “east”
[2,3] “east”
[2,4] “exit”
[1,3] “exit”
… …
Solving a system of linear equations
(Iterative) Policy Evaluation
𝜋
𝑉 ¿
Iterative Approach
)
′ ) 1. Initialize for all states .

¿𝑠 2 2. Repeat until convergence:
) ,
P, a // While
𝑄 𝜋 ( 𝑠 ,𝑎 )
(s • For each state :
P ¿
𝜋
𝑉 ¿
a= R
R𝑅( 𝑠 , 𝑎 , 𝑠1 )
′
(𝑠
𝑃, 𝑎 ,
𝑠¿
) 𝑡
¿ ¿3 ′
¿ )¿ 𝑄𝑡𝜋−1 (𝑠 , 𝜋 (𝑠))
)
Always Go Right Always Go Forward
Noise = 0.2
Example: Policy Evaluation Discount = 0.9
Living reward = -0.3
𝜋1 𝜋2
𝜋 1 (𝑠) 𝜋 2(𝑠)
State Action State Action
(2,1) “east” (2,1) “north”
(2,2) “east” (2,2) “north”
(2,3) “east” (2,3) “north”
(2,4) “exit” (2,4) “exit”
(1,3) “exit” (1,3) “exit”
… … … …
Noise = 0.2
t=0 Discount = 0.9
Noise = 0.2
t=0  t=1 Discount = 0.9
𝜋
𝑉 𝑡 ¿
𝑄𝑡𝜋−1 (𝑠 , 𝜋 (𝑠))
𝑉 𝜋𝑡 −1 ( 𝑠 ) 𝑉 𝜋𝑡 ( 𝑠 )
𝑉 𝜋1 ( 2 , 4 )=100
𝑉 𝜋1 ( 1 ,1 ) =−10
𝑉 ( 2 ,2)=( 0.8) × (− 0.3+0.9𝑉 ( 3 ,2) )

𝜋
1
𝜋
0
Noise = 0.2
𝜋
𝑉 𝑡 ¿
𝑄𝑡𝜋−1 (𝑠 , 𝜋 (𝑠))
𝑉 𝜋𝑡 −1 ( 𝑠 ) 𝑉 𝜋𝑡 ( 𝑠 )
𝑉 (2,2)=( 0.8) ×(− 0.3+0.9𝑉 (3,2))

𝜋
2
𝜋
1
𝑉 (2,3)=(0.8) ×(−0.3+0.9𝑉 (3 ,3))

𝜋
2
𝜋
1
Noise = 0.2
𝜋
𝑉 𝑡 ¿
𝑄𝑡𝜋−1 (𝑠 , 𝜋 (𝑠))
𝑉 𝜋𝑡 −1 ( 𝑠 ) 𝑉 𝜋𝑡 ( 𝑠 )
𝑉 (2,2)=(0.8) ×(− 0.3+0.9𝑉 (3 ,2))

𝜋
3
𝜋
2
𝑉 (2,1)=(0.8) ×(− 0.3+0.9𝑉 (3 ,1))

𝜋
3
𝜋
2
Noise = 0.2
𝜋
𝑉 𝑡 ¿
𝑄𝑡𝜋−1 (𝑠 , 𝜋 (𝑠))
𝑉 𝜋𝑡 −1 ( 𝑠 ) 𝑉 𝜋𝑡 ( 𝑠 )
𝑉 (2,2)=(0.8) ×(− 0.3+0.9𝑉 (3 ,2))

𝜋
4
𝜋
3
𝑉 (2,1)=(0.8) ×(− 0.3+0.9𝑉 (3 ,1))

𝜋
4
𝜋
3
Noise = 0.2
𝜋
𝑉 𝑡 ¿
𝑄𝑡𝜋−1 (𝑠 , 𝜋 (𝑠))
𝑉 𝜋𝑡 −1 ( 𝑠 ) 𝑉 𝜋𝑡 ( 𝑠 )
𝑉 (2,2)=( 0.8) ×(− 0.3+0.9𝑉 (3,2))

𝜋
5
𝜋
4
𝑉 (2,3)=(0.8) ×(−0.3+0.9𝑉 (3 ,3))

𝜋
5
𝜋
4
Noise = 0.2
𝜋
𝑉 𝑡 ¿
𝑄𝑡𝜋−1 (𝑠 , 𝜋 (𝑠))
𝑉 𝜋𝑡 −1 ( 𝑠 ) 𝑉 𝜋𝑡 ( 𝑠 )
𝑉 (2,2)=( 0.8) ×(− 0.3+0.9𝑉 (3,2))

𝜋
6
𝜋
5
𝑉 (2,3)=(0.8) ×(−0.3+0.9𝑉 (3 ,3))

𝜋
6
𝜋
5
Noise = 0.2
𝜋
𝑉 𝑡 ¿
𝑄𝑡𝜋−1 (𝑠 , 𝜋 (𝑠))
𝑉 𝜋𝑡 −1 ( 𝑠 ) 𝑉 𝜋𝑡 ( 𝑠 )
𝑉 (2,2)=(0.8) ×(− 0.3+0.9𝑉 (3 ,2))

𝜋
7
𝜋
6
𝑉 (2,3)=(0.8)×(−0.3+0.9𝑉 (3,3))
𝜋
7
𝜋
6
Noise = 0.2
Example: Policy Evaluation Discount = 0.9
𝐴𝑙𝑤𝑎𝑦𝑠𝐺𝑜 𝑅𝑖𝑔h𝑡 𝐴𝑙𝑤𝑎𝑦𝑠𝐺𝑜 𝐹𝑜𝑟𝑤𝑎𝑟𝑑
≤
𝜋1 𝜋2
𝑉 ( 𝑠) 𝑉 (𝑠)
Summary: Policy Evaluation
 How do we calculate for a fixed policy ?
s
 Idea 1: Turn recursive Bellman equations into updates
(s)
s, (s)
𝜋
𝑉 𝑡 ¿ s, (s),s’
 Efficiency: O(S2) per iteration ’s
 Idea 2: Bellman equations are just a linear system

 Solve with Matlab (or your favorite linear system solver)
𝑉 (𝑠)=∑ 𝑃 ( 𝑠 |𝑠 ,𝜋 (𝑠)) [𝑅 ( 𝑠, 𝜋(𝑠),𝑠 ) +𝛾 𝑉 ( 𝑠′ ) ]

𝜋 ′ ′ 𝜋
𝑠′
Value Iteration
Policy Evaluation
how good is it according to a policy ?
state s = (2,2)
Value Iteration
how good is it according to an optimal policy?
state s = (2,2)
𝑄 ( 𝑠,𝑎 )=∑ 𝑃 ( 𝑠 |𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠 ) +𝛾 𝑉 ( 𝑠 ′ ) ]
Value Iteration
∗ ′ ′ ∗
𝑠′
) ) )
) )
𝑄∗ ( 𝑠, 𝑎4 )
𝑉 ∗ ( 𝑠 )=? =‘north’
𝑄∗ ( 𝑠, 𝑎1 )
)
=‘west’ =‘east’
𝜋 ∗ ( 𝑠 ) =argmax 𝑄∗ ( 𝑠 , 𝑎)
𝑎
)
𝑄∗ ( 𝑠, 𝑎3 )
) =‘south’
V ∗ ( 𝑠 )=max 𝑄∗ ( 𝑠 , 𝑎 )
𝑎 𝑄∗ ( 𝑠, 𝑎2 )
)
) ) )
Optimal Policy
Do what  says to do Do the optimal action
s s
(s) a
s, (s) s, a
s, (s),s’ s,a,s’
’s ’s
Expectimax trees max over all actions to compute the optimal values
Optimal Quantities
 The value (utility) of a state s:

s is a
V*(s) = expected utility starting in s and s
state
acting optimally a
(s, a) is a
 The value (utility) of a q-state (s,a): s, a q-state
Q*(s,a) = expected utility starting out
s,a,s’ (s,a,s’) is a
having taken action a from state s and transition
(thereafter) acting optimally ’s
 The optimal policy:

*(s) = optimal action from state s ¿ argmax 𝑄∗ (𝑠 , 𝑎)
𝑎
Optimal Values of States
 Fundamental operation: compute the (expectimax) value of a state
 Expected utility under optimal action
s
 Average sum of (discounted) rewards
 This is just what expectimax computed! a
s, a
 Recursive definition of optimal values:
𝑄 ( 𝑠,𝑎 )=∑ 𝑃 ( 𝑠 |𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠 ) +𝛾 𝑉 ( 𝑠 ′ ) ]
∗ ′ ′ ∗ s,a,s’
𝑠′
’s
V ∗ ( 𝑠 )=max 𝑄∗ ( 𝑠 , 𝑎 )
𝑎
𝜋 ∗ ( 𝑠 ) =argmax 𝑄∗ ( 𝑠 , 𝑎)
V ( 𝑠 )=max ∑ 𝑃 ( 𝑠 |𝑠 ,𝑎) [𝑅 ( 𝑠,𝑎 ,𝑠 ) +𝛾 𝑉 ( 𝑠′ )]
∗ ′ ′ ∗ 𝑎
𝑎 𝑠′
Values of States
 Recursive definition of value:
s
a
s, a
s,a,s’
’s
Bellman Equations
 Recursive definition of value:
 Bellman Equation:
Necessary condition for optimality in optimization problems
Richard E. Bellman
formulated as Dynamic Programming (1920–1984)
 Dynamic Programing:
Process to simplify an optimization problem by breaking it
down into an optimal substructure.
Time-Limited Values
 Key idea: time-limited values
 Define Vk(s) to be the optimal value of s if the game ends

in k more time steps
 Equivalently, it’s what a depth-k expectimax would give from s
Example: Value Iteration
S: 1
F: .5*2+.5*2=2
Assume no discount!
0 0 0 ∗
𝑉 𝑡 ¿
S: .5*1+.5*1=1
2 F: -10
Assume no discount!
0 0 0 ∗
𝑉 𝑡 ¿
2 1 0
Assume no discount!
0 0 0 ∗
𝑉 𝑡 ¿
S: 1+2=3
F: .5*(2+2)+.5*(2+1)=3.5
2 1 0
Assume no discount!
0 0 0 ∗
𝑉 𝑡 ¿
3.5 2.5 0
2 1 0
Assume no discount!
0 0 0 ∗
𝑉 𝑡 ¿
Policy Evaluation
𝜋
𝑉 ¿
Iterative Approach
)
′ ) 1. Initialize for all states .

¿𝑠 2 2. Repeat until convergence:
) ,
P, a // While
𝑄 𝜋 ( 𝑠 ,𝑎 )
(s • For each state :
P ¿
𝜋
𝑉 ¿
a= R
R𝑅( 𝑠 , 𝑎 , 𝑠1 )
′
(𝑠
𝑃, 𝑎 ,
𝑠¿
) 𝑡
¿ ¿3 ′
¿ )¿ 𝑄𝑡𝜋−1 (𝑠 , 𝜋 (𝑠))
)
Value Iteration
V ( 𝑠 )=max ∑ 𝑃 ( 𝑠 |𝑠 ,𝑎) [𝑅 ( 𝑠,𝑎 ,𝑠 ) +𝛾 𝑉 ( 𝑠′ )] Value Iteration
∗ ′ ′ ∗
Evaluate an optimal policy
𝑎 𝑠′ 1. Initialize for all states .
2. Repeat until convergence:
𝑄∗ ( 𝑠, 𝑎4 )
// While
• For each state :
∗
𝑉 ∗ ( 𝑠 )=? =‘north’
∗
𝑄 ( 𝑠,𝑎1 ) 𝑉 𝑡 ¿
=‘west’ =‘east’
𝑄∗𝑡 −1 (𝑠 , 𝑎)
𝑄∗ ( 𝑠, 𝑎3 )
=‘south’
Policy Evaluation
𝑄∗ ( 𝑠, 𝑎2 ) Evaluate a specific policy
𝜋
𝑉 𝑡 ¿
𝑄𝑡𝜋−1 (𝑠 , 𝜋 (𝑠))
t=0
Noise = 0.2
Discount = 0.9
Living reward = 0
𝑉 ∗𝑡 (𝑠)
t=1
𝑉 ∗𝑡 (𝑠)⟵ max ∑ 𝑃
𝑎∈ 𝐴𝑐𝑡𝑖𝑜𝑛𝑠(𝑠) 𝑠′
( 𝑠′
∨𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠′
) +𝛾 𝑉 ∗
𝑡−1 ( 𝑠 ′
)] Noise = 0.2
Discount = 0.9
Living reward = 0
𝑄∗𝑡 −1 (𝑠 , 𝑎)
𝑉∗
𝑡 −1 ( 𝑠 ′
) 𝑄∗𝑡 −1 (𝑠 , 𝑎) 𝑉 𝑡 (𝑠 )
∗ ′
t=2
( 𝑠′
∨𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠′
) +𝛾 𝑉 ∗
𝑡−1 ( 𝑠 ′
)] Noise = 0.2
Discount = 0.9
Living reward = 0
𝑄∗𝑡 −1 (𝑠 , 𝑎)
𝑉∗
𝑡 −1 ( 𝑠 ′
) 𝑄∗𝑡 −1 (𝑠 , 𝑎) 𝑉 𝑡 (𝑠 )
∗ ′
t=3
( 𝑠′
∨𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠′
) +𝛾 𝑉 ∗
𝑡−1 ( 𝑠 ′
)] Noise = 0.2
Discount = 0.9
Living reward = 0
𝑄∗𝑡 −1 (𝑠 , 𝑎)
𝑉∗
𝑡 −1 ( 𝑠 ′
) 𝑄∗𝑡 −1 (𝑠 , 𝑎) 𝑉 𝑡 (𝑠 )
∗ ′
t=4
( 𝑠′
∨𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠′
) +𝛾 𝑉 ∗
𝑡−1 ( 𝑠 ′
)] Noise = 0.2
Discount = 0.9
Living reward = 0
𝑄∗𝑡 −1 (𝑠 , 𝑎)
𝑉∗
𝑡 −1 ( 𝑠 ′
) 𝑄∗𝑡 −1 (𝑠 , 𝑎) 𝑉 𝑡 (𝑠 )
∗ ′
t=5
( 𝑠′
∨𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠′
) +𝛾 𝑉 ∗
𝑡−1 ( 𝑠 ′
)] Noise = 0.2
Discount = 0.9
Living reward = 0
𝑄∗𝑡 −1 (𝑠 , 𝑎)
𝑉∗
𝑡 −1 ( 𝑠 ′
) 𝑄∗𝑡 −1 (𝑠 , 𝑎) 𝑉 𝑡 (𝑠 )
∗ ′
t=6
( 𝑠′
∨𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠′
) +𝛾 𝑉 ∗
𝑡−1 ( 𝑠 ′
)] Noise = 0.2
Discount = 0.9
Living reward = 0
𝑄∗𝑡 −1 (𝑠 , 𝑎)
𝑉∗
𝑡 −1 ( 𝑠 ′
) 𝑄∗𝑡 −1 (𝑠 , 𝑎) 𝑉 𝑡 (𝑠 )
∗ ′
t=7
( 𝑠′
∨𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠′
) +𝛾 𝑉 ∗
𝑡−1 ( 𝑠 ′
)] Noise = 0.2
Discount = 0.9
Living reward = 0
𝑄∗𝑡 −1 (𝑠 , 𝑎)
𝑉∗
𝑡 −1 ( 𝑠 ′
) 𝑄∗𝑡 −1 (𝑠 , 𝑎) 𝑉 𝑡 (𝑠 )
∗ ′
t=100
( 𝑠′
∨𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠′
) +𝛾 𝑉 ∗
𝑡−1 ( 𝑠 ′
)] Noise = 0.2
Discount = 0.9
Living reward = 0
𝑄∗𝑡 −1 (𝑠 , 𝑎)
𝑉∗
𝑡 −1 ( 𝑠 ′
) 𝑄∗𝑡 −1 (𝑠 , 𝑎) 𝑉 𝑡 (𝑠 )
∗ ′
Policy Extraction
Computing Actions from Values
 Let’s imagine we have the optimal values V*(s)
 How should we act?

 It’s not obvious!
 We need to do a mini-expectimax (one step)

𝜋 ( 𝑠 )=argmax ∑ 𝑃 ( 𝑠 |𝑠,𝑎 ) [ 𝑅 ( 𝑠 ,𝑎,𝑠 ) +𝛾 𝑉 ( 𝑠′ ) ]
∗ ′ ′ ∗
𝑎 𝑠′
 This is called policy extraction, since it gets the policy implied by the values
Computing Actions from Q-Values
 Let’s imagine we have the optimal q-values:
 How should we act?

 Completely trivial to decide!
𝜋 ∗ ( 𝑠 )=argmax 𝑄 ∗ ( 𝑠 , 𝑎 )
𝑎
𝜋 𝑠 =argmax ∑ 𝑃 𝑠 𝑠,𝑎 [ 𝑅 𝑠 ,𝑎,𝑠 +𝛾 𝑉 ( 𝑠′ ) ]

∗
( ) ( ′
| ) ( ′
) ∗
𝑎 𝑠′
The Bellman Equations
s
 Definition of “optimal utility” via expectimax recurrence
gives a simple one-step lookahead relationship amongst a
optimal utility values s, a
V ∗ ( 𝑠 )=max 𝑄∗ ( 𝑠 , 𝑎 )
𝑎
s,a,s’
𝑄 ( 𝑠,𝑎 )=∑ 𝑃 ( 𝑠 |𝑠,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠 ) +𝛾 𝑉 ( 𝑠 ′ ) ]
∗ ′ ′ ∗
’s
𝑠′
∗ ′ ′ ∗ How to be optimal:
Step 1: Take correct first action
𝑎 Step 2: Keep being optimal
𝑠′
Value Iteration
 Bellman equations characterize the optimal values: V(s)

∗ ′ ′ ∗ a
𝑎 s, a
𝑠′
s,a,s’
 Value iteration computes them: V(s’)
𝑉 ( 𝑠 ) ⟵max ∑ 𝑃 ( 𝑠 |𝑠 ,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠 ) +𝛾 𝑉 ( 𝑠 ′ ) ]
∗
𝑡
′ ′ ∗
𝑡−1
𝑎 𝑠′
Policy Iteration
Problems with Value Iteration
 Value iteration repeats the Bellman updates: s
a
𝑉 ( 𝑠 ) ⟵max ∑ 𝑃 ( 𝑠 |𝑠 ,𝑎 ) [ 𝑅 ( 𝑠,𝑎, 𝑠 ) +𝛾 𝑉 ( 𝑠 ′ ) ]
∗
𝑡
′ ′ ∗
𝑡−1 s, a
𝑎 𝑠′
 Problem 1: It’s slow – O(S2A) per iteration s,a,s’

’s
 Problem 2: The “max” at each state rarely changes
 Problem 3: The policy often converges long before the values

t=0
Noise = 0.2
Discount = 0.9
Living reward = 0
t=1
Noise = 0.2
Discount = 0.9
Living reward = 0
t=2
Noise = 0.2
Discount = 0.9
Living reward = 0
t=3
Noise = 0.2
Discount = 0.9
Living reward = 0
t=4
Noise = 0.2
Discount = 0.9
Living reward = 0
t=5
Noise = 0.2
Discount = 0.9
Living reward = 0
t=6
Noise = 0.2
Discount = 0.9
Living reward = 0
t=7
Noise = 0.2
Discount = 0.9
Living reward = 0
t=8
Noise = 0.2
Discount = 0.9
Living reward = 0
t=9
Noise = 0.2
Discount = 0.9
Living reward = 0
t=10
Noise = 0.2
Discount = 0.9
Living reward = 0
t=11
Noise = 0.2
Discount = 0.9
Living reward = 0
t=12
Noise = 0.2
Discount = 0.9
Living reward = 0
t=100
Noise = 0.2
Discount = 0.9
Living reward = 0
Policy Iteration
 Alternative approach for optimal values:
 Step 1: Policy evaluation: calculate utilities for some fixed policy (not optimal
utilities!) until convergence
 Step 2: Policy improvement: update policy using one-step look-ahead with resulting
converged (but not optimal!) utilities as future values
 Repeat steps until policy converges
 This is policy iteration

 It’s still optimal!
 Can converge (much) faster under some conditions
Policy Iteration
1. Initialization: , arbitrarily for all
2. Evaluation: For fixed current policy , find values with policy evaluation:
 Iterate until values converge:
𝑉 ( 𝑠 ) ⟵ ∑ 𝑃 ( 𝑠 |𝑠 ,𝜋 𝑖 (𝑠)) [ 𝑅 (𝑠 ,𝜋 𝑖 (𝑠),𝑠 )+𝛾 𝑉 ( 𝑠′ ) ]

𝜋𝑖 ′ ′ 𝜋𝑖
𝑡 𝑡 −1
𝑠′
3. Improvement: For fixed values, get a better policy using policy extraction
 One-step look-ahead:
𝜋𝑖+1 ( 𝑠 )=argmax ∑ 𝑃 ( 𝑠 |𝑠,𝑎 ) [ 𝑅 ( 𝑠 ,𝑎,𝑠 )+𝛾 𝑉 ( 𝑠′ ) ]

′ ′ 𝜋𝑖
𝑎 𝑠′
 If for all then return ; else and go to 2.

Policy Evaluation - t=0, t=1
Noise = 0.2
Discount = 0.9
Living reward = 0

𝑡 𝑡 −1
𝑠′
Noise = 0.2
Discount = 0.9
Living reward = 0

𝑡 𝑡 −1
𝑠′
Noise = 0.2
Discount = 0.9
Living reward = 0

𝑡 𝑡 −1
𝑠′
Noise = 0.2
Discount = 0.9
Living reward = 0

𝑡 𝑡 −1
𝑠′
Noise = 0.2
Discount = 0.9
Living reward = 0

𝑡 𝑡 −1
𝑠′
Noise = 0.2
Discount = 0.9
Living reward = 0

𝑡 𝑡 −1
𝑠′
Policy Improvement – t=6
Noise = 0.2
Discount = 0.9
Living reward = 0
𝜋𝑖+1 ( 𝑠 )=argmax ∑ 𝑃 ( 𝑠 |𝑠,𝑎 ) [ 𝑅 ( 𝑠 ,𝑎,𝑠 )+𝛾 𝑉 ( 𝑠′ ) ]

′ ′ 𝜋𝑖
𝑎 𝑠′
Policy Improvement – Policy Evaluation
Noise = 0.2
Discount = 0.9
Living reward = 0

𝑡 𝑡 −1
𝑠′
Comparison
 Both value iteration and policy iteration compute the same thing (all optimal values)
 In value iteration:
 Every iteration updates both the values and (implicitly) the policy
 We don’t track the policy, but taking the max over actions implicitly recomputes it
 In policy iteration:
 We do several passes that update utilities with fixed policy (each pass is fast because we
consider only one action, not all of them)
 After the policy is evaluated, a new policy is chosen (slow like a value iteration pass)
 The new policy will be better (or we’re done)
 Both are dynamic programs for solving MDPs

Summary: MDP Algorithms
 So you want to….
 Compute optimal values: use value iteration or policy iteration
 Compute values for a particular policy: use policy evaluation
 Turn your values into a policy: use policy extraction (one-step lookahead)
 These all look the same!

 They basically are – they are all variations of Bellman updates
 They all use one-step lookahead expectimax fragments
 They differ only in whether we plug in a fixed policy or max over actions

08 MDPs

Uploaded by

Copyright:

Available Formats

You might also like

08 MDPs

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

08 MDPs

Uploaded by

Copyright:

Available Formats

CS 106: Artificial Intelligence

Markov Decision Processes

Instructor: Ngoc-Hoang LUONG, PhD.

University of Information Technology (UIT), VNU-HCM

 MDPs are non-deterministic search problems

 For Markov decision processes, “Markov” means action

 A policy : S → A gives an action for each state.

 For MDPs, we want an optimal policy *: S → A

 Expectimax didn’t compute entire policies

action a = “south” P(P(

s,a,s’ T(s,a,s’) = P(s’|s,a)

R(s) = -0.01 R(s) = -0.03

R(s) = -0.4 R(s) = -2.0

(2,1) “east” (2,1) “north”

(2,2) “east” (2,2) “north”

(2,3) “east” (2,3) “north”

(2,4) “exit” (2,4) “exit”

(1,3) “exit” (1,3) “exit”

State Action Value State Action Value

(2,1) “east” -9.06 (2,1) “north” 32.62

(2,2) “east” -8.25 (2,2) “north” 48.23

(2,3) “east” 0.76 (2,3) “north” 69.60

(2,4) “exit” 100 (2,4) “exit” 100

(1,3) “exit” -10 (1,3) “exit” -10

: the expected utility obtained by starting from

: the state-value function for policy .

if and only if , for all .

Living Reward = 0, Noise = 0

𝑈 = 0+0+ ( 0+0.5 ) +0 +0+0+ 20= 20.5

𝑈 = ( 0+0.5 ) + 0+0+0 + ( 0+0.5 )+ 0+0+ 0+20=21

Living Reward = -0.3, Noise = 0.0

𝑈 =− 0.3 −0.3 + ( −0.3+ 0.5 ) −0.3 − 0.3 −0.3+ 20=18.7

Utility Sum of Rewards

Living Reward = -0.3, Noise = 0.2

𝑈 =− 0.3 −0.3 + ( −0.3+ 0.5 ) −0.3 − 0.3 −0.3+ 20=18.7

𝑈 =− 0.3 −0.3 + ( −0.3+ 0.5 ) −0.3 −50=− 50.7

𝑈 =− 0.3 −0.3 − 50=−50.6

Expected Utility Sum

 More or less? [1, 2, 2] or [2, 3, 4]

 Now or later? [0, 0, 1] or [1, 0, 0]

Worth Now Worth Next Step Worth In Two Steps

 Example: discount of 0.5

 Actions: East, West, and Exit (only available in exit states a, e)

 Quiz 1: For  = 1, what is the optimal policy?

 Quiz 2: For  = 0.1, what is the optimal policy?

 Discounting: use 0 <  < 1

Living Reward = -0.3, Noise = 0.0

𝑈 =( − 0.3 ) + ( 0.9 ) × 2=1.5

(3,2) “exit” 𝑄 ( 𝑠 ,𝑎 )=∑

′ ) 1. Initialize for all states .

𝑉 ( 2 ,2)=( 0.8) × (− 0.3+0.9𝑉 ( 3 ,2) )

𝑉 (2,2)=( 0.8) ×(− 0.3+0.9𝑉 (3,2))

𝑉 (2,3)=(0.8) ×(−0.3+0.9𝑉 (3 ,3))

𝑉 (2,2)=(0.8) ×(− 0.3+0.9𝑉 (3 ,2))

𝑉 (2,1)=(0.8) ×(− 0.3+0.9𝑉 (3 ,1))

𝑉 (2,2)=(0.8) ×(− 0.3+0.9𝑉 (3 ,2))

𝑉 (2,1)=(0.8) ×(− 0.3+0.9𝑉 (3 ,1))