Logistics: CSE 473 Markov Decision Processes

10/12/2012
Logistics
 PS 2 due Tuesday  Thursday 10/18
 PS 3 due Thursday 10/25
CSE 473 Markov Decision Processes
Dan Weld
Many slides from Chris Bishop, Mausam, Dan Klein,
Stuart Russell, Andrew Moore & Luke Zettlemoyer
MDPs Planning Agent
Static vs. Dynamic
Markov Decision Processes
• Planning Under Uncertainty
Environment
• Mathematical Framework Fully
vs.
• Bellman Equations Partially
Deterministic
ete st c
• Value Iteration Ob
Observable
bl vs.
What action Stochastic
• Real‐Time Dynamic Programming next?
Andrey Markov
• Policy Iteration (1856‐1922)
Perfect Instantaneous
vs. vs.
• Reinforcement Learning Noisy Durative
Percepts Actions
Objective of an MDP
Review: Expectimax
• Find a policy : 6 → $  What if we don’t know what the result of an action
will be? E.g.,
• which optimizes • In solitaire, next card is unknown
• In pacman, the ghosts act randomly max
• minimizes discounted expected cost to reach a
goal or  Can do expectimax search
 Max nodes as in minimax
Max nodes as in minimax search chance
• maximizes undiscount. expected reward  Chance nodes, like min nodes, except
• maximizes expected (reward‐cost) the outcome is uncertain ‐ take
average (expectation) of children
 Calculate expected utilities 10 4 5 7
• given a ____ horizon
• finite
 Today, we formalize as an Markov Decision Process
• infinite  Handle intermediate rewards & infinite plans
• indefinite  More efficient processing
1
10/12/2012
Grid World
 An MDP is defined by:
 Walls block the agent’s path
• A set of states s  S
 Agent’s actions may go astray: • A set of actions a  A
 80% of the time, North action • A transition function T(s,a,s’)
• Prob that a from s leads to s’
takes the agent North • i.e., P(s’ | s,a)
(assuming no wall) • Also called “the model”
 10% ‐ actually go West • A reward function R(s, a, s’)
• Sometimes just R(s) or R(s’)
 10% ‐ actually go East • A start state (or distribution)
 If there is a wall in the chosen • Maybe a terminal state
direction, the agent stays put
• MDPs: non‐deterministic search
 Small “living” reward each step
Reinforcement learning: MDPs where we don’t
 Big rewards come at the end know the transition or reward functions
 Goal: maximize sum of rewards
What is Markov about MDPs?
Solving MDPs
 In deterministic single-agent search problems, want an optimal
 Andrey Markov (1856‐1922)
plan, or sequence of actions, from start to a goal
 “Markov” generally means that  In an MDP, we want an optimal policy *: S → A
• conditioned on the present state, • A policy  gives an action for each state
p
• the future is independent of the past
p • An optimal policy maximizes expected utility if followed
An optimal policy maximizes expected utility if followed
• Defines a reflex agent
 For Markov decision processes,
“Markov” means:
Optimal policy when
R(s, a, s’) = ‐0.03
for all non‐terminals s
Example Optimal Policies Example Optimal Policies
R(s) = ‐0.01 R(s) = ‐0.03 R(s) = ‐0.01 R(s) = ‐0.03
R(s) = ‐0.4 R(s) = ‐2.0 R(s) = ‐0.4 R(s) = ‐2.0
2
10/12/2012
Example Optimal Policies Example Optimal Policies
R(s) = ‐0.01 R(s) = ‐0.03 R(s) = ‐0.01 R(s) = ‐0.03
R(s) = ‐0.4 R(s) = ‐2.0 R(s) = ‐0.4 R(s) = ‐2.0
Example: High‐Low High‐Low as an MDP
 States:
• 2, 3, 4, done
 Three card types: 2, 3, 4  Actions:
• Infinite deck, twice as many 2’s • High, Low
 Start with 3 showing  Model: T(s, a, s’):
 After each card, you say “high” or “low” • P(s’=4 | 4, Low) = 1/4
 New card is flipped 3 •
•
P(s’=3 | 4, Low) = 1/4
P(s’=2
P(s 2 | 4, Low) /
| 4, Low) = 1/2
3
• If
If you’re right, you win the points shown on
’ i h i h i h
the new card • P(s’=done | 4, Low) = 0
• Ties are no‐ops (no reward)‐0 • P(s’=4 | 4, High) = 1/4
• If you’re wrong, game ends • P(s’=3 | 4, High) = 0
• P(s’=2 | 4, High) = 0
• P(s’=done | 4, High) = 3/4
• …
 Differences from expectimax problems:  Rewards: R(s, a, s’):
 #1: get rewards as you go • Number shown on s’ if s’<s  a=“high” …
 #2: you might play forever! • 0 otherwise
 Start: 3
Search Tree: High‐Low
MDP Search Trees
 Each MDP state gives an expectimax‐like search tree
Low High s is a
s
state
, High a
, Low
(s, a) is a
s, a
q-state
T= T= T = 0, T = (s,a,s’) called a
0.5, R 0.25, R R = 4 0.25, R s,a,s’ transition
=2 =3 =0 T(s,a,s’) = P(s’|s,a)
s’
R(s,a,s’)
High Low High Low High Low
3
10/12/2012
Infinite Utilities?!
Utilities of Sequences
 In order to formalize optimality of a policy, need to  Problem: infinite state sequences have infinite rewards
understand utilities of sequences of rewards
 Typically consider stationary preferences:  Solutions:
• Finite horizon:
• Terminate episodes after a fixed T steps (e.g. life)
• Gives nonstationary policies ( depends on time left)
• Absorbing state: guarantee that for every policy, a terminal state will
eventually be reached (like “done” for High‐Low)
 Theorem: only two ways to define stationary utilities • Discounting: for 0 <  < 1
 Additive utility:
 Discounted utility:
• Smaller  means smaller “horizon” – shorter term focus
Discounting Recap: Defining MDPs
 Markov decision processes:
• States S s
• Start state s0
a
 Typically discount • Actions A
s, a
• Transitions P(s’|s, a)
rewards by  < 1 each (, , )
aka T(s,a,s’) s,a,s’
s,a,s
time step • Rewards R(s,a,s’) (and discount ) s’
• Sooner rewards have
higher utility than  MDP quantities so far:
later rewards • Policy,  = Function that chooses an action for each state
• Also helps the • Utility (aka “return”) = sum of discounted rewards
algorithms converge
Optimal Utilities Why Not Search Trees?
 Define the value of a state s:  Why not solve with expectimax?
V*(s) = expected utility starting in s and acting optimally s
 Define the value of a q‐state (s,a):  Problems:
Q*(s,a) = expected utility starting in s, taking action a
a
• This tree is usually infinite (why?)
and thereafter acting optimally s, a • Same states appear over and over (why?)
 Define the optimal policy: • We would search once per state (why?)
We would search once per state (why?)
*(s) = optimal action from state s s,a,s’’
s’
 Idea: Value iteration
• Compute optimal values for all states all at
once using successive approximations
• Will be a bottom‐up dynamic program similar
in cost to memoization
• Do all planning offline, no replanning needed!
4
10/12/2012
The Bellman Equations Bellman Equations for MDPs
 Definition of “optimal utility” leads to a simple
one‐step look‐ahead relationship between Q*(a, s)
optimal utility values:
(1920‐1984)
s
a
s, a
s,a,s’
s’
Bellman Backup (MDP) Bellman Backup

Q1(s,a1) = 2 +  0
• Given an estimate of V* function (say Vn)
~2
• Backup Vn function at state s
• calculate a new estimate (Vn+1) : Q1(s,a2) = 5 +  0.9~
a1 s1 V0= 0
V1= 6.5 +  0.1~ 2
5 ~ 6.1
5 V s0 a2
Q1(s,a3) = 4.5 +  2
ax s2 V0= 1 ~ 6.5
V a3
• Qn+1(s,a) : value/cost of the strategy: s3 V = 2
• execute action a in s, execute n subsequently max 0
• n = argmaxa∈Ap(s)Qn(s,a)
Value iteration [Bellman’57] Value Iteration
• assign an arbitrary assignment of V0 to each state.  Idea:
• Start with V0*(s) = 0, which we know is right (why?)
• repeat • Given Vi*, calculate the values for all states for depth i+1:
• for all states s
• compute Vnn+11(s) by Bellman backup at s. Iteration n+1
• until maxs |Vn+1(s) – Vn(s)| < 
• This is called a value update or Bellman update
-convergence
Residual(s) • Repeat until convergence
 Theorem: will converge to unique optimal values  Theorem: will converge to unique optimal values
 Basic idea: approximations get refined towards optimal values  Basic idea: approximations get refined towards optimal values
 Policy may converge long before values do  Policy may converge long before values do
5
10/12/2012
Example: =0.9, living
Value Estimates Example: Bellman Updates reward=0, noise=0.2
 Calculate estimates Vk*(s) ? ? ?
• The optimal value considering only next k time steps
(k rewards)
• As k , Vk approaches the optimal value ? ?
 Why: ? ? ? ?
 If discounting, distant rewards become
negligible
 If terminal states reachable from
everywhere, fraction of episodes not
ending becomes negligible
 Otherwise, can get infinite expected
utility and then this approach actually
won’t work
Example: Value Iteration Example: Value Iteration
V1 V2
QuickTime™ and a
GIF decompressor
are needed to see this picture.
 Information propagates outward from terminal
states and eventually all states have correct value
estimates
Practice: Computing Actions Comments
• Decision‐theoretic Algorithm
 Which action should we chose from state s: • Dynamic Programming
• Fixed Point Computation
• Given optimal values Q? • Probabilistic version of Bellman‐Ford Algorithm
• for shortest path computation
• MDP1 : Stochastic Shortest Path Problem
 Time Complexity
• Given optimal values V?
• one iteration: O(|6|2|$ |)
• number of iterations: poly(|6|, |$ |, 1/1‐)
 Space Complexity: O(|6|)
 Factored MDPs = Planning under uncertainty
• Lesson: actions are easier to select from Q’s!
• exponential space, exponential time
6
10/12/2012
Convergence Properties Convergence
• Vn → V* in the limit as n→  Define the max‐norm:
• -convergence: Vn function is within  of V*
• Optimality: current policy is within 2 of optimal
 Theorem: For any two approximations Ut and Vt
• Monotonicity
• V0 ≤p V* ⇒ Vn ≤p V* (Vn monotonic from below)
• I.e. any distinct approximations must get closer to each other, so, in
• V0 ≥p V* ⇒ Vn ≥p V* (Vn monotonic from above) particular, any approximation must get closer to the true V* (aka U)
• otherwise Vn non‐monotonic and value iteration converges to a unique, stable, optimal solution
 Theorem:
• I.e. once the change in our approximation is small, it must also be
close to correct
Value Iteration Complexity MDPs
 Problem size: • Planning Under Uncertainty
• |A| actions and |S| states
• Mathematical Framework
 Each Iteration • Bellman Equations
• Computation: O(|A|⋅|S|2) • Value Iteration
• Space: O(|S|) • Real‐Time Dynamic Programming
Andrey Markov
• Policy Iteration (1856‐1922)
 Num of iterations
• Reinforcement Learning
• Can be exponential in the discount factor γ
Asynchronous Value Iteration Asynchonous Value Iteration

Prioritized Sweeping
 States may be backed up in any order  Why backup a state if values of successors same?
• Instead of systematically, iteration by iteration  Prefer backing a state
• whose successors had most change
 Theorem:
• As long as every state is backed up infinitely often…  Priority Queue of (state, expected change in value)
• Asynchronous value iteration converges to optimal  Backup in the order of priority
 After backing a state update priority queue
• for all predecessors
7
10/12/2012
Asynchonous Value Iteration Why?

Real Time Dynamic Programming
[Barto, Bradtke, Singh’95]
 Why is next slide saying min
• Trial: simulate greedy policy starting from start state;
perform Bellman backup on visited states
• RTDP:
• Repeat Trials until value function converges
RTDP Trial Comments
Vn
• Properties
Qn+1(s0,a)
• if all states are visited infinitely often then Vn → V*
agreedy = a2 Min Vn
?
a1
Vn Goal
• Advantages
a2
Vn+1(s0) s0 ? • Anytime: more probable states explored quickly
Vn
a3
?
Vn
• Disadvantages
Vn
• complete convergence can be slow!
Vn
Labeled RTDP [Bonet&Geffner ICAPS03]

MDPs
 Stochastic Shortest Path Problems Markov Decision Processes
• Policy w/ min expected cost to reach goal • Planning Under Uncertainty
 Initialize v0(s) with admissible heuristic
• Underestimates remaining cost • Mathematical Framework
 Theorem: • Bellman Equations
• if residual of Vk(s) <  and • Value Iteration
Vk(s’) <  for all succ(s), s’, in greedy graph • Real‐Time Dynamic Programming
Andrey Markov
• Then Vk is ‐consistent and will remain so • Policy Iteration (1856‐1922)
 Labeling algorithm detects convergence
• Reinforcement Learning
Goal
s0 ?
8
10/12/2012
Changing the Search Space Utilities for Fixed Policies
• Value Iteration
• Search in value space  Another basic operation: compute
the utility of a state s under a fix s
• Compute the resulting policy (general non‐optimal) policy
 Define the utility of a state s, under (s)
a fixed policy : s, (s)
• Policy Iteration
Policy Iteration V(s) = expected total discounted
rewards (return) starting in s and s, (s),s’
• Search in policy space following 
s’
• Compute the resulting value  Recursive relation (one‐step look‐
ahead / Bellman equation):
Policy Evaluation Policy Iteration
 How do we calculate the V’s for a fixed policy?  Problem with value iteration:
• Considering all actions each iteration is slow: takes |A| times
 Idea one: modify Bellman updates longer than policy evaluation
• But policy doesn’t change each iteration, time wasted
 Alternative to value iteration:
• Step 1: Policy evaluation: calculate utilities for a fixed policy (not
optimal utilities!) until convergence (fast)
 Idea two: it’s just a linear system, solve with Matlab • Step 2: Policy improvement: update policy using one‐step
(or whatever) lookahead with resulting converged (but not optimal!) utilities
(slow but infrequent)
• Repeat steps until policy converges
Policy Iteration Policy iteration [Howard’60]
• assign an arbitrary assignment of 0 to each state.
 Policy evaluation: with fixed current policy , find values with
simplified Bellman updates:
• repeat
• Iterate until values converge
• Policy Evaluation: compute Vn+1the evaluation of n costly: O(n3)
• Policy Improvement: for all states s
• compute
compute n+1(s):
(s): argmax
argmaxa Ap(s)Qn+1(s,a)
(s,a)
• until n+1  n
 Policy improvement: with fixed utilities, find the best action approximate
Modified by value iteration
according to one‐step look‐ahead Policy Iteration
Advantage using fixed policy
• searching in a finite (policy) space as opposed to

uncountably infinite (value) space ⇒ convergence faster.
• all other properties follow!
9
10/12/2012
Modified Policy iteration Policy Iteration Complexity
• assign an arbitrary assignment of 0 to each state.
 Problem size:
• repeat • |A| actions and |S| states
• Policy Evaluation: compute Vn+1 the approx. evaluation of n
• Policy Improvement: for all states s
• compute n+1(s): argmaxa Ap(s)Qn+1(s,a)
 Each Iteration
• until n+1  n • Computation: O(|S|3 + |A|⋅|S|2)
• Space: O(|S|)
Advantage
 Num of iterations
• probably the most competitive synchronous dynamic
programming algorithm.
• Unknown, but can be faster in practice
• Convergence is guaranteed
Comparison Recap: MDPs
 Markov decision processes:
• States S
 In value iteration: • Actions A s
• Every pass (or “backup”) updates both utilities (explicitly, based on current • Transitions P(s’|s,a) (or T(s,a,s’)) a
utilities) and policy (possibly implicitly, based on current policy)
• Rewards R(s,a,s’) (and discount ) s, a
• Start state s0
 In policy iteration:
In policy iteration: s,a,s’
s,a,s
• Several passes to update utilities with frozen policy  Quantities: s’
• Occasional passes to update policies • Returns = sum of discounted rewards
• Values = expected future returns from a state (optimal, or for a
 Hybrid approaches (asynchronous policy iteration): fixed policy)
• Any sequences of partial updates to either policy entries or utilities will • Q‐Values = expected future returns from a q‐state (optimal, or
converge if every state is visited infinitely often for a fixed policy)
10

Logistics: CSE 473 Markov Decision Processes

Uploaded by

Copyright:

Available Formats

You might also like

Logistics: CSE 473 Markov Decision Processes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Logistics: CSE 473 Markov Decision Processes

Uploaded by

Copyright:

Available Formats

10/12/2012

R(s) = ‐0.01 R(s) = ‐0.03 R(s) = ‐0.01 R(s) = ‐0.03

R(s) = ‐0.4 R(s) = ‐2.0 R(s) = ‐0.4 R(s) = ‐2.0

R(s) = ‐0.01 R(s) = ‐0.03 R(s) = ‐0.01 R(s) = ‐0.03

R(s) = ‐0.4 R(s) = ‐2.0 R(s) = ‐0.4 R(s) = ‐2.0

Bellman Backup (MDP) Bellman Backup

Example: =0.9, living

Value Estimates Example: Bellman Updates reward=0, noise=0.2

Asynchronous Value Iteration Asynchonous Value Iteration

Asynchonous Value Iteration Why?

Labeled RTDP [Bonet&Geffner ICAPS03]

• searching in a finite (policy) space as opposed to

You might also like