Professional Documents
Culture Documents
Logistics: CSE 473 Markov Decision Processes
Logistics: CSE 473 Markov Decision Processes
Logistics: CSE 473 Markov Decision Processes
Logistics
PS 2 due Tuesday Thursday 10/18
PS 3 due Thursday 10/25
CSE 473 Markov Decision Processes
Dan Weld
Many slides from Chris Bishop, Mausam, Dan Klein,
Stuart Russell, Andrew Moore & Luke Zettlemoyer
MDPs Planning Agent
Static vs. Dynamic
Markov Decision Processes
• Planning Under Uncertainty
Environment
• Mathematical Framework Fully
vs.
• Bellman Equations Partially
Deterministic
ete st c
• Value Iteration Ob
Observable
bl vs.
What action Stochastic
• Real‐Time Dynamic Programming next?
Andrey Markov
• Policy Iteration (1856‐1922)
Perfect Instantaneous
vs. vs.
• Reinforcement Learning Noisy Durative
Percepts Actions
Objective of an MDP
Review: Expectimax
• Find a policy : 6 → $ What if we don’t know what the result of an action
will be? E.g.,
• which optimizes • In solitaire, next card is unknown
• In pacman, the ghosts act randomly max
• minimizes discounted expected cost to reach a
goal or Can do expectimax search
Max nodes as in minimax
Max nodes as in minimax search chance
• maximizes undiscount. expected reward Chance nodes, like min nodes, except
• maximizes expected (reward‐cost) the outcome is uncertain ‐ take
average (expectation) of children
Calculate expected utilities 10 4 5 7
• given a ____ horizon
• finite
Today, we formalize as an Markov Decision Process
• infinite Handle intermediate rewards & infinite plans
• indefinite More efficient processing
1
10/12/2012
Grid World
Markov Decision Processes
An MDP is defined by:
Walls block the agent’s path
• A set of states s S
Agent’s actions may go astray: • A set of actions a A
80% of the time, North action • A transition function T(s,a,s’)
• Prob that a from s leads to s’
takes the agent North • i.e., P(s’ | s,a)
(assuming no wall) • Also called “the model”
10% ‐ actually go West • A reward function R(s, a, s’)
• Sometimes just R(s) or R(s’)
10% ‐ actually go East • A start state (or distribution)
If there is a wall in the chosen • Maybe a terminal state
direction, the agent stays put
• MDPs: non‐deterministic search
Small “living” reward each step
Reinforcement learning: MDPs where we don’t
Big rewards come at the end know the transition or reward functions
Goal: maximize sum of rewards
What is Markov about MDPs?
Solving MDPs
In deterministic single-agent search problems, want an optimal
Andrey Markov (1856‐1922)
plan, or sequence of actions, from start to a goal
“Markov” generally means that In an MDP, we want an optimal policy *: S → A
• conditioned on the present state, • A policy gives an action for each state
p
• the future is independent of the past
p • An optimal policy maximizes expected utility if followed
An optimal policy maximizes expected utility if followed
• Defines a reflex agent
For Markov decision processes,
“Markov” means:
Optimal policy when
R(s, a, s’) = ‐0.03
for all non‐terminals s
Example Optimal Policies Example Optimal Policies
2
10/12/2012
Example Optimal Policies Example Optimal Policies
Example: High‐Low High‐Low as an MDP
States:
• 2, 3, 4, done
Three card types: 2, 3, 4 Actions:
• Infinite deck, twice as many 2’s • High, Low
Start with 3 showing Model: T(s, a, s’):
After each card, you say “high” or “low” • P(s’=4 | 4, Low) = 1/4
New card is flipped 3 •
•
P(s’=3 | 4, Low) = 1/4
P(s’=2
P(s 2 | 4, Low) /
| 4, Low) = 1/2
3
• If
If you’re right, you win the points shown on
’ i h i h i h
the new card • P(s’=done | 4, Low) = 0
• Ties are no‐ops (no reward)‐0 • P(s’=4 | 4, High) = 1/4
• If you’re wrong, game ends • P(s’=3 | 4, High) = 0
• P(s’=2 | 4, High) = 0
• P(s’=done | 4, High) = 3/4
• …
Differences from expectimax problems: Rewards: R(s, a, s’):
#1: get rewards as you go • Number shown on s’ if s’<s a=“high” …
#2: you might play forever! • 0 otherwise
Start: 3
Search Tree: High‐Low
MDP Search Trees
Each MDP state gives an expectimax‐like search tree
Low High s is a
s
state
, High a
, Low
(s, a) is a
s, a
q-state
T= T= T = 0, T = (s,a,s’) called a
0.5, R 0.25, R R = 4 0.25, R s,a,s’ transition
=2 =3 =0 T(s,a,s’) = P(s’|s,a)
s’
R(s,a,s’)
High Low High Low High Low
3
10/12/2012
Infinite Utilities?!
Utilities of Sequences
In order to formalize optimality of a policy, need to Problem: infinite state sequences have infinite rewards
understand utilities of sequences of rewards
Typically consider stationary preferences: Solutions:
• Finite horizon:
• Terminate episodes after a fixed T steps (e.g. life)
• Gives nonstationary policies ( depends on time left)
• Absorbing state: guarantee that for every policy, a terminal state will
eventually be reached (like “done” for High‐Low)
Theorem: only two ways to define stationary utilities • Discounting: for 0 < < 1
Additive utility:
Discounted utility:
• Smaller means smaller “horizon” – shorter term focus
Discounting Recap: Defining MDPs
Markov decision processes:
• States S s
• Start state s0
a
Typically discount • Actions A
s, a
• Transitions P(s’|s, a)
rewards by < 1 each (, , )
aka T(s,a,s’) s,a,s’
s,a,s
time step • Rewards R(s,a,s’) (and discount ) s’
• Sooner rewards have
higher utility than MDP quantities so far:
later rewards • Policy, = Function that chooses an action for each state
• Also helps the • Utility (aka “return”) = sum of discounted rewards
algorithms converge
Optimal Utilities Why Not Search Trees?
Define the value of a state s: Why not solve with expectimax?
V*(s) = expected utility starting in s and acting optimally s
Define the value of a q‐state (s,a): Problems:
Q*(s,a) = expected utility starting in s, taking action a
a
• This tree is usually infinite (why?)
and thereafter acting optimally s, a • Same states appear over and over (why?)
Define the optimal policy: • We would search once per state (why?)
We would search once per state (why?)
*(s) = optimal action from state s s,a,s’’
s’
Idea: Value iteration
• Compute optimal values for all states all at
once using successive approximations
• Will be a bottom‐up dynamic program similar
in cost to memoization
• Do all planning offline, no replanning needed!
4
10/12/2012
The Bellman Equations Bellman Equations for MDPs
Definition of “optimal utility” leads to a simple
one‐step look‐ahead relationship between Q*(a, s)
optimal utility values:
(1920‐1984)
s
a
s, a
s,a,s’
s’
• Qn+1(s,a) : value/cost of the strategy: s3 V = 2
• execute action a in s, execute n subsequently max 0
• n = argmaxa∈Ap(s)Qn(s,a)
Value iteration [Bellman’57] Value Iteration
• assign an arbitrary assignment of V0 to each state. Idea:
• Start with V0*(s) = 0, which we know is right (why?)
• repeat • Given Vi*, calculate the values for all states for depth i+1:
• for all states s
• compute Vnn+11(s) by Bellman backup at s. Iteration n+1
• until maxs |Vn+1(s) – Vn(s)| <
• This is called a value update or Bellman update
-convergence
Residual(s) • Repeat until convergence
Theorem: will converge to unique optimal values Theorem: will converge to unique optimal values
Basic idea: approximations get refined towards optimal values Basic idea: approximations get refined towards optimal values
Policy may converge long before values do Policy may converge long before values do
5
10/12/2012
Calculate estimates Vk*(s) ? ? ?
• The optimal value considering only next k time steps
(k rewards)
• As k , Vk approaches the optimal value ? ?
Why: ? ? ? ?
If discounting, distant rewards become
negligible
If terminal states reachable from
everywhere, fraction of episodes not
ending becomes negligible
Otherwise, can get infinite expected
utility and then this approach actually
won’t work
Example: Value Iteration Example: Value Iteration
V1 V2
QuickTime™ and a
GIF decompressor
are needed to see this picture.
Information propagates outward from terminal
states and eventually all states have correct value
estimates
Practice: Computing Actions Comments
• Decision‐theoretic Algorithm
Which action should we chose from state s: • Dynamic Programming
• Fixed Point Computation
• Given optimal values Q? • Probabilistic version of Bellman‐Ford Algorithm
• for shortest path computation
• MDP1 : Stochastic Shortest Path Problem
Time Complexity
• Given optimal values V?
• one iteration: O(|6|2|$ |)
• number of iterations: poly(|6|, |$ |, 1/1‐)
Space Complexity: O(|6|)
Factored MDPs = Planning under uncertainty
• Lesson: actions are easier to select from Q’s!
• exponential space, exponential time
6
10/12/2012
Convergence Properties Convergence
• Vn → V* in the limit as n→ Define the max‐norm:
• -convergence: Vn function is within of V*
• Optimality: current policy is within 2 of optimal
Theorem: For any two approximations Ut and Vt
• Monotonicity
• V0 ≤p V* ⇒ Vn ≤p V* (Vn monotonic from below)
• I.e. any distinct approximations must get closer to each other, so, in
• V0 ≥p V* ⇒ Vn ≥p V* (Vn monotonic from above) particular, any approximation must get closer to the true V* (aka U)
• otherwise Vn non‐monotonic and value iteration converges to a unique, stable, optimal solution
Theorem:
• I.e. once the change in our approximation is small, it must also be
close to correct
Value Iteration Complexity MDPs
Markov Decision Processes
Problem size: • Planning Under Uncertainty
• |A| actions and |S| states
• Mathematical Framework
Each Iteration • Bellman Equations
• Computation: O(|A|⋅|S|2) • Value Iteration
• Space: O(|S|) • Real‐Time Dynamic Programming
Andrey Markov
• Policy Iteration (1856‐1922)
Num of iterations
• Reinforcement Learning
• Can be exponential in the discount factor γ
7
10/12/2012
• Trial: simulate greedy policy starting from start state;
perform Bellman backup on visited states
• RTDP:
• Repeat Trials until value function converges
RTDP Trial Comments
Vn
• Properties
Qn+1(s0,a)
• if all states are visited infinitely often then Vn → V*
agreedy = a2 Min Vn
?
a1
Vn Goal
• Advantages
a2
Vn+1(s0) s0 ? • Anytime: more probable states explored quickly
Vn
a3
?
Vn
• Disadvantages
Vn
• complete convergence can be slow!
Vn
s0 ?
8
10/12/2012
Changing the Search Space Utilities for Fixed Policies
• Value Iteration
• Search in value space Another basic operation: compute
the utility of a state s under a fix s
• Compute the resulting policy (general non‐optimal) policy
Define the utility of a state s, under (s)
a fixed policy : s, (s)
• Policy Iteration
Policy Iteration V(s) = expected total discounted
rewards (return) starting in s and s, (s),s’
• Search in policy space following
s’
• Compute the resulting value Recursive relation (one‐step look‐
ahead / Bellman equation):
Policy Evaluation Policy Iteration
How do we calculate the V’s for a fixed policy? Problem with value iteration:
• Considering all actions each iteration is slow: takes |A| times
Idea one: modify Bellman updates longer than policy evaluation
• But policy doesn’t change each iteration, time wasted
Alternative to value iteration:
• Step 1: Policy evaluation: calculate utilities for a fixed policy (not
optimal utilities!) until convergence (fast)
Idea two: it’s just a linear system, solve with Matlab • Step 2: Policy improvement: update policy using one‐step
(or whatever) lookahead with resulting converged (but not optimal!) utilities
(slow but infrequent)
• Repeat steps until policy converges
Policy Iteration Policy iteration [Howard’60]
• assign an arbitrary assignment of 0 to each state.
Policy evaluation: with fixed current policy , find values with
simplified Bellman updates:
• repeat
• Iterate until values converge
• Policy Evaluation: compute Vn+1the evaluation of n costly: O(n3)
• Policy Improvement: for all states s
• compute
compute n+1(s):
(s): argmax
argmaxa Ap(s)Qn+1(s,a)
(s,a)
• until n+1 n
Policy improvement: with fixed utilities, find the best action approximate
Modified by value iteration
according to one‐step look‐ahead Policy Iteration
Advantage using fixed policy
9
10/12/2012
Modified Policy iteration Policy Iteration Complexity
• assign an arbitrary assignment of 0 to each state.
Problem size:
• repeat • |A| actions and |S| states
• Policy Evaluation: compute Vn+1 the approx. evaluation of n
• Policy Improvement: for all states s
• compute n+1(s): argmaxa Ap(s)Qn+1(s,a)
Each Iteration
• until n+1 n • Computation: O(|S|3 + |A|⋅|S|2)
• Space: O(|S|)
Advantage
Num of iterations
• probably the most competitive synchronous dynamic
programming algorithm.
• Unknown, but can be faster in practice
• Convergence is guaranteed
Comparison Recap: MDPs
Markov decision processes:
• States S
In value iteration: • Actions A s
• Every pass (or “backup”) updates both utilities (explicitly, based on current • Transitions P(s’|s,a) (or T(s,a,s’)) a
utilities) and policy (possibly implicitly, based on current policy)
• Rewards R(s,a,s’) (and discount ) s, a
• Start state s0
In policy iteration:
In policy iteration: s,a,s’
s,a,s
• Several passes to update utilities with frozen policy Quantities: s’
• Occasional passes to update policies • Returns = sum of discounted rewards
• Values = expected future returns from a state (optimal, or for a
Hybrid approaches (asynchronous policy iteration): fixed policy)
• Any sequences of partial updates to either policy entries or utilities will • Q‐Values = expected future returns from a q‐state (optimal, or
converge if every state is visited infinitely often for a fixed policy)
10