Professional Documents
Culture Documents
Schaefer MDP
Schaefer MDP
Schaefer MDP
Processes
Andrew Schaefer
EWO Seminar
October 26, 2006
1
What are Markov Decision
Processes (MDPs)?
z MDPs are a method for formulating and solving
stochastic and dynamic decisions
z MDPs are very flexible, which is an advantage from a
modeling perspective but a drawback from a solution
viewpoint (can’t take advantage of special structure)
z MDPs are pervasive; every Fortune 500 company uses
them in some form or another
z Particular success in inventory management and
reliability
2
Potential for MDPs in EWO
z The flexibility of MDPs may allow them to be useful for
certain problems
3
Outline
z Basic Components of MDPs
z Solution Techniques
z Extensions
4
Outline – Basic Components
z Inventory Example
z MDP Vocabulary
z 5 Basic Components
1. decision epochs
2. states
3. actions
4. rewards
5. transition probabilities
z value function
z decision rule
z policy
5
Inventory Example
z Problem Statement z Simplifying Assumptions
z Each month, the manager of a z Instantaneous delivery
warehouse z No backlogging, i.e. excess
z observes current inventory on demand is lost
hand of a single product
z Warehouse capacity of M units
z decides how much additional
stock to order from a supplier z Product is sold in whole units
only
z Monthly demand
z uncertain
z Revenues, costs and demand
distribution do not change from
z probability distribution is known
month to month
z Tradeoff between costs of
z keeping inventory
z lost sales
z Objective
z maximize expected profit over
the next year 6
Inventory Level
M M M
st
0 0 0
t t+1 t+2
z MDP Vocabulary
z 5 Basic Components
1. decision epochs
2. states
3. actions
4. rewards
5. transition probabilities
z value function
z decision rule
z policy
8
1. Decision Epochs: t
z Points in time at which decisions are made
z analogous to period start times in a “Markov Process”
z Inventory example
z first day of month 1, first day of month 2, …, first day of month 12
z In general: 1, 2, …, N
z N: length of the time horizon (could be infinite)
z Also called
z periods
z stages 9
2. States: st
z Relevant information needed to describe the system
z analogous to states in “Markov Processes”
z includes all information from the past relevant to the future
z Inventory example
z amount of inventory on hand at the start of the month
z st, t = 1, …, 12
10
3. Actions: at
z Means by which the decision maker interacts with
the system
z permissible actions can be state dependent
z no exact analogy to “Markov Processes”
z the decision is usually modeled outside the Markov Process
z Inventory example
z how much additional stock to order each month
z denoted
z at, t = 1, …, 12
z possible values
z 0, 1, 2, …, M - st
11
Inventory Example
holding cost
start of start of
month t h ( st + a t ) month t+1
time
f ( min { Dt , st + at } ) sales
order income
count arrives; hold
order
stock stock inventory; fill demand;
new
on on hand incur stock on hand is now
stock
hand is now demand st+1 = st+at-min{Dt, st+at}
at
st st +at Dt
amount sold
fixed cost variable cost
⎧ K + c ( at ) at > 0
O ( at ) = ⎨ Pr ( Dt = j ) = p j , j = 0,1, 2,…
⎩0 at = 0 12
z Inventory example
z expected income – order cost – holding cost
z rt ( st , at ) = E[income in month t ] − O ( at ) − h ( st + at ) , t = 1, 2,… ,12
E[income in t ] = E[income in t | Dt ≤ st + at ]Pr { Dt ≤ st + at } + E[income in t | Dt > st + at ]Pr { Dt > st + at }
st + a t ∞
= ∑ f ( j ) p j + f ( st + a t ) ∑ pj
j=0 j = st + at + 1
z in the last epoch, we assume that whatever is left over has some
salvage value, g(.)
z Inventory example
z we already established that st+1 = st+at-min{Dt, st+at}
15
Value Function
z Inventory example
v13 ( s13 ) = r13 ( s13 , ⋅ ) = g ( s13 )
⎧ M ⎫
v12 ( s12 ) = max
a12 ∈{0,1,…, M − s12 }
r ( s , a )
⎨ 12 12 12 ∑ 13
+ v ( j ) Pr { s 13 = j | s 12 , a }
12 ⎬
⎩ j = 0 ⎭
⎧ M ⎫
v11 ( s11 ) = max r ( s , a )
⎨ 11 11 11 ∑ 12
+ v ( j ) Pr { s = j | s , a }
11 ⎬
a11 ∈{0,1,…, M − s11 }
12 11
⎩ j = 0 ⎭
⎧ M ⎫
v 2 ( s2 ) = ⎨ r2 ( s2 , a2 ) + ∑ v3 ( j ) Pr { s3 = j | s2 , a2 }⎬
maximum max
total expected a2 ∈{0,1,…, M − s2 }
⎩ j=0 ⎭
reward
⎧ M ⎫
starting in v1 ( s1 ) = max ⎨ r1 ( s1 , a1 ) + ∑ v 2 ( j ) Pr { s2 = j | s1 , a1 }⎬
a1 ∈{0,1,…, M − s1 }
state s1 with ⎩ j=0 ⎭
12 decision
epochs expected immediate expected remaining 16
17
Policy
z A collection of decision rules for all states
z Inventory example
19
Various factors to consider in
choosing the right technique
20
Outline – Solution Techniques
z Finite-horizon MDPs
z Backwards induction solution technique
z Infinite-horizon MDPs
z Optimality criteria
z Solution techniques
z value iteration
z policy iteration
z linear programming
21
Finite horizon problems
z We want to find a policy, π , that maximizes:
⎧N ⎫
vπ (s) = Επ ,s ⎨∑ r (st , at )⎬
⎩ t =1 ⎭
22
Backwards Induction –
Shortest Path Problem
4 5
2 3 2
3 3
2 57 2
4 3
1
5 2 1
2
2 6 2
5
3 6
4 5
2 3 2
3 3
2 57 2
4 3
1 0
5 2 1
2
2 6 2
5
3 6
4 5
2
2 3 2
3 3
2 57 2
3 3
4
1 0
5 2 1
2
1
2 6 2
5
3 6 2
4 5
6 2
2 3 2
3 3
2 57 2
5 3 3
4
1 0
5 2 1
2
5 1
2 6 2
5
3 6 2
6
27
Formalization of Backwards
Induction
z Assume N period horizon with the following parameters:
z r(s,a): immediate reward for choosing action a when in state s
z p(j | s,a): probability that system moves to state j at next time period
given the current state is s and action a is chosen
z g(s): salvage value of system occupying state s at final period N
z Algorithm:
1. Let vN(s) = g(s) for all s
⎧ ⎫
a. vt ( s ) = max ⎨r ( s, a ) + ∑ p ( j | s, a )vt +1 ( j )⎬
a∈ A
⎩ j∈S ⎭
⎧ ⎫
b. at ( s ) = arg max ⎨r ( s, a) + ∑ p ( j | s, a )vt +1 ( j )⎬
a∈ A ⎩ j∈S ⎭ 28
Infinite Horizon Problems
z The infinite horizon case is the limiting value of the
finite case as the time horizon tends toward infinity
z Recall the finite horizon value function:
⎧N ⎫
vπ ( s ) = Επ , s ⎨∑ r ( st , at )⎬
⎩ t =1 ⎭
z This is the infinite horizon value function:
⎧N ⎫
vπ ( s ) = lim Επ , s ⎨∑ r ( st , at )⎬
N →∞
⎩ t =1 ⎭
z This is the infinite horizon value function with discounting:
⎧ N t −1 ⎫
vπ ( s ) = lim Επ , s ⎨∑ λ r ( st , at )⎬
N →∞
⎩ t =1 ⎭
29
Important Factors for Infinite
Horizon Models
z Discount factor
z If we have a discount factor < 1, then total expected value
converges to a finite solution
z If there is no discounting, then total expected value may explode
z However, if the system has an absorbing state with 0 reward, then
even these may yield finite optimal values
z Objective criteria
z Total expected value
z Average expected value over the long run
z More detailed analyses is required
z When is a stationary optimal policy guaranteed to exist?
z If the state space and action space are finite
30
Previous applications of MDPs
z At the heart of every inventory management paper is an
MDP
31
Limitations of MDPs
z “Curse of dimensionality”
z As the problem size increases, i.e. the state and/or action space
become larger, it becomes computationally very difficult to solve
the MDPs
z There are some methods that are more memory-efficient than
Policy Iteration and Value Iteration algorithms
z There are also some solution techniques that find near-optimal
solutions in a short time
33
Why do MDPs not Scale Well?
z Analogy in the deterministic world: DP vs. IP
z Any integer program can be formulated as a dynamic
program
z However, DPs struggle with IPs that take branch-and-
bound less than a second
z Why?
z IP uses polyhedral theory to get strong bounds
z DP techniques grow with the size of the problem
z Will this occur in the stochastic setting?
34
MDP: Approximate Solution
Techniques
z Finding the optimal policy is polynomial time in the state
space size
z Linear programming formulation
z Sometimes too many states for this to be practical
35
Extensions of MDPs
z Thus far, we’ve discussed discrete-time MDP, i.e. the
decisions are made at certain time periods
37
Partially Observable MDPs
(POMDPs)
z System state is not fully observable.
z Decision maker receives a signal o which is related to the system
state by q(o|s)
z Applications
z Medical diagnosis and treatment
z Equipment repair
38
POMDPs- Formal Definition
z POMDPs consist of
z State space S
z Action space A
z Observation space O
z Transition matrix P
z Rewards R
z Observation process distribution Q(o | s)
z Value function V = V(y; π)
z Where π: O Æ A is a policy
39
POMDP - Contd
z Simple example
40
POMDP controlled by parameterized
stochastic policy
z Dynamics of POMDP controlled by parameterized
stochastic policies:
41
Constrained Markov Decision
Process
z MDPs with probabilistic constraints
z Typically linear in terms of the limiting probability variables
z Applications
z Service level constraints in call centers
z P(No delay) >= 0.5
z E[Number in queue] <= 10
42
CMDP: LP technique
z Can model constraint as a function of the variables
z Why randomized?
z Vertices of LP polytope without constraints correspond to non-
randomized policies
z Constraints create new vertices corresponding to randomized
policies
43
Competitive MDPs
z Two (or more) players compete
44
Why would I prefer an MDP to
a Stochastic Program?
z MDPs are much more flexible – existing multistage SP
models assume linear relations across stages
z If there is a recursive nature to the problem, MDPs are
likely superior
z When the number of states/actions is small relative to
the time horizon (e.g. inventory management) MDPs are
likely to be superior
z The MDP may be easier to analyze analytically; e.g. the
optimality of (s,S) policies in inventory theory
45
Why would I prefer a
Stochastic Program to an MDP?
z When the action and/or state space has a (good)
polyhedral representation
z When the set of decisions and/or states is enormous, or
infinite
z When the number of stages is not critical relative to the
detail required in each stage
46
Conclusions
z MDPs are a flexible technique for stochastic and
dynamic optimization problems
z MDPs have a much longer history of success than
stochastic programming
z Curse of dimensionality may render them impractical
(e.g. for a VRP, action space is set of TSP tours)
z Approximate solution techniques appear promising
z Allow models that are unexplored in SP
z Partial observability
z Continuous time
z Competition
47