Schaefer MDP

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

Markov Decision

Processes
Andrew Schaefer
EWO Seminar
October 26, 2006

1
What are Markov Decision
Processes (MDPs)?
z MDPs are a method for formulating and solving
stochastic and dynamic decisions
z MDPs are very flexible, which is an advantage from a
modeling perspective but a drawback from a solution
viewpoint (can’t take advantage of special structure)
z MDPs are pervasive; every Fortune 500 company uses
them in some form or another
z Particular success in inventory management and
reliability

2
Potential for MDPs in EWO
z The flexibility of MDPs may allow them to be useful for
certain problems

z It is unlikely that they can work as stand-alone


techniques

z Long history of successful applications

3
Outline
z Basic Components of MDPs

z Solution Techniques

z Extensions

z Big Picture – Potential Roles for MDPs

4
Outline – Basic Components
z Inventory Example

z MDP Vocabulary
z 5 Basic Components
1. decision epochs
2. states
3. actions
4. rewards
5. transition probabilities
z value function
z decision rule
z policy

5
Inventory Example
z Problem Statement z Simplifying Assumptions
z Each month, the manager of a z Instantaneous delivery
warehouse z No backlogging, i.e. excess
z observes current inventory on demand is lost
hand of a single product
z Warehouse capacity of M units
z decides how much additional
stock to order from a supplier z Product is sold in whole units
only
z Monthly demand
z uncertain
z Revenues, costs and demand
distribution do not change from
z probability distribution is known
month to month
z Tradeoff between costs of
z keeping inventory
z lost sales
z Objective
z maximize expected profit over
the next year 6
Inventory Level

M M M

st

0 0 0

t t+1 t+2

st+1 = st + at – min{Dt, st + at} 7


Outline – Basic Components
z Inventory Example

z MDP Vocabulary
z 5 Basic Components
1. decision epochs
2. states
3. actions
4. rewards
5. transition probabilities
z value function
z decision rule
z policy

8
1. Decision Epochs: t
z Points in time at which decisions are made
z analogous to period start times in a “Markov Process”

z Inventory example
z first day of month 1, first day of month 2, …, first day of month 12

z In general: 1, 2, …, N
z N: length of the time horizon (could be infinite)

z Also called
z periods
z stages 9
2. States: st
z Relevant information needed to describe the system
z analogous to states in “Markov Processes”
z includes all information from the past relevant to the future

z Inventory example
z amount of inventory on hand at the start of the month
z st, t = 1, …, 12

z possible values – state space


z 0, 1, 2, …, M

10
3. Actions: at
z Means by which the decision maker interacts with
the system
z permissible actions can be state dependent
z no exact analogy to “Markov Processes”
z the decision is usually modeled outside the Markov Process

z Inventory example
z how much additional stock to order each month
z denoted
z at, t = 1, …, 12

z possible values
z 0, 1, 2, …, M - st
11
Inventory Example
holding cost
start of start of
month t h ( st + a t ) month t+1

time
f ( min { Dt , st + at } ) sales
order income
count arrives; hold
order
stock stock inventory; fill demand;
new
on on hand incur stock on hand is now
stock
hand is now demand st+1 = st+at-min{Dt, st+at}
at
st st +at Dt
amount sold
fixed cost variable cost
⎧ K + c ( at ) at > 0
O ( at ) = ⎨ Pr ( Dt = j ) = p j , j = 0,1, 2,…
⎩0 at = 0 12

ordering cost probability demand is for j units


4. Rewards: rt(st,at)
z Expected immediate net income associated with taking
a particular action, in a particular state, in a particular
epoch
z analogous to state utilities in “Markov Processes”

z Inventory example
z expected income – order cost – holding cost
z rt ( st , at ) = E[income in month t ] − O ( at ) − h ( st + at ) , t = 1, 2,… ,12
E[income in t ] = E[income in t | Dt ≤ st + at ]Pr { Dt ≤ st + at } + E[income in t | Dt > st + at ]Pr { Dt > st + at }
st + a t ∞
= ∑ f ( j ) p j + f ( st + a t ) ∑ pj
j=0 j = st + at + 1

z in the last epoch, we assume that whatever is left over has some
salvage value, g(.)

r13 ( s13 , ⋅ ) = g ( s13 ) in general rN +1 (s N +1 ,⋅) = g (s N +1 )


13
5. Transition Probabilities: pt(st,at)
z Distribution that governs how the state of the
process changes as actions are taken over time
z depends on current state and action only, and possibly time
z that is, the future is independent of the past given the present
(The Markov Property)

z Inventory example
z we already established that st+1 = st+at-min{Dt, st+at}

j ≤ s+a end up with some leftovers if


⎧ ps + a − j
⎪ ∞ demand is less than inventory

Pr { st +1 = j | st = s , at = a} = ⎨ ∑ pi j=0 end up with nothing if
demand exceeds inventory
⎪ i = s+a
⎪ j > s+a can’t end up with more
depends on demand ⎩0 than you started with
14
Value Function
z Collectively, decision epochs, states, actions, rewards,
and transition probabilities form an MDP.

z But how do they fit together?

z Value function, vt(st)


z maximum total expected reward starting in state st with N-t
decision epochs remaining
z vt(st) = maxa{rt(st,at) + E vt+1(st+1)}
z rt(st,at): expected immediate reward in period t
z E vt+1(st+1): expected remaining reward in periods t + 1, t + 2,…N

15
Value Function
z Inventory example
v13 ( s13 ) = r13 ( s13 , ⋅ ) = g ( s13 )
⎧ M ⎫
v12 ( s12 ) = max
a12 ∈{0,1,…, M − s12 }
r ( s , a )
⎨ 12 12 12 ∑ 13
+ v ( j ) Pr { s 13 = j | s 12 , a }
12 ⎬
⎩ j = 0 ⎭
⎧ M ⎫
v11 ( s11 ) = max r ( s , a )
⎨ 11 11 11 ∑ 12
+ v ( j ) Pr { s = j | s , a }
11 ⎬
a11 ∈{0,1,…, M − s11 }
12 11
⎩ j = 0 ⎭

⎧ M ⎫
v 2 ( s2 ) = ⎨ r2 ( s2 , a2 ) + ∑ v3 ( j ) Pr { s3 = j | s2 , a2 }⎬
maximum max
total expected a2 ∈{0,1,…, M − s2 }
⎩ j=0 ⎭
reward
⎧ M ⎫
starting in v1 ( s1 ) = max ⎨ r1 ( s1 , a1 ) + ∑ v 2 ( j ) Pr { s2 = j | s1 , a1 }⎬
a1 ∈{0,1,…, M − s1 }
state s1 with ⎩ j=0 ⎭
12 decision
epochs expected immediate expected remaining 16

remaining reward period 1 reward in periods (2,…,N)


Decision Rule
z A rule, for a particular state, that prescribes an
action for each decision epoch
z dt(s) = action to take in state s in decision epoch t

z Inventory example - possible decision rules


z d1(0) = 5 :
if there are 0 items on hand at the beginning of month 1, order 5.
z d2(1) = 3 :
if there is 1 item on hand at the beginning of month 2, order 3.

17
Policy
z A collection of decision rules for all states

z Inventory example

⎡ d1 (0) d 2 (0) … d12 (0) ⎤ ⎡ 5 4 ……… 1 ⎤


⎢ d (1) … ⎥ ⎢ ⎥
⎢ 1 d 2 (1) d 12 (1) ⎥=⎢ 4 3 ……… 0 ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎣ 1
d ( M ) d 2 ( M ) … d 12 ( M ) ⎦ ⎣ 0 0 ……… 0 ⎦

z Under a fixed policy, the process behaves according to a


Markov chain.
18
Outline – Solution Techniques
z Finite-horizon MDPs
z Backwards induction solution technique
z Infinite-horizon MDPs
z Optimality criteria
z Solution techniques
z value iteration
z policy iteration
z linear programming

19
Various factors to consider in
choosing the right technique

z Will rewards be discounted?


z What is the objective?
z Maximize expected total reward, or
z Maximize expected average reward
z Is the problem formulated as a finite or infinite horizon
problem?
z If finite horizon, solution technique does not depend on
discounting or objective
z For infinite horizon, technique does depend on discounting and
objective

20
Outline – Solution Techniques
z Finite-horizon MDPs
z Backwards induction solution technique
z Infinite-horizon MDPs
z Optimality criteria
z Solution techniques
z value iteration
z policy iteration
z linear programming

21
Finite horizon problems
z We want to find a policy, π , that maximizes:

⎧N ⎫
vπ (s) = Επ ,s ⎨∑ r (st , at )⎬
⎩ t =1 ⎭

z The solution technique used is called “backwards


induction”

22
Backwards Induction –
Shortest Path Problem

4 5
2 3 2
3 3
2 57 2
4 3
1
5 2 1
2
2 6 2
5
3 6

Stage 1 Stage 2 Stage 3 Stage 4 Stage N


23
Backwards Induction –
Shortest Path Problem

4 5
2 3 2
3 3
2 57 2
4 3
1 0
5 2 1
2
2 6 2
5
3 6

Stage 1 Stage 2 Stage 3 Stage 4 Stage N


24
Backwards Induction –
Shortest Path Problem

4 5
2
2 3 2
3 3
2 57 2
3 3
4
1 0
5 2 1
2
1
2 6 2
5
3 6 2

Stage 1 Stage 2 Stage 3 Stage 4 Stage N


25
Backwards Induction –
Shortest Path Problem

4 5
6 2
2 3 2
3 3
2 57 2
5 3 3
4
1 0
5 2 1
2
5 1
2 6 2
5
3 6 2
6

Stage 1 Stage 2 Stage 3 Stage N-1 Stage N


26
Idea behind backwards induction
z Envision being in the last time period for all the possible
states and decide the best action for those states

z This yields an optimal value for that state in that period

z Next envision being in the next-to-last period for all the


possible states and decide the best action for those
states, given you now know the optimal values of being
in various states at the next time period

z Continue this process until you reach the present time


period

27
Formalization of Backwards
Induction
z Assume N period horizon with the following parameters:
z r(s,a): immediate reward for choosing action a when in state s
z p(j | s,a): probability that system moves to state j at next time period
given the current state is s and action a is chosen
z g(s): salvage value of system occupying state s at final period N

z Algorithm:
1. Let vN(s) = g(s) for all s

2. For t = N-1, N-2, …, 1 and for all s at each period do:

⎧ ⎫
a. vt ( s ) = max ⎨r ( s, a ) + ∑ p ( j | s, a )vt +1 ( j )⎬
a∈ A
⎩ j∈S ⎭
⎧ ⎫
b. at ( s ) = arg max ⎨r ( s, a) + ∑ p ( j | s, a )vt +1 ( j )⎬
a∈ A ⎩ j∈S ⎭ 28
Infinite Horizon Problems
z The infinite horizon case is the limiting value of the
finite case as the time horizon tends toward infinity
z Recall the finite horizon value function:
⎧N ⎫
vπ ( s ) = Επ , s ⎨∑ r ( st , at )⎬
⎩ t =1 ⎭
z This is the infinite horizon value function:
⎧N ⎫
vπ ( s ) = lim Επ , s ⎨∑ r ( st , at )⎬
N →∞
⎩ t =1 ⎭
z This is the infinite horizon value function with discounting:
⎧ N t −1 ⎫
vπ ( s ) = lim Επ , s ⎨∑ λ r ( st , at )⎬
N →∞
⎩ t =1 ⎭
29
Important Factors for Infinite
Horizon Models
z Discount factor
z If we have a discount factor < 1, then total expected value
converges to a finite solution
z If there is no discounting, then total expected value may explode
z However, if the system has an absorbing state with 0 reward, then
even these may yield finite optimal values
z Objective criteria
z Total expected value
z Average expected value over the long run
z More detailed analyses is required
z When is a stationary optimal policy guaranteed to exist?
z If the state space and action space are finite
30
Previous applications of MDPs
z At the heart of every inventory management paper is an
MDP

z Reliability and replacement problems

z Some routing and logistics problems (e.g. Warren


Powell’s work in trucking)

z 50 years of successful applications (particularly inventory


theory, which is widely applied)

31
Limitations of MDPs
z “Curse of dimensionality”
z As the problem size increases, i.e. the state and/or action space
become larger, it becomes computationally very difficult to solve
the MDPs
z There are some methods that are more memory-efficient than
Policy Iteration and Value Iteration algorithms
z There are also some solution techniques that find near-optimal
solutions in a short time

z Enormous data requirements


z For each action and state pair, we need a transition probability
matrix and a reward function
32
Limitations of MDPs
z Stationarity assumption
z In fact, transition probabilities may not stay the same over time
z Two methods to model non-stationary transition probabilities and
rewards
1. Use a finite-horizon model
2. Enlarge the state space by including time

33
Why do MDPs not Scale Well?
z Analogy in the deterministic world: DP vs. IP
z Any integer program can be formulated as a dynamic
program
z However, DPs struggle with IPs that take branch-and-
bound less than a second
z Why?
z IP uses polyhedral theory to get strong bounds
z DP techniques grow with the size of the problem
z Will this occur in the stochastic setting?

34
MDP: Approximate Solution
Techniques
z Finding the optimal policy is polynomial time in the state
space size
z Linear programming formulation
z Sometimes too many states for this to be practical

z Examples of approximate solution techniques


z State aggregation and action elimination
z Many AI techniques (reinforcement learning)
z Approximate linear programming (de Farias and van Roy)
z In the last 10 years this has become a very active area

35
Extensions of MDPs
z Thus far, we’ve discussed discrete-time MDP, i.e. the
decisions are made at certain time periods

z Continuous time MDP models generalize MDPs by


z allowing the decision maker to choose actions whenever the
system changes
z modeling the system evolution in continuous time
z allowing the time spent in a particular state to follow an arbitrary
probability distribution
z No one has considered a continuous time stochastic
program
36
Extensions of MDPs
z Most models are completely observable MDPs, i.e. the
state of the system is known with certainty at all time
periods

z Partially observable MDP models generalize MDPs by


relaxing the assumption that the state of the system is
completely observable

37
Partially Observable MDPs
(POMDPs)
z System state is not fully observable.
z Decision maker receives a signal o which is related to the system
state by q(o|s)

z Observation process, depends on the (unknown) state,


generated from a probability distribution function

z Hidden variables form a standard MDP

z Applications
z Medical diagnosis and treatment
z Equipment repair

38
POMDPs- Formal Definition
z POMDPs consist of
z State space S
z Action space A
z Observation space O
z Transition matrix P
z Rewards R
z Observation process distribution Q(o | s)
z Value function V = V(y; π)
z Where π: O Æ A is a policy

39
POMDP - Contd
z Simple example

40
POMDP controlled by parameterized
stochastic policy
z Dynamics of POMDP controlled by parameterized
stochastic policies:

41
Constrained Markov Decision
Process
z MDPs with probabilistic constraints
z Typically linear in terms of the limiting probability variables

z Applications
z Service level constraints in call centers
z P(No delay) >= 0.5
z E[Number in queue] <= 10

z Value iteration and Policy iteration algorithms do not


extend. Linear Programming does.

42
CMDP: LP technique
z Can model constraint as a function of the variables

z Results in the optimal randomized policy


z Optimal choice of action in some states can be randomized
z Not practical
z Difficult to obtain the actual randomization probabilities

z Why randomized?
z Vertices of LP polytope without constraints correspond to non-
randomized policies
z Constraints create new vertices corresponding to randomized
policies
43
Competitive MDPs
z Two (or more) players compete

z The actions of one player affects the state of the other

z May be interesting for modeling various oligopolistic


relationships among competitors

z Still a nascent area

44
Why would I prefer an MDP to
a Stochastic Program?
z MDPs are much more flexible – existing multistage SP
models assume linear relations across stages
z If there is a recursive nature to the problem, MDPs are
likely superior
z When the number of states/actions is small relative to
the time horizon (e.g. inventory management) MDPs are
likely to be superior
z The MDP may be easier to analyze analytically; e.g. the
optimality of (s,S) policies in inventory theory

45
Why would I prefer a
Stochastic Program to an MDP?
z When the action and/or state space has a (good)
polyhedral representation
z When the set of decisions and/or states is enormous, or
infinite
z When the number of stages is not critical relative to the
detail required in each stage

46
Conclusions
z MDPs are a flexible technique for stochastic and
dynamic optimization problems
z MDPs have a much longer history of success than
stochastic programming
z Curse of dimensionality may render them impractical
(e.g. for a VRP, action space is set of TSP tours)
z Approximate solution techniques appear promising
z Allow models that are unexplored in SP
z Partial observability
z Continuous time
z Competition
47

You might also like