Professional Documents
Culture Documents
UE2MDP2020
UE2MDP2020
UE2MDP2020
Dynamic programming
Xiaolan Xie
Dynamic programming
Basic principe of dynamic programming
Some applications
Xiaolan Xie
Dynamic programming
Basic principe of dynamic programming
Some applications
Xiaolan Xie
Introduction
Dynamic programming (DP) is a general optimization
technique based on implicit enumeration of the solution
space.
The problems should have a particular sequential structure,
such that the set of unknowns can be made sequentially.
It is based on the "principle of optimality"
A wide range of problems can be put in seqential form and
solved by dynamic programming
Xiaolan Xie
Introduction
Applications :
• Optimal control
• Most problems in graph theory
• Investment
• Deterministic and stochastic inventory control
• Project scheduling
• Production scheduling
Xiaolan Xie
Illustration of DP by shortest path problem
Problem : We are planning the construction of a
highway from city A to city K. Different construction
alternatives and their costs are given in the following
graph. The problem consists in determine the highway
with the minimum total cost.
14 D 3
B I 10
8 10 G 9
E 5 K
A 10 9
10 8
C 7 8 H J
F
15
Xiaolan Xie
BELLMAN's principle of optimality
General form:
if C belongs to an optimal path from A to B, then the sub-path A
to C and C to B are also optimal
or
all sub-path of an optimal path is optimal
A
B
optimal C optimal
Corollary :
SP(xo, y) = min {SP(xo, z) + l(z, y) | z : predecessor of y}
Xiaolan Xie
Solving a problem by DP
1. Extension
Extend the problem to a family of problems of the same nature
2. Recursive Formulation (application of the principle of optimality)
Link optimal solutions of these problems by a recursive relation
3. Decomposition into steps or phases
Define the order of the resolution of the problems in such a way
that, when solving a problem P, optimal solutions of all other
problems needed for computation of P are already known.
4. Computation by steps
Xiaolan Xie
Solving a problem by DP
Xiaolan Xie
Shortest Path in an acyclic graph
• Problem setting : find a shortest path from x0 (root of the graph) to a given
node y0
• Extension : Find a shortest path from x0 to any node y, denoted SP(x0, y)
• Recursive formulation
SP(y) = min { SP(z) + l(z, y) : z predecessorr of y}
• Decomposition into steps : At each step k, consider only nodes y with
unknown SP(y) but for which the SP of all precedecssors are known.
• Compute SP(y) step by step
Remarks :
• It is a backward dynamic programming
• It is also possible to solve this problem by forward dynamic programming
Xiaolan Xie
DP from a control point of view
Consider the control of
(i) a discrete-time dynamic system, with
(ii) costs generated over time depending on the states and the
control actions
action action
Cost Cost
Xiaolan Xie
DP from a control point of view
System dynamics :
x t+1 = ft(xt, ut), t = 0, 1, ..., N-1
where
t : temps index
xt : state of the system
action action
ut = control action to decide at t
Cost Cost
Xiaolan Xie
DP from a control point of view
Criterion to optimize
action action
Xiaolan Xie
DP from a control point of view
Value function or cost-to-go function:
action action
Cost Cost
Xiaolan Xie
DP from a control point of view
Optimality equation or Bellman equation
action action
Cost Cost
Xiaolan Xie
Applications
Xiaolan Xie
Applications
Single machine scheduling (Knapsac)
Problem :
Consider a set of N production requests, each needing a
production time ti on a bottleneck machine and generating
a profit pi. The capacity of the bottleneck machine is C.
Question: determine the production requests to confirm in
order to maximize the total profit.
Formulation:
max pi Xi
subject to:
ti Xi C
Xiaolan Xie
Knapsack Problem
Knapsack Problème :
• Mr Radin can take 7 KG without paying over-weight fee on
his return flight. He decides to take advantage of it and look
for some local products that he can sale at home for extra gain.
• He selects n most interesting objects, weighs each of them,
and bargains the price.
• Which objects should he buy in order to maximize his gain?
Object (i) 1 2 3 4 5 6
Weight (wi) 2 1 1 3 2 1
Expected gain (ri) 8 5 5 6 3 2
Xiaolan Xie
Knapsack Problem
Generic formulation:
• Time = 1, …, 7
• State st = remaining capacity for objects t, t+1, …
• State space = {0, 1, 2, …, 7}
• Action at time t = selection or not object t
• Action space At(s) = {1=YES, 0=NO} if s ≥ wt and = {0} if s < wt
• Immediate gain at time t
gt(st, ut) = rt if YES
=0 if NO
• State transition or system dynamics:
st+1 = st – wt if YES
= st if NO
Xiaolan Xie
Knapsack Problem
Value function:
Jn(s) = Maximal gain from objects n, n+1, …, 6 with a
remaing capacity of s KG.
Optimality equation:
Xiaolan Xie
Knapsack Problem
time
7 6 5 4 3 2 1
wi ri wi ri wi ri wi ri wi ri wi ri
1 2 2 3 3 6 1 5 1 5 2 8
state Jn(s) action Jn(s) action Jn(s) action Jn(s) action Jn(s) action Jn(s) action Jn(s) action
0 0 0 N 0 N 0 N 0 N 0 N 0 N
1 0 2 Y 2 N 2 N 5 Y 5 N 5 N
2 0 2 Y 3 Y 3 N 7 Y 10 Y 10 N
3 0 2 Y 5 Y 6 Y 8 Y 12 Y 13 Y
4 0 2 Y 5 Y 8 Y 11 Y 13 Y 18 Y
5 0 2 Y 5 Y 9 Y 13 Y 16 Y 20 Y
6 0 2 Y 5 Y 11 Y 14 Y 18 Y 21 Y
7 0 2 Y 5 Y 11 Y 16 Y 19 Y 24 Y
YES NO YES NO YES NO YES NO YES NO YES NO
0 -1 0 -1 0 -1 0 -1 0 -1 0 -1 0 -1
1 2 0 -1 2 -1 2 5 2 5 5 -1 5 =
2 2 0 3 2 -1 3 7 3 10 7 8 10
3 2 0 5 2 6 5 8 6 12 8 13 12
Infeasible
4 2 0 5 2 8 5 11 8 13 11 18 13 action
5 2 0 5 2 9 5 13 9 16 13 20 16
6 2 0 5 2 11 5 14 11 18 14 21 18
7 2 0 5 2 11 5 16 11 19 16 24 19
Xiaolan Xie
Knapsack Problem
Xiaolan Xie
Applications
Inventory control
Xiaolan Xie
Applications
Inventory control
Generic formulation:
• Time = 1, …, 7
• State st = Inventory at the beginning of period t
• State space = {0, 1, 2, …, 5}
• Action at time t = purchasing quantity ut of period t
• Action space A(st) = {max(0, dt – st), …, 5 + dt - st}
• Immediate cost at time t
gt(st, ut) = K + ptut + ht(st + ut - dt) if u > 0
= ht(st + ut - dt) if NO
• State transition or system dynamics:
st+1 = st + ut - dt
Xiaolan Xie
Applications
Inventory control
Value function:
Jn(s) = minimal total cost over periods n, n+1, …, 6 by
starting with an inventory s at the beginning of period n.
Optimality equation:
Xiaolan Xie
Applications
Traveling salesman problem
Problem :
Data: a graph with N nodes and a distance matrix
[dij] beteen any two nodes i and j.
Question: determine a circuit of minimum total
distance passing each node once.
Extensions:
C(y, S): shortest path from y to x0 passing once
each node in S.
Application: Machine scheduling with setups.
Job 1 2 3
Due date di 5 6 5
Processing time pi 3 2 4
weight wi 3 1 2
Xiaolan Xie
Stochastic dynamic programming
Model
Consider the control of
(i) a discrete-time stochastic dynamic system, with
(ii) costs generated over time
perturbation perturbation
action action
Xiaolan Xie
Stochastic dynamic programming
Model
System dynamics :
x t+1 = ft(xt, ut, wt), t = 0, 1, ..., N-1
where
t : time index
xt : state of the system
ut = decision at time t perturbation
action action
wt : random perturbations
State t State t+1
cost cost
Xiaolan Xie
Stochastic dynamic programming
Model
Criterion
perturbation
action action
cost cost
Xiaolan Xie
Stochastic dynamic programming
Example
Consider a problem of ordering a quantity of a certain item at
each of N periods so as to meet a stochastic demand, while
minimizing the incurred expected cost.
xt+1 = xt + ut - wt
Xiaolan Xie
Stochastic dynamic programming
Example
Cost :
purchaing cost cut
inventory cost : r(xt + ut - wt)
wt
stock at period t stock at period t
Total cost: Inventory
xt
system
xt+1 = xt + ut - wt
order quantity
Xiaolan Xie
Stochastic dynamic programming
Model
Open-loop control:
Order quantities u1, u2, ..., uN-1 are determined once at time 0
Closed-loop control:
Xiaolan Xie
Stochastic dynamic programming
Control policy
The rule for selecting at each period t a control action ut
for each possible state xt.
Optimal control:
minimize Jp(x0) over all possible polciy p
pij(u, t) = P{xt+1 = j | xt = i, ut = u}
Xiaolan Xie
Stochastic dynamic programming
Principle of optimality
Let p* = {m*0, ..., m*N-1} be an optimal policy for the basic
problem for the N time periods.
Then the truncated policy {m*i, ..., m*N-1} is optimal for the
following subproblem
•minimization of the following total cost (called cost-to-go
function) from time i to time N by starting with state xi at
time i
Xiaolan Xie
Stochastic dynamic programming
DP algorithm
Theorem: For every initial state x0, the optimal cost J*(x0) of
the basic problem is equal to J0(x0), given by the last step of the
following algorithm, which proceeds backward in time from
period N-1 to period 0
Xiaolan Xie
Stochastic dynamic programming
Example
Consider the inventory control problem with the following:
Xiaolan Xie
Stochastic dynamic programming
Example
Generic formulation:
• Time = {1, 2, 3, 4=end}
• State xt = inventory level at the beginning of a period
• State space = {0, 1, 2}
• Action ut = order quantity of period t
• Action space = {0, 1, …, 2 – xt}
• Perturbation dt = demand of period t
• Immediate cost = aut + (xt + ut – dt)2
• System dynamics xt+1 = max{0, xt + ut – dt}
Xiaolan Xie
Stochastic dynamic programming
Example
Value function:
Jn(s) = minimal total cost over periods n, n+1, …, 3 by
starting with an inventory s at the beginning of period n.
Optimality equation:
Xiaolan Xie
Stochastic dynamic programming
Example – Immediate cost
a 0,25 dt = w 0 1 2
Pw 0,1 0,2 0,7 mean
0 0 0 1 4 3 mean stage
0 1 1,25 0,25 1,25 1,05 cost
(s,u) 0 2 4,5 1,5 0,5 1,1 =
1 0 1 0 1 0,8 0.1g(s,u,0)+
0.2g(s,u,1)+
1 1 4,25 1,25 0,25 0,85
0.7g(s,u,2)
2 0 4 1 0 0,6
g(s,u,w) = 0.25u+(s+u– w)2
Xiaolan Xie
Stochastic dynamic programming
Example – Period 3-problem
Period n = 3
a 0,25 dt = w 0 1 2 Mean
Pw 0,1 0,2 0,7 total
0 0 0 1 4 3
period-4
opt 0 1 1,25 0,25 1,25 1,05
s' J4(s') (s,u) 0 2 4,5 1,5 0,5 1,1
0 0 1 0 1 0 1 0,8
1 0 1 1 4,25 1,25 0,25 0,85
2 0 2 0 4 1 0 0,6
Period n = 2
a 0,25 dt = w 0 1 2 Mean
Pw 0,1 0,2 0,7 total
0 0 1,05 2,05 5,05 4,05
period-3
opt 0 1 2,05 1,3 2,3 2,075
s' J3(s') (s,u) 0 2 5,1 2,3 1,55 2,055
0 1,05 1 0 1,8 1,05 2,05 1,825
1 0,8 1 1 4,85 2,05 1,3 1,805
2 0,6 2 0 4,6 1,8 1,05 1,555
Period n = 1
a 0,25 dt = w 0 1 2 Mean
Pw 0,1 0,2 0,7 total
0 0 2,055 3,055 6,055 5,055
period-2
opt 0 1 3,055 2,305 3,305 3,08
s' J2(s') (s,u) 0 2 6,055 3,305 2,555 3,055
0 2,055 1 0 2,805 2,055 3,055 2,83
1 1,805 1 1 5,805 3,055 2,305 2,805
2 1,555 2 0 5,555 2,805 2,055 2,555
Xiaolan Xie
Stochastic dynamic programming
Example – value function & control
Optimal policy a =0,25
Stock 3-period policy 2-period policy 1-period policy
Xiaolan Xie
Stochastic dynamic programming
Example – Control map or policy
0 2 2 1
Long-term to short-term
1 1 1 0
2 0 0 0
From
Long-term policy: (s=0, u=2), (s=1, u=1), (s=2, u=0)
To
Myopic policy: (s=0, u=1), (s=1, u=0), (s=2, u=0)
Xiaolan Xie
Stochastic dynamic programming
Example – Sample paths
Stock Period-1 Period-2 Period-3
Control Map 0 2 2 1
1 1 1 0
2 0 0 0
costs costs
Xiaolan Xie
Applications
Inventory management
Bus engine replacement
Highway pavement maintenance
Bed allocation in hospitals
Personal staffing in fire department
Traffic control in communication networks
…
Xiaolan Xie
Example
• Consider a with one machine producing one product. The
processing time of a part is exponentially distributed with rate
p. The demand arrive according to a Poisson process of rate d.
• state Xt = stock level, Action : at = make or rest
0 1 2 3
d d d
d
Xiaolan Xie
Example
• Zero stock policy (M/M/1) p0 = 1-r, p-n = rn p 0, r = d/p
Xiaolan Xie
MDP = Markov Decision Process
MDP Model formulation
Xiaolan Xie
Decision epochs
Times at which decisions are made.
Xiaolan Xie
State and action sets
At each decision epoch, the system occupies a state.
Xiaolan Xie
Costs and Transition probabilities
As a result of choosing action a As in state s at decision epoch t,
• the decision maker receives a cost Ct(s, a) and
• the system state at the next decision epoch is determined by the
probability distribution pt(. |s, a).
An Markov decision process is characterized by {T, S, As, pt(. |s, a), Ct(s, a)}
Xiaolan Xie
Exemple of inventory management
Consider the inventory control problem with the following:
Xiaolan Xie
Exemple of inventory management
Decision Epochs T = {0, 1, 2, …, N}
Set of states : S = {0, 1, 2} indicating the initial stock Xt
Action set : As indicating the possible order quantity Ut
A0 = {0, 1, 2}, A1 = {0, 1}, A2 = {0}
(0, 0) 3 (0, 0) 1
Transition (0, 1) 0,9 0,1
Cost (0, 1) 1,05
probability
function (0, 2) 1,1 (0, 2) 0,2 0,7 0,1
(1, 0) 0,8 (1, 0) 0,9 0,1
(1, 1) 0,85 (1, 1) 0,2 0,7 0,1
(2, 0) 0,6 (2, 0) 0,2 0,7 0,1
Xiaolan Xie
Decision Rules
A decision rule prescribes a procedure for action selection in each
state at a specified decision epoch.
Xiaolan Xie
Decision Rules
A decision rule can also be either
Xiaolan Xie
Decision Rules
As a result, the decision rules can be:
Xiaolan Xie
Policies
A policy specifies the decision rule to be used at all decision epoch.
Xiaolan Xie
Example
Decision epochs: T = {1, 2, …, N}
State : S = {s1, s2}
Actions: As1 = {a11, a12}, As2 = {a21}
Costs: Ct(s1, a11) =5, Ct(s1, a12) =10, Ct(s2, a21) = -1, CN(s1) = rN(s2) 0
Transition probabilities: pt(s1 |s1, a11) = 0.5, pt(s2|s1, a11) = 0.5, pt(s1 |s1,
a12) = 0, pt(s2|s1, a12) = 1, pt(s1 |s2, a21) = 0, pt(s2 |s2, a21) = 1
a11 a11
a21
{5, .5}
{5, .5}
S1 S2 {-1, 1}
a12
{10, 1}
Xiaolan Xie
Example
A deterministic Markov policy
Decision epoch 1:
d1(s1) = a11, d1(s2) = a21 One state one action
Decision epoch 2: (also called control map)
d2(s1) = a12, d2(s2) = a21
a11 a11
a21
{5, .5}
{5, .5}
S1 S2 {-1, 1}
a12
{10, 1}
Xiaolan Xie
Example
A randomized Markov policy
Decision epoch 1:
P1, s1(a11) = 0.7, P1, s1(a12) = 0.3
One state
P1, s2(a21) = 1
one proba distribution of actions
Decision epoch 2:
P2, s1(a11) = 0.4, P2, s1(a12) = 0.6
P2, s2(a21) = 1
a11 a11
a21
{5, .5}
{5, .5}
S1 S2 {-1, 1}
a12
{10, 1}
Xiaolan Xie
Example
A deterministic history-dependent policy
Decision epoch 1: Decision epoch 2:
d1(s1) = a11 history h
d1(s2) = a21 d2(h)
(s1, a11, s1)
One history one action a13
(s1, a12, s1)
a13 infeasible
{0, 1}
a11 a11 (s1, a13, s1)
{5, .5} a11
{5, .5} (s2, a21, s1)
{-1, 1}
S1 S2 infeasible
a21
a12
(*, *, s2)
{10, 1} a21
Xiaolan Xie
Example
A randomized history-dependent policy
Decision epoch 1: Decision epoch 2:
P1, s1(a11) = 0.6 history h P(a = a11) P(a = a12) P(a = a13)
P1, s1(a12) = 0.3 (s1, a11, s1) 0.4 0.3 0.3
(0, 0) 3 (0, 0) 1
Transition (0, 1) 0,9 0,1
Cost (0, 1) 1,05
probability
function (0, 2) 1,1 (0, 2) 0,7 0,2 0,1
(1, 0) 0,8 (1, 0) 0,9 0,1
(1, 1) 0,85 (1, 1) 0,7 0,2 0,1
(2, 0) 0,6 (2, 0) 0,7 0,2 0,1
Xiaolan Xie
Stochastic inventory control policies
State s = inventory at the beginning of a period
Action a = order quantity such that s+a 2
MD : Markovian and deterministic
Stationary: {s=0: a = 2, s=1: a=1, s=2: a = 0}
Nonstationary:
{(s,a)=(0,2), (1,1), (2,0)} for period 1 to 5
{(s,a)=(0,1), (1,0), (2,0)} for period 6 on
MR : Markovian and randomized
Stationary: {s=0: a = 2 w.p. 0.5 a=0 w.p. 0.5, s=1: a=1, s=2: a = 0}
Nonstationary:
{(s,a)=(0,2), (1,1), (2,0)} for period 1 to 5
{(s,a)=(0,2 w.p. 0.5 & 0 w.p. 0.5), (1,0), (2,0)} for period 6 on
where w.p. = with probability
Xiaolan Xie
Stochastic inventory control policies
s Action a
HD
history dependent 0 2
1
if lost sales (s+a-d < 0) for last two periods
if demand for the last period
and deterministic 0 if no demand for the last period
1 1
0
if lost sale for the last period
if no demand for the last period
2 0
s Action a
HR 0 2 if lost sales for last two periods
2 w.p. 0.5 & 0 w.p. 0.5 if demandfor the last period
history dependent 1 w.p. 0.3 & 0 w.p. 0.7 if no demand for the last period
and randomized
1 1 w.p. 0.5 & 0 w.p. 0.5
0
if lost sale for the last period
if no demand for the last period
2 0
Xiaolan Xie
Remarks
Each Markov policy leads to a discrete time Markov Chain
and the policy can be evaluated by solving the related
Markov chain.
Xiaolan Xie
Remarks
MD : Markovian and MR : Markovian and
deterministic
randomized
s=0: a = 2,
s=0: a = 2 w.p. 0.5 a=0 w.p. 0.5,
s=1: a=1,
s=1: a=1,
s=2: a = 0
s=2: a = 0
Xiaolan Xie
Remarks
Nonstationary MD : Markovian and deterministic
{(s,a)=(0,2), (1,1), (2,0)} for period 1 to 2
{(s,a)=(0,1), (1,0), (2,0)} for period 3 on
Xiaolan Xie
Assumptions
Assumption 1: The decision epochs T = {1, 2, …, N}
Criterion:
Xiaolan Xie
Optimality of Markov deterministic policy
Theorem :
Assume S is finite or countable, and that As is finite for each
s S.
Xiaolan Xie
Optimality equations
Theorem : The following value functions
and the action a that minimizes the above term defines the
optimal policy.
Xiaolan Xie
Optimality equations
The optimality equation can also be expressed as:
Xiaolan Xie
Backward induction algorithm
•Set t = N and
3. Repeat 2 till t = 1.
Xiaolan Xie
Infinite Horizon discounted
Markov decision processes
Xiaolan Xie
Assumptions
Assumption 1: The decision epochs T = {1, 2, …}
Assumption 2: The state space S is finite or countable
Assumption 3: The action space As is finite for each s
Assumption 4: Stationary costs and transition probabilities;
C(s, a) and p(j |s, a), do not vary from decision epoch to
decision epoch
Assumption 5: Bounded costs: | Ct(s, a) | M for all a As
and all s S (to be relaxed)
Xiaolan Xie
Assumptions
Criterion:
where
0 < l< 1 is the discounting factor
PHR is the set of all possible policies.
Xiaolan Xie
Discounting factor
Large discounting factor l 1 : long-term optimum
Xiaolan Xie
Optimality equations
Theorem: Under assumptions 1-5, the following optimal cost
function V*(s) exists:
Xiaolan Xie
Computation of optimal policy
Value Iteration
Value iteration algorithm:
1.Select any bounded value function V0, let n =0
2. For each s S, compute
Xiaolan Xie
Computation of optimal policy
Value Iteration
Theorem: Under assumptions 1-5,
a.Vn converges to V*
b. The stationary policy defined in the value iteration
algorithm converges to an optimal policy.
Xiaolan Xie
Computation of optimal policy
Policy Iteration
Policy iteration algorithm:
1.Select arbitrary stationary policy p0, let n =0
2. (Policy evaluation) Obtain the value function Vn of policy
pn.
3.(Policy improvement) Choose pn+1 = {dn+1, dn+1,…} such
that
Xiaolan Xie
Computation of optimal policy
Policy Iteration
Policy evaluation:
For any stationary deterministic policy p = {d, d, …}, its
value function
Xiaolan Xie
Computation of optimal policy
Policy Iteration
Theorem:
The value functions Vn generated by the policy iteration
algorithm is such that Vn+1 <= Vn.
Further, if Vn+1 = Vn, Vn = V*.
Xiaolan Xie
Computation of optimal policy
Linear programming
Recall the optimality equation
Xiaolan Xie
Computation of optimal policy
Linear programming
Dual linear program
Xiaolan Xie
Extensition to Unbounded Costs
Theorem 1. Under the condition C(s, a) ≥ 0 (or C(s, a) ≤0) for all
states i and control actions a, the optimal cost function V*(s) among
all stationary determinitic policies satisfies the optimality equation
Theorem 2. Assume that the set of control actions is finite. Then, under
the condition C(s, a) ≥ 0 for all states i and control actions a, we have
where VN(s) is the solution of the value iteration algorithm with V0(s) = 0.
Implication of Theorem 2 : The optimal cost can be obtained as the limit
of value iteration and the optimal stationary policy can also be obtained in
the limit.
Xiaolan Xie
Example
• Consider a computer system consisting of M different processors.
• Using processor i for a job incurs a finite cost Ci with C1 < C2 < ... < CM.
• When we submit a job to this system, processor i is assigned to our job with
probability pi.
• At this point we can (a) decide to go with this processor or (b) choose to hold the
job until a lower-cost processor is assigned.
• The system periodically return to our job and assign a processor in the same
way.
• Waiting until the next processor assignment incurs a fixed finite cost c.
Question:
How do we decide to go with the processor currently assigned to our job versus
waiting for the next assignment?
Suggestions:
• The state definition should include all information useful for decision
• The problem belongs to the so-called stochastic shortest path problem.
Xiaolan Xie
Why does it work: Preliminary
• Policy p value function (cost minimization)
Xiaolan Xie
Why does it work: DP & optimality equation
• DP (Dynamic Programming)
• Optimality equation
Xiaolan Xie
Why does it work: DP & optimality equation
• DP operator T
Xiaolan Xie
Why does it work : DP convergence
Lemma 1: If 0 C(s,a) M, then VN(s) is monotone converging
and limN VN(s) = V*(s)
Property guarantees the existence of V*(s).
Proof. Part one due to VN(s) VN+1(s) and VN(s) M/(1-l) .
Xiaolan Xie
Why does it work : convergence of value iteration
Lemma 2: If 0 C(s,a) M, for any bounded function f, then
limN TN(f(s)) = V*(s) and limN TpN(f(s)) = Vp (s)
Xiaolan Xie
Why does it work : optimality equation
Theorem 1: If 0 C(s,a) M, V*(s) is the unique bounded
function of the optimality equation. Moreover, any stationary
policy is optimal iff p(s) is any minimizer of the right hand term.
Xiaolan Xie
Why does it work : optimality equation
Theorem 1: If 0 C(s,a) M, V*(s) is the unique bounded
function of the optimality equation. Moreover, any stationary
policy is optimal iff p(s) is any minimizer of the right hand term.
Xiaolan Xie
Why does it work : optimality equation
Theorem A: If 0 C(s,a) M, V*(s) is the unique bounded
function of the optimality equation. Moreover, any stationary
policy is optimal iff p(s) is any minimizer of the right hand term.
Xiaolan Xie
Why does it work : convergence of policy iteration
Theorem B: The value functions Vn generated by the policy
iteration algorithm is such that Vn+1 Vn.
Xiaolan Xie
Why does it work : convergence of policy iteration
Theorem B: The value functions Vn generated by the policy
iteration algorithm is such that Vn+1 Vn.
Xiaolan Xie
Infinite Horizon average cost
Markov decision processes
Xiaolan Xie
Assumptions
Assumption 1: The decision epochs T = {1, 2, …}
Assumption 2: The state space S is finite
Assumption 3: The action space As is finite for each s
Assumption 4: Stationary costs and transition probabilities;
C(s, a) and p(j |s, a) do not vary from decision epoch to
decision epoch
Assumption 5: Bounded costs: | Ct(s, a) | M for all a As
and all s S
Assumption 6: The markov chain correponding to any
stationary deterministic policy contains a single recurrent
class. (Unichain)
Xiaolan Xie
Assumptions
Criterion:
where
PHR is the set of all possible policies.
Xiaolan Xie
Optimal policy
Main Theorem: Under Assumptions 1-6,
• There exists an optimal stationary deterministic policy.
• There exists a real g and a value function h(s) that satisfy the
following optimality equation:
• For any solutions (g, h) and (g’, h’) of the optimality equation:
(a) g = g’ is the optimal average reward;
(b) h(s) = h’(s) + k (translation closure);
• Any maximizer of the optimality equation is an optimal policy.
Xiaolan Xie
Relation between discounted and average cost MDP
differential
cost
for any given state x0.
Xiaolan Xie
Relation between discounted and average cost MDP
Xiaolan Xie
Relation between discounted and average cost MDP
• Why
if limits interchangeable
Blackwell optimality
Xiaolan Xie
Computation of the optimal policy by LP
Recall the optimality equation:
Xiaolan Xie
Computation of optimal policy
Value Iteration
1.Select any bounded value function h0 with h0(s0) = 0, let n =0
2. For each s S, compute
Xiaolan Xie
Computation of optimal policy : Policy Iteration
3. Policy improvement:
Xiaolan Xie
Extensions to unbounded cost
Theorem. Assume that the set of control actions is finite. Suppose
that there exists a finite constant L and some state x0 such that
|Vl(x) - Vl(x0)| ≤ L
for all states x and for all l (0,1). Then, for some sequence {ln}
converging to 1, the following limit exist and satisfy the optimality
equation.
Xiaolan Xie
Why does it work : convergence of policy iteration
Theorem: If all policies generated by policy iteration are unichains,
then gn+1 ≥ gn.
Xiaolan Xie
Continuous time Markov decision
processes
Xiaolan Xie
Assumptions
Xiaolan Xie
Assumptions
Criterion:
Xiaolan Xie
Example
• Consider a system with one machine producing one product. The
processing time of a part is exponentially distributed with rate p. The
demand arrive according to a Poisson process of rate d.
• state Xt = stock level, Action : at = make or rest
0 1 2 3
d d d
d
Xiaolan Xie
Uniformization
Any continuous-time Markov chain can be converted to a
discrete-time chain through a process called
« uniformization ».
Xiaolan Xie
Uniformization
In order to synchronize (uniformize) the transitions at the same
pace, we choose a uniformization rate
g MAX{m(i)}
« Uniformized » Markov chain with
•transitions occur only at instants generated by a common a
Poisson process of rate g (also called standard clock)
•state-transition probabilities
pij = mij / g
pii = 1 - m(i)/ g
where the self-loop transitions correspond to fictitious events.
Xiaolan Xie
Uniformization
CTMC
a Step1: Determine rate of the states
m(S1) = a, m(S2) = b
S1 S2
b
Step 2: Select an uniformization
Uniformized CTMC rate
g-a a
g-b g ≥ max{m(i)}
S1 S2
b Step 3: Add self-loop transitions to
states of CTMC.
DTMC by uniformization
1-a/g a/g 1-b/g
Step 4: Derive the corresponding
S1 S2 uniformized DTMC
b/g
Xiaolan Xie
Uniformization
m(0,0) = l1+l2
m(1,0) = m1+l2
m(0,1) = l1+m2
m(1,1) = m1
Xiaolan Xie
Uniformization
Xiaolan Xie
Uniformization
0 1 2 3
d d d
d
0 1 2 3
d/g d/g d/g d/g d/g
(not make, p/g) (not make, p/g) (not make, p/g) (not make, p/g)
Xiaolan Xie
Uniformization
Under the uniformization,
• a sequence of discrete decision epochs T1, T2, … is generated
where Tk+1 – Tk = EXP(g).
• The discrete-time markov chain describes the state of the system at
these decision epochs.
• All criteria can be easily converted.
continuous cost C(s,a)
per unit time fixed cost
fixed cost k(s,a, j)
K(s,a)
(s,a) j
EXP(g) EXP(g) EXP(g)
Xiaolan Xie
Cost function convertion
for uniformized Markov chain
Xiaolan Xie
Optimality equation: discounted cost case
Equivalent discrete time discounted MDP
• a discrete-time Markov chain with uniform transition rate g
• a discount factor l = g/(g+b)
• a stage cost given by the sum of
─ continuous cost C(s, a)/(b+g),
─ K(s, a) for fixed cost incurred at T0
─ lk(s,a,j)p(j|s,a) for fixed cost incurred at T1
Optimality equation
Xiaolan Xie
Optimality equation: average cost case
Equivalent discrete time average-cost MDP
• a discrete-time Markov chain with uniform transition rate g
• a stage cost given by C(s, a)/g whenever a state s is entered
and an action a is chosen.
Optimality equation for average cost per uniformized period:
where
• g = average cost/uniformized period,
• gg =average cost/time unit,
• h(s) = differential cost with respect to reference state s0 and h(s0) = 0
Xiaolan Xie
Optimality equation: average cost case
Multiply both side of the optimality equation by g leads to:
where
• G = gg optimal average cost per time unit
• H(s) = modified differential cost with H(s) = g(V(s) – V(s0))
Xiaolan Xie
Optimality equation: average cost case
Alternative optimality equation 2: Hamilton-Jacobi-Bellman equation
where
Xiaolan Xie
Example (continue)
Uniformize the Markov decision process with rate g = p+d
Xiaolan Xie
Example (continue)
From the optimality equation:
V(s+1) –V(s) > 0 and the decision is not producing, for all s >= K and
V(s+1) –V(s) <= 0 and the decision is producing, for all s < K
Xiaolan Xie
Example (continue)
Convexity proved by value iteration
Proof by induction.
V0 is convex.
If Vn is convex with minimum
at s = K, then Vn+1 is convex.
s
K-1 K
Xiaolan Xie
Example (continue)
Convexity proved by value iteration
•Assume Vn is convex with minimum at s = K.
•Vn+1 is convex if
i.e. DU(s) D U(s+1) where DU(s) = U(s+1) – U(s)
•True for s +1 < K-1 and s > K-1 by induction.
•Proof established if
DU(K-2) D U(K-1) DVn(K-1) 0
DU(K-1) D U(K) 0 D Vn(K)
V(s+1) V(s)
s
K-1 K
Xiaolan Xie
Condition for optimality of
monotone policy
(first order properties)
Xiaolan Xie
Monotone policy
Monotone policy
p(s) nondecreasing or nonincreasing in s
Xiaolan Xie
Submodularity and Supermodularity
A function g(x, y) is said supermodular if for x+ ≥ x- and y+ ≥ y-,
It is said submodular if
Xiaolan Xie
Submodularity and Supermodularity
Supermodular functions:
Xiaolan Xie
Dynamic Programming Operator
• DP operator T
equivalently
Xiaolan Xie
DP Operator: monotonicity preservation
Xiaolan Xie
DP Operator: control monotonicity
Xiaolan Xie
DP Operator: control monotonicity
Xiaolan Xie
Batch delivery model
• Customer demand Dt for a product arrives over time.
• State set S = {0, 1, …}: quantity of pending demand
• Action set A = {0=no delivery, 1=deliver all pending demand}
• Cost C(s,a) = hs(1-a) + aK
where
Submodularity
h = unit holding cost,
a(s) nondecreasing
K= fixed delivery cost
• Transition snext = s(1-a) + D where P(D=i) = pi, i=0, 1, …
GOAL: minimize the total cost
Xiaolan Xie
Batch delivery model
Xiaolan Xie
A machine replacement model
• Machine deteriorates by a random number I of states per period
• State set S = {0, 1, …} from best to worse condition
• Action set A = {1=replace,0=not replace}
• Reward r(s,a) = R – h(s(1-a)) – aK
Supermodularity
R = fixed income per period,
a(s) nondecreasing
h(s) = nondecreasing operation cost
K= replacement cost
• Transition snext = s(1-a) + I where P(I=i) = pi, i=0, 1, …
GOAL: maximize the total reward
Xiaolan Xie
A machine replacement model
Xiaolan Xie
A general framework for
value function property analysis
Xiaolan Xie
Introduction: event operators
Xiaolan Xie
Introduction: a single server queue
• exponential server
• Poisson arrivals of which the admission can be controlled
• l: arrival rate
• m: service rate , l+m= 1
• c: unit rejection cost
• C(x): holding cost of x customers
Xiaolan Xie
Introduction: discrete-time queue
Xiaolan Xie
One-dimension models : operators
Xiaolan Xie
One-dimension models: operators
Xiaolan Xie
One-dimension models: operators
Xiaolan Xie
One-dimension models : operators
Xiaolan Xie
One-dimension models : properties
Xiaolan Xie
One-dimension models : property propagation
Xiaolan Xie
One-dimension models : property propagation
Xiaolan Xie
One-dimension models : property propagation
Proof of Lemma 1.
Tcosts and Tunif : results follow directly as increasingness and convexity are closed under convex combinations.
TA(1) : results follow directly, by replacing x by x + e1 in the inequalities.
TFS(1) : certain terms cancel out.
TD(1) : Increasingness follows as for TA(1), except if x1 = b1. In this case TD(1)f(x) = TD(1)f(x + e1). Also for the
convexity the only non-trivial case is x1 = b1. This reduces to f(x) f(x + e1).
TMD(1) : Roughly the same arguments are used.
TAC(1) :
Xiaolan Xie
One-dimension models : property propagation
Xiaolan Xie
One-dimension models : property propagation
Xiaolan Xie
a single server queue
• l: arrival rate is l, m: service rate , l+m= 1
• c: unit rejection cost
• C(x): holding cost of x customers
Xiaolan Xie
discrete-time queue
Xiaolan Xie
Production-inventory system
Xiaolan Xie
Multi-machine production-inventory with preemption
Xiaolan Xie
Examples of Tenv(i)
Xiaolan Xie
Two-dimension models : operators
Xiaolan Xie
Two-dimension models : properties
Xiaolan Xie
Two-dimension models : properties
Conv Super SuperC
R R
+ LL
L R R L + L L
R L R R
R LL R R L
+ R
R L L R + L L
L R L R
L R L R
Xiaolan Xie
2-dimension models : property propagation
Xiaolan Xie
2-dimension models : property propagation
Xiaolan Xie
2-dimension models : property propagation
Xiaolan Xie
2-dimension models : property propagation
Xiaolan Xie
2-dimension models : property propagation
Xiaolan Xie
2-dimension models : property propagation
Xiaolan Xie
2-dimension models
Control structure under Super(1, 2) + SuperC(1, 2)
TAC(1) & TAC(2) : decreasing switching curve below which customers are admitted.
TCD(1) and TCD(2) can be seen as dual to TAC(1) and TAC(2), with corresponding
results.
Xiaolan Xie
2-dimension models
Control structure under
Super(1, 2) + SuperC(1, 2)
queue1
SuperC(1, 2):
TR: an increasing switching curve above
(below) which customers are assigned to
queue 1 (2) queue2
SuperC(1, 2):
TCJ(1,2): the optimal control is increasing in
x1 and decreasing in x2, i.e. an increasing
switching curve, below which jockeying
occurs.
Xiaolan Xie
2-dimension models : property propagation
Xiaolan Xie
2-dimension models
Control structure under Super(1, 2)
Xiaolan Xie
2-dimension models : property propagation
Xiaolan Xie
2-dimension models
Control structure under Sub(1, 2) + SubC(1, 2)
Sub(1, 2) + SubC(1, 2) Conv(1) + Conv(2)
SubC(1, 2)
TAC(1) (TAC(2)) : increasing switching curve above (below) which customers are
admitted.
Also the effects of TCD(i) amount to balancing in some sense the two queues.
TACF(1,2) has a decreasing switching curve below which customers are admitted
Xiaolan Xie
2-dimension models
Control structure under Sub(1, 2) + SubC(1, 2)
TAC(1) TAC(2)
Admission No
queue1
Admission
No
queue2
Xiaolan Xie
Examples: a queue served by two servers
Xiaolan Xie
Examples: a queue served by two servers
TCJ(1,2)
To slow
Xiaolan Xie
Examples: production line with Poisson demand
Xiaolan Xie
Examples: tandem queues with Poisson demand
x2
M1 produce M2 produce
x1
Xiaolan Xie
Examples: admission control of tandem queues
Xiaolan Xie
Examples: cyclic tandem queues
Xiaolan Xie
Multi-machine production-inventory with non preemption
AC(1)
AC(2)
Xiaolan Xie
Examples: stochastic knapsack
Xiaolan Xie
Examples
Xiaolan Xie