Veinott Dynamic Programming Course Notes

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 204

Lectures in Dynamic Programming

and Stochastic Control

Arthur F. Veinott, Jr.

Spring 2008
MS&E 351 Dynamic Programming and Stochastic Control

Department of Management Science and Engineering


Stanford University
Stanford, California 94305
Copyright © 2008 by Arthur F. Veinott, Jr.
Contents
1 Discrete-Time-Parameter Finite Markov Population Decision Chains...................................1
1 FORMULATION................................................................................................................................. 1
2 MAXIMUM R -PERIOD VALUE: RECURSION AND EXAMPLES................................................ 3
Minimum-Cost Chain........................................................................................................................4
Supply Management......................................................................................................................... 5
Exercising a Call Option...................................................................................................................6
System Reliability............................................................................................................................. 7
Maximum Expected R -Period Instantaneous Rate of Return: Portfolio Selection.......................... 8
3 MAXIMUM EXPECTED R -PERIOD UTILITY WITH CONSTANT RISK POSTURE................ 9
Expected Utilities and Risk Aversion/Preference............................................................................. 9
Constant Additive Risk Posture..................................................................................................... 10
Constant Multiplicative Risk Posture.............................................................................................11
Additive and Multiplicative Utility Functions................................................................................11
Maximum Expected R -Period Symmetric Multiplicative Utility................................................... 12
Maximum R -Period Instantaneous Rate of Expected Return........................................................12
4 MAXIMUM VALUE IN CIRCUITLESS SYSTEMS.........................................................................12
Knapsack Problem (Capital Budgeting)......................................................................................... 13
Portfolio Selection........................................................................................................................... 14
5 DECISIONS, POLICIES, OPTIMAL-RETURN OPERATOR.........................................................14
6 MAXIMUM R -PERIOD VALUE: FORMALITIES..........................................................................15
Advantages of Maximum-R -Period-Value Policies.........................................................................16
Rolling Horizons and the Backward and Forward Recursions........................................................16
Limitations of Maximum-R -Period-Value Policies......................................................................... 17
7 MAXIMUM VALUE IN TRANSIENT SYSTEMS............................................................................17
Why Study Infinite-Horizon Problem............................................................................................. 17
Characterization of Transient Matrices.......................................................................................... 18
Transient System............................................................................................................................ 19
Comparison Lemma........................................................................................................................ 19
Policy-Improvement Method...........................................................................................................20
Stationary Maximum-Value Policies: Existence and Characterization...........................................21
Newton’s Method: A Specialization of Policy-Improvement Method............................................. 22
Successive Approximations: Maximum R -Period Value Converges to Maximum Value............... 22
Geometric Interpretation: a Single-State Reliability Example........................................................23
System Degree, Spectral Radius and Polynomial Boundedness......................................................24
Geometric Convergence of Successive Approximations.................................................................. 27
Contraction Mappings.....................................................................................................................28
Linear-Programming Method.......................................................................................................... 29
State-Action Frequency...................................................................................................................31
Simplex Method: A Specialization of Policy-Improvement Method............................................... 34
Running Times................................................................................................................................35
Stochastic Constraints.....................................................................................................................37
Stationary Randomized Policies......................................................................................................37
Supply Management with Service Constraints............................................................................... 39
Maximum Present Value.................................................................................................................39
Multi-Armed Bandit Problem: Optimality of Largest-Index Rule..................................................41

i
MS&E 351 Dynamic Programming ii Contents
Copyright © 2008 by Arthur F. Veinott, Jr.

8 MAXIMUM PRESENT VALUE WITH SMALL INTEREST RATES IN BOUNDED SYSTEMS.... 45


Bounded System..............................................................................................................................46
Strong Maximum Present Value in Bounded Systems................................................................... 46
Cesàro Limits and Neumann Series................................................................................................ 46
Stationary and Deviation Matrices................................................................................................. 47
Laurent Expansion of Resolvent..................................................................................................... 49
Laurent Expansion of Present Value for Small Interest Rates....................................................... 50
Characterization of Stationary Strong Maximum-Present-Value Policies...................................... 50
Application to Controlling Service and Rework Rates in a G/M/_ Queue.................................. 51
Strong Policy-Improvement Method............................................................................................... 53
Existence and Characterization of Stationary Strong Maximum-Present-Value Policies...............54
Truncation of Infinite Matrices....................................................................................................... 55
8-Optimality: Efficient Implementation of the Strong Policy-Improvement Method.....................57
9 CESÀRO OVERTAKING OPTIMALITY WITH IMMIGRATION IN BOUNDED SYSTEMS.....60
Controlled Queueing Network with Proportional Service Rates.....................................................61
Cash Management...........................................................................................................................62
Manpower Planning........................................................................................................................ 62
Insurance Management................................................................................................................... 62
Asset Management.......................................................................................................................... 63
Immigration Stream........................................................................................................................ 63
Cohort and Markov Policies............................................................................................................63
Overtaking Optimality.................................................................................................................... 65
Cesàro Overtaking Optimality........................................................................................................ 66
Convolutions................................................................................................................................... 66
Binomial Coefficients and Sequences.............................................................................................. 67
Polynomial Expansion of Expected Population Sizes with Binomial Immigration........................ 68
Polynomial Expansion of R -Period Values.....................................................................................70
Comparison Lemma for R -Period Values....................................................................................... 72
Cesàro Overtaking Optimality with Binomial Immigration Stream...............................................73
Reward-Rate Optimality.................................................................................................................73
Cesàro Overtaking Optimality........................................................................................................ 74
Float Optimality............................................................................................................................. 74
Value Interpretation of Immigration Stream.................................................................................. 75
Combining Physical and Value Immigration Streams.................................................................... 75
Cesàro Overtaking Optimality with More General Immigration Streams...................................... 76
Future-Value Optimality................................................................................................................ 78
10 SUMMARY.......................................................................................................................................78

2 Team Decisions, Certainty Equivalents and Stochastic Programming.................................81


1 FORMULATION AND EXAMPLES................................................................................................ 81
Airline Reservations with Uncertain Demand.................................................................................82
Inventory Control with Uncertain Demand.................................................................................... 83
Transportation Problem with Uncertain Demand.......................................................................... 84
Capacity Planning with Uncertain Demand................................................................................... 84
2 REDUCTION OF STOCHASTIC TO ORDINARY MATHEMATICAL PROGRAMS..................85
Linear and Quadratic Programs......................................................................................................86
Computations.................................................................................................................................. 87
Comparison of Dynamic and Stochastic Programming.................................................................. 87
MS&E 351 Dynamic Programming iii Contents
Copyright © 2008 by Arthur F. Veinott, Jr.

3 QUADRATIC UNCONSTRAINED TEAM DECISION PROBLEMS............................................. 88


4 SEQUENTIAL QUADRATIC UNCONSTRAINED TEAMS: CERTAINTY EQUIVALENTS...... 89
Interpretation of Solution................................................................................................................90
Computations.................................................................................................................................. 90
Quadratic Control Problem............................................................................................................ 91
Rocket Control................................................................................................................................ 91
Multiproduct Supply Management................................................................................................. 91
Dynamic Programming Solution with Zero Random Errors...........................................................92
Solution with Independent Random Errors.................................................................................... 94
Solution with Dependent Random Errors.......................................................................................94
Strengths and Weaknesses.............................................................................................................. 95

3 Continuous-Time-Parameter Markov Population Decision Processes..................................97


1 FINITE CHAINS: FORMULATION................................................................................................. 97
2 MAXIMUM X -PERIOD VALUE: BELLMAN’S EQUATION AND EXAMPLES...........................98
Controlled Queues...........................................................................................................................99
Supply Management........................................................................................................................99
Project Scheduling...........................................................................................................................99
3 PIECEWISE-CONSTANT POLICIES, GENERATORS, TRANSITION MATRICES................. 100
Nonnegative, Substochastic and Stochastic Transition Matrices..................................................101
4 CHARACTERIZATION OF MAXIMUM X -PERIOD-VALUE POLICIES................................... 101
5 EXISTENCE OF MAXIMUM X -PERIOD-VALUE POLICIES..................................................... 103
6 MAXIMUM X -PERIOD VALUE WITH A SINGLE STATE.........................................................106
7 EQUIVALENCE PRINCIPLE FOR INFINITE-HORIZON PROBLEMS......................................108
8 MAXIMUM PRINCIPLE................................................................................................................. 110
Linear Control Problem................................................................................................................ 112
Markov Population Decision Chain.............................................................................................. 113
9 MAXIMUM PRESENT VALUE FOR CONTROLLED ONE-DIMENSIONAL DIFFUSIONS.....113
Diffusions.......................................................................................................................................113
Probability of Reaching One Boundary Before the Other............................................................ 116
Mean Time to Reach Boundary.................................................................................................... 117
Controlled Diffusions.....................................................................................................................117
Maximizing the Probability of Accumulating Given Wealth........................................................118

Appendix: Functions of Matrices............................................................................................. 121


1 MATRIX NORM.............................................................................................................................. 121
2 EIGENVALUES AND VECTORS...................................................................................................122
3 SIMILARITY....................................................................................................................................122
4 JORDAN FORM.............................................................................................................................. 122
5 SPECTRAL MAPPING THEOREM...............................................................................................123
6 MATRIX DERIVATIVES AND INTEGRALS............................................................................... 124
7 MATRIX EXPONENTIALS............................................................................................................ 125
8 MATRIX DIFFERENTIAL EQUATIONS...................................................................................... 125

References................................................................................................................................. 127
BOOKS................................................................................................................................................ 127
SURVEYS............................................................................................................................................128
ARTICLES.......................................................................................................................................... 128
Index of Symbols.................................................................................................................................... 131
MS&E 351 Dynamic Programming iv Contents
Copyright © 2008 by Arthur F. Veinott, Jr.

Homework Assignments

Homework 1 (4/11/08)
Exercising a Put Option
Requisition Processing
Matrix Products
Bridge Clearance
Homework 2 (4/18/08)
Dynamic Portfolio Selection with Constant Multiplicative Risk Posture
Airline Overbooking
Sequencing: Optimality of Index Policies
Homework 3 (4/25/08)
Multifacility Linear-Cost Production Planning
Discovering System Transience
Successive Approximations and Newton’s Method Find Nearly Optimal Policies in Linear Time
Homework 4 (5/2/08)
Component Replacement
Optimal Stopping Policy
Simple Stopping Problems
House Buying
Homework 5 (5/9/08)
Bayesian Statistical Quality Control and Repair
Minimum Expected Present Value of Sojourn Times
Optimal Control of Tandem Queues
Homework 6 (5/16/08)
Limiting Present-Value Optimality with Binomial Immigration
Maximizing Reward Rate by Linear Programming
Homework 7 (5/23/08)
Discovering System Boundedness
Finding the Maximum Spectral Radius
Irreducible Systems and Cesàro-Geometric-Overtaking Optimality
Homework 8 (5/30/08)
Element-Wise Product of Symmetric Positive Semi-Definite Matrices
Quadratic Unconstrained Team-Decision Problem with Normally Distributed Observations
Optimal Baking
Homework 9 (6/4/08)
Pricing a House for Sale
Transient Systems in Continuous Time
1
Discrete-Time-Parameter Finite
Markov Population Decision Chains
1 FORMULATION
A discrete-time-parameter finite Markov population decision chain is a system that involves
a finite population evolving over a sequence of periods labeled "ß #ß á . and over which one can
exert some control. The system description depends on four data elements, viz., states, actions,
rewards and transition rates. In each period R , each individual is in some state = in a set f of
W  _ states. The state summarizes all “relevant” information about the system history L as of
period R . Each individual in state = chooses an action + from a finite set E=L œ E= of possible
actions, earns a reward <Ð=ß +ß LÑ œ <Ð=ß +Ñ, and generates a (finite) expected1 number :Ð> l =ß +ß LÑ
œ :Ð> l =ß +Ñ   ! of individuals in state > − f in period R ". Call :Ð> l =ß +Ñ the transition rate.
The assumptions that E, < and : depend on L only through = are essential for a state = in a
period to summarize all relevant information about the system history L as of the period. Ob-

1For nearly all optimality concepts considered in the sequel, it suffices to consider only the expected num-
bers of individuals entering each state rather than the distribution of the actual random numbers of indi-
viduals entering each state. For that reason, those distributions are not considered explicitly here.

1
MS&E 351 Dynamic Programming 2 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

serve that the size of the population may vary over time. Also, there is no interaction among the
individuals. Figure 1 illustrates this situation.

ã
+ :
= >

Figure 1

Call a system stochastic (resp., substochastic) if !> :Ð> l =ß +Ñ œ " (resp., Ÿ ") for each + − E=
and = − f . An important instance of a stochastic (resp., substochastic) system is that in which
:Ð> l =ß +Ñ is the probability that an individual in state = who takes action + in a period generates
a single individual in state >.2 Call a stochastic system deterministic if the :Ð> l =ß +Ñ are all ! or "
because then for each = and + there will be a unique > − f for which :Ð> l =ß +Ñ œ " and :Ð7 l =ß +Ñ
œ ! for all 7 Á >.
Examples of States and Actions in Various Applications. The table below gives examples of
states and actions in several application areas.

Application State Action


Manage supply chain Inventory levels of products Choose product order times/quantities
Maintain road Condition of road Select resurfacing option
Invest in securities Portfolio of securities Buy/sell securities: times and amounts
Inspect lot Number of defectives Accept/reject lot, continue sampling
Route calls Nodes in network Send a call at one node to another
Control queueing network Queue sizes at each station Set service rates at each station
Overbook flight Number of reservations Book/decline reservation request
Market product Goodwill Advertise product
Manage reservoirs Water levels at reservoirs Release water from reservoirs
Insure asset Risk category Set policy premium
Patrol area Car locations, service requests Reposition cars
Guide rocket Position and velocity Choose retrorockets to fire
Care for a patient Condition of patient Conduct tests and treatments

In practice, it is often the case that the reward <Ð=ß +ß X Ñ that a substochastic system earns
when an individual in a period takes action + in state = depends on =, + and an auxiliary ran-
dom variable X whose conditional distribution given =, + and the system history at that time

2The last condition rules out a system in which an individual in a period generates more than one individual
in the next period even though such a system may still be stochastic (resp., substochastic).
MS&E 351 Dynamic Programming 3 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

depends only on = and +. To reduce this problem to the one above, it suffices to let <Ð=ß +Ñ œ
EÐ<Ð=ß +ß X Ñ l =ß +Ñ be the conditional expected reward that the system earns when it is in state =
and takes action + therein. For example, if X is the next state that the system visits after taking
action + in state =, then <Ð=ß +Ñ œ !> <Ð=ß +ß >Ñ:Ð> l =ß +Ñ.

2 MAXIMUM R -PERIOD VALUE: RECURSION AND EXAMPLES


Why Maximize Expected Finite-Horizon Rewards? The most common goal in the above set-
ting is to find a “policy”, i.e., a rule that specifies the action to take in each state with R or less
periods to go, that maximizes the expected R -period reward where there is a given terminal value
of being in each state with no periods to go. Finite-horizon optimality concepts like this require
users to specify the terminal value, though it is often difficult to do. While decision makers are gen-
erally not interested in a fixed number of periods, this approach is often used for several reasons.
ì Simplicity. Though fixing a particular finite horizon is often rather arbitrary, the concept is simple.
By contrast, optimality concepts for an infinite horizon—perhaps the main alternative—are more
subtle and varied.
ì Realism. Optimality over a finite horizon is often more realistic. Users frequently think that they
can specify the terminal value well enough to facilitate making good decisions in earlier periods and
are comfortable with finite horizons. This way, they do not have to make explicit projections be-
yond the finite horizon—which they might view as speculative anyway—though of course these pro-
jections must instead be reflected in the terminal-value function. Few users are prepared to think
about the indefinite future of which their own lives occupy such a minuscule part.
ì Extension to Nonstationary Data. Optimality concepts over a finite horizon generally adapt eas-
ily to nonstationary data without any increase in computational complexity. By contrast, the com-
putations needed to assure infinite-horizon optimality rise rapidly with the extent of nonstation-
arity. This is an important advantage of finite-horizon concepts because nonstationarity is com-
mon in life.

Maximum R -Period Value. We begin by examining this problem informally. Let Z=R be the
maximum expected R -period reward, called the R -period value, that an individual (and his
progeny) can earn starting from state =, assuming for the moment that this maximum is
achieved. Now if an individual chooses an action + − E= in the first period and the individual’s
progeny use an “optimal” policy in the remaining R " periods, then
Z=R   <Ð=ß +Ñ  " :Ð> l =ß +ÑZ>R " .
>−f
reward in expected reward in remaining R " periods
first period when an optimal policy is used therein

Moreover, if the individual chooses an “optimal” action + in the first period, then equality oc-
curs above. This idea, called the principle of optimality, implies the dynamic-programming re-
cursion:

(1) Z=R œ max Ò<Ð=ß +Ñ  " :Ð> l =ß +ÑZ>R " Ó


+−E= >−f
MS&E 351 Dynamic Programming 4 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

for = − f and R œ "ß #ß á ß where Z=! is the given terminal value in state =. This recursion per-
mits one to calculate ÖZ=" ×, then ÖZ=# ×, then ÖZ=$ ×, and so on. Once Z=R is computed, one opti-
mal action +R
= in state = with R periods to go is simply any action in E= that achieves the maxi-
mum on the right-hand side of (1). Thus, when there are R periods to go, each individual in
state = can be assumed to take the same action without loss of optimality. The recursion Ð1) is
of fundamental importance in a broad class of applications.
Minimum R -Period Cost. If <Ð=ß +Ñ is instead a “cost” and the aim is to minimize expected
R -period cost, then the “max” in (1) should be replaced by “min”. Then Z=R is the minimum
expected R -period cost starting from state =. This alternate form of the dynamic-programming
recursion appears often in the sequel with G=R replacing Z=R .
Deterministic Case. In the deterministic case, there may be several actions that take an indi-
vidual from state = to >. In that event, it is best to choose one with maximum reward and eliminate
the others. Then one can identify actions in state = with states > visited next, so (1) simplifies to

(2) Z=R œ max Ò<Ð=ß >Ñ  Z>R " Ó


>−E=

for = − f and R œ "ß #ß á . As Figure 2 illustrates, Z=R can be thought of as the maximum R -per-
iod reward that an individual can earn in traversing an R -step chain that begins in state = and
earns a terminal reward Z>! when it ends in state >.

Z =R =
> Z >!

N N1 N2 1 0
Figure 2

Examples
1 Minimum-Cost Chain. The minimum-cost-chain problem described above has a terminal cost
in each state. One specialization of this problem is to find a minimum-cost chain from every state
to a given terminal state 7 in R steps or less. Thus if -Ð=ß >Ñ is the cost of moving from state = to
state > in one step and G=R is the minimum R -step-or-less cost of moving from state = to state 7 ,
then G7R œ ! for R   ", G=" œ -Ð=ß 7 Ñ and

(3) G=R œ min Ò-Ð=ß >Ñ  G>R " Ó, R œ #ß $ß á and = − f Ï Ö7 ×.


>−E=
MS&E 351 Dynamic Programming 5 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

If the associated (directed) graph Z ´ Ðf ß TÑ with node set f and arc set T ´ ÖÐ=ß >Ñ À > − E= × has
no circuit (i.e., directed cycle) around which the total cost incurred is negative, then

(4) G=R œ G=W" for = − f Ï Ö7 × and R   W  ".

This is because a minimum-cost chain from = to 7 need not visit a node twice. For if it did, the
chain could be shortened and its cost reduced by eliminating the circuit created by visiting the
indicated node twice as Figure 3 illustrates. Thus a minimum-cost chain from any node in f Ï Ö7 ×
to 7 can be assumed to have W  " arcs or less, justifying (4). If one takes care to choose the min-
imizer > œ >R R R "
= in (3) so that >= œ >= whenever G=R œ G=R " , then the >R
= œ >= will be independ-
ent of R   W  " by (4). Also, the subgraph of Z with arcs Ð=ß >= Ñ, = − f Ï Ö7 ×, is a tree with the
unique simple chain therein from each node = Á 7 to 7 being a minimum-cost chain from = to 7 .

2 6 2 6
7 7
1 3 5 1
= =
4
Figure 3

Computational Effort (or Complexity). How much computation is required to calculate G= ´


G=W" for all = − f Ï Ö7 ×? Let E be the number of arcs in T. Then for each R , evaluation of (3) re-
quires E additions and nearly the same number of comparisons. Since this must be done for
each R œ #ß á ß W", the total number of additions and comparisons is about ÐW  #ÑE.

2 Supply Management. One of the areas in which dynamic programming has been used most
widely is to find optimal supply-management policies. The reason for this is that nearly all firms
carry inventories and the investment in them is sizable. For example, US manufacturing and
trade inventories alone in 2000 were 1.205 trillion dollars, or 12% of the entire US gross national
product of 9.963 trillion dollars that year!3
As an illustration of the role of dynamic programming in such problems, suppose that the de-
mands for a single product in successive periods are independent and identically-distributed ran-
dom variables. At the beginning of a period, the supply manager observes the initial stock =, ! Ÿ =
Ÿ W , and orders a nonnegative amount with immediate delivery bringing the starting stock to +,
= Ÿ + Ÿ W . If the demand H in the period exceeds +, the excess demand H  + is lost. There is
an ordering cost -Ð+  =Ñ and a holding and penalty cost 2Ð+  HÑ in the period. Let G=R be the
minimum expected R -period cost starting with the initial stock =. Then ÐG†! ´ !Ñ

32001 Statistical Abstract of the United States, Table 756 and 640.
MS&E 351 Dynamic Programming 6 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

G=R œ min Ò-Ð+  =Ñ  E2Ð+  HÑ  EGÐ+HÑ


R "
Ó
=Ÿ+ŸW

for ! Ÿ = Ÿ W and R œ "ß #ß á .

3 Exercising a Call Option. One of the most active places in which dynamic programming is
used today is Wall Street. To illustrate, consider the problem of determining when to exercise
an (American) call option to buy a stock ignoring commissions. The option gives the purchaser
the right to buy the stock at the strike price =‡   ! on any of the next R days.
Two questions arise. When should the option be exercised? What is its value? To answer
these questions requires a stock-price model and a dynamic-programming recursion to find the
value of the option as well as an optimal option-exercise policy.4
Consider the following stock-price model. Suppose that the stock price is = on day R   !,
i.e., R days before the option expires. If R  !, assume that the stock price on the following day
R " is =VR where V" ß V# ß á are independent identically distributed nonnegative random varia-
bles with the same distribution as a nonnegative random variable V. Then, < ´ V  " is the rate
of return for a day, and E< is the expected rate of return that day.
Let Z=R be the value of the option on day R when the market price of the stock is =. There
are two alternatives that day. One is to exercise the option to buy the stock at the strike price
and immediately resell it, which earns =  =‡ . The other is not to exercise the option that day, in
R "
which case the maximum expected income in the remaining R " days is EZ=V . Since one seeks
the alternative with higher expected future income, the value of the option on expiration day is
Z=! œ Ð=  =‡ Ñ and on day R  ! is given recursively by

(5) Z=R œ maxÐ=  =‡ ß EZ=V


R "
Ñß R œ "ß #ß á .

Thus it is optimal to exercise the option on expiration day if =  =‡  !, and not do so otherwise.
And it is optimal to exercise the option on day R  ! if =  =‡  EZ=V
R"
, and not do so otherwise.
Nonnegative Expected Rate of Return. Consider now the question when it is optimal to
exercise the option. It turns out that as long as the expected rate of return is nonnegative, i.e.,
E<   ! or equivalently EV   ", the answer is to wait until the expiration day. To establish this
fact, it suffices to show that

(6) Z=R œ EZ=V


R "

4This formulation of the problem of when to exercise an (American) call option addresses the situation faced
by an investor who wishes to profit from speculation on the market price of a stock. By contrast, the formu-
lation of the problem in finance addresses the problem faced by an institution who wishes to price an option
to avoid market risk and rely on commissions for profits. Though the assumptions differ, the computations
are similar.
MS&E 351 Dynamic Programming 7 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

for each R  ! and all =   !. To that end, observe from (5) for R  " and by definition for R œ
" that Z=R "   =  =‡ for each =   !. Thus because EV   ", it follows that EZ=V
R "
  EÐ=V  =‡ Ñ
  =  =‡ . Hence from (5) again, (6) holds. Hence, if the expected rate of return is nonnegative, it
is optimal not to exercise the option at any market price when R  ! days remain until expiration.
Furthermore, Z=R œ EÐ=VR  =‡ Ñ is the value of the option where V R ´ #R " V3 , so =V
R
is the
price at expiration.
Negative Expected Rate of Return. Suppose now that the expected rate of return is nega-
tive, i.e., E<  !. To analyze equation (5), it turns out to be useful to subtract =  =‡ from both
sides of (5) and make the change of variables Y=R œ Z=R  Ð=  =‡ Ñ. Then (5) reduces to the equiv-
alent system

(5)w Y=R œ maxÐ!ß =E<  EY=V


R "
Ñ

for R œ "ß #ß á where Y=! œ Ð=‡  =Ñ . Now we claim that Y=R is decreasing and continuous in
=   ! for each R and lim=Ä_ Y=R œ !. Certainly that is so for R œ !. Suppose it is so for R "
R "
and consider R . Then EY=V is decreasing and continuous in = and approaches ! as = Ä _.
R "
Consequently, there is a smallest = œ =R   ! such that =E<  EY=V Ÿ !. Thus, it follows that
the maximum on the right side of (5)w is =E<  EY=V
R "
if =  =R and ! if =   =R . Since (5) and
(5)w are equivalent, it follows that this rule is optimal with (5) as well, i.e., it is optimal to wait
if =  =R and to exercise the option if =   =R . In short, if the expected rate of return is negative
and if R days remain until expiration of the option, then there is a price limit =R such that it is
optimal to wait if the price is below =R and to exercise the option if the price is =R or higher.

4 System Reliability. A system consists of a finite set D of components. The system is ob-
served once a period and each component is found to be “working” or “failed.” If = is the subset
of components that is observed to be working in some period, then a subset > Ï = of the failed com-
ponents may be replaced at a cost <>Ï= where > Ð ª =Ñ is the set of working components after re-
placement. The expected operating cost incurred during the period is then :> . Some components
may fail during the period with the conditional distribution of the random set A of working com-
ponents in the next period, given that > is the working set after replacement in the period and
given the past history, depending on >, but not otherwise on the past history. Let G=R be the
minimum expected R -period cost when = is the initial working set. Then ÐG†! ´ !Ñ,

G=R œ min Ò<>Ï=  :>  EÐGAR " l >ÑÓ


=©>©D

for g © = © D and R œ "ß #ß á .


MS&E 351 Dynamic Programming 8 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

5 Maximum Expected R -Period Instantaneous Rate of Return: Portfolio Selection. Sup-


pose that in a stochastic system, the return the system earns in periods "ß á ß R is the product
V" â VR of the nonnegative random returns V" ß á ß VR in those periods. For example, if the
rate of return in period 3 is 15%, then the return in period 3 is V3 œ 1.15. Now let 1003R % be
the (random) instantaneous rate of return per period over the R periods assuming continuous
compounding. Then /3R R œ V" â VR , so

"
(7) 3R œ ÒlnV"  â  lnVR Ó.
R

Now since lnV3 is the instantaneous rates of return in period 3, the problem of maximizing the
expected instantaneous rate of return over R periods reduces to maximizing the sum E lnV" 
â  E lnVR of the expected instantaneous rates of return in those R periods.
Now assume that the conditional distribution of V3 given that the system starts in state = in
period 3 and takes action + − E= in that period, and given the past history, is independent of 3
and of the past history. Then the corresponding conditional expected value <Ð=ß +Ñ of lnV3 is inde-
pendent of 3.
Note that a necessary condition for <Ð=ß +Ñ to be finite in this example is that the conditional
probability that V3 is zero given that the system starts in state = in period 3 and takes action
+ − E= in that period is zero. The reason is that a zero return corresponds to an instantaneous
rate of return equal to _ and so is to be avoided at all costs if the goal is to maximize the ex-
pected instantaneous rate of return.
Observe that maximizing the expected instantaneous rate of return is different from maximiz-
ing the instantaneous rate of expected return. The former is equivalent to maximizing the expect-
ed value of the sum of the logarithms of the returns whereas the later entails maximizing the ex-
pected value of the product of the returns. Thus, the former involves additive rewards while the
latter involves multiplicative rewards. For example, if you have the opportunity to invest in a
security whose return V in a period is 0 or 3, each with equal probability, then E lnV œ _ and
EV œ 1.5. Thus, the expected instantaneous rate of return is _ whereas the instantaneous rate
of expected return is 50%.
Optimal Portfolio Selection. A simple example of the above development is portfolio selection.
Suppose that the returns that various securities earn depend on the market state = − f , e.g., mar-
ket index, earnings, interest rates, time, etc. The actions available in market state = is a set E= of
several portfolios of securities. The one-period return of a portfolio + in market state = is a ran-
dom variable whose conditional distribution given =, + and the past history depends only on = and
+. Denote by <Ð=ß +Ñ the corresponding conditional expected value of the instantaneous rate of
return. Finally the probability that the market state in a period is > given that the market state
MS&E 351 Dynamic Programming 9 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

in the prior period is = and that a portfolio + is chosen at that time, and given the history of the
process, depends only on = and >. This reflects the fact that a single portfolio manager generally
does not influence the market. Thus, :Ð> l =ß +Ñ œ :Ð> l =Ñ. Let Z=R be the maximum expected value
of the instantaneous rate of return over R periods when the market state is initially =. Then (1)
holds with :Ð> l =Ñ replacing :Ð> l =ß +Ñ. Consequently, since an action in E= maximizes the right-
hand side of (1) if and only if it maximizes <Ð=ß +Ñ because the term !> :Ð> l =ÑZ>R" is independent
of +. This means that the optimal policy is myopic, i.e., maximizes the (expected) reward in each
period alone without regard for the future. Thus, in the present case, it suffices to separately max-
imize the expected instantaneous rate of return in each period.

3 MAXIMUM EXPECTED R -PERIOD UTILITY WITH CONSTANT RISK POSTURE


[HM72], [Ar65], [Ro75a]
Expected Utility and Risk Aversion/Preference
Expected Utility. A decision maker’s preferences among gambles, which we take to be
random R -vectors, can often be expressed by a utility function, i.e., a real-valued function ? de-
fined on the range (assumed in d R ) of the set of gambles. When this is so, a decision maker who
has a choice between two gambles will prefer one with higher expected utility, i.e., if \ and ]
are gambles, then the decision maker prefers \ to ] if E?Ð\Ñ   E?Ð] Ñ. Eminently plausible
axioms implying that a decision maker’s preferences can be represented by a utility function are
discussed in books on the foundations of decision theory.
Risk Aversion/Preference. A decision maker whose preferences can be represented by a util-
ity function ? is called a risk averter (resp., risk preferrer) if E?Ð\Ñ Ÿ ?ÐE\Ñ (resp., E?Ð\Ñ  
?ÐE\Ñ) for all gambles \ with finite expectations, i.e., the decision maker prefers a certain
(resp., an uncertain) gamble to an uncertain (resp., a certain) one having the same expected
value. As examples, managers and investors are usually risk averters as are buyers of insurance.
By contrast, those who gamble at casinos and at race tracks are usually risk preferrers. Of
course, a risk averter may still prefer an uncertain gamble to a certain one if the former has
higher expectation and ? is increasing (which is usually the case). Indeed, in that event only
such gambles will be of interest if they are real valued. For if ] is a constant random variable
and \ is an uncertain one that is preferred to ] , then ?Ð] Ñ Ÿ E?Ð\Ñ Ÿ ?ÐE\Ñ, whence ] Ÿ E\ .
That is why investors do not exhibit much interest in risky securities whose expected returns are
less than those available on safe ones.
A decision maker is a risk averter (resp., preferrer) if and only if ? is concave (resp., convex).
To see this, observe first that it suffices to establish the claim for a risk averter, since if ? is the
utility function for a risk preferrer, ? is the utility function for a risk averter. To show the
“only if ” part, observe that for all R -vectors Bß C and nonnegative numbers :ß ; with :  ; œ ",
MS&E 351 Dynamic Programming 10 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

a risk averter would rather receive the certain gamble :B  ;C than the gamble that yields B
with probability : and C with probability ; (and so has the same expected value), i.e., ?Ð:B  ;CÑ
  :?ÐBÑ  ;?ÐCÑ, which is the definition of concavity. The “if part” is known as Jensen’s inequality
and may be proved as follows. Suppose \ is a random R -vector with finite expectation E\ and
that ? is concave. Then since ? is concave, there is a . − d R (the supergradient of ? at E\ or
the gradient there if it exists) such that ?ÐBÑ Ÿ ?ÐE\Ñ  .ÐB  E\Ñ for all B − d R . Thus ?Ð\Ñ
Ÿ ?ÐE\Ñ  .Ð\  E\Ñ, so on taking expected values, E?Ð\Ñ Ÿ ?ÐE\Ñ.
Examples. Suppose - œ Ð-3 Ñ, B œ ÐB3 Ñ − d R . For B ¦ !, i.e., B3  ! for each 3, let B- ´ B"-" â B-RR
and ln B ´ Ðln B3 Ñ. Then the functions -B, /-B and, for -   ! and B ¦ !, ln B- œ !3 -3 ln B3 ex-
hibit risk aversion, while the first and the negatives of the last two exhibit risk preference. For
B ¦ !, the functions „ B- , and for ! Ÿ Î !, ln B- generally do not exhibit either risk aversion
Î -Ÿ
or risk preference.
General (even concave) utility functions are usually too complex to permit one to do the
computations needed to choose between gambles. For that reason, it is of interest to consider
more tractable utility functions with special structures that arise naturally in applications. One
such class of utility functions that is tractable in dynamic programming is that with constant
risk posture, i.e., for which one’s posture towards a class of gambles is independent of one’s level
of wealth. We now discuss this concept for the classes of additive and multiplicative gambles.
Constant Additive Risk Posture
A decision maker has constant additive risk posture if his posture towards every additive gam-
ble ] is independent of his wealth B − d R , i.e., E?ÐB  ] Ñ  ?ÐBÑ has constant sign in B when-
ever E?ÐB  ] Ñ has finite expected value for every B. This hypothesis is plausible for large firms
whose total assets are much larger than the investments they consider. In any case, the hypo-
thesis is satisfied if and only if, apart from an affine transformation +?  , of ?, ? has the form
(1) ?ÐBÑ œ /-B for all B
or
(2) ?ÐBÑ œ -B for all B
for some row vector - − d R . For the “if ” part of the above result, observe that the sign of

E?ÐB  ] Ñ  ?ÐBÑ œ 
/-B ÒE/-]  "Ó, if (1) holds
- E] , if (2) holds

is independent of B. Notice that +?  , is concave (resp., convex) if (1) holds and + Ÿ ! (resp.,
+   !), in which case the decision maker is a constant additive risk averter (resp., preferrer).
MS&E 351 Dynamic Programming 11 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Constant Multiplicative Risk Posture


Alternately, a decision maker has constant multiplicative risk posture if his posture toward
any positive multiplicative gamble ] ¦ ! is independent of his positive wealth B − d R , i.e.,
E?ÐB ‰ ] Ñ  ?ÐBÑ has constant sign in B ¦ ! whenever E?ÐB ‰ ] Ñ has finite expected value for
all B ¦ ! where B ‰ ] ´ ÐB3 ]3 Ñ. In most situations this hypothesis seems more reasonable for a
wide range of wealth levels than does constant additive risk posture. In any case, the hypothesis
is satisfied if and only if, apart from an affine transformation +?  , of ?, ? has the form

(3) ?ÐBÑ œ B- for all B ¦ !


or
(4) ?ÐBÑ œ ln B- for all B ¦ !

and some - − d R . This result follows from that for the additive case on making the change of
variables Bw œ ln B and ] w œ ln ] , and defining ?w by the rule ?w ÐBw Ñ œ ?ÐBÑ. Then ?ÐB ‰ ] Ñ œ
?w ÐBw  ] w Ñ, so ? exhibits constant multiplicative risk posture if and only if ?w exhibits constant
additive risk posture. In particular,
w
?ÐBÑ œ B- if and only if ?w ÐBw Ñ œ /-B ,
while
?ÐBÑ œ ln B- if and only if ?w ÐBw Ñ œ -Bw .

Additive and Multiplicative Utility Functions


A utility function ?Ð<Ñ of rewards < œ Ð<" ß á ß <R Ñ earned in periods "ß á ß R that exhibits ad-
ditive or multiplicative risk posture is either additive, i.e.,

(5a) ?Ð<Ñ œ ?" Ð<" Ñ  â  ?R Ð<R Ñ

for some functions ?" ß á ß ?R on the real line, or multiplicative, i.e.,

(5b) ?Ð<Ñ œ „ ?" Ð<" Ñ â ?R Ð<R Ñ

for some nonnegative functions ?" ß á ß ?R on a subset of the real line. The situation may be sum-
marized as follows.

ì Constant Additive Risk Posture. Suppose ?Ð<Ñ exhibits constant additive risk posture. Then
up to a positive affine transformation, either ?Ð<Ñ œ -< is additive with ?3 Ð@Ñ œ -3 @ or ?Ð<Ñ œ
„ /-< is multiplicative with ?3 Ð@Ñ œ /-3 @ . Thus, the only strictly risk-adverting (resp., -preferring)
constant-additive-risk-posture utility functions are multiplicative and exponential.

ì Constant Multiplicative Risk Posture. Suppose ?Ð<Ñ exhibits constant multiplicative risk pos-
ture on the positive orthant. Then, up to a positive affine transformation, either ?Ð<Ñ œ ln <- œ - ln <
is additive with ?3 Ð@Ñ œ -3 ln @ or ?Ð<Ñ œ „ <- is multiplicative with ?3 Ð@Ñ œ @-3 for @  !. Thus,
the only strictly risk-averting (resp., -preferring) constant-multiplicative-risk-posture utility func-
tions are additive and logarithmic.
MS&E 351 Dynamic Programming 12 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Maximum Expected R -Period Symmetric Multiplicative Utility


Now consider an W -state stochastic system that consists of a single individual with state-space
f, action sets E= , transition probabilities ;Ð> l =ß +Ñ and, for this paragraph only, one-period rewards
<Ð=ß +ß >Ñ that depend not only on = and +, but also on >. Suppose also that the utility ?Ð<Ñ of the
rewards < œ Ð<" ß á ß <R Ñ earned in periods "ß á ß R is additive or multiplicative, i.e., (5a) or (5b)
holds. For simplicity, also assume that ? is symmetric, i.e., ?" œ â œ ?R œ ?
s, say. When ? is
additive, the dynamic-programming recursion (1) in §1.2 applies directly by simply replacing the
one-period rewards <Ð=ß +ß >Ñ by their utilities ?Ð<Ð=ß
s +ß >ÑÑ. Now consider the case when ? is multi-
plicative. Then let Z=R be the maximum expected R -period utility starting from state =. Hence,
since ?Ð<" ß á ß <R Ñ œ ?Ð< s " Ñ where ?
s " Ñ?Ð<# ß á ß <R Ñ for R   # and ?Ð<" Ñ œ „ ?Ð< s   !, one sees that

Z=R œ max " ?Ð<Ð=ß


s +ß >ÑÑ;Ð> l =ß +ÑZ>R "
+−E= >−f

for = − f and R œ "ß #ß á where Z ! ´ „ " (" is an W -vector of ones). Thus,

(6) Z=R œ max " :Ð> l =ß +ÑZ>R "


+−E= >−f

for = − f and R œ "ß #ß á where for each + − E= and =ß > − f ,

(7) :Ð> l =ß +Ñ ´ ?Ð<Ð=ß


s +ß >ÑÑ;Ð> l =ß +Ñ.

Observe that !>−f :Ð> l =ß +Ñ may exceed one even though !>−f ;Ð> l =ß +Ñ œ ". Thus (7) is an in-
stance of branching and maximizing (resp., minimizing) the size of the expected total population
in all states at the end of R periods from each initial state where Z ! œ " (resp., Z ! œ ").

Maximum R -Period Instantaneous Rate of Expected Return.


The above development also applies to the problem in which the goal is to maximize the in-
stantaneous rate of expected return. In that event, ?Ð<Ñ
s œ <, so (7) simplifies to

(7)w :Ð> l =ß +Ñ ´ <Ð=ß +ß >Ñ;Ð> l =ß +Ñ.

4 MAXIMUM VALUE IN CIRCUITLESS SYSTEMS


In the preceding two subsections we studied the problem of maximizing the R -period value.
In many circumstances one is interested in earning the maximum value over an infinite horizon.
Problems of this type are not generally well posed because the sum of the expected rewards in
periods "ß #ß á may diverge.
The simplest situation in which this difficulty does not arise is that in which the system is
“circuitless”. To describe this concept, it is useful to introduce the system graph Z , i.e., the di-
MS&E 351 Dynamic Programming 13 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

rected graph whose nodes are the states and whose arcs are the ordered pairs Ð=ß >Ñ of states for
which :Ð> l =ß +Ñ  ! for some + − E= . Call the system circuitless if the system graph has no cir-
cuits, i.e., there is no sequence of states that begins and ends with the same state and for which
each successive ordered pair Ð=ß >Ñ of states in the sequence is an arc of Z .
For example, the R -period problems of §1.2 and §1.3 are circuitless systems. To see this, in-
clude the period in the state so the state-space for the R -period problem consists of the pairs
Ð=ß 8Ñ − f ‚ a where a œ Ö"ß á ß R × is the set of periods. Then since no period can be revisited,
the system is circuitless.
In circuitless systems, it is possible to relabel the states as "ß á ß W so that each state is ac-
cessible only to states with higher numbers. Consequently, the number of transitions before the
population disappears is at most lf l  " œ W  " because no state can be revisited. Thus, the
maximum value Z= starting from state = is finite because it is a sum of at most W finite terms.
Then by an argument like that used to justify (1) of §1.2, one sees that Z= satisfies

(1) Z= œ max Ò<Ð=ß +Ñ  " :Ð> l =ß +ÑZ> Ó, = − f .


+−E= =>

The Z= may then be calculated recursively in the order ZW ß á ß Z" .

Examples
1 Knapsack Problem (Capital Budgeting). The problem of choosing items to fill a knap-
sack to maximize total value, or of allocating limited capital to projects to maximize revenue,
are both instances of an important problem called the knapsack problem. To describe the prob-
lem, let G be a finite set of positive integers representing amounts of capital investment required
for each of a group of projects. Let <- be the projected revenue when - is invested and W be the
(positive) capital available. The problem can be posed as the integer program of finding an inte-
ger vector B œ ÐB- Ñ that

maximizes " <- B-


-−G
subject to
" -B- œ W and B   !.
-−G
(This formulation allows replication of projects; if this is impossible, then assume B Ÿ ".) Also
assume " − G , so the knapsack problem is feasible for W œ !ß "ß á . Let f œ Ö!ß á ß W× be the
state-space. The actions available in state = are the investments - − G for which - Ÿ =. Since
each investment reduces the amount of capital available for subsequent investments, the system
is circuitless. Let Z= be the maximum revenue that can be earned when the capital available is =.
Now Z! ´ !.
MS&E 351 Dynamic Programming 14 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

A Dynamic-Programming Recursion. One dynamic-programming recursion for this problem is

Z= œ max Ò<-  Z=- Ó


-−G

for = œ "ß á ß W . Of course, ZW is the desired maximum revenue, and Š  "‹100% is the internal
-Ÿ=
ZW
W
rate of return on the invested capital W assuming that all revenue is received at a common time.
An Alternate Dynamic-Programming Recursion. Dynamic-programming recursions for solv-
ing a problem are not unique. For example, the Z= may also be determined from the alternate
dynamic-programming (branching) recursion

Z= œ max ÒZ?  Z@ Ó ” <= , = − G


?@œ=
?ß@!
and

Z= œ max ÒZ?  Z@ Ó, =  G .
?@œ=
?ß@!
2 Portfolio Selection (Manne). In each period =, " Ÿ =  W , there is a set E= of securities
available for investment. One dollar invested in security + − E= in period = generates :Ð> l =ß +Ñ
  ! dollars in period > œ ="ß á ß W . Of course !W>œ=" :Ð> l =ß +Ñ normally exceeds one, so the
system is branching. Then the state-space is f œ Ö"ß á ß W×. Since no period can be revisited, the
system is circuitless. The goal is to find, for each =, the maximum income Z= that can be earned
by period W from each dollar invested in period =. Then ZW œ " and

Z= œ max " :Ð> l =ß +ÑZ> , = œ "ß á ß W".


W

+−E= >œ="
"
In this case, the internal rate of return on capital invested in period =  W is ÐÐZ= Ñ W=  "Ñ100%.

5 DECISIONS, POLICIES, OPTIMAL-RETURN OPERATOR


An informal definition of a “policy” appears on page 3. It is now time to make that concept
precise. At the same time, we introduce matrix and operator notation to simplify the development.
A decision is a function $ that assigns to each state = − f an action $ = − E= . Thus ? ´ ‚=−f E=
is the set of decisions. A policy is a sequence 1 œ Ð$" ß $# ß á Ñ of decisions. Using 1 œ Ð$R Ñ means
=
that if an individual is in state = in period R , then the individual uses action $R in that period.
Now ?_ ´ ? ‚ ? ‚ â is the set of policies. Finally, call a policy $ _ œ Ð$ ß $ ß á Ñ stationary if it
uses the same decision $ in each period.
For any $ − ?, let <$ be the W -element column vector whose =>h element is <Ð=ß $ = Ñ, i.e., <$ is
the one-period reward vector when using the decision $ . Also let T$ be the W ‚ W transition ma-
trix whose =>>2 element is :Ð> l =ß $ = Ñ, i.e., T$ is the (one-step) transition matrix using $ . If 1 œ
Ð$" ß $# ß á Ñ is a policy, let T1R ´ T$" â T$R be the R -step transition matrix using 1 ÐT1! ´ MÑ.
Thus T$R ´ T$R_ œ ÐT$ ÑR . Observe that the =>>2 element of T1R is the expected number of indi-
MS&E 351 Dynamic Programming 15 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

viduals in state > in period R  " that one individual in state = in period one and his progeny
generate when they use 1. Then ÐZ1! ´ !Ñ the W -element column vector

Z1R ´ " T13" <$3


R
(1)
3œ"

is the expected N-period reward or R -period value of 1.


Define the optimal-return operator e À d W Ä d W by

(2) eZ ´ max Ò<$  T$ Z Ó, Z − d W .


$ −?

Observe that eZ is the maximum expected one-period reward when Z is the terminal reward in
that period. Of course e! Z œ Z , e" Z œ eZ , e# Z œ eÐeZ Ñ, etc. It is important to recognize
that the maximum in (2) is attained; indeed the =>2 element ÐeZ Ñ= of eZ is

ÐeZ Ñ= œ max Ò<Ð=ß +Ñ  " :Ð> l =ß +ÑZ> Ó


+−E= >−f

for all = − f , i.e., the maximum in (2) is taken coordinatewise. Also note that eZ is increasing
in Z , a fact that we will use often in the sequel.

6 MAXIMUM R -PERIOD VALUE: FORMALITIES


Our goal now is to justify the dynamic-programming recursion (1) of §1.2 in a more formal
way. To that end, let Z1R Ð?Ñ ´ Z1R  T1R ? be the R -period value when using 1 and the terminal
value vector at the end of period R is ? − d W .

Proposition 1. Maximum R -Period Value. eR ? œ max1−?_ Z1R Ð?Ñ, R œ !ß "ß á and ? − d W .

Proof. The result is trivial for R œ !. Suppose it holds for R  "   ! and consider R . Now

<$  T$ Z1R " Ð?Ñ œ Z$1


R
Ð?Ñ

for all $ , so because T$   !,

(1a) eR ? œ eÐeR " ?Ñ œ max Ò<$  T$ eR " ?Ó œ max Z$1


R
Ð?Ñ. è
$ −? Ð$ ß1Ñ−?_

Remark 1. The first equality in (1a) can be written in the alternate form

(1b) Z R œ max Ò<$  T$ Z R " Ó, R œ "ß #ß á


$ −?
where Z R ´ eR ? and Z ! ´ ?, or equivalently,

(1c) Z=R œ max Ò<Ð=ß +Ñ  " :Ð> l =ß +ÑZ>R " Ó


+−E= >−f
MS&E 351 Dynamic Programming 16 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

for R œ "ß #ß á and = − f . Of course, (1c) is precisely (1) in §1.2.

Remark 2. The proof is constructive. For if we have found 1 maximizing Z† R " Ð?Ñ, then
Ð$ ß 1Ñ maximizes Z† R Ð?Ñ if and only if $ attains the maximum on the right-hand side of (1b).

Advantages of Maximum-R -Period-Value Policies


Policies having maximum R -period value are very useful in practice for several reasons.
ì Ease of Computation. They are easy to compute for moderate size R by the recursion (1c).
ì Dependence on Horizon Length. The recursion for computing them automatically finds the
best first-period decision for each horizon length R , thus enabling one to study the impact of the
horizon length on the best first-period decision.
ì Nonstationary Data. The recursion extends immediately to the case of nonstationary data
without any additional computational effort. However, in this event, it is no longer possible to
study the impact of the horizon length on the best first-period decision without extra computa-
tion except in the deterministic case. We now explain briefly why this is so.

Rolling Horizons and the Backward and Forward Recursions


Consider first the deterministic case in which the reward <8 Ð=ß >Ñ earned in period 8 when an
individual moves from state = in period 8 to state > in period 8  " is nonstationary and one is
given the states 5ß 7 in which one must begin and end in periods " and R respectively. Then the
obvious generalization of the recursion (1c) is the backward recursion

(2) F=8 œ max Ò<8 Ð=ß >Ñ  F>8" Ó, = − f and 8 œ "ß á ß R


>

where F=8 is the maximum reward that can be earned in periods 8ß á ß R by an individual start-
ing in state = in period 8 and F>R " œ ! or _ according as > œ 7 or > Á 7 . Observe that al-
though we have suppressed the fact in the notation, F=8 depends on R since F=8 is the maximum
reward earned in periods 8ß á ß R . Thus, if one wishes to tabulate the maximum R -period value
F5" for R œ "ß á ß Q , then one must compute the F=8 from (2) for all " Ÿ 8 Ÿ R Ÿ Q and = − f .
If we fix the number of state-action pairs, this requires SÐQ # Ñ additions and comparisons.5
However, it is possible to do this computation with at most SÐQ Ñ additions and compari-
sons by instead using the forward recursion

(2)w J>8 œ max Ò<8 Ð=ß >Ñ  J=8" Ó, > − f and 8 œ "ß á ß R
=

where J>8 is the maximum reward that can be earned in periods "ß á ß 8 by an individual ending
in state > in period 8 and J=! œ ! or _ according as = œ 5 or = Á 5 . Now the maximum R -per-

5If 0 and 1 are real-valued functions, write 0 ÐBÑ œ SÐ1ÐBÑÑ if there is a constant O such that ¸0 ÐBѸ Ÿ O1ÐBÑ
for all B.
MS&E 351 Dynamic Programming 17 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

iod value is simply J7R which can be tabulated for all " Ÿ R Ÿ Q with at most SÐQ Ñ additions
and comparisons. Thus, if one is interested in studying the effect of changes in the horizon length
on the maximum R -period value for a deterministic problem, it is much better to use the for-
ward than the backward recursion.
Unfortunately, the forward recursion does not have a natural generalization to stochastic sys-
tems. For that reason one is left only with the nonstationary generalization of the backward re-
cursion (2) in that case.

Limitations of Maximum-R -Period-Value Policies


Policies having maximum R -period value do have some significant limitations including the
following.

ì Computational Burden. The computational effort rises linearly with the horizon length R .

ì Nonstationarity of Optimal Policies. Optimal R -period policies are generally nonstationary,


even with stationary data, and so are difficult to implement.

ì Dependence on Horizon Length. The dependence of the best first-period decision on the hori-
zon length R is generally complex and, especially in the case of large R , requires the user to be
rather arbitrary in choosing it.

These considerations lead us to look for an asymptotic theory as R gets large, or alternately, to
consider instead the infinite-horizon problem. The latter approach is particularly fruitful because
the symmetries present in the problem generally assure that there is an “optimal” policy that is
stationary. Also, those policies are “nearly optimal” for all large enough R .

7 MAXIMUM VALUE IN TRANSIENT SYSTEMS [Sh53], [Ho60], [D’E63], [deG60], [Bl62],


[Der62], [Den67], [Ve69a], [Ve74], [Ro75c], [Ro78], [RV92]
Why Study Infinite-Horizon Problems? What is the point of studying infinite-horizon prob-
lems in a stationary environment? After all, life is finite. No doubt for this reason men and wo-
men of practical affairs are usually concerned with the near term, often concerned with the in-
termediate term, and occasionally concerned with the long term. But is the long term infinite?
Probably not. Also, is it reasonable to consider models in which the environment remains immut-
able and unchanging forever? Hard to believe.
In view of these observations, what is the justification for studying the stationary infinite-
horizon problem? Here are some reasons for so doing.
ì Near Optimality for Long Finite-Horizons. The stationary infinite-horizon problem provides a
good approximation to the stationary long finite-horizon problem in the sense that optimal infin-
ite-horizon policies are nearly optimal for long (and often surprisingly short) horizons.
MS&E 351 Dynamic Programming 18 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

ì Optimality of Stationary Policies. The stationarity of the environment in the stationary infinite-
horizon problem generally assures that there is an optimal policy that is stationary and independ-
ent of the horizon. This fact is important in practice because stationary policies are much easier
to store and implement than nonstationary ones.
ì Computational Effort Independent of Horizon. By contrast with the finite-horizon problem, the
computational effort to find an optimal infinite-horizon policy is independent of the horizon.
ì Includes Many Nonstationary Problems. The stationary infinite-horizon problem includes the
8-period nonstationary problem as a special case. To see this, append to the original state = each per-
iod R that the system could be in that state, so the states of the augmented system are the pairs
Ð=ß R Ñ for = − f and R œ "ß á ß 8, and transitions are from each period R  8 to R " and from per-
iod 8 to exiting the system. If 8 œ 7  : and transitions in period 8 are instead to period 7  ",
the system repeats every : periods after period 7. This reduces such “eventually :-periodic” prob-
lems to stationary ones.

A natural approach to the infinite-horizon problem is to seek a policy 1 œ Ð$R Ñ whose value

Z1 ´ " T1R <$R "


_
(1)
R œ!

is a maximum. However, this definition makes sense only if the right-hand side of (1) is defined
and finite. In order for this to be the case, it suffices to assume that 1 is transient, i.e., !_ R
R œ! T1
is finite. The =>>2 element of the last series is the sum of the expected sojourn times of all indi-
viduals in state > generated by one individual starting in state = in period one. Indeed, the only
way that (1) can be defined and finite for every choice of <† is that 1 be transient. And in that
event, Z1 is the W -vector of expected infinite-horizon rewards, i.e., the value earned by 1 start-
ing in each state. Incidentally, it is often convenient to call a decision $ or its transition matrix
T$ transient if $ _ is transient. When is a transition matrix transient?

Characterization of Transient Matrices


The sequel often uses the (Tchebychev) norm lT l of a matrix T œ Ð:34 Ñ defined by lT l œ
max3 !4 l:34 l. This norm has the property that lT Ul Ÿ lT llUl whenever the matrix product T U
is defined. Though §1 of the Appendix discusses several properties of this and other norms, the
term “norm” means the Tchebychev norm throughout the main text.

Lemma 2. Characterization of Transient Matrices. If T is a square complex matrix, the fol-


lowing are equivalent.
"‰ T R Ä ! .
#‰ The Neumann series !_
R œ! T
R
converges absolutely.
Moreover, each of the above implies
$‰ M  T is nonsingular and its inverse equals the Neumann series in #‰ .

Proof. 2‰ Ê 1‰ . Immediate.
1‰ Ê 2‰ . Since T R Ä !, ¼T Q ¼ Ÿ for some Q   ". Let P œ !Q ¼ 45Q ¼ Ÿ
" " 4
4œ! mT mÞ Since T
lT 4 l ¼T Q ¼ , it follows that !_ ¼ R ¼ œ !Q " !_
¼ 45Q ¼ Ÿ P!5œ! ¼T Q ¼5 Ÿ #P  _.
#
5 _
R œ! T 4œ! 5œ! T
MS&E 351 Dynamic Programming 19 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

1‰ Ê 3‰ . Evidently,

Ð" T R ÑÐM  T Ñ œ " ÐT R  T R " Ñ œ M  T Q .


Q " Q "

R œ! R œ!
Since T Q Ä !, the right-hand side above is nonsingular for large Q , so M  T is also nonsingu-
lar. Thus, postmultiplying by ÐM  T Ñ" and letting Q Ä _ establishes 3‰ . è

An alternate (more advanced) proof of the equivalence of 1‰ and 2‰ of the Lemma is to


apply the Spectral Mapping Theorem 4 and Corollary 2 of the Appendix.
Observe that Lemma 2 implies that # _ is transient if and only if T#R Ä !, i.e., the expected
population size in period R generated by # _ converges to zero as R Ä _. And in that event,

! Ÿ " T#R œ ÐM  T# Ñ" .


_

R œ!
For example, if T# is strictly substochastic, i.e., T# " ¥ ", then # is transient. More generally, if
T# is substochastic, then # is transient if and only if T#W is strictly substochastic.
Transient System. As discussed above, Z1 is defined and finite for every 1 and <† if and only
if every 1 is transient. In that event, call the system transient. As an example, note that circuit-
less systems are transient because then T1W œ ! for all 1.
The goal now is to show that for transient systems, there exists a policy having maximum val-
ue, and one such policy is stationary. The constructive proof of these claims depends on the follow-
ing fundamental Comparison Lemma. It asserts that the difference between the values of two tran-
sient policies 1 œ Ð#R Ñ and ) is the sum over R of the product of two matrices that depend on R .
The first is the matrix T1R of expected numbers of individuals in each state after R periods start-
ing from each state when 1 is used in periods "ß á ß R . The second is the difference Z#R " )  Z) be-
tween the values of the policies Ð#R " ß )Ñ—which uses #R " and then ) —and ) .

Comparison Lemma

Lemma 3. Comparison Lemma. If 1 œ Ð#R Ñ and ) are transient policies, then

Z1  Z) œ " T1R K#R " )


_

R œ!
where the comparison function is defined by

K#) ´ <#  T# Z)  Z) for # − ?.

If also 1 œ # _ , >hen

Z#  Z) œ " T#R K#) œ ÐM  T# Ñ" K#)


_

R œ!

where Z# ´ Z#_ . Moreover, Z œ Z# if and only if Z œ <#  T# Z .

Proof. Since 1 is transient, Z1R Ä Z1 is finite and T1R Ä !, so Z1R ÐZ) Ñ œ Z1R  T1R Z) Ä Z1 .
Thus,
MS&E 351 Dynamic Programming 20 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Z1  Z) œ lim ÒZ1R " ÐZ) Ñ  Z1! ÐZ) ÑÓ œ " ÒZ1R " ÐZ) Ñ  Z1R ÐZ) ÑÓ
_

R Ä_ R œ!

œ " T1R ÒZ#"R " ÐZ) Ñ  Z) Ó œ " T1R K#R " ) .


_ _

R œ! R œ!

If 1 œ # _ , the last term above becomes Ò!_ R


R œ! ÐT# Ñ ÓK#) . The remaining assertions then follow
from Lemma 2 and what was just shown. è

Policy-Improvement Method
It is now time to apply the Comparison Lemma to give a policy-improvement method for find-
ing a maximum-value stationary policy for the case where each stationary policy is transient. Then
since ÐM  T# Ñ"   M   !, it follows from the Comparison Lemma that Z#  Z$   K#$  ! (resp.,
Z#  Z$ Ÿ K#$ Ÿ !) if K#$  ! (resp., Ÿ !). For this reason we say # improves $ if K#$  !. If no
decision improves $ , we claim that max#−? K#$ œ !, so Z# Ÿ Z$ for all # . The claim is an easy con-
sequence of the facts that K$$ œ ! and the =>2 element K#$= œ <Ð=ß # = Ñ  !>−f :Ð> l =ß # = ÑZ$> of K#$
depends on # = , but not # > for > Á =. For suppose that K#$=  ! for some # and =. Define decision
( by (= œ # = and (> œ $ > for > Á =. Then K($= œ K#$=  ! and K($> œ K$$> œ ! for > Á =. Hence
K($  ! and ( improves $ , which is a contradiction and establishes the claim. The above remarks
permit us to prove the following result constructively.

Lemma 4. Existence of Maximum-Value Stationary Policy. If every stationary policy is


transient, there is a maximum-value stationary policy $ _ . Moreover,

Ð2aÑ ! œ max K#$ ,


# −?

or equivalently, Z ´ Z$ is a fixed point of e, i.e., Z œ eZ , or what is the same thing,

Ð2bÑ Z œ max Ò<#  T# Z Ó.


# −?

Moreover, the maximum value over the stationary policies is the unique fixed point of e.

Proof. Let $! be arbitrary and choose decisions $" ß $# ß á inductively so $R " improves $R ,
and terminates otherwise. Since the value of each policy in the sequence is higher than that of
its predecessor, no decision can occur twice in the sequence. Thus because there are only finitely
many decisions, the sequence must terminate with a $ œ $R having no improvement. Then from
the discussion preceding the Lemma, (2a) or equivalently (2b) holds, so $ _ is a maximum-value
stationary policy. Conversely, if Z is a fixed point of e, then Z œ <(  T( Z for some ( − ?,
whence Z œ Z( . Thus max#−? K#( œ ! and so Z œ Z( is the maximum value over the stationary
policies. è
MS&E 351 Dynamic Programming 21 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

The finite algorithm given in the above proof finds a maximum-value stationary policy. Call
the algorithm the policy-improvement method. The process of finding an improvement of $ , if
one exists, involves two steps, viz., finding the value of $ and then seeking an improvement of $ .
The first step entails solving the system

ÐM  T$ ÑZ œ <$

for Z œ Z$ . The second is to find, for each state =, an action # = − E= for which

<Ð=ß # = Ñ  " :Ð> l =ß # = ÑZ$>   Z$=


>−f
with strict inequality holding for some =, i.e., a decision # œ Ð# = Ñ such that <#  T# Z$  Z$ .

Stationary Maximum-Value Policies: Existence and Characterization


We now come to our main results for transient systems.

Theorem 5. Characterization of Transient Systems. A system is transient if and only if


every stationary policy is transient.

Proof. It suffices to prove the “if ” part. By Lemma 4, there is a maximum-value stationary
policy $ _ , say, for the case <# ´ " for all # . Let 1 œ Ð#" ß ## ß á Ñ be any policy. Then since " Ÿ
Z$ ¥ _ is a fixed point of e, e is increasing, and eQ " is the maximum Q -period value with
terminal value ", it follows that

Z$ œ eZ$ œ â œ eQ Z$   eQ "   " T1R "


Q

R œ!
for each Q œ !ß "ß á , so 1 is transient. è

Theorem 6. Existence and Characterization of Maximum-Value Policies and Maximum


Value. In a transient system, there is a stationary maximum-value policy. Moreover, a policy )
has maximum value if and only if max#−? K#) œ !. Finally, the maximum value Z ‡ is the unique
fixed point of e.

Proof. For the first assertion, observe from Lemma 4 that there is a maximum-value station-
ary policy $ _ , say, with max#−? K#$ œ !. Hence, by the Comparison Lemma, Z1 Ÿ Z$ for all 1,
so $ _ is a maximum-value policy. For the second assertion, if ) has maximum value, Z) œ Z$ so
max#−? K#) œ max#−? K#$ œ !. Conversely, if max#−? K#) œ !, then ) has maximum-value by the
Comparison Lemma. The third assertion is immediate from the first assertion and Lemma 4. è

Key Ideas. The above developments illustrate three important ideas that will recur frequent-
ly in the sequel in studying optimality concepts for nontransient systems.
MS&E 351 Dynamic Programming 22 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

ì Policy Improvement and Comparison Lemma. It is often possible to use suitably generalized
versions of the policy-improvement method and Comparison Lemma together to establish construct-
ively the existence of stationary “optimal” policies, and to characterize them for general systems—
transient or not.

ì Convenient Terminal Values. It is often useful to study a problem with the aid of a terminal
value that makes calculations easy and then find the effect of changing that value to the desired
one. For example, in the proof of Theorem 5, Z$   " is a fixed point of e whence Z$ œ eQ Z$   eQ ".

ì System Properties. To establish properties of the system that are independent of the rewards,
e.g., transience in Theorem 5, it is frequently useful to set all rewards and/or terminal values equal
to " (or ! or "), and then apply existence or characterization results about “optimal” policies and
their values.

The fact that Z ‡ is a fixed point of e is intuitive on writing that fact as

(3) Z‡ œ max Ò <#  T# Z ‡ Ó .


max value # −? reward in max value
periods period 1 in periods
1ß2ßÞÞÞ using # 2ß3ßÞÞÞ given
# used in 1

The fact that Z ‡ is the unique solution of the system of W nonlinear equations Z œ eZ in W
variables suggests the possibility of using classical iterative methods to solve the equations for
Z ‡ , e.g., Newton’s method, successive approximations, Gauss-Seidel method, etc. We first dis-
cuss Newton’s method.

Newton’s Method: A Specialization of Policy-Improvement Method


Newton’s method is an instance of the policy-improvement method. To see this, initiate
µ
Newton’s method with Z œ Z$ . The first step of the method is to find a linear approximation e
µ µ
of e with e Z œ eZ . The natural linear approximation is e Y ´ <#  T# Y for Y − d W where #
is a maximizer of <#  T# Z . This is equivalent to choosing # to maximize K#$ , i.e., choosing the
µ µ
“best” improvement of $ . The linear approximation e has the properties that e Z œ eZ and
µ µ
e Y Ÿ eY for all Y . The second step is to solve the linear equations Y œ e Y , i.e., Y œ <# 
T# Y , for Y œ Z# .
Now replace $ by # and repeat the above two steps iteratively. Thus Newton’s method for
solving Z œ eZ is the specialization of the policy improvement method that, given $ , chooses a
# that not only improves $ , but also maximizes K#$ . Moreover, the method converges in finitely
many steps.

Successive Approximations: Maximum R -Period Value Converges to Maximum Value


Another classical method for finding the fixed point of e is successive approximations. This
method entails choosing the sequence Z ! ß Z " ß á iteratively by the rule Z R œ eZ R " , R œ "ß #ß
á . The method has special interest in dynamic programming because Z R œ eR Z is the maxi-
MS&E 351 Dynamic Programming 23 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

mum R -period value with terminal reward Z œ Z ! . Since in transient systems, T$R Ä ! for each
$, it is plausible that Z R Ä Z ‡ in such systems. The sequel establishes this fact and provides the
rate of convergence as well.
In order to see what the rate of convergence might be, consider the special case where ?
consists of only a single decision $ . Then

¼Z ‡  eR Z ¼ œ ¼eR Z ‡  eR Z ¼ œ ¼T$R ÐZ ‡  Z Ѽ Ÿ ¼T$R ¼lZ ‡  Z l.

Thus the rate at which eR Z converges to Z ‡ is bounded above by the rate at which T$R con-
verges to !. And the rate of convergence of the latter is determined by the spectral radius of T$ .
Before establishing this fact, it seems useful to give an example.

Geometric Interpretation: a Single-State Reliability Example


It is informative to illustrate the above results geometrically on a concrete example for the
case of a transient system with a single state and three decisions 5ß .ß #. The example is interest-
ing for another reason too, viz., the time between transitions is random. Such systems are semi-
Markov decision chains, though this does not really introduce anything new for infinite-horizon
transient systems as the discussion below reveals.
Suppose that a contractor provides a service, e.g., lighting, automotive transportation, or
desktop computing, for a client. The client can cancel the service at the beginning of any period
without prior notice. The contractor estimates that the client will cancel independently in each
period with probability "  -, !  -  ". The contractor has three products with which to pro-
vide the service, e.g., three types of lights, cars or computers. The service lives of the copies of
product $ that the contractor places in service are independent random variables whose distribu-
tions coincide with that of another nonnegative integer-valued random variable P$ . Let T$ ´
E-P$ be the probability that the client does not cancel the service during the life of product $ .
Also let <$ be the expected revenue that product $ earns during its lifetime. Now Z ‡ œ max$ Z$ .
The table below provides the problem data and the value of each policy. The system is

$ <$ T$ ÐM  T$ Ñ" Z$
" &
5 8 & % 10
" %
. 12 % $ 16
$
# 6 % % 24

is transient since maxÐT5 ß T. ß T# Ñ  ". Also, the maximum expected revenue that the contractor
can earn, viz., Z œ Z ‡ œ 24, is the unique solution of the nonlinear equation
MS&E 351 Dynamic Programming 24 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

èéê èëéëê èéê


5 . #
Z œ maxÐ )  "& Z ß "#  "% Z ß '  $% Z Ñ.

Note that # is optimal even though its expected lifetime revenue is smallest.
It is useful to examine how policy improvement method would produce this result with New-
ton improvements, i.e., the best improvement at each step. To that end, note that the optimal-
return operator is given by eZ œ maxÐ)  "& Z ß "#  "% Z ß '  $% Z Ñ.
• Policy 5 . Suppose one begins with 5. The first step is to solve Z œ )  "& Z for the value of 5 ,
i.e., for Z œ Z5 œ 10. Next eÐ"!Ñ œ maxÐ)  "& 10ß "#  "% 10ß '  $% 10Ñ œ maxÐ10ß 14 "# ß 13 "# Ñ œ 14 "# ,
so K.5 œ 14 "#  10 œ % "# . Therefore . improves 5 .
• Policy .. Now solve Z œ "#  "% Z for Z œ Z. œ 16. Then eÐ"6Ñ œ maxÐ)  "& 16ß "#  "% 16ß ' 
$ "
% 16Ñ œ maxÐ11 & ß 16ß 18Ñ œ 18. Thus, K#. œ 18  16 œ 2. Therefore # improves ..

• Policy # . Now solve Z œ '  $% Z for Z œ Z# œ 24. Then eÐ24Ñ œ maxÐ)  "& 24ß "#  "% 24ß ' 
$
% 24Ñ œ 24. Thus, K$# Ÿ 0 for all $ , establishing that # is optimal.

Figure 4a displays the graph of the optimal-return operator for the above example as a wide
bold line. Note that Z  eZ is continuous, strictly increasing and changes sign once from  to
, and so has a unique zero Z ‡ , viz., the unique fixed point of e. Also eZ$   Z$ for all deci-
sions $ , and equality holds if and only if Z$ ´ Z ‡ .
Figures 4b and 4c respectively illustrate Newton’s method and successive approximations for
the above example. In each case, if Y and Z are values calculated at successive iterations, wide
bold line segments are drawn from ÐY ß Y Ñ to ÐY ß eY Ñ to ÐZ ß Z Ñ. Together these line segments
form a path that traces the trajectory of successive iterates from the initial value to Z ‡ . In partic-
ular, starting from decision 5, Newton’s method first chooses . and then # . By contrast, starting
from the value Z ! , successive approximations follows an infinite staircase trajectory.
Observe from Figure 4b,c that starting from an initial value Z ! for which Z ! Ÿ eZ ! , the
value that Newton’s method calculates at each iteration lies above the one that successive ap-
proximations computes, and this is so in general. Thus, Newton’s method converges at least as
fast as successive approximations. Of course this does not necessarily mean that Newton’s meth-
od is more efficient than successive approximations because the former requires more work at
each iteration than the latter.

System Degree, Spectral Radius and Polynomial Boundedness


The degree of a sequence T! ß T" ß á of matrices is 0 if lTR l œ SÐ!R Ñ for some !  !  " (i.e.,
lTR l Ÿ O !R for some O ), the smallest positive integer 5 for which lTR l œ SÐR 5" Ñ if such an
integer exists and the degree is not 0, and _ otherwise. The degree .T of a square matrix T is
the degree of the sequence T ! ß T " ß T # ß á of nonnegative integer powers of T .
MS&E 351 Dynamic Programming 25 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Z
$!
eZ ‡ œ Z ‡
eZ <.  T.Z
#!
<5  T5Z
"!

<-  T-Z !
! "! #! $! Z
Z5 Z. Z ‡ œ Z-
Figure 4a. Maximum Value is Unique Fixed Point of Optimal-Return Operator

Z
$!

eZ ‡ œ Z ‡
#! eZ

"!

!
! "! #! $! Z

Z5 Z. Z œ Z-
Figure 4b. Newton's Method

eZ ‡ œ Z ‡
eZ
eZ " œ Z #
eZ ! œ Z "

!
! Z! Z " Z # Z ‡ œ Z- Z

Figure 4c. Successive Approximations


MS&E 351 Dynamic Programming 26 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Examples of Degrees of Sequences

Degree of ÖTR × ! " # $ _


Š ‹
" R
TR Ð"ÑR R R # #R im
#

The spectral radius 5T of a square complex matrix T is the ra- 5T


dius of the smallest circle in the complex plane that is centered re
at the origin and contains the eigenvalues of T as Figure 5 il-

lustrates. If 5T  !, then the normalized matrix T ´ 5T" T has eigenvalues
spectral radius one. Figure 5. Spectral Circle

Lemma 7. If T is an W ‚ W complex matrix, the following are equivalentÀ


"‰ " Ÿ .T  _à " Ÿ .T Ÿ Wà and 5T œ ". Ðpolynomially boundedÑ
Moreover, the following are equivalentÀ
#‰ .T œ !à T R Ä !à and 5T  ". ÐtransientÑ
Finally, the following are equivalentÀ
$‰ T R œ ! 0 or some R   !à 5T œ !à and T œ X Y X " for some upper-triangular matrix Y
with zero diagonal elements and some nonsingular matrix X . If also, T is real and nonnegative,
one such T is a permutation matrix, so the system graph of T is circuitless. ÐcircuitlessÑ

Proof. Use the Jordan form of T . è



Corollary 8. If T is an W ‚ W complex matrix with 5 ´ 5T  ! and 5" T has degree . , then
 
" Ÿ . Ÿ W and T R œ SÐR . " 5R Ñ.

Proof. Since 5" T has spectral radius one, the result follows from "‰ of Lemma 7. è

The degree .1 of a policy 1 is the degree of the sequence ÐT1R Ñ of endogenous expected pop-
ulation sizes that 1 generates. The degree of a policy is of fundamental importance in the devel-
opment of the theory and is a natural measure of the endogenous rate of growth of the sequence
of expected population sizes when using the policy.
In particular, the degree of 1 is zero if and only if the sequence of endogenous expected pop-
ulation sizes when using 1 converges to zero geometrically, so 1 is transient. If the degree of 1 is
nonzero and finite, it is one plus the order of the lowest-order polynomial majorizing the sequence
of endogenous expected population sizes when using 1. For example, if the system is stochastic,
the degree of each policy is one because the sequence of endogenous expected population sizes is
bounded (by one). If the degree of 1 is infinite, the sequence of endogenous expected population
sizes when using 1 is not majorized by any polynomial.
MS&E 351 Dynamic Programming 27 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Put .$ ´ .$_ and call . ´ max$−? .$ the system degree. Let 5$ ´ 5T$ and call 5 ´ max$−? 5$

the system spectral radius. If 5  !, call the system with transition matrices T $ ´ 5" T$ for all
$ − ? the normalized system.
The next result is very important. It implies that max1−?_ .1 œ . , i.e., the system degree ma-
jorizes the degree of every policy.

Theorem 9. System Degree. The sequence max1−?_ ¼T1R ¼, R œ !ß "ß á ß has degree . .

Proof. We give the proof only for the case . œ !, (Later we give the proof for the case . œ ").
Put e s
s Z ´ max$−? Ò"  T$ Z Ó and eZ ´ max$−? T$ Z . Since each stationary policy is transient, e
has a unique fixed point Z by Lemma 4, and Z œ "  T$ Z   ". Also ! Ÿ eZ Ÿ e s Z œ Z . Thus
eR Z is decreasing in R and is bounded below by !, so eR Z Æ Y , say. Since e is continuous,
Y œ limR Ä_ eÐeR Z Ñ œ eY , so Y is a fixed point of e, whence Y œ ! because ! is the unique
"
fixed point of e by Lemma 4. Thus eR Z Æ !. Hence eQ Z Ÿ Z for some Q   ". Then e5Q Z
" # "
Ÿ Ð Ñ5 Z for 5 œ "ß #ß á ß from which fact it follows that eR Z œ SÐ!R Ñ with ! ´ Ð Ñ"ÎQ . Since
eR Z œ max1−?_ T1R Z , max1−?_ ¼T1R ¼ œ SÐ!R Ñ. è
# #

Transient Systems. For the case . œ !, Theorem 9 strengthens Theorem 5 by implying not
only that every policy is transient, but also that every policy has degree zero.

Geometric Convergence of Successive Approximations


It is now possible to establish the convergence rate of eR Z to Z ‡ in transient systems.

Theorem 10. Geometric Convergence of Successive Approximations. In a system with posi-



tive maximum spectral radius 5 and degree . of the normalized system,

¼eR Y  eR Z ¼ œ SÐR . " 5R ÑmY  Z m uniformly in Y ß Z − d W .


If also the system is transient, i.e., 5  ", then eR Z Ä Z ‡ .


Proof. Suppose Y ß Z − d W and let T R
1 be the R -step transition matrix for 1 in the normal-
ized system. Choose 1 so eR Y œ Z1R  T1R Y . Then eR Z   Z1R  T1R Z , so by Theorem 9 and
Lemma 7,

 
eR Y  eR Z Ÿ T1R ÐY  Z Ñ œ 5R T R
1 ÐY  Z Ñ œ SÐR
. " R
5 ÑmY  Z m

uniformly in Y ß Z . Interchanging the roles of Y ß Z establishes the first claim. The second claim
then follows from the first by choosing Y œ Z ‡ and using the facts that Z ‡ is finite (because the
system is transient) and eR Z ‡ œ Z ‡ . è
MS&E 351 Dynamic Programming 28 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Theorem 10 depends on the proof of Theorem 9 for arbitrary .. There is a weaker result that
depends on Theorem 9 only for . œ !, the case for which that theorem was proved above. The
weaker result is that ¼Z ‡  eR Z ¼ œ SÐ
5 R ÑmZ ‡  Z m uniformly in Z for 
5  5 . The proof is ident-
ical to that of Theorem 10 except that one normalizes the T$ with respect to 
5 rather than 5.
In a transient system, Theorem 10 provides a sharp estimate of the rate of convergence of
eR Z to Z ‡ . However, the estimate depends on 5 which is, in general, difficult to compute. For
that reason it is useful to have an upper bound on 5 that is easier to compute. The next result
gives such a bound.

Lemma 11. If T is a square complex matrix, then 5T Ÿ lT l.

Proof. If - is an eigenvalue of T , then there is a @ Á ! such that -@ œ T @. Hence l-ll@l œ


lT @l Ÿ lT ll@l. On dividing by l@l, the result follows. è

Corollary 12. If . ´ max$−? lT$ l  ", the system is transient and mZ ‡  eR Z m Ÿ .R lZ ‡  Z l.

Proof. Iterate the inequality mZ ‡  eZ m Ÿ .mZ ‡  Z m. è

Remark. Notice that Corollary 12 is weaker than Theorem 10 when 5 Á . because then, by
Lemma 11, 5  .. If 5 œ ., then the Corollary gives the constant in Theorem 10. However, then
  
the rate of convergence is the same in both cases because . œ " since mT R
1 m Ÿ " for all R and .

is the degree of T $ for some $ .

Contraction Mappings. The usual proof of Corollary 12 employs the contraction-mapping


theorem. Indeed, when the hypothesis of Corollary 12 is valid, an easy modification of the proof
of Theorem 10 shows that the mapping e is a contraction with modulus ! Ÿ .  ", i.e., one has
meY  eZ m Ÿ .mY  Z m for all Y ß Z − d W . The contraction-mapping theorem in the Appendix
then assures that ÖeR Z × is a Cauchy sequence and converges geometrically to the unique fixed
point Z ‡ of e at the rate . that Corollary 12 provides. The development here employs a differ-
ent approach because, as the above remark discusses, it gives the sharper result of Theorem 10.
Actually, when the system is transient, but the hypothesis of Corollary 12 does not hold, it
is possible to modify the contraction-mapping method to establish geometric convergence of eR Z
to Z ‡ at any rate 5
 with 5  5
  " in two different ways. One is to apply Theorem 10 to show

 in the usual (Tchebychev) norm, e5 is a


that even though e is not a contraction with modulus 5
5 for some 5  ! in that norm. The other is to show that e is in fact
contraction with modulus 5
 in a different “weighted” (Tchebychev) norm.
a contraction with modulus 5
MS&E 351 Dynamic Programming 29 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Linear-Programming Method
So far in this section we have explored the possibility of finding maximum-value policies in
transient systems by finding the unique fixed point of e. It is possible to find this fixed point by
linear programming if we consider instead the larger set of vectors Z satisfying Z   eZ , or
equivalently the linear inequalities Z   <$  T$ Z for all $ − ?. Call such vectors Z excessive. As
we shall see, there is a least excessive vector, and it is the fixed point. The simplest way to see
that there is a least excessive vector is to observe that the set of excessive vectors is nonempty,
closed, bounded below (by each Z$ ) and a meet sublattice, i.e., whenever Y and Z are excessive,
so is the pointwise minimum Y • Z . Moreover, the least excessive vector may be found by mini-
mizing any positive linear function of Z over the set of excessive vectors Z . It might seem odd
that this linear program entails minimization while our goal is to maximize value. The explana-
tion for this apparent paradox is simply that this linear program is, as the sequel shows, the du-
al of the maximum-value problem.
The linear-programming method is interesting in its own right and because the theory and
algorithms of linear programming are well developed, codes for solving such problems are widely
available, and the method is readily adapted to encompass additional stochastic constraints. A
typical stochastic constraint might involve a bound on a linear combination of the expected num-
ber of times individuals in various states take various actions, e.g., an upper bound on the ex-
pected number of times that individuals enter an undesirable set of states. Another stochastic
constraint might assure that if one uses an action in one of a subset of states, then one uses the
action in all states in the subset. The last constraint can be expressed in terms of linear inequal-
ities in integer and continuous variables.
In order to formulate the linear programs, suppose that there is an W -element row vector
A   ! of initial population sizes of individuals in each state in period one. Then consider the fol-
lowing pair of dual linear programs for a transient system.

Primal Linear Program. Find W -element row vectors B$ , $ − ?, that

maximize " B$ <$


$ −?

subject to

" B$ ÐM  T$ Ñ œ A
$ −?

B$   ! , $ − ?
MS&E 351 Dynamic Programming 30 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Dual Linear Program. Find an W -element column vector Z that

minimizes AZ

subject to
Z   <$  T$ Z , $ − ?.

In order to understand the primal program, let B œ ÐB# Ñ be such that B$ ÐM  T$ Ñ œ A for one
$ − ? and B# œ ! for all # Á $ . Then by Lemma 2, M  T$ is nonsingular and B$ œ AÐM  T$ Ñ"
œ A!_ R
R œ! T$   !, so B is a basic feasible solution of the primal. Call M  T$ the basis asso-
ciated with $ . The basic feasible solution B has the following interpretation. Since AT$R is the
vector of expected population sizes in each state in period R  " when all individuals use $ _ , B$
is the sum of the expected sojourn times of all individuals in each state when they all use $ _ .
Hence since <$ is the vector of one-period rewards earned by one individual in each state who
uses $ _ , B$ <$ œ AÐM  T$ Ñ" <$ œ AZ$ is the expected total reward earned by all individuals in all
periods when they use $ _ . The aim is to find a $ _ that maximizes the expected total reward,
which motivates the primal objective function. The next result establishes that the primal and
dual always have optimal solutions in transient systems.

Theorem 13. Linear Programming. In a transient system, the maximum value Z ‡ is the least
feasible solution of the dual, is optimal for the dual for all A   !, and is the unique optimal solu-
tion of the dual for A ¦ !. Moreover, there is a $ such that Z ‡ œ <$  T$ Z ‡ ; also each such $ _
has maximum value and M  T$ is an optimal basis for the primal for all A   !.

Proof. The maximum value Z ‡ is a fixed point of e, and so is feasible for the dual. Choose
$ − ? so Z ‡ œ <$  T$ Z ‡ , whence Z ‡ œ Z$ . Suppose A   !. Choose B œ ÐB# Ñ so B$ ÐM  T$ Ñ œ A
and B# œ ! otherwise. As discussed above, B is a basic feasible solution of the primal. Hence, since
B and Z ‡ satisfy complementary slackness, they are optimal for the primal and dual respectively.
Moreover, since M  T$ has a nonnegative inverse, it is an optimal basis for all A   !. Thus, since
Z ‡ is optimal for the dual for all A   !, Z ‡ is the least feasible solution of the dual. Also, Z ‡ is
the unique optimal solution of the dual for A ¦ ! since then AZ  AZ ‡ for Z  Z ‡ . è

It is important to recognize that although the matrix formulations of the primal and dual
programs given above are convenient for theoretical analyses like those above, both are highly
redundant because there are #>Á= |E> | distinct decisions that take action + − E= in state =. As a
consequence the primal will have a distinct variable and the dual a distinct inequality for each
of those distinct decisions that take action + in state =. On the other hand, since the dual ine-
MS&E 351 Dynamic Programming 31 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

qualities associated with taking action + in state = are identical, all but one can be omitted be-
cause they are redundant. Hence, the primal variables corresponding to those redundant dual in-
equalities are also redundant and so can also be eliminated.
After stripping out the redundant variables of the primal and corresponding redundant ine-
qualities of the dual, let B=+ be the (unique) primal variable corresponding to taking action + in
state =. Then the primal and dual programs take the following irredundant forms. It is these ir-
redundant forms that should be used for computation.

Irredundant Primal. Choose B œ ÐB=+ Ñ   ! that

maximizes " <Ð=ß +ÑB=+


+−E= ß=−f
subject to

" B=+  " :Ð= l >ß +ÑB>+ œ A= , = − f .


+−E= +−E> ß>−f

Irredundant Dual. Choose Z œ ÐZ= Ñ that

minimizes " A= Z=
=−f
subject to

Z=   <Ð=ß +Ñ  " :Ð> l =ß +ÑZ> , + − E= , = − f .


>−f

We next study the set of all feasible solutions of the irredundant primal. To do this, it is
convenient to introduce two definitions.

State-Action Frequency. The state-action frequency of a policy is the vector whose =+>2 com-
ponent is the expected number of times that individuals are in state = and take action + when the
policy is used. If the state-action frequency B œ ÐB=+ Ñ of a policy 1 is finite, e.g., as when 1 is trans-
ient, B is feasible for the irredundant primal. To see this, observe that the expected number of
times that individuals leave state = when 1 is used is !+−E B=+ . On the other hand the expected
number of times that individuals enter state = when 1 is used is A=  !+−E>ß>−f :Ð= l >ß +ÑB>+ . The
=

first term accounts for the initial population in state = in period one and the second for the ex-
pected number of times (after period one) that individuals enter state = from the preceding per-
iod when 1 is used. Since B   ! and the expected number of times that individuals enter and
leave state = must coincide when 1 is used, the state-action frequency B of 1 is feasible for the ir-
redundant primal.
MS&E 351 Dynamic Programming 32 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

The above development shows that the state-action frequency of every transient policy is feas-
ible for the irredundant primal. The question arises whether the converse is also true. In short, is
every feasible solution of the irredundant primal the state-action frequency of some policy? The
answer is ‘no’—at least for the “nonrandomized” policies considered so far—as the following ex-
ample shows.

Example. Nonconvexity of Set of State-Action Frequencies of Nonrandomized Policies.


Suppose f œ Ö"ß #× and A œ Ð" !Ñ. There are only two actions in state one. The first generates
no individuals in either state in the next period and the second sends the individual in state one
to state two in the next period. There is only one action in state two and it generates no individ-
uals in either state in the next period. Then there are effectively only two nonrandomized poli-
cies—both transient—and the set of state-action frequencies is ÖÐ" ! !Ñ, Ð! " "Ñ×, which is
not convex. Thus there are many feasible solutions of the irredundant primal that are not state-
action frequencies of nonrandomized policies.

Initially Randomized Polices. On the other hand, Theorem 15 below asserts that every feas-
ible solution of the irredundant primal is a state-action frequency if the system is transient and
the class of admissible policies is enlarged to permit initial randomization. Call a policy initially
randomized if at the beginning of period one, the policy selects each stationary policy $ _ with
some given probability .$   !, !$−? .$ œ ". Call such a policy nonrandomized if all but one of
the .$ ’s is zero. In particular, if only .$ is nonzero, and so equal to one, the corresponding non-
randomized policy is simply the stationary one $ _ .
Theorem 15 also asserts that the extreme solutions of the irredundant primal are state-action
frequencies of nonrandomized stationary policies even when the system is not transient. To estab-
lish this result, we first require the following characterization of the existence of a solution and of
a least solution of the following system of linear inequalities where A and T are nonnegative.

(4) B œ A  BT , B   !.

Lemma 14. Suppose that the row vector w and square matrix T are nonnegative and have a
like number of columns, and let C ´ !_ R
R œ! AT . Then the following are equivalent.
"‰ C is finite.
#‰ The inequalities Ð4Ñ have a solution, e.g., C .
$‰ C is finite and is the least solution of the inequalities Ð4Ñ.

Proof. 1‰ Ê 2‰ . Observe that C œ A  !_ R œ" AT


R
œ A  CT   !, i.e., C satisfies Ð4Ñ.
2‰ Ê 3‰ . If B satisfies (4), then B œ A  BT œ â œ !R " 3
3œ! AT  BT
R
  !R " 3
3œ! AT for R œ
"ß #ß á , so B   C. Thus since 1‰ Ê 2‰ , C is the least solution of (4).
3‰ Ê 1‰ . Immediate.
MS&E 351 Dynamic Programming 33 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Theorem 15. Feasible Solutions of Irredundant Primal are State-Action Frequencies of


Initially Randomized Policies. If A   !, then B is an extreme solution of the irredundant primal
if and only if B is the finite state-action frequency of a nonrandomized stationary policy. If A ¦ !,
the feasible row bases of the irredundant primal are the matrices M  T$ for which $ _ is trans-
ient. If A   ! and the system is transient, then the set of feasible solutions of the irredundant
primal coincides with the set of state-action frequencies of the initially randomized policies.

Proof. Suppose A   !. We begin by proving the sufficiency in the first assertion. Suppose B
is the finite state-action frequency of $ , so B$ œ !_ R "
R œ! AT$ is finite. Also suppose B œ Ð?  @Ñ
#
for some feasible solutions ? and @ of the irredundant primal. Then the nonzero elements of B, ?
and @ are contained among the state-action pairs defining $ , so ?$ and @$ satisfy (4) with T œ
" "
T$ . Since B$ is the least solution of (4) by Lemma 14, B$ Ÿ ?$ ß @$ , whence because B$ œ ?$  @$ ,
# #
B$ œ ?$ œ @$ . Hence, B œ ? œ @, so B is an extreme solution of the irredundant primal as assert-
ed.
To establish the necessity in the first assertion, let P be the !=−f |E= | row by W column con-
straint matrix of the irredundant primal. Let B be an extreme solution of BP œ A, B   ! and B
be the row subvector of positive elements of B. Then the rows F of P corresponding to B are
linearly independent. Let F (resp., F! ) be the columns of F that contain at least one positive
(resp., no positive) element, and let A (resp., A! ) be the corresponding row subvector of A.
Since B F! œ A! , B ¦ !, F! Ÿ ! and A!   !, it follows that F! œ ! and A! œ !. Thus, since
each row of P, and hence of F, has at most one positive element, each column of F has exactly
one positive element. For if not, F has more rows than columns, contradicting the fact that the
rows of F , and hence F , are linearly independent. Thus F is nonsingular. Let $ be a decision
that uses the state-action pairs corresponding to the rows of F and arbitrary actions in other
states. On possibly permuting the rows of P, we can write F œ M  T for the restriction
T   ! of T$ to the set f of states corresponding to A . Since F is nonsingular, B is the
unique solution of B œ A  B T   !, so B œ !_ R
R œ! A T by Lemma 14. Hence since f is
closed under $ , B is the state-action frequency of $ _ , which proves the first assertion. The sec-
ond assertion follows from what was just proved on noting that f œ f when A ¦ !.
To establish the third assertion, suppose A   ! and the system is transient. Then the set of
feasible solutions of the irredundant primal is bounded because it has an optimal solution when
<Ð=ß +Ñ œ " for all + − E= and = − f by Theorem 13, and so is a polytope, i.e., is polyhedral and
bounded. Also, it is known that each element of a polytope is a convex combination of its ex-
treme points. The third assertion then follows from the first if we take the weights on the ex-
treme points in the convex combination to be the initial probabilities of choosing the various
nonrandomized stationary policies. è
MS&E 351 Dynamic Programming 34 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Finding An Optimal Policy. Observe from Theorems 13 and 15 that if the system is trans-
ient and B œ ÐB=+ Ñ is a basic optimal solution of the irredundant primal program, then one max-
imum-value decision in state = is to take the unique action + − E= for which B=+  ! if such an
action exists and to take any action if B=+ œ ! for all + − E= .

Simplex Method: A Specialization of Policy Improvement Method. Application of the sim-


plex method to solve the irredundant primal is equivalent to the specialization of the policy-im-
provement method in which the action for only a single state is changed at each iteration. To
see this, suppose that $ − ?. The basis corresponding to $ is M  T$ and the price vector is given
by Z$ œ ÐM  T$ Ñ" <$ , so that the reduced profits corresponding to # − ? are <#  ÐM  T# ÑZ$ œ
K#$ . The simplex method selects a # and an = for which K#$=  ! and # differs from $ only in
state =. (This selection rule agrees with the ordinary rule by which the simplex method chooses
a variable to leave the basis when A ¦ ! because then the only feasible bases are associated
with decisions. The selection rule breaks ties in the choice of the exiting variable when A   !
and A ¦
/ !.) By contrast, the policy-improvement method amounts to carrying out block pivots
in the irredundant primal program, which although not generally justified in linear program-
ming, is valid here because of the special “totally Leontief ” structure of the transpose of the
irredundant primal constraint matrix P defined in the above proof.
We illustrate the simplex method in Figure 6 for a two-state problem in the space of dual
inequalities. In that example there are two states "ß #, three actions ,ß -ß . in state ", and two
actions ,ß - in state 2. There are five state-action pairs. For each state-action pair =+, we plot
the line that is the set of solutions to the linear equations

Z= œ <Ð=ß +Ñ  " :Ð> l =ß +ÑZ> .


>−f

The value of a decision is the intersection of the two lines corresponding to the two state-action
pairs defining the decision. A decision is a pair of state-action pairs, one for each state. For ex-
ample, the maximum-value decision Ð".ß #,Ñ takes action . in state one and action , in state
two, and the maximum value is the least excessive point. One possible sequence of decisions
selected by applying the simplex method to the irredundant primal is: Ð",ß #-Ñ Ä Ð"-ß #-Ñ Ä
Ð"-ß #,Ñ Ä Ð".ß #,Ñ. The heavy bold line in Figure 6 traces the progress of the simplex method in
order through the values of these decisions in the space of dual variables. Notice that the action
for exactly one state is changed at each iteration.

Running Times
So far we have discussed three related methods for finding maximum-value policies in trans-
ient systems, viz., successive approximations, policy improvement (including Newton’s method)
MS&E 351 Dynamic Programming 35 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Z#

Excessive Points

2b
Z‡
2c

1b
1c 1d
Z"

Figure 6. Progress of Primal Simplex Method in Dual Space for Two-State Problem

and linear-programming methods (including the simplex method and interior-point algorithms).
Each of these methods can be used in practice to solve problems with thousands—and sometimes
even millions—of states if implemented properly on a computer.
Uniformly Strictly Substochastic Systems: Strongly Polynomial-Time Exact and Linear-Time
%-Approximation Algorithms. Suppose the system is uniformly strictly substochastic, i.e., each of the
row sums of the transition matrices is uniformly bounded by a fixed constant ! Ÿ "  ". With ra-
tional data, Ye [Ye03] has shown that a maximum-value policy can be found in strongly polynom-
ial time with interior-point methods. Even without rational data, the running time can be reduced
to linear time using successive approximations if instead one is satisfied to find an %-optimal pol-
icy. We outline briefly why this is so, leaving a more complete account to a homework problem.
Call a policy %-optimal if its value is within "!!%%  ! of the maximum value. Call an algo-
rithm that finds such a policy an %-approximation algorithm. It turns out that for fixed %  !, suc-
cessive approximations is a strongly linear-time %-approximation algorithm. That is so because the
method can be implemented to find a policy that is %-optimal with SÐ8Ñ arithmetic operations
Ð  ß  ß ‚ ß ƒ ß ” ß • Ñ where 8 is the number of nonzero one-period rewards and transition rates
over all states and actions. To that end, it is necessary to compute Z R œ eZ R" for R œ "ß á ß Q
with Z ! ´ ! and Q ´ ² lnln Ð"Î
Ð"Î%Ñ

³ to assure that any optimal decision in the first period of an Q -per-
iod problem has value within 100%% of the maximum with an infinite horizon. Since each of the
Q needed evaluations of eZ R " requires SÐ8Ñ arithmetic operations and Q depends only on %
and " , successive approximations finds an % -optimal policy in strongly linear time.
MS&E 351 Dynamic Programming 36 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Since one must examine all the data of a transient system at least once to be assured of find-
ing a maximum-value policy, the running time of every algorithm for finding a maximum-value
policy is at least SÐ8Ñ. Hence, under the above hypothesis and for fixed %, successive approxima-
tions is as efficient as possible up to a constant factor.
Gauss-Seidel Method. One way to improve successive approximations is to update the value
Z after computing ÐeZ Ñ= for a state =, rather than waiting to compute ÐeZ Ñ= for all states =
before updating Z . This is the Gauss-Seidel method, and it converges faster than successive ap-
proximations.
Newton’s Method. A different way to reduce the number of iterations of successive approxi-
mations is to use Newton’s method. However, this method requires the added work of comput-
ing the value of the policy found at each iteration. The number of arithmetical operations to do
this by Gaussian elimination is SÐW $ Ñ, which is higher than linear time if the number E of state-
action pairs is much larger than W . If instead, one uses successive approximations to estimate
these values, then the overall number of arithmetic operations to execute Newton’s method falls
to SÐ8Ñ for fixed % as with successive approximations.
Balancing Pricing and Updating Values. In all three algorithms, one should take care to strike
an appropriate balance between the work of pricing (computing eZ ) and updating the estimate
of the value Z . In particular, this suggests that successive approximations is likely to be more or
less efficient than Newton’s method according as the number of actions in each state is respect-
ively relatively small or large. This is simply because eZ is relatively cheap to compute in the
former case and relatively expensive to compute in the latter when compared to the work to
update Z . Also, successive approximations is likely to be more or less efficient than Newton’s
method using Gaussian elimination according as the maximum spectral radius of the transition
matrices is respectively relatively small or large. This is because the work to compute the new
value by successive approximations relative to Gaussian elimination is relatively large or small
according as the maximum spectral radius is respectively relatively large or small.
Simplex Method. If the simplex method is used to solve the irredundant primal, it is proba-
bly not a good idea to compute eZ at each iteration to find the best single state-action pair to
improve. (For with the same work, one could use Newton’s method to improve the actions in all
states at once!) The reason is simply that too much work is expended in pricing. A better ap-
proach is either to price out the actions for a single state = at each iteration (which requires
merely computing ÐeZ Ñ= ) and perhaps cycle through the states to be priced at each iteration
periodically. In this way one strikes a better balance between the work of pricing and finding a
good estimate of the value.
Transient Systems: Polynomial-Time Algorithms. It is possible to check whether or not a sub-
stochastic system is transient in SÐ8Ñ time with a combinatorial algorithm that depends only on
MS&E 351 Dynamic Programming 37 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

the number 8 of state-actions pairs and positive transition rates. It is possible to check whether or
not an arbitrary system is transient by solving the irredundant primal linear program with objec-
tive function <Ð=ß +Ñ œ " for all + − E= and = − f . If the data are rational, this linear program and
its dual can be solved with SÐÐEW #  E"Þ& WÑPÑ arithmetic operations where P is the number of bits
required to represent the reward and transition rates. Indeed, this can be done with an interior-
point algorithm of Vaidya [Va90] and a refinement of Ye, Todd and Mizuno [YTM94]. If the sys-
tem is transient, a maximum-value policy can then be found by solving the irredundant primal
and dual linear programs with the same interior-point method and running time. An open ques-
tion is whether these methods can be refined to run in linear time.
Stochastic Constraints. Practitioners often want to impose stochastic constraints, i.e., con-
straints on the state-action frequencies, to achieve service goals or avoid undesirable states. Here
are some examples. The expected number of stockouts does not exceed 5. The expected number
of backorders is at most " . The expected number of equipment failures is at most 9 . Constraints
of this type can often be expressed by appending linear inequalities to the irredundant primal. Al-
ternately, a stochastic constraint might be necessary to assure that if the system uses an action in
one of a subset of states, then the system uses the action in all states in the subset. A constraint
of this type can be expressed in terms of linear inequalities in integer and continuous variables.
If stochastic constraints are imposed on the irredundant primal, then the optimal solution of
the linear program will not generally be an extreme point of the irredundant primal. For this
reason, it is necessary to use randomized policies. As we have seen, if the system is transient, then
one way of doing this is to use an initially randomized policy. However, such a policy has the
unattractive feature that the policies used after the initial randomization can have radically dif-
ferent performance. This feature seems artificial and is likely to be unacceptable to users.
Stationary Randomized Policies. In practice it is often more desirable to use a stationary ran-
domized policy $ _ with $ œ Ð$ = Ñ where, for each = − f , $ = œ Ð$ =! Ñ is a probability vector on E= ,
i.e., a nonnegative vector on E= whose sum is one. The interpretation is that $ _ chooses action
+ − E= in state = − f with probability $ =+ each time an individual enters that state. The station-
ary (nonrandomized) policies are equivalent to the stationary randomized policies where the prob-
ability vectors are all unit vectors.

Theorem 16. Feasible Solutions of the Irredundant Primal are State-Action Frequencies
of Stationary Randomized Policies. Suppose A   ! and the system is transient. Then the set of
feasible solutions of the irredundant primal coincides with the set of state-action frequencies of
the stationary randomized policies. In particular, if B is a feasible solution of the irredundant
primal, then B is the state-action frequency of the stationary randomized policy $ _ œ Ð$ = Ñ where
the probability vector $ = œ Ð$ =+ Ñ on E= is given by $ =+ œ B=+ Î!!−E B=! for each action + − E= for
states = with !!−E= B=!  ! and $ = is arbitrary for states = with !!−E= B=! œ !.
=
MS&E 351 Dynamic Programming 38 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Before turning to the proof, it is worthwhile noting why the last claim of the Theorem is
intuitive. The reason is that the probability of choosing action + in state = can reasonably be
expected to be the ratio of the expected number B=+ of individual visits to state = that choose
action + therein to the expected number !!−E= B=! of individual visits to state =.

Proof. We show first that each stationary randomized policy is transient. To that end, for each
state = − f , let = be the set of probability vectors # = œ Ð# =+ Ñ on E= . Consider the stationary
randomized policy # _ where # œ Ð# = Ñ and # = œ Ð# =+ Ñ − = for each = − f . Let T# be the transition
matrix of # , i.e., the =>>2 element of T# is !+ :Ð> l =ß +Ñ# =+ . To see that T# is transient, observe from
Lemma 4 that because the system is transient, there is a unique vector Z   " such that

Z= œ max Ð"  " :Ð> l =ß +ÑZ> Ñ, = − f .


+−E= >
This implies that

Z= œ max Ð"  " :Ð> l =ß +Ñ(=+ Z> Ñ, = − f ,


(= −= >ß+
because, for each = − f , the maximum on the right-hand side of the above equation is attained by
setting (=+ œ " for some + − E= and (=! œ ! otherwise. Now on setting (= œ # = for each =, it follows
from the above equation that Z   "  T# Z . Iterating this inequality shows that

Z   " T#3 "  T#R Z   "T#3 ",


R " R

so !R
3œ! 3œ!
3
3œ! T# " is increasing in R and bounded above by Z , whence T# is transient as claimed.
Next oberve that since each stationary randomized policy is transient, its state-action frequency
is finite and so feasible for the irredundant primal by the argument in the paragraph titled “state-
action frequency” on page 31. Conversely, it remains to show that if B is a feasible solution of the
irredundant primal, then B is the state-action frequency of the stationary randomized policy $ _
defined in the Theorem. To that end, let \= be the expected number of exits from state = when
using $ _ . Since $ _ is transient, the \= are finite. It remains to show that \= œ !!−E B=! for =

each state =. To see this, observe first that because the expected number of entries into state =
equals the expected number of exits therefrom, it follows that

(5) \=  " :Ð= l >ß +Ñ$ >+ \> œ A= , = − f .


>ß+

It is enough to show that this equation has a unique solution since B is feasible for the irredundant
primal and so, as is readily verified by substitution, \= œ !!−E B=! is one such solution. Let T$
element is !+ :Ð= l >ß +Ñ$ >+ . Rewriting (5) in matrix form yields
=
>2
be the matrix whose >=
\  \T$ œ A,
which as a unique solution because T$ is transient. è
MS&E 351 Dynamic Programming 39 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Example. Supply Management with Service Constraints. Consider a dynamic supply prob-
lem with an obsolescence probability, backorders up to F , say, starting inventories up to M , say,
and a zero lead time for delivery of orders. Then the states are Fß á ß M . Now the constraint
that the expected number of stockouts is at most 5 can be expressed by the inequality

" " B=+ Ÿ 5.


"

=œF +

Another possible constraint is to require that the expected fraction of the periods in which stock
is short be at most 0 . This can be expressed as the ratio constraint

" " B=+


"

=œF +
Ÿ 0,
" " B=+
M

=œF +

or equivalently, the linear inequality

" " B=+ Ÿ 0 " " B=+ ,


" M

=œF + =œF +

On the other hand, the constraint that the expected number of backorders be at most " would
take the form

 " =Ð" B=+ Ñ Ÿ " .


"

=œF +

Augmented Irredundant Primal. In order to find optimal policies with stochastic constraints,
it is necessary to append the stochastic constraints to the irredundant primal program and solve
the resulting program. Call this the augmented irredundant primal.
Number of Randomized Actions. If there are, say, 5 linear stochastic constraints, then each
basic feasible solution of the augmented irredundant primal will have at most W  5 positive
variables because there can be at most one basic variable for each constraint. Thus, there can be
up to W  5 state-action pairs that are positive in a basic feasible solution of the augmented irre-
dundant primal. In that event, the number of states in which it may be optimal to randomize
can be as large as 5 . If 5 is small in comparison with W , relatively little randomization is neces-
sary. But if 5 is large in comparison with W , randomization becomes important. In the supply
problem above, there are only three stochastic constraints. Thus, it will be optimal to randomize
in at most three initial inventory levels.
MS&E 351 Dynamic Programming 40 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Maximum Present Value


The maximum-value criterion may be a reasonable one if the population is indifferent between
receiving income in one period or another. However this is often not the case because income has
a time value. One reflection of this is that it is usually possible to borrow and lend money at ap-
propriate interest rates. Assume now that these two interest rates both equal 1003% per period,
say. In that event, the amount of money that must be invested in one period to generate one
"
unit of money in the following period is " ´ . Call " the discount factor, reflecting the dis-
"3
count that one enjoys for buying one unit of money one period before receiving it. Thus, the
amount of money that would be needed in period zero to generate the income stream <" ß á ß <R
in periods "ß á ß R at the interest rate 1003% is

Z 3R ´ " " 3 <3 .


R

3œ"

Call this sum the present value of the stream Ð<3 Ñ discounted to period zero.
The above development tacitly assumes that interest is not taxed. If, as is usually the case,
interest is taxed, it is more natural to let 1003% be the after-tax interest rate, e.g., 60% of the
original interest rate for populations in the 40% tax bracket. Then Z 3R is the after-tax present
value of the income stream Ð<3 Ñ.
In the above discussion we have also ignored another important factor, viz., inflation. In in-
flationary times, income earned in one period has greater purchasing power than income earned
later. This suggests that income streams and interest rates should be adjusted to account for
inflation. To be concrete about this, suppose that 1003w % is the ordinary after-tax interest rate
and 1000 % is the rate of inflation. Then the inflation-adjusted after-tax interest rate 1003% is
"3w
given by "  3 œ "0 . Thus, even if 3w is positive, 3 will be negative if the rate of inflation ex-
ceeds the after-tax rate of interest, a not uncommon situation. In any event Z 3R is the inflation-
adjusted after-tax present value of the income stream Ð<3 Ñ. Since the inflation-adjusted after-tax
present value is equivalent to an ordinary present value with an appropriate choice of the inter-
est rate, we can and do simply speak about interest rates and present values in the sequel.
The above discussion leads us to consider both positive and negative interest rates. We will
make the fairly mild assumption that 3  ", however. For if 3 Ÿ ", then either 3 œ ", in
which case money invested in one period is lost in the following period, or 3  ", in which case
a unit investment in one period becomes a liability of "  3  ! in the following period. Both
cases rarely occur, and almost never repeatedly.
The above considerations suggest that policies 1 œ Ð$3 Ñ be compared on the basis of their
R -period present values (discounted to period zero)

´ " " 3 T13" <$3 .


R
Z13R
3œ"
MS&E 351 Dynamic Programming 41 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

s $ ´ " T$ , s
On setting T s 31 ´ T
< $ ´ " <$ and T s $" âT
s $3 , the above can be rewritten as

Z13R ´ " T
R
s 13"s
< $3 .
3œ"

This reduces the problem of finding a policy with maximum R -period present value to an equiv-
alent problem of finding a policy with maximum R -period value.
Observe that 5Ts $  " if and only if "5T$  ". Thus the equivalent problem with transition
s $ is transient if and only if "5  ", or equivalently, the present value (discounted to
matrices T
period zero)

´ " " 3 T13" <$3


_
Z13
3œ"
of each policy 1 œ Ð$3 Ñ is absolutely convergent for every choice of <† . When this is so, the maxi-
mum-present-value problem is equivalent to a maximum-value problem and so the results for the
latter problem apply at once. In particular, there is a stationary maximum-present-value policy.

Multi-Armed Bandit Problem: Optimality of Largest-Index Rule [GJ74], [KV87], [Gi89]


A fundamental sequential statistical decision problem that attracted the interest of statistic-
ians at least as early as the 1940s is the multi-armed bandit problem. The problem entails find-
ing a maximum-present-value policy in a particular multidimensional Markov decision chain.
The difficulty of the problem was thought to be so great that Allied scientists during World War
II joked that dropping leaflets describing the problem over Axis countries might be a good way to
distract their scientists from the war effort. The problem remained unsolved until 1972 when
Gittins and Jones found a remarkably simple solution. There are many applications of this problem
including sequential allocation of several available treatments to patients and sequential allocation
of effort to several parallel projects. For definiteness in the sequel, consider the problem to be one
of project scheduling.
There are R independent projects, labeled "ß á ß R . In each period, only one project may be
active. The states of the R " inactive projects in a period are frozen. The state =8> of project 8
at the beginning of the >>2 period it is active is a finite-state Markov chain with stationary tran-
sition probabilities. Since the state of project 8 does not change while it is inactive, =8" is also
the state of that project at the beginning of period ". Put =8 œ Ð=8" ß =8# ß á Ñ and assume that =4
and =8 are independent for all 4 Á 8. If project 8 is active in a period when it is in state =, the
project earns a reward <=8 . Inactive projects earn no rewards. There is a discount factor
!  "  ". A policy is a nonanticipative rule for deciding which project to activate in each per-
iod, i.e., the choice of the project to activate in a period depends only on the projects active in
prior periods and the states of the projects observed in that and prior periods. This is a finite-
state-and-action Markov decision chain. The goal is to find a maximum-present-value policy.
MS&E 351 Dynamic Programming 42 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Random and Stopping Times. To do this requires a few definitions. A random time is a _
or nonnegative integer-valued random variable. A stopping time for project 8 is a random time
5 that is nonanticipative for project 8, i.e., PrÐ5 Ÿ > l =) is independent of =8>" ß =8># ß á for each
nonnegative integer > and = œ Ð=" ß á ß =R Ñ. A policy may be described inductively by selecting
for each > œ "ß #ß á a period 7> and a project that will be active during periods 7>" "ß á ß 7>
where 7! ´ ! and 7>"  7> whenever 7>"  _. Since policies are nonanticipative and since =4
and =8 are independent for 4 Á 8, 7>  "  7>" is the stopping time for the project selected in per-
iod 7>" " relative to the state of the project in that period.
Index of a Project in a State. Let V>8 ´ !>" " 4 <=884 be the present value of the rewards that pro-
ject 8 earns when it is active in the first > periods. Put F> ´ "  " > . The sequel shows that F> is
proportional to !>" " 4 . The index of project 8 in state = is 78= ´ max7  " EV78 ÎEF7 where the max-
imum is over stopping times 7" for project 8 in state =. Note that the indices for a particular
project in its various states depend only on the rewards and transition law of the Markov chain
for that project—not the others. Consequently, as the development below shows, finding the in-
dices for the project entails finding maximum-present-value policies for a family of Markov deci-
sion chains for that project alone, one for each state.
Largest-Index Rule. The largest-index rule is a policy that, in each period, selects a project
with maximum index. This simple policy turns out to have maximum present value. This result
reduces the R -dimensional Markov decision problem to a collection of one-dimensional ones.
Present Value a Project Earns. The key fact underlying the development of this result is that
for any sequence "   !"   !#   â   ! of (Borel) functions of =, there is a random time 5 for
which PrÐ5   > l =Ñ œ !> , for > œ "ß #ß á , so

EV58 œ E" !> " > <=88> .


_
(6)
>œ"

For example, if 5> is the >>2 period that project 8 is active when using a given policy and if 5 is
the random time for which PrÐ5   > l =Ñ œ !> œ " 5> > , then (6) implies that EV58 œ E!_ 5> 8
>œ" " <=8 >

is the present value of the rewards that the policy earns from project 8. Thus, the random time 5
adjusts the discounting of rewards that project 8 earns to reflect its active periods 5" ß 5# ß á .

Theorem 17. Optimality of Largest-Index Rule. In the multi-armed-bandit problem, the largest-
index rule has maximum present value.

Proof. Let 1‡ be a policy that uses the largest-index rule and let 7>" " be the >>2 period in
which 1‡ selects a project. By relabeling, assume that 1‡ selects project one, say, in the first per-
iod. Put 7 ´ 7" and 7" œ EV7" ÎEF7 . By hypothesis, 7"   78=8" , so for each stopping time 5"
for project 8,
MS&E 351 Dynamic Programming 43 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

(7) EV58 Ÿ 7" EF5 .

Let Z1 herein be the present value of any policy 1 starting from the given initial state. Let
1w be an arbitrary policy, %  ! be a number and 1! be a policy that activates each project infin-
itely often and for which Z1!   Z1w  %. Because rewards in distant periods are heavily discount-
ed, 1! can be formed, for example, by modifying 1w to cyclically activate each project for one
period every R periods starting in a sufficiently distant period. Let 1" be the policy that
permutes the order in which 1! activates projects by shifting the first 7 times that 1! activates
project one to the first 7 periods.6 Let X be the period in which 1! activates project one for the
7 >2 time if 7 is finite and let X œ _ otherwise. Figure 7 illustrates these definitions for the case
R œ $.

10 kt 2 3 1 1 3 1 3 2 2 1 2 1 3 3 1

Nœ3

11 lt 1 1 1 1 2 3 3 3 2 2 2 1 3 3 1

Period
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
7 T

Figure 7. Active Projects in each Period

For each fixed 8 and >   ", let 5> (resp., 6> ) be the period in which 1! (resp., 1" ) activates pro-
ject 8 for the >>2 time. For " Ÿ 5> Ÿ X , set !> œ " 5> > if 8 œ " and !> œ " 5> >  " 6>> if 8  ". For
5>  X , 5> œ 6> and set !> œ !. Observe that ! Ÿ !> Ÿ " is decreasing in >   " because 5>  >  !
is increasing and, if 8  ", 5>  6> Ÿ ! is increasing in >   ". Thus, from (6), there is a random
time 58 Ÿ X with PrÐ58   > l =Ñ œ !> for >   ". Also by (6) and the fact that 5> œ 6> for 5>  X ,
the difference between the present values that 1! and 1" earn from project 8 is E!_ Ð" 5>  " 6> Ñ<=88
œ E!X1 Ð" 5>  " 6> Ñ<=88> , which is EV5" "  EV7" if 8 œ " and EV588 if 8  ". Hence,
1 >

Z1"  Z1! œ EV7"  " EV588 .


R
(8)
8œ"

If the reward that each active project earns in each state is instead Ð"  " ÑÎ" , the left-hand side of
(8) vanishes, so because !>" " 4 Ð"  " ÑÎ" œ "  " > , (8) reduces to

! œ EF7  " EF58 .


R
(9)
8œ"

6For 1" to be a policy, it must be nonanticipative. That this is so may be shown by noting that 1" essen-
tially consists of activating project one for the first 7 periods and then using 1! afresh thereafter, being care-
ful to skip the first 7 times that 1! activates project one.
MS&E 351 Dynamic Programming 44 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Now for each 8, 58 "   " is a stopping time for project 8 since PrÐ58 " Ÿ > l =Ñ œ "  !> is inde-
pendent of =8>" ß =8># ß á . Hence, from (7)-(9) and the definition of 7" ,

Z1"  Z1!   7" ÐEF7  " EF58 Ñ œ !.


R

8œ"

Using the above construction, it follows by induction on >   " that there exist policies 1> that
agree with 1‡ through period 7>   > and have the property that Z1>  Z1>"   !. Thus, Z1>  Z1!
  ! and Z1‡ œ lim>Ä_ Z1>   Z1!   Z1w  %. Since % is arbitrary, Z1‡   Z1w . è

Restart Problem. To facilitate computation, it is useful to characterize the index of a project


in state =, say, as the value of the project in an associated “restart-in-=” problem. For notational
simplicity, drop the superscript designating the project in the sequel without further mention.
Now the restart-in-= problem entails choosing a maximum-present-value stopping time 7 " in
state = for the problem in which one starts in state =, chooses the stopping time 7 " and re-
starts afresh in state = in period 7 ". If @= has maximum present value for this problem, then
@= œ max7  " EÐV7  " 7 @= ). Now if @7 = is the (expected) present value of rewards that the stop-
ping time 7 " earns, then @7 = œ EV7  ÐE" 7 Ñ@7 = , so @7 = œ EV7 ÎEF7 . Hence, @= œ max7  " @7 = œ 7=
is the index of the project in state =.
In order to examine the computational efficiency of this formulation, suppose that the se-
quence of states that the project enters is a Markov chain on the finite state-space f of size W .
The above approach to the restart-in-= problem looks at the system only when it is in state =
and chooses a stopping time. Because the system is Markovian, it suffices to limit attention to a
nonrandomized Markovian stopping time, or equivalently a restarting set, i.e., a set of states in
which to restart the system in state =. Though this approach requires only one state, the number
of actions (restarting sets) in that state is #W , and so grows geometrically with W .
A better approach to this problem is to retain the original state-space of the Markov chain.
Then there are only two actions in each state 4, viz., continue or restart in =. To implement this
approach, let T4 be the 4>2 row of the transition matrix of the Markov chain on f . Then the
maximum-present-value vector for this form of the restart-in-= problem is the unique solution
@ œ Ð@4 Ñ of

(10) @4 œ " maxÐ<4  T4 @ß <=  T= @Ñß 4 − f .

Of course, 7= œ @= is the desired index of the project in state =.


Size Comparison. Suppose that the sizes of the state-spaces of the R Markov chains are each
R
W . Then the original R -dimensional problem has W R states, R actions in each state and R W
state-action pairs. By contrast, the largest-index rule requires solving WR restarting problems,
one for each state of each project, with each such problem having W states, two actions and #W
MS&E 351 Dynamic Programming 45 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

state-action pairs. For example, if W œ R œ %, the original problem would have 256 states and
2&"# state-action pairs. By contrast the largest-index rule would require solving 16 restarting prob-
lems each with 4 states and 8 state-action pairs.
On-Line Computation. The “discounted” computational effort required to solve the restarting
problems to find the needed indices can be reduced significantly if one computes on-line. Initially
one must solve W restarting problems. Thereafter, in order to implement the largest-index rule, it
suffices to wait until the state of the project that was active in the previous period is observed to
be in the restart set and then solve the restart problem for that project and state only. This en-
tails solving at most one restarting problem in each period. Thus the “discounted” number of re-
starting problems that must be solved is at most " ÐR  " ÎÐ"  " ÑÑ, and so is independent of W .

8 MAXIMUM PRESENT VALUE WITH SMALL INTEREST RATES IN BOUNDED SYSTEMS


[Bl62], [Ve66], [MV69], [Ve69a], [Ve74], [Ro75c], [RV92]
When a decision maker is indifferent between receiving income in different periods, attention
shifts to problems in which the interest rate is zero. Problems of this type are quite important.
Some examples appear below. In the first, the inflation-adjusted after-tax interest rate is zero. In
all but the first two, there is no interest rate. However, it is useful to introduce small interest rates
to address these problems over long time horizons.
ì Zero Inflation-Adjusted After-Tax Interest Rate. Suppose that the interest rate is 5%, the mar-
ginal tax rate is 40% and the rate of inflation is 3%. Then the after-tax rate of interest is 3% which
coincides with the rate of inflation. Consequently, the inflation-adjusted after-tax interest rate is zero.
ì Maximum Present Value for an Interval of Interest Rates. A decision maker may wish to find a
policy that simultaneously has maximum present value for all small enough interest rates major-
izing a given nonnegative interest rate.
ì Maximize Sustainable Yield of a Renewable Resource. One may wish to maximize the long-term
sustainable yield of fish, crops, lumber, etc.
ì Maximize Expected Instantaneous Rate of Return. This problem arises in infinite-horizon ver-
sions of Example 5 on p. 8.
ì Maximize Throughput of a Factory. The general manager of a factory may wish to maximize
its throughput.
ì Maximize Yield of a Production Process. A production engineer may wish to maximize the yield
of chips from a fab plant.
ì Maximize Expected Number of Patients Treated Successfully. A hospital may wish to maxi-
mize the expected number of patients treated successfully.
ì Maximize Expected Number of On-Time Deliveries. A trucking, shipping, or air-cargo company
may wish to maximize the expected number on-time deliveries.
ì Maximize the Expected Number of Customers Served. A firm may wish to maximize the ex-
pected number of satisfied customers.
ì Minimize Expected Number of Fatalities. A public agency may wish to minimize the expected
number of fatalities from various hazards.
MS&E 351 Dynamic Programming 46 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

For transient systems, the maximum-present-value problem with zero interest rates is pre-
cisely the maximum-value problem discussed in §1.7. But for nontransient systems, the value of
a policy is typically not well defined or is infinite. For this reason, generalizing the notion of max-
imum value from transient to nontransient systems requires a different formulation.

Bounded System
Assume throughout this section that the system is bounded, i.e., its system degree . is zero or
one. For example, substochastic systems are bounded because then T1R is substochastic for each
R and so is bounded. However, not all bounded systems are substochastic. In bounded systems,
the present value Z13 of a policy 1 is well defined and finite for all 3  ! even though that is not
generally the case where 3 œ !. However, for nontransient bounded systems with 3 œ !, it is not
generally possible or useful to compare policies by means of their present values because those
values may not be well defined or finite.

Strong Maximum Present Value


One way of resolving this dilemma is to consider present-value preference relations for small
(positive) interest rates. Such preference relations are of special interest when decisions are made
frequently, say every week or day. As will be seen subsequently, the strongest preference relation
of this type is the following. A policy - has strong maximum present value if - has simultaneous
maximum present value for all sufficiently small positive interest rates, i.e., there is a 3‡  !
such that

(1) Z-3  Z13   ! for all !  3  3‡ and all 1.

This preference relation will play a fundamental role in what follows, especially for systems in
which there is exogenous immigration. The development to follow shows that stationary strong
maximum-present-value policies exist and generalizes the policy improvement method to find
such a policy.

Cesàro Limits and Neumann Series


First, however, it is useful to generalize the Neumann series expansion of Lemma 2 from
s R ´ !R
transient to bounded systems. If T! ß T" ß á and T are W ‚ W complex matrices and T 3œ! T3 ,
write TR Ä T ÐCß "Ñ if R " T s R " Ä T . Similarly, write !_
R œ! TR œ T ÐCß "Ñ and say that the
s R Ä T ÐCß "Ñ, or equivalently, R " !R
first series converges ÐCß "Ñ if T " s
3œ! T 3 Ä T . On the other
hand, write TR Ä T ÐCß !Ñ if TR Ä T . Incidentally, ÐCß !Ñ (resp., ÐCß "Ñ) stands for Cesàro limit
of order zero (resp., one). Thus Cesàro limits of order zero are ordinary limits.
It is easy to see that if a sequence converges ÐCß !Ñ, the sequence also converges ÐCß "Ñ, and
the two limits coincide. However the converse is generally false. For example, every periodic se-
quence converges ÐCß "Ñ. But such a sequence does not converge ÐCß !Ñ except in the trivial case
MS&E 351 Dynamic Programming 47 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

of a constant sequence. Thus, one important feature of ÐCß "Ñ convergence is that it assures that
a broader class of sequences converges than does ordinary (Cß 0) convergence.
In the above terminology, Lemma 2 asserts that if T is a square complex matrix and T R Ä !
ÐCß 0Ñ, then M  T is nonsingular and has a (Cß 0) Neumann series expansion. The next lemma as-
serts that this result remains valid for (Cß 1) limits as well.

Lemma 18. Cesàro Neumann Series. If T is a square complex matrix and T R Ä ! ÐCß "Ñ, >hen
M  T is nonsingular and

ÐM  T Ñ" œ " T R ÐCß "Ñ.


_
Ð2Ñ
R œ!
Proof. Recall as in the proof of Lemma 2 that

M  T 4 œ "ÐT 3  T 3" Ñ œ ÐM  T Ñ " T 3 œ ÐM  T ÑT


4" 4"
s 4"

s 4 ´ !43œ! T 3 . Summing the above equation over 4 from " to R and dividing by R , yields
3œ! 3œ!
where T

s R " T œ ÐM  T ÑÐR " " T


R "
(3) M  R " T s 3 Ñ, R œ "ß #ß á .
3œ!
R
Thus if T Ä ! ÐCß "Ñ, then the left hand side of (3) is nearly M for large enough R and so is
nonsingular. Hence M  T is nonsingular. Now premultiply (3) by ÐM  T Ñ" and let R Ä _. è

In order to appreciate the significance of this result, observe that since T R Ä ! ÐCß !Ñ implies
T R Ä ! ÐCß "Ñ, it follows that (2) will certainly hold whenever T R Ä ! ÐCß !Ñ. But (2) holds in
other cases as well. For example, if T œ M , then T R Ä ! ÐCß "Ñ. In that event (2) asserts that
Ð "# ÑM œ !_ R
Rœ! ÐMÑ ÐCß "Ñ. Moreover, the last two ÐCß "Ñ limits cannot be replaced by ÐCß !Ñ limits.

Stationary and Deviation Matrices


The next two results characterize two important matrices associated with a square matrix T .

Theorem 19. Stationary Matrix. If T is a square complex matrix with .T Ÿ ", then there is a
stationary matrix T ‡ for T such that

Ð4Ñ T R Ä T ‡ ÐCß .T Ñ.

Moreover,

Ð5Ñ T T ‡ œ T ‡T œ T ‡T ‡ œ T ‡.

If T is real Ðresp., nonnegative, substochastic, stochasticÑ, then so is T ‡ . Also, T ‡ œ ! if, and pro-
vided that T is nonnegative, only if T is transient. Finally, T œ M if and only if T ‡ œ M .
MS&E 351 Dynamic Programming 48 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Proof. If .T œ !, then T ‡ œ ! and the result is clear. Suppose .T œ ". Put FR ´ R " !R" 3
3œ! T .
Then ÖT R × and ÖFR × are bounded. Hence, to prove (4), it suffices to show that the limit points of
ÖFR × coincide. (A limit point of a sequence of matrices is, of course, the limit of a subsequence of
the matrices). Let N be any limit point of ÖFR ×. Then FR  R " ÐT R  MÑ œ T FR œ FR T have
the common limit point N œ T N œ N T . Thus N œ T R N œ N T R and so N œ FR N œ N FR . Now let
O be any limit point of ÖFR ×. Then the preceding equalities have the common limit point N œ
ON œ N O . Since N and O are arbitrary limit points, they can be interchanged, giving O œ N O
œ ON . Thus N œ O , so (4) holds. Moreover, this also establishes (5). We claim that if T is non-
negative and T ‡ œ !, then T is transient. To show this, observe from Lemma 18 that H ´ ÐM  T Ñ"
exists and is nonnegative, so H œ M  T H œ â œ !R 3
! T T
R "
H   !R 3
! T for all R . Thus, since
T is nonnegative, it is transient as claimed. The remaining assertions follow from (4) and (5). è

In view of (4), when T is nonnegative, the =>>2 element of T ‡ can be interpreted as the long
run expected average number of individuals in state > per period generated by one individual
starting in state =. Notice from T ‡ T œ T ‡ in (5) that each row of T ‡ is left stationary when
postmultiplied by T . Thus, by (4), the =>2 row of T ‡ is the stationary distribution of the associ-
ated Markov population chain to which the R -step transition rates from state = converge ÐCß .T Ñ.

Theorem 20. Deviation Matrix. If T is an W ‚ W complex matrix with .T Ÿ ", then T ‡  U is


nonsingular where T ‡ is the stationary matrix for T and U ´ T  M . Also the deviation matrix H
for T , defined by
Ð6Ñ H ´ ÐT ‡  UÑ" ÐM  T ‡ Ñ œ ÐM  T ‡ ÑÐT ‡  UÑ" ,
and T ‡ satisfy
H œ " ÐT R  T ‡ Ñ ÐCß .T Ñ
_
Ð7Ñ
R œ!

Ð8Ñ HT ‡ œ T ‡ H œ !
and
Ð9Ñ ÐT ‡  UÑ" œ T ‡  H.

Moreover, ‚W Ђ is the complex numbersÑ is the direct sum of the ranges of T ‡ and H. The sum of
the ranks of T ‡ and H is W . Also T ‡ œ ! if and only if H œ ÐM  T Ñ" . And T ‡ œ M if and only if
H œ !. If further T is real, then so is H.

Proof. It follows readily from (5) that

ÐT  T ‡ ÑR œ T R  T ‡ for R   ".
On setting F œ T  T ‡ , it is apparent from (4) and the above equation that F R Ä ! ÐCß .T Ñ.
Hence, T ‡  U œ M  F is nonsingular by Lemmas 2 and 18, so by the above facts,
MS&E 351 Dynamic Programming 49 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

ÐT ‡  UÑ" œ " ÐT  T ‡ ÑR œ " ÐT R  T ‡ Ñ  T ‡ ÐCß .T Ñ.


_ _
Ð10Ñ
R œ! R œ!

Now postmultiply (resp., premultiply) this equation by M  T ‡ and use (5) to establish (6), (7)
and the fact that M  T ‡ and ÐT ‡  UÑ" commute. Next postmultiply (resp., premultiply) (6)
by T ‡ and use (5) to establish (8). Also, (9) follows from (7) and Ð10Ñ. In view of (9), ‚W is the
direct sum of the ranges of T ‡ and H provided B œ T ‡ C œ HD implies that B œ !. To see that the
last is so, observe from (8) that ÐT ‡  HÑB œ T ‡ HD  HT ‡ C œ !, whence B œ ! by (9). The char-
acterizations of T ‡ œ ! and T ‡ œ M follow from Ð6) and Ð8Ñ. The fact that T is real implies that H
is real follows from (4) and (6). è

Notice from (5) and (6) that if T is nonnegative, the =>>2 element of H is the ÐCß .T Ñ limit
of the amount by which the sum of the expected R -period sojourn times of all individuals in
state > generated by one individual starting in state = exceeds that when the system starts in-
stead with the stationary population distribution for state =. Thus H measures the deviation
from stationarity.

Laurent Expansion of Resolvent


Suppose that T is a square complex matrix and that .T Ÿ ", so 5T Ÿ ". Then 3M  U œ
Ð"  3ÑM  T is nonsingular for 3  ! because Ð"  3Ñ" T has spectral radius less than one. Call
V3 ´ Ð3M  UÑ" the resolvent of U for 3  !. Also V 3 œ " ÐM  " T Ñ" œ !_ R œ! "
R " R
T where
"
"´ . Therefore, if T is real and nonnegative, then V3 is the matrix whose =>>2 element is
"3
the sum of the expected discounted sojourn times of all individuals in state > generated by one
individual in state = initially where " is the discount factor and 1003% is the interest rate.
In order to study policies with maximum present value for small interest rates 3  !, it is
useful to develop an expansion of V3 in 3. It turns out that the desired expansion is a Laurent
series about the origin whose coefficients are the stationary matrix T ‡ and powers of the devia-
tion matrix H.

Theorem 21. Laurent Expansion of Resolvent. If T is a square complex matrix with .T Ÿ "
and if T ‡ and H are respectively the stationary and deviation matrices for T , then for all 3  !
with 35H  ",

V3 œ 3" T ‡  "Ð3Ñ8 H8" .


_
Ð11Ñ

If T ‡ œ !, then V3 œ !_
8œ!
8 8"
8œ! Ð3Ñ ÐM  T Ñ . If H œ !, then V 3 œ 3" M . If T is real, so is V 3 .

Proof. It follows from Theorem 19 that UT ‡ œ ! œ T ‡ U, so Ð3M  UÑT ‡ œ 3T ‡ œ T ‡ Ð3M  UÑ.


Premultiplying the next-to-last equation by V 3 and postmultiplying the last equation by V 3 yields

(12) V3 T ‡ œ 3" T ‡ œ T ‡ V 3 .

Also, from Theorems 19 and 20,


MS&E 351 Dynamic Programming 50 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

(13) ÐM  T ‡ ÑÐM  3HÑ œ ÐT ‡  UÑH  3H œ Ð3M  UÑH.

Now for all 3  ! with 35H  ", M  3H is nonsingular, so its inverse exists and has a Neumann
series expansion. Thus, premultiplying (13) by V3 and postmultiplying (13) by ÐM  3HÑ" yields

œ " Ð3Ñ8 H8" .


_
3 ‡ "
(14) V ÐM  T Ñ œ HÐM  3HÑ
8œ!

Now add the first equation in (12) to (14). The rest follows from (10). è

Laurent Expansion of Present Value in Interest Rate


Suppose that the system is bounded. Then on putting U$ ´ T$  M and V$3 ´ Ð3M  U$ Ñ" , it
follows that the present value of $ is

Z$3 œ Š " " R " T$R ‹<$ œ " ÐM  " T$ Ñ" <$ œ V$3 <$ .
_

R œ!

This fact and Theorem 21 imply that for all 3  ! with 35H  ",

Z$3 œ " 38 @$8


_
(15)
8œ.

where @$" ´ T$‡ <$ and @$8 ´ Ð"Ñ 8


H$8" <$ for 8 œ !ß "ß á ß with T$‡ and H$ being respectively the
stationary and deviation matrices for T$ , and . is the system degree. Evidently (15) gives the
desired Laurent expansion of Z$3 in 3.
In order to obtain a clearer understanding of the fundamentally important Laurent expan-
sion (15) of the present value of a stationary policy $ _ for small interest rates, it is useful to
interpret briefly the quantities appearing in the expansion. To that end, observe first that the
present value Z$3 of $ is the product of the expected present value V$3 of the sojourn times of all
individuals in each state and the unit rewards per period <$ that individuals earn in those states.
If . œ ", the first coefficient in the Laurent expansion of Z$3 is @$" . In order to interpret this
term, observe that depositing Z$3 in a bank that pays interest at the rate 1003% earns the inter-
est payment 3Z$3 each period in perpetuity. Thus the present value Z$3 is equivalent to the in-
finite stream of equal interest payments 3Z$3 in each period. Hence, it follows from (15) that the
limit of the interest payments in each period as the interest rate approaches zero is @$" .
There is a related interpretation of @$" that may be described as follows. Observe from Z$3 œ
V$3 <$ , (12) and (15) that T$‡ Z$3 œ 3" @$" is the present value of $ starting with its stationary dis-
tribution in each state. Thus, @$" is also the interest payment in each period resulting from depos-
iting T$‡ Z$3 in a bank that pays interest at the rate 1003%.
If . œ ", the second coefficient in the expansion (15) of Z$3 is @$! . It can be interpreted by ob-
serving from (15) that Z$3 œ 3" @$"  @$!  9Ð"Ñ where 9Ð"Ñ is a function of 3 that converges to
zero as 3 Æ !. Thus @$! is the limit as 3 Æ ! of the difference Z$3  3" @$" between the present
values of $ starting in each state and starting with the stationary distribution for that state.
MS&E 351 Dynamic Programming 51 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Characterization of Stationary Strong Maximum-Present-Value Policies as _-Optimal


The Laurent expansion of the present value of a stationary policy in the interest rate plays a
key role in characterizing stationary strong maximum-present-value policies in a bounded sys-
tem. To see this requires a definition. Call a real matrix F lexicographically nonnegative, written
F ¤ !, if the first nonvanishing element (if any) of each row of F is positive. Similarly, write
F ¢ ! if F ¤ ! and F Á !. Also write F £ ! (resp., F ¡ !) if F ¤ ! (resp., F ¢ !).
Let Z$ ´ Ð@$. @$." â Ñ be the matrix of coefficients of the Laurent expansion (15) of Z$3 .
This matrix has W rows and infinitely many columns. The reader should note that Z$ differs
from its earlier interpretation as the value of a transient policy $ _ . Indeed, it follows from (15)
that in the present notation, the value of $ _ would then be lim3Æ! Z$3 œ @$! because in that event
@$" œ !. Thus in the sequel the reader will need to be careful to remember which of the two
interpretations of Z$ is intended. The reason for the twin interpretations of Z$ is that many of
the following results regarding strong present-value optimality for bounded systems have the
same form as for maximum-value optimality for transient systems.
Now observe from (15) that

œ " 38 Ð@$8  @#8 Ñ.


_
(16) Z$3  Z#3
8œ.

This difference is nonnegative, i.e., the present value of $ is at least as large as that for # , for all
3  ! with 3 max(−? 5H(  " if and only if Z$  Z# ¤ !. The reason for this is that for all such posi-
tive 3, the sign of each row of (16) is simply the sign of the first nonvanishing coefficient of the
same row of the Laurent expansion. Thus since there is a stationary maximum-present-value
policy for each 3  !, it follows that $ _ has strong maximum present value if and only if $ _ is
_-optimal, i.e., Z$ ¤ Z# for all # − ?. Let ?_ be the set of $ − ? for which $ _ is _-optimal.

Application to Controlling Service and Rework Rates in a KÎQ Î_ Queue


As an illustration of the above ideas, consider the following application to controlled KÎQ Î_
queues. Suppose jobs arrive for service at a machine shop at the beginning of each hour accord-
ing to an arbitrary stochastic process. Each job may be processed on a machine during the hour
following arrival by a skilled or unskilled worker since the numbers of machines and of workers
of each type is adequate to serve all arriving jobs (this amounts to assuming that there are
infinitely many channels in the queue). Unskilled workers earn A per hour while working on a
job and skilled workers earn twice that much. A job completed by a worker is shipped and earns
the shop < if the job is acceptable and is reworked (i.e., reprocessed during the next hour) if it is
"
not. Each job completed by an unskilled worker requires rework with probability :,  :  ",
#
whether or not the job was previously reworked. By contrast each job completed by a skilled
worker requires rework with probability #:  ". Thus, the probability of shipping a job
MS&E 351 Dynamic Programming 52 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

completed by a skilled worker is #Ð"  :Ñ, which is double the corresponding probability "  :
for an unskilled worker. The problem, which is illustrated in Figure 8, is to decide whether to
use skilled or unskilled workers to process jobs when the goal is to achieve strong maximum
present value. In short, the issue is whether it is better to use a slow worker with low pay or a
fast worker with high pay.

Rework

Arrival Ship

Figure 8. Optimal Control of Service and Rework Rates in a KÎQ Î_ Queue

Initially it is convenient to ignore the arrival process and instead assume that there is a sin-
gle job awaiting service. Observe that this is a single-state system in which there are two deci-
sions 8 and 5 corresponding respectively to using an unskilled or skilled worker to process the
job. Put ? œ Ö8ß 5×. Then the one-period reward and transition rates are

<8 œ Ð"  :Ñ<  A, T8 œ :, <5 œ #<8 and T5 œ #:  ".

For each $ − ?, denote by Z$3 the expected present value of net income (revenue minus wages)
received from a job when a worker of type $ processes the job until it is shipped and the hourly
interest rate is 3  !.
Notice that since : ” Ð#:  "Ñ  ", the system is transient and so . œ !. Thus, T$‡ œ ! and
" "
H$ œ Ð"  T$ Ñ" for $ − ?, so H8 œ and H5 œ . Also, @$! œ H$ <$ and @$" œ H$ @$! for
A ": #Ð":Ñ
$ − ?, so @8! œ @5! ´ @! ´ <  , @" œ H8 @! and @5" œ H5 @! . Hence
": 8
@! 3
(17) Z53  Z83 œ ÐH8  H5 Ñ@! 3  9Ð3Ñ œ  9Ð3Ñ.
#Ð":Ñ

Observe that the expected profit @! when a skilled worker processes a job is the same as that
for an unskilled worker, i.e., a job is profitable or unprofitable independently of the type of
worker who processes the job. Thus if the goal of the shop is to use a maximum-value policy
(with 3 œ !), the shop will be indifferent between using a skilled or an unskilled worker. How-
ever, if instead the shop seeks a strong maximum-present-value policy, it follows from (17) that
it is optimal to use a skilled worker if a job is profitable and to use an unskilled worker if the
job is unprofitable! The explanation for this is that if a job is profitable, it is better to use a
skilled worker who earns the profit earlier when the expected discount factor is larger, while if a
job is unprofitable, it is better to use an unskilled worker who incurs the loss later when the ex-
pected discount factor is smaller.
MS&E 351 Dynamic Programming 53 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

But what if a job is neither profitable nor unprofitable, i.e., @! œ !? In that event, @$8 œ
Ð"Ñ8 H$8 @! œ ! for all 8   !, whence Z53  Z83 œ ! for all 3  !, i.e., the shop is indifferent be-
tween using a skilled or an unskilled worker for all positive interest rates.
Incidentally, so far no consideration has been given to the effect of the arrival process into
the queue. Does it matter? The answer is it does as the developments in the sequel will show.
But for now, return to the general theory.

Strong Policy-Improvement Method


It is now possible to generalize the policy-improvement method to find a strong maximum
present-value policy for bounded systems. To that end, it is useful first to adapt the Comparison
Lemma to the present-value problem.
Comparison Lemma for Present Values. Observe that the maximum-present-value problem
for a bounded system is equivalent to a maximum-value problem for a transient system with
one-period reward vectors " <$ and transition matrices " T$ for each $ − ?. Then on applying the
Comparison Lemma for transient systems to the maximum-present-value problem, the resulting
Comparison Lemma for stationary policies with 3  ! becomes

(18) Z#3  Z$3 œ V#3 K#$


3
,

3
where K#$ ´ <#  U# Z$3  3Z$3 is the present-value comparison function. For a direct proof of this
fact, observe that Z#3  Z$3 œ V#3 [<#  Ð3M  U# ÑZ$3 ] œ V#3 K#$
3
. Also, note that in the special case in
which # and $ are transient, setting the interest rate 3 œ ! reduces (18) to the Comparison Lemma
for stationary transient policies.
Laurent Expansion of Present-Value Comparison Function. It is now possible to give the
desired generalization of the policy-improvement method. Observe that if the goal were merely
seeking an improvement of $ for a single fixed positive interest rate 3  !, it would be enough to
3
choose # so that K#$  ! for that single value of 3. For then by (18), Z#3  Z$3 . However, the
goal now is to be sure that # is an improvement of $ simultaneously for all sufficiently small
3
positive 3. One way to achieve this is to choose # so that K#$  ! simultaneously for all suffic-
3
iently small positive 3  !. To find such a # , it is useful to develop the Laurent expansion of K#$
in 3  !. To that end, substitute the Laurent expansion (15) of Z$3 into K#$
3
and collect terms of
like powers of 3. The result is that

œ " 38 1#$
_
3 8
(19) K#$
8œ.

8
for all small enough 3  ! where 1#$ ´ <#8  U# @$8  @$8" for 8 œ .ß ."ß á ß <#8 ´ ! for 8 Á !,
. ." .#
<#! ´ <# , @$." ´ ! and . is the system degree. Let K#$ ´ Ð1#$ 1#$ 1#$ â Ñ be the matrix of
coefficients of the Laurent expansion (19). Observe that K#$ is a matrix with W rows and infin-
MS&E 351 Dynamic Programming 54 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

itely many columns. The reader is warned again that K#$ generalizes its earlier interpretation as
comparing the values of the transient policies Ð# ß $ _ Ñ and $ _ to comparing the present values of
those policies for all small enough positive interest rates. Moreover, in the present notation and
3 !
under the hypothesis that the system is transient, lim3Æ! K#$ œ 1#$ œ <#  T# @$!  @$! which, be-
cause @$! is then the value of $ , is the difference between the values of the transient policies Ð# ß $ _ Ñ
and $ _ .
3
Strong Policy Improvement. It follows from (19) that K#$  ! (resp., Ÿ !) for all small enough
3  ! if and only if K#$ ¢ ! (resp., K#$ £ !). Since V$3   " M for 3  !, it follows from (16), (18)
and (19) that Z#  Z$ ¢ ! (resp., £ !) if K#$ ¢ ! (resp., £ !). For this reason, say that # strongly
improves $ if K#$ ¢ !. Now observe from (18) that since Z$3  Z$3 œ ! and V$3 is nonsingular, K$$
3
œ
! for all 3  !. Thus, K$$ œ ! by (19). Hence, if no decision strongly improves $ , then, analogously
with the corresponding claim for the transient case, K#$ £ ! for all # . To see why this is so, ob-
serve that it is possible to improve the action in one state without affecting the others. This is be-
cause the =>2 row K#$= of K#$ depends on the action # = that # takes in state =, but not on the ac-
tion # > that # takes in any other state >. Thus, if there is any state = and action # = − E= for which
K#$= is lexicographically positive, then choosing # > œ $ > for all > Á = will assure that K#$ ¢ ! since
then K#$> œ K$$> œ ! for > Á =. It follows from the above discussion that $ _ is _-optimal if and
only if K#$ £ ! for all # .

Existence and Characterization of Stationary Strong Maximum-Present-Value Policies


Using these facts, it is now possible to prove the main result of this section.

Theorem 22. Existence and Characterization of Stationary Strong Maximum-Present-Value


Policies. In a bounded system, there is a stationary strong maximum-present-value policy. If $ −
?, the following are equivalent: "‰ $ _ has strong maximum present value; #‰ $ _ is _-optimal; and
$‰ K#$ £ ! for all # .

Proof. The equivalence of #‰ and $‰ (resp., "‰ ) was shown above (resp., following (16)). Thus,
since $‰ implies "‰ , it suffices to show that there is a $ for which $‰ holds. To that end, let $! −
? be arbitrary and choose $" ß $# ß á in ? inductively as follows. Given $R , let $R " strongly im-
prove $R , if such a decision exists, and terminate with $R otherwise. Since Z$R increases lexico-
graphically with R , no decision can appear twice in the sequence. Thus because there are only
finitely many decisions, the sequence must terminate with a $ œ $R that no decision strongly
improves. Then $‰ holds from the discussion preceding the Theorem. è

Proof of System-Degree Theorem 9 where . œ ". Theorem 22 provides the tool necessary
to prove Theorem 9 where . œ ", i.e., the sequence max1−?_ mT1R m has degree one. To that end,
MS&E 351 Dynamic Programming 55 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

put <$ œ " for all $ − ?. Then by Theorem 22, there is a $ such that K#$ £ ! for all # , so that
"
1#$ Ÿ ! for all # and Z$3   ! for all small 3  !, so Z$ ¤ !. On setting @ œ @$" , it follows that
@ œ T$‡ <$   !. Let N ß O be a partition of the states for which @N ¦ ! and @O œ !. Now rewrite
"
the inequality 1#$ Ÿ ! as @   T# @ for all # − ?. Iterating this inequality shows that @   T1R @ for
all 1, so @N   T1RN N @N and ! œ T1RON @N for all 1 and R   ". Because @N ¦ !, the first inequality
implies that max1−?_ mT1RN N m œ SÐ"Ñ and the second implies T1RON œ ! for all 1 and R , i.e., in-
dividuals in O cannot generate individuals in N . Thus, since @O œ ! and Z$ ¤ !, ? ´ @$!O   !.
" !
Also 1#$ O œ ! and so 1#$ O Ÿ ! for all # , i.e., ?   "  T# OO ? for all # . Hence the restriction of the
problem to states in O is a transient system, so by Theorem 9 for that case, max1−?_ mT1ROO m œ
SÐ!R Ñ for some ! Ÿ !  ". Now on setting 1 œ Ð#" ß ## ß á Ñ and 13 œ Ð#3" ß #3# ß á Ñ, it follows that

œ "T13"
R
R 3
T1RN O N N T#3 N O T13 OO .
3œ"
To see this, consider the 45 >2 element of the 3>2 term of the summand on the right. That element
is the expected number of ancestors in state 5 − O in period R " of an individual in state 4 − N
in period one for which some descendant of 4 in N in period 3 generates an ancestor of 5 in O in
period 3". Thus, since max1−?_ mT1RN N m œ SÐ"Ñ and max1−?_ mT1ROO m œ SÐ!R Ñ, it follows that
max1−?_ mT1RN O m œ SÐ"Ñ and so max1−?_ mT1R m œ SÐ"Ñ. è

Call the algorithm given in the proof of Theorem 22 for finding a stationary strong maxi-
mum-present-value policy the strong policy-improvement method. The method requires only fi-
nitely many iterations, each involving two steps. For a given $ , first compute Z$ . Then seek a #
with K#$ ¢ !. The last two steps do not appear to be executable in finite time since they seem
to require computing the respective infinite matrices Z$ and K#$ . However, it is possible to refine
the algorithm to run in finite time as the development below shows.

Truncation of Infinite Matrices


In a bounded system, it turns out to be sufficient to compute the finite truncated matrices
Z$8 ´ Ð@$. â @$8 Ñ and K#$
8 .
´ Ð1#$ 8
â 1#$ Ñ for 8 œ W  X where X is the rank of T$‡ . As examples,
observe that if T$ is transient, then X œ !; if T$ is stochastic and irreducible, then X œ "; if
T$ œ M , then X œ W . In order to establish the claim, a preliminary result is necessary.

Lemma 23. If F is a real W ‚ W matrix with rank 6, P is a subspace of d W , B − d W and F 3 B − P


for " Ÿ 3 Ÿ 6, then F 3 B − P for 3 œ "ß #ß á .

Proof. The result is trivial if 6 œ ! since then F œ !. Thus suppose that 6   ". The first step
is to show that there is a positive integer 4 Ÿ 6 such that F 4" B is a linear combination of
MS&E 351 Dynamic Programming 56 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

F " Bß á ß F 4 B. Since the dimension of the subspace spanned by the columns of F is 6, the vectors
F " Bß á ß F 6" B are linearly dependent, from which the assertion follows.
Next show by induction on 5 that F 5 B is a linear combination of F " Bß á ß F 4 B for all 5   ",
which, because P is a subspace, will complete the proof. This is so by construction for " Ÿ 5 Ÿ
4". Suppose it holds for 5" Ð   4"Ñ, so F 5" B œ !43œ" -3 F 3 B. Premultiplying this equation by
F gives F 5 B œ !43œ" -3 F 3" B. Since F 4" B is a linear combination of F " Bß á ß F 4 B, so is F 5 B. è

Theorem 24. Truncation. Suppose the system is bounded, # ß $ − ? and T$‡ has rank T. Then
WX
"‰ K#$ œ ! if and only if K#$ œ !.
#‰ Z# œ Z$ if and only if Z#WX œ Z$WX .

Proof. For part 1‰ , it suffices to show the “if ” part. Observe that since K$$
8
œ !,

8 8 8
(20) 1#$ œ 1#$  1$$ œ ÐT#  T$ Ñ@$8 , 8 œ "ß #ß á .

WX
Because K#$ œ !, it follows from (20) that

(21) ÐT#  T$ Ñ@$8 œ !, for " Ÿ 8 Ÿ WX .

In view of (20), it suffices to show that (21) holds for 8  WX . That this is so follows from
Lemma 23 with F œ H$ , P the null space of T#  T$ , and B œ @$! , on noting from Theorem 20
that because the rank of T$‡ is X , the rank of H$ is WX .
For part 2‰ , it suffices to show the “if ” part. Now Z#WX œ Z$WX implies that

(22) ÐH#  H$ Ñ@$8 œ !, for ! Ÿ 8 Ÿ WX ".

To show that Z# œ Z$ , it suffices to establish (22) for 8   W  X . The last follows from Lemma
23 with F œ H$ , P the null space of H#  H$ , and B œ <$ , on noting, as above, that the rank
of H$ is WX . è

Remark 1. If the rank X of T$‡ is W , then T$‡ œ M and H$ œ ! by Theorem 20. Then @$" œ <$
and @$8 œ ! for 8 œ !ß "ß á .

Remark 2. Part "‰ of Theorem 24 implies that if the =>2 , say, row of K#$
WX
vanishes, the same
is so of the =>2 row of K#$ . As a consequence, if # strongly improves $ , then K#$
WX
¢ !. Thus, in
searching for a decision that strongly improves $ , it is never necessary to look beyond the ma-
trices Z$W and K#$
W
, or if X is known, Z$WX and K#$
WX
, for # − ?. Observe also that X œ ! if and
only if T$ is transient. Otherwise X   ".
MS&E 351 Dynamic Programming 57 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

8-Optimality: Efficient Implementation of the Strong Policy-Improvement Method


A more efficient way to execute the strong policy-improvement method for bounded W -state
systems is to lexicographically maximize Z$ by first maximizing Z$. , then Z$." , then Z$.# ,
etc. Indeed, as the sequel shows, for each 8   . , it is often of interest to seek a policy $ _ that is
8-optimal, i.e., Z$8 ¤ Z#8 , all # − ?. Let ?8 be the set of decisions $ for which $ _ is 8-optimal.

Theorem 25. Selectivity. In a bounded S-state system with system degree . , ? ª ?. ª ?."
ª â ª ?WX œ ?WX " œ â œ ?_ where T is the minimum of the ranks of the stationary ma-
trices over all decisions.

Proof. Immediate from the definitions and the Truncation Theorem 24.

8-Improvements. It is useful now to develop a method of finding an 8-optimal policy. To that


8 . 8 8
end, put K#$ ´ Ð1#$ â 1#$ Ñ. Call # − ? an 8-improvement of $ − ? if K#$ ¢ ! and if # = œ $ = when-
ever the =>2 row of K#$
8
vanishes. Clearly # strongly improves $ .

Theorem 26. 8-improvements and 8-optimal policies. In a bounded system with system degree
. , if # is an 8-improvement of $ , then Z#8 ¢ Z$8 . If $ has no Ð8.Ñ-improvement, then $ _ is 8-op-
timal.
8
Proof. If # is an 8-improvement of $ , then # is a strong improvement of $ because K#$ = œ!
3
implies K#$= œ K$$= œ !. Thus K#$  ! for all small enough 3  !. Moreover, since V#3   " M ,

lim " 358 Ð@#5  @$5 Ñ œ lim 38 ÐZ#3  Z$3 Ñ œ lim 38 V#3 K#$ œ lim " 358 1#$
8 8
3 3
  lim "38 K#$ 5
 !,
3Æ! 5œ. 3Æ! 3Æ! 3Æ! 3Æ! 5œ.

so Z#8 ¢ Z$8 as claimed.


3
Suppose $ has no Ð8.Ñ-improvement. Then for each # , 38. K#$ Ÿ 9Ð"Ñ. Also lim3Æ! 3. V#3
exists and is finite by Theorem 20 and V#3 is nonnegative for all small enough 3  !. Thus

lim " 358 Ð@#5  @$5 Ñ œ lim 38 ÐZ#3  Z$3 Ñ œ lim 3. V#3 38. K#$
8
3
Ÿ !,
3Æ! 5œ. 3Æ! 3Æ!

whence Z#8 £ Z$8 for each # , i.e., $ is 8-optimal. è

Example. . œ ", $ Has No 8-Improvement, and $  ?8 . It is natural to hope that if . œ "


and $ has no 8-improvement, then $ − ?8 . However, that is not so. For example, suppose that
there is a single state and two decisions # , $ with T# œ T$ œ ", so U# œ U$ œ !, and <# œ ", <$ œ !.
Then @#" œ " and @$" œ !, so 1#$
"
œ U# @$" œ !, whence $ has no "-improvement. But # is a
!
!-improvement of $ because 1#$ œ <#  U# @$!  @$" œ "  !. Also, # is "-optimal.

Example. . œ ", $ Has a "-Improvement. Suppose there are two states and two decisions
# and $ with # # œ $ # . Let <# œ <$ œ Ð! "ÑX . Let T$ œ M , so U$ œ !. Let T#" œ Ð! "Ñ, so U#" œ
Ð" "Ñ. Then @$" X " " " "
" œ Ð! "Ñ , K#$ " œ U# " @$ " œ " and K#$ # œ K$$ # œ !, so # "-improves $ .
MS&E 351 Dynamic Programming 58 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Computing Z$8 . The only method discussed so far for computing Z$8 requires first calculat-
ing the stationary and deviation matrices for $ . It turns out that this is not necessary. Indeed, it
suffices to solve a system of linear equations for Z$8 .

Theorem 27. Computation of Z$8 . Suppose the system is bounded with system degree . . Let Z 8
œ Ð@. â @8 Ñ where the @4 are W -vectors and @." ´ !. Then Z 8. œ Z$8. satisfies the linear
equations

(23) <$4  U$ @4 œ @4" for . Ÿ 4 Ÿ 8  . .

Conversely, if Z 8. satisfies these equations, then Z 8 œ Z$8 .

8.
Proof. Observe first that since K$$ œ !, then K$$ œ !, or equivalently, Z 8. œ Z$8. satis-
fies (23). Conversely, suppose that Z 8. satisfies (23). Since Z$8. also satisfies (23), it follows
that Y 8. ´ Z 8.  Z$8. satisfies

(24) U$ ?4 œ ?4" for . Ÿ 4 Ÿ 8.

where Y 8. œ Ð?. ß á ß ?8. Ñ and ?." ´ !. Now (24) implies that

(25) U$ ?4 œ ?4" for . Ÿ 4 Ÿ 8

and, on premultiplying (24) by T$‡ and using the facts that T$‡ U$ œ ! and, if . œ !, T$‡ œ !,

(26) T$‡ ?4 œ ! for . Ÿ 4 Ÿ 8.

Subtracting (26) from (25) gives

(27) ÐU$  T$‡ Ñ?4 œ ?4" for . Ÿ 4 Ÿ 8.

Since T$‡  U$ is nonsingular and ?." œ !, it follows from (27) that ?. œ !. Proceeding
inductively, it follows that Y 8 œ !, i.e., Z 8 œ Z$8 as claimed. è

Example. Why Z 8. Must Satisfy (23) to Assure Z 8 œ Z$8 with . œ ". Suppose T$ œ M ,
so U$ œ !, and 8 œ #,. Then the first equation in (23) is !@" œ !, and so does not constrain
@" at all. However, the second equation in (23) is <$  !@! œ @" , which shows that @" œ <$ .

Policy Improvement in Bounded Systems. The above results lead to the following policy-im-
provement method for finding an 8-optimal policy for a bounded system with . œ " that requires
only Ð8"Ñ-improvements of decisions. Let $! be given. Choose $" ß á ß $R inductively so that $5 is
an Ð8"Ñ-improvement of $5" for given 5 œ "ß á ß R where R is the first integer (which exists)
for which $R has no Ð8"Ñ-improvement. Then $R is 8-optimal. In order to check whether a de-
cision $ has an Ð8"Ñ-improvement, it is necessary to compute Z 8# satisfying (23) with 8"
replacing 8 there. Then from Theorem 27, Z 8" œ Z$8" , which is needed to begin the search for
an Ð8"Ñ-improvement of $ .
MS&E 351 Dynamic Programming 59 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Sequential Implementation. The above generic implementation of policy improvement to find


an element of ?8 can be made more efficient by sequential implementation, i.e., by first seeking a
$" − ?" , then a $! − ?! , then a $" − ?" , etc., until a $8 − ?8 is found. More precisely, suppose
one has found a $5" − ?5" . Then repeatedly seek Ð5"Ñ-improvements until a decision $5 is found
with no Ð5"Ñ-improvement. Then by Theorem 26, $5 − ?5 .
Sequential implementation is more efficient than generic implementation for two reasons. First,
sequential implementation generally requires fewer iterations because it focuses on maximizing the
coefficients Z$8 of the Laurent expansion of Z$3 in order of declining importance. Thus, sequential
implementation first finds $ œ $" maximizing @$" , then from among all such $ a $ œ $! maximiz-
ing @$! , then from among all such $ a $ œ $" maximizing @$" , etc.
Second, sequential implementation requires much less computation at each iteration than gen-
eric implementation. The implementation below explains why by showing how to use Theorems
26 and 27 to process inputs $ − ?5" and Z$5 to produce outputs # − ?5 and Z#5" .

Ð+Ñ Find @$5" . Solve <$5"  U$ @5" œ @$5 and <$5#  U$ @5# œ @5" for @5" œ @$5" and @5# .
Ð,Ñ Find a Ð5"Ñ-improvement # of $ if one exists. Since $ − ?5" , $ has no Ð5"Ñ-improvement.
5" 5" 5" 5 5"
Consider the # for which K#$ œ K$$ œ !, so K#$ œ Ð! 1#$ 1#$ Ñ. Choose from among those # one
5 5" = = 5 5"
for which Ð1#$ 1#$ Ñ ¢ ! and # œ $ whenever Ð1#$ = 1#$ = Ñ œ !. If there is a # that Ð5"Ñ-improves $ ,
go to Ð-Ñ. Otherwise set # œ $ and go to Ð.Ñ. In both cases Z#5" œ Z$5" and # − ?5" .
Ð-Ñ Find @5# . Solve <#5  U# @5 œ @$5" and <#5"  U# @5" œ @5 for @5 œ @#5 and @5" . Reset $ œ # and
go to Ð+Ñ.
Ð.Ñ Terminate with # − ?5 and Z#5" . Since # has no Ð5"Ñ-improvement, # œ $ − ?5 . Solve
<#5"  U# @5" œ @$5 and <#5#  U# @5# œ @5" for @5" œ @#5" and @5# .

Note that steps Ð+Ñ, Ð-Ñ and Ð.Ñ each require solving #W linear equations. And finding the best Ð5"Ñ-improve-
5 5"
ment # of $ in step Ð,Ñ entails lexicographically maximizing Ð1#$ 1#$ Ñ, which requires up to #E operations.7
Uniqueness. In practice, it is usually so that for relatively small 5 and $ _ 5 -optimal, only # œ $
5 5
satisfies K#$ œ K$$ œ !. Then, by the Selectivity Theorem 25, $ _ is 8-optimal for all 8   5 . Hence
one can terminate sequential implementation early at iteration 5 .

Policy Improvement in Transient Systems. It is particularly easy to describe the amount of


work required to find a decision in ?8 in transient systems, or equivalently, . œ !. In that event, let
Z 8 œ Ð@! @" â @8 Ñ œ Z$8 for any $ − ?8 . Now from Theorem 26, $ − ?8 if and only if K#$
8
£ ! for
all # − ?, or equivalently from Theorem 27 Ð@" œ !, ?" œ ?Ñ,

(28) @5 œ max [<#5  @5"  T# @5 ], 5 œ !ß á ß 8.


# −?5"

7By contrast with sequential implementation, the generic implementation entails solving a system of Ð5%ÑW
linear equations at each iteration. And seeking the best Ð5"Ñ-improvement at each iteration requires up
to Ð5$ÑE operations to do each lexicographic maximization.
MS&E 351 Dynamic Programming 60 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Moreover, the set of decisions # that maximize the right-hand side of the 5 >2 equation in (28) is
?5 © ?5" . Notice that these equations can be solved for @! , @" ß á ß @8 in that order. Once @5"
has been found, the work of finding @5 is precisely that of finding a stationary maximum-value
policy for a transient system with a restricted decision set ?5" and a new reward vector <#5 
@5" . Thus the total amount of work required to find a $ − ?8 is at most 8" times that re-
quired to find a stationary maximum-value policy in a transient system—and usually much less.
In particular, the work required to find a strong maximum-present-value policy is at most W"
times that required to find a stationary maximum-value policy in a transient system. Although
we do not discuss it in detail here, we remark that the situation described above is similar for
bounded systems. In short, one should expect to be able to find strong maximum-present-value
policies on PCs for transient or bounded systems with thousands of states.

9 CESÀRO OVERTAKING OPTIMALITY WITH IMMIGRATION IN BOUNDED SYSTEMS


[Ve66], [Ve68], [DM68], [Li69], [Ve74], [RV92]
The preceding section generalizes the maximum-value criterion from transient to bounded
systems by means of maximum-present-value preference relations for small interest rates. Anoth-
er approach to bounded-systems is instead to consider preference relations that make a suitable
function of the rewards large over long horizons. For example, it is often of interest to consider
preference relations that focus on making the long-run expected reward rate, total reward, total
float, future value with a negative interest rate, instantaneous rate of return, or expected multi-
plicative utility large. It is possible to encompass these and other preference relations in a uni-
form and simple way by generalizing our model to allow exogenous (possibly negative) immigra-
tion of individuals into the population in each period. Then, each of the above preference relations
reduces to Cesàro overtaking optimality for a suitable choice of the immigration stream.
Call a policy Cesàro overtaking optimal if the difference between the expected R -period re-
wards earned by that policy and any other policy has a nonnegative Cesàro limit inferior of ap-
propriate order. Cesàro limits are needed to smooth out oscillations in the differences of expect-
ed R -period rewards that would otherwise generally prevent the existence of overtaking-optimal
policies from being assured. It might be supposed that a policy that is Cesàro overtaking opti-
mal for each individual immigrant would also be Cesàro overtaking optimal for the immigration
stream itself. But that is not so! The reason is fundamentally that each individual immigrant
would be indifferent between two policies that produce the same permanent income even though
one policy produces more float, i.e., temporary income, than the other. But the system would
prefer the policy that produces more float if, as is possible, the aggregate of the float that each
individual immigrant produces provides extra permanent income for the system. That is the rea-
son why the set of Cesàro-overtaking-optimal policies depends on the immigration stream and
permits several preference relations to be reduced thereto.
MS&E 351 Dynamic Programming 61 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Allowing immigration into the population is also interesting in its own right in the study of
controlled populations. For example, exogenous arrivals into a queueing network comprise an im-
migration stream therefor.

Examples
Controlled Queueing Network with Proportional Service Rates. Consider a controlled W -sta-
tion queueing network. Customers arrive exogenously at the various stations according to a gen-
eral stochastic process that is independent of the service process, with AR
= being the (finite) ex-
pected number of such arrivals at station = in period R  ". Each station has infinitely many
independent and identical servers, each with geometrically-distributed services times. A custom-
er arriving exogenously or endogenously at a station at the beginning of a period begins service
there immediately because there are infinitely many servers available. In each period and at
each station, the network manager chooses an action from a finite set that determines the
common service rate of customers at the station and the common rates at which customers leave
the station in the period to exit the network or to go to various stations in the next period. The
cost of servicing a customer and the revenue earned when the customer completes service at a
station depends on the action taken by the manager. This is a Markov decision chain with
immigration in which the states are the W stations, <Ð=ß +Ñ is the reward earned by a customer at
station = when the manager takes action + at that station, :Ð> l =ß +Ñ is the probability that a
customer being served at station = completes service at that station during the period and is sent
to station > at the beginning of the following period, and "  !> :Ð> l =ß +Ñ is the probability that
a customer completes service at station = and leaves the system.
Keep in mind that the assumption that a station has infinitely many identical servers has an
alternate interpretation, viz., that the station has a single server whose service rate is propor-
tional to the number of individuals waiting to be served at the station. In this event, selection of
an action by the manager amounts to choosing the proportionality constant in the proportional
service rate there.
Figure 9 illustrates a four-station network. Exogenous arrivals occur only at stations 1 and
2. Stations 1 and 2 generate output at both stations 3 and 4. Stations 2 and 3 also generate
output that must be reworked respectively by starting afresh at stations 2 and 1 respectively.
This model of controlled queueing networks has several strengths. One is that only one state is
required for each station (it is not necessary to keep track of the number of customers in a sta-
tion because there are infinitely many servers), so an W -station network leads to a problem with
only W states. For this reason, it is computationally tractable to find optimal policies for
controlling queueing networks with as many as several thousand stations! Two other strengths
are that the exogenous arrival process is arbitrary and the manager is allowed to control the ser-
vice rates and routing of customers.
MS&E 351 Dynamic Programming 62 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Rework

Departures
!

Arrivals


Arrivals

Rework Departures
"

Figure 9. Optimal Control of Service/Rework Rates in a Four-Station Network of Queues

Of course the model also has limitations. The service rate at each station must be propor-
tional to the number of customers there; the number of such customers cannot be restricted; and
rewards are linear in the expected number of customers at each station.
Cash Management. Bills arrive daily at a firm for payment from all over the country. The
firm pays the bills by writing checks drawn on several banks throughout the country at which it
has accounts. The length of time for a check to clear depends on the location of the payee and
of the bank on which the check is drawn. Generally, the more distant the payee from the bank
on which the check is written, the longer the check takes to clear. From past experience, the
firm knows the distribution of the random time to clear checks drawn on each bank to pay bills
from each payee. The problem is to choose the bank on which to write a check to pay bills from
each payee.
Manpower Planning. A personnel director manages a work force in which individuals are hired,
move between various jobs and skill-levels over time so as to meet an organization’s needs, and
eventually leave. An individual’s state may be differentiated by factors like age, compensation,
current assignment, education, skill, etc. Actions may include policies on recruitment, hiring, train-
ing, compensation, job assignment, promotion, layoff, firing, retirement, etc.
Insurance Management. An insurance company insures its policy holders against risks of var-
ious types. A policy holder’s state could include the type and terms of his policy, age, sex, health,
safety record, occupation, prior claims, etc. The company’s actions might include rules for accept-
ing or rejecting new policy holders, premiums charged, dividends paid, coverage, cancellation pro-
visions, etc.
MS&E 351 Dynamic Programming 63 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Asset Management. Managers of assets like vehicles, machines, equipment, computers, build-
ings, etc., maintain them to provide service to internal and/or external customers. An asset’s
state may include its type, condition, current status, reliability, age, capabilities, market value,
location, etc. Actions may include buying, renting, leasing, operating, pricing, maintaining, sell-
ing, storing, moving, etc.

Immigration Stream
As several of the above examples suggest, it is useful to generalize the systems considered
heretofore to allow individuals to enter the population through exogenous uncontrolled immi-
gration. In particular, the (possibly negative) number of immigrants in state = in period R  "
  " is AR R R
= . Let A œ diagÐA= Ñ be the W ‚ W diagonal immigration matrix whose =
>2
diagonal
element is AR
= . Call immigrants in a period positive or negative according as the number of immi-
grants in the period is positive or negative. Positive (resp., negative) immigrants add (resp.,
subtract) their rewards and those of their descendants to (resp., from) that of the system. Call
the sequence A œ ÐA! A" â Ñ of immigration matrices the immigration stream.
An alternate approach is to represent an immigration stream by appending immigrant-gener-
ating states. This device can be useful for studying certain well behaved immigration streams. In-
deed this viewpoint is necessary when it is possible to control the immigration stream. However,
it is usually better not to append additional states to reflect immigration for two reasons. One is
that this approach facilitates the study of the effect of changes in the immigration stream on the
set of Cesàro-overtaking-optimal policies. The other is that this method does not enlarge the state-
space and leads to more efficient computational methods.

Cohort and Markov Policies


Each immigrant in a period generates populations in subsequent periods. Call the collection
of all immigrants in a given period and their descendants a cohort. The R >2 cohort is the one
that the immigrants in period R generate. Of course each individual in the population in a
given period belongs to exactly one cohort, so there is a partition of the population in any given
period into cohorts as Figure 10 illustrates.
In the presence of immigration, a policy 1 œ Ð$R Ñ has at least two different interpretations.
One is as a Markov policy, i.e., every individual in period R uses $R . The other is as a cohort
policy, i.e., all members of cohort 3 use $R in period R 3". Equivalently, a cohort policy starts
afresh for each cohort. Figure 11 illustrates these two types of policies. Of course, the distinction
between Markov and cohort policies vanishes when they are stationary.
Nonstationary Markov policies might seem more natural than nonstationary cohort policies
because the action an individual takes in a given state intuitively should depend only on the state
MS&E 351 Dynamic Programming 64 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Cohort 1

Cohort 2

Cohort 3

Cohort 4 Cohort 5

1 2 3 4 5
Periods
Figure 10. Cohorts

Markov Policy Cohort Policy


Period Period
Cohort 1 2 3 â Cohort 1 2 3 â
1 $" $# $$ 1 $" $# $$
2 $# $$ â 2 $" $# â
3 $$ 3 $"
ã ã ã ã

Figure 11. Interpretations of Ð$R Ñ as Markov and Cohort Policies

and not the cohort to which the individual belongs. However, perhaps surprisingly, it turns out
that cohort policies are more convenient for two reasons. One is that they are easier to work with
because the formulas for the relevant quantities are nicer, and results about Markov policies can
easily be derived from them in any case. The second reason is that cumulative immigration streams
have an alternate interpretation in terms of unit values of rewards earned in different periods. In
that event, when there is no exogenous immigration, all individuals are in the same cohort so co-
hort policies are the most general that need to be considered. For these reasons, in the remainder
of this section policies mean cohort policies unless explicitly stated to the contrary.
Let Z1R A be the R -period value when A is the immigration stream and 1 is a (cohort) policy.
Observe that since R  5 periods remain when the immigrants in cohort 5  " arrive, then, as
Figure 12 illustrates,

œ "A5 Z1R 5 .
R
Z1R A
5œ!
MS&E 351 Dynamic Programming 65 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Periods Ä
1 2 3 â R
Cohort 1 A! Z1R
Cohort 2 Immigrants A" Z1R " â
Cohort 3 by Period A# Z1R #
Æ ã ã
Cohort R AR " Z1"

Figure 12. R -Period Value with Immigration Stream A

Overtaking Optimality
It might be expected that the analog of finding a strong maximum-present-value policy, i.e.,
one that simultaneously maximizes the present value for all small enough positive interest rates,
would be to find a policy - that maximizes the R -period value Z-R A simultaneously for all suffi-
ciently large R . Unfortunately, as the example below illustrates, such policies do not generally
exist because the set of maximum-R -period-value policies usually depends on R for large R , even
for transient systems. Of course, there are some exceptions to this general rule, e.g., circuitless
systems with no immigration after the first period, minimum-cost-chain problems, etc.
But a satisfactory treatment of bounded systems requires a weaker concept. To motivate one
such concept, observe that in a transient system, limR Ä_ Z1R œ Z1 for each 1. Thus a policy -
has maximum value, i.e., Z-  Z1   ! for all 1, if and only if lim infR Ä_ ÐZ-R  Z1R Ñ   ! for all
1. Both definitions are well defined in a transient system, but only the latter is generally well
defined in a bounded system. This suggests the following definition in a bounded system.
Call a policy - overtaking optimal for A if

Ð1Ñ lim inf ÐZ-R A  Z1R A Ñ   ! for all 1.


R Ä_

Call - overtaking optimal if Z-R  Z1R replaces Z-R A  Z1R A in (1). Observe that instead of maxi-
mizing a criterion function, overtaking optimality is a binary preference relation on the set of
policies. Moreover, that relation is a quasiorder, i.e., is reflexive and transitive, but not necessar-
ily antisymmetric.
Stationary overtaking-optimal policies exist for transient systems and, when the stopping value
is finite, for stopping problems. Indeed, they exist provided that the limit inferior in Ð1Ñ can be re-
placed by a finite limit, as occurs in convergent systems, and often in positive (all rewards are non-
negative) and negative (all rewards are nonpositive) systems. Unfortunately, overtaking-optimal
policies do not exist in general because Z$R=  Z#R= can oscillate about zero for some states = as the
next example illustrates.
MS&E 351 Dynamic Programming 66 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Example. Nonexistence of Overtaking-Optimal Policies when . œ " and Dependence of Max-


imum-R -Period-Value Policies on R . Consider the three-state deterministic system in Figure 13.
There are two decisions # and $ , and the rewards are as indicated on each arc. Note that # _ and
$ _ are effectively the only policies and . œ ". Now Z$R!  Z#R! œ #Ð"ÑR "  " alternates between
$ and " so $ _ has maximum R -period value for odd R and # _ has maximum R -period value
for even R . Thus, no policy has maximum R -period value for all large enough R . Also,

lim inf ÐZ$R!  Z#R! Ñ œ " and lim inf ÐZ#R!  Z$R! Ñ œ $,
R Ä_ R Ä_

so no policy is overtaking optimal.

0
$ #

3 0
2
1 2
q2

Figure 13. Nonexistence of Overtaking-Optimal Policies

Cesàro Overtaking Optimality


One way to address this problem is to smooth out the fluctuations in Z-R A  Z1R A by substi-
tuting a ÐCß .Ñ limit inferior for the limit inferior in Ð1Ñ. If ,! ß ," ß á is a sequence, the ÐCß .Ñ limit
inferior of the sequence is the ordinary limit inferior of the sequence if . œ ! and the limit
inferior of the sequence of arithmetic averages, viz., lim infR Ä_ R " !R "
3œ! ,3 , if . œ ".
Call - Cesàro overtaking optimal for A if

(1)w lim inf ÐZ-R A  Z1R A Ñ   ! ÐCß .Ñ for all 1.


R Ä_
Call - Cesàro overtaking optimal if Z-R  Z1R replaces Z-R A  Z1R A in (1)w Þ Observe that in the ex-
ample above, $ _ is the unique Cesàro-overtaking-optimal policy, though it is not overtaking opti-
mal. It turns out that stationary Cesàro-overtaking-optimal policies always exist for bounded sys-
tems under natural assumptions on the immigration stream. In order to show this, it is necessary
to develop some preliminary concepts.

Convolutions
It is convenient to represent ÐZ$R A Ñ as a “convolution” of the two sequences ÐZ$R Ñ and ÐAR Ñ.
To define this notion, suppose F œ ÐF R Ñ and G œ ÐG R Ñ are sequences for which R   ! and one
or both are either scalars or matrices of common size and for which the products F 3 G 4 are well
defined for all 3ß 4. The last is always so if one or both of the sequences are scalars or if both are
MS&E 351 Dynamic Programming 67 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

matrices and the number of columns of each F 3 equals the number of rows of each G 4 . Denote
by F ‡ G the sequence whose R >2 element ÐF ‡ GÑR is !R 3 R 3
! F G ; call F ‡ G the convolution of
F and G . The convolution operation ‡ is associative, i.e., ÐF ‡ GÑ ‡ I œ F ‡ ÐG ‡ IÑ, provided that
F ‡ G and G ‡ I are well defined. If also, F 3 G 4 œ G 4 F 3 for all 3ß 4, then the convolution operation
is also commutative, i.e., F ‡ G œ G ‡ F . In particular, that is so if F or G is a sequences of sca-
lars. If F and G are each finite sequences with /   " elements, denote by F ì G the middle ele-
ment of the convolution F ‡ G of #/  " elements. Thus, F ì G is the sum over 5 œ "ß á ß / of the
product of the element of F with 5 >2 smallest index and that of G with 5 >2 largest index.

Binomial Coefficients and Sequences


For any positive integer 4, set 4x ´ Ð4ÑÐ4"ÑâÐ"Ñ, and for any number 5 , define 5 choose 4,
denoted Š ‹, by
5
4

Š ‹´
5 5Ð5"ÑâÐ54"Ñ
.
4 4x
For 4 œ !, put !x ´ " and Š ‹ ´ ". Of course, if also 5   4 are nonnegative integers, then
5
!

Š ‹œ œŠ ‹.
5 5x 5
4 4xÐ54Ñx 54

It is easy to verify for any 5 and nonnegative integer 4 that ŠŠ ‹ ´ !‹


5
"

Š ‹Š ‹œŠ ‹.
5 5 5"
Ð2Ñ
4" 4 4

It turns out that the immigration streams in which the numbers of immigrants in a state in
a period are independent of the state and in which the common sequence is “binomial” play a
fundamental role in the analysis. Define the binomial sequence of order 8, denoted "8 , by induc-
tion on 8. Call "! ´ Ð" ! ! â Ñ the identity sequence because "! ‡ F œ F for any sequence
F œ ÐF ! F " â Ñ. Call "" ´ Ð" " " â Ñ the summing sequence because the R >2 element of
"" ‡ F is !R 3
3œ! F . The inverse of summing is differencing. Call "" ´ Ð" " ! ! â Ñ the dif-
ferencing sequence because the R >2 element of "" ‡ F is F R  F R " where F " ´ 0.
Now define "8 ´ Ð"R
8 Ñ inductively by the rule:

"8 œ 
"" ‡ "8" , 8  "
"" ‡ "8" , 8  ".

Call "8 ‡ F and "8 ‡ F respectively the 8-fold sum and 8-fold difference of F. Observe that

"7 ‡ "8 œ "78 , 7ß 8 œ !ß „"ß „#ß á .

It is easy to show from Ð2Ñ by induction on 8 that


MS&E 351 Dynamic Programming 68 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

8 œŠ ‹, R ß |8| œ !ß "ß #ß á .
R 8"
"R
R

Thus,

8 œŠ ‹, 8  ! ,
R 8"
"R
8"

which is a polynomial in R of degree 8  ", whence "8 has degree 8. Also

8 œ Ð"Ñ Š ‹, 8 Ÿ ! ,
8
"R R
R

which alternates in sign for ! Ÿ R Ÿ 8 and is zero for R  8.


The table below gives the first five elements of the binomial sequences "8 for |8| Ÿ #.

8 "8
# " # " ! ! â
" " " ! ! !
! " ! ! ! ! â
" " " " " "
# " # $ % & â

Figure 14. Some Binomial Sequences

Polynomial Expansion of Expected Population Sizes with Binomial Immigration


Our goal now is to give a polynomial expansion in R of the R -period value Z1R 8 of a policy
1 for a binomial immigration stream "8 of order 8, i.e., "R
8
"
M is the immigration matrix in each
period R   ". Note that the symbol "8 refers both to the binomial sequence and immigration
stream of order 8. The context will make the meaning clear, though either interpretation will
usually do.
For each policy 1, let •1 ´ ÐZ1! Z1" â Ñ and •18 ´ ÐZ1!8 Z1"8 â Ñ ´ "8 ‡ •1 . If 1 œ $ _ , re-
place 1 by $ in these definitions and let $ ´ Ð! T$! T$" â Ñ. It follows that •$ œ Ð"" ‡ $ Ñ<$ and
•8$ œ Ð"8" ‡ $ Ñ<$ .
There are two interpretations of the =>>2 element of Ð"8" ‡ $ ÑR œ Ð"8 ‡ Ð"" ‡ $ ÑÑR . One is
the expected population size in state > in period R   " generated by binomial immigration of
order 8" into state =. The other is the sum of the expected sojourn times in state > in the first
R   " periods of the immigrants into state = and their descendants when there is binomial immi-
gration of order 8 into state =.
Thus, in order to obtain a polynomial expansion of Z$R 8 in R , it is necessary to find a poly-
nomial expansion of the R >2 element of "8" ‡ $ in R . The polynomial expansion is analogous to
the Laurent expansion of the resolvent of U$ that was needed to study the problem with small in-
terest rates. The desired polynomial expansion is provided in the next theorem.
MS&E 351 Dynamic Programming 69 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Theorem 28. Polynomial Expansion of Convolutions of Binomial Sequences with Matrix


Powers. Suppose that P is a square complex matrix with .T Ÿ " and with stationary and devia-
tion matrices T ‡ and H respectively. Then for R   " and 8   ",

Ð"8" ‡ ÑR œ Š ‹ " Š ‹Ð"Ñ4 H4"  T R 8 ÐHÑ8" ÐM  T ‡ Ñ


8
R 8 ‡ R 8
Ð3Ñ T 
8" 4œ!
84

where  œ Ð! T ! T " â Ñ.

Before proving this Theorem, it is useful to discuss it. Notice that on letting ƒ8" be the se-
quence of matrices ÐT ‡ H" H# â Ð"Ñ8 H8" Ñ and ,8R be the sequence of binomial coefficients
Š ‹ß á ß Š ‹,
R 8 R 8
the expansion Ð3Ñ can be rewritten more compactly using convolutions as
! 8"

Ð3Ñw Ð"8" ‡ ÑR œ ,8R ì ƒ8"  %8R

where %8R ´ T R 8 ÐHÑ8" ÐM  T ‡ Ñ. Observe that %8R œ 9.T Ð"Ñ, i.e., %8R Ä ! ÐCß .T Ñ, since T and
H commute, T R Ä T ‡ ÐCß .T Ñ and T ‡ T ‡ œ T ‡ . Thus, Ð3Ñw provides an expansion that is a sum
of a polynomial in R of degree 8  " and an error term that converges to ! ÐCß .T Ñ.
The case in which 8 œ ! is of special interest. In that event, Ð3Ñw becomes

" T 3 œ R T ‡  H  9.T Ð"Ñ.


R "
Ð3Ñww
3œ!

Notice that this representation is simply a restatement of the representation Ð7Ñ of the deviation
matrix in §1.8 as may be seen by subtracting R T ‡ from both sides of the above equation and
taking the ÐCß .T Ñ limit as R Ä _.
The expansion Ð3Ñw is the sum of a polynomial in R and a small error term, and is analogous
to a corresponding expansion of 38 V3 as the sum of a polynomial in 3" and a small error
term. The last expansion is obtained from the Laurent expansion of V3 and is

38 V3 œ 38" T ‡  38 H  â  Ð"Ñ8 H8"  9Ð"Ñ.

Observe that the polynomial in this expansion has the same degree 8  " and the same coeffi-
cient matrix ƒ8" as that in Ð3Ñw . Incidentally, the multiplication of V3 by 38 in the above
expression arises because the present value of "8 used in Ð3Ñw , when discounted to period 8, is
38 . In the special case in which 8 œ !, this last expansion becomes

V3 œ 3" T ‡  H  9Ð"Ñ.

The right-hand side of this expansion is similar to Ð3Ñww with 3" replacing R .
MS&E 351 Dynamic Programming 70 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Proof of Theorem 28. To prove Ð3Ñ, it suffices to show for R   " and 8   " that

Ð"8" ‡ ÑR T ‡ œ Š ‹T ‡
R 8
Ð4Ñ
8"
and

Ð"8" ‡ ÑR ÐM  T ‡ Ñ œ "Š ‹Ð"Ñ4 H4"  T R 8 ÐHÑ8" ÐM  T ‡ Ñ,


8
R 8
Ð5Ñ
4œ!
84

since Ð3Ñ follows by adding Ð%Ñ and Ð&Ñ. To prove Ð%Ñ, observe that

8# T œ Š ‹T ‡ .
R 8
Ð"8" ‡ ÑR T ‡ œ Ð"8" ‡ "" ÑR " T ‡ œ "R " ‡
8"

The next step is to establish Ð5Ñ by induction on 8. The result is trivial for 8 œ ". Now ob-
serve that Ð"" ‡ ÑÐM  T Ñ œ Ð""  "! ÑM  T . Postmultiplying this equation by H and using the
facts that HT ‡ œ T ‡ H œ ! and HU œ M  T ‡ œ UH from Theorem 20 yields

Ð6Ñ Ð"" ‡ ÑÐM  T ‡ Ñ œ Ð""  "! ÑH  T ÐM  T ‡ ÑH,

establishing Ð&Ñ for 8 œ !. Suppose now that Ð&Ñ holds for 8 Ð   !Ñ and consider 8  ". By taking
the convolution of Ð'Ñ with "8 and then using the bilinearity and associativity of convolution, the
induction hypothesis, the fact that H8" U œ H8 œ UH8" , and Ð#Ñ, it follows that for R   ",

Ð"8" ‡ ÑR ÐM  T ‡ Ñ œ Ð"R R R ‡


8"  "8 ÑH  T Ð"8 ‡ Ñ ÐM  T ÑH

œŠ ‹H  T Š"Š ‹Ð"Ñ4 H4  T R 8" ÐHÑ8 ÐM  T ‡ Ñ‹H


8
R 8" R 8"
8 4œ"
84

œŠ ‹H  "Š ‹Ð"Ñ4 H4 ÐH  MÑ  T R 8 ÐHÑ8" ÐM  T ‡ Ñ


8
R 8" R 8"
8 4œ"
84

œ "ŠŠ ‹Š ‹‹Ð"Ñ4 H4"  T R 8 ÐHÑ8" ÐM  T ‡ Ñ


8
R 8" R 8"
4œ!
84 84"

œ "Š ‹Ð"Ñ4 H4"  T R 8 ÐHÑ8" ÐM  T ‡ Ñ,


8
R 8
4œ!
84

which establishes Ð&Ñ. è

Polynomial Expansion of R -Period Values


Recall that •7 "
$ œ "7 ‡ •$ œ Ð"7 ‡ "" Ñ ‡ Ð"" ‡ •$ Ñ œ "7" ‡ •$ œ Ð"7" ‡ $ Ñ<$ . Thus, it fol-
lows from the polynomial expansion Ð$Ñ and Ð$Ñw of the sum of the expected sojourn times of in-
dividuals in the first R periods with binomial immigration of order 8 œ 7 when using $ _ that
MS&E 351 Dynamic Programming 71 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Z$R 7 œ " Š ‹@  T$R ?7


7
R 7 4 R 7 R 7
Ð7Ñ $ œ , 7 ì Z $  T$ ? $
4œ.
74 $
for R   " and 7   ! where ?7 7 7
$ ´ T$ @$ . Observe that this formula gives an expansion of Z$
R7

as the sum of a polynomial in R of degree 7  . with coefficient matrix Z$7 and a term that is
9. Ð"Ñ.
Although the error term is small, it is convenient to eliminate it entirely by including an ap-
propriate terminal reward. This important idea will play a fundamental role in the sequel. To
obtain the desired exact formula, let Z$R 7 Ð?Ñ œ Z$R 7  T$R ? where ? is the terminal reward re-
ceived at the end of period R . It follows from Ð7Ñ that Z$R 7 Ð?7
$ Ñ is a polynomial in R of degree
7  . with

$ Ñ œ "Š ‹@ œ ,7
7
R 7 4
Ð8Ñ Z$R 7 Ð?7 R
ì Z$7
4œ.
74 $
for R   " and 7   !. Observe that since . Ÿ ", it follows from Ð7Ñ, Ð8Ñ and T$‡ H$ œ ! that
limR Ä_ ÒZ$R 7  Z$R 7 Ð?7
$ ÑÓ œ ! ÐCß .Ñ for 7   !. Notice from Ð8Ñ that $ maximizes Z$
R7 7
Ð?$ Ñ over
? for all large enough R if and only if $ maximizes Z$7 lexicographically. This is a strong form
of overtaking optimality for stationary policies except that the terminal value ?7
$ depends on $ .
The case in which 7 œ ! is of special interest. In that event, Ð8Ñ expresses the fact that with
the proper choice of the terminal reward, viz., ?!$ , the R -period value

Z$R Ð?!$ Ñ œ R @$"  @$!

is affine in R . Moreover, the slope

@$" œ T$‡ <$ œ lim R " Z$R


R Ä_
is the stationary reward or reward rate and the intercept

@$! œ H$ <$ œ lim ÐZ$R  R @$" Ñ ÐCß .Ñ


R Ä_
is the ÐCß .Ñ limit of the difference between the R -period values starting from each state and start-
ing with the stationary distribution therein.
For every policy 1 œ Ð#3 Ñ, decision $ , ? − d W , 7ß R   !, and P   ", let 1P ´ Ð#" ß á ß #P Ñ, 1P ´
Ð#P" ß #P# ß á Ñ and 1P $ ´ Ð1P ß $ _ Ñ. If # ß $ − ?, let # P ´ Ð# _ ÑP , #$ ´ Ð#ß $ _ Ñ and # P $ ´ Ð# P ß $ _ Ñ.
Subscripts on decisions will be reserved for enumerating them.
The next lemma expresses Z1R 7 Ð?Ñ as the sum of the expected rewards earned in the first P
periods and those earned in the subsequent Q ´ R  P   ! periods when 7   !. The result fol-
1 œ "7" ‡ •1 , "7" œ Š ‹
R 7
lows readily from the facts that •7 " R
and •" ! "
1 œ Ð! T1 <#" T1 <## âÑ.
7

Lemma 29. If 1 œ Ð#3 Ñ − ?_ , ? − d W , Pß Q ß 7   !, and R œ P  Q   ", then

Z1R 7 Ð?Ñ œ "Š ‹T13" <#3  T1P Z1Q


P
R 37 7
Ð?Ñ.
3œ"
7 P
MS&E 351 Dynamic Programming 72 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Proof. Use the fact that Z1R 7 Ð?Ñ œ ! Š R 37 ‹T13" <#3  ! Š R 37 ‹T13" <#3  T1R ?. è
P R
7 7
3œ" 3œP"

Comparison Lemma for R -Period Values


Unfortunately, there is no exact comparison lemma for the difference of the R -period values
of two policies in terms of a comparison function. However, it is possible to give an exact com-
parison lemma for “eventually stationary” policies with a suitable terminal reward. Subsequent-
ly, we show that the error made in altering the problem in this way is comparatively small.
The following lemma gives an exact representation for the difference between the R -period
value Z1RP $7 Ð?7 P
$ Ñ of the eventually stationary policy 1 $ and the R -period value Z$
R7 7
Ð?$ Ñ of the
corresponding stationary policy $ _ , both with the terminal reward ?$7 . The exactness of the
representation is a consequence of this choice of the terminal reward. The representation is in
terms of the time-domain comparison function

´ "Š ‹1 œ ,7
7
57 57 4 5 7
Ð9Ñ K#$ ì K#$
4œ.
74 #$
where 5ß 7   ! and # ,$ − ?.

Lemma 30. If 1 œ Ð#3 Ñ − ?_ , $ − ?, .$ Ÿ ", ! Ÿ P Ÿ R , and 7   !, then

Ð?$ Ñ œ "T13" K#R3 $3ß7 .


P
Z1RP $7 Ð?7
$ Ñ  Z$
R7 7

3œ"
Proof. First establish the result for P œ ". Put # œ #" . By Lemma 29 with P œ ", Z#$R 7 Ð?7
$ Ñ
œŠ ‹<#
R 7"
 T# Z$R "ß7 Ð?7
$ Ñ. Using this fact, Ð)Ñ, the identity Ð#Ñ for binomial coefficients,
7
and @$." œ !, one finds that

Ð?$ Ñ œ Š ‹<#  T# " Š ‹@$  " Š ‹@


7 7
R 7" R 7" 4 R 7 4
Z#$R 7 Ð?7
$ Ñ  Z$
R7 7
7 4œ.
74 4œ.
74 $

œŠ ‹ # " Š ‹ $ "Š ‹@$


7 7"
R 7" R 7" 4 R 7" 4
<  U # @ 
7 4œ.
74 4œ.
74"
R "ß7
œ K#$ ,

which establishes the result for P œ ". This fact and Lemma 29 imply that for P   ",

œ "ÒZ1R3 $7 Ð?7
P
Z1RP $7 Ð?7
$ Ñ  Z$R 7 Ð?7
$ Ñ
R7 7
$ Ñ  Z13" $ Ð?$ ÑÓ
3œ"

œ "T13" ÒZ#R3 $3"ß7 Ð?$7 Ñ  Z$R 3"ß7 Ð?$7 ÑÓ œ "T13" K#R3 $3ß7 . è
P P

3œ" 3œ"
The above result permits us to deduce the following important Comparison Lemma which
gives an approximate representation for the difference between the R -period values of an arbi-
trary policy and a stationary policy—both with a binomial immigration stream of order 7.

Lemma 31. Comparison Lemma. If . Ÿ ", 1 œ Ð#3 Ñ − ?_ , $ − ?, R   Q   ! and 7   !, then

œ " T13" K#R3 $3ß7  SÐR ." Ñ uniformly in 1.


R Q
Z1R 7  Z$R 7
3œ"
MS&E 351 Dynamic Programming 73 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Proof. It follows from Lemma 29 and the System-Degree Theorem 9 that

Z1R 7  Z1RR Q
7 7 R Q
$ Ð?$ Ñ œ T1 ÒZ1Q 7
R Q
 Z$Q 7 Ð?7
$ ÑÓ œ SÐR
."
Ñ uniformly in 1

for R   Q , and the same is so when $ _ replaces 1. The result then follows by observing that the
error from replacing Z1R7 7
R Q $ Ð?$ Ñ by Z1
R7
and Z$R7 Ð?7
$ Ñ by Z$
R7
in Lemma 30 is SÐR ." Ñ. è

Cesàro Overtaking Optimality with Binomial Immigration Streams


Now a policy - is Cesàro overtaking optimal for the binomial immigration stream of order 8 if

(10) lim inf ÐZ-R 8  Z1R 8 Ñ   ! ÐCß .Ñ for all 1

or, equivalently, since !R


R Ä_
38 R R Rß8" R
3œ! Z1 œ Ð"" ‡ Ð"8 ‡ •1 ÑÑ œ Ð"8" ‡ •1 Ñ œ Z1 and limRÄ_
R "
œ ",

(10)w lim inf R . ÐZ-R ß8.  Z1R ß8. Ñ   ! for all 1.


R Ä_
If . œ !, Cesàro overtaking optimality for the binomial immigration stream of order 8 is the same
as overtaking optimality therefor. If . œ ", the ÐCß .Ñ limits are needed to damp out the effect of
the (bounded) error term in the Comparison Lemma 31.
Call a policy strong Cesàro overtaking optimal if it is Cesàro overtaking optimal for binomial
immigration streams of all orders. Now apply Lemma 31 to establish the next result.
Theorem 32. Existence and Characterization of Stationary Cesàro-Overtaking-Optimal Poli-
cies. In a bounded system with system degree ., there is a stationary strong Cesàro-overtaking-op-
timal policy. Also, $ _ is Cesàro overtaking optimal for the binomial immigration stream of order
8   . if and only if $ _ is 8-optimal. Further, $ _ is strong Cesàro overtaking optimal if and only
if $ _ is _-optimal.
Proof. Since the system is bounded, it follows from Theorem 22 that there is a stationary
policy $ _ that has strong maximum present value and for which K#$ £ ! for all # − ?. Thus, if
5ß8.
Q  ! is large enough, it follows from (9) that K#$ Ÿ ! for all # − ? and 5   Q . Also since
the system is bounded, .1 Ÿ " for all 1 by the System-Degree Theorem 9. Hence, by the Com-
parison Lemma 31 (with 7 œ 8  . on noting that R  3   Q for 3 Ÿ R  Q ), - œ $ _ satisfies
(10)w , i.e., $ _ is Cesàro overtaking optimal for the binomial immigration stream of order 8. Since
this is so for all 8, $ _ is strong Cesàro overtaking optimal. The two characterizations follow from
(10)w and (7). è
Special Cases. Several examples of Cesàro overtaking optimality for binomial immigration
streams of order 8 appear below. Each of these examples provides an alternate interpretation of
Cesàro overtaking optimality as a natural concept of optimality for the system without exogen-
ous immigration after the first period.
Reward-Rate Optimality. Since "" œ ÐM M ! ! â Ñ, Z1Rß" œ T1R" <#R is the expected re-
ward that 1 œ Ð#3 Ñ earns in period R . Thus, it is natural to call a policy reward-rate optimal if
it is Cesàro overtaking optimal for the binomial immigration stream of order ".
MS&E 351 Dynamic Programming 74 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

As an example, consider the two-state substochastic deterministic system in Figure 15. There
are two decisions # ß $ with rewards the numbers on each arc. Observe that both # _ and $ _ have
common unit reward rate starting from either state, and so both are reward-rate optimal. By con-
trast, Z$R"  Z#R" œ 2 for R   ", so only $ _ is overtaking optimal.
 0
1 2 1
 2
Figure 15

Cesàro Overtaking Optimality. Since "! œ ÐM ! ! â Ñ, Z1R ! œ Z1R . Thus, Cesàro overtak-
ing optimality for the binomial stream of order zero is ordinary Cesàro overtaking optimality. In a
transient system, a policy is Cesàro overtaking optimal if and only if it has maximum value. More-
over, Cesàro overtaking optimality is well defined for bounded systems. By contrast, for such sys-
tems, values are often undefined or equal „ _ (as in Figure 15), rendering the maximum-value
criterion insufficiently selective. Thus Cesàro overtaking optimality is the proper way of extending
the concept of maximum value from transient to bounded systems.
Now consider the two-state substochastic deterministic system that Figure 16 illustrates.
There are two decisions # and $ with rewards the numbers on each arc. Then both # and $ have
common value zero starting from state one, and so are Cesàro overtaking optimal.

0   1 1
1 2

Figure 16

However, if the immigration stream is binomial of order one, each individual who arrives in
state one in period R and uses $ _ earns a unit temporary reward in that period. Moreover, as
Figure 17 illustrates, the aggregate of the temporary rewards that all individuals generate when
they use $ _ provides a permanent unit reward for the system, i.e., Z$R" " œ " for all R   ". Thus
only $ _ maximizes the total reward for the system with binomial immigration of order one.
Float Optimality. Since "" œ ÐM M M â Ñ, Z1R " is the R -period float, i.e., the sum !R
3œ" Z1
3

of the cash positions in each period " Ÿ 3 Ÿ R . For this reason, call a policy float optimal if it is
Cesàro overtaking optimal for the binomial immigration stream of order one. As an example,
consider the W -station queueing network discussed at the beginning of this section in which the
expected exogenous arrivals in each period into each station depends on the station, but not on
the period. Then the stream of expected arrivals is a nonnegative diagonal matrix times the bi-
nomial stream of order one. Thus if the goal is to find a policy that is Cesàro overtaking optimal
MS&E 351 Dynamic Programming 75 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Immigration Periods R
Cohort AR
"
"
1 2 3 4 â
1 1 1 1
2 1 1 1
3 1 1 1
ã ã â
Total Rewards 1 0 0 0 â

Figure 17. Rewards each Immigrant Earns in each Period


with A œ diagÐ" !Ñ † "" when Using $ _

for the queueing network, it suffices to find a policy that is float optimal with no exogenous
immigration after the first period.

Value Interpretation of Immigration Stream


In each of the above examples, an immigration stream A, or more precisely the cumulative
immigration stream [ œ Ð[ R Ñ ´ A ‡ "" , has an alternate interpretation as values of rewards an
arbitrary policy 1 earns with no immigration after period one. To see this, let •A1 œ ÐZ1R A Ñ be
the sequence of R -period values that 1 earns when the immigration stream is A. Then it follows
that •A1 œ A ‡ •1 œ [ ‡ •"
1 , so that Z1
RA
œ !R3œ! [ Z 1 œ !R
R 3 3ß" 3 R 3ß"
3œ! [ Z1 . Moreover, since
Z13ß" œ T13" <#3 is the expected reward that 1 œ Ð#3 Ñ earns in period 3, it follows that [ 3 can be
thought of alternately as the unit value of rewards in period R 3, i.e., with 3" periods remaining.
As examples, observe that when A is the binomial stream of order ", then [ œ "! so [ R œ !
for R  !, i.e., the weight is on rewards only in the last period. In this event, Cesàro overtaking
optimality is equivalent to maximizing the reward rate. By contrast, if A is the binomial stream of
order !, then [ œ "" , so [ R œ M for all R   !, i.e., there is equal weight on rewards in each per-
iod. In this event, Cesàro overtaking optimality is equivalent to ordinary Cesàro overtaking opti-
mality. If instead, A œ "8 and 8  !, then [ œ "8" , i.e., the weight on rewards with 3 periods re-
maining is a polynomial in 3 of degree 8. Thus as 8 increases, the absolute and relative weights on
rewards in early periods increases.
Evidently, this value interpretation of ÐcumulativeÑ immigration streams encompasses most
standard optimality concepts (in which there is no immigration after the first period). Moreover,
in this setting, the cohort and Markov policies coincide.

Combining Physical and Value Immigration Streams


So far there are two distinct interpretations of immigration streams, viz., physical and value.
Can they be combined as would be desirable, for example, in the study of queueing networks
where both are needed? The answer is ‘yes’. Indeed, Cesàro overtaking optimality for a system
in which there is a physical immigration stream A and a value one Aw is equivalent to Cesàro
MS&E 351 Dynamic Programming 76 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

overtaking optimality for the system with the single combined immigration stream Aw ‡ A. To
see this, observe that the stream of R -period values with the physical stream A and policy 1 is
simply A ‡ •1 . Now if it is desirable to impose the value stream Aw on this system, the new val-
ue-weighted stream of R -period values is Aw ‡ ÐA ‡ •1 Ñ œ ÐAw ‡ AÑ ‡ •1 , which justifies our claim.
Thus, for example, if the vector of expected arrivals into the stations of the queueing net-
work is the diagonal matrix A! in each period and the goal is to maximize the float of the net-
work, then A œ A! † "" and Aw œ "" , so the combined immigration stream is Aw ‡ A œ A! † "# . On
the other hand if the goal is merely to maximize the network’s reward rate, then Aw œ "" and
the combined stream is Aw ‡ A œ A! † "! .

Cesàro Overtaking Optimality with More General Immigration Streams


It is now time to generalize the above results concerning the existence of stationary Cesàro-
overtaking-optimal policies from binomial to a more general class of immigration streams which
includes positive linear combinations of binomial streams. The basic idea is to reduce Cesàro ov-
ertaking optimality for the more general class of immigration streams to that for corresponding
binomial immigration streams. To see how this can be done, it is useful first to establish a pre-
liminary lemma.

Lemma 33. Nonnegativity of Cesàro Limits of Convolutions. If F œ ÐF ! F " â Ñ and G œ


ÐG ! G " â Ñ are sequences of real matrices or scalars for which the convolution F ‡ G is defined,
F is nonnegative and has degree zero, and lim infR Ä_ G R   ! ÐCß .Ñ with ! Ÿ . Ÿ ", then neces-
sarily lim infR Ä_ ÐF ‡ GÑR   ! ÐCß .Ñ.

s R   ! where G
Proof. Since lim infR Ä_ G R   ! ÐCß .Ñ, lim infR Ä_ R . G s ´ G ‡ ". . Let % ¦ !
3
be a given matrix. Choose Q so that 3. G s   % for 3   Q . Then since F   ! has degree zero,
there is a matrix O not depending on Q or % such that

s R œ " F R 3 G
s 3  " F R 3 G
Q " R
ÐF ‡ GÑ s 3   SÐR ." Ñ  %OR . .
3œ! 3œQ
s R   %O , whence lim infR Ä_ ÐF ‡ GÑR   %O ÐCß .Ñ. Thus be-
Therefore, lim infR Ä_ R . ÐF ‡ GÑ
cause % is arbitrary, lim infR Ä_ ÐF ‡ GÑR   ! ÐCß .Ñ. è

In many practical settings, interest centers on immigration streams that are more general
than binomial. One such class is the set j8 of immigration streams A for which A ‡ "7 is non-
negative and has degree zero for some 7 Ÿ 8. For example, the binomial streams "7 of order
7 Ÿ 8 are in j8 because "7 ‡ "7 œ "! is nonnegative and has degree zero.
Denote by j8 the convex cone that consists of the nonnegative linear combinations of ele-
ments of j8 . It is of interest that j8 ¨ j8 . To see why, let A œ "8  # † "8" , so A − j8 . How-
ever, A ‡ "8 œ "!  # † "" œ Ð$M #M ! ! âÑ is not nonnegative and A ‡ "8" œ ""  # † "! œ
MS&E 351 Dynamic Programming 77 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Ð$M M M âÑ does not have degree zero. Thus A ‡ "7 is not nonnegative for 7   8 and does not
have degree zero for 7  8. Hence A Â j8 .

Theorem 34. Reduction of Immigration Streams to Binomial Ones. In a bounded system for
which A − j8 with 8   ", there is a stationary policy that is Cesàro overtaking optimal for A,
viz., any stationary policy that is Cesàro overtaking optimal for the binomial immigration stream of
order 8.

Proof. Suppose $ − ?8 and the system degree is . . By hypothesis, A œ !87œ5 A7 for some
5 Ÿ 8 and immigration streams A5 ß á ß A8 such that A7 ‡ "7 is nonnegative and has degree
zero for 5 Ÿ 7 Ÿ 8. For each policy 1,

" ÒÐA7 ‡ "7 Ñ ‡ Е$7  •17 ÑÓ.


8
•A A
$  •1 œ A ‡ Е$  •1 Ñ œ
7œ5
Let _ (resp., _7 ) denote the ÐCß .Ñ limit inferior of the sequence on the left-hand side (resp., se-
quence in brackets on the right-hand side) of the above equation. Since $ − ?8 © ?7 for each
7 Ÿ 8, it follows that _7   ! by Lemma 33. Hence _   !87œ5 _7   !. è

One may ask whether the hypothesis A − j8 in Theorem 34 can be dropped. The answer is
that it cannot as the following example illustrates.

Example. Nonexistence of a Stationary Cesàro-Overtaking-Optimal Policy When A Â


/
j8 for all 8   ". Consider the two-state substochastic system that Figure 18 illustrates.
The rewards and actions available in each state appear on the arcs. Let A! œ diagÐ" "Ñ and
A3 œ ! for 3   ". The nonstationary policy 1 that takes action + in the first period and action ,
in the second is the unique (Cesàro-) overtaking-optimal policy because it earns the system "
starting from each state. By contrast, all stationary policies earn the system " or ! in some
state and so are not Cesàro overtaking optimal for A. Thus, by Theorem 34, it must be so that
A œ ÐA3 Ñ Â j8 for all 8   ".

+ 0
1 , , 1
1 2
0 +
Figure 18

It can be shown that the condition that A ‡ "7 has degree zero is not essential to Theorem
34. But in that event it is necessary to replace Cesàro limits of order one by higher-order Cesàro
limits. For that reason we omit a discussion of the more general case.
MS&E 351 Dynamic Programming 78 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Future-Value Optimality
The above result has a useful application to problems in which the interest rate 1003% is
negative and "  3. Recall from §1.7 that the last assumption means that one cannot lose
more than one invests. Also, negative interest rates arise when the decision maker is interested
in inflation-adjusted rewards and in which the after-tax rate of interest is less than the rate of
inflation, a not uncommon situation. In this circumstance, the inflation-adjusted present value
of a policy is often a divergent series, and so is not well defined. The way out of this difficulty is
to instead discount income to the future.
R3
The future value Zs1 in period R of a policy 1 œ Ð#3 Ñ is

Zs1 ´ " !R 3 T13" <#3 œ Ð ‡ •"


R
R3 R R
1 Ñ œ ÐÐ ‡ "" Ñ ‡ •1 Ñ
3œ"
where ! ´ "  3 and  ´ Ð! M !" M !# M â Ñ. Observe that [ œ  can be thought of alternately
!

as a cumulative immigration stream. The hypotheses on 3 imply that !  !  ". Call a policy -
future-value optimal if
R3 R3
lim inf ÐZs-  Zs1 Ñ   ! ÐCß .Ñ for all 1.
R Ä_

Theorem 35. Reduction of Future-Value to Reward-Rate Optimality. In a bounded system,


there is a stationary policy that is future-value optimal for all "  3  !, viz., any stationary re-
ward-rate optimal policy.

Proof. Set A ´  ‡ "" . Then A ‡ "" œ  is nonnegative and has degree zero, so A − j" .
Now apply Theorem 34 with 8 œ ". è

10 SUMMARY
Figures 19 and 20 below provide directed assertion-implication graphs that summarize many
of the results of this chapter, the homework problems and a few others. The graphs describe re-
lations among the various optimality concepts for transient and bounded systems.
The nodes of the graphs are assertions and the arcs are implications. An arc (arrow) from one
assertion to another signifies that the first assertion implies the second. Similarly, a double arrow
between two assertions signifies that the two assertions are equivalent. A chain from one asser-
tion to another assures that the first assertion implies the second. Each strong component of a
graph contains a maximal collection of equivalent assertions. For example, the assertions above
the dashed line are equivalent.
In Figure 20, some arcs have text labels. When this is the case, an arc’s implication is valid
when the statement in the arc’s text label is valid. For example, “$ _ overtaking optimal” implies
“$ _ Cesàro overtaking optimal”. The converse is generally false as the example of Figure 13 illus-
trates. However, it can be shown that “$ _ Cesàro overtaking optimal” implies “$ _ overtaking op-
timal” provided that a stationary overtaking-optimal policy exists.
MS&E 351 Dynamic Programming 79 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Since there exists a stationary strong present-value optimal policy, there is a stationary poli-
cy that is optimal for each of the concepts considered here except, as discussed above, overtaking
optimality. However, there is a stationary Cesàro-overtaking-optimal policy, and each such policy
is overtaking optimal provided only that a stationary overtaking-optimal policy exists.

$_ _-Optimal
$_ Strong Present-Value Optimal $_ Strong Cesàro Overtaking Optimal
Z$ ¤ Z# for all #

3 _-Optimality Conditions R8
K#$ Ÿ ! for all # and small 3  ! lim K#$ Ÿ ! for all # and 8
K#$ £! for all # RÄ_

$_ Limiting Present-Value Optimal $_ W-Optimal $_ Cesàro Overtaking Optimal


for Binomial Stream of Order W Z$W ¤ Z#W for all # for Binomial Stream of Order W

3 W-Optimality Conditions
lim 3WK#$ Ÿ ! for all # W £! for all #
RW
lim K#$ Ÿ ! for all #
3Æ! K#$ RÄ_

$_ Limiting Present-Value Optimal $_ (8")-Optimal $_ Cesàro Overtaking Optimal


for Binomial Stream of Order 8" Z$8" ¤ Z#8" for all # for Binomial Stream of Order 8"

$_ Cesàro Overtaking Optimal for Positive


$_ Limiting Present-Value Optimal for the 8-Optimality Conditions
8. Linear Combinations of Streams A with
A with A 3   ! and A 3 œ SÐ38Ñ for 3  ! K#$ £! for all #
A ‡ "7   ! of Degree ! for some 7 Ÿ 8

$_ Limiting Present-Value Optimal $_ 8-Optimal $_ Cesàro Overtaking Optimal


for Binomial Stream of Order 8 Z$8 ¤ Z#8 for all # for Binomial Stream of Order 8

3 (8.)-Optimality Conditions
lim 38K#$ Ÿ ! for all # 8
R8
lim K#$ Ÿ ! for all #
3Æ! K#$ £! for all # RÄ_

Figure 19. Optimality of $ _ for a Bounded System with Immigration


MS&E 351 Dynamic Programming 80 §1 Discrete Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Transient Substochastic System

Float Optimal $_ Moment Optimal for Mean-


Z$" ¤ Z#" for all # Variance of First Passage Time

Provided there is a Stationary


Overtaking-Optimal Policy

Value Optimal
$_ Cesàro Overtaking Optimal $_ Overtaking Optimal
Z$! ¤ Z#! for all #

Transient Transient
System System

$_ Total-Instantaneous-
$_ Maximum Value
Rate-of-Return Optimal

Circuitless
System

Reward-Rate Optimal $_ Instantaneous- $_ Maximum R-Period


" "
Z$ ¤ Z# for all # Rate-of-Return Optimal Value for all R   W

Irreducible System

$_ Cesàro Geometric- $_ Future-Value Optimal with


Overtaking Optimal "  3  ! and A ‡ "" œ ÐÐ"3ÑRMÑ

Irreducible System

$_ Cesàro Symmetric-Multipli-
cative-Utility-Growth Optimal

Figure 20. Optimality of $ _ for Some Special Bounded Systems


2
Team Decisions, Certainty Equivalents
and Stochastic Programming
1 FORMULATION AND EXAMPLES [Be55], [Da55], [Ra55], [Ve65], [MR72]
Team decision theory is concerned with a team, i.e., collection, of individuals that each

• control different decision variables,


• base their decisions on possibly different information, and
• share a common objective.

In the sequel, we shall formulate such problems as stochastic programs.


A team is composed of 8 members labeled "ß á ß 8. The 3>2 team member observes a random
vector ^3 assuming values in a set ™3 and then uses a real-valued decision \3 À ™3 Ä d to select
an action \3 Ð^3 Ñ. This formulation encompasses team members that have vector-valued deci-
sions since each such member can be considered to be a group of different team members each
with common information and real-valued decisions. Let ^ œ Ð^" ß á ß ^8" Ñ be the vector whose
elements are the 8 random vectors ^" ß á ß ^8 observed by the 8 team members and a random

81
MS&E 351 Dynamic Programming 82 §2 Team Decisions & Stochastic Programming
Copyright © 2008 by Arthur F. Veinott, Jr.

vector ^8" that contains a (possibly empty) subset of those random vectors as well as relevant
unobserved information. The distribution of ^ is known to all members of the team. Denote by
™ the range of ^ , which we take to be finite. Call \ ´ Ð\" ß á ß \8 Ñ a team decision. Let — de-
note the set of all team decisions. Let <Ð † ß † Ñ À d 8 ‚ ™ Ä d be the given reward function and
1Ð † ß † Ñ À d 8 ‚ ™ Ä d 7 be the given constraint function. The problem is to choose \ to maxi-
mize

(1) VÐ\Ñ ´ E<Ð\Ð^Ñß ^Ñ

subject to

(2) 1Ð\Ð^Ñß ^Ñ   !

and

(3) \ − —.

One of the principal goals of team decision theory is to study the effect of different informa-
tion structures on the quality of optimal decisions as measured by the maximum expected re-
ward. Of course, the more information that each team member receives, the greater the maxi-
mum expected reward, but generally so also is the cost of collecting and distributing the infor-
mation. Thus one goal of team decision theory is to quantify the value of additional information
to facilitate comparison with the costs of providing and using such information. This enables the
team to choose the proper amount of information to provide each member. From this viewpoint,
team decision theory can be considered to be a branch of the field of management information
systems.

Example 1. Airline Reservations with Uncertain Demand. An airline wishes to give its
agents in 8 cities rules for deciding how many seats to book on a flight of capacity O  ! so as
to maximize the expected net revenue from the flight where ^8" ´ ÐH" ß á ß H8 Ñ   ! is the
random vector of demands experienced by agents "ß á ß 8. Let \3 be the number of tickets sold
by agent 3. Then

<Ð\ß ^Ñ œ <" \3  :Ð" \3  OÑ


8 8

3œ" 3œ"

where < is the ticket price and : is the penalty for each oversold ticket. Also (2) becomes

! Ÿ \3 Ÿ H3 , " Ÿ 3 Ÿ 8.

Information Structures. The information on ticket demands can be shared among the agents
in many ways. We mention four examples.
MS&E 351 Dynamic Programming 83 §2 Team Decisions & Stochastic Programming
Copyright © 2008 by Arthur F. Veinott, Jr.

(a) Complete Communication. Let ^3 œ ^8" for all 3, i.e., every agent gets complete infor-
mation about tickets demanded at other agents.
(b) Decentralization. Let ^3 œ H3 for all 3, i.e., each agent receives information only on his
own ticket demand.
(c) Partial Communication. Let ^3 œ ÐH3 ß !84œ" H4 Ñ, i.e., each agent receives information on
his own demand and the total demand at all agents.
(d) Management by Exception. Let ^3 œ ÐH3 ß X Ñ where

Ú
Ý" "
8 8

X œ Û 4œ"
H4 , if H4  Q
Ý
Ü0
4œ"

, otherwise,

i.e., each agent receives information on his demand and, if the total demand at all agents ex-
ceeds a threshold Q , the total demand at all agents. In this event, Q can be chosen to be at
least O . For if an agent is informed that the total ticket demand at all agents does not exceed
the capacity of the plane, there is nothing to be gained by telling the agent the precise demands
at the other agents.

Example 2. Inventory Control with Uncertain Demand. An inventory manager seeks to


minimize the expected cost of storage and shortage of a single product over 8 weeks. Let \3 and
H3 be respectively the cumulative orders and demands in weeks "ß á ß 3 where 3 œ "ß á ß 8. Let
^8" œ ÐH" ß á ß H8 Ñ. Then the storage and shortage cost is

" Ò2Ð\3  H3 Ñ  :ÐH3  \3 Ñ Ó


8

3œ"

where 2 and : are respectively the unit costs of storage and shortage. The constraints are

\" Ÿ \# Ÿ â Ÿ \8 .

Information Structures. Among the information structures that may be appropriate in this
problem are the following.
(a) Complete Information. Let ^3 œ ÐH" ß á ß H3" Ñ œ Ð^3" ß H3" Ñ, i.e., decisions are made
sequentially with the ordering decision in week 3 being based on all previously observed de-
mands.
(b) Delayed Reporting. Let ^3 œ ÐH" ß á ß H3"6 Ñ, i.e., ordering decisions in week 3 are
based on all previously reported demands where there is an 6-week delay in reporting.
(c) Forecasting. Let ^3 œ ÐH" ß á ß H3"6 Ñ, i.e., ordering decisions are based on all previously
observed demands as well as perfect forecasts of the demands in the next 6 weeks.
MS&E 351 Dynamic Programming 84 §2 Team Decisions & Stochastic Programming
Copyright © 2008 by Arthur F. Veinott, Jr.

(d) Reporting Errors. Let ^3 œ ÐH" ß á ß H3" Ñ  Ð%" ß á ß %3" Ñ, i.e., ordering decisions in each
period are based on demands observed plus a random error in reporting.
(e) Current Cumulative Demand. Let ^3 œ H3" , i.e., ordering decisions in week 3 depend
only on cumulative observed demand to date. Although this assumption is a natural one and re-
duces the amount of information needed to make decisions, rules of this type are not generally
optimal—even when demands are independent in each week. For in the latter case, the state of
the system at the beginning of week 3 is the difference between cumulative orders and demands
through week 3  " and so depends on demands in all prior weeks through the cumulative or-
ders.

Example 3. Transportation Problem with Uncertain Demand. Consider the stochastic ver-
sion of the transportation problem in which there are =3 aircraft available at city 3 œ "ß á ß 8 one
evening, there is a random demand H4 at city 4 œ "ß á ß 7 for aircraft the following day, and
the goal is to minimize the expected costs of deadheading, i.e., sending empty aircraft from one
city to another, and lost revenue. There is a cost -34 of sending an empty aircraft from city 3 to
city 4 during the night. The aircraft scheduler must decide the number B34 of empty aircraft to
send from city 3 to city 4 before observing the demands at the various cities. Denote by C4 ÐH4 Ñ
the difference between the number of aircraft needed at city 4 and the number sent there. The
expected revenue lost from each aircraft needed at city 4 that is not available there is <4 . Then
the cost of deadheading and lost revenues is

" " -34 B34  " <4 C4 ÐH4 Ñ .


8 7 7

3œ" 4œ" 4œ"

The constraints are

" B34
7
Ÿ =3 , 3 œ "ß á ß 8.
4œ"

" B34  C4 ÐH4 Ñ œ H4 , 4 œ "ß á ß 7


8

3œ"
B34   !, all 3ß 4.

Information Structure. In this case, the decision functions B34 and C4 are based respectively
on no information, i.e., ^34 œ g, and the number of aircraft ^4 œ H4 needed at city 4.

Example 4. Capacity Planning with Uncertain Demand. A firm seeks a minimum-expect-


ed-cost policy for leasing and renting autos to meet independently distributed employee
demands H" ß á ß H8 therefor in weeks "ß á ß 8. There is a unit cost 634 of leasing an auto at the
MS&E 351 Dynamic Programming 85 §2 Team Decisions & Stochastic Programming
Copyright © 2008 by Arthur F. Veinott, Jr.

beginning of week 3 for use during weeks 3ß á ß 4. When the firm does not have a leased auto
available for an employee who needs one in a week, the firm rents one at a cost <  !. Decisions
on leases that begin in week 3 are made after observing demands in prior weeks, but before ob-
serving the demand in week 3. But rental decisions for week 3 are made after also observing the
demand in that week. Let B34 be the number of autos leased at the beginning of week 3 for use
during weeks 3ß á ß 4, " Ÿ 3 Ÿ 4 Ÿ 8. Let C3 ÐH3 Ñ be the difference between the numbers of autos
demanded and leased in week 3, " Ÿ 3 Ÿ 8. The total cost of leasing and renting is

" 634 B34  " <4 C4 ÐH4 Ñ .


3Ÿ4 3
The constraints are

" B35  C4 ÐH4 Ñ œ H4 , 4 œ "ß á ß 8


3Ÿ4Ÿ5
and

B34   !, all 3ß 4.

Information Structure. In this case, the decision functions B34 and C3 are based respectively
on no information, i.e., ^34 œ g, and the auto demand ^3 œ H3 in week 3. In order to see why it
is not necessary to allow leasing decisions to depend on any of the demands, notice that the
state of the system at the beginning of a week is the vector of numbers of autos leased in prior
weeks for use in the week and thereafter. Thus since the demands are independent, the state in
a week is a function only of leasing decisions in prior weeks and not the demands in prior weeks
themselves. Hence the transition law for the system is deterministic and only the rental decision
in a week depends on the demand in the week. Incidentally, this conclusion would not be valid if
the demands were dependent, e.g., if they formed a Markov chain. In that event, leasing deci-
sions in a week should depend on the demands in all prior weeks—even when the demands are
Markovian.

2 REDUCTION OF STOCHASTIC TO ORDINARY MATHEMATICAL PROGRAMS


Since ™ is finite, the stochastic program (1)-(3) reduces to that of choosing \ÐDÑ ´
Ð\" ÐD" Ñß á ß \8 ÐD8 ÑÑ (where D œ ÐD" ß á ß D8" Ñ − ™ and :ÐDÑ œ T Ð^ œ DÑ  !Ñ, that maximizes

(1)w " <Ð\ÐDÑß DÑ:ÐDÑ


D−™

subject to

(2)w 1Ð\ÐDÑß DÑ   !, D − ™,
MS&E 351 Dynamic Programming 86 §2 Team Decisions & Stochastic Programming
Copyright © 2008 by Arthur F. Veinott, Jr.

which is an ordinary mathematical program. The variables of this mathematical program are
the values of the decision for each team member corresponding to each value of the random vec-
tor the member observes. The constraints (2)w replicate the original constraints (2) once for each
value of the random vector ^ . In applications the number of redundant replications in (2)w of an
inequality in (2) can be significantly reduced by replicating the inequality only for values of the
subvector of ^ that appears in the inequality rather than also so doing for the complementary
subvector of ^ .

Linear and Quadratic Programs


It is of interest to examine when the above mathematical program is a linear or concave
quadratic program. If <Ð † ß DÑ and 1Ð † ß DÑ are affine functions for each D − ™, then (1)w -(2)w is a
linear program. If instead <Ð † ß DÑ is quadratic and the associated form is negative semidefinite
for all D − ™ and negative definite for some D − ™, and if 1Ð † ß DÑ is affine for each D − ™, then
(1)w -(2)w is a strictly-concave quadratic program. Thus, if it is feasible, it has a unique solution.
In particular if 1 ´ g, i.e., there are no inequality constraints (2)w , then the optimal team de-
cision \ ‡ is the unique solution to a system of linear equations obtained by setting the partial
derivative of (1)w with respect to each decision variable \3 ÐD3 Ñ equal to zero.
In order to make the above ideas more concrete, we formulate Examples 3 and 4 above as
linear programs. Consider first the Transportation Problem with Uncertain Demand in which
there are R values .4" ß á ß .4R of the random demand H4 and :45 ´ T ÐH4 œ .45 Ñ for 5 œ "ß á ß R
 
and each 4. Set C45 œ C4 Ð.45 Ñ and write C45 œ C45  C45 as a difference of two nonnegative var-
iables. Then the problem becomes the ordinary linear program of choosing B34 and C45 that mini-
mize

" " -34 B34  " " <4 :45 C45


8 7 7 R

3œ" 4œ" 4œ" 5œ"


subject to

" B34
7
Ÿ =3 , 3 œ "ß á ß 8
4œ"

" B34  C45


8
 
 C45 œ .45 , 4 œ "ß á ß 7 and 5 œ "ß á ß R ,
3œ"
 
B34 , C45 , C45   !, all 3ß 4ß 5 .

Consider now the problem of Capacity-Planning-with-Uncertain-Demand in which there are


at most R values .4" ß á ß .4R of the demand H4 and :45 ´ T ÐH4 œ .45 Ñ for 5 œ "ß á ß R and
 
each 4. Set C45 œ C4 Ð.45 Ñ and write C45 œ C45  C45 as a difference of two nonnegative variables.
Then the problem becomes the ordinary linear program of choosing B34 and C45 that minimize
MS&E 351 Dynamic Programming 87 §2 Team Decisions & Stochastic Programming
Copyright © 2008 by Arthur F. Veinott, Jr.

" 634 B34  " <:35 C35



.
3Ÿ4 3ß5
subject to

" B35  C45


 
 C45 œ .45 , 4 œ "ß á ß 8 and 5 œ "ß á ß R ,
3Ÿ4Ÿ5
and
 
B34 , C45 , C45   !, all 3ß 4ß 5 .

Computations
It is useful to examine the size of the mathematical program that results from the above
approach to stochastic programming. To that end, let R4 ´ l™4 l and let R ´ max4 R4 be the
maximum number of values of any ^4 . Also, let W3 be the set of indices 4 of nonconstant random
variables ^4 that appear in the 3>2 constraint in (2). Let Q ´ max3 lW3 l be the maximum number
of nonconstant ^4 that appear in an inequality. Then the mathematical program may have up
to !83œ" R3 Ÿ 8R variables and up to !3œ" 7 #
4−W3 R4 Ÿ 7R
Q
inequality constraints. This
problem is generally tractable only if R is not too large and if either Q is very small or 7 œ !.
These two conditions are often fulfilled in a problem that either has a very small number of
time periods, e.g., say three or less, or has a deterministic transition law. The Transportation
Problem with Uncertain Demand (Example 3) is an example of the former type and involves
7Ð8  #R Ñ variables and 8  7R linear equations or inequalities beyond nonnegativity. The
problem of Capacity Planning with Uncertain Demand (Example 4) is an example of the latter
" #
type and involves up to 8  #8R variables and 8R linear equations as well as nonnegativity
#
of the variables. In both cases R is generally not too large and Q œ ". As a third example, sup-
pose 7 œ ! and the objective is quadratic and strictly concave. Then one simply sets the partial
derivatives with respect to each of the 8R variables equal to zero and solves the resulting sys-
tem of 8R linear equations in a like number of unknowns. This last problem is tractable if R is
not too large since 7 œ !.
The stochastic programming approach is not tractable for many multiperiod stochastic se-
quential decision problems because R is large. To illustrate, consider the problem of Inventory
Control with Uncertain Demand (Example 2) and complete information. If each H3 has O   #
values, then R œ l™8 l œ O 8" grows exponentially in the number of periods 8. Thus the number
of variables, and hence inequalities, in the mathematical programming formulation of this sto-
chastic program is usually impossibly large. This is so even if the demands in successive periods
are independent.

Comparison of Dynamic and Stochastic Programming


The dynamic- and stochastic-programming approaches to solving sequential decision prob-
lems under uncertainty are generally useful for computational purposes in different circumstances.
MS&E 351 Dynamic Programming 88 §2 Team Decisions & Stochastic Programming
Copyright © 2008 by Arthur F. Veinott, Jr.

The dynamic-programming approach is useful if the number of state-action pairs is not too
large. The stochastic-programming approach is useful if the number of possible values of the in-
formation that one observes is not too large. To illustrate, the dynamic-programming approach
is useful in Examples 2 where the demands in successive periods are independent and there is
complete information, whereas the stochastic-programming approach to this problem is generally
intractable. On the other hand, the stochastic-programing approach is useful in Example 3 where
the demands for capacity in successive periods are independent, whereas the dynamic-program-
ming approach to this problem is generally intractable.

3 QUADRATIC UNCONSTRAINED TEAM DECISION PROBLEMS [Ra61ß 62], [MR72]


As discussed above, a team decision problem that is especially easy to solve is one that is
strictly concave and quadratic, and unconstrained. In particular, assume that 1 œ g and

"
(1) <Ð\ß ^Ñ œ \ T $ Ð^Ñ  \ T U\
#
where $ œ $ Ð^Ñ œ Ð$3 Ñ, U œ Ð;34 Ñ, ;3 œ Ð;3" ß á ß ;38 Ñ and E^3 $3 ´ EÐ$3 ± ^3 Ñ.

Theorem 1. Quadratic Unconstrained Team Decision Problems. In a quadratic uncon-


strained team decision problem with symmetric positive-definite U, the team decision \ ‡ is opti-
mal if and only if

Ð2Ñ E^3 Ð$3  ;3 \ ‡ Ñ œ !, 3 œ "ß á ß 8.

Also there is a unique team decision \ ‡ satisfying (2).

Proof. The function <Ð † ß DÑ is strictly concave for each fixed D − ™. Therefore it follows that
!D−™ <Ð\ÐDÑß DÑ:ÐDÑ is strictly concave in Ö\3 ÐD3 ÑÀ D3 − ™3 for all 3× and quadratic. Thus the \3‡ ÐD3 Ñ
are the unique solution to
`
VÐ\ ‡ Ñ œ !, D3 − ™3 , 3 œ "ß á ß 8.
`\3 ÐD3 Ñ

Now use (1) to rewrite this system as

` `
!œ VÐ\ ‡ Ñ œ EÒE^3 <Ð\ ‡ Ð^Ñß ^ÑÓ
`\3 ÐD3 Ñ `\3 ÐD3 Ñ
`
œ ÒE^3 œD3 <Ð\ ‡ Ð^Ñß ^ÑÓT Ð^3 œ D3 Ñ
`\3 ÐD3 Ñ

œ E^ 3 œD3 Ò$3 Ð^Ñ  ;3 \ ‡ Ð^ÑÓT Ð^3 œ D3 Ñ,

so because T Ð^3 œ D3 Ñ  !, (2) holds as claimed. è


MS&E 351 Dynamic Programming 89 §2 Team Decisions & Stochastic Programming
Copyright © 2008 by Arthur F. Veinott, Jr.

Theorem 1 requires that a system of up to 8R linear equations in a like number of variables


be solved and so is not tractable if R is large. In the special case in which $3 is affine in ^3 for each 3
and ^ has a joint normal distribution, a homework problem shows that the \3 are affine in the
^3 . The coefficients can be found by solving two systems of linear equations, one 8 ‚ 8 and the
other 7 ‚ 7 where 7 is the number of random variables among ^" ß á ß ^8 .

4 SEQUENTIAL QUADRATIC UNCONSTRAINED TEAMS: CERTAINTY EQUIVALENTS


[HMSM60], [MR72]
On the other hand, even without the above assumptions that $3 is affine in ^3 for each 3 and
^ has a joint normal distribution, but with the added assumption that the problem is “sequen-
tial”, we now show how to reduce the number of linear equations that need to be solved in such
problems to about two systems of 8 linear equations in a like number of variables—a dramatic
reduction indeed.
In order to motivate the idea, consider the following example.

Example 5. Single-Person Quadratic Unconstrained Team. Suppose 8 œ ", ^" is the empty
vector, ^# is real valued, \ does not depend on ^ , and <Ð\ß ^Ñ œ EÐ^#  \Ñ# . Then

min EÐ^#  \Ñ# œ EÒÐ^#  E^# Ñ  ÐE^#  \ ‡ ÑÓ# œ EÐ^#  E^# Ñ#  ÐE^#  \ ‡ Ñ# œ Var^#
\
where \ ‡ œ E^# . Since \ ‡ depends only on the expected value of ^# , we would have found the
same solution if we had initially replaced ^# by E^# and solved the “equivalent” deterministic
problem min\ ÐE^#  \Ñ# , which equals !, so again \ ‡ œ E^# . Thus E^# is a “certainty equiva-
lent” for ^# with respect to the problem of finding \ ‡ .

If Y and Z are finite-valued random vectors, we say that Y determines Z if Z œ 0 ÐY Ñ for


some function 0 . In that event, for any finite-valued random variable [ , EZ EY [ œ EZ EY ßZ [
œ EZ [ . Also, if Y determines Z and Z determines [ , then Y determines [ , i.e., the deter-
mines relation is transitive.
Call the team decision problem sequential if ^3" determines ^3 for 3 œ "ß á ß 8". The next
result is important because it allows one to solve sequential quadratic unconstrained team de-
cision problems by replacing all random variables by their conditional expectations given the
then current state of information and solving the equivalent deterministic problem.

Theorem 2. Certainty Equivalents for Sequential Quadratic Unconstrained Team Decision


Problems. In a sequential quadratic unconstrained team decision problem with symmetric posi-
tive-definite U, there is a unique optimal team decision \ ‡ that satisfies

Ð3Ñ E^3 Ð$4  ;4 \ ‡ Ñ œ !, " Ÿ 3 Ÿ 4 Ÿ 8.

Moreover, on setting L œ Ö"ß á ß 3"× and J œ Ö3ß á ß 8× for given 3,

Ð4Ñ E^3 \J‡ œ U" ^3 ‡


J J ÒE $J  UJ L \L Ó.
MS&E 351 Dynamic Programming 90 §2 Team Decisions & Stochastic Programming
Copyright © 2008 by Arthur F. Veinott, Jr.

Proof. Recall from Theorem 1 that

E^4 Ð$4  ;4 \ ‡ Ñ œ !

so because ^4 determines ^3 for 3 Ÿ 4,

! œ E^3 E^4 Ð$4  ;4 \ ‡ Ñ œ E^3 Ð$4  ;4 \ ‡ Ñ.

For fixed 3, these equations can be rewritten as

E^3 Ð$J  UJ L \L

 UJ J \J‡ Ñ œ !.

Now premultiply this equation by U" ^3 ‡ ‡


J J and use the fact that E \L œ \L . è

Interpretation of Solution

Remark 1. Solution as a Sequence of Deterministic Problems. The formula (4) permits


\3‡ Ð^3 Ñ œ E^3 \3‡ to be computed once the variables \"‡ Ð^" Ñß á ß \3"

Ð^3" Ñ have been found.
Moreover, since \3‡ Ð^3 Ñ depends on $J only through its conditional expectation E^3 $J given the
then current information ^3 , it follows that \3‡ Ð^3 Ñ has the same value that it would have for
the corresponding deterministic problem in which the random vector $J is replaced by its con-
ditional expectation E^3 $J !

Remark 2. On-Line Implementation or Wait-and-See Solution. It is not necessary to tabu-


late \3‡ Ð † Ñ in order to choose \3‡ Ð^3 Ñ optimally. All that is required is to wait until ^3 is observed
—in which case ^" ß á ß ^3" are determined—and then choose \3‡ Ð^3 Ñ as in Remark 1 above.

Remark 3. Linearity of Optimal Decision Function. Observe from (4) that the optimal de-
cision function \3‡ Ð^3 Ñ for person 3 is linear in the history \L

of decisions of prior persons and
the conditional expected value E^3 $J of the future random vector $J (yet to be observed except
for $3 Ð^3 Ñ) given the state of information ^3 of person 3 at the time he chooses his decision.

Computations
If we use Remark 2 above, then except for the work in computing the needed conditional ex-
pected values E^3 $J , one for each 3, the work in solving (4) is independent of R4 for all 4 . If one
first finds the Cholesky factorization of the matrix U, i.e., the upper triangular matrix V for which
8$
U œ VT V, then operations suffice to find all optimal decisions.
#
MS&E 351 Dynamic Programming 91 §2 Team Decisions & Stochastic Programming
Copyright © 2008 by Arthur F. Veinott, Jr.

Quadratic Control Problem [Be87], [Ku71]


It is often the case in practice that sequential quadratic team decision problems discussed in
Theorem 2 have additional structure that leads to computational simplifications. To illustrate
these ideas, consider the following quadratic control problem. Choose controls ?" ß á ß ?X and,
given an initial state vector B" , state vectors B# ß á ß BX that minimize the expected value of the
quadratic form

"ÐBT
X "
T T
Ð5Ñ > U> B >  ? > V > ? > Ñ  B X U X B X
>œ"

subject to the linear dynamical equations

Ð6Ñ B>" œ E> B>  F> ?>  !> , > œ "ß á ß X "

where the B> and !> are column vectors of common dimension, the ?> are column vectors of
common dimension, the matrices U> are symmetric and positive semidefinite, the matrices V> are
symmetric and positive definite, the matrices E> and F> have appropriate dimension, the !> are
random errors whose distributions are independent of the controls applied, and Ð!" ß á ß !>" Ñ is the
information available to choose the control ?> for each >. It is natural in most applications to
think of > as one of X periods, e.g., days, weeks, months, etc., and we shall use this terminology
in what follows. Problems of this type arise in a variety of settings of which we mention two.

Example 6. Rocket Control. Consider the problem of moving a rocket efficiently from its
initial position B" to a terminal position that is close to a target position at time X . Let B> be
the position of the rocket at time > and ?> be the “control” exerted at that time. The position of
a rocket might be a vector including at least its six location and velocity coordinates. The de-
sired target location and velocity vector at time X is the null vector (the distance and velocity
should be zero at time X ). Assume that the motion of the rocket satisfies the linear dynamical
equations (6). The objective function might have U> œ ! for all >  X . Then the objective func-
tion would entail minimizing the sum of two terms, the first representing the sum of the costs of
T
the controls at the first X  " times and the second BX UX BX the cost of missing the target
position at time X . Each cost reflects the fact that small quantities have small costs and large
quantities have large costs.

Example 7. Multiproduct Supply Management. Consider the problem of managing inventor-


ies of several products so as to minimize the sum of the expected resource and storage/shortage
costs in the presence of uncertain demands for the products. In this setting, B> would be the pos-
sibly negative vector of inventories of the products in period >, ?> the possibly negative vector of
several resources (perhaps labor, materials, capital, etc.) applied to production in period >, and
MS&E 351 Dynamic Programming 92 §2 Team Decisions & Stochastic Programming
Copyright © 2008 by Arthur F. Veinott, Jr.

!> the vector of demands for the products in period >. Negative inventories represent backorders
and negative resource consumption represents returns of resources. The matrix E> in period >
would be the identity matrix if stocks of products are merely held in storage without transform-
ation. That matrix might differ from the identity matrix if stocks move from one state to an-
other over time, e.g., reflecting age, deterioration, location, etc. The 34>2 element of the matrix
F> would be the rate of production of product 3 in period > per unit of resource 4 applied in that
period. The costs reflect the desirability of low inventories and low resource costs.

Reduction to Unconstrained Problem. It might seem that this problem is more general
than the unconstrained sequential decision problem considered to date because of the linear
equality constraints (6). But that is not the case because one can use (6) to eliminate the B>
leaving (5) depending only on the ?> . Then Theorem 3 applies at once. As a consequence, in
order to determine the optimal choice of ?" , it suffices to replace the random vectors !> by their
conditional expectations given the state of information at the time the vector ?" is chosen. For
this reason we can and do assume without loss of generality that the !> are constant vectors.
In fact it is possible to eliminate the constant vectors entirely. This may be accomplished by
appending an additional state variable C> in each period > and appending to (6) the equations

Ð7Ñ C>" œ C> , > œ "ß á ß X

where C" ´ ". This assures that C> œ " for all >, so we can replace !> in (6) by the product !> C> .
The augmented system with one additional state variable and the same control variables then
has the desired form with zero random errors. If we denote the various quantities in the
augmented system with overbars, those quantities have the following relations to the

corresponding quantities of the original system: 
! > œ !, 
? > œ ? > , V > œ V> ,

U> œ  > , E> œ Œ , F > œ Œ  and 


B > œ Œ > ,
 U 0  E> !>  F> B
Ð8Ñ
0 0 0 1 0 C>

where in each case the second matrix in a column (resp., row) partition has only one column
(resp., row).

Dynamic-Programming Solution with Zero Random Errors. It is natural to solve the zero-
random-error problem by dynamic programming. This approach is tractable because even
though the state vector B> in period > may have many components, the minimum cost is a pos-
itive semidefinite quadratic form that can be calculated explicitly. To see this, let V> ÐBÑ be the
minimum cost in periods >ß á ß X given B is the state in period >, and let ?> ÐBÑ be the corre-
sponding optimal control in that state and period. Evidently
MS&E 351 Dynamic Programming 93 §2 Team Decisions & Stochastic Programming
Copyright © 2008 by Arthur F. Veinott, Jr.

Ð9Ñ V> ÐBÑ œ min ÒBT U> B  ?T V> ?  V>" ÐE> B  F> ?ÑÓ, > œ "ß á ß X "
?

and VX ÐBÑ œ BT UX B for all B.

Theorem 3. Optimal Quadratic Control. If the quadratic control problem Ð&Ñ-Ð'Ñ has zero
random errors, each U> is symmetric and positive semidefinite, and each V> is symmetric and
positive definite, then the minimum cost V> and optimal control ?> are

Ð10Ñ V> ÐBÑ œ BT O>" B, > œ "ß á ß X ,

Ð11Ñ ?> ÐBÑ œ W>" Y> E> B, > œ "ß á ß X "

with symmetric positive semidefinite matrices O> given from the Riccati recursion

T
Ð12Ñ O>" œ U>  E> [> E> , > œ "ß á ß X "

T T T
and OX " œ UX where [> ´ O>  Y> W>" Y> , W> ´ V>  F> O> F> and Y> ´ F> O> for > œ "ß á ß
X ".

Proof. The proof of (10)-(11) is by induction on >. This claim is trivially true for > œ X . Sup-
pose it holds for 1  >  " Ÿ X and consider >. Then W> is symmetric and positive definite since
that is so of V> and since O> is symmetric and positive semidefinite from the induction hypoth-
esis. Thus W> is nonsingular, whence by (10) for >  " and completing the square, the expression
in brackets on the right-hand side of (9) is

T
BT U> B  ?T V> ?  V>" ÐE> B  F> ?Ñ œ BT ÐU>  E> O> E> ÑB  Ð?T W> ?  #?T Y> E> BÑ

T
œ BT ÐU>  E> [> E> ÑB  Ð?  W>" Y> E> BÑT W> Ð?  W>" Y> E> BÑ.

Now since W> is symmetric and positive definite, ? œ W>" Y> E> B is the unique minimizer of the
above function, so (10) and (11) follow from (9). Furthermore, it is immediate from (12) and the
induction hypothesis that O>" is symmetric. Finally, O>" is positive semidefinite because V> is
nonnegative. The last follows from the fact that the expression in brackets on the right-hand
side of (9) is nonnegative for all B because U> , V> and O> are symmetric and positive semi-
definite. è

Remark 1. Linearity of Optimal Control. Observe from (11) that the optimal control ?> in
period > is linear in the state vector. Thus when there is a random error, ?> is linear in the state
vector in period > and the conditional expected value of the random error in that period given
the random errors in prior periods.
MS&E 351 Dynamic Programming 94 §2 Team Decisions & Stochastic Programming
Copyright © 2008 by Arthur F. Veinott, Jr.

Remark 2. Linearity of Computational Effort in X . Observe that the matrices O> can be
calculated recursively from (12). Thus the computational effort to determine those matrices and
the optimal controls is linear in X .

Remark 3. Quadratic Plus Linear Costs. It is also possible to apply the above results to the
case in which a linear function of the state and control vectors is added to the objective function
(5). To do this, complete the square, thereby expressing the generalization of (5), apart from a
constant, as

"ÒÐB>  0> ÑT U> ÐB>  0> Ñ  Ð?>  8> ÑT V> Ð?>  8> ÑÓ  ÐBX  0X ÑTUX ÐBX  0X Ñ
X "
w
Ð&Ñ
>œ"

for some translation vectors 0> and 8> for each >. Then substitute  B > ´ B>  0> , 
? > ´ ?>  8> and

! > ´ E> 0>  F> 8>  !>  0>" in both (5)w and (6) for each >. The resulting problem assumes the
form (5)-(6) with  B >, 
? > and 
! > replacing B> , ?> and !> respectively.

Solution with Independent Random Errors. If the random errors in different periods are in-
dependent, then the conditional expectations of the random errors in period > or later given the
random errors in periods prior to > are independent of the latter. Consequently the optimal con-
trol functions for the corresponding augmented zero-random-error system are optimal for the or-
iginal system with independent random errors.

Solution with Dependent Random Errors. If, however, the random errors in different per-
iods are dependent, then the optimal control function in period >  " determined from the cor-
responding augmented zero-random-error system is not generally optimal for the original system

(with random errors) in period >. The reason is that the augmented matrix E > depends on the
conditional expected error !> , whose value depends on the values of the conditioning variables.

Consequently, the matrix O > depends on !>X ´ Ð!> ß á ß !X " Ñ. To examine this dependence,

partition O > like (8) in the form

O> œ  T .
 O> 5>
5> ,>

Then a straightforward, but tedious, computation using (12) for the corresponding augmented
system shows by induction on > that the matrices O> are independent of !>X , the vectors 5>
satisfy the recursion (5X " ´ !)

T T T
Ð13Ñ 5>" œ E> Ò[> !>  ÐM  Y> W>" F> Ñ5> Ó, > œ "ß á ß X 1,

and the optimal control ?> ÐBß !>X Ñ takes the form
MS&E 351 Dynamic Programming 95 §2 Team Decisions & Stochastic Programming
Copyright © 2008 by Arthur F. Veinott, Jr.

T
Ð14Ñ ?> ÐBß !>X Ñ œ W>" ÒY> ÐE> B  !> Ñ  F> 5> Ó, > œ "ß á ß X 1,

Indeed, the matrices O> coincide with the corresponding matrices satisfying the Riccati recursion
for the unaugmented zero-random-error problem. Also, the matrices defining the coefficients of
B, !> and 5> in (13) and (6) are all independent of !>X . Consequently, to compute the optimal
controls ?> from (6) when !>X is altered because of new information, simply recompute the 5> us-
ing the recursion (13). Incidentally, it also follows by iterating (13) that 5> œ 5> Ð!>"ßX Ñ is linear
in !>"ßX . Thus, in view of (6), ?> ÐBß !>X Ñ is linear in B and !>X , confirming Remark 3 following
Theorem 2 in the present instance.

Strengths and Weaknesses


The Certainty-Equivalence Theorem 2 permits sequential decision problems under uncertainty
with thousands of variables and constraints to be solved nearly as easily as corresponding determin-
istic problems provided that the objective function is convex and quadratic, and the constraints
consist only of linear equations—not inequalities. For example, a multiproduct multiperiod
inventory problem with thousands of products that are tied together by joint convex quadratic
production and storage/shortage costs, uncertain demand for each product, and no inequality
constraints can be solved easily by the Certainty-Equivalence Theorem and its implementation as
a quadratic control problem discussed above.
Because of the Certainty-Equivalence Theorem, it is tempting to use its conclusion, viz., re-
placing all random variables by their expectations and solving the resulting deterministic prob-
lem, in circumstances where it cannot be shown to lead to optimal solutions. The reason for so
doing is the hope that the error from so doing will not be great. Indeed, that general approach is
used implicitly in practice for many large sequential decision problems under uncertainty that
could not otherwise be solved.
Nevertheless, it must be recognized that the assumption that the costs are convex and quad-
ratic may not be a reasonable approximation in practice. And the absence of inequality constraints
does not, for example, permit one to require production in each period to be nonnegative, as is us-
ually the case in practice. Thus, if the optimal production schedule is not nonnegative, the optimal
solution may not be meaningful. On the other hand, in practice one can usually modify the
optimal solution in reasonable ways to overcome these limitations—though not without losing
optimality.
Though the certainty-equivalence approach has limitations, it is one of very few approaches
that is useful for finding optimal policies for large sequential decision problems under uncertainty.
For this reason, it is worthy of serious consideration in many circumstances.
MS&E 351 Dynamic Programming 96 §2 Team Decisions & Stochastic Programming
Copyright © 2008 by Arthur F. Veinott, Jr.

This page is intentionally left blank.


3
Continuous-Time-Parameter
Markov Population Decision Processes
1 FINITE CHAINS: FORMULATION
Consider a finite population of individuals that is observed continuously and indefinitely be-
ginning at time zero. At each time, each individual will be in some state in a finite set f of W  _
states. Each time >   ! an individual is observed in state =, an action + is chosen from a finite set
E= of possible actions and a reward rate <Ð=ß +Ñ is earned per unit time. Also the individual gener-
ates a finite expected number ;Ð?l=ß +Ñ of additional individuals in state ? per unit time. Assume
that ;Ð?l=ß +Ñ   ! for ? Á =, since otherwise positive populations may produce negative ones. Call
;Ð?l=ß +Ñ the transition rate from state = to state ? when using action +.
In the important case where the population consists of exactly (resp., at most) one individual
at each point in time, so !?−f ;Ð?l=ß +Ñ œ ! (resp., Ÿ !) for all + − E= and = − f , call the system
stochastic (resp., substochastic). In either case ;Ð=l=ß +Ñ Ÿ ! for all + − E= and = − f . The condi-
tion !?−f ;Ð?l=ß +Ñ œ ! has the interpretation that the expected rate of addition of individuals to
the system generated by each individual is zero, so the system population size remains unchanged.
Similarly, in the substochastic case the system population size remains unchanged or falls.

97
MS&E 351 Dynamic Programming 98 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

2 MAXIMUM X -PERIOD VALUE: BELLMAN'S EQUATION AND EXAMPLES [Be57]


A fundamental problem in the above setting is to find a policy that maximizes the X -period
expected reward with terminal value vector @ œ Ð@= Ñ in the various states. We begin with an in-
formal analysis and make it rigorous later. Let Z=> be the maximum expected reward that an in-
dividual (and his progeny) in state = at time > earns during the interval Ò>ß X Ó. Then it is plausi-
ble that for small increments of time 2  !,

Z=>2 œ max Ò<Ð=ß +Ñ2  " ;Ð?l=ß +Ñ2Z?>  Z=>  9Ð2ÑÓ


+−E= ?−f

where 9Ð2Ñ is a function for which lim2Æ! 2" 9Ð2Ñ œ !. Subtracting Z=> from both sides, dividing
by 2, and letting 2 Æ ! yields Bellman’s equation

Z = œ max Ò<Ð=ß +Ñ  " ;Ð?l=ß +ÑZ?> Ó


†>
(1)
+−E= ?−f
†> .
for >   ! and = − f together with the terminal condition Z=X ´ @= for = − f . (Of course Z= ´ Z=> .Ñ
.>
The expression in brackets on the right-hand side of (1) is the sum of two terms. The first is
the reward rate <Ð=ß +Ñ that an individual in state = earns while taking action + in that state at
time >. The second is the extra reward rate during the interval Ò>ß X Ó that results from changes in
the population size in each state at time > when an individual in state = takes action + in that
state at time > and all individuals use an optimal policy thereafter. To see why this is so, observe
that ;Ð?l=ß +ÑZ?> is the product of the expected rate ;Ð?l=ß +Ñ of addition of individuals to state ?
when an individual in state = takes action + in that state at time > and the maximum expected
reward Z?> that each such individual in state ? and his progeny earn in the interval Ò>ß X Ó. Thus,
the second term !?−f ;Ð?l=ß +ÑZ?> is the total extra expected reward rate during the interval Ò>ß X Ó
that results from addition of individuals to the several states when an individual in state = takes
action + in that state at time >. Finally, the right-hand side of (1) is the maximum expected re-
ward rate that an individual and his progeny earn in state = at time >, and equals the left-hand
side of (1), viz., the negative of the time derivative of the maximum value in state = at time >.
Observe that (1) is a system of W ordinary nonlinear differential equations. The sequel devel-
ops this equation rigorously, shows that it has a unique solution, and establishes the existence of a
maximum X -period-value piecewise-constant (in time) policy.
Sometimes in the sequel, the symbol Z=> will instead mean the maximum >-period value. (The
reader will be warned when this is so.) In that event, the minus sign on the left-hand side of (1)
disappears and the terminal condition becomes instead Z=! ´ @= for = − f .

Example 1. Controlled Queues. Consider controlling a single-channel queue with independ-


ent Poisson arrivals and exponential service times. The state of the system is the number of cus-
MS&E 351 Dynamic Programming 99 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

tomers in the queue (including the one being served), and this number cannot exceed the queue
capacity W . Let -=   ! be the arrival rate of customers into the system when = customers are in
the queue. Of course, -W ´ !. The actions are the indices "ß á ß O= of the O= available service
rates .=" ß á ß .=O=   !, say, when = customers are in the system. Of course O! œ " and .!" œ !.
Then for ! Ÿ =ß ? Ÿ W , the transition rates are
Ú
Ý .=+ , ?œ="
;Ð?l=ß +Ñ œ Û a.=+  -= b, ? œ =
Ý
Ü -= , ?œ="

and ;Ð?l=ß +Ñ œ ! otherwise. Also <Ð=ß +Ñ œ <.=+  -=+ where < is the price each customer pays on
completing service and -=+ is the service cost rate when the action + is chosen with = customers
in the queue. In this event Bellman’s equation becomes
†> > > > >
Z = œ max Ò<.=+  -=+  .=+ ÐZÐ="Ñ   Z= Ñ  -= ÐZÐ="Ñ•W  Z= ÑÓß ! Ÿ = Ÿ W .
"Ÿ+ŸO=

Example 2. Supply Management. A supply manager seeks an ordering policy with minimum
expected cost. The states are the possible inventory levels !ß á ß W . An order for one or more it-
ems can be placed or canceled at any time with the time to deliver being exponentially distrib-
uted with mean ." , !  .  _. The actions in state = are the amounts + of stock on hand plus
on order, so = Ÿ + Ÿ W . The times between demands are independent exponentially distributed
random variables with common mean -" , !  -  _. The demands are independent and ident-
ically distributed with .5 being the probability of a demand of size 5 . Unsatisfied demands are
lost. The transition rates are

Ú
Ý
Ý
Ý
Ý
Ý
-.=? ß !  ?  =
-!5  = .5 ß ? œ !
; a?l=ß +b œ Û
Ý
Ý
Ý
Ý
Ý
. ß? œ +  =
Ü a-  .bß ? œ =

and ;Ð?l=ß +Ñ œ ! otherwise. Also, the expected-cost-rate function is -Ð=ß +Ñ œ -Ð+  =Ñ.  2Ð=Ñ 
!_Aœ" :ÐAÑ-.=A where -Ð † Ñ and :Ð † Ñ are the ordering and shortage cost functions, 2Ð † Ñ is the
storage-cost rate, and -Ð!Ñ œ 2Ð!Ñ œ :Ð!Ñ œ !.

Example 3. Project Scheduling. A project manager wants to allocate funds continuously


among the activities of a project to maximize the probability of completing the project by time
X . The project consists of a partially-ordered collection T of activities. An activity cannot be
initiated until all activities preceding it in the partial order are completed. At any given point of
time, T can be partitioned into two sets, viz., the set = of activities that have been completed
MS&E 351 Dynamic Programming 100 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

and the set =- (the complement of =) of activities that have not been completed. From what we
have said above, the set = is necessarily decreasing, i.e., if an activity is in =, then so is every
activity that precedes it. Similarly, =- is increasing, i.e., if an activity is in =- , then so is every
activity that follows it in the partial order. Denote by f the set of decreasing subsets of T. For
each = − f , denote by = the set of activities in =- that are not preceded by any other activity in
=- . Evidently, = is the set of activities on which effort may be expended when = − f is the set
of completed activities. Funds may be spent continuously on the project at the rate of , million
dollars per week. At each point in time at which = − f is the set of completed activities, the funds
can be allocated among the activities in = in multiples of one million dollars. An action + in
state = − f is an k= k-tuple of nonnegative integers + œ Ð+! Ñ indexed by elements ! of = for
which !!−= +! œ ,. The interpretation of + is that +! million dollars per week are allocated to
activity ! − = for each such !. Denote by E= the set of all such actions if = Á T and the empty
action g if = œ T. The times to complete the activities are independent exponentially distributed
random variables. The completion rate of activity ! − T when 7   ! million dollars per month
are allocated to that activity is -! È7 with -!  !. Let Z=> be the maximum probability of com-
pleting the project by time X when = − f is the subset of activities completed by time ! Ÿ > Ÿ X .
Then @T œ " and @= œ ! otherwise. Also, <Ð=ß +Ñ œ ! for all + − E= and = − f . Moreover, in state
T, i.e., when the project is completed, ;Ð=lTß gÑ œ ! for all = − f . Finally, for each = − f \ ÖT× and
+ œ Ð+! Ñ − E= , the nonzero transition rates are contained among those for which ;Ð=  Ö!×l=ß +Ñ œ
-! È+! for some ! − = or ;Ð=l=ß +Ñ œ !!− -! È+! .
=

3 PIECEWISE-CONSTANT POLICIES, GENERATORS, TRANSITION MATRICES


A decision is a function $ that assigns to each state = in f an action $ = − E= . Let ? ´ ‚=−f E=
be the set of all decisions. A policy is a mapping 1 œ Ð$> Ñ À Ò!ß _Ñ Ä ? Ð> Ä $> Ñ that is piecewise
constant, i.e., there is a sequence of numbers ! ´ >!  >"  >#  â with >R Ä _ as R Ä _, such
that $> œ $>R for >R Ÿ >  >R " and each ! Ÿ R . Using 1 means that an individual in state = at
time > chooses action $>= at that time. Let ?_ denote the set of all policies. Call a policy $ _ that
uses $ at each point in time stationary.
For each $ − ?, let <$ ´ Ð<Ð=ß $ = ÑÑ be the reward-rate vector and U$ ´ Ð;Ð?l=ß $ = ÑÑ be the gen-
erator matrix. Let T$> be the W ‚ W transition matrix whose =?>2 element is the expected number
of individuals in state ? at time > given that one individual was in state = at time ! and $ _ is
used. Assume that the T$> satisfy T$! œ M ,

(1) T$>2 œ T$> T$2 œ T$2 T$> for 2ß >   !

and

(2) T$2 œ M  U$ 2  9Ð2Ñ for 2  !.


MS&E 351 Dynamic Programming 101 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Substituting (2) into (1), subtracting T$> from both sides, dividing by 2 , and letting 2 Ä ! estab-
† .
lishes that ÐT >$ ´ T$> Ñ
.>

(3) T$> U$ œ T >$ œ U$ T$> for >   !.

These equations describe the growth of the population when the decision $ is used and may be
interpreted as follows. Observe that U$ is the rate of addition to the population at any time per
individual at that time. The left-hand side of (3) is the product of the population at time > and
the rate of addition to the population per individual at that time. The right-hand side of (3) is
the product of the rate of addition to the population at time zero and the population generated
at time > by each individual added at time zero. Both products are simply different ways of ex-

pressing the rate of addition T >$ to the population at time >. In any case, the unique solution of
each of these two systems of W linear differential equations satisfying T$! œ M is

(4) T$> œ /U$ > for >   !.

Nonnegative, Substochastic and Stochastic Transition Matrices


The nonnegativity of the off-diagonal elements of U$ implies that T$> is nonnegative and has
positive diagonal elements for >   !. To see this, write U$ œ ÐU$  -MÑ  -M where - is a positive
number chosen so that U$  -M   !. Now ÐU$  -MÑ> commutes with -M>, so that T$> œ /U$ > œ
/ÐU$ -MÑ> /-M> . Also /ÐU$ -MÑ>   ! because ÐU$  -MÑ>   ! and /-M> œ /-> M   !, so T$>   !. Further,
> >
since /U$ > œ Ð/U$ 8 Ñ8 and /U$ 8 œ M  8> U$  9Ð 8> Ñ is nonnegative and has positive diagonal elements
for large enough 8, it follows that T$> œ /U$ > also has positive diagonal elements.
If also U$ " œ ! (resp., Ÿ !), then T$> is stochastic (resp., substochastic). To see this, observe
that
>

T$> "  " œ Ð/U$ >  MÑ" œ  ( /U$ ? .? U$ ",


!
so because the term in large parentheses is nonnegative, T$> is stochastic (resp., substochastic) if
U$ " œ ! (resp., U$ " Ÿ !).

4 CHARACTERIZATION OF MAXIMUM X -PERIOD-VALUE POLICIES [Be57], [Mi68b]


Suppose that 1 œ Ð$> Ñ is any policy with $> œ #R for >R Ÿ >  >R " for some Ö#R × and num-
bers ! ´ >!  >"  >#  â with >R Ä _ as R Ä _. Then the transition function T1> determined
by 1 is, for >R Ÿ >  >R " , given by

(1) T1> œ T#>!" >! â T#>RR"


>R " >>R
T#R .

Thus, for >R  >  >R " ,



(2Ñ T >1 œ T1> U$> .
MS&E 351 Dynamic Programming 102 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Interpret the =?>2 element of T1> as the expected number of individuals that an individual in state
= at time ! generates in state ? at time > when using 1. Using 1 during an interval means that 1
uses the decision $> at each time > in the interval.
Let 1 œ Ð#> Ñ and 1‡ œ Ð$> Ñ be policies. Fix X  ! and @ − d W . Let 1> 1‡ be the policy that uses
1 during Ò!ß >Ñ and 1‡ during Ò>ß _Ñ, i.e., uses #? at ? − Ò!ß >Ñ and uses $? at ? − Ò>ß _Ñ. Set Z1> œ
ÐZ1>= Ñ with Z1>= being the expected reward that 1 earns during the interval Ò>ß X Ó starting with a
single individual in state = at time >.
The following Comparison Lemma gives an expression for the difference between the X -period
values of two policies.

Lemma 1. Comparison. If 1 œ Ð#> Ñ and 1‡ are policies, then


X

Z1!  Z1!‡ œ ( T1> K#> > 1‡ .>


!
where the comparison function is given by

> > †>


Ð3) K#1 ‡ ´ <#  U# Z1‡  Z 1‡ ß # − ? and ! Ÿ > Ÿ X .

Proof. Evidently,
>

(4) Z1!> 1‡ œ ( T1? <#? .?  T1> Z1>‡ .


!
Thus, for all continuity points > of 1 and 1‡ ,
. ! †
(5) Z > ‡ œ T1> Ò<#>  U#> Z1>‡  Z 1> ‡ Ó œ T1> K#> > 1‡ .
.> 1 1
Hence since Z1!  Z1!‡ œ ' Ð Z1!> 1‡ Ñ.>, the result follows from (4) and (5). è
X
.
!
.>

The optimal-return operator plays a fundamental role in discrete-time-parameter problems.


In continuous-time-parameter problems, the optimal-return generator Z plays a similar role. De-
fine Z À d W Ä d W by

Z Z œ max Ò<$  U$ Z Ó for Z − d W .


$ −?
Observe that the optimal-return generator stands in the same relation to the generator matrices
U$ that the optimal-return operator does to the transition matrices T$ .

Theorem 2. Characterization of Maximum X -Period-Value Policies. The policy 1‡ œ Ð$> Ñ has


maximum X -period value with terminal-value vector @ if and only if Z > ´ Z1>‡ 3s continuously
differentiable in > on Ò!ß X Ó, Z X œ @ and
†>
Ð6Ñ Z œ Z Z > œ <$>  U$> Z > , ! Ÿ > Ÿ X .
MS&E 351 Dynamic Programming 103 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

>
Proof. For the “if ” part, observe that (6) implies that K#1 ‡ Ÿ ! for all # , so by the Compari-

son Lemma 1, Z1! Ÿ Z1!‡ for all 1. For the “only if ” part, observe first that if 1 œ 1‡ , (5) becomes
! œ T1>‡ K$> > 1‡ . Also T1>‡ is nonsingular because it is a product of matrix exponentials, each of which
is nonsingular. Thus K$> > 1‡ œ ! for all continuity points of 1‡ . Now suppose K#1
>
‡ Ÿ
Î ! for some
continuity point > of 1‡ and some # . For each state = with K#1
> = =
‡ = Ÿ !, redefine # œ $> , whence

> > ? w
K#1 ‡ = œ K$ 1‡ = œ !. Thus K#1‡  ! for ? in a sufficiently small neighborhood of >. Let 1 be the
>

policy that uses # on that neighborhood and uses 1‡ elsewhere. Observe that T1>w is nonnegative
and has positive diagonal elements because it is a product of nonnegative matrix exponentials
with those properties. Thus, it follows from the Comparison Lemma 1 that Z1!w  Z1!‡ , which is a
> >
contradiction. Hence, max#−? K#1 ‡ œ ! œ K$ 1‡ , or equivalently (6) holds, at continuity points of
>

1‡ . But Z > is continuous and so is Z, whence by the first equality in (6) at continuity points of
† †
Z † , Z > is continuous on Ò!ß X Ó. è


Incidentally, Z > œ Z Z > is, of course, Bellman’s equation.

5 EXISTENCE OF MAXIMUM X -PERIOD-VALUE POLICIES [Mi68b]


As a first step in showing that there is a maximum X -period-value piecewise-constant policy,
we show how to find a stationary policy that has maximum >-period value for all small enough
>  !. To that end, let
>

œ  ( /U$ ? .? <$  /U$ > @.


>
(1) Zs$ Ð@Ñ
!
>
be the expected >  ! period reward when $ _
is used and the terminal reward is @. For >  !, Zs$ Ð@Ñ
in (1) can be interpreted as the terminal reward for which the expected >  ! period reward
when $ _ is used is @. To see this, notice that with this interpretation, we should have
>

@ œ  ( /U$ ? .? <$  /U$ > Zs$ Ð@Ñ.


>
(2)
!
>
Solving (2) for Zs$ Ð@Ñ gives
> >

œ ( / .? <$  / @ œ  ( /U$ ? .? <$  /U$ > @,


>
Zs$ Ð@Ñ U$ Ð?>Ñ U$ >

! !
>
which shows that Zs$ Ð@Ñ satisfies (1) for >  !.
Now consider the following two problems.

>
I. Maximum Value. Choose $ − ? to maximize Zs$ Ð@Ñ for all small enough >  !.
>
II. Minimum Balloon Payment. Choose $ − ? to minimize Zs$ Ð@Ñ for all large enough >  !.
MS&E 351 Dynamic Programming 104 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

The first problem has an obvious interpretation. The second problem can be interpreted as
follows. Suppose that one must accumulate a total expected reward @= during the next > periods
starting with one individual in state =. Also suppose that there are no funds available initially
and that if the system earns a total expected reward differing from @ in the > periods, then
each individual in state ? after > periods must make a (possibly negative) balloon payment
>
Zs$ Ð@Ñ? at that time to make up the shortfall. The goal of problem II is to minimize the size of
the balloon payments that each individual must make for small >.

Example 4. Suppose W œ ", ? œ Ö# ß $ ×, <# œ <$ œ !, @ œ ", U# œ " and U$ œ #. Then # _
solves problem I while $ _ solves problem II.

In order to solve problems I and II, substitute the Maclaurin-series expansion for /U$ ? into (1) and
integrate term-by-term yielding

"
_ 8
s „> > 8
(3) „ Z $ Ð@Ñ œ <$„ , >  !
8œ!
8x
where
Ú
œÛ
„@ ß8 œ !
Ü Ð „ U$ Ñ8" Ð<$  U$ @Ñß 8 œ "ß #ß ÞÞÞ Þ
<$8„

Let G8$„ œ Ð<$!„ â <$8„ Ñ and G$„ œ Ð<$!„ <$"„ â Ñ. Let ?„ be the set of $ − ? that maximize
(3) for all sufficiently small >  !. From (3) we have ?„ œ Ö$ − ? À G$„ ¤ G#„ , all # − ?×.

Theorem 3. Maximum-Value and Minimum-Balloon-Payment Stationary Policies for Small


k>k. There is a stationary policy maximizing Ðresp., minimizingÑ Zs. Ð@Ñ over the class of station-
>

ary policies for all sufficiently small >  ! Ðresp., large >  !Ñ, i.e., ? Á g Ðresp., ? Á gÑ.

Proof. Let ?8„ œ Ö$ − ? À G8$„ ¤ G8#„ , all # − ?× for 8 œ !ß "ß á . Clearly ?!„ œ ?, ?„
"
œ
Ö$ − ? À <$  U$ @   <#  U# @, all # − ?× Á g. Now suppose inductively that ?8"
„ is nonempty
and is a product of action sets, one for each state, for some 8  "   ". Then G8" 8"
$ „ œ G„ say,
for $ − ?8" 8 8"
„ . Thus ?„ œ Ö$ − ?„ À <$8„   <#8„ , all # − ?„
8"
×. Hence since <$8„ œ „ U$ <„
8"
for
$ − ?8"
„ , it follows that

?8 œ Ö$ − ?
8" 8"
À U$ < 8"
  U# < 8"
, all # − ? ×

and

?8 œ Ö$ − ?
8" 8"
À U$ < 8"
Ÿ U# < 8"
, all # − ? ×,

so ?8„ is nonempty and is a product of action sets, one for each state. Thus ?„ œ +_ 8
8œ! ?„ Á g. è
MS&E 351 Dynamic Programming 105 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Theorem 4. Truncation. ?W"


„ œ ?W#
„ œ â œ ?„ .

Proof. If GW" W"


$ „ œ G# „ , then

(4) ÐU$  U# Ñ<$8„ œ !, 8 œ "ß á ß W .

It suffices to show (4) holds for all 8   " for this implies G$„ œ G#„ from which the Theorem fol-
lows. That (4) holds for all 8   " follows from Lemma 23 of §1.8 with F œ „ U$ and P the null
space of U$  U# . è

Theorem 5. Existence of Maximum X -Period-Value Policies. For X  ! and terminal value-


vector @, there is a policy with maximum X -period value. Also Z œ ÐZ > Ñ, where Z > is the maxi-
mum value on Ò>ß X Ó, is the unique solution of Bellman’s equation.

Proof. Put ?„ Ð@Ñ ´ ?„ . Choose $" − ? Ð@Ñ. Let Zs$> Ð@Ñ be the >-period value when using $ and
the terminal reward is @. Now show that $" maximizes Zs$> Ð@Ñ over all piecewise-constant policies for
all sufficiently small >  !. By Theorem 2, this will be so if # œ $" maximizes <#  U# Zs$> Ð@Ñ for all
small enough >  !. To see that this is so, notice on using Zs$>" Ð@Ñ œ !_
"
>8 8
8œ! <$"  that
8x
#
>
Ð5Ñ <#  U# Zs$>" Ð@Ñ œ Ò<#  U# @Ó  >U# <$""   U# <$#"   â .
#x

Hence, by construction, # œ $" lexicographically maximizes the coefficients of the above series and
so indeed maximizes <#  U# Zs$>" Ð@Ñ for all small enough >  !.
Let >" ´ inf Ö!  > Ÿ X À Z Zs$> Ð@Ñ  <$"  U$" Zs$> Ð@Ñ×  !. Then 1‡ œ $" for X  >" Ÿ > Ÿ X is
" "
>"
optimal. Now on replacing @ by Zs$" Ð@Ñ ´ @" , repeat the above construction obtaining $# − ? Ð@" Ñ.
Then define 1‡ œ $# for X  >"  ># Ÿ >  X"  >" where ># œ inf Ö!  > Ÿ X  >" À ZZs$> Ð@"Ñ  <$#
#

 U$# Zs$># Ð@Ñ×  !. Continuing inductively, construct 1‡ and a sequence >" ß ># ß á of positive num-
bers. If !R ‡
3œ" >3   X for some R , the claim is established. If not, X ´ X 
!_3œ" >3   !. It turns
out that this is impossible.
For let @‡ œ lim>ÆX ‡ Z1>‡ . From Theorem 3, there is a $ that minimizes Zs.>Ð@‡ Ñ over the class
of decisions for all small enough >  !, i.e., $ − ? Ð@‡ Ñ. Now # œ $ maximizes <#  U# Zs$> Ð@‡ Ñ for
all small enough >  !, say for ! Ÿ > Ÿ %, %  !. To see this observe
>#
(6) <#  U# Zs$> Ð@‡ Ñ œ Ò<#  U# @‡ Ó  >U# <$"  U# <$#  â .
#
µ
It follows that on letting Z > ´ Zs$> Ð@‡ Ñ,

µ µ µ
(7) Z > œ Z Z > , Z ! œ @‡ , ! Ÿ > Ÿ %.

But from Theorem 4 and construction, Z > ´ Z1>X
‡ satisfies
MS&E 351 Dynamic Programming 106 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.


(8) Z > œ Z Z > , Z ! œ @‡ , ! Ÿ > Ÿ X  X ‡ .

Now the equations (7) and (8) are identical on Ò!ß %Ó. Thus, since Z is Lipschitz continuous, it fol-
µ
lows from the theory of nonlinear differential equations in the Appendix that Z > œ Z > for ! Ÿ >
Ÿ %. Hence, redefine 1‡ to be $ on ÒX ‡ ß X ‡  %Ó without altering its optimality, contradicting the as-
sumption >R Æ !. è

Nonstationary Data. The above results concerning the existence of maximum X -period-value
policies extend to situations in which the data are nonstationary. The simplest situation is that
in which the action sets E=> , reward rates <> Ð=ß +Ñ and transition rates ;> Ð?l=ß +Ñ at time > are
piecewise constant in > on Ò!ß X Ó, i.e., there exist constants ! œ X!  X"  X#  â  XR œ X such
that E=> , <> Ð=ß +Ñ and ;> Ð?l=ß +Ñ are constant in > on the interval ÒX3" ß X3 Ñ for each 3 œ "ß á ß R . To
see why there is a maximum X -period-value policy, suppose that one has found the maximum
ÐX X3 Ñ-period value Z X3 and the corresponding policy 1 on the interval ÒX3 ß X Ó with terminal value
@ . Now apply Theorem 5 to find a maximum ÐX3 X3" Ñ-period-value Z X3" and the correspond-
ing policy . on the interval ÒX3" ß X3 Ñ with terminal value Z X3 . Then the policy that uses . on the
interval ÒX3" ß X3 Ñ followed by 1 on the interval ÒX3 ß X Ó has maximum ÐX X3" Ñ-period value on
the interval ÒX3" ß X Ó with terminal value @. Repeating this procedure for 3 œ R ß á ß " produces
the desired maximum X -period-value policy and its value.

6 MAXIMUM X -PERIOD VALUE WITH A SINGLE STATE


It is instructive to apply the results for the maximum-X -period-value problem to the case where
there is only a single state, i.e., W œ ". The general procedure is best described with an example.

Example 5. Suppose there is a single-state and five decisions !ß # ß $ ß (ß ). Also suppose the graph
of the piecewise-linear convex function Z appears (landscape) in the right-hand side of Figure 1,
with each linear segment labeled by the decision , for which Z Z œ <,  U, Z . From the graph
of Z in the right-hand side of Figure 1, it is possible to sketch a rough graph of Z † in the left-

hand side of Figure 1 that illustrates several features of the solution to Z > œ Z Z > , ! Ÿ > Ÿ X ,
given any terminal value @. These features include:

• the monotonicity of Z > in > on Ò!ß X Ó,


• the subinterval of Ò!ß X Ó during which it is optimal to use a decision,

• the monotonicity of Z > on each such subinterval,
• the values of Z > at the ends of each such subinterval and
• the order in which the decisions are used.

To see how, suppose one has at hand the value Z > at time >. Now find the value and sign of the

derivative Z > , and the decision , to use at time > by moving horizontally from the value Z > on
MS&E 351 Dynamic Programming 107 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Z> ! Z
!
!

‡ ‡
Z  Z

U! ž U  ž U  ž !
@ @
   ! ž U( ž U)
/ /
( (


Z‡ ( (

)
) )

! > % X Z@ ! ZZ

Z > œ Z Z > , ! Ÿ > Ÿ X
Graphs of Four Z † Graph (Landscape) of Z

Z ‡ œ ! œ ZZ ‡
Z

Figure 1. Graphs of Z † and Z for a Single-State Problem

the vertical axis of the left-hand graph to the same value on the vertical axis of the right-hand

graph and reading off Z > œ Z Z > œ <,  U, Z > . Using this information and starting with the
given terminal value @, construct the graph of Z † . This is done in the left-hand side of Figure 1
for four different terminal values.
To illustrate, consider the second largest @, say, of the four terminal values mentioned

above. For that terminal value, Z X œ Z @ œ <$  U$ @  !. Let / be the value of Z for which
<$  U$ / œ <(  U( / , i.e., the value for which one is indifferent between using $ and (. Now as >

decreases from X , Z > decreases and the derivative Z > increases as long as Z >   / . Call the time
> œ 7 at which Z > œ / the switching time. Thus Z † is concave and increasing on the interval

Ò7 ß X Ó. Now as > decreases from 7 to !, the derivative Z > œ <(  U( Z > decreases, but remains
positive, with Z > remaining above the value Z ‡ at which the derivative of Z † vanishes, i.e.,
Z ‡ œ !. Thus Z † is convex and increasing on the interval Ò!ß 7 Ó. And it is optimal to use ( on
Z
Ò!ß 7 Ñ and $ on Ò7 ß X Ó.
Actually, it is possible to give a closed form expression for the switching time 7 . For at 7 , we
have
X 7

Z/ œ <$  U$ Z œ <$  U$  (
7
/U$ ? <$ .?  /U$ ÐX 7 Ñ @ œ /U$ ÐX 7 Ñ Z@.
!
Solving this equation for 7 yields
MS&E 351 Dynamic Programming 108 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

$ lnŠ ‹
Z/
7 œ X  U"
Z@
provided that 7   !. In the contrary event, it is optimal to use $ on the entire interval Ò!ß X Ó. Ob-
serve that X  7 is independent of X .

Z ‡ and Z ‡ . Both are equilibria in
It is interesting to note that there are two zeros of Z , viz., 
the sense that if @ equals either of them, then Z > œ @ for all ! Ÿ > Ÿ X . But the first equilibrium
‡
is a stable attractor and the second is an unstable repeller. The first is so because for all @  Z ,

Z ! Ä Z ‡  Z ‡ as X Ä _. The second is so from what was just shown and the fact that for all

@  Z ‡ , Z ! Ä _.

Theorem 6. Single State: Each Decision Used for an Interval. If there is a single state, there
is a maximum-T-period-value policy for which the set of times at which any given decision is used
is a Ðpossibly emptyÑ interval and the maximum value Z > on Ò>ß X Ó is monotone in ! Ÿ > Ÿ X .


Proof. First show that Z > has constant sign on Ò!ß X Ó. To that end, suppose Z @   ! (resp.ß Ÿ !).
† †
Then Z > œ Z Z >   ! (resp., Ÿ !) for all ! Ÿ > Ÿ X because Z > is continuous in > and if there is a

> for which Z > œ !, then Z 7 œ Z > for all ! Ÿ 7 Ÿ >.
Next observe that Z Z is piecewise linear and convex in Z . Hence, for each $ − ?, the set of val-
ues Z for which Z Z œ <$  U$ Z is a subinterval of d . Now combine these facts with Theorem 2. è

One important implication of this result is that the number of switching times is bounded above
by k?k  ".

7 EQUIVALENCE PRINCIPLE FOR INFINITE-HORIZON PROBLEMS [Ve69a]


This section studies the infinite-horizon version of the continuous-time-parameter problem.
This problem can, for most purposes, be reduced to solving an “equivalent” discrete-time-pa-
rameter problem. To see this, it is useful to begin by discussing transient policies in continuous
time.
Call a policy 1 transient if ' T1> .> ¥ _. If 1 œ Ð$> Ñ is transient, its value is given by
_

! _

Z1 ´ ( T1> <$> .>.


!
Now
X X

œ ( œ ÐU$ ÑÐ( /U$ > .>Ñ.


U$ X .
M / Ð /U$ > Ñ.>
.>
! !
For $ _ to be transient, it is necessary and sufficient that /U$ X Ä ! as X Ä _. The necessity is
evident. For the sufficiency, observe that M  /U$ X is nonsingular for large enough X , so the same
is so of the two matrices in parentheses following the second equality in the last displayed equa-
MS&E 351 Dynamic Programming 109 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

tion. Thus, on premultiplying that equation by U" and letting X Ä _, it follows that ! Ÿ
' /U$ > .> œ U"
_ $

$ . Hence
! _

Z$ œ  ( /U$ > .> <$ œ U"


$ <$ .
!
Now on defining T$ ´ M  U$ , it follows that

Z$ œ ÐM  T$ Ñ" <$ ,

which is the value of $ _ for the equivalent discrete-time-parameter problem with T$ and <$ pro-
vided that T$   !. The nonnegativity of the T$ can always be assured by taking the unit of time
to be small enough, e.g., a minute rather than a day, in the continuous-time-parameter problem.
A proportional change in the time unit necessitates a like change in the reward rate and the rate
of addition of individuals to the population. In particular, on putting U s $ ´ +U$ , s
< $ ´ +<$ and
> ´ +7 where +  !, and observing that the value Z$ of $ is independent of the choice of the time
unit, it follows that
_

Z$ œ  ( /+U$ 7 . 7  +<$ œ U
"
s$ s
<$.
!

Thus if +  ! is small enough, T s $   ! and so Z$ œ ÐM  T


s$ ´ M  U s $ Ñ"s
<$.

Equivalence of Discrete- and Continuous-Time-Parameter Problems. Suppose for this para-


graph and the next that the time unit is chosen so T$ œ M  U$   ! for each $ and that each $ _ is
transient in the continuous-time-parameter problem. As discussed above, the last is so if and only
if each $ _ is transient in the equivalent discrete-time-parameter problem with T$ and <$ . Moreover,
for each $ , Z$ œ U" "
$ <$ œ ÐM  T$ Ñ <$ is the value of $
_
in both problems. Thus, in order to find
a maximum or near-maximum value stationary policy for the continuous-time-parameter prob-
lem, it suffices to find such a policy for the equivalent discrete-time-parameter problem, e.g., by
successive approximations, policy improvement or linear programming.
This equivalence is also useful for theoretical studies. For example, Z ‡ is the maximum value
among the stationary policies in the continuous-time-parameter problem if and only if Z ‡ is the
maximum value among the stationary policies in the equivalent discrete-time-parameter prob-
lem. Thus, Z ‡ is the unique fixed point of e, or equivalently, the unique zero of Z Ð œ e  MÑ,
i.e., ! œ Z Z ‡ or

(1) ! œ max Ò<$  U$ Z ‡ Ó.


$ −?
This is consistent with the results for the >-period problem because if Z > œ Z ‡ for all >   !, then
MS&E 351 Dynamic Programming 110 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

† †
Z > œ !, whence (1) reduces to Bellman’s equation for the >-period problem, viz., ! œ Z > œ Z Z >
for all >   !.

Maximum Present Value. Suppose instead that each stationary policy $ _ is bounded, i.e.,
T$> œ SÐ"Ñ. In this event, interest often centers on present-value optimality criteria. For that case,
observe from what was shown above, for each $ the finite present value of future population sizes
with the nominal interest rate 3  ! per unit time is '! /3> T$> .> œ '! /$ $
_ _ ÐU 3MÑ>
.> œ Ð3M  U$ Ñ"
œ V$3 where V$3 is, of course, the resolvent of U$ . Also, as long as the time unit is small enough
(which here requires common rescaling of the <$ , U$ and 3) so that T$ œ M  U$   ! for all $ , then
as shown on page 49, V$3 is also the present value of future population sizes in the equivalent dis-
crete-time-parameter problem for all $ . Moreover, in both cases the present values of $ _ coincide
and equal Z$3 œ V$3 <$ for each $ . Thus both sets of maximum-present-value stationary policies coin-
cide, and so do their present values. Also, the maximum-present-value Z ‡ among the stationary pol-
icies for the continuous-time-parameter problem is the unique zero of the analog of (1) formed by
replacing U$ by U$  3M , viz.,

(2) ! œ max Ò<$  U$ Z ‡  3Z ‡ Ó.


$ −?
Thus, in order to find a maximum- or near maximum-present-value stationary policy for the con-
tinuous-time-parameter problem, it suffices to find such a policy for the equivalent discrete-time-
parameter problem, e.g., by successive approximations, policy improvement or linear program-
ming. Also, the sets of strong present-value optimal (resp., 8-optimal) stationary policies in the con-
tinuous- and equivalent discrete-time-parameter problems coincide. Thus, one such policy can be
found for the continuous-time-parameter problem by solving the equivalent discrete-time-param-
eter problem, e.g., by the strong policy-improvement method or linear programming.

8 MAXIMUM PRINCIPLE [PBGM62], [LM67]


It is often useful to consider models in which there are infinitely many states. This section does
that for deterministic control problems whose transition laws are given by differential equations.
The goals are to develop Bellman’s equation and Pontryagin’s celebrated Maximum Principle
for these problems formally. By ‘formal’ is meant that differentiability of appropriate functions
is assumed, but not shown. The Maximum Principle is a necessary condition for optimality that is
also sufficient provided that the problem satisfies suitable convexity conditions, e.g., concavity of
the objective function and linearity of the transition law.
The problem is that of finding a piecewise-continuous control +† À Ò!ß X Ó Ä d 7 that maximizes
the X -period value
X

( <Ð= ß + Ñ.>
> >
(1)
!
subject to
MS&E 351 Dynamic Programming 111 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

(2) =† > œ 0 Ð=> ß +> Ñ, =! œ =!

and

(3) +> − E

for ! Ÿ > Ÿ X where =† À Ò!ß X Ó Ä d 8 and =! is the given initial state. The interpretation is that
=> − d 8 is the state of the system at time >, +> − E © d 7 is the action chosen at time >, <Ð=> ß +> Ñ
is the reward rate at time >, and (2) is the transition law governing the evolution of the system
through the states.
In order to obtain the desired necessary condition for optimality, first formally generalize Bell-
man’s equation

` >
 Z œ max Ò<Ð=ß +Ñ  ÐU+ Z > Ñ= Ó, Z=X œ !,
`> = +−E

for ! Ÿ > Ÿ X and all =, to the present case where Z=> is the maximum value in Ò>ß X Ó starting in
state =. The appropriate definition of ÐU+ Z Ñ= in the present case is the differential operator

"
ÐU+ Z Ñ= œ lim ÐZ=20 Ð=ß+Ñ  Z= Ñ œ 0 Ð=ß +Ñf= Z=
2Æ! 2
`
where f= Z= œ Ð Z= Ñ is the gradient of Z= because if Ð=> ß +> Ñ œ Ð=ß +Ñ, then from (2),
`=3

=>2 œ =>  20 Ð=> ß +> Ñ  9Ð2Ñ.

Thus Bellman’s equation becomes the nonlinear partial differential equation

` >
(4) ! œ max Ò<Ð=ß +Ñ  0 Ð=ß +Ñf= Z=>  Z Ó, Z=X œ !,
+−E `> =

for ! Ÿ > Ÿ X and all =. Now let 1Ð=ß +ß >Ñ be the term in brackets in (4), +† be an optimal control
and =† be the corresponding optimal trajectory satisfying the differential equations (2). Then
from (4),

(5) ! œ max 1Ð=ß +> ß >Ñ œ 1Ð=> ß +> ß >Ñ


=

because 1Ð=ß +ß >Ñ Ÿ ! for all states =, actions + − E and ! Ÿ > Ÿ X . Thus

(6) f= 1Ð=> ß +> ß >Ñ œ !,

so on suppressing Ð=> ß +> ß >Ñ,

# #
! œ f= <  f= 0 f= Z  0 f== Z  f=> Z

œŠ ‹, œŠ ‹ and f= 0 œ Š ‹.
`# `# `03
#
where f== #
f=> Thus since =† œ 0 ,
`=3 `=4 `=3 `> `=4
MS&E 351 Dynamic Programming 112 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

.
(7) f= Z=>> œ =† > f==
#
Z=>>  f=>
#
Z=>> œ f= <  f= 0 f= Z=>> .
.>

Now set @> ´ f= Z=>> and observe that Z=X œ ! for all =, (7) becomes the adjoint linear differential
equation

(8) @† > œ f= <Ð=> ß +> Ñ  f= 0 Ð=> ß +> Ñ@> , @X œ !

for ! Ÿ > Ÿ X and (4) implies the maximum principle

` >
(9)  Z > œ max Ò<Ð=> ß +Ñ  0 Ð=> ß +Ñ@> Ó œ <Ð=> ß +> Ñ  0 Ð=> ß +> Ñ@>
`> = +−E

for ! Ÿ > Ÿ X . In order to check whether a control Ð+> Ñ satisfies (9), solve the ordinary differential
equations (2) for Ð=> Ñ and then the linear differential equations (8) for Ð@> Ñ. Finally, check whether
+> attains the maximum on the right-hand-side of (9).
In general the conditions (2), (8), and (9) are necessary, but not sufficient, for optimality of
a control Ð+> Ñ because (6) is necessary, but not sufficient, for (5). Thus, like the analogous Kar-
ush-Kuhn-Tucker-Lagrange conditions for finite-dimensional mathematical programs, the maxi-
mum principle is necessary, but not sufficient, for optimality. However, again like analogous re-
sults for finite-dimensional concave programs, the maximum principle is sufficient for optimality
under suitable convexity hypotheses, e.g., when 0 Ð † ß † Ñ is linear, <Ð=ß +Ñ œ :Ð=Ñ  ;Ð+Ñ is concave,
E is convex, and mild regularity conditions are fulfilled. By contrast, a suitable form of Bellman’s
equation (4) is necessary and sufficient for optimality just as various versions of it introduced
throughout this course are also necessary and sufficient for optimality.

Example 6. Linear Control Problem. For the special case in which < and 0 are linear, i.e.,

<Ð=ß +Ñ œ =-  +. and 0 Ð=ß +Ñ œ =G  +H,

the adjoint equations (8) reduce to the linear differential equations with constant coefficients

@† > œ -  G@> , @X œ !

that are independent of the control and trajectory. Also the maximum principle simplifies to

max +Ð.  H@> Ñ œ +> Ð.  H@> Ñ.


+−E

Observe that in this case, all that is required to find an optimal control is to solve the adjoint
equations and then choose + œ +> to achieve the maximum on the left-hand side of the above
equation.
MS&E 351 Dynamic Programming 113 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Example 7. Markov Population Decision Chain. Consider the W -state Markov branching deci-
sion chain discussed in the preceding section. In this event, => is the (row) vector of expected num-
bers of individuals in each state at time >. Then E œ ?, + œ $ , =! œ =! ,

<Ð=ß +Ñ œ =<$ and 0 Ð=ß +Ñ œ =U$ .

The adjoint equations simplify to the linear differential equations with variable coefficients

@† > œ <$>  U$> @> , @X œ !

that depend on the control, but not the trajectory. Thus, the maximum principle simplifies to

max Ò=> <$  => U$ @> Ó œ => <$>  => U$> @> œ => @† > .
$ −?

Now this is so for all initial populations =! œ =!   !, and hence all =>   ! in the corresponding
convex cone of dimension W . Thus it follows by factoring out => in the above equation that

@† > œ Z @> œ <$>  U$> @> ,

which is precisely Bellman’s equation (6) of §3.4 for the continuous-time-parameter problem.

9 MAXIMUM PRESENT-VALUE FOR CONTROLLED ONE-DIMENSIONAL DIFFUSIONS


[Ma67ß 68], [Pu74]
The goal of this section is to introduce the theory of controlled one-dimensional diffusions in
a manner analogous to the development of continuous-time-parameter Markov decision chains.
As in the previous section, the generators here are differential operators. Otherwise, as will be
seen in the sequel, the main features, formulas and results developed for Markov decision chains
carry over in the same form to controlled one-dimensional diffusions, albeit with a few new com-
plications. The development given below is analytic in the spirit of Feller, Volume I, and has the
advantage of leading quickly to computational methods. A more fundamental treatment would,
of course, require demonstrating that there exist stochastic processes with the properties described
here.

Diffusions
Roughly speaking, a one-dimensional diffusion ÖW > À >   !× with state space a compact interval
of real numbers f œ Ò=! ß =" Ó, _  =!  ="  _, is a Markov process in which for all small 2 
!, the conditional distribution of each increment W >2  W > given W > is approximately normal
with mean and variance proportional to 2 and with the proportionality constants depending on
W > . (If f œ d and the proportionality constants are ! and " respectively, the process is a Wiener
MS&E 351 Dynamic Programming 114 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

process or Brownian motion.) More precisely, assume that there exist piecewise-continuous drift
and diffusion coefficients .= and 5=# , = − f , with 5=#  ! being bounded away from zero and bound-
ed above such that for each %  !,

Ð3Ñ ( T 2 Ð=ß .?Ñ œ 9Ð2Ñ,


k?=k%

Ð33Ñ ( Ð?  =ÑT 2 Ð=ß .?Ñ œ .= 2  9Ð2Ñ,


k?=kŸ%

and

Ð333Ñ ( Ð?  =Ñ# T 2 Ð=ß .?Ñ œ 5=# 2  9Ð2Ñ


k?=kŸ%

at continuity points of . and 5# where T > Ð=ß FÑ œ T ÐW > − F ± W ! œ =Ñ. Observe that Ð3Ñ asserts
that the probability that the process deviates from any given initial state by more than % in a
small interval is small. Condition Ð33Ñ (resp., Ð333Ñ) asserts that the truncated mean drift (resp.,
square deviation) of the process from any initial state = in a small interval is essentially propor-
tional to the length of the interval with the proportionality constant .= (resp., 5=# ) depending on =.
Incidentally, the diffusion coefficient has an alternate interpretation as the infinitesimal vari-
ance. To justify this interpretation, it suffices to show that Ð33Ñ and Ð333Ñ are equivalent to Ð33Ñ and

Ð333Ñw ( Ð?  =  .= 2Ñ# T 2 Ð=ß .?Ñ œ 5=# 2  9Ð2Ñ.


k?=kŸ%

To see why this is so, observe that

( Ð?  =  .= 2Ñ# T 2 Ð=ß .?Ñ œ ( Ð?  =Ñ# T 2 Ð=ß .?Ñ


k?=kŸ% k?=kŸ%

 #.= 2( Ð?  =ÑT 2 Ð=ß .?Ñ  .#= 2# ( T 2 Ð=ß .?Ñ.


k?=kŸ% k?=kŸ%

Now the next to last term on the right-hand side of this equation is 9Ð2Ñ by Ð33Ñ and the last
term is 9Ð2Ñ because the integral is bounded below by zero and above by one. Thus, it follows
from the above equation that Ð33Ñ and Ð333Ñ are equivalent to Ð33Ñ and Ð333Ñw as claimed. While
this formulation has intuitive appeal, it has the drawbacks that the conditions Ð33Ñ and Ð333Ñw are
not independent and condition Ð333Ñw is less convenient to work with.
The next step is to derive the generator U of the process defined by

ÐUZ Ñ= ´ lim ”( Z? T 2 Ð=ß .?Ñ  Z= •


"
(1)
2Æ! 2
MS&E 351 Dynamic Programming 115 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

†† ††
for all bounded Z with bounded piecewise-continuous Z and = an interior continuity point of Z ,
. and 5# . It turns out that at such points

" # †† †
(2) ÐUZ Ñ= œ 5 Z =  .= Z= .
# =

To see this, observe from Ð3Ñ-Ð333Ñ that

ÐUZ Ñ= œ lim (
"
ÐZ?  Z= ÑT 2 Ð=ß .?Ñ
2Æ! 2 k?=kŸ%

œ lim (
" † " ††
ÒZ= Ð?  =Ñ  Z = Ð?  =Ñ# ÓT 2 Ð=ß .?Ñ
2Æ! 2 k?=kŸ% #

 –lim ( ÐZ )  Z = ÑÐ?  =Ñ# T 2 Ð=ß .?Ñ—


" " †† ††
2Æ! 2 k?=kŸ% #
" # †† †
œ 5= Z =  .= Z=
#

(with k)  =k Ÿ k?  =k) because ¸ Z )  Z = ¸, and hence the bracketed term above, can be made arbi-
†† ††

trarily small by choosing %  ! small enough.


The next step is to show that if the reward rate <= earned per unit time in state = is piecewise
continuous and bounded in =, then the expected present value Z 3 of rewards, viz.,
_

Z œ ( /3> T > <.>,


3

for 3  ! where ÐT > <Ñ= ´ 'f <? T > Ð=ß .?Ñ, satisfies
!

(3) Ð3M  UÑZ 3 œ <

at interior continuity points of <, . and 5# .8 To see this, observe from Ð3Ñ that lim2Æ! T 2 < œ < at
such points. Thus
2

Z œ ( /3> T > <.>  /32 T 2 Z 3 œ <2  Ð"  32ÑT 2 Z 3  9Ð2Ñ.


3

!
3 †† 3
Subtracting Z , dividing by 2 and letting 2 Æ ! establishes (3). Thus if, as can be shown, Z . is
bounded and piecewise continuous, it follows by substituting (2) into (3) that Z 3 satisfies the sec-
ond-order linear differential equation

" # †† 3 †3
(4) 5= Z =  .= Z=  3Z=3 œ <= , = − Ð=! ß =" Ñ
#

8Note that this equation is the same as the corresponding equation for Markov population decision chains
except that here U is a differential operator rather than a matrix.
MS&E 351 Dynamic Programming 116 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

at continuity points of <, . and 5 # . Of course Z 3 is not the only solution to (4) because no bound-
ary conditions have been specified.
The simplest type of boundary condition is absorption with a terminal value @3 at =3 for 3 œ !ß ".
This implies that

(5) Z=33 œ @3 , 3 œ !ß ".

It turns out that (4) has a unique solution subject to the two-point boundary conditions (5). We
show this under the hypotheses that <= ß .= and 5=# are piecewise constant. To see this, put

Î ! " Ñ Î ! Ñ
Z ´ Œ † 3 , U = ´ # 3
 Z3 
Ï 5# 5=# Ò Ï 5# Ò
#.= 
and < = ´ #<= ,
Z
= =

and rewrite (4) as


  
(4)w Z= œ 
< =  U= Z= , = − Ð=!ß =" Ñ.

In this form, (4) is evidently the differential equation for the =  =! period value in a two-state

continuous-time-parameter Markov population chain where  < = and U= are respectively the reward-
rate vector and transition-rate matrix =  =! units of time from the end of the process. With this
 
in mind, let T => be the “reverse time” transition matrix from “time” = to “time” >. Since U= has

positive 12>2 element for = − f , so does T => for all >  = in f . Then

  
Z=" œ T =" =! Z=!  constant.

 
Hence the first component of Z =" is a strictly increasing affine function of Z=! . Thus since the first
 
component of Z=! is given as @! , the second component of Z=! can be chosen to assure the first com-

ponent of Z=" equals @" . Therefore, (4) has a unique solution subject to the boundary conditions (5).
Observe that the above argument remains valid even when 3 œ !, whence Z ´ Z ! is the unique
solution to (4), (5) with 3 œ !. In that event, Z= is the total expected reward earned starting from
state =.
It is useful now to apply these results to solve some simple problems associated with the Wiener
process with zero drift and unit diffusion coefficients.

Example 8. Probability of Reaching One Boundary Before the Other. What is the probabil-
ity Z= of reaching =" before =! starting from = − Ò=! ß =" Ó? Then <= œ ! on Ð=! ß =" Ñ and Z satisfies

" ††
 Z = œ !, Z=! œ !, Z=" œ ".
#
MS&E 351 Dynamic Programming 117 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Thus Z= œ !=  " , so ! œ !=!  " and " œ !="  " whence

=  =!
Z= œ .
="  = !

Example 9. Mean Time to Reach Boundary. What is the mean time Z= until absorption at
the boundaries starting from = − Ò=! ß =" Ó? Then <= œ " on Ð=! ß =" Ñ and Z satisfies

" ††
 Z= œ ", Z=! œ Z=" œ !.
#
Thus

Z= œ =#  !=  "

! œ =#!  !=!  "

! œ =#"  !="  "

so

Z= œ Ð=  =! ÑÐ="  =Ñ.

Controlled Diffusions
A diffusion process is controlled by changing its drift and diffusion coefficients. Thus suppose
#
E is the finite set of actions available in any state, and that .=+ and 5=+  ! are respectively the
drift and diffusion coefficients and <=+ is the reward rate when one chooses action + in state =. As-
# #
sume that .=+ , 5=+  ! and <=+ are each piecewise constant in = and 5=+ is bounded away from zero
and bounded above for each + − E. Also assume that there is absorption with terminal rewards
@! and @" at the two finite boundaries =!  =" . When the interest rate is 3  !, earning the term-
inal reward @3 at the boundary =3 is equivalent to earning <=3 + ´ 3@3 per unit time there forever.
A decision is a piecewise constant function $ on the interval Ò=! ß =" Ó with $ = being the action
that $ chooses when the process is in state = − Ò=! ß =" Ó. When using $ , the reward function is <$ œ
Ð<=$= Ñ for all = − Ò=! ß =" Ó. Also, in the interior of the state space, the generator U$ is defined by

" # †† †
(6) ÐU$ Z Ñ= œ 5 = Z =  .=$= Z= , = − Ð=! ß =" Ñ
# =$
††
at continuity points = of Z = , $ = , .=$= and 5=#$= . Since there is absorption at the boundaries, it fol-
lows from the definition (1) of the generator that U$ is defined at the boundaries by

(7) ÐU$ Z Ñ= œ !, = œ =! ß =" .


MS&E 351 Dynamic Programming 118 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

A stationary policy $ _ entails using $ at each point in time. If the goal is to find a stationary
policy that maximizes the expected present value of the rewards earned over an infinite time in-
terval, Bellman’s equation becomes

(8) ! œ max Ò<#  U# Z 3  3Z 3 Ó


# −?

where ? is the set of decisions and Z 3 is the maximum expected present value of rewards. Note
that with this formulation, the boundary conditions are absorbed into (8) since its restriction to
the boundary states =3 are, in view of (7), equivalent to 3@3  3Z=33 ´ ! for 3 œ !ß ", or what is the
same thing, Z=33 ´ @3 , for 3 œ !ß ".9 To see this, let V$3 be the nonnegative linear operator that as-
signs to <$ the unique solution Z œ Z$3 of the equation

(9) Ð3M  U$ ÑZ œ <$ .

Thus, Z$3 œ V$3 <$ and V$3 is invertible with ÐV$3 Ñ" œ 3M  U$ . This leads to the following version
of the Comparison Lemma,

Z#3  Z$3 œ V#3 Ò<#  ÐV#3 Ñ" Z$3 Ó œ V#3 K#$


3

where the comparison function is defined by

3
K#$ ´ <#  U# Z$3  3Z$3 .

3 3
Thus, # œ $ maximizes Z#3 over ? if K#$ Ÿ ! for all # − ? since (9) can be rewritten as K$$ œ !.
Hence Bellman’s equation (8) holds. On substituting the definition (6) and (7) of the differential
operator U# into (8), observe that (8) becomes

" # †† 3 †3
(8)w ! œ max Ò<=+  5=+ Z =  .=+ Z=  3Z=3 Ó, = − Ð=! ß =" Ñ
+−E #

at interior continuity points of ., 5 and <, and

(8)ww Z=33 ´ @3 , 3 œ !ß "

at the boundaries.
Also, it can be shown that there is a stationary optimal policy $ _ with $ piecewise constant,
#
even if the data E, <=+ , 5=+ and .=+ are piecewise constant in =. The method of doing this entails
first reducing the infinite-horizon controlled-diffusion problem to an equivalent finite-horizon
(X œ ="  =! period) two-state continuous-time-parameter Markov population decision chain as

9Here again the equation (8) is the same as that for the corresponding Markov population decision chain
except that here the generators U# are differential operators rather than matrices.
MS&E 351 Dynamic Programming 119 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

discussed above (see (4)w ). Then show that a piecewise-constant (now in time) policy is optimal
for the latter problem by using arguments like those given in §3.5—especially the last paragraph
thereof.

Example 10. Maximizing the Probability of Accumulating Given Wealth. Consider an in-
vestor who seeks to maximize his probability of accumulating a million dollars before going bank-
rupt starting with = million dollars where ! Ÿ = Ÿ ". Denote by Z= the desired maximum proba-
bility. Assume that the investor’s wealth is a controlled diffusion with drift coefficient . and dif-
fusion coefficient 5#  " or " respectively according as the investor acts boldly or timidly.
Now Z is the unique solution of Bellman’s equation (with 3 œ !)

" †† † " †† † " †† " †† †


(10) ! œ Ð 5# Z =  .Z= Ñ ” Ð Z =  .Z= Ñ œ Ð 5# Z = Ñ ” Ð Z = Ñ  .Z=
# # # #

for = − Ð!ß "Ñ subject to the boundary conditions Z! œ ! and Z" œ ". Observe from (10) that bold
†† ††
(resp., timid ) investing is optimal if and only if Z =   ! (resp., Z = Ÿ !) on Ò!ß "Ó, i.e., the investor’s
maximum probability Z= is convex (resp., concave) on Ò!ß "Ó. This is so if and only if . Ÿ ! (resp.,
.   !). Here is the argument for bold investing; the other case is similar. Bold investing is opti-
mal if and only if

" # †† †
(11) !œ 5 Z =  .Z= , = − Ð!ß "Ñ,
#

Z! = ! and Z" = ". On rewriting (11) as


††
" # Z=
5 † œ .
# Z=

and then integrating gives

" # †
5 ln Z= œ =.  !
#
† ††
for = − Ò!ß "Ó and some !. Thus Z is nondecreasing, i.e., Z   !, if and only if . Ÿ ! as claimed.
This result has the following intuitive explanation. If the investor earns the mean return, the
investor would eventually accumulate a million dollars if the drift is positive and eventually go
bankrupt if the drift is negative. Consequently, if the drift is positive, timid investing is the better
option since it has higher probability of keeping the wealth close to the mean and this is all that
is needed to reach a million dollars in this case. On the other hand, if the drift is negative, bold
investing is the better option since it has higher probability of raising the wealth significantly
above the mean which is needed to accumulate a million dollars in this case.
MS&E 351 Dynamic Programming 120 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

This page is intentionally left blank.


Appendix
Functions of Matrices
The purpose of this appendix is to summarize a few useful facts about functions of matrices,
especially the Jordan form and its applications to matrix power series, matrix exponentials, etc.

1 MATRIX NORM
A norm lT l of a complex matrix T œ Ð:34 Ñ is a real valued function that satisfies 1‰ -4‰ below.
A norm of a matrix is a measure of its distance from the null matrix. Throughout, all norms are
presumed to satisfy &‰ as well. One example is the Tchebychev norm lT l œ max3 !4 l:34 l. Indeed,
the term “norm” means the Tchebychev norm in the main text unless stated otherwise.

1‰ l T l   ! .
2‰ lT l œ ! if and only if T œ !.
3‰ l-T l œ k-klT l for each complex number -. (homogeneity)
4‰ lT  Ul Ÿ lT l  lUl if the matrix sum T  U is well defined. (triangle inequality)
5 lT Ul Ÿ lT llUl if the matrix product T U is well defined.

(Schwarz inequality)

121
MS&E 351 Dynamic Programming 122 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

2 EIGENVALUES AND VECTORS


An eigenvalue of a square complex matrix T is a complex number - such that -M  T is sin-
gular. The spectrum 5ÐT Ñ of T is the set of its eigenvalues. The number of eigenvalues, allowing
for multiplicities, equals the order of T . The spectral radius 5T of T is the radius of the smallest
circle in the complex plane having center at the origin and containing the spectrum of T . For
each eigenvalue - of T , there is nonnull vector @ for which -@ œ T @. Call @ an eigenvector of T .
As an application of these definitions, note that 5-T œ k-k5T and 5T Ÿ lT l. To see the last fact,
let - be an eigenvalue of T and @ be an associated eigenvector. Then since -@ œ T @,

k-kl@l œ l-@l œ lT @l Ÿ lT ll@l,

whence on dividing by l@l Á !, the result follows.

3 SIMILARITY
A square complex matrix T is similar to a square complex matrix U of like order if there is
a nonsingular matrix X such that T œ X UX " . The similarity relation is evidently an equiv-
alence relation, i.e., is transitive, reflexive and symmetric. More important is the fact that two
similar matrices have the same spectrum. And if T is similar to U, T R is similar to UR . More
generally, if 0 ÐT Ñ ´ !_
R œ! +R T
R
is absolutely convergent, then so is 0 ÐUÑ, and 0 ÐT Ñ is similar
to 0 ÐUÑ. Also 0 ÐT Ñ is absolutely convergent if that is so of 0 ÐlT lÑ. It turns out that among
each equivalence class of similar matrices, there is a matrix having a particularly simple form and
whose properties it is easier to study. That matrix is the Jordan form.

4 JORDAN FORM
Call a square complex matrix T nilpotent if T R œ ! for some R   ". A Jordan matrix of or-
der W is an W ‚ W matrix N of the form N ´ -M  I where - is a complex number and I is the
matrix having ones just above the diagonal and zeroes elsewhere. For example, if W œ %,

Î0 0Ñ
Ð 0 0Ó
1 0

I œ ÐÐ Ó

0 1

Ï0 0 0Ò
0 0 0
0

Observe that N œ -M  I has the following properties:

1‰ - is the unique eigenvalue of N ;


2‰ I W œ ! (so I is nilpotent);
3‰ N R œ !W"
5œ! Š 5 ‹-
R R 5 5
I , R   W ; and
4‰ N R Ä ! if and only if k-k  ".
MS&E 351 Dynamic Programming 123 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Property 3‰ follows from 2‰ and the binomial theorem

ÐT  UÑ œ " Œ T
R
R R R 5 5
U ,
5œ!
5

which is valid for square matrices T ß U that commute, i.e., T U œ UT , as do M and I . Property
4‰ follows from 3‰ . Notice that Š R5 ‹ is a polynomial in R of degree 5 . Thus 3‰ implies that
N R œ SÐR W" -R Ñ.
A square complex matrix is in Jordan form if it is block-diagonal and each diagonal block is
a Jordan matrix. The spectrum of a matrix T in Jordan form is simply the union of the spectra
of its diagonal blocks, say N" ß á ß N5 , where N3 ´ -3 M  I for all 3. Then

Î N" ! Ñ
Ð Ó
T œ ÐÐ Ó
Ó
N#

Ï! N5 Ò
ä

and Ö-" ß á ß -5 × is the spectrum of T .


The following representation theorem for square complex matrices in terms of matrices in Jor-
dan form is of great importance. It enables many questions about square complex matrices to be
reduced to the same questions about matrices in Jordan form.

Theorem 1. Jordan Form. Every square complex matrix is similar to a matrix in Jordan form.

Corollary 2. Transient Matrices. If T is a square complex matrix, then T R Ä ! if and only if


5T  ".

Proof. The result follows from 4‰ above for Jordan matrices and so for T by Theorem 1. è

Corollary 3. Nilpotent Matrices. If T is an W ‚ W complex matrix, the following are equiva-


lent.

"‰ T is nilpotent.

#‰ T W œ ! .

$‰ 5ÐT Ñ œ Ö!×.

Proof. The result follows easily from 3‰ above for Jordan matrices and so is true for T by
Theorem 1. è

5 SPECTRAL MAPPING THEOREM


The next result asserts that the spectrum of a matrix power series is the set of complex num-
bers that one obtains by replacing the matrix in the power series by each of its eigenvalues. If 0 is
a function and W is a subset of its domain, let 0 ÐWÑ be the image of W under 0 , i.e., Ö0 Ð-Ñ À - − W×.
MS&E 351 Dynamic Programming 124 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Theorem 4. Spectral Mapping. If T is a square complex matrix and 0 ÐT Ñ ´ !_


Rœ! +R T
R
is ab-
solutely convergent, then 0 Ð-Ñ is absolutely convergent for each - − 5ÐT Ñ and 5Ð0 ÐT ÑÑ œ 0 Ð5ÐT ÑÑ.

Proof. By Theorem 1, it suffices to prove the result for the Jordan form of T , and hence for
each of its Jordan blocks N œ -M  I , - − 5ÐT Ñ. Now N R is upper triangular with common diag-
onal element -R , so 0 ÐN Ñ is upper triangular with common diagonal element 0 Ð-Ñ. Thus if 0 ÐN Ñ
is absolutely convergent, then 0 Ð-Ñ is absolutely convergent and is the unique eigenvalue of 0 ÐN Ñ,
so 5Ð0 ÐN ÑÑ œ 0 Ð5ÐN ÑÑ. è

6 MATRIX DERIVATIVES AND INTEGRALS


Often interest centers on a matrix function \Ð † Ñ œ ÐB34 Ð † ÑÑ of real-valued functions B34 on
an interval X of real numbers with \Ð>Ñ being 7 ‚ 8 for each > − X . The matrix derivative of \
. † .
at >, denoted \Ð>Ñ (or \Ð>Ñ), is the matrix Ð B34 Ð>ÑÑ (or ÐB† 34 Ð>ÑÑ) of first-order derivatives of
the B34 . Similarly, the matrix integral of \ , denoted ' \Ð>Ñ.>, is the matrix Š' B34 Ð>Ñ.>‹ of inte-
.> .>

grals of the B34 . Matrix-valued functions inherit many properties of real-valued functions. It is
easy to verify the following facts. If \Ð>Ñ and ] Ð>Ñ are matrix-valued functions, then

Ð\Ð>Ñ  ] Ð>ÑÑ œ \Ð>Ñ  ] Ð>Ñ and ( Ð\Ð>Ñ  ] Ð>ÑÑ.> œ ( \Ð>Ñ.>  ( ] Ð>Ñ.>


. . .
.> .> .>

provided the matrices are of the same size. Also,


. . .
Ð\Ð>Ñ] Ð>ÑÑ œ Ð \Ð>ÑÑ] Ð>Ñ  \Ð>Ñ ] Ð>Ñ
.> .> .>

provided the number of columns of \Ð>Ñ equals the number of rows of ] Ð>Ñ. Finally,

( \Ð?Ñ.? œ \Ð>Ñ.
. >
.> !

7 MATRIX EXPONENTIALS
There is a powerful and useful method of extending the definition of a function 0 ÐBÑ of a var-
iable B with a Maclaurin expansion to a function 0 ÐUÑ of a square matrix U. The method entails
substituting UR for BR for each R in the Maclaurin expansion of 0 ÐBÑ, provided that the Mac-
laurin expansion of 0 ÐUÑ is absolutely convergent.
An important example of this method of defining a matrix function is the matrix exponen-
tial. The function /B of the variable B has the Maclaurin expansion /x œ !_
R œ!
" R
B . Hence one
Rx
can define the matrix exponential /U of the square matrix U by

/U ´ "
_
" R
U
R œ!
Rx
MS&E 351 Dynamic Programming 125 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

because the Maclaurin series defining /U is absolutely convergent. Thus, from the Spectral Map-
ping Theorem, 5Ð/U Ñ œ /5ÐUÑ . If also T is a square complex matrix of the same order as U and
commutes therewith, it follows easily from the Binomial Theorem in §A.4 above that

(1) /T U œ /T /U œ /U /T .

One consequence is that /U is nonsingular and its inverse is /U because M œ /! œ /UU œ /U /U .
. U>
Also, .> / œ U/U> œ /U> U.

8 MATRIX DIFFERENTIAL EQUATIONS


Matrix exponentials are important because they provide an elegant and useful tool for study-
ing the solution of a system of linear differential equations.

Theorem 5. Solution of Matrix Differential Equation. If U is an 7 ‚ 7 real matrix and ; is


a continuous function from the nonnegative real line to d 75 , then the system of matrix linear dif-
ferential equations
Þ
Ð2Ñ \Ð>Ñ œ ;Ð>Ñ  U\Ð>Ñ, \Ð!Ñ œ \!

has a unique continuously differentiable solution, viz.,


>

Ð3Ñ \Ð>Ñ œ ( /UÐ>?Ñ ;Ð?Ñ.?  /U> \! .


!
In particular, if 5 œ 7, ; œ ! and \! œ M , then \Ð>Ñ œ /U> .

Proof. To show that (3) satisfies (2), rewrite (3) as


>

\Ð>Ñ œ /  ( /U? ;Ð?Ñ.?  \! .


U>

!
s are two solutions of (2),
Then differentiate both sides of this equation. Conversely, if \ and \
.
then [ ´ \  \ s satisfies [ Ð>Ñ œ U[ Ð>Ñ, [ Ð!Ñ œ !. Thus

. U> .
/ [ Ð>Ñ œ U/U> [ Ð>Ñ  /U> [ Ð>Ñ œ !
.>
so /U> [ Ð>Ñ is constant in >. But that constant is ! at > œ !, so /U> [ Ð>Ñ œ !. Hence [ œ !,
and so \ œ \ s. è

The next result is a continuous-time analog of Corollary 2.


MS&E 351 Dynamic Programming 126 §3 Continuous Time Parameter
Copyright © 2008 by Arthur F. Veinott, Jr.

Lemma 6. Transient Matrices in Continuous Time. If U is a square complex matrix, the fol-
lowing are equivalent.
"‰ /U> Ä ! as > Ä _.
# ‰ 5/ U  " .
$‰ ¸/- ¸  " for all - − 5ÐUÑ.
%‰ The real parts of the eigenvalues of U are negative.

Proof. 1‰ Í 2‰ . Since /UR œ Ð/U ÑR for R œ !ß "ß á by (1), the claim follows from Corollary
2 when > runs through the integers. It remains to show that /UÒ>Ó Ä ! implies /U> Ä ! where [>]
is the integer part of >. Evidently, /U> œ /UÒ>Ó /UÐ>Ò>ÓÑ , so because ¼/U ¼ Ÿ !_
R œ! R x lUl Ÿ /
" R lUl
,

¼/U> ¼ œ ¼/UÒ>Ó ¼ sup ¼/U? ¼ Ÿ ¼/UÒ>Ó ¼/lUl ,


!Ÿ?Ÿ"

from which the claim follows.


2‰ Í 3‰ . Immediate from 5Ð/U Ñ œ /5ÐUÑ .
3‰ Í 4‰ . Immediate from standard properties of complex exponentials. è
References
Books
Arrow, K. J. (1965). Aspects of the Theory of Risk Bearing. Academic Bookstore, Helsinki, Finland.
Aho, A., J. Hopcroft, and J. Ullman (1974). The Design and Analysis of Computer Algorithms. Addison-
Wesley, Reading, Mass.
Bellman, R. E. (1957). Dynamic Programming. Princeton University Press, Princeton, NJ.
Bellman, R. E. and S. E. Dreyfus (1962). Applied Dynamic Programming. Princeton University Press,
Princeton, NJ.
Bensoussan, A., E. G. Hurst, Jr., and B. Naslund (1974). Management Applications of Modern Control
Theory. American Elsevier, New York.
Bertsekas, D. P. (2007). Dynamic Programming and Optimal Control. I and II. 3rd ed. Athena Scientific,
Belmont, Massachusetts.
Bertsekas, D. P. and S. E. Shreve (1978). Stochastic Optimal Control: The Discrete Time Case. Academ-
ic Press, San Francisco.
Bertsekas, D. P. and J. N. Tsitsiklis (1996). Neuro-Dynamic Programming. Athena Scientific, Belmont, MA.
Chow, Y. S., H. Robbins and D. Siegmund (1971). Great Expectations: The Theory of Optimal Stopping.
Houghton Mifflin, Boston.
Denardo, E. V. (1982). Dynamic Programming: Models and Applications. Prentice-Hall, Englewood
Cliffs, NJ.
Derman, C. (1970). Finite State Markovian Decision Processes. Academic Press, New York.
Doob, J. L. (1953). Stochastic Processes. Wiley, New York.
Dubins, L. E. and L. J. Savage (1965). How to Gamble if You Must. McGraw-Hill, New York.
Dynkin, E. B. and A. A. Yushkevich (1969). Markov Processes. Translated by J. Wood. Plenum Press,
New York.
Feller, W. (1957). An Introduction to Probability Theory and Its Applications, Vol. 1, Second Ed. Wiley, New
York.
Fleming, W. H. and R. W. Rishel (1975). Deterministic and Stochastic Control. Springer-Verlag, New York.
Gittens, J. C. (1989). Multiarmed Bandit Allocation Indices. Wiley, Chichester.
Halmos, P. (1958). Finite Dimensional Vector Spaces. Van Nostrand, Princeton, NJ.
Hardy, G. H. (1949). Divergent Series. Clarendon Press, Oxford.
Holt, C. C., F. Modigliani, J. F. Muth and H. A. Simon (1960). Planning Production, Inventories, and Work
Force. Prentice-Hall, Englewood Cliffs, NJ.
Hordijk, A. (1974). Dynamic Programming and Markov Potential Theory. Mathematical Centre Tract
No. 51, Mathematical Center, Amsterdam.
Howard, R. A. (1960). Dynamic Programming and Markov Processes. The Technology Press, MIT, Cam-
bridge, Mass.
Howard, R. A. (1971). Dynamic Probabilistic Systems. Vol. II: Semi-Markov and Decision Processes.
Wiley, New York.
Karlin, S. (1959). Mathematical Methods and Theory in Games, Programming, and Economics, Vol. I.
Addison-Wesley, Reading, Mass.
Kato, T. (1966). Perturbation Theory for Linear Operators. Springer-Verlag, New York.
Kemeny, J. G. and J. L. Snell (1960). Finite Markov Chains. Van Nostrand, Princeton, NJ.
Kushner, H. (1967). Stochastic Stability and Control. Academic Press, New York.
Kushner, H. (1971). Introduction to Stochastic Control. Holt, Rinehart, and Winston, New York.
Lee, E. B. and L. Markus (1967). Foundations of Optimal Control Theory. Wiley, New York.
Mandl, P. (1968). Analytical Treatment of One-Dimensional Markov Processes. Springer-Verlag, New
York.
Marschak, J. and R. Radner (1972). Economic Theory of Teams. Cowles Foundation Monograph. New
Haven and London, Yale University Press.
Martin, J. (1967). Bayesian Decision Problems and Markov Chains. Wiley, New York.

127
MS&E 351 Dynamic Programming 128 References
Copyright © 2008 by Arthur F. Veinott, Jr.

Pontryagin, L. S., V. G. Boltyanskiii, R. V. Gamkrelidze and E. F. Mishchenko (1962). The Mathematical


Theory of Optimal Processes. Translated from Russian by K. N. Trirogoff and Edited by L. W.
Neustadt. Interscience Publishers, Wiley, New York.
Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley,
New York.
Ross, S. M. (1983). Introduction to Stochastic Dynamic Programming. Academic Press, New York.
Veinott, A. F., Jr. (1965). Mathematical Studies in Management Science. Macmillan, New York.
Whittle, P. (1983). Optimization Over Time: Dynamic Programming and Stochastic Control. Vols. I &
II. Wiley, New York.

Surveys
Bellman, R. E. and R. Karush (1964). Dynamic Programming: A Bibliography of Theory and Applica-
tion. RM-3951-PR, 140 pp. and RM-3591-1-PR, 142 pp., the RAND Corporation, Santa Monica, CA.
Breiman, L. (1964). Stopping Rule Problems. In E. Beckenbach (ed.). Applied Combinatorial Mathema-
tics. Wiley, New York.
Dantzig, G. B. (1959). On the Status of Multistage Linear Programming Problems. Man. Sci. 6, 1, 53-72.
Dreyfus, S. E. (1969). An Appraisal of Some Shortest-Path Algorithms. Opns. Res. 17, 3, 395-412.
Fleming, W. (1966). Optimal Control of Diffusion Processes. In E. Caianello (ed.). Functional Analysis
and Optimization. Academic Press, New York, 67-84.
Fulkerson, D. (1966). Flow Networks and Combinatorial Operations Research. Amer. Math. Monthly 73,
115-138.
Mandl, P. (1967). Analytical Methods in the Theory of Controlled Markov Processes. In J. Kozesnik
(ed.). Transactions of the Fourth Prague Conference on Information Theory, Statistical Decision
Functions and Random Processes, 1965. Academic Press, New York, 45-53.
Puterman, M. L. (1990). Markov Decision Processes. Chapter 8 in D. P. Heyman and M. J. Sobel (eds.).
Handbooks in Operations Research and Management Science 2, Elsevier (North-Holland), 331-434.
Veinott, A. F., Jr. (1965). Commentary on Part Two: Stochastic Decision Models. In A. F. Veinott, Jr.
(ed). Mathematical Studies in Management Science. Macmillan, New York, 313-321.
Veinott, A. F., Jr. (1974). Markov Decision Chains. In B. C. Eaves and G. B. Dantzig (eds.). Studies in
Optimization. MAA Studies in Mathematics, Vol. 10, Mathematical Association of America, 124-159.
White, D. J. (1985). Real applications of Markov decision processes. Interfaces 15, 73-83.
White, D. J. (1988). Further real applications of Markov decision processes. Interfaces 18, 55-61.
White, D. J. (1993). A Survey of Applications of Markov Decision Processes. Journal of the Operational
Research Society (UK) 44, 11, 1073-1096.

Articles
Balinski, M. (1961). On Solving Discrete Stochastic Decision Problems. Study 2. Mathematica, Princeton,
NJ.
Beale, E. M. L. (1955). On Minimizing a Convex Function Subject to Linear Inequalities. J. Royal Sta-
tist. Soc. Ser. B 17, 2, 173-184.
Bellman, R. E. (1957). A Markovian Decision Process. J. Math. Mech. 6, 5, 679-684.
Bertsimas, D. and J. Niño-Mora (1996). Conservation Laws, Extended Polymatroids and Multiarmed
Bandit Problems; A Polyhedral Approach to Indexable Systems. Math. Operations Res. 21, 2, 257-306.
Blackwell, D. (1962). Discrete Dynamic Programming. Ann. Math. Stat. 33, 2, 719-726.
Blackwell, D. (1967). Positive Dynamic Programming. In Proceedings of Fifth Berkeley Symposium on
Mathematical Statistics and Probability, Vol. I. Univ. of Calif. Press, Berkeley, Calif., 415-418.
Brown, B. (1965). On the Iterative Method of Dynamic Programming on a Finite Space Discrete Time
Markov Process. Ann. Math Stat. 36 4, 1279-1285.
Cantaluppi, L. J. (1984a). Optimality of Piecewise-Constant Policies in Semi-Markov Decision Chains.
SIAM J. Control Optim. 22, 5, 723-739.
Cantaluppi, L. J. (1984b). Computation of Optimal Policies in Discounted Semi-Markov Decision Chains.
OR Spektrum 6, 3, 147-160.
MS&E 351 Dynamic Programming 129 References
Copyright © 2008 by Arthur F. Veinott, Jr.

Chatwin, R. E. (1992). Optimal Airline Overbooking. Ph.D. Dissertation, Department of Operations Re-
search, Stanford University, Stanford, CA, 111 pp.
Dantzig, G. B. (1955). Optimal Solutions of a Dynamic Leontief Model with Substitution. Econometrica
23, 3, 295-302.
Dantzig, G. B. (1955). Linear Programming Under Uncertainty. Man. Sci. 1, 3-4, 197-206.
de Ghellinck, G. (1960). Les Problèms de Décisions Séquentielles. Cahiers du Centre d'Etudes de Re-
cherche Opérationnelle 2, 2, 161-179.
Denardo, E. V. (1967). Contraction Mappings in the Theory Underlying Dynamic Programming. SIAM
Review 9, 2, 165-177.
Denardo, E. V. (1969). Computing a Bias-Optimal Policy in Discrete Time Markov Decision. Report No. 9
Dept. of Admin. Sci., Yale Univ., New Haven, 21 pp. (Revised version of Computing (g,w)-Optimal
Policies in Continuous and Discrete Markov Programs, (1968).)
Denardo, E. V. (1970). On Linear Programming in a Markov Decision Problem. Man. Sci. 16, 5, 281-288.
Denardo, E. V. and B. L. Fox (1968). Multichain Markov Renewal Programs. SIAM J. Appl. Math. 16, 3,
468-487.
Denardo, E. V. and B. L. Miller (1968). An Optimality Condition for Discrete Dynamic Programming with
No Discounting. Ann. Math. Stat. 39, 4, 1220-1227.
D'Epenoux, F. (1963). A Probabilistic Production and Inventory Problem. Man. Sci. 10, 1, 98-108.
(Translation of an article published in Revue Francaise de Recherche Opérationnelle 14 (1960)).
Derman, C. (1962). On Sequential Decisions and Markov Chains. Man. Sci. 9, 1, 16-24.
Derman, C. (1963). Stable Sequential Rules and Markov Chains. J. Math. Anal. and Appl. 6, 2, 257-265.
Derman, C. (1964). On Sequential Control Processes. Ann. Math. Stat. 35, 1, 341-349.
Derman, C. and R. Strauch (1965). A Note on Memoryless Rules for Controlling Sequential Control Pro-
cesses. Ann. Math. Stat. 37, 1, 276-278.
Dynkin, E. B. (1963). The Optimum Choice of the Instants for Stopping a Markov Process. Soviet Math.
(English translation of Doklady Akad. Nauk.) 4, 627-629.
Eaton, J. and L. Zadeh (1962). Optimal Pursuit Strategies in Discrete State Probabilistic Systems.
Transactions of ASME Series D, Journal of Basic Engineering 84, 23-29.
Federgruen, A. (1984). Successive Approximation Methods for Solving Nested Functional Equations in
Markov Decision Problems. Math. Opns. Res. 9, 3, 319-344.
Fox, B. L. and D. M. Landi (1968). An Algorithm for Identifying the Ergodic Subchains and Transient
States of a Stochastic Matrix. Commun. Assoc. Comp. Mach. 11, 9, 619-621.
Gittens, J. C. and D. M. Jones (1974). A Dynamic Allocation Index for the Sequential Design of Experi-
ments. In J. Gani, K. Sarkadi and I. Vince (Eds.) Progress in Statistics. European Meeting of Statisti-
cians, 1972, 1, North Holland, Amsterdam, 214-266.
Hordijk, A. and L. C. M. Kallenberg (1984). Constrained Undiscounted Stochastic Dynamic Program-
ming. Math. Opns. Res. 9, 2, 276-289.
Howard, R. A. and J. E. Matheson (1972). Risk Sensitive Markov Decision Processes. Man. Sci. 18, 356-
369.
Kantorovitch, L. (1939). The Method of Successive Approximations for Functional Equations. Acta
Mathematica 71, 63-97.
Katehakis, M. N. and A. F. Veinott, Jr. (1987). The Multi-Armed Bandit Problem: Decomposition and
Computation. Math. Operations Res. 12, 2, 262-268.
Lanery, E. (1967). Etude Asymptotique des Systemes Markoviens à Commande. Révue d'Informatique et
Recherche Opérationnelle 1, 5, 3-56.
Lippman, S. A. (1968). On the Set of Optimal Policies in Discrete Dynamic Programming. J. Math. Anal.
Appl. 440-445.
Lippman, S. A. (1969). Criterion Equivalence in Discrete Dynamic Programming. Opns. Res. 17, 5, 920-
922.
MacQueen, J. and R. Miller, Jr. (1960). Optimal Persistence Policies. Opns. Res. 8, 3, 362-380.
Manne, A. S. (1960). Linear Programming and Sequential Decisions. Man. Sci. 6, 3, 259-267.
Martin-Lof, A. (1967). Optimal Control of a Continuous-Time Markov Chain with Periodic Transition
Probabilities. Opns. Res. 15, 5, 872-881.
MS&E 351 Dynamic Programming 130 References
Copyright © 2008 by Arthur F. Veinott, Jr.

Miller, B. L. (1968a). Finite State Continuous-Time Markov Decision Processes with an Infinite Planning
Horizon. J. Math. Anal. Appl. 22, 552-569.
Miller, B. L. (1968b). Finite State Continuous-Time Markov Decision Processes with a Finite Planning
Horizon. Siam J. Control 6, 2, 266-280.
Miller, B. L. and A. F. Veinott, Jr. (1969). Discrete Dynamic Programming with a Small Interest Rate.
Ann. Math. Stat. 40, 2, 366-370.
Ornstein, D. (1969). On the Existence of Stationary Optimal Strategies. Proc. Amer. Math. Soc. 20, 2,
563-569.
Radner, R. (1955). The Linear Team: An Example of Linear Programming Under Uncertainty. Proc. Sec-
ond Symp. Linear Prog. National Bureau of Standards, Washington, 381-396.
Radner, R. (1961). Evaluation of Information in Organizations. In Proc. 4th Berkeley Symp.
Radner, R. (1962). Team Decision Problems. Ann. Math. Stat. 33, 2, 857-881.
Riis, J. (1965). Discounted Markov Programming in a Periodic Process. Opns. Res. 13, 6, 920-929.
Rosenblatt, D. (1957). On the Graphs and Asymptotic Forms of Finite Boolean Relation Matrices and
Stochastic Matrices. NRLQ 4, 2, 151-167.
Ross, I. and F. Harary (1955). Identification of the Liason Persons of an Organization Using the Struc-
ture Matrix. Man. Sci. 1, 251-258.
Rothblum, U. G. (1975a). Multivariate Constant Risk Posture. J. Econ. Theory 10, 3, 309-332.
Rothblum, U. G. (1975b). Algebraic Eigenspaces of Nonnegative Matrices. Lin. Alg. Appl. 12, 281-292.
Rothblum, U. G. (1975c). Normalized Markov Decision Chains I: Sensitive Discount Optimality. Opns.
Res. 16, 785-795.
Rothblum, U. G. (1978). Normalized Markov Decision Chains II: Optimality of Nonstationary Policies.
SIAM J. Control 15, 221-232.
Rothblum, U. G. (1984). Multiplicative Markov Decision Chains. Math. Opns. Res. 9, 1, 6-24.
Rothblum, U. G. and A. F. Veinott, Jr. (1992). Markov Branching Decision Chains: Immigration-Induced
Optimality. Technical Report No. 45, Department of Operations Research, Stanford University, 100 pp.
Rothblum, U. G. and P. Whittle (1982). Growth Optimality for Branching Markov Decision Chains.
Math. Opns. Res. 7, 4, 582-601.
Rykov, V. V. (1966). Markov Decision Processes with Finite State and Decision Spaces. Theory Prob.
Appl. 112, 2, 302-311.
Shapley, L. S. (1953). Stochastic Games. Proc. Nat. Acad. Sci. 39, 1095-1100.
Sladky, K. (1974). On the Set of Optimal Controls for Markov Chains with Rewards. Kybernetika 10,
350-367.
Strauch, R. E. (1966). Negative Dynamic Programming. Ann. Math. Stat. 37, 4, 871-890.
Strauch, R. E. and A. F. Veinott, Jr. (1966). A Property of Sequential Control Processes. RM-4772-PR,
The RAND Corporation, Santa Monica, Calif., 8 pp.
Tardos, É. (1986). A Strongly Polynomial Algorithm to Solve Combinatorial Linear Programs. Opns.
Res. 34, 2, 250-256.
Tarski, A. (1955). A Lattice Theoretical Fixpoint Theorem and Applications. Pac. J. Math. 5, 285-309.
Veinott, A. F., Jr. (1966). On Finding Optimal Policies in Discrete Dynamic Programming with No Dis-
counting. Ann. Math. Stat. 37, 5, 1284-1294.
Veinott, A. F., Jr. (1968). Discrete Dynamic Programming with Sensitive Optimality Criteria. Prelimin-
ary Report. Ann. Math. Stat. 39, 1372.
Veinott, A. F., Jr. (1969a). Discrete Dynamic Programming with Sensitive Discount Optimality Criteria.
Ann. Math. Stat. 40, 5, 1635-1660.
Veinott, A. F., Jr. (1969b). Minimum Concave Cost Solution of Leontief Substitution Models of Multi-fa-
cility Inventory Systems. Opns. Res. 17, 2, 262-291.
Warshall, S. (1962). A Theorem on Boolean Matrices. J. Assoc. Comp. Mach. 9, 11-12.
Wolfe, P. and G. B. Dantzig (1962). Linear Programming in a Markov Chain. Opns. Res. 10, 5, 702-710.
Zachrisson, L. (1964). Markov Games. In M. Dresher L. S. Shapley, and A. W. Tucker (eds.). Advances
in Game Theory. Princeton University Press, Princeton, N. J., 211-253.
Index of Symbols
appearing on more than two successive pages.
Syntax: symbol, page number, meaning.

+, 1, action 5T , 26, spectral radius of T


E= , 1, actions in state = 8-optimal, 57 5$ , 27, spectral radius of T$
5=# , 114, diffusion coefficient
l † l, 18, norm (Tchebychev)
‡, 67, convolution 8-improvement, 57
#
5=+ , 117, diffusion coef., controlled
,8R , 69, binomial coefficient seq. R -period value, 3
", 40, discount factor ?7 7 7
$ , 71, T$ @$
ì , 67, middle element of convolution "8 , 67, binomial sequence of order 8
"8 , 68, binomial immigration order 8 Z$ , 18, discrete-time value of $ _
G=R , 4, R -period cost in state = "R
8 , 67, coefficient R of "8 Z$ , 108, continuous-time value of $ _
."
Z$ , 51, Ð@.$ @$ âÑ
.1 , 26, degree of 1 :Ð>l=ß +Ñ, 1, discrete-time trans. rate @$ , 50, T$ <$ ß 8 œ "; Ð"Ñ8 H$8" <$ , 8   !
8 ‡

.$ , 27, degree of $ _ T$ , 14, transition matrix for $ Z$8 , 55, Ð@$. â @8$ Ñ
., 27, system degree T$R , 14, R >2 power of T$ Z$R ß 15, R -period value of $ _
H, 48, deviation matrix T1R , 14, R -step trans. matrix for 1 Z1R ß 15, R -period value of 1
H$ , 50, deviation matrix for $ T ‡ ß 47, stationary matrix for T Z1R Ð?Ñ, 15, Z1R  T1R ?
$ , 14, decision $ œ Ð$ = Ñ T$> , 100, >-period trans. matrix for $ Z=R , 3, R -period value from =
?, 14, set of decisions T1> , 101, >-period trans. matrix for 1 Z ‡ , 21, maximum value
$ _ , 14, stationary policy Ð$ ß $ ß á Ñ , 69, Ð! T ! T " â Ñ Z1 , 18, discrete time value of 1
?_ ß 14, set of policies $ , 68, Ð! T$! T$" â Ñ Z1 , 108, continuous time value of 1
?8 , 57, 8-optimal decisions 1, 14, policy Z1> ß 102, value of 1 during Ò>ß X Ó
?_ , 51, _-optimal decisions 1P , 71, use first P decisions of 1 Z13 ß 41, present value of 1
ƒ8" , 69, ÐT ‡ H" â Ð"Ñ8 H8" Ñ 1P , 71, exclude first P decisions of 1 Z$3 ß 50, present value of $ _

/U , 124, matrix exp. !_


1P $ , 71, use 1P then $ _ Z1R A ß 64, R -period value of 1 with A
" R
R œ! Rx U Z1R 8 ß 68, R -period value of 1 with "8
;Ð?l=ß +Ñ, 97, cont-time trans. rate Z1R 7 Ð?Ñ, 71, Z1R 7  T1R ?
18#$ ß 53, <#8  U# @$8  @$8" U$ , 50, T$  M discrete time •1 , 68, ÐZ1R Ñ R -period values of 1
†> U$ , 100, cont.-time generator for $ •$ , 68, ÐZ$R Ñ R -period values of $
K#1
> >
‡ ß 102, <#  U# Z1‡  Z 1‡
8
K#$ .
, 55, Ð1#$ 8
â 1#$ Ñ •81 , 68, "8 ‡ •1
57 5 7 <Ð=ß +Ñ, 1, one-period reward
K#$ , 72, ,7 ì K#$
<$ , 14, one-period reward vector for $ A, 29, initial immigration ÐA= Ñ
K#) ß 19ß <#  T# Z)  Z)
V 3 , 49, Ð3M  UÑ" œ !_
<#8 , 53, <# if 8 œ !; 0 if 8 Á ! A, 63, immigration stream ÐAR Ñ
."
K#$ , 53, Ð1.
!
#$ ß 1#$ ßáÑ
V$3 , 50, Ð3M  U$ Ñ" œ !_
R œ! "
R " R
T [ , 75, cum. immigration Ð[ R Ñ
3 _
K#$ , 53, 8œ. 38 18#$ R œ! " R " R
T$ j8 , 76, set of immigration streams
Z, 12, system graph e, 15, optimal-return operator
Z, 102, optimal-return generator eR , 15, R >2 “power” of e B=+ , 31, state-action frequency
¦ , 10, strictly greater than
¤ ß 51, lexico greater than or equal to =, 1, state
W, 1, number of states
.= , 114, drift coefficient f, 1, state space
.=+ , 117, drift coefficient, controlled 5, 27, system spectral radius

131
Index
+, sequencing, hw2
E= , supply management,
, test a machine to find a fault, hw2
‡ time to reach boundary, hw5
action, 1 transportation,
application, yield of production process,
accumulating wealth, hw0 yield of renewable resource,
airline overbooking, hw2
R
airline reservation, ,7 ,
asset management, ",
baking, hw8 •
Bayesian statistical quality control and repair, hw5 Bayesian, hw5
bridge clearance, hw1 Bellman's equation,
capacity planning, binary preference relation,
capital budgeting, binomial coefficient,
cash management, binomial immigration stream of order 8,
chain selection, binomial sequence of order 8,
component replacement, hw4 binomial theorem,
customers served, bounded,
deliveries on-time, polynomially,
earning and learning, hw9 branching,
exercising a call option, 0 Brownian motion,
exercising a put option, hw1
factory throughput, certainty equivalent,
fatality reduction, Cesàro limit,
house buying, hw4 Cesàro-geometric-overtaking optimal, hw7
house selling, hw9 Cesàro Neumann series,
insurance management, Cesàro-overtaking optimal,
investment, Cholesky factorization,
knapsack problem, choose function,
lot inspection, combinatorial property,
marketing, comparison function,
matrix products (order of multiplying), hw1 present-value,
medical screening, comparison lemma,
multi-armed bandit, R -period-value,
multi-facility linear-cost production planning, hw3 present-value,
network routing, concave function,
patients treated successfully, contraction mapping,
police patrolling, controlled diffusion,
portfolio selection, hw2 convex function,
probability of reaching boundary, convolution,
project scheduling,
queues, .,
tandem, hw5 .$ ,
network of, .1 ,
reliability, H$ ,
requisition processing, hw1 $,
reservoir management, ?,
restart problem, $_,
road maintenance, ?_ ß
rocket guidance, ?8 ,
scheduling, ?_ ,
search for a plane, hw2 ?8„ ,
MS&E 351 Dynamic Programming 133 Index
Copyright © 2006 by Arthur F. Veinott, Jr.

?„ , combining physical and value,


ƒ8 , cumulative,
decision, order of binomial,
deficient point, physical,
degree of a matrix, present value of,
degree of a policy, 26 value,
degree of a sequence, immigration streams in j8 ,
degree of a system, 27 improves,
determines, 8-,
deviation matrix, strongly,
diffusion, index rule
controlled, Gittens,
diffusion coefficient, multi-armed-bandit,
discount factor, task sequencing, hw2
discovering system boundedness, inflation,
discovering system transience, hw3 instantaneous rate-of-return, 8
drift coefficient, interest rate,
dynamic-programming recursion, negative,
positive,
/U , real after-tax,
eigenvalue, small positive,
eigenvector, zero,
equivalence principle, irredundant dual,
excessive point, irredundant primal,
least,
expansion, Jordan form,
Cesàro Neumann,
Laurent, Laurent expansion,
Maclaurin, Laurent expansion of present value,
Neumann, Laurent expansion of present-value comparison function,
polynomial, Laurent expansion of resolvent,
lexicographic comparison,
fixed point, linear optimal decision function,
linear program,
18#$ ß dual,
K#1
>
ß finding maximum reward-rate with a, hw6
8 finding maximum stopping-value with a,
K#$ ,
57
finding maximum value with a,
K#$ , irredundant dual,
K# Z , irredundant primal,
K#$ , primal,
3
K#$ ,
Z, .= ,
generator for deterministic control problem, .=+ ,
generator for one-dimensional diffusion, minimum balloon payment,
generator matrix, maximum expected instantaneous rate-of-return,
graph, maximum expected multiplicative symmetric utility,
incidence, maximum growth factor,
minimum-cost chain in, maximum R -period value,
system, maximum reward rate,
maximum reward rate by linear programming, hw6
horizon, maximum present value,
finite, maximum principle,
infinite, maximum spectral radius, hw7
rolling, maximum X -period value,
maximum value,
immigration stream, Markov population decision chain,
binomial, finite continuous-time-parameter,
finite discrete-time-parameter,
MS&E 351 Dynamic Programming 134 Index
Copyright © 2006 by Arthur F. Veinott, Jr.

mathematical program, stopping-, hw4


matrix, strong Cesàro-overtaking-,
characterization of transient, strong present-value-,
degree of, total instantaneous-rate-of-return-,
deviation, value-,
generator, optimal policy with immigration stream
Jordan, Cesàro-overtaking-,
nilpotent, limiting present-value, hw5
nonnegative, overtaking-,
R -step transition, present-value,
permutation, strong present-value,
positive definite, optimal quadratic control,
positive semidefinite, optimal-return generator,
resolvent, optimal-return operator,
stationary,
similarity to a, :Ð>l=ß +Ñ,
stochastic, T$ ,
strictly substochastic, s $,
T
substochastic, T$R ,
symmetric, T1R ,
transient, T ‡ß
transition, T$> ,
upper-triangular, T1> ,
matrix derivative, ,
matrix differential equation,
lT l,
$ ,
matrix exponential,
matrix integral, 1,
matrix norm, 1P ,
multi-armed bandit, 1P ,
1P $ ,
Neumann series, 1> 1‡ ß
Newton’s method, <$8„ ,
linear-time approximation with, hw3 G8$„ ,
R -period value, 3 <$„ ,
G$„ ,
"8 , Perron-Frobenius Theorem, hw7
"R
8, policy,
on-line implementation, cohort,
multi-armed bandit, deterministic,
sequential quadratic unconstrained team, eventually stationary,
optimal policy, initially randomized,
Cesàro-geometric-overtaking-, hw7 Markov,
Cesàro-overtaking-, nonanticipative,
Cesàro-symmetric-multiplicative-utility-growth-, nonstationary,
float-, periodic,
future-value-, piecewise-constant,
growth factor, randomized,
index, hw2 stationary,
instantaneous-rate-of-return-, stopping, hw4
limiting present-value-, hw6 transient,
moment-, policy-improvement method,
myopic stopping, hw4 limiting present-value,
R -period-value-, maximum reward-rate,
R -period-expected-utility-, 8-optimality,
nonexistence of an overtaking-, Newton’s method: a special case of,
overtaking-, simplex method: a special case of,
present-value-, strong,
reward-rate-, polynomially bounded,
spectral-radius-, hw7
MS&E 351 Dynamic Programming 135 Index
Copyright © 2006 by Arthur F. Veinott, Jr.

present value, sojourn time,


inflation-adjusted, spectral mapping theorem,
Laurent expansion of, spectral radius,
maximum, spectrum,
strong maximum, state, 0
principle of optimality, state-action frequency,
program, stationary matrix,
linear, stochastic program,
mathematical, stopping policy, hw4
quadratic, stopping problem, hw4
stochastic, simple, hw4
unconstrained, stopping time,
stopping value, hw4
;Ð?l=ß +Ñ, successive approximations,
U$ , linear-time approximation with, hw3
quasiorder, system, 1
bounded, 46
<Ð=ß +Ñ, 1 circuitless, 13
<$ , deterministic, 49
<$8 , discovering boundedness of a, hw7
V$3 , discovering transience of a, hw3
e, irreducible, hw7
deficient point of, normalized, 27
excessive point of, polynomially bounded,
fixed point of, stochastic, 2
eR , strictly substochastic,
random time, substochastic, 2
recursion, transient, 19
backward, system degree,
forward, system graph, 12
resolvent, strongly connected, hw7
Laurent expansion of, system properties,
restart problem,
reward, 1 team decision problem,
reward-rate, unconstrained,
reward-rate vector, quadratic convex,
reward vector, normally distributed observations, hw8
Ricatti recursion, sequential,
risk aversion, transient policy,
risk posture, transient system,
constant additive, transition matrix,
constant multiplicative, transition rate, 1
risk preference,
?7
$ ,
=, utility function,
W, 1 additive,
f, multiplicative,
5, symmetric,
5T ,
5$ , @8$ ,
5=# , Z$8 ,
#
5=+ , Z$ ,
series, Z$R ß
Laurent, Z1R ß
Maclaurin, Z1> ß
Neumann, Z1R 3 ß
stochastic constraint, R3
Zs1 ß
similar,
Z1R Ð?Ñ,
MS&E 351 Dynamic Programming 136 Index
Copyright © 2006 by Arthur F. Veinott, Jr.
>
Zs$ Ð@Ñß present,
Z=R , stopping,
Z ‡, terminal,
3 transient,
Z$ ß
Z13 ß
Z13A ß A,
Z$R A ß A8 ,
Z1R A ß [,
Z1R 8 ß [ 8,
Z1R 8 Ð?Ñ, j8 ,
•$ , Wiener process,
•1 ,
•81 , B=+ ,
value,
future,
R -period,

Acknowledgement. Mai Vu suggested the creation of this index. It is a pleasure to acknowl-


edge her encouragement and substantial collaboration in its preparation.
MS&E 351 Dynamic Programming Spring 2008
Arthur F. Veinott, Jr.

Homework 1 Due April 11

1. Exercising a Put Option. Consider the problem of deciding when to exercise a put option
to sell a stock at the strike price =‡  ! during any of the next R days. Suppose that if the price
of the stock on one day is =   !, then the price the next day is =V where the return V is a ran-
dom variable. Assume that the returns on successive days are independent and have the same dis-
tribution as a nonnegative random variable V whose expected value is finite. The goal is to maxi-
mize the expected net revenue from an option to sell the stock any time in the next R days at
the strike price when the price with R days to expiration is =. Assume that if one exercises the
option to sell the stock on a day at the strike price, then one first buys the stock that day at the
market price. Let < ´ V  " be the rate of return.
(a) Dynamic-Programming Recursion. Write down a dynamic-programming recursion for the
maximum expected net revenue Z=R when the price with R days to expiration is =.
(b) Nonpositive Expected Rate of Return. Show that if E< Ÿ !, then it is optimal not to
exercise the option before expiration.
(c) Positive Expected Rate of Return. Determine the optimal exercise policy with R days to
expiration by induction on R where E<  !. [Hint: First show that Z!R œ =‡ for each R œ !ß "ß á .
Next observe that if the stock price falls from a positive level to !, which occurs with probability
"  : œ PrÐV œ !Ñ, then the stock price can never become positive again. For this reason, it
suffices to consider the case of positive stock prices =  !. For that case, rewrite the dynamic
programming recursion in (a) to reflect this fact and then make the change of variables [=R ´
" R
= ÐZ=  Ð=‡  =ÑÑ.]

2. Requisition Processing. Requisitions arrive independently at a data-processing center in


periods "ß á ß R . At the end of each period, the center manager decides whether to process any
requisitions. If the manager decides to process in a period, he processes all previously unprocessed
requisitions arriving in that and all prior periods. The manager must process requisition in a per-
iod in which the number of unprocessed requisitions exceeds Q  !. The cost of processing a
batch of requisitions of any size is O  ! and the processing time is negligible. The waiting cost
incurred in each period that a requisition awaits processing but is not processed is A  !. All
requisitions, if any, are processed in period R . The problem is to determine the periods in which
to process that minimize the combined expected costs of processing and waiting over R periods.
Suppose that :3 is the known probability that 3 œ 0ß "ß á ß , requisitions arrive in a period where
!,3œ! :3 œ 1. The expected number of requisitions arriving in a period is positive.
MS&E 351 Dynamic Programming 2 Homework 1 Due April 11, 2008

(a) Dynamic-Programming Recursion. Write down a dynamic-programming recursion for


the minimum expected costs that permits an optimal R -period processing policy to be deter-
mined.
(b) Optimality of Processing-Point Policy. Show that one optimal processing policy in per-
iod 8 has the following form: if = is the number of requisitions accumulated before processing at
the end of period 8, process in period 8 if =  =8 and wait (do not process) if = Ÿ =8 . Show how
to determine the constants =" ß á ß =R . [Hint: Prove these facts by induction on 8 by showing that
the minimum expected cost in periods 8ß á ß R is increasing in =.]
(c) Optimal Preplanned Processing Periods. Answer part (a) under the hypothesis that all
the processing times must be chosen before any requisitions arrive, that no requisitions are on hand
initially, and that Q œ _. [Hint: It suffices to consider only R states.]
(d) Comparison of Policies in Parts (b) and (c). Will the minimum expected cost in part
(a) (assuming no requisitions are on hand initially and Q œ _) be larger or smaller than that
in part (c), or can one say? What administrative and/or computational advantages does the pol-
icy in part (c) have over that in part (b)?

3. Matrix Products. Let Q" ß á ß Q8 be matrices with Q3 having <3" rows and <3 columns
for 3 œ 1ß 2ß á ß 8 and some positive integers <! ß á ß <8 . The problem is to choose the order of mul-
tiplying the matrices that will minimize the number of multiplications needed to compute the prod-
uct Q" Q# â Q8 . Assume that matrices are multiplied in the “usual” way.
(a) Dynamic-Programming Recursion. Give a dynamic-programming recursion for finding
the optimal order in which to multiply the matrices that requires SÐ8$ Ñ operations.
(b) Example. Find the optimal order in which to multiply the matrices where 8 œ 4 and
Ð<! <" <# <$ <% Ñ œ Ð10 30 70 2 100Ñ.
(c) Comparison of Minimum and Maximum Number of Multiplications. Also solve the
problem of part (b) where the objective is instead to maximize the number of multiplications.
What is the ratio of the maximum to minimum number of multiplications?

4. Bridge Clearance. Consider a network of W cities. Between each pair Ð=ß >Ñ of cities, there
is a road having a bridge whose clearance height is 2Ð=ß >Ñ   0 with 2Ð=ß >Ñ œ 2Ð>ß =Ñ. A trucking
firm wants to determine a route from each city to city W that maximizes the minimum bridge
height along the route subject to the restriction that at most R cities are visited enroute (exclud-
ing the initial city). Give a dynamic-programming recursion for solving the problem.
MS&E 351 Dynamic Programming Spring 2008
Arthur F. Veinott, Jr.

Answers to Homework 1 Due April 11

1. Exercising a Put Option. Suppose that the (nonnegative) market price of a stock on any
day is a product of its price on the preceding day and its return that day. Assume that the returns
on successive days are independent and have the same distribution as another nonnegative random
variable V whose expectation is finite. Call < ´ V  " the rate of return. Let Z=R be the maximum
expected revenue from a put option to sell a stock at the strike price =‡  ! when the market price
R days before expiration of the option is =   !.
(a) Dynamic-Programming Recursion. The Z=R satisfy the dynamic-programming recursion

(1) Z=R œ maxÐ=‡  =ß EZ =V


R "
Ñ

for R œ "ß #ß á where Z=! œ Ð=‡  =Ñ for all =   !.


(b) Nonpositive Expected Rate of Return. Suppose E< Ÿ !, so EV Ÿ ". We claim that it is
optimal not to exercise the option until the expiration day, and then only if the price that day is
below the strike price. It suffices to show by induction that Z=R œ EZ =V
R "
for R  ! and for all
=. To see this, observe from (1) for R  " and by definition for R œ " that Z =R "   =‡  = for
R "
each =   !. Thus, because EV Ÿ ", it follows that EZ =V   =‡  =EV   =‡  = for each =   !.
Thus, Z=R œ EZ =V
R "
for R  ! as claimed.
(c) Positive Expected Rate of Return. Suppose E<  !, so EV  ". For this part, it is help-
ful to note that once one reaches the market price = œ !, the subsequent price V= is also zero, so
the only market price accessible from ! is !. Hence, when the market price is zero, it is optimal
to exercise the put option to sell the stock at the strike price =‡ , so Z!R œ =‡ for R œ !ß "ß á . It
is convenient to restrict attention to positive market prices in the sequel. To that end, let : œ
R "
PrÐV  !Ñ and, for any random variable [ , let E [ ´ EÐ[ lV  !Ñ. Then EZ =V œ Ð"  :Ñ=‡ 
R "
:E Z =V , so the restriction of (1) to positive market prices can be rewritten as

(2) Z=R œ maxÐ=‡  =ß Ð"  :Ñ=‡  :E Z =V


R "
Ñ

for =  ! and R œ "ß #ß á where Z=! œ Ð=‡  =Ñ for all =  !.


Subtract =‡  = from both sides of (2) and make the change of variables Y=R œ Z=R  Ð=‡  =Ñ
for R œ !ß "ß á . Then (2) reduces to the equivalent system

Y=R œ maxÐ!ß =E<  :E Y=V


R "
Ñ

for =  ! and R œ "ß #ß á where Y=! œ Ð=  =‡ Ñ for all =  !. Now make the second change of
variables [=R œ "= Y=R for =  !. Then the above equation becomes

(2)w [=R œ maxÐ!ß E<  :E V[=V


R "
Ñ

1
MS&E 351 Dynamic Programming -2- Answers to Homework 1 Due April 11, 2008

=‡ 
for =  ! and R œ "ß #ß á where [=! œ Ð"  = Ñ for =  !. We claim that for each R , [=R is
increasing and continuous in =  ! with lim=Å_ [=R œ " and lim=Æ! [=R œ !. Certainly that is so
R"
for R œ !. Suppose it is so for R  " and consider R . Then :E V[=V is increasing and contin-
uous in =  !, and converges to EV as = Å _ and to ! as = Æ !, whence the claim follows. Conse-
R "
quently, there is a largest = œ =R such that E<  :E V[=V Ÿ !. Thus, the maximum on the
right side of (2)w is E<  :E V[=V
R "
if =  =R and ! if = Ÿ =R . Since (2) and (2)w are equivalent,
it follows that this rule is optimal with (2) as well, i.e., it is optimal to wait if =  =R and to exer-
cise the option if = Ÿ =R . In short, if the expected rate of return is positive and if R days remain
until expiration of the option, then there is an exercise price =R such that it is optimal to wait if
the price is above =R and to exercise the option if the price is below =R .

2. Requisition Processing. The state of the system is the number of requisitions on hand at
the end of a period before processing and the actions are to process or to wait.
(a) Dynamic-Programming Recursion. Let G=8 be the minimum expected cost in periods
8ß á ß R when there are = requisitions on hand at the end of period 8 before processing in that
period. Then G=R œ 0 or O according as = œ 0 or =  0. Also, for each 1 Ÿ 8  R ß

G=8 œ min ÒO  !> :> G>8" ß A=  !> :> G>=


8"
Ó for ! Ÿ = Ÿ Q

and

G=8 œ O  !> :> G>8" for Q  =

where the first and second terms 18 and =8= in brackets on the right-hand-side of the first
equation correspond respectively to processing and to waiting.
(b) Optimality of Processing-Point Policy. We show first that G=8 is increasing in = by in-
duction on 8. This is surely true for 8 œ R . Thus suppose it is so for 8"ß á ß R and consider 8.
8"
Then since G>= is increasing in = for each > and A= is increasing in =, =8= is increasing in =. Thus,
since 18 is independent of =, G=8 œ 18 • ==8 Ÿ 18 is increasing in =.
Now let =R œ 0 and for each 1 Ÿ 8  R , let =8 be the largest integer = for which =8= Ÿ 18 .
Since =8= is increasing in =, it follows that ==8 Ÿ 18 if and only if = Ÿ =8 . Thus one optimal policy
in period 8 is to wait if = Ÿ =8 and to process if =8  =.
(c) Optimal Preplanned Processing Periods. Let G8 be the minimum expected cost in per-
iods 1ß á ß 8 when the processing times must be chosen before any requisitions arrive, there are
no requisitions on hand initially, Q œ _, and one processes in period 8. Let < œ !> >:> be the
expected number of requisitions arriving in a period. Without loss of generality, assume that
<  ! because in the contrary event there are never any requisitions to process. The expected
MS&E 351 Dynamic Programming -3- Answers to Homework 1 Due April 11, 2008

processing and waiting costs -3 incurred when all requisitions arriving in 3 consecutive periods are
processed in the last of those periods is

-3 œ O  A< ! Ð4"Ñ œ O  A<


3 3Ð3"Ñ
.
4œ" #
Then the G8 may be found from the (deterministic) forward equations (G! ´ !Ñ

G8 œ min ÒG7  -87 Ó, 8 œ "ß á ß R Þ


!Ÿ78
Let 78 be the least 7 minimizing the right-hand-side of this equation. Then 78 is the next to last
optimal period in which to process assuming that one processes in period 8. Thus since one pro-
cesses last in period R , one processes next to last in period 7R , 2nd from last in period 77R , etc.
(d) Comparison of Policies in Parts (b) and (c). Since the optimal solution in part (b)
allows the decision maker to base his decisions on the information accumulated whereas that in
part (c) does not, the minimum expected cost in part (b) will not exceed that in part (c). On
the other hand, the solution in part (c) has the advantages that it requires less computational
effort, it utilizes deterministic rather than random processing times (which makes scheduling
easier), and it does not require one to keep records on the number of requisitions remaining
unprocessed at the end of each period.
We now give a more precise comparison of the computational effort required to find the op-
timal policies in parts (b) and (c) under the assumption that Q œ S(R ), as seems reasonable to
make the comparison fair. The numbers of comparisons, additions and multiplications required
to find the policies in parts (b) and (c) without further exploiting the special structure of the
problem is summarized in the table below.

Comparisons Additions Multiplications


(b) S(R # ) S(R # ,) S(R # ,)
(c) S(R # ) S(R # ) S(R )

Comparison of Computational Effort to Find Solutions in (b) and (c)

Observe that finding the solution in (b) requires the same order of comparisons, S(,) times as
many additions and S(R ,) times as many multiplications when compared to the number of
operations required for the solution in (c). The factor S(,) reduction in additions arises because
the solution in (c) depends only on the expected number < of requisitions arriving in a period
(which we take to be given) whereas that in (b) depends on the number ," of elements in the
support of the distribution of the number of requisitions arriving in a period.
MS&E 351 Dynamic Programming -4- Answers to Homework 1 Due April 11, 2008

3. Matrix Products. Every intermediate matrix product must be of the form Q3" â Q5 , and
that matrix product must be a product of Q3" â Q4 and Q4" â Q5 for some 3  4  5 . Let 735
be the minimum number of multiplications needed to compute Q3" â Q5 . Thus, the minimum
number of multiplications needed to compute Q3" â Q4 and Q4" â Q5 is 734  745 . Also, the
number of multiplications needed to compute the product of those two matrices is <3 <4 <5 . Since 4
should be chosen optimally, one has the branching recursion (73ß3" œ 0 for 0 Ÿ 3  8)

735 œ min c 734  745  <3 <4 <5 d for ! Ÿ 3  5 Ÿ 8Þ


345

(b) Example. With Ð<! <" <# <$ <% Ñ œ Ð10 30 70 2 100Ñ.
7!# œ <! <" <# œ 21000
7"$ œ <" <# <$ œ 4200
7#% œ <# <$ <% œ 14000
7!$ œ min Ò7!"  7"$  <! <" <$ , 7!#  7#$  <! <# <$ Ó
œ min Ò%#!!  '!!ß #"!!!  "%!!0Ó œ %)!!Þ
7"% œ min Ò7"#  7#%  <" <# <% , 7"$  7$%  <" <$ <% Ó
œ min Ò14000  21000!, 4200  6000Ó œ 10200.
7!% œ min Ò7!"  7"%  <! <" <% , 7!#  7#%  <! <# <% , 7!$  7$%  <! <$ <% Ó
œ min Ò10200  30000, 21000  14000  70000, 4800  2000Ó œ 6800.
The minimum-multiplication order of matrix-multiplication is thus (Q" (Q# Q$ ))Q% .
(c) Comparison of Minimum and Maximum Number of Multiplications. If Q35 is the max-
imum number of operations required to compute the matrix product Q3" â Q5 , then Q35
œ 735 for 5  3 Ÿ #. Also Q!$ œ 22400, Q"% œ 224000, and

Q!% œ max ’224000  30000, 21000  14000  70000, 22400  2000“ œ 254000.

The maximum-multiplication order of matrix-multiplication is thus Q" (Q# (Q$ Q% )). Also the
ratio of the maximum to the minimum number of multiplications is

Q!%
œ 37.35!
7!%

4. Bridge Clearance. Let Z=8 be the maximum of the minimum bridge heights along routes
from = to W among those that visit at most 8 cities. Set ZW8 ´ _ for 8 œ "ß #ß á . Then

Z=" œ 2Ð=ß WÑ
and
Z=8 œ max 2Ð=ß >Ñ • Z>8"
>Á=
for = œ "ß á ß W" and 8 œ #ß á ß R .
MS&E 351 Dynamic Programming Spring 2008
Arthur F. Veinott, Jr. rev 4:15 pm 4/13/2008

Revised Homework 2 Due April 18


(Do any three problems—problems 2-4 if you took MS&E 251.Ñ

1. Dynamic Portfolio Selection with Constant Multiplicative Risk Posture. A financial


planner seeks a strategy for dynamically revising a portfolio of Q securities that she manages
for a client to maximize the expected utility of distributions to him over 8 years. At the begin-
ning of each year R  8, the planner observes the market value (in dollars) =  ! of her client’s
portfolio, distributes a portion :! = to him in cash, and invests fractions :" ß á ß :Q of the re-
maining funds Ð"  :! Ñ= in securities "ß á ß Q . The distribution fraction :! and portfolio : ´
Ð:" ß á ß :Q Ñ satisfy !  :  ! Ÿ ", :   ! and !Q3œ" :3 œ ". At the beginning of the last year 8,
the planner distributes the entire market value of the portfolio in cash. There are no transaction
costs for buying or selling securities. Each dollar she invests in security 3 in year R is worth
V3R  ! dollars at the end of the year. The joint distribution of V R ´ ÐV"R ß á ß VQ
R
Ñ in each year
R is known, the random vectors V" ß á ß V 8 are independent, and V3R and ln V3R have finite ex-
pectations for all 3ß R . Her client’s utility of distributing A" ß á ß A8   ! to him in years "ß á ß 8
is !8R œ" ln AR .
(a) Dynamic-Programming Recursion. Give a dynamic-programming recursion satisfied by
the maximum expected utility Z R Ð=Ñ of distributions to the client in years R ß á ß 8 when the mar-
ket value of the client’s portfolio at the beginning of year R is =.
(b) Maximum Expected Utility has Constant Multiplicative Risk Posture. Show by induc-
tion on R that Z R Ð=Ñ œ +R ln =  ,R for some constants +R and ,R , and show how to evaluate
the constants.
(c) Optimal Distribution Fraction and Portfolio. Conclude that the optimal policy in year
R  8 is to choose the distribution fraction :! œ "ÎÐ8R "Ñ (the reciprocal of the number of
years remaining) and portfolio : that maximizes E lnÐ:V R Ñ (invest myopically). Note that both
:! and : are independent of the value of the portfolio at the beginning of year R .

2. Airline Overbooking. An airline seeks a reservation policy for a flight with W seats that
maximizes its expected profit from the flight. Reservation requests arrive hourly according to a
Bernoulli process with : being the probability of a reservation request during an hour. A pass-
enger with a reservation pays the fare 0  ! at flight time. If ,   ! passengers with reservations
are denied boarding at flight time, they do not pay the fare and the airline pays them a penalty
-Ð,Ñ (divided among them) where - is increasing on the nonnegative integers and -Ð!Ñ œ !. Con-
sider the R >2 hour (R  !Ñ before flight time. At the beginning of the hour, the airline reviews
the number of reservations on hand and decides whether to book (accept) or decline any reserva-
tion request during the hour. Each passenger with a reservation after any reservation request is
booked or declined, but before cancellations, independently cancels the reservation with proba-
MS&E 351 Dynamic Programming 2 Homework 2 Due April 18, 2008

bility ; during the hour. For this reason, the airline is considering the possibility of overbooking
the flight to compensate for cancellations. Let Z<R be the maximum expected future profit when
< seats have been booked before reservation requests and cancellations during the hour. Let @<R be
the maximum expected future profit when < seats have been booked after booking or declining
any reservation request, but before cancellations, during the hour
(a) Dynamic-Programming Recursion. Give a dynamic-programming recursion from which an
optimal reservation policy can be determined.
(b) Optimality of Booking-Limit Policies. Assume, as can be shown, that if 1 is quasicon-
cave on the integers, then E1ÐF< Ñ is quasiconcave in < where F< is a sum of < independent iden-
tically distributed Bernoulli random variables. Use this result to show the following facts. First,
@<R is quasiconcave in <, i.e., there is a nonnegative integer booking limit ,R such that @<R is in-
creasing in < on [0ß ,R ] and decreasing in < on [,R ß _Ñ. Second, Z<R is quasiconcave in < and
achieves its maximum at < œ ,R . Third, a booking-limit policy is optimal, i.e., accept a reserva-
tion request when R hours remain before flight time if and only if <  ,R .
(c) Solving the Problem with MATLAB. Assume that -Ð,Ñ œ 0 , . Graph Z<R for ! Ÿ R Ÿ $!
and ! Ÿ < Ÿ #! by using MATLAB1 with the following parameter values: W œ "!, 0 œ $300  J ,
: œ .# and .3, and ; œ .05 and .10, where J is the sum of the digits representing the positions in
the alphabet of the first letters of your first and last names. For example, Arthur Veinott would
have J œ "  ## œ #$ because E is the first and Z is the 228. letter of the alphabet. In each case,
estimate the booking limit ten hours before flight time from your graphs. Discuss whether your
graphs confirm the claim in (b) that Z<R is quasiconcave in <. What conjectures do the graphs
that you found suggest about the optimal reservation policy and/or maximum expected reward
and their variation with the various data elements? You will lose points on your conjectures
only if your graphs are inconsistent with or do not support your conjectures or if you don’t make
enough interesting conjectures. The idea here is to brainstorm intelligently.

3. Optimal Capacitated Supply Policy. A manager seeks a supply policy for a single prod-
uct that minimizes the expected costs of ordering, storage and shortage over 8 periods. The de-
mands H" ß á ß H8 for the product in periods "ß á ß 8 are independent real-valued random varia-
bles. At the beginning of period R , the supply manager observes the possibly negative initial stock
B − d on hand. If B  !, B units are backlogged. The supply manager then orders a nonnegative

1To develop the graphs required for this part and the next one, you need to run the MATLAB m-file
airlin27.m. This file calls the files finhor3r.m and booklim3.m. To do this, download these files from the
MS&E 351 course directory. MATLAB is available on the UNIX clusters in Terman and Gates. Printers
are available in both places for a fee. You can also run MATLAB remotely. Of course, if you own MAT-
LAB, you can run it on your personal computer as well.
MS&E 351 Dynamic Programming 3 Homework 2 Due April 18, 2008

quantity at unit cost -   ! with immediate delivery, bringing the starting stock after receipt of
the order to C, B Ÿ C Ÿ B  ? where ?  ! is the given capacity. Unsatisfied demands are back-
logged so the stock on hand at the end of period R is C  HR . There is a convex storage and
shortage cost function 1ÐDÑ where D is the (possibly negative) stock on hand at the end of a per-
iod. Assume that KR ÐCÑ ´ E1ÐC  HR Ñ is finite for all C , KR ÐCÑ Ä _ as C Ä _ and -C  KR ÐCÑ
Ä _ as C Ä _. Let GR ÐBÑ be the minimum expected cost in periods R ß á ß 8 starting with
the initial stock B in period R . Assume that GR ÐBÑ is finite for all B. The terminal cost G8" Ð † Ñ
at the end of period 8 is bounded below. Assume, as can be shown, that all convex functions
arising in this problem are continuous and that convex functions are closed under addition and
positive scalar multiples.

(a) Dynamic-Programming Recursion. Give a dynamic-programming recursion for calculat-


ing GR ÐBÑ.

(b) Convexity of GR ÐBÑ. Show that if the terminal cost is convex, then GR ÐBÑ is convex
and bounded below in B. ÒHint: Use part (d) below.Ó

(c) Optimality of Constrained Base-Stock Policies. Show that one optimal policy CR ÐBÑ in
‡ ‡
period R has the form CR ÐBÑ œ ÐCR ” BÑ • ÐB  ?Ñ for some real number CR .

(d) Projections of Convex Functions. Suppose that 0 ÐBß CÑ is a real-valued convex function
on a nonempty closed convex set [ © d 5 and let J ÐBÑ ´ infÐBßCÑ−[ 0 ÐBß CÑ. Show that if J is
finite on \ ´ ÖB À ÐBß CÑ − [ ×, then J is convex thereon.

4. Sequencing: Optimality of Index Policies. Consider a sequence "ß á ß 8 of 8 tasks. When


a task is executed, it succeeds or fails. Given a permutation of the tasks, execute them in turn
until some task succeeds or until all tasks fail, whichever occurs first. The goal is to choose the
order of executing the tasks that minimizes the resulting expected cost. Let 5 œ Ð5" ß á ß 58 Ñ be a
permutation of the tasks. Assume that the cost -53 of executing task 53 depends on 53 , but not
the other tasks. Let :53 |53 be the conditional probability that task 53 fails given that the tasks
53 ´ Ð5" ß á ß 53" Ñ fail. Assume that :53 |53 has the property that "  :53 |53 œ 053 † 153 for some
positive functions 0 and 1. Let :53 be the (unconditional) probability that the tasks 53 fail. As-
sume that :53 is symmetric, i.e., is independent of the order of executing tasks 5" ß á ß 53" .

(a) Optimality of Index Policies by Interchanging Tasks. Show that there is an index func-
tion MÐ † Ñ such that it is optimal to choose any permutation 5 œ Ð5" ß á ß 58 Ñ for which MÐ53 Ñ is
increasing in 3. [Hint: For an arbitrary permutation 5 œ Ð5" ß á ß 58 Ñ of the tasks, consider when
it is optimal not to interchange the order of executing tasks 53 and 53" .]
Show how to apply the result of problem (a) to solve each of the following two problems.
MS&E 351 Dynamic Programming 4 Homework 2 Due April 18, 2008

(b) Testing a Machine to Find a Fault. Suppose that a machine does not operate properly
and 8 independent tests "ß á ß 8 can be performed to find the cause. Test 3 costs -3 and has prob-
ability :3 of failing to find the cause given that the tests preceding it fail to find the cause. Once
the cause is found, no further tests are executed. All tests may fail to find the cause. What is the
minimum-expected-cost sequence to carry out the tests?
(c) Search for a Plane. Suppose a plane is down in one of 8 locations "ß á ß 8 with known pos-
itive probabilities ;" ß á ß ;8 , !3 ;3 œ ". Denote by >3 the time to search location 3 to find the plane
or determine that it is not there. What search sequence minimizes the time to find the plane?
MS&E 351 Dynamic Programming Spring 2008
Arthur F. Veinott, Jr.

Answers to Homework 2 Due April 18

1. Dynamic Portfolio Selection with Constant Multiplicative Risk Posture.


(a) Dynamic-Programming Recursion. The dynamic-programming recursion is

(1) Z R Ð=Ñ œ max Öln(=:! Ñ  EZ R " Ð=Ð":! Ñ:V R Ñ×ß R œ "ß á ß 8", =  !
!:! "

where T is the set of row Q -vectors Ö: À !Q


:−T
R R
3œ" :3 œ "ß :   !×, V œ ÐV3 Ñ is a column Q -vector
and Z 8 Ð=Ñ œ ln = for =  !.
(b) Maximum Expected Utility has Constant Multiplicative Risk Posture.1 It suffices to
show by induction on R that

(2) Z R Ð=Ñ œ +R ln =  ,R

where +8 ´ " and , 8 ´ !, and for R œ "ß á ß 8",

Ð3Ñ +R ´ 8R " œ "  +R "

and

(4Ñ ,R ´ max {ln :!  +R " ln Ð":! Ñ×  +R " max E lnÐ:V R Ñ  , R " .


!:! " :−T
To see this, observe that (2) certainly holds for R œ 8. Suppose (2) holds for R  " Ÿ 8. Then
on substituting (2) for R " into the right-hand side of (1) and using (3) and (4), it follows that
(2) holds for R .
(c) Optimal Distribution Fraction and Portfolio. It follows from (4) that the optimal distribu-
tion fraction :! œ :!R in year R maximizes the first term on the right-hand side of (4). Maximiz-
ing that term and using (3) yields :!R œ "Î+R œ "ÎÐ8R "Ñ. Similarly, it follows from (4) that
the optimal portfolio : œ :R in each year 1 Ÿ R  8 maximizes the second term on the right-
hand side of (4), viz., maximizes E lnÐ:V R Ñ subject to : − T . Note that the optimal distribution
fraction and portfolio in each year are both independent of the level of wealth at the beginning
of the year. Also, the optimal portfolio in each year is myopic, i.e., maximizes the expected in-
stantaneous rate of return for that year alone without regard for subsequent investment oppor-
tunities.

1Incidentally, note that the client’s utility function ?ÐAÑ œ !8R œ" ln AR for distributions A œ ÐAR Ñ ¦ ! in the 8 years
exhibits constant multiplicative risk posture. Indeed, up to positive affine transformations, it is the only such utility
function that is also symmetric and strictly risk averse.

1
MS&E 351 Dynamic Programming -2- Answers to Homework 2 Due April 18, 2008

2. Airline Overbooking.
(a) Dynamic-Programming Recursion. Let Z<R be the maximum expected profit that can
be earned from a flight when <   ! seats have been booked R hours before flight time. Then the
profit at flight time (R œ !) is

Z<! œ @<! ´ 0 < • Ò0 W  -ÐÐ<  WÑ ÑÓ.

When R œ "ß #ß á hours remain before flight time, the airline can either book, i.e., accept, or de-
cline a new reservation request during the ensuing hour. Reflecting these two possibilities, the maxi-
mum expected profit satisfies the recursion

(1) Z<R œ max Ö@<R ß Ð"  :Ñ@<R  :@<"


R
×

where @<R ´ EZGR< " and G< is the number of the < reservations that do not cancel during the hour.
Then G< has a binomial distribution with parameters < and "  ; . Interpret @<R as the maximum
expected profit when there are < reservations after booking or declining new requests, but before
cancellations, with R hours remaining until flight time.
(b) Optimality of Booking-Limit Policies. We prove by induction on R that Z<R and @<R are
quasiconcave in <, and a booking-limit policy is optimal with R hours to flight time and booking
limit ,R that is a maximizer of @<R . This is true for R œ ! with booking limit ,! œ W because Z<!
œ @<! œ 0 < is increasing in < Ÿ W and Z<! œ @<! œ 0 W  -ÐÐ<  WÑ Ñ is decreasing in <   W . Suppose
that R  "   ! and Z<R" is quasiconcave in <. Then @<R œ EZGR"
<
is quasiconcave in < by the first
assertion in the statement of (b). It is clear from (1) that < œ ,R also maximizes Z<R and Z,RR œ @,RR .
Thus since @<R is increasing in < Ÿ ,R , the same is so of Z<R œ Ð"  :Ñ@<R  :@Ð<"Ñ•,
R
R . Similarly, since

@<R is decreasing in <   ,R , the same is so of Z<R œ @<R . Thus Z<R is quasiconcave in <. Also, a book-
ing-limit policy is optimal with R hours to flight time and booking limit ,R .
(c) Solving the Problem with MATLAB. The graphs of the Z<R and the corresponding book-
ing limits ,R for R œ !ß á ß $! and < œ !ß á ß 2! with W œ "! seats and 0 œ $$!! per seat are at-
tached for Ð:ß ;Ñ œ Ð.2ß .05Ñ, Ð.2ß .10Ñ, Ð.3ß .05Ñ, Ð.3ß .10Ñ. The booking limits 10 hours before flight
time are respectively 15, 20, 15, 20. The graphs do confirm the conjecture in (b) that Z<R is
quasiconcave in < for each R . The graphs are interpolated linearly between successive integer
values of < to define Z<R for fractional values of <. The graphs suggest the following additional
conjectures:

ì Monotonicity of Booking Limits. ,R is increasing in ÐR ß :ß ;Ñ and ,! œ W .

ì Equilibrium Reservation Level. There is a (possibly fractional) equilibrium reservation


level / Ÿ W such that Z/R œ Z/ is independent of R . The equilibrium appears to be roughly
MS&E 351 Dynamic Programming -3- Answers to Homework 2 Due April 18, 2008

the reservation level / where the expected number of new reservation requests equals the
expected number of cancellations, viz., : œ Ð/  :Ñ; or / œ :Ð"  ;ÑÎ; Ÿ :Î; . This formu-
la is probably valid only for the case where / is significantly smaller than W . Also, / is in-
creasing in : and decreasing in ; . Finally, if the number of reservations is less than /, the
expected revenue from the flight never exceeds Z/ œ Z/! œ /0 , regardless of the number
of hours before flight time.

ì Monotonicity/Concavity/Quasiconcavity of Maximum Value. Z<R is concave in <, in-


creasing in R for ! Ÿ < Ÿ /, decreasing in R for / Ÿ < Ÿ W , quasiconcave in R for W Ÿ <,
increasing in :, and decreasing in ; for ! Ÿ < Ÿ W . In fact, if / is sufficiently below W , it
appears that Z<R is essentially linear in < for < sufficiently below W . This together with
the fact Z/R œ /0 from the preceding paragraph implies that Z<R ¸ ÒÐ"  ;ÑR Ð<  /Ñ  /Ó0
for < sufficiently below W .

ì Monotonicity of Differences of Maximum R -Period Values. For each ! Ÿ R and


!  5 , let ?5 Z<R ´ Z<R 5  Z<R . Let ?R R
5 be the largest number < such that ?5 Z< œ !.
Then ,R Ÿ ?R
5 Ÿ,
R 5
. Also, ?5 Z<R is positive for <  /, negative for /  <  ?R
5 and pos-
itive for ?R
5  <.

ì Limiting Maximum Value Independent of Initial Reservations. Then limR Ä_ Z<R ex-
ists and equals Z/ for all <.

3. Optimal Capacitated Supply Policy.

(a) Dynamic-Programming Recursion. The desired dynamic-programming recursion is


GR ÐBÑ œ min Ò- † ÐC  BÑ  KR ÐCÑ  EGR " ÐC  HR ÑÓ
BŸCŸB?

for B − d and R œ "ß #ß á ß 8 where G8" ÐBÑ is given.

(b) Convexity of GR ÐBÑ. We claim that GR ÐBÑ is convex and bounded below in B by in-
duction on R . The claim is so by hypothesis for R œ 8  ". Suppose it holds for R  " Ÿ 8  "
and consider R . Then each term in the minimand on the right side of the recursion in (a) is
convex and bounded below on the convex feasible set B Ÿ C Ÿ B  ?. Thus, by (d), GR ÐBÑ is
convex and bounded below.

(c) Optimality of Constrained Base-Stock Policies. We claim that one optimal policy CR ÐBÑ
‡ ‡
in period R has the form CR ÐBÑ œ ÐCR ” BÑ • ÐB  ?Ñ for some real number CR . To see this, let

NR ÐCÑ ´ -C  KR ÐCÑ  EGR " ÐC  HR Ñ and C œ CR be a minimizer of NR ÐCÑ. The existence of
such a minimizer is assured because EGR " ÐC  HR Ñ is bounded below in C , -C  KR ÐCÑ Ä _ as
C Ä _, KR ÐCÑ Ä _ as C Ä _, and NR ÐCÑ is convex on the real line, and so continuous. Also,
MS&E 351 Dynamic Programming -4- Answers to Homework 2 Due April 18, 2008

CR ÐBÑ minimizes NR ÐCÑ subject to B Ÿ C Ÿ B  ?. To see this, it suffices to consider the following
cases:
‡ ‡
• B  ?  CR . Since NR ÐCÑ is decreasing in C Ÿ CR , CR ÐBÑ œ B  ?.
‡ ‡ ‡
• B Ÿ CR Ÿ B  ?. Since C œ CR minimizes NR ÐCÑ, CR ÐBÑ œ CR .
‡ ‡
• CR Ÿ B. Since NR ÐCÑ is increasing in C   CR , CR ÐBÑ œ B

(d) Projections of Convex Functions. Let 0 ÐBß CÑ be a real-valued convex function on a non-
empty closed convex set [ © d 5 and let J ÐBÑ ´ infÐBßCÑ−[ 0 ÐBß CÑ. We show that if J is finite on
the set \ ´ ÖB À ÐBß CÑ − [ ×, then J is convex thereon. To see this, let %  !. Suppose :ß :w   !;
:  :w œ "; and Bß Bw − \ . Choose Cß C w such that ÐBß CÑ, ÐBw ß C w Ñ − [ , 0 ÐBß CÑ Ÿ J ÐBÑ  % and 0 ÐBw ß C w Ñ
Ÿ J ÐBw Ñ  %. Then

J Ð:B  :w Bw Ñ Ÿ 0 Ð: † ÐBß CÑ  :w † ÐBw ß C w ÑÑ Ÿ :0 ÐBß CÑ  : w 0 ÐBw ß C w Ñ Ÿ :J ÐBÑ  : w J ÐB w Ñ  %.

Since this is so for all %  !, it is so for % œ !, whence J is convex.

4. Sequencing: Optimality of Index Policies. Let 5 be a sequence of tasks and GÐ5Ñ be its
associated expected cost. Without loss of generality, assume that 5 œ Ð"ß á ß 8Ñ, i.e., perform the
tasks in numerical order. Let " œ ! and 3 œ Ð"ß á ß 3"Ñ for "  3  8.
(a) Optimality of Index Policies by Interchanging Tasks. Fix " Ÿ 3  8 and let 5w be the
sequence formed from 5 by interchanging tasks 3 and 3". Let G3 be the expected cost of tasks
3. Let G 4 be the expected cost of tasks 4ß á ß 8 for " Ÿ 4 Ÿ 8. Then ÐG! ´ G 8" ´ !Ñ

GÐ5Ñ œ G3  :3 Ð-3  :3|3 -3" Ñ  G 3#


and
GÐ5w Ñ œ G3  :3 Ð-3"  :3"|3 -3 Ñ  G 3# .

Consequently, on comparing the above formulas, it follows that GÐ5Ñ Ÿ GÐ5w Ñ if and only if

-3  :3|3 -3" Ÿ -3"  :3"|3 -3 ,

or equivalently,
-3 -3"
Ÿ .
"  :3|3 "  :3"|3
Thus since

Ð‡Ñ "  :4|3 œ 04 † 13

for each 4   3, the above inequality simplifies to


-3 -3"
Ÿ .
03 03"
MS&E 351 Dynamic Programming -5- Answers to Homework 2 Due April 18, 2008

Consider the index rule in which the tests, after possibly renumbering, are carried out in an
-
order that assures that the index function M3 ´ 03 is increasing in 3. We claim that this index rule
3
is optimal. For in any other sequence 5 , there must be a first test with higher index than its suc-
cessor. Let 5w be the sequence in which one interchanges these two tests. The above argument
shows that GÐ5w Ñ Ÿ GÐ5Ñ. Also, 5w is lexicographically smaller than 5 , i.e., the first nonzero ele-
ment of 5  5 w is positive. Repeating this process up to ˆ # ‰ times produces the index rule be-
8

cause each sequence is lexicographically smaller than its predecessor and lexicographic order is a
total (or linear) order of the sequences. Thus, the index rule is optimal.
(b) Testing a Machine to Find a Fault. Let 5 be a permutation of the tests, 13 œ ", 04 œ " 
:4 and :3 ´ :" â :3" for all 3ß 4   ". Note that (‡) holds and that 13 and :3 are symmetric in
3. Thus, the hypotheses of part (a) are satisfied. Hence from part (a), one optimal sequence 5 in
3 -
which to perform the tests is so that the cost-success ratio ": is increasing in 3.
3
(c) Search for a Plane. Let 5 be a permutation of the locations, :3 ´ "  ;"  â  ;3" , 13
œ :" , 04 œ ;4 and -4 œ >4 for all 3ß 4   ". Note that (‡) holds and that 13 and :3 are sym-
3
metric in 3. Thus, the hypotheses of part (a) are satisfied. Hence from part (a), one optimal
>
sequence 5 in which to search the locations is so that the time-success ratio ;3 is increasing in 3.
3

Remark. The function 1 can be dispensed with if one replaces the condition (‡) by the
equivalent condition that
"  :5 |3 05
œ for all 5ß 4   3.
"  :4|3 04
MS&E 351 Dynamic Programming -6- Answers to Homework 2 Due April 18, 2008

Maximum Expected Flight Revenue (p = 0.2, q = 0.05)


3000

2500

2000
Revenue in $

1500

1000

500

0
0 2 4 6 8 10 12 14 16 18 20
Reservations

Optimal Booking Limits vs. Hours to Flight Time (p = 0.2, q = 0.05)

20

18

16

14
Reservations

12

10

0
0 5 10 15 20 25 30
Hours to Flight Time
MS&E 351 Dynamic Programming -7- Answers to Homework 2 Due April 18, 2008

Maximum Expected Flight Revenue (p = 0.2, q = 0.1)


3000

2500

2000
Revenue in $

1500

1000

500

0
0 2 4 6 8 10 12 14 16 18 20
Reservations

Optimal Booking Limits vs. Hours to Flight Time (p = 0.2, q = 0.1)

20

18

16

14
Reservations

12

10

0
0 5 10 15 20 25 30
Hours to Flight Time
MS&E 351 Dynamic Programming -8- Answers to Homework 2 Due April 18, 2008

Maximum Expected Flight Revenue (p = 0.3, q = 0.05)


3000

2500

2000
Revenue in $

1500

1000

500

0
0 2 4 6 8 10 12 14 16 18 20
Reservations

Optimal Booking Limits vs. Hours to Flight Time (p = 0.3, q = 0.05)

20

18

16

14
Reservations

12

10

0
0 5 10 15 20 25 30
Hours to Flight Time
MS&E 351 Dynamic Programming -9- Answers to Homework 2 Due April 18, 2008

Maximum Expected Flight Revenue (p = 0.3, q = 0.1)


3000

2500

2000
Revenue in $

1500

1000

500

0
0 2 4 6 8 10 12 14 16 18 20
Reservations

Optimal Booking Limits vs. Hours to Flight Time (p = 0.3, q = 0.1)

20

18

16

14
Reservations

12

10

0
0 5 10 15 20 25 30
Hours to Flight Time
MS&E 351 Dynamic Programming Spring 2008
Arthur F. Veinott, Jr.

Homework 3 Due April 25


1. Multifacility Linear-Cost Production Planning. Suppose O facilities, labeled 1ß á ß O , each
produce a single product. Production of each unit at facility 5 in period 8 œ 1ß á ß R costs -85 and
45
consumes /78   0 units at facility 4 in period 7 for each 4ß 5 and " Ÿ 7  8. The cost -85 includes
45
the costs of in-transit storage and shipping /78 units of each product 4 in each period 7  8, but
excludes the cost of supplying those units of product 4 in period 7. The unit cost of storage at
facility 5 at the end of period 8 is 285 . There are known nonnegative demands for the product pro-
duced at each facility in each period. Let G85 be the minimum total cost of satisfying a unit of
demand at facility 5 in period 8. This unit cost includes all costs of production, storage and in-
transit storage at all facilities involved in supplying that unit of demand at facility 5 in period 8.
(a) Dynamic-Programming Recursion. Give a dynamic-programming recursion for solving
the problem. [Hint: Express G85 in terms of the G7
4
for 7  8. Use the fact that the demand at
facility 5 in period 8 can be satisfied either by producing at 5 in 8 or producing at 5 in a prior
period and storing from the previous period 8  1.]
(b) Running Time. How many multiplications, additions and comparisons are required to solve
the problem? Your answer should be of the form :ÐR ß OÑ  SÐ;ÐR ß OÑÑ for some polynomials :
and ; with : being of higher order than ; .
(c) Running Time: Special Cases. How long would it take a computer to execute the opera-
tions needed to solve the problem when O œ 1000 and R œ 50? O œ 10000 and R œ 500? (Assume
that the computer will do 10) multiplications, 10* additions and 10* comparisons respectively per
second. Use :ÐR ß OÑ defined in (b) for these evaluations.)
(d) Nonlinear Costs. Suppose the production and storage costs are nonlinear in the amounts
produced and stored. Can you modify your recursion in part (a) to solve this case without expand-
ing the number of “states” and “actions”? If so, show how to do this; if not, explain why.

2. Discovering System Transience. The purpose of this problem is to give two methods of dis-
covering whether or not a system, i.e., a finite discrete-time-parameter Markov population deci-
sion chain, is transient. Improvements in one of the methods are given for substochastic systems.
(a) Characterization of Transient Matrices. Suppose T is a square real nonnegative matrix.
Show that T R Ä ! if and only if the linear inequalities ÐM  T Ñ@ œ <, @   ! have a solution for
every <   ! (resp., some < ¦ !). [Hint: For the “if ” part, write the first equation in the form
@ œ <  T @ and iterate.]
(b) Discovering System Transience by Policy Improvement. Show how the policy-improve-
ment method can be used to check whether or not a system is transient. [Hint: Let <$ œ " for all
$ and apply the policy-improvement method. Use the result of (a) to show how to detect when a
decision $ is found for which $ is not transient.]
(c) Discovering System Transience in Substochastic Systems by a Combinatorial Method.
Consider a substochastic system with W states and E state-action pairs. Let ;Ð=ß +Ñ be the proba-
MS&E 351 Dynamic Programming 2 Homework 3 Due April 25, 2008

bility that an individual who is in state = and chooses action + exits the system in one step. Thus,
;Ð=ß +Ñ ´ "  !>−f :Ð>l=ß +Ñ for each + − E= and = − f . Let ;1R ´ Ð"  T1R Ñ" be the vector of prob-
abilities that an individual exits the system in R   ! steps or less starting from each state while
using 1. For R   !, let X R be the set of states = − f from which it is possible to exit the system
in R steps or less with every policy, i.e., min1 ;1R=  !. Show that the following are equivalent:
"‰ the system is transient; #‰ X W œ f ; $‰ min1 ;1W ¦ !. Give a low order polynomial in W and E
that is an upper bound on the number of comparisons needed to find X W . [Hint: For R   ", let
Y R be the set of states = − f from which it is possible to exit the system in R steps or less with
every policy, but not less than R steps for some policy. Then X R œ X R "  Y R for R   " . Show
that there is an integer ! Ÿ Q Ÿ W for which X Q œ X Q " . Show that "‰ Ê #‰ Ê $‰ Ê "‰ . Show
that "‰ Ê #‰ by contraposition.]

3. Successive Approximations and Newton's Method Find Nearly Optimal Policies in Linear
Time for Discounted Markov Decision Chains. Consider a finite discrete-time-parameter W -state
"
discounted Markov decision chain with discount factor " œ and interest rate 1003%  !, so
"3
T$ " œ " " for all $ . Since adding a constant to the rewards earned in all states does not affect the
set of maximum-value policies, there is no loss of generality in assuming that <$   ! for all $ . Let
8 be the larger of the number of state-action pairs and the number of nonzero data elements
<Ð=ß +Ñ, :Ð>l=ß +Ñ. Let %  ! be given and Z ‡ be the maximum value over all policies. Call a policy
1 %-optimal if lZ ‡  Z1 l Ÿ %lZ ‡ l.
(a) Computing eZ in Linear Time. Show that the number of operations required to com-
pute eZ for any fixed Z − d W is at most 8. (An operation is an addition, a comparison, and a
multiplication).
(b) Successive Approximations Finds an %-Optimal Policy in Linear Time. Show that Z R ´
eR ! is increasing in R . Also show, for fixed % and " , how successive approximations can be used
to find a stationary %-optimal policy with at most SÐ8Ñ operations. [Hint: Let $ _ be a stationary
maximum-value policy and # be such that Z R œ <#  T# Z R " . Show that Z$R Ÿ Z R Ÿ Z# Ÿ
Z$R  T$R Z$ . Also show how to choose R depending on % and " , but not on 8, such that # _ is
%-optimal.]
R
(c) Newton's Method Finds an %-Optimal Policy in Linear Time. Show that the values Z
´ Z$R of the successive decisions $R ß R   "ß selected by Newton’s method are increasing in R , and
R
that Z   Z R for all R . Thus conclude that Newton’s method requires no more iterations to find
a stationary %-optimal policy than does successive approximations. Show that if Gaussian elimina-
tion is used to compute the value of a decision $R in Newton’s method, then the number of opera-
tions needed to find that value is SÐW $ Ñ. Thus conclude that the number of operations required
to find a stationary %-optimal policy with Newton’s method (when computing values by Gaussian
elimination) is SÐ8  W $ Ñ for fixed % and " . Show that this running time can be reduced to SÐ8Ñ if
the values of successive decisions selected by Newton’s method are instead estimated by successive
approximations.
MS&E 351 Dynamic Programming Spring 2008
Arthur F. Veinott, Jr.

Answers to Homework 3 Due April 25

1. Multifacility Linear-Cost Production Planning.


(a) Dynamic-Programming Recursion. The forward dynamic-programming recursion for finding
the minimum unit costs is (G"5 œ -"5 )

G85 œ min Ò-85  " " /78


45 4
G7 5
ß 28" 5
 G8" Ó
78 4

for "  8 Ÿ R and " Ÿ 5 Ÿ O . The right-hand side of the above equation is the minimum of two
terms, the first representing production of product 5 in period 8 and the second production of that
product before period 8 and storing the product from period 8" to period 8.
(b) Running Time. We summarize below the number of additions, multiplications and compar-
isons required to compute the G85 .

" " " "" œ


" # #
Additions R O  SÐR O # Ñ.
5 8 78 4 #

" " " "" œ


" # #
Multiplications R O  SÐR O # Ñ.
5 8 78 4 #

Comparisons " " " œ R O  SÐOÑ.


5 8

(c) Running Time: Special Cases. The approximate running times for special cases are
summarized in the table below.

O "!!0 "!,!!0
R 50 500
Approx. Run Time 14 sec 38 hrs

Approximate Running Times

(d) Nonlinear Costs. It is not possible to modify the recursion in (a) to handle nonlinear costs
without dramatically enlarging the state and actions spaces. The reason is that the recursion in (a)
is based on the fact that the marginal costs are independent of the volume of production. That
would no longer be true with nonlinear cost functions.

2. Discovering System Transience.


(a) Characterization of Transient Matrices. The “only if ” part follows from the fact that
T R Ä ! implies M  T is nonsingular and (M  T )" œ !_
R œ! T
R
  ! by Lemma 3. Thus, @ œ
(M  T )" <   ! solves ÐM  T Ñ@ œ <, @   ! for all <   0. For the “if ” part, suppose < ¦ 0 and @   0
solves ÐM  T Ñ@ œ <, @   !. Then

1
MS&E 351 Dynamic Programming -2- Answers to Homework 3 Due April 25, 2008

@ œ <  T @ œ â œ "T 3<  T R @   " T 3<


R " R

!_
3œ! 3œ!

for R œ 0, ", á . Thus since < ¦ 0, 3œ! T 3 converges, whence T R Ä 0.


(b) Using Policy Improvement to Discover System Transience. Let <$ œ 1 for all $ − ?.
Suppose $ − ? and ÐM  T$ Ñ@ œ 1 has no nonnegative solution. Then from (a), $ _ is not trans-
ient. If the equations have a nonnegative solution, then that solution is Z$   ". Now choose an im-
provement # of $ , if one exists, replace $ by # , and repeat the above construction. After finitely
many steps, we will either find a $ _ that is not transient, in which case the system is not trans-
ient, or a $ that has no improvement. In the latter event, Z$   ! is a fixed point of e, whence
Z$ œ max#−? Ð"  T# Z$ Ñ ¦ T# Z$ for all # . Thus, for each # , <# ´ ÐM  T# ÑZ$ ¦ !, so from (a), # is
transient. Hence the system is transient.
(c) Discovering System Transience in Substochastic Systems by a Combinatorial Method.
For R   !, let X R be the set of states = − f from which it is possible to exit the system in R
steps or less with every policy. For R   ", let Y R be the set of states = − f from which it is possi-
ble to exit the system in R steps or less with every policy, but not in less than R steps with some
policy. Then X R œ X R "  Y R for R   ".
"‰ Ê #‰ . The proof is by contraposition. If #‰ is false, then X W § f . Let X œ f Ï X W . By defin-
ition of the X R , g œ X ! © X " © X # © â, so there is a smallest integer ! Ÿ Q Ÿ W such that Y Q "
œ g, whence X Q œ X Q " . We claim that Y R œ g for all R  Q . This is certainly so for R œ
Q  ". Suppose it is so for R  Q and consider R  ". Then, for each state = − Y R" , each policy
that exits the system in R  " steps, but not less, first takes an action that sends the system only to
states in Y R , which is impossible because Y R œ g. Hence X Q œ X W . Then there is a decision $
such that T$X X is stochastic, whence $ _ is not transient.
#‰ Ê $‰ . Immediate from the definitions involved.
$‰ Ê "‰ . By hypothesis, min1 ;1W ¦ ! and so max1 T1W " ¥ ". Thus there exists !  -  " such
that max1 T1W " Ÿ -". Iterating this inequality shows that max1 T13W " Ÿ -3 " for 3 œ "ß #ß á . Since
max1 T1R " is decreasing in R   !, !_ R
R œ! T1 " Ÿ W
!3œ!
_
T13W " Ÿ WÐ"  -W Ñ" , whence 1 is transient.
We show that the total number of comparisons to find X Q is at most EW . For R   !, let ER
=
be the set of actions + − E= in state = − f that assure it is not possible to exit the system in R steps
or less starting from = and using + initially. For = − f , E!= œ g. For R   " and = − X R" , ER
= œ g.
For R œ " (resp., R  "Ñ and = − f Ï X R" , ER R"
= is the set of + − E= such that ;Ð=ß +Ñ œ ! (resp.,
:Ð>l=ß +Ñ œ ! for > − Y R" ); given ER "
Ð†Ñ ß Y
R "
, X R " one can find ER
= for all = − f Ï X
R"
with E
(resp., E†lY R " l) comparisons. For R   ", Y R is the set of states = − f Ï X R " for which ER
= œ g,
and the Y R are disjoint. If X Q œ f (resp., § f ), it suffices to find ER
= , and so Y
R
and X R œ
X R "  Y R , for = − f Ï X R " and R Ÿ Q (resp., R Ÿ Q "Ñ; the number of comparisons to do so
is at most EÐ"  !Q # lY
R"
lÑ Ÿ E†lX Q l œ EW (resp., EÐ"  !Q
#
"
lY R" lÑ Ÿ EÐlf Ï X Q l  lX Q lÑ œ
EWÑ.
MS&E 351 Dynamic Programming -3- Answers to Homework 3 Due April 25, 2008

3. Successive Approximations and Newton's Method Find an %-Optimal Policy in Linear


Time for Discounted Markov Decision Chains.
(a) Computing eZ in Linear Time. Recall that

(eZ )= œ maxÖ<Ð=ß +Ñ  " :Ð>l=ß +ÑZ> ×


+−E= >−f
W
for each Z − d and = − f . Observe that each <Ð=ß +Ñ and :Ð>l=ß +Ñ appears exactly once in the
formula above for each = and therefore the number of operations needed to compute (eZ )= does
not exceed the number 8= number of nonzero elements among the <Ð=ß +Ñ and :Ð>l=ß +Ñ for all
+ − E= and > − f , for each fixed = − f . Thus, the total number of operations required to compute
eZ is bounded above by != 8= œ 8. Let Z ! ´ Z ´ !.
!

(b) Successive Approximations Finds an %-Optimal Policy in Linear Time. Notice that
Z œ eZ ! œ e!   !. Thus, by the monotonicity of e, and so of eR , it follows that Z R" œ eR Z "
"

  eR ! œ Z R . Also, Z R œ <#  T# Z R " Ÿ <#  T# Z R , so (M  T# )Z R Ÿ <# . Premultiplying the


last inequality by (M  T# )" , which exists and is nonnegative because # _ is transient, gives Z R
Ÿ ÐM  T# )" <# œ Z# . Since the system is strictly substochastic, it is transient and there is a sta-
tionary maximum-value policy $ _ . Hence,

Z$R Ÿ Z R Ÿ Z# Ÿ Z$ œ Z$R  T$R Z$ œ Z ‡ .

Therefore,

0 Ÿ Z ‡  Z# Ÿ Z ‡  Z R Ÿ T$R Z$ œ T$R Z ‡

and so

mZ ‡  Z# m Ÿ mZ ‡  Z R m Ÿ mT$R mmZ ‡ m œ " R mZ ‡ m.

Thus, taking R iterations with R œ ¶ · œ ¶ · assures that # is % -optimal and that Z# is


ln % ln Ð"Î%Ñ _
ln " ln Ð"3Ñ
within 100%% of Z ‡ . Since R is independent of 8 and each iteration requires at most 8 operations,
successive approximations finds a stationary %-optimal policy with O(8) operations for fixed % and " .
(c) Newton's Method Finds an %-Optimal Policy in Linear Time. Since Newton’s method is
an instance of the Policy-Improvement Method and the values of successive policies increase with
R " R R
that method, it follows that Z   Z for R   !. Thus, if Z   Z R , then

R " R R
Z   <$R "  T$R " Z œ max{<$  T$ Z }   max{<$  T$ Z R } œ Z R " .
$ $

This shows that the values at iteration R with Newton’s method are at least as large as those with
successive approximations. Thus Newton’s method requires no more iterations than successive ap-
proximations, so the number R of iterations in (b) will find a stationary %-optimal policy. Each
MS&E 351 Dynamic Programming -4- Answers to Homework 3 Due April 25, 2008

iteration R of Newton’s method requires finding the “best” improvement $R and then its value
Z$R . The first task can be done in O(8) time as shown in (a) and the second can be done in O(W $ )
time if Gaussian elimination is used to solve the required system of W linear equations. Since the
number of iterations is independent of both 8 and W , this implementation of Newton’s method
requires O(8  W $ ) operations to find an %-optimal policy. However, if instead of solving a sys-
tem of linear equations to compute Z$R , we use successive approximations to estimate this value
as in (b), then both steps of Newton’s method can be implemented in O(8) time. Then this imple-
mentation of Newton’s method requires O(8) operations to find an %-optimal policy.
MS&E 351 Dynamic Programming Spring 2008
Arthur F. Veinott, Jr.

Homework 4 Due May 2


(Do Problems 1, 2 and either 3 or 4; if you took MS&E 251, do Problems 2, 3 and 4)

1. Component Replacement. A manager of a fleet of Q new machines desires to develop a


maintenance policy that maximizes the expected net profit from the machines as long as they
are “working”. Working machines are rented out each morning and are returned by 5 p.m. that
day with the daily rental revenue per machine being /. Each machine consists of four compon-
ents, two of type ! and two of type " . Each component may be “working” or “failed”. Working
components fail independently of one another while in service, though customers receive no re-
funds. The probability that working components of types ! and " fail during a day are .2 and .1
respectively. A machine is “working” if it has at least one working component of each type; oth-
erwise the machine is “failed”. When a machine is returned at the end of a day, the manager
observes the numbers of working components of each type therein. If the machine has failed, the
manager discards it. If the machine is working, the manager decides which failed components, if
any, to replace during the evening in order to prolong the machine's working life. Components of
types ! and " cost respectively 1 and 2. The labor cost to replace one failed component of either
type in a working machine during an evening is 5 and that to replace two failed components, one
of each type, is 6. Initially, all components of all machines are working.
(a) Solution by Newton’s Method. Find the component replacement policy that maximizes
expected profit for / œ % using the MATLAB file newtonr".m (Newton’s method).
(b) Formulation as Linear Program. Formulate the problem of finding a component re-
placement policy for the machines that maximizes the expected net profit as a linear program.
[Hint: The problem has nine variables and four equality constraints.]
(c) Solution by Linear Programmming. Find the optimal policy and the maximum expect-
ed profit therefrom by solving the linear program in (b) by using the MATLAB file tranlpsc.m.
(Type “help linprog” at the MATLAB prompt to see explanations for the linprog solver.) Do
this for each of / œ .&, #, % and ', and where Q is the sum of the two integers representing the
positions in the alphabet of the first letter in your first and last names. For example, Walt Dis-
ney would have to solve the problem for the case Q œ #$  % œ #( since [ is the twenty-third
and H is the fourth letter of the alphabet. Please be sure to state the value of Q that you use.
Discuss qualitatively the effect that you observe in the above computations of increasing the
daily revenue / per machine on the optimal replacement policy.
(d) Extending the Service Life. The manager has been asked by the Vice-President for Mar-
keting to extend the average working life of a machine to at least 17 days. For the case / œ %,
what is the maximum-expected profit and corresponding optimal policy among those that satisfy
this constraint? Use the MATLAB file tranlpsc.m to solve the problem.

1
MS&E 351 Dynamic Programming 2 Homework 4 Due May 2, 2008

2. Optimal Stopping Policy. Consider a finite W -state Markov branching decision chain. Call
a policy 1 stopping if T1R Ä !. Call Z1 ´ lim supR Ä_ Z1R the value of 1 and call the supremum
Z ‡ of Z1 over all stopping policies 1 the maximum stopping value. (The limit superior and suprem-
um are taken coordinate-wise.) If there are no stopping policies, let Z ‡ ´ _. Let A ¦ ! be a
row W -vector and consider the primal linear program of finding a vector ÐB$ Ñ that

maximizes "B$ <$


$ −?
subject to

"B$ ÐM  T$ Ñ œ A
$ −?
B$   ! all $ − ?

(a) Existence of Stopping Policies. Show that the following statements are equivalent:
"‰ There is a stopping policy. #‰ There is a stationary stopping policy. $‰ The primal linear pro-
gram has a feasible solution. [Hint: Show that 1‰ implies 2‰ by establishing that there is a trans-
ient policy 1 œ Ð#" ß ## ß á Ñ that is periodic, i.e., for some Q   ", #R œ #R Q for R œ "ß #ß á .
Then, for this part only, replace each <$ by ". Show that eR ! Æ Z   Z1 , that Z is a finite neg-
ative fixed point of e, and each $ _ for which eZ œ "  T$ Z is a stopping policy.]
(b) Existence of Optimal Stopping Policies. Show that the following statements are equiva-
lent: "‰ Z ‡ is finite. #‰ There is a stationary maximum-value stopping policy and Z ‡ is the com-
mon least fixed point and least excessive point of e. ÐAn excessive point of e is a vector Z − d W
with Z   eZ .Ñ $‰ The primal linear program has an optimal solution. %‰ There is a stopping pol-
icy and an excessive point of e. [Hint: Show that 4‰ Ê 3‰ Ê 2‰ Ê 1‰ Ê 4‰ . Show that 3‰
implies 2‰ by establishing that one optimal solution of the linear program is basic and that for
the corresponding decision $ , Z$ is optimal for the dual program, and so is a fixed point of e.
Then show that Z$ œ Z ‡ . Establish that 1‰ implies 4‰ by using part (a) to show that since Z ‡ is
finite, there is a stationary stopping policy $ _ , whence the primal is feasible. Then show eR Z$ is
nondecreasing in R and converges to a fixed point of e, so the dual is feasible.]
(c) Maximum Value Need Not Equal Maximum Stopping Value. Suppose f œ Ö1×, ? œ
Ö# ß $ ×, <# œ !, T# œ ", <$ œ " and T$ œ !. Show that Z ‡ œ ", but that Z œ ! is the maximum
value over all policies and is the greatest nonpositive fixed point of e.
MS&E 351 Dynamic Programming 3 Homework 4 Due May 2, 2008

3. Simple Stopping Problems. Let T be the stochastic transition matrix of an W -state Mar-
kov chain. In each period R , say, the chain is observed in state = − f , there are two actions avail-
able. One is to stop and receive a terminal reward <= . The other is to continue by paying an en-
trance fee -= that permits the state of the chain to be observed in period R  ". Put < œ Ð<= Ñ and
- œ Ð-= Ñ.
(a) Existence of Stationary Optimal Stopping Policy. Show that there is a stationary max-
imum-value stopping policy if and only if Z   < ” Ð-  T Z Ñ for some Z − d W .
(b) Solution in W Iterations by Linear Programming. Show that if the simplex method is
used to solve the corresponding primal linear program starting with the decision that stops in ev-
ery state, then the simplex method will terminate in W iterations or less. [Hint: Show that since
the value of the stopping policy encountered at each iteration of the simplex method exceeds that
of its predecessor, once the simplex method introduces the continuation action in a state, that ac-
tion will not be changed back to the stopping action in that state at a subsequent iteration.]
(c) Entrance-Fee Problem. An entrance-fee problem is one in which there are no terminal
rewards. Show how to reduce the above problem to an equivalent entrance-fee problem. [Hint:
Subtract < from the equation Z œ < ” Ð-  T Z Ñ, thereby forming Zs œ ! ” Ð-s  T Zs Ñ where
Zs ´ Z  < and s
- ´ -  <  T <.]
(d) Optimality of Myopic Policies for Entrance-Fee Problem. Assume f œ Ö"ß á ß W×.
Consider an entrance-fee problem in which (3) T is upper triangular with diagonal elements less
than ", except that the last one equals ", and (33) there is an =‡ , " Ÿ =‡ Ÿ W , such that -=  ! for
=  =‡ and -=   ! for =   =‡ . Show that it is optimal to continue if =  =‡ and to stop if =   =‡ .
Call this policy myopic because it maximizes the one-period reward.

4. House Buying. You have just moved to a city and are considering buying a house. The
purchase prices of houses that interest you are independent identically distributed random vari-
ables with :" ß á ß :W Ð!3 :3 œ "Ñ being the positive probabilities that the purchase prices are
respectively "ß á ß W . Each time you look at a house to determine its price, you incur a cost
-  !. Determine explicitly a stationary policy that minimizes the expected cost of looking and
buying among those policies in which you eventually buy with probability one. [Hint: Assume
that you always have the option of buying the cheapest house that you have observed todate.
Then apply the result of problem 3(d) to find an optimal policy. Also show that one optimal
policy has the property that if you do not buy a house in the period in which you observe its
price, then you never buy it.]
MS&E 351 Dynamic Programming Spring 2008
Arthur F. Veinott, Jr.

Answers to Homework 4 Due May 2

1. Component Replacement. The states of a machine at the end of a day are the ordered
pairs (3ß 4) where 3 and 4 are respectively the number of working components of types ! and " , so
f œ {(1ß 1), (1ß 2), (2ß 1), (2ß 2)}. The actions available in state (3ß 4) are the ordered pairs (5ß 6)  
(3ß 4) of numbers of working components of the two types after replacement. Let :(7ß 8 ± 5ß 6) be
the conditional probability that a machine in state (5ß 6) after replacements in an evening is in
state (7ß 8) Ÿ (5ß 6) at 5:00 p.m. the following day. Denote by <(3ß 4ß 5ß 6) the net profit earned
during a day by a machine in state (3ß 4) at 5:00 p.m. on the previous day when action (5ß 6) is
chosen for the machine that evening. The :(> ± +) are given by the formula

:(7ß 8l5ß 6) œ Œ .8 .2 Œ .9 .1


5 7 57 6 8 68
7 8
and are tabulated below. The <(=ß +) are given by <(=ß +) œ /  - (=ß +) where

- (3ß 4ß 5ß 6) œ (5  3)  2(6  4)  - (5  6  3  4)

and - (?) is the labor cost of replacing ? components in the machine, - (0) œ 0, - (1) œ 5, and - (2)
œ 6. The - (=ß +) are as tabulated below. Also note that the nine state-action pairs are as given
in a table below. In this instance, there are four actions in state 11, two in each of states 12 and
21, and one in state 22.
State-Action Pairs
:Ð> ± +Ñ -Ð=ß +Ñ = +
+ + 11 11
11 12
> 11 12 21 22 = 11 12 21 22
11 21
"" Þ(#00 Þ"%%0 Þ#))0 Þ!&(' "" ! ( ' * 11 22
"# Þ'%)0 - Þ#&*# "# ! - ' 12 12
#" ! ( 12 22
#" Þ&('0 Þ""&#
## ! 21 21
## Þ&")%
21 22
22 22

Let :(>l=ß +) œ :(>l+). Let A> be the number of machines initially in state >. Assume that A> œ 0
for > Á (2ß 2).

(a) Solution by Newton's Method. Below is the MATLAB output resulting from running a
file compdata.m to create the input data and then the file newtonr1.m in the course directory to
solve the component replacement problem. Then the MATLAB output is:

1
MS&E 351 Dynamic Programming -2- Answers to Homework 4 Due May 2, 2008

>> compdata

P =
0.7200 0 0 0
0.1440 0.6480 0 0
0.2880 0 0.5760 0
0.0576 0.2592 0.1152 0.5184
0.1440 0.6480 0 0
0.0576 0.2592 0.1152 0.5184
0.2880 0 0.5760 0
0.0576 0.2592 0.1152 0.5184
0.0576 0.2592 0.1152 0.5184

r =
4
-3
-2
-5
4
-2
4
-3
4

>> newtonr1
stop =
yes
ans =
Halted after 2 iterations

V =
17.6774
20.6774
21.4412
26.6774

LastDecision =
4 2 1 1

The maximum expected net revenues earned and optimal actions in states 11, 12, 21 and 22
are respectively the elements of Z above and the actions 22, 22, 21 and 22 indicated in the
vector LastDecision (note that action 4 in state 11 is 22, action 2 in state 12 is 22, action 1 in
state 21 is 21, action 1 in state 22 is 22).

(b) Formulation as Linear Program. The problem is to choose the expected numbers
B=+   ! of machine-days that machines end a day in state = and take replacement action + for
each =ß +, to maximize the expected net revenue

(1) " <(=ß +)B=+


=ß+
subject to

(2) " B>+  " :(> l =ß +)B=+ œ A> , > − f .


+ =ß+

The matrix of nonzero coefficients of (2) is tabulated below (with A## œ 1) together with <(=ß +)
for each of the four cases / œ .5, 2, 4 and 6.
MS&E 351 Dynamic Programming -3- Answers to Homework 4 Due May 2, 2008

Daily Reward Vectors for Several /'s


= 11 12 21 22
/ + 11 12 21 22 12 22 21 22 22
0.5 0.5 6.5 5.5 8.5 0.5 5.5 0.5 6.5 0.5
2 2 5 4 7 2 4 2 5 2
4 4 3 2 5 4 2 4 3 4
6 6 1 0 3 6 0 6 1 6

Coefficient Matrix of Equations (2)


= 11 12 21 22
> + 11 12 21 22 12 22 21 22 22 A>
11 .28 .856 .712 .9424 .144 .0576 .288 .0576 .0576
12 .648 .2592 .352 .7408 .2592 .2592
21 .576 .1152 .1152 .424 .8848 .1152
22 .5184 .5184 .5184 .4816 1

(c) Solution by Linear Programming. The optimal action to take and the maximum unit
value Z=‡ of each machine in each state = are tabulated below for each of the required values of
/. Notice that the optimal actions in each state are the same for all nonnegative initial popula-
tions, and the maximum values are linear therein. In particular, if Q is the initial number of
machines in state 22, then the maximum value therein is Z##‡ Q .
Observe that as the daily rental price / increases, it is optimal to replace more components
to prolong the lives of the machines. Moreover, if it is optimal to replace any components, then
all failed components should be replaced because of the scale economies in labor cost of replace-
ment. Finally, when a machine has only one failed component and / œ 4, then it is optimal to re-
place component !, but not " , since the former is both cheaper and more likely to fail than the
latter.

Optimal Actions and Maximum Unit Values Z=‡ in Each State = for several /'s
Optimal Action in = Maximum Unit Value Z=‡ in =
/ = 11 12 21 22 11 12 21 22
0.5 11 12 21 22 1.79 2.15 2.39 2.98
2 11 12 21 22 7.14 8.60 9.57 11.92
4 22 22 21 22 17.68 20.68 21.44 26.68
6 22 22 22 22 53.90 56.90 55.90 62.90

Below is the MATLAB output resulting from running the file compdata.m to create the input
data and then the file tranlpsc.m in the course directory to solve the component replacement prob-
lem by linear programming. The MATLAB output for the case / œ % is:
MS&E 351 Dynamic Programming -4- Answers to Homework 4 Due May 2, 2008

>> compdata

P =
0.7200 0 0 0
0.1440 0.6480 0 0
0.2880 0 0.5760 0
0.0576 0.2592 0.1152 0.5184
0.1440 0.6480 0 0
0.0576 0.2592 0.1152 0.5184
0.2880 0 0.5760 0
0.0576 0.2592 0.1152 0.5184
0.0576 0.2592 0.1152 0.5184

r =
4
-3
-2
-5
4
-2
4
-3
4

>> w = [0 0 0 1]

w = 0 0 0 1

>> tranlpsc

x = 0 -0.0000 0.0000 1.5696 0 2.9948 3.1392 0 6.9895

V =
17.6774
20.6774
21.4412
26.6774

d = 4 2 1 1

The maximum expected net revenues earned and optimal actions in states 11, 12, 21 and 22
are respectively the elements of Z above and the actions 22, 22, 21 and 22 indicated in the
vector . (note that action 4 in state 11 is 22, action 2 in state 12 is 22, action 1 in state 21 is
21, action 1 in state 22 is 22). These results agree with those using Newton’s method.

(d) Extending the Service Life. The service life constraint

(3) " B=+   "(


=ß+

must be appended to the constraints (2).


Below is the MATLAB output resulting from running the file compdata.m with / œ % and
then tranlpsc.m, to solve the component replacement problem with service constraint by linear
programming. The MATLAB output is:
MS&E 351 Dynamic Programming -5- Answers to Homework 4 Due May 2, 2008

>> compdata

P =
0.7200 0 0 0
0.1440 0.6480 0 0
0.2880 0 0.5760 0
0.0576 0.2592 0.1152 0.5184
0.1440 0.6480 0 0
0.0576 0.2592 0.1152 0.5184
0.2880 0 0.5760 0
0.0576 0.2592 0.1152 0.5184
0.0576 0.2592 0.1152 0.5184

r =
4
-3
-2
-5
4
-2
4
-3
4

>> w = [ 0 0 0 1]

w = 0 0 0 1

>> tranlpsc

x = 0 0.0000 0 1.3973 0 3.9360 1.8148 0.9799 8.8720


pr = 0 0.0000 0 1.0000 0 1.0000 0.6494 0.3506 1.0000

v = 24.9490

Here the elements of :< are the optimal conditional probabilities of taking each action
starting in each state. In particular, it is optimal to take actions 22 in states 11, 12 and 22. In
state 21, it is optimal to choose action 21 with probability .6494 and action 22 with probability
.3506. One consequence of adding the service constraint is to reduce the maximum expected re-
ward from Z## œ 26.6774 to @ œ 24.9490.

2. Optimal Stopping Policy.


(a) Existence of Stopping Policies. 1‰ Ê 2‰ . Let 1 be a stopping policy. Then there is an Q
such that ¼T1Q ¼  . Let 1 œ (#" ß ## ß á ), 1Q œ (#" ß á ß #Q Ñ and . œ (1Q ß 1Q ß á ), so . is per-
"
#
iodic. Notice that . is also transient because

m" T.R m œ m" " T.Q 58 m Ÿ " m(T1Q )5 m " mT18 m Ÿ # " mT18 m  _.
_ _ Q " _ Q " Q "

R œ! 5œ! 8œ! 5œ! 8œ! 8œ!

Let <$ œ 1 for all $ , Z ! œ 0 and Z R " œ eZ R for R   !. We claim Z R is nonincreasing in R .


This is certainly so for R œ 1 since Z " œ 1 Ÿ 0 œ Z ! . Thus, for R  ", it follows from the
monotonicity of e, and so of eR , that Z R " Ÿ Z R . Since <$ œ 1 ¥ ! for all $ , then Z R œ
eR 0   Z.R   Z. ¦ _, so the sequence ÖZ R × is bounded below and has a limit Z . Also, by
MS&E 351 Dynamic Programming -6- Answers to Homework 4 Due May 2, 2008

the continuity of e, eZ œ eÐlimR Ä_ eR 0Ñ œ limR Ä_ eR " 0 œ Z . Thus, there is a decision $


such that Z œ 1  T$ Z Ÿ " ¥ 0, so from Problem 2 of Homework 3, $ _ is transient and so
stopping.
2‰ Ê 3‰ . Let $ be a stationary stopping policy. Then T$R Ä 0 and, by Lemma 3, (M  T$ )"
exists and is nonnegative. Thus B# œ 0 for # Á $ and B$ œ A(M  T$ )"   ! is a basic feasible sol-
ution of the primal.
3‰ Ê 1‰ . If there is a feasible solution to the primal, there is a basic feasible solution. Hence
by Theorem 15, there is a transient policy $ _ , and $ _ is stopping.
(b) Existence of Optimal Stopping Policies. 4‰ Ê 3‰ . Since there is a stopping policy, by
(a), the primal is feasible. If there exists Z such that Z   eZ , then the dual is also feasible.
This implies that both the primal and dual have optimal solutions.
3‰ Ê 2‰ . If there is an optimal solution of the primal, then by Theorem 15 there is a basic
optimal solution with row basis M  T$ for some decision $ for which $ _ is transient, and so
stopping. Thus the basic optimal solution is such that B$ (M  T$ ) œ A and B# œ 0 for # Á $ and
there exist optimal dual prices Z . Since Z is feasible for the dual, Z   eZ . And by complemen-
tary slackness, Z œ <$  T$ Z , so Z œ Z$ is finite and a fixed point of e. Then Z$ œ eZ$ œ â œ
eR Z$   Z1R  T1R Z$ . If 1 is stopping, T1R Ä 0, so on taking limit superiors on both sides,
Z$   Z1 and Z ‡ œ Z$ . That Z$ is the least excessive and fixed point of e follows from the fact
that if Z is excessive, then Z   <$  T$ Z , i.e., ÐM  T$ ÑZ   <$ . Premultiplying the last inequality
by ÐM  T$ Ñ" , which exists and is nonnegative, implies that Z   ÐM  T$ Ñ" <$ œ Z$ as claimed.
2‰ Ê 1‰ . Since there exists a stationary maximum-value stopping policy $ _ , Z ‡ œ Z$ and,
since $ is transient, Z ‡ is finite.
1‰ Ê 4‰ . Since Z ‡ is finite, there is a stopping policy and so, by (a), a stationary stopping
policy $ _ , say. Then eZ$   <$  T$ Z$ œ Z$ . Hence, eR " Z$   eR Z$ by the monotonicity of e,
and so eR . Let 1 œ Ð#3 Ñ be a policy such that eR Z$ œ Z1R (Z$ ) œ Z1R $_ where 1R ´ Ð#" ß á ß #R Ñ.
Since $ _ is stopping, so is 1R $ _ ; and it's value is eR Z$ , whence eR Z$ Ÿ Z ‡ . Hence since eR Z$
is bounded above and nondecreasing in R   !, Z ´ limR Ä_ eR Z$ exists and, by the continuity
of e, Z œ eZ . Thus Z is excessive.
(c) Maximum Value Need Not Equal Maximum Stopping Value. Observe that since the
rewards are nonpositive, Z. is nonpositive for every policy .. Also, notice that Z# œ 0 is the
maximum value. Let 1 be a stopping policy. Then 1 uses $ in some period, so Z1 œ 1 is the
maximum stopping value. Now eZ œ Z ” 1 and 0 (resp., ") is the greatest nonpositive
(resp., least) fixed point of e.
MS&E 351 Dynamic Programming -7- Answers to Homework 4 Due May 2, 2008

3. Simple Stopping Problems. The irredundant form of the linear program in Problem 2 is
that of choosing W -element row vectors B and C that maximize

B<  C-

subject to

B  C(M  T ) œ A, Bß C   0

where A is a positive W -element row vector.


(a) Existence of Stationary Optimal Stopping Policy. For the “if ” part, observe that Z is
excessive since Z   < ” (-  T Z ) œ eZ . Let $ _ be the stopping policy that stops in every
state, so T$ œ 0. Then there is a stationary maximum-value stopping policy by Problem 2(b).
For the “only if ” part, observe that if there is a stationary maximum-value stopping policy,
then Z ‡ is finite. Hence, by Problem 2(b), there is an excessive point of e.
(b) Solution in W Iterations by Simplex Method. Recall that the simplex method is the
special case of the policy improvement method in which each iteration improves the action in
exactly one state. Now suppose the simplex method begins with the decision $ that stops in
every state. Each subsequent iteration will either produce a stationary stopping policy (c.f.,
Theorem 15) whose value is greater than that of its predecessor (and strictly so in the state
whose action is changed), or will establish that the primal objective function is unbounded
above, in which case no stationary maximum-value stopping policy exists. Consider the former
case. Observe that it is not possible to improve the action in a state = from “stop” to “continue”
and then back to “stop” again at subsequent iterations. This is because the former change in-
creases the value in = from <= to a strictly higher value while the latter returns the value in = to
<= , which contradicts the fact that the value in a state rises with each iteration. Thus the num-
ber of states in which one continues at an iteration properly contains that at the previous iter-
ation. Hence, by the result of (a) and of Problem 2(b), the simplex method terminates in at
most W iterations with a stationary maximum-value stopping policy or by finding that the pri-
mal objective function is unbounded above, in which case no such policy exists.
(c) Entrance-Fee Problem. Observe that Z œ < ” (-  T Z ) if and only if Zs œ 0 ” (-s  T Zs ),
and the latter is an entrance-fee problem. Thus there is a solution of the former system if and only
if that is so of the latter, so there is a stationary maximum-value stopping policy in one problem
if and only if that is so in the other. In either case, Z ‡ is the least solution of the former system
‡ ‡
if and only if Zs œ Z ‡  < is the least solution of the latter. Also, Z=‡ œ <= if and only if Zs = œ 0,
so the optimal states in which to stop are the same in both problems.
(d) Optimality of Myopic Policies for Entrance-Fee Problems. It is certainly optimal to
continue in states =  =‡ with negative entrance fees -=  0 because one earns an immediate re-
MS&E 351 Dynamic Programming -8- Answers to Homework 4 Due May 2, 2008

ward -=  0 from so doing compared with a zero reward from stopping therein. On the other
hand, if =   =‡ it is optimal to stop and earn a zero reward. This is because the entrance fees are
nonnegative in all states in [=‡ ß W ], and once one enters this set of states it is never possible to
return to any state >  =‡ with a negative entrance fee because the elements of T below the
diagonal are zero. Thus continuation in any state in [=‡ ß W ] cannot earn positive expected re-
wards. A more formal proof may be given by showing by induction on =, starting with = œ W ,
that the least fixed point Z ‡ œ (Z=‡ ) of the optimal return operator is nonegative and is given by
Ú
Ý
Ý! ß =   =‡
Z=‡ œ Û
Ý
Ý -=  ":=> Z>‡ ß =  =‡ .
Ü >=‡

4. House Buying. This is a special case of the simple stopping problem in Problem 3 in
which the states are the lowest prices 1ß á ß W seen so far. If = is the lowest price seen so far,
then one can either stop and buy the house whose price was = with revenue <= œ = or continue
looking at a cost -= œ - . Then (T <)= œ !W" (> • =):> . Observe first that Z œ (Z= ) œ 0 satisfies
Z=   = ” (-  T Z ), so by the result of Problem 3(a), a stationary optimal stopping policy
exists. Also, by the result of Problem 3(c), the problem is equivalent to the entrance-fee problem
- œ (s
in which s - = ) œ -  <  T < so

- = œ -  =  " (> • =):> œ -  " (>  =) :> .


W W
s
>œ" >œ"

Also, the conditions for the optimality of myopic policies in Problem 3(d) are fulfilled with the
order of the states reversed. For the elements of T œ (:=> ) above the diagonal are zero and the
diagonal elements are less than one except for the first which equals one since :>> Ÿ 1  :"  1
- " œ -  0, so there is a maximum = œ =‡ such that s
for >  1 and :"" œ 1. Also, s - =  0. More-
- =  0 for = Ÿ =‡ and s
- = is nonincreasing in = so s
over, s - = Ÿ 0 for =  =‡ . Thus, if = is the lowest
house price seen to date and = Ÿ =‡ , it is optimal to buy the house whose price was =; otherwise
it is optimal to continue looking.
Observe that in implementing this policy, one buys the first time that the lowest price = to date
is =‡ or less. But this happens only if the lowest price to date is in fact that for the last house ob-
served. Thus, the policy of buying the first house whose price is =‡ or less is optimal for the problem
in which the option of buying a house whose price was observed in a previous period is not avail-
able.
MS&E 351 Dynamic Programming Spring 2008
Arthur F. Veinott, Jr.

Homework 5 Due May 9


(Do Problem 3 and Problem 1 or 2—Problem 2 if you took MS&E 251)

1. Bayesian Statistical Quality Control and Repair. A firm manufactures a product under a
continuing contract with the government that allows the government to cancel at any time with-
out penalty. During any given day, the process for producing the product is either in control or
out of control. Whether the process is out of control is not observed directly, but can only be
inferred from the results of production. Each day the firm produces one item, inspects it and
classifies it as good or defective. The government accepts a good item and pays the firm <  !
for it. The firm discards a defective item. When the process is out of control, each item produced
is defective. Thus if a good item is produced, the process was in control during its production.
When the process is in control, the (known) probability that it produces a good item is :, ! 
:  ". The (known) probability that the process is in control at the time of production of an
item given that it is in control at the time of production of its predecessor is ; , !  ;  ". Once
the process is out of control, it remains that way until it is repaired. Independently of the produc-
tion process, there is a probability " , !  "  ", that the firm will retain the contract for anoth-
er day. The government informs the firm at the beginning of a day whether or not the contract
is to be continued that day. If the decision is to cancel, the firm receives no further revenue from
the government and incurs no further costs. If the decision is to continue, there are two possibil-
ities that day: immediately repair at a cost O  ! or don’t repair. Repair is done quickly and
brings the process into control with probability ; (regardless of whether or not it was in control
at the time of repair). Repair is the only permissible option when W consecutive defectives have
been produced since the later of the times of the last repair and last good item. This Bayesian
sequential statistical decision problem is to choose a repair policy that maximizes the expected
value of profits before the government cancels the contract.
(a) Posterior Probability the Next Item is Good. Let := be the conditional probability that
the item =  " is good given that items "ß á ß = were defective and either item 0 was good or the
process was repaired just before producing item 1. Give a formula for := in terms of :, ; and =.
(b) Maximum Expected Value. Write down explicitly an equation (in terms of " , O , < and
the := ) whose solution is the maximum expected profit before the government cancels the con-
tract. Define the states and actions. Is this system transient? Why? Does there exist a station-
ary policy achieving the maximum expected profit? Explain briefly. Discuss whether or not the
policy-improvement method could be used to solve this problem.
(c) Optimality of Control-Limit Policy. Show for the problem in part (b) that a control-limit
policy is optimal, i.e., there is a number =‡ such that if the number of consecutive defective items
since the later of the last good item or repair is =, it is optimal to repair if =   =‡ and not to re-

1
MS&E 351 Dynamic Programming 2 Homework 5 Due May 9, 2008

pair otherwise. [Hint: First show that := is decreasing in =. Next show by induction on R that the
maximum expected profit over R days is decreasing in the state where the terminal value func-
tion is zero. Then use successive approximations to show that the maximum expected profit
over the infinite horizon is decreasing in the state. Finally, establish the optimality of a control-
limit policy.]
Remark. The approach to establishing the form of the optimal policy in the infinite-horizon
problem given in part (c), viz., establishing a desired property of the R -period value function by
induction and then retaining the property as R Ä _, is the most commonly used way to estab-
lish the form of the optimal policy in infinite-horizon problems.

2. Minimum Expected Present Value of Sojourn Times. Consider a finite transient substochas-
tic Markov decision chain with state space f , one-period rewards <$ and transition matrices T$
for each decision $ − ?. Append a stopped state 7 to f . The vector of probabilities that the sys-
tem enters 7 from each state in f in one step using decision $ is ÐM  T$ Ñ". Let X1= be the sojourn
3
time of the system in f before reaching 7 starting from state = and using policy 1. Let X1= œ
!4œ" " be the present value of the sojourn time of the system in f starting from state = − f
X 1= 4

"
and using policy 1 where 0  " œ "3  " is the discount factor. Let X1 œ ÐX1= Ñ, EX1 œ ÐEX1= Ñ,
EX1# œ ÐEX1#= Ñ, Var X1 œ ÐVar X1= Ñ and EX13 œ ÐEX13= Ñ be column vectors.
(a) Factorial Moments of Sojourn Times. Let X œ X$ , T œ T$ and H œ H$ . Show that H
œ ÐM  T Ñ" and E Š X 5"
5 ‹ œ H5 1, 5 œ "ß #ß á where E Š 5 ‹ œ ŠE Š =5 ‹‹ is a column
X 4 X 4

vector. [Hint: Let M=R œ " if the system starts in state =, uses $ _ , and does not reach state 7 be-
fore or in period R , and let M=R œ 0 otherwise. Then X$= œ !_ R
R œ! M= . Now compute the desired
expectations by induction on 5 using the familiar binomial identity Š X 4"
5 ‹ œ Š 5" ‹  Š 5 ‹.]
X 4 X 4

(b) Minimum Limiting Expected Present Value of Sojourn Times of Individuals and Pop-
ulations, and Moment-Optimality. Suppose one exogenous individual arrives in each state = in
each period. Let X$3= be the present value of sojourn times in f of all exogenous individuals who
first enter the system in state = and their progeny, and let X$3 œ ÐX 3
 $ = Ñ. Show that the following
problems are equivalent:
1‰ Minimum Limiting Expected Present Value of Population Sojourn Time. Find $ for
which lim sup3Æ! ÐEX$3  EX#3 Ñ Ÿ 0 for all # − ?.
2‰ Minimum Limiting Expected Discounted Individual Sojourn Time of Order One.
Find $ for which lim sup3Æ! 3" ÐEX$3  EX#3 Ñ Ÿ 0 for all # − ?.
3‰ Maximum Expected Population Reward with Negative Unit One-Period Reward.
Find $ for which lim inf3Æ! 3" ÐZ$3  Z#3 Ñ   0 for all # − ? where <# œ " for all # − ?.
4‰ Policies with No 1-Improvement and Negative Unit One-Period Reward. Find $
that maximizes Ð@$! ß @$" Ñ lexicographically where <# œ " for all # − ?.
5‰ Moment Optimality. Find $ that minimizes ÐEX$ ß Var X$ Ñ lexicographically.
MS&E 351 Dynamic Programming 3 Homework 5 Due May 9, 2008

How would you have to modify the above results if the rate of exogenous arrivals into each state
were constant, but state dependent? Random (and independent) with constant rate, but state
dependent? [Hint: Establish the following equivalences: 1‰ Í 2‰ Í 3‰ Í 4‰ Í 5‰ . To establish
the first equivalence, show that 3" is the present value of the number of exogenous individuals
entering the system in each given state. Use the result of ÐaÑ to establish the last equivalence.]
(c) Risk Posture. For fixed 3  !, does an individual who wishes to minimize EX$3 exhibit risk
aversion, risk preference or neither toward his sojourn time X$ in f ? Why? Explain briefly why
your conclusion is consistent with the equivalence of 2‰ and 5‰ in (b).
(d) Limiting Maximum Expected Sojourn Times. The problems in parts (b) and (c) seem
most appropiate in situations where the sojourn times are painful, so we want to get them over
quickly. But often sojourn times are pleasant, e.g., a firm’s time to bankruptcy, a student’s stay
at Stanford, or even life itself, so we want to prolong them. In this circumstance, how would you
modify the results of parts (b) and (c).

3. Optimal Control of Tandem Queues. Consider a job waiting to be processed by a fac-


tory at the beginning of some hour. The job must first be processed at station one and then at
station two. The job may be processed at each station by a skilled or unskilled worker. Unskilled
workers are paid the hourly rate 23 at station 3 while working on the job and skilled workers are
paid twice that much. If the job is processed by a worker at station one during an hour and
does not require rework, the job is sent on to station two for service at the beginning of the next
hour and earns the shop the progress payment <" . If the job is processed by a worker at station
two during an hour and does not require rework, the job is shipped and earns the shop an
additional payment <# . If the job is processed at either station in an hour by an unskilled
worker, the job requires rework at that station during the subsequent hour with probability :,
"
 :  ", whether or not the job was previously reworked. By contrast if the job is processed
#
by a skilled worker at either station in an hour, the job requires rework at that station during
the following hour with probability #:  ". Thus, the probability a skilled worker who processes
the job at a station during an hour successfully completes the job there is #Ð"  :Ñ, which is
double the corresponding probability for an unskilled worker. The problem is to decide whether
to use a skilled or unskilled worker at each station to process the job there. In short, the issue is
whether it is better to use a slow worker with low pay or a fast worker with high pay at each
station.
(a) Strong-Maximum-Present-Value Policies by Strong Policy Improvement Method. As-
sume : œ Þ(, 23 œ $  $3 and <3 œ "(3 for 3 œ "ß #. Determine which stationary policies have max-
imum value. Use the strong policy improvement method as implemented on page 59 of Lectures
in Dynamic Programming and Stochastic Control to find a strong-maximum-present-value pol-
icy starting with the policy in which an unskilled worker is used at each station.
MS&E 351 Dynamic Programming 4 Homework 5 Due May 9, 2008

(b) Nonstationary Arrivals. Suppose the expected number of jobs arriving for service at the
factory in hour R  " is AR œ $!!!  R  ) sinÐ R"#1 Ñ, R œ !ß "ß á Þ Also assume that the num-
bers of unskilled and skilled workers available at each station each hour is large enough so that
all jobs requiring service during the hour can be processed without delay. What stationary poli-
cies have strong maximum present value? [Hint: Show that the present value of profits is the
product of the present values of the arrival stream ÐAR Ñ and the profits earned by a single job.]
MS&E 351 Dynamic Programming Spring 2008
Arthur F. Veinott, Jr.

Answers to Homework 5 Due May 9

1. Bayesian Statistical Quality Control and Repair.


(a) Posterior Probability the Next Item is Good. Let K be the event that item =" is good;
P be the event that items "ß á ß = are defective and either item ! is good or the process is re-
paired before producing item 1; and M be the number of the items "ß á ß = that are produced
while the process is in control. Then P œ  =5œ! ÖPß M œ 5×. Let := œ PrÐK l PÑ. Then
PrÐKß PÑ PrÐKß Pß M œ =Ñ ;:Ò;Ð"  :ÑÓ=
! PrÐPß M œ 5Ñ ("  ; ) ! [; (1 
:= œ PrÐK l PÑ œ œ = œ ="
.
PrÐPÑ
:)]5  [; (1  :)]=
5œ! 5œ!

(b) Maximum Expected Value. The states are the numbers !ß "ß á ß W of consecutive defec-
tive items produced since the later of the time of the last repair or good item. In each state oth-
er than W , there are two actions, repair or don’t repair. In state W the only action is to repair.
Let Z= be the maximum expected net income given that the system is in state =. Observe that
:! œ ;:. Then for ! Ÿ =  W ,

(‡) Z= œ " max Ð<:!  O  :! Z!  (1  :! )Z" ß <:=  := Z!  (1  := )Z=" Ñ,

and ZW œ " Ð<:!  O  :! Z!  (1  :! )Z" Ñ. Since the probability of retaining the contract on a
day is " , ! Ÿ "  ", the system is strictly substochastic and so transient. Thus there is a sta-
tionary maximum-value policy. Hence policy-improvement method could be used to solve this
problem.
(c) Optimality of Control-Limit Policy. To see that a control limit policy is optimal, i.e.,
there exists =‡ such that one repairs if =   =‡ and doesn’t repair otherwise, first show that := is
decreasing in = as follows. Put ! œ ;Ð"  :Ñ. Then
;:!= ;: ;:

("  ; ) ! !5  != ("  ; ) ! !5=  " ("  ; ) ! !5  "


:= œ ="
œ ="
œ =

5œ! 5œ! 5œ"

is decreasing in = because !=5œ" !5 is increasing in = and ;  ".


Next show that the maximum value Z= is decreasing in = œ !ß á ß W . To this end, put Z ! ´
!, so Z=! is decreasing in = œ !ß á ß W . Now suppose that Z=R is decreasing in = œ !ß á ß W and
consider R  ". Then for = œ !ß á ß W",

(‡‡) Z=R " œ " max Ð<:!  O  :! Z!R  (1  :! )Z"R ß <:=  := Z!R  (1  := )Z="
R
Ñ.

Since <  ! and := is decreasing in =, <:= is decreasing in = œ !ß á ß W . Also, since :=  !, it fol-


lows from the induction hypothesis that

:= Z!R  (1  := )Z="
R
  :=" Z!R  (1  :=" )Z="
R
  :=" Z!R  (1  :=" )Z=#
R
,

1
MS&E 351 Dynamic Programming -2- Answers to Homework 5 Due May 9, 2008

i.e., := Z!R  (1  := )Z="


R
is decreasing in = œ !ß á ß W".1 Thus, the second term in parenthesis
on the right side of (‡‡) is decreasing in = œ !ß á ß W". Since the first term in parenthesis on the
right side of (‡‡) is a constant, the pointwise maximum of a constant and a decreasing function is
R "
decreasing, Z=R " is decreasing in = œ !ß á ß W". Also, ZW"   "Ð<:!  O  :! Z!R  (1  :! )Z"R Ñ
œ ZWR " , so Z=R " is decreasing in = œ !ß á ß W .
Now since the system is strictly substochastic, it follows from the fact (shown in Lectures á )
that the maximum R -period value converges to the maximum value that limR Ä_ Z=R œ Z= is
also decreasing in = œ !ß á ß W . Consequently, the second term in parentheses on the right-hand
side of Ð‡Ñ is decreasing in = œ !ß á ß W" and the control limit policy defined by letting =‡ be the
smallest integer =ß ! Ÿ =  W , such that <:!  O  :! Z!  (1  :! )Z"   <:=  := Z!  (1  := )Z="
if there is such an integer, and letting =‡ ´ W otherwise, has maximum value since that decision
assures that equality occurs in (‡).

2. Minimum Expected Present Value of Sojourn Times.


(a) Factorial Moments of Sojourn Times. Since the system is transient, T ‡ œ 0 and H œ
(M  T )" . We now show by induction that EŠ ‹
X 5"
œ H5 1. This is trivially so for 5 œ 0.
5
Suppose it is true for 5   ! and consider 5  ". Then by the induction hypothesis,

EŒ  œ EŒ   EŒ  œ H 1  EŒ .
X 5 X 5" X 5" 5 X 5"
5" 5 5" 5"

Let V" be the event that the system reaches 7 in the first step and :" œ PrÐV" Ñ. Then

EŒ  œ EŒ Œ  ± V" :"  EŒŒ  ± V" (1  :" ),


X 5" X 5" X 5" -
5" 5" 5"
Observe that, in the above equation, the first term is 0 because Š ‹
5
œ !, and the second term
T EŠ ‹. EŠ ‹ H5 1  T EŠ ‹, EŠ ‹
5"
X 5 X 5 X 5 X 5
is Thus, œ so œ ÐM  T Ñ" H5 1 œ H5" 1.2
5" 5" 5" 5"
(b) Minimum Limiting Expected Present Value of Sojourn Times of Individuals and Pop-
ulations, and Moment-Optimality.
1‰ Í 2‰ . The present value of the number of exogenous individuals entering the system in a
given state is !_ 3
3œ" " œ 3
"
for 3  ! since exactly one exogenous individual enters the state in
each period. Then, EX 3$ œ 3" EX$3 , which implies the claim.
2‰ Í 3‰ . As shown in Lectures á , for each # and all 3  !, EX#3 œ V#3 1 œ V#3 (1) œ Z#3
since <# œ 1, which establishes the claim.
3‰ Í 4‰ . For small values of 3 and each # , Z#3 œ !_ 8 8
8œ" 3 @# œ
!8œ!
_
38 @#8 since T#‡ œ 0. Thus
3" (Z$3  Z#3 ) œ 3" (@$!  @#! )  (@$"  @#" )  9 ("). On taking the limit inferior as 3 Æ ! on both

1This monotonicity argument is due to Gareth James.


2This induction step is due to Alamuru Krishna.
MS&E 351 Dynamic Programming -3- Answers to Homework 5 Due May 9, 2008

sides, one sees that the limit inferior on the left is nonnegative if and only if (@$! ß @$" ) ¤ (@#! ß @#" )
where ¤ is the usual lexicographic order.
4‰ Í 5‰ . It is enough to observe that EX$ œ H$ 1 œ H$ <$ œ @$! and

Var X$ œ #EŒ   EŒ   EŒ  œ 2H$ 1  H$ 1  (H$ 1) œ 2@$  @$  (@$ ) .


#
X$ " X$ X$ # # " ! ! #
# " "

No modifications are required in either case.


(c) Risk Posture. Since the present value of the individual’s sojourn time is X$3 , minimizing
EX$3 is equivalent to maximizing EX$3 , and X$3 is a convex function of X$ , it follows that the
individual exhibits risk preference toward his sojourn time X$ . By the equivalence of 2‰ and 5‰ ,
minimizing the sum of the first two terms of the expansion of 3" EX$3 for small 3  ! is equiv-
alent to lexicographically minimizing (EX$ ß Var X$ ). Thus the individual prefers policies that
minimize EX$ , and among policies that achieve this minimum, those that maximize Var X$ . This
is consistent with risk preference.
(d) Limiting Maximum Expected Sojourn Times. In (b), modify 1‰ and 2‰ by replacing lim sup
by lim inf, and Ÿ by   ; modify 3‰ and 4‰ by replacing " by "; and modify 5‰ by replacing
minimizes by maximizes. The modified version of (c) involves an individual who wishes to maxi-
mize EX$3 , and so exhibits risk aversion towards his sojourn time X$ . By the equivalence of the
modified versions of 2‰ and 5‰ , maximizing the sum of the first two terms of the expansion of
3" EX$3 for small 3  ! is equivalent to lexicographically maximizing (EX$ ß Var X$ ). Thus the
individual prefers policies that maximize EX$ , and among policies that achieve this maximum,
those that minimize Var X$ . This is consistent with risk aversion.

3. Optimal Control of Tandem Queues.


(a) Strong-Maximum-Present-Value Policies by Strong Policy Improvement Method.
The state space consists of two states 1 and 2, viz., the station at which a job is located. There
are four possible decisions: ? œ Ö??ß ?=ß =?ß ==×, where ?= represents using unskilled workers at
station 1 and and skilled workers at station 2, etc. The transition matrices associated with each
decision are:

T?? œ Œ  T?= œ Œ 
: ": : ":
0 : 0 #:"

T=? œ 
: 
T== œ 
#:" 
#:" #Ð":) #:" #Ð":)
.
0 0

Notice that each is transient and hence the system is transient, i.e., . œ !. Thus T$‡ œ ! and H$ œ
ÐM  T$ Ñ" for all $ − ? and, if X is the minimum of the ranks of the stationary matrices over all
MS&E 351 Dynamic Programming -4- Answers to Homework 5 Due May 9, 2008

decisions, then X œ !. Theorem 22 on p. 54 implies that a stationary strong-maximum-present-


value policy exists and Theorem 25 on p. 57 implies that ? ª ?! ª ?" ª ?# œ ?_ .
Next show that ? œ ?! ¨ ?" œ ?# œ ?_ and ?" œ Ö==×, i.e., that using skilled workers at
both stations is the unique stationary strong-maximum-present-value policy. Saying that ?! œ
? is equivalent to saying that all stationary policies have the same value. That this is so should
be clear from the results of the single station problem. Since the downstream station (station 2)
may be considered as a single station in isolation, whatever action one takes in state 2 must give
Š œ 34  œ %‹. Then for station 1, as far as
! ! 2# *
the same value, i.e., @?# œ @=# ´ @#! œ <# 
": !Þ$
total expected value is concerned, the expected value earned in state 2 is an additional reward
for shipping (so <"w œ <"  @#! œ 17  4 œ 21), whence the action taken in state 1 does not affect
! ! 2" '
the expected value obtained in state 2, i.e., @?" œ @=" ´ @"! œ <"w  Ð œ 21  œ 1Ñ. Thus
the value of all stationary policies is Z ‡ œ a "% b.
": !Þ$

It is possible to show this directly by calculating Z$ œ H$ <$ œ ÐM  T$ Ñ" <$ for each $ − ?,
where the reward vectors associated with each decision are

<?? œ Œ
Ð"  :Ñ<#  2# 
<?= œ Œ 
Ð"  :Ñ<"  2" Ð"  :Ñ<"  2"
2Ð"  :Ñ<#  22#

<=? œ Œ  <== œ 
2Ð"  :Ñ<#  22# 
2Ð"  :Ñ<"  22" 2Ð"  :Ñ<"  22"
.
Ð"  :Ñ<#  2#

Now @$! œ lim3Æ! Z$3 œ Z$ for each $ − ?, and so @! œ Š ‹ and ?! œ ?. Now determine @" œ
"
%
max#−?! Ð<#"  @!  T# @" Ñ œ max#−? Ð@!  T# @" Ñ. As explained on p. 59 of Lectures á , the prob-
lem of finding @" is precisely that of finding a stationary maximum-value policy for the transient
system with a restricted decision set ?! and a new reward vector <#"  @! œ @! . To do this, use
"
policy improvement and, as suggested, begin with the policy ??. Since @?? œ @!  T?? @??
"
, it
follows that
" " "
@??" œ "  .7@??"  .3@??#
and
" "
@??# œ 4  .7@??#
from which one concludes

Î  &! Ñ
Ð $ Ó
"
@?? œÐ Ó .
Ï $ Ò
%!

Now consider improving ?? with the aid of the comparison function


"
K# @?? œ @!  T# @??
" "
 @?? .
MS&E 351 Dynamic Programming -5- Answers to Homework 5 Due May 9, 2008

Set # œ ==Þ Then

Î  &! Ñ Î  &! Ñ
.6 Ð $ Ó Ð $ Ó 1
œ     Ð ÓÐ Ó œ  ß
" 1 .4
K== @??
Ï $ Ò Ï $ Ò
4 0 .4 %! %! 4
 

and so == improves ??. But == is, in fact, the decision that gives the “greatest” improvement of
"
??, i.e., == is the decision that Newton’s method selects. This is clear since ÐK# @?? Ñ> depends on
# > , but not on # = for = Á >, so that one has

œ  , K?= @?? œ   and K=? @?? œ  .


" 0 " 0 " 1
K?? @??
0 4 0
"
Now since @== œ @!  T== @==
"
, it follows that
" " "
@==" œ "  .4@=="  .6@==#
and
" "
@==# œ 4  .4@==#

from which follows


Î  #& Ñ
Ð $ Ó
"
@== œÐ Ó .
Ï $ Ò
#!

" "
Now we claim that K# @== Ÿ ! for all # − ?! , which implies that == − ?" Þ Further, since K# @==
Á ! unless # œ ==, it follows that ?" œ Ö==×Þ Since ?_ is non-empty and is a subset of ?" , con-
clude that ?_ œ Ö==× and hence that == is the unique stationary strong-maximum-present-value
policy.
(b) Nonstationary Analysis. Under the stationary policy $ , the R >2 cohort (i.e., the jobs
that arrive in hour R ") earn a value discounted to time R of Z$"3 . Discounting this to time 0,
the R >2 cohort earns the present value " R Z$3" . Thus the present value of the policy $ _ is

" AR " R Z$3" œ Z$3" " " R AR .


_ _

!_
R œ! R œ!

But R œ! " R AR is the present value of the arrival stream. This demonstrates the hint. It fol-
lows immediately that a policy that has strong-maximum-present-value for a single job, also has
strong-maximum-present-value for the problem with non-stationary arrivals. Thus == is the
unique stationary strong-maximum-present-value policy for this problem.
MS&E 351 Dynamic Programming Spring 2008
Arthur F. Veinott, Jr.

Homework 6 Due May 16

1. Limiting Present-Value Optimality with Binomial Immigration. Consider a bounded W -state


discrete-time-parameter Markov population decision chain with immigration stream A œ ÐAR Ñ,
"
interest rate 1003%  0 and discount factor is " œ . Assume throughout that the series A3 ´
!_
"3
R R
R œ! " A is absolutely-convergent for all 3  !. (For example, that is so if AR œ SÐR ; Ñ for
some ;   !.) The =>2 diagonal element of A3 is the present value of the stream of immigrants
into state = discounted to period one. Let Z13A be the present value of the (cohort) policy 1 with
the immigration stream A. Call - limiting present-value optimal for A if lim inf3Æ! ÐZ-3A  Z13A Ñ
  ! for all 1. Let ?8 ´ {$ − ? À Z$8 ¤ Z#8 , all # − ?} and ?_ ´ {$ − ? À Z$ ¤ Z# , all # − ?}.
(a) Present Values of Convolutions are Products of Present Values. Suppose that A and
s3 are absolutely convergent.
s are immigration streams for which the series defining A3 and A
A
s3 . Show by induction on 8 that "38 œ Ð"  "Ñ8 œ Ð"  3Ñ8 38 for k8k œ
s 3 œ A3 A
Show that ÐA ‡ AÑ
!ß "ß á where "38 is the present value of the binomial sequence of order 8 discounted to period one.
(b) Limiting Present-Value Optimality for Binomial Immigration Stream of Order 8. Show
that Z13A œ A3 Z13 . Show also that - is limiting present-value optimal for the binomial immigra-
tion stream of order 8 Ð   "Ñ if and only if lim inf3Æ! 38 ÐZ-3  Z13 Ñ   ! for all 1.
(c) ?8 is the Set of Decisions that are Limiting Present-Value Optimal for the Binomial
Immigration Stream of Order 8. Show that $ _ is limiting present-value optimal for the binom-
ial immigration stream of order 8 Ð   "Ñ if and only if $ − ?8 .
(d) Machine Sequencing. Let !R   ! be the expected number of jobs arriving at a firm's
machine shop in week R  ", R œ !ß "ß á . The machine shop has three types of machines la-
beled "ß #ß $. Each job must be processed through one of the four machine sequences +ß ,ß -ß or .
given in the table below. Each job takes three weeks to process through the shop with each se-
quence, one week on each machine in the sequence. There are enough machines of each type so
that jobs are not delayed at any machine to which they are assigned by the machine shop. The
firm incurs a cost of 3 thousand dollars in each week a job is processed by machine 3. Which se-
quences are (Cesàro) overtaking optimal for the firm when the expected numbers of jobs arriv-
ing each week are respectively !!  ! and !R œ ! for R  !, !R œ !!  ! for all R   !, and

Machines in each Sequence

Week
Sequence 1st 2nd 3rd
+ 1 3 2
, 2 1 3
- 2 3 1
. 1 3 3

1
MS&E 351 Dynamic Programming 2 Homework 6 Due May 16, 2008

!R œ !! ÐR  "Ñ  ! for all R   !? Which sequences are limiting present-value optimal for each
of the above three job-arrival streams? Which sequences have strong maximum present value for
each of the three job-arrival streams. [Hint: Tabulate •!$  •!# for each sequence ! œ Ð!R Ñ and
relevant pairs # ß $ of decisions directly from the definitions without first finding the terms of the
expansions.]
(e) Social Security: Aggregating Losses to Produce a Profit. A country that does not per-
mit immigration divides its adult population into two groups, workers Ð[ Ñ and retirees ÐVÑ. The
country has proposed a new social security system that will cover all current and future inhabi-
tants (except retirees in the initial population who will be supported in a different way).
Individuals in the country spend one period of 30 years in each group. Let A! œ diagÐA![ !Ñ be
the initial population in the two groups to be covered by social security. The vector of exogen-
ous individuals (children) entering the population in period R  " is :R A! , :  !. Workers in
one period have an even chance of surviving to retire in the next period and retirees in a period
have no chance of surviving to retire again in the next period. Each worker earns an average
income M  ! during his working years and pays 10% thereof into social security. Each retiree
receives 20:% of M from social security during retirement.
Ð3Ñ Show that the expected social security payments to each individual (except the initial re-
tirees) is greater than, equal to, or less than the amount collected from the individual respectively
according as :  ", : œ ", or :  ", and yet social security is always in the black!
Ð33Ñ Show that the ratio of a worker's retirement income from social security to his income
#
per person before retirement after social security payments is :Ð:  "Ñ, and so is an increasing
*
function of : (assuming that a worker's income is divided equally between himself and his : chil-
dren). In particular that ratio is .17, .44 or .83 respectively according as : œ .5, 1.0 or 1.5.
Ð333Ñ In this example with :  ", whence the number of immigrants increases geometrically,
the government pays out more to each individual than it collects (i.e., Z$ ¥ !) and yet collects
more money from the population as a whole than it pays out (i.e., lim infR Ä_ Z$R A ¦ ! ÐCß "Ñ).
Show that this phenomenon cannot occur in a transient Markov population decision chain (as in
this example) with a binomial immigration stream A.

2. Maximizing Reward Rate by Linear Programming. Assume that an W -state discrete-time-


parameter Markov population decision chain has degree one. Let A   ! be a row W -vector of the
numbers of individuals initially in each state. Consider the following redundant dual pair of linear
programs. The primal is to find row W -vectors B" !
# and B# for # − ? that
MS&E 351 Dynamic Programming 3 Homework 6 Due May 16, 2008

maximize " B"


# <#
#

subject to

Ð1Ñ " B"


#  " B!# U# œ A
# #

(2) " B"


# U# œ!
#

(3) B" !
# ß B#   ! for # − ?.

The dual is to find column W -vectors @" and @! that

minimize A@"

subject to

Ð4Ñ @"  U# @!   <#

and

(5) U# @"  !

for all # − ?.
(a) Solution of Linear Programs. Show that there is a $ − ? not depending on A   ! and
optimal solutions ÐB"
# B!# Ñ and Ð@" @! Ñ respectively of the primal and dual such that B"
# œ
B!# œ ! for all # Á $ , Ð@" @! Ñ is independent of A   !, and @" œ max#−? @#" œ @$" is the least
feasible value of @" for the dual. [Hint: Choose $ so that K#$
!
£ ! for all # − ?, or equivalently
" " !
1#$ Ÿ ! and R 1#$  1#$ Ÿ ! for all # and all large enough R . Next show that Ð@" @! Ñ ´ Ð@$"
R @$"  @$! Ñ is feasible for the dual for all large enough R . Then show that for all large enough
R , ÐB"
# B!# Ñ is feasible for the primal where B#" œ B#! œ ! for all # Á $ and, for # œ $ , ÐB"
$ B!$ Ñ
´ ÐAT$‡ AÐR T$‡  H$ ÑÑ. Do this by using the fact that ! Ÿ V$3 œ 3" T$‡  H$  9Ð"Ñ for all
small enough 3  !. Next show that for all large enough R , ÐB"
# B!# Ñ and Ð@" @! Ñ are
respectively optimal for the primal and dual because their objective-function values are equal.]
(b) Average Maximum Equals Maximum Reward Rate. Show that limR Ä_ R " eR ! œ
max#−? @#" . [Hint: Choose $ so K#$
! "
£ ! for all # − ?, or equivalently max#−? ÒR 1#$ !
 1#$ Óœ!
for all large enough R . Use this fact to show that for fixed large enough Q , eR ÐQ @$"  @$! Ñ œ
ÐQ  R Ñ@$"  @$! , R œ "ß #ß á . The geometric interpretation of this result is that when @$" Á !,
e carries the half-line ÖR @$"  @$! À R   Q × into itself. And when @$" œ !, @$! is a fixed point of
e. Indeed, Theorem 6 asserts that if the system is transient, then this fixed point is unique and
is the maximum value.]
MS&E 351 Dynamic Programming 4 Homework 6 Due May 16, 2008

(c) Stochastic Systems with Strongly Connected System Graph. Suppose the system is sto-
chastic and the system graph is strongly connected. The last means that for each pair of states
= Á >, there is a chain in the system graph from = to >. Show that this implies that there is a de-
cision # and a positive integer R (both depending on = and >) such that T#R=>  !. Show that ev-
ery stationary maximum-reward-rate policy $ _ has the property that @$"
= is independent of =.
[Hint: Set @" œ @$" , rewrite (5) as @"   T# @" and iterate].
(d) Linear Programs for Stochastic Systems with Strongly Connected System Graph. Sup-
pose the system is stochastic, the system graph is strongly connected and != A= œ ". Use the re-
sult of (c) to reduce the pair of dual linear programs in the problem statement to the equivalent
alternate pair of linear programs given below. The alternate primal is to find row W -vectors B"
#
for # − ? that

maximize " B"


# <#
#

subject to

Ð6Ñ " B"


# " œ"
#

(7) " B"


# U# œ !
#

(8) B"
#   ! for # − ?.

The alternate dual is to find a number ?" and a column W -vectors @! that

minimize ?"

subject to

Ð9Ñ "?"  U# @!   <#

for all # − ? where " is here a column W -vector of ones.


Determine whether or not these alternate linear programs are dual to one another. Also show
that there is a $ − ? and optimal solutions B"
# and Ð?
"
@! Ñ respectively of the alternate primal
and dual such that B"
# œ ! for all # Á $ and "?
"
œ max#−? @#" œ @$" , and ?" is the least feas-
ible value of ?" for the alternate dual. [Hint: Postmultiply (1) by the column W -vector " to form
the alternate primal.]
MS&E 351 Dynamic Programming Spring 2008
Arthur F. Veinott, Jr.

Answers to Homework 6 Due May 16

1. Limiting Present-Value Optimality with Binomial Immigration.


(a) Present Values of Convolutions are Products of Present Values. We have

s )3 œ " " R (A ‡ A
s )R œ " " R " A 3 A
sR 3 œ " " 3 A3 " " R 3 A
_ _ R _ _
(A ‡ A sR 3 œ A3 A
s3 .
R œ! R œ! 3œ! 3œ! R œ3

We now show by induction that 138 œ (1" )8 œ 38 (13)8 . The claim holds for 8 œ !ß "ß "
since 13! œ " œ (1" )! , 13" œ !_ 5
3œ! " œ (1" )
"
and 13" œ 1" . Suppose the claim is true for
8  0. Then 138" œ ( 1" ‡ 18 )3 œ 1"3 138 œ (1" )" (1" )8 œ (1" )8" . Suppose the claim is
true for 8  0. Then 138" œ ( 1" ‡ 18 )3 œ 1"
3
138 œ (1" ) (1" )8 œ (1" )8" .
(b) Limiting Present-Value Optimality for Binomial Immigration Streams of Order 8. Ev-
idently, Z13= œ !_ 5 5 3 3 3
5œ! " A Z1 œ A Z1 . Then, - is limiting present-value optimal for the binomial
immigration stream 183 of order 8 if lim inf3Æ! 138 (Z-3  Z13 )   0 for all 1, or equivalently, since
138 œ 38 (13)8 and lim3Æ! (13)8 œ "  0, if lim inf3Æ! 38 (Z-3  Z13 )   0 for all 1.
(c) ?8 is the Set of Decisions that are Limiting Present-Value Optimal for the Binomial
Immigration Stream of Order 8. Since the system is bounded, there is a stationary policy that
has maximum present value for all small enough 3  !. Thus, from (b), $ _ is limiting present-
value optimal for the binomial immigration stream of order 8 if and only if the following holds:
lim inf3Æ! 38 (Z$3  Z#3 )   0 for all # − ?. On substituting the relevant Laurent expansions, the
last is so if and only if lim inf3Æ! 38 !_ 3 3 3
3œ" 3 (@$  @# ) œ lim inf3Æ!
!3œ"
8
338 (@$3  @#3 )   0 for all
# − ?, or equivalently Z$8  Z#8 ¤ 0 for all # , i.e., $ − ?8 .
(d) Machine Sequencing. The following table contains all the data required to establish op-
timality of policies under different preference relations. By a suitable choice of units we can and
do assume !! œ ", so the three immigration streams are essentially equivalent to binomial streams
of orders !, 1 and 2 respectively. By examining the column corresponding to R   3, observe that
for binomial immigration of order 0, policies +, , and - are Cesàro overtaking optimal, and so by
part (c), limiting present-value optimal. For the binomial immigration stream of order 1, note
that policies + and , are Cesàro overtaking optimal since it suffices to tabulate the difference of
two policies for policies +, , and - because they are the only policies in ?! . Therefore, + and ,
are in ?" and, by (c), limiting present-value optimal. Thus, for the third immigration stream,
which is binomial of order 2, it suffices to compute the difference of values of policies + and ,.
The table shows that policy + is Cesàro overtaking optimal and, by (c), limiting present-value
optimal.

1
MS&E 351 Dynamic Programming -2- Answers to Homework 6 Due May 16, 2008

R œ" R œ# R  $
Z-R  Z.R " " "
Z,R  Z-R ! # !
Z+R  Z,R " " !
Z,R "  Z-R " ! # #
Z+R "  Z,R " " ! !
Z+R #  Z,R # " " "

The only policy that has maximum present value for all sufficiently small positive interest rates
is + since it is the unique element of ?# and so ?# œ ?_ Á g by Theorem 25 of §1.8 and the fact
that the system is bounded.
(e) Social Security: Aggregating Losses to Produce a Profit.
Ð3Ñ The expected social security payments to an individual are Œ .2 :M œ .1:M . However,
"
#
each individual pays .1M to social security during his working years, which is less than, equal to
or greater than the expected social security payments to him according as :  1, : œ 1 or :  1
respectively. Yet social security breaks even in every period except the first one when it collects
money but does not make any payments as can be seen from the fact that in period R social
"
security collects 0Þ" :R A![ M and pays .2 :M:R " A![ .
#
Ð33Ñ Observe that the income per person before retirement after social security payments is
Þ*M
given by as the net income is split among an individual and his children. Then the ratio of a
:"
worker's retirement income from social security to his income before retirement after social secur-
#
ity payments is :Ð:  "Ñ.
*
Ð333Ñ We claim that limR Ä_ Z$R 8 Ÿ ! and all 8 œ "ß !ß á . This is obvious for 8 œ " since
Z$R ß" Ä @$" œ !. Now suppose that 8   !. Choose %  ! so that Z$ Ÿ #%". It is enough to show
by induction on 8 that Z$R8 Ÿ %" for all large enough R and all 8   !. To that end, observe that
since $ is transient, Z$R Ä Z$ so Z$R Ÿ %" for all large enough R . Now suppose the claim is so for
8 and consider 8  ". Then Z$R ß8" œ !R 38
3œ! Z$ Ÿ %" for all large enough R as claimed.

2. Maximizing Reward Rate by Linear Programming. Consider the primal program of finding
row W -vectors B" !
# and B# for # − ? that

maximize " B"


# <#
#

subject to

Ð1Ñ " B"


#  " B!# U# œ A
# #

(2) " B"


# U# œ!
#

(3) B" !
# ß B#   ! for # − ?.
MS&E 351 Dynamic Programming -3- Answers to Homework 6 Due May 16, 2008

Then the dual program is to find column W -vectors @" and @! that
minimize A@"
subject to
Ð4Ñ @"  U# @!   <#
and
(5) U# @"  !
for all # − ?.
(a) Solution of Linear Programs. The first step is to observe from Theorem 22 of §1.8 that
!
there is a $ such that K#$ £ ! for all # − ?, or equivalently,
" " !
(6) 1#$ Ÿ ! and R 1#$  1#$ Ÿ ! for all # − ? and all large enough R .
Now from Theorem 26 of §1.8, @$" œ max#−? @#" independently of A   !. Also (6) holds if and only
if Ð@" @! Ñ ´ Ð@$" R @$"  @$! Ñ is feasible for the dual for all large enough R .
The next step is to observe from Theorem 21 of §1.8 that ! Ÿ V$3 œ 3" T$‡  H$  9Ð"Ñ for all
small enough 3  !. Thus, B!$ ´ R B" ! "
$  B$   ! for all large enough R and all A   ! where B$ ´
AT$‡ and B!$ ´ AH$ . Let B" " " ! "
$ ´ B$ and B# œ B# œ ! for all # Á $ . We claim that ÐB# B!# Ñß # − ?,
is feasible for the primal for all large enough R . The claim follows from B" ‡
$ U$ œ AT$ U$ œ ! and

B" ! " " ! " ! ‡


$  B$ U$ œ B$  ÐR B$  B$ ÑU$ œ B$  B$ U$ œ AÐT$  H$ U$ Ñ œ A

since H$ U$ œ H$ ÐT$‡  U$ Ñ œ M  T$‡ , the last from Theorem 20 of §1.8.


The final step is to observe that B" ‡ "
$ <$ œ AT$ <$ œ A@$ œ A@
"
for all A   !. This implies
that for all large enough R , ÐB"
# B!# Ñ and Ð@" @! Ñ are respectively optimal for the primal and
dual, and @$" is the least feasible value of @" for the dual.
!
(b) Average Maximum Equals Maximum Average Reward Rate. Choose $ so K#$ £ ! for all
" !
# − ?. Thus max#−? ÐR 1#$  1#$ Ñ œ ! for all large enough R , say all R   Q , or equivalently,
eÐR @$"  @$! Ñ œ ÐR  "Ñ@$"  @$! for R   Q . Thus, by repeated application of the last equation,
eR ÐQ @$"  @$! Ñ œ eR " ÐÐQ  "Ñ@$"  @$! Ñ œ â œ ÐQ  R Ñ@$"  @$! , R œ "ß #ß á . Now Theorem
10 of §1.7 implies that meR ÐQ @$"  @$! Ñ  eR !m œ SÐ"Ñ. Therefore, on combining these facts it
follows that limR Ä_ R " eR ! œ limR Ä_ R " eR ÐQ @$"  @$! Ñ œ @$" as claimed.
(c) Stochastic Systems with Strongly Connected System Graph. Let = Á > be states. Since the
system graph is strongly connected, there is a chain V in the system graph from = to >. For each
state / − V Ï Ö>×, let 0/ − V be such that Ð/ß 0/ Ñ is an arc in the system graph. Thus there is a
decision # such that :Ð0/ l/ß # / Ñ  ! for each such state /. Hence T#R=>  ! where R is the number
of arcs in V, establishing the first claim.
Now let $ _ be a stationary maximum-reward-rate policy, @" ´ @$" , = be a state that mini-
mizes @=" and > Á = be any other state. As shown above, since the system graph is strongly con-
MS&E 351 Dynamic Programming -4- Answers to Homework 6 Due May 16, 2008

nected, there is a decision # and a positive integer R such that T#R=>  !. Then from (5), @"  
T# @"   â   T#R @" , so by the definition of = and the fact that T#R is stochastic, @="   T#R=> @>" 
Ð"  T#R=> Ñ@=" . Hence since T#R=>  !, @="   @>" . However, by definition of =, @=" Ÿ @>" , so @="
œ @>" œ @$"
= is independent of = as claimed.
(d) Linear Programs for Stochastic Systems with Strongly Connected System Graph. Sup-
pose the system is stochastic, the system graph is strongly connected and != A= œ ". Now on
postmultiplying (1) by the W -column vector " of ones and using the facts that U# " œ ! (because
T# is stochastic) and A" œ ", yields (6) of the alternate primal. Thus, the constraint set of the
alternate primal contains that of the original primal. Also, since one optimal solution Ð@" @! Ñ of
the dual takes the form @" œ "?" , the last inequality (5) of the dual is redundant and the dual
objective function reduces to A"?" œ ?" . Hence the dual is equivalent to the alternate dual.
Further, the alternate dual is the dual of the alternate primal. Now choose $ as in (a) above.
Then there are optimal solutions ÐB"
# B!# Ñ and Ð@" @! Ñ respectively of the original primal and
dual such that B" " " " "
# œ ! for all # Á $ and max# −? @# œ @$ œ "? . Thus, B# is optimal for the
alternate primal and ?" is optimal and the least feasible value of ?" for the alternate dual.
MS&E 351 Dynamic Programming Spring 2008
Arthur F. Veinott, Jr.

Homework 7 Due May 23

1. Discovering System Boundedness. Consider the following collections of linear inequali-


ties and nonlinear equations for a discrete-time-parameter finite Markov population decision
chain with <$ œ " for all $ , viz.,

(1) T$ @ Ÿ @ and <$  T$ ? Ÿ ?  @ for all $ − ?, and Ð@ß ?Ñ   !


and

(2) max T$ @ œ @, max Ð<$  T$ ?Ñ œ ?  @ and Ð@ß ?Ñ   !.


$ −? $ −?
Parts (a)-(c) below establish that system boundedness is equivalent to the existence of a Ð@ß ?Ñ
satisfying either of the conditions (1) or (2). Thus system boundedness can be checked by linear
programming.
(a) (1) Implies System Boundedness. Show that if there is a pair Ð@ß ?Ñ satisfying (1), then
the system is bounded. [Hint: Argue as in the proof of the System-Degree Theorem 9 of §1.7 for
. œ ". In particular, let N ß O be a partition of the states for which @N ¦ ! and @O œ !. Show
that T$ON œ ! for all $ − ?. Then show the restriction of the system to states in O is transient
by considering the second system of inequalities.]
(b) System Boundedness Implies (2). Show that if the system is bounded, then there is a
Ð@ß ?Ñ that satisfies (2). [Hint: There is a $ _ with strong maximum present value, so K#$
!
£ ! for
all # − ?. Then show that Ð@ß ?Ñ œ Ð@$" ß Q @$"  @$! Ñ satisfies (2) for large enough Q  !.]
(c) (2) Implies (1). Show that if Ð@ß ?Ñ satisfies (2), then Ð@ß ?Ñ satisfies (1).

2. Finding the Maximum Spectral Radius. Consider a discrete-time-parameter finite Mar-


kov population decision chain.
(a) Monotonicity of Spectral Radius. Show that if T and U are square real matrices of the
same order with ! Ÿ T Ÿ U, then 5T Ÿ 5U . [Hint: Use Corollary 8 of §1.7.]
(b) Bounds on Spectral Radius. Show that if T œ Ð:34 Ñ is a square real nonnegative matrix,
then
7 ´ min " :34 Ÿ 5T Ÿ max " :34 ´ Q .
3 4 3 4
 
T and T for which ! Ÿ 
[Hint: Construct matrices  T Ÿ T Ÿ T , the row sums of 
T all equal 7

and the row sums of T all equal Q .]
(c) Bounds on Maximum Spectral Radius. Show that 7 Ÿ 5 Ÿ Q where

7 ´ min max " :Ð>l=ß +Ñ, Q ´ max " :Ð>l=ß +Ñ and 5 ´ max 5$
= + +ß= $
> >
is the maximum of the spectral radii of the T$ .
(d) Finding the Maximum Spectral Radius. Show that if ! Ÿ 7  Q , 7 Ÿ 5 Ÿ Q , and ) is
the midpoint of the interval [7ß Q ], then )  5 if and only if the system with transition ma-
trices )" T$ for all $ is transient. Use this fact to give an iterative method for estimating 5.

1
MS&E 351 Dynamic Programming 2 Homework 7 Due May 23, 2008

3. Irreducible Systems and Cesàro Geometric-Overtaking Optimality. Consider a discrete-


time-parameter finite Markov population decision chain. Let e@ ´ max$−? T$ @. Call a real num-
ber . an eigenvalue of e if there is a real vector @, called an associated eigenvector, such that
e@ œ .@. Call the system irreducible if the system graph Z is strongly connected, i.e., there is a
chain (directed path) from each state to each other state. Let 5 ´ max$−? 5$ be the maximum
spectral radius.
(a) System Boundedness. Show that if e has a unit eigenvalue Ð. œ "Ñ and associated eigen-
vector @ ¦ !, then the system is bounded.
(b) Unit Eigenvalue of e with Positive Eigenvector. Show that if the system is irreducible
and bounded, but not transient, then e has a unit eigenvalue and associated eigenvector @ ¦ !.
[Hint: Use the results of parts (a)-(c) of Problem 1. Show that since the system is not transient,
the set N in the hint for Problem 1(a) is not empty. Then use the irreducibility to show that the
set O is empty.]
(c) Positive Maximum Spectral Radius an Eigenvalue of e with Positive Eigenvector.
Show that if the system is irreducible, then 5  !, 5 is an eigenvalue of e and that there is an
associated eigenvector @ ¦ !. [Hint: Use the irreducibility and Problem 2(c) to show that 5  !.

Next consider the normalized system in which T$ is replaced by T $ ´ 5" T$ for each $ . Let <$ œ "
for all $ − ? and Z 3 be the resulting maximum expected present value with interest rate 1003%
 !, so that Z 3 œ max$−? Ð" "  " T $ Z 3 Ñ   " ". Let l?l be the sum of the absolute values of the

elements of ?. Show that mZ 3 m" Z 3 has a limit point @   ! as 3 Æ ! with l@l œ ", and e ´ 5" e

has a unit eigenvalue with @ being an associated eigenvector. Then use the irreducibility to show
that @ ¦ !.]
(d) Perron-Frobenius Theorem. Show that if T$ is irreducible, then 5$  !, 5$ is an eigen-
value of T$ and there is an associated eigenvector @ ¦ !, i.e., T$ @ œ 5$ @.
(e) Maximum Spectral Radius is Maximum Population Growth Factor. Show that if the
system is irreducible, then max$−? mT$R m œ SÐ5R Ñ. Then show that the normalized system is
bounded. [Hint: Apply the result of part (c).]
(f) Existence of Stationary Cesàro Geometric-Overtaking Optimal Policies. Suppose that
the system is irreducible. Call - Cesàro geometrically-overtaking optimal if

lim inf 5R ÐZ-R ß"  Z1R ß" Ñ   ! ÐCß "Ñ for all 1.
R Ä_

Show that there is a stationary Cesàro geometrically-overtaking optimal policy, viz., any sta-
tionary maximum-reward-rate policy for the normalized system.
(g) Maximum Long-Run Growth of Expected Symmetric Multiplicative Utility. Discuss
the application of the result of part (f) to the problem of maximizing the long-run growth of ex-
pected symmetric multiplicative utility for an irreducible finite Markov decision chain.
MS&E 351 Dynamic Programming Spring 2008
Arthur F. Veinott, Jr.

Answers to Homework 7 Due May 23

1. Discovering System Boundedness.


(a) (1) Implies System Boundedness. Let N ß O be a partition of the states such that @N ¦
0 and @O œ 0, so from T$ @ Ÿ @ it follows that T$N N @N Ÿ @N and T$ON @N Ÿ 0. Then, by induction
T$RN N @N Ÿ @N and so T$RN N œ S(1). Conclude from T$ON @N Ÿ 0, @N ¦ 0 and T$   0 that T$ON œ 0
and thus T$RON œ 0 for every R . Also, since (@ß ?) satisfies (1), 1  T$OO ?O Ÿ ?O , so by Problem
2(a) of Homework 3, the restriction of the system to states in O is transient, whence T$3OO œ
S(!3 ) for some !, 0  !  1. Observe that, since $ is arbitrary, it follows from the System-De-
gree Theorem that for each policy 1 œ ($" ß $# ß á ), max1 mT1RN N m œ S(1), max1 mT1RON m œ 0 for
every R and max1 mT13OO m œ S(!3 ). Thus T1RN O œ !R 3" R 3 3"
3œ" T1N N T$3 N O T1 OO , max1 mT1N N m œ SÐ"Ñ
3

and max1 mT1R3 OO


3
m œ S(!R 3 ), so max1 mT1RN O m œ S(1). Hence max1 mT1R m œ S(1).
(b) System Boundedness Implies (2). Since system is bounded, by Theorem 22 there is a
stationary strong-maximum-present-value policy $ _ . Then, K#$ £ 0 for every # , so (1#$
" !
ß 1#$ ) £
0, and K$$ œ !. Observe that since the rewards are positive, Z$3   ! for all small enough 3  !.
Thus since Z$3 œ !_ 8 8 " ! "
8œ" 3 @$ , (@$ ß @$ ) ¤ !. These facts imply that @$   0 and that there exists
" ! "
Q  0 with max# 1#$ œ 0, max# Ð1#$  Q 1#$ Ñ œ 0 and Q @$"  @$!   0. These inequalities restate
the fact that (@ß ?) ´ (@$" ß Q @$"  @$! ) satisfies (2).
(c) (2) Implies (1). Immediate.

2. Finding the Maximum Spectral Radius.


(a) Monotonicity of Spectral Radius. The proof is by contradiction. Suppose 0 Ÿ T Ÿ U, but
5U  5T . Then 5T" U has spectral radius 5T" 5U  1. Thus, 5T" U is transient, and so 5T" T is
also transient since 5T" T Ÿ 5T" U. But 1 œ 55T" T , contradicting the transience of 5T" T .
(b) Bounds on Spectral Radius. Since 7 Ÿ 5T Ÿ Q is trivial if Q œ ! and the left-hand in-
equality is trivial if 7 œ !, it suffices to consider the case that !  7 Ÿ Q . Now observe that if
U is a square stochastic matrix, then UR 1 œ 1 for R œ 1ß #ß á , so .U œ 1, and hence 5U œ 1 by
Lemma 7. Thus by respectively reducing the elements of rows of T with sums above 7 and then

increasing the elements of rows of T with sums below Q , produces the matrices 
T and T such
  
that ! Ÿ 
T ŸT ŸT, T 1 œ 71 and T 1 œ Q 1. Hence, 7" T and Q " T are stochastic and so
have spectral radius one, whence 5T œ 7 and 5T œ Q . Thus, from part (a), 7 Ÿ 5T Ÿ Q .
(c) Bounds on the Maximum Spectral Radius. Choose # so that 7 œ min= !> :(> |=ß # = ).
Then, by part (b) and Lemma 11 of §", it follows that 7 Ÿ 5# Ÿ 5 Ÿ max$ mT$ m œ Q where m † m
is the Chebyshev norm.
(d) Finding the Maximum Spectral Radius. The )-normalized system is the one with tran-
sition matrices )" T$ for all $ . Observe from Lemma 7 of §1 that 5  ) if and only if the )-nor-

1
MS&E 351 Dynamic Programming -2- Answers to Homework 7 Due May 23, 2008

malized system is transient, a fact that can be checked by the method of problem #(b) of the
Homework 3. If the )-normalized system is transient, then reset Q œ ); otherwise, reset 7 œ ).
The resulting interval [7ß Q ] contains 5 by what was just shown, and the interval is one-half
the length of the original interval. Now repeat this construction iteratively until Q  7 is small
enough.

3. Irreducible Systems and Cesàro Geometric-Overtaking Optimality.


(a) System Boundedness. Let @ be a positive eigenvector associated with the unit eigenvalue
of e. Then, @ œ e@ œ â œ eR @ œ max1 T1R @, so T1R œ S (1).
(b) Unit Eigenvalue of e with Positive Eigenvector. Suppose the system is bounded and
irreducible, but not transient. From Problem 1, there exists @   0 with max$−? T$ @ œ @. Thus, @
is a nonnegative eigenvector associated with a unit eigenvalue. We claim @ is positive. Consider
a partition of the state space N ß O as in 1(a). Since the system is not transient, N Á g because,
as shown in 1(a), the restriction of the system to states in O is transient. Furthermore, O œ g
for if not, T1RON œ 0 for every R as in 1(a). But this contradicts the irreducibility of the system.
Thus, @ œ @N ¦ 0.
(c) Positive Maximum Spectral Radius an Eigenvalue of e with Positive Eigenvector.
Since the system is irreducible, for every state = there exists an action + such that !> :(> | =ß +)
 0. Thus by 2(c), 5   7 œ min= max+ !> :(> | =ß +)  !. Consider the normalized system T $

œ 5" T$ . This system is not transient because 5


 œ 1. Let 
< $ œ " for all $ − ?. Then the maxi-
mum present value Z 3 with interest rate 1003%  ! is such that lim3Æ! mZ 3 m œ _ because in the
contrary event Z 3 is bounded and so the system is transient. But since all rewards are non-

negative, Z 3 œ max$−? {" 1  " T $ Z 3 }   " 1 ¦ !. Now multiplying both sides by mZ 3 m" yields

Z 3 mZ 3 m" œ max$−? Ò" 1mZ 3 m"  " T $ Z 3 mZ 3 m" Ó. Thus since Z 3 mZ 3 m" has unit norm, it has a

limit point @   ! as 3 converges to zero. Also m@m œ " and @ œ max$−? T $ @ œ max$−? 5" T$ @ or
equivalently, 5@ œ e@. This shows that 5 is an eigenvalue of e and @ is an associated eigenvec-
tor. Observe that this implies T$ @ Ÿ 5@ for all $ . Let N ß O partition the state space so that @O
œ 0 and @N ¦ 0. Since m@m œ 1, necessarily N Á g. And, if O Á g, then it follows from T$ @ Ÿ 5@
that T$ON @N Ÿ 0 for all $ , which implies T$ON œ 0, contradicting the irreducibility of the system.
Thus @ ¦ !.
(d) Perron-Frobenius Theorem. This is part (c) when ? œ {$ }.
(e) Maximum Spectral Radius is Maximum Population Growth Rate. Since the system is
irreducible, 5  0 (from (c)) and there exists @ ¦ 0 such that 5@ œ e@   T$ @ for all $ − ?, so

max1 T1R @ Ÿ 5R @, whence max1 mT1R m œ S (5R ). Furthermore, max1 mT R
1 m œ 5
R
max1 mT1R m œ
S(1) and so the normalized system is bounded.
MS&E 351 Dynamic Programming -3- Answers to Homework 7 Due May 23, 2008

(f) Existence of Stationary Cesàro Geometric-Overtaking Optimal Policies. By (e) the norm-
alized system is bounded. Then by Theorem 32 there exists a stationary Cesàro-overtaking-op-
timal policy $ _ for the binomial immigration stream of order 1, i.e.,

 
lim inf ÐZ$R ß"  Z 1R ß" Ñ   0 (Cß 1)
R Ä_
 ß" R "
for all 1 œ (#" ß ## ß á ) where Z R
1 œ T 1 <#R œ 5R " T1R " <#R œ 5R " Z1R ß" . Thus,

lim inf 5R ÐZ$R ß"  Z1R ß" Ñ   0 (Cß 1) for all 1.
R Ä_

(g) Maximum Long-Run Growth of Expected Symmetric Multiplicative Utility. From equa-
tion (6) of §1.4 of Lectures á , the maximum expected utility Z R in periods 1ß á ß R satisfies
R ,"
Z R œ max$−? T$ Z R " œ max1−?_ T1R Z ! where Z ! œ 1. Let <$ œ 1 for all $ . Then, Z1 œ
T1R " 1. Since the system is irreducible, apply the result of (f) to find a policy that maximizes
the long-run growth of expected symmetric utility.
MS&E 351 Dynamic Programming Spring 2008
Arthur F. Veinott, Jr.

Homework 8 Due May 30

1. Element-Wise Product of Symmetric Positive Semi-Definite Matrices. Show that if G is


an 7 ‚ 7 symmetric positive semi-definite matrix that is partitioned symmetrically into blocks
G34 with G33 symmetric positive definite for 3ß 4 œ "ß á ß 8 Ÿ 7, and if U œ Ð;34 Ñ is an 8 ‚ 8 symmet-
ric positive definite matrix, then the matrix L composed of blocks ;34 G34 is symmetric and positive
definite. [Hint: Observe that since G is a symmetric positive semi-definite matrix, there is a diag-
onal matrix A œ diag Ð-" ß á ß -7 Ñ   ! of eigenvalues of G and a matrix Z with column 7-vectors
@" ß á ß @7 such that G œ Z AZ T œ !7
T
5œ" -5 @5 @5 . One can partition each 7-element row vector
T T T T T
BT œ ÐB" ß á ß B8 Ñ corresponding to the partitioning of G , and in particular @5 œ Ð@5" ß á ß @58 Ñ.
Let A5 œ ÐB3 @53 Ñ be an 8-element column vector for each 5 . Show that BT LB œ !7
T T
5œ" -5 A5 UA5 .]

2. Quadratic Unconstrained Team-Decision Problem with Normally Distributed Observa-


tions. Consider the quadratic unconstrained team-decision problem in which U is symmetric and
positive definite, and the random column vectors ^" ß á ß ^8 have a joint normal distribution
with E^3 œ !, G34 denoting the matrix of covariances between the components of ^3 and ^4 , and
G33 being an identity matrix for 3ß 4 œ "ß á ß 8. You may assume the known (e.g., Feller, Vol. II,
80-87) facts that the covariance matrix G œ ÐG34 Ñ is symmetric and positive semi-definite, and
that E^3 ^4 œ G43 ^3 for 3ß 4 œ "ß á ß 8. Also assume that E^3 $3 œ .3 ^3  E$3 for some row vector
.3 , 3 œ "ß á ß 8. Show that there is an optimal team decision function \ ‡ that is affine, i.e., \3‡ Ð^3 Ñ
œ ,3 ^3  -3 where the row vectors ,3 and numbers -3 , 3 œ "ß á ß 8, are the unique solutions of cer-
tain systems of linear equations. You may assume that \ ‡ is optimal if, with probability one,
E^3 $3 œ ;3 E^3 \ ‡ , 3 œ "ß á ß 8. [Hint: Apply the result of Problem 1 about element-wise products of
symmetric positive semi-definite matrices.]

3. Optimal Baking. At the beginning of a day, 8 bakeries make forecasts ^" ß á ß ^8 of their re-
spective demands for bread during the day. The forecasts are random variables with known means.
After observing only its own forecast, bakery 3 bakes \3 loaves of bread, 3 œ "ß á ß 8. The actual
demands H" ß á ß H8 for bread at bakeries "ß á ß 8 are random variables. Assume that H3 and ^3
have finitely many values, that forecasts are unbiased, i.e., EÐH3 l^3 Ñ œ ^3 for all 3, and that ÐH3 ß ^3 Ñ
and ÐH4 ß ^4 Ñ are independent for each 3 Á 4. The loss PÐ\ß ^ß HÑ, where \ œ Ð\3 Ñ, ^ œ Ð^3 Ñ,
and H œ ÐH3 Ñ, is

PÐ\ß ^ß HÑ œ " +3 Ð\3  H3 Ñ#  Ð" Ð\3  H3 Ñ  ,Ñ#


8 8
" "
# 3œ" # 3œ"

where +3  !, 3 œ "ß á ß 8 and ,  !. The problem is to choose a baking policy from the above class
that minimizes the expected one-period loss. Solve the problem explicitly. [Hint: Consider affine
functions of the form \3 œ !3 ^3  "3 and find the coefficients !3 and "3 .]
MS&E 351 Dynamic Programming Spring 2008
Arthur F. Veinott, Jr.

Answers to Homework 8 Due May 30

1. Element-Wise Product of Symmetric Positive Semi-Definite Matrices. Observe that


since G is an 7 ‚ 7 symmetric positive semi-definite matrix, there exists a matrix Z whose
column 7-vectors are @" ß á ß @7 and a diagonal matrix A œ diag Ð-" ß á ß -7 Ñ   0 of eigenvalues
of G such that G œ Z AZ X œ !7 X
5œ" -5 @5 @5 . On partitioning each 7-element row vector B œ
X

T
ÐBX" ß á ß BX8 Ñ, 7   8, corresponding to the partitioning of G , and in particular doing so for @5 œ
Ð@5" ß á ß @58 Ñ, then each block G34 œ !7
T T X T
5œ" -5 @53 @54 . Let A5 œ ÐB3 @53 Ñ be an 8-element column vec-
tor for each 5 . Therefore, since L34 œ ;34 G34 , it follows that

BX LB œ " BX3 ;34 "-5 @53 @54 B4 œ "-5 " A5 ;34 A5 œ "-5 A5 UA5   0
7 7 7
X 3 4 X

3ß4 5œ" 5œ" 3ß4 5œ"

since U is positive definite. But from the fact that G33 is symmetric positive definite, it follows
that for every B3 Á 0,

œ "-5 BX3 @53 @53 B3 œ "-5 (A35 )# .


7 7
0 BX3 G33 B3 X

5œ" 5œ"

Thus, there exists at least one value of 5 for which -5  0 and A5 Á 0. Hence, for B Á 0, there is
an 3 such that B3 Á 0 and so BX LB œ !7 X
5œ" -5 A5 UA5  0.

2. Quadratic Unconstrained Team-Decision Problem with Normally Distributed Observa-


tions. The optimal solution \ ‡ must satisfy E^3 $3 œ ;3 E^3 \ ‡ with probability one for 3 œ 1ß á ß 8.
But by assumption, E^3 $3 œ .3 ^3  E$3 . If \3‡ (^3 ) œ ,3 ^3  -3 , then

.3 ^3  E$3 œ ! ;34 E^3 \4‡ œ ! ;34 (,4 G43 ^3  -4 ) œ ! (,4 ;43 G43 ^3  ;34 -4 )
8 8 8

4œ" 4œ" 4œ"

since U is symmetric. Let L be as in Problem 1. Then, .3 ^3  E$3 œ ,L3 ^3  ;3 - for all 3 and all
^3 where , œ (," ß á ß ,8 ), - X œ (-" ß á ß -8 ) and L3X œ ÐL3" ß á ß L38 Ñ. Thus it must be that E$ œ
U- and . œ ,L . These systems have the unique solutions - œ U" E$ and , œ .L " since U is
symmetric positive definite and, by Problem 1, so is L .

3. Optimal Baking. Write the problem in the form of a Quadratic Unconstrained Team De-
cision Problem by defining U œ diag +  ""X and writing the objective function as <(\ß ^ ) œ
 \ X U\  $ \  constant term where $3 œ +3 H3  ,  !84œ" H4 . Observe that since +3  0 for all
"
#
3, U is positive definite. By Theorem 1 of §2.3, \ ‡ is optimal if and only if E^3 $3 œ E^3 ;3 \ ‡ , i.e.,

E^3 (+3 H3  ,  "H4 ) œ E^3 šÐ"\4 Ñ  (1  +3 )\3 ›,


8

4œ" 4Á3

and this is equivalent to

-1-
MS&E 351 Dynamic Programming -2- Answers to Homework 8, Due May 30, 2008

+3 ^3  ,  ^3  "EH4 œ Ð"E\4 Ñ  (1  +3 )\3


4Á3 4Á3

since E^3 H3 œ ^3 , E^3 H4 œ EH4 , E^3 \3 œ \3 and E^3 \4 œ E\4 . If \3 œ !3 ^3  "3 , find !3 ß "3 such
that \3 is the solution to the system above. Do this by substituting \3 in the right-hand side and
observing that EH3 œ E^3 for all 3, since EH3 œ E(E^3 H3 ) œ E^3 . Then, this new equivalent sys-

!
,
tem has solution !3 œ " and "3 œ and thus \3‡ œ ^3  "3 is the optimal solution.
+3 Ð"  5 +5" Ñ
MS&E 351 Dynamic Programming Spring 2008
Arthur F. Veinott, Jr.

Homework 9 Due June 4


1. Revenue Management: Pricing a House for Sale. An executive must sell her house in an
interval Ò!ß >Ó. Her goal is to maximize her expected sale price. She is willing to vary her asking
price as a piecewise-constant function of time, with her asking price at any moment restricted to
a given finite set T of positive numbers. Whenever her asking price is : − T , buyers willing to
pay that price arrive according to a poisson process with mean rate ;:  0 per unit time, with ;:
being strictly decreasing in :. She sells the house to the first buyer arriving during Ò!ß >Ó at her then
current asking price, or if no buyer arrives, to her company at the guaranteed price :‡ with
0  :‡  : ´ min T . Let Z > be the maximum expected sale price in Ò!ß >Ó.

(a) Optimality Equation. Give an equation whose solution is ÐZ > Ñ.
(b) Existence of Optimal Asking-Price Policy. Explain why there exists a maximum-ex-
pected-asking-price policy.
(c) Monotonicity and Concavity of Maximum Expected Sale Price in Time Remaining.
Show that Z > is positive, strictly increasing and concave in >   0.
(d) Monotonicity of Optimal Asking-Price in Time Remaining. Show that with one optimal
asking-price policy, the asking price :> used when > time units remain has the properties that:
" ‰ :> œ 
: ´ max T for all large enough >  0 and
#‰ :> is increasing in >  0.
(e) Limit of Maximum >-Period Expected Sale Price. Show that lim>Ä_ Z > œ 
:.

2. Transient Systems in Continuous Time. Show that if every stationary policy is transient in a
continuous-time-parameter finite Markov population decision chain, then every policy is transient
and there is a stationary maximum-value policy. Explain briefly how successive approximations
can be used to find a stationary policy having value within %  0 of the maximum. [Hint: For the
first part, adapt the proof of the corresponding result for the discrete-time-parameter problem.]

You might also like