Makowski

TA9 11:45 -
ESTIMATION AND OPTIMAL CONTROL FOR

CONSTRAINED MARKOV CHAINS
bY
D.-J. Ma', A.M. Makowski' and A. Shwartr2
University of Maryland and Technioa
will become apparent from the discussion given in forthcoming sections,

ABSTRACT constrained MDP's form a rich class of stochastic optimkation problems
The (optimal) design of many engineering systems can be adequately whose solution leads to a variety of interesting questions of both a theo-
recast as a Markov decision process, where requirements on system per- retical and practical nature.
formance are captured in the form of constraints. In this paper, various This paper, which is devoted to a brief survey of recent work in this
optimalityresults for constrainedMarkov decision processes arebriefly re- area, is organized as follows: In Section 2, the singleconstraintopti-
viewed; the corresponding implementation issues arediscussed and shown mization problem is precisely formulated, a solution technique via La-
to lead to several problems of parameter estimation. Simple situations grangian arguments is then outlined and general results on the structure
wheresuchconstrainedproblemsnaturally a r i s e , arepresentedin the of optimal policies are summarized. Extensions to more complex (i. e.,
context of queueing systems, in order to illustrate various points of the non-Markovian) dynamics as well as to the more difficult situation where
theory. In each case, the structure of the optimal policy is exhibited. several constraints are enforced, are briefly considered in Section 3. Also
I. INTRODUCTION discussedinSection 3 are various implementation issues which are in-
Controlled queueing systems constitute a natural class of models for herent to the very form of the optimal policies identified in Section 2,
a variety of engineering applications, suchas the onw arising in computer and which leadnaturally tospecific problems of combined estimation and
communicationnetworksandinmanufacturingenvironments [9,10,23]. controlforMarkovchains. Most of the literature on thistopic [ll]is
The theory of Markovdecision proceaaes (MDP's)providesone of the concernedwith indirect adaptive control problems 171. For constrained
main tools to analyse performance optimality for many such stochastic MDP's, the situation is somewhat different in that some control param-
systems. In recent yeam, thesemethodn were successfully used on sev- eters may not be available to the decision-maker in practice, even if the
eral simple queueing systems, to obtain an explicit form for the optimal model parameters were known. This suggests viewing the designof imple-
control strategy [8,12,25,31]. The model considered here is one operating mentable Constrainedoptimal policies aa a direct adaptive control problem
in discrete time; this reflect8 the increasing use of digital implementation [7]. The various ideas and proposala of Section 3 are illustrated in two
in many of today's systems; moreover, under standard rwumptions the specific situations of independentinterest,whicharediscussedin Sec-
analysis of some classesof continuoutitime MDP's reduces to the analysis tions 4 and 5, respectively. The 6mt situation centers around a problem
of an embedded discrete-time MDP [16,27]. A large body of knowledge of resourcesharing,whereseveraldiscrete-timequeueswithgeometric
is available in the literature on this class of problems, ranging from o p service requirements compete for the service attention of a single server.
timality conditions to structural properties of the optimal strategies to The second example is essentially a problem of optimal flow control for a
algorithmic results. The discussion is typically carried out under the aa- discrete-time MlMll queue.
sumption that system performance may be captured through a single cost 2. MDP'S WITH A SINGLE CONSTRAINT
criterion based on an instantaneous cost which depends on the state, the 2.1 The problem formulation
control action and perhaps time. Standard performance measures include In [26], Ross considers the following version of the constrained opti-
the finite-horison criterion, as well as the discounted and long-run average mkation problem for Markov chains: Let {X(n)}F denote a controlled
costs, both being taken over the infinite horizon. Markov chain, with countable state space S,compact metric action space
However, in many practical situations, conflicting goals need to be U and transition probabilities (psU(u)) assumed continuous in u. Follow-
taken into consideration, a requirement that often cannot be adequately ing the usual formulation of MDP's as given in 1131, an admissible policy
captured through a single cost function. Typical examples arise in com- n generates at time n an action U(n) on the basis of the information
putercommunicationapplicationswhere it maybedesirable to maxi- R(n) := ( X ( l ) , U ( l ) , . . . , X ( n - l ) , U ( n - l ) , X ( n ) ) . For a given initial
mke throughput, while keeping delays small, or in manufacturing syti state distribution (held k e d hereafter), the policy r induces a probabd-
tems where resources are allocated so as to maximire the -called line ity measure P" on the natural u-field that equips the canonical sample
throughput and yet prevent inventories to build-up. space n := (S X U)", with corresponding expectation operator E". The
Such trade-off considerations require that the standard formulation notation P is reserved for the collection of all admissible policies. The
for MDP's be modified so u to accomodate theconflicting objectives. One class of (possibly randomized) Markovs t a t i o n q policies is then denoted
possible way to achieve this would beto define several cost functions, one by 7, while 5 stands for the subclass of all non-randomized policies in 7.
for each objective identifiedby the designer, and to focus attention on the Clearly 5 c 7 c P.
Corresponding multi-objective optimization problem. Here instead, con- Given are two mappings r , c : S x U + l R , which are assumed con-
flicting requirements inherent to the (optimal) design problem at hand tinuous in the variable u and which are interpreted as the instantaneous
are incorporated through constraints. Such an approach was considered reward and cost functions, respectively. For every admissible policy n in
by Ross 1261 in the context of finite-state MDP's under an objective crite- P, Pose
rion and a constraint function, each being given in the form of a long-run
average functional asociated with an instantaneous cost, and the goal is
to maximize the first criterion subject to a bound on the constraint. As
and
ElectricalEngineeringDepartmentandSystemsResearchCenter, - n
University of Maryland, College Park, Maryland20742, U.S.A. The work K(*):= LT C c ( X ( t ) , U ( t ) ) ,
of these authors was supported partially through NSF Grant ECS-85 * t=1
51836, partially through a grant from AT&T Bell Laboratories and par- and for every V in l R , define
tially through ONR Grant N00014-84K-0614. R. := { x in P : K ( r ) 2 V}. (2.3)
2 Department of Electrical Engineering, Technion, Haifa 32000, Israel.
Of interest here is the constrained problem (C4.)defined as
The work of this author was supported through a grant from A T k T Bell
Laboratories. (C&) : maximize J ( n ) over Pv.
CH2344-0/86/0000-0994 $1.00 @ 1986 IEEE 994

2.2 A Lagrangiaxl methodology More precisely, there exist 0 < 7. c w , and policies g and g in $7'
A Lagrangian methodology is now described for atudying this con- such that
strained problem; it requires the introduction of a family of auxiliary K(g) I v 5 K(T). (2.9)
problems: For every 7 > 0, let the mapping b'f: S X U -+ lR be given by If the randomized policies f,, 0 5 q 5 1, are defined by
b,(z, u) := r(z, u) - 7c(z, u) (2.4)
fq := clji + (1 - (2.10)
for all (2, u) in S x r/, and define the corresponding Lagrangian functional then j' P fq., where the optimal bias q' is determined as the solution to
by the equation
1 " K ( f , ) = v, 0 I q I 1. (2.11)
B ' ( r ) := -E" x b " ( X ( t )U, ( t ) ) (2.5)
t=1
The discussion of this result can be summarized as the search for
for every policy r in P . As itoccursinMathematicalProgramming, apolicyin 7 that satisfies the requirements (Rl)-(R3) forsomevalue
the solution of the constrained MDP (C&) is closely related to various ' > 0 of the Lagrangian parameter. The reader
7 is referred to 1261 for
properties of the unconstrained MDP (LP,)associated with (2.5), where details.
[LP,): maximize B7(r)over P . 2.4 Optimality via m k h g policies
The existence of policies g and in $7' with the property (2.9) can
To see this, observe that for any policy r in P , the inequality B1(r)1 be further exploited to generate an alternate solution to the constrained
J ( r ) - 7 K ( r ) holds, whereas if the policy r yields (2.1) and (2.2) as limits, MDP (CA.) in the one-parameter family of mizing policies {r(p),O 5
then B'(r) = J(r11 - 7 K ( r ) . If, in addition to this property, a policy p 5 1). For every p in the unit interval [0,1], consider a two-sided coin
r* meets the constraint, i. e., K ( r * ) = V , and is optimal for the MDP biased so that the events Head and Tail occur with probability p and
(LP,), then necessarily for all r in P , 1 - p, respectively. To define the mixing policy ~ ( p ) throw
, this biased
coin exactly once at the beginning of times, before starting to operate
-
B"(*') = J(*') - 7 K ( r * )1 F ( r )1 J ( r ) 7 K ( r ) . (2.6) the system. The policy r(p) is defined as the policy in P that operates
according to s (reap. g) if the outcome is Tail(resp.Head). It is not
difficult to see that for such a policy, the relations
Since K ( r ) 5 V for any policy r in A., the inequality (2.6) and the fact
7 > 0 readily imply that
J b ( P ) ) = (1 - p)J(g) + PJb)
(2.12a)
+ +
J(r*) 1 J ( r ) 7 ! K ( r ' )- K ( r ) ] = J(r) 7 ( V - K ( r ) )1 J ( s ) (2.7)
and
foreverypolicy r in A., whence the policy r* solves theconstrained
K ( r ( p ) ) = (1 - p)K(g) +P m ) (2.12b)
optimization problem (CA.). hold true.
From this discussion, it should be clear to the reader in what sense Now, under the non-degeneracy srrsumptionKfg) # K ( g ) ,pose p' :=
the Lagrangian problems { (LP,), 7 > 0) are useful for solvingthe original [ V - K ( g ) ] / [ K ( g-K(g)j
) 2 0,(owing to (2.9)). Observe from (2.12) that
constrained problem. Indeed, as the arguments given above indicate, any the policy r[p*) steers T2.2) to the value V , yields (2.1) and (2.2) aa limifs
policy r* in P which and that B7 (r(p)) = (l-p)BT'(g)+pB,'(g) = BT'. In other words, the
( E l ) : ylelds the expressions J ( r ' ) and K ( r ' ) as limifs, policy ~ ( p ' )meets the requirements (Rl)-(R3), and consequently solves
(R2): meets the constraint with K ( r ' ) = V , and the constrained problem (C&). It should be noted, in contrast with the
(R3): solves the unconstrained MDP (LP,)for some 7 > 0, policy f' obtained by randomization in Section 2.3, that the evaluation
necessarily solves the constrained problem (CA.). This approach can be of the optimal mixing parameter p' is immediate, if the values K(g) and
used either directly on specific problems mutatis mutandis, as illustrated K ( g ) are available.
in Sections 4 and 5 , or it can provide a convenient theoretical framework 3. MORE ON C O N S T R A I N E D O P T I M I Z A T I O N
for establishing general results on the existence and structural form of
S.l Generalizations
solutions to the constrained optimization problem.
It is possible to generalize the discussion of Section 2 in several di-
To simplify the discussion, it is convenient to assume that S is finite rections:
and that the controlled chain has a single ergodic class under each policy
f in 7 . In that case,underanypolicy f in 7, the expressions (2.1), Firstly,it is of interesttoconsidersystemdynamicswheremore
(2.2) and (2.5) exist srr limits and are independent of the initial condition. complex probabilistic mechanisms areallowed for state transitions and/or
Moreover, it is well known that an optimal policy for problem (LP,)can where the stateprocesses live in more general spaces. Beutler andRoss 141
always be selected to be a pure strategy in 5. This follows by atandard consider a version of (C&) for general semi-Markov decision processes
arguments based on the corresponding Dynamic Programming equation withfinite state spaceandcompactactionspace.Nainand Ross 1211
(2.8), which states here that for all z in S,the relation study a specific constrained MDP with countable (non-finite) state space.
In both cases, similar results on the structure of the Constrained optimal
policyarereported. However, more work seems needed M no general
theory is available to date in the case of non-finite state spaces.
Secondly, aa pointed out in the introduction, the main practical mo-
tivation for studying constrained optimisation problems arises from the
holds for some real constant BT and some mapping h':S * lR. Their desire to handlesituationswithmultiple (con5icting)objectives. The
existence is guaranteed under the assumed conditions 1131, with Bl iden- next natural step would consist in formulating constrained MDP's with
tifiedwith the optimal value of problem (LP.,).It is also well known multiple constraints as a nonlinearprogmmming problemin the space
that if 5' denotes the class of policies g'f in 5 with the property that for of policies. More precisely, with the notation of Section 2, let J ( r ) be
each z in S,the action u = 9 7 ( z ) attains the maximum in the Dynamic the costassociatedwith the policy r in P , andlet IC'(%) denote the
Programming equation (2.8), then any policy in 5' is optimal for the correspondingvalue of the p h Constraint, 1 -< i <
I. Forevery vector
problem (LP,). V = (v~,...,v,)in E?, pose
2.5 Optimality via randomized policies
In [26], it is shown under the simplifying assumptions stated earlier,
A. := {r in P :* ( x ) < vi, 15 i 5 I } (3.1)
that whenever the problem is 'feasible', there exists aconstrained optimal
and define the multiple constraint problem (CA.) as in Section 2. Very
policy with a simple structure. little is known on the existence andstructure of optimal solutions for prob.
Theorem [26]: I f i o r some 0 < 7 < 00, atleastonepolicy g' in, 5' lems of this type. This probably could be traced back to the fact that
has theproperty that K(g") 5 V , then there ezists a constrainedoptrmal the relationship of the corresponding Lagrangian problem to the original
policy 'f in 7 defined by a simple randomization between two pure Markov problem (CA.) is far more subtle in the multiple constraint situation.
stationary policies g and g in 5. In fact, as of the writing of this paper, it is not clear on how to obtain
995
in general an optimal policy through randomization and/or mixing pro- are used for every J in M . The discrete-time axis is divided into contigu-
cedures similar to the ones presented in Section 2. Results are available control framer; the (J + 1)": such control frame starts upon comple-
ous
only in particular instances; Altman and Shwartz [I]establish existence tion of the n(J)'* cycle and is made up of r ~ + E~J + +~cycles.
~ The policy
of a solution t o a problem with multiple constraints for the competing a ~ s ( p is) defined as the policy in P that during the J"' frame operates
queue model considered by Nain and Ross [21]. policy g for cycles, and then policy for EJ cycles, J = 1 , 2 , . . .. Under
the condition (3.3), well-known properties of first return times for Markov
5.2 I m p l e m e n t a t i o n issues chains readily imply that
Even if the policies g and 3, and the value 7' were readily available,
the computation of the optimal bias q* may prove to be a non-trivial
task, for it requires solving forq in the implicitequationK ( f , ) = V on the
interval [0,1], and makes it necessary to evaluate theexpression K ( f , ) for
each 0 5 q 5 1. Both steps usually turn out to be highly difficult ones in and
many applications, and are oftenpossible only via numericalmethods; this
difficulty is clearly illustrated on the competing queue model discussed in
Section 4.
The optimal bias q' acts as an internal parameter; it is available in
principle if the etternd (or model) parameters (i. e., the entries in the where the second equality is justified through (2.12) by the definition of p'.
kansition matrix) are known, but may not be easily available in practice In fact, the convergence (3.5) takes place for the sample averages as well.
,wing to computationaldifficulties. Of course, this difficulty of implemen- a
Note that if were rational, say of the form p' = &
for some integers
;ation is further compounded when some of the external parameters are -n and E, then the conditions (3.3) would be automatically satisfied upon
lot known, since 'on line' identification of the external parameters does choosing E, = 11 and E, = E for all j in mT.
not provide a feasible means to evaluate q'. In any case, this points to Thereader willreadily see that the time-sharing implementation
adaptive methods for directly estimating q', now treated as an unknown a ~ s ( p ' )of the optimal mixing policy ~ ( p ' ) , satisfies the requirements
parameter, and this specifically for the purpose of generating an optimal
'RI)-(R3) and thus solves the constrained problem (CS.).
control; in the terminology of adaptive systems, this is referred to as di-
rect adaptive control 171. This suggests broadening the notion of adaptive In Section 4, this approach is shown to beuseful for identifying a
control for Markov chains to view it as a procedure for recursivelyup- solution to a multiple constraint problem.
dating thecontroltomeettheperformance criterion. Although this is a 5.4 A Certainty Equivalence implementation
well-known problem in the general theory of adaptive systems, it seems A possible solution to the difficulties mentioned in Section 3.2 would
to have not been studied much in the context of MDP's, at least to the be to estimate directly the optimal bias q* and then use the Certainty
authors' knowledge. Equivalence Principle at each step. Moreprecisely, this suggests using
The reader's attention should be drawn to the fact that direct a d a p a possibly recursive estimation scheme that generates a sequence of bias
tive control ideas, with q' regarded as an unknown parameter, do not values {q(n)}? converging to q'. At step n, the RV q(n) constitutes an
always lead to implementable policies. This was illustrated by Shwartz estimate of the bias value q*, which is thus interpretedas the (conditional)
and Makowski 128,301 on the competing queue problem of Section 4. probability of using g (given available information H(n)), and it is thus
The implementation issues discussed above can be addressed in the natural to select the control action U(n) according to jq(,,); the adaptive
somewhat more general context of steering the cost (2.2) to a prespecified policy so generated by the sequence {q(n))}? is denoted by a.
value V : Given is a parametrized family {jp, 0 5 q 5 I} of Markov There are as many such adaptive schemes as there are schemes for
stationary policies, and assume, with the notation g := jo and 3 := j1, estimating the optimal bias value q'. In each specific case, optimality of
that K ( g ) 5 V 5 K(3). The problem is then to finid a policy 'f in the the adaptive policy a will be concluded if it can be established that policy
parametrized family { j,, 0 5 q 5 I} that steers the cost (2.2) to the value 0
V , i. e., K(j') = V. If the mapping q + K(j,) is continuous, this can (Al): yields (2.1) and (2.2) as limits,
be achieved by selecting f' to be jp, with the bias q' being determined (A2): meets the constraint, i. e., K ( a ) = V , and
as the solution to the equation K(f,) = V, 0 5 q 5 1. Although most (A3): yields the same cost as f', i. e., J(a)= J ( f * ) .
of the ideas discussed in this paper apply mutatis mutandis to this more This is done via a separate analysis and typically proceeds by showing
general situation, the discussion will be carried out only in the context of that
constrained MDP's for sake of clarity. li-Ifqcn)(x(n)) - f*(x(4)1= 0, (3.6)
5.5 A time-sharing implementation
the convergence taking place under Pa either almost surely or in probabif-
Although the optimal mixing policy r(p') of Section 2.4 has a very
ity, a property which readily follows from the (weak) cowutency of the
simple structure,it is not stationaryand ergodic. Indeed,under such
estimation scheme, Under (3.6) and possibly additional structural model
a policy ~ ( p ' ) , the sample averages corresponding to K(r(p')) do not
assumptions, a method of proof due to Mandl 1191 can be extended to
satisfy the constraint since on a set of probability p* (resp. 1 -p*), these
show that J(a)= J ( f ' ) and K ( a )= K(f'). Examples of this approach
limits will be K ( g ) (resp. K(g)). This somewhatunappealing feature
are given in Sections 4 and 5.
can be eliminated through the following time-sharing implementation of
mixing policies. Assume the existence of a privileged state to which the At this point, the reader may wonder as to how such an estimation
system returns to infinitely often under eachone of the policies g and scheme is selected.
3, and define a cycle as the time T betweenconsecutive visits to that (i): Sometimes, it is feasible to compute the optimalbias q* as a func-
state. Denote the expectation of a cycle duration under policies g and 7 tion q'(8) of the external parameters 8. In that case, the designer may
by EP-(T) and Er(T) , respectively. For every p in the unit interval [ 0 , 1 ] , want to consider using the Certainty Equivalence Principle in conjunction
the mixing policy r(p)has a time-sharing implementation a ~ s ( pwhich ) with a parameter estimation scheme, say based on the Mazimum Likeli-
is nowdefined:Let bethe element of [0,1]
. . uniquelydefined through hood Principle. This approach is illustrated in Section 5 on a problem of
the relation 00w control.
$ET(T)
P = ( 1 - p')EP(T)+ p'rn(T) (ii): In many applications, the function q + K ( f , ) turns out to be
continuous and strictly monotone, say increasing for sake of definiteness.
and consider two sequences of non-negative integers {gj}T and {E,}? In that case, the search for q* can be interpretedas finding the zero of the
with the property that continuous, strictly monotone function K(f,) -V and this brings to mind
ideas from the theory of Stochastic Approzimations 1241. Here, this circle
(3.3) of ideas suggests generating a sequence of bias values {q(n)}? through
the recursion
where the notations
J J
q(n + I) = [ q(n) + an(V - c ( x ( n + I), v ( n + 1))) 1; n = 12 2 2 ' . .(3.7)
n(J):= Enj, E(J):= ZEj and n(J) := n(J)+E(J) (3.4)
with q(1) given in [O,lJ. Here [z]: := 0 V (z A 1) for all z in E3 and the
j= 1 j= 1
996
sequence of step sizzs {a,,}? satisfies and solving the constrained problem ( C h )reduces to finding a policy
m m
r* in P which meets the requirements (Rl)-(R3).
0 < an 10, an = CQ, I an+1- an I< 03. (34 If the constraint is not satisfied while giving highest priority to the
n= 1 n= 1 constrainedqueue(i. e., the Oth queue), then there is no solution.On
the other hand,if the constraint is satisfied while giving lowest priority to
The corresponding policy CZSA is structurally simple and easy to imple-
the constrained queue and ordering the other queues according to the pc-
ment on-line; this simplicity of implementation is derived from the fact
rule, then this policy is optimal for the constrained problem ( C h ) . The
that the difficult step of directly solving for q* is completely bypassed.
proof in the remaining case is available in 1211, the main idea being that
4. O P T I M A L R E S O U R C E A L L O C A T I O N each one of the problems (LP,)is solved by a &ed priority assignment
4.1 Model policy which admits a description as the pc-rule based on p07 and prck,
+
Consider the following system of K 1 infinite-capacity queues that 1 s k I K .
compete for the use of a singleserver: Time is slotted and the service The dynamics of this problem weregeneralizedbyNain and Ross
requirement of each customer corresponds exactly to one time slot. At to a semi-Markov decision process 1221. Another generalization, given by
the beginning of each time slot, the controller gives priority to one of Altman and Shwarts [l],involves a more general constraint (4.3) associ-
the queues. If the kth queue is givenservice attention during that slot, ated with an instantaneouscost d: I N K + 1 + lR which is also an arbitrary
then with probability p k the serviced customer (if any) completes service linear function with positive coefficients (asin (4.2)).
and leaves the system, while with probability 1 - pk, the customer fails Theorem 111: If the problem is feasible; then there ia a constrained opti-
to complete service andremains in the queue. The arrival pattern is mal policy fq. which randomizes between twowork-conserving static pri-
modelled as a renewdprocess, in that the batch sizes of customers arriving ority assignment policies.
into the system in each slot are independent and identically distributed
from slot to slot. Under these assumptions, the evolution of the system -
4.S Single constraint An adaptive implementation
is fully described by anINK+'-valued process { X ( n ) } y , with x&(.), Even with (4.3), the function q + K(fp) is not easy to compute, in
0 < k 5 K ,representing the number of customers in the kth queue at the spite of the linearity of the instantaneous cost. Indeed, as pointed out by
beginning of the slot [n,n 1). + Nain and Ross [21], computing the quantityK(fp) amounts t o studying a
coupled-processor problem whose solution can be obtained viaa reduction
The mean number of customers arriving to the kth queue is denoted
to a Riemann-Hilbert problem [e].
by x k . Define the traffic intensity p := 2
and assume henceforth
The stochastic approximation algorithm thatgenerates the sequence
that p < 1; this guarantees system stability [2].
of bias estimates {q(n)}y here takes the special form
For some mappingc: INK+' -+ R2, the cost to be minimized is defined
bv q(n+l)=[q(n)+a,(V-Xo(n+l))]~ n = 1 ,2 I. . . (4.5)
1 "
J ( r ) := G"--E* c(X(t))
?I t=1 with q(1) given in [ O , l ] .
for every policy r in P. A special case abundantly treated in the literature For K = 1, thereare onlytwo fixed priorityassignment policies,
is the one where c is linear and positive, i.e. for all z in MK+', and therefore no a priori knowledge of the various statisties is required
K
in order to implement this algorithm. As such, the proposed policy is
implementable and constitutes an adaptive policy in the restricted tech-
C(.) = CkZkr (4.2)
k=O
nical sense understood in theliterature on the nowBayesian adaptive
with Ck 2 0,O < k < K . For this case, several authors have discussed controlproblemfor Markov chains [ l l ] . However, for K > 1, OSA is
the problem of selecting a service allocation strategy that minimizes (4.1) implementable only if - g and have been determined.
over the class P of all admissible service allocation strategies [3,5]. They The basic results assume finite third moments on the initial queue
all show the optimality of the pc-rule, i. e., the fixed priority assignment sizes and on the statistics of the arrival pattern. Under these additional
policy that orders the customer classes in increasing order of priority with conditions,
<
the values p k C k , 0 k 5 K.
Theorem [SO]: The sequence of biases {q(n)}y converges in Probability
4.2 Single Constrained queue (under P a * ) t o the optimal bias 9'.
In [ Z l ] , Nain and Ross considered the situation where several types The derivation of this result makes combined use of results by Kush-
of traffic, e.g., voice, video and data, competefor the use of a single ner and Shwarts [141 on the weak convergence of Stochastic Approxima-
synchronous communication channel. They formulate this situation as a tions via ODE methods and on moment estimates obtained for the queue
+
system of K 1 discrete-time queues that compete for the attention of a size process in [30]. The condition (3.6) now holds and implies
single server, and solve for the service allocation strategy that minimizes
the long-run average of a linear expression in the queue sizes of the K T h e o r e m [SO]: The policy OSA solves the constrained optimization prob-
customer classes { 1, . ', K} under the constraint that the long-run aver- lem ( C h )with J ( C Z S A=) J(r)
and K ( c z ~ =
A )K ( f ' ) .
age queue size of the remaining customer class 0 does not exceed a certain 4.4 Multiple constraint -
A time-sharing implementation
value V. Thus for any policy r in P, define J ( n ) by (4.1) with c given by In this section, a special version of the multiple constraint problem
(4.2) where co = 0, and pose ( C h )is studied. For every policy c in P, pose
1 n
K ( r ) := L n - E * C X o ( t ) .
t=1
Nain and Ross [21] extend some of the optimality results from [3,5]
to show that if the constraint can be met in a non-trivial fashion, then an and for every vector V = iV1, ' . ', VK) in R K ,consider the constrained
optimal policy with a very simple structure can be identified. problem
Theorem (211: If the problem is feasible, there ezists a constrained opti- ( C h ): minimize KO(*) over & , (4.7)
mal policy j' in j r which randomizes between twowork-conserving static
priority assignment policies.
with A. defined by (3.1).
This result is derived through a Lagrangian argument inthe following +

There are exactly L := (K l)!non-idling policies in 5 which act
way:For 7 > 0, let the mapping b7:INK+' + R2 begivenby b7(2) = as fixed priority assignments, the f h such policy being denoted through-
C(Z) + 720 and pose
out by gi, 1 I l < L. An element p = (PI,. " , p ~ )in IRL lies in the
Gdimensional simplex whenever 0 5 pi 6 1 for all 1 5 1 I L, with
l n C f = l p ~= 1. The mizing policy r(p) associated with anyelement p in
B7(r):= Gn:E" b'(X(t)) (4.4) the Gdimensionalsimplex is defined through a procedure that generalizes
'* t=1 the one discussed in Section 2.4: Consider an h i d e d coin that yields the
for every policy r in P. The unconstrained MDP (LP,)is now defined as f h side with probability pi, 1 5 1 <
L and throw the coin exactly once at
the beginning of times, before starting to operate the system. The policy
: minimize F ( r )over P ,
(LP.) r(p) is the one that operates according to the &ed priority policy gi if
997
the outcome of the throw is the Ph side, 1 5 1 I L. For every p in the respectively, for every admissible policy x on P .
L-dimensional simplex, the relations A threshold policy is a Markov stationary policy in 7, with a sim-
ple structure determined by two parameten L and q in M and [0,1],
L respectively,whenceathreshold policy is denotedhereafter by ( L , q ) .
K ' ( 4 P ) ) = CPtK'(Pi),0 I i I K, (4.8) According to the threshold policy (L, q ) , an incoming customer is admit-
I=1 ted or rejected wether the queue size is < L or > L;if the queue size is
exactly L,a biased coin with bias q is flipped and the outcome then de-
hold true and suggest the following Linear Program (LP),
where
termines whether or not the incoming customer can access to the queue.
L The adopted convention interprets q as the (conditional) probability of
Minimise XZ~K"(~) (4.9a) accepting an incoming customer.
1=1 The policy that admits every single customer is denoted by (m, 1). If
K ( ( m ,1)) 5 V , then the problem (CR.)corresponding to (5.1)-(5.2) is
subject to the constraints trivially solved by (m, 1). On the other hand, if K((m, 1)) > V , then the
problem has a non-trivial solution, which can be shown to be of threshold
L
type. This result is derived again through a Lagrangian argument: For
C~,K'(Oi)
I vi, 0 I i I K, (4.9b) 7 > 0, let the mapping b7: M -+ IR be given by b,(z) = pl(z # 0) - 7 2
I=1
and define for every policy r in P , B'(r) as in (2.5). The unconstrained
and MDP (LP,)is nowdeEned as Section 2, and as inpreviousinstances
L where this technique was used, solving the constrained problem (CR.)
O I z ~ < l , l < l < L , and cz1=1. (4.9c) reduces to finding a policy x* in P which meets the requirements (R1)-
l=1 (R3).
The relationship between the Linear Program (LP) and the original con- The unconstrained Lagrangian problem(LP,) can be solved through
strained problem (CA.) is formabed in the following result. a tedious Dynamic Programming argument that shows concavity of the
Theorem [l]: If the Linear Program (LP), hos a rolution, say p', then corresponding value function.
the policy ~ ( p ' )solves the multiple conatraint problem (CR.).Conversely, Theorem [17]: For every 7 > 0, there enbts a threshold policy (L,, q,)
if problem (CA.) hoa a aolution, then it can alwayr be implemented by a that 80[Ve8 the unconstrainedLagrangian problem (LP,). Moreover,for
min'ng policy. eachthreshold value L i n I N , there always e m i t s 7 ( L ) > 0 so that any
The solution of problem (CR.)via mixing policies has the clear ad- thresholdpolicy ( L , q ) , with q arbitrary in [0,1],solves the Lagrangion
vantage of requiring only the solution of the Linear Program (LP); this problem (LP,(,,)).
is to be contrasted with the difficult queueing problem that is required Threshold policies are thus unconstrained optimal. For any thresh-
to find the optimal randomized policy [21]. A dynamic or time-sharing oldpolicy ( L , q ) , thequantities (5.1) and (5.2) exiat as limitr. More-
implementation can also be provided in the spirit of Section 3.3. Here the over, K ( ( L , q ) )increases as L increases, whereas for fixed L in M , the
empty state is taken to be the privileged state and a cycle thus coincides mapping q + K ( ( L , q ) )is continuow and strictlymonotoneincreas-
with a bwy cycle for the queueing system. For sake of brevity, the dis- ing on [O,l]. Since K ( ( c q 1 ) )> V, thereexists ' L in M such that
cussion is restricted to the case where the element p in the L-dimensional K((L',O)) < V 5 K((L*,l)),andthecontinuityand the strict mono-
simplex has all its components rational with pi = R,
1 5 1 5 L,for some tonicity of q + K ( ( L ' , q ) ) thenimply the existence of q* such that
n = (nl,. ',nL) in M L , where In1 = nl. Define thepolicy i ( n ) in P K((L',q')) = v.

as the one that operates the b e d priority assignment gi over ni cycles. Theorem [17]: If the conatrained problem (CA.) has a non-trivial solu-
With the help of results on busy periods [2,29], it is a simple exercise to tion, it can be taken to be a threshold policy f' = (L', 4.).
show, in analogy with (3.5), that the relations The quantities 'L and q* are determined by the arrival and service
rates X and p , and by the constraint value V; they can be computed as
the unique solution to the equation
K((L,q))=V, L=O,l,...and O < q I l . (5.3)

hold true. The more general situation can be treated exactlyas in Section
3.3.
This result represents the discrete-time analog of results obtained by
5. OPTIMAL FLOW CONTROL Lazar 1151. In contrast with the competing queue problem, a closed-form
5.1 Model and constrained problems expression is available here for the quantity K ( ( L , q ) )for all L in M and
Consider the following flow control model for discrete-time M l M l l q in [O,l]. However, as in Section 4, it is still reasonable to seek on-line
queueing systems [17,18]:At the beginning of each time slot, the controller implementations of such a threshold policy.
decides either to admit or reject the arrivals during that slot.
An admitted 5.2 A Stochastic Approximation implementation
customer joins the queue while a rejected customer is immediately lost. As inSection 3.4, the twopolicies g = (L*, 0) and = (L*, 1) are
A customer (if any) that completes service in a slot leaves the system at assumed available, which presumes only knowledge of L'. The following
the end of that slot with probability p, and fails to complete service in stochastic approximation algorithm generates the sequence of bias esti-
that slot with probability 1 - p, in which case it remains at the head of mates {q(n)}y through the recursion
the line to await service in the next slot. The arrival pattern is modelled
as a Bernoulli sequencewithparameter X, independent of theservice
process which is modelled as another Bernoulli sequence with parameter
p. Under these assumptions, the evolution of the system is fully described
by an M-valued process {X(n)}y, with X ( n ) representing the number with q ( 1 ) given in [O,l], where in addition to the conditions (3.8), the step
a? < 00. The RV q(n) constitutes
of customers in the queue at the beginning of the slot [n,n 1). + sizes
satisfy an
estimate
the biaa value q* and is interpreted as the conditional probabilityof using
of
The problem considered here is formulated as the search for a policy I,or equivalently, of giving admission to a potential customer during the
that maximizes the throughput under the constraint that thelong-run av- slot [n,n + 1) when the queue size X(n) is equal to L'. With this scheme,
erage queue size does notexceed a certain valueV , where the throughput the control action to beimplemented is simply generated according to the
and the average queue size are expressed as optimal threshold policy f' when the queue ske is not equal to L'.
The key result is obtained under a fourth moment assumption on
J(r) := h : J F k p l ( X ( t ) # 0) the initialqueue size, and is proved using ideas propoaedby Metivier and
t=1 Priouret I201 on the almost sure convergenceof Stochastic Approximation
and algorithms.
Theorem [lS]: The sequence of biases {q(n)}y conuerged almost surely
(under P a s A ) to the optimal biw q*.
The condition (3.6), which is now seen to hold, can then be used to [7] G. C. Goodwinand K. W.Sin, AdaptiveFiltering,Predictionand
show that Control, Prentice-Hall, Englewood Cliffs, NJ., 1984.
Theorem (IS]: The policy U S A aolves the corutrained optimization prob- [8] B. Hajek, 'Optimalcontrol of two interacting service stations", IEEE
lem ( C h )with J(QSA) = J(f*)and K(aSA) = K ( f ' ) . Trans. Auto. Control, vol. AC-30, pp. 68-76 (1985).
5.S The time-sharing implementation [9] D. Heyman and M. Sobel, Stochattic Models in Operations Research,
Volume II: Stochastic Optimization, MacGraw-Hill, New York, 1984.
Since the quantity K ( ( L ,q ) ) is computable for all values of its ar-
guments, the values L', K ( g ) and K ( g ) are thus readily available. The [lo] L. Kleinrock, QueueingSyatems, Volume 11: ComputerApplications
optimal mixing parameter p' can then be immediately evaluated, and the , John Wiley k Sons, New York, 1976.
optimal mixing policy r(p') considered in Section2.4 is thus easily imple- [ll]P. R. Kumar, 'A survey of some results in stochastic adaptive con-
mentable. Here, thethreshold nature of the policies s and.J suggests level trol', SIAM J. Control Opt. vol. 23, pp. 329-380 (1985).
'L as a privileged state, and a cycle is thus defined as the time duration [12] P. R. Kumar and W.Lin,'Optimalcontrol of aqueueingsystem
between consecutive visits to level L'. The time-sharing implementation withtwoheterogeneousservers', IEEE Trans.Auto.Control,vol.
ars(p') corresponding to r(p') is defined aa in Section 3.3 and provides AC-29, pp. 696-703 (1984).
an easy way to implement an optimal constrained policy. The discussion
is similar to the one of Section 3.3 and will be omitted. H.J. Kushner, Introduction to Stochastic Control, Holt, Rinehart and
Winston, New York, 1971.
5.4 An indirect a.daptive control implementation
H. J. Kushner and A. Shwarts, "An invariant measure approach to
All previous implementation schemes require the knowledge of the the convergence of Stochastic Approximationa with state-dependent
optimalthreshold 15'. In case theexternalparameters, X and p , are noise', SIAM J. Control Opt. Vol. 22, pp.1527 (1984).
unknown, the parameter L ' is certainly not available and all previously
considered policies are thus not implementable in their given form. A. A. Laear,'Thethroughputtimedelayfunction of an MlMll
queue', IEEE Trans. Info. Theory, vol. 29, pp. 914918 (1983).
In such cases, it is natural to consider a scheme that uses Certainty
Equivalence principle in conjunction with maximum likelihood estima- S. Lippman, "Applying a new device in the optimbation of exponen-
tors. The sequence of estimates {(A(n),p(n))}T of the true parameter tial queueing system", Oper. Res. vol. 23, pp. 687-710 (1975).
(X, p ) is generated baaed on all the past available information by invoking D.-J. Ma and A. M. Makowski, 'A simple problem of folw control I:
the principle of maximum likelihood, and the sequence { (L(n), (I("))}? Optimality results', in preparation (1986).
is then determined by the estimates X(n) and p(n) by solving the equa- D.-J. Ma and A. M. Makowski, 'A simple problem of folw control 11
tion (5.3). In case there is no solution to the equation (5.3) for a pair Implementation of threshold policies", in preparation (1986).
(X(n),p(n)),simplyset L(n) = 00 and q(n) = 1. The controlaction P. Mandl, 'Estimation and control in Markov chains", Adv. Appl.
+
to be implemented in the slot [n,n 1) is then generated according to
Prob. vol. 6,pp. 40-60 (1974).
(L(n),q(n)), and the corresponding adaptive policy is denoted by U M L .
M. Metivier and P. Priouret, "Applications of a Kushner and Clark
The estimation procedure relies on a information pattern,I ( n ) which
lemma to general classesof stochastic algorithms', IEEE Trans. Info.
is richer than N(n), with
Theory, vol. 30, pp. 140-150 (1984).
I(n)={X(l.),U(t),A(t),B(t),l<t<n} n=l,2,...(5.5) P. Nainand K. W. Ross,'Optimalpriorityassignmentwithhard
-
constraint", Rapports de Recherche No. 459, INRIA Rocquencourt,
where the {0,1)-valued RV's U(n), A(n)and E(n) represent the control France (November 1985).
+
action implemented in the slot [n, l),the arrival during that slot and [22) P. Nainand K. W. Ross, 'Optimal multiplexing of heterogeneous
service completion at the end of that slot, respectively. traffic with hard constraint", Proceedings of the Joint Performance
Using the Strong Law of Large Numbers and results from the theory '86 and ACM Sigmetrics Conference, Raleigh, North Carolina, May
of Large Deviations, the appropriate versionof condition (3.6) is shown to 1986.
again take place [la], the rate of convergence beingezponentially fast; this 1231 M. Reiser, 'Performance evaluation of data communication systems',
last fact is crucial t.0 the basic result, which is obtained under a fourth Proc. IEEE, vol. 70, no 12, pp.171-196 (1982).
moment assumption on the initial queue size. [24] H. Robbinsand S. Monro, 'A stochasticapproximationmethod",
Theorem [MI: The policy CYML solves the constrained optimization prob- Ann. Math. Stat. vol. 22, pp. 4W407 (1951).
lem ( C h )with J ( a M L ) = J ( f * ) and K ( U M L=) K ( f ' ) . [25] Z.Rosberg, P. V. Varaiya and J. Walrand, "Optimal controlof service
It is not clear at this time on how to design direct adaptive control intandemqueues",IEEETrans.Auto.Control, vol. AC-27, pp.
schemes when the optimal threshold t 'is not available. A stochastic a p 600-610 (1982).
proximation algorithm for generating recursively a sequenceof estimates 1261 K. W.Ross, Constrained Markov DecMion Processes with Queueing
for '
L similar to the one used for q' naturally comesto mind; however, the Applications, Ph.D. thesis, Computer, Information and Control En-
corresponding scheme fails to work owing to its sensitivity to variations gineering, University of Michigan, 1985.
of the integer-valued threshold [la]. [27] D. Serfozo, 'An equivalencebetweendiscreteandcontinuoustime
REFERENCES Markov decision processes", Oper. Res. vol. 23, pp. 618620 (1979).
[I] E. Altman and A. Shwarts, "Optimal priority assignment with gen- [28] A. Shwarts and A. M. Makowski, 'An optimal adaptive scheme for
eral constraints', submitted, 24th Allerton Conference on Commu- two competing queues with constraints", 7th International Confer-
nication, Control and Computing, October 1986. ence on Analysis and Optimization of System, . . 515-532, Antibes,
pp.
France, 1986:
[2] J. S. Baras, A. J. Dorsey and A. M. Makowski, "Discrete time com-
peting queues with geometric service requirements: stability, param- A. Shwartsand A.M. Makowski, 'Adaptivepoliciesforasystem
eter estimation and adaptive controln,under revision, SIAM J. Con- of competing queues I Convergence results for the long-run average
trol Opt. cost', in preparation (1986).
[3] J. S. Baras,D.-J. Ma and A. M. Makowski, 'K competingqueues A. Shwarts and A. M. Makowski,'Adaptive policies forasystem
with geometric service requirements and linear costs: The pc-rule is of competing queues I 1 Implementable schemes for optimal server
always optimal", Systems k ControlLetters,vol. 6, pp. 173-180 allocation", in preparation (1986).
(1985). S. Stidham, Jr., 'Optimal control of a d h i o n to aqueueing system",
IEEE Trans. Auto. Control, vol. A C 3 0 , pp. 705-713 (1985).
141 Beutler and Ross, 'Time-average optimal constrained semi-Markov
decisionprocesses",Adv.Appl.Prob. vol. 18, pp. 341-359(1986).
[5] C. Buyyukkoc, P. Varaiya and J. Walrand, "The cprule revisited",
Adv.Appl. Prcb. vol. 17, pp. 234-235(1985).
161 G. Fayolle and P. Iasnogorodski, 'Two coupled processors: The re-
duction to aRiemann-Hilbertproblem", Z. Wahr. verw. Gebiete,
V O ~ . 47, pp. 325-351 (1979).
999

Makowski

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Makowski

Uploaded by

Copyright:

Available Formats

TA9 11:45 -

ESTIMATION AND OPTIMAL CONTROL FOR

D.-J. Ma', A.M. Makowski' and A. Shwartr2

University of Maryland and Technioa

will become apparent from the discussion given in forthcoming sections,

CH2344-0/86/0000-0994 $1.00 @ 1986 IEEE 994

This result is derived through a Lagrangian argument inthe following +

n = (nl,. ',nL) in M L , where In1 = nl. Define thepolicy i ( n ) in P K((L',q')) = v.

K((L,q))=V, L=O,l,...and O < q I l . (5.3)

You might also like