Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

A survey of average cost problems in deterministic

discrete-time control systems


I
Onésimo Hernández-Lerma
CINVESTAV, Departament of Mathematics
Av. Politécnico 2508, México City 07360, México

Leonardo R. Laura-Guarachi
SEPI-ESE-IPN
Plan de Agua Prieta 66, Plutarco Elı́as Calles, 11340 Miguel Hidalgo, México City,
México.

Saul Mendoza-Palacios,II∗
El Colegio de México, CEE.
Carretera Picacho Ajusco 20, Ampliación Fuentes del Pedregal, 14110 Tlalpan, México
City, México.

Abstract
This paper concerns optimal control problems for infinite-horizon discrete-
time deterministic systems with the long-run average cost (AC) criterion.
This optimality criterion can be traced back to a paper by Bellman [6] for
a class of Markov decision processes (MDPs). We present a survey of some
of the main approaches to study the AC problem, namely, the AC optimal-
ity (or dynamic programming) equation, the steady state approach, and the
vanishing discount approach, emphasizing the difference between the deter-
ministic control problem and the corresponding (stochastic) MDP. Several
examples illustrate these approaches and related results. We also state some
open problems.

I
The work of this author was partially supported by CONACYT grant 263963.
II
Work partially supported by Consejo Nacional de Ciencia y Tecnologı́a (CONACYT-
México) under grants CONACYT-(Project No. A1-S-11222) and Ciencia-Frontera 2019-
87787.

Corresponding author
URL: smendoza@math.cinvestav.mx (Saul Mendoza-Palacios,II )

Preprint submitted to Elsevier December 29, 2021


Keywords: Average cost, Markov decision processes, Dynamic
programming, Discrete time systems.
2020 MSC: 90C40, 90C39, 49J21, 49K21, 49L20.

1. Introduction
This paper concerns deterministic discrete-time control systems in which
the state process {xt , t = 0, 1, ...} ⊂ X evolves as
xt+1 = F (xt , at ), t = 0, 1, 2, ... (1)
where {at , t = 0, 1, ...} ⊂ A is the sequence of control variables or control
actions at each time t. In many applications, the state and action spaces
X and A are subsets of finite-dimensional spaces, say Rn and Rm . Here,
however, we suppose that the state and action (or control) spaces X and A
are so-called Borel spaces (that is, Borel subsets of complete and separable
metric spaces), which include all the spaces that appear in applications, even
finite or countable sets (with the discrete topology).
Given a cost-per-stage function c(x, a), let
T −1
X
JT (π, x) := c(xt , at ) (2)
t=0

be the total cost in the first T stages (T = 1, 2, ...) when using the control
policy (or strategy) π = {a0 , a1 , ...} ⊂ A, given the initial state x0 = x. In
the long-run (or asymptotic or limiting) average cost (AC) control problem
we wish to minimize the objective function (or performance index)
1
J(π, x) := lim sup JT (π, x) ∀x0 = x (3)
T →∞ T
over all policies π, subject to (1). (In Section 2, below, we present a more
detailed description of the AC control problem.)
The AC value function is
J ∗ (x) := inf{J(π, x) : π ∈ Π} (4)
where Π is the set of admissible (or feasible) control policies or strategies
(see Section 2). A control policy π ∗ is said to be average–cost optimal (AC–
optimal) if
J(π ∗ , x) = J ∗ (x) ∀x ∈ X. (5)

2
Remark 1.1. Concerning the definition in (3), note that at the outset we
do not know if the averages (1/T )JT (π, x) converge as T → ∞; therefore,
we have to use either “lim sup” or “lim inf” to ensure that J(π, x) is well
defined. Moreover, the reason for using “lim sup” rather than “lim inf” is
due to the Abelian theorem introduced in Section 5. We will come back to
this point after Lemma 5.6.

Since AC problems concern “convergence of averages” (as in ergodic the-


ory or in the laws of large numbers), they are also known as ergodic control
problems. (See Arapostathis et al. [1], Arisawa [2], Ghosh and Rao [21], etc.)
The main purpose of this paper is to present, for the first time, a system-
atic and organized overview of different approaches to study the deterministic
AC control problem (1)-(5), namely, the AC optimality equation (ACOE),
the steady (or stationary) state approach, and the vanishing discount ap-
proach. (There is also an infinite-dimensional linear programming approach
to AC problems, but we do not include it in this paper because it would re-
quire too much background material. The interested reader may consult, for
instance, Borkar et al. [10] or Chapter 11 in Hernández-Lerma and Lasserre
[26] and their references.)
The ACOE is also known as the AC dynamic programming equation or
the AC Bellman equation.

1.1. The AC control problems, where do they come from?


The AC problems can be traced back to the 1957 paper “A Markovian
decision problem” by Bellman [6]. (Bellman [4] also coined the term Markov
decision process (MDP).) Bellman [6] was interested in analyzing the growth
rate of finite-horizon costs as in (2). However, the infinite-horizon case (3)
and its applications to problems in operations research, industrial engineering
and many other areas were soon popularized by the books by Bellman [5]
and Howard [28], among other publications. (A Markov decision process is
also known as a Markov control process.)

1.2. MDPs versus deterministic control systems


Since Markov decision processes (MDPs) usually concern stochastic prob-
lems, an obvious question is, what is their connection with the deterministic
AC problem (1)-(5)? Are the latter problems a special class of MDPs? To an-
swer these questions in a precise manner, let us first recall (see, for instance,
Blackwell [9], Costa and Dufour [15], Feinberg et al. [18], Hernández-Lerma

3
and Lasserre [25, 26],...) that a time-homogeneous MDP can be represented
in compact form as a Markov control model (CM) CM := (X, A, Q, c), with
X, A and c as above, whereas Q represents the process transition law or
transition probability

Q(B|x, a) := Prob[xt+1 ∈ B|xt = x, at = a] (6)

for all B ⊂ X, (x, a) ∈ X ×A, and t = 0, 1, ... In particular, for the determin-
istic system (1) the transition law is the Dirac (or unit) measure concentrated
at F (·, ·), that is,
(
1 if F (x, a) ∈ B,
Q(B|x, a) = δF (x,a) (B) := (7)
0 otherwise.

Hence, strictly speaking, the deterministic control problem (1)-(5) is a MDP


with a transition function F rather than a transition probability. The corre-
sponding deterministic control model can be expressed as

CMd := (X, A, F, c), (8)

which (by (7)) we might call a “degenerate” MDP.


Paradoxically, however, many results for MDPs are not applicable to the
deterministic case because they require conditions that exclude deterministic
systems. As an example, some MDPs require the transition law Q to be
strongly continuous, which means that the mapping
Z
(x, a) 7→ v(y)Q(dy|x, a) (9)
X

is continuous for any bounded measurable function v on X. In the determin-


istic case, (9) becomes
(x, a) 7→ v(F (x, a)) (10)
which clearly is not continuous for an arbitrary measurable function v. (Take
v as, for instance, the indicator function of a set B ⊂ X.) On the other
hand, we do have continuity in (10) if

(x, a) 7→ F (x, a) (11)

is continuous and, in addition, Q is weakly continuous (or a Feller transition


probability) so, by definition, we have continuity in (9) for any bounded

4
continuous function v. Due to this fact, our statements concerning MDPs
are always restricted to the weakly continuous (or Feller) case, as in Costa
and Dufour [15], Feinberg et al. [18], and Vega-Amaya [41], among others.
To the best of our knowledge, the analysis of AC problems for MDPs with
weakly continuous (or Feller) transition probabilities was initiated by Schäl
[38].
Unfortunately, having a weakly continuous transition law is still not suf-
ficient for some MDP results to be applicable to the deterministic case. The
reason is that typically MDPs require conditions such as ergodicity, irre-
ducibility and others that do not hold in the deterministic case.

1.3. Organization of the paper


In Section 2 we complete the description of the AC problem. In particular,
we formalize the classes of control policies we are interested in. In Section
3 we introduce the average cost optimality equation (ACOE), also known as
the AC Bellman (or dynamic programming) equation. This is a key tool to
analyze AC control problems because most of the results and techniques are
directly or indirectly related to the ACOE. Section 4 concerns the steady-
state approach, which requires the existence of state-actions pairs (x̄, ā) ∈
K such that F (x̄, ā) = x̄. This approach is of relevance in some modern
control techniques such as model predictive control; see Müller [35], Müller
et al. [36], Grüne et al. [24]. Finally, in Section 5 we consider the vanishing
discount approach to AC problems. This approach is an interesting and
clever adaptation to control problems of classical mathematical results such
as Abelian theorems and mean ergodic theorems, as in Bishop et al. [8],
Davies [16], Sznajder and Filar [39], and Yosida [43], for instance. Each
of the sections 3, 4, 5 presents examples illustrating the main results. We
conclude in Section 6 with some general comments and a list of problems for
further research.

2. The AC control problem: technical preliminaries


In this section we, first, complete the description of the AC control prob-
lem (1)-(3), and then we introduce important related concepts.
Consider the deterministic control model CMd = (X, A, F, c) in (8). Re-
call that X and A are Borel spaces. Without further notice, we suppose that
all the sets and functions introduced below are Borel measurable.

5
For each state x ∈ X, we denote by A(x) ⊂ A the (nonempty) set of
feasible controls (or control actions) in x. The set of feasible state-action
pairs
K := {(x, a) ∈ X × A : a ∈ A(x)} (12)
is the graph of the set-valued function (or multifunction) x → A(x).
We also assume that the mapping from K to X in (11) is continuous.
Let F be the family of functions f : X → A such that f (x) is in A(x)
for all x ∈ X; that is, the graph {(x, f (x)) : x ∈ X} of f is in K. These
functions f are called selectors of the multifunction x → A(x). We assume
that F is nonempty.
For many purposes, an admissible control policy (or strategy) is just a
sequence π = {at , t = 0, 1, ...} such that at is in A(xt ) for all t = 0, 1, ... In
particular, if, for every t = 0, 1, ..., at = ft (xt ) for some function ft ∈ F, then
π is said to be a Markov (or feedback) control policy. Moreover, if there exists
f ∈ F such that ft ≡ f for all t, then π is called a stationary Markov policy
or simply a stationary policy. In this case (following a standard convention
for MDPs), we identify π with f .

Remark 2.1. We denote by Π the family of admissible control policies, and


(by an abuse of terminology) we refer to F, the family of selectors, as the set
of stationary policies.

(a) Given a real-valued function v on K and a stationary policy f ∈ F, we


will usually write v(x, f (x)) as v(x, f ) for all x ∈ X. In particular, for
the stage cost function c, c(x, f (x)) = c(x, f ).

(b) Given an admissible control policy π = {at , t = 0, 1, ...}, we denote by


{xπt , t = 0, 1, ..} the sequence defined by (1) given that xπ0 ≡ x0 and
at ∈ π for all t = 0, 1, .... In particular, for a stationary policy f ∈ F,
we write the states as xft .

(c) For a given function ξ : X → R, let Πξ be the family of control policies


π ∈ Π such that, for every initial state x0 ,
1
ξ(xπt ) → 0 as t → ∞. (13)
t
Similarly, Fξ denotes the family of stationary policies f ∈ F that satisfy
(13) for every initial state.

6
As an example, if ξ is bounded, then (13) holds for all π ∈ Π and all
f ∈ F; hence Πξ = Π, and Fξ = F. For special functions ξ, the relation (13)
is a transversality-like condition, which is a standard requirement in some
optimal control problems.

Finally, to avoid trivial situations, we suppose the following.

Assumption 2.2. There is a policy π ∈ Π such that J(π, x) < ∞ for each
x ∈ X.

3. The AC optimality equation


A pair (j ∗ , l) consisting of a real number j ∗ ∈ R and a function l : X → R
is called a solution to the average cost optimality equation (ACOE) if, for
every x ∈ X,
j ∗ + l(x) = inf [c(x, a) + l(F (x, a))]. (14)
a∈A(x)

It can be shown that if (j ∗ , l) is a solution to the ACOE, then j ∗ is unique,


whereas l is unique up to additive constants only. In fact, it is obvious that
if l(·) satisfies (14), then so does l(·) + k for any constant k.
A solution (j ∗ , l) to the ACOE is also known as a canonical pair . If, in
addition, f ∗ is a stationary policy that satisfies (15) below, then (j ∗ , l, f ∗ ) is
called a canonical triplet.
The following theorem gives sufficient conditions to solve the AC control
problem (1)-(3).

Theorem 3.1. Suppose that (j ∗ , l) is a solution to the ACOE (14), and let
Πl be as in Remark 2.1 (c) with ξ = l in (13). Then, for every initial state
x0 = x,

(a) j ∗ ≤ J(π, x) for all π ∈ Πl ; hence

(b) j ∗ ≤ J ∗ (x) if Πl = Π.

Moreover, suppose that there exists a policy f ∗ ∈ Fl such that f ∗ (x) ∈ A(x)
attains the minimum in the right–hand side of (14), i.e.,

j ∗ + l(x) = c(x, f ∗ ) + l(F (x, f ∗ )) ∀ x ∈ X. (15)

Then, for all x ∈ X,

7
(c) j ∗ = J(f ∗ , x) ≤ J(π, x) for all π ∈ Πl ; hence

(d) f ∗ is AC–optimal and J(f ∗ , ·) ≡ J ∗ (·) ≡ j ∗ if Πl = Π.

Proof. By (14), for every (x, a) ∈ K we have

j ∗ + l(x) ≤ c(x, a) + l(F (x, a)) (16)

Now consider an arbitrary policy π = {at } ∈ Πl . Then, using the notation


in Remark 2.1 (b), for any initial state xπ0 = x0 and t = 0, 1, ..., from (16) we
have
j ∗ + l(xπt ) ≤ c(xπt , at ) + l(xπt+1 ).
Hence, summation over t = 0, 1, ..., T − 1, gives

T j ∗ ≤ JT (π, x) + l(xπT ) − l(x0 ). (17)

Multiplying both sides of the latter inequality by 1/T and then letting T →
∞ we obtain (a), which in turn gives (b) if Πl = Π.
(c) If f ∗ satisfies (15), then we have equality throughout (16)–(17), which
yields the equality in (c). The inequality follows from (a). Finally, (d) is a
consequence of (b) and (c).
As in Remark 2.1(c), if l is bounded, then Πl = Π, as required in parts
(b) and (d) of Theorem 3.1.
The ACOE is very useful in the sense that, under the conditions of The-
orem 3.1, it gives the minimal AC j ∗ and an AC optimal policy f ∗ . It is also
useful to obtain refinements of AC optimality, such as “overtaking optimal”
or “bias optimal” controls. However, if we are only interested in obtaining
AC-optimal controls, it suffices to obtain an optimality inequality. This is
explained in Section 5.
Arguments similar to those in the proof of Theorem 3.1 give other useful
results, such as the following.

Proposition 3.2. (a) Suppose that instead of (16), for some f ∈ F, we


have
j ∗ + l(x) ≥ c(x, f ) + l(F (x, f )) ∀ x ∈ X. (18)
If limt→∞ l(xft )/t ≥ 0, then j ∗ ≥ J(f, x) for all x ∈ X.

8
(b) If the inequality in (18) is reversed, i.e.,

j ∗ + l(x) ≤ c(x, f ) + l(F (x, f )) ∀ x ∈ X, (19)

and limt→∞ l(xft )/t ≤ 0, then j ∗ ≤ J(f, x).

Proof. (a) Letting xft ≡ xt , we have that

j ∗ ≥ c(xt , f ) + l(xt+1 ) − l(xt ) ∀ t = 0, 1, ...

Hence, summation over t = 0, 1, ..., T − 1 gives

T j ∗ ≥ JT (f, x) + l(xT ) − l(x0 ).

Multiplying both sides by 1/T and letting T → ∞, since limt→∞ l(xft )/t ≥ 0,
we obtain that j ∗ ≥ J(f, x) for all x ∈ X.
(b) The proof only requires changing the inequality in (a).

Corollary 3.3. (a) If (18) holds for some f ∈ F, then

j ∗ + l(x) ≥ inf [c(x, a) + l(F (x, a))] ∀ x ∈ X.


a∈A(x)

(b) If (19) holds for all f ∈ F, then

j ∗ + l(x) ≤ inf [c(x, a) + l(F (x, a))] ∀ x ∈ X.


a∈A(x)

In the following Example 3.4 we wish to maximize a long-run average


reward (AR) or average utility (instead of minimizing an AC as in (3)-(4)).
The corresponding AR is defined as a “lim inf” rather than a “lim sup”; see
(21). This is related to Lemma 5.6(a). See Section 5 for details.

Example 3.4 (The Brock–Mirman model). This infinite–horizon model was


introduced by Brock and Mirman [12, 13]with discount and non-discount
criteria. The state and control variables xt and at denote capital and con-
sumption, respectively, at time t = 0, 1, . . . . The state and control spaces are
X = [0, ∞), A = (0, ∞), A(x) = (0, cxθ ], and the dynamics of the system is
given by
xt+1 = cxθt − at , t = 0, 1, 2, . . . , (20)

9
with a given initial state x0 ∈ X, θ ∈ (0, 1). Consider the objective function
to be optimized as the long–run average reward (AR)
T −1
1X
J(π, x0 ) = lim inf log(at ). (21)
T →∞ T
t=0

Note, from (21), that the stage reward is r(x, a) = log(a) (in lieu of the
stage cost c(x, a) in (2)).
To find a canonical triplet (j ∗ , l, f ∗ ) that satisfies the average reward op-
timality equation (AROE)

j ∗ + l(x) = max [r(x, a) + l(F (x, a))] ∀x ∈ X, (22)


a∈A(x)

we consider a function l(x) of the form l(x) := b log(x), where b is an unknown


parameter to be determined. In this case, the right side of (22) reaches the
cxθ
maximum when a = 1+b , so (22) becomes

cxθ
 
  
∗ θ cb
j + b log(x) = log + b log x
1+b 1+b
   
c cb
= (1 + b)θ log(x) + log + b log .
1+b 1+b

θ
This last equation is satisfied if b = (1 + b)θ, which implies that b = .
1−θ
Therefore the canonical triplet (j ∗ , l, f ∗ ) is given by

θ
j ∗ = log (c(1 − θ)) + log (cθ) , (23)
1−θ
θ
l(x) = log(x), (24)
1−θ
f ∗ (x) = c(1 − θ)xθ . (25)

It can be shown that, for every initial state x0 > 0,


∗ 1−θ t t
xft = (cθ) 1−θ xθ0 for t = 1, 2, . . . (26)
∗ 1
and xft → (cθ) 1−θ , which implies that f ∗ satisfies (13), that is f ∗ ∈ Fl . Thus
by Theorem 3.1, j ∗ = J(f ∗ , x) ≥ J(π, x) for all π ∈ Πl . ♦

10
If the reader is familiar with the dynamic programming or Bellman equa-
tion he/she surely noted in Example 3.4 that the solution (23)-(25) to the
AROE (22) was obtained by means of the “guess and verify” procedure. That
is, in view of the stage reward in (21), we guessed that the function l(·) in
(22) is of the form l(x) = b log(x), and then we verified that this is indeed
the case for some value of b. We will use the same procedure in the following
example on a LQ (linear system, quadratic cost) problem. To simplify the
presentation, in this example we consider a scalar (or one-dimensional) LQ
problem, but similar results hold in a general vector case. For details see, for
instance, Section 7.4 in Bersekas [7].

Example 3.5 (An LQ problem). Consider the deterministic LQ control


model CMd in (8) with state and control sets X = A = R, transition function

F (x, a) = δx + ηa,

with nonzero coefficients δ, η, and a quadratic stage cost

c(x, a) = qx2 + ra2 ∀x, a ∈ R,

with q ≥ 0 and r > 0. Hence, more explicitly, the AC problem is to minimize


the long-run average cost
T −1
1X 2
J(π, x) = lim sup (qxt + ra2t ) (27)
T →∞ T t=0

subject to
xt+1 = δxt + ηat ∀t = 0, 1, ... (28)
for every initial state x0 = x. From (14), the corresponding ACOE is

j ∗ + l(x) = min[qx2 + ra2 + l(δx + ηa)] (29)


a∈R

In view of (27)-(28) we conjecture that l is of the form l(x) = bx2 for some
constant b. To verify that this is indeed the case, we insert this function l in
(29) to obtain that the minimum is attained at

bδη
a∗ = −Bx with B := . (30)
r + bη 2

11
With this value of a = a∗ (29) becomes
[qr + (qη 2 + rδ 2 )b] 2
j ∗ + bx2 = x. (31)
r + bη 2
To proceed further, observe that with a∗ as in (30), it follows that (28)
becomes

xt+1 = Γxt with Γ := δ − Bη = . (32)
r + bη 2
Assuming that b is such that |Γ| < 1, the linear system (32) is stable and, as
t → ∞,
xt = Γt x0 → 0
for every initial state x0 . We thus obtain from (30)-(31) that (j ∗ , l, f ∗ ) is a
canonical triplet with j ∗ = 0, l(x) = bx2 , and f ∗ (x) = −Bx, where b is the
unique positive solution of the quadratic (steady-state Riccati) equation
[qr + (qη 2 + rδ 2 )b]
b= .
r + bη 2

4. The steady–state approach


In this section we introduce the so-called steady state approach to the
AC problem (1)-(3). Some of the main ideas go back to the paper by Flynn
[19], which apparently is not well known.
A state x∗ ∈ X is called a steady state (or stationary state) of the system
(1) if there exists a∗ ∈ A(x∗ ) such that F (x∗ , a∗ ) = x∗ . In this case, (x∗ , a∗ )
is said to be a steady state-action pair. If, in addition, (x∗ , a∗ ) solves, over
all (x, a) ∈ K, the static problem

minimize c(x, a) subject to F (x, a) = x, (33)

then (x∗ , a∗ ) is called a minimum steady state-action pair for the ACcontrol
problem (1)–(3).
The following Assumption 4.1 summarizes some of the main conditions
required in the steady-state approach to the AC problem, namely, (a) the
existence of a minimum steady state-action pair, (b) a dissipativity condition,
and (c) a stabilizability/reachability condition. (Concerning (b), see Remark
4.2)

12
Assumption 4.1. The OCP (1)–(3) satisfies:
(a) There exists a minimum steady state-action pair (x∗ , a∗ ) ∈ K.
(b) The problem (1)–(3) is dissipative, which means that there is a so-called
storage function λ : X → R such that, for every (x, a) ∈ K,

λ(x) − λ(F (x, a)) ≤ c(x, a) − c(x∗ , a∗ ). (34)

(c) Let λ and Πλ be as in (34) and Remark 2.1(c), respectively. For each
initial state x0 ∈ X, there is a control policy π̄ ∈ Πλ (which may
depend on x0 ) such that the corresponding path (x̄t , āt ) converges to
the minimum state-action pair (x∗ , a∗ ) in (a) as t → ∞.
Remark 4.2. (a) The notion of dissipativity was introduced in control
theory by Willems [42]. He considered a differential system, say ẋ(t) =
F (x(t), a(t)), instead of the discrete-time model (1). Dissipativity for
the discrete-time case was introduced by Byrnes and Lin [14]. The
reader should be warned, however, that the terminology is not stan-
dard. For instance, different authors may use different signs in (34).
(b) A function λ that satisfies (34) is called an excessive function by Flynn
[20]. He also uses a Lagrange multipliers approach to study the static
problem (33), which of course is a constrained optimization problem.
(c) If (j ∗ , l) satisfies the ACOE (14), then

j ∗ + l(x) ≤ c(x, a) + l(F (x, a)) ∀ (x, a). (35)

Hence, the dissipativity inequality (34) holds with j ∗ = c(x∗ , a∗ ) and


λ(·) = l(·).
(d) Dissipativity is a key concept in model predictive control (MPC). See,
for instance, Müller et al. [36] and the recent survey by Müller [35],
and their references. Some aspects of dissipativity for MPC have been
extended to discounted control problems by Grüne et al. [24]. (See the
following Section 5 for a review of discounted problems.)
The following Theorem 4.3 is the main result in this section.
Theorem 4.3. Suppose that Assumption 4.1 holds and the stage cost c :
K → R is continuous. Let j ∗ := c(x∗ , a∗ ). Then, for all x ∈ X,

13
(a) j ∗ = J(π̄, x) ≤ J(π, x) for all π ∈ Πλ , where π̄ satisfies Assumption
4.1(c); hence

(b) the AC value function is J ∗ (·) ≡ j ∗ and π̄ is AC-optimal if λ is such


that Πλ = Π.

Proof. (a) Let π = {at } be an arbitrary control policy in Πλ with correspond-


ing state–action sequence (xt , at ). Then, from (35),

j ∗ ≤ c(xt , at ) + λ(xt+1 ) − λ(xt )

for all t = 0, 1, .... This yields, as in (16)–(17),

T j ∗ ≤ JT (π, x) + λ(xT ) − λ(x0 ) ∀ T = 1, 2, ...,

so, by (13),
j ∗ ≤ J(π, x) ∀x ∈ X. (36)
On the other hand, by the stabilizability in Assumption 4.1(c), there is a pol-
icy π̄ ∈ Πλ for which the state–action sequence (x̄t , āt ) converges to (x∗ , a∗ ).
Therefore, since the stage cost c is continuous, c(x̄t , āt ) converges to j ∗ , which
implies that J(π̄, x) = j ∗ for all x. This prove (a).
(b) If Πλ = Π, then part (a) yields that π̄ is AC-optimal and also that
the AC value function is J ∗ (·) ≡ j ∗ .
In the remainder of this section we present some examples that illustrate
Theorem 4.3.

Example 4.4 (The LQ problem, cont’d.). Consider the LQ problem in Ex-


ample 3.5. Clearly, given the transition function and the stage cost

F (x, a) = δx + ηa and c(x, a) = qx2 + ra2 ,

respectively, the origin (x∗ , a∗ ) = (0, 0) is a minimum steady state-action pair.


Therefore, Assumption 4.1(a) holds. Furthermore, as in (35), the canonical
pair (j ∗ , l(·)) yields the dissipativity inequality (34) with j ∗ = c(x∗ , a∗ ) and
λ(x) = l(x) = bx2 . Finally, the stabilizability condition in Assumption
4.1(c) is obtained with π̄ = f ∗ , where f ∗ (x) = −Bx as in Example 3.5.
Consequently, the conclusions of Theorem 4.3 hold for the LQ problem. ♦

14
Example 4.5 (The Brock–Mirman model, cont’d.). In Example 3.4, for
the transition function F (x, a) = cxθ − a, and the stage reward (or utility)
function r(x, a) = log(a), it can be verified that the unique solution to the
corresponding steady state-action problem (33),

maximize: log(a) subject to cxθ − a = x,

is given by  
1 θ
(x∗ , a∗ ) = (cθ) 1−θ , c(1 − θ)(cθ) 1−θ . (37)
We can obtain the dissipativity inequality (34) from the ACOE in Exam-
ple 3.4. However, it is illustrative to obtain (34) directly as follows.
From (23)-(24) and Theorem 4.3(a), we have
θ
J(π̄, x) = r(x∗ , a∗ ) = log(a∗ ) = log (c(1 − θ)) + log (cθ) ; (38)
1−θ
where π̄ ≡ f ∗ (·) in (25). To get the Assumption 4.1(b), note that F (x, a∗ ) −
F (x, a) = a − a∗ . Thus, the strict concavity of the stage reward r(x, a) :=
log(a) gives
∂r ∗ ∗ ∂r ∗ ∗
r(x, a) − r(x∗ , a∗ ) ≤ (x , a )(x − x∗ ) + (x , a )(a − a∗ )
∂x ∂a
1 ∗
= ∗ (F (x, a ) − F (x, a)).
a
Moreover, the system function F (x, a∗ ) is also concave in x and ∂F ∂x
(x∗ , a∗ ) =
1, so
∂F ∗ ∗
F (x, a∗ ) − F (x∗ , a∗ ) ≤ (x , a )(x − x∗ ) = x − x∗ ,
∂x
that is, F (x, a∗ ) ≤ x for all x ∈ X. Therefore, from the last two inequalities
we get the corresponding dissipativity condition (34)

r(x, a) − r(x∗ , a∗ ) ≤ λ(x) − λ(F (x, a))

with the storage function λ(x) := a1∗ x. Notice that λ is different from l in
(24), Example 3.4.
Observe that for any initial state x0 ∈ X, the policy f ∗ in (25) with
corresponding state-control path (xt , at ), given by
1−θ t t θ−θ t+1 t+1
xt = (cθ) 1−θ xθ0 , at = c(1 − θ)(cθ) 1−θ xθ0 ,

15
satisfies the stabilizability Assumption 4.1(c), i.e., (xt , at ) converges to the
optimal stationary pair (x∗ , a∗ ).
Hence, by Theorem 4.3, r(x∗ , a∗ ) = J(π̄, x) ≥ J(π, x) for all π ∈ Πλ and
all x ∈ X. ♦
Example 4.6 (The Mitra-Wan forestry model; [33, 31, 30]). Consider a
forestland covered by trees of the same species classified by age-classes from
1 to n. After age n, trees have no economic value. The state space in this
example can be identified with the n–simplex
n
X
n
∆ := {x ∈ R : xi = 1, xi ≥ 0, i = 1, ..., n}
i=1

where each coordinate xi denotes the proportion of land occupied by i-aged


trees.
Let xt = (x1,t , ..., xn,t ) ∈ ∆ be the forest state at period t. By the end
of the period the forester must decide to harvest a proportion of land in any
age class, say at = (a1,t , ..., an,t ) with 0 ≤ ai,t ≤ xi,t , i = 1, ..., n. Because a
tree has no economic value after age n, an,t = xn,t . Thus, for each x ∈ ∆,
the admissible control set is A(x) = [0, x1 ] × · · · × [0, xn−1 ] × {xn }. Suppose
that the forest evolves according to the dynamic model
x1,t+1 = a1,t + · · · + an,t , (39)
xi+1,t+1 = xi,t − ai,t , i = 1, ..., n − 1, (40)
where (39) means that all harvested area at the end of period t must be sown
by trees of age 1 at the beginning of period t + 1. On the other hand, (40)
states that trees of age i that have not been harvested until the end of period
t become trees of age i + 1 in period t + 1.
For a planning horizon T , (39)–(40) can be written as a discrete-time
linear control system
xt+1 = f (xt , at ) := Axt + Bat for t = 0, 1, ..., T − 1, (41)
where
   
0 0 . 0 0 1 1 . 1 1

 1 0 . 0 0 


 −1 0 . 0 0 

A := 
 0 1 . 0 0 ,
 and B := 
 0 −1 . 0 0 .
 (42)
 . . . . .   . . . . . 
0 0 . 1 0 0 0 . −1 0

16
Now, assume the timber production per unit area is related to the tree
age-classes by the biomass vector

ξ = (ξ1 , ξ2 , ..., ξn ) ∈ Rn , ξi ≥ 0, i = 1, 2, . . . , n,

where ξi represents the amount of timber produced by i-aged trees occupying


a unit of land. Hence, the total amount of timber collected at the end of
period t is given by
ξ · at = ξ1 a1,t + · · · + ξn an,t .
Consider a timber price function p : [0, ∞) → [0, ∞), assumed to be increas-
ing and concave. Given a forest state x and an admissible harvest control a,
the stage income is r(x, a) := p(ξ · a). Therefore, the performance index to
maximize is
T −1
X
JT (π, x) := r(xt , at ). (43)
t=0

It can be shown that the control system (41) has a set of stationary states
given by the pairs (x, a) satisfying x1 ≥ x2 ≥ · · · ≥ xn and

a1 = x1 − x2 , a2 = x2 − x3 , ..., an = xn .

Moreover, for each age class i there is a pair of stationary state and con-
trol (xi , ai ), known as a normal forest, defined as follow: the state is xi :=
(1/i, ..., 1/i, 0, ..., 0), where each of the first i coordinates is 1/i , and the
remaining are 0; and the control is ai := (0, ..., 0, 1/i, 0, ..., 0), where 1/i is in
the i–coordinate.
Following the Brock-Mitra-Wan condition [11, 34], we assume that there
is a unique normal forest (x∗ , a∗ ) that satisfies

r(x∗ , a∗ ) = max p ξ · ai : i = 1, 2, ..., n .


 

So, given the concavity of p, there is k ≥ 0 such that

r(x, a) − r(x∗ , a∗ ) ≤ kξ · (a − a∗ ) ∀a ∈ A(x).

Letting N := (1, 2, ..., n) and γ := max{ξ · ai : i = 1, 2, ..., n}, we get the


vector componentwise inequality ξ ≤ γN . Moreover, from a straightforward
calculation we have N · [x − F (x, a)] = N · (a − a∗ ) for any x ∈ ∆ and all
a ∈ A(x). Therefore,

r(x, a) − r(x∗ , a∗ ) ≤ kγN · [x − F (x, a)] ∀x ∈ ∆, a ∈ A(x).

17
Introducing the function λ : ∆ → R, defined by λ(x) = kγN · x, and the
value j ∗ := r(x∗ , a∗ ), we have the corresponding dissipativity inequality (35),

j ∗ + λ(x) ≥ r(x, a) + λ(F (x, a)) ∀x ∈ ∆, a ∈ A(x).

Thus in particular we conclude that (x∗ , a∗ ) is an optimal steady state-


action pair. Moreover, this pair (x∗ , a∗ ) can be reached by a finite sequence
of harvest plans from any initial state. Hence, the Mitra-Wan model satisfies
the Assumption 4.1, and the conditions of Theorem 4.3, and so the optimal
AC value function is
   
∗ ∗ ∗ ξi
J (x) ≡ r(x , a ) = max p : i = 1, 2, ..., n .
i

5. The vanishing discount approach


The concept of discounted utility was introduced by Samuelson [37] in eco-
nomics, in 1937. This concept motivated the discounted (stochastic) control
problem introduced by Blackwell [9], which was one of the earliest Markov
decision processes (MDPs) studied in detail.
In this section we briefly review discounted MDPs and show some of their
connections to AC problems. It should be noted, however, that from a purely
mathematical viewpoint the connection between some “discounted” power
series and long-run averages go back to basic results such as the Abelian
theorems (see Lemma 5.6(a) below) or the mean ergodic theorems (see, for
instance, Davies [16] or Yosida [43]).
In this section, we first summarize some aspects of discounted control
problems and then we present their connections to AC problems.

5.1. Discounted optimal control problems


Consider the deterministic control model CMd = (X, A, F, c) in (8), but
instead of the AC problem (1)-(3) we now consider the discounted cost OCP:
for each discount factor α ∈ (0, 1) and initial state x0 = x, minimize

X
Vα (π, x) := αt c(xt , at ) (44)
t=0

18
over all policies π = {at , t = 0, 1, ...} ∈ Π, subject to (1). The corresponding
α-discount value function is

Vα (x) := inf Vα (π, x) ∀x ∈ X. (45)


π

A policy π ∗ is said to be α-discount optimal (or simply α-optimal) if

Vα (π ∗ , x) = Vα (x) ∀x ∈ X. (46)

To fix ideas, in this section we suppose that the following Assumption


5.1 holds. This assumption consists of conditions which are sufficient for
our present purposes; they are not necessary. For other sets of conditions
ensuring the results in this section, see Costa and Dufour [15], Feinberg et al.
[18], Hernández-Lerma and Lasserre [25, 26], Schäl [38] among many other
publications.
Assumption 5.1. (a) The transition function F : K → X is continuous.

(b) The stage cost c : K → R is nonnegative. Moreover,

(c) c is K-inf compact, which means that, for every sequence {(xt , at )} in
K such that xt → x and {c(xt , at )} is bounded above, it holds that {at }
has an accumulation point in A(x).
Clearly, in Assumption 5.1(b) we may replace “c nonnegative” by “c
bounded below”. On the other hand, a condition ensuring Assumption 5.1(c)
is, for example, that c is inf-compact, that is, for every real number r, the
set {(x, a) ∈ K|c(x, a) ≤ r} is compact. (See Feinberg et al. [18].)
Let L(X) be the family of real-valued functions on X which are lower
semicontinuous (l.s.c.) and bounded below. For each α ∈ (0, 1), we define
the operator Tα on L(X) as

Tα v(x) := inf [c(x, a) + αv(F (x, a))] (47)


a∈A(x)

for all v ∈ L(X) and x ∈ X.


Theorem 5.2. Suppose that Assumption 5.1 holds and fix an arbitrary α ∈
(0, 1). Then:
(a) The operator Tα in (47) maps L(X) into itself, that is, if v is in L(X),
then so is Tα v.

19
(b) For every v ∈ L(X) there exists a stationary policy f ∈ F such that
f (x) ∈ A(x) attains the minimum at the right-hand side of (47), i.e.
(using the notation in Remark 2.1(a)),

Tα v(x) = c(x, f ) + αv(F (x, f )) (48)

for all x ∈ X.
(c) The α-discount value function in (45)-(46) is in L(X) and it is a fixed-
point of Tα , that is, Tα Vα = Vα . More explicitly,

Vα (x) = inf [c(x, a) + αVα (F (x, a))]. (49)


a∈A(x)

(d) There exists fα ∈ F such that

Vα (x) = c(x, fα ) + αVα (F (x, fα ) (50)

for all x ∈ X.
(e) A stationary policy f ∈ F is α-optimal if, and only if, f satisfies (50).
Theorem 5.2 is a standard result in discounted dynamic programming.
See, for instance, the references mentioned at the paragraph preceding As-
sumption 5.1.
The equation (49) is called the α-discounted cost optimality equation (α-
DCOE), and it is also known as the discounted cost dynamic programming
or Bellman equation.
The so-called vanishing discount approach to AC control problems is
based on several connections between the α-discounted costs Vα and the
AC costs as α ↑ 1. These connections are discussed in the remainder of this
section. We begin with the simplest case.

5.2. Convergence of c(xt , at )


Let M be an arbitrary constant. Inside the summation in (44) replace
c(·, ·) by c(·, ·) ± M . Then (44) becomes

M X
Vα (π, x) = + αt [c(xt , at ) − M ],
1 − α t=0

i.e.,

X
(1 − α)Vα (π, x) = M + (1 − α) αt [c(xt , at ) − M ]. (51)
t=0

20
This equation is useful in several ways. For instant, if M = J(π, x), then
(51) becomes

X
(1 − α)Vα (π, x) = J(π, x) + (1 − α) αt [c(xt , at ) − J(π, x)]. (52)
t=0

This suggests that J(π, x) can be approximated by (1 − α)Vα (π, x) as α ↑ 1.


(See Lemma 5.6(a) below.) We can also note the following obvious fact.

Proposition 5.3. (a) If c(xt , at ) → M as t → ∞, then

(1 − α)Vα (π, x) → J(π, x) = M

as α ↑ 1.
(b) In particular, if c is continuous and Assumptions 4.1(a) and (c) hold,
then
lim(1 − α)Vα (π̄, x) = J(π̄, x) = j ∗ ,
α↑1

with π̄ and j := c(x∗ , a∗ ) as in Assumption 4.1(c) and Theorem 4.3.


The following example is a straightforward application of Proposition


5.3(b). (Other examples are introduced below.)

Example 5.4 (An LQ system, cont’d.). Let (x∗ , a∗ ) = (0, 0) and π̄ = f ∗ be


as in Examples 3.5 and 4.4. Then π̄ is AC-optimal and Proposition 5.3(b)
holds with j ∗ = 0. ♦

Example 5.5 (The Brock–Mirman model, cont’d.). Consider the Example


3.4 and let (x∗ , a∗ ) and π̄ = f ∗ be as in (37) and (25), respectively. Then

θ − θt+1
log(at ) = log (c(1 − θ)) + log (cθ) + θt+1 log(x0 ),
1−θ
and Proposition 5.3(b) holds with

θ
j ∗ = log (c(1 − θ)) + log (cθ) .
1−θ

21
5.3. An Abelian theorem
Another connection between discounted cost problems and the AC case
is provided by the Abelian theorem in part (a) of the following lemma.

Lemma 5.6. Let {ct } be a sequence bounded below, and consider the lower
and upper limit averages (also known as Cesàro limits)
n−1 n−1
L 1X U 1X
C := lim inf ct , C := lim sup ct ,
n→∞ n n→∞ n
t=0 t=0

and the lower and upper Abelian limits



X ∞
X
L t U
A := lim inf (1 − α) α ct , A := lim sup(1 − α) α t ct .
α↑1 α↑1
t=0 t=0

Then

(a) C L ≤ AL ≤ AU ≤ C U .

(b) If AL = AU , then the equality holds in (a), i.e., C L = AL = AU = C U .

For a proof of Lemma 5.6 see the references in Bishop et al. [8] or Sznajder
and Filar [39]. Part (b) in Lemma 5.6 is known as the Hardy-Littlewood
Theorem.
From Lemma 5.6(a) we can obtain useful bounds for the average costs
J(π, x). See the following Lemma 5.7 and also Theorem 5.9.

Lemma 5.7. Let π ∈ Π be an arbitary control policy. Then:


(a) For any initial state x ∈ X,

lim inf (1 − α)Vα (π, x) ≤ lim sup(1 − α)Vα (π, x) ≤ J(π, x).
α↑1 α↑1

(b) The value functions Vα (·) and J ∗ (·) in (45) and (4), respectively, satisfy
that
lim inf (1 − α)Vα (x) ≤ lim sup(1 − α)Vα (x) ≤ J ∗ (x)
α↑1 α↑1

for all x ∈ X.

22
Proof. (a) Consider an arbitrary policy π = {at } and the corresponding
state trajectory {xt }. Let ct := c(xt , at ). By Assumption 5.1(b), ct is
nonnegative for all t = 0, 1, .... Hence part (a) follows from Lemma
5.6(a).
(b) The first inequality in (b) follows from the first inequality in (a). More-
over, from (46) and part (a) again,

lim sup(1 − α)Vα (x) ≤ J(π, x) ∀x ∈ X.


α↑1

Therefore, since π was arbitrary, we obtain (b).

5.4. Rewriting the α-DCOE


Consider the α-DCOE (49), and let

hα (x) := Vα (x) − mα , and ρα := (1 − α)mα (53)

where mα is a constant which, depending on the context, can be defined in


two possible ways:
(a) mα := inf x∈X Vα (x);
(b) mα := Vα (x̄) for some arbitrary, but fixed, state x̄ ∈ X.
The former case (if the infimum is finite) is convenient because it ensures
that the function hα is (53) is nonnegative. On the other hand, (b) can be
useful because it gives that hα (x̄) = 0, so we “fix” hα at x̄. We use (a) and
(b) in Examples 5.11 and 5.12, respectively.
Using (53) we may rewrite the α-DCOE (49) as

ρα + hα (x) = inf [c(x, a) + αhα (F (x, a))]. (54)


a∈A(x)

We can relate (54) to the dissipativy inequality (34) as in the following


remark.

Remark 5.8. For each α ∈ (0, 1), let (xe (α), ae (α)) be a steady state-action
pair which is a limit point of some optimal state-action path. Then

c(xe (α), ae (α))


Vα (xe (α)) = .
1−α

23
According to (53) we have
ρα = (1 − α)Vα (xe (α)) = c(xe (α), ae (α)).
Moreover, as in Assumption 4.1 (b), we define the storage function for the
discounted problem by λα (·) := hα (·). Thus the equation (54) implies a
dissipative-like inequality as in (34),
λα (x) − αλα (F (x, a)) ≤ c(x, a) − c(xe (α), ae (α)) ∀a ∈ A(x). (55)
Note that if α ↑ 1 and (xe (α), ae (α)) converges to the optimal stationary pair
(x∗ , a∗ ) defined in (33), then the inequality (55) yields (34). As an example,
the latter fact holds in the LQ case. Actually, from Example 4.4 we can see
that, for any α ∈ (0, 1], (xe (α), ae (α)) = (x∗ , a∗ ) = (0, 0).
Comparing (54) with the ACOE (14), we may summarize the vanishing
discount approach as follows: Find conditions under which, as α ↑ 1, the pair
(ρα , hα (·)) in (54) converges to a solution (J ∗ , h∗ ) of the ACOE (14).
More explicitly, the idea is to determine a sequence of discount factors
αn → 1 and a pair (J ∗ , h∗ (·)) such that, as n → ∞,
hn ≡ hαn → h∗ and ραn → J ∗ , (56)
and (J ∗ , h∗ (·)) satisfies (14). We show in Subsection 5.5, by means of exam-
ples, that this is indeed a “feasible” approach. However, as far as we can tell,
there are no general results for the deterministic AC problem (1)-(3). All the
known results on the convergence in (56) to a solution of the ACOE refer to
“stochastic” MDPs, not to the “degenerate” case in (8).
The good news is that, in the degenerate case, (56) gives a pair (ρ∗ , h∗ (·))
that satisfies an AC optimality inequality (ACOI), which suffices to obtain
an AC optimal stationary policy f ∗ ∈ F. We will next present this fact.
Let ρα := (1 − α)mα be as in (53), with mα := inf x∈X Vα (x). Moreover,
let ρ∗ := lim supα↑1 ρα and
J ∗ := inf J ∗ (x) = inf inf J(π, x). (57)
x∈X x∈X π∈Π

By Lemma 5.7(b),
ρ∗ ≤ J ∗ . (58)
We now state the ACOI. The next theorem, which we present without proof,
is due to Feinberg et al. [18]. (Vega-Amaya [41] presents a self-contained
proof of Theorem 5.9, shorter than the proof in [18].)

24
Theorem 5.9. Suppose that Assumption 5.1 holds and, in addition,
(a) J ∗ < ∞;
(b) The function h(·) := lim inf hα (·) is finite-valued.
α↑1

Then there exists a function h∗ ∈ L(X), with h∗ (·) ≤ h(·), and a stationary
policy f ∗ ∈ F such that (J ∗ , h∗ (·)) satisfies the ACOI

ρ∗ + h∗ (x) ≥ inf [c(x, a) + h∗ [F (x, a)]]


a∈A(X)

and, moreover,
ρ∗ + h∗ (x) ≥ c(x, f ∗ ) + h∗ [F (x, f ∗ )]. (59)
Hence the policy f ∗ is AC-optimal.

In Theorem 5.9 we included the condition (a) to state the theorem as in


[18], but note that in fact (a) follows from our Assumption 2.2.

Remark 5.10. (a) As in Proposition 3.2(a), the inequality (59) gives that
ρ∗ ≥ J(f ∗ , x) for all x ∈ X. Therefore, from (57) and (58),

ρ∗ ≥ J(f ∗ , ·) ≥ J ∗ ≥ ρ∗ .

This shows that f ∗ is AC-optimal and the AC value function is

J ∗ (·) = J(f ∗ , ·) = J ∗ = ρ∗ .

(b) Let J ∗ be as in (57). If a pair (π̄, x̄) ∈ Π × X is such that J(π̄, x̄) = J ∗ ,
then it is called a minimum pair. For (stochastic) MDPs, the existence
of a minimum pair can be determined in several ways, including infinite-
dimensional linear programming arguments. (See, for instance, Yu [44]
or Section 6.4 in Hernández-Lerma and Lasserre [25].)

To conclude this section we note that Costa and Dufour [15] obtain results
on the ACOI similar to Theorem 5.9 using two different sets of assumptions.
These results are valid in our present deterministic context. On the other
hand, they also obtain the ACOE and the convergence of the policy iteration
algorithm but for MDPs that, to the best of our knowledge, exclude the
deterministic problem (1)-(3).

25
5.5. Examples
In this subsection we introduce some examples to illustrate the several
approaches to the AC control problem.

Example 5.11 (An LQ system, cont’d.). We consider again the LQ system


in Examples 3.5, 4.4 and 5.4, that is, with state and control space X = A = R,
transition function F (x, a) = δx + ηa, and stage cost c(x, a) = qx2 + ra2 .
But now we have the α-discounted cost

X
Vα (π, x) := αt (qx2t + ra2t )
t=0

for any policy π = {at } and initial state x0 = x. Clearly, Assumption 5.1
is satisfied, and the α-DCOE (49) can be obtained by the usual “guess and
verify” procedure. In fact, it is well known (see, for instance, Bersekas [7],
Hernández-Lerma and Lasserre [25],...) that the α-discount value function is
given by
Vα (x) = k(α)x2 (60)
for every α ∈ (0, 1) and x ∈ X, where k(α) is the unique positive solution of
the quadratic (Riccati) equation

qr + (qη 2 + rδ 2 )αk
k= . (61)
r + αkη 2

Note that, as α ↑ 1, (61) reduces to the quadratic equation for the AC case
at the end of Example 3.5. In other words, for α = 1, the positive solution
k = k(1) of (61) coincides with the constant b in (30)-(32). Moreover, from
(60),
mα := inf Vα (x) = 0 ∀α ∈ (0, 1).
x∈X

Therefore, (53) becomes

ρα = 0 and hα (·) = Vα (·).

This yields that, for any sequence of discount factors αn → 1, the pairs
(ραn , hαn (·)) converge to the solution (j ∗ , l(·)) of the ACOE (29). See also
(56).

26
To conclude, let us note that, for each α ∈ (0, 1), the α-optimal control
policy is given by

αk(α)δη
fα (x) = −B(α)x with B(α) :=
r + αk(α)η 2

which, as α ↑ 1, converges to the AC-optimal control a∗ (·) in (30). To put


it another way, for the LQ problem, the vanishing discount approach gives a
canonical triplet as in (15). ♦

Example 5.12 (The Brock-Mirman model, cont’d.). We consider again the


Brock-Mirman model in Examples 3.4 (see (20)), 4.5 and 5.5, but now with
the α-discount optimality criterion. Hence, we now wish to maximize the
discounted reward ∞
X
Vα (π, x) := αt log(at )
t=0

for every initial state x ∈ X, where X := [0, ∞). This problem has been
solved by several methods, see for example Ulus [40], Domı́nguez-Corella and
Hernández-Lerma [17], and Le Van and Saglam [32]. The α-optimal control
and the corresponding value function are

fα∗ (x) := c(1 − θα)xθ

and
1 θα θ
Vα∗ (x) = log[c(1 − θα)] + log(cθα) + log(x)
1−α (1 − α)(1 − θα) 1 − θα

respectively. In (53) take mα := Vα (x̄) with x̄ = 1. Then, for every x ∈ X,

θ
hα (x) = Vα (x) − mα = log(x)
1 − θα
θα
ρα = log[c(1 − θα)] + log(cθα).
1 − θα
This yields that, for any sequence of discount factors αn → 1, the pairs
(ραn , hαn (·)) converge to the solution (j ∗ , l(·)) of the AROE (22). Moreover,
the α-optimal control fα∗ (x) converges to the AC-optimal control f ∗ (·) in
(25), as α ↑ 1. ♦

27
6. Concluding remarks and open problems
This paper presents the three main approaches to analyze average cost
(AC) control problems for discrete-time deterministic systems, namely, the
AC optimality equation, the steady-state approach, and the vanishing dis-
count approach.
AC problems are a standard topic in the theory and applications of
discrete- and continuous-time “stochastic” Markov decision processes (MDPs).
Hence, since our control problems form a special class of MDPs, one might
expect that the results for MDPs are directly applicable to the deterministic
case. This, however, is not the case. Indeed, as noted in the subsection 1.2,
many concepts for stochastic MDPs are not valid for deterministic or “de-
generate” MDPs. As a consequence, there are open problems such as the
following.

1. Find general conditions under which the vanishing discount approach


gives the ACOE (14); in other words, give conditions ensuring (56).
(As already noted above, this is known to be true in some particular
cases, as in Examples 5.11 and 5.12.)
2. Is it advantageous to use other sets of strategies, for instance, random-
ized strategies? To put it another way, it is well known that using
randomized or relaxed controls can be very useful for some theoretical
or practical purposes, as in Artstein [3] or González-Hernández and
Hernández-Lerma [22], for example. Is this true for the deterministic
AC problem (1)-(3)?
3. As noted in Remark 5.10(b), for MDPs, the existence of a minimum
pair can be determined in several forms. Are these forms applicable to
the deterministic or degenerate MDP in (1)-(3)?
4. In dynamic programming there are two important approximation pro-
cedures, the value iteration (or successive approximations) algorithm
and the policy iteration (or Howard improvement) algorithm; see, for
instance, Hernández-Lerma and Lasserre [25, 26]. Extend these two
algorithms to the deterministic AC problem (1)-(3). By the way, Costa
and Dufour [15] introduce a policy iteration algorithm for MDPs with
weakly continuous (or Feller) transition probabilities (see Section 1.2
above), but it does not include the determinitic case (1)-(3).
5. Extend the results in Sections 3, 4, 5 to discrete-time deterministic
noncooperative games. (Perhaps the zero-sum case is not too difficult.)

28
6. Extend the results in Sections 3, 4, 5 to the “differential case” in which
(1) and (2) are replaced by

ẋ(t) = F (x(t), a(t)) ∀t ≥ 0 (62)

and Z T
JT (π, x) := c(x(t), a(t)), T ≥ 0, (63)
0

respectively, for every initial state x(0) = x. Some results go back


to Arisawa [2] and Grüne [23]. A particular application is given by
Kawaguchi [29].
7. Extend the results in Sections 3, 4, 5 to differential games. As in
Problem 5, consider first the zero-sum case. This case has been partly
studied by Ghosh and Rao [21] For a recent reference see Hochart [27].

References
[1] A. Arapostathis, V. S. Borkar, and M. K. Ghosh. Ergodic Control of
Diffusion Processes. Cambridge University Press, UK, 2012.

[2] M. Arisawa. Ergodic problem for the hamilton-jacobi-bellman equation.


i. existence of the ergodic attractor. Annales de l’Institut Henri Poincaré
C, Analyse non linéaire, 14(4):415–438, 1997.

[3] Z. Artstein. Stabilization with relaxed controls. Nonlinear Analysis:


Theory, Methods & Applications, 7(11):1163–1173, 1983.

[4] R. Bellman. The theory of dynamic programming. Bulletin of the Amer-


ican Mathematical Society, 60(6):503–515, 1954.

[5] R. Bellman. Dynamic Programming. Princeton University Press, New


Jersey, 1957.

[6] R. Bellman. A Markovian decision process. Journal of Mathematics and


Mechanics, 6(5):679–684, 1957.

[7] D. P. Bersekas. Dynamic programming: deterministic and stochastic


models. Prentice-Hall, Englewood Cliffs, NJ, 1987.

29
[8] C. J. Bishop, E. A. Feinberg, and J. Zhang. Examples concerning Abel
and Cesàro limits. Journal of Mathematical Analysis and Applications,
420(2):1654–1661, 2014.
[9] D. Blackwell. Discounted dynamic programming. The Annals of Math-
ematical Statistics, 36(1):226–235, 1965.
[10] V. S. Borkar, V. Gaitsgory, and I. Shvartsman. LP formulations of
discrete time long-run average optimal control problems: the nonergodic
case. SIAM Journal on Control and Optimization, 57(3):1783–1817,
2019.
[11] W. A. Brock. On existence of weakly maximal programmes in a multi-
sector economy. The Review of Economic Studies, 37(2):275–280, 1970.
[12] W. A. Brock and L. J. Mirman. Optimal economic growth and uncer-
tainty: the discounted case. Journal of Economic Theory, 4:479–513,
1972.
[13] W. A. Brock and L. J. Mirman. Optimal economic growth and uncer-
tainty: the no discounting case. International Economic Review, 14:
560–573, 1973.
[14] C. I. Byrnes and W. Lin. Losslessness, feedback equivalence, and the
global stabilization of discrete-time nonlinear systems. IEEE Transac-
tions on Automatic Control, 39(1):83–98, 1994.
[15] O. L. V. Costa and F. Dufour. Average control of Markov decision
processes with Feller transition probabilities and general action spaces.
Journal of Mathematical Analysis and Applications, 396(1):58–69, 2012.
[16] E. B. Davies. One-Parameter Semigroups. Academic Press, London,
1980.
[17] A. Domı́nguez-Corella and O. Hernández-Lerma. The maximum princi-
ple for discrete-time control systems and applications to dynamic games.
Journal of Mathematical Analysis and Applications, 475(1):253–277,
2019.
[18] E. A. Feinberg, P. O. Kasyanov, and N. V. Zadoianchuk. Average cost
Markov decision processes with weakly continuous transition probabili-
ties. Mathematics of Operations Research, 37(4):591–607, 2012.

30
[19] J. Flynn. Steady state policies for a class of deterministic dynamic
programming models. SIAM Journal on Applied Mathematics, 28(1):
87–99, 1975.
[20] J. Flynn. Optimal steady states, excessive functions, and deterministic
dynamic programs. Journal of Mathematical Analysis and Applications,
144(2):586–594, 1989.
[21] M. K. Ghosh and K. S. M. Rao. Differential games with ergodic payoff.
SIAM Journal on Control and Optimization, 43(6):2020–2035, 2005.
[22] J. González-Hernández and O. Hernández-Lerma. Extreme points of
sets of randomized strategies in constrained optimization and control
problems. SIAM Journal on Control and Optimization, 15(4):1085–1104,
2005.
[23] L. Grüne. On the relation between discounted and average optimal value
functions. Journal of Differential Equations, 148(1):65–99, 1998.
[24] L. Grüne, M. A. Müller, C. M. Kellett, and S. R. Weller. Strict dissi-
pativity for discrete time discounted optimal control problems. Mathe-
matical Control & Related Fields, 11(4):771–796, 2021.
[25] O. Hernández-Lerma and J. B. Lasserre. Discrete-Time Markov Control
Processes: Basic Optimality Criteria. Springer, New York, 1996.
[26] O. Hernández-Lerma and J. B. Lasserre. Further Topics on Discrete-
Time Markov Control Processes. Springer, New York, 1999.
[27] A. Hochart. Unique ergodicity of deterministic zero-sum differential
games. Dynamic Games and Applications, 11(1):109–136, 2021.
[28] R. A. Howard. Dynamic Programming and Markov Processes. John
Wiley, Cambridge, Mass., 1960.
[29] K. Kawaguchi. Optimal control of pollution accumulation with long-
run average welfare. Environmental and Resource Economics, 26(3):
457–468, 2003.
[30] L. R. Laura-Guarachi. An optimal control problem in forest manage-
ment. In Games and Evolutionary Dynamics: Selected Theoretical and
Applied Developments, pages 189–209. El Colegio de México, 2021.

31
[31] L. R. Laura-Guarachi and O. Hernández-Lerma. The Mitra-Wan an
forestry model: a discrete-time optimal control problem. Natural Re-
source Modeling, 28(2):152–168, 2015.
[32] C. Le Van and H. C. Saglam. Optimal growth models and the Lagrange
multiplier. Journal of Mathematical Economics, 40(3-4):393–410, 2004.
[33] T. Mitra and H. Y. Wan Jr. Some theoretical results on the economics
of forestry. The Review of Economic Studies, 52(2):263–282, 1985.
[34] T. Mitra and H. Y. Wan Jr. On the Faustmann solution to the for-
est management problem. Journal of Economic Theory, 40(2):229–249,
1986.
[35] M. A. Müller. Dissipativity in economic model predictive control: be-
yond steady-state optimality. In Recent Advances in Model Predictive
Control, pages 27–43. Springer, 2021.
[36] M. A. Müller, D. Angeli, and F. Allgöwer. On necessity and robustness
of dissipativity in economic model predictive control. IEEE Transactions
on Automatic Control, 60(6):1671–1676, 2015.
[37] P. A. Samuelson. A note on measurement of utility. The Review of
Economic Studies, 4(2):155–161, 1937.
[38] M. Schäl. Average optimality in dynamic programming with general
state space. Mathematics of Operations Research, 18(1):163–172, 1993.
[39] R. Sznajder and J. A. Filar. Some comments on a theorem of Hardy
and Littlewood. Journal of Optimization Theory and Applications, 75
(1):201–208, 1992.
[40] A. Y. Ulus. On discrete time infinite horizon optimal growth problem.
An International Journal of Optimization and Control: Theories & Ap-
plications (IJOCTA), 8(1):102–116, 2018.
[41] Ó. Vega-Amaya. On the vanishing discount factor approach for Markov
decision processes with weakly continuous transition probabilities. Jour-
nal of Mathematical Analysis and Applications, 426(2):978–985, 2015.
[42] J. C. Willems. Dissipative dynamical systems part i: General theory.
Archive for Rational Mechanics and Analysis, 45(5):321–351, 1972.

32
[43] K. Yosida. Functional Analysis, 6th Edition. Springer, Berlin, 1980.

[44] H. Yu. On the minimum pair approach for average cost Markov decision
processes with countable discrete action spaces and strictly unbounded
costs. SIAM Journal on Control and Optimization, 58(2):660–685, 2020.

33

You might also like