Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

1

Global Adaptive Dynamic Programming for


Continuous-Time Nonlinear Systems
Yu Jiang, Member, IEEE, Zhong-Ping Jiang, Fellow, IEEE

Abstract—This paper presents a novel method of global used as universal approximators [18], [36], there are at least
adaptive dynamic programming (ADP) for the adaptive optimal two major limitations for ADP-based online implementations.
control of nonlinear polynomial systems. The strategy consists First, in order to approximate unknown functions with high
of relaxing the problem of solving the Hamilton-Jacobi-Bellman
arXiv:1401.0020v4 [math.DS] 10 Jan 2017

(HJB) equation to an optimization problem, which is solved via a accuracy, a large number of basis functions comprising the
new policy iteration method. The proposed method distinguishes neural network are usually required. Hence, it may incur a
from previously known nonlinear ADP methods in that the neural huge computational burden for the learning system. Besides,
network approximation is avoided, giving rise to significant it is not trivial to specify the type of basis functions, when
computational improvement. Instead of semiglobally or locally the target function to be approximated is unknown. Second,
stabilizing, the resultant control policy is globally stabilizing for
a general class of nonlinear polynomial systems. Furthermore, in neural network approximations generally are effective only on
the absence of the a priori knowledge of the system dynamics, some compact sets, but not in the entire state space. Therefore,
an online learning method is devised to implement the proposed the resultant control policy may not provide global asymptotic
policy iteration technique by generalizing the current ADP stability for the closed-loop system. In addition, the compact
theory. Finally, three numerical examples are provided to validate set, on which the uncertain functions of interest are to be
the effectiveness of the proposed method.
approximated, has to be carefully quantified before one applies
Index Terms—Adaptive dynamic programming, nonlinear sys- the online learning method, such that stability can be assured
tems, optimal control, global stabilization. during the learning process [22].
The main purpose of this paper is to develop a novel
I. I NTRODUCTION ADP methodology to achieve global and adaptive suboptimal
Dynamic programming [4] offers a theoretical way to solve stabilization of uncertain continuous-time nonlinear polyno-
optimal control problems. However, it suffers from the in- mial systems via online learning. As the first contribution of
herent computational complexity, also known as the curse of this paper, an optimization problem, of which the solutions
dimensionality. Therefore, the need for approximative methods can be easily parameterized, is proposed to relax the prob-
has been recognized as early as in the late 1950s [5]. Within all lem of solving the Hamilton-Jacobi-Bellman (HJB) equation.
these methods, adaptive/approximate dynamic programming This approach is inspired from the relaxation method used
(ADP) [6], [7], [38], [46], [58], [60] is a class of heuristic in approximate dynamic programming for Markov decision
techniques that solves for the cost function by searching processes (MDPs) with finite state space [13], and more
for a suitable approximation. In particular, adaptive dynamic generalized discrete-time systems [32], [56], [57], [43], [44],
programming [58], [59] employs the idea from reinforcement [48]. However, methods developed in these papers cannot be
learning [49] to achieve online approximation of the cost trivially extended to the continuous-time setting, or achieve
function, without using the knowledge of the system dynam- global asymptotic stability of general nonlinear systems. The
ics. ADP has been extensively studied for Markov decision idea of relaxation was also used in nonlinear H∞ control,
processes (see, for example, [7] and [38]), as well as dynamic where Hamilton-Jacobi inequalities are used for nonadaptive
systems (see the review papers [29] and [55]). Stability issues systems [16], [51], [42].
in ADP-based control systems design are addressed in [2], The second contribution of the paper is to propose a relaxed
[31], [54]. A robustification of ADP, known as Robust-ADP or policy iteration method. Each iteration step is formulated as a
RADP, is recently developed by taking into account dynamic sum of squares (SOS) program [37], [10], which is equivalent
uncertainties [24]. to a semidefinite program (SDP). It is worth pointing out that,
To achieve online approximation of the cost function and different from the inverse optimal control design [27], the
the control policy, neural networks are widely used in the proposed method finds directly a suboptimal solution to the
previous ADP architecture. Although neural networks can be original optimal control problem.
The third contribution is an online learning method that
This work has been supported in part by the National Science Foundation implements the proposed iterative schemes using only the real-
under Grants ECCS-1101401 and ECCS-1230040. The work was done when time online measurements, when the perfect system knowledge
Y. Jiang was a PhD student in the Control and Networks Lab at New York
University Polytechnic School of Engineering. is not available. This method can be regarded as a nonlinear
Y. Jiang is with The MathWorks, 3 Apple Hill Dr, Natick, MA 01760, variant of our recent work for continuous-time linear systems
E-mail: yu.jiang@mathworks.com with completely unknown system dynamics [23]. This method
Z. P. Jiang is with the Department of Electrical and Computer Engineering,
Polytechnic School of Engineering, New York University, Five Metrotech distinguishes from previously known nonlinear ADP methods
Center, Brooklyn, NY 11201 USA. zjiang@nyu.edu in that the neural network approximation is avoided for com-
2

putational benefits and that the resultant suboptimal control Under Assumption 2.1, the closed-loop system composed
policy achieves global asymptotic stabilization, instead of only of (1) and u = u1 (x) is globally asymptotically stable at the
semi-global or local stabilization. origin, with a well-defined Lyapunov function V0 . With this
The remainder of this paper is organized as follows. Section property, u1 is also known as an admissible control policy
2 formulates the problem and introduces some basic results [3], implying that the cost J(x0 , u1 ) is finite, ∀x0 ∈ Rn .
regarding nonlinear optimal control. Section 3 relaxes the Indeed, integrating both sides of (3) along the trajectories of
problem of solving an HJB equation to an optimization prob- the closed-loop system composed of (1) and u = u1 (x) on
lem. Section 4 develops a relaxed policy iteration technique the interval [0, +∞), it is easy to show that
for polynomial systems using sum of squares (SOS) programs
[10]. Section 5 develops an online learning method for apply- J(x0 , u1 ) ≤ V0 (x0 ), ∀x0 ∈ Rn . (5)
ing the proposed policy iteration, when the system dynamics
are not known exactly. Section 6 examines three numerical B. Optimality and stability
examples to validate the efficiency and effectiveness of the
Here, we recall a basic result connecting optimality and
proposed method. Section 7 gives concluding remarks.
global asymptotic stability in nonlinear systems [45].
Notations: Throughout this paper, we use C 1 to denote the
To begin with, let us give the following assumption.
set of all continuously differentiable functions. P denotes the
set of all functions in C 1 that are also positive definite and Assumption 2.2: There exists V o ∈ P, such that the
proper. R+ indicates the set of all non-negative real numbers. Hamilton-Jacobi-Bellman (HJB) equation holds
For any vector u ∈ Rm and any symmetric matrix R ∈ Rm×m , H(V o ) = 0 (6)
we define |u|2R as uT Ru. A feedback control policy u is called
globally stabilizing, if under this control policy, the closed- where
loop system is globally asymptotically stable at the origin
[26]. Given a vector of polynomials f (x), deg(f ) denotes the H(V ) = ∇V T (x)f (x) + q(x)
1
highest polynomial degree of all the entries in f (x). For any − ∇V T (x)g(x)R−1 (x)g T (x)∇V (x).
positive integers d1 , d2 satisfying d2 ≥ d1 , m ~ d1 ,d2 (x) is the 4
n+d1 −1
vector of all (n+d
d2
2
) − ( d1 −1 ) distinct monic monomials in Under Assumption 2.2, it is easy to see that V o is a
x ∈ Rn with degree at least d1 and at most d2 , and arranged well-defined Lyapunov function for the closed-loop system
in lexicographic order [12]. Also, R[x]d1 ,d2 denotes the set of comprised of (1) and
all polynomials in x ∈ Rn with degree no less than d1 and 1
no greater than d2 . ∇V refers to the gradient of a function uo (x) = − R−1 (x)g T (x)∇V o (x). (7)
2
V : Rn → R.
Hence, this closed-loop system is globally asymptotically
II. P ROBLEM FORMULATION AND PRELIMINARIES stable at x = 0 [26]. Then, according to [45, Theorem 3.19],
uo is the optimal control policy, and the value function V o (x0 )
A. Problem formulation
gives the optimal cost at the initial condition x(0) = x0 , i.e.,
Consider the nonlinear system
V o (x0 ) = min J(x0 , u) = J(x0 , uo ), ∀x0 ∈ Rn . (8)
ẋ = f (x) + g(x)u (1) u

where x ∈ Rn is the system state, u ∈ Rm is the control input, It can also be shown that V o is the unique solution to the
f : Rn → Rn and g : Rn → Rn×m are polynomial mappings HJB equation (6) with V o ∈ P. Indeed, let V̂ ∈ P be another
with f (0) = 0. solution to (6). Then, by Theorem 3.19 in [45], along the
In conventional optimal control theory [30], the common solutions of the closed-loop system composed of (1) and u =
objective is to find a control policy u that minimizes certain û = − 21 R−1 g T ∇V̂ , it follows that
performance index. In this paper, it is specified as follows. Z ∞
o
Z ∞ V̂ (x0 ) = V (x0 ) − |uo − û|2R dt, ∀x0 ∈ Rn . (9)
J(x0 , u) = r(x(t), u(t))dt, x(0) = x0 (2) 0
0 Finally, by comparing (8) and (9), we conclude that
where r(x, u) = q(x)+uT R(x)u, with q(x) a positive definite
polynomial function, and R(x) is a symmetric positive definite V o = V̂ .
matrix of polynomials. Notice that, the purpose of specifying
r(x, u) in this form is to guarantee that an optimal control C. Conventional policy iteration
policy can be explicitly determined, if it exists.
The above-mentioned result implies that, if there exists
Assumption 2.1: Consider system (1). There exist a func-
a class-P function which solves the HJB equation (6), an
tion V0 ∈ P and a feedback control policy u1 , such that
optimal control policy can be obtained. However, the nonlinear
L(V0 (x), u1 (x)) ≥ 0, ∀x ∈ Rn (3) HJB equation (6) is very difficult to be solved analytically
in general. As a result, numerical methods are developed to
where, for any V ∈ C 1 and u ∈ Rm ,
approximate the solution. In particular, the following policy
L(V, u) = −∇V T (x)(f (x) + g(x)u) − r(x, u). (4) iteration method is widely used [41].
3

1) Policy evaluation: For i = 1, 2, · · · , solve for the cost Some useful facts about Problem 3.1 are given as follows.
function Vi (x) ∈ C 1 , with Vi (0) = 0, from the following Theorem 3.3: Under Assumptions 2.1 and 2.2, the follow-
partial differential equation. ing hold.
1) Problem 3.1 has a nonempty feasible set.
L(Vi (x), ui (x)) = 0. (10)
2) Let V be a feasible solution to Problem 3.1. Then, the
2) Policy improvement: Update the control policy by control policy
1 1
ui+1 (x) = − R−1 (x)g T (x)∇Vi (x). (11) ū(x) = − R−1 g T ∇V (15)
2 2
The convergence property of the conventional policy itera- is globally stabilizing.
tion algorithm is given in the following theorem, which is an 3) For any x0 ∈ Rn , an upper bound of the cost of the
extension of [41, Theorem 4]. For the readers’ convenience, closed-loop system comprised of (1) and (15) is given
its proof is given in Appendix B. by V (x0 ), i.e.,
Theorem 2.3: Suppose Assumptions 2.1 and 2.2 hold, and J(x0 , ū) ≤ V (x0 ). (16)
the solution Vi (x) ∈ C 1 satisfying (10) exists, for i = 1, 2, · · · .
Let Vi (x) and ui+1 (x) be the functions generated from (10) 4) Along the trajectories of the closed-loop system (1) and
and (11). Then, the following properties hold, ∀i = 0, 1, · · · . (7), the following inequalities hold for any x0 ∈ Rn :
Z ∞
1) V o (x) ≤ Vi+1 (x) ≤ Vi (x), ∀x ∈ Rn ; V (x0 ) + H(V (x(t)))dt ≤ V o (x0 ) ≤ V (x0 ). (17)
2) ui+1 is globally stabilizing; 0
3) suppose there exist V ∗ ∈ C 1 and u∗ , such that ∀x0 ∈ 5) V o as defined in (8) is a global optimal solution to
Rn , we have lim Vi (x0 ) = V ∗ (x0 ) and lim ui (x0 ) = Problem 3.1.
i→∞ i→∞
u∗ (x0 ). Then, V ∗ = V o and u∗ = uo . Proof:
1
Notice that finding the analytical solution to (10) is still 1) Define u0 = − R−1 g T ∇V0 . Then,
non-trivial. Hence, in practice, the solution is approximated 2
using, for example, neural networks or Galerkin’s method H(V0 ) = ∇V0T (f + gu0 ) + r(x, u0 )
[3]. Also, ADP-based online approximation method can be = ∇V0T (f + gu1 ) + r(x, u1 )
applied to compute numerically the cost functions via online +∇V0T g(u0 − u1 ) + |u0 |2R − |u1 |2R
data [53], [22], when the precise knowledge of f or g is not
available. However, although approximation methods can give = ∇V0T (f + gu1 ) + r(x, u1 )
acceptable results on some compact set in the state space, they −2|u0 |2R − 2uT0 Ru1 + |u0 |2R − |u1 |2R
cannot be used to achieve global stabilization. In addition, in = ∇V0T (f + gu1 ) + r(x, u1 ) − |u0 − u1 |2R
order to reduce the approximation error, huge computational ≤ 0
complexity is almost inevitable.
Hence, V0 is a feasible solution to Problem 3.1.
III. S UBOPTIMAL CONTROL WITH RELAXED HJB 2) To show global asymptotic stability, we only need to
EQUATION
prove that V is a well-defined Lyapunov function for the
closed-loop system composed of (1) and (15). Indeed, along
In this section, we consider an auxiliary optimization prob- the solutions of the closed-loop system, it follows that
lem, which allows us to obtain a suboptimal solution to the
minimization problem (2) subject to (1). For simplicity, we V̇ = ∇V T (f + gū)
will omit the arguments of functions whenever there is no = H(V ) − r(x, ū)
confusion in the context. ≤ −q(x)
Problem 3.1 (Relaxed optimal control problem):
R Therefore, the system is globally asymptotically stable at the
min Ω V (x)dx (12) origin [26].
V
s.t. H(V ) ≤ 0 (13) 3) Along the solutions of the closed-loop system comprised
of (1) and (15), we have
V ∈P (14) Z T
where Ω ⊂ Rn is an arbitrary compact set containing the V (x0 ) = − ∇V T (f + gū)dt + V (x(T ))
0
origin as an interior point. As a subset of the state space, Ω Z T
describes the area in which the system performance is expected = [r(x, ū) − H(V )] dt + V (x(T ))
to be improved the most. 0
Z T
Remark 3.2: Notice that Problem 3.1 is called a relaxed ≥ r(x, ū)dt + V (x(T )) (18)
problem of (6). Indeed, if we restrict this problem by replacing 0
the inequality constraint (13) with the equality constraint (6), By 2), lim V (x(T )) = 0. Therefore, letting T → +∞,
there will be only one feasible solution left and the objective T →+∞
by (18) and (2), we have
function can thus be neglected. As a result, Problem 3.1
reduces to the problem of solving (6). V (x0 ) ≥ J(x0 , ū). (19)
4

4) By 3), we have In addition, for computational simplicity, we would like to


find some positive integer r, such that Vi ∈ R[x]2,2r . Then,
V (x0 ) ≥ J(x0 , ū) ≥ min J(x0 , ū) = V o (x0 ). (20) the new control policy ui+1 obtained from (27) is a vector of
u

Hence, the second inequality in (17) is proved. polynomials.


On the other hand, Based on the above discussion, the following Assumption
is given to replace Assumption 2.1.
H(V ) = H(V ) − H(V o ) Assumption 4.1: There exist smooth mappings V0 : Rn →
= (∇V − ∇V o )T (f + gū) + r(x, ū) R and u1 : Rn → Rm , such that V0 ∈ R[x]2,2r ∩ P and
L(V0 , u1 ) is SOS.
−(∇V o )T g(uo − ū) − r(x, uo )
= (∇V − ∇V o )T (f + guo ) − |ū − uo |2R
B. SOS-programming-based policy iteration
≤ (∇V − ∇V o )T (f + guo )
Now, we are ready to propose a relaxed policy iteration
Integrating the above equation along the solutions of the scheme. Similar as in other policy-iteration-based iterative
closed-loop system (1) and (7) on the interval [0, +∞), we schemes, an initial globally stabilizing (and admissible) control
have policy has been assumed in Assumption 4.1.
Z ∞
o 1) Policy evaluation: For i = 1, 2, · · · , solve for an optimal
V (x0 ) − V (x0 ) ≤ − H(V (x))dt. (21) solution pi to the following optimization program:
0
Z
5) By 3), for any feasible solution V to Problem 3.1, we min V (x)dx (24)
have V o (x) ≤ V (x). Hence, p Ω
Z Z s.t. L(V, ui ) is SOS (25)
V o (x)dx ≤ V (x)dx (22) Vi−1 − V is SOS (26)
Ω Ω

which implies that V o is a global optimal solution. where V = pT m ~ 2,2r (x). Then, denote Vi =
The proof is therefore complete. pTi m
~ 2,2r (x).
Remark 3.4: A feasible solution V to Problem 3.1 may 2) Policy improvement: Update the control policy by
not necessarily be the true cost function associated with the 1
control policy ū defined in (15). However, by Theorem 3.3, ui+1 = − R−1 g T ∇Vi . (27)
2
we see V can be viewed as an upper bound or an overestimate
Then, go to Step 1) with i replaced by i + 1.
of the actual cost, inspired by the concept of underestimator in
[56]. Further, V serves as a Lyapunov function for the closed- A flowchart of the practical implementation is given in
loop system and can be more easily parameterized than the Figure 1, where  > 0 is a predefined threshold to balance
actual cost function. For simplicity, V is still called the cost the exploration/exploitation trade-off. In practice, a larger 
function, in the remainder of the paper. may lead to shorter exploration time and therefore will allow
the system to implement the control policy and terminate the
exploration noise sooner. On the other hand, using a smaller 
IV. SOS- BASED P OLICY I TERATION FOR POLYNOMIAL allows the learning system to better improve the control policy
SYSTEMS
but needs longer learning time.
The inequality constraint (13) contained in Problem 3.1
provides us the freedom of specifying desired analytical forms
of the cost function. However, solving (13) is non-trivial in Initialization: Find (V0 , u1 ) that
Start
general, even for polynomial systems (see, for example, [11], satisfy Assumption 4.1. i ← 1.
[14], [34], [47], [61]). Indeed, for any polynomial with degree
no less than four, deciding its non-negativity is an NP-hard
Policy Evaluation: Solve Vi =
problem [37]. Thanks to the developments in sum of squares
pTi m
~ r,2r (x) from (24)-(26).
(SOS) programs [10], [37], the computational burden can be
significantly reduced, if inequality constraints can be restricted
to SOS constraints. The purpose of this section is to develop Policy Improvement: Update the
a novel policy iteration method for polynomial systems using control policy to ui+1 by (27).
SOS-based methods [10], [37].

i←i+1 |pi − pi−1 | ≤ 


A. Polynomial parametrization No
Let us first assume R(x) is a constant, real symmetric Yes
matrix. Notice that L(Vi , ui ) is a polynomial in x, if ui is a
Use u = ui as the
polynomial control policy and Vi is a polynomial cost function. Stop
control input.
Then, the following implication holds
L(Vi , ui ) is SOS ⇒ L(Vi , ui ) ≥ 0. (23) Fig. 1. Flowchart of the SOS-programming-based policy iteration.
5

Remark 4.2: The optimization problem (24)-(26) is a well- Indeed, along the solutions of the closed-loop system com-
defined SOS program [10]. Indeed, the objective function (24) posed of (1) and u = ui , it follows that
T
Ris linear in p, Tsince for any R V = p m ~ 2,2r (x), we have
V̇i−1 = T
∇Vi−1 (f + gui )

V (x)dx = c p, with c = Ω
m
~ 2,2r (x)dx.
Theorem 4.3: Under Assumptions 2.2 and 4.1, the follow- = −L(Vi−1 , ui ) − r(x, ui )
ing are true, for i = 1, 2, · · · . ≤ −q(x).
1) The SOS program (24)-(26) has a nonempty feasible set. Therefore, ui is globally stabilizing, since Vi−1 is a well-
2) The closed-loop system comprised of (1) and u = ui is defined Lyapunov function for the system. In addition, we
globally asymptotically stable at the origin. have
3) Vi ∈ P. In particular, the following inequalities hold: Z ∞
Vi (x0 ) ≥ r(x, ui )dt > 0, ∀x0 6= 0. (32)
V o (x0 ) ≤ Vi (x0 ) ≤ Vi−1 (x0 ), ∀x0 ∈ Rn . (28) 0
Similarly as in (31), we can show
4) There exists V ∗ (x) satisfying V ∗ (x) ∈ R[x]2,2r ∩ P,
such that, for any x0 ∈ Rn , lim Vi (x0 ) = V ∗ (x0 ). V o (x0 ) ≤ Vi (x0 ) ≤ Vi−1 (x0 ), ∀x0 ∈ Rn , (33)
i→∞
5) Along the solutions of the system comprised of (1) and and conclude that Vi ∈ P.
(7), the following inequalities hold: 3) The two inequalities have been proved in (33).
4) By 3), for each x ∈ Rn , the sequence {Vi (x)}∞
Z ∞
i=0 is
0 ≤ V ∗ (x0 ) − V o (x0 ) ≤ − H(V ∗ (x(t)))dt. (29) monotonically decreasing with 0 as its lower bound. There-
0
fore, the limit exists, i.e., there exists V ∗ (x), such that
Proof: 1) We prove by mathematical induction. lim Vi (x) = V ∗ (x). Let {pi }∞
i=1 be the sequence such that
i) Suppose i = 1, under Assumption 4.1, we know i→∞
~ 2,2r (x). Then, we know lim pi = p∗ ∈ Rn2r ,
Vi = pTi m
L(V0 , u1 ) is SOS. Hence, V = V0 is a feasible solution to i→∞
the problem (24)-(26). and therefore V ∗ = p∗T m~ 2,2r (x). Also, it is easy to show
ii) Let V = Vj−1 be an optimal solution to the problem V o ≤ V ∗ ≤ V0 . Hence, V ∗ ∈ R[x]2,2r ∩ P.
(24)-(26) with i = j − 1 > 1. We show that V = Vj−1 is a 5) Let
feasible solution to the same problem with i = j. 1
u∗ = − R−1 g T ∇V ∗ , (34)
Indeed, by definition, 2
1 Then, by 4) we know
uj = − R−1 g T ∇Vj−1 ,
2 H(V ∗ ) = −L(V ∗ , u∗ ) ≤ 0, (35)
and
which implies that V ∗ is a feasible solution to Problem 3.1.
L(Vj−1 , uj ) = T
−∇Vj−1 (f + guj ) − r(x, uj ) The inequalities in 5) can thus be obtained by the fourth
T property in Theorem 3.3.
= L(Vj−1 , uj−1 ) − ∇Vj−1 g(uj − uj−1 )
The proof is thus complete.
+uTj−1 Ruj−1 − uTj Ruj Remark 4.4: Notice that Assumption 4.1 holds for any
= L(Vj−1 , uj−1 ) + |uj − uj−1 |2R . controllable linear systems. For general nonlinear systems,
such a pair (V0 , u1 ) satisfying Assumption 4.1 may not always
Under the induction assumption, we know Vj−1 ∈ R[x]2,2r exist, and in this paper we focus on polynomial systems that
and L(Vj−1 , uj−1 ) is SOS. Hence, L(Vj−1 , uj ) is SOS. As a satisfy Assumption 4.1. For this class of systems, we assume
result, Vj−1 is a feasible solution to the SOS program (24)- that both V0 and u1 have to be determined before executing
(26) with i = j. the proposed algorithm. The search of (V0 , u1 ) is not trivial,
2) Again, we prove by induction. because it amounts to solving some bilinear matrix inequalities
i) Suppose i = 1, under Assumption 4.1, u1 is globally (BMI) [40]. However, this problem has been actively studied
stabilizing. Also, we can show that V1 ∈ P. Indeed, for each in recent years, and several applicable approaches have been
x0 ∈ Rn with x0 6= 0, we have developed. For example, a Lyapunov based approach utilizing
Z ∞ state-dependent linear matrix inequalities has been studied in
V1 (x0 ) ≥ r(x, u1 )dt > 0. (30) [40]. This method has been generalized to uncertain nonlinear
0 polynomial systems in [61] and [20]. In [19] and [28], the
By (30) and the constraint (26), under Assumption 2.2 it authors proposed a solution for a stochastic HJB. It is shown
follows that that this method gives a control Lyapunov function for the
deterministic system.
V o ≤ V1 ≤ V0 . (31) Remark 4.5: Notice that the control policies considered
in the proposed algorithm can be extended to polynomial
Since both V o and V0 are assumed to belong to P, we
fractions. Indeed, instead of requiring L(V0 , u1 ) to be SOS
conclude that V1 ∈ P.
in Assumption 4.1, let us assume
ii) Suppose ui−1 is globally stabilizing, and Vi−1 ∈ P for
i > 1. Let us show that ui is globally stabilizing, and Vi ∈ P. α2 (x)L(V0 , u1 ) is SOS (36)
6

where α(x) > 0 is an arbitrary polynomial. Then, the initial that ∇Vi−1 (f + gui ) + uTi Rui ≤ 0. Then, the system (40) is
−1
control policy u1 can take the form of u1 = α(x) v(x) , with forward complete.
v(x) a column vector of polynomials. Then, we can relax the Proof: Under Assumptions 2.2 and 4.1, by Theorem 4.3
constraint (25) to the following: we know Vi−1 ∈ P. Then, by completing the squares, it
follows that
α2 (x)L(V, ui ) is SOS (37)
T
∇Vi−1 (f + gui + ge) ≤ −uTi Rui − 2uTi Re
and it is easy to see the SOS-based policy iteration algorithm
= −|ui + e|2R + |e|2R
can still proceed with this new constraint.
Remark 4.6: In addition to Remark 4.5, if R(x) is not ≤ |e|2R
constant, by definition, there exists a symmetric matrix of ≤ |e|2R + Vi−1 .
polynomials R∗ (x), such that
According to [1, Corollary 2.11], the system (40) is forward
R(x)R∗ (x) = det(R(x))Im (38) complete.
By Lemma 5.2 and Theorem 4.3, we immediately have the
with det(R(x)) > 0. following Proposition.
As a result, the policy improvement step (27) gives Proposition 5.3: Under Assumptions 2.2 and 4.1, let ui be
1 a feedback control policy obtained at the i-th iteration step in
ui+1 = − R∗ (x)g T (x)∇Vi (x). (39) the proposed policy iteration algorithm (24)-(27) and e be a
2det(R(x))
bounded time-varying function. Then, the closed-loop system
which is a vector of polynomial fractions. (1) with u = ui + e is forward complete.
Now, if we select α(x) in (37) such that det(R(x)) divides
α(x), we see α2 (x)L(V, ui ) is a polynomial, and the proposed
B. Online implementation
SOS-based policy iteration algorithm can still proceed with
(25) replaced by (37). Following the same reasoning as in the Under Assumption 4.1, in the SOS-based policy iteration,
proof of Theorem 4.3, it is easy to show that the solvability we have
and the convergence properties of the proposed policy iteration L(Vi , ui ) ∈ R[x]2,2d , ∀i > 1, (41)
algorithm will not be affected with the relaxed constraint (37).
if the integer d satisfies
V. G LOBAL ADAPTIVE DYNAMIC PROGRAMMING FOR 1
d ≥ max {deg(f ) + 2r − 1, deg(g) + 2(2r − 1),
UNCERTAIN POLYNOMIAL SYSTEMS 2
deg(Q), 2(2r − 1) + 2deg(g)} .
The proposed policy iteration method requires the perfect
knowledge of f and g. In practice, precise system knowledge Also, ui obtained from the proposed policy iteration algorithm
may be difficult to obtain. Hence, in this section, we develop satisfies ui ∈ R[x]1,d .
an online learning method based on the idea of ADP to Hence, there exists a constant matrix Ki ∈ Rm×nd , with
implement the iterative scheme with real-time data, instead nd = (n+d d ) − 1, such that ui = Ki m ~ 1,d (x). Also, suppose
of identifying first the system dynamics. For simplicity, here there exists a constant vector p ∈ Rn2r , with n2r = (n+2r2r ) −
again we assume R(x) is a constant matrix. The results can n − 1, such that V = pT m ~ 2,2r (x). Then, along the solutions
be straightforwardly extended to non-constant R(x), using the of the system (40), it follows that
same idea as discussed in Remark 4.6. V̇ = ∇V T (f + gui ) + ∇V T ge
To begin with, consider the system
= −r(x, ui ) − L(V, ui ) + ∇V T ge
ẋ = f + g(ui + e) (40) = −r(x, ui ) − L(V, ui )
where ui is a feedback control policy and e is a bounded time- +(R−1 g T ∇V )T Re (42)
varying function, known as the exploration noise, added for Notice that the two terms L(V, ui ) and R−1 g T ∇V rely on
the learning purpose. f and g. Here, we are interested in solving them without
identifying f and g.
A. Forward completeness To this end, notice that for the same pair (V, ui ) defined
above, we can find a constant vector lp ∈ Rn2d , with n2d =
Due to the existence of the exploration noise, we need to
(n+2d
2d ) − d − 1, and a constant matrix Kp ∈ R
m×nd
, such that
first show if solutions of system (40) exist globally, for positive
time. For this purpose, we will show that the system (40) is L(V, ui ) = lpT m
~ 2,2d (x) (43)
forward complete [1]. 1
Definition 5.1 ([1]): Consider system (40) with e as the − R−1 g T ∇V = Kp m ~ 1,d (x) (44)
2
input. The system is called forward complete if, for any initial
Therefore, calculating L(V, ui ) and R−1 g T ∇V amounts to
condition x0 ∈ Rn and every input signal e, the corresponding
finding lp and Kp .
solution of system (40) is defined for all t ≥ 0.
Substituting (43) and (44) in (42), we have
Lemma 5.2: Consider system (40). Suppose ui is a glob-
ally stabilizing control policy and there exists Vi−1 ∈ P, such V̇ = −r(x, ui ) − lpT m ~ T1,d (x)KpT Re
~ 2,2d (x) − 2m (45)
7

Now, integrating the terms in (45) over the interval [t, t+δt], 3) Policy evaluation and improvement:
we have Find an optimal solution (pi , Ki+1 ) to the following
SOS program
p T [m
~ 2,2r (x(t)) − m ~ 2,2r (x(t + δt))]
Z t+δt min cT p (50)
p,Kp
= r(x, ui ) + lpT m
~ 2,2d (x)  
t lp
s.t. Φi = Ξi + Θi p (51)
~ T1,d (x)KpT Re dt vec(Kp )

+2m (46)
lpT m
~ 2,2d (x) is SOS (52)
Eq. (46) implies that, lp and Kp can be directly calculated T
by using real-time online data, without knowing the precise (pi−1 − p) m ~ 2,2r (x) is SOS (53)
R
knowledge of f and g. where c = Ω m ~ 2,2r (x)dx.
To see how it works, let us define the following matrices: Then, denote Vi = pTi m ~ 2,2r (x), ui+1 = Ki+1 m~ 1,d (x),
σe ∈ Rn2d +mnd , Φi ∈ Rqi ×(n2d +mnd ) , Ξi ∈ Rqi , Θi ∈ and go to Step 2) with i ← i + 1.
Rqi ×n2r , such that Theorem 5.6: Under Assumptions 2.1, 4.1 and 5.4, the
 T T following properties hold.
σe = − m ~ 2,2d 2m ~ T1,d ⊗ eT R ,
h Rt iT 1) The optimization problem (50)-(53) has a nonempty
1,i
R t2,i
R tq ,i feasible set.
Φi = t0,i
σ e dt t1,i
σ e dt · · · i
tqi −1,i
σ e dt ,
h R 2) The sequences {Vi }∞ ∞
i=1 and {ui }i=1 satisfy the proper-
t1,i R t2,i
Ξi = t0,i
r(x, ui )dt t1,i r(x, ui )dt · · · ties 2)-5) in Theorem 4.3.
R tq ,i i T Proof: Given pi ∈ Rn2r , there exists a constant matrix
i
tqi −1,i
r(x, ui )dt , Kpi ∈ Rm×nd such that (pi , Kpi ) is a feasible solution to
h
tqi ,i
iT the optimization problem (50)-(53) if and only if pi is a
t1,i t
Θi = ~ 2,2r |t0,i
m ~ 2,2r |t2,i
m 1,i
· · · m
~ 2,2r | tq −1,i
. feasible solution to the SOS program (24)-(26). Therefore, by
i
Theorem 4.3, 1) holds. In addition, since the two optimization
problems share the identical objective function, we know that
Then, (46) implies if (pi , Kpi ) is a feasible solution to the optimization problem

lp
 (50)-(53), pi is also an optimal solution to the SOS program
Φi = Ξi + Θi p. (47) (24)-(26). Hence, the theorem can be obtained from Theorem
vec(Kp )
4.3.
Notice that any pair of (lp , Kp ) satisfying (47) will satisfy Remark 5.7: By Assumption 4.1, u1 must be a globally
the constraint between lp and Kp as implicitly indicated in stabilizing control policy. Indeed, if it is only locally stabiliz-
(45) and (46). ing, there are two reasons why the algorithm may not proceed.
Assumption 5.4: For each i = 1, 2, · · · , there exists an
integer qi0 , such that the following rank condition holds,
Start
rank(Φi ) = n2d + mnd , (48)
Initialization: Find (V0 , u1 ) that satisfy
if qi ≥ qi0 . Assumption 4.1. i ← 1.
Remark 5.5: This rank condition (48) is in the spirit of
persistency of excitation (PE) in adaptive control (e.g. [21],
[50]) and is a necessary condition for parameter convergence. Collect Online Data: Apply ui as the
control input. Compute the matrices
Given p ∈ Rn2r and Ki ∈ Rm×nd , suppose Assumption 5.4
Φi , Ξi , and Θi until Assumption 5.4 is
is satisfied and qi ≥ qi0 for all i = 1, 2, · · · . Then, it is easy to
satisfied.
see that the values of lp and Kp can be uniquely determined
from (46). Indeed,
  Policy Evaluation and Improvement:
lp −1 T Obtain pi and Ki+1 by solving (50)-
= ΦTi Φi Φi (Ξi + Θi p) (49)
vec(Kp ) (53). Let Vi = pTi m
~ 2,2r (x) and ui+1 =
Ki+1 m~ 1,d (x).
Now, the ADP-based online learning method is given below,
and a flowchart is provided in Figure 2.
1) Initialization: i←i+1 |pi − pi−1 | ≤ 
No
Fine the pair (V0 , u1 ) that satisfy Assumption 4.1. Let
Yes
p0 be the constant vector such that V0 = pT0 m ~ 2,2r (x),
Let i = 1. Apply u = ui as the
Stop
2) Collect online data: control input.
Apply u = ui + e to the system and compute the data
matrices Φi , Ξi , and Θi , until Φi is of full column rank. Fig. 2. Flowchart of ADP-based online control method.
8

First, with a locally stabilizing control law, there is no way 2

to guarantee the forward completeness of solutions, or avoid 1st iteration With GADP-based controller
With initial controller
finite escape, during the learning phase. Second, the region 1.5
of attraction associated with the new control policy is not
guaranteed to be global. Such a counterexample is ẋ = θx3 +u 1
with unknown positive θ, for which the choice of a locally
stabilizing controller u1 = −x does not lead to a globally 3rd iteration
0.5
stabilizing suboptimal controller.
2nd iteration
0
VI. A PPLICATIONS 4th iteration
A. A scalar nonlinear polynomial system
-0.5
0 5 10 15 20 25 30 35 40 45 50
Consider the following polynomial system time (sec)

ẋ = ax2 + bu (54)
Fig. 3. System state trajectory.
where x ∈ R is the system state, u ∈ R is the control
input, a and b, satisfying a ∈ [0, 0.05] and b ∈ [0.5, 1], are 3
uncertain constants.
R∞ The cost to be minimized is defined as V : Initial cost
1

J(x0 , u) = 0 (0.01x2 +0.01x4 +u2 )dt. An initial stabilizing 2.5


V 4 : Improved cost
o
V : Optimal cost
control policy can be selected as u1 = −0.1x − 0.1x3 , which
globally asymptotically stabilizes system (54), for any a and 2

b satisfying the given range. Further, it is easy to see that


V0 = 10(x2 + x4 ) and u1 satisfy Assumption 4.1 with r = 2. 1.5

In addition, we set d = 3.
1
Only for the purpose of simulation, we set a = 0.01, b = 1,
and x(0) = 2. Ω is specified as Ω = {x|x ∈ R and |x| ≤ 1}. 0.5
The proposed global ADP method is applied with the control
policy updated after every five seconds, and convergence is 0
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
attained after five iterations, when |pi − pi−1 | ≤ 10−3 . Hence, x
the constant vector c in the objective function (50) is computed
as c = [ 23 0 25 ]T . The exploration noise is set to be e = Fig. 4. Comparison of the value functions.
0.01(sin(10t) + sin(3t) + sin(100t)), which is turned off after
the fourth iteration. 1
The simulated state trajectory is shown in Figure 3, where u1 : Initial control policy
u4 : Improved control policy
the control policy is updated every five seconds until conver- 0.5
uo : Optimal control policy

gence is attained. The suboptimal control policy and the cost


function obtained after four iterations are V ∗ = 0.1020x2 +
0
0.007x3 +0.0210x4 and u∗ = −0.2039x−0.02x2 −0.0829x3 .
For comparison purpose, the exact optimal √ cost and the control
o x3 ( 101x2 +100)3 20 -0.5
policy are √given below. V = 150 + 15150 − 303 and
o x2 101x2 +100+101x 4
+100x 2
u =− √
100 101x2 +100
. Figures 4 and 5 show the
-1
comparison of the suboptimal control policy with respect to
the exact optimal control policy and the initial control policy.
-1.5
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
x
B. Fault-tolerant control
Consider the following example [61] Fig. 5. Comparison of the control policies.

−x31 − x1 x22 + x1 x2
     
ẋ1 0 β1
= + u (55)
ẋ2 x1 + 2x2 β2 β1
However, optimality of the closed-loop system has not been
where β1 , β2 ∈ [0.5, 1] are the uncertain parameters. This sys- well addressed.
tem can be considered as having a loss-of-effectiveness fault Our control objective is to improve the control policy
[61], since the actuator gains are smaller than the commanded using the proposed ADP method such that we can reduce the
position (β1 = β2 = 1). following cost
Using SOS-related techniques, it has been shown in [61] that Z ∞
the following robust control policy can globally asymptotically J(x0 , u) = (x21 + x22 + uT u)dt (57)
stabilize the system (55) at the origin. 0
   
u1,1 10.283x1 − 13.769x2 and at the same time guarantee global asymptotic stability. In
u1 = = (56)
u1,2 −10.7x1 − 3.805x2 particular, we are interested to improve the performance of the
9

closed-loop system in the set Ω = {x|x ∈ R2 and |x| ≤ 1}. 12


6
10 With GADP-based controller
By solving the following feasibility problem using SOS- With initial controller 4
8
TOOLS [35], [39] 6
2

0
1st iteration

x1
V ∈ R[x]2,4 (58) 4 -2

2 29.8 29.9 30 30.1 30.2 30.3 30.4


L(V, u1 ) is SOS, ∀β1 , β2 ∈ [0.5, 1] (59)
0

we have obtained a polynomial function as follows -2 Final iteration Disturbance


0 5 10 15 20 25 30 35
V0 = 17.6626x21 − 18.2644x1 x2 + 16.4498x22 time (sec)

−0.1542x31 + 1.7303x21 x2 − 1.0845x1 x22 15


15
With GADP-based controller

+0.1267x32 + 3.4848x41 − 0.8361x31 x2 With initial controller 10

10 5
+4.8967x21 x22 + 2.3539x42 1st iteration

x2
0
5
which, together with (56), satisfies Assumption 4.1. 29.8 29.9 30 30.1 30.2 30.3 30.4

Only for simulation purpose, we set β1 = 0.7 and β2 = 0

Final iteration Disturbance


0.6. The initial condition is arbitrarily set as x1 (0) = 1 and 0 5 10 15 20 25 30 35
x2 (0) = −2. The proposed online learning scheme is applied time (sec)

to update the control policy every four second for seven times.
The exploration noise is the sum of sinusoidal waves with Fig. 6. System trajectories.
different frequencies, and it is turned off after the last iteration.
The suboptimal and globally stabilizing control policy is is
u8 = [u8,1 , u8,2 ]T with
V0(x1,x2,0)
u8,1 = −0.0004x31 − 0.0067x21 x2 − 0.0747x21
+0.0111x1 x22 + 0.0469x1 x2 − 0.2613x1 60

−0.0377x32 − 0.0575x22 − 2.698x2 50

u8,2 = −0.0005x31 − 0.009x21 x2 − 0.101x21 40


+0.0052x1 x22 − 0.1197x1 x2 − 1.346x1
30
−0.0396x32 − 0.0397x22 − 3.452x2 .
20
The associated cost function is as follows:
10
V8 = 1.4878x21 + 0.8709x1 x2 + 4.4963x22 10

+0.0131x31+ 0.2491x21 x2 − 0.0782x1 x22 0


10 0
5
x1
+0.0639x32+ 0.0012x31 x2 + 0.0111x21 x22 V7(x1,x2,0)
0
-5 -10
-10
x2
−0.0123x1 x32 + 0.0314x42 .
In Figure 6, we show the system state trajectories during Fig. 7. Comparison of the cost functions.
the learning phase and the post-learning phase. At t = 30,
we inject an impulse disturbance through the input channel to
deviate the state from the origin. Then, we compare the system where x1 , x2 , and mb denote respectively the position,
response under the proposed control policy and the initial velocity, and mass of the car body; x3 , x4 , and mw represent
control policy given in [61]. The suboptimal cost function and repectively the position, velocity, and mass of the wheel
the original cost function are compared in Figure 7. assembly; kt , ks , kn , and bs are the tyre stiffness, the linear
suspension stiffness, the nonlinear suspension stiffness, and
C. An active suspension system
the damping rate of the suspension.
Consider the quarter-car suspension model as described by It is assumed that mb ∈ [250, 350], mw ∈ [55, 65],
the following set of differential equations [15]. bs ∈ [900, 1100], ks ∈ [15000, 17000], kn = ks /10, and
ẋ1 = x2 (60) kt ∈ [180000, 200000]. Then, it is easy to see that, without
ks (x1 − x3 ) + kn (x1 − x3 )3 any control input, the system is globally asymptotically stable
ẋ2 = − at the origin. Here we would like to use the proposed online
mb
bs (x2 − x4 ) − u learning algorithm to design an active suspension control
− (61) system which reduces the following performance index
mb
ẋ3 = x4 (62) Z ∞ X 4

ks (x1 − x3 ) + kn (x1 − x3 )3 J(x0 , u) = ( x2i + u2 )dt (64)


ẋ4 = 0 i=1
mw
bs (x2 − x4 ) + kt x3 − u and at the same time maintain global asymptotic stability. In
+ (63) particular, we would like to improve the system performance
mw
10

in the set Ω = {x|x ∈ R4 and |x1 | ≤ 0.05, |x2 | ≤ 10, |x3 | ≤ 0.4
0.05, |x4 | ≤ 10, }. With GADP-based controller
To begin with, we first use SOSTOOLS [35], [39] to obtain 0.2
With no controller

an initial cost function V0 ∈ R[x]2,4 for the polynomial system

x1
(60)-(63) with uncertain parameters, of which the range is 0

given above. This is similar to the SOS feasibility problem


-0.2
described in (58)-(59). 120 120.5 121 121.5 122 122.5 123
Then, we apply the proposed online learning method with time (sec)
u1 = 0. The initial condition is arbitrarily selected. From t = 0 4
to t = 120, we apply bounded exploration noise as inputs With GADP-based controller
With no controller
for learning purpose, until convergence is attained after 10 2

x2
iterations.
0
The suboptimal and global stabilizing control policy we
obtain is -2
120 120.5 121 121.5 122 122.5 123
u10 =−3.53x31 − 1.95x21 x2 + 10.1x21 x3 + 1.11x21 x4 time (sec)
−4.61 × 108 x21 − 0.416x1 x22 + 3.82x1 x2 x3
0.2
+0.483x1 x2 x4 + 4.43 × 108 x1 x2 − 10.4x1 x23 With GADP-based controller

−2.55x1 x3 x4 − 2.91 × 108 x1 x3 − 0.174x1 x24 0.1


With no controller

x3
−2.81 × 108 x1 x4 − 49.2x1 − 0.0325x32 + 0.446x22 x3
0
+0.06x22 x4 + 1.16 × 108 x22 − 1.9x2 x23 − 0.533x2 x3 x4
−6.74 × 108 x2 x3 − 0.0436x2 x24 − 2.17 × 108 x2 x4 -0.1
120 120.5 121 121.5 122 122.5 123

−17.4x2 + 3.96x33 + 1.5x23 x4 + 7.74 × 108 x23 time (sec)


10
+0.241x3 x24 + 2.62 × 108 x3 x4 + 146.0x3 + 0.0135x34 With GADP-based controller
+1.16 × 108 x24 + 12.5x4 5
With no controller
x4

At t = 120, an impulse disturbance is simulated such that 0


the state is deviated from the origin. Then, we compare the
post-learning performance of the closed-loop system using the -5
120 120.5 121 121.5 122 122.5 123
suboptimal control policy with the performance of the original time (sec)
system with no control input (see Figure 8).
To save space, we do not show their explicit forms of V0 Fig. 8. Post-learning performance.
and V10 , since each of them is the summation of 65 different
monomials. In Figure 9, we compare these two cost functions
by restricting x3 ≡ x4 ≡ 0. It can be seen that V10 has been V0(x1,x2,0,0)
significantly reduced from V0 .
250

VII. C ONCLUSIONS 200

This paper has proposed, for the first time, a global ADP 150
method for the data-driven (adaptive) optimal control of
100
nonlinear polynomial systems. In particular, a new policy
iteration scheme has been developed. Different from conven- 50
tional policy iteration, the new iterative technique does not
0
attempt to solve a partial differential equation but a convex 5 0.4
0.2
optimization problem at each iteration step. It has been shown 0 0
V10 (x1,x2,0,0)
that, this method can find a suboptimal solution to continuous- x2
-0.2
x1
-5 -0.4
time nonlinear optimal control problems [30]. In addition,
the resultant control policy is globally stabilizing. Also, the
method can be viewed as a computational strategy to solve Fig. 9. Comparison of the cost functions.
directly Hamilton-Jacobi inequalities, which are used in H∞
control problems [16], [51].
When the system parameters are unknown, conventional result in slow convergence and loss of global asymptotic stabil-
ADP methods utilize neural networks to approximate online ity for the closed-loop system. Here, the proposed global ADP
the optimal solution, and a large number of basis functions method has overcome the two above-mentioned shortcomings,
are required to assure high approximation accuracy on some and it yields computational benefits.
compact sets. Thus, neural-network-based ADP schemes may It is under our current investigation to extend the proposed
11

methodology for more general (deterministic or stochastic) Since ui is globally stabilizing, we know
nonlinear systems [8], [19], and [28], as well as systems with lim Vi (x(T )) = 0 and lim V o (x(T )) = 0. Hence,
T →+∞ T →+∞
parametric and dynamic uncertainties [25], [22], [24], [9]. letting T → +∞, from (67) it follows that Vi (x0 ) ≥ V o (x0 ).
Since x0 is arbitrarily selected, we have Vi (x) ≥ V o (x),
ACKNOWLEDGEMENT ∀x ∈ Rn .
The first author would like to thank Dr. Yebin Wang, 2) Let qi (x) be a positive semidefinite function, such that
De Meng, Niao He, and Zhiyuan Weng for many helpful
L(Vi−1 (x), ui (x)) = qi (x), ∀x ∈ Rn . (68)
discussions on semidefinite programming, in the summer of
2013. Therefore,

A PPENDIX A (∇Vi−1 − ∇Vi )T (f + gui ) + qi (x) = 0. (69)


S UM - OF - SQUARES (SOS) PROGRAM Similar as in 1), along the trajectories of system (1) with
An SOS program is a convex optimization problem of the u = ui and x(0) = x0 , we can show
following form Z ∞
Problem A.1 (SOS programming [10]): Vi−1 (x0 ) − Vi (x0 ) = qi (x)dt (70)
0
min bT y (65) Hence, Vi−1 (x) ≤ Vi (x), ∀x ∈ R . n
y
s.t. pi (x; y) are SOS, i = 1, 2, · · · , k0 (66) 3) By definition,
Pn0
where pi (x; y) = ai0 (x) + j=1 aij (x)yj , and aij (x) are L(Vi , ui+1 )
given polynomials in R[x]0,2d . = −∇ViT (f + gui+1 ) − q − |ui+1 |2R
In [10, p.74], it has been pointed out that SOS programs = −∇ViT (f + gui ) − q − |ui |2R
are in fact equivalent to semidefinite programs (SDP) [52],
[10]. SDP is a broad generalization of linear programming. It −∇ViT g (ui+1 − ui ) − |ui+1 |2R + |ui |2R
concerns with the optimization of a linear function subject to = 2uTi+1 R (ui+1 − ui ) − |ui+1 |2R + |ui |2R
linear matrix inequality constraints, and is of great theoretic = −2uTi+1 Rui + |ui+1 |2R + |ui |2R
and practical interest [10]. The conversion from an SOS to ≥ 0 (71)
an SDP can be performed either manually, or automatically
using, for example, the MATLAB toolbox SOSTOOLS [35], The proof is complete.
[39], YALMIP [33], and Gloptipoly [17].
Proof of Theorem 2.3
A PPENDIX B We first prove 1) and 2) by induction. To be more specific,
P ROOF OF T HEOREM 2.3 we will show that 1) and 2) are true and Vi ∈ P, for all
Before proving Theorem 2.3, let us first give the following i = 0, 1, · · · .
lemma. i) If i = 1, by Assumption 2.1 and Lemma B.1 1), we
Lemma B.1: Consider the conventional policy iteration immediately know 1) and 2) hold. In addition, by Assumptions
algorithm described in (10) and (11). Suppose ui (x) is a 2.1 and 2.2, we have V o ∈ P and V0 ∈ P. Therefore, Vi ∈ P.
globally stabilizing control policy and there exists Vi (x) ∈ C 1 ii) Suppose 1) and 2) hold for i = j > 1, and Vj ∈ P. We
with V (0) = 0, such that (10) holds. Let ui+1 be defined as show 1) and 2) also hold for i = j + 1, and Vj+1 ∈ P.
in (11). Then, Under Assumption 2.2, the followings are true. Indeed, since V o ∈ P and Vj ∈ P. By the induction
assumption, we know Vj+1 ∈ P.
1) Vi (x) ≥ V o (x);
Next, by Lemma B.1 3), we have L(Vj+1 , uj+2 ) ≥ 0. As
2) for any Vi−1 ∈ P, such that L(Vi−1 , ui ) ≥ 0, we have
a result, along the solutions of system (1) with u = uj+2 , we
Vi ≤ Vi−1 ;
have
3) L(Vi , ui+1 ) ≥ 0.
Proof: 1) Under Assumption 2.2, we have V̇j+1 (x) ≤ −q(x). (72)
0 = L(V o , uo ) − L(Vi , ui ) Notice that, since Vj+1 ∈ P, it is a well-defined Lyapunov
= o T
(∇Vi − ∇V ) (f + gui ) + r(x, ui ) function for the closed-loop system (1) with u = uj+2 .
Therefore, uj+2 is globally stabilizing, i.e., 2) holds for
−(∇V o )T g(uo − ui ) − r(x, uo )
i = j + 1.
= (∇Vi − ∇V o )T (f + gui ) + |ui − uo |2R Then, by Lemma B.1 2), we have Vj+2 ≤ Vj+1 . Together
Therefore, for any x0 ∈ Rn , along the trajectories of system with the induction Assumption, it follows that V o ≤ Vj+2 ≤
(1) with u = ui and x(0) = x0 , we have Vj+1 . Hence, 1) holds for i = j + 1.
Z T Now, let us prove 3). If such a pair (V ∗ , u∗ ) exists, we
Vi (x0 ) − V o (x0 ) = |ui − uo |2R dt immediately know u∗ = − 21 R−1 g T ∇V ∗ . Hence,
0
+Vi (x(T )) − V o (x(T )) (67) H(V ∗ ) = L(V ∗ , u∗ ) = 0. (73)
12

Also, since V o ≤ V ∗ ≤ V0 , V ∗ ∈ P. However, as discussed in [24] Z. P. Jiang and Y. Jiang, “Robust adaptive dynamic programming
Section II-B, solution to the HJB equation (6) must be unique. for linear and nonlinear systems: An overview,” European Journal of
Control, vol. 19, no. 5, pp. 417–425, 2013.
Hence, V ∗ = V o and u∗ = uo . [25] Z. P. Jiang and L. Praly, “Design of robust adaptive controllers for
The proof is complete. nonlinear systems with dynamic uncertainties,” Automatica, vol. 34,
no. 7, pp. 825–840, 1998.
[26] H. K. Khalil, Nonlinear Systems, 3rd Edition. Upper Saddle River, NJ:
R EFERENCES Prentice Hall, 2002.
[27] M. Krstic and Z.-H. Li, “Inverse optimal design of input-to-state stabi-
[1] D. Angeli and E. D. Sontag, “Forward completeness, unboundedness lizing nonlinear controllers,” IEEE Transactions on Automatic Control,
observability, and their Lyapunov characterizations,” Systems & Control vol. 43, no. 3, pp. 336–350, 1998.
Letters, vol. 38, no. 4, pp. 209–217, 1999. [28] Y. P. Leong, M. B. Horowitz, and J. W. Burdick, “Optimal con-
[2] S. N. Balakrishnan, J. Ding, and F. L. Lewis, “Issues on stability of troller synthesis for nonlinear dynamical systems,” arXiv preprint
ADP feedback controllers for dynamical systems,” IEEE Transactions arXiv:1410.0405, 2014.
on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 38, no. 4, [29] F. L. Lewis and D. Vrabie, “Reinforcement learning and adaptive
pp. 913–917, 2008. dynamic programming for feedback control,” IEEE Circuits and Systems
[3] R. W. Beard, G. N. Saridis, and J. T. Wen, “Galerkin approximations Magazine, vol. 9, no. 3, pp. 32–50, 2009.
of the generalized Hamilton-Jacobi-Bellman equation,” Automatica, [30] F. L. Lewis, D. Vrabie, and V. L. Syrmos, Optimal Control, 3rd ed.
vol. 33, no. 12, pp. 2159–2177, 1997. New York: Wiley, 2012.
[4] R. Bellman, Dynamic programming. Princeton, NJ: Princeton Univer-
[31] F. L. Lewis and D. Liu, Eds., Reinforcement Learning and Approximate
sity Press, 1957.
Dynamic Programming for Feedback Control. Hoboken, NJ: Wiley,
[5] R. Bellman and S. Dreyfus, “Functional approximations and dynamic
2013.
programming,” Mathematical Tables and Other Aids to Computation,
[32] B. Lincoln and A. Rantzer, “Relaxing dynamic programming,” IEEE
vol. 13, no. 68, pp. 247–251, 1959.
Transactions on Automatic Control, vol. 51, no. 8, pp. 1249–1260, 2006.
[6] D. P. Bertsekas, Dynamic Programming and Optimal Control, 4th ed.
[33] J. Lofberg, “YALMIP: A toolbox for modeling and optimization in
Belmonth, MA: Athena Scientific Belmont, 2007.
Matlab,” in Proceedings of 2004 IEEE International Symposium on
[7] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming.
Computer Aided Control Systems Design, 2004, pp. 284–289.
Nashua, NH: Athena Scientific, 1996.
[34] E. Moulay and W. Perruquetti, “Stabilization of nonaffine systems: A
[8] T. Bian, Y. Jiang, and Z. P. Jiang, “Adaptive dynamic programming
constructive method for polynomial systems,” IEEE Transactions on
and optimal control of nonlinear nonaffine systems,” Automatica, 2014,
Automatic Control, vol. 50, no. 4, pp. 520–526, 2005.
available online, DOI:10.1016/j.automatica.2014.08.023.
[9] ——, “Decentralized and adaptive optimal control of large-scale systems [35] A. Papachristodoulou, J. Anderson, G. Valmorbida, S. Prajna, P. Seiler,
with application to power systems,” IEEE Transactions on Industrial and P. A. Parrilo, “SOSTOOLS: Sum of squares optimization toolbox
Electronics, 2014, available online, DOI:10.1109/TIE.2014.2345343. for MATLAB,” 2013. [Online]. Available: http://arxiv.org/abs/1310.4716
[10] G. Blekherman, P. A. Parrilo, and R. R. Thomas, Eds., Semidefinite [36] J. Park and I. W. Sandberg, “Universal approximation using radial-basis-
Optimization and Convex Algebraic Geometry. Philadelphia, PA: function networks,” Neural Computation, vol. 3, no. 2, pp. 246–257,
SIAM, 2013. 1991.
[11] Z. Chen and J. Huang, “Global robust stabilization of cascaded polyno- [37] P. A. Parrilo, “Structured semidefinite programs and semialgebraic
mial systems,” Systems & Control Letters, vol. 47, no. 5, pp. 445–453, geometry methods in robustness and optimization,” Ph.D. dissertation,
2002. California Institute of Technology, Pasadena, California, 2000.
[12] D. Cox, J. Little, and D. O’Shea, Ideals, Varieties, and Algorithms: An [38] W. B. Powell, Approximate Dynamic Programming: Solving the curses
Introduction to Computational Algebraic Geometry and Commutative of dimensionality. New York: John Wiley & Sons, 2007.
Algebra, 3rd ed. New York, NY: Springer, 2007. [39] S. Prajna, A. Papachristodoulou, P. Seiler, and P. A. Parrilo, “SOS-
[13] D. P. de Farias and B. Van Roy, “The linear programming approach TOOLS and its control applications,” in Positive polynomials in control.
to approximate dynamic programming,” Operations Research, vol. 51, Springer, 2005, pp. 273–292.
no. 6, pp. 850–865, 2003. [40] S. Prajna, A. Papachristodoulou, and F. Wu, “Nonlinear control synthesis
[14] G. Franze, D. Famularo, and A. Casavola, “Constrained nonlinear by sum of squares optimization: A Lyapunov-based approach,” in
polynomial time-delay systems: A sum-of-squares approach to estimate Proceedings of the Asian Control Conference, 2004, pp. 157–165.
the domain of attraction,” IEEE Transactions Automatic Control, vol. 57, [41] G. N. Saridis and C.-S. G. Lee, “An approximation theory of optimal
no. 10, pp. 2673–2679, 2012. control for trainable manipulators,” IEEE Transactions on Systems, Man
[15] P. Gaspar, I. Szaszi, and J. Bokor, “Active suspension design using linear and Cybernetics, vol. 9, no. 3, pp. 152–159, 1979.
parameter varying control,” International Journal of Vehicle Autonomous [42] M. Sassano and A. Astolfi, “Dynamic approximate solutions of the HJ
Systems, vol. 1, no. 2, pp. 206–221, 2003. inequality and of the HJB equation for input-affine nonlinear systems,”
[16] J. W. Helton and M. R. James, Extending H∞ Control to Nonlinear Sys- IEEE Transactions on Automatic Control, vol. 57, no. 10, pp. 2490–
tems: Control of Nonlinear Systems to Achieve Performance Objectives. 2503, Oct 2012.
SIAM, 1999. [43] C. Savorgnan, J. B. Lasserre, and M. Diehl, “Discrete-time stochastic
[17] D. Henrion and J.-B. Lasserre, “Gloptipoly: Global optimization over optimal control via occupation measures and moment relaxations,” in
polynomials with Matlab and SeDuMi,” ACM Transactions on Mathe- Proceedings of the Joint 48th IEEE Conference on Decison and Control
matical Software, vol. 29, no. 2, pp. 165–194, 2003. and the 28th Chinese Control Conference, Shanghai, P. R. China, 2009,
[18] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward pp. 519–524.
networks are universal approximators,” Neural Networks, vol. 2, no. 5, [44] P. J. Schweitzer and A. Seidmann, “Generalized polynomial approx-
pp. 359–366, 1989. imations in Markovian decision processes,” Journal of Mathematical
[19] M. B. Horowitz and J. W. Burdick, “Semidefinite relaxations for Analysis and Applications, vol. 110, no. 2, pp. 568–582, 1985.
stochastic optimal control policies,” in Proceedings of the 2014 Americal [45] R. Sepulchre, M. Jankovic, and P. Kokotovic, Constructive Nonlinear
Control Conference, June 1994, pp. 3006–3012. Control. New York: Springer Verlag, 1997.
[20] W.-C. Huang, H.-F. Sun, and J.-P. Zeng, “Robust control synthesis of [46] J. Si, A. G. Barto, W. B. Powell, D. C. Wunsch et al., Eds., Handbook
polynomial nonlinear systems using sum of squares technique,” Acta of learning and approximate dynamic programming. Hoboken, NJ:
Automatica Sinica, vol. 39, no. 6, pp. 799–805, 2013. Wiley, Inc., 2004.
[21] P. A. Ioannou and J. Sun, Robust Adaptive Control. Upper Saddle [47] E. D. Sontag, “On the observability of polynomial systems, I: Finite-
River, NJ: Prentice-Hall, 1996. time problems,” SIAM Journal on Control and Optimization, vol. 17,
[22] Y. Jiang and Z. P. Jiang, “Robust adaptive dynamic programming and no. 1, pp. 139–151, 1979.
feedback stabilization of nonlinear systems,” IEEE Transactions on [48] T. H. Summers, K. Kunz, N. Kariotoglou, M. Kamgarpour, S. Summers,
Neural Networks and Learning Systems, vol. 25, no. 5, pp. 882–893, and J. Lygeros, “Approximate dynamic programming via sum of squares
2014. programming,” arXiv preprint arXiv:1212.1269, 2012.
[23] ——, “Computational adaptive optimal control for continuous-time lin- [49] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
ear systems with completely unknown dynamics,” Automatica, vol. 48, Cambridge Univ Press, 1998.
no. 10, pp. 2699–2704, 2012. [50] G. Tao, Adaptive Control Design and Analysis. Wiley, 2003.
13

[51] A. J. van der Schaft, L2 -Gain and Passivity in Nonlinear Control.


Berlin: Springer, 1999.
[52] L. Vandenberghe and S. Boyd, “Semidefinite programming,” SIAM
Review, vol. 38, no. 1, pp. 49–95, 1996.
[53] D. Vrabie and F. L. Lewis, “Neural network approach to continuous-
time direct adaptive optimal control for partially unknown nonlinear
systems,” Neural Networks, vol. 22, no. 3, pp. 237–246, 2009.
[54] D. Vrabie, K. G. Vamvoudakis, and F. L. Lewis, Optimal Adaptive
Control and Differential Games by Reinforcement Learning Principles.
London, UK: The Institution of Engineering and Technology, 2013.
[55] F.-Y. Wang, H. Zhang, and D. Liu, “Adaptive dynamic programming: an
introduction,” IEEE Computational Intelligence Magazine, vol. 4, no. 2,
pp. 39–47, 2009.
[56] Y. Wang and S. Boyd, “Approximate dynamic programming via iterated
Bellman inequalities,” Manuscript preprint, 2010.
[57] ——, “Performance bounds and suboptimal policies for linear stochastic
control via LMIs,” International Journal of Robust and Nonlinear
Control, vol. 21, no. 14, pp. 1710–1728, 2011.
[58] P. Werbos, “Beyond regression: New tools for prediction and analysis
in the behavioral sciences,” Ph.D. dissertation, Harvard Univ. Comm.
Appl. Math., 1974.
[59] ——, “Advanced forecasting methods for global crisis warning and
models of intelligence,” General Systems Yearbook, vol. 22, pp. 25–38,
1977.
[60] ——, “Reinforcement learning and approximate dynamic programming
(RLADP) – Foundations, common misceonceptsions and the challenges
ahead,” in Reinforcement Learning and Approximate Dynamic Program-
ming for Feedback Control, F. L. Lewis and D. Liu, Eds. Hoboken,
NJ: Wiley, 2013, pp. 3–30.
[61] J. Xu, L. Xie, and Y. Wang, “Simultaneous stabilization and robust
control of polynomial nonlinear systems using SOS techniques,” IEEE
Transactions on Automatic Control, vol. 54, no. 8, pp. 1892–1897, 2009.

You might also like