Professional Documents
Culture Documents
Automatica: Sayak Mukherjee He Bai Aranya Chakrabortty
Automatica: Sayak Mukherjee He Bai Aranya Chakrabortty
Automatica
journal homepage: www.elsevier.com/locate/automatica
Brief paper
article info a b s t r a c t
Article history: We present a set of model-free, reduced-dimensional reinforcement learning (RL) based optimal
Received 8 October 2019 control designs for linear time-invariant singularly perturbed (SP) systems. We first present a state
Received in revised form 10 November 2020 feedback and an output feedback based RL control design for a generic SP system with unknown
Accepted 11 December 2020
state and input matrices. We take advantage of the underlying time-scale separation property of the
Available online xxxx
plant to learn a linear quadratic regulator (LQR) for only its slow dynamics, thereby saving significant
Keywords: amount of learning time compared to the conventional full-dimensional RL controller. We analyze the
Reinforcement learning sub-optimality of the designs using SP approximation theorems, and provide sufficient conditions for
Linear quadratic regulator closed-loop stability. Thereafter, we extend both designs to clustered multi-agent consensus networks,
Singular perturbation where the SP property reflects through clustering. We develop both centralized and cluster-wise block-
Model-free control
decentralized RL controllers for such networks, in reduced dimensions. We demonstrate the details
Model reduction
of the implementation of these controllers using simulations of relevant numerical examples, and
compare them with conventional RL designs to show the computational benefits of our approach.
© 2020 Elsevier Ltd. All rights reserved.
https://doi.org/10.1016/j.automatica.2020.109451
0005-1098/© 2020 Elsevier Ltd. All rights reserved.
S. Mukherjee, H. Bai and A. Chakrabortty Automatica 126 (2021) 109451
separation arises due to clustering of the network nodes. Along s.t. A − BKT ∈ RH∞ . (6)
with a centralized design, a variant is proposed that imposes
We assume (A, B) to be stabilizable. We consider y(t) to be di-
a block-diagonal structure on the RL controller to facilitate its
rectly measurable, or x(t) to be measurable (i.e. C = I) and T to
implementation. Numerical results show that our approach saves
be known so that y(t) can be computed at all time t. This is not a
significant amount of learning time than the conventional RL
restrictive assumption as in many SP systems the identity of the
while still maintaining a modest closed-loop performance. All the
slow and fast states are often known a priori (Khalil & Kokotovic,
designs are described by implementable algorithms together with
theoretical guarantees. 1978) even if the model is unknown. In some cases, for example
The first design has been presented as a preliminary result as in the multi-agent model that will be shown later in the paper,
in our recent conference paper Mukherjee et al. (2018). The offline knowledge of certain structural properties of the system
second design, however, is completely new. The multi-agent can enable the designer to construct T using measurements of
RL controllers, which were presented only for scalar dynamics x(t), even when A and B are unknown. The benefit of using y(t)
in Mukherjee et al. (2018, 2019), are now extended to vector- as the feedback variable is that one has to learn only a (m × r)
dimensional states. Moreover, unlike prior results, the consensus matrix instead of a (m × n) matrix if full state feedback x(t) was
model here is more generic as we allow each node to have used. This will improve the learning time, especially if r ≪ n.
self dynamics. The simulation examples presented in Section 5 Before proceeding with the control design, we make the following
are much larger-dimensional than in Mukherjee et al. (2018) to assumption.
demonstrate the numerical benefits of the designs.
Notations: RH∞ is the set of all proper, real and rational stable Assumption 2. A22 in Eq. (3b) is Hurwitz.
transfer matrices; ⊗ denotes Kronecker product; 1n denotes a This assumption means that the fast dynamics of (3) are stable,
column vector of size n with all ones; ∪ denotes union opera- which allows us to skip feeding back z(t) in (4).
tion of sets; blkdiag(m1 , . . . , mn ) denotes a block-diagonal matrix
with m1 , . . . , mn as its block diagonal elements; |M | denotes the 2.2. Problem statement for output feedback RL
cardinality of set M; ∥.∥ denotes Euclidean norm of a vector and
Frobenius norm of a matrix unless mentioned otherwise. P2. Considering that q(t) is measured and C is known, but A
and B are both unknown in (1), estimate the states ŷ(t), ẑ(t) (or,
2. Problem formulation equivalently estimate x̂(t) and compute ŷ(t) = T x̂(t) assuming
that T is known), learn a controller K ∈ Rm×r using online
Consider a linear time-invariant (LTI) system measurements of q(t) ∫and u(t) such that u = −K ŷ = −KT x̂
∞
ẋ = Ax + Bu, x(0) = x0 , q = C x, (1) minimizes J(y(0); u) = 0 (yT Qy + uT Ru)dt .
We assume (A, B) to be stabilizable, and (A, C ) to be detectable.
where, x ∈ R is the state, u ∈ R is the control input, and q ∈ Rp
n m
Our approach would be to estimate the slow states ŷ(t) without
is the output. We assume that the matrices A and B are unknown, knowing (A, B) using an observer employing a neural structure
although n, m and p are known. The following assumption is that does not require exact information of the state dynamics, and
made. then using u(t) and ŷ(t) to learn the controller K using adaptive
dynamic programming.
Assumption 1. The system with state–space model (1) exhibits a We present the solutions for P1 and P2 with associated stabil-
singular perturbation property, i.e., there exist a small parameter ity proofs in the following two respective sections.
1 ≫ ϵ > 0 and a similarity transform T such that by defining
y ∈ Rr and z ∈ Rn−r as 3. Reduced-dimensional state feedback RL
[ ] [ ]
y T
=Tx= x, (2) Following Khalil (2002), the reduced slow subsystem of (3) can
z G
be defined by substituting ϵ = 0, resulting in
the state–space model (1) can be rewritten as
ẏs = As ys + Bs us , ys (0) = y(0), u = us + uf , (7)
ẏ = A11 y + A12 z + B1 u, y(0) = Tx0 = y0 , (3a) −1 −1
where As = A11 − A12 A22 A21 and Bs = B1 − A12 A22 B2 . Since our
ϵ ż = A21 y + A22 z + B2 u, z(0) = Gx0 = z0 , (3b) intent is to only use the slow variable for feedback, we substitute
the fast control input uf = 0, and the slow control input us = u.
[ ] [ ]
y y
q = CT −1 =C . (3c) If the controller were to use ys (t) for feedback then it would find
z z
u = −K̄ ys (t) to solve:
In the transformed model (3), y(t) represents the slow states ∫ ∞
and z(t) represents the fast states. Since A and B are unknown, minimize J̄(ys (0); u) = (yTs Qys + uT Ru)dt , (8)
the matrices A11 , A12 , A21 , A22 , B1 and B2 are unknown as well. 0
Algorithm 1 SP-RL using slow dynamics theorem provides a sufficient condition that is required to achieve
Input: Measurements of y(t) and u0 (t) asymptotic stability for the (k + 1)th iteration of Algorithm 1
Step 1 - Data storage: Store data (i.e., y(t) and u0 (t)) for sufficiently large assuming that the control policy at the kth iteration stabilizes (3).
uniformly sampled time instants (t1 , t2 , · · · , tl ), and construct the following
matrices:
[ ]T
Theorem 3. Assume that the control policy u = −Kk y at the kth
δyy = y ⊗ y| t 1 +T
t1, · · · , y ⊗ y|
tl +T
tl , (11) iteration asymptotically stabilizes (3). Consider R ≻ 0 and Q ≻ 0
[∫ ]T with λmin (Q ) sufficiently large. Then the control policy at the (k+1)th
t1 +T ∫ tl +T
Iyy = t (y ⊗ y)dτ , · · · , t (y ⊗ y)dτ , (12) iteration given by u = −Kk+1 y is asymptotically stabilizing for
1 l
[∫ ]T (3) □
t +T ∫ t +T
Iyu0 = t 1 (y ⊗ u0 )dτ , · · · , t l (y ⊗ u0 )dτ , (13)
1 l
Theorem 2. Assume that ∥ys (t)∥ and ∥u0 (t)∥ are bounded for a
4.1. Sub-optimality and stability analysis
finite time t ∈ [0, t1 ]. The solutions of Algorithm 1 are given by
P = P̄ + O(ϵ ), K = K̄ + O(ϵ ), and J = J̄ + O(ϵ )
Lemma 1. Define e(t) = x(t) − x̂(t). If e is uniformly ultimately
Proof. See theorems 2 and 3 in Mukherjee et al. (2018). bounded (UUB) with a bound b for all t ≥ t0 + T for some initial
Theorem 2 shows that the controller obtained from Algorithm time t0 , then there exist positive constants ϵ ∗ and k such that for all
1 is O(ϵ ) close to that obtained from the ideal design (where 0 < ϵ ≤ ϵ∗
ϵ = 0) which also holds for the respective costs. Moreover, ∥ŷ(t) − ys (t)∥ ≤ k̄|ϵ| + b := c(ϵ, b) (15)
it is shown in Chow and Kokotovic (1976) that the reduced-
order control cost is O(ϵ )-perturbed from the full-order control holds uniformly for t ∈ [t2 , t1 ].
cost. This holds for the model-free case as well. As Algorithm 1
is constructed first by learning the ideal slow sub-system, and Proof. Since e(t) is UUB, there exists positive constants b and
then by replacing ys (t) with y(t) for the implementation, we b̂, independent of t0 ≥ 0, and for every a ∈ (0, b̂), there exists
can quantify the sub-optimality of this approximation. The next T1 = T1 (a, b), independent of t0 , such that ∥ŷ(t0 ) − y(t0 )∥ ≤ a,
3
S. Mukherjee, H. Bai and A. Chakrabortty Automatica 126 (2021) 109451
∥ŷ(t) − y(t)∥ ≤ b, ∀t ≥ t0 + T1 := t2 . (16) A candidate observer to estimate ŷ(t) without knowing (A, B)
is the neuro-adaptive observer proposed in Abdollahi et al. (2006)
From Theorem 1, it follows that there exist positive constants k that employs a neural network (NN) structure to account for
and p such that, the lack of dynamic model information. This observer guarantees
boundedness of e(t), which, with proper tuning, can also be made
∥y(t) − ys (t)∥ ≤ k̄|ϵ| ∀t ∈ [t0 , t1 ], t1 > t2 , ∀|ϵ| < p. (17)
arbitrarily small. Recalling the mechanism of this observer, we
Combining (16) and (17), for t ∈ [t2 , t1 ] we have rewrite (1) as
ẋ = Âx + (Ax − Âx) + Bu, q = C x, (19)
∥ŷ(t) − ys (t)∥ ≤ k̄|ϵ| + b := c(ϵ, b). (18)
g(x,u)
This completes the proof. □
where  is a Hurwitz matrix, and (C , Â) is observable. We do not
Corollary 1. If e(t) = O(ϵ ) for t ∈ [t2 , t1 ], then ŷ(t) = ys (t) + O(ϵ ). have proper knowledge about g(x, u), and a NN with sufficiently
large number of neurons can approximate g(x, u), as g(x, u) =
W σ (V x̄) + η(x). Here, x̄ = [x, u], while σ (.) and η(x) are the
Proof. The proof directly follows from Lemma 1. □
activation function and the bounded NN approximation error,
We know that if ys (t) were available for feedback then P̄ , K̄
respectively. W and V are the ideal fixed NN weights. We choose
would be the optimal solutions. However, due to the state esti-
mation error bound b and the singular perturbation error O(ϵ ), G such that Ac = Â − GC is Hurwitz. The observer dynamics follow
the actual solutions are given as P = P̄ + ∆P, K = K̄ + ∆K , where as
∆P and ∆K are matrix perturbations resulting from non-ideal x̂˙ = Âx̂ + g(x̂, u) +G(q − C x̂), q̂ = C x̂, (20)
feedback.
=Ŵ σ (V̂ x̄)
ˆ
Proposition 1. Perturbations ∆P , ∆K are bounded, i.e., there exist where Ŵ , V̂ are NN weights when driven by x̂, and are updated
two positive constants ρ, ρ1 , dependent on b and ϵ , such that based on the modified Back Propagation (BP) algorithm. The
∥∆P ∥ ≤ ρ, ∥∆K ∥ ≤ ρ1 . Moreover, if e(t) = O(ϵ ) for t ∈ [t2 , t1 ], observer (20) requires the knowledge of C . Accordingly, we define
then we will recover P = P̄ + O(ϵ ), K = K̄ + O(ϵ ). the output error as q̃ = q − C x̂. The objective function is to
minimize J = 12 (q̃T q̃). Following Abdollahi et al. (2006), the
Proof. Please see Mukherjee et al. (2020). update law follows from gradient descent as:
˙
If e(t) can be made sufficiently small by proper tuning of the Ŵ = − η1 (q̃T C A−
c ) (σ (V̂ x̄)) −ρ1 ∥q̃∥Ŵ ,
1 T ˆ T (21)
observer gain then we would recover the design characteristics of
Algorithm 1. To this end, we present the following stability result. η1 ( ∂ J )
∂ Ŵ
˙
V̂ = − η2 (q̃ C Ac Ŵ (I − Λ(V̂ x̄)))
T −1 ˆ T −ρ2 ∥q̃∥V̂ ,
ˆ T sgn(x̄)
Theorem 4. Assume that the control policy u = −Kk ŷ is asymp-
totically stabilizing for the kth iteration. Then, there exist sufficiently η2 ( ∂ J )
small b∗ , and 0 < ϵ ∗ ≪ 1 such that for b ≤ b∗ , 0 < ϵ ≤ ϵ ∗ , with ∂ V̂
Q ≻ 0, R ≻ 0, u = −Kk+1 ŷ will asymptotically stabilize (3) at the where η1 , η2 > 0 are learning rates and ρ1 , ρ2 are small positive
(k + 1)th iteration. numbers, Ac = A − GC , σ (·) is the activation function, and x̄ˆ =
[x̂, u]. Considering k neurons we have Λ(V̂ x̄) ˆ = diag(σ 2 (V̂i x̄))
i
ˆ ,i =
Proof. Please see Appendix A. 1, 2, . . . , k, where σi (V̂i x̄)
ˆ is the ith element of σ (V̂ x̄),
ˆ and sgn(·)
is the sign function. The update law (21) depends on the knowl-
As shown in Appendix A, the estimation error enters the edge of C . This observer guarantees the following boundedness
closed-loop system as an exogenous disturbance. Since Kk+1 is property.
stabilizing, the states converge to a neighborhood of the origin for
sufficiently small b∗ and ϵ ∗ . Note that the designer does not need Theorem 5 (Abdollahi et al., 2006, Theorem 1). With the update law
the explicit knowledge of ϵ ∗ . Assuming the existence of a small described as (21), the state estimation error x̃ = x − x̂ and weight
enough ϵ is sufficient. From the physical laws of the dynamics, the estimation errors W̃ = W − Ŵ , Ṽ = V − V̂ are uniformly ultimately
states in Eq. (3b) are inherently sufficiently faster than the states bounded (UUB).
in Eq. (3a) ensuring stability. The design takes advantage of this
The size of the estimation error bound can be made arbitrarily
structure to construct a sample efficient reduced-dimensional
small by properly selecting the parameters and learning rates
learning control methodology.
as shown in Abdollahi et al. (2006). For example, with higher
learning rates the convergence rate can be increased but these
Remark 2. The convergence of the observer and the RL itera-
parameters need to be properly tuned to avoid overshoot, and
tions are handled sequentially. The observer gathers a sufficient
selecting  to have fast eigenvalues will also keep the state
amount of data samples to meet the rank condition rank(Θ̂k ) =
estimation error small.
r(r + 1)/2 + rm, after which the control gain is computed it-
eratively. Θ̂k has same structure as Θk but with y(t) replaced 5. Applying to clustered multi-agent networks
by ŷ(t). The designer may start gathering data samples after a
few initial time-steps over which the observer has converged We next describe how SP-based RL designs can be applied
close to its steady-state. The observer is designed to achieve for the control of clustered multi-agent consensus networks,
fast convergence, as discussed next. The state estimation error e.g., power systems, robotic swarms, and biological networks. The
that may be present in the observer output has been taken into LTI model of these networks can be brought into the standard SP
consideration in the sub-optimality and the stability analysis, as form (3) by exploiting the time-scale separation in its dynamics
discussed in Proposition 1 and Theorem 4. arising from the clustering of nodes.
4
S. Mukherjee, H. Bai and A. Chakrabortty Automatica 126 (2021) 109451
5.1. SP representation of clustered networks transformation to (23), and redefining the time-scale as ts = ϵ t,
the following SP form is obtained:
Consider a network of n agents, where the dynamics of the ith dy
agent is given by = A11 y + A12 z + B1 u, (25a)
∑ dts
ẋi = Fxi + aij (xj − xi ) + bi ui , (22) dz
ϵ = A21 y + A22 z + B2 u, (25b)
j∈Ni dts
where xi ∈ Rs is the state, ui ∈ Rp is the input, and Ni denotes
A11 = T (LE ⊗ Is )U + (Ir ⊗ F )/ϵ, A12 = T (LE ⊗ Is )G† ,
the set of agents that are connected to agent i, for i = 1, . . . n.
The graph between agents is assumed to be connected and time- A21 = G(LE ⊗ Is )U , A22 = G(LI ⊗ Is )G† + (In−r ⊗ F )+
invariant. The constants aij = aji > 0 denote the coupling ϵ G(LE ⊗ Is )G† , B1 = TB/ϵ, B2 = GB.
strengths of the interaction between agents i and j, and vice versa.
The matrix F ∈ Rs×s models the self-feedback of each node. The The detailed derivation is shown in Mukherjee et al. (2020). All six
overall network model is written as matrices are assumed to be unknown. Following Assumption 2,
we assume that A22 is Hurwitz.
ẋ = Ax + Bu, x(0) = x0 , (23)
ns
where, x ∈ R is the vector of all agent states, u ∈ R is the ns 5.2. Projection of control to agents
control input, B = diag(b1 , . . . , bn ), A = In ⊗ F + L ⊗ Is , L ∈ Rn×n
being the weighted network Laplacian matrix satisfying L1n = 0. One important distinction between controlling the multi-agent
system (25) and a generic SP system (3) is that the control input
Assumption 3. F is marginally stable. u for the former has a physical meaning in terms of each agent.
Therefore, u(t), although designed in lower dimension, must be
Let the agents be divided into r non-empty, non-overlapping, actuated in its actual dimension. One way to design u(t) is to
distinct groups I1 , . . . , Ir such that agents inside each group use u = M ũ where ũ ∈ R(rp)×(rs) is the actual control signal
are strongly connected while the groups themselves are weakly learned using ADP, and the matrix M is a projection matrix of
connected, i.e., aij ≫ apq for any two agents i and j inside a group the form M = blkdiag(M 1 , . . . , M r ), M i = M̄ i ⊗ Is , M̄ i = 1|Ii | ,
and any other two agents p and q in two different groups. This which projects the reduced-dimensional controller to the full-
type of clustering has been shown to induce a two-time scale dimensional plant. M is constructed by the designer with the
behavior in the network dynamics of (22) (Chow & Kokotovic, assumption that the designer knows the cluster identity of each
1985). Fig. 1a shows an example of such a clustered dynamic agent. We assume (A, BM) to be stabilizable.
network. The clustered nature of the network helps decompose L
as L = LI +ϵ LE , where LI (a block-diagonal matrix) and LE (a sparse 6. Block-decentralized multi-agent RL
matrix) represent the internal and the external connections, and
ϵ is the singular perturbation parameter arising from the worst- The controllers learned in Sections 3 and 4 need to be com-
case ratio of the coupling weights inside a cluster to that between puted in a centralized way. In this section we show that for the
the clusters. The slow and fast variables are defined as clustered consensus model (25) the clustered nature of the sys-
tem can also aid in learning a cluster-wise decentralized RL con-
[ ] [ ] [ ]
y T y
= x, x = (U G† ) , (24) troller. Fig. 1 describes the centralized and block-decentralized
z G z
architectures.
where, T = T1 ⊗ Is , G = G1 ⊗ Is . Here, T1 ∈ Rr ×n = Na−1 U T ,
Na = diag(n1 , n2 , . . . , nr ), where ni denotes the number of agents 6.1. Cluster-wise representation
in group i, U = diag(U1 , U2 , . . . , Ur ), Uα = 1nα . The matrix G1 , fol-
lowing from Chow and Kokotovic (1985), is not required for our Let the states of the agents in cluster α be denoted as
design as we are only interested in constructing the slow states. (xα1 , xα2 , . . . , xαnα ) ∈ Rnα s . Following Chow and Kokotovic (1985),
We can see that T is simply an averaging operation. Applying this the transformation matrix T in (24) is an averaging operation on
5
S. Mukherjee, H. Bai and A. Chakrabortty Automatica 126 (2021) 109451
the states of agents inside a cluster, which implies that the slow Algorithm 2 Cluster-wise Decentralized ADP
variable for the cluster α is For area α = 1, 2, . . . , r
1 α Step 1: Construct matrices δyα yα , Iyα yα , Iyα uα0 having similar structures as
yα = (x + xα2 + · · · + xαnα ), α = 1, . . . , r , (26) δyy , Iyy , Iyu0 but with y(t) replaced by yα (t).
nα 1 Step 2: Starting with a stabilizing K0α , Solve for Kkα+1 iteratively (k = 0, 1, . . . )
y = [y1 ; y2 ; . . . ; yr ]. (27) once matrices δyα yα , Iyα yα , Iyα uα0 are constructed and iterative equation can be
written for each small learning steps as,
For the cluster-wise decentralized design, the starting point is to
δyα yα −2Iyα yα (Is ⊗ KkαT Rα ) − 2Iyα uα0 (Is ⊗ Rα ) ×
[ ]
consider the scenario if all clusters were decoupled from each
other. We denote the states in cluster α in that scenario as
Θkα
xαd1 , xαd2 , . . . , xαdnα ∈ Rnα s , and the concatenated state vector [
v ec(Pkα )
]
considering all the clusters are denoted as xd . For this decoupled α = −Iyα yα v ec(Qkα ) . (34)
v ec(Kk+1 )
scenario, yαd and yd are similarly defined following (26) and (27).
Φαk
Then we will have,
The stopping criterion for this update is Pkα − Pkα−1 < γ1 , where γ1 is a chosen
yα (t) = yαd (t) + O(ϵ ), ∀t ∈ [0, t1 ]. (32) Proof. The proof is given in Appendix B.
Fig. 5. Improved closed-loop response for the clustered network (Top panel -
Q = 10I5 , Bottom panel - Q = 1000I5 ).
Table 1
Reduction in learning and CPU run times for the slow state feedback-based
design with 25 agents.
Min. learning time CPU run times
(T = 0.01 s)
Full-state feedback 18.75 s 72.19 s
Fig. 3. Comparison of slow state 1 with ϵ = 0.01, 0.001 and reduced slow Reduced-dim state feedback 0.75 s 1.34 s
subsystem for standard SP system.
7.3. Output feedback RL (OFRL) design Fig. 10. Slow state trajectories for the standard SP system (OFRL).
Fig. 11. Fast state trajectories for the standard SP system (OFRL). Fig. 13. Learning with full state estimates for the clustered network.
Fig. 14. Learning with slow state estimates for the clustered network.
Fig. 12. Comparison with state feedback for the ϵ = 0 system (OFRL).
One can consider the dynamics ẏs = (As − Bs K̄k+1 + ρ3 )ys as a = (Ir ⊗ F )ys − B̃1 MKys + ϵ (H11 + H12 H̃2−1 B̃2 ) ys . (B.5)
perturbed version of the nominal dynamics, ẏs = (As − Bs K̄k+1 )ys .
H̃
Considering a Lyapunov function Vk (t) = yTs P̄k ys , and computing
The dynamics (B.5) can be viewed as the decoupled dynamics
its time-derive along ẏs = (As − Bs K̄k+1 )ys , we get
ẏd = F1 yd − B̃1 MKyd perturbed by an O(ϵ ) term vanishing at
V̇k (t) = yTs [P̄k (As − Bs K̄k+1 ) + (As − Bs K̄k+1 )T P̄k ]ys , ys = 0. The vanishing perturbation term given by g(t , ys ) =
ϵ H̃ys satisfies ∥g(t , ys )∥ ≤ ϵ∥H̃ ∥∥ys ∥. With these considerations,
which, using the proof of Theorem 2, can be shown to reduce to we apply Khalil (2002, Lemma 9.1) and conclude that ys = 0
is exponentially stable for a sufficiently small ϵ . As the slow
V̇k (t) = −yTs [(K̄k − K̄k+1 )T R(K̄k − K̄k+1 )]yTs reduced sub-system model is the perturbed version of the de-
− yTs [Q + K̄kT+1 RK̄k+1 ]ys . (A.4) coupled model with the above-mentioned bound, the learned
decentralized controller will exponentially stabilize the slow sub-
With Q ≻ 0, closed-loop will be asymptotically stable. The system dynamics, which in turn stabilizes the entire system with
dynamics ẏs = (As − Bs K̄k+1 +ρ3 )ys is basically ẏs = (As − Bs K̄k+1 )ys the assumption that the fast sub-system is stable. □
perturbed by ρ3 ys vanishing at ys = 0. If the estimation error is
small with sufficiently small ϵ then we will have a sufficiently References
small upper bound ∥ρ3 ∥ ≤ ρ̄3 , and the vanishing perturbation
g(t , ys ) = ρ3 ys will satisfy ∥g(t , ys )∥ ≤ ρ̄3 ∥ys ∥. With these Abdollahi, F., Talebi, H. A., & Patel, R. V. (2006). A stable neural network-based
considerations, we apply Khalil (2002, Lemma 9.1) and conclude observer with application to flexible-joint manipulators. IEEE Transactions on
that the ys = 0 is exponentially stable for a sufficiently small Neural Networks, 17(1), 118–129.
ϵ and state estimation error. Disturbance ρ4 (t) depends on the Chow, J., & Kokotovic, P. (1976). A decomposition of near-optimum regulators for
state estimation error bound b and the controller gain Kk+1 . With systems with slow and fast modes. IEEE Transactions on Automatic Control,
21(5), 701–705.
arbitrarily small estimation error, the norm of the disturbance can
Chow, J., & Kokotovic, P. (1985). Time scale modeling of sparse dynamic
be bounded by sufficiently small upper-bound ∥ρ4 (t)∥ ≤ ρ̄4 .
networks. IEEE Transactions on Automatic Control, 30(8), 714–722.
Jiang, Y., & Jiang, Z.-P. (2012). Computational adaptive optimal control
for continuous-time linear systems with completely unknown dynamics.
Appendix B. Proof of Theorem 7 Automatica, 48, 2699–2704.
Jiang, Y., & Jiang, Z.-P. (2017). Robust adaptive dynamic programming. Wiley-IEEE
press.
We first show that the learned decentralized control gain Khalil, H. (2002). Nonlinear systems. Prentice-Hall, New York.
Kkα+1 can stabilize the decoupled yd dynamics when ϵ is small Khalil, H., & Kokotovic, P. (1978). Control strategies for decision makers using
with a sufficiently large Q . Let the area-wise control be uα = different models of the same system. IEEE Transactions on Automatic Control,
−M α Kkα+1 yα . Therefore, u = −MKk+1 y, where Kk+1 = diag(Kk1+1 , 23(2), 289–298.
. . . , Kkr+1 ). From Theorem 6, we have Kkα+1 = K̄kα+1 +O(ϵ ), implying Kleinman, D. (1968). On an iterative technique for Riccati equation computations.
IEEE Transactions on Automatic Control, 13(1), 114–115.
Kk+1 = K̄k+1 +O(ϵ ). Using the learned gains Kk+1 for the decoupled Kokotovic, P., O’malley, R., & Sannuti, P. (1976). Singular perturbations and order
dynamics with F1 = In ⊗ F we get ẏd = F1 yd − B̃1 (−M K̄k+1 yd ) − reduction in control theory: An overview. Automatica, 12, 123–132.
O(ϵ )yd . Next, consider the Lyapunov function Vk (t) = yTd P̄k yd with Lewis, F., & Vrabie, D. (2009). Reinforcement learning and adaptive dynamic
P̄k ≻ 0, and its time derivative along ẏd as, programming for feedback control. IEEE Circuits and Systems Magazine, 9(3),
32–50.
V̇k (t) = yTd [P̄k (F1 − B̃1 M K̄k+1 − O(ϵ )) Liu, D., & Wei, Q. (2014). Policy iteration adaptive dynamic programming
algorithm for discrete-time nonlinear systems. IEEE Transactions on Neural
+ (F1 − B̃1 M K̄k+1 − O(ϵ ))T P̄k ]yd . (B.1) Networks and Learning Systems, 25(3), 621–634.
Mukherjee, S., Bai, H., & Chakrabortty, A. (2018). On model-free reinforcement
Using the ARE, ATk P̄k + P̄k Ak = −(K̄kT RK̄k + Q ) with Ak = learning of reduced-order optimal control for singularly perturbed systems.
F1 − B̃1 M K̄k , and K̄k+1 = R−1 M T B̃T1 P̄k , it can be shown that V̇k (t) In IEEE conference on decision and conrol 2018. Miami, FL, USA.
10
S. Mukherjee, H. Bai and A. Chakrabortty Automatica 126 (2021) 109451
Mukherjee, S., Bai, H., & Chakrabortty, A. (2019). Block-decentralized model- He Bai received his B.Eng. degree from the Depart-
free reinforcement learning control of two time-scale networks. In American ment of Automation at the University of Science and
control conference 2019. Philadelphia, PA, USA. Technology of China, Hefei, China, in 2005, and the
Mukherjee, S., Bai, H., & Chakrabortty, A. (2020). Reduced-dimensional rein- M.S. and Ph.D. degrees in Electrical Engineering from
Rensselaer Polytechnic Institute (RPI) in 2007 and 2009,
forcement learning control using singular perturbation approximations. arXiv
respectively. From 2009 to 2010, he was a Post-doctoral
preprint arXiv:2004.14501.
Researcher at Northwestern University, Evanston, IL.
Sutton, R., & Barto, A. (1998). Reinforcement learning - An introduction. Cambridge:
From 2010 to 2015, he was a Senior Research and
MIT Press.
Development Scientist at UtopiaCompression Corpora-
Vamvoudakis, K. (2017). Q-learning for continuous-time linear systems: A tion. In 2015, he joined the Mechanical and Aerospace
model-free infinite horizon optimal control approach. Systems & Control Engineering Department at Oklahoma State University
Letters, 100, 14–20. as an assistant professor. He has published over 80 peer-reviewed journal and
Vrabie, D., Pastravanu, O., Abu-Khalaf, M., & Lewis, F. (2009). Adaptive opti- conference papers related to control and robotics and a research monograph
mal control for continuous-time linear systems based on policy iteration. ‘‘Cooperative control design: a systematic passivity-based approach’’ in Springer.
Automatica, 45, 477–484. He holds one patent on monocular passive ranging. His research interests in-
Wu, H., & Luo, B. (2012). Neural network based online simultaneous policy clude reinforcement learning, distributed optimization and learning, multi-agent
update algorithm for solving the HJI equation in nonlinear H∞ control. IEEE systems, and autonomous systems.
Transactions on Neural Networks and Learning Systems, 23(12), 1884–1895.
11