Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Automatica 126 (2021) 109451

Contents lists available at ScienceDirect

Automatica
journal homepage: www.elsevier.com/locate/automatica

Brief paper

Reduced-dimensional reinforcement learning control using singular


perturbation approximations✩

Sayak Mukherjee a , , He Bai b , Aranya Chakrabortty a
a
Department of Electrical and Computer Engineering, North Carolina State University, USA
b
Mechanical and Aerospace Engineering Department, Oklahoma State University, USA

article info a b s t r a c t

Article history: We present a set of model-free, reduced-dimensional reinforcement learning (RL) based optimal
Received 8 October 2019 control designs for linear time-invariant singularly perturbed (SP) systems. We first present a state
Received in revised form 10 November 2020 feedback and an output feedback based RL control design for a generic SP system with unknown
Accepted 11 December 2020
state and input matrices. We take advantage of the underlying time-scale separation property of the
Available online xxxx
plant to learn a linear quadratic regulator (LQR) for only its slow dynamics, thereby saving significant
Keywords: amount of learning time compared to the conventional full-dimensional RL controller. We analyze the
Reinforcement learning sub-optimality of the designs using SP approximation theorems, and provide sufficient conditions for
Linear quadratic regulator closed-loop stability. Thereafter, we extend both designs to clustered multi-agent consensus networks,
Singular perturbation where the SP property reflects through clustering. We develop both centralized and cluster-wise block-
Model-free control
decentralized RL controllers for such networks, in reduced dimensions. We demonstrate the details
Model reduction
of the implementation of these controllers using simulations of relevant numerical examples, and
compare them with conventional RL designs to show the computational benefits of our approach.
© 2020 Elsevier Ltd. All rights reserved.

1. Introduction the full-dimensional plant. The specific property of the plant


model that we study here is based on singular perturbations (SP),
Reinforcement Learning (RL), originally introduced in Sutton i.e., plants whose dynamics are separated into two time-scales.
and Barto (1998), has recently seen a resurgence in optimal Traditionally, SP theory has been used for model reduction (Chow
control of dynamical systems through a variety of papers such and Kokotovic (1985), Kokotovic et al. (1976)), and control (Chow
as Jiang and Jiang (2012), Lewis and Vrabie (2009), Liu and and Kokotovic (1976)) of large-scale systems, but only by using
Wei (2014), Vamvoudakis (2017), Vrabie et al. (2009), Wu and knowledge of the full plant model. Its extension to model-free
Luo (2012) using solution techniques such as adaptive dynamic control using RL has not been addressed. To bridge this gap, we
programming (ADP), actor–critic methods, Q-learning, etc. Curse present several sets of RL-based control designs where we exploit
of dimensionality, however, continues to be an ongoing debate the underlying SP property of the plant to learn a controller
for all of these RL-based control designs. Depending on the size for only its dominant slow time-scale dynamics, thereby saving
and complexity of the plant, it may take an unacceptably long significant amount of learning time. We provide sub-optimality
amount of time to perform system exploration. Our goal in and stability results for the resulting closed-loop system.
this paper is to counteract this problem by exploiting certain The main contributions are as follows. Three distinct RL con-
physical characteristics of the plant dynamics so that learning trol designs for singularly perturbed systems are presented. The
only a reduced-dimensional controller is sufficient for stabilizing first design assumes that the slow state variable is either di-
rectly measurable, or can be constructed from the measurements
✩ The work of the third author was supported by the National Science
of the full state vector. Using this assumption, we develop a
modified ADP algorithm which learns a reduced-dimensional RL
Foundation ECCS grant 1940866. The material in this paper was partially
presented at: the 57th IEEE Conference on Decision and Control, December 17– controller using only feedback from the slow state variables.
19, 2018, Miami Beach, Florida, USA, and the 2019 American Control Conference The controller is shown to guarantee closed-loop stability of
(ACC), July 10–12, 2019, Philadelphia, PA, USA. This paper was recommended for the full-dimensional system if the fast dynamics are stable. The
publication in revised form by Associate Editor Kyriakos G. Vamvoudakis under
second design extends this algorithm to output feedback control
the direction of Editor Miroslav Krstic.
∗ Corresponding author. using a neuro-adaptive state estimator (Abdollahi et al., 2006).
E-mail addresses: smukher8@ncsu.edu (S. Mukherjee), he.bai@okstate.edu The third design shows the relevance of these two designs to
(H. Bai), achakra2@ncsu.edu (A. Chakrabortty). SP models of multi-agent consensus networks where time-scale

https://doi.org/10.1016/j.automatica.2020.109451
0005-1098/© 2020 Elsevier Ltd. All rights reserved.
S. Mukherjee, H. Bai and A. Chakrabortty Automatica 126 (2021) 109451

separation arises due to clustering of the network nodes. Along s.t. A − BKT ∈ RH∞ . (6)
with a centralized design, a variant is proposed that imposes
We assume (A, B) to be stabilizable. We consider y(t) to be di-
a block-diagonal structure on the RL controller to facilitate its
rectly measurable, or x(t) to be measurable (i.e. C = I) and T to
implementation. Numerical results show that our approach saves
be known so that y(t) can be computed at all time t. This is not a
significant amount of learning time than the conventional RL
restrictive assumption as in many SP systems the identity of the
while still maintaining a modest closed-loop performance. All the
slow and fast states are often known a priori (Khalil & Kokotovic,
designs are described by implementable algorithms together with
theoretical guarantees. 1978) even if the model is unknown. In some cases, for example
The first design has been presented as a preliminary result as in the multi-agent model that will be shown later in the paper,
in our recent conference paper Mukherjee et al. (2018). The offline knowledge of certain structural properties of the system
second design, however, is completely new. The multi-agent can enable the designer to construct T using measurements of
RL controllers, which were presented only for scalar dynamics x(t), even when A and B are unknown. The benefit of using y(t)
in Mukherjee et al. (2018, 2019), are now extended to vector- as the feedback variable is that one has to learn only a (m × r)
dimensional states. Moreover, unlike prior results, the consensus matrix instead of a (m × n) matrix if full state feedback x(t) was
model here is more generic as we allow each node to have used. This will improve the learning time, especially if r ≪ n.
self dynamics. The simulation examples presented in Section 5 Before proceeding with the control design, we make the following
are much larger-dimensional than in Mukherjee et al. (2018) to assumption.
demonstrate the numerical benefits of the designs.
Notations: RH∞ is the set of all proper, real and rational stable Assumption 2. A22 in Eq. (3b) is Hurwitz.
transfer matrices; ⊗ denotes Kronecker product; 1n denotes a This assumption means that the fast dynamics of (3) are stable,
column vector of size n with all ones; ∪ denotes union opera- which allows us to skip feeding back z(t) in (4).
tion of sets; blkdiag(m1 , . . . , mn ) denotes a block-diagonal matrix
with m1 , . . . , mn as its block diagonal elements; |M | denotes the 2.2. Problem statement for output feedback RL
cardinality of set M; ∥.∥ denotes Euclidean norm of a vector and
Frobenius norm of a matrix unless mentioned otherwise. P2. Considering that q(t) is measured and C is known, but A
and B are both unknown in (1), estimate the states ŷ(t), ẑ(t) (or,
2. Problem formulation equivalently estimate x̂(t) and compute ŷ(t) = T x̂(t) assuming
that T is known), learn a controller K ∈ Rm×r using online
Consider a linear time-invariant (LTI) system measurements of q(t) ∫and u(t) such that u = −K ŷ = −KT x̂

ẋ = Ax + Bu, x(0) = x0 , q = C x, (1) minimizes J(y(0); u) = 0 (yT Qy + uT Ru)dt .
We assume (A, B) to be stabilizable, and (A, C ) to be detectable.
where, x ∈ R is the state, u ∈ R is the control input, and q ∈ Rp
n m
Our approach would be to estimate the slow states ŷ(t) without
is the output. We assume that the matrices A and B are unknown, knowing (A, B) using an observer employing a neural structure
although n, m and p are known. The following assumption is that does not require exact information of the state dynamics, and
made. then using u(t) and ŷ(t) to learn the controller K using adaptive
dynamic programming.
Assumption 1. The system with state–space model (1) exhibits a We present the solutions for P1 and P2 with associated stabil-
singular perturbation property, i.e., there exist a small parameter ity proofs in the following two respective sections.
1 ≫ ϵ > 0 and a similarity transform T such that by defining
y ∈ Rr and z ∈ Rn−r as 3. Reduced-dimensional state feedback RL
[ ] [ ]
y T
=Tx= x, (2) Following Khalil (2002), the reduced slow subsystem of (3) can
z G
be defined by substituting ϵ = 0, resulting in
the state–space model (1) can be rewritten as
ẏs = As ys + Bs us , ys (0) = y(0), u = us + uf , (7)
ẏ = A11 y + A12 z + B1 u, y(0) = Tx0 = y0 , (3a) −1 −1
where As = A11 − A12 A22 A21 and Bs = B1 − A12 A22 B2 . Since our
ϵ ż = A21 y + A22 z + B2 u, z(0) = Gx0 = z0 , (3b) intent is to only use the slow variable for feedback, we substitute
the fast control input uf = 0, and the slow control input us = u.
[ ] [ ]
y y
q = CT −1 =C . (3c) If the controller were to use ys (t) for feedback then it would find
z z
u = −K̄ ys (t) to solve:
In the transformed model (3), y(t) represents the slow states ∫ ∞
and z(t) represents the fast states. Since A and B are unknown, minimize J̄(ys (0); u) = (yTs Qys + uT Ru)dt , (8)
the matrices A11 , A12 , A21 , A22 , B1 and B2 are unknown as well. 0

s.t. As − Bs K̄ ∈ RH∞ . (9)


2.1. Problem statement for state-feedback RL
The optimal solution for the above problem is given by the
following algebraic Riccati equation (ARE):
P1. Learn a control gain K ∈ Rm×r for the singularly perturbed
system (3) without knowing the model using online measure- ATs P̄ + P̄As + Q − P̄Bs R−1 BTs P̄ = 0, K̄ = R−1 BTs P̄ ,
ments of u(t) and y(t) such that where P̄ = P̄ T ≻ 0. If As and Bs are unknown, then the RL
u(t) = −Ky(t) = −KTx(t) (4) controller K̄ can be learned using measurements of ys (t) and of
an exploration input u(t) = u0 (t) by the ADP algorithm presented
minimizes in Jiang and Jiang (2017), which is a model-free version of Klein-
∫ ∞
man’s algorithm (Kleinman, 1968). The control policy u0 (t) must
J(y(0); u) = (yT Qy + uT Ru)dt , (5) be persistently exciting, and can be chosen arbitrarily as long as
0
2
S. Mukherjee, H. Bai and A. Chakrabortty Automatica 126 (2021) 109451

Algorithm 1 SP-RL using slow dynamics theorem provides a sufficient condition that is required to achieve
Input: Measurements of y(t) and u0 (t) asymptotic stability for the (k + 1)th iteration of Algorithm 1
Step 1 - Data storage: Store data (i.e., y(t) and u0 (t)) for sufficiently large assuming that the control policy at the kth iteration stabilizes (3).
uniformly sampled time instants (t1 , t2 , · · · , tl ), and construct the following
matrices:
[ ]T
Theorem 3. Assume that the control policy u = −Kk y at the kth
δyy = y ⊗ y| t 1 +T
t1, · · · , y ⊗ y|
tl +T
tl , (11) iteration asymptotically stabilizes (3). Consider R ≻ 0 and Q ≻ 0
[∫ ]T with λmin (Q ) sufficiently large. Then the control policy at the (k+1)th
t1 +T ∫ tl +T
Iyy = t (y ⊗ y)dτ , · · · , t (y ⊗ y)dτ , (12) iteration given by u = −Kk+1 y is asymptotically stabilizing for
1 l
[∫ ]T (3) □
t +T ∫ t +T
Iyu0 = t 1 (y ⊗ u0 )dτ , · · · , t l (y ⊗ u0 )dτ , (13)
1 l

such that rank(Iyy Iyu0 ) = r(r + 1)/2 + rm satisfies.


Proof. Please see Theorem 4 in Mukherjee et al. (2018).
Step 2 - Controller update: Starting with a stabilizing K0 , solve for K iteratively
(k = 0, 1, · · ·) following the update equation: Remark 1 (Design Trade-off). The proof of Theorem 3 is based
] v ec(Pk )
[ ] on Lyapunov stability analysis, where Q compensates for the
δyy = −Iyy v ec(Qk ) . error due to O(ϵ ) approximation of the fast dynamics such that
[
−2Iyy (Ir ⊗ KkT R) − 2Iyu0 (Ir ⊗ R) (14)
 v ec(Kk+1 )
Q − O(ϵ ) ≻ 0. This translates to the requirement of a sufficiently
    
Θk Φk
large λmin (Q ). In practice, assuming that a reliable upper bound
The stopping criterion for this update is ∥Pk − Pk−1 ∥ < γ , where γ is a chosen
small positive threshold.
for ϵ is known to the designer (even if the exact value of ϵ may
Step 3 - Applying control: After P and K converge, remove u0 and apply not be known), one can start the off-policy RL iteration with
u = −Ky. a Q such that λmin (Q ) is greater than that upper bound, and
adjust Q , if necessary, to obtain satisfactory stability and transient
performance. Note that the only underlying assumption about
the physics of the system is the existence of a strong time-scale
the system states remain bounded. For example, one choice of u0
separation, i.e., the existence of a sufficiently small ϵ . Estimates of
is a sum of sinusoidal signals.
ϵ , or at least of an upper bound for it, can be obtained from offline
In reality, however, ys is not accessible as ϵ ̸ = 0. We, therefore,
spectral analyses of the state trajectories, or even from a subset
recall the following theorem from Chow and Kokotovic (1976),
of known network parameters. Also note that our approach is
which will allow us to replace ys (t) with y(t) in the learning
different from classical robust control designs where a nominal
algorithm.
model is used to compute stabilizing controls with a known upper
bound on the uncertainty. The physics-informed nature of our
Theorem 1. (Chow & Kokotovic, 1976; Khalil, 2002) Consider the
design helps with scalability and stronger analytical guarantees.
two systems (3) and (7). There exists 0 < ϵ ∗ ≪ 1 such that for
all 0 < ϵ ≤ ϵ ∗ , the trajectories y(t) and ys (t) satisfy uniformly for Finally, the requirement for an initial stabilizing control is
t ∈ [0, t1 ] common for policy iterations (Jiang & Jiang, 2012). For an open-
y(t) = ys (t) + O(ϵ ). (10) loop stable system, learning can be theoretically achieved with-
out any initial stabilizing control. However, an initial stabilizing
Algorithm 1 shows how the controller K is learned using y(t) and control may improve convergence and accuracy.
u0 (t).
The condition rank(Θk ) = r(r + 1)/2 + rm can be satis- 4. Reduced-dimensional output feedback RL
fied, for example, by utilizing data from at least twice as many
sampling intervals as the number of unknowns. We next provide We next address the RL design when the full state information
the analytical guarantees of Algorithm 1 related to the SP-based is not available. We consider that the system model is in the
approximations. singularly perturbed form (3), and then design an observer to
estimate the state x as x̂(t) = [ŷ(t); ẑ(t)]. The idea then is to
3.1. Sub-optimality and stability analysis simply replace y(t) by ŷ(t) in Algorithm 1. If the system is not
directly in the singularly perturbed form (3), but in the generic
The optimal controller parameters P , K can be written as P = form (1), then a full-order observer is required to estimate x̂(t)
P̄ + ∆P , K = K̄ + ∆K , where P̄ , K̄ are the optimal solutions if ys (t) followed by computation of the slow state estimate ŷ(t) using
were available for design, and ∆P , ∆K are matrix perturbations the knowledge of the projection T . In Section 4.2 we will present
resulting from the fact that ϵ ̸ = 0. The following theorem one such observer. We first analyze the stability properties of the
establishes the sub-optimality of the learned controller using y(t). output feedback design.

Theorem 2. Assume that ∥ys (t)∥ and ∥u0 (t)∥ are bounded for a
4.1. Sub-optimality and stability analysis
finite time t ∈ [0, t1 ]. The solutions of Algorithm 1 are given by
P = P̄ + O(ϵ ), K = K̄ + O(ϵ ), and J = J̄ + O(ϵ )
Lemma 1. Define e(t) = x(t) − x̂(t). If e is uniformly ultimately
Proof. See theorems 2 and 3 in Mukherjee et al. (2018). bounded (UUB) with a bound b for all t ≥ t0 + T for some initial
Theorem 2 shows that the controller obtained from Algorithm time t0 , then there exist positive constants ϵ ∗ and k such that for all
1 is O(ϵ ) close to that obtained from the ideal design (where 0 < ϵ ≤ ϵ∗
ϵ = 0) which also holds for the respective costs. Moreover, ∥ŷ(t) − ys (t)∥ ≤ k̄|ϵ| + b := c(ϵ, b) (15)
it is shown in Chow and Kokotovic (1976) that the reduced-
order control cost is O(ϵ )-perturbed from the full-order control holds uniformly for t ∈ [t2 , t1 ].
cost. This holds for the model-free case as well. As Algorithm 1
is constructed first by learning the ideal slow sub-system, and Proof. Since e(t) is UUB, there exists positive constants b and
then by replacing ys (t) with y(t) for the implementation, we b̂, independent of t0 ≥ 0, and for every a ∈ (0, b̂), there exists
can quantify the sub-optimality of this approximation. The next T1 = T1 (a, b), independent of t0 , such that ∥ŷ(t0 ) − y(t0 )∥ ≤ a,
3
S. Mukherjee, H. Bai and A. Chakrabortty Automatica 126 (2021) 109451

which implies that 4.2. Neuro-adaptive observer

∥ŷ(t) − y(t)∥ ≤ b, ∀t ≥ t0 + T1 := t2 . (16) A candidate observer to estimate ŷ(t) without knowing (A, B)
is the neuro-adaptive observer proposed in Abdollahi et al. (2006)
From Theorem 1, it follows that there exist positive constants k that employs a neural network (NN) structure to account for
and p such that, the lack of dynamic model information. This observer guarantees
boundedness of e(t), which, with proper tuning, can also be made
∥y(t) − ys (t)∥ ≤ k̄|ϵ| ∀t ∈ [t0 , t1 ], t1 > t2 , ∀|ϵ| < p. (17)
arbitrarily small. Recalling the mechanism of this observer, we
Combining (16) and (17), for t ∈ [t2 , t1 ] we have rewrite (1) as
ẋ = Âx + (Ax − Âx) + Bu, q = C x, (19)
∥ŷ(t) − ys (t)∥ ≤ k̄|ϵ| + b := c(ϵ, b). (18)   
g(x,u)
This completes the proof. □
where  is a Hurwitz matrix, and (C , Â) is observable. We do not
Corollary 1. If e(t) = O(ϵ ) for t ∈ [t2 , t1 ], then ŷ(t) = ys (t) + O(ϵ ). have proper knowledge about g(x, u), and a NN with sufficiently
large number of neurons can approximate g(x, u), as g(x, u) =
W σ (V x̄) + η(x). Here, x̄ = [x, u], while σ (.) and η(x) are the
Proof. The proof directly follows from Lemma 1. □
activation function and the bounded NN approximation error,
We know that if ys (t) were available for feedback then P̄ , K̄
respectively. W and V are the ideal fixed NN weights. We choose
would be the optimal solutions. However, due to the state esti-
mation error bound b and the singular perturbation error O(ϵ ), G such that Ac = Â − GC is Hurwitz. The observer dynamics follow
the actual solutions are given as P = P̄ + ∆P, K = K̄ + ∆K , where as
∆P and ∆K are matrix perturbations resulting from non-ideal x̂˙ = Âx̂ + g(x̂, u) +G(q − C x̂), q̂ = C x̂, (20)
feedback.   
=Ŵ σ (V̂ x̄)
ˆ

Proposition 1. Perturbations ∆P , ∆K are bounded, i.e., there exist where Ŵ , V̂ are NN weights when driven by x̂, and are updated
two positive constants ρ, ρ1 , dependent on b and ϵ , such that based on the modified Back Propagation (BP) algorithm. The
∥∆P ∥ ≤ ρ, ∥∆K ∥ ≤ ρ1 . Moreover, if e(t) = O(ϵ ) for t ∈ [t2 , t1 ], observer (20) requires the knowledge of C . Accordingly, we define
then we will recover P = P̄ + O(ϵ ), K = K̄ + O(ϵ ). the output error as q̃ = q − C x̂. The objective function is to
minimize J = 12 (q̃T q̃). Following Abdollahi et al. (2006), the
Proof. Please see Mukherjee et al. (2020). update law follows from gradient descent as:
˙
If e(t) can be made sufficiently small by proper tuning of the Ŵ = − η1 (q̃T C A−
c ) (σ (V̂ x̄)) −ρ1 ∥q̃∥Ŵ ,
1 T ˆ T (21)
observer gain then we would recover the design characteristics of
  
Algorithm 1. To this end, we present the following stability result. η1 ( ∂ J )
∂ Ŵ
˙
V̂ = − η2 (q̃ C Ac Ŵ (I − Λ(V̂ x̄)))
T −1 ˆ T −ρ2 ∥q̃∥V̂ ,
ˆ T sgn(x̄)
Theorem 4. Assume that the control policy u = −Kk ŷ is asymp-   
totically stabilizing for the kth iteration. Then, there exist sufficiently η2 ( ∂ J )
small b∗ , and 0 < ϵ ∗ ≪ 1 such that for b ≤ b∗ , 0 < ϵ ≤ ϵ ∗ , with ∂ V̂

Q ≻ 0, R ≻ 0, u = −Kk+1 ŷ will asymptotically stabilize (3) at the where η1 , η2 > 0 are learning rates and ρ1 , ρ2 are small positive
(k + 1)th iteration. numbers, Ac = A − GC , σ (·) is the activation function, and x̄ˆ =
[x̂, u]. Considering k neurons we have Λ(V̂ x̄) ˆ = diag(σ 2 (V̂i x̄))
i
ˆ ,i =
Proof. Please see Appendix A. 1, 2, . . . , k, where σi (V̂i x̄)
ˆ is the ith element of σ (V̂ x̄),
ˆ and sgn(·)
is the sign function. The update law (21) depends on the knowl-
As shown in Appendix A, the estimation error enters the edge of C . This observer guarantees the following boundedness
closed-loop system as an exogenous disturbance. Since Kk+1 is property.
stabilizing, the states converge to a neighborhood of the origin for
sufficiently small b∗ and ϵ ∗ . Note that the designer does not need Theorem 5 (Abdollahi et al., 2006, Theorem 1). With the update law
the explicit knowledge of ϵ ∗ . Assuming the existence of a small described as (21), the state estimation error x̃ = x − x̂ and weight
enough ϵ is sufficient. From the physical laws of the dynamics, the estimation errors W̃ = W − Ŵ , Ṽ = V − V̂ are uniformly ultimately
states in Eq. (3b) are inherently sufficiently faster than the states bounded (UUB).
in Eq. (3a) ensuring stability. The design takes advantage of this
The size of the estimation error bound can be made arbitrarily
structure to construct a sample efficient reduced-dimensional
small by properly selecting the parameters and learning rates
learning control methodology.
as shown in Abdollahi et al. (2006). For example, with higher
learning rates the convergence rate can be increased but these
Remark 2. The convergence of the observer and the RL itera-
parameters need to be properly tuned to avoid overshoot, and
tions are handled sequentially. The observer gathers a sufficient
selecting  to have fast eigenvalues will also keep the state
amount of data samples to meet the rank condition rank(Θ̂k ) =
estimation error small.
r(r + 1)/2 + rm, after which the control gain is computed it-
eratively. Θ̂k has same structure as Θk but with y(t) replaced 5. Applying to clustered multi-agent networks
by ŷ(t). The designer may start gathering data samples after a
few initial time-steps over which the observer has converged We next describe how SP-based RL designs can be applied
close to its steady-state. The observer is designed to achieve for the control of clustered multi-agent consensus networks,
fast convergence, as discussed next. The state estimation error e.g., power systems, robotic swarms, and biological networks. The
that may be present in the observer output has been taken into LTI model of these networks can be brought into the standard SP
consideration in the sub-optimality and the stability analysis, as form (3) by exploiting the time-scale separation in its dynamics
discussed in Proposition 1 and Theorem 4. arising from the clustering of nodes.
4
S. Mukherjee, H. Bai and A. Chakrabortty Automatica 126 (2021) 109451

Fig. 1. Centralized and block-decentralized control architectures.

5.1. SP representation of clustered networks transformation to (23), and redefining the time-scale as ts = ϵ t,
the following SP form is obtained:
Consider a network of n agents, where the dynamics of the ith dy
agent is given by = A11 y + A12 z + B1 u, (25a)
∑ dts
ẋi = Fxi + aij (xj − xi ) + bi ui , (22) dz
ϵ = A21 y + A22 z + B2 u, (25b)
j∈Ni dts
where xi ∈ Rs is the state, ui ∈ Rp is the input, and Ni denotes
A11 = T (LE ⊗ Is )U + (Ir ⊗ F )/ϵ, A12 = T (LE ⊗ Is )G† ,
the set of agents that are connected to agent i, for i = 1, . . . n.
The graph between agents is assumed to be connected and time- A21 = G(LE ⊗ Is )U , A22 = G(LI ⊗ Is )G† + (In−r ⊗ F )+
invariant. The constants aij = aji > 0 denote the coupling ϵ G(LE ⊗ Is )G† , B1 = TB/ϵ, B2 = GB.
strengths of the interaction between agents i and j, and vice versa.
The matrix F ∈ Rs×s models the self-feedback of each node. The The detailed derivation is shown in Mukherjee et al. (2020). All six
overall network model is written as matrices are assumed to be unknown. Following Assumption 2,
we assume that A22 is Hurwitz.
ẋ = Ax + Bu, x(0) = x0 , (23)
ns
where, x ∈ R is the vector of all agent states, u ∈ R is the ns 5.2. Projection of control to agents
control input, B = diag(b1 , . . . , bn ), A = In ⊗ F + L ⊗ Is , L ∈ Rn×n
being the weighted network Laplacian matrix satisfying L1n = 0. One important distinction between controlling the multi-agent
system (25) and a generic SP system (3) is that the control input
Assumption 3. F is marginally stable. u for the former has a physical meaning in terms of each agent.
Therefore, u(t), although designed in lower dimension, must be
Let the agents be divided into r non-empty, non-overlapping, actuated in its actual dimension. One way to design u(t) is to
distinct groups I1 , . . . , Ir such that agents inside each group use u = M ũ where ũ ∈ R(rp)×(rs) is the actual control signal
are strongly connected while the groups themselves are weakly learned using ADP, and the matrix M is a projection matrix of
connected, i.e., aij ≫ apq for any two agents i and j inside a group the form M = blkdiag(M 1 , . . . , M r ), M i = M̄ i ⊗ Is , M̄ i = 1|Ii | ,
and any other two agents p and q in two different groups. This which projects the reduced-dimensional controller to the full-
type of clustering has been shown to induce a two-time scale dimensional plant. M is constructed by the designer with the
behavior in the network dynamics of (22) (Chow & Kokotovic, assumption that the designer knows the cluster identity of each
1985). Fig. 1a shows an example of such a clustered dynamic agent. We assume (A, BM) to be stabilizable.
network. The clustered nature of the network helps decompose L
as L = LI +ϵ LE , where LI (a block-diagonal matrix) and LE (a sparse 6. Block-decentralized multi-agent RL
matrix) represent the internal and the external connections, and
ϵ is the singular perturbation parameter arising from the worst- The controllers learned in Sections 3 and 4 need to be com-
case ratio of the coupling weights inside a cluster to that between puted in a centralized way. In this section we show that for the
the clusters. The slow and fast variables are defined as clustered consensus model (25) the clustered nature of the sys-
tem can also aid in learning a cluster-wise decentralized RL con-
[ ] [ ] [ ]
y T y
= x, x = (U G† ) , (24) troller. Fig. 1 describes the centralized and block-decentralized
z G z
architectures.
where, T = T1 ⊗ Is , G = G1 ⊗ Is . Here, T1 ∈ Rr ×n = Na−1 U T ,
Na = diag(n1 , n2 , . . . , nr ), where ni denotes the number of agents 6.1. Cluster-wise representation
in group i, U = diag(U1 , U2 , . . . , Ur ), Uα = 1nα . The matrix G1 , fol-
lowing from Chow and Kokotovic (1985), is not required for our Let the states of the agents in cluster α be denoted as
design as we are only interested in constructing the slow states. (xα1 , xα2 , . . . , xαnα ) ∈ Rnα s . Following Chow and Kokotovic (1985),
We can see that T is simply an averaging operation. Applying this the transformation matrix T in (24) is an averaging operation on
5
S. Mukherjee, H. Bai and A. Chakrabortty Automatica 126 (2021) 109451

the states of agents inside a cluster, which implies that the slow Algorithm 2 Cluster-wise Decentralized ADP
variable for the cluster α is For area α = 1, 2, . . . , r
1 α Step 1: Construct matrices δyα yα , Iyα yα , Iyα uα0 having similar structures as
yα = (x + xα2 + · · · + xαnα ), α = 1, . . . , r , (26) δyy , Iyy , Iyu0 but with y(t) replaced by yα (t).
nα 1 Step 2: Starting with a stabilizing K0α , Solve for Kkα+1 iteratively (k = 0, 1, . . . )
y = [y1 ; y2 ; . . . ; yr ]. (27) once matrices δyα yα , Iyα yα , Iyα uα0 are constructed and iterative equation can be
written for each small learning steps as,
For the cluster-wise decentralized design, the starting point is to
δyα yα −2Iyα yα (Is ⊗ KkαT Rα ) − 2Iyα uα0 (Is ⊗ Rα ) ×
[ ]
consider the scenario if all clusters were decoupled from each
other. We denote the states in cluster α in that scenario as
  
Θkα
xαd1 , xαd2 , . . . , xαdnα ∈ Rnα s , and the concatenated state vector [
v ec(Pkα )
]
considering all the clusters are denoted as xd . For this decoupled α = −Iyα yα v ec(Qkα ) . (34)
v ec(Kk+1 )
scenario, yαd and yd are similarly defined following (26) and (27).
  
Φαk
Then we will have,
The stopping criterion for this update is Pkα − Pkα−1  < γ1 , where γ1 is a chosen
 

ẋd = (In ⊗ F + LI ⊗ Is )xd + Bu, (28) small positive threshold.


Step 3: Next ũα = −K α yα is applied and uα0 source is removed.
ẏd = T ẋd = (T1 ⊗ Is )(In ⊗ F + L ⊗ Is )xd + B̃1 u,
I
End For

where B̃1 = TB. As xd = Uyd + G† zd , (28) is reduced to


ẏd = (Ir ⊗ F )yd + B̃1 u. (29) 6.2.1. Analysis and stability for the decentralized design
The controller can be represented cluster-wise as u = In this section we analyze the sub-optimality and stability
[u1 ; u2 ; . . . , ur ]. Using the projected controller discussed in Sec- aspects of the area-wise decentralized controller learned from
tion 5.2, we can design uα (t) as Algorithm 2. The learned controller K α ∈ R for all the areas will
be perturbed from the controller computed using yαd , i.e.,
uα = M α ũα , M α = M̄ α ⊗ Is , M̄ i = 1|Ii | , (30)
P α = P̄ α + ∆P α , K α = K̄ α + ∆K α , (35)
α
where ũ is the controller learned in cluster α , α = 1, . . . , r.
α α
Taking a hint from the cluster-wise decentralized structure of where P̄ , K̄ are the optimal solutions if the clusters were decou-
yd -dynamics in (29), we next state our design problem as follows. pled and yαd (t) were available for design, and ∆P α , ∆K α are ma-
trix perturbations. The following theorem shows that the matrix
P3. Consider the multi-agent consensus model (23) where A and perturbations are O(ϵ ) small.
B are unknown. Learn a control gain K α for every area α , α =
1, . . . , r, using yα (t) and ũα (t) such that uα = M α ũα = −M α K α yα Theorem 6. Assuming ∥yαd (t)∥ and ∥uα0 (t)∥ are bounded, the area-
stabilizes the closed-loop system and minimizes the following wise decentralized solutions satisfy for α = 1, . . . , r
individual cluster-wise objectives
∫ ∞ P α = P̄ α + O(ϵ ), K α = K̄ α + O(ϵ ), J α = J̄ α + O(ϵ ). (36)
α α α αT α α αT α α
J (y (0); ũ ) = (y Q y + ũ R ũ )dt , (31)
0 Proof. This proof directly follows from the analysis performed
for α = 1, . . . , r. We assume that (A, BM) is stabilizable. for Theorem 2. Here the time-scale separation exists between the
decoupled average variable yαd and the actual average variable yα .
6.2. RL algorithm Using Lemma 2, these variables are O(ϵ ) apart, which leads to (36)
following the analysis of Theorem 2, and Corollary 1. □
We exploit a different O(ϵ ) separation existing between the Next we analyze the closed-loop stability conditions for the
trajectories of the actual average variable of an area and the same block-decentralized design.
variable when the areas are decoupled. We start by providing a
lemma proving how the actual average variable yα is related to Theorem 7. Assume that the control policy uα = −M α Kkα yα for
the decoupled average variable yαd for an area α . area α at the kth iteration is asymptotically stable. Then the control
policy at the (k + 1)th iteration given by uα = −M α Kkα+1 yα is
Lemma 2. The cluster-wise average variable yα (t) and the decoupled asymptotically stable with Rα ≻ 0 and Q α ≻ 0, if ϵ is sufficiently
average variable yαd (t) are related as, small. □

yα (t) = yαd (t) + O(ϵ ), ∀t ∈ [0, t1 ]. (32) Proof. The proof is given in Appendix B.

Proof. The proof is presented in Mukherjee et al. (2020).


7. Numerical simulations
We first consider decoupled clusters with the according T for
averaging. The decoupled slow dynamics is given in (28). The 7.1. Centralized state feedback design
controller for area α uses the yαd (t) feedback and implements
ũα = −K̄ α yαd (t) so that the decoupled dynamics are stabilized A singularly perturbed system in the form of (3) is considered
and the following objective is minimized for area α with the ARE with two fast and two slow states. We choose ϵ = 0.01, Q =
solution P̄ α ≻ 0 and the optimal control gain K̄ α : 10I2 , R = I, the initial conditions as [1, 2, 1, 0], and the learning
∫ ∞ time-step as 0.01 s. The model matrices are taken from Chow and
J̄ α (yαd (0); ũα (0)) = (yαd T Q α yαd + ũα T Rα ũα )dt . (33) Kokotovic (1976) as
0
0.4 −0.524
[ ] [ ] [ ]
0 0 0 0
As the decoupled system is fictitious, based on Lemma 2, it is A11 = , A12 = , A21 =
0 0 0.345 0 0 0
plausible to replace yαd (t) with yα (t) in the learning algorithm and
−0.465 0.262
[ ] [ ]
then follow the same procedure as the Kleinman’s algorithm. The 1
resulting algorithm is given in Algorithm 2.
A22 = , B1 = B2 = .
0 −1 1
6
S. Mukherjee, H. Bai and A. Chakrabortty Automatica 126 (2021) 109451

Fig. 4. Learned controller with full-state feedback.

Fig. 2. Convergence of P and K for the standard SP system.

Fig. 5. Improved closed-loop response for the clustered network (Top panel -
Q = 10I5 , Bottom panel - Q = 1000I5 ).

Table 1
Reduction in learning and CPU run times for the slow state feedback-based
design with 25 agents.
Min. learning time CPU run times
(T = 0.01 s)
Full-state feedback 18.75 s 72.19 s
Fig. 3. Comparison of slow state 1 with ϵ = 0.01, 0.001 and reduced slow Reduced-dim state feedback 0.75 s 1.34 s
subsystem for standard SP system.

cluster is assumed to have a local coordinator that averages the


The system is persistently excited by exploration noise follow- states from inside the cluster, and transmits the average state to a
ing Jiang and Jiang (2017). The control gain is learned as K = central controller, which learns the reduced-dimensional control
[3.80 1.38], producing a closed-loop objective value J = 7.72 input ũ(t) ∈ R5 and subsequently back-projects it to individual
units. The convergence plots for P and K are shown in Fig. 2. For agents.
the ideal slow system (i.e., when ϵ = 0), the following controller Fig. 4 shows the learning of the full-dimensional optimal LQR
is learned: K̄ = [3.1623 1.9962], J̄ = 7.2950 units. Fig. 3 controller. It takes at least 18.75 s to learn K ∈ R25×25 . The ex-
compares closed-loop responses learned by ADP for the ideal ploration signal here is a sum of sinusoidal signals with different
reduced slow system (ϵ = 0) versus the system with ϵ ̸ = 0. The frequencies. With r = 5, the reduced-order controller, on the
top panel of Fig. 3 shows this comparison for ϵ = 0.01, while the other hand, requires only r 2 + 2r 2 = 75 samples for learning. It
bottom panel shows this for ϵ = 0.001. It can be seen that the dominantly affects the slow poles, and with Q = 10I5 , the closed-
responses of the ideal and non-ideal reduced-order systems get loop slow poles are placed at −3.14, −3.18, −3.17, −3.15, and
closer to each other over time as ϵ decreases. −3.16. Dynamic performance is improved with increase in the
We next consider a 5-cluster, 25-agent network, each agent weights of Q as shown in Fig. 5. Table 1 gives a comparison
with scalar state with F = 0. Therefore the network has 4 between the full and the reduced-order designs in terms of min-
slow eigenvalues, one zero eigenvalue and the rest are the fast imum learning time based on the sample requirement and CPU
eigenvalues. The slow eigenvalues are −0.128, −0.195, −0.196, run times.
and −0.2638. The control architecture is shown in Fig. 1(a). Each
7
S. Mukherjee, H. Bai and A. Chakrabortty Automatica 126 (2021) 109451

Fig. 8. Dynamic performance with cluster-wise decentralized design (Top panel


Fig. 6. Decentralized controllers for ideal decoupled clusters.
— Q = 10I5 for all areas, Bottom panel — varying Q across different areas).

Fig. 7. Convergence of K and P for the cluster-wise decentralized design.

7.2. Cluster-wise decentralized state feedback design

Considering the same multi-agent example, we first perform


the ADP-based learning of the controller when the clusters are
fully decoupled (i.e., the ideal decentralized scenario). Each area Fig. 9. Convergence of K and P for the standard SP system (OFRL).
is equipped with an aggregator. Note that the average of all the
cluster states represents the decoupled slow state yαd for cluster
α . The state evolution of one representative area is shown in
Fig. 6. We consider similar coupling strengths between the agents
inside all the clusters with Q = 10, R = 1 but with different
initial conditions. The computed scalar control gain for each area
is K = 3.1623, and the corresponding objective values are J̄ 1 =
1.317, J̄ 2 = 0.745, J̄ 3 = 1.765, J̄ 4 = 0.8451 and J̄ 5 = 0.5244.
Thereafter, the decentralized ADP computation is performed
on the actual system following Algorithm 2. The average states
from each cluster is used as the feedback signal for the ADP
computation block as shown in Fig. 1(b). Fig. 7 shows the fast
convergence of the ADP iterations. With Q = 10, R = 1 for all the
areas, the cluster-wise decentralized control gains are computed
as K 1 = 3.139, K 2 = 3.195, K 3 = 3.130, K 4 = 3.187, K 5 =
3.173, with the objective values as J 1 = 1.308, J 2 = 0.754, J 3 =
1.7478, J 4 = 0.8524 and J 5 = 0.5261. In Fig. 8, we can see
that with the increasing value of Q α , α = 1, . . . , 5, the dynamic
performance of the agent states increases. The dynamic perfor-
mance of different cluster states can be controlled independently
using different Q for the different areas. The learning time is also
decreased because of the reduced number of feedback variables.
The exploration is performed for only 0.2 s.

7.3. Output feedback RL (OFRL) design Fig. 10. Slow state trajectories for the standard SP system (OFRL).

We first consider the singularly perturbed system as in Sec-


tion 7.1 with ϵ = 0.01, initial condition [1, 2, 1, 0]. We consider the ADP-based computations using the estimated states. Fig. 10
C = [1, 1, 0, 0; 0, 0, 1, 1]. The learning time step is 0.01 s. Data is and Fig. 11 show the actual versus estimated state trajectories
gathered for 0.7 s with the system being persistently excited with using the NN observer. For the design of the NN observer, the
exploration noise. Fig. 9 shows the convergence of P and K during
8
S. Mukherjee, H. Bai and A. Chakrabortty Automatica 126 (2021) 109451

Fig. 11. Fast state trajectories for the standard SP system (OFRL). Fig. 13. Learning with full state estimates for the clustered network.

Fig. 14. Learning with slow state estimates for the clustered network.
Fig. 12. Comparison with state feedback for the ϵ = 0 system (OFRL).

complexity (i.e., minimum learning time) here will be the same as


Hurwitz matrix  is considered to be of SP structure but different that for the state feedback case once the observer converges. The
than the original state matrix. We can see from Figs. 11–12 that CPU run time for the reduced-dimensional design is only 13.82 s
the estimation error is small, and the ADP controller using these compared to that for the full-scale design, which is 298 s.
estimates maintains closed-loop stability. Also, Fig. 12 compares
the output feedback control responses with the ideal (ϵ = 0) state 8. Conclusion
feedback responses.
We next consider the 5-cluster, 25-agent consensus network. The paper presented RL based optimal control designs incor-
For the estimator design, the Hurwitz matrix  is taken to be of porating ideas from model reduction following from time-scale
similar structure as A but the coupling between the agents in a separation properties in LTI systems. Both state feedback and
same cluster is 20% off from the original, while the inter-cluster output feedback RL designs are reported. The designs are ex-
strengths are 50% off from the original. For the full-order system, tended to clustered multi-agent networks for which an additional
Fig. 13 shows few examples of the state estimation, where the cluster-wise block-decentralized RL control is also discussed. Sub-
learning takes approximately 20 s. In the reduced-order design, optimality and stability analyses for each design are performed
using the NN observer estimates the aggregator generates the using SP approximation theorems. For the state feedback de-
average states for each cluster. These average states and inputs signs only the SP approximation error affects the sub-optimality,
are used for the reduced-order ADP iterations. Fig. 14 shows that whereas for the output feedback designs the state estimation
the reduced-order design using the NN observer requires ap- error adds to it. Results are validated using multiple simulation
proximately 1 second of exploration time. The savings in sample case studies.
9
S. Mukherjee, H. Bai and A. Chakrabortty Automatica 126 (2021) 109451

Appendix A. Proof of Theorem 4 becomes

V̇k (t) = −yTd (K̄k − K̄k+1 )T R(K̄k − K̄k+1 ) + Q


[
We know that ∥ŷ − y(t)∥ ≤ b, therefore by use of slack
+K̄kT+1 RK̄k+1 − O(ϵ ) yd .
]
(B.2)
variables, we can write ŷ = y(t) + b1r − ∆b (t). Also, we have
∥Kk+1 − K̄k+1 ∥ ≤ ρ1 , implying Kk+1 = K̄k+1 + ρ1 I − ∆ρ . Let us We conclude that with a sufficiently small ϵ , if Q =
denote b2 = b1r −∆b (t), ρ2 (b, ϵ ) = ρ1 I −∆ρ . The feedback control diag(Q 1 , . . . , Q r ) has a sufficiently large λmin (Q ) > 0 then V̇k (t)
is given by, u = −Kk+1 ŷ = −Kk+1 (y + b2 ) which will make (3) as will be negative definite, stabilizing the decoupled dynamics.
Next, consider the reduced slow sub-system dynamics of the
ẏ = A11 y + A12 z + B1 (−Kk+1 (y + b2 )), (A.1) actual system. Using the learned feedback in (25) we get
dy
ϵ ż = A21 y + A22 z + B2 (−Kk+1 (y + b2 )). (A.2) = ((Ir ⊗ F )/ϵ )y + H11 y + H12 z + B̃1 /ϵ (−MKy), (B.3)
dts
dz
We next re-derive the slow subsystem by substituting ϵ = 0. The ϵ = ϵ H21 y + (H̃2 + ϵ H22 )z + B̃2 (−MKy), (B.4)
1 −1 dts
slow manifold is given as zs = −A−
22 A21 − B2 Kk+1 ys + A22 B2 Kk+1 b2 .
Therefore, the slow-subsystem dynamics using Kk+1 = K̄k+1 + ρ2 where, H̃2 = H2 + (In−r ⊗ F ). By substituting ϵ = 0 and using
follows as the slow manifold variable zs = −H̃2−1 B˜2 (−MKys ), we obtain the
reduced sub-system as
ẏs = (As − Bs K̄k+1 + ρ3 )ys + ρ4 (t). (A.3) dys (Ir ⊗ F ) B̃1
= ys + H11 ys + ( − H12 H̃2−1 B̃2 )(−MKys ).
dts ϵ ϵ
where As = A11 − A12 A22 A21 , Bs = B1 − A12 A22 B2 , ρ3 = −Bs ρ2 and
−1 −1
Reverting back to the original time-scale ts = ϵ t, we get
ρ4 (t) = −Bs (K̄k+1 b2 (t) + ρ2 (t)b2 (t)). Here, ρ4 (t) acts as a distur-
bance to the dynamics: ẏs = (As − Bs K̄k+1 + ρ3 )ys . Therefore, we dys
= (Ir ⊗ F )ys + ϵ H11 ys + (B̃1 − ϵ H12 H̃2−1 B̃2 )(−MKys ),
investigate stability by analyzing the disturbance-free dynamics. dt

One can consider the dynamics ẏs = (As − Bs K̄k+1 + ρ3 )ys as a = (Ir ⊗ F )ys − B̃1 MKys + ϵ (H11 + H12 H̃2−1 B̃2 ) ys . (B.5)
perturbed version of the nominal dynamics, ẏs = (As − Bs K̄k+1 )ys .
  

Considering a Lyapunov function Vk (t) = yTs P̄k ys , and computing
The dynamics (B.5) can be viewed as the decoupled dynamics
its time-derive along ẏs = (As − Bs K̄k+1 )ys , we get
ẏd = F1 yd − B̃1 MKyd perturbed by an O(ϵ ) term vanishing at
V̇k (t) = yTs [P̄k (As − Bs K̄k+1 ) + (As − Bs K̄k+1 )T P̄k ]ys , ys = 0. The vanishing perturbation term given by g(t , ys ) =
ϵ H̃ys satisfies ∥g(t , ys )∥ ≤ ϵ∥H̃ ∥∥ys ∥. With these considerations,
which, using the proof of Theorem 2, can be shown to reduce to we apply Khalil (2002, Lemma 9.1) and conclude that ys = 0
is exponentially stable for a sufficiently small ϵ . As the slow
V̇k (t) = −yTs [(K̄k − K̄k+1 )T R(K̄k − K̄k+1 )]yTs reduced sub-system model is the perturbed version of the de-
− yTs [Q + K̄kT+1 RK̄k+1 ]ys . (A.4) coupled model with the above-mentioned bound, the learned
decentralized controller will exponentially stabilize the slow sub-
With Q ≻ 0, closed-loop will be asymptotically stable. The system dynamics, which in turn stabilizes the entire system with
dynamics ẏs = (As − Bs K̄k+1 +ρ3 )ys is basically ẏs = (As − Bs K̄k+1 )ys the assumption that the fast sub-system is stable. □
perturbed by ρ3 ys vanishing at ys = 0. If the estimation error is
small with sufficiently small ϵ then we will have a sufficiently References
small upper bound ∥ρ3 ∥ ≤ ρ̄3 , and the vanishing perturbation
g(t , ys ) = ρ3 ys will satisfy ∥g(t , ys )∥ ≤ ρ̄3 ∥ys ∥. With these Abdollahi, F., Talebi, H. A., & Patel, R. V. (2006). A stable neural network-based
considerations, we apply Khalil (2002, Lemma 9.1) and conclude observer with application to flexible-joint manipulators. IEEE Transactions on
that the ys = 0 is exponentially stable for a sufficiently small Neural Networks, 17(1), 118–129.
ϵ and state estimation error. Disturbance ρ4 (t) depends on the Chow, J., & Kokotovic, P. (1976). A decomposition of near-optimum regulators for
state estimation error bound b and the controller gain Kk+1 . With systems with slow and fast modes. IEEE Transactions on Automatic Control,
21(5), 701–705.
arbitrarily small estimation error, the norm of the disturbance can
Chow, J., & Kokotovic, P. (1985). Time scale modeling of sparse dynamic
be bounded by sufficiently small upper-bound ∥ρ4 (t)∥ ≤ ρ̄4 .
networks. IEEE Transactions on Automatic Control, 30(8), 714–722.
Jiang, Y., & Jiang, Z.-P. (2012). Computational adaptive optimal control
for continuous-time linear systems with completely unknown dynamics.
Appendix B. Proof of Theorem 7 Automatica, 48, 2699–2704.
Jiang, Y., & Jiang, Z.-P. (2017). Robust adaptive dynamic programming. Wiley-IEEE
press.
We first show that the learned decentralized control gain Khalil, H. (2002). Nonlinear systems. Prentice-Hall, New York.
Kkα+1 can stabilize the decoupled yd dynamics when ϵ is small Khalil, H., & Kokotovic, P. (1978). Control strategies for decision makers using
with a sufficiently large Q . Let the area-wise control be uα = different models of the same system. IEEE Transactions on Automatic Control,
−M α Kkα+1 yα . Therefore, u = −MKk+1 y, where Kk+1 = diag(Kk1+1 , 23(2), 289–298.

. . . , Kkr+1 ). From Theorem 6, we have Kkα+1 = K̄kα+1 +O(ϵ ), implying Kleinman, D. (1968). On an iterative technique for Riccati equation computations.
IEEE Transactions on Automatic Control, 13(1), 114–115.
Kk+1 = K̄k+1 +O(ϵ ). Using the learned gains Kk+1 for the decoupled Kokotovic, P., O’malley, R., & Sannuti, P. (1976). Singular perturbations and order
dynamics with F1 = In ⊗ F we get ẏd = F1 yd − B̃1 (−M K̄k+1 yd ) − reduction in control theory: An overview. Automatica, 12, 123–132.
O(ϵ )yd . Next, consider the Lyapunov function Vk (t) = yTd P̄k yd with Lewis, F., & Vrabie, D. (2009). Reinforcement learning and adaptive dynamic
P̄k ≻ 0, and its time derivative along ẏd as, programming for feedback control. IEEE Circuits and Systems Magazine, 9(3),
32–50.
V̇k (t) = yTd [P̄k (F1 − B̃1 M K̄k+1 − O(ϵ )) Liu, D., & Wei, Q. (2014). Policy iteration adaptive dynamic programming
algorithm for discrete-time nonlinear systems. IEEE Transactions on Neural
+ (F1 − B̃1 M K̄k+1 − O(ϵ ))T P̄k ]yd . (B.1) Networks and Learning Systems, 25(3), 621–634.
Mukherjee, S., Bai, H., & Chakrabortty, A. (2018). On model-free reinforcement
Using the ARE, ATk P̄k + P̄k Ak = −(K̄kT RK̄k + Q ) with Ak = learning of reduced-order optimal control for singularly perturbed systems.
F1 − B̃1 M K̄k , and K̄k+1 = R−1 M T B̃T1 P̄k , it can be shown that V̇k (t) In IEEE conference on decision and conrol 2018. Miami, FL, USA.

10
S. Mukherjee, H. Bai and A. Chakrabortty Automatica 126 (2021) 109451

Mukherjee, S., Bai, H., & Chakrabortty, A. (2019). Block-decentralized model- He Bai received his B.Eng. degree from the Depart-
free reinforcement learning control of two time-scale networks. In American ment of Automation at the University of Science and
control conference 2019. Philadelphia, PA, USA. Technology of China, Hefei, China, in 2005, and the
Mukherjee, S., Bai, H., & Chakrabortty, A. (2020). Reduced-dimensional rein- M.S. and Ph.D. degrees in Electrical Engineering from
Rensselaer Polytechnic Institute (RPI) in 2007 and 2009,
forcement learning control using singular perturbation approximations. arXiv
respectively. From 2009 to 2010, he was a Post-doctoral
preprint arXiv:2004.14501.
Researcher at Northwestern University, Evanston, IL.
Sutton, R., & Barto, A. (1998). Reinforcement learning - An introduction. Cambridge:
From 2010 to 2015, he was a Senior Research and
MIT Press.
Development Scientist at UtopiaCompression Corpora-
Vamvoudakis, K. (2017). Q-learning for continuous-time linear systems: A tion. In 2015, he joined the Mechanical and Aerospace
model-free infinite horizon optimal control approach. Systems & Control Engineering Department at Oklahoma State University
Letters, 100, 14–20. as an assistant professor. He has published over 80 peer-reviewed journal and
Vrabie, D., Pastravanu, O., Abu-Khalaf, M., & Lewis, F. (2009). Adaptive opti- conference papers related to control and robotics and a research monograph
mal control for continuous-time linear systems based on policy iteration. ‘‘Cooperative control design: a systematic passivity-based approach’’ in Springer.
Automatica, 45, 477–484. He holds one patent on monocular passive ranging. His research interests in-
Wu, H., & Luo, B. (2012). Neural network based online simultaneous policy clude reinforcement learning, distributed optimization and learning, multi-agent
update algorithm for solving the HJI equation in nonlinear H∞ control. IEEE systems, and autonomous systems.
Transactions on Neural Networks and Learning Systems, 23(12), 1884–1895.

Aranya Chakrabortty received the Ph.D. degree in


Electrical Engineering from Rensselaer Polytechnic In-
Sayak Mukherjee received the B.E. degree in Electri- stitute, Troy, NY in 2008. From 2008 to 2009 he
cal Engineering from Jadavpur University, India with was a postdoctoral research associate in University of
the medal for second highest percentage in 2015 Washington, Seattle. From 2009 to 2010 he was an
and the Ph.D. in Electrical Engineering from NC State Assistant Professor of Electrical Engineering at Texas
University, USA in 2020. He recently joined Pacific Tech University. Since 2010 Aranya has joined the
Northwest National Laboratory as a Post-doctorate Re- Electrical and Computer Engineering Department at
search Associate. His research interests include learning North Carolina State University, where he is currently
based optimal control techniques, scalable reinforce- a Professor. His research interests are in all branches
ment learning for dynamic systems with applications of control theory with applications in electric power
related to power and energy systems. systems. He received the NSF CAREER award in 2011.

11

You might also like