Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

J Control Autom Electr Syst (2015) 26:134–148

DOI 10.1007/s40313-015-0167-5

Adaptive Critic Design Using Policy Iteration Technique for LTI


Systems: A Comprehensive Performance Analysis
Lal Bahadur Prasad · Barjeev Tyagi · Hari Om Gupta

Received: 5 February 2014 / Revised: 25 December 2014 / Accepted: 5 January 2015 / Published online: 7 February 2015
© Brazilian Society for Automatics–SBA 2015

Abstract This paper presents adaptive critic design (ACD) 1 Introduction


using policy iteration technique for adaptive optimal con-
trol of continuous-time linear time-invariant (LTI) dynamical The control systems are decision-making systems that are
systems. A comprehensive performance analysis of control designed to provide autonomy to dynamic systems with
scheme is presented in this paper. ACD using online pol- desired system response and performance. It is a challenging
icy iteration technique provides an adaptive optimal control task to design control algorithm that be simple and guaran-
solution for an infinite horizon problem subject to the real- tee the stability and robustness in closed loop system in real
time dynamics of a continuous-time system. In general, ACD situations. The performance of controlled systems is desired
involves a critic and an actor, where the critic evaluates the to be optimal which should be valid also when applied in the
performance of present control policy and generates a critic real situation. There are many optimization and optimal con-
signal to update the control action for performance improve- trol, and adaptation and adaptive control techniques present
ment, and the actor provides the control input to the system in the literature for linear and nonlinear dynamical systems.
being controlled. Policy iteration technique which is based Thus, to have both features of control design, it is desired to
on actor–critic structure consists of two-step iteration: pol- design online adaptive optimal control.
icy evaluation and policy improvement. In this paper, the Recently, several researchers have tried to explore the
application of control scheme is implemented for a practical intelligent computational techniques with adaptive and opti-
example of LTI systems–load frequency control of power mal control design by applying certain methodologies for cer-
system. The modeling, simulation results, and analysis are tain applications (Behera and Kar 2009; Bhuvaneswari et al.
presented for both of system models without and with inte- 2009). The reinforcement learning (RL) (Lewis and Vrabie
gral control. The comparative performance investigation of 2009; Lewis and Vamvoudakis 2010; Wong and Lee 2010)
adaptive critic control scheme and linear quadratic regulator is an approach to adaptive optimal control, strongly related
is also presented. with direct and indirect adaptive optimal control methods
from a theoretical point of view. There are basically two
Keywords Actor–critic · Adaptive critic · Adaptive ways of solving the associated optimal control problem; one
optimal control · Integral control · LFC · LTI system · Policy is Pontryagin’s minimum principle and the other is Bellman’s
iteration dynamic programming (DP) (Lewis and Vrabie 2009). How-
ever, the solution of Hamilton–Jacobi–Bellman (HJB) equa-
tion associated with DP has a computational complexity that
L. B. Prasad (B) · B. Tyagi · H. O. Gupta
grows exponentially with the number of state variables, the
Department of Electrical Engineering, Indian Institute of Technology
Roorkee, Uttarakhand 247667, India problem known as “curse of dimensionality” (Wang et al.
e-mail: erlbprasad@gmail.com; ibpeedee@iitr.ac.in 2009; Ferrari and Stengel 2002), and it is an off-line process
B. Tyagi where the problem is solved from the end point and is
e-mail: btyagfee@iitr.ac.in approached in backward direction. To overcome these issues,
H. O. Gupta in 1977, Werbos proposed Adaptive/Approximate Dynamic
e-mail: harifee@iitr.ac.in Programming (ADP) (Lewis and Vrabie 2009; Wang et al.

123
J Control Autom Electr Syst (2015) 26:134–148 135

2009; Murray et al. 2002; Huang et al. 2013; Nosair et al. tive optimal control without using complete knowledge of
2010). Combining the concepts of ADP, RL, and backprop- the system dynamics. The online policy iteration algorithm
agation (Prokhorov and Wunsch II 1997), he introduced an solves the optimal control problem along a single state trajec-
approach for ADP called adaptive critic designs (ACDs) as tory, which does not require knowledge of the system internal
a way for solving dynamic programming problems forward- dynamics, and thus can be viewed as a direct adaptive opti-
in-time. ACD utilizes two parametric structures known as mal control technique. Unlike the regular adaptive controllers
the actor and the critic. The actor parameterizes the control which rely on online identification of system dynamics fol-
policy. The critic approximates a value-related cost func- lowed by model-based controller design, the policy iteration
tion and captures the effect that the control law will have method relies on identification of cost function associated
on the future cost which describes the performance of con- with a given control policy followed by policy improvement
trol system. At any given time, the critic provides guidance in sense of minimizing the identified cost.
to improve the control policy, and the actor to update the Recently, certain literatures have described ACDs for var-
critic. There are several versions of ACDs present in the lit- ious control applications applying certain approaches (Bhu-
erature (Lewis and Vrabie 2009; Ferrari and Stengel 2002; vaneswari et al. 2009; Lewis and Vrabie 2009; Lewis and
Huang et al. 2013; Prokhorov and Wunsch II 1997; Hansel- Vamvoudakis 2010; Wong and Lee 2010; Wang et al. 2009;
mann et al. 2007; Lin 2011; Padhi et al. 2001; Deb et al. Ferrari and Stengel 2002; Huang et al. 2013; Nosair et al.
2007; Kulkarni and Venayagamoorthy 2010; Vamvoudakis 2010; Prokhorov and Wunsch II 1997; Hanselmann et al.
and Lewis 2010; Vrabie et al. 2009, 2007; Modares et al. 2007; Lin 2011; Padhi et al. 2001; Deb et al. 2007; Kulkarni
2013; Li and Liu 2012; Luo and Wu 2012; Vrabie and Lewis and Venayagamoorthy 2010; Vamvoudakis and Lewis 2010;
2008, 2009; Padhi et al. 2006; Kumar et al. 2007; Gurrala Vrabie et al. 2009, 2007; Modares et al. 2013; Li and Liu
et al. 2009; Prem et al. 2009). Werbos defined actor–critic 2012; Luo and Wu 2012; Vrabie and Lewis 2008, 2009;
online learning algorithms to solve optimal control problem Padhi et al. 2006; Kumar et al. 2007; Gurrala et al. 2009; Prem
based on Value Iteration (VI) and defined a family of VI et al. 2009). ACDs are described for discrete-time systems in
algorithms as Adaptive Dynamic Programming (ADP) algo- Lewis and Vrabie (2009), Lewis and Vamvoudakis (2010),
rithms. He used a critic neural network for value function Wang et al. (2009), Huang et al. (2013), Nosair et al. (2010),
approximation and an actor neural network for control pol- Padhi et al. (2001), Li and Liu (2012), Padhi et al. (2006),
icy approximation. Generalized Policy Iteration is a family of Kumar et al. (2007), Gurrala et al. (2009), for continuous-
optimal learning techniques which has policy iteration (PI) at time systems in Bhuvaneswari et al. (2009), Lewis and Vrabie
one extreme (Lewis and Vamvoudakis 2010; Vamvoudakis (2009), Lewis and Vamvoudakis (2010), Wang et al. (2009),
and Lewis 2010). Policy Iteration (PI) algorithms consist Ferrari and Stengel (2002), Prokhorov and Wunsch II (1997),
two-step iteration: policy evaluation and policy improve- Hanselmann et al. (2007), Lin (2011), Deb et al. (2007),
ment. Instead of solving HJB equation by direct approach, the Vamvoudakis and Lewis (2010), Vrabie et al. (2009), Vra-
PI algorithm starts by evaluating the cost of a given initial bie et al. (2007), Modares et al. (2013), Luo and Wu (2012),
admissible control policy, which is often accomplished by Vrabie and Lewis (2008), Vrabie and Lewis (2009) Prem
solving a nonlinear Lyapunov equation. This updated cost et al. (2009), and for stochastic systems in Wong and Lee
is then used to obtain an updated improved control pol- (2010). Adaptive optimal control using various approaches
icy which will have a lower associated cost. This is often is presented in Bhuvaneswari et al. (2009), Lewis and Vra-
accomplished by minimizing a Hamiltonian function with bie (2009), Lewis and Vamvoudakis (2010), Wong and Lee
respect to the updated cost (Lewis and Vrabie 2009; Lewis (2010), Wang et al. (2009), Ferrari and Stengel (2002), Mur-
and Vamvoudakis 2010; Vamvoudakis and Lewis 2010; Vra- ray et al. (2002), Huang et al. (2013), Nosair et al. (2010),
bie et al. 2009, 2007; Modares et al. 2013; Luo and Wu 2012; Prokhorov and Wunsch II (1997), Hanselmann et al. (2007),
Vrabie and Lewis 2008, 2009). This is the so-called ‘greedy Lin (2011), Padhi et al. (2001), Deb et al. (2007), Kulka-
policy’ with respect to the updated cost (Vamvoudakis and rni and Venayagamoorthy (2010), Vamvoudakis and Lewis
Lewis 2010). These two steps of policy evaluation and pol- (2010), Vrabie et al. (2009), Vrabie et al. (2007), Modares
icy improvement are repeated until the policy improvement et al. (2013), Li and Liu (2012), Luo and Wu (2012), Vrabie
step no longer changes the actual policy and thus converging and Lewis (2008), Vrabie and Lewis (2009), Padhi et al.
to optimal control. It is noted that the infinite horizon cost (2006), Kumar et al. (2007), Gurrala et al. (2009), Prem
can be evaluated only in the case of admissible and stabi- et al. (2009). Adaptive optimal control by ACD using neural
lizing control policies. Admissibility is in fact a condition networks in actor–critic configuration and policy iteration
for the control policy which is used to initialize the algo- technique for solving online optimal control problem with-
rithm (Vamvoudakis and Lewis 2010). PI algorithm requires out making use of explicit knowledge of internal dynamics
an initial stabilizing control policy, but VI does not require are presented for linear systems in Vamvoudakis and Lewis
it. The ACD using policy iteration technique performs adap- (2010), Vrabie et al. (2009), Vrabie et al. (2007), Luo and Wu

123
136 J Control Autom Electr Syst (2015) 26:134–148

(2012), and for nonlinear systems in Vamvoudakis and Lewis infinite horizon optimal control design using adaptive critic
(2010), Modares et al. (2013), Vrabie and Lewis (2008), scheme. Policy iteration technique with convergence proof
Vrabie and Lewis (2009). The PI algorithm on an actor– of algorithm is discussed in Sect. 3. Section 4 discusses
critic structure using neural networks for an online optimal the adaptive optimal control using online policy iteration
control of continuous-time systems in the presence of con- technique for continuous-time LTI systems. In Sect. 5, the
straints due to actuator saturation is presented in Modares application of control scheme for a practical example of LTI
et al. (2013), where a suitable non quadratic functional is systems–LFC of power system is presented. The system mod-
used to encode the constraints into the optimization for- eling, simulation results, and analysis are presented for both
mulation. Optimal control using value iteration technique of system models without and with integral control. The con-
for discrete-time affine nonlinear systems is presented in Li clusion is presented in Sect. 6. At the end, a brief list of
and Liu (2012). The applications of ‘single network adap- references is given.
tive critic (SNAC)’ are presented for nonlinear systems in
Padhi et al. (2006), Kumar et al. (2007), Gurrala et al. (2009),
and for linear systems in Kumar et al. (2007). ACD using 2 Adaptive Critic Design
Takagi–Sugeno (T–S) fuzzy systems for optimal control of
continuous-time input affine nonlinear systems is presented 2.1 Infinite Horizon Optimal Control of Continuous-Time
in Prem et al. (2009). ACD using SVM-based tree-type neural LTI Systems
network as critic is presented in Deb et al. (2007). ACD using
PSO-based actor and neural network-based critic is presented The infinite horizon optimal control [i.e., Linear Quadratic
in Kulkarni and Venayagamoorthy (2010). Thus the state-of- Regulator (LQR)] problem for continuous-time LTI systems
the-art shows that ACD methods provide an effective adap- is presented in this section (Behera and Kar 2009; Hansel-
tive optimal control of dynamical systems. mann et al. 2007; Vrabie et al. 2009, 2007).
The adaptive critic designs applying certain approaches Consider the continuous-time linear time-invariant (LTI)
for various control applications for both linear and nonlin- dynamical system described by
ear systems and in both discrete-time and continuous-time
ẋ(t) = Ax(t) + Bu(t) (1)
frameworks have been presented recently in literatures, and
though the adaptive critic is successfully implemented in sev- where x(t) ∈ R , u(t) ∈ R and (A, B) is stabilizable,
n m
eral real-life problems, the network architectures which could subject to the optimal control problem
give much information for control engineers is not analyzed
much in the literature. Also the performance investigation u∗ (t) = arg min V (t0 , x(t0 ), u(t)) (2)
u(t)
of adaptive critic control scheme with practical applications t0 ≤t≤∞
is not explored much in the literature. Even certain recent
papers are appearing on ACDs and policy iteration technique where the infinite horizon quadratic cost function to be min-
with certain applications, the comprehensive presentation imized is expressed as
of the concept with practical applications is much desired.  ∞
This paper contributes by investigating comprehensively the
V (x(t0 ), t0 ) = (x T (τ )Q x(τ ) + u T (τ )Ru(τ ))dτ (3)
ACD using policy iteration technique for continuous-time t0
LTI systems. The performance of proposed control scheme
is investigated for the cases of the structural change in the sys- with Q ≥ 0, R > 0 and (Q 1/2 , A) detectable.
tem dynamics by including integral control, and the change The solution of this optimal control problem, determined
in system parameters in real situation at any moment of by Bellman’s optimality principle, is given by
time. The comparative performance investigation of adap-
tive critic control scheme and linear quadratic regulator is u(t) = −K x(t) with K = R −1 B T P (4)
also presented. In this paper, the application of proposed con-
trol scheme is implemented for a practical example of LTI where the matrix P is the unique positive definite solution of
systems–load frequency control (LFC) of power system, for the Algebraic Riccati Equation (ARE)
both of system models without and with integral control. The
modeling, analysis, and simulation results are presented. The A T P + P A − P B R −1 B T P + Q = 0 (5)
proposed approach is partially model free is demonstrated by
simulating also with change in system parameters at certain Equation (4) gives a stabilizing closed loop controller deter-
instant of time. mined from the unique positive semi-definite solution of
This paper is organized in six sections. Section 1 presents ARE under the detectability condition. Here to solve (5),
the relevance and general introduction. Section 2 describes both system matrix A and control input matrix B must be

123
J Control Autom Electr Syst (2015) 26:134–148 137

known, i.e., complete knowledge of the system dynamics is 3 Policy Iteration Technique
required. Due to this reason, developing algorithms that will
converge to the solution of the optimization problem with- In this section, online policy iteration technique which gives
out performing prior system identification and using explicit optimal control solution of the LQR problem, without using
models of the system dynamics is of particular interest from knowledge of the system internal dynamics (i.e., system
the control systems point of view. matrix A) is presented. It gives an adaptive controller which
converges to the state feedback optimal controller. The pol-
2.2 Continuous-Time Adaptive Critic Scheme icy iteration technique (Lewis and Vrabie 2009; Lewis and
Vamvoudakis 2010; Wong and Lee 2010; Vamvoudakis and
In ACD, (4) and (5) are represented by two parametric func- Lewis 2010; Vrabie et al. 2009, 2007; Modares et al. 2013;
tion approximation networks namely action network (actor) Luo and Wu 2012; Vrabie and Lewis 2008, 2009) is based
and critic network (critic), respectively. The action network on an actor–critic structure, consists of two-step iteration:
which provides control signals represents the relationship critic update and actor update. For a given stabilizing con-
between state and input. The critic network which learns the troller, critic computes the associated infinite horizon cost.
desired performance index for some performance index/cost The actor computes the control policy and is represented by
function represents the relationship between state and costate its parameters (i.e., feedback controller gain) (Vrabie et al.
vector. The critic evaluates the performance of actor and the 2009, 2007).
actor is improved, based on the feedback from the critic net-
work. These two functional networks approximating HJB 3.1 Policy Iteration Algorithm
equation successively adapt to determine the optimal control
solution for a system. In general, ACD uses incremental opti- Let a stabilizing gain K for (1), under the assumption that
mization combined with a parametric structure to efficiently (A, B) is stabilizable, such that ẋ = (A − B K )x is a stable
approximate the optimal cost and control. In ACD, a short- closed loop system. Then the corresponding infinite horizon
term cost metric is optimized that ensures optimization of the quadratic cost is given by
cost over all future times.
In general, neural networks are used as the parametric  ∞
structures in ACD (Bhuvaneswari et al. 2009; Lewis and Vra- V (x(t)) = x T (τ )(Q + K T R K )x(τ )dτ = x T (t)P x(t)
t
bie 2009; Wang et al. 2009; Ferrari and Stengel 2002; Mur-
(6)
ray et al. 2002; Lin 2011; Padhi et al. 2001; Modares et al.
2013; Vrabie and Lewis 2008, 2009). Other choices for para-
where P is the real symmetric positive definite solution of
metric function approximations in ACD are fuzzy systems
the Lyapunov matrix equation
(Prem et al. 2009), support vector machines (SVM) (Deb
et al. 2007), particle swarm optimization (PSO) (Kulkarni
and Venayagamoorthy 2010). A simplified version of adap- (A − B K )T P + P(A − B K ) = −(K T R K + Q) (7)
tive critic architecture which uses only one network instead
of two required in a standard ACD is a ‘single network adap- and V (x(t)) serves as a Lyapunov function for (1) with con-
tive critic (SNAC)’ using neural networks (Padhi et al. 2006; troller gain K . The cost function (6) can be written as
Kumar et al. 2007; Gurrala et al. 2009). ACDs function as
 t+T
supervised learning systems and reinforcement learning sys-
V (x(t)) = x T (τ )(Q+K T R K )x(τ )dτ +V (x(t +T ))
tems (Lewis and Vrabie 2009; Lewis and Vamvoudakis 2010; t
Wong and Lee 2010; Prokhorov and Wunsch II 1997). Policy (8)
iteration (Lewis and Vrabie 2009; Lewis and Vamvoudakis
2010; Wong and Lee 2010; Vamvoudakis and Lewis 2010; Based on (8), denoting x(t) with x t , with parameterization
Vrabie et al. 2009, 2007; Modares et al. 2013; Luo and Wu V (x t ) = x tT P x t and considering an initial stabilizing con-
2012; Vrabie and Lewis 2008, 2009) and value iteration algo- trol gain K 1 , following two-step online policy iteration algo-
rithms (Lewis and Vrabie 2009; Lewis and Vamvoudakis rithm can be implemented:
2010; Vamvoudakis and Lewis 2010; Li and Liu 2012; Luo
and Wu 2012) are other options of ACD for solving online
1. Policy evaluation
optimal control problem for linear systems (Vamvoudakis
and Lewis 2010; Vrabie et al. 2009, 2007; Li and Liu 2012;  t+T
Luo and Wu 2012), and nonlinear systems (Vamvoudakis and x tT Pi x t = x τT (Q + K iT R K i )x τ dτ + x t+T
T
Pi x t+T
Lewis 2010; Modares et al. 2013; Li and Liu 2012; Vrabie t
and Lewis 2008, 2009). (9)

123
138 J Control Autom Electr Syst (2015) 26:134–148

2. Policy improvement Using the policy update given by (10) and completing the
squares, the second term can be written as
K i+1 = R −1 B T Pi (10)
x tT [K i+1
T
R(K i − K i+1 ) + (K i − K i+1 )T R K i+1 ]x t

Equations (9) and (10) formulate a new policy iteration algo- = x tT [−(K i − K i+1 )T R(K i − K i+1 )
rithm. It is important to note that this algorithm does not −K i+1
T
R K i+1 + K iT R K i ]x t
require system matrix A for its solution, only control input Using (13), the first term in the equation can be written as
matrix B must be known for updating K . −x tT [K iT R K i + Q]x t , and summing up the two terms one
obtains
3.2 Convergence of Policy Iteration Algorithm
V̇i (x t ) = − x tT [(K i − K i+1 )T R(K i − K i+1 )]x t
The convergence of policy iteration algorithm is discussed in − x tT [Q + K i+1
T
R K i+1 ]x t (15)
this subsection referring the lemmas, remarks and theorems
Thus, under the initial assumptions from the problem setup
in Vrabie et al. (2009), Vrabie et al. (2007).
Q ≥ 0, R > 0, Vi (x t ) is a Lyapunov function proving that
Let Ai = A − B K i , then for the system ẋ = Ai x, a
the updated control policy u = −K i+1 x is stabilizing with
Lyapunov function may be Vi (x t ) = x tT Pi x t , ∀x t , and
K i+1 given by (10), and thus, if (10) is used for updating
the control policy then new control policy will be stabilizing.
d(x tT Pi x t ) Thus, it is concluded that if the initial control policy given by
= x tT (AiT Pi +Pi Ai )x t = −x tT (K iT R K i +Q)x t
dt K 1 is stabilizing, then all policies obtained using the iteration
(11) (9) and (10) will be stabilizing policies.
Let Ric(Pi ) be the matrix-valued function defined as
then, ∀T > 0, from (11) we may have
 t+T  Ric(Pi ) = A T Pi + Pi A − Pi B R −1 B T Pi + Q (16)
t+T d(x τT Pi x τ )
x τ (Q + K i R K i )x τ dτ = −
T T

t t dτ and let RicPi denote the Fréchet derivative of Ric(Pi ) taken
= x tT Pi x t − x t+T
T
Pi x t+T (12) with respect to Pi . The matrix function RicPi evaluated at a
given matrix M will thus be
which is same as (9). From (11) the Lyapunov equation is
RicPi (M) = (A − B R −1 B T Pi )T M + M(A − B R −1 B T Pi ).
AiT Pi + Pi Ai = −(K iT R K i + Q) (13)
Equations (11) and (10) can be compactly written as
For a stabilizing control policy K i , the matrix Ai is stable
and K iT R K i + Q > 0 then there exists a unique solution of AiT Pi + Pi Ai = −(Pi−1 B R −1 B T Pi−1 + Q) (17)
the Lyapunov equation (13), Pi > 0. Thus if Ai is asymp-
totically stable, the solution of (9) is the unique solution of Subtracting AiT Pi−1 + Pi−1 Ai on both sides gives
(13), and thus both (9) and (13) are equivalent. Although the AiT (Pi − Pi−1 ) + (Pi − Pi−1 )Ai
same solution is obtained whether solving (13) or (9), (9)
can be solved without using any knowledge on the system = −(Pi−1 A + A T Pi−1 − Pi−1 B R −1 B T Pi−1 + Q)
matrix A. Thus, PI algorithm (9) and (10) is equivalent to (18)
iterating between (13) and (10), without using knowledge of which is Newton’s method
the system internal dynamics, if Ai is stable at each iteration.
Let the control policy K i is stabilizing with the associated Pi = Pi−1 − (RicPi−1 )−1 Ric(Pi−1 ) (19)
cost Vi (x t ) = x tT Pi x t . For the state trajectories generated
while using the controller K i+1 , take the positive definite cost Thus, iteration between (9) and (10) is equivalent to Newton’s
function Vi (x t ) as a Lyapunov function candidate. Taking the method formulation (19) by use of the introduced notations
derivative of Vi (x t ) along the trajectories generated by K i+1 , Ric(Pi ) and RicPi . Newton’s method, i.e., iteration (13) and
one obtains (10), conditioned by an initial stabilizing policy will converge
V̇i (x t ) = x tT [Pi (A − B K i+1 ) + (A − B K i+1 )T Pi ]x t to the solution of ARE. And, if the initial policy is stabilizing,
all the subsequent control policies will be stabilizing. This
= x tT [Pi (A − B K i ) + (A − B K i )T Pi ]x t
proven equivalence between (13) and (10), and (9) and (10),
+ x tT [Pi B(K i − K i+1 ) shows that the online policy iteration algorithm will converge
+ (K i − K i+1 )T B T Pi ]x t (14) to the solution of the optimal control problem (2) with infinite

123
J Control Autom Electr Syst (2015) 26:134–148 139

 t+T
horizon quadratic cost (3) without using knowledge of the
d( x̄(t), K i ) ≡ x T (τ )(Q + K iT R K i )x(τ )dτ
internal dynamics of the controlled system (1). Thus, under t
the assumptions of stabilizability of (A, B) and detectability
is measured based on the system states over the time interval
of (Q 1/2 , A), with Q ≥ 0, R > 0 in the cost index (2), the
[t, t + T ]. System (1) is augmented by introducing a new
policy iteration (9) and (10), conditioned by an initial sta-
state V (t) defined as V̇ (t) = x T (t)Q x(t)+u T (t)Ru(t), and
bilizing controller, converges to the optimal control solution
so the value of d( x̄(t), K i ) can be measured by taking two
given by (3) where the matrix P satisfies the ARE (4).
measurements of this newly introduced system state since
Thus the only requirement for convergence to the opti-
d( x̄(t), K i ) = V (t + T ) − V (t). This new state signal is
mal controller consists in an initial stabilizing policy that
the output of an analog integration block having as inputs
will guarantee a finite value for the cost V1 (x t ) = x tT P1 x t .
the quadratic terms x T (t)Q x(t) and u T (t)Ru(t), which can
Under the assumption that the system to be controlled is sta-
also be obtained using an analog processing unit.
bilizable and implementation of an optimal state feedback
The parameter vector p̄i of the function Vi (x t ) (i.e., critic),
controller is possible and desired, it is reasonable to assume
which will then yield the matrix Pi , is found by minimiz-
that a stabilizing (though not optimal) state feedback con-
ing, in the least squares sense, the error between the target
troller is available to begin the iteration. In fact, in many
function, d( x̄(t), K i ), and the parameterized left-hand side of
cases, the system to be controlled is itself stable such that the
(21). Evaluating the right-hand side of (21) at N ≥ n(n+1)/2
initial controller can be chosen as zero.
(the number of independent elements in the matrix Pi ) points
x̄ i in the state space, over the same time interval T , the least
squares solution is obtained as
4 Adaptive Optimal Control Using Online Policy
Iteration Technique Based Adaptive Critic Scheme p̄i = (X X T )−1 XY (22)

This section presents the online implementation of policy where X = [ x̄ 1 x̄ 2 . . . x̄ 


N ], x̄ i = x̄ i (t) − x̄ i (t + T ),

iteration algorithm based adaptive optimal control without Y = [d( x̄ 1 , K i ) d( x̄ 2 , K i ) . . . d( x̄ N , K i )]T
using knowledge of the system internal dynamics. The imple- The least squares problem can be solved in real-time after
mentation of PI algorithm only needs knowledge of B matrix a sufficient number of data points are collected along a sin-
which explicitly appears in (10). The system matrix A is not gle state trajectory, under the regular presence of an exci-
required for computation of either of two steps of PI algo- tation requirement. Alternatively, (22) can be solved also
rithm, as that information is embedded in the states x(t) and using recursive estimation algorithms (e.g. gradient descent
x(t + T ) which are observed online. algorithms or the recursive least squares (RLS) algorithm) in
Associated with the policy K i , to find the critic parameters which case a persistence of excitation condition is required.
(matrix Pi ) of the cost function in (9), the term x T (t)Pi x(t) Due to this reason, there are no real issues related to the algo-
is written as rithm becoming computationally expensive with the increase
of the state space dimension (Vrabie et al. 2009, 2007).
x T (t)Pi x(t) = p̄iT x̄(t) (20) In the case in which the cost function (9) is solved for
in a single step (e.g. using a method such as the exact least-
where x̄(t) denotes the Kronecker product quadratic poly- squares described by (22)), the online algorithm has the same
nomial basis vector with the elements {xi (t)x j (t)}i=1,n; j=i,n quadratic convergence speed as Newton’s method. For the
and p = v(P) with v(·) a vector-valued matrix function case in which the solution of (9) is obtained iteratively, the
that acts on symmetric matrices and returns a column vector convergence speed of the online algorithm will decrease. In
by stacking the elements of the diagonal and upper trian- this case at each step in the PI algorithm [which involves
gular part of the symmetric matrix into a vector where the solving (9) and (10)], a recursive gradient descent algo-
off-diagonal elements are taken as 2Pi j (Vrabie et al. 2009, rithm, which most often has exponential convergence, will
2007). Using (20), (9) is rewritten as be used for solving (10). Thus it is resolved that the conver-
 gence speed of online algorithm will depend on the chosen
t+T
technique for solving (9). Even the convergence property of
p̄iT ( x̄(t)− x̄(t + T )) = x T (τ )(Q + K iT R K i )x(τ )dτ
t online algorithm is not affected by the value of sample time
(21) T ; it affects the excitation condition necessary in the setup of
a numerically well-posed least squares problem and obtain-
In this equation, p̄i is the vector of unknown parameters and ing the least squares solution (22). More precisely, assuming
x̄(t) − x̄(t + T ) acts as a regression vector. The right-hand without loss of generality that the matrix X in (22) is square,
side target function is denoted by d( x̄(t), K i ) (also known and letting ε > 0 be a desired lower bound on the determinant
as the reinforcement on the time interval [t, t + T ]), of X, then the chosen sampling time T must satisfy

123
140 J Control Autom Electr Syst (2015) 26:134–148


T > (23) formances evaluated at two consecutive steps crosses below

n
|λ I (Ac )| a designer specified limit, i.e., the algorithm has converged
I =1 to the optimal controller. Also in the case that this error is
bigger than this specified limit, the critic again starts tuning
where λ I denotes the eigenvalues of closed loop system and the actor parameters to obtain an optimal control policy. If
a > 0 is a scaling factor. From this point of view, a minimal there is a sudden change in system dynamics described by
insight relative to the system dynamics would be required for the matrix A as long as the present controller is stabilizing
choosing the sampling time T (Vrabie et al. 2009, 2007). for the new matrix A, the algorithm will converge to the solu-
The online PI algorithm requires only measurements of tion to the corresponding new ARE. Thus the algorithm is
states at discrete moments in time, t and t + T , as well as suitable for online implementation from the control theory
knowledge of observed cost over time interval [t, t + T ], point of view.
which is d( x̄(t), K i ). Therefore, knowledge of system matrix Figure 1 (Vrabie et al. 2009, 2007) shows the schematic
A is not required for the cost evaluation or the control policy block diagram of adaptive optimal control with actor–critic
update, and only the matrix B is required for the control pol- structure for LTI system. Since the system is augmented with
icy update, using (10), which makes the tuning algorithm only an extra state V (t) that is part of the adaptive critic control
partially model-free. The PI algorithm converges to optimal scheme, thus this controller is actually a dynamic controller
control solution measuring cost along a single state trajec- with the cost state. This adaptive optimal controller has a
tory, provided that there is enough initial excitation in the hybrid structure with a continuous-time internal state fol-
system. Since the algorithm iterates only on stabilizing poli- lowed by a sampler and discrete-time update rule. The appli-
cies which will make the system states go to zero, sufficient cation of the proposed control scheme is presented in the
excitation in the initial state of the system is necessary. In the following section 5.
case that excitation is lost prior to obtaining the convergence
(system reaches the equilibrium point), a new experiment
needs to be conducted having as a starting point the last pol- 5 System Modeling, Simulation Results and Analysis
icy from the previous experiment. In this case, the control
policy is updated at time t + T , after observing the state This section presents the system modeling, simulation results
x(t + T ) and it is used for controlling the system during the and analysis to demonstrate the application of ACD using
time interval [t + T, t + 2T ]. The critic stops updating the online PI technique for adaptive optimal control of LTI sys-
control policy when the difference between the system per- tems considering a practical system–LFC of power system.

Fig. 1 Adaptive optimal


control with actor–critic
structure

123
J Control Autom Electr Syst (2015) 26:134–148 141

⎡ ⎤ ⎡ ⎤
5.1 Load Frequency Control of Power System −1 KP
0
⎢ TP TP
⎥ ⎢ 0 ⎥
A=⎣ 0 − T1T 1
TT ⎦, B=⎢ ⎥
⎣ 0 ⎦,
The active power and frequency control is known as load
− RT1 G 0 − T1G 1
frequency control (LFC), which plays an important func- TG
tion of power system operation where the main objective is ⎡ ⎤
− KTPP
to regulate the output power of each generator at prescribed ⎢ 0 ⎥

levels while keeping the frequency fluctuations within prede- F =⎢ ⎥
⎣ 0 ⎦, C = 1 0 0
termined limits (Elgerd 2004; Alomoush 2010). In the large-
scale interconnected power systems, the system frequency
and the inter-area tie-line power are required to be near to
The range of LFC system parameters (Alomoush 2010; Tan
the scheduled values as possible. LFC is required to be robust
and Xu 2009; Wang et al. 1993, 1994) is
to the unknown external disturbances and system model and
parameter uncertainties. Many control strategies for power
1 KP 1
systems LFC have been presented in literature (Elgerd 2004; ∈ [0.033, 0.1], ∈ [4, 12], ∈ [2.564, 4.762],
TP TP TT
Alomoush 2010; Tan and Xu 2009; Wang et al. 1993, 1994).
1 1
In the following Sects. 5.2 and 5.3 respectively, the incre- ∈ [9.615, 17.857], ∈ [3.081, 10.639]
mental linear system models of power system LFC without TG RTG
and with integral control both are considered to investigate
Considering the values of system parameters around the
the performance of proposed control scheme also under the
above range, let we have
system’s structural change. The performance is also investi-
gated for change in system parameters at certain instant of ⎡ ⎤
−0.0665 11.5 0
time to demonstrate that the proposed approach is partially
A=⎣ 0 −2.5 2.5 ⎦,
model-free.
−9.5 0 −13.7360
⎡ ⎤ ⎡ ⎤
0 −11.5
5.2 LFC System Model Without Integral Control B=⎣ 0 ⎦, F = ⎣ 0 ⎦
13.7360 0
The functional block diagram of single-area power system
LFC without integral control model is shown in Fig. 2 (Elgerd The LFC system transfer function is given by
2004; Alomoush 2010).
The state space model of the system is derived as following  f (s) 394.9
G(s) = = 3
u(s) s + 16.3s + 35.42s + 275.4
2
ẋ = Ax(t) + Bu(t) + FPd (t) (24)
y =  f (t) = C x(t) (25) For implementation of PI algorithm, the initial condi-
tions for states and cost function, and critic parameters are
where state vector x(t) = [ f (t) Pg (t) X g (t)]T ;  f (t) taken as x0 = [0, 0.1, 0, 0]; P = [0 0 0; 0 0 0; 0 0 0]. The
is incremental frequency deviation (Hz); Pg (t) is incre- length of simulation in samples is taken 60, and sample time
mental change in generator output (p.u. MW); X g (t) is T = 0.05 s. The cost function parameters Q and R are taken
incremental change in governor value position (p.u. MW); as identity matrices of appropriate dimensions. The unique
Pd (t) is load disturbance (p.u. MW); R is speed regulation positive definite solution of ARE (5), denoted here by matrix
due to governor action (Hz. p.u. MW−1 ); TG is governor RicP, and adaptive optimal critic matrix P of adaptive critic
time constants (s); TT is turbine time constants (s); TP is scheme using PI in (9) and (10) with (22), respectively, are
plant model constants (s); K P is plant gain; and obtained as

Fig. 2 Block diagram of LFC


of power system

123
142 J Control Autom Electr Syst (2015) 26:134–148

⎡ ⎤ ⎡ ⎤
0.3600 0.5313 0.0367 0.3006 0.3466 0.0370
Ric P = ⎣ 0.5313 1.5662 0.1690 ⎦ , Ric P = ⎣ 0.3466 0.7568 0.1243 ⎦ ,
0.0367 0.1690 0.0500 0.0370 0.1243 0.0532
⎡ ⎤ ⎡ ⎤
0.3673 0.5357 0.0367 0.3673 0.5357 0.0367
P = ⎣ 0.5357 1.5967 0.1733 ⎦ P = ⎣ 0.5357 1.5967 0.1733 ⎦
0.0367 0.1733 0.0507 0.0367 0.1733 0.0507
and the actor gains of LQR design by (4) and (5) denoted and the actor gains RicK and K , respectively, are obtained as
here by RicK, and actor K by adaptive critic scheme using

RicK = 0.5077 1.7077 0.7305 ,
PI in (10), respectively, are obtained as


K = 0.5040 2.3811 0.6962
RicK = 0.5044 2.3212 0.6867 ,

The eigenvalues of closed loop system are obtained as
K = 0.5040 2.3811 0.6962
−16.5154, −5.4251 + 4.1489i, −5.4251 − 4.1489i
The eigenvalues of closed loop system are obtained as
Figure 3 shows the simulation responses of LFC model
−19.9774, −2.9441 + 3.9285i, −2.9441 − 3.9285i
(24) and (25) with ACD using PI technique. Figure 3a shows
Simulation with change in system parameters is also done system state trajectories which converge towards the equilib-
at sample k = 21; (i.e., t = 1.05 s), such that A(2, 2) = −4, rium point. Figure 3b shows control signal trajectory which
and A(2, 3) = 4; then the solution is obtained as also converge towards zero. Figure 3c shows evolution of

Fig. 3 Responses with ACD


using PI technique for LFC
without integral control model a
system states, b control signal, c
evolution of poles of closed loop
system, d critic parameters, e
updating of critic parameters, f
evolution of poles of closed loop
system with change in system
parameters at sample k = 21
(i.e., t = 1.05 s), A(2, 2) = −4,
and A(2, 3) = 4

123
J Control Autom Electr Syst (2015) 26:134–148 143

closed loop poles of the system during simulation. Figure 3c Tan and Xu 2009; Wang et al. 1993, 1994). The state E(t),
shows convergence of critic parameters of matrix P towards incremental change in integral control is included in the state
optimal values. Figure 3d shows P parameters updating with vector x(t) as, x(t) = [ f (t) Pg (t) X g (t) E(t)]T ,
iteration, here * at one indicate update, and * at zero indicate and E(t) may be defined by
no update. The simulation responses for the case of change  t
in system parameter are similar as above and just the evolu-
E(t) = K E  f (t)dτ (26)
tion of poles changes adapting the controller. The evolution 0
of poles of the closed loop system for this case is shown in
Fig. 3f. to insure the regulation property of  f (t), i.e.,
Figure 4 presents the closed loop response of LFC system
without integral control model using both approaches of LQR  Ė(t) = K E  f (t) (27)
and adaptive critic (AC). It is observed that the closed loop
response with unit step load disturbance Pd (t) have steady- where, K E is integral control gain; and
state error. The adaptive optimal controller using ACD gives ⎡ KP

− T1P TP 0 0
similar responses as of standard LQR. Figure 5 shows closed ⎢ ⎥
⎢ 0 − T1T 1
0 ⎥
loop unit step response of load frequency for LFC system A=⎢ TT ⎥,
⎣ − RT1
0 − TG − TG ⎦
1 1
without integral control model and with change in system G

parameters using both approaches of LQR and adaptive critic KE 0 0 0


⎡ ⎤ ⎡ KP ⎤
(AC). It is observed here that the closed loop response with 0 − TP
unit step load disturbance Pd (t) is unaffected by change ⎢ 0 ⎥ ⎢ 0 ⎥

B=⎢ ⎥ ⎢ ⎥
⎣ 1 ⎦, F = ⎣ 0 ⎦, C = 1 0 0 0
in system parameters and remains at same value as before TG
but using LQR it is affected by change in system parameters. 0 0
Thus, the controller using ACD adapts for change in system
Considering the values of system parameters around the
internal dynamics.
above range, let we have
⎡ ⎤
5.3 LFC System Model with Integral Control −0.0665 11.5 0 0
⎢ 0 −2.5 2.5 0 ⎥
A=⎢ ⎣ −9.5
⎥,
The functional block diagram of single-area power system 0 −13.7360 −13.7360 ⎦
LFC with integral control model is shown in Fig. 6 (Alo- 0.6 0 0 0
⎡ ⎤ ⎡ ⎤
moush 2010; Tan and Xu 2009; Wang et al. 1993, 1994). 0 −11.5
Introducing an integral control of  f (t) in the LFC sys- ⎢ 0 ⎥ ⎢ 0 ⎥
B=⎢ ⎥ ⎢
⎣ 13.7360 ⎦ , F = ⎣ 0 ⎦

tem dynamics, the state space model of the system is derived
by modifying (24) and (25) as following (Alomoush 2010; 0 0

Fig. 4 Closed loop unit step


response of load frequency for
LFC system without integral
control model

123
144 J Control Autom Electr Syst (2015) 26:134–148

Fig. 5 Closed loop unit step


response of load frequency for
LFC system without integral
control model and with change
in system parameters at sample
k = 21 (i.e., t = 1.05 s),
A(2, 2) = −4, A(2, 3) = 4

Fig. 6 Block diagram of LFC of power system with integral control

⎡ ⎤
Under this case, the LFC system transfer function is given 0.4600 0.6911 0.0519 0.4642
by ⎢ 0.6911 1.8668 0.2002 0.5800 ⎥
Ric P = ⎢ ⎥
⎣ 0.0519 0.2002 0.0533 0.0302 ⎦ ,
0.4642 0.5800 0.0302 2.2106
 f (s) 394.9s + 2.677 × 10−29 ⎡ ⎤
G(s) = = 4 0.5428 0.7621 0.0552 0.5619
u(s) s + 16.3s 3 + 35.42s 2 + 275.4s + 236.9 ⎢ 0.7621 2.2278 0.2504 0.6393 ⎥
P=⎢ ⎥
⎣ 0.0552 0.2504 0.0610 0.0302 ⎦
0.5619 0.6393 0.0302 2.3280
For implementation of PI algorithm, the initial conditions
for states and cost function, and critic parameters are taken as and the actor gains of LQR design by (4) and (5) denoted
x0 = [0, 0.1, 0, 0, 0]; P = [0 0 0 0; 0 0 0 0; 0 0 0 0; 0 0 0 0]. here by RicK, and actor K by adaptive critic scheme using
The length of simulation in samples is taken 100, and sample PI in (10), respectively, are obtained as
time T = 0.05 s. The cost function parameters Q and R are

RicK = 0.7135 2.7499 0.7323 0.4142 ,
taken as identity matrices of appropriate dimensions. The

K = 0.7587 3.4394 0.8372 0.4142
unique positive definite solution of ARE (5), denoted here by
matrix RicP, and adaptive optimal critic matrix P of adaptive The eigenvalues of closed loop system are obtained as
critic scheme using PI in (9) and (10) with (22), respectively,
are obtained as −20.1027; −3.4914 + 3.3266i; −3.4914 − 3.3266i; −0.7168

123
J Control Autom Electr Syst (2015) 26:134–148 145

Simulation with change in system parameter is also done Figure 7 shows simulation responses of LFC with inte-
at sample k = 81; (i.e., t = 4.05 s), such that A(4, 1) = 0.8, gral control model with ACD using PI technique. Figure 7a
then the solution is obtained as shows system state trajectories which converge towards the
⎡ ⎤ equilibrium point. Figure 7b shows control signal trajectory
0.4981 0.7486 0.0573 0.4837
⎢ 0.7486 1.9694 0.2106 0.5884 ⎥ which also converge towards zero. Figure 7c shows evolu-
Ric P = ⎢⎣ 0.0573 0.2106 0.0544 0.0302 ⎦ ,
⎥ tion of closed loop poles of the system during simulation.
Figure 7d shows convergence of critic parameters of matrix
0.4837 0.5884 0.0302 1.7894
⎡ ⎤ P towards optimal values. Figure 7e shows P parameters
0.5428 0.7621 0.0552 0.5619 updating with iteration, here * at one indicate update, and *
⎢ 0.7621 2.2278 0.2504 0.6393 ⎥
P=⎢ ⎣ 0.0552 0.2504 0.0610 0.0302 ⎦
⎥ at zero indicate no update. The simulation responses for the
case of change in system parameter are similar as above and
0.5619 0.6393 0.0302 2.3280 just the evolution of poles changes adapting the controller.
Figure 7f shows evolution of closed loop poles in this case.
and the actor gains RicK and K , respectively, are obtained as
Figure 8 presents the closed loop response of LFC sys-


RicK = 0.7869 2.8934 0.7473 0.4142 , tem with integral control model with unit step using both

approaches of LQR and adaptive critic (AC). It is observed
K = 0.7587 3.4394 0.8372 0.4142
that there is no steady-state error in the closed loop responses
The eigenvalues of closed loop system are obtained as which is due to including integral control in LFC model.
The adaptive optimal controller using ACD gives similar
−20.0826, −1.0625, −3.3286 + 3.1399i, −3.3286 − 3.1399i responses as of standard LQR. Figure 9 presents the closed

Fig. 7 Responses with ACD


using PI technique for LFC with
integral control model a system
states, b control signal, c
evolution of poles of closed loop
system, d critic parameters, e
updating of critic parameters, f
evolution of poles of closed loop
system with change in system
parameter at sample k = 81
(i.e., t = 4.05 s), A(4, 1) = 0.8

123
146 J Control Autom Electr Syst (2015) 26:134–148

Fig. 8 Closed loop unit step


response of load frequency for
LFC system with integral
control model

Fig. 9 Closed loop unit step


response of load frequency for
LFC system with integral
control model and with change
in system parameter at sample
k = 81 (i.e., t = 4.05 s),
A(4, 1) = 0.8

loop response of load frequency with unit step load distur- situation, the controller adapts it and converges to same opti-
bance Pd (t) for case of change in system parameters. It is mal values. Thus the actor K and critic P parameters remain
observed here also that the response of ACD remains exactly unchanged.
the same as before and is not affected by change in sys- Analyzing the simulation results obtained for LTI system
tem parameter. The controller performs adapting the change in all the above mentioned cases of system models without
in system parameters. This demonstrates that the proposed and with integral control, and changes in system parame-
control scheme is effective and robust. ters applying ACD using online policy iteration technique,
In the above simulation results, it is observed that critic it is established that this proposed control scheme provides
parameter matrix P and actor parameter K obtained from a promising adaptive optimal control solution for dynamical
control scheme of ACD using PI technique are converging systems without complete knowledge of the system dynam-
adaptively to optimal values and RicP and RicK, respec- ics. The structural change introduced in system dynamics by
tively, are mostly of same values to that obtained from LQR including integral control is augmenting the system behavior
approach. Also in case of change in system parameter in real such as of its credit that removing the steady-state error in

123
J Control Autom Electr Syst (2015) 26:134–148 147

closed loop responses. The structural change in system will Bhuvaneswari, N. S., Uma, G., & Rangaswamy, T. R. (2009). Adap-
not be adapted by the proposed controller but it will adapt the tive and optimal control of a non-linear process using intelligent
controllers. Applied Soft Computing, 9, 182–190.
change in system parameters in real situation at any moment Deb, A. K., Jayadeva, Gopal, M., & Chandra, S. (2007). SVM-based
of time. Thus this technique is partially model-free, effective tree-type neural networks as a critic in adaptive critic designs for
and robust. control. IEEE Transactions on Neural Networks, 18(4), 1016–
1030.
Elgerd, O. I. (2004). Electric energy systems theory: An introduction
(2nd ed.). New Delhi: Tata McGraw-Hill Publishing Company Ltd.
6 Conclusion
ch. 9.
Ferrari, S., & Stengel, R. F. (2002, May 8–10). An adaptive critic global
Adaptive critic design using online policy iteration technique controller. In Proceedings of 2002 American control conference,
gives an infinite horizon adaptive optimal control solution to Alaska, USA, Vol. 4, pp. 2665–2670.
Gurrala, G., Sen, I., & Padhi, R. (2009). Single network adaptive critic
the real-time dynamics of a continuous-time LTI system. In
design for power system stabilisers. IET Gener. Transm. Distrib.,
general, ACD involves a critic and an actor, where the critic 3(9), 850–858.
evaluates the performance of present control policy and gen- Hanselmann, T., Noakes, L., & Zaknich, A. (2007). Continuous-time
erates a critic signal to update the control action for perfor- adaptive critics. IEEE Transactions on Neural Networks, 18(3),
631–647.
mance improvement, and the actor provides the control input
Huang, Z., Ma, J., & Huang, H. (2013). An approximate dynamic pro-
to the system being controlled. Policy iteration technique gramming method for multi-input multi-output nonlinear system.
which is based on actor–critic structure consists of two-step Optimal Control Applications and Methods, 34, 80–95.
iteration: policy evaluation and policy improvement. The infi- Kulkarni, R. V., & Venayagamoorthy, G. K. (2010). Adaptive critics for
dynamic optimization. Neural Networks, 23, 587–591.
nite horizon optimal solution using ARE requires the com-
Kumar, S., Padhi, R., & Behera, L. (2007, April 16–18). Direct adap-
plete knowledge of the system dynamics, which becomes par- tive control using single network adaptive critic. In Proceedings
tially model-free, using the online policy iteration technique of 2007 IEEE international conference on system of systems engi-
based adaptive critic scheme. The proposed adaptive opti- neering, SoSE ’07, pp. 1–6.
Lewis, F. L., & Vamvoudakis, K. G. (2010, June 9–11). Optimal adaptive
mal controller solves online continuous-time optimal con-
control for unknown systems using output feedback by reinforce-
trol problem without using complete knowledge of the sys- ment learning methods. In Proceedings of 8th IEEE international
tem dynamics. The knowledge of system’s internal dynam- conference on control and automation (ICCA), Xiamen, China, pp.
ics (i.e., matrix A) is not needed for evaluation of cost or the 2138–2145.
Lewis, F. L., & Vrabie, D. (2009). Reinforcement learning and adaptive
update of control policy; only the knowledge of input matrix
dynamic programming for feedback control. IEEE Circuits and
B is required for updating the control policy. The conver- Systems Magazine, 9(3), 32–50.
gence of the proposed algorithm, under the condition of ini- Li, H., & Liu, D. (2012). Optimal control for discrete-time affine non-
tial stabilizing controller, to the solution of the state feedback linear systems using general value iteration. IET Control Theory
& Applications, 6(18), 2725–2736.
optimal control problem has been established. In this paper,
Lin, C.-K. (2011). Radial basis function neural network-based adaptive
the comprehensive performance analysis of proposed control critic control of induction motors. Applied Soft Computing, 11,
scheme is presented. The application of control scheme is 3066–3074.
implemented for a practical example of LTI systems–LFC of Luo, B., & Wu, H.-N. (2012). Online policy iteration algorithm for opti-
mal control of linear hyperbolic PDE systems. Journal of Process
power system. The modeling, analysis, and simulation results
Control, 22, 1161–1170.
are presented for both of system models without and with Modares, H., Sistani, M.-B. N., & Lewis, F. L. (2013). A policy iteration
integral control. Simulation results justify the effectiveness approach to online optimal control of continuous-time constrained-
and robustness of the control scheme based on ACD using PI input systems. ISA Transactions, 52(5), 611–621.
Murray, J. J., Cox, C. J., Lendaris, G. G., & Saeks, R. (2002).
technique.
Adaptive dynamic programming. IEEE Transactions on Systems,
Man, and Cybernetics, Part C: Applications and Reviews, 32(2),
Acknowledgments This research received no specific grant from any 140–153.
funding agency in the public, commercial, or not-for-profit organiza- Nosair, H., Yang, Y., & Lee, J. M. (2010). Min–max control using para-
tions. First author is thankful to Madan Mohan Malaviya Engineering metric approximate dynamic programming. Control Engineering
College Gorakhpur; QIP Centre, I.I.T. Roorkee; and AICTE, India for Practice, 18, 190–197.
sponsoring & financing him for Ph.D. work under QIP scheme. Padhi, R., Balakrishnan, S. N., & Randolph, T. (2001). Adaptive-critic
based optimal neuro control synthesis for distributed parameter
systems. Automatica, 37, 1223–1234.
References Padhi, R., Unnikrishnan, N., Wang, X., & Balakrishnan, S. N. (2006).
A single network adaptive critic (SNAC) architecture for optimal
Alomoush, M. (2010). Load frequency control and automatic generation control synthesis for a class of nonlinear systems. Neural Net-
control using fractional-order controllers. Electrical Engineering, works, 19, 1648–1660.
91, 357–368. Prem, K. P., Behera, L., Siddique, N. H., & Prasad, G. (2009, October
Behera, L., & Kar, I. (2009). Intelligent systems and control: Principles 11–14). A T-S fuzzy based adaptive critic for continuous-time input
and applications (1st ed.). New Delhi: Oxford University Press. affine nonlinear systems. In Proceedings of IEEE international

123
148 J Control Autom Electr Syst (2015) 26:134–148

conference on systems, man and cybernetics (SMC), pp. 4329– Vrabie, D., Pastravanu, O., Abu-Khalaf, M., & Lewis, F. L. (2009).
4334. Adaptive optimal control for continuous-time linear systems based
Prokhorov, D. V., & Wunsch II, D. C. (1997). Adaptive critic designs. on policy iteration. Automatica, 45, 477–484.
IEEE Transactions on Neural Networks, 8(5), 997–1007. Wang, Y., Zhou, R., & Wen, C. (1993). Robust load-frequency controller
Tan, W., & Xu, Z. (2009). Robust analysis and design of load frequency design for power systems. IEE Proceedings–C, 140(1), 11–16.
controller for power systems. Electric Power Systems Research, Wang, Y., Zhou, R., & Wen, C. (1994). New robust adaptive load-
79, 846–853. frequency control with system parametric uncertainties. IEE
Vamvoudakis, K. G., & Lewis, F. L. (2010). Online actor–critic algo- Proceedings–Generation, Transmission and Distribution, 141(3),
rithm to solve the continuous-time infinite horizon optimal control 184–190.
problem. Automatica, 46, 878–888. Wang, F.-Y., Zhang, H., & Liu, D. (2009). Adaptive dynamic program-
Vrabie, D., & Lewis, F. L. (2008, December 9–11). Adaptive optimal ming: An introduction. IEEE Computational Intelligence Maga-
control algorithm for continuous-time nonlinear systems based on zine, 4(2), 39–47.
policy iteration. In Procdeeings of 47th IEEE conference on deci- Wong, W. C., & Lee, J. H. (2010). A reinforcement learning-based
sion and control, Cancun, Mexico, pp. 73–79. scheme for direct adaptive optimal control of linear stochastic sys-
Vrabie, D., & Lewis, F. (2009). Neural network approach to continuous- tems. Optimal Control Applications and Methods, 31, 365–374.
time direct adaptive optimal control for partially unknown nonlin-
ear systems. Neural Networks, 22, 237–246.
Vrabie, D., Pastravanu, O., & Lewis, F. L. (2007, July 27–29). Pol-
icy iteration for continuous-time systems with unknown internal
dynamics. In Proceedings of 15th mediterranean conference on
control & automation, Athens-Greece, pp. T01–010.

123

You might also like