Athans Workshop 10 07

F.L.
Lewis
Moncrief-ODonnell Endowed Chair
Head, Controls & Sensors Group
Supported by :
NSF - PAUL WERBOS
ARO RANDY ZACHERY
Automation & Robotics Research Institute (ARRI)

The University of Texas at Arlington
Notes on Optimal Control

Draguna Vrabie
Talk available online at

http://ARRI.uta.edu/acs
Mike Athans
PhD from Berkeley in 1961

BOOKS
Optimal Control (McGraw Hill, 1966)
Systems, Networks and Computation: Basic Concepts (McGraw Hill, 1972)
Multivariable Methods (McGraw Hill, 1974)
Third President of the IEEE Control Systems Society, 1972 to 1974
More than 40 Phd students- how many?
Awards
American Automatic Control Council's Donald P. Eckman Award "for outstanding
contributions to the field of automatic control" (first recipient of the award, 1964)
American Society for Engineering Education's Frederick E. Terman Award as "the
outstanding young electrical engineering educator" (first recipient, 1969)
American Control Council's Education Award for "outstanding contributions and
distinguished leadership in automatic control education" (second recipient, 1980)
Fellow of the IEEE (1973)
Fellow of the AAAS (1977)
Distinguished Member of the IEEE Control Systems Society (1983)
IEEE Control Systems Society's 1993 H.W. Bode Prize, including the delivery of the Bode
Plenary Lecture at the 1993 IEEE Conference on Decision and Control
American Automatic Control Council's Richard E. Bellman Control Heritage Award (1995)
"In Recognition of a Distinguished Career in Automatic Control; As a Leader and Champion
of Innovative Research; As a Contributor to Fundamental Knowledge in Optimal, Adaptive,
Robust, Decentralized and Distributed Control; and as a Mentor to his Students"
Athans & Falb, Optimal Control (McGraw Hill, 1966) First book on OC?
1. Use Neural Networks as Function Approximators in Adaptive Control

2. Optimal Adaptive Control gives new adaptive control algorithms
Two-layer feedforward static neural network (NN)

VT
(.)
x1
(.)
y1
(.)
y2
(.)
ym
xn
inputs
(.)
2
x2
(.)
WT
L
(.)
outputs
hidden layer
Summation eqs
yi = wik vkj x j + vk 0 + wi 0
k =1
j
=
1
Matrix eqs
y = W T (V T x)
Have the universal approximation property

Overcome Barrons fundamental accuracy limitation of 1-layer NN
Neural Network Robot Controller

Universal Approximation Property
Feedback linearization
..
qd
Nonlinear Inner Loop
Feedforward Loop
qd
^f(x)
e
[ I]
Kv
q
Robot System
Robust Control v(t)

Term
PD Tracking Loop
NN universal basis property means no regression matrix is needed

Extension of Adaptive Control to Nonlinear-in the parameters systems
Problem- Nonlinear in the NN weights so
that standard proof techniques do not work
Extension of Adaptive Control to nonlinear-in parameters systems

No regression matrix needed
Theorem 1 (NN Weight Tuning for Stability)
Let the desired trajectory qd (t ) and its derivatives be bounded. Let the initial tracking error be
within a certain allowable set U . Let Z M be a known upper bound on the Frobenius norm of the
unknown ideal weights Z .
Take the control input as
NLIP
= W T (V T x ) + K v r v
with
v(t ) = K Z ( Z
+ Z M )r .
Extra Jacobian terms from NLIP network

Let weight tuning be provided by
W = Fr T F 'V T xr T F r W ,
V = Gx( 'T W r ) T G r V
T
T
with any constant matrices F = F > 0, G = G > 0 , and scalar tuning parameter > 0 . Initialize
the weight estimates as W = 0,V = random .
Then the filtered tracking error r (t ) and NN weight estimates W ,V are uniformly ultimately
bounded. Moreover, arbitrarily small tracking error may be achieved by selecting large control
gains K v .
robustifying terms-
Backprop termsWerbos
e-mod, sigma mod, etc.
Flexible & Vibratory Systems

Add an extra feedback loop
Two NN needed
Use passivity to show stability
Backstepping
..
qd
Nonlinear FB Linearization Loop
NN#1
e
e = e.
^ (x)
F
1
[ I]
qd =
q. d
qd
Kr
Robust Control
Term
vi(t)
1/KB1
id
ue
K
q
qr = . r
qr
Robot
System
^F (x)
2
NN#2
Tracking Loop
Backstepping Loop
Neural network backstepping controller for Flexible-Joint robot arm
Advantages over traditional Backstepping- no regression functions needed
Actuator Nonlinearities q
Deadzone, saturation, backlash
NN in Feedforward Loop- Deadzone Compensation
Estimate
of Nonlinear
Function
little observer
NN Deadzone
Precompensator
f (x)
qd
e
-
Critic:
Actor:
r
Kv
II
D(u)
T
W i = T i (U i w ) r T W T ' (U T u )U T k 1T r W i k 2 T r W i W i
T
W = S ' (U T u )U T W i i (U i w ) r T k 1 S r W
Mechanical
System
Acts like a 2-layer NN

With enhanced
backprop tuning !
Optimality in Biological Systems

Cell Homeostasis
The individual cell is a complex

feedback control system. It pumps
ions across the cell membrane to
maintain homeostatis, and has only
limited energy to do so.
Permeability control of the cell membrane
Cellular Metabolism
http://www.accessexcellence.org/RC/VL/GG/index.html
Optimality in Control Systems Design

Rocket Orbit Injection
R. Kalman 1960
Dynamics
r=w
v2 F
w = 2 + sin
r r
m
wv F
v=
+ cos
r
m
m = Fm
Objectives
Get to orbit in minimum time
Use minimum fuel
http://microsat.sm.bmstu.ru/e-library/Launch/Dnepr_GEO.pdf
Adaptive Control is generally not Optimal

Optimal Control is off-line,
and needs to know the system dynamics to solve design eqs.
We want ONLINE ADAPTIVE OPTIMAL Control
Continuous-Time Optimal Control

x = f ( x, u )
System
Cost
V ( x(t )) = r ( x, u ) dt = (Q( x) + u T Ru ) dt
For a given control, the cost satisfies this eq.
Hamiltonian
V
V
V
0 = V + r ( x, u ) =
, u)
f ( x, u ) + r ( x, u ) H ( x,
x + r ( x, u ) =
In LQR, this is a Lyapunov eq
Optimal cost
Bellman
Optimal control
In LQR, this is a Riccati eq
V ( 0) = 0
T
T
V
V
0 = min r ( x, u ) +
f ( x, u )
x = min r ( x, u ) +
u (t )
x
x u (t )
T
T
*
V *
x = min r ( x, u ) +
f ( x, u )
0 = min r ( x, u ) +
x u (t )
u (t )
x
V
h ( x(t )) = 1 2 R g ( x)
x
1
HJB equation
*
*
dV *
1 T dV
1 dV
gR g
f + Q ( x) 4
0 =
dx
dx
dx
, V ( 0) = 0
Linear system, quadratic cost

System: x = Ax + Bu
Utility: r ( x, u ) = xT Qx + u T Ru; R > 0, Q 0
The cost is quadratic
V ( x (t )) = r ( x, u )d = xT (t ) Px (t )
t
Optimal control (state feedback):

u (t ) = R 1 BT Px(t ) = Lx(t )
HJB equation is the algebraic Riccati equation (ARE):
0 = PA + AT P + Q PBR 1 B T P
Full system dynamics must be known
Off-line solution
CT Policy Iteration
To avoid solving HJB equation
r ( x, u ) = Q( x) + u T Ru
Utility
Cost for any given u(t)

T
V
V
0=
, u)
f ( x , u ) + r ( x, u ) H ( x,
x
x
Lyapunov equation
Iterative solution
Pick stabilizing initial control
Find cost
V
0 = k f ( x, hk ( x)) + r ( x, hk ( x))
x
Vk (0) = 0
Update control
V
hk +1 ( x) = 1 2 R 1 g T ( x) k
x
T
Convergence proved by Saridis 1979 if

Lyapunov eq. solved exactly
Beard & Saridis used complicated Galerkin
Integrals to solve Lyapunov eq.
Abu Khalaf & Lewis used NN to approx. V for
nonlinear systems and proved convergence
Full system dynamics must be known

Off-line solution
LQR Policy iteration = Kleinman algorithm

1. For a given control policy u = Lk x solve for the cost:
0 = Ak T Pk + Pk Ak + C T C + Lk T RLk
Lyapunov eq.
Ak = A BLk
2. Improve policy:
Lk = R 1BT Pk 1
If started with a stabilizing control policy L0 the matrix Pk

monotonically converges to the unique positive definite
solution of the Riccati equation.
Every iteration step will return a stabilizing controller.
The system has to be known.
Kleinman 1968
Policy Iteration Solution

Policy iteration
( A BBT Pi )T Pi +1 + Pi +1 ( A BBT Pi ) + Pi BBT Pi + Q = 0

This is in fact a Newtons Method
Ric( P ) AT P + PA + Q PBBT P
Then, Policy Iteration is
Pi +1 = Pi RicPi
Ric( Pi ),
i = 0,1,
Frechet Derivative
Ric' Pi ( P) ( A BB T Pi )T P + P ( A BB T Pi )
Draguna Vrabie
Policy Iterations without Lyapunov Equations

Dynamic programming
built on Bellmans optimality principle alternative
form for CT Systems [Athans & Falb 1966, Lewis &
Syrmos 1995]
t +t
V * ( x(t )) = min r ( x( ), u ( ))d

u ( )
t
t <t +t
+ V ( x(t + t ))
f(x) and g(x) do not appear
r ( x ( ), u( )) = xT ( )Qx ( ) + uT ( ) Ru ( )
Draguna Vrabie
Solving for the cost Our approach

u = Lx
For a given control

The cost satisfies
t +T
V ( x(t )) =
T
T
(
x
Qx
+
u
Ru )dt
+ V ( x(t + T ))
f(x) and g(x) do not appear
LQR case
t +T
x(t )T Px(t ) =
T
T
T
(
x
Qx
+
u
Ru
)
dt
+
x
(
t
+
T
)
Px(t + T )
Optimal gain is
L = R 1 BT P
Draguna Vrabie
Solving for the cost Our approach
1. Policy iteration
t +T
Cost update
Vk +1 ( x(t )) =
( xT Qx + u k T Ru k )dt + Vk +1 ( x(t + T ))
A and B do not appear

For LQR case
xT (t ) Pk +1 x(t ) =
u k (t ) = Lk x(t )
t +T
xT ( )(Q + Lk T RLk ) x( )d + xT (t + T ) Pk +1 x(t + T )
Control gain update
Lk +1 = R 1BT Pk +1
Initial stabilizing control is needed
B needed for control update
Lemma 1
xT (t ) Pk +1 x(t ) =
1
Lk = R B Pk
T
t +T
t
Solves Lyapunov equation without knowing A or B
is equivalent to
( A BR 1 BT Pk )T Pk +1 + Pk +1 ( A BR 1 BT Pk ) + Pk BR 1 BT Pk + Q = 0
Proof:
d ( x T Pi x )
= x T ( Ai T Pi + Pi Ai ) x = x T ( K i T RK i + Q ) x
dt
t +T
( A B B
Pi )
P i + 1 = Pi
Pi+1 + Pi+1 ( A B B
(R ic )
Pi
R ic ( Pi ) ,
Pi ) + Pi B B
i = 0 ,1,
Pi + Q
= 0
t +T
T
T
x (Q + K i RK i ) xd = d ( xT Px
)
=
x
(
t
)
Px
(
t
)
x
(t + T ) Px
i
i
i (t + T )
T
Theorem D. Vrabie
This algorithm converges and is equivalent to Kleinmans Algorithm
Only B is needed
Algorithm Implementation
Critic update
xT (t ) Pk +1 x(t ) =
t +T
x (t ) = x(t ) x(t ) is the quadratic basis set
To set this up as
pk +1T x (t ) =
t +T
x( )T (Q + KiT RKi ) x( )d + pk +1T x (t + T )
pk +1T (t )
vec( ABC ) = C T A vec( B)
Use Kronecker product
pk +1T
t +T
[ x (t ) x (t +T )] =
x( )T (Q + KiT RKi ) x( )d
Quadratic regression vector
(t , t + T ) Reinforcement on time interval [t, t+T]
Now use RLS along the trajectory to get new weights

Unpack weights into the matrix Pk+1
Then find updated FB gain Lk +1 = R 1BT Pk +1
pk +1
1. Select initial control policy

Solves Lyapunov eq. without knowing dynamics
2. Find associated cost
pk +1T [ x (t ) x (t + T ) ] =
t +T
x( )T (Q + Lk T RLk ) x( )d = (t , t + T )
Lk +1 = R 1 BT Pk +1
3. Improve control
observe x(t)
Measure cost increment (reinforcement)

by adding V as a state. Then
apply uk(t)=Lkx(t)
V = xT Qx + (u k )T Ru k
observe cost integral

observe x(t+T)
update P
A is not needed anywhere
t+T
update control gain to Lk+1
do RLS until convergence to Pk+1
Or use batch Least-Squares solution along the trajectory
The Critic update

x (t ) Pk +1 x(t ) =
t +T
can be setup as
pi+1T (t ) pi+1T
t +T
[ x (t ) x (t +T )] =
x( )T (Q + Lk T RLk ) x( )d d ( x (t ), Lk )
Evaluating d ( x (t ), Lk ) for n(n+1)/2 trajectory points,

one can setup a least squares problem to solve
T 1
pi+1 = ( XX ) XY
X =[ 1 (t ) 2 (t ) ... N (t )]
Y = [d ( x 1, Ki ) d ( x 2 , Ki ) ... d ( x N , Ki )]T
Draguna Vrabie
Direct Optimal Adaptive Controller
Actor
multiplier
Lk
System
x = Ax+ Bu; x0
V = xT Qx+ uT Ru
V
ZOH T
FB Gain L
Critic
A hybrid continuous/discrete dynamic controller

whose internal state is the observed cost over the interval
Dynamic
Control
System
Gain update (Policy)

Lk
Control
u k (t ) = Lk x(t )
t
Sample periods need not be the same
They can be selected on-line in real time
Continuous-time control with discrete gain updates
Draguna Vrabie
2. CT ADP Greedy iteration

t +T
Cost update
Vk +1 ( x(t )) =
x (t ) Pk +1 x(t ) =
T
kT
Ru k ) dt + Vk ( x(t + T ))
u k (t ) = Lk x(t )
Control policy
LQR
( x Qx + u
t +T
xT ( )(Q + Lk T RLk ) x( )d + xT (t +T ) Pk x(t +T )
Control gain update

Lk +1 = R 1BT Pk +1
A and B do not appear

B needed for control update
No initial stabilizing control needed

Direct Optimal Adaptive Control for Partially Unknown CT Systems
The critic update
xT (t ) Pk +1 x(t ) =
t +T
xT ( )(Q + Lk T RLk ) x( )d + xT (t + T ) Pk x(t + T )
To set this up as
pk +1T x (t ) =
vec( ABC ) = C T A vec( B)
Use Kronecker product
t +T
x( )T (Q + Lk T RLk ) x( )d + pk T x (t + T )
t
Regression vector
Previous weights
Now use RLS along the trajectory to get new weights pk +1

Unpack weights into the matrix Pk+1
Then find updated FB gain
Lk +1 = R 1BT Pk +1
1. Select control policy

Solves for cost update without knowing dynamics
2. Find associated cost
pk +1T x (t ) =
t +T
x( )T (Q + Lk T RLk ) x( )d + pk T x (t + T )
Lk +1 = R 1 BT Pk +1
3. Improve control
observe x(t)
Measure cost increment by adding V as a state.

T
i T
i
Then
apply u
V = x Qx + (u ) Ru
observe cost integral

observe x(t+T)
update P
No initial stabilizing control needed
t+T
update control gain to Lk+1
do RLS until convergence to Pk+1
A is not needed anywhere
Draguna Vrabie
Direct Optimal Adaptive Controller

A hybrid continuous/discrete dynamic controller
whose internal state is the observed value over the interval
Actor
Lk
System
x = Ax+ Bu; x0
V = xT Qx+ uT Ru
V
ZOH T
Dynamic
Control
System
FB Gain L
Critic
Has a different critic cost update

No initial stabilizing gain needed
Draguna Vrabie
Analysis of the algorithm

u k (t ) = Lk x(t )
For a given control policy
Lk = R 1BT Pk
with
Ak = A BR 1 B T Pk
Vk +1 ( x(t )) =
Greedy update
t +T
{x Qx + (u )
k T
Ru k d + Vk ( x(t + T )), V0 = 0
is equivalent to
t +T
Pk +1 =
e Ak t (Q + Lk T RLk )e Ak t dt + e Ak T Pk e Ak T
T
a strange pseudo-discretized RE
c.f. DT RE
Pk +1 = A Pk A + Q A Pk B Pk + B PK B
T
B T Pk A
Pk +1 = A Pk Ak + Q + L Pk + B PK B Lk
T
k
T
k
Draguna Vrabie
Analysis of the algorithm
This extra term means the initial

Control action need not be stabilizing
Lemma 2. CT HDP is equivalent to

T
Pk +1 Pk = e Ak t ( Pk A + APk + Q Lk RLk )e Ak t dt
T
Ak = A BR 1 B T Pk
When ADP converges, the resulting P satisfies the Continuous-Time ARE !!
ADP solves the CT ARE without knowledge of the system dynamics A
Solve the Riccati Equation

WITHOUT knowing the plant dynamics
Model-free ADP
Direct OPTIMAL ADAPTIVE CONTROL
Works for Nonlinear Systems
a neural network is used to approximate the cost
Robustness?
Comparison with adaptive control methods?
Policy Evaluation Critic update

Let K be any state feedback gain for the
system (1). One can measure the associated
cost over the infinite time horizon
V (t , x(t )) =
t +T
x( )T (Q + K T RK ) x( )d +W (t +T , x(t +T ))
where W (t +T , x(t +T )) is an initial infinite

horizon cost to go.
What to do about the tail issues in Receding Horizon Control
Gain update (Policy)

Lk
Control
u k (t ) = Lk x(t )
t
Sample periods need not be the same
Continuous-time control with discrete gain updates
Simulations on: F-16 autopilot

Load frequency control for power system
Control signal
A matrix not needed
0
-0.1
System states
4
-0.2
3.5
3
-0.3
0
2.5
0.5
2
1.5
1
1.5
Time (s)
Controller parameters
1
0.5
0
0
-0.2
0.5
1
Time (s)
1.5
-0.4
0
0.5
1
Time (s)
1.5
Critic parameters
0.2
P(1,1)
P(1,2)
P(2,2)
P(1,1) - optimal
P(1,2) - optimal
P(2,2) - optimal
0.15
0.1
0.05
0
0
3
4
Time (s)
Converge to SS Riccati equation soln
Nonlinear Case Works Too

x = f ( x ) + g ( x )u
V ( x(t )) = r ( x, u ) dt = (Q( x) + uT Ru ) dt
Policy Iteration
Vk +1 ( x(t )) =
t +T
( Q( x) + u
T
k
uk +1 = hk +1 ( x) = R 1 g ( x)T
Ruk dt + Vk +1 ( x(t + T ))
Vk +1
x
Prove this is equivalent to:

V
0 = k +1 f ( x, hk ( x)) + r ( x, hk ( x))
x
T
uk +1 = hk +1 ( x) = 1 2 R 1 g T ( x)
Vk +1
x
f(x) needed
which is known to converge
f(x) not needed
t +T
( Q( x) + u
Vk +1 ( x(t )) =
T
k
Ruk dt + Vk +1 ( x(t + T ))
T
Approximate value by Neural Network V = W ( x)
Wk +1 ( x(t )) =
t +T
( Q( x) + u
T
k
Ruk dt + Wk +1 ( x(t + T ))
Wk +1 [ ( x(t )) ( x(t + T )) ] =
t +T
( Q( x) + u
T
k
Ruk dt
regression vector
Reinforcement on time interval [t, t+T]
Now use RLS along the trajectory to get new weights Wk +1

Then find updated FB
( x(t ))
V
uk +1 = hk +1 ( x) = 1 2 R 1 g T ( x) k = 1 2 R 1 g T ( x)
Wk +1
(
)
x
t
x
Continuous-Time Optimal Control

x = f ( x, u )
System
Cost
V ( x(t )) = r ( x, u ) dt = (Q( x) + u T Ru ) dt
c.f. DT value recursion,

where f(), g() do not appear
Vh ( xk ) = r ( xk , h( xk )) + Vh ( xk +1 )
Hamiltonian
V
V
V
0 = V + r ( x, u ) =
, u)
f ( x, u ) + r ( x, u ) H ( x,
x + r ( x, u ) =
Optimal cost
Bellman
Optimal control
T
T
V
V
0 = min r ( x, u ) +
f ( x, u )
x = min r ( x, u ) +
u (t )
x
x u (t )
T
T
*
V *
x = min r ( x, u ) +
f ( x, u )
0 = min r ( x, u ) +
x u (t )
u (t )
x
V
h ( x(t )) = 1 2 R g ( x)
x
1
HJB equation
V ( 0) = 0
*
*
dV *
1 T dV
1 dV
gR g
f + Q ( x) 4
0 =
dx
dx
dx
V ( 0) = 0

Athans Workshop 10 07

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Athans Workshop 10 07

Uploaded by

Copyright:

Available Formats

F.L.

Automation & Robotics Research Institute (ARRI)

Notes on Optimal Control

Talk available online at

PhD from Berkeley in 1961

Third President of the IEEE Control Systems Society, 1972 to 1974

More than 40 Phd students- how many?

1. Use Neural Networks as Function Approximators in Adaptive Control

Two-layer feedforward static neural network (NN)

Have the universal approximation property

Neural Network Robot Controller

Robust Control v(t)

NN universal basis property means no regression matrix is needed

Extension of Adaptive Control to nonlinear-in parameters systems

Extra Jacobian terms from NLIP network

e-mod, sigma mod, etc.

Flexible & Vibratory Systems

Nonlinear FB Linearization Loop

Neural network backstepping controller for Flexible-Joint robot arm

Advantages over traditional Backstepping- no regression functions needed

Deadzone, saturation, backlash

NN in Feedforward Loop- Deadzone Compensation

Acts like a 2-layer NN

Optimality in Biological Systems

The individual cell is a complex

Permeability control of the cell membrane

Optimality in Control Systems Design

Adaptive Control is generally not Optimal

Continuous-Time Optimal Control

In LQR, this is a Lyapunov eq

In LQR, this is a Riccati eq

Linear system, quadratic cost

The cost is quadratic

Optimal control (state feedback):

Cost for any given u(t)

Convergence proved by Saridis 1979 if

Full system dynamics must be known

LQR Policy iteration = Kleinman algorithm

If started with a stabilizing control policy L0 the matrix Pk

Policy Iteration Solution

( A BBT Pi )T Pi +1 + Pi +1 ( A BBT Pi ) + Pi BBT Pi + Q = 0

Policy Iterations without Lyapunov Equations

V * ( x(t )) = min r ( x( ), u ( ))d

f(x) and g(x) do not appear

Solving for the cost Our approach

For a given control

f(x) and g(x) do not appear

Solving for the cost Our approach

A and B do not appear

xT ( )(Q + Lk T RLk ) x( )d + xT (t + T ) Pk +1 x(t + T )

Control gain update

Initial stabilizing control is needed

B needed for control update

xT ( )(Q + Lk T RLk ) x( )d + xT (t + T ) Pk +1 x(t + T )

xT ( )(Q + Lk T RLk ) x( )d + xT (t + T ) Pk +1 x(t + T )

x (t ) = x(t ) x(t ) is the quadratic basis set

x( )T (Q + KiT RKi ) x( )d + pk +1T x (t + T )

vec( ABC ) = C T A vec( B)

Use Kronecker product

Quadratic regression vector

(t , t + T ) Reinforcement on time interval [t, t+T]

Now use RLS along the trajectory to get new weights

1. Select initial control policy