Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

F.L.

Lewis
Moncrief-ODonnell Endowed Chair
Head, Controls & Sensors Group

Supported by :
NSF - PAUL WERBOS
ARO RANDY ZACHERY

Automation & Robotics Research Institute (ARRI)


The University of Texas at Arlington

Notes on Optimal Control


Draguna Vrabie

Talk available online at


http://ARRI.uta.edu/acs

Mike Athans

PhD from Berkeley in 1961


BOOKS
Optimal Control (McGraw Hill, 1966)
Systems, Networks and Computation: Basic Concepts (McGraw Hill, 1972)
Multivariable Methods (McGraw Hill, 1974)

Third President of the IEEE Control Systems Society, 1972 to 1974

More than 40 Phd students- how many?

Awards
American Automatic Control Council's Donald P. Eckman Award "for outstanding
contributions to the field of automatic control" (first recipient of the award, 1964)
American Society for Engineering Education's Frederick E. Terman Award as "the
outstanding young electrical engineering educator" (first recipient, 1969)
American Control Council's Education Award for "outstanding contributions and
distinguished leadership in automatic control education" (second recipient, 1980)
Fellow of the IEEE (1973)
Fellow of the AAAS (1977)
Distinguished Member of the IEEE Control Systems Society (1983)
IEEE Control Systems Society's 1993 H.W. Bode Prize, including the delivery of the Bode
Plenary Lecture at the 1993 IEEE Conference on Decision and Control
American Automatic Control Council's Richard E. Bellman Control Heritage Award (1995)
"In Recognition of a Distinguished Career in Automatic Control; As a Leader and Champion
of Innovative Research; As a Contributor to Fundamental Knowledge in Optimal, Adaptive,
Robust, Decentralized and Distributed Control; and as a Mentor to his Students"

Athans & Falb, Optimal Control (McGraw Hill, 1966) First book on OC?

1. Use Neural Networks as Function Approximators in Adaptive Control


2. Optimal Adaptive Control gives new adaptive control algorithms

Two-layer feedforward static neural network (NN)


VT

(.)

x1
(.)

y1

(.)

y2

(.)

ym

xn
inputs

(.)
2

x2
(.)

WT

L
(.)

outputs

hidden layer

Summation eqs

yi = wik vkj x j + vk 0 + wi 0
k =1

j
=
1

Matrix eqs

y = W T (V T x)

Have the universal approximation property


Overcome Barrons fundamental accuracy limitation of 1-layer NN

Neural Network Robot Controller


Universal Approximation Property

Feedback linearization
..
qd
Nonlinear Inner Loop

Feedforward Loop

qd

^f(x)
e

[ I]

Kv

q
Robot System

Robust Control v(t)


Term

PD Tracking Loop

NN universal basis property means no regression matrix is needed


Extension of Adaptive Control to Nonlinear-in the parameters systems
Problem- Nonlinear in the NN weights so
that standard proof techniques do not work

Extension of Adaptive Control to nonlinear-in parameters systems


No regression matrix needed
Theorem 1 (NN Weight Tuning for Stability)
Let the desired trajectory qd (t ) and its derivatives be bounded. Let the initial tracking error be
within a certain allowable set U . Let Z M be a known upper bound on the Frobenius norm of the
unknown ideal weights Z .
Take the control input as

NLIP

= W T (V T x ) + K v r v

with

v(t ) = K Z ( Z

+ Z M )r .

Extra Jacobian terms from NLIP network


Let weight tuning be provided by

W = Fr T F 'V T xr T F r W ,

V = Gx( 'T W r ) T G r V

T
T
with any constant matrices F = F > 0, G = G > 0 , and scalar tuning parameter > 0 . Initialize
the weight estimates as W = 0,V = random .

Then the filtered tracking error r (t ) and NN weight estimates W ,V are uniformly ultimately
bounded. Moreover, arbitrarily small tracking error may be achieved by selecting large control
gains K v .
robustifying terms-

Backprop termsWerbos

e-mod, sigma mod, etc.

Flexible & Vibratory Systems


Add an extra feedback loop
Two NN needed
Use passivity to show stability

Backstepping
..
qd

Nonlinear FB Linearization Loop

NN#1

e
e = e.

^ (x)
F
1
[ I]

qd =

q. d
qd

Kr
Robust Control
Term
vi(t)

1/KB1

id

ue
K

q
qr = . r
qr
Robot
System

^F (x)
2
NN#2

Tracking Loop

Backstepping Loop

Neural network backstepping controller for Flexible-Joint robot arm

Advantages over traditional Backstepping- no regression functions needed

Actuator Nonlinearities q

Deadzone, saturation, backlash

NN in Feedforward Loop- Deadzone Compensation

Estimate
of Nonlinear
Function

little observer

NN Deadzone
Precompensator

f (x)

qd

e
-

Critic:
Actor:

r
Kv

II

D(u)

T
W i = T i (U i w ) r T W T ' (U T u )U T k 1T r W i k 2 T r W i W i
T
W = S ' (U T u )U T W i i (U i w ) r T k 1 S r W

Mechanical
System

Acts like a 2-layer NN


With enhanced
backprop tuning !

Optimality in Biological Systems


Cell Homeostasis

The individual cell is a complex


feedback control system. It pumps
ions across the cell membrane to
maintain homeostatis, and has only
limited energy to do so.

Permeability control of the cell membrane

Cellular Metabolism

http://www.accessexcellence.org/RC/VL/GG/index.html

Optimality in Control Systems Design


Rocket Orbit Injection

R. Kalman 1960

Dynamics

r=w
v2 F
w = 2 + sin
r r
m
wv F
v=
+ cos
r
m
m = Fm
Objectives
Get to orbit in minimum time
Use minimum fuel
http://microsat.sm.bmstu.ru/e-library/Launch/Dnepr_GEO.pdf

Adaptive Control is generally not Optimal


Optimal Control is off-line,
and needs to know the system dynamics to solve design eqs.
We want ONLINE ADAPTIVE OPTIMAL Control

Continuous-Time Optimal Control


x = f ( x, u )

System
Cost

V ( x(t )) = r ( x, u ) dt = (Q( x) + u T Ru ) dt
For a given control, the cost satisfies this eq.

Hamiltonian

V
V
V
0 = V + r ( x, u ) =
, u)
f ( x, u ) + r ( x, u ) H ( x,
x + r ( x, u ) =

In LQR, this is a Lyapunov eq

Optimal cost

Bellman
Optimal control

In LQR, this is a Riccati eq

V ( 0) = 0

T
T

V
V

0 = min r ( x, u ) +
f ( x, u )
x = min r ( x, u ) +

u (t )
x
x u (t )

T
T

*
V *

x = min r ( x, u ) +
f ( x, u )
0 = min r ( x, u ) +

x u (t )
u (t )
x

V
h ( x(t )) = 1 2 R g ( x)
x
1

HJB equation

*
*

dV *
1 T dV
1 dV
gR g
f + Q ( x) 4
0 =
dx
dx
dx

, V ( 0) = 0

Linear system, quadratic cost


System: x = Ax + Bu
Utility: r ( x, u ) = xT Qx + u T Ru; R > 0, Q 0

The cost is quadratic

V ( x (t )) = r ( x, u )d = xT (t ) Px (t )
t

Optimal control (state feedback):


u (t ) = R 1 BT Px(t ) = Lx(t )
HJB equation is the algebraic Riccati equation (ARE):
0 = PA + AT P + Q PBR 1 B T P
Full system dynamics must be known
Off-line solution

CT Policy Iteration
To avoid solving HJB equation

r ( x, u ) = Q( x) + u T Ru

Utility

Cost for any given u(t)


T

V
V
0=
, u)
f ( x , u ) + r ( x, u ) H ( x,

x
x

Lyapunov equation

Iterative solution
Pick stabilizing initial control
Find cost
V
0 = k f ( x, hk ( x)) + r ( x, hk ( x))
x
Vk (0) = 0
Update control
V
hk +1 ( x) = 1 2 R 1 g T ( x) k
x
T

Convergence proved by Saridis 1979 if


Lyapunov eq. solved exactly
Beard & Saridis used complicated Galerkin
Integrals to solve Lyapunov eq.
Abu Khalaf & Lewis used NN to approx. V for
nonlinear systems and proved convergence

Full system dynamics must be known


Off-line solution

LQR Policy iteration = Kleinman algorithm


1. For a given control policy u = Lk x solve for the cost:
0 = Ak T Pk + Pk Ak + C T C + Lk T RLk

Lyapunov eq.

Ak = A BLk

2. Improve policy:
Lk = R 1BT Pk 1

If started with a stabilizing control policy L0 the matrix Pk


monotonically converges to the unique positive definite
solution of the Riccati equation.
Every iteration step will return a stabilizing controller.
The system has to be known.
Kleinman 1968

Policy Iteration Solution


Policy iteration

( A BBT Pi )T Pi +1 + Pi +1 ( A BBT Pi ) + Pi BBT Pi + Q = 0


This is in fact a Newtons Method

Ric( P ) AT P + PA + Q PBBT P
Then, Policy Iteration is

Pi +1 = Pi RicPi

Ric( Pi ),

i = 0,1,
Frechet Derivative

Ric' Pi ( P) ( A BB T Pi )T P + P ( A BB T Pi )

Draguna Vrabie

Policy Iterations without Lyapunov Equations


Dynamic programming
built on Bellmans optimality principle alternative
form for CT Systems [Athans & Falb 1966, Lewis &
Syrmos 1995]
t +t

V * ( x(t )) = min r ( x( ), u ( ))d


u ( )
t
t <t +t

+ V ( x(t + t ))

f(x) and g(x) do not appear

r ( x ( ), u( )) = xT ( )Qx ( ) + uT ( ) Ru ( )

Draguna Vrabie

Solving for the cost Our approach


u = Lx

For a given control


The cost satisfies

t +T

V ( x(t )) =

T
T
(
x
Qx
+
u
Ru )dt

+ V ( x(t + T ))

f(x) and g(x) do not appear

LQR case

t +T

x(t )T Px(t ) =

T
T
T
(
x
Qx
+
u
Ru
)
dt
+
x
(
t
+
T
)
Px(t + T )

Optimal gain is

L = R 1 BT P

Draguna Vrabie

Solving for the cost Our approach

1. Policy iteration
t +T

Cost update

Vk +1 ( x(t )) =

( xT Qx + u k T Ru k )dt + Vk +1 ( x(t + T ))

A and B do not appear


For LQR case

xT (t ) Pk +1 x(t ) =

u k (t ) = Lk x(t )
t +T

xT ( )(Q + Lk T RLk ) x( )d + xT (t + T ) Pk +1 x(t + T )

Control gain update

Lk +1 = R 1BT Pk +1

Initial stabilizing control is needed

B needed for control update

Lemma 1

xT (t ) Pk +1 x(t ) =
1

Lk = R B Pk
T

t +T

xT ( )(Q + Lk T RLk ) x( )d + xT (t + T ) Pk +1 x(t + T )

t
Solves Lyapunov equation without knowing A or B

is equivalent to

( A BR 1 BT Pk )T Pk +1 + Pk +1 ( A BR 1 BT Pk ) + Pk BR 1 BT Pk + Q = 0
Proof:

d ( x T Pi x )
= x T ( Ai T Pi + Pi Ai ) x = x T ( K i T RK i + Q ) x
dt

t +T

( A B B

Pi )

P i + 1 = Pi

Pi+1 + Pi+1 ( A B B

(R ic )
Pi

R ic ( Pi ) ,

Pi ) + Pi B B

i = 0 ,1,

Pi + Q

= 0

t +T

T
T
x (Q + K i RK i ) xd = d ( xT Px
)
=
x
(
t
)
Px
(
t
)

x
(t + T ) Px
i
i
i (t + T )
T

Theorem D. Vrabie
This algorithm converges and is equivalent to Kleinmans Algorithm
Only B is needed

Algorithm Implementation
Critic update

xT (t ) Pk +1 x(t ) =

t +T

xT ( )(Q + Lk T RLk ) x( )d + xT (t + T ) Pk +1 x(t + T )

x (t ) = x(t ) x(t ) is the quadratic basis set

To set this up as

pk +1T x (t ) =

t +T

x( )T (Q + KiT RKi ) x( )d + pk +1T x (t + T )

pk +1T (t )

vec( ABC ) = C T A vec( B)

Use Kronecker product

pk +1T

t +T

[ x (t ) x (t +T )] =

x( )T (Q + KiT RKi ) x( )d

Quadratic regression vector

(t , t + T ) Reinforcement on time interval [t, t+T]

Now use RLS along the trajectory to get new weights


Unpack weights into the matrix Pk+1
Then find updated FB gain Lk +1 = R 1BT Pk +1

pk +1

1. Select initial control policy


Solves Lyapunov eq. without knowing dynamics

2. Find associated cost

pk +1T [ x (t ) x (t + T ) ] =

t +T

x( )T (Q + Lk T RLk ) x( )d = (t , t + T )

Lk +1 = R 1 BT Pk +1

3. Improve control
observe x(t)

Measure cost increment (reinforcement)


by adding V as a state. Then

apply uk(t)=Lkx(t)

V = xT Qx + (u k )T Ru k

observe cost integral


observe x(t+T)
update P

A is not needed anywhere

t+T
update control gain to Lk+1
do RLS until convergence to Pk+1

Algorithm Implementation
Or use batch Least-Squares solution along the trajectory

The Critic update


x (t ) Pk +1 x(t ) =

t +T

xT ( )(Q + Lk T RLk ) x( )d + xT (t + T ) Pk +1 x(t + T )

can be setup as
pi+1T (t ) pi+1T

t +T

[ x (t ) x (t +T )] =

x( )T (Q + Lk T RLk ) x( )d d ( x (t ), Lk )

x (t ) = x(t ) x(t ) is the quadratic basis set

Evaluating d ( x (t ), Lk ) for n(n+1)/2 trajectory points,


one can setup a least squares problem to solve
T 1

pi+1 = ( XX ) XY

X =[ 1 (t ) 2 (t ) ... N (t )]
Y = [d ( x 1, Ki ) d ( x 2 , Ki ) ... d ( x N , Ki )]T

Draguna Vrabie

Direct Optimal Adaptive Controller

Actor

multiplier

Lk

System

x = Ax+ Bu; x0

V = xT Qx+ uT Ru
V

ZOH T

FB Gain L

Critic

A hybrid continuous/discrete dynamic controller


whose internal state is the observed cost over the interval

Dynamic
Control
System

Gain update (Policy)


Lk

Control

u k (t ) = Lk x(t )

t
Sample periods need not be the same
They can be selected on-line in real time

Continuous-time control with discrete gain updates

Draguna Vrabie

2. CT ADP Greedy iteration


t +T

Cost update

Vk +1 ( x(t )) =

x (t ) Pk +1 x(t ) =
T

kT

Ru k ) dt + Vk ( x(t + T ))

u k (t ) = Lk x(t )

Control policy

LQR

( x Qx + u

t +T

xT ( )(Q + Lk T RLk ) x( )d + xT (t +T ) Pk x(t +T )

Control gain update


Lk +1 = R 1BT Pk +1

A and B do not appear


B needed for control update

No initial stabilizing control needed


Direct Optimal Adaptive Control for Partially Unknown CT Systems

Algorithm Implementation
The critic update

xT (t ) Pk +1 x(t ) =

t +T

xT ( )(Q + Lk T RLk ) x( )d + xT (t + T ) Pk x(t + T )

x (t ) = x(t ) x(t ) is the quadratic basis set

To set this up as

pk +1T x (t ) =

vec( ABC ) = C T A vec( B)

Use Kronecker product

t +T

x( )T (Q + Lk T RLk ) x( )d + pk T x (t + T )

t
Regression vector

Previous weights

Now use RLS along the trajectory to get new weights pk +1


Unpack weights into the matrix Pk+1
Then find updated FB gain

Lk +1 = R 1BT Pk +1

1. Select control policy


Solves for cost update without knowing dynamics

2. Find associated cost

pk +1T x (t ) =

t +T

x( )T (Q + Lk T RLk ) x( )d + pk T x (t + T )

Lk +1 = R 1 BT Pk +1

3. Improve control
observe x(t)

Measure cost increment by adding V as a state.


T
i T
i
Then

apply u

V = x Qx + (u ) Ru

observe cost integral


observe x(t+T)
update P

No initial stabilizing control needed

t+T
update control gain to Lk+1
do RLS until convergence to Pk+1

A is not needed anywhere

Draguna Vrabie

Direct Optimal Adaptive Controller


A hybrid continuous/discrete dynamic controller
whose internal state is the observed value over the interval

Actor

Lk

System

x = Ax+ Bu; x0

V = xT Qx+ uT Ru
V

ZOH T

Dynamic
Control
System

FB Gain L

Critic

Has a different critic cost update


No initial stabilizing gain needed

Draguna Vrabie

Analysis of the algorithm


u k (t ) = Lk x(t )

For a given control policy

Lk = R 1BT Pk

with

Ak = A BR 1 B T Pk

Vk +1 ( x(t )) =

Greedy update

t +T

{x Qx + (u )

k T

Ru k d + Vk ( x(t + T )), V0 = 0

is equivalent to

t +T

Pk +1 =

e Ak t (Q + Lk T RLk )e Ak t dt + e Ak T Pk e Ak T
T

a strange pseudo-discretized RE
c.f. DT RE

Pk +1 = A Pk A + Q A Pk B Pk + B PK B
T

B T Pk A

Pk +1 = A Pk Ak + Q + L Pk + B PK B Lk
T
k

T
k

Draguna Vrabie

Analysis of the algorithm

This extra term means the initial


Control action need not be stabilizing

Lemma 2. CT HDP is equivalent to


T

Pk +1 Pk = e Ak t ( Pk A + APk + Q Lk RLk )e Ak t dt
T

Ak = A BR 1 B T Pk

When ADP converges, the resulting P satisfies the Continuous-Time ARE !!

ADP solves the CT ARE without knowledge of the system dynamics A

Solve the Riccati Equation


WITHOUT knowing the plant dynamics
Model-free ADP
Direct OPTIMAL ADAPTIVE CONTROL
Works for Nonlinear Systems
a neural network is used to approximate the cost

Robustness?
Comparison with adaptive control methods?

Policy Evaluation Critic update


Let K be any state feedback gain for the
system (1). One can measure the associated
cost over the infinite time horizon
V (t , x(t )) =

t +T

x( )T (Q + K T RK ) x( )d +W (t +T , x(t +T ))

where W (t +T , x(t +T )) is an initial infinite


horizon cost to go.

What to do about the tail issues in Receding Horizon Control

Gain update (Policy)


Lk

Control

u k (t ) = Lk x(t )

t
Sample periods need not be the same

Continuous-time control with discrete gain updates

Simulations on: F-16 autopilot


Load frequency control for power system
Control signal

A matrix not needed

0
-0.1

System states
4

-0.2

3.5
3

-0.3
0

2.5

0.5

2
1.5

1
1.5
Time (s)
Controller parameters

1
0.5
0
0

-0.2
0.5

1
Time (s)

1.5

-0.4
0

0.5

1
Time (s)

1.5

Critic parameters
0.2
P(1,1)
P(1,2)
P(2,2)
P(1,1) - optimal
P(1,2) - optimal
P(2,2) - optimal

0.15

0.1

0.05

0
0

3
4
Time (s)

Converge to SS Riccati equation soln

Nonlinear Case Works Too


x = f ( x ) + g ( x )u

V ( x(t )) = r ( x, u ) dt = (Q( x) + uT Ru ) dt

Policy Iteration

Vk +1 ( x(t )) =

t +T

( Q( x) + u

T
k

uk +1 = hk +1 ( x) = R 1 g ( x)T

Ruk dt + Vk +1 ( x(t + T ))
Vk +1
x

Prove this is equivalent to:


V
0 = k +1 f ( x, hk ( x)) + r ( x, hk ( x))
x
T

uk +1 = hk +1 ( x) = 1 2 R 1 g T ( x)

Vk +1
x

f(x) needed

which is known to converge

f(x) not needed

Algorithm Implementation
t +T

( Q( x) + u

Vk +1 ( x(t )) =

T
k

Ruk dt + Vk +1 ( x(t + T ))

T
Approximate value by Neural Network V = W ( x)

Wk +1 ( x(t )) =

t +T

( Q( x) + u

T
k

Ruk dt + Wk +1 ( x(t + T ))

Wk +1 [ ( x(t )) ( x(t + T )) ] =

t +T

( Q( x) + u

T
k

Ruk dt

regression vector

Reinforcement on time interval [t, t+T]

Now use RLS along the trajectory to get new weights Wk +1


Then find updated FB

( x(t ))
V
uk +1 = hk +1 ( x) = 1 2 R 1 g T ( x) k = 1 2 R 1 g T ( x)
Wk +1
(
)
x
t
x

Continuous-Time Optimal Control


x = f ( x, u )

System
Cost

V ( x(t )) = r ( x, u ) dt = (Q( x) + u T Ru ) dt

c.f. DT value recursion,


where f(), g() do not appear

Vh ( xk ) = r ( xk , h( xk )) + Vh ( xk +1 )

Hamiltonian

V
V
V
0 = V + r ( x, u ) =
, u)
f ( x, u ) + r ( x, u ) H ( x,
x + r ( x, u ) =

Optimal cost

Bellman
Optimal control

T
T

V
V

0 = min r ( x, u ) +
f ( x, u )
x = min r ( x, u ) +

u (t )
x
x u (t )

T
T

*
V *

x = min r ( x, u ) +
f ( x, u )
0 = min r ( x, u ) +

x u (t )
u (t )
x

V
h ( x(t )) = 1 2 R g ( x)
x
1

HJB equation

V ( 0) = 0

*
*

dV *
1 T dV
1 dV
gR g
f + Q ( x) 4
0 =
dx
dx
dx

V ( 0) = 0

You might also like