Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

HT2013

Introduction to Optimal Control theory


This note covers the material of the last four lectures. We give some basic ideas
and techniques for solving optimal control problems. We introduce the idea of dynamic
programming and the principle of optimality, explain Pontryagins maximum principle,
and heuristic derivation of the theorem. Some worked examples are given in details.
The topics are selected from last three chapters of our textbook. Optimal control theory
has an extremely large literature so the aim of the note is of necessity to give a very
concise treatment. We strongly recommend that readers study the material in the book
for rigorous mathematical treatment.
We start with some motivation and philosophy behind optimization. Optimization is
a key tool in modelling. Sometimes it is important to solve a problem optimally. Other
times either a near-optimal solution is good enough, or the real problem does not have
a single criterion by which a solution can be judged. However, even then optimization is
useful as a way to test thinking. If the optimal solution is ridiculous it may suggest ways
in which both modelling and thinking can be rened.
Optimal control theory is concerned with dynamical systems and their optimization
over time. It takes stochastic, unknown or imperfect observations into account and deals
with optimization problem over evolution processes. This is in contrast with optimization
models in most topics in operations research, e.g. our advanced course Optimization where
things are static and nothing is random or hidden. The three features of optimal control
are imperfect state observations, dynamical and stochastic evolution. They give rise to
new other type of optimization problems and require other way of thinking. This course
does not deal with stochastic systems. However, it is my intention to lead the course to a
more advanced level. So I also give a short introduction to Kalman lter and the Bellman
principle of optimality applied to Markov process and heuristically to diusion processes
by dynamic programming approach to meet some students requirement. Note that these
topics are beyond the scope of the course curriculum.
1 Performance indices
Measures of performance
We consider systems described by a general set of n nonlinear dierential equations
x(t) = f(x, u, t) (1.1)
subject to
x(t
0
) = x
0
, (1.2)
1
where the components of f are continuous and satisfy standard conditions, such as having
continuous rst partial derivatives so that the solution (1.1) exists and is unique for given
initial conditions (1.2). A cost functional, or a performance index is a scalar which provides
a measure by which the performance of the cost of the system can be judged.
Minimum-time problems. Here u(t) is to be chosen so as to transfer the system from
an initial state x
0
to a specied state in the shortest possible time. This is equivalent to
minimizing the performance index
J = t
1
t
0
=
_
t
1
t
0
dt (1.3)
where t
1
is the rst instant of time at which the desired state is reached.
Example 1. An aircraft pursues a ballistic missile and wishes to intercept it as quickly as
possible. For simplicity neglect gravitational and aerodynamic forces and suppose that
the trajectories are horizontal. At t = 0 the aircraft is a distance a from the missile, whose
motion is known to be descried by x(t) = a + bt
2
, where b is a positive constant. The
motion of the aircraft is given by x(t)
a
= u, where the thrust u(t) is subject to |u| 1,
with suitably chosen units. Clearly the optimal strategy for the aircraft is to accelerate
with maximum thrust u(t) = +1. After a time t the aircraft has then travelled a distance
ct +
1
2
t
2
, where x
a
(0) = c, so interception will occur at time T where
cT +
1
2
T
2
= a +bT
2
.
This equation may not have any real positive solution; in other words, this minimum-time
problem may have no solution for certain initial conditions.
Terminal control. In this case the nal state x
f
= x
(
t
1
) is to be brought as near as
possible to some desired state r(t
1
). A suitable performance measure to be minimized is
e

(t
1
)Me(t
1
), (1.4)
where e(t) = x(t)r(t) and M is a real symmetric positive denite nn matrix. A special
case is when M is the identity matrix and (1.4) reduces to x
f
r(t
1
)
2
e
. More generally,
if M = diag[m
1
, ...m
n
] then m
i
are chosen so as to weight the relative importance of the
deviations [x
i
(t
1
) r
i
(t
1
)]
2
. If some of the r
i
(t
1
) are not specied then the corresponding
elements of M will be zero and M will be only positive semidenite.
Minimum eort. The desired nal state is now to be attained with minimum total
expenditure of control eort. Suitable performance indices to be minimized are
_
t
1
t
0

i
|u
i
|dt (1.5)
2
or
_
t
1
t
0
u

Rudt (1.6)
where R is real positive denite and the
i
and r
ij
are weighting factors.
Tracking problems. The aim here is to follow or track as closely as possible some
desired state r(t) throughout the interval t
0
t t
1
. Following (1.4) and (1.6) a suitable
index is
_
t
1
t
0
e

Qedt (1.7)
where Q is real symmetric positive semidenite. If u
i
(t) are unbounded then minimization
of (1.7) can lead to a control vector having innite components. This is unacceptable for
real life problems, so to restrict the total control eort a combination of (1.6) and (1.7)
can be used, giving
_
t
1
t
0
(e

Qe +u

Ru)dt (1.8)
expressions of the form (1.6), (1.7) and (1.8) are termed quadratic performance indices.
Example 2. A landing vehicle separates from a spacecraft at time t = 0 at an altitude h
from the surface of a planet, with initial downward velocity . For simplicity assume that
gravitational forces can be neglected and that the mass of the vehicle is constant. Consider
vertical motion only, with upwards regarded as the positive direction. let x
1
denote the
altitude, x
2
velocity and u(t) the thrust exerted by the rocket motor, subject to |u(t)| 1
with suitable scaling. The equations of motion are
x
1
= x
2
, x
2
= u (1.9)
and the initial conditions are
x
1
(0) = h, x
2
(0) = . (1.10)
For a soft landing at some time t we require
x
1
(t
f
) = 0, x
2
(t
f
) = 0. (1.11)
A suitable performance index might be
_
t
f
0
(|u| +k)dt, (1.12)
which is a combination of (1.3) and (1.5). The expression (1.12) represents a sum of
total fuel consumption and time to landing, k being a factor which weights the relative
importance of these two quantities.
The expression for the optimal control which minimizes (1.12) subject to (1.9), (1.10)
and (1.11) will be developed later. Of course the simple equation (1.9) arise in a variety
of situations.
3
Performance indices given above are termed functionals, since they assign a unique
real number to a set of functions x
i
(t), u
j
(t). In the classical optimization literature more
general functionals are used, for instance the problem of Bolza is to choose u(t) so as to
minimize
J(u) = Q[x(t
1
), t
1
)] +
_
t
1
t
0
q(x, u, t)dt (1.13)
subject to (1.1), the scalar function Q and q being continuous and having continuous rst
partial derivatives.
2 Dynamic programming
We begin with a simple example to illustrate the idea of the Bellman optimality principle.
A key idea is that optimization over time can often be thought of as optimization in stages.
We trade o our desire to obtain the lowest possible cost at the present stage against the
implication this would have for costs at future stages. The best action minimizes the sum
of the cost incurred at the current stage and the least total cost that can be incurred from
all subsequent stages, consequent on this decision.
2.1 Bellmans principle of optimality
The optimality principle stated by Bellman is as follows:
An optimal policy has the property that whatever the initial state and initial decision
are, the remaining decisions must constitute an optimal policy with regard to the state
resulting from the rst decision.
According to Bellman, the basic idea of dynamic programming was initiated by himself
in his research done during 1949-1951, mainly for multistage decision problems. He also
found that the technique was also applicable to the calculus of variations and to optimal
control problems whose state equations are ordinary dierential equations, which led to
a nonlinear PDE now called the Hamilton-Jacobi-Bellman equation. In 1960s, Kalman
pointed out such a relation and probably was the rst to use the name HJB equation of
the control problem.
As a matter of fact, the idea of the principle of the optimality actually goes back to
Jakob Bernoulli while he was soling the famous brachistochrone problem, posed by his
brother Johann Bernoulli in 1696. Moreover, an equation identical to what was nowadays
called the Bellman equation was rst derived by Caratheodory in 1926 while he was study-
ing the sucient conditions of the calculus of variations problem (his approach was named
Caratheodorys royal road. The work of Swedish mathematician Wald on sequential anal-
ysis in the late 1940s contains some ideas very similar to that of dynamic programming.
Briey we can state the Bellmans principle of optimality as follows. From any point
on an optimal trajectory, the remaining trajectory is optimal for the corresponding problem
4
initiated at that point. We illustrate this principle by an example.
Example 3. (The shortest path problem) Consider the stagecoach problem in which
a traveler wishes to minimize the length of a journey from town A to town J by rst
traveling to one of B,C, or D and then onwards to one of E,F or G then onwards to one of
H or I and nally to J. Thus there are four stages. The arcs are marked with distances
between towns.
A
B
C
D
E
F
G
H
I
J
2
4
3
7
4
6
3
2
4
4
1
5
3
3
3
6
4
1
3
4
Figure 1: Road system for stagecoach problem
Let V (x) be the minimal distance from town X to town J. Then obviously V (J) = 0,
V (H) = 3, V (I) = 4. Recursively we compute next stage and obtain the following.
V (F) = min(6 +V (H), 3 +V (I)) = 7,
V (E) = min(1 +V (H), 4 +V (I)) = 4,
V (G) = min(3 +V (H), 3 +V (I)) = 6,
The underline above indicates that the minimum is attained at that point.
V (B) = min(7 +V (E), 4 +V (F), 6 +V (G)) = 11,
V (C) = min(3 +V (E), 2 +V (F), 4 +V (G)) = 7,
V (D) = min(5 +V (G), 1 +V (F), 4 +V (E)) = 8,
and nally
V (A) = min(2 +V (B), 4 +V (C), 3 +V (D))
So the shortest path is A D F I J. From above calculation we know that the
shortest path is not unique. (Find all of the shortest paths.)
2.2 The optimality equation
To avoid complication of technicality we rst derive the optimality equation for discrete-
time problem. In this case t takes integer values, say t = 0, 1, 2, .... Suppose that u
t
is a
5
control variable whose value is to be chosen at time t. Let U
t
(= u
0
, u
1
, ..., u
t1
) denote
the partial sequence of controls (or decisions) taken over the rst t stages. Suppose the
cost up to the time horizon N is given by
C = G(U
N1
) = G(u
0
, u
1
, ..., u
N1
).
Then the principle of Optimality is phrased in the following theorem.
Theorem 2.1 (The principle of optimality). Dene the functions
G(U
t1
) = inf
ut,u
t+1
,...,u
N1
G(U
N1
).
Then these functions obey the recursion
G(U
t1
, t) = inf
ut
G(U
t
, t + 1), t < N,
with terminal condition G(U
N1
, N) = G(U
N1
).
The proof is immediate, since
G(U
t1
, t) = inf
ut
inf
u
t+1
,...,u
N1
G(u
0
, u
1
, ..., u
t
, u
t+1
, ..., u
N1
).
Now we consider the dynamical system
x
t+1
= f(x
t
, u
t
, t)
where x R
n
is a state variable, the control variable u
t
is chosen on the basis of knowing
U
t1
= (u
0
, ..., u
t1
), which determines everything else. But a more economical represen-
tation of the past history is often sucient. Suppose we wish to minimize a cost function
of the form
J =
N1

t=0
J(x
t
, u
t
, t) +J
N
(x
N
) (2.1)
by choice of controls {u
0
, u
1
, ..., u
N1
}. A cost function that can be written in this way is
called decomposable cost. Dene the so-called cost-to-go function from time t onwards as
J
t
=
N1

=t
J(x

, u

, ) +J
N
(x
N
) (2.2)
and the minimal cost-to-go function as an optimization over {u
t
, ..., u
N1
} conditional on
x
t
= x,
V (x, t) = inf
ut,...,u
N1
J
t
.
Here V (x, t) is the minimal future cost from time t onwards, given that the state x at time
t. The function V (x, t) is also called a value function. Then by induction we can prove
that
V (x, t) = inf
u
[J(x, u, t) +V (f(x, u, t), t + 1)] t < N (2.3)
6
with terminal condition V (x, N) = J
N
(x
N
), where x is a generic value of x
t
. The min-
imizing u in (2.3) is the optimal control u(x, t) and values of x
0
, ..., x
t1
are irrelevant.
The optimality equation (2.3) is also called the dynamical programming equation (DP) or
Bellman equation.
The DP equation denes an optimal control problem in feedback or closed loop form,
with u
t
= u(x
t
, t). This is in contrast to the open loop formulation in which {u
0
, ..., u
N1
}
are to be determined all at once at time 0. A policy (or strategy) is a rule for choosing the
value of the control variable under all possible circumstances as a function of the perceived
circumstances. Keep the following in mind.
(i) The optimal u
t
is a function of x
t
and t, i.e., u
t
= u(x
t
, t).
(ii) The DP equation yields the optimal control u
t
in closed loop form. It is optimal
whatever the past control policy may have been.
(iii) The DP equation is a backward recursion in time (from which we get the optimum
at N 1, then N 2 and so on.) The later policy is decided rst.
It could also be instructive to remember the citation from Kierkegaard Life must be lived
forward and understood backwards.
Example 4. Managing spending and savings. An investor receives annual income from
a building society of x
t
kr in year t. He consumes u
t
and adds x
t
u
t
to his capital,
0 u
t
x
t
. The capital is invested at interest rate 100%, and so his income in year
t + 1 increases to
x
t+1
= f(x
t
, u
t
) = x
t
+(x
t
u
t
).
He desires to maximize his total consumption over N years, J =

N1
t=0
u
t
.
(What is your guess?)
It is clear that this problem is time invariant. So we can drop t in all functions involved.
Let now the value function V
s
(x) denote the maximal reward obtainable, starting in state
x and when there is time s = N t to go. The DP equation is
V
s
(x) = max
0ux
[u +V
s1
(x +(x u))],
with the terminal condition V
0
(x) = 0, (since no more can be obtained once time N is
reached). Remember that we use short hand notation for x and u for x
s
and u
s
.
The idea is to substitute backwards and soon guess the form of the solution. First,
V
1
(x) = max
0ux
[u +V
0
(x +(x u))] = max
0ux
[u + 0] = x.
Next,
V
2
(x) = max
0ux
[u +V
1
(x +(x u))] = max
0ux
[u +x +(x u)].
7
Since the function that is to be maximized is linear in u, its maximum attains at either
u = 0 or u = x. Thus
V
2
(x) = max((1 +)x, 2x) = max(1 +, 2)x = c
2
x.
This motivates the guess V
s1
(x) = c
s1
x. We shall show that this is a right answer. The
rst induction step is already done. So we assume this is valid at s 1. We nd that
V
s
(x) = max
0ux
[u +c
s1
(x +(x u))] = max[(1 +)c
s1
, 1 +c
s1
]x = c
s
x.
Thus our guess is veried and V
s
(x) = c
s
x, where c
s
satisfy the recursion implicitly in the
above computation. that is c
s
= c
s1
+ max[c
s1
, 1], which yields
c
s
=
_
_
_
s s s

(1 +)
ss

s s

,
where s

is the least integer such that s

. The optimal strategy is to invest the whole


of the income in years 0, ..., N s

1, (to build up capital) and then consume the whole


of the income in years N s

, ..., N 1.
Is this a surprise for you? What we learn from this example is the following. (i) It is
often useful to frame things in terms of time to go, s. (ii) Although the form of the DP
equation can sometimes look messy, try working backwards from V
0
(x) (which is known).
Often we would get a pattern from which we can piece together a solution. (iii) When the
dynamics are linear, the optimal control lies at an extreme point of the set of admissible
controls. This form of policy is known as bang-bang control.
2.3 Markov decision processes

This section is optional. It is written for completeness and for a student who wishes to
know how the theory can be used in non-deterministic optimal control problems. We shall
state the theory for controlled diusion processes later.
Consider now stochastic evolution. Let X
t
= (x
0
, .., x
t
) and U
t
= (u
0
, ..., u
t
) denote
the x and u histories at time t. As before, state structure is dened by a dynamic system
having value at x
t
at time t with the following properties.
(a) Markov dynamics: (i.e., a stochastic dynamical system)
P(x
t+1
|X
t
, U
t
) = P(x
t+1
|x
t
, u
t
),
where P( ) is probability.
(b) Decomposable cost: the cost given by (2.1)
8
These assumptions dene state structure. To avoid loosing insight we also require the
following property.
(c) perfect state observation: The current value of the state is observable. That is, x
t
is
known at the time at which u
t
must be chosen. Let W
t
denote the observed history
at time t. Assume W
t
= ((X
t
, U
t1
). Note that J is determined by W
N
, so we might
write J = J(W
N
).
These assumptions dene what is known as a discrete-time Markov decision processes
(MDP). It is widely used in applications. As before the cost-to-go function is given by
(2.2). Denote the minimal expected cost-to-go function by
V (W
t
) = inf

[J
t
|W
t
],
where denote a policy, i.e., a rule for choosing the controls u
0
, ..., u
N1
, and E( ) denotes
the mathematical expectation. We have the following theorem.
Theorem 2.2. V (W
t
) is a function of x
t
and t alone, say V (x
t
, t). It satises the opti-
mality equation
V (x
t
, t) = inf
ut
[J(x
t
, u
t
, t) +E[V (x
t+1
, t + 1)|x
t
, u
t
]], t < N, (2.4)
with terminal condition V (x
N
, N) = J
N
(x
N
). Moreover, a minimizing value of u
t
, which
is also a function of x
t
and t in (2.4) is optimal.
Proof. Use induction. First, the value of V (W
N
) is J
N
(x
N
), so the theorem is valid at
time N. Assume it is valid at time t + 1. The DP equation is then
V (W
t
) = inf
ut
[J(x
t
, u
t
, t) +E[V (x
t+1
, t + 1)|X
t
, U
t
]].
But by assumption (a), the right-hand side of above equation reduces to the right-hand
side of (2.4). This proves the theorem.
Example 5. Exercising a stock option. The owner of a call option has the option to buy a
share at xed striking price p. The option must be exercised by day N. If he exercises
the option on day t and then immediately sells the share at the current price x
t
, he can
make a prot of x
t
p. Suppose the price sequence obeys the equation x
t+1
= x
t
+
t
,
where the
t
are i.i.d random variables for which E|| < . The aim is to exercise the
option optimally.
Let V
s
(x) be the value function (maximal expected prot) when the share price is x
and there are s days to go. Show that (i) V
s
(x) is non-decreasing in s, (ii) V
s
(x) x is
non-decreasing in x and (iii) V
s
(x) is continuous in x. Deduce that the optimal policy
can be characterized as follows. There exists a non-decreasing sequence {a
s
} such that an
9
optimal policy is to exercise the option the rst time that x x
s
, where x is the current
price and s is the number of days to go before expiry of the option.
The state variable at time t is, strictly speaking, x
t
plus a variable which indicates
whether the option has been exercised or not. However, it is only the latter case which is
of interest, so x is the eective state variable. Since DP makes its calculations backwards,
from the terminal point, it is often an advantage to write things in terms of s, time to go,
as pointed earlier. Let V
s
(x) be the value function (maximal expected prot) with s days
to go then
V
0
(x) = max(x p, 0)
and so the DP equation is
V
s
(x) = max(x p, E[V
s1
(x +)]), s = 1, 2, ...
Note that the expectation operator comes outside, not inside, V
s1
(). We can use induc-
tion to show (i)-(iii). For example, (i) is obvious, since increasing s means we have more
time over which to exercise the option. However, for a formal proof we have
V
1
(x) = max(x p, E[V
0
(x +)]) max(x p, 0) = V
0
(x).
Now suppose V
s1
V
s2
. Then
V
s
(x) = max(x p, E[V
s1
(x +)]) max(x p, E[V
s2
(x +)]) = V
s1
(x).
Therefore V
s
is non-decreasing in s Similarly, an inductive proof of (ii) follows from
V
s
(x) x = max(x, E[V
s1
(x +) (x +)] +E()),
since the left-hand inherits the non-decreasing character of the term V
s1
(x+)(x+) in
the right-hand side . Thus the optimal strategy can be characterized as asserted, because
from (ii) and (iii) and the fact that V
s
(x) x p it follows that there exists an a
s
such
that V
s
(x) is greater than x p if x < a
s
nd equals x p is x a
s
. It follows from (i)
that a
s
is non-decreasing in s. The constant s
s
is the smallest x for which V
s
(x) = x p.
Example 6. Accepting the best oer: We are to interview N candidates for a job. At
the end of each interview we must either hire or reject the candidate we have just seen,
and may not change this decision later. Candidates are seen in random order and can be
ranked against those seen previously. The aim is to maximize the probability of choosing
the candidate of greatest rank.
Let W t be the history of observations up to time t, i.e., after we have interviewed
the tth candidate. All that matters are the value of t and whether the tth candidate is
better than all her predecessors: let x
t
= 1 if this is true and x
t
= 0 if it is not. In the
case x
t
= 1, the probability she is the best of all N candidates is
P(best of N|best of rst t) =
P(best of N)
P(best of rst t)
=
1/N
1/t
=
t
N
.
10
Now the fact that the tth candidate is the best of the t candidates seen so far places no
restriction on the relative ranks of the rst t 1 candidates. Thus x
t
= 1 and W
t1
are
statistically independent and we have
P(x
t
= 1|W
t1
) =
P(W
t1
|x
t
= 1)
P(W
t1
)
P(X t = 1) = P(x
t
= 1) =
1
t
.
Let V (0, t1) be the probability that under an optimal policy we select the best candidate,
given that we have seen t 1 candidates so far and the last one was not the best of those.
DP gives
V (0, t 1) =
t 1
t
V (0, t) +
1
t
max(
t
N
, V (0, t)) = max(
t 1
t
V (0, t) +
1
h
, V (0, t))
These imply V (0, t 1) V (0, t) for all t N. Therefore, since
t
N
and V (0, t) are
respectively increasing and non-decreasing in t, it must be that for small t we have V (0, t) >
t
N
and for large t we have V (0, t) <
t
N
. Let t
0
be the smallest t such that V (0, t)
t
N
.
Then
V (0, t 1) =
_
_
_
V (0, t
0
), t < t
0
,
t1
t
V (0, t) +
1
N
, t t
0
.
Solving the second of these backwards from the point t = N, V (0, N) = 0, we obtain
V (0, t 1)
t 1
=
1
N(t 1)
+
V (0, t)
t
= =
1
N(t 1)
+
1
ht
+
1
N(N 1)
,
Hence,
V (0, t 1) =
t 1
N
N1

=t
0
1

, t t
0
.
Since we require V (0, t
0
)
t
0
N
, it must be that t
0
is the smallest integer satisfying
N1

=t
0
1

1.
For large N the sum on the left above is about log(N/t
0
), so log(N/t
0
) 1 and we nd
t
0

h
e
. The optimal policy is to interview about
N
e
candidates, but without selecting
any of these, and then select the rst one thereafter that is the best of all those seen so
far. The probability of success is V (0, t
0
)
t
0
N

1
e
= 0.3679. It is surprising that the
probability of success is large for arbitrarily large N.
There are couple of things we should learn from this example. (i) It is often useful to
try to establish the fact that terms over which a maximum is being taken are monotone in
opposite directions, as we did with
t
N
and V (0, t). (ii) A typical approach is to rst deter-
mine the form of the solution, then nd the optimal cost (reward) function by backward
recursion from the terminal point where its value is known.
11
2.4 LQ optimal control problem
In this section we shall solve the optimal control problem where the dynamical system is
linear and cost function is quadratic, using DP. We begin with the discrete-time problem
and then extend the results to continuous time.
Consider now the system in state space form
x
t+1
= A
t
x
t
+B
t
u
t
where x R
n
and u R
m
, A is nn and B is nm. We wish to minimize the quadratic
cost
J =
N1

t=0
(x

t
, u

t
)
_
Q
t
S

t
S
t
R
t
__
x
t
u
t
_
+x

N
P
N
x
N
where Q and P
N
are given, positive semi-denite, and R is given and is positive denite.
This is a model for regulation of x, u to the point (0, 0). The following lemma will be
used frequently.
Lemma 2.3. Suppose x and u are vectors. Consider a quadratic form
f(x, u) := (x

, u

)
_
P
xx
P
xu
P
ux
P
uu
__
x
u
_
.
Assume that the matrix
_
P
xx
P
xu
P
ux
P
uu
_
is symmetric, P
uu
is positive denite. Then the
minimum with respect to u is attained at
u = P
1
uu
P
ux
x,
and is equal to
x

(P
xx
P
xu
P
1
uu
P
ux
)x.
Proof. Suppose that the quadratic form is minimized at u. Then According to the neces-
sary condition for optimality
f
u
= 2P
uu
u +P
ux
x = 0
Here the symmetry of the matrix is used. So we have
u = P
1
uu
P
ux
x.
Since R
uu
is positive denite, the function has a global minimum. So the above value of
u is indeed an optimal solution. A straightforward calculation leads to the optimal value
as stated in the lemma.
12
Now we are processing the solution to LQ control problem. Let the cost-to-go function
be
J
t
(x
t
, u
t
, t) =
N1

=t
(x

, u

)
_
Q

__
x

_
+x

N
P
N
x
N
Then the optimal cost-to-go function is
V
t
(x
t
, t) = min
ut
J
t
(x
t
, u
t
, t)
and the optimal cost to our problem is V (x
0
, 0). According to the optimality principle, we
nd optimal cost function backwards. Obviously V
N
(x
N
, N) = x

N
P
N
x
N
. Next we have
the DP equation
V
N1
(x
N1
, N 1)
= min
u
N1
[(x

N1
, u
N1
)
_
Q
N1
S

N1
S
N1
R
N1
__
x
N1
u
N1
_
+V
N
(A
N1
x
N1
+B
N1
u
N1
, N)]
= min
u
N1
[(x

N1
, u
N1
)
_
Q
N1
S

N1
S
N1
R
N1
__
x
N1
u
N1
_
+ (A
N1
x
N1
+B
N1
u
N1
)

P
N
(A
N1
x
N1
+B
N1
u
N1
)]
= min
u
N1
[(x

N1
, u
N1
)
_
Q
N1
+A

N1
P
N
A
N1
S

N1
+A

N1
P
N
B
N1
S
N1
+B

N1
P
N
A
N1
R
N1
+B

N1
P
N
B
N1
__
x
N1
u
N1
_
]
Since R
N1
is positive denite and P
N
is positive semidenite the matrix R
N1
+B

N1
P
N
B
N1
is positive denite. By Lemma 2.3
u
N1
= (R
N1
+B

N1
P
N
B
N1
)
1
(S
N1
+B

N1
P
N
A
N1
)x
N1
is the optimal control at this stage and the value function is
V
N1
(x
N1
, N 1) = x

N1
P
N1
x
N1
,
where
P
N1
= Q
N1
+A

N1
P
N
A
N1
(S
N1
+B

N1
P
N
A
N1
)

(R
N1
+B

N1
P
N
B
N1
)
1
(S
N1
+B

N1
P
N
A
N1
).
Continue in this way backwards, we obtain the following theorem.
Theorem 2.4. The value function to the LQ problem has the quadratic form
V
t
(x
t
, t) = x

t
P
t
x
t
, for t N,
and the optimal control is
u
t
= K
t
x
t
, t < N
13
where
K
t
= (R
t
+B

t
P
t+1
B
t
)
1
(S
t
+B

t
P
t+1
A
t
), t < N.
The time-varying matrix P
t
satises the Riccati equation
P
t
= Q
t
+A

t
P
t+1
A
t
(S
t
+B

t
P
t+1
A
t
)

(R
t
+B

t
P
t+1
B
t
)
1
(S
t
+B

t
P
t+1
A
t
), t < N
with the terminal condition P
N
given.
Remark 2.5. (i) Note that S can be normalized to zero by choosing a new control u =
u +R
1
Sx, and setting

A = ABR
1
S,

Q = QS

R
1
S.
(ii) The optimally controlled process obeys x
t+1
=
t
x
t
. Here the matrix
t
is so-called
the gain matrix and is dened by

t
= A
t
+B
t
K
t
.
(iii) We can derive the solution for continuous-time LQ problem from the discrete-time
solution. In continuous-time we take the state space representation x = Ax + Bu
and the cost function
J =
_
t
f
0
_
x

_
_
Q S

S R
__
x
u
_
dt + (x

Px)(t
f
).
Moving forward in time in increments of we have
x
t+1
x
t+
, A I +A, B B, R, S, Q R, S, Q.
Then as before, V (x, t) = x

Px, where P obeys the Riccati equation, after a lengthy


though straightforward calculation (letting 0 and dropping the higher order
terms)
dP
dt
+Q+A

P +PA(S

+PB)R
1
(S +B

P) = 0.
Observe that this is simpler than the discrete time version. The optimal control is
u(t) = K(t)x(t)
where
K(t) = R
1
(S +B

P(t)).
The optimally controlled system is x = (t)x, where (t) = A+BK(t).
(iv) The solvability of Riccati equation and existence of unique positive denite solution
of the Riccati equation are important and have their own interest. However, they
are beyond this course.
14
2.5 DP in continuous time
We study deterministic dynamic programming in continuous time. Consider the dynamical
system
x = f(x, u, t), x(t
0
) = x
0
along with a cost functional
J(t
0
, x
o
, u()) =
_
t
f
t
0
q(x, u, t)dt +Q(x(t
f
), t
f
),
where x R
n
, u U R
m
, U is a set of admissible controls. We wish to minimize this
cost. The cost-to-go functional can be dened as
J(t

, x(t

), u()) =
_
t
f
t

q(x(), u(), )d +Q(x(t


f
), t
f
).
Next we dene the optimal cost-to-go functional, the value funtion, as
J

(t

, x(t

)) = min
u()
J(t

, x(t

), u()).
where the optmization is performed over all admissbile controls. This means in particular
that the optimization problem can be written as
(P) min
u()
J(t
0
, x
0
, u()) (2.5)
where the optimization is performed over all admissible controls u(t) U. Moreover, if
u

() is the optimal control function then


V (x
0
, t
0
) = J (t
0
, x
0
) = J(t
0
, x
0
, u

( ))
Note that J

(t
0
, x
0
) should satisfy the boundary condition J

(t
f
, x) = Q(x, t
f
).
Next we prove the optimality principle.
Proposition 2.6 (The optimality Principle). Let u

: [t
0
, t
f
] R
m
be an optimal
condition for (P) that generates the optimal trajectory x

: [t
0
, t
f
] R
n
. Then for
any t

(t
0
, t
f
], the restriction of the optimal control to [t

, t
f
], u

|
[t

,t
f
]
, is optimal for
min
u()
J(t

, x

(t

), u()) and the corresponding optimal trajectory is x

|
[t

,t
f
]
.
Proof. By additivity of the cost functional we have
J

(t
0
, t
f
) =
_
t

t
0
q(x

(t), u

(t), t)dt +J(t

, x

(t

), u

|
[t

,t
f
]
).
Assume u

|
[t

,t
f
]
was not optimal over the interval [t

, t
f
] when the initial point is x(t

) =
x

(t

). Then there would exist an admissible control function u() dened on [t

, t
f
] such
that
J(t

, x

(t

), u()) < J(t

, x

(t

), u

|
[t

,t
f
]
).
15
The control
u(t) =
_
_
_
u

(t) t [t
0
, t

)
u(t) t [t

, t
f
]
is admissible and gives the cost
J(t
0
, x
0
, u()) =
_
t
f
t
0
q(x

(t), u

(t), t)dt +J(t

, x

(t

), u())
<
_
t
f
t
0
q(x

(t), u

(t), t)dt +J(t

, x

(t

), u

|
[t

,t
f
]
) = J

(t
0
, x
0
))
This contradicts the optimality of u

() and we have shown that the restriction u

|
[t

,t
f
]
is
optimal over [t

, t
f
].
Now we are in a position to erive the dynamic programming equation. Following the
logic of the discrete-time case, the value function should be approximately
V (x, t) inf
u
[q(x, u, t) +V (x +f(x, u, t), t + )].
By Taylor expansion,
V (x +f(x, u, t), t + ) V (x, t) +V
t
(x, t) +V
x
(x, t)f(x, u, t), t).
Then
V (x, t) = inf
uU
[q(x, u, t) +V (x, t) +V
t
(x, t) +V
x
(x, t)f(x, u, t)]
Since V (x, t) is independent of u, we can pull it outside the minimization. This leads to
the following partial dierential equation
V
t
(x, t) + inf
uU
[q(x, u, t) +V
x
(x, t)f(x, u, t)] = 0 (2.6)
with the boundary condition V (x, t
f
) = Q(x, t
f
). This optimality equation equation is
called the Hamilton-Jacobi-Bellman equation.
Remark 2.7. Note that what we have derived is the following. Assume there is an op-
timal control u(), and the optimal cost-to-go function V (x, t) has continuously partial
derivatives in both x and t, which we say a C
1
function. Then V (x, t) is a solution of the
HJB equation and u(t) is the minimizing argument in (2.6) (pointwise). This necessary
condition for optimality is not so useful since it assumes that the value function V is C
1
.
Of course, this is not always the case. This is one of the drawbacks of this approach
while Pontryagins maximum principle can do better. Another drawback is that the HJB
equation is a PDE whose closed form solution is often dicult to nd. Moreover, it is
also dicult to solve it numerically, since a numerical procedure requires a discretization
of the whole state space which is normally very large. However, the advantage of the the
HJB equation is that it also gives a sucient condition for optimality. We shall prove the
so-called verication theorem for DP.
16
Theorem 2.8 (Verication Theorem for DP). Assume
1. V : R
n
[t
0
, t
f
] R is C
1
(in both arguments) and solves (HJB);
2. (t, x) = arg min
uU
[q(x, u, t) +V
x
(x, t)f(x, u, )] is admissible.
Then
V (x, t) = J

(t, x) for all (t, x) [t


0
, t
f
] R
n
and (t, x(t)) is the optimal feedback control.
Proof. Let u() be an admissible control on [t
0
, t
f
] and let x() be the corresponding solu-
tion to x(t) = f(x(t), u(t), t), x(t
0
) = x
0
. We have
V (x(t
f
), t
f
) V (x
0
, t
0
) =
_
t
f
t
0

V (x(t), t)dt
=
_
t
f
t
0
(V
t
(x(t), t) +V
x
(x(t), t)f(x(t), u(t), t)) dt
_
t
f
t
0
q(x(t), u(t), t)dt
where the inequlity follows since (HJB)implies
V
t
(x(t), t) +V
x
(x(t), t)f(x(t), u(t), t) q(x(t), u(t), t).
Using the boundary condition V (x(t
f
), t
f
) = Q(x(t
f
), t
f
) yields
V (x
0
, t
0
) Q(x(t
f
), t
f
) +
_
t
f
t
0
q(x(t), u(t), t)dt = J(t
0
, x
0
, u()).
This inequlity holds for all admissble controls, in particular the optimal u

(). Therefore
we have shown that
V (x
0
, t
0
) J

(t
0
, x
0
)
for all initial points. We shall next to show that equality is achieved by using u(t) =
(t, x(t)). Note that Condition (ii) means
min
uU
(q(x, u, t) +V
x
(x, t)f(x, u, t)) = q(x, (t, x(t)) +V
x
(x, t)f(x, (t, x(t)), t).
This, together with (HJB), yields
V
t
(x, t) +V
x
(x, t)f(x, (t, x(t)), t) = q(x, (t, x(t)), t).
Integrating this equation gives
V (x
0
, t
0
) = Q(x(t
f
), t
f
) +
_
t
f
t
0
q(x(t), (t, x(t)), t)dt = J(t
0
, x
0
, u()) J

(t
0
, x
0
)
where the last inequality follows sine J

(t
0
, x
0
) = min
u()
J
(
t
0
, x
0
, u()). Thus we have
J

(t
0
, x
0
) V (x(t
0
), t
0
) J

(t
0
, x
0
).
This shows that (x(t), (t, x(t)) is the optimal state and control trajectory, i.e., x

(t) = x(t)
and u

(t) = (t, x(t)). Since t = 0, x


0
are arbitrary this complets the proof of the
theorem.
17
Note that the dynamic programming holds for all initial values. Now we give some
examples to illustrate how to solve the HJB equation.
Example 7. LQ-problem, revisited. Now we derive the solution of the continuous-time LQ
problem by the HJB equation. To make the computation simpler we normalize the matrix
S to zero. So we have the following PDE,
0 = V
t
(x, t) + min
u
[x

Qx +u

Ru +V
x
(x, t)(Ax +Bu)]
and the boundary condition V (x, t
f
) = x

P(t
f
)x. We make an ansatz: V (x, t) = x

P(t)x,
where P is an n n symmetric matrix. Then
V
t
(x, t) = x


Px, V
x
(x, t) = 2x

P.
Substituting this into the previous equation yields
0 = x


Px + min
u
[x

Qx +u

Ru + 2xAx + 2x

PBu].
Now we shall nd the minimizing u rst. That the matrix R is positive denite guarantees
the global minimum. The minimum is attained at
u = R
1
B

P(t)x.
So the equation becomes
0 = x

(

P +Q+PA+A

P +PBR
1
B

P)x
It holds if P satises the Riccati equation

P +Q+PA+A

P PBR
1
B

P = 0.
In the LQ problem we can nd a closed form solution for the HJB equation. The
following example is an application of LQ-control.
Example 8. (Tracking under disturbances) A problem which arises often is that of
nding controls so as to force the output of a given system to track back (follow) a desired
reference signal r(). We also allow a disturbance to act on the system.
Let be a linear system over R with outputs
x = A(t)x +B(t)u +(t), y = C(t)x,
where is an R
n
-valued xed measurable essentially bounded function. Assume given an
R
p
-valued xed measurable essentially bounded function r. We consider a cost criterion:
J =
_

[(t)

R(t)(t) +e(t)

Qe(t)]dt +()

S(),
18
where

= A + B, () = x
0
and e = C r is the tracking error. We assume that
R is an m m symmetric matrix of measurable essentially bounded functions, Q is a
p p symmetric matrix of measurable essentially bounded functions, and S is a constant
symmetric n n matrix. Further we assume that R is positive denite, Q and S are
positive semidenite, for each t.
The problem is then that of minimizing J for given disturbance , reference signal r,
and initial state, by an appropriate choice of controls. This problem can be reduced to
LQ-control problem. See details on pages 372375 in Sontag.
Example 9. (Deterministic Kalman ltering) Let (A, B, C) be a time-varying continuous-
time linear system, and let Q, R, S be measurable essentially bounded matrix function of
sizes n n, m m and n n, respectively. Assume that both S and R() are positive
denite.
For any given measurable essentially bounded function y on an interval [, ], minimize
the expression
_

[(t)

R(t)(t) + (C(t)(t) y(t))

Q(t)(C(t)(t) y(t))]dt +()

S()
over the set of all possible trajectories (, ) of

= A +B on [, ].
By reversing time, so that the cost is imposed on the nal state, this problem can
be reduced to Tracking problem in the preceding example. This problem is sometimes
termed as (deterministic) Kalman ltering. See motivation and solution on pages 375379
in Sontag.
Before doing next example we want to point out a special type of optimal control
problem, that is, optimization of the so-called discounted cost
J =
_
t
f
0
e
t
q(x, u, t)dt +e
t
f
q(x(t
f
), t
f
).
Let the value function for the discounted optimal time-to-go function be

V (x, t). By the
HJB equation, it satises the following PDE

V
t
(x, t) + inf
uU
[e
t
q(x, u, t) +

V
x
(x, t)f(x, u, t)] = 0
with the boundary value condition

V (x, t
f
) = e
t
f
Q(x, t
f
). Now let

V (x, t) = e
t
V (x, t).
Then V (x, t) satises the following equation
V
t
(x, t) V (x, t) + inf
uU
[q(x, u, t) +V
x
(x, t)f(x, u, t)] = 0 (2.7)
with V (x, t
f
) = Q(x, t
f
). We call this equation the discounted HJB equation.
19
Example 10. Estate planning. A man is considering his lifetime plan of investment and
expenditure. He has an initial level of savings S and no income other than that which he
obtains from investment at a xed interest rate. His total capital x is therefore governed
by the equation
x = x u
where > 0 and u denotes his rate of expenditure. His immediate enjoyment due to
expenditure is U(u) where U is his utility function. In his case U(u) = u
1
2
. Future
enjoyment, at time t, is counted less today, at time 0, by incorporation of a discount term
e
t
. Thus, he wishes to maximize
J =
_
t
f
0
e
t
U(u)dt,
assuming

2
< .
We try to solve equation (2.7), i.e.,
0 = max
u
(

u B +V
t
+V
x
(x u))
with V (x, t
f
) = 0. We try a solution of the form V (x, t) = f(t)

x. For this to work we


need
0 = max
u
(

u f(t)

x +

f(t)

x +
f
2

x
(x u))
The maximizing control is u =
x
f
2
. (Do not forget to check the sucient condition!) The
optimal value is thus

x
f
(
1
2
(
1
2
)f
2
+f

f)
Therefore we shall have a solution if we choose f to make the term within ( ) equal to
0. The boundary condition V (x, t
f
) = 0 implies that f(t
f
)
_
x(t
f
) = 0 which implies
f(t
f
) = 0. Hence we shall nd solution to
1
2
(
1
2
)f
2
+f

f = 0
with the terminal condition f(t
f
) = 0. Thus we nd
f(t)
2
=
1 e
(

2
)(t
f
t)
2
.
There are few problems that can be solved analytically. We close this section by a
short comment on the use of the HJB equation. For a given optimal control problem on
the form
min J =
_
t
f
0
q(x, u, t)dt +Q(x(t
f
), t
f
)
subject to x = f(x, u, t) with u U we take the following steps.
20
1. Optimize pointwise over u to obtain
u

(p, x, t) = arg min


uU
[q(x, u, t) +p

f(x, u, t)].
Here p R
n
is a parameter vector.
2. Dene H(p, x, t) = q(x, u

(p, x, t), t) +p

f(x, u

(p, x, t)).
3. Solve the PDE
V
t
(x, t) = H(V
x
(x, t), x, t)
with the boundary condition V (x, t
f
) = Q(x, t
f
).
If we dene H(p, x, u) = q(x, u, t) + p

f(x, u, t) then u

(x, t, p) = arg min


uU
H(p, x, u, t).
The same optimization is a part of the condition in Pontryagins minimum (maximum)
principle.
3 Pontryagins principle I: a dynamic programming approach
We explain Pontryagins maximum principle, derive it rst by dynamic programming
(through the HJB equation) under stronger assumptions and then by variational calculus,
and give some examples of its use.
Consider time-invariant optimal control problem.
Problem: Minimize the cost function
J =
_
t
f
0
q(x, u)dt +Q(x(t
f
))
subject to
x = f(x, u),
with x(0) = x
0
being xed and u U, a set of admissible controls.
The HJB equation is
V
t
(x, t) + inf
u
[q(x, u) +V
x
(x)f(x, u)] = 0 (3.1)
and the boundary condition V (x, t
f
) = Q(x).
Assume that (u

, x

) is an optimal solution then (3.1)


0 = V
t
(x

, t) +q(x

, u

) +V
x
(x

)f(x

, u

). (3.2)
Dene now the so-called adjoint variable (or co-state variable)
p

= V
x
(x

)
21
where p R
n
, and the Hamiltonian function
H(p, x

, u

) = q(x

, u

) +p

f(x

, u

).
It is obvious that H
p
= f(x

, u

) = x

. To our purpose we make a stronger assumption


on V : V is twice dierentiable. Then
p

= dV
x
/dt = V
xt
(x

, t) + ( x

V
xx
(x

, t) = V
xt
(x

, t) +f(x

, u

V
xx
(x

, t).
On the other hand, dierentiating (3.2) with respect to x gives
V
xt
(x

, t) +q
x
(x

, u

) +f(x

, u

V
xx
(x

, t) +V
x
(x

, t)f
x
(x

, u

) = 0
or equivalently,
V
tx
(x

, t) +f(x

, u

V
xx
(x

, t) = q
x
(x

, u

) p

f
x
(x, u

).
Therefore we have a system for the adjoint state
p

= q
x
+p

f
x
with the terminal condition p(t
f
)

= Q
x
(x(t
f
)). If u

is an optimizing argument then


H(p(t), x(t), v) H(p(t), x(t), u(t))
for all t, 0 t t
f
, and all admissible controls v U.
Note that the fact that optimizing the Hamiltonian function over u is the same as in
dynamic programming approach. We summarize above arguments as a theorem.
Theorem 3.1 (Pontryagins minimum principle (PMP)). Suppose that u(t) U and
x(t) represent the optimal control and state trajectory for the optimal control problem
Problem. Then there is an adjoint trajectory p(t) such that together u(t), x(t), and p(t)
satisfy
(i) state equation and initial state condition:
x(t) = f(x(t), u(t)), x(0) = x
0
,
(ii) adjoint equation and nal condition:
p

= p

f
x
(x(t), u(t)) +q
x
(x(t), u(t)), p(t
f
) = Q
x
(x(t
f
)),
(iii) minimum condition: for all t, 0 t t
f
, and all v U
H(p(t), x(t), v) H(p(t), x(t), u(t))
where H is the Hamiltonian
H(p, x, u) = p

f(x, u) +q(x, u).


22
In PMP we have n state equations, n adjoint equations and m conditions from the
minimum condition. So they decide the 2n+m variables in general. Note that we did not
assume that the optimal cost function is twice dierentiable although we used it in the
derivation. The reason is that it is not needed when you use variational calculus. Neither
do we need C
1
condition on the optimal cost function. PMP gives a necessary condition
for optimality. Some other remarks are in order. In fact we can claim that the problem
stated earlier covers most of optimal control problems where the terminal time is xed.
Remark 3.2. This theory can be easily extended to problems where the system equations,
the constraints, and the cost functions are all explicit functions of time. Consider the
problem with
x(t) = f(x(t), u(t), t), x(0) = x
0
, u(t) U
J =
_
t
f
0
(q(x(t), u(t), t)dt +Q(x(t
f
), t
f
).
(3.3)
Dene an additional state variable x
n+1
= t, this problem can be converted to an equivalent
problem without explicit t dependence. Let z(t) =
_
x
x
n+1
_
. Then z(0) =
_
x
0
0
_
,

f =
_
f
1
_
.
Then we have state equation
z(t) =

f(z(t), u(t))
with initial condition z(0) given, and u(t) U, and the cost function
J =
_
t
f
0
q(z(t), u(t))dt +Q(z(t
f
)).
It is easy to see that the state equation, the state initial condition and the minimum
condition are the same as in PMP above, but with explicit t dependence. Nevertheless
there are now n + 1 adjoint variables. We keep the same notation as in the theorem,
p

= (p

, p
n+1
). The rst n adjoint variables satisfy the condition
p

= p

f
x
(x, u, t) +q
x
(x, u, t),
while
p
n+1
= p

f
t
(x, u, t) +q
t
(x, u, t)
with the terminal condition
p
n+1
(t
f
) = Q
t
(x(t
f
), t
f
)
Hence the minimum principle for the time-variant optimal control problem is
(i) state equation and initial state condition:
x(t) = f(x(t), u(t), t), x(0) = x
0
,
23
(ii) adjoint equation and nal condition:
p(t)

= p(t)

f
x
(x(t), u(t), t) +q
x
(x(t), u(t), t), p(t
f
) = q
x
(x(t
f
), t
f
),
p
n+1
= p

f
t
(x, u, t) +q
t
(x, u, t).
Note that the terminal condition for p
n+1
is free because the terminal time is xed.
(iii) minimum condition: for all t, 0 t t
f
, and all v U
H(x(t), v, p(t)) H(x(t), u(t), p(t))
where H is the Hamiltonian
H(x, u, p, t) = p

f(x, u, t) +q(x, u, t).


Remark 3.3. From the theorem and the previous remark we can draw the following con-
clusions:
(i) the adjoint variable is the gradient of the value function with respect to the state
vector.
(ii) In DP the value function is obtained by solving a PDE, (the HJB equation). This
is a consequence of the approach of looking for an optimal control from any initial
point.
(iii) In PMP we only solve the value function (or rather its gradient which is the adjoint
variable) for a special initial condition. This gives a two point boundary value prob-
lem for ODEs, which is normally much easier to deal with than the HJB equation.
4 Calculus of variations
This subject dates back to Newton, and we have room for only a brief treatment. In
particular we shall not mention the well-known Euler equation approach which can be
found in standard texts on variational calculus.
We consider the problem of minimizing J(u) in (1.13) subject to the system (1.1) and
initial conditions (1.2). We assume that there are no constraints on the control functions
u
i
(t) and that J(u) is dierential, i.e., if u and u + u are two controls for which J is
dened then
J = J(u +u) J(u) = J(u, u) +j(u, u)u (4.1)
where J is linear in u and j(u, u) 0 as u 0 (using any suitable norm). In (4.1)
J is termed the variation in J corresponding to a variation u in u. The control u

is an
extremal, and J has a relative minimum, if there exists an > 0 such that for all functions
u satisfying uu

< the dierence J(u) J(u

) is nonnegative. A fundamental result


is the following.
24
Theorem 4.1. A necessary condition for u

to be an extremal is that J(u

, u) = 0 for
all u.
The proof is omitted.
We now apply Theorem 4.1. Introduce a vector of Lagrange multipliers p = [p
1
, ..., p
n
]

so as to form an augmented functional incorporating the constraints:


J
a
= Q[x(t
1
), t
1
] +
_
t
1
t
0
[q(x, u, t) +p

(f x)]dt (4.2)
Integrating the last term by parts gives
J
a
= Q[x(t
1
), t
1
] [p

x]
t
1
t
0
+
_
t
1
t
0
[H + ( p)

x]dt (4.3)
where the Hamiltonian function is dened by
H(x, u, t) = q(x, u, t) +p

f. (4.4)
Assume that u is continuous and dierentialable on t
0
t t
1
and that t
0
and t
1
are
xed. The variation in J
a
corresponding to a variation u in u is
J
a
=
__
Q
x
p

_
x
_
t=t
1
+
_
t
1
t
0
_
H
x
x +
H
u
u + ( p)

x
_
dt (4.5)
where x is the variation in x in the dierential equations (1.1) due to u, using the
notation
H
x
=
_
H
x
1
, ,
H
x
n
_
and similarly for Q/x and H/u. Note that since x(t
0
) is specied, (x)
t=t
0
= 0. It is
convenient to remove the term in (4.5) involving x by suitably choosing p, i.e. by taking
p
i
=
H
x
i
, i = 1, 2, ..., n (4.6)
and
p
i
(t
1
) =
_
Q
x
i
_
t=t
1
. (4.7)
Eqn (4.5) then reduces to
J
a
=
_
t
1
t
0
_
H
u
u
_
dt (4.8)
so by Theorem 4.1 a necessary condition for u

to be an extremal is that
_
H
u
i
_
u=u

= 0, t
0
t t
1
, i = 1, ..., m. (4.9)
We have therefore established:
25
Theorem 4.2. Necessary conditions for u

to be an extremal for (1.13) subject to (1.1)


and (1.2) are that (4.6),(4.7), and (4.9) hold.
Remark 4.3. The Hamiltonian is constant. Suppose u is unconstrained and f and q are
autonomous, i.e., they do not explicitly depend on t. Let p, x, u satisfy the necessary
conditions for optimality, then H(p, x, u) is constant for t
0
t t
1
. The following
calculation proves this statement. Let H(p, x, u) = p

f(x, u) +q(x, u).


dH
dt
= H
x
x +H

u
u + p

f = H
x
f H
x
f = 0
where we used the fact that H
u
(p, x, u) = 0 and the state and adjoint equations. Thus
the Hamiltonian is constant. In other words, H is constant along the optimal trajectory.
Example 11. LQ-problem, revisited. We wish to minimize the cost function
J =
1
2
_
t
1
0
(x

Qx +u

Ru)dt +x(t
1
)

P
t
1
x(t
1
)
such that x = Ax +Bu, x(0) = x
0
, where Q and P
t
1
are symmetric positive semi-denite
matrices and R is symmetric positive denite matrix. The Hamiltonian is
H(p, x, u) = p

Ax +p

Bu +
1
2
x

Qx +
1
2
u

Ru.
Since this is an unconstrained problem, we can use the minimum condition H
u
= 0, which
is p

B + u

R = 0, i.e., u = R
1
B

p. This together with the state equation, gives the


following two-point boundary value problem
x = Ax BR
1
B

p, x(0) = x
0
p = Qx A

p, p(t
1
) = P
t
1
x(t
1
)
Or in matrix form
_
x
p
_
=
_
A BR
1
B

Q A

__
x
p
_
,
_
x(0)
p(t
1
)
_
=
_
x
0
P
t
1
x(t
1
)
_
.
In DP, we learned that u = R
1
B

Px. So we know less from Theorem 4.2. However, we


shall show that the solution is in fact the same. Note that the state and adjoint equations
form a linear system of 2n variables, which can be solved by the state transition matrix
(t, s) portioned as
_

11
(t, s)
12
(t, s)

21
(t, s)
22
(t, s)
_
. Each block is n n. Therefore
_
x
0
p(0)
_
=
_

11
(0, t
1
)
12
(0, t
1
)

21
(0, t
1
)
22
(0, t
1
)
__
x(t
1
)
P
t
1
x(t
1
)
_
=
_

11
(0, t
1
)
12
(0, t
1
)

21
(0, t
1
)
22
(0, t
1
)
__
I
P
t
1
_
x(t
1
)
(4.10)
26
So we have p(0) in terms of x
0
and the transition matrix.
p(0) = (
21
(0, t
1
) +
22
(0, t
1
)P
t
1
)(
11
(0, t
1
) +
12
(0, t
1
)P
t
1
)
1
x
0
if (
11
(0, t
1
) +
12
(0, t
1
)P
t
1
)
1
exists. In a similar manner, we have
_
x(t)
p(t)
_
=
_

11
(t, t
1
)
12
(t, t
1
)

21
(t, t
1
)
22
(t, t
1
)
__
I
P
t
1
_
x(t
1
) (4.11)
It gives
p(t) = (
21
(0, t
1
) +
22
(0, t
1
)P
t
1
)(
11
(0, t
1
) +
12
(0, t
1
)P
t
1
)
1
x(t) =: P(t)x(t)
It in turn gives u = R
1
B

Px. Now we have to show that P satises the Riccati


equation. Substituting the control law obtained and the relation between p and x above
yield
(

P PAA

P +PBR
1
B

P Q)x = 0
This holds if


P PAA

P +PBR
1
B

P Q = 0
with terminal condition P(t
1
) = P
t
1
.
We can also prove that P(t) can be obtained by P(t) = Y (t)X(t)
1
where X(t) and
Y (t) solves the linear matrix equation
_

X(t)

Y (t)
_
=
_
A BR
1
B

Q A

__
X
Y
_
,
_
X(0)
Y (0)
_
=
_
I
P
t
1
_
backwards in time over the interval [0, t
1
].
So far we have assumed that t
1
is xed and x
(
t
1
) is free. If this is not necessarily the
case, then by considering (4.3) the terms in J
a
in (4.5) outside the integral are obtained
to be __
Q
x
p

_
x +
_
H +
Q
t
_
t
_
u=u

t=t
1
(4.12)
The expression in (4.12) must be zero by virtue of Theorem 4.1, since (4.6) and (4.9) still
hold, making the integral in (4.5) zero. The implications of this for some important special
cases are now listed. The initial conditions (1.2) hold throughout.
Final time t
1
specied.
(i) x(t
1
) free
In (4.12) we have t
1
= 0 but x(t
1
) is arbitrary, so the conditions (4.7) must hold
(with the fact that H is a constant when appropriate), as before.
27
(ii) x(t
1
) specied
In this case t
1
= 0, x(t
1
) = 0 so (4.12) is automatically zero. The conditions are
thus
x

(t
1
) = x
f
the nal state (4.13)
and (4.13) replaces (4.7).
Final time t
1
free
(iii) x(t
1
) free
Both t
1
= 0 and x(t
1
) are arbitrary, so for the expression in (4.12)to vanish, (4.7)
must hold together with
_
H +
Q
t
_
u = u

t = t
1
= 0 (4.14)
In particular, if Q, q and f do not explicitly depend on t, then (4.14) and that H is
a constant imply
(H)
u=u
= 0, t
0
t t
1
. (4.15)
(iv) x(t
1
) specied
Only t
1
= 0 is now arbitrary in (4.12) so the conditions are (4.13) and (4.14) and
(4.15).
If the preceding conditions on x(t
1
) apply to only some of its components, then since
the x
i
(t
1
) in (4.12) are independent it follows that the appropriate conditions hold
only for these components.
Example 12. A particle of unit mass moves along the x-axis subject to a force u(t). It is
required to determine the control which transfers the particle from rest at the origin to
rest at x = 1 in unit time, so as to minimize the eort involved, measured by
_
1
0
u
2
dt.
The equation of motion is x = u, and taking x
1
= x, x
2
= x we obtain the state equations
x
1
= x
2
, x
2
= u (4.16)
and from (4.4)
H = p
1
x
2
+p
2
u +u
2
.
From (4.9) the optimal control is given by
2u

+p

2
= 0 (4.17)
28
and by (4.6) the adjoint equations are
p

1
= 0, p

2
= p
1
. (4.18)
Integrating (4.18) gives p

2
= c
1
t + c
2
, where c
1
and c
2
are constants. From (4.16) and
(4.17) we obtain
x

2
=
1
2
(c
1
t +c
2
)
which on integrating, and using the given conditions x
2
(0) = 0 = x
2
(1) produces
x

2
(t) =
1
2
c
2
(t
2
t), c
1
= 2c
2
.
Finally, integrating the rst equation in (4.16) and using x
1
(0) = 0, x
1
(1) = 1 gives
x

1
(t) =
1
2
t
2
(3 2t), c
2
= 12
and hence from (4.17) the optimal control is
u

(t) = 6(1 2t).


Example 13. A ship moves through a region of strong currents. For simplicity, and by a
suitable choice of coordinates, assume that the current is parallel to the x
1
-axis and has
velocity c = V x
2
/a, where a is a positive constant, and V is the magnitude (assumed
constant) of the ships velocity relative to the water. The problem is to steer the ship so
as to minimize the time of travel from some given pointA to the origin. We see in the
x
u
Current
V
x
2
1
Figure 2:
gure that the control variable is the angel u. The equations of motion are
x
1
= V cos u +c = V (cos u x
2
/a) (4.19)
x
2
= V sin u, (4.20)
29
where (x
1
(t), x
2
(t)) denotes the position of the ship at time t. The performance index is
(1.3) with t
0
= 0, so from (4.4)
H = 1 +p
1
V (cos u x
2
/a) +p
2
V sin u. (4.21)
The condition (4.9) gives
p

1
V sin u

+p

2
V cos u

= 0
so that
tan u

= p

2
/p

1
. (4.22)
The adjoint equations (4.6) are
p

1
= 0, p

2
= p

1
V/a, (4.23)
which imply that p

1
= c
1
, a constant. Since t
1
is not specied we have case (iv) so that
(4.16)-(4.18) hold. From (4.21) his gives
0 = 1 +p

1
V (cos u

2
/a) +p

2
V sin u

= 1 +c
1
V (cos u

2
/a) +c
1
V sin
2
u

/ cos u

, (4.24)
using the expression for p

2
in (4.22). Substituting x

2
(t
1
) = 0 into (4.24) leads to
c
1
= cos u
1
/V, (4.25)
where u
1
= u

(t
1
). Eqn (4.25) reduces (4.24), after some rearrangement, to
x

2
/a = sec u

sec u
1
. (4.26)
Dierentiating (4.26) with respect to t gives
du

dt
sec u

tan u

= x

2
/a = V sin u

/a
by (4.20). Hence
(V/a)
dt
du

= sec
2
u

which on integration produces


tan u

tan u
1
= (V/a)(t t
1
). (4.27)
Use of (4.19), (4.26) and (4.27) and some straightforward manipulation leads to an ex-
pression for x

1
in terms of u

and u
1
, which enables the optimal path to be computed. A
typical minimum-time path is shown in Figure 3.
30
Figure 3:
If the state at time T (assumed xed) is to lie on a surface dened by some function
m[x(t)] = 0, where m may in general be a k-vector, then it can be shown that in addition
to the k conditions
m[x

(t
1
)] = 0 (4.28)
there are further n conditions which can be written as
Q
x
p

= d
1
_
m
1
x
_
+d
2
_
m
2
x
_
+ +d
k
_
m
k
x
_
(4.29)
both sides being evaluated at t = t
1
, u = u

, x = x

, p = p

. The d
i
in (4.29) are constants
to be determined. Together with the 2nconstants of integration there are thus a total
2n + k unknowns and 2n + k conditions (4.28) , (4.29) and (1.2). If t
1
is free then in
addition (4.14) holds.
The conditions which hold at t = t
1
for the various cases we have covered are summa-
rized in the table:
t
1
xed t
1
free
x(t
1
) free (4.7) (4.7) and (4.14)
x(t
1
) xed (4.13) (4.13) and (4.14)
x(t
1
) lies on m[x(t
1
)] = 0 (4.28) and (4.29) (4.14), (4.28) and (4.29)
Example 14. If a second order system is to be transferred from the origin to a circle of
unit radius, centre the origin, at some time t
1
then we must have
[x

1
(t
1
)]
2
+ [x

2
(t
1
)]
2
= 1. (4.30)
Since
m(x) = x
2
1
+x
2
2
1
31
(4.29) gives
[p

1
(t
1
), p

2
(t
1
)] = d
1
[2x

1
(t
1
), 2x

2
(t
1
)],
assuming Q 0, and hence
p

1
(t
1
)/p

2
(t
1
) = x

1
(t
1
)/x

2
(t
1
). (4.31)
Eqns (4.30) and (4.31) are the conditions to be satised at t = t
1
.
Example 15. A system described by
x
1
= x
2
, x
2
= x
2
+u (4.32)
is to be transferred from x(0) = 0 to the line
ax
1
+bx
2
= c
at time t
1
so as to minimize
_
t
1
0
u
2
dt,
which is of the form (1.6). The values a, b, c and t
1
are given.
From (4.4)
H = u
2
+p
1
x
2
p
2
x
2
+p
2
u
and (4.9) gives
u

=
1
2
p

2
. (4.33)
The adjoint equations (4.6) are
p

1
= 0, p

2
= p

1
+p

2
so that
p

1
= c
1
, p

2
= c
2
e
t
+c
1
(4.34)
where c
1
and c
2
are constants. Substituting (4.33) and (4.34) into (4.32) leads to
x

1
= c
3
e
t

1
4
c
2
e
t

1
2
c
1
t +c
4
, x

2
= c
3
e
t

1
4
c
2
e
t

1
2
c
1
,
and the conditions
x

1
(0) = 0, x

2
(0) = 0, ax

1
(t
1
) +bx

2
(t
1
) = c (4.35)
must hold.
It is easy to verify that (4.29) produces
p

1
(t
1
)/p

2
(t
1
) = a/b, (4.36)
and (4.35) and and (4.36) give four equations for the four unknown constants c
i
. The
optimal control u

(t) is then obtained from (4.33) and (4.34).


32
In some problems the restriction on the total amount of control eort which can be
expended to carry out a required task may be expressed in the form
_
t
1
t
0
q(x, u, t)dt = c (4.37)
where c is a given constant, such a constraint being termed isoperimetric. A convenient
way of dealing with (4.37) is to dene a new variable
x
n+1
(t) =
_
t
t
0
q(x, u, s)ds
so that
x
n+1
(t) = q(x, u, t). (4.38)
The dierential equation (4.38) is simply added to the original set (1.1) together with the
conditions
x
n+1
(t
0
) = 0, x
n+1
(t
1
) = c
and the procedure then continues as before ignoring (4.37).
Example 16. LQ control. Minimize
J =
_
t
1
0
u(t)
2
dt subject to
_
x(t) = Ax(t) +Bu(t)
x(0) = x
0
, x(t
1
) = 0
Assume that the system is completely controllable. We start by minimizing the Hamilto-
nian
H( p, x, u) = p
0
u
2
+p

(Ax +Bu).
There are two cases. You can skip the rst part without any serious loss.
Case 1: p
0
= 0 If this is the case, we have arg min
u
H( p, x, u) = arg min
u
[p

(Ax +Bu)] =
unless p

B = 0. It is however, impossible to have u = on a nonzero time interval


since then the cost would be innite, which clearly cannot be the minimum since we
know that the system can be driven to origin with nite energy expenditure. The other
alternative p(t)

B = 0 for t [0, t
1
] is also impossible. To see this we note that the adjoint
equation
p(t) = A

p(t)
has the solution p(t) = e
A

t
p(0). Hence, in order for p(t)

B = 0 for t [0, t
1
] we need
p(0)

B = 0
p(0)

B = 0
.
.
.
p
(n1)
(0)

B = 0

p(0)

B = 0
p(0)

AB = 0
.
.
.
p(0)

A
n1
B = 0
p(0)

[B, AB, ..., A


n1
B] = 0.
33
If the system is controllable, then the matrix [B, AB, A
n1
B] has full rank, which
implies that p(0) = 0. However, then p(t) = 0 and p(t) = 0 which contradicts the
theorem. This leads to the conclusion that p
0
= 0 is impossible for a controllable system.
Case 2: p
0
= 1. We have u(t) =
1
2
B

p minimizing the Hamiltonian. The adjoint equation


is
p(t) = A

p(t)
which has the solution p(t) = e
A

t
p(0), x(0) = x
0
. By the variation of constants formula
we obtain
x(t
1
) = e
At
1
x
0

1
2
_
t
1
0
e
A(t
1
s)
BB

e
A

s
dsp(0)
= e
At
1
x
0

1
2
W(t
1
, 0)e
A

t
1
p(0)
where the reachability Grammian is
W(t
1
, 0) =
_
t
1
0
e
A(t
1
s)
BB

e
A

(t
1
s)
ds
In our case the system is controllable and therefore W(t
1
, 0) is positive denite and thus
invertible. We can solve for p(0), which gives
p(0) = 2e
A

t
1
W(t
1
, 0)
1
e
At
1
x
0
This gives the optimal control
u(t) =
1
2
B

e
A

t
p(0) = B

e
A

(t
1
t)
W(t
1
, 0)
1
e
At
1
x
0
and the optimal cost becomes (after some calculations)
J

= x

0
e
A

t
1
W(t
1
, 0)
1
e
At
1
x
0
.
Exercises
1. A system is described by
x
1
= 2x
1
+u,
and the control u(t) is to be chosen so as to minimize
_
1
0
u
2
dt.
Show that the optimal control which transfers the system from x
1
(0) = 1 to x
1
(1) = 0
is u

= 4e
2t
/(e
4
1).
34
2. The equations describing a production scheduling problem are
dI
dt
= S +P,
dS
dt
= P
where I(t) is the level of inventory (or stock), S(t) is the rate of sales and is
a positive constant. The production rate P(t) can be controlled and is assumed
unbounded. It is also assumed that the rate of production costs is proportional
to P
2
. It is required to choose the production rate which will change I(0) = I
0
,
S(0) = S
0
into I(t
f
) = I
1
, S(t
f
) = S
1
in a xed time t
f
whilst minimizing the total
production cost. Show that the optimal production rate has the form P

= k
1
+k
2
t
and indicate how the constants k
1
and k
2
can be determined.
3. A particle of mass m moves on a smooth horizontal plane with rectangular Cartesian
coordinates x and y. Initially the particle is at rest at the origin, and a force of
constant magnitude ma is applied to it so as to ensure that after a xed time t
f
the
particle is moving along a given line parallel to the x-axis with maximum speed. The
angle u(t) made by the force with the positive x direction is the control variable,
and is unbounded. Show that the optimal control is given by
tan u

= tan u
0
+ct
where c is a constant and u
0
= u

(0). Hence deduce that


y

(t) = (a/c)(sec u

sec u
0
)
and obtain a similar expression for x

(t) (hint: change the independent variable from


t to u).
4. For the system in Example 15 described by eqn (4.32), determine the control which
transfers it from x(0) = 0 to the line x
1
+ 5x
2
= 15 and minimizes
1
2
[x
1
(2) 5]
2
+
1
2
[x
2
(2) 2]
2
+
1
2
_
2
0
u
2
dt.
5. Complete the detail in deriving (4.28) and (4.29).
5 Pontryagins principle II: a variational calculus approach
In real life problems the control variables are usually subject to constraints on their magni-
tudes, typically of the form |u
i
(t)| k
i
. This implies that the set of nal states which can
be achieved is restricted. Our aim here is to derive the necessary conditions for optimality
corresponding to Theorem 4.2 for the unbounded case. An admissible control is one which
35
satises the constraints, and we consider variations such that u

+ u is admissible and
u is suciently small so that the sign of
J = J(u

+u) J(u

),
where J is dened in (1.13), is determined by J in (4.1). Because of the restriction on
u, Theorem 4.1 no longer applies, and instead a necessary condition for u

to minimize
J is
J(u

, u) 0. (5.1)
To see this we give a heuristic derivation. Recall that the control u

casuses the
functional J to have a relative minimum if
J(u) J(u

) = J 0
for all admissible controls suciently cloase to u

. If we let u = u

+ u the inrement in
J can be expressed as
J(u

, u) = J(u

, u) + higher-order terms ;
J is linear in u and the higher-order terms approach zero as the norm of u tends to
zero. If the control were unbounded, we could use the linearity of J with respect to
u, and the fact that u can vary arbitrarily to show that a necessary condition for u

to be an extremal control is that the variation J(u

, u) must be zero for all admissible


u having a suciently small norm. Since we are no longer assuming that the admissible
controls are not bounded, u is arbitrary only if the extremal control is strictly within the
boundary for all time in the interval [t
0
, t
1
]. In this case, the boundary has no eect on
the problem solution. If, however, an extremal control lies on a boundary during at least
one subinterval [t

1
, t

2
] of the interval [t
0
, t
1
], then admissble control variation u exist but
its negatives ( u) are not dmissble. If only these variations are considered, a necessary
condition for u

to minimize J is that J(u

, u) 0. On the other had, for variations u),


which are nonzero only for t not in the interval [t

1
, t

2
], it is necessary that J(u

, u) = 0;
the reasoning used in proving the fundamental theorem applies. Considering all admissible
variations with u small enough so that the sign of J is determined by J, we see that
a necessary condition for u

to minimize J is
J(u

, u) 0.
It seems reasonable to ask if this result has an analog in calculus. Recall that the
dierential df is the linear part of the increment f. Consider the end point t
0
, and t
1
of the interval, and admissble values of the time increment t, which are small enough so
that the sign of f is determined by the sign of df. If t
0
is a point where f has a relative
36
minimum, then df(t
0
, t) must be greater than or equal to zero. The same requirement
applies for f(t
1
) to be a relative minimum. Thus necessary conditions for the function f
to have relative minimum at the end points of the interval are
df(t
0
, t) 0, admissble t 0
df(t
1
, t) 0, admissble t 0
and a necessary condition for f to have a relative minimum at an interior point t, t
0
<
t < t
1
is
df(t, t) = .
For the control problem the analogous necessary conditions are
J(u

, u) 0
if u

lies on the boundary during any potion of the time interval [t


0
, t
1
], and
J(u

, u) = 0
if u

lies within the boundary during the entire time interval [t


0
, t
1
].
The development then proceeds as earlier: Lagrange multipliers are introduced to
dene J
a
in (4.2) and are chosen so as to satisfy (4.6) a(4.7). The only dierence is that
the expression for J
a
in (4.9) is replaced by
J
a
(u, u) =
_
t
1
t
0
[H(x, u +u, t, t) H(x, u, p, t)]dt. (5.2)
It therefore follows by (5.1) that necessary condition for u = u

to be a minimizing control
is that J
a
(u

, u) in (5.2) be nonnegative for all admissible u. This in turn implies that


H(x

, u

+u, p

, t) H(x

, u

, p

, t), (5.3)
for all admissible u and all t [t
0
, t
1
]; for if (5.3) did not hold in some interval t

1
t t

2
,
say, with t

2
t

1
arbitrarily small, then by choosing u = 0 for t outside this interval
J
a
(u

, u) would be made negative. Eqn (5.3) states that u

minimizes H, so we have
established:
Theorem 5.1 (Pontryagins minimum principle). Necessary conditions for u

to minimize
(1.13) are (4.6), (4.7), and (5.3)
With a slightly dierent denition of H the principle becomes one of maximizing H,
and is then referred to in the literature as the maximum principle. Note that u

is now
allowed to be piecewise continuous. We omit the rigorous proof here.
Our derivation assumed that t
1
was xed and x(t
1
) free; the boundary conditions for
other situations are precisely the same as those given in the preceding section. It can also
be shown that when H does not explicitly depend on t then H is constant and (4.15) still
hold for the respective cases when the nal time t
1
is xed or free.
37
Example 17 (Minimum revolved area). Consider the problem of nding the (lipschitz
continuous) y : [0, 1] R
+
that joins the points (0, 1) and (1, y
1
) and has the property
that, when its graph is revolved around the x-axis, there results a surface of mimum area.
The surface in question has area
A =
_
1
0
2y
_
1 + (dy/dx)
2
dx.
Thus, the minimization problem is the endpoint-constrained problem for which x
0
=
x(0) = 1,x
f
= x(1) = y
1
. It can be formulated as an optimal control problem where we
drop the factor 2 since it is irrelevent in minimization:
min J(u) =
_
1
0
x
_
1 +u
2
dt
subject to
x = u, x(0) = 1, x(1) = y
1
.
Now dene the Hamiltonian function be
H(p, x, u) = x
_
1 +u
2
+pu
Applying PMP, we have
p = H
x
=
_
1 +u
2
0 = H
u
=
xu

1 +u
2
+p = 0 u
2
=
p
2
x
2
p
2
which gives us the optimal control. in order to solve the equations for p and x we rewrite
its equivalent form
p =
xu

1 +u
2
.
substitute p in the equation, together with x = u
x
_
1 +u
2
+pu = c x
_
1 + x
2

x x

1 + x
2
= c (c is a constant) (5.4)
(since the problem is autonomous the Hamiltonian is a constant on the optimal trajectory)
we obtain
x = c
_
1 + x
2
for some constant c which, because x(0) = 1, must satisfy 0 < c 1. From
x
2
=
_
x
c
_
2
1 (5.5)
it follows that
x x =
1
c
2
x x.
38
We assume that x = 0 (which can only happen if x c and thus y
1
= 1. So there is some
nonempty open interval where
x =
1
c
2
and hence there are constants
1
,
1
so that
x(t) =
1
cosh(t/c) +
1
sinh(t/c)
on that interval. (here we need the property that q is real-analytic so that x takes this form
on the entire interval by the principle of analytic continuation). The condition x(0) = 1
implies that
1
= 1 and the equation (5.5) evaluated at t = 0 gives
_

1
c
_
2
=
_
1
c
_
2
1
and hence

1
=
_
1 c
2
] 1, 1[.
Pick d R so that
tanh d =
1
which implies that coshd = 1/c. Then (using cosh(z +y) = cosh z cosh y + sinh z sinh y)
x(t) =
cosh(t cosh d +d)
cosh d
Every minimizing curve must be of this form. We must however meet the second bundary
condition: x(1) = y
1
. This can be done if and only if one can solve
y
1
=
cosh(cosh d +d)
cosh d
for d which requires y
1
to be suciently large (approximately > 0.587); in general there
may be none, one or two solutions. For example if y
1
= cosh 1 the solutions are d = 0 and
d 2.3, respectively. The integrals
_
1
0
x

1 + x
2
dt are approximately 1.407 and 1.764.
So if a minimum exists, it must be x(t) = cosh t. We can show that this function is the
unique global minimizer. (In this case, p(t) = x(t) x(t)/
_
1 + x(t)
2
= cosh t tanh t.
Remark: If we use the Euler-Lagrange equations to solve this problem the Euler-Lagrange
equation is
q(x(t), x(t)) x(t)q
u
(x(t), x(t)) c.
Compare this equation with equation (5.4) we can easily see that p = q
u
. This is the
best I can explain the Lagrange multiplier or the costae p in relation to the Euler-Lagrange
equations.
In fact we can prove that the extremal Lagrange multipliers, or costates, are the
sensitivity of the minimum value of the performance measure to changes in the state
value.
39
Example 18. Consider again the soft landing problem described in Example 2 where
(1.12) is to be minimized subject to the system equations (1.9). The Hamiltonian (4.4) is
H = |u| +k +p
1
x
2
+p
2
u. (5.6)
Since the admissible range of control is 1 u(t) 1, it follows that H will be minimized
by the following:
u

(t) =
_

_
1 if p

2
(t) > 1
0 if 1 > p

2
(t) > 1
1 if p

2
(t) < 1.
(5.7)
Such a control is referred to in the literature by the graphic term bang-zero-bang, since
only maximum thrust is applied in a forward or reverse direction; no intermediate nonzero
values are used. If there is no period in which u

is zero the control is called bang-bang.


For example, a racing-car driver approximates to bang-bang operation, since he tends to
use either full throttle or maximum braking when attempting to circuit a track as quickly
as possible.
In (5.7) u

(t) switches in values according to the value of p

2
(t), which is therefore
termed (in this example) the switching function. The adjoint equations (4.6) are
p

1
(t) = 0, p

2
= p

1
and integrating these gives
p

1
(t) = c
1
, p

2
(t) = c
1
tc
2
(5.8)
where c
1
and c
2
are constants. Since p

2
is linear in t, it follows that it can take each of
the values +1 and 1 at most once in 0 t t
f
, so u

(t) can switch at most twice. We


must however use physical considerations to determine an actual optimal control. Since
the landing vehicle begins with a downwards veloscily at an altitude h, logical sequences
of control would seem to either
u

= 0, followed by u

= +1
(upwards is regarded as positive), or
u

= 1, then u

= 0, then u

= +1. (5.9)
Consider the rst possibility and suppose that u

switches from zero to one at time t


1
. By
virtue of (5.7) this sequence of control is possible if p

2
decreases with time. It is easy to
verify that the solution of (1.9) subject to the initial conditions (1.10) is
x

1
= h t, x

2
= , 0 t t
1
x

1
= h t +
1
2
(t t
1
)
2
, x

2
= + (t t
1
), t
1
t t
f
(5.10)
40
Substituting the soft landing requirements (1.11) into (5.10) gives
t
f
= h/ +
1
2
, t
1
= h/
1
2
. (5.11)
Because the nal time is not specied and because of the form H in (5.6) eqn (4.15) holds,
so in particular (H)
u=u
= 0 at t = 0, i.e. with t = 0 in (5.6)
k +p

1
(0)x

2
(0) = 0
or p

1
(0) = k/. Hence from (5.8) we have
p

1
(t) = k/, t 0
and
p

2
(t) = kt/ 1 +kt
1
/ (5.12)
using the assumption that p

2
(t
1
) = 1. Thus the assumed optimal control will be valid
if t
1
> 0 and p

2
(0) < 1 (the latter conditions being necessary since u

(0) = 0), and using


(5.11) and (5.12) these conditions imply
h >
1
2

2
, k < 2
2
/(h
1
2

2
). (5.13)
If these inequalities do not hold then some dierent control strategy, such as (5.9), becomes
optimal. For example, if k is increased so that the second inequality in (5.13) is violated
then this means that more emphasis is placed on the time to landing in the performance
index (1.12). It is therefore reasonable to expect this time would be reduced by rst
accelerating downwards with u

= 1 before coasting with u

= 0, as in (5.9). It is
interesting to note that provided (5.13) holds then the total time t
f
to landing in (5.11)
is independent of k.
Example 19. Suppose that in the preceding example it is now required to determine a
control which achieves a soft landing in the least possible time, starting with an arbitrary
given initial state x(0) = x
0
. The performance index is just (1.3) with t
0
= 0, t
1
= t
f
. The
Hamitonian (4.4) is now
H = 1 +p
1
x
2
+p
2
xu
and Theorem 5.1 gives the optimal control:
u

= 1, p
2
< 0; u

= 1, p

2
> 0,
or more succinctly,
u

(y) = sgn(p

2
) (5.14)
where sgn stands for the sign function. The optimal control thus has bang-bang form, and
we must determine the switching function p

2
(t). Using (4.6) we again obtain
p

1
= 0, p

2
= p

1
41
so
p

1
= c
1
, p

2
= c
1
t +c
2
where c
1
and c
2
are constants. Since p

2
is a linear function of t it can change sign at most
once in [0, t
f
], so the optimal control (5.14) must take one of the following forms:
u

(t) =
_

_
+1 0 t t
f
1, 0 t t
f
+1, 0 t < t
1
; 1, t
1
t t
f
1, 0 t < t
2
; +1, t
2
t t
f
.
(5.15)
Integrating the state equation (1.9) with u = 1 gives
Figure 4:
x
1
=
1
2
t
2
+c
3
t +c
4
, x
2
= t +c
3
. (5.16)
Eliminating t in (5.16) yields
x
1
(t) =
1
2
x
2
2
(t) +c
5
, when u

= +1, (5.17)
x
1
(t) =
1
2
x
2
2
(t) +c
6
, when u

= 1. (5.18)
The trajectories (5.17) and (5.18) represent two families of parabolas, shown in Figure 4.
The direction of the arrows represents t increasing.
We can now investigate the various cases in (5.15)
(i) u

= +1, 0 t t
f
. The initial state x
0
must lie on the lower part of the curve PO
corresponding to c
5
= 0 in Figure 4(a).
42
(ii) u

= 1, 0 t t
f
. The initial state x
0
must lie on the upper part of the curve QO
corresponding to c
6
= 0 in Figure 4(b).
(iii) With the third case in (5.15), since u

= 1 for t
1
t t
f
it follows that x

(t
1
)
must lie on the curve QO. The point x

1
(t
1
) is reached using u

= +1 for 0 t < t
1
,
so the initial part of the optimal trajectory must belong to the curves in Figure 4(a).
The optimal trajectory will thus be as shown in Figure 5. The point R corresponds
to t = t
1
, and is where u

switches from +1 to 1; QO is therefore termed the


switching curve. By considering Figure 4 it is clear that the situation just described
holds for any initial state lying to the left of both PO and QO.
Figure 5:
(iv) A similar argument shows that the last case in (5.15) applies for any initial state lying
to the right of PO and QO, a typical optimal trajectory being shown in Figure 6.
The switching now takes place on PO, so the complete switching curve is QOP,
shown in Figure 7.
To summarize, if x
0
lies on the switching curve then u

= 1 according as x
1
(0) is positive
or negative. If x
0
does not lie on the switching curve then u

must initially be chosen so


as to move x

(t) towards the switching curve.


Exercises
1. A system is described by
d
3
z
dt
3
= u(t)
where z(t) denotes displacement. Starting from some given initial position with
given velocity and acceleration it is required to choose u(t), which is constrained by
|u(t)| k, so as to make displacement, velocity, and acceleration equal to zero in
43
Figure 6:
Figure 7:
the least possible time. Show using Theorem 5.1 that the optimal control consists
of u

= k with zero, one, or two switchings.


2. A linear system is described by
z(t) +a z(t) +bz(t) = u(t)
where a > 0 and a
2
< 4b. The control variable is subject to |u(t)| k and is to be
chosen so that the system reaches the state z(T) = 0, z(Y ) = 0 in minimum possible
time. Show that the optimal control is u

= ksgnp(t), where p(t) is a periodic


function of t.
44
3. A rocket is ascending vertically above the earth, which is assumed at. It is also
assumed that aerodynamic forces can be neglected, that the gravitational attraction
is constant, and that the thrust from the motor acts vertically downwards. The
equations of motion are
d
2
h
dt
2
= g +
cu
m
,
dm
dt
= u(t)
where h(t) is the vertical height, m(t) is the rocket mass, and c and g are positive
constants. The propellant mass ow can be controlled subject to 0 u(t) k. The
mass, height, and velocity at t = 0 are known and it is required to maximize the
height subsequently reached. Show that the optimal control has the form
u

(t) = k, s > 0; u

(t) = 0, s < 0
where the switching function s(t) satises the equation ds/dt = c/m.
If switching occurs at time t
1
show that
s(t) =
_
_
_
c
k
ln
m(t)
m(t
1
)
, 0 t t
1
c(t
1
t)/m(t
1
), t
1
t T.
4. A reservoir is assumed to have a constant cross-section, and the depth of the water
at time t is x
1
(t). The net inow rate of water u(t) can be controlled subject to
0 u(t) k, but the reservoir leaks, the dierential equation describing the system
being
x
1
= 0.1x
1
+u.
Find the control policy which maximizes the total amount of water in the reservoir
over 100 units if time, i.e.
_
100
0
x
1
(t)dt.
If during the same period the total inow of water is limited by
_
100
0
u(t)dt = 60k
determine the optimal control in this case.
5. In Example 19 let x
1
(0) = , x
2
(0) = be arbitrary initial point to the right of the
switching curve in Figure 6, with 0. Show that the minimum time to the origin
is T

= + (4 + 2
2
)
1/2
.
6. Consider the system described by the equations (1.9) subject to |u(t)| 1. It is
required to transfer the system to some point lying on the perimeter of the square
in the (x
1
, x
2
) plane having vertices (1, 1) in minimum time, starting from an
arbitrary point outside the square. Determine the switching curves.
45
6 Kalman ltering and certainty equivalence

We present the important concepts of the Kalman lter, certainty-equivalence and the
separation principle, as we stated in the theory of output feedback.
6.1 Imperfect state observation with noise
The elements needed to dene a control optimization are specication of (i) the dynamics
of the process, (ii) which quantities are observable at a given time, and (iii) an optimization
criterion.
In the LQG model the system equation and observation relations are linear, the cost
is quadratic, and the noise is Gaussian (jointly normal). The LQG model is important
because it has a complete theory and introduces some key concepts, such as controllability,
observability and the certainty equivalence principle. Imperfect observation is the most
important point. The model is
x
t
= Ax
t1
+Bu
t1
+
t
y
t
= Cx
t1
+
t
(6.1)
where
t
is process noise, y
t
is the observation at time t and
t
is the observation noise.
The state observations are degraded in that we observe only Cx
t1
. Assume
cov
_

_
= E
_

__

_
=
_
Q L
L

R
_
and that x
0
N( x
0
, V
0
). Let W
t
= (Y
t
, U
t1
) = (y
1
, ..., y
t
; u
0
, ..., u
t1
) denote the ob-
served history up to time t. Of course we assume that t, A, B, C, N, L, M, x
0
and V
0
are
also known; W
t
denotes what might be dierent if the process were rerun. In the next
subsection we turn to the question of estimating x from y. We consider the issue of state
estimation and optimal control and shall show the following things.
(i) x
t
can be calculated recursively from the Kalman lter (a linear operator):
x
t
= A x
t1
+Bu
t1
+H
t
(y
t
C x
t1
),
which is like the system equation except that the noise is given by an innovation
process, y
t
= y
t
C x
t1
, rather than the white noise.
Compare this with observer!
(ii) If there is full information (i.e., y
t
= x
t
) and the optimal control is u
t
= K
t
x
t
, then
without full information the optimal control is K
t
x
t
, where x
t
is the best linear least
squares estimate of x
t
based on the information (Y
t
, U
t1
) at time t.
Many of the ideas we encounter in this analysis are not related to the special state
structure and are therefore worth noting as general observations about control with im-
perfect information.
46
6.2 The Kalman lter
Consider the state system (6.1). Note that both x
t
and y
t
can be written as a linear
functions of the unknown noise and the known values of u
0
, ..., u
t1
. Thus the disturbance
of x
t
conditional on W
t
= (Y
t
, U
t1
) must be normal, with some mean x
t
and covariance
matrix V
t
. The following theorem describes recursive updating relations for these two
quantities.
A preliminary result is needed to make the proof simpler.
Lemma 6.1. Suppose that x and y are jointly normal with zero means and covariance
matrix
cov
_
x
y
_
=
_
V
xx
V
xy
V
yx
V
yy
_
.
Then the distribution of x conditional on y is Gaussian, with
E(x|y) = V
xy
V
1
yy
y,
and
cov(x|y) = V
xx
V
xy
V
1
yy
V
yx
. (6.2)
Proof. Both y and x V
xy
V
1
yy
y are linear functions of x and y and therefore they are
Gaussian. From E[(x V
xy
V
1
yy
y)y

] = 0 it follows that they are uncorrelated and this


implies they are independent. Hence the distribution of x V
xy
V
1
yy
y conditional on y is
identical with its unconditional distribution, and this is Gaussian with zero mean and the
covariance matrix given by (6.2).
The estimate of x in terms of y dened as x = Ky = V
xy
V
1
yy
y is known as the linear
least squares estimate of x in terms of y. Even without the assumption that x and y are
jointly normal, this linear function of y has a smaller covariance matrix than any other
unbiased estimate for x that is a linear function of y. In the Gaussian case, it is also the
maximum likelihood estimator.
Theorem 6.2 (The Kalman lter). Suppose that conditional on W
0
, the initial state x
0
is distributed N( x
0
, V
0
) and the state and observations satisfy the recursions of the LQG
model (6.1). Then conditional on W
t
, the current state is distributed N( x
t
, V
t
). The
conditional mean and variance satisfy the updating recursions
x
t
= A x
t1
+Bu
t1
+H
t
(y
t
C x
t1
), (6.3)
V
t
= Q+AV
t1
A

(L +AV
t1
C

)(R +CV
t1
C

)
1
(L

+CV
t1
A

), (6.4)
where
H
t
= (L +AV
t1
C

)(R +CV
t1
C

)
1
. (6.5)
47
Proof. We do induction on t. Consider the moment when u
t1
has been determined but
y
t
has not yet observed. The distribution of (x
t
, y
t
) conditional on W
t1
, u
t1
) is jointly
normal with means
E(x
t
|W
t1
, u
t1
) = A x
t1
+Bu
t1
,
E(y
t
|W
t1
, u
t1
) = C x
t1
.
Let e
t1
:= x
t1
x
t1
, which by an inductive assumption is N(0, V
t1
). Consider the
innovations

t
= x
t
E(x
t
|W
t1
, u
t1
) = x
t
(A x
t1
+Bu
t1
) =
t
Ae
t1
,

t
= y
t
E(y
t
|W
t1
, u
t1
) = y
t
C x
t1
=
t
Ce
t1
.
Conditional on (W
t1
, u
t1
), these quantities are normally distributed with zero means
and covariance matrix
cov
_

t
Ae
t1

t
Ce
t1
_
=
_
Q+AV
t1
A

L +AV
t1
C

+CV
t1
A

R +CV
t1
C

_
=
_
V

_
Thus it follows from Lemma 6.1 that the distribution of
t
conditional on knowing (W
t1
, u
t1
,
t
),
(which is equivalent to knowing W
t
), is normal with mean V

V
1


t
and covariance matrix
V

V
1

. These prove the theorem.


6.3 Certainty equivalence
We say that a quantity a is policy-independent if E

(a|W
0
) is independent of .
Theorem 6.3. Suppose LQG model assumptions hold. Then
(i)
V (W
t
) = x
t
P
t
x
t
+ (6.6)
where x
t
is the linear least squares estimate of x
t
whose evolution is determined
by the Kalman lter in Theorem 6.2 and + indicates terms that are policy
independent;
(ii) the optimal control is given by
u
t
= K
t
x
t
,
where P
t
and K
t
are the same matrices as in the full information case of Theo-
rem 2.4.
It is important to grasp the remarkable fact that (ii) states: the optimal control u
t
is
exactly the same as it would be if all unknowns were known and took values equal to their
48
linear least square estimates (equivalently, their conditional means) based on observations
up to time t. This is the idea known as certainty equivalence. As we have seen in the
previous subsection, the distribution of the estimation error x
t
x
t
does not depend
on U
t1
. The fact that the problems of optimal estimation and optimal control can be
decoupled in this way is known as the separation principle, as we have seen in the theory
of observer and output feedback.
Proof. The proof is by backward induction. Suppose equation (6.6) holds at t. Recall that
x
t
= A x
t1
+Bu
t1
+H
t

t
, e
t1
= x
t1
x
t1
.
Then with a quadratic cost of the form J(x, u) = x

Qx + 2u

Sx +u

Ru, we have
V (W
t1
) = min
u
t1
[J(x
t1
, u
t1
) + x
t
P
t
x
t
+ |W
t1
u
t1
]
= min
u
t1
E[J( x
t1
e
t1
, u
t1
)
(A x
t1
+Bu
t1
+H
t

t
)

P
t
(A x
t1
+Bu
t1
+H
t

t
) |W
t1
u
t1
]
= min
u
t1
[J( x
t1
, u
t1
) + (A x
t1
+Bu
t1
)

P
t
(A x
t1
+Bu
t1
)] + ,
where we used the fact that conditional on W
t1
, u
t1
both e
t1
and
t
have zero means
and are policy-independent. This ensures that when we expand the quadratics in powers
of e
t1
and H
t

t
the expected value of the linear terms in these quantities are zero and the
expected value of the quadratic terms (represented by + ) are policy independent.
7 Optimal control problem with Controlled diusion processes

We give a brief introduction to controlled continuous time stochastic models with a con-
tinuous state space, i.e., controlled diusion processes.
7.1 Diusion processes and controlled diusion processes
TheWiener precess {W(t)}, is a scalar process for which W(0) = 0, the increments in W
over disjoint time intervals are statistically independent and W(t) is normally distributed
with zero mean and variance t. This is also called Brownian motion. This specication is
internally consistent because, for example,
W(t) = W(t
1
) + [W(t) W(t
1
)]
and for 0 t
1
t the two terms on the right-hand side are independent normal variables
of zero mean and with variance t
1
and t t
1
, respectively.
If W is the increment of W in a time interval of length t then
E[W] = 0, E[(W)
2
] = t, E[(W)
j
] = o(t), for j > 2,
49
where the expectation is one conditional on the past of the process. Note that since
E[(
W
t
)
2
] = O(t)
1
) ,
the formal derivative =
DW
dt
(contiguous-time white noise) does not exist in a mean
square sense, but expectations such as
E
_
__
(t)(t)dt
_
2
_
= E
_
__
(t)dW(t)
_
2
_
=
_
(t)
2
dt
make sense if the integral is convergent.
Now consider a stochastic dierential equation
x = a(x, u)t +q(x, u)W,
which we shall write formally as
x = a(x, u) +q(x, u).
This, as a Markov process, has an innitesimal generator with action
p(u)(x) = lim
t0
E
_
(x(t +t)) (x)
t

x(t) = x, u(t) = u
_
=
x
a +
1
2

xx
g
2
=
x
a +
1
2
N
xx
where N(x, u) = q(x, u)
2
. The DP equation is thus
inf
u
[J +V
t
+V
x
a +
1
2
NV
xx
] = 0.
In the vector case this becomes the controlled diusion process
inf
u
[J +V
t
+V
x
a +
1
2
tr(NV
xx
)] = 0.
Example 20. (LQG in continuous time). The DP equation is
inf
u
[x

Qx +u

Ru +V
t
+V

x
(Ax +Bu) +
1
2
tr(NV
xx
)] = 0.
In analogy with the discrete and deterministic continuous cases where we have considered
previously, we try a solution of the form,
V (x, t) = x

P(t)x +(t).
This leads to the same Riccati equation
0 = x

[Q+PA+A

P PBR
1
B

P +

P]x.
50
and also
+ tr(NP(t)) = 0,
giving
(t) =
_
T
t
tr(NP())d.
Most parts in Section 1 and treatment of variational calculus are taken from Introduction
to mathematical control theory by Barnett & Cameron. Please send comments, report
of errors and other suggestions to yishao@math.su.se.
Yishao Zhou September 30, 2013
51

You might also like