Lec ICFDC 5

Outline Preliminary (Optimal Control) Example [1]
Intelligent Control and Fault Diagnosis

Lecture 5: Intelligent Control (RL)
Farzaneh Abdollahi
Department of Electrical Engineering
Amirkabir University of Technology
Winter 2024
Farzaneh Abdollahi Intelligent Control Lecture 5 1/16

Preliminary (Optimal Control)
Example [1]

Prelim (Dynamic Programming in Continues Time)

▶ Objective:
Z tf
J(x, u, t0 , tf ) = h(xf , tf ) + g (x(t), u(t), t)dt
t0
subject to
ẋ = f (x(t), u(t), t)
▶ Value fcn (cost-to-go): V (x(t0 ), t0 , tf ) = minu J(x, u, t0 , tf )
▶ Solution:
▶ Bellman generalized Hamiton Jaccobi Equ by considering u(t)
▶ Hamilton Jaccobi Bellman (HJB) equ
∂V ∂V T
− = min{g (x(t), u(t), t) + f (x(t), u(t), t)}
∂t u
| ∂x
{z }
H(x(t),u(t), ∂V
∂x ,t)=Hamiltonian

▶ It is expensive to solve n dimensional HJB PDE

▶ For more details ref to Steve Brunton crash courses
▶ RL can be considered as a model-free framework for solving optimal
control problems stated as Markov decision processes (MDPs)
▶ RL rests on two pillars of equal importance:
1. Optimal control: the two most famous RL algorithms, TD- and
Q-learning, are all about approximating the value function that is at
the heart of optimal control. Similarly, actor-critic methods are based
on state-feedback, which is motivated by optimal control theory.
2. Statistics and information theory, especially the topic of exploration,
▶ In Control problems RL can provide optimal solution when there ie
no a priori info on system dynamics

Example [1]
▶ Consider a nonlinear system
ẋ = f (x) + g (x)u (1)
where f : R n → R n :the drift unknown dynamics; g : R n → R n×m :

the control effectiveness matrix with n ≥ m and the pseudoinverse of
g (x) exists;
▶ Let xd ∈ R n : a time-varying continuously differentiable desired state
trajectory.
▶ Objective: Designing an optimal tracking controller when f (x) is not
known a priori

Design a DNN Identifier
▶ To improve performance and using indirect approach the unknown

dynamics is identified by a DNN.
▶ A multitimescale DNN-based indetifier is applied
▶ The output layer weights of the DNN are adjusted in real time using
adaptive update laws motivated by a Lyapunov-based stability analysis.
▶ Concurrent to real-time execution, data are collected and DNN
training algorithms (e.g., stochastic gradient descent, iteratively
update the inner layer DNN features.
▶ i.e. the inner layer weights are not updated in real time; the weights
are discretely updated intermittently during task-execution once
training is complete
DNN Identifier
▶ Using Universal approximation theorem, f can be represented as
f (x) = θT ϕ(Φ∗ (x)) + ϵ∗θ (x)
▶ θ ∈ R n×n :ideal output layer weight matrix
▶ ϕ : R p → R h : vector of activation fcn,
▶ Φ∗ = Vk ϱk (Vk−1 , ϱk−1 (Vk−2 , ϱk−2 (...x)))R n ⇒ R p : unknown inner
layer features of the DNN; k number of inner layer; Vk , ϱk : the inner
layer weights and activation fcns
▶ ϵ∗θ (x):bounded function approximation error
▶ The ideal weights are unknown and should be approximated by
learning.
▶ The identifer dynamics:
x̂˙ = θ̂T ϕ(Φ̂i (x)) + g (x)u + k0 x̃
where x̃ = x − x̂, k0 > 0: estimator gain

DNN Identifier
▶ The proposed update law for the output layer weights:
M
θ̂˙ = Γθ ϕ(Φ̂i (x))x̃ T + kθ Γθ
X
ϕ(Φ̂i (xj ))
j=1
. (x̄˙ j − gj (xj )uj − θ̂T ϕ(Φ̂i (xj )))T (2)
where Γθ ∈ R h×h and kθ > 0 are constant adaptation gains.

▶ Assumtion: A history stack of input–output data pairs {xj , uj }Mj=1
and history stack of numerically computed state derivatives {x̄˙ j }M
j=1
are available a priori for each index j

▶ The dynamics are unknown ⇝ud (xd ) is not known.

▶ An approximation of the trajectory tracking component of the
controller: ûd (xd , θ̂) = g + (xd )(hd (xd ) − fˆi (x, θ̂)).
where
▶ hd (xd ) = ẋd is a locally Lipshitz fcn
▶ g + (x) = (g T (x)g (x))−1 g T (x)

▶ For tracking, let us define the error dynamics:
ξ˙ = F (ξ) + G (ξ)µ
▶ ξ = [e, xd ]
▶ mu = u − ud (xd )

▶ F = f (e + x d ) − hd (x d ) + g (e + x d )ud (x d )
hd (xd )
▶ G = [g (e + xd )T 0m×n ]T
▶ The control objective is to find a control policy u that minimizes the
cost function: Z ∞
J(ξ, µ) = Q̄(ξ) + µT Rµ dτ
0 | {z }
r (ξ,µ)

Topics
▶ The value fcn for the optimal solution is
Z ∞
V ∗ (ξ) = min r (ξ, µ) dτ
µ∈U 0 | {z }
Q(ξ)+µ∗ (ξ)T Rµ∗ (ξ)
▶ The optimal control policy V ∗ is a solution to the corresponding

HJB equation
0 = ∇ξ V ∗ (ξ)(F (ξ) + G (ξ)µ∗ (ξ)) + Q(ξ) + µ∗ (ξ)T Rµ∗ (ξ) (3)
▶ The boundary condition V ∗ (0) = 0,

▶ The optimal policy
1
µ∗ (ξ) = R −1 G (ξ)T (∇ξ V ∗ (ξ))T
2
▶ To solve the HJB equ, the optimal value fcn can be found.
Considering the universal approximation property of MLP, NN is
applied to approximate V ∗ (critic)
V ∗ = W T σ(ξ) + ϵ(ξ) ≃ ŴcT σ(ξ)
▶ and also the optimal control policy µ∗ (actor)
1 1
µ∗ = R −1 G (ξ)T (∇ξ σ(ξ)T W +∇ξ ϵ(ξ)T ) ≃ R −1 G (ξ)T (∇ξ σ(ξ)T Ŵa )
2 2
▶ The control signal applied to (1) is
u = µ̂(ξ, Ŵa ) + ûd (xd , θ̂)

Update Laws
▶ The HJB equation in (3) is equal to zero under optimal conditions;
▶ Applying aprroximated fcns V ∗ (ξ) and µ∗ (ξ), and fˆ results in a
residual δ̂
▶ δ̂ is considered as cost fcn to min to define the continuous-time
least-squares-based update laws which are designed based on the
subsequent stability analysis
˙ ω
Ŵc = −ηc1 Γ δ̂ − ηc2 ΓΣc
ρ
Γωω T Γ
Γ̇ = (λΓ − ηc1 − ηc2 ΓΣΓ Γ)1
ρ2
˙ G T Ŵa ω T
Ŵ a = −ηa1 (Ŵa − Ŵc ) − ηa2 Ŵa + ηc1 Ŵc + ηc2 Σa Ŵc
4ρ

Stability Analysis
▶ UUB of e, W̃c , W̃a , x̃, θ̃ is show by the following Lyapunov candidate
1 1 1 1
VL = V ∗ (ξ) + W̃cT Γ(−1) W̃c + W̃aT W̃a + x̃ T x̃ + tr (θ̃T Γ−1
θ θ̃)
2 2 2 2

References I
S. N. M. L. Greene, Z.I. Bell and W. Dixon, “Deep neural network-based

approximate optimal tracking for unknown nonlinear systems,” IEEE
Transactions on Automatic Control, vol. 68, no. 5, pp. 3171–3177, 2023.
D. Silver, Reinforcement Learning Lecture.

https://www.davidsilver.uk/teaching/ (available date:Jan. 2023).
Janis Klaise, Reinforcement learning with MATLAB.

http://web.khu.ac.kr/ tskim/NE%2010-3%20Reinforcement-learning-
ebook.pdf (available date:Jan. 2023).
S. Meyn, Control Systems and Reinforcement Learning.

Campbridge University Press, 2022.
K. G. Vamvoudakis, Y. Wan, F. L. Lewis, and D. Cansever, Handbook of

Reinforcement learning and Control.
Springer, 2021.

References II
R.S. Sutton and A. G. Barto, Reinforcement learning: an introduction.

MIT Press, 2nd ed., 2018.
Z. I. B. S. M. Nahid Mahmud, S. A. Nivison and R. Kamalapurkar, “Safe

model-based reinforcement learning for systems with parametric
uncertainties,” Frontiers in Robotics and AI, vol. 8, no. Article 733104,
2021.

Lec ICFDC 5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec ICFDC 5

Uploaded by

Copyright:

Available Formats

Outline Preliminary (Optimal Control) Example [1]

Intelligent Control and Fault Diagnosis

Department of Electrical Engineering

Amirkabir University of Technology

Farzaneh Abdollahi Intelligent Control Lecture 5 1/16

Preliminary (Optimal Control)

Farzaneh Abdollahi Intelligent Control Lecture 5 2/16

Prelim (Dynamic Programming in Continues Time)

Farzaneh Abdollahi Intelligent Control Lecture 5 3/16

▶ It is expensive to solve n dimensional HJB PDE

Farzaneh Abdollahi Intelligent Control Lecture 5 4/16

▶ Consider a nonlinear system

ẋ = f (x) + g (x)u (1)

where f : R n → R n :the drift unknown dynamics; g : R n → R n×m :

Farzaneh Abdollahi Intelligent Control Lecture 5 5/16

Design a DNN Identifier

▶ To improve performance and using indirect approach the unknown

x̂˙ = θ̂T ϕ(Φ̂i (x)) + g (x)u + k0 x̃

where x̃ = x − x̂, k0 > 0: estimator gain

▶ The proposed update law for the output layer weights:

. (x̄˙ j − gj (xj )uj − θ̂T ϕ(Φ̂i (xj )))T (2)

where Γθ ∈ R h×h and kθ > 0 are constant adaptation gains.

Farzaneh Abdollahi Intelligent Control Lecture 5 8/16

▶ The dynamics are unknown ⇝ud (xd ) is not known.

Farzaneh Abdollahi Intelligent Control Lecture 5 9/16

▶ For tracking, let us define the error dynamics:

Farzaneh Abdollahi Intelligent Control Lecture 5 10/16

▶ The optimal control policy V ∗ is a solution to the corresponding

0 = ∇ξ V ∗ (ξ)(F (ξ) + G (ξ)µ∗ (ξ)) + Q(ξ) + µ∗ (ξ)T Rµ∗ (ξ) (3)

▶ The boundary condition V ∗ (0) = 0,

V ∗ = W T σ(ξ) + ϵ(ξ) ≃ ŴcT σ(ξ)

▶ and also the optimal control policy µ∗ (actor)

u = µ̂(ξ, Ŵa ) + ûd (xd , θ̂)

Farzaneh Abdollahi Intelligent Control Lecture 5 12/16

Farzaneh Abdollahi Intelligent Control Lecture 5 13/16

▶ UUB of e, W̃c , W̃a , x̃, θ̃ is show by the following Lyapunov candidate

Farzaneh Abdollahi Intelligent Control Lecture 5 14/16

S. N. M. L. Greene, Z.I. Bell and W. Dixon, “Deep neural network-based

D. Silver, Reinforcement Learning Lecture.

Janis Klaise, Reinforcement learning with MATLAB.

S. Meyn, Control Systems and Reinforcement Learning.

K. G. Vamvoudakis, Y. Wan, F. L. Lewis, and D. Cansever, Handbook of

Farzaneh Abdollahi Intelligent Control Lecture 5 15/16

R.S. Sutton and A. G. Barto, Reinforcement learning: an introduction.

Z. I. B. S. M. Nahid Mahmud, S. A. Nivison and R. Kamalapurkar, “Safe

Farzaneh Abdollahi Intelligent Control Lecture 5 16/16

You might also like