On The Synthesis of Control Policies From Noisy Example Datasets: A Probabilistic Approach

On the synthesis of control policies from
noisy example datasets: a probabilistic

approach
Davide Gagliardi ∗ Giovanni Russo ∗∗
∗
School of Electrical and Electronic Engineering, University College
Dublin, Ireland (e-mail: davide.gagliardi@ucd.ie).
∗∗
Department of Information and Electronic Engineering and Applied
Mathematics, University of Salerno, Italy (e-mail: giovarusso@ucd.ie).
arXiv:2001.04428v2 [math.OC] 27 Feb 2020
Abstract: In this note we consider the problem of synthesizing optimal control policies for
a system from noisy datasets. We present a novel algorithm that takes as input the available
dataset and, based on these inputs, computes an optimal policy for possibly stochastic and non-
linear systems that also satisfies actuation constraints. The algorithm relies on solid theoretical
foundations, which have their key roots into a probabilistic interpretation of dynamical systems.
The effectiveness of our approach is illustrated by considering an autonomous car use case. For
such use case, we make use of our algorithm to synthesize a control policy from noisy data
allowing the car to merge onto an intersection, while satisfying additional constraints on the
variance of the car speed.
1. INTRODUCTION technical novelty of our results with respect to the classic

works on FPD lies in the fact that we explicitly embed ac-
A framework that is becoming particularly appealing to tuation constraints in our formulation, thus solving an op-
design control algorithms is that of devising the con- timization problem where the Kullback-Leibler Divergence
trol policy from examples (or demonstrations), see e.g. is minimized subject to constraints on the control variable.
Hanawal et al. (2019); Wabersich and Zeilinger (2018) By relying on the FPD, one of the main advantages of our
and references therein. At their roots these control from results over classic IRL/Control approaches is that policies
demonstration techniques, which are gaining considerable can be synthesized from noisy data without requiring any
attention under the label of Inverse Reinforcement Learn- assumption on the linearity of the system. The system
ing (IRL), rely on Inverse Optimal Control and Opti- can in fact be a general stochastic nonlinear dynamical
mization Bryson (1996). Today, IRL/control is recognized system. Moreover, by embedding actuation constraints
as an appealing framework to learn policies from suc- into the problem formulation and by solving the resulting
cess stories Argall et al. (2009) and potential applica- optimization, we can export the policy that has been
tions include planning Englert et al. (2017) and prefer- learned on other systems that have different actuation
ences/prescriptions learning Xu and Paschalidis (2019). capabilities. As an additional contribution, we devise from
our theoretical results an algorithmic procedure. The key
There is then no surprise that, over the years, a number of reference applications over which the algorithm was tested
techniques have been developed to address the problem of involved an autonomous driving use case and full results
devising control policies from demonstrations, mainly in are presented here.
the context of Markov Decision Processes (MDPs) Sutton
and Barto (1998). Results include Ratliff et al. (2009), 2. MATHEMATICAL PRELIMINARIES
which leverages a linear programming approach, Ratliff
et al. (2006) which relies on a maximum margin approach,
2.1 Notation
Ziebart et al. (2008) that makes use of the maximum
entropy principle and Ramachandran and Amir (2007)
that formalizes the problem via Bayesian statistics. Sets, as well as operators, are denoted by calligraphic
characters, while vector quantities are denoted in bold.
In this context, the main contributions of this extended Let nz be a positive integer and consider the measurable
abstract can be summarized as follows. First, we introduce space (Z, Fz ), with Z ⊆ Rnz and with Fz being a σ-
an approach to synthesize control policies from examples algebra on Z. Then, the random vector (i.e. a multidi-
which is based on the Fully Probabilistic Design (FPD) mensional random variable) on (Z, Fz ) is denoted by Z
Kárný (1996); Kárný and Guy (2006); Herzallah (2015); and its realization is denoted by z (in the paper, we use
Pegueroles and Russo (2019); Krn and Kroupa (2012). the convention that these random vectors are row vectors).
This approach formalizes the control problem as an opti- The probability density function (or simply pdf in what fol-
mization problem where the Kullback-Leibler Divergence lows) of a continuous Z is denoted by fZ (z). For notational
(see Section 2.2) between an ideal probability density convenience, whenever it is clear from the context, we omit
function (pdf, obtained from e.g. demonstrations) and the the argument and/or the subscript of the pdf. Hence, the
pdf modeling the system/plant is minimized. The main support of f := fZ (z) is denoted by S (f ) and, analogously,
the expectation of a function h(·) ofR Z is indicated with Remark 1. The cpdf f (xk |uk , xk−1 ) describes the system
Ef [h(Z)] ad defined as Ef [h(Z)] := S(f ) h(z)f (z)dz. We behavior at time tk , given the previous state and the input
also remark here that whenever we apply the averaging at time tk . In turn, the input is also generated from the
operator to a given function, we use an upper-case letter cdpf f (uk |xk−1 ), which is a randomized control policy,
for the function argument as this is a random vector. The returning the input given the previous state. Finally, we
joint pdf of two random vectors, say Z and Y,is denoted by also note that the initial conditions are embedded in the
f[Z,Y] (z, y) and abbreviated with f (z, y). The conditional probabilistic system description through the prior f (x0 ).
probability density function ( or cpdf in what follows) of Z
In the rest of the paper we use the following shorthand
with respect to the random vector Y is denoted by f (z|y)
and, whenever the context is clear, we use the shorthand notations: f˜X
k
:= f (xk |uk , xk−1 ), f˜Uk
:= f (uk |xk−1 ),
n n
f0 := f (x0 ) and f := f (d ). Hence, (3) can be
notation f˜Z . Finally, given Z ⊆ Rnz , its indicator function
compactly written as
is denoted by 1Z (z). That is, 1Z (z) = 1, ∀z ∈ Z and Y Y
0 otherwise. We also make use of the internal product fn = f˜k f˜k f0 = fñ f0 , fñ :=
X U f˜k f˜k . (4)
X U
between tensors, which is denoted by h·, ·i. k∈K k∈K
2.2 The Kullback-Leibler divergence 3.1 The control problem
The control problem considered in this paper will be stated Our goal is to synthesize, from an example dataset, say
(see Section 3.1) in terms of the Kullback-Leibler (KL, dne , the control pdf f (uk |xk−1 ) that allows the closed-
Kullback and Leibler (1951)) divergence, formalized with loop system (4) to achieve the demonstrated behav-
the following: ior, subject to its actuation constraints. As in Kárný
Definition 1. (Kullback-Leibler(KL) divergence). Consider (1996); Quinn et al. (2016); Pegueroles and Russo (2019);
two pdfs, φ := φZ (z) and g := gZ (z), with φ being Kárný and Guy (2006); Herzallah (2015) the behav-
absolutely continuous with respect to g. Then, the KL- ior illustrated in the example dataset can be speci-
divergence of φ with respect to g is fied through the reference pdf g (dne ) extracted from
Z
φ the example dataset (as e.g. its empirical distribution).
DKL (φ||g) := φ ln dz. (1) Following the chain rule for pdfs we have g (dne ) :=
S(φ) g Q
k∈K g (xk |uk , xk−1 ) g (uk |xk−1 ) g (x0 ). Again, by setting
k k
Intuitively, DKL (φ||g) is a measure of how well φ approxi- g̃X := g (xk |uk , xk−1 ), g̃U := g (uk |xk−1 ), g0 := g (x0 ) and
mates g. We now give give a property of the KL-divergence, g n := g (dne ) we get:
the KL-divergence splitting property, which is used in the
Y
gn = k k
g̃X g̃U g0 = g̃ n g0 , (5)
proof of Theorem1. k∈K
Property 1. Let φ and g be two pdfs of the random vector where g̃ n :=
Q k k
[Z, Y], with Z and Y being random vectors of dimensions k∈K g̃X g̃U .
nZ and nY , respectively. Then, the following splitting rule The control problem can then be recast as the problem of
holds: designing f (uk |xk−1 ) so that f n approximates g n . This
DKL (φ(y, z)|| g(y, z)) = leads to the following formalization:
DKL (φ(y)|| g(y)) + Eφ(Y) [DKL (φ(z|Y)|| g(z|Y))] n ∗ o
Problem 1. Determine the sequence of cpdfs, say f˜k U ,
(2) k∈K
Proof:The proof follows from the definition of DKL , the solving the nonlinear program
conditioning and independence rules for pdfs. A self- min DKL (f n ||g n )
contained proof of this technical result is reported in the {fU }k∈K
˜k
appendix. 2 h i (6)
s.t. Ef˜k h̃u,k (U) = H̃u,k , k ∈ K,
U
3. FORMULATION OF THE CONTROL PROBLEM where the constraints are algebraically independent.
Let: (i) K := {k}nk=1 , K0 := K∪{0} and T := {tk : k ∈ K0 } In Problem 1, the constraints are formalized as expec-
be the time horizon over which the system is observed; (ii) tations. We note that these constraints can be equiva-
xk ∈ Rdx and uk ∈ Rdu be, respectively, the system state lently written as S(f˜k ) f˜uk h̃u,k (u) du = H̃u,k . Also,
R
and input at time tk ∈ T ; (ii) dk := (xk , uk ) be the data U
collected from the system at time tk ∈ T and dk the data the constraints of the program are time-varying and the
collected from t0 ∈ T up to time tk ∈ T (tk > t0 ). As number of constraints can change over time (the number
shown in e.g. Peterka (1981), the system behavior can be of constraints at time tk is denoted by cu,k ). Indeed, in
described via the joint pdf of the observed data, say f (dn ). the constraints of (6): (i) H̃u,k is a (column) vector of
h iT
Then, as shown in the same paper, the application of the coefficients, i.e. H̃u,k := Hu,0,k , HTu,k and h̃u,k (z) :=
chain rule for probability density functions leads to the iT
following factorization for f (dn ):
h
Y hu,0,k , hTu,k (z); (ii) Hu,k ∈ Rcu,k and hu,k : S(f˜uk ) 7→
f (dn ) = f (xk |uk , xk−1 ) f (uk |xk−1 ) f (x0 ) . (3) Rcu,k ; (iii) Hu,0,k := 1 and hu,0,k (z) := 1Uk (z) ensure that
k∈K the solution of the program is a cpdf. Finally, in Problem 1
Throughout this work we refer to (3) as the probabilistic we assume that the constraints are algebraically indepen-
description of the closed loop system, or we simply say that dent. The notion of algebraically independent constraints
(3) is our closed loop system. is formalized next.
Definition 2. Let Z be a random vector with underlying the Maximum Entropy principle Guilleminot and Soize
pdf fZ (z) and support Z. A set of functions h : Z 7→ Rcz is (2013)).
said to be algebraically independent if thereR exists a subset,
Lemma 2. Let: (i) Z ⊆ Rnz and Θ̃ ⊆ Rnz ; (ii) fˆ1 :
say S ⊂ Z, with non-zero measure (i.e. S dz > 0) and
such that: Z 7→ fˆ1 (z) be a positive and integrable function on
Z; (iii) fˆ2 : (Z × Θ̃) 7→ fˆ1 (z) e−hθ̃, h̃(z)i , where h̃ =
Z
∃S ⊂ Z : hv, h (z)i2 dz > 0, ∀v ∈ Rcz \0 (7) h iT
S h̃1 (z) , . . . , h̃cz (z) : Z 7→ Rcz are algebraically inde-
In what follows, we simply say that a set of equations (or pendent functions. Consider the constraints defined by the
constraints) of the form of (7) is algebraically independent set of the following equations:
Z
if the above definition is satisfied. As shown in Guilleminot
and Soize (2013), the assumption that the contraints are fˆ2 z, θ̃ h̃i (z) dz = H̃i , i = 1, . . . , cz , (13)
Z
algebraically independent ensures that Problem 1 is well h iT
posed. where H̃ := H̃1 , . . . , H̃cz ∈ Rcz . Then, the unique
∗
solution, say θ̃ , of the minimization problem
4. TECHNICAL RESULTS
min J θ̃ , (14)
We now introduce the main technical results of this paper. θ̃

The key result behind the algorithm of Section 5 is with J θ̃ := hθ̃, H̃i + Z fˆ2 z, θ̃ dz is also a solution
R
Theorem 1. The proof of this result, given in this section,
makes use of three technical lemmas (i.e. Lemma 1, Lemma of (13).
2 and Lemma3).
Proof: See the appendix 2
Lemma 1. Let: (i) Z be a random vector on the measur-
able space (Z, Fz ); (ii) f := fZ (z), g := gZ (z) be two
probability distributions over (Z, Fz ); (iii) α : Z 7→ R+ Finally, we introduce here the following technical lemma
0 be
a nonnegative function of Z, integrable under the measure that is used in the proof of Theorem 1.
given by fZ (z). Assume that fZ (z) satisfies the following Lemma 3. Let f n and g n be the pdfs defined in (3) and
set of algebraically independent equations: (5), respectively. Then:
R h i
fZ (z) h̃ (z) dz = H̃, (8) DKL (f n ||g n ) = DKL f n−1 ||g n−1 +Ef n−1 DKL fñ ||g̃ n

T
where: (i) h̃(z) := h0 , hT (z), with h0 (z) := 1S(Z) (z) (15)
and h : Z 7→ Rcz being a measurable map; (ii) H̃(z) := Proof:The result is obtained from Property 1 (see the
T appendix for a proof of this property) by setting Y :=
H0 , HT with H0 := 1 and H ∈ Rcz being a vector of [X0 , U1 , X1 , . . . , Un−1 , Xn−1 ] and Z := [Un , Xn ] 2
constants. Then:
(1) the solution of the constrained optimization problem The main result behind the algorithm of Section 5, the
proof of which makes use of the above technical results, is
min L(f ) s.t. constraints in (8) (9) presented next.
fZ
∗
with Z Theorem 1. The solution, f˜U k
= f ∗ (uk |xk−1 ), of the
L(f ) := DKL (f ||g) + fZ (z) α (z) dz (10) control Problem 1 is
∗ −{ω̂(uk , xk−1 )+hλ∗
u,k ,hu,k (uk )i}
k e
is the pdf f˜U
k
= g̃U ∗ , (16)
∗ e1+λu,0,k
g (z) e−{α(z)+hλ ,h(z)i}
f ∗ := fZ∗ (z) = ∗ . (11) where:
e1+λ0
In (11) λ∗0 and λ∗ = [λ∗1 , . . . , λ∗cz ]T are the Lagrange (1) ω̂(·, ·) is generated via the backward recursion
multipliers associated to the constraints; ω̂ (uk , xk−1 ) = α̂ (uk , xk−1 ) + β̂ (uk , xk−1 ) , (17)
(2) moreover, the corresponding minimum is: with
L∗ := L (f ∗ ) = − (1 + λ∗0 + hλ∗ , Hi) . (12)
α̂ (uk , xk−1 ) := DKL f˜X k k
||g̃X
Proof: See the appendix. 2 (18)
β̂ (uk , xk−1 ) := −Ef˜k [ln γ̂ (Xk )] ,
X
Note that, in Lemma 1, the optimal solution fZ∗ (z)

de- with terminal conditions
β̂(un , xn−1 ) = 0 and
∗
pends on the Lagrange multipliers (LMs) λ∗0 and λ . The ñ n
α̂ (un , xn−1 ) = DKL f ||g̃ ; X X
first LM, i.e. λ∗0 , can be obtained by integration, i.e. by
∗ (2) γ̂ (·) in (18) is given by
imposing that e1+λ0 normalizes fZ∗ (z) in (11). With the "c #
next result, we propose a strategy for finding the LMs λ∗ . Xu
ln γ̂ (xk−1 ) := ln (γ̂u,i,k (xk−1 )) , (19)

In particular, the idea is to recast the problem of finding
i=0
the solutions of non-linear equations as a minimization
problem. In general, the approach can be also used to with
fit the parameters of a pdf so that it meets a set of pre- γ̂u,0,k (xk−1 ) = exp {λ∗u,0,k + 1}, (20)
specified constrains (for example, to find pdfs that satisfy and
γ̂u,i,k (xk−1 ) := exp {λ∗u,i,k Hu,i,k } i = 1, . . . , cu , Now, note that
h i
(21) Bn∗ : = min Bn = min Epn−1 Â (Xn−1 ) =
with terminal conditions γ̂u,0,n (xn−1 ) = 1, i.e. f˜U
n f˜U
n X
λ∗u,0,n = 0, and λ∗u,i,n = 0, i = 1, . . . , n; s.t.: s.t.:

h i
Eu,n Eu,n
(3) λ∗u,0,k and λ∗u,k = λ∗u,1,k , . . . , λ∗u,cu ,k in (16) are  
the Lagrange multipliers (LMs) associated to the   (28)
constraints at time tk . In particular,  
= Epn−1  min Â (Xn−1 ) = Epn−1 [A∗n ] ,
 
λ∗u,0,k = X  fñ  X
Z  U 
s.t.:
∗

ln k
g̃U e−{ω̂(uk , xk−1 )+hλu,k ,hu,k (uk )i} duk − 1, Eu,n
while all the other LMs can be obtained numerically where the above expression was obtained by using the
(via e.g. Lemma 2). fact that the expectation operator is linear and the fact
that independence of the decision variable (i.e. f˜U n
) is
Moreover, the corresponding minimum at time k is given independent on the pdf over which the expectation is
by: performed (i.e. pn−1
X ). This implies that, once we solve the
Bk∗ := −Epk−1 [ln γ̂ (Xk−1 )] . (22) problem
X
where pkX denotes the pdf of the state at time tk (i.e. A∗n := min Â(xn−1 ) (29)
pkX := f (xk )). f˜un
s.t.:
Proof:For notational convenience, we use the shorthand Eu,n
notation {Eu,k } to denote the set of constraints of Problem for any fixed xn−1 , then Bn∗ can be obtained by averaging
1 at time tk . We also denote by {Eu,k }K the set of A∗n over pn−1
X . We now focus on solving problem (29). In
n−1
constraints over the whole time horizon K and {Eu,k }k=1 doing so, we first note that, following (27), Â(xn−1 ) can
to denote the constraints from t1 up to time tn−1 . be re-written as follows: " ! #
f˜U
n
Z
Note that, following Lemma 3, Problem 1 can be re-written ˜ n
Â (xn−1 ) = fU ln n + α̂ (un , xn−1 ) dun ,
as follows: g̃U
kmin
DKL (f n ||g n ) = (30a)
f˜U
k∈K ˜ n
α̂ (un , xn−1 ) := DKL f ||g̃ . n
(30b)
s.t.:
X X
Eu,k In turn, (30a) can be compactly written as:
k∈K
= min

DKL f n−1 ||g n−1 + Bn∗
(23) Z
k n−1 Â(xn−1 ) = DKL f˜un ||g̃un + f˜U
n
α̂ (un , xn−1 ) dun , (31)

f˜U
k=1
s.t.:
n−1 where we used the definition of KL-divergence. Hence,
Eu,k
k=1 Lemma 1 can be used to solve the optimization problem
where: in.(29). Indeed by applying Lemma 1 with Z = Un ,
f = f˜U
h i
n n
Bn∗ := min Bn , Bn := Ef n−1 DKL fñ ||g̃ n . (24) , g = g̃U , h = hu,n , H = Hu,n we get the following
f˜U
n solution to (29):
s.t.: ∗ −{α̂(un , xn−1 )+hλ∗
u,n ,hu,n (un )i}
Eu,n ñ n e
fU = g̃U ∗ . (32)
That is, Problem 1 can be approached by solving first the e1+λu,0,n
optimization of the last time-instant of the time-horizon K In the above pdf, λ∗u,0,n and λ∗u,n are the LMs at the
(the term Bn in (23)) and then by taking into account the last time instant, tn . The LM λ∗u,0,n can be obtained by
result from this optimization problem in the optimization imposing a normalization condition to (32). That is, λ∗u,0,n
up to the instant tn−1 . Now we focus on the sub-problem: can be found by imposing that
R n −{α̂(u , x )+hλ∗ ,h (u )i}
Bn∗ := min Bn (25) exp{λ∗u,0,n + 1} = g̃U e n n−1 u,n u,n n
dun
f˜U
n
= γ̂u,0,n (xn−1 ) .
s.t.:
(33)
Eu,n
Also, following Lemma 1, the minimum of the problem is
For this problem, we first observe that the following given by:
equality is satisfied for the term Bn :
Â∗n = − 1 + λ∗u,0,n + hλ∗u,n , Hu,n i

h i h i (34)
Bn = Ef n−1 DKL fñ ||g̃ n = Epn−1 DKL fñ ||g̃ n . or equivalently
X
(26)
"c #
X u
Â∗n = −

Such equality was obtained by noting that DKL fñ ||g̃ n ln (γ̂u,i,n (xn−1 )) = − ln γ̂ (xn−1 ) (35)
i=0
is only a function of the previous state (see also Kárný
(1996)) and, for notational convenience, we rename it as where we have used the definitions (20) and (21) for
γ̂u,i,n , i = 0, . . . cu . Therefore, the corresponding minimum
Â (·). Hence, Bn becomes value for Bn is:
h i h i
Bn = E n−1 DKL fñ ||g̃ n = E n−1 Â (Xn−1 ) . (27)
pX pX Bn∗ = −Epn−1 [ln γ̂ (Xn−1 )] . (36)
X
Note now that the solution we found to the problem in (25) where ω̂ (un−1 , xn−2 ) = α̂ (un−1 , xn−2 ) + β̂ (un−1 , xn−2 )
only depends on Xn−1 and therefore the original problem and
(23) can be split as
DKL f n−1 ||g n−1 + Bn∗ = α̂ (un−1 , xn−2 ) := DKL f˜Xn−1 n−1
||g̃X

kmin
n−1 (45)
f˜U β̂ (un−1 , xn−2 ) := −Efñ−1 [ln γ̂ (Xn−1 )]
k=1
s.t.: X
n−1
Eu,k
k=1 ∗
The last expression for Â(xn−2 ) obtained in (44) allows
= min DKL f n−2 ||g n−2 + Bn−1 (37)
k n−2
us to use the Lemma 1 to solve the optimization problem
f˜U
k=1 defined in (40). Indeed by applying Lemma 1 with Z =
s.t.:
n−2 Un−1 , f = f˜U n−1 n−1
, g = g̃U , h = hu,n−1 , H = Hu,n−1 ,
Eu,k
k=1 α̂(·) = ω̂(·), we get the following solution to the problem
in (40):
where:
∗ ∗ −{ω̂(un−1 , xn−2 )+hλ∗u,n−1 ,hu,n−1 (un−1 )i}
Bn−1 := min Bn−1 , (38a) n−1 e

n−1
f˜U f˜U
n−1
= g̃U ∗
1+λu,0,n−1
.
e
s.t.: (46)
Eu,n−1
h i Now, the LM λ∗u,0,n−1 can be obtained by imposing the
∗
:= Ef n−2 DKL fñ−1 ||g̃ n−1 + Bn∗

Bn−1 (38b) normalization condition for fñ−1 . That is,
U
We approach the above problem in the same way we used
to solve the problem in (25). The idea is now to find a exp{λ∗u,0,n−1 + 1} =
Z
function, Â (xn−2 ), such that n−1 −{α̂(un−1 , xn−2 )+hλ∗
u,n−1 ,hu,n−1 (un−1 )i} du
h i g̃U e n−1 =
Bn−1 = Epn−2 Â (Xn−2 ) . (39)
X γ̂u,0,n−1 (xn−2 ) .
Once this is done, we then solve the problem (47)
A∗n−1 := min Â(xn−2 ) (40)
f˜U
n All the other LMs, λ∗u,n−1 , can be instead obtained via
∗
s.t.: Lemma 2. Moreover, the minimum value for Bn−1 corre-
Eu,n sponding to the above pdf is
∗
and obtain Bn−1 as ∗
Bn−1
Pcu
= −Efñ−2 [ i=0 ln γ̂u,i,n−1 (Xn−2 )]
∗ := Epn−2 A∗n−1 .
X (48)
Bn−1 (41) = −Efñ−2 [ln γ̂ (Xn−2 )] .
X X
To this end we first note that the following identities

h i The proof can then be concluded by observing that
∗ at
Epn−1 [ϕ (Xn−1 )] = Epn−2 Efñ−1 [ϕ (Xn−1 )] (42a)
X X
h i each further backward iteration, the solution fU ˜k
has
Efñ−1 [ϕ (Xn−1 )] = Efñ−2 Efñ−1 [ϕ (Xn−1 )] (42b) ∗
X X the same shape as f˜U n−1
. Indeed, the sub-problems
hold for any function ϕ of Xn−1 . Therefore, by means of corresponding to each further backward iteration have the
(36) and (42a) we obtain, from (38b): exact same structure as the problem solved at the time
h i
Bn−1 = Ef n−2 DKL fñ−1 ||g̃ n−1 + Bn∗ = instant n−1. In particular, the problems will have the same
h i structure for the functions α̂, β̂, ω̂, this time evaluated at
= Epn−2 DKL fñ−1 ||g̃ n−1 + Bn∗ = the previous instants. Moreover, for the last time instant
X h i
= Epn−2 DKL fñ−1 ||g̃ n−1 + (n) the quantity β̂ (un , xn ) can be set to 0 (as there are no
X h i constraints at iteration n+1) and this is in turn equivalent
−Epn−2 Efñ−1 [ln γ̂ (Xn−1 )] =
X
to have λ∗u,i,n+1 = 0, ∀i. This completes the proof. 2
 
 
ñ−1 ||g̃ n−1 + E ñ−1 [− ln γ̂ (Xn−1 )] We are now ready to introduce our algorithm translating
= Epn−2 D f

x
 KL f  the above theoretical results into a computational tool.
| {z }
=:Â(Xn−2 )
(43)
and the term Â (xn−2 ) can be recognized. Now, following
the same reasoning we used to compute Â(xn−1 ), we 5. THE ALGORITHM
explicitly write Â(xn−2 ) in compact form as
We developed an algorithmic procedure that, by leveraging
Â (xn−2) = h i the technical
n ∗ oresults introduced above, outputs the solu-
= DKL fñ−1 ||g̃ n−1 + Efñ−1 Efñ−1 [− ln γ̂ (Xn−1 )] =
R n−1 n f˜Un−1 U X o tion f˜k U to Problem 1. The only inputs that are
k∈K
= f˜U ln g̃n−1 + ω̂ (un−1 , xn−2 ) dun−1 , necessary to the algorithm are g (dne ), extracted from the
U
(44) example dataset and the f˜X
k
’s modeling the plant.
Algorithm 1 Pseudo-code The stretch of road we used for our experiments is shown
in Fig.1 and the corresponding data that were collected
Inputs:
are in Fig.2.
g (dne ) and f˜X
k
’s
Output:
n ∗ o
f˜U
k
solving Problem 1
k∈K
Initialize
γ̂u,0,n (xn ) = 1 λ∗u,0,n = 0, λ∗u,i,n = 0,
γ̂ ≡ γ̂u,0,n ;
β̂ (xn−1 , un ) = 0 ;
for k = n to 1 do
By backward recursion
(xk |uk ,xk−1 )
α̂ (uk , xk−1 ) ← f (xk |uk , xk−1 ) fg(x
R
k |uk ,xk−1 )
dxk
R
β̂ (uk , xk−1 ) ← f (xk |uk , xk−1 ) {− ln (γ̂ (xk ))}
ω̂ (uk , xk−1 ) ← α̂ (uk , xk−1 ) + β̂ (uk , xk−1 )
n̂u (uk , xk−1 )R← g (uk |xk−1 ) exp {−ω̂ (uk , xk−1 )}
γ̃0 (xk−1 ) ← n̂u (uk , xk−1 ) duk
n̂u (uk ,xk−1 )
f (uk |xk−1 ) ← γ̃0 (xk−1 )
Use Lemma 2 with Z := S(f (uk |xk−1 )), fˆ1 = f ,

H̃ := H̃u,k , h̃ := h̃x,k , λ0 := λu,0,k , λ := λu,k ,
iT iT
Fig. 2. Data collected during the experiments: speed,
h h
θ̃ := θ0 , θ T = 1 + λ 0 , λT to find the Lagrange
acceleration, jerk as a function of distance (measured
multipliers: from the beginning of the trip, the UCD entrance).
λ∗u,k = λ∗ ← θ ∗ The vertical line in each panel denotes the physical
λ∗u,0,k (xk−1 ) = λ∗0 ← θ0∗ − 1 location of the junction highlighted in Fig. 1. The
Compute the policy and prepare variables for the panels on the left report all the data collected from
next iteration, k − 1: 100 trips, while panels on the right report the subset
∗ −hλ∗ ,hu,k (uk )i
of 20 trips with the lowest jerk.
← f (uk |xk−1 )e
u,k
f˜k
U 1+λ∗
e u,0,k
γ̂u,i,k (xk−1 ) ← exp {λ∗u,i,k Hu,i,k } i = 1, . . . , cu We used the distance between the the road junction point
γ̂u,0,k = exp {θ0∗ } ← exp {λ∗u,0,k + 1} and the car position as state variable (xk = d(tk )) and the
Pcu
γ̂ (xk−1 ) ← exp [ i=0 ln (γ̂u,i,k (xk ))] car longitudinal speed as control variable (uk = v(tk )).
end for From the dataset, we extracted the 20 trips with the lowest
jerk (in red in Fig. 2). We used this reduced dataset as
desired behavior for the car. Given this set-up, we were
6. VALIDATION able to compute both f (dn ) and g(dne ) from the complete
dataset of 100 trips and the reduced dataset of 20 trips
We used Algorithm 1 to synthesize a control policy (from respectively. These pdfs are shown in Fig. 3, together with
real data) that would allow an autonomous car to merge on the corresponding control pdf (rightward panel).We also
a highway. The scenario considered in our test is described note here that S(f (dn )) ⊆ S(g(dne )) and this guarantees
in Fig.1. Data were collected using the infrastructure of the absolute continuity of f (dn ) with respect to g(dne ).
Griggs et al. (2019): GPS position, speed, acceleration and
jerk were gathered through an OBD2 connection during
100 test drives.
Fig. 3. Pdfs extracted from the datasets of Fig. 2. On

the axes, x and u denote the full series of collected
distances and speeds.
Fig. 1. Autonomous driving scenario for Section 6: a car Finally, we decided to constraint the variance of the
that is trying to merge onto a highway. The figure acceleration (the control variable) and solved the resulting
illustrates the stretch of road where the experiments Problem 1 via Algorithm 1. In particular, to make the
took place. The area is outside the UCD entrance on problem computationally efficient, we approximated all
Stillorgan Road, Dublin 4. the above pdfs as Gaussian distributions via the Maximum
Entropy Principle. Once this was done, we were able to
h i
φ (z|y) φ (y) ln φ(y)
RR
dy dz =
control the closed loop pdf of the system so that it became hR g(y) i
as close as possible to g(dne ), given the constraint on the φ (y) ln φ(y)
R
= φ (z|y) dz ∗ g(y) dy =
variance - see Figure 4. In the figure, the initial condition = DKL (φ (y) ||g (y))
was x0 = 18 meters (physically, this is a traffic light outside
the UCD gate). Also, the equality constraint was set to where we used Fubini’s theorem, the fact that that the
have a variance of the closed-loop system higher than the term on the first lineR in square brackets is indepedent on
variance of g(dne ) - this is why the closed-loop pdf is flatter. Z and the fact that φ (z|y) dz = 1.
By using again Fubini’s theorem, for the term (2) instead
we have: h i
φ (y) φ (z|y) ln φ(z|y)
RR
hR g(z|y) dz dy
i
=
R φ(z|y)
= φ (y) φ (z|y) ln g(z|y) dz dy =
R
= φ (y) [DKL (φ (z|y) ||g (z|y))] dy =
= Eφ(Y) [DKL (φ(z|Y)|| g(z|Y))] ,
thus proving the result. 2
Proof of Lemma 1
We prove the result in two steps. First, we rewrite the cost

Fig. 4. The results obtained using Algorithm 1. For the function L(f ) and consider the corresponding augmented
sake of clarity, the results are illustrated at time k = 1 Lagrangian. Then, we make use of the Euler-Lagrange
and are representative of the other time instants. The (EL) stationary conditions to find fZ∗ (z) (in what follows
optimal control pdf (left panel) and the reesulting we omit the dependencies of functions and pdfs on the
closed loop pdf (right panel). random variable z whenever this is clear from the context).
As a first step, note that the cost function L(f ) of the
constrained optimization problemh in (9)
cani be conve-
7. CONCLUSIONS
niently re-written as L(f ) = f ln fg + α dz. Then,
R
We presented an approach to the synthesis of policies from the augmented Lagrangian takes the following form:
examples. The key technical novelty of the results is the Laug (f, λ0 , λ) :=
inclusion of actuation constraints in the problem formu- Z
f
Z
lation. This in turn yields policies that can be exported f ln + α dz + λ0 f 1S(Z) dz − 1
to different systems having different actuation capabili- g
Z
ties. After presenting the main results we introduced an + hλ, f h (z) dz − Hi,
algorithmic procedure (code is available upon request).
If accepted, the presentation will include a sketch of the where λ0 and λ := [λ1 , . . . , λcz ]T are the (non-negative)
proofs and a full report of our experimental results, which Lagrange multipliers (LMs) corresponding to the con-
could not be included here due to space constraints. straints of the optimization problem. In turn, the above
expression can be re-written as
Appendix A. SKETCH OF THE PROOFS
Z
f
Laug (f, λ0 , λ) = f ln + α + λ0 + hλ, h (z)i dz
g
Proof of Property 1 − λ0 − hλ, Hi
(A.1)
To prove this result we start from the definition of KL-
divergence. In particular: Now, we let
DKL (φ (y, z) ||g (y, α̃(z) = α(z) + λ0 + hλ, h (z)i (A.2)
h z)) :=i
=
RR
φ (y, z) ln φ(y,z) dy dz = and make use of the EL stationary conditions to find
g(y,z)
h i the optimal solution. First, we consider the EL stationary
RR φ(z|y)φ(y)
= φ (z|y) φ (y) ln g(z|y) g(y) dy dz = condition with respect to the pdf f . These conditions can
Z Z
φ (y)
be written in terms of the quantity under h the integral
i
= φ (z|y) φ (y) ln dy dz + in (A.1), i.e. in terms of l(f ) := f ln fg + α̃ =
g (y)
f [ln (f ) − ln (g) + α̃]. In particular, by imposing the
| {z }
Z Z (1) stationary condition we obtain:
φ (z|y)
+ φ (y) φ (z|y) ln dz dy . ∂l(f )

f
g (z|y) = ln + α̃ + 1 = 0. (A.3)
| {z } ∂f g
(2)
Therefore, it follows that all the optimal solution candi-
For the term (1) in the above expression we may continue dates must be of the form:
as follows: f (z) = g(z)e−{1+α̃(z)} , (A.4)
which, by definition of α̃, becomes is strictly positive definite in Θ̃. Indeed, computing the
e−{α(z)+hλ,h(z)i} Hessian yields
f (z) = g(z) . (A.5) h i R h i
e1+λ0 ∇2 J θ̃ = Z h̃ (z) ⊗ h̃ (z) fˆ1 (z) e−hθ̃, h̃(z)i dz,
Note that the above candidates are a function of the LMs.
(A.10)
These can be computed by applying the EL stationary
where ⊗ denotes the external product between tensors.
condition with respect to λ0 , λ1 , . . . , λcz . This yields the
Now, since the equations in (13) are algebraically inde-
following set of additional conditions:
pendent, we have (see Definition 2):
∂Laug (f, λ0 , λ) cz
= 0, i = 0, . . . , cz . (A.6) ∃S
h ⊂ Z: ∀ṽ i ∈ R − {0}
∂λi
h ∇2 J θ̃ ṽ, ṽi = (A.11)
That is, (A.6) imply that the LMs associated to the
hṽ, h̃ (z) (z)i2 fˆ1 (z) e−hθ̃, h̃(z)i dz > 0.
R
constraints must satisfy: S
∗
e−{α(z)+hλ,h(z)i}
Z
g(z) h̃i (z) dz = H̃i i = 0, . . . , cz , This implies that θ̃ is the unique minimizer of the
e1+λ0 optimization problem, thus concluding the proof. 2
(A.7)
which was obtained by replacing the expression of the
optimal solution candidate (A.5) in (A.6). REFERENCES
Now, the above set of equations can be solved via Lemma Argall, B.D., Chernova, S., Veloso, M., and Browning, B.
2 and here we let λ∗0 , λ∗ be the resulting values of LMs. (2009). A survey of robot learning from demonstration.
By substituting the optimal LMs into the expression of the Robotics and Autonomous Systems, 57(5), 469 – 483.
optimal solution candidates yields: doi:https://doi.org/10.1016/j.robot.2008.10.024.
∗ Bryson, A.E. (1996). Optimal control-1950 to 1985. IEEE
∗ ∗ e−{α(z)+hλ ,h(z)i}
f := f (z) = g(z) ∗ . Control Systems Magazine, 16(3), 26–33.
e1+λ0 Englert, P., Vien, N.A., and Toussaint, M. (2017). In-
The proof is then concluded by noticing that f ∗ (z) is verse kkt: Learning cost functions of manipulation
indeed the optimal solution since the Lagrangian is convex tasks from demonstrations. The International Jour-
in f . To show convexity, it suffices to consider the second nal of Robotics Research, 36(13-14), 1474–1488. doi:
derivative of l(f ) and to observe that this is always positive 10.1177/0278364917745980.
∂2l ∂(ln(f )+α̃+1) Griggs, W., Ordóñez-Hurtado, R., Russo, G., and Shorten,
definite (indeed ∂f 2 = ∂f = f1 > 0).
R. (2019). A vehicle-in-the-loop emulation platform
Finally, the second part of the result follows from evaluat- for demonstrating intelligent transportation systems.
ing L (f ∗ ). Indeed: In Control Strategies for Advanced Driver Assistance
h −{α+hλ∗ ,hi}
i Systems and Autonomous Driving Functions, 133–154.
L (f ∗ ) = f ∗ ln ge 1+λ∗
R
+ α dz = Springer.
e 0g
= − f ∗ (1 + λ∗0 + hλ∗ , hi) dz =

R (A.8) Guilleminot, J. and Soize, C. (2013). On the statistical
∗ ∗
= − (1 + λ0 + hλ , Hi) , dependence for the components of random elasticity ten-
sors exhibiting material symmetry properties. Journal
and this completes the proof. 2
of elasticity, 111(2), 109–130.
Hanawal, M., Liu, H., Zhu, H., and Paschalidis, I. (2019).
Proof of Lemma 2 Learning policies for markov decision processes from
data. IEEE Transactions on Automatic Control, 64,
2298–2309.
We prove this result by showing that: (i) J θ̃ is strictly Herzallah, R. (2015). Fully probabilistic control for
convex; (ii) its minimizer must satisfy the set of equations stochastic nonlinear control systems with input de-
(13). pendent noise. Neural networks, 63, 199–207. doi:
10.1016/j.neunet.2014.12.004.
The proof of statement (ii) comes directly from the eval-
Kárný, M. (1996). Towards fully probabilistic con-
uation of the first order stationary condition. Indeed,
any
∗ trol design. Automatica, 32(12), 1719–1722. doi:
optimal candidate, say θ̃ , must satisfy ∇J θ˜∗ = 0. 10.1016/s0005-1098(96)80009-4.
Kárný, M. and Guy, T.V. (2006). Fully probabilistic
Now, computing ∇J θ˜∗ yields control design. Systems & Control Letters, 55(4), 259–
˜∗ 265. doi:10.1016/j.sysconle.2005.08.001.
H̃ − Z fˆ1
(z) e−hθ , h̃(z)i h̃ (z) dz =
R
(A.9) Kullback, S. and Leibler, R. (1951). On information and
H̃ − Z fˆ2 z, θ˜∗ h̃ (z) dz = 0,
R
sufficiency. Annals of Mathematical Statistics, 22, 79–87.
Krn, M. and Kroupa, T. (2012). Axiomatisation of fully
where we used the definition of fˆ2 to obtain the second probabilistic design. Information Sciences, 186(1), 105
equality. That is, (A.9) immediately implies that any – 113. doi:https://doi.org/10.1016/j.ins.2011.09.018.
candidate minimizer of the optimization problem in (14) Pegueroles, B.G. and Russo, G. (2019). On robust
must fulfil the set of equations (13). stability of fully probabilistic control with respect to
In order to prove strict convexity data-driven model uncertainties. In 2019 18th Eu-
(i.e. statement (i)) ropean Control Conference (ECC), 2460–2465. doi:
we compute the Hessian of J θ̃ and show that this 10.23919/ECC.2019.8795901.
Peterka, V. (1981). Bayesian approach to system identifi-
cation. 239–304. doi:10.1016/b978-0-08-025683-2.50013-
2.
Quinn, A., Kárnỳ, M., and Guy, T.V. (2016). Fully
probabilistic design of hierarchical bayesian mod-
els. Information Sciences, 369, 532–547. doi:
10.1016/j.ins.2016.07.035.
Ramachandran, D. and Amir, E. (2007). Bayesian inverse
reinforcement learning. In Proceedings of the 20th Inter-
national Joint Conference on Artifical Intelligence, IJ-
CAI’07, 2586–2591. Morgan Kaufmann Publishers Inc.,
San Francisco, CA, USA.
Ratliff, N.D., Bagnell, J.A., and Zinkevich, M.A. (2006).
Maximum margin planning. In Proceedings of the
23rd International Conference on Machine Learning,
ICML ’06, 729–736. ACM, New York, NY, USA. doi:
10.1145/1143844.1143936.
Ratliff, N.D., Silver, D., and Bagnell, J.A. (2009). Learning
to search: Functional gradient techniques for imitation
learning. Autonomous Robots, 27(1), 25–53.
Sutton, R.S. and Barto, A.G. (1998). Introduction to
Reinforcement Learning. MIT Press, Cambridge, MA,
USA, 1st edition.
Wabersich, K.P. and Zeilinger, M.N. (2018). Scal-
able synthesis of safety certificates from data with
application to learning-based control. In 2018 Eu-
ropean Control Conference (ECC), 1691–1697. doi:
10.23919/ECC.2018.8550288.
Xu, T. and Paschalidis, I.C. (2019). Learning models
for writing better doctor prescriptions. In 2019 18th
European Control Conference (ECC), 2454–2459. doi:
10.23919/ECC.2019.8796280.
Ziebart, B.D., Maas, A., Bagnell, J.A., and Dey, A.K.
(2008). Maximum entropy inverse reinforcement learn-
ing. In Proc. AAAI, 1433–1438.

On The Synthesis of Control Policies From Noisy Example Datasets: A Probabilistic Approach

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

On The Synthesis of Control Policies From Noisy Example Datasets: A Probabilistic Approach

Uploaded by

Copyright:

Available Formats

On the synthesis of control policies from

noisy example datasets: a probabilistic

1. INTRODUCTION technical novelty of our results with respect to the classic

2.2 The Kullback-Leibler divergence 3.1 The control problem

Note that, in Lemma 1, the optimal solution fZ∗ (z)

ln γ̂ (xk−1 ) := ln (γ̂u,i,k (xk−1 )) , (19)

λ∗u,0,n = 0, and λ∗u,i,n = 0, i = 1, . . . , n; s.t.: s.t.:

To this end we first note that the following identities

Use Lemma 2 with Z := S(f (uk |xk−1 )), fˆ1 = f ,

Fig. 3. Pdfs extracted from the datasets of Fig. 2. On

We prove the result in two steps. First, we rewrite the cost

= − f ∗ (1 + λ∗0 + hλ∗ , hi) dz =

You might also like