Professional Documents
Culture Documents
On The Synthesis of Control Policies From Noisy Example Datasets: A Probabilistic Approach
On The Synthesis of Control Policies From Noisy Example Datasets: A Probabilistic Approach
Abstract: In this note we consider the problem of synthesizing optimal control policies for
a system from noisy datasets. We present a novel algorithm that takes as input the available
dataset and, based on these inputs, computes an optimal policy for possibly stochastic and non-
linear systems that also satisfies actuation constraints. The algorithm relies on solid theoretical
foundations, which have their key roots into a probabilistic interpretation of dynamical systems.
The effectiveness of our approach is illustrated by considering an autonomous car use case. For
such use case, we make use of our algorithm to synthesize a control policy from noisy data
allowing the car to merge onto an intersection, while satisfying additional constraints on the
variance of the car speed.
The control problem considered in this paper will be stated Our goal is to synthesize, from an example dataset, say
(see Section 3.1) in terms of the Kullback-Leibler (KL, dne , the control pdf f (uk |xk−1 ) that allows the closed-
Kullback and Leibler (1951)) divergence, formalized with loop system (4) to achieve the demonstrated behav-
the following: ior, subject to its actuation constraints. As in Kárný
Definition 1. (Kullback-Leibler(KL) divergence). Consider (1996); Quinn et al. (2016); Pegueroles and Russo (2019);
two pdfs, φ := φZ (z) and g := gZ (z), with φ being Kárný and Guy (2006); Herzallah (2015) the behav-
absolutely continuous with respect to g. Then, the KL- ior illustrated in the example dataset can be speci-
divergence of φ with respect to g is fied through the reference pdf g (dne ) extracted from
Z
φ the example dataset (as e.g. its empirical distribution).
DKL (φ||g) := φ ln dz. (1) Following the chain rule for pdfs we have g (dne ) :=
S(φ) g Q
k∈K g (xk |uk , xk−1 ) g (uk |xk−1 ) g (x0 ). Again, by setting
k k
Intuitively, DKL (φ||g) is a measure of how well φ approxi- g̃X := g (xk |uk , xk−1 ), g̃U := g (uk |xk−1 ), g0 := g (x0 ) and
mates g. We now give give a property of the KL-divergence, g n := g (dne ) we get:
the KL-divergence splitting property, which is used in the
Y
gn = k k
g̃X g̃U g0 = g̃ n g0 , (5)
proof of Theorem1. k∈K
Property 1. Let φ and g be two pdfs of the random vector where g̃ n :=
Q k k
[Z, Y], with Z and Y being random vectors of dimensions k∈K g̃X g̃U .
nZ and nY , respectively. Then, the following splitting rule The control problem can then be recast as the problem of
holds: designing f (uk |xk−1 ) so that f n approximates g n . This
DKL (φ(y, z)|| g(y, z)) = leads to the following formalization:
DKL (φ(y)|| g(y)) + Eφ(Y) [DKL (φ(z|Y)|| g(z|Y))] n ∗ o
Problem 1. Determine the sequence of cpdfs, say f˜k U ,
(2) k∈K
Proof:The proof follows from the definition of DKL , the solving the nonlinear program
conditioning and independence rules for pdfs. A self- min DKL (f n ||g n )
contained proof of this technical result is reported in the {fU }k∈K
˜k
appendix. 2 h i (6)
s.t. Ef˜k h̃u,k (U) = H̃u,k , k ∈ K,
U
3. FORMULATION OF THE CONTROL PROBLEM where the constraints are algebraically independent.
Let: (i) K := {k}nk=1 , K0 := K∪{0} and T := {tk : k ∈ K0 } In Problem 1, the constraints are formalized as expec-
be the time horizon over which the system is observed; (ii) tations. We note that these constraints can be equiva-
xk ∈ Rdx and uk ∈ Rdu be, respectively, the system state lently written as S(f˜k ) f˜uk h̃u,k (u) du = H̃u,k . Also,
R
and input at time tk ∈ T ; (ii) dk := (xk , uk ) be the data U
collected from the system at time tk ∈ T and dk the data the constraints of the program are time-varying and the
collected from t0 ∈ T up to time tk ∈ T (tk > t0 ). As number of constraints can change over time (the number
shown in e.g. Peterka (1981), the system behavior can be of constraints at time tk is denoted by cu,k ). Indeed, in
described via the joint pdf of the observed data, say f (dn ). the constraints of (6): (i) H̃u,k is a (column) vector of
h iT
Then, as shown in the same paper, the application of the coefficients, i.e. H̃u,k := Hu,0,k , HTu,k and h̃u,k (z) :=
chain rule for probability density functions leads to the iT
following factorization for f (dn ):
h
Y hu,0,k , hTu,k (z); (ii) Hu,k ∈ Rcu,k and hu,k : S(f˜uk ) 7→
f (dn ) = f (xk |uk , xk−1 ) f (uk |xk−1 ) f (x0 ) . (3) Rcu,k ; (iii) Hu,0,k := 1 and hu,0,k (z) := 1Uk (z) ensure that
k∈K the solution of the program is a cpdf. Finally, in Problem 1
Throughout this work we refer to (3) as the probabilistic we assume that the constraints are algebraically indepen-
description of the closed loop system, or we simply say that dent. The notion of algebraically independent constraints
(3) is our closed loop system. is formalized next.
Definition 2. Let Z be a random vector with underlying the Maximum Entropy principle Guilleminot and Soize
pdf fZ (z) and support Z. A set of functions h : Z 7→ Rcz is (2013)).
said to be algebraically independent if thereR exists a subset,
Lemma 2. Let: (i) Z ⊆ Rnz and Θ̃ ⊆ Rnz ; (ii) fˆ1 :
say S ⊂ Z, with non-zero measure (i.e. S dz > 0) and
such that: Z 7→ fˆ1 (z) be a positive and integrable function on
Z; (iii) fˆ2 : (Z × Θ̃) 7→ fˆ1 (z) e−hθ̃, h̃(z)i , where h̃ =
Z
∃S ⊂ Z : hv, h (z)i2 dz > 0, ∀v ∈ Rcz \0 (7) h iT
S h̃1 (z) , . . . , h̃cz (z) : Z 7→ Rcz are algebraically inde-
In what follows, we simply say that a set of equations (or pendent functions. Consider the constraints defined by the
constraints) of the form of (7) is algebraically independent set of the following equations:
Z
if the above definition is satisfied. As shown in Guilleminot
and Soize (2013), the assumption that the contraints are fˆ2 z, θ̃ h̃i (z) dz = H̃i , i = 1, . . . , cz , (13)
Z
algebraically independent ensures that Problem 1 is well h iT
posed. where H̃ := H̃1 , . . . , H̃cz ∈ Rcz . Then, the unique
∗
solution, say θ̃ , of the minimization problem
4. TECHNICAL RESULTS
min J θ̃ , (14)
We now introduce the main technical results of this paper. θ̃
The key result behind the algorithm of Section 5 is with J θ̃ := hθ̃, H̃i + Z fˆ2 z, θ̃ dz is also a solution
R
Theorem 1. The proof of this result, given in this section,
makes use of three technical lemmas (i.e. Lemma 1, Lemma of (13).
2 and Lemma3).
Proof: See the appendix 2
Lemma 1. Let: (i) Z be a random vector on the measur-
able space (Z, Fz ); (ii) f := fZ (z), g := gZ (z) be two
probability distributions over (Z, Fz ); (iii) α : Z 7→ R+ Finally, we introduce here the following technical lemma
0 be
a nonnegative function of Z, integrable under the measure that is used in the proof of Theorem 1.
given by fZ (z). Assume that fZ (z) satisfies the following Lemma 3. Let f n and g n be the pdfs defined in (3) and
set of algebraically independent equations: (5), respectively. Then:
R h i
fZ (z) h̃ (z) dz = H̃, (8) DKL (f n ||g n ) = DKL f n−1 ||g n−1 +Ef n−1 DKL f˜n ||g̃ n
T
where: (i) h̃(z) := h0 , hT (z), with h0 (z) := 1S(Z) (z) (15)
and h : Z 7→ Rcz being a measurable map; (ii) H̃(z) := Proof:The result is obtained from Property 1 (see the
T appendix for a proof of this property) by setting Y :=
H0 , HT with H0 := 1 and H ∈ Rcz being a vector of [X0 , U1 , X1 , . . . , Un−1 , Xn−1 ] and Z := [Un , Xn ] 2
constants. Then:
(1) the solution of the constrained optimization problem The main result behind the algorithm of Section 5, the
proof of which makes use of the above technical results, is
min L(f ) s.t. constraints in (8) (9) presented next.
fZ
∗
with Z Theorem 1. The solution, f˜U k
= f ∗ (uk |xk−1 ), of the
L(f ) := DKL (f ||g) + fZ (z) α (z) dz (10) control Problem 1 is
∗ −{ω̂(uk , xk−1 )+hλ∗
u,k ,hu,k (uk )i}
k e
is the pdf f˜U
k
= g̃U ∗ , (16)
∗ e1+λu,0,k
g (z) e−{α(z)+hλ ,h(z)i}
f ∗ := fZ∗ (z) = ∗ . (11) where:
e1+λ0
In (11) λ∗0 and λ∗ = [λ∗1 , . . . , λ∗cz ]T are the Lagrange (1) ω̂(·, ·) is generated via the backward recursion
multipliers associated to the constraints; ω̂ (uk , xk−1 ) = α̂ (uk , xk−1 ) + β̂ (uk , xk−1 ) , (17)
(2) moreover, the corresponding minimum is: with
L∗ := L (f ∗ ) = − (1 + λ∗0 + hλ∗ , Hi) . (12)
α̂ (uk , xk−1 ) := DKL f˜X k k
||g̃X
Proof: See the appendix. 2 (18)
β̂ (uk , xk−1 ) := −Ef˜k [ln γ̂ (Xk )] ,
X
while all the other LMs can be obtained numerically where the above expression was obtained by using the
(via e.g. Lemma 2). fact that the expectation operator is linear and the fact
that independence of the decision variable (i.e. f˜U n
) is
Moreover, the corresponding minimum at time k is given independent on the pdf over which the expectation is
by: performed (i.e. pn−1
X ). This implies that, once we solve the
Bk∗ := −Epk−1 [ln γ̂ (Xk−1 )] . (22) problem
X
where pkX denotes the pdf of the state at time tk (i.e. A∗n := min Â(xn−1 ) (29)
pkX := f (xk )). f˜un
s.t.:
Proof:For notational convenience, we use the shorthand Eu,n
notation {Eu,k } to denote the set of constraints of Problem for any fixed xn−1 , then Bn∗ can be obtained by averaging
1 at time tk . We also denote by {Eu,k }K the set of A∗n over pn−1
X . We now focus on solving problem (29). In
n−1
constraints over the whole time horizon K and {Eu,k }k=1 doing so, we first note that, following (27), Â(xn−1 ) can
to denote the constraints from t1 up to time tn−1 . be re-written as follows: " ! #
f˜U
n
Z
Note that, following Lemma 3, Problem 1 can be re-written ˜ n
 (xn−1 ) = fU ln n + α̂ (un , xn−1 ) dun ,
as follows: g̃U
kmin
DKL (f n ||g n ) = (30a)
f˜U
k∈K ˜ n
α̂ (un , xn−1 ) := DKL f ||g̃ . n
(30b)
s.t.:
X X
Eu,k In turn, (30a) can be compactly written as:
k∈K
= min
DKL f n−1 ||g n−1 + Bn∗
(23) Z
k n−1 Â(xn−1 ) = DKL f˜un ||g̃un + f˜U
n
α̂ (un , xn−1 ) dun , (31)
f˜U
k=1
s.t.:
n−1 where we used the definition of KL-divergence. Hence,
Eu,k
k=1 Lemma 1 can be used to solve the optimization problem
where: in.(29). Indeed by applying Lemma 1 with Z = Un ,
f = f˜U
h i
n n
Bn∗ := min Bn , Bn := Ef n−1 DKL f˜n ||g̃ n . (24) , g = g̃U , h = hu,n , H = Hu,n we get the following
f˜U
n solution to (29):
s.t.: ∗ −{α̂(un , xn−1 )+hλ∗
u,n ,hu,n (un )i}
Eu,n ˜n n e
fU = g̃U ∗ . (32)
That is, Problem 1 can be approached by solving first the e1+λu,0,n
optimization of the last time-instant of the time-horizon K In the above pdf, λ∗u,0,n and λ∗u,n are the LMs at the
(the term Bn in (23)) and then by taking into account the last time instant, tn . The LM λ∗u,0,n can be obtained by
result from this optimization problem in the optimization imposing a normalization condition to (32). That is, λ∗u,0,n
up to the instant tn−1 . Now we focus on the sub-problem: can be found by imposing that
R n −{α̂(u , x )+hλ∗ ,h (u )i}
Bn∗ := min Bn (25) exp{λ∗u,0,n + 1} = g̃U e n n−1 u,n u,n n
dun
f˜U
n
= γ̂u,0,n (xn−1 ) .
s.t.:
(33)
Eu,n
Also, following Lemma 1, the minimum of the problem is
For this problem, we first observe that the following given by:
equality is satisfied for the term Bn :
Â∗n = − 1 + λ∗u,0,n + hλ∗u,n , Hu,n i
h i h i (34)
Bn = Ef n−1 DKL f˜n ||g̃ n = Epn−1 DKL f˜n ||g̃ n . or equivalently
X
(26)
"c #
X u
Â∗n = −
Such equality was obtained by noting that DKL f˜n ||g̃ n ln (γ̂u,i,n (xn−1 )) = − ln γ̂ (xn−1 ) (35)
i=0
is only a function of the previous state (see also Kárný
(1996)) and, for notational convenience, we rename it as where we have used the definitions (20) and (21) for
γ̂u,i,n , i = 0, . . . cu . Therefore, the corresponding minimum
 (·). Hence, Bn becomes value for Bn is:
h i h i
Bn = E n−1 DKL f˜n ||g̃ n = E n−1 Â (Xn−1 ) . (27)
pX pX Bn∗ = −Epn−1 [ln γ̂ (Xn−1 )] . (36)
X
Note now that the solution we found to the problem in (25) where ω̂ (un−1 , xn−2 ) = α̂ (un−1 , xn−2 ) + β̂ (un−1 , xn−2 )
only depends on Xn−1 and therefore the original problem and
(23) can be split as
DKL f n−1 ||g n−1 + Bn∗ = α̂ (un−1 , xn−2 ) := DKL f˜Xn−1 n−1
||g̃X
kmin
n−1 (45)
f˜U β̂ (un−1 , xn−2 ) := −Ef˜n−1 [ln γ̂ (Xn−1 )]
k=1
s.t.: X
n−1
Eu,k
k=1 ∗
The last expression for Â(xn−2 ) obtained in (44) allows
= min DKL f n−2 ||g n−2 + Bn−1 (37)
k n−2
us to use the Lemma 1 to solve the optimization problem
f˜U
k=1 defined in (40). Indeed by applying Lemma 1 with Z =
s.t.:
n−2 Un−1 , f = f˜U n−1 n−1
, g = g̃U , h = hu,n−1 , H = Hu,n−1 ,
Eu,k
k=1 α̂(·) = ω̂(·), we get the following solution to the problem
in (40):
where:
∗ ∗ −{ω̂(un−1 , xn−2 )+hλ∗u,n−1 ,hu,n−1 (un−1 )i}
Bn−1 := min Bn−1 , (38a) n−1 e
n−1
f˜U f˜U
n−1
= g̃U ∗
1+λu,0,n−1
.
e
s.t.: (46)
Eu,n−1
h i Now, the LM λ∗u,0,n−1 can be obtained by imposing the
∗
:= Ef n−2 DKL f˜n−1 ||g̃ n−1 + Bn∗
Bn−1 (38b) normalization condition for f˜n−1 . That is,
U
We approach the above problem in the same way we used
to solve the problem in (25). The idea is now to find a exp{λ∗u,0,n−1 + 1} =
Z
function, Â (xn−2 ), such that n−1 −{α̂(un−1 , xn−2 )+hλ∗
u,n−1 ,hu,n−1 (un−1 )i} du
h i g̃U e n−1 =
Bn−1 = Epn−2 Â (Xn−2 ) . (39)
X γ̂u,0,n−1 (xn−2 ) .
Once this is done, we then solve the problem (47)
A∗n−1 := min Â(xn−2 ) (40)
f˜U
n All the other LMs, λ∗u,n−1 , can be instead obtained via
∗
s.t.: Lemma 2. Moreover, the minimum value for Bn−1 corre-
Eu,n sponding to the above pdf is
∗
and obtain Bn−1 as ∗
Bn−1
Pcu
= −Ef˜n−2 [ i=0 ln γ̂u,i,n−1 (Xn−2 )]
∗ := Epn−2 A∗n−1 .
X (48)
Bn−1 (41) = −Ef˜n−2 [ln γ̂ (Xn−2 )] .
X X
for k = n to 1 do
By backward recursion
(xk |uk ,xk−1 )
α̂ (uk , xk−1 ) ← f (xk |uk , xk−1 ) fg(x
R
k |uk ,xk−1 )
dxk
R
β̂ (uk , xk−1 ) ← f (xk |uk , xk−1 ) {− ln (γ̂ (xk ))}
ω̂ (uk , xk−1 ) ← α̂ (uk , xk−1 ) + β̂ (uk , xk−1 )
n̂u (uk , xk−1 )R← g (uk |xk−1 ) exp {−ω̂ (uk , xk−1 )}
γ̃0 (xk−1 ) ← n̂u (uk , xk−1 ) duk
n̂u (uk ,xk−1 )
f (uk |xk−1 ) ← γ̃0 (xk−1 )
γ̂u,i,k (xk−1 ) ← exp {λ∗u,i,k Hu,i,k } i = 1, . . . , cu We used the distance between the the road junction point
γ̂u,0,k = exp {θ0∗ } ← exp {λ∗u,0,k + 1} and the car position as state variable (xk = d(tk )) and the
Pcu
γ̂ (xk−1 ) ← exp [ i=0 ln (γ̂u,i,k (xk ))] car longitudinal speed as control variable (uk = v(tk )).
end for From the dataset, we extracted the 20 trips with the lowest
jerk (in red in Fig. 2). We used this reduced dataset as
desired behavior for the car. Given this set-up, we were
6. VALIDATION able to compute both f (dn ) and g(dne ) from the complete
dataset of 100 trips and the reduced dataset of 20 trips
We used Algorithm 1 to synthesize a control policy (from respectively. These pdfs are shown in Fig. 3, together with
real data) that would allow an autonomous car to merge on the corresponding control pdf (rightward panel).We also
a highway. The scenario considered in our test is described note here that S(f (dn )) ⊆ S(g(dne )) and this guarantees
in Fig.1. Data were collected using the infrastructure of the absolute continuity of f (dn ) with respect to g(dne ).
Griggs et al. (2019): GPS position, speed, acceleration and
jerk were gathered through an OBD2 connection during
100 test drives.
Proof of Lemma 1
We presented an approach to the synthesis of policies from the augmented Lagrangian takes the following form:
examples. The key technical novelty of the results is the Laug (f, λ0 , λ) :=
inclusion of actuation constraints in the problem formu- Z
f
Z
lation. This in turn yields policies that can be exported f ln + α dz + λ0 f 1S(Z) dz − 1
to different systems having different actuation capabili- g
Z
ties. After presenting the main results we introduced an + hλ, f h (z) dz − Hi,
algorithmic procedure (code is available upon request).
If accepted, the presentation will include a sketch of the where λ0 and λ := [λ1 , . . . , λcz ]T are the (non-negative)
proofs and a full report of our experimental results, which Lagrange multipliers (LMs) corresponding to the con-
could not be included here due to space constraints. straints of the optimization problem. In turn, the above
expression can be re-written as
Appendix A. SKETCH OF THE PROOFS
Z
f
Laug (f, λ0 , λ) = f ln + α + λ0 + hλ, h (z)i dz
g
Proof of Property 1 − λ0 − hλ, Hi
(A.1)
To prove this result we start from the definition of KL-
divergence. In particular: Now, we let
DKL (φ (y, z) ||g (y, α̃(z) = α(z) + λ0 + hλ, h (z)i (A.2)
h z)) :=i
=
RR
φ (y, z) ln φ(y,z) dy dz = and make use of the EL stationary conditions to find
g(y,z)
h i the optimal solution. First, we consider the EL stationary
RR φ(z|y)φ(y)
= φ (z|y) φ (y) ln g(z|y) g(y) dy dz = condition with respect to the pdf f . These conditions can
Z Z
φ (y)
be written in terms of the quantity under h the integral
i
= φ (z|y) φ (y) ln dy dz + in (A.1), i.e. in terms of l(f ) := f ln fg + α̃ =
g (y)
f [ln (f ) − ln (g) + α̃]. In particular, by imposing the
| {z }
Z Z (1) stationary condition we obtain:
φ (z|y)
+ φ (y) φ (z|y) ln dz dy . ∂l(f )
f
g (z|y) = ln + α̃ + 1 = 0. (A.3)
| {z } ∂f g
(2)
Therefore, it follows that all the optimal solution candi-
For the term (1) in the above expression we may continue dates must be of the form:
as follows: f (z) = g(z)e−{1+α̃(z)} , (A.4)
which, by definition of α̃, becomes is strictly positive definite in Θ̃. Indeed, computing the
e−{α(z)+hλ,h(z)i} Hessian yields
f (z) = g(z) . (A.5) h i R h i
e1+λ0 ∇2 J θ̃ = Z h̃ (z) ⊗ h̃ (z) fˆ1 (z) e−hθ̃, h̃(z)i dz,
Note that the above candidates are a function of the LMs.
(A.10)
These can be computed by applying the EL stationary
where ⊗ denotes the external product between tensors.
condition with respect to λ0 , λ1 , . . . , λcz . This yields the
Now, since the equations in (13) are algebraically inde-
following set of additional conditions:
pendent, we have (see Definition 2):
∂Laug (f, λ0 , λ) cz
= 0, i = 0, . . . , cz . (A.6) ∃S
h ⊂ Z: ∀ṽ i ∈ R − {0}
∂λi
h ∇2 J θ̃ ṽ, ṽi = (A.11)
That is, (A.6) imply that the LMs associated to the
hṽ, h̃ (z) (z)i2 fˆ1 (z) e−hθ̃, h̃(z)i dz > 0.
R
constraints must satisfy: S
∗
e−{α(z)+hλ,h(z)i}
Z
g(z) h̃i (z) dz = H̃i i = 0, . . . , cz , This implies that θ̃ is the unique minimizer of the
e1+λ0 optimization problem, thus concluding the proof. 2
(A.7)
which was obtained by replacing the expression of the
optimal solution candidate (A.5) in (A.6). REFERENCES
Now, the above set of equations can be solved via Lemma Argall, B.D., Chernova, S., Veloso, M., and Browning, B.
2 and here we let λ∗0 , λ∗ be the resulting values of LMs. (2009). A survey of robot learning from demonstration.
By substituting the optimal LMs into the expression of the Robotics and Autonomous Systems, 57(5), 469 – 483.
optimal solution candidates yields: doi:https://doi.org/10.1016/j.robot.2008.10.024.
∗ Bryson, A.E. (1996). Optimal control-1950 to 1985. IEEE
∗ ∗ e−{α(z)+hλ ,h(z)i}
f := f (z) = g(z) ∗ . Control Systems Magazine, 16(3), 26–33.
e1+λ0 Englert, P., Vien, N.A., and Toussaint, M. (2017). In-
The proof is then concluded by noticing that f ∗ (z) is verse kkt: Learning cost functions of manipulation
indeed the optimal solution since the Lagrangian is convex tasks from demonstrations. The International Jour-
in f . To show convexity, it suffices to consider the second nal of Robotics Research, 36(13-14), 1474–1488. doi:
derivative of l(f ) and to observe that this is always positive 10.1177/0278364917745980.
∂2l ∂(ln(f )+α̃+1) Griggs, W., Ordóñez-Hurtado, R., Russo, G., and Shorten,
definite (indeed ∂f 2 = ∂f = f1 > 0).
R. (2019). A vehicle-in-the-loop emulation platform
Finally, the second part of the result follows from evaluat- for demonstrating intelligent transportation systems.
ing L (f ∗ ). Indeed: In Control Strategies for Advanced Driver Assistance
h −{α+hλ∗ ,hi}
i Systems and Autonomous Driving Functions, 133–154.
L (f ∗ ) = f ∗ ln ge 1+λ∗
R
+ α dz = Springer.
e 0g