Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)

Compressed Conditional Mean Embeddings


for Model-Based Reinforcement Learning

Guy Lever John Shawe-Taylor Ronnie Stafford Csaba Szepesvári


University College London University College London University College London University of Alberta
London, UK London, UK London, UK Edmonton, Canada
g.lever@cs.ucl.ac.uk j.shawe-taylor@cs.ucl.ac.uk r.stafford.12@ucl.ac.uk szepesva@cs.ualberta.ca

Abstract learned policy exist: e.g. consistency in the large data limit
for the algorithm of Ormoneit and Sen (2002) under mild
We present a model-based approach to solving Markov de- smoothness assumptions or bounds on the value of the pol-
cision processes (MDPs) in which the system dynamics are
icy learned in terms of the error of the CME (Grünewälder et
learned using conditional mean embeddings (CMEs). This
class of methods comes with strong performance guaran- al. 2012b; Yao et al. 2014). c) By reducing model learning to
tees, and enables planning to be performed in an induced fi- a supervised learning problem, to some extent the problem
nite (pseudo-)MDP, which approximates the MDP, but can be of a-priori defining a good low-dimensional representation
solved exactly using dynamic programming. Two drawbacks architecture for value functions is avoided. d) The approach
of existing methods exist: firstly, the size of the induced fi- can be data-efficient since state-of-the-art non-parametric
nite (pseudo-)MDP scales quadratically with the amount of approaches can be used to approximate the CME from few
data used to learn the model, costing much memory and time data points.
when planning with the learned model; secondly, learning One drawback is the fact that the size of the induced fi-
the CME itself using powerful kernel least-squares is costly nite (pseudo-)MDP scales quadratically with the amount of
– a second computational bottleneck. We present an algo-
rithm which maintains a rich kernelized CME model class,
data used to learn the model, significantly slowing down the
but solves both problems: firstly we demonstrate that the loss planning step and straining the algorithm’s memory require-
function for the CME model suggests a principled approach ments as more data is collected. Further, Grünewälder et al.
to compressing the induced (pseudo-)MDP, leading to faster (2012b) employ non-parametric methods to learn the CME
planning, while maintaining guarantees; secondly we propose which are data efficient but computationally expensive. It
to learn the CME model itself using fast sparse-greedy kernel is therefore of interest to investigate methods for learning
regression well-suited to the RL context. We demonstrate su- CMEs in RL which retain these benefits but enable effi-
perior performance to existing methods in this class of model- cient planning and which can be learned efficiently even
based approaches on a range of MDPs. for problems that require huge amounts of data to learn a
good model. We present a method of learning a compressed
CME which induces a compressed finite pseudo-MDP for
1 Introduction efficient planning, decoupling the amount of data collected
Several methods have been proposed for model-based re- and the size of the induced pseudo-MDP. The compression
inforcement learning (RL) which, due to the form of the guarantees the existence of compressed CMEs maintain-
model, induce a finite (pseudo-)MDP1 which approximates ing good performance bounds. We optimize the CME non-
the MDP, such that solving the induced (pseudo-)MDP de- parametrically but efficiently, using kernel matching pursuit.
livers a good policy for the true MDP (Ormoneit and Sen
2002; Grünewälder et al. 2012b; Yao et al. 2014). In these 2 Preliminaries
methods the learned ‘model’ can be viewed as a “conditional
mean embedding” (CME) of the MDP transition dynamics. 2.1 Reinforcement Learning Background
We review this family of methods in Section 3. We recall basic concepts associated with reinforcement
In general the approach has the following potentially use- learning (RL). In RL an agent acts in an environment by
ful properties: a) The induced finite approximate Bellman sequentially choosing actions over a sequence of time steps,
optimality equation can be solved exactly in finite time, in order to maximize a cumulative reward. We model this
avoiding instability associated with approximate dynamic as a Markov decision process (MDP) which is defined us-
programming (e.g. Bertsekas (2011)). As a result, planning ing a state space S, an action space A, an initial state dis-
is guaranteed to succeed. b) Guarantees on the value of the tribution P1 over S, an (immediate) reward function r :
Copyright  c 2016, Association for the Advancement of Artificial S × A → [0, 1], and a stationary Markov transition ker-
Intelligence (www.aaai.org). All rights reserved. nel (P (s |s, a))(s,a,s )∈S×A×S These, together with a (sta-
1
Pseudo-MDPs relax the potential transition functions from the tionary, stochastic, Markov) policy π : S → P(A), a
set of probability measures to signed measures. See Section 3.1. map from the state space to the set of distributions over

1779
the action space, gives rise to a controlled Markov chain and Yao et al. (2014) and we defer a detailed discussion to
ξ = (S1 , A1 , S2 , A2 , . . . ) where S1 ∼ P1 , At ∼ π(St ) Section 3.2. Related dynamics models which estimate ex-
and St+1 ∼ P (·|St , At ), i.e., P determines the stochastic pected successor feature maps using regression are com-
dynamics of the controlled process. An agent that interacts mon in approximate dynamic programming (Parr et al. 2008;
with an MDP observes (some function of) the states and se- Sutton et al. 2008). The key difference is that we will use
lects actions sequentially with the goal to accumulate reward the model to solve an MDP by deriving an induced finite
comparable to what could be obtained by a policy π ∗ which pseudo-MDP and solving it exactly. Further, we separate the
maximizes the expected return (expected ∞ total cumulative dynamics learning from any particular policy. A similar idea
discounted reward), J(π) := E[ t=1 γ t−1 r(St , At ); π] was explored recently by van Hoof, Peters, and Neumann
where, E[·; π] denotes the expectation with respect to (2015) in the context of a direct policy search algorithm.
P1, P and π. We recall the value function V π (s) :=

E[ t=1 γ t−1 r(St , At )|S1 = s; π] and action-value func- 3 Conditional Mean Embeddings for RL
tion Qπ (s, a) := r(s, a) + γES  ∼P (·|s,a) [V π (S  )] (gener- The representation of the model that we study in this
ally, we will denote the successor state of s by s ). The op- work is motivated by dynamic programming algorithms.
timal value function is defined by V ∗ (s) = supπ∈Π V π (s). In policy iteration the model is required to solve the
For a given action-value function Q : S × A → R we de- Bellman expectation equation (1), and so it is suffi-
fine the (deterministic) greedy policy w.r.t. Q by π(s) := cient to compute the conditional expectation of value
argmaxa∈A Q(s, a) and denote π = greedy(Q) (ties bro- functions, ES  ∼P (·|s,a) [V (S  )]. Recalling the discussion in
ken arbitrarily). Obtaining V π for a given π is known as Section 2.1, when V (s) = v, φ(s)F , we have that
value estimation. Any V π satisfies the Bellman equation, ES  ∼P (·|s,a) [V (S  )] = ES  ∼P (·|s,a) [φ(S  )], vF and hence
V π (s) = EA∼π(s) [r(s, A) + γES  ∼P (·|s,A) [V π (S  )]], (1) it suffices to learn the function μ : S × A → F such that

and the map T π : V → V, over the set V of real-valued μ(s, a) = ES  ∼P (·|s,a) [φ(S  )]. (2)
functions on S, defined by (T π V )(s) := EA∼π(s) [r(s, A) + In this work our “transition model” is an estimate of μ(·),
γES  ∼P (·|s,A) [V (S  )]] is known as the Bellman operator for and for any estimate μ̂(·) we write,
π. For finite state spaces an optimal policy can be obtained
using dynamic programming methods such as value itera- Êμ̂ (V, (s, a)) = μ̂(s, a), vF ≈ ES  ∼P (·|s,a) [V (S  )] , (3)
tion (Bellman 1957) and policy iteration (Howard 1960). In
policy iteration V π is obtained by solving (1) for a given de- where we used the (so-far implicit) convention that V (s) =
terministic policy π : S → A followed by taking the greedy v, φ(s)F . Note that Êμ̂ may not correspond to any condi-
policy with respect to V π and iterating. In MDPs with large tional probability measure on S. The loss
or continuous state spaces, value functions are typically rep- loss(μ̂) := E(s,a)∼D,s ∼P (·|s,a) [||μ̂(s, a) − φ(s )||2F ],
resented in some approximation architecture, e.g., as a linear
function in some feature space, V π (s) ≈ vπ , φ(s)F =: where D is some data distribution over S × A, serves
V̂ π (s) where φ : S → F is a feature map, and F a Hilbert as a natural objective for the problem of learning
space. Choosing a low-dimensional feature map φ(·) a-priori μ(·) (Grünewälder et al. 2012a). Given data2 D =
can be a problem since it is difficult to balance powerful rep- {(si , ai , si )}ni=1 , where si ∼ P (·|si , ai ) the empirical loss
resentation ability and parsimoniousness. Further, in the ap- is therefore,
proximate case value estimation entails solving
1
n


loss(μ̂) := ||μ̂(si , ai ) − φ(si )||2F . (4)
wπ , φ(s)F ≈ r(s, π(s))+γES  ∼P (·|s,π(s)) [vπ , φ(S )F ], n i=1
which must be solved approximately (using, for example,
The function μ(·) defined in (2) is known as the conditional
LSTD (Bradtke and Barto 1996) or minimizing the Bellman
mean embedding (CME) of P in F; methods to learn CMEs
residual (Baird 1995)) since in general no solution in vπ can
have been provided by Song, Fukumizu, and Gretton (2013)
be found with equality for a given feature map φ(·), which
and Grünewälder et al. (2012a).
can lead to instabilities (Bertsekas 2012).
Model-based reinforcement learning is an approach to RL 3.1 Key Properties of the CME Approach
in which data is used to estimate the dynamics and/or reward
function followed by solving the resulting estimated MDP. The Induced Finite (Pseudo-)MDP As we will see in
This approach can be data efficient since the planning stage Section 3.2, several approaches to learn the CME (2)
can be performed offline, and does not require interaction n data result  in a solution of the form μ̂(s, a) =
from
with the system. In this work we will present a model-based i=1 αi (s, a)φ(si ). This means that, from (3),
policy iteration algorithm with a particular form of the tran- 
n
sition model. We suppose that the mean reward function r is Êμ̂ (V, (s, a)) = μ̂(s, a), vF = αi (s, a)V (si ), (5)
known. Trivially, any estimate for the mean reward function i=1
can be substituted into our algorithm. 2
We will use the (sloppy) notation si ∈ D to indicate the suc-
Related Work The approaches most similar to ours are cessor states in D in the sense that si ∈ D ⇔ ∃(si , ai , si ) ∈ D,
by Ormoneit and Sen (2002), Grünewälder et al. (2012b) and similarly for the actions ai , and predecessor states si .

1780
i.e., our estimates of conditional expectation can be com- Policy Iteration using CMEs These observations moti-
puted by measuring V only on the sample points {si }ni=1 . In vate a generic model-based policy iteration algorithm for
fact αi (s, a) can be viewed as the “probability”3 of transi- solving MDPs using CMEs: Collect data to learn an approx-
tioning to state si from state-action (s, a) under the learned imate proper CME (2) of the transition dynamics; solve the
model (and the probability of transitioning to states beyond approximate Bellman equation (7) on the samples exactly;
the sample is zero). Collect (αi (s, a))ni=1 into the vector Construct the approximation (9) and take the greedy policy
α(s, a) ∈ Rn . If, further, greedy(Q̂πμ̂ ); iterate if necessary. For clarity this is outlined
||α(si , a)||1 ≤ 1 (6) in pseudocode in Algorithm 1.

for all successor states si ∈ D and a ∈ A then the approxi-
mation T̂μ̂π : V →V of the Bellman operator T π defined by Algorithm 1 Generic model-based policy iteration with
CMEs
(T̂μ̂π V )(s) := EA∼π(s) [r(s, A) + γ Êμ̂ (V, (s, A))] Input: MDP M = (S, A, P1 , P, R) to interact with;
 n known mean reward function r, known start-state distri-
= EA∼π(s) [r(s, A) + γ αi (s, A)V (si )] (7) bution P1 .
i=1 Initialize: Q0 = r, D0 = ∅, P̂ 0 , π1 = greedy(Q0 )
is a contraction on the sample and so iterated “backups” Parameters: nnew , J
V ← T̂μ̂π V will converge to a fixed point V̂μ̂π such that for k = 1, 2, ... do
Data acquisition: Collect nnew data points Dnew (e.g.
V̂μ̂π = T̂μ̂π V̂μ̂π . The solution to the fixed point equation from the policy πk or an exploratory policy). Aggregate
V = T̂μ̂π V (8) data: Dk = Dk−1 ∪ Dnew .
Update dynamics model: Learn the CME μ̂(·) using
can be determined exactly by solving a linear system in n the data Dk .
variables by matrix inversion (or approximately by iterat- for j = 1, 2, ..., J do
ing backups): i.e. to perform this version of policy itera-
Policy evaluation: Form estimate V̂j of V πk
tion the value function only needs to be maintained at the
by solving the approximate Bellman Equation
successor sample points si ∈ D, even if S is continuous.
This should be contrasted to the typical situation in approxi- V = T̂μ̂π V (8). Define Q̂j (s, a) = r(s, a) +
n
mate dynamic programming, in which the Bellman equation γ j=1 αj (s, a)V̂j (sj ).
cannot be solved exactly in general, and backups followed Policy improvement: πk ← greedy(Q̂j ).
by projection can diverge. For the same reason the greedy end for
policy can be executed anywhere on S using knowledge of πk+1 ← πk .
V̂μ̂π (·) only at the sample points si ∈ D since we can con- end for
struct the corresponding action-value function at any state-
action, via
 n
Modeling Value Functions Although we assume value
Q̂πμ̂ (s, a) = r(s, a) + γ αi (s, a)V̂μ̂π (si ). (9) functions can be well-approximated in F, we never need

i=1
to model them in F, i.e. find a weight vector v such that
We refer to a CME such that j |αj (s, a)| ≤ 1 for all v, φ(s)F ≈ V (s). We will see that it is not even necessary
(s, a) ∈ S × A a proper CME. When αj (si , a) ≥ 0 and to construct φ(s) for any s, knowledge of the kernel function
  π L(s, s ) = φ(s), φ(s ) is sufficient. Thus the approxima-
k αk (si , a) = 1 for all i, j, T̂μ̂ is the Bellman opera-
tor for an MDP defined on the sample {si }ni=1 , whose dy- tion space F for V can be a rich function class.
namics are defined by P (sj |si , a) = αj (si , a). Otherwise
Performance Guarantees It is possible to derive bounds
αj (si , a) does not define an MDP, hence we refer to the in- on the value of the learned policy in value iteration in terms
duced pseudo-MDP in general. A theory of pseudo-MDPs, of the quality of the learned model and how well V ∗ can be
as a method of MDP abstractions, has been developed in Yao modeled in F. For example we have the following theorem:
et al. (2014). A few important results include the following:
The condition that makes CMEs proper allows one to define Theorem 3.1. (Grünewälder et al. (2012b), Theorem 3.2)
value functions for policies the usual fashion (with expecta- Let V̂k be the kth function obtained after k iterations of
tion w.r.t. signed measures). The value functions satisfy the value iteration using a proper CME μ̂,  = sups,a ||μ(s, a)−
analog Bellman equations. If one defines the optimal value μ̂(s, a)||F , α = V̂1 − V̂0 ∞ , and let πk = greedy(V̂k ) be
function as the fixed-point of the analogue Bellman optimal-
ity equation then this optimal value function can be found by the resulting greedy policy. Then, for any Ṽ ∗ ∈ F,
value-iteration. One can also show that value-iteration can 2γ  k 
be replaced by either linear programming or policy iteration ||V πk −V ∗ ||∞ ≤ 2
γ α+2||V ∗ − Ṽ ∗ ||∞ +||Ṽ ∗ ||F .
1−γ
under, e.g., the additional condition that αj is nonnegative
valued. Theorem 3.1 applies to any proper estimate of the CME,
3
The αi (s, a) are not necessarily positive or normalized. independently of how it is learned.

1781
3.2 Review of Current Approaches Factored Linear Action Models The approach of Yao et
We now review the current approaches to learning the CME al. (2014) is to replace F by a Banach space of real-valued
in model-based reinforcement learning algorithms which re- functions over S so that F’s topological dual F ∗ contains
sult in solving finite MDPs. the point evaluation functionals δs (·) for any s ∈ S, redefin-
ing ·, · to be the dual-pairing bracket [·, ·] : F × F ∗ → R
Kernel-Based Reinforcement Learning The idea of us- with [v, λ] = λ(v), and then defining φ(s) = δs (·). In addi-
ing an induced finite MDP from a model estimate was, to tion, the constraint (6) is imposed in the optimization, rather
our knowledge, introduced in the Kernel-Based Reinforce- than projection after learning. In our preliminary experi-
ment Learning algorithm (KBRL) of Ormoneit and Sen ments, this performed worse than the projection approach,
(2002) who derive an approximate finite Bellman equation despite that Yao et al. (2014) presented experimental results
induced by a model and solveit.4 If we approximate the to the contrary (on different problems).
n 
CME (2) with μ̂KS (s, a) = KS
i=1 αi (s, a)φ(si ) using
kernel smoothing, the induced finite transition dynamics are
4 Compressed CMEs for RL
K((s, a), (si , ai ))
αiKS (s, a) := n (10) As discussed in Section 3.2, one drawback of the CME ap-
j=1 K((s, a), (sj , aj )) proach is that the size of the induced finite (pseudo-)MDP
where K is a smoothing kernel, and condition (6) holds. scales with the amount of data observed so that planning
The KBRL method uses a similar model, though there no scales poorly. In this section we will present a version of the
smoothing is performed over actions (in fact the method al- algorithm of Grünewälder et al. (2012b) which is more effi-
lows for a different kernel per action). The KBRL method cient to learn, and which induces a small compressed MDP
is shown to be consistent: under smoothness conditions, the on a subset of the data points to ensure efficient planning.
learned value function will converge to the optimal value We will learn a compression set C = {c1 , ..., cm } ⊆ S with
function with infinite i.i.d. data. m  n = |D|, and a compressed CME μ̂CM P (·) of the
One drawback is that the size of the state space of the form
induced MDP equals |D| = n and planning via a solving 
m
the approximate Bellman equation (7) scales poorly with the μ̂CM P (s, a) = αjCM P (s, a)φ(cj ), (13)
data. Versions of KBRL with efficient planning have been j=1
proposed using stochastic factorization of the transition ma-
trix of the induced finite MDP (da Motta Salles Barreto, Pre- so that the induced finite pseudo-MDP is defined on C. De-
cup, and Pineau 2011) or a cover tree quantization of the tails of learning such a CME are presented in Section 4.2.
state space (Kveton and Theocharous 2012). First we detail the choice of the compression set C.
Modeling Transition Dynamics with Kernel Least
4.1 Learning the Compression Set
Squares CMEs Grünewälder et al. (2012b) suggest a reg-
ularized least squares approach to learning the CME (2) ap- We now show that learning CME transition models suggest
plying methods from the literature on learning CMEs (Song, a principled means of learning a compression set: our com-
Fukumizu, and Gretton 2013; Grünewälder et al. 2012a). In pression set will guarantee the existence of a compressed
general, a normed hypothesis class H ⊂ F S×A is chosen CME maintaining good guarantees via Theorem 3.1. We first
and the empirical loss (4) plus a regularization term is mini- introduce a useful property of compression sets:
mized, Definition Given any set P = {sj }nj=1 , and any small er-
1
n
ror δ, a compression set C = {cj }m j=1 is a δ-lossy com-
μ̂RLS := argmin ||μ(si , ai )−φ(si )||2F +λ||μ||2H . (11)
μ∈H n i=1 pression of P, if given any proper CME μ̂(·) of the form
μ̂(s, a) = j=1 αj (s, a)φ(sj ), there exists a proper CME
n
When H = HK is a RKHS (of F-valued functions) with a m
on S × A the solution to (11) is μ̂KRLS (s, a) = μ̂CM P (s, a) = j=1 αj
CM P
(s, a)φ(cj ), which approxi-
n KKRLS
kernel
j=1 αj (s, a)φ(sj ) where mates μ̂(s, a) in F in the sense that sup(s,a)∈S×A ||μ̂(s, a)−

n μ̂CM P (s, a)||F ≤ δ.
αjKRLS (s, a) = K((s, a), (si , ai ))Wij (12) Algorithm 2 gives a method to maintain a δ-lossy
i=1 compression set.  The required minimization problem
−1
where W = (K + λI) , and Kij = K((si , ai ), (sj , aj )) minb∈Rm ,||b||1 ≤1 || i=1 bi φ(ci ) − φ(sj )||F can be solved
m

is the kernel matrix on the data. The approach to guarantee using Lasso (see Appendix B for details). We remark that the
the constraint (6) is simply to threshold and normalize the feature vectors φ(s) never need to be explicitly computed.
α(si , a) for all si ∈ D and a ∈ A, in practice, however, we
Theorem 4.1. Algorithm 2 with C initialized to ∅ returns a
find it is better to project the weights using fast Euclidean
δ-lossy compression set of P.
projection to the L1-ball (Duchi et al. 2008), followed by
normalization. Unlike KBRL (Ormoneit and Sen 2002) this Proof. See Appendix A. The result follows im-
approach minimizes a loss over a rich hypothesis class. mediately from Lemma A.1, m which states that if
4
The same idea was used by Rust (1996) and Szepesvári (2001), max1≤j≤n minb∈Rm ,||b||1 ≤1 || i=1 bi φ(ci )−φ(sj )||F ≤ δ
though in the context of planning, not for estimating a model. then {c1 , . . . , cm } is a δ-lossy compression set of P.

1782
Algorithm 2 augmentCompressionSet(C, δ, P) (w1 , ..., wd ) , where for each , (ŝ , â ) ∈ D, and where
d  n, such that (16) is represented only using the basis
Input: Initial compression set C = c1 , ..., cm , candidates d
P = s1 , ..., sn , tolerance δ B: αjM P (s, a) = =1 K((s, a), (ŝ , â ))Wj MP
. Matching
for j = 1, 2, ..., n do  pursuit incrementally adds bases K(·, (ŝ , â )) and weights
if minb∈Rm ,||b||1 ≤1 || i=1 bi φ(ci )−φ(sj )||F > δ then
m
w ∈ Rm “greedily” – i.e. to maximally reduce the loss (4),
Augment compression set: C ← C ∪sj , m ← m+1 1  
n d m
end if || MP
K((si , ai ), (ŝ , â ))Wj φ(cj )−φ(si )||2F .
end for n i=1 j=1
=1
Given the next basis element, each new weight w can
be found in closed form, and so each addition of a (ba-
A δ-lossy compression C guarantees the existence of a sis,weight) pair requires a sweep over all candidate basis
good compressed CME defined over C. We now further show points. There are some non-standard aspects of the appli-
that δ-lossy compressions imply compressed CMEs which cation of matching pursuit to optimize W CM P and details
are good for learning MDPs. We begin with a trivial corol- are included in supplementary material (Lever et al. 2016),
lary of Theorem 3.1: where we demonstrate for instance that F can be an infinite-
Corollary 4.2. Suppose proper CMEs μ̄(·) and μ̂(·) are dimensional feature space of an RKHS and yet we can still
such that perform matching pursuit efficiently. The greedy incremen-
tal approach is suited to the RL setting since we interact with
sup ||μ̂(s, a) − μ̄(s, a)||F ≤ δ. (14) the system to gather new data, which is then explored by
(s,a)∈S×A matching pursuit to discover new basis functions, but we do
not sweep over the full data set at each iteration (see Algo-
Let V̂k be the kth value function obtained after k iterations
rithm 3 for details).
of value iteration using a μ̄(·), πk = greedyμ̄ (V̄k ), Ṽ ∗ ∈ F. Once matching pursuit has found the basis B we “backfit”
Let Bk (Ṽ ∗ ) be the formal bound in Theorem 3.1 on V πk − the weights by performing RKHS-regularized least-squares
V ∗ ∞ (i.e., Bk depends on μ̂). Then, in the primal,
1
n

||V πk −V ∗ ||∞ ≤ Bk (Ṽ ∗ ) + δ||Ṽ ∗ ||F (15) W CM P = argmin εi (W ) + λ tr(W LCC W  K  ),
(1 − γ 2 ) W n i=1
Thus, suppose that μ̂(·) is known to be good, then com- (17)
paring Corollary 4.2 with Theorem 3.1, we only pay an addi- where with slight abuse of notation we redefine
tional term of 2γδ|| Ṽ ∗ ||F W CM P to have dimension d × m, and εi (W ) :=
(1−γ 2 ) for solving an MDP with the CME d  m  2
μ̄(·) instead of the CME μ̂(·). In particular if we take μ̄(·) =1 j=1 K((si , ai ), (ŝ , â ))Wj φ(cj ) − φ(si ) F .
to be a compressed CME μ̂CM P (·) whose existence is guar- The term tr(W LCC W  K  ) in (17) is the RKHS norm
anteed by a δ-lossy compression then guarantee (15) holds regularizer ||μ̂CM P (·)||2K . The solution to (17) is
for μ̂CM P (·). This is an existence proof: in the following W CM P = (Ψ Ψ + λnK)−1 Ψ LDC (LCC )−1 , (18)
section we detail how to learn a compressed CME.
where Kjk = K((ŝj , âj ), (ŝk , âk )), Ψij =
4.2 Learning Compressed CMEs K((si , ai ), (ŝj , âj )), LDC
jk = φ(sj ), φ(ck )F and
We now learn LCC = φ(cj ), φ(ck )F , and λ can be cross validated.
ma proper compressed CME, denoted jk
μ̂P CM P (·) = j=1 αP CM P (·)φ(cj ), given a compression Computing (18) is efficient since the inverse is of a d × d
set C. We first seek a CME μCM P (·) of the form (13) in matrix. We add a small ridge term to the kernel matrix
which the αjCM P (s, a) are of the form (12), inverses to prevent numerical instability.
To ensure a contraction we obtain the proper CME

n μ̂P CM P (·) by projecting the learned weights to the L1-
αjCM P (s, a) = K((s, a), (si , ai ))WijCM P , (16) ball: denoting proj(f ) := argminβ∈Rm ,||β||1 ≤1 ||f −
m
j=1 βj φ(cj )||F we set
i=1

i.e. similar to Grünewälder et al. (2012b), but with a com- α⊥ (s, a) := proj(μ̂CM P (s, a)), (19)
pressed representation. W CM P ∈ Rn×m is to be opti-
mized to minimize (4). The exact RKHS regularized least which can be done using a Lasso (see Appendix B), and fi-
squares solution, with one dimension of W CM P scaling nally we normalize,
with n, the number of samples, is prohibitively expensive to αP CM P (s, a) := α⊥ (s, a)/||α⊥ (s, a)||1 . (20)
compute, and so we optimize W CM P using kernel match-
ing pursuit (Vincent and Bengio 2002; Mallat and Zhang 4.3 Policy Iteration with Compressed CMEs
1993) which is an incremental sparse-greedy method of op- Our algorithm combines the compression set learning of
timizing kernel least squares. In our application matching Section 4.1, CME learning of Section 4.2 and policy it-
pursuit maintains a sparse basis of kernel functions B = eration using CMEs of Section 3.1 (specifically the CME
{K(·, (ŝ , â ))}d=1 and d × m weight matrix W M P = μ̂P CM P (·)). The algorithm is detailed in Algorithm 3.

1783
Algorithm 3 Policy iteration with compressed kernel CMEs Benchmarks Our first experiments are the well-known
Input: MDP M = {S, A, P1 , P, R} to interact with; cart-pole and mountain-car benchmark MDPs. These are
known reward function R and P1 . continuous 2 dimensional state MDPs with 3 actions. The
Parameters: Kernel K on S × A; feature map φ(·) on S; compressed CME approach consistently optimizes the pol-
compression tolerance δ; policy improvements J; nnew ; icy in the mountain-car with fewer than 5 iterations (10 data
maximum basis dimension d. trajectories), and fewer than 10 iterations (20 trajectories) in
Initialize: action-value function e.g. Q0 = r; D0 = ∅; the cart-pole and performs better than both competitors in
π1 = greedy(Q0 ); ψ 0 (·); n0 = 0; B0 = ∅. terms of policy optimization. We hypothesize that the two
for k = 1, 2, ... do least-squares methods out-perform the kernel smoothing ap-
Data acquisition: Collect nnew data points Dnew = proach since they directly optimize a loss over a rich hy-
{si , ai , si }ni=n
k
from behavior policy distribution pothesis class. Our compressed CME method outperforms
k−1 +1 the full kernel least squares approach because a more princi-
ρ ; nk ← nk + nnew ; Aggregate data: Dk = Dk−1 ∪
νk
pled projection of the weights (19) can be performed. The
Dnew = {si , ai , si }ni=1 k
. compressed CME approach obtained finite MDPs of size
Augment compression set with new data: Ck = less than 100 in the mountain-car and about 400 in the cart-
augmentCompressionSet(Ck−1 , {si }ni=n k
k−1 +1
, δ). pole (from a total of 3000 data points over 15 iterations),
Augment candidate feature dictionary new data: decoupling the amount of data gathered and the size of the
Gk = Bk−1 ∪ {K(·, (si , ai ))}ni=n k
k−1 +1 induced pseudo-MDP so that planning scales essentially in-
Sparse basis selection: Learn sparse basis Bk = dependently of data size, compared to the competitors where
{K(·, (ŝ , â ))}d=1
k
, dk ≤ d from candidates Gk using planning becomes a bottleneck. In terms of model and fea-
matching pursuit; Set ψk (·) = K((ŝ , â ), ·), Ψk = ture learning time, the full kernel least-squares approach is
 k  surprisingly fast to learn, this is because “full relearns” in
ψ (s1 , a1 ), ..., ψ k (snk , ank ) .
which all parameters were cross-validated were performed
Backfit CME weights: WkCM P = (Ψ k Ψk + at iterations 1,2 and 5 (hence the spikes on the graphs) but
λnK)−1 Ψ kL
DC
(LCC )−1 as in (18). at other iterations we performed fast online updates to the
for = 1, 2, ..., J do matrix inverses, using previous optimal parameters. Since
Policy evaluation: Using finite pseudo-MDP dy- model learning is not a bottleneck for kernel smoothing we
namics αP CM P (cj , a) (20) for a ∈ A, cj ∈ C solve allowed it to cross-validate all parameters at every iteration;
approximate Bellman Equation (7) to obtain estimate kernel smoothing could be similarly speeded-up by avoiding
V̂ of V πk at the compression points cj ∈ C. Set model selection, with a small degradation in policy quality.
|Ck | P CM P
Q̂ (s, a) = r(s, a) + γ j=1 αj (s, a)V̂ (cj ). Results are shown in Figure 1 and Figure 2.
Policy improvement: πk ← greedy(Q̂ ).
end for Simulated Quadrocopter Navigation The third experi-
πk+1 ← πk . ment is a simulated Quadrocopter navigation task which
end for uses a simulator (De Nardi 2013) calibrated to model the
dynamics of a PelicanTM platform. This is a higher di-
mensional problem, S ⊂ R13 (but effectively reduced
to 6-dimensions by the choice of state kernel), s =
5 Experiments (x, y, z, θ φ, ψ, ẋ, ẏ, ż, θ̇, φ̇, ψ̇, F ) which consists of plat-
form position (x, y, z) ∈ R3 , roll, pitch and yaw (θ, φ, ψ) ∈
We performed 3 online model-based policy iteration experi- R3 , associated time derivatives, and thrust F applied to the
ments. Experiments in the batch setting, with i.i.d. data can rotors. The action set is a discretization of a 2-dimensional
be found in Lever et al. (2016). For reproducibility more de- rectangle into 81 actions which represent desired velocity
tail is given in Appendix C. We compared the compressed vectors in the (x, y) plane (desired z velocity is fixed at
CME to the kernel smoothing model (10) – as in the KBRL zero). An internal PID controller translates these desired ve-
algorithm (Ormoneit and Sen 2002), but we additionally locities into low level commands issued to the rotors in at-
smooth over actions (which performed slightly better here), tempt to attain those velocities. This creates complex dy-
and use non i.i.d. data generated by interactions – and the namics for the system. The platform is initialized at posi-
kernel least-squares model of Grünewälder et al. (2012b). tion (0, 0, −50), a target location is defined at coordinates
We outperformed the full kernel smoothing method, in terms 1 2

of policy quality, so did not compare to the more efficient xtarg = (5, 5) and we define r(s, a) = e− 25 ||xtarg −sxy || .
versions of KBRL which do not report better performance. On this task the compressed CME method is able to learn
We report mean results over 10 experiments, and standard a near optimal policy, accelerating before hovering around
error. Experiments are run on a cluster of single core pro- the target point to collect reward. Both least-squares meth-
cessors. For all MDPs, at each iteration the learners were ods out performed the KS method, and planning for the com-
given 2 trajectories of data: one from the current greedy pol- pressed CME approach is significantly more efficient. Com-
icy and a second from an -greedy ( = 0.3) “exploratory” pression to a finite pseudo-MDP of size 100-150 (from a
version (defining ρνk in the algorithm as a mixture of those total of 2400 data points over 12 iterations) was achieved.
two policies). Results are shown in Figure 3.

1784
Figure 1: Cart-pole reward and timing results

Figure 2: Mountain-car reward and timing results

6 Summary Bradtke, S. J., and Barto, A. G. 1996. Linear least-squares algo-


rithms for temporal difference learning. Machine Learning 22(1-
We have presented a model-based policy iteration algorithm, 3):33–57.
based on learning a compressed CME of the transition dy- da Motta Salles Barreto, A.; Precup, D.; and Pineau, J. 2011. Rein-
namics, and solving the induced finite pseudo-MDP exactly forcement learning using kernel-based stochastic factorization. In
using dynamic programming. This results in a stable algo- NIPS 2011, 720–728.
rithm with good performance guarantees in terms of the tran- De Nardi, R. 2013. The qrsim quadrotor simulator. Technical Re-
sition model error. An analysis of the CME approach sug- port RN/13/08, Department of Computer Science, University Col-
gested a principled means to choose a compression set for lege London, Gower Street, London UK.
the CME, which guarantees an at most small loss in the rep- Duchi, J. C.; Shalev-Shwartz, S.; Singer, Y.; and Chandra, T. 2008.
resentation power of the CME, and only a small degrada- Efficient projections onto the l1-ball for learning in high dimen-
tion in performance bound. To learn the CME our approach sions. In ICML 2008, 272–279.
builds a state-action representation online based on interac- Grünewälder, S.; Lever, G.; Baldassarre, L.; Patterson, S.; Gretton,
tions with the system using sparse-greedy feature selection A.; and Pontil, M. 2012a. Conditional mean embeddings as regres-
from a data-defined dictionary of kernel functions, enhanc- sors. In ICML 2012.
ing the state-action representation to maximally improve the Grünewälder, S.; Lever, G.; Baldassarre, L.; Pontil, M.; and Gret-
model. We overcome the planning bottleneck of competitor ton, A. 2012b. Modelling transition dynamics in mdps with rkhs
methods and obtain better policies on a range of MDPs. The embeddings. In ICML 2012.
compression approach could be used in other related algo- Howard, R. A. 1960. Dynamic Programming and Markov Pro-
rithms such as KBRL. Future work could include using other cesses. Cambridge, MA: MIT Press.
representations of the CME in this context, such as deep neu- Kveton, B., and Theocharous, G. 2012. Kernel-based reinforce-
ral networks. We remark that it is possible to combine the ment learning on representative states. In Proceedings of the
compression approach with the constrained approach of Yao Twenty-Sixth AAAI Conference on Artificial Intelligence.
et al. (2014) which we defer for a longer version. Lagoudakis, M. G., and Parr, R. 2003. Least-squares policy itera-
tion. Journal of Machine Learning Research 4:1107–1149.
References Lever, G.; Shawe-Taylor, J.; Stafford, R.; and Szepesvári, C.
2016. Compressed conditional mean embeddings for model-
Baird, L. C. 1995. Residual algorithms: Reinforcement learning based reinforcement learning (supplementary material). In
with function approximation. In ICML, 30–37. http://www0.cs.ucl.ac.uk/staff/G.Lever/pubs/CME4RLSupp.pdf.
Bellman, R. 1957. Dynamic Programming. Princeton, NJ, USA: Mallat, S., and Zhang, Z. 1993. Matching pursuits with time-
Princeton University Press, 1 edition. frequency dictionaries. IEEE Transactions on Signal Processing
Bertsekas, D. P. 2011. Approximate policy iteration: a survey and 41(12):3397–3415.
some new methods. Journal of Control Theory and Applications Ormoneit, D., and Sen, S. 2002. Kernel-based reinforcement learn-
9(3):310–335. ing. Machine Learning 49(2-3):161–178.
Bertsekas, D. P. 2012. Dynamic Programming and Optimal Con- Parr, R.; Li, L.; Taylor, G.; Painter-Wakefield, C.; and Littman,
trol, volume 2. Athena Scientific, 4th edition. M. L. 2008. An analysis of linear models, linear value-function

1785
Figure 3: Simulated quadrocopter navigation reward and timing results

approximation, and feature selection for reinforcement learning. In Proof.


ICML 2008, 752–759.

m 
m  n
Rust, J. 1996. Using randomization to break the curse of dimen- |αiCM P (s, a)| = | bi (sj )αj (s, a)|
sionality. Econometrica 65:487–516.
i=1 i=1 j=1
Shawe-Taylor, J., and Cristianini, N. 2004. Kernel Methods for
Pattern Analysis. Cambridge University Press. 
n 
m
≤ |αj (s, a)| |bi (sj )| ≤ 1,
Singh, S. P., and Sutton, R. S. 1996. Reinforcement learning with
j=1 i=1
replacing eligibility traces. Machine Learning 22(1-3):123–158.
Song, L.; Fukumizu, K.; and Gretton, A. 2013. Kernel embed- and,
dings of conditional distributions: A unified kernel framework for
nonparametric inference in graphical models. IEEE Signal Pro- ||μ̂(s, a) − μ̂CM P (s, a)||F
cess. Mag. 30(4):98–111.  n 
m
Sutton, R.; Szepesvári, C.; Geramifard, A.; and Bowling, M. H. = || αj (s, a)φ(sj ) − αiCM P (s, a)φ(ci )||F
2008. Dyna-style planning with linear function approximation and j=1 i=1
prioritized sweeping. In UAI 2008, 528–536.
n m
Szepesvári, C. 2001. Efficient approximate planning in contin- = || αj (s, a)(φ(sj ) − bi (sj )φ(ci ))||F
uous space Markovian decision problems. AI Communications j=1 i=1
13(3):163–176.

n 
m
Tibshirani, R. 1996. Regression shrinkage and selection via the ≤ |αj (s, a)| max ||φ(sj ) − bi (sj )φ(ci ))||F
lasso. Journal of the Royal Statistical Society (Series B) 58:267– j
j=1 i=1
288.
van Hoof, H.; Peters, J.; and Neumann, G. 2015. Learning of non- ≤ δ.
parametric control policies with high-dimensional state features. In
AISTATS 2015.
Vincent, P., and Bengio, Y. 2002. Kernel matching pursuit. Ma- Note that by construction the compression set returned
chine Learning 48(1-3):165–187. by Algorithm 2 satisfies condition (21) and is therefore a δ-
Yao, H.; Szepesvári, C.; Pires, B. A.; and Zhang, X. 2014. Pseudo- lossy compression. Hence Lemma A.1 directly implies The-
mdps and factored linear action models. In 2014 IEEE Symposium orem 4.1.
on Adaptive Dynamic Programming and Reinforcement, 1–9.
B The Weight Projection via Lasso
Appendix
Projections of the following form
A Proof of Theorem 4.1

m 
m
We begin with a lemma: proj(α) = argmin || αj φ(sj ) − βj φ(sj )||F .
Lemma A.1. Given {s1 , s2 , ..., sn }, suppose C = β∈Rm ,||β||1 ≤1 j=1 j=1
{c1 , ..., cm } ⊆ S is such that for all sj ∈ D there exists (22)
b = b(sj ) ∈ Rm , ||b||1 ≤ 1 such that
are used throughout the method to maintain compression

m
sets and project weights to obtain propoer CMEs. Problem
|| bi (sj )φ(ci ) − φ(sj )||F ≤ δ. (21) (22) can be reduced to:
i=1
n proj(α) = argmin ||Rβ − Rα||2 .
Then let μ(·) be defined via μ(s, a) = j=1 αj (s, a)φ(sj ), β∈Rm ,||β||1 ≤1
with ||α(s, a)||1 ≤ 1 for all n (s, a)  ∈ S × A.
If we define αiCM P (s, a) = j=1 bi (sj )αj (s, a) then
where R R = L, and Lij = φ(si ), φ(sj )F , which can
m be solved with Lasso (Tibshirani 1996). Note that the feature
i=1 |αi (s, a)| ≤ 1 and for all (s, a)
CM P
vectors do not need to be computed – the Gram matrix is all
||μ(s, a) − μ̂CM P (s, a)||F < δ. that is needed. R can be an incomplete (low-rank) Cholesky

1786
decomposition (Shawe-Taylor and Cristianini 2004) of L search). During model learning we performed 5-fold cross
with few rows, so that the effective number of “data points” validation over a range of 10 bandwidths in the range
in the Lasso is small, and therefore efficient. For example [0.01, 5] to optimize σS×A . For the compressed CME the
we use an incomplete Cholesky factorization with 200 rows tolerance of the compression set was set to δ = 0.1, i.e. we
in our experiments. use a δ-lossy compression set C with δ = 0.1.
Cumulative reward of 50 is a near optimal policy in which
C Further Experimental Details the pole is quickly swung up and balanced for the entire
We begin with settings common to all MDPs. The horizon episode. Cumulative reward of 40 to 45 indicates that the
of each MDP is 100, so that nnew = 200 data points were pole is swung up and balanced, but either not optimally
added at each iteration. To perform planning at each itera- quickly, or that the controller is unable to balance the pole
tion we performed J = 10 policy evaluation/improvement for the entire remainder of the episode.
steps, before returning to the MDP to collect more data. For
C.2 Mountain-Car Experiment Details
the compressed CME the size of the sparse-greedy feature
space was constrained to be no greater than d = 200. For In this problem the agent controls a car located at the bot-
all 3 methods we performed 5-fold cross-validation over tom of a valley and the objective is to drive to the top of
10 bandwidth parameters to optimize the “input” kernel a hill but does not have enough power to achieve this di-
K on S × A. For the two least-squares methods we also rectly and must climb the opposite hill a short distance be-
cross-validated the regularization parameter over 20 values. fore accelerating toward the goal (see e.g. Singh and Sut-
The “output” feature map corresponds to a Gaussian kernel ton (1996)). States s = (x, v) are position and velocity,
φ(s) = L(s, ·), and the bandwidth of L was chosen using S = (−1.2, 0.7) × (−0.07, 0.07), A = {−1, 0, 1} and
2
an informal search for each MDP (it is not clear how this r(s, a) = e−8(x−0.6) and s0 = (−0.5, 0). Dynamics are
parameter can be validated). For planning we set γ = 0.98, x = x + v + 1 , v  = v + 0.001a − 0.0025 cos(3x) + 2 /10,
but we report results for γ = 0.99. where 1 , 2 are Gaussian random variables with standard
deviation 0.02. If x > 0.6 then the state is set to (0.6, 0).
C.1 Cart-Pole Experiment Details For a horizon H = 100 the optimal return is around 45.
This problem simulates a pole attached at a pivot to a cart, Doing nothing receives almost no reward.

and by applying force to the cart the pole must be swung  state kernel is a  Gaussian L(s, s )
The =
−1   
to the vertical position and balanced. The problem is under- exp 2σ2 (s − s ) MS (s − s ) with
S
actuated in the sense that insufficient power is available to
MS = diag(1, 100), and the state-
drive the pole directly to the vertical position, hence the  
action
 kernel is K((s, a), (s , a )) =
problem captures the notion of trading off immediate re-
ward for long term gain. We used the same simulator as exp 2σ−1 2 ((s, a) − (s   
, a )) M S×A ((s, a) − (s  
, a ))
S×A
Lagoudakis and Parr (2003), except here we choose a con- with MS×A = diag(1, 100, 1/25). The output feature map
tinuous reward signal. The state space is two dimensional, φ(s) = L(s, ·) with σS = 0.5 (chosen by informal search).
s = (θ, θ̇) representing the angle (θ = 0 when the pole During model learning we performed 5-fold cross valida-
is pointing vertically upwards) and angular velocity of the tion over a range of 10 bandwidths in the range [0.01, 5] to
pole. The action set is {−50, 0, 50} representing the hori- optimize σS×A . For the compressed CME the tolerance of
zontal force in Newtons applied to the cart. Uniform noise the compression set selection was set to δ = 0.01, i.e. we
in [−10, 10] is added to each action. The system dynamics use a δ-lossy compression set C with δ = 0.01.
are θt+1 = θt + Δt θ˙t , θ̇t+1 = θ̇t + Δt θ¨t where
C.3 Quadrocopter Experiment Details
g sin(θ) − αm (θ̇)2 sin(2θ)/2 − α cos(θ)u The  state kernel is a Gaussian L(s, s ) =
θ̈ = ,
4 /3 − αm cos2 (θ) −1   
exp 2σ 2 (s − s ) MS (s − s ) with MS =
S
where g = 9.8m/s2 is the acceleration due to gravity, m = diag(1/5, 1/105 , 1, 1/105 , 0) where 1 = (1, 1, 1),
2kg is the mass of the pole, M = 8kg is the mass of the cart, (which essentially reduces the state observations to 6
= 0.5m is the length of the pole and α = 1/(m + M ). dimensions), and the state-action kernel is
We choose Δt = 0.1s. Rewards R(s, a) = 1+cos(θ) 2 , the K((s, a), (s , a ))
discount factor is γ = 0.99, the horizon is H = 100, and the
−1
pole begins in the downwards position, s0 = (π, 0). = exp 2 ((s, a)−(s , a ))MS×A ((s, a)−(s , a ))
 2σS×A
 state kernel is a  Gaussian L(s, s )
The =
−1  
exp 2σ2 (s − s ) MS (s − s ) 
with with MS×A = diag(1/5, 1/105 , 1, 1/105 , 0, 1/100). The
MS
S
= diag(1, 1/4), and the state- output feature map φ(s) = L(s, ·) with σS = 1 (chosen
K((s, a), (s 
, a )) by informal search). During model learning we performed
action
 kernel is = 5-fold cross validation over a range of 10 bandwidths in the
exp 2σ−1
2 ((s, a) − (s   
, a )) M S×A ((s, a) − (s  
, a )) range [0.01, 5] to optimize σS×A . For the compressed CME
S×A
with MS×A = diag(1, 1/4, 1/10000). The output feature the tolerance of the compression set selection was set to δ =
map φ(s) = L(s, ·) with σS = 0.5 (chosen by informal 0.1, i.e. we use a δ-lossy compression set C with δ = 0.1.

1787

You might also like