Professional Documents
Culture Documents
Reinforcement Learning Wiht Exploration
Reinforcement Learning Wiht Exploration
with Exploration
A thesis submitted to
The University of Birmingham
for the degree of
Do tor of Philosophy
III
IV
A knowledgements
My deepest gratitude goes to my friend and long-time supervisor, Manfred Kerber. My
demands on his time over the past ve years for dis ussion, feedba k and proof readings
ould (at best) be des ribed as unreasonable. Unlike many PhD students my work was
not tied to any parti ular grant, supervisor or resear h topi and I an think of few other
people who would be willing to supervise work outside of their own eld. Through Manfred
I was lu ky to have the freedom to explore the areas that interested me the most, and also
to publish my work independently. For reasons I won't dis uss here, these freedoms are
be oming in reasing rare { any supervisor who provides them has a truly generous nature.
Without his onstant en ouragement (and harassment) and his enormous expertise, I am
sure that this thesis would never have rea hed ompletion. In several ases, important ideas
would have fallen by the wayside without Manfred to point out the interest in them.
For patiently introdu ing me to the topi s that interest me the most I am extremely grateful
to Jeremy Wyatt. Through his reinfor ement learning reading group, the obje tionable
be ame the obsession, and the obfus ated be ame the Obvious. As the only lo al expert in
my eld, his enthusiasm in my ideas has been the greatest motivation throughout. Without
it I would surely have quit my PhD within the rst year.
I thank the other members of my thesis group (past and present), Xin Yao, Russell Beale and
John Barnden, for their support and guidan e throughout. I also thank my department for
funding my study (and extensive worldwide travel) through a Tea hing Assistant s heme.
Without this, not only would I not have had the freedom to pursue my own resear h, I
would have never have had the opportunity to perform resear h at all.
I thank Remi Munos and Andrew Moore for hosting my enlightening (but ultimately too
short) sabbati al with them at Carnegie Mellon, and my department for funding the visit.
I thank Geo Gordon for indulging my long Q+A dis ussions about his work that lead to
new ontributions.
I am lu ky to have bene ted from dis ussions and advi e (no matter how brief) with many
of the eld's other leading luminaries. These in lude Ri hard Sutton, Mar o Wiering, Doina
Pre up, Leslie Kaelbling and Thomas Dietteri h.
I thank John Bullinaria for nally setting me straight on neural networks.
Thanks to Tim Kova s who o-founded the reinfor ement learning reading group. As my
oÆ e-mate for many years he has been the person to re eive my most un ooked ideas. I
look forward to more of his otter-tainment in the future and promise to return all of his
pens the next time we meet.
V
VI
Through dis ussions about my work (or theirs), by providing te hni al assistan e, or even
through al oholi stress-relief, I have bene ted from many other members of my department.
Among others, these people in lude: Adrian Hartley, Axel Groman, Mar in Chady, Johnny
Page, Kevin Lu as, John Woodward, Gavin Brown, A him Jung, Ri ardo Poli and Ri hard
Pannell.
My apologies to Dee who I'm sure is the happiest of all to see this nished.
1 Introdu tion 1
1.1 Arti ial Intelligen e and Ma hine Learning . . . . . . . . . . . . . . . . . . 1
1.2 Forms of Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Reinfor ement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 Sequential De ision Tasks and the Delayed Credit Assignment Problem 3
1.4 Learning and Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 About This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Stru ture of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Dynami Programming 7
2.1 Markov De ision Pro esses . . . . . . . . . . . . . . . . . . . . . .. . . .. . 7
2.2 Poli ies, State Values and Return . . . . . . . . . . . . . . . . . .. . . .. . 8
2.3 Poli y Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .. . 10
2.3.1 Q-Fun tions . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .. . 11
2.3.2 In-Pla e and Asyn hronous Updating . . . . . . . . . . .. . . .. . 12
2.4 Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .. . 13
2.4.1 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .. . 13
2.4.2 Poli y Improvement . . . . . . . . . . . . . . . . . . . . .. . . .. . 13
2.4.3 The Convergen e and Termination of Poli y Iteration . .. . . .. . 14
2.4.4 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . .. . . .. . 16
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .. . 18
3 Learning from Intera tion 21
3.1 Introdu tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 In remental Estimation of Means . . . . . . . . . . . . . . . . . . . . . . . . 22
VII
VIII CONTENTS
Introdu tion
hooses to a t di erently (hopefully for the better) in some situation than it might have
done prior to olle ting this experien e.
How learning o urs depends upon the form of feedba k that is available. For example,
through observing what happens after leaving home on di erent days with di erent kinds
of weather, it may be possible to learn the following asso iation between situations, a tions
and their onsequen es,
\If the sky is loudy, and I don't take my umbrella, then I am likely to get wet."
Observing the onsequen e of leaving home without an umbrella on a loudy day is a form
of feedba k. However, the onsequen es of a tions in themselves do not tell how to hoose
better a tions. Whether the agent should prefer to leave home with an umbrella depends
on whether it minds getting wet. Clearly, without some form of utility atta hed to a tions,
it is impossible to know what hanges ould lead the agent to a t in a better way. Learning
without this utility is alled unsupervised learning, and annot dire tly lead to better agent
behaviour.
If feedba k is given in the form,
\If it is loudy, you should take an umbrella,"
then supervised learning is o urring. A tea her (or supervisor) is assumed to be available
that knows the best a tion to take in a given situation. The supervisor an provide advi e
that orre ts the a tions taken by the agent.
If feedba k is given in the form of positive or negative reinfor ements (rewards), for example,
\Earlier it was loudy. You didn't take your umbrella. Now you got wet. That
was pretty bad,"
then the agent learns through reinfor ement learning. Learning o urs by making adjust-
ments to the situation-a tion mapping that maximises the amount of positive reinfor ement
re eived and minimises the negative reinfor ement. Often reinfor ements are s alar values
(e.g. 1 for a bad a tion, +10 for a good one). A wide variety of algorithms are available
for learning in this way. This thesis reviews and improves on a number of them.
the hope that these will reveal better ways of a ting in the future? This is known as the
exploration / exploitation dilemma [130, 150℄.
This dilemma is parti ularly important to reinfor ement learning. For supervised learning
it is often assumed that the way in whi h exploration of the problem is ondu ted is the
responsibility of the tea her (i.e. not the responsibility of the learning element). For re-
infor ement learning, the reverse is more often true. The learning agent itself is usually
expe ted to de ide whi h a tions to take in order to gain more information about how the
problem may better be solved. Finding good general methods for doing so remains a dif-
ult and interesting resear h question, but it is not the subje t of this thesis. A separate
question is how reinfor ement learning methods an ontinue to solve the desired problem
while exploring (or, more pre isely, while not exploiting). Many reinfor ement learning
algorithms are known to behave poorly while not exploiting. One of this thesis' major
ontributions is an examination of how these methods an be improved.
Appendix A reviews some basi terminology and proofs about dynami programming
methods that are employed elsewhere in the thesis.
Appendix B shows termination error bounds of a new modi ed poli y-iteration algo-
rithm.
Appendix C ontains the forwards-ba kwards equivalen e proof of the bat h mode
SMDP a umulate tra e TD() algorithm.
Appendix D provides a useful guide to notation and terminology.
New ontributions are made throughout. Readers with a detailed knowledge of reinfor e-
ment learning are re ommended to read the ontributions se tion in Chapter 8 before the
rest of the thesis.
Chapter 2
Dynami Programming
Chapter Outline
This hapter reviews the theoreti al foundations of value-based reinfor ement
learning. It overs the standard formal framework used to des ribe the agent-
environment intera tion and also te hniques for nding optimal ontrol strate-
gies within this framework.
a1 ...
a2
...
Figure 2.1: A Markov De ision Pro ess. Large ir les are states, small bla k dots are
a tions. Some states may have many a tions. An a tion may lead to di ering su essor
states with a given probability.
For the RL framework we also add,
a reward fun tion whi h, given a hs; a; s0 i triple generates a random s alar valued
reward with a xed distribution. The reward for taking a in s and then entering s0 is
a random variable whose expe tation is de ned here as Rssa 0 .
A pro ess is said to be Markov if it has the Markov Property. Formally, the Markov Property
holds if,
P r(st+1 j st ; at ) = P r(st+1 j st ; at ; st 1 ; at 1 ; : : :): (2.1)
holds. That is to say that the probability distribution over states entered at t + 1 is ondi-
tionally independent of the events prior to (st ; at ) { knowing the urrent state and a tion
taken is suÆ ient to de ne what happens at the next step. In reinfor ement learning, we
also assume the same for the reward fun tion,
P r(rt+1 j st ; at ) = P r(rt+1 j st ; at ; st 1 ; at 1 ; : : :): (2.2)
The Markov property is a simplifying assumption whi h makes it possible to reason about
optimality and proofs is a more straightforward way.
For a more detailed a ount of MDPs see [21℄ or [114℄. For the remainder of this se tion the
terms pro ess and environment will be used inter hangeably under the assumption that the
agent's environment an be exa tly modelled as a dis rete nite Markov pro ess. In later
hapters we examine ases where this assumption does not hold.
Re eding Horizon Problems. The agent should a t to maximise the nite horizon
return at ea h step (i.e. we a t to maximise V(k) for all t, and k is a xed at every
step).
In nite Horizon Problems. The agent should a t to maximise the reward available
over an in nite future.
Most work in RL has entred around single-step and in nite horizon problems. In the
in nite horizon ase, it is ommon to use the total future dis ounted return as the value of
a state:
zt1 = rt + rt+1 + + k rt+1 + (2.5)
The parameter, 2 [0; 1℄, is a dis ount fa tor. Choosing < 1 denotes a preferen e to
re eiving immediate rewards to those in the more distant future. It also ensures that the
return is bounded in ases where the agent may olle t reward inde nitely (i.e. if the task
is non-episodi or non-terminating), sin e all in nite geometri series have nite sums for a
ommon ratio of, j j < 1.
The in nite horizon ase is also of spe ial interest as it allows the value of a state to be
on isely de ned re ursively:
V (s) = E zt1+1 j s = st
Terminal States Some environments may ontain terminal states. Entering su h a state
means that no more reward an be olle ted this episode. To be onsistent with the in nite
horizon formalism, terminal states are usually modelled as a state in whi h all a tions lead
to itself and generate no reward. In pra ti e, it is usually easiest to model all terminal states
as a single spe ial state, s+, whose value is zero and in whi h no a tions are available.
10 CHAPTER 2. DYNAMIC PROGRAMMING
V (s) 0 : (2.9)
a s0
X X
= 0 + (s; a) Pssa 0 Rssa 0 + V (s0)
a s0
= 0 + V (s) (2.10)
Note that only the true value-fun tion, V , and not its estimate, V^ , appears on the right-
hand side of 2.10. Continuing the iteration we have,
V^2 (s) = a 0 + V^ (s0 )
X X
(s; a) Pssa 0 Rss 1
a s0
X X
= (s; a) a 0 + (V (s0 )
Pssa 0 Rss 0)
a s0
X X
= 20 + (s; a) Pssa 0 Rssa 0 + V (s0 )
a s0
= 0 + V (s)
2
...
V^k (s) = k 0 + V (s) (2.11)
Thus if 0 < h < 1 ithen the onvergen e of V^ to V is assured in the limit (as k ! 1)
sin e limk!1 k 0 = 0. The following ontra tion mapping an be derived from 2.11 and
states that ea h update stri tly redu es the worst value estimate in any state by a fa tor of
(also see Appendix A) [114, 20, 17℄:
maxs
jV^k+1(s) V (s)j max
s
jV^k (s) V (s)j: (2.12)
The termination ondition in step 8 of the algorithm allows it to stop on e a satisfa tory
maximum error has been rea hed.
This re ursive pro ess of iteratively re-estimating the value fun tion in terms of itself is
alled bootstrapping. Sin e Equation 2.6 represents a system of linear equations, several
alternative solution methods, su h as Gaussian elimination, ould be used to exa tly nd
V (see [34, 80℄). However, most of the learning methods des ribed in this thesis are, in
one way or another, derived from iterative poli y evaluation and work by making iterative
approximations of value estimates.
2.3.1 Q-Fun tions
In addition to state-values we an also de ne state-a tion values (Q-values) as:
Q (s; a) = E
X
[rt+1 + V (st+1 )js = st ; a = at ℄
= Pssa 0 Rssa 0 + V (s0) (2.13)
s0
12 CHAPTER 2. DYNAMIC PROGRAMMING
Intuitively, this Q-fun tion (due to Watkins, [163℄) gives the value of following an a tion for
one step plus the dis ounted expe ted value of following the poli y thereafter. The expe ted
value of a state under a given sto hasti poli y may be found solely from the Q-values at
that state:
X
V (s) = (s; a)Q (s; a) (2.14)
a
and so the Q-fun tion may be fully de ned independently of V (by ombining Equations
2.13 and 2.14):
!
X X
Q (s; a) = Pssa 0 a0
Rss + (s0 ; a0 )Q (s0 ; a0 ) (2.15)
s0 a0
There may be many optimal poli ies for some MDPs { this only requires that there are states
whose a tions yield equivalent expe ted returns. In su h ases, there are also sto hasti
optimal poli ies for that pro ess. However, every MDP always has at least one deterministi
optimal poli y. This follows simply from noting that if a SAP leads to a higher mean return
than the other a tions for that state, then it is better to always take that a tion than
some mix of a tions in that state. As a result most ontrol optimisation methods seek only
deterministi poli ies even though sto hasti optimal poli ies may exist.
2.4.2 Poli y Improvement
Improving a poli y as a whole simply involves improving the poli y in a single state. To
do this, we make the poli y greedy with respe t to Q . The greedy a tion, ags , for a state is
de ned as,
ags = arg max
a
Q(s; a) (2.18)
A greedy poli y, g , is one whi h yields a greedy a tion in every state. An improved poli y
may be a hieved by making it greedy in any state:
(s) arg max a
Q(s; a) (2.19)
14 CHAPTER 2. DYNAMIC PROGRAMMING
1) k 0
2) do
3) nd Qk for k Evaluate poli y
4) for ea h s 2 S do
5) k+1 (s) = arg maxa Qk (s; a) Improve poli y
6) k k + 1
7) while k 6= k 1
Figure 2.4: Poli y Iteration. Upon termination is optimal provided that Qk an be found
a urately (see 2.4.3). In the improvement step (step 5), ties between equivalent a tion
should be broken onsistently to return a onsistent poli y for the same Q-fun tion, and so
also allow the algorithm to terminate. Step 3 is assumed to evaluate Q exa tly.
The poli y improvement theorem rst stated by Bellman and Dreyfus [16℄ states that if,
max Q (s; a) X (s; a)Q (s; a)
a a
holds then it is at least as good to take a greedy a tion in s than to follow sin e, if the agent
now passes through this state it an expe t to olle t at least maxa Q (s; a) (in the mean)
rather than a (s; a)Q (s; a) from there onward [16℄.1 The a tual improvement may be
P
greater sin e hanging the poli y at s may also improve the poli y for states following from
s in the ase where s may be revisited during the same episode.
The improved poli y an be evaluated and then improved again. This pro ess an be
repeated until the poli y an be improved no further in any state, at whi h point an optimal
poli y must have been found.
The poli y iteration algorithm shown in Figure 2.4 (adapted from [150℄, rst devised by
Howard [56℄) performs essentially this iterative pro ess ex ept that the poli y improvement
step is applied to every state in-between poli y evaluations. Combinations of lo al improve-
ments upon a xed Q will also produ e stri tly (globally) improving poli ies { any lo al
improvement in the poli y an only maintain or in rease the expe ted return available from
the states that lead into it.
2.4.3 The Convergen e and Termination of Poli y Iteration
With Exa t Q . Showing that the poli y iteration algorithm terminates with an optimal
poli y in nite time is straightforward. Note that, i) the poli y improvement step only
produ es deterministi poli ies, of whi h there are only jS jjAj and, ii) ea h new poli y
stri tly improves upon the previous (unless the poli y is already optimal). Put these fa ts
together and it is lear that the algorithm must terminate with the optimal poli y in less
than nk improvement steps (k = jAj; n = jS j) [114℄. In most ases this is a gross overestimate
of the required number of iterations until termination. More re ently, Mansour and Singh
have provided a tighter bound of O( knn ) improvement steps [77℄. Both of these bounds
ex lude the ost of evaluating the poli y at ea h iteration.
1
Impli itly, this statement rests on knowing Q a urately.
2.4. OPTIMAL CONTROL 15
2 3
2 3 1
4 5
1) do:
2) V^ evaluate(, V^ )
3) 0
4) for ea h s 2 S :
a 0 + V^ (s0 )
5) ag arg maxa s0 Pssa 0 Rss
P
6) v0
P ag ag ^ (s0 )
s0 Pss0 Rss0 + V
max ; V^ (s) v0
7)
8) (s) ag Make 0 .
9) while T
Figure 2.6: Modi ed Poli y Iteration. Upon termination is optimal to with some small
error (see text).
Over oming this only requires that the main loop terminates when the improvement in
the poli y in any state has be ome suÆ iently small (see the termination ondition in
Figure 2.6). The poli y iteration algorithm published in [150℄ also requires the same hange
to guarantee its termination.
The algorithm in Figure 2.6 guarantees that,
V (s) V (s)
2 T (2.20)
1
holds upon termination, for some termination threshold T . If T = 0 then the algorithm is
equivalent to modi ed poli y iteration. Part B of the Appendix establishes the straightfor-
ward proof of termination and error bounds { these follow dire tly from the work of Williams
and Baird [172℄. The proof assumes that the evaluate step of the revised algorithm applies,
V^ (s) a 0 + V^ (s0 )
X X
(s; a) Pssa 0 Rss k
a s0
at least on e for every state, either syn hronously or asyn hronously.
after k iterations. That is to say that it nds the Q-fun tion for the poli y that maximises
the expe ted k-step dis ounted return:
h i
max
E rt + r t +1 + + k 1r
k (2.24)
whi h di ers from maximising the expe ted in nite dis ounted return,
h i
max
E r t + r t+1 + + k 1r +
k (2.25)
by an arbitrary small amount for a large enough k and 0 < 1. Thus, value iteration
assures that Q^ k onverges upon Q as k ! 1 sin e Q^ (1) = Q , given Q^ 0(s; a) = 0 and
0 < 1.
A more rigorous proof that applies for arbitrary ( nite) initial value fun tions was estab-
lished by Bellman [15℄ and an be found in Se tion A.4.
In parti ular, the following ontra tion mapping an be shown whi h avoids the need to
assume Q^ 0 = 0,
max
s
jV^k+1(s) V (s)j max s
jV^k (s) V (s)j: (2.26)
Proofs of onvergen e for the in-pla e and asyn hronous updating ase have also been
established [17℄.
2.5 Summary
We have seen how dynami programming methods an be used to evaluate the long-term
utility of xed poli ies, and how, by making the evaluation poli y greedy, optimal poli ies
may also be onverged upon. Value iteration and poli y iteration form the basis of all of the
RL algorithms detailed in this thesis. Although they are a powerful and general tool for solv-
ing diÆ ult multi-step de ision problems in sto hasti environments, the MDP formalism
and dynami programming methods so far presented su er a number of limitations:
1. Availability of a Model Dynami Programming methods assume that a model of
the environment (P and R) is available in advan e, and that no further knowledge of, or
intera tion with, the environment is required in order to determine how to a t optimally
within it. However, in many ases of interest, a prior model is not generally available, nor is
it always lear how su h a model might be onstru ted in any eÆ ient manner. Fortunately,
even without a model, a number of alternatives are available to us. It remains possible to
learn a model, or even learn a value fun tion or Q-fun tion dire tly through experien e
gained from within the environment. Reinfor ement learning through intera ting with the
environment is the subje t of the next hapter.
2. Small Finite Spa es In many pra ti al problems, a state might orrespond to a
point in a high dimensional spa e: s = hx1; x2 ; : : : ; xni. Ea h dimension orresponds to a
parti ular feature of the problem being solved. For instan e, suppose our task is to design
2.5. SUMMARY 19
an optimal strategy for the game of ti -ta -toe. Ea h omponent of the board state, xi,
des ribes the position of one ell in a 3 3 grid (1 i 9), and an take one of three
values (\X", \O" or \empty"). In this ase, the size of the state spa e is 39 . For a game
of draughts, we have 32 usable tiles and a state spa e size of the order of 332 . In general,
given n features ea h of whi h an take k possible values, we have a state spa e of size
kn . In other words, the size of the state spa e grows exponentially with its dimensionality.
Correspondingly, so grows the memory required to store a value fun tion and the time
required to solve su h a problem. This exponential growth in the spa e and time osts for
a small in rease in the problem size is referred to as the \Curse of Dimensionality" (due to
Bellman, [15℄).
Similarly, if the state-spa e has in nitely many states (e.g. if the state-spa e is ontinuous)
then it is simply impossible to exa tly store individual values for ea h state.
In both ases, using a fun tion approximator to represent approximations of the value
fun tion or model an help. These are dis ussed in Chapter 6.
3. Markov Property In pra ti e, the Markov property is hard to obtain. There are
many ases where the des ription of the urrent state may la k important information
ne essary to hoose the best a tion. For instan e, suppose that you nd yourself in a large
building where many of the orridors look the same. In this ase, based upon what is seen
lo ally, it may be impossible to de ide upon the best dire tion to move given that some
other part of the building looks the same but where some other dire tion is best.
In many instan es su h as this, the environment may really be an MDP, although it may not
be the ase that the agent an exa tly observe its true state. However, the prior sequen e
of observations (of states, a tions, rewards and su essors) often reveal useful information
about the likely real state of the pro ess (e.g. if I remember how many ights of stairs I went
up I an now tell whi h orridor I am in with greater ertainty). This kind of problem an
be formalised as a Partially Observable Markov De ision Pro ess (POMDP). A POMDP, is
often de ned as an MDP, whi h in ludes S , A, P and a reward fun tion, plus a set of prior
observations and a mapping from real states to observations.
These problems and their related solution methods are not examined in this thesis. See [27℄
or [74℄ for ex ellent introdu tions and eld overviews.
4. Dis rete Time The MDP formalism assumes that there is a xed, dis rete amount
of time between state observations. In many problems this is untrue and events o ur at
varying real-valued time intervals (or even o ur ontinuously). A good example is the state
of a queue for an elevator [36℄. At t = 0 the state of the queue might be empty (s0). Some
time later someone may join the queue (we make a transition to s1), but the time interval
between states transitions an take some real value whose probability may be given by a
ontinuous distribution.
Variable and ontinuous time interval variants of MDPs are referred to as a Semi-Markov
De ision Pro ess (SMDPs) [114℄, and are examined in Chapter 7.
20 CHAPTER 2. DYNAMIC PROGRAMMING
5. Undis ounted Ergodi Tasks In ases where reward may be olle ted inde nitely
and dis ounting is not desired, the dis ounted return model may not be used sin e the future
sum of rewards with = 1 may be unbounded. Furthermore, even in ases where the returns
an be shown to be bounded, with = 1 the poli y-iteration and value-iteration algorithms
are not guaranteed to onverge upon Q. This follows as a result of using bootstrapping
and the max operator whi h auses any optimisti initial bias in the Q-fun tion to remain
inde nitely.
If dis ounting is not desired, then an average reward per step formalism an be used. Here
the expe ted return is de ned as follows [132, 153, 75, 21℄:
= lim
1X n
E [r js = s ℄
n!1 n t t
t=1
This formalism is problemati in pro esses where all states are rea hable from any other
under the poli y (su h as pro ess is said to be ergodi ). However, even in this ase, from
some states higher than average return may be gained for some short time and so su h a
state might be onsidered to be better. Quantitatively, the value of a state an be de ned
by the relative di eren e between the long-term average reward from any state, , and the
reward following a starting state:
1
X
V (s) = E [rt+k jst = s; ℄
k=1
Thus a poli y may be improved by modifying it to in rease the time that the system spends
in high valued states (thereby raising ). Average reward methods are not examined in this
thesis.
Chapter 3
Chapter Outline
In this hapter we see how reinfor ement learning problems an be solved solely
through intera ting with the environment and learning from what is observed.
No knowledge of the task being solved needs to be provided. A number of
standard algorithms for learning in this way are reviewed. The short omings
of exploration insensitive model-free ontrol methods are highlighted, and new
intuitions about the online behaviour of a umulate tra e TD() methods il-
lustrated.
1) for ea h episode:
2) Initialise: t 0; st=0
3) while st is not terminal:
4) sele t at
5) follow at ; observe, rt+1, st+1
6) perform updates to P^ , R^ , Q^ and/or V^
using the new experien e hst ; at ; rt+1 , st+1i
7) t t+1
= 1 z + (k + 1)Z^ Z^
k + 1 k+1 k k
= Z^k +
1 zk+1 Z^k (3.3)
k+1
Re en y Weighted Average. By hoosing a onstant value for , (where 0 < < 1),
update 3.1 an be used to al ulate a re en y weighted average. This an be seen more
learly by expanding the right hand side of Equation 3.1:
Z^t+1 = zt+1 + (1 )Z^t (3.4)
Intuitively, ea h new observation forms a xed per entage of the new estimate. Re en y
weighted averages are useful if the observations are drawn from a non-stationary distribu-
tion.
In ases where 1 6= 1 the estimates Z^k (k > 1) may be partially determined by the initial
estimate, Z^0. Su h estimates are said to be biased by the initial estimate. Z^0 is an initial
bias.
Mean in the Limit. From standard statisti s, with k = 1=k, from Equation 3.2 we
have,
lim ^ = E [z℄:
Z
k!1 k
(3.5)
However, more usefully, Equation 3.5 also holds if,
1
X
1) k =1 (3.6)
k=1
1
X
2) 2 < 1;
k (3.7)
k=1
both hold. These are the Robbins-Monro onditions and appear frequently as onditions for
onvergen e of many sto hasti approximation algorithms [126℄.
The rst ondition ensures that, at any point, the sum of the remaining stepsizes is in nite
and so the urrent estimate will eventually be ome insigni ant. Thus, if the urrent esti-
mate ontains some kind of bias, then this is eventually eliminated. The se ond ondition
ensures that the step sizes eventually be ome small enough so that any varian e in the
observations an be over ome.
In most interesting learning problems, there is the possibility of trading lower bias for
higher varian e, or vi e versa. Slowly de lining learning rates redu e bias more qui kly but
24 CHAPTER 3. LEARNING FROM INTERACTION
onverge more slowly. Redu ing the learning rate qui kly gives fast onvergen e but slow
redu tions in bias. If the learning rate is de lined too qui kly, premature onvergen e upon
a value other than E [z℄ may o ur. The Robbins-Monro onditions guarantee that this
annot happen.
Conditions 1 and 2 are known to hold for,
k (s) =
1 ; (3.8)
k(s)
at the kth update of Z^(s) and 1=2 < 1 [167℄.
where s is visited at times ft1 ; : : : tM g. In this ase, the RunningAverage update is applied
oine, at the end of ea h episode at the earliest. Ea h state-value is updated on e for ea h
state visit using the return following that visit. M represents the total number of visits to
s in all episodes.
3.3. MONTE CARLO METHODS FOR POLICY EVALUATION 25
s T
PT ; RT
Figure 3.2: A simple Markov pro ess for whi h rst-visit and every-visit Monte Carlo
approximation initially nd di erent value estimates. The pro ess has a starting state,
s, and a terminal state, T . Ps and PT denote the respe tive transition probabilities for
s ; s and for s ; T . The respe tive rewards for these transitions are Rs and RT .
First Visit Monte Carlo Estimation. The rst-visit Monte Carlo estimate is de ned
as the sample average of returns following the rst visit to a state during the episodes in
whi h it was visited:
V^F (s) =
1XN
zt1i (3.10)
N i=1
where s is rst visited during an episode at times ft1; : : : ; tN g and N
represents the total
number of episodes. The key di eren e here is that an observed reward may be used to
update a state value only on e, whereas in the every-visit ase, a state value may be de ned
as the average of several non-independent return estimates, ea h involving the same reward,
if the state is revisited during an episode.
Bias and Varian e. In the ase where state revisits are allowed within a trial these
methods produ e di erent estimators of return. Singh and Sutton analysed these di eren es
whi h an be hara terised by onsidering the pro ess in Figure 3.2 [139℄. For simpli ity
assume = 1, then from the Bellman equation (2.6) the true value for this pro ess is:
V (s) = Ps (Rs + V (s)) + PT RT (3.11)
= PsR1s + PPT RT (3.12)
s
Ps
= R + RT
PT s
(3.13)
Consider the di eren e between the methods following one episode with the following ex-
perien e,
s;s;s;s;T
The rst-visit estimate is:
V^F (s) = Rs + Rs + Rs + RT
while the every-visit estimate is:
R + 2Rs + 3Rs + 4RT
V^E (s) = s
4
26 CHAPTER 3. LEARNING FROM INTERACTION
For both ases, it is possible to nd the expe tation of the estimate after one trial for some
arbitrary experien e. This is done by averaging the possible returns that ould be observed
in the rst episode weighted their probability of being observed. For the rst-visit ase, it
an be shown that after the rst episode [139℄,
P
E V^1F (s) = s Rs + RT
h i
PT
= V (s)
and so is an unbiased estimator of V (s). After N episodes, V^NF (s) is the sample average of
N independent unbiased estimates of V (s), and so is also unbiased.
For the every-visit ase, it an be shown (in [139℄) that after the rst episode,
Ps
E V^1E (s) =
h i
2PT Rs + RT
where k denotes the number of times that s is visited within the episode. Thus after the
rst episode the every-visit method does not give an unbiased estimate of V (s). Its bias is
given by,
BIASE1 = V (s) E V^1E (s) = 2PPsT Rs:
h i
(3.14)
Singh and Sutton also show that after M episodes,
BIASEM = M 2+ 1 BIASE1 : (3.15)
Thus the every-visit method is also unbiased as M ! 1.
The bias in the every-visit method omes from the fa t that it uses some rewards several
times. Thus many of the return observations are not independent. However, the observa-
tions between trials are independent, and so as the number of trials grows, its bias shrinks.
Both methods onverge upon V (s) as M or N tend to in nity.
Singh and Sutton also analysed the expe ted varian e in the estimates learned by ea h
method. They found that, while the rst-visit method has no bias, it initially has a higher
expe ted varian e than the every-visit method. However, its expe ted varian e de lines far
more rapidly, and is usually lower than for the every-visit method after a very small number
of trials. Thus, in the long-run the rst-visit method appears to be superior, having no bias
and lower varian e.
So, assuming for the moment that V^ (st+1 ) is a xed onstant, Update 3.16 an be seen as
a sto hasti version of,
V^ (s) a 0 + V^ (s0 ) ;
X X
(s; a) Pssa 0 Rss (3.17)
a s0
where E rt+1 + V^ (st+1 )js = st ; is estimated in by V^ (s) in the limit from the observed
h i
(sample) return estimates, rt+1 + V^t (st+1), rather than the target return estimate given
by the right-hand-side of update 3.17.
TD(0) is reliant upon observing the return estimate, r + V^ (s0), and applying it in update
3.16 with the probability distribution de ned by R, P and . This an be done in several
28 CHAPTER 3. LEARNING FROM INTERACTION
1) for ea h episode:
2) initialise st
3) while st is not terminal:
4) sele t at a ording to
5) follow at ; observe, rt+1, st+1
6) TD(0)-update(st , at , rt+1, st+1)
7) t t+1
TD(0)-update(st, at , rt+1 , st+1
)
1) V (st) V^ (st ) + t+1 (st ) rt+1 + V^ (st ) V^ (st)
Figure 3.3: The online TD(0) learning algorithm. Evaluates the value-fun tion for the
poli y followed while gathering experien e.
ways, but by far the most straightforward is to a tually follow the evaluation poli y in the
environment and make updates after ea h step using the experien e olle ted. Figure 3.3
shows this online learning version of TD(0) in full. Note that it makes no use of R or P .
In general, the value of the orre tion term (V^ (st+1) in update 3.16) is not a onstant but is
hanging as st+1 is visited and its value updated. The method an be seen to be averaging
return estimates sampled from a non-stationary distribution. The return estimate is also
biased by the initial value fun tion estimate, V^0. Even so, the algorithm an be shown to
onverge upon V asP1t ! 1 providedPthat the learning rate is de lined under the Robbins-
Monro onditions ( k=1 k (s) = 1; k=1 2k (s; a) < 1), that all value estimates ontinue
1
to be updated, the pro ess is Markov, all rewards have nite varian e, 0 < 1 and that
the evaluation poli y is followed [148, 38, 158, 59, 21℄. In pra ti e it is ommon to use the
xed learning rate = 1 if the transitions and rewards are deterministi , or some lower
value if they are sto hasti . Fixed also allows ontinuing adaptation in ases where the
reward or transition probability distributions are non-stationary (in whi h ase the Markov
property does not hold).
3.4.3 SARSA(0)
Similar to TD(0), SARSA(0) evaluates the Q-fun tion of an evaluation poli y [128, 173℄.
Its update rule is:
Q^ (st ; at ) Q^ (st ; at ) + rt+1 + Q^ (st+1 ; at+1 ) Q^ (st ; at ) ;
k (3.18)
where at and at+1 are sele ted with the probability spe i ed by the evaluation poli y and
k = k (st ; at ).
SARSA di ers from the standard algorithm pattern given in Figure 3.1 be ause it needs to
know the next a tion that will to be taken when making the value update. The SARSA
algorithm is shown in Figure 3.4. An alternative s heme that appears to be equally valid
and is more losely related to the poli y-evaluation Q-fun tion update (see Equation 2.15
3.4. TEMPORAL DIFFERENCE LEARNING FOR POLICY EVALUATION 29
1) for ea h episode:
2) initialise st
3) sele t at a ording to
4) while st is not terminal:
5) follow at; observe, rt+1 , st+1
6) sele t at+1 a ording to
7) SARSA(0)-update(st , at , rt+1 , st+1, at+1 )
8) t t+1
SARSA(0)-update(st , at, rt+1, st+1 , at+1)
1) Q^ (st ; at ) Q^ (st ; at ) + k rt+1 + Q^ (st+1 ; at+1 ) Q^ (st ; at )
Figure 3.4: The online SARSA(0) learning algorithm. Evaluates the Q-fun tion for the
poli y followed while gathering experien e.
and Figure 2.3) is to repla e the target return estimate with [128℄:
(st+1 ; a0 )Q^ (st+1 ; a0 ):
X
rt+1 + (3.19)
a0
An algorithm employing this return does not need to know at+1 to make the update and
so an be implemented in the standard framework. Its independen e of at+1 also makes
this an o -poli y method { it doesn't need to a tually follow the evaluation poli y in order
to evaluate it. This property is dis ussed in more detail later in this hapter. However,
unlike regular SARSA, this method does require that the evaluation poli y is known, whi h
may not always be the ase { experien e ould be generated be observing an external (e.g.
human) ontroller.
3.4.4 Return Estimate Length
Single-Step Return Estimates
The TD(0) and SARSA(0) algorithms are single-step temporal di eren e learning methods
and apply updates to estimate some target return estimate having the following form:
zt(1) = rt + U^ (st ); (3.20)
It is important to note that it is the dependen e upon using only information gained from the
immediate reward and the su essor state that allows single-step methods to be easily used
as an online learning algorithms. However, when single-step learning methods are applied
in the standard way, by updating V^ (st ) or Q^ (st ; at ) at time t + 1, new return information
is propagated ba k only to the previous state. This an result in extremely slow learning
in ases where redit for visiting a parti ular state or taking a parti ular a tion is delayed
by many time steps. Figure 3.5 provides an example of this problem. Ea h episode begins
in the leftmost state. Ea h state to the right is visited in sequen e until the rightmost
(terminal) state is entered where a reward of 1 is given (r = 0 in all other states). In su h
a situation, it would take 1-step methods a minimum of 64 episodes before any information
30 CHAPTER 3. LEARNING FROM INTERACTION
t= 0 t= 6 3 t= 6 4
... r= 1
Figure 3.5: The orridor task. Single-step updating methods su h as TD(0), SARSA(0)
and Q-learning an be very slow to propagate any information about the terminal reward
to the leftmost state.
about the terminal reward rea hes the leftmost state. A Monte Carlo estimate would nd
the orre t solution after just one episode.
Multi-Step Return Estimates
By modifying the return estimate to look further ahead than the next state, a single ex-
perien e an be used to update utility estimates at many previously visited states. For
example, the 1-step return in 3.16, zt(1) = rt + U (st), may be repla ed with the orre ted
n-step trun ated return estimate,
zt(n) = rt + rt+1 + + n 1 U^ (st+n ) (3.21)
or we may use,
h i
zt = (1 ) zt(1) + zt(2) + 2 zt(3) + (3.22)
= (1 ) rt + U^ (st+1 ) + rt + zt+1
(3.23)
= rt + (1 ) U^ (st+1 ) + zt+1 (3.24)
whi h is a -return estimate [147, 148, 163, 128, 107℄. The -return estimate is important
as it is a generalisation of both z(1) and z(1) sin e, if = 0, then z = z(1) , and if = 1,
then z = z(1) , or the a tual dis ounted return.
A key feature of multi-step estimates is that a single observed reward may be used in
updating the state-values or Q-values in many previously visited states. Intuitively, this
o ers the ability to more qui kly assign redit for delayed rewards.
The return estimate length an also be seen as managing a tradeo between bias and
varian e in the return estimate [163℄. When is low, the estimate is highly biased toward
the initial state-value or Q-fun tion. When is high the estimate involves mainly the
a tual observed reward and is a less biased estimator. However, unbiased return estimates
don't ne essarily result in the fastest learning. Typi ally, longer return estimates have higher
varian e as there is a greater spa e of possible values that a multi-step return estimate ould
take. By ontrast, a single-step estimate is limited to taking values formed by ombinations
of the possible immediate rewards and the values of immediate su essor states, and so
may typi ally have lower varian e. Also, employing the already-learned value estimates of
su essor states in updates may also help speed up learning sin e these values may ontain
summaries of the omplex future that may follow from the state. Best performan e is
often to be found with intermediate values of [148, 128, 73, 139, 150℄.
3.4. TEMPORAL DIFFERENCE LEARNING FOR POLICY EVALUATION 31
However, while multi-step estimates appear to o er faster delayed redit assignment they
seem to su er the same problem as the Monte-Carlo methods { that the updates must either
be made o -line, at the end of ea h episode, or that episodes are split into stages and the
return estimates trun ated. Chapter 4 introdu es a method whi h explores the latter ase.
The next se tion shows how the e e t of using the -return estimate an be approximated
by a fully in remental online method that makes updates after ea h step.
3.4.5 Eligibility Tra es: TD()
This se tion shows how -return estimates an be applied as an in remental online learning
algorithm. This is surprising be ause it implies that it is not ne essary to wait until all the
information used by the -return estimate is olle ted before a ba kup an be made to a
previously visited state.
The e e t of using z an be losely and in rementally approximated online using eligibility
tra es [148, 163℄.
A -return algorithm performs the following update,
V^ (st ) V^ (st ) + t (st ) zt+1 V^ (st ) :
(3.25)
By Equation 3.24 Sutton showed that the error estimate in this update an be re-written
as [148, 163, 107℄,
zt+1 V^ (st ) = Æt + Æt+1 + : : : + ( )k Æt+k + : : : (3.26)
where Æt is the 1-step temporal di eren e error as before,
Æt = rt+1 + V^ (st+1 ) V^ (st ):
If the pro ess is a y li and nite (and so ne essarily also has a terminal state), this allows
update 3.25 to be re-written as the following on-line update rule, whi h over omes the need
to have advan e knowledge of the 1-step errors,
t
V^ (s) V^ (s) +
X
t (s)Æt ( )t k I (s; sk) (3.27)
k=t0
where t0 indi ates the time of the start of the episode, and I (s; sk ) is 1 if s was visited at
t k, and zero otherwise. This update must be applied to all states visited at time t or
before, within the episode.
In the ase in whi h state revisits may o ur, the updates may be postponed and a single
bat h update may be made for ea h state at the end of the episode,
TX1 t
V^ (s) V^ (s) +
X
t (s)Æt ( )t k I (s; sk )
t=t0 k=0
where sT is the terminal state.
32 CHAPTER 3. LEARNING FROM INTERACTION
However, the above methods don't appear to be of any extra pra ti al use than the Monte-
Carlo or -return methods. If the task is a y li , then then there is little bene t for having
an online learning algorithm sin e the agent annot make use of the values it updates until
the end of the episode. So the assumption preventing state revisits is often relaxed. In this
ase the error terms may be inexa t sin e the state-values used as the return orre tion may
have been altered if the state was previously visited. However, intuitively this seems to be
a good thing sin e the return orre tion is more up-to-date as a result.
To avoid the expensive re al ulation of the summation in 3.27, this term an be rede ned
as,
V^ (s) V^ (s) + t et (s)Æt (3.28)
where e(s) is an (a umulating) eligibility tra e. For ea h state at ea h step it is updated
as follows,
et 1 (s) + 1; if s = st,
et (s) = et 1 (s); otherwise. (3.29)
The full online TD() algorithm is shown in Figure 3.6. Both the online and bat h TD()
algorithms are known to onverge upon the true state-value fun tion for the evaluation
poli y under the same onditions as TD(0) [38, 158, 59, 21℄.
The intuitive idea behind an eligibility tra e is to make a state eligible for learning for several
steps after it was visited. If an unexpe tedly good or bad event happens (as measured by
the temporal di eren e error, Æ), then all of the previously visited states are immediately
redited with this. The size of the value adjustment is s aled by the state's eligibility, whi h
de ays with the time sin e the last visit. Moreover, the 1-step error Æt measures an error in
the -return used, not just for the previous state, but for all previously visited states in the
episode. The eligibility measures the relevan e of that error to the values of the previous
states given that they were updated using a -return orre ted for the error found at the
urrent state. Thus it should be lear why the tra e de ays as ( )k { the ontribution of
V^ (st+k ) to zt() is ( )k .
3.4. TEMPORAL DIFFERENCE LEARNING FOR POLICY EVALUATION 33
The Forward-Ba kward Equivalen e of Bat h TD() and -Return Updates
If the hanges to the value-fun tion that the a umulate-tra e algorithm is to make during
an episode, are summed,
TX1
V (s) = t et (st )Æt
t=0
and applied at the end of the episode (instead of online),
V (s) V (s) + V (s)
it an be shown that this is equivalent to applying the -return update,
V^ (s) V^ (s) + zt+1 V^ (s) ;
at the end of the episode, for ea h s = st visited during the episode [150℄.1
Thus in the ase where = 1, and k (s) = 1=k(s), this bat h mode TD() method is
equivalent to the every-visit Monte Carlo algorithm. The proof of this an be found in [139℄
and [150℄.
Below, the dire t -return method is referred to a the forward-view, and the eligibility tra e
method as the ba kward-view (after [150℄).
3.4.6 SARSA()
The equivalent version of TD() for updating a Q-fun tion is SARSA(), shown in Figure
3.7 [128, 129℄. Here, an eligibility value is maintained for ea h state-a tion pair.
SARSA()-update(st, at , rt+1, st+1 , at+1 )
1) Æ rt+1 + Q^ (st+1; at+1 ) Q^ (st; at )
2) e(s; a) e(s; a) + 1
3) for ea h s 2 S
3a) Q^ (s; a) Q^ (s; a) + e(s; a)Æ
3b) e(s; a) e(s; a)
Figure 3.7: The a umulating-tra e SARSA() update. This update step should repla e
the SARSA(0)-update in Figure 3.4 for the full learning algorithm. All eligibilities should
be set to zero at the start of ea h episode.
every-visit Monte-Carlo algorithm. An alternative eligibility tra e s heme is the repla ing
tra e:
et (s) =
1; if s = st, (3.30)
et 1 (s); otherwise.
Sutton refers to this as a re en y heuristi { the eligibility of a state depends only upon the
time sin e the last visit. By ontrast, the a umulating tra e is a frequen y and re en y
heuristi .
In [139℄ Sutton and Singh show that, with = 1 and with appropriately de lining learning
rates, the bat h-update TD() algorithms exa tly implement the Monte Carlo algorithms.
In parti ular, it an be shown that a umulating tra es give the every-visit method, and
repla ing tra es give the rst-visit Monte Carlo method.
In addition to the better theoreti al bene ts of every-visit Monte Carlo, the repla e tra e
method has often performed better in online learning tasks. In [150℄ Sutton and Barto also
prove that the TD() and forward-view -return methods are identi al in the ase of bat h
(i.e. oine) updating for general with a onstant .
When estimating Q-values two repla e-tra e s hemes exist. These are the state-repla ing
tra e [139, 150℄,
< 1; if s = st and a = at ,
8
3.4.8 A y li Environments
If the environment is a y li , then the di erent eligibility updates produ e identi al eligi-
bility values and so the a umulate and repla e tra e methods must be identi al. In this
ase, the online and bat h versions of the algorithms are also identi al sin e the return
orre tions used in return estimates must be xed within an episode. With = 1, the
-return methods also implement the Monte Carlo methods in a y li environments. Also,
here, both rst-visit and every-visit methods are equivalent.
The eligibility tra e methods appear to be onsiderably more expensive than the other
model-free methods so far presented. For TD(0) and SARSA(0) the time- ost per experien e
is O(1). The Monte-Carlo and dire t -return methods have the same ost if the returns are
al ulated starting with the most re ent experien e and working ba kwards.2 Algorithms
working in this way will be seen in Chapter 4. By ontrast, TD() has a time- ost as high
as O(jS j) per experien e.
Thus the great bene t a orded by using eligibility tra es is that they allow multi-step return
estimates to be used for ontinual online learning and, as a onsequen e, an also be used in
Sin e all dis ounted return estimators an be al ulated re ursively as, zt = f (rt ; st ; at ; zt+1 ; U ); for
2
some fun tion f . If zt+1 is known then it is heap to al ulate zt by working ba kwards.
3.4. TEMPORAL DIFFERENCE LEARNING FOR POLICY EVALUATION 35
Æt Æt
PSfrag repla ements Æt
Zt Zt+1 zt
Figure 3.8: Number line showing e e t of step-size. Note that having a step-size greater
2 an a tually in rease the error in the estimate (i.e. moving the new estimate into the
hashed area).
non-episodi tasks and in y li al environments in a relatively straightforward way. We will
see in the next hapter that the ost of the eligibility tra e updates an be greatly redu ed.
3.4.9 The Non-Equivalen e of Online Methods in Cy li Environments
Consider the RunningAverage update rule (3.1). It is easy to see that with a large learning
rate the algorithm an a tually in rease the error in the predi tion. Let Æt = zt Z^t 1, then
if > 2 and after an update, jzt Z^t+1j > jzt Z^t j. The problem an be seen visually in
Figure 3.8.
This raises new suspi ions about the online behaviour of the a umulate tra e TD() update.
In a worst ase environment (see Figure 3.9) in whi h a state is revisited after every step,
after k revisits the eligibility tra e be omes,
ek (s) = 1 + + + ( )k 1
= 1 1 ( )
k
Thus, for < 1, an upper bound on an a umulating eligibility tra e (in any pro ess) is
given by,
e1 (s) =
1 (3.33)
1
For = 1 the tra e grows without bound if the pro ess is nite and has no terminal state.
The TD() update (3.28) makes updates of the following form:
V (s) V (s) + t (s)et (s)Æ:
Thus it might seem that where t (s)et (s) > 2 holds the TD() algorithm ould grow in
error with ea h update. These onditions are easily satis ed for lose enough to 1 in any
non-terminating nite (and therefore y li ) pro ess. Considering the ase where the tra e
rea hes its upper bound, we have in the worst ase s enario,
t (s)
1 > 2
1
36 CHAPTER 3. LEARNING FROM INTERACTION
t et
20
15
10
where the state's eligibility grows at the Figure 3.10: The growth of the a umu-
maximum rate. The reward is a random late tra e update step-size for the pro-
variable hosen from the range [ 1; 1℄ with ess in Figure 3.9. The learning rate is
a uniform distribution.
t = t : , = 0:999 and = 1:0. These
1
settings satisfy the onditions of onver-
0 55
1 t2(s) <
assuming a onstant t (s) while the eligibility rises. Yet the onvergen e of online a u-
mulate tra e TD() has already been established [38, 59℄. Cru ially these rely upon the
learning rate being de lined under the Robbins-Monro onditions whi h ensures that
tends to zero (and so t (s)et (s) must eventually fall below 2). However, even learning rate
s hedules that satisfy the Robbins-Monro onditions an ause t (s)et (s) > 2 to hold for a
onsiderable time in the early stages of learning. An example is shown in Figure 3.10. Note
that even though a high value of is used (i.e. lose to 1:0, at whi h value fun tions may
be ill-de ned), by 10000 steps the remaining rewards an be negle ted from the value of the
state sin e 0:99910000 is very small. Even so, at the end of this period, t (s)et (s) > 2.
What are the pra ti al onsequen es of this for the online a umulate tra e TD() algo-
rithm? Figure 3.11 ompares this method with an online forward view algorithm using the
pro ess in Figure 3.9. With = 1, a forward view -return algorithm an be implemented
online in this parti ular task by making the following updates:
(1 ) rt+1 + V^t (s)) + rt+1 + zt
zt+1
V^t+1 (s) V^t+1 (s) + t (s) zt+1 V^t+1 (s)
Note that this is \ba k-to-front" { rewards should in luded into z with the most re ent
rst. However, this makes no di eren e in this ase sin e there is only one state and only
one reward. Thus with = 1, z re ords the a tual observed dis ounted return (and is
also the rst-visit estimate) ex ept for some small error introdu ed by V^0 (s). V^0(s) is set to
zero (i.e. the orre t value) for all of the methods. In the experiment, the initial estimate
has little in uen e on the general shape of the graphs in Figure 3.11 beyond the rst few
steps. Also, with t (s) = 1=t, V^t (s) is the every-visit estimate ex ept for the negligible error
3.4. TEMPORAL DIFFERENCE LEARNING FOR POLICY EVALUATION 37
7 12
6
10
jV^ (s)j
8 Accumulate
Forward View, Every-Visit
4 Replace
Forward View, First-Visit
= 1t
6
t 3
2
1
0 0
0 50 100 150 200 0 20000 40000 60000 80000 100000
30 40
35
25
30
jV^ (s)j
20
25
= t 1:
15 20
t 0 55 15
10
0 0
0 50 100 150 200 0 20000 40000 60000 80000 100000
100 300
90
250
80
70
jV^ (s)j
200
60
= 0:5
50 150
t 40
100
30
10
PSfrag repla ements 50
Time Time
0 0
0 50 100 150 200 0 20000 40000 60000 80000 100000
Figure 3.11: Comparison of varian e between the online versions of TD(), and the forward
view methods in the single state pro ess in Figure 3.9 where = 0:999 and = 1. The
results are the average of 300 runs. The horizontal and verti al axes di er in s aling. The
verti al axis measures jV^ (s)j = jV^ (s) V (s)j sin e V (s) = 0.
aused by V^0 (s). Alternatively, note that the method is exa tly the every-visit method
for a slightly di erent pro ess where there is some (very small) probability of entering a
zero-valued terminal state (in whi h ase setting V^0 (s) = 0 is justi ed). This allows us to
losely ompare online TD() with the forward-view Monte-Carlo estimates, and even do so
with di erent learning rate s hemes. Di erent learning rate s hemes orrespond to di erent
re en y weightings of the a tual return. The \Forward-View, First-Visit" method in Fig-
ure 3.11 simply learns the a tual observed return at the urrent time, and is independent
of the learning rate. The repla e tra e method is also shown and is equivalent to TD(0) for
this environment.
The results an be seen in Figure 3.11. The most interesting results are those for a umulate
tra e TD(). Here we see that where t(s) = 1=t, the method most losely approximates
38 CHAPTER 3. LEARNING FROM INTERACTION
the every-visit method (at least in the long-term). This is predi ted as a theoreti al result
by Singh and Sutton in [139℄ for the bat h update ase. With more slowly de lining or a
onstant (i.e. more re ently biased), the a umulate tra e method is onsiderably higher in
error than any of the other methods. This seems to be at odds with the existing theoreti al
results in [150℄ where it is shown that TD() is equivalent to the forward view method for
onstant (and any ). However, this equivalen e applies only in the oine (bat h update)
ase. The equivalen e is approximate in the online learning ase and we see the onsequen e
of this approximation in Figure 3.11. In the xed ase, the values learned by a umulate
tra e TD() are so high in varian e as to be essentially useless as predi tions. Similar results
an be expe ted in other y li environments where the eligibility tra e an grow very large.
There are also numerous examples in the literature where the performan e of a umulate
tra e methods sharply degrades as tends to 1 (in parti ular, see [139, 150℄). In ontrast,
the every-visit method behaves mu h more reasonably (as do the rst-visit and repla e tra e
methods). Partially, this is some motivation for a new (pra ti al) online-learning forward
view method presented in Chapter 4.
It may seem surprising that the error in the a umulate tra e TD() method does not
ontinue to in rease inde nitely sin e t et is onsiderably higher than 2 after the rst few
updates and remains so. The reason for this is that the observed samples used in updates
(rt + V^ (st+1 )) are not independent of the learned estimates (V^ (st )). Unlike in the basi
RunningAverage update ase where divergen e to in nity is lear (with z independent of
Z ), this non-independen e appears to be useful in bounding the size of the possible error
in this and presumably other y li tasks.
In Figure 3.11, we also see that the every-visit method performed marginally better than
rst-visit in ea h ase. This is onsistent with the theoreti al results obtained by Singh
and Sutton in [139℄ whi h predi t that (oine) every-visit Monte Carlo will nd predi tions
with a lower mean squared error (i.e. lower varian e) for the rst few episodes { only one
episode o urred in this experiment.
We an on lude that, i) drawing analogies between forward-view methods and online ver-
sions of eligibility tra e methods is dangerous sin e the equivalen e of these methods does
not extend to the online ase, and ii) that a umulate tra e TD() an perform poorly
in y li environments where t et above 2 is maintained. In parti ular, it an perform far
worse than its forward-view ounterpart for learning rate de lination s hemes slower than
(s) = 1=k(s), (where k is the number of visits to s). This an be attributed to the ap-
proximate nature of the forward-ba kwards equivalen e in the online ase. In y li tasks,
errors due to this approximation an be magni ed by large e e tive step-sizes ( e).
3.5. TEMPORAL DIFFERENCE LEARNING FOR CONTROL 39
Figure 3.12: The online Q-learning algorithm. Evaluates the greedy poli y independently
of the poli y used to generate experien e. This method is exploration insensitive.
behave greedily, avoiding the return lost while exploring, but settle for a poli y that may
be sub-optimal?
Optimal Bayesian solutions to this dilemma are known but are intra table in the general
multi-step pro ess ase [78℄. However, there are many good heuristi solutions. Good
surveys of early work an be found in [62, 156℄, re ent surveys an be found in [85, 174, 63℄.
Also see [41, 40, 142, 175℄ for re ent work not in luded in these. Common features of the
most su essful methods are lo al de nitions of un ertainty (e.g. a tion ounters, Q-value
error and varian e measures), the propagation of this un ertainty to prior states and then
hoosing a tions whi h maximise ombined measures of this long-term un ertainty and
long-term value.
Multi-Step Methods O -poli y learning is less straightforward for methods that use
multi-step return estimates. For example, if a multi-step return estimate used to update
Q^ (st ; at ) in ludes the reward following a non-greedy a tion, at+k (k 1), then there is a
bias to learn about the return following a non-greedy poli y instead of the greedy poli y.
That is to say, Q^ (st 1; at 1 ) re eives redit for the delayed reward, rt+k+1 , whi h the agent
might not observe if it follows the greedy poli y after Q^ (st; at ). In most ases, learning
in this way denies onvergen e upon Q. This is straightforward to see when the ase is
onsidered where Q^ = Q is known to hold. Most updates following non-greedy a tion are
likely to move Q^ away from Q (in expe tation).
The most ommonly used solution to this problem is to ensure that the exploration poli y
onverges upon the greedy poli y in the limit, and so on-poli y methods eventually evaluate
the greedy poli y [135℄. However, s hemes for doing this must arefully observe the learning
3.5. TEMPORAL DIFFERENCE LEARNING FOR CONTROL 41
rate. If onvergen e to the greedy poli y is too fast then the agent may be ome stu k in
a lo al minimum sin e hoosing only greedy a tions may result in some parts of the envi-
ronment being under-explored (or under-updated). If onvergen e upon the greedy poli y
is too slow, then as the learning rate de lines, the Q-fun tion will onverge prematurely
and remain biased toward the rewards following non-greedy a tions. In [135℄, Singh et al.
dis uss several exploration methods whi h are greedy in the limit and allow SARSA(0) to
nd Q in the limit. Their results also seem likely to hold also for SARSA(), although
there is as yet no proof of this.
In any ase, following or even onverging upon the greedy exploration strategy may not
always be desirable or even possible. For example:
Bootstrapping from externally generated experien e or some given training poli y
(su h as one provided by a human expert) an greatly redu e the agent's initial learn-
ing osts [72, 112℄. Even if the agent follows this training poli y, we would still like
our method to be learning about the greedy poli y (and so moving toward the optimal
poli y).
There may be a limited amount of time available for exploration (e.g. for ommer ial
or safety riti al appli ations, it might desirable to have distin t training, testing and
appli ation phases). In this ase, we may wish to perform as mu h exploration as
possible in the training stage.
The agent may be trying to learn several poli ies (behaviours) in parallel where ea h
poli y should maximise its own reward fun tion (as in [58, 79, 143℄). At any time the
agent may take only one a tion, yet it remains useful to be able to use this experien e
to update the Q-fun tions of all the poli ies being evaluated.
The agent's task may be non-stationary, in whi h ase ontinual exploration is required
in order to evaluate a tions whose true Q-values are hanging [105℄.
The agent's Q-fun tion representation may be non-stationary. Continual exploration
may be required to evaluate the a tions in the new representation.
It has long been known that multi-step return estimates need not lead to exploration-
sensitive methods. The method re ommended by Watkins is to trun ate the -return
estimate su h that the rewards following o -poli y (e.g. non-greedy a tions) a tions are
removed from it [163℄. For example, Q^ (st 1 ; at 1 ) should be updated using the orre ted
n-step trun ated -return, (see [163, 31℄)
h i
zt(;n) = (1 ) zt(1) + zt(2) + 2 zt(3) + + n 2 zt(n 1) + n 1 zt(n) (3.36)
= (1 ) rt + U^ (st ) + rt + zt(+1
;n 1)
(3.37)
where,
zt(;1) = rt + U^ (st )
and at+n is the next o -poli y a tion. However, if there is a onsiderable amount of explo-
ration then the return estimate may be trun ated extremely frequently, and mu h of the
42 CHAPTER 3. LEARNING FROM INTERACTION
The eligibility tra e methods may also be used for o -poli y evaluation of a xed poli y by
applying importan e sampling [111℄. Here, the eligibility tra e is s aled by the likelihood
that the exploratory poli y has of generating the experien e seen by the evaluation poli y.
When used for greedy poli y evaluation, the method redu es to Watkins' Q(). Like the
o -poli y SARSA(0) method, the evaluation poli y must be known.
Optimisti Q-value Initialisation and Exploration
To en ourage exploration of the environment, a ommon te hnique in RL is to provide an
optimisti initial Q-fun tion and then follow a poli y with a strong greedy bias. Examples
of these \soft greedy" poli ies in lude -greedy and Boltzmann sele tion [135, 150℄. Over
time ea h Q-value will de rease as it is updated, but the Q-values of untried a tions or
a tions that led to untried a tions will remain arti ially high. Thus, even while following
a purely greedy poli y, the agent an be led to unexplored parts of the state-spa e.
However, problems arise if the estimated value of an a tion should ever fall below its true
value (as may easily happen in environments with sto hasti rewards or transitions). In
this ase any method whi h a ts only greedily an be ome stu k in a lo al minimum sin e
the truly best a tions are no longer followed.
The original version of PW-Q(), as published in [107℄, assumes that g is always followed.
As a result the standard Q-fun tion initialisation for PW-Q() is an optimisti one. Even
so, several authors report good results when using PW-Q() and following semi-greedy
poli ies [128, 169℄. In this ase, PW-Q() is an unsound method in the sense that like
SARSA() it an be shown that it will not onverge upon Q in some environments while
The use of the eligibility tra e in the Peng and Williams' and Watkins' Q() algorithms presented is
3
the same as the method in [107, 167℄, but di ers from TD() and SARSA(). Be ause, in Figures 3.13 and
3.14, the tra es are updated before the Q-values, the tra e extends an extra step into the history and an
additional update may be the result in the ase of state revisits. The algorithms may be modi ed to remove
this additional update, although in pra ti e, this makes little di eren e.
3.5. TEMPORAL DIFFERENCE LEARNING FOR CONTROL 43
Watkins-Q()-update(st; at ; rt+1 ; st+1 )
1) if o -poli y(st; at ) Test for non-greedy a tion
2) for ea h (s; a) 2 S A do: Trun ate eligibility tra es
3) e(s; a) 0
4) Æt rt+1 + maxa Q^ (st+1; a) Q^ (st ; at )
5) for ea h SAP (s; a) 2 S A do:
6) e(s; a) e(s; a) De ay tra e
^ ^
7) Q(s; a) Q(s; a) + Æt e(s; a)
8) Q^ (st ; at ) Q^ (st ; at ) + k Æt e(st ; at )
9) for ea h a 2 A(st ) do:
9a) e(st ; a) 0
10) e(st ; at ) e(st ; at ) + 1
Figure 3.13: O -poli y (Watkins') Q() with a state repla ing tra e. This version di ers
slightly to the algorithm re ently published in the standard text [150℄. For an a umulating
tra e version omit steps 9 and 9a. For state-a tion repla ing tra es, repla e steps 9 to 10
with e(st ; at ) 1.
PW-Q()-update(st; at ; rt+1 ; st+1 )
1) Æt0 rt+1 + maxa Q^ (st+1; a) Q^ (st ; at )
2) Æt rt+1 + maxa Q^ (st+1; a) maxa Q^ (st ; a)
3) for ea h SAP (s; a) 2 S A do:
4) e(s; a) e(s; a)
5) Q^ (s; a) Q^ (s; a) + Æt e(s; a)
6) Q^ (st ; at ) Q^ (st ; at ) + k Æt0 e(st ; at )
7) for ea h a 2 A(st ) do:
7a) e(st ; a) 0
7) e(st ; at ) e(st ; at ) + 1
Figure 3.14: Peng and Williams' Q() with a state repla ing tra e. Modi ations for
a umulating and state-a tion repla ing tra es are as for Watkins' Q() (Figure 3.13).
4
This an be seen straightforwardly in deterministi pro esses with deterministi rewards. Note that if
Q^ = Q is known to hold, then PW-Q(), (or SARSA()) may in rease kQ^ Q k if non-greedy a tions are
taken. The same is not true for Q-learning and Watkins' Q().
44 CHAPTER 3. LEARNING FROM INTERACTION
is one of a Markov pro ess). By their \single-step" nature, P and R give rise to methods
that heavily rely on the Markov property. It is not lear how multi-step models an be
learned and so over ome their dependen e on the Markov property. It is also often un lear
how to represent sto hasti models with many kinds of fun tion approximator. Fun tion
approximation is overed in more detail in Chapter 5.
3.7 Summary
In this hapter we have seen how reinfor ement learning an pro eed starting with little
or no prior knowledge of the task being solved. Using only the knowledge gained through
intera tion with the environment, optimal solutions to diÆ ult sto hasti ontrol problems
an be found.
A number of di erent dimensions to RL methods have been seen; predi tion and ontrol
methods, bias and varian e issues, dire t and indire t methods, exploration and exploitation,
online and oine methods, on-poli y and o -poli y methods, and single-step and multi-step
methods.
Online learning in y li environments was identi ed as a parti ularly interesting lass of
problems for model-free methods. Here we see a wider variation in the solutions methods
than the a y li or oine ases. Also, we have seen how it is diÆ ult to apply forwards
view methods in this ase and how (a umulate) tra e methods an signi antly di er from
their forward view analogues. Also, there appears be no theoreti ally sound and experien e
eÆ ient model-free ontrol method for online learning while ontinuing to take non-greedy
a tions. Se tion 3.5.3 listed several examples of why su h learning methods are useful.
Apparently sound methods, su h as Watkins' Q() su er from \shortsightedness", while
unsound methods an easily be shown to su er from a loss of predi tive a ura y (pra ti al
examples are given in the next hapter).
Chapter 4
Chapter Outline
This hapter reviews extensions to the model-free learning algorithms pre-
sented in the previous hapter. We see how their omputational osts an
be redu ed, their data-eÆ ien y in reased, while also allowing for exploratory
a tions and online learning. The experimental results using these algorithms
also lead to interesting insights about the role of optimism in reinfor ement
learning ontrol methods.
Predi tive. Algorithms that predi t, from ea h state in the environment, the expe ted
return available for following some given poli
y thereafter.
Exploration Insensitive. Algorithms that an evaluate one poli y while following an-
other are exploration insensitive methods (also referred to as o -poli y methods) [150, 163℄.
In the ontext of ontrol optimisation, we often want to evaluate the greedy poli y while
following some exploration poli y.
47
48 CHAPTER 4. EFFICIENT OFF-POLICY CONTROL
Online Learning. Online learning methods immediately apply observed experien es for
learning. Where exploration depends upon the Q-fun tion, online methods an have a
huge advantage over methods whi h learn oine [128, 65, 165, 168℄. For instan e, most
exploration strategies qui kly de line the probability of taking a tions whi h lead to large
punishments provided that the Q-values for those a tions are also de lined. If the Q-fun tion
is adjusted oine, or after some long interval, then the exploration strategy may sele t poor
a tions many times more than ne essary within a single episode.
Computationally Cheap. Currently, the heapest online learning ontrol methods have
time omplexities of O(jAj) per experien e where jAj is the number of a tions urrently
available to the agent [168, 163℄.
Fast Learning. Methods whi h make e e tive use of limited real experien e. For exam-
ple, methods whi h learn a model of the environment an make ex ellent use of experien e
but are often omputationally far more expensive than O(jAj) when learning online. Exist-
ing model-free methods have attempted to ta kle this using eligibility tra es [148, 163, 128℄
or ba kwards replay [72, 76℄. However, o -poli y (exploration insensitive) eligibility tra e
methods for ontrol su h as Watkins' Q() are relatively ineÆ ient. Also, ba kwards replay
is generally regarded as a te hnique that annot be used for online learning. Methods su h
as SARSA() and Peng and Williams' Q() are exploration sensitive methods; if exploring
a tions are ontinuously taken in the environment then they lose predi tive a ura y in
their Q-fun tions as a result.
S alable. For an RL algorithm to be pra ti al it must work in ases where there are very
many states or if the state-spa e is non-dis rete. Typi ally, this involves using a fun tion
approximator to store and update the Q-fun tion. Eligibility tra e methods have been
shown to work well when applied with fun tion approximators [163, 149℄.
" #
T
X X t 1
= Æt0 It (s; a) + ( )t i Æt Ii(s; a) (4.3)
t=1 i=1
XT " t 1
X
#
= Æ0 It (s; a) +
t Æt ( ) t iI
i (s; a) : (4.4)
t=1 i=1
In what follows, let us abbreviate It = It(s; a) and = . Suppose some SAP (s; a)
o urs at steps t1; t2 ; t3; : : :, then we may unfold terms of expression (4.4):
" # " #
T
X X t 1 t1
X X t 1
Æt0 It + Æt t iI
i = Æt0 It + Æt t iI
i +
t=1 i=1 t=1 i=1
" #
t2
X X t 1
Æt0 It + Æt t iI
i +
t=t1 +1 i=1
t3 " t 1 #
X X
Æt0 It + Æt t iI
i + ::: (4.5)
t=t2 +1 i=1
Sin e I (s; a) is 1 only for t = t1; t2 ; t3; : : :, where SAP revisits of (s; a) o ur at, t1; t2 ; t3 ; : : :,
and I (s; a) is 0 otherwise, we an rewrite Equation 4.5 as
t2
X t3
X
Æt01 + Æt02 + Æt t t1 + Æt0 +
3
Æt t t1 + t t2 +::: =
t=t1 +1 t=t2 +1
Æt01 + Æt02 +
1 t2
X
Æt t + Æt0 + 1t + 1t
t3
X
Æt t + ::: =
t1 3 1 2
t=t1 +1 t=t2 +1
! !
+ 1t + 1t + 1t
Xt2 t1
X t3
X t2
X
Æ0 t1 + Æ0
t2 1
Æt t Æt t + Æ0 t3 1 2
Æt t Æt t + :::
t=1 t=1 t=1 t=1
De ning t = Pti=1 Æi i , this be omes
1
1 1
Æt + Æt + t (t t ) + Æt + t + t (t t ) + : : :
0 0 0 (4.6)
1 2 1 2 1 3 1 2 3 2
This will allow Pthe onstru tion of an eÆ ient online Q() algorithm. We de ne a lo al
tra e e0t (s; a) = ti=1 Ii (s;ai ) , and use (4.6) to write down the total update of Q(s; a) during
an episode:
T
Q^ (s; a) = Æt0 It(s; a) + e0t (s; a)(t+1 t) :
X
(4.7)
t=1
To exploit this we introdu e a global variable keeping tra k of the umulative TD() error
sin e the start of the episode. As long as the SAP (s; a) does not o ur we postpone updating
Q^ (s; a). In the update below we need to subtra t that part of whi h has already been
used (see equations 4.6 and 4.7). We use for ea h SAP (s; a) a lo al variable Æ(s; a) whi h
re ords the value of at the moment of the last update, and a lo al tra e variable e0(s; a).
Then, on e Q^ (s; a) needs to be known, we update Q^ (s; a) by adding e0(s; a)( Æ(s; a)).
Algorithm overview. The algorithm relies on two pro edures: the Lo al Update pro edure
al ulates exa t Q-values on e they are required; the Global Update pro edure updates the
4.2. ACCELERATING Q( ) 51
global variables and the urrent Q-value. Initially we set the global variables 0 1:0 and
0. We also initialise the lo al variables Æ(s; a) 0 and e0 (s; a) 0 for all SAPs.
Lo al updates. Q-values for all a tions possible in a given state are updated before an
a tion is sele ted and before a parti ular Q-value is al ulated. For ea h SAP (s; a) a
variable Æ(s; a) tra ks hanges sin e the last update:
Lo al Update(st ; at ) :
1) Q^ (st ; at ) Q^ (st; at ) + (
k st ; at )( Æ(st ; at ))e0 (st ; at )
2) Æ(st ; at )
The global update pro edure. After ea h exe uted a tion we invoke the pro edure
Global Update, whi h onsists of three basi steps: (1) To al ulate maxa Q^ (st+1 ; a) (whi h
may have hanged due to the most re ent experien e), it alls Lo al Update for the possible
next SAPs. (2) It updates the global variables t and . (3) It updates the Q-value and
tra e variable of (st ; at ) and stores the urrent value (in Lo al Update).
For state repla ing eligibility tra es [139℄ step 8 should be hanged as follows: 8a : e0(st ; a)
0; e0 (st; at ) 1= t .
Ma hine pre ision problem and solution. Adding Æt t to in line 5 may reate
a problem due to limited ma hine pre ision: for large absolute values of and small t
there may be signi ant rounding errors. More importantly, line 8 will qui kly over ow any
ma hine for < 1. The following addendum to the pro edure Global Update dete ts when
t falls below ma hine pre ision and updates all SAPs whi h have o urred. A list, H ,
m
is used to tra k SAPs that are not up-to-date. If e0 (s; a) < m, the SAP (s; a) is removed
from H . Finally, and t are reset to their initial values.
52 CHAPTER 4. EFFICIENT OFF-POLICY CONTROL
Error 1. Step 1 of the original Global Update pro edure performs the updates to the
Q-values at st+1 ne essary to ensure that Q^ (st+1; ) is an up-to-date estimate before steps
2 and 3 where it is used. However, Q^ (st; ) is also used in steps 2 and 3 and may not be
up-to-date. This is easily orre ted by adding:
1b) Lo al Update(st ; a)
We shall see below that this hange is not ne essary if Q^ (st; ) is made up-to-date at the
end of the Global Update pro edure.
Error 2. When state repla ing tra es are employed with the original Fast Q() algorithm,
it is possible that the eligibility of some SAPs are zeroed. In su h a ase, if these SAPs
previously had non zero eligibilities then they will not re eive any update making use of
Æt . An ex eption is Q^ (st ; at ), whi h is made up-to-date in step 6 (and so makes use of Æt ).
However all other SAPs at st with non-zero eligibilities will re eive no adjustment toward
Æt if their eligibilities are zeroed:
A tion Sele tion. Steps 9 and 9a of the Revised Global Update pro edure are a pragmati
hange to ensure that all of the Q-values for st+1 are up-to-date by the end of the pro edure.
If this were not so then any ode needing to make use of the up-to-date Q-fun tion at st+1 ,
su h as those for sele ting the agent's next a tion, would need to be de ned in terms of
the up-to-date, Q-fun tion instead. Q+ is used to denote up-to-date Q-fun tion and an be
54 CHAPTER 4. EFFICIENT OFF-POLICY CONTROL
Watkins' Q(). Watkins' Q() requires that the eligibility tra e be zeroed after tak-
ing non-greedy a tions. The new Fast Q() version works in the same way (by applying
e0 (s; a) 0 for all SAPs), ex ept that here we must ensure that all non-up-to-date SAPs
are updated before zeroing their tra es (see the Flush Updates pro edure).
4.2. ACCELERATING Q( ) 55
For a umulating tra es:
Revised Global Update(st ; at ; rt ; st+1 ) :
1)8a 2 A Do
1a) Lo al Update(st+1 ; a)
2) Æt0 (rt + maxa Q^ (st+1 ; a) Q^ (st; at )) NB. st was made up-to-date in step 9
3) Æt (rt + maxa Q^ (st+1 ; a) maxa Q^ (st; a))
4) t t 1
5) + Æt t
6) Lo al Update(st ; at )
7) Q^ (st ; at ) Q^ (st; at ) + k (st; at )Æt0
8) e0 (st ; at ) e0 (st ; at ) + 1= t In rement eligibility
9) 8a 2 A Do
9a) Lo al Update(st+1 ; a) Make Q^ (st+1 ; ) up-to-date before a tion sele tion
For state-a tion repla ing tra es repla e step 8 with:
8) e0 (st ; at ) 1= t Set eligibility to 1
For state repla ing tra es, repla e steps 8 - 9a with:
8) 8a 2 A Do
8a) Lo al Update(st ; a) Make Q^ (st ; ) up-to-date before zeroing eligibility
8b) e (st ; a) 0
0 Zero eligibility
8 ) Lo al Update(st+1 ; a) Make Q^ (st+1 ; ) up-to-date before a tion-sele tion
9) e (st ; at ) 1=
0 t Set eligibility to 1
For Watkins Q() prepend the following to the Revised Global Update pro edures.
0) if o -poli y(st; at ) Test whether a non-greedy a tion was taken
0a) Flush Updates()
Flush Updates()
1) 8(s; a) 2 H Do
2) Q^ (s; a) Q^ (s; a) + (
k st ; a t )( Æ(s; a))e0 (s; a)
3) Æ(s; a) 0
4) e0 (s; a) 0
5) H fg
6) 0
7) t 1
Figure 4.1: The revised Fast Q() algorithm for a umulating, state repla ing and state-
a tion repla ing tra es and for Watkins' Q(). The ma hine pre ision addendum should be
appended to ea h algorithm. The Flush Updates pro edure an also be alled upon entering
a terminal state to make the entire Q-fun tion up-to-date and also reinitialise the eligibility
and error values of ea h SAP ready for learning in the next episode.
56 CHAPTER 4. EFFICIENT OFF-POLICY CONTROL
4.2.3 Validation
In this se tion we empiri ally test how losely the orre t and erroneous implementations
of Fast Q() approximate the original versions of Q(). Fast Q()+ is used to denote the
orre t implementation suggested here and Fast Q() to denote the method that does not
apply a Lo al Update for all a tions in the new state between alls to the Global Update
pro edure. Note that if these updates are performed, Fast Q()+ and Fast Q() are
identi al methods.1
The algorithms were tested using the maze task shown in Figure 4.4. This task was hosen
as redit for a tions leading to the goal an be signi antly delayed (and so eligibility tra es
are expe ted to help) and also be ause state revisits an frequently o ur, ausing the
di erent eligibility tra e methods to behave di erently.
A tions taken by the agent at ea h step were sele ted using -greedy [150℄. This sele ts a
greedy a tion, arg maxa Q^ (st; a), with probability , and a random a tion with 1 . Fast
Q() was given the bene t of using the true up-to-date Q-fun tion, (i.e. arg maxa Q^ + (st ; a)
was used to hose its greedy a tion).
Figure 4.2 ompares the results for the PW Q() variants. The graphs measure the total
reward olle ted by ea h algorithm and the mean squared error (MSE) in the up-to-date
Q-fun tion learned by ea h algorithm over the ourse of 200000 time steps. The squared
error was measured as,
2
Q^ (s; a)
SE (s) = V (s) max
a
; (4.9)
for both versions of Fast Q(). An a urate V was found by dynami programming meth-
ods. All of the results in the graphs are the average of 100 runs.
Fast PW Q()+ provided equal or better performan e than Fast PW Q() in most in-
stan es, and its results also provided an extremely good t against the original version of
PW Q() in all ases (see Figures 4.2 and 4.3). Similar results were found when omparing
Watkins' Q() and its Fast variants (see Figures 4.5 and 4.6).
Fast Q() worked espe ially worse in terms of error than Fast Q()+ for PW with a u-
mulating or state-a tion repla ing tra es. However, in one instan e (with a state repla ing
tra e) the error performan e of the revised algorithm was a tually worse than the original
(see Figure 4.3).This anomaly was not seen for Watkins' Q() (see Figure 4.6).
The experiments in Wiering's original des ription of Fast Q() did perform these lo al updates and so
1
we do not repeat the experiments in the original paper [168, 169, 167℄.
4.2. ACCELERATING Q( ) 57
400000 1000
350000 PW, acc PW, acc
Fast PW (+), acc Fast PW (+), acc
300000 Fast PW (-), acc 800
Cumulative Reward
250000
600
200000
150000
400
100000
50000 200
0
-50000 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
400000 1000
350000 PW, srepl PW, srepl
Fast PW (+), srepl Fast PW (+), srepl
300000 Fast PW (-), srepl 800
250000
600
200000
150000
400
100000
50000 200
0
-50000 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
400000 1000
350000 PW, sarepl PW, sarepl
Fast PW (+), sarepl Fast PW (+), sarepl
300000 Fast PW (-), sarepl 800
250000
600
200000
150000
400
100000
50000 200
0
-50000 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
Figure 4.2: Comparison of PW Q(), Fast PW Q()+ and Fast PW Q() performan e
pro les in the sto hasti maze task. Results are the average of 20 runs. The parameters
were Q^ 0 = 100, = 0:3, = 0:1 (low exploration rate), = 0:9 and m = 1 10 3 for
regular Q() and m = 10 10 for the Fast versions. (left olumn) Total reward olle ted.
(right olumn) Mean squared error in the value fun tion. (top row) With a umulating
tra es. (middle row) With state repla ing tra es. (bottom row) With state-a tion repla ing
tra es.
The e e t of exploratory a tions on PW Q() are also evident in these results. The PW Q()
methods olle ted less reward and found a hugely less a urate Q-fun tion in the ase of a
high exploration rate than Watkins' methods ( ompare Figures 4.3 and 4.6). In ontrast,
Watkins' variants olle ted similar or better amounts of reward but found far more a urate
Q-fun tions than Peng and Williams' methods in both the high and low exploration rate
ases. Similar results on erning the error were reported by Wyatt in [176℄. However, this
example learly demonstrates the bene t of o -poli y learning under exploration in terms
of olle ted return.
58 CHAPTER 4. EFFICIENT OFF-POLICY CONTROL
200000 2000
PW, acc PW, acc
Fast PW (+), acc Fast PW (+), acc
150000 Fast PW (-), acc
100000
1000
50000
500
0
-50000 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
200000 2000
PW, srepl PW, srepl
Fast PW (+), srepl Fast PW (+), srepl
150000 Fast PW (-), srepl
1500
100000
1000
50000
500
0
-50000 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
200000 2000
PW, sarepl PW, sarepl
Fast PW (+), sarepl Fast PW (+), sarepl
150000 Fast PW (-), sarepl
Mean Squared Error
1500
100000
1000
50000
500
0
-50000 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
Figure 4.3: Comparison of Peng and Williams' Q() methods with a high exploration rate
( = 0:5). All other parameters are as in Figure 4.2. Note that the s ale of the verti al axes
di ers between experiment sets.
20
18
16
14
12
10
0
0 2 4 6 8 10 12 14 16 18 20
Figure 4.4: The large sto hasti maze task. At ea h step the agent may hoose one of four
a tions (N,S,E,W). Transitions have probabilities of 0:8 of su eeding, 0:08 of moving the
agent laterally and 0:04 of moving in the opposite to intended dire tion. Impassable walls
are marked in bla k and penalty elds of 4 and 1 are marked in dark and light grey
respe tively. A reward of 100 is given for entering the top-right orner and 10 for the others.
Episodes start in random states and ontinue until one of the four terminal orner states is
entered.
4.2. ACCELERATING Q( ) 59
400000 1000
350000 WAT, acc WAT, acc
Fast WAT (+), acc Fast WAT (+), acc
300000 Fast WAT (-), acc 800
Cumulative Reward
250000
600
200000
150000
400
100000
50000 200
0
-50000 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
400000 1000
350000 WAT, srepl WAT, srepl
Fast WAT (+), srepl Fast WAT (+), srepl
300000 Fast WAT (-), srepl 800
250000
600
200000
150000
400
100000
50000 200
0
-50000 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
400000 1000
350000 WAT, sarepl WAT, sarepl
Fast WAT (+), sarepl Fast WAT (+), sarepl
300000 Fast WAT (-), sarepl 800
250000
600
200000
150000
400
100000
50000 200
0
-50000 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
Figure 4.5: Comparison of Watkins' Q(), Fast Watkins' Q() and Revised Fast Watkins'
Q()+ in the sto hasti maze task. All parameters are as in Figure 4.2 (i.e. a low exploration
rate with = 0:1).
In addition to showing that the performan e of Fast Q()+ is similar to Q() in the mean,
we performed a more detailed test. The agents were made to learn from identi al experien e
gathered over 2000 simulation steps in the small sto hasti maze shown in Figure 4.7. At
ea h time step, the di eren e between the Q-fun tions of Q() and the up-to-date Q-
fun tions of Fast Q()+ and Fast Q() was measured. The largest di eren es at any time
during the ourse of learning are shown in Table 4.1. The di eren es for Fast Q()+ are all
in the order of m or better. The di eren es for Fast Q() are many orders of magnitude
greater.
60 CHAPTER 4. EFFICIENT OFF-POLICY CONTROL
200000 400
WAT, acc WAT, acc
Fast WAT (+), acc 350
Fast WAT (+), acc
150000 Fast WAT (-), acc
100000 250
200
50000 150
100
0
50
-50000 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
200000 400
WAT, srepl WAT, srepl
Fast WAT (+), srepl 350
Fast WAT (+), srepl
150000 Fast WAT (-), srepl
300
100000 250
200
50000 150
100
0
50
-50000 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
200000 400
WAT, sarepl WAT, sarepl
Fast WAT (+), sarepl 350
Fast WAT (+), sarepl
150000 Fast WAT (-), sarepl
Mean Squared Error
300
100000 250
200
50000 150
100
0
50
-50000 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
Figure 4.6: Comparison of Watkins' Q() methods with a high exploration rate ( = 0:5).
All other parameters are as in Figure 4.2.
3 +1
2 -1
1 2 3 4
Figure 4.7: A small sto hasti maze task (from [130℄). Rewards of 1 and +1 are given for
entering (4; 2) and (4; 3), respe tively. On non-terminal transitions, rt = 251 .
4.3. BACKWARDS REPLAY 61
Fast Q() Fast Q()+
PW-a 0.7 1:7 10 15
PW-srepl 1.3 8:8 10 16
PW-sarepl 0.3 1:7 10 15
WAT-a 1.3 7:6 10 13
WAT-srepl 2.5 4:2 10 10
WAT-sarepl 0.6 2:9 10 11
Table 4.1: The largest di eren es from Q-fun tion learned by original Q() during the
ourse of 2000 time steps of experien e within the small maze task in Figure 4.7. The
experiment parameters were m = 10 9 , = 0:2, = 0:95 and = 1:0. The experien e
was generated by randomly sele ting a tions.
Ba kwards-Replay-Watkins-Q()-update
1) z 0 Initialise return to value of terminal state
2) for ea h i in tT 1; tT 2; : : : t0 do:
3) z (ri+1 + z) + (1 ) ri+1 + max a Q^ (si+1; a)
4) Q^ (si; ai ) Q^ (si; ai ) + k z Q^ (si; ai )
Figure 4.8: Lin's ba kwards replay algorithm modi ed for evaluating the greedy poli y
(as Watkins' Q()). The algorithm is applied upon entering a terminal state and may be
exe uted several times. Terminal states are assumed to have zero value (rewards for entering
a terminal state may be non-zero).
The training experien e has the advantage of providing the agent with a relatively good
behaviour from whi h it may bootstrap its own poli y and also greatly redu es the ost
of exploring the state spa e. Note that a key di eren e between this and the training
methods used by supervised learning is that the RL agent aims to a tually improve upon the
training behaviour and not simply reprodu e it. Experien e replay has also been su essfully
applied by Zhang and Dietteri h for Job Shop s heduling system [177℄, and for mobile robot
navigation [140℄.
When replaying the re orded experien e a great learning eÆ ien y boost an be gained by
replaying the experien e in the reverse order to whi h it was observed. For example, if the
agent observed the experien e tuples (st; at ; rt+1 ); (st+1 ; at+1 ; rt+2 ); : : :, then a Q-learning
update is made to Q^ (st+1 ; at+1 ) before Q^ (st ; at ). In this way, the return estimate used to
update Q^ (st; at ) may use a just-updated value of maxa Q^ (st+1 ; a), whi h itself may have
just hanged to in lude the just-updated value of maxa Q^ (st+2 ; a), and so on. Even if 1-step
return estimates are employed in the ba kups, and experien e is only replayed on e, then in-
formation about a new reward an still be propagated to many prior SAPs. Furthermore, if
the -return estimates are employed then omputational eÆ ien y gains an also be found
by working ba kwards and employing the re ursive form of the -return estimate (as in
Equation (3.24) or (3.37)). This is illustrated in a new version of the ba kwards replay
algorithm modi ed to use the same return estimate as Watkins' Q() (see Figure 4.8). The
algorithm is extremely simple, an provide learning speedups and also has a natural ompu-
tationally eÆ ient implementation; it is just O(jAj) per step. It a hieves its omputational
eÆ ien y far more elegantly than Fast Q() by dire tly implementing the forwards view
of -return updates. By ontrast Fast Q() performs two omplex transformations on the
return estimate.
Figure 4.9 illustrates the advantage of using ba kwards replay over Q() in the orridor
task shown in Figure 3.5. Note here that ba kwards replay with = 0 an be as good or
better than Q() (for any ) where the learning rate is de lined with 1=k (k(s; a) = kth
ba kup of Q^ (s; a)). Similar results are noted by Sutton and Singh [151℄. As in this example,
they note that ba kwards replay redu es bias due to the initial value estimates in a y li
environments, eliminating it totally in ases where = 1 at the rst value updates.
4.3. BACKWARDS REPLAY 63
1
1
0.8 0.8
V*
0.6 0.6
BR(0)
Value
Value
V*
BR(0.9)
0.4 0.4
BR(0.9)
0.2 Q(0.9) 0.2 Q(0.9)
BR(0)
Q(0) Q(0)
0 0
0 10 20 30 40 50 60 0 10 20 30 40 50 60
State, s State, s
Figure 4.9: The Q-fun tions learned by ba kwards replay and by Q() after 1 episode in
the orridor task shown in Figure 3.5. Values of = 0, = 0:9 and Q^ 0 = 0 are tested.
(left) Learning with a onstant = 0:8. Ba kwards replay improves upon the eligibility
tra e ounterparts in both ases. This learning speed-up for ba kwards replay is derived
solely from employing more up-to-date information. (right) Learning with = 1=k. With
any value of ba kwards replay nds the a tual return estimate, while Q() nds it only
if = 1.
However, be ause of its dependen e on future information, its not lear how ba kwards
replay extends to the ase of online learning in y li environments.
Trun ated TD()
In [30℄ Ci hosz introdu ed the Trun ated TD() (TTD) algorithm to apply ba kwards
replay online. Figure 4.10 shows how TTD an be modi ed to be a greedy-poli y evaluating
exploration insensitive method. TTD also dire tly employs the -return due to a state or
SAP by maintaining an experien e bu er from whi h its return is omputed. To keep the
bu er to a reasonable length but still allow for online learning, only the last n experien es
are maintained. Updates are delayed { state st n is updated at time t when there is enough
experien e to make an n-step trun ated -return estimate (as introdu ed in Equation 3.37).
This delay in making ba kups an lead to the same ineÆ ien ies in the exploration strategy
su ered by purely oine learning methods. As su h, TTD is sometimes referred to as
semi-oine as it still allows for non-episodi learning and exploration [168℄. Also, the
method makes updates at a ost of O(n jAj) per step and so it would seem there is no
omputational advantage to learning in this way ompared to the approximate method
des ribed in Se tion 4.2. Thus, the primary bene t of this approa h is that it dire tly
employs the -return estimate in updates and is simpler than an eligibility tra e method
as a result. Ci hosz also argues that sin e a tual -return estimates are used, the method
an be applied more easily to a wider range of fun tion approximators than is possible for
eligibility tra e methods [31℄.
Replayed TD()
Replayed TD() is an adaptation of TTD that updates the most re ent n states at ea h
time-step using the most re ent n experien es [32℄ (see Figure 4.11).
64 CHAPTER 4. EFFICIENT OFF-POLICY CONTROL
Trun ated-Watkins-Q()-update(st+1)
1) z maxa Q^ (st+1 ; a)
2) was-o -poli y false
3) for ea h i in t + 1; : : : ; t + 2 n do:
4) if was-o -poli y: True when ai+1 was non-greedy.
5) z ri + maxa Q(si ; a) ^
6) else:
z (ri + z ) + (1 ) ri + maxa Q^ (si ; a)
7)
8) was-o -poli y o -poli y(si; ai )
9) Q^ (st n ; at n) Q^ (st n; at n ) + k z Q^ (st n ; at n )
Figure 4.10: Ci hosz' Trun ated TD() algorithm modi ed for evaluating the greedy poli y.
The above update is applied after every step. An experien e bu er of the last n experien es
needs to be maintained and the rst and last n updates of an episode need spe ial handling.
These extra details are omitted from the above algorithm (see [31℄ for full details).
Replayed-Watkins-Q()-update(st)
1) z 0 Initialise return to value of terminal state
2) for ea h i in t; : : : t n do:
3) z (ri+1 + z) + (1 ) ri+1 + maxa Q^(si+1; a)
Figure 4.11: Ci hosz' Replayed TD() modi ed for evaluating the greedy poli y. The above
update is applied after every step.
Note that, for a SAP visited at time t, Q^ (st; at ) will re eive updates toward all of the follow-
ing n trun ated -return estimates: zt(+1;1) (;2)
; zt+2 ; : : : ; zt(+;nn ) . Clearly these return estimates
are not independent: all n returns in lude rt+1 , n 1 in lude rt+2 and so on. As a result
of updating a Q-value several times towards these similar returns the algorithm will learn
Q-values that are mu h more strongly biased towards the most re ent experien es than
other methods. In turn this ould ause learning problems in highly sto hasti environ-
ments (or more generally where the return estimate has high varian e). There may exist
ways to ountera t this (for example, by redu ing the learning rate). Even so, it is likely
that the algorithm's aggressive use of experien e outweighs these high varian e problems
and Ci hosz reports some promising results. However, the algorithm also remains O(n jAj)
per step (as TTD()), and although it doesn't su er the same delay in performing updates
that ould be detrimental to exploration, immediate redit for a tions is propagated to no
more than the last n states.
4.4. EXPERIENCE STACK REINFORCEMENT LEARNING 65
This se tion introdu es the Experien e Sta k Algorithm. This new method an be seen as
a generalisation of Lin's oine ba kwards replay and also dire tly learns from the -return
estimate.
To allow the algorithm to work online, ba kups are made in a lazy fashion; states are
ba ked-up only when new estimates of Q-values are required (for the purposes of aiding
exploration) and available given the prior experien e. Spe i ally, this o urs when the
learner nds itself in a state it has previously visited and not ba ked-up.
The details of the algorithm are best explained through a worked example. Consider the
experien e in Figure 4.12. A learning episode starts in st1 and the algorithm pro eeds
re ording all experien es until st3 is entered (previously visited at t2). If we ontinue
exploring without making a ba kup to st2, we do so uninformed of the reward re eived
between t2 + 1 and t3, perhaps to re olle t some negative reward in sequen e X . This is
the important disadvantage of an oine algorithm that we wish to avoid. To prevent this,
the algorithm immediately replays (ba kwards) experien e to update the states from st3 1
to st2 using the -return trun ated at st3 . This obtains a new Q-value at st3 that an be
used to aid exploration. Ea h replayed experien e is dis arded from memory. States visited
prior to st2 (sequen e W ) are not immediately updated. Putting exploration issues aside,
it is often preferable to delay ba kups for as long as possible with the expe tation that the
experien e yet to ome will provide better Q-values to use in updates.
At a later point (t5) the agent takes an o -poli y a tion. When sequen e Y is eventually
updated, it will use a return estimate trun ated at st5, the value of whi h will be re ently
updated following the experien e in sequen e Z and beyond. This is a signi ant improve-
ment over Watkins' Q() whi h will make no immediate use of the experien e olle ted in
sequen e Z in updates to Y .
st1 st5
st2= st3 st4
W Y Z
Figure 4.12: A sequen e of experien es. st2 is revisited at t3 and an o -poli y a tion taken
at st5. States in sequen e X (in luding st3) will be updated before those in sequen e W, Y
or Z.
66 CHAPTER 4. EFFICIENT OFF-POLICY CONTROL
It is always the ase that the earliest state in j was observed as a su essor to the most
re ent SAP in j 1. Performing a push operation on an experien e sequen e re ords an
experien e and pop operations are used when replaying experien e.
The ES-Watkins-replay pro edure, shown in Figure 4.14, is used to replay experien e su h
that a new Q-value estimate at sstop is obtained. The value of s0 provides the return
orre tion for the most re ent SAP in the sta k. s0 must be the su essor of the SAP found
at top(top(es)) (i.e. the most re ent SAP in the sta k).
A ounter, B (s), re ords the number of times s appears in the experien e sta k in order to
determine how many ba kups to sstop that experien e replay an provide without having to
sear h through the re orded experien e.
How experien e is re orded and replayed is determined by the ES-Watkins-update pro e-
dure. Like Watkins-Q(), it ensures that ES-Watkins-replay uses -return estimates that
are trun ated at the point where an o -poli y a tion is taken. Figure 4.13 shows the state of
the sta k after the experien e des ribed in Figure 4.12. It ontains the experien e sequen es
W , Y and Z from bottom to top (X has already been updated and removed). The ends
of ea h experien e sequen e de ne when return trun ations o ur. For example, due to the
exploratory a tion at t5, st5 starts a new experien e sequen e. Thus, the ba kup to st4 will
use only rt5 + maxa Q^ (st5 ; a), but Q^ (st5 ; a) will be up-to-date.
Bias Prevention Why doesn't sequen e Y simply extend sequen e W in Figure 4.13?
(That is, why is the return trun ated at end of sequen e W ?) There is no requirement that
the return estimate used to ba kup st2 1 involve the a tual observed return immediately
time
Z= 3
top
... ...
(st5 ,a t5 ,rt5+1 ), ... ,(st6−1 ,at6−1 ,rt6 )
... ...
(st3,a t3 ,rt3+1 ), ... ,(st4 ,at4 ,rt5 )
...
W = 1
bottom (st1,a t1 ,rt1+1), (st1+1,at1+1 ,rt1+2), ... ,(st2−1,at2−1,rt2 )
Figure 4.13: The state of the experien e sta k after the experien e in Figure 4.12. The
end of ea h row (or experien e sequen e) determines where return trun ations o ur. The
rightmost states re eive 1-step Q-learning ba kups.
4.4. EXPERIENCE STACK REINFORCEMENT LEARNING 67
following t2 1. Generally, if st+k = st, then the return in luding and following rt+k is just
as suitable. That is, if,
" # " #
1
X 1
X
E rt + ir
t+i = E rt + ir
t+i+k (4.11)
i=1 i=1
However Condition 4.11 will usually not hold when applying the experien e sta k algorithm.
For example, suppose that sequen e X in ludes some unusually negative rewards. If the
ba kups to the states in W were made using a return ex luding the rewards in sequen e X
then the Q-values in sequen e W would be ome biased (by being over-optimisti ). In order
to prevent this biasing, the value of the state at whi h an experien e replay ends is used to
provide an estimate of the future return to all prior states in the sta k. In the example, st2
must be updated to in lude the return in sequen es Y and X . The ba kups to states prior
to st2 should use a return trun ated at st2. The algorithm a hieves this simply by starting
a new experien e sequen e at the top of the sta k to indi ate that a return trun ation is
required (step 13 of ES-Watkins-update).
Choi e of Bmax The parameter Bmax varies how many times a state may be revisited
before a ba kup is made. Its hoi e is problem dependent. With Bmax = 1 ba kups are
made on every revisit. If revisits o ur often and at short intervals, then experien e will
be frequently replayed whi h also auses the return estimate to be frequently trun ated; an
e e t whi h is similar to lowering toward 0. This is in addition to the e e t of trun ations
that o ur after taking o -poli y a tions.
However, with higher values of Bmax , the algorithm behaves more like an oine learning
method and exploration an bene t less frequently from up-to-date Q-values.
Flushing the Sta k Entering a terminal state, sterm, automati ally auses the entire
remaining ontents of the experien e sta k to be replayed sin e sstop = sterm and sterm
annot o ur in the experien e sta k (N.B. B (sterm) = 0 at all times). Otherwise, the sta k
an be ushed at any time by alling ES-Watkins-replay(snow ; sterm).
Computational Costs Sin e ea h state may appear in the experien e sta k no more
than Bmax times, the worst- ase spa e- omplexity of maintaining the experien e sta k is
O(jS j Bmax ). The total time- omplexity is O(jAj) per experien e when averaged over the
entire lifetime of the agent (as Fast Q()). The a tual time- ost per timestep may vary
greatly between steps.
S ope This new te hnique an easily be adapted to use the return estimates employed
by many other methods. For example, an analogue of Naive Watkins Q() an be made by
omitting lines 1) and 2) from ES-Watkins-Update. An analogue of TD() an be made by
repla ing all o urren es of Q^ (x; y) with V^ (x), repla ing step 6) with,
6) z (r + z) + (1 ) r + V^ (s0)
Frequent Revisits In some tasks, su h as problems with state aliasing, a single state
may be revisited for several onse utive steps. To prevent the method from using mainly
1-step returns, B (s; a) ould be in remented only upon leaving a state. This would require
that the same a tion be taken until the state is left, although this is often a bene t while
learning with state-aliasing (as we will see in Chapter 7).
In general, there may be better ways to a e t when experien e is replayed than with the
Bmax parameter. If the purpose of making ba kups online is to aid exploration, then a better
method might be to try to estimate the bene t of replaying experien e to exploration when
de iding whether to update a state.
A Note About Convergen e The open question remains about whether this algorithm
is guaranteed to onverge upon the optimal Q-fun tion. Intuitively, it should, and under
the same onditions as 1-step Q-learning sin e in a sense, the algorithms di er only slightly.
Both methods approa h Q by estimating the expe ted return available under the greedy
poli y. For general MDPs, the expe ted update made by both methods appears to be a
xed point in Q^ only where Q^ = Q.
However, the onvergen e proof of 1-step Q-learning follows from establishing a form of
equivalen e to 1-step value iteration [59℄. This relationship does not appear to dire tly
follow for multi-step return estimates. Moreover, no onvergen e proof has been established
for any ontrol method with > 0 [145, 137℄.
approximately 10 minutes to omplete on a Sun Ultra 5 and ea h graph point is the average of 15 trials. A
onservative estimate of the total exe ution time onsumed to produ e Figures 4.15 to 4.22 is 2050 ma hine
hours, or 12 ma hine weeks. In pra ti e the experiment was made feasible by distributing the load over a
luster of 60 workstations, redu ing the real-time ost to approximately 34 hours.
4.5. EXPERIMENTAL RESULTS 71
The Fast Q() ma hine pre ision parameter was m = 10 7 in all ases. o pol = 10 4
throughout.
Attention is drawn to the ways in whi h the algorithms are a e ted by di erent parameters
in the following se tions.
The E e ts of Q0 . The most surprising result is that the initial Q-fun tion, Q0, has su h
a ounter-intuitive e e t on performan e. The maze task has an optimal value-fun tion, V ,
whose mean is approximately 68 and has maximum and minimum values of 99.5 and 45.6
respe tively. The standard rule of thumb when using -greedy (and many other exploration
strategies) is to initialise the Q-fun tion optimisti ally to en ourage the agent to take untried
a tions, or a tions that lead to untried a tions [150℄. Yet overall, the performan e was
generally worse with Q0 = 100 than when starting with a Q-fun tion that has a higher
initial error given by a pessimisti bias (Figures 4.15 and 4.16 show Q0 varying over a
larger range than the other graphs). Subje tively, the best all round performan e in nal
umulative reward and MSE was obtained with Q0 = 50 for all algorithms. It is possible
that the reason for this is that the lower initial Q-values aused the agent to less thoroughly
explore the environment and settle upon a more exploiting poli y more qui kly.
Unlike the eligibility tra e methods, the experien e sta k methods also still performed well
with very low initial Q-fun tions ( ompare the umulative reward olle ted with Q0 = 0 on
all graphs.)
Se tion 4.7 presents a likely explanation as to why optimisti initial Q-fun tions an be
harmful to learning.
Figure 4.24 shows an overlay of the di erent methods with a pessimisti initial Q-fun tion.
The experien e sta k method outperform the eligibility tra e methods in almost all ases
ex ept with high . The di eren e between the methods is even larger with lower Q0 .
The E e ts of . For Q0 < 100 the experien e sta k methods performed better or no
worse than their eligibility tra e ounterparts a ross the majority of parameter settings. In
parti ular they were less sensitive to and a hieved better performan e with low as a
result. A dis ussion of the reasons for this is given in Se tion 4.6.
With Q0 = 100 the experien e sta k methods were most sensitive to and performed worse
than their eligibility tra e ounterparts in many instan es. The experien e sta k methods
were also more sensitive to Bmax at this setting.
72 CHAPTER 4. EFFICIENT OFF-POLICY CONTROL
Cumulative Reward
150000
ES-WAT-5 200
ES-WAT-10
Q0 = 100 100000
ES-WAT-50
150
100
50000
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
200000 300
250
200
Q0 = 75 100000
150
50000 100
200000 300
250
200
Q0 = 50 100000
150
50000 100
200000 300
250
Mean Squared Error
150000
Cumulative Reward
200
Q0 = 25 100000
150
50000 100
200000 900
800
700
Mean Squared Error
150000
Cumulative Reward
600
Q0 = 0 100000 500
400
50000 300
200
200000 4000
3500
Mean Squared Error
150000
Cumulative Reward
3000
2500
Q0 = 25 100000
2000
1500
50000
1000
Figure 4.15: Comparison of the e e ts of , Bmax and the initial Q-values on ES-Watkins
with a high exploration rate ( = 0:5). Results for the end of learning after 200000 steps
in the Maze task. Performan e be omes degraded at Q0 = 100, though less so with higher
. Performan e is less sensitive to ompared to Watkins' Q() (most plots are more
horizontal than in Figure 4.16).
74 CHAPTER 4. EFFICIENT OFF-POLICY CONTROL
200000 300
FastWAT-srepl
FastWAT-sarepl 250
Cumulative Reward
150000
200
100
50000
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
200000 300
250
200
Q0 = 75 100000
150
50000 100
200000 300
250
150000 Mean Squared Error
Cumulative Reward
200
Q0 = 50 100000
150
50000 100
200000 300
250
Mean Squared Error
150000
Cumulative Reward
200
Q0 = 25 100000
150
50000 100
200000 2600
2400
2200
Mean Squared Error
150000
Cumulative Reward
2000
1800
Q0 = 0 100000 1600
1400
1200
50000
1000
800
PSfrag repla ements 0 PSfrag repla ements 600
400
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
200000 7000
6500
Mean Squared Error
150000
Cumulative Reward
6000
Q0 = 25 100000
5500
50000 5000
Figure 4.16: Watkins Q() with a high exploration rate ( = 0:5) after 200000 steps in the
Maze task. As ES-Watkins, performan e also be omes degraded at Q0 = 100. Performan e
is more sensitive to and and also degrades more with low Q0 than ES-Watkins.
4.5. EXPERIMENTAL RESULTS 75
200000 300
ES-NWAT-1
ES-NWAT-2 250
Cumulative Reward
ES-NWAT-5
ES-NWAT-10 200
50000 100
200000 300
250
200
Q0 = 50 100000
150
50000 100
200000 900
800
700
Q0 = 0 100000 600
500
50000 400
200
0.1 0.2 0.3 0.4
0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4
0.5 0.6 0.7 0.8 0.9
Figure 4.17: Comparison of the e e ts of , Bmax on ES-NWAT in the Maze task with a
high exploration rate ( = 0:5).
200000 300
FastPW-srepl
FastPW-sarepl 250
Mean Squared Error
150000 FastPW-acc
Cumulative Reward
200
Q0 = 100 100000
150
50000 100
200000 300
250
Mean Squared Error
150000
Cumulative Reward
200
Q0 = 50 100000
150
50000 100
200000 3500
3000
Mean Squared Error
150000
Cumulative Reward
2500
Q0 = 0 100000
2000
50000 1500
Figure 4.18: Comparison of the e e ts of , the tra e type and the initial Q-values on Peng
and Williams' Q() in the Maze task with a high exploration rate ( = 0:5).
76 CHAPTER 4. EFFICIENT OFF-POLICY CONTROL
400000 300
350000 ES-WAT-1
ES-WAT-2 250
Cumulative Reward
300000
ES-WAT-5
250000 ES-WAT-10 200
Q0 = 100 200000
ES-WAT-50
150
150000
100
100000
400000 300
350000
250
300000
250000 200
Q0 = 50 200000 150
150000
100
100000
400000 3500
350000
3000
Mean Squared Error
Cumulative Reward
300000
250000 2500
Q0 = 0 200000 2000
150000
1500
100000
Figure 4.19: Comparison of the e e ts of , Bmax and the initial Q-values on ES-Watkins
in the Maze task with a low exploration rate ( = 0:1).
400000 300
350000 FastWAT-srepl
FastWAT-sarepl 250
Mean Squared Error
FastWAT-acc
Cumulative Reward
300000
250000 200
400000 300
350000
250
Mean Squared Error
Cumulative Reward
300000
250000 200
Q0 = 50 200000 150
150000
100
100000
400000 4750
350000 4700
4650
Mean Squared Error
Cumulative Reward
300000
4600
250000 4550
Q0 = 0 200000 4500
4450
150000
4400
100000 4350
4300
PSfrag repla ements PSfrag repla ements
50000
4250
0
4200
0.1 0.2 0.3 0.4
0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4
0.5 0.6 0.7 0.8 0.9
Figure 4.20: Comparison of the e e ts of , tra e type and the initial Q-values on Watkins'
Q() in the Maze task with a low exploration rate ( = 0:1).
4.5. EXPERIMENTAL RESULTS 77
400000 300
350000 ES-NWAT-1
ES-NWAT-2 250
Cumulative Reward
300000
ES-NWAT-5
250000 ES-NWAT-10 200
Q0 = 100 200000
ES-NWAT-50
150
150000
100
100000
400000 300
350000
250
Q0 = 50 200000 150
150000
100
100000
400000 3500
350000
3000
300000
250000 2500
Q0 = 0 200000 2000
150000
1500
100000
Figure 4.21: Comparison of the e e ts of , Bmax and the initial Q-values on ES-NWAT in
the Maze task with a low exploration rate ( = 0:1).
400000 300
350000 FastPW-srepl
FastPW-sarepl 250
Mean Squared Error
FastPW-acc
Cumulative Reward
300000
250000 200
400000 300
350000
250
Mean Squared Error
Cumulative Reward
300000
250000 200
Q0 = 50 200000 150
150000
100
100000
400000 4800
350000 4600
Mean Squared Error
Cumulative Reward
300000
4400
250000
Q0 = 0 200000
4200
4000
150000
100000 3800
Figure 4.22: Comparison of the e e ts of , tra e type and the initial Q-values on Peng and
Williams' Q() in the Maze task with a low exploration rate ( = 0:1).
78 CHAPTER 4. EFFICIENT OFF-POLICY CONTROL
200000 600
FastWAT-srepl
Q0 = 50
500 FastWAT-sarepl
150000 FastWAT-acc
= 0:3
400 ES-WAT-10
100000 ES-WAT-50
300
50000 FastWAT-srepl
FastWAT-sarepl 200
200000 600
FastWAT-srepl FastWAT-srepl
Q0 = 100
FastWAT-sarepl 500 FastWAT-sarepl
150000 FastWAT-acc
ES-WAT-1 ES-WAT-1
= 0:9
ES-WAT-10 400 ES-WAT-10
100000 ES-WAT-50 ES-WAT-50
300
50000
200
-50000
PSfrag repla ements 100
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Figure 4.23: Comparison of e e ts of the learning rate s hedule on ES-Watkins and ES-
WAT. The top row presents favourable setting for ES-WAT. The bottom row presents
unfavourable settings. = 0:5 in both ases. Changes in had little e e t on the relative
performan e of the algorithms. Results were similar for ES-NWAT and PW-Q().
The E e ts of Bmax . The new Bmax parameter appeared to be relatively easy to tune
in the maze task. With Q0 < 100 most settings of Bmax and provided improvements
over the original eligibility tra e algorithms. In general, Bmax aused the greatest spread in
performan e when Q0 was either very high or very low. For example, Bmax = 50 generally
resulted in the poorest relative performan e where Q0 = 100 and best performan e with
pessimisti values (e.g. Q0 = 0). Intermediate values (Q0 = 50) gave the least sensitivity
to Bmax as the high values of Bmax swit h from providing relatively good to relatively poor
performan e.
With Q0 = 100, Bmax = 1 provided a sharp drop in performan e ompared to slightly
higher values (e.g. Bmax = 2 or Bmax = 3). A possible reason for this is that some states
may be revisited extremely soon regardless of the exploration strategy simply be ause the
environment is sto hasti . As a result there is often little bene t to the exploration strategy
for learning about these revisits. However, the likelihood of a state being qui kly revisited
by han e two, three or more times falls extremely rapidly with the in reasing number of
revisits. In su h ases it is likely that revisits o ur as the result of poor exploration, in
whi h ase the exploration strategy may be improved as result of making an immediate
ba kup. Curiously, however, this phenomenon is not seen where Q0 < 100.
The E e ts of Exploration. As expe ted, with low exploration levels Watkins' methods
performed very similarly to Peng and Williams' methods ( ompare Figure 4.19 with 4.21
and Figure 4.20 with 4.22).
However, the main motivation for developing the experien e sta k algorithm was to allow for
eÆ ient redit assignment and a urate predi tion, while still allowing exploratory a tions
4.5. EXPERIMENTAL RESULTS 79
to be taken. With high exploration levels both of the non-o -poli y methods still generally
outperformed Watkins' methods in terms of umulative reward olle ted, but performed
worse in terms of their nal MSE. This is the e e t of trading longer, untrun ated return
estimates (whi h allow temporal di eren e errors to a e t more prior Q-values) for the
theoreti al soundness of the algorithms (by using rewards following o -poli y a tions in the
return estimate).
But the best overall improvements in the entire experiment were found by ES-WAT at
Q0 = 50. At this setting the algorithm outperformed (or performed no worse) than ES-
PW, FastWAT and FastPW in terms of both umulative reward and error a ross the entire
range of . This is a signi ant result as it demonstrates that Watkins' Q() has been
improved upon to su h an extent that it an outperform methods that don't trun ate the
return upon taking exploratory a tions.
The E e ts of The Learning Rate. In Figures 4.15 to 4.22 the learning rate was
de lined with ea h ba kup as in Equation 3.8 with = 0:5.4 By han e, this appeared to
be a good hoi e for all of the methods tested. Best overall performan e ould be found in
most settings with between 0.3 and 0.5 (see Figure 4.23).
In work by Singh and Sutton [139℄, the best hoi e of learning rate has been shown to vary
with . This was also found to be the ase here. However, unlike in their experiments,
here the learning rate s hedule had little e e t on the relative performan es of the algo-
rithms. Also the work by Singh and Sutton aimed to ompare repla e and a umulate tra e
methods using a xed learning rate. Several experiments were ondu ted here using a xed
learning rate. This also had little e e t on the relative performan es with the ex eption
that ombinations of high and aused the a umulate tra e methods to behave very
poorly in most instan es. Se tion 3.4.9 in the previous hapter suggests why.
Optimised Parameters. Figure 4.25 ompares the di erent methods with optimised
Q0 , , and Bmax . In terms of umulative reward performan e, there is little di eren e
between the methods. However, the experien e sta k methods are markedly more rapid at
error redu tion.
4
High values of provide the fastest de lining learning rate.
80 CHAPTER 4. EFFICIENT OFF-POLICY CONTROL
O -Poli y, = 0:5
200000
Non-O -Poli y, = 0:1
400000
350000
Cumulative Reward
Cumulative Reward
150000
300000
100000 FastWAT-srepl
FastWAT-sarepl FastPW-srepl
FastWAT-acc FastPW-sarepl
ES-WAT-1 250000 FastPW-acc
300 300
FastWAT-srepl FastPW-srepl
250 FastWAT-sarepl 250 FastPW-sarepl
Mean Squared Error
100 100
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Figure 4.24: Overlay of results at the end of learning after 200000 steps in the Maze task.
Q0 = 50, = 0:5.
O -Poli y, = 0:5
200000
Non-O -Poli y, = 0:1
400000
FastWAT-srepl 350000 FastPW-srepl
FastWAT-sarepl FastPW-sarepl
150000 FastWAT-acc 300000 FastPW-acc
Cumulative Reward
Cumulative Reward
ES-WAT-3 ES-NWAT-1
250000
100000
200000
150000
50000
100000
0 50000
0
-50000 -50000
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
600 1000
FastWAT-srepl FastPW-srepl
500 FastWAT-sarepl FastPW-sarepl
800
Mean Squared Error
FastWAT-acc FastPW-acc
ES-WAT-3 ES-NWAT-1
400
600
300
400
200
200
100
0 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
Figure 4.25: Comparison of results during learning in the Maze task with optimised values
of Q0 and . The experien e sta k algorithms provided little improvement in the reward
olle ted but gave far faster error redu tion in the Q-fun tion.
with the highest values of . Therefore, it is reasonable to anti ipate larger di eren es in
performan e between the two approa hes in environments where lower values of are best
for eligibility tra e methods. In this ase, ba kwards replay methods look likely to provide
stronger improvements sin e the learned Q-values are more greatly utilised.
Q(s2, a1) = 10
Q(s1, a1) = 10 Q(s2, a2) = 10
s1 s2 ...
a1
Q(s2, ak) = 10
Figure 4.26: A simple pro ess in whi h optimisti initial Q-values slows learning. Rewards
are zero on all transitions.
4.7. INITIAL BIAS AND THE MAX OPERATOR. 83
high and so state-values and Q-values are very dependent upon their su essors' values.
Although this idea is simple, it does not, to the best of my knowledge, appear in the existing
RL literature.6 The most losely related work appears to be that of Thrun and S hwartz in
[157℄. They note that the max operator an ause a systemati overestimation of Q-values
when look-up table representations are repla ed by fun tion approximators.
Examples of methods that use maxa Q(s; a) in their return estimates are: value-iteration,
prioritised sweeping, Q-learning , R-learning [132℄, Watkins' Q() and Peng and Williams'
Q(). Similar problems are also expe ted with \interval estimation" methods for deter-
mining error bounds in value estimates [62℄.7 Methods whi h are not expe ted to su er in
this way in lude TD(), SARSA() and poli y iteration (i.e. methods that evaluate xed
poli ies, not greedy ones).
4.7.1 Empiri al Demonstration
Value-Iteration
The e e t of initial bias on value-iteration was evaluated on several di erent pro esses with
known models: the 2-Way orridor Figure 4.28, the small maze in Figure 4.7 and the large
maze Figure 4.4. In ea h experiment an initial value fun tion, V0, was hosen with either
an optimisti bias, V0A+, or the same amount of pessimisti bias, V0A ,
V0A+ (s) = V (s) + bias; (4.13)
V0 (s) = V (s) bias:
A (4.14)
where \bias" is a positive number and V is the known solution. This ensures that both
the optimisti and pessimisti methods start the same maximum-norm distan e from the
desired value fun tion. This setup is atypi al sin e V is usually not known in advan e
and it also provides value-iteration with some information about the initial poli y. However
with knowledge of the reward fun tion it is often possible to estimate the maximum and
minimum values of V . A se ond set of starting onditions was also tested:
V0B+ (s) = max V (s0 ) + bias; (4.15)
s0
V0B (s) = min V (s0 ) bias: (4.16)
s0
Figure 4.27 ompares these initial biasing methods.
Table 4.7.1 shows the number of appli ations of update 2.21 to all states in the pro ess
required by value-iteration until V has onverged upon V to within some small degree of
error. bias = 50 in all ases. In all tasks, the pessimisti initial bias ensured onvergen e in
the fewest updates.
With the orridor task, in the optimisti ase, the number of sweeps until termination an
be made arbitrarily high by making suÆ iently lose to 1. However, if all the estimates
6
Similar problems are known to o ur with applied dynami programming algorithms. Examples are
ontinuously updating distan e ve tor network routing algorithms (su h as the Bellman-Ford algorithm)
[108℄. I thank Thomas Dietteri h for pointing out the relationship.
7
I thank Leslie Kaelbling for pointing this out.
84 CHAPTER 4. EFFICIENT OFF-POLICY CONTROL
start below their lowest true value, then the number of sweeps never ex eeds the length of
the orridor sin e, in this deterministi problem, after ea h sweep at least one more state
leading to the goal has a orre t value.
V0A+ (s)
V (s)
PSfrag repla ements
V0A (s)
V0B+ (s)
V (s)
PSfrag repla ements
V0B (s)
Figure 4.27: Initial biasing methods.
al al al
-1 s1 ... s19 +1
ar ar ar
Table 4.7: Comparison of the e e ts of initial value bias on the required number of
value-iteration sweeps over the state-spa e until the error in V^ has be ome insigni ant
(maxs jV (s) V^ (s)j < 0:001). Results are the average of 30 independent trials.
Q-Learning
The e e t of the initial bias on Q-learning is shown in Table 4.8. The Q-learning agents
were allowed to roam the 2-way orridor and the small maze environments for 30 episodes.
For the large maze, 200000 time steps were allowed. The Q-fun tions for the agents were
4.7. INITIAL BIAS AND THE MAX OPERATOR. 85
initialised in a similar fashion to the value-iteration ase but with an initial bias of 5.
Throughout learning, random a tion sele tion was used to ensure that the learned Q-values
ould not a e t the agent's experien e. At the end of learning, the mean squared error in
the learned value-fun tion, maxa Q^ (s; a), was measured. In all ases, the pessimisti initial
bias provided the best performan e.
Initial Bias
Task QA0 (s) QA0 +(s) QB0 (s) QB0 + (s)
2-Way Corridor 1.0 20.0 19.3 20.6
Small Maze 1.2 22.1 18.9 24.9
Large Maze 3.1 12.4 7.4 323.0
Table 4.8: Comparison of theP e e ts of initial Q-value bias on Q-Learning. Values shown
are the mean squared error, s(V (s) maxa Q^ (s; a))2 =jS j , at the end of learning. Results
are the average of 100 independent trials.
where the bonus, b, is a positive value that de lines with the number of times a has been
taken in s. The bonus should de line as less information remains be gained about the
e e ts of taking a in s on olle ting reward. The e e t of the bonuses is always to make
the Q-values of a tions over-optimisti until the environment is thoroughly explored. As
a result, the idea that optimisti initial Q-values an a tually be a hindran e to learning
often omes as a ounter-intuitive idea to many resear hers in RL.
86 CHAPTER 4. EFFICIENT OFF-POLICY CONTROL
1400
Q-pess
1200 Q-opt
1000
Average MSE
800
600
200
0
0 50 100 150 200 250 300
Steps (x1000)
Figure 4.29: The e e t of initial bias on two Q-learning-like algorithms on the large maze
task. Both methods share the identi al exploration poli ies. The Q-pess method that
distinguishes between optimism for exploration and real Q-value predi tions (by maintaining
a separating fun tion, B that is updated using the Q-learning update) and starts with a
pessimisti Q-fun tion. The verti al axis measured the mean squared error in the learned
Q-fun tion (as in Table 4.8). Both methods share identi al exploration strategies.
4.7. INITIAL BIAS AND THE MAX OPERATOR. 87
B0 = 100 = Q0 -opt so that Q-pess may follow an equivalent -greedy exploration strategy
as the Q-learner by hoosing arg maxa B (s; a) with probability at ea h step. However, the
Q-pess method also maintains and updates a Q-fun tion using exa tly the same update as
Q-opt, although di erently initialised. The di erent Q-fun tions are initialised to have the
same size of error from Q. For Q-opt, the error gives an optimisti Q0 , and for Q-pess, it
is hosen to give a pessimisti one.
In this ase separating optimism from exploration has allowed the optimal Q-fun tion to be
approa hed mu h more qui kly without a e ting exploration at all. Still faster onvergen e
an be found with B -pess by hoosing a higher Q0.
Why was the worst overall performan e by the experien e sta k algorithms where the initial
Q-fun tion was optimisti and was low? (see Q0 = 100 in Figures 4.15, 4.17, 4.19 and
4.21)
Consider the example experien e in Figure 4.30 and as before assume that = 1 and r = 0
on all transitions. State st1 and st2 are so far unvisited, but st3 has been frequently visited
and its true value is now known (for this example, it is only important that maxa Q^ (st2 ; a) >
maxa Q^ (st3 ; a)).
If = 0 and ba kwards replay is employed, although Q(st2; at2 ) may be lowered, this
adjustment will not immediately rea h st1 sin e maxa Q(st2 ; a) does not hange. Thus the
bene t of using ba kwards replay in this situation is destroyed by the ombination of the
optimisti Q-values at st2 and using a single step return (although this is no worse than
single-step Q-learning).
However, as grows and maxa Q(s; a) weighs less in the return estimates ompared to
the a tual reward, more signi ant adjustments to Q(st1 ; at1 ) will follow (this is true of
both ba kwards replay and the eligibility tra e methods). However, as noted in Se tion 4.6
there may be little bene t to using the experien e sta k algorithm with high sin e SAPs
are removed from the experien e history after they are updated. It was argued that the
additional return trun ations this auses may a tually aid ba kwards replay and o set this
problem; yet it has been shown here that trun ated returns an ause ba kwards replay to be
markedly less e e tive if Q0 is optimisti . Notably, the experien e sta k algorithms perform
mu h worse than the original algorithms in the above experiments only where is high and
the Q-fun tion is optimisti . This is ontrary to the existing rules of thumb in hoosing
good parameter settings and resulted in substantial initial diÆ ulties in demonstrating any
good performan e with the experien e sta k algorithm. The true nature of the method only
be ame lear when examining di erent Q0. There appears to be no previous experimental
work in the literature that ompares algorithms using di erent Q0.
In the experiments in Figures 4.15-4.22 in Se tion 4.5, in almost all ases where Q0 < 100
and < 0:9 the ea h experien e sta k method outperforms the eligibility tra e ounterpart
with the ex eption of a few ases with very high Bmax. We also see that the experien e
sta k methods are mu h more robust to their hoi e of Q0 than the tra e methods, ex ept
for Q0 = 100.
Can this problem be avoided by using the method of separating exploration bonuses from
predi tions dis ussed in 4.7.3? Note that for the o -poli y results in Figures 4.15 and 4.16,
by optimising Q0 su h that umulative reward is maximised (B0 = 25 in the dual learning
method), the experien e sta k method looks better than any result obtained by Watkins'
Q(). However, at this setting the error performan e is poor. It is possible to spe ulate that
this ould be avoided by hoosing Q0 = 75 as the Q-fun tion used to generate predi tions
in the same experiment. However, sin e the error also depends upon the given experien e
(whi h depends upon B0), to perform a fair omparison one would need to run a series
of experiments where Q0 and B0 are varied to determine where it is possible to provide
better umulative reward and error than Watkins' Q(). These experiments have not been
performed.
4.8. SUMMARY 89
10 10 0
10 10 0
... st1 st2 st3
10 10 0
Figure 4.30: A sequen e of experien e in a pro ess similar to the one in Figure 4.26. Q-
values before the experien e are labelled above the a tions. Single-step ba kwards replay
( = 0) performs poorly here. Algorithms that use multistep return estimates ( > 0) are
less a e ted by the initial bias than single-step methods.
4.7.6 Initial Bias and SARSA()
In a omparison of di erent eligibility tra e s hemes by Rummery in [128℄, SARSA() was
shown to outperform other versions of Q() in terms of poli y performan e. The algorithms
were tested under a semi-greedy exploration poli y and so it is reasonable to assume that an
optimisti initial Q-fun tion was employed. In this s enario, and in the light of the above
results, it seems likely that SARSA() would su er less than Peng and Williams' Q() and
Watkins' Q(), sin e it does not expli itly employ the max operator. Performing rigorous
omparisons of these methods is diÆ ult sin e the exploration method used strongly a e ts
how the methods di er { under a purely greedy poli y, Peng and William's Q(), Watkins'
Q() and SARSA() are very similar methods. Su h a omparison should also take into
a ount the a ura y of the learned Q-fun tion. In this respe t, it is straightforward to
onstru t situations in whi h SARSA() performs extremely poorly while following a non-
greedy poli y.
4.8 Summary
Over the history of RL an elegant taxonomy has emerged that di erentiates RL te hniques
by the return estimates they learn from. While eligibility tra e methods are a well es-
tablished and important RL tool that an learn of the expe tation of a variety of return
estimates, the tra es themselves make understanding and analysing these methods diÆ ult.
This is espe ially true of eÆ ient (but more omplex) te hniques of implementing tra es
su h as Fast-Q().
In Se tion 3.4.8 we saw that the need for eligibility tra es arises only from the need for on-
line learning; simpler and naturally eÆ ient alternatives exist if the environment is a y li
or if it is a eptable to learn oine. In Se tion 3.4.8 we also saw that (at least for a umu-
late tra e variants), eligibility tra e methods don't losely approximate their forward view
ounterparts and an su er from higher varian e in their learned estimates as a result. This
led to the idea that the forward view methods whi h dire tly learn from -return estimates
might be preferable if they ould be applied online. In addition, with forward-view methods
it is straightforward and natural to apply ba kwards replay to derive additional eÆ ien y
gains at no additional omputational ost, although it is less obvious how to learn online.
90 CHAPTER 4. EFFICIENT OFF-POLICY CONTROL
+
?
-
?
Online
learning
+ +
needed
+ +
1
Offline Process
learning or is
+
possible acyclic
Optimistic
Figure 4.31: Improvement spa e for experien e sta k vs. eligibility tra e ontrol methods.
+ denotes that the analysis suggests that the learning speed of a ba kwards replay method
is expe ted to be as good as or better than for the related eligibility tra e method. ?+ and
? denote that the analysis was in on lusive but the experimental results were positive or
negative respe tively.
We have seen how ba kwards replay an be made to work e e tively and eÆ iently online
by postponing updates until the updated values are a tually needed. This te hnique an
be adapted to use most forms of trun ated return estimate. Analogues of TD() [148℄,
SARSA() [128℄ and the new importan e sampling eligibility tra es methods of Pre up
[111℄ are easily derived. In general, the method is as omputationally heap as the fastest
way of implementing eligibility tra es but is mu h simpler due to its dire t appli ation of
the return estimates when making ba kups. As a result it is expe ted that further analysis
and proofs about the online behaviour of the algorithm will follow more easily than for the
related eligibility tra e methods.
The fo us in this hapter was to nd an e e tive ontrol method that doesn't su er from
the \short-sightedness" of Watkins' Q() and also doesn't su er from unsoundness under
ontinuing exploration (i.e. as an o ur with Peng and Williams Q() or SARSA()).
When should the experien e sta k method be employed? The experimental results have
shown that, at least in some ases, using ba kwards replay online an provide faster learning
and faster onvergen e of the Q-fun tion than the tra e methods. Improvements in all ases
in all problem domains are not expe ted (nor was this found in the experiments). However,
the experimental results (supported by additional analyses) have led to a hara terisation
of its performan e that is shown in Figure 4.31.
In summary,
Expe t little bene t from using online ba kwards replay ompared to eligibility tra e
methods with values of lose to 1.
With low (and possibly intermediate) values of always expe t performan e improve-
ments (or at least no performan e degradation).
4.8. SUMMARY 91
Expe t variants employing the max operator in their estimate of return (e.g. ES-WAT
and ES-NWAT) to work poorly with high initial Q-values.
Expe t the algorithm to always provide improvements in a y li tasks ex ept where
= 1 (i.e. non-bootstrapping) and so performs identi al overall updates to the existing
tra e or Monte Carlo methods.
In addition, the strong e e t of the initial Q-fun tion has been highlighted as having a major
e e t upon the learning speed of several reinfor ement learning algorithms. Previously,
even in work examining the e e ts of initial bias or , this has not been onsidered to be an
important fa tor a e ting the relative performan e algorithms, and is often omitted from
the experimental method [171, 151, 139, 106, 150, 31℄. The ndings here suggest that it an
be at least as important to optimise Q0 as it is to optimise and , and the hoi e of Q0
a e ts di erent methods in di erent ways.
92 CHAPTER 4. EFFICIENT OFF-POLICY CONTROL
Chapter 5
Chapter Outline
This hapter reviews standard fun tion approximation te hniques used to rep-
resent value fun tions and Q-fun tions in large or non-dis rete state-spa es.
The intera tion between bootstrapping reinfor ement learning methods and
the fun tion approximators update rules is also reviewed. A new general but
weak theorem shows that general dis ounted return estimating reinfor ement
learning algorithms annot diverge to in nity when a form of \linear" fun tion
approximator is used for approximating the value-fun tion or Q-fun tion. The
results are signi ant insofar as examples of divergen e of the value-fun tion
exist where similar linear fun tion approximators are trained using a similar
in remental gradient des ent rule. A di erent \gradient des ent" error rite-
rion is used to produ e a training rule whi h has a non-expansion property
and therefore annot possibly diverge. This training rule is already used for
reinfor ement learning.
how an useful inferen es be made about the parts of the environment not visited?
Reinfor ement learning turns to te hniques more ommonly used for supervised learning.
Supervised learning ta kles the problem of inferring a fun tion from a set of input-output
examples { or how to predi t the desired output for a given input. More generally, the
te hnique of learning an input-output mapping an be des ribed as fun tion approximation.
This hapter examines the use of fun tion approximators for representing value fun tions
and Q-fun tions in ontinuous state-spa es. The general problem being solved still remains
as one of learning to predi t expe ted returns from observed rewards (a reinfor ement
learning problem). However, in this ontext, the fun tion approximation and generalisation
problems are harder than they would be in a supervised learning setting sin e the training
data (the set of input-output examples), annot be known in advan e. In fa t, in the
majority of ases, the training data is determined in part by the output of the learned
fun tion. This auses some severe diÆ ulties in the analysis of RL algorithms, and in many
ases, methods an be ome unstable.
Se tions 5.2{5.5 review ommon methods for fun tion approximation in reinfor ement learn-
ing. Linear methods are fo used upon as they have been parti ularly well studied by RL
resear hers from theoreti al standpoint, and have also had a moderate amount of pra ti al
su ess. Se tion 5.5 examines the bootstrapping problem whi h is the sour e of instability
when ombining fun tion approximation with reinfor ement learning. Se tion 5.7 intro-
du es the linear averager s heme whi h di ers from more ommon linear s hemes only in
the measure of error being minimised. However, also in this se tion, a new proof establishes
the stability of this method with all dis ounted return estimating reinfor ement learning
algorithms by demonstrating their boundedness. Se tion 5.8 on ludes.
Gravity
Value
-20
-30
-40
-50
-60
-70
-80
-90
Figure 5.1: (left) The Mountain Car Task. (right) An inverted value fun tion for the
Mountain Car task showing the estimated value (steps-to-goal) of a state. This gure is a
learned fun tion using a method presented in a later se tion { the true fun tion is mu h
smoother but still in ludes the major dis ontinuity between where it is possible to get to the
goal dire tly, and where the ar must reverse away from the goal to gain extra momentum.
between the states, or more generally, an Lp-norm (or Minkowski metri ), (see [8℄)
0 11=p
k
X
dp (s; q) = jsj qj jpA
j =1
Nearest Neighbour
The output is simply the instan e nearest to the query point:
V (q) = i
96 CHAPTER 5. FUNCTION APPROXIMATION
where,
i = argmin d(sj ; q)
j 2[1::n℄
with ties broken arbitrarily. Although omputationally relatively fast, a disadvantage of this
approa h is that the resulting value fun tion will be dis ontinuous between neighbourhoods.
Kernel Based Averaging
In order to produ e a smoother (and better tting) output fun tion, the values of many
instan es an be averaged together, but with nearby instan es weighted more heavily in the
output than those further away. How heavily the instan es are weighted in the average is
ontrolled by a weighting kernel (or smoothing kernel) whi h indi ates how relevant ea h
instan e is in predi ting the output for the query point. For instan e, we might use a
Gaussian kernel: d s;q )2
K (s; q) = e ;
(
2 2
where the parameter ontrols the kernel width. Other possibilities exist { the main
riteria for a kernel is that its output is at a maximum at its entre and de lines to zero
with in reasing distan e from it. The weights for a weighted average an now be found by
normalising the kernel and an output found:
K (s; q)
Pn
V (s) = Pi n i
i K (s; q )
Atkeson, Moore and S haal provide an ex ellent dis ussion of this form of lo ally weighted
representation in [8℄ and [1℄.
fun tion approximator. ii) Averagers. Here the learned values of parameters may have an
easily understandable meaning. For example, the parameters may represent the values of
prototype states as in Se tion 5.2. These methods an be shown to be more stable under a
wider range of onditions ([159, 49℄). iii) State-aggregation methods where the state-spa e
is partitioned into non-overlapping sets. Ea h set represents a state in some smaller state-
spa e to whi h standard RL methods an dire tly be applied. iv) Table lookup, whi h is a
spe ial ase of state-aggregation.
We assume that there are as many omponents in ~ as there are in ~x. The reason that this
is alled a linear fun tion is be ause the output is formed from a linear ombination of the
inputs:
1 x1 + 2 x2 + + n xn
and not some non-linear ombination. Alternatively, we might note that Equation 5.1 is
linear be ause it represents the equation of a hyper-plane in n 1 dimensions. This might
appear to limit fun tion approximators that employ linear output fun tions to representing
only planar fun tions. Happily, through areful hoi e of this need not be the ase. In
fa t we an see that the nearest neighbour and kernel based average methods are linear
fun tion approximators where i is de ned as:
K (s ; s )
(sq )i = Pn q i : (5.2)
k K (sq ; sk )
under Pthe1 standard (Robbins-Monro) onditions for onvergen e of sto hasti approxima-
tion: p ip = 1, and P1p 2ip < 1 (whi h also implies that all weights are updated
in nitely often) [21, 127, 11℄.
Di erent error riteria yield di erent update rules { another is examined later in this hapter.
There is a lose relationship between update 5.5 and the update rules used by the eligibility
tra e methods in Chapter 3 (whi h nds the LMS error in a set of return estimates). Here
xi represents the size of parameter i 's ontribution to the fun tion output. With xi = 0,
i has no ontribution to the output and so is ineligible for hange.
Finally, with the ex eption of some spe ial ases, the learned parameters themselves may
have no meaning outside of the fun tion approximator. There is (typi ally) no sense in
whi h a parameter ould be onsidered by itself to be predi tion of the output. The set of
parameters found is simply that whi h happens to minimise the error riteria.
Throughout the rest of this hapter, the method presented here is referred to as the linear
least mean square method, to di erentiate it from methods that learn using other ost
metri s.
5.4.2 Step Size Normalisation
Finding a sensible range of values for in update 5.5 that allows for e e tive learning is
more diÆ ult than with the RunningAverage update rule used by the temporal di eren e
100 CHAPTER 5. FUNCTION APPROXIMATION
where ~0 = ~ +~ and is the parameter ve tor after training with (~xp; zp). To nd a learning
rate that makes the full step, Equation 5.6 should be solved for f (~xp; ~0) = zp:
X
zp = f (~xp ; ~) + x2ip ip zp f (~xp; ~)
i
X
1 = x2ip ip ; (5.7)
i
Whi h should hold in order to make a full-step. We an now s ale this step size,
0 = X x2 ip ; (5.8)
p ip
i
so that hoosing 0p = 1 results in the full step to zp, and 0p = 0 results in no learning.
If a single global learning rate is desired ( ip = jp for all i and j ), then (from Equation
5.8) the normalised learning rate is given straightforwardly as,
0
ip = P xp 2 ;
i ip
where 0
p is the new global learning rate at update p.
5.5. INPUT MAPPINGS 101
0; otherwise. (5.9)
xmid = 1; if 1=3 s < 2=3,
0; otherwise. (5.10)
xfar = 1; if 2=3 s < 1,
0; otherwise. (5.11)
If s has more than one dimension, then the state-spa e might be quantised into hyper- ubes.
However the partitioning is done, it is assumed that the regions are non-overlapping and
that only one input feature is ever a tive (e.g. (s) = [0; 1; 0; 0; 0; 0; 0; 0℄). That is to say
that subsets of the original spa e are aggregated together into a smaller dis rete spa e. The
nearest neighbour method presented in Se tion 5.2 and table look-up are spe ial ases of
state aggregation.
The main disadvantage of this form of input mapping is that the state spa e may need to
be partitioned into tiny regions in order to provide the ne essary resolution to solve the
problem. If it is not lear from the outset how partitioning should be performed, then
simply partitioning the state-spa e into uniformly size hyper ubes will typi ally result in
a huge set of input features (exponential in the number dimensions of the input spa e).
Similar problems follow with non-regular but evenly distributed partitioned regions, as may
o ur with the nearest neighbour approa h.
102 CHAPTER 5. FUNCTION APPROXIMATION
Devised by Albus [4, 3℄, the Cerebellar Model Arti ulation Controller (CMAC) onsists of
a number of overlapping input regions, ea h of whi h represents a feature (see Figure 5.3).
The features are binary { any region ontaining the input state represents an input feature
with value 1. All other input features have a value of 0.
tilin g s
fe a tu re
a c tiv e fe a tu r e s
p o in t o f p o in t o f
q u e ry /b a c k u p q u e ry /b a c k u p
a c tiv e tile s
v a lid
s p a c e
Figure 5.3: (left) A CMAC. The horizontal and verti al axes represent dimensions of the
state spa e. (right) The CMAC with a regularised tiling.
If the input tiles are arranged into a regular pattern (e.g. in a grid as in Figure 5.3, right)
then there are parti ularly eÆ ient ways to dire tly determine whi h features are a tive (i.e.
without sear h). A similar argument an be made for some lasses of state aggregation but
not, in general, for the nearest neighbour method (whi h usually requires some sear h).
In the ase of a linear output fun tion, sin e many of the inputs will be zero, we simply
have:
X X
f (~xp; ~) = xip i = i : (5.12)
i a tive
i2
This form of input mapping, when ombined with the linear output fun tion and delta
learning rule has been extremely su essful in reinfor ement learning. Notably, there are
many su essful examples using online Q-learning, Q() and SARSA() [71, 149, 70, 131,
167, 150, 64, 141℄. [150℄ provides many others.
Figure 5.4 shows how the features of a CMAC or an RBF (introdu ed in the next se tion)
are linearly ombined to produ e an output fun tion.
5.5. INPUT MAPPINGS 103
CMAC (Binary Coarse Coding): i (s) = I (dist(s; enteri ) >radiusi)
Figure 5.4: Example input features and how they are linearly ombined to produ e omplex
non-linear fun tions in a 1 dimensional input spa e. The left-hand-side urves (the set of
features) are summed to produ e the urve on the right-hand-side (the output fun tion). A
single parameter i determines the verti al s aling of a single feature. It is intended that
the parameter ve tor, ~, is adjusted su h that the output fun tion ts some target set of
data.
Radial basis fun tions (RBFs) are super ially similar to the kernel based average method
presented in Se tion 5.2. With xed entres and widths, an RBF network is simply a linear
method and so an be trained using the LMS rule, although in this ase, the parameters
won't represent \prototypi al" values. However, one of the great attra tions of an RBF is
its ability to shift the entres and widths of the basis fun tions.
In a xed CMAC vs. adaptive Gaussian RBF bakeo of representations for Q-learning, little
di eren e was found between the methods [68℄ (although these results onsider only one
test s enario). In some ases it was found that adapting the RBF entres left some parts of
the spa e under-represented. In similar work with Q() using adaptive RBF entres, poor
performan e was found in omparison to the CMAC [167℄. In addition to these problems,
RBFs are omputationally far more expensive than CMACs.
Good overviews of RBF and related kernel based methods an be found in [98, 99, 8, 1, 90℄.
104 CHAPTER 5. FUNCTION APPROXIMATION
P P
1
Sin e xi p 2 f0; 1g, 0:2= i xip = 0:2= i xip
2
and so this learning rate gives a properly normalised
step-size of 0:2 as shown in Se tion 5.4.2.
5.5. INPUT MAPPINGS 105
100
10000
Sine Fun tion Target
5
100
10000
Input
Feature
Shape
Figure 5.5: The generalisational and representational e e t of input features of di ering
widths and gradient.
106 CHAPTER 5. FUNCTION APPROXIMATION
f( . )
Figure 5.6: The expansion problem. Some fun tion approximators, when trained using
some fun tions of their output an diverge in range.
be ome small enough that this doesn't happen { most fun tion approximation s hemes settle
into some lo al optimum of parameters if their distribution of training data is xed a-priori.
However, for a bootstrapping RL system these in reases in error an be fed ba k into the
training data. New return estimates that are used as training data are based upon f . In
the ase of TD(0),
z = r + V^ (s);
is repla ed by,
z = r + f ((s); ~);
and may be greater in error as a result of a previous parameter update. In pathologi al
ases, this an ause the range of f to diverge to in nity. There are examples where this
happens for both non-linear and linear fun tion approximators [10, 160, 150℄. The problem
is shown visually in Figure 5.6.
The following se tions review some s hemes that deal with this problem.
Grow-Support Methods
The \Grow-Support" solution proposed by Boyan and Moore is to work ba kwards from a
goal state, whi h should be known in advan e [24℄ (see also [23℄). A set of \stable" states
with a urately known values is maintained around the goal. The a ura y of these values
is veri ed by performing simulated \rollouts" from the new states using a simulation model
(although in pra ti e this ould be done with real experien e, but far less eÆ iently). This
\support region" is then expanded away from the goal, adding new states whose values
depend upon the values of the states in the old support region. In this way, the algorithm
an ensure that the return orre tions used by bootstrapping methods have little error, and
so ensure the method's stability. 2 In [24℄, Boyan and Moore also present several simple
environments in whi h a variety of ommon fun tion approximators fail to onverge or even
nd anti-optimal solutions, but su eed when trained using the grow-support method.
2
For similar reasons, one might also expe t ba kwards replay methods (su h as the experien e sta k
method) to be more stable with fun tion approximation.
108 CHAPTER 5. FUNCTION APPROXIMATION
~ = 2 1
~
110 CHAPTER 5. FUNCTION APPROXIMATION
V^ (st ) V^ (st+1 )
!
rt+1 + V^ (st+1 ) V^ (st )
=
~ ~
In the linear ase, we have,
i = rt+1 + V^ (st+1 ) V^ (st ) i(st ) i (s0t+1 )
The su essor states, st+1 and s0t+1 should be generated independently whi h may mean
that the method is often impra ti al without a model to generate a sample su essor state
[150℄. Also, i (st) i(s0t+1 ) may often be small leading to very slow learning. However,
Baird also dis usses ways of ombining this approa h with the linear LMS method in a way
that attempts to maximise learning speed while also ensuring stability. A later version of
this approa h [9℄ ombines the method with value-fun tionless dire t poli y sear h methods,
su h as REINFORCE [170℄.
Averagers
The term averager is due to Gordon [49℄. The key property of averagers are that they are
non-expansions { that they annot extrapolate from the training values. In [49℄ Gordon
notes that i) the value-iteration operator is a fun tion that has the ontra tion property,
and ii) many fun tion approximation s hemes an be shown to be non-expansions and
iii) any fun tional omposition of a ontra tion and a non-expansion is a fun tion that is
also a ontra tion. This makes it possible to prove that syn hronous value-iteration will
onverge upon a xed point in the set of parameters, if one exists, provided that the fun tion
approximator an be shown to be a non-expansion.
Many mean squared error minimising methods do not have this property. A spe ial kind of
averager method is presented in the next se tion, for whi h it is lear that any dis ounted
return based RL method annot possibly diverge (to in nity) regardless of the sampling
distribution of return and distribution of updates.
then the gradient des ent rule (5.3) yields a slightly di erent update rule:
i i + ip zp i xip : (5.13)
5.7. LINEAR AVERAGERS 111
or,
i i + desired outputp i ontribution of i to output
ip
Here, the update minimises the weighted (by xi) squared errors between ea h i and the
target output, rather than between the a tual and target outputs. As before, the learning
rate ip should be de lined over time. This method is referred to as a linear averager to
di erentiate it from the linear LMS gradient des ent method.
To make the analysis of this method more straightforward, it is also assumed that the inputs
to the linear averager are normalised,
x0ip
xip = P 0 ;
k xkp
and that 0 xip 1. The purpose of this is to make it lear that Pi xipi is a weighted
average of the omponents of ~. It is also assumed that 0 ipxip 1, in whi h ase
after update (5.13), jzp i0 j jzp ij must hold.3 In this way it also be omes lear that
ea h individual i is moving loser to zp sin e update (5.13) has a xed-point only where
zp = i . This does not happen with update (5.5) where zp = f ((sp); ~) is the update's
xed-point. Note that in the linear averager s heme, adjustments may still be made where
zp = f ((sp); ~).
Fun tion approximators that an be trained using this s heme in lude state-aggregation
(state-aliasing and nearest neighbour methods), k-nearest neighbour, ertain kernel based
learners (su h as RBF methods with xed entres and basis widths) pie e-wise and bary en-
tri linear interpolation [80, 37, 93℄, and table-lookup. All of these methods di er only by
their hoi e of input mapping, , whi h is often normalised. Many of these methods are
already employed in RL (see [136, 167, 140, 117, 93, 97℄ for re ent examples).
Spe ial ases of this framework for whi h onvergen e theorems exist are, Q-learning and
TD(0) with stationary exploration poli ies and state-aggregation representations [136℄,
value-iteration where the fun tion approximator update an be shown to be a non-expansion
[48℄, or is a state-aggregation method [21, 159℄, or is an adaptive lo ally linear representa-
tion [93, 97℄. The value-iteration based methods assume that a model of the environment
is available, and they are also deterministi algorithms and are easier to analyse as a re-
sult. The most signi ant (and most re ent) result is by Szepesvari where the \almost
sure" onvergen e of Q-learning with a stationary exploration poli y has been shown with
interpolative fun tion approximators whose parameters are modi ed with update (5.13)
[152℄.
Figure 5.7 ompares the linear LMS (update (5.5)) and linear averager (update (5.13))
methods in a standard supervised learning setting. Linear averagers appear to su er from
over-smoothing problems if broad input features are used, while the use of narrow input
features (for any fun tion approximator), limits the ability to generalise sin e the values of
many input features will be near or at zero, and their asso iated parameters adjusted by
similarly small amounts. The method does not exaggerate the training data in the output
in the way that update (5.5) an. The exaggeration problem is the sour e of divergen e in
3
These spe ial assumptions may be relaxed where Theorem 2 (below) an be shown to hold.
112 CHAPTER 5. FUNCTION APPROXIMATION
Input
Feature
Shape,
(s)i 2 2 2 2
Various
1.5 1.5 1.5 1.5
1 1 1 1
(s)i i
0 0 0 0
-1 -1 -1 -1
-2 -2 -2 -2
Figure 5.7: The e e t of input feature width and ost fun tions on in remental linear
gradient des ent with di erent ost s hemes. (top) A omparison of the fun tions learned
by parameter update rules (5.5) and (5.13) when the training set is taken from 1000 random
samples of the target step fun tion. Note that the averager method learns a fun tion that is
entirely ontained within the verti al bounds of the target fun tion. In ontrast the linear
LMS gradient des ent method does not, but nds a t with a lower mean squared error.
This exaggeration of the training data, in ombination with the use of bootstrapping, is the
ause of divergen e when using fun tion approximation with RL.
(middle) The input feature shape used by ea h method in ea h olumn. 50 su h features,
overlapping and spread uniformly a ross the extent of the gure provided the input to the
linear output fun tion. Note that update (5.5) still learns well with broad input features.
In ontrast, the averager method su ers from over-smoothing of the output fun tion and
annot well represent the steep details of the target fun tion.
(bottom) A sele tion of the learned parameters over the extent where their inputs are non-
zero. Note that for the averager method, the learned parameters are the average of the
target fun tion over the extent where the parameter ontributes to the output. For both
methods, the learned fun tion in the top row is an average of the fun tions in the bottom
row (sin e the input features were normalised).
RL.4 However, as follows intuitively from its error riterion, the linear LMS method nds a
t with a lower mean squared error in the supervised learning ase.
The next two se tions show that fun tion approximators whi h do not exaggerate annot
diverge when used for return estimation in RL. In parti ular, the stability (i.e. boundedness)
of the linear averager method is proven for all dis ounted return estimating RL algorithms.
The rationale behind the proof is simply:
i) All dis ounted return estimates whi h bootstrap from f (; ~) have spe i bounds.
4
In some work, this exaggeration (extrapolation of the range of training target values) is sometimes
onfused with extrapolation (whi h refers to fun tion approximator queries outside the range of states
asso iated with the training data).
5.7. LINEAR AVERAGERS 113
ii) Adjusting ~ using the linear averager update to better approximate su h a return
estimate annot in rease these bounds.
5.7.1 Dis ounted Return Estimate Fun tions are Bounded Contra tions
Theorem 1 Let r be a bounded real value su h that rmin r rmax . De ne a bound on
the maximum a hievable dis ounted return as [Vmin ; Vmax ℄ where,
r
Vmin = rmin + + k rmin + = min ;
1
V = r + + r + = rmax ;
max max
k
max
1
for some , 0 < 1. Let z (v) = r + v.
Under these ondition, z is a bounded ontra tion. That is to say that:
i) if v > Vmax , then
z (v) < v and z (v) Vmin ,
ii) if v < Vmin , then
z (v) > v and z (v) Vmax ,
iii) if Vmin v Vmax , then
Vmin z (v) Vmax ,
Sin e r rmin ,
r+ v Vmin
) z (v) Vmin:
This proves the se ond part of i).
ii) Is shown in the same way.
iii) Assume that Vmin v and show the following holds,
Vmin z (v)
() Vmin r + v:
This holds sin e (from (5.14)),
r + v rmin + v rmin + Vmin = Vmin :
The above proof method an be applied to a number of reinfor ement learning algorithms.
For instan e, for Q-learning (where z = rt+1 + maxa Q^ (st+1 ; a)), by rede ning v as
maxa Q^ (st+1 ; a), r as rt+1 , and ea h remaining v as Q^ (st+1 ; at+1 ), the proof holds with-
out further modi ation. Similarly, the method an be applied to the return estimates
used by all single step methods (whi h in ludes TD(0), SARSA(0), V(0), the asyn hronous
value-iteration and value-iteration updates) in the same way.
Contra tion bounds for a tual return methods (i.e. non-bootstrapping or Monte-Carlo meth-
ods) are more straightforward. Simply note that if,
z = r1 + r2 + 2 r3 +
and rmin ri rmax for i 2 IN then Vmin z Vmax .
Contra tion bounds for -return methods (i.e. forward view methods as in [150℄) an also
be established by showing that n-step trun ated orre ted return estimates,
!
nX1
z (n) = i 1r
i + nv
n
i=1
(with rmax < ri < rmax ) are a bounded ontra tion. This an done by a method similar to
the proof of Theorem 1. Note that any weighted sum of the form,
n
X
xi zi ;
i
with weights, n
X
xi = 1 and, 0 xi 1
i
has a bound entirely ontained within [mini zi ; maxi zi℄. It has been shown in other work that
-return estimates are su h a weighted sum of n-step trun ated orre ted return estimates
[163℄,
z = (1 ) z (1) + z (2) + 2 z (3) + ;
5.7. LINEAR AVERAGERS 115
Value
Vmax 9
>
>
>
>
>
>
>
All
fmax >
>
>
> possible
PSfrag repla ements >
=
return
f > estimates
>
>
>
>
>
> (all train-
Vmin
>
>
>
>
>
;
ing data.)
fmin
State
Figure 5.8: By Theorem 1, all possible dis ounted return estimates must be within the
bounds shown sin e v may only take values bounded within [fmin; fmax ℄. Only return
estimates within these bounds an possibly be passed as training data to the fun tion
approximator.
and so -return estimates are also bounded ontra tions. More intuitively, note that -
return estimates o upy a spa e of fun tions between the 1-step methods su h as TD(0)
and Q-learning (where = 0, n = 1), and the a tual return estimates (where = 1,
n = 1).
Theorem 2 De ne ~0 to be the new parameter ve tor after training with some arbitrary
target z 2 IR. Let the bounds of the new output fun tion, f 0, be de ned as,
0 = min f ((sp ); ~0 );
fmin
s2S
0 = max f ((sp ); ~0 ):
fmax
s2S
If,
min(Vmin ; fmin) fmin
0 f 0 max(Vmax ; fmax )
max
for any possible training example, then bounds of f annot diverge.
t + (r + V (s2 ) t )
t + (2 t t )
t (1 + (2 1))
Thus t+1 is greater in magnitude (i.e. greater in error, sin e = 0 is optimal) than t for
(1+ (2 1)) > 1. Thus, where 2 > 1 holds and for any positive this method in reases
in error for ea h update from s1. Only updates from s2 de rease . Thus is s2 is updated
insuÆ iently in omparison to s1 (as is the ase for the uniform distribution), divergen e to
in nity o urs. The online update distribution ensures that V (s1) is suÆ iently updated
to allows for onvergen e.
The linear averager method onverges upon = 0 given 0 < < 1. The features are
assumed to be normalised, ((s2) = 1, not 2) and the method therefore redu es to a
standard state-aggregation method. For transitions, s1 ; s2,
t+1 t + (r + V (s2 ) t )
t + ( t t )
t (1 + ( 1))
and so de reases in magnitude for 0 < < 1, 0 < 1.
Caveat. In every ase, the linear averager method is guaranteed to be bounded. However,
be ause the linear averager method redu es to state aggregation, it is possible that the
example above may be a \straw man". It only shows an example where the LMS method
diverges and the linear averager method does not. It may be that there are s enarios in
whi h the LMS method onverges upon the optimal solution while the averager method
does not, or where it onverges to its extreme bounds. A ne bottle of single malt whisky
may be laimed by the rst person to send me the page number of this senten e.
5.7.4 Adaptive Representation S hemes
Many forms of fun tion approximator an adapt their input mapping () by shifting whi h
input states a tivate whi h input features (as does an RBF network [68℄), or simply by
adding more features and more parameters [117, 93, 131℄. In su h ases, it is often easy
to provide guarantees that the range of outputs is no larger as a result of this adaptation
(for example by ensuring that new parameters are some average of existing ones). In this
way, these methods an also be guaranteed to be bounded. An example of an adaptive
representation s heme is provided in the next hapter.
118PSfrag repla ements CHAPTER 5. FUNCTION APPROXIMATION
1-
s1 s2
V^ (s1 ) = V^ (s2 ) = 2 V^ (sterm) = 0
Figure 5.9: Tsitsiklis and Van Roy's ounter-example. A single parameter is used to rep-
resent the values of two states. All rewards are zero on all transitions and so the optimal
value of is zero. The feature mapping is arranged su h that (s1 ) = 1 and (s2) = 2.
= 0:99 and = 0:01.
5.7.5 Dis ussion
Gordon demonstrated that value-iteration with approximated V^ must onverge upon a xed
point in the set of parameters for any fun tion approximation s heme that has the non-
expansion property [48℄. This follows from noting simply that the value-iteration update
is known to be a ontra tion, and that any fun tional omposition of a non-expansion and
ontra tion is also a ontra tion to a xed point (if one exists).
The results here demonstrate the boundedness of general dis ounted RL with similar fun -
tion approximators for analogous reasons by showing that all dis ounted return estimate
fun tions (with bounded rewards) are bounded ontra tions (i.e. ontra tions to within a
bounded region), that the linear averager update is a non-expansion, and that the omposi-
tion of these fun tions is also bounded ontra tion. This provided a more general (and more
a essible) demonstration of why fun tion approximator updates having the non-expansion
property annot lead to an unbounded fun tion, and that,
f ((s); ~) 2 [min(Vmin ; fmin
0 ); max(V ; f 0 )℄;
max max
are the bounds on the output of f over its lifetime ([fmin0 ; f 0 ℄ denotes the initial bounds
max
on the output of f for all s 2 S ). This is a more general statement than is found in [48℄
(it applies to more RL methods), but it is weaker in the sense that onvergen e to a xed-
point is not shown. However, this work dire tly applies to sto hasti algorithms whereas
the method in [48℄ onsiders only deterministi algorithms where a model of the reward and
environment dynami s must be available.
Although onvergen e an be shown with the linear LMS method for some RL algorithms
(e.g. for TD()), this only holds given restri ted update distributions [10, 160℄. Divergen e
to in nity an be shown in ases where this does not hold. This is a problem for ontrol
optimisation methods su h as Q-learning (whi h has TD(0) as a spe ial ase) where arbitrary
exploration of the environment is desired. It should also be noted that the linear averager
method annot diverge no matter how the return estimates are sampled. This is surprising
sin e the two gradient des ent s hemes di er only by the error measure being minimised.
However, linear averagers appear to be limited to using narrow input features where steep
details in the target fun tion need to be represented. Following the review in Se tion 5.6
this appears to be a ommon tradeo in su essfully applied fun tion approximators.
5.8. SUMMARY 119
5.8 Summary
A variety of representation methods are available to store and update value and Q-fun tions.
In in reasing levels of sophisti ation and empiri al su ess, but de reasing levels of provable
stability, these are: i) table lookup, ii) state aggregation, iii) averagers, iv) linear LMS
methods and, v) non-linear methods (e.g. MLPs).
A number of heuristi s have been reviewed that appear to be useful in aiding the stability
of these methods: making updates with the online, on-poli y distributions, the use of xed
poli y evaluation methods rather than greedy poli y evaluating methods, the use of fun tion
approximators that do not exaggerate training data, the use of lo al input features, and the
use of non-bootstrapping methods.
It is not lear that attempting to minimise the error between a fun tion approximator's
output and the target training values is a good strategy for RL. We have seen that some
methods whi h attempt to do just this may diverge to in nity, while some methods that
do not, and learn prototypi al state values instead, annot (although they may still su er
in other ways where bootstrapping is used). Also, for ontrol tasks, it does not follow that
predi tive a ura y is a ne essary requirement for good poli ies [5, 150℄. This is also seen
in methods su h as SARSA() and Peng and Williams' Q() where good poli ies may be
learned, even where there is onsiderable error in the Q-fun tion. Although, similarly, it is
straightforward to onstru t situations where reasonably a urate Q-fun tions (i.e. lose to
Q ) have a greedy poli y that is extremely poor.
120 CHAPTER 5. FUNCTION APPROXIMATION
Chapter 6
Adaptive Resolution
Representations
Chapter Outline
This hapter introdu es a new method for representing Q-fun tions for on-
tinuous state problems. The method is not dire tly motivated by minimising
a fun tion of return estimate error, but aims to re ne the Q-fun tion repre-
sentation in the areas of the state-spa e that are most riti al for de ision
making.
This hapter dis usses autonomous, adaptive methods for representing Q-fun tions. The
initial limits on the system's performan e are removed by adding resour es to the rep-
resentation as needed. Over time, the representation is improved through a pro ess of
general-to-spe i re nement. Although a simple state-aggregation representation is used
(for ease of implementation), traditional problems often experien ed with these methods
an be avoided (e.g. la k of ne ontrol with oarse aggregations and slow learning with ne
representations). In the new approa h, during the initial stages of learning, broad features
allow good generalisation and rapid learning, while in the later stages, as the representation
is re ned, small details in the learned poli y may be represented. Unlike most fun tion
approximation methods, the method is not motivated by value fun tion error minimisation,
but by seeking out good quality poli y representations. It is noted that i) good quality
poli ies an be found long before an a urate Q-fun tion is found (the su ess of methods
su h as Peng's and Williams' Q() demonstrate this), and that, ii) in ontinuous spa es
there are often large areas where a tions under the optimal poli y are the same.
R e g io n
B ra n c h
D a ta
6
Value
-2
0 50 100 150 200 250 300 350
State, s
Figure 6.2: The optimal Q-fun tion for SinWorld. The de ision boundaries are at s = 90o
and s = 270o where Q(s; L) and Q(s; R) interse t.
124 CHAPTER 6. ADAPTIVE RESOLUTION REPRESENTATIONS
and many pra ti al problems, is the apparent simpli ity of the optimal poli y ompared to
the omplexity of its Q-fun tion:
if 90o s < 270o ;
(s) = L;
R; if otherwise : (6.1)
It is trivial to onstru t and learn a two region Q-fun tion whi h nds the optimal poli y
given only a few experien es. This, of ourse, relies upon knowing the de ision boundaries
(i.e. where Q(s; L) and Q(s; R) interse t) in advan e (see Figure 6.2).
De ision boundaries are used to guide the partitioning pro ess sin e it is here that one an
expe t to nd improvements in poli ies at a higher resolution; in areas of uniform poli y,
there is no performan e bene t for knowing that the poli y is the same in twi e as mu h
detail.
While it is true that, in general we annot determine without rst knowing Q, in many
pra ti al ases of interest it is often possible to nd near or even optimal poli ies with very
oarsely represented Q-fun tions. A good estimate of is found if, for every region, the
best Q-value in a region is, with some minimum degree of on den e, signi antly greater
than the other Q-values in the same region.
Similarly, there is little to be gained by knowing more about regions of spa e where there
is a set of two or more near equivalent best a tions whi h are learly better than others.
To over both ases, de ision boundaries are de ned to be the parts a state-spa e where i)
the greedy poli y hanges and, ii) where the Q-values of those greedy a tions diverge after
interse ting.
It is important to note that the ost of representing de ision boundaries is a fun tion of
their surfa e size and not ne essarily the dimensionality of the state-spa e. Hen e, if there
are very large areas of uniform poli y, then there an be a onsiderable redu tion in the
amount of resour es required to represent a poli y to a given resolution when ompared to
uniform resolution methods.
6.2.3 The Algorithm
The partitioning pro ess onsiders every pair of adja ent regions in turn. The de ision of
whether to further divide the pair is formed around the following heuristi :
do not onsider splitting if the highest-valued a tions in both regions are the same
(i.e. there is no de ision boundary),
only onsider splitting if all the Q values for both regions are known to a \reasonable"
degree of on den e,
only split if, for either region, taking the re ommended a tion of one region in the
adja ent region is expe ted to be signi antly worse than taking another, better,
a tion in the adja ent region.
The se ond point is important, insofar as that the de ision to split regions is based solely
upon estimates of Q-values. In pra ti e it is very diÆ ult to measure on den e in Q-
values sin e they may ultimately be de ned by the values of urrently unexplored areas of
the state-a tion spa e or parts of the spa e whi h only appear useful at higher resolutions
6.2. DECISION BOUNDARY PARTITIONING (DBP) 125
Do Split Don’t Split
a1
a2
a1
a1
a2 a1
a2
a2
Figure 6.3: The De ision Boundary Partitioning Heuristi . The diagrams show Q-values
in pairs of adja ent regions. The horizontal axis represents state, and the verti al axis
represents value.
(although see [62, 85℄ for some on den e estimation methods). For both of these reasons,
the Q-fun tion is non-stationary during learning whi h itself auses problems for statisti al
on den e measures. The naive solution applied here is to require that all the a tions in
both regions under onsideration must have been experien ed (and so had their Q-values
re-estimated) some minimum number of times, V ISmin, whi h is spe i ed as a parameter
of the algorithm. This also has the added advantage of ensuring that infrequently visited
states are less likely to be onsidered for partitioning.
In the nal part of the heuristi , the assumption is made that the agent su ers some
\signi ant loss" in return if it annot determine exa tly where it is best to follow the
re ommended a tion of one region instead of the re ommended a tion of an adja ent region.
If the best a tion of one region, when taken in an adja ent region is little better than any
of the other a tions in the adja ent region, then it it reasonable to assume that between
the two regions the agent will not perform mu h better if it ould de ide exa tly where
ea h a tion is best. The \signi ant loss", min, is the se ond and nal parameter for the
algorithm. Figure 6.3 show situations in whi h partitioning o urs.
Setting min > 0 attempts to ensure that the partitioning pro esses is bounded. For
di erentiable Q-fun tions, as the regions be ome smaller on either side of the de ision
boundary, the loss for taking the a tion suggested by the adja ent region must eventually
fall below min. In the ase, where de ision boundaries o ur at dis ontinuities in the
Q-fun tion, unbounded partitioning along the boundary is the right thing to do provided
that there remains the expe tation that the extra partitions an redu e the loss that the
agent will re eive. The fa t that there is a boundary indi ates that there is some better
126 CHAPTER 6. ADAPTIVE RESOLUTION REPRESENTATIONS
1
This isn't true in the unlikely ase that regions are already exa tly separated at the boundary. But if
this is the ase, ontinued partitioning is still ne essary to verify this.
6.2. DECISION BOUNDARY PARTITIONING (DBP) 127
6.2.4 Empiri al Results
In this se tion the variable resolution algorithm is evaluated empiri ally on three di erent
learning tasks. In all experiments the 1-step Q-learning algorithm is used. Although faster
learning an be a hieved with other algorithms, Q-learning is employed here be ause of its
ease of implementation and omputational eÆ ien y.2 Also, throughout, the exploration
poli y used is -greedy [150℄. In addition, upon entering a region the agent is ommitted to
following a single a tion until it leaves the region. This prevents the exploration strategy
from dithering within a region and allows larger parts of the environment to be overed
more qui kly.
In the SinWorld environment (introdu ed above) the agent has the task of learning the
poli y whi h gets it to (and keeps it at) the peak of a sin urve in the shortest time. To
prevent a lu ky partitioning of the state spa e whi h exa tly divides the Q-fun tion at
the de ision boundaries, a random o set for the reward fun tion was hosen for ea h trial:
f (s) = sin(s + random). In ea h episode the agent is started in a random state and follows
its exploration poli y for 20 steps. In all trials the agent started with only a two state
representation. At the end of ea h episode, the de ision boundary partitioning algorithm
was applied.
Figure 6.4 shows the nal partitioning after 1000 episodes. The highest resolution areas
are seen at the de ision boundaries (where Q(s; L) and Q(s; R) interse t). At s = 90o
partitioning has stopped as the expe ted loss in dis ounted reward for not knowing the area
in greater detail is less than min. The de line in the partitioning rate as the boundaries
are more pre isely identi ed an be seen in Figure 6.5.
Figure 6.6 ompares the performan e of the variable resolution methods against a number
of xed uniform grid representations. The performan e measure used was the average
dis ounted reward olle ted over 30 evaluations of a 20 step episode under the urrently
re ommended poli y. The results were averaged over 100 trials. The initial performan e
mat hes that of an 8 state representation. After 1000 episodes, however, the performan e
is slightly better than a 32 state representation (not shown) whi h managed mu h slower
improvements in the initial stages. It is important to note that without prior knowledge
of the problem is it diÆ ult to assess whi h xed resolution representation will provide
the best tradeo between learning speed and onvergent performan e. Starting with only
two states, the adaptive resolution method provided fast learning in the initial stages yet
managed near optimal performan e overall.
2
These experiments were also ondu ted prior to the experien e sta k method.
128 CHAPTER 6. ADAPTIVE RESOLUTION REPRESENTATIONS
10
Q(s, L)
Q(s, R)
r(s)
Value
4
-2
0 1 2 3 4 5 6 7
State, s
Figure 6.4: The nal partitioning after 1000 episodes in the SinWorld experiment. The
highest resolution areas are seen at the de ision boundaries (where Q(s; L) and Q(s; R)
interse t).
35
30
25
20
States
15
10
0
0 100 200 300 400 500 600 700 800 900 1000
Episode
Figure 6.5: The number of regions in the SinWorld experiment. Note that the 1st derivative
(the partitioning rate) is de reasing over time.
6
Adaptive
5 4 states
16 states
Average Discounted Return
4 8 states
2 states
2 32 states
0
10 20 30 40 50 60 70 80 90 100
Episode
Figure 6.6: Comparison of initial learning performan es for the variable vs. xed resolu-
tion representations in the SinWorld task. The performan e measure is the average total
dis ounted reward olle ted over 20 steps from random starting positions and o sets of the
reward fun tion.
6.2. DECISION BOUNDARY PARTITIONING (DBP) 129
The Mountain Car Task
In the Mountain Car task the agent has the problem of driving an under-powered ar to
the top of a steep hill.3
The a tions available to the agent are to apply an a eleration, de eleration or neither
( oasting) to the ar's engine. However, even at full power, gravity provides a stronger
for e than the engine an ounter. In order to rea h the goal the agent must reverse ba k
up the hill, gaining suÆ ient height and momentum to propel itself over the far side. On e
the goal is rea hed, the episode terminates. The value of the goal states are de ned to be
zero sin e there is no possibility of future reward. At every time-step the agent re eives
a punishment of 1, and no dis ounting was employed ( = 1). In this spe ial ase, the
Q-values simply represent the negative of the expe ted number of steps to rea h the goal.
Figure 6.7 shows the Q-values of the re ommended a tions after 5000 learning episodes.
The li represents a dis ontinuity in the Q-fun tion. On the high side of the li the agent
has just enough momentum to rea h the goal. If the agent reverses for a single time step
at this point it annot rea h the goal and must reverse ba k down the hill. It is here that
there is a de ision boundary and a large loss for not knowing exa tly whi h a tion is best.
Figure 6.8 shows how this area of the state-spa e has been dis retised to a high resolution.
Regions where the best a tions are easy to de ide upon are represented more oarsely.
Figure 6.9 shows a performan e omparison between the adaptive and the xed, uniform grid
representations. The measure used is the average total reward olle ted from 30 random
starting positions using the urrently re ommended poli y and with learning suspended.
Due to the large dis ontinuity in the Q-fun tion, partitioning ontinues long after there
appears to be a signi ant performan e bene t for doing so (shown in Figure 6.10). This
simply re e ts that the performan e metri measures the poli y as a whole from random
starting positions. Agents starting on or around the dis ontinuity still ontinue to gain
some performan e improvements.
The same experiment was also ondu ted but with the ranges of the states hosen to be 10
times larger than previously, giving a new state-spa e of 100 times the original volume (see
Figure 6.8). Starting positions for the learning and evaluation episodes were still hosen to
be inside the original volume. These hanges had little e e t upon the amount of memory
used or the onvergent performan e, although learning pro eeded far more slowly in the
initial stages.
3
This experiment reprodu es the environment des ribed in [150, p. 214℄
130 CHAPTER 6. ADAPTIVE RESOLUTION REPRESENTATIONS
Value
-20
-30
-40
-50
-60
-70
-80
-90
Figure 6.7: A value fun tion for the Mountain Car experiment after 5000 episodes. The
value is measured as maxa Q(s; a) to show the estimated number of steps to the goal
under the re ommended poli y.
Figure 6.8: (left) A partitioning after 5000 episodes in the Mountain Car experiment. Po-
sition and velo ity are measured along the horizontal and verti al axes respe tively. (right)
The same experiment but with poorly hosen s aling of axes. This had little e e t on the
nal performan e or number of states used.
6.2. DECISION BOUNDARY PARTITIONING (DBP) 131
0
-500
200
180
Average Reward
160
-1000
140
Adaptive
256 states 120
16 states
States
-1500
100
80
60
-2000
100 200 300 400 500 600 700 800 900 1000
Episode 40
20
4
A detailed des ription of this environment is available at: http://www. s.bham.a .uk/~sir/pub/hbeam.html
132 CHAPTER 6. ADAPTIVE RESOLUTION REPRESENTATIONS
T h ru s t
g .M m o to r
g .M b e a m
g .M c o u n te r
Figure 6.11: The Hoverbeam Task. The agent must drive the propeller to balan e the beam
horizontally.
140
120
100 Adaptive
8 states
64 states
512 states
Average Reward
4096 states
80
60
40
20
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Episode
Figure 6.12: The mean performan e over 20 experiments using the adaptive and the xed,
uniform representations in the Hoverbeam task. The total reward olle ted after 200 steps
under the urrently re ommended poli y is measured.
`hand'. Reinfor ements are provided for redu tions in this for e. The splitting riteria is to
partition regions if the arm's ontroller is failing to maintain the lo al punishment below
some ertain threshold. In ases where the exerted for es were very small, most partitioning
o urred and ne ontrol was the result.
In [46℄ Fernandez shows how the state-spa e an be dis retised prior to learning using
the Generalised Lloyd Algorithm. The method provides greater resolution in more highly
visited parts of the state-spa e. Similarly, RBF networks may adapt their ell entres su h
that some parts of the state-spa e are represented in greater detail [68, 97℄. A riti ism of
these kinds of approa h is that they are based upon similar assumptions made by standard
supervised learning algorithms { that a greater proportion of the error minimisation \e ort"
should be spent on more frequently visited states. It is not lear that this is the best strategy
for reinfor ement learning where, for instan e, the values of states leading to a goal may be
infrequently visited but may also de ne the values of all other states.
G-Learning
In another early work, [28℄ Chapman and Kaelbling's G algorithm employs a de ision tree to
represent the Q-fun tion over a dis rete (binary) spa e. Ea h bran h of the tree represents a
distin tion between 0 and 1 for a parti ular environmental state variable. Ea h leaf ontains
an additional \fringe" whi h keeps information about all of the remaining distin tions that
an be made. The de ision of whether or not to x a distin tion is made on the basis of
two statisti al tests (only one need pass). Here it was found that performing Q-learning
and using the learned Q-values to make a split was insuÆ ient. Instead, the method learns
the future reward distribution:
1
X
D(st ; at ; r) = t+k P r (r = rt+k+1 )
k=0
The possible rewards are assumed to be drawn from a small dis rete set, R. From this, the
Q-values an be re overed as follows:
Q^ (s; a) =
X
rD(s; a; r):
r2R
Thus the method re overs the same on-poli y return estimate as bat h, a umulate-tra e
SARSA(1) (or an every-visit Monte Carlo method), but also has a (non-stationary) future
reward distribution for ea h region. The return distributions of a pair of regions di ering
by a single input variable are ompared using a T -test [42℄. The distin tion is xed, and the
tree deepened, if it is found that the reward distributions di er with a \signi ant degree
of on den e".5
The G algorithm also xes distin tions on the basis of whether di ering distin tions re om-
mend di erent a tions. Intuitively, the method also appears to identify de ision boundaries
but in dis rete spa es.
5
The use of signi an e measures in RL to ompare return distributions is almost always heuristi sin e
the return distributions are almost always non-stationary.
6.3. RELATED WORK 135
Classi er Systems
A lassi er system onsists of a population of ternary rules of the form h1; 0; #; 1 : 1; 0i
[55℄. A rule en odes a state-a tion pair, h state : a tion i. A rule applies and suggests an
a tion if it mat hes an input state (whi h should also be a binary string). A # in a rule
stands for \don't are". Thus a rule h#,#,#,# : 1; 0i mat hes any input state, and the
rule h0,#,#,# : 1; 0i mat hes any state where the rst bit is 0. In this respe t, a lassi er
system provides similar representations to a binary de ision tree where data is stored at
many levels; h#,#,#,# : 1; 0i represents the root and h0,#,#,# : 1; 0i is the next level
down. In pra ti e, a tree is not used to hold the rules. The population is unstru tured {
there may be gaps in state-spa e overed by the population and several rules may apply in
other states.
Ea h rule has an asso iated set of parameters, some of whi h are used to determine a rule's
tness. Fitness measures the quality of a rule and orresponds to tness in an evolutionary
sense. Periodi ally, un t rules are deleted from the population and new rules added by
ombining t rules together.
In [96℄, Munos and Patinel's Partitioning Q-learning, the evolutionary omponent is repla ed
with a spe ialisation operator that repla es rules ontaining a #, with two new rules in whi h
the # is substituted with a 1 and a 0. Ea h rule keeps a Q-value for the SAP that it en odes
and is updated whenever it is found to apply (several rules may have Q-values updated on
ea h step). The spe ialisation operator is applied for a fra tion of the rules in whi h the
varian e in the 1-step error is greatest. This varian e is measured as:
1X n
(ri + max 0 Q(s0 ; a0 )) (ri 1 + max 0 Q(s0 ; a0 ))2
n i=1 a i a i 1
where the rule applied and was updated at times ft0 ; : : : ; ti; : : : ; tng. The result is that
spe ialisation auses something like the tree deepening as in G-learning. However, unlike
the T -test test, this method does not distinguish between noise in the 1-step return, and
the di erent distributions of return that follow from adja ent state aggregations.
Utile SuÆx Memory (USM)
So far, all of the methods dis ussed (in luding the DBP approa h) assume that the real
observed states are those of a large or ontinuous MDP. However, in some ases, the reward
or transitions following from the next a tion may not simply depend upon the urrent
state and a tion taken, but may depend upon what happened 2, 3 or more steps ago (i.e.
the environment is a partially observable MDP). Similar to the G-algorithm, M Callum's
Utile SuÆx Memory (USM) also uses a de ision tree to attempt to dis over the relevant
\state" distin tions needed for a ting [82, 81℄. However, here the agent's per eived state is
a re ent history of observed environmental inputs and a tions taken. Bran hes in the tree
represent distin tions in the re ent history of events that allow di erent Q-value predi tions
to be distinguished. The top level of the tree represents a tions to be taken at the urrent
state for whi h Q-values are desired. Deeper levels of the tree make distin tions between
di erent prior observations. For example, a bran h 3 levels down might distinguish between
whether at 2 = a10 or whether at 2 = a5 . Distin tions (bran hes) are added if these
136 CHAPTER 6. ADAPTIVE RESOLUTION REPRESENTATIONS
di erent histories appear to give rise to di erent distributions of 1-step orre ted return,
r + maxa Q(s0 ; a). The return distributions following from ea h history are generated from
a pool of stored experien es. The Kolmogorov-Smirnov test is used to de ide whether the
distributions are di erent [42℄.6
Continuous U -Tree
In [161℄ Uther and Veloso apply USM and G-learning ideas to a ontinuous spa e. As in
the DBP approa h, a kd-tree is used to represent the entire state-spa e, and bran hes of
the tree subdivide the spa e. As in M Callum's USM, a pool of experien e is maintained
and replayed to perform oine value-updates. Within a region, the 1-step orre ted return
is measured for ea h stored experien e whi h serves as a sample set. This is ompared with
a sample from an adja ent region using the Kolmogorov test. Also, an alternative (less
\theoreti ally based") test was used whi h maintains splits if this redu es the varian e in
the 1-step return estimates by some threshold.
Comments
An interesting issue with many of these methods is that we a tually expe t the return
following from di erent regions to be drawn from di erent distributions in almost all ases
{ in very many problems, the optimal value fun tion is non- onstant throughout almost
all of the state-spa e. This follows as a onsequen e of using dis ounting. The return
distributions following from adja ent regions are therefore likely to have di erent means,
and so will be shown to be from di erent distributions under the statisti al tests given
signi ant amounts of experien e. It may be that the Kolmogorov-Smirnov test or the T -
test identify relatively large hanges in the value fun tion more qui kly than other parts of
6
The Kolmogorov-Smirnov test distinguishes samples by the largest di eren e in their umulative distri-
bution.
6.3. RELATED WORK 137
the state-spa e (e.g. at dis ontinuities), or where signi an e tests are passed most qui kly
(e.g in areas where most experien e o urs). One might hope that these areas also oin ide
with hanges in optimal poli y, although this is learly not always the ase.
With experien e a hing methods (USM and Continuous U -Tree), there is the opportunity
to deepen the tree until a la k of re orded experien e within leaf regions auses it to be
poorly modelled by the stored experien e (e.g. either be ause the region ontains no expe-
rien es, no experien es whi h exit the region ( ausing \false terminal states"), or too few
experien es to lo ally model the lo al varian e in value and pass any reasonable statisti al
test). Partitioning so deep su h that we have one experien e per a tion per region is unlikely
to be desirable and seems ertain to lead to over tting problems.
As the number of regions in reases, so then does the ost of performing value-iteration
sweeps a ross the set of regions. If omputational osts an be negle ted however, one
might expe t an approa h of partitioning as deeply as possible to make extremely good use
of experien e (provided over tting and false terminals an be avoided).
However, if time and spa e osts are an issue, then it be omes natural to examine ways in
whi h parts of the state-spa e an be kept oarse. In this respe t, the existing methods miss
the key insight that it simply is not ne essary (in all ases) to represent the value fun tion
to a high degree of a ura y in order to represent a urate poli ies.
It is argued that re nement methods should seek to redu e un ertainty about the best a tion,
and not un ertainty about their values in order to nd better quality poli ies.
The de ision boundary partitioning method o ers an initial heuristi way to do this, al-
though it is less prin ipled an approa h as one might hope. For instan e, in many ases
it will follow that to redu e un ertainty about the best a tion requires more ertain a tion
value estimates for those a tions. In turn it may follow, (at least in the ase of bootstrap-
ping value estimation algorithms, su h as Q-learning and value-iteration) that the only way
to redu e the un ertainty in these a tion value estimates is to in rease the resolution of
the regions whose values determine the a tion values that we are un ertain about. This
requires a non-lo al partitioning method. All of the methods onsidered so far are lo al
methods and do not onsider partitioning su essor regions in order to redu e ertainty at
the urrent region.
In the next paragraph, the VRDP approa hes of Moore, and Munos and Moore use a
number of di erent partitioning riteria. In parti ular, the In uen eStandard Deviation
heuristi appears to be an more prin ipled step in the dire tion of redu ing the un ertainty
about the best a tions to take.
while following the greedy poli y from some starting state. A disadvantage of this approa h
is that every state is on the greedy path from somewhere { attempting to use this method to
generate poli ies from arbitrary starting states auses the method to partition everywhere.
More re ent VRDP work by Munos and Moore examines and ompares several di erent
partitioning riteria [94, 95, 92℄. The method uses a grid-based \ nite-element" representa-
tion.7 The nite elements are the points (states) at the orners of grid ells for whi h values
are to be omputed. A dis rete transition model is generated by asting short traje tories
from an element and noting the nearby su essors at the end of the traje tory. Elements
near to the traje tory's end are given high transition probabilities in the model.
The following lo al partitioning rules were initially tested:
i) Measure the utility of a split in a dimension as the size of the lo al hange in value
along that dimension. Splits are ranked and a fra tion of the best are a tually divided.
ii) Measure the lo al variability in the values in a dimension. Rank and split, as before,
but based on this new measure. This auses splits to o ur where the value fun tion
is non-linear.
iii) Identify where the poli y hanges along a dimension, and split in that dimension.
This re nes at de ision boundaries.
The de ision boundary method was found to onverge upon sub-optimal poli ies in a di er-
ent version of the mountain ar task requiring ner ontrol. In some ases, the performan e
of the de ision boundary approa h was a tually worse than for xed, uniform representa-
tions of the same size. The reason for this is due to errors in the value-approximation of
states away from the de ision boundary, whi h a tually ause the de ision boundaries to be
mispla ed. Combining the de ision boundary and non-linearity heuristi s resulted in better
performan e.
To improve this situation further, an in uen e heuristi was devised that takes into a ount
the extent to whi h the value of one state ontributes to the values of another element.
Intuitively, in uen e is a measure of the size of hange in s that follows from a unit of
hange in the value of si. The in uen e I (sjsi) of the value of state si on s is de ned as:
1
X
I (sjsi ) = pk (s; si )
k=0
where, pk (s; si ) is the k-step dis ounted probability of being in si after k-steps when starting
from s and following the greedy poli y, g . This an be found as follows:8
p0 (s; s0 ) = 1 (if s = s0 ), 0 (if s 6= s0 )
p1 (s; s0 ) = Pss 0 (s)
g
X
pk (s; s0 ) = P g (s) p (x; s0 )
ss0 k 1
x
7
This work was ondu ted independently of, and in parallel with, the DBP approa h [116, 115, 117℄.
8
Below, , represents the times ale over whi h a state-transition model was al ulated, or the mean
transition time between s and s0 . Variable times ale methods are dis ussed in the next hapter. Assume for
now that = 1.
6.3. RELATED WORK 139
The in uen e of a state s on a set of states, , is de ned as:
X
I (sj )= I (sjsi ):
si 2
However, improvements in value representations may not ne essarily follow from splitting
states with high in uen e if these state have a urate values. It is assumed that states with
high varian e in their values (due to having many possible su essors with di ering values)
provide poor value estimates.9 Moreover, sin e state values depend on their su essor's
su essor, a long-term (dis ounted) varian e measure an also be derived from the lo al
varian e measures. These heuristi s are ombined to provide the following partitioning ri-
teria:
1) Identify the set, , of states along the de ision boundary.
2) Cal ulate the total in uen e on de ision boundary values, I (sj ), for all s.
3) Cal ulate the long-term dis ounted varian e of ea h state, 2 (s).
4) Cal ulate the utility of splitting a state as: (s)I (sj )
5) Split a fra tion of the highest utility states.
An illustration of this pro ess appears in Figure 6.13. The gures are provided with thanks
to Remi Munos [94℄.
The Standard DeviationIn uen e measure, (s)I (sj ), performed greatly better for equiv-
alent numbers of states, and appears to be the most prin ipled method to date. Although, in
their experiments, a omplete and a urate environment model was available, it seems lear
that the method an naturally be adapted to the ase where a model is learned. Model-free
versions of this method don't seem possible { there is no obvious way to learn the in uen e
measure without a model.
Note that the in uen e and varian e measures are artefa ts of the value estimation pro edure
and do not dire tly measure how \good" or \bad" a state is. The in uen e and varian e
of states tend to zero with in reasing simulation length, and be ome zero if the simulation
enters a terminal state. Thus, there remains the possibility of further developments with
this approa h that adjust the simulation times ale in order to redu e the number of states
with high varian e and in uen e.
9
It is assumed, sin e only deterministi reward fun tions and environments are onsidered, that the sour e
of varian e must lie in value un ertainties due to the approximate representation.
140 CHAPTER 6. ADAPTIVE RESOLUTION REPRESENTATIONS
Velocity
Velocity
GOAL
Position Position
(a) The optimal policy and several trajectories (b) Influence on 3 points
Figure 6.13: Stages of Munos and Moore's variable resolution s heme for a mountain ar
task. The task di ers slightly to the one used in experiments earlier in this hapter and
provides the highest reward for rea hing the goal with no velo ity. The top left gure shows
the optimal poli y for this task. In uen e measures a state's ontribution to the value of a
set of other states (top-right). Standard deviation is a measure of the ertainty of a state's
value. The In uen eStandard Deviation measure is used to de ide where to in rease the
resolution. A fra tion of the highest valued (darkest) states by this measure is partitioned.
6.4. DISCUSSION 141
Parti-Game
The Parti-Game algorithm is an online model-learning methods that also employs kd-trees
for value and poli y representations [86℄ (see also Ansari et al. for a revised version [2℄).
The method doesn't solve generi RL problems but aims to nd any path to a known goal
state in a deterministi environment.
The method is assumed to have lo al ontrollers that enable the agent to steer to adja ent
regions (the set of available a tions is the number of adja ent regions). The method attempts
to minimise the expe ted number of regions traversed to rea h the goal, learning a region
transition model and al ulating a regions-to-goal value-fun tion as it goes (all untried
a tions in a region are assumed to lead dire tly to the goal). The method behaves greedily
with respe t to its value fun tion at all times. The splitting riterion is to divide regions
along the \win/lose" boundary where it is urrently thought possible to be able to rea h
the goal and where it is not. Importantly, as the resolution in reases, high-resolution areas
appear expensive to ross be ause they in rease the regions-to-goal value { thus greedy
exploration initially avoids the win/lose boundary where it has previously failed to rea h the
goal. However, as alternative routes be ome exhausted, the win/lose boundary is eventually
explored. This symbiosis of the exploration method and representation appears to be the
sour e of the algorithm's su ess. The method is has been shown to very qui kly nd paths
to a goal state in problems with up to 9-dimensional ontinuous state.
The experiments showed that the nal poli ies a hieved an be better and are rea hed more
qui kly than those of xed uniform representations. This is espe ially true in problems
requiring very ne ontrol in a relatively small part of the entire state-spa e.
The independent study by Munos and Moore shows that partitioning at de ision boundaries,
and other lo al partitioning riteria, nds sub-optimal solutions. The non-lo al heuristi
of partitioning states whose values are un ertain and also in uen e the values at de ision
boundaries (and therefore the lo ation of de ision boundaries), allows smaller representa-
tions of higher quality poli ies to be found than lo al methods.
Chapter 7
Chapter Outline
This hapter introdu es learning methods for dis rete event, ontinuous time
problems (modelled formally as Semi-Markov De ision Pro esses). We will see
how the standard dis rete time framework an lead to biasing problems when
used with dis retised representations of ontinuous state problems. A new
method is proposed that attempts to redu e this bias by adapting learning and
ontrol times ales to t a variable times ale given by the representation. For
this purpose Semi-Markov De ision Pro ess learning methods are employed.
Consider the following environment; the learner exists in the orridor shown in Figure 7.1.
Episodes always start in the leftmost state. Ea h a tion auses a transition one state to the
right until the rightmost state is entered where the episode terminates and a reward of 1 is
given. A reward of zero is re eived for all other a tions and = 0:95. The environment is
dis rete and Markov ex ept that the agent's per eption of it is limited to four larger dis rete
states.
Figure 7.2 shows the resulting value-fun tion when standard (1-step) DP and 1-step Q-
learning are used with state aliasing. With Q-learning, ba kup (3.34) was applied after
every step. With DP, a maximum-likelihood model was formed by applying ba kups (3.41)
and (3.42) after ea h step and solving the model using value-iteration. Both methods learn
over-estimates of the value-fun tion by the last region.
The modelled MDP in Figure 7.3 is that learned by the 1-step DP method. Over-estimation
o urs sin e the rightmost region learns an average value of the aliased states it ontains.
Unfortunately, the region whi h leads into it requires the value of its rst state (not the
average) as its return orre tion in order to predi t the return for entering that region and
a ting from there onwards. Sin e, in this example, the rst state of a region always has
a lower value than the average, the return orre tion introdu es an over-optimisti bias.
These biases a umulate as they are propagated to the prede essor regions.
The e e t on Q-learning is worse. Having a high step-size, , weighs Q-values to the more
re ent return estimates used in ba kups. In the extreme ase where = 1, ea h ba kup to
a region wipes out any previous value; ea h value re ords the return observed upon leaving
the region. This leads to the ase where the leftmost region learns the value for being just
4 steps from the goal. This is espe ially undesirable in ontinual learning tasks where
annot be de lined in the standard way.
t= 0 t= 6 3 t= 6 4
... r= 1
t= 0 t= 1 6 t= 3 2 t= 4 8 t= 6 4
... r= 1
Figure 7.1: (top) The orridor task. (bottom) The same task with states aliased into four
regions.
7.3. MULTI-TIMESCALE LEARNING 145
1 1
0.9 V*(s) 0.9
1-step DP
0.8 0.8
0.7 0.7
0.6 0.6
Value
Value
0.5 0.5
0.4 0.4 V*(s)
0.3 0.3 alpha=1.0
alpha=0.8
0.2 0.2 alpha=0.5
alpha=0.2
0.1 0.1 alpha=0.1
alpha=0.01
0 0
0 10 20 30 40 50 60 0 10 20 30 40 50 60
State State
Figure 7.2: Solutions to the orridor task using 1-step DP (left) and 1-step Q-learning
(right).
1 -p 1 -p 1 -p 1 -p
r= 1
p p p p
Figure 7.3: A naively onstru ted maximum likelihood model of the aliased orridor. p = 161 .
where a = at, s = st , s0 = st+n and R^ as is the estimated expe ted (un orre ted) trun ated
return for taking a in state s for n-steps, and P^ asx gives the estimated dis ounted transition
146 CHAPTER 7. VALUE AND MODEL LEARNING WITH DISCRETISATION
A multi-time model (P^ and R^ ) on isely represents the e e ts of following a ourse of a tion
for several time-steps (and possibly variable amounts of time) instead of the usual one-step.
Sin e the amount of dis ounting that needs to o ur (in the mean) is a ounted for by the
model, is dropped from the 1-step DP ba kup to form the following multi-time ba kup
rule,
!
V^ (s) max
a
R ^ ass0 V^ (s0 ) :
^ as + X P (7.5)
s0
More generally, the above multi-time methods are a spe ial ase of ontinuous time dis rete
event methods for learning in Semi-Markov De ision Pro esses (SMDP) (see [61, 114℄).
Here, n may be a variable, real-valued amount of time. If a su essor state is entered
after some real valued duration, t > 0, repla ing all o urren es of n with t in the above
updates yields a new set of algorithms suitable for learning in an SMDP. In ases where
reward is also provided in ontinuous time by a reward rate, , the following immediate
reward measure an be used while still performing learning in dis rete time [91, 25℄,
Z t
r(tt ) = x a dx:
sx (7.6)
0
All -return methods may also be adapted to work in this way by de ning the return
estimate as follows:
zt = (1 t ) r(tt ) + t U^ (st+t )
+t r(tt ) + t zt+t (7.7)
By re ording the time interval t, along with the states observed, rewards olle ted and
a tions taken, Equation 7.7 allows an SMDP variant of ba kwards replay and the experien e
sta k method to be onstru ted straightforwardly. Also, from (7.7), the following updates
for a ontinuous time, a umulate tra e TD() may be found:
8s 2 S; e(s)
( )t e(s) + 1; if s = st,
( )t e(s); otherwise.
8s 2 S; V^ (s) V^ (s) + (rt t + t V^ (st+t ) V^ (st ))e(s)
( )
A derivation appears in Appendix C. This method di ers from other SMDP TD() methods
(e.g. see [44℄, whi h also onsiders a ontinuous state representation). The derivation of
these updates in Appendix C show that the version here is the analogue of the forward-view
ontinuous time -return estimate (Equation 7.7).
7.4. FIRST-STATE UPDATES 147
Start Start
a1 a1
a2 a1
a1 a1
a2
a1
a2 a1
a1
a2
a2
Start
a1
Figure 7.4: (top-left) A tions taken and updates made by original every-step algorithms.
The dis rete region is entered at START. Sele ting di erent a tions on ea h step an auses
dithering and poorly measured return for following the poli y re ommended by the region
(whi h an only be a single a tion). (top-right) E e t of the ommitment poli y. Updates
are still made after every step. (bottom-left) Multi-time rst-state update with ommit-
ment poli y. Updates made on e per region. (bottom-right) Possible distribution of state
values whose mean is learned by rst-state methods. It is assumed that states are entered,
predominately from one dire tion.
later and a tions As0 are now available. The learning updates should be made here.
The following wrappers transform the original algorithm into one whi h predi ts the return
available from the rst states of a region entered. It is assumed that the per ept, s, denotes
a region and not a state.
nextA tion0 (agent) ! a tion
if dt = 0 then
a nextA tion(agent)
return a
The variables dt; a; s; multistep r are global. At the start of ea h episode dt and multistep r
should be initialised to 0.
The nextA tion0 wrapper ensures that the agent is ommitted to taking the a tion hosen
in the rst state of s until it leaves. If we seek a poli y that pres ribes only one a tion
per region, it is important that only single a tions are followed within a region, otherwise
the return estimates may be ome biased to the return available for following mixtures of
a tions.1
For ontrol optimisation problems it is assumed that there is at least one deterministi
poli y that is optimal. If the method were instead to be used for poli y evaluation, the
agent ould equally be ommitted to some (possibly sto hasti but still xed) poli y until
the region is exited.
The setState0 wrapper re ords the trun ated dis ounted return and the amount of time whi h
has passed and is ne essary for the original variable-time algorithm to make a ba kup. The
value, max, is the maximum possible amount of time for whi h the agent is ommitted to
following the same a tion. It may happen that the agent be omes stu k if it ontinually
follows the same ourse of a tion in a region. The time bound attempts to avoid su h
situations.
See Figure 7.4 for an intuitive des ription of rst-state methods. Note that the method
impli itly assumes that regions are predominantly entered from one dire tion. If entered
from all dire tions then the expe ted rst-state values an be expe ted to be an approx-
imation loser to the real mean state-value of the region as a whole. Thus in this ase,
one would not expe t the method to provide any signi ant improvements over every-step
update methods.
1
This form of exploration was used in the de ision boundary partitioning experiments.
7.5. EMPIRICAL RESULTS 149
0.5
0.4
0.3
0.2
0.1
0
0 10 20 30 40 50 60
State
Figure 7.5: The value-fun tion found using rst-state ba kups in the orridor task. Every-
state Q-learning nds the same solution as every-state DP sin e a slowly de lining learning
rate was used.
Mountain Car Task In the mountain ar experiments the agent is presented with a
4 4 uniform grid representation of the state-spa e. = 1 for all steps, = 0:9, Q0 = 0.
The -greedy exploration method was used with de lining linearly from 0:5 on the rst
episode to 0 on the last. All episodes start at randomly sele ted states. For the model-free
methods (Q-learning and Peng and William' Q()), is also de lined in the same way.
Be ause the rst-state methods alter the agent's exploration poli y by keeping the hoi e
of a tion onstant for longer, the every-step methods are also tested using the same poli y
of ommitting to an a tion until a region is exited. For the model-based (DP) method,
Wiering's version of prioritised sweeping was adapted for the SMDP ase in order to allow
the method to learn online [167℄. 5 value ba kups were allowed per step during exploration,
and the value fun tion was solved using value-iteration for the urrent model at the end of
ea h episode. Q0 was used as the value of all untried a tions in ea h region.
150 CHAPTER 7. VALUE AND MODEL LEARNING WITH DISCRETISATION
Peng and Williams' Q() was also tested. The main purpose of this experiment was to try
to establish whether the improvements aused by the wrapper were due to using rst-state
return estimates or simply through using multi-step returns. We have seen earlier in the
thesis how multi-step methods an over ome slow learning problems by using single reward
and transition observations to update many value estimates. One might think that this
would provide the rst-state method with an additional extra advantage over the every-
step methods. However, in this respe t ea h Q-learning method is a tually very similar.
Ea h method updates at most one value for ea h step (unlike -return and eligibility tra e
methods). Even so, PW-Q() was also tested with = 1:0, ensuring that the return
estimates employ the reward due to a tions many steps in the future. The following state-
repla ing tra e method was used ( .f. update (3.31)):
< 1; if s = st and a = at ,
8
2000
Every-Step DP
Every-Step DP + Commitment Policy
Average Episode Length (offline) First-State DP
1500
1000
500
0
0 5 10 15 20 25 30 35 40 45 50
Episode
80 3000
Every-Step DP Every-Step DP
Every-Step DP + Commitment Policy Every-Step DP + Commitment Policy
First-State DP First-State DP
2500
60
Mean Squared Regret
2000
Mean Regret
40
1500
20
1000
0
500
-20 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
Episode Episode
Figure 7.6: First-state results for the model-based method in the mountain ar task. `Every-
Step' indi ates that learning updates and a tion hoi es for exploration where made after
every step. `Every Step + Commitment Poli y' indi ates that learning updates were made
at every step, but a tion hoi es were made only upon entering a new region. `First-State'
indi ates that the variable times ale learning updates and a tion hoi es were made on e
per visited region. (See Figure 7.4.)
152 CHAPTER 7. VALUE AND MODEL LEARNING WITH DISCRETISATION
In Figure 7.6 (the model-learning method), we an see that the ommitment poli y led to
big improvements in the learned poli y, but no signi ant di eren e in performan e in this,
or the other measures, follows from using the rst-state learning method. The ommitment
poli y also lead to improvements in terms of the regret measure. The standard every-step
method learned values that were onsistently over-optimisti and also generally greater in
varian e than the ommitment poli y methods.
With the Q-learning and Q() methods (see Figure 7.7), the general pi ture is that some
improvements are seen over the ommitment poli y method as a result of using the rst-state
updates. This happens in ea h measure to some degree. This result is somewhat surprising,
espe ially for Q-learning, whi h an be viewed as performing a sto hasti version of the
value-iteration updates used in the model-learning experiment. A possible reason for this
is the re en y biasing e e ts of high learning rates (as seen in the Q-learning example in
Se tion 7.2). To test this, the experiment was repeated with a lower and xed learning rate
( = 0:1). In this ase, the di eren e between the every-state and rst-state ommitment
poli y methods shrinks (see Figures 7.9 and 7.10).
7.5. EMPIRICAL RESULTS 153
2000
Every-Step Q(0)
Every-Step Q(0) + Commitment Policy
First-State Q(0)
1000
500
0
0 20 40 60 80 100 120 140 160 180 200
Episode
80 3000
Every-Step Q(0) Every-Step Q(0)
Every-Step Q(0) + Commitment Policy Every-Step Q(0) + Commitment Policy
First-State Q(0) First-State Q(0)
2500
60
40
1500
20
1000
0
500
-20 0
0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
Episode Episode
1500
1000
500
0
0 5 10 15 20 25 30 35 40 45 50
Episode
80 3000
Every-Step PW(1.0) Every-Step PW(1.0)
Every-Step PW(1.0) + Commitment Policy Every-Step PW(1.0) + Commitment Policy
First-State PW(1.0) First-State PW(1.0)
2500
60
Mean Squared Regret
2000
Mean Regret
40
1500
20
1000
0
500
-20 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
Episode Episode
Figure 7.8: Peng and William's Q() results in the mountain ar task with de lining .
154 CHAPTER 7. VALUE AND MODEL LEARNING WITH DISCRETISATION
2000
Every-Step Q(0)
Every-Step Q(0) + Commitment Policy
First-State Q(0)
1000
500
0
0 20 40 60 80 100 120 140 160 180 200
Episode
80 3000
Every-Step Q(0) Every-Step Q(0)
Every-Step Q(0) + Commitment Policy Every-Step Q(0) + Commitment Policy
First-State Q(0) First-State Q(0)
2500
60
Mean Squared Regret
2000
Mean Regret
40
1500
20
1000
0
500
-20 0
0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
Episode Episode
1500
1000
500
0
0 5 10 15 20 25 30 35 40 45 50
Episode
80 3000
Every-Step PW(1.0) Every-Step PW(1.0)
Every-Step PW(1.0) + Commitment Policy Every-Step PW(1.0) + Commitment Policy
First-State PW(1.0) First-State PW(1.0)
2500
60
Mean Squared Regret
2000
Mean Regret
40
1500
20
1000
0
500
-20 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
Episode Episode
Figure 7.10: Peng and William's Q() results in the mountain ar task with = 0:1.
7.6. DISCUSSION 155
for ma ro a tions must be ome as low or lower than those for taking MDP-level a tions.
The existing work with ma ro-a tions still applies \single-step multi-time" learning updates
(e.g. the adaptations of DP and Q-learning in Se tion 7.3). It seems likely that these
methods might also bene t from the use of the new SMDP TD() or SMDP experien e
sta k algorithm for the same reasons that these methods help in the xed time interval
ase. These are multi-step, multi-time methods in the sense that their return estimates
may bootstrap from values in the entire future, rather than a small subset of it. Some
ma ro-learning methods learn at lower levels and higher levels in parallel while higher level
poli ies are followed. In this ase, eÆ ient o -poli y ontrol learning methods su h as those
presented in Chapter 4 would seem appropriate.
Chapter 8
Summary
Chapter Outline
This hapter summarises the main ontributions of the thesis, lists spe i
ontributions and suggests dire tions for future resear h.
8.1 Review
This thesis has examined the apabilities of existing reinfor ement learning algorithms,
developed new algorithms that extend these apabilities where they have been found to be
de ient, developed a pra ti al understanding of new algorithms through experiment and
analysis, and has also strengthened elements of reinfor ement learning theory.
It has fo used upon two existing problems in reinfor ement learning: i) problems of o -
poli y learning, and ii) problems with error-minimising fun tion approximation approa hes
to reinfor ement learning. These are the major ontributions of the thesis and are detailed
below:
O -poli y Learning O -poli y learning methods allow agents to learn about one be-
haviour while following another. For ontrol optimisation problems, agents need to to
evaluate the return available under the greedy poli y in order to onverge upon the opti-
mal one. However, experien e may be generated in fairly arbitrary ways { for example,
generated by a human expert, or by a me hanism that sele ts a tions in order to manage
the exploration exploitation tradeo . EÆ ient o -poli y learning methods already exist in
the form of ba kward replayed Q-learning. However, it was previously un lear how this
ould be applied as an online learning algorithm. Online learning is an important feature of
any method whi h eÆ iently manages the exploration-exploitation tradeo . On one hand,
eligibility tra e methods an already be applied online and have enjoyed widespread use
157
158 CHAPTER 8. SUMMARY
as a result. However, as sound o -poli y methods they an be very ineÆ ient. More-
over, where oine learning is possible (e.g. if the environment is a y li ), it would seem
that ba kward-replaying forward view methods is a generally more preferable approa h.
A forwards-ba kwards equivalen e proof demonstrates that these methods learn from es-
sentially the same estimate of return, but the forward view is more straightforward (an-
alyti ally) and also has a natural omputationally eÆ ient implementation. Furthermore,
ba kwards replay provides extra eÆ ien y gains over eligibility tra e methods when boot-
strapping estimates of return are used ( < 1). This omes from learning with information
that is simply more up-to-date.
The work with the new experien e sta k algorithm in Se tion 4.4 represents an advan e
by inheriting the desirable properties of ba kwards-replay (and larifying what these are),
and also allowing for online learning. When used for o -poli y greedy poli y evaluation it
provides advantages over Watkins' Q() (and Q-learning), by allowing allowing redit for
the urrent reward to be propagated ba k further than the last non-greedy a tion. However,
it was shown that a hieving this gain is strongly dependent upon whether the Q-values used
as bootstrapping value estimates are over-estimates (i.e. whether they are optimisti ). It
was shown how optimisti initial value-fun tions (the rule of thumb for many exploration
methods) an severely inhibit redit assignment for a variety of ontrol-optimising RL meth-
ods. The separation of optimisti value estimates for en ouraging exploration and the value
estimates used as predi tions of return appears to o er a solution to this problem.
Fun tion Approximation for Reinfor ement Learning In order to s ale up value-
based RL methods to solve pra ti al tasks with many dimensional state-features or tasks
with ontinuous (or non-dis rete) state, fun tion approximators are employed to represent
value fun tions and Q-fun tions. But many popular methods are known to su er from
instabilities, parti ularly when used with ontrol-optimising RL methods or with o -poli y
update distributions (e.g. if making updates with experien e gathered under exploring poli-
ies). The well-studied least-mean-squared error minimising gradient des ent method is a
famous example. It was shown how, through a new hoi e of error measure to minimise,
this method an be made more stable. The boundedness of dis ounted return estimating
RL methods was shown with this fun tion approximation method. In parti ular, the proof
holds for o -poli y Q-learning and the new experien e sta k algorithm { the stability of
these methods with gradient des ent fun tion approximation was not previously known.
However, the linear averager method appears to be a less powerful fun tion approxima-
tion te hnique than the original LMS method, although it has also frequently been used
su essfully for RL in the past.
In Se tion 6.2 the de ision boundary partitioning (DBP) heuristi for representation dis-
retisation was presented. The re nement riteria followed from the idea that, in ontinuous
state-spa es, optimal problem solutions often have large areas of uniform poli y. It is ex-
pe ted therefore that, in su h ases, ompa t representations of optimal poli ies follow from
attempting to represent in detail only those areas where the poli y hanges (de ision bound-
aries). The major ontribution here is the idea that fun tion approximation should not be
motivated by minimising the error between the learned and observed estimates of return,
but by attempting to nd the best a tion available in a state. A new method was introdu ed
to re ne the representation in areas where the greedy poli y hanges. An empiri al test
8.2. CONTRIBUTIONS 159
found the method to outperform xed uniform dis retisations. Coarse representations in
the initial stages allowed fast learning and good initial poli y approximations to be qui kly
learned. The ner dis retisations whi h followed allowed poli ies of better quality to be
learned.
The re ent work by Munos and Moore ( ondu ted independently and simultaneously) shows
the DBP heuristi to nd sub-optimal poli ies. Non-lo al re nement is also required in
order to a hieve a urate value estimates and therefore orre t pla ement of the de ision
boundaries (at least for heavily bootstrapping value estimation pro edures su h as value-
iteration). However, their method requires a model (or one to be learned) in order to be
applied.
8.2 Contributions
The following is a list of the spe i ontributions in order of appearan e.
In Se tion 2.4.3 an adaptation was made to the approximate modi ed poli y iteration
algorithm presented by Sutton and Barto in their standard text [150℄. Their algorithm
appears to be the rst of its kind whi h expli itly laims to terminate and as su h is
of fundamental importan e to the eld. An oversight in their algorithm was shown
using the new ounterexamples in Figure 2.5. The algorithm was orre ted and error
bounds for the quality of the nal poli y were provided. A proof is provided in
Appendix B whi h follows straightforwardly from the work of Williams and Baird
[171℄. The orre tion features in the errata of [150℄.
The approximate equivalen e of bat h-mode a umulate-tra e TD() and a dire t -
return estimating algorithm is well known to the RL ommunity { a derivation an be
found in [150℄ for xed . In an empiri al demonstration in Se tion 3.4.9, it was shown
that this equivalen e does not hold in the online-updating ase (even approximately
so), in ases where the environment is y li al su h that the a umulating tra e value
grows above some threshold. This result followed from the intuitive insight that
sto hasti updating rules of the form Zt+1 = Zt + (zt Zt), having stepsizes greater
than 2 diverge to in nity in ases where zt is independent of Zt .
In Se tion 4.2.2 modi ations to Wiering's Fast Q() were des ribed where it was
likely that existing published versions of this algorithm might be misinterpreted. An
empiri al test was performed to demonstrate the algorithm's equivalen e to Q().
This work was published jointly with Mar o Wiering as [125℄.
Se tion 4.4 introdu ed the Experien e Sta k algorithm. The existing ba kward re-
play method was adapted to allow for eÆ ient model-free online o -poli y ontrol
optimisation. Unlike other popular online learning methods (su h as eligibility tra e
approa hes), the method dire tly learns from -return estimates and also a natural
omputationally eÆ ient implementation. An experimental and theoreti al analysis
of the algorithm's parameters provided a hara terisation of when the algorithm is
likely to outperform related eligibility tra e methods. This work was published as
[123, 121℄.
160 CHAPTER 8. SUMMARY
In Se tion 4.7 optimisti initial value-fun tions were found to severely inhibit the
error-redu ing abilities of greedy-poli y evaluating RL methods. It was also seen how
exploration methods that employ optimism to en ourage exploration an avoid these
problems by separating return predi tions from the optimisti value estimates used to
en ourage exploration. This work was published as [120, 122℄.
In Se tion 5.7 a \linear-averager" value fun tion approximation s heme was for-
malised. The approximation s heme is already used for reinfor ement learning and
di ers from the well studied in remental least mean square (LMS) gradient des ent
s heme only in the error measure being minimised. A proof of nite (but possibly very
large) error in the value fun tion was shown for all dis ounted return estimating RL
algorithms when employing a linear averager for fun tion approximation. Notably,
the proof overs new ases su h as Q-learning with arbitrary experien e distributions
(i.e. arbitrary exploration). Examples of divergen e in this ase exist for the LMS
method. This work was published as [124℄.
Se tion 6.2 introdu ed the de ision boundary partitioning (DBP) heuristi for repre-
sentation re nement based upon hanges in the greedy a tion. This work was pub-
lished as [117, 119, 115℄.
In Chapter 7 an analysis of the biasing problems asso iated with bootstrapping algo-
rithms in dis retised ontinuous state spa es was performed. A generi RL algorithm
modi ation was suggested to redu e this bias by attempting to learn the expe ted
rst-state values of ontinuous regions. Some bias redu tion and poli y quality im-
provements were observed, but most improvements ould be attributed either to fol-
lowing a poli y whi h ommits to a single a tion throughout a region, or related
problems asso iated to learning with large learning rates.
In Appendix C, a umulate tra e TD() was adapted to the SMDP ase. An equiv-
alen e with a forward-view SMDP method was established for the bat h update and
a y li pro ess ase by adapting the proof method for the MDP ase found in Sutton
and Barto's standard text [150℄.
Exploitation of the Optimisti Bias Problem. There are many algorithms that one
may hoose to apply in solving RL problems. Whi h should be used and when? In par-
ti ular, for ontrol optimisation there are algorithms whi h evaluate the greedy poli y (e.g.
Q-learning, Watkins' Q(), value-iteration). Algorithms for evaluating xed poli ies (e.g.
TD(), SARSA() and DP poli y evaluation methods) may also be used for ontrol by
assuming that an evaluation of a xed poli y is sought, and then making this poli y pro-
gressively more greedy. The subtle di eren e is that xed poli y evaluation methods seem
likely to qui kly eliminate unhelpful optimisti biases sin e their initial xed poli y has a
value fun tion whi h is less than or equal to the optimal one in every state. However, while
these methods are spending time evaluating a xed poli y, they are not ne essarily improv-
ing their poli y. With this in mind, future work might aim to examine optimal ways of
sele ting how greedy the poli y under evaluation should be made in order to redu e value-
fun tion error at the fastest possible rate. Initial work in this dire tion might examine the
di eren es between poli y-iteration and value-iteration and seek hybrid approa hes (similar
to Puterman's modi ed poli y-iteration [114℄).
Also, it remains to be seen whether, following from the dual update results in Se tion 4.7.3,
better exploration strategies an be developed. Improvements ould be expe ted to follow
through providing exploration s hemes with more a urate value estimates.
B and B are bootstrapping operators { they form new value estimates based upon existing
value estimates. BV is a shorthand for a syn hronous update sweep a ross all stated (see
Se tion 2.3.2).
V^ (s0 )
X
= max
s
max
a
V (s0 )
s0
max V (s0 ) V^ (s0 )
s0
= V V^ 1
// Improve
0
for ea h s 2 S
a 0 + V^ (s0 )
ag arg maxa s0 Pssa 0 Rss
P
0 (s) ag
Sin e, v0 (s) = B V^ (s), at the end of this we have = jjV^ B V^ jj. Thus, a bound in the error
of V^ from V at the end of this loop is given by Equation A.10,
V^ B V^ 1
V^ V (B.1)
1 1
= 1 (B.2)
From Equation B.1, Williams and Baird have shown that the following bound an be pla ed
upon the loss in return for following an improved (i.e. greedy) poli y, 0 derived from V^
167
168 APPENDIX B. MODIFIED POLICY ITERATION TERMINATION
[171℄:
0
V (s) V (s) 12 (B.3)
for any state s. 0 is derived from V^ in the above algorithm.
Thus we obtain the full poli y-iteration algorithm with a termination threshold T .
1) do:
2a) V^ evaluate(, V^ )
2b) 0
2 ) for ea h s 2 S :
a 0 + V^ (s0 )
2 -1) ag arg maxa s0 Pssa 0 Rss
P
2 -2) v0
P
s0 Pss0 Rss0 + V
ag ag ^ (s0 )
max ; V^ (s) v0
2 -3)
2 -4) (s) ag Make 0 .
3) while T
This algorithm guarantees that,
V (s) V (s)
2 T (B.4)
1
upon termination.
Note that Equation B.3 does not rely upon the evaluate pro edure returning an exa t eval-
uation of V . Of ourse, termination requires that the evaluate/improve pro ess onverges
upon V^ = V . Puterman and Shin have established that modi ed poli y-iteration will on-
verge if the evaluation step applies V^ B V^ a xed number of times (i.e. at least on e)
[113℄. In the ase where step 2a) is exa tly V^ B V^ , then the above algorithm redu es
to the syn hronous value-iteration algorithm.
In pra ti e, the evaluation step does not need to perform syn hronous updates sin e apply-
ing, V^ (s) B V^ (s) at least on e for ea h state in S is generally at least as e e tive at
redu ing jjV V^ jj as the syn hronous ba kup.
Appendix C
where rt represents the dis ounted reward immediately olle ted between t 1 and t. Then
the ontinuous time (forward-view) -estimate updates states as follows:
Consider the hange in this value, based upon a single estimate of -return if the update is
applied in bat h-mode: (Throughout, for simpli ity, is assumed to be onstant.)
169
170 APPENDIX C. CONTINUOUS TIME TD( )
= V^ (st )
+ (1 t ) rt+1 + t V^ (st+1 )
1 1
= V^ (st )
+rt+1 + t V^ (st+1 ) ( )t V^ (st+1 )
1 1
+( )t zt+1
1
= V^ (st )
+rt+1 + 0t V^ (st+1 ) ( )t V^ (st+1 )
1 1
t V^ (s )
1
(1 t ) r
t
+( )t +t r + t z A
1
1
+2 +
+1
t +2
1
+1
1 1
+1
t+2 +1
t+2
= V^ (st )
+ rt+1 + t V^ (st+1 ) ( ) t V^ (st+1 )
1
1
+ ( )t +t 1 1
+1 zt+2
= V^ (st )
+ ^ (st+1 ) ( )t V^ (st+1)
rt+1 + t1 V 1
... ...
=
rt+1 + t V^ (st+1 ) V^ (st )
1
t V^ (s ) V^ (s )
+ ( )t 1
rt +2 + t +2 t
1
+1
+1
... ...
=
)t rt+1 + t V^ (st+1 ) V^ (st )
( 0 1
... ...
1 X
1 V^ (s) = X k
( )tk t I (s; st)Æk
k=0 t=0
Through re e tion in the plane x = y, x=L Pxy=L f (x; y) = PHy=L Pyx=L f (y; x), for any
PH
any L, H and f ,
1 V^ (s) = X1 X
t
( )kt k I (s; sk )Æt
t=0 k=0
1
X t
X
= Æt ( )kt k I (s; sk )
t=0 k=0
De ning an eligibility value for s as:
t
X
et (s) = ( )kt k I (s; sk )
k=0
then the eligibility tra es for all states may be al ulated in rementally as follows:
(
8s 2 S; et (s) ( )t et 1 (s) + 1; if s = st,
1
( )t et 1 (s);
1
otherwise.
and the state values in rementally updated as follows:
8s 2 S; V^ (s) V^ (s) + Æt et (s):
As for single-step TD(), this forward-ba kward equivalen e applies only for the bat h
updating and a y li environment ase. The equivalen e is approximate for the general
online-learning ase sin e V , as seen by the T D errors, is xed in value throughout the
episode.
In ases where episode lengths are nite and sT is the terminal state, sin e by de nition,
Æk = 0, (k T ), then (C.1) may pre isely be rewritten as,
1 V^ (s) = TX1 I (s; st) TX1 k = t( )tk t Æk :
t=0
Using a similar method to the steps following (C.1), the same update rule follows for the
terminating state ase as the in nite trial ase.
172
APPENDIX C. CONTINUOUS TIME TD( )
Appendix D
173
174 APPENDIX D. NOTATION, TERMINOLOGY AND ABBREVIATIONS
rt Immediate reward re eived for the a tion taken immediately prior to time
t.
Ras Dis ounted immediate reward fun tion.
t Dis rete time index. (Or step index in the SMDP ase).
Real valued time duration.
U^ (s) Generi return orre tion. Repla e with the estimated value at s of fol-
lowing the evaluation poli y from s (e.g. U (s) = maxa Q(s; a) for greedy
poli y evaluation).
V The value fun tion for the optimal poli y.
V The value fun tion for the poli y .
V^ Estimate of the value fun tion for the poli y .
V^0 Initial value fun tion estimate.
X^ Estimate of E [X ℄.
z Estimation target. Observed value whose mean we wish to estimate.
z (1) 1-step orre ted trun ated return estimate.
z (n) n-step orre ted trun ated return estimate.
z -return estimate.
z (;n) n-step orre ted trun ated -return estimate.
x =yz y z xy+z
Global amount of de ay.
Æ TD error.
Assignment.
ba kward-view Eligibility tra e method. Updates of the form: V (s) V (s) + Æe(s)
greedy-a tion arg maxa Q^ (s; a)
xed-point x is the xed-point of f if x = f (x).
Updates of the form: V^ (s) V^ (s) + z V^ (s)
forward-view
-return method A forward view method.
n-step trun ated rt+1 + + n 1 rt+n
return
n-step trun ated rt+1 + + n 1 rt+n + n U (st+n+1 )
orre ted return
o -poli y Di erent to the poli y under evaluation.
on-poli y As the poli y under evaluation.
return orre tion U (st+n+1 ) in a orre ted n-step trun ated return.
return Long term measure of reward.
state Environmental situation.
state-spa e Set of all possible environmental situations.
BR Ba kwards Replay
DBP De ision Boundary Partitioning
DP Dynami Programming
FA Fun tion Approximator
LMSE Least Mean Squared Error
175
MDP Markov De ision Pro ess
POMDP Partially Observable Markov De ision Pro ess
PW Peng and Williams' Q()
RL Reinfor ement Learning
SAP State A tion Pair
SMDP Semi-Markov De ision Pro ess, ( ontinuous time MDP)
TTD Trun ated TD()
WAT Watkins' Q()
176 APPENDIX D. NOTATION, TERMINOLOGY AND ABBREVIATIONS
Bibliography
[1℄ C. G. Atkeson A. W. Moore and S. S haal. Memory-based learning for ontrol.
Te hni al Report CMU-RI-TR-95-18, CMU Roboti s Institute, April 1995.
[2℄ M. A. Al-Ansari and R. J. Williams. EÆ ient, globally-optimized reinfor ement learn-
ing with the Parti-game algorithm. In Advan es in Neural Information Pro essing
Systems 11. The MIT Press, Cambridge, MA, 1999.
[3℄ J.S. Albus. Data storage in the erebellar model arti ulation ontroller (CMAC).
Journal of dynami systems, measurement and ontrol, 97(3), 1975.
[4℄ J.S. Albus. A new approa h to manipulator ontrol: the erebellar model arti ulation
ontroller (CMAC). Journal of dynami systems, measurement and ontrol, 97(3),
1975.
[5℄ C. Anderson. Approximating a poli y an be easier than approximating a value
fun tion. Te hni al Report CS-00-101, Department of Computer S ien e, Colorado
State University, CO, USA, 2000.
[6℄ C. Anderson and S. Crawford-Hines. Multigrid Q-learning. Te hni al Report CS-94-
121, Colorado State University, Fort Collins, CO 80523, 1994.
[7℄ David Andre, Nir Friedman, and Ronald Parr. Generalized prioritized sweeping. In
Mi hael I. Jordan, Mi hael J. Kearns, and Sara A. Solla, editors, Advan es in Neural
Information Pro essing Systems, volume 10. The MIT Press, 1998.
[8℄ Christopher G. Atkeson, Andrew W. Moore, and Stefan S haal. Lo ally weighted
learning. AI Review, 11:75{113, 1996.
[9℄ L. C. Baird and A. W. Moore. Gradient des ent for general reinfor ement learning.
In Advan es in Neural Information Pro essing Systems, volume 11, 1999.
[10℄ Leemon C. Baird. Residual algorithms: Reinfor ement learning with fun tion approx-
imation. In Pro eedings of the Twelfth International Conferen e on Ma hine Learning,
pages 30{77, San Fran is o, 1995. Morgan Kaufmann.
[11℄ Leemon C. Baird. Reinfor ement Learning Through Gradient Des ent. PhD thesis,
S hool of Computer S ien e, Carnegie Mellon University, Pittsburgh, PA 15213, 1999.
Te hni al Report Number CMU-CS-99-132.
177
178 BIBLIOGRAPHY
[12℄ Andrew G. Barto, Steven J. Bradtke, and Satinder P. Singh. Learning to a t using
real-time dynami programming. Arti ial Intelligen e, 72:81{138, 1995.
[13℄ Andrew G. Barto, Ri hard S. Sutton, and Charles W. Anderson. Neuronlike adaptive
elements that an solve diÆ ult learning problems. IEEE Transa tions on Systems,
Man and Cyberneti s, 13(5):834{846, Septemeber 1983.
[14℄ R. Beale and T. Ja kson. Neural Computing: An introdu tion. Institute of Physi s
Publishing, Bristol, UK, 1990.
[15℄ R. E. Bellman. Dynami Programming. Prin eton University Press, 1957.
[16℄ R. E. Bellman and S. E. Dreyfus. Applied Dynami Programming. RAND Corp, 1962.
[17℄ D. P. Bertsekas. Distributed dynami s programming. IEEE Transa tions on Auto-
mati Control, 27:610{616, 1982.
[18℄ D. P. Bertsekas. Distributed asyn hronous omputation of xed points. Mathemati al
Programming, 27:107{120, 1983.
[19℄ D. P. Bertsekas. Dynami Programming: Deterministi and Sto hasti Models. Pren-
ti e Hall, Englewood Cli s, NJ, 1987.
[20℄ D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numeri al
Methods. Prenti e Hall, Englewood Cli s, NJ, 1989.
[21℄ D. P. Bertsekas and J. N. Tsitsiklis. Neurodynami Programming. Athena S ienti ,
Belmont, MA, 1996.
[22℄ Mi hael Bowling and Manuela Veloso. Bounding the suboptimality of reusing sub-
problems. In Pro eedings of IJCAI-99, 1999.
[23℄ Justin Boyan and Andrew Moore. Robust value fun tion approximation by work-
ing ba kwards. In Pro eedings of the Workshop on Value Fun tion Approximation.
Ma hine Learning Conferen e Tahoe City, California, July 9, 1995.
[24℄ Justin A. Boyan and Andrew W. Moore. Generalization in reinfor ement learning:
Safely approximating the value fun tion. In Pro eedings of Neural Information Pro-
essing Systems, volume 7. Morgan Kaufmann, January 1995.
[25℄ Steven J. Bradtke and Mi hael O. Du . Reinfor ement learning for ontinuous-time
Markov de ision problems. In Advan es in Neural Information Pro essing Systems,
volume 7, pages 393{400, 1995.
[26℄ P. V. C. Caironi and M. Dorigo. Training Q agents. Te hni al Report IRIDIA-94-14,
Universite Libre de Bruxelles, 1994.
[27℄ Anthony R. Cassandra. Exa t and Approximate Algorithms for Partially Observable
Markov De ision Pro esses. PhD thesis, Brown University, Department of Computer
S ien e, Providen e, RI, 1998.
BIBLIOGRAPHY 179
[28℄ David Chapman and Leslie Pa k Kaelbling. Input generalization in delayed rein-
for ement learning: An algorithm and performan e omparisons. In Pro eedings of
the Twelfth International Joint Conferen e on Arti ial Intelligen e, pages 726{731.
Morgan Kaufmann, San Mateo, CA, 1991.
[29℄ C. S. Chow and J. N. Tsitsiklis. An optimal one-way multigrid algorithm for dis rete{
time sto hasti ontrol. IEEE Transa tions on Automati Control, 36:898{914, 1991.
[30℄ Pawel Ci hosz. Trun ated temporal di eren es and sequential replay: Comparison,
integration, and experiments. In Pro eedings of the Poster Session of the Ninth In-
ternational Symposium on Methodologies for Intelligent Systems, 1996.
[31℄ Pawel Ci hosz. Reinfor ement Learning by Trun ating Temporal Di eren es. PhD
thesis, Warsaw University of Te hnology, Poland, July 1997.
[32℄ Pawel Ci hosz. TD() learning without eligibility tra es: A theoreti al analysis.
Arti ial Intelligen e, 11:239{263, 1999.
[33℄ Pawel Ci hosz. A forwards view of repla ing eligibility tra es for states and state-
a tion pairs. Mathemati al Algorithms, 1:283{297, 2000.
[34℄ Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introdu tion To
Algorithms. The MIT Press, Cambridge, Massa husetts, 1990.
[35℄ Ri hard Dearden Craig Boutilier and Moises Goldszmidt. Sto hasti dynami pro-
gramming with fa tored representations. Arti ial Intelligen e. To appear.
[36℄ Robert H. Crites. Large-S ale Dynami Optimization Using Teams Of Reinfor ement
Learning Agents. PhD thesis, (Computer S ien e) Graduate S hool of the University
of Massa husetts, Amherst, September 1996.
[37℄ S ott Davies. Multidimensional triangulation and interpolation for reinfor ement
learning. In Advan es in Neural Information Pro essing Systems, volume 9, 1996.
[38℄ P. Dayan. The onvergen e of TD() for general . Ma hine Learning, 8:341{362,
1992.
[39℄ P. Dayan. Improving generalisation for temporal di eren e learning: The su essor
representation. Neural Computation, 5:613{624, 1993.
[40℄ Ri hard Dearden, Nir Friedman, and David Andre. Model based bayesian exploration.
In Pro eedings of UAI-99, Sto kholm, Sweden, 1999.
[41℄ Ri hard Dearden, Nir Friedman, and Stuart Russell. Bayesian Q-learning. In Pro-
eedings of AAAI-98, Madison, WI, 1998.
[42℄ Morris H. DeGroot. Probability and Statisti s. Addison Wesley, 2 edition, 1989.
[43℄ Thomas G. Dietteri h. State abstra tion in MAXQ hierar hi al reinfor ement learn-
ing. In Advan es in Neural Information Pro essing Systems, volume 12. The MIT
Press, 2000.
180 BIBLIOGRAPHY
[44℄ Kenji Doya. Temporal di eren e learning in ontinuous time and spa e. In Advan es
in Neural Information Pro essing Systems, volume 8, pages 1073{1079, 1996.
[45℄ P. Dupuis and M. R. James. Rates of onvergen e for approximation s hemes in
optimal ontrol. SIAM Journal of Control and Optimisation, 360(2), 1998.
[46℄ Fernando Fernandez and Daniel Borrajo. VQQL. Applying ve tor quantization to re-
infor ement learning. In M. Veloso, E. Pagello, and Hiroaki Kitano, editors, RoboCup-
99: Robot So er WorldCup III, number 1856 in Le ture Notes in Arti ial Intelli-
gen e, pages 171{178. Springer, 2000.
[47℄ Jerome H. Friedman, Jon L. Bentley, and Raphael A. Finkel. An algorithm for nd-
ing best mat hes in logarithmi expe ted time. ACM Transa tions on Mathemati al
Software, 3(3):209{226, September 1977.
[48℄ G. J. Gordon. Stable fun tion approximation in dynami programming. In Armand
Prieditis and Stuart Russell, editors, Pro eedings of the Twelfth International Confer-
en e on Ma hine Learning, pages 261{268, San Fran is o, CA, 1995. Morgan Kauf-
mann.
[49℄ Geo rey J. Gordon. Online tted reinfor ement learning from the value fun tion
approximation. In Workshop at ML-95, 1995.
[50℄ Geo rey J. Gordon. Chattering in SARSA(). CMU Learning Lab internal report.
Available from http://www-2. s. mu.edu/~ggordon/, 1996.
[51℄ Geo rey J. Gordon. Reinfor ement learning with fun tion approximation onverges
to a region. In Advan es in Neural Information Pro essing Systems, volume 12. The
MIT Press, 2000.
[52℄ W. Ha kbush. Multigrid Methods and Appli ations. Springer-Verlag, 1985.
[53℄ M. Hauskre ht, N. Meuleau, C. Boutilier, L. Pa k Kaelbling, and T. Dean. Hierar hi-
al solution of Markov de ision pro esses using ma ro-a tions. In Pro eedings of the
1998 Conferen e on Un ertainty in Arti ial Intelligen e, Madison, Wis onsin, 1998.
[54℄ Robert B. He kendorn and Charles W. Anderson. A multigrid form of value-iteration
applied to a Markov de ision pro ess. Te hni al Report CS-98-113, Computer S ien e
Department, Colorado State University, Fort Collins, CO 80523, November 1998.
[55℄ John H. Holland, Lashon B. Booker, Mar o Colombetti, Mar o Dorigo, David E.
Goldberg, Stephanie Forrest, Ri k L. Riolo, Robert E. Smith, Pier Lu a Lanzi, Wolf-
gang Stolzmann, and Stewart W. Wilson. What is a Learning Classi er System?
In Pier Lu a Lanzi, Wolfgang Stolzmann, and Stewart W. Wilson, editors, Learning
Classi er Systems. From Foundations to Appli ations, volume 1813 of LNAI, pages
3{32, Berlin, 2000. Springer-Verlag.
[56℄ Ronald A. Howard. Dynami Programming and Markov De ision Pro esses. The MIT
Press, Cambridge, Massa husetts, 1960.
BIBLIOGRAPHY 181
[57℄ Mark Humphrys. A tion sele tion methods using reinfor ement learning. In From
Animals to Animats 4: Pro eedings of the Fourth International Conferen e on Sim-
ulation of Adaptive Behavior, volume 4, pages 135{144. MIT Press/Bradford Books,
MA., USA, 1996.
[58℄ Mark Humphrys. A tion Sele tion Methods Using Reinfor ement Learning. PhD
thesis, Trinity Hall, University of Cambridge, June 1997.
[59℄ T. Jaakkola, M. Jordan, and S. Singh. On the onvergen e of sto hasti iterative
dynami programming algorithms. Neural Computation, 6(6):1185{1201, 1994.
[60℄ Tommi Jaakkola, Satinder P. Singh, and Mi hael I. Jordan. Reinfor ement learning
algorithm for partially observable Markov problems. In Advan es in Neural Informa-
tion Pro essing Systems, volume 7, 1995.
[61℄ A. Bryson Jr. and Y. Ho. Applied Optimal Control. Hemisphere Publishing, New
York, 1975.
[62℄ Leslie Pa k Kaelbling. Learning in Embedded Systems. PhD thesis, Department of
Computer S ien e, Stanford University, Stanford, CA., 1990.
[63℄ Leslie Pa k Kaelbling, Mi hael L. Littman, and Andrew W. Moore. Reinfor ement
learning: A survey. Journal of Arti ial Intelligen e Resear h, 4:237{285, 1996.
[64℄ Masahito Yamamoto Keiko Motoyama, Keiji Suzuki and Azuma Ohu hi. Evolutionary
state spa e on guration with reinfor ement learning for adaptive airship ontrol. In
The Third Australia-Japan Workshop on Intelligent and Evolutionary Systems (Pro-
eedings), 1999.
[65℄ S. Koenig and R. G. Simmons. The e e t of representation and knowledge on
goal-dire ted exploration with reinfor ement-learning algorithms. Ma hine Learning,
22:228{250, 1996.
[66℄ R. E. Korf. Real-time heuristi sear h. Arti ial Intelligen e, 42:189{221, 1990.
[67℄ J. R. Krebs, A. Ka elnik, and P. Taylor. Test of optimal sampling by foraging great
tits. Nature, 275(5675):27{31, 1978.
[68℄ R. Kret hmar and C. Anderson. Comparison of CMACs and radial basis fun tions for
lo al fun tion approximators in reinfor ement learning. In Pro eedings of the IEEE
International Conferen e on Neural Networks. Houston, TX, pages 834{837, 1997.
[69℄ J. H. Kushner and Dupuis. Numeri al Methods for Sto hasti Control Problems in
Continuous Time. Appli ations of Mathemati s. Springer Verlag, 1992.
[70℄ Leonid Kuvayev and Ri hard Sutton. Approximation in model-based learning. In
ICML'97 Workshop on Modelling in Reinfor ement Learning, 1997.
[71℄ C. Lin and H. Kim. CMAC-based adaptive riti self-learning ontrol. IEEE Trans-
a tions on Neural Networks, 2:530{533, 1991.
182 BIBLIOGRAPHY
[72℄ L. J. Lin. Self-improving rea tive agents based on reinfor ement learning, planning
and tea hing. Ma hine Learning, 8:293{321, 1992.
[73℄ Long-Ji Lin. S aling up reinfor ement learning for robot ontrol. In Pro eedings of
the Tenth International Conferen e on Ma hine Learning, pages 182{189, Amherst,
MA, June 1993. Morgan Kaufmann.
[74℄ Mi hael L. Littman, Thomas L. Dean, and Leslie Pa k Kaelbling. On the omplexity
of solving Markov de ision problems. In Pro eedings of the Eleventh International
Conferen e on Un ertainty in Arti ial Intelligen e, page 9, 1995.
[75℄ S. Mahadevan. Average reward reinfor ement learning: Foundations, algorithms and
empiri al results. Ma hine Learning, 22:159{196, 1996.
[76℄ S. Mahadevan and J. Connell. Automati programming of behavior based robots.
Arti ial Intelligen e, 55(2-2):311{365, June 1992.
[77℄ Yishay Mansour and Satinder Singh. On the omplexity of poli y iteration. In Un-
ertainty in Arti ial Intelligen e, 1999.
[78℄ J. J. Martin. Bayesian De ision Problems and Markov Chains. John Wiley and Sons,
New York, New York, 1969.
[79℄ Maja J. Matari . Intera tion and Intelligent Behavior. PhD thesis, MIT AI Lab,
August 1994. AITR-1495.
[80℄ John H. Mathews. Numeri al Methods for Mathemati s, S ien e and Engineering.
Prenti e Hall, London, UK, 1995.
[81℄ Andrew M Callum. Instan e-based utile distin tions for reinfor ement learning. In
Pro eedings of the Twelfth International Ma hine Learning, San Fran is o, 1995. Mor-
gan Kaufmann.
[82℄ Andrew K. M Callum. Reinfor ement Learning with Sele tive Per eption and Hid-
den State. PhD thesis, Department of Computer S ien e University of Ro hester
Ro hester, NY, 14627, USA, 1995.
[83℄ Amy M Govern, Ri hard S. Sutton, and Andrew H. Fagg. Roles of ma ro-a tions in
a elerating reinfor ement learning. In 1997 Gra e Hopper Celebration of Women in
Computing, 1997.
[84℄ C. Melhuish and T. C. Fogarty. Applying a restri ted mating poli y to determine
state spa e ni hes using delayed reinfor ement. In T. C. Fogarty, editor, Pro eedings
of Evolutionary Computing, Arti ial Intelligen e and the Simulation of Behaviour
Workshop, pages 224{237. Springer-Verlag, 1994.
[85℄ Ni olas Meuleau and Paul Bourgine. Exploration of multi-state environments: Lo al
measures and ba k-propagation of un ertainty. Ma hine Learning, 35(2):117{154,
May 1999.
BIBLIOGRAPHY 183
[86℄ A. W. Moore and C. G. Atkeson. The Parti-game algorithm for variable resolution
reinfor ement learning in multidimensional state-spa es. Ma hine Learning, 21:199{
233, 1995.
[87℄ Andrew W. Moore. Variable resolution dynami programming: EÆ iently learning
a tion maps on multivariate real-value state-spa es. In L. Birnbaum and G. Collins,
editors, Pro eedings of the Eighth International Conferen e on Ma hine Learning.
Morgan Kaufman, June 1991.
[88℄ Andrew W. Moore and Christopher G. Atkeson. Prioritised sweeping: Reinfor ement
learning with less data and less time. Ma hine Learning, 13:103{130, 1994.
[89℄ Andrew William Moore. EÆ ient Memory Based Learning for Robot Control. PhD
thesis, University of Cambridge, Computer Laboratory, November 1990.
[90℄ K. Muller, S. Mika, G. Rats h, K. Tsuda, and B. S holkopf. An introdu tion to kernel
based methods. IEEE Transa tions on Neural Networks, 12(2):181{202, Mar h 2001.
[91℄ Remi Munos and Paul Bourgine. Reinfor ement learning for ontinuous sto hasti
ontrol problems. In Mi hael I. Jordan, Mi hael J. Kearns, and Sara A. Solla, editors,
Advan es in Neural Information Pro essing Systems, volume 10. The MIT Press, 1998.
[92℄ Remi Munos and Andrew Moore. Variable resolution dis retization in optimal ontrol.
Ma hine Learning. To appear.
[93℄ Remi Munos and Andrew Moore. Bary entri interpolator for ontinuous spa e &
time reinfor ement learning. In M. S. Kearns and D. A. Cohn S. A. Solla, editors,
Advan es in Neural Information Pro essing Systems, volume 11. The MIT Press, 1999.
[94℄ Remi Munos and Andrew Moore. In uen e and varian e of a Markov hain: Appli-
ation to adaptive dis retization in optimal ontrol. In IEEE Conferen e on De ision
and Control, 1999.
[95℄ Remi Munos and Andrew Moore. Variable resolution dis retization for high-a ura y
solutions of optimal ontrol problems. In Pro eedings of the 16th International Joint
Conferen e on Arti ial Intelligen e, pages 1348{1355, 1999.
[96℄ Remi Munos and Jo elyn Patinel. Reinfor ement learning with dynami overing of
state-a tion spa e: Partitioning q-learning. In From Animals to Animats 3: Pro eed-
ings of the International Conferen e on Simulation of Adaptive Behavior, 1994.
[97℄ D. Ormoneit and S. Sen. Kernel-based reinfor ement learning. Ma hine Learning,
42:241{267, 2001.
[98℄ Mark J. L. Orr. Introdu tion to radial basis fun tion networks. Te hni al report,
Institute for Adaptive Neural Computation, Division of Informati s, University of
Edinburgh, 1996. http://www.an .ed.a .uk/~mjo/rbf.html.
[99℄ Mark J. L. Orr. Re ent advan es in radial basis fun tion networks. Te hni al report,
Institute for Adaptive Neural Computation, Division of Informati s, University of
Edinburgh, 1999. http://www.an .ed.a .uk/~mjo/rbf.html.
184 BIBLIOGRAPHY
[100℄ S. Pareigis. Adaptive hoi e of grid and time in reinfor ement learning. In Advan es
in Neural Information Pro essing Systems, volume 10. The MIT Press, Cambridge,
MA, 1997.
[101℄ S. Pareigis. Multi-grid methods for reinfor ement learning in ontrolled di usion
pro esses. In Advan es in Neural Information Pro essing Systems, volume 9. The
MIT Press, Cambridge, MA, 1998.
[102℄ Ronald Parr and Stuart Russell. Reinfor ement learning with hierar hies of ma hines.
In Advan es in Neural Information Pro essing Systems, volume 10, 1997.
[103℄ M.D. Pendrith and M.R.K. Ryan. A tual return reinfor ement learning versus tem-
poral di eren es: Some theoreti al and experimental results. In The Thirteenth In-
ternational Conferen e on Ma hine Learning. Morgan Kaufmann, 1996.
[104℄ M.D. Pendrith and M.R.K. Ryan. C-tra e: A new algorithm for reinfor ement learning
of roboti ontrol. In ROBOLEARN-96, Key West, Florida, 19-20 May, 1996, 1996.
[105℄ J. Peng and R. J. Williams. EÆ ient learning and planning within the Dyna frame-
work. Adaptive Behaviour, 2:437{454, 1993.
[106℄ J. Peng and R. J. Williams. Te hni al note: In remental Q-learning. Ma hine Learn-
ing, 22:283{290, 1996.
[107℄ Jing Peng and Ronald J. Williams. In remental multi-step Q-learning. In W. Cohen
and H. Hirsh, editors, Pro eedings of the 11th International Conferen e on Ma hine
Learning, pages 226{232. Morgan Kaufmann, San Fran is o, 1994.
[108℄ Larry Peterson and Bru e Davie. Computer Networks: A Systems Approa h. Morgan
Kaufmann, 2nd edition, 2000.
[109℄ D. Pre up and R. Sutton. Multi-time models for temporally abstra t planning. In
Advan es in Neural Information Pro essing Systems, volume 10, 1998.
[110℄ D. Pre up and R. S. Sutton. Multi-time models for reinfor ement learning. In Pro-
eedings of the ICML'97 Workshop on Modelling in Reinfor ement Learning, 1997.
[111℄ D. Pre up, R. S. Sutton, and S. Singh. Eligibility tra e methods for o -poli y eval-
uation. In Pro eedings of the 17th International Conferen e of Ma hine Learning.
Morgan Kaufmann, 2000.
[112℄ Bob Pri e and Craig Boutilier. Impli it imitation in multi-agent reinfor ement learn-
ing. In Pro eedings of the 16th International Conferen e on Ma hine Learning, 1999.
[113℄ M. L. Puterman and M. C. Shin. Modi ed poli y iteration algorithms for dis ounted
Markov de ision problems. Management S ien e, 24:1137{1137, 1978.
[114℄ Martin L. Puterman. Markov De ision Pro esses: Dis rete Sto hasti Dynami Pro-
gramming. John Wiley and Sons, In ., New York, New York, 1994.
BIBLIOGRAPHY 185
[115℄ Stuart Reynolds. De ision boundary partitioning: Variable resolution model-free
reinfor ement learning. Te hni al Report CSRP-99-15, S hool of Computer S i-
en e, The University of Birmingham, Birmingham, B15 2TT, UK, July 1999.
ftp://ftp. s.bham.a .uk/pub/te h-reports/1999/CSRP-99-15.ps.gz.
[116℄ Stuart I. Reynolds. Issues in adaptive representation reinfor ement learning. Presenta-
tion at the 4th European Workshop on Reinfor ement Learning, Lugano, Switzerland,
O tober 1999.
[117℄ Stuart I. Reynolds. De ision boundary partitioning: Variable resolution model-
free reinfor ement learning. In Pro eedings of the Seventeenth International Confer-
en e on Ma hine Learning, pages 783{790, San Fran is o, 2000. Morgan Kaufmann.
http://www. s.bham.a .uk/~sir/pub/ml2k DBP.ps.gz.
[119℄ Stuart I. Reynolds. Adaptive representation methods for reinfor ement learning. In
Advan es in Arti ial Intelligen e, Pro eeding of AI-2001, Ottawa, Canada, Le ture
Notes in Arti ial Intelligen e (LNAI 2056), pages 345{348. Springer-Verlag, June
2001. http://www. s.bham.a .uk/~sir/pub/ai2001.ps.gz.
[120℄ Stuart I. Reynolds. The urse of optimism. In Pro eedings of the Fifth European
Workshop on Reinfor ement Learning, Utre ht, The Netherlands, pages 38{39, O to-
ber 2001. http://www. s.bham.a .uk/~sir/pub/EWRL5 opt.ps.gz.
[121℄ Stuart I. Reynolds. Experien e sta k reinfor ement learning: An online for-
ward -return method. In Pro eedings of the Fifth European Workshop on
Reinfor ement Learning, Utre ht, The Netherlands, pages 40{41, O tober 2001.
http://www. s.bham.a .uk/~sir/pub/EWRL5 sta k.ps.gz.
[122℄ Stuart I. Reynolds. Optimisti initial Q-values and the max operator. In Qiang Shen,
editor, Pro eedings of the UK Workshop on Computational Intelligen e, Edinburgh,
UK, pages 63{68. The University of Edinburgh Printing Servi es, September 2001.
http://www. s.bham.a .uk/~sir/pub/UKCI-01.ps.gz.
[123℄ Stuart I Reynolds. Experien e sta k reinfor ement learning for o -poli y ontrol.
Te hni al Report CSRP-02-1, S hool of Computer S ien e, University of Birmingham,
January 2002. http://www. s.bham.a .uk/~sir/pub/ES-CSRP-02-1.ps.gz.
[124℄ Stuart I. Reynolds. The stability of general dis ounted reinfor ement learning with
linear fun tion approximation. In John Bullinaria, editor, Pro eedings of the UK
Workshop on Computational Intelligen e, Birmingham, UK, pages 139{146, Septem-
ber 2002. http://www. s.bham.a .uk/~sir/pub/uk i-02.ps.gz.
[125℄ Stuart I Reynolds and Mar o A. Wiering. Fast Q() revisited. Te hni al Re-
port CSRP-02-2, S hool of Computer S ien e, University of Birmingham, May 2002.
http://www. s.bham.a .uk/~sir/pub/fastq-CSRP-02-2.ps.gz.
186 BIBLIOGRAPHY
[126℄ H. Robbins and S. Monro. A sto hasti approximation method. Annals of Mathe-
mati al Statisti s, 22:400{407, 1951.
[127℄ David E. Rumelhart, James L. M Clelland, and the PDP Resear h Group. Parallel
Distributed Pro essing: Explorations in the Mi rostru ture of Cognition, volume 1:
Foundations. The MIT Press, Cambridge, MA, 1986.
[128℄ G. A. Rummery and M. Niranjan. On-line Q-learning using onne tionist systems.
Te hni al Report CUED/F-INFENG/TR 166, Cambridge University Engineering De-
partment, September 1994.
[129℄ Gavin A Rummery. Problem Solving with Reinfor ement Learning. PhD thesis, De-
partment of Engineering, University of Cambridge, July 1995.
[130℄ Stuart Russell and Peter Norvig. Arti ial Intelligen e: A Modern Approa h. Prenti e
Hall, London, UK, 1995.
[131℄ Juan Carlos Santamaria, Ri hard Sutton, and Ashwin Ram. Experiments with re-
infor ement learning in problems with ontinuous state and a tion spa es. Adaptive
Behavior 6(2), 1998.
[132℄ A. S hwartz. A reinfor ement learning algorithm for maximizing undis ounted re-
wards. In Pro eeding of the Tenth International Conferen e on Ma hine Learning,
pages 298{305. Morgan Kaufmann, San Mateo, CA, June 1993.
[133℄ J. Simons, H. Van Brussel, J. De S hutter, and J. Verhaert. A self-learning automa-
ton with variable resolution for high pre ision assembly by industrial robots. IEEE
Transa tions on Automati Control, 5(27):1109{1113, O tober 1982.
[134℄ S. Singh. S aling reinfor ement learning algorithms by learning variable temporal
resolution models. In Pro eedings of the Ninth Ma hine Learning Conferen e, 1992.
[135℄ S. Singh, T. Jaakkola, M. L. Littman, and C. Szepesvari. Convergen e results for
single-step on-poli y reinfor ement-learning algorithms. Ma hine Learning, 2000.
[136℄ S. P. Singh, T. Jaakkola, and M. I. Jordan. Reinfor ement learning with soft state
aggregation. In G. Tesauro, D. S. Touretzky, and T. Leen, editors, Advan es in Neural
Information Pro essing Systems: Pro eedings of the 1994 Conferen e, pages 359{368.
The MIT Press, Cambridge, MA, 1994.
[137℄ Satinder Singh. Personal ommuni ation, 2001.
[138℄ Satinder P. Singh, Tommi Jaakkola, and Mi hael I. Jordan. Learning without state-
estimation in partially observable Markovian de ision pro esses. In Pro eedings of the
Eleventh International Conferen e on Ma hine Learning, 1994.
[139℄ Satinder P. Singh and Ri hard S. Sutton. Reinfor ement learning with repla ing
eligibility tra es. Ma hine Learning, 22:123{158, 1996.
[140℄ William D. Smart and Leslie Kaelbling Pa k. Pra ti al reinfor ement learning in
ontinuous spa es. In Pro eedings of the Seventeenth International Conferen e on
Ma hine Learning, San Fran is o, 2000. Morgan Kaufmann.
BIBLIOGRAPHY 187
[141℄ P. Stone and R. S. Sutton. S aling reinfor ement learning toward robo up so er. In
Eighteenth International Conferen e on Ma hine Learning, 2001.
[142℄ Mal olm Strens. A bayesian framework for reinfor ement learning. In Pro eedings of
the 17th International Conferen e on Ma hine Learning, pages 943{950, San Fran-
is o, 2000. Morgan Kaufmann.
[143℄ R. Sutton, D. Pre up, and S. Singh. Between MDPs and Semi-MDPs: A framework
for temporal abstra tion in reinfor ement learning. Arti ial Intelligen e, 112:181{
211, 1999.
[144℄ R. S. Sutton. Planning by in remental dynami programming. In Pro eedings of
the Eighth International Workshop on Ma hine Learning, pages 353{357. Morgan
Kaufmann, 1991.
[145℄ R. S. Sutton. Open theoreti al questions in reinfor ement learning. Extended abstra t
of an invited talk at EuroCOLT'99, 1999.
[146℄ R. S. Sutton and D. Pre up. O -poli y temporal-di eren e learning with fun tion
approximation. In Pro eedings of the Eighteenth International Conferen e on Ma hine
Learning, 2001.
[147℄ Ri hard S. Sutton. Temporal Credit Assignment in Reinfor ement Learning. PhD
thesis, University of Massa husetts, 1984.
[148℄ Ri hard S. Sutton. Learning to predi t by methods of temporal di eren e. Ma hine
Learning, 3:9{44, 1988.
[149℄ Ri hard S. Sutton. Generalization in reinfor ement learning: Su essful examples
using sparse oarse oding. In David S. Touretzky, Mi hael C. Mozer, and Mi hael E.
Hasselmo, editors, Advan es in Neural Information Pro essing Systems 8, pages 1038{
1044. The MIT Press, Cambridge, MA., 1996.
[150℄ Ri hard S. Sutton and Andrew G. Barto. Reinfor ement Learning: An Introdu tion.
The MIT Press, Cambridge, MA., 1998.
[151℄ Ri hard S. Sutton and Satinder P. Singh. On step-size and bias in temporal di eren e
learning. In Pro eedings of the Eighth Yale Workshop on Adaptive and Learning
Systems, pages 91{96, 1994.
[152℄ Csaba Szepesvari. Convergent reinfor ement learning with value fun tion interpola-
tion. Te hni al Report TR-2001-02, Mindmaker Ltd., Budapest 1121, Konkoly Th.
M. u. 29-33, Hungary, 2001.
[153℄ P. Tadepalli and D. Ok. H -learning: A reinfor ement learning method to optimize
undis ounted average reward. Te hni al Report 94-30-01, Oregon State University,
Computer S ien e Department, Corvallis, 1994.
[154℄ Vladislav Tadi. On the onvergen e of temporal-di eren e learning with linear fun -
tion approximation. Ma hine Learning, 42:241{267, 2001.
188 BIBLIOGRAPHY