Reinforcement Learning Wiht Exploration

Reinfor ement Learning
with Exploration
Stuart Ian Reynolds
A thesis submitted to
The University of Birmingham
for the degree of
Do tor of Philosophy
S hool of Computer S ien e

The University of Birmingham
Birmingham B15 2TT
United Kingdom
De ember 2002
Abstra t
Reinfor ement Learning (RL) te hniques may be used to nd optimal ontrollers for mul-
tistep de ision problems where the task is to maximise some reward signal. Su essful
appli ations in lude ba kgammon, network routing and s heduling problems. In many sit-
uations it is useful to have methods that learn about one behaviour while a tually following
another (i.e. ò -poli y' methods). For example, the learner may follow an exploring be-
haviour, while its goal is to learn about the optimal behaviour, and not the exploring one.
Existing methods for learning in this way (namely, Q-learning and Watkins' Q()) are no-
toriously ineÆ ient with their use of experien e. More eÆ ient methods exist but are either
unsound (i.e. provably non- onvergent to optimal solutions), or annot be applied online.
Online learning is important to let the e e ts of a tions be qui kly asso iated with those
a tions, in turn allowing later de isions to be informed of those e e ts.
A new algorithm is introdu ed to over ome these problems. It works online, without èl-
igibility tra es', and has a naturally eÆ ient implementation. Experiments and analysis
hara terise when it is likely to outperform existing related methods. New insights into
the use of optimism for en ouraging exploration are also dis overed. It is found that stan-
dard pra ti es an have strongly negative e e ts on a large lass of RL methods for ontrol
optimisation.
Also examined are large and non-dis rete state-spa e problems where `fun tion approxima-
tion' is needed, but where many RL methods are known to be unstable. Parti ularly, these
are ontrol optimisation methods and when experien e is gathered in ò -poli y' distribu-
tions (e.g. while exploring). By a new hoi e of error measure to minimise, the well studied
linear gradient des ent methods are shown to be `stable' when used with any `dis ounted
return' estimating RL method. The notion of stability is weak (very large, but nite er-
ror bounds are shown), but the result is signi ant insofar as it overs new ases su h as
o -poli y and multi-step methods for ontrol optimisation.
New ways of viewing the goal of fun tion approximation in RL are also examined. Rather
than a pro ess of error minimisation between the learned and observed reward signal, the
obje tive is viewed to be that of nding representations that make it possible to identify the
best a tion for given states. A new `de ision boundary partitioning' algorithm is presented
with this goal in mind. The method re ursively re nes the value-fun tion representation,
in reasing it in areas where it is expe ted that this will result in better de ision poli ies.
III
IV
A knowledgements
My deepest gratitude goes to my friend and long-time supervisor, Manfred Kerber. My
demands on his time over the past ve years for dis ussion, feedba k and proof readings
ould (at best) be des ribed as unreasonable. Unlike many PhD students my work was
not tied to any parti ular grant, supervisor or resear h topi and I an think of few other
people who would be willing to supervise work outside of their own eld. Through Manfred
I was lu ky to have the freedom to explore the areas that interested me the most, and also
to publish my work independently. For reasons I won't dis uss here, these freedoms are
be oming in reasing rare { any supervisor who provides them has a truly generous nature.
Without his onstant en ouragement (and harassment) and his enormous expertise, I am
sure that this thesis would never have rea hed ompletion. In several ases, important ideas
would have fallen by the wayside without Manfred to point out the interest in them.
For patiently introdu ing me to the topi s that interest me the most I am extremely grateful
to Jeremy Wyatt. Through his reinfor ement learning reading group, the obje tionable
be ame the obsession, and the obfus ated be ame the Obvious. As the only lo al expert in
my eld, his enthusiasm in my ideas has been the greatest motivation throughout. Without
it I would surely have quit my PhD within the rst year.
I thank the other members of my thesis group (past and present), Xin Yao, Russell Beale and
John Barnden, for their support and guidan e throughout. I also thank my department for
funding my study (and extensive worldwide travel) through a Tea hing Assistant s heme.
Without this, not only would I not have had the freedom to pursue my own resear h, I
would have never have had the opportunity to perform resear h at all.
I thank Remi Munos and Andrew Moore for hosting my enlightening (but ultimately too
short) sabbati al with them at Carnegie Mellon, and my department for funding the visit.
I thank Geo Gordon for indulging my long Q+A dis ussions about his work that lead to
new ontributions.
I am lu ky to have bene ted from dis ussions and advi e (no matter how brief) with many
of the eld's other leading luminaries. These in lude Ri hard Sutton, Mar o Wiering, Doina
Pre up, Leslie Kaelbling and Thomas Dietteri h.
I thank John Bullinaria for nally setting me straight on neural networks.
Thanks to Tim Kova s who o-founded the reinfor ement learning reading group. As my
oÆ e-mate for many years he has been the person to re eive my most un ooked ideas. I
look forward to more of his otter-tainment in the future and promise to return all of his
pens the next time we meet.
V
VI
Through dis ussions about my work (or theirs), by providing te hni al assistan e, or even
through al oholi stress-relief, I have bene ted from many other members of my department.
Among others, these people in lude: Adrian Hartley, Axel Groman, Mar in Chady, Johnny
Page, Kevin Lu as, John Woodward, Gavin Brown, A him Jung, Ri ardo Poli and Ri hard
Pannell.
My apologies to Dee who I'm sure is the happiest of all to see this nished.
For my parents for everything.

Contents
1 Introdu tion 1
1.1 Arti ial Intelligen e and Ma hine Learning . . . . . . . . . . . . . . . . . . 1
1.2 Forms of Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Reinfor ement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 Sequential De ision Tasks and the Delayed Credit Assignment Problem 3
1.4 Learning and Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 About This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Stru ture of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Dynami Programming 7
2.1 Markov De ision Pro esses . . . . . . . . . . . . . . . . . . . . . .. . . .. . 7
2.2 Poli ies, State Values and Return . . . . . . . . . . . . . . . . . .. . . .. . 8
2.3 Poli y Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .. . 10
2.3.1 Q-Fun tions . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .. . 11
2.3.2 In-Pla e and Asyn hronous Updating . . . . . . . . . . .. . . .. . 12
2.4 Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .. . 13
2.4.1 Optimality . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .. . 13
2.4.2 Poli y Improvement . . . . . . . . . . . . . . . . . . . . .. . . .. . 13
2.4.3 The Convergen e and Termination of Poli y Iteration . .. . . .. . 14
2.4.4 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . .. . . .. . 16
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .. . 18
3 Learning from Intera tion 21
3.1 Introdu tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 In remental Estimation of Means . . . . . . . . . . . . . . . . . . . . . . . . 22
VII
VIII CONTENTS
3.3 Monte Carlo Methods for Poli y Evaluation . . . . . . . . . . . . . . . . .. 24

3.4 Temporal Di eren e Learning for Poli y Evaluation . . . . . . . . . . . . .. 26
3.4.1 Trun ated Corre ted Return Estimates . . . . . . . . . . . . . . .. 26
3.4.2 TD(0) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 27
3.4.3 SARSA(0) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 28
3.4.4 Return Estimate Length . . . . . . . . . . . . . . . . . . . . . . . .. 29
3.4.5 Eligibility Tra es: TD() . . . . . . . . . . . . . . . . . . . . . . .. 31
3.4.6 SARSA() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 33
3.4.7 Repla e Tra e Methods . . . . . . . . . . . . . . . . . . . . . . . .. 33
3.4.8 A y li Environments . . . . . . . . . . . . . . . . . . . . . . . . .. 34
3.4.9 The Non-Equivalen e of Online Methods in Cy li Environments .. 35
3.5 Temporal Di eren e Learning for Control . . . . . . . . . . . . . . . . . .. 39
3.5.1 Q(0): Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 39
3.5.2 The Exploration-Exploitation Dilemma . . . . . . . . . . . . . . .. 39
3.5.3 Exploration Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . .. 40
3.5.4 The O -Poli y Predi ate . . . . . . . . . . . . . . . . . . . . . . .. 44
3.6 Indire t Reinfor ement Learning . . . . . . . . . . . . . . . . . . . . . . .. 44
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 46
4 EÆ ient O -Poli y Control 47
4.1 Introdu tion . . . . . . . . . . . . . . . . . . . . . . . .. . . .. . .. . . .. 47
4.2 A elerating Q() . . . . . . . . . . . . . . . . . . . .. . . .. . .. . . .. 49
4.2.1 Fast Q() . . . . . . . . . . . . . . . . . . . . .. . . .. . .. . . .. 49
4.2.2 Revisions to Fast Q() . . . . . . . . . . . . . .. . . .. . .. . . .. 53
4.2.3 Validation . . . . . . . . . . . . . . . . . . . . .. . . .. . .. . . .. 56
4.2.4 Dis ussion . . . . . . . . . . . . . . . . . . . . .. . . .. . .. . . .. 61
4.3 Ba kwards Replay . . . . . . . . . . . . . . . . . . . .. . . .. . .. . . .. 61
4.4 Experien e Sta k Reinfor ement Learning . . . . . . .. . . .. . .. . . .. 65
4.4.1 The Experien e Sta k . . . . . . . . . . . . . .. . . .. . .. . . .. 66
4.5 Experimental Results . . . . . . . . . . . . . . . . . . .. . . .. . .. . . .. 70
4.6 The E e ts of on the Experien e Sta k Method . . .. . . .. . .. . . .. 80
4.7 Initial Bias and the max Operator. . . . . . . . . . . .. . . .. . .. . . .. 82
4.7.1 Empiri al Demonstration . . . . . . . . . . . .. . . .. . .. . . .. 83
CONTENTS IX
4.7.2 The Need for Optimism . . . . . . . . . . . . . . . .. . .. . . .. . 85
4.7.3 Separating Value Predi tions from Optimism . . . .. . .. . . .. . 86
4.7.4 Dis ussion . . . . . . . . . . . . . . . . . . . . . . . .. . .. . . .. . 87
4.7.5 Initial Bias and Ba kwards Replay. . . . . . . . . . .. . .. . . .. . 88
4.7.6 Initial Bias and SARSA() . . . . . . . . . . . . . .. . .. . . .. . 89
4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . . .. . 89
5 Fun tion Approximation 93
5.1 Introdu tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 Example S enario and Solution . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.3 The Parameter Estimation Framework . . . . . . . . . . . . . . . . . . . . . 96
5.3.1 Representing Return Estimate Fun tions . . . . . . . . . . . . . . . . 97
5.3.2 Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.4 Linear Methods (Per eptrons) . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.4.1 In remental Gradient Des ent . . . . . . . . . . . . . . . . . . . . . . 98
5.4.2 Step Size Normalisation . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.5 Input Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.5.1 State Aggregation (Aliasing) . . . . . . . . . . . . . . . . . . . . . . 101
5.5.2 Binary Coarse Coding (CMAC) . . . . . . . . . . . . . . . . . . . . . 102
5.5.3 Radial Basis Fun tions . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.5.4 Feature Width, Distribution and Gradient . . . . . . . . . . . . . . . 104
5.5.5 EÆ ien y Considerations . . . . . . . . . . . . . . . . . . . . . . . . 106
5.6 The Bootstrapping Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.7 Linear Averagers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.7.1 Dis ounted Return Estimate Fun tions are Bounded Contra tions . 113
5.7.2 Bounded Fun tion Approximation . . . . . . . . . . . . . . . . . . . 115
5.7.3 Boundness Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.7.4 Adaptive Representation S hemes . . . . . . . . . . . . . . . . . . . 117
5.7.5 Dis ussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6 Adaptive Resolution Representations 121
6.1 Introdu tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
X CONTENTS
6.2 De ision Boundary Partitioning (DBP) . . .. . .. . .. . . .. . .. . . .. 122

6.2.1 The Representation . . . . . . . . .. . .. . .. . . .. . .. . . .. 122
6.2.2 Re nement Criteria . . . . . . . . .. . .. . .. . . .. . .. . . .. 122
6.2.3 The Algorithm . . . . . . . . . . . .. . .. . .. . . .. . .. . . .. 124
6.2.4 Empiri al Results . . . . . . . . . . .. . .. . .. . . .. . .. . . .. 127
6.3 Related Work . . . . . . . . . . . . . . . . .. . .. . .. . . .. . .. . . .. 133
6.3.1 Multigrid Methods . . . . . . . . . .. . .. . .. . . .. . .. . . .. 133
6.3.2 Non-Uniform Methods . . . . . . . .. . .. . .. . . .. . .. . . .. 133
6.4 Dis ussion . . . . . . . . . . . . . . . . . . .. . .. . .. . . .. . .. . . .. 141
7 Value and Model Learning With Dis retisation 143
7.1 Introdu tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .. 143
7.2 Example: Single Step Methods and the Aliased Corridor Task . . .. . . .. 144
7.3 Multi-Times ale Learning . . . . . . . . . . . . . . . . . . . . . . .. . . .. 145
7.4 First-State Updates . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .. 147
7.5 Empiri al Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .. 149
7.6 Dis ussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . .. 155
8 Summary 157
8.1 Review . . . . . . . . . .. . . .. . .. . . .. . .. . .. . . .. . .. . . .. 157
8.2 Contributions . . . . . .. . . .. . .. . . .. . .. . .. . . .. . .. . . .. 159
8.3 Future Dire tions . . . .. . . .. . .. . . .. . .. . .. . . .. . .. . . .. 160
8.4 Con luding Remarks . .. . . .. . .. . . .. . .. . .. . . .. . .. . . .. 162
A Foundation Theory of Dynami Programming 163
A.1 Full Ba kup Operators . . . . . . . . . . . .. . .. . .. . . .. . .. . . .. 163
A.2 Unique Fixed-Points and Optima . . . . . .. . .. . .. . . .. . .. . . .. 163
A.3 Norm Measures . . . . . . . . . . . . . . . .. . .. . .. . . .. . .. . . .. 164
A.4 Contra tion Mappings . . . . . . . . . . . .. . .. . .. . . .. . .. . . .. 164
A.4.1 Bellman Residual Redu tion . . . .. . .. . .. . . .. . .. . . .. 165
B Modi ed Poli y Iteration Termination 167
C Continuous Time TD() 169

CONTENTS XI
D Notation, Terminology and Abbreviations 173
XII CONTENTS
Chapter 1
Introdu tion
1.1 Arti ial Intelligen e and Ma hine Learning

Arti ial Intelligen e (AI) is the study of arti ial ma hines that exhibit ìntelligent' be-
haviour. Intelligen e itself is a notoriously diÆ ult term to de ne, but ommonly we asso-
iate it with the ability to learn from experien e. Ma hine learning is the related eld that
has given rise to intelligent agents ( omputer programs) that do just this.
The idea of reating ma hines that imitate what humans do annot fail to fas inate and
inspire. Through the study of AI and ma hine learning we may dis over hidden truths
about ourselves. How did we ome to be? How do we do what we do? Am I a omputer
running a omputer program? And if so, an we simulate su h a program on a omputer?
AI may even be able o er insights into age-old philosophi al questions. Who am I? And
what is the importan e of self?
In reasingly though, AI and ma hine learning are be oming engineering dis iplines, rather
than natural s ien es. In the industrial age we asked, \How an I build ma hines to do work
for me?" In the information age we now ask, \How an I build ma hines that think for me?"
The diÆ ulties fa ed in building su h ma hines are enormous. How an we build intelligent
learning ma hines when we know so little about the origins of our own intelligen e?
In this thesis, I examine Reinfor ement Learning (RL) algorithms. We will see how these
omputer algorithms an learn to solve very omplex problems with the bare minimum of
information. The algorithms are not hard-wired solutions to spe i problems, but instead
learn to solve problems through their past experien es.
1.2 Forms of Learning

This thesis examines how agents an learn how to a t in order to solve de ision problems.
The task is to nd a mapping from situations to a tions that is better than others by some
measure.
Learning ould be said to have o urred if, on the basis of its prior experien e, an agent
1
2 CHAPTER 1. INTRODUCTION
hooses to a t di erently (hopefully for the better) in some situation than it might have
done prior to olle ting this experien e.
How learning o urs depends upon the form of feedba k that is available. For example,
through observing what happens after leaving home on di erent days with di erent kinds
of weather, it may be possible to learn the following asso iation between situations, a tions
and their onsequen es,
\If the sky is loudy, and I don't take my umbrella, then I am likely to get wet."
Observing the onsequen e of leaving home without an umbrella on a loudy day is a form
of feedba k. However, the onsequen es of a tions in themselves do not tell how to hoose
better a tions. Whether the agent should prefer to leave home with an umbrella depends
on whether it minds getting wet. Clearly, without some form of utility atta hed to a tions,
it is impossible to know what hanges ould lead the agent to a t in a better way. Learning
without this utility is alled unsupervised learning, and annot dire tly lead to better agent
behaviour.
If feedba k is given in the form,
\If it is loudy, you should take an umbrella,"
then supervised learning is o urring. A tea her (or supervisor) is assumed to be available
that knows the best a tion to take in a given situation. The supervisor an provide advi e
that orre ts the a tions taken by the agent.
If feedba k is given in the form of positive or negative reinfor ements (rewards), for example,
\Earlier it was loudy. You didn't take your umbrella. Now you got wet. That
was pretty bad,"
then the agent learns through reinfor ement learning. Learning o urs by making adjust-
ments to the situation-a tion mapping that maximises the amount of positive reinfor ement
re eived and minimises the negative reinfor ement. Often reinfor ements are s alar values
(e.g. 1 for a bad a tion, +10 for a good one). A wide variety of algorithms are available
for learning in this way. This thesis reviews and improves on a number of them.
1.3 Reinfor ement Learning

The key di eren e between supervised learning and reinfor ement learning is that, in rein-
for ement learning an agent is never told the orre t a tion it should take in a situation,
but only some measure of how good or how bad an a tion is. It is up to the learning el-
ement itself to de ide whi h a tions are best given this information. This is part of the
great appeal of reinfor ement learning { solutions to omplex de ision problems may often
be found by providing the minimum possible information required to solve the problem.
1.4. LEARNING AND EXPLORATION 3
In many animals, reinfor ement learning is the only form of learning that appears to o ur
and it is an essential part of human behaviour. We burn our hands in a re and very qui kly
learn not to do that again. Pleasure and pain are good examples of rewards that reinfor e
patterns of behaviour.
Su essful Example: TD-Gammon A reinfor ement learning system, TD-Gammon,
has been used to learn to play the game of ba kgammon [155℄. The system was set up su h
that a positive reinfor ement was given upon winning a game. With little other information,
the program learned a level of play equal to that of grandmaster human players with many
years of experien e. What is spe ta ular about the su ess of this system is that it learned
entirely through self-play over several days of playing. No external heuristi information
or tea her was available to suggest whi h moves might be best to take, other than the
reinfor ement re eived at the end of ea h game.1
1.3.1 Sequential De ision Tasks and the Delayed Credit Assignment Prob-
lem
Many interesting problems (su h as ba kgammon) an be modelled as sequential de ision
tasks. Here, the system may ontain many states (su h as board positions), and di ering
a tions may lead to di erent states. A whole series of a tions may be required to get to
a parti ular state (su h as winning a game). The e e ts of any parti ular a tion may not
be ome apparent until some time after it was taken. In the ba kgammon example, the
reinfor ement learner must be able to asso iate the utility of the a tions it takes in the
opening stages of the game with the likelihood of it winning the game, in order to improve
the quality of its opening moves. The problem of learning to asso iate a tions with their
long-term onsequen es is known as the delayed redit assignment problem [130, 150℄.
This thesis deals with ways of solving delayed redit assignment problems. In parti ular, it
deals primarily with value-fun tion based methods in whi h the long-term utility for being
in a parti ular state or taking an a tion in a state is modelled. By learning long-term value-
estimates, we will see that these methods transform the diÆ ult problem of determining the
long term e e ts of an a tion, into the easy problem of de iding what is the best looking
immediate a tion.
1.4 Learning and Exploration

In many pra ti al ases, utility based learning methods (both supervised learning and rein-
for ement learning) fa e a diÆ ult dilemma. Given that the methods are often put to work
to solve some real-world problem, should the system dire tly attempt to do its best to solve
the problem based upon its prior experien e, or should it follow other ourses of a tion in
1
The learner was provided with a model that predi ts the likelihood of the next possible board positions
for ea h of the possible roles of the di e (i.e. the rules of the game). However, unlike similarly su essful hess
programs, a lengthy sear h of possible future moves is never ondu ted. Instead, The program simply learns
the general quality of board on gurations, and uses its knowledge about di e roles and possible moves to
hoose a move whi h leads it to the next immediately `best-looking' board on guration.
the hope that these will reveal better ways of a ting in the future? This is known as the
exploration / exploitation dilemma [130, 150℄.
This dilemma is parti ularly important to reinfor ement learning. For supervised learning
it is often assumed that the way in whi h exploration of the problem is ondu ted is the
responsibility of the tea her (i.e. not the responsibility of the learning element). For re-
infor ement learning, the reverse is more often true. The learning agent itself is usually
expe ted to de ide whi h a tions to take in order to gain more information about how the
problem may better be solved. Finding good general methods for doing so remains a dif-
ult and interesting resear h question, but it is not the subje t of this thesis. A separate
question is how reinfor ement learning methods an ontinue to solve the desired problem
while exploring (or, more pre isely, while not exploiting). Many reinfor ement learning
algorithms are known to behave poorly while not exploiting. One of this thesis' major
ontributions is an examination of how these methods an be improved.
1.5 About This Thesis

This thesis began as a pie e of resear h into multi-agent systems in whi h many agents
ompete or ollaborate to solve individual or olle tive problems. Reinfor ement learning
was identi ed as a te hnique that an allow agents to do this. Although multi-agent learning
is not overed, two questions arose from this work whi h are now the subje t of this thesis:
In many tasks, the agent's environment may be very large. Typi ally, the agent annot
hope to visit all of the environment's states within its lifetime. Generalisations (and
approximations) must be made in order to infer the best a tions to take in unvisited
states. If so, an internal representations be found su h that the agent's ability to
take the best a tions is improved?
If learning while not exploiting, many reinfor ement learning algorithms are known
to be ineÆ ient, ina urate or unstable. What an be done to improve this situation?
The se ond question (although resear hed most re ently), is overed rst as it follows more
dire tly for the fundamental material presented in the early hapters. The rst question is
overed in the nal hapters but was resear hed rst. Sin e this time a great deal of related
work has been done by other resear hers that ta kle the same question. This work is also
reviewed.
1.6 Stru ture of the Thesis

The following items provide an overview of ea h part of the thesis.
Chapter 2 introdu es some simplifying formalisms, Markov De ision Pro esses (MDPs),
and basi solution methods, Dynami Programming (DP), upon whi h reinfor ement
learning methods build. A minor error in an existing version of the poli y-iteration
algorithm is orre ted.
1.6. STRUCTURE OF THE THESIS 5
Chapter 3 introdu es standard reinfor ement learning methods for learning without
prior knowledge of the environment or the spe i task to be solved. Here the need for
reinfor ement learning while not exploiting is identi ed, and the de ien ies in existing
solution methods are made lear. Also, this hapter hallenges a ommon assumption
about a lass of existing algorithms. We will see ases where \a umulate tra e"
methods are not approximately equivalent to their \forward view" ounterparts.
Chapter 4 introdu es omputationally eÆ ient alternatives to the basi eligibility tra e
methods. The Fast Q() algorithm is reviewed and minor hanges to it are suggested.
The ba kwards replay algorithm is also reviewed and proposed as a simpler and natu-
rally eÆ ient alternative to eligibility tra e methods. The method also has the added
advantage of learning with information that is more \up-to-date." However, it is not
obvious how ba kwards replay an be employed for online learning in y li environ-
ments. A new algorithm is proposed to solve this problem and is also intended to
provide improvements when learning while exploring. The experimental results with
this algorithm lead to a new insight that optimism an inhibit learning in a lass
of ontrol optimising algorithms. Optimism is ommonly en ouraged in order to aid
exploration, and so this omes as a ounter-intuitive idea to many.
Chapter 5 reviews standard fun tion approximation methods that are used to al-
low reinfor ement learning to be employed in large and non-dis rete state spa es.
The well-studied and often employed linear gradient des ent methods for least mean
square error minimisation are known to be unstable in a variety of s enarios. A new
error measure is suggested and it is shown that this leads to provably more stable
reinfor ement learning methods. Although the notion of stability is rather weak (only
the boundedness of methods is proved, and very large bounds are given), this stabil-
ity is established for, i) methods performing sto hasti ontrol optimisation and, ii)
learning with arbitrary experien e distributions, where this was not previously known
to hold.
Chapter 6 examines a new fun tion approximation method that is not motivated by
error minimisation, but by adapting the resolution of the agent's internal representa-
tion su h that its ability to hoose between di erent a tions is improved. The de ision
boundary partitioning heuristi is proposed and ompared against similar xed resolu-
tion methods. Re ent and simultaneously ondu ted work along these lines by Munos
and Moore is also reviewed.
Chapter 7 examines reinfor ement learning in ontinuous time. This is a natural ex-
tension for methods that learn with adaptive representations. A simple modi ation
of standard reinfor ement learning methods is proposed that is intended to redu e
biasing problems asso iated with employing bootstrapping methods in oarsely dis re-
tised ontinuous spa es. An a umulate tra e TD() algorithm for the Semi Markov
De ision Pro ess (SMDP) ase is also developed and a forwards-ba kwards equivalen e
proof of the bat h mode version of this algorithm is established.
Chapter 8 on ludes, lists the thesis' ontributions and suggests future work. The
new ontributions an be found throughout the thesis.
Appendix A reviews some basi terminology and proofs about dynami programming
methods that are employed elsewhere in the thesis.
Appendix B shows termination error bounds of a new modi ed poli y-iteration algo-
rithm.
Appendix C ontains the forwards-ba kwards equivalen e proof of the bat h mode
SMDP a umulate tra e TD() algorithm.
Appendix D provides a useful guide to notation and terminology.
New ontributions are made throughout. Readers with a detailed knowledge of reinfor e-
ment learning are re ommended to read the ontributions se tion in Chapter 8 before the
rest of the thesis.
Chapter 2
Dynami Programming
Chapter Outline
This hapter reviews the theoreti al foundations of value-based reinfor ement
learning. It overs the standard formal framework used to des ribe the agent-
environment intera tion and also te hniques for nding optimal ontrol strate-
gies within this framework.
2.1 Markov De ision Pro esses

A great part of the work done on reinfor ement learning, in parti ular that on onvergen e
proofs, assumes that the intera tion between the agent and the environment an be modelled
as a dis rete-time nite Markov de ision pro ess (MDP). In this formalism, a step in the
life of an agent pro eeds as follows:
At time, t, the learner is in some state, s 2 S , and takes some a tion, a 2 As, a ording
to a poli y, . Upon taking the a tion the learner enters another state, s0, at t + 1 with
probability Pssa 0 . For making this transition, the learner re eives a s alar reward, rt+1, given
by random variable whose expe tation is deonted as Rssa 0 .
A dis rete nite Markov pro ess onsists of,
the state-spa e, S whi h onsists of a nite set of states fs1 ; s2; : : : ; sN g,
a nite set of a tions available from ea h state, A(s) = fas1 ; as2 ; : : : ; asM g.
a global lo k, t = 1; 2; : : : ; T , ounting dis rete time steps. T may be in nite.
a state transition fun tion, Pssa 0 = P r(st+1 = s0 j st = s; at = a), (i.e. the probability
of observing s0 at t + 1 given that a tion a was taken in state s at time t.)
7
8 CHAPTER 2. DYNAMIC PROGRAMMING
a1 ...
a2
...
Figure 2.1: A Markov De ision Pro ess. Large ir les are states, small bla k dots are
a tions. Some states may have many a tions. An a tion may lead to di ering su essor
states with a given probability.
For the RL framework we also add,
a reward fun tion whi h, given a hs; a; s0 i triple generates a random s alar valued
reward with a xed distribution. The reward for taking a in s and then entering s0 is
a random variable whose expe tation is de ned here as Rssa 0 .
A pro ess is said to be Markov if it has the Markov Property. Formally, the Markov Property
holds if,
P r(st+1 j st ; at ) = P r(st+1 j st ; at ; st 1 ; at 1 ; : : :): (2.1)
holds. That is to say that the probability distribution over states entered at t + 1 is ondi-
tionally independent of the events prior to (st ; at ) { knowing the urrent state and a tion
taken is suÆ ient to de ne what happens at the next step. In reinfor ement learning, we
also assume the same for the reward fun tion,
P r(rt+1 j st ; at ) = P r(rt+1 j st ; at ; st 1 ; at 1 ; : : :): (2.2)
The Markov property is a simplifying assumption whi h makes it possible to reason about
optimality and proofs is a more straightforward way.
For a more detailed a ount of MDPs see [21℄ or [114℄. For the remainder of this se tion the
terms pro ess and environment will be used inter hangeably under the assumption that the
agent's environment an be exa tly modelled as a dis rete nite Markov pro ess. In later
hapters we examine ases where this assumption does not hold.
2.2 Poli ies, State Values and Return

A poli y, , determines how the agent sele ts a tions from the state in whi h it nds itself.
In general, a poli y is any mapping from states to a tions. A poli y may be deterministi ,
in whi h ase (st) = at, or it may spe ify a distribution over a tions, (s; a) = P r(a =
at js = st ). On e we have established a poli y, we an ask how mu h return this poli y
generates from any given state in the pro ess. Return is a measure of reward olle ted for
taking a series of a tions. The value of a state is a measure of expe ted return we an a hieve
for being in a state and following a given poli y thereafter (i.e. its mean long-term utility).
RL problems an therefore be further ategorised by what estimate of return we want to
maximise:
2.3. POLICY EVALUATION 9
Single Step Problems. Here agents should a t to maximise the immediate reward
available from the urrent state. The value of a state, V (s), is de ned as,
V (s) = E [rt+1 j s = st ℄ (2.3)
where E denotes an expe tation given that a tions are hosen a ording to .
Finite Horizon Problems. Here agents should a t to maximise the reward available
given that there are just k more steps available to olle t the reward. The value of a
state is de ned as,
(
0; if k = 0,
V(k) (s) = E hr + V (s ) j s = s i ; otherwise.
(2.4)
t+1 (k 1) t+1 t
Re eding Horizon Problems. The agent should a t to maximise the nite horizon
return at ea h step (i.e. we a t to maximise V(k) for all t, and k is a xed at every
step).
In nite Horizon Problems. The agent should a t to maximise the reward available
over an in nite future.
Most work in RL has entred around single-step and in nite horizon problems. In the
in nite horizon ase, it is ommon to use the total future dis ounted return as the value of
a state:
zt1 = rt + rt+1 + + k rt+1 + (2.5)
The parameter, 2 [0; 1℄, is a dis ount fa tor. Choosing < 1 denotes a preferen e to
re eiving immediate rewards to those in the more distant future. It also ensures that the
return is bounded in ases where the agent may olle t reward inde nitely (i.e. if the task
is non-episodi or non-terminating), sin e all in nite geometri series have nite sums for a
ommon ratio of, j j < 1.
The in nite horizon ase is also of spe ial interest as it allows the value of a state to be
on isely de ned re ursively:
V (s) = E zt1+1 j s = st

= EX [rt+1 +XV (st+1 ) j s = st ℄

= (s; a) Pssa 0 Rssa 0 + V (s0) (2.6)
a s0
Equation 2.6 is known as a Bellman equation for V (see [15℄).
Terminal States Some environments may ontain terminal states. Entering su h a state
means that no more reward an be olle ted this episode. To be onsistent with the in nite
horizon formalism, terminal states are usually modelled as a state in whi h all a tions lead
to itself and generate no reward. In pra ti e, it is usually easiest to model all terminal states
as a single spe ial state, s+, whose value is zero and in whi h no a tions are available.
2.3 Poli y Evaluation

For some xed sto hasti poli y, , the iterative poli y evaluation algorithm shown Figure
2.2 will nd an approximation of its state-value fun tion, V^ (see also [114, 150℄). The
hat notation, x^, indi ates an approximation of some true value, x. Step 5 of the algorithm
simply applies the Bellman equation (2.6) upon an old estimate of V to generate a new
estimate (this is alled a ba kup or update). Making updates for all states is alled a sweep.
It is intuitively easy to see that this algorithm will onverge upon V if 0 < 1. Assume
that the initial value fun tion estimate has a worst initial error of 0 in any state:
V^0 (s) = V (s) 0 (2.7)
Throughout, is used to denote a bound in order to simplify notation. That is to say,
V (s) 0 V^0 (s) V (s) + 0 (2.8)
and not,
^V (s) = V (s) + 0; or,

V (s) 0 : (2.9)
1) Initialise V^0 with arbitrary nite values; k 0

2) do
3) 0
4) for ea h s 2 S P
a 0 + V^ (s0 )
V^k+1 (s) = a (s; a) s0 Pssa 0 Rss

5) P
k
6) max(; jV^k+1(s) V^k (s)j )
7) k k + 1
8) while > T
Figure 2.2: The syn hronous iterative poli y evaluation algorithm. Determines the value
fun tion of a xed sto hasti poli y to within a maximum deviation from V of 1 T in any
state for 0 < 1.
1) Initialise Q0 with arbitrary nite values; k 0
2) do
3) 0
4) for ea h hs; ai 2 SP A(s)
5) Q^ k+1 (s; a) = s0 Pssa 0 Rss ^ k (s0; a0 )
a 0 + P 0 (s0 ; a0 )Q
a
6) max(; jQ^ k+1(s; a) Q^ k (s; a)j )
7) k k + 1
8) while > T
Figure 2.3: The syn hronous iterative poli y evaluation algorithm for determining Q to
within 1 T .

2.3. POLICY EVALUATION 11
After the rst iteration we have (at worst),
V^1 (s) = a 0 + V^ (s0 )
X X
(s; a) Pssa 0 Rss 0
a s0
X X
= (s; a) a 0 + (V (s0 ) )
Pssa 0 Rss 0

a s0
X X
= 0 + (s; a) Pssa 0 Rssa 0 + V (s0)
a s0
= 0 + V (s) (2.10)
Note that only the true value-fun tion, V , and not its estimate, V^ , appears on the right-
hand side of 2.10. Continuing the iteration we have,
V^2 (s) = a 0 + V^ (s0 )
X X
(s; a) Pssa 0 Rss 1
a s0
X X
= (s; a) a 0 + (V (s0 )
Pssa 0 Rss 0)
a s0
X X
= 20 + (s; a) Pssa 0 Rssa 0 + V (s0 )
a s0
= 0 + V (s)
2
...
V^k (s) = k 0 + V (s) (2.11)
Thus if 0 < h < 1 ithen the onvergen e of V^ to V is assured in the limit (as k ! 1)
sin e limk!1 k 0 = 0. The following ontra tion mapping an be derived from 2.11 and
states that ea h update stri tly redu es the worst value estimate in any state by a fa tor of
(also see Appendix A) [114, 20, 17℄:
maxs
jV^k+1(s) V (s)j max
s
jV^k (s) V (s)j: (2.12)
The termination ondition in step 8 of the algorithm allows it to stop on e a satisfa tory
maximum error has been rea hed.
This re ursive pro ess of iteratively re-estimating the value fun tion in terms of itself is
alled bootstrapping. Sin e Equation 2.6 represents a system of linear equations, several
alternative solution methods, su h as Gaussian elimination, ould be used to exa tly nd
V (see [34, 80℄). However, most of the learning methods des ribed in this thesis are, in
one way or another, derived from iterative poli y evaluation and work by making iterative
approximations of value estimates.
2.3.1 Q-Fun tions
In addition to state-values we an also de ne state-a tion values (Q-values) as:
Q (s; a) = E
X
[rt+1 + V (st+1 )js = st ; a = at ℄
= Pssa 0 Rssa 0 + V (s0) (2.13)
s0
Intuitively, this Q-fun tion (due to Watkins, [163℄) gives the value of following an a tion for
one step plus the dis ounted expe ted value of following the poli y thereafter. The expe ted
value of a state under a given sto hasti poli y may be found solely from the Q-values at
that state:
X
V (s) = (s; a)Q (s; a) (2.14)
a
and so the Q-fun tion may be fully de ned independently of V (by ombining Equations
2.13 and 2.14):
!
X X
Q (s; a) = Pssa 0 a0
Rss + (s0 ; a0 )Q (s0 ; a0 ) (2.15)
s0 a0
It is straightforward to modify the iterative poli y evaluation algorithm to approximate a

Q-fun tion instead of a state-value fun tion (see Figure 2.3).
Note that V and Q are easily inter hangeable when given R and P . Also, from equation
(2:14), it follows that knowing Q and is enough to determine V (without R or P ). The
reverse is not true.
We will see in Se tion 2.4.2 that being able to ompare a tion-values makes it trivial to
make improvements to the poli y.
2.3.2 In-Pla e and Asyn hronous Updating

Step 5 of ea h algorithm in Figures 2.2 and 2.3 performs updates that have the form
U^k+1 = f (U^k ), where U^ is a utility fun
tion, V^ or Q^ . That is to say, that a new value of every
state or state-a tion pair is given entirely from the last value fun tion or Q-fun tion. The
algorithms are usually presented in this way only to simplify proofs about their onvergen e.
This form of updating is alled syn hronous or Ja obi-style [150℄.
A better method is to make the updates in-pla e [17, 150℄ (i.e. we perform U^ (s) f (U^ (s))
for one state, and make further ba kups to other states in the same sweep using this new
estimate). This requires storing only one value fun tion or Q fun tion rather than two and
is referred to as in-pla e or Gauss-Seidel updating. This method also usually onverges
faster sin e the values in the su essor states upon whi h updates are based may have been
updated within the same sweep and so are more up-to-date.
A third alternative is asyn hronous updating [20℄. This is the same as the in-pla e method
ex ept it allows for states or state a tion pairs (SAPs) to be updated in any order and with
varying frequen y. This method is known to onverge provided that all states (or SAPs)
are updated in nitely often but with a nite frequen y. An advantage of this approa h is
that the number of updates may be distributed unevenly, with more updates being given
to more important parts of the state spa e [17, 18, 20℄.
2.4. OPTIMAL CONTROL 13
2.4 Optimal Control

In the previous se tion we have seen how to nd the long-term utility for being in a state,
or being in a state and taking a spe i a tion, and then following a xed poli y thereafter.
While it's useful to know how good a poli y is, we'd really prefer to know how to produ e
better poli ies. Ultimately, we'd like to nd optimal poli ies.
2.4.1 Optimality
An optimal poli y, , is any whi h a hieves the maximum expe ted return when starting
from any state in the pro ess. The optimal Q-fun tion, Q, is de ned by the Bellman
optimality equation [15℄:
Q (s; a) = max
Q (s; a)

X
= Pssa 0 Rsa + max a0
Q (s0 ; a0 ) (2.16)
s0
Similarly, V is given as:
V (s) = max
a
Q (s; a)
X
= max
a
a0+
Pssa 0 Rss V (s0 )

(2.17)
s0
There may be many optimal poli ies for some MDPs { this only requires that there are states
whose a tions yield equivalent expe ted returns. In su h ases, there are also sto hasti
optimal poli ies for that pro ess. However, every MDP always has at least one deterministi
optimal poli y. This follows simply from noting that if a SAP leads to a higher mean return
than the other a tions for that state, then it is better to always take that a tion than
some mix of a tions in that state. As a result most ontrol optimisation methods seek only
deterministi poli ies even though sto hasti optimal poli ies may exist.
2.4.2 Poli y Improvement
Improving a poli y as a whole simply involves improving the poli y in a single state. To
do this, we make the poli y greedy with respe t to Q . The greedy a tion, ags , for a state is
de ned as,
ags = arg max
a
Q(s; a) (2.18)
A greedy poli y, g , is one whi h yields a greedy a tion in every state. An improved poli y
may be a hieved by making it greedy in any state:
(s) arg max a
Q(s; a) (2.19)
1) k 0
2) do
3) nd Qk for k Evaluate poli y
4) for ea h s 2 S do
5) k+1 (s) = arg maxa Qk (s; a) Improve poli y
6) k k + 1
7) while k 6= k 1
Figure 2.4: Poli y Iteration. Upon termination is optimal provided that Qk an be found
a urately (see 2.4.3). In the improvement step (step 5), ties between equivalent a tion
should be broken onsistently to return a onsistent poli y for the same Q-fun tion, and so
also allow the algorithm to terminate. Step 3 is assumed to evaluate Q exa tly.
The poli y improvement theorem rst stated by Bellman and Dreyfus [16℄ states that if,
max Q (s; a) X (s; a)Q (s; a)
a a
holds then it is at least as good to take a greedy a tion in s than to follow sin e, if the agent
now passes through this state it an expe t to olle t at least maxa Q (s; a) (in the mean)
rather than a (s; a)Q (s; a) from there onward [16℄.1 The a tual improvement may be
P
greater sin e hanging the poli y at s may also improve the poli y for states following from
s in the ase where s may be revisited during the same episode.
The improved poli y an be evaluated and then improved again. This pro ess an be
repeated until the poli y an be improved no further in any state, at whi h point an optimal
poli y must have been found.
The poli y iteration algorithm shown in Figure 2.4 (adapted from [150℄, rst devised by
Howard [56℄) performs essentially this iterative pro ess ex ept that the poli y improvement
step is applied to every state in-between poli y evaluations. Combinations of lo al improve-
ments upon a xed Q will also produ e stri tly (globally) improving poli ies { any lo al
improvement in the poli y an only maintain or in rease the expe ted return available from
the states that lead into it.
2.4.3 The Convergen e and Termination of Poli y Iteration
With Exa t Q . Showing that the poli y iteration algorithm terminates with an optimal
poli y in nite time is straightforward. Note that, i) the poli y improvement step only
produ es deterministi poli ies, of whi h there are only jS jjAj and, ii) ea h new poli y
stri tly improves upon the previous (unless the poli y is already optimal). Put these fa ts
together and it is lear that the algorithm must terminate with the optimal poli y in less
than nk improvement steps (k = jAj; n = jS j) [114℄. In most ases this is a gross overestimate
of the required number of iterations until termination. More re ently, Mansour and Singh
have provided a tighter bound of O( knn ) improvement steps [77℄. Both of these bounds
ex lude the ost of evaluating the poli y at ea h iteration.
1
Impli itly, this statement rests on knowing Q a urately.
2 3
2 3 1
4 5
1 k Vk (2) Vk (3) Vk (4) Vk (5)

0 1:000 1:000 1:000 1:000
k Vk (2) Vk (3) 1 0:900 0:900 0:810 0:729
0 V0(2) V0 (3) 2 0:810 0:656 0:729 0:656
1 V0 (3) V0(2) 3 0:590 0:590 0:531 0:478
2 2 V0(2) 2 V0 (3)
4 0:531 0:430 0:478 0:430
3 3 V0 (3) 3V0 (2) 5 0:387 0:387 0:349 0:314
4 4 V0(2) 4 V0 (3)
6 0:349 0:282 0:314 0:282
5 . .. . .. 7 0:254 0:254 0:229 0:206
8 0:229 0:185 0:206 0:185
9 .. . .. . . .. . ..
Figure 2.5: Example pro esses where the modi ed poli y-iteration algorithm in Figure 2.4
onverges to optimal estimates but fails to terminate if the evaluation of Q is approximate.
In both pro esses, all rewards are zero and so V (s) = Q(s; a) = 0 for all states and
a tions. Termination will not o ur in ea h ase be ause the greedy poli y in the state 1
never stabilises. The a tions in this state have equivalent values under the optimal poli y
but the greedy a tion ip- ops inde nitely between the two hoi es while there is any error
in the value estimates. In both ases, the error is only eliminated as k ! 1 The value of
the su essor state sele ted by the poli y in state 1 after poli y improvement is shown in
bold. (left) Syn hronous updating with V0(2) > V0(3) > 0, and 0 < < 1. (right) In-pla e
updating with = 0:9. Updates are made in the sequen e given by the state numbers.
With Approximate Q . The above proof requires that an a urate Q is found between
iterations. Methods to do this are generally omputationally expensive. An alternative
method is modi ed poli y iteration whi h employs the iterative (and approximate) poli y
evaluation algorithm from Se tion 2.3 to evaluate the poli y in step 3 [114, 21℄. Using the
last found value or Q fun tion as the initial estimate for the iterative poli y evaluation
algorithm will usually redu e the number of required sweeps required before termination.
However, if Qk is only known approximately, then between iterations arg maxa Qk (s; a)
may os illate between a tions for states where there are equivalent (or near-equivalent) true
Q-values for the optimal poli y.2 This an be true even if Qk monotoni ally ontinues to
move towards Q sin e, in some ases, the Q-values of the a tions in a state may improve
at varying rates and so their relative order may ontinue to hange. Figure 2.5 illustrates
this new insight with two examples.
2
For pra ti al implementations, it should be noted that due to the limitations of ma hine pre ision, even
algorithms that are intended to solve Q pre isely may su er from this phenomenon.
1) do:
2) V^ evaluate(, V^ )
3) 0
4) for ea h s 2 S :
a 0 + V^ (s0 )

5) ag arg maxa s0 Pssa 0 Rss
P
6) v0
P ag ag ^ (s0 )
s0 Pss0 Rss0 + V
max ; V^ (s) v0

7)
8) (s) ag Make 0 .
9) while T
Figure 2.6: Modi ed Poli y Iteration. Upon termination is optimal to with some small
error (see text).
Over oming this only requires that the main loop terminates when the improvement in
the poli y in any state has be ome suÆ iently small (see the termination ondition in
Figure 2.6). The poli y iteration algorithm published in [150℄ also requires the same hange
to guarantee its termination.
The algorithm in Figure 2.6 guarantees that,
V (s) V (s)
2 T (2.20)
1
holds upon termination, for some termination threshold T . If T = 0 then the algorithm is
equivalent to modi ed poli y iteration. Part B of the Appendix establishes the straightfor-
ward proof of termination and error bounds { these follow dire tly from the work of Williams
and Baird [172℄. The proof assumes that the evaluate step of the revised algorithm applies,
V^ (s) a 0 + V^ (s0 )
X X
(s; a) Pssa 0 Rss k
a s0
at least on e for every state, either syn hronously or asyn hronously.
2.4.4 Value Iteration

The modi ed poli y iteration algorithm alternates between evaluating Q^ for a xed
and then improving based upon the new Q-fun tion estimate. We an interleave these
methods to a ner degree by using iterative poli y evaluation to evaluate the greedy poli y
rather than a xed poli y. This is done by repla ing step 5 of the iterative poli y evaluation
algorithm in Figure 2.3 with:

Q^ k+1 (s; a) ^
X
a 0 + max Q
Pssa 0 Rss (s 0 ; a 0 ) : (2.21)
0 a k
s0
Note that, for the syn hronous updating ase, this new algorithm is exa tly equivalent to
performing 1 sweep of poli y iteration followed by a poli y improvement sweep { no poli y
improvement step needs to be expli itly performed sin e the greedy poli y is impli itly being
evaluated by maxa0 Q^ k (s0; a0 ).
It is less than obvious that this new algorithm will onverge upon Q . As a poli y evaluation
method it is no longer evaluating a xed poli y but hasing a non-stationary one.
In fa t the algorithm does onverge upon Q as k ! 1. By onsidering the ase where
Q^ 0 = 0, we an see that (syn hronous) value-iteration progressively solves a slightly di erent
problem, and in the limit nds Q .
Let a k-step nite horizon dis ounted return be de ned as follows:
rt + rt+1 + + k 1 rk
To behave optimally in a k-step problem is to a t to maximise this return given that there
are k-steps available to do so. Let (k)(s) denote an optimal poli y for the k-step nite
horizon problem.
Then in the ase where Q^ 0 = 0 we have,
Q(1) (s; a) = E [rt+1 js = st ; a = at ℄ (2.22)

Q^ 0 (s; a)
X
= Pssa 0 Rssa 0 + max a0
s0
Thus after 1 sweep, the Q-fun tion is the solution of the dis ounted 1-step nite horizon
problem. That is to say, Q(1) predi ts the maximum expe ted 1-step nite horizon return
and so (1)
(s) = arg maxa Q (s; a).
(1)
Clearly, the optimal value of an a tion when there are 2 steps to go is the expe ted value of
taking that a tion and then a ting to maximise the dis ounted expe ted return with 1-step
to go, 1-step on:
h i
Q(2) (s; a) = E rt+1 + E [rt+2 ℄ js = st ; a = at
(1)

= E rt+1 + max a0
Q (s; a0 )js = st ; a = at ; t+1 =
(1) (1)
X
= a0+
Pssa 0 Rss max
a0
Q(1) (s; a0 )
s0
Thus (2)
(s) = arg maxa Q (s; a) and is an optimal poli y for the 2-step nite horizon
(2)
problem. With k steps to go we have:
" " # #
k
X
Q(k+1) (s; a) = E rt+1 + E(k) i 1r
t+i js = st ; a = at

i=2
= E rt+1 + max
0 Q(k) (s; a )js = st ; a = at ; t+1 = (k)

a
0
X
= Pssa 0 a0
Rss + max
0 Q(k) (s; a )
a
0 (2.23)
s0
So, under the assumption that the Q-fun tion is initialised to zero, then it is lear that
value-iteration (with syn hronous updates) has solved the k-step nite horizon problem
after k iterations. That is to say that it nds the Q-fun tion for the poli y that maximises
the expe ted k-step dis ounted return:
h i
max

E rt + r t +1 + + k 1r
k (2.24)
whi h di ers from maximising the expe ted in nite dis ounted return,
h i
max

E r t + r t+1 + + k 1r +
k (2.25)
by an arbitrary small amount for a large enough k and 0 < 1. Thus, value iteration
assures that Q^ k onverges upon Q as k ! 1 sin e Q^ (1) = Q , given Q^ 0(s; a) = 0 and
0 < 1.
A more rigorous proof that applies for arbitrary ( nite) initial value fun tions was estab-
lished by Bellman [15℄ and an be found in Se tion A.4.
In parti ular, the following ontra tion mapping an be shown whi h avoids the need to
assume Q^ 0 = 0,
max
s
jV^k+1(s) V (s)j max s
jV^k (s) V (s)j: (2.26)
Proofs of onvergen e for the in-pla e and asyn hronous updating ase have also been
established [17℄.
2.5 Summary
We have seen how dynami programming methods an be used to evaluate the long-term
utility of xed poli ies, and how, by making the evaluation poli y greedy, optimal poli ies
may also be onverged upon. Value iteration and poli y iteration form the basis of all of the
RL algorithms detailed in this thesis. Although they are a powerful and general tool for solv-
ing diÆ ult multi-step de ision problems in sto hasti environments, the MDP formalism
and dynami programming methods so far presented su er a number of limitations:
1. Availability of a Model Dynami Programming methods assume that a model of
the environment (P and R) is available in advan e, and that no further knowledge of, or
intera tion with, the environment is required in order to determine how to a t optimally
within it. However, in many ases of interest, a prior model is not generally available, nor is
it always lear how su h a model might be onstru ted in any eÆ ient manner. Fortunately,
even without a model, a number of alternatives are available to us. It remains possible to
learn a model, or even learn a value fun tion or Q-fun tion dire tly through experien e
gained from within the environment. Reinfor ement learning through intera ting with the
environment is the subje t of the next hapter.
2. Small Finite Spa es In many pra ti al problems, a state might orrespond to a
point in a high dimensional spa e: s = hx1; x2 ; : : : ; xni. Ea h dimension orresponds to a
parti ular feature of the problem being solved. For instan e, suppose our task is to design
2.5. SUMMARY 19
an optimal strategy for the game of ti -ta -toe. Ea h omponent of the board state, xi,
des ribes the position of one ell in a 3 3 grid (1 i 9), and an take one of three
values (\X", \O" or \empty"). In this ase, the size of the state spa e is 39 . For a game
of draughts, we have 32 usable tiles and a state spa e size of the order of 332 . In general,
given n features ea h of whi h an take k possible values, we have a state spa e of size
kn . In other words, the size of the state spa e grows exponentially with its dimensionality.
Correspondingly, so grows the memory required to store a value fun tion and the time
required to solve su h a problem. This exponential growth in the spa e and time osts for
a small in rease in the problem size is referred to as the \Curse of Dimensionality" (due to
Bellman, [15℄).
Similarly, if the state-spa e has in nitely many states (e.g. if the state-spa e is ontinuous)
then it is simply impossible to exa tly store individual values for ea h state.
In both ases, using a fun tion approximator to represent approximations of the value
fun tion or model an help. These are dis ussed in Chapter 6.
3. Markov Property In pra ti e, the Markov property is hard to obtain. There are
many ases where the des ription of the urrent state may la k important information
ne essary to hoose the best a tion. For instan e, suppose that you nd yourself in a large
building where many of the orridors look the same. In this ase, based upon what is seen
lo ally, it may be impossible to de ide upon the best dire tion to move given that some
other part of the building looks the same but where some other dire tion is best.
In many instan es su h as this, the environment may really be an MDP, although it may not
be the ase that the agent an exa tly observe its true state. However, the prior sequen e
of observations (of states, a tions, rewards and su essors) often reveal useful information
about the likely real state of the pro ess (e.g. if I remember how many ights of stairs I went
up I an now tell whi h orridor I am in with greater ertainty). This kind of problem an
be formalised as a Partially Observable Markov De ision Pro ess (POMDP). A POMDP, is
often de ned as an MDP, whi h in ludes S , A, P and a reward fun tion, plus a set of prior
observations and a mapping from real states to observations.
These problems and their related solution methods are not examined in this thesis. See [27℄
or [74℄ for ex ellent introdu tions and eld overviews.
4. Dis rete Time The MDP formalism assumes that there is a xed, dis rete amount
of time between state observations. In many problems this is untrue and events o ur at
varying real-valued time intervals (or even o ur ontinuously). A good example is the state
of a queue for an elevator [36℄. At t = 0 the state of the queue might be empty (s0). Some
time later someone may join the queue (we make a transition to s1), but the time interval
between states transitions an take some real value whose probability may be given by a
ontinuous distribution.
Variable and ontinuous time interval variants of MDPs are referred to as a Semi-Markov
De ision Pro ess (SMDPs) [114℄, and are examined in Chapter 7.
5. Undis ounted Ergodi Tasks In ases where reward may be olle ted inde nitely
and dis ounting is not desired, the dis ounted return model may not be used sin e the future
sum of rewards with = 1 may be unbounded. Furthermore, even in ases where the returns
an be shown to be bounded, with = 1 the poli y-iteration and value-iteration algorithms
are not guaranteed to onverge upon Q. This follows as a result of using bootstrapping
and the max operator whi h auses any optimisti initial bias in the Q-fun tion to remain
inde nitely.
If dis ounting is not desired, then an average reward per step formalism an be used. Here
the expe ted return is de ned as follows [132, 153, 75, 21℄:
= lim
1X n
E [r js = s ℄
n!1 n t t
t=1
This formalism is problemati in pro esses where all states are rea hable from any other
under the poli y (su h as pro ess is said to be ergodi ). However, even in this ase, from
some states higher than average return may be gained for some short time and so su h a
state might be onsidered to be better. Quantitatively, the value of a state an be de ned
by the relative di eren e between the long-term average reward from any state, , and the
reward following a starting state:
1
X
V (s) = E [rt+k jst = s; ℄
k=1
Thus a poli y may be improved by modifying it to in rease the time that the system spends
in high valued states (thereby raising ). Average reward methods are not examined in this
thesis.
Chapter 3
Learning from Intera tion
Chapter Outline
In this hapter we see how reinfor ement learning problems an be solved solely
through intera ting with the environment and learning from what is observed.
No knowledge of the task being solved needs to be provided. A number of
standard algorithms for learning in this way are reviewed. The short omings
of exploration insensitive model-free ontrol methods are highlighted, and new
intuitions about the online behaviour of a umulate tra e TD() methods il-
lustrated.
3.1 Introdu tion

The methods in the previous hapter showed how to nd optimal solutions to multi-step
de ision problems. While these te hniques are invaluable tools for operations resear h and
planning, it is diÆ ult to think of them as te hniques for learning. No experien e is gathered
{ all of the ne essary information required to solve their task of nding an optimal poli y is
known from the outset. The methods presented in this hapter start with no model (i.e. P
and R are unknown). Every improvement made follows only from the information olle ted
through intera tions with the environment. Most of the methods follow the algorithm
pattern shown in Figure 3.1.
Dire t and Indire t Reinfor ement Learning Broadly, (value-based) RL methods
for learning through intera tion an be split into two ategories. Indire t methods use their
experien es in the environment to onstru t a model of it, usually by building estimates
of the transition and reward fun tions, P and R. This model an then be used to gen-
erate value-fun tions or Q-fun tions using, for instan e, methods similar to the dynami
21
22 CHAPTER 3. LEARNING FROM INTERACTION
1) for ea h episode:
2) Initialise: t 0; st=0
3) while st is not terminal:
4) sele t at
5) follow at ; observe, rt+1, st+1
6) perform updates to P^ , R^ , Q^ and/or V^
using the new experien e hst ; at ; rt+1 , st+1i
7) t t+1
Figure 3.1: An abstra t in remental online reinfor ement learning algorithm.

programming te hniques introdu ed in the last hapter. Indire t methods are also termed
model-based (or model-learning) RL methods.
Alternatively, we an learn value-fun tions and Q-fun tions dire tly from the reward signal,
forgoing a model. This is alled the dire t or model-free approa h to reinfor ement learning.
This hapter rst presents an in remental estimation rule. From this we see how the dire t
methods are derived and then the indire t methods.
3.2 In remental Estimation of Means

Dire t methods an be thought of as algorithms that attempt to estimate the mean of a
return signal solely from observations of that return signal. For most dire t methods, this
is usually done in rementally by applying an update rule of the following form:
Z^k = RunningAverage(Z^k 1 ; zk ; k )
= Z^k 1 + k (zk Z^k 1) (3.1)
where Z^k is the new estimated mean whi h in ludes the kth observation, zk , of a random
variable, and k 2 [0; 1℄ is a step-size (or learning rate) parameter. Ea h observation is
assumed to be a bounded s alar value given a random variable with a stationary distribution.
By de ning the learning rate in di erent ways the update rule an be given a number of
useful properties. These are listed below.
Running Average. With k = 1=k, Z^k is the sample mean (i.e. average) of the set of
observations fz1 ; : : : ; zk g,
Z^k =
1X
k
zi (3.2)
k i=1
The following derivation of update 3.1 is from [150℄,
Z^k+1 =
1 kX +1
z
k + 1 i=1 i
3.2. INCREMENTAL ESTIMATION OF MEANS 23
!
= 1 z +X k
z
k + 1 k+1 i=1 i
1 z +k 1 X k !!
= k + 1 k+1
z
k i=1 i
= 1 z + kZ^
k + 1 k+1 k
= 1 z + (k + 1)Z^ Z^

k + 1 k+1 k k
= Z^k +
1 zk+1 Z^k (3.3)
k+1
Re en y Weighted Average. By hoosing a onstant value for , (where 0 < < 1),
update 3.1 an be used to al ulate a re en y weighted average. This an be seen more
learly by expanding the right hand side of Equation 3.1:
Z^t+1 = zt+1 + (1 )Z^t (3.4)
Intuitively, ea h new observation forms a xed per entage of the new estimate. Re en y
weighted averages are useful if the observations are drawn from a non-stationary distribu-
tion.
In ases where 1 6= 1 the estimates Z^k (k > 1) may be partially determined by the initial
estimate, Z^0. Su h estimates are said to be biased by the initial estimate. Z^0 is an initial
bias.
Mean in the Limit. From standard statisti s, with k = 1=k, from Equation 3.2 we
have,
lim ^ = E [z℄:
Z
k!1 k
(3.5)
However, more usefully, Equation 3.5 also holds if,
1
X
1) k =1 (3.6)
k=1
1
X
2) 2 < 1;
k (3.7)
k=1
both hold. These are the Robbins-Monro onditions and appear frequently as onditions for
onvergen e of many sto hasti approximation algorithms [126℄.
The rst ondition ensures that, at any point, the sum of the remaining stepsizes is in nite
and so the urrent estimate will eventually be ome insigni ant. Thus, if the urrent esti-
mate ontains some kind of bias, then this is eventually eliminated. The se ond ondition
ensures that the step sizes eventually be ome small enough so that any varian e in the
observations an be over ome.
In most interesting learning problems, there is the possibility of trading lower bias for
higher varian e, or vi e versa. Slowly de lining learning rates redu e bias more qui kly but
onverge more slowly. Redu ing the learning rate qui kly gives fast onvergen e but slow
redu tions in bias. If the learning rate is de lined too qui kly, premature onvergen e upon
a value other than E [z℄ may o ur. The Robbins-Monro onditions guarantee that this
annot happen.
Conditions 1 and 2 are known to hold for,
k (s) =
1 ; (3.8)
k(s)
at the kth update of Z^(s) and 1=2 < 1 [167℄.
3.3 Monte Carlo Methods for Poli y Evaluation

This se tion examines two model-free methods for performing poli y evaluation. That is to
say, given an evaluation poli y, , they obtain the value-fun tion or Q-fun tion that predi ts
the expe ted return available under this poli y without the use of an environmental model.
Monte Carlo estimation represents the most basi value predi tion method. The idea behind
it is simply to nd the sample mean of the omplete a tual return,
zt1 = rt+1 + rt+2 +
for following the evaluation poli y after a state, or SAP (state-a tion pair), until the end of
the episode at time T . The evaluation poli y is assumed to be xed and is assumed to be
followed while olle ting these rewards. If a terminal state is rea hed then, without loss of
generality, the in nite sum an be trun ated by rede ning it as,
zt1 = rt+1 + rt+2 + + T 1 rT + T V (sT )
where V (sT ) is the value of the terminal state sT . Typi ally, this is de ned to be zero.
Again this is without loss of generality, sin e rT an be rede ned to re e t the di ering
rewards for entering di erent terminal states.
Singh and Sutton di erentiate between two avours of Monte Carlo estimate { these are
the rst-visit and every-visit estimates [139℄.
Every Visit Monte Carlo Estimation. The every-visit Monte Carlo estimate is de ned
as the sample average of the observed return following every visit to a state:
VÊ (s) =
1X M
z1 (3.9)
M i=1 ti
where s is visited at times ft1 ; : : : tM g. In this ase, the RunningAverage update is applied
oine, at the end of ea h episode at the earliest. Ea h state-value is updated on e for ea h
state visit using the return following that visit. M represents the total number of visits to
s in all episodes.
3.3. MONTE CARLO METHODS FOR POLICY EVALUATION 25
PSfrag repla ements Ps ; R s
s T
PT ; RT
Figure 3.2: A simple Markov pro ess for whi h rst-visit and every-visit Monte Carlo
approximation initially nd di erent value estimates. The pro ess has a starting state,
s, and a terminal state, T . Ps and PT denote the respe tive transition probabilities for
s ; s and for s ; T . The respe tive rewards for these transitions are Rs and RT .
First Visit Monte Carlo Estimation. The rst-visit Monte Carlo estimate is de ned
as the sample average of returns following the rst visit to a state during the episodes in
whi h it was visited:
V^F (s) =
1XN
zt1i (3.10)
N i=1
where s is rst visited during an episode at times ft1; : : : ; tN g and N
represents the total
number of episodes. The key di eren e here is that an observed reward may be used to
update a state value only on e, whereas in the every-visit ase, a state value may be de ned
as the average of several non-independent return estimates, ea h involving the same reward,
if the state is revisited during an episode.
Bias and Varian e. In the ase where state revisits are allowed within a trial these
methods produ e di erent estimators of return. Singh and Sutton analysed these di eren es
whi h an be hara terised by onsidering the pro ess in Figure 3.2 [139℄. For simpli ity
assume = 1, then from the Bellman equation (2.6) the true value for this pro ess is:
V (s) = Ps (Rs + V (s)) + PT RT (3.11)
= PsR1s + PPT RT (3.12)
s
Ps
= R + RT
PT s
(3.13)
Consider the di eren e between the methods following one episode with the following ex-
perien e,
s;s;s;s;T
The rst-visit estimate is:
V^F (s) = Rs + Rs + Rs + RT
while the every-visit estimate is:
R + 2Rs + 3Rs + 4RT
VÊ (s) = s
4
For both ases, it is possible to nd the expe tation of the estimate after one trial for some
arbitrary experien e. This is done by averaging the possible returns that ould be observed
in the rst episode weighted their probability of being observed. For the rst-visit ase, it
an be shown that after the rst episode [139℄,
P
E V^1F (s) = s Rs + RT
h i
PT
= V (s)
and so is an unbiased estimator of V (s). After N episodes, V^NF (s) is the sample average of
N independent unbiased estimates of V (s), and so is also unbiased.
For the every-visit ase, it an be shown (in [139℄) that after the rst episode,
Ps
E V^1E (s) =
h i
2PT Rs + RT
where k denotes the number of times that s is visited within the episode. Thus after the
rst episode the every-visit method does not give an unbiased estimate of V (s). Its bias is
given by,
BIASE1 = V (s) E V^1E (s) = 2PPsT Rs:
h i
(3.14)
Singh and Sutton also show that after M episodes,
BIASEM = M 2+ 1 BIASE1 : (3.15)
Thus the every-visit method is also unbiased as M ! 1.
The bias in the every-visit method omes from the fa t that it uses some rewards several
times. Thus many of the return observations are not independent. However, the observa-
tions between trials are independent, and so as the number of trials grows, its bias shrinks.
Both methods onverge upon V (s) as M or N tend to in nity.
Singh and Sutton also analysed the expe ted varian e in the estimates learned by ea h
method. They found that, while the rst-visit method has no bias, it initially has a higher
expe ted varian e than the every-visit method. However, its expe ted varian e de lines far
more rapidly, and is usually lower than for the every-visit method after a very small number
of trials. Thus, in the long-run the rst-visit method appears to be superior, having no bias
and lower varian e.
3.4 Temporal Di eren e Learning for Poli y Evaluation

3.4.1 Trun ated Corre ted Return Estimates
Be ause the return estimate used by the Monte Carlo method (i.e. the observed return,
z (1) ) looks ahead at the rewards re eived until the end of the episode, it is impossible to
make updates to the value fun tion based upon it during an episode. Updates must be
3.4. TEMPORAL DIFFERENCE LEARNING FOR POLICY EVALUATION 27
made in between episodes. If the task is non-episodi (e.g. if the environment never enters
a terminal state), it seems unlikely that a Monte Carlo method an be used at all. One
possibility is to make the task episodi by breaking the episodes into stages. The stages
ould be separated by a xed number of steps, by repla ing
z (1) = rt+1 + rt+2 + + n 1 rt+n +
with,
z (n) = rt+1 + rt+2 + + n 1 rt+n + n U (st+n )
where U (st+n) is a return orre tion and predi ts the expe ted return for following the
evaluation poli y from st+n. Note that this is already done to deal with terminal states
in the Monte-Carlo method. However, if st+n is not a terminal state we typi ally will not
know the true utility of st+n under the evaluation poli y. Instead we repla e it with an
estimate { the urrent V^ (st+n) for example.
The next se tion introdu es a spe ial ase where n = 1. Updates are performed after ea h
and every step using knowledge only about the immediate reward olle ted and the next
state entered.
3.4.2 TD(0)
The temporal di eren e learning algorithm, TD(0), an be used to evaluate a poli y and
works through applying the following update [13, 147, 148℄:
V^ (st ) V^ (st ) + t (st ) rt+1 + V^ (st+1 ) V^ (st ) ;

(3.16)
where rt+1 is the reward following at taken from st and sele ted a ording the poli y under
evaluation. Note that this has the same form as the RunningAverage update rule where
the target is E [rt+1 + V^ (st+1 )℄. Re all from Equation 2.6 that,
V^ (s) = a 0 + V^ (s0 )
X X
(s; a) Pssa 0 Rss
a s0
= E rt+1 + V^ (st+1 )js = st;
h i
So, assuming for the moment that V^ (st+1 ) is a xed onstant, Update 3.16 an be seen as
a sto hasti version of,
V^ (s) a 0 + V^ (s0 ) ;
X X
(s; a) Pssa 0 Rss (3.17)
a s0
where E rt+1 + V^ (st+1 )js = st ; is estimated in by V^ (s) in the limit from the observed
h i
(sample) return estimates, rt+1 + V^t (st+1), rather than the target return estimate given
by the right-hand-side of update 3.17.
TD(0) is reliant upon observing the return estimate, r + V^ (s0), and applying it in update
3.16 with the probability distribution de ned by R, P and . This an be done in several
2) initialise st
4) sele t at a ording to
5) follow at ; observe, rt+1, st+1
6) TD(0)-update(st , at , rt+1, st+1)
7) t t+1
TD(0)-update(st, at , rt+1 , st+1
)
1) V (st) V^ (st ) + t+1 (st ) rt+1 + V^ (st ) V^ (st)

Figure 3.3: The online TD(0) learning algorithm. Evaluates the value-fun tion for the
poli y followed while gathering experien e.
ways, but by far the most straightforward is to a tually follow the evaluation poli y in the
environment and make updates after ea h step using the experien e olle ted. Figure 3.3
shows this online learning version of TD(0) in full. Note that it makes no use of R or P .
In general, the value of the orre tion term (V^ (st+1) in update 3.16) is not a onstant but is
hanging as st+1 is visited and its value updated. The method an be seen to be averaging
return estimates sampled from a non-stationary distribution. The return estimate is also
biased by the initial value fun tion estimate, V^0. Even so, the algorithm an be shown to
onverge upon V asP1t ! 1 providedPthat the learning rate is de lined under the Robbins-
Monro onditions ( k=1 k (s) = 1; k=1 2k (s; a) < 1), that all value estimates ontinue
1
to be updated, the pro ess is Markov, all rewards have nite varian e, 0 < 1 and that
the evaluation poli y is followed [148, 38, 158, 59, 21℄. In pra ti e it is ommon to use the
xed learning rate = 1 if the transitions and rewards are deterministi , or some lower
value if they are sto hasti . Fixed also allows ontinuing adaptation in ases where the
reward or transition probability distributions are non-stationary (in whi h ase the Markov
property does not hold).
3.4.3 SARSA(0)
Similar to TD(0), SARSA(0) evaluates the Q-fun tion of an evaluation poli y [128, 173℄.
Its update rule is:
Q^ (st ; at ) Q^ (st ; at ) + rt+1 + Q^ (st+1 ; at+1 ) Q^ (st ; at ) ;

k (3.18)
where at and at+1 are sele ted with the probability spe i ed by the evaluation poli y and
k = k (st ; at ).
SARSA di ers from the standard algorithm pattern given in Figure 3.1 be ause it needs to
know the next a tion that will to be taken when making the value update. The SARSA
algorithm is shown in Figure 3.4. An alternative s heme that appears to be equally valid
and is more losely related to the poli y-evaluation Q-fun tion update (see Equation 2.15
2) initialise st
3) sele t at a ording to
5) follow at; observe, rt+1 , st+1
6) sele t at+1 a ording to
7) SARSA(0)-update(st , at , rt+1 , st+1, at+1 )
8) t t+1
SARSA(0)-update(st , at, rt+1, st+1 , at+1)
1) Q^ (st ; at ) Q^ (st ; at ) + k rt+1 + Q^ (st+1 ; at+1 ) Q^ (st ; at )

Figure 3.4: The online SARSA(0) learning algorithm. Evaluates the Q-fun tion for the
poli y followed while gathering experien e.
and Figure 2.3) is to repla e the target return estimate with [128℄:
(st+1 ; a0 )Q^ (st+1 ; a0 ):
X
rt+1 + (3.19)
a0
An algorithm employing this return does not need to know at+1 to make the update and
so an be implemented in the standard framework. Its independen e of at+1 also makes
this an o -poli y method { it doesn't need to a tually follow the evaluation poli y in order
to evaluate it. This property is dis ussed in more detail later in this hapter. However,
unlike regular SARSA, this method does require that the evaluation poli y is known, whi h
may not always be the ase { experien e ould be generated be observing an external (e.g.
human) ontroller.
3.4.4 Return Estimate Length
Single-Step Return Estimates
The TD(0) and SARSA(0) algorithms are single-step temporal di eren e learning methods
and apply updates to estimate some target return estimate having the following form:
zt(1) = rt + U^ (st ); (3.20)
It is important to note that it is the dependen e upon using only information gained from the
immediate reward and the su essor state that allows single-step methods to be easily used
as an online learning algorithms. However, when single-step learning methods are applied
in the standard way, by updating V^ (st ) or Q^ (st ; at ) at time t + 1, new return information
is propagated ba k only to the previous state. This an result in extremely slow learning
in ases where redit for visiting a parti ular state or taking a parti ular a tion is delayed
by many time steps. Figure 3.5 provides an example of this problem. Ea h episode begins
in the leftmost state. Ea h state to the right is visited in sequen e until the rightmost
(terminal) state is entered where a reward of 1 is given (r = 0 in all other states). In su h
a situation, it would take 1-step methods a minimum of 64 episodes before any information
t= 0 t= 6 3 t= 6 4
... r= 1
Figure 3.5: The orridor task. Single-step updating methods su h as TD(0), SARSA(0)
and Q-learning an be very slow to propagate any information about the terminal reward
to the leftmost state.
about the terminal reward rea hes the leftmost state. A Monte Carlo estimate would nd
the orre t solution after just one episode.
Multi-Step Return Estimates
By modifying the return estimate to look further ahead than the next state, a single ex-
perien e an be used to update utility estimates at many previously visited states. For
example, the 1-step return in 3.16, zt(1) = rt + U (st), may be repla ed with the orre ted
n-step trun ated return estimate,
zt(n) = rt + rt+1 + + n 1 U^ (st+n ) (3.21)
or we may use,
h i
zt = (1 ) zt(1) + zt(2) + 2 zt(3) + (3.22)
= (1 ) rt + U^ (st+1 ) + rt + zt+1

(3.23)
= rt + (1 ) U^ (st+1 ) + zt+1 (3.24)
whi h is a -return estimate [147, 148, 163, 128, 107℄. The -return estimate is important
as it is a generalisation of both z(1) and z(1) sin e, if = 0, then z = z(1) , and if = 1,
then z = z(1) , or the a tual dis ounted return.
A key feature of multi-step estimates is that a single observed reward may be used in
updating the state-values or Q-values in many previously visited states. Intuitively, this
o ers the ability to more qui kly assign redit for delayed rewards.
The return estimate length an also be seen as managing a tradeo between bias and
varian e in the return estimate [163℄. When is low, the estimate is highly biased toward
the initial state-value or Q-fun tion. When is high the estimate involves mainly the
a tual observed reward and is a less biased estimator. However, unbiased return estimates
don't ne essarily result in the fastest learning. Typi ally, longer return estimates have higher
varian e as there is a greater spa e of possible values that a multi-step return estimate ould
take. By ontrast, a single-step estimate is limited to taking values formed by ombinations
of the possible immediate rewards and the values of immediate su essor states, and so
may typi ally have lower varian e. Also, employing the already-learned value estimates of
su essor states in updates may also help speed up learning sin e these values may ontain
summaries of the omplex future that may follow from the state. Best performan e is
often to be found with intermediate values of [148, 128, 73, 139, 150℄.
However, while multi-step estimates appear to o er faster delayed redit assignment they
seem to su er the same problem as the Monte-Carlo methods { that the updates must either
be made o -line, at the end of ea h episode, or that episodes are split into stages and the
return estimates trun ated. Chapter 4 introdu es a method whi h explores the latter ase.
The next se tion shows how the e e t of using the -return estimate an be approximated
by a fully in remental online method that makes updates after ea h step.
3.4.5 Eligibility Tra es: TD()
This se tion shows how -return estimates an be applied as an in remental online learning
algorithm. This is surprising be ause it implies that it is not ne essary to wait until all the
information used by the -return estimate is olle ted before a ba kup an be made to a
previously visited state.
The e e t of using z an be losely and in rementally approximated online using eligibility
tra es [148, 163℄.
A -return algorithm performs the following update,
V^ (st ) V^ (st ) + t (st ) zt+1 V^ (st ) :

(3.25)
By Equation 3.24 Sutton showed that the error estimate in this update an be re-written
as [148, 163, 107℄,
zt+1 V^ (st ) = Æt + Æt+1 + : : : + ( )k Æt+k + : : : (3.26)
where Æt is the 1-step temporal di eren e error as before,
Æt = rt+1 + V^ (st+1 ) V^ (st ):
If the pro ess is a y li and nite (and so ne essarily also has a terminal state), this allows
update 3.25 to be re-written as the following on-line update rule, whi h over omes the need
to have advan e knowledge of the 1-step errors,
t
V^ (s) V^ (s) +
X
t (s)Æt ( )t k I (s; sk) (3.27)
k=t0
where t0 indi ates the time of the start of the episode, and I (s; sk ) is 1 if s was visited at
t k, and zero otherwise. This update must be applied to all states visited at time t or
before, within the episode.
In the ase in whi h state revisits may o ur, the updates may be postponed and a single
bat h update may be made for ea h state at the end of the episode,
TX1 t
V^ (s) V^ (s) +
X
t (s)Æt ( )t k I (s; sk )
t=t0 k=0
where sT is the terminal state.
TD()-update(st, at, rt+1 , st+1 )

1) Æ rt+1 + V^ (st+1 ) V^ (st)
2) e(s) e(s) + 1
3) for ea h s 2 S :
3a) V^ (s) V^ (s) + e(s)Æ
3b) e(s) e(s)
Figure 3.6: The a umulating-tra e TD() update. This update step should repla e TD(0)-
update in Figure 3.3 for the full learning algorithm. All eligibilities should be set to zero at
the start of ea h episode.
However, the above methods don't appear to be of any extra pra ti al use than the Monte-
Carlo or -return methods. If the task is a y li , then then there is little bene t for having
an online learning algorithm sin e the agent annot make use of the values it updates until
the end of the episode. So the assumption preventing state revisits is often relaxed. In this
ase the error terms may be inexa t sin e the state-values used as the return orre tion may
have been altered if the state was previously visited. However, intuitively this seems to be
a good thing sin e the return orre tion is more up-to-date as a result.
To avoid the expensive re al ulation of the summation in 3.27, this term an be rede ned
as,
V^ (s) V^ (s) + t et (s)Æt (3.28)
where e(s) is an (a umulating) eligibility tra e. For ea h state at ea h step it is updated
as follows,
et 1 (s) + 1; if s = st,

et (s) = et 1 (s); otherwise. (3.29)
The full online TD() algorithm is shown in Figure 3.6. Both the online and bat h TD()
algorithms are known to onverge upon the true state-value fun tion for the evaluation
poli y under the same onditions as TD(0) [38, 158, 59, 21℄.
The intuitive idea behind an eligibility tra e is to make a state eligible for learning for several
steps after it was visited. If an unexpe tedly good or bad event happens (as measured by
the temporal di eren e error, Æ), then all of the previously visited states are immediately
redited with this. The size of the value adjustment is s aled by the state's eligibility, whi h
de ays with the time sin e the last visit. Moreover, the 1-step error Æt measures an error in
the -return used, not just for the previous state, but for all previously visited states in the
episode. The eligibility measures the relevan e of that error to the values of the previous
states given that they were updated using a -return orre ted for the error found at the
urrent state. Thus it should be lear why the tra e de ays as ( )k { the ontribution of
V^ (st+k ) to zt() is ( )k .
The Forward-Ba kward Equivalen e of Bat h TD() and -Return Updates
If the hanges to the value-fun tion that the a umulate-tra e algorithm is to make during
an episode, are summed,
TX1
V (s) = t et (st )Æt
t=0
and applied at the end of the episode (instead of online),
V (s) V (s) + V (s)
it an be shown that this is equivalent to applying the -return update,
V^ (s) V^ (s) + zt+1 V^ (s) ;

at the end of the episode, for ea h s = st visited during the episode [150℄.1
Thus in the ase where = 1, and k (s) = 1=k(s), this bat h mode TD() method is
equivalent to the every-visit Monte Carlo algorithm. The proof of this an be found in [139℄
and [150℄.
Below, the dire t -return method is referred to a the forward-view, and the eligibility tra e
method as the ba kward-view (after [150℄).
3.4.6 SARSA()
The equivalent version of TD() for updating a Q-fun tion is SARSA(), shown in Figure
3.7 [128, 129℄. Here, an eligibility value is maintained for ea h state-a tion pair.
SARSA()-update(st, at , rt+1, st+1 , at+1 )
1) Æ rt+1 + Q^ (st+1; at+1 ) Q^ (st; at )
2) e(s; a) e(s; a) + 1
3) for ea h s 2 S
3a) Q^ (s; a) Q^ (s; a) + e(s; a)Æ
3b) e(s; a) e(s; a)
Figure 3.7: The a umulating-tra e SARSA() update. This update step should repla e
the SARSA(0)-update in Figure 3.4 for the full learning algorithm. All eligibilities should
be set to zero at the start of ea h episode.
3.4.7 Repla e Tra e Methods

In pra ti e, a umulating tra e methods are known to often work poorly, espe ially with
lose to 1 [139, 149, 150℄. In part, this is likely to be the result of its relationship with the
1
See also the spe ial ase of the forwards-ba kwards equivalen e proof in Appendix C where = 1. This
proof is a generalisation of the one in [150℄.
every-visit Monte-Carlo algorithm. An alternative eligibility tra e s heme is the repla ing
tra e:
et (s) =

1; if s = st, (3.30)
et 1 (s); otherwise.
Sutton refers to this as a re en y heuristi { the eligibility of a state depends only upon the
time sin e the last visit. By ontrast, the a umulating tra e is a frequen y and re en y
heuristi .
In [139℄ Sutton and Singh show that, with = 1 and with appropriately de lining learning
rates, the bat h-update TD() algorithms exa tly implement the Monte Carlo algorithms.
In parti ular, it an be shown that a umulating tra es give the every-visit method, and
repla ing tra es give the rst-visit Monte Carlo method.
In addition to the better theoreti al bene ts of every-visit Monte Carlo, the repla e tra e
method has often performed better in online learning tasks. In [150℄ Sutton and Barto also
prove that the TD() and forward-view -return methods are identi al in the ase of bat h
(i.e. oine) updating for general with a onstant .
When estimating Q-values two repla e-tra e s hemes exist. These are the state-repla ing
tra e [139, 150℄,
< 1; if s = st and a = at ,
8
et (s; a) = : 0; if s = st and a 6= at , (3.31)

et 1 (s; a); if s 6= st .
and the state-a tion repla ing tra e [33℄,
et (s; a) =

1; if s = st and a = at , (3.32)
et 1 (s; a); otherwise.
3.4.8 A y li Environments
If the environment is a y li , then the di erent eligibility updates produ e identi al eligi-
bility values and so the a umulate and repla e tra e methods must be identi al. In this
ase, the online and bat h versions of the algorithms are also identi al sin e the return
orre tions used in return estimates must be xed within an episode. With = 1, the
-return methods also implement the Monte Carlo methods in a y li environments. Also,
here, both rst-visit and every-visit methods are equivalent.
The eligibility tra e methods appear to be onsiderably more expensive than the other
model-free methods so far presented. For TD(0) and SARSA(0) the time- ost per experien e
is O(1). The Monte-Carlo and dire t -return methods have the same ost if the returns are
al ulated starting with the most re ent experien e and working ba kwards.2 Algorithms
working in this way will be seen in Chapter 4. By ontrast, TD() has a time- ost as high
as O(jS j) per experien e.
Thus the great bene t a orded by using eligibility tra es is that they allow multi-step return
estimates to be used for ontinual online learning and, as a onsequen e, an also be used in
Sin e all dis ounted return estimators an be al ulated re ursively as, zt = f (rt ; st ; at ; zt+1 ; U ); for
2
some fun tion f . If zt+1 is known then it is heap to al ulate zt by working ba kwards.
Æt Æt
PSfrag repla ements Æt
Zt Zt+1 zt
Figure 3.8: Number line showing e e t of step-size. Note that having a step-size greater
2 an a tually in rease the error in the estimate (i.e. moving the new estimate into the
hashed area).
non-episodi tasks and in y li al environments in a relatively straightforward way. We will
see in the next hapter that the ost of the eligibility tra e updates an be greatly redu ed.
3.4.9 The Non-Equivalen e of Online Methods in Cy li Environments
Consider the RunningAverage update rule (3.1). It is easy to see that with a large learning
rate the algorithm an a tually in rease the error in the predi tion. Let Æt = zt Z^t 1, then
if > 2 and after an update, jzt Z^t+1j > jzt Z^t j. The problem an be seen visually in
Figure 3.8.
This raises new suspi ions about the online behaviour of the a umulate tra e TD() update.
In a worst ase environment (see Figure 3.9) in whi h a state is revisited after every step,
after k revisits the eligibility tra e be omes,
ek (s) = 1 + + + ( )k 1
= 1 1 ( )
k
Thus, for < 1, an upper bound on an a umulating eligibility tra e (in any pro ess) is
given by,
e1 (s) =
1 (3.33)
1
For = 1 the tra e grows without bound if the pro ess is nite and has no terminal state.
The TD() update (3.28) makes updates of the following form:
V (s) V (s) + t (s)et (s)Æ:
Thus it might seem that where t (s)et (s) > 2 holds the TD() algorithm ould grow in
error with ea h update. These onditions are easily satis ed for lose enough to 1 in any
non-terminating nite (and therefore y li ) pro ess. Considering the ase where the tra e
rea hes its upper bound, we have in the worst ase s enario,
t (s)
1 > 2
1
t et
20
15
10
PSfrag repla ements s

r 2 [0; 1℄ 5
Figure 3.9: A worst- PSfrag repla ements

ase environment 0
1 10 100 1000 10000
for a umulating eligibility tra e methods Time, t
where the state's eligibility grows at the Figure 3.10: The growth of the a umu-
maximum rate. The reward is a random late tra e update step-size for the pro-
variable hosen from the range [ 1; 1℄ with ess in Figure 3.9. The learning rate is
a uniform distribution.
t = t : , = 0:999 and = 1:0. These
1
settings satisfy the onditions of onver-
0 55
gen e for a umulate tra e TD().
1 t2(s) <
assuming a onstant t (s) while the eligibility rises. Yet the onvergen e of online a u-
mulate tra e TD() has already been established [38, 59℄. Cru ially these rely upon the
learning rate being de lined under the Robbins-Monro onditions whi h ensures that
tends to zero (and so t (s)et (s) must eventually fall below 2). However, even learning rate
s hedules that satisfy the Robbins-Monro onditions an ause t (s)et (s) > 2 to hold for a
onsiderable time in the early stages of learning. An example is shown in Figure 3.10. Note
that even though a high value of is used (i.e. lose to 1:0, at whi h value fun tions may
be ill-de ned), by 10000 steps the remaining rewards an be negle ted from the value of the
state sin e 0:99910000 is very small. Even so, at the end of this period, t (s)et (s) > 2.
What are the pra ti al onsequen es of this for the online a umulate tra e TD() algo-
rithm? Figure 3.11 ompares this method with an online forward view algorithm using the
pro ess in Figure 3.9. With = 1, a forward view -return algorithm an be implemented
online in this parti ular task by making the following updates:
(1 ) rt+1 + V^t (s)) + rt+1 + zt

zt+1
V^t+1 (s) V^t+1 (s) + t (s) zt+1 V^t+1 (s)

Note that this is \ba k-to-front" { rewards should in luded into z with the most re ent
rst. However, this makes no di eren e in this ase sin e there is only one state and only
one reward. Thus with = 1, z re ords the a tual observed dis ounted return (and is
also the rst-visit estimate) ex ept for some small error introdu ed by V^0 (s). V^0(s) is set to
zero (i.e. the orre t value) for all of the methods. In the experiment, the initial estimate
has little in uen e on the general shape of the graphs in Figure 3.11 beyond the rst few
steps. Also, with t (s) = 1=t, V^t (s) is the every-visit estimate ex ept for the negligible error
7 12
6
10
jV^ (s)j
8 Accumulate
Forward View, Every-Visit
4 Replace
Forward View, First-Visit
= 1t
6
t 3
PSfrag repla ements PSfrag repla ements

2
2
1
0 0
0 50 100 150 200 0 20000 40000 60000 80000 100000
30 40
35
25
30
jV^ (s)j
20
25
= t 1:
15 20
t 0 55 15
10

10
5
5
0 0
0 50 100 150 200 0 20000 40000 60000 80000 100000
100 300
90
250
80
70
jV^ (s)j
200
60
= 0:5
50 150
t 40
100
30
PSfrag repla ements 20
10
Time Time
0 0
0 50 100 150 200 0 20000 40000 60000 80000 100000
Figure 3.11: Comparison of varian e between the online versions of TD(), and the forward
view methods in the single state pro ess in Figure 3.9 where = 0:999 and = 1. The
results are the average of 300 runs. The horizontal and verti al axes di er in s aling. The
verti al axis measures jV^ (s)j = jV^ (s) V (s)j sin e V (s) = 0.
aused by V^0 (s). Alternatively, note that the method is exa tly the every-visit method
for a slightly di erent pro ess where there is some (very small) probability of entering a
zero-valued terminal state (in whi h ase setting V^0 (s) = 0 is justi ed). This allows us to
losely ompare online TD() with the forward-view Monte-Carlo estimates, and even do so
with di erent learning rate s hemes. Di erent learning rate s hemes orrespond to di erent
re en y weightings of the a tual return. The \Forward-View, First-Visit" method in Fig-
ure 3.11 simply learns the a tual observed return at the urrent time, and is independent
of the learning rate. The repla e tra e method is also shown and is equivalent to TD(0) for
this environment.
The results an be seen in Figure 3.11. The most interesting results are those for a umulate
tra e TD(). Here we see that where t(s) = 1=t, the method most losely approximates
the every-visit method (at least in the long-term). This is predi ted as a theoreti al result
by Singh and Sutton in [139℄ for the bat h update ase. With more slowly de lining or a
onstant (i.e. more re ently biased), the a umulate tra e method is onsiderably higher in
error than any of the other methods. This seems to be at odds with the existing theoreti al
results in [150℄ where it is shown that TD() is equivalent to the forward view method for
onstant (and any ). However, this equivalen e applies only in the oine (bat h update)
ase. The equivalen e is approximate in the online learning ase and we see the onsequen e
of this approximation in Figure 3.11. In the xed ase, the values learned by a umulate
tra e TD() are so high in varian e as to be essentially useless as predi tions. Similar results
an be expe ted in other y li environments where the eligibility tra e an grow very large.
There are also numerous examples in the literature where the performan e of a umulate
tra e methods sharply degrades as tends to 1 (in parti ular, see [139, 150℄). In ontrast,
the every-visit method behaves mu h more reasonably (as do the rst-visit and repla e tra e
methods). Partially, this is some motivation for a new (pra ti al) online-learning forward
view method presented in Chapter 4.
It may seem surprising that the error in the a umulate tra e TD() method does not
ontinue to in rease inde nitely sin e t et is onsiderably higher than 2 after the rst few
updates and remains so. The reason for this is that the observed samples used in updates
(rt + V^ (st+1 )) are not independent of the learned estimates (V^ (st )). Unlike in the basi
RunningAverage update ase where divergen e to in nity is lear (with z independent of
Z ), this non-independen e appears to be useful in bounding the size of the possible error
in this and presumably other y li tasks.
In Figure 3.11, we also see that the every-visit method performed marginally better than
rst-visit in ea h ase. This is onsistent with the theoreti al results obtained by Singh
and Sutton in [139℄ whi h predi t that (oine) every-visit Monte Carlo will nd predi tions
with a lower mean squared error (i.e. lower varian e) for the rst few episodes { only one
episode o urred in this experiment.
We an on lude that, i) drawing analogies between forward-view methods and online ver-
sions of eligibility tra e methods is dangerous sin e the equivalen e of these methods does
not extend to the online ase, and ii) that a umulate tra e TD() an perform poorly
in y li environments where t et above 2 is maintained. In parti ular, it an perform far
worse than its forward-view ounterpart for learning rate de lination s hemes slower than
(s) = 1=k(s), (where k is the number of visits to s). This an be attributed to the ap-
proximate nature of the forward-ba kwards equivalen e in the online ase. In y li tasks,
errors due to this approximation an be magni ed by large e e tive step-sizes ( e).
3.5. TEMPORAL DIFFERENCE LEARNING FOR CONTROL 39
3.5 Temporal Di eren e Learning for Control

3.5.1 Q(0): Q-learning
Like value-iteration, Q-learning evaluates the greedy poli y. It does so using the following
update rule:

^Q(st ; at ) ^Q(st; at ) + k (st; at ) rt+1 + max ^ ^
0 Q(st+1 ; a ) Q(st ; at ) ; (3.34)
0
a
Note that the target return estimate used by Q-learning,
rt+1 + max Q^ (st+1 ; a0 )
a0
is a spe ial ase of the one used by the o -poli y SARSA update (3.19) in whi h the
evaluation poli y is the (non-stationary) greedy poli y.
Q-learning is known to onverge upon Q as k ! 1 under similar onditions as TD(0)
[163, 164, 59, 21℄. However, unlike TD(0), there is no need to follow the evaluation poli y
(i.e. the greedy poli y). Exploratory a tions may be taken freely, and yet only the greedy
poli y is ever evaluated. The method will onverge upon the optimal Q-fun tion provided
that all SAPs are tried with a nite frequen y, also other onditions similar to those ensuring
the onvergen e of TD(0).
1) Initialise: t 0; st=0
3) initialise st
5) sele t at
6) follow at; observe, rt+1 , st+1
7) Q-learning-update(st, at , rt+1 , st+1)
8) t t+1
Q-learning-update(st, at, rt+1, st+1 )
1) Q^ (st ; at ) Q^ (st ; at ) + k rt+1 + maxa0 Q^ (st+1 ; a0 ) Q^ (st ; at )

Figure 3.12: The online Q-learning algorithm. Evaluates the greedy poli y independently
of the poli y used to generate experien e. This method is exploration insensitive.
3.5.2 The Exploration-Exploitation Dilemma

Why take exploratory a tions? Almost all systems that learn ontrol poli ies through
intera tion fa e the exploration-exploitation dilemma. Should the agent sa ri e immediate
reward and take a tions that redu e the un ertainty about the return following untried
a tions in the hope that they will lead to more rewarding poli ies; or should the agent
behave greedily, avoiding the return lost while exploring, but settle for a poli y that may
be sub-optimal?
Optimal Bayesian solutions to this dilemma are known but are intra table in the general
multi-step pro ess ase [78℄. However, there are many good heuristi solutions. Good
surveys of early work an be found in [62, 156℄, re ent surveys an be found in [85, 174, 63℄.
Also see [41, 40, 142, 175℄ for re ent work not in luded in these. Common features of the
most su essful methods are lo al de nitions of un ertainty (e.g. a tion ounters, Q-value
error and varian e measures), the propagation of this un ertainty to prior states and then
hoosing a tions whi h maximise ombined measures of this long-term un ertainty and
long-term value.
3.5.3 Exploration Sensitivity

When learning state values or state-a tion values, we do so with respe t to the return
obtainable for following some poli y after visiting those states. For some learning methods,
su h as TD() and SARSA() the poli y being evaluated is the same as the poli y a tually
followed while gathering experien e. These are referred to as on-poli y methods [150℄.
For these, the a tual experien e a e ts what these methods onverge upon in the limit.
By ontrast o -poli y methods allow o -poli y (or exploratory) a tions to be taken in the
environment (i.e. a tions may be hosen from a distribution di erent from the evaluation
poli y). That is to say, they may learn the value-fun tions or Q-fun tions for one poli y
while following another.
To put this into ontext, for ontrol optimisation problems we are usually evaluating the
greedy poli y,
g (s) = arg max
a
Q^ (s; a): (3.35)
Q(0) is an exploration insensitive method as it only ever estimates the return available
under the greedy poli y, regardless of the distribution of, or methods used to obtain its
experien e. This is possible be ause its return estimate, rt + maxa Q(st; a), is independent
of at . For the same reason, SARSA(0) using the return estimate in Equation 3.19 is also
an o -poli y method.
Multi-Step Methods O -poli y learning is less straightforward for methods that use
multi-step return estimates. For example, if a multi-step return estimate used to update
Q^ (st ; at ) in ludes the reward following a non-greedy a tion, at+k (k 1), then there is a
bias to learn about the return following a non-greedy poli y instead of the greedy poli y.
That is to say, Q^ (st 1; at 1 ) re eives redit for the delayed reward, rt+k+1 , whi h the agent
might not observe if it follows the greedy poli y after Q^ (st; at ). In most ases, learning
in this way denies onvergen e upon Q. This is straightforward to see when the ase is
onsidered where Q^ = Q is known to hold. Most updates following non-greedy a tion are
likely to move Q^ away from Q (in expe tation).
The most ommonly used solution to this problem is to ensure that the exploration poli y
onverges upon the greedy poli y in the limit, and so on-poli y methods eventually evaluate
the greedy poli y [135℄. However, s hemes for doing this must arefully observe the learning
rate. If onvergen e to the greedy poli y is too fast then the agent may be ome stu k in
a lo al minimum sin e hoosing only greedy a tions may result in some parts of the envi-
ronment being under-explored (or under-updated). If onvergen e upon the greedy poli y
is too slow, then as the learning rate de lines, the Q-fun tion will onverge prematurely
and remain biased toward the rewards following non-greedy a tions. In [135℄, Singh et al.
dis uss several exploration methods whi h are greedy in the limit and allow SARSA(0) to
nd Q in the limit. Their results also seem likely to hold also for SARSA(), although
there is as yet no proof of this.
In any ase, following or even onverging upon the greedy exploration strategy may not
always be desirable or even possible. For example:
Bootstrapping from externally generated experien e or some given training poli y
(su h as one provided by a human expert) an greatly redu e the agent's initial learn-
ing osts [72, 112℄. Even if the agent follows this training poli y, we would still like
our method to be learning about the greedy poli y (and so moving toward the optimal
poli y).
There may be a limited amount of time available for exploration (e.g. for ommer ial
or safety riti al appli ations, it might desirable to have distin t training, testing and
appli ation phases). In this ase, we may wish to perform as mu h exploration as
possible in the training stage.
The agent may be trying to learn several poli ies (behaviours) in parallel where ea h
poli y should maximise its own reward fun tion (as in [58, 79, 143℄). At any time the
agent may take only one a tion, yet it remains useful to be able to use this experien e
to update the Q-fun tions of all the poli ies being evaluated.
The agent's task may be non-stationary, in whi h ase ontinual exploration is required
in order to evaluate a tions whose true Q-values are hanging [105℄.
The agent's Q-fun tion representation may be non-stationary. Continual exploration
may be required to evaluate the a tions in the new representation.
It has long been known that multi-step return estimates need not lead to exploration-
sensitive methods. The method re ommended by Watkins is to trun ate the -return
estimate su h that the rewards following o -poli y (e.g. non-greedy a tions) a tions are
removed from it [163℄. For example, Q^ (st 1 ; at 1 ) should be updated using the orre ted
n-step trun ated -return, (see [163, 31℄)
h i
zt(;n) = (1 ) zt(1) + zt(2) + 2 zt(3) + + n 2 zt(n 1) + n 1 zt(n) (3.36)
= (1 ) rt + U^ (st ) + rt + zt(+1

;n 1)
(3.37)
where,
zt(;1) = rt + U^ (st )
and at+n is the next o -poli y a tion. However, if there is a onsiderable amount of explo-
ration then the return estimate may be trun ated extremely frequently, and mu h of the
bene t of using a multi-step return estimate an be eliminated. As a result, the method is

seldom applied.
For an eligibility tra e method, zeroing the eligibilities immediately after taking an o -poli y
a tion has the same e e t as trun ating the -return estimate [163℄.
Figure 3.13 shows Watkins' Q() eligibility tra e algorithm and Figure 3.14 shows Peng
and Williams' Q().3 Watkins' Q() trun ates the return estimate after taking non-greedy
a tions and is an o -poli y method. PW-Q(), does not trun ate the return and assumes
that all rewards are those observed under a greedy poli y. It is neither on-poli y nor o -
poli y.
The Watkins' Q() and PW-Q() algorithms are identi al methods when purely greedy
poli ies are followed. They di er only in the temporal di eren e error used to update SAPs
visited at t k, (k > 1),
Watkins-Q() : Æt = rt+1 + maxa Q^ (st+1 ; a) Q^ (st ; a)
PW-Q() : Æt = rt+1 + maxa Q^ (st+1 ; a) max a
Q^ (st ; a)
The eligibility tra e methods may also be used for o -poli y evaluation of a xed poli y by
applying importan e sampling [111℄. Here, the eligibility tra e is s aled by the likelihood
that the exploratory poli y has of generating the experien e seen by the evaluation poli y.
When used for greedy poli y evaluation, the method redu es to Watkins' Q(). Like the
o -poli y SARSA(0) method, the evaluation poli y must be known.
Optimisti Q-value Initialisation and Exploration
To en ourage exploration of the environment, a ommon te hnique in RL is to provide an
optimisti initial Q-fun tion and then follow a poli y with a strong greedy bias. Examples
of these \soft greedy" poli ies in lude -greedy and Boltzmann sele tion [135, 150℄. Over
time ea h Q-value will de rease as it is updated, but the Q-values of untried a tions or
a tions that led to untried a tions will remain arti ially high. Thus, even while following
a purely greedy poli y, the agent an be led to unexplored parts of the state-spa e.
However, problems arise if the estimated value of an a tion should ever fall below its true
value (as may easily happen in environments with sto hasti rewards or transitions). In
this ase any method whi h a ts only greedily an be ome stu k in a lo al minimum sin e
the truly best a tions are no longer followed.
The original version of PW-Q(), as published in [107℄, assumes that g is always followed.
As a result the standard Q-fun tion initialisation for PW-Q() is an optimisti one. Even
so, several authors report good results when using PW-Q() and following semi-greedy
poli ies [128, 169℄. In this ase, PW-Q() is an unsound method in the sense that like
SARSA() it an be shown that it will not onverge upon Q in some environments while
The use of the eligibility tra e in the Peng and Williams' and Watkins' Q() algorithms presented is
3
the same as the method in [107, 167℄, but di ers from TD() and SARSA(). Be ause, in Figures 3.13 and
3.14, the tra es are updated before the Q-values, the tra e extends an extra step into the history and an
additional update may be the result in the ase of state revisits. The algorithms may be modi ed to remove
this additional update, although in pra ti e, this makes little di eren e.
Watkins-Q()-update(st; at ; rt+1 ; st+1 )
1) if o -poli y(st; at ) Test for non-greedy a tion
2) for ea h (s; a) 2 S A do: Trun ate eligibility tra es
3) e(s; a) 0
4) Æt rt+1 + maxa Q^ (st+1; a) Q^ (st ; at )
5) for ea h SAP (s; a) 2 S A do:
6) e(s; a) e(s; a) De ay tra e
^ ^
7) Q(s; a) Q(s; a) + Æt e(s; a)
8) Q^ (st ; at ) Q^ (st ; at ) + k Æt e(st ; at )
9) for ea h a 2 A(st ) do:
9a) e(st ; a) 0
10) e(st ; at ) e(st ; at ) + 1
Figure 3.13: O -poli y (Watkins') Q() with a state repla ing tra e. This version di ers
slightly to the algorithm re ently published in the standard text [150℄. For an a umulating
tra e version omit steps 9 and 9a. For state-a tion repla ing tra es, repla e steps 9 to 10
with e(st ; at ) 1.
PW-Q()-update(st; at ; rt+1 ; st+1 )
1) Æt0 rt+1 + maxa Q^ (st+1; a) Q^ (st ; at )
2) Æt rt+1 + maxa Q^ (st+1; a) maxa Q^ (st ; a)
3) for ea h SAP (s; a) 2 S A do:
4) e(s; a) e(s; a)
5) Q^ (s; a) Q^ (s; a) + Æt e(s; a)
6) Q^ (st ; at ) Q^ (st ; at ) + k Æt0 e(st ; at )
7) for ea h a 2 A(st ) do:
7a) e(st ; a) 0
7) e(st ; at ) e(st ; at ) + 1
Figure 3.14: Peng and Williams' Q() with a state repla ing tra e. Modi ations for
a umulating and state-a tion repla ing tra es are as for Watkins' Q() (Figure 3.13).
exploratory a tions ontinue to be taken.4 However, it may gain a greater eÆ ien y in

assigning redit to a tions over Watkins' Q() as it does not trun ate its return estimate
when taking o -poli y a tions. This allows the redit for individual a tions to be used to
adjust more Q-values in prior states.
4
This an be seen straightforwardly in deterministi pro esses with deterministi rewards. Note that if
Q^ = Q is known to hold, then PW-Q(), (or SARSA()) may in rease kQ^ Q k if non-greedy a tions are
taken. The same is not true for Q-learning and Watkins' Q().
3.5.4 The O -Poli y Predi ate

For ontrol tasks, the ommon test used to de ide whether an a tion was o -poli y (i.e.
non-greedy) is [163, 176, 150℄,

true; if at 6= arg maxa Q^ (st; a);
o -poli y(st; at ) = false (3.38)
; otherwise:
whi h assumes that only a single a tion an be greedy. However, onsider that in some
tasks some states an have several equivalent best a tions (e.g. as in the example in 2.5).
Also, the Q-fun tion might be initialised uniformly, in whi h ase all a tions are initially
equivalent. For Watkins' Q() the above predi ate will result in the return estimate being
trun ated unne essarily often. A better alternative whi h a knowledges that there may be
several equivalent greedy a tions is,
; if maxa Q^ (st ; a) Q^ (st ; at ) > o pol ;

o -poli y(st; at ) = true
false; otherwise: (3.39)
where o pol is a onstant whi h provides an upper bound for the maximally tolerated
degree to whi h an a tion may be o -poli y (i.e. the allowable \o -poli yness" of an a tion).
With o pol > 0 the o -poli y predi ate may yield false even for non-greedy a tions. For
the Watkins' Q() algorithm this means that the return estimate may in lude the reward
following a tions that are less greedy. An a tion, a, is de ned here to be nearly-greedy if
V^ (s) Q^ (s; a) o pol for some small positive value of o pol . If o pol in reases further
to be greater than (maxa0 Q^ (s; a0 )) Q^ (s; a) for all states over the entire life of the agent,
then the Watkins-Q() algorithm is indenti al to PW-Q() sin e the o -poli y predi ate is
always false. The intermediate values of o pol de ne a new spa e of algorithms (we might
all these semi-naive Watkins' Q(), after [150℄).
The value of o pol suggests the following error in the learned predi tions for using the
return of nearly-greedy poli ies as an evaluation of a greedy poli y,
o pol + o pol + 2 o pol +
= 1 o pol ; for 0 < < 1:
3.6 Indire t Reinfor ement Learning

An alternative method to dire tly learning the value fun tion is to integrate planning (i.e.
value iteration) with online learning. This approa h is the DYNA framework { many in-
stantiations are possible [144℄.
In order to allow planning, maximum likelihood models of Rsa and Pssa 0 an be onstru ted
from the running means of samples of observed immediate rewards and state transitions,
or equivalently, by applying the following updates (in order), [143℄
Nsa Nsa + 1; (3.40)
3.6. INDIRECT REINFORCEMENT LEARNING 45
R^ sa R^ sa +
1 rt R^ sa ;

(3.41)
Nsa
8x 2 S; P^sxa P^sx
a + 1 I (x; s0) P^ a ; (3.42)
Na sx
s
where a = at, s = st, s0 = st+1, I (x; s0) is an identity indi ator, equal to 1 if x = s0 and
0 otherwise, and Nsa is a re ord of the number of times a has been taken in s. Ba kup
(3.42) must be applied for all (s; x) pairs after ea h observed transition. Note that there is
no bene t for learning R^ssa 0 instead of R^sa sin e on e a is hosen s, there is no ontrol over
whi h s0 is entered as the su essor.
With a model, the dynami programming methods presented in the previous hapter may
now be applied. In pra ti e, fully re-solving the learned MDP given the new model is often
too expensive to do online. The Adaptive Real-Time Dynami Programming (ARDP)
solution is to perform value-iteration ba kups to some small set of states between online
steps [12℄. Similar approa hes were also proposed in [89, 71, 66℄.
Alternatively, prioritised sweeping fo uses the ba kups to where they are expe ted to most
qui kly redu e error [88, 105, 167, 7℄. Note that if the value of a state hanges, then the
value of its prede essors are likely to also need updating. When applied online, the urrent
state is updated and the hange in error noted. A priority queue is maintained indi ating
whi h states are likely to re eive the greatest error redu tion on the basis of the size of the
value hanges in the their su essors. Thus when the urrent state is updated, its hange in
value is used to promote the position of its prede essors in the priority queue. Additional
updates may then be made, always removing and updating the highest priority state in the
queue, and then promoting its prede essors in the queue. More or fewer updates may be
made depending upon how mu h real time is available between experien es.
In pra ti e, it is not always lear whether the value-iteration ba kups are preferable to
model-free methods. In several omparisons, they appear to learn with orders of magnitude
less experien e than Q-learning [150, 12℄. However, value-iteration ba kups are often far
more expensive. For instan e, if the environment is very sto hasti then a state may have
very many su essors. In the worst ase, a value-iteration update for a single state ould ost
O(jS j jAj). Thus, even when updates are distributed in fo used ways, their omputational
expense an still be very great ompared to model-free methods. Also, in the next hapter
we will see how the omputational ost of experien e-eÆ ient model-free methods (su h as
eligibility tra e methods) an be brought in line with methods su h as Q-learning.
A general rule of thumb seems to be that if experien e is ostly to obtain then learning a
model is a good way to redu e this ost. But, the most e e tive way of employing models
is, however, still open to debate { model-free methods an also applied using the model as
a simulation. A dis ussion an be found in [150℄.
So far, we have also only onsidered ases where it is feasible to store V^ or Q^ in a look-up
table. Where this is not possible (e.g. if the spate-spa e is large or non-dis rete), then
fun tion approximators must be employed to represent these fun tions, and also the model
(P and R). In this ase, it seems that the model-free methods provide signi ant advantages.
For instan e, -return and eligibility tra e methods are thought to su er less in non-Markov
settings. (Many fun tion approximation s hemes su h as state-aggregation an be thought
of as providing the learner with a non-Markov view of the world, even if the per eived state
is one of a Markov pro ess). By their \single-step" nature, P and R give rise to methods
that heavily rely on the Markov property. It is not lear how multi-step models an be
learned and so over ome their dependen e on the Markov property. It is also often un lear
how to represent sto hasti models with many kinds of fun tion approximator. Fun tion
approximation is overed in more detail in Chapter 5.
3.7 Summary
In this hapter we have seen how reinfor ement learning an pro eed starting with little
or no prior knowledge of the task being solved. Using only the knowledge gained through
intera tion with the environment, optimal solutions to diÆ ult sto hasti ontrol problems
an be found.
A number of di erent dimensions to RL methods have been seen; predi tion and ontrol
methods, bias and varian e issues, dire t and indire t methods, exploration and exploitation,
online and oine methods, on-poli y and o -poli y methods, and single-step and multi-step
methods.
Online learning in y li environments was identi ed as a parti ularly interesting lass of
problems for model-free methods. Here we see a wider variation in the solutions methods
than the a y li or oine ases. Also, we have seen how it is diÆ ult to apply forwards
view methods in this ase and how (a umulate) tra e methods an signi antly di er from
their forward view analogues. Also, there appears be no theoreti ally sound and experien e
eÆ ient model-free ontrol method for online learning while ontinuing to take non-greedy
a tions. Se tion 3.5.3 listed several examples of why su h learning methods are useful.
Apparently sound methods, su h as Watkins' Q() su er from \shortsightedness", while
unsound methods an easily be shown to su er from a loss of predi tive a ura y (pra ti al
examples are given in the next hapter).
Chapter 4
EÆ ient O -Poli y Control
Chapter Outline
This hapter reviews extensions to the model-free learning algorithms pre-
sented in the previous hapter. We see how their omputational osts an
be redu ed, their data-eÆ ien y in reased, while also allowing for exploratory
a tions and online learning. The experimental results using these algorithms
also lead to interesting insights about the role of optimism in reinfor ement
learning ontrol methods.
4.1 Introdu tion

The previous hapter introdu ed a number of RL algorithms. Let's review some the prop-
erties that we'd like a method to have:
Predi tive. Algorithms that predi t, from ea h state in the environment, the expe ted
return available for following some given poli
y thereafter.
Optimising Control. Algorithms perform ontrol optimisation if they nd or approxi-

mate an optimal poli y rather than evaluate some xed poli y.
Exploration Insensitive. Algorithms that an evaluate one poli y while following an-
other are exploration insensitive methods (also referred to as o -poli y methods) [150, 163℄.
In the ontext of ontrol optimisation, we often want to evaluate the greedy poli y while
following some exploration poli y.
47
48 CHAPTER 4. EFFICIENT OFF-POLICY CONTROL
Online Learning. Online learning methods immediately apply observed experien es for
learning. Where exploration depends upon the Q-fun tion, online methods an have a
huge advantage over methods whi h learn oine [128, 65, 165, 168℄. For instan e, most
exploration strategies qui kly de line the probability of taking a tions whi h lead to large
punishments provided that the Q-values for those a tions are also de lined. If the Q-fun tion
is adjusted oine, or after some long interval, then the exploration strategy may sele t poor
a tions many times more than ne essary within a single episode.
Computationally Cheap. Currently, the heapest online learning ontrol methods have
time omplexities of O(jAj) per experien e where jAj is the number of a tions urrently
available to the agent [168, 163℄.
Fast Learning. Methods whi h make e e tive use of limited real experien e. For exam-
ple, methods whi h learn a model of the environment an make ex ellent use of experien e
but are often omputationally far more expensive than O(jAj) when learning online. Exist-
ing model-free methods have attempted to ta kle this using eligibility tra es [148, 163, 128℄
or ba kwards replay [72, 76℄. However, o -poli y (exploration insensitive) eligibility tra e
methods for ontrol su h as Watkins' Q() are relatively ineÆ ient. Also, ba kwards replay
is generally regarded as a te hnique that annot be used for online learning. Methods su h
as SARSA() and Peng and Williams' Q() are exploration sensitive methods; if exploring
a tions are ontinuously taken in the environment then they lose predi tive a ura y in
their Q-fun tions as a result.
S alable. For an RL algorithm to be pra ti al it must work in ases where there are very
many states or if the state-spa e is non-dis rete. Typi ally, this involves using a fun tion
approximator to store and update the Q-fun tion. Eligibility tra e methods have been
shown to work well when applied with fun tion approximators [163, 149℄.
This hapter reviews a number of important RL algorithms. It is shown how Lin's

ba kwards replay an be modi ed to learn online and so provides a good substitute for
eligibility tra e methods. It is both simpler and in many instan es also faster learning. The
simpli ity gains are derived by dire tly employing the -return estimate in learning updates
rather al ulating the e e t of this in rementally. In many instan es learning speedups
are also derived through the ba kwards replay me hanism, allowing return estimates to be
based on more up-to-date information than for eligibility tra e methods.
Spe ial onsideration is given to o -poli y ontrol methods whi h, despite having most of
the above properties in ombination, have re eived little attention or use in the literature
due to their supposed slow learning [128, 150℄. Several new o -poli y ontrol methods are
presented, the last of whi h is designed to provide signi ant data-eÆ ien y improvements
over Watkins' Q(). The general new te hnique an easily be applied in order to derive
analogues of most eligibility tra e methods su h as TD(), SARSA() [150℄ and importan e
sampling TD() [111℄.
First se tion 4.2 reviews Fast Q() { a method for pre isely implementing eligibility tra e
4.2. ACCELERATING Q( ) 49
methods at a ost whi h is independent of the size of the state-spa e. This algorithm is
used as a state-of-the-art baseline against whi h the new method is ompared. Se tion 4.3
reviews existing ba kwards replay methods that provide the basis of the new approa h.
Se tion 4.4 introdu es the new Experien e Sta k method { an online-learning version of
ba kwards replay that is as omputationally heap as Fast Q(). Se tion 4.5 provides some
experimental results with this algorithm, omparing it against Fast Q(). This and the
supporting analysis in Se tions 4.6 and 4.7 give a useful pro le of when ba kwards replay
may provide improvements over eligibility tra es. Se tion 4.7 also provides a surprising
new insight into the potentially harmful e e ts of optimisti initial value biases on learning
updates that employ return estimates trun ated with maxa Q^ (s; a).
4.2 A elerating Q()

Naive implementations of Q() (as presented in the previous hapter) are far more expensive
than Q(0) as they involve updating the eligibilities and Q-values of all SAPs at ea h time-
step. This gives a time omplexity of O(jS jjAj) per experien e, instead of O(jAj). A simple
and well known improvement upon this is to update only those Q-values with signi ant
tra es. See [167℄ for an implementation. For some tra e signi an e, n, n or fewer of the
most re ently visited states have their eligibilities and values updated, at a ost of O(n jAj)
per step. States visited more than n steps ago have eligibilities of zero. n is given su h that,
( )n < , for some small . However, if ( ) ! 1 and the environment has an appropriate
stru ture, potentially all of the states in the system may ontain signi ant tra es. In this
ase, n ! jS j, and mu h of the omputational saving is nulli ed.
4.2.1 Fast Q()

Fast Q() is intended as a fully online implementation of Peng and Williams' Q() but with
a time omplexity O(jAj) per update. The algorithm is designed for > 0 { otherwise we
an use simple Q-learning. This se tion is adapted with minor hanges from [125℄ whi h
ontains the original des ription of Fast Q(), provided by ourtesy of Mar o Wiering. The
des ription of Fast Q() is not new a ontribution.
Main prin iple. The algorithm is based on the observation that the only Q-values needed
at any given time are those for the possible a tions given the urrent state. Hen e, using
\lazy learning", we an postpone updating Q-values until they are needed.
First note that, as in Equation 3.27 for TD(), the in rement of Q^ (s; a) made by Peng
and Williams' Q() (in Figure 3.14), for a omplete episode an be written as follows (for
simpli ity, a xed learning rate is assumed):
Q^ (s; a) = Q^ T (2s; a) Q^ 0(s; a) 3
(4.1)
Q^ (s; a) = X 4Æ0 It(s; a) + X ( )i t ÆiIt(s; a)5
T T
(4.2)
t
t=1 i=t+1
" #
T
X X t 1
= Æt0 It (s; a) + ( )t i Æt Ii(s; a) (4.3)
t=1 i=1
XT " t 1
X
#
= Æ0 It (s; a) +
t Æt ( ) t iI
i (s; a) : (4.4)
t=1 i=1
In what follows, let us abbreviate It = It(s; a) and = . Suppose some SAP (s; a)
o urs at steps t1; t2 ; t3; : : :, then we may unfold terms of expression (4.4):
" # " #
T
X X t 1 t1
X X t 1
Æt0 It + Æt t iI
i = Æt0 It + Æt t iI
i +
t=1 i=1 t=1 i=1
" #
t2
X X t 1
Æt0 It + Æt t iI
i +
t=t1 +1 i=1
t3 " t 1 #
X X
Æt0 It + Æt t iI
i + ::: (4.5)
t=t2 +1 i=1
Sin e I (s; a) is 1 only for t = t1; t2 ; t3; : : :, where SAP revisits of (s; a) o ur at, t1; t2 ; t3 ; : : :,
and I (s; a) is 0 otherwise, we an rewrite Equation 4.5 as
t2
X t3
X
Æt01 + Æt02 + Æt t t1 + Æt0 +
3
Æt t t1 + t t2 +::: =
t=t1 +1 t=t2 +1
Æt01 + Æt02 +
1 t2
X
Æt t + Æt0 + 1t + 1t
t3
X
Æt t + ::: =
t1 3 1 2
t=t1 +1 t=t2 +1
! !
+ 1t + 1t + 1t
Xt2 t1
X t3
X t2
X
Æ0 t1 + Æ0
t2 1
Æt t Æt t + Æ0 t3 1 2
Æt t Æt t + :::
t=1 t=1 t=1 t=1
De ning t = Pti=1 Æi i , this be omes
1
1 1
Æt + Æt + t (t t ) + Æt + t + t (t t ) + : : :
0 0 0 (4.6)
1 2 1 2 1 3 1 2 3 2
This will allow Pthe onstru tion of an eÆ ient online Q() algorithm. We de ne a lo al
tra e e0t (s; a) = ti=1 Ii (s;ai ) , and use (4.6) to write down the total update of Q(s; a) during
an episode:
T
Q^ (s; a) = Æt0 It(s; a) + e0t (s; a)(t+1 t) :
X
(4.7)
t=1
To exploit this we introdu e a global variable keeping tra k of the umulative TD() error
sin e the start of the episode. As long as the SAP (s; a) does not o ur we postpone updating
Q^ (s; a). In the update below we need to subtra t that part of whi h has already been
used (see equations 4.6 and 4.7). We use for ea h SAP (s; a) a lo al variable Æ(s; a) whi h
re ords the value of at the moment of the last update, and a lo al tra e variable e0(s; a).
Then, on e Q^ (s; a) needs to be known, we update Q^ (s; a) by adding e0(s; a)( Æ(s; a)).
Algorithm overview. The algorithm relies on two pro edures: the Lo al Update pro edure
al ulates exa t Q-values on e they are required; the Global Update pro edure updates the
global variables and the urrent Q-value. Initially we set the global variables 0 1:0 and
0. We also initialise the lo al variables Æ(s; a) 0 and e0 (s; a) 0 for all SAPs.
Lo al updates. Q-values for all a tions possible in a given state are updated before an
a tion is sele ted and before a parti ular Q-value is al ulated. For ea h SAP (s; a) a
variable Æ(s; a) tra ks hanges sin e the last update:
Lo al Update(st ; at ) :
1) Q^ (st ; at ) Q^ (st; at ) + (
k st ; at )( Æ(st ; at ))e0 (st ; at )
2) Æ(st ; at )
The global update pro edure. After ea h exe uted a tion we invoke the pro edure
Global Update, whi h onsists of three basi steps: (1) To al ulate maxa Q^ (st+1 ; a) (whi h
may have hanged due to the most re ent experien e), it alls Lo al Update for the possible
next SAPs. (2) It updates the global variables t and . (3) It updates the Q-value and
tra e variable of (st ; at ) and stores the urrent value (in Lo al Update).
Global Update(st; at ; rt ; st+1 ) :

1)8a 2 A Do: Make Q^ (st+1 ; ) up-to-date
1a) Lo al Update(st+1 ; a)
2) Æt0 (rt + maxa Q^ (st+1 ; a) Q^ (st; at ))
3) Æt (rt + maxa Q^ (st+1 ; a) maxa Q^ (st; a)
4) t t 1 Update global lo k
5) + Æt t Add new TD-error to global error
6) Lo al Update(st ; at ) Make Q^ (st ; at ) up-to-date for next step
^ ^
7) Q(st ; at ) Q(st; at ) + k (st; at )Æt
0
8) e0 (st ; at ) e0 (st ; at ) + 1= t De ay Tra e
For state repla ing eligibility tra es [139℄ step 8 should be hanged as follows: 8a : e0(st ; a)
0; e0 (st; at ) 1= t .
Ma hine pre ision problem and solution. Adding Æt t to in line 5 may reate
a problem due to limited ma hine pre ision: for large absolute values of and small t
there may be signi ant rounding errors. More importantly, line 8 will qui kly over ow any
ma hine for < 1. The following addendum to the pro edure Global Update dete ts when
t falls below ma hine pre ision and updates all SAPs whi h have o urred. A list, H ,
m
is used to tra k SAPs that are not up-to-date. If e0 (s; a) < m, the SAP (s; a) is removed
from H . Finally, and t are reset to their initial values.
Global Update : addendum

9) If (visited(st ; at ) = 0):
9a) H H [ (st ; at )
9b) visited(st ; at ) 1
10) If ( t < m):
10a) 8(s; a) 2 H Do
10a-1) Lo al Update(s; a)
10a-2) e0 (s; a) e0 (s; a) t
10a-3) If (e0 (s; a) < m ):
10a-3-1) H H n (s; a)
10a-3-2) visited(s; a) 0
10a-4) Æ(s; a) 0
10b) 0
10 ) t 1:0
Comments. Re all that Lo al Update sets Æ(s; a) , and update steps depend on
Æ(s; a). Thus, after having updated all SAPs in H , we an set 0 and Æ(s; a) 0.
Furthermore, we an simply set e0 (s; a) e0 (s; a) t and t 1:0 without a e ting the
expression e0 (s; a) t used in future updates | this just res ales the variables. Note that if
= 1, then no sweeps through the history list will be ne essary.
Complexity. The algorithm's most expensive part is the set of alls to Lo al Update,
whose total ost is O(jAj). This is not bad: even Q-learning's a tion sele tion pro edure
osts O(jAj) if, say, the Boltzmann rule is used. Con erning the o asional omplete sweep
through SAPs still in the history list H : during ea h sweep the tra es of SAPs in H are
multiplied by t . SAPs are deleted from H on e their tra e falls below m. In the worst ase
one sweep per n time steps updates 2n SAPs and osts O(1) on average. This means that
there is an additional omputational burden at ertain time steps, but sin e this happens
infrequently, the method's average update omplexity stays O(jAj).
The spa e omplexity of the algorithm remains O(jS jjAj). We need to store the following
variables for all SAPs: Q-values, eligibility tra es, Æ values, the \visited" bit, and three
pointers to manage the history list (one from ea h SAP to its pla e in the history list, and
two for the doubly linked list). Finally we need to store the two global variables and .
4.2.2 Revisions to Fast Q()
In this se tion we see how the original version of Fast Q() is likely to be misapplied to
give rise to two subtle errors. This se tion also introdu es: i) what modi ations, if any,
are required of a tion sele tion me hanisms that are intended to employ the up-to-date Q-
fun tion, ii) the state-a tion repla e tra e version of Fast Q(), and, iii) how the algorithm
may be modi ed for o -poli y learning (as Watkins' Q()) [163, 150℄. The new algorithms
are shown in Figure 4.1.
The new work in this se tion an be found in a joint te hni al report o-authored with
Ma ro Wiering [125℄.
Error 1. Step 1 of the original Global Update pro edure performs the updates to the
Q-values at st+1 ne essary to ensure that Q^ (st+1; ) is an up-to-date estimate before steps
2 and 3 where it is used. However, Q^ (st; ) is also used in steps 2 and 3 and may not be
up-to-date. This is easily orre ted by adding:
1b) Lo al Update(st ; a)
We shall see below that this hange is not ne essary if Q^ (st; ) is made up-to-date at the
end of the Global Update pro edure.
Error 2. When state repla ing tra es are employed with the original Fast Q() algorithm,
it is possible that the eligibility of some SAPs are zeroed. In su h a ase, if these SAPs
previously had non zero eligibilities then they will not re eive any update making use of
Æt . An ex eption is Q^ (st ; at ), whi h is made up-to-date in step 6 (and so makes use of Æt ).
However all other SAPs at st with non-zero eligibilities will re eive no adjustment toward
Æt if their eligibilities are zeroed:
From the original version of Global Update:

. ..
3) Æt (rt + maxa Q^ (st+1 ; a) maxa Q^ (st; a))
. ..
Here, ea h a 6= at with non-zero tra es re eive no update using Æt
(Q^ (st ; at ) is already up-to-date before this point)
8) 8a : e0 (st; a) 0; e0 (st ; at ) 1= t .
To avoid this in the revised algorithm, all of the Q-values at st are made up-to-date before
zeroing their eligibility tra es (step 8a in the state-repla e tra e revisions).
A tion Sele tion. Steps 9 and 9a of the Revised Global Update pro edure are a pragmati
hange to ensure that all of the Q-values for st+1 are up-to-date by the end of the pro edure.
If this were not so then any ode needing to make use of the up-to-date Q-fun tion at st+1 ,
su h as those for sele ting the agent's next a tion, would need to be de ned in terms of
the up-to-date, Q-fun tion instead. Q+ is used to denote up-to-date Q-fun tion and an be
found at any time as follows:

Q^ + (s; a) = Q^ (s; a) + k (s; a)( Æ(s; a))e0 (st ; at ) (4.8)
From an implementation standpoint, these hanges are desirable for at least three reasons.
Firstly, the need to use Q^ + for a tion sele tion is easy to overlook when implementing
the original version of Fast Q() as part of a larger learning agent. Se ondly, it redu es
oupling between algorithms; with steps 9 and 9a an algorithm that implements a tion
sele tion based on the up-to-date Q-values of st+1 does not need to use Q^ + or even are
that values at di erent states may be out-of-date. Thirdly, it redu es the dupli ation of
ode; we are likely to already have a tion-sele tion algorithms that use Q^ (st+1 ; ) and so we
don't need to implement others that use Q^ +(st+1 ; ) instead.
The original des ription of Fast Q() assumed that the Lo al Update pro edure was alled
for all a tions in the urrent state immediately after the Global Update pro edure and prior
to sele ting a tions. However, from the original des ription, it was not lear that this still
needs to be done (for the same reason as Error 2, above) even if the Q-values at the urrent
state are not used by the a tion sele tion method (for example, if the a tions are sele ted
randomly or provided by a trainer). If this is done, then the new and revised algorithms
are essentially identi al.
The following two se tions introdu e new features to the algorithm and are not revisions.
State-A tion Repla ing Tra es. From Se tion 3.4.7 note that the state-a tion repla e
tra e method sets e(s; a) to 1 instead of adding 1, as in the a umulate tra e method.
For Fast Q(), an e e t equivalent to setting an eligibility to 1 is a hieved by performing
e0t+1 (s; a) 1= t .
Watkins' Q(). Watkins' Q() requires that the eligibility tra e be zeroed after tak-
ing non-greedy a tions. The new Fast Q() version works in the same way (by applying
e0 (s; a) 0 for all SAPs), ex ept that here we must ensure that all non-up-to-date SAPs
are updated before zeroing their tra es (see the Flush Updates pro edure).
For a umulating tra es:
Revised Global Update(st ; at ; rt ; st+1 ) :
1)8a 2 A Do
1a) Lo al Update(st+1 ; a)
2) Æt0 (rt + maxa Q^ (st+1 ; a) Q^ (st; at )) NB. st was made up-to-date in step 9
3) Æt (rt + maxa Q^ (st+1 ; a) maxa Q^ (st; a))
4) t t 1
5) + Æt t
6) Lo al Update(st ; at )
7) Q^ (st ; at ) Q^ (st; at ) + k (st; at )Æt0
8) e0 (st ; at ) e0 (st ; at ) + 1= t In rement eligibility
9) 8a 2 A Do
9a) Lo al Update(st+1 ; a) Make Q^ (st+1 ; ) up-to-date before a tion sele tion
For state-a tion repla ing tra es repla e step 8 with:
8) e0 (st ; at ) 1= t Set eligibility to 1
For state repla ing tra es, repla e steps 8 - 9a with:
8) 8a 2 A Do
8a) Lo al Update(st ; a) Make Q^ (st ; ) up-to-date before zeroing eligibility
8b) e (st ; a) 0
0 Zero eligibility
8 ) Lo al Update(st+1 ; a) Make Q^ (st+1 ; ) up-to-date before a tion-sele tion
9) e (st ; at ) 1=
0 t Set eligibility to 1
For Watkins Q() prepend the following to the Revised Global Update pro edures.
0) if o -poli y(st; at ) Test whether a non-greedy a tion was taken
0a) Flush Updates()
Flush Updates()
1) 8(s; a) 2 H Do
2) Q^ (s; a) Q^ (s; a) + (
k st ; a t )( Æ(s; a))e0 (s; a)
3) Æ(s; a) 0
4) e0 (s; a) 0
5) H fg
6) 0
7) t 1
Figure 4.1: The revised Fast Q() algorithm for a umulating, state repla ing and state-
a tion repla ing tra es and for Watkins' Q(). The ma hine pre ision addendum should be
appended to ea h algorithm. The Flush Updates pro edure an also be alled upon entering
a terminal state to make the entire Q-fun tion up-to-date and also reinitialise the eligibility
and error values of ea h SAP ready for learning in the next episode.
4.2.3 Validation
In this se tion we empiri ally test how losely the orre t and erroneous implementations
of Fast Q() approximate the original versions of Q(). Fast Q()+ is used to denote the
orre t implementation suggested here and Fast Q() to denote the method that does not
apply a Lo al Update for all a tions in the new state between alls to the Global Update
pro edure. Note that if these updates are performed, Fast Q()+ and Fast Q() are
identi al methods.1
The algorithms were tested using the maze task shown in Figure 4.4. This task was hosen
as redit for a tions leading to the goal an be signi antly delayed (and so eligibility tra es
are expe ted to help) and also be ause state revisits an frequently o ur, ausing the
di erent eligibility tra e methods to behave di erently.
A tions taken by the agent at ea h step were sele ted using -greedy [150℄. This sele ts a
greedy a tion, arg maxa Q^ (st; a), with probability , and a random a tion with 1 . Fast
Q() was given the bene t of using the true up-to-date Q-fun tion, (i.e. arg maxa Q^ + (st ; a)
was used to hose its greedy a tion).
Figure 4.2 ompares the results for the PW Q() variants. The graphs measure the total
reward olle ted by ea h algorithm and the mean squared error (MSE) in the up-to-date
Q-fun tion learned by ea h algorithm over the ourse of 200000 time steps. The squared
error was measured as,
2
Q^ (s; a)

SE (s) = V (s) max
a
; (4.9)
for regular Q() and as,

2
Q^ +(s; a)

SE (s) = V (s) max
a
; (4.10)
for both versions of Fast Q(). An a urate V was found by dynami programming meth-
ods. All of the results in the graphs are the average of 100 runs.
Fast PW Q()+ provided equal or better performan e than Fast PW Q() in most in-
stan es, and its results also provided an extremely good t against the original version of
PW Q() in all ases (see Figures 4.2 and 4.3). Similar results were found when omparing
Watkins' Q() and its Fast variants (see Figures 4.5 and 4.6).
Fast Q() worked espe ially worse in terms of error than Fast Q()+ for PW with a u-
mulating or state-a tion repla ing tra es. However, in one instan e (with a state repla ing
tra e) the error performan e of the revised algorithm was a tually worse than the original
(see Figure 4.3).This anomaly was not seen for Watkins' Q() (see Figure 4.6).
The experiments in Wiering's original des ription of Fast Q() did perform these lo al updates and so
1
we do not repeat the experiments in the original paper [168, 169, 167℄.
400000 1000
350000 PW, acc PW, acc
Fast PW (+), acc Fast PW (+), acc
300000 Fast PW (-), acc 800
Mean Squared Error

Fast PW (-), acc
Cumulative Reward
250000
600
200000
150000
400
100000
50000 200
0
-50000 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
400000 1000
350000 PW, srepl PW, srepl
Fast PW (+), srepl Fast PW (+), srepl
300000 Fast PW (-), srepl 800
Mean Squared Error

Fast PW (-), srepl
Cumulative Reward
250000
600
200000
150000
400
100000
50000 200
0
-50000 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
400000 1000
350000 PW, sarepl PW, sarepl
Fast PW (+), sarepl Fast PW (+), sarepl
300000 Fast PW (-), sarepl 800
Mean Squared Error

Fast PW (-), sarepl
Cumulative Reward
250000
600
200000
150000
400
100000
50000 200
0
-50000 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
Figure 4.2: Comparison of PW Q(), Fast PW Q()+ and Fast PW Q() performan e
pro les in the sto hasti maze task. Results are the average of 20 runs. The parameters
were Q^ 0 = 100, = 0:3, = 0:1 (low exploration rate), = 0:9 and m = 1 10 3 for
regular Q() and m = 10 10 for the Fast versions. (left olumn) Total reward olle ted.
(right olumn) Mean squared error in the value fun tion. (top row) With a umulating
tra es. (middle row) With state repla ing tra es. (bottom row) With state-a tion repla ing
tra es.
The e e t of exploratory a tions on PW Q() are also evident in these results. The PW Q()
methods olle ted less reward and found a hugely less a urate Q-fun tion in the ase of a
high exploration rate than Watkins' methods ( ompare Figures 4.3 and 4.6). In ontrast,
Watkins' variants olle ted similar or better amounts of reward but found far more a urate
Q-fun tions than Peng and Williams' methods in both the high and low exploration rate
ases. Similar results on erning the error were reported by Wyatt in [176℄. However, this
example learly demonstrates the bene t of o -poli y learning under exploration in terms
of olle ted return.
200000 2000
PW, acc PW, acc
Fast PW (+), acc Fast PW (+), acc
150000 Fast PW (-), acc
Mean Squared Error

Fast PW (-), acc
Cumulative Reward
1500
100000
1000
50000
500
0
-50000 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
200000 2000
PW, srepl PW, srepl
Fast PW (+), srepl Fast PW (+), srepl
150000 Fast PW (-), srepl
Mean Squared Error

Fast PW (-), srepl
Cumulative Reward
1500
100000
1000
50000
500
0
-50000 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
200000 2000
PW, sarepl PW, sarepl
Fast PW (+), sarepl Fast PW (+), sarepl
150000 Fast PW (-), sarepl
Mean Squared Error
Fast PW (-), sarepl

Cumulative Reward
1500
100000
1000
50000
500
0
-50000 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
Figure 4.3: Comparison of Peng and Williams' Q() methods with a high exploration rate
( = 0:5). All other parameters are as in Figure 4.2. Note that the s ale of the verti al axes
di ers between experiment sets.
20
18
16
14
12
10
0
0 2 4 6 8 10 12 14 16 18 20
Figure 4.4: The large sto hasti maze task. At ea h step the agent may hoose one of four
a tions (N,S,E,W). Transitions have probabilities of 0:8 of su eeding, 0:08 of moving the
agent laterally and 0:04 of moving in the opposite to intended dire tion. Impassable walls
are marked in bla k and penalty elds of 4 and 1 are marked in dark and light grey
respe tively. A reward of 100 is given for entering the top-right orner and 10 for the others.
Episodes start in random states and ontinue until one of the four terminal orner states is
entered.
400000 1000
350000 WAT, acc WAT, acc
Fast WAT (+), acc Fast WAT (+), acc
300000 Fast WAT (-), acc 800
Mean Squared Error

Fast WAT (-), acc
Cumulative Reward
250000
600
200000
150000
400
100000
50000 200
0
-50000 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
400000 1000
350000 WAT, srepl WAT, srepl
Fast WAT (+), srepl Fast WAT (+), srepl
300000 Fast WAT (-), srepl 800
Mean Squared Error

Fast WAT (-), srepl
Cumulative Reward
250000
600
200000
150000
400
100000
50000 200
0
-50000 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
400000 1000
350000 WAT, sarepl WAT, sarepl
Fast WAT (+), sarepl Fast WAT (+), sarepl
300000 Fast WAT (-), sarepl 800
Mean Squared Error

Fast WAT (-), sarepl
Cumulative Reward
250000
600
200000
150000
400
100000
50000 200
0
-50000 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
Figure 4.5: Comparison of Watkins' Q(), Fast Watkins' Q() and Revised Fast Watkins'
Q()+ in the sto hasti maze task. All parameters are as in Figure 4.2 (i.e. a low exploration
rate with = 0:1).
In addition to showing that the performan e of Fast Q()+ is similar to Q() in the mean,
we performed a more detailed test. The agents were made to learn from identi al experien e
gathered over 2000 simulation steps in the small sto hasti maze shown in Figure 4.7. At
ea h time step, the di eren e between the Q-fun tions of Q() and the up-to-date Q-
fun tions of Fast Q()+ and Fast Q() was measured. The largest di eren es at any time
during the ourse of learning are shown in Table 4.1. The di eren es for Fast Q()+ are all
in the order of m or better. The di eren es for Fast Q() are many orders of magnitude
greater.
200000 400
WAT, acc WAT, acc
Fast WAT (+), acc 350
Fast WAT (+), acc
150000 Fast WAT (-), acc
Mean Squared Error

Fast WAT (-), acc
Cumulative Reward
300
100000 250
200
50000 150
100
0
50
-50000 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
200000 400
WAT, srepl WAT, srepl
Fast WAT (+), srepl 350
Fast WAT (+), srepl
150000 Fast WAT (-), srepl
Mean Squared Error

Fast WAT (-), srepl
Cumulative Reward
300
100000 250
200
50000 150
100
0
50
-50000 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
200000 400
WAT, sarepl WAT, sarepl
Fast WAT (+), sarepl 350
Fast WAT (+), sarepl
150000 Fast WAT (-), sarepl
Mean Squared Error
Fast WAT (-), sarepl

Cumulative Reward
300
100000 250
200
50000 150
100
0
50
-50000 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
Figure 4.6: Comparison of Watkins' Q() methods with a high exploration rate ( = 0:5).
All other parameters are as in Figure 4.2.
3 +1
2 -1
1 2 3 4
Figure 4.7: A small sto hasti maze task (from [130℄). Rewards of 1 and +1 are given for
entering (4; 2) and (4; 3), respe tively. On non-terminal transitions, rt = 251 .
4.3. BACKWARDS REPLAY 61
Fast Q() Fast Q()+
PW-a 0.7 1:7 10 15
PW-srepl 1.3 8:8 10 16
PW-sarepl 0.3 1:7 10 15
WAT-a 1.3 7:6 10 13
WAT-srepl 2.5 4:2 10 10
WAT-sarepl 0.6 2:9 10 11
Table 4.1: The largest di eren es from Q-fun tion learned by original Q() during the
ourse of 2000 time steps of experien e within the small maze task in Figure 4.7. The
experiment parameters were m = 10 9 , = 0:2, = 0:95 and = 1:0. The experien e
was generated by randomly sele ting a tions.
4.2.4 Dis ussion

Fast Q() provides the means to implement Q() at a greatly redu ed omputational ost
that is independent of the size of the state spa e. As su h, it makes it feasible for RL to
ta kle problems of greater s ale. Independently developed, Pendrith and Ryan's P-Tra e
and C-Tra e algorithms work in a similar way to Fast Q() but are limited to the ase
where = 1 [104, 103℄.
Although the underlying derivation of Fast Q() is orre t, we have seen here that the
original algorithmi des ription is likely to be misinterpreted and in orre tly implemented.
Simpli ations and lari ations were made, maintaining the algorithm's mean time om-
plexity of O(jAj) per step. Naive implementations of Q() are O(jS j jAj) per step.
We have also seen how Fast Q() an be modi ed to use state-a tion repla ing tra es or
to be used as an exploration insensitive learning method and reported upon the merits
of these modi ations. In parti ular, in the experiments ondu ted here, the exploration
insensitive versions provided similar or better performan e in terms of the olle ted reward,
but a hieved uniformly better performan e in terms of Q-fun tion error. This was found
both with high or low amounts of exploration.
4.3 Ba kwards Replay

In [72℄ Lin introdu ed experien e replay whi h, like eligibility tra e and (forward-view)
-return methods, allow a single experien e to be used to adjust to the values of many
prede essor states.
In his experiments, a human ontroller provides a training poli y for a robot to redu e
the ost of exploring the environment. This experien e is re orded and then repeatedly
replayed oine in order to learn a Q-fun tion. The Q-fun tion was represented by a multi-
layer neural network a single-step Q-learning-like update rule was used to make updates.
In this way, better use of a small amount of expensive real experien e an be made when
training the RL agent.
Ba kwards-Replay-Watkins-Q()-update
1) z 0 Initialise return to value of terminal state
2) for ea h i in tT 1; tT 2; : : : t0 do:
3) z (ri+1 + z) + (1 ) ri+1 + max a Q^ (si+1; a)
4) Q^ (si; ai ) Q^ (si; ai ) + k z Q^ (si; ai )

5) if o -poli y(si; ai ): Test for non-greedy a tion.

6) z maxa Q^ (si ; a) Trun ate return estimate.
Figure 4.8: Lin's ba kwards replay algorithm modi ed for evaluating the greedy poli y
(as Watkins' Q()). The algorithm is applied upon entering a terminal state and may be
exe uted several times. Terminal states are assumed to have zero value (rewards for entering
a terminal state may be non-zero).
The training experien e has the advantage of providing the agent with a relatively good
behaviour from whi h it may bootstrap its own poli y and also greatly redu es the ost
of exploring the state spa e. Note that a key di eren e between this and the training
methods used by supervised learning is that the RL agent aims to a tually improve upon the
training behaviour and not simply reprodu e it. Experien e replay has also been su essfully
applied by Zhang and Dietteri h for Job Shop s heduling system [177℄, and for mobile robot
navigation [140℄.
When replaying the re orded experien e a great learning eÆ ien y boost an be gained by
replaying the experien e in the reverse order to whi h it was observed. For example, if the
agent observed the experien e tuples (st; at ; rt+1 ); (st+1 ; at+1 ; rt+2 ); : : :, then a Q-learning
update is made to Q^ (st+1 ; at+1 ) before Q^ (st ; at ). In this way, the return estimate used to
update Q^ (st; at ) may use a just-updated value of maxa Q^ (st+1 ; a), whi h itself may have
just hanged to in lude the just-updated value of maxa Q^ (st+2 ; a), and so on. Even if 1-step
return estimates are employed in the ba kups, and experien e is only replayed on e, then in-
formation about a new reward an still be propagated to many prior SAPs. Furthermore, if
the -return estimates are employed then omputational eÆ ien y gains an also be found
by working ba kwards and employing the re ursive form of the -return estimate (as in
Equation (3.24) or (3.37)). This is illustrated in a new version of the ba kwards replay
algorithm modi ed to use the same return estimate as Watkins' Q() (see Figure 4.8). The
algorithm is extremely simple, an provide learning speedups and also has a natural ompu-
tationally eÆ ient implementation; it is just O(jAj) per step. It a hieves its omputational
eÆ ien y far more elegantly than Fast Q() by dire tly implementing the forwards view
of -return updates. By ontrast Fast Q() performs two omplex transformations on the
return estimate.
Figure 4.9 illustrates the advantage of using ba kwards replay over Q() in the orridor
task shown in Figure 3.5. Note here that ba kwards replay with = 0 an be as good or
better than Q() (for any ) where the learning rate is de lined with 1=k (k(s; a) = kth
ba kup of Q^ (s; a)). Similar results are noted by Sutton and Singh [151℄. As in this example,
they note that ba kwards replay redu es bias due to the initial value estimates in a y li
environments, eliminating it totally in ases where = 1 at the rst value updates.
4.3. BACKWARDS REPLAY 63
1
1
0.8 0.8
V*
0.6 0.6
BR(0)
Value
Value
V*
BR(0.9)
0.4 0.4
BR(0.9)
0.2 Q(0.9) 0.2 Q(0.9)
BR(0)
Q(0) Q(0)
0 0
0 10 20 30 40 50 60 0 10 20 30 40 50 60
State, s State, s
Figure 4.9: The Q-fun tions learned by ba kwards replay and by Q() after 1 episode in
the orridor task shown in Figure 3.5. Values of = 0, = 0:9 and Q^ 0 = 0 are tested.
(left) Learning with a onstant = 0:8. Ba kwards replay improves upon the eligibility
tra e ounterparts in both ases. This learning speed-up for ba kwards replay is derived
solely from employing more up-to-date information. (right) Learning with = 1=k. With
any value of ba kwards replay nds the a tual return estimate, while Q() nds it only
if = 1.
However, be ause of its dependen e on future information, its not lear how ba kwards
replay extends to the ase of online learning in y li environments.
Trun ated TD()
In [30℄ Ci hosz introdu ed the Trun ated TD() (TTD) algorithm to apply ba kwards
replay online. Figure 4.10 shows how TTD an be modi ed to be a greedy-poli y evaluating
exploration insensitive method. TTD also dire tly employs the -return due to a state or
SAP by maintaining an experien e bu er from whi h its return is omputed. To keep the
bu er to a reasonable length but still allow for online learning, only the last n experien es
are maintained. Updates are delayed { state st n is updated at time t when there is enough
experien e to make an n-step trun ated -return estimate (as introdu ed in Equation 3.37).
This delay in making ba kups an lead to the same ineÆ ien ies in the exploration strategy
su ered by purely oine learning methods. As su h, TTD is sometimes referred to as
semi-oine as it still allows for non-episodi learning and exploration [168℄. Also, the
method makes updates at a ost of O(n jAj) per step and so it would seem there is no
omputational advantage to learning in this way ompared to the approximate method
des ribed in Se tion 4.2. Thus, the primary bene t of this approa h is that it dire tly
employs the -return estimate in updates and is simpler than an eligibility tra e method
as a result. Ci hosz also argues that sin e a tual -return estimates are used, the method
an be applied more easily to a wider range of fun tion approximators than is possible for
eligibility tra e methods [31℄.
Replayed TD()
Replayed TD() is an adaptation of TTD that updates the most re ent n states at ea h
time-step using the most re ent n experien es [32℄ (see Figure 4.11).
Trun ated-Watkins-Q()-update(st+1)
1) z maxa Q^ (st+1 ; a)
2) was-o -poli y false
3) for ea h i in t + 1; : : : ; t + 2 n do:
4) if was-o -poli y: True when ai+1 was non-greedy.
5) z ri + maxa Q(si ; a) ^
6) else:
z (ri + z ) + (1 ) ri + maxa Q^ (si ; a)

7)
8) was-o -poli y o -poli y(si; ai )
9) Q^ (st n ; at n) Q^ (st n; at n ) + k z Q^ (st n ; at n )

Figure 4.10: Ci hosz' Trun ated TD() algorithm modi ed for evaluating the greedy poli y.
The above update is applied after every step. An experien e bu er of the last n experien es
needs to be maintained and the rst and last n updates of an episode need spe ial handling.
These extra details are omitted from the above algorithm (see [31℄ for full details).
Replayed-Watkins-Q()-update(st)
1) z 0 Initialise return to value of terminal state
2) for ea h i in t; : : : t n do:
3) z (ri+1 + z) + (1 ) ri+1 + maxa Q^(si+1; a)

4) Q^ (si; ai ) Q^ (si; ai ) + k z Q^ (si 1; ai 1 )

5) if o -poli y(si; ai ): Test for non-greedy a tion.
6) z maxa Q(si ; a)^ Trun ate return estimate.
Figure 4.11: Ci hosz' Replayed TD() modi ed for evaluating the greedy poli y. The above
update is applied after every step.
Note that, for a SAP visited at time t, Q^ (st; at ) will re eive updates toward all of the follow-
ing n trun ated -return estimates: zt(+1;1) (;2)
; zt+2 ; : : : ; zt(+;nn ) . Clearly these return estimates
are not independent: all n returns in lude rt+1 , n 1 in lude rt+2 and so on. As a result
of updating a Q-value several times towards these similar returns the algorithm will learn
Q-values that are mu h more strongly biased towards the most re ent experien es than
other methods. In turn this ould ause learning problems in highly sto hasti environ-
ments (or more generally where the return estimate has high varian e). There may exist
ways to ountera t this (for example, by redu ing the learning rate). Even so, it is likely
that the algorithm's aggressive use of experien e outweighs these high varian e problems
and Ci hosz reports some promising results. However, the algorithm also remains O(n jAj)
per step (as TTD()), and although it doesn't su er the same delay in performing updates
that ould be detrimental to exploration, immediate redit for a tions is propagated to no
more than the last n states.
4.4. EXPERIENCE STACK REINFORCEMENT LEARNING 65
4.4 Experien e Sta k Reinfor ement Learning
This se tion introdu es the Experien e Sta k Algorithm. This new method an be seen as
a generalisation of Lin's oine ba kwards replay and also dire tly learns from the -return
estimate.
To allow the algorithm to work online, ba kups are made in a lazy fashion; states are
ba ked-up only when new estimates of Q-values are required (for the purposes of aiding
exploration) and available given the prior experien e. Spe i ally, this o urs when the
learner nds itself in a state it has previously visited and not ba ked-up.
The details of the algorithm are best explained through a worked example. Consider the
experien e in Figure 4.12. A learning episode starts in st1 and the algorithm pro eeds
re ording all experien es until st3 is entered (previously visited at t2). If we ontinue
exploring without making a ba kup to st2, we do so uninformed of the reward re eived
between t2 + 1 and t3, perhaps to re olle t some negative reward in sequen e X . This is
the important disadvantage of an oine algorithm that we wish to avoid. To prevent this,
the algorithm immediately replays (ba kwards) experien e to update the states from st3 1
to st2 using the -return trun ated at st3 . This obtains a new Q-value at st3 that an be
used to aid exploration. Ea h replayed experien e is dis arded from memory. States visited
prior to st2 (sequen e W ) are not immediately updated. Putting exploration issues aside,
it is often preferable to delay ba kups for as long as possible with the expe tation that the
experien e yet to ome will provide better Q-values to use in updates.
At a later point (t5) the agent takes an o -poli y a tion. When sequen e Y is eventually
updated, it will use a return estimate trun ated at st5, the value of whi h will be re ently
updated following the experien e in sequen e Z and beyond. This is a signi ant improve-
ment over Watkins' Q() whi h will make no immediate use of the experien e olle ted in
sequen e Z in updates to Y .
st1 st5
st2= st3 st4
W Y Z
Figure 4.12: A sequen e of experien es. st2 is revisited at t3 and an o -poli y a tion taken
at st5. States in sequen e X (in luding st3) will be updated before those in sequen e W, Y
or Z.
4.4.1 The Experien e Sta k

The algorithm maintains a sta k of unreplayed experien e sequen es, es = h 1 ; 2 ; : : : ; i i,
ordered from the earliest sequen e, 1 , to the most re ent, i , from the bottom to the top of
the sta k (see Figure 4.13). Ea h experien e sequen e onsists of another sta k of temporally
su essive state-a tion-reward triples,
j = h(st ; at ; rt+1 ); (st+1 ; at+1 ; rt+2 ); : : : ; (st+k ; at+k ; rt+k+1 )i:
It is always the ase that the earliest state in j was observed as a su essor to the most
re ent SAP in j 1. Performing a push operation on an experien e sequen e re ords an
experien e and pop operations are used when replaying experien e.
The ES-Watkins-replay pro edure, shown in Figure 4.14, is used to replay experien e su h
that a new Q-value estimate at sstop is obtained. The value of s0 provides the return
orre tion for the most re ent SAP in the sta k. s0 must be the su essor of the SAP found
at top(top(es)) (i.e. the most re ent SAP in the sta k).
A ounter, B (s), re ords the number of times s appears in the experien e sta k in order to
determine how many ba kups to sstop that experien e replay an provide without having to
sear h through the re orded experien e.
How experien e is re orded and replayed is determined by the ES-Watkins-update pro e-
dure. Like Watkins-Q(), it ensures that ES-Watkins-replay uses -return estimates that
are trun ated at the point where an o -poli y a tion is taken. Figure 4.13 shows the state of
the sta k after the experien e des ribed in Figure 4.12. It ontains the experien e sequen es
W , Y and Z from bottom to top (X has already been updated and removed). The ends
of ea h experien e sequen e de ne when return trun ations o ur. For example, due to the
exploratory a tion at t5, st5 starts a new experien e sequen e. Thus, the ba kup to st4 will
use only rt5 + maxa Q^ (st5 ; a), but Q^ (st5 ; a) will be up-to-date.
Bias Prevention Why doesn't sequen e Y simply extend sequen e W in Figure 4.13?
(That is, why is the return trun ated at end of sequen e W ?) There is no requirement that
the return estimate used to ba kup st2 1 involve the a tual observed return immediately
time
Z= 3
top
... ...
(st5 ,a t5 ,rt5+1 ), ... ,(st6−1 ,at6−1 ,rt6 )
PSfrag repla ements Y = 2

time
... ...
(st3,a t3 ,rt3+1 ), ... ,(st4 ,at4 ,rt5 )
...
W = 1
bottom (st1,a t1 ,rt1+1), (st1+1,at1+1 ,rt1+2), ... ,(st2−1,at2−1,rt2 )
Figure 4.13: The state of the experien e sta k after the experien e in Figure 4.12. The
end of ea h row (or experien e sequen e) determines where return trun ations o ur. The
rightmost states re eive 1-step Q-learning ba kups.
following t2 1. Generally, if st+k = st, then the return in luding and following rt+k is just
as suitable. That is, if,
" # " #
1
X 1
X
E rt + ir
t+i = E rt + ir
t+i+k (4.11)
i=1 i=1
holds where st = st+k , then,

1
X
rt + ir
t+i+k (4.12)
i=1
is learly a suitable estimate of return following st 1 ; at 1 . Similar arguments apply for

trun ated n-step and -returns.
ES-Watkins-replay( sstop ; s0 )
1) while not empty(es):
2) z maxa Q^ (s0 ; a) Find initial return orre tion
3) pop(es) Get most re ent experien e sequen e
4) while not empty( ):
5) hs; a; ri pop( ) Get most re ent unreplayed experien e
6) z (r + z ) + (1 ) r + maxa Q^ (s0 ; a)

7) Q^ (s; a) Q^ (s; a) + k z Q^ (s; a)
8) B (s) B (s) 1 De rement pending ba kups ounter for s
9) if s = sstop and B (s) = 0: Have performed required ba kup?
10) if not empty( ):
11) push(es, ) Return unreplayed experien es to sta k
12) return
13) s0 s New Q^ (s; a) is now used in next ba kup
ES-Watkins-update( st ; at ; rt+1 ; st+1 )
1) if o -poli y(st ; at ): Was last a tion non-greedy?
2) add-as- rst = true Trun ates return on o -poli y a tions
3) if empty(es) or add-as- rst: Re ord new experien e . . .
4) = reate-sta k() . . . in new sequen e
5) else
6) = pop(es) . . . at end of most re ent sequen e
7) push( , hst ; at ; rt+1 i )
8) push(es, )
9) add-as- rst = false
10) B (st ) = B (st ) + 1 In rement pending ba kups ounter for st
11) if B (st+1 ) Bmax or terminal(st+1 ):
12) ES-Watkins-replay(st+1, st+1 ) Replay experien e to obtain a new Q-value at st
13) add-as- rst = true Trun ates return to prevent biasing
Figure 4.14: The Experien e Sta k algorithm for o -poli y evaluation of the greedy poli y.
A version that doesn't trun ate return after o -poli y a tions an be obtained by omitting
lines 1 and 2 in ES-Watkins-update. This is later referred to as ES-PW after Peng and
Williams' Q(). The name add-as- rst is for a global variable. It should be set to false at
the start of ea h episode.
However Condition 4.11 will usually not hold when applying the experien e sta k algorithm.
For example, suppose that sequen e X in ludes some unusually negative rewards. If the
ba kups to the states in W were made using a return ex luding the rewards in sequen e X
then the Q-values in sequen e W would be ome biased (by being over-optimisti ). In order
to prevent this biasing, the value of the state at whi h an experien e replay ends is used to
provide an estimate of the future return to all prior states in the sta k. In the example, st2
must be updated to in lude the return in sequen es Y and X . The ba kups to states prior
to st2 should use a return trun ated at st2. The algorithm a hieves this simply by starting
a new experien e sequen e at the top of the sta k to indi ate that a return trun ation is
required (step 13 of ES-Watkins-update).
Choi e of Bmax The parameter Bmax varies how many times a state may be revisited
before a ba kup is made. Its hoi e is problem dependent. With Bmax = 1 ba kups are
made on every revisit. If revisits o ur often and at short intervals, then experien e will
be frequently replayed whi h also auses the return estimate to be frequently trun ated; an
e e t whi h is similar to lowering toward 0. This is in addition to the e e t of trun ations
that o ur after taking o -poli y a tions.
However, with higher values of Bmax , the algorithm behaves more like an oine learning
method and exploration an bene t less frequently from up-to-date Q-values.
Flushing the Sta k Entering a terminal state, sterm, automati ally auses the entire
remaining ontents of the experien e sta k to be replayed sin e sstop = sterm and sterm
annot o ur in the experien e sta k (N.B. B (sterm) = 0 at all times). Otherwise, the sta k
an be ushed at any time by alling ES-Watkins-replay(snow ; sterm).
Computational Costs Sin e ea h state may appear in the experien e sta k no more
than Bmax times, the worst- ase spa e- omplexity of maintaining the experien e sta k is
O(jS j Bmax ). The total time- omplexity is O(jAj) per experien e when averaged over the
entire lifetime of the agent (as Fast Q()). The a tual time- ost per timestep may vary
greatly between steps.
S ope This new te hnique an easily be adapted to use the return estimates employed
by many other methods. For example, an analogue of Naive Watkins Q() an be made by
omitting lines 1) and 2) from ES-Watkins-Update. An analogue of TD() an be made by
repla ing all o urren es of Q^ (x; y) with V^ (x), repla ing step 6) with,
6) z (r + z) + (1 ) r + V^ (s0)

in ES-Watkins-Replay, and omitting lines 1) and 2) from ES-Watkins-Update. Analogues

of SARSA() and the importan e sampling methods in [111℄ are equally easy to derive.
Spe ial Cases With Bmax = 1 the algorithm is fully oine, identi al to Lin's ba k-
wards replay and also only suitable for use in episodi tasks. In this ase, if = 0 it exa tly
implements 1-step Q-learning with ba kwards replay as used in [72, 76℄. As noted in Se -
tion 3.4.8, a y li tasks are spe ial in that (non-ba kward-replaying) -return estimating
methods and bat h eligibility tra es methods are equivalent. With = 1 the experien e
sta k method is also a member of this equivalen e lass.
However, in (terminating) y li tasks with = 1, with suÆ iently high Bmax to lead to
purely oine learning, and where the learning rate is de lined with 1=k(s; a), the method
implements an every-visit Monte-Carlo algorithm. A rst-visit method ould be derived by
skipping over ba kups to Q(s; a) where B (s; a) 6= 1.2
Frequent Revisits In some tasks, su h as problems with state aliasing, a single state
may be revisited for several onse utive steps. To prevent the method from using mainly
1-step returns, B (s; a) ould be in remented only upon leaving a state. This would require
that the same a tion be taken until the state is left, although this is often a bene t while
learning with state-aliasing (as we will see in Chapter 7).
In general, there may be better ways to a e t when experien e is replayed than with the
Bmax parameter. If the purpose of making ba kups online is to aid exploration, then a better
method might be to try to estimate the bene t of replaying experien e to exploration when
de iding whether to update a state.
A Note About Convergen e The open question remains about whether this algorithm
is guaranteed to onverge upon the optimal Q-fun tion. Intuitively, it should, and under
the same onditions as 1-step Q-learning sin e in a sense, the algorithms di er only slightly.
Both methods approa h Q by estimating the expe ted return available under the greedy
poli y. For general MDPs, the expe ted update made by both methods appears to be a
xed point in Q^ only where Q^ = Q.
However, the onvergen e proof of 1-step Q-learning follows from establishing a form of
equivalen e to 1-step value iteration [59℄. This relationship does not appear to dire tly
follow for multi-step return estimates. Moreover, no onvergen e proof has been established
for any ontrol method with > 0 [145, 137℄.
Use with Fun tion Approximators For an RL algorithm to be of widespread pra ti al

use it must employ some form of generalisation in ases where the state-a tion spa e is large
or non-dis rete (e.g. ontinuous). Typi ally, this is a hieved using a fun tion approximator
to store the Q-fun tion. Although it has not been tested, it is lear that the experien e
sta k method an be made to work with fun tion approximators as other forward-view
implementations already exist [31℄. A problem that might be en ountered in an implemen-
tation is de iding when to replay experien e sin e, unlike table-lookup or state-aggregation
ases, revisits to pre isely the same state rarely o ur. Several potential solutions to this
exist and it remains the subje t of future resear h.
2
Similar forward-view analogues of repla e tra e methods for Q-fun tions are also dis ussed by Ci hosz
[33℄.
4.5 Experimental Results

In this se tion versions of the experien e sta k algorithm are ompared against their Fast-
Q() ounterparts. Fast-Q() was hosen as it is in the same omputational lass as the
experien e sta k algorithm and so allows a thorough omparison with the various well-
studied eligibility tra e methods. Expli it omparisons with standard ba kwards replay are
not made, but high values of Bmax provide an approximate omparison.
A omparison with Replayed TTD() was not performed. This algorithm is omputationally
more expensive.3 Also, a fair omparison in this ase would allow the experien e sta k
method to also replay the same experien es several times before removing them from the
experien e sta k.
The algorithms were tested using the large maze shown in Figure 4.4 (p. 58). This task
was hosen as it requires online learning to a hieve good performan e. Oine algorithms
that annot improve their exploration strategies online are expe ted to nd the goal rarely
in the early stages of learning.
For Watkins' Q() and PW-Q() three di erent eligibility tra e styles are examined and
ompared against their on-poli y or o -poli y experien e sta k ounterparts. ES-NWAT
was used for omparison against PW-Q() sin e PW-Q() has no obvious forward-view
analogue. Some omparisons were made against Naive Watkins' Q(), but this performed
worse than PW-Q() in all ases. These results are omitted.
The learning rate is de ned as k (s; a) = 1=k(s; a) throughout, with = 0:5 in all ases
ex ept in Figure 4.23, whi h ompares di erent values of = 0:5. For the eligibilty tra e
methods, k(s; a) re ords the number of times s has been taken in a. For the experien e
sta k method k(s; a) re ords the number of updates to Q^ (s; a). These di erent s hemes are
needed to provided a fair omparison and simply re e t the di erent times at whi h the
algorithms apply return estimates in updates. In both ases, k(s; a) = 1 at the rst update,
and k de lines on average at the same rate for ea h method.
Figures 4.15 to 4.22 below measure the performan e of the algorithms along four varying
parameter settings: exploration rate (), , Bmax and the initial Q-fun tion. The perfor-
man e measures are the total umulative reward olle ted by the agent after the 200000 time
steps and the nal average mean squared error in its learned value fun tion. Throughout
learning the -greedy exploration strategy was employed and the results are broadly divided
into two se tions; high exploration levels ( = 0:5) and high exploitation levels ( = 0:1).
The di eren e between Watkins' Q(), Naive Watkins' Q() and PW-Q() is expe ted to
be small for nearly greedy poli ies (where is low).
Table 4.3 lists the abbreviations used throughout. Tables 4.4 and 4.6 provide an index to
the experimental results in this se tion.
Computational ost was a big issue when running these experiments. Ea h 200000 step trial took
3
approximately 10 minutes to omplete on a Sun Ultra 5 and ea h graph point is the average of 15 trials. A
onservative estimate of the total exe ution time onsumed to produ e Figures 4.15 to 4.22 is 2050 ma hine
hours, or 12 ma hine weeks. In pra ti e the experiment was made feasible by distributing the load over a
luster of 60 workstations, redu ing the real-time ost to approximately 34 hours.
4.5. EXPERIMENTAL RESULTS 71
The Fast Q() ma hine pre ision parameter was m = 10 7 in all ases. o pol = 10 4
throughout.
Attention is drawn to the ways in whi h the algorithms are a e ted by di erent parameters
in the following se tions.
The E e ts of Q0 . The most surprising result is that the initial Q-fun tion, Q0, has su h
a ounter-intuitive e e t on performan e. The maze task has an optimal value-fun tion, V ,
whose mean is approximately 68 and has maximum and minimum values of 99.5 and 45.6
respe tively. The standard rule of thumb when using -greedy (and many other exploration
strategies) is to initialise the Q-fun tion optimisti ally to en ourage the agent to take untried
a tions, or a tions that lead to untried a tions [150℄. Yet overall, the performan e was
generally worse with Q0 = 100 than when starting with a Q-fun tion that has a higher
initial error given by a pessimisti bias (Figures 4.15 and 4.16 show Q0 varying over a
larger range than the other graphs). Subje tively, the best all round performan e in nal
umulative reward and MSE was obtained with Q0 = 50 for all algorithms. It is possible
that the reason for this is that the lower initial Q-values aused the agent to less thoroughly
explore the environment and settle upon a more exploiting poli y more qui kly.
Unlike the eligibility tra e methods, the experien e sta k methods also still performed well
with very low initial Q-fun tions ( ompare the umulative reward olle ted with Q0 = 0 on
all graphs.)
Se tion 4.7 presents a likely explanation as to why optimisti initial Q-fun tions an be
harmful to learning.
Figure 4.24 shows an overlay of the di erent methods with a pessimisti initial Q-fun tion.
The experien e sta k method outperform the eligibility tra e methods in almost all ases
ex ept with high . The di eren e between the methods is even larger with lower Q0 .
The E e ts of . For Q0 < 100 the experien e sta k methods performed better or no
worse than their eligibility tra e ounterparts a ross the majority of parameter settings. In
parti ular they were less sensitive to and a hieved better performan e with low as a
result. A dis ussion of the reasons for this is given in Se tion 4.6.
With Q0 = 100 the experien e sta k methods were most sensitive to and performed worse
than their eligibility tra e ounterparts in many instan es. The experien e sta k methods
were also more sensitive to Bmax at this setting.
Abbrevation Des ription

Fast-WAT-a Fast Watkins' Q() with a umulating tra es. The eligibility tra e is
zeroed following non-greedy a tions, making this an exploration insensi-
tive method. Alternative suÆxes of -srepl and -sarepl respe tively denote
state-repla e and state-a tion repla e tra es styles. Figure 4.1 shows the
implemented algorithm. The various eligibility tra e styles are introdu ed
as equations (3.29), (3.32) and (3.31).
ES-WAT-3 The Experien e Sta k Algorithm in Figure 4.14 with Bmax = 3. Ba k-
ups are made at the third state revisit. Non-greedy a tions trun ate
the return estimate and so this is an exploration insensitive method (as
Watkins' Q()).
Fast-PW-srepl Fast Peng and Williams' Q() with state-repla ing tra es (see Figure 4.1).
This is an exploration sensitive method.
ES-NWAT-2 The Experien e Sta k Algorithm in Figure 4.14 with Bmax = 2. Steps
1 and 2 of the ES-Watkins-Update pro edure are omitted so that non-
greedy a tions do not trun ate the return estimate. This is an exploration
sensitive method similar to Peng and Williams' Q() and Naive Q().
Table 4.3: Guide to the tested algorithms and abbreviations used.
Figure Algorithm Exploration Level
Figure 4.15 ES-WAT 0.5 High
Figure 4.16 Fast-WAT 0.5 High
Figure 4.17 ES-NWAT 0.5 High
Figure 4.18 Fast-PW 0.5 High
Figure 4.19 ES-WAT 0.1 Low
Figure 4.20 Fast-WAT 0.1 Low
Figure 4.21 ES-NWAT 0.1 Low
Figure 4.22 Fast-PW 0.1 Low
Table 4.4: Experimental results showing varying and Q0.
Figure Des ription

Figure 4.23 E e ts of di ering learning rate s hedules.
Figure 4.24 Overlays of results with best initial Q-fun tion (Q0 = 50).
Figure 4.25 During-learning performan e with optimised Q0 and .
Table 4.6: Other results.

200000 300
ES-WAT-1
ES-WAT-2 250
Mean Squared Error

ES-WAT-3
Cumulative Reward
150000
ES-WAT-5 200
ES-WAT-10
Q0 = 100 100000
ES-WAT-50
150
100
50000

50
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
200000 300
250
Mean Squared Error

150000
Cumulative Reward
200
Q0 = 75 100000
150
50000 100

50
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
200000 300
250
Mean Squared Error

150000
Cumulative Reward
200
Q0 = 50 100000
150
50000 100

50
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
200000 300
250
Mean Squared Error
150000
Cumulative Reward
200
Q0 = 25 100000
150
50000 100

50
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
200000 900
800
700
Mean Squared Error
150000
Cumulative Reward
600
Q0 = 0 100000 500
400
50000 300
200
PSfrag repla ements 0 PSfrag repla ements 100

0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
200000 4000
3500
Mean Squared Error
150000
Cumulative Reward
3000
2500
Q0 = 25 100000
2000
1500
50000
1000

0
0.1 0.2 0.3 0.4

0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4

0.5 0.6 0.7 0.8 0.9
Figure 4.15: Comparison of the e e ts of , Bmax and the initial Q-values on ES-Watkins
with a high exploration rate ( = 0:5). Results for the end of learning after 200000 steps
in the Maze task. Performan e be omes degraded at Q0 = 100, though less so with higher
. Performan e is less sensitive to ompared to Watkins' Q() (most plots are more
horizontal than in Figure 4.16).
200000 300
FastWAT-srepl
FastWAT-sarepl 250
Mean Squared Error

FastWAT-acc
Cumulative Reward
150000
200
Q0 = 100 100000 150
100
50000

50
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
200000 300
250
Mean Squared Error

150000
Cumulative Reward
200
Q0 = 75 100000
150
50000 100

50
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
200000 300
250
150000 Mean Squared Error
Cumulative Reward
200
Q0 = 50 100000
150
50000 100

50
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
200000 300
250
Mean Squared Error
150000
Cumulative Reward
200
Q0 = 25 100000
150
50000 100

50
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
200000 2600
2400
2200
Mean Squared Error
150000
Cumulative Reward
2000
1800
Q0 = 0 100000 1600
1400
1200
50000
1000
800
400
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
200000 7000
6500
Mean Squared Error
150000
Cumulative Reward
6000
Q0 = 25 100000
5500
50000 5000

4500
0
4000
0.1 0.2 0.3 0.4

0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4

0.5 0.6 0.7 0.8 0.9
Figure 4.16: Watkins Q() with a high exploration rate ( = 0:5) after 200000 steps in the
Maze task. As ES-Watkins, performan e also be omes degraded at Q0 = 100. Performan e
is more sensitive to and and also degrades more with low Q0 than ES-Watkins.
200000 300
ES-NWAT-1
ES-NWAT-2 250
Mean Squared Error

150000 ES-NWAT-3
Cumulative Reward
ES-NWAT-5
ES-NWAT-10 200
Q0 = 100 100000 ES-NWAT-50

150
50000 100

50
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
200000 300
250
Mean Squared Error

150000
Cumulative Reward
200
Q0 = 50 100000
150
50000 100

50
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
200000 900
800
Mean Squared Error

150000
Cumulative Reward
700
Q0 = 0 100000 600
500
50000 400
200
0.1 0.2 0.3 0.4

0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4

0.5 0.6 0.7 0.8 0.9
Figure 4.17: Comparison of the e e ts of , Bmax on ES-NWAT in the Maze task with a
high exploration rate ( = 0:5).
200000 300
FastPW-srepl
FastPW-sarepl 250
Mean Squared Error
150000 FastPW-acc
Cumulative Reward
200
Q0 = 100 100000
150
50000 100

50
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
200000 300
250
Mean Squared Error
150000
Cumulative Reward
200
Q0 = 50 100000
150
50000 100

50
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
200000 3500
3000
Mean Squared Error
150000
Cumulative Reward
2500
Q0 = 0 100000
2000
50000 1500

1000
0
500
0.1 0.2 0.3 0.4

0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4

0.5 0.6 0.7 0.8 0.9
Figure 4.18: Comparison of the e e ts of , the tra e type and the initial Q-values on Peng
and Williams' Q() in the Maze task with a high exploration rate ( = 0:5).
400000 300
350000 ES-WAT-1
ES-WAT-2 250
Mean Squared Error

ES-WAT-3
Cumulative Reward
300000
ES-WAT-5
250000 ES-WAT-10 200
Q0 = 100 200000
ES-WAT-50
150
150000
100
100000

50000 50
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
400000 300
350000
250
Mean Squared Error

Cumulative Reward
300000
250000 200
Q0 = 50 200000 150
150000
100
100000

50000 50
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
400000 3500
350000
3000
Mean Squared Error
Cumulative Reward
300000
250000 2500
Q0 = 0 200000 2000
150000
1500
100000

50000 1000
0
500
0.1 0.2 0.3 0.4

0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4

0.5 0.6 0.7 0.8 0.9
Figure 4.19: Comparison of the e e ts of , Bmax and the initial Q-values on ES-Watkins
in the Maze task with a low exploration rate ( = 0:1).
400000 300
350000 FastWAT-srepl
FastWAT-sarepl 250
Mean Squared Error
FastWAT-acc
Cumulative Reward
300000
250000 200
Q0 = 100 200000 150

150000
100
100000

50000 50
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
400000 300
350000
250
Mean Squared Error
Cumulative Reward
300000
250000 200
Q0 = 50 200000 150
150000
100
100000

50000 50
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
400000 4750
350000 4700
4650
Mean Squared Error
Cumulative Reward
300000
4600
250000 4550
Q0 = 0 200000 4500
4450
150000
4400
100000 4350
4300
50000
4250
0
4200
0.1 0.2 0.3 0.4

0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4

0.5 0.6 0.7 0.8 0.9
Figure 4.20: Comparison of the e e ts of , tra e type and the initial Q-values on Watkins'
Q() in the Maze task with a low exploration rate ( = 0:1).
400000 300
350000 ES-NWAT-1
ES-NWAT-2 250
Mean Squared Error

ES-NWAT-3
Cumulative Reward
300000
ES-NWAT-5
250000 ES-NWAT-10 200
Q0 = 100 200000
ES-NWAT-50
150
150000
100
100000

50000 50
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
400000 300
350000
250
Mean Squared Error

Cumulative Reward
300000
250000 200
Q0 = 50 200000 150
150000
100
100000

50000 50
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
400000 3500
350000
3000
Mean Squared Error

Cumulative Reward
300000
250000 2500
Q0 = 0 200000 2000
150000
1500
100000

50000 1000
0
500
0.1 0.2 0.3 0.4

0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4

0.5 0.6 0.7 0.8 0.9
Figure 4.21: Comparison of the e e ts of , Bmax and the initial Q-values on ES-NWAT in
the Maze task with a low exploration rate ( = 0:1).
400000 300
350000 FastPW-srepl
FastPW-sarepl 250
Mean Squared Error
FastPW-acc
Cumulative Reward
300000
250000 200
Q0 = 100 200000 150

150000
100
100000

50000 50
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
400000 300
350000
250
Mean Squared Error
Cumulative Reward
300000
250000 200
Q0 = 50 200000 150
150000
100
100000

50000 50
0
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
400000 4800
350000 4600
Mean Squared Error
Cumulative Reward
300000
4400
250000
Q0 = 0 200000
4200
4000
150000
100000 3800

50000 3600
0
3400
0.1 0.2 0.3 0.4

0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4

0.5 0.6 0.7 0.8 0.9
Figure 4.22: Comparison of the e e ts of , tra e type and the initial Q-values on Peng and
Williams' Q() in the Maze task with a low exploration rate ( = 0:1).
200000 600
FastWAT-srepl
Q0 = 50
500 FastWAT-sarepl
150000 FastWAT-acc
Mean Squared Error

Cumulative Reward
ES-WAT-1
= 0:3
400 ES-WAT-10
100000 ES-WAT-50
300
50000 FastWAT-srepl
FastWAT-sarepl 200

FastWAT-acc
0 ES-WAT-1 100
ES-WAT-10
ES-WAT-50
-50000 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
200000 600
FastWAT-srepl FastWAT-srepl
Q0 = 100
FastWAT-sarepl 500 FastWAT-sarepl
150000 FastWAT-acc
Mean Squared Error

FastWAT-acc
Cumulative Reward
ES-WAT-1 ES-WAT-1
= 0:9
ES-WAT-10 400 ES-WAT-10
100000 ES-WAT-50 ES-WAT-50
300
50000
200
-50000
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Figure 4.23: Comparison of e e ts of the learning rate s hedule on ES-Watkins and ES-
WAT. The top row presents favourable setting for ES-WAT. The bottom row presents
unfavourable settings. = 0:5 in both ases. Changes in had little e e t on the relative
performan e of the algorithms. Results were similar for ES-NWAT and PW-Q().
The E e ts of Bmax . The new Bmax parameter appeared to be relatively easy to tune
in the maze task. With Q0 < 100 most settings of Bmax and provided improvements
over the original eligibility tra e algorithms. In general, Bmax aused the greatest spread in
performan e when Q0 was either very high or very low. For example, Bmax = 50 generally
resulted in the poorest relative performan e where Q0 = 100 and best performan e with
pessimisti values (e.g. Q0 = 0). Intermediate values (Q0 = 50) gave the least sensitivity
to Bmax as the high values of Bmax swit h from providing relatively good to relatively poor
performan e.
With Q0 = 100, Bmax = 1 provided a sharp drop in performan e ompared to slightly
higher values (e.g. Bmax = 2 or Bmax = 3). A possible reason for this is that some states
may be revisited extremely soon regardless of the exploration strategy simply be ause the
environment is sto hasti . As a result there is often little bene t to the exploration strategy
for learning about these revisits. However, the likelihood of a state being qui kly revisited
by han e two, three or more times falls extremely rapidly with the in reasing number of
revisits. In su h ases it is likely that revisits o ur as the result of poor exploration, in
whi h ase the exploration strategy may be improved as result of making an immediate
ba kup. Curiously, however, this phenomenon is not seen where Q0 < 100.
The E e ts of Exploration. As expe ted, with low exploration levels Watkins' methods
performed very similarly to Peng and Williams' methods ( ompare Figure 4.19 with 4.21
and Figure 4.20 with 4.22).
However, the main motivation for developing the experien e sta k algorithm was to allow for
eÆ ient redit assignment and a urate predi tion, while still allowing exploratory a tions
to be taken. With high exploration levels both of the non-o -poli y methods still generally
outperformed Watkins' methods in terms of umulative reward olle ted, but performed
worse in terms of their nal MSE. This is the e e t of trading longer, untrun ated return
estimates (whi h allow temporal di eren e errors to a e t more prior Q-values) for the
theoreti al soundness of the algorithms (by using rewards following o -poli y a tions in the
return estimate).
But the best overall improvements in the entire experiment were found by ES-WAT at
Q0 = 50. At this setting the algorithm outperformed (or performed no worse) than ES-
PW, FastWAT and FastPW in terms of both umulative reward and error a ross the entire
range of . This is a signi ant result as it demonstrates that Watkins' Q() has been
improved upon to su h an extent that it an outperform methods that don't trun ate the
return upon taking exploratory a tions.
The E e ts of The Learning Rate. In Figures 4.15 to 4.22 the learning rate was
de lined with ea h ba kup as in Equation 3.8 with = 0:5.4 By han e, this appeared to
be a good hoi e for all of the methods tested. Best overall performan e ould be found in
most settings with between 0.3 and 0.5 (see Figure 4.23).
In work by Singh and Sutton [139℄, the best hoi e of learning rate has been shown to vary
with . This was also found to be the ase here. However, unlike in their experiments,
here the learning rate s hedule had little e e t on the relative performan es of the algo-
rithms. Also the work by Singh and Sutton aimed to ompare repla e and a umulate tra e
methods using a xed learning rate. Several experiments were ondu ted here using a xed
learning rate. This also had little e e t on the relative performan es with the ex eption
that ombinations of high and aused the a umulate tra e methods to behave very
poorly in most instan es. Se tion 3.4.9 in the previous hapter suggests why.
Optimised Parameters. Figure 4.25 ompares the di erent methods with optimised
Q0 , , and Bmax . In terms of umulative reward performan e, there is little di eren e
between the methods. However, the experien e sta k methods are markedly more rapid at
error redu tion.
4
High values of provide the fastest de lining learning rate.
O -Poli y, = 0:5
200000
Non-O -Poli y, = 0:1
400000
350000
Cumulative Reward
Cumulative Reward
150000
300000
100000 FastWAT-srepl
FastWAT-sarepl FastPW-srepl
FastWAT-acc FastPW-sarepl
ES-WAT-1 250000 FastPW-acc

50000 ES-WAT-3 ES-NWAT-1
ES-WAT-10 ES-NWAT-3
ES-WAT-50 200000 ES-NWAT-10
ES-NWAT-50
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
300 300
FastWAT-srepl FastPW-srepl
250 FastWAT-sarepl 250 FastPW-sarepl
Mean Squared Error
Mean Squared Error

FastWAT-acc FastPW-acc
ES-WAT-1 ES-NWAT-1
200 ES-WAT-3 200 ES-NWAT-3
ES-WAT-10 ES-NWAT-10
150 ES-WAT-50 150 ES-NWAT-50
100 100


0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Figure 4.24: Overlay of results at the end of learning after 200000 steps in the Maze task.
Q0 = 50, = 0:5.
O -Poli y, = 0:5
200000
Non-O -Poli y, = 0:1
400000
FastWAT-srepl 350000 FastPW-srepl
FastWAT-sarepl FastPW-sarepl
150000 FastWAT-acc 300000 FastPW-acc
Cumulative Reward
Cumulative Reward
ES-WAT-3 ES-NWAT-1
250000
100000
200000
150000
50000
100000
0 50000
0
-50000 -50000
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
600 1000
FastWAT-srepl FastPW-srepl
500 FastWAT-sarepl FastPW-sarepl
800
Mean Squared Error
Mean Squared Error
FastWAT-acc FastPW-acc
ES-WAT-3 ES-NWAT-1
400
600
300
400
200
200
100
0 0
0 50000 100000 150000 200000 0 50000 100000 150000 200000
Steps Steps
Figure 4.25: Comparison of results during learning in the Maze task with optimised values
of Q0 and . The experien e sta k algorithms provided little improvement in the reward
olle ted but gave far faster error redu tion in the Q-fun tion.
4.6 The E e ts of on the Experien e Sta k Method

Why is the experien e sta k method often less sensitive to than the eligibility tra e
methods?
The ES methods have two separate and omplimentary me hanisms for eÆ iently propa-
gating redit to many prior states: -return estimates and ba kwards replay. The hoi e of
determines the extent to whi h ea h me hanism is used.
4.6. THE EFFECTS OF ON THE EXPERIENCE STACK METHOD 81
When the value of is very low the -return estimate weighs observed rewards in the
distant future very little (see Equation 3.37) and the ability to propagate redit to many
states an ome mainly only from ba kwards replay. However with very high the return
estimate employs mainly only observed rewards and very little of the stored Q-values. As
a result ba kwards replay makes little use of, and derives little advantage from using the
newly updated values of su essor states.
It might appear that there is little or even no learning bene t to using ba kwards replay
instead of eligibility tra es with very high values of , sin e, at least super ially, the algo-
rithms appear to be learning in a similar way (i.e. using mainly the -return me hanism).
In fa t one might expe t the experien e sta k methods to a tually perform worse in this
instan e sin e, when states are revisited, se tions of the experien e history are pruned from
memory and are no longer ba ked-up as they might be by an eligibility tra e method.
However, as explained in Se tion 4.4, replaying experien e requires that additional trun a-
tions in return be made. To the eligibility tra e algorithms, frequently trun ating return
(zeroing the eligibility tra e) will negate mu h of the bene t of using the -return estimate
sin e the return looks to observed rewards only a few states into the future. However, return
trun ations may a tually aid the ba kwards replay me hanism sin e it means that greater
use is made of the re ently updated Q-fun tion.
Furthermore, with = 0 and Bmax = 1 it is reasonable to expe t the experien e sta k
methods to improve upon or do no worse than 1-step Q-learning in all ases. Given the
same experien es, the algorithm makes the same updates as Q-learning but in an order that
is expe ted to employ a more re ently informed value fun tion. If it ould be shown that
Q-learning monotoni ally redu es the expe ted error in the Q-fun tion with ea h ba kup,
then a simple proof of this improvement follows. However, in general, in the initial stages
of learning the Q-fun tion error may a tually in rease (this was seen in Figure 4.25). Faster
learning in this ase may a tually result in this initial error growing more rapidly.5 Notably
though, the experien e sta k algorithms improved upon or performed no worse than the
original algorithms in all of the above experiments where was low ( = 0:1) and Bmax = 1.
Performan e was o asionally worse with high Bmax . Presumably this was the result of poor
exploration aused through making infrequent updates to the Q-fun tion.
For similar reasons, it is reasonable to expe t (but it is not proven) that the experien e
sta k methods will improve upon or do no worse than the eligibility tra e methods in
a y li environments for all values of . In this ase the a umulate, repla e and state-
repla e tra e update methods are all equivalent and the eligibility tra e methods are known
to be exa tly equivalent to applying a forward view method in whi h the Q-fun tion is xed
within the episode. Given the same experien es, the experien e sta k methods therefore
make the same updates as the eligibility tra e methods ex ept that ea h update may be
based upon a more informed Q-fun tion due to the ba kwards replay element. This is not
an improvement new to the experien e sta k methods { the same applies for ba kwards
replay when applied at the end of an episode. However, in this ase the diÆ ult issue of
how to deal with state revisits does not o ur. This is what the experien e sta k method
solves.
Finally, note that in the test environment, the eligibility tra e methods performed best
5
This an be also be seen where > 0 in Figure 4.25.
with the highest values of . Therefore, it is reasonable to anti ipate larger di eren es in
performan e between the two approa hes in environments where lower values of are best
for eligibility tra e methods. In this ase, ba kwards replay methods look likely to provide
stronger improvements sin e the learned Q-values are more greatly utilised.
4.7 Initial Bias and the max Operator.

All of the algorithms tested in Se tion 4.5 appeared to work better with non-optimisti
initial Q-values. This may seem a ounter-intuitive result sin e optimisti initial Q-values
are generally thought to work well with -greedy poli ies [150℄. An obvious explanation for
this is that higher initial Q-values ould have aused the agent to explore the environment
more and for an unne essarily long period, while with low initial Q-values the problems of
lo al minima were avoided through using a semi-random exploration poli y.
This se tion explores an alternative explanation that, independently of the e e ts of ex-
ploration, optimisti Q-values an make learning diÆ ult. More spe i ally, RL algorithms
that update their value estimates based upon a return estimate orre ted with maxa Q(s; a)
nd it more diÆ ult to over ome their initial biases if these biases are optimisti .
To see that this is so, onsider the example in Figure 4.26. Assume that all transitions
yield a reward of 0. Some learning algorithm is applied that adjusts Q(s1; a1 ) towards
E [maxa Q(s2 ; a)℄ (for simpli ity assume = 1). If all Q-values are initialised optimisti ally,
to 10 for example, then the Q-values of all a tions in s2 must be readjusted (i.e. lowered
towards zero) before Q(s1; a1 ) may be lowered. However, if the Q-values are initialised
pessimisti ally by the same amount (to 10), then maxa Q(s2 ; a) is raised when the value
of a single a tion in s2 is raised. In turn, Q(s1 ; a1 ) may then also be raised.
In general, it is lear that it is easier for RL algorithms employing maxa Q(s; a) in their
return estimates to raise their Q-value predi tions than to lower them. In e e t, the max
operator auses a resistan e to hange in value updates that an inhibit learning. More
intuitively, note that if the initial Q-fun tion is optimisti , then the agent annot strengthen
good a tions { it an only weaken poor ones.
It is also lear that the e e t of this is further ompounded if: i) the Q-values in s2 are
themselves based upon the over-optimisti values of their su essors, ii) states have many
a tions available, and so many Q-values to adjust before maxa Q(s; a) may hange, iii) is
Q(s2, a1) = 10
Q(s1, a1) = 10 Q(s2, a2) = 10
s1 s2 ...
a1
Q(s2, ak) = 10
Figure 4.26: A simple pro ess in whi h optimisti initial Q-values slows learning. Rewards
are zero on all transitions.
4.7. INITIAL BIAS AND THE MAX OPERATOR. 83
high and so state-values and Q-values are very dependent upon their su essors' values.
Although this idea is simple, it does not, to the best of my knowledge, appear in the existing
RL literature.6 The most losely related work appears to be that of Thrun and S hwartz in
[157℄. They note that the max operator an ause a systemati overestimation of Q-values
when look-up table representations are repla ed by fun tion approximators.
Examples of methods that use maxa Q(s; a) in their return estimates are: value-iteration,
prioritised sweeping, Q-learning , R-learning [132℄, Watkins' Q() and Peng and Williams'
Q(). Similar problems are also expe ted with \interval estimation" methods for deter-
mining error bounds in value estimates [62℄.7 Methods whi h are not expe ted to su er in
this way in lude TD(), SARSA() and poli y iteration (i.e. methods that evaluate xed
poli ies, not greedy ones).
4.7.1 Empiri al Demonstration
Value-Iteration
The e e t of initial bias on value-iteration was evaluated on several di erent pro esses with
known models: the 2-Way orridor Figure 4.28, the small maze in Figure 4.7 and the large
maze Figure 4.4. In ea h experiment an initial value fun tion, V0, was hosen with either
an optimisti bias, V0A+, or the same amount of pessimisti bias, V0A ,
V0A+ (s) = V (s) + bias; (4.13)
V0 (s) = V (s) bias:
A (4.14)
where \bias" is a positive number and V is the known solution. This ensures that both
the optimisti and pessimisti methods start the same maximum-norm distan e from the
desired value fun tion. This setup is atypi al sin e V is usually not known in advan e
and it also provides value-iteration with some information about the initial poli y. However
with knowledge of the reward fun tion it is often possible to estimate the maximum and
minimum values of V . A se ond set of starting onditions was also tested:
V0B+ (s) = max V (s0 ) + bias; (4.15)
s0
V0B (s) = min V (s0 ) bias: (4.16)
s0
Figure 4.27 ompares these initial biasing methods.
Table 4.7.1 shows the number of appli ations of update 2.21 to all states in the pro ess
required by value-iteration until V has onverged upon V to within some small degree of
error. bias = 50 in all ases. In all tasks, the pessimisti initial bias ensured onvergen e in
the fewest updates.
With the orridor task, in the optimisti ase, the number of sweeps until termination an
be made arbitrarily high by making suÆ iently lose to 1. However, if all the estimates
6
Similar problems are known to o ur with applied dynami programming algorithms. Examples are
ontinuously updating distan e ve tor network routing algorithms (su h as the Bellman-Ford algorithm)
[108℄. I thank Thomas Dietteri h for pointing out the relationship.
7
I thank Leslie Kaelbling for pointing this out.
start below their lowest true value, then the number of sweeps never ex eeds the length of
the orridor sin e, in this deterministi problem, after ea h sweep at least one more state
leading to the goal has a orre t value.
V0A+ (s)
V (s)
PSfrag repla ements
V0A (s)
V0B+ (s)
V (s)
PSfrag repla ements
V0B (s)
Figure 4.27: Initial biasing methods.
al al al
-1 s1 ... s19 +1
ar ar ar
Figure 4.28: A deterministi 2-way orridor. On non-terminal transitions, rt = 0.

Initial Bias
Task V0A (s) V0A+ (s) V0B (s) V0B+ (s)
2-Way Corridor 10.4 207.1 11.5 207.5
Small Maze 18.0 241.0 17.7 241.4
Large Maze 53.9 86.2 54.6 107.5
Table 4.7: Comparison of the e e ts of initial value bias on the required number of
value-iteration sweeps over the state-spa e until the error in V^ has be ome insigni ant
(maxs jV (s) V^ (s)j < 0:001). Results are the average of 30 independent trials.
Q-Learning
The e e t of the initial bias on Q-learning is shown in Table 4.8. The Q-learning agents
were allowed to roam the 2-way orridor and the small maze environments for 30 episodes.
For the large maze, 200000 time steps were allowed. The Q-fun tions for the agents were
initialised in a similar fashion to the value-iteration ase but with an initial bias of 5.
Throughout learning, random a tion sele tion was used to ensure that the learned Q-values
ould not a e t the agent's experien e. At the end of learning, the mean squared error in
the learned value-fun tion, maxa Q^ (s; a), was measured. In all ases, the pessimisti initial
bias provided the best performan e.
Initial Bias
Task QA0 (s) QA0 +(s) QB0 (s) QB0 + (s)
2-Way Corridor 1.0 20.0 19.3 20.6
Small Maze 1.2 22.1 18.9 24.9
Large Maze 3.1 12.4 7.4 323.0
Table 4.8: Comparison of theP e e ts of initial Q-value bias on Q-Learning. Values shown
are the mean squared error, s(V (s) maxa Q^ (s; a))2 =jS j , at the end of learning. Results
are the average of 100 independent trials.
4.7.2 The Need for Optimism

The previous two se tions have shown how optimisti initial Q-fun tions an inhibit rein-
for ement learning methods that employ maxa Q(s; a) in their return estimates. Indepen-
dently of the e e ts of exploration, it has been demonstrated that onvergen e towards the
optimal Q-fun tion an be qui ker if the initial Q-values are biased non-optimisti ally.
However, this does not suggest that performan e improvements an in general be obtained
simply by making the initial Q-fun tion less optimisti . The reason for this is that in
pra ti al RL settings agents must often manage the exploration/exploitation tradeo .
A ommon feature of most su essful exploration strategies is to introdu e an optimisti
initial bias into the Q-fun tion and then follow a mainly exploiting strategy (i.e. mainly
hoose the a tion with the highest Q-value at ea h step). For example -greedy exploration
strategies assume optimisti Q-values for all untried state-a tion pairs [150, 175℄. At ea h
step the agent a ts randomly with some small probability and hooses the greedy (i.e.
exploiting) a tion with probability 1 .
More generally the optimisti bias is introdu ed and propagated in the form of exploration
bonuses as follows [85, 174, 167, 130, 175, 41℄,

Q(s; a) E (r + b) + max Q(s0 ; a0 ) (4.17)
a0
where the bonus, b, is a positive value that de lines with the number of times a has been
taken in s. The bonus should de line as less information remains be gained about the
e e ts of taking a in s on olle ting reward. The e e t of the bonuses is always to make
the Q-values of a tions over-optimisti until the environment is thoroughly explored. As
a result, the idea that optimisti initial Q-values an a tually be a hindran e to learning
often omes as a ounter-intuitive idea to many resear hers in RL.
4.7.3 Separating Value Predi tions from Optimism

Be ause of the need for optimism for exploration it is not lear that simply having less
optimisti initial Q-fun tions will always help the agent to learn { this would simply redu e
the amount of exploration that it does. However, better methods might be derived by
separating out the optimisti bias that is introdu ed to en ourage exploration from the
a tual Q-value estimates. For example, we may maintain independent Q-fun tions and
bonus (or optimism) fun tions:

Q^ (s; a) E r + max Q^ (s 0 ; a0 ) ; (4.18)
a0

B (s; a) E (r + b) +
max B (s0; a0 ) : (4.19)
a0
with the former being used for predi tions and the latter for exploration. Regardless of
the initial hoi e of Q^ , a tions may still be hosen optimisti ally through areful hoi e of
the exploration bonuses and the initial values of B . For example, the agent might a t in
order to maximise B (s; a) or even max(maxa Q^ (s; a); maxa B (s; a)) or Q^ (s; a)+ B (s; a). The
Q-fun tion an now be initialised non-optimisti ally thus allowing an a urate Q-fun tion
to be learned more qui kly as seen in Se tion 4.7.1.
Previous other work has separated value estimates for return predi tion from the values
used to guide exploration (e.g. [85℄). However, here we see for the rst time that through
knowing how optimisti initial value fun tions ause ineÆ ient learning, a better initial value
fun tion hoi e may be made and so allow more a urate values estimates to be a heived
more qui kly.
Example. Figure 4.29 ompares two algorithms that share identi al exploration strategies.
Q-opt is a regular Q-learning agent that explores using the -greedy strategy with = 0:01
and Q0 = 100. Q-pess makes the same Q-learning ba kups as Q-opt, and also,
B (st ; at ) B (st ; at ) + [rt+1 + max B (st+1 ; a)℄: (4.20)
a0
1400
Q-pess
1200 Q-opt
1000
Average MSE
800
600
PSfrag repla ements

400
200
0
0 50 100 150 200 250 300
Steps (x1000)
Figure 4.29: The e e t of initial bias on two Q-learning-like algorithms on the large maze
task. Both methods share the identi al exploration poli ies. The Q-pess method that
distinguishes between optimism for exploration and real Q-value predi tions (by maintaining
a separating fun tion, B that is updated using the Q-learning update) and starts with a
pessimisti Q-fun tion. The verti al axis measured the mean squared error in the learned
Q-fun tion (as in Table 4.8). Both methods share identi al exploration strategies.
B0 = 100 = Q0 -opt so that Q-pess may follow an equivalent -greedy exploration strategy
as the Q-learner by hoosing arg maxa B (s; a) with probability at ea h step. However, the
Q-pess method also maintains and updates a Q-fun tion using exa tly the same update as
Q-opt, although di erently initialised. The di erent Q-fun tions are initialised to have the
same size of error from Q. For Q-opt, the error gives an optimisti Q0 , and for Q-pess, it
is hosen to give a pessimisti one.
In this ase separating optimism from exploration has allowed the optimal Q-fun tion to be
approa hed mu h more qui kly without a e ting exploration at all. Still faster onvergen e
an be found with B -pess by hoosing a higher Q0.
4.7.4 Dis ussion

Distinguishing value-predi tions from optimism generally seems like a good idea as we an
now deal with these two on eptually di erent quantities separately (and it adds little to
the overall omputational omplexities of the algorithms). We an now also make expli it
separations between exploration and exploitation { at any time we an de ide to stop
exploring ompletely and de ide to exploit given the best poli y we urrently have. For
example, in gambling or nan ial trading problems we might wish to learn about the relative
return available for making bets or trading shares by initially exploring the problem with a
small amount of apital. Later, if we de ided to play the game for real and bet the farm for
the expe ted return indi ated by the learned Q-values, we might be extremely disappointed
to nd that this return was in fa t a gross over-estimate.
There are also other appli ations for whi h a urate Q-values are needed, but in whi h
exploration is still required. An example is de iding whether or not (or where) to re ne the
agent's internal Q-fun tion representation. This an be done based upon the di eren es of
Q-values in adja ent parts of the spa e [117, 28℄. In di erent RL frameworks, the agent may
be learning to improve several independent poli ies that maximise several separate reward
fun tions [57℄. De iding whi h poli y to follow at any time is done based upon the Q-values
of the a tions in ea h poli y.
Finally, note that the goal of most existing exploration methods is only to maximise the
return that the agent an olle t over its lifetime and not to nd a urate Q-fun tions (in
fa t some exploration methods fail to nd a urate Q-fun tions but still nd poli ies that
are almost optimal in the reward they an olle t). Could adapting exploration methods to
distinguish between optimism and value predi tion still help to maximise the return that
the agent olle ts? Intuitively the answer is yes sin e nding a urate Q-values more qui kly
should allow the agent to better predi t the relative value of exploiting instead of exploring.
However, this may only apply to model-free RL methods. For model-learning methods the
advantages of separating return predi tions from optimisti biases are far fewer. At any
time, these methods may al ulate the Q-fun tion unbiased of exploration bonuses and so
generate a purely exploiting poli y. This an be done by dis arding the exploration bonuses
(i.e. remove b in Equation 4.17) and solving the Q-fun tion under the assumption that the
learned model is orre t. However, as we have seen, model-based methods that solve the
Q-fun tion using, for example, value-iteration an be made greatly more omputationally
eÆ ient if the Q-fun tion is initialised non-optimisti ally.
4.7.5 Initial Bias and Ba kwards Replay.
Why was the worst overall performan e by the experien e sta k algorithms where the initial
Q-fun tion was optimisti and was low? (see Q0 = 100 in Figures 4.15, 4.17, 4.19 and
4.21)
Consider the example experien e in Figure 4.30 and as before assume that = 1 and r = 0
on all transitions. State st1 and st2 are so far unvisited, but st3 has been frequently visited
and its true value is now known (for this example, it is only important that maxa Q^ (st2 ; a) >
maxa Q^ (st3 ; a)).
If = 0 and ba kwards replay is employed, although Q(st2; at2 ) may be lowered, this
adjustment will not immediately rea h st1 sin e maxa Q(st2 ; a) does not hange. Thus the
bene t of using ba kwards replay in this situation is destroyed by the ombination of the
optimisti Q-values at st2 and using a single step return (although this is no worse than
single-step Q-learning).
However, as grows and maxa Q(s; a) weighs less in the return estimates ompared to
the a tual reward, more signi ant adjustments to Q(st1 ; at1 ) will follow (this is true of
both ba kwards replay and the eligibility tra e methods). However, as noted in Se tion 4.6
there may be little bene t to using the experien e sta k algorithm with high sin e SAPs
are removed from the experien e history after they are updated. It was argued that the
additional return trun ations this auses may a tually aid ba kwards replay and o set this
problem; yet it has been shown here that trun ated returns an ause ba kwards replay to be
markedly less e e tive if Q0 is optimisti . Notably, the experien e sta k algorithms perform
mu h worse than the original algorithms in the above experiments only where is high and
the Q-fun tion is optimisti . This is ontrary to the existing rules of thumb in hoosing
good parameter settings and resulted in substantial initial diÆ ulties in demonstrating any
good performan e with the experien e sta k algorithm. The true nature of the method only
be ame lear when examining di erent Q0. There appears to be no previous experimental
work in the literature that ompares algorithms using di erent Q0.
In the experiments in Figures 4.15-4.22 in Se tion 4.5, in almost all ases where Q0 < 100
and < 0:9 the ea h experien e sta k method outperforms the eligibility tra e ounterpart
with the ex eption of a few ases with very high Bmax. We also see that the experien e
sta k methods are mu h more robust to their hoi e of Q0 than the tra e methods, ex ept
for Q0 = 100.
Can this problem be avoided by using the method of separating exploration bonuses from
predi tions dis ussed in 4.7.3? Note that for the o -poli y results in Figures 4.15 and 4.16,
by optimising Q0 su h that umulative reward is maximised (B0 = 25 in the dual learning
method), the experien e sta k method looks better than any result obtained by Watkins'
Q(). However, at this setting the error performan e is poor. It is possible to spe ulate that
this ould be avoided by hoosing Q0 = 75 as the Q-fun tion used to generate predi tions
in the same experiment. However, sin e the error also depends upon the given experien e
(whi h depends upon B0), to perform a fair omparison one would need to run a series
of experiments where Q0 and B0 are varied to determine where it is possible to provide
better umulative reward and error than Watkins' Q(). These experiments have not been
performed.
4.8. SUMMARY 89
10 10 0
10 10 0
... st1 st2 st3
10 10 0
Figure 4.30: A sequen e of experien e in a pro ess similar to the one in Figure 4.26. Q-
values before the experien e are labelled above the a tions. Single-step ba kwards replay
( = 0) performs poorly here. Algorithms that use multistep return estimates ( > 0) are
less a e ted by the initial bias than single-step methods.
4.7.6 Initial Bias and SARSA()
In a omparison of di erent eligibility tra e s hemes by Rummery in [128℄, SARSA() was
shown to outperform other versions of Q() in terms of poli y performan e. The algorithms
were tested under a semi-greedy exploration poli y and so it is reasonable to assume that an
optimisti initial Q-fun tion was employed. In this s enario, and in the light of the above
results, it seems likely that SARSA() would su er less than Peng and Williams' Q() and
Watkins' Q(), sin e it does not expli itly employ the max operator. Performing rigorous
omparisons of these methods is diÆ ult sin e the exploration method used strongly a e ts
how the methods di er { under a purely greedy poli y, Peng and William's Q(), Watkins'
Q() and SARSA() are very similar methods. Su h a omparison should also take into
a ount the a ura y of the learned Q-fun tion. In this respe t, it is straightforward to
onstru t situations in whi h SARSA() performs extremely poorly while following a non-
greedy poli y.
4.8 Summary
Over the history of RL an elegant taxonomy has emerged that di erentiates RL te hniques
by the return estimates they learn from. While eligibility tra e methods are a well es-
tablished and important RL tool that an learn of the expe tation of a variety of return
estimates, the tra es themselves make understanding and analysing these methods diÆ ult.
This is espe ially true of eÆ ient (but more omplex) te hniques of implementing tra es
su h as Fast-Q().
In Se tion 3.4.8 we saw that the need for eligibility tra es arises only from the need for on-
line learning; simpler and naturally eÆ ient alternatives exist if the environment is a y li
or if it is a eptable to learn oine. In Se tion 3.4.8 we also saw that (at least for a umu-
late tra e variants), eligibility tra e methods don't losely approximate their forward view
ounterparts and an su er from higher varian e in their learned estimates as a result. This
led to the idea that the forward view methods whi h dire tly learn from -return estimates
might be preferable if they ould be applied online. In addition, with forward-view methods
it is straightforward and natural to apply ba kwards replay to derive additional eÆ ien y
gains at no additional omputational ost, although it is less obvious how to learn online.
+
?
-
?
Online
learning
+ +
needed
+ +
1
Offline Process
learning or is
+

possible acyclic
PSfrag repla ements Pessimistic

+
Qt=0 0
Optimistic
Figure 4.31: Improvement spa e for experien e sta k vs. eligibility tra e ontrol methods.
+ denotes that the analysis suggests that the learning speed of a ba kwards replay method
is expe ted to be as good as or better than for the related eligibility tra e method. ?+ and
? denote that the analysis was in on lusive but the experimental results were positive or
negative respe tively.
We have seen how ba kwards replay an be made to work e e tively and eÆ iently online
by postponing updates until the updated values are a tually needed. This te hnique an
be adapted to use most forms of trun ated return estimate. Analogues of TD() [148℄,
SARSA() [128℄ and the new importan e sampling eligibility tra es methods of Pre up
[111℄ are easily derived. In general, the method is as omputationally heap as the fastest
way of implementing eligibility tra es but is mu h simpler due to its dire t appli ation of
the return estimates when making ba kups. As a result it is expe ted that further analysis
and proofs about the online behaviour of the algorithm will follow more easily than for the
related eligibility tra e methods.
The fo us in this hapter was to nd an e e tive ontrol method that doesn't su er from
the \short-sightedness" of Watkins' Q() and also doesn't su er from unsoundness under
ontinuing exploration (i.e. as an o ur with Peng and Williams Q() or SARSA()).
When should the experien e sta k method be employed? The experimental results have
shown that, at least in some ases, using ba kwards replay online an provide faster learning
and faster onvergen e of the Q-fun tion than the tra e methods. Improvements in all ases
in all problem domains are not expe ted (nor was this found in the experiments). However,
the experimental results (supported by additional analyses) have led to a hara terisation
of its performan e that is shown in Figure 4.31.
In summary,
Expe t little bene t from using online ba kwards replay ompared to eligibility tra e
methods with values of lose to 1.
With low (and possibly intermediate) values of always expe t performan e improve-
ments (or at least no performan e degradation).
4.8. SUMMARY 91
Expe t variants employing the max operator in their estimate of return (e.g. ES-WAT
and ES-NWAT) to work poorly with high initial Q-values.
Expe t the algorithm to always provide improvements in a y li tasks ex ept where
= 1 (i.e. non-bootstrapping) and so performs identi al overall updates to the existing
tra e or Monte Carlo methods.
In addition, the strong e e t of the initial Q-fun tion has been highlighted as having a major
e e t upon the learning speed of several reinfor ement learning algorithms. Previously,
even in work examining the e e ts of initial bias or , this has not been onsidered to be an
important fa tor a e ting the relative performan e algorithms, and is often omitted from
the experimental method [171, 151, 139, 106, 150, 31℄. The ndings here suggest that it an
be at least as important to optimise Q0 as it is to optimise and , and the hoi e of Q0
a e ts di erent methods in di erent ways.
Chapter 5
Fun tion Approximation
Chapter Outline
This hapter reviews standard fun tion approximation te hniques used to rep-
resent value fun tions and Q-fun tions in large or non-dis rete state-spa es.
The intera tion between bootstrapping reinfor ement learning methods and
the fun tion approximators update rules is also reviewed. A new general but
weak theorem shows that general dis ounted return estimating reinfor ement
learning algorithms annot diverge to in nity when a form of \linear" fun tion
approximator is used for approximating the value-fun tion or Q-fun tion. The
results are signi ant insofar as examples of divergen e of the value-fun tion
exist where similar linear fun tion approximators are trained using a similar
in remental gradient des ent rule. A di erent \gradient des ent" error rite-
rion is used to produ e a training rule whi h has a non-expansion property
and therefore annot possibly diverge. This training rule is already used for
reinfor ement learning.
5.1 Introdu tion

So far, all of the reinfor ement learning methods dis ussed have assumed small, dis rete
state and a tion spa es { that it is feasible to exa tly store ea h Q-value in a table. What
then, if the environment has thousands or millions of state-a tion pairs? As the size of the
state-a tion spa e in reases, so does the ost of gathering experien e in ea h state and also
the diÆ ultly in using it to a urately update so many table entries. Moreover, if the state
or a tion spa es have ontinuous dimensions, and so there is an in nite number of states,
then representing ea h state or a tion value in a table is no longer possible. Therefore,
in large or in nite spa es, the problem fa ed by a reinfor ement learning agent is one of
generalisation. Given a limited amount of experien e within a subset of the environment,
93
94 CHAPTER 5. FUNCTION APPROXIMATION
how an useful inferen es be made about the parts of the environment not visited?
Reinfor ement learning turns to te hniques more ommonly used for supervised learning.
Supervised learning ta kles the problem of inferring a fun tion from a set of input-output
examples { or how to predi t the desired output for a given input. More generally, the
te hnique of learning an input-output mapping an be des ribed as fun tion approximation.
This hapter examines the use of fun tion approximators for representing value fun tions
and Q-fun tions in ontinuous state-spa es. The general problem being solved still remains
as one of learning to predi t expe ted returns from observed rewards (a reinfor ement
learning problem). However, in this ontext, the fun tion approximation and generalisation
problems are harder than they would be in a supervised learning setting sin e the training
data (the set of input-output examples), annot be known in advan e. In fa t, in the
majority of ases, the training data is determined in part by the output of the learned
fun tion. This auses some severe diÆ ulties in the analysis of RL algorithms, and in many
ases, methods an be ome unstable.
Se tions 5.2{5.5 review ommon methods for fun tion approximation in reinfor ement learn-
ing. Linear methods are fo used upon as they have been parti ularly well studied by RL
resear hers from theoreti al standpoint, and have also had a moderate amount of pra ti al
su ess. Se tion 5.5 examines the bootstrapping problem whi h is the sour e of instability
when ombining fun tion approximation with reinfor ement learning. Se tion 5.7 intro-
du es the linear averager s heme whi h di ers from more ommon linear s hemes only in
the measure of error being minimised. However, also in this se tion, a new proof establishes
the stability of this method with all dis ounted return estimating reinfor ement learning
algorithms by demonstrating their boundedness. Se tion 5.8 on ludes.
5.2 Example S enario and Solution

Suppose that our reinfor ement learning problem was to ontrol the ar shown in Figure
5.1. The task is to drive the ar to the top of the hill in the shortest possible time [149, 150℄.
Rewards are 1 on all timesteps and the value of the terminal state is zero. The state of
the system onsists of the ar's position along the hill and its velo ity. There are just two
a tions available to the agent { to a elerate or de elerate (reverse).
Suppose also that we wish to represent the value fun tion for this spa e (see Figure 5.1). We
must represent a fun tion of two ontinuous valued inputs (a position and velo ity ve tor).
One of the easiest ways to represent a fun tion in a ontinuous spa e is to populate the
spa e with a set of data instan es at di erent states, f(1 ; s1); : : : ; (n ; sn)g. Roughly, ea h
instan e is a \prototype" of the fun tion's output at that state (i.e. V (si) i) [159℄. If
we require a value estimate at some arbitrary query state, q, then we an take an average
of the values of nearby instan es, possibly weighting nearby instan es more greatly in the
output. To do this requires that we de ne some distan e metri ,
d(s; q) = distan e between s and q
whi h quantitatively spe i es \nearby". For instan e, we might use the Eu lidean distan e
5.2. EXAMPLE SCENARIO AND SOLUTION 95
Goal
Gravity
Value
-20
-30
-40
-50
-60
-70
-80
-90
0.05 -0.5 Position

0
Velocity -1
-0.05
Figure 5.1: (left) The Mountain Car Task. (right) An inverted value fun tion for the
Mountain Car task showing the estimated value (steps-to-goal) of a state. This gure is a
learned fun tion using a method presented in a later se tion { the true fun tion is mu h
smoother but still in ludes the major dis ontinuity between where it is possible to get to the
goal dire tly, and where the ar must reverse away from the goal to gain extra momentum.
between the states, or more generally, an Lp-norm (or Minkowski metri ), (see [8℄)
0 11=p
k
X
dp (s; q) = jsj qj jpA
j =1
for k-dimensional ve tors s and q.

There are also di erent s hemes we might use to de ide how nearby instan es are ombined
to produ e the output:
Nearest Neighbour
The output is simply the instan e nearest to the query point:
V (q) = i
where,
i = argmin d(sj ; q)
j 2[1::n℄
with ties broken arbitrarily. Although omputationally relatively fast, a disadvantage of this
approa h is that the resulting value fun tion will be dis ontinuous between neighbourhoods.
Kernel Based Averaging
In order to produ e a smoother (and better tting) output fun tion, the values of many
instan es an be averaged together, but with nearby instan es weighted more heavily in the
output than those further away. How heavily the instan es are weighted in the average is
ontrolled by a weighting kernel (or smoothing kernel) whi h indi ates how relevant ea h
instan e is in predi ting the output for the query point. For instan e, we might use a
Gaussian kernel: d s;q )2
K (s; q) = e ;
(
2 2
where the parameter ontrols the kernel width. Other possibilities exist { the main
riteria for a kernel is that its output is at a maximum at its entre and de lines to zero
with in reasing distan e from it. The weights for a weighted average an now be found by
normalising the kernel and an output found:
K (s; q)
Pn
V (s) = Pi n i
i K (s; q )
Atkeson, Moore and S haal provide an ex ellent dis ussion of this form of lo ally weighted
representation in [8℄ and [1℄.
5.3 The Parameter Estimation Framework

The most pervasive and general lass of fun tion approximator are the parameter estimation
methods. Here, the approximated fun tion is represented by
f ((s); ~);
where f is some output fun tion, is an input mapping whi h returns a feature ve tor,
(s) = ~x = [x1 ; : : : ; xn ℄;
and ~ is a parameter ve tor (or weights ve tor),
~ T = [1 ; : : : ; m ℄;
and is a set of adjustable parameters. The problem solved by supervised learning and
statisti al regression te hniques is how to nd a ~ that minimises some measure of the error
in the output of f , given some set of training data,
f(s1 ; z1 ); : : : ; (sj ; zj )g;
where zp (p 2 f1; : : : ; j g) represents the desired output of f for an input (sp). The training
data is generally assumed to be noisy.
5.3. THE PARAMETER ESTIMATION FRAMEWORK 97
Desired output
z Error
Fun tion Parameter
adjustment ~
State Features Aoutput

tual
s f
~x
Input
Mapping Output
Fun tion
Figure 5.2: Parameter Estimation Fun tion Approximation.
5.3.1 Representing Return Estimate Fun tions
Con retely, for reinfor ement learning, if we are interested in approximating a value fun -
tion, then we have,
V^ (s) = f ((s); ~)
and say that f ((); ~) is the fun tion whi h approximates V^ (). In the ase where a Q-
fun tion approximation is required, we might have,
Q^ (s; a) = f ((s); ~(a))
in whi h ase there is approximation only in the state spa e and a di erent set of parameters
is maintained for ea h available a tion. Alternatively,
Q^ (s; a) = f ((s; a); ~)
in whi h ase there is approximation in both the state and a tion spa e. This formulation
is more suitable for use with large or non-dis rete a tion spa es [131℄.
5.3.2 Taxonomy
Examples of methods whi h t this parameter framework are \non-linear" methods su h as
multi-layer per eptions (MLPs). Although these non-linear methods have had some striking
su ess in pra ti al appli ations of RL (e.g. [155, 36, 177℄) there is little or no pra ti al
theory about their behaviour, other than ounter-examples showing how they an be ome
unstable and diverge when used in ombination with RL methods [24, 159℄.
A mu h stronger body of theory exists for linear fun tion approximators. Examples in lude:
i) Linear Least Mean Square methods su h the CMAC [163, 149℄ and (Radial Basis Fun -
tion) RBF methods [131, 150℄. Here the goal is to nd an optimal set of parameters that
happens to minimise some measure of error between the output fun tion and the train-
ing data. As in a MLP the learned parameters may have no real meaning outside of the
fun tion approximator. ii) Averagers. Here the learned values of parameters may have an
easily understandable meaning. For example, the parameters may represent the values of
prototype states as in Se tion 5.2. These methods an be shown to be more stable under a
wider range of onditions ([159, 49℄). iii) State-aggregation methods where the state-spa e
is partitioned into non-overlapping sets. Ea h set represents a state in some smaller state-
spa e to whi h standard RL methods an dire tly be applied. iv) Table lookup, whi h is a
spe ial ase of state-aggregation.
5.4 Linear Methods (Per eptrons)

All linear methods produ e their output from a weighted sum of the inputs. For example:
n
X
f (~x; ~) = xi i = ~x~ (5.1)
i
We assume that there are as many omponents in ~ as there are in ~x. The reason that this
is alled a linear fun tion is be ause the output is formed from a linear ombination of the
inputs:
1 x1 + 2 x2 + + n xn
and not some non-linear ombination. Alternatively, we might note that Equation 5.1 is
linear be ause it represents the equation of a hyper-plane in n 1 dimensions. This might
appear to limit fun tion approximators that employ linear output fun tions to representing
only planar fun tions. Happily, through areful hoi e of this need not be the ase. In
fa t we an see that the nearest neighbour and kernel based average methods are linear
fun tion approximators where i is de ned as:
K (s ; s )
(sq )i = Pn q i : (5.2)
k K (sq ; sk )
5.4.1 In remental Gradient Des ent

In remental gradient des ent is a training rule for modifying the parameter ve tor based
upon a stream of training examples [127, 14℄. Alternative, bat h update versions are possible
whi h make an update based upon the entire training set (see [127℄) and are omputationally
more eÆ ient. However, non-in remental fun tion approximation is not generally suitable
for use with RL sin e the training data (return estimates) are not generally available a-
priori but gathered online. The way in whi h they are gathered usually depends upon
the state of the fun tion approximator during learning { most exploration s hemes and all
bootstrapping estimates of return rely upon the urrent value-fun tion or Q-fun tion.
The basi idea of gradient des ent is to onsider how the error in f varies with respe t to ~
(for some training example (~xp; zp )), and modify ~ in the dire tion whi h redu es the error:
Ep
~ = p (5.3)
~
5.4. LINEAR METHODS (PERCEPTRONS) 99
for some error fun tion Ep and step size .
Con retely, in the ase of the linear output fun tion, if we de ne the error fun tion as:
Ep =
1 z f (~xp; ~)2 (5.4)
2 p
then:

~ = zp f (~xp; ~) ~xp
Ea h parameter is adjusted as follows:

i i + ip zp f (~xp ; ~) xip (5.5)
or

i i + ip desired outputp a tual outputp ontribution of i to output
where ip is the a learning rate for parameter i at the pth training pair (~xp; zp ).
Update 5.5 (due to Widrow and Ho , [166℄), is known as the Delta Rule or the Least Mean
Square Rule and an be shown to nd a lo al minima of,
X1 2
zp f ((sp ); ~) ;
p 2
under Pthe1 standard (Robbins-Monro) onditions for onvergen e of sto hasti approxima-
tion: p ip = 1, and P1p 2ip < 1 (whi h also implies that all weights are updated
in nitely often) [21, 127, 11℄.
Di erent error riteria yield di erent update rules { another is examined later in this hapter.
There is a lose relationship between update 5.5 and the update rules used by the eligibility
tra e methods in Chapter 3 (whi h nds the LMS error in a set of return estimates). Here
xi represents the size of parameter i 's ontribution to the fun tion output. With xi = 0,
i has no ontribution to the output and so is ineligible for hange.
Finally, with the ex eption of some spe ial ases, the learned parameters themselves may
have no meaning outside of the fun tion approximator. There is (typi ally) no sense in
whi h a parameter ould be onsidered by itself to be predi tion of the output. The set of
parameters found is simply that whi h happens to minimise the error riteria.
Throughout the rest of this hapter, the method presented here is referred to as the linear
least mean square method, to di erentiate it from methods that learn using other ost
metri s.
5.4.2 Step Size Normalisation
Finding a sensible range of values for in update 5.5 that allows for e e tive learning is
more diÆ ult than with the RunningAverage update rule used by the temporal di eren e
learning algorithms in the previous hapters. Previously, hoosing = 1 resulted in a full

step to the new training estimate. That is to say that after training, the learned fun tion
exa tly predi ts the last training example when presented with the last input. Higher values,
an result in stri tly in reasing the error with the training value. Smaller values result in
smaller steps toward the training value, mixing it with an average of many of the previously
presented training values. No learning o urs with = 0.
However, with update 5.5 hoosing ip = 1 does not ne essarily result in a `full step'. For
example, even if xi = 1 for all i, hoosing i = 1 will usually result in a step that is in
far too great { in reasing the error between the new training example and old predi tion.
Smaller or greater values of xi e e tively result in smaller or greater steps toward the target
value. The useful range of learning rate values learly depends on the s ale of the input
features. How then should the size of the step be hosen?
One solution is to re-normalise the step-size su h that sensible values are found in the range
[0; 1℄. The working below shows how this an be done. First note that the new learned
fun tion may be written as:
X
f (~xp ; ~0 ) = xip (i + i)
i
X X
= xipi + xip i
i i
X
= f (~xp; ~) + xip ip zp f (~xp ; ~) xip ; (5.6)
i
where ~0 = ~ +~ and is the parameter ve tor after training with (~xp; zp). To nd a learning
rate that makes the full step, Equation 5.6 should be solved for f (~xp; ~0) = zp:
X
zp = f (~xp ; ~) + x2ip ip zp f (~xp; ~)
i
X
1 = x2ip ip ; (5.7)
i
Whi h should hold in order to make a full-step. We an now s ale this step size,
0 = X x2 ip ; (5.8)
p ip
i
so that hoosing 0p = 1 results in the full step to zp, and 0p = 0 results in no learning.
If a single global learning rate is desired ( ip = jp for all i and j ), then (from Equation
5.8) the normalised learning rate is given straightforwardly as,
0
ip = P xp 2 ;
i ip
where 0
p is the new global learning rate at update p.
5.5. INPUT MAPPINGS 101
5.5 Input Mappings

Many fun tion approximation methods an be hara terised by their input mapping, (s),
whi h maps from the environment state, s, to the set of input features, ~x. The feature
set is often the major hara teristi a e ting the generalisational properties of the fun tion
approximator and the same input mapping an be applied with di erent output fun tions
or training rules. They also provide a good way to in orporate prior knowledge about
the problem by sele ting to s ale or warp the inputs in ways that in rease the fun tion
approximator's resolution in some important part of the spa e [131℄.
All of the methods des ribed in this se tion an be used with the LMS training method.
However, more generally they might be provided as inputs to more omplex fun tion ap-
proximators (su h as a multi-layer neural network).
Several ommon input mappings are reviewed here. Ea h input mapping an be thought of
as playing a role similar to the weighting kernels in Se tion 5.2. The inputs may sometimes
be normalised to sum to 1, although this is not always assumed to be done.
5.5.1 State Aggregation (Aliasing)
Suppose that a robot has a range nder that returns real valued distan es in the range
[0; 1). We might map this to three binary features: (s) = [xnear; xmid; xfar℄,
xnear = 1; if 0 s < 1=3,

0; otherwise. (5.9)
xmid = 1; if 1=3 s < 2=3,

0; otherwise. (5.10)
xfar = 1; if 2=3 s < 1,

0; otherwise. (5.11)
If s has more than one dimension, then the state-spa e might be quantised into hyper- ubes.
However the partitioning is done, it is assumed that the regions are non-overlapping and
that only one input feature is ever a tive (e.g. (s) = [0; 1; 0; 0; 0; 0; 0; 0℄). That is to say
that subsets of the original spa e are aggregated together into a smaller dis rete spa e. The
nearest neighbour method presented in Se tion 5.2 and table look-up are spe ial ases of
state aggregation.
The main disadvantage of this form of input mapping is that the state spa e may need to
be partitioned into tiny regions in order to provide the ne essary resolution to solve the
problem. If it is not lear from the outset how partitioning should be performed, then
simply partitioning the state-spa e into uniformly size hyper ubes will typi ally result in
a huge set of input features (exponential in the number dimensions of the input spa e).
Similar problems follow with non-regular but evenly distributed partitioned regions, as may
o ur with the nearest neighbour approa h.
5.5.2 Binary Coarse Coding (CMAC)
Devised by Albus [4, 3℄, the Cerebellar Model Arti ulation Controller (CMAC) onsists of
a number of overlapping input regions, ea h of whi h represents a feature (see Figure 5.3).
The features are binary { any region ontaining the input state represents an input feature
with value 1. All other input features have a value of 0.
tilin g s
fe a tu re
a c tiv e fe a tu r e s
p o in t o f p o in t o f
q u e ry /b a c k u p q u e ry /b a c k u p
a c tiv e tile s
v a lid
s p a c e
Figure 5.3: (left) A CMAC. The horizontal and verti al axes represent dimensions of the
state spa e. (right) The CMAC with a regularised tiling.
If the input tiles are arranged into a regular pattern (e.g. in a grid as in Figure 5.3, right)
then there are parti ularly eÆ ient ways to dire tly determine whi h features are a tive (i.e.
without sear h). A similar argument an be made for some lasses of state aggregation but
not, in general, for the nearest neighbour method (whi h usually requires some sear h).
In the ase of a linear output fun tion, sin e many of the inputs will be zero, we simply
have:
X X
f (~xp; ~) = xip i = i : (5.12)
i a tive
i2
This form of input mapping, when ombined with the linear output fun tion and delta
learning rule has been extremely su essful in reinfor ement learning. Notably, there are
many su essful examples using online Q-learning, Q() and SARSA() [71, 149, 70, 131,
167, 150, 64, 141℄. [150℄ provides many others.
Figure 5.4 shows how the features of a CMAC or an RBF (introdu ed in the next se tion)
are linearly ombined to produ e an output fun tion.
CMAC (Binary Coarse Coding): i (s) = I (dist(s; enteri ) >radiusi)
RBF (Radial Basis Fun tions): i (s) = Gaussian(s, enteri , widthi)
Figure 5.4: Example input features and how they are linearly ombined to produ e omplex
non-linear fun tions in a 1 dimensional input spa e. The left-hand-side urves (the set of
features) are summed to produ e the urve on the right-hand-side (the output fun tion). A
single parameter i determines the verti al s aling of a single feature. It is intended that
the parameter ve tor, ~, is adjusted su h that the output fun tion ts some target set of
data.
5.5.3 Radial Basis Fun tions
Radial basis fun tions (RBFs) are super ially similar to the kernel based average method
presented in Se tion 5.2. With xed entres and widths, an RBF network is simply a linear
method and so an be trained using the LMS rule, although in this ase, the parameters
won't represent \prototypi al" values. However, one of the great attra tions of an RBF is
its ability to shift the entres and widths of the basis fun tions.
In a xed CMAC vs. adaptive Gaussian RBF bakeo of representations for Q-learning, little
di eren e was found between the methods [68℄ (although these results onsider only one
test s enario). In some ases it was found that adapting the RBF entres left some parts of
the spa e under-represented. In similar work with Q() using adaptive RBF entres, poor
performan e was found in omparison to the CMAC [167℄. In addition to these problems,
RBFs are omputationally far more expensive than CMACs.
Good overviews of RBF and related kernel based methods an be found in [98, 99, 8, 1, 90℄.
5.5.4 Feature Width, Distribution and Gradient

The width of a feature an greatly a e t the ability to generalise. The wider a feature, the
broader the generalisations that are made about a training instan e and the faster learning
an pro eed in the initial stages. More on retely, in the ase of linear output fun tions,
if a feature xi is non-zero for a set of input states, then i ontributes to the output of
those states. Thus if xi is non-zero for more states (i.e. a wider feature), then updating i
a e ts the output fun tion for more states in the training set (to a greater or lesser degree
depending upon the magnitude of xi at those states).
Do broad features smooth out important details in output fun tions (i.e. do they redu e
its resolution)? Sutton argues not and presents results for the CMAC [150℄. Similar results
are repli ated in Figure 5.5. However, also shown here are results for smoother kernels (e.g.
the Gaussian of an RBF).
In the example, 100 overlapping features were presented as the inputs to a linear fun tion
approximator trained using update 5.5. Step and sine fun tions were used as target fun tions
for approximation. The bottom row shows the shape ofP the input features used for training
in ea h olumn. The learning rate was given by 0:2= i xip (as in [150℄).1 With both step
and Gaussian features, broader features allowed broader generalisations to be made from
a limited number of training patterns. However, in the Gaussian ase, broad kernels were
disastrous, resulting in extremely poor approximations. Adding more or fewer kernels of the
same width, allowing more training or using di erent or de lining learning rates produ es
similar results.
The reason for this due to the size of the feature's gradients. If we have two small segments
of a Gaussian approximated by, g1 : y = 4x, and, g2 : y = 3x, then summing them we get
g1 (x)+ g2 (x) = 7x. We see that the gradients of a set of urves are additive when the set is
summed together. Thus an in nite number of Gaussians are required to pre isely represent
the steep (in nite) gradient in the step fun tion. In ontrast, a CMAC's binary input
features have a steep (in nite) gradient and so an represent the steep details in the target
fun tion, even when the features are wider than the details in the target. Note however,
that this steep gradient doesn't prevent the CMAC from also approximating fun tions with
shallow gradients. Note that in both ases, the narrow features result in less aggressive
generalisation in the initial stages.
P P
1
Sin e xi p 2 f0; 1g, 0:2= i xip = 0:2= i xip
2
and so this learning rate gives a properly normalised
step-size of 0:2 as shown in Se tion 5.4.2.
Binary Features Gaussian Features

Training Step Fun tion Target
Samples
5
100
10000
Sine Fun tion Target
5
100
10000
Input
Feature
Shape
Figure 5.5: The generalisational and representational e e t of input features of di ering
widths and gradient.
5.5.5 EÆ ien y Considerations

k-Nearest Neighbour Sele tion
Methods su h as the RBF, in whi h every element of the feature ve tor may be non-zero
an expensive to update if there are many a tive features. If the features are entred at
some state in the input spa e (su h as in the lo ally weighted averaging example), then a
ommon tri k is to onsider only the k-nearest feature entres and treat all others as if their
values were zero [131℄. Spe ial data stru tures, su h as a kd-tree, an be used to store the
feature entres and also eÆ iently determine the k-nearest neighbours to the query point
at a ost of mu h less than O(n) for a total set of n features and k << n [89, 47℄.
This method an also be used without spatially entred features by hoosing only the
k largest valued features, although the diÆ ulty here is how to determine these features
without sear hing through them all.
In both methods, if k < n then dis ontinuities appear in the output fun tion at the bound-
aries in the state spa e where the set of nearest-neighbours hanges. The dis ontinuities
will generally be smaller for larger k. A spe ial ase is k = 1, in whi h all linear methods
redu e to state aggregation.
Hashing
Hashing is often asso iated with the CMAC [4, 150℄, although in prin iple it may be applied
to the inputs of any kind of fun tion approximator. Hashing simply maps several input
features to the input for a single parameter. This an be done (and is usually assumed
to be done) in fairly arbitrary ways. In this way huge numbers of input features an be
redu ed down to arbitrarily small sets. The e e t of hashing appears to have been studied
very little, although it has been employed with su ess with the CMAC and SARSA()
[141℄.
5.6 The Bootstrapping Problem

The LMS fun tion approximation s heme has often been su essfully used for RL in pra ti e
{ see [163, 71, 149, 167, 64, 141℄ for examples. In addition there are several RL methods,
su h as TD(0) [148℄ and TD() [38, 160, 150, 154, 146℄ for whi h onvergen e proofs and
error bounds exist. However there are also some other methods su h as value-iteration and
Q-learning for whi h the range of f diverges [10, 160℄. Even the TD(0) algorithm an be
made to diverge if experien e is generated from distributions di erent to the online one
[160℄. This is serious ause for on ern sin e TD(0) is a spe ial ase of many methods.
The major problem in using RL with any fun tion approximation s heme is that the training
data are not given independently of the output of the fun tion approximator. When an
adjustment is made to ~ to minimise the output error with some target z at s, it is possible
that the hange redu es the error for s but in reases it for other states. This is not usually a
serious problem if the step-sizes are slowly de lined be ause the in reases in error eventually
5.6. THE BOOTSTRAPPING PROBLEM 107
f( . )
Train f to fit f Train f to fit f
Figure 5.6: The expansion problem. Some fun tion approximators, when trained using
some fun tions of their output an diverge in range.
be ome small enough that this doesn't happen { most fun tion approximation s hemes settle
into some lo al optimum of parameters if their distribution of training data is xed a-priori.
However, for a bootstrapping RL system these in reases in error an be fed ba k into the
training data. New return estimates that are used as training data are based upon f . In
the ase of TD(0),
z = r + V^ (s);
is repla ed by,
z = r + f ((s); ~);
and may be greater in error as a result of a previous parameter update. In pathologi al
ases, this an ause the range of f to diverge to in nity. There are examples where this
happens for both non-linear and linear fun tion approximators [10, 160, 150℄. The problem
is shown visually in Figure 5.6.
The following se tions review some s hemes that deal with this problem.
Grow-Support Methods
The \Grow-Support" solution proposed by Boyan and Moore is to work ba kwards from a
goal state, whi h should be known in advan e [24℄ (see also [23℄). A set of \stable" states
with a urately known values is maintained around the goal. The a ura y of these values
is veri ed by performing simulated \rollouts" from the new states using a simulation model
(although in pra ti e this ould be done with real experien e, but far less eÆ iently). This
\support region" is then expanded away from the goal, adding new states whose values
depend upon the values of the states in the old support region. In this way, the algorithm
an ensure that the return orre tions used by bootstrapping methods have little error, and
so ensure the method's stability. 2 In [24℄, Boyan and Moore also present several simple
environments in whi h a variety of ommon fun tion approximators fail to onverge or even
nd anti-optimal solutions, but su eed when trained using the grow-support method.
2
For similar reasons, one might also expe t ba kwards replay methods (su h as the experien e sta k
method) to be more stable with fun tion approximation.
A tual Return Methods

The most straightforward solution to the bootstrapping problem is to perform Monte-Carlo
estimation of the a tual return. In this ase no bootstrapping o urs sin e the return esti-
mate does not use the learned values. If the return is olle ted following a xed distribution
of experien e then it is lear that any fun tion approximator that onverges using xed
(a-priori) training data distributions will also onverge in this ase. Here we are simply
performing regular supervised learning, and the fa t that the target fun tion is the expe -
tation of the a tual observed return is in idental. Also, in the work showing onvergen e of
TD(), the nal error bounds an be shown to in rease with lower [160℄. In pra ti e, how-
ever, bootstrapping methods an greatly outperform Monte Carlo methods both in terms
of predi tion error and poli y quality [150℄.
Online, On-Poli y Update Distributions

Note that with the linear LMS training rule (and also with other fun tion approximators)
the error fun tion being minimised is de ned in terms of the distribution of training data.
The parameters of states that appear infrequently re eive fewer updates and so are likely
to be greater in error as a result.
Convergen e theorems for TD() assume that updated states are sampled from the online,
on-poli y distribution (i.e. as they o ur naturally while following the evaluation poli y)
[38, 160, 154℄. Following this distribution ensures that states whose values appear as boot-
strapping estimates are suÆ iently updated. Failing to update these states means that the
parameters used to represent their values (upon whi h return estimates depend) may shift
into on gurations that minimise the error in unrelated values at other states. Where the
online, on-poli y distribution is not followed there are examples where the approximated
fun tion diverges to in nity [10, 160℄.
This is a problem for o -poli y methods where the parameters de ning the Q-values of state-
a tion pairs that are infrequently taken (and so also infrequently updated), may frequently
de ne the estimates of return. An obvious example of su h a method is Q-learning. Here
r + maxa Q^ (s; a) is used as the return estimate, but as an o -poli y method there is
typi ally no assumption that the greedy a tion is followed. If it is insuÆ iently followed,
then the greedy a tion's Q-value is not updated and the parameters used to represent it
shift to minimising errors for other state-a tion pairs. One might expe t that online Q-
learning while following greedy or semi-greedy poli ies ould be stable. However this is not
the ase and there are still examples where divergen e to in nity may o ur [146℄. The
ause of this is probably due to a problem noted by Thrun and Swartz [157℄. If the hanges
to weights are thought of as noise in the Q-fun tion, then the e e t of the max operator is
to onsistently overestimate the value of the greedy a tion in states where there are a tions
with similar values.
Also, Q-learning and semi-greedy poli y evaluating algorithms (su h as SARSA), su er
sin e the greedy poli y depends upon the approximated Q-fun tion. This o-dependen e
an ause a phenomenon alled hattering where the Q-fun tion, and its asso iated poli y,
os illates in bounded region even in simple situations su h as state-aliasing [21, 50, 51, 5℄.
5.6. THE BOOTSTRAPPING PROBLEM 109
Even so, methods su h as Q(), Q-learning or value iteration an work well in pra ti e even
when updates are not made with the online distribution [163, 128, 149, 167, 150, 117, 140℄.
Other re ent work shows that variants of TD() or SARSA() an be ombined with im-
portan e sampling in a way that does allow o -poli y evaluation of a xed poli y while
following a spe ial lass of exploration poli ies [111, 146℄. The idea behind importan e sam-
pling is to weight the parameter updates by the probability of making those updates under
the evaluation poli y. This allows the overall hange in the parameters over the ourse of
an episode to have the same expe ted value (but higher varian e), even if the evaluation
poli y is not followed. It is not lear, however, whether this method an be used e e tively
for ontrol optimisation.
Lo al Input Features
Lo al (i.e. narrow) input features are a ommon feature in many pra ti al appli ations of
fun tion approximation in RL [13, 163, 71, 1, 68, 167, 150, 95, 140, 141℄. Why might this
be so?
Consider any goal based tasks where bootstrapping estimates are employed. Here the values
of states near the goal may ompletely de ne the true values of all other states. If broad
features are used and the bulk of the updates are made at states away from the goal (as an
easily happen when updating with the online distribution) then it is likely that parameters
will move away from representing the values of states near the goal and so make it very
diÆ ult for other states to ever approa h their true value. The grow support method is
one solution to avoid \forgetting" the values of states upon whi h others depend. Another
is to use lo alised (i.e. narrow) input features. Thus, in ases where the updates are made
far from the goal, the parameters that en ode the values of states near the goal are not
modi ed. A similar argument an be made for non-goal based tasks { the general problem
is one of not forgetting the values of important states while they are not being visited [128℄.
However, as we have seen earlier, lo al input features redu e the amount of generalisation
that may o ur.
Residual Algorithms
In [10℄ Baird notes that a simple way to guarantee onvergen e (under xed training distri-
butions) is to make use of our knowledge about the dependen e of the training data and the
fun tion approximator and allow for this by in luding a bootstrapping term when deriving
a gradient des ent update rule. Previously, in the ase of TD(0), the gradient des ent rule,
~ = 21 (zt+1 ~V (st))
2

assumes zt+1 to be independent of ~, but not V (st) = f ((st ); ~). In residual gradient
learning, the error is fully de ned as,
2
rt+1 + V^ (st+1 ) V^ (st )

~ = 2 1
~
V^ (st ) V^ (st+1 )
!
rt+1 + V^ (st+1 ) V^ (st )

=
~ ~
In the linear ase, we have,
i = rt+1 + V^ (st+1 ) V^ (st ) i(st ) i (s0t+1 )

The su essor states, st+1 and s0t+1 should be generated independently whi h may mean
that the method is often impra ti al without a model to generate a sample su essor state
[150℄. Also, i (st) i(s0t+1 ) may often be small leading to very slow learning. However,
Baird also dis usses ways of ombining this approa h with the linear LMS method in a way
that attempts to maximise learning speed while also ensuring stability. A later version of
this approa h [9℄ ombines the method with value-fun tionless dire t poli y sear h methods,
su h as REINFORCE [170℄.
Averagers
The term averager is due to Gordon [49℄. The key property of averagers are that they are
non-expansions { that they annot extrapolate from the training values. In [49℄ Gordon
notes that i) the value-iteration operator is a fun tion that has the ontra tion property,
and ii) many fun tion approximation s hemes an be shown to be non-expansions and
iii) any fun tional omposition of a ontra tion and a non-expansion is a fun tion that is
also a ontra tion. This makes it possible to prove that syn hronous value-iteration will
onverge upon a xed point in the set of parameters, if one exists, provided that the fun tion
approximator an be shown to be a non-expansion.
Many mean squared error minimising methods do not have this property. A spe ial kind of
averager method is presented in the next se tion, for whi h it is lear that any dis ounted
return based RL method annot possibly diverge (to in nity) regardless of the sampling
distribution of return and distribution of updates.
5.7 Linear Averagers

In the LMS s heme, we were minimising:
1 X(z f (~xp; ~))2 :
2 p p
By providing a slightly di erent error fun tion to minimise,
1 XX n
2 p i xip(zp i) ;
2
then the gradient des ent rule (5.3) yields a slightly di erent update rule:

i i + ip zp i xip : (5.13)
5.7. LINEAR AVERAGERS 111
or,

i i + desired outputp i ontribution of i to output
ip
Here, the update minimises the weighted (by xi) squared errors between ea h i and the
target output, rather than between the a tual and target outputs. As before, the learning
rate ip should be de lined over time. This method is referred to as a linear averager to
di erentiate it from the linear LMS gradient des ent method.
To make the analysis of this method more straightforward, it is also assumed that the inputs
to the linear averager are normalised,
x0ip
xip = P 0 ;
k xkp
and that 0 xip 1. The purpose of this is to make it lear that Pi xipi is a weighted
average of the omponents of ~. It is also assumed that 0 ipxip 1, in whi h ase
after update (5.13), jzp i0 j jzp ij must hold.3 In this way it also be omes lear that
ea h individual i is moving loser to zp sin e update (5.13) has a xed-point only where
zp = i . This does not happen with update (5.5) where zp = f ((sp); ~) is the update's
xed-point. Note that in the linear averager s heme, adjustments may still be made where
zp = f ((sp); ~).
Fun tion approximators that an be trained using this s heme in lude state-aggregation
(state-aliasing and nearest neighbour methods), k-nearest neighbour, ertain kernel based
learners (su h as RBF methods with xed entres and basis widths) pie e-wise and bary en-
tri linear interpolation [80, 37, 93℄, and table-lookup. All of these methods di er only by
their hoi e of input mapping, , whi h is often normalised. Many of these methods are
already employed in RL (see [136, 167, 140, 117, 93, 97℄ for re ent examples).
Spe ial ases of this framework for whi h onvergen e theorems exist are, Q-learning and
TD(0) with stationary exploration poli ies and state-aggregation representations [136℄,
value-iteration where the fun tion approximator update an be shown to be a non-expansion
[48℄, or is a state-aggregation method [21, 159℄, or is an adaptive lo ally linear representa-
tion [93, 97℄. The value-iteration based methods assume that a model of the environment
is available, and they are also deterministi algorithms and are easier to analyse as a re-
sult. The most signi ant (and most re ent) result is by Szepesvari where the \almost
sure" onvergen e of Q-learning with a stationary exploration poli y has been shown with
interpolative fun tion approximators whose parameters are modi ed with update (5.13)
[152℄.
Figure 5.7 ompares the linear LMS (update (5.5)) and linear averager (update (5.13))
methods in a standard supervised learning setting. Linear averagers appear to su er from
over-smoothing problems if broad input features are used, while the use of narrow input
features (for any fun tion approximator), limits the ability to generalise sin e the values of
many input features will be near or at zero, and their asso iated parameters adjusted by
similarly small amounts. The method does not exaggerate the training data in the output
in the way that update (5.5) an. The exaggeration problem is the sour e of divergen e in
3
These spe ial assumptions may be relaxed where Theorem 2 (below) an be shown to hold.
Linear LMS Linear Averager

(Update (5.5)) (Update (5.13))
f ((s); ~)
Input
Feature
Shape,
(s)i 2 2 2 2
Various
1.5 1.5 1.5 1.5
1 1 1 1
0.5 0.5 0.5 0.5
(s)i i
0 0 0 0
-0.5 -0.5 -0.5 -0.5
-1 -1 -1 -1
-1.5 -1.5 -1.5 -1.5
-2 -2 -2 -2
Figure 5.7: The e e t of input feature width and ost fun tions on in remental linear
gradient des ent with di erent ost s hemes. (top) A omparison of the fun tions learned
by parameter update rules (5.5) and (5.13) when the training set is taken from 1000 random
samples of the target step fun tion. Note that the averager method learns a fun tion that is
entirely ontained within the verti al bounds of the target fun tion. In ontrast the linear
LMS gradient des ent method does not, but nds a t with a lower mean squared error.
This exaggeration of the training data, in ombination with the use of bootstrapping, is the
ause of divergen e when using fun tion approximation with RL.
(middle) The input feature shape used by ea h method in ea h olumn. 50 su h features,
overlapping and spread uniformly a ross the extent of the gure provided the input to the
linear output fun tion. Note that update (5.5) still learns well with broad input features.
In ontrast, the averager method su ers from over-smoothing of the output fun tion and
annot well represent the steep details of the target fun tion.
(bottom) A sele tion of the learned parameters over the extent where their inputs are non-
zero. Note that for the averager method, the learned parameters are the average of the
target fun tion over the extent where the parameter ontributes to the output. For both
methods, the learned fun tion in the top row is an average of the fun tions in the bottom
row (sin e the input features were normalised).
RL.4 However, as follows intuitively from its error riterion, the linear LMS method nds a
t with a lower mean squared error in the supervised learning ase.
The next two se tions show that fun tion approximators whi h do not exaggerate annot
diverge when used for return estimation in RL. In parti ular, the stability (i.e. boundedness)
of the linear averager method is proven for all dis ounted return estimating RL algorithms.
The rationale behind the proof is simply:
i) All dis ounted return estimates whi h bootstrap from f (; ~) have spe i bounds.
4
In some work, this exaggeration (extrapolation of the range of training target values) is sometimes
onfused with extrapolation (whi h refers to fun tion approximator queries outside the range of states
asso iated with the training data).
ii) Adjusting ~ using the linear averager update to better approximate su h a return
estimate annot in rease these bounds.
5.7.1 Dis ounted Return Estimate Fun tions are Bounded Contra tions
Theorem 1 Let r be a bounded real value su h that rmin r rmax . De ne a bound on
the maximum a hievable dis ounted return as [Vmin ; Vmax ℄ where,
r
Vmin = rmin + + k rmin + = min ;
1
V = r + + r + = rmax ;
max max
k
max
1
for some , 0 < 1. Let z (v) = r + v.
Under these ondition, z is a bounded ontra tion. That is to say that:
i) if v > Vmax , then
z (v) < v and z (v) Vmin ,
ii) if v < Vmin , then
z (v) > v and z (v) Vmax ,
iii) if Vmin v Vmax , then
Vmin z (v) Vmax ,
for any v 2 IR.
Proof: i) Assume that v > Vmax and the following holds,

z (v) < v
() r+ v <v
() 1
r < v;
whi h follows from r rmin sin e,
r
1 1rmax = Vmax < v:
This proves the rst part of i).
We have in general:
rmin
1 = Vmin
() rmin = (1 )Vmin
() rmin + Vmin = Vmin: (5.14)
Sin e v Vmax Vmin and 0,
rmin + v Vmin :
Sin e r rmin ,
r+ v Vmin
) z (v) Vmin:
This proves the se ond part of i).
ii) Is shown in the same way.
iii) Assume that Vmin v and show the following holds,
Vmin z (v)
() Vmin r + v:
This holds sin e (from (5.14)),
r + v rmin + v rmin + Vmin = Vmin :
The above proof method an be applied to a number of reinfor ement learning algorithms.
For instan e, for Q-learning (where z = rt+1 + maxa Q^ (st+1 ; a)), by rede ning v as
maxa Q^ (st+1 ; a), r as rt+1 , and ea h remaining v as Q^ (st+1 ; at+1 ), the proof holds with-
out further modi ation. Similarly, the method an be applied to the return estimates
used by all single step methods (whi h in ludes TD(0), SARSA(0), V(0), the asyn hronous
value-iteration and value-iteration updates) in the same way.
Contra tion bounds for a tual return methods (i.e. non-bootstrapping or Monte-Carlo meth-
ods) are more straightforward. Simply note that if,
z = r1 + r2 + 2 r3 +
and rmin ri rmax for i 2 IN then Vmin z Vmax .
Contra tion bounds for -return methods (i.e. forward view methods as in [150℄) an also
be established by showing that n-step trun ated orre ted return estimates,
!
nX1
z (n) = i 1r
i + nv
n
i=1
(with rmax < ri < rmax ) are a bounded ontra tion. This an done by a method similar to
the proof of Theorem 1. Note that any weighted sum of the form,
n
X
xi zi ;
i
with weights, n
X
xi = 1 and, 0 xi 1
i
has a bound entirely ontained within [mini zi ; maxi zi℄. It has been shown in other work that
-return estimates are su h a weighted sum of n-step trun ated orre ted return estimates
[163℄,
z = (1 ) z (1) + z (2) + 2 z (3) + ;
Value
Vmax 9
>
>
>
>
>
>
>
All
fmax >
>
>
> possible
PSfrag repla ements >
=
return
f > estimates
>
>
>
>
>
> (all train-
Vmin
>
>
>
>
>
;
ing data.)
fmin
State
Figure 5.8: By Theorem 1, all possible dis ounted return estimates must be within the
bounds shown sin e v may only take values bounded within [fmin; fmax ℄. Only return
estimates within these bounds an possibly be passed as training data to the fun tion
approximator.
and so -return estimates are also bounded ontra tions. More intuitively, note that -
return estimates o upy a spa e of fun tions between the 1-step methods su h as TD(0)
and Q-learning (where = 0, n = 1), and the a tual return estimates (where = 1,
n = 1).
5.7.2 Bounded Fun tion Approximation

De ne the urrent bounds on the output of some fun tion approximator to be [fmin; fmax℄,
where
fmin = min f ((sp ); ~);
s2S
fmax = max f ((sp ); ~):

s2S
A orollary of Theorem 1 is that,
min(Vmin ; fmin) z max(Vmax ; fmax );
where z is any of the dis ounted return estimates given in the previous se tion, in luding
any bootstrapping estimates de ned in terms of f (e.g. where v = V^ (s) = f ((s); ~), in the
ase of TD(0)). In other words, the values of possible training data provided to a fun tion
approximator must lie within the ombined bounds of [Vmin; Vmax ℄ and [fmin ; fmax℄ (see
Figure 5.8).
Sin e return estimate fun tions must lie in these bounds, and due to the following theorem
(satis ed by the linear averager method), the linear averager method is bounded and so
annot diverge to in nity.
Theorem 2 De ne ~0 to be the new parameter ve tor after training with some arbitrary
target z 2 IR. Let the bounds of the new output fun tion, f 0, be de ned as,
0 = min f ((sp ); ~0 );
fmin
s2S
0 = max f ((sp ); ~0 ):
fmax
s2S
If,
min(Vmin ; fmin) fmin
0 f 0 max(Vmax ; fmax )
max
for any possible training example, then bounds of f annot diverge.
Proof: It follows from Theorem 1 that,

[min(Vmin ; fmin); max(Vmax ; fmax )℄;
entirely ontains,
[min(Vmin ; fmin
0 ); max(Vmax ; f 0 )℄:
max
Thus, further training with any possible training data annot expand the bounds of f
beyond its initial bounds before training.
Many fun tion approximators satisfy the onditions of this theorem for,
min(Vmin; fmin ) z max(Vmax ; fmax );
(whi h always holds for the dis ounted return fun tions dis ussed).
Theorem 3 The linear averager fun tion approximator presented in Se tion 5.7 satis es
the onditions of Theorem 2 for,
min(Vmin; fmin ) z max(Vmax ; fmax ):
Proof: Note simply that for,
i0 i + (z i) xi
ip
where 0 ipxi < 1, then i is no further from z than it was initially. Sin e,
min(Vmin; fmin ) z max(Vmax ; fmax );
then i0 must also be at least as lose to being ontained within these bounds than it was to
begin with. If it was already within these bounds it remains so sin e z is in these bounds.
Also, sin e f is a weighted average of the omponents of ~, it is bounded by [mini i; maxi i℄
for any input state. Sin e, as a result of the update, the bounds of all the omponents of ~
are either un hanged, or moving to be ontained within [min(Vmin; fmin ); max(Vmax ; fmax)℄,
so then are the bounds of f .
The linear LMS gradient des ent methods do not satisfy Theorem 2. The exaggeration
e e ts in Figure 5.7 are an illustration of this.
5.7.3 Boundness Example
Figure 5.9 shows Tsitsiklis and Van Roy's ounter-example [160℄. In the linear LMS method,
divergen e with TD(0) an o ur if the update distribution di ers from the online one. For
instan e, if updates are made to s1 and s2 with equal frequen y, diverges to in nity. This
o urs sin e when updating from s1, the update is:
t+1 t + (zs t )
1
t + (r + V (s2 ) t )
t + (2 t t )
t (1 + (2 1))
Thus t+1 is greater in magnitude (i.e. greater in error, sin e = 0 is optimal) than t for
(1+ (2 1)) > 1. Thus, where 2 > 1 holds and for any positive this method in reases
in error for ea h update from s1. Only updates from s2 de rease . Thus is s2 is updated
insuÆ iently in omparison to s1 (as is the ase for the uniform distribution), divergen e to
in nity o urs. The online update distribution ensures that V (s1) is suÆ iently updated
to allows for onvergen e.
The linear averager method onverges upon = 0 given 0 < < 1. The features are
assumed to be normalised, ((s2) = 1, not 2) and the method therefore redu es to a
standard state-aggregation method. For transitions, s1 ; s2,
t+1 t + (r + V (s2 ) t )
t + ( t t )
t (1 + ( 1))
and so de reases in magnitude for 0 < < 1, 0 < 1.
Caveat. In every ase, the linear averager method is guaranteed to be bounded. However,
be ause the linear averager method redu es to state aggregation, it is possible that the
example above may be a \straw man". It only shows an example where the LMS method
diverges and the linear averager method does not. It may be that there are s enarios in
whi h the LMS method onverges upon the optimal solution while the averager method
does not, or where it onverges to its extreme bounds. A ne bottle of single malt whisky
may be laimed by the rst person to send me the page number of this senten e.
5.7.4 Adaptive Representation S hemes
Many forms of fun tion approximator an adapt their input mapping () by shifting whi h
input states a tivate whi h input features (as does an RBF network [68℄), or simply by
adding more features and more parameters [117, 93, 131℄. In su h ases, it is often easy
to provide guarantees that the range of outputs is no larger as a result of this adaptation
(for example by ensuring that new parameters are some average of existing ones). In this
way, these methods an also be guaranteed to be bounded. An example of an adaptive
representation s heme is provided in the next hapter.
118PSfrag repla ements CHAPTER 5. FUNCTION APPROXIMATION
1-

s1 s2
V^ (s1 ) = V^ (s2 ) = 2 V^ (sterm) = 0
Figure 5.9: Tsitsiklis and Van Roy's ounter-example. A single parameter is used to rep-
resent the values of two states. All rewards are zero on all transitions and so the optimal
value of is zero. The feature mapping is arranged su h that (s1 ) = 1 and (s2) = 2.
= 0:99 and = 0:01.
5.7.5 Dis ussion
Gordon demonstrated that value-iteration with approximated V^ must onverge upon a xed
point in the set of parameters for any fun tion approximation s heme that has the non-
expansion property [48℄. This follows from noting simply that the value-iteration update
is known to be a ontra tion, and that any fun tional omposition of a non-expansion and
ontra tion is also a ontra tion to a xed point (if one exists).
The results here demonstrate the boundedness of general dis ounted RL with similar fun -
tion approximators for analogous reasons by showing that all dis ounted return estimate
fun tions (with bounded rewards) are bounded ontra tions (i.e. ontra tions to within a
bounded region), that the linear averager update is a non-expansion, and that the omposi-
tion of these fun tions is also bounded ontra tion. This provided a more general (and more
a essible) demonstration of why fun tion approximator updates having the non-expansion
property annot lead to an unbounded fun tion, and that,
f ((s); ~) 2 [min(Vmin ; fmin
0 ); max(V ; f 0 )℄;
max max
are the bounds on the output of f over its lifetime ([fmin0 ; f 0 ℄ denotes the initial bounds
max
on the output of f for all s 2 S ). This is a more general statement than is found in [48℄
(it applies to more RL methods), but it is weaker in the sense that onvergen e to a xed-
point is not shown. However, this work dire tly applies to sto hasti algorithms whereas
the method in [48℄ onsiders only deterministi algorithms where a model of the reward and
environment dynami s must be available.
Although onvergen e an be shown with the linear LMS method for some RL algorithms
(e.g. for TD()), this only holds given restri ted update distributions [10, 160℄. Divergen e
to in nity an be shown in ases where this does not hold. This is a problem for ontrol
optimisation methods su h as Q-learning (whi h has TD(0) as a spe ial ase) where arbitrary
exploration of the environment is desired. It should also be noted that the linear averager
method annot diverge no matter how the return estimates are sampled. This is surprising
sin e the two gradient des ent s hemes di er only by the error measure being minimised.
However, linear averagers appear to be limited to using narrow input features where steep
details in the target fun tion need to be represented. Following the review in Se tion 5.6
this appears to be a ommon tradeo in su essfully applied fun tion approximators.
5.8. SUMMARY 119
5.8 Summary
A variety of representation methods are available to store and update value and Q-fun tions.
In in reasing levels of sophisti ation and empiri al su ess, but de reasing levels of provable
stability, these are: i) table lookup, ii) state aggregation, iii) averagers, iv) linear LMS
methods and, v) non-linear methods (e.g. MLPs).
A number of heuristi s have been reviewed that appear to be useful in aiding the stability
of these methods: making updates with the online, on-poli y distributions, the use of xed
poli y evaluation methods rather than greedy poli y evaluating methods, the use of fun tion
approximators that do not exaggerate training data, the use of lo al input features, and the
use of non-bootstrapping methods.
It is not lear that attempting to minimise the error between a fun tion approximator's
output and the target training values is a good strategy for RL. We have seen that some
methods whi h attempt to do just this may diverge to in nity, while some methods that
do not, and learn prototypi al state values instead, annot (although they may still su er
in other ways where bootstrapping is used). Also, for ontrol tasks, it does not follow that
predi tive a ura y is a ne essary requirement for good poli ies [5, 150℄. This is also seen
in methods su h as SARSA() and Peng and Williams' Q() where good poli ies may be
learned, even where there is onsiderable error in the Q-fun tion. Although, similarly, it is
straightforward to onstru t situations where reasonably a urate Q-fun tions (i.e. lose to
Q ) have a greedy poli y that is extremely poor.
Chapter 6
Adaptive Resolution
Representations
Chapter Outline
This hapter introdu es a new method for representing Q-fun tions for on-
tinuous state problems. The method is not dire tly motivated by minimising
a fun tion of return estimate error, but aims to re ne the Q-fun tion repre-
sentation in the areas of the state-spa e that are most riti al for de ision
making.
6.1 Introdu tion

There are many questions that the designer of a learning system will need to answer in order
to build a suitable fun tion approximator to represent the value fun tion of a reinfor ement
learning agent. How are the feature mappings for a fun tion approximator de ided upon?
What are appropriate feature widths and shapes for the problem? How many features should
be used? Should they be uniformly distributed? If not, whi h areas in the state-spa e are
the most important to represent? And so on.
In order to answer these questions, help may be found by exploiting some knowledge about
the problem being solved. However, in many tasks the problem may be too abstra t or
ill-understood to do this. The result is often an expensive pro ess of trial and error to
nd a suitable feature on guration. The fun tion approximation methods presented in
the previous hapter are \stati " in the sense that their input mappings or the available
number of adjustable parameters are xed. In general, this also imposes xed bounds upon
the possible performan e that the system may a hieve. If a fun tion approximator's initial
on guration was poorly hosen, poor learning and poor performan e may result.
121
122 CHAPTER 6. ADAPTIVE RESOLUTION REPRESENTATIONS
This hapter dis usses autonomous, adaptive methods for representing Q-fun tions. The
initial limits on the system's performan e are removed by adding resour es to the rep-
resentation as needed. Over time, the representation is improved through a pro ess of
general-to-spe i re nement. Although a simple state-aggregation representation is used
(for ease of implementation), traditional problems often experien ed with these methods
an be avoided (e.g. la k of ne ontrol with oarse aggregations and slow learning with ne
representations). In the new approa h, during the initial stages of learning, broad features
allow good generalisation and rapid learning, while in the later stages, as the representation
is re ned, small details in the learned poli y may be represented. Unlike most fun tion
approximation methods, the method is not motivated by value fun tion error minimisation,
but by seeking out good quality poli y representations. It is noted that i) good quality
poli ies an be found long before an a urate Q-fun tion is found (the su ess of methods
su h as Peng's and Williams' Q() demonstrate this), and that, ii) in ontinuous spa es
there are often large areas where a tions under the optimal poli y are the same.
6.2 De ision Boundary Partitioning (DBP)

In this se tion, a new algorithm is provided that re ursively re nes the Q-fun tion represen-
tation in the parts of the state-spa e that appear to be most important for de ision making
(i.e. where there is a hange in the a tion urrently re ommended by the Q-fun tion). The
state-spa e is assumed to be ontinuous, and that the state transition and reward fun tions
for this spa e are Markov.
6.2.1 The Representation

The Q-fun tion is represented by partitioning the state-spa e into hyper-volumes. In pra -
ti e, this is implemented through a kd-tree (see Figure 6.1) [47, 89℄. The root node of
the tree represents the entire state-spa e. Ea h bran h of the tree divides the spa e into
two equally sized dis rete sub-spa es halfway along one axis. Only the leaf nodes ontain
any data. Ea h leaf stores the Q-values for a small hyper-re tangular subset of the entire
state-spa e. From here on, the dis rete areas of ontinuous spa e that the leaf nodes over
are referred to as regions. The represented Q-fun tion is uniform within regions and dis-
ontinuities exist between them. The aggregate regions are treated as dis rete states from
the point of view of the value-update rules.
As a state-aggregation method, following the results in Se tion 5.7 in the last hapter and
also those of Singh [138℄, the method an be expe ted to be stable (i.e not prone to diverge
to in nity) when used with most RL algorithms.
6.2.2 Re nement Criteria

Periodi ally, the resolution of an area is in reased by sub-dividing a region into two smaller
regions. How should this be done overall? Subdividing regions uniformly (i.e. subdividing
every region) will lead to a doubling of the memory requirements. A more areful approa h
6.2. DECISION BOUNDARY PARTITIONING (DBP) 123
R e g io n
B ra n c h
D a ta
Figure 6.1: A kd-tree partitioning of a two dimensional spa e.

is required to avoid su h an exponential growth in the number of regions as the resolution
in reases.
Consider the following learning task { an agent should maximise its return where:
S = fs j 0o s < 360o g
A = [L; R℄ (that is, \go left" and \go right")
8
< 1; if s 0 = s 15o and a = L,
Pssa 0 = : 1; if s0 = s + 15o and a = R,
0; if otherwise.
sin(s 15o ) if a = L,

Rsa = sin (s + 15o ) if a = R.
= 0:9
The world is ir ular su h that f (s) = f (s + 360o ). Although this is a very simple prob-
lem, nding and representing a good estimate of the optimal Q-fun tion to any degree of
a ura y may prove diÆ ult for some lasses of fun tion approximator. For instan e { the
fun tion is both non-linear and non-di erentiable. However, of parti ular interest in this
10
Q(s,L)
R(s,R)
8 sin(s)
6
Value
-2
0 50 100 150 200 250 300 350
State, s
Figure 6.2: The optimal Q-fun tion for SinWorld. The de ision boundaries are at s = 90o
and s = 270o where Q(s; L) and Q(s; R) interse t.
and many pra ti al problems, is the apparent simpli ity of the optimal poli y ompared to
the omplexity of its Q-fun tion:
if 90o s < 270o ;

(s) = L;

R; if otherwise : (6.1)
It is trivial to onstru t and learn a two region Q-fun tion whi h nds the optimal poli y
given only a few experien es. This, of ourse, relies upon knowing the de ision boundaries
(i.e. where Q(s; L) and Q(s; R) interse t) in advan e (see Figure 6.2).
De ision boundaries are used to guide the partitioning pro ess sin e it is here that one an
expe t to nd improvements in poli ies at a higher resolution; in areas of uniform poli y,
there is no performan e bene t for knowing that the poli y is the same in twi e as mu h
detail.
While it is true that, in general we annot determine without rst knowing Q, in many
pra ti al ases of interest it is often possible to nd near or even optimal poli ies with very
oarsely represented Q-fun tions. A good estimate of is found if, for every region, the
best Q-value in a region is, with some minimum degree of on den e, signi antly greater
than the other Q-values in the same region.
Similarly, there is little to be gained by knowing more about regions of spa e where there
is a set of two or more near equivalent best a tions whi h are learly better than others.
To over both ases, de ision boundaries are de ned to be the parts a state-spa e where i)
the greedy poli y hanges and, ii) where the Q-values of those greedy a tions diverge after
interse ting.
It is important to note that the ost of representing de ision boundaries is a fun tion of
their surfa e size and not ne essarily the dimensionality of the state-spa e. Hen e, if there
are very large areas of uniform poli y, then there an be a onsiderable redu tion in the
amount of resour es required to represent a poli y to a given resolution when ompared to
uniform resolution methods.
6.2.3 The Algorithm
The partitioning pro ess onsiders every pair of adja ent regions in turn. The de ision of
whether to further divide the pair is formed around the following heuristi :
do not onsider splitting if the highest-valued a tions in both regions are the same
(i.e. there is no de ision boundary),
only onsider splitting if all the Q values for both regions are known to a \reasonable"
degree of on den e,
only split if, for either region, taking the re ommended a tion of one region in the
adja ent region is expe ted to be signi antly worse than taking another, better,
a tion in the adja ent region.
The se ond point is important, insofar as that the de ision to split regions is based solely
upon estimates of Q-values. In pra ti e it is very diÆ ult to measure on den e in Q-
values sin e they may ultimately be de ned by the values of urrently unexplored areas of
the state-a tion spa e or parts of the spa e whi h only appear useful at higher resolutions
Do Split Don’t Split
a1
a2
a1
Should take a1 here? a2

No change in policy.
a1
a2 a1
a2
a2
a1 Likely improvement is small.

Stepped functions always expected.
Should take other action here?
Figure 6.3: The De ision Boundary Partitioning Heuristi . The diagrams show Q-values
in pairs of adja ent regions. The horizontal axis represents state, and the verti al axis
represents value.
(although see [62, 85℄ for some on den e estimation methods). For both of these reasons,
the Q-fun tion is non-stationary during learning whi h itself auses problems for statisti al
on den e measures. The naive solution applied here is to require that all the a tions in
both regions under onsideration must have been experien ed (and so had their Q-values
re-estimated) some minimum number of times, V ISmin, whi h is spe i ed as a parameter
of the algorithm. This also has the added advantage of ensuring that infrequently visited
states are less likely to be onsidered for partitioning.
In the nal part of the heuristi , the assumption is made that the agent su ers some
\signi ant loss" in return if it annot determine exa tly where it is best to follow the
re ommended a tion of one region instead of the re ommended a tion of an adja ent region.
If the best a tion of one region, when taken in an adja ent region is little better than any
of the other a tions in the adja ent region, then it it reasonable to assume that between
the two regions the agent will not perform mu h better if it ould de ide exa tly where
ea h a tion is best. The \signi ant loss", min, is the se ond and nal parameter for the
algorithm. Figure 6.3 show situations in whi h partitioning o urs.
Setting min > 0 attempts to ensure that the partitioning pro esses is bounded. For
di erentiable Q-fun tions, as the regions be ome smaller on either side of the de ision
boundary, the loss for taking the a tion suggested by the adja ent region must eventually
fall below min. In the ase, where de ision boundaries o ur at dis ontinuities in the
Q-fun tion, unbounded partitioning along the boundary is the right thing to do provided
that there remains the expe tation that the extra partitions an redu e the loss that the
agent will re eive. The fa t that there is a boundary indi ates that there is some better
representation of the poli y that an be a hieved.1

In both ases, a pra ti al limit to partitioning is also imposed by the amount of exploration
available to the agent. The smaller a region be omes, the less likely it is to be visited. As
a result, the on den e in the Q-values for a region is expe ted to in rease more slowly the
smaller the region is.
The remainder of this se tion is devoted to a detailed des ription of the algorithm. To ab-
stra t from the implementation details of a kd-tree, the learner is assumed to have available
the set REGIONS , where regi 2 REGIONS and regi = hVol i; Qi; VIS ii. Qi(a) is the
Q-value for ea h a tion, a, within the region, Vol i is the des ription of the hyper-re tangle
regi overs and V ISi (a) re ords the number of times an a tion has been hosen within Vol
sin e the region was reated. The hoi e of whether to split a region is made as follows:
i
1) Find the set of adja ent regions pairs:

ADJ = fhregi ; regj i j regi ; regj 2 REGIONS ^ neighbours(regi ; regj )g
2) Let SPLIT be the set of regions to subdivide (initially empty).
3) for regi; regj 2 ADJ
3a) ai = arg maxa Q(regi ; a)
3b) aj = arg maxa Q(regj ; a)
3 ) Find the estimated loss given that, for some states in the region, it appears better
to take the re ommended a tion of the adja ent region:
3d) i = jQ(regi ; ai ) Q(regi ; aj )j
3e) j = jQ(regi; ai ) Q(regj ; aj )j
3f) if (ai 6= aj ) and (poli y di eren e)
(i min or j min) and (suÆ ient di eren e)
(fa 2 A j V ISi(a); V ISj (a) V ISming 6= ;) (suÆ ient value approximation)
3f-1) SPLIT := SPLIT [ fregi ; regj g
4) Partition every region in SPLIT at the midpoint of its longest dimension, maintaining
the prior estimates for ea h Q-value in the new regions.
5) Mark ea h new region as unvisited: V IS (a) := 0 for all a.
A good strategy to dividing regions is to always divide along the longest dimension [86℄ after
rst normalising the lengths with the size of the state-spa e. This method does not require
that distan es in ea h axis be dire tly omparable and simply ensures that partitioning
o urs in every dimension with equal frequen y. The obvious strategy, of dividing in the axis
of the fa e that separates regions appeared to work parti ularly poorly. In most experiments,
this led to some regions having a very large number of neighbours.
1
This isn't true in the unlikely ase that regions are already exa tly separated at the boundary. But if
this is the ase, ontinued partitioning is still ne essary to verify this.
6.2.4 Empiri al Results
In this se tion the variable resolution algorithm is evaluated empiri ally on three di erent
learning tasks. In all experiments the 1-step Q-learning algorithm is used. Although faster
learning an be a hieved with other algorithms, Q-learning is employed here be ause of its
ease of implementation and omputational eÆ ien y.2 Also, throughout, the exploration
poli y used is -greedy [150℄. In addition, upon entering a region the agent is ommitted to
following a single a tion until it leaves the region. This prevents the exploration strategy
from dithering within a region and allows larger parts of the environment to be overed
more qui kly.
The SinWorld Task
In the SinWorld environment (introdu ed above) the agent has the task of learning the
poli y whi h gets it to (and keeps it at) the peak of a sin urve in the shortest time. To
prevent a lu ky partitioning of the state spa e whi h exa tly divides the Q-fun tion at
the de ision boundaries, a random o set for the reward fun tion was hosen for ea h trial:
f (s) = sin(s + random). In ea h episode the agent is started in a random state and follows
its exploration poli y for 20 steps. In all trials the agent started with only a two state
representation. At the end of ea h episode, the de ision boundary partitioning algorithm
was applied.
Figure 6.4 shows the nal partitioning after 1000 episodes. The highest resolution areas
are seen at the de ision boundaries (where Q(s; L) and Q(s; R) interse t). At s = 90o
partitioning has stopped as the expe ted loss in dis ounted reward for not knowing the area
in greater detail is less than min. The de line in the partitioning rate as the boundaries
are more pre isely identi ed an be seen in Figure 6.5.
Figure 6.6 ompares the performan e of the variable resolution methods against a number
of xed uniform grid representations. The performan e measure used was the average
dis ounted reward olle ted over 30 evaluations of a 20 step episode under the urrently
re ommended poli y. The results were averaged over 100 trials. The initial performan e
mat hes that of an 8 state representation. After 1000 episodes, however, the performan e
is slightly better than a 32 state representation (not shown) whi h managed mu h slower
improvements in the initial stages. It is important to note that without prior knowledge
of the problem is it diÆ ult to assess whi h xed resolution representation will provide
the best tradeo between learning speed and onvergent performan e. Starting with only
two states, the adaptive resolution method provided fast learning in the initial stages yet
managed near optimal performan e overall.
2
These experiments were also ondu ted prior to the experien e sta k method.
10
Q(s, L)
Q(s, R)
r(s)
Value
4
-2
0 1 2 3 4 5 6 7
State, s
Figure 6.4: The nal partitioning after 1000 episodes in the SinWorld experiment. The
highest resolution areas are seen at the de ision boundaries (where Q(s; L) and Q(s; R)
interse t).
35
30
25
20
States
15
10
0
0 100 200 300 400 500 600 700 800 900 1000
Episode
Figure 6.5: The number of regions in the SinWorld experiment. Note that the 1st derivative
(the partitioning rate) is de reasing over time.
6
Adaptive
5 4 states
16 states
Average Discounted Return
4 8 states
2 states
2 32 states
0
10 20 30 40 50 60 70 80 90 100
Episode
Figure 6.6: Comparison of initial learning performan es for the variable vs. xed resolu-
tion representations in the SinWorld task. The performan e measure is the average total
dis ounted reward olle ted over 20 steps from random starting positions and o sets of the
reward fun tion.
The Mountain Car Task
In the Mountain Car task the agent has the problem of driving an under-powered ar to
the top of a steep hill.3
The a tions available to the agent are to apply an a eleration, de eleration or neither
( oasting) to the ar's engine. However, even at full power, gravity provides a stronger
for e than the engine an ounter. In order to rea h the goal the agent must reverse ba k
up the hill, gaining suÆ ient height and momentum to propel itself over the far side. On e
the goal is rea hed, the episode terminates. The value of the goal states are de ned to be
zero sin e there is no possibility of future reward. At every time-step the agent re eives
a punishment of 1, and no dis ounting was employed ( = 1). In this spe ial ase, the
Q-values simply represent the negative of the expe ted number of steps to rea h the goal.
Figure 6.7 shows the Q-values of the re ommended a tions after 5000 learning episodes.
The li represents a dis ontinuity in the Q-fun tion. On the high side of the li the agent
has just enough momentum to rea h the goal. If the agent reverses for a single time step
at this point it annot rea h the goal and must reverse ba k down the hill. It is here that
there is a de ision boundary and a large loss for not knowing exa tly whi h a tion is best.
Figure 6.8 shows how this area of the state-spa e has been dis retised to a high resolution.
Regions where the best a tions are easy to de ide upon are represented more oarsely.
Figure 6.9 shows a performan e omparison between the adaptive and the xed, uniform grid
representations. The measure used is the average total reward olle ted from 30 random
starting positions using the urrently re ommended poli y and with learning suspended.
Due to the large dis ontinuity in the Q-fun tion, partitioning ontinues long after there
appears to be a signi ant performan e bene t for doing so (shown in Figure 6.10). This
simply re e ts that the performan e metri measures the poli y as a whole from random
starting positions. Agents starting on or around the dis ontinuity still ontinue to gain
some performan e improvements.
The same experiment was also ondu ted but with the ranges of the states hosen to be 10
times larger than previously, giving a new state-spa e of 100 times the original volume (see
Figure 6.8). Starting positions for the learning and evaluation episodes were still hosen to
be inside the original volume. These hanges had little e e t upon the amount of memory
used or the onvergent performan e, although learning pro eeded far more slowly in the
initial stages.
3
This experiment reprodu es the environment des ribed in [150, p. 214℄
Value
-20
-30
-40
-50
-60
-70
-80
-90
0.05 -0.5 Position

0
Velocity -1
-0.05
Figure 6.7: A value fun tion for the Mountain Car experiment after 5000 episodes. The
value is measured as maxa Q(s; a) to show the estimated number of steps to the goal
under the re ommended poli y.
Figure 6.8: (left) A partitioning after 5000 episodes in the Mountain Car experiment. Po-
sition and velo ity are measured along the horizontal and verti al axes respe tively. (right)
The same experiment but with poorly hosen s aling of axes. This had little e e t on the
nal performan e or number of states used.
0
-500
200
180
Average Reward
160
-1000
140
Adaptive
256 states 120
16 states
States
-1500
100
80
60
-2000
100 200 300 400 500 600 700 800 900 1000
Episode 40
20
Figure 6.9: The mean performan e over 0
50 experiments using the adaptive and

0 100 200 300 400 500 600 700 800 900 1000
Episode
the xed, uniform representations in the

Mountain Car task. The average total Figure 6.10: The number of regions in the
reward olle ted from 30 random start- Mountain Car experiment.
ing positions under the urrently re om-
mended poli y is measured.
The Hoverbeam Task
In the hoverbeam task [84℄ the agent has the task of horizontally balan ing a beam (see
Figure 6.11). On one end of the beam is a heavy motor that drives a propeller and produ es
lift. On the other is a ounterbalan e. The state-spa e is three dimensional and in ludes the
angle from the horizontal, , the angular velo ity of the beam and the speed of the motor.
The available a tions are to in rease or de rease the speed of the motor. In this way we also
see how a problem with a ontinuous a tion set an be de omposed into a similar problem
with a dis rete a tion set and a larger state-spa e { the problem ould also be presented as
one with motor speed as the only available a tion.
The reward fun tion provided to the agent is largest when the beam is horizontal and
de lines inversely with the absolute angle from horizontal. Ea h episode terminates after
200 steps or if the angle of the beam deviates more than 30o from horizontal.4
This task requires ne ontrol of the motor speed only in a small part of the entire spa e.
Figure 6.12 ompares the performan e of several xed resolution representations against
the adaptive representation. Poli ies with oarse representations (512 states) ause the
beam to os illate around the horizontal while xed high-resolution representations (4096
states) take an una eptably long time to learn. An intermediate (512 state) resolution
representation proved best out of the xed resolution methods. The adaptive resolution
method outperformed ea h xed resolution methods. Approximately 4000 regions were
needed by the end of 10000 episodes.
4
A detailed des ription of this environment is available at: http://www. s.bham.a .uk/~sir/pub/hbeam.html
T h ru s t
g .M m o to r
g .M b e a m
g .M c o u n te r
Figure 6.11: The Hoverbeam Task. The agent must drive the propeller to balan e the beam
horizontally.
140
120
100 Adaptive
8 states
64 states
512 states
Average Reward
4096 states
80
60
40
20
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Episode
Figure 6.12: The mean performan e over 20 experiments using the adaptive and the xed,
uniform representations in the Hoverbeam task. The total reward olle ted after 200 steps
under the urrently re ommended poli y is measured.
SinWorld Mountain Hoverbeam

Car
min 0:1 10 2
V ISmin 5 15 15
Initial 2 16 8
regions
Partition 1 episode 10 episodes 10 episodes
test freq.
0:1 0.15 0.1
0:9 1.0 0.995
Qt=0 10 0 10
0.3 0.3 0.3
Start state random random 30o
Table 6.1: Experiment Parameters.
6.3. RELATED WORK 133
6.3 Related Work

6.3.1 Multigrid Methods
In a multigrid method, uniform representation resolutions are maintained for the entire
state-spa e, although several-layers of di erent resolution may be employed. Lower levels
may be initialised by the values of oarse layers or bootstrap from their values [29, 101, 54,
6, 162, 69, 70℄.
An obvious disadvantage of uniform multi-grid methods are their limited s alability into
high-dimensional state-spa e problems. In order to represent part of the state-spa e at
a resolution to 1=k of the total width of ea h dimension, kd regions are represented at
the nest resolution. Ex luding the ost of less oarse layers, we an see that memory
requirements grow exponentially in the dimensionality of the state-spa e. In situations
where all represented states have values updated, time omplexity osts must also grow
at least as fast. The hief advantage of multi-grid methods is redu ed learning osts for
the ne-resolution approximation. As in the DBP approa h, the values learned by oarse
layers provide broad generalisation and so rapid (but ina urate) dissemination of return
information throughout the spa e.
Most multigrid work assumes models of the environment are known a-priori, although [6℄
and [162℄ use Q-learning. In this ase, the time omplexity osts of the value-updating
methods an be less than that of the spa e omplexity osts. For example, in Vollbre ht's
kd-Q-Learning [162℄, whi h starts with a kd-tree that is fully partitioned to a given depth, Q-
values are maintained and updated at all levels throughout the tree. However, sin e, learning
o urs on ea h level, the time- ost of the method grows more reasonably, as O(n jAj), for
a tree of depth n. Many of the regions in the nest-levels, however, will never be visited or
ever store values to any useful degree of on den e. To a ount for this, the method de ides
at whi h level in the tree it has most on den e in value estimates, and uses the region at
this level to determine poli ies and value estimates for bootstrapping. The method an
be expe ted to make better use of experien e than the DBP Q-learning approa h but is
omputationally more expensive and is also limited to problems for whi h a full tree of the
required depth an be represented from the outset.
Where learning o urs at several layers of abstra tion simultaneously is also related to
work learning with ma ro-a tions and options (although here a dis rete, but large, MDP is
typi ally assumed) [134, 39, 83, 102, 110, 143, 22, 43℄. This work is reviewed in the next
hapter.
6.3.2 Non-Uniform Methods

To atta k the s alability problem, many methods examine ways to non-uniformly dis retise
the state-spa e.
In an early method, Simons uses a non-uniform (state-splitting) grid to ontrol a roboti
arm [133℄. The task is to nd a ontroller whi h minimises the for es exerted on the arm's
`hand'. Reinfor ements are provided for redu tions in this for e. The splitting riteria is to
partition regions if the arm's ontroller is failing to maintain the lo al punishment below
some ertain threshold. In ases where the exerted for es were very small, most partitioning
o urred and ne ontrol was the result.
In [46℄ Fernandez shows how the state-spa e an be dis retised prior to learning using
the Generalised Lloyd Algorithm. The method provides greater resolution in more highly
visited parts of the state-spa e. Similarly, RBF networks may adapt their ell entres su h
that some parts of the state-spa e are represented in greater detail [68, 97℄. A riti ism of
these kinds of approa h is that they are based upon similar assumptions made by standard
supervised learning algorithms { that a greater proportion of the error minimisation \e ort"
should be spent on more frequently visited states. It is not lear that this is the best strategy
for reinfor ement learning where, for instan e, the values of states leading to a goal may be
infrequently visited but may also de ne the values of all other states.
G-Learning
In another early work, [28℄ Chapman and Kaelbling's G algorithm employs a de ision tree to
represent the Q-fun tion over a dis rete (binary) spa e. Ea h bran h of the tree represents a
distin tion between 0 and 1 for a parti ular environmental state variable. Ea h leaf ontains
an additional \fringe" whi h keeps information about all of the remaining distin tions that
an be made. The de ision of whether or not to x a distin tion is made on the basis of
two statisti al tests (only one need pass). Here it was found that performing Q-learning
and using the learned Q-values to make a split was insuÆ ient. Instead, the method learns
the future reward distribution:
1
X
D(st ; at ; r) = t+k P r (r = rt+k+1 )
k=0
The possible rewards are assumed to be drawn from a small dis rete set, R. From this, the
Q-values an be re overed as follows:
Q^ (s; a) =
X
rD(s; a; r):
r2R
Thus the method re overs the same on-poli y return estimate as bat h, a umulate-tra e
SARSA(1) (or an every-visit Monte Carlo method), but also has a (non-stationary) future
reward distribution for ea h region. The return distributions of a pair of regions di ering
by a single input variable are ompared using a T -test [42℄. The distin tion is xed, and the
tree deepened, if it is found that the reward distributions di er with a \signi ant degree
of on den e".5
The G algorithm also xes distin tions on the basis of whether di ering distin tions re om-
mend di erent a tions. Intuitively, the method also appears to identify de ision boundaries
but in dis rete spa es.
5
The use of signi an e measures in RL to ompare return distributions is almost always heuristi sin e
the return distributions are almost always non-stationary.
Classi er Systems
A lassi er system onsists of a population of ternary rules of the form h1; 0; #; 1 : 1; 0i
[55℄. A rule en odes a state-a tion pair, h state : a tion i. A rule applies and suggests an
a tion if it mat hes an input state (whi h should also be a binary string). A # in a rule
stands for \don't are". Thus a rule h#,#,#,# : 1; 0i mat hes any input state, and the
rule h0,#,#,# : 1; 0i mat hes any state where the rst bit is 0. In this respe t, a lassi er
system provides similar representations to a binary de ision tree where data is stored at
many levels; h#,#,#,# : 1; 0i represents the root and h0,#,#,# : 1; 0i is the next level
down. In pra ti e, a tree is not used to hold the rules. The population is unstru tured {
there may be gaps in state-spa e overed by the population and several rules may apply in
other states.
Ea h rule has an asso iated set of parameters, some of whi h are used to determine a rule's
tness. Fitness measures the quality of a rule and orresponds to tness in an evolutionary
sense. Periodi ally, un t rules are deleted from the population and new rules added by
ombining t rules together.
In [96℄, Munos and Patinel's Partitioning Q-learning, the evolutionary omponent is repla ed
with a spe ialisation operator that repla es rules ontaining a #, with two new rules in whi h
the # is substituted with a 1 and a 0. Ea h rule keeps a Q-value for the SAP that it en odes
and is updated whenever it is found to apply (several rules may have Q-values updated on
ea h step). The spe ialisation operator is applied for a fra tion of the rules in whi h the
varian e in the 1-step error is greatest. This varian e is measured as:
1X n
(ri + max 0 Q(s0 ; a0 )) (ri 1 + max 0 Q(s0 ; a0 ))2
n i=1 a i a i 1
where the rule applied and was updated at times ft0 ; : : : ; ti; : : : ; tng. The result is that
spe ialisation auses something like the tree deepening as in G-learning. However, unlike
the T -test test, this method does not distinguish between noise in the 1-step return, and
the di erent distributions of return that follow from adja ent state aggregations.
Utile SuÆx Memory (USM)
So far, all of the methods dis ussed (in luding the DBP approa h) assume that the real
observed states are those of a large or ontinuous MDP. However, in some ases, the reward
or transitions following from the next a tion may not simply depend upon the urrent
state and a tion taken, but may depend upon what happened 2, 3 or more steps ago (i.e.
the environment is a partially observable MDP). Similar to the G-algorithm, M Callum's
Utile SuÆx Memory (USM) also uses a de ision tree to attempt to dis over the relevant
\state" distin tions needed for a ting [82, 81℄. However, here the agent's per eived state is
a re ent history of observed environmental inputs and a tions taken. Bran hes in the tree
represent distin tions in the re ent history of events that allow di erent Q-value predi tions
to be distinguished. The top level of the tree represents a tions to be taken at the urrent
state for whi h Q-values are desired. Deeper levels of the tree make distin tions between
di erent prior observations. For example, a bran h 3 levels down might distinguish between
whether at 2 = a10 or whether at 2 = a5 . Distin tions (bran hes) are added if these
di erent histories appear to give rise to di erent distributions of 1-step orre ted return,
r + maxa Q(s0 ; a). The return distributions following from ea h history are generated from
a pool of stored experien es. The Kolmogorov-Smirnov test is used to de ide whether the
distributions are di erent [42℄.6
Continuous U -Tree
In [161℄ Uther and Veloso apply USM and G-learning ideas to a ontinuous spa e. As in
the DBP approa h, a kd-tree is used to represent the entire state-spa e, and bran hes of
the tree subdivide the spa e. As in M Callum's USM, a pool of experien e is maintained
and replayed to perform oine value-updates. Within a region, the 1-step orre ted return
is measured for ea h stored experien e whi h serves as a sample set. This is ompared with
a sample from an adja ent region using the Kolmogorov test. Also, an alternative (less
\theoreti ally based") test was used whi h maintains splits if this redu es the varian e in
the 1-step return estimates by some threshold.
Dynami ally Refa toring Representations

In [35℄ Boutilier, Dearden and Goldszmidt use a method that seeks to in rease the resolution
of (de ision-based) binary state representations where there is eviden e that the value is
non- onstant within a aggregate region. A Bayesian network is used to ompa tly represent
a transition probability fun tion. The ompa tness of this fun tion follows from noting that
(at least for many state dis rete tasks), many a tions frequently leave many features of the
urrent state un hanged. For example, an a tion su h as \pi k up o ee up", will not
a e t whi h room the agent is. Transitions to other rooms, from any state, after taking this
a tion are ompa tly represented with a probability of zero of o urring. Value fun tions
are represented as de ision trees (as in G-learning). However, here it is noted that it is
possible to refa tor the tree to provide equivalent but smaller representations, espe ially in
ases where the represented value fun tion has a onstant value. A form of modi ed poli y
iteration (stru tured poli y iteration) is performed upon the tree. At ea h iteration, the
tree is refa tored to maintain its ompa tness.
Comments
An interesting issue with many of these methods is that we a tually expe t the return
following from di erent regions to be drawn from di erent distributions in almost all ases
{ in very many problems, the optimal value fun tion is non- onstant throughout almost
all of the state-spa e. This follows as a onsequen e of using dis ounting. The return
distributions following from adja ent regions are therefore likely to have di erent means,
and so will be shown to be from di erent distributions under the statisti al tests given
signi ant amounts of experien e. It may be that the Kolmogorov-Smirnov test or the T -
test identify relatively large hanges in the value fun tion more qui kly than other parts of
6
The Kolmogorov-Smirnov test distinguishes samples by the largest di eren e in their umulative distri-
bution.
the state-spa e (e.g. at dis ontinuities), or where signi an e tests are passed most qui kly
(e.g in areas where most experien e o urs). One might hope that these areas also oin ide
with hanges in optimal poli y, although this is learly not always the ase.
With experien e a hing methods (USM and Continuous U -Tree), there is the opportunity
to deepen the tree until a la k of re orded experien e within leaf regions auses it to be
poorly modelled by the stored experien e (e.g. either be ause the region ontains no expe-
rien es, no experien es whi h exit the region ( ausing \false terminal states"), or too few
experien es to lo ally model the lo al varian e in value and pass any reasonable statisti al
test). Partitioning so deep su h that we have one experien e per a tion per region is unlikely
to be desirable and seems ertain to lead to over tting problems.
As the number of regions in reases, so then does the ost of performing value-iteration
sweeps a ross the set of regions. If omputational osts an be negle ted however, one
might expe t an approa h of partitioning as deeply as possible to make extremely good use
of experien e (provided over tting and false terminals an be avoided).
However, if time and spa e osts are an issue, then it be omes natural to examine ways in
whi h parts of the state-spa e an be kept oarse. In this respe t, the existing methods miss
the key insight that it simply is not ne essary (in all ases) to represent the value fun tion
to a high degree of a ura y in order to represent a urate poli ies.
It is argued that re nement methods should seek to redu e un ertainty about the best a tion,
and not un ertainty about their values in order to nd better quality poli ies.
The de ision boundary partitioning method o ers an initial heuristi way to do this, al-
though it is less prin ipled an approa h as one might hope. For instan e, in many ases
it will follow that to redu e un ertainty about the best a tion requires more ertain a tion
value estimates for those a tions. In turn it may follow, (at least in the ase of bootstrap-
ping value estimation algorithms, su h as Q-learning and value-iteration) that the only way
to redu e the un ertainty in these a tion value estimates is to in rease the resolution of
the regions whose values determine the a tion values that we are un ertain about. This
requires a non-lo al partitioning method. All of the methods onsidered so far are lo al
methods and do not onsider partitioning su essor regions in order to redu e ertainty at
the urrent region.
In the next paragraph, the VRDP approa hes of Moore, and Munos and Moore use a
number of di erent partitioning riteria. In parti ular, the In uen eStandard Deviation
heuristi appears to be an more prin ipled step in the dire tion of redu ing the un ertainty
about the best a tions to take.
Variable Resolution Dynami Programming

Moore's Variable Resolution Dynami Programming (VRDP) is a model-based approa h
that uses a kd-tree for representing a value fun tion [87, 89℄. A simulation model is assumed
to be available from whi h a state transition probability fun tion is derived (by simulating
experien es from states within a node and noting the su essor node). This is used to
produ e a dis rete region transition probability fun tion whi h is then solved by standard
DP te hniques. The partitioning riteria is to split at states along the traje tories seen
while following the greedy poli y from some starting state. A disadvantage of this approa h
is that every state is on the greedy path from somewhere { attempting to use this method to
generate poli ies from arbitrary starting states auses the method to partition everywhere.
More re ent VRDP work by Munos and Moore examines and ompares several di erent
partitioning riteria [94, 95, 92℄. The method uses a grid-based \ nite-element" representa-
tion.7 The nite elements are the points (states) at the orners of grid ells for whi h values
are to be omputed. A dis rete transition model is generated by asting short traje tories
from an element and noting the nearby su essors at the end of the traje tory. Elements
near to the traje tory's end are given high transition probabilities in the model.
The following lo al partitioning rules were initially tested:
i) Measure the utility of a split in a dimension as the size of the lo al hange in value
along that dimension. Splits are ranked and a fra tion of the best are a tually divided.
ii) Measure the lo al variability in the values in a dimension. Rank and split, as before,
but based on this new measure. This auses splits to o ur where the value fun tion
is non-linear.
iii) Identify where the poli y hanges along a dimension, and split in that dimension.
This re nes at de ision boundaries.
The de ision boundary method was found to onverge upon sub-optimal poli ies in a di er-
ent version of the mountain ar task requiring ner ontrol. In some ases, the performan e
of the de ision boundary approa h was a tually worse than for xed, uniform representa-
tions of the same size. The reason for this is due to errors in the value-approximation of
states away from the de ision boundary, whi h a tually ause the de ision boundaries to be
mispla ed. Combining the de ision boundary and non-linearity heuristi s resulted in better
performan e.
To improve this situation further, an in uen e heuristi was devised that takes into a ount
the extent to whi h the value of one state ontributes to the values of another element.
Intuitively, in uen e is a measure of the size of hange in s that follows from a unit of
hange in the value of si. The in uen e I (sjsi) of the value of state si on s is de ned as:
1
X
I (sjsi ) = pk (s; si )
k=0
where, pk (s; si ) is the k-step dis ounted probability of being in si after k-steps when starting
from s and following the greedy poli y, g . This an be found as follows:8
p0 (s; s0 ) = 1 (if s = s0 ), 0 (if s 6= s0 )
p1 (s; s0 ) = Pss 0 (s)
g
X
pk (s; s0 ) = P g (s) p (x; s0 )
ss0 k 1
x
7
This work was ondu ted independently of, and in parallel with, the DBP approa h [116, 115, 117℄.
8
Below, , represents the times ale over whi h a state-transition model was al ulated, or the mean
transition time between s and s0 . Variable times ale methods are dis ussed in the next hapter. Assume for
now that = 1.
The in uen e of a state s on a set of states, , is de ned as:
X
I (sj )= I (sjsi ):
si 2
However, improvements in value representations may not ne essarily follow from splitting
states with high in uen e if these state have a urate values. It is assumed that states with
high varian e in their values (due to having many possible su essors with di ering values)
provide poor value estimates.9 Moreover, sin e state values depend on their su essor's
su essor, a long-term (dis ounted) varian e measure an also be derived from the lo al
varian e measures. These heuristi s are ombined to provide the following partitioning ri-
teria:
1) Identify the set, , of states along the de ision boundary.
2) Cal ulate the total in uen e on de ision boundary values, I (sj ), for all s.
3) Cal ulate the long-term dis ounted varian e of ea h state, 2 (s).
4) Cal ulate the utility of splitting a state as: (s)I (sj )
5) Split a fra tion of the highest utility states.
An illustration of this pro ess appears in Figure 6.13. The gures are provided with thanks
to Remi Munos [94℄.
The Standard DeviationIn uen e measure, (s)I (sj ), performed greatly better for equiv-
alent numbers of states, and appears to be the most prin ipled method to date. Although, in
their experiments, a omplete and a urate environment model was available, it seems lear
that the method an naturally be adapted to the ase where a model is learned. Model-free
versions of this method don't seem possible { there is no obvious way to learn the in uen e
measure without a model.
Note that the in uen e and varian e measures are artefa ts of the value estimation pro edure
and do not dire tly measure how \good" or \bad" a state is. The in uen e and varian e
of states tend to zero with in reasing simulation length, and be ome zero if the simulation
enters a terminal state. Thus, there remains the possibility of further developments with
this approa h that adjust the simulation times ale in order to redu e the number of states
with high varian e and in uen e.
9
It is assumed, sin e only deterministi reward fun tions and environments are onsidered, that the sour e
of varian e must lie in value un ertainties due to the approximate representation.
Velocity
Velocity
GOAL
Position Position
(a) The optimal policy and several trajectories (b) Influence on 3 points
(a) States of policy disagreement (b) Influence on these states
(a) Standard deviation (b) Influence x Standard deviation
Figure 6.13: Stages of Munos and Moore's variable resolution s heme for a mountain ar
task. The task di ers slightly to the one used in experiments earlier in this hapter and
provides the highest reward for rea hing the goal with no velo ity. The top left gure shows
the optimal poli y for this task. In uen e measures a state's ontribution to the value of a
set of other states (top-right). Standard deviation is a measure of the ertainty of a state's
value. The In uen eStandard Deviation measure is used to de ide where to in rease the
resolution. A fra tion of the highest valued (darkest) states by this measure is partitioned.
6.4. DISCUSSION 141
Parti-Game
The Parti-Game algorithm is an online model-learning methods that also employs kd-trees
for value and poli y representations [86℄ (see also Ansari et al. for a revised version [2℄).
The method doesn't solve generi RL problems but aims to nd any path to a known goal
state in a deterministi environment.
The method is assumed to have lo al ontrollers that enable the agent to steer to adja ent
regions (the set of available a tions is the number of adja ent regions). The method attempts
to minimise the expe ted number of regions traversed to rea h the goal, learning a region
transition model and al ulating a regions-to-goal value-fun tion as it goes (all untried
a tions in a region are assumed to lead dire tly to the goal). The method behaves greedily
with respe t to its value fun tion at all times. The splitting riterion is to divide regions
along the \win/lose" boundary where it is urrently thought possible to be able to rea h
the goal and where it is not. Importantly, as the resolution in reases, high-resolution areas
appear expensive to ross be ause they in rease the regions-to-goal value { thus greedy
exploration initially avoids the win/lose boundary where it has previously failed to rea h the
goal. However, as alternative routes be ome exhausted, the win/lose boundary is eventually
explored. This symbiosis of the exploration method and representation appears to be the
sour e of the algorithm's su ess. The method is has been shown to very qui kly nd paths
to a goal state in problems with up to 9-dimensional ontinuous state.
6.4 Dis ussion

A novel partitioning riterion has been devised to allow the re nement of dis retised poli y
and Q-fun tion representations in ontinuous spa es. The key insights are that:
Traditional problems in using xed dis retisations in lude slow learning if the rep-
resentation is too ne, poor poli ies if the representation is too oarse, or otherwise
have a requirement for problem spe i knowledge (or tuning) to a hieve appropriate
levels of dis retisation.
General-to-spe i re nement promises to solve ea h of these problems by allowing fast
learning (through broad generalisation) in the initial stages while the representation
is oarse, and still allow good quality solutions as the representation is in reased.
No (lo al) improvements in poli y quality an be derived by knowing in greater detail
that a region of spa e re ommends a single a tion. This lead to the de ision boundary
partitioning riteria that in reases the representation's resolution at points where the
re ommended poli y signi antly hanges.
In ontinuous spa es, de ision boundaries may be smaller or lower dimensional features
in the the state-spa e of the value-fun tion than the state-spa e itself. By exploiting
this, and seeking only to represent the boundaries between areas of uniform poli y, it
is thought that the size of the agent's poli y or Q-fun tion representation an be kept
small, while still allowing good poli ies to be represented. Areas represented in high
detail (and where poor generalisation an o ur) an also be kept to a minimum.
The experiments showed that the nal poli ies a hieved an be better and are rea hed more
qui kly than those of xed uniform representations. This is espe ially true in problems
requiring very ne ontrol in a relatively small part of the entire state-spa e.
The independent study by Munos and Moore shows that partitioning at de ision boundaries,
and other lo al partitioning riteria, nds sub-optimal solutions. The non-lo al heuristi
of partitioning states whose values are un ertain and also in uen e the values at de ision
boundaries (and therefore the lo ation of de ision boundaries), allows smaller representa-
tions of higher quality poli ies to be found than lo al methods.
Chapter 7
Value and Model Learning With

Dis retisation
Chapter Outline
This hapter introdu es learning methods for dis rete event, ontinuous time
problems (modelled formally as Semi-Markov De ision Pro esses). We will see
how the standard dis rete time framework an lead to biasing problems when
used with dis retised representations of ontinuous state problems. A new
method is proposed that attempts to redu e this bias by adapting learning and
ontrol times ales to t a variable times ale given by the representation. For
this purpose Semi-Markov De ision Pro ess learning methods are employed.
7.1 Introdu tion

This hapter presents an analysis of some problems asso iated with dis retising ontinuous
state-spa es. Note that in dis retised ontinuous spa es the agent may see itself as being
within the same state for several timesteps before exiting. We will see what e e t this
an have on bootstrapping RL algorithms that assume the Markov property, and that, at
least for some simple toy problems, this problem an be over ome by modifying the RL
algorithm to perform a single value ba kup based upon the entire reward olle ted until the
per eived state hanges. The results are RL algorithms that employ both spatial abstra tion
(through fun tion approximation) and temporal abstra tion (through variable times ale RL
algorithms) simultaneously.
143
144 CHAPTER 7. VALUE AND MODEL LEARNING WITH DISCRETISATION
7.2 Example: Single Step Methods and the Aliased Corridor

Task
Consider the following environment; the learner exists in the orridor shown in Figure 7.1.
Episodes always start in the leftmost state. Ea h a tion auses a transition one state to the
right until the rightmost state is entered where the episode terminates and a reward of 1 is
given. A reward of zero is re eived for all other a tions and = 0:95. The environment is
dis rete and Markov ex ept that the agent's per eption of it is limited to four larger dis rete
states.
Figure 7.2 shows the resulting value-fun tion when standard (1-step) DP and 1-step Q-
learning are used with state aliasing. With Q-learning, ba kup (3.34) was applied after
every step. With DP, a maximum-likelihood model was formed by applying ba kups (3.41)
and (3.42) after ea h step and solving the model using value-iteration. Both methods learn
over-estimates of the value-fun tion by the last region.
The modelled MDP in Figure 7.3 is that learned by the 1-step DP method. Over-estimation
o urs sin e the rightmost region learns an average value of the aliased states it ontains.
Unfortunately, the region whi h leads into it requires the value of its rst state (not the
average) as its return orre tion in order to predi t the return for entering that region and
a ting from there onwards. Sin e, in this example, the rst state of a region always has
a lower value than the average, the return orre tion introdu es an over-optimisti bias.
These biases a umulate as they are propagated to the prede essor regions.
The e e t on Q-learning is worse. Having a high step-size, , weighs Q-values to the more
re ent return estimates used in ba kups. In the extreme ase where = 1, ea h ba kup to
a region wipes out any previous value; ea h value re ords the return observed upon leaving
the region. This leads to the ase where the leftmost region learns the value for being just
4 steps from the goal. This is espe ially undesirable in ontinual learning tasks where
annot be de lined in the standard way.
t= 0 t= 6 3 t= 6 4
... r= 1
t= 0 t= 1 6 t= 3 2 t= 4 8 t= 6 4
... r= 1
Figure 7.1: (top) The orridor task. (bottom) The same task with states aliased into four
regions.
7.3. MULTI-TIMESCALE LEARNING 145
1 1
0.9 V*(s) 0.9
1-step DP
0.8 0.8
0.7 0.7
0.6 0.6
Value
Value
0.5 0.5
0.4 0.4 V*(s)
0.3 0.3 alpha=1.0
alpha=0.8
0.2 0.2 alpha=0.5
alpha=0.2
0.1 0.1 alpha=0.1
alpha=0.01
0 0
0 10 20 30 40 50 60 0 10 20 30 40 50 60
State State
Figure 7.2: Solutions to the orridor task using 1-step DP (left) and 1-step Q-learning
(right).
1 -p 1 -p 1 -p 1 -p
r= 1
p p p p
Figure 7.3: A naively onstru ted maximum likelihood model of the aliased orridor. p = 161 .
7.3 Multi-Times ale Learning

In se tion 3.4.4 we saw how return estimates may employ a tual rewards olle ted over
multiple timesteps:
zt(n) = r(tn) + n U (st+n ); (7.1)
where nPis the number of steps for whi h the poli y under evaluation is followed, and
rt(n) = nk=1 k 1 rt+k is an n-step trun ated a tual return. Here n is assumed to be a
variable orresponding to the amount of time it takes for some event to o ur. In parti ular,
the amount of time it takes to enter the su essor of st, (st+n 6= st) is used.
In [143℄, Sutton, Pre up and Singh des ribe how to adapt existing 1-step algorithms to use
these return estimates (see also [134, 110, 109, 53℄). The 1-step Q-learning update be omes,
Q^ (st ; at ) Q^ (st ; at ) + r(tn) + n maxa0 Q^ (st ; a0 ) Q^ (st ; at ) :

(7.2)
Similarly, model-learning methods may learn a multi-time model,
Nsa Nsa + 1
^ as + 1a r(tn) R

R^ as R
Ns
^ as ; (7.3)
P^ a + 1 n I (x; s0 ) P^ a ;

8x 2 S; P^ a sx sx (7.4)
Nsa sx
where a = at, s = st , s0 = st+n and R^ as is the estimated expe ted (un orre ted) trun ated
return for taking a in state s for n-steps, and P^ asx gives the estimated dis ounted transition
probabilities given this same ourse of a tion:

h i 1
lim ^ asx = X
P nP r (x = st+njs = st; a = at ; x 6= st)
Nsa !1 n=1
A multi-time model (P^ and R^ ) on isely represents the e e ts of following a ourse of a tion
for several time-steps (and possibly variable amounts of time) instead of the usual one-step.
Sin e the amount of dis ounting that needs to o ur (in the mean) is a ounted for by the
model, is dropped from the 1-step DP ba kup to form the following multi-time ba kup
rule,
!
V^ (s) max
a
R ^ ass0 V^ (s0 ) :
^ as + X P (7.5)
s0
More generally, the above multi-time methods are a spe ial ase of ontinuous time dis rete
event methods for learning in Semi-Markov De ision Pro esses (SMDP) (see [61, 114℄).
Here, n may be a variable, real-valued amount of time. If a su essor state is entered
after some real valued duration, t > 0, repla ing all o urren es of n with t in the above
updates yields a new set of algorithms suitable for learning in an SMDP. In ases where
reward is also provided in ontinuous time by a reward rate, , the following immediate
reward measure an be used while still performing learning in dis rete time [91, 25℄,
Z t
r(tt ) = x a dx:
sx (7.6)
0
All -return methods may also be adapted to work in this way by de ning the return
estimate as follows:
zt = (1 t ) r(tt ) + t U^ (st+t )

+t r(tt ) + t zt+t (7.7)
By re ording the time interval t, along with the states observed, rewards olle ted and
a tions taken, Equation 7.7 allows an SMDP variant of ba kwards replay and the experien e
sta k method to be onstru ted straightforwardly. Also, from (7.7), the following updates
for a ontinuous time, a umulate tra e TD() may be found:
8s 2 S; e(s)

( )t e(s) + 1; if s = st,
( )t e(s); otherwise.
8s 2 S; V^ (s) V^ (s) + (rt t + t V^ (st+t ) V^ (st ))e(s)
( )
A derivation appears in Appendix C. This method di ers from other SMDP TD() methods
(e.g. see [44℄, whi h also onsiders a ontinuous state representation). The derivation of
these updates in Appendix C show that the version here is the analogue of the forward-view
ontinuous time -return estimate (Equation 7.7).
7.4. FIRST-STATE UPDATES 147
Start Start
a1 a1
a2 a1
a1 a1
a2
a1
a2 a1
a1
a2
a2
Start
a1
Figure 7.4: (top-left) A tions taken and updates made by original every-step algorithms.
The dis rete region is entered at START. Sele ting di erent a tions on ea h step an auses
dithering and poorly measured return for following the poli y re ommended by the region
(whi h an only be a single a tion). (top-right) E e t of the ommitment poli y. Updates
are still made after every step. (bottom-left) Multi-time rst-state update with ommit-
ment poli y. Updates made on e per region. (bottom-right) Possible distribution of state
values whose mean is learned by rst-state methods. It is assumed that states are entered,
predominately from one dire tion.
7.4 First-State Updates

Se tion 7.2 identi ed a problem with naively using bootstrapping RL updates in environ-
ments were there are aggregations of states whi h the learner sees as a single state. The
key problem this auses is that the return orre tion used by ba kups upon leaving a region
does not ne essarily re e t the available return for a ting after entering the su essor region,
but is at best an average of the values of states within the su essor. To redu e this bias,
learning algorithms an be modi ed to use return estimates that re e t the return re eived
following the rst states of su essor regions. This is done by making ba kups to a region
using only return estimates representing the return following its rst visited state. This is
easy to do if there is a ontinuous-time (SMDP) algorithm available whi h has the following
two omponents:
nextA tion(agent) ! a tion Returns the next, possibly exploratory, a tion sele ted by
the agent.
setState(agent, r, s0 , As0 , ) Informs the agent of the onsequen es of its last a tion.
The last a tion generated r immediate dis ounted reward, put it into state s0, time
later and a tions As0 are now available. The learning updates should be made here.
The following wrappers transform the original algorithm into one whi h predi ts the return
available from the rst states of a region entered. It is assumed that the per ept, s, denotes
a region and not a state.
nextA tion0 (agent) ! a tion
if dt = 0 then
a nextA tion(agent)
return a
setState0 (agent; r; s0 ; As0 ; )

multistep r multistep r + dt r
dt dt +
if s0 6= s or dt max then
setState(agent; multistep r; s0 ; As0 ; dt)
dt 0; multistep r 0; s s0
The variables dt; a; s; multistep r are global. At the start of ea h episode dt and multistep r
should be initialised to 0.
The nextA tion0 wrapper ensures that the agent is ommitted to taking the a tion hosen
in the rst state of s until it leaves. If we seek a poli y that pres ribes only one a tion
per region, it is important that only single a tions are followed within a region, otherwise
the return estimates may be ome biased to the return available for following mixtures of
a tions.1
For ontrol optimisation problems it is assumed that there is at least one deterministi
poli y that is optimal. If the method were instead to be used for poli y evaluation, the
agent ould equally be ommitted to some (possibly sto hasti but still xed) poli y until
the region is exited.
The setState0 wrapper re ords the trun ated dis ounted return and the amount of time whi h
has passed and is ne essary for the original variable-time algorithm to make a ba kup. The
value, max, is the maximum possible amount of time for whi h the agent is ommitted to
following the same a tion. It may happen that the agent be omes stu k if it ontinually
follows the same ourse of a tion in a region. The time bound attempts to avoid su h
situations.
See Figure 7.4 for an intuitive des ription of rst-state methods. Note that the method
impli itly assumes that regions are predominantly entered from one dire tion. If entered
from all dire tions then the expe ted rst-state values an be expe ted to be an approx-
imation loser to the real mean state-value of the region as a whole. Thus in this ase,
one would not expe t the method to provide any signi ant improvements over every-step
update methods.
1
This form of exploration was used in the de ision boundary partitioning experiments.
7.5. EMPIRICAL RESULTS 149
7.5 Empiri al Results

The rst-state ba kup rules are evaluated on the orridor task introdu ed in se tion 7.2 and
the mountain ar task.
Corridor Task Figure 7.5 ompares the learned value fun tions of the rst-state 1-step
(or every-step) methods. The learned value fun tion was the same for both model-free
and model-based methods. Even though the rst-state methods may have a higher overall
absolute error than their every-step ounterparts, it is argued that i) these estimate are
more suitable for bootstrapping and do not su er from the same progressive overestimation
by the time the reward is propagated to the leftmost region, and ii) the higher error is of
no onsequen e if we an we an hoose whi h state values to believe. We know that the
predi tions represent values of the expe ted rst states of ea h region. In these states, the
method has no error.
1
0.9 V*(s)
Every-Step Methods
0.8 First-State Methods
0.7
0.6
Value
0.5
0.4
0.3
0.2
0.1
0
0 10 20 30 40 50 60
State
Figure 7.5: The value-fun tion found using rst-state ba kups in the orridor task. Every-
state Q-learning nds the same solution as every-state DP sin e a slowly de lining learning
rate was used.
Mountain Car Task In the mountain ar experiments the agent is presented with a
4 4 uniform grid representation of the state-spa e. = 1 for all steps, = 0:9, Q0 = 0.
The -greedy exploration method was used with de lining linearly from 0:5 on the rst
episode to 0 on the last. All episodes start at randomly sele ted states. For the model-free
methods (Q-learning and Peng and William' Q()), is also de lined in the same way.
Be ause the rst-state methods alter the agent's exploration poli y by keeping the hoi e
of a tion onstant for longer, the every-step methods are also tested using the same poli y
of ommitting to an a tion until a region is exited. For the model-based (DP) method,
Wiering's version of prioritised sweeping was adapted for the SMDP ase in order to allow
the method to learn online [167℄. 5 value ba kups were allowed per step during exploration,
and the value fun tion was solved using value-iteration for the urrent model at the end of
ea h episode. Q0 was used as the value of all untried a tions in ea h region.
Peng and Williams' Q() was also tested. The main purpose of this experiment was to try
to establish whether the improvements aused by the wrapper were due to using rst-state
return estimates or simply through using multi-step returns. We have seen earlier in the
thesis how multi-step methods an over ome slow learning problems by using single reward
and transition observations to update many value estimates. One might think that this
would provide the rst-state method with an additional extra advantage over the every-
step methods. However, in this respe t ea h Q-learning method is a tually very similar.
Ea h method updates at most one value for ea h step (unlike -return and eligibility tra e
methods). Even so, PW-Q() was also tested with = 1:0, ensuring that the return
estimates employ the reward due to a tions many steps in the future. The following state-
repla ing tra e method was used ( .f. update (3.31)):
< 1; if s = st and a = at ,
8
8s; a 2 S A; e(s; a) : 0; if s = st and a 6= at ,

( )t e(s; a); otherwise.
The results of the various methods are shown in Figures 7.6-7.8. The average trial length
measures the quality of the urrent greedy poli y from 30 randomly sele ted states. Regret
measures the di eren e between the estimated value of a starting region and the a tual
observed return for following the greedy poli y for ea h of these evaluations. Regret is
taken to represent a measure of bias in the learned Q-fun tion, and the mean squared
regret as measure of varian e is the estimate. The results in these graphs are the average
of 100 independent trials. The la k of smoothness in the graphs omes from averaging over
many starting states.
2000
Every-Step DP
Every-Step DP + Commitment Policy
Average Episode Length (offline) First-State DP
1500
1000
500
0
0 5 10 15 20 25 30 35 40 45 50
Episode
80 3000
Every-Step DP Every-Step DP
Every-Step DP + Commitment Policy Every-Step DP + Commitment Policy
First-State DP First-State DP
2500
60
Mean Squared Regret
2000
Mean Regret
40
1500
20
1000
0
500
-20 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
Episode Episode
Figure 7.6: First-state results for the model-based method in the mountain ar task. Èvery-
Step' indi ates that learning updates and a tion hoi es for exploration where made after
every step. Èvery Step + Commitment Poli y' indi ates that learning updates were made
at every step, but a tion hoi es were made only upon entering a new region. `First-State'
indi ates that the variable times ale learning updates and a tion hoi es were made on e
per visited region. (See Figure 7.4.)
In Figure 7.6 (the model-learning method), we an see that the ommitment poli y led to
big improvements in the learned poli y, but no signi ant di eren e in performan e in this,
or the other measures, follows from using the rst-state learning method. The ommitment
poli y also lead to improvements in terms of the regret measure. The standard every-step
method learned values that were onsistently over-optimisti and also generally greater in
varian e than the ommitment poli y methods.
With the Q-learning and Q() methods (see Figure 7.7), the general pi ture is that some
improvements are seen over the ommitment poli y method as a result of using the rst-state
updates. This happens in ea h measure to some degree. This result is somewhat surprising,
espe ially for Q-learning, whi h an be viewed as performing a sto hasti version of the
value-iteration updates used in the model-learning experiment. A possible reason for this
is the re en y biasing e e ts of high learning rates (as seen in the Q-learning example in
Se tion 7.2). To test this, the experiment was repeated with a lower and xed learning rate
( = 0:1). In this ase, the di eren e between the every-state and rst-state ommitment
poli y methods shrinks (see Figures 7.9 and 7.10).
2000
Every-Step Q(0)
Every-Step Q(0) + Commitment Policy
First-State Q(0)
Average Episode Length (offline)

1500
1000
500
0
0 20 40 60 80 100 120 140 160 180 200
Episode
80 3000
Every-Step Q(0) Every-Step Q(0)
Every-Step Q(0) + Commitment Policy Every-Step Q(0) + Commitment Policy
First-State Q(0) First-State Q(0)
2500
60
Mean Squared Regret

2000
Mean Regret
40
1500
20
1000
0
500
-20 0
0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
Episode Episode
Figure 7.7: Q-learning results in the mountain ar task with de lining .

2000
Every-Step PW(1.0)
Every-Step PW(1.0) + Commitment Policy
First-State PW(1.0)
1500
1000
500
0
0 5 10 15 20 25 30 35 40 45 50
Episode
80 3000
Every-Step PW(1.0) Every-Step PW(1.0)
Every-Step PW(1.0) + Commitment Policy Every-Step PW(1.0) + Commitment Policy
First-State PW(1.0) First-State PW(1.0)
2500
60
Mean Squared Regret
2000
Mean Regret
40
1500
20
1000
0
500
-20 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
Episode Episode
Figure 7.8: Peng and William's Q() results in the mountain ar task with de lining .
2000
Every-Step Q(0)
Every-Step Q(0) + Commitment Policy
First-State Q(0)

1500
1000
500
0
0 20 40 60 80 100 120 140 160 180 200
Episode
80 3000
Every-Step Q(0) Every-Step Q(0)
Every-Step Q(0) + Commitment Policy Every-Step Q(0) + Commitment Policy
First-State Q(0) First-State Q(0)
2500
60
Mean Squared Regret
2000
Mean Regret
40
1500
20
1000
0
500
-20 0
0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
Episode Episode
Figure 7.9: Q-learning results in the mountain ar task with = 0:1.

2000
Every-Step PW(1.0)
Every-Step PW(1.0) + Commitment Policy
First-State PW(1.0)
1500
1000
500
0
0 5 10 15 20 25 30 35 40 45 50
Episode
80 3000
Every-Step PW(1.0) Every-Step PW(1.0)
Every-Step PW(1.0) + Commitment Policy Every-Step PW(1.0) + Commitment Policy
First-State PW(1.0) First-State PW(1.0)
2500
60
Mean Squared Regret
2000
Mean Regret
40
1500
20
1000
0
500
-20 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
Episode Episode
Figure 7.10: Peng and William's Q() results in the mountain ar task with = 0:1.
7.6. DISCUSSION 155
7.6 Dis ussion

Previous work has identi ed the bene ts of using multi-step return estimates in non-Markov
settings [103, 60, 159℄. Here we have seen how dis retisation of the state-spa e an ause
the representation to appear non-Markov and so an introdu e biases for bootstrapping
RL algorithms. The rst-state methods are intended to redu e this bias by ensuring that
learned values used for bootstrapping are also lower in bias. In ases where SMDP vari-
ants of RL algorithms are available (and we have seen that su h methods an be derived
straightforwardly), implementing a rst-state method is also straightforward through the
use of wrapper fun tions. An empiri al omparison with xed times ale methods was pro-
vided. Overall, the experimental results with the mountain ar task were disappointing.
Possibly, this may be due to the relatively small time it takes to traverse a region in this
task. The major improvements were found to be as a result of following the ommitment
poli y rather than learning rst-state value estimates. Some improvements were seen in the
model-free ase as a result of the rst-state updates, but only where high learning rates
aused every-step updating methods to be ome unduly re en y biased.
Other work has pointed to the use of adaptive times ale models and updates in adaptive
dis retisation s hemes. Notably, in [95, 92℄ Munos and Moore generate a multi-time model
from a state dynami s model. The model is built by running simulated traje tories until
a su essor region is entered. This is essentially the same as the rst-state method and
was developed independently. However, in this work the aim was simply to produ e a
model. The value-biasing problems of state-aliasing are unlikely to be as severe sin e linear
interpolation o urs between regions.
In [100℄ and [101℄, by Pareigis, the lo al times ale of an update is halved if this auses lo al
value estimates to in rease.2 This assumes that learning at the shorter times ale yields
greedy poli ies with lo ally greater values, and that the larger times ale does not lead to
overestimates of state-values. Se tion 7.2 showed how su h overestimates an o ur.
In genuine SMDPs, RL methods need to learn at varying times ales simply be ause infor-
mation is re eived from the environment at varying intervals. Other than the rst-state
method, some RL methods hoose to learn over variable times ales. This in ludes work us-
ing ma ro-a tions [134, 39, 83, 102, 110, 143, 22, 43℄. A ma ro a tion is a prolonged a tion
omposed of several su essive a tions, su h as another lower-level poli y, or a hierar hy
of poli ies, or some hand oded ontroller. Learning in this way an result in signi ant
speedups { return information is propagated to states many steps in the past, and ommit-
ting to a xed ma ro-a tion an aid the exploration in the same way as we have seen above
(i.e. by preventing dithering). In the Options framework [143℄ (and also the HAM methods
in [102℄), if the environment is a dis rete MDP, speedups an be provided while also ensuring
onvergen e to optimality by learning at the abstra t and at (MDP) level simultaneously.
Optimality follows from noting that, if a tions in the MDP level have a greater value than
the ma ro a tions, then optimal solution is to follow these low-level a tions. Q-values at
the MDP level may also bootstrap from Q-values in the abstra t level { eventually Q-values
2
Again, to ompare the lo al values at di erent times ales, a deterministi ontinuous time model of the
state-dynami s is assumed to be known.
for ma ro a tions must be ome as low or lower than those for taking MDP-level a tions.
The existing work with ma ro-a tions still applies \single-step multi-time" learning updates
(e.g. the adaptations of DP and Q-learning in Se tion 7.3). It seems likely that these
methods might also bene t from the use of the new SMDP TD() or SMDP experien e
sta k algorithm for the same reasons that these methods help in the xed time interval
ase. These are multi-step, multi-time methods in the sense that their return estimates
may bootstrap from values in the entire future, rather than a small subset of it. Some
ma ro-learning methods learn at lower levels and higher levels in parallel while higher level
poli ies are followed. In this ase, eÆ ient o -poli y ontrol learning methods su h as those
presented in Chapter 4 would seem appropriate.
Chapter 8
Summary
Chapter Outline
This hapter summarises the main ontributions of the thesis, lists spe i
ontributions and suggests dire tions for future resear h.
8.1 Review
This thesis has examined the apabilities of existing reinfor ement learning algorithms,
developed new algorithms that extend these apabilities where they have been found to be
de ient, developed a pra ti al understanding of new algorithms through experiment and
analysis, and has also strengthened elements of reinfor ement learning theory.
It has fo used upon two existing problems in reinfor ement learning: i) problems of o -
poli y learning, and ii) problems with error-minimising fun tion approximation approa hes
to reinfor ement learning. These are the major ontributions of the thesis and are detailed
below:
O -poli y Learning O -poli y learning methods allow agents to learn about one be-
haviour while following another. For ontrol optimisation problems, agents need to to
evaluate the return available under the greedy poli y in order to onverge upon the opti-
mal one. However, experien e may be generated in fairly arbitrary ways { for example,
generated by a human expert, or by a me hanism that sele ts a tions in order to manage
the exploration exploitation tradeo . EÆ ient o -poli y learning methods already exist in
the form of ba kward replayed Q-learning. However, it was previously un lear how this
ould be applied as an online learning algorithm. Online learning is an important feature of
any method whi h eÆ iently manages the exploration-exploitation tradeo . On one hand,
eligibility tra e methods an already be applied online and have enjoyed widespread use
157
158 CHAPTER 8. SUMMARY
as a result. However, as sound o -poli y methods they an be very ineÆ ient. More-
over, where oine learning is possible (e.g. if the environment is a y li ), it would seem
that ba kward-replaying forward view methods is a generally more preferable approa h.
A forwards-ba kwards equivalen e proof demonstrates that these methods learn from es-
sentially the same estimate of return, but the forward view is more straightforward (an-
alyti ally) and also has a natural omputationally eÆ ient implementation. Furthermore,
ba kwards replay provides extra eÆ ien y gains over eligibility tra e methods when boot-
strapping estimates of return are used ( < 1). This omes from learning with information
that is simply more up-to-date.
The work with the new experien e sta k algorithm in Se tion 4.4 represents an advan e
by inheriting the desirable properties of ba kwards-replay (and larifying what these are),
and also allowing for online learning. When used for o -poli y greedy poli y evaluation it
provides advantages over Watkins' Q() (and Q-learning), by allowing allowing redit for
the urrent reward to be propagated ba k further than the last non-greedy a tion. However,
it was shown that a hieving this gain is strongly dependent upon whether the Q-values used
as bootstrapping value estimates are over-estimates (i.e. whether they are optimisti ). It
was shown how optimisti initial value-fun tions (the rule of thumb for many exploration
methods) an severely inhibit redit assignment for a variety of ontrol-optimising RL meth-
ods. The separation of optimisti value estimates for en ouraging exploration and the value
estimates used as predi tions of return appears to o er a solution to this problem.
Fun tion Approximation for Reinfor ement Learning In order to s ale up value-
based RL methods to solve pra ti al tasks with many dimensional state-features or tasks
with ontinuous (or non-dis rete) state, fun tion approximators are employed to represent
value fun tions and Q-fun tions. But many popular methods are known to su er from
instabilities, parti ularly when used with ontrol-optimising RL methods or with o -poli y
update distributions (e.g. if making updates with experien e gathered under exploring poli-
ies). The well-studied least-mean-squared error minimising gradient des ent method is a
famous example. It was shown how, through a new hoi e of error measure to minimise,
this method an be made more stable. The boundedness of dis ounted return estimating
RL methods was shown with this fun tion approximation method. In parti ular, the proof
holds for o -poli y Q-learning and the new experien e sta k algorithm { the stability of
these methods with gradient des ent fun tion approximation was not previously known.
However, the linear averager method appears to be a less powerful fun tion approxima-
tion te hnique than the original LMS method, although it has also frequently been used
su essfully for RL in the past.
In Se tion 6.2 the de ision boundary partitioning (DBP) heuristi for representation dis-
retisation was presented. The re nement riteria followed from the idea that, in ontinuous
state-spa es, optimal problem solutions often have large areas of uniform poli y. It is ex-
pe ted therefore that, in su h ases, ompa t representations of optimal poli ies follow from
attempting to represent in detail only those areas where the poli y hanges (de ision bound-
aries). The major ontribution here is the idea that fun tion approximation should not be
motivated by minimising the error between the learned and observed estimates of return,
but by attempting to nd the best a tion available in a state. A new method was introdu ed
to re ne the representation in areas where the greedy poli y hanges. An empiri al test
8.2. CONTRIBUTIONS 159
found the method to outperform xed uniform dis retisations. Coarse representations in
the initial stages allowed fast learning and good initial poli y approximations to be qui kly
learned. The ner dis retisations whi h followed allowed poli ies of better quality to be
learned.
The re ent work by Munos and Moore ( ondu ted independently and simultaneously) shows
the DBP heuristi to nd sub-optimal poli ies. Non-lo al re nement is also required in
order to a hieve a urate value estimates and therefore orre t pla ement of the de ision
boundaries (at least for heavily bootstrapping value estimation pro edures su h as value-
iteration). However, their method requires a model (or one to be learned) in order to be
applied.
8.2 Contributions
The following is a list of the spe i ontributions in order of appearan e.
In Se tion 2.4.3 an adaptation was made to the approximate modi ed poli y iteration
algorithm presented by Sutton and Barto in their standard text [150℄. Their algorithm
appears to be the rst of its kind whi h expli itly laims to terminate and as su h is
of fundamental importan e to the eld. An oversight in their algorithm was shown
using the new ounterexamples in Figure 2.5. The algorithm was orre ted and error
bounds for the quality of the nal poli y were provided. A proof is provided in
Appendix B whi h follows straightforwardly from the work of Williams and Baird
[171℄. The orre tion features in the errata of [150℄.
The approximate equivalen e of bat h-mode a umulate-tra e TD() and a dire t -
return estimating algorithm is well known to the RL ommunity { a derivation an be
found in [150℄ for xed . In an empiri al demonstration in Se tion 3.4.9, it was shown
that this equivalen e does not hold in the online-updating ase (even approximately
so), in ases where the environment is y li al su h that the a umulating tra e value
grows above some threshold. This result followed from the intuitive insight that
sto hasti updating rules of the form Zt+1 = Zt + (zt Zt), having stepsizes greater
than 2 diverge to in nity in ases where zt is independent of Zt .
In Se tion 4.2.2 modi ations to Wiering's Fast Q() were des ribed where it was
likely that existing published versions of this algorithm might be misinterpreted. An
empiri al test was performed to demonstrate the algorithm's equivalen e to Q().
This work was published jointly with Mar o Wiering as [125℄.
Se tion 4.4 introdu ed the Experien e Sta k algorithm. The existing ba kward re-
play method was adapted to allow for eÆ ient model-free online o -poli y ontrol
optimisation. Unlike other popular online learning methods (su h as eligibility tra e
approa hes), the method dire tly learns from -return estimates and also a natural
omputationally eÆ ient implementation. An experimental and theoreti al analysis
of the algorithm's parameters provided a hara terisation of when the algorithm is
likely to outperform related eligibility tra e methods. This work was published as
[123, 121℄.
In Se tion 4.7 optimisti initial value-fun tions were found to severely inhibit the
error-redu ing abilities of greedy-poli y evaluating RL methods. It was also seen how
exploration methods that employ optimism to en ourage exploration an avoid these
problems by separating return predi tions from the optimisti value estimates used to
en ourage exploration. This work was published as [120, 122℄.
In Se tion 5.7 a \linear-averager" value fun tion approximation s heme was for-
malised. The approximation s heme is already used for reinfor ement learning and
di ers from the well studied in remental least mean square (LMS) gradient des ent
s heme only in the error measure being minimised. A proof of nite (but possibly very
large) error in the value fun tion was shown for all dis ounted return estimating RL
algorithms when employing a linear averager for fun tion approximation. Notably,
the proof overs new ases su h as Q-learning with arbitrary experien e distributions
(i.e. arbitrary exploration). Examples of divergen e in this ase exist for the LMS
method. This work was published as [124℄.
Se tion 6.2 introdu ed the de ision boundary partitioning (DBP) heuristi for repre-
sentation re nement based upon hanges in the greedy a tion. This work was pub-
lished as [117, 119, 115℄.
In Chapter 7 an analysis of the biasing problems asso iated with bootstrapping algo-
rithms in dis retised ontinuous state spa es was performed. A generi RL algorithm
modi ation was suggested to redu e this bias by attempting to learn the expe ted
rst-state values of ontinuous regions. Some bias redu tion and poli y quality im-
provements were observed, but most improvements ould be attributed either to fol-
lowing a poli y whi h ommits to a single a tion throughout a region, or related
problems asso iated to learning with large learning rates.
In Appendix C, a umulate tra e TD() was adapted to the SMDP ase. An equiv-
alen e with a forward-view SMDP method was established for the bat h update and
a y li pro ess ase by adapting the proof method for the MDP ase found in Sutton
and Barto's standard text [150℄.
8.3 Future Dire tions

Following the advan es made in this thesis, a number of questions and avenues for future
resear h arise.
Experien e Sta k Reinfor ement Learning. Further work with the Experien e Sta k
method may yield further re nements to the algorithm. For example, the use of a sta k
to store experien e sequen es was introdu ed to allow the sequen es to be replayed in the
reverse of their observed order. Other methods ould replay the sequen es in di erent orders
so that the amount of experien e replayed is minimised su h that the number of states that
are no longer onsidered for further updating is minimised. Also, the Bmax parameter ould
be repla ed by a heuristi that de ides whether to immediately replay experien e based upon
a measure of the bene t to the exploration strategy that experien e replay may yield.
8.3. FUTURE DIRECTIONS 161
Other extensions might take ideas for Lin's original formulation (and also Ci hosz' Replayed
TTD()), where the same experien e is replayed several times over. This ould also be
done here although, as in the related work, at an in reased omputational ost and an
in reased re en y bias in the learned values. Whether these hanges would lead to improved
performan e ould be the subje t of further study.
The most pressing extension to the experien e sta k method is its adaptation for use with
parameter-based fun tion approximators (su h as the CMAC). Here the major issue is
how to de ide when to replay experien e sin e exa t state revisits rarely o ur as in the
MDP/table-lookup ase. A possible solution is to re ord the potential s ale of hange in a
parameter's value that is possible if the stored experien e is replayed.
Exploitation of the Optimisti Bias Problem. There are many algorithms that one
may hoose to apply in solving RL problems. Whi h should be used and when? In par-
ti ular, for ontrol optimisation there are algorithms whi h evaluate the greedy poli y (e.g.
Q-learning, Watkins' Q(), value-iteration). Algorithms for evaluating xed poli ies (e.g.
TD(), SARSA() and DP poli y evaluation methods) may also be used for ontrol by
assuming that an evaluation of a xed poli y is sought, and then making this poli y pro-
gressively more greedy. The subtle di eren e is that xed poli y evaluation methods seem
likely to qui kly eliminate unhelpful optimisti biases sin e their initial xed poli y has a
value fun tion whi h is less than or equal to the optimal one in every state. However, while
these methods are spending time evaluating a xed poli y, they are not ne essarily improv-
ing their poli y. With this in mind, future work might aim to examine optimal ways of
sele ting how greedy the poli y under evaluation should be made in order to redu e value-
fun tion error at the fastest possible rate. Initial work in this dire tion might examine the
di eren es between poli y-iteration and value-iteration and seek hybrid approa hes (similar
to Puterman's modi ed poli y-iteration [114℄).
Also, it remains to be seen whether, following from the dual update results in Se tion 4.7.3,
better exploration strategies an be developed. Improvements ould be expe ted to follow
through providing exploration s hemes with more a urate value estimates.
Non-Orthogonal Partitioning Representations. The grid-like partitionings of kd-

trees seems unlikely to allow methods employing them to s ale well in many problems with
very high dimensional state-spa es. In high dimensional spa es, important features (su h as
de ision boundaries, or the Parti-Game's win-lose boundary) may be of a low dimensionality
but run diagonally a ross many dimensions. In this ase, partitionings may be required in
every dimension to adequately represent the important features, and the total representation
ost may grow exponentially with the dimensionality of the state-spa e. The inability to
eÆ iently represent simple features su h as diagonal planes follows from the fa t that the
kd-tree makes splits that are orthogonal to all but one axis (i.e. the resolution is in reased
in only one dimension per split). To alleviate this, non-orthogonal partitioning ould be
employed. For instan e, partitionings may be de ned by arbitrarily pla ed hyper-planes,
thus allowing arbitrary planar features to be represented more eÆ iently.
Exploration with Adaptive Representations. Where systems with unknown dynam-

i s must be ontrolled, RL methods always fa e the exploration-exploitation tradeo . Most
of the work on erned with exploration appears to have fo used on the ase where the en-
vironment is a small dis rete MDP. How best to explore ontinuous state-spa es remains a
diÆ ult problem, but it is one for whi h we may be able to make additional assumptions that
are not possible, or reasonable, in the dis rete MDP ases (e.g. su h as similar states having
similar values or similar dynami s). Where adaptive representations are employed, explo-
ration may be required to explore the ner ontrol possible at higher resolutions. However,
how the relative importan e of exploring di erent parts of the spa e should be measured
is not at all lear. In parti ular, the \prior" ommonly used by many MDP exploration
methods is to assume that any untried a tion leads dire tly to the highest possible valued
state. This seems unreasonable for the Q-values of newly split regions sin e, intuitively,
the oarser representation should provide some information about the values at the ner
resolution.
8.4 Con luding Remarks

Over the history of reinfor ement learning there have been a number of truly outstanding
pra ti al appli ations. Yet these reports remain in the minority. Mu h of the work, like the
ontributions made here, are on erned with expanding the fringes of theory and under-
standing in in remental ways. Most work onsiders example \toy" problems that serve well
in demonstrating how new methods work where the old ones do not, how the behaviour of
a parti ular method varies in interesting ways with the adjustment of some parameter, or
shows some formal proof about behaviour. The use of toy problems is to be expe ted in
any work whi h ta kles su h diÆ ult and general problems as those whi h reinfor ement
learning aims to solve.
Even so, the future hallenge for reinfor ement learning lies in proving itself in the real
world. Its widespread pra ti al usefulness needs to be pla ed beyond question in ways
similar ways to that whi h has been a hieved by expert systems, pattern re ognition and
geneti algorithms. This an only be done by nding real problems that people have, and
applying reinfor ement learning to solve them.
Appendix A
Foundation Theory of Dynami

Programming
This appendix presents some fundamental theorems and notations from the eld of Dynami
Programming.
A.1 Full Ba kup Operators

This se tion introdu es a notation for the ba kup operators introdu ed in Chapter 2.
B represents an evaluation of a poli y using one-step lookahead:
B V^ (s) = E rt+1 + V^ (st + 1)jst = s;
h i
(A.1)
= (s; a) Pssa 0 Rssa 0 + V^ (s0)
X X
(A.2)
a s0
B represents an evaluation of a greedy poli y using one-step lookahead:

B V^ (s) = max ^
h i
a
E rt+1 + V (s t + 1)j s t = s; a t = a (A.3)
a 0 + V^ (s0 )
X
= max a
Pssa 0 Rss (A.4)
s0
B and B are bootstrapping operators { they form new value estimates based upon existing
value estimates. BV is a shorthand for a syn hronous update sweep a ross all stated (see
Se tion 2.3.2).
A.2 Unique Fixed-Points and Optima

It had been shown by Bellman that V^ is the unique value fun tion for the optimal poli y
if B V^ is a xed point [16℄. That is to say, if V^ = B V^ then V^ = V then V^ is optimal.
Similarly if V^ = B V^ then V^ = V .
163
164 APPENDIX A. FOUNDATION THEORY OF DYNAMIC PROGRAMMING
A.3 Norm Measures

The norm operator is denoted as kX k and represents some arbitrary distan e measure given
by the size of the ve tor X = (x1 ; : : : ; xn ). Of interest is,
jjX jj1 = max i
jxij; (A.5)
and is a maximum-norm distan e.
The max-norm measure is of interest in dynami programming as it provides a useful mea-
sure of the error in a value fun tion. In parti ular,
V V^ 1 = max s
V (s) V^ (s) ; (A.6)
is a Bellman Error or Bellman Residual and is a measure of the largest di eren e between
an optimal and estimated value fun tion.
A.4 Contra tion Mappings

The ba kup operators B and B are ontra tion mappings. That is to say that they mono-
toni ally redu e the error in the value estimate. The following proof was rst established
by Bellman:1
V B V^ V^ V^ 1 1
(A.7)
Proof:
BV B V^ 1 = max
s
B V^ (s) B V^ (s)
! !
X
a 0 + V^ (s0 )
X
= max
s
max
a
a 0 + V (s0 )
Rss

max
a
Rss
s0 s0
! !
V^ (s0 )
X X
max
s
max
a
a0
Rss + V (s0 )
a0
Rss +
s0 s0
V^ (s0 )
X
= max
s
max
a
V (s0 )
s0
max V (s0 ) V^ (s0 )
s0
= V V^ 1
Using a similar method it an be shown that,

V B V^ V^ V^ 1 1
(A.8)
1
This version is taken from [167℄.
A.4. CONTRACTION MAPPINGS 165
A.4.1 Bellman Residual Redu tion
The following bound follows from the above ontra tion mapping A.8 [172, 171, 19, 21℄:
^ ^
^V V V B V 1 (A.9)
1 1
Proof: By the triangle inequality,
V^ V 1 V^ B V^ 1 + B V^ V 1
V^ B V^ 1 + V^ V 1
from whi h it follows that,
V^ B V^ 1
V^ V :
1 1
Using the same method, it an be shown that,

V^ B V^ 1
V^ V : (A.10)
1 1
These bounds provide useful pra ti al stopping onditions for DP algorithms sin e the
right-hand-sides an be found without knowledge of V or V .
166 APPENDIX A. FOUNDATION THEORY OF DYNAMIC PROGRAMMING
Appendix B
Modi ed Poli y Iteration

Termination
This se tion establishes termination onditions with error bounds for poli y-iteration em-
ploying approximate poli y evaluation (i.e. modi ed poli y iteration). The reader is assumed
to be familiar with the notation and results in Appendix A.
First, onsider the evaluate-improve steps of the inner loop of the modi ed poli y-iteration
algorithm,
// Evaluate
V^ evaluate(, V^ ) Find V^ = V .
// Improve
0
for ea h s 2 S
a 0 + V^ (s0 )

ag arg maxa s0 Pssa 0 Rss
P
ag0 Rag0 + V^ (s0 )

v0
P
0 P
s ss ss
^
max ; V (s) v0

0 (s) ag
Sin e, v0 (s) = B V^ (s), at the end of this we have = jjV^ B V^ jj. Thus, a bound in the error
of V^ from V at the end of this loop is given by Equation A.10,
V^ B V^ 1
V^ V (B.1)
1 1
= 1 (B.2)
From Equation B.1, Williams and Baird have shown that the following bound an be pla ed
upon the loss in return for following an improved (i.e. greedy) poli y, 0 derived from V^
167
168 APPENDIX B. MODIFIED POLICY ITERATION TERMINATION
[171℄:
0
V (s) V (s) 12 (B.3)
for any state s. 0 is derived from V^ in the above algorithm.
Thus we obtain the full poli y-iteration algorithm with a termination threshold T .
1) do:
2a) V^ evaluate(, V^ )
2b) 0
2 ) for ea h s 2 S :
a 0 + V^ (s0 )

2 -1) ag arg maxa s0 Pssa 0 Rss
P
2 -2) v0
P
s0 Pss0 Rss0 + V
ag ag ^ (s0 )
max ; V^ (s) v0

2 -3)
2 -4) (s) ag Make 0 .
3) while T
This algorithm guarantees that,
V (s) V (s)
2 T (B.4)
1
upon termination.
Note that Equation B.3 does not rely upon the evaluate pro edure returning an exa t eval-
uation of V . Of ourse, termination requires that the evaluate/improve pro ess onverges
upon V^ = V . Puterman and Shin have established that modi ed poli y-iteration will on-
verge if the evaluation step applies V^ B V^ a xed number of times (i.e. at least on e)
[113℄. In the ase where step 2a) is exa tly V^ B V^ , then the above algorithm redu es
to the syn hronous value-iteration algorithm.
In pra ti e, the evaluation step does not need to perform syn hronous updates sin e apply-
ing, V^ (s) B V^ (s) at least on e for ea h state in S is generally at least as e e tive at
redu ing jjV V^ jj as the syn hronous ba kup.
Appendix C
Continuous Time TD()

In this se tion, the a umulate tra e TD() algorithm is derived for the dis rete event,
ontinuous time interval ase. By areful hoi e of notation, the method found in [150℄ for
showing the equivalen e of a umulate tra e TD() (the ba kward view) with the dire t
-return algorithm (the forwards view), may be used.
State and reward observations are dis rete events o urring in ontinuous time. A state,
visit st is a dis rete event, (t 2 IN). For this se tion, t identi es an event in ontinuous
time { it is not a ontinuous time value itself. To simplify notation it is more onvenient to
identify the duration between events. Let tn identify the time between events t and t + n.
The notation di ers from that in Chapter 7.
Let the ontinuous time -return estimate be de ned as follows:
zt = (1 t ) rt+1 +

+1

t+1 V ^ (st+1 )

+t rt+1 + t
+1 +1
zt+1
where rt represents the dis ounted reward immediately olle ted between t 1 and t. Then
the ontinuous time (forward-view) -estimate updates states as follows:
V^ (st ) V^ (st ) + zt V^ (st )

Consider the hange in this value, based upon a single estimate of -return if the update is
applied in bat h-mode: (Throughout, for simpli ity, is assumed to be onstant.)
169
170 APPENDIX C. CONTINUOUS TIME TD( )
1 V^ (st ) = zt+1 V^ (st )
= V^ (st )
+ (1 t ) rt+1 + t V^ (st+1 )

1 1
+ t rt+1 + t zt+1

1 1
= V^ (st )
+rt+1 + t V^ (st+1 ) ( )t V^ (st+1 )
1 1
+( )t zt+1
1
= V^ (st )
+rt+1 + 0t V^ (st+1 ) ( )t V^ (st+1 )
1 1
t V^ (s )
1
(1 t ) r
t
+( )t +t r + t z A
1
1
+2 +
+1
t +2
1
+1
1 1
+1
t+2 +1
t+2
= V^ (st )
+ rt+1 + t V^ (st+1 ) ( ) t V^ (st+1 )
1
1
+ ( )t rt+2 + t V^ (st+2 ) ( )t V^ (st+2 )

1 1 1
+1 +1
+ ( )t +t 1 1
+1 zt+2
= V^ (st )
+ ^ (st+1 ) ( )t V^ (st+1)
rt+1 + t1 V 1
rt+2 + t V^ (st+2 ) ( )t V^ (st+2 )

+ ( )t 1 1
+1
1
+1
rt+3 + t V^ (st+3 ) ( )t V^ (st+3 )

+ ( )t +t 1 1
+1
1
+2
1
+2
... ...
=
rt+1 + t V^ (st+1 ) V^ (st )
1
t V^ (s ) V^ (s )

+ ( )t 1
rt +2 + t +2 t
1
+1
+1
rt+3 + t V^ (st+3 ) V^ (st+2 )

+ ( )t +t
1 1
+1
1
+2
... ...
=
)t rt+1 + t V^ (st+1 ) V^ (st )

( 0 1
+ ( )t rt+2 + t V^ (st+2 ) V^ (st+1 )

1 1
+1
+ ( )t rt+3 + t V^ (st+3 ) V^ (st+2 )

2 1
+2
... ...
Let the 1-step ontinuous time TD error be de ned as:

Æk = rk+1 + k V^ (sk+1) V^ (sk );
1
171
then,
1
1 V^ (s ) = X
t ( )tk t
Æk
k=t
for a single -return estimate. In the ase where a state s may be revisited several times
during the episode, we have:
1 V^ (s) = X1 1
X
I (s; st) ( )tk t Æk (C.1)
t=0 k=t
1X
X 1
= ( )tk t I (s; st)Æk
t=0 k=t
Sin e PH
x=L
PH
y=x f (x; y) = y=L x=L f (x; y) for any L, H and f ,
PH Py
1 X
1 V^ (s) = X k
( )tk t I (s; st)Æk
k=0 t=0
Through re e tion in the plane x = y, x=L Pxy=L f (x; y) = PHy=L Pyx=L f (y; x), for any
PH
any L, H and f ,
1 V^ (s) = X1 X
t
( )kt k I (s; sk )Æt
t=0 k=0
1
X t
X
= Æt ( )kt k I (s; sk )
t=0 k=0
De ning an eligibility value for s as:
t
X
et (s) = ( )kt k I (s; sk )
k=0
then the eligibility tra es for all states may be al ulated in rementally as follows:
(
8s 2 S; et (s) ( )t et 1 (s) + 1; if s = st,
1
( )t et 1 (s);
1
otherwise.
and the state values in rementally updated as follows:
8s 2 S; V^ (s) V^ (s) + Æt et (s):
As for single-step TD(), this forward-ba kward equivalen e applies only for the bat h
updating and a y li environment ase. The equivalen e is approximate for the general
online-learning ase sin e V , as seen by the T D errors, is xed in value throughout the
episode.
In ases where episode lengths are nite and sT is the terminal state, sin e by de nition,
Æk = 0, (k T ), then (C.1) may pre isely be rewritten as,
1 V^ (s) = TX1 I (s; st) TX1 k = t( )tk t Æk :
t=0
Using a similar method to the steps following (C.1), the same update rule follows for the
terminating state ase as the in nite trial ase.
172
APPENDIX C. CONTINUOUS TIME TD( )
Appendix D
Notation, Terminology and

Abbreviations
Learning step-size.
k (s; a) Learning step-size at the kth update of (s; a).
Learning rate s hedule parameter where, k (s; a) = k(s;a1)beta .
o pol Allowable non-greediness threshold.
0 Initial value fun tion error.
Dis ount fa tor: dis ounted return = rt + rt+2 + 2rt+3 + .
T Small termination error threshold.
Exploration parameter. Likelihood of taking a random a tion.
e(s) Eligibility tra e for state s.
e0 (s) Fast Q() eligibility tra e for state s.
E [x℄ Expe tation of x.
E [xjy℄ Conditional expe tation. Expe tation of x given y.
I (a; b) Identity fun tion. Yields 1 if a = b and 0 otherwise.
N (s; a) Number of times a is observed in s.
A poli y.
An optimal poli y.
n A nearly-greedy poli y.
g A greedy poli y.
Pssa 0 State transition probability fun tion. Probability of entering s0 after tak-
ing a in s.
Pass0 Dis ounted state transition probability fun tion. As P but in ludes mean
amount of dis ounting o urring between leaving s and entering s0.
P r(x) Probability of event x.
P r(xjy) Conditional probability. Probability of x given y.
Q0 Initial Q-fun tion estimate.
Q(s; a) A Q-value. The long-term expe ted return for taking a tion a in state s.
Q+ (s; a) An up-to-date Q-value. See Fast Q().
Rssa0 Expe ted immediate reward fun tion for taking a in s and transiting to
s0 .
173
174 APPENDIX D. NOTATION, TERMINOLOGY AND ABBREVIATIONS
rt Immediate reward re eived for the a tion taken immediately prior to time
t.
Ras Dis ounted immediate reward fun tion.
t Dis rete time index. (Or step index in the SMDP ase).
Real valued time duration.
U^ (s) Generi return orre tion. Repla e with the estimated value at s of fol-
lowing the evaluation poli y from s (e.g. U (s) = maxa Q(s; a) for greedy
poli y evaluation).
V The value fun tion for the optimal poli y.
V The value fun tion for the poli y .
V^ Estimate of the value fun tion for the poli y .
V^0 Initial value fun tion estimate.
X^ Estimate of E [X ℄.
z Estimation target. Observed value whose mean we wish to estimate.
z (1) 1-step orre ted trun ated return estimate.
z (n) n-step orre ted trun ated return estimate.
z -return estimate.
z (;n) n-step orre ted trun ated -return estimate.
x =yz y z xy+z
Global amount of de ay.
Æ TD error.
Assignment.
ba kward-view Eligibility tra e method. Updates of the form: V (s) V (s) + Æe(s)
greedy-a tion arg maxa Q^ (s; a)
xed-point x is the xed-point of f if x = f (x).
Updates of the form: V^ (s) V^ (s) + z V^ (s)

forward-view
-return method A forward view method.
n-step trun ated rt+1 + + n 1 rt+n
return
n-step trun ated rt+1 + + n 1 rt+n + n U (st+n+1 )
orre ted return
o -poli y Di erent to the poli y under evaluation.
on-poli y As the poli y under evaluation.
return orre tion U (st+n+1 ) in a orre ted n-step trun ated return.
return Long term measure of reward.
state Environmental situation.
state-spa e Set of all possible environmental situations.
BR Ba kwards Replay
DBP De ision Boundary Partitioning
DP Dynami Programming
FA Fun tion Approximator
LMSE Least Mean Squared Error
175
MDP Markov De ision Pro ess
POMDP Partially Observable Markov De ision Pro ess
PW Peng and Williams' Q()
RL Reinfor ement Learning
SAP State A tion Pair
SMDP Semi-Markov De ision Pro ess, ( ontinuous time MDP)
TTD Trun ated TD()
WAT Watkins' Q()
176 APPENDIX D. NOTATION, TERMINOLOGY AND ABBREVIATIONS
Bibliography
[1℄ C. G. Atkeson A. W. Moore and S. S haal. Memory-based learning for ontrol.
Te hni al Report CMU-RI-TR-95-18, CMU Roboti s Institute, April 1995.
[2℄ M. A. Al-Ansari and R. J. Williams. EÆ ient, globally-optimized reinfor ement learn-
ing with the Parti-game algorithm. In Advan es in Neural Information Pro essing
Systems 11. The MIT Press, Cambridge, MA, 1999.
[3℄ J.S. Albus. Data storage in the erebellar model arti ulation ontroller (CMAC).
Journal of dynami systems, measurement and ontrol, 97(3), 1975.
[4℄ J.S. Albus. A new approa h to manipulator ontrol: the erebellar model arti ulation
ontroller (CMAC). Journal of dynami systems, measurement and ontrol, 97(3),
1975.
[5℄ C. Anderson. Approximating a poli y an be easier than approximating a value
fun tion. Te hni al Report CS-00-101, Department of Computer S ien e, Colorado
State University, CO, USA, 2000.
[6℄ C. Anderson and S. Crawford-Hines. Multigrid Q-learning. Te hni al Report CS-94-
121, Colorado State University, Fort Collins, CO 80523, 1994.
[7℄ David Andre, Nir Friedman, and Ronald Parr. Generalized prioritized sweeping. In
Mi hael I. Jordan, Mi hael J. Kearns, and Sara A. Solla, editors, Advan es in Neural
Information Pro essing Systems, volume 10. The MIT Press, 1998.
[8℄ Christopher G. Atkeson, Andrew W. Moore, and Stefan S haal. Lo ally weighted
learning. AI Review, 11:75{113, 1996.
[9℄ L. C. Baird and A. W. Moore. Gradient des ent for general reinfor ement learning.
In Advan es in Neural Information Pro essing Systems, volume 11, 1999.
[10℄ Leemon C. Baird. Residual algorithms: Reinfor ement learning with fun tion approx-
imation. In Pro eedings of the Twelfth International Conferen e on Ma hine Learning,
pages 30{77, San Fran is o, 1995. Morgan Kaufmann.
[11℄ Leemon C. Baird. Reinfor ement Learning Through Gradient Des ent. PhD thesis,
S hool of Computer S ien e, Carnegie Mellon University, Pittsburgh, PA 15213, 1999.
Te hni al Report Number CMU-CS-99-132.
177
178 BIBLIOGRAPHY
[12℄ Andrew G. Barto, Steven J. Bradtke, and Satinder P. Singh. Learning to a t using
real-time dynami programming. Arti ial Intelligen e, 72:81{138, 1995.
[13℄ Andrew G. Barto, Ri hard S. Sutton, and Charles W. Anderson. Neuronlike adaptive
elements that an solve diÆ ult learning problems. IEEE Transa tions on Systems,
Man and Cyberneti s, 13(5):834{846, Septemeber 1983.
[14℄ R. Beale and T. Ja kson. Neural Computing: An introdu tion. Institute of Physi s
Publishing, Bristol, UK, 1990.
[15℄ R. E. Bellman. Dynami Programming. Prin eton University Press, 1957.
[16℄ R. E. Bellman and S. E. Dreyfus. Applied Dynami Programming. RAND Corp, 1962.
[17℄ D. P. Bertsekas. Distributed dynami s programming. IEEE Transa tions on Auto-
mati Control, 27:610{616, 1982.
[18℄ D. P. Bertsekas. Distributed asyn hronous omputation of xed points. Mathemati al
Programming, 27:107{120, 1983.
[19℄ D. P. Bertsekas. Dynami Programming: Deterministi and Sto hasti Models. Pren-
ti e Hall, Englewood Cli s, NJ, 1987.
[20℄ D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numeri al
Methods. Prenti e Hall, Englewood Cli s, NJ, 1989.
[21℄ D. P. Bertsekas and J. N. Tsitsiklis. Neurodynami Programming. Athena S ienti ,
Belmont, MA, 1996.
[22℄ Mi hael Bowling and Manuela Veloso. Bounding the suboptimality of reusing sub-
problems. In Pro eedings of IJCAI-99, 1999.
[23℄ Justin Boyan and Andrew Moore. Robust value fun tion approximation by work-
ing ba kwards. In Pro eedings of the Workshop on Value Fun tion Approximation.
Ma hine Learning Conferen e Tahoe City, California, July 9, 1995.
[24℄ Justin A. Boyan and Andrew W. Moore. Generalization in reinfor ement learning:
Safely approximating the value fun tion. In Pro eedings of Neural Information Pro-
essing Systems, volume 7. Morgan Kaufmann, January 1995.
[25℄ Steven J. Bradtke and Mi hael O. Du . Reinfor ement learning for ontinuous-time
Markov de ision problems. In Advan es in Neural Information Pro essing Systems,
volume 7, pages 393{400, 1995.
[26℄ P. V. C. Caironi and M. Dorigo. Training Q agents. Te hni al Report IRIDIA-94-14,
Universite Libre de Bruxelles, 1994.
[27℄ Anthony R. Cassandra. Exa t and Approximate Algorithms for Partially Observable
Markov De ision Pro esses. PhD thesis, Brown University, Department of Computer
S ien e, Providen e, RI, 1998.
BIBLIOGRAPHY 179
[28℄ David Chapman and Leslie Pa k Kaelbling. Input generalization in delayed rein-
for ement learning: An algorithm and performan e omparisons. In Pro eedings of
the Twelfth International Joint Conferen e on Arti ial Intelligen e, pages 726{731.
Morgan Kaufmann, San Mateo, CA, 1991.
[29℄ C. S. Chow and J. N. Tsitsiklis. An optimal one-way multigrid algorithm for dis rete{
time sto hasti ontrol. IEEE Transa tions on Automati Control, 36:898{914, 1991.
[30℄ Pawel Ci hosz. Trun ated temporal di eren es and sequential replay: Comparison,
integration, and experiments. In Pro eedings of the Poster Session of the Ninth In-
ternational Symposium on Methodologies for Intelligent Systems, 1996.
[31℄ Pawel Ci hosz. Reinfor ement Learning by Trun ating Temporal Di eren es. PhD
thesis, Warsaw University of Te hnology, Poland, July 1997.
[32℄ Pawel Ci hosz. TD() learning without eligibility tra es: A theoreti al analysis.
Arti ial Intelligen e, 11:239{263, 1999.
[33℄ Pawel Ci hosz. A forwards view of repla ing eligibility tra es for states and state-
a tion pairs. Mathemati al Algorithms, 1:283{297, 2000.
[34℄ Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introdu tion To
Algorithms. The MIT Press, Cambridge, Massa husetts, 1990.
[35℄ Ri hard Dearden Craig Boutilier and Moises Goldszmidt. Sto hasti dynami pro-
gramming with fa tored representations. Arti ial Intelligen e. To appear.
[36℄ Robert H. Crites. Large-S ale Dynami Optimization Using Teams Of Reinfor ement
Learning Agents. PhD thesis, (Computer S ien e) Graduate S hool of the University
of Massa husetts, Amherst, September 1996.
[37℄ S ott Davies. Multidimensional triangulation and interpolation for reinfor ement
learning. In Advan es in Neural Information Pro essing Systems, volume 9, 1996.
[38℄ P. Dayan. The onvergen e of TD() for general . Ma hine Learning, 8:341{362,
1992.
[39℄ P. Dayan. Improving generalisation for temporal di eren e learning: The su essor
representation. Neural Computation, 5:613{624, 1993.
[40℄ Ri hard Dearden, Nir Friedman, and David Andre. Model based bayesian exploration.
In Pro eedings of UAI-99, Sto kholm, Sweden, 1999.
[41℄ Ri hard Dearden, Nir Friedman, and Stuart Russell. Bayesian Q-learning. In Pro-
eedings of AAAI-98, Madison, WI, 1998.
[42℄ Morris H. DeGroot. Probability and Statisti s. Addison Wesley, 2 edition, 1989.
[43℄ Thomas G. Dietteri h. State abstra tion in MAXQ hierar hi al reinfor ement learn-
ing. In Advan es in Neural Information Pro essing Systems, volume 12. The MIT
Press, 2000.
180 BIBLIOGRAPHY
[44℄ Kenji Doya. Temporal di eren e learning in ontinuous time and spa e. In Advan es
in Neural Information Pro essing Systems, volume 8, pages 1073{1079, 1996.
[45℄ P. Dupuis and M. R. James. Rates of onvergen e for approximation s hemes in
optimal ontrol. SIAM Journal of Control and Optimisation, 360(2), 1998.
[46℄ Fernando Fernandez and Daniel Borrajo. VQQL. Applying ve tor quantization to re-
infor ement learning. In M. Veloso, E. Pagello, and Hiroaki Kitano, editors, RoboCup-
99: Robot So er WorldCup III, number 1856 in Le ture Notes in Arti ial Intelli-
gen e, pages 171{178. Springer, 2000.
[47℄ Jerome H. Friedman, Jon L. Bentley, and Raphael A. Finkel. An algorithm for nd-
ing best mat hes in logarithmi expe ted time. ACM Transa tions on Mathemati al
Software, 3(3):209{226, September 1977.
[48℄ G. J. Gordon. Stable fun tion approximation in dynami programming. In Armand
Prieditis and Stuart Russell, editors, Pro eedings of the Twelfth International Confer-
en e on Ma hine Learning, pages 261{268, San Fran is o, CA, 1995. Morgan Kauf-
mann.
[49℄ Geo rey J. Gordon. Online tted reinfor ement learning from the value fun tion
approximation. In Workshop at ML-95, 1995.
[50℄ Geo rey J. Gordon. Chattering in SARSA(). CMU Learning Lab internal report.
Available from http://www-2. s. mu.edu/~ggordon/, 1996.
[51℄ Geo rey J. Gordon. Reinfor ement learning with fun tion approximation onverges
to a region. In Advan es in Neural Information Pro essing Systems, volume 12. The
MIT Press, 2000.
[52℄ W. Ha kbush. Multigrid Methods and Appli ations. Springer-Verlag, 1985.
[53℄ M. Hauskre ht, N. Meuleau, C. Boutilier, L. Pa k Kaelbling, and T. Dean. Hierar hi-
al solution of Markov de ision pro esses using ma ro-a tions. In Pro eedings of the
1998 Conferen e on Un ertainty in Arti ial Intelligen e, Madison, Wis onsin, 1998.
[54℄ Robert B. He kendorn and Charles W. Anderson. A multigrid form of value-iteration
applied to a Markov de ision pro ess. Te hni al Report CS-98-113, Computer S ien e
Department, Colorado State University, Fort Collins, CO 80523, November 1998.
[55℄ John H. Holland, Lashon B. Booker, Mar o Colombetti, Mar o Dorigo, David E.
Goldberg, Stephanie Forrest, Ri k L. Riolo, Robert E. Smith, Pier Lu a Lanzi, Wolf-
gang Stolzmann, and Stewart W. Wilson. What is a Learning Classi er System?
In Pier Lu a Lanzi, Wolfgang Stolzmann, and Stewart W. Wilson, editors, Learning
Classi er Systems. From Foundations to Appli ations, volume 1813 of LNAI, pages
3{32, Berlin, 2000. Springer-Verlag.
[56℄ Ronald A. Howard. Dynami Programming and Markov De ision Pro esses. The MIT
Press, Cambridge, Massa husetts, 1960.
BIBLIOGRAPHY 181
[57℄ Mark Humphrys. A tion sele tion methods using reinfor ement learning. In From
Animals to Animats 4: Pro eedings of the Fourth International Conferen e on Sim-
ulation of Adaptive Behavior, volume 4, pages 135{144. MIT Press/Bradford Books,
MA., USA, 1996.
[58℄ Mark Humphrys. A tion Sele tion Methods Using Reinfor ement Learning. PhD
thesis, Trinity Hall, University of Cambridge, June 1997.
[59℄ T. Jaakkola, M. Jordan, and S. Singh. On the onvergen e of sto hasti iterative
dynami programming algorithms. Neural Computation, 6(6):1185{1201, 1994.
[60℄ Tommi Jaakkola, Satinder P. Singh, and Mi hael I. Jordan. Reinfor ement learning
algorithm for partially observable Markov problems. In Advan es in Neural Informa-
tion Pro essing Systems, volume 7, 1995.
[61℄ A. Bryson Jr. and Y. Ho. Applied Optimal Control. Hemisphere Publishing, New
York, 1975.
[62℄ Leslie Pa k Kaelbling. Learning in Embedded Systems. PhD thesis, Department of
Computer S ien e, Stanford University, Stanford, CA., 1990.
[63℄ Leslie Pa k Kaelbling, Mi hael L. Littman, and Andrew W. Moore. Reinfor ement
learning: A survey. Journal of Arti ial Intelligen e Resear h, 4:237{285, 1996.
[64℄ Masahito Yamamoto Keiko Motoyama, Keiji Suzuki and Azuma Ohu hi. Evolutionary
state spa e on guration with reinfor ement learning for adaptive airship ontrol. In
The Third Australia-Japan Workshop on Intelligent and Evolutionary Systems (Pro-
eedings), 1999.
[65℄ S. Koenig and R. G. Simmons. The e e t of representation and knowledge on
goal-dire ted exploration with reinfor ement-learning algorithms. Ma hine Learning,
22:228{250, 1996.
[66℄ R. E. Korf. Real-time heuristi sear h. Arti ial Intelligen e, 42:189{221, 1990.
[67℄ J. R. Krebs, A. Ka elnik, and P. Taylor. Test of optimal sampling by foraging great
tits. Nature, 275(5675):27{31, 1978.
[68℄ R. Kret hmar and C. Anderson. Comparison of CMACs and radial basis fun tions for
lo al fun tion approximators in reinfor ement learning. In Pro eedings of the IEEE
International Conferen e on Neural Networks. Houston, TX, pages 834{837, 1997.
[69℄ J. H. Kushner and Dupuis. Numeri al Methods for Sto hasti Control Problems in
Continuous Time. Appli ations of Mathemati s. Springer Verlag, 1992.
[70℄ Leonid Kuvayev and Ri hard Sutton. Approximation in model-based learning. In
ICML'97 Workshop on Modelling in Reinfor ement Learning, 1997.
[71℄ C. Lin and H. Kim. CMAC-based adaptive riti self-learning ontrol. IEEE Trans-
a tions on Neural Networks, 2:530{533, 1991.
182 BIBLIOGRAPHY
[72℄ L. J. Lin. Self-improving rea tive agents based on reinfor ement learning, planning
and tea hing. Ma hine Learning, 8:293{321, 1992.
[73℄ Long-Ji Lin. S aling up reinfor ement learning for robot ontrol. In Pro eedings of
the Tenth International Conferen e on Ma hine Learning, pages 182{189, Amherst,
MA, June 1993. Morgan Kaufmann.
[74℄ Mi hael L. Littman, Thomas L. Dean, and Leslie Pa k Kaelbling. On the omplexity
of solving Markov de ision problems. In Pro eedings of the Eleventh International
Conferen e on Un ertainty in Arti ial Intelligen e, page 9, 1995.
[75℄ S. Mahadevan. Average reward reinfor ement learning: Foundations, algorithms and
empiri al results. Ma hine Learning, 22:159{196, 1996.
[76℄ S. Mahadevan and J. Connell. Automati programming of behavior based robots.
Arti ial Intelligen e, 55(2-2):311{365, June 1992.
[77℄ Yishay Mansour and Satinder Singh. On the omplexity of poli y iteration. In Un-
ertainty in Arti ial Intelligen e, 1999.
[78℄ J. J. Martin. Bayesian De ision Problems and Markov Chains. John Wiley and Sons,
New York, New York, 1969.
[79℄ Maja J. Matari . Intera tion and Intelligent Behavior. PhD thesis, MIT AI Lab,
August 1994. AITR-1495.
[80℄ John H. Mathews. Numeri al Methods for Mathemati s, S ien e and Engineering.
Prenti e Hall, London, UK, 1995.
[81℄ Andrew M Callum. Instan e-based utile distin tions for reinfor ement learning. In
Pro eedings of the Twelfth International Ma hine Learning, San Fran is o, 1995. Mor-
gan Kaufmann.
[82℄ Andrew K. M Callum. Reinfor ement Learning with Sele tive Per eption and Hid-
den State. PhD thesis, Department of Computer S ien e University of Ro hester
Ro hester, NY, 14627, USA, 1995.
[83℄ Amy M Govern, Ri hard S. Sutton, and Andrew H. Fagg. Roles of ma ro-a tions in
a elerating reinfor ement learning. In 1997 Gra e Hopper Celebration of Women in
Computing, 1997.
[84℄ C. Melhuish and T. C. Fogarty. Applying a restri ted mating poli y to determine
state spa e ni hes using delayed reinfor ement. In T. C. Fogarty, editor, Pro eedings
of Evolutionary Computing, Arti ial Intelligen e and the Simulation of Behaviour
Workshop, pages 224{237. Springer-Verlag, 1994.
[85℄ Ni olas Meuleau and Paul Bourgine. Exploration of multi-state environments: Lo al
measures and ba k-propagation of un ertainty. Ma hine Learning, 35(2):117{154,
May 1999.
BIBLIOGRAPHY 183
[86℄ A. W. Moore and C. G. Atkeson. The Parti-game algorithm for variable resolution
reinfor ement learning in multidimensional state-spa es. Ma hine Learning, 21:199{
233, 1995.
[87℄ Andrew W. Moore. Variable resolution dynami programming: EÆ iently learning
a tion maps on multivariate real-value state-spa es. In L. Birnbaum and G. Collins,
editors, Pro eedings of the Eighth International Conferen e on Ma hine Learning.
Morgan Kaufman, June 1991.
[88℄ Andrew W. Moore and Christopher G. Atkeson. Prioritised sweeping: Reinfor ement
learning with less data and less time. Ma hine Learning, 13:103{130, 1994.
[89℄ Andrew William Moore. EÆ ient Memory Based Learning for Robot Control. PhD
thesis, University of Cambridge, Computer Laboratory, November 1990.
[90℄ K. Muller, S. Mika, G. Rats h, K. Tsuda, and B. S holkopf. An introdu tion to kernel
based methods. IEEE Transa tions on Neural Networks, 12(2):181{202, Mar h 2001.
[91℄ Remi Munos and Paul Bourgine. Reinfor ement learning for ontinuous sto hasti
ontrol problems. In Mi hael I. Jordan, Mi hael J. Kearns, and Sara A. Solla, editors,
Advan es in Neural Information Pro essing Systems, volume 10. The MIT Press, 1998.
[92℄ Remi Munos and Andrew Moore. Variable resolution dis retization in optimal ontrol.
Ma hine Learning. To appear.
[93℄ Remi Munos and Andrew Moore. Bary entri interpolator for ontinuous spa e &
time reinfor ement learning. In M. S. Kearns and D. A. Cohn S. A. Solla, editors,
Advan es in Neural Information Pro essing Systems, volume 11. The MIT Press, 1999.
[94℄ Remi Munos and Andrew Moore. In uen e and varian e of a Markov hain: Appli-
ation to adaptive dis retization in optimal ontrol. In IEEE Conferen e on De ision
and Control, 1999.
[95℄ Remi Munos and Andrew Moore. Variable resolution dis retization for high-a ura y
solutions of optimal ontrol problems. In Pro eedings of the 16th International Joint
Conferen e on Arti ial Intelligen e, pages 1348{1355, 1999.
[96℄ Remi Munos and Jo elyn Patinel. Reinfor ement learning with dynami overing of
state-a tion spa e: Partitioning q-learning. In From Animals to Animats 3: Pro eed-
ings of the International Conferen e on Simulation of Adaptive Behavior, 1994.
[97℄ D. Ormoneit and S. Sen. Kernel-based reinfor ement learning. Ma hine Learning,
42:241{267, 2001.
[98℄ Mark J. L. Orr. Introdu tion to radial basis fun tion networks. Te hni al report,
Institute for Adaptive Neural Computation, Division of Informati s, University of
Edinburgh, 1996. http://www.an .ed.a .uk/~mjo/rbf.html.
[99℄ Mark J. L. Orr. Re ent advan es in radial basis fun tion networks. Te hni al report,
Institute for Adaptive Neural Computation, Division of Informati s, University of
Edinburgh, 1999. http://www.an .ed.a .uk/~mjo/rbf.html.
184 BIBLIOGRAPHY
[100℄ S. Pareigis. Adaptive hoi e of grid and time in reinfor ement learning. In Advan es
in Neural Information Pro essing Systems, volume 10. The MIT Press, Cambridge,
MA, 1997.
[101℄ S. Pareigis. Multi-grid methods for reinfor ement learning in ontrolled di usion
pro esses. In Advan es in Neural Information Pro essing Systems, volume 9. The
MIT Press, Cambridge, MA, 1998.
[102℄ Ronald Parr and Stuart Russell. Reinfor ement learning with hierar hies of ma hines.
In Advan es in Neural Information Pro essing Systems, volume 10, 1997.
[103℄ M.D. Pendrith and M.R.K. Ryan. A tual return reinfor ement learning versus tem-
poral di eren es: Some theoreti al and experimental results. In The Thirteenth In-
ternational Conferen e on Ma hine Learning. Morgan Kaufmann, 1996.
[104℄ M.D. Pendrith and M.R.K. Ryan. C-tra e: A new algorithm for reinfor ement learning
of roboti ontrol. In ROBOLEARN-96, Key West, Florida, 19-20 May, 1996, 1996.
[105℄ J. Peng and R. J. Williams. EÆ ient learning and planning within the Dyna frame-
work. Adaptive Behaviour, 2:437{454, 1993.
[106℄ J. Peng and R. J. Williams. Te hni al note: In remental Q-learning. Ma hine Learn-
ing, 22:283{290, 1996.
[107℄ Jing Peng and Ronald J. Williams. In remental multi-step Q-learning. In W. Cohen
and H. Hirsh, editors, Pro eedings of the 11th International Conferen e on Ma hine
Learning, pages 226{232. Morgan Kaufmann, San Fran is o, 1994.
[108℄ Larry Peterson and Bru e Davie. Computer Networks: A Systems Approa h. Morgan
Kaufmann, 2nd edition, 2000.
[109℄ D. Pre up and R. Sutton. Multi-time models for temporally abstra t planning. In
Advan es in Neural Information Pro essing Systems, volume 10, 1998.
[110℄ D. Pre up and R. S. Sutton. Multi-time models for reinfor ement learning. In Pro-
eedings of the ICML'97 Workshop on Modelling in Reinfor ement Learning, 1997.
[111℄ D. Pre up, R. S. Sutton, and S. Singh. Eligibility tra e methods for o -poli y eval-
uation. In Pro eedings of the 17th International Conferen e of Ma hine Learning.
Morgan Kaufmann, 2000.
[112℄ Bob Pri e and Craig Boutilier. Impli it imitation in multi-agent reinfor ement learn-
ing. In Pro eedings of the 16th International Conferen e on Ma hine Learning, 1999.
[113℄ M. L. Puterman and M. C. Shin. Modi ed poli y iteration algorithms for dis ounted
Markov de ision problems. Management S ien e, 24:1137{1137, 1978.
[114℄ Martin L. Puterman. Markov De ision Pro esses: Dis rete Sto hasti Dynami Pro-
gramming. John Wiley and Sons, In ., New York, New York, 1994.
BIBLIOGRAPHY 185
[115℄ Stuart Reynolds. De ision boundary partitioning: Variable resolution model-free
reinfor ement learning. Te hni al Report CSRP-99-15, S hool of Computer S i-
en e, The University of Birmingham, Birmingham, B15 2TT, UK, July 1999.
ftp://ftp. s.bham.a .uk/pub/te h-reports/1999/CSRP-99-15.ps.gz.
[116℄ Stuart I. Reynolds. Issues in adaptive representation reinfor ement learning. Presenta-
tion at the 4th European Workshop on Reinfor ement Learning, Lugano, Switzerland,
O tober 1999.
[117℄ Stuart I. Reynolds. De ision boundary partitioning: Variable resolution model-
free reinfor ement learning. In Pro eedings of the Seventeenth International Confer-
en e on Ma hine Learning, pages 783{790, San Fran is o, 2000. Morgan Kaufmann.
http://www. s.bham.a .uk/~sir/pub/ml2k DBP.ps.gz.
[118℄ Stuart I. Reynolds. A des ription of state dynami s and experiment

parameters for the hoverbeam task. Unpublished Te hni al Report,
http://www. s.bham.a .uk/~sir/pub/, April 2000.
[119℄ Stuart I. Reynolds. Adaptive representation methods for reinfor ement learning. In
Advan es in Arti ial Intelligen e, Pro eeding of AI-2001, Ottawa, Canada, Le ture
Notes in Arti ial Intelligen e (LNAI 2056), pages 345{348. Springer-Verlag, June
2001. http://www. s.bham.a .uk/~sir/pub/ai2001.ps.gz.
[120℄ Stuart I. Reynolds. The urse of optimism. In Pro eedings of the Fifth European
Workshop on Reinfor ement Learning, Utre ht, The Netherlands, pages 38{39, O to-
ber 2001. http://www. s.bham.a .uk/~sir/pub/EWRL5 opt.ps.gz.
[121℄ Stuart I. Reynolds. Experien e sta k reinfor ement learning: An online for-
ward -return method. In Pro eedings of the Fifth European Workshop on
Reinfor ement Learning, Utre ht, The Netherlands, pages 40{41, O tober 2001.
http://www. s.bham.a .uk/~sir/pub/EWRL5 sta k.ps.gz.
[122℄ Stuart I. Reynolds. Optimisti initial Q-values and the max operator. In Qiang Shen,
editor, Pro eedings of the UK Workshop on Computational Intelligen e, Edinburgh,
UK, pages 63{68. The University of Edinburgh Printing Servi es, September 2001.
http://www. s.bham.a .uk/~sir/pub/UKCI-01.ps.gz.
[123℄ Stuart I Reynolds. Experien e sta k reinfor ement learning for o -poli y ontrol.
Te hni al Report CSRP-02-1, S hool of Computer S ien e, University of Birmingham,
January 2002. http://www. s.bham.a .uk/~sir/pub/ES-CSRP-02-1.ps.gz.
[124℄ Stuart I. Reynolds. The stability of general dis ounted reinfor ement learning with
linear fun tion approximation. In John Bullinaria, editor, Pro eedings of the UK
Workshop on Computational Intelligen e, Birmingham, UK, pages 139{146, Septem-
ber 2002. http://www. s.bham.a .uk/~sir/pub/uk i-02.ps.gz.
[125℄ Stuart I Reynolds and Mar o A. Wiering. Fast Q() revisited. Te hni al Re-
port CSRP-02-2, S hool of Computer S ien e, University of Birmingham, May 2002.
http://www. s.bham.a .uk/~sir/pub/fastq-CSRP-02-2.ps.gz.
186 BIBLIOGRAPHY
[126℄ H. Robbins and S. Monro. A sto hasti approximation method. Annals of Mathe-
mati al Statisti s, 22:400{407, 1951.
[127℄ David E. Rumelhart, James L. M Clelland, and the PDP Resear h Group. Parallel
Distributed Pro essing: Explorations in the Mi rostru ture of Cognition, volume 1:
Foundations. The MIT Press, Cambridge, MA, 1986.
[128℄ G. A. Rummery and M. Niranjan. On-line Q-learning using onne tionist systems.
Te hni al Report CUED/F-INFENG/TR 166, Cambridge University Engineering De-
partment, September 1994.
[129℄ Gavin A Rummery. Problem Solving with Reinfor ement Learning. PhD thesis, De-
partment of Engineering, University of Cambridge, July 1995.
[130℄ Stuart Russell and Peter Norvig. Arti ial Intelligen e: A Modern Approa h. Prenti e
Hall, London, UK, 1995.
[131℄ Juan Carlos Santamaria, Ri hard Sutton, and Ashwin Ram. Experiments with re-
infor ement learning in problems with ontinuous state and a tion spa es. Adaptive
Behavior 6(2), 1998.
[132℄ A. S hwartz. A reinfor ement learning algorithm for maximizing undis ounted re-
wards. In Pro eeding of the Tenth International Conferen e on Ma hine Learning,
pages 298{305. Morgan Kaufmann, San Mateo, CA, June 1993.
[133℄ J. Simons, H. Van Brussel, J. De S hutter, and J. Verhaert. A self-learning automa-
ton with variable resolution for high pre ision assembly by industrial robots. IEEE
Transa tions on Automati Control, 5(27):1109{1113, O tober 1982.
[134℄ S. Singh. S aling reinfor ement learning algorithms by learning variable temporal
resolution models. In Pro eedings of the Ninth Ma hine Learning Conferen e, 1992.
[135℄ S. Singh, T. Jaakkola, M. L. Littman, and C. Szepesvari. Convergen e results for
single-step on-poli y reinfor ement-learning algorithms. Ma hine Learning, 2000.
[136℄ S. P. Singh, T. Jaakkola, and M. I. Jordan. Reinfor ement learning with soft state
aggregation. In G. Tesauro, D. S. Touretzky, and T. Leen, editors, Advan es in Neural
Information Pro essing Systems: Pro eedings of the 1994 Conferen e, pages 359{368.
The MIT Press, Cambridge, MA, 1994.
[137℄ Satinder Singh. Personal ommuni ation, 2001.
[138℄ Satinder P. Singh, Tommi Jaakkola, and Mi hael I. Jordan. Learning without state-
estimation in partially observable Markovian de ision pro esses. In Pro eedings of the
Eleventh International Conferen e on Ma hine Learning, 1994.
[139℄ Satinder P. Singh and Ri hard S. Sutton. Reinfor ement learning with repla ing
eligibility tra es. Ma hine Learning, 22:123{158, 1996.
[140℄ William D. Smart and Leslie Kaelbling Pa k. Pra ti al reinfor ement learning in
ontinuous spa es. In Pro eedings of the Seventeenth International Conferen e on
Ma hine Learning, San Fran is o, 2000. Morgan Kaufmann.
BIBLIOGRAPHY 187
[141℄ P. Stone and R. S. Sutton. S aling reinfor ement learning toward robo up so er. In
Eighteenth International Conferen e on Ma hine Learning, 2001.
[142℄ Mal olm Strens. A bayesian framework for reinfor ement learning. In Pro eedings of
the 17th International Conferen e on Ma hine Learning, pages 943{950, San Fran-
is o, 2000. Morgan Kaufmann.
[143℄ R. Sutton, D. Pre up, and S. Singh. Between MDPs and Semi-MDPs: A framework
for temporal abstra tion in reinfor ement learning. Arti ial Intelligen e, 112:181{
211, 1999.
[144℄ R. S. Sutton. Planning by in remental dynami programming. In Pro eedings of
the Eighth International Workshop on Ma hine Learning, pages 353{357. Morgan
Kaufmann, 1991.
[145℄ R. S. Sutton. Open theoreti al questions in reinfor ement learning. Extended abstra t
of an invited talk at EuroCOLT'99, 1999.
[146℄ R. S. Sutton and D. Pre up. O -poli y temporal-di eren e learning with fun tion
approximation. In Pro eedings of the Eighteenth International Conferen e on Ma hine
Learning, 2001.
[147℄ Ri hard S. Sutton. Temporal Credit Assignment in Reinfor ement Learning. PhD
thesis, University of Massa husetts, 1984.
[148℄ Ri hard S. Sutton. Learning to predi t by methods of temporal di eren e. Ma hine
Learning, 3:9{44, 1988.
[149℄ Ri hard S. Sutton. Generalization in reinfor ement learning: Su essful examples
using sparse oarse oding. In David S. Touretzky, Mi hael C. Mozer, and Mi hael E.
Hasselmo, editors, Advan es in Neural Information Pro essing Systems 8, pages 1038{
1044. The MIT Press, Cambridge, MA., 1996.
[150℄ Ri hard S. Sutton and Andrew G. Barto. Reinfor ement Learning: An Introdu tion.
The MIT Press, Cambridge, MA., 1998.
[151℄ Ri hard S. Sutton and Satinder P. Singh. On step-size and bias in temporal di eren e
learning. In Pro eedings of the Eighth Yale Workshop on Adaptive and Learning
Systems, pages 91{96, 1994.
[152℄ Csaba Szepesvari. Convergent reinfor ement learning with value fun tion interpola-
tion. Te hni al Report TR-2001-02, Mindmaker Ltd., Budapest 1121, Konkoly Th.
M. u. 29-33, Hungary, 2001.
[153℄ P. Tadepalli and D. Ok. H -learning: A reinfor ement learning method to optimize
undis ounted average reward. Te hni al Report 94-30-01, Oregon State University,
Computer S ien e Department, Corvallis, 1994.
[154℄ Vladislav Tadi. On the onvergen e of temporal-di eren e learning with linear fun -
tion approximation. Ma hine Learning, 42:241{267, 2001.
188 BIBLIOGRAPHY
[155℄ G. J. Tesauro. Temporal di eren es learning and TD-gammon. Communi ations of

the ACM, 38:58{68, 1995.
[156℄ S. Thrun. EÆ ient exploration in reinfor ement learning. Te hni al Report CMU-
CS-92-102, Carnegie Mellon University, PA, 1992.
[157℄ Sebastian Thrun and Anton S hwartz. Issues in using fun tion approximation for
reinfor ement learning. In Pro eedings of the 1993 Conne tionist Models Summer
S hool. Lawren e Erlbaum, Hillsdale, NJ, 1993.
[158℄ J. N. Tsitsiklis. Asyn hronous sto hasti approximation and Q-learning. Ma hine
Learning, 16:185{202, 1994.
[159℄ J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large s ale dynami
programming. Ma hine Learning, 22, 1996.
[160℄ J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-di eren e learning with
fun tion approximation. IEEE Transa tions on Automati Control, 42(5):674{690,
May 1997.
[161℄ William T. B. Uther and Manuela M. Veloso. Tree based dis retization for ontinuous
state spa e reinfor ement learning. In Pro eedings of the Fifteenth National Confer-
en e on Arti ial Intelligen e (AAAI '98), volume 15, pages 769{774. AAAI Press,
1998.
[162℄ Hans Vollbre ht. kd-Q-learning with hierar hi generalisation in state spa e. Te hni al
Report SFB 527, Department of Neural Information Pro essing, University of Ulm,
Ulm, Germany, 1999.
[163℄ C.J.C.H. Watkins. Learning from Delayed Rewards. PhD thesis, King's College,
Cambridge, UK, May 1989.
[164℄ C.J.C.H. Watkins and P. Dayan. Te hni al note: Q-Learning. Ma hine Learning,
8:279{292, 1992.
[165℄ S. Whitehead. Reinfor ement Learning for the Adaptive Control of Per eption and
A tion. PhD thesis, King's College, Cambridge, U.K., 1992.
[166℄ B. Widrow and M. E. Ho . Adaptive swit hing ir uits. In Western Ele troni Show
and Convention, Convention Re ord, volume 4, 1960. Reprinted in J. A. Anderson and
E. Rosenfeld, editors, Neuro omputing: Foundations and Resear h, The MIT Press,
Cambridge, MA, 1988.
[167℄ Mar o Wiering. Explorations in EÆ ient Reinfor ement Learning. PhD thesis, Uni-
versiteit van Amsterdam, The Netherlands, February 1999.
[168℄ Mar o Wiering and Jurgen S hmidhuber. Fast online Q(). Ma hine Learning,
33(1):105{115, 1998.
[169℄ Mar o Wiering and Jurgen S hmidhuber. Speeding up Q()-Learning. In Pro eedings
of the Tenth European Conferen e on Ma hine Learning (ECML'98), 1998.
BIBLIOGRAPHY 189
[170℄ R. J. Williams. Toward a theory of reinfor ement learning onne tionist systems.
Te hni al Report NU-CCS-88-3, Northeastern University, Boston, MA, 1988.
[171℄ R. J. Williams and L. C. Baird. Tight performan e bounds on greedy poli ies based
on imperfe t value fun tions. In Pro eedings of the Tenth Yale Workshop on Adaptive
and Learning Systems, Yale University, page 6, June 1994.
[172℄ R. J. Williams and L. C. Baird III. Tight performan e bounds on greedy poli ies based
on imperfe t value fun tions. Te hni al Report NU-CCS-93-14, College of Computer
S ien e, Northeastern University, Boston, 1993.
[173℄ Stewart W. Wilson. ZCS: A zeroth level lassi er system. Evolutionary Computation,
2(1):1{18, 1994. http://predi tion-dynami s. om/.
[174℄ Jeremy Wyatt. Exploration and Inferen e in Learning from Reinfor ement. PhD
thesis, Department of Arti ial Intelligen e, University of Edinburgh, UK, Mar h
1996.
[175℄ Jeremy Wyatt. Exploration ontrol in reinfor ement learning using optimisti model
sele tion. Pro eedings of the Eighteenth International Conferen e on Ma hine Learn-
ing (ICML-2001), pages 593{600, 2001.
[176℄ Jeremy Wyatt, Gillian Hayes, and John Hallam. Investigating the behaviour of Q().
In Colloquium on Self-Learning Robots, IEE, London, February 1996.
[177℄ W. Zhang and T. G. Dietteri h. A reinfor ement learning approa h to job-shop
s heduling. In Pro eedings of the Fourteenth International Joint Conferen e on Arti-
ial Intelligen e, pages 1114{1120. Morgan Kaufmann, 1995.

Reinforcement Learning Wiht Exploration

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reinforcement Learning Wiht Exploration

Uploaded by

Copyright:

Available Formats

Reinfor ement Learning

Stuart Ian Reynolds

S hool of Computer S ien e

For my parents for everything.

3.3 Monte Carlo Methods for Poli y Evaluation . . . . . . . . . . . . . . . . .. 24

6.2 De ision Boundary Partitioning (DBP) . . .. . .. . .. . . .. . .. . . .. 122

C Continuous Time TD() 169

1.1 Arti ial Intelligen e and Ma hine Learning

1.2 Forms of Learning

1.3 Reinfor ement Learning

1.4 Learning and Exploration

1.5 About This Thesis

1.6 Stru ture of the Thesis

2.1 Markov De ision Pro esses

2.2 Poli ies, State Values and Return

= EX [rt+1 +XV  (st+1 ) j s = st ℄

Equation 2.6 is known as a Bellman equation for V  (see [15℄).

2.3 Poli y Evaluation

1) Initialise V^0 with arbitrary nite values; k 0

It is straightforward to modify the iterative poli y evaluation algorithm to approximate a

2.3.2 In-Pla e and Asyn hronous Updating

2.4 Optimal Control

1 k Vk (2) Vk (3) Vk (4) Vk (5)

2.4.4 Value Iteration

Learning from Intera tion

3.1 Introdu tion

Figure 3.1: An abstra t in remental online reinfor ement learning algorithm.

3.2 In remental Estimation of Means

3.3 Monte Carlo Methods for Poli y Evaluation

PSfrag repla ements Ps ; R s

3.4 Temporal Di eren e Learning for Poli y Evaluation

TD()-update(st, at, rt+1 , st+1 )

3.4.7 Repla e Tra e Methods

et (s; a) = : 0; if s = st and a 6= at , (3.31)

PSfrag repla ements s

Figure 3.9: A worst- PSfrag repla ements

gen e for a umulate tra e TD().

PSfrag repla ements PSfrag repla ements

PSfrag repla ements PSfrag repla ements

PSfrag repla ements 20

3.5 Temporal Di eren e Learning for Control

3.5.2 The Exploration-Exploitation Dilemma

3.5.3 Exploration Sensitivity

bene t of using a multi-step return estimate an be eliminated. As a result, the method is

exploratory a tions ontinue to be taken.4 However, it may gain a greater eÆ ien y in

3.5.4 The O -Poli y Predi ate

3.6 Indire t Reinfor ement Learning

EÆ ient O -Poli y Control

4.1 Introdu tion

Optimising Control. Algorithms perform ontrol optimisation if they nd or approxi-

This hapter reviews a number of important RL algorithms. It is shown how Lin's

4.2 A elerating Q()

4.2.1 Fast Q()

Global Update(st; at ; rt ; st+1 ) :

Global Update : addendum

From the original version of Global Update:

found at any time as follows:

for regular Q() and as,

Mean Squared Error

Mean Squared Error

Mean Squared Error

Mean Squared Error

Mean Squared Error

C Continuous Time TD() 169

= EX [rt+1 +XV (st+1 ) j s = st ℄

Equation 2.6 is known as a Bellman equation for V (see [15℄).

TD()-update(st, at, rt+1 , st+1 )

gen e for a umulate tra e TD().

4.2 A elerating Q()

4.2.1 Fast Q()

for regular Q() and as,