2008 Thegoog, Bad, uglyRL

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Available online at www.sciencedirect.

com

Reinforcement learning: The Good, The Bad and The Ugly


Peter Dayana and Yael Nivb

Reinforcement learning provides both qualitative and huge range of paradigms and systems. The literature in
quantitative frameworks for understanding and modeling this area is by now extensive, and has been the topic of
adaptive decision-making in the face of rewards and many recent reviews (including [5–9]). This is in addition
punishments. Here we review the latest dispatches from the to rapidly accumulating literature on the partly related
forefront of this field, and map out some of the territories where questions of optimal decision-making in situations invol-
lie monsters. ving slowly amounting information, or social factors such
Addresses as games [10–12]. Thus here, after providing a brief
a
UCL, United Kingdom sketch of the overall RL scheme for control (for a more
b
Psychology Department and Princeton Neuroscience Institute, extensive review, see [13]), we focus only on some of the
Princeton University, United States many latest results relevant to RL and its neural instan-
Corresponding authors: Dayan, Peter (dayan@gatsby.ucl.ac.uk) and
tiation. We categorize these recent findings into those
Niv, Yael (yael@princeton.edu) that fit comfortably with, or flesh out, accepted notions
(playfully, ‘The Good’), some new findings that are not as
snugly accommodated, but suggest the need for exten-
Current Opinion in Neurobiology 2008, 18:185–196 sions or modifications (‘The Bad’), and finally some key
This review comes from a themed issue on
areas whose relative neglect by the field is threatening to
Cognitive neuroscience impede its further progress (‘The Ugly’).
Edited by Read Montague and John Assad

Available online 22nd August 2008


The reinforcement learning framework
0959-4388/$ – see front matter Decision-making environments are characterized by a
# 2008 Elsevier Ltd. All rights reserved. few key concepts: a state space (states are such things
DOI 10.1016/j.conb.2008.08.003
as locations in a maze, the existence or absence of
different stimuli in an operant box or board positions
in a game), a set of actions (directions of travel, presses on
different levers, and moves on a board), and affectively
Introduction important outcomes (finding cheese, obtaining water, and
Reinforcement learning (RL) [1] studies the way that winning). Actions can move the decision-maker from one
natural and artificial systems can learn to predict the con- state to another (i.e. induce state transitions) and they can
sequences of and optimize their behavior in environments produce outcomes. The outcomes are assumed to have
in which actions lead them from one state or situation to the numerical (positive or negative) utilities, which can
next, and can also lead to rewards and punishments. Such change according to the motivational state of the
environments arise in a wide range of fields, including decision-maker (e.g. food is less valuable to a satiated
ethology, economics, psychology, and control theory. animal) or direct experimental manipulation (e.g. poison-
Animals, from the most humble to the most immodest, ing). Typically, the decision-maker starts off not knowing
face a range of such optimization problems [2], and, to an the rules of the environment (the transitions and out-
apparently impressive extent, solve them effectively. RL, comes engendered by the actions), and has to learn or
originally born out of mathematical psychology and oper- sample these from experience.
ations research, provides qualitative and quantitative com-
putational-level models of these solutions. In instrumental conditioning, animals learn to choose
actions to obtain rewards and avoid punishments, or,
However, the reason for this review is the increasing more generally to achieve goals. Various goals are
realization that RL may offer more than just a compu- possible, such as optimizing the average rate of acqui-
tational, ‘approximate ideal learner’ theory for affective sition of net rewards (i.e. rewards minus punishments), or
decision-making. RL algorithms, such as the temporal some proxy for this such as the expected sum of future
difference (TD) learning rule [3], appear to be directly rewards, where outcomes received in the far future are
instantiated in neural mechanisms, such as the phasic discounted compared with outcomes received more
activity of dopamine neurons [4]. That RL appears to be immediately. It is the long-term nature of these goals
so transparently embedded has made it possible to use it that necessitates consideration not only of immediate
in a much more immediate way to make hypotheses outcomes but also of state transitions, and makes choice
about, and retrodictive and predictive interpretations interesting and difficult. In terms of RL, instrumental
of, a wealth of behavioral and neural data collected in a conditioning concerns optimal choice, that is determining

www.sciencedirect.com Current Opinion in Neurobiology 2008, 18:185–196


186 Cognitive neuroscience

an assignment of actions to states, also known as a policy have high values themselves). Model-free learning rules
that optimizes the subject’s goals. such as the temporal difference (TD) rule [3] define any
momentary inconsistency as a prediction error, and use it to
By contrast with instrumental conditioning, Pavlovian (or specify rules for plasticity that allow learning of more
classical) conditioning traditionally treats how subjects accurate values and decreased inconsistencies. Given
learn to predict their fate in those cases in which they correct values, it is possible to improve the policy by
cannot actually influence it. Indeed, although RL is preferring those actions that lead to higher utility out-
primarily concerned with situations in which action selec- comes and higher valued states. Direct model-free
tion is germane, such predictions play a major role in methods for improving policies without even acquiring
assessing the effects of different actions, and thereby in values are also known [21].
optimizing policies. In Pavlovian conditioning, the pre-
dictions also lead to responses. However, unlike the Model-free methods are statistically less efficient than
flexible policies that are learned via instrumental con- model-based methods, because information from the
ditioning, Pavlovian responses are hard-wired to the environment is combined with previous, and possibly
nature and emotional valence of the outcomes. erroneous, estimates or beliefs about state values, rather
than being used directly. The information is also stored
RL methods can be divided into two broad classes, model- in scalar quantities from which specific knowledge about
based and model-free, which perform optimization in very rewards or transitions cannot later be disentangled. As
different ways (Box 1, [14]). Model-based RL uses experi- a result, these methods cannot adapt appropriately
ence to construct an internal model of the transitions and quickly to changes in contingency and outcome utilities.
immediate outcomes in the environment. Appropriate Based on the latter characteristic, model-free RL has
actions are then chosen by searching or planning in this been suggested as a model of habitual actions [14,15], in
world model. This is a statistically efficient way to use which areas such as the dorsolateral striatum and the
experience, as each morsel of information from the amygdala are believed to play a key role [17,18]. How-
environment can be stored in a statistically faithful and ever, a far more direct link between model-free RL and
computationally manipulable way. Provided that constant the workings of affective decision-making is apparent in
replanning is possible, this allows action selection to be the findings that the phasic activity of dopamine
readily adaptive to changes in the transition contingencies neurons during appetitive conditioning (and indeed
and the utilities of the outcomes. This flexibility makes the fMRI BOLD signal in the ventral striatum of
model-based RL suitable for supporting goal-directed humans, a key target of dopamine projections) has many
actions, in the terms of Dickinson and Balleine [15]. of the quantitative characteristics of the TD prediction
For instance, in model-based RL, performance of actions error that is key to learning model-free values [4,22–24].
leading to rewards whose utilities have decreased is It is these latter results that underpin the bulk of neural
immediately diminished. Via this identification and other RL.
findings, the behavioral neuroscience of such goal-
directed actions suggests a key role in model-based RL ‘The Good’: new findings in neural RL
(or at least in its components such as outcome evaluation) Daw et al. [5] sketched a framework very similar to this,
for the dorsomedial striatum (or its primate homologue, and reviewed the then current literature which pertained
the caudate nucleus), prelimbic prefrontal cortex, the to it. Our first goal is to update this analysis of the
orbitofrontal cortex, the medial prefrontal cortex,1 and literature. In particular, courtesy of a wealth of exper-
parts of the amygdala [9,17–20]. iments, just two years later we now know much more
about the functional organization of RL systems in the
Model-free RL, on the other hand, uses experience to brain, the pathways influencing the computation of pre-
learn directly one or both of two simpler quantities (state/ diction errors, and time-discounting. A substantial frac-
action values or policies) which can achieve the same tion of this work involves mapping the extensive findings
optimal behavior but without estimation or use of a world from rodents and primates onto the human brain, largely
model. Given a policy, a state has a value, defined in terms using innovative experimental designs while measuring
of the future utility that is expected to accrue starting the fMRI BOLD signal. This research has proven most
from that state. Crucially, correct values satisfy a set of fruitful, especially in terms of tracking prediction error
mutual consistency conditions: a state can have a high signals in the human brain. However, in considering these
value only if the actions specified by the policy at that results it is important to remember that the BOLD signal
state lead to immediate outcomes with high utilities, and/ is a measure of oxyhemoglobin levels and not neural
or states which promise large future expected utilities (i.e. activity, let alone dopamine. That neuromodulators can
act directly on capillary dilation, and that even the limited
1
Note that here and henceforth we lump together results from rat and
evidence we have about the coupling between synaptic
primate prefrontal cortical areas for the sake of brevity and simplicity, drive or neural activity and the BOLD signal is confined
despite important and contentious differences [16]. to the cortex (e.g. [25]) rather than the striatum or mid-

Current Opinion in Neurobiology 2008, 18:185–196 www.sciencedirect.com


Reinforcement learning: The Good, The Bad and The Ugly Dayan and Niv 187

Box 1 Model-based and model-free reinforcement learning

Reinforcement learning methods can broadly be divided into two classes, model-based and model-free. Consider the problem illustrated in the
figure, of deciding which route to take on the way home from work on Friday evening. We can abstract this task as having states (in this case,
locations, notably of junctions), actions (e.g. going straight on or turning left or right at every intersection), probabilities of transitioning from one
state to another when a certain action is taken (these transitions are not necessarily deterministic, e.g. due to road works and bypasses), and
positive or negative outcomes (i.e. rewards or costs) at each transition from scenery, traffic jams, fuel consumed, etc. (which are again
probabilistic).
Model-based computation, illustrated in the left ‘thought bubble’, is akin to searching a mental map (a forward model of the task) that has been
learned based on previous experience. This forward model comprises knowledge of the characteristics of the task, notably, the probabilities of
different transitions and different immediate outcomes. Model-based action selection proceeds by searching the mental map to work out the long-
run value of each action at the current state in terms of the expected reward of the whole route home, and chooses the action that has the highest
value.
Model-free action selection, by contrast, is based on learning these long-run values of actions (or a preference order between actions) without
either building or searching through a model. RL provides a number of methods for doing this, in which learning is based on momentary
inconsistencies between successive estimates of these values along sample trajectories. These values, sometimes called cached values because
of the way they store experience, encompass all future probabilistic transitions and rewards in a single scalar number that denotes the overall future
worth of an action (or its attractiveness compared with other actions). For instance, as illustrated in the right ‘thought bubble’, experience may have
taught the commuter that on Friday evenings the best action at this intersection is to continue straight and avoid the freeway.
Model-free methods are clearly easier to use in terms of online decision-making; however, much trial-and-error experience is required to make the
values be good estimates of future consequences. Moreover, the cached values are inherently inflexible: although hearing about an unexpected
traffic jam on the radio can immediately affect action selection that is based on a forward model, the effect of the traffic jam on a cached propensity
such as ‘avoid the freeway on Friday evening’ cannot be calculated without further trial-and-error learning on days in which this traffic jam occurs.
Changes in the goal of behavior, as when moving to a new house, also expose the differences between the methods: whereas model-based
decision making can be immediately sensitive to such a goal-shift, cached values are again slow to change appropriately. Indeed, many of us have
experienced this directly in daily life after moving house. We clearly know the location of our new home, and can make our way to it by
concentrating on the new route; but we can occasionally take an habitual wrong turn toward the old address if our minds wander. Such
introspection, and a wealth of rigorous behavioral studies (see [15], for a review) suggests that the brain employs both model-free and model-based
decision-making strategies in parallel, with each dominating in different circumstances [14]. Indeed, somewhat different neural substrates underlie
each one [17].

www.sciencedirect.com Current Opinion in Neurobiology 2008, 18:185–196


188 Cognitive neuroscience

brain areas such as the ventral tegmental area, imply that Further data on contributions to the computation of the
this fMRI evidence is alas very indirect. TD prediction error have come from new findings on
excitatory pathways into the dopamine system [43]. Evi-
Functional organization: in terms of the mapping from dence about the way that the amygdala [44,45,46] and
rodents to humans, reinforcer devaluation has been the medial prefrontal cortex [47] code for both positive
employed to study genuinely goal-directed choice in and negative predicted values and errors, and the joint
humans, that is, to search for the underpinnings of beha- coding of actions, the values of actions, and rewards in
vior that is flexibly adjusted to such things as changes in striatal and OFC activity [48–55,9], is also significant,
the value of a predicted outcome. The orbitofrontal given the putative roles of these areas in learning and
cortex (OFC) has duly been revealed as playing a particu- representing various forms of RL values.
lar role in representing goal-directed value [26]. How-
ever, the bulk of experiments has not set out to In fact, the machine learning literature has proposed
distinguish model-based from model-free systems, and various subtly different versions of the TD learning
has rather more readily found regions implicated in signal, associated with slightly different model-free RL
model-free processes. There is some indirect evidence methods. Recent evidence from a primate study [56]
[27–29,30] for the involvement of dopamine and dopa- looking primarily at one dopaminergic nucleus, the sub-
minergic mechanisms in learning from reward prediction stantia nigra pars compacta, seems to support a version
errors in humans, along with more direct evidence from an called SARSA [57]; whereas evidence from a rodent study
experiment involving the pharmacological manipulation [58] of the other major dopaminergic nucleus, the ventral
of dopamine [31]. These studies, in which prediction tegmental area, favors a different version called Q-learn-
errors have been explicitly modeled, along with others, ing (CJCH Watkins, Learning from delayed rewards,
which use a more general multivariate approach [32] or an PhD thesis, University of Cambridge, 1989). Resolving
argument based on a theoretical analysis of multiple, this discrepancy, and indeed, determining whether these
separate, cortico-basal ganglia loops [33], also reveal roles learning rules can be incorporated within the popular
for OFC, medial prefrontal cortical structures, and even Actor/Critic framework for model-free RL in the basal
the cingulate cortex. The contributions of the latter ganglia [59,1], will necessitate further experiments and
especially have recently been under scrutiny in animal computational investigation.
studies focused on the cost-benefit tradeoffs inherent in
decision-making [34,35], and the involvement of dopa- A more radical change to the rule governing the activity of
minergic projections to and from the anterior cingulate dopamine cells which separates out differently the por-
cortex, and thus potential interactions with RL have been tions associated with the outcome (the primary reward)
suggested ([36], but see [37]). However, novel approaches and the learned predictions has also been suggested in a
to distinguishing model-based and model-free control modeling study [60]. However, various attractive features
may be necessary to tease apart more precisely the of the TD rule, such as its natural account of secondary
singular contributions of the areas. conditioning and its resulting suitability for optimizing
sequences of actions leading up to a reward, are not
Computation of prediction errors: in terms of pathways, one inherited directly by this rule.
of the more striking recent findings is evidence that the
lateral habenula suppresses the activity of dopamine Temporal discounting: a recurrent controversy involves the
neurons [38,39], in a way which may be crucial for the way that the utilities of proximal and distant outcomes are
representation of the negative prediction errors that arise weighed against each other. Exponential discounting,
when states turn out to be worse than expected. One similar to a uniform interest rate, has attractive theoretical
natural possibility is then that pauses in the burst firing of properties, notably the absence of intertemporal choice
dopamine cells might code for these negative prediction conflict, the possibility of recursive calculation scheme
errors. This has received some quantitative support [40], and simple prediction errors [1]. However, the more
despite the low baseline rate of firing of these neurons, computationally complex hyperbolic discounting, which
which suggests that such a signal would have a rather low shows preference reversals and impulsivity, is a more
bandwidth. Further, studies examining the relationship common psychological finding in humans and animals
between learning from negative and positive con- [61].
sequences and genetic polymorphisms related to the
D2 dopaminergic receptor have established a functional The immediate debate concerned the abstraction and
link between dopamine and learning when outcomes are simplification of hyperbolic discounting to two evaluative
worse than expected [41,42]. Unfortunately, marring this systems, one interested mostly in the here and now, the
apparent convergence of evidence is the fact that the other in the distant future [62], and their apparent instan-
distinction between the absence of an expected reward tiation in different subcortical and cortical structures
and more directly aversive events is still far from being respectively [63,64]. This idea became somewhat
clear, as we complain below. wrapped-up with the distinct notion that these neural

Current Opinion in Neurobiology 2008, 18:185–196 www.sciencedirect.com


Reinforcement learning: The Good, The Bad and The Ugly Dayan and Niv 189

areas are involved in model-free and model-based learn- the neural areas involved communicate, cooperate and
ing. Further, other studies found a more unitary neural compete.
representation of discounting [65], at least in the BOLD
signal, and recent results confirm that dopaminergic pre- Appetitive–aversive interactions: one key issue that dogs
diction errors indeed show the expected effects of dis- neural RL [79] is the coding of aversive rather than
counting [58]. One surprising finding is that the OFC appetitive prediction errors. Although dopamine neurons
may separate out the representation of the temporal are seemingly mostly inhibited by unpredicted punish-
discount factor applied to distant rewards from that of ments [80,81], fMRI studies into the ventral striatum in
the magnitude of the reward [54], implying a complex humans have produced mixed results, with aversive pre-
problem of how these quantities are then integrated. diction errors sometimes leading to above-baseline
BOLD [82,83], but other times below-baseline BOLD,
There has also been new work based on the theory [8] perhaps with complex temporal dynamics [84,23]. Dopa-
that the effective interest rate for time is under the mine antagonists suppress the aversive prediction error
influence of the neuromodulator serotonin. In a task that signal [85], and withdrawing (OFF) or administering
provides a fine-scale view of temporal choice [66], dietary (ON) dopamine-boosting medication to patients suffer-
reduction of serotonin levels in humans (tryptophan ing from Parkinson’s disease leads to boosted and sup-
depletion) gave rise to extra impulsivity, that is, favoring pressed learning from negative outcomes, respectively
smaller rewards sooner over larger rewards later, which [86].
can be effectively modeled by a steeper interest rate [67].
Somewhat more problematically for the general picture of One study that set out to compare directly appetitive and
control sketched above, tryptophan depletion also led to a aversive prediction errors found a modest spatial separ-
topographically mapped effect in the striatum, with ation in the ventral striatal BOLD signal [87], consistent
quantities associated with predictions for high interest with various other findings about the topography of this
rates preferentially correlated with more ventral areas, structure [88,89]; indeed, there is a similar finding in the
and those for low interest rates with more dorsal areas OFC [90]. However, given that nearby neurons in the
[68]. amygdala code for either appetitive or aversive outcomes
[44,45,46], fMRI’s spotlight may be too coarse to address
The idea that subjects are trying to optimize their long- all such questions adequately.
run rates of acquisition of reward has become important in
studies of time-sensitive sensory decision-making [11,69]. Aversive predictions are perhaps more complex than
It also inspired a new set of modeling investigations into appetitive ones, owing to their multifaceted range of
free operant choice tasks, in which subjects are free to effects, with different forms of contextual information
execute actions at times of their own choosing, and the (such as defensive distance; [91]) influencing the choice
dependent variables are quantities such as the rates of between withdrawal and freezing versus approach and
responding [70,71]. In these accounts, the rate of reward fighting. Like appetitive choice behavior, some of these
acts as an opportunity cost for time, thereby penalizing behavioral effects require vigorous actions, which we
sloth, and is suggested as being coded by the tonic (as discussed above in terms of tonic levels of dopamine
distinct from the phasic) levels of dopamine. This cap- [70,71]. Moreover, aversive predictions can lead to active
tures findings associated with response vigor given dopa- avoidance, which then brings about appetitive predictions
minergic manipulations [72,73,74], and, at least associated with the achievement of safety. This latter
assuming a particular coupling between phasic and tonic mechanism is implicated in conditioned avoidance
dopamine, can explain results linking vigor to predictions responses [92], and has also been shown to cause increases
of reward [75,76]. in BOLD in the same part of the OFC that is also
activated by the receipt of rewards [93].
‘The Bad’: apparent but tractable
inconsistencies Novelty, uncertainty and exploration: the bulk of work in RL
Various research areas which come in close contact with focuses on exploitation, that is on using past experience to
different aspects of RL, help extend or illuminate it in not optimize outcomes on the next trial or next period. More
altogether expected ways. These include issues of aver- ambitious agents seek also to optimize exploration, taking
sive-appetitive interactions, exploration and novelty, a account in their choices not only the benefits of known
range of phenomena important in neuroeconomics [77,78] (expected) future rewards, but also the potential benefits
such as risk, Pavlovian-instrumental interactions, and also of learning about unknown rewards and punishments (i.e.
certain new structural or architectural findings. The exist- the long-term gain to be harvested due to acquiring
ence of multiple control mechanisms makes it challen- knowledge about the values of different states). The
ging to interpret some of these results unambiguously, balancing act between these requires careful accounting
since rather little is known for sure about the interaction for uncertainty, in which the neuromodulators acetyl-
between model-free and model-based systems, or the way choline and norepinephrine have been implicated

www.sciencedirect.com Current Opinion in Neurobiology 2008, 18:185–196


190 Cognitive neuroscience

[94,95]. Uncertainty is also related to novelty, which RL is challenged by various findings in and around the
seems to involve dopaminergic areas and some of their fields of experimental economics, behavioral economics,
targets [96,97,98]. Uncertainty and novelty can support and neuroeconomics [107,77]. These study the psycho-
different types of exploration, respectively, directed logical and neural factors underlying a range of situations
exploration, in which exploratory actions are taken in in which behavior departs from apparent normative
proportion to known uncertainty, and undirected explora- ideals. One example of this concerns the sensitivity of
tion, in which novel unknown parts of the environment choice behavior to risk associated with variability in the
are uniformly explored. Tracking uncertainty has been outcomes. There are extensive economic theories about
shown to involve the PFC in humans [99] and complex risk, and indeed evidence from functional neuroimaging
patterns of activity in lateral and medial prefrontal cortex that variability may be coded in specific cortical and
of monkeys [100], and has also been associated with subcortical regions such as the insular cortex, OFC and
functions of the amygdala [101]. the ventral striatum [108–112]. However, how risk should
influence action selection in model-free and model-based
Model-free and model-based systems in RL adopt rather control is not completely clear, partly because variability
different approaches to exploration, although because can have both an indirect influence on values, for instance
uncertainty generally favors model-based control [14], via a saturating utility function for rewards [113], and
the model-free approaches may be concealed. Indeed, direct effects through a risk-sensitive learning process
we may expect that certain forms of explicit change and [114,115]. Although nonlinear utilities for rewards will
uncertainty could act to transfer habitized control back to affect both types of controllers, the effects of risk through
a model-based system [102]. Model-based exploration sampling biases [114] or through a learning rule that is
probably involves frontal areas, as exemplified by one asymmetrically sensitive to positive and negative errors
recent imaging study which revealed a particular role for [115] depend on the specific implementation of learning
fronto-polar cortex in trials in which nonexploitative in a model-free or model-based controller, and thus can
actions were chosen [103]. Model-free exploration has differ between the two.
been suggested to involve neuromodulatory systems such
as dopamine and norepinephrine, instantiating computa- A second neuroeconomic concern is regret, in which a form
tionally more primitive strategies such as dopaminergic of counterfactual thinking [116] or fictive learning [117] is
exploration bonuses [98]. induced when foregone alternatives turn out to be better
than those that were chosen [118,119]. Imaging studies
Uncertainty should influence learning as well as explora- have suggested a role for the OFC and other targets of the
tion. In particular, learning (and forgetting) rates should dopamine system in processing fictive learning signals
be higher in a rapidly changing environment and slower in [120,121,117]. Here, also, there may be separate instan-
a rather stationary one. A recent fMRI study demon- tiations in model-based and model-free systems, as
strated that human subjects adjust their learning rates accounting for counterfactuals is straightforward in the
according to the volatility of the environment, and former, but requires extra machinery in the latter.
suggested that the anterior cingulate cortex is involved
in the online tracking of volatility [104]. In animal con- A third issue along these lines is framing, in which
ditioning tasks involving a more discrete form of surprise different descriptions of a single (typically risky) outcome
associated with stimuli, substantial studies have shown a result in different valuations. Subjects are more reluctant
role for the central nucleus of the amygdala and the to take chances in the face of apparent gains than losses,
cholinergic neurons in the substantia innominata and even when the options are actually exactly the same. A
nucleus basalis in upregulating subsequent learning recent imaging study [122] reported a close correlation
associated with that stimulus [94]. More recent work between susceptibility to framing and activity in the
has shown that there is a crucial role for the projection amygdala, and a negative correlation with activity in
from the dopaminergic (putatively prediction error cod- OFC and medial PFC.
ing) substantia nigra pars compacta to the central nucleus
in this upregulation [105], and that the involvement of the Finally, there is an accumulating wealth of work on social
central nucleus is limited to the time of surprise (and that utility functions, and game–theoretic interactions be-
of the cholinergic nuclei to the time of subsequent tween subjects [123], including studies in patient popu-
learning; [106]). This accords with evidence that some lations with problems with social interactions [124].
amygdala neurons respond directly to both positive and There are interesting hints for neural RL from studies
negative surprising events ([45], a minimum require- that use techniques such as transcranial magnetic stimu-
ment for a surprise signal) and to unpredictability itself lation to disrupt specific prefrontal activity, which turns
[101]. out to affect particular forms of social choice such as the
ability to reject patently unfair (though still lucrative)
Risk, regret and neuroeconomics: the relatively straightfor- offers in games of economic exchange [125]. However,
ward view of utility optimization that permeates neural these have yet to be coupled to the nascent more com-

Current Opinion in Neurobiology 2008, 18:185–196 www.sciencedirect.com


Reinforcement learning: The Good, The Bad and The Ugly Dayan and Niv 191

plete RL accounts of the complex behavior in pair and differently, suggesting even more relatively untapped
multi-agent tournaments [126,127]. complexity.

Pavlovian values: although it is conventional in RL to Structural findings: finally, there are some recent fascinat-
consider Pavlovian or classical conditioning as only being ing and sometimes contrarian studies into the way that
about the acquisition of predictions, with instrumental striatal pathways function in choice. One set
conditioning being wholly responsible for choice, it is [138,139,140] has revealed a role for the subthalamic
only through a set of (evolutionarily programmed) nucleus (STN) in slowing down choices between strongly
responses, such as the approach that is engendered by appetitive options, perhaps to give more time for the
predictions of reward, that Pavlovian predictions become precise discrimination of the best action from merely
evident in the first place. Indeed, Pavlovian responses can good ones [141]. It could again be that the impulsivity
famously out-compete instrumental responses [128]. This created when the STN is suppressed arises from a form of
is of particular importance in omission schedules or nega- Pavlovian approach [137]. Others have looked at serial
tive automaintenance, when subjects continue respond- processes that shift action competition and choice up a
ing based on predictions of future rewards, even when ventral to dorsal axis along the striatum [142,143], perhaps
their actions actually prevent them from getting those consistent with the suggested spiraling connectivity
rewards. through dopaminergic nuclei [144,145].

Other paradigms such as Pavlovian-instrumental transfer, ‘The Ugly’: crucial challenges


in which Pavlovian predictors influence the vigor of The last set of areas of neural RL suffer a crucial disparity
instrumental responding, are further evidence of non- between their importance and the relative dearth of
normative interactions among these two forms of learn- systematic or comprehensive studies. Hopefully, by the
ing, potentially accounting for some of the effects of next review, this imbalance will be at least partially
manipulating reward schedules on effort [52,51]. Part redressed.
of the architecture of this transfer involving the amgydala,
the striatum and dopamine that has been worked out in In terms of model-based control, a central question con-
rodent studies [17] has recently been confirmed in a cerns the acquisition and use of hierarchical structures. It
human imaging experiment [129]; however, a recent seems rather hopeless to plan using only the smallest
surprise was that lesions in rats that disconnected the units of action (think of the twitches of a single muscle),
central nucleus of the amygdala from the ventral teg- and so there is a great interest in hierarchies in both
mental area actually enhanced transfer [130], rather than standard [146] and neural RL [147]. Unfortunately, the
suppressing it, as would have been expected from earlier crucial problem of acquiring appropriate hierarchical
work (e.g. [131]). structure in the absence of supervision is far from being
solved. A second concern has to do with a wide class of
The Pavlovian preparatory response to aversive predic- learning scenarios in which optimal learning relies on
tions is often withdrawal or disengagement. It has been detection of change and, essentially, learning of a ‘new’
hypothesized that this is a form of serotonergically situation, rather than updating the previously learned
mediated inhibition [132,133] that acts to arrest paths one. A prime example of this is the paradigm of extinction
leading to negative outcomes, and, in model-based terms, in which a cue previously predictive of reward is no longer
excise potentially aversive parts of the search tree [134]. paired with the rewarding outcome. Behavioral results
Compromising serotonin should thus damage this reflex- show that animals and humans do not simply ‘unlearn’ the
ive inhibition, leading to systematic problems with evalu- previous predictive information, but rather they learn a
ation and choice. The consummatory Pavlovian response new predictive relationship that is inhibitory to the old
to immediate threat is panic, probably mediated by the one [148]. How this type of learning is realized within a
peri-acqueductal gray [135,91]; such responses have RL framework, whether model-based or model-free, is at
themselves recently been observed in an imaging study present unclear, though some suggestions have been
which involved a virtual threatening ‘chase’ [136]. made [149]. The issues mentioned above to do with
uncertainty and change also bear on this.
Appetitive and aversive Pavlovian influences have been
suggested as being quite pervasive [137], accounting for In terms of model-free control, there are various pressing
aspects of impulsivity apparent in hyperbolic discounting, anomalies (on top of appetitive/aversive opponency)
framing and the like. Unfortunately, the sophisticated associated with the activity of dopamine cells (and the
behavioral and neural analyses of model-free and model- not-necessarily equivalent release of dopamine at its
based instrumental values are not paralleled, as of yet, by targets; [150]). One concerns prediction errors inspired
an equivalently worked-out theory for the construction of by conditioned inhibitors for reward and punishment
Pavlovian values. Moreover, Pavlovian influences may predictors. While the predictive stimuli we have contem-
affect model-based and model-free instrumental actions plated so far are stimuli that predict the occurrence of

www.sciencedirect.com Current Opinion in Neurobiology 2008, 18:185–196


192 Cognitive neuroscience

some outcome, a conditioned inhibitor is a stimulus that scalar noise [157], to a degree that could make accurate
predicts that an otherwise expected outcome will not prediction learning, and accurately timed dips in dopa-
happen. How the former (conditioned excitors) and the mine responding, quite challenging to achieve.
latter (conditioned inhibitors) interact to generate a single
prediction and a corresponding prediction error is not yet Conclusions
clear either neurally or computationally. A fascinating As should be apparent from this review, neural RL is a
electrophysiological study [151] targeting this issue vibrant and dynamic field, generating new results at a near-
suggests that the answer is not simple. overwhelming rate, and spreading its wings well beyond its
initial narrow confines of trial-and-error reward learning.
A second anomaly is the apparent adaptive scaling of We have highlighted many foci of ongoing study, and also
prediction errors depending on the precise range of such some orphaned areas mentioned in the previous section.
errors that is expected, forming an informationally effi- However, our best hope is that the sterling efforts to link
cient code [152]. As with other cases of adaptation, it is not together the substantial theoretically motivated and infor-
clear that downstream mechanisms (here, presumably mative animal studies to human neuroimaging results,
dopamine receptors) have the information or indeed along with new data from cyclic voltammetric measure-
the calculational wherewithal to reverse engineer cor- ments of phasic dopamine concentrations [158], imminent
rectly the state of adaptation of the processes controlling results on serotonin, and even nascent efforts to activate
the activity of dopamine cells. This opens up the possib- DA cells in vivo using new optogenetic methods such as
ility that there might be biases in the interpretation of the targetted channel rhodopsin, will start to pay off in the
signal, leading to biases in choice. Alternatively, this opposite direction by deciding between, and forcing
adaptation might serve a different computational goal, improvements to, RL models.
such as adjusting learning appropriately in the face of
forms of uncertainty [153]. Acknowledgements
We are very grateful to Peter Bossaerts, Nathaniel Daw, Michael Frank,
Russell Poldrack, Daniel Salzman, Ben Seymour and Wako Yoshida for their
A timely reminder of the complexity of the dopamine helpful comments on a previous version of this manuscript. PD is funded by
system came from a recent study suggesting a new sub- the Gatsby Charitable Foundation; YN is funded by the Human Frontiers
class of dopamine neurons with particularly unusual neu- Science Program.
rophysiological properties, that selectively target the
prefrontal cortex, the basolateral amygdala and the core References and recommended reading
of the accumbens [154]. It is important to remember that Papers of particular interest, published within the period of the review,
have been highlighted as:
most of our knowledge about the behaviorally relevant
activity of dopamine neurons comes from recordings  of special interest
during Pavlovian tasks or rather simple instrumental  of outstanding interest
scenarios — basic questions about dopaminergic firing
1. Sutton RS, Barto AG: Reinforcement Learning: An Introduction
patterns when several actions are necessary in order to (Adaptive Computation and Machine Learning)The MIT Press;
obtain an outcome, when there are several outcomes in a 1998.
trial, or in hierarchical tasks are still unanswered. Lammel 2. Montague PR: Why Choose This Book?: How We Make
et al.’s [154] results add to previous subtle challenges to Decisions.Dutton Adult; 2006.
the common premise that the dopaminergic signal is 3. Sutton R: Learning to predict by the methods of temporal
differences. Mach Learn 1988, 3:9-44.
unitary (e.g. when taking together [56,58]). Future
physiological studies and recordings in complex tasks 4. Montague PR, Dayan P, Sejnowski TJ: A framework for
mesencephalic dopamine systems based on predictive
are needed to flesh out what could potentially be a more Hebbian learning. J Neurosci 1996, 16:1936-1947.
diverse signal than has so far been considered. The role of 5. Daw ND, Doya K: The computational neurobiology of learning
dopamine in the prefrontal cortex, which has only played and reward. Curr Opin Neurobiol 2006, 16:199-204.
prominently in relatively few models (notably [155]), and 6. Johnson A, van der Meer MA, Redish AD: Integrating
indeed its potential effects on the model-based controller, hippocampus and striatum in decision-making. Curr Opin
might also need to be rethought in light of these findings. Neurobiol 2007, 17:692-697.
7. O’Doherty JP, Hampton A, Kim H: Model-based fMRI and its
application to reward learning and decision making. Ann N Y
A final area of unfortunate lack of theory and dearth of Acad Sci 2007, 1104:35-53.
data has to do with timing. The activity pattern of
8. Doya K: Modulators of decision making. Nat Neurosci 2008,
dopamine cells that most convincingly suggests that they 11:410-416.
report a prediction error is the well-timed dip in respond- 9. Rushworth MFS, Behrens TEJ: Choice, uncertainty and value in
ing when an expected reward is not provided [156,40]. prefrontal and cingulate cortex. Nat Neurosci 2008, 11:389-397.
The neural mechanisms underlying this timing are most 10. Körding K: Decision theory: what ‘‘should’’ the nervous system
unclear; in fact there are various quite different classes of do? Science 2007, 318:606-610.
suggestion in the RL literature and beyond. What is worse 11. Gold JI, Shadlen MN: The neural basis of decision making. Annu
is that this form of interval timing is subject to substantial Rev Neurosci 2007, 30:535-574.

Current Opinion in Neurobiology 2008, 18:185–196 www.sciencedirect.com


Reinforcement learning: The Good, The Bad and The Ugly Dayan and Niv 193

12. Lee D: Neural basis of quasi-rational decision making. Curr in the VTA, let alone the different classes of dopamine cells). It will
Opin Neurobiol 2006, 16:191-198. be fascinating to see what happens in the substantia nigra pars
compacta, whose dopamine cells may be more closely associated
13. Niv Y, Montague PR: Theoretical and empirical studies of with action rather than value learning..
learning. In Neuroeconomics: Decision Making and The Brain.
Edited by Glimcher PW, Camerer C, Fehr E, Poldrack R. New York, 31. Pessiglione M, Seymour B, Flandin G, Dolan RJ, Frith CD:
NY: Academic Press; 2008:329–349.  Dopamine-dependent prediction errors underpin
reward-seeking behaviour in humans. Nature 2006,
14. Daw ND, Niv Y, Dayan P: Uncertainty-based competition 442:1042-1045. Despite the substantial circumstantial data from
between prefrontal and dorsolateral striatal systems for animals and human neurological cases, there are few direct tests
behavioral control. Nat Neurosci 2005, 8:1704-1711. of the specifically dopaminergic basis of normal instrumental
conditioning. This study tested this in healthy subjects using L-
15. Dickinson A, Balleine B: The role of learning in motivation. In
DOPA and haloperidol, which are expected respectively to
Stevens’ Handbook of Experimental Psychology. Edited by
enhance and suppress dopamine function. L-DOPA improved
Gallistel C. New York, NY: Wiley; 2002:497–533.
choice performance toward monetary gains (compared to
16. Uylings HBM, Groenewegen HJ, Kolb B: Do rats have a haloperidol-treated subjects) but not avoidance of monetary
prefrontal cortex? Behav Brain Res 2003, 146:3-17. losses. This was accompanied by an increase in ventral striatal
BOLD responses corresponding to prediction errors in the gain
17. Balleine BW: Neural bases of food-seeking: affect, arousal and condition in the L-DOPA group, whose magnitude was sufficient
reward in corticostriatolimbic circuits. Physiol Behav 2005, for a standard action-value learning model to explain the effects of
86:717-730. the drugs on behavioral choices. These results provide (much
needed, but still indirect) evidence that the ventral striatal
18. Killcross S, Coutureau E: Coordination of actions and habits in prediction error BOLD signal indeed reflects (at least partly)
the medial prefrontal cortex of rats. Cereb Cortex 2003, 13:400- dopaminergic activity.
408.
32. Hampton AN, O’Doherty JP: Decoding the neural substrates of
19. Dolan RJ: The human amygdala and orbital prefrontal cortex in reward-related decision making with functional MRI. Proc Natl
behavioural regulation. Philos Trans R Soc Lond B: Biol Sci 2007, Acad Sci U S A 2007, 104:1377-1382.
362:787-799.
33. Samejima K, Doya K: Multiple representations of belief states
20. Matsumoto K, Tanaka K: The role of the medial prefrontal cortex and action values in corticobasal ganglia loops. Ann NY Acad
in achieving goals. Curr Opin Neurobiol 2004, 14:178-185. Sci 2007, 1104:213-228.
21. Baxter J, Bartlett P: Infinite-horizon policy-gradient estimation. 34. Walton ME, Bannerman DM, Alterescu K, Rushworth MFS:
J Artif Intell Res 2001, 15:319-350. Functional specialization within medial frontal cortex of the
anterior cingulate for evaluating effort-related decisions. J
22. Berns GS, McClure SM, Pagnoni G, Montague PR: Predictability
Neurosci 2003, 23:6475-6479.
modulates human brain response to reward. J Neurosci 2001,
21:2793-2798. 35. Schweimer J, Hauber W: Involvement of the rat anterior
cingulate cortex in control of instrumental responses
23. O’Doherty JP, Dayan P, Friston K, Critchley H, Dolan RJ:
guided by reward expectancy. Learn Mem 2005,
Temporal difference models and reward-related learning in
12:334-342.
the human brain. Neuron 2003, 38:329-337.
36. Schweimer J, Hauber W: Dopamine D1 receptors in the anterior
24. Haruno M, Kuroda T, Doya K, Toyama K, Kimura M, Samejima K,
cingulate cortex regulate effort-based decision making. Learn
Imamizu H, Kawato M: A neural correlate of reward-based
Mem 2006, 13:777-782.
behavioral learning in caudate nucleus: a functional magnetic
resonance imaging study of a stochastic decision task. J 37. Walton ME, Croxson PL, Rushworth MFS, Bannerman DM: The
Neurosci 2004, 24:1660-1665. mesocortical dopamine projection to anterior cingulate cortex
plays no role in guiding effort-related decisions. Behavioral
25. Logothetis NK, Wandell BA: Interpreting the BOLD signal. Annu
Neuroscience 2005, 119:323-328.
Rev Physiol 2004, 66:735-769.
38. Matsumoto M, Hikosaka O: Lateral habenula as a source of
26. Valentin VV, Dickinson A, O’Doherty JP: Determining the neural
 negative reward signals in dopamine neurons. Nature 2007,
 substrates of goal-directed learning in the human brain. J
447:1111-1115. The sources of excitation and inhibition of
Neurosci 2007, 27:4019-4026. Rodent behavioral paradigms [15]
dopaminergic and indeed nondopaminergic cells in the VTA have
have played a crucial role in establishing the structural and
long been debated. This paper shows that there is an important
functional dissociations between different classes of behavior [14].
pathway from a nucleus called the lateral habenula, that has the
This paper is one of a series of studies from this group which is
effect of inhibiting the activity of the dopamine cells. Consistent
importing these paradigms lock, stock and barrel to humans,
with this connection, habenula neurons were excited by stimuli
showing that exactly the same distinctions and indeed
that effectively predict the absence of reward, and inhibited by
homologous anatomical bases apply. These studies form a crucial
targets that predict its presence.
part of the evidentiary basis of neural RL.
39. Lecourtier L, Defrancesco A, Moghaddam B: Differential tonic
27. O’Doherty JP, Buchanan TW, Seymour B, Dolan RJ: Predictive
influence of lateral habenula on prefrontal cortex and nucleus
neural coding of reward preference involves dissociable
accumbens dopamine release. European Journal Neuroscience
responses in human ventral midbrain and ventral striatum.
2008, 27:1755-1762.
Neuron 2006, 49:157-166.
40. Bayer HM, Lau B, Glimcher PW: Statistics of midbrain dopamine
28. Schönberg T, Daw ND, Joel D, O’Doherty JP: Reinforcement
neuron spike trains in the awake primate. J Neurophysiol 2007,
learning signals in the human striatum distinguish learners
98:1428-1439.
from nonlearners during reward-based decision making. J
Neurosci 2007, 27:12860-12867. 41. Frank MJ, Moustafa AA, Haughey HM, Curran T, Hutchison KE:
Genetic triple dissociation reveals multiple roles for dopamine
29. Tobler PN, O’Doherty JP, Dolan RJ, Schultz W: Human neural
in reinforcement learning. Proc Natl Acad Sci U S A 2007,
learning depends on reward prediction errors in the blocking
104:16311-16316.
paradigm. J Neurophysiol 2006, 95:301-310.
42. Klein TA, Neumann J, Reuter M, Hennig J, von Cramon DY,
30. D’Ardenne K, McClure SM, Nystrom LE, Cohen JD: Bold
Ullsperger M: Genetically determined differences in learning
 responses reflecting dopaminergic signals in the human
from errors. Science 2007, 318:1642-1645.
ventral tegmental area. Science 2008, 319:1264-1267. This
study achieves an impressive new level of anatomical precision in 43. McHaffie JG, Jiang H, May PJ, Coizet V, Overton PG, Stein BE,
recording fMRI BOLD signals from the ventral tegmental area (VTA) Redgrave P: A direct projection from superior colliculus to
in humans, confirming various aspects suggested from recordings substantia nigra pars compacta in the cat. Neuroscience 2006,
in macaques and rats (though the extra precision does not extend 138:221-234.
to a distinction between dopaminergic and nondopaminergic cells

www.sciencedirect.com Current Opinion in Neurobiology 2008, 18:185–196


194 Cognitive neuroscience

44. Paton JJ, Belova MA, Morrison SE, Salzman CD: The primate 61. Ainslie G: Breakdown of Will. Cambridge University Press; 2001.
amygdala represents the positive and negative value of visual
stimuli during learning. Nature 2006, 439:865-870. 62. Loewenstein G, Prelec D: Anomalies in intertemporal choice:
Evidence and an interpretation. The Quarterly Journal of
45. Belova MA, Paton JJ, Morrison SE, Salzman CD: Expectation Economics 1992, 107:573-597.
 modulates neural responses to pleasant and aversive stimuli
in primate amygdale. Neuron 2007, 55:970-984 The amygdala 63. McClure SM, Laibson DI, Loewenstein G, Cohen JD: Separate
has long been suggested to be an important site for appetitive and neural systems value immediate and delayed monetary
aversive learning (in which the difference between valences is rewards. Science 2004, 306:503-507.
crucial) and also a form of uncertainty (in which positive and
64. McClure SM, Ericson KM, Laibson DI, Loewenstein G, Cohen JD:
negative valences are treated together). This study and its
Time discounting for primary rewards. J Neurosci 2007,
colleague [44] provides an interesting view into the coding of
27:5796-5804.
valenced and unvalenced predictions and prediction errors in the
macaque amygdala. Putting these data together with the rodent 65. Kable JW, Glimcher PW: The neural correlates of subjective
studies, with their anatomically and functionally precise value during intertemporal choice. Nat Neurosci 2007,
dissociations [17] is a pressing task.. 10:1625-1633.
46. Salzman CD, Paton JJ, Belova MA, Morrison SE: Flexible neural 66. Schweighofer N, Shishida K, Han CE, Okamoto Y, Tanaka SC,
representations of value in the primate brain. Ann N Y Acad Sci Yamawaki S, Doya K: Humans can adopt optimal discounting
2007, 1121:336-354. strategy under real-time constraints. PLoS Comput Biol 2006,
2:e152.
47. Matsumoto M, Matsumoto K, Abe H, Tanaka K: Medial prefrontal
cell activity signaling prediction errors of action values. Nat 67. Schweighofer N, Bertin M, Shishida K, Okamoto Y, Tanaka SC,
Neurosci 2007, 10:647-656. Yamawaki S, Doya K: Low-serotonin levels increase delayed
reward discounting in humans. J Neurosci 2008, 28:4528-4532.
48. Balleine BW, Delgado MR, Hikosaka O: The role of the dorsal
striatum in reward and decision-making. J Neurosci 2007, 68. Tanaka SC, Schweighofer N, Asahi S, Shishida K, Okamoto Y,
27:8161-8165. Yamawaki S, Doya K: Serotonin differentially regulates short-
and long-term prediction of rewards in the ventral and dorsal
49. Hikosaka O: Basal ganglia mechanisms of reward-oriented eye
striatum. PLoS ONE 2007, 2:e1333.
movement. Ann N Y Acad Sci 2007, 1104:229-249.
69. Bogacz R, Brown E, Moehlis J, Holmes P, Cohen JD: The physics
50. Lau B, Glimcher PW: Action and outcome encoding in the
of optimal decision making: a formal analysis of models of
primate caudate nucleus. J Neurosci 2007, 27:14502-14514.
performance in two-alternative forced-choice tasks. Psychol
51. Hikosaka O, Nakamura K, Nakahara H: Basal ganglia orient eyes Rev 2006, 113:700-765.
to reward. J Neurophysiol 2006, 95:567-584.
70. Niv Y, Joel D, Dayan P: A normative perspective on motivation.
52. Simmons JM, Ravel S, Shidara M, Richmond BJ: A comparison of Trends Cogn Sci 2006, 10:375-381.
reward-contingent neuronal activity in monkey orbitofrontal
71. Niv Y, Daw ND, Joel D, Dayan P: Tonic dopamine: opportunity
cortex and ventral striatum: guiding actions toward rewards.
costs and the control of response vigor. Psychopharmacology
Ann NY Acad Sci 2007, 1121:376-394.
(Berl) 2007, 191:507-520.
53. Padoa-Schioppa C: Orbitofrontal cortex and the computation
72. Salamone JD, Correa M: Motivational views of reinforcement:
of economic value. Ann N Y Acad Sci 2007, 1121:232-253.
implications for understanding the behavioral functions of
54. Roesch MR, Taylor AR, Schoenbaum G: Encoding of time- nucleus accumbens dopamine. Behav Brain Res 2002, 137:3-25.
discounted rewards in orbitofrontal cortex is independent of
73. Nakamura K, Hikosaka O: Role of dopamine in the primate
value representation. Neuron 2006, 51:509-520.
caudate nucleus in reward modulation of saccades. J Neurosci
55. Furuyashiki T, Holland PC, Gallagher M: Rat orbitofrontal cortex 2006, 26:5360-5369.
separately encodes response and outcome information
74. Mazzoni P, Hristova A, Krakauer JW: Why don’t we move faster?
during performance of goal-directed behaviour. J Neurosci
 Parkinson’s disease, movement vigor, and implicit motivation.
2008, 28:5127-5138.
J Neurosci 2007, 27:7105-7116 Patients with Parkinson’s disease
56. Morris G, Nevet A, Arkadir D, Vaadia E, Bergman H: Midbrain exhibit bradykinesia, that is, they move more slowly than healthy
 dopamine neurons encode decisions for future action. Nat controls. This paper suggests that this effect is not from the
Neurosci 2006, 9:1057-1063 Most historical recordings from obvious cause of an altered speed-accuracy tradeoff, but rather
dopamine cells have been in cases in which there is no meaningful because the patient’s ‘motivation to move’, defined in a careful
choice between different alternatives. This study in macaques, way that resonates well with Niv et al’s [71] notion of vigor, is
together with Roesch et al’s [58] in rodents looks specifically at reduced..
choice. The two studies confirm the general precepts of neural RL
75. Pessiglione M, Schmidt L, Draganski B, Kalisch R, Lau H,
based on temporal difference (TD) learning, but use rather different
Dolan RJ, Frith CD: How the brain translates money into force: a
methods, and come to different conclusions about which of the
neuroimaging study of subliminal motivation. Science 2007,
variants of TD is at work..
316:904-906.
57. Rummery G, Niranjan M: On-line Q-learning using connectionist
76. Shidara M, Richmond BJ: Differential encoding of information
systems, Tech. Rep. Technical Report CUED/F-INFENG/TR 166,
about progress through multi-trial reward schedules by three
Cambridge University Engineering Department, 1994.
groups of ventral striatal neurons. Neurosci Res 2004, 49:307-
58. Roesch MR, Calu DJ, Schoenbaum G: Dopamine neurons encode 314.
 the better option in rats deciding between differently delayed or
77. Glimcher PW: Decisions, Uncertainty, and the Brain: The Science
sized rewards. Nat Neurosci 2007, 10:1615-1624 Along with work
of Neuroeconomics (Bradford Books). The MIT Press; 2004.
from Hyland and colleagues [159,160], this is one of the few studies
looking at the activity of dopamine neurons in rodents. Many of its 78. Montague PR: Neuroeconomics: a view from neuroscience.
findings are exactly consistent with neural TD, though it comes to a Funct Neurol 2007, 22:219-234.
different conclusion about exactly which TD rule is operational from
the monkey experiment performed by Morris et al. [56].. 79. Daw ND, Kakade S, Dayan P: Opponent interactions between
serotonin and dopamine. Neural Netw 2002, 15:603-616.
59. Barto A: Adaptive critics and the basal ganglia. In Models of
Information Processing in the Basal Ganglia. Edited by Houk J, 80. Ungless MA, Magill PJ, Bolam JP: Uniform inhibition of
Davis J, Beiser D. Cambridge, MA: MIT Press; 1995:215–232. dopamine neurons in the ventral tegmental area by aversive
stimuli. Science 2004, 303:2040-2042.
60. O’Reilly RC, Frank MJ, Hazy TE, Watz B: PVLV: the primary value
and learned value Pavlovian learning algorithm. Behav 81. Coizet V, Dommett EJ, Redgrave P, Overton PG: Nociceptive
Neurosci 2007, 121:31-49. responses of midbrain dopaminergic neurones are modulated

Current Opinion in Neurobiology 2008, 18:185–196 www.sciencedirect.com


Reinforcement learning: The Good, The Bad and The Ugly Dayan and Niv 195

by the superior colliculus in the rat. Neuroscience 2006, 100. Matsumoto M, Matsumoto K, Tanaka K: Effects of novelty on
139:1479-1493. activity of lateral and medial prefrontal neurons. Neurosci Res
2007, 57:268-276.
82. Seymour B, O’Doherty JP, Dayan P, Koltzenburg M, Jones AK,
Dolan RJ, Friston KJ, Frackowiak RS: Temporal difference 101. Herry C, Bach DR, Esposito F, Salle FD, Perrig WJ, Scheffler K,
models describe higher-order learning in humans. Nature Lthi A, Seifritz E: Processing of temporal unpredictability in
2004, 429:664-667. human and animal amygdale. J Neurosci 2007, 27:5958-5966.
83. Jensen J, Smith AJ, Willeit M, Crawley AP, Mikulis DJ, Vitcu I, 102. Li J, McClure SM, King-Casas B, Montague PR: Policy adjustment
Kapur S: Separate brain regions code for salience vs. valence in a dynamic economic game. PLoS ONE 2006, 1:e103.
during reward prediction in humans. Hum Brain Mapp 2007,
28:294-302. 103. Daw ND, O’Doherty JP, Dayan P, Seymour B, Dolan RJ: Cortical
substrates for exploratory decisions in humans. Nature 2006,
84. Delgado MR, Nystrom LE, Fissell C, Noll DC, Fiez JA: Tracking the 441:876-879.
hemodynamic responses to reward and punishment in the
striatum. J Neurophysiol 2000, 84:3072-3077. 104. Behrens TEJ, Woolrich MW, Walton ME, Rushworth MFS:
Learning the value of information in an uncertain world. Nat
85. Menon M, Jensen J, Vitcu I, Graff-Guerrero A, Crawley A, Neurosci 2007, 10:1214-1221.
Smith MA, Kapur S: Temporal difference modeling of the blood-
oxygen level dependent response during aversive 105. Lee HJ, Youn JM, O MJ, Gallagher M, Holland PC. Role of
conditioning in humans: effects of dopaminergic modulation. substantia nigra-amygdala connections in surprise-induced
Biol Psychiatry 2007, 62:765-772. enhancement of attention. J Neurosci 2006, 26:6077–6081.

86. Frank MJ, Seeberger LC, O’Reilly RC: By carrot or by stick: 106. Holland PC, Gallagher M: Different roles for amygdala central
cognitive reinforcement learning in Parkinsonism. Science nucleus and substantia innominata in the surprise-induced
2004, 306:1940-1943. enhancement of learning. J Neurosci 2006, 26:3791-3797.

87. Seymour B, Daw N, Dayan P, Singer T, Dolan R: Differential 107. Camerer C: Behavioral Game Theory: Experiments in Strategic
encoding of losses and gains in the human striatum. J Neurosci Interaction Princeton, NJ: Princeton University Press; 2003.
2007, 27:4826-4831.
108. Kuhnen CM, Knutson B: The neural basis of financial risk taking.
88. Reynolds SM, Berridge KC: Fear and feeding in the nucleus Neuron 2005, 47:763-770.
accumbens shell: rostrocaudal segregation of GABA-elicited
109. Dreher JC, Kohn P, Berman KF: Neural coding of distinct
defensive behavior versus eating behaviour. J Neurosci 2001,
statistical properties of reward information in humans. Cereb
21:3261-3270.
Cortex 2006, 16:561-573.
89. Reynolds SM, Berridge KC: Positive and negative motivation in
110. Tanaka SC, Samejima K, Okada G, Ueda K, Okamoto Y,
nucleus accumbens shell: bivalent rostrocaudal gradients for
Yamawaki S, Doya K: Brain mechanism of reward prediction
GABA-elicited eating, taste ‘‘liking’’/‘‘disliking’’ reactions,
under predictable and unpredictable environmental dynamics.
place preference/avoidance, and fear. J Neurosci 2002,
Neural Netw 2006, 19:1233-1241.
22:7308-7320.
111. Tobler PN, O’Doherty JP, Dolan RJ, Schultz W: Reward value
90. O’Doherty J, Kringelbach ML, Rolls ET, Hornak J, Andrews C:
coding distinct from risk attitude-related uncertainty coding in
Abstract reward and punishment representations in the
human reward systems. J Neurophysiol 2007, 97:1621-1632.
human orbitofrontal cortex. Nat Neurosci 2001, 4:95-102.
112. Preuschoff K, Bossaerts P, Quartz SR: Neural differentiation of
91. McNaughton N, Corr PJ: A two-dimensional neuropsychology
expected reward and risk in human subcortical structures.
of defense: Fear/anxiety and defensive distance. Neurosci
Neuron 2006, 51:381-390.
Biobehav Rev 2004, 28:285-305.
113. Tobler PN, Fletcher PC, Bullmore ET, Schultz W: Learning-related
92. Moutoussis M, Williams J, Dayan P, Bentall RP: Persecutory
human brain activations reflecting individual finances. Neuron
delusions and the conditioned avoidance paradigm: towards
2007, 54:167-175.
an integration of the psychology and biology of paranoia.
Cognit Neuropsychiatry 2007, 12:495-510. 114. Niv Y, Joel D, Meilijson I, Ruppin E: Evolution of reinforcement
learning in uncertain environments: a simple explanation for
93. Kim H, Shimojo S, O’Doherty JP: Is avoiding an aversive
complex foraging behaviors. Adaptive Behavior 2002, 10:5-24.
outcome rewarding? Neural substrates of avoidance learning
in the human brain. PLoS Biol 2006, 4:e233. 115. Mihatsch O, Neuneier R: Risk-sensitive reinforcement learning.
Mach Learn 2002, 49:267-290.
94. Holland P, Gallagher M: Amygdala circuitry in attentional and
representational processes. Trends Cogn Sci 1999, 3:65-73. 116. Byrne R: Mental models and counterfactual thoughts about
what might have been. Trends Cogn Sci 2002, 6:426-431.
95. Cohen JD, McClure SM, Yu AJ: Should I stay or should I go? How
the human brain manages the trade-off between exploitation 117. Lohrenz T, McCabe K, Camerer CF, Montague PR: Neural
and exploration. Philos Trans R Soc Lond B Biol Sci 2007, signature of fictive learning signals in a sequential investment
362:933-942. task. Proc Natl Acad Sci U S A 2007, 104:9493-9498.
96. Wittmann BC, Bunzeck N, Dolan RJ, Düzel E: Anticipation of 118. Bell D: Regret in decision making under uncertainty. Oper Res
 novelty recruits reward system and hippocampus while 1982, 30:961-981.
promoting recollection. Neuroimage 2007, 38:194-202 Novelty
plays a crucial role in learning, and has also been associated with 119. Loomes G, Sugden R: Regret theory: an alternative theory of
the dopamine system via the construct of bonuses [98]. This paper rational choice under uncertainty. Econ J 1982, 92:805-824.
is part of a series of studies into the coupling between novelty,
putatively dopaminergic midbrain activation and hippocampal- 120. Breiter HC, Aharon I, Kahneman D, Dale A, Shizgal P: Functional
dependent long-term memory. Along with their intrinsic interest, imaging of neural responses to expectancy and experience of
the results of these experiments suggest that this form of learning monetary gains and losses. Neuron 2001, 30:619-639.
can provide evidence about the characteristics of signals
121. Coricelli G, Critchley HD, Joffily M, O’Doherty JP, Sirigu A,
associated with RL..
Dolan RJ: Regret and its avoidance: a neuroimaging study of
97. Bunzeck N, Düzel E: Absolute coding of stimulus novelty in the choice behaviour. Nat Neurosci 2005, 8:1255-1262.
human substantia nigra/VTA. Neuron 2006, 51:369-379.
122. De Martino B, Kumaran D, Seymour B, Dolan RJ: Frames, biases,
98. Kakade S, Dayan P: Dopamine: generalization and bonuses. and rational decision-making in the human brain. Science
Neural Netw 2002, 15:549-559. 2006, 313:684-687.

99. Yoshida W, Ishii S: Resolution of uncertainty in prefrontal 123. Fehr E, Camerer CF: Social neuroeconomics: the neural circuitry
cortex. Neuron 2006, 50:781-789. of social preferences. Trends Cogn Sci 2007, 11:419-427.

www.sciencedirect.com Current Opinion in Neurobiology 2008, 18:185–196


196 Cognitive neuroscience

124. Chiu PH, Kayali MA, Kishida KT, Tomlin D, Klinger LG, Klinger MR, 142. Atallah HE, Lopez-Paniagua D, Rudy JW, O’Reilly RC: Separate
Montague PR: Self responses along cingulate cortex reveal neural substrates for skill learning and performance
quantitative neural phenotype for high-functioning autism. in the ventral and dorsal striatum. Nat Neurosci 2007,
Neuron 2008, 57:463-473. 10:126-131.
125. Knoch D, Pascual-Leone A, Meyer K, Treyer V, Fehr E: 143. Belin D, Everitt BJ: Cocaine seeking habits depend upon
Diminishing reciprocal fairness by disrupting the right dopamine-dependent serial connectivity linking the ventral
prefrontal cortex. Science 2006, 314:829-832. with the dorsal striatum. Neuron 2008, 57:432-441.
126. Shoham Y, Powers R, Grenager T: If multi-agent learning is the 144. Joel D, Weiner I: The connections of the dopaminergic system
answer, what is the question? Artif Intell 2007, 171:365-377. with the striatum in rats and primates: an analysis with respect
to the functional and compartmental organization of the
127. Gmytrasiewicz P, Doshi P: A framework for sequential planning striatum. Neuroscience 2000, 96:451-474.
in multi-agent settings. J Artif Intell Res 2005, 24:49-79.
145. Haber SN, Fudge JL, McFarland NR: Striatonigrostriatal
128. Breland K, Breland M: The misbehavior of organisms. Am pathways in primates form an ascending spiral from the shell
Psychol 1961, 16:681-684. to the dorsolateral striatum. J Neurosci 2000, 20:2369-2382.
129. Talmi D, Seymour B, Dayan P, Dolan RJ: Human Pavlovian- 146. Barto A, Mahadevan S: Recent advances in hierarchical
instrumental transfer. J Neurosci 2008, 28:360-368. reinforcement learning. Discrete Event Dynamic Systems 2003,
13:341-379.
130. El-Amamy H, Holland PC: Dissociable effects of disconnecting
amygdala central nucleus from the ventral tegmental area or 147. Botvinick MM: Hierarchical models of behavior and prefrontal
substantia nigra on learned orienting and incentive function. Trends Cogn Sci 2008, 12:201-208.
motivation. Eur J Neurosci 2007, 25:1557-1567.
148. Bouton ME: Context, ambiguity, and unlearning: sources of
131. Murschall A, Hauber W: Inactivation of the ventral tegmental relapse after behavioral extinction. Biol Psychiatry 2002,
area abolished the general excitatory influence of Pavlovian 60:322-328.
cues on instrumental performance. Learn Mem 2006, 13:123-
126. 149. Redish AD, Jensen S, Johnson A, Kurth-Nelson Z: Reconciling
reinforcement learning models with behavioral extinction and
132. Graeff FG, Guimares FS, Andrade TGD, Deakin JF: Role of 5-HT in renewal: implications for addiction, relapse, and problem
stress, anxiety, and depression. Pharmacol Biochem Behav gambling. Psychol Rev 2007, 114:784-805.
1996, 54:129-141.
150. Montague PR, McClure SM, Baldwin PR, Phillips PEM,
133. Soubrié P: Reconciling the role of central serotonin neurons in Budygin EA, Stuber GD, Kilpatrick MR, Wightman RM: Dynamic
human and animal behaviour. Behav Brain Sci 1986, 9:364. gain control of dopamine delivery in freely moving animals. J
Neurosci 2004, 24:1754-1759.
134. Dayan P, Huys QJM: Serotonin, inhibition, and negative mood.
PLoS Comput Biol 2008, 4:e4. 151. Tobler PN, Dickinson A, Schultz W: Coding of predicted reward
omission by dopamine neurons in a conditioned inhibition
135. Blanchard DC, Blanchard RJ: Ethoexperimental approaches to
paradigm. J Neurosci 2003, 23:10402-10410.
the biology of emotion. Annu Rev Psychol 1988, 39:43-68.
152. Tobler PN, Fiorillo CD, Schultz W: Adaptive coding of reward
136. Mobbs D, Petrovic P, Marchant JL, Hassabis D, Weiskopf N,
value by dopamine neurons. Science 2005, 307:1642-1645.
Seymour B, Dolan RJ, Frith CD: When fear is near: threat
imminence elicits prefrontal-periaqueductal gray shifts in 153. Preuschoff K, Bossaerts P: Adding prediction risk to the theory
humans. Science 2007, 317:1079-1083. of reward learning. Ann NY Acad Sci 2007, 1104:135-146.
137. Dayan P, Niv Y, Seymour B, Daw ND: The misbehavior of value 154. Lammel S, Hetzel A, Hckel O, Jones I, Liss B, Roeper J: Unique
and the discipline of the will. Neural Netw 2006, 19:1153-1160. properties of mesoprefrontal neurons within a dual
mesocorticolimbic dopamine system. Neuron 2008, 57:760-
138 . Frank MJ, Samanta J, Moustafa AA, Sherman SJ: Hold your
 773.
horses: impulsivity, deep brain stimulation, and medication in
Parkinsonism. Science 2007, 318:1309-1312 An earlier paper by 155. O’Reilly RC, Frank MJ: Making working memory work: a
the same group [141] suggested that the subthalamic nucleus computational model of learning in the prefrontal cortex and
(STN) might play a crucial role in slowing down behavior when a basal ganglia. Neural Comput 2006, 18:283-328.
choice has to be made between strongly appetitive options, so the
absolute best one can be chosen without impulsive early choices. 156. Schultz W, Dayan P, Montague PR: A neural substrate of
This very clever study uses two different treatments for prediction and reward. Science 1997, 275:1593-1599.
Parkinson’s disease (dopaminergic therapy and deep brain
stimulation of the STN) together with a very important prediction/ 157. Gibbon J, Malapani C, Dale C, Gallistel C: Toward a neurobiology
action learning task [86] to test this hypothesis, and the distinction of temporal cognition: advances and challenges. Curr Opin
between STN and dopaminergic effects in the striatum.. Neurobiol 1997, 7:170-184.

139. Aron AR, Poldrack RA: Cortical and subcortical contributions to 158. Day JJ, Roitman MF, Wightman RM, Carelli RM: Associative
stop signal response inhibition: role of the subthalamic learning mediates dynamic shifts in dopamine signaling in the
nucleus. J Neurosci 2006, 26:2424-2433. nucleus accumbens. Nat Neurosci 2007, 10:1020-1028.

140. Aron AR, Behrens TE, Smith S, Frank MJ, Poldrack RA: 159. Hyland BI, Reynolds JNJ, Hay J, Perk CG, Miller R: Firing modes
Triangulating a cognitive control network using diffusion- of midbrain dopamine cells in the freely moving rat.
weighted magnetic resonance imaging (MRI) and functional Neuroscience 2002, 114:475-492.
MRI. J Neurosci 2007, 27:3743-3752.
160. Pan WX, Schmidt R, Wickens JR, Hyland BI: Dopamine cells
141. Frank MJ: Hold your horses: a dynamic computational role for respond to predicted events during classical conditioning:
the subthalamic nucleus in decision making. Neural Netw 2006, evidence for eligibility traces in the reward-learning network. J
19:1120-1136. Neurosci 2005, 25:6235-6242.

Current Opinion in Neurobiology 2008, 18:185–196 www.sciencedirect.com

You might also like