Professional Documents
Culture Documents
Distributional Reinforcement Learning: By: Marc G. Bellemare, Will Dabney, Mark Rowland
Distributional Reinforcement Learning: By: Marc G. Bellemare, Will Dabney, Mark Rowland
Distributional Reinforcement Learning: By: Marc G. Bellemare, Will Dabney, Mark Rowland
0001
Citation:
Distributional Reinforcement Learning
By: Marc G. Bellemare, Will Dabney, Mark Rowland
DOI: 10.7551/mitpress/14207.001.0001
ISBN (electronic): 9780262374026
Publisher: The MIT Press
Published: 2023
the law of large numbers, and subjective utility theory. In fact, most reinforce-
ment learning textbooks axiomatize the maximization of expectation. Quoting
Richard Bellman, for example:
The general idea, and this is fairly unanimously accepted, is to use some average of
the possible outcomes as a measure of the value of a policy.
This book takes the perspective that modeling the expected return alone
fails to account for many complex, interesting phenomena that arise from
interactions with one’s environment. This is evident in many of the decisions
that we make: the first rule of investment states that expected profits should
be weighed against volatility. Similarly, lottery tickets offer negative expected
returns but attract buyers with the promise of a high payoff. During a snowstorm,
relying on the average frequency at which buses arrive at a stop is likely to lead
to disappointment. More generally, hazards big and small result in a wide range
of possible returns, each with its own probability of occurrence. These returns
and their probabilities can be collectively described by a return distribution, our
main object of study.
(a) (b)
fold call
-1 -2 -1 -2 -1 +2 -1 -2 -1 +2 -1 +2
Figure 1.1
(a) In Kuhn poker, each player is dealt one card and then bets on whether they hold the
highest card. The diagram depicts one particular play through; the house’s card (bottom)
is hidden until betting is over. (b) A game tree with all possible states shaded according
to their frequency of occurrence in our example. Leaf nodes depict immediate gains and
losses, which we equate with different values of the received reward.
0.6
Probability
Probability
Probability
Probability
0.2 0.2 0.2
0.4
Figure 1.2
Distribution over winnings for the player after playing T rounds. For T = 1, this corre-
sponds to the distribution of immediate gains and losses. For T = 5, we see a single mode
appear roughly centered on the expected winnings. For larger T , two additional modes
appear, one in which the agent goes bankrupt and one where the player has successfully
doubled their stake. As T → ∞, only these two outcomes have nonzero probability.
doubles their bet, or check. In response to a raise, the second player can call
and match the new bet or fold and lose their £1 ante. If the first player chooses
to check instead (keep the bet as-is), the option to raise is given to the second
player, symmetrically. If neither player folded, the player with the higher card
wins the pot (£1 or £2, depending on whether the ante was raised). Figure 1.1b
visualizes a single play of the game as a fifty-five-state game tree.
Consider a player who begins with £10 and plays a total of up to T hands
of Kuhn poker, stopping early if they go bankrupt or double their initial stake.
To keep things simple, we assume that this player always goes first and that
their opponent, the house, makes decisions uniformly at random. The player’s
strategy depends on the card they are dealt and also incorporates an element of
randomness. There are two situations in which a choice must be made: whether
to raise or check at first and whether to call or fold when the other player raises.
The following table of probabilities describes a concrete strategy as a function
of the player’s dealt card:
Holding a ... Jack Queen King
Probability of raising 1/3 0 1
Probability of calling 0 2/3 1
If we associate a positive and negative reward with each round’s gains or losses,
then the agent’s random return corresponds to their total winnings at the end of
the T rounds and ranges from −10 to 10.
How likely is the player to go bankrupt? How long does it take before the
player is more likely to be ahead than not? What is the mean and variance of the
player’s winnings after T = 15 hands have been played? These three questions
(and more) can be answered by using distributional dynamic programming
to determine the distribution of returns obtained after T rounds. Figure 1.2
In this case, Gπ (x), R, X 0 , and Gπ (X 0 ) are random variables, and the superscript
D indicates equality between their distributions. Correctly interpreting the distri-
butional Bellman equation requires identifying the dependency between random
variables, in particular between R and X 0 . It also requires understanding how
discounting affects the probability distribution of Gπ (x) and how to manipulate
the collection of random variables Gπ implied by the definition.
Another change concerns how we quantify the behavior of learning algo-
rithms and how we measure the quality of an agent’s predictions. Because
value functions are real-valued vectors, the distance between a value function
estimate and the desired expected return is measured as the absolute difference
between those two quantities. On the other hand, when analyzing a distribu-
tional reinforcement learning algorithm, we must instead measure the distance
between probability distributions using a probability metric. As we will see,
some probability metrics are better suited to distributional reinforcement learn-
ing than others, but no single metric can be identified as the “natural” metric
for comparing return distributions.
Implementing distributional reinforcement learning algorithms also poses
some concrete computational challenges. In general, the return distribution
is supported on a range of possible returns, and its shape can be quite com-
plex. To represent this distribution with a finite number of parameters, some
approximation is necessary; the practitioner is faced with a variety of choices
and trade-offs. One approach is to discretize the support of the distribution
uniformly and assign a variable probability to each interval, what we call the
categorical representation. Another is to represent the distribution using a finite
number of uniformly weighted particles whose locations are parameterized,
called the quantile representation. In practice and in theory, we find that the
choice of distribution representation impacts the quality of the return function
approximation and also the ease with which it can be computed.
Learning return distributions from sampled experience is also more challeng-
ing than learning to predict expected returns. The issue is particularly acute
when learning proceeds by bootstrapping: that is, when the return function
estimate at one state is learned on the basis of the estimate at successor states.
When the return function estimates are defined by a deep neural network, as is
common in practice, one must also take care in choosing a loss function that is
compatible with a stochastic gradient descent scheme.
For an agent that only knows about expected returns, it is natural (almost
necessary) to define optimal behavior in terms of maximizing this quantity.
The Q-learning algorithm, which performs credit assignment by maximizing
over state-action values, learns a policy with exactly this objective in mind.
Knowledge of the return function, however, allows us to define behaviors
The MIT Press would like to thank the anonymous peer reviewers who provided
comments on drafts of this book. The generous work of academic experts is essential for
establishing the authority and quality of our publications. We acknowledge with
gratitude the contributions of these otherwise uncredited readers.
Names: Bellemare, Marc G., author. | Dabney, Will, author. | Rowland, Mark
(Research scientist), author.
Title: Distributional reinforcement learning / Marc G. Bellemare, Will Dabney, Mark
Rowland.
Description: Cambridge, Massachusetts : The MIT Press, [2023] | Series: Adaptive
computation and machine learning | Includes bibliographical references and index.
Identifiers: LCCN 2022033240 (print) | LCCN 2022033241 (ebook) | ISBN
9780262048019 (hardcover) | ISBN 9780262374019 (epub) | ISBN 9780262374026
(pdf)
Subjects: LCSH: Reinforcement learning. | Reinforcement learning–Statistical methods.
Classification: LCC Q325.6 .B45 2023 (print) | LCC Q325.6 (ebook) | DDC
006.3/1–dc23/eng20221102
LC record available at https://lccn.loc.gov/2022033240
LC ebook record available at https://lccn.loc.gov/2022033241