Professional Documents
Culture Documents
Chapter 3
Chapter 3
• The methods in the previous chapter were exact, tabular, methods, that work for
problems of moderate size that fit in memory.
• In this chapter we move to high-dimensional problems with large state spaces
that no longer fit in memory.
• The methods in this chapter are deep, model-free, value-based methods, related
to Q-learning.
• In reinforcement learning the current behavior policy determines which action is
selected next, a process that can be self-reinforcing. There is no ground truth, as
in supervised learning.
• In deep reinforcement learning convergence to 𝑉 and 𝑄 values is based on a
bootstrapping process, and the first challenge is to find training methods that
converge to stable function values.
End-to-end Learning
• A paper was presented on an algorithm that could play 1980s Atari video games
just by training on the pixel input of the video screen.
• The algorithm used a combination of deep learning and Q-learning, and was
named Deep Q-Network, or DQN. This was a breakthrough for reinforcement
learning.
• Besides the fact that the problem that was solved was easily understood, true
eye-hand coordination of this complexity had not been achieved by a computer
before; furthermore, the end-to-end learning from pixel to joystick implied
artificial behavior that was close to how humans play games.
• DQN essentially launched the field of deep reinforcement learning.
• A major technical challenge that was overcome by DQN was the instability of the
deep reinforcement learning process.
• ALE presents agents with a high-dimensional3 visual input (210 × 160 RGB video
at 60 Hz, or 60 images per second) of tasks that were designed to be interesting
and challenging for human players.
• The game cartridge ROM holds 2-4 kB of game code, while the console random-
access memory is small, just 128 bytes.
• The actions can be selected via a joystick (9 directions), which has a fire button
(fire on/oc), giving 18 actions in total.
• Atari games, with high-resolution video input at high frame rates, are an entirely
dicerent kind of challenge than Grid worlds or board games.
• Indeed, the Atari benchmark called for very dicerent agent algorithms,
prompting the move from tabular algorithms to algorithms based on function
approximation and deep learning. ALE has become a standard benchmark in
deep reinforcement learning research.
Bootstrapping Q-Values
Three Challenges
• Let us have a closer look at the challenges that deep reinforcement learning
faces. There are three problems with our naive deep Q-learner.
• First, convergence to the optimal Q-function depends on full coverage of the
state space, yet the state space is too large to sample fully.
• Second, there is a strong correlation between subsequent training samples, with
a real risk of local optima.
• Third, the loss function of gradient descent literally has a moving target, and
bootstrapping may diverge.
Coverage
• Proofs that algorithms such as Q-learning converge to the optimal policy depend
on the assumption of full state space coverage; all state-action pairs must be
sampled.
• Otherwise, the algorithms will not converge to an optimal action value for each
state.
• Clearly, in large state spaces where not all states are sampled, this situation
does not hold, and there is no guarantee of convergence.
Correlation
Convergence
• Starting at the end of the 1980s, Tesauro had written a program that played very
strong Backgammon based on a neural network.
• The program was called Neurogammon and used supervised learning from
grand-master games.
• In order to improve the strength of the program, he switched to temporal
dicerence reinforcement learning from self-play games.
• TD-Gammon learned by playing against itself, achieving stable learning in a
shallow network.
• TD-Gammon’s training used a temporal dicerence algorithm similar to Q-
learning, approximating the value function with a network with one hidden layer,
using raw board input enhanced with hand-crafted heuristic features.
• The results in Atari and later in Go as well as further work have now provided
clear evidence that both stable training and generalizing deep reinforcement
learning are indeed possible, and have improved our understanding of the
circumstances that influence stability and convergence.
Decorrelating States
Experience Replay
• We can reduce correlation—and the local minima they cause—by adding a small
amount of supervised learning.
• To break correlations and to create a more diverse set of training examples, DQN
uses experience replay.
• Experience replay introduces a replay bucer, a cache of previously explored
states, from which it samples training states at random.
• Experience replay stores the last 𝑁 examples in the replay memory, and samples
uniformly when performing updates.
• A typical number for 𝑁 is 10^6.
• By using a bucer, a dynamic dataset from which recent training examples are
sampled, we train states from a more diverse set, instead of only from the most
recent one.
• The goal of experience replay is to increase the independence of subsequent
training examples.
• The next state to be trained on is no longer a direct successor of the current
state, but one somewhere in a long history of previous states.
• In this way the replay bucer spreads out the learning over more previously seen
states, breaking temporal correlations between samples.
• DQN’s replay bucer (1) improves coverage, and (2) reduces correlation.
• DQN treats all examples equal, old and recent alike.
• Note that, curiously, training by experience replay is a form of oc-policy learning,
since the target parameters are dicerent from those used to generate the
sample.
• Oc-policy learning is one of the three elements of the deadly triad, and we find
that stable learning can actually be improved by a special form of one of its
problems.
• Zhang et al. [875] study the deadly triad with experience replay, and find that
larger networks resulted in more instabilities, but also that longer multi-step
returns yielded fewer unrealistically high reward values.
Conclusion
• DQN was able to successfully learn end-to-end behavior policies for many
dicerent games (although similar and from the same benchmark set).
• Minimal prior knowledge was used to guide the system, and the agent only got to
see the pixels and the game score.
• The same network architecture and procedure was used on each game;
however, a network trained for one game could not be used to play another
game.
• The main problems that were overcome by Mnih et al. were training divergence
and learning instability.
• The nature of most Atari 2600 games is that they require eye-hand reflexes.
• Most Atari games are more about immediate reflexes than about longer term
reasoning.
• In this sense, the problem of playing Atari well is not unlike an image
categorization problem: both problems are to find the right response that
matches an input consisting of a set of pixels.
Improving Exploration
• DQN applies random sampling of its replay bucer, and one of the first
enhancements was prioritized sampling.
• It was found that DQN, being an oc-policy algorithm, typically overestimates
action values.
• Double DQN addresses overestimation.
• Dueling DDQN introduces the advantage function to standardize action values.
• Distributional DQN showed that networks that use probability distributions work
better than networks that only use single point expected values.
• In 2017 Hessel et al. [335] performed a large experiment that combined seven
important enhancements.
• The paper has become known as the Rainbow paper, since the major graph
showing the cumulative performance over 57 Atari games of the seven
enhancements is multi-colored.
Overestimation
• DQN samples uniformly over the entire history in the replay bucer, where Q-
learning uses only the most recent (and important) state. It stands to reason to
see if a solution in between these two extremes performs well.
• Prioritized experience replay, or PEX, is such an attempt.
• In DQN, experience replay lets agents reuse examples from the past, although
experience transitions are uniformly sampled, and actions are simply replayed at
the same frequency that they were originally experienced, regardless of their
significance.
• The PEX approach provides a framework for prioritizing experience.
• Important actions are replayed more frequently, and therefore learning eciciency
is improved.
• As measure of importance, standard proportional prioritized replay is used, with
the absolute TD error to prioritize actions.
• Prioritized replay is used widely in value-based deep reinforcement learning.
• The measure can be computed in the distributional setting using the mean
action values.
• In the Rainbow paper all distributional variants prioritize actions by the Kullback-
Leibler loss.
Advantage Function
• The original DQN uses a single neural network as function approximator; DDQN
(double deep Q-network) uses a separate target Q-Network to evaluate an
action.
• Dueling DDQN, also known as DDDQN, improves on this architecture by using
two separate estimators: a value function and an advantage function
𝐴(𝑠,𝑎) = 𝑄(𝑠,𝑎) −𝑉(𝑠).
• Advantage functions are related to the actor-critic approach.
• An advantage function computes the dicerence between the value of an action
and the value of the state.
• Advantage functions provide better policy evaluation when many actions have
similar values.
Distributional Methods
• The original DQN learns a single value, which is the estimated mean of the state
value.
• This approach does not take uncertainty into account.
• To remedy this, distributional Q-learning learns a categorical probability
distribution of discounted returns instead, increasing exploration.
• Bellemare et al. design a new distributional algorithm which applies Bellman’s
equation to the learning of distributions, a method called distributional DQN.
Noisy DQN
• Noisy DQN uses stochastic net-work layers that add parametric noise to the
weights.
• The noise induces randomness in the agent’s policy, which increases
exploration.
• The parameters that govern the noise are learned by gradient descent together
with the remaining network weights.
Network Architecture
• The network receives the change in game score as a number from the emulator,
and derivative updates are mapped to {−1, 0, +1} to indicate decrease, no
change, or improvement of the score (the Huber loss).
• To reduce computational demands further, frame skipping is employed.
• Only one in every 3–4 frames were used, depending on the game. To take game
history into account, the net takes as input the last four resulting frames.
• As optimizer RMSprop is used.
• A variant of 𝜖-greedy is used, that starts with an 𝜖 of 1.0 (fully exploring) going
down to 0.1 (90% exploiting).
Benchmarking Atari
Summary
BOOK QUESTIONS
• What is Gym? Gym is a toolkit for developing and comparing reinforcement learning
algorithms.
• What are the Stable Baselines? Stable Baselines are a set of improved implementations
of reinforcement learning algorithms based on OpenAI Baselines.
• The loss function of DQN uses the Q-function as target. What is a consequence? A
consequence is that the Q-values can be overestimated, leading to instability in training.
• Name one simple exploration/exploitation method. One simple method is the ε-greedy
algorithm.
• What is bootstrapping? Bootstrapping is a method where the agent updates its value
estimates based on other estimated values rather than waiting for the final outcome.
• Describe the architecture of the neural network in DQN. The neural network in DQN
has convolutional layers followed by fully connected layers, which output Q-values for each
action.
• Why is deep reinforcement learning more susceptible to unstable learning than deep
supervised learning?Deep reinforcement learning is more susceptible because its training
targets depend on the parameters being optimized, creating a moving target problem.
• What is the deadly triad? The deadly triad consists of function approximation,
bootstrapping, and off-policy learning, which together can lead to instability and divergence.
• What is the role of the replay buffer? The replay buffer stores past experiences to break
the correlation between consecutive samples and stabilize training.
• How can correlation between states lead to local minima? Correlation between states
can lead to local minima by causing the agent to get stuck in suboptimal policies due to
correlated sampling errors.
• Why should the coverage of the state space be sufficient? Sufficient coverage ensures
that the learning algorithm can explore and learn about all possible states, leading to a more
robust and optimal policy.
• What happens when deep reinforcement learning algorithms do not converge? When
they do not converge, the agent fails to learn an optimal policy, leading to poor performance
or instability.
• How large is the state space of chess estimated to be? 10^47, 10^170, or 10^1685? The
state space of chess is estimated to be 10^47.
• How large is the state space of Go estimated to be? 10^47, 10^170, or 10^1685? The
state space of Go is estimated to be 10^170.
• How large is the state space of StarCraft estimated to be? 10^47, 10^170, or
10^1685? The state space of StarCraft is estimated to be 10^1685.
• What does the rainbow in the Rainbow paper stand for, and what is the main
message? The rainbow stands for the combination of several improvements to DQN: Double
DQN, Prioritized Replay, Dueling Networks, Multi-step Learning, Distributional RL, Noisy
Nets, and Categorical DQN. The main message is that these enhancements work well
together to significantly improve performance.
• Mention three Rainbow improvements that are added to DQN. Three Rainbow
improvements are Double DQN, Prioritized Experience Replay, and Dueling Network
Architectures.