Professional Documents
Culture Documents
Deep Reinforcement Learning: A Brief Survey
Deep Reinforcement Learning: A Brief Survey
D
eep reinforcement learning (DRL) is poised to revolution-
ize the field of artificial intelligence (AI) and represents
a step toward building autonomous systems with a higher-
level understanding of the visual world. Currently, deep learn-
ing is enabling reinforcement learning (RL) to scale to problems
that were previously intractable, such as learning to play video
games directly from pixels. DRL algorithms are also applied
to robotics, allowing control policies for robots to be learned
directly from camera inputs in the real world. In this survey,
we begin with an introduction to the general field of RL, then
progress to the main streams of value-based and policy-based
methods. Our survey will cover central algorithms in deep RL,
including the deep Q-network (DQN), trust region policy opti-
mization (TRPO), and asynchronous advantage actor critic. In
parallel, we highlight the unique advantages of deep neural net-
works, focusing on visual understanding via RL. To conclude,
we describe several current areas of research within the field.
Introduction
One of the primary goals of the field of AI is to produce fully
autonomous agents that interact with their environments to learn
optimal behaviors, improving over time through trial and error.
Crafting AI systems that are responsive and can effectively
learn has been a long-standing challenge, ranging from robots,
©Istockphoto.com/zapp2photo
which can sense and react to the world around them, to purely
software-based agents, which can interact with natural lan-
guage and multimedia. A principled mathematical framework
for experience-driven autonomous learning is RL [78]. Although
RL had some successes in the past [31], [53], [74], [81], previous
approaches lacked scalability and were inherently limited to fairly
low-dimensional problems. These limitations exist because RL
algorithms share the same complexity issues as other algorithms:
memory complexity, computational complexity, and, in the case
of machine-learning algorithms, sample complexity [76]. What
we have witnessed in recent years—the rise of deep learning,
relying on the powerful function approximation and representa-
tion learning properties of deep neural networks—has provided
Digital Object Identifier 10.1109/MSP.2017.2743240
Date of publication: 13 November 2017 us with new tools to overcoming these problems.
Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on April 16,2020 at 06:18:38 UTC from IEEE Xplore. Restrictions apply.
The advent of deep learning has had a significant impact field of DRL, there have been two outstanding success stories.
on many areas in machine learning, dramatically improving The first, kickstarting the revolution in DRL, was the develop-
the state of the art in tasks such as object detection, speech ment of an algorithm that could learn to play a range of Atari
recognition, and language translation [39]. The most impor- 2600 video games at a superhuman level, directly from image
tant property of deep learning is that deep neural networks can pixels [47]. Providing solutions for the instability of function
automatically find compact low-dimensional representations approximation techniques in RL, this work was the first to con-
(features) of high-dimensional data (e.g., images, text, and vincingly demonstrate that RL agents could be trained on raw,
audio). Through crafting inductive biases into neural network high-dimensional observations, solely based on a reward signal.
architectures, particularly that of hierarchical representations, The second standout success was the development of a hybrid
machine-learning practitioners have made effective progress DRL system, AlphaGo, that defeated a human world champion in
in addressing the curse of dimensionality [7]. Deep learning Go [73], paralleling the historic achievement of IBM’s Deep Blue
has similarly accelerated progress in RL, with the use of deep- in chess two decades earlier [9]. Unlike the handcrafted rules that
learning algorithms within RL defining the field of DRL. The have dominated chess-playing systems, AlphaGo comprised neu-
aim of this survey is to cover both seminal and recent develop- ral networks that were trained using supervised learning and RL,
ments in DRL, conveying the innovative ways in which neu- in combination with a traditional heuristic search algorithm.
ral networks can be used to bring us closer toward developing DRL algorithms have already been applied to a wide range
autonomous agents. For a more comprehensive survey of recent of problems, such as robotics, where control policies for robots
efforts in DRL, we refer readers to the overview by Li [43]. can now be learned directly from camera inputs in the real world
Deep learning enables RL to scale to decision-making prob- [41], [42], succeeding controllers that used to be hand-engineered
lems that were previously intractable, i.e., settings with high- or learned from low-dimensional features of the robot’s state. In
dimensional state and action spaces. Among recent work in the Figure 1, we showcase just some of the domains that DRL has
(a) (b)
Target
(f)
Figure 1. A range of visual RL domains. (a) Three classic Atari 2600 video games, Enduro, Freeway, and Seaquest, from the Arcade Learning Environment (ALE)
[5]. Due to the range of supported games that vary in genre, visuals, and difficulty, the ALE has become a standard test bed for DRL algorithms [20], [47], [48],
[55], [70], [75], [92]. The ALE is one of several benchmarks that are now being used to standardize evaluation in RL. (b) The TORCS car racing simulator, which
has been used to test DRL algorithms that can output continuous actions [33], [44], [48] (as the games from the ALE only support discrete actions). (c) Utilizing
the potentially unlimited amount of training data that can be amassed in robotic simulators, several methods aim to transfer knowledge from the simulator to the
real world [11], [64], [84]. (d) Two of the four robotic tasks designed by Levine et al. [41]: screwing on a bottle cap and placing a shaped block in the correct hole.
Levine et al. [41] were able to train visuomotor policies in an end-to-end fashion, showing that visual servoing could be learned directly from raw camera inputs by
using deep neural networks. (e) A real room, in which a wheeled robot trained to navigate the building is given a visual cue as input and must find the correspond-
ing location [100]. (f) A natural image being captioned by a neural network that uses RL to choose where to look [99]. (b)–(f) reproduced from [41], [44], [84],
[99], and [100], respectively.
State (St )
Reward (rt )
St +1 rt +1
Policy (π )
Environment Agent
Action (at )
Figure 2. The perception-action-learning loop. At time t, the agent receives state s t from the environment. The agent uses its policy to choose an action
a t . Once the action is executed, the environment transitions a step, providing the next state, s t + 1, as well as feedback in the form of a reward, rt +1 . The
agent uses knowledge of state transitions, of the form (s t , a t , s t +1, rt +1), to learn and improve its policy.
Shallow Deep
Policy gradients
Backups Bootstrapping Backups Gradients can provide a strong learning signal as to how to
(c) (d) improve a parameterized policy. However, to compute the
expected return (1) we need to average over plausible trajec-
Figure 3. Two dimensions of RL algorithms based on the backups used tories induced by the current policy parameterization. This
to learn or construct a policy. At the extremes of these dimensions are (a) averaging requires either deterministic approximations (e.g.,
dynamic programming, (b) exhaustive search, (c) one-step TD learning, and linearization) or stochastic approximations via sampling
(d) Monte Carlo approaches. Bootstrapping extends from (c) one-step TD [12]. Deterministic approximations can be only applied in a
learning to n-step TD learning methods [78], with (d) pure Monte Carlo ap-
proaches not relying on bootstrapping at all. Another possible dimension of
model-based setting where a model of the underlying transi-
variation is (c) and (d) choosing to sample actions versus (a) and (b) taking tion dynamics is available. In the more common model-free
the expectation over all choices. (Figure recreated based on [78].) RL setting, a Monte Carlo estimate of the expected return is
Action
State
Reward
Figure 5. The DQN [47]. The network takes the state—a stack of gray-scale frames from the video game—and processes it with convolutional and fully
connected layers, with ReLU nonlinearities in between each layer. At the final layer, the network outputs a discrete action, which corresponds to one of
the possible control inputs for the game. Given the current state and chosen action, the game returns a new score. The DQN uses the reward—the dif-
ference between the new score and the previous one—to learn from its decision. More precisely, the reward is used to update its estimate of Q, and the
error between its previous estimate and its new estimate is backpropagated through the network.
Q-function modifications
Considering that one of the key components of the DQN is a
function approximator for the Q-function, it can benefit from
fundamental advances in RL. In [86], van Hasselt showed that
the single estimator used in the Q-learning update rule over-
estimates the expected return due to the use of the maximum
action value as an approximation of the maximum expected
action value. Double-Q learning provides a better estimate
through the use of a double estimator [86]. While double-Q
learning requires an additional function to be learned, later
work proposed using the already available target network
from the DQN algorithm, resulting in significantly better
results with only a small change in the update step [87].
Yet another way to adjust the DQN architecture is to
decompose the Q-function into meaningful functions, such as
constructing Q r by adding together separate layers that com-
pute the state-value function V r and advantage function A r
[92]. Rather than having to come up with accurate Q-values
for all actions, the duelling DQN [92] benefits from a single
baseline for the state in the form of V r and easier-to-learn Figure 6. A saliency map of a trained DQN [47] playing Space Invad-
relative values in the form of A r. The combination of the duel- ers [5]. By backpropagating the training signal to the image space, it is
ling DQN with prioritized experience replay [67] is one of the possible to see what a neural-network-based agent is attending to. In
this frame, the most salient points—shown with the red overlay—are the
state-of-the-art techniques in discrete action settings. Further
laser that the agent recently fired and also the enemy that it anticipates
insight into the properties of A r by Gu et al. [19] led them to hitting in a few time steps.
modify the DQN with a convex advantage layer that extended
the algorithm to work over sets of continuous actions, creating
the normalized advantage function (NAF) algorithm. Benefit- potentially be distributed at larger scales than techniques that
ing from experience replay, target networks, and advantage rely on gradients [65].
updates, NAF is one of several state-of-the-art techniques in
continuous control problems [19]. Backpropagation through stochastic functions
The workhorse of DRL, however, remains backpropagation.
Policy search The previously discussed REINFORCE rule [97] allows neural
Policy search methods aim to directly find policies by means networks to learn stochastic policies in a task-dependent man-
of gradient-free or gradient-based methods. Prior to the cur- ner, such as deciding where to look in an image to track [69]
rent surge of interest in DRL, several successful methods in or caption [99] objects. In these cases, the stochastic variable
DRL eschewed the commonly used backpropagation algo- would determine the coordinates of a small crop of the image
rithm in favor of evolutionary algorithms [17], [33], which are and hence reduce the amount of computation needed. This
gradient-free policy search algorithms. Evolutionary methods usage of RL to make discrete, stochastic decisions over inputs
rely on evaluating the performance of a population of agents. is known in the deep-learning literature as hard attention and is
Hence, they are expensive for large populations or agents with one of the more compelling uses of basic policysearch methods
many parameters. However, as black-box optimization meth- in recent years, having many applications outside of traditional
ods, they can be used to optimize arbitrary, nondifferentiable RL domains.
models and naturally allow for more exploration in the param-
eter space. In combination with a compressed representation Compounding errors
of neural network weights, evolutionary algorithms can even Searching directly for a policy represented by a neural network
be used to train large networks; such a technique resulted in with very many parameters can be difficult and suffer from severe
the first deep neural network to learn an RL task, straight local minima. One way around this is to use guided policy search
from high-dimensional visual inputs [33]. Recent work has (GPS), which takes a few sequences of actions from another con-
reignited interest in evolutionary methods for RL as they can troller (which could be constructed using a separate method, such