Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Deep Learning for Visual Understanding

Kai Arulkumaran, Marc Peter Deisenroth,


Miles Brundage, and Anil Anthony Bharath

Deep Reinforcement Learning


A brief survey

D
eep reinforcement learning (DRL) is poised to revolution-
ize the field of artificial intelligence (AI) and represents
a step toward building autonomous systems with a higher-
level understanding of the visual world. Currently, deep learn-
ing is enabling reinforcement learning (RL) to scale to problems
that were previously intractable, such as learning to play video
games directly from pixels. DRL algorithms are also applied
to robotics, allowing control policies for robots to be learned
directly from camera inputs in the real world. In this survey,
we begin with an introduction to the general field of RL, then
progress to the main streams of value-based and policy-based
methods. Our survey will cover central algorithms in deep RL,
including the deep Q-network (DQN), trust region policy opti-
mization (TRPO), and asynchronous advantage actor critic. In
parallel, we highlight the unique advantages of deep neural net-
works, focusing on visual understanding via RL. To conclude,
we describe several current areas of research within the field.

Introduction
One of the primary goals of the field of AI is to produce fully
autonomous agents that interact with their environments to learn
optimal behaviors, improving over time through trial and error.
Crafting AI systems that are responsive and can effectively
learn has been a long-standing challenge, ranging from robots,
©Istockphoto.com/zapp2photo
which can sense and react to the world around them, to purely
software-based agents, which can interact with natural lan-
guage and multimedia. A principled mathematical framework
for experience-driven autonomous learning is RL [78]. Although
RL had some successes in the past [31], [53], [74], [81], previous
approaches lacked scalability and were inherently limited to fairly
low-dimensional problems. These limitations exist because RL
algorithms share the same complexity issues as other algorithms:
memory complexity, computational complexity, and, in the case
of machine-learning algorithms, sample complexity [76]. What
we have witnessed in recent years—the rise of deep learning,
relying on the powerful function approximation and representa-
tion learning properties of deep neural networks—has provided
Digital Object Identifier 10.1109/MSP.2017.2743240
Date of publication: 13 November 2017 us with new tools to overcoming these problems.

26 IEEE Signal Processing Magazine | November 2017 | 1053-5888/17©2017IEEE

Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on April 16,2020 at 06:18:38 UTC from IEEE Xplore. Restrictions apply.
The advent of deep learning has had a significant impact field of DRL, there have been two outstanding success stories.
on many areas in machine learning, dramatically improving The first, kickstarting the revolution in DRL, was the develop-
the state of the art in tasks such as object detection, speech ment of an algorithm that could learn to play a range of Atari
recognition, and language translation [39]. The most impor- 2600 video games at a superhuman level, directly from image
tant property of deep learning is that deep neural networks can pixels [47]. Providing solutions for the instability of function
automatically find compact low-dimensional representations approximation techniques in RL, this work was the first to con-
(features) of high-dimensional data (e.g., images, text, and vincingly demonstrate that RL agents could be trained on raw,
audio). Through crafting inductive biases into neural network high-dimensional observations, solely based on a reward signal.
architectures, particularly that of hierarchical representations, The second standout success was the development of a hybrid
machine-learning practitioners have made effective progress DRL system, AlphaGo, that defeated a human world champion in
in addressing the curse of dimensionality [7]. Deep learning Go [73], paralleling the historic achievement of IBM’s Deep Blue
has similarly accelerated progress in RL, with the use of deep- in chess two decades earlier [9]. Unlike the handcrafted rules that
learning algorithms within RL defining the field of DRL. The have dominated chess-playing systems, AlphaGo comprised neu-
aim of this survey is to cover both seminal and recent develop- ral networks that were trained using supervised learning and RL,
ments in DRL, conveying the innovative ways in which neu- in combination with a traditional heuristic search algorithm.
ral networks can be used to bring us closer toward developing DRL algorithms have already been applied to a wide range
autonomous agents. For a more comprehensive survey of recent of problems, such as robotics, where control policies for robots
efforts in DRL, we refer readers to the overview by Li [43]. can now be learned directly from camera inputs in the real world
Deep learning enables RL to scale to decision-making prob- [41], [42], succeeding controllers that used to be hand-engineered
lems that were previously intractable, i.e., settings with high- or learned from low-dimensional features of the robot’s state. In
dimensional state and action spaces. Among recent work in the Figure 1, we showcase just some of the domains that DRL has

(a) (b)

Target

(c) (d) (e)


A Giraffe Standing

(f)

Figure 1. A range of visual RL domains. (a) Three classic Atari 2600 video games, Enduro, Freeway, and Seaquest, from the Arcade Learning Environment (ALE)
[5]. Due to the range of supported games that vary in genre, visuals, and difficulty, the ALE has become a standard test bed for DRL algorithms [20], [47], [48],
[55], [70], [75], [92]. The ALE is one of several benchmarks that are now being used to standardize evaluation in RL. (b) The TORCS car racing simulator, which
has been used to test DRL algorithms that can output continuous actions [33], [44], [48] (as the games from the ALE only support discrete actions). (c) Utilizing
the potentially unlimited amount of training data that can be amassed in robotic simulators, several methods aim to transfer knowledge from the simulator to the
real world [11], [64], [84]. (d) Two of the four robotic tasks designed by Levine et al. [41]: screwing on a bottle cap and placing a shaped block in the correct hole.
Levine et al. [41] were able to train visuomotor policies in an end-to-end fashion, showing that visual servoing could be learned directly from raw camera inputs by
using deep neural networks. (e) A real room, in which a wheeled robot trained to navigate the building is given a visual cue as input and must find the correspond-
ing location [100]. (f) A natural image being captioned by a neural network that uses RL to choose where to look [99]. (b)–(f) reproduced from [41], [44], [84],
[99], and [100], respectively.

IEEE Signal Processing Magazine | November 2017 | 27


Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on April 16,2020 at 06:18:38 UTC from IEEE Xplore. Restrictions apply.
been applied to, ranging from playing video games [47] to indoor to perform; an optimal policy is any policy that maximizes the
navigation [100]. expected return in the environment. In this respect, RL aims to
solve the same problem as optimal control. However, the chal-
Reward-driven behavior lenge in RL is that the agent needs to learn about the consequenc-
Before examining the contributions of deep neural networks to es of actions in the environment by trial and error, as, unlike in
RL, we will introduce the field of RL in general. The essence of optimal control, a model of the state transition dynamics is not
RL is learning through interaction. An RL agent interacts with available to the agent. Every interaction with the environment
its environment and, upon observing the consequences of its yields information, which the agent uses to update its knowledge.
actions, can learn to alter its own behavior in response to rewards This perception-action-learning loop is illustrated in Figure 2.
received. This paradigm of trial-and-error learning has its roots
in behaviorist psychology and is one of the main foundations of Markov decision processes
RL [78]. The other key influence on RL is o­ ptimal control, which Formally, RL can be described as a Markov decision process
has lent the mathematical formalisms (most notably dynamic (MDP), which consists of
programming [6]) that underpin the field. ■■ a set of states S, plus a distribution of starting states p (s 0)
In the RL setup, an autonomous agent, controlled by a ■■ a set of actions A
machine-learning algorithm, observes a state s t from its envi- ■■ transition dynamics T (s t +1 | s t, a t) that map a state-action
ronment at time step t. The agent interacts with the environ- pair at time t onto a distribution of states at time t + 1
ment by taking an action a t in state s t . When the agent takes ■■ an immediate/instantaneous reward function R (s t, a t, s t + 1)
an action, the environment and the agent transition to a new ■■ a discount factor c ! [0, 1], where lower values place more
state, s t + 1, based on the current state and the chosen action. emphasis on immediate rewards.
The state is a sufficient statistic of the environment and there- In general, the policy r is a mapping from states to a prob-
by comprises all the necessary information for the agent to ability distribution over actions r: S " p (A = a | S) . If the
take the best action, which can include parts of the agent such MDP is episodic, i.e., the state is reset after each episode of length
as the position of its actuators and sensors. In the optimal con- T, then the sequence of states, actions, and rewards in an episode
trol literature, states and actions are often denoted by x t and constitutes a trajectory or rollout of the policy. Every rollout of a
u t, respectively. policy accumulates rewards from the environment, resulting in
The best sequence of actions is determined by the rewards the return R = / Tt =-01 c t rt +1 . The goal of RL is to find an opti-
provided by the environment. Every time the environment tran- mal policy, r * that achieves the maximum expected return from
sitions to a new state, it also provides a scalar reward rt +1 to all states:
the agent as feedback. The goal of the agent is to learn a policy
(control strategy) r that maximizes the expected return (cumula- *
r = argmax E [R | r] . (1)
tive, discounted reward). Given a state, a policy returns an action r

State (St )

Reward (rt )
St +1 rt +1

Policy (π )

Environment Agent

Action (at )

Figure 2.  The perception-action-learning loop. At time t, the agent receives state s t from the environment. The agent uses its policy to choose an action
a t . Once the action is executed, the environment transitions a step, providing the next state, s t + 1, as well as feedback in the form of a reward, rt +1 . The
agent uses knowledge of state transitions, of the form (s t , a t , s t +1, rt +1), to learn and improve its policy.

28 IEEE Signal Processing Magazine | November 2017 |


Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on April 16,2020 at 06:18:38 UTC from IEEE Xplore. Restrictions apply.
It is also possible to consider nonepisodic MDPs, where There are two main approaches to solving RL problems: methods
T = 3. In this situation, c 1 1 prevents an infinite sum of based on value functions and methods based on policy search.
rewards from being accumulated. Furthermore, methods that There is also a hybrid actor-critic approach that employs both
rely on complete trajectories are no longer applicable, but those value functions and policy search. Next, we will explain these
that use a finite set of transitions still are. approaches and other useful concepts for solving RL problems.
A key concept underlying RL is the Markov property—
only the current state affects the next state, or, in other words, Value functions
the future is conditionally independent of the past given the Value function methods are based on estimating the value
present state. This means that any decisions made at s t can (expected return) of being in a given state. The state-value
be based solely on s t +1, rather than {s 0, s 1, f, s t -1}. Although function V r (s) is the expected return when starting in state s
this assumption is held by the majority of RL algorithms, it is and following r subsequently:
somewhat unrealistic, as it requires the states to be fully observ-
able. A generalization of MDPs are partially observable MDPs V r (s) = E [R s, r] . (2)
(POMDPs), in which the agent receives an observation o t ! X,
where the distribution of the observation p (o t +1 | s t +1, a t) is The optimal policy, r *, has a corresponding state-value
dependent on the current state and the previous action [27]. function V * (s), and vice versa; the optimal state-value function
In a control and signal processing context, the observation can be defined as
would be described by a measurement/observation mapping in
a state-space model that depends on the current state and the V * (s) = max V r (s) 6s ! S. (3)
r
previously applied action.
*
POMDP algorithms typically maintain a belief over the If we had V (s) available, the optimal policy could be
current state given the previous belief state, the action taken, retrieved by choosing among all actions available at s t and
and the current observation. A more common approach in deep picking the action a that maximizes E s t +1 ~T(s t +1 | s t, a) [V * (s t +1)].
learning is to utilize recurrent neural networks (RNNs) [20], In the RL setting, the transition dynamics T are unavail-
[21], [48], [96], which, unlike feedforward neural networks, are able. Therefore, we construct another function, the state-action
dynamical systems. value or quality function Q r (s, a), which is similar to V r,
except that the initial action a is provided and r is only fol-
Challenges in RL lowed from the succeeding state onward:
It is instructive to emphasize some challenges faced in RL:
■■ The optimal policy must be inferred by trial-and-error Q r (s, a) = E [R | s, a, r]. (4)
interaction with the environment. The only learning signal
the agent receives is the reward. The best policy, given Q r (s, a), can be found by choos-
■■ The observations of the agent depend on its actions and ing a greedily at every state: argmax a Q r (s, a) . Under this
can contain strong temporal correlations. policy, we can also define V r (s) by maximizing Q r (s, a):
■■ Agents must deal with long-range time dependencies: V r (s) = max a Q r (s, a) .
often the consequences of an action only materialize after
many transitions of the environment. This is known as the Dynamic programming
(temporal) credit assignment problem [78]. To actually learn Q r, we exploit the Markov property and
We will illustrate these challenges in the context of an define the function as a Bellman equation [6], which has the
indoor robotic visual navigation task: if the goal location is following recursive form:
specified, we may be able to estimate the distance remaining
(and use it as a reward signal), but it is unlikely that we will Q r (s t, a t) = E s t + 1 [rt + 1 + cQ r (s t + 1, r (s t + 1))] . (5)
know exactly what series of actions the robot needs to take
to reach the goal. As the robot must choose where to go as it This means that Q r can be improved by bootstrapping, i.e.,
navigates the building, its decisions influence which rooms it we can use the current values of our estimate of Q r to improve
sees and, hence, the statistics of the visual sequence captured. our estimate. This is the foundation of Q-learning [94] and the
Finally, after navigating several junctions, the robot may find state-action-reward-state-action (SARSA) algorithm [62]:
itself in a dead end. There is a range of problems, from learning
the consequences of actions to balancing exploration versus Q r (s t, a t) ! Q r (s t, a t) + ad, (6)
exploitation, but ultimately these can all be addressed formally
within the framework of RL. where a is the learning rate and d = Y - Q r (s t, a t) the tempo-
ral difference (TD) error; here, Y is a target as in a standard
RL algorithms regression problem. SARSA, an on-policy learning algorithm,
So far, we have introduced the key formalism used in RL, the is used to improve the estimate of Q r by using transitions gen-
MDP, and briefly noted some challenges in RL. In the following, erated by the behavioral policy (the policy derived from Q r),
we will distinguish between different classes of RL algorithms. which results in setting Y = rt + cQ r (s t +1, a t +1) . Q-learning

IEEE Signal Processing Magazine | November 2017 | 29


Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on April 16,2020 at 06:18:38 UTC from IEEE Xplore. Restrictions apply.
is off-policy, as Q r is instead updated by transitions that were an entire spectrum of RL methods based around the amount of
not necessarily generated by the derived policy. Instead, sampling utilized.
Q -learning uses Y = rt + c max a Q r (s t +1, a), which directly Another major value-function-based method relies on
approximates Q *. learning the advantage function A r (s, a) [3]. Unlike produc-
To find Q * from an arbitrary Q r, we use generalized policy ing absolute state-action values, as with Q r, A r instead rep-
iteration, where policy iteration consists of policy evaluation resents relative state-action values. Learning relative values is
and policy improvement. Policy evaluation improves the esti- akin to removing a baseline or average level of a signal; more
mate of the value function, which can be achieved by minimiz- intuitively, it is easier to learn that one action has better conse-
ing TD errors from trajectories experienced by following the quences than another than it is to learn the actual return from
policy. As the estimate improves, the policy can naturally be taking the action. A r represents a relative advantage of actions
improved by choosing actions greedily based on the updated through the simple relationship A r = Q r - V r and is also
value function. Instead of performing these steps separately to closely related to the baseline method of variance reduction
convergence (as in policy iteration), generalized policy itera- within gradient-based policy search methods [97]. The idea of
tion allows for interleaving the steps, such that progress can be advantage updates has been utilized in many recent DRL algo-
made more rapidly. rithms [19], [48], [71], [92].

Sampling Policy search


Instead of bootstrapping value functions using dynamic Policy search methods do not need to maintain a value func-
programming methods, Monte Carlo methods estimate the tion model but directly search for an optimal policy r *. Typi-
expected return (2) from a state by averaging the return from cally, a parameterized policy r i is chosen, whose parameters
multiple rollouts of a policy. Because of this, pure Monte Carlo are updated to maximize the expected return E [R | i] using
methods can also be applied in non-Markovian environments. either gradient-based or gradient-free optimization [12]. Neu-
On the other hand, they can only be used in episodic MDPs, ral networks that encode policies have been successfully
as a rollout has to terminate for the return to be calculated. It trained using both gradient-free [17], [33] and gradient-based
is possible to get the best of both methods by combining TD [22], [41], [44], [70], [71], [96], [97] methods. Gradient-free
learning and Monte Carlo policy evaluation, as is done in the optimization can effectively cover low-dimensional parameter
TD( m ) algorithm [78]. Similarly to the discount factor, the m in spaces, but, despite some successes in applying them to large
TD( m ) is used to interpolate between Monte Carlo evaluation networks [33], gradient-based training remains the method of
and bootstrapping. As demonstrated in Figure 3, this results in choice for most DRL algorithms, being more sample efficient
when policies possess a large number of parameters.
When constructing the policy directly, it is common to
Full output parameters for a probability distribution; for continu-
Backups ous actions, this could be the mean and standard deviations of
Gaussian distributions, while for discrete actions this could be
Dynamic the individual probabilities of a multinomial distribution. The
Programming result is a stochastic policy from which we can directly sample
Exhaustive
Search actions. With gradient-free methods, finding better policies
(a) (b) requires a heuristic search across a predefined class of models.
Monte Carlo
Methods such as evolution strategies essentially perform hill
climbing in a subspace of policies [65], while more complex
methods, such as compressed network search, impose addi-
TD Learning tional inductive biases [33]. Perhaps the greatest advantage of
gradient-free policy search is that it can also optimize nondif-
Sample ferentiable policies.
Backups

Shallow Deep
Policy gradients
Backups Bootstrapping Backups Gradients can provide a strong learning signal as to how to
(c) (d) improve a parameterized policy. However, to compute the
expected return (1) we need to average over plausible trajec-
Figure 3. Two dimensions of RL algorithms based on the backups used tories induced by the current policy parameterization. This
to learn or construct a policy. At the extremes of these dimensions are (a) averaging requires either deterministic approximations (e.g.,
dynamic programming, (b) exhaustive search, (c) one-step TD learning, and linearization) or stochastic approximations via sampling
(d) Monte Carlo approaches. Bootstrapping extends from (c) one-step TD [12]. Deterministic approximations can be only applied in a
learning to n-step TD learning methods [78], with (d) pure Monte Carlo ap-
proaches not relying on bootstrapping at all. Another possible dimension of
model-based setting where a model of the underlying transi-
variation is (c) and (d) choosing to sample actions versus (a) and (b) taking tion dynamics is available. In the more common model-free
the expectation over all choices. (Figure recreated based on [78].) RL setting, a Monte Carlo estimate of the expected return is

30 IEEE Signal Processing Magazine | November 2017 |


Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on April 16,2020 at 06:18:38 UTC from IEEE Xplore. Restrictions apply.
determined. For gradient-based learning, Searching directly for a danger of suffering from model errors, which
this Monte Carlo approximation poses a policy represented by a in turn affects the learned policy. Although
challenge since gradients cannot pass through deep neural networks can potentially produce
neural network with very
these samples of a stochastic function. There- very complex and rich models [14], [55], [75],
fore, we turn to an estimator of the gradient, many parameters can be sometimes simpler, more data-efficient meth-
known in RL as the REINFORCE rule [97]. difficult and can suffer ods are preferable [19]. These considerations
Intuitively, gradient ascent using the estima- from severe local minima. also play a role in actor-critic methods with
tor increases the log probability of the sam- learned value functions [32], [71].
pled action, weighted by the return. More
formally, the REINFORCE rule can be used to compute the gra- The rise of DRL
dient of an expectation over a function f of a random variable X Many of the successes in DRL have been based on scaling
with respect to parameters i: up prior work in RL to high-dimensional problems. This is
due to the learning of low-dimensional feature representations
d i E X [ f (X; i)] = E X [ f (X; i) d i log p (X)] . (7) and the powerful function approximation properties of neural
networks. By means of representation learning, DRL can deal
As this computation relies on the empirical return of a trajec- efficiently with the curse of dimensionality, unlike tabular and
tory, the resulting gradients possess a high variance. By introduc- traditional nonparametric methods [7]. For instance, convo-
ing unbiased estimates that are less noisy, it is possible to reduce lutional neural networks (CNNs) can be used as components
the variance. The general methodology for performing this is to of RL agents, allowing them to learn directly from raw, high-
subtract a baseline, which means weighting updates by an advan- dimensional visual inputs. In general, DRL is based on train-
tage rather than the pure return. The simplest baseline is the aver- ing deep neural networks to approximate the optimal policy
* * * *
age return taken over several episodes [97], but there are many r and/or the optimal value functions V , Q , and A .
more options available [71].
Value functions
Actor-critic methods The well-known function approximation properties of neural net-
It is possible to combine value functions with an explicit rep- works led naturally to the use of deep learning to regress functions
resentation of the policy, resulting in actor-critic methods, for use in RL agents. Indeed, one of the earliest success stories in
as shown in Figure 4. The “actor” (policy) learns by using RL is TD-Gammon, a neural network that reached expert-level
feedback from the “critic” (value function). In doing so, these performance in backgammon in the early 1990s [81]. Using TD
methods tradeoff variance reduction of policy gradients with methods, the network took in the state of the board to predict the
bias introduction from value function methods [32], [71]. probability of black or white winning. Although this simple idea
Actor-critic methods use the value function as a baseline has been echoed in later work [73], progress in RL research has
for policy gradients, such that the only fundamental difference favored the explicit use of value functions, which can capture the
between actor-critic methods and other baseline methods is
that actor-critic methods utilize a learned value function. For
this reason, we will later discuss actor-critic methods as a sub-
Actor
set of policy gradient methods. (Policy)

Planning and learning


TD Error
Given a model of the environment, it is possible to use dynam-
ic programming over all possible actions [Figure 3(a)], sample State Action
trajectories for heuristic search (as was done by AlphaGo
[73]), or even perform an exhaustive search [Figure 3(b)]. Sut-
ton and Barto [78] define planning as any method that utilizes
a model to produce or improve a policy. This includes distri- Critic
bution models, which include T and R, and sample models, (Value Function)
from which only samples of transitions can be drawn.
In RL, we focus on learning without access to the underlying Reward
model of the environment. However, interactions with the envi-
ronment could be used to learn value functions, policies, and also
Environment
a model. Model-free RL methods learn directly from interactions
with the environment, but model-based RL methods can simulate
transitions using the learned model, resulting in increased sample Figure 4. The actor-critic setup. The actor (policy) receives a state from
the environment and chooses an action to perform. At the same time, the
efficiency. This is particularly important in domains where each critic (value function) receives the state and reward resulting from the
interaction with the environment is expensive. However, learning previous interaction. The critic uses the TD error calculated from this infor-
a model introduces extra complexities, and there is always the mation to update itself and the actor. (Figure recreated based on [78].)

IEEE Signal Processing Magazine | November 2017 | 31


Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on April 16,2020 at 06:18:38 UTC from IEEE Xplore. Restrictions apply.
structure underlying the environment. From early value function maximizing its score on a video game, the DQN learns to
methods in DRL, which took simple states as input [61], current extract salient visual features, jointly encoding objects, their
methods are now able to tackle visually and conceptually com- movements, and, most importantly, their interactions. Using
plex environments [47], [48], [70], [100]. techniques originally developed for explaining the behavior
of CNNs in object recognition tasks, we can also inspect what
Function approximation and the DQN parts of its view the agent considers important (see Figure 6).
We begin our survey of value-function-based DRL algo- The true underlying state of the game is contained within 128
rithms with the DQN [47], illustrated in Figure 5, which bytes of Atari 2600 random-access memory. However, the DQN
achieved scores across a wide range of classic Atari 2600 was designed to directly learn from visual inputs (210 # 160
video games [5] that were comparable to that of a profes- pixel 8-bit RGB images), which it takes as the state s. It is
sional video games tester. The inputs to the DQN are four impractical to represent Q r (s, a) exactly as a lookup table: when
gray-scale frames of the game, concatenated over time, which combined with 18 possible actions, we obtain a Q-table of size
are initially processed by several convolutional layers to S # A = 18 # 256 3 # 210 # 160. Even if it were feasible to cre-
extract spatiotemporal features, such as the movement of the ate such a table, it would be sparsely populated, and information
ball in Pong or Breakout. The final feature map from the gained from one state-action pair cannot be propagated to other
convolutional layers is processed by several fully connected state-action pairs. The strength of the DQN lies in its ability to
layers, which more implicitly encode the effects of actions. compactly represent both high-dimensional observations and
This contrasts with more traditional controllers that use fixed the Q-function using deep neural networks. Without this ability,
preprocessing steps, which are therefore unable to adapt their tackling the discrete Atari domain from raw visual inputs would
processing of the state in response to the learning signal. be impractical.
A forerunner of the DQN—neural-fitted Q (NFQ) itera- The DQN addressed the fundamental instability problem
tion—involved training a neural network to return the Q-value of using function approximation in RL [83] by the use of
given a state-action pair [61]. NFQ was later extended to train a two techniques: experience replay [45] and target networks.
network to drive a slot car using raw visual inputs from a camera Experience replay memory stores transitions of the form
over the race track, by combining a deep autoencoder to reduce (s t, a t, s t +1, rt +1) in a cyclic buffer, enabling the RL agent to
the dimensionality of the inputs with a separate branch to predict sample from and train on previously observed data offline.
Q-values [38]. Although the previous network could have been Not only does this massively reduce the number of interac-
trained for both reconstruction and RL tasks simultaneously, it tions needed with the environment, but batches of experience
was both more reliable and computationally efficient to train the can be sampled, reducing the variance of learning updates.
two parts of the network sequentially. Furthermore, by sampling uniformly from a large memory,
The DQN [47] is closely related to the model proposed the temporal correlations that can adversely affect RL algo-
by Lange et al. [38] but was the first RL algorithm that was rithms are broken. Finally, from a practical perspective,
demonstrated to work directly from raw visual inputs and on batches of data can be efficiently processed in parallel by
a wide variety of environments. It was designed such that modern hardware, increasing throughput. While the origi-
the final fully connected layer outputs Q r (s, $) for all action nal DQN algorithm used uniform sampling [47], later work
values in a discrete set of actions—in this case, the various showed that prioritizing samples based on TD errors is more
directions of the joystick and the fire button. This not only effective for learning [67].
enables the best action, argmax a Q r (s, a), to be chosen after The second stabilizing method, introduced by Mnih et al.
a single forward pass of the network, but also allows the net- [47], is the use of a target network that initially contains the
work to more easily encode action-independent knowledge weights of the network enacting the policy but is kept frozen
in the lower, convolutional layers. With merely the goal of for a large period of time. Rather than having to calculate the

Action

State

Reward

Figure 5. The DQN [47]. The network takes the state—a stack of gray-scale frames from the video game—and processes it with convolutional and fully
connected layers, with ReLU nonlinearities in between each layer. At the final layer, the network outputs a discrete action, which corresponds to one of
the possible control inputs for the game. Given the current state and chosen action, the game returns a new score. The DQN uses the reward—the dif-
ference between the new score and the previous one—to learn from its decision. More precisely, the reward is used to update its estimate of Q, and the
error between its previous estimate and its new estimate is backpropagated through the network.

32 IEEE Signal Processing Magazine | November 2017 |


Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on April 16,2020 at 06:18:38 UTC from IEEE Xplore. Restrictions apply.
TD error based on its own rapidly fluctuating estimates of
the Q-values, the policy network uses the fixed target net-
work. During training, the weights of the target network are
updated to match the policy network after a fixed number of
steps. Both experience replay and target networks have gone
on to be used in subsequent DRL works [19], [44], [50], [93].

Q-function modifications
Considering that one of the key components of the DQN is a
function approximator for the Q-function, it can benefit from
fundamental advances in RL. In [86], van Hasselt showed that
the single estimator used in the Q-learning update rule over-
estimates the expected return due to the use of the maximum
action value as an approximation of the maximum expected
action value. Double-Q learning provides a better estimate
through the use of a double estimator [86]. While double-Q
learning requires an additional function to be learned, later
work proposed using the already available target network
from the DQN algorithm, resulting in significantly better
results with only a small change in the update step [87].
Yet another way to adjust the DQN architecture is to
decompose the Q-function into meaningful functions, such as
constructing Q r by adding together separate layers that com-
pute the state-value function V r and advantage function A r
[92]. Rather than having to come up with accurate Q-values
for all actions, the duelling DQN [92] benefits from a single
baseline for the state in the form of V r and easier-to-learn Figure 6. A saliency map of a trained DQN [47] playing Space Invad-
relative values in the form of A r. The combination of the duel- ers [5]. By backpropagating the training signal to the image space, it is
ling DQN with prioritized experience replay [67] is one of the possible to see what a neural-network-based agent is attending to. In
this frame, the most salient points—shown with the red overlay—are the
state-of-the-art techniques in discrete action settings. Further
laser that the agent recently fired and also the enemy that it anticipates
insight into the properties of A r by Gu et al. [19] led them to hitting in a few time steps.
modify the DQN with a convex advantage layer that extended
the algorithm to work over sets of continuous actions, creating
the normalized advantage function (NAF) algorithm. Benefit- potentially be distributed at larger scales than techniques that
ing from experience replay, target networks, and advantage rely on gradients [65].
updates, NAF is one of several state-of-the-art techniques in
continuous control problems [19]. Backpropagation through stochastic functions
The workhorse of DRL, however, remains backpropagation.
Policy search The previously discussed REINFORCE rule [97] allows neural
Policy search methods aim to directly find policies by means networks to learn stochastic policies in a task-dependent man-
of gradient-free or gradient-based methods. Prior to the cur- ner, such as deciding where to look in an image to track [69]
rent surge of interest in DRL, several successful methods in or caption [99] objects. In these cases, the stochastic variable
DRL eschewed the commonly used backpropagation algo- would determine the coordinates of a small crop of the image
rithm in favor of evolutionary algorithms [17], [33], which are and hence reduce the amount of computation needed. This
gradient-free policy search algorithms. Evolutionary methods usage of RL to make discrete, stochastic decisions over inputs
rely on evaluating the performance of a population of agents. is known in the deep-learning literature as hard attention and is
Hence, they are expensive for large populations or agents with one of the more compelling uses of basic policysearch methods
many parameters. However, as black-box optimization meth- in recent years, having many applications outside of traditional
ods, they can be used to optimize arbitrary, nondifferentiable RL domains.
models and naturally allow for more exploration in the param-
eter space. In combination with a compressed representation Compounding errors
of neural network weights, evolutionary algorithms can even Searching directly for a policy represented by a neural network
be used to train large networks; such a technique resulted in with very many parameters can be difficult and suffer from severe
the first deep neural network to learn an RL task, straight local minima. One way around this is to use guided policy search
from high-dimensional visual inputs [33]. Recent work has (GPS), which takes a few sequences of actions from another con-
reignited interest in evolutionary methods for RL as they can troller (which could be constructed using a separate method, such

IEEE Signal Processing Magazine | November 2017 | 33


Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on April 16,2020 at 06:18:38 UTC from IEEE Xplore. Restrictions apply.
as optimal control). GPS learns from them by using supervised visual state spaces [44]. In the same vein as DPGs, Heess et al.
learning in combination with importance sampling, which cor- [22] devised a method for calculating gradients to optimize sto-
rects for off-policy samples [40]. This approach effectively biases chastic policies by “reparameterizing” [30], [60] the stochastic-
the search toward a good (local) optimum. GPS works in a loop, ity away from the network, thereby allowing standard gradients
by optimizing p­ olicies to match sampled trajectories and opti- to be used (instead of the high-variance REINFORCE estima-
mizing trajectory distributions to match the policy and minimize tor [97]). The resulting stochastic value gradient (SVG) methods
costs. Levine et al. [41] showed that it was possible to train visuo- are flexible and can be used both with (SVG(0) and SVG(1))
motor policies for a robot “end to end,” straight from the RGB and without (SVG(3)) value function critics, and with (SVG
pixels of the camera to motor torques, and, hence, provide one of (3) and SVG(1)) and without (SVG(0)) models. Later work
the seminal works in DRL. proceeded to integrate DPGs and SVGs with RNNs, allowing
A more commonly used method is to use a trust region, in them to solve continuous control problems in POMDPs, learning
which optimization steps are restricted to lie within a region directly from pixels [21]. Together, DPGs and SVGs can be con-
where the approximation of the true cost function still holds. sidered algorithmic approaches for improving learning efficiency
By preventing updated policies from deviating too wildly from in DRL.
previous policies, the chance of a catastrophically bad update is An orthogonal approach to speeding up learning is to exploit
lessened, and many algorithms that use trust regions guarantee parallel computation. By keeping a canonical set of parameters
or practically result in monotonic improvement in policy perfor- that are read by and updated in an asynchronous fashion by mul-
mance. The idea of constraining each policy gradient update, as tiple copies of a single network, computation can be efficiently
measured by the Kullback–Leibler (KL) divergence between the distributed over both processing cores in a single central process-
current and proposed policy, has a long history in RL [28]. One ing unit (CPU), and across CPUs in a cluster of machines. Using
of the newer algorithms in this line of work, TRPO, has been a distributed system, Nair et al. [51] developed a framework for
shown to be relatively robust and applicable to domains with training multiple DQNs in parallel, achieving both better per-
high-dimensional inputs [70]. To achieve this, TRPO optimiz- formance and a reduction in training time. However, the sim-
es a surrogate objective function—specifically, it optimizes an pler asynchronous advantage actor-critic (A3C) algorithm [48],
(importance sampled) advantage estimate, constrained using a developed for both single and distributed machine settings, has
quadratic approximation of the KL divergence. While TRPO can become one of the most popular DRL techniques in recent times.
be used as a pure policy gradient method with a simple baseline, A3C combines advantage updates with the actor-critic formula-
later work by Schulman et al. [71] introduced generalized advan- tion and relies on asynchronously updated policy and value func-
tage estimation (GAE), which proposed several, more advanced tion networks trained in parallel over several processing threads.
variance reduction baselines. The combination of TRPO and The use of multiple agents, situated in their own, independent
GAE remains one of the state-of-the-art RL techniques in con- environments, not only stabilizes improvements in the param-
tinuous control. eters, but conveys an additional benefit in allowing for more
exploration to occur. A3C has been used as a standard start-
Actor-critic methods ing point in many subsequent works, including the work of Zhu
Actor-critic approaches have grown in popularity as an et al. [100], who applied it to robotic navigation in the real world
effective means of combining the benefits of policy search through visual inputs.
methods with learned value functions, which are able to There have been several major advancements on the original
learn from full returns and/or TD errors. They can benefit A3C algorithm that reflect various motivations in the field of
from improvements in both policy gradient methods, such as DRL. The first is actor-critic with experience replay [93], which
GAE [71], and value function methods, such as target net- adds off-policy bias correction to A3C, allowing it to use experi-
works [47]. In the last few years, DRL actor-critic methods ence replay to improve sample complexity. Others have attempted
have been scaled up from learning simulated physics tasks to bridge the gap between value and policy-based RL, utilizing
[22], [44] to real robotic visual navigation tasks [100], directly theoretical advancements to improve upon the original A3C [50],
from image pixels. [54]. Finally, there is a growing trend toward exploiting auxiliary
One recent development in the context of actor-critic algo- tasks to improve the representations learned by DRL agents and,
rithms is deterministic policy gradients (DPGs) [72], which hence, improve both the learning speed and final performance of
extend the standard policy gradient theorems for stochastic poli- these agents [26], [46].
cies [97] to deterministic policies. One of the major advantages of
DPGs is that, while stochastic policy gradients integrate over both Current research and challenges
state and action spaces, DPGs only integrate over the state space, To conclude, we will highlight some current areas of research
requiring fewer samples in problems with large action spaces. In in DRL and the challenges that still remain. Previously, we have
the initial work on DPGs, Silver et al. [72] introduced and demon- focused mainly on model-free methods, but we will now exam-
strated an off-policy actor-critic algorithm that vastly improved ine a few model-based DRL algorithms in more detail. Model-
upon a stochastic policy gradient equivalent in high-dimensional based RL algorithms play an important role in making RL data
continuous control problems. Later work introduced deep DPG, efficient and in trading off exploration and exploitation. After
which utilized neural networks to operate on high-dimensional, tackling exploration strategies, we shall then address hierarchical

34 IEEE Signal Processing Magazine | November 2017 |


Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on April 16,2020 at 06:18:38 UTC from IEEE Xplore. Restrictions apply.
RL (HRL), which imposes an inductive bias on the final policy lated over time (e.g., from stochastic processes) to better preserve
by explicitly factorizing it into several levels. When available, momentum [44].
trajectories from other controllers can be used to bootstrap the The observation that temporal correlation is important led
learning process, leading us to imitation learning and inverse RL Osband et al. [56] to propose the bootstrapped DQN, which
(IRL). For the final topic, we will look at multiagent systems, maintains several Q-value “heads” that learn different values
which have their own special considerations. through a combination of different weight initializations and
bootstrapped sampling from experience replay memory. At
Model-based RL the beginning of each training episode, a different head is cho-
The key idea behind model-based RL is to learn a transition sen, leading to temporally extended exploration. Usunier et al.
model that allows for simulation of the environment without [85] later proposed a similar method that performed explora-
interacting with the environment directly. Model-based RL tion in policy space by adding noise to a single output head,
does not assume specific prior knowledge. However, in prac- using zero-order gradient estimates to allow backpropagation
tice, we can incorporate prior knowledge (e.g., physics-based through the policy.
models [29]) to speed up learning. Model learning plays an One of the main principled exploration strategies is the
important role in reducing the number of required interac- upper confidence bound (UCB) algorithm, based on the prin-
tions with the (real) environment, which may be limited in ciple of “optimism in the face of uncertainty” [36]. The idea
practice. For example, it is unrealistic to perform millions of behind UCB is to pick actions that maximize E 6R@ + lv 6R@,
experiments with a robot in a reasonable amount of time and where v [R] is the standard deviation of the return and l 2 0.
without significant hardware wear and tear. There are various UCB therefore encourages exploration in regions with high
approaches to learn predictive models of dynamical systems uncertainty and moderate expected return. While easily achiev-
using pixel information. Based on the deep dynamical model able in small tabular cases, the use of powerful density models
[90], where high-dimensional observations are embedded into has allowed this algorithm to scale to high-dimensional visual
a lower-dimensional space using autoencoders, several model- domains with DRL [4].
based DRL algorithms have been proposed for learning models UCB can also be considered one way of implementing
and policies from pixel information [55], [91], [95]. If a suffi- intrinsic motivation, which is a general concept that advocates
ciently accurate model of the environment can be learned, then decreasing uncertainty/making progress in learning about the
even simple controllers can be used to control a robot directly environment [68]. There have been several DRL algorithms that
from camera images [14]. Learned models can also be used to try to implement intrinsic motivation via minimizing model pre-
guide exploration purely based on simulation of the environ- diction error [57], [75] or maximizing information gain [25], [49].
ment, with deep models allowing these techniques to be scaled
up to high-dimensional visual domains [75]. Hierarchical RL
Although deep neural networks can make reasonable predic- In the same way that deep learning relies on hierarchies of fea-
tions in simulated environments over hundreds of time steps [10], tures, HRL relies on hierarchies of policies. Early work in this
they typically require many samples to tune the large number area introduced options, in which, apart from primitive actions
of parameters they contain. Training these models often requires (single time-step actions), policies could also run other poli-
more samples (interaction with the environment) than simpler cies (multitime-step “actions”) [79]. This approach allows top-
models. For this reason, Gu et al. [19] train locally linear mod- level policies to focus on higher-level goals, while subpolicies
els for use with the NAF algorithm—the continuous equivalent are responsible for fine control. Several works in DRL have
of the DQN [47]—to improve the algorithm’s sample complex- attempted HRL by using one top-level policy that chooses
ity in the robotic domain where samples are expensive. It seems between subpolicies, where the division of states or goals in to
likely that the usage of deep models in model-based DRL could subpolicies is achieved either manually [1], [34], [82] or auto-
be massively spurred by general advances in improving the data matically [2], [88], [89]. One way to help construct subpolicies
efficiency of neural networks. is to focus on discovering and reaching goals, which are spe-
cific states in the environment; they may often be locations, to
Exploration versus exploitation which an agent should navigate. Whether utilized with HRL or
One of the greatest difficulties in RL is the fundamental dilemma not, the discovery and generalization of goals is also an impor-
of exploration versus exploitation: When should the agent try out tant area of ongoing research [35], [66], [89].
(perceived) nonoptimal actions to explore the environment (and
potentially improve the model), and when should it exploit the Imitation learning and inverse RL
optimal action to make useful progress? Off-policy algorithms, One may ask why, if given a sequence of “optimal” actions
such as the DQN [47], typically use the simple e -greedy explora- from expert demonstrations, it is not possible to use supervised
tion policy, which chooses a random action with probability e ! learning in a straightforward manner—a case of “learning
[0, 1], and the optimal action otherwise. By decreasing e over from demonstration.” This is indeed possible and is known as
time, the agent progresses toward exploitation. Although adding behavioral cloning in traditional RL literature. Taking advan-
independent noise for exploration is usable in continuous control tage of the stronger signals available in supervised learn-
problems, more sophisticated strategies inject noise that is corre- ing problems, behavioral cloning enjoyed success in earlier

IEEE Signal Processing Magazine | November 2017 | 35


Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on April 16,2020 at 06:18:38 UTC from IEEE Xplore. Restrictions apply.
neural network research, with the most notable success being ter sample complexity, generalization, and interpretability [16].
ALVINN, one of the earliest autonomous cars [59]. However, In time, we also hope that our theoretical understanding of the
behavioral cloning cannot adapt to new situations, and small properties of neural networks (particularly within DRL) will
deviations from the demonstration during the execution of improve, as it currently lags far behind practice.
the learned policy can compound and lead to scenarios where To conclude, it is worth revisiting the overarching goal of all
the policy is unable to recover. A more generalizable solution of this research: the creation of general-purpose AI systems that
is to use provided trajectories to guide the learning of suit- can interact with and learn from the world around them. Interac-
able state-action pairs but fine-tune the agent using RL [23]. tion with the environment is simultaneously the advantage and
The goal of IRL is to estimate an unknown reward function disadvantage of RL. While there are many challenges in seeking
from observed trajectories that characterize a desired solution to understand our complex and ever-changing world, RL allows
[52]; IRL can be used in combination with us to choose how we explore it. In effect, RL
RL to improve upon demonstrated behavior. endows agents with the ability to perform
Using the power of deep neural networks, it In effect, RL endows experiments to better understand their sur-
is now possible to learn complex, nonlinear agents with the ability to roundings, enabling them to learn even high-
reward functions for IRL [98]. Ho and Ermon perform experiments to level causal relationships. The availability
[24] showed that policies are uniquely char- better understand their of high-quality visual renderers and physics
acterized by their occupancies (visited state surroundings, enabling engines now enables us to take steps in this
and action distributions) allowing IRL to be direction, with works that try to learn intui-
them to learn even high-
reduced to the problem of measure matching. tive models of physics in visual environments
With this insight, they were able to use gen- level causal relationships. [13]. Challenges remain before this will be
erative adversarial training [18] to facilitate possible in the real world, but steady progress
reward-function learning in a more flexible manner, resulting in is being made in agents that learn the fundamental principles of
the generative adversarial imitation learning algorithm. the world through observation and action. Perhaps, then, we
are not too far away from AI systems that learn and act in more
Multiagent RL human-like ways in increasingly complex environments.
Usually, RL considers a single learning agent in a stationary
environment. In contrast, multiagent RL (MARL) considers Acknowledgments
multiple agents learning through RL and often the nonstation- Kai Arulkumaran would like to acknowledge Ph.D. funding
arity introduced by other agents changing their behaviors as from the Department of Bioengineering at Imperial College
they learn [8]. In DRL, the focus has been on enabling (differ- London. This research has been partially funded by a Google
entiable) communication between agents, which allows them Faculty Research Award to Marc Deisenroth.
to cooperate. Several approaches have been proposed for this
purpose, including passing messages to agents sequentially Authors
[15], using a bidirectional channel (providing ordering with Kai Arulkumaran (ka709@imperial.ac.uk) received a B.A.
less signal loss) [58], and an all-to-all channel [77]. The addi- degree in computer science at the University of Cambridge in
tion of communication channels is a natural strategy to apply 2012 and an M.Sc. degree in biomedical engineering from
to MARL in complex scenarios and does not preclude the Imperial College London in 2014, where he is currently a Ph.D.
usual practice of modeling cooperative or competing agents as degree candidate in the Department of Bioengineering. He was a
applied elsewhere in the MARL literature [8]. research intern at Twitter’s Magic Pony and Microsoft Research
in 2017. His research focus is deep reinforcement learning and
Conclusion: Beyond pattern recognition transfer learning for visuomotor control.
Despite the successes of DRL, many problems need to be Marc Peter Deisenroth (m.deisenroth@imperial.ac.uk)
addressed before these techniques can be applied to a wide received an M.Eng. degree in computer science at the University
range of complex real-world problems [37 ]. Recent work of Karlsruhe in 2006 and a Ph.D. degree in machine learning at
with (nondeep) generative causal models demonstrated supe- the Karlsruhe Institute of Technology in 2009. He is a lecturer of
rior generalization over standard DRL algorithms [48], [63] statistical machine learning in the Department of Computing at
in some benchmarks [5], achieved by reasoning about causes Imperial College London and with PROWLER.io. He was
and effects in the environment [29]. For example, the schema awarded an Imperial College Research Fellowship in 2014 and
networks of Kanksy et al. [29] trained on the game Break- received Best Paper Awards at the International Conference on
out immediately adapted to a variant where a small wall was Robotics and Automation 2014 and the International Conference
placed in front of the target blocks, while progressive (A3C) on Control, Automation, and Systems 2016. He is a recipient of
networks [63] failed to match the performance of the schema a Google Faculty Research Award and a Microsoft Ph.D.
networks even after training on the new domain. Although Scholarship. His research is centered around data-efficient
DRL has already been combined with AI techniques, such machine learning for autonomous decision making.
as search [73] and planning [80], a deeper integration with Miles Brundage (miles.brundage@philosophy.ox.ac.uk)
other traditional AI approaches promises benefits such as bet- received a B.A. degree in political science at George

36 IEEE Signal Processing Magazine | November 2017 |


Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on April 16,2020 at 06:18:38 UTC from IEEE Xplore. Restrictions apply.
Washington University, Washington, D.C., in 2010. He is a [21] N. Heess, J. J. Hunt, T. P. Lillicrap, and D. Silver. “Memory-based control with
recurrent neural networks,” in NIPS Workshop on Deep Reinforcement Learning,
Ph.D. degree candidate in the Human and Social Dimensions 2015.
of Science and Technology Department at Arizona State [22] N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa, “Learning
University and a research fellow at the University of Oxford’s continuous control policies by stochastic value gradients,” in Proc. Neural
Information Processing Systems, 2015, pp. 2944–2952.
Future of Humanity Institute. His research focuses on gover- [23] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, A.
nance issues related to artificial intelligence. Sendonaris, G. Dulac-Arnold, et al. (2017). Learning from demonstrations for real
world reinforcement learning. arXiv. [Online]. Available: https://arxiv.org/
Anil Anthony Bharath (a.bharath@ic.ac.uk) received his B. abs/1704.03732
Eng. degree in electronic and electrical engineering from [24] J. Ho and S. Ermon, “Generative adversarial imitation learning,” in Proc. Neural
University College London in 1988 and his Ph.D. degree in sig- Information Processing Systems, 2016, pp. 4565-4573.
nal processing from Imperial College London in 1993, where he [25] R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. de Turck, and P. Abbeel,
“VIME: Variational information maximizing exploration,” in Proc. Neural Information
is currently a reader in the Department of Bioengineering. He is Processing Systems, 2016, pp. 1109–1117.
also a fellow of the Institution of Engineering and Technology [26] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K.
and a cofounder of Cortexica Vision Systems. He was previously Kavukcuoglu, “Reinforcement learning with unsupervised auxiliary tasks,” in Proc.
Int. Conf. Learning Representations, 2017.
an academic visitor in the Signal Processing Group at the [27] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting in par-
University of Cambridge in 2006. His research interests are in tially observable stochastic domains,” Artificial Intell., vol. 101, no. 1, pp. 99–134,
1998.
deep architectures for visual inference.
[28] S. M. Kakade, “A natural policy gradient,” in Proc. Neural Information
Processing Systems, 2002, pp. 1531–1538.
References [29] K. Kansky, T. Silver, D. A. Mély, M. Eldawy, M. Lázaro-Gredilla, X. Lou, N.
[1] K. Arulkumaran, N. Dilokthanakul, M. Shanahan, and A. A. Bharath, Dorfman, S. Sidor, S. Phoenix, and D. George, “Schema networks: zero-shot transfer
“Classifying options for deep reinforcement learning,” in Proc. IJCAI Workshop Deep with a generative causal model of intuitive physics,” in Proc. Int. Conf. Machine
Reinforcement Learning: Frontiers and Challenges, 2016. Learning, 2017, pp. 1809–1818.
[2] P. Bacon, J. Harb, and D. Precup, “The option-critic architecture,” in Proc. [30] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in Proc. Int.
Association Advancement Artificial Intelligence, 2017, pp. 1726–1734. Conf. Learning Representations, 2014.
[3] L. C. Baird III, “Advantage updating,” Defense Tech. Inform. Center, Tech. Report [31] N. Kohl and P. Stone, “Policy gradient reinforcement learning for fast quadrupedal
D-A280 862, Fort Belvoir, VA, 1993. locomotion,” in Proc. IEEE Int. Conf. Robotics and Automation, 2004, pp. 2619–
[4] M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos, 2624.
“Unifying count-based exploration and intrinsic motivation,” in Proc. Neural [32] V. R. Konda and J. N. Tsitsiklis, “On actor-critic algorithms,” SIAM J. Control
Information Processing Systems, 2016, pp. 1471–1479. Optim., vol. 42, no. 4, pp. 1143–1166, 2003.
[5] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade learning [33] J. Koutník, G. Cuccu, J. Schmidhuber, and F. Gomez, “Evolving large-scale neu-
environment: an evaluation platform for general agents,” in Proc. Int. Joint Conf. ral networks for vision-based reinforcement learning,” in Proc. Conf. Genetic and
Artificial Intelligence, 2015, pp. 253–279. Evolutionary Computation, 2013, pp. 1061–1068.
[6] R. Bellman, “On the theory of dynamic programming,” Proc. Nat. Acad. Sci., vol. [34] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum, “Hierarchical deep
38, no. 8, pp. 716–719, 1952. reinforcement learning: Integrating temporal abstraction and intrinsic motivation,” in
[7] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: a review and new Proc. Neural Information Processing Systems, 2016, pp. 3675–3683.
perspectives,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798–1828, 2013. [35] T. D. Kulkarni, A. Saeedi, S. Gautam, and S. J. Gershman, “Deep successor rein-
[8] L. Busoniu, R. Babuska, and B. De Schutter, “A comprehensive survey of multiagent forcement learning,” in NIPS Workshop on Deep Reinforcement Learning, 2016.
reinforcement learning,” IEEE Trans. Syst., Man, Cybern., vol. 38, no. 2, pp. 156–172, 2008. [36] T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocation rules,” Adv.
[9] M. Campbell, A. J. Hoane, and F. Hsu, “Deep Blue,” Artificial Intell., vol. 134, no. Appl. Math., vol. 6, no. 1, pp. 4–22, 1985.
1-2, pp. 57–83, 2002. [37] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman, “Building
[10] S. Chiappa, S. Racaniere, D. Wierstra, and S. Mohamed, “Recurrent environment machines that learn and think like people,” Behavioral Brain Sci., pp. 1–101, 2016.
simulators,” in Proc. Int. Conf. Learning Representations, 2017. [Online]. Available: https://www.cambridge.org/core/journals/behavioral-and-brain-
sciences/a r ticle/ build ing-mach ines-t hat-lea r n-a nd-t h in k-l i ke-people/
[11] P. Christiano, Z. Shah, I. Mordatch, J. Schneider, T. Blackwell, J. Tobin, P.
A9535B1D745A0377E16C590E14B94993
Abbeel, and W. Zaremba. (2016). Transfer from simulation to real world through learn-
ing deep inverse dynamics model. arXiv. [Online]. Available: https://arxiv.org/ [38] S. Lange, M. Riedmiller, and A. Voigtlander, “Autonomous reinforcement learn-
abs/1610.03518 ing on raw visual input data in a real world application,” in Proc. Int. Joint Conf.
Neural Networks, 2012, pp. 1–8.
[12] M. P. Deisenroth, G. Neumann, and J. Peters, “A survey on policy search for
robotics,” Foundations and Trends in Robotics, vol. 2, no. 1–2, pp. 1–142, 2013. [39] Y. LeCun, Y. Bengio, and G. Hinton. “Deep learning,” Nature, vol. 521, no. 7553,
pp. 436–444, 2015.
[13] M. Denil, P. Agrawal, T. D. Kulkarni, T. Erez, P. Battaglia, and N. de Freitas,
“Learning to perform physics experiments via deep reinforcement learning,” in Proc. [40] S. Levine and V. Koltun, “Guided policy search,” in Proc. Int. Conf. Learning
Int. Conf. Learning Representations, 2017. Representations, 2013.
[14] C. Finn, X. Y. Tan, Y. Duan, T. Darrell, S. Levine, and P. Abbeel, “Deep spatial [41] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuo-
autoencoders for visuomotor learning,” in Proc. IEEE Int. Conf. Robotics and motor policies,” J. Mach. Learning Res., vol. 17, no. 39, pp. 1–40, 2016.
Automation, 2016, pp. 512–519. [42] S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning hand-eye coordi-
[15] J. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson, “Learning to communi- nation for robotic grasping with deep learning and large-scale data collection,” in Proc.
cate with deep multi-agent reinforcement learning,” in Proc. Neural Information Int. Symp. Experimental Robotics, 2016, pp. 173–184.
Processing Systems, 2016, pp. 2137–2145. [43] Y. Li. (2017). Deep reinforcement learning: An overview. arXiv. [Online].
[16] M. Garnelo, K. Arulkumaran, and M. Shanahan, “Towards deep symbolic rein- Available: https://arxiv.org/abs/1701.07274
forcement learning,” in NIPS Workshop on Deep Reinforcement Learning, 2016. [44] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D.
[17] F. Gomez and J. Schmidhuber. “Evolving modular fast-weight networks for con- Wierstra, “Continuous control with deep reinforcement learning,” in Proc. Int. Conf.
trol,” in Proc. Int. Conf. Artificial Neural Networks, 2005, pp. 383–389. Learning Representations, 2016.
[18] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. [45] L. Lin, “Self-improving reactive agents based on reinforcement learning, planning
Courville, and Y. Bengio, “Generative adversarial nets,” in Proc. Neural Information and teaching,” Mach. Learning, vol. 8, no. 3–4, pp. 293–321, 1992.
Processing Systems, 2014, pp. 2672–2680. [46] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. Ballard, A. Banino, M. Denil, R.
[19] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep Q-learning with Goroshin, et al., “Learning to navigate in complex environments,” in Proc. Int. Conf.
model-based acceleration,” in Proc. Int. Conf. Learning Representations, 2016. Learning Representations, 2017.
[20] M. Hausknecht and P. Stone, “Deep recurrent Q-learning for partially observ- [47] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare,
able MDPs,” in Association for the Advancement of Artificial Intelligence Fall A. Graves, M. Riedmiller, et al., “Human-level control through deep reinforcement
Symp. Series, 2015. learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.

IEEE Signal Processing Magazine | November 2017 | 37


Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on April 16,2020 at 06:18:38 UTC from IEEE Xplore. Restrictions apply.
[48] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. [74] S. Singh, D. Litman, M. Kearns, and M. Walker, “Optimizing dialogue manage-
Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learn- ment with reinforcement learning: Experiments with the NJFun system,” J. Artificial
ing,” in Proc. Int. Conf. Learning Representations, 2016. Intell. Res., vol. 16, pp. 105–133, Feb. 2002.
[49] S. Mohamed and D. J. Rezende, “Variational information maximisation for [75] B. C. Stadie, S. Levine, and P. Abbeel, “Incentivizing exploration in reinforcement
intrinsically motivated reinforcement learning,” in Proc. Neural Information learning with deep predictive models,” in NIPS Workshop on Deep Reinforcement
Processing Systems, 2015, pp. 2125–2133. Learning, 2015.
[50] O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans. (2017). Bridging the gap [76] A. L. Strehl, L. Li, E. Wiewiora, J. Langford, and M. L. Littman, “PAC model-free
between value and policy based reinforcement learning. arXiv. [Online]. Available: reinforcement learning,” in Proc. Int. Conf. Machine Learning, 2006, pp. 881–888.
https://arxiv.org/abs/1702.08892 [77] S. Sukhbaatar, A. Szlam, and R. Fergus, “Learning multiagent communication
[51] A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. de Maria, V. with backpropagation,” in Proc. Neural Information Processing Systems, 2016, pp.
Panneershelvam, M. Suleyman, et al., “Massively parallel methods for deep rein- 2244–2252.
forcement learning,” in ICML Workshop on Deep Learning, 2015. [78] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
[52] A. Y. Ng and S. J. Russell, “Algorithms for inverse reinforcement learning,” in Cambridge, MA: MIT Press, 1998.
Proc. Int. Conf. Machine Learning, 2000, pp. 663–670. [79] R. S. Sutton, D. Precup, and S. Singh, “Between MDPs and semi-MDPs: A frame-
[53] A. Y. Ng, A. Coates, M. Diel, V. Ganapathi, J. Schulte, B. Tse, E. Berger, and work for temporal abstraction in reinforcement learning,” Artificial Intell., vol. 112, no.
E. Liang, “Autonomous inverted helicopter flight via reinforcement learning,” in 1–2, pp. 181–211, 1999.
Proc. Int. Symp. Experimental Robotics, 2006, pp. 363–372. [80] A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel, “Value iteration net-
[54] B. O’Donoghue, R. Munos, K. Kavukcuoglu, and V. Mnih, “PGQ: Combining works,” in Proc. Neural Information Processing Systems, 2016, pp. 2154–2162.
policy gradient and Q-learning,” in Proc. Int. Conf. Learning Representations, [81] G. Tesauro, “Temporal difference learning and TD-gammon,” Commun. ACM,
2017. vol. 38, no. 3, pp. 58–68, 1995.
[55] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh, “Action-conditional video [82] C. Tessler, S. Givony, T. Zahavy, D. J. Mankowitz, and S. Mannor, “A deep hier-
prediction using deep networks in Atari games,” in Proc. Neural Information archical approach to lifelong learning in Minecraft,” in Proc. Association for the
Processing Systems, 2015, pp. 2863–2871. Advancement Artificial Intelligence, 2017, pp. 1553–1561.
[56] I. Osband, C. Blundell, A. Pritzel, and B. van Roy, “Deep exploration via boot- [83] J. N. Tsitsiklis and B. van Roy, “Analysis of temporal-difference learning with
strapped DQN,” in Proc. Neural Information Processing Systems, 2016, pp. 4026– function approximation,” in Proc. Neural Information Processing Systems, 1997, pp.
4034. 1075–1081.
[57] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven explora- [84] E. Tzeng, C. Devin, J. Hoffman, C. Finn, X. Peng, S. Levine, K. Saenko, and T.
tion by self-supervised prediction,” in Proc. Int. Conf. Machine Learning, 2017, pp. Darrell, “Towards adapting deep visuomotor representations from simulated to real
2778–2787. environments,” in Workshop Algorithmic Foundations Robotics, 2016.
[58] P. Peng, Q. Yuan, Y. Wen, Y. Yang, Z. Tang, H. Long, and J. Wang. (2017). [85] N. Usunier, G. Synnaeve, Z. Lin, and S. Chintala, “Episodic exploration for deep
Multiagent bidirectionally-coordinated nets for learning to play StarCraft combat deterministic policies: An application to StarCraft micromanagement tasks,” in Proc.
games. arXiv. [Online]. Available: https://arxiv.org/abs/1703.10069 Int. Conf. Learning Representations, 2017.
[59] D. A. Pomerleau, “ALVINN, an autonomous land vehicle in a neural network,” [86] H. van Hasselt, “Double Q-learning,” in Proc. Neural Information Processing
in Proc. Neural Information Processing Systems, 1989, pp. 305–313. Systems, 2010, pp. 2613–2621.
[60] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backpropagation and [87] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double
approximate inference in deep generative models,” in Proc. Int. Conf. Machine Q-learning,” in Proc. Association for the Advancement of Artificial Intelligence, 2016,
Learning, 2014, pp. 1278–1286. pp. 2094–2100.
[61] M. Riedmiller, “Neural fitted q iteration—First experiences with a data efficient [88] A. Vezhnevets, V. Mnih, S. Osindero, A. Graves, O. Vinyals, J. Agapiou, and K.
neural reinforcement learning method,” in Proc. European Conf. Machine Kavukcuoglu. “Strategic attentive writer for learning macro-actions,” in Proc. Neural
Learning, 2005, pp. 317–328. Information Processing Systems, 2016, pp. 3486–3494.
[62] G. A. Rummery and M. Niranjan, “On-line Q-learning using connectionist sys- [89] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and
tems,” Dept. Engineering, Univ. Cambridge, MA, Tech. Rep. CUED/F-INFENG/ K. Kavukcuoglu, “FeUdal networks for hierarchical reinforcement learning,” in Proc.
TR 166, 1994. Int. Conf. Machine Learning, 2017, pp. 3540–3549.
[63] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. [90] N. Wahlström, T. B. Schön, and M. P. Deisenroth, “Learning deep dynamical
Kavukcuoglu, R. Pascanu, and R. Hadsell. (2016). Progressive neural networks. models from image pixels,” in Proc. IFAC Symp. System Identification, 2015, pp.
arXiv. [Online]. Available: https://arxiv.org/abs/1606.04671 1059–1064.
[64] A. A. Rusu, M. Vecerik, T. Rothörl, N. Heess, R. Pascanu, and R. Hadsell. [91] N. Wahlström, T. B. Schön, and M. P. Deisenroth, “From pixels to torques: poli-
(2016). Sim-to-real robot learning from pixels with progressive nets. arXiv. [Online]. cy learning with deep dynamical models,” in ICML Workshop on Deep Learning,
Available: https://arxiv.org/abs/1610.04286 2015.
[65] T. Salimans, J. Ho, X. Chen, and I. Sutskever. (2017). Evolution strategies as a [92] Z. Wang, N. de Freitas, and M. Lanctot, “Dueling network architectures for deep
scalable alternative to reinforcement learning. arXiv. [Online]. Available: https:// reinforcement learning,” in Proc. Int. Conf. Learning Representations, 2016.
arxiv.org/abs/1703.03864
[93] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de
[66] T. Schaul, D. Horgan, K. Gregor, and D. Silver, “Universal value function Freitas, “Sample efficient actor-critic with experience replay,” in Proc. Int. Conf.
approximators,” in Proc. Int. Conf. Machine Learning, 2015, pp. 1312–1320. Learning Representations, 2017.
[67] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience [94] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Mach. Learning, vol. 8, no. 3-4,
replay,” in Proc. Int. Conf. Learning Representations, 2016. pp. 279–292, 1992.
[68] J. Schmidhuber, “A possibility for implementing curiosity and boredom in [95] M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller, “Embed to control:
model-building neural controllers,” in Proc. Int. Conf. Simulation Adaptive A locally linear latent dynamics model for control from raw images,” in Proc. Neural
Behavior, 1991, pp. 222–227. Information Processing Systems, 2015, pp. 2746–2754.
[69] J. Schmidhuber and R. Huber, “Learning to generate artificial fovea trajectories [96] D. Wierstra, A. Förster, J. Peters, and J. Schmidhuber, “Recurrent policy gradi-
for target detection,” Int. J. Neural Syst., vol. 2, no. 01n02, pp. 125–134, 1991. ents,” Logic J. IGPL, vol. 18, no. 5, pp. 620–634, 2010.
[Online]. Ava ilable: http://www.worldscientif ic.com /doi /abs/10.1142/
S012906579100011X [97] R. J. Williams, “Simple statistical gradient-following algorithms for connectionist
reinforcement learning,” Mach. Learning, vol. 8, no. 3–4, pp. 229–256, 1992.
[70] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region
policy optimization,” in Proc. Int. Conf. Machine Learning, 2015, pp. 1889–1897. [98] M. Wulfmeier, P. Ondruska, and I. Posner, “Maximum entropy deep inverse
reinforcement learning,” in NIPS Workshop on Deep Reinforcement Learning,
[71] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High- 2015.
dimensional continuous control using generalized advantage estimation,” in Proc.
Int. Conf. Learning Representations, 2016. [99] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel,
and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual
[72] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, attention,” in Proc. Int. Conf. Machine Learning, 2015, pp. 2048–2057.
“Deterministic policy gradient algorithms,” in Proc. Int. Conf. Machine Learning,
2014, pp. 387–395. [100] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi,
“Target-driven visual navigation in indoor scenes using deep reinforcement learning,” in
[73] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Proc. IEEE Int. Conf. Robotics and Automation, 2017, pp. 3357–3364.
Schrittwieser, I. Antonoglou, et al., “Mastering the game of go with deep neural net-
works and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.  SP

38 IEEE Signal Processing Magazine | November 2017 |


Authorized licensed use limited to: Universitatsbibliothek Erlangen Nurnberg. Downloaded on April 16,2020 at 06:18:38 UTC from IEEE Xplore. Restrictions apply.

You might also like