22 Selected Top Papers On Deep Learning

22 Selected Top Papers On Deep Learning
These papers provide a breadth of information about Deep Learning (a class of machine learning algorithms that
uses multiple layers to progressively extract higher level features from the raw input) that is generally useful and
interesting from a computer science perspective.

Contents
 Deep Learning in Neural Networks: An Overview

 Salient Object Detection: A Discriminative Regional Feature Integration
Approach
 Long-term Recurrent Convolutional Networks for Visual Recognition and
Description
 MatConvNet: Convolutional Neural Networks for MATLAB
 Image Super-Resolution Using Deep Convolutional Networks
 Beyond Short Snippets: Deep Networks for Video Classification
 U-Net: Convolutional Networks for Biomedical Image Segmentation
 Faster R-CNN: Towards Real-Time Object Detection with Region Proposal
Networks
 Unsupervised Representation Learning with Deep Convolutional Generative
Adversarial Networks
 Inception-v4, Inception-ResNet and the Impact of Residual Connections on
Learning
 Generative Adversarial Nets
 Character-level Convolutional Networks for Text Classification
 A Few Useful Things to Know About Machine Learning
 Conditional Random Fields as Recurrent Neural Networks
 Deep Learning Face Attributes in the Wild
 Asynchronous Methods for Deep Reinforcement Learning
 Human-level control through deep reinforcement learning
 Deep learning
 TensorFlow: A System for Large-Scale Machine Learning
 TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed
Systems
 Visual Madlibs: Fill in the blank Description Generation and Question
Answering
 Visualizing and Understanding Convolutional Networks
Deep Learning in Neural Networks: An Overview
Technical Report IDSIA-03-14 / arXiv:1404.7828 v4 [cs.NE] (88 pages, 888 references)
Jürgen Schmidhuber
The Swiss AI Lab IDSIA
arXiv:1404.7828v4 [cs.NE] 8 Oct 2014
Istituto Dalle Molle di Studi sull’Intelligenza Artificiale

University of Lugano & SUPSI
Galleria 2, 6928 Manno-Lugano
Switzerland
8 October 2014
Abstract
In recent years, deep artificial neural networks (including recurrent ones) have won numerous
contests in pattern recognition and machine learning. This historical survey compactly summarises
relevant work, much of it from the previous millennium. Shallow and deep learners are distin-
guished by the depth of their credit assignment paths, which are chains of possibly learnable, causal
links between actions and effects. I review deep supervised learning (also recapitulating the history
of backpropagation), unsupervised learning, reinforcement learning & evolutionary computation,
and indirect search for short programs encoding deep and large networks.
LATEX source: http://www.idsia.ch/˜juergen/DeepLearning8Oct2014.tex

Complete BIBTEX file (888 kB): http://www.idsia.ch/˜juergen/deep.bib
Preface
This is the preprint of an invited Deep Learning (DL) overview. One of its goals is to assign credit
to those who contributed to the present state of the art. I acknowledge the limitations of attempting
to achieve this goal. The DL research community itself may be viewed as a continually evolving,
deep network of scientists who have influenced each other in complex ways. Starting from recent DL
results, I tried to trace back the origins of relevant ideas through the past half century and beyond,
sometimes using “local search” to follow citations of citations backwards in time. Since not all DL
publications properly acknowledge earlier relevant work, additional global search strategies were em-
ployed, aided by consulting numerous neural network experts. As a result, the present preprint mostly
consists of references. Nevertheless, through an expert selection bias I may have missed important
work. A related bias was surely introduced by my special familiarity with the work of my own DL
research group in the past quarter-century. For these reasons, this work should be viewed as merely a
snapshot of an ongoing credit assignment process. To help improve it, please do not hesitate to send
corrections and suggestions to juergen@idsia.ch.
1
Contents
1 Introduction to Deep Learning (DL) in Neural Networks (NNs) 4
2 Event-Oriented Notation for Activation Spreading in NNs 5
3 Depth of Credit Assignment Paths (CAPs) and of Problems 6
4 Recurring Themes of Deep Learning 7

4.1 Dynamic Programming for Supervised/Reinforcement Learning (SL/RL) . . . . . . 7
4.2 Unsupervised Learning (UL) Facilitating SL and RL . . . . . . . . . . . . . . . . . 7
4.3 Learning Hierarchical Representations Through Deep SL, UL, RL . . . . . . . . . . 8
4.4 Occam’s Razor: Compression and Minimum Description Length (MDL) . . . . . . . 8
4.5 Fast Graphics Processing Units (GPUs) for DL in NNs . . . . . . . . . . . . . . . . 8
5 Supervised NNs, Some Helped by Unsupervised NNs 8

5.1 Early NNs Since the 1940s (and the 1800s) . . . . . . . . . . . . . . . . . . . . . . 9
5.2 Around 1960: Visual Cortex Provides Inspiration for DL (Sec. 5.4, 5.11) . . . . . . . 10
5.3 1965: Deep Networks Based on the Group Method of Data Handling . . . . . . . . . 10
5.4 1979: Convolution + Weight Replication + Subsampling (Neocognitron) . . . . . . 10
5.5 1960-1981 and Beyond: Development of Backpropagation (BP) for NNs . . . . . . . 11
5.5.1 BP for Weight-Sharing Feedforward NNs (FNNs) and Recurrent NNs (RNNs) 11
5.6 Late 1980s-2000 and Beyond: Numerous Improvements of NNs . . . . . . . . . . . 12
5.6.1 Ideas for Dealing with Long Time Lags and Deep CAPs . . . . . . . . . . . 12
5.6.2 Better BP Through Advanced Gradient Descent (Compare Sec. 5.24) . . . . 13
5.6.3 Searching For Simple, Low-Complexity, Problem-Solving NNs (Sec. 5.24) . 14
5.6.4 Potential Benefits of UL for SL (Compare Sec. 5.7, 5.10, 5.15) . . . . . . . . 14
5.7 1987: UL Through Autoencoder (AE) Hierarchies (Compare Sec. 5.15) . . . . . . . 15
5.8 1989: BP for Convolutional NNs (CNNs, Sec. 5.4) . . . . . . . . . . . . . . . . . . 16
5.9 1991: Fundamental Deep Learning Problem of Gradient Descent . . . . . . . . . . . 16
5.10 1991: UL-Based History Compression Through a Deep Stack of RNNs . . . . . . . 17
5.11 1992: Max-Pooling (MP): Towards MPCNNs (Compare Sec. 5.16, 5.19) . . . . . . . 18
5.12 1994: Early Contest-Winning NNs . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.13 1995: Supervised Recurrent Very Deep Learner (LSTM RNN) . . . . . . . . . . . . 19
5.14 2003: More Contest-Winning/Record-Setting NNs; Successful Deep NNs . . . . . . 21
5.15 2006/7: UL For Deep Belief Networks / AE Stacks Fine-Tuned by BP . . . . . . . . 21
5.16 2006/7: Improved CNNs / GPU-CNNs / BP for MPCNNs / LSTM Stacks . . . . . . 22
5.17 2009: First Official Competitions Won by RNNs, and with MPCNNs . . . . . . . . . 22
5.18 2010: Plain Backprop (+ Distortions) on GPU Breaks MNIST Record . . . . . . . . 23
5.19 2011: MPCNNs on GPU Achieve Superhuman Vision Performance . . . . . . . . . 23
5.20 2011: Hessian-Free Optimization for RNNs . . . . . . . . . . . . . . . . . . . . . . 24
5.21 2012: First Contests Won on ImageNet, Object Detection, Segmentation . . . . . . . 24
5.22 2013-: More Contests and Benchmark Records . . . . . . . . . . . . . . . . . . . . 25
5.23 Currently Successful Techniques: LSTM RNNs and GPU-MPCNNs . . . . . . . . . 26
5.24 Recent Tricks for Improving SL Deep NNs (Compare Sec. 5.6.2, 5.6.3) . . . . . . . 27
5.25 Consequences for Neuroscience . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.26 DL with Spiking Neurons? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2
6 DL in FNNs and RNNs for Reinforcement Learning (RL) 29
6.1 RL Through NN World Models Yields RNNs With Deep CAPs . . . . . . . . . . . . 29
6.2 Deep FNNs for Traditional RL and Markov Decision Processes (MDPs) . . . . . . . 30
6.3 Deep RL RNNs for Partially Observable MDPs (POMDPs) . . . . . . . . . . . . . . 31
6.4 RL Facilitated by Deep UL in FNNs and RNNs . . . . . . . . . . . . . . . . . . . . 31
6.5 Deep Hierarchical RL (HRL) and Subgoal Learning with FNNs and RNNs . . . . . . 31
6.6 Deep RL by Direct NN Search / Policy Gradients / Evolution . . . . . . . . . . . . . 32
6.7 Deep RL by Indirect Policy Search / Compressed NN Search . . . . . . . . . . . . . 33
6.8 Universal RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7 Conclusion and Outlook 34
8 Acknowledgments 35
Abbreviations in Alphabetical Order

AE: Autoencoder HTM: Hierarchical Temporal Memory
AI: Artificial Intelligence HMAX: Hierarchical Model “and X”
ANN: Artificial Neural Network LSTM: Long Short-Term Memory (RNN)
BFGS: Broyden-Fletcher-Goldfarb-Shanno MDL: Minimum Description Length
BNN: Biological Neural Network MDP: Markov Decision Process
BM: Boltzmann Machine MNIST: Mixed National Institute of Standards
BP: Backpropagation and Technology Database
BRNN: Bi-directional Recurrent Neural Network MP: Max-Pooling
CAP: Credit Assignment Path MPCNN: Max-Pooling CNN
CEC: Constant Error Carousel NE: NeuroEvolution
CFL: Context Free Language NEAT: NE of Augmenting Topologies
CMA-ES: Covariance Matrix Estimation ES NES: Natural Evolution Strategies
CNN: Convolutional Neural Network NFQ: Neural Fitted Q-Learning
CoSyNE: Co-Synaptic Neuro-Evolution NN: Neural Network
CSL: Context Senistive Language OCR: Optical Character Recognition
CTC : Connectionist Temporal Classification PCC: Potential Causal Connection
DBN: Deep Belief Network PDCC: Potential Direct Causal Connection
DCT: Discrete Cosine Transform PM: Predictability Minimization
DL: Deep Learning POMDP: Partially Observable MDP
DP: Dynamic Programming RAAM: Recursive Auto-Associative Memory
DS: Direct Policy Search RBM: Restricted Boltzmann Machine
EA: Evolutionary Algorithm ReLU: Rectified Linear Unit
EM: Expectation Maximization RL: Reinforcement Learning
ES: Evolution Strategy RNN: Recurrent Neural Network
FMS: Flat Minimum Search R-prop: Resilient Backpropagation
FNN: Feedforward Neural Network SL: Supervised Learning
FSA: Finite State Automaton SLIM NN: Self-Delimiting Neural Network
GMDH: Group Method of Data Handling SOTA: Self-Organising Tree Algorithm
GOFAI: Good Old-Fashioned AI SVM: Support Vector Machine
GP: Genetic Programming TDNN: Time-Delay Neural Network
GPU: Graphics Processing Unit TIMIT: TI/SRI/MIT Acoustic-Phonetic Continu-
GPU-MPCNN: GPU-Based MPCNN ous Speech Corpus
HMM: Hidden Markov Model UL: Unsupervised Learning
HRL: Hierarchical Reinforcement Learning WTA: Winner-Take-All
3
1 Introduction to Deep Learning (DL) in Neural Networks (NNs)
Which modifiable components of a learning system are responsible for its success or failure? What
changes to them improve performance? This has been called the fundamental credit assignment prob-
lem (Minsky, 1963). There are general credit assignment methods for universal problem solvers that
are time-optimal in various theoretical senses (Sec. 6.8). The present survey, however, will focus on
the narrower, but now commercially important, subfield of Deep Learning (DL) in Artificial Neural
Networks (NNs).
A standard neural network (NN) consists of many simple, connected processors called neurons,
each producing a sequence of real-valued activations. Input neurons get activated through sensors per-
ceiving the environment, other neurons get activated through weighted connections from previously
active neurons (details in Sec. 2). Some neurons may influence the environment by triggering actions.
Learning or credit assignment is about finding weights that make the NN exhibit desired behavior,
such as driving a car. Depending on the problem and how the neurons are connected, such behavior
may require long causal chains of computational stages (Sec. 3), where each stage transforms (of-
ten in a non-linear way) the aggregate activation of the network. Deep Learning is about accurately
assigning credit across many such stages.
Shallow NN-like models with few such stages have been around for many decades if not centuries
(Sec. 5.1). Models with several successive nonlinear layers of neurons date back at least to the 1960s
(Sec. 5.3) and 1970s (Sec. 5.5). An efficient gradient descent method for teacher-based Supervised
Learning (SL) in discrete, differentiable networks of arbitrary depth called backpropagation (BP) was
developed in the 1960s and 1970s, and applied to NNs in 1981 (Sec. 5.5). BP-based training of deep
NNs with many layers, however, had been found to be difficult in practice by the late 1980s (Sec. 5.6),
and had become an explicit research subject by the early 1990s (Sec. 5.9). DL became practically fea-
sible to some extent through the help of Unsupervised Learning (UL), e.g., Sec. 5.10 (1991), Sec. 5.15
(2006). The 1990s and 2000s also saw many improvements of purely supervised DL (Sec. 5). In the
new millennium, deep NNs have finally attracted wide-spread attention, mainly by outperforming al-
ternative machine learning methods such as kernel machines (Vapnik, 1995; Schölkopf et al., 1998)
in numerous important applications. In fact, since 2009, supervised deep NNs have won many official
international pattern recognition competitions (e.g., Sec. 5.17, 5.19, 5.21, 5.22), achieving the first
superhuman visual pattern recognition results in limited domains (Sec. 5.19, 2011). Deep NNs also
have become relevant for the more general field of Reinforcement Learning (RL) where there is no
supervising teacher (Sec. 6).
Both feedforward (acyclic) NNs (FNNs) and recurrent (cyclic) NNs (RNNs) have won contests
(Sec. 5.12, 5.14, 5.17, 5.19, 5.21, 5.22). In a sense, RNNs are the deepest of all NNs (Sec. 3)—they
are general computers more powerful than FNNs, and can in principle create and process memories
of arbitrary sequences of input patterns (e.g., Siegelmann and Sontag, 1991; Schmidhuber, 1990a).
Unlike traditional methods for automatic sequential program synthesis (e.g., Waldinger and Lee, 1969;
Balzer, 1985; Soloway, 1986; Deville and Lau, 1994), RNNs can learn programs that mix sequential
and parallel information processing in a natural and efficient way, exploiting the massive parallelism
viewed as crucial for sustaining the rapid decline of computation cost observed over the past 75 years.
The rest of this paper is structured as follows. Sec. 2 introduces a compact, event-oriented notation
that is simple yet general enough to accommodate both FNNs and RNNs. Sec. 3 introduces the
concept of Credit Assignment Paths (CAPs) to measure whether learning in a given NN application is
of the deep or shallow type. Sec. 4 lists recurring themes of DL in SL, UL, and RL. Sec. 5 focuses
on SL and UL, and on how UL can facilitate SL, although pure SL has become dominant in recent
competitions (Sec. 5.17–5.23). Sec. 5 is arranged in a historical timeline format with subsections on
important inspirations and technical contributions. Sec. 6 on deep RL discusses traditional Dynamic
Programming (DP)-based RL combined with gradient-based search techniques for SL or UL in deep
4
NNs, as well as general methods for direct and indirect search in the weight space of deep FNNs and
RNNs, including successful policy gradient and evolutionary methods.
2 Event-Oriented Notation for Activation Spreading in NNs

Throughout this paper, let i, j, k, t, p, q, r denote positive integer variables assuming ranges implicit
in the given contexts. Let n, m, T denote positive integer constants.
An NN’s topology may change over time (e.g., Sec. 5.3, 5.6.3). At any given moment, it can
be described as a finite subset of units (or nodes or neurons) N = {u1 , u2 , . . . , } and a finite set
H ⊆ N × N of directed edges or connections between nodes. FNNs are acyclic graphs, RNNs cyclic.
The first (input) layer is the set of input units, a subset of N . In FNNs, the k-th layer (k > 1) is the set
of all nodes u ∈ N such that there is an edge path of length k − 1 (but no longer path) between some
input unit and u. There may be shortcut connections between distant layers. In sequence-processing,
fully connected RNNs, all units have connections to all non-input units.
The NN’s behavior or program is determined by a set of real-valued, possibly modifiable, param-
eters or weights wi (i = 1, . . . , n). We now focus on a single finite episode or epoch of information
processing and activation spreading, without learning through weight changes. The following slightly
unconventional notation is designed to compactly describe what is happening during the runtime of
the system.
During an episode, there is a partially causal sequence xt (t = 1, . . . , T ) of real values that I call
events. Each xt is either an input set by the environment, or the activation of a unit that may directly
depend on other xk (k < t) through a current NN topology-dependent set int of indices k representing
incoming causal connections or links. Let the function v encode topology information and map such
event index pairs (k, t) to weight indices.
P For example, in the non-input case weQmay have xt =
ft (nett ) with real-valued nett = k∈int xk wv(k,t) (additive case) or nett = k∈int xk wv(k,t)
(multiplicative case), where ft is a typically nonlinear real-valued activation function such as tanh.
In many recent competition-winning NNs (Sec. 5.19, 5.21, 5.22) there also are events of the type
xt = maxk∈int (xk ); some network types may also use complex polynomial activation functions
(Sec. 5.3). xt may directly affect certain xk (k > t) through outgoing connections or links represented
through a current set outt of indices k with t ∈ ink . Some of the non-input events are called output
events.
Note that many of the xt may refer to different, time-varying activations of the same unit in
sequence-processing RNNs (e.g., Williams, 1989, “unfolding in time”), or also in FNNs sequentially
exposed to time-varying input patterns of a large training set encoded as input events. During an
episode, the same weight may get reused over and over again in topology-dependent ways, e.g., in
RNNs, or in convolutional NNs (Sec. 5.4, 5.8). I call this weight sharing across space and/or time.
Weight sharing may greatly reduce the NN’s descriptive complexity, which is the number of bits of
information required to describe the NN (Sec. 4.4).
In Supervised Learning (SL), certain NN output events xt may be associated with teacher-given,
real-valued labels or targets dt yielding errors et , e.g., et = 1/2(xt −dt )2 . A typical goal of supervised
NN training is to find weights that yield episodes with small total error E, the sum of all such et . The
hope is that the NN will generalize well in later episodes, causing only small errors on previously
unseen sequences of input events. Many alternative error functions for SL and UL are possible.
SL assumes that input events are independent of earlier output events (which may affect the en-
vironment through actions causing subsequent perceptions). This assumption does not hold in the
broader fields of Sequential Decision Making and Reinforcement Learning (RL) (Kaelbling et al.,
1996; Sutton and Barto, 1998; Hutter, 2005; Wiering and van Otterlo, 2012) (Sec. 6). In RL, some
of the input events may encode real-valued reward signals given by the environment, and a typical
5
goal is to find weights that yield episodes with a high sum of reward signals, through sequences of
appropriate output actions.
Sec. 5.5 will use the notation above to compactly describe a central algorithm of DL, namely,
backpropagation (BP) for supervised weight-sharing FNNs and RNNs. (FNNs may be viewed as
RNNs with certain fixed zero weights.) Sec. 6 will address the more general RL case.
3 Depth of Credit Assignment Paths (CAPs) and of Problems

To measure whether credit assignment in a given NN application is of the deep or shallow type, I
introduce the concept of Credit Assignment Paths or CAPs, which are chains of possibly causal links
between the events of Sec. 2, e.g., from input through hidden to output layers in FNNs, or through
transformations over time in RNNs.
Let us first focus on SL. Consider two events xp and xq (1 ≤ p < q ≤ T ). Depending on the
application, they may have a Potential Direct Causal Connection (PDCC) expressed by the Boolean
predicate pdcc(p, q), which is true if and only if p ∈ inq . Then the 2-element list (p, q) is defined to
be a CAP (a minimal one) from p to q. A learning algorithm may be allowed to change wv(p,q) to
improve performance in future episodes.
More general, possibly indirect, Potential Causal Connections (PCC) are expressed by the re-
cursively defined Boolean predicate pcc(p, q), which in the SL case is true only if pdcc(p, q), or if
pcc(p, k) for some k and pdcc(k, q). In the latter case, appending q to any CAP from p to k yields a
CAP from p to q (this is a recursive definition, too). The set of such CAPs may be large but is finite.
Note that the same weight may affect many different PDCCs between successive events listed by a
given CAP, e.g., in the case of RNNs, or weight-sharing FNNs.
Suppose a CAP has the form (. . . , k, t, . . . , q), where k and t (possibly t = q) are the first succes-
sive elements with modifiable wv(k,t) . Then the length of the suffix list (t, . . . , q) is called the CAP’s
depth (which is 0 if there are no modifiable links at all). This depth limits how far backwards credit
assignment can move down the causal chain to find a modifiable weight.1
Suppose an episode and its event sequence x1 , . . . , xT satisfy a computable criterion used to
decide whether a given problem has been solved (e.g., total error E below some threshold). Then
the set of used weights is called a solution to the problem, and the depth of the deepest CAP within
the sequence is called the solution depth. There may be other solutions (yielding different event
sequences) with different depths. Given some fixed NN topology, the smallest depth of any solution
is called the problem depth.
Sometimes we also speak of the depth of an architecture: SL FNNs with fixed topology imply a
problem-independent maximal problem depth bounded by the number of non-input layers. Certain
SL RNNs with fixed weights for all connections except those to output units (Jaeger, 2001; Maass
et al., 2002; Jaeger, 2004; Schrauwen et al., 2007) have a maximal problem depth of 1, because only
the final links in the corresponding CAPs are modifiable. In general, however, RNNs may learn to
solve problems of potentially unlimited depth.
Note that the definitions above are solely based on the depths of causal chains, and agnostic to the
temporal distance between events. For example, shallow FNNs perceiving large “time windows” of
input events may correctly classify long input sequences through appropriate output events, and thus
solve shallow problems involving long time lags between relevant events.
At which problem depth does Shallow Learning end, and Deep Learning begin? Discussions with
DL experts have not yet yielded a conclusive response to this question. Instead of committing myself
1 An alternative would be to count only modifiable links when measuring depth. In many typical NN applications this would
not make a difference, but in some it would, e.g., Sec. 6.1.
6
to a precise answer, let me just define for the purposes of this overview: problems of depth > 10
require Very Deep Learning.
The difficulty of a problem may have little to do with its depth. Some NNs can quickly learn
to solve certain deep problems, e.g., through random weight guessing (Sec. 5.9) or other types of
direct search (Sec. 6.6) or indirect search (Sec. 6.7) in weight space, or through training an NN first
on shallow problems whose solutions may then generalize to deep problems, or through collapsing
sequences of (non)linear operations into a single (non)linear operation (but see an analysis of non-
trivial aspects of deep linear networks, Baldi and Hornik, 1994, Section B). In general, however,
finding an NN that precisely models a given training set is an NP-complete problem (Judd, 1990;
Blum and Rivest, 1992), also in the case of deep NNs (Sı́ma, 1994; de Souto et al., 1999; Windisch,
2005); compare a survey of negative results (Sı́ma, 2002, Section 1).
Above we have focused on SL. In the more general case of RL in unknown environments, pcc(p, q)
is also true if xp is an output event and xq any later input event—any action may affect the environment
and thus any later perception. (In the real world, the environment may even influence non-input events
computed on a physical hardware entangled with the entire universe, but this is ignored here.) It is
possible to model and replace such unmodifiable environmental PCCs through a part of the NN that
has already learned to predict (through some of its units) input events (including reward signals) from
former input events and actions (Sec. 6.1). Its weights are frozen, but can help to assign credit to
other, still modifiable weights used to compute actions (Sec. 6.1). This approach may lead to very
deep CAPs though.
Some DL research is about automatically rephrasing problems such that their depth is reduced
(Sec. 4). In particular, sometimes UL is used to make SL problems less deep, e.g., Sec. 5.10. Often
Dynamic Programming (Sec. 4.1) is used to facilitate certain traditional RL problems, e.g., Sec. 6.2.
Sec. 5 focuses on CAPs for SL, Sec. 6 on the more complex case of RL.
4 Recurring Themes of Deep Learning

4.1 Dynamic Programming for Supervised/Reinforcement Learning (SL/RL)
One recurring theme of DL is Dynamic Programming (DP) (Bellman, 1957), which can help to fa-
cilitate credit assignment under certain assumptions. For example, in SL NNs, backpropagation itself
can be viewed as a DP-derived method (Sec. 5.5). In traditional RL based on strong Markovian as-
sumptions, DP-derived methods can help to greatly reduce problem depth (Sec. 6.2). DP algorithms
are also essential for systems that combine concepts of NNs and graphical models, such as Hidden
Markov Models (HMMs) (Stratonovich, 1960; Baum and Petrie, 1966) and Expectation Maximization
(EM) (Dempster et al., 1977; Friedman et al., 2001), e.g., (Bottou, 1991; Bengio, 1991; Bourlard and
Morgan, 1994; Baldi and Chauvin, 1996; Jordan and Sejnowski, 2001; Bishop, 2006; Hastie et al.,
2009; Poon and Domingos, 2011; Dahl et al., 2012; Hinton et al., 2012a; Wu and Shao, 2014).
4.2 Unsupervised Learning (UL) Facilitating SL and RL

Another recurring theme is how UL can facilitate both SL (Sec. 5) and RL (Sec. 6). UL (Sec. 5.6.4)
is normally used to encode raw incoming data such as video or speech streams in a form that is more
convenient for subsequent goal-directed learning. In particular, codes that describe the original data in
a less redundant or more compact way can be fed into SL (Sec. 5.10, 5.15) or RL machines (Sec. 6.4),
whose search spaces may thus become smaller (and whose CAPs shallower) than those necessary for
dealing with the raw data. UL is closely connected to the topics of regularization and compression
(Sec. 4.4, 5.6.3).
7
4.3 Learning Hierarchical Representations Through Deep SL, UL, RL
Many methods of Good Old-Fashioned Artificial Intelligence (GOFAI) (Nilsson, 1980) as well as
more recent approaches to AI (Russell et al., 1995) and Machine Learning (Mitchell, 1997) learn
hierarchies of more and more abstract data representations. For example, certain methods of syn-
tactic pattern recognition (Fu, 1977) such as grammar induction discover hierarchies of formal rules
to model observations. The partially (un)supervised Automated Mathematician / EURISKO (Lenat,
1983; Lenat and Brown, 1984) continually learns concepts by combining previously learnt concepts.
Such hierarchical representation learning (Ring, 1994; Bengio et al., 2013; Deng and Yu, 2014) is also
a recurring theme of DL NNs for SL (Sec. 5), UL-aided SL (Sec. 5.7, 5.10, 5.15), and hierarchical RL
(Sec. 6.5). Often, abstract hierarchical representations are natural by-products of data compression
(Sec. 4.4), e.g., Sec. 5.10.
4.4 Occam’s Razor: Compression and Minimum Description Length (MDL)

Occam’s razor favors simple solutions over complex ones. Given some programming language, the
principle of Minimum Description Length (MDL) can be used to measure the complexity of a so-
lution candidate by the length of the shortest program that computes it (e.g., Solomonoff, 1964;
Kolmogorov, 1965b; Chaitin, 1966; Wallace and Boulton, 1968; Levin, 1973a; Solomonoff, 1978;
Rissanen, 1986; Blumer et al., 1987; Li and Vitányi, 1997; Grünwald et al., 2005). Some methods
explicitly take into account program runtime (Allender, 1992; Watanabe, 1992; Schmidhuber, 1997,
2002); many consider only programs with constant runtime, written in non-universal programming
languages (e.g., Rissanen, 1986; Hinton and van Camp, 1993). In the NN case, the MDL princi-
ple suggests that low NN weight complexity corresponds to high NN probability in the Bayesian
view (e.g., MacKay, 1992; Buntine and Weigend, 1991; Neal, 1995; De Freitas, 2003), and to high
generalization performance (e.g., Baum and Haussler, 1989), without overfitting the training data.
Many methods have been proposed for regularizing NNs, that is, searching for solution-computing
but simple, low-complexity SL NNs (Sec. 5.6.3) and RL NNs (Sec. 6.7). This is closely related to
certain UL methods (Sec. 4.2, 5.6.4).
4.5 Fast Graphics Processing Units (GPUs) for DL in NNs

While the previous millennium saw several attempts at creating fast NN-specific hardware (e.g., Jackel
et al., 1990; Faggin, 1992; Ramacher et al., 1993; Widrow et al., 1994; Heemskerk, 1995; Korkin et al.,
1997; Urlbe, 1999), and at exploiting standard hardware (e.g., Anguita et al., 1994; Muller et al., 1995;
Anguita and Gomes, 1996), the new millennium brought a DL breakthrough in form of cheap, multi-
processor graphics cards or GPUs. GPUs are widely used for video games, a huge and competitive
market that has driven down hardware prices. GPUs excel at the fast matrix and vector multiplications
required not only for convincing virtual realities but also for NN training, where they can speed up
learning by a factor of 50 and more. Some of the GPU-based FNN implementations (Sec. 5.16–5.19)
have greatly contributed to recent successes in contests for pattern recognition (Sec. 5.19–5.22), image
segmentation (Sec. 5.21), and object detection (Sec. 5.21–5.22).
5 Supervised NNs, Some Helped by Unsupervised NNs

The main focus of current practical applications is on Supervised Learning (SL), which has domi-
nated recent pattern recognition contests (Sec. 5.17–5.23). Several methods, however, use additional
Unsupervised Learning (UL) to facilitate SL (Sec. 5.7, 5.10, 5.15). It does make sense to treat SL and
8
UL in the same section: often gradient-based methods, such as BP (Sec. 5.5.1), are used to optimize
objective functions of both UL and SL, and the boundary between SL and UL may blur, for example,
when it comes to time series prediction and sequence classification, e.g., Sec. 5.10, 5.12.
A historical timeline format will help to arrange subsections on important inspirations and techni-
cal contributions (although such a subsection may span a time interval of many years). Sec. 5.1 briefly
mentions early, shallow NN models since the 1940s (and 1800s), Sec. 5.2 additional early neurobio-
logical inspiration relevant for modern Deep Learning (DL). Sec. 5.3 is about GMDH networks (since
1965), to my knowledge the first (feedforward) DL systems. Sec. 5.4 is about the relatively deep
Neocognitron NN (1979) which is very similar to certain modern deep FNN architectures, as it com-
bines convolutional NNs (CNNs), weight pattern replication, and subsampling mechanisms. Sec. 5.5
uses the notation of Sec. 2 to compactly describe a central algorithm of DL, namely, backpropagation
(BP) for supervised weight-sharing FNNs and RNNs. It also summarizes the history of BP 1960-1981
and beyond. Sec. 5.6 describes problems encountered in the late 1980s with BP for deep NNs, and
mentions several ideas from the previous millennium to overcome them. Sec. 5.7 discusses a first hier-
archical stack (1987) of coupled UL-based Autoencoders (AEs)—this concept resurfaced in the new
millennium (Sec. 5.15). Sec. 5.8 is about applying BP to CNNs (1989), which is important for today’s
DL applications. Sec. 5.9 explains BP’s Fundamental DL Problem (of vanishing/exploding gradients)
discovered in 1991. Sec. 5.10 explains how a deep RNN stack of 1991 (the History Compressor) pre-
trained by UL helped to solve previously unlearnable DL benchmarks requiring Credit Assignment
Paths (CAPs, Sec. 3) of depth 1000 and more. Sec. 5.11 discusses a particular winner-take-all (WTA)
method called Max-Pooling (MP, 1992) widely used in today’s deep FNNs. Sec. 5.12 mentions a
first important contest won by SL NNs in 1994. Sec. 5.13 describes a purely supervised DL RNN
(Long Short-Term Memory, LSTM, 1995) for problems of depth 1000 and more. Sec. 5.14 mentions
an early contest of 2003 won by an ensemble of shallow FNNs, as well as good pattern recognition
results with CNNs and deep FNNs and LSTM RNNs (2003). Sec. 5.15 is mostly about Deep Belief
Networks (DBNs, 2006) and related stacks of Autoencoders (AEs, Sec. 5.7), both pre-trained by UL to
facilitate subsequent BP-based SL (compare Sec. 5.6.1, 5.10). Sec. 5.16 mentions the first SL-based
GPU-CNNs (2006), BP-trained MPCNNs (2007), and LSTM stacks (2007). Sec. 5.17–5.22 focus on
official competitions with secret test sets won by (mostly purely supervised) deep NNs since 2009,
in sequence recognition, image classification, image segmentation, and object detection. Many RNN
results depended on LSTM (Sec. 5.13); many FNN results depended on GPU-based FNN code de-
veloped since 2004 (Sec. 5.16, 5.17, 5.18, 5.19), in particular, GPU-MPCNNs (Sec. 5.19). Sec. 5.24
mentions recent tricks for improving DL in NNs, many of them closely related to earlier tricks from
the previous millennium (e.g., Sec. 5.6.2, 5.6.3). Sec. 5.25 discusses how artificial NNs can help to
understand biological NNs; Sec. 5.26 addresses the possibility of DL in NNs with spiking neurons.
5.1 Early NNs Since the 1940s (and the 1800s)

Early NN architectures (McCulloch and Pitts, 1943) did not learn. The first ideas about UL were
published a few years later (Hebb, 1949). The following decades brought simple NNs trained by
SL (e.g., Rosenblatt, 1958, 1962; Widrow and Hoff, 1962; Narendra and Thathatchar, 1974) and
UL (e.g., Grossberg, 1969; Kohonen, 1972; von der Malsburg, 1973; Willshaw and von der Malsburg,
1976), as well as closely related associative memories (e.g., Palm, 1980; Hopfield, 1982).
In a sense NNs have been around even longer, since early supervised NNs were essentially variants
of linear regression methods going back at least to the early 1800s (e.g., Legendre, 1805; Gauss, 1809,
1821); Gauss also refers to his work of 1795. Early NNs had a maximal CAP depth of 1 (Sec. 3).
9
5.2 Around 1960: Visual Cortex Provides Inspiration for DL (Sec. 5.4, 5.11)
Simple cells and complex cells were found in the cat’s visual cortex (e.g., Hubel and Wiesel, 1962;
Wiesel and Hubel, 1959). These cells fire in response to certain properties of visual sensory inputs,
such as the orientation of edges. Complex cells exhibit more spatial invariance than simple cells. This
inspired later deep NN architectures (Sec. 5.4, 5.11) used in certain modern award-winning Deep
Learners (Sec. 5.19–5.22).
5.3 1965: Deep Networks Based on the Group Method of Data Handling
Networks trained by the Group Method of Data Handling (GMDH) (Ivakhnenko and Lapa, 1965;
Ivakhnenko et al., 1967; Ivakhnenko, 1968, 1971) were perhaps the first DL systems of the Feed-
forward Multilayer Perceptron type, although there was earlier work on NNs with a single hidden
layer (e.g., Joseph, 1961; Viglione, 1970). The units of GMDH nets may have polynomial activation
functions implementing Kolmogorov-Gabor polynomials (more general than other widely used NN
activation functions, Sec. 2). Given a training set, layers are incrementally grown and trained by re-
gression analysis (e.g., Legendre, 1805; Gauss, 1809, 1821) (Sec. 5.1), then pruned with the help of
a separate validation set (using today’s terminology), where Decision Regularisation is used to weed
out superfluous units (compare Sec. 5.6.3). The numbers of layers and units per layer can be learned
in problem-dependent fashion. To my knowledge, this was the first example of open-ended, hierar-
chical representation learning in NNs (Sec. 4.3). A paper of 1971 already described a deep GMDH
network with 8 layers (Ivakhnenko, 1971). There have been numerous applications of GMDH-style
nets, e.g. (Ikeda et al., 1976; Farlow, 1984; Madala and Ivakhnenko, 1994; Ivakhnenko, 1995; Kondo,
1998; Kordı́k et al., 2003; Witczak et al., 2006; Kondo and Ueno, 2008).
5.4 1979: Convolution + Weight Replication + Subsampling (Neocognitron)

Apart from deep GMDH networks (Sec. 5.3), the Neocognitron (Fukushima, 1979, 1980, 2013a)
was perhaps the first artificial NN that deserved the attribute deep, and the first to incorporate the
neurophysiological insights of Sec. 5.2. It introduced convolutional NNs (today often called CNNs or
convnets), where the (typically rectangular) receptive field of a convolutional unit with given weight
vector (a filter) is shifted step by step across a 2-dimensional array of input values, such as the pixels
of an image (usually there are several such filters). The resulting 2D array of subsequent activation
events of this unit can then provide inputs to higher-level units, and so on. Due to massive weight
replication (Sec. 2), relatively few parameters (Sec. 4.4) may be necessary to describe the behavior of
such a convolutional layer.
Subsampling or downsampling layers consist of units whose fixed-weight connections originate
from physical neighbours in the convolutional layers below. Subsampling units become active if at
least one of their inputs is active; their responses are insensitive to certain small image shifts (compare
Sec. 5.2).
The Neocognitron is very similar to the architecture of modern, contest-winning, purely super-
vised, feedforward, gradient-based Deep Learners with alternating convolutional and downsampling
layers (e.g., Sec. 5.19–5.22). Fukushima, however, did not set the weights by supervised backpropa-
gation (Sec. 5.5, 5.8), but by local, WTA-based unsupervised learning rules (e.g., Fukushima, 2013b),
or by pre-wiring. In that sense he did not care for the DL problem (Sec. 5.9), although his architecture
was comparatively deep indeed. For downsampling purposes he used Spatial Averaging (Fukushima,
1980, 2011) instead of Max-Pooling (MP, Sec. 5.11), currently a particularly convenient and popular
WTA mechanism. Today’s DL combinations of CNNs and MP and BP also profit a lot from later
work (e.g., Sec. 5.8, 5.16, 5.16, 5.19).
10
5.5 1960-1981 and Beyond: Development of Backpropagation (BP) for NNs
The minimisation of errors through gradient descent (Hadamard, 1908) in the parameter space of
complex, nonlinear, differentiable (Leibniz, 1684), multi-stage, NN-related systems has been dis-
cussed at least since the early 1960s (e.g., Kelley, 1960; Bryson, 1961; Bryson and Denham, 1961;
Pontryagin et al., 1961; Dreyfus, 1962; Wilkinson, 1965; Amari, 1967; Bryson and Ho, 1969; Direc-
tor and Rohrer, 1969), initially within the framework of Euler-LaGrange equations in the Calculus of
Variations (e.g., Euler, 1744).
Steepest descent in the weight space of such systems can be performed (Bryson, 1961; Kelley,
1960; Bryson and Ho, 1969) by iterating the chain rule (Leibniz, 1676; L’Hôpital, 1696) à la Dynamic
Programming (DP) (Bellman, 1957). A simplified derivation of this backpropagation method uses the
chain rule only (Dreyfus, 1962).
The systems of the 1960s were already efficient in the DP sense. However, they backpropagated
derivative information through standard Jacobian matrix calculations from one “layer” to the previous
one, without explicitly addressing either direct links across several layers or potential additional effi-
ciency gains due to network sparsity (but perhaps such enhancements seemed obvious to the authors).
Given all the prior work on learning in multilayer NN-like systems (see also Sec. 5.3 on deep non-
linear nets since 1965), it seems surprising in hindsight that a book (Minsky and Papert, 1969) on the
limitations of simple linear perceptrons with a single layer (Sec. 5.1) discouraged some researchers
from further studying NNs.
Explicit, efficient error backpropagation (BP) in arbitrary, discrete, possibly sparsely connected,
NN-like networks apparently was first described in a 1970 master’s thesis (Linnainmaa, 1970, 1976),
albeit without reference to NNs. BP is also known as the reverse mode of automatic differentia-
tion (Griewank, 2012), where the costs of forward activation spreading essentially equal the costs of
backward derivative calculation. See early FORTRAN code (Linnainmaa, 1970) and closely related
work (Ostrovskii et al., 1971).
Efficient BP was soon explicitly used to minimize cost functions by adapting control parameters
(weights) (Dreyfus, 1973). Compare some preliminary, NN-specific discussion (Werbos, 1974, sec-
tion 5.5.1), a method for multilayer threshold NNs (Bobrowski, 1978), and a computer program for
automatically deriving and implementing BP for given differentiable systems (Speelpenning, 1980).
To my knowledge, the first NN-specific application of efficient BP as above was described in
1981 (Werbos, 1981, 2006). Related work was published several years later (Parker, 1985; LeCun,
1985, 1988). A paper of 1986 significantly contributed to the popularisation of BP for NNs (Rumelhart
et al., 1986), experimentally demonstrating the emergence of useful internal representations in hidden
layers. See generalisations for sequence-processing recurrent NNs (e.g., Williams, 1989; Robinson
and Fallside, 1987; Werbos, 1988; Williams and Zipser, 1988, 1989b,a; Rohwer, 1989; Pearlmutter,
1989; Gherrity, 1989; Williams and Peng, 1990; Schmidhuber, 1992a; Pearlmutter, 1995; Baldi, 1995;
Kremer and Kolen, 2001; Atiya and Parlos, 2000), also for equilibrium RNNs (Almeida, 1987; Pineda,
1987) with stationary inputs.
5.5.1 BP for Weight-Sharing Feedforward NNs (FNNs) and Recurrent NNs (RNNs)
Using the notation of Sec. 2 for weight-sharing FNNs or RNNs, after an episode of activation spread-
ing through differentiable ft , P
a single iteration of gradient descent through BP computes changes of
∂E ∂E ∂nett
all wi in proportion to ∂w i
= t ∂nett ∂wi as in Algorithm 5.5.1 (for the additive case), where each
weight wi is associated with a real-valued variable 4i initialized by 0.
The computational costs of the backward (BP) pass are essentially those of the forward pass
(Sec. 2). Forward and backward passes are re-iterated until sufficient performance is reached.
11
Alg. 5.5.1: One iteration of BP for weight-sharing FNNs or RNNs
for t = T, . . . , 1 do
∂E
to compute ∂net t
, inititalize real-valued error signal variable δt by 0;
if xt is an input event then continue with next iteration;
if there is an error et P then δt := xt − dt ;
add to δt the value k∈outt wv(t,k) δk ; (this is the elegant and efficient recursive chain rule
application collecting impacts of nett on future events)
multiply δt by ft0 (nett );
for all k ∈ int add to 4wv(k,t) the value xk δt
end for
change each wi in proportion to 4i and a small real-valued learning rate
As of 2014, this simple BP method is still the central learning algorithm for FNNs and RNNs. No-
tably, most contest-winning NNs up to 2014 (Sec. 5.12, 5.14, 5.17, 5.19, 5.21, 5.22) did not augment
supervised BP by some sort of unsupervised learning as discussed in Sec. 5.7, 5.10, 5.15.
5.6 Late 1980s-2000 and Beyond: Numerous Improvements of NNs

By the late 1980s it seemed clear that BP by itself (Sec. 5.5) was no panacea. Most FNN applications
focused on FNNs with few hidden layers. Additional hidden layers often did not seem to offer empiri-
cal benefits. Many practitioners found solace in a theorem (Kolmogorov, 1965a; Hecht-Nielsen, 1989;
Hornik et al., 1989) stating that an NN with a single layer of enough hidden units can approximate
any multivariate continous function with arbitrary accuracy.
Likewise, most RNN applications did not require backpropagating errors far. Many researchers
helped their RNNs by first training them on shallow problems (Sec. 3) whose solutions then gener-
alized to deeper problems. In fact, some popular RNN algorithms restricted credit assignment to a
single step backwards (Elman, 1990; Jordan, 1986, 1997), also in more recent studies (Jaeger, 2001;
Maass et al., 2002; Jaeger, 2004).
Generally speaking, although BP allows for deep problems in principle, it seemed to work only
for shallow problems. The late 1980s and early 1990s saw a few ideas with a potential to overcome
this problem, which was fully understood only in 1991 (Sec. 5.9).
5.6.1 Ideas for Dealing with Long Time Lags and Deep CAPs
To deal with long time lags between relevant events, several sequence processing methods were pro-
posed, including Focused BP based on decay factors for activations of units in RNNs (Mozer, 1989,
1992), Time-Delay Neural Networks (TDNNs) (Lang et al., 1990) and their adaptive extension (Bo-
denhausen and Waibel, 1991), Nonlinear AutoRegressive with eXogenous inputs (NARX) RNNs (Lin
et al., 1996), certain hierarchical RNNs (Hihi and Bengio, 1996) (compare Sec. 5.10, 1991), RL
economies in RNNs with WTA units and local learning rules (Schmidhuber, 1989b), and other meth-
ods (e.g., Ring, 1993, 1994; Plate, 1993; de Vries and Principe, 1991; Sun et al., 1993a; Bengio
et al., 1994). However, these algorithms either worked for shallow CAPs only, could not generalize
to unseen CAP depths, had problems with greatly varying time lags between relevant events, needed
external fine tuning of delay constants, or suffered from other problems. In fact, it turned out that
certain simple but deep benchmark problems used to evaluate such methods are more quickly solved
by randomly guessing RNN weights until a solution is found (Hochreiter and Schmidhuber, 1996).
While the RNN methods above were designed for DL of temporal sequences, the Neural Heat
Exchanger (Schmidhuber, 1990c) consists of two parallel deep FNNs with opposite flow directions.
12
Input patterns enter the first FNN and are propagated “up”. Desired outputs (targets) enter the “oppo-
site” FNN and are propagated “down”. Using a local learning rule, each layer in each net tries to be
similar (in information content) to the preceding layer and to the adjacent layer of the other net. The
input entering the first net slowly “heats up” to become the target. The target entering the opposite net
slowly “cools down” to become the input. The Helmholtz Machine (Dayan et al., 1995; Dayan and
Hinton, 1996) may be viewed as an unsupervised (Sec. 5.6.4) variant thereof (Peter Dayan, personal
communication, 1994).
A hybrid approach (Shavlik and Towell, 1989; Towell and Shavlik, 1994) initializes a poten-
tially deep FNN through a domain theory in propositional logic, which may be acquired through
explanation-based learning (Mitchell et al., 1986; DeJong and Mooney, 1986; Minton et al., 1989).
The NN is then fine-tuned through BP (Sec. 5.5). The NN’s depth reflects the longest chain of
reasoning in the original set of logical rules. An extension of this approach (Maclin and Shavlik,
1993; Shavlik, 1994) initializes an RNN by domain knowledge expressed as a Finite State Automa-
ton (FSA). BP-based fine-tuning has become important for later DL systems pre-trained by UL, e.g.,
Sec. 5.10, 5.15.
5.6.2 Better BP Through Advanced Gradient Descent (Compare Sec. 5.24)

Numerous improvements of steepest descent through BP (Sec. 5.5) have been proposed. Least-
squares methods (Gauss-Newton, Levenberg-Marquardt) (Gauss, 1809; Newton, 1687; Levenberg,
1944; Marquardt, 1963; Schaback and Werner, 1992) and quasi-Newton methods (Broyden-Fletcher-
Goldfarb-Shanno, BFGS) (Broyden et al., 1965; Fletcher and Powell, 1963; Goldfarb, 1970; Shanno,
1970) are computationally too expensive for large NNs. Partial BFGS (Battiti, 1992; Saito and
Nakano, 1997) and conjugate gradient (Hestenes and Stiefel, 1952; Møller, 1993) as well as other
methods (Solla, 1988; Schmidhuber, 1989a; Cauwenberghs, 1993) provide sometimes useful fast al-
ternatives. BP can be treated as a linear least-squares problem (Biegler-König and Bärmann, 1993),
where second-order gradient information is passed back to preceding layers.
To speed up BP, momentum was introduced (Rumelhart et al., 1986), ad-hoc constants were added
to the slope of the linearized activation function (Fahlman, 1988), or the nonlinearity of the slope was
exaggerated (West and Saad, 1995).
Only the signs of the error derivatives are taken into account by the successful and widely used
BP variant R-prop (Riedmiller and Braun, 1993) and the robust variation iRprop+ (Igel and Hüsken,
2003), which was also successfully applied to RNNs.
The local gradient can be normalized based on the NN architecture (Schraudolph and Sejnowski,
1996), through a diagonalized Hessian approach (Becker and Le Cun, 1989), or related efficient meth-
ods (Schraudolph, 2002).
Some algorithms for controlling BP step size adapt a global learning rate (Lapedes and Farber,
1986; Vogl et al., 1988; Battiti, 1989; LeCun et al., 1993; Yu et al., 1995), while others compute in-
dividual learning rates for each weight (Jacobs, 1988; Silva and Almeida, 1990). In online learning,
where BP is applied after each pattern presentation, the vario-η algorithm (Neuneier and Zimmer-
mann, 1996) sets each weight’s learning rate inversely proportional to the empirical standard devia-
tion of its local gradient, thus normalizing the stochastic weight fluctuations. Compare a local online
step size adaptation method for nonlinear NNs (Almeida et al., 1997).
Many additional tricks for improving NNs have been described (e.g., Orr and Müller, 1998; Mon-
tavon et al., 2012). Compare Sec. 5.6.3 and recent developments mentioned in Sec. 5.24.
13
5.6.3 Searching For Simple, Low-Complexity, Problem-Solving NNs (Sec. 5.24)
Many researchers used BP-like methods to search for “simple,” low-complexity NNs (Sec. 4.4)
with high generalization capability. Most approaches address the bias/variance dilemma (Geman
et al., 1992) through strong prior assumptions. For example, weight decay (Hanson and Pratt, 1989;
Weigend et al., 1991; Krogh and Hertz, 1992) encourages near-zero weights, by penalizing large
weights. In a Bayesian framework (Bayes, 1763), weight decay can be derived (Hinton and van
Camp, 1993) from Gaussian or Laplacian weight priors (Gauss, 1809; Laplace, 1774); see also (Mur-
ray and Edwards, 1993). An extension of this approach postulates that a distribution of networks with
many similar weights generated by Gaussian mixtures is “better” a priori (Nowlan and Hinton, 1992).
Often weight priors are implicit in additional penalty terms (MacKay, 1992) or in methods based
on validation sets (Mosteller and Tukey, 1968; Stone, 1974; Eubank, 1988; Hastie and Tibshirani,
1990; Craven and Wahba, 1979; Golub et al., 1979), Akaike’s information criterion and final pre-
diction error (Akaike, 1970, 1973, 1974), or generalized prediction error (Moody and Utans, 1994;
Moody, 1992). See also (Holden, 1994; Wang et al., 1994; Amari and Murata, 1993; Wang et al.,
1994; Guyon et al., 1992; Vapnik, 1992; Wolpert, 1994). Similar priors (or biases towards simplicity)
are implicit in constructive and pruning algorithms, e.g., layer-by-layer sequential network construc-
tion (e.g., Ivakhnenko, 1968, 1971; Ash, 1989; Moody, 1989; Gallant, 1988; Honavar and Uhr, 1988;
Ring, 1991; Fahlman, 1991; Weng et al., 1992; Honavar and Uhr, 1993; Burgess, 1994; Fritzke, 1994;
Parekh et al., 2000; Utgoff and Stracuzzi, 2002) (see also Sec. 5.3, 5.11), input pruning (Moody, 1992;
Refenes et al., 1994), unit pruning (e.g., Ivakhnenko, 1968, 1971; White, 1989; Mozer and Smolen-
sky, 1989; Levin et al., 1994), weight pruning, e.g., optimal brain damage (LeCun et al., 1990b), and
optimal brain surgeon (Hassibi and Stork, 1993).
A very general but not always practical approach for discovering low-complexity SL NNs or
RL NNs searches among weight matrix-computing programs written in a universal programming
language, with a bias towards fast and short programs (Schmidhuber, 1997) (Sec. 6.7).
Flat Minimum Search (FMS) (Hochreiter and Schmidhuber, 1997a, 1999) searches for a “flat”
minimum of the error function: a large connected region in weight space where error is low and re-
mains approximately constant, that is, few bits of information are required to describe low-precision
weights with high variance. Compare perturbation tolerance conditions (Minai and Williams, 1994;
Murray and Edwards, 1993; Hanson, 1990; Neti et al., 1992; Matsuoka, 1992; Bishop, 1993; Ker-
lirzin and Vallet, 1993; Carter et al., 1990). An MDL-based, Bayesian argument suggests that flat
minima correspond to “simple” NNs and low expected overfitting. Compare Sec. 5.6.4 and more
recent developments mentioned in Sec. 5.24.
5.6.4 Potential Benefits of UL for SL (Compare Sec. 5.7, 5.10, 5.15)

The notation of Sec. 2 introduced teacher-given labels dt . Many papers of the previous millennium,
however, were about unsupervised learning (UL) without a teacher (e.g., Hebb, 1949; von der Mals-
burg, 1973; Kohonen, 1972, 1982, 1988; Willshaw and von der Malsburg, 1976; Grossberg, 1976a,b;
Watanabe, 1985; Pearlmutter and Hinton, 1986; Barrow, 1987; Field, 1987; Oja, 1989; Barlow et al.,
1989; Baldi and Hornik, 1989; Sanger, 1989; Ritter and Kohonen, 1989; Rubner and Schulten, 1990;
Földiák, 1990; Martinetz et al., 1990; Kosko, 1990; Mozer, 1991; Palm, 1992; Atick et al., 1992;
Miller, 1994; Saund, 1994; Földiák and Young, 1995; Deco and Parra, 1997); see also post-2000
work (e.g., Carreira-Perpinan, 2001; Wiskott and Sejnowski, 2002; Franzius et al., 2007; Waydo and
Koch, 2008).
Many UL methods are designed to maximize entropy-related, information-theoretic (Boltzmann,
1909; Shannon, 1948; Kullback and Leibler, 1951) objectives (e.g., Linsker, 1988; Barlow et al., 1989;
MacKay and Miller, 1990; Plumbley, 1991; Schmidhuber, 1992b,c; Schraudolph and Sejnowski,
14
1993; Redlich, 1993; Zemel, 1993; Zemel and Hinton, 1994; Field, 1994; Hinton et al., 1995; Dayan
and Zemel, 1995; Amari et al., 1996; Deco and Parra, 1997).
Many do this to uncover and disentangle hidden underlying sources of signals (e.g., Jutten and
Herault, 1991; Schuster, 1992; Andrade et al., 1993; Molgedey and Schuster, 1994; Comon, 1994;
Cardoso, 1994; Bell and Sejnowski, 1995; Karhunen and Joutsensalo, 1995; Belouchrani et al., 1997;
Hyvärinen et al., 2001; Szabó et al., 2006; Shan et al., 2007; Shan and Cottrell, 2014).
Many UL methods automatically and robustly generate distributed, sparse representations of in-
put patterns (Földiák, 1990; Hinton and Ghahramani, 1997; Lewicki and Olshausen, 1998; Hyvärinen
et al., 1999; Hochreiter and Schmidhuber, 1999; Falconbridge et al., 2006) through well-known fea-
ture detectors (e.g., Olshausen and Field, 1996; Schmidhuber et al., 1996), such as off-center-on-
surround-like structures, as well as orientation sensitive edge detectors and Gabor filters (Gabor,
1946). They extract simple features related to those observed in early visual pre-processing stages
of biological systems (e.g., De Valois et al., 1982; Jones and Palmer, 1987).
UL can also serve to extract invariant features from different data items (e.g., Becker, 1991)
through coupled NNs observing two different inputs (Schmidhuber and Prelinger, 1992), also called
Siamese NNs (e.g., Bromley et al., 1993; Hadsell et al., 2006; Taylor et al., 2011; Chen and Salman,
2011).
UL can help to encode input data in a form advantageous for further processing. In the context
of DL, one important goal of UL is redundancy reduction. Ideally, given an ensemble of input pat-
terns, redundancy reduction through a deep NN will create a factorial code (a code with statistically
independent components) of the ensemble (Barlow et al., 1989; Barlow, 1989), to disentangle the
unknown factors of variation (compare Bengio et al., 2013). Such codes may be sparse and can be
advantageous for (1) data compression, (2) speeding up subsequent BP (Becker, 1991), (3) trivialising
the task of subsequent naive yet optimal Bayes classifiers (Schmidhuber et al., 1996).
Most early UL FNNs had a single layer. Methods for deeper UL FNNs include hierarchical
(Sec. 4.3) self-organizing Kohonen maps (e.g., Koikkalainen and Oja, 1990; Lampinen and Oja, 1992;
Versino and Gambardella, 1996; Dittenbach et al., 2000; Rauber et al., 2002), hierarchical Gaussian
potential function networks (Lee and Kil, 1991), layer-wise UL of feature hierarchies fed into SL
classifiers (Behnke, 1999, 2003a), the Self-Organising Tree Algorithm (SOTA) (Herrero et al., 2001),
and nonlinear Autoencoders (AEs) with more than 3 (e.g., 5) layers (Kramer, 1991; Oja, 1991; DeMers
and Cottrell, 1993). Such AE NNs (Rumelhart et al., 1986) can be trained to map input patterns
to themselves, for example, by compactly encoding them through activations of units of a narrow
bottleneck hidden layer. Certain nonlinear AEs suffer from certain limitations (Baldi, 2012).
L OCOCODE (Hochreiter and Schmidhuber, 1999) uses FMS (Sec. 5.6.3) to find low-complexity
AEs with low-precision weights describable by few bits of information, often producing sparse or
factorial codes. Predictability Minimization (PM) (Schmidhuber, 1992c) searches for factorial codes
through nonlinear feature detectors that fight nonlinear predictors, trying to become both as infor-
mative and as unpredictable as possible. PM-based UL was applied not only to FNNs but also to
RNNs (e.g., Schmidhuber, 1993b; Lindstädt, 1993). Compare Sec. 5.10 on UL-based RNN stacks
(1991), as well as later UL RNNs (e.g., Klapper-Rybicka et al., 2001; Steil, 2007).
5.7 1987: UL Through Autoencoder (AE) Hierarchies (Compare Sec. 5.15)

Perhaps the first work to study potential benefits of UL-based pre-training was published in 1987. It
proposed unsupervised AE hierarchies (Ballard, 1987), closely related to certain post-2000 feedfor-
ward Deep Learners based on UL (Sec. 5.15). The lowest-level AE NN with a single hidden layer is
trained to map input patterns to themselves. Its hidden layer codes are then fed into a higher-level AE
of the same type, and so on. The hope is that the codes in the hidden AE layers have properties that
15
facilitate subsequent learning. In one experiment, a particular AE-specific learning algorithm (dif-
ferent from traditional BP of Sec. 5.5.1) was used to learn a mapping in an AE stack pre-trained by
this type of UL (Ballard, 1987). This was faster than learning an equivalent mapping by BP through
a single deeper AE without pre-training. On the other hand, the task did not really require a deep
AE, that is, the benefits of UL were not that obvious from this experiment. Compare an early sur-
vey (Hinton, 1989) and the somewhat related Recursive Auto-Associative Memory (RAAM) (Pollack,
1988, 1990; Melnik et al., 2000), originally used to encode sequential linguistic structures of arbitrary
size through a fixed number of hidden units. More recently, RAAMs were also used as unsupervised
pre-processors to facilitate deep credit assignment for RL (Gisslen et al., 2011) (Sec. 6.4).
In principle, many UL methods (Sec. 5.6.4) could be stacked like the AEs above, the history-
compressing RNNs of Sec. 5.10, the Restricted Boltzmann Machines (RBMs) of Sec. 5.15, or hi-
erarchical Kohonen nets (Sec. 5.6.4), to facilitate subsequent SL. Compare Stacked Generaliza-
tion (Wolpert, 1992; Ting and Witten, 1997), and FNNs that profit from pre-training by competitive
UL (e.g., Rumelhart and Zipser, 1986) prior to BP-based fine-tuning (Maclin and Shavlik, 1995). See
also more recent methods using UL to improve subsequent SL (e.g., Behnke, 1999, 2003a; Escalante-
B. and Wiskott, 2013).
5.8 1989: BP for Convolutional NNs (CNNs, Sec. 5.4)

In 1989, backpropagation (Sec. 5.5) was applied (LeCun et al., 1989, 1990a, 1998) to Neocognitron-
like, weight-sharing, convolutional neural layers (Sec. 5.4) with adaptive connections. This combi-
nation, augmented by Max-Pooling (MP, Sec. 5.11, 5.16), and sped up on graphics cards (Sec. 5.19),
has become an essential ingredient of many modern, competition-winning, feedforward, visual Deep
Learners (Sec. 5.19–5.23). This work also introduced the MNIST data set of handwritten digits (Le-
Cun et al., 1989), which over time has become perhaps the most famous benchmark of Machine
Learning. CNNs helped to achieve good performance on MNIST (LeCun et al., 1990a) (CAP depth
5) and on fingerprint recognition (Baldi and Chauvin, 1993); similar CNNs were used commercially
in the 1990s.
5.9 1991: Fundamental Deep Learning Problem of Gradient Descent

A diploma thesis (Hochreiter, 1991) represented a milestone of explicit DL research. As mentioned
in Sec. 5.6, by the late 1980s, experiments had indicated that traditional deep feedforward or re-
current networks are hard to train by backpropagation (BP) (Sec. 5.5). Hochreiter’s work formally
identified a major reason: Typical deep NNs suffer from the now famous problem of vanishing or
exploding gradients. With standard activation functions (Sec. 1), cumulative backpropagated error
signals (Sec. 5.5.1) either shrink rapidly, or grow out of bounds. In fact, they decay exponentially in
the number of layers or CAP depth (Sec. 3), or they explode. This is also known as the long time
lag problem. Much subsequent DL research of the 1990s and 2000s was motivated by this insight.
Later work (Bengio et al., 1994) also studied basins of attraction and their stability under noise from a
dynamical systems point of view: either the dynamics are not robust to noise, or the gradients vanish.
See also (Hochreiter et al., 2001a; Tiňo and Hammer, 2004). Over the years, several ways of partially
overcoming the Fundamental Deep Learning Problem were explored:
I A Very Deep Learner of 1991 (the History Compressor, Sec. 5.10) alleviates the problem
through unsupervised pre-training for a hierarchy of RNNs. This greatly facilitates subsequent
supervised credit assignment through BP (Sec. 5.5). In the FNN case, similar effects can be
achieved through conceptually related AE stacks (Sec. 5.7, 5.15) and Deep Belief Networks
(DBNs, Sec. 5.15).
16
II LSTM-like networks (Sec. 5.13, 5.16, 5.17, 5.21–5.23) alleviate the problem through a special
architecture unaffected by it.
III Today’s GPU-based computers have a million times the computational power of desktop ma-
chines of the early 1990s. This allows for propagating errors a few layers further down within
reasonable time, even in traditional NNs (Sec. 5.18). That is basically what is winning many of
the image recognition competitions now (Sec. 5.19, 5.21, 5.22). (Although this does not really
overcome the problem in a fundamental way.)
IV Hessian-free optimization (Sec. 5.6.2) can alleviate the problem for FNNs (Møller, 1993;
Pearlmutter, 1994; Schraudolph, 2002; Martens, 2010) (Sec. 5.6.2) and RNNs (Martens and
Sutskever, 2011) (Sec. 5.20).
V The space of NN weight matrices can also be searched without relying on error gradients,
thus avoiding the Fundamental Deep Learning Problem altogether. Random weight guessing
sometimes works better than more sophisticated methods (Hochreiter and Schmidhuber, 1996).
Certain more complex problems are better solved by using Universal Search (Levin, 1973b)
for weight matrix-computing programs written in a universal programming language (Schmid-
huber, 1997). Some are better solved by using linear methods to obtain optimal weights for
connections to output events (Sec. 2), and evolving weights of connections to other events—
this is called Evolino (Schmidhuber et al., 2007). Compare also related RNNs pre-trained by
certain UL rules (Steil, 2007), also in the case of spiking neurons (Yin et al., 2012; Klampfl and
Maass, 2013) (Sec. 5.26). Direct search methods are relevant not only for SL but also for more
general RL, and are discussed in more detail in Sec. 6.6.
5.10 1991: UL-Based History Compression Through a Deep Stack of RNNs

A working Very Deep Learner (Sec. 3) of 1991 (Schmidhuber, 1992b, 2013a) could perform credit as-
signment across hundreds of nonlinear operators or neural layers, by using unsupervised pre-training
for a hierarchy of RNNs.
The basic idea is still relevant today. Each RNN is trained for a while in unsupervised fashion to
predict its next input (e.g., Connor et al., 1994; Dorffner, 1996). From then on, only unexpected inputs
(errors) convey new information and get fed to the next higher RNN which thus ticks on a slower, self-
organising time scale. It can easily be shown that no information gets lost. It just gets compressed
(much of machine learning is essentially about compression, e.g., Sec. 4.4, 5.6.3, 6.7). For each
individual input sequence, we get a series of less and less redundant encodings in deeper and deeper
levels of this History Compressor or Neural Sequence Chunker, which can compress data in both
space (like feedforward NNs) and time. This is another good example of hierarchical representation
learning (Sec. 4.3). There also is a continuous variant of the history compressor (Schmidhuber et al.,
1993).
The RNN stack is essentially a deep generative model of the data, which can be reconstructed from
its compressed form. Adding another RNN to the stack improves a bound on the data’s description
length—equivalent to the negative logarithm of its probability (Huffman, 1952; Shannon, 1948)—as
long as there is remaining local learnable predictability in the data representation on the corresponding
level of the hierarchy. Compare a similar observation for feedforward Deep Belief Networks (DBNs,
2006, Sec. 5.15).
The system was able to learn many previously unlearnable DL tasks. One ancient illustrative
DL experiment (Schmidhuber, 1993b) required CAPs (Sec. 3) of depth 1200. The top level code of
the initially unsupervised RNN stack, however, got so compact that (previously infeasible) sequence
classification through additional BP-based SL became possible. Essentially the system used UL to
17
greatly reduce problem depth. Compare earlier BP-based fine-tuning of NNs initialized by rules of
propositional logic (Shavlik and Towell, 1989) (Sec. 5.6.1).
There is a way of compressing higher levels down into lower levels, thus fully or partially col-
lapsing the RNN stack. The trick is to retrain a lower-level RNN to continually imitate (predict) the
hidden units of an already trained, slower, higher-level RNN (the “conscious” chunker), through ad-
ditional predictive output neurons (Schmidhuber, 1992b). This helps the lower RNN (the automatizer)
to develop appropriate, rarely changing memories that may bridge very long time lags. Again, this
procedure can greatly reduce the required depth of the BP process.
The 1991 system was a working Deep Learner in the modern post-2000 sense, and also a first
Neural Hierarchical Temporal Memory (HTM). It is conceptually similar to earlier AE hierarchies
(1987, Sec. 5.7) and later Deep Belief Networks (2006, Sec. 5.15), but more general in the sense
that it uses sequence-processing RNNs instead of FNNs with unchanging inputs. More recently,
well-known entrepreneurs (Hawkins and George, 2006; Kurzweil, 2012) also got interested in HTMs;
compare also hierarchical HMMs (e.g., Fine et al., 1998), as well as later UL-based recurrent sys-
tems (Klapper-Rybicka et al., 2001; Steil, 2007; Klampfl and Maass, 2013; Young et al., 2014).
Clockwork RNNs (Koutnı́k et al., 2014) also consist of interacting RNN modules with different clock
rates, but do not use UL to set those rates. Stacks of RNNs were used in later work on SL with great
success, e.g., Sec. 5.13, 5.16, 5.17, 5.22.
5.11 1992: Max-Pooling (MP): Towards MPCNNs (Compare Sec. 5.16, 5.19)
The Neocognitron (Sec. 5.4) inspired the Cresceptron (Weng et al., 1992), which adapts its topol-
ogy during training (Sec. 5.6.3); compare the incrementally growing and shrinking GMDH networks
(1965, Sec. 5.3).
Instead of using alternative local subsampling or WTA methods (e.g., Fukushima, 1980; Schmid-
huber, 1989b; Maass, 2000; Fukushima, 2013a), the Cresceptron uses Max-Pooling (MP) layers. Here
a 2-dimensional layer or array of unit activations is partitioned into smaller rectangular arrays. Each
is replaced in a downsampling layer by the activation of its maximally active unit. A later, more com-
plex version of the Cresceptron (Weng et al., 1997) also included “blurring” layers to improve object
location tolerance.
The neurophysiologically plausible topology of the feedforward HMAX model (Riesenhuber and
Poggio, 1999) is very similar to the one of the 1992 Cresceptron (and thus to the 1979 Neocognitron).
HMAX does not learn though. Its units have hand-crafted weights; biologically plausible learning
rules were later proposed for similar models (e.g., Serre et al., 2002; Teichmann et al., 2012).
When CNNs or convnets (Sec. 5.4, 5.8) are combined with MP, they become Cresceptron-like
or HMAX-like MPCNNs with alternating convolutional and max-pooling layers. Unlike Cresceptron
and HMAX, however, MPCNNs are trained by BP (Sec. 5.5, 5.16) (Ranzato et al., 2007). Advantages
of doing this were pointed out subsequently (Scherer et al., 2010). BP-trained MPCNNs have become
central to many modern, competition-winning, feedforward, visual Deep Learners (Sec. 5.17, 5.19–
5.23).
5.12 1994: Early Contest-Winning NNs

Back in the 1990s, certain NNs already won certain controlled pattern recognition contests with secret
test sets. Notably, an NN with internal delay lines won the Santa Fe time-series competition on chaotic
intensity pulsations of an NH3 laser (Wan, 1994; Weigend and Gershenfeld, 1993). No very deep
CAPs (Sec. 3) were needed though.
18
5.13 1995: Supervised Recurrent Very Deep Learner (LSTM RNN)
Supervised Long Short-Term Memory (LSTM) RNN (Hochreiter and Schmidhuber, 1997b; Gers et al.,
2000; Pérez-Ortiz et al., 2003) could eventually perform similar feats as the deep RNN hierarchy
of 1991 (Sec. 5.10), overcoming the Fundamental Deep Learning Problem (Sec. 5.9) without any
unsupervised pre-training. LSTM could also learn DL tasks without local sequence predictability
(and thus unlearnable by the partially unsupervised 1991 History Compressor, Sec. 5.10), dealing
with very deep problems (Sec. 3) (e.g., Gers et al., 2002).
The basic LSTM idea is very simple. Some of the units are called Constant Error Carousels
(CECs). Each CEC uses as an activation function f , the identity function, and has a connection to itself
with fixed weight of 1.0. Due to f ’s constant derivative of 1.0, errors backpropagated through a CEC
cannot vanish or explode (Sec. 5.9) but stay as they are (unless they “flow out” of the CEC to other,
typically adaptive parts of the NN). CECs are connected to several nonlinear adaptive units (some
with multiplicative activation functions) needed for learning nonlinear behavior. Weight changes of
these units often profit from error signals propagated far back in time through CECs. CECs are
the main reason why LSTM nets can learn to discover the importance of (and memorize) events that
happened thousands of discrete time steps ago, while previous RNNs already failed in case of minimal
time lags of 10 steps.
Many different LSTM variants and topologies are allowed. It is possible to evolve good problem-
specific topologies (Bayer et al., 2009). Some LSTM variants also use modifiable self-connections of
CECs (Gers and Schmidhuber, 2001).
To a certain extent, LSTM is biologically plausible (O’Reilly, 2003). LSTM learned to solve
many previously unlearnable DL tasks involving: Recognition of the temporal order of widely sep-
arated events in noisy input streams; Robust storage of high-precision real numbers across extended
time intervals; Arithmetic operations on continuous input streams; Extraction of information con-
veyed by the temporal distance between events; Recognition of temporally extended patterns in noisy
input sequences (Hochreiter and Schmidhuber, 1997b; Gers et al., 2000); Stable generation of pre-
cisely timed rhythms, as well as smooth and non-smooth periodic trajectories (Gers and Schmidhuber,
2000). LSTM clearly outperformed previous RNNs on tasks that require learning the rules of regu-
lar languages describable by deterministic Finite State Automata (FSAs) (Watrous and Kuhn, 1992;
Casey, 1996; Siegelmann, 1992; Blair and Pollack, 1997; Kalinke and Lehmann, 1998; Zeng et al.,
1994; Manolios and Fanelli, 1994; Omlin and Giles, 1996; Vahed and Omlin, 2004), both in terms of
reliability and speed.
LSTM also worked on tasks involving context free languages (CFLs) that cannot be represented
by HMMs or similar FSAs discussed in the RNN literature (Sun et al., 1993b; Wiles and Elman, 1995;
Andrews et al., 1995; Steijvers and Grunwald, 1996; Tonkes and Wiles, 1997; Rodriguez et al., 1999;
Rodriguez and Wiles, 1998). CFL recognition (Lee, 1996) requires the functional equivalent of a run-
time stack. Some previous RNNs failed to learn small CFL training sets (Rodriguez and Wiles, 1998).
Those that did not (Rodriguez et al., 1999; Bodén and Wiles, 2000) failed to extract the general rules,
and did not generalize well on substantially larger test sets. Similar for context-sensitive languages
(CSLs) (e.g., Chalup and Blair, 2003). LSTM generalized well though, requiring only the 30 shortest
exemplars (n ≤ 10) of the CSL an bn cn to correctly predict the possible continuations of sequence
prefixes for n up to 1000 and more. A combination of a decoupled extended Kalman filter (Kalman,
1960; Williams, 1992b; Puskorius and Feldkamp, 1994; Feldkamp et al., 1998; Haykin, 2001; Feld-
kamp et al., 2003) and an LSTM RNN (Pérez-Ortiz et al., 2003) learned to deal correctly with values
of n up to 10 million and more. That is, after training the network was able to read sequences of
30,000,000 symbols and more, one symbol at a time, and finally detect the subtle differences be-
tween legal strings such as a10,000,000 b10,000,000 c10,000,000 and very similar but illegal strings such
as a10,000,000 b9,999,999 c10,000,000 . Compare also more recent RNN algorithms able to deal with long
19
time lags (Schäfer et al., 2006; Martens and Sutskever, 2011; Zimmermann et al., 2012; Koutnı́k et al.,
2014).
Bi-directional RNNs (BRNNs) (Schuster and Paliwal, 1997; Schuster, 1999) are designed for in-
put sequences whose starts and ends are known in advance, such as spoken sentences to be labeled by
their phonemes; compare (Fukada et al., 1999). To take both past and future context of each sequence
element into account, one RNN processes the sequence from start to end, the other backwards from
end to start. At each time step their combined outputs predict the corresponding label (if there is
any). BRNNs were successfully applied to secondary protein structure prediction (Baldi et al., 1999).
DAG-RNNs (Baldi and Pollastri, 2003; Wu and Baldi, 2008) generalize BRNNs to multiple dimen-
sions. They learned to predict properties of small organic molecules (Lusci et al., 2013) as well as
protein contact maps (Tegge et al., 2009), also in conjunction with a growing deep FNN (Di Lena
et al., 2012) (Sec. 5.21). BRNNs and DAG-RNNs unfold their full potential when combined with the
LSTM concept (Graves and Schmidhuber, 2005, 2009; Graves et al., 2009).
Particularly successful in recent competitions are stacks (Sec. 5.10) of LSTM RNNs (Fernan-
dez et al., 2007; Graves and Schmidhuber, 2009) trained by Connectionist Temporal Classifica-
tion (CTC) (Graves et al., 2006), a gradient-based method for finding RNN weights that maxi-
mize the probability of teacher-given label sequences, given (typically much longer and more high-
dimensional) streams of real-valued input vectors. CTC-LSTM performs simultaneous segmentation
(alignment) and recognition (Sec. 5.22).
In the early 2000s, speech recognition was dominated by HMMs combined with FNNs (e.g.,
Bourlard and Morgan, 1994). Nevertheless, when trained from scratch on utterances from the TIDIG-
ITS speech database, in 2003 LSTM already obtained results comparable to those of HMM-based
systems (Graves et al., 2003; Beringer et al., 2005; Graves et al., 2006). In 2007, LSTM outperformed
HMMs in keyword spotting tasks (Fernández et al., 2007); compare recent improvements (Indermuhle
et al., 2011; Wöllmer et al., 2013). By 2013, LSTM also achieved best known results on the famous
TIMIT phoneme recognition benchmark (Graves et al., 2013) (Sec. 5.22). Recently, LSTM RNN /
HMM hybrids obtained best known performance on medium-vocabulary (Geiger et al., 2014) and
large-vocabulary speech recognition (Sak et al., 2014a).
LSTM is also applicable to robot localization (Förster et al., 2007), robot control (Mayer et al.,
2008), online driver distraction detection (Wöllmer et al., 2011), and many other tasks. For example,
it helped to improve the state of the art in diverse applications such as protein analysis (Hochreiter
and Obermayer, 2005), handwriting recognition (Graves et al., 2008, 2009; Graves and Schmidhuber,
2009; Bluche et al., 2014), voice activity detection (Eyben et al., 2013), optical character recogni-
tion (Breuel et al., 2013), language identification (Gonzalez-Dominguez et al., 2014), prosody contour
prediction (Fernandez et al., 2014), audio onset detection (Marchi et al., 2014), text-to-speech syn-
thesis (Fan et al., 2014), social signal classification (Brueckner and Schulter, 2014), machine transla-
tion (Sutskever et al., 2014), and others.
RNNs can also be used for metalearning (Schmidhuber, 1987; Schaul and Schmidhuber, 2010;
Prokhorov et al., 2002), because they can in principle learn to run their own weight change algo-
rithm (Schmidhuber, 1993a). A successful metalearner (Hochreiter et al., 2001b) used an LSTM
RNN to quickly learn a learning algorithm for quadratic functions (compare Sec. 6.8).
Recently, LSTM RNNs won several international pattern recognition competitions and set nu-
merous benchmark records on large and complex data sets, e.g., Sec. 5.17, 5.21, 5.22. Gradient-
based LSTM is no panacea though—other methods sometimes outperformed it at least on certain
tasks (Jaeger, 2004; Schmidhuber et al., 2007; Martens and Sutskever, 2011; Pascanu et al., 2013b;
Koutnı́k et al., 2014); compare Sec. 5.20.
20
5.14 2003: More Contest-Winning/Record-Setting NNs; Successful Deep NNs
In the decade around 2000, many practical and commercial pattern recognition applications were
dominated by non-neural machine learning methods such as Support Vector Machines (SVMs) (Vap-
nik, 1995; Schölkopf et al., 1998). Nevertheless, at least in certain domains, NNs outperformed other
techniques.
A Bayes NN (Neal, 2006) based on an ensemble (Breiman, 1996; Schapire, 1990; Wolpert, 1992;
Hashem and Schmeiser, 1992; Ueda, 2000; Dietterich, 2000a) of NNs won the NIPS 2003 Feature
Selection Challenge with secret test set (Neal and Zhang, 2006). The NN was not very deep though—
it had two hidden layers and thus rather shallow CAPs (Sec. 3) of depth 3.
Important for many present competition-winning pattern recognisers (Sec. 5.19, 5.21, 5.22) were
developments in the CNN department. A BP-trained (LeCun et al., 1989) CNN (Sec. 5.4, Sec. 5.8) set
a new MNIST record of 0.4% (Simard et al., 2003), using training pattern deformations (Baird, 1990)
but no unsupervised pre-training (Sec. 5.7, 5.10, 5.15). A standard BP net achieved 0.7% (Simard
et al., 2003). Again, the corresponding CAP depth was low. Compare further improvements in
Sec. 5.16, 5.18, 5.19.
Good image interpretation results (Behnke, 2003b) were achieved with rather deep NNs trained
by the BP variant R-prop (Riedmiller and Braun, 1993) (Sec. 5.6.2); here feedback through recurrent
connections helped to improve image interpretation. FNNs with CAP depth up to 6 were used to
successfully classify high-dimensional data (Vieira and Barradas, 2003).
Deep LSTM RNNs started to obtain certain first speech recognition results comparable to those
of HMM-based systems (Graves et al., 2003); compare Sec. 5.13, 5.16, 5.21, 5.22.
5.15 2006/7: UL For Deep Belief Networks / AE Stacks Fine-Tuned by BP

While learning networks with numerous non-linear layers date back at least to 1965 (Sec. 5.3), and ex-
plicit DL research results have been published at least since 1991 (Sec. 5.9, 5.10), the expression Deep
Learning was actually coined around 2006, when unsupervised pre-training of deep FNNs helped to
accelerate subsequent SL through BP (Hinton and Salakhutdinov, 2006; Hinton et al., 2006). Compare
earlier terminology on loading deep networks (Sı́ma, 1994; Windisch, 2005) and learning deep mem-
ories (Gomez and Schmidhuber, 2005). Compare also BP-based (Sec. 5.5) fine-tuning (Sec. 5.6.1) of
(not so deep) FNNs pre-trained by competitive UL (Maclin and Shavlik, 1995).
The Deep Belief Network (DBN) is a stack of Restricted Boltzmann Machines (RBMs) (Smolen-
sky, 1986), which in turn are Boltzmann Machines (BMs) (Hinton and Sejnowski, 1986) with a single
layer of feature-detecting units; compare also Higher-Order BMs (Memisevic and Hinton, 2010).
Each RBM perceives pattern representations from the level below and learns to encode them in un-
supervised fashion. At least in theory under certain assumptions, adding more layers improves a
bound on the data’s negative log probability (Hinton et al., 2006) (equivalent to the data’s description
length—compare the corresponding observation for RNN stacks, Sec. 5.10). There are extensions for
Temporal RBMs (Sutskever et al., 2008).
Without any training pattern deformations (Sec. 5.14), a DBN fine-tuned by BP achieved 1.2%
error rate (Hinton and Salakhutdinov, 2006) on the MNIST handwritten digits (Sec. 5.8, 5.14). This
result helped to arouse interest in DBNs. DBNs also achieved good results on phoneme recognition,
with an error rate of 26.7% on the TIMIT core test set (Mohamed and Hinton, 2010); compare further
improvements through FNNs (Hinton et al., 2012a; Deng and Yu, 2014) and LSTM RNNs (Sec. 5.22).
A DBN-based technique called Semantic Hashing (Salakhutdinov and Hinton, 2009) maps se-
mantically similar documents (of variable size) to nearby addresses in a space of document rep-
resentations. It outperformed previous searchers for similar documents, such as Locality Sensitive
Hashing (Buhler, 2001; Datar et al., 2004). See the RBM/DBN tutorial (Fischer and Igel, 2014).
21
Autoencoder (AE) stacks (Ballard, 1987) (Sec. 5.7) became a popular alternative way of pre-
training deep FNNs in unsupervised fashion, before fine-tuning (Sec. 5.6.1) them through BP
(Sec. 5.5) (Bengio et al., 2007; Vincent et al., 2008; Erhan et al., 2010). Sparse coding (Sec. 5.6.4)
was formulated as a combination of convex optimization problems (Lee et al., 2007a). Recent surveys
of stacked RBM and AE methods focus on post-2006 developments (Bengio, 2009; Arel et al., 2010).
Unsupervised DBNs and AE stacks are conceptually similar to, but in a certain sense less general
than, the unsupervised RNN stack-based History Compressor of 1991 (Sec. 5.10), which can process
and re-encode not only stationary input patterns, but entire pattern sequences.
5.16 2006/7: Improved CNNs / GPU-CNNs / BP for MPCNNs / LSTM Stacks

Also in 2006, a BP-trained (LeCun et al., 1989) CNN (Sec. 5.4, Sec. 5.8) set a new MNIST record
of 0.39% (Ranzato et al., 2006), using training pattern deformations (Sec. 5.14) but no unsupervised
pre-training. Compare further improvements in Sec. 5.18, 5.19. Similar CNNs were used for off-
road obstacle avoidance (LeCun et al., 2006). A combination of CNNs and TDNNs later learned to
map fixed-size representations of variable-size sentences to features relevant for language processing,
using a combination of SL and UL (Collobert and Weston, 2008).
2006 also saw an early GPU-based CNN implementation (Chellapilla et al., 2006) up to 4 times
faster than CPU-CNNs; compare also earlier GPU implementations of standard FNNs with a reported
speed-up factor of 20 (Oh and Jung, 2004). GPUs or graphics cards have become more and more
important for DL in subsequent years (Sec. 5.18–5.22).
In 2007, BP (Sec. 5.5) was applied for the first time (Ranzato et al., 2007) to Neocognitron-
inspired (Sec. 5.4), Cresceptron-like (or HMAX-like) MPCNNs (Sec. 5.11) with alternating convo-
lutional and max-pooling layers. BP-trained MPCNNs have become an essential ingredient of many
modern, competition-winning, feedforward, visual Deep Learners (Sec. 5.17, 5.19–5.23).
Also in 2007, hierarchical stacks of LSTM RNNs were introduced (Fernandez et al., 2007). They
can be trained by hierarchical Connectionist Temporal Classification (CTC) (Graves et al., 2006). For
tasks of sequence labelling, every LSTM RNN level (Sec. 5.13) predicts a sequence of labels fed to
the next level. Error signals at every level are back-propagated through all the lower levels. On spoken
digit recognition, LSTM stacks outperformed HMMs, despite making fewer assumptions about the
domain. LSTM stacks do not necessarily require unsupervised pre-training like the earlier UL-based
RNN stacks (Schmidhuber, 1992b) of Sec. 5.10.
5.17 2009: First Official Competitions Won by RNNs, and with MPCNNs
Stacks of LSTM RNNs trained by CTC (Sec. 5.13, 5.16) became the first RNNs to win official interna-
tional pattern recognition contests (with secret test sets known only to the organisers). More precisely,
three connected handwriting competitions at ICDAR 2009 in three different languages (French, Arab,
Farsi) were won by deep LSTM RNNs without any a priori linguistic knowledge, performing simul-
taneous segmentation and recognition. Compare (Graves and Schmidhuber, 2005; Graves et al., 2009;
Schmidhuber et al., 2011; Graves et al., 2013; Graves and Jaitly, 2014) (Sec. 5.22).
To detect human actions in surveillance videos, a 3-dimensional CNN (e.g., Jain and Seung, 2009;
Prokhorov, 2010), combined with SVMs, was part of a larger system (Yang et al., 2009) using a bag
of features approach (Nowak et al., 2006) to extract regions of interest. The system won three 2009
TRECVID competitions. These were possibly the first official international contests won with the
help of (MP)CNNs (Sec. 5.16). An improved version of the method was published later (Ji et al.,
2013).
2009 also saw a GPU-DBN implementation (Raina et al., 2009) orders of magnitudes faster than
previous CPU-DBNs (see Sec. 5.15); see also (Coates et al., 2013). The Convolutional DBN (Lee
22
et al., 2009a) (with a probabilistic variant of MP, Sec. 5.11) combines ideas from CNNs and DBNs,
and was successfully applied to audio classification (Lee et al., 2009b).
5.18 2010: Plain Backprop (+ Distortions) on GPU Breaks MNIST Record

In 2010, a new MNIST (Sec. 5.8) record of 0.35% error rate was set by good old BP (Sec. 5.5)
in deep but otherwise standard NNs (Ciresan et al., 2010), using neither unsupervised pre-training
(e.g., Sec. 5.7, 5.10, 5.15) nor convolution (e.g., Sec. 5.4, 5.8, 5.14, 5.16). However, training pattern
deformations (e.g., Sec. 5.14) were important to generate a big training set and avoid overfitting. This
success was made possible mainly through a GPU implementation of BP that was up to 50 times
faster than standard CPU versions. A good value of 0.95% was obtained without distortions except
for small saccadic eye movement-like translations—compare Sec. 5.15.
Since BP was 3-5 decades old by then (Sec. 5.5), and pattern deformations 2 decades (Baird, 1990)
(Sec. 5.14), these results seemed to suggest that advances in exploiting modern computing hardware
were more important than advances in algorithms.
5.19 2011: MPCNNs on GPU Achieve Superhuman Vision Performance

In 2011, a flexible GPU-implementation (Ciresan et al., 2011a) of Max-Pooling (MP) CNNs or Con-
vnets was described (a GPU-MPCNN), building on earlier MP work (Weng et al., 1992) (Sec. 5.11)
CNNs (Fukushima, 1979; LeCun et al., 1989) (Sec. 5.4, 5.8, 5.16), and on early GPU-based CNNs
without MP (Chellapilla et al., 2006) (Sec. 5.16); compare early GPU-NNs (Oh and Jung, 2004) and
GPU-DBNs (Raina et al., 2009) (Sec. 5.17). MPCNNs have alternating convolutional layers (Sec. 5.4)
and max-pooling layers (MP, Sec. 5.11) topped by standard fully connected layers. All weights are
trained by BP (Sec. 5.5, 5.8, 5.16) (Ranzato et al., 2007; Scherer et al., 2010). GPU-MPCNNs have
become essential for many contest-winning FNNs (Sec. 5.21, Sec. 5.22).
Multi-Column GPU-MPCNNs (Ciresan et al., 2011b) are committees (Breiman, 1996; Schapire,
1990; Wolpert, 1992; Hashem and Schmeiser, 1992; Ueda, 2000; Dietterich, 2000a) of GPU-
MPCNNs with simple democratic output averaging. Several MPCNNs see the same input; their output
vectors are used to assign probabilities to the various possible classes. The class with the on average
highest probability is chosen as the system’s classification of the present input. Compare earlier, more
sophisticated ensemble methods (Schapire, 1990), the contest-winning ensemble Bayes-NN (Neal,
2006) of Sec. 5.14, and recent related work (Shao et al., 2014).
An ensemble of GPU-MPCNNs was the first system to achieve superhuman visual pattern recog-
nition (Ciresan et al., 2011b, 2012b) in a controlled competition, namely, the IJCNN 2011 traffic
sign recognition contest in San Jose (CA) (Stallkamp et al., 2011, 2012). This is of interest for fully
autonomous, self-driving cars in traffic (e.g., Dickmanns et al., 1994). The GPU-MPCNN ensem-
ble obtained 0.56% error rate and was twice better than human test subjects, three times better than
the closest artificial NN competitor (Sermanet and LeCun, 2011), and six times better than the best
non-neural method.
A few months earlier, the qualifying round was won in a 1st stage online competition, albeit by
a much smaller margin: 1.02% (Ciresan et al., 2011b) vs 1.03% for second place (Sermanet and
LeCun, 2011). After the deadline, the organisers revealed that human performance on the test set
was 1.19%. That is, the best methods already seemed human-competitive. However, during the
qualifying it was possible to incrementally gain information about the test set by probing it through
repeated submissions. This is illustrated by better and better results obtained by various teams over
time (Stallkamp et al., 2012) (the organisers eventually imposed a limit of 10 resubmissions). In the
final competition this was not possible.
23
This illustrates a general problem with benchmarks whose test sets are public, or at least can be
probed to some extent: competing teams tend to overfit on the test set even when it cannot be directly
used for training, only for evaluation.
In 1997 many thought it a big deal that human chess world champion Kasparov was beaten by
an IBM computer. But back then computers could not at all compete with little kids in visual pat-
tern recognition, which seems much harder than chess from a computational perspective. Of course,
the traffic sign domain is highly restricted, and kids are still much better general pattern recognis-
ers. Nevertheless, by 2011, deep NNs could already learn to rival them in important limited visual
domains.
An ensemble of GPU-MPCNNs was also the first method to achieve human-competitive perfor-
mance (around 0.2%) on MNIST (Ciresan et al., 2012c). This represented a dramatic improvement,
since by then the MNIST record had hovered around 0.4% for almost a decade (Sec. 5.14, 5.16, 5.18).
Given all the prior work on (MP)CNNs (Sec. 5.4, 5.8, 5.11, 5.16) and GPU-CNNs (Sec. 5.16),
GPU-MPCNNs are not a breakthrough in the scientific sense. But they are a commercially relevant
breakthrough in efficient coding that has made a difference in several contests since 2011. Today, most
feedforward competition-winning deep NNs are (ensembles of) GPU-MPCNNs (Sec. 5.21–5.23).
5.20 2011: Hessian-Free Optimization for RNNs

Also in 2011 it was shown (Martens and Sutskever, 2011) that Hessian-free optimization (e.g., Møller,
1993; Pearlmutter, 1994; Schraudolph, 2002) (Sec. 5.6.2) can alleviate the Fundamental Deep Learn-
ing Problem (Sec. 5.9) in RNNs, outperforming standard gradient-based LSTM RNNs (Sec. 5.13) on
several tasks. Compare other RNN algorithms (Jaeger, 2004; Schmidhuber et al., 2007; Pascanu et al.,
2013b; Koutnı́k et al., 2014) that also at least sometimes yield better results than steepest descent for
LSTM RNNs.
5.21 2012: First Contests Won on ImageNet, Object Detection, Segmentation

In 2012, an ensemble of GPU-MPCNNs (Sec. 5.19) achieved best results on the ImageNet classifica-
tion benchmark (Krizhevsky et al., 2012), which is popular in the computer vision community. Here
relatively large image sizes of 256x256 pixels were necessary, as opposed to only 48x48 pixels for
the 2011 traffic sign competition (Sec. 5.19). See further improvements in Sec. 5.22.
Also in 2012, the biggest NN so far (109 free parameters) was trained in unsupervised mode
(Sec. 5.7, 5.15) on unlabeled data (Le et al., 2012), then applied to ImageNet. The codes across its top
layer were used to train a simple supervised classifier, which achieved best results so far on 20,000
classes. Instead of relying on efficient GPU programming, this was done by brute force on 1,000
standard machines with 16,000 cores.
So by 2011/2012, excellent results had been achieved by Deep Learners in image recognition and
classification (Sec. 5.19, 5.21). The computer vision community, however, is especially interested in
object detection in large images, for applications such as image-based search engines, or for biomed-
ical diagnosis where the goal may be to automatically detect tumors etc in images of human tissue.
Object detection presents additional challenges. One natural approach is to train a deep NN classifier
on patches of big images, then use it as a feature detector to be shifted across unknown visual scenes,
using various rotations and zoom factors. Image parts that yield highly active output units are likely
to contain objects similar to those the NN was trained on.
2012 finally saw the first DL system (an ensemble of GPU-MPCNNs, Sec. 5.19) to win a contest
on visual object detection (Ciresan et al., 2013) in large images of several million pixels (ICPR 2012
Contest on Mitosis Detection in Breast Cancer Histological Images, 2012; Roux et al., 2013). Such
biomedical applications may turn out to be among the most important applications of DL. The world
24
spends over 10% of GDP on healthcare (> 6 trillion USD per year), much of it on medical diagnosis
through expensive experts. Partial automation of this could not only save lots of money, but also make
expert diagnostics accessible to many who currently cannot afford it. It is gratifying to observe that
today deep NNs may actually help to improve healthcare and perhaps save human lives.
2012 also saw the first pure image segmentation contest won by DL (Ciresan et al., 2012a), again
through an GPU-MPCNN ensemble (Segmentation of Neuronal Structures in EM Stacks Challenge,
2012).2 EM stacks are relevant for the recently approved huge brain projects in Europe and the
US (e.g., Markram, 2012). Given electron microscopy images of stacks of thin slices of animal
brains, the goal is to build a detailed 3D model of the brain’s neurons and dendrites. But human
experts need many hours and days and weeks to annotate the images: Which parts depict neuronal
membranes? Which parts are irrelevant background? This needs to be automated (e.g., Turaga et al.,
2010). Deep Multi-Column GPU-MPCNNs learned to solve this task through experience with many
training images, and won the contest on all three evaluation metrics by a large margin, with superhu-
man performance in terms of pixel error.
Both object detection (Ciresan et al., 2013) and image segmentation (Ciresan et al., 2012a) profit
from fast MPCNN-based image scans that avoid redundant computations. Recent MPCNN scanners
speed up naive implementations by up to three orders of magnitude (Masci et al., 2013; Giusti et al.,
2013); compare earlier efficient methods for CNNs without MP (Vaillant et al., 1994).
Also in 2012, a system consisting of growing deep FNNs and 2D-BRNNs (Di Lena et al., 2012)
won the CASP 2012 contest on protein contact map prediction. On the IAM-OnDoDB benchmark,
LSTM RNNs (Sec. 5.13) outperformed all other methods (HMMs, SVMs) on online mode detec-
tion (Otte et al., 2012; Indermuhle et al., 2012) and keyword spotting (Indermuhle et al., 2011). On the
long time lag problem of language modelling, LSTM RNNs outperformed all statistical approaches
on the IAM-DB benchmark (Frinken et al., 2012); improved results were later obtained through a
combination of NNs and HMMs (Zamora-Martnez et al., 2014). Compare earlier RNNs for object
recognition through iterative image interpretation (Behnke and Rojas, 1998; Behnke, 2002, 2003b);
see also more recent publications (Wyatte et al., 2012; OReilly et al., 2013) extending work on bio-
logically plausible learning rules for RNNs (O’Reilly, 1996).
5.22 2013-: More Contests and Benchmark Records

A stack (Fernandez et al., 2007; Graves and Schmidhuber, 2009) (Sec. 5.10) of bi-directional LSTM
RNNs (Graves and Schmidhuber, 2005) trained by CTC (Sec. 5.13, 5.17) broke a famous TIMIT
speech (phoneme) recognition record, achieving 17.7% test set error rate (Graves et al., 2013), despite
thousands of man years previously spent on Hidden Markov Model (HMMs)-based speech recognition
research. Compare earlier DBN results (Sec. 5.15).
CTC-LSTM also helped to score first at NIST’s OpenHaRT2013 evaluation (Bluche et al., 2014).
For optical character recognition (OCR), LSTM RNNs outperformed commercial recognizers of his-
torical data (Breuel et al., 2013). LSTM-based systems also set benchmark records in language iden-
tification (Gonzalez-Dominguez et al., 2014), medium-vocabulary speech recognition (Geiger et al.,
2014), prosody contour prediction (Fernandez et al., 2014), audio onset detection (Marchi et al.,
2014), text-to-speech synthesis (Fan et al., 2014), and social signal classification (Brueckner and
Schulter, 2014).
An LSTM RNN was used to estimate the state posteriors of an HMM; this system beat the previous
state of the art in large vocabulary speech recognition (Sak et al., 2014b,a). Another LSTM RNN with
hundreds of millions of connections was used to rerank hypotheses of a statistical machine translation
2 It should be mentioned, however, that LSTM RNNs already performed simultaneous segmentation and recognition when
they became the first recurrent Deep Learners to win official international pattern recognition contests—see Sec. 5.17.
25
system; this system beat the previous state of the art in English to French translation (Sutskever et al.,
2014).
A new record on the ICDAR Chinese handwriting recognition benchmark (over 3700 classes)
was set on a desktop machine by an ensemble of GPU-MPCNNs (Sec. 5.19) with almost human
performance (Ciresan and Schmidhuber, 2013); compare (Yin et al., 2013).
The MICCAI 2013 Grand Challenge on Mitosis Detection (Veta et al., 2013) also was won by an
object-detecting GPU-MPCNN ensemble (Ciresan et al., 2013). Its data set was even larger and more
challenging than the one of ICPR 2012 (Sec. 5.21): a real-world dataset including many ambiguous
cases and frequently encountered problems such as imperfect slide staining.
Three 2D-CNNs (with mean-pooling instead of MP, Sec. 5.11) observing three orthogonal projec-
tions of 3D images outperformed traditional full 3D methods on the task of segmenting tibial cartilage
in low field knee MRI scans (Prasoon et al., 2013).
Deep GPU-MPCNNs (Sec. 5.19) also helped to achieve new best results on important bench-
marks of the computer vision community: ImageNet classification (Zeiler and Fergus, 2013; Szegedy
et al., 2014) and—in conjunction with traditional approaches—PASCAL object detection (Girshick
et al., 2013). They also learned to predict bounding box coordinates of objects in the Imagenet
2013 database, and obtained state-of-the-art results on tasks of localization and detection (Sermanet
et al., 2013). GPU-MPCNNs also helped to recognise multi-digit numbers in Google Street View
images (Goodfellow et al., 2014b), where part of the NN was trained to count visible digits; compare
earlier work on detecting “numerosity” through DBNs (Stoianov and Zorzi, 2012). This system also
excelled at recognising distorted synthetic text in reCAPTCHA puzzles. Other successful CNN appli-
cations include scene parsing (Farabet et al., 2013), object detection (Szegedy et al., 2013), shadow
detection (Khan et al., 2014), video classification (Karpathy et al., 2014), and Alzheimers disease
neuroimaging (Li et al., 2014).
Additional contests are mentioned in the web pages of the Swiss AI Lab IDSIA, the University of
Toronto, NY University, and the University of Montreal.
5.23 Currently Successful Techniques: LSTM RNNs and GPU-MPCNNs

Most competition-winning or benchmark record-setting Deep Learners actually use one of two super-
vised techniques: (a) recurrent LSTM (1997) trained by CTC (2006) (Sec. 5.13, 5.17, 5.21, 5.22), or
(b) feedforward GPU-MPCNNs (2011, Sec. 5.19, 5.21, 5.22) based on CNNs (1979, Sec. 5.4) with
MP (1992, Sec. 5.11) trained through BP (1989–2007, Sec. 5.8, 5.16).
Exceptions include two 2011 contests (Goodfellow et al., 2011; Mesnil et al., 2011; Goodfel-
low et al., 2012) specialised on Transfer Learning from one dataset to another (e.g., Caruana, 1997;
Schmidhuber, 2004; Pan and Yang, 2010). However, deep GPU-MPCNNs do allow for pure SL-based
transfer (Ciresan et al., 2012d), where pre-training on one training set greatly improves performance
on quite different sets, also in more recent studies (Oquab et al., 2013; Donahue et al., 2013). In
fact, deep MPCNNs pre-trained by SL can extract useful features from quite diverse off-training-set
images, yielding better results than traditional, widely used features such as SIFT (Lowe, 1999, 2004)
on many vision tasks (Razavian et al., 2014). To deal with changing datasets, slowly learning deep
NNs were also combined with rapidly adapting “surface” NNs (Kak et al., 2010).
Remarkably, in the 1990s a trend went from partially unsupervised RNN stacks (Sec. 5.10) to
purely supervised LSTM RNNs (Sec. 5.13), just like in the 2000s a trend went from partially unsuper-
vised FNN stacks (Sec. 5.15) to purely supervised MPCNNs (Sec. 5.16–5.22). Nevertheless, in many
applications it can still be advantageous to combine the best of both worlds—supervised learning and
unsupervised pre-training (Sec. 5.10, 5.15).
26
5.24 Recent Tricks for Improving SL Deep NNs (Compare Sec. 5.6.2, 5.6.3)
DBN training (Sec. 5.15) can be improved through gradient enhancements and automatic learning rate
adjustments during stochastic gradient descent (Cho et al., 2013; Cho, 2014), and through Tikhonov-
type (Tikhonov et al., 1977) regularization of RBMs (Cho et al., 2012). Contractive AEs (Rifai et al.,
2011) discourage hidden unit perturbations in response to input perturbations, similar to how FMS
(Sec. 5.6.3) for L OCOCODE AEs (Sec. 5.6.4) discourages output perturbations in response to weight
perturbations.
Hierarchical CNNs in a Neural Abstraction Pyramid (e.g., Behnke, 2003b, 2005) were trained to
reconstruct images corrupted by structured noise (Behnke, 2001), thus enforcing increasingly abstract
image representations in deeper and deeper layers. Denoising AEs later used a similar procedure (Vin-
cent et al., 2008).
Dropout (Hinton et al., 2012b; Ba and Frey, 2013) removes units from NNs during training to
improve generalisation. Some view it as an ensemble method that trains multiple data models simul-
taneously (Baldi and Sadowski, 2014). Under certain circumstances, it could also be viewed as a form
of training set augmentation: effectively, more and more informative complex features are removed
from the training data. Compare dropout for RNNs (Pham et al., 2013; Pachitariu and Sahani, 2013;
Pascanu et al., 2013a). A deterministic approximation coined fast dropout (Wang and Manning, 2013)
can lead to faster learning and evaluation and was adapted for RNNs (Bayer et al., 2013). Dropout is
closely related to older, biologically plausible techniques for adding noise to neurons or synapses dur-
ing training (e.g., Hanson, 1990; Murray and Edwards, 1993; Schuster, 1992; Nadal and Parga, 1994;
Jim et al., 1995; An, 1996), which in turn are closely related to finding perturbation-resistant low-
complexity NNs, e.g., through FMS (Sec. 5.6.3). MDL-based stochastic variational methods (Graves,
2011) are also related to FMS. They are useful for RNNs, where classic regularizers such as weight
decay (Sec. 5.6.3) represent a bias towards limited memory capacity (e.g., Pascanu et al., 2013b).
Compare recent work on variational recurrent AEs (Bayer and Osendorfer, 2014).
The activation function f of Rectified Linear Units (ReLUs) is f (x) = x for x > 0, f (x) = 0
otherwise—compare the old concept of half-wave rectified units (Malik and Perona, 1990). ReLU
NNs are useful for RBMs (Nair and Hinton, 2010; Maas et al., 2013), outperformed sigmoidal ac-
tivation functions in deep NNs (Glorot et al., 2011), and helped to obtain best results on several
benchmark problems across multiple domains (e.g., Krizhevsky et al., 2012; Dahl et al., 2013).
NNs with competing linear units tend to outperform those with non-competing nonlinear units,
and avoid catastrophic forgetting through BP when training sets change over time (Srivastava et al.,
2013). In this context, choosing a learning algorithm may be more important than choosing activation
functions (Goodfellow et al., 2014a). Maxout NNs (Goodfellow et al., 2013) combine competitive
interactions and dropout (see above) to achieve excellent results on certain benchmarks. Compare
early RNNs with competing units for SL and RL (Schmidhuber, 1989b). To address overfitting,
instead of depending on pre-wired regularizers and hyper-parameters (Hertz et al., 1991; Bishop,
2006), self-delimiting RNNs (SLIM NNs) with competing units (Schmidhuber, 2012) can in principle
learn to select their own runtime and their own numbers of effective free parameters, thus learning
their own computable regularisers (Sec. 4.4, 5.6.3), becoming fast and slim when necessary. One may
penalize the task-specific total length of connections (e.g., Legenstein and Maass, 2002; Schmidhuber,
2012, 2013b; Clune et al., 2013) and communication costs of SLIM NNs implemented on the 3-
dimensional brain-like multi-processor hardware to be expected in the future.
RmsProp (Tieleman and Hinton, 2012; Schaul et al., 2013) can speed up first order gradient de-
scent methods (Sec. 5.5, 5.6.2); compare vario-η (Neuneier and Zimmermann, 1996), Adagrad (Duchi
et al., 2011) and Adadelta (Zeiler, 2012). DL in NNs can also be improved by transforming hidden
unit activations such that they have zero output and slope on average (Raiko et al., 2012). Many ad-
ditional, older tricks (Sec. 5.6.2, 5.6.3) should also be applicable to today’s deep NNs; compare (Orr
27
and Müller, 1998; Montavon et al., 2012).
5.25 Consequences for Neuroscience

It is ironic that artificial NNs (ANNs) can help to better understand biological NNs (BNNs)—see
the ISBI 2012 results mentioned in Sec. 5.21 (Segmentation of Neuronal Structures in EM Stacks
Challenge, 2012; Ciresan et al., 2012a).
The feature detectors learned by single-layer visual ANNs are similar to those found in early
visual processing stages of BNNs (e.g., Sec. 5.6.4). Likewise, the feature detectors learned in deep
layers of visual ANNs should be highly predictive of what neuroscientists will find in deep layers
of BNNs. While the visual cortex of BNNs may use quite different learning algorithms, its objective
function to be minimised may be quite similar to the one of visual ANNs. In fact, results obtained with
relatively deep artificial DBNs (Lee et al., 2007b) and CNNs (Yamins et al., 2013) seem compatible
with insights about the visual pathway in the primate cerebral cortex, which has been studied for
many decades (e.g., Hubel and Wiesel, 1968; Perrett et al., 1982; Desimone et al., 1984; Felleman
and Van Essen, 1991; Perrett et al., 1992; Kobatake and Tanaka, 1994; Logothetis et al., 1995; Bichot
et al., 2005; Hung et al., 2005; Lennie and Movshon, 2005; Connor et al., 2007; Kriegeskorte et al.,
2008; DiCarlo et al., 2012); compare a computer vision-oriented survey (Kruger et al., 2013).
5.26 DL with Spiking Neurons?

Many recent DL results profit from GPU-based traditional deep NNs, e.g., Sec. 5.16–5.19. Current
GPUs, however, are little ovens, much hungrier for energy than biological brains, whose neurons ef-
ficiently communicate by brief spikes (Hodgkin and Huxley, 1952; FitzHugh, 1961; Nagumo et al.,
1962), and often remain quiet. Many computational models of such spiking neurons have been pro-
posed and analyzed (e.g., Gerstner and van Hemmen, 1992; Zipser et al., 1993; Stemmler, 1996;
Tsodyks et al., 1996; Maex and Orban, 1996; Maass, 1996, 1997; Kistler et al., 1997; Amit and
Brunel, 1997; Tsodyks et al., 1998; Kempter et al., 1999; Song et al., 2000; Stoop et al., 2000; Brunel,
2000; Bohte et al., 2002; Gerstner and Kistler, 2002; Izhikevich et al., 2003; Seung, 2003; Deco and
Rolls, 2005; Brette et al., 2007; Brea et al., 2013; Nessler et al., 2013; Kasabov, 2014; Hoerzer et al.,
2014; Rezende and Gerstner, 2014).
Future energy-efficient hardware for DL in NNs may implement aspects of such models (e.g.,
Liu et al., 2001; Roggen et al., 2003; Glackin et al., 2005; Schemmel et al., 2006; Fieres et al., 2008;
Khan et al., 2008; Serrano-Gotarredona et al., 2009; Jin et al., 2010; Indiveri et al., 2011; Neil and Liu,
2014; Merolla et al., 2014). A simulated, event-driven, spiking variant (Neftci et al., 2014) of an RBM
(Sec. 5.15) was trained by a variant of the Contrastive Divergence algorithm (Hinton, 2002). Spiking
nets were evolved to achieve reasonable performance on small face recognition data sets (Wysoski
et al., 2010) and to control simple robots (Floreano and Mattiussi, 2001; Hagras et al., 2004). A
spiking DBN with about 250,000 neurons (as part of a larger NN; Eliasmith et al., 2012; Eliasmith,
2013) achieved 6% error rate on MNIST; compare similar results with a spiking DBN variant of
depth 3 using a neuromorphic event-based sensor (O’Connor et al., 2013). In practical applications,
however, current artificial networks of spiking neurons cannot yet compete with the best traditional
deep NNs (e.g., compare MNIST results of Sec. 5.19).
28
6 DL in FNNs and RNNs for Reinforcement Learning (RL)
So far we have focused on Deep Learning (DL) in supervised or unsupervised NNs. Such NNs learn
to perceive / encode / predict / classify patterns or pattern sequences, but they do not learn to act
in the more general sense of Reinforcement Learning (RL) in unknown environments (see surveys,
e.g., Kaelbling et al., 1996; Sutton and Barto, 1998; Wiering and van Otterlo, 2012). Here we add a
discussion of DL FNNs and RNNs for RL. It will be shorter than the discussion of FNNs and RNNs
for SL and UL (Sec. 5), reflecting the current size of the various fields.
Without a teacher, solely from occasional real-valued pain and pleasure signals, RL agents must
discover how to interact with a dynamic, initially unknown environment to maximize their expected
cumulative reward signals (Sec. 2). There may be arbitrary, a priori unknown delays between actions
and perceivable consequences. The problem is as hard as any problem of computer science, since any
task with a computable description can be formulated in the RL framework (e.g., Hutter, 2005). For
example, an answer to the famous question of whether P = N P (Levin, 1973b; Cook, 1971) would
also set limits for what is achievable by general RL. Compare more specific limitations, e.g., (Blondel
and Tsitsiklis, 2000; Madani et al., 2003; Vlassis et al., 2012). The following subsections mostly focus
on certain obvious intersections between DL and RL—they cannot serve as a general RL survey.
6.1 RL Through NN World Models Yields RNNs With Deep CAPs

In the special case of an RL FNN controller C interacting with a deterministic, predictable environ-
ment, a separate FNN called M can learn to become C’s world model through system identification,
predicting C’s inputs from previous actions and inputs (e.g., Werbos, 1981, 1987; Munro, 1987; Jor-
dan, 1988; Werbos, 1989b,a; Robinson and Fallside, 1989; Jordan and Rumelhart, 1990; Schmidhu-
ber, 1990d; Narendra and Parthasarathy, 1990; Werbos, 1992; Gomi and Kawato, 1993; Cochocki and
Unbehauen, 1993; Levin and Narendra, 1995; Miller et al., 1995; Ljung, 1998; Prokhorov et al., 2001;
Ge et al., 2010). Assume M has learned to produce accurate predictions. We can use M to substi-
tute the environment. Then M and C form an RNN where M ’s outputs become inputs of C, whose
outputs (actions) in turn become inputs of M . Now BP for RNNs (Sec. 5.5.1) can be used to achieve
desired input events such as high real-valued reward signals: While M ’s weights remain fixed, gradi-
ent information for C’s weights is propagated back through M down into C and back through M etc.
To a certain extent, the approach is also applicable in probabilistic or uncertain environments, as long
as the inner products of M ’s C-based gradient estimates and M ’s “true” gradients tend to be positive.
In general, this approach implies deep CAPs for C, unlike in DP-based traditional RL (Sec. 6.2).
Decades ago, the method was used to learn to back up a model truck (Nguyen and Widrow, 1989).
An RL active vision system used it to learn sequential shifts (saccades) of a fovea, to detect targets in
visual scenes (Schmidhuber and Huber, 1991), thus learning to control selective attention. Compare
RL-based attention learning without NNs (Whitehead, 1992).
To allow for memories of previous events in partially observable worlds (Sec. 6.3), the most gen-
eral variant of this technique uses RNNs instead of FNNs to implement both M and C (Schmidhuber,
1990d, 1991c; Feldkamp and Puskorius, 1998). This may cause deep CAPs not only for C but also
for M .
M can also be used to optimize expected reward by planning future action sequences (Schmid-
huber, 1990d). In fact, the winners of the 2004 RoboCup World Championship in the fast
league (Egorova et al., 2004) trained NNs to predict the effects of steering signals on fast robots
with 4 motors for 4 different wheels. During play, such NN models were used to achieve desirable
subgoals, by optimizing action sequences through quickly planning ahead. The approach also was
used to create self-healing robots able to compensate for faulty motors whose effects do not longer
match the predictions of the NN models (Gloye et al., 2005; Schmidhuber, 2007).
29
Typically M is not given in advance. Then an essential question is: which experiments should C
conduct to quickly improve M ? The Formal Theory of Fun and Creativity (e.g., Schmidhuber, 2006a,
2013b) formalizes driving forces and value functions behind such curious and exploratory behavior:
A measure of the learning progress of M becomes the intrinsic reward of C (Schmidhuber, 1991a);
compare (Singh et al., 2005; Oudeyer et al., 2013). This motivates C to create action sequences
(experiments) such that M makes quick progress.
6.2 Deep FNNs for Traditional RL and Markov Decision Processes (MDPs)
The classical approach to RL (Samuel, 1959; Bertsekas and Tsitsiklis, 1996) makes the simplifying
assumption of Markov Decision Processes (MDPs): the current input of the RL agent conveys all
information necessary to compute an optimal next output event or decision. This allows for greatly
reducing CAP depth in RL NNs (Sec. 3, 6.1) by using the Dynamic Programming (DP) trick (Bellman,
1957). The latter is often explained in a probabilistic framework (e.g., Sutton and Barto, 1998), but
its basic idea can already be conveyed in a deterministic setting. For simplicity, using the notation
of Sec. 2, let input events xt encode the entire current state of the environment, including a real-
valued reward rt (no need to introduce additional vector-valued notation, since real values can encode
arbitrary vectors of real values). The original RL goal (find weights that maximize the sum of all
rewards of an episode) is replaced by an equivalent set of alternative goals set by a real-valued value
function V defined on input events. Consider any two subsequent input events xt , xk . Recursively
define V (xt ) = rt + V (xk ), where V (xk ) = rk if xk is the last input event. Now search for weights
that maximize the V of all input events, by causing appropriate output events or actions.
Due to the Markov assumption, an FNN suffices to implement the policy that maps input to out-
put events. Relevant CAPs are not deeper than this FNN. V itself is often modeled by a separate
FNN (also yielding typically short CAPs) learning to approximate V (xt ) only from local information
rt , V (xk ).
Many variants of traditional RL exist (e.g., Barto et al., 1983; Watkins, 1989; Watkins and Dayan,
1992; Moore and Atkeson, 1993; Schwartz, 1993; Rummery and Niranjan, 1994; Singh, 1994; Baird,
1995; Kaelbling et al., 1995; Peng and Williams, 1996; Mahadevan, 1996; Tsitsiklis and van Roy,
1996; Bradtke et al., 1996; Santamarı́a et al., 1997; Prokhorov and Wunsch, 1997; Sutton and Barto,
1998; Wiering and Schmidhuber, 1998b; Baird and Moore, 1999; Meuleau et al., 1999; Morimoto and
Doya, 2000; Bertsekas, 2001; Brafman and Tennenholtz, 2002; Abounadi et al., 2002; Lagoudakis and
Parr, 2003; Sutton et al., 2008; Maei and Sutton, 2010; van Hasselt, 2012). Most are formulated in
a probabilistic framework, and evaluate pairs of input and output (action) events (instead of input
events only). To facilitate certain mathematical derivations, some discount delayed rewards, but such
distortions of the original RL problem are problematic.
Perhaps the most well-known RL NN is the world-class RL backgammon player (Tesauro, 1994),
which achieved the level of human world champions by playing against itself. Its nonlinear, rather
shallow FNN maps a large but finite number of discrete board states to values. More recently, a
rather deep GPU-CNN was used in a traditional RL framework to play several Atari 2600 computer
games directly from 84x84 pixel 60 Hz video input (Mnih et al., 2013), using experience replay (Lin,
1993), extending previous work on Neural Fitted Q-Learning (NFQ) (Riedmiller, 2005). Even bet-
ter results are achieved by using (slow) Monte Carlo tree planning to train comparatively fast deep
NNs (Guo et al., 2014). Compare RBM-based RL (Sallans and Hinton, 2004) with high-dimensional
inputs (Elfwing et al., 2010), earlier RL Atari players (Grüttner et al., 2010), and an earlier, raw video-
based RL NN for computer games (Koutnı́k et al., 2013) trained by Indirect Policy Search (Sec. 6.7).
30
6.3 Deep RL RNNs for Partially Observable MDPs (POMDPs)
The Markov assumption (Sec. 6.2) is often unrealistic. We cannot directly perceive what is behind our
back, let alone the current state of the entire universe. However, memories of previous events can help
to deal with partially observable Markov decision problems (POMDPs) (e.g., Schmidhuber, 1990d,
1991c; Ring, 1991, 1993, 1994; Williams, 1992a; Lin, 1993; Teller, 1994; Kaelbling et al., 1995;
Littman et al., 1995; Boutilier and Poole, 1996; Jaakkola et al., 1995; McCallum, 1996; Kimura et al.,
1997; Wiering and Schmidhuber, 1996, 1998a; Otsuka et al., 2010). A naive way of implementing
memories without leaving the MDP framework (Sec. 6.2) would be to simply consider a possibly huge
state space, namely, the set of all possible observation histories and their prefixes. A more realistic
way is to use function approximators such as RNNs that produce compact state features as a function
of the entire history seen so far. Generally speaking, POMDP RL often uses DL RNNs to learn which
events to memorize and which to ignore. Three basic alternatives are:
1. Use an RNN as a value function mapping arbitrary event histories to values (e.g., Schmidhuber,
1990b, 1991c; Lin, 1993; Bakker, 2002). For example, deep LSTM RNNs were used in this
way for RL robots (Bakker et al., 2003).
2. Use an RNN controller in conjunction with a second RNN as predictive world model, to obtain
a combined RNN with deep CAPs—see Sec. 6.1.
3. Use an RNN for RL by Direct Search (Sec. 6.6) or Indirect Search (Sec. 6.7) in weight space.
In general, however, POMDPs may imply greatly increased CAP depth.
6.4 RL Facilitated by Deep UL in FNNs and RNNs

RL machines may profit from UL for input preprocessing (e.g., Jodogne and Piater, 2007). In partic-
ular, an UL NN can learn to compactly encode environmental inputs such as images or videos, e.g.,
Sec. 5.7, 5.10, 5.15. The compact codes (instead of the high-dimensional raw data) can be fed into an
RL machine, whose job thus may become much easier (Legenstein et al., 2010; Cuccu et al., 2011),
just like SL may profit from UL, e.g., Sec. 5.7, 5.10, 5.15. For example, NFQ (Riedmiller, 2005) was
applied to real-world control tasks (Lange and Riedmiller, 2010; Riedmiller et al., 2012) where purely
visual inputs were compactly encoded by deep autoencoders (Sec. 5.7, 5.15). RL combined with UL
based on Slow Feature Analysis (Wiskott and Sejnowski, 2002; Kompella et al., 2012) enabled a real
humanoid robot to learn skills from raw high-dimensional video streams (Luciw et al., 2013). To
deal with POMDPs (Sec. 6.3) involving high-dimensional inputs, RBM-based RL was used (Otsuka,
2010), and a RAAM (Pollack, 1988) (Sec. 5.7) was employed as a deep unsupervised sequence en-
coder for RL (Gisslen et al., 2011). Certain types of RL and UL also were combined in biologically
plausible RNNs with spiking neurons (Sec. 5.26) (e.g., Yin et al., 2012; Klampfl and Maass, 2013;
Rezende and Gerstner, 2014).
6.5 Deep Hierarchical RL (HRL) and Subgoal Learning with FNNs and RNNs
Multiple learnable levels of abstraction (Fu, 1977; Lenat and Brown, 1984; Ring, 1994; Bengio
et al., 2013; Deng and Yu, 2014) seem as important for RL as for SL. Work on NN-based Hierar-
chical RL (HRL) has been published since the early 1990s. In particular, gradient-based subgoal
discovery with FNNs or RNNs decomposes RL tasks into subtasks for RL submodules (Schmid-
huber, 1991b; Schmidhuber and Wahnsiedler, 1992). Numerous alternative HRL techniques have
been proposed (e.g., Ring, 1991, 1994; Jameson, 1991; Tenenberg et al., 1993; Weiss, 1994; Moore
and Atkeson, 1995; Precup et al., 1998; Dietterich, 2000b; Menache et al., 2002; Doya et al., 2002;
31
Ghavamzadeh and Mahadevan, 2003; Barto and Mahadevan, 2003; Samejima et al., 2003; Bakker and
Schmidhuber, 2004; Whiteson et al., 2005; Simsek and Barto, 2008). While HRL frameworks such as
Feudal RL (Dayan and Hinton, 1993) and options (Sutton et al., 1999b; Barto et al., 2004; Singh et al.,
2005) do not directly address the problem of automatic subgoal discovery, HQ-Learning (Wiering and
Schmidhuber, 1998a) automatically decomposes POMDPs (Sec. 6.3) into sequences of simpler sub-
tasks that can be solved by memoryless policies learnable by reactive sub-agents. Recent HRL orga-
nizes potentially deep NN-based RL sub-modules into self-organizing, 2-dimensional motor control
maps (Ring et al., 2011) inspired by neurophysiological findings (Graziano, 2009).
6.6 Deep RL by Direct NN Search / Policy Gradients / Evolution

Not quite as universal as the methods of Sec. 6.8, yet both practical and more general than most
traditional RL algorithms (Sec. 6.2), are methods for Direct Policy Search (DS). Without a need for
value functions or Markovian assumptions (Sec. 6.2, 6.3), the weights of an FNN or RNN are directly
evaluated on the given RL problem. The results of successive trials inform further search for better
weights. Unlike with RL supported by BP (Sec. 5.5, 6.3, 6.1), CAP depth (Sec. 3, 5.9) is not a crucial
issue. DS may solve the credit assignment problem without backtracking through deep causal chains
of modifiable parameters—it neither cares for their existence, nor tries to exploit them.
An important class of DS methods for NNs are Policy Gradient methods (Williams, 1986, 1988,
1992a; Sutton et al., 1999a; Baxter and Bartlett, 2001; Aberdeen, 2003; Ghavamzadeh and Mahade-
van, 2003; Kohl and Stone, 2004; Wierstra et al., 2008; Rückstieß et al., 2008; Peters and Schaal,
2008b,a; Sehnke et al., 2010; Grüttner et al., 2010; Wierstra et al., 2010; Peters, 2010; Grondman
et al., 2012; Heess et al., 2012). Gradients of the total reward with respect to policies (NN weights)
are estimated (and then exploited) through repeated NN evaluations.
RL NNs can also be evolved through Evolutionary Algorithms (EAs) (Rechenberg, 1971; Schwe-
fel, 1974; Holland, 1975; Fogel et al., 1966; Goldberg, 1989) in a series of trials. Here several policies
are represented by a population of NNs improved through mutations and/or repeated recombinations
of the population’s fittest individuals (e.g., Montana and Davis, 1989; Fogel et al., 1990; Maniezzo,
1994; Happel and Murre, 1994; Nolfi et al., 1994b). Compare Genetic Programming (GP) (Cramer,
1985) (see also Smith, 1980) which can be used to evolve computer programs of variable size (Dick-
manns et al., 1987; Koza, 1992), and Cartesian GP (Miller and Thomson, 2000; Miller and Harding,
2009) for evolving graph-like programs, including NNs (Khan et al., 2010) and their topology (Turner
and Miller, 2013). Related methods include probability distribution-based EAs (Baluja, 1994; Sar-
avanan and Fogel, 1995; Sałustowicz and Schmidhuber, 1997; Larraanaga and Lozano, 2001), Co-
variance Matrix Estimation Evolution Strategies (CMA-ES) (Hansen and Ostermeier, 2001; Hansen
et al., 2003; Igel, 2003; Heidrich-Meisner and Igel, 2009), and NeuroEvolution of Augmenting Topolo-
gies (NEAT) (Stanley and Miikkulainen, 2002). Hybrid methods combine traditional NN-based RL
(Sec. 6.2) and EAs (e.g., Whiteson and Stone, 2006).
Since RNNs are general computers, RNN evolution is like GP in the sense that it can evolve
general programs. Unlike sequential programs learned by traditional GP, however, RNNs can mix
sequential and parallel information processing in a natural and efficient way, as already mentioned in
Sec. 1. Many RNN evolvers have been proposed (e.g., Miller et al., 1989; Wieland, 1991; Cliff et al.,
1993; Yao, 1993; Nolfi et al., 1994a; Sims, 1994; Yamauchi and Beer, 1994; Miglino et al., 1995;
Moriarty, 1997; Pasemann et al., 1999; Juang, 2004; Whiteson, 2012). One particularly effective
family of methods coevolves neurons, combining them into networks, and selecting those neurons
for reproduction that participated in the best-performing networks (Moriarty and Miikkulainen, 1996;
Gomez, 2003; Gomez and Miikkulainen, 2003). This can help to solve deep POMDPs (Gomez and
Schmidhuber, 2005). Co-Synaptic Neuro-Evolution (CoSyNE) does something similar on the level of
32
synapses or weights (Gomez et al., 2008); benefits of this were shown on difficult nonlinear POMDP
benchmarks.
Natural Evolution Strategies (NES) (Wierstra et al., 2008; Glasmachers et al., 2010; Sun et al.,
2009, 2013) link policy gradient methods and evolutionary approaches through the concept of Natural
Gradients (Amari, 1998). RNN evolution may also help to improve SL for deep RNNs through
Evolino (Schmidhuber et al., 2007) (Sec. 5.9).
6.7 Deep RL by Indirect Policy Search / Compressed NN Search

Some DS methods (Sec. 6.6) can evolve NNs with hundreds or thousands of weights, but not mil-
lions. How to search for large and deep NNs? Most SL and RL methods mentioned so far somehow
search the space of weights wi . Some profit from a reduction of the search space through shared
wi that get reused over and over again, e.g., in CNNs (Sec. 5.4, 5.8, 5.16, 5.21), or in RNNs for SL
(Sec. 5.5, 5.13, 5.17) and RL (Sec. 6.1, 6.3, 6.6).
It may be possible, however, to exploit additional regularities/compressibilities in the space of so-
lutions, through indirect search in weight space. Instead of evolving large NNs directly (Sec. 6.6), one
can sometimes greatly reduce the search space by evolving compact encodings of NNs, e.g., through
Lindenmeyer Systems (Lindenmayer, 1968; Jacob et al., 1994), graph rewriting (Kitano, 1990), Cellu-
lar Encoding (Gruau et al., 1996), HyperNEAT (D’Ambrosio and Stanley, 2007; Stanley et al., 2009;
Clune et al., 2011; van den Berg and Whiteson, 2013) (extending NEAT; Sec. 6.6), and extensions
thereof (e.g., Risi and Stanley, 2012). This helps to avoid overfitting (compare Sec. 5.6.3, 5.24) and is
closely related to the topics of regularisation and MDL (Sec. 4.4).
A general approach (Schmidhuber, 1997) for both SL and RL seeks to compactly encode weights
of large NNs (Schmidhuber, 1997) through programs written in a universal programming lan-
guage (Gödel, 1931; Church, 1936; Turing, 1936; Post, 1936). Often it is much more efficient to
systematically search the space of such programs with a bias towards short and fast programs (Levin,
1973b; Schmidhuber, 1997, 2004), instead of directly searching the huge space of possible NN weight
matrices. A previous universal language for encoding NNs was assembler-like (Schmidhuber, 1997).
More recent work uses more practical languages based on coefficients of popular transforms (Fourier,
wavelet, etc). In particular, RNN weight matrices may be compressed like images, by encoding them
through the coefficients of a discrete cosine transform (DCT) (Koutnı́k et al., 2010, 2013). Compact
DCT-based descriptions can be evolved through NES or CoSyNE (Sec. 6.6). An RNN with over a
million weights learned (without a teacher) to drive a simulated car in the TORCS driving game (Loia-
cono et al., 2009, 2011), based on a high-dimensional video-like visual input stream (Koutnı́k et al.,
2013). The RNN learned both control and visual processing from scratch, without being aided by
UL. (Of course, UL might help to generate more compact image codes (Sec. 6.4, 4.2) to be fed into a
smaller RNN, to reduce the overall computational effort.)
6.8 Universal RL
General purpose learning algorithms may improve themselves in open-ended fashion and
environment-specific ways in a lifelong learning context (Schmidhuber, 1987; Schmidhuber et al.,
1997b,a; Schaul and Schmidhuber, 2010). The most general type of RL is constrained only by the
fundamental limitations of computability identified by the founders of theoretical computer science
(Gödel, 1931; Church, 1936; Turing, 1936; Post, 1936). Remarkably, there exist blueprints of univer-
sal problem solvers or universal RL machines for unlimited problem depth that are time-optimal in
various theoretical senses (Hutter, 2005, 2002; Schmidhuber, 2002, 2006b). In particular, the Gödel
Machine can be implemented on general computers such as RNNs and may improve any part of its
software (including the learning algorithm itself) in a way that is provably time-optimal in a certain
33
sense (Schmidhuber, 2006b). It can be initialized by an asymptotically optimal meta-method (Hut-
ter, 2002) (also applicable to RNNs) which will solve any well-defined problem as quickly as the
unknown fastest way of solving it, save for an additive constant overhead that becomes negligible as
problem size grows. Note that most problems are large; only few are small. AI and DL researchers are
still in business because many are interested in problems so small that it is worth trying to reduce the
overhead through less general methods, including heuristics. Here I won’t further discuss universal
RL methods, which go beyond what is usually called DL.
7 Conclusion and Outlook

Deep Learning (DL) in Neural Networks (NNs) is relevant for Supervised Learning (SL) (Sec. 5),
Unsupervised Learning (UL) (Sec. 5), and Reinforcement Learning (RL) (Sec. 6). By alleviating
problems with deep Credit Assignment Paths (CAPs, Sec. 3, 5.9), UL (Sec. 5.6.4) can not only facil-
itate SL of sequences (Sec. 5.10) and stationary patterns (Sec. 5.7, 5.15), but also RL (Sec. 6.4, 4.2).
Dynamic Programming (DP, Sec. 4.1) is important for both deep SL (Sec. 5.5) and traditional RL with
deep NNs (Sec. 6.2). A search for solution-computing, perturbation-resistant (Sec. 5.6.3, 5.15, 5.24),
low-complexity NNs describable by few bits of information (Sec. 4.4) can reduce overfitting and im-
prove deep SL & UL (Sec. 5.6.3, 5.6.4) as well as RL (Sec. 6.7), also in the case of partially observable
environments (Sec. 6.3). Deep SL, UL, RL often create hierarchies of more and more abstract repre-
sentations of stationary data (Sec. 5.3, 5.7, 5.15), sequential data (Sec. 5.10), or RL policies (Sec. 6.5).
While UL can facilitate SL, pure SL for feedforward NNs (FNNs) (Sec. 5.5, 5.8, 5.16, 5.18) and re-
current NNs (RNNs) (Sec. 5.5, 5.13) did not only win early contests (Sec. 5.12, 5.14) but also most
of the recent ones (Sec. 5.17–5.22). Especially DL in FNNs profited from GPU implementations
(Sec. 5.16–5.19). In particular, GPU-based (Sec. 5.19) Max-Pooling (Sec. 5.11) Convolutional NNs
(Sec. 5.4, 5.8, 5.16) won competitions not only in pattern recognition (Sec. 5.19–5.22) but also image
segmentation (Sec. 5.21) and object detection (Sec. 5.21, 5.22).
Unlike these systems, humans learn to actively perceive patterns by sequentially directing atten-
tion to relevant parts of the available data. Near future deep NNs will do so, too, extending previous
work since 1990 on NNs that learn selective attention through RL of (a) motor actions such as saccade
control (Sec. 6.1) and (b) internal actions controlling spotlights of attention within RNNs, thus closing
the general sensorimotor loop through both external and internal feedback (e.g., Sec. 2, 5.21, 6.6, 6.7).
Many future deep NNs will also take into account that it costs energy to activate neurons, and to
send signals between them. Brains seem to minimize such computational costs during problem solv-
ing in at least two ways: (1) At a given time, only a small fraction of all neurons is active because local
competition through winner-take-all mechanisms shuts down many neighbouring neurons, and only
winners can activate other neurons through outgoing connections (compare SLIM NNs; Sec. 5.24).
(2) Numerous neurons are sparsely connected in a compact 3D volume by many short-range and
few long-range connections (much like microchips in traditional supercomputers). Often neighbour-
ing neurons are allocated to solve a single task, thus reducing communication costs. Physics seems
to dictate that any efficient computational hardware will in the future also have to be brain-like in
keeping with these two constraints. The most successful current deep RNNs, however, are not. Un-
like certain spiking NNs (Sec. 5.26), they usually activate all units at least slightly, and tend to be
strongly connected, ignoring natural constraints of 3D hardware. It should be possible to improve
them by adopting (1) and (2), and by minimizing non-differentiable energy and communication costs
through direct search in program (weight) space (e.g., Sec. 6.6, 6.7). These more brain-like RNNs
will allocate neighboring RNN parts to related behaviors, and distant RNN parts to less related ones,
thus self-modularizing in a way more general than that of traditional self-organizing maps in FNNs
(Sec. 5.6.4). They will also implement Occam’s razor (Sec. 4.4, 5.6.3) as a by-product of energy min-
34
imization, by finding simple (highly generalizing) problem solutions that require few active neurons
and few, mostly short connections.
The more distant future may belong to general purpose learning algorithms that improve them-
selves in provably optimal ways (Sec. 6.8), but these are not yet practical or commercially relevant.
8 Acknowledgments
Since 16 April 2014, drafts of this paper have undergone massive open online peer review through
public mailing lists including connectionists@cs.cmu.edu, ml-news@googlegroups.com, comp-neuro-
@neuroinf.org, genetic programming@yahoogroups.com, rl-list@googlegroups.com, imageworld-
@diku.dk, Google+ machine learning forum. Thanks to numerous NN / DL experts for valuable
comments. Thanks to SNF, DFG, and the European Commission for partially funding my DL re-
search group in the past quarter-century. The contents of this paper may be used for educational and
non-commercial purposes, including articles for Wikipedia and similar sites.
References
Aberdeen, D. (2003). Policy-Gradient Algorithms for Partially Observable Markov Decision Pro-
cesses. PhD thesis, Australian National University.
Abounadi, J., Bertsekas, D., and Borkar, V. S. (2002). Learning algorithms for Markov decision
processes with average cost. SIAM Journal on Control and Optimization, 40(3):681–698.
Akaike, H. (1970). Statistical predictor identification. Ann. Inst. Statist. Math., 22:203–217.
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In
Second Intl. Symposium on Information Theory, pages 267–281. Akademinai Kiado.
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic
Control, 19(6):716–723.
Allender, A. (1992). Application of time-bounded Kolmogorov complexity in complexity theory. In

Watanabe, O., editor, Kolmogorov complexity and computational complexity, pages 6–22. EATCS
Monographs on Theoretical Computer Science, Springer.
Almeida, L. B. (1987). A learning rule for asynchronous perceptrons with feedback in a combinatorial
environment. In IEEE 1st International Conference on Neural Networks, San Diego, volume 2,
pages 609–618.
Almeida, L. B., Almeida, L. B., Langlois, T., Amaral, J. D., and Redol, R. A. (1997). On-line step
size adaptation. Technical report, INESC, 9 Rua Alves Redol, 1000.
Amari, S. (1967). A theory of adaptive pattern classifiers. IEEE Trans. EC, 16(3):299–307.
Amari, S., Cichocki, A., and Yang, H. (1996). A new learning algorithm for blind signal separation.
In Touretzky, D. S., Mozer, M. C., and Hasselmo, M. E., editors, Advances in Neural Information
Processing Systems (NIPS), volume 8. The MIT Press.
Amari, S. and Murata, N. (1993). Statistical theory of learning curves under entropic loss criterion.
Neural Computation, 5(1):140–153.
35
Amari, S.-I. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2):251–
276.
Amit, D. J. and Brunel, N. (1997). Dynamics of a recurrent network of spiking neurons before and
following learning. Network: Computation in Neural Systems, 8(4):373–404.
An, G. (1996). The effects of adding noise during backpropagation training on a generalization
performance. Neural Computation, 8(3):643–674.
Andrade, M. A., Chacon, P., Merelo, J. J., and Moran, F. (1993). Evaluation of secondary structure
of proteins from UV circular dichroism spectra using an unsupervised learning neural network.
Protein Engineering, 6(4):383–390.
Andrews, R., Diederich, J., and Tickle, A. B. (1995). Survey and critique of techniques for extracting
rules from trained artificial neural networks. Knowledge-Based Systems, 8(6):373–389.
Anguita, D. and Gomes, B. A. (1996). Mixing floating- and fixed-point formats for neural network
learning on neuroprocessors. Microprocessing and Microprogramming, 41(10):757 – 769.
Anguita, D., Parodi, G., and Zunino, R. (1994). An efficient implementation of BP on RISC-based
workstations. Neurocomputing, 6(1):57 – 65.
Arel, I., Rose, D. C., and Karnowski, T. P. (2010). Deep machine learning – a new frontier in artificial
intelligence research. Computational Intelligence Magazine, IEEE, 5(4):13–18.
Ash, T. (1989). Dynamic node creation in backpropagation neural networks. Connection Science,
1(4):365–375.
Atick, J. J., Li, Z., and Redlich, A. N. (1992). Understanding retinal color coding from first principles.
Neural Computation, 4:559–572.
Atiya, A. F. and Parlos, A. G. (2000). New results on recurrent network training: unifying the algo-
rithms and accelerating convergence. IEEE Transactions on Neural Networks, 11(3):697–709.
Ba, J. and Frey, B. (2013). Adaptive dropout for training deep neural networks. In Advances in Neural
Information Processing Systems (NIPS), pages 3084–3092.
Baird, H. (1990). Document image defect models. In Proceddings, IAPR Workshop on Syntactic and
Structural Pattern Recognition, Murray Hill, NJ.
Baird, L. and Moore, A. W. (1999). Gradient descent for general reinforcement learning. In Advances
in neural information processing systems 12 (NIPS), pages 968–974. MIT Press.
Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In
International Conference on Machine Learning, pages 30–37.
Bakker, B. (2002). Reinforcement learning with Long Short-Term Memory. In Dietterich, T. G.,
Becker, S., and Ghahramani, Z., editors, Advances in Neural Information Processing Systems 14,
pages 1475–1482. MIT Press, Cambridge, MA.
Bakker, B. and Schmidhuber, J. (2004). Hierarchical reinforcement learning based on subgoal dis-
covery and subpolicy specialization. In et al., F. G., editor, Proc. 8th Conference on Intelligent
Autonomous Systems IAS-8, pages 438–445, Amsterdam, NL. IOS Press.
36
Bakker, B., Zhumatiy, V., Gruener, G., and Schmidhuber, J. (2003). A robot that reinforcement-learns
to identify and memorize important previous observations. In Proceedings of the 2003 IEEE/RSJ
International Conference on Intelligent Robots and Systems, IROS 2003, pages 430–435.
Baldi, P. (1995). Gradient descent learning algorithms overview: A general dynamical systems per-
spective. IEEE Transactions on Neural Networks, 6(1):182–195.
Baldi, P. (2012). Autoencoders, unsupervised learning, and deep architectures. Journal of Machine
Learning Research (Proc. 2011 ICML Workshop on Unsupervised and Transfer Learning), 27:37–
50.
Baldi, P., Brunak, S., Frasconi, P., Pollastri, G., and Soda, G. (1999). Exploiting the past and the
future in protein secondary structure prediction. Bioinformatics, 15:937–946.
Baldi, P. and Chauvin, Y. (1993). Neural networks for fingerprint recognition. Neural Computation,
5(3):402–418.
Baldi, P. and Chauvin, Y. (1996). Hybrid modeling, HMM/NN architectures, and protein applications.
Baldi, P. and Hornik, K. (1989). Neural networks and principal component analysis: Learning from
examples without local minima. Neural Networks, 2:53–58.
Baldi, P. and Hornik, K. (1994). Learning in linear networks: a survey. IEEE Transactions on Neural
Networks, 6(4):837–858. 1995.
Baldi, P. and Pollastri, G. (2003). The principled design of large-scale recursive neural network
architectures – DAG-RNNs and the protein structure prediction problem. J. Mach. Learn. Res.,
4:575–602.
Baldi, P. and Sadowski, P. (2014). The dropout learning algorithm. Artificial Intelligence, 210C:78–
122.
Ballard, D. H. (1987). Modular learning in neural networks. In Proc. AAAI, pages 279–284.
Baluja, S. (1994). Population-based incremental learning: A method for integrating genetic
search based function optimization and competitive learning. Technical Report CMU-CS-94-163,
Carnegie Mellon University.
Balzer, R. (1985). A 15 year perspective on automatic programming. IEEE Transactions on Software
Engineering, 11(11):1257–1268.
Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1(3):295–311.
Barlow, H. B., Kaushal, T. P., and Mitchison, G. J. (1989). Finding minimum entropy codes. Neural
Computation, 1(3):412–423.
Barrow, H. G. (1987). Learning receptive fields. In Proceedings of the IEEE 1st Annual Conference
on Neural Networks, volume IV, pages 115–121. IEEE.
Barto, A. G. and Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning.
Discrete Event Dynamic Systems, 13(4):341–379.
37
Barto, A. G., Singh, S., and Chentanez, N. (2004). Intrinsically motivated learning of hierarchical col-
lections of skills. In Proceedings of International Conference on Developmental Learning (ICDL),
pages 112–119. MIT Press, Cambridge, MA.
Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983). Neuronlike adaptive elements that can
solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics,
SMC-13:834–846.
Battiti, R. (1989). Accelerated backpropagation learning: two optimization methods. Complex Sys-
tems, 3(4):331–342.
Battiti, T. (1992). First- and second-order methods for learning: Between steepest descent and New-
ton’s method. Neural Computation, 4(2):141–166.
Baum, E. B. and Haussler, D. (1989). What size net gives valid generalization? Neural Computation,
1(1):151–160.
Baum, L. E. and Petrie, T. (1966). Statistical inference for probabilistic functions of finite state
Markov chains. The Annals of Mathematical Statistics, pages 1554–1563.
Baxter, J. and Bartlett, P. L. (2001). Infinite-horizon policy-gradient estimation. J. Artif. Int. Res.,
15(1):319–350.
Bayer, J. and Osendorfer, C. (2014). Variational inference of latent state sequences using recurrent
networks. arXiv preprint arXiv:1406.1655.
Bayer, J., Osendorfer, C., Chen, N., Urban, S., and van der Smagt, P. (2013). On fast dropout and its
applicability to recurrent networks. arXiv preprint arXiv:1311.0701.
Bayer, J., Wierstra, D., Togelius, J., and Schmidhuber, J. (2009). Evolving memory cell structures for
sequence learning. In Proc. ICANN (2), pages 755–764.
Bayes, T. (1763). An essay toward solving a problem in the doctrine of chances. Philosophical
Transactions of the Royal Society of London, 53:370–418. Communicated by R. Price, in a letter
to J. Canton.
Becker, S. (1991). Unsupervised learning procedures for neural networks. International Journal of
Neural Systems, 2(1 & 2):17–33.
Becker, S. and Le Cun, Y. (1989). Improving the convergence of back-propagation learning with
second order methods. In Touretzky, D., Hinton, G., and Sejnowski, T., editors, Proc. 1988 Con-
nectionist Models Summer School, pages 29–37, Pittsburg 1988. Morgan Kaufmann, San Mateo.
Behnke, S. (1999). Hebbian learning and competition in the neural abstraction pyramid. In Pro-
ceedings of the International Joint Conference on Neural Networks (IJCNN), volume 2, pages
1356–1361.
Behnke, S. (2001). Learning iterative image reconstruction in the neural abstraction pyramid. Inter-
national Journal of Computational Intelligence and Applications, 1(4):427–438.
Behnke, S. (2002). Learning face localization using hierarchical recurrent networks. In Proceedings
of the 12th International Conference on Artificial Neural Networks (ICANN), Madrid, Spain, pages
1319–1324.
38
Behnke, S. (2003a). Discovering hierarchical speech features using convolutional non-negative matrix
factorization. In Proceedings of the International Joint Conference on Neural Networks (IJCNN),
volume 4, pages 2758–2763.
Behnke, S. (2003b). Hierarchical Neural Networks for Image Interpretation, volume LNCS 2766 of
Lecture Notes in Computer Science. Springer.
Behnke, S. (2005). Face localization and tracking in the Neural Abstraction Pyramid. Neural Com-
puting and Applications, 14(2):97–103.
Behnke, S. and Rojas, R. (1998). Neural abstraction pyramid: A hierarchical image understand-
ing architecture. In Proceedings of International Joint Conference on Neural Networks (IJCNN),
volume 2, pages 820–825.
Bell, A. J. and Sejnowski, T. J. (1995). An information-maximization approach to blind separation

and blind deconvolution. Neural Computation, 7(6):1129–1159.
Bellman, R. (1957). Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1st
edition.
Belouchrani, A., Abed-Meraim, K., Cardoso, J.-F., and Moulines, E. (1997). A blind source separation
technique using second-order statistics. IEEE Transactions on Signal Processing, 45(2):434–444.
Bengio, Y. (1991). Artificial Neural Networks and their Application to Sequence Recognition. PhD
thesis, McGill University, (Computer Science), Montreal, Qc., Canada.
Bengio, Y. (2009). Learning Deep Architectures for AI. Foundations and Trends in Machine Learning,
V2(1). Now Publishers.
Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: A review and new per-
spectives. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8):1798–1828.
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training of deep
networks. In Cowan, J. D., Tesauro, G., and Alspector, J., editors, Advances in Neural Information
Processing Systems 19 (NIPS), pages 153–160. MIT Press.
Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient
descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166.
Beringer, N., Graves, A., Schiel, F., and Schmidhuber, J. (2005). Classifying unprompted speech by
retraining LSTM nets. In Duch, W., Kacprzyk, J., Oja, E., and Zadrozny, S., editors, Artificial
Neural Networks: Biological Inspirations - ICANN 2005, LNCS 3696, pages 575–581. Springer-
Verlag Berlin Heidelberg.
Bertsekas, D. P. (2001). Dynamic Programming and Optimal Control. Athena Scientific.
Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-dynamic Programming. Athena Scientific, Bel-

mont, MA.
Bichot, N. P., Rossi, A. F., and Desimone, R. (2005). Parallel and serial neural mechanisms for visual
search in macaque area V4. Science, 308:529–534.
Biegler-König, F. and Bärmann, F. (1993). A learning algorithm for multilayered neural networks
based on linear least squares problems. Neural Networks, 6(1):127–131.
39
Bishop, C. M. (1993). Curvature-driven smoothing: A learning algorithm for feed-forward networks.
IEEE Transactions on Neural Networks, 4(5):882–884.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Blair, A. D. and Pollack, J. B. (1997). Analysis of dynamical recognizers. Neural Computation,
9(5):1127–1142.
Blondel, V. D. and Tsitsiklis, J. N. (2000). A survey of computational complexity results in systems

and control. Automatica, 36(9):1249–1274.
Bluche, T., Louradour, J., Knibbe, M., Moysset, B., Benzeghiba, F., and Kermorvant, C. (2014).
The A2iA Arabic Handwritten Text Recognition System at the OpenHaRT2013 Evaluation. In
International Workshop on Document Analysis Systems.
Blum, A. L. and Rivest, R. L. (1992). Training a 3-node neural network is np-complete. Neural
Networks, 5(1):117–127.
Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. (1987). Occam’s razor. Information
Processing Letters, 24:377–380.
Bobrowski, L. (1978). Learning processes in multilayer threshold nets. Biological Cybernetics, 31:1–
6.
Bodén, M. and Wiles, J. (2000). Context-free and context-sensitive dynamics in recurrent neural
networks. Connection Science, 12(3-4):197–210.
Bodenhausen, U. and Waibel, A. (1991). The Tempo 2 algorithm: Adjusting time-delays by super-
vised learning. In Lippman, D. S., Moody, J. E., and Touretzky, D. S., editors, Advances in Neural
Information Processing Systems 3, pages 155–161. Morgan Kaufmann.
Bohte, S. M., Kok, J. N., and La Poutre, H. (2002). Error-backpropagation in temporally encoded
networks of spiking neurons. Neurocomputing, 48(1):17–37.
Boltzmann, L. (1909). In Hasenöhrl, F., editor, Wissenschaftliche Abhandlungen (collection of Boltz-

mann’s articles in scientific journals). Barth, Leipzig.
Bottou, L. (1991). Une approche théorique de l’apprentissage connexioniste; applications à la re-
connaissance de la parole. PhD thesis, Université de Paris XI.
Bourlard, H. and Morgan, N. (1994). Connnectionist Speech Recognition: A Hybrid Approach.

Kluwer Academic Publishers.
Boutilier, C. and Poole, D. (1996). Computing optimal policies for partially observable Markov
decision processes using compact representations. In Proceedings of the AAAI, Portland, OR.
Bradtke, S. J., Barto, A. G., and Kaelbling, L. P. (1996). Linear least-squares algorithms for temporal
difference learning. In Machine Learning, pages 22–33.
Brafman, R. I. and Tennenholtz, M. (2002). R-MAX—a general polynomial time algorithm for near-
optimal reinforcement learning. Journal of Machine Learning Research, 3:213–231.
Brea, J., Senn, W., and Pfister, J.-P. (2013). Matching recall and storage in sequence learning with
spiking neural networks. The Journal of Neuroscience, 33(23):9565–9575.
40
Breiman, L. (1996). Bagging predictors. Machine Learning, 24:123–140.
Brette, R., Rudolph, M., Carnevale, T., Hines, M., Beeman, D., Bower, J. M., Diesmann, M., Morri-
son, A., Goodman, P. H., Harris Jr, F. C., et al. (2007). Simulation of networks of spiking neurons:
a review of tools and strategies. Journal of Computational Neuroscience, 23(3):349–398.
Breuel, T. M., Ul-Hasan, A., Al-Azawi, M. A., and Shafait, F. (2013). High-performance OCR for
printed English and Fraktur using LSTM networks. In 12th International Conference on Document
Analysis and Recognition (ICDAR), pages 683–687. IEEE.
Bromley, J., Bentz, J. W., Bottou, L., Guyon, I., LeCun, Y., Moore, C., Sackinger, E., and Shah, R.
(1993). Signature verification using a Siamese time delay neural network. International Journal of
Pattern Recognition and Artificial Intelligence, 7(4):669–688.
Broyden, C. G. et al. (1965). A class of methods for solving nonlinear simultaneous equations. Math.
Comp, 19(92):577–593.
Brueckner, R. and Schulter, B. (2014). Social signal classification using deep BLSTM recurrent neural
networks. In Proceedings 39th IEEE International Conference on Acoustics, Speech, and Signal
Processing, ICASSP 2014, Florence, Italy, pages 4856–4860.
Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking
neurons. Journal of Computational Neuroscience, 8(3):183–208.
Bryson, A. and Ho, Y. (1969). Applied optimal control: optimization, estimation, and control. Blais-
dell Pub. Co.
Bryson, A. E. (1961). A gradient method for optimizing multi-stage allocation processes. In Proc.
Harvard Univ. Symposium on digital computers and their applications.
Bryson, Jr., A. E. and Denham, W. F. (1961). A steepest-ascent method for solving optimum pro-
gramming problems. Technical Report BR-1303, Raytheon Company, Missle and Space Division.
Buhler, J. (2001). Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinfor-
matics, 17(5):419–428.
Buntine, W. L. and Weigend, A. S. (1991). Bayesian back-propagation. Complex Systems, 5:603–643.
Burgess, N. (1994). A constructive algorithm that converges for real-valued input patterns. Interna-
tional Journal of Neural Systems, 5(1):59–66.
Cardoso, J.-F. (1994). On the performance of orthogonal source separation algorithms. In Proc.
EUSIPCO, pages 776–779.
Carreira-Perpinan, M. A. (2001). Continuous latent variable models for dimensionality reduction and
sequential data reconstruction. PhD thesis, University of Sheffield UK.
Carter, M. J., Rudolph, F. J., and Nucci, A. J. (1990). Operational fault tolerance of CMAC networks.
In Touretzky, D. S., editor, Advances in Neural Information Processing Systems (NIPS) 2, pages
340–347. San Mateo, CA: Morgan Kaufmann.
Caruana, R. (1997). Multitask learning. Machine Learning, 28(1):41–75.
Casey, M. P. (1996). The dynamics of discrete-time computation, with application to recurrent neural
networks and finite state machine extraction. Neural Computation, 8(6):1135–1178.
41
Cauwenberghs, G. (1993). A fast stochastic error-descent algorithm for supervised learning and op-
timization. In Lippman, D. S., Moody, J. E., and Touretzky, D. S., editors, Advances in Neural
Chaitin, G. J. (1966). On the length of programs for computing finite binary sequences. Journal of
the ACM, 13:547–569.
Chalup, S. K. and Blair, A. D. (2003). Incremental training of first order recurrent neural networks to
predict a context-sensitive language. Neural Networks, 16(7):955–972.
Chellapilla, K., Puri, S., and Simard, P. (2006). High performance convolutional neural networks for
document processing. In International Workshop on Frontiers in Handwriting Recognition.
Chen, K. and Salman, A. (2011). Learning speaker-specific characteristics with a deep neural archi-
tecture. IEEE Transactions on Neural Networks, 22(11):1744–1756.
Cho, K. (2014). Foundations and Advances in Deep Learning. PhD thesis, Aalto University School
of Science.
Cho, K., Ilin, A., and Raiko, T. (2012). Tikhonov-type regularization for restricted Boltzmann ma-
chines. In Intl. Conf. on Artificial Neural Networks (ICANN) 2012, pages 81–88. Springer.
Cho, K., Raiko, T., and Ilin, A. (2013). Enhanced gradient for training restricted Boltzmann machines.
Church, A. (1936). An unsolvable problem of elementary number theory. American Journal of
Mathematics, 58:345–363.
Ciresan, D. C., Giusti, A., Gambardella, L. M., and Schmidhuber, J. (2012a). Deep neural networks
segment neuronal membranes in electron microscopy images. In Advances in Neural Information
Processing Systems (NIPS), pages 2852–2860.
Ciresan, D. C., Giusti, A., Gambardella, L. M., and Schmidhuber, J. (2013). Mitosis detection in
breast cancer histology images with deep neural networks. In Proc. MICCAI, volume 2, pages
411–418.
Ciresan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. (2010). Deep big simple neural
nets for handwritten digit recogntion. Neural Computation, 22(12):3207–3220.
Ciresan, D. C., Meier, U., Masci, J., Gambardella, L. M., and Schmidhuber, J. (2011a). Flexible, high
performance convolutional neural networks for image classification. In Intl. Joint Conference on
Artificial Intelligence IJCAI, pages 1237–1242.
Ciresan, D. C., Meier, U., Masci, J., and Schmidhuber, J. (2011b). A committee of neural networks
for traffic sign classification. In International Joint Conference on Neural Networks (IJCNN), pages
1918–1921.
Ciresan, D. C., Meier, U., Masci, J., and Schmidhuber, J. (2012b). Multi-column deep neural network
for traffic sign classification. Neural Networks, 32:333–338.
Ciresan, D. C., Meier, U., and Schmidhuber, J. (2012c). Multi-column deep neural networks for image
classification. In IEEE Conference on Computer Vision and Pattern Recognition CVPR 2012. Long
preprint arXiv:1202.2745v1 [cs.CV].
42
Ciresan, D. C., Meier, U., and Schmidhuber, J. (2012d). Transfer learning for Latin and Chinese char-
acters with deep neural networks. In International Joint Conference on Neural Networks (IJCNN),
pages 1301–1306.
Ciresan, D. C. and Schmidhuber, J. (2013). Multi-column deep neural networks for offline handwrit-
ten Chinese character classification. Technical report, IDSIA. arXiv:1309.0261.
Cliff, D. T., Husbands, P., and Harvey, I. (1993). Evolving recurrent dynamical networks for robot
control. In Artificial Neural Nets and Genetic Algorithms, pages 428–435. Springer.
Clune, J., Mouret, J.-B., and Lipson, H. (2013). The evolutionary origins of modularity. Proceedings
of the Royal Society B: Biological Sciences, 280(1755):20122863.
Clune, J., Stanley, K. O., Pennock, R. T., and Ofria, C. (2011). On the performance of indirect
encoding across the continuum of regularity. Trans. Evol. Comp, 15(3):346–367.
Coates, A., Huval, B., Wang, T., Wu, D. J., Ng, A. Y., and Catanzaro, B. (2013). Deep learning with
COTS HPC systems. In Proc. International Conference on Machine learning (ICML’13).
Cochocki, A. and Unbehauen, R. (1993). Neural networks for optimization and signal processing.
John Wiley & Sons, Inc.
Collobert, R. and Weston, J. (2008). A unified architecture for natural language processing: Deep
neural networks with multitask learning. In Proceedings of the 25th International Conference on
Machine Learning (ICML), pages 160–167. ACM.
Comon, P. (1994). Independent component analysis – a new concept? Signal Processing, 36(3):287–
314.
Connor, C. E., Brincat, S. L., and Pasupathy, A. (2007). Transformation of shape information in the
ventral pathway. Current Opinion in Neurobiology, 17(2):140–147.
Connor, J., Martin, D. R., and Atlas, L. E. (1994). Recurrent neural networks and robust time series
prediction. IEEE Transactions on Neural Networks, 5(2):240–254.
Cook, S. A. (1971). The complexity of theorem-proving procedures. In Proceedings of the 3rd Annual
ACM Symposium on the Theory of Computing (STOC’71), pages 151–158. ACM, New York.
Cramer, N. L. (1985). A representation for the adaptive generation of simple sequential programs.
In Grefenstette, J., editor, Proceedings of an International Conference on Genetic Algorithms and
Their Applications, Carnegie-Mellon University, July 24-26, 1985, Hillsdale NJ. Lawrence Erl-
baum Associates.
Craven, P. and Wahba, G. (1979). Smoothing noisy data with spline functions: Estimating the correct
degree of smoothing by the method of generalized cross-validation. Numer. Math., 31:377–403.
Cuccu, G., Luciw, M., Schmidhuber, J., and Gomez, F. (2011). Intrinsically motivated evolutionary
search for vision-based reinforcement learning. In Proceedings of the 2011 IEEE Conference on
Development and Learning and Epigenetic Robotics IEEE-ICDL-EPIROB, volume 2, pages 1–7.
IEEE.
Dahl, G., Yu, D., Deng, L., and Acero, A. (2012). Context-dependent pre-trained deep neural net-
works for large-vocabulary speech recognition. Audio, Speech, and Language Processing, IEEE
Transactions on, 20(1):30–42.
43
Dahl, G. E., Sainath, T. N., and Hinton, G. E. (2013). Improving deep neural networks for LVCSR
using rectified linear units and dropout. In IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pages 8609–8613. IEEE.
D’Ambrosio, D. B. and Stanley, K. O. (2007). A novel generative encoding for exploiting neural net-
work sensor and output geometry. In Proceedings of the Conference on Genetic and Evolutionary
Computation (GECCO), pages 974–981.
Datar, M., Immorlica, N., Indyk, P., and Mirrokni, V. S. (2004). Locality-sensitive hashing scheme
based on p-stable distributions. In Proceedings of the 20th Annual Symposium on Computational
Geometry, pages 253–262. ACM.
Dayan, P. and Hinton, G. (1993). Feudal reinforcement learning. In Lippman, D. S., Moody, J. E.,
and Touretzky, D. S., editors, Advances in Neural Information Processing Systems (NIPS) 5, pages
271–278. Morgan Kaufmann.
Dayan, P. and Hinton, G. E. (1996). Varieties of Helmholtz machine. Neural Networks, 9(8):1385–
1403.
Dayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. (1995). The Helmholtz machine. Neural
Computation, 7:889–904.
Dayan, P. and Zemel, R. (1995). Competition and multiple cause models. Neural Computation,
7:565–579.
De Freitas, J. F. G. (2003). Bayesian methods for neural networks. PhD thesis, University of Cam-
bridge.
de Souto, M. C., Souto, M. C. P. D., and Oliveira, W. R. D. (1999). The loading problem for pyramidal
neural networks. In Electronic Journal on Mathematics of Computation.
De Valois, R. L., Albrecht, D. G., and Thorell, L. G. (1982). Spatial frequency selectivity of cells in
macaque visual cortex. Vision Research, 22(5):545–559.
de Vries, B. and Principe, J. C. (1991). A theory for neural networks with time delays. In Lippmann,
R. P., Moody, J. E., and Touretzky, D. S., editors, Advances in Neural Information Processing
Systems (NIPS) 3, pages 162–168. Morgan Kaufmann.
Deco, G. and Parra, L. (1997). Non-linear feature extraction by redundancy reduction in an unsuper-
vised stochastic neural network. Neural Networks, 10(4):683–691.
Deco, G. and Rolls, E. T. (2005). Neurodynamics of biased competition and cooperation for attention:
a model with spiking neurons. Journal of Neurophysiology, 94(1):295–313.
DeJong, G. and Mooney, R. (1986). Explanation-based learning: An alternative view. Machine
Learning, 1(2):145–176.
DeMers, D. and Cottrell, G. (1993). Non-linear dimensionality reduction. In Hanson, S. J., Cowan,
J. D., and Giles, C. L., editors, Advances in Neural Information Processing Systems (NIPS) 5, pages
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data
via the EM algorithm. Journal of the Royal Statistical Society, B, 39.
44
Deng, L. and Yu, D. (2014). Deep Learning: Methods and Applications. NOW Publishers.
Desimone, R., Albright, T. D., Gross, C. G., and Bruce, C. (1984). Stimulus-selective properties of
inferior temporal neurons in the macaque. The Journal of Neuroscience, 4(8):2051–2062.
Deville, Y. and Lau, K. K. (1994). Logic program synthesis. Journal of Logic Programming,
19(20):321–350.
Di Lena, P., Nagata, K., and Baldi, P. (2012). Deep architectures for protein contact map prediction.
Bioinformatics, 28:2449–2457.
DiCarlo, J. J., Zoccolan, D., and Rust, N. C. (2012). How does the brain solve visual object recogni-
tion? Neuron, 73(3):415–434.
Dickmanns, D., Schmidhuber, J., and Winklhofer, A. (1987). Der genetische Algorithmus:
Eine Implementierung in Prolog. Technical Report, Inst. of Informatics, Tech. Univ. Munich.
http://www.idsia.ch/˜juergen/geneticprogramming.html.
Dickmanns, E. D., Behringer, R., Dickmanns, D., Hildebrandt, T., Maurer, M., Thomanek, F., and
Schiehlen, J. (1994). The seeing passenger car ’VaMoRs-P’. In Proc. Int. Symp. on Intelligent
Vehicles ’94, Paris, pages 68–73.
Dietterich, T. G. (2000a). Ensemble methods in machine learning. In Multiple classifier systems,
pages 1–15. Springer.
Dietterich, T. G. (2000b). Hierarchical reinforcement learning with the MAXQ value function de-
composition. J. Artif. Intell. Res. (JAIR), 13:227–303.
Director, S. W. and Rohrer, R. A. (1969). Automated network design - the frequency-domain case.
IEEE Trans. Circuit Theory, CT-16:330–337.
Dittenbach, M., Merkl, D., and Rauber, A. (2000). The growing hierarchical self-organizing map.
In IEEE-INNS-ENNS International Joint Conference on Neural Networks, volume 6, pages 6015–
6015. IEEE Computer Society.
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. (2013). De-
CAF: A deep convolutional activation feature for generic visual recognition. arXiv preprint
arXiv:1310.1531.
Dorffner, G. (1996). Neural networks for time series processing. In Neural Network World.
Doya, K., Samejima, K., ichi Katagiri, K., and Kawato, M. (2002). Multiple model-based reinforce-
ment learning. Neural Computation, 14(6):1347–1369.
Dreyfus, S. E. (1962). The numerical solution of variational problems. Journal of Mathematical
Analysis and Applications, 5(1):30–45.
Dreyfus, S. E. (1973). The computational solution of optimal control problems with time lag. IEEE
Transactions on Automatic Control, 18(4):383–385.
Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and
stochastic optimization. The Journal of Machine Learning, 12:2121–2159.
45
Egorova, A., Gloye, A., Göktekin, C., Liers, A., Luft, M., Rojas, R., Simon, M., Tenchio, O., and
Wiesel, F. (2004). FU-Fighters Small Size 2004, Team Description. RoboCup 2004 Symposium:
Papers and Team Description Papers. CD edition.
Elfwing, S., Otsuka, M., Uchibe, E., and Doya, K. (2010). Free-energy based reinforcement learning
for vision-based navigation with high-dimensional sensory inputs. In Neural Information Process-
ing. Theory and Algorithms (ICONIP), volume 1, pages 215–222. Springer.
Eliasmith, C. (2013). How to build a brain: A neural architecture for biological cognition. Oxford
University Press, New York, NY.
Eliasmith, C., Stewart, T. C., Choo, X., Bekolay, T., DeWolf, T., Tang, Y., and Rasmussen, D. (2012).
A large-scale model of the functioning brain. Science, 338(6111):1202–1205.
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2):179–211.
Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., and Bengio, S. (2010). Why does
unsupervised pre-training help deep learning? J. Mach. Learn. Res., 11:625–660.
Escalante-B., A. N. and Wiskott, L. (2013). How to solve classification and regression problems on
high-dimensional data with a supervised extension of slow feature analysis. Journal of Machine
Learning Research, 14:3683–3719.
Eubank, R. L. (1988). Spline smoothing and nonparametric regression. In Farlow, S., editor, Self-
Organizing Methods in Modeling. Marcel Dekker, New York.
Euler, L. (1744). Methodus inveniendi.
Eyben, F., Weninger, F., Squartini, S., and Schuller, B. (2013). Real-life voice activity detection with
LSTM recurrent neural networks and an application to Hollywood movies. In Proc. 38th IEEE
International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013, Vancouver,
Canada, pages 483–487.
Faggin, F. (1992). Neural network hardware. In International Joint Conference on Neural Networks
(IJCNN), volume 1, page 153.
Fahlman, S. E. (1988). An empirical study of learning speed in back-propagation networks. Technical
Report CMU-CS-88-162, Carnegie-Mellon Univ.
Fahlman, S. E. (1991). The recurrent cascade-correlation learning algorithm. In Lippmann, R. P.,
Moody, J. E., and Touretzky, D. S., editors, Advances in Neural Information Processing Systems
(NIPS) 3, pages 190–196. Morgan Kaufmann.
Falconbridge, M. S., Stamps, R. L., and Badcock, D. R. (2006). A simple Hebbian/anti-Hebbian
network learns the sparse, independent components of natural images. Neural Computation,
18(2):415–429.
Fan, Y., Qian, Y., Xie, F., and Soong, F. K. (2014). TTS synthesis with bidirectional LSTM based
recurrent neural networks. In Proc. Interspeech.
Farabet, C., Couprie, C., Najman, L., and LeCun, Y. (2013). Learning hierarchical features for scene
labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1915–1929.
Farlow, S. J. (1984). Self-organizing methods in modeling: GMDH type algorithms, volume 54. CRC
Press.
46
Feldkamp, L. A., Prokhorov, D. V., Eagen, C. F., and Yuan, F. (1998). Enhanced multi-stream Kalman
filter training for recurrent networks. In Nonlinear Modeling, pages 29–53. Springer.
Feldkamp, L. A., Prokhorov, D. V., and Feldkamp, T. M. (2003). Simple and conditioned adaptive
behavior from Kalman filter trained recurrent networks. Neural Networks, 16(5):683–689.
Feldkamp, L. A. and Puskorius, G. V. (1998). A signal processing framework based on dynamic neural
networks with application to problems in adaptation, filtering, and classification. Proceedings of
the IEEE, 86(11):2259–2277.
Felleman, D. J. and Van Essen, D. C. (1991). Distributed hierarchical processing in the primate
cerebral cortex. Cerebral Cortex, 1(1):1–47.
Fernandez, R., Rendel, A., Ramabhadran, B., and Hoory, R. (2014). Prosody contour prediction with
Long Short-Term Memory, bi-directional, deep recurrent neural networks. In Proc. Interspeech.
Fernández, S., Graves, A., and Schmidhuber, J. (2007). An application of recurrent neural networks
to discriminative keyword spotting. In Proc. ICANN (2), pages 220–229.
Fernandez, S., Graves, A., and Schmidhuber, J. (2007). Sequence labelling in structured domains with
hierarchical recurrent neural networks. In Proceedings of the 20th International Joint Conference
on Artificial Intelligence (IJCAI).
Field, D. J. (1987). Relations between the statistics of natural images and the response properties of
cortical cells. Journal of the Optical Society of America, 4:2379–2394.
Field, D. J. (1994). What is the goal of sensory coding? Neural Computation, 6:559–601.
Fieres, J., Schemmel, J., and Meier, K. (2008). Realizing biological spiking network models in
a configurable wafer-scale hardware system. In IEEE International Joint Conference on Neural
Networks, pages 969–976.
Fine, S., Singer, Y., and Tishby, N. (1998). The hierarchical hidden Markov model: Analysis and
applications. Machine Learning, 32(1):41–62.
Fischer, A. and Igel, C. (2014). Training restricted Boltzmann machines: An introduction. Pattern
Recognition, 47:25–39.
FitzHugh, R. (1961). Impulses and physiological states in theoretical models of nerve membrane.
Biophysical Journal, 1(6):445–466.
Fletcher, R. and Powell, M. J. (1963). A rapidly convergent descent method for minimization. The
Computer Journal, 6(2):163–168.
Floreano, D. and Mattiussi, C. (2001). Evolution of spiking neural controllers for autonomous vision-
based robots. In Evolutionary Robotics. From Intelligent Robotics to Artificial Life, pages 38–61.
Springer.
Fogel, D. B., Fogel, L. J., and Porto, V. (1990). Evolving neural networks. Biological Cybernetics,
63(6):487–493.
Fogel, L., Owens, A., and Walsh, M. (1966). Artificial Intelligence through Simulated Evolution.
Wiley, New York.
47
Földiák, P. (1990). Forming sparse representations by local anti-Hebbian learning. Biological Cyber-
netics, 64:165–170.
Földiák, P. and Young, M. P. (1995). Sparse coding in the primate cortex. In Arbib, M. A., editor, The
Handbook of Brain Theory and Neural Networks, pages 895–898. The MIT Press.
Förster, A., Graves, A., and Schmidhuber, J. (2007). RNN-based Learning of Compact Maps for
Efficient Robot Localization. In 15th European Symposium on Artificial Neural Networks, ESANN,
pages 537–542, Bruges, Belgium.
Franzius, M., Sprekeler, H., and Wiskott, L. (2007). Slowness and sparseness lead to place, head-
direction, and spatial-view cells. PLoS Computational Biology, 3(8):166.
Friedman, J., Hastie, T., and Tibshirani, R. (2001). The elements of statistical learning, volume 1.
Springer Series in Statistics, New York.
Frinken, V., Zamora-Martinez, F., Espana-Boquera, S., Castro-Bleda, M. J., Fischer, A., and Bunke,
H. (2012). Long-short term memory neural networks language modeling for handwriting recog-
nition. In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 701–704.
IEEE.
Fritzke, B. (1994). A growing neural gas network learns topologies. In Tesauro, G., Touretzky, D. S.,
and Leen, T. K., editors, NIPS, pages 625–632. MIT Press.
Fu, K. S. (1977). Syntactic Pattern Recognition and Applications. Berlin, Springer.
Fukada, T., Schuster, M., and Sagisaka, Y. (1999). Phoneme boundary estimation using bidirectional
recurrent neural networks and its applications. Systems and Computers in Japan, 30(4):20–30.
Fukushima, K. (1979). Neural network model for a mechanism of pattern recognition unaffected by
shift in position - Neocognitron. Trans. IECE, J62-A(10):658–665.
Fukushima, K. (1980). Neocognitron: A self-organizing neural network for a mechanism of pattern
recognition unaffected by shift in position. Biological Cybernetics, 36(4):193–202.
Fukushima, K. (2011). Increasing robustness against background noise: visual pattern recognition by
a Neocognitron. Neural Networks, 24(7):767–778.
Fukushima, K. (2013a). Artificial vision by multi-layered neural networks: Neocognitron and its
advances. Neural Networks, 37:103–119.
Fukushima, K. (2013b). Training multi-layered neural network Neocognitron. Neural Networks,

40:18–31.
Gabor, D. (1946). Theory of communication. Part 1: The analysis of information. Electri-
cal Engineers-Part III: Journal of the Institution of Radio and Communication Engineering,
93(26):429–441.
Gallant, S. I. (1988). Connectionist expert systems. Communications of the ACM, 31(2):152–169.
Gauss, C. F. (1809). Theoria motus corporum coelestium in sectionibus conicis solem ambientium.
Gauss, C. F. (1821). Theoria combinationis observationum erroribus minimis obnoxiae (Theory of
the combination of observations least subject to error).
48
Ge, S., Hang, C. C., Lee, T. H., and Zhang, T. (2010). Stable adaptive neural network control.
Springer.
Geiger, J. T., Zhang, Z., Weninger, F., Schuller, B., and Rigoll, G. (2014). Robust speech recognition
using long short-term memory recurrent neural networks for hybrid acoustic modelling. In Proc.
Interspeech.
Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the bias/variance dilemma.
Neural Computation, 4:1–58.
Gers, F. A. and Schmidhuber, J. (2000). Recurrent nets that time and count. In Neural Networks, 2000.
IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, volume 3,
pages 189–194. IEEE.
Gers, F. A. and Schmidhuber, J. (2001). LSTM recurrent networks learn simple context free and
context sensitive languages. IEEE Transactions on Neural Networks, 12(6):1333–1340.
Gers, F. A., Schmidhuber, J., and Cummins, F. (2000). Learning to forget: Continual prediction with
LSTM. Neural Computation, 12(10):2451–2471.
Gers, F. A., Schraudolph, N., and Schmidhuber, J. (2002). Learning precise timing with LSTM
recurrent networks. Journal of Machine Learning Research, 3:115–143.
Gerstner, W. and Kistler, W. K. (2002). Spiking Neuron Models. Cambridge University Press.
Gerstner, W. and van Hemmen, J. L. (1992). Associative memory in a network of spiking neurons.
Network: Computation in Neural Systems, 3(2):139–164.
Ghavamzadeh, M. and Mahadevan, S. (2003). Hierarchical policy gradient algorithms. In Proceedings
of the Twentieth Conference on Machine Learning (ICML-2003), pages 226–233.
Gherrity, M. (1989). A learning algorithm for analog fully recurrent neural networks. In IEEE/INNS
International Joint Conference on Neural Networks, San Diego, volume 1, pages 643–644.
Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2013). Rich feature hierarchies for accurate object
detection and semantic segmentation. Technical Report arxiv.org/abs/1311.2524, UC Berkeley and
ICSI.
Gisslen, L., Luciw, M., Graziano, V., and Schmidhuber, J. (2011). Sequential constant size compressor
for reinforcement learning. In Proc. Fourth Conference on Artificial General Intelligence (AGI),
Google, Mountain View, CA, pages 31–40. Springer.
Giusti, A., Ciresan, D. C., Masci, J., Gambardella, L. M., and Schmidhuber, J. (2013). Fast image
scanning with deep max-pooling convolutional neural networks. In Proc. ICIP.
Glackin, B., McGinnity, T. M., Maguire, L. P., Wu, Q., and Belatreche, A. (2005). A novel approach
for the implementation of large scale spiking neural networks on FPGA hardware. In Computa-
tional Intelligence and Bioinspired Systems, pages 552–563. Springer.
Glasmachers, T., Schaul, T., Sun, Y., Wierstra, D., and Schmidhuber, J. (2010). Exponential natural
evolution strategies. In Proceedings of the Genetic and Evolutionary Computation Conference
(GECCO), pages 393–400. ACM.
Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier networks. In AISTATS, volume 15,
pages 315–323.
49
Gloye, A., Wiesel, F., Tenchio, O., and Simon, M. (2005). Reinforcing the driving quality of soccer
playing robots by anticipation. IT - Information Technology, 47(5).
Gödel, K. (1931). Über formal unentscheidbare Sätze der Principia Mathematica und verwandter
Systeme I. Monatshefte für Mathematik und Physik, 38:173–198.
Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization and Machine Learning. Addison-
Wesley, Reading, MA.
Goldfarb, D. (1970). A family of variable-metric methods derived by variational means. Mathematics
of computation, 24(109):23–26.
Golub, G., Heath, H., and Wahba, G. (1979). Generalized cross-validation as a method for choosing
a good ridge parameter. Technometrics, 21:215–224.
Gomez, F. J. (2003). Robust Nonlinear Control through Neuroevolution. PhD thesis, Department of
Computer Sciences, University of Texas at Austin.
Gomez, F. J. and Miikkulainen, R. (2003). Active guidance for a finless rocket using neuroevolution.
In Proc. GECCO 2003, Chicago.
Gomez, F. J. and Schmidhuber, J. (2005). Co-evolving recurrent neurons learn deep memory
POMDPs. In Proc. of the 2005 conference on genetic and evolutionary computation (GECCO),
Washington, D. C. ACM Press, New York, NY, USA.
Gomez, F. J., Schmidhuber, J., and Miikkulainen, R. (2008). Accelerated neural evolution through
cooperatively coevolved synapses. Journal of Machine Learning Research, 9(May):937–965.
Gomi, H. and Kawato, M. (1993). Neural network control for a closed-loop system using feedback-
error-learning. Neural Networks, 6(7):933–946.
Gonzalez-Dominguez, J., Lopez-Moreno, I., Sak, H., Gonzalez-Rodriguez, J., and Moreno, P. J.
(2014). Automatic language identification using Long Short-Term Memory recurrent neural net-
works. In Proc. Interspeech.
Goodfellow, I., Mirza, M., Da, X., Courville, A., and Bengio, Y. (2014a). An Empirical Investigation
of Catastrophic Forgetting in Gradient-Based Neural Networks. TR arXiv:1312.6211v2.
Goodfellow, I. J., Bulatov, Y., Ibarz, J., Arnoud, S., and Shet, V. (2014b). Multi-digit number
recognition from street view imagery using deep convolutional neural networks. arXiv preprint
arXiv:1312.6082 v4.
Goodfellow, I. J., Courville, A., and Bengio, Y. (2011). Spike-and-slab sparse coding for unsupervised
feature discovery. In NIPS Workshop on Challenges in Learning Hierarchical Models.
Goodfellow, I. J., Courville, A. C., and Bengio, Y. (2012). Large-scale feature learning with spike-
and-slab sparse coding. In Proceedings of the 29th International Conference on Machine Learning
(ICML).
Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013). Maxout net-
works. In International Conference on Machine Learning (ICML).
Graves, A. (2011). Practical variational inference for neural networks. In Advances in Neural Infor-
mation Processing Systems (NIPS), pages 2348–2356.
50
Graves, A., Eck, D., Beringer, N., and Schmidhuber, J. (2003). Isolated digit recognition with LSTM
recurrent networks. In First International Workshop on Biologically Inspired Approaches to Ad-
vanced Information Technology, Lausanne.
Graves, A., Fernandez, S., Gomez, F. J., and Schmidhuber, J. (2006). Connectionist temporal classifi-
cation: Labelling unsegmented sequence data with recurrent neural nets. In ICML’06: Proceedings
of the 23rd International Conference on Machine Learning, pages 369–376.
Graves, A., Fernandez, S., Liwicki, M., Bunke, H., and Schmidhuber, J. (2008). Unconstrained on-
line handwriting recognition with recurrent neural networks. In Platt, J., Koller, D., Singer, Y.,
and Roweis, S., editors, Advances in Neural Information Processing Systems (NIPS) 20, pages
577–584. MIT Press, Cambridge, MA.
Graves, A. and Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent neural net-
works. In Proc. 31st International Conference on Machine Learning (ICML), pages 1764–1772.
Graves, A., Liwicki, M., Fernandez, S., Bertolami, R., Bunke, H., and Schmidhuber, J. (2009). A
novel connectionist system for improved unconstrained handwriting recognition. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 31(5).
Graves, A., Mohamed, A.-R., and Hinton, G. E. (2013). Speech recognition with deep recurrent
neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 6645–6649. IEEE.
Graves, A. and Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM
and other neural network architectures. Neural Networks, 18(5-6):602–610.
Graves, A. and Schmidhuber, J. (2009). Offline handwriting recognition with multidimensional re-
current neural networks. In Advances in Neural Information Processing Systems (NIPS) 21, pages
545–552. MIT Press, Cambridge, MA.
Graziano, M. (2009). The Intelligent Movement Machine: An Ethological Perspective on the Primate
Motor System. Oxford University Press, USA.
Griewank, A. (2012). Documenta Mathematica - Extra Volume ISMP, pages 389–400.

Grondman, I., Busoniu, L., Lopes, G. A. D., and Babuska, R. (2012). A survey of actor-critic rein-
forcement learning: Standard and natural policy gradients. Systems, Man, and Cybernetics, Part C:
Applications and Reviews, IEEE Transactions on, 42(6):1291–1307.
Grossberg, S. (1969). Some networks that can learn, remember, and reproduce any number of com-
plicated space-time patterns, I. Journal of Mathematics and Mechanics, 19:53–91.
Grossberg, S. (1976a). Adaptive pattern classification and universal recoding, 1: Parallel development
and coding of neural feature detectors. Biological Cybernetics, 23:187–202.
Grossberg, S. (1976b). Adaptive pattern classification and universal recoding, 2: Feedback, expecta-
tion, olfaction, and illusions. Biological Cybernetics, 23.
Gruau, F., Whitley, D., and Pyeatt, L. (1996). A comparison between cellular encoding and direct
encoding for genetic neural networks. NeuroCOLT Technical Report NC-TR-96-048, ESPRIT
Working Group in Neural and Computational Learning, NeuroCOLT 8556.
51
Grünwald, P. D., Myung, I. J., and Pitt, M. A. (2005). Advances in minimum description length:
Theory and applications. MIT Press.
Grüttner, M., Sehnke, F., Schaul, T., and Schmidhuber, J. (2010). Multi-Dimensional Deep Memory
Atari-Go Players for Parameter Exploring Policy Gradients. In Proceedings of the International
Conference on Artificial Neural Networks ICANN, pages 114–123. Springer.
Guo, X., Singh, S., Lee, H., Lewis, R., and Wang, X. (2014). Deep learning for real-time Atari game
play using offline Monte-Carlo tree search planning. In Advances in Neural Information Processing
Systems 27 (NIPS).
Guyon, I., Vapnik, V., Boser, B., Bottou, L., and Solla, S. A. (1992). Structural risk minimization for
character recognition. In Lippman, D. S., Moody, J. E., and Touretzky, D. S., editors, Advances in
Neural Information Processing Systems (NIPS) 4, pages 471–479. Morgan Kaufmann.
Hadamard, J. (1908). Mémoire sur le problème d’analyse relatif à l’équilibre des plaques élastiques
encastrées. Mémoires présentés par divers savants à l’Académie des sciences de l’Institut de
France: Éxtrait. Imprimerie nationale.
Hadsell, R., Chopra, S., and LeCun, Y. (2006). Dimensionality reduction by learning an invariant
mapping. In Proc. Computer Vision and Pattern Recognition Conference (CVPR’06). IEEE Press.
Hagras, H., Pounds-Cornish, A., Colley, M., Callaghan, V., and Clarke, G. (2004). Evolving spiking
neural network controllers for autonomous robots. In IEEE International Conference on Robotics
and Automation (ICRA), volume 5, pages 4620–4626.
Hansen, N., Müller, S. D., and Koumoutsakos, P. (2003). Reducing the time complexity of the deran-
domized evolution strategy with covariance matrix adaptation (CMA-ES). Evolutionary Computa-
tion, 11(1):1–18.
Hansen, N. and Ostermeier, A. (2001). Completely derandomized self-adaptation in evolution strate-
gies. Evolutionary Computation, 9(2):159–195.
Hanson, S. J. (1990). A stochastic version of the delta rule. Physica D: Nonlinear Phenomena,
42(1):265–272.
Hanson, S. J. and Pratt, L. Y. (1989). Comparing biases for minimal network construction with
back-propagation. In Touretzky, D. S., editor, Advances in Neural Information Processing Systems
(NIPS) 1, pages 177–185. San Mateo, CA: Morgan Kaufmann.
Happel, B. L. and Murre, J. M. (1994). Design and evolution of modular neural network architectures.
Neural Networks, 7(6):985–1004.
Hashem, S. and Schmeiser, B. (1992). Improving model accuracy using optimal linear combinations
of trained neural networks. IEEE Transactions on Neural Networks, 6:792–794.
Hassibi, B. and Stork, D. G. (1993). Second order derivatives for network pruning: Optimal brain
surgeon. In Lippman, D. S., Moody, J. E., and Touretzky, D. S., editors, Advances in Neural
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning. Springer
Series in Statistics.
52
Hastie, T. J. and Tibshirani, R. J. (1990). Generalized additive models. Monographs on Statisics and
Applied Probability, 43.
Hawkins, J. and George, D. (2006). Hierarchical Temporal Memory - Concepts, Theory, and Termi-
nology. Numenta Inc.
Haykin, S. S. (2001). Kalman filtering and neural networks. Wiley Online Library.
Hebb, D. O. (1949). The Organization of Behavior. Wiley, New York.

Hecht-Nielsen, R. (1989). Theory of the backpropagation neural network. In International Joint
Conference on Neural Networks (IJCNN), pages 593–605. IEEE.
Heemskerk, J. N. (1995). Overview of neural hardware. Neurocomputers for Brain-Style Processing.
Design, Implementation and Application.
Heess, N., Silver, D., and Teh, Y. W. (2012). Actor-critic reinforcement learning with energy-based
policies. In Proc. European Workshop on Reinforcement Learning, pages 43–57.
Heidrich-Meisner, V. and Igel, C. (2009). Neuroevolution strategies for episodic reinforcement learn-
ing. Journal of Algorithms, 64(4):152–168.
Herrero, J., Valencia, A., and Dopazo, J. (2001). A hierarchical unsupervised growing neural network
for clustering gene expression patterns. Bioinformatics, 17(2):126–136.
Hertz, J., Krogh, A., and Palmer, R. (1991). Introduction to the Theory of Neural Computation.
Addison-Wesley, Redwood City.
Hestenes, M. R. and Stiefel, E. (1952). Methods of conjugate gradients for solving linear systems.
Journal of research of the National Bureau of Standards, 49:409–436.
Hihi, S. E. and Bengio, Y. (1996). Hierarchical recurrent neural networks for long-term dependencies.
In Touretzky, D. S., Mozer, M. C., and Hasselmo, M. E., editors, Advances in Neural Information
Processing Systems 8, pages 493–499. MIT Press.
Hinton, G. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks.
Science, 313(5786):504–507.
Hinton, G. E. (1989). Connectionist learning procedures. Artificial intelligence, 40(1):185–234.
Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural
Comp., 14(8):1771–1800.
Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. (1995). The wake-sleep algorithm for unsuper-
vised neural networks. Science, 268:1158–1160.
Hinton, G. E., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V.,
Nguyen, P., Sainath, T. N., and Kingsbury, B. (2012a). Deep neural networks for acoustic modeling
in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag.,
29(6):82–97.
Hinton, G. E. and Ghahramani, Z. (1997). Generative models for discovering sparse distributed
representations. Philosophical Transactions of the Royal Society B, 352:1177–1190.
53
Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets.
Hinton, G. E. and Sejnowski, T. E. (1986). Learning and relearning in Boltzmann machines. In
Parallel Distributed Processing, volume 1, pages 282–317. MIT Press.
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. (2012b).
Improving neural networks by preventing co-adaptation of feature detectors. Technical Report
arXiv:1207.0580.
Hinton, G. E. and van Camp, D. (1993). Keeping neural networks simple. In Proceedings of the
International Conference on Artificial Neural Networks, Amsterdam, pages 11–18. Springer.
Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut
für Informatik, Lehrstuhl Prof. Brauer, Technische Universität München. Advisor: J. Schmidhuber.
Hochreiter, S., Bengio, Y., Frasconi, P., and Schmidhuber, J. (2001a). Gradient flow in recurrent nets:
the difficulty of learning long-term dependencies. In Kremer, S. C. and Kolen, J. F., editors, A Field
Guide to Dynamical Recurrent Neural Networks. IEEE Press.
Hochreiter, S. and Obermayer, K. (2005). Sequence classification for protein analysis. In Snowbird
Workshop, Snowbird, Utah. Computational and Biological Learning Society.
Hochreiter, S. and Schmidhuber, J. (1996). Bridging long time lags by weight guessing and “Long
Short-Term Memory”. In Silva, F. L., Principe, J. C., and Almeida, L. B., editors, Spatiotemporal
models in biological and artificial systems, pages 65–72. IOS Press, Amsterdam, Netherlands.
Serie: Frontiers in Artificial Intelligence and Applications, Volume 37.
Hochreiter, S. and Schmidhuber, J. (1997a). Flat minima. Neural Computation, 9(1):1–42.
Hochreiter, S. and Schmidhuber, J. (1997b). Long Short-Term Memory. Neural Computation,
9(8):1735–1780. Based on TR FKI-207-95, TUM (1995).
Hochreiter, S. and Schmidhuber, J. (1999). Feature extraction through LOCOCODE. Neural Compu-
tation, 11(3):679–714.
Hochreiter, S., Younger, A. S., and Conwell, P. R. (2001b). Learning to learn using gradient descent.
In Lecture Notes on Comp. Sci. 2130, Proc. Intl. Conf. on Artificial Neural Networks (ICANN-
2001), pages 87–94. Springer: Berlin, Heidelberg.
Hodgkin, A. L. and Huxley, A. F. (1952). A quantitative description of membrane current and its
application to conduction and excitation in nerve. The Journal of Physiology, 117(4):500.
Hoerzer, G. M., Legenstein, R., and Maass, W. (2014). Emergence of complex computational struc-
tures from chaotic neural networks through reward-modulated Hebbian learning. Cerebral Cortex,
24:677–690.
Holden, S. B. (1994). On the Theory of Generalization and Self-Structuring in Linearly Weighted
Connectionist Networks. PhD thesis, Cambridge University, Engineering Department.
Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press,
Ann Arbor.
Honavar, V. and Uhr, L. (1993). Generative learning structures and processes for generalized connec-
tionist networks. Information Sciences, 70(1):75–108.
54
Honavar, V. and Uhr, L. M. (1988). A network of neuron-like units that learns to perceive by gen-
eration as well as reweighting of its links. In Touretzky, D., Hinton, G. E., and Sejnowski, T.,
editors, Proc. of the 1988 Connectionist Models Summer School, pages 472–484, San Mateo. Mor-
gan Kaufman.
Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational
abilities. Proc. of the National Academy of Sciences, 79:2554–2558.
Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal
approximators. Neural Networks, 2(5):359–366.
Hubel, D. H. and Wiesel, T. (1962). Receptive fields, binocular interaction, and functional architecture
in the cat’s visual cortex. Journal of Physiology (London), 160:106–154.
Hubel, D. H. and Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate
cortex. The Journal of Physiology, 195(1):215–243.
Huffman, D. A. (1952). A method for construction of minimum-redundancy codes. Proceedings IRE,
40:1098–1101.
Hung, C. P., Kreiman, G., Poggio, T., and DiCarlo, J. J. (2005). Fast readout of object identity from
macaque inferior temporal cortex. Science, 310(5749):863–866.
Hutter, M. (2002). The fastest and shortest algorithm for all well-defined problems. International
Journal of Foundations of Computer Science, 13(3):431–443. (On J. Schmidhuber’s SNF grant
20-61847).
Hutter, M. (2005). Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Prob-
ability. Springer, Berlin. (On J. Schmidhuber’s SNF grant 20-61847).
Hyvärinen, A., Hoyer, P., and Oja, E. (1999). Sparse code shrinkage: Denoising by maximum likeli-
hood estimation. In Kearns, M., Solla, S. A., and Cohn, D., editors, Advances in Neural Information
Processing Systems (NIPS) 12. MIT Press.
Hyvärinen, A., Karhunen, J., and Oja, E. (2001). Independent component analysis. John Wiley &
Sons.
ICPR 2012 Contest on Mitosis Detection in Breast Cancer Histological Images (2012). IPAL Lab-
oratory and TRIBVN Company and Pitie-Salpetriere Hospital and CIALAB of Ohio State Univ.,
http://ipal.cnrs.fr/ICPR2012/.
Igel, C. (2003). Neuroevolution for reinforcement learning using evolution strategies. In Reynolds, R.,
Abbass, H., Tan, K. C., Mckay, B., Essam, D., and Gedeon, T., editors, Congress on Evolutionary
Computation (CEC 2003), volume 4, pages 2588–2595. IEEE.
Igel, C. and Hüsken, M. (2003). Empirical evaluation of the improved Rprop learning algorithm.
Neurocomputing, 50(C):105–123.
Ikeda, S., Ochiai, M., and Sawaragi, Y. (1976). Sequential GMDH algorithm and its application to
river flow prediction. IEEE Transactions on Systems, Man and Cybernetics, (7):473–479.
Indermuhle, E., Frinken, V., and Bunke, H. (2012). Mode detection in online handwritten documents
using BLSTM neural networks. In Frontiers in Handwriting Recognition (ICFHR), 2012 Interna-
tional Conference on, pages 302–307. IEEE.
55
Indermuhle, E., Frinken, V., Fischer, A., and Bunke, H. (2011). Keyword spotting in online handwrit-
ten documents containing text and non-text using BLSTM neural networks. In Document Analysis
and Recognition (ICDAR), 2011 International Conference on, pages 73–77. IEEE.
Indiveri, G., Linares-Barranco, B., Hamilton, T. J., Van Schaik, A., Etienne-Cummings, R., Delbruck,
T., Liu, S.-C., Dudek, P., Häfliger, P., Renaud, S., et al. (2011). Neuromorphic silicon neuron
circuits. Frontiers in Neuroscience, 5(73).
Ivakhnenko, A. G. (1968). The group method of data handling – a rival of the method of stochastic
approximation. Soviet Automatic Control, 13(3):43–55.
Ivakhnenko, A. G. (1971). Polynomial theory of complex systems. IEEE Transactions on Systems,
Man and Cybernetics, (4):364–378.
Ivakhnenko, A. G. (1995). The review of problems solvable by algorithms of the group method of data
handling (GMDH). Pattern Recognition and Image Analysis / Raspoznavaniye Obrazov I Analiz
Izobrazhenii, 5:527–535.
Ivakhnenko, A. G. and Lapa, V. G. (1965). Cybernetic Predicting Devices. CCM Information Corpo-
ration.
Ivakhnenko, A. G., Lapa, V. G., and McDonough, R. N. (1967). Cybernetics and forecasting tech-
niques. American Elsevier, NY.
Izhikevich, E. M. et al. (2003). Simple model of spiking neurons. IEEE Transactions on Neural
Networks, 14(6):1569–1572.
Jaakkola, T., Singh, S. P., and Jordan, M. I. (1995). Reinforcement learning algorithm for partially
observable Markov decision problems. In Tesauro, G., Touretzky, D. S., and Leen, T. K., editors,
Advances in Neural Information Processing Systems (NIPS) 7, pages 345–352. MIT Press.
Jackel, L., Boser, B., Graf, H.-P., Denker, J., LeCun, Y., Henderson, D., Matan, O., Howard, R., and
Baird, H. (1990). VLSI implementation of electronic neural networks: and example in character
recognition. In IEEE, editor, IEEE International Conference on Systems, Man, and Cybernetics,
pages 320–322, Los Angeles, CA.
Jacob, C., Lindenmayer, A., and Rozenberg, G. (1994). Genetic L-System Programming. In Parallel
Problem Solving from Nature III, Lecture Notes in Computer Science.
Jacobs, R. A. (1988). Increased rates of convergence through learning rate adaptation. Neural Net-
works, 1(4):295–307.
Jaeger, H. (2001). The ”echo state” approach to analysing and training recurrent neural networks.
Technical Report GMD Report 148, German National Research Center for Information Technology.
Jaeger, H. (2004). Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless
communication. Science, 304:78–80.
Jain, V. and Seung, S. (2009). Natural image denoising with convolutional networks. In Koller, D.,
Schuurmans, D., Bengio, Y., and Bottou, L., editors, Advances in Neural Information Processing
Systems (NIPS) 21, pages 769–776. Curran Associates, Inc.
Jameson, J. (1991). Delayed reinforcement learning with multiple time scale hierarchical backpropa-
gated adaptive critics. In Neural Networks for Control.
56
Ji, S., Xu, W., Yang, M., and Yu, K. (2013). 3D convolutional neural networks for human action
recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):221–231.
Jim, K., Giles, C. L., and Horne, B. G. (1995). Effects of noise on convergence and generalization
in recurrent networks. In Tesauro, G., Touretzky, D., and Leen, T., editors, Advances in Neural
Information Processing Systems (NIPS) 7, page 649. San Mateo, CA: Morgan Kaufmann.
Jin, X., Lujan, M., Plana, L. A., Davies, S., Temple, S., and Furber, S. B. (2010). Modeling spiking
neural networks on SpiNNaker. Computing in Science & Engineering, 12(5):91–97.
Jodogne, S. R. and Piater, J. H. (2007). Closed-loop learning of visual control policies. J. Artificial
Intelligence Research, 28:349–391.
Jones, J. P. and Palmer, L. A. (1987). An evaluation of the two-dimensional Gabor filter model of
simple receptive fields in cat striate cortex. Journal of Neurophysiology, 58(6):1233–1258.
Jordan, M. I. (1986). Serial order: A parallel distributed processing approach. Technical Report ICS
Report 8604, Institute for Cognitive Science, University of California, San Diego.
Jordan, M. I. (1988). Supervised learning and systems with excess degrees of freedom. Technical
Report COINS TR 88-27, Massachusetts Institute of Technology.
Jordan, M. I. (1997). Serial order: A parallel distributed processing approach. Advances in Psychol-
ogy, 121:471–495.
Jordan, M. I. and Rumelhart, D. E. (1990). Supervised learning with a distal teacher. Technical Report
Occasional Paper #40, Center for Cog. Sci., Massachusetts Institute of Technology.
Jordan, M. I. and Sejnowski, T. J. (2001). Graphical models: Foundations of neural computation.
MIT Press.
Joseph, R. D. (1961). Contributions to perceptron theory. PhD thesis, Cornell Univ.
Juang, C.-F. (2004). A hybrid of genetic algorithm and particle swarm optimization for recurrent
network design. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on,
34(2):997–1006.
Judd, J. S. (1990). Neural network design and the complexity of learning. Neural network modeling
and connectionism. MIT Press.
Jutten, C. and Herault, J. (1991). Blind separation of sources, part I: An adaptive algorithm based on
neuromimetic architecture. Signal Processing, 24(1):1–10.
Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. (1995). Planning and acting in partially
observable stochastic domains. Technical report, Brown University, Providence RI.
Kaelbling, L. P., Littman, M. L., and Moore, A. W. (1996). Reinforcement learning: a survey. Journal
of AI research, 4:237–285.
Kak, S., Chen, Y., and Wang, L. (2010). Data mining using surface and deep agents based on neural
networks. AMCIS 2010 Proceedings.
Kalinke, Y. and Lehmann, H. (1998). Computation in recurrent neural networks: From counters to
iterated function systems. In Antoniou, G. and Slaney, J., editors, Advanced Topics in Artificial
Intelligence, Proceedings of the 11th Australian Joint Conference on Artificial Intelligence, volume
1502 of LNAI, Berlin, Heidelberg. Springer.
57
Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Journal of Basic
Engineering, 82(1):35–45.
Karhunen, J. and Joutsensalo, J. (1995). Generalizations of principal component analysis, optimiza-
tion problems, and neural networks. Neural Networks, 8(4):549–562.
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014). Large-scale
video classification with convolutional neural networks. In IEEE Conference on Computer Vision
and Pattern Recognition (CVPR).
Kasabov, N. K. (2014). Neucube: A spiking neural network architecture for mapping, learning and
understanding of spatio-temporal brain data. Neural Networks.
Kelley, H. J. (1960). Gradient theory of optimal flight paths. ARS Journal, 30(10):947–954.
Kempter, R., Gerstner, W., and Van Hemmen, J. L. (1999). Hebbian learning and spiking neurons.
Physical Review E, 59(4):4498.
Kerlirzin, P. and Vallet, F. (1993). Robustness in multilayer perceptrons. Neural Computation,
5(1):473–482.
Khan, M. M., Khan, G. M., and Miller, J. F. (2010). Evolution of neural networks using Cartesian
Genetic Programming. In IEEE Congress on Evolutionary Computation (CEC), pages 1–8.
Khan, M. M., Lester, D. R., Plana, L. A., Rast, A., Jin, X., Painkras, E., and Furber, S. B. (2008).
SpiNNaker: mapping neural networks onto a massively-parallel chip multiprocessor. In Interna-
tional Joint Conference on Neural Networks (IJCNN), pages 2849–2856. IEEE.
Khan, S. H., Bennamoun, M., Sohel, F., and Togneri, R. (2014). Automatic feature learning for robust
shadow detection. In IEEE Conference on Computer Vision and Pattern Recognition CVPR.
Kimura, H., Miyazaki, K., and Kobayashi, S. (1997). Reinforcement learning in POMDPs with
function approximation. In ICML, volume 97, pages 152–160.
Kistler, W. M., Gerstner, W., and van Hemmen, J. L. (1997). Reduction of the Hodgkin-Huxley
equations to a single-variable threshold model. Neural Computation, 9(5):1015–1045.
Kitano, H. (1990). Designing neural networks using genetic algorithms with graph generation system.
Complex Systems, 4:461–476.
Klampfl, S. and Maass, W. (2013). Emergence of dynamic memory traces in cortical microcircuit
models through STDP. The Journal of Neuroscience, 33(28):11515–11529.
Klapper-Rybicka, M., Schraudolph, N. N., and Schmidhuber, J. (2001). Unsupervised learning in
LSTM recurrent neural networks. In Lecture Notes on Comp. Sci. 2130, Proc. Intl. Conf. on Artifi-
cial Neural Networks (ICANN-2001), pages 684–691. Springer: Berlin, Heidelberg.
Kobatake, E. and Tanaka, K. (1994). Neuronal selectivities to complex object features in the ventral
visual pathway of the macaque cerebral cortex. J. Neurophysiol., 71:856–867.
Kohl, N. and Stone, P. (2004). Policy gradient reinforcement learning for fast quadrupedal locomotion.
In Robotics and Automation, 2004. Proceedings. ICRA’04. 2004 IEEE International Conference on,
volume 3, pages 2619–2624. IEEE.
58
Kohonen, T. (1972). Correlation matrix memories. Computers, IEEE Transactions on, 100(4):353–
359.
Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological
Cybernetics, 43(1):59–69.
Kohonen, T. (1988). Self-Organization and Associative Memory. Springer, second edition.
Koikkalainen, P. and Oja, E. (1990). Self-organizing hierarchical feature maps. In International Joint
Conference on Neural Networks (IJCNN), pages 279–284. IEEE.
Kolmogorov, A. N. (1965a). On the representation of continuous functions of several variables by su-
perposition of continuous functions of one variable and addition. Doklady Akademii. Nauk USSR,,
114:679–681.
Kolmogorov, A. N. (1965b). Three approaches to the quantitative definition of information. Problems

of Information Transmission, 1:1–11.
Kompella, V. R., Luciw, M. D., and Schmidhuber, J. (2012). Incremental slow feature analysis: Adap-
tive low-complexity slow feature updating from high-dimensional input streams. Neural Computa-
tion, 24(11):2994–3024.
Kondo, T. (1998). GMDH neural network algorithm using the heuristic self-organization method
and its application to the pattern identification problem. In Proceedings of the 37th SICE Annual
Conference SICE’98, pages 1143–1148. IEEE.
Kondo, T. and Ueno, J. (2008). Multi-layered GMDH-type neural network self-selecting optimum
neural network architecture and its application to 3-dimensional medical image recognition of
blood vessels. International Journal of Innovative Computing, Information and Control, 4(1):175–
187.
Kordı́k, P., Náplava, P., Snorek, M., and Genyk-Berezovskyj, M. (2003). Modified GMDH method
and models quality evaluation by visualization. Control Systems and Computers, 2:68–75.
Korkin, M., de Garis, H., Gers, F., and Hemmi, H. (1997). CBM (CAM-Brain Machine) - a hardware
tool which evolves a neural net module in a fraction of a second and runs a million neuron artificial
brain in real time.
Kosko, B. (1990). Unsupervised learning in noise. IEEE Transactions on Neural Networks, 1(1):44–
57.
Koutnı́k, J., Cuccu, G., Schmidhuber, J., and Gomez, F. (July 2013). Evolving large-scale neural
networks for vision-based reinforcement learning. In Proceedings of the Genetic and Evolutionary
Computation Conference (GECCO), pages 1061–1068, Amsterdam. ACM.
Koutnı́k, J., Gomez, F., and Schmidhuber, J. (2010). Evolving neural networks in compressed weight
space. In Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation,
pages 619–626.
Koutnı́k, J., Greff, K., Gomez, F., and Schmidhuber, J. (2014). A Clockwork RNN. In Proceedings
of the 31th International Conference on Machine Learning (ICML), volume 32, pages 1845–1853.
arXiv:1402.3511 [cs.NE].
59
Koza, J. R. (1992). Genetic Programming – On the Programming of Computers by Means of Natural
Selection. MIT Press.
Kramer, M. (1991). Nonlinear principal component analysis using autoassociative neural networks.
AIChE Journal, 37:233–243.
Kremer, S. C. and Kolen, J. F. (2001). Field guide to dynamical recurrent networks. Wiley-IEEE
Press.
Kriegeskorte, N., Mur, M., Ruff, D. A., Kiani, R., Bodurka, J., Esteky, H., Tanaka, K., and Bandettini,
P. A. (2008). Matching categorical object representations in inferior temporal cortex of man and
monkey. Neuron, 60(6):1126–1141.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convo-
lutional neural networks. In Advances in Neural Information Processing Systems (NIPS 2012),
page 4.
Krogh, A. and Hertz, J. A. (1992). A simple weight decay can improve generalization. In Lippman,
D. S., Moody, J. E., and Touretzky, D. S., editors, Advances in Neural Information Processing
Systems 4, pages 950–957. Morgan Kaufmann.
Kruger, N., Janssen, P., Kalkan, S., Lappe, M., Leonardis, A., Piater, J., Rodriguez-Sanchez, A., and
Wiskott, L. (2013). Deep hierarchies in the primate visual cortex: What can we learn for computer
vision? IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1847–1871.
Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical
Statistics, pages 79–86.
Kurzweil, R. (2012). How to Create a Mind: The Secret of Human Thought Revealed.
Lagoudakis, M. G. and Parr, R. (2003). Least-squares policy iteration. JMLR, 4:1107–1149.
Lampinen, J. and Oja, E. (1992). Clustering properties of hierarchical self-organizing maps. Journal
of Mathematical Imaging and Vision, 2(2-3):261–272.
Lang, K., Waibel, A., and Hinton, G. E. (1990). A time-delay neural network architecture for isolated
word recognition. Neural Networks, 3:23–43.
Lange, S. and Riedmiller, M. (2010). Deep auto-encoder neural networks in reinforcement learning.
In Neural Networks (IJCNN), The 2010 International Joint Conference on, pages 1–8.
Lapedes, A. and Farber, R. (1986). A self-optimizing, nonsymmetrical neural net for content address-
able memory and pattern recognition. Physica D, 22:247–259.
Laplace, P. (1774). Mémoire sur la probabilité des causes par les évènements. Mémoires de
l’Academie Royale des Sciences Presentés par Divers Savan, 6:621–656.
Larraanaga, P. and Lozano, J. A. (2001). Estimation of Distribution Algorithms: A New Tool for
Evolutionary Computation. Kluwer Academic Publishers, Norwell, MA, USA.
Le, Q. V., Ranzato, M., Monga, R., Devin, M., Corrado, G., Chen, K., Dean, J., and Ng, A. Y. (2012).
Building high-level features using large scale unsupervised learning. In Proc. ICML’12.
LeCun, Y. (1985). Une procédure d’apprentissage pour réseau à seuil asymétrique. Proceedings of
Cognitiva 85, Paris, pages 599–604.
60
LeCun, Y. (1988). A theoretical framework for back-propagation. In Touretzky, D., Hinton, G.,
and Sejnowski, T., editors, Proceedings of the 1988 Connectionist Models Summer School, pages
21–28, CMU, Pittsburgh, Pa. Morgan Kaufmann.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D.
(1989). Back-propagation applied to handwritten zip code recognition. Neural Computation,
1(4):541–551.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D.
(1990a). Handwritten digit recognition with a back-propagation network. In Touretzky, D. S.,
editor, Advances in Neural Information Processing Systems 2, pages 396–404. Morgan Kaufmann.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to docu-
ment recognition. Proceedings of the IEEE, 86(11):2278–2324.
LeCun, Y., Denker, J. S., and Solla, S. A. (1990b). Optimal brain damage. In Touretzky, D. S., editor,
Advances in Neural Information Processing Systems 2, pages 598–605. Morgan Kaufmann.
LeCun, Y., Muller, U., Cosatto, E., and Flepp, B. (2006). Off-road obstacle avoidance through end-
to-end learning. In Advances in Neural Information Processing Systems (NIPS 2005).
LeCun, Y., Simard, P., and Pearlmutter, B. (1993). Automatic learning rate maximization by on-line
estimation of the Hessian’s eigenvectors. In Hanson, S., Cowan, J., and Giles, L., editors, Advances
in Neural Information Processing Systems (NIPS 1992), volume 5. Morgan Kaufmann Publishers,
San Mateo, CA.
Lee, H., Battle, A., Raina, R., and Ng, A. Y. (2007a). Efficient sparse coding algorithms. In Advances
in Neural Information Processing Systems (NIPS) 19, pages 801–808.
Lee, H., Ekanadham, C., and Ng, A. Y. (2007b). Sparse deep belief net model for visual area V2. In
Advances in Neural Information Processing Systems (NIPS), volume 7, pages 873–880.
Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009a). Convolutional deep belief networks for
scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th Interna-
tional Conference on Machine Learning (ICML), pages 609–616.
Lee, H., Pham, P. T., Largman, Y., and Ng, A. Y. (2009b). Unsupervised feature learning for audio
classification using convolutional deep belief networks. In Proc. NIPS, volume 9, pages 1096–1104.
Lee, L. (1996). Learning of context-free languages: A survey of the literature. Technical Report
TR-12-96, Center for Research in Computing Technology, Harvard University, Cambridge, Mas-
sachusetts.
Lee, S. and Kil, R. M. (1991). A Gaussian potential function network with hierarchically self-
organizing learning. Neural Networks, 4(2):207–224.
Legendre, A. M. (1805). Nouvelles méthodes pour la détermination des orbites des cometes. F. Didot.
Legenstein, R., Wilbert, N., and Wiskott, L. (2010). Reinforcement learning on slow features of
high-dimensional input streams. PLoS Computational Biology, 6(8).
Legenstein, R. A. and Maass, W. (2002). Neural circuits for pattern recognition with small total wire
length. Theor. Comput. Sci., 287(1):239–249.
61
Leibniz, G. W. (1676). Memoir using the chain rule (cited in TMME 7:2&3 p 321-332, 2010).
Leibniz, G. W. (1684). Nova methodus pro maximis et minimis, itemque tangentibus, quae nec
fractas, nec irrationales quantitates moratur, et singulare pro illis calculi genus. Acta Eruditorum,
pages 467–473.
Lenat, D. B. (1983). Theory formation by heuristic search. Machine Learning, 21.
Lenat, D. B. and Brown, J. S. (1984). Why AM an EURISKO appear to work. Artificial Intelligence,
23(3):269–294.
Lennie, P. and Movshon, J. A. (2005). Coding of color and form in the geniculostriate visual pathway.
Journal of the Optical Society of America A, 22(10):2013–2033.
Levenberg, K. (1944). A method for the solution of certain problems in least squares. Quarterly of
applied mathematics, 2:164–168.
Levin, A. U., Leen, T. K., and Moody, J. E. (1994). Fast pruning using principal components. In
Advances in Neural Information Processing Systems 6, page 35. Morgan Kaufmann.
Levin, A. U. and Narendra, K. S. (1995). Control of nonlinear dynamical systems using neural
networks. ii. observability, identification, and control. IEEE Transactions on Neural Networks,
7(1):30–42.
Levin, L. A. (1973a). On the notion of a random sequence. Soviet Math. Dokl., 14(5):1413–1416.
Levin, L. A. (1973b). Universal sequential search problems. Problems of Information Transmission,
9(3):265–266.
Lewicki, M. S. and Olshausen, B. A. (1998). Inferring sparse, overcomplete image codes using an
efficient coding framework. In Jordan, M. I., Kearns, M. J., and Solla, S. A., editors, Advances in
Neural Information Processing Systems (NIPS) 10, pages 815–821.
L’Hôpital, G. F. A. (1696). Analyse des infiniment petits, pour l’intelligence des lignes courbes. Paris:
L’Imprimerie Royale.
Li, M. and Vitányi, P. M. B. (1997). An Introduction to Kolmogorov Complexity and its Applications
(2nd edition). Springer.
Li, R., Zhang, W., Suk, H.-I., Wang, L., Li, J., Shen, D., and Ji, S. (2014). Deep learning based
imaging data completion for improved brain disease diagnosis. In Proc. MICCAI. Springer.
Lin, L. (1993). Reinforcement Learning for Robots Using Neural Networks. PhD thesis, Carnegie
Mellon University, Pittsburgh.
Lin, T., Horne, B., Tino, P., and Giles, C. (1996). Learning long-term dependencies in NARX recur-
rent neural networks. IEEE Transactions on Neural Networks, 7(6):1329–1338.
Lindenmayer, A. (1968). Mathematical models for cellular interaction in development. J. Theoret.
Biology, 18:280–315.
Lindstädt, S. (1993). Comparison of two unsupervised neural network models for redundancy reduc-
tion. In Mozer, M. C., Smolensky, P., Touretzky, D. S., Elman, J. L., and Weigend, A. S., editors,
Proc. of the 1993 Connectionist Models Summer School, pages 308–315. Hillsdale, NJ: Erlbaum
Associates.
62
Linnainmaa, S. (1970). The representation of the cumulative rounding error of an algorithm as a
Taylor expansion of the local rounding errors. Master’s thesis, Univ. Helsinki.
Linnainmaa, S. (1976). Taylor expansion of the accumulated rounding error. BIT Numerical Mathe-
matics, 16(2):146–160.
Linsker, R. (1988). Self-organization in a perceptual network. IEEE Computer, 21:105–117.
Littman, M. L., Cassandra, A. R., and Kaelbling, L. P. (1995). Learning policies for partially ob-
servable environments: Scaling up. In Prieditis, A. and Russell, S., editors, Machine Learning:
Proceedings of the Twelfth International Conference, pages 362–370. Morgan Kaufmann Publish-
ers, San Francisco, CA.
Liu, S.-C., Kramer, J., Indiveri, G., Delbrück, T., Burg, T., Douglas, R., et al. (2001). Orientation-
selective aVLSI spiking neurons. Neural Networks, 14(6-7):629–643.
Ljung, L. (1998). System identification. Springer.
Logothetis, N. K., Pauls, J., and Poggio, T. (1995). Shape representation in the inferior temporal
cortex of monkeys. Current Biology, 5(5):552–563.
Loiacono, D., Cardamone, L., and Lanzi, P. L. (2011). Simulated car racing championship competi-
tion software manual. Technical report, Dipartimento di Elettronica e Informazione, Politecnico di
Milano, Italy.
Loiacono, D., Lanzi, P. L., Togelius, J., Onieva, E., Pelta, D. A., Butz, M. V., Lönneker, T. D.,
Cardamone, L., Perez, D., Sáez, Y., Preuss, M., and Quadflieg, J. (2009). The 2009 simulated car
racing championship.
Lowe, D. (1999). Object recognition from local scale-invariant features. In The Proceedings of the
Seventh IEEE International Conference on Computer Vision (ICCV), volume 2, pages 1150–1157.
Lowe, D. (2004). Distinctive image features from scale-invariant key-points. Intl. Journal of Com-
puter Vision, 60:91–110.
Luciw, M., Kompella, V. R., Kazerounian, S., and Schmidhuber, J. (2013). An intrinsic value system
for developing multiple invariant representations with incremental slowness learning. Frontiers in
Neurorobotics, 7(9).
Lusci, A., Pollastri, G., and Baldi, P. (2013). Deep architectures and deep learning in chemoinformat-
ics: the prediction of aqueous solubility for drug-like molecules. Journal of Chemical Information
and Modeling, 53(7):1563–1575.
Maas, A. L., Hannun, A. Y., and Ng, A. Y. (2013). Rectifier nonlinearities improve neural network
acoustic models. In International Conference on Machine Learning (ICML).
Maass, W. (1996). Lower bounds for the computational power of networks of spiking neurons. Neural
Maass, W. (1997). Networks of spiking neurons: the third generation of neural network models.
Maass, W. (2000). On the computational power of winner-take-all. Neural Computation, 12:2519–
2535.
63
Maass, W., Natschläger, T., and Markram, H. (2002). Real-time computing without stable states: A
new framework for neural computation based on perturbations. Neural Computation, 14(11):2531–
2560.
MacKay, D. J. C. (1992). A practical Bayesian framework for backprop networks. Neural Computa-
tion, 4:448–472.
MacKay, D. J. C. and Miller, K. D. (1990). Analysis of Linsker’s simulation of Hebbian rules. Neural
Maclin, R. and Shavlik, J. W. (1993). Using knowledge-based neural networks to improve algorithms:
Refining the Chou-Fasman algorithm for protein folding. Machine Learning, 11(2-3):195–215.
Maclin, R. and Shavlik, J. W. (1995). Combining the predictions of multiple classifiers: Using com-
petitive learning to initialize neural networks. In Proc. IJCAI, pages 524–531.
Madala, H. R. and Ivakhnenko, A. G. (1994). Inductive learning algorithms for complex systems
modeling. CRC Press, Boca Raton.
Madani, O., Hanks, S., and Condon, A. (2003). On the undecidability of probabilistic planning and
related stochastic optimization problems. Artificial Intelligence, 147(1):5–34.
Maei, H. R. and Sutton, R. S. (2010). GQ(λ): A general gradient algorithm for temporal-difference
prediction learning with eligibility traces. In Proceedings of the Third Conference on Artificial
General Intelligence, volume 1, pages 91–96.
Maex, R. and Orban, G. (1996). Model circuit of spiking neurons generating directional selectivity in
simple cells. Journal of Neurophysiology, 75(4):1515–1545.
Mahadevan, S. (1996). Average reward reinforcement learning: Foundations, algorithms, and empir-
ical results. Machine Learning, 22:159.
Malik, J. and Perona, P. (1990). Preattentive texture discrimination with early vision mechanisms.
Journal of the Optical Society of America A, 7(5):923–932.
Maniezzo, V. (1994). Genetic evolution of the topology and weight distribution of neural networks.
Manolios, P. and Fanelli, R. (1994). First-order recurrent neural networks and deterministic finite
state automata. Neural Computation, 6:1155–1173.
Marchi, E., Ferroni, G., Eyben, F., Gabrielli, L., Squartini, S., and Schuller, B. (2014). Multi-
resolution linear prediction based features for audio onset detection with bidirectional LSTM neural
networks. In Proc. 39th IEEE International Conference on Acoustics, Speech, and Signal Process-
ing, ICASSP 2014, Florence, Italy, pages 2183–2187.
Markram, H. (2012). The human brain project. Scientific American, 306(6):50–55.

Marquardt, D. W. (1963). An algorithm for least-squares estimation of nonlinear parameters. Journal
of the Society for Industrial & Applied Mathematics, 11(2):431–441.
Martens, J. (2010). Deep learning via Hessian-free optimization. In Fürnkranz, J. and Joachims, T.,
editors, Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages
735–742, Haifa, Israel. Omnipress.
64
Martens, J. and Sutskever, I. (2011). Learning recurrent neural networks with Hessian-free optimiza-
tion. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages
1033–1040.
Martinetz, T. M., Ritter, H. J., and Schulten, K. J. (1990). Three-dimensional neural net for learning
visuomotor coordination of a robot arm. IEEE Transactions on Neural Networks, 1(1):131–136.
Masci, J., Giusti, A., Ciresan, D. C., Fricout, G., and Schmidhuber, J. (2013). A fast learning algorithm
for image segmentation with max-pooling convolutional networks. In International Conference on
Image Processing (ICIP13), pages 2713–2717.
Matsuoka, K. (1992). Noise injection into inputs in back-propagation learning. IEEE Transactions
on Systems, Man, and Cybernetics, 22(3):436–440.
Mayer, H., Gomez, F., Wierstra, D., Nagy, I., Knoll, A., and Schmidhuber, J. (2008). A system for
robotic heart surgery that learns to tie knots using recurrent neural networks. Advanced Robotics,
22(13-14):1521–1537.
McCallum, R. A. (1996). Learning to use selective attention and short-term memory in sequential
tasks. In Maes, P., Mataric, M., Meyer, J.-A., Pollack, J., and Wilson, S. W., editors, From Ani-
mals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive
Behavior, Cambridge, MA, pages 315–324. MIT Press, Bradford Books.
McCulloch, W. and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity.
Bulletin of Mathematical Biophysics, 7:115–133.
Melnik, O., Levy, S. D., and Pollack, J. B. (2000). RAAM for infinite context-free languages. In
Proc. IJCNN (5), pages 585–590.
Memisevic, R. and Hinton, G. E. (2010). Learning to represent spatial transformations with factored
higher-order Boltzmann machines. Neural Computation, 22(6):1473–1492.
Menache, I., Mannor, S., and Shimkin, N. (2002). Q-cut – dynamic discovery of sub-goals in rein-
forcement learning. In Proc. ECML’02, pages 295–306.
Merolla, P. A., Arthur, J. V., Alvarez-Icaza, R., Cassidy, A. S., Sawada, J., Akopyan, F., Jackson,
B. L., Imam, N., Guo, C., Nakamura, Y., Brezzo, B., Vo, I., Esser, S. K., Appuswamy, R., Taba,
B., Amir, A., Flickner, M. D., Risk, W. P., Manohar, R., and Modha, D. S. (2014). A million
spiking-neuron integrated circuit with a scalable communication network and interface. Science,
345(6197):668–673.
Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I., Lavoie, E., Muller, X.,
Desjardins, G., Warde-Farley, D., Vincent, P., Courville, A., and Bergstra, J. (2011). Unsupervised
and transfer learning challenge: a deep learning approach. In JMLR W&CP: Proc. Unsupervised
and Transfer Learning, volume 7.
Meuleau, N., Peshkin, L., Kim, K. E., and Kaelbling, L. P. (1999). Learning finite state controllers for
partially observable environments. In 15th International Conference of Uncertainty in AI, pages
427–436.
Miglino, O., Lund, H., and Nolfi, S. (1995). Evolving mobile robots in simulated and real environ-
ments. Artificial Life, 2(4):417–434.
65
Miller, G., Todd, P., and Hedge, S. (1989). Designing neural networks using genetic algorithms. In
Proceedings of the 3rd International Conference on Genetic Algorithms, pages 379–384. Morgan
Kauffman.
Miller, J. F. and Harding, S. L. (2009). Cartesian genetic programming. In Proceedings of the 11th An-
nual Conference Companion on Genetic and Evolutionary Computation Conference: Late Break-
ing Papers, pages 3489–3512. ACM.
Miller, J. F. and Thomson, P. (2000). Cartesian genetic programming. In Genetic Programming, pages
121–132. Springer.
Miller, K. D. (1994). A model for the development of simple cell receptive fields and the ordered
arrangement of orientation columns through activity-dependent competition between on- and off-
center inputs. Journal of Neuroscience, 14(1):409–441.
Miller, W. T., Werbos, P. J., and Sutton, R. S. (1995). Neural networks for control. MIT Press.
Minai, A. A. and Williams, R. D. (1994). Perturbation response in feedforward networks. Neural
Networks, 7(5):783–796.
Minsky, M. (1963). Steps toward artificial intelligence. In Feigenbaum, E. and Feldman, J., editors,
Computers and Thought, pages 406–450. McGraw-Hill, New York.
Minsky, M. and Papert, S. (1969). Perceptrons. Cambridge, MA: MIT Press.
Minton, S., Carbonell, J. G., Knoblock, C. A., Kuokka, D. R., Etzioni, O., and Gil, Y. (1989).
Explanation-based learning: A problem solving perspective. Artificial Intelligence, 40(1):63–118.
Mitchell, T. (1997). Machine Learning. McGraw Hill.

Mitchell, T. M., Keller, R. M., and Kedar-Cabelli, S. T. (1986). Explanation-based generalization: A
unifying view. Machine Learning, 1(1):47–80.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M.
(Dec 2013). Playing Atari with deep reinforcement learning. Technical Report arXiv:1312.5602
[cs.LG], Deepmind Technologies.
Mohamed, A. and Hinton, G. E. (2010). Phone recognition using restricted Boltzmann machines.
In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages
4354–4357.
Molgedey, L. and Schuster, H. G. (1994). Separation of independent signals using time-delayed

correlations. Phys. Reviews Letters, 72(23):3634–3637.
Møller, M. F. (1993). Exact calculation of the product of the Hessian matrix of feed-forward network
error functions and a vector in O(N) time. Technical Report PB-432, Computer Science Depart-
ment, Aarhus University, Denmark.
Montana, D. J. and Davis, L. (1989). Training feedforward neural networks using genetic algorithms.
In Proceedings of the 11th International Joint Conference on Artificial Intelligence (IJCAI) - Vol-
ume 1, IJCAI’89, pages 762–767, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
Montavon, G., Orr, G., and Müller, K. (2012). Neural Networks: Tricks of the Trade. Number LNCS
7700 in Lecture Notes in Computer Science Series. Springer Verlag.
66
Moody, J. E. (1989). Fast learning in multi-resolution hierarchies. In Touretzky, D. S., editor, Ad-
vances in Neural Information Processing Systems (NIPS) 1, pages 29–39. Morgan Kaufmann.
Moody, J. E. (1992). The effective number of parameters: An analysis of generalization and regu-
larization in nonlinear learning systems. In Lippman, D. S., Moody, J. E., and Touretzky, D. S.,
editors, Advances in Neural Information Processing Systems (NIPS) 4, pages 847–854. Morgan
Kaufmann.
Moody, J. E. and Utans, J. (1994). Architecture selection strategies for neural networks: Application
to corporate bond rating prediction. In Refenes, A. N., editor, Neural Networks in the Capital
Markets. John Wiley & Sons.
Moore, A. and Atkeson, C. (1995). The parti-game algorithm for variable resolution reinforcement
learning in multidimensional state-spaces. Machine Learning, 21(3):199–233.
Moore, A. and Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data
and less time. Machine Learning, 13:103–130.
Moriarty, D. E. (1997). Symbiotic Evolution of Neural Networks in Sequential Decision Tasks. PhD
thesis, Department of Computer Sciences, The University of Texas at Austin.
Moriarty, D. E. and Miikkulainen, R. (1996). Efficient reinforcement learning through symbiotic

evolution. Machine Learning, 22:11–32.
Morimoto, J. and Doya, K. (2000). Robust reinforcement learning. In Leen, T. K., Dietterich, T. G.,
and Tresp, V., editors, Advances in Neural Information Processing Systems (NIPS) 13, pages 1061–
1067. MIT Press.
Mosteller, F. and Tukey, J. W. (1968). Data analysis, including statistics. In Lindzey, G. and Aronson,
E., editors, Handbook of Social Psychology, Vol. 2. Addison-Wesley.
Mozer, M. C. (1989). A focused back-propagation algorithm for temporal sequence recognition.
Complex Systems, 3:349–381.
Mozer, M. C. (1991). Discovering discrete distributed representations with iterative competitive learn-
ing. In Lippmann, R. P., Moody, J. E., and Touretzky, D. S., editors, Advances in Neural Information
Processing Systems 3, pages 627–634. Morgan Kaufmann.
Mozer, M. C. (1992). Induction of multiscale temporal structure. In Lippman, D. S., Moody, J. E.,
and Touretzky, D. S., editors, Advances in Neural Information Processing Systems (NIPS) 4, pages
Mozer, M. C. and Smolensky, P. (1989). Skeletonization: A technique for trimming the fat from a
network via relevance assessment. In Touretzky, D. S., editor, Advances in Neural Information
Processing Systems (NIPS) 1, pages 107–115. Morgan Kaufmann.
Muller, U. A., Gunzinger, A., and Guggenbühl, W. (1995). Fast neural net simulation with a DSP
processor array. IEEE Transactions on Neural Networks, 6(1):203–213.
Munro, P. W. (1987). A dual back-propagation scheme for scalar reinforcement learning. Proceedings
of the Ninth Annual Conference of the Cognitive Science Society, Seattle, WA, pages 165–176.
67
Murray, A. F. and Edwards, P. J. (1993). Synaptic weight noise during MLP learning enhances fault-
tolerance, generalisation and learning trajectory. In S. J. Hanson, J. D. C. and Giles, C. L., editors,
Advances in Neural Information Processing Systems (NIPS) 5, pages 491–498. San Mateo, CA:
Morgan Kaufmann.
Nadal, J.-P. and Parga, N. (1994). Non-linear neurons in the low noise limit: a factorial code max-
imises information transfer. Network, 5:565–581.
Nagumo, J., Arimoto, S., and Yoshizawa, S. (1962). An active pulse transmission line simulating
nerve axon. Proceedings of the IRE, 50(10):2061–2070.
Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann machines. In
International Conference on Machine Learning (ICML).
Narendra, K. S. and Parthasarathy, K. (1990). Identification and control of dynamical systems using
neural networks. Neural Networks, IEEE Transactions on, 1(1):4–27.
Narendra, K. S. and Thathatchar, M. A. L. (1974). Learning automata – a survey. IEEE Transactions
on Systems, Man, and Cybernetics, 4:323–334.
Neal, R. M. (1995). Bayesian learning for neural networks. PhD thesis, University of Toronto.
Neal, R. M. (2006). Classification with Bayesian neural networks. In Quinonero-Candela, J., Magnini,
B., Dagan, I., and D’Alche-Buc, F., editors, Machine Learning Challenges. Evaluating Predictive
Uncertainty, Visual Object Classification, and Recognising Textual Entailment, volume 3944 of
Lecture Notes in Computer Science, pages 28–32. Springer.
Neal, R. M. and Zhang, J. (2006). High dimensional classification with Bayesian neural networks and
Dirichlet diffusion trees. In Guyon, I., Gunn, S., Nikravesh, M., and Zadeh, L. A., editors, Feature
Extraction: Foundations and Applications, Studies in Fuzziness and Soft Computing, pages 265–
295. Springer.
Neftci, E., Das, S., Pedroni, B., Kreutz-Delgado, K., and Cauwenberghs, G. (2014). Event-driven
contrastive divergence for spiking neuromorphic systems. Frontiers in Neuroscience, 7(272).
Neil, D. and Liu, S.-C. (2014). Minitaur, an event-driven FPGA-based spiking network accelerator.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, PP(99):1–8.
Nessler, B., Pfeiffer, M., Buesing, L., and Maass, W. (2013). Bayesian computation emerges in
generic cortical microcircuits through spike-timing-dependent plasticity. PLoS Computational Bi-
ology, 9(4):e1003037.
Neti, C., Schneider, M. H., and Young, E. D. (1992). Maximally fault tolerant neural networks. In
IEEE Transactions on Neural Networks, volume 3, pages 14–23.
Neuneier, R. and Zimmermann, H.-G. (1996). How to train neural networks. In Orr, G. B. and Müller,
K.-R., editors, Neural Networks: Tricks of the Trade, volume 1524 of Lecture Notes in Computer
Science, pages 373–423. Springer.
Newton, I. (1687). Philosophiae naturalis principia mathematica. William Dawson & Sons Ltd.,
London.
Nguyen, N. and Widrow, B. (1989). The truck backer-upper: An example of self learning in neural
networks. In Proceedings of the International Joint Conference on Neural Networks, pages 357–
363. IEEE Press.
68
Nilsson, N. J. (1980). Principles of artificial intelligence. Morgan Kaufmann, San Francisco, CA,
USA.
Nolfi, S., Floreano, D., Miglino, O., and Mondada, F. (1994a). How to evolve autonomous robots:
Different approaches in evolutionary robotics. In Brooks, R. A. and Maes, P., editors, Fourth
International Workshop on the Synthesis and Simulation of Living Systems (Artificial Life IV), pages
190–197. MIT.
Nolfi, S., Parisi, D., and Elman, J. L. (1994b). Learning and evolution in neural networks. Adaptive
Behavior, 3(1):5–28.
Nowak, E., Jurie, F., and Triggs, B. (2006). Sampling strategies for bag-of-features image classifica-
tion. In Proc. ECCV 2006, pages 490–503. Springer.
Nowlan, S. J. and Hinton, G. E. (1992). Simplifying neural networks by soft weight sharing. Neural
O’Connor, P., Neil, D., Liu, S.-C., Delbruck, T., and Pfeiffer, M. (2013). Real-time classification and
sensor fusion with a spiking deep belief network. Frontiers in Neuroscience, 7(178).
Oh, K.-S. and Jung, K. (2004). GPU implementation of neural networks. Pattern Recognition,
37(6):1311–1314.
Oja, E. (1989). Neural networks, principal components, and subspaces. International Journal of
Neural Systems, 1(1):61–68.
Oja, E. (1991). Data compression, feature extraction, and autoassociation in feedforward neural
networks. In Kohonen, T., Mäkisara, K., Simula, O., and Kangas, J., editors, Artificial Neural
Networks, volume 1, pages 737–745. Elsevier Science Publishers B.V., North-Holland.
Olshausen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive field properties by
learning a sparse code for natural images. Nature, 381(6583):607–609.
Omlin, C. and Giles, C. L. (1996). Extraction of rules from discrete-time recurrent neural networks.
Oquab, M., Bottou, L., Laptev, I., and Sivic, J. (2013). Learning and transferring mid-level image
representations using convolutional neural networks. Technical Report hal-00911179.
O’Reilly, R. (2003). Making working memory work: A computational model of learning in the
prefrontal cortex and basal ganglia. Technical Report ICS-03-03, ICS.
O’Reilly, R. C. (1996). Biologically plausible error-driven learning using local activation differences:
The generalized recirculation algorithm. Neural Computation, 8(5):895–938.
Orr, G. and Müller, K. (1998). Neural Networks: Tricks of the Trade. Number LNCS 1524 in Lecture
Notes in Computer Science Series. Springer Verlag.
Ostrovskii, G. M., Volin, Y. M., and Borisov, W. W. (1971). Über die Berechnung von Ableitungen.
Wiss. Z. Tech. Hochschule für Chemie, 13:382–384.
Otsuka, M. (2010). Goal-Oriented Representation of the External World: A Free-Energy-Based Ap-
proach. PhD thesis, Nara Institute of Science and Technology.
69
Otsuka, M., Yoshimoto, J., and Doya, K. (2010). Free-energy-based reinforcement learning in a
partially observable environment. In Proc. ESANN.
Otte, S., Krechel, D., Liwicki, M., and Dengel, A. (2012). Local feature based online mode detection
with recurrent neural networks. In Proceedings of the 2012 International Conference on Frontiers
in Handwriting Recognition, pages 533–537. IEEE Computer Society.
Oudeyer, P.-Y., Baranes, A., and Kaplan, F. (2013). Intrinsically motivated learning of real world
sensorimotor skills with developmental constraints. In Baldassarre, G. and Mirolli, M., editors,
Intrinsically Motivated Learning in Natural and Artificial Systems. Springer.
OReilly, R. C., Wyatte, D., Herd, S., Mingus, B., and Jilk, D. J. (2013). Recurrent processing during
object recognition. Frontiers in Psychology, 4:124.
Pachitariu, M. and Sahani, M. (2013). Regularization and nonlinearities for neural language models:
when are they needed? arXiv preprint arXiv:1301.5650.
Palm, G. (1980). On associative memory. Biological Cybernetics, 36.
Palm, G. (1992). On the information storage capacity of local learning rules. Neural Computation,
4(2):703–711.
Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and
Data Engineering, 22(10):1345–1359.
Parekh, R., Yang, J., and Honavar, V. (2000). Constructive neural network learning algorithms for
multi-category pattern classification. IEEE Transactions on Neural Networks, 11(2):436–451.
Parker, D. B. (1985). Learning-logic. Technical Report TR-47, Center for Comp. Research in Eco-
nomics and Management Sci., MIT.
Pascanu, R., Gulcehre, C., Cho, K., and Bengio, Y. (2013a). How to construct deep recurrent neural
networks. arXiv preprint arXiv:1312.6026.
Pascanu, R., Mikolov, T., and Bengio, Y. (2013b). On the difficulty of training recurrent neural
networks. In ICML’13: JMLR: W&CP volume 28.
Pasemann, F., Steinmetz, U., and Dieckman, U. (1999). Evolving structure and function of neurocon-
trollers. In Angeline, P. J., Michalewicz, Z., Schoenauer, M., Yao, X., and Zalzala, A., editors, Pro-
ceedings of the Congress on Evolutionary Computation, volume 3, pages 1973–1978, Mayflower
Hotel, Washington D.C., USA. IEEE Press.
Pearlmutter, B. A. (1989). Learning state space trajectories in recurrent neural networks. Neural
Pearlmutter, B. A. (1994). Fast exact multiplication by the Hessian. Neural Computation, 6(1):147–
160.
Pearlmutter, B. A. (1995). Gradient calculations for dynamic recurrent neural networks: A survey.
Pearlmutter, B. A. and Hinton, G. E. (1986). G-maximization: An unsupervised learning procedure
for discovering regularities. In Denker, J. S., editor, Neural Networks for Computing: American
Institute of Physics Conference Proceedings 151, volume 2, pages 333–338.
70
Peng, J. and Williams, R. J. (1996). Incremental multi-step Q-learning. Machine Learning, 22:283–
290.
Pérez-Ortiz, J. A., Gers, F. A., Eck, D., and Schmidhuber, J. (2003). Kalman filters improve LSTM
network performance in problems unsolvable by traditional recurrent nets. Neural Networks,
(16):241–250.
Perrett, D., Hietanen, J., Oram, M., Benson, P., and Rolls, E. (1992). Organization and functions of
cells responsive to faces in the temporal cortex [and discussion]. Philosophical Transactions of the
Royal Society of London. Series B: Biological Sciences, 335(1273):23–30.
Perrett, D., Rolls, E., and Caan, W. (1982). Visual neurones responsive to faces in the monkey
temporal cortex. Experimental Brain Research, 47(3):329–342.
Peters, J. (2010). Policy gradient methods. Scholarpedia, 5(11):3698.
Peters, J. and Schaal, S. (2008a). Natural actor-critic. Neurocomputing, 71:1180–1190.
Peters, J. and Schaal, S. (2008b). Reinforcement learning of motor skills with policy gradients. Neural
Network, 21(4):682–697.
Pham, V., Kermorvant, C., and Louradour, J. (2013). Dropout Improves Recurrent Neural Networks
for Handwriting Recognition. arXiv preprint arXiv:1312.4569.
Pineda, F. J. (1987). Generalization of back-propagation to recurrent neural networks. Physical
Review Letters, 19(59):2229–2232.
Plate, T. A. (1993). Holographic recurrent networks. In S. J. Hanson, J. D. C. and Giles, C. L., editors,
Advances in Neural Information Processing Systems (NIPS) 5, pages 34–41. Morgan Kaufmann.
Plumbley, M. D. (1991). On information theory and unsupervised neural networks. Dissertation,
published as technical report CUED/F-INFENG/TR.78, Engineering Department, Cambridge Uni-
versity.
Pollack, J. B. (1988). Implications of recursive distributed representations. In Proc. NIPS, pages
527–536.
Pollack, J. B. (1990). Recursive distributed representation. Artificial Intelligence, 46:77–105.
Pontryagin, L. S., Boltyanskii, V. G., Gamrelidze, R. V., and Mishchenko, E. F. (1961). The Mathe-
matical Theory of Optimal Processes.
Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In IEEE
International Conference on Computer Vision (ICCV) Workshops, pages 689–690. IEEE.
Post, E. L. (1936). Finite combinatory processes-formulation 1. The Journal of Symbolic Logic,
1(3):103–105.
Prasoon, A., Petersen, K., Igel, C., Lauze, F., Dam, E., and Nielsen, M. (2013). Voxel classification
based on triplanar convolutional neural networks applied to cartilage segmentation in knee MRI. In
Medical Image Computing and Computer Assisted Intervention (MICCAI), volume 8150 of LNCS,
pages 246–253. Springer.
Precup, D., Sutton, R. S., and Singh, S. (1998). Multi-time models for temporally abstract planning. In
Advances in Neural Information Processing Systems (NIPS), pages 1050–1056. Morgan Kaufmann.
71
Prokhorov, D. (2010). A convolutional learning system for object classification in 3-D LIDAR data.
Prokhorov, D., Puskorius, G., and Feldkamp, L. (2001). Dynamical neural networks for control. In
Kolen, J. and Kremer, S., editors, A field guide to dynamical recurrent networks, pages 23–78.
IEEE Press.
Prokhorov, D. and Wunsch, D. (1997). Adaptive critic design. IEEE Transactions on Neural Net-
works, 8(5):997–1007.
Prokhorov, D. V., Feldkamp, L. A., and Tyukin, I. Y. (2002). Adaptive behavior with fixed weights in
RNN: an overview. In Proceedings of the IEEE International Joint Conference on Neural Networks
(IJCNN), pages 2018–2023.
Puskorius, G. V. and Feldkamp, L. A. (1994). Neurocontrol of nonlinear dynamical systems with
Kalman filter trained recurrent networks. IEEE Transactions on Neural Networks, 5(2):279–297.
Raiko, T., Valpola, H., and LeCun, Y. (2012). Deep learning made easier by linear transformations in
perceptrons. In International Conference on Artificial Intelligence and Statistics, pages 924–932.
Raina, R., Madhavan, A., and Ng, A. (2009). Large-scale deep unsupervised learning using graphics
processors. In Proceedings of the 26th Annual International Conference on Machine Learning
(ICML), pages 873–880. ACM.
Ramacher, U., Raab, W., Anlauf, J., Hachmann, U., Beichter, J., Bruels, N., Wesseling, M., Sich-
eneder, E., Maenner, R., Glaess, J., and Wurz, A. (1993). Multiprocessor and memory architecture
of the neurocomputer SYNAPSE-1. International Journal of Neural Systems, 4(4):333–336.
Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2006). Efficient learning of sparse representa-
tions with an energy-based model. In et al., J. P., editor, Advances in Neural Information Processing
Systems (NIPS 2006). MIT Press.
Ranzato, M. A., Huang, F., Boureau, Y., and LeCun, Y. (2007). Unsupervised learning of invariant
feature hierarchies with applications to object recognition. In Proc. Computer Vision and Pattern
Recognition Conference (CVPR’07), pages 1–8. IEEE Press.
Rauber, A., Merkl, D., and Dittenbach, M. (2002). The growing hierarchical self-organizing map: ex-
ploratory analysis of high-dimensional data. IEEE Transactions on Neural Networks, 13(6):1331–
1341.
Razavian, A. S., Azizpour, H., Sullivan, J., and Carlsson, S. (2014). CNN features off-the-shelf: an
astounding baseline for recognition. arXiv preprint arXiv:1403.6382.
Rechenberg, I. (1971). Evolutionsstrategie - Optimierung technischer Systeme nach Prinzipien der
biologischen Evolution. Dissertation. Published 1973 by Fromman-Holzboog.
Redlich, A. N. (1993). Redundancy reduction as a strategy for unsupervised learning. Neural Com-
putation, 5:289–304.
Refenes, N. A., Zapranis, A., and Francis, G. (1994). Stock performance modeling using neural
networks: a comparative study with regression models. Neural Networks, 7(2):375–388.
Rezende, D. J. and Gerstner, W. (2014). Stochastic variational learning in recurrent spiking networks.
Frontiers in Computational Neuroscience, 8:38.
72
Riedmiller, M. (2005). Neural fitted Q iteration—first experiences with a data efficient neural rein-
forcement learning method. In Proc. ECML-2005, pages 317–328. Springer-Verlag Berlin Heidel-
berg.
Riedmiller, M. and Braun, H. (1993). A direct adaptive method for faster backpropagation learning:
The Rprop algorithm. In Proc. IJCNN, pages 586–591. IEEE Press.
Riedmiller, M., Lange, S., and Voigtlaender, A. (2012). Autonomous reinforcement learning on raw
visual input data in a real world application. In International Joint Conference on Neural Networks
(IJCNN), pages 1–8, Brisbane, Australia.
Riesenhuber, M. and Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nat.
Neurosci., 2(11):1019–1025.
Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. (2011). Contractive auto-encoders:
Explicit invariance during feature extraction. In Proceedings of the 28th International Conference
on Machine Learning (ICML-11), pages 833–840.
Ring, M., Schaul, T., and Schmidhuber, J. (2011). The two-dimensional organization of behavior. In
Proceedings of the First Joint Conference on Development Learning and on Epigenetic Robotics
ICDL-EPIROB, Frankfurt.
Ring, M. B. (1991). Incremental development of complex behaviors through automatic construc-
tion of sensory-motor hierarchies. In Birnbaum, L. and Collins, G., editors, Machine Learning:
Proceedings of the Eighth International Workshop, pages 343–347. Morgan Kaufmann.
Ring, M. B. (1993). Learning sequential tasks by incrementally adding higher orders. In S. J. Hanson,
J. D. C. and Giles, C. L., editors, Advances in Neural Information Processing Systems 5, pages 115–
122. Morgan Kaufmann.
Ring, M. B. (1994). Continual Learning in Reinforcement Environments. PhD thesis, University of
Texas at Austin, Austin, Texas 78712.
Risi, S. and Stanley, K. O. (2012). A unified approach to evolving plasticity and neural geometry. In
International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE.
Rissanen, J. (1986). Stochastic complexity and modeling. The Annals of Statistics, 14(3):1080–1100.
Ritter, H. and Kohonen, T. (1989). Self-organizing semantic maps. Biological Cybernetics,
61(4):241–254.
Robinson, A. J. and Fallside, F. (1987). The utility driven dynamic error propagation network. Tech-
nical Report CUED/F-INFENG/TR.1, Cambridge University Engineering Department.
Robinson, T. and Fallside, F. (1989). Dynamic reinforcement driven error propagation networks
with application to game playing. In Proceedings of the 11th Conference of the Cognitive Science
Society, Ann Arbor, pages 836–843.
Rodriguez, P. and Wiles, J. (1998). Recurrent neural networks can learn to implement symbol-
sensitive counting. In Advances in Neural Information Processing Systems (NIPS), volume 10,
pages 87–93. The MIT Press.
Rodriguez, P., Wiles, J., and Elman, J. (1999). A recurrent neural network that learns to count.
Connection Science, 11(1):5–40.
73
Roggen, D., Hofmann, S., Thoma, Y., and Floreano, D. (2003). Hardware spiking neural network with
run-time reconfigurable connectivity in an autonomous robot. In Proc. NASA/DoD Conference on
Evolvable Hardware, 2003, pages 189–198. IEEE.
Rohwer, R. (1989). The ‘moving targets’ training method. In Kindermann, J. and Linden, A., edi-
tors, Proceedings of ‘Distributed Adaptive Neural Information Processing’, St.Augustin, 24.-25.5,.
Oldenbourg.
Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization
in the brain. Psychological review, 65(6):386.
Rosenblatt, F. (1962). Principles of Neurodynamics. Spartan, New York.
Roux, L., Racoceanu, D., Lomenie, N., Kulikova, M., Irshad, H., Klossa, J., Capron, F., Genestie, C.,
Naour, G. L., and Gurcan, M. N. (2013). Mitosis detection in breast cancer histological images -
an ICPR 2012 contest. J. Pathol. Inform., 4:8.
Rubner, J. and Schulten, K. (1990). Development of feature detectors by self-organization: A network
model. Biological Cybernetics, 62:193–199.
Rückstieß, T., Felder, M., and Schmidhuber, J. (2008). State-Dependent Exploration for policy gra-
dient methods. In et al., W. D., editor, European Conference on Machine Learning (ECML) and
Principles and Practice of Knowledge Discovery in Databases 2008, Part II, LNAI 5212, pages
234–249.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning internal representations by error
propagation. In Rumelhart, D. E. and McClelland, J. L., editors, Parallel Distributed Processing,
volume 1, pages 318–362. MIT Press.
Rumelhart, D. E. and Zipser, D. (1986). Feature discovery by competitive learning. In Parallel
Distributed Processing, pages 151–193. MIT Press.
Rummery, G. and Niranjan, M. (1994). On-line Q-learning using connectionist sytems. Technical
Report CUED/F-INFENG-TR 166, Cambridge University, UK.
Russell, S. J., Norvig, P., Canny, J. F., Malik, J. M., and Edwards, D. D. (1995). Artificial Intelligence:
a Modern Approach, volume 2. Englewood Cliffs: Prentice Hall.
Saito, K. and Nakano, R. (1997). Partial BFGS update and efficient step-length calculation for three-
layer neural networks. Neural Computation, 9(1):123–141.
Sak, H., Senior, A., and Beaufays, F. (2014a). Long Short-Term Memory recurrent neural network
architectures for large scale acoustic modeling. In Proc. Interspeech.
Sak, H., Vinyals, O., Heigold, G., Senior, A., McDermott, E., Monga, R., and Mao, M. (2014b). Se-
quence discriminative distributed training of Long Short-Term Memory recurrent neural networks.
In Proc. Interspeech.
Salakhutdinov, R. and Hinton, G. (2009). Semantic hashing. Int. J. Approx. Reasoning, 50(7):969–
978.
Sallans, B. and Hinton, G. (2004). Reinforcement learning with factored states and actions. Journal
of Machine Learning Research, 5:1063–1088.
74
Sałustowicz, R. P. and Schmidhuber, J. (1997). Probabilistic incremental program evolution. Evolu-
tionary Computation, 5(2):123–141.
Samejima, K., Doya, K., and Kawato, M. (2003). Inter-module credit assignment in modular rein-
forcement learning. Neural Networks, 16(7):985–994.
Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal on
Research and Development, 3:210–229.
Sanger, T. D. (1989). An optimality principle for unsupervised learning. In Touretzky, D. S., editor,
Advances in Neural Information Processing Systems (NIPS) 1, pages 11–19. Morgan Kaufmann.
Santamarı́a, J. C., Sutton, R. S., and Ram, A. (1997). Experiments with reinforcement learning in
problems with continuous state and action spaces. Adaptive Behavior, 6(2):163–217.
Saravanan, N. and Fogel, D. B. (1995). Evolving neural control systems. IEEE Expert, pages 23–27.
Saund, E. (1994). Unsupervised learning of mixtures of multiple causes in binary data. In Cowan,
J. D., Tesauro, G., and Alspector, J., editors, Advances in Neural Information Processing Systems
Schaback, R. and Werner, H. (1992). Numerische Mathematik, volume 4. Springer.

Schäfer, A. M., Udluft, S., and Zimmermann, H.-G. (2006). Learning long term dependencies with
recurrent neural networks. In Kollias, S. D., Stafylopatis, A., Duch, W., and Oja, E., editors, ICANN
(1), volume 4131 of Lecture Notes in Computer Science, pages 71–80. Springer.
Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5:197–227.

Schaul, T. and Schmidhuber, J. (2010). Metalearning. Scholarpedia, 6(5):4650.
Schaul, T., Zhang, S., and LeCun, Y. (2013). No more pesky learning rates. In Proc. 30th International
Conference on Machine Learning (ICML).
Schemmel, J., Grubl, A., Meier, K., and Mueller, E. (2006). Implementing synaptic plasticity in
a VLSI spiking neural network model. In International Joint Conference on Neural Networks
(IJCNN), pages 1–6. IEEE.
Scherer, D., Müller, A., and Behnke, S. (2010). Evaluation of pooling operations in convolutional ar-
chitectures for object recognition. In Proc. International Conference on Artificial Neural Networks
(ICANN), pages 92–101.
Schmidhuber, J. (1987). Evolutionary principles in self-referential learning, or on learning
how to learn: the meta-meta-... hook. Diploma thesis, Inst. f. Inf., Tech. Univ. Munich.
http://www.idsia.ch/˜juergen/diploma.html.
Schmidhuber, J. (1989a). Accelerated learning in back-propagation nets. In Pfeifer, R., Schreter, Z.,
Fogelman, Z., and Steels, L., editors, Connectionism in Perspective, pages 429 – 438. Amsterdam:
Elsevier, North-Holland.
Schmidhuber, J. (1989b). A local learning algorithm for dynamic feedforward and recurrent networks.
Connection Science, 1(4):403–412.
75
Schmidhuber, J. (1990a). Dynamische neuronale Netze und das fundamentale raumzeitliche Lern-
problem. (Dynamic neural nets and the fundamental spatio-temporal credit assignment problem.)
Dissertation, Inst. f. Inf., Tech. Univ. Munich.
Schmidhuber, J. (1990b). Learning algorithms for networks with internal and external feedback. In
Touretzky, D. S., Elman, J. L., Sejnowski, T. J., and Hinton, G. E., editors, Proc. of the 1990
Connectionist Models Summer School, pages 52–61. Morgan Kaufmann.
Schmidhuber, J. (1990c). The Neural Heat Exchanger. Talks at TU Munich (1990), University
of Colorado at Boulder (1992), and Z. Li’s NIPS*94 workshop on unsupervised learning. Also
published at the Intl. Conference on Neural Information Processing (ICONIP’96), vol. 1, pages
194-197, 1996.
Schmidhuber, J. (1990d). An on-line algorithm for dynamic reinforcement learning and planning in
reactive environments. In Proc. IEEE/INNS International Joint Conference on Neural Networks,
San Diego, volume 2, pages 253–258.
Schmidhuber, J. (1991a). Curious model-building control systems. In Proceedings of the International
Joint Conference on Neural Networks, Singapore, volume 2, pages 1458–1463. IEEE press.
Schmidhuber, J. (1991b). Learning to generate sub-goals for action sequences. In Kohonen, T.,
Mäkisara, K., Simula, O., and Kangas, J., editors, Artificial Neural Networks, pages 967–972.
Elsevier Science Publishers B.V., North-Holland.
Schmidhuber, J. (1991c). Reinforcement learning in Markovian and non-Markovian environments.
In Lippman, D. S., Moody, J. E., and Touretzky, D. S., editors, Advances in Neural Information
Processing Systems 3 (NIPS 3), pages 500–506. Morgan Kaufmann.
Schmidhuber, J. (1992a). A fixed size storage O(n3 ) time complexity learning algorithm for fully
recurrent continually running networks. Neural Computation, 4(2):243–248.
Schmidhuber, J. (1992b). Learning complex, extended sequences using the principle of history com-
pression. Neural Computation, 4(2):234–242. (Based on TR FKI-148-91, TUM, 1991).
Schmidhuber, J. (1992c). Learning factorial codes by predictability minimization. Neural Computa-

tion, 4(6):863–879.
Schmidhuber, J. (1993a). An introspective network that can learn to run its own weight change
algorithm. In Proc. of the Intl. Conf. on Artificial Neural Networks, Brighton, pages 191–195. IEE.
Schmidhuber, J. (1993b). Netzwerkarchitekturen, Zielfunktionen und Kettenregel. (Network archi-

tectures, objective functions, and chain rule.) Habilitation Thesis, Inst. f. Inf., Tech. Univ. Munich.
Schmidhuber, J. (1997). Discovering neural nets with low Kolmogorov complexity and high general-
ization capability. Neural Networks, 10(5):857–873.
Schmidhuber, J. (2002). The Speed Prior: a new simplicity measure yielding near-optimal computable
predictions. In Kivinen, J. and Sloan, R. H., editors, Proceedings of the 15th Annual Conference
on Computational Learning Theory (COLT 2002), Lecture Notes in Artificial Intelligence, pages
216–228. Springer, Sydney, Australia.
Schmidhuber, J. (2004). Optimal ordered problem solver. Machine Learning, 54:211–254.
76
Schmidhuber, J. (2006a). Developmental robotics, optimal artificial curiosity, creativity, music, and
the fine arts. Connection Science, 18(2):173–187.
Schmidhuber, J. (2006b). Gödel machines: Fully self-referential optimal universal self-improvers. In
Goertzel, B. and Pennachin, C., editors, Artificial General Intelligence, pages 199–226. Springer
Verlag. Variant available as arXiv:cs.LO/0309048.
Schmidhuber, J. (2007). Prototype resilient, self-modeling robots. Science, 316(5825):688.

Schmidhuber, J. (2012). Self-delimiting neural networks. Technical Report IDSIA-08-12,
arXiv:1210.0118v1 [cs.NE], The Swiss AI Lab IDSIA.
Schmidhuber, J. (2013a). My first Deep Learning system of 1991 + Deep Learning timeline 1962-
2013. Technical Report arXiv:1312.5548v1 [cs.NE], The Swiss AI Lab IDSIA.
Schmidhuber, J. (2013b). P OWER P LAY: Training an Increasingly General Problem Solver by Con-
tinually Searching for the Simplest Still Unsolvable Problem. Frontiers in Psychology.
Schmidhuber, J., Ciresan, D., Meier, U., Masci, J., and Graves, A. (2011). On fast deep nets for AGI
vision. In Proc. Fourth Conference on Artificial General Intelligence (AGI), Google, Mountain
View, CA, pages 243–246.
Schmidhuber, J., Eldracher, M., and Foltin, B. (1996). Semilinear predictability minimization pro-
duces well-known feature detectors. Neural Computation, 8(4):773–786.
Schmidhuber, J. and Huber, R. (1991). Learning to generate artificial fovea trajectories for target
detection. International Journal of Neural Systems, 2(1 & 2):135–141.
Schmidhuber, J., Mozer, M. C., and Prelinger, D. (1993). Continuous history compression. In Hüning,
H., Neuhauser, S., Raus, M., and Ritschel, W., editors, Proc. of Intl. Workshop on Neural Networks,
RWTH Aachen, pages 87–95. Augustinus.
Schmidhuber, J. and Prelinger, D. (1992). Discovering predictable classifications. Technical Report
CU-CS-626-92, Dept. of Comp. Sci., University of Colorado at Boulder. Published in Neural
Computation 5(4):625-635 (1993).
Schmidhuber, J. and Wahnsiedler, R. (1992). Planning simple trajectories using neural subgoal gen-
erators. In Meyer, J. A., Roitblat, H. L., and Wilson, S. W., editors, Proc. of the 2nd International
Conference on Simulation of Adaptive Behavior, pages 196–202. MIT Press.
Schmidhuber, J., Wierstra, D., Gagliolo, M., and Gomez, F. J. (2007). Training recurrent networks by
Evolino. Neural Computation, 19(3):757–779.
Schmidhuber, J., Zhao, J., and Schraudolph, N. (1997a). Reinforcement learning with self-modifying
policies. In Thrun, S. and Pratt, L., editors, Learning to learn, pages 293–309. Kluwer.
Schmidhuber, J., Zhao, J., and Wiering, M. (1997b). Shifting inductive bias with success-story algo-
rithm, adaptive Levin search, and incremental self-improvement. Machine Learning, 28:105–130.
Schölkopf, B., Burges, C. J. C., and Smola, A. J., editors (1998). Advances in Kernel Methods -
Support Vector Learning. MIT Press, Cambridge, MA.
77
Schraudolph, N. and Sejnowski, T. J. (1993). Unsupervised discrimination of clustered data via op-
timization of binary information gain. In Hanson, S. J., Cowan, J. D., and Giles, C. L., editors,
Advances in Neural Information Processing Systems, volume 5, pages 499–506. Morgan Kauf-
mann, San Mateo.
Schraudolph, N. N. (2002). Fast curvature matrix-vector products for second-order gradient descent.
Schraudolph, N. N. and Sejnowski, T. J. (1996). Tempering backpropagation networks: Not all

weights are created equal. In Touretzky, D. S., Mozer, M. C., and Hasselmo, M. E., editors, Ad-
vances in Neural Information Processing Systems (NIPS), volume 8, pages 563–569. The MIT
Press, Cambridge, MA.
Schrauwen, B., Verstraeten, D., and Van Campenhout, J. (2007). An overview of reservoir computing:
theory, applications and implementations. In Proceedings of the 15th European Symposium on
Artificial Neural Networks. p. 471-482 2007, pages 471–482.
Schuster, H. G. (1992). Learning by maximization the information transfer through nonlinear noisy
neurons and “noise breakdown”. Phys. Rev. A, 46(4):2131–2138.
Schuster, M. (1999). On supervised learning from sequential data with applications for speech recog-
nition. PhD thesis, Nara Institute of Science and Technolog, Kyoto, Japan.
Schuster, M. and Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions
on Signal Processing, 45:2673–2681.
Schwartz, A. (1993). A reinforcement learning method for maximizing undiscounted rewards. In

Proc. ICML, pages 298–305.
Schwefel, H. P. (1974). Numerische Optimierung von Computer-Modellen. Dissertation. Published
1977 by Birkhäuser, Basel.
Segmentation of Neuronal Structures in EM Stacks Challenge (2012). IEEE International Symposium
on Biomedical Imaging (ISBI), http://tinyurl.com/d2fgh7g.
Sehnke, F., Osendorfer, C., Rückstieß, T., Graves, A., Peters, J., and Schmidhuber, J. (2010).
Parameter-exploring policy gradients. Neural Networks, 23(4):551–559.
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., and LeCun, Y. (2013). OverFeat:
Integrated recognition, localization and detection using convolutional networks. arXiv preprint
arXiv:1312.6229.
Sermanet, P. and LeCun, Y. (2011). Traffic sign recognition with multi-scale convolutional networks.
In Proceedings of International Joint Conference on Neural Networks (IJCNN’11), pages 2809–
2813.
Serrano-Gotarredona, R., Oster, M., Lichtsteiner, P., Linares-Barranco, A., Paz-Vicente, R., Gómez-
Rodrı́guez, F., Camuñas-Mesa, L., Berner, R., Rivas-Pérez, M., Delbruck, T., et al. (2009). Caviar:
A 45k neuron, 5m synapse, 12g connects/s AER hardware sensory–processing–learning–actuating
system for high-speed visual object recognition and tracking. IEEE Transactions on Neural Net-
works, 20(9):1417–1438.
78
Serre, T., Riesenhuber, M., Louie, J., and Poggio, T. (2002). On the role of object-specific features
for real world object recognition in biological vision. In Biologically Motivated Computer Vision,
pages 387–397.
Seung, H. S. (2003). Learning in spiking neural networks by reinforcement of stochastic synaptic
transmission. Neuron, 40(6):1063–1073.
Shan, H. and Cottrell, G. (2014). Efficient visual coding: From retina to V2. In Proc. International
Conference on Learning Representations (ICLR). arXiv preprint arXiv:1312.6077.
Shan, H., Zhang, L., and Cottrell, G. W. (2007). Recursive ICA. Advances in Neural Information
Processing Systems (NIPS), 19:1273.
Shanno, D. F. (1970). Conditioning of quasi-Newton methods for function minimization. Mathemat-
ics of computation, 24(111):647–656.
Shannon, C. E. (1948). A mathematical theory of communication (parts I and II). Bell System
Technical Journal, XXVII:379–423.
Shao, L., Wu, D., and Li, X. (2014). Learning deep and wide: A spectral method for learning deep
networks. IEEE Transactions on Neural Networks and Learning Systems.
Shavlik, J. W. (1994). Combining symbolic and neural learning. Machine Learning, 14(3):321–331.
Shavlik, J. W. and Towell, G. G. (1989). Combining explanation-based and neural learning: An
algorithm and empirical results. Connection Science, 1(3):233–255.
Siegelmann, H. (1992). Theoretical Foundations of Recurrent Neural Networks. PhD thesis, Rutgers,
New Brunswick Rutgers, The State of New Jersey.
Siegelmann, H. T. and Sontag, E. D. (1991). Turing computability with neural nets. Applied Mathe-
matics Letters, 4(6):77–80.
Silva, F. M. and Almeida, L. B. (1990). Speeding up back-propagation. In Eckmiller, R., editor,
Advanced Neural Computers, pages 151–158, Amsterdam. Elsevier.
Sı́ma, J. (1994). Loading deep networks is hard. Neural Computation, 6(5):842–850.
Sı́ma, J. (2002). Training a single sigmoidal neuron is hard. Neural Computation, 14(11):2709–2728.
Simard, P., Steinkraus, D., and Platt, J. (2003). Best practices for convolutional neural networks
applied to visual document analysis. In Seventh International Conference on Document Analysis
and Recognition, pages 958–963.
Sims, K. (1994). Evolving virtual creatures. In Glassner, A., editor, Proceedings of SIGGRAPH ’94
(Orlando, Florida, July 1994), Computer Graphics Proceedings, Annual Conference, pages 15–22.
ACM SIGGRAPH, ACM Press. ISBN 0-89791-667-0.
Simsek, Ö. and Barto, A. G. (2008). Skill characterization based on betweenness. In NIPS’08, pages
1497–1504.
Singh, S., Barto, A. G., and Chentanez, N. (2005). Intrinsically motivated reinforcement learning. In
Advances in Neural Information Processing Systems 17 (NIPS). MIT Press, Cambridge, MA.
79
Singh, S. P. (1994). Reinforcement learning algorithms for average-payoff Markovian decision pro-
cesses. In National Conference on Artificial Intelligence, pages 700–705.
Smith, S. F. (1980). A Learning System Based on Genetic Adaptive Algorithms,. PhD thesis, Univ.
Pittsburgh.
Smolensky, P. (1986). Parallel distributed processing: Explorations in the microstructure of cognition,
vol. 1. chapter Information Processing in Dynamical Systems: Foundations of Harmony Theory,
pages 194–281. MIT Press, Cambridge, MA, USA.
Solla, S. A. (1988). Accelerated learning in layered neural networks. Complex Systems, 2:625–640.
Solomonoff, R. J. (1964). A formal theory of inductive inference. Part I. Information and Control,
7:1–22.
Solomonoff, R. J. (1978). Complexity-based induction systems. IEEE Transactions on Information

Theory, IT-24(5):422–432.
Soloway, E. (1986). Learning to program = learning to construct mechanisms and explanations.
Communications of the ACM, 29(9):850–858.
Song, S., Miller, K. D., and Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-
dependent synaptic plasticity. Nature Neuroscience, 3(9):919–926.
Speelpenning, B. (1980). Compiling Fast Partial Derivatives of Functions Given by Algorithms. PhD
thesis, Department of Computer Science, University of Illinois, Urbana-Champaign.
Srivastava, R. K., Masci, J., Kazerounian, S., Gomez, F., and Schmidhuber, J. (2013). Compete to
compute. In Advances in Neural Information Processing Systems (NIPS), pages 2310–2318.
Stallkamp, J., Schlipsing, M., Salmen, J., and Igel, C. (2011). The German traffic sign recognition
benchmark: A multi-class classification competition. In International Joint Conference on Neural
Networks (IJCNN 2011), pages 1453–1460. IEEE Press.
Stallkamp, J., Schlipsing, M., Salmen, J., and Igel, C. (2012). Man vs. computer: Benchmarking
machine learning algorithms for traffic sign recognition. Neural Networks, 32:323–332.
Stanley, K. O., D’Ambrosio, D. B., and Gauci, J. (2009). A hypercube-based encoding for evolving
large-scale neural networks. Artificial Life, 15(2):185–212.
Stanley, K. O. and Miikkulainen, R. (2002). Evolving neural networks through augmenting topolo-
gies. Evolutionary Computation, 10:99–127.
Steijvers, M. and Grunwald, P. (1996). A recurrent network that performs a contextsensitive prediction
task. In Proceedings of the 18th Annual Conference of the Cognitive Science Society. Erlbaum.
Steil, J. J. (2007). Online reservoir adaptation by intrinsic plasticity for backpropagation–
decorrelation and echo state learning. Neural Networks, 20(3):353–364.
Stemmler, M. (1996). A single spike suffices: the simplest form of stochastic resonance in model
neurons. Network: Computation in Neural Systems, 7(4):687–716.
Stoianov, I. and Zorzi, M. (2012). Emergence of a ’visual number sense’ in hierarchical generative
models. Nature Neuroscience, 15(2):194–6.
80
Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Roy. Stat. Soc.,
36:111–147.
Stoop, R., Schindler, K., and Bunimovich, L. (2000). When pyramidal neurons lock, when they
respond chaotically, and when they like to synchronize. Neuroscience research, 36(1):81–91.
Stratonovich, R. (1960). Conditional Markov processes. Theory of Probability And Its Applications,
5(2):156–178.
Sun, G., Chen, H., and Lee, Y. (1993a). Time warping invariant neural networks. In S. J. Hanson,
J. D. C. and Giles, C. L., editors, Advances in Neural Information Processing Systems (NIPS) 5,
pages 180–187. Morgan Kaufmann.
Sun, G. Z., Giles, C. L., Chen, H. H., and Lee, Y. C. (1993b). The neural network pushdown au-
tomaton: Model, stack and learning simulations. Technical Report CS-TR-3118, University of
Maryland, College Park.
Sun, Y., Gomez, F., Schaul, T., and Schmidhuber, J. (2013). A Linear Time Natural Evolution Strat-
egy for Non-Separable Functions. In Proceedings of the Genetic and Evolutionary Computation
Conference, page 61, Amsterdam, NL. ACM.
Sun, Y., Wierstra, D., Schaul, T., and Schmidhuber, J. (2009). Efficient natural evolution strategies.
In Proc. 11th Genetic and Evolutionary Computation Conference (GECCO), pages 539–546.
Sutskever, I., Hinton, G. E., and Taylor, G. W. (2008). The recurrent temporal restricted Boltzmann
machine. In NIPS, volume 21, page 2008.
Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks.
Technical Report arXiv:1409.3215 [cs.CL], Google. NIPS’2014.
Sutton, R. and Barto, A. (1998). Reinforcement learning: An introduction. Cambridge, MA, MIT
Press.
Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (1999a). Policy gradient methods for
reinforcement learning with function approximation. In Advances in Neural Information Process-
ing Systems (NIPS) 12, pages 1057–1063.
Sutton, R. S., Precup, D., and Singh, S. P. (1999b). Between MDPs and semi-MDPs: A framework
for temporal abstraction in reinforcement learning. Artif. Intell., 112(1-2):181–211.
Sutton, R. S., Szepesvári, C., and Maei, H. R. (2008). A convergent O(n) algorithm for off-policy
temporal-difference learning with linear function approximation. In Advances in Neural Informa-
tion Processing Systems (NIPS’08), volume 21, pages 1609–1616.
Szabó, Z., Póczos, B., and Lőrincz, A. (2006). Cross-entropy optimization for independent pro-
cess analysis. In Independent Component Analysis and Blind Signal Separation, pages 909–916.
Springer.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Ra-
binovich, A. (2014). Going deeper with convolutions. Technical Report arXiv:1409.4842 [cs.CV],
Google.
Szegedy, C., Toshev, A., and Erhan, D. (2013). Deep neural networks for object detection. pages
2553–2561.
81
Taylor, G. W., Spiro, I., Bregler, C., and Fergus, R. (2011). Learning invariance through imitation. In
Conference on Computer Vision and Pattern Recognition (CVPR), pages 2729–2736. IEEE.
Tegge, A. N., Wang, Z., Eickholt, J., and Cheng, J. (2009). NNcon: improved protein contact map
prediction using 2D-recursive neural networks. Nucleic Acids Research, 37(Suppl 2):W515–W518.
Teichmann, M., Wiltschut, J., and Hamker, F. (2012). Learning invariance from natural images in-
spired by observations in the primary visual cortex. Neural Computation, 24(5):1271–1296.
Teller, A. (1994). The evolution of mental models. In Kenneth E. Kinnear, J., editor, Advances in
Genetic Programming, pages 199–219. MIT Press.
Tenenberg, J., Karlsson, J., and Whitehead, S. (1993). Learning via task decomposition. In Meyer,
J. A., Roitblat, H., and Wilson, S., editors, From Animals to Animats 2: Proceedings of the Second
International Conference on Simulation of Adaptive Behavior, pages 337–343. MIT Press.
Tesauro, G. (1994). TD-gammon, a self-teaching backgammon program, achieves master-level play.
Tieleman, T. and Hinton, G. (2012). Lecture 6.5—RmsProp: Divide the gradient by a running average
of its recent magnitude. COURSERA: Neural Networks for Machine Learning.
Tikhonov, A. N., Arsenin, V. I., and John, F. (1977). Solutions of ill-posed problems. Winston.
Ting, K. M. and Witten, I. H. (1997). Stacked generalization: when does it work? In in Proc.
International Joint Conference on Artificial Intelligence (IJCAI).
Tiňo, P. and Hammer, B. (2004). Architectural bias in recurrent neural networks: Fractal analysis.
Tonkes, B. and Wiles, J. (1997). Learning a context-free task with a recurrent neural network: An
analysis of stability. In Proceedings of the Fourth Biennial Conference of the Australasian Cognitive
Science Society.
Towell, G. G. and Shavlik, J. W. (1994). Knowledge-based artificial neural networks. Artificial

Intelligence, 70(1):119–165.
Tsitsiklis, J. N. and van Roy, B. (1996). Feature-based methods for large scale dynamic programming.
Machine Learning, 22(1-3):59–94.
Tsodyks, M., Pawelzik, K., and Markram, H. (1998). Neural networks with dynamic synapses. Neural
Tsodyks, M. V., Skaggs, W. E., Sejnowski, T. J., and McNaughton, B. L. (1996). Population dynam-
ics and theta rhythm phase precession of hippocampal place cell firing: a spiking neuron model.
Hippocampus, 6(3):271–280.
Turaga, S. C., Murray, J. F., Jain, V., Roth, F., Helmstaedter, M., Briggman, K., Denk, W., and Seung,
H. S. (2010). Convolutional networks can learn to generate affinity graphs for image segmentation.
Turing, A. M. (1936). On computable numbers, with an application to the Entscheidungsproblem.
Proceedings of the London Mathematical Society, Series 2, 41:230–267.
82
Turner, A. J. and Miller, J. F. (2013). Cartesian Genetic Programming encoded artificial neural net-
works: A comparison using three benchmarks. In Proceedings of the Conference on Genetic and
Evolutionary Computation (GECCO), pages 1005–1012.
Ueda, N. (2000). Optimal linear combination of neural networks for improving classification perfor-
mance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(2):207–215.
Urlbe, A. P. (1999). Structure-adaptable digital neural networks. PhD thesis, Universidad del Valle.
Utgoff, P. E. and Stracuzzi, D. J. (2002). Many-layered learning. Neural Computation, 14(10):2497–
2529.
Vahed, A. and Omlin, C. W. (2004). A machine learning method for extracting symbolic knowledge
from recurrent neural networks. Neural Computation, 16(1):59–71.
Vaillant, R., Monrocq, C., and LeCun, Y. (1994). Original approach for the localisation of objects in
images. IEE Proc on Vision, Image, and Signal Processing, 141(4):245–250.
van den Berg, T. and Whiteson, S. (2013). Critical factors in the performance of HyperNEAT. In
GECCO 2013: Proceedings of the Genetic and Evolutionary Computation Conference, pages 759–
766.
van Hasselt, H. (2012). Reinforcement learning in continuous state and action spaces. In Wiering, M.
and van Otterlo, M., editors, Reinforcement Learning, pages 207–251. Springer.
Vapnik, V. (1992). Principles of risk minimization for learning theory. In Lippman, D. S., Moody,
J. E., and Touretzky, D. S., editors, Advances in Neural Information Processing Systems (NIPS) 4,
pages 831–838. Morgan Kaufmann.
Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer, New York.
Versino, C. and Gambardella, L. M. (1996). Learning fine motion by using the hierarchical ex-
tended Kohonen map. In Proc. Intl. Conf. on Artificial Neural Networks (ICANN), pages 221–226.
Springer.
Veta, M., Viergever, M., Pluim, J., Stathonikos, N., and van Diest, P. J. (2013). MICCAI 2013 Grand
Challenge on Mitosis Detection.
Vieira, A. and Barradas, N. (2003). A training algorithm for classification of high-dimensional data.
Neurocomputing, 50:461–472.
Viglione, S. (1970). Applications of pattern recognition technology. In Mendel, J. M. and Fu, K. S.,
editors, Adaptive, Learning, and Pattern Recognition Systems. Academic Press.
Vincent, P., Hugo, L., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and composing robust fea-
tures with denoising autoencoders. In Proceedings of the 25th international conference on Machine
learning, ICML ’08, pages 1096–1103, New York, NY, USA. ACM.
Vlassis, N., Littman, M. L., and Barber, D. (2012). On the computational complexity of stochastic
controller optimization in POMDPs. ACM Transactions on Computation Theory, 4(4):12.
Vogl, T., Mangis, J., Rigler, A., Zink, W., and Alkon, D. (1988). Accelerating the convergence of the
back-propagation method. Biological Cybernetics, 59:257–263.
83
von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in the striate cortex.
Kybernetik, 14(2):85–100.
Waldinger, R. J. and Lee, R. C. T. (1969). PROW: a step toward automatic program writing. In
Walker, D. E. and Norton, L. M., editors, Proceedings of the 1st International Joint Conference on
Artificial Intelligence (IJCAI), pages 241–252. Morgan Kaufmann.
Wallace, C. S. and Boulton, D. M. (1968). An information theoretic measure for classification. Com-
puter Journal, 11(2):185–194.
Wan, E. A. (1994). Time series prediction by using a connectionist network with internal delay lines.
In Weigend, A. S. and Gershenfeld, N. A., editors, Time series prediction: Forecasting the future
and understanding the past, pages 265–295. Addison-Wesley.
Wang, C., Venkatesh, S. S., and Judd, J. S. (1994). Optimal stopping and effective machine complexity
in learning. In Advances in Neural Information Processing Systems (NIPS’6), pages 303–310.
Morgan Kaufmann.
Wang, S. and Manning, C. (2013). Fast dropout training. In Proceedings of the 30th International
Conference on Machine Learning (ICML-13), pages 118–126.
Watanabe, O. (1992). Kolmogorov complexity and computational complexity. EATCS Monographs
on Theoretical Computer Science, Springer.
Watanabe, S. (1985). Pattern Recognition: Human and Mechanical. Willey, New York.
Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, King’s College, Oxford.
Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine Learning, 8:279–292.
Watrous, R. L. and Kuhn, G. M. (1992). Induction of finite-state automata using second-order recur-
rent networks. In Moody, J. E., Hanson, S. J., and Lippman, R. P., editors, Advances in Neural
Waydo, S. and Koch, C. (2008). Unsupervised learning of individuals and categories from images.
Weigend, A. S. and Gershenfeld, N. A. (1993). Results of the time series prediction competition
at the Santa Fe Institute. In Neural Networks, 1993., IEEE International Conference on, pages
1786–1793. IEEE.
Weigend, A. S., Rumelhart, D. E., and Huberman, B. A. (1991). Generalization by weight-elimination
with application to forecasting. In Lippmann, R. P., Moody, J. E., and Touretzky, D. S., editors,
Advances in Neural Information Processing Systems (NIPS) 3, pages 875–882. San Mateo, CA:
Morgan Kaufmann.
Weiss, G. (1994). Hierarchical chunking in classifier systems. In Proceedings of the 12th National
Conference on Artificial Intelligence, volume 2, pages 1335–1340. AAAI Press/The MIT Press.
Weng, J., Ahuja, N., and Huang, T. S. (1992). Cresceptron: a self-organizing neural network which
grows adaptively. In International Joint Conference on Neural Networks (IJCNN), volume 1, pages
576–581. IEEE.
Weng, J. J., Ahuja, N., and Huang, T. S. (1997). Learning recognition and segmentation using the
cresceptron. International Journal of Computer Vision, 25(2):109–143.
84
Werbos, P. J. (1974). Beyond Regression: New Tools for Prediction and Analysis in the Behavioral
Sciences. PhD thesis, Harvard University.
Werbos, P. J. (1981). Applications of advances in nonlinear sensitivity analysis. In Proceedings of the
10th IFIP Conference, 31.8 - 4.9, NYC, pages 762–770.
Werbos, P. J. (1987). Building and understanding adaptive systems: A statistical/numerical approach
to factory automation and brain research. IEEE Transactions on Systems, Man, and Cybernetics,
17.
Werbos, P. J. (1988). Generalization of backpropagation with application to a recurrent gas market
model. Neural Networks, 1.
Werbos, P. J. (1989a). Backpropagation and neurocontrol: A review and prospectus. In IEEE/INNS
International Joint Conference on Neural Networks, Washington, D.C., volume 1, pages 209–216.
Werbos, P. J. (1989b). Neural networks for control and system identification. In Proceedings of
IEEE/CDC Tampa, Florida.
Werbos, P. J. (1992). Neural networks, system identification, and control in the chemical industries.
In D. A. White, D. A. S., editor, Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive
Approaches, pages 283–356. Thomson Learning.
Werbos, P. J. (2006). Backwards differentiation in AD and neural nets: Past links and new oppor-
tunities. In Automatic Differentiation: Applications, Theory, and Implementations, pages 15–34.
Springer.
West, A. H. L. and Saad, D. (1995). Adaptive back-propagation in on-line learning of multilayer

networks. In Touretzky, D. S., Mozer, M., and Hasselmo, M. E., editors, NIPS, pages 323–329.
MIT Press.
White, H. (1989). Learning in artificial neural networks: A statistical perspective. Neural Computa-
tion, 1(4):425–464.
Whitehead, S. (1992). Reinforcement Learning for the adaptive control of perception and action. PhD
thesis, University of Rochester.
Whiteson, S. (2012). Evolutionary computation for reinforcement learning. In Wiering, M. and van
Otterlo, M., editors, Reinforcement Learning, pages 325–355. Springer, Berlin, Germany.
Whiteson, S., Kohl, N., Miikkulainen, R., and Stone, P. (2005). Evolving keepaway soccer players
through task decomposition. Machine Learning, 59(1):5–30.
Whiteson, S. and Stone, P. (2006). Evolutionary function approximation for reinforcement learning.
Journal of Machine Learning Research, 7:877–917.
Widrow, B. and Hoff, M. (1962). Associative storage and retrieval of digital information in networks
of adaptive neurons. Biological Prototypes and Synthetic Systems, 1:160.
Widrow, B., Rumelhart, D. E., and Lehr, M. A. (1994). Neural networks: Applications in industry,
business and science. Commun. ACM, 37(3):93–105.
Wieland, A. P. (1991). Evolving neural network controllers for unstable systems. In International
Joint Conference on Neural Networks (IJCNN), volume 2, pages 667–673. IEEE.
85
Wiering, M. and Schmidhuber, J. (1996). Solving POMDPs with Levin search and EIRA. In Saitta,
L., editor, Machine Learning: Proceedings of the Thirteenth International Conference, pages 534–
542. Morgan Kaufmann Publishers, San Francisco, CA.
Wiering, M. and Schmidhuber, J. (1998a). HQ-learning. Adaptive Behavior, 6(2):219–246.
Wiering, M. and van Otterlo, M. (2012). Reinforcement Learning. Springer.
Wiering, M. A. and Schmidhuber, J. (1998b). Fast online Q(λ). Machine Learning, 33(1):105–116.
Wierstra, D., Foerster, A., Peters, J., and Schmidhuber, J. (2010). Recurrent policy gradients. Logic
Journal of IGPL, 18(2):620–634.
Wierstra, D., Schaul, T., Peters, J., and Schmidhuber, J. (2008). Natural evolution strategies. In
Congress of Evolutionary Computation (CEC 2008).
Wiesel, D. H. and Hubel, T. N. (1959). Receptive fields of single neurones in the cat’s striate cortex.
J. Physiol., 148:574–591.
Wiles, J. and Elman, J. (1995). Learning to count without a counter: A case study of dynamics
and activation landscapes in recurrent networks. In In Proceedings of the Seventeenth Annual
Conference of the Cognitive Science Society, pages pages 482 – 487, Cambridge, MA. MIT Press.
Wilkinson, J. H., editor (1965). The Algebraic Eigenvalue Problem. Oxford University Press, Inc.,
New York, NY, USA.
Williams, R. J. (1986). Reinforcement-learning in connectionist networks: A mathematical analysis.
Technical Report 8605, Institute for Cognitive Science, University of California, San Diego.
Williams, R. J. (1988). Toward a theory of reinforcement-learning connectionist systems. Technical
Report NU-CCS-88-3, College of Comp. Sci., Northeastern University, Boston, MA.
Williams, R. J. (1989). Complexity of exact gradient computation algorithms for recurrent neural
networks. Technical Report Technical Report NU-CCS-89-27, Boston: Northeastern University,
College of Computer Science.
Williams, R. J. (1992a). Simple statistical gradient-following algorithms for connectionist reinforce-
ment learning. Machine Learning, 8:229–256.
Williams, R. J. (1992b). Training recurrent networks using the extended Kalman filter. In Interna-
tional Joint Conference on Neural Networks (IJCNN), volume 4, pages 241–246. IEEE.
Williams, R. J. and Peng, J. (1990). An efficient gradient-based algorithm for on-line training of
recurrent network trajectories. Neural Computation, 4:491–501.
Williams, R. J. and Zipser, D. (1988). A learning algorithm for continually running fully recurrent
networks. Technical Report ICS Report 8805, Univ. of California, San Diego, La Jolla.
Williams, R. J. and Zipser, D. (1989a). Experimental analysis of the real-time recurrent learning
algorithm. Connection Science, 1(1):87–111.
Williams, R. J. and Zipser, D. (1989b). A learning algorithm for continually running fully recurrent
networks. Neural Computation, 1(2):270–280.
86
Willshaw, D. J. and von der Malsburg, C. (1976). How patterned neural connections can be set up by
self-organization. Proc. R. Soc. London B, 194:431–445.
Windisch, D. (2005). Loading deep networks is hard: The pyramidal case. Neural Computation,
17(2):487–502.
Wiskott, L. and Sejnowski, T. (2002). Slow feature analysis: Unsupervised learning of invariances.
Witczak, M., Korbicz, J., Mrugalski, M., and Patton, R. J. (2006). A GMDH neural network-based
approach to robust fault diagnosis: Application to the DAMADICS benchmark problem. Control
Engineering Practice, 14(6):671–683.
Wöllmer, M., Blaschke, C., Schindl, T., Schuller, B., Färber, B., Mayer, S., and Trefflich, B. (2011).
On-line driver distraction detection using Long Short-Term Memory. IEEE Transactions on Intel-
ligent Transportation Systems (TITS), 12(2):574–582.
Wöllmer, M., Schuller, B., and Rigoll, G. (2013). Keyword spotting exploiting Long Short-Term
Memory. Speech Communication, 55(2):252–265.
Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5(2):241–259.

Wolpert, D. H. (1994). Bayesian backpropagation over i-o functions rather than weights. In Cowan,
J. D., Tesauro, G., and Alspector, J., editors, Advances in Neural Information Processing Systems
Wu, D. and Shao, L. (2014). Leveraging hierarchical parametric networks for skeletal joints based
action segmentation and recognition. In Proc. Conference on Computer Vision and Pattern Recog-
nition (CVPR).
Wu, L. and Baldi, P. (2008). Learning to play Go using recursive neural networks. Neural Networks,
21(9):1392–1400.
Wyatte, D., Curran, T., and O’Reilly, R. (2012). The limits of feedforward vision: Recurrent process-
ing promotes robust object recognition when objects are degraded. Journal of Cognitive Neuro-
science, 24(11):2248–2261.
Wysoski, S. G., Benuskova, L., and Kasabov, N. (2010). Evolving spiking neural networks for audio-
visual information processing. Neural Networks, 23(7):819–835.
Yamauchi, B. M. and Beer, R. D. (1994). Sequential behavior and learning in evolved dynamical
neural networks. Adaptive Behavior, 2(3):219–246.
Yamins, D., Hong, H., Cadieu, C., and DiCarlo, J. J. (2013). Hierarchical modular optimization of
convolutional networks achieves representations similar to macaque IT and human ventral stream.
Advances in Neural Information Processing Systems (NIPS), pages 1–9.
Yang, M., Ji, S., Xu, W., Wang, J., Lv, F., Yu, K., Gong, Y., Dikmen, M., Lin, D. J., and Huang,
T. S. (2009). Detecting human actions in surveillance videos. In TREC Video Retrieval Evaluation
Workshop.
Yao, X. (1993). A review of evolutionary artificial neural networks. International Journal of Intelli-
gent Systems, 4:203–222.
87
Yin, F., Wang, Q.-F., Zhang, X.-Y., and Liu, C.-L. (2013). ICDAR 2013 Chinese handwriting recog-
nition competition. In 12th International Conference on Document Analysis and Recognition (IC-
DAR), pages 1464–1470.
Yin, J., Meng, Y., and Jin, Y. (2012). A developmental approach to structural self-organization in
reservoir computing. IEEE Transactions on Autonomous Mental Development, 4(4):273–289.
Young, S., Davis, A., Mishtal, A., and Arel, I. (2014). Hierarchical spatiotemporal feature extraction
using recurrent online clustering. Pattern Recognition Letters, 37:115–123.
Yu, X.-H., Chen, G.-A., and Cheng, S.-X. (1995). Dynamic learning rate optimization of the back-
propagation algorithm. IEEE Transactions on Neural Networks, 6(3):669–677.
Zamora-Martnez, F., Frinken, V., Espaa-Boquera, S., Castro-Bleda, M., Fischer, A., and Bunke, H.
(2014). Neural network language models for off-line handwriting recognition. Pattern Recognition,
47(4):1642–1652.
Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method. CoRR, abs/1212.5701.
Zeiler, M. D. and Fergus, R. (2013). Visualizing and understanding convolutional networks. Technical
Report arXiv:1311.2901 [cs.CV], NYU.
Zemel, R. S. (1993). A minimum description length framework for unsupervised learning. PhD thesis,
University of Toronto.
Zemel, R. S. and Hinton, G. E. (1994). Developing population codes by minimizing description
length. In Cowan, J. D., Tesauro, G., and Alspector, J., editors, Advances in Neural Information
Processing Systems 6, pages 11–18. Morgan Kaufmann.
Zeng, Z., Goodman, R., and Smyth, P. (1994). Discrete recurrent neural networks for grammatical
inference. IEEE Transactions on Neural Networks, 5(2).
Zimmermann, H.-G., Tietz, C., and Grothmann, R. (2012). Forecasting with recurrent neural net-
works: 12 tricks. In Montavon, G., Orr, G. B., and Müller, K.-R., editors, Neural Networks:
Tricks of the Trade (2nd ed.), volume 7700 of Lecture Notes in Computer Science, pages 687–707.
Springer.
Zipser, D., Kehoe, B., Littlewort, G., and Fuster, J. (1993). A spiking network model of short-term
active memory. The Journal of Neuroscience, 13(8):3406–3420.
88
1
Salient Object Detection: A Discriminative

Regional Feature Integration Approach
Huaizu Jiang, Zejian Yuan, Ming-Ming Cheng, Yihong Gong,
Nanning Zheng, and Jingdong Wang
Abstract—Salient object detection has been attracting a lot of interest, and recently various heuristic computational models have
been designed. In this paper, we formulate saliency map computation as a regression problem. Our method, which is based
on multi-level image segmentation, utilizes the supervised learning approach to map the regional feature vector to a saliency
arXiv:1410.5926v1 [cs.CV] 22 Oct 2014
score. Saliency scores across multiple layers are finally fused to produce the saliency map. The contributions lie in two-fold.
One is that we propose a discriminate regional feature integration approach for salient object detection. Compared with existing
heuristic models, our proposed method is able to automatically integrate high-dimensional regional saliency features and choose
discriminative ones. The other is that by investigating standard generic region properties as well as two widely studied concepts
for salient object detection, i.e., regional contrast and backgroundness, our approach significantly outperforms state-of-the-art
methods on six benchmark datasets. Meanwhile, we demonstrate that our method runs as fast as most existing algorithms.
1 I NTRODUCTION In this paper, we formulate salient object detection

as a regression problem, learning a regressor that
Visual saliency has been a fundamental problem in
directly maps the regional feature vector to a saliency
neuroscience, psychology, neural systems, and com-
score. Our approach consists of three main steps. The
puter vision for a long time. It is originally defined
first one is multi-level segmentation, which decom-
as a task of predicting the eye-fixations on images [2].
poses the image to multiple segmentations. Second,
Recently it is extended to identifying a region [3], [4]
we conduct a region saliency computation step with
containing the salient object, known as salient object
a Random Forest regressor that maps the regional
detection or salient region detection. Applications of
features to a saliency score. Last, a saliency map is
salient object detection include object detection and
computed by fusing the saliency maps across multiple
recognition [5], [6], image compression [7], image
layers of segmentations.
cropping [8], photo collage [9], [10], dominant color
The key contributions lie in the second step, region
detection [11], [12] and so on.
saliency computation. Firstly, unlike most existing
The study on human visual systems suggests that algorithms that compute saliency maps heuristically
the saliency is related to uniqueness, rarity and sur- from various features and combine them to get the
prise of a scene, characterized by primitive features saliency map, which we call saliency integration,
like color, texture, shape, etc. Recently a lot of ef- we learn a Random Forest regressor that directly
forts have been made to design various heuristic maps the feature vector of each region to a saliency
algorithms to compute the saliency [13]–[21]. Built score, which we call discriminative regional feature
upon the feature integration theory [2], [22], almost all integration (DRFI). This is a principle way in im-
the approaches compute conspicuity (feature) maps age classification [26], but rarely studied in salient
from different saliency cues and then combine them object detection. Secondly, by investigating standard
together to form the final saliency map. Hand-crafted generic region properties and two widely studied con-
integration rules, however, are fragile and poor to cepts in salient object detection, i.e., regional contrast
generalize. For instance, in a recent survey [23], and backgroundness, our proposed approach consis-
none of the algorithms can consistently outperforms tently outperforms state-of-the-art algorithms on all
others on the benchmark data sets. Though some six benchmark data sets with large margins. Rather
learning-based salient object detection algorithms are than heuristically hand-crafting special features, it
proposed [19], [24], [25], the potential of supervised turns out that the learned regressor is able to auto-
learning is not deeply investigated. matically integrate features and pick up discrimina-
tive ones for saliency. Even though the regressor is
• H. Jiang, Z. Yuan, Y. Gong and N. Zheng, are with Xi’an Jiaotong trained on a small set of images, it demonstrates good
University. M.M. Cheng is with Oxford University. J. Wang is with
Microsoft Research Asia.
generalization ability to other data sets.
• A preliminary version of this work appeared at CVPR [1]. The rest of this paper is organized as follows. Sec. 2
• Project website jianghz.com/drfi. introduces related work and discusses their differ-
ences with our proposed method. The saliency com-
2
putation framework is presented in Sec. 3. Sec. 4 Recently, Cheng et al. [36] propose a soft image
describes the regional saliency features adopted in abstraction using a Gaussian Mixture Model (GMM),
this paper. Sec. 5 presents the learning framework where each pixel maintains a probability belonging to
of our approach. Empirical analysis of our proposed all the regions instead of a single hard region label,
method and comparisons with other algorithms are to better compute the saliency. The global unique-
demonstrated in Sec. 6. Finally, Sec. 7 discusses and ness can also be captured with the low-rank matrix
concludes this paper. recovery framework [37]–[39]. The low-rank matrix
corresponds to the background regions while sparse
noises are indications of salient regions. A submod-
2 R ELATED WORK ular salient object detection algorithm is presented
in [40], where superpixels are gradually grouped
Salient object detection, stemming from eye fixation to form potential salient regions by iteratively opti-
prediction, aims to separate the entire salient object mizing a submodular facility location problem. The
from the background. Since the pioneer work of Itti et Bayesian fraemwork is introduced for salient object
al. [2], it attracts more and more research interests detection in [41], [42]. A partial differential equation
in computer vision, driven by applications such as (PDE) is also introduced for salient object detection in
content-aware image resizing [8], picture collage [10], a recent work [43].
etc. In the following, we focus on salient object
detection (segmentation) and briefly review existing In addition to capturing the uniqueness, many other
algorithms. A comprehensive survey can be found priors are also proposed for saliency computation.
from a recent work [23]. A literature review of eye Centeral prior, i.e., the salient object usually lies in
fixation prediction can be seen in [27], which also the center of an image, is investigated in [32], [44].
includes some analysis on salient object detection. Object prior, such as connectivity prior [45], concavity
We simply divide existing algorithms into two cat- context [20], and auto-context cue [46], background-
egories: unsupervised and supervised, according to ness prior [47]–[50], generic objectness prior [51]–[53],
if the groundtruth annotations of salient objects are and background connectivity prior [38], [54], [55]
adopted. are also studied for saliency computation. Example-
based approaches, searching for similar images of the
Unsupervised approaches. Most salient object detec- input, are developed for salient object detection [8],
tion algorithms characterize the uniqueness of a scene [56]. The depth cue is leveraged for saliency analysis
as salient regions following the center-surround con- derived from stereopsic image pairs in [57] and a
trast framework [2], where different kinds of features depth camera (e.g., Kinect) in [58]. Li et al. [59] adopt
are combined according to the feature integration the light field camera for salient object detection.
theory [22]. The multi-scale pixel contrast is studied Besides, spectral analysis in the frequency domain is
in [19], [28]. The discriminant center-surround hy- used to detect salient regions [13].
pothesis is analyzed in [16], [29]. Color histograms,
computed to represent the center and the surround, Supervised approaches. Inspired by the feature in-
are used to evaluate the center-surround dissimilar- tegration theory, some approaches focus on learning
ity [19]. An information theory perspective is intro- the linear fusion weight of saliency features. Liu et
duced to yield a sound mathematical formulation, al. [19] propose to learn the linear fusion weight
computing the center-surround divergence based on of saliency features in a Conditional Random Field
feature statistics [18]. A cost-sensitive SVM is trained (CRF) framework. Recently, the large-margin frame-
to measure the separability of a center region w.r.t. work was adopted to learn the weights in [60]. Due
its surroundings [30]. The uniqueness can also be to the highly non-linear essence of the saliency mech-
captured in a global scope by comparing a patch anism, the linear mapping might not perfectly capture
to its k nearest neighbors [17] or as its distance to the characteristics of saliency. In [24], a mixture of
the average patch over the image along the principal linear Support Vector Machines (SVM) is adopted
component axis coordinates [31]. to partition the feature space into a set of sub-
The center-surround difference framework is also regions that were linearly separable using a divide-
investigated to compute the saliency from region- and-conquer strategy. Alternatively, a Boosted Deci-
based image representation. The multi-level image sion Tree (BDT) is learned to get an initial saliency
segmentation is adopted for salient object detection map, which will be further refined using a high
based on local regional contrast [32]. The global re- dimensional color transform [61]. In [25], generic re-
gional contrast is studied in [15], [21] as well. To gional properties are investigated for salient object
further enhance the performance, saliency maps on detection. Li et al. [62] propose to generate a saliency
hierarchical segmentations are computed and finally map by adaptively averaging the object proposals [63]
combined through a tree model via dynamic pro- with their foreground probabilities that are learned
gramming [33]. Both the color and textural global based on eye fixations features using the Random
uniqueness are investigated in [34], [35]. Forest regressor. Additionally, Wang et al. [64] learn
3
a Random Forest to directly localize the salient object

on thumbnail images. In [65], a saliency map is used
to guide the sampling of sliding windows for object
category recognition, which is online learned during Annotated Training Images
the classification process.

Our proposed discriminative regional feature inte-
gration (DRFI) approach is a supervised salient object
detection algorithm. Compared with unsupervised
…
methods, our approach extends the contrast value Input Image Multi-Level
Saliency Fusion
used in existing algorithms to the contrast vector
to represent a region. More importantly, instead of Multi-Level
Image Segmentation
Multi-Level
Saliency Computation
designing heuristic integration rules, our approach

is able to automatically combine the high-dimensional Fig. 1. The framework of our proposed discriminative
saliency features in a data-driven fashion and pick regional feature integration (DRFI) approach.
up the discriminative ones. Compared with existing
supervised methods, our method learns a highly non-
Due to the limitation of low-level cues, none of the
linear combination of saliency features and does not
current segmentation algorithms can reliably segment
require any assumption of the feature space. The
the salient object. Therefore, we resort to the multi-
most similar approaches to ours might be [25], [61].
level segmentation for robustness purpose. In Sec. 5.1,
[25] is a light touch on the discriminative feature
we will further demonstrate how to utilize multi-
integration without presenting a deep investigation,
level segmentation to generate a large amount of
which only considers the regional property descriptor.
training samples.
In [61], the learned saliency map is only used as a
pre-processing step to provide a coarse estimation of Regional saliency computation. In our approach, we
salient and background regions while our approach predict saliency scores for each region that is jointly
directly output the saliency map. represented by three types of features: regional con-
It is noted that some supervised learning ap- trast, regional property, and regional backgroundness,
proaches exist to predict eye fixation [66], [67]. The which will be described in Sec. 4. At present, we
features, e.g., the local energy of the steerable pyramid denote the feature as a vector x. Then the feature x
lters [68] in [66] and the perceptual Gestalt group- is passed into a random forest regressor f , yielding a
ing cues in [67], seem to be more suitable for eye saliency score. The Random Forest regressor is learnt
fixation prediction, while our approach is specifically from the regions of the training images and integrates
designed for salient object detection. We also note the the features together in a discriminative strategy. The
discriminative feature fusion has also been studied in learning procedure will be given in Sec. 5.
image classification [69], which learns the adaptive Multi-level saliency fusion. After conducting re-
weights of features according to the classification task gion saliency computation, each region has a saliency
to better distinguish one class from others. Instead, value. For each level, we assign the saliency value
our approach integrates three types of regional fea- of each region to its contained pixels. As a result,
tures in a discriminative strategy for the saliency we generate M saliency maps {A1 , A2 , · · · , AM }, and
regression on multiple segmentations. then fuse them together, A = g(A1 , · · · , AM ), to get
the final saliency map A, where g is a combinator
3 I MAGE SALIENCY COMPUTATION function introduced in Sec. 5.3.
The pipeline of our approach consists of three main
steps: multi-level segmentation that decomposes an 4 R EGIONAL S ALIENCY F EATURES
image into regions, regional saliency computation that In this section, we present three types of regional
maps the features extracted from each region to a saliency features, leading to a 93-dimensional feature
saliency score, and multi-level saliency fusion that vector for each region.
combines the saliency maps over all the layers of seg-
mentations to get the final saliency map. The whole 4.1 Regional contrast descriptor
process is illustrated in Fig. 1. A region is likely thought to be salient if it is dif-
Multi-level segmentation. Given an image I, we ferent from others. Unlike most existing approaches
represent it by a set of M -level segmentations S = that compute the contrast values, e.g., the distances
{S1 , S2 , · · · , SM }, where each segmentation Sm is a of region features like color and texture, and then
decomposition of the image I. We apply the graph- combine them together directly forming a saliency
based image segmentation approach [70] to generate score, our approach computes a contrast descriptor, a
multiple segmentations using M groups of different vector representing the differences of feature vectors
parameters. of regions.
4
Color and texture features Differences of features Contrast Backgroundness

features dim definition dim
a1 average RGB values 3 d(aR i S
1 , a1 ) 3 c1 ∼ c3 b1 ∼ b3
Ri
h1 RGB histogram 256 χ (h1 , hS
2
1) 1 c4 b4
a2 average HSV values 3 d(aR i S
2 , a2 ) 3 c5 ∼ c7 b5 ∼ b7
h2 HSV histogram 256 χ2 (hR i S
2 , h2 ) 1 c8 b8
Ri S
a3 average L*a*b* values 3 d(a3 , a3 ) 3 c9 ∼ c11 b9 ∼ b11
h3 L*a*b* histogram 256 χ2 (hR i S
3 , h3 ) 1 c12 b12
Ri S
r absolute response of LM filters 15 d(r , r ) 15 c13 ∼ c27 b13 ∼ b27
h4 max response histogram of the LM filters 15 χ2 (hR i S
4 , h4 ) 1 c28 b28
Ri
h5 histogram of the LBP feature 256 χ (h4 , hS
2
5) 1 c29 b29
Fig. 2. Color and texture features describing the visual characteristics of a region which are used to compute the regional
feature vector. d(x1 , x2 ) = (|x11 − x21 |, · · · , |x1d − x2d |) where d is the number of elements in the vectors x1 and x2 . And
Pb 2(h1i −h2i )2
χ2 (h1 , h2 ) = i=1 h1i +h2i
with b being the number of histogram bins. The last two columns denote the symbols for
regional contrast and backgroundness descriptors. (In the definition of a feature, S corresponds to Rj for the regional contrast
descriptor and B for the regional backgroundness descriptor, respectively.)
To compute the contrast descriptor, we describe regions with similar appearances might belong to the
each region Ri ∈ Sm by a feature vector, including background in one image but belong to the salient
color and texture features, denoted by vRi . The de- object in some other images. It is not enough to merely
tailed description is given in Fig. 2. For color features, use the property features to check if one region is in
we consider RGB, HSV, and L*a*b* color spaces. For the background or the salient object.
texture features, we adopt the LBP feature [71] and Therefore, we extract the pseudo-background re-
the responses of the LM filter bank [72]. gion and compute the backgroundness descriptor for
As suggested in previous works [15], [21], the re- each region with the pseudo-background region as a
gional contrast value xck derived from the k-th feature reference. The pseudo-background region B is defined
channel is computed by checking Ri against all other as the 15-pixel wide narrow border region of the
regions, image. To verify such a definition, we made a simple
Nm survey on the MSRA-B data set with 5000 images and
found that 98% of pixels in the border area belongs to
X
xck (Ri ) = αj wij Dk (vRi , vRj ), (1)
j=1 the background. The backgroundness value xbk of the
region Ri on the k-th feature is then defined as
where Dk (vRi , vRj ) captures the difference of the
k-th channel of the feature vectors vRi and vRj . xbk (Ri ) = Dk (vRi , vB ). (2)
Specifically, the difference of the histogram feature We get a 29-dimensional feature vector. See details in
is computed as the χ2 distance and as their absolute Fig. 2.
||pm −pm ||2
− i 2σ2j
differences for other features. wij = e is as
spatial weighting term, where pi and pj are the mean 4.3 Regional property descriptor
positions of Ri and Rj , respectively. σs controls the Additionally, we consider the generic properties of a
strength of the spatial weighting effect. We empirically region, including appearance and geometric features.
set it as 1.0 in our implementation. αj is introduced These two features are extracted independently from
to account for the irregular shapes of regions, defined each region like the feature extraction algorithm in
as the normalized area of the region Rj . Nm is the image labeling [73]. The appearance features attempt
number of regions in Sm . As a result, we get a 29- to describe the distribution of colors and textures
dimensional feature vector. The details of the regional in a region, which can characterize their common
contrast descriptor are given in Fig. 2. properties for the salient object and the background.
For example, the background usually has homoge-
4.2 Regional backgroundness descriptor neous color distribution or similar texture pattern. The
There exist a few algorithms attempting to make use geometric features include the size and position of
of the characteristics of the background (e.g., homo- a region that may be useful to describe the spatial
geneous color or textures) to heuristically determine distribution of the salient object and the background.
if one region is background, e.g., [47]. In contrast, For example, the salient object tends to be placed near
our algorithm extracts a set of features and adopts the center of the image while the background usually
the supervised learning approach to determine the scatters over the entire image. Finally, we obtain a 35-
background degree (accordingly the saliency degree) dimensional regional property descriptor. The details
of a region. are given in Fig. 3.
It has been observed that the background identi- In summary, we obtain a 93-dimensional (2 × 29 +
fication depends on the whole image context. Image 35) feature vector for each region. Fig. 4 demonstrates
5
description notation dim

average normalized x coordinates p1 1
average normalized y coordinates p2 1
10th percentile of the normalized x coord. p3 1
10th percentile of the normalized y coord. p4 1
90th percentile of the normalized x coord. p5 1
90th percentile of the normalized y coord. p6 1
normalized perimeter p7 1
aspect ratio of the bounding box p8 1
variances of the RGB values p9 ∼ p11 3
variances of the L*a*b* values p12 ∼ p14 3
variances of the HSV values p15 ∼ p17 3
variance of the response of the LM filters p18 ∼ p32 15
variance of the LBP feature p33 1
normalized area p34 1
normalized area of the neighbor regions p35 1
Fig. 3. The regional property descriptor. (The abbrevi-
ation coord. indicates coordinates.)
visualizations of the most important features for each

kind of regional feature descriptor.
5 L EARNING
In this section, we introduce how to learn a Random Fig. 4. Illustration of the most important features.
Forest to map the feature vector of each region to From top to bottom: input images, the most important
a saliency score. Learning the multi-level saliency contrast feature (c12 ), the most important background-
fusion weight is also presented. ness feature (b12 ), the most important property feature
(p5 ), and the saliency map of our approach (DRFIs)
produced on a single-level segmentation. Brighter area
5.1 Generating training samples indicates larger feature value (thus larger saliency
We use supervised multi-level segmentation to gener- value according to c12 , b12 and the saliency map).
ate training samples. We first learn the similarity score
of each adjacent regions, to show the probability that
the adjacent regions both belong to the salient region
{S1t , S2t , . . . , SM
t
} to gather a large mount of train-
or the background. Similar regions will be grouped
ing samples. Specifically, denote S0t as the over-
together in a hierarchical way. Training samples of the
segmentation of the image generated using the graph-
saliency regressor are those confident regions in the
based image segmentation algorithm [70]. The regions
grouping hierarchy.
in S0t are represented by a weighted graph, which con-
By learning the similarity score, we hope that those
nects the spatially neighboring regions. The weight
regions from the object (or background) are more
of each edge is the learned similarity of two adjacent
likely to be grouped together. In specific, given an
superpixels. Similar to the pixel-wise grouping in [70],
over-segmentation of an image, we connect each re-
pairs of regions are sequentially merged in the order
gion and its spatially-neighboring regions forming a
of decreasing the weights of edges. We change the
set of pairs P = {(Ri , Rj )} and learn the probability
tolerance degree of small regions, i.e., the parameter k
p(ai = aj ), where ai is the saliency label of the region
of the approach [70] (see the details in [70]) to generate
Ri . Such a set of pairs into two parts: a positive
the segmentations from S1t to SM t
}. To avoid too fine
part P + = {(Ri , Rj )|ai = aj } and a negative part t |S0t |
P − = {(Ri , Rj )|ai 6= aj }. Following [74], each region groupings, we discard Si if |S t | > 0.6, where | · |
i
pair is described by a set of features including the denotes the number of superpixels.
regional saliency of two regions (2×93 features), the Given a set of training images with ground
feature contrast of two regions (similar to the regional truth annotations and their multi-level segmenta-
contrast descriptor, 29 features), and the geometry tion, we can collect lots of confident regions R =
features of the superpixel boundary of two regions {R(1) , R(2) , · · · , R(Q) } and the corresponding saliency
(similar to p1 ∼ p7 in Fig. 3, 7 features). Given these scores A = {a(1) , a(2) , · · · , a(Q) } to learn a Random
222-dimensional feature, we learn a boosted decision Forest saliency regressor. Only confident regions are
tree classifier to estimate the similarity score of each kept for training since some regions may contain
adjacent region pair. pixels from both the salient object and background.
Based on the learned similarity of two adja- A region is considered to be confident if the number
cent regions, we produce multi-level segmentation of pixels belonging to the salient object or the back-
6
ground exceeds 80% of the total number of pixels these features Fm to maximize the splitting criterion
in the region. Its saliency score is set as 1 or 0 ac-
(ti ) 2 (ti ) 2
P P
cordingly. In experiments we find that few regions of ∗ ∗ ti ∈Dl a ti ∈Dr a
(f , τ ) = max +
all the training examples, around 6%, are unconfident f ∈Fm ,τ |Dl | |Dr |
and we discard them from training. 2 !
a(ti )
P
One benefit to generate multi-level segmentation is − ti ∈D , (3)
|D|
that a large amount of training samples can be gath-
ered. In the Sec. 6.3, we empirically analyze different where Dl = {(x(ti ) , a(ti ) )|x(ti ) (f ) < τ },
settings of Mt and validate our motivation to generate Dr = {(x(ti ) , a(ti ) )|x(ti ) (f ) ≥ τ }, and D = Dl ∪ Dr .
training samples based on multi-level image segmen- Such a spliting procedure is repeated until |D| < 5
tation. Additionally, with the guiding of learned sim- and a leaf node is created. The prediction value of the
ilarity metric, there might be only a few large regions leaf node is the average saliency scores of the training
left in the high-level segmentation, which are help- samples falling in it. We will empirically examine the
ful for the Random Forest regressor to learn object- settings of parameters T and m in Sec. 6.3.
level properties. However, the learned similarity is Learning a saliency regressor can automatically in-
hard to generalize across datasets. This is why our tegrate the features and discover the most discrimi-
approach did not perform the best on SED2 dataset native ones. Additionally, in the training procedure
in our pervious version [1]. To this end, we adopt the of the random forest, the feature importance can be
unsupervised multi-level segmentation in the testing estimated simultaneously. Refer to the supplementary
phrase, which is also more efficient without learning for more details. Fig. 6 presents the most important 60
the similarity score. features.
5.3 Learning the multi-level saliency fusor

Given the multi-level saliency maps
5.2 Learning the regional saliency regressor {A1 , A2 , · · · , AM } for an image, our aim is to
learn a combinator g(A1 , A2 , · · · , AM ) to fuse them
together to form the final saliency map A. Such
Our aim is to learn the regional saliency estimator a problem has been already addressed in existing
from a set of training examples. As aforementioned, methods, such as the conditional random field
each region is described by a feature vector x ∈ Rd , solution [19]. In our implementation, we find that
PM
composed of the regional contrast, regional property, a linear combinator, A = m=1 wm Am , performs
and regional backgroundness descriptors (i.e., d = 93). well by learning the weights using a least square
From the training data X = {x(1) , x(2) , · · · , x(Q) } and estimator, i.e., minimizing the sum of the losses
the saliency scores A = {a(1) , a(1) , · · · , a(Q) }, we learn PM
(kA − m=1 wm Am k2F ) over all the training images.
a random forest regressor fr : Rd → R which maps In practice, we found that the average of multi-
the feature vector of each region to a saliency score. level saliency maps performs as nearly well as a
A Random Forest saliency regressor is an ensemble weighted average.
of T decision trees, where each tree consists of split
and leaf nodes. Each split node stores a feature index 6 E XPERIMENTS
f and a threshold τ . Given a feature vector x, each
In this section, we empirically analyze our proposed
split node in the tree makes a decision based on the
approach. Comparisons with state-of-the-art methods
feature index and threshold pair (f, τ ). If x(f ) < τ it
on benchmark data sets are also demonstrated.
traverses to the left child, otherwise to the right child.
When reaching a leaf node, its stored prediction value
will be given. The final prediction of the forest is the 6.1 Data sets
average of the predictions over all the decision trees. We evaluate the performance over five data sets that
Training a Random Forest regressor is to inde- are widely used in salient object detection and seg-
pendently build each decision tree. For each tree, mentation.
the training samples are randomly drawn with MSRA-B.1 This data set [19] includes 5000 images,
replacement, Xt = {x(t1 ) , x(t2 ) , · · · , x(tQ ) }, At = originally containing labeled rectangles from nine
{a(t1 ) , a(t2 ) , · · · , a(tQ ) }, where ti ∈ [1, Q], i ∈ [1, Q]. users drawing a bounding box around what they
Constructing a tree is to find the pair (f, τ ) for each consider the most salient object. There is a large varia-
split node and the prediction value for each leaf tion among images including natural scenes, animals,
node. Starting from the root node, m features Fm indoor, outdoor, etc. We manually segment the salient
are randomly chosen without replacement from the full
feature vector. The best split will be found among 1. http://research.microsoft.com/en-us/um/people/jiansun/
7
object (contour) within the user-drawn rectangle to 6.2 Evaluation Metrics

obtain binary masks. The ASD data set [13] is a subset We evaluate the performance using the measures
(with binary masks) of MSRA-B, and thus we no used in [23] based on the overlapping area between
longer make evaluations on it. groundtruth annotation and saliency prediction, in-
iCoSeg.2 This is a publicly available co-segmentation cluding the PR (precision-recall) curve, the ROC (re-
data set [75], including 38 groups of totally 643 im- ceiver operating characteristic) curve and the AUC
ages. Each image is along with a pixel-wise ground- (Area Under ROC Curve) score. Precision corresponds
truth annotation, which may contain one or multiple to the percentage of salient pixels correctly assigned,
salient objects. In this paper, we use it to evaluate the and recall is the fraction of detected salient pixels
performance of salient object detection. belonging to the salient object in the ground truth.
SED.3 This data set [76] contains two subsets: SED1 For a grayscale saliency map, whose pixel values
that has 100 images containing only one salient object are in the range [0, 255], we vary the threshold from
and SED2 that has 100 images containing exactly two 0 to 255 to obtain a series of salient object segmen-
salient objects. Pixel-wise groundtruth annotations for tations. The PR curve is created by computing the
the salient objects in both SED1 and SED2 are pro- precision and recall values at each threshold. The ROC
vided. We only make evaluations on SED2. Similar curve can also be generated based on true positive
to the larger MSRA-B dataset, only single one salient rates and false positive rates obtained during the
object exists in each image in SED1, where state-of- calculation of the PR curve.
the-art performance was reported in our previous ver-
sion [1]. Additionally, evaluations on SED2 may help 6.3 Parameters Analysis
us check the adaptability of salient object detection In this section, we empirically analyze the perfor-
algorithms on multiple-object cases. mance of salient object detection against the set-
ECSSD.4 To overcome the weakness of existing data tings of parameters during both training and testing
set such as ASD, in which background structures phrases. Since we want to test the cross-data gen-
are primarily simple and smooth, a new data set eralization ability of our approach, we run five-fold
denoted as Extended Complex Scene Saliency Dataset cross-validation on the training set. Settings of pa-
(ECSSD) is proposed recently in [77]. It contains 1000 rameters are thus blind to other testing data and fair
images with diversified patterns in both foreground comparisons with other approaches can be conducted.
and background, where many semantically meaning- Average AUC scores resulting from cross-validation
ful but structurally complex images are available. under different parameter setting are plotted in Fig. 5.
Binary masks for salient objects are produced by 5 Training parameters analysis.. There are three param-
subjects. eters during training, number of segmentations Mt
DUT-OMRON.5 Similarly, this dataset is also intro- to generate training samples, number of trees T and
duced to evaluate salient object detection algorithms number of randomly chosen features m when training
on images with more than a single salient object and the Random Forest regressor.
relatively complex background. It contains 5,168 high Larger number of segmentations lead to larger
quality natural images, where each image is resized amount of training data. As a classifier usually ben-
to have a maximum side length of 400 pixels. Anno- efits more from greater quantity of training samples,
tations are available in forms of both bounding boxes we can observe from Fig. 5(a) that the performance
and pixel-wise binary object masks. Furthermore, eye steadily increase when Mt becomes larger. We finally
fixation annotations are also provided makeing this set Mt = 48 to generate around 1.7 million samples to
dataset suitable for simultaneously evaluating salient train our regional Random Forest saliency regressor.
object localization and detection models as well as As shown in Fig. 5(b), the performance of our ap-
fixation prediction models. proach with more trees in the Random Forest saliency
We randomly sample 3000 images from the MSRA- regressor is higher. The more trees there are, the
B data set to train our model. Five-fold cross valida- less variances are among the decision trees, and thus
tion is run to select the parameters. The remaining the better performance can be achieved. Though the
2000 images are used for testing. Rather than training performance keeps increasing as more trees adopted,
a model for each data set, we use the model trained we choose to set T = 200 trees to train the regressor
from the MSRA-B data set and test it over others. This to balance the efficiency and the effectiveness.
can help test the adaptability to other different data When splitting each node during the construction
sets of the model trained from one data set and avoid a decision tree, only m randomly chosen features can
the model overfitted to a specific one. be observed . Intuitively, on one hand, increasing
m will give the node greater chance to select more
2. http://chenlab.ece.cornell.edu/projects/touch-coseg
3. http://www.wisdom.weizmann.ac.il/∼vision/Seg Evaluation DB/
discriminative features. On the other hand, however,
4. http://www.cse.cuhk.edu.hk/leojia/projects/hsaliency larger m will bring smaller variances between de-
5. http://ice.dlut.edu.cn/lu/dut-omron/homepage.htm cision trees. For instance, suppose m is set to the
8
0.97 0.97 0.97 0.97
0.969 0.968 0.968 0.968
0.968 0.966 0.966 0.966
0.967 0.964 0.964 0.964

AUC Scores
AUC Scores
AUC Scores
F−measure
0.966 0.962 0.962 0.962
0.965 0.96 0.96 0.96
0.964 0.958 0.958 0.958
0.963 0.956 0.956 0.956
0.962 0.954 0.954 0.954
0.961 0.952 0.952 0.952
0.96 0.95 0.95 0.95

M=1 M=10 M=20 M=30 M=40 M=48 T=5 T=10 T=50 T=100 T=150 T=200 T=300 m=3 m=5 m=10 m=15 m=20 Mt=1 Mt=2 Mt=5 Mt=10 Mt=15 Mt=20
(a) (b) (c) (d)

Fig. 5. Empirical analysis of parameters in terms of the AUC scores based on five-fold cross-validation of the
training set. From left to right, (a) AUC scores versus the number of segmentations to generate training samples,
(b)(c) AUC scores versus number of decision trees and number of randomly chosen features at each node in the
Random Forest saliency regressor, and (d) number of segmentations in generating saliency maps.
0.16 0.16
0.14 0.14
0.12 0.12
Feature Importance
Feature Importance
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
0.02 0.02
0 b b b p p p p p p p p p p p p p p b b b b p p b b b p b b p 0 p b b c p p b p b p p p c p p p p b b c c p p b p p b p p c
12 4 8 5 3 12 6 1 26 2 17 28 10 35 4 9 34 11 1 2 29 11 25 21 6 28 29 23 5 8 7 9 10 12 27 13 3 33 7 20 31 22 4 15 21 16 14 20 24 21 8 30 19 19 23 32 26 24 18 23
Fig. 6. The most important 60 regional saliency features given by Random Forest regressor, occupying around
90% of the energy of total features. There are 5 contrast features, 20 backgroundness features, and 35 property
features. From left to right: the first and second 30 important features, respectively. See Fig. 2 and Fig. 3 for the
description of the features.
dimension of the feature vector, i.e., all of the features properties for salient object detection that are widely
can be seen during the splitting, most likely the same studied in other tasks such as image classification [26].
most discriminative feature will be chosen at each For example, high rank of geometric features p5 , p3
split node. Consequently, nearly identical decision and p6 might correspond to the compositional bias of
trees are built that can not complement each other salient objects. The importance of variance features
well and thus may result in inferior performance. p12 and p26 , on the other hand, might be related
According to Fig. 5(c), we empirically set m = 15 since with the background properties. Among the contrast-
the performance is the best. based descriptors, regional contrast descriptor is the
Testing parameters analysis. One can see in Fig. 5(d) least important. Since it might be affected by cluttered
that the AUC scores of the saliency maps increase scenes and less important compared with the regional
when more layers of segmentations are adopted. The backgroundness descriptor which is in some sense
reason is that there may exist some confident regions more robust. Moreover, we also observe that color
that cover the most (even entire) part of an object in features are much more discriminative than texture
more layers of segmentations. However, a larger num- features.
ber of segmentations introduce more computational To further validate the importance of features across
burden. Therefore, to balance the efficiency and the different data sets, we train classifiers by removing
effectiveness, we set M to 15 segmentations in our each kind of feature descriptor on each benchmark
experiments. data set (testing set of MSRA-B). AUC scores of
saliency maps are demonstrated in Fig. 7. As can
6.4 Feature Importance be seen, removing some feature descriptors does not
Our approach uses a wide variety of features. In necessarily lead to performance decrease. Consistent
this section, we empirically analyze the usefulness of with the feature rank given by the Random Fore-
these regional saliency features. Fig. 6 shows the rank stregressor, regional contrast descriptor is the least
of the most important 60 regional features produced important one on most of the benchmark data sets as
during the training of Random Forest regressor, which least performance drop are observed with its removal
occupy around 90% of the energy of total features. on most of the data sets. Regional property descriptor
The feature rank indicates that the property descrip- still plays the most important role on MSRA-B, ECSSD
tor is the most critical one in our feature set (occupies and DUT-OMRON. Since there are multiple salient
35 out of top 60 features). The reason might be that objects in an image in SED2 and iCoSeg, the common
salient objects share some common properties, vali- properties learned from the training data, where only
dating our motivation to exploit the generic regional a single salient object exists in most of the images,
9
MSRA-B iCoSeg ECSSD DUT-OMRON SED2 DUT-OMRON*

0.95
SVO 0.899 0.861 0.799 0.866 0.834 0.793
0.9
CA 0.860 0.837 0.738 0.815 0.854 0.760
CB 0.930 0.852 0.819 0.831 0.825 0.624
0.85 RC 0.937 0.880 0.833 0.859 0.840 0.679
SF 0.917 0.911 0.777 0.803 0.872 0.715
0.8 LRK 0.925 0.908 0.810 0.859 0.881 0.758
HS 0.930 0.882 0.829 0.860 0.820 0.735
0.75
GMR 0.942 0.902 0.834 0.853 0.831 0.646
PCA 0.938 0.895 0.817 0.887 0.903 0.776
0.7
MC 0.951 0.898 0.849 0.887 0.863 0.715
No Backgroundness
0.65
No Contrast DSR 0.956 0.921 0.856 0.899 0.895 0.776
No Property
Top 60 RBD 0.945 0.941 0.840 0.894 0.873 0.779
Full
0.6 DRFIs 0.954 0.944 0.858 0.910 0.902 0.804
MSRA−B iCoSeg ECSSD DUT−OMRON SED2 DUT−OMRON*
DRFI 0.971 0.968 0.875 0.931 0.933 0.822
Fig. 7. Feature importance across different data sets.
For each data set, we report the AUC scores of Fig. 8. AUC: area under ROC curve (larger is better).
saliency maps by removing each kind of descriptor The best three results are highlighted with red, green,
to see the performance drop. Additionally, we also and blue fonts, respectively.
demonstrate the performance exploiting only the top
60 features shown in Fig. 6.
the performance of our DRFI approach with a single
layer (DRFIs).
may not perform the best. The backgroundness de-
scriptor performs well on MSRA-B, SED2, and iCoSeg. Quantitative comparison.. Quantitative comparisons
However, it plays the least important role on ECSSD are shown in Fig. 8, Fig. 9 and Fig. 10. As can be seen,
data set. Its removal even leads to better performance, our approach (DRFI) consistently outperforms others
which indicates the pseudo-background assumption on all benchmark data sets with large margins in
might not always hold well. Finally, instead of con- terms of AUC scores, PR and ROC curves. In specific,
sidering all of the 93 features, we also adopt only the it improves by 1.57%, 2.66%, 2.34%, 3.45% and 3.21%
top 60 features for training. Surprisingly, this feature over the best-performing state-of-the-art algorithm
vector performs as well as the entire feature descrip- according to the AUC scores on MSRA-B, iCoSeg,
tor, even slightly better on DUT-OMRON, implying ECSSD, DUT-OMRON, and SED2, respectively.
some features contribute little. Our single-level version (DRFIs) performs best on
We visualize the most important features of each iCoSeg, ECSSD, and DUT-OMRON as well. It im-
descriptor in Fig. 4. As we can see, even the most proves by 0.32%, 0.23%, and 1.22% over the best-
powerful backgroundness feature provide far less ac- performing state-of-the-art method in terms of AUC
curate information of salient objects. By integrating scores on these three data sets, respectively. It is
all of the weak information, much better saliency slightly worse (but still one of the top 3 best models)
maps can be achieved. Note that we do not adopt on MSRA-B and SED2 data set. Such improvement
the multi-level fusion enhancement. Another advan- is substantial by considering the already high per-
tages of our approach is the automatic fusion of formance of state-of-the-art algorithms. More impor-
features. For example, the rules to employ geometric tantly, though the Random Forest regressor is trained
features are discovered from the training data instead on MSRA-B, it performs best on other challenging
of heuristically defined as previous approaches [32], data sets like ECSSD and DUT-OMRON.
which might be poor to generalize. With the multi-level enhancement, performance of
our approach can be further improved. For instance,
6.5 Performance Comparison it improves by 1.22% on MSRA-B and 1.78% on DUT-
We report both quantitative and qualitative compar- OMRON.
isons of our approach with state-of-the-art methods. Qualitative comparison. We also provide the qualita-
To save the space, we only consider the top four tive comparisons of different methods in Fig. 11. As
models ranked in the survey [23]: SVO [51], CA [17], can be seen, our approach (shown in Fig. 11 (n)(o))
CB [32], and RC [15] and recently-developed methods: can deal well with the challenging cases where the
SF [21], LRK [78], HS [33], GMR [48], PCA [31], background is cluttered. For example, in the first
MC [50], DSR [49], RBD [55] that are not covered two rows, other approaches may be distracted by the
in [23]. Note that we compare our approach with the textures on the background while our method almost
extended version of RC. In total, we make compar- successfully highlights the entire salient object. It is
isons with 12 approaches. Additionally, we also report also worth pointing out that our approach performs
10
1 1 1
0.9 0.9 0.9
0.8 SVO 0.8 SVO 0.8 SVO

CA CA CA
0.7 CB 0.7 CB 0.7 CB
RC RC RC
0.6 SF 0.6 SF 0.6 SF
Precision
Precision
Precision
LRK LRK LRK
0.5 HS 0.5 HS 0.5 HS
GMR GMR GMR
0.4 0.4 0.4
PCA PCA PCA
0.3 MC 0.3 MC 0.3 MC
DSR DSR DSR
0.2 RBD 0.2 RBD 0.2 RBD
DRFIs DRFIs DRFIs
0.1 DRFI MSRA 0.1 DRFI iCoSeg 0.1 DRFI ECSSD
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Recall Recall Recall
1 1
0.9 0.9
SVO SVO 0.5 SVO
0.8 0.8
CA CA CA
0.7 CB 0.7 CB CB
RC RC 0.4 RC
0.6 SF 0.6 SF SF
Precision
Precision
Precision
LRK LRK LRK

0.5 HS 0.5 HS 0.3 HS
GMR GMR GMR
0.4 0.4
PCA PCA PCA
0.2
0.3 MC 0.3 MC MC
DSR DSR DSR
0.2 RBD 0.2 RBD RBD
DRFIs DRFIs
0.1
DRFIs
DUT−OMRON*
0.1 DRFI DUT−OMRON 0.1 DRFI SED2 DRFI
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Recall Recall Recall
Fig. 9. Quantitative comparisons of saliency maps produced by different approaches on different data sets in
terms of PR curves. See supplemental materials for more evaluations.
1 1 1
0.9 0.9 0.9

SVO SVO SVO
0.8 CA 0.8 CA 0.8 CA
CB CB CB
0.7 RC 0.7 RC 0.7 RC
True Positive Rate
True Positive Rate
True Positive Rate
SF SF SF
0.6 0.6 0.6
LRK LRK LRK
0.5 HS 0.5 HS 0.5 HS
GMR GMR GMR
0.4 PCA 0.4 PCA 0.4 PCA
MC MC MC
0.3 DSR 0.3 DSR 0.3 DSR
RBD RBD RBD
0.2 0.2 0.2
DRFIs DRFIs DRFIs
0.1 MSRA DRFI 0.1 iCoSeg DRFI 0.1 ECSSD DRFI
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
False Positive Rate False Positive Rate False Positive Rate
1 1 1
0.9 0.9 0.9

SVO SVO SVO
0.8 CA 0.8 CA 0.8 CA
CB CB CB
0.7 RC 0.7 RC 0.7 RC
True Positive Rate
True Positive Rate
True Positive Rate
SF SF SF
0.6 0.6 0.6
LRK LRK LRK
0.5 HS 0.5 HS 0.5 HS
GMR GMR GMR
0.4 PCA 0.4 PCA 0.4 PCA
MC MC MC
0.3 DSR 0.3 DSR 0.3 DSR
RBD RBD RBD
0.2 0.2 0.2
DRFIs DRFIs DRFIs
0.1 DUT−OMRON DRFI 0.1 SED2 DRFI 0.1 DUT−OMRON* DRFI
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
False Positive Rate False Positive Rate False Positive Rate
Fig. 10. Quantitative comparison of saliency maps produced by different approaches on different data sets in
terms of ROC curves. See supplemental materials for more evaluations.
well when the object touches the image border, e.g., 6.6 Robustness Analysis
the first and last third rows in Fig. 11, even though
it violates the pseudo-background assumption. With As suggested by Fig. 6 and Fig. 7, the region back-
the multi-level enhancement, more appealing results groundness and property descriptors, especially the
can be achieved. geometric properties, play important roles in our
approach. In natural images, the pseudo-background
assumption may not be held well. Additionally, the
distributions of salient objects may be different from
our training set. It is natural to doubt that whether
11
(a) input (b) SVO (c) CA (d) CB (e) RC (f) SF (g) LRK (h) HS (i) GMR (j) PCA (k) MC (l) DSR (m) RBD (n) DRFIs (o) DRFI
Fig. 11. Visual comparison of the saliency maps. Our method (DRFI) consistently generates better saliency
maps.
our approach can still perform well on these chal- OMRON data set) according to the AUC scores.
lenging cases. To this end, we select 635 images from
DUT-OMRON dataset (we call it DUT-OMRON*), 6.7 Efficiency
where salient objects touch the image border and
are far from the image center. Quantitative compar- Since the computation on each level of the multiple
isons with state-of-the-art approaches are presented segmentations is independent, we utilize the multi-
in Fig. 8, Fig. 9, and Fig. 10. Check our project website thread technique to accelerate our C++ code. Fig. 12
jianghz.com/drfi and the supplementary material for summarizes the running time of different approaches,
more details. tested on the MSRA-B data set with a typical 400×300
image using a PC with an Intel i5 CPU of 2.50GHz and
Not surprisingly, performances of all approaches 8GB memory. 8 threads are utilized for acceleration.
decline. But our approach DRFI still significantly out- As we can see, our approach can run as fast as most
performs other methods in terms of PR curve, ROC existing approaches. If equipped as a pre-processing
curve and AUC scores. Even with a single level, step for an application, e.g., picture collage, our ap-
our approach DRFIs performs slightly better than proach will not harm the user experiences.
others (ranked as the second best in terms of AUC For training, it takes around 24h with around 1.7
scores). In specific, DRFIs and DRFI are better than the million training samples. As training each decision
top-performing method by around 2.22% and 1.36% tree is also independent to each other, parallel com-
(compared with 2.20% and 1.26% on the whole DUT- puting techniques can also be utilized for acceleration.
12
Method SVO CA CB RC SF LRK HS

Time(s) 56.5 52.3 1.40 0.138 0.210 11.5 0.365
Code M+C M+C M+C C C M+C EXE
Method GMR PCA MC DSR RBD DRFIs DRFI*
Time(s) 1.16 2.07 0.129 4.19 0.267 0.183 0.418
Code M+C M+C M+C M+C M C C
Fig. 12. Comparison of running time. M indicates the

code is written in MATLAB and EXE is corresponding to
the executable. (*8 threads are used for acceleration.)
7 D ISCUSSIONS AND F UTURE W ORK Fig. 13. Failure cases of our approach.
7.1 Unsupervised vs Supervised

7.2 Limitations of Our Approach
As data-driven approaches, especially supervised Since our approach mainly consider the regional
learning methods, dominate other fields of computer contrast and backgroudness features, it may fail on
vision, it is somewhat surprising that the potential of cluttered scenes. See Fig. 13 for illustration. In the
supervised salient object detection is relatively under first column, high saliency values are assigned to the
exploited. The main research efforts of salient ob- texture ares in the background as they are distinct in
ject detection still concentrate on developing heuris- terms of either contrast or background features. The
tic rules to combine saliency features. Note we are salient object in the second column have similar color
not saying that heuristic models are useless. We in- with the background and occupies a large portion of
stead favor the supervised approaches for following the image, making it challenging to generate good
two advantages. On one hand, supervised learning detection result. For the third column, it is not fair
approaches can automatically fuse different kinds of to say that our approach completely fails. The flag
saliency features, which is valuable especially when is indeed salient. However, as the statue violates the
facing high-dimensional feature vectors. It is nearly pseudo-background assumption and occupying large
infeasible for humans, even domain experts, to design portions as well, it is difficult to generate an appealing
rules to integrate the 93 dimensional feature vector of saliency map using our approach.
this paper. For example, the sixth important feature
p12 (variance of the L* values of a region) seems
to be rather obscure for salient object detection. Yet 7.3 Conclusion and Future Work
integration rules discovered from training samples In this paper, we address the salient object detection
indicate it is highly discriminative, more than the problem using a discriminative regional feature inte-
traditional regional contrast features. gration approach. The success of our approach stems
On the other hand, data-driven approaches always from the utilization of supervised learning algorithm:
own much better generalization ability than heuristic we learn a Random Forest regressor to automatically
methods. In a recent survey of salient object detec- integrate a high-dimensional regional feature descrip-
tion [23], none of existing unsupervised algorithms tor to predict the saliency score and automatically dis-
can consistently outperforms others on all benchmark cover the most discriminative features. Experimental
data sets since different pre-defined heuristic rules results validate that compared with traditional ap-
favor different settings of the data set (e.g., the num- proaches, which heuristically compute saliency maps
ber of objects, center bias, etc). Leveraging the large from different types of features
amount of training samples (nearly two milliion), our Our approach is closely related to the image la-
learned regional saliency regressor almost performs beling method [26]. The goal is to assign prede-
the best over all six benchmark data sets. Even though fined labels (geometric category in [26] and object or
it is trained on a single data set, it performs better than background in salient object detection) to the pixels.
others on challenging cases which are significantly It needs further study to investigate the connection
different from the training set. between image labeling and salient object, and if
One potential reason that learning-based ap- the two problems are essentially equivalent. Utilizing
proaches is not favored for salient object detection the data-driven image labeling approaches for salient
might be related to the efficiency. Though a lot of object detection is also worth exploring in the future.
training time is required to train a classifier, testing Additionally, there exist some obvious directions to
time is more likely to be the major concern of a system further improve our approach.
rather than the offline training time. Once trained, the • Incorporating more saliency features. In this pa-
classifier can be used as an off-shelf tool. In this paper, per, we consider only contrast, backgroundness,
we demonstrate that a learning-based approach can and generic property features of a region. By con-
run as fast as some heuristic methods. sidering more saliency features, better detection
13
results can be expected. For example, background [15] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. S. Torr, and S.-
connectivity prior [55] can be incorporated to M. Hu, “Global contrast based salient region detection,” IEEE
TPAMI, 2014.
relax the pseudo-background assumption. Addi- [16] D. Gao, V. Mahadevan, and N. Vasconcelos, “The discriminant
tionally, spatial distribution prior [19], [21], focus- center-surround hypothesis for bottom-up saliency,” in NIPS,
ness prior [52], diverse density score [53] based 2007.
[17] S. Goferman, L. Zelnik-Manor, and A. Tal, “Context-aware
on generic objectness, and graph-based manifold saliency detection,” in CVPR, 2010, pp. 2376–2383.
ranking score [48] can also be integrated. [18] D. A. Klein and S. Frintrop, “Center-surround divergence of
• Better fusion strategy. We simply investigate the feature statistics for salient object detection,” in ICCV, 2011.
[19] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H.-
linear fusion of saliency maps with any post op- Y. Shum, “Learning to detect a salient object,” IEEE TPAMI,
timization step. As a future work, we can utilize vol. 33, no. 2, pp. 353–367, 2011.
the optimization step of other approaches to en- [20] Y. Lu, W. Zhang, H. Lu, and X. Xue, “Salient object detection
using concavity context,” in ICCV, 2011, pp. 233–240.
hance the performance. For example, we can run [21] F. Perazzi, P. Krähenbühl, Y. Pritch, and A. Hornung, “Saliency
saliency detection on hierarchical detections and filters: Contrast based filtering for salient region detection,” in
fuse them as suggested in [77]. The optimization CVPR, 2012, pp. 733–740.
method proposed in [55] is also applicable. [22] A. Treisman and G. Gelad, “A feature-integration theory of
attention,” Cognitive Psychology, vol. 12, no. 1, pp. 97–136, 1980.
• Integrating more cues. A recent trend on salient [23] A. Borji, D. N. Sihite, and L. Itti, “Salient object detection: A
object detection is to integrate more cues in ad- benchmark,” in ECCV (2), 2012, pp. 414–429.
dition to traditional RGB data. Our approach is [24] P. Khuwuthyakorn, A. Robles-Kelly, and J. Zhou, “Object of
interest detection by saliency learning,” in ECCV, 2010.
natural to be extedned to consider cues such as [25] P. Mehrani and O. Veksler, “Saliency segmentation based on
depth on RGB-D input, temporal consistency on learning and graph cut refinement,” in BMVC, 2010.
video sequences, and saliency co-occurrence for [26] D. Hoiem, A. A. Efros, and M. Hebert, “Recovering surface
layout from an image,” IJCV, vol. 75, no. 1, pp. 151–172, 2007.
co-salient object detection. [27] A. Borji and L. Itti, “State-of-the-art in visual attention mod-
eling,” IEEE Trans. Pattern Anal. Mach. Intell., 2013.
[28] F. Liu and M. Gleicher, “Region enhanced scale-invariant
ACKNOWLEDGEMENTS saliency detection,” in ICME, 2006, pp. 1477–1480.
[29] D. Gao and N. Vasconcelos, “Bottom-up saliency is a discrim-
This work was supported in part by the National inant process,” in ICCV, 2007, pp. 1–6.
[30] X. Li, Y. Li, C. Shen, A. R. Dick, and A. van den Hengel, “Con-
Basic Research Program of China under Grant No. textual hypergraph modeling for salient object detection,” in
2015CB351703 and 2012CB316400, and the National ICCV, 2013, pp. 3328–3335.
Natural Science Foundation of China under Grant No. [31] R. Margolin, A. Tal, and L. Zelnik-Manor, “What makes a
patch distinct?” in CVPR, 2013.
91120006.
[32] H. Jiang, J. Wang, Z. Yuan, T. Liu, N. Zheng, and S. Li,
“Automatic salient object segmentation based on context and
shape prior,” in BMVC, 2011.
R EFERENCES [33] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detec-
tion,” in CVPR, 2013.
[1] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li, “Salient [34] C. Scharfenberger, A. Wong, K. Fergani, J. S. Zelek, and D. A.
object detection: A discriminative regional feature integration Clausi, “Statistical textural distinctiveness for salient region
approach,” in IEEE CVPR, 2013, pp. 2083–2090. detection in natural images,” in CVPR, 2013, pp. 979–986.
[2] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based [35] K. Shi, K. Wang, J. Lu, and L. Lin, “Pisa: Pixelwise image
visual attention for rapid scene analysis,” IEEE TPAMI, 1998. saliency by aggregating complementary appearance contrast
[3] Y.-F. Ma and H.-J. Zhang, “Contrast-based image attention measures with spatial priors,” in CVPR, 2013, pp. 2115–2122.
analysis by using fuzzy growing,” in ACM Multimedia, 2003. [36] M.-M. Cheng, J. Warrell, W.-Y. Lin, S. Zheng, V. Vineet, and
[4] T. Liu, J. Sun, N.-N. Zheng, X. Tang, and H.-Y. Shum, “Learn- N. Crook, “Efficient salient region detection with soft image
ing to detect a salient object,” CVPR, vol. 0, pp. 1–8, 2007. abstraction,” in ICCV, 2013, pp. 1529–1536.
[5] C. Kanan and G. W. Cottrell, “Robust classification of objects, [37] X. Shen and Y. Wu, “A unified approach to salient object
faces, and flowers using natural image statistics,” in CVPR, detection via low rank matrix recovery,” in CVPR, 2012.
2010, pp. 2472–2479. [38] W. Zou, K. Kpalma, Z. Liu, J. Ronsin et al., “Segmentation
[6] D. Walther and C. Koch, “Modeling attention to salient proto- driven low-rank matrix recovery for saliency detection,” in
objects,” Neural Networks, vol. 19, no. 9, pp. 1395–1407, 2006. BMVC, 2013, pp. 1–13.
[7] L. Itti, “Automatic foveation for video compression using a [39] H. Peng, B. Li, R. Ji, W. Hu, W. Xiong, and C. Lang, “Salient
neurobiological model of visual attention,” IEEE TIP, 2004. object detection via low-rank and structured sparse matrix
[8] L. Marchesotti, C. Cifarelli, and G. Csurka, “A framework for decomposition,” in AAAI, 2013.
visual saliency detection with applications to image thumb- [40] Z. Jiang and L. S. Davis, “Submodular salient region detec-
nailing,” in ICCV, 2009, pp. 2232–2239. tion,” in CVPR, 2013, pp. 2043–2050.
[9] S. Goferman, A. Tal, and L. Zelnik-Manor, “Puzzle-like col- [41] E. Rahtu, J. Kannala, M. Salo, and J. Heikkilä, “Segmenting
lage,” Comput. Graph. Forum, vol. 29, no. 2, pp. 459–468, 2010. salient objects from images and videos,” in ECCV (5), 2010,
[10] J. Wang, L. Quan, J. Sun, X. Tang, and H.-Y. Shum, “Picture pp. 366–379.
collage,” in CVPR (1), 2006, pp. 347–354. [42] Y. Xie, H. Lu, and M.-H. Yang, “Bayesian saliency via low and
[11] P. Wang, D. Zhang, J. Wang, Z. Wu, X.-S. Hua, and S. Li, “Color mid level cues,” IEEE TIP, vol. 22, no. 5, pp. 1689–1698, 2013.
filter for image search,” in ACM Multimedia, 2012. [43] R. Liu, J. Cao, G. Zhong, Z. Lin, S. Shan, and Z. Su, “Adap-
[12] P. Wang, D. Zhang, G. Zeng, and J. Wang, “Contextual dom- tive partial differential equation learning for visual saliency
inant color name extraction for web image search,” in ICME detection,” in CVPR, 2014.
Workshops, 2012, pp. 319–324. [44] P. Wang, J. Wang, G. Zeng, J. Feng, H. Zha, and S. Li, “Salient
[13] R. Achanta, S. S. Hemami, F. J. Estrada, and S. Süsstrunk, object detection for searched web images via global saliency,”
“Frequency-tuned salient region detection,” in CVPR, 2009. in CVPR, 2012, pp. 3194–3201.
[14] A. Borji and L. Itti, “Exploiting local and global patch rarities [45] S. Vicente, V. Kolmogorov, and C. Rother, “Graph cut based
for saliency detection,” in CVPR, 2012, pp. 478–485. image segmentation with connectivity priors,” in CVPR, 2008.
14
[46] L. Wang, J. Xue, N. Zheng, and G. Hua, “Automatic salient [77] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detec-
object extraction with contextual cue,” in ICCV, 2011. tion,” in CVPR. CVPR, 2013, pp. 1155–1162.
[47] Y. Wei, F. Wen, W. Zhu, and J. Sun, “Geodesic saliency using [78] X. Shen and Y. Wu, “A unified approach to salient object
background priors,” in ECCV (3), 2012, pp. 29–42. detection via low rank matrix recovery,” in CVPR, 2012.
[48] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency
detection via graph-based manifold ranking,” in CVPR, 2013.
[49] X. Li, H. Lu, L. Zhang, X. Ruan, and M.-H. Yang, “Saliency
detection via dense and sparse reconstruction,” in ICCV, 2013.
[50] B. Jiang, L. Zhang, H. Lu, C. Yang, and M.-H. Yang, “Saliency
detection via absorbing markov chain,” in ICCV, 2013.
[51] K.-Y. Chang, T.-L. Liu, H.-T. Chen, and S.-H. Lai, “Fusing
generic objectness and visual saliency for salient object de-
tection,” in ICCV, 2011, pp. 914–921.
[52] P. Jiang, H. Ling, J. Yu, and J. Peng, “Salient region detection
by ufo: Uniqueness, focusness and objectness,” in ICCV, 2013.
[53] Y. Jia and M. Han, “Category-independent object-level saliency
detection,” in ICCV, 2013.
[54] J. Zhang and S. Sclaroff, “Saliency detection: A boolean map
approach,” in ICCV, 2013, pp. 153–160.
[55] W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency optimization
from robust background detection,” in CVPR, 2014.
[56] M. Wang, J. Konrad, P. Ishwar, K. Jing, and H. A. Rowley,
“Image saliency: From intrinsic to extrinsic context,” in CVPR,
2011, pp. 417–424.
[57] Y. Niu, Y. Geng, X. Li, and F. Liu, “Leveraging stereopsis for
saliency analysis,” in CVPR, 2012, pp. 454–461.
[58] K. Desingh, K. M. Krishna, D. Rajan, and C. Jawahar, “Depth
really matters: Improving visual salient region detection with
depth,” in BMVC, 2013.
[59] N. Li, J. Ye, Y. Ji, H. Ling, and J. Yu, “Saliency detection on
light fields,” in CVPR, 2014.
[60] S. Lu, V. Mahadevan, and N. Vasconcelos, “Learning optimal
seeds for diffusion-based salient object detection,” in CVPR,
2014.
[61] J. Kim, D. Han, Y.-W. Tai, and J. Kim, “Salient region detection
via high-dimensional color transform,” in CVPR, 2014.
[62] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille, “The
secrets of salient object segmentation,” in CVPR, 2014.
[63] J. Carreira and C. Sminchisescu, “Constrained parametric min-
cuts for automatic object segmentation,” in CVPR, 2010, pp.
3241–3248.
[64] P. Wang, J. Wang, G. Zeng, J. Feng, H. Zha, and S. Li, “Salient
object detection for searched web images via global saliency,”
in CVPR, 2012, pp. 3194–3201.
[65] F. Moosmann, D. Larlus, and F. Jurie, “Learning saliency maps
for object categorization,” in EECVW, 2006.
[66] T. Judd, K. A. Ehinger, F. Durand, and A. Torralba, “Learning
to predict where humans look,” in ICCV, 2009, pp. 2106–2113.
[67] Y. Lu, W. Zhang, C. Jin, and X. Xue, “Learning attention map
from images,” in CVPR, 2012, pp. 1067–1074.
[68] E. P. Simoncelli and W. T. Freeman, “The steerable pyramid:
a flexible architecture for multi-scale derivative computation,”
in ICIP (3), 1995, pp. 444–447.
[69] B. Fernando, É. Fromont, D. Muselet, and M. Sebban, “Dis-
criminative feature fusion for image classification,” in CVPR,
2012, pp. 3434–3441.
[70] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient graph-
based image segmentation,” IJCV, vol. 59, no. 2, 2004.
[71] M. Heikkilä, M. Pietikäinen, and C. Schmid, “Description of
interest regions with local binary patterns,” Pattern Recognition,
vol. 42, no. 3, pp. 425–436, 2009.
[72] T. K. Leung and J. Malik, “Representing and recognizing
the visual appearance of materials using three-dimensional
textons,” IJCV, vol. 43, no. 1, pp. 29–44, 2001.
[73] D. Hoiem, A. A. Efros, and M. Hebert, “Geometric context
from a single image,” in ICCV, 2005, pp. 654–661.
[74] H. Jiang, Y. Wu, and Z. Yuan, “Probabilistic salient object
contour detection based on superpixels,” in ICIP, 2013, pp.
3069–3072.
[75] D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen, “iCoseg:
Interactive co-segmentation with intelligent scribble guid-
ance.” in IEEE CVPR, 2010, pp. 3169–3176.
[76] S. Alpert, M. Galun, R. Basri, and A. Brandt, “Image seg-
mentation by probabilistic bottom-up aggregation and cue
integration,” in CVPR, 2007.
1
Long-term Recurrent Convolutional Networks for

Visual Recognition and Description
Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama,
Kate Saenko, Trevor Darrell
Abstract—
Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which
are also recurrent are effective for tasks involving sequences, visual and otherwise. We describe a class of recurrent convolutional
architectures which is end-to-end trainable and suitable for large-scale visual understanding tasks, and demonstrate the value of these
models for activity recognition, image captioning, and video description. In contrast to previous models which assume a fixed visual
arXiv:1411.4389v4 [cs.CV] 31 May 2016
representation or perform simple temporal averaging for sequential processing, recurrent convolutional models are “doubly deep” in
that they learn compositional representations in space and time. Learning long-term dependencies is possible when nonlinearities are
incorporated into the network state updates. Differentiable recurrent models are appealing in that they can directly map variable-length
inputs (e.g., videos) to variable-length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be
optimized with backpropagation. Our recurrent sequence models are directly connected to modern visual convolutional network
models and can be jointly trained to learn temporal dynamics and convolutional perceptual representations. Our results show that such
models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined or optimized.
1 I NTRODUCTION
Recognition and description of images and videos is Input Visual Sequence Output
a fundamental challenge of computer vision. Dramatic Features Learning
progress has been achieved by supervised convolutional
neural network (CNN) models on image recognition tasks,
and a number of extensions to process video have been CNN LSTM y1
recently proposed. Ideally, a video model should allow pro-
cessing of variable length input sequences, and also provide
for variable length outputs, including generation of full-
length sentence descriptions that go beyond conventional CNN LSTM y2
one-versus-all prediction tasks. In this paper we propose
Long-term Recurrent Convolutional Networks (LRCNs), a class
of architectures for visual recognition and description which
combines convolutional layers and long-range temporal re-
cursion and is end-to-end trainable (Figure 1). We instanti-
ate our architecture for specific video activity recognition,
image caption generation, and video description tasks as
described below. yT
Research on CNN models for video processing has
CNN LSTM
considered learning 3D spatio-temporal filters over raw
sequence data [1], [2], and learning of frame-to-frame rep-
resentations which incorporate instantaneous optic flow or Fig. 1. We propose Long-term Recurrent Convolutional Networks (LR-
trajectory-based models aggregated over fixed windows CNs), a class of architectures leveraging the strengths of rapid progress
or video shot segments [3], [4]. Such models explore two in CNNs for visual recognition problems, and the growing desire to
apply such models to time-varying inputs and outputs. LRCN processes
extrema of perceptual time-series representation learning: the (possibly) variable-length visual input (left) with a CNN (middle-
either learn a fully general time-varying weighting, or apply left), whose outputs are fed into a stack of recurrent sequence models
(LSTMs, middle-right), which finally produce a variable-length prediction
(right). Both the CNN and LSTM weights are shared across time, result-
• J. Donahue, L. A. Hendricks, M. Rohrbach, S. Guadarrama, and T. Darrell ing in a representation that scales to arbitrarily long sequences.
are with the Department of Electrical Engineering and Computer Science,
UC Berkeley, Berkeley, CA. simple temporal pooling. Following the same inspiration
• M. Rohrbach and T. Darrell are additionally affiliated with the Interna- that motivates current deep convolutional models, we ad-
tional Computer Science Institute, Berkeley, CA. vocate for video recognition and description models which
• S. Venugopalan is with the Department of Computer Science, UT Austin,
Austin, TX. are also deep over temporal dimensions; i.e., have temporal
• K. Saenko is with the Department of Computer Science, UMass Lowell, recurrence of latent variables. Recurrent Neural Network
Lowell, MA. (RNN) models are “deep in time” – explicitly so when
Manuscript received November 30, 2015. unrolled – and form implicit compositional representations
2
in the time domain. Such “deep” models predated deep RNN Unit LSTM Unit
spatial convolution models in the literature [5], [6]. xt
The use of RNNs in perceptual applications has been ex- ht-1
σ ht
Input Output
plored for many decades, with varying results. A significant Gate σ Gate σ
Output σ zt
it ot
limitation of simple RNN models which strictly integrate gt
ϕ + ϕ ht = zt
state information over time is known as the “vanishing Input Modulation Gate
gradient” effect: the ability to backpropagate an error signal
xt ft
through a long-range temporal interval becomes increas- ht-1 σ
ingly difficult in practice. Long Short-Term Memory (LSTM) ct-1
Forget Gate
ct
units, first proposed in [7], are recurrent modules which
enable long-range learning. LSTM units have hidden state
augmented with nonlinear mechanisms to allow state to Fig. 2. A diagram of a basic RNN cell (left) and an LSTM memory
propagate without modification, be updated, or be reset, cell (right) used in this paper (from [13], a slight simplification of the
using simple learned gating functions. LSTMs have recently architecture described in [14], which was derived from the LSTM initially
proposed in [7]).
been demonstrated to be capable of large-scale learning of
speech recognition [8] and language translation models [9],
[10]. 2 BACKGROUND : R ECURRENT N ETWORKS
We show here that convolutional networks with re-
current units are generally applicable to visual time-series Traditional recurrent neural networks (RNNs, Figure 2, left)
modeling, and argue that in visual tasks where static or flat model temporal dynamics by mapping input sequences to
temporal models have previously been employed, LSTM- hidden states, and hidden states to outputs via the following
style RNNs can provide significant improvement when recurrence equations (Figure 2, left):
ample training data are available to learn or refine the rep-
ht = g(Wxh xt + Whh ht−1 + bh )
resentation. Specifically, we show that LSTM type models
provide for improved recognition on conventional video zt = g(Whz ht + bz )
activity challenges and enable a novel end-to-end optimiz-
where g is an element-wise non-linearity, such as a sigmoid
able mapping from image pixels to sentence-level natural
or hyperbolic tangent, xt is the input, ht ∈ RN is the hidden
language descriptions. We also show that these models
state with N hidden units, and zt is the output at time t.
improve generation of descriptions from intermediate visual
For a length T input sequence hx1 , x2 , ..., xT i, the updates
representations derived from conventional visual models.
above are computed sequentially as h1 (letting h0 = 0), z1 ,
We instantiate our proposed architecture in three ex-
h2 , z2 , ..., hT , zT .
perimental settings (Figure 3). First, we show that directly
Though RNNs have proven successful on tasks such
connecting a visual convolutional model to deep LSTM
as speech recognition [15] and text generation [16], it can
networks, we are able to train video recognition models
be difficult to train them to learn long-term dynamics,
that capture temporal state dependencies (Figure 3 left;
likely due in part to the vanishing and exploding gradients
Section 4). While existing labeled video activity datasets
problem [7] that can result from propagating the gradients
may not have actions or activities with particularly com-
down through the many layers of the recurrent network,
plex temporal dynamics, we nonetheless observe significant
each corresponding to a particular time step. LSTMs provide
improvements on conventional benchmarks.
a solution by incorporating memory units that explicitly
Second, we explore end-to-end trainable image to sen-
allow the network to learn when to “forget” previous hid-
tence mappings. Strong results for machine translation
den states and when to update hidden states given new
tasks have recently been reported [9], [10]; such models
information. As research on LSTMs has progressed, hidden
are encoder-decoder pairs based on LSTM networks. We
units with varying connections within the memory unit
propose a multimodal analog of this model, and describe an
have been proposed. We use the LSTM unit as described
architecture which uses a visual convnet to encode a deep
in [13] (Figure 2, right), a slight simplification of the one
state vector, and an LSTM to decode the vector into a natural
described in [8], which was derived from the original LSTM
language string (Figure 3 middle; Section 5). The resulting −1
model can be trained end-to-end on large-scale image and unit proposed in [7]. Letting σ(x) = (1 + e−x ) be the
text datasets, and even with modest training provides com- sigmoid non-linearity which squashes real-valued inputs to
x −x
petitive generation results compared to existing methods. a [0, 1] range, and letting tanh(x) = eex −e
+e−x = 2σ(2x) − 1
Finally, we show that LSTM decoders can be driven be the hyperbolic tangent non-linearity, similarly squashing
directly from conventional computer vision methods which its inputs to a [−1, 1] range, the LSTM updates for time step
predict higher-level discriminative labels, such as the se- t given inputs xt , ht−1 , and ct−1 are:
mantic video role tuple predictors in [11] (Figure 3, right;
it = σ(Wxi xt + Whi ht−1 + bi )
Section 6). While not end-to-end trainable, such models offer
architectural and performance advantages over previous ft = σ(Wxf xt + Whf ht−1 + bf )
statistical machine translation-based approaches. ot = σ(Wxo xt + Who ht−1 + bo )
We have realized a generic framework for recurrent gt = tanh(Wxc xt + Whc ht−1 + bc )
models in the widely adopted deep learning framework
ct = ft ct−1 + it gt
Caffe [12], including ready-to-use implementations of RNN
and LSTM units. (See http://jeffdonahue.com/lrcn/.) ht = ot tanh(ct )
3
x y denotes the element-wise product of vectors x and y . φV (xt ). The outputs of φV are then passed into a recurrent
In addition to a hidden unit ht ∈ RN , the LSTM includes sequence learning module.
an input gate it ∈ RN , forget gate ft ∈ RN , output gate In its most general form, a recurrent model has param-
ot ∈ RN , input modulation gate gt ∈ RN , and memory cell eters W , and maps an input xt and a previous time step
ct ∈ RN . The memory cell unit ct is a sum of two terms: the hidden state ht−1 to an output zt and updated hidden state
previous memory cell unit ct−1 which is modulated by ft , ht . Therefore, inference must be run sequentially (i.e., from
and gt , a function of the current input and previous hidden top to bottom, in the Sequence Learning box of Figure 1), by
state, modulated by the input gate it . Because it and ft are computing in order: h1 = fW (x1 , h0 ) = fW (x1 , 0), then
sigmoidal, their values lie within the range [0, 1], and it h2 = fW (x2 , h1 ), etc., up to hT . Some of our models stack
and ft can be thought of as knobs that the LSTM learns multiple LSTMs atop one another as described in Section 2.
to selectively forget its previous memory or consider its To predict a distribution P (yt ) over outcomes yt ∈ C
current input. Likewise, the output gate ot learns how much (where C is a discrete, finite set of outcomes) at time step
of the memory cell to transfer to the hidden state. These t, the outputs zt ∈ Rdz of the sequential model are passed
additional cells seem to enable the LSTM to learn complex through a linear prediction layer ŷt = Wz zt + bz , where
and long-term temporal dynamics for a wide variety of Wz ∈ R|C|×dz and bz ∈ R|C| are learned parameters. Finally,
sequence learning and prediction tasks. Additional depth the predicted distribution P (yt ) is computed by taking the
exp(ŷt,c )
can be added to LSTMs by stacking them on top of each softmax of ŷt : P (yt = c) = softmax(ŷt ) = P exp(ŷ .
0)t,c
(`−1) c0 ∈C
other, using the hidden state ht of the LSTM in layer
The success of recent deep models for object recogni-
` − 1 as the input to the LSTM in layer `.
tion [17], [18], [19] suggests that strategically composing
Recently, LSTMs have achieved impressive results on
many “layers” of non-linear functions can result in powerful
language tasks such as speech recognition [8] and ma-
models for perceptual problems. For large T , the above
chine translation [9], [10]. Analogous to CNNs, LSTMs are
recurrence indicates that the last few predictions from a
attractive because they allow end-to-end fine-tuning. For
recurrent network with T time steps are computed by a very
example, [8] eliminates the need for complex multi-step
“deep” (T layer) non-linear function, suggesting that the
pipelines in speech recognition by training a deep bidirec-
resulting recurrent model may have similar representational
tional LSTM which maps spectrogram inputs to text. Even
power to a T layer deep network. Critically, however, the
with no language model or pronunciation dictionary, the
sequence model’s weights W are reused at every time step,
model produces convincing text translations. [9] and [10]
forcing the model to learn generic time step-to-time step
translate sentences from English to French with a multi-
dynamics (as opposed to dynamics conditioned on t, the
layer LSTM encoder and decoder. Sentences in the source
sequence index) and preventing the parameter size from
language are mapped to a hidden state using an encoding
growing in proportion to the maximum sequence length.
LSTM, and then a decoding LSTM maps the hidden state to
In most of our experiments, the visual feature transfor-
a sequence in the target language. Such an encoder-decoder
mation φ corresponds to the activations in some layer of
scheme allows an input sequence of arbitrary length to
a deep CNN. Using a visual transformation φV (.) which
be mapped to an output sequence of different length. The
is time-invariant and independent at each time step has the
sequence-to-sequence architecture for machine translation
important advantage of making the expensive convolutional
circumvents the need for language models.
inference and training parallelizable over all time steps of
The advantages of LSTMs for modeling sequential data
the input, facilitating the use of fast contemporary CNN
in vision problems are twofold. First, when integrated with
implementations whose efficiency relies on independent
current vision systems, LSTM models are straightforward
batch processing, and end-to-end optimization of the visual
to fine-tune end-to-end. Second, LSTMs are not confined to
and sequential model parameters V and W .
fixed length inputs or outputs allowing simple modeling
We consider three vision problems (activity recognition,
for sequential data of varying lengths, such as text or video.
image description and video description), each of which
We next describe a unified framework to combine recurrent
instantiates one of the following broad classes of sequential
models such as LSTMs with deep convolutional networks
learning tasks:
to form end-to-end trainable networks capable of complex
visual and sequence prediction tasks. 1) Sequential input, static output (Figure 3, left):
hx1 , x2 , ..., xT i 7→ y . The visual activity recognition
problem can fall under this umbrella, with videos
3 L ONG - TERM R ECURRENT C ONVOLUTIONAL of arbitrary length T as input, but with the goal
N ETWORK (LRCN) MODEL of predicting a single label like running or jumping
This work proposes a Long-term Recurrent Convolutional drawn from a fixed vocabulary.
Network (LRCN) model combining a deep hierarchical vi- 2) Static input, sequential output (Figure 3, middle):
sual feature extractor (such as a CNN) with a model that can x 7→ hy1 , y2 , ..., yT i. The image captioning problem
learn to recognize and synthesize temporal dynamics for fits in this category, with a static (non-time-varying)
tasks involving sequential data (inputs or outputs), visual, image as input, but a much larger and richer label
linguistic, or otherwise. Figure 1 depicts the core of our space consisting of sentences of any length.
approach. LRCN works by passing each visual input xt 3) Sequential input and output (Figure 3, right):
(an image in isolation, or a frame from a video) through hx1 , x2 , ..., xT i 7→ hy1 , y2 , ..., yT 0 i. In tasks such as
a feature transformation φV (.) with parameters V , usually video description, both the visual input and output
a CNN, to produce a fixed-length vector representation are time-varying, and in general the number of
4
Activity Recognition Image Captioning Video Description

Sequences in the Input Sequences in the Output Sequences in the Input and Output
CNN
CNN
CNN
CNN
CRF
LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM
Average LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM
HighJump <BOS> A man runs <EOS> <BOS> A man jumps high <EOS>
Fig. 3. Task-specific instantiations of our LRCN model for activity recognition, image description, and video description.
input and output time steps may differ (i.e., we may sequence sampled from the training set L(V, W, D) =
PT
have T 6= T 0 ). In video description, for example, the 1 P
− |D| (xt ,yt )T t=1 log P (yt |x1:t , y1:t−1 , V, W ).
t=1 ∈D
number of frames in the video should not constrain One of the most appealing aspects of the described sys-
the length of (number of words in) the natural tem is the ability to learn the parameters “end-to-end,” such
language description. that the parameters V of the visual feature extractor learn
to pick out the aspects of the visual input that are relevant
In the previously described generic formulation of re- to the sequential classification problem. We train our LRCN
current models, each instance has T inputs hx1 , x2 , ..., xT i models using stochastic gradient descent, with backprop-
and T outputs hy1 , y2 , ..., yT i. Note that this formulation agation used to compute the gradient ∇V,W L(V, W, D̃) of
does not align cleanly with any of the three problem classes the objective L with respect to all parameters (V, W ) over
described above – in the first two classes, either the input
minibatches D̃ ⊂ D sampled from the training dataset D.
or output is static, and in the third class, the input length
We next demonstrate the power of end-to-end trainable
T need not match the output length T 0 . Hence, we describe hybrid convolutional and recurrent networks by exploring
how we adapt this formulation in our hybrid model to each
three applications: activity recognition, image captioning,
of the above three problem settings.
and video description.
With sequential inputs and static outputs (class 1), we
take a late-fusion approach to merging the per-time step
predictions hy1 , y2 , ..., yT i into a single prediction y for the 4 ACTIVITY RECOGNITION
full sequence. With static inputs x and sequential outputs Activity recognition is an instance of the first class of se-
(class 2), we simply duplicate the input x at all T time quential learning tasks described above: each frame in a
steps: ∀t ∈ {1, 2, ..., T } : xt := x. Finally, for a sequence- length T sequence is the input to a single convolutional
to-sequence problem with (in general) different input and network (i.e., the convnet weights are tied across time). We
output lengths (class 3), we take an “encoder-decoder” consider both RGB and flow as inputs to our recognition
approach, as proposed for machine translation by [9], [20]. system. Flow is computed with [21] and transformed into a
In this approach, one sequence model, the encoder, maps “flow image” by scaling and shifting x and y flow values to
the input sequence to a fixed-length vector, and another se- a range of [−128, +128]. A third channel for the flow image
quence model, the decoder, unrolls this vector to a sequential is created by calculating the flow magnitude.
output of arbitrary length. Under this type of model, a run During training, videos are resized to 240 × 320 and we
of the full system on one instance occurs over T +T 0 −1 time augment our data by using 227 × 227 crops and mirroring.
steps. For the first T time steps, the encoder processes the Additionally, we train the LRCN networks with video clips
input x1 , x2 , ..., xT , and the decoder is inactive until time of 16 frames, even though the UCF101 videos are generally
step T , when the encoder’s output is passed to the decoder, much longer (on the order of 100 frames when extracting
which in turn predicts the first output y1 . For the latter T 0 −1 frames at 30 FPS). Training on shorter video clips can be
time steps, the decoder predicts the remainder of the out- seen as analogous to training on image crops and is a useful
put y2 , y3 , ..., yT 0 with the encoder inactive. This encoder- method of data augmentation. LRCN is trained to predict
decoder approach, as applied to the video description task, the video’s activity class at each time step. To produce a
is depicted in Section 6, Figure 5 (left). single label prediction for an entire video clip, we average
Under the proposed system, the parameters (V, W ) the label probabilities – the outputs of the network’s softmax
of the model’s visual and sequential components can layer – across all frames and choose the most probable label.
be jointly optimized by maximizing the likelihood of At test time, we extract 16 frame clips with a stride of 8
the ground truth outputs yt at each time step t, con- frames from each video and average across all clips from a
ditioned on the input data and labels up to that point single video.
(x1:t , y1:t−1 ). In particular, for a training set D of labeled The CNN base of LRCN in our activity recognition
sequences (xt , yt )Tt=1 ∈ D, we optimize parameters (V, W ) experiments is a hybrid of the CaffeNet [12] reference model
to minimize the expected negative log likelihood of a (a minor variant of AlexNet [17]) and the network used
5
by Zeiler & Fergus [22]. The network is pre-trained on Single Input Type Weighted Average
Model RGB Flow 1/2, 1/2 1/3, 2/3
the 1.2M image ILSVRC-2012 [23] classification training
Single frame 67.37 74.37 75.46 78.94
subset of the ImageNet [24] dataset, giving the network a LRCN-fc6 68.20 77.28 80.90 82.34
strong initialization to facilitate faster training and avoid
overfitting to the relatively small video activity recognition TABLE 1
Activity recognition: Comparing single frame models to LRCN networks
datasets. When classifying center crops, the top-1 classifica- for activity recognition on the UCF101 [25] dataset, with RGB and flow
tion accuracy is 60.2% and 57.4% for the hybrid and CaffeNet inputs. Average values across all three splits are shown. LRCN
reference models, respectively. consistently and strongly outperforms a model based on predictions
from the underlying convolutional network architecture alone.
We compare LRCN to a single frame baseline model.
In our baseline model, T video frames are individually Label ∆ Label ∆
classified by a CNN. As in the LSTM model, whole video BoxingPunchingBag 40.82 BoxingSpeedBag -16.22
classification is done by averaging scores across all video HighJump 29.73 Mixing -15.56
frames. JumpRope 28.95 Knitting -14.71
CricketShot 28.57 Typing -13.95
Basketball 28.57 Skiing -12.50
4.1 Evaluation WallPushups 25.71 BaseballPitch -11.63
We evaluate our architecture on the UCF101 dataset [25] Nunchucks 22.86 BrushingTeeth -11.11
ApplyEyeMakeup 22.73 Skijet -10.71
which consists of over 12,000 videos categorized into 101 HeadMassage 21.95 Haircut -9.10
human action classes. The dataset is split into three splits, Drumming 17.78 TennisSwing -8.16
with just under 8,000 videos in the training set for each split.
TABLE 2
We explore various hyperparameters for the LRCN activ- Activity recognition: comparison of improvement ∆ in LRCN’s per-class
ity recognition architecture. To explore different variants, we recognition accuracy versus the single-frame baseline. Here we report
divide the first training split of UCF101 into a smaller train- results on all three splits of UCF101 (only results on the first split were
ing set (≈6,000 videos) and a validation set (≈3,000 videos). presented in the paper). ∆ is the difference between LRCN’s accuracy
and the single-frame model’s accuracy.
We find that the most influential hyperparameters include
the number of hidden units in the LSTM and whether f c6 For the majority of classes, LRCN improves performance
or f c7 features are used as input to the LSTM. We compare over the single frame model. Though LRCN performs worse
networks with 256, 512, and 1024 LSTM hidden units. When on some classes including Knitting and Mixing, in general
using flow as an input, more hidden units leads to better when LRCN performs worse, the loss in accuracy is not
peformance with 1024 hidden units yielding a 1.7% boost in as substantial as the gain in accuracy for classes like Box-
accuracy in comparison to a network with 256 hidden units ingPunchingBag and HighJump. Consequently, accuracy is
on our validation set. In contrast, for networks with RGB higher overall.
input, the number of hidden units has little impact on the Table 3 compares accuracies for the LRCN flow and
performance of the model. We thus use 1024 hidden units LRCN RGB models for individual classes on Split 1 of
for flow inputs, and 256 for RGB inputs. We find that using UCF101. Note that for some classes the LRCN flow model
f c6 as opposed to f c7 features improves accuracy when outperforms the LRCN RGB model and vice versa. One
using flow as input on our validation set by 1%. When using explanation is that activities which are better classified by
RGB images as input, the difference between using f c6 or the LRCN RGB model are best determined by which objects
f c7 features is quite small; using f c6 features only increases are present in the scene, while activities which are better
accuracy by 0.2%. Because both models perform better with classified by the LRCN flow model are best classified by the
f c6 features, we train our final models using f c6 features kind of motion in the scene. For example, activity classes
(denoted by LRCN-f c6 ). We also considered subsampling like Typing are highly correlated with the presence of certain
the frames input to the LSTM, but found that this hurts objects, such as a keyboard, and are thus best learned by the
performance compared with using all frames. Additionally, LRCN RGB model. Other activities such as SoccerJuggling
when training the LRCN network end-to-end, we found that include more generic objects which are frequently seen
aggressive dropout (0.9) was needed to avoid overfitting. in other activities (soccer balls, people) and are thus best
Table 1 reports the average accuracy across the three identified from class-specific motion cues. Because RGB and
standard test splits of UCF101. Columns 2-3, compare video flow signals are complementary, the best models take both
classification of LRCN against the baseline single frame into account.
architecture for both RGB and flow inputs. LRCN yields the LRCN shows clear improvement over the baseline
best results for both RGB and flow and improves upon the single-frame system and is comparable to accuracy achieved
baseline network by 0.83% and 2.91% respectively. RGB and by other deep models. [4] report the results on UCF101
flow networks can be combined by computing a weighted by computing a weighted average between flow and RGB
average of network scores as proposed in [4]. Like [4], networks and achieve 87.6%. [3] reports 65.4% accuracy on
we report two weighted averages of the predictions from UCF101, which is substantially lower than LRCN.
the RGB and flow networks in Table 1 (right). Since the
flow network outperforms the RGB network, weighting the
flow network higher unsurprisingly leads to better accuracy. 5 I MAGE CAPTIONING
In this case, LRCN outperforms the baseline single-frame In contrast to activity recognition, the static image caption-
model by 3.40%. ing task requires only a single invocation of a convolutional
Table 2 compares LRCN’s accuracy with the single frame network since the input consists of a single image. At each
baseline model for individual classes on Split 1 of UCF101. time step, both the image features and the previous word
6
Label ∆ Label ∆ R@1 R@5 R@10 Medr

BoxingPunchingBag 57.14 Typing -44.19 Caption to Image (Flickr30k)
PushUps 53.33 TennisSwing -42.86
JumpRope 50.00 FieldHockeyPenalty -32.50 DeViSE [30] 6.7 21.9 32.7 25
SoccerJuggling 48.72 BrushingTeeth -30.56 SDT-RNN [29] 8.9 29.8 41.1 16
HandstandWalking 44.12 CuttingInKitchen -30.30 DeFrag [28] 10.3 31.4 44.5 13
m-RNN [27] 12.6 31.2 41.5 16
Basketball 40.00 Skijet -28.57 ConvNet [31] 11.8 34.0 46.3 13
BodyWeightSquats 38.46 Mixing -26.67 LRCN2f (ours) 17.5 40.3 50.8 9
Lunges 37.84 Skiing -25.00
Nunchucks 34.29 Knitting -20.59 Image to Caption (Flickr30k)
WallPushups 34.29 FloorGymnastics -19.44 DeViSE [30] 4.5 18.1 29.2 26
SDT-RNN [29] 9.6 29.8 41.1 16
TABLE 3 DeFrag [28] 16.4 40.2 54.7 8
Activity recognition: comparison of per-class recognition accuracy m-RNN [27] 18.4 40.2 50.9 10
between the flow and RGB LRCN models. ∆ is the difference between ConvNet [31] 14.8 39.2 50.9 10
LRCN flow accuracy and LRCN RGB accuracy. LRCN2f (ours) 23.6 46.6 58.3 7
TABLE 4
are provided as inputs to the sequence model, in this case a Image description: retrieval results for the Flickr30k [32] datasets.
stack of LSTMs (each with 1000 hidden units), which is used R@K is the average recall at rank K (high is good). Medr is the
to learn the dynamics of the time-varying output sequence, median rank (low is good).
natural language. modern and computationally expensive VGGNet [18] model
At time step t, the input to the bottom-most LSTM is the pre-trained for ILSVRC-2012 [23] classification.
embedded word from the previous time step yt−1 . Input Without any explicit language modeling or impositions
words are encoded as “one-hot” vectors: vectors y ∈ RK on the structure of the generated captions, the described
with a single non-zero component yi = 1 denoting the ith LRCN system learns mappings from images input as pixel
word in the vocabulary, where K is the number of words intensity values to natural language descriptions that are
in the vocabulary, plus one additional entry for the <BOS> often semantically descriptive and grammatically correct.
(beginning of sequence) token which is always taken as y0 , At training time, the previous word inputs y1:t−1 at time
the “previous word” at the first time step (t = 1). These step t are from the ground truth caption. For inference of
one-hot vectors are then projected into an embedding space captions on a novel image x, the input is a sample ỹt ∼
with dimension de by multiplication We yt with a learned P (yt |ỹ1:t−1 , φV (x)) from the model’s predicted distribution
parameter matrix We ∈ Rde ×K . The result of a matrix- at the previous time step, and generation continues until an
vector multiplication with a one-hot vector is the column <EOS> (end of sequence) token is generated.
of the matrix corresponding to the index of the single non-
zero component of the one-hot vector. We can therefore be
thought of as a “lookup table,” mapping each of the K 5.1 Evaluation
words in the vocabulary to a de -dimensional vector. We evaluate our image description model for retrieval and
The visual feature representation φV (x) of the image x generation tasks. We first demonstrate the effectiveness of
may be input to the sequence model – a stack of L LSTMs our model by quantitatively evaluating it on the image and
– by concatenating it at each time step either with (1) the caption retrieval tasks proposed by [26] and seen in [27],
embedded previous word We yt−1 and fed into the first [28], [29], [30], [31]. We report results on Flickr30k [32], and
(`−1)
LSTM of the stack, or (2) the hidden state ht output COCO 2014 [33] datasets, both with five captions annotated
from LSTM ` − 1 and fed into LSTM `, for some ` ∈ 2, ..., L. per image.
These choices are depicted in Figure 4. We refer to the
latter choice as “factored,” as it forces a sort of separation 5.1.1 Retrieval
of responsibilities by “blinding” the first ` − 1 LSTMs and Retrieval results on the Flickr30k [32] dataset are recorded in
forcing all of the capacity of their hidden states at time step Table 4. We report median rank, Medr, of the first retrieved
t to represent only the partial caption y1:t−1 independent ground truth image or caption and Recall@K , the number
of the visual input, while the LSTMs starting from ` are of images or captions for which a correct caption or image is
responsible for fusing the lower layer’s hidden state given retrieved within the top K results. Our model consistently
by the partial caption with the visual feature representation outperforms the strong baselines from recent work [27],
(`)
φV (x) to produce a joint hidden state representation ht [28], [29], [30], [31] as can be seen in Table 4. Here, we
of the visual and language inputs up to time step t from note that the VGGNet model in [31] (called OxfordNet in
which the next word yt can be predicted. In the factored their work) outperforms our model on the retrieval task.
case, the hidden state ht for the lower layers is conditionally However, VGGNet is a stronger convolutional network [18]
independent of the image x given the partial caption y1:t−1 . than that used for our results on this task. The strength
The outputs of the final LSTM in the stack are the of our sequence model (and integration of the sequence
inputs to a learned linear prediction layer with a softmax and visual models) can be more directly measured against
producing a distribution P (yt |y1:t−1 , φV (x)) over words yt the ConvNet [31] result, which uses a very similar base
in the model’s vocabulary, including the <EOS> token de- CNN architecture (AlexNet [17], where we use CaffeNet)
noting the end of the caption, allowing the model to predict pretrained on the same data.
captions of varying length. The visual model φV used for We also ablate the model’s retrieval performance on a
our image captioning experiments is either the CaffeNet [12] randomly chosen subset of 1000 images (and 5000 cap-
reference model, a variant of AlexNet [17], or the more tions) from the COCO 2014 [33] validation set. Results are
7
CNN
CNN
CNN
LSTM LSTM LSTM LSTM
LSTM LSTM LSTM LSTM LSTM LSTM
<BOS> A man <BOS> A man <BOS> A man
Single Layer (L = 1) Two Layers (L = 2), Unfactored Two Layers (L = 2), Factored
LRCN1u LRCN2u LRCN2f
Fig. 4. Three variants of the LRCN image captioning architecture that we experimentally evaluate. We explore the effect of depth in the LSTM
stack, and the effect of the “factorization” of the modalities.
recorded in Table 5. The first group of results for each Vision Model Sequence Model Retrieval Performance
task examines the effectiveness of an LSTM compared with CNN FT? Unit L Factor? R@1 R@5 R@10 Medr
a “vanilla” RNN as described in Section 2. These results Caption to Image
demonstrate that the use of the LSTM unit compared to CaffeNet - RNN 2 X 21.3 51.7 67.2 5
the simpler RNN architecture is an important element of CaffeNet - LSTM 2 X 25.0 56.2 70.6 4
our model’s performance on this task, justifying the addi- CaffeNet - LSTM 1 - 25.2 56.2 70.8 4
tional complexity and suggesting that the LSTM’s gating CaffeNet - LSTM 2 - 23.4 54.8 69.3 5
CaffeNet - LSTM 2 X 25.0 56.2 70.6 4
mechanisms allowing for “long-term” memory may be quite
CaffeNet X LSTM 1 - 28.5 60.0 74.5 4
useful, even for relatively simple sequences. CaffeNet X LSTM 2 - 25.6 57.2 72.2 4
Within the second and third result groups, we compare CaffeNet X LSTM 2 X 27.2 59.6 74.7 4
performance among the three sequence model architectural VGGNet - LSTM 2 X 33.5 68.1 80.8 3
variants depicted in Figure 4. For both tasks and under all VGGNet X LSTM 2 X 39.3 74.7 85.9 2
metrics, the two layer, unfactored variant (LRCN2u ) per- Image to Caption
forms worse than the other two. The fact that LRCN1u out- CaffeNet - RNN 2 X 30.2 61.0 72.6 4
performs LRCN2u indicates that stacking additional LSTM CaffeNet - LSTM 2 X 33.8 65.3 75.3 3
layers alone is not beneficial for this task. The other two CaffeNet - LSTM 1 - 32.3 64.5 75.6 3
CaffeNet - LSTM 2 - 29.9 60.8 72.7 3
variants (LRCN2f and LRCN1u ) perform similarly across CaffeNet - LSTM 2 X 33.8 65.3 75.3 3
the board, with LRCN2f appearing to have a slight edge
CaffeNet X LSTM 1 - 36.1 68.4 79.5 3
in the image to caption task under most metrics, but the CaffeNet X LSTM 2 - 33.1 63.7 76.9 3
reverse for caption to image retrieval. CaffeNet X LSTM 2 X 36.3 67.3 80.6 2
Unsurprisingly, finetuning the CNN (indicated by the VGGNet - LSTM 2 X 46.0 77.4 88.3 2
“FT?” column of Table 5) and using a more powerful CNN VGGNet X LSTM 2 X 53.3 84.3 91.9 1
(VGGNet [18] rather than CaffeNet) each improve results TABLE 5

substantially across the board. Finetuning boosts the R@k Retrieval results (image to caption and caption to image) for a randomly
metrics by 3-5% for CaffeNet, and 5-8% for VGGNet. Switch- chosen subset (1000 images) of the COCO 2014 [33] validation set.
R@K is the average recall at rank K (high is good). Medr is the
ing from CaffeNet to VGGNet improves results by around median rank (low is good).
8-12% for the caption to image task, and by roughly 11-17%
for the image to caption task. be employed for a given network. The simplest strategy,
and the one employed for most of our generation results
5.1.2 Generation in our prior work [43], is to generate captions greedily;
We evaluate LRCN’s caption generation performance on i.e., by simply choosing the most probable word at each
the COCO2014 [33] dataset using the official metrics on time step. This is equivalent to (and denoted in Table 6 by)
which COCO image captioning submissions are evaluated. beam search with beam width 1. In general, beam search
The BLEU [34] and METEOR [36] metrics were designed with beam width N approximates the most likely caption
for automatic evaluation of machine translation methods. by retaining and expanding only the N current most likely
ROUGE-L [37] was designed for evaluating summarization partial captions, according to the model. We find that of the
performance. CIDEr-D [35] was designed specifically to beam search strategies, a beam width of 3-5 gives the best
evaluate the image captioning task. generation numbers – performance saturates quickly and
In Table 6 we evaluate variants of our model along the even degrades for larger beam width (e.g., 10).
same axes as done for the retrieval tasks in Table 5. In the An alternative, non-deterministic generation strategy is
last of the three groups of results, we additionally explore to randomly sample N captions from the model’s distri-
and evaluate various caption generation strategies that can bution and choose the most probable among these. Under
8
Generation Strategy Vision Model Sequence Model Generation Performance (COCO 2014 [33] Validation Set)
Beam Sample
Width N T CNN FT? Unit L Factor? B1 B2 B3 B4 C M R
1 - - CaffeNet - RNN 2 X 0.638 0.454 0.315 0.220 0.660 0.209 0.473
1 - - CaffeNet - LSTM 2 X 0.646 0.462 0.321 0.224 0.674 0.210 0.477
1 - - CaffeNet - LSTM 1 - 0.654 0.475 0.333 0.231 0.661 0.209 0.480
1 - - CaffeNet - LSTM 2 - 0.653 0.470 0.328 0.230 0.682 0.212 0.480
1 - - CaffeNet - LSTM 2 X 0.646 0.462 0.321 0.224 0.674 0.210 0.477
1 - - CaffeNet X LSTM 1 - 0.661 0.485 0.344 0.241 0.702 0.216 0.489
1 - - CaffeNet X LSTM 2 - 0.659 0.478 0.338 0.238 0.716 0.217 0.486
1 - - CaffeNet X LSTM 2 X 0.659 0.478 0.336 0.237 0.717 0.218 0.486
1 - - VGGNet - LSTM 2 X 0.674 0.494 0.351 0.248 0.773 0.227 0.497
1 - - VGGNet X LSTM 2 X 0.695 0.519 0.374 0.268 0.839 0.237 0.512
- 100 1.5 CaffeNet - RNN 2 X 0.647 0.466 0.334 0.244 0.703 0.212 0.479
- 100 1.5 CaffeNet - LSTM 2 X 0.657 0.478 0.344 0.251 0.720 0.215 0.485
- 100 1.5 CaffeNet - LSTM 1 - 0.664 0.490 0.354 0.254 0.704 0.211 0.488
- 100 1.5 CaffeNet - LSTM 2 - 0.664 0.486 0.352 0.257 0.732 0.216 0.489
- 100 1.5 CaffeNet - LSTM 2 X 0.657 0.478 0.344 0.251 0.720 0.215 0.485
- 100 1.5 CaffeNet X LSTM 1 - 0.679 0.507 0.370 0.268 0.753 0.219 0.499
- 100 1.5 CaffeNet X LSTM 2 - 0.672 0.495 0.361 0.265 0.762 0.222 0.495
- 100 1.5 CaffeNet X LSTM 2 X 0.670 0.493 0.358 0.264 0.764 0.222 0.495
- 100 1.5 VGGNet - LSTM 2 X 0.690 0.514 0.377 0.278 0.828 0.231 0.508
- 100 1.5 VGGNet X LSTM 2 X 0.711 0.541 0.402 0.300 0.896 0.242 0.524
1 - - VGGNet X LSTM 2 X 0.695 0.519 0.374 0.268 0.839 0.237 0.512
2 - - VGGNet X LSTM 2 X 0.707 0.533 0.394 0.291 0.879 0.242 0.520
3 - - VGGNet X LSTM 2 X 0.708 0.536 0.399 0.298 0.888 0.243 0.521
4 - - VGGNet X LSTM 2 X 0.706 0.534 0.398 0.299 0.888 0.243 0.521
5 - - VGGNet X LSTM 2 X 0.704 0.533 0.398 0.300 0.888 0.242 0.520
10 - - VGGNet X LSTM 2 X 0.699 0.528 0.395 0.298 0.886 0.241 0.518
- 1 2.0 VGGNet X LSTM 2 X 0.658 0.472 0.327 0.224 0.733 0.222 0.483
- 10 2.0 VGGNet X LSTM 2 X 0.708 0.534 0.391 0.286 0.868 0.239 0.519
- 25 2.0 VGGNet X LSTM 2 X 0.712 0.540 0.398 0.294 0.885 0.241 0.523
- 100 2.0 VGGNet X LSTM 2 X 0.714 0.543 0.402 0.297 0.889 0.242 0.524
- 100 1.0 VGGNet X LSTM 2 X 0.674 0.494 0.357 0.261 0.805 0.228 0.494
- 100 1.5 VGGNet X LSTM 2 X 0.711 0.541 0.402 0.300 0.896 0.242 0.524
- 100 2.0 VGGNet X LSTM 2 X 0.714 0.543 0.402 0.297 0.889 0.242 0.524
TABLE 6
Image caption generation performance (under the BLEU 1-4 [34] (B1-B4), CIDEr-D [35] (C), METEOR [36] (M), and ROUGE-L [37] (R) metrics)
across various network architectures and generation strategies. In the topmost set of results, we show performance across various CNN and
recurrent architectures for a simple generation strategy – beam search with beam width 1 (i.e., simply choosing the most probable word at each
time step). In the middle set of results, we show performance across the same set of architectures for a more sophisticated and computationally
intensive generation strategy found to be the best performing (in terms of performance under the CIDEr-D metric) among those explored in the
bottom-most set of results, which explores various generation strategies while fixing the choice of network. In the first two sets of results, we vary
the visual input CNN architecture (either CaffeNet [12], an architecture similar to AlexNet [17], or the more modern VGGNet [18]) and whether its
weights are finetuned (FT?). Keeping the visual input CNN fixed with CaffeNet, we also vary the choice of recurrent architecture, comparing a
stack of “vanilla” RNNs with LSTMs [7], as well as the number of layers in the stack L, and (for L = 2) whether the layers are “factored” (i.e.,
whether the visual input is passed into the second layer). In the last set of results, we explore two generation strategies – beam search, and
choosing the best (highest log-likelihood) among N samples from the model’s predicted distribution. For beam search we vary the beam width
from 1-10. For the sampling strategy we explore the effect of sample size N as well as the effect of applying various choices of scalar factor T
(inverse of the “temperature”) to the logits input to the softmax producing the distribution.
this strategy we also examine the effect of applying various of results in the table, which ablates LRCN architectures.
choices of scalar factors (inverse of the “temperature”) T to We also record generation performance for all architectures
the real-valued predictions input to the softmax producing (Table 6, top set of results) with the simpler generation
the distribution. For larger values of T the samples are strategy used in our earlier work [43] for ease of comparison
greedier and less diverse, with T = ∞ being equivalent to with this work and for future researchers. For the remainder
beam search with beam width 1. Larger values of N suggest of this discussion, we will focus on the middle set of
using smaller values of T , and vice versa – for example, results, and particularly on the CIDEr-D [35] (C) metric,
with large N and large T , most of the O(N ) computation is as it was designed specifically for automatic evaluation of
wasted as many of the samples will be redundant. We assess image captioning systems. We see again that the LSTM unit
saturation as the number of samples N grows, and find that outperforms an RNN unit for generation, though not as
N = 100 samples with T = 2 improves little over N = 25. significantly as for retrieval. Between the sequence model
We also varied the temperature T among values 1, 1.5, and architecture choices (depicted in Figure 4) of the number
2 (all with N = 100) and found T = 1.5 to perform the best. of layers L and whether to factor, we see that in this
We adopt the best-performing generation strategy from case the two-layer models (LRCN2f and LRCN2u ) perform
the bottom-most set of results in Table 6 (sampling with similarly, outperforming the single layer model (LRCN1u ).
T = 1.5, N = 100) as the strategy for the middle set Interestingly, of the three variants, LRCN2f is the only one
9
Generation Performance (COCO 2014 [33] Test Set)

Method B1 B2 B3 B4 C M R
[38] NIC 0.895 0.802 0.694 0.587 0.946 0.346 0.682
[39] MSR Captivator 0.907 0.819 0.710 0.601 0.937 0.339 0.680
[40] m-RNN (2015) 0.890 0.798 0.687 0.575 0.935 0.325 0.666
(Ours) * LRCN, this work (sample) 0.895 0.804 0.695 0.585 0.934 0.335 0.678
[41] MSR 0.880 0.789 0.678 0.567 0.925 0.331 0.662
[42] Nearest Neighbor 0.872 0.770 0.655 0.542 0.916 0.318 0.648
[33] Human 0.880 0.744 0.603 0.471 0.910 0.335 0.626
[27] m-RNN (2014) 0.890 0.801 0.690 0.578 0.896 0.320 0.668
(Ours) [43] LRCN (greedy) 0.871 0.772 0.653 0.534 0.891 0.322 0.656
[44] Show, Attend, and Tell 0.872 0.768 0.644 0.523 0.878 0.323 0.651
[31] MLBL 0.848 0.747 0.633 0.517 0.752 0.294 0.635
[45] NeuralTalk 0.828 0.701 0.566 0.446 0.692 0.280 0.603
TABLE 7
Image caption generation results from top-performing methods in the 2015 COCO caption challenge competition, sorted by performance under the
CIDEr-D metric. (We omit submissions that did not provide a reference to a report describing their method; see full results at
http://mscoco.org/dataset/#captions-leaderboard.) All results except for our updated result (denoted by LRCN, this work) were competition entries
(submitted by May 2015). Our updated result differs from our original competition entry only by generation strategy (sampling with N = 100,
T = 1.5, rather than beam search with width 1; i.e., greedy search); the visual and recurrent architectures (and trained weights) are the same.
to perform best for both retrieval and generation. Correctness Grammar Relevance
We see again that fine-tuning (FT) the visual represen- TreeTalk [46] 4.08 4.35 3.98
VGGNet [31] 3.71 3.46 3.70
tation and using a stronger vision model (VGGNet [18]) NN [31] 3.44 3.20 3.49
improves results significantly. Fine-tuning improves CIDEr- LRCN fc8 (ours) 3.74 3.19 3.72
LRCN FT (ours) 3.47 3.01 3.50
D by roughly 0.04 points for CaffeNet, and by roughly 0.07
points for VGGNet. Switching from finetuned CaffeNet to Captions 2.55 3.72 2.59
VGGNet improves CIDEr-D by 0.13 points. TABLE 8
In Table 7 we compare generation performance with Image description: Human evaluator rankings from 1-6 (low is good)
contemporaneous and recent work submitted to the 2015 averaged for each method and criterion. We evaluated on 785 Flickr
images selected by the authors of [31] for the purposes of comparison
COCO caption challenge using our best-performing method against this similar contemporary approach.
(under the CIDEr-D metric) from the results on the valida-
tion set described above – generating a caption for a single 6 V IDEO DESCRIPTION
image by taking the best of N = 100 samples with a scalar
factor of T = 1.5 applied to the softmax inputs, using an In video description the LSTM framework allows us to
LRCN model which pairs a fine-tuned VGGNet with our model the video as a variable length input stream. How-
LRCN2f (two layer, factored) sequence model architecture. ever, due to the limitations of available video description
Our results are competitive with the contemporary work, datasets, we rely on more “traditional” activity and video
performing 4th best in CIDEr-D (0.934, compared with the recognition processing for the input and use LSTMs for
best result of 0.946 from [38]), and 3rd best in METEOR generating a sentence. We first distinguish the following
(0.335, compared with 0.346 from [38]). architectures for video description (see Figure 5). For each
In addition to standard quantitative evaluations, we also architecture, we assume we have predictions of activity, tool,
employ Amazon Mechnical Turk workers (“Turkers”) to object, and locations present in the video from a CRF based
evaluate the generated sentences. Given an image and a on the full video input. In this way, we observe the video as
set of descriptions from different models, we ask Turkers whole at each time step, not incrementally frame by frame.
to rank the sentences based on correctness, grammar and (a) LSTM encoder & decoder with CRF max. (Fig-
relevance. We compared sentences from our model to the ure 5(a)) This architecture is motivated by the video de-
ones made publicly available by [31]. As seen in Table 8, scription approach presented in [11]. They first recognize
our fine-tuned (FT) LRCN model performs on par with the a semantic representation of the video using the maximum
Nearest Neighbour (NN) on correctness and relevance, and a posteriori (MAP) estimate of a CRF with video features
better on grammar. as unaries. This representation, e.g., hknife,cut,carrot,cutting
We show sample captions in Figure 6. We additionally boardi, is concatenated into an input sequence (knife cut
note some properties of the captions our model generates. carrot cutting board) which is translated to a natural language
When using the VGG model to generate sentences in the sentence (a person cuts a carrot on the board) using statistical
validation set, we find that 33.7% of our generated setences machine translation (SMT) [47]. We replace SMT with an
exactly match a sentence in the training set. Furthermore, encoder-decoder LSTM, which encodes the input sequence
we find that when using a beam size of one, our model as a fixed length vector before decoding to a sentence.
generates 42% of the vocabulary words used by human (b) LSTM decoder with CRF max. (Figure 5(b)) In this
annotators when describing images in the validation set. variant we provide the full visual input representation at
Some words, such as “lady” and “guy”, are not generated each time step to the LSTM, analogous to how an image is
by our model but are commonly used by human annotators, provided as an input to the LSTM in image captioning.
but synonyms such as “woman” and “man” are two of the (c) LSTM decoder with CRF probabilites. (Figure 5(c))
most common words generated by our model. A benefit of using LSTMs for machine translation compared
10
Visual Input
CRF-max One Hot
Input Sentence
board [0, 1, 0, 0…] LSTM LSTM
Encoder
cutting
CRF board cutting [1, 0, 0, 0…] LSTM LSTM
cut cut [0, 0, 1, 0...] LSTM LSTM

CRF-prob
Visual
knife knife [0, 0, 0, 1…] LSTM LSTM cutting
Input cut knife
board
[0, 1, 0, 0…] [0, 0, 1, 0...] [0, 0, 0, 1…] LSTM LSTM LSTM LSTM A
LSTM LSTM A CRF [0, 0.8, 0.2, 0…][0.3, 0, 0.7, 0…][0, 0.1, 0.2, 0.7…]
LSTM LSTM man [0, 1, 0, 0…] [0, 0, 1, 0...] [0, 0, 0, 1…] LSTM LSTM [0, 0.8, 0.2, 0…][0.3, 0, 0.7, 0…][0, 0.1, 0.2, 0.7…] LSTM LSTM man
Decoder
LSTM LSTM cuts [0, 1, 0, 0…] [0, 0, 1, 0...] [0, 0, 0, 1…] LSTM LSTM [0, 0.8, 0.2, 0…][0.3, 0, 0.7, 0…][0, 0.1, 0.2, 0.7…] LSTM LSTM cuts
LSTM LSTM <EOS>

[0, 1, 0, 0…] [0, 0, 1, 0...] [0, 0, 0, 1…] LSTM LSTM [0, 0.8, 0.2, 0…][0.3, 0, 0.7, 0…][0, 0.1, 0.2, 0.7…] LSTM LSTM <EOS>
(a) (b) (c)

LSTM Encoder-Decoder LSTM Decoder (CRF-max) LSTM Decoder (CRF-prob)
Fig. 5. Our approaches to video description. (a) LSTM encoder & decoder with CRF max (b) LSTM decoder with CRF max (c) LSTM decoder with
CRF probabilities.
Architecture Input BLEU image captioning, and video description as well as related
SMT [11] CRF max 24.9 new tasks such as visual question answering.
SMT [48] CRF prob 26.9
(a) LSTM Encoder-Decoder (ours) CRF max 25.3
(b) LSTM Decoder (ours) CRF max 27.4 7.1 Prior Work
(c) LSTM Decoder (ours) CRF prob 28.8
TABLE 9
Activity Recognition. State-of-the-art shallow models com-
Video description: Results on detailed description of TACoS multilevel bine spatio-temporal features along dense trajectories [50]
[48], in %, see Section 6 for details. and encode features as bags of words or Fisher vectors
for classification. Such shallow features track how low
to phrase-based SMT [47] is that it can naturally incorporate
level features change through time but cannot track higher
probability vectors during training and test time which
level features. Furthermore, by encoding features as bags of
allows the LSTM to learn uncertainties in visual generation
words or Fisher vectors, temporal relationships are lost.
rather than relying on MAP estimates. The architecture is
Many deep architectures proposed for activity recogni-
the the same as in (b), but we replace max predictions with
tion stack a fixed number of video frames for input to a
probability distributions.
deep network. [3] propose a fusion convolutional network
which fuses layers which correspond to different input
6.1 Evaluation frames at various levels of a deep network. [4] proposes
We evaluate our approach on the TACoS multilevel [48] a two stream CNN which combines one CNN trained
dataset, which has 44,762 video/sentence pairs (about on RGB frames and one CNN trained on a stack of 10
40,000 for training/validation). We compare to [11] who use flow frames. When combining RGB and flow by averaging
max prediction as well as a variant presented in [48] which softmax scores, results are comparable to state-of-the-art
takes CRF probabilities at test time and uses a word lattice shallow models on UCF101 [25] and HMDB51 [51]. Results
to find an optimal sentence prediction. Since we use the are further improved by using an SVM to fuse RGB and
max prediction as well as the probability scores provided flow as opposed to simply averaging scores. Alternatively,
by [48], we have an identical visual representation. [48] [1] and [2] propose learning deep spatio-temporal features
uses dense trajectories [49] and SIFT features as well as with 3D convolutional neural networks. [2], [52] propose
temporal context reasoning modeled in a CRF. In this set extracting visual and motion features and modeling tempo-
of experiments we use the two-layered, unfactored version ral dependencies with recurrent networks. This architecture
of LRCN, as described for image description. most closely resembles our proposed architecture for activ-
Table 9 shows the BLEU-4 score. The results show that ity classification, though it differs in two key ways. First, we
(1) the LSTM outperforms an SMT-based approach to video integrate 2D CNNs that can be pre-trained on large image
description; (2) the simpler decoder architecture (b) and datasets. Second, we combine the CNN and LSTM into a
(c) achieve better performance than (a), likely because the single model to enable end-to-end fine-tuning.
input does not need to be memorized; and (3) our approach Image Captioning. Several early works [53], [54], [55],
achieves 28.8%, clearly outperforming the best reported [56] on image captioning combine object and scene recog-
number of 26.9% on TACoS multilevel by [48]. nition with template or tree based approaches to generate
More broadly, these results show that our architecture is captions. Such sentences are typically simple and are easily
not restricted only to input from deep networks, but can be distinguished from more fluent human generated descrip-
cleanly integrated with fixed or variable length inputs from tions. [46], [57] address this by composing new sentences
other vision systems. from existing caption fragments which, though more human
like, are not necessarily accurate or correct.
More recently, a variety of deep and multi-modal models
7 R ELATED W ORK [27], [29], [30], [58] have been proposed for image and cap-
We present previous literature pertaining to the three tasks tion retrieval, as well as caption generation. Though some of
discussed in this work. Additionally, we discuss subsequent these models rely on deep convolutional nets for image fea-
extensions which combine convolutional and recurrent net- ture extraction [30], [58], recently researchers have realized
works to achieve improved results on activity recognition, the importance of also including temporally deep networks
11
to model text. [29] propose an RNN to map sentences into focus on the current word generated by the model. Other
a multi-modal embedding space. By mapping images and works aim to address specific limitations of captioning
language into the same embedding space, they are able to models based on combining convolutional and recurrent
compare images and descriptions for image and annotation architectures. For example, methods have been proposed
retrieval tasks. [27] propose a model for caption generation to integrate new vocabulary with limited [40] or no [68]
that is more similar to the model proposed in this work: examples of images and corresponding captions.
predictions for the next word are based on previous words Video Description. In this work, we rely on intermedi-
in a sentence and image features. [58] propose an encoder- ate features for video description, but end-to-end trainable
decoder model for image caption retrieval which relies on models for visual captioning have since been proposed. [69]
both a CNN and LSTM encoder to learn an embedding of propose creating a video feature by pooling high level CNN
image-caption pairs. Their model uses a neural language features across frames. The video feature is then used to
decoder to enable sentence generation. As evidenced by the generate descriptions in the same way an image is used to
rapid growth of image captioning, visual sequence models generate a description in LRCN. Though achieving good
like LRCN are increasingly important for describing the results, by pooling CNN features, temporal information
visual world using natural language. from the video is lost. Consequently, [70] propose an LSTM
Video Description. Recent approaches to describing to encode video frames into a fixed length vector before
video with natural language have made use of templates, sentence generation with an LSTM. Using an end-to-end
retrieval, or language models [11], [59], [60], [60], [61], [62], trainable “sequence-to-sequence” model which can exploit
[63], [64]. To our knowledge, we present the first application temporal structure in video, [70] improve upon results for
of deep models to the video description task. Most similar video description. [71] propose a similar model, adding a
to our work is [11], which use phrase-based SMT [47] to temporal attention mechanism which weights video frames
generate a sentence. In Section 6 we show that phrase-based differently when generating each word in a sentence.
SMT can be replaced with LSTMs for video description as Visual Grounding. [72] combine CNNs with LSTMs for
has been shown previously for language translation [9], [65]. visual grounding. The model first encodes a phrase which
describes part of an image using an LSTM, then learns to
attend to the appropriate location in the image to accurately
7.2 Contemporaneous and Subsequent Work
reconstruct the phrase. In order to reconstruct the phrase,
Similar work in activity recognition and visual description the model must learn to visually ground the input phrase to
was conducted contemporaneously with our work, and a the appropriate location in the image.
variety of subsequent work has combined convolutional and Natural Language Object Retrieval. In this work, we
recurrent networks to both improve upon our results and present methods for image retrieval based on a natural
achieve exciting results on other sequential visual tasks. language description. In contrast, [73] use a model based on
Activity Recognition. Contemporaneous with our work, LRCN for object retrieval, which returns the bounding box
[66] train a network which combines CNNs and LSTMs for around a given object as opposed to an entire image. In or-
activity recognition. Because activity recognition datasets der to adapt LRCN to the task of object retrieval, [73] include
like UCF101 are relatively small in comparison to image local convolutional features which are extracted from object
recognition datasets, [66] pretrain their network using the proposals and the spatial configuration of object proposals
Sports-1M [3] dataset which includes over a million videos in addition to a global image feature. By including local
mined from YouTube. By training a much larger network features, [73] effectively adapt LRCN for object retrieval.
(four stacked LSTMs) and pretraining on a large video
dataset, [66] achieve 88.6% on the UCF101 dataset.
[67] also combines a convolutional network with an
8 C ONCLUSION
LSTM to predict multiple activities per frame. Unlike LRCN, We’ve presented LRCN, a class of models that is both
[67] focuses on frame-level (rather than video-level) predic- spatially and temporally deep, and flexible enough to be
tions, which allows their system to label multiple activities applied to a variety of vision tasks involving sequential
that occur in different temporal locations of a video clip. inputs and outputs. Our results consistently demonstrate
Like we show for activity recognition, [67] demonstrates that by learning sequential dynamics with a deep sequence
that including temporal information improves upon a sin- model, we can improve upon previous methods which learn
gle frame baseline. Additionally, [67] employ an attention a deep hierarchy of parameters only in the visual domain,
mechanism to further improve results. and on methods which take a fixed visual representation
Image Captioning. [45] and [38] also propose models of the input and only learn the dynamics of the output
which combine a CNN with a recurrent network for image sequence.
captioning. Though similar to LRCN, the architectures pro- As the field of computer vision matures beyond tasks
posed in [45] and [38] differ in how image features are input with static input and predictions, deep sequence modeling
into the sequence model. In contrast to our system, in which tools like LRCN are increasingly central to vision systems
image features are input at each time step, [45] and [38] for problems with sequential structure. The ease with which
only input image features at the first time step. Furthermore, these tools can be incorporated into existing visual recog-
they do not explore a “factored” representation (Figure 4). nition pipelines makes them a natural choice for percep-
Subsequent work [44] has proposed attention to focus on tual problems with time-varying visual input or sequential
which portion of the image is observed during sequence outputs, which these methods are able to handle with little
generation. By including attention, [44] aim to visually input preprocessing and no hand-designed features.
12
A female tennis player in action on A group of young men playing a A man riding a wave on top of a
the court. game of soccer surfboard.
A baseball game in progress with the A brown bear standing on top of a A person holding a cell phone in
batter up to plate. lush green field. their hand.
A close up of a person brushing his A woman laying on a bed in a bed- A black and white cat is sitting on a
teeth. room. chair.
A large clock mounted to the side of A bunch of fruit that are sitting on a A toothbrush holder sitting on top of
a building. table. a white sink.
Fig. 6. Image description: images with corresponding captions generated by our finetuned LRCN model. These are images 1-12 of our randomly
chosen validation set from COCO 2014 [33]. We used beam search with a beam size of 5 to generate the sentences, and display the top (highest
likelihood) result above.
13
ACKNOWLEDGMENTS [24] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Ima-
geNet: A large-scale hierarchical image database,” in CVPR, 2009.
The authors thank Oriol Vinyals for valuable advice and [25] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101
helpful discussion throughout this work. This work was human actions classes from videos in the wild,” CRCV-TR-12-01,
supported in part by DARPA’s MSEE and SMISC programs, Tech. Rep., 2012.
[26] P. Y. Micah Hodosh and J. Hockenmaier, “Framing image descrip-
NSF awards IIS-1427425 and IIS-1212798, and the Berkeley tion as a ranking task: Data, models and evaluation metrics,” in
Vision and Learning Center. The GPUs used for this research JAIR, vol. 47, 2013.
were donated by NVIDIA. Marcus Rohrbach was supported [27] J. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille, “Deep captioning
by a fellowship within the FITweltweit-Program of the with multimodal recurrent neural networks (m-RNN),” in ICLR,
2015.
German Academic Exchange Service (DAAD). Lisa Anne [28] A. Karpathy, A. Joulin, and L. Fei-Fei, “Deep fragment embed-
Hendricks was supported by the NDSEG. dings for bidirectional image sentence mapping,” in NIPS, 2014.
[29] R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng,
“Grounded compositional semantics for finding and describing
R EFERENCES images with sentences,” in TACL, vol. 2, 2014.
[30] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov
[1] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural et al., “Devise: A deep visual-semantic embedding model,” in
networks for human action recognition,” in IEEE Trans. Pattern NIPS, 2013.
Anal. Mach. Intell., 2013. [31] R. Kiros, R. Salakhuditnov, and R. S. Zemel, “Unifying visual-
[2] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt, “Se- semantic embeddings with multimodal neural language models,”
quential deep learning for human action recognition,” in Human in TACL, 2015.
Behavior Understanding, 2011. [32] M. H. Peter Young, Alice Lai and J. Hockenmaier, “From image
[3] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and descriptions to visual denotations: New similarity metrics for
L. Fei-Fei, “Large-scale video classification with convolutional semantic inference over event descriptions,” in TACL, vol. 2, 2014.
neural networks,” in CVPR, 2014. [33] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
[4] K. Simonyan and A. Zisserman, “Two-stream convolutional net- P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in
works for action recognition in videos,” in NIPS, 2014. context,” arXiv preprint arXiv:1405.0312, Tech. Rep., 2014.
[5] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning [34] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a method
internal representations by error propagation,” DTIC Document, for automatic evaluation of machine translation,” in ACL, 2002.
Tech. Rep., 1985. [35] R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-
[6] R. J. Williams and D. Zipser, “A learning algorithm for continually based image description evaluation,” in CVPR, 2015.
running fully recurrent neural networks,” in Neural Computation, [36] S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT
1989. evaluation with improved correlation with human judgments,” in
[7] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” in Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation
Neural Computation. MIT Press, 1997. Measures for Machine Translation and/or Summarization, 2005.
[8] A. Graves and N. Jaitly, “Towards end-to-end speech recognition [37] C.-Y. Lin, “Rouge: A package for automatic evaluation of sum-
with recurrent neural networks,” in ICML, 2014. maries,” in Text Summarization Branches Out: Proceedings of the ACL-
[9] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence 04 Workshop, 2004.
learning with neural networks,” in NIPS, 2014. [38] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A
[10] K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio, “On neural image caption generator,” in CVPR, 2015.
the properties of neural machine translation: Encoder-decoder
[39] J. Devlin, H. Cheng, H. Fang, S. Gupta, L. Deng, X. He, G. Zweig,
approaches,” in SSST Workshop, 2014.
and M. Mitchell, “Language models for image captioning: The
[11] M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele,
quirks and what works,” in ACL, 2015.
“Translating video content to natural language descriptions,” in
[40] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, “Learning
ICCV, 2013.
like a child: Fast novel visual concept learning from sentence
[12] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
descriptions of images,” in ICCV, 2015.
S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture
for fast feature embedding,” in ACM MM, 2014. [41] H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollár,
[13] W. Zaremba and I. Sutskever, “Learning to execute,” in arXiv J. Gao, X. He, M. Mitchell, J. Platt et al., “From captions to visual
preprint arXiv:1410.4615, 2014. concepts and back,” in CVPR, 2015.
[14] A. Graves, “Generating sequences with recurrent neural net- [42] J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick,
works,” in arXiv preprint arXiv:1308.0850, 2013. “Exploring nearest neighbor approaches for image captioning,”
[15] O. Vinyals, S. V. Ravuri, and D. Povey, “Revisiting recurrent neural arXiv preprint arXiv:1505.04467, Tech. Rep., 2015.
networks for robust ASR,” in ICASSP, 2012. [43] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach,
[16] I. Sutskever, J. Martens, and G. E. Hinton, “Generating text with S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent
recurrent neural networks,” in ICML, 2011. convolutional networks for visual recognition and description,” in
[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifi- CVPR, 2015.
cation with deep convolutional neural networks,” in NIPS, 2012. [44] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and
[18] K. Simonyan and A. Zisserman, “Very deep convolutional net- Y. Bengio, “Show, attend and tell: Neural image caption generation
works for large-scale image recognition,” in ICLR, 2015. with visual attention,” in ICML, 2015.
[19] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, [45] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with generating image descriptions,” in CVPR, 2015.
convolutions,” in CVPR, 2015. [46] P. Kuznetsova, V. Ordonez, T. L. Berg, U. C. Hill, and Y. Choi,
[20] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, “TreeTalk: Composition and compression of trees for image de-
and Y. Bengio, “Learning phrase representations using rnn scriptions,” in TACL, vol. 2, no. 10, 2014.
encoder-decoder for statistical machine translation,” in EMNLP, [47] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico,
2014. N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer,
[21] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, “High accuracy O. Bojar, A. Constantin, and E. Herbst, “Moses: Open source
optical flow estimation based on a theory for warping,” in ECCV, toolkit for statistical machine translation,” in ACL, 2007.
2004. [48] A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal,
[22] M. D. Zeiler and R. Fergus, “Visualizing and understanding con- and B. Schiele, “Coherent multi-sentence video description with
volutional networks,” in ECCV, 2014. variable level of detail,” in German Conference on Pattern Recognition
[23] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, (GCPR). Springer, 2014.
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and [49] H. Wang, A. Kläser, C. Schmid, and C. Liu, “Dense trajectories
L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” and motion boundary descriptors for action recognition,” in IJCV,
in IJCV, vol. 115, no. 3, 2015. 2013.
14
[50] H. Wang and C. Schmid, “Action recognition with improved Jeff Donahue is a PhD student at the University of California, Berkeley,
trajectories,” in ICCV, 2013. advised by Prof. Trevor Darrell. His research focuses on the use of deep
[51] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB: learning for computer vision applications. He graduated with a BS in
a large video database for human motion recognition,” in ICCV, computer science from the University of Texas at Austin, where he was
2011. advised by Prof. Kristen Grauman.
[52] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt, “Ac-
tion classification in soccer videos with long short-term memory
recurrent neural networks,” in International Conference on Artificial Lisa Anne Hendricks is a PhD student at the University of California,
Neural Networks (ICANN), 2010. Berkeley. Her research focuses on deep learning for sequential models
[53] A. Farhadi, M. Hejrati, M. Sadeghi, P. Young, C. Rashtchian, as well as applications at the intersection of language and vision. She is
J. Hockenmaier, and D. Forsyth, “Every picture tells a story: advised by Prof. Trevor Darrell. Lisa Anne holds a Bachelor’s of Science
Generating sentences from images,” in ECCV, 2010. in Electrical Engineering (B.S.E.E.) from Rice University.
[54] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L.
Berg, “Baby talk: Understanding and generating simple image
descriptions,” in CVPR, 2011.
[55] Y. Yang, C. L. Teo, H. Daumé III, and Y. Aloimonos, “Corpus- Marcus Rohrbach’s research focuses on visual recognition, language
guided sentence generation of natural images,” in EMNLP, 2011. understanding, and machine learning. He received his BSc and MSc
[56] M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg, degree in Computer Science from the University of Technology Darm-
K. Yamaguchi, T. Berg, K. Stratos, and H. Daumé III, “Midge: stadt, Germany, in 2006 and 2009, respectively. From 2006-2007, he
Generating image descriptions from computer vision detections,” spent one year at the University of British Columbia as a graduate
in Proceedings of the 13th Conference of the European Chapter of the visiting student. During his PhD he worked at the Max Planck Institute
Association for Computational Linguistics, 2012. for Informatics, Saarbrücken, Germany with Bernt Schiele and Manfred
[57] P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and Y. Choi, Pinkal. He completed it in 2014 with summa cum laude at Saarland
“Collective generation of natural image descriptions,” in ACL, University and received the DAGM MVTec Dissertation Award 2015 for
2012. it. He currently works as a post-doc with Trevor Darrell at UC Berkeley.
[58] R. Kiros, R. Salakhutdinov, and R. Zemel, “Multimodal neural
language models,” in ICML, 2014.
[59] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venu- Subhashini Venugopalan is a PhD student at the University of Texas at
gopalan, R. Mooney, T. Darrell, and K. Saenko, “YouTube2Text: Austin. Her research focuses on deep learning techniques to generate
Recognizing and describing arbitrary activities using semantic descriptions for events in videos. She is advised by Prof. Raymond
hierarchies and zero-shoot recognition,” in ICCV, 2013. Mooney. Subhashini holds a master’s degree in Computer Science from
[60] M. U. G. Khan, L. Zhang, and Y. Gotoh, “Human focused video IIT Madras and a bachelor’s degree from NIT Karnataka, India.
description,” in ICCV Workshops, 2011.
[61] A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S. Dickinson, S. Fidler,
A. Michaux, S. Mussman, S. Narayanaswamy, D. Salvi, L. Schmidt,
Sergio Guadarrama is a Software Engineer at Google Research, where
J. Shangguan, J. M. Siskind, J. Waggoner, S. Wang, J. Wei, Y. Yin,
he works in Machine Perception as a member of the Vale team. He
and Z. Zhang, “Video in sentences out,” in The Conference on
received his PhD from the Technical University of Madrid, followed by
Uncertainty in Artificial Intelligence (UAI), 2012.
postdoctoral work at the European Center for Soft Computing. After
[62] P. Das, C. Xu, R. Doell, and J. Corso, “Thousand frames in just
that, he was first a Visiting Scholar and then a Research Scientist at
a few words: Lingual description of videos through latent topics
UC Berkeley EECS. His research spans the areas of computer vision,
and sparse object stitching,” in CVPR, 2013.
language and deep learning. Dr. Guadarrama’s current research focus
[63] C. C. Tan, Y.-G. Jiang, and C.-W. Ngo, “Towards textually describ-
is on new network architectures for multi-task dense predictions, such
ing complex video contents with audio-visual concept classifiers,”
as object detection, instance segmentation, depth prediction and visual
in ACM MM, 2011.
question-answering. He has received research grants from the Govern-
[64] J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. J.
ment of Spain, such as the Juan de la Cierva Award (Early Career Award
Mooney, “Integrating language and vision to generate natural
in Computer Science), and the Mobility Grant for Postdoctoral Research.
language descriptions of videos in the wild,” in International
Conference on Computational Linguistics (COLING), 2014.
[65] H. Sak, O. Vinyals, G. Heigold, A. Senior, E. McDermott, R. Monga,
and M. Mao, “Sequence discriminative distributed training of long Kate Saenko is an Assistant Professor of Computer Science at the
short-term memory recurrent neural networks,” in Interspeech, University of Massachusetts Lowell, where she leads the Computer
2014. Vision and Learning Group. She received her PhD from MIT, followed
[66] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, by postdoctoral work at UC Berkeley EECS and Harvard SEAS. Her
R. Monga, and G. Toderici, “Beyond short snippets: Deep net- research spans the areas of computer vision, machine learning, and
works for video classification,” in CVPR, 2015. human-robot interfaces. Dr. Saenko’s current research interests include
[67] S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori, and domain adaptation of machine learning models and joint modeling of
L. Fei-Fei, “Every moment counts: Dense detailed labeling of language and vision. She is the recipient of research grant awards from
actions in complex videos,” arXiv preprint arXiv:1507.05738, Tech. the National Science Foundation, DARPA, and other government and
Rep., 2015. industry agencies.
[68] L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney,
K. Saenko, and T. Darrell, “Deep compositional captioning: De-
scribing novel object categories without paired training data,” in Trevor Darrell is on the faculty of the CS Division of the EECS Depart-
CVPR, 2016. ment at UC Berkeley and is also appointed at the UCB-affiliated Interna-
[69] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and tional Computer Science Institute (ICSI). He is the director of the Berke-
K. Saenko, “Translating videos to natural language using deep ley Vision and Learning Center (BVLC) and is the faculty director of the
recurrent neural networks,” in NAACL, 2015. PATH center in the UCB Institute of Transportation Studies PATH. His
[70] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, interests include computer vision, machine learning, computer graphics,
and K. Saenko, “Sequence to sequence–video to text,” in ICCV, and perception-based human computer interfaces. Prof. Darrell received
2015. the SM and PhD degrees from MIT in 1992 and 1996, respectively. He
[71] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and was previously on the faculty of the MIT EECS department from 1999-
A. Courville, “Describing videos by exploiting temporal struc- 2008, where he directed the Vision Interface Group. He was a member
ture,” in CVPR, vol. 1050, 2015. of the research staff at Interval Research Corporation from 1996-1999.
[72] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele, He obtained the BSE degree from the University of Pennsylvania in
“Grounding of textual phrases in images by reconstruction,” arXiv 1988, having started his career in computer vision as an undergraduate
preprint arXiv:1511.03745, Tech. Rep., 2015. researcher in Ruzena Bajcsy’s GRASP lab.
[73] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell,
“Natural language object retrieval,” in CVPR, 2016.
MatConvNet
Convolutional Neural Networks for MATLAB
Andrea Vedaldi Karel Lenc

arXiv:1412.4564v3 [cs.CV] 5 May 2016
i
ii
Abstract
MatConvNet is an implementation of Convolutional Neural Networks (CNNs)

for MATLAB. The toolbox is designed with an emphasis on simplicity and flexibility.
It exposes the building blocks of CNNs as easy-to-use MATLAB functions, providing
routines for computing linear convolutions with filter banks, feature pooling, and many
more. In this manner, MatConvNet allows fast prototyping of new CNN architec-
tures; at the same time, it supports efficient computation on CPU and GPU allowing
to train complex models on large datasets such as ImageNet ILSVRC. This document
provides an overview of CNNs and how they are implemented in MatConvNet and
gives the technical details of each computational block in the toolbox.
Contents
1 Introduction to MatConvNet 1
1.1 Getting started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 MatConvNet at a glance . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Documentation and examples . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Neural Network Computations 9

2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Network structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Directed acyclic graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Computing derivatives with backpropagation . . . . . . . . . . . . . . . . . . 12
2.3.1 Derivatives of tensor functions . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 Derivatives of function compositions . . . . . . . . . . . . . . . . . . 13
2.3.3 Backpropagation networks . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.4 Backpropagation in DAGs . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.5 DAG backpropagation networks . . . . . . . . . . . . . . . . . . . . . 18
3 Wrappers and pre-trained models 21

3.1 Wrappers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 SimpleNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 DagNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Pre-trained models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Learning models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Running large scale experiments . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Computational blocks 25
4.1 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Convolution transpose (deconvolution) . . . . . . . . . . . . . . . . . . . . . 27
4.3 Spatial pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5 Spatial bilinear resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.6 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.6.1 Local response normalization (LRN) . . . . . . . . . . . . . . . . . . 30
iii
iv CONTENTS
4.6.2 Batch normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.6.3 Spatial normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.6.4 Softmax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.7 Categorical losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.7.1 Classification losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.7.2 Attribute losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.8 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.8.1 p-distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5 Geometry 37
5.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Simple filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.1 Pooling in Caffe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Convolution transpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.4 Transposing receptive fields . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.5 Composing receptive fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.6 Overlaying receptive fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6 Implementation details 43
6.1 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Convolution transpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.3 Spatial pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.4 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.4.1 ReLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.4.2 Sigmoid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.5 Spatial bilinear resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.6 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.6.1 Local response normalization (LRN) . . . . . . . . . . . . . . . . . . 46
6.6.2 Batch normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.6.3 Spatial normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.6.4 Softmax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.7 Categorical losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.7.1 Classification losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.7.2 Attribute losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.8 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.8.1 p-distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Bibliography 51
Chapter 1
Introduction to MatConvNet
MatConvNet is a MATLAB toolbox implementing Convolutional Neural Networks (CNN)

for computer vision applications. Since the breakthrough work of [7], CNNs have had a
major impact in computer vision, and image understanding in particular, essentially replacing
traditional image representations such as the ones implemented in our own VLFeat [11] open
source library.
While most CNNs are obtained by composing simple linear and non-linear filtering op-
erations such as convolution and rectification, their implementation is far from trivial. The
reason is that CNNs need to be learned from vast amounts of data, often millions of images,
requiring very efficient implementations. As most CNN libraries, MatConvNet achieves
this by using a variety of optimizations and, chiefly, by supporting computations on GPUs.
Numerous other machine learning, deep learning, and CNN open source libraries exist.
To cite some of the most popular ones: CudaConvNet,1 Torch,2 Theano,3 and Caffe4 . Many
of these libraries are well supported, with dozens of active contributors and large user bases.
Therefore, why creating yet another library?
The key motivation for developing MatConvNet was to provide an environment par-
ticularly friendly and efficient for researchers to use in their investigations.5 MatConvNet
achieves this by its deep integration in the MATLAB environment, which is one of the most
popular development environments in computer vision research as well as in many other areas.
In particular, MatConvNet exposes as simple MATLAB commands CNN building blocks
such as convolution, normalisation and pooling (chapter 4); these can then be combined and
extended with ease to create CNN architectures. While many of such blocks use optimised
CPU and GPU implementations written in C++ and CUDA (section section 1.4), MATLAB
native support for GPU computation means that it is often possible to write new blocks
in MATLAB directly while maintaining computational efficiency. Compared to writing new
CNN components using lower level languages, this is an important simplification that can
significantly accelerate testing new ideas. Using MATLAB also provides a bridge towards
1
https://code.google.com/p/cuda-convnet/
2
http://cilvr.nyu.edu/doku.php?id=code:start
3
http://deeplearning.net/software/theano/
4
http://caffe.berkeleyvision.org
5
While from a user perspective MatConvNet currently relies on MATLAB, the library is being devel-
oped with a clean separation between MATLAB code and the C++ and CUDA core; therefore, in the future
the library may be extended to allow processing convolutional networks independently of MATLAB.
1
2 CHAPTER 1. INTRODUCTION TO MATCONVNET
other areas; for instance, MatConvNet was recently used by the University of Arizona in
planetary science, as summarised in this NVIDIA blogpost.6
MatConvNet can learn large CNN models such AlexNet [7] and the very deep net-
works of [9] from millions of images. Pre-trained versions of several of these powerful models
can be downloaded from the MatConvNet home page7 . While powerful, MatConvNet
remains simple to use and install. The implementation is fully self-contained, requiring only
MATLAB and a compatible C++ compiler (using the GPU code requires the freely-available
CUDA DevKit and a suitable NVIDIA GPU). As demonstrated in fig. 1.1 and section 1.1,
it is possible to download, compile, and install MatConvNet using three MATLAB com-
mands. Several fully-functional examples demonstrating how small and large networks can
be learned are included. Importantly, several standard pre-trained network can be immedi-
ately downloaded and used in applications. A manual with a complete technical description
of the toolbox is maintained along with the toolbox.8 These features make MatConvNet
useful in an educational context too.9
MatConvNet is open-source released under a BSD-like license. It can be downloaded
from http://www.vlfeat.org/matconvnet as well as from GitHub.10 .
1.1 Getting started

MatConvNet is simple to install and use. fig. 1.1 provides a complete example that clas-
sifies an image using a latest-generation deep convolutional neural network. The example
includes downloading MatConvNet, compiling the package, downloading a pre-trained CNN
model, and evaluating the latter on one of MATLAB’s stock images.
The key command in this example is vl_simplenn, a wrapper that takes as input the
CNN net and the pre-processed image im_ and produces as output a structure res of results.
This particular wrapper can be used to model networks that have a simple structure, namely
a chain of operations. Examining the code of vl_simplenn (edit vl_simplenn in MatCon-
vNet) we note that the wrapper transforms the data sequentially, applying a number of
MATLAB functions as specified by the network configuration. These function, discussed in
detail in chapter 4, are called “building blocks” and constitute the backbone of MatCon-
vNet.
While most blocks implement simple operations, what makes them non trivial is their
efficiency (section 1.4) as well as support for backpropagation (section 2.3) to allow learning
CNNs. Next, we demonstrate how to use one of such building blocks directly. For the sake of
the example, consider convolving an image with a bank of linear filters. Start by reading an
image in MATLAB, say using im = single(imread('peppers.png')), obtaining a H × W × D
array im, where D = 3 is the number of colour channels in the image. Then create a bank
of K = 16 random filters of size 3 × 3 using f = randn(3,3,3,16,'single'). Finally, convolve the
6
http://devblogs.nvidia.com/parallelforall/deep-learning-image-understanding-planetary-scien
7
http://www.vlfeat.org/matconvnet/
8
http://www.vlfeat.org/matconvnet/matconvnet-manual.pdf
9
An example laboratory experience based on MatConvNet can be downloaded from http://www.
robots.ox.ac.uk/~vgg/practicals/cnn/index.html.
10
http://www.github.com/matconvnet
1.1. GETTING STARTED 3
% install and compile MatConvNet (run once)

untar(['http://www.vlfeat.org/matconvnet/download/' ...
'matconvnet−1.0−beta12.tar.gz']) ;
cd matconvnet−1.0−beta12
run matlab/vl_compilenn
% download a pre−trained CNN from the web (run once)

urlwrite(...
'http://www.vlfeat.org/matconvnet/models/imagenet−vgg−f.mat', ...
'imagenet−vgg−f.mat') ;
% setup MatConvNet
run matlab/vl_setupnn
% load the pre−trained CNN

net = load('imagenet−vgg−f.mat') ;
% load and preprocess an image

im = imread('peppers.png') ;
im_ = imresize(single(im), net.meta.normalization.imageSize(1:2)) ;
im_ = im_ − net.meta.normalization.averageImage ;
% run the CNN bell pepper (946), score 0.704

res = vl_simplenn(net, im_) ;
% show the classification result

scores = squeeze(gather(res(end).x)) ;
[bestScore, best] = max(scores) ;
figure(1) ; clf ; imagesc(im) ;
title(sprintf('%s (%d), score %.3f',...
net.classes.description{best}, best, bestScore)) ;
Figure 1.1: A complete example including download, installing, compiling and running Mat-
ConvNet to classify one of MATLAB stock images using a large CNN pre-trained on
ImageNet.
image with the filters by using the command y = vl_nnconv(x,f,[]). This results in an array
y with K channels, one for each of the K filters in the bank.
While users are encouraged to make use of the blocks directly to create new architectures,
MATLAB provides wrappers such as vl_simplenn for standard CNN architectures such as
AlexNet [7] or Network-in-Network [8]. Furthermore, the library provides numerous examples
(in the examples/ subdirectory), including code to learn a variety of models on the MNIST,
CIFAR, and ImageNet datasets. All these examples use the examples/cnn_train training
code, which is an implementation of stochastic gradient descent (section 3.3). While this
training code is perfectly serviceable and quite flexible, it remains in the examples/ subdirec-
tory as it is somewhat problem-specific. Users are welcome to implement their optimisers.
1.2 MatConvNet at a glance

MatConvNet has a simple design philosophy. Rather than wrapping CNNs around complex
layers of software, it exposes simple functions to compute CNN building blocks, such as linear
convolution and ReLU operators, directly as MATLAB commands. These building blocks are
easy to combine into complete CNNs and can be used to implement sophisticated learning
algorithms. While several real-world examples of small and large CNN architectures and
training routines are provided, it is always possible to go back to the basics and build your
own, using the efficiency of MATLAB in prototyping. Often no C coding is required at all
to try new architectures. As such, MatConvNet is an ideal playground for research in
computer vision and CNNs.
MatConvNet contains the following elements:
• CNN computational blocks. A set of optimized routines computing fundamental

building blocks of a CNN. For example, a convolution block is implemented by
y=vl_nnconv(x,f,b) where x is an image, f a filter bank, and b a vector of biases (sec-
tion 4.1). The derivatives are computed as [dzdx,dzdf,dzdb] = vl_nnconv(x,f,b,dzdy)
where dzdy is the derivative of the CNN output w.r.t y (section 4.1). chapter 4 de-
scribes all the blocks in detail.
• CNN wrappers. MatConvNet provides a simple wrapper, suitably invoked by

vl_simplenn, that implements a CNN with a linear topology (a chain of blocks). It also
provides a much more flexible wrapper supporting networks with arbitrary topologies,
encapsulated in the dagnn.DagNN MATLAB class.
• Example applications. MatConvNet provides several examples of learning CNNs with

stochastic gradient descent and CPU or GPU, on MNIST, CIFAR10, and ImageNet
data.
• Pre-trained models. MatConvNet provides several state-of-the-art pre-trained CNN

models that can be used off-the-shelf, either to classify images or to produce image
encodings in the spirit of Caffe or DeCAF.
1.3. DOCUMENTATION AND EXAMPLES 5
0.9
dropout top-1 val
dropout top-5 val
0.8 bnorm top-1 val
bnorm top-5 val
0.7
0.6
0.5
0.4
0.3
0.2
0 10 20 30 40 50 60
epoch
Figure 1.2: Training AlexNet on ImageNet ILSVRC: dropout vs batch normalisation.
1.3 Documentation and examples

There are three main sources of information about MatConvNet. First, the website con-
tains descriptions of all the functions and several examples and tutorials.11 Second, there
is a PDF manual containing a great deal of technical details about the toolbox, including
detailed mathematical descriptions of the building blocks. Third, MatConvNet ships with
several examples (section 1.1).
Most examples are fully self-contained. For example, in order to run the MNIST example,
it suffices to point MATLAB to the MatConvNet root directory and type addpath ←-
examples followed by cnn_mnist. Due to the problem size, the ImageNet ILSVRC example
requires some more preparation, including downloading and preprocessing the images (using
the bundled script utils/preprocess−imagenet.sh). Several advanced examples are included
as well. For example, fig. 1.2 illustrates the top-1 and top-5 validation errors as a model
similar to AlexNet [7] is trained using either standard dropout regularisation or the recent
batch normalisation technique of [3]. The latter is shown to converge in about one third of
the epochs (passes through the training data) required by the former.
The MatConvNet website contains also numerous pre-trained models, i.e. large CNNs
trained on ImageNet ILSVRC that can be downloaded and used as a starting point for many
other problems [1]. These include: AlexNet [7], VGG-S, VGG-M, VGG-S [1], and VGG-VD-
16, and VGG-VD-19 [10]. The example code of fig. 1.1 shows how one such model can be
used in a few lines of MATLAB code.
11
See also http://www.robots.ox.ac.uk/~vgg/practicals/cnn/index.html.
model batch sz. CPU GPU CuDNN

AlexNet 256 22.1 192.4 264.1
VGG-F 256 21.4 211.4 289.7
VGG-M 128 7.8 116.5 136.6
VGG-S 128 7.4 96.2 110.1
VGG-VD-16 24 1.7 18.4 20.0
VGG-VD-19 24 1.5 15.7 16.5
Table 1.1: ImageNet training speed (images/s).
1.4 Speed
Efficiency is very important for working with CNNs. MatConvNet supports using NVIDIA
GPUs as it includes CUDA implementations of all algorithms (or relies on MATLAB CUDA
support).
To use the GPU (provided that suitable hardware is available and the toolbox has been
compiled with GPU support), one simply converts the arguments to gpuArrays in MATLAB,
as in y = vl_nnconv(gpuArray(x), gpuArray(w), []). In this manner, switching between CPU
and GPU is fully transparent. Note that MatConvNet can also make use of the NVIDIA
CuDNN library with significant speed and space benefits.
Next we evaluate the performance of MatConvNet when training large architectures
on the ImageNet ILSVRC 2012 challenge data [2]. The test machine is a Dell server with
two Intel Xeon CPU E5-2667 v2 clocked at 3.30 GHz (each CPU has eight cores), 256 GB
of RAM, and four NVIDIA Titan Black GPUs (only one of which is used unless otherwise
noted). Experiments use MatConvNet beta12, CuDNN v2, and MATLAB R2015a. The
data is preprocessed to avoid rescaling images on the fly in MATLAB and stored in a RAM
disk for faster access. The code uses the vl_imreadjpeg command to read large batches of
JPEG images from disk in a number of separate threads. The driver examples/cnn_imagenet.m
is used in all experiments.
We train the models discussed in section 1.3 on ImageNet ILSVRC. table 1.1 reports
the training speed as number of images per second processed by stochastic gradient descent.
AlexNet trains at about 264 images/s with CuDNN, which is about 40% faster than the
vanilla GPU implementation (using CuBLAS) and more than 10 times faster than using the
CPUs. Furthermore, we note that, despite MATLAB overhead, the implementation speed is
comparable to Caffe (they report 253 images/s with CuDNN and a Titan – a slightly slower
GPU than the Titan Black used here). Note also that, as the model grows in size, the size of
a SGD batch must be decreased (to fit in the GPU memory), increasing the overhead impact
somewhat.
table 1.2 reports the speed on VGG-VD-16, a very large model, using multiple GPUs. In
this case, the batch size is set to 264 images. These are further divided in sub-batches of 22
images each to fit in the GPU memory; the latter are then distributed among one to four
GPUs on the same machine. While there is a substantial communication overhead, training
speed increases from 20 images/s to 45. Addressing this overhead is one of the medium term
goals of the library.
1.5. ACKNOWLEDGMENTS 7
num GPUs 1 2 3 4
VGG-VD-16 speed 20.0 22.20 38.18 44.8
Table 1.2: Multiple GPU speed (images/s).
1.5 Acknowledgments
MatConvNet is a community project, and as such acknowledgements go to all contributors.
We kindly thank NVIDIA supporting this project by providing us with top-of-the-line GPUs
and MathWorks for ongoing discussion on how to improve the library.
The implementation of several CNN computations in this library are inspired by the Caffe
library [5] (however, Caffe is not a dependency). Several of the example networks have been
trained by Karen Simonyan as part of [1] and [10].
Chapter 2
Neural Network Computations
This chapter provides a brief introduction to the computational aspects of neural networks,
and convolutional neural networks in particular, emphasizing the concepts required to un-
derstand and use MatConvNet.
2.1 Overview
A Neural Network (NN) is a function g mapping data x, for example an image, to an output
vector y, for example an image label. The function g = fL ◦ · · · ◦ f1 is the composition
of a sequence of simpler functions fl , which are called computational blocks or layers. Let
x1 , x2 , . . . , xL be the outputs of each layer in the network, and let x0 = x denote the network
input. Each intermediate output xl = fl (xl−1 ; wl ) is computed from the previous output xl−1
by applying the function fl with parameters wl .
In a Convolutional Neural Network (CNN), the data has a spatial structure: each xl ∈
Hl ×Wl ×Cl
R is a 3D array or tensor where the first two dimensions Hl (height) and Wl (width)
are interpreted as spatial dimensions. The third dimension Cl is instead interpreted as
the number of feature channels. Hence, the tensor xl represents a Hl × Wl field of Cl -
dimensional feature vectors, one for each spatial location. A fourth dimension Nl in the
tensor spans multiple data samples packed in a single batch for efficiency parallel processing.
The number of data samples Nl in a batch is called the batch cardinality. The network is
called convolutional because the functions fl are local and translation invariant operators
(i.e. non-linear filters) like linear convolution.
It is also possible to conceive CNNs with more than two spatial dimensions, where the
additional dimensions may represent volume or time. In fact, there are little a-priori re-
strictions on the format of data in neural networks in general. Many useful NNs contain a
mixture of convolutional layers together with layer that process other data types such as text
strings, or perform other operations that do not strictly conform to the CNN assumptions.
MatConvNet includes a variety of layers, contained in the matlab/ directory, such
as vl_nnconv (convolution), vl_nnconvt (convolution transpose or deconvolution), vl_nnpool
(max and average pooling), vl_nnrelu (ReLU activation), vl_nnsigmoid (sigmoid activation),
vl_nnsoftmax (softmax operator), vl_nnloss (classification log-loss), vl_nnbnorm (batch nor-
malization), vl_nnspnorm (spatial normalization), vl_nnnormalize (locar response normal-
9
10 CHAPTER 2. NEURAL NETWORK COMPUTATIONS
ization – LRN), or vl_nnpdist (p-distance). There are enough layers to implement many
interesting state-of-the-art networks out of the box, or even import them from other tool-
boxes such as Caffe.
NNs are often used as classifiers or regressors. In the example of fig. 1.1, the output
ŷ = f (x) is a vector of probabilities, one for each of a 1,000 possible image labels (dog, cat,
trilobite, ...). If y is the true label of image x, we can measure the CNN performance by a
loss function `y (ŷ) ∈ R which assigns a penalty to classification errors. The CNN parameters
can then be tuned or learned to minimize this loss averaged over a large dataset of labelled
example images.
Learning generally uses a variant of stochastic gradient descent (SGD). While this is an
efficient method (for this type of problems), networks may contain several million parameters
and need to be trained on millions of images; thus, efficiency is a paramount in MATLAB
design, as further discussed in section 1.4. SGD also requires to compute the CNN derivatives,
as explained in the next section.
2.2 Network structures

In the simplest case, layers in a NN are arranged in a sequence; however, more complex
interconnections are possible as well, and in fact very useful in many cases. This section
discusses such configurations and introduces a graphical notation to visualize them.
2.2.1 Sequences
Start by considering a computational block f in the network. This can be represented
schematically as a box receiving data x and parameters w as inputs and producing data y
as output:
x f y
As seen above, in the simplest case blocks are chained in a sequence f1 → f2 → · · · → fL

yielding the structure:
x1 x2 xL−1
x0 f1 f2 ... fL xL
w1 w2 wL
Given an input x0 , evaluating the network is a simple matter of evaluating all the blocks
from left to right, which defines a composite function xL = f (x0 ; w1 , . . . , wL ).
2.2. NETWORK STRUCTURES 11
f1 x1
x0 w1 f3 x3
f2 x2 f5 x7
w2 x5 w5
x4 f4
w4 x6
Figure 2.1: Example DAG.
2.2.2 Directed acyclic graphs

One is not limited to chaining layers one after another. In fact, the only requirement for
evaluating a NN is that, when a layer has to be evaluated, all its input have been evaluated
prior to it. This is possible exactly when the interconnections between layers form a directed
acyclic graph, or DAG for short.
In order to visualize DAGs, it is useful to introduce additional nodes for the network
variables, as in the example of Fig. 2.1. Here boxes denote functions and circles denote
variables (parameters are treated as a special kind of variables). In the example, x0 and x4
are the inputs of the CNN and x6 and x7 the outputs. Functions can take any number of
inputs (e.g. f3 and f5 take two) and have any number of outputs (e.g. f4 has two). There
are a few noteworthy properties of this graph:
1. The graph is bipartite, in the sense that arrows always go from boxes to circles and
from circles to boxes.
2. Functions can have any number of inputs or outputs; variables and parameters can
have an arbitrary number of outputs (a parameter with more of one output is shared
between different layers); variables have at most one input and parameters none.
3. Variables with no incoming arrows and parameters are not computed by the network,
but must be set prior to evaluation, i.e. they are inputs. Any variable (or even param-
eter) may be used as output, although these are usually the variables with no outgoing
arrows.
4. Since the graph is acyclic, the CNN can be evaluated by sorting the functions and
computing them one after another (in the example, evaluating the functions in the
order f1 , f2 , f3 , f4 , f5 would work).
2.3 Computing derivatives with backpropagation

Learning a NN requires computing the derivative of the loss with respect to the network
parameters. Derivatives are computed using an algorithm called backpropagation, which is
a memory-efficient implementation of the chain rule for derivatives. First, we discuss the
derivatives of a single layer, and then of a whole network.
2.3.1 Derivatives of tensor functions

In a CNN, a layer is a function y = f (x) where both input x ∈ RH×W ×C and output
0 0 0
y ∈ RH ×W ×C are tensors. The derivative of the function f contains the derivative of
each output component yi0 j 0 k0 with respect to each input component xijk , for a total of
H 0 × W 0 × C 0 × H × W × C elements naturally arranged in a 6D tensor. Instead of expressing
derivatives as tensors, it is often useful to switch to a matrix notation by stacking the input
and output tensors into vectors. This is done by the vec operator, which visits each element
of a tensor in lexicographical order and produces a vector:
 
x111
 x211 
 . 
 . 
 . 
vec x =  xH11  .
 
 x121 
 
 . 
 .. 
xHW C
By stacking both input and output, each layer f can be seen reinterpreted as vector function
vec f , whose derivative is the conventional Jacobian matrix:
 
∂y111 ∂y111 ∂y111 ∂y111 ∂y111
. . . . . .
 ∂x111 ∂x211 ∂xH11 ∂x121 ∂xHW C 
 ∂y211 ∂y211 ∂y211 ∂y211
... . . . ∂x∂yHW 211

 ∂x111 ∂x211 ∂xH11 ∂x121 C 

.. .. .. .. ..

. . ... . . ... .
 
 
d vec f 
 ∂yH 0 11 ∂yH 0 11 ∂yH 0 11 ∂yH 0 11 ∂yH 0 11 

= . . . . . . .
d(vec x)>  ∂x111 ∂x211 ∂xH11 ∂x121 ∂xHW C 

 ∂y121 ∂y121 ∂y121 ∂y121
. . . ∂x∂yHW

 ∂x
111 ∂x211
... ∂xH11 ∂x121
121
C 

.. .. .. .. ..

 
 . . ... . . ... . 
 
∂yH 0 W 0 C 0 ∂yH 0 W 0 C 0 ∂yH 0 W 0 C 0 ∂yH 0 W 0 C 0 ∂yH 0 W 0 C 0
∂x111 ∂x211
... ∂xH11 ∂x121
. . . ∂xHW C
This notation for the derivatives of tensor functions is taken from [6] and is used throughout
this document.
2.3. COMPUTING DERIVATIVES WITH BACKPROPAGATION 13
While it is easy to express the derivatives of tensor functions as matrices, these matrices
are in general extremely large. Even for moderate data sizes (e.g. H = H 0 = W = W 0 =
32 and C = C 0 = 128), there are H 0 W 0 C 0 HW C ≈ 17 × 109 elements in the Jacobian.
Storing that requires 68 GB of space in single precision. The purpose of the backpropagation
algorithm is to compute the derivatives required for learning without incurring this huge
memory cost.
2.3.2 Derivatives of function compositions

In order to understand backpropagation, consider first a simple CNN terminating in a loss
function fL = `y :
x1 x2 xL−1
x0 f1 f2 ... fL xl ∈ R
w1 w2 wL
The goal is to compute the gradient of the loss value xL (output) with respect to each network
parameter wl :
df d
>
= [fL (·; wL ) ◦ ... ◦ f2 (·; w2 ) ◦ f1 (x0 ; w1 )] .
d(vec wl ) d(vec wl )>
By applying the chain rule and by using the matrix notation introduced above, the derivative
can be written as
df d vec fL (xL−1 ; wL ) d vec fl+1 (xl ; wl+1 ) d vec fl (xl−1 ; wl )
= × ··· × × (2.1)
d(vec wl )> d(vec xL−1 ) > d(vec xl )> d(vec wl> )
where the derivatives are computed at the working point determined by the input x0 and the
current value of the parameters.
Note that, since the network output xl is a scalar quantity, the target derivative
df /d(vec wl )> has the same number of elements of the parameter vector wl , which is moder-
ate. However, the intermediate Jacobian factors have, as seen above, an unmanageable size.
In order to avoid computing these factor explicitly, we can proceed as follows.
Start by multiplying the output of the last layer by a tensor pL = 1 (note that this tensor
is a scalar just like the variable xL ):
df d vec fL (xL−1 ; wL ) d vec fl+1 (xl ; wl+1 ) d vec fl (xl−1 ; wl )
pL × = pL × ×··· × ×
d(vec wl ) > d(vec xL−1 ) > d(vec xl )> d(vec wl> )
| {z }
(vec pL−1 )>
d vec fl+1 (xl ; wl+1 ) d vec fl (xl−1 ; wl )

= (vec pL−1 )> × · · · × ×
d(vec xl )> d(vec wl> )
In the second line the last two factors to the left have been multiplied obtaining a new
tensor pL−1 that has the same size as the variable xL−1 . The factor pL−1 can therefore be
explicitly stored. The construction is then repeated by multiplying pairs of factors from left
to right, obtaining a sequence of tensors pL−2 , . . . , pl until the desired derivative is obtained.
Note that, in doing so, no large tensor is ever stored in memory. This process is known as
backpropagation.
In general, tensor pl is obtained from pl+1 as the product:
d vec fl+1 (xl ; wl+1 )

(vec pl )> = (vec pl+1 )> × .
d(vec xl )>
The key to implement backpropagation is to be able to compute these products without

explicitly computing and storing in memory the second factor, which is a large Jacobian
matrix. Since computing the derivative is a linear operation, this product can be interpreted
as the derivative of the layer projected along direction pl+1 :
dhpl+1 , f (xl ; wl )i
pl = . (2.2)
dxl
Here h·, ·i denotes the inner product between tensors, which results in a scalar quantity.
Hence the derivative (2.2) needs not to use the vec notation, and yields a tensor pl that has
the same size as xl as expected.
In order to implement backpropagation, a CNN toolbox provides implementations of each
layer f that provide:
• A forward mode, computing the output y = f (x; w) of the layer given its input x
and parameters w.
• A backward mode, computing the projected derivatives
dhp, f (x; w)i dhp, f (x; w)i

and ,
dx dw
given, in addition to the input x and parameters w, a tensor p that the same size as y.
This is best illustrated with an example. Consider a layer f such as the convolution operator
implemented by the MatConvNet vl_nnconv command. In the “forward” mode, one calls
the function as y = vl_nnconv(x,w,[]) to apply the filters w to the input x and obtain the
output y. In the “backward mode”, one calls [dx, dw] = vl_nnconv(x,w,[],p). As explained
above, dx, dw, and p have the same size as x, w, and y, respectively. The computation of large
Jacobian is encapsulated in the function call and never carried out explicitly.
2.3.3 Backpropagation networks

In this section, we provide a schematic interpretation of backpropagation and show how it
can be implemented by “reversing” the NN computational graph.
The projected derivative of eq. (2.2) can be seen as the derivative of the following mini-
network:
y
x f h·, ·i z∈R
w p
In the context of back-propagation, it can be useful to think of the projection p as the

“linearization” of the rest of the network from variable y down to the loss. The projected
derivative can also be though of as a new layer (dx, dw) = df (x, w, p) that, by computing
the derivative of the mini-network, operates in the reverse direction:
x w
dx df p
dw
By construction (see eq. (2.2)), the function df is linear in the argument p.

Using this notation, the forward and backward passes through the original network can
be rewritten as evaluating an extended network which contains a BP-reverse of the original
one (in blue in the diagram):
x0 f1 x1 f2 x2 ... xL−1 fL xL
w1 w2 wL
dx0 df1 dx1 df2 dx2 ... dxL−1 dfL dpL
dw1 dw2 dwL
2.3.4 Backpropagation in DAGs

Assume that the DAG has a single output variable xL and assume, without loss of generality,
that all variables are sorted in order of computation (x0 , x1 , . . . , xL−1 , xL ) according to the
DAG structure. Furthermore, in order to simplify the notation, assume that this list contains
both data and parameter variables, as the distinction is moot for the discussion in this section.
We can cut the DAG at any point in the sequence by fixing x0 , . . . , xl−1 to some arbitrary
value and dropping all the DAG layers that feed into them, effectively transforming the first l
variables into inputs. Then, the rest of the DAG defines a function hl that maps these input
variables to the output xL :
xL = hl (x0 , x1 , . . . , xl−1 ).
Next, we show that backpropagation in a DAG iteratively computes the projected derivatives
of all functions h1 , . . . , hL with respect to all their parameters.
Backpropagation starts by initializing variables (dx0 , . . . , dxl−1 ) to null tensors of the
same size as (x0 , . . . , xl−1 ). Next, it computes the projected derivatives of
xL = hL (x0 , x1 , . . . , xL−1 ) = fπL (x0 , x1 , . . . , xL−1 ).
Here πl denotes the index of the layer fπl that computes the value of the variable xl . There
is at most one such layer, or none if xl is an input or parameter of the original NN. In the
first case, the layer may depend on any of the variables prior to xl in the sequence, so that
general one has:
xl = fπl (x0 , . . . , xl−1 ).
At the beginning of backpropagation, since there are no intermediate variables between xL−1
and xL , the function hL is the same as the last layer fπL . Thus the projected derivatives of
hL are the same as the projected derivatives of fπL , resulting in the equation
dhpL , fπL (x0 , . . . , xt−1 )i

∀t = 0, . . . , L − 1 : dxt ← dxt + .
dxt
Here, for uniformity with the other iterations, we use the fact that dxl are initialized to zero
anaccumulate the values instead of storing them. In practice, the update operation needs to
be carried out only for the variables xl that are actual inputs to fπL , which is often a tiny
fraction of all the variables in the DAG.
After the update, each dxt contains the projected derivative of function hL with respect
to the corresponding variable:
dhpL , hL (x0 , . . . , xl−1 )i
∀t = 0, . . . , L − 1 : dxt = .
dxt
Given this information, the next iteration of backpropagation updates the variables to con-
tain the projected derivatives of hL−1 instead. In general, given the derivatives of hl+1 ,
backpropagation computes the derivatives of hl by using the relation
xL = hl (x0 , x1 , . . . , xl−1 ) = hl+1 (x0 , x1 , . . . , xl−1 , fπL (x0 , . . . , xl−1 ))
Applying the chain rule to this expression, for all 0 ≤ t ≤ l − 1:

dhp, hl i dhp, hl+1 i dhpL , hl+1 i d vec fπl
>
= >
+ .
d(vec xt ) d(vec xt ) d(vec xl )> d(vec xt )>
| {z }
vec dxl
This yields the update equation

dhpl , fπl (x0 , . . . , xl−1 )i
∀t = 0, . . . , l − 1 : dxt ← dxt + , where pl = dxl . (2.3)
dxt
Once more, the update needs to be explicitly carried out only for the variables xt that are
actual inputs of fπl . In particular, if xl is a data input or a parameter of the original neural
network, then xl does not depend on any other variables or parameters and fπl is a nullary
function (i.e. a function with no arguments). In this case, the update does not do anything.
After iteration L − l + 1 completes, backpropagation remains with:
dhpL , hl (x0 , . . . , xl−1 )i
∀t = 0, . . . , l − 1 : dxt = .
dxt
Note that the derivatives for variables xt , l ≤ t ≤ L − 1 are not updated since hl does not
depend on any of those. Thus, after all L iterations are complete, backpropagation terminates
with
dhpL , hl (x0 , . . . , xl−1 )i
∀l = 1, . . . , L : dxl−1 = .
dxl−1
As seen above, functions hl are obtained from the original network f by transforming variables
x0 , . . . , xl−1 into to inputs. If xl−1 was already an input (data or parameter) of f , then the
derivative dxl−1 is applicable to f as well.
Backpropagation can be summarized as follows:
Given: a DAG neural network f with a single output xL , the values of all input variables
(including the parameters), and the value of the projection pL (usually xL is a scalar
and pL = pL = 1):
1. Sort all variables by computation order (x0 , x1 , . . . , xL ) according to the DAG.
2. Perform a forward pass through the network to compute all the intermediate vari-
able values.
3. Initialize (dx0 , . . . , dxL−1 ) to null tensors with the same size as the corresponding
variables.
4. For l = L, L − 1, . . . , 2, 1:
a) Find the index πl of the layer xl = fπl (x0 , . . . , xl−1 ) that evaluates variable xl .
If there is no such layer (because xl is an input or parameter of the network),
go to the next iteration.
b) Update the variables using the formula:
dhdxl , fπl (x0 , . . . , xl−1 )i

∀t = 0, . . . , l − 1 : dxt ← dxt + .
dxt
To do so efficiently, use the “backward mode” of the layer fπl to compute its
derivative projected onto dxl as needed.
2.3.5 DAG backpropagation networks

Just like for sequences, backpropagation in DAGs can be implemented as a corresponding
BP-reversed DAG. To construct the reversed DAG:
1. For each layer fl , and variable/parameter xt and wl , create a corresponding layer dfl
and variable/parameter dxt and dwl .
2. If a variable xt (or parameter wl ) is an input of fl , then it is an input of dfl as well.
3. If a variable xt (or parameter wl ) is an input of fl , then the variable dxt (or the
parameter dwl ) is an output dfl .
4. In the previous step, if a variable xt (or parameter wl ) is input to two or more layers in
f , then dxt would be the output of two or more layers in the reversed network, which
creates a conflict. Resolve these conflicts by inserting a summation layer that adds
these contributions (this corresponds to the summation in the BP update equation
(2.3)).
The BP network corresponding to the DAG of Fig. 2.1 is given in Fig. 2.2.
f1 x1
x0 w1 f3 x3
f2 x2 f5 x7
w2 x5 w5
x4 f4
w4 x6
df1 dx1
dx0 Σ dw1 df3 dx3
df2 dx2 df5 p7
dw2 dx5 dw5
dx4 df4
Σ p6
dw4
Figure 2.2: Backpropagation network for a DAG.

Chapter 3
Wrappers and pre-trained models
It is easy enough to combine the computational blocks of chapter 4 “manually”. However, it

is usually much more convenient to use them through a wrapper that can implement CNN
architectures given a model specification. The available wrappers are briefly summarised in
section 3.1.
MatConvNet also comes with many pre-trained models for image classification (most
of which are trained on the ImageNet ILSVRC challenge), image segmentation, text spotting,
and face recognition. These are very simple to use, as illustrated in section 3.2.
3.1 Wrappers
MatConvNet provides two wrappers: SimpleNN for basic chains of blocks (section 3.1.1)
and DagNN for blocks organized in more complex direct acyclic graphs (section 3.1.2).
3.1.1 SimpleNN
The SimpleNN wrapper is suitable for networks consisting of linear chains of computational
blocks. It is largely implemented by the vl_simplenn function (evaluation of the CNN and of
its derivatives), with a few other support functions such as vl_simplenn_move (moving the
CNN between CPU and GPU) and vl_simplenn_display (obtain and/or print information
about the CNN).
vl_simplenn takes as input a structure net representing the CNN as well as input x and
potentially output derivatives dzdy, depending on the mode of operation. Please refer to the
inline help of the vl_simplenn function for details on the input and output formats. In fact,
the implementation of vl_simplenn is a good example of how the basic neural net building
blocks can be used together and can serve as a basis for more complex implementations.
3.1.2 DagNN
The DagNN wrapper is more complex than SimpleNN as it has to support arbitrary graph
topologies. Its design is object oriented, with one class implementing each layer type. While
this adds complexity, and makes the wrapper slightly slower for tiny CNN architectures (e.g.
MNIST), it is in practice much more flexible and easier to extend.
21
22 CHAPTER 3. WRAPPERS AND PRE-TRAINED MODELS
DagNN is implemented by the dagnn.DagNN class (under the dagnn namespace).
3.2 Pre-trained models

vl_simplenn is easy to use with pre-trained models (see the homepage to download some).
For example, the following code downloads a model pre-trained on the ImageNet data and
applies it to one of MATLAB stock images:
% setup MatConvNet in MATLAB
run matlab/vl_setupnn
% download a pre−trained CNN from the web

urlwrite(...
'http://www.vlfeat.org/matconvnet/models/imagenet−vgg−f.mat', ...
'imagenet−vgg−f.mat') ;
net = load('imagenet−vgg−f.mat') ;
% obtain and preprocess an image

im = imread('peppers.png') ;
im_ = single(im) ; % note: 255 range
im_ = imresize(im_, net.meta.normalization.imageSize(1:2)) ;
im_ = im_ − net.meta.normalization.averageImage ;
Note that the image should be preprocessed before running the network. While preprocessing
specifics depend on the model, the pre-trained model contains a net.meta.normalization
field that describes the type of preprocessing that is expected. Note in particular that this
network takes images of a fixed size as input and requires removing the mean; also, image
intensities are normalized in the range [0,255].
The next step is running the CNN. This will return a res structure with the output of
the network layers:
% run the CNN
res = vl_simplenn(net, im_) ;
The output of the last layer can be used to classify the image. The class names are
contained in the net structure for convenience:
% show the classification result
scores = squeeze(gather(res(end).x)) ;
[bestScore, best] = max(scores) ;
figure(1) ; clf ; imagesc(im) ;
title(sprintf('%s (%d), score %.3f',...
net.meta.classes.description{best}, best, bestScore)) ;
Note that several extensions are possible. First, images can be cropped rather than
rescaled. Second, multiple crops can be fed to the network and results averaged, usually for
improved results. Third, the output of the network can be used as generic features for image
encoding.
3.3. LEARNING MODELS 23
3.3 Learning models

As MatConvNet can compute derivatives of the CNN using backpropagation, it is simple
to implement learning algorithms with it. A basic implementation of stochastic gradient
descent is therefore straightforward. Example code is provided in examples/cnn_train.
This code is flexible enough to allow training on NMINST, CIFAR, ImageNet, and probably
many other datasets. Corresponding examples are provided in the examples/ directory.
3.4 Running large scale experiments

For large scale experiments, such as learning a network for ImageNet, a NVIDIA GPU (at
least 6GB of memory) and adequate CPU and disk speeds are highly recommended. For
example, to train on ImageNet, we suggest the following:
• Download the ImageNet data http://www.image-net.org/challenges/LSVRC. In-

stall it somewhere and link to it from data/imagenet12
• Consider preprocessing the data to convert all images to have a height of 256 pixels.
This can be done with the supplied utils/preprocess-imagenet.sh script. In this
manner, training will not have to resize the images every time. Do not forget to point
the training code to the pre-processed data.
• Consider copying the dataset into a RAM disk (provided that you have enough memory)
for faster access. Do not forget to point the training code to this copy.
• Compile MatConvNet with GPU support. See the homepage for instructions.
Once your setup is ready, you should be able to run examples/cnn_imagenet (edit the
file and change any flag as needed to enable GPU support and image pre-fetching on multiple
threads).
If all goes well, you should expect to be able to train with 200-300 images/sec.
Chapter 4
Computational blocks
This chapters describes the individual computational blocks supported by MatConvNet.

The interface of a CNN computational block <block> is designed after the discussion in
chapter 2. The block is implemented as a MATLAB function y = vl_nn<block>(x,w) that
takes as input MATLAB arrays x and w representing the input data and parameters and
returns an array y as output. In general, x and y are 4D real arrays packing N maps or
images, as discussed above, whereas w may have an arbitrary shape.
The function implementing each block is capable of working in the backward direction
as well, in order to compute derivatives. This is done by passing a third optional argument
dzdy representing the derivative of the output of the network with respect to y; in this case,
the function returns the derivatives [dzdx,dzdw] = vl_nn<block>(x,w,dzdy) with respect to
the input data and parameters. The arrays dzdx, dzdy and dzdw have the same dimensions
of x, y and w respectively (see section 2.3).
Different functions may use a slightly different syntax, as needed: many functions can
take additional optional arguments, specified as property-value pairs; some do not have
parameters w (e.g. a rectified linear unit); others can take multiple inputs and parameters, in
which case there may be more than one x, w, dzdx, dzdy or dzdw. See the rest of the chapter
and MATLAB inline help for details on the syntax.1
The rest of the chapter describes the blocks implemented in MatConvNet, with a
particular focus on their analytical definition. Refer instead to MATLAB inline help for
further details on the syntax.
4.1 Convolution
The convolutional block is implemented by the function vl_nnconv. y=vl_nnconv(x,f,b) com-
putes the convolution of the input map x with a bank of K multi-dimensional filters f and
biases b. Here
0 0 ×D×D 00 00 ×W 00 ×D 00
x ∈ RH×W ×D , f ∈ RH ×W , y ∈ RH .
1
Other parts of the library will wrap these functions into objects with a perfectly uniform interface;
however, the low-level functions aim at providing a straightforward and obvious interface even if this means
differing slightly from block to block.
25
26 CHAPTER 4. COMPUTATIONAL BLOCKS
P- = 2 x P+ = 3
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
y
1 2
1 2 3 4
1 2 3 4
1 2 3 4
Figure 4.1: Convolution. The figure illustrates the process of filtering a 1D signal x by a
filter f to obtain a signal y. The filter has H 0 = 4 elements and is applied with a stride of
Sh = 2 samples. The purple areas represented padding P− = 2 and P+ = 3 which is zero-
filled. Filters are applied in a sliding-window manner across the input signal. The samples of
x involved in the calculation of a sample of y are shown with arrow. Note that the rightmost
sample of x is never processed by any filter application due to the sampling step. While in
this case the sample is in the padded region, this can happen also without padding.
The process of convolving a signal is illustrated in fig. 4.1 for a 1D slice. Formally, the output
is given by
H0 X
X W0 X
D
yi00 j 00 d00 = bd00 + fi0 j 0 d × xi00 +i0 −1,j 00 +j 0 −1,d0 ,d00 .
i0 =1 j 0 =1 d0 =1
The call vl_nnconv(x,f,[]) does not use the biases. Note that the function works with arbi-
trarily sized inputs and filters (as opposed to, for example, square images). See section 6.1
for technical details.
Padding and stride. vl_nnconv allows to specify top-bottom-left-right paddings

(Ph− , Ph+ , Pw− , Pw+ ) of the input array and subsampling strides (Sh , Sw ) of the output array:
H X0
W X
D 0
X
yi00 j 00 d00 = bd00 + fi0 j 0 d × xSh (i00 −1)+i0 −P − ,Sw (j 00 −1)+j 0 −Pw− ,d0 ,d00 .
h
i0 =1 j 0 =1 d0 =1
In this expression, the array x is implicitly extended with zeros as needed.
Output size. vl_nnconv computes only the “valid” part of the convolution; i.e. it requires
each application of a filter to be fully contained in the input support. The size of the output
is computed in section 5.2 and is given by:
H − H 0 + Ph− + Ph+

00
H =1+ .
Sh
4.2. CONVOLUTION TRANSPOSE (DECONVOLUTION) 27
Note that the padded input must be at least as large as the filters: H + Ph− + Ph+ ≥ H 0 ,
otherwise an error is thrown.
Receptive field size and geometric transformations. Very often it is useful to geo-
metrically relate the indexes of the various array to the input data (usually images) in terms
of coordinate transformations and size of the receptive field (i.e. of the image region that
affects an output). This is derived in section 5.2.
Fully connected layers. In other libraries, fully connected blocks or layers are linear
functions where each output dimension depends on all the input dimensions. MatConvNet
does not distinguish between fully connected layers and convolutional blocks. Instead, the
former is a special case of the latter obtained when the output map y has dimensions W 00 =
H 00 = 1. Internally, vl_nnconv handles this case more efficiently when possible.
Filter groups. For additional flexibility, vl_nnconv allows to group channels of the input
array x and apply different subsets of filters to each group. To use this feature, specify
0 0 0 00
as input a bank of D00 filters f ∈ RH ×W ×D ×D such that D0 divides the number of input
dimensions D. These are treated as g = D/D0 filter groups; the first group is applied to
dimensions d = 1, . . . , D0 of the input x; the second group to dimensions d = D0 + 1, . . . , 2D0
00 00 00
and so on. Note that the output is still an array y ∈ RH ×W ×D .
An application of grouping is implementing the Krizhevsky and Hinton network [7] which
uses two such streams. Another application is sum pooling; in the latter case, one can specify
D groups of D0 = 1 dimensional filters identical filters of value 1 (however, this is considerably
slower than calling the dedicated pooling function as given in section 4.3).
4.2 Convolution transpose (deconvolution)

The convolution transpose block (sometimes referred to as “deconvolution”) is the transpose
of the convolution block described in section 4.1. In MatConvNet, convolution transpose
is implemented by the function vl_nnconvt.
In order to understand convolution transpose, let:
0 0 ×D×D 00 00 ×W 00 ×D 00
x ∈ RH×W ×D , f ∈ RH ×W , y ∈ RH ,
be the input tensor, filters, and output tensors. Imagine operating in the reverse direction
by using the filter bank f to convolve the output y to obtain the input x, using the defini-
tions given in section 4.1 for the convolution operator; since convolution is linear, it can be
expressed as a matrix M such that vec x = M vec y; convolution transpose computes instead
vec y = M > vec x. This process is illustrated for a 1D slice in fig. 4.2.
There are two important applications of convolution transpose. The first one are the
so called deconvolutional networks [12] and other networks such as convolutional decoders
that use the transpose of a convolution. The second one is implementing data interpolation.
In fact, as the convolution block supports input padding and output downsampling, the
convolution transpose block supports input upsampling and output cropping.
C- = 2 y C+ = 3
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
x
1 2 4
2 3 4
1 2 3 4
1 2 3 4
Figure 4.2: Convolution transpose. The figure illustrates the process of filtering a 1D
signal x by a filter f to obtain a signal y. The filter is applied in a sliding-window, in a
pattern that is the transpose of fig. 4.1. The filter has H 0 = 4 samples in total, although each
filter application uses two of them (blue squares) in a circulant manner. The purple areas
represent crops with C− = 2 and C+ = 3 which are discarded. The samples of x involved in
the calculation of a sample of y are shown with arrow. Note that, differently from fig. 4.1,
there are no samples to the right of y which are involved in a convolution operation. This is
because the width H 00 of the output y, which given H 0 can be determined up to Uh samples,
is selected to be the smallest possible.
Convolution transpose can be expressed in closed form in the following rather unwieldy
expression (derived in section 6.2):
0 0
D q(H
X X ,Sh ) q(W ,Sw )
X
yi00 j 00 d00 = f1+Sh i0 +m(i00 +P − ,Sh ),
h
−
1+Sw j 0 +m(j 00 +Pw ,Sw ), d00 ,d0 ×
d0 =1 i0 =0 j 0 =0
x1−i0 +q(i00 +P − ,Sh ), −

1−j 0 +q(j 00 +Pw ,Sw ), d0 (4.1)
h
where

k−1
m(k, S) = (k − 1) mod S, q(k, n) = ,
S
(Sh , Sw ) are the vertical and horizontal input upsampling factors, (Ph− , Ph+ , Ph− , Ph+ ) the output
crops, and x and f are zero-padded as needed in the calculation. Note also that filter k is
stored as a slice f:,:,k,: of the 4D tensor f .
The height of the output array y is given by
H 00 = Sh (H − 1) + H 0 − Ph− − Ph+ .
A similar formula holds true for the width. These formulas are derived in section 5.3 along
with an expression for the receptive field of the operator.
We now illustrate the action of convolution transpose in an example (see also fig. 4.2).
Consider a 1D slice in the vertical direction, assume that the crop parameters are zero,
4.3. SPATIAL POOLING 29
and that Sh > 1. Consider the output sample yi00 where the index i00 is chosen such that
Sh divides i00 − 1; according to (4.1), this sample is obtained as a weighted summation of
xi00 /Sh , xi00 /Sh −1 , ... (note that the order is reversed). The weights are the filter elements f1 ,
fSh ,f2Sh , . . . subsampled with a step of Sh . Now consider computing the element yi00 +1 ; due to
the rounding in the quotient operation q(i00 , Sh ), this output sample is obtained as a weighted
combination of the same elements of the input x that were used to compute yi00 ; however,
the filter weights are now shifted by one place to the right: f2 , fSh +1 ,f2Sh +1 , . . . . The same
is true for i00 + 2, i00 + 3, . . . until we hit i00 + Sh . Here the cycle restarts after shifting x to
the right by one place. Effectively, convolution transpose works as an interpolating filter.
4.3 Spatial pooling

vl_nnpool implements max and sum pooling. The max pooling operator computes the max-
imum response of each feature channel in a H 0 × W 0 patch
yi00 j 00 d = max xi00 +i−10 ,j 00 +j 0 −1,d .

1≤i0 ≤H 0 ,1≤j 0 ≤W 0
00 00
resulting in an output of size y ∈ RH ×W ×D , similar to the convolution operator of sec-
tion 4.1. Sum-pooling computes the average of the values instead:
1 X
yi00 j 00 d = 0 0 xi00 +i0 −1,j 00 +j 0 −1,d .
W H 1≤i0 ≤H 0 ,1≤j 0 ≤W 0
Detailed calculation of the derivatives is provided in section 6.3.
Padding and stride. Similar to the convolution operator of section 4.1, vl_nnpool sup-
ports padding the input; however, the effect is different from padding in the convolutional
block as pooling regions straddling the image boundaries are cropped. For max pooling,
this is equivalent to extending the input data with −∞; for sum pooling, this is similar to
padding with zeros, but the normalization factor at the boundaries is smaller to account for
the smaller integration area.
4.4 Activation functions

MatConvNet supports the following activation functions:
• ReLU. vl_nnrelu computes the Rectified Linear Unit (ReLU):
yijd = max{0, xijd }.
• Sigmoid. vl_nnsigmoid computes the sigmoid :

1
yijd = σ(xijd ) = .
1 + e−xijd
See section 6.4 for implementation details.

4.5 Spatial bilinear resampling

vl_nnbilinearsampler uses bilinear interpolation to spatially warp the image according to
an input transformation grid. This operator works with an input image x, a grid g, and an
output image y as follows:
0 0 0 0 ×C
x ∈ RH×W ×C , g ∈ [−1, 1]2×H ×W , y ∈ RH ×W .
The same transformation is applied to all the features channels in the input, as follows:
H X
X W
yi00 j 00 c = xijc max{0, 1 − |αv g1i00 j 00 + βv − i|} max{0, 1 − |αu g2i00 j 00 + βu − j|}, (4.2)
i=1 j=1
where, for each feature channel c, the output yi00 j 00 c at the location (i00 , j 00 ), is a weighted sum
of the input values xijc in the neighborhood of location (g1i00 j 00 , g2i00 j 00 ). The weights, as given
in (4.2), correspond to performing bilinear interpolation. Furthermore, the grid coordinates
are expressed not in pixels, but relative to a reference frame that extends from −1 to 1 for
all spatial dimensions of the input image; this is given by choosing the coefficients as:
H −1 H +1 W −1 W +1
αv = , βv = − , αu = , βu = − .
2 2 2 2
See section 6.5 for implementation details.
4.6 Normalization
4.6.1 Local response normalization (LRN)
vl_nnnormalize implements the Local Response Normalization (LRN) operator. This oper-
ator is applied independently at each spatial location and to groups of feature channels as
follows:  −β
X
yijk = xijk κ + α x2ijt  ,
t∈G(k)
where, for each output channel k, G(k) ⊂ {1, 2, . . . , D} is a corresponding subset of input
channels. Note that input x and output y have the same dimensions. Note also that the
operator is applied uniformly at all spatial locations.
See section 6.6.1 for implementation details.
4.6.2 Batch normalization

vl_nnbnorm implements batch normalization [4]. Batch normalization is somewhat different
from other neural network blocks in that it performs computation across images/feature
maps in a batch (whereas most blocks process different images/feature maps individually).
4.7. CATEGORICAL LOSSES 31
y = vl_nnbnorm(x, w, b) normalizes each channel of the feature map x averaging over spatial
locations and batch instances. Let T be the batch size; then
x, y ∈ RH×W ×K×T , w ∈ RK , b ∈ RK .
Note that in this case the input and output arrays are explicitly treated as 4D tensors in
order to work with a batch of feature maps. The tensors w and b define component-wise
multiplicative and additive constants. The output feature map is given by
H W T H W T
xijkt − µk 1 XXX 1 XXX
yijkt = wk p 2 +bk , µk = xijkt , σk2 = (xijkt −µk )2 .
σk + HW T i=1 j=1 t=1 HW T i=1 j=1 t=1
4.6.3 Spatial normalization

vl_nnspnorm implements spatial normalization. The spatial normalization operator acts on
different feature channels independently and rescales each input feature by the energy of the
features in a local neighbourhood . First, the energy of the features in a neighbourhood
W 0 × H 0 is evaluated
1 X
n2i00 j 00 d = 0 0 x2 H 0 −1 W 0 −1 .
W H 1≤i0 ≤H 0 ,1≤j 0 ≤W 0 i00 +i0 −1−b 2 c,j 00 +j 0 −1−b 2 c,d
In practice, the factor 1/W 0 H 0 is adjusted at the boundaries to account for the fact that
neighbors must be cropped. Then this is used to normalize the input:
1
yi00 j 00 d = xi00 j 00 d .
(1 + αn2i00 j 00 d )β
4.6.4 Softmax
vl_nnsoftmax computes the softmax operator:
exijk
yijk = PD .
xijt
t=1 e
Note that the operator is applied across feature channels and in a convolutional manner
at all spatial locations. Softmax can be seen as the combination of an activation function
(exponential) and a normalization operator. See section 6.6.4 for implementation details.
4.7 Categorical losses

The purpose of a categorical loss function `(x, c) is to compare a prediction x to a ground
truth class label c. As in the rest of MatConvNet, the loss is treated as a convolutional
operator, in the sense that the loss is evaluated independently at each spatial location. How-
ever, the contribution of different samples are summed together (possibly after weighting)
and the output of the loss is a scalar. Section 4.7.1 losses useful for multi-class classification
and the section 4.7.2 losses useful for binary attribute prediction. Further technical details
are in section 6.7. vl_nnloss implements the following all of these.
4.7.1 Classification losses

Classification losses decompose additively as follows:
X
`(x, c) = wij1n `(xij:n , cij:n ). (4.3)
ijn
Here x ∈ RH×W ×C×N and c ∈ {1, . . . , C}H×W ×1×N , such that the slice xij:n represent a vector
of C class scores and and cij1n is the ground truth class label. The ìnstanceWeights` option
can be used to specify the tensor w of weights, which are otherwise set to all ones; w has
the same dimension as c.
Unless otherwise noted, we drop the other indices and denote by x and c the slice xij:n
and the scalar cij1n . vl_nnloss automatically skips all samples such that c = 0, which can
be used as an “ignore” label.
Classification error. The classification error is zero if class c is assigned the largest score
and zero otherwise:
`(x, c) = 1 c 6= argmax xc . (4.4)
k
Ties are broken randomly.
Top-K classification error. The top-K classification error is zero if class c is within the
top K ranked scores:
`(x, c) = 1 [|{k : xk ≥ xc }| ≤ K] . (4.5)
The classification error is the same as the top-1 classification error.
Log loss or negative posterior log-probability. In this case, x is interpreted as a vector

of posterior probabilities p(k) = xk , k = 1, . . . , C over the C classes. The loss is the negative
log-probability of the ground truth class:
`(x, c) = − log xc . (4.6)
P
Note that this makes the implicit assumption x ≥ 0, k xk = 1. Note also that, unless
xc > 0, the loss is undefined. For these reasons, x is usually the output of a block such as
softmax that can guarantee these conditions. However, the composition of the naive log loss
and softmax is numerically unstable. Thus this is implemented as a special case below.
Generally, for such a loss to make sense, the score xc should be somehow in competition
with the other scores xk , k 6= c. If this is not the case, minimizing (4.6) can trivially be
achieved by maxing all xk large, whereas the intended effect is that xc should be large com-
pared to the xk , k 6= c. The softmax block makes the score compete through the normalization
factor.
Softmax log-loss or multinomial logistic loss. This loss combines the softmax block
and the log-loss block into a single block:
C
e xc X
`(x, c) = − log PC = −xc + log exk . (4.7)
k=1 exk k=1
Combining the two blocks explicitly is required for numerical stability. Note that, by combin-
P with softmax, this loss automatically makes the score compete: `(bx, c) ≈ 0
ing the log-loss
when xc k6=c xk .
This loss is implemented also in the deprecated function vl_softmaxloss.
Multi-class hinge loss. The multi-class logistic loss is given by
`(x, c) = max{0, 1 − xc }. (4.8)
Note that `(x, c) = 0 ⇔ xc ≥ 1. This, just as for the log-loss above, this loss does not
automatically make the score competes. In order to do that, the loss is usually preceded by
the block:
yc = xc − max xk .
k6=c
Hence yc represent the confidence margin between class c and the other classes k 6= c. Just
like softmax log-loss combines softmax and loss, the next loss combines margin computation
and hinge loss.
Structured multi-class hinge loss. The structured multi-class logistic loss, also know as
Crammer-Singer loss, combines the multi-class hinge loss with a block computing the score
margin:
`(x, c) = max 0, 1 − xc + max xk . (4.9)
k6=c
4.7.2 Attribute losses

Attribute losses are similar to classification losses, but in this case classes are not mutually
exclusive; they are, instead, binary attributes. Attribute losses decompose additively as
follows: X
`(x, c) = wijkn `(xijkn , cijkn ). (4.10)
ijkn
Here x ∈ RH×W ×C×N and c ∈ {−1, +1}H×W ×C×N , such that the scalar xijkn represent
a confidence that attribute k is on and cij1n is the ground truth attribute label. The
ìnstanceWeights` option can be used to specify the tensor w of weights, which are oth-
erwise set to all ones; w has the same dimension as c.
Unless otherwise noted, we drop the other indices and denote by x and c the scalars xijkn
and cijkn . As before, samples with c = 0 are skipped.
Binary error. This loss is zero only if the sign of x − τ agrees with the ground truth label
c:
`(x, c|τ ) = 1[sign(x − τ ) 6= c]. (4.11)
Here τ is a configurable threshold, often set to zero.
Binary log-loss. This is the same as the multi-class log-loss but for binary attributes.
Namely, this time xk ∈ [0, 1] is interpreted as the probability that attribute k is on:
(
− log x, c = +1,
`(x, c) = (4.12)
− log(1 − x), c = −1,

1 1
= − log c x − + . (4.13)
2 2
Similarly to the multi-class log loss, the assumption x ∈ [0, 1] must be enforced by the block
computing x.
Binary logistic loss. This is the same as the multi-class logistic loss, but this time x/2
represents the confidence that the attribute is on and −x/2 that it is off. This is obtained
by using the logistic function σ(x)
cx
1 e2
`(x, c) = − log σ(cx) = − log = − log cx cx . (4.14)
1 + e−cx e 2 + e− 2
Binary hinge loss. This is the same as the structured multi-class hinge loss but for binary
attributes:
`(x, c) = max{0, 1 − cx}. (4.15)
There is a relationship between the hinge loss and the structured multi-class hinge loss which
is analogous to the relationship between binary logistic loss and multi-class logistic loss.
Namely, the hinge loss can be rewritten as:

cx kx
`(x, c) = max 0, 1 − + max
2 k6=c 2
Hence the hinge loss is the same as the structure multi-class hinge loss for C = 2 classes,
where x/2 is the score associated to class c = 1 and −x/2 the score associated to class c = −1.
4.8 Comparisons
4.8.1 p-distance
The vl_nnpdist function computes the p-distance between the vectors in the input data x
and a target x̄:
! p1
X
yij = |xijd − x̄ijd |p
d
4.8. COMPARISONS 35
Note that this operator is applied convolutionally, i.e. at each spatial location ij one extracts
and compares vectors xij: . By specifying the option 'noRoot', true it is possible to compute
a variant omitting the root:
X
yij = |xijd − x̄ijd |p , p > 0.
d

Chapter 5
Geometry
This chapter looks at the geometry of the CNN input-output mapping.
5.1 Preliminaries
In this section we are interested in understanding how components in a CNN depend on
components in the layers before it, and in particular on components of the input. Since
CNNs can incorporate blocks that perform complex operations, such as for example cropping
their inputs based on data-dependent terms (e.g. Fast R-CNN), this information is generally
available only at “run time” and cannot be uniquely determined given only the structure
of the network. Furthermore, blocks can implement complex operations that are difficult to
characterise in simple terms. Therefore, the analysis will be necessarily limited in scope.
We consider blocks such as convolutions for which one can deterministically establish
dependency chains between network components. We also assume that all the inputs x and
outputs y are in the usual form of spatial maps, and therefore indexed as xi,j,d,k where i, j
are spatial coordinates.
Consider a layer y = f (x). We are interested in establishing which components of x
influence which components of y. We also assume that this relation can be expressed in
terms of a sliding rectangular window field, called receptive field. This means that the output
component yi00 ,j 00 depends only on the input components xi,j where (i, j) ∈ Ω(i00 , j 00 ) (note that
feature channels are implicitly coalesced in this discussion). The set Ω(i00 , j 00 ) is a rectangle
defined as follows:

00 ∆h − 1 ∆h − 1
i ∈ αh (i − 1) + βh + − , (5.1)
2 2

00 ∆v − 1 ∆v − 1
j ∈ αv (j − 1) + βv + − , (5.2)
2 2
where (αh , αv ) is the stride, (βh , βv ) the offset, and (∆h , ∆v ) the receptive field size.
37
38 CHAPTER 5. GEOMETRY
5.2 Simple filters

We now compute the receptive field geometry (αh , αv , βh , βv , ∆h , ∆v ) for the most common
operators, namely filters. We consider in particular simple filters that are characterised by
an integer size, stride, and padding.
It suffices to reason in 1D. Let H 0 bet the vertical filter dimension, Sh the subampling
stride, and Ph− and Ph+ the amount of zero padding applied to the top and the bottom of the
input x. Here the value yi00 depends on the samples:
H0 − 1 H0 − 1 H0 + 1

−
0 00
xi : i ∈ [1, H ] + Sh (i − 1) − Ph = − , + Sh (i00 − 1) − Ph− + .
2 2 2
Hence
H0 + 1
αh = Sh , βh = − Ph− , ∆h = H 0 .
2
A similar relation holds for the horizontal direction.
Note that many blocks (e.g. max pooling, LNR, ReLU, most loss functions etc.) have a
filter-like receptive field geometry. For example, ReLU can be considered a 1 × 1 filter, such
that H = Sh = 1 and Ph− = Ph+ = 0. Note that in this case αh = 1, βh = 1 and ∆h = 1.
In addition to computing the receptive field geometry, we are often interested in determin-
ing the sizes of the arrays x and y throughout the architecture. In the case of filters, and once
more reasoning for a 1D slice, we notice that yi00 can be obtained for i00 = 1, 2, . . . , H 00 where
H 00 is the largest value of i00 before the receptive fields falls outside x (including padding). If
H is the height of the input array x, we get the condition
H 0 + Sh (H 00 − 1) − Ph− ≤ H + Ph+ .
Hence
H − H 0 + Ph− + Ph+

00
H = + 1. (5.3)
Sh
5.2.1 Pooling in Caffe

MatConvNet treats pooling operators like filters, using the rules above. In the library Caffe,
this is done slightly differently, creating some incompatibilities. In their case, the pooling
window is allowed to shift enough such that the last application always includes the last pixel
of the input. If the stride is greater than one, this means that the last application of the
pooling window can be partially outside the input boundaries even if padding is “officially”
zero.
More formally, if H 0 is the pool size and H the size of the signal, the last application of
the pooling window has index i00 = H 00 such that
H − H0

00 0 00

Sh (i − 1) + H i00 =H 00 ≥ H ⇔ H = + 1.
Sh
If there is padding, the same logic applies after padding the input image, such that the output
has height:
H − H 0 + Ph− + Ph+

00
H = + 1.
Sh
5.2. SIMPLE FILTERS 39
This is the same formula as for above filters, but with the ceil instead of floor operator. Note
that in practice Ph− = Ph+ = Ph since Caffe does not support asymmetric padding.
Unfortunately, it gets more complicated. Using the formula above, it can happen that
the last padding application is completely outside the input image and Caffe tries to avoid
it. This requires
H − 1 + Ph−
S(i00 − 1) − Ph− + 1i00 =H 00 ≤ H H 00 ≤

⇔ + 1. (5.4)
Sh
Using the fact that for integers a, b, one has da/be = b(a + b − 1)/bc, we can rewrite the
expression for H 00 as follows
H − H 0 + Ph− + Ph+ H − 1 + Ph− Ph+ + Sh − H 0

00
H = +1= + + 1.
Sh Sh Sh
Hence if Ph+ + Sh ≤ H 0 then the second term is less than zero and (5.4) is satisfied. In
practice, Caffe assumes that Ph+ , Ph− ≤ H 0 − 1, as otherwise the first filter application falls
entirely in the padded region. Hence, we can upper bound the second term:
Ph+ + Sh − H 0 Sh − 1
≤ ≤ 1.
Sh Sh
We conclude that, for any choices of Ph+ and Sh allowed by Caffe, the formula above may
violate constraint (5.4) by at most one unit. Caffe has a special provision for that and lowers
H 00 by one when needed. Furthermore, we see that if Ph+ = 0 and Sh ≤ H 0 (which is often
the case and may be assumed by Caffe), then the equation is also satisfied and Caffe skips
the check.
Next, we find MatConvNet equivalents for these parameters. Assume that Caffe applies
a symmetric padding Ph . Then in MatConvNet Ph− = Ph to align the top part of the output
signal. To match Caffe, the last sample of the last filter application has to be on or to the
right of the last Caffe-padded pixel:
 
 H − H0 + P − + P + 
Sh  h h
+ 1 −1 + H 0 ≥ H + 2P − .
 
 Sh  | {z h}
| {z } Caffe rightmost input sample with padding
MatConvNet rightmost pooling index
| {z }
MatConvNet rightmost pooled input sample
Rearranging
H − H 0 + Ph− + Ph+ H − H 0 + 2Ph−

≥
Sh Sh
Using ba/bc = d(a − b + 1)/be we get the equivalent condition:
H − H 0 + 2Ph− Ph+ − Ph− − Sh + 1 H − H 0 + 2Ph−

+ ≥
Sh Sh Sh
Removing the ceil operator lower bounds the left-hand side of the equation and produces the
sufficient condition
Ph+ ≥ Ph− + Sh − 1.
As before, this may still be too much padding, causing the last pool window application to
be entirely in the rightmost padded area. MatConvNet places the restriction Ph+ ≤ H 0 − 1,
so that
Ph+ = min{Ph− + Sh − 1, H 0 − 1}.
For example, a pooling region of width H 0 = 3 samples with a stride of Sh = 1 samples and
null Caffe padding Ph− = 0, would result in a right MatConvNet padding of Ph+ = 1.
5.3 Convolution transpose

The convolution transpose block is similar to a simple filter, but somewhat more complex.
Recall that convolution transpose (section 6.2) is the transpose of the convolution operator,
which in turn is a filter. Reasoning for a 1D slice, let xi be the input to the convolution
transpose block and yi00 its output. Furthermore let Uh , Ch− , Ch+ and H 0 be the upsampling
factor, top and bottom crops, and filter height, respectively.
If we look at the convolution transpose backward, from the output to the input (see also
fig. 4.2), the data dependencies are the same as for the convolution operator, studied in
section 5.2. Hence there is an interaction between xi and yi00 only if
1 + Uh (i − 1) − Ch− ≤ i00 ≤ H 0 + Uh (i − 1) − Ch− (5.5)
where cropping becomes padding and upsampling becomes downsampling. Turning this
relation around, we find that
i + Ch− − H 0 i + Ch− − 1
00 00
+1≤i≤ + 1.
Sh Sh
Note that, due to rounding, it is not possible to express this set tightly in the form outlined
above. We can however relax these two relations (hence obtaining a slightly larger receptive
field) and conclude that
1 2Ch− − H 0 + 1 H0 − 1
αh = , βh = + 1, ∆h = + 1.
Uh 2Uh Uh
Next, we want to determine the height H 00 of the output y of convolution transpose as
a function of the heigh H of the input x and the other parameters. Swapping input and
output in (5.3) results in the constraint:
H − H 0 + Ch− + Ch+
00
H =1+ .
Uh
If H is now given as input, it is not possible to recover H 00 uniquely from this expression;
instead, all the following values are possible
Sh (H − 1) + H 0 − Ch− − Ch+ ≤ H 00 < Sh H + H 0 − Ch− − Ch+ .

5.4. TRANSPOSING RECEPTIVE FIELDS 41
This is due to the fact that Uh acts as a downsampling factor in the standard convolution
direction and some of the samples to the right of the convolution input y may be ignored by
the filter (see also fig. 4.1 and fig. 4.2).
Since the height of y is then determined up to Sh samples, and since the extra samples
would be ignored by the computation and stay zero, we choose the tighter definition and set
H 00 = Uh (H − 1) + H 0 − Ch− − Ch+ .
5.4 Transposing receptive fields

Suppose we have determined that a later y = f (x) has a receptive field transformation
(αh , βh , ∆h ) (along one spatial slice). Now suppose we are given a block x = g(y) which
is the “transpose” of f , just like the convolution transpose layer is the transpose of the
convolution layer. By this, we mean that, if yi00 depends on xi due to f , then xi depends on
yi00 due to g.
Note that, by definition of receptive fields, f relates the inputs and outputs index pairs
(i, i00 ) given by (5.1), which can be rewritten as
∆h − 1 ∆h − 1
− ≤ i − αh (i00 − 1) − βh ≤ .
2 2
A simple manipulation of this expression results in the equivalent expression:
(∆h + αh − 1)/αh − 1 1 1 + αh − βh (∆h + αh − 1)/αh − 1

− ≤ i00 − (i − 1) − ≤ .
2 αh αh 2αh
Hence, in the reverse direction, this corresponds to a RF transformation
1 1 + αh − βh ˆ h = ∆h + αh − 1 .
α̂h = , β̂h = , ∆
αh αh αh
Example 1. For convolution, we have found the parameters:
H0 + 1
αh = S h , βh = − Ph− , ∆h = H 0 .
2
Using the formulas just found, we can obtain the RF transformation for convolution transpose:
1 1
α̂h = = ,
αh Sh
1 + Sh − (H 0 + 1)/2 + Ph− P − − H 0 /2 + 1/2 2P − − H 0 + 1
β̂h = = h +1= h + 1,
Sh Sh Sh
0 0
ˆ h = H + Sh − 1 = H − 1 + 1.
∆
Sh Sh
Hence we find again the formulas obtained in section 5.3.

5.5 Composing receptive fields

Consider now the composition of two layers h = g ◦ f with receptive fields (αf , βf , ∆f ) and
(αg , βg , ∆g ) (once again we consider only a 1D slice in the vertical direction, the horizontal
one being the same). The goal is to compute the receptive field of h.
To do so, pick a sample ig in the domain of g. The first and last sample if in the domain
of f to affect ig are given by:
∆f − 1
if = αf (ig − 1) + βf ± .
2
Likewise, the first and last sample ig to affect a given output sample ih are given by
∆g − 1
ig = αg (ih − 1) + βg ± .
2
Substituting one relation into the other, we see that the first and last sample if in the domain
of g ◦ f to affect ih are:

∆g − 1 ∆f − 1
if = αf αg (ih − 1) + βg ± − 1 + βf ±
2 2
αf (∆g − 1) + ∆f − 1
= αf αg (ih − 1) + αf βg − 1 + βf ± .
2
We conclude that
αh = αf αg , βh = αf (βg − 1) + βf , ∆h = αf (∆g − 1) + ∆f .
5.6 Overlaying receptive fields

Consider now the combination h(f (x1 ), g(x2 )) where the domains of f and g are the same.
Given the rule above, it is possible to compute how each output sample ih depends on each
input sample if through f and on each input sample ig through g. Suppose that this gives
receptive fields (αhf , βhf , ∆hf ) and (αhg , βhg , ∆hg ) respectively. Now assume that the domain
of f and g coincide, i.e. x = x1 = x2 . The goal is to determine the combined receptive field.
This is only possible if, and only if, α = αhg = αhf . Only in this case, in fact, it is
possible to find a sliding window receptive field that tightly encloses the receptive field due
to g and f at all points according to formulas (5.1). We say that these two receptive fields
are compatible. The range of input samples i = if = ig that affect any output sample ih is
then given by

∆hf − 1 ∆hg − 1
imax = α(ih − 1) + a, a = min βhf − , βg − ,
2 2

∆hf − 1 ∆hg − 1
imin = α(ih − 1) + b, b = max βhf + , βg + .
2 2
We conclude that the combined receptive field is
a+b
α = αhg = αhf , β= , δ = b − a + 1.
2
Chapter 6
Implementation details
This chapter contains calculations and details.
6.1 Convolution
It is often convenient to express the convolution operation in matrix form. To this end, let
φ(x) be the im2row operator, extracting all W 0 × H 0 patches from the map x and storing
them as rows of a (H 00 W 00 ) × (H 0 W 0 D) matrix. Formally, this operator is given by:
[φ(x)]pq = xijd
(i,j,d)=t(p,q)
where the index mapping (i, j, d) = t(p, q) is
i = i00 + i0 − 1, j = j 00 + j 0 − 1, p = i00 + H 00 (j 00 − 1), q = i0 + H 0 (j 0 − 1) + H 0 W 0 (d − 1).
It is also useful to define the “transposed” operator row2im:

X
[φ∗ (M )]ijd = Mpq .
(p,q)∈t−1 (i,j,d)
Note that φ and φ∗ are linear operators. Both can be expressed by a matrix H ∈
00 00 0 0
R(H W H W D)×(HW D) such that
vec(φ(x)) = H vec(x), vec(φ∗ (M )) = H > vec(M ).
Hence we obtain the following expression for the vectorized output (see [6]):
(
(I ⊗ φ(x)) vec F, or, equivalently,
vec y = vec (φ(x)F ) = >
(F ⊗ I) vec φ(x),
0 0
where F ∈ R(H W D)×K is the matrix obtained by reshaping the array f and I is an identity
matrix of suitable dimensions. This allows obtaining the following formulas for the deriva-
tives: >
dz dz > dz
= (I ⊗ φ(x)) = vec φ(x)
d(vec F )> d(vec y)> dY
43
44 CHAPTER 6. IMPLEMENTATION DETAILS
00 W 00 )×K
where Y ∈ R(H is the matrix obtained by reshaping the array y. Likewise:
>
dz dz > d vec φ(x) dz >
= (F ⊗ I) = vec F H
d(vec x)> d(vec y)> d(vec x)> dY
In summary, after reshaping these terms we obtain the formulas:

dz dz dz dz >
vec y = vec (φ(x)F ) , = φ(x)> , = φ∗ F
dF dY dX dY
0 0
where X ∈ R(H W )×D is the matrix obtained by reshaping x. Notably, these expressions are
used to implement the convolutional operator; while this may seem inefficient, it is instead
a fast approach when the number of filters is large and it allows leveraging fast BLAS and
GPU BLAS implementations.
6.2 Convolution transpose

In order to understand the definition of convolution transpose, let y to be obtained from x
by the convolution operator as defined in section 4.1 (including padding and downsampling).
Since this is a linear operation, it can be rewritten as vec y = M vec x for a suitable matrix M ;
convolution transpose computes instead vec x = M > vec y. While this is simple to describe
in term of matrices, what happens in term of indexes is tricky. In order to derive a formula
for the convolution transpose, start from standard convolution (for a 1D signal):
H 0
H − H 0 + Ph− + Ph+
X
00
yi00 = fi0 xS(i00 −1)+i0 −P − , 1≤i ≤1+ ,
i0 =1
h S
where S is the downsampling factor, Ph− and Ph+ the padding, H the length of the input
signal, x and H 0 the length of the filter f . Due to padding, the index of the input data x
may exceed the range [1, H]; we implicitly assume that the signal is zero padded outside this
range.
In order to derive an expression of the convolution transpose, we make use of the identity
vec y> (M vec x) = (vec y> M ) vec x = vec x> (M > vec y). Expanding this in formulas:
b W 0 +∞ +∞
X X X X
yi00 fi0 xS(i00 −1)+i0 −P − = yi00 fi0 xS(i00 −1)+i0 −P −
h h
i00 =1 i0 =1 i00 =−∞ i0 =−∞
+∞
X +∞
X
= yi00 fk−S(i00 −1)+P − xk
h
i00 =−∞ k=−∞
+∞
X +∞
X
= yi00 f
k−1+P −
xk
(k−1+Ph− ) mod S+S 1−i00 + S
h +1
i00 =−∞ k=−∞
+∞
X +∞
X
= xk y k−1+Ph− f(k−1+P − ) mod S+S(q−1)+1 .
+2−q h
k=−∞ q=−∞ S
6.3. SPATIAL POOLING 45
Summation ranges have been extended to infinity by assuming that all signals are zero padded
as needed. In order to recover such ranges, note that k ∈ [1, H] (since this is the range of
elements of x involved in the original convolution). Furthermore, q ≥ 1 is the minimum value
of q for which the filter f is non zero; likewise, q ≤ b(H 0 − 1)/2c + 1 is a fairly tight upper
bound on the maximum value (although, depending on k, there could be an element less).
Hence
0
1+b H S−1 c
X
xk = y k−1+Ph− f(k−1+P − ) mod S+S(q−1)+1 , k = 1, . . . , H. (6.1)
+2−q h
q=1 S
Note that the summation extrema in (6.1) can be refined slightly to account for the finite
size of y and w:
k − 1 + Ph−

00
max 1, +2−H ≤q
S
H − 1 − (k − 1 + Ph− ) mod S k − 1 + Ph−
0
≤ 1 + min , .
S S
The size H 00 of the output of convolution transpose is obtained in section 5.3.
6.3 Spatial pooling

Since max pooling simply selects for each output element an input element, the relation can
be expressed in matrix form as vec y = S(x) vec x for a suitable selector matrix S(x) ∈
00 00
{0, 1}(H W D)×(HW D) . The derivatives can the be written as: d(vecdzx)> = d(vecdzy)> S(x), for all
but a null set of points, where the operator is not differentiable (this usually does not pose
problems in optimization by stochastic gradient). For max-pooling, similar relations exists
with two differences: S does not depend on the input x and it is not binary, in order to
account for the normalization factors. In summary, we have the expressions:
dz dz
vec y = S(x) vec x, = S(x)> . (6.2)
d vec x d vec y
6.4 Activation functions

6.4.1 ReLU
The ReLU operator can be expressed in matrix notation as
dz dz
vec y = diag s vec x, = diag s
d vec x d vec y
where s = [vec x > 0] ∈ {0, 1}HW D is an indicator vector.

6.4.2 Sigmoid
The derivative of the sigmoid function is given by
dz dz dyijd dz −1
= = (−e−xijd )
dxijk dyijd dxijd dyijd (1 + e−xijd )2
dz
= yijd (1 − yijd ).
dyijd
In matrix notation:
dz dz
= y (11> − y).
dx dy
6.5 Spatial bilinear resampling

The projected derivative dhp, φ(x, g)i/dx of the spatial bilinaer resampler operator with
respect to the input image x can be found as follows:
" H X W
#
∂ X X
pi00 k00 c00 xi0 j 0 c00 max{0, 1 − |αv g1i00 j 00 + βv − i0 |} max{0, 1 − |αu g2i00 j 00 + βu − j 0 |}
∂xijc i00 j 00 c00 i0 =1 j 0 =1
X
= pi00 k00 c max{0, 1 − |αv g1i00 j 00 + βv − i|} max{0, 1 − |αu g2i00 j 00 + βu − j|}. (6.3)
i00 j 00
Note that the formula is similar to Eq. 4.2, with the difference that summation is on i00 rather
than i.
The projected derivative dhp, φ(x, g)i/dg with respect to the grid is similar:
" H X W
#
∂ X X
pi00 k00 c xijc max{0, 1 − |αv g1i00 j 00 + βv − i|} max{0, 1 − |αu g2i00 j 00 + βu − j|}
∂g1i0 j 0 i00 j 00 c i=1 j=1
X H X
X W
=− pi0 j 0 c αv xijc max{0, 1−|αv g2i0 j 0 +βv −j|} sign(αv g1i0 j 0 +βv −j)1{−1<αu g2i0 j0 +βu <1} .
c i=1 j=1
(6.4)
A similar expression holds for ∂g2i0 j 0
6.6 Normalization
6.6.1 Local response normalization (LRN)
The derivative is easily computed as:
dz dz X dz
= L(i, j, d|x)−β − 2αβxijd L(i, j, k|x)−β−1 xijk
dxijd dyijd dyijk
k:d∈G(k)
where X
L(i, j, k|x) = κ + α x2ijt .
t∈G(k)
6.6. NORMALIZATION 47
6.6.2 Batch normalization

The derivative of the network output z with respect to the multipliers wk and biases bk is
given by
dz X dz dyi00 j 00 k00 t00 X dz xi00 j 00 kt00 − µk
= = p ,
dwk i00 j 00 k00 t00 dyi00 j 00 k00 t00 dwk 00 00
i j t 00
dyi00 j 00 kt00 σk
2
+
dz X dz dyi00 j 00 k00 t00 X dz
= = .
dbk i00 j 00 k00 t00 dyi00 j 00 k00 t00 dwk 00 00
i j t 00
dyi00 j 00 kt00
The derivative of the network output z with respect to the block input x is computed as
follows:
dz X dz dyi00 j 00 k00 t00
= .
dxijkt i00 j 00 k00 t00 dyi00 j 00 k00 t00 dxijkt
Since feature channels are processed independently, all terms with k 00 6= k are zero. Hence
dz X dz dyi00 j 00 kt00
= ,
dxijkt i00 j 00 t00 dyi00 j 00 kt00 dxijkt
where
− 32 dσk2

dyi00 j 00 kt00 dµk 1 wk 2
= wk δi=i00 ,j=j 00 ,t=t00 − p − (x i 00 j 00 kt00 − µk ) σ +
k ,
dxijkt dxijkt σk2 + 2 dxijkt
the derivatives with respect to the mean and variance are computed as follows:
dµk 1
= ,
dxijkt HW T
dσk2

2 X 1 2
= (xijkt − µk ) δi=i0 ,j=j 0 ,t=t0 − = (xi0 j 0 kt0 − µk ) ,
dxi0 j 0 kt0 HW T ijt HW T HW T
and δE is the indicator function of the event E. Hence

!
dz wk dz 1 X dz
=p 2 −
dxijkt σk + dyijkt HW T i00 j 00 kt00 dyi00 j 00 kt00
wk X dz 2
− 3 (xi00 j 00 kt00 − µk ) (xijkt − µk )
2(σk + ) 2 i00 j 00 kt00 dyi00 j 00 kt00
2 HW T
i.e.
!
dz wk dz 1 X dz
=p 2 −
dxijkt σk + dyijkt HW T i00 j 00 kt00 dyi00 j 00 kt00
wk xijkt − µk 1 X dz xi00 j 00 kt00 − µk
−p 2 p p .
σk + σk2 + HW T i00 j 00 kt00 dyi00 j 00 kt00 σk2 +
We can identify some of these terms with the ones computed as derivatives of bnorm with
respect to wk and µk :
!
dz wk dz 1 dz xijkt − µk 1 dz
=p 2 − − p 2 .
dxijkt σk + dyijkt HW T dbk σk + HW T dwk
6.6.3 Spatial normalization

The neighbourhood norm n2i00 j 00 d can be computed by applying average pooling to x2ijd using
0
vl_nnpool with a W 0 ×H 0 pooling region, top padding b H 2−1 c, bottom padding H 0 −b H−1
2
c−1,
and similarly for the horizontal padding.
The derivative of spatial normalization can be obtained as follows:
dz X dz dyi00 j 00 d
=
dxijd i00 j 00 d dyi00 j 00 d dxijd
X dz dxi00 j 00 d dz dn2i00 j 00 d dx2ijd
= (1 + αn2i00 j 00 d )−β − αβ (1 + αn2i00 j 00 d )−β−1 xi00 j 00 d
i00 j 00 d
dyi00 j 00 d dxijd dyi00 j 00 d d(x2ijd ) dxijd
" #
2
dz X dz dn 00 00
i j d
= (1 + αn2ijd )−β − 2αβxijd (1 + αn2i00 j 00 d )−β−1 xi00 j 00 d
dyijd i00 j 00 d
dyi j d
00 00 d(x2ijd )
" #
2
dz X dn i 00 j 00 d dz
= (1 + αn2ijd )−β − 2αβxijd ηi00 j 00 d 2
, ηi00 j 00 d = (1 + αn2i00 j 00 d )−β−1 xi00 j 00 d
dyijd i00 j 00 d
d(x ijd ) dy 00
i j d00
Note that the summation can be computed as the derivative of the vl_nnpool block.
6.6.4 Softmax
Care must be taken in evaluating the exponential in order to avoid underflow or overflow.
The simplest way to do so is to divide the numerator and denominator by the exponential of
the maximum value:
exijk −maxd xijd
yijk = PD .
e xijt −maxd xijd
t=1
The derivative is given by:
D
dz X dz X
exijd L(x)−1 δ{k=d} − exijd exijk L(x)−2 , exijt .

= L(x) =
dxijd k
dyijk t=1
Simplifying: !
K
dz dz X dz
= yijd − yijk . .
dxijd dyijd k=1 dyijk
In matrix form:
dz dz dz >
=Y − Y 11
dX dY dY
where X, Y ∈ RHW ×D are the matrices obtained by reshaping the arrays x and y. Note that
the numerical implementation of this expression is straightforward once the output Y has
been computed with the caveats above.
6.7 Categorical losses

This section obtains the projected derivatives of the categorical losses in section 4.7. Recall
that all losses give a scalar output, so the projection tensor p is trivial (a scalar).
6.7.1 Classification losses

Top-K classification error. The derivative is zero a.e.
Log-loss. The projected derivative is:
∂p`(x, c) ∂ log(xc )
= −p = −pxc δk=c .
∂xk ∂xk
Softmax log-loss. The projected derivative is given by:

C
! !
xc
∂p`(x, c) ∂ X e
= −p xc − log ext = −p δk=c − PC .
∂xk ∂xk t=1 t=1 e xt
In brackets, we can recognize the output of the loss itself:

exc
y = `(x, c) = PC .
xt
t=1 e
Hence the loss derivatives rewrites:

∂p`(x, c)
= −p (δk=c − y) .
∂xk
Multi-class hinge loss. The projected derivative is:
∂p`(x, c)
= −p 1[xc < 1] δk=c .
∂xk
Structured multi-class hinge loss. The projected derivative is:
∂p`(x, c)
= −p 1[xc < 1 + max xt ] (δk=c − δk=t∗ ), t∗ = argmax xt .
∂xk t6=c t=1,2,...,C
6.7.2 Attribute losses

Binary error. The derivative of the binary error is 0 a.e.
Binary log-loss. The projected derivative is:
∂p`(x, c) c
= −p .
c x − 12 + 12

∂x
Binary logistic loss. The projected derivative is:
∂p`(x, c) ∂ 1 ce−cx c
= −p log −cx
= −p −cx
= −p cx = −pc σ(−cx).
∂x ∂x 1+e 1+e e +1
Binary hinge loss. The projected derivative is
∂p`(x, c)
= −pc 1[cx < 1].
∂x
6.8 Comparisons
6.8.1 p-distance
The derivative of the operator without root is given by:
dz dz
= p|xijd − x̄ijd |p−1 sign(xijd − x̄ijd ).
dxijd dyij
The derivative of the operator with root is given by:

! p1 −1
dz dz 1 X
= |xijd0 − x̄ijd0 |p p|xijd − x̄ijd |p−1 sign(xijd − x̄ijd )
dxijd dyij p d0
dz |xijd − x̄ijd |p−1 sign(xijd − x̄ijd ) dz dz
= p−1 ; =− .
dyij yij dx̄ijd dxijd
The formulas simplify a little for p = 1, 2 which are therefore implemented as special cases.
Bibliography
[1] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the
details: Delving deep into convolutional nets. In Proc. BMVC, 2014.
[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale
Hierarchical Image Database. In Proc. CVPR, 2009.
[3] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by
reducing internal covariate shift. CoRR, 2015.
[4] S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift. ArXiv e-prints, 2015.
[5] Yangqing Jia. Caffe: An open source convolutional architecture for fast feature embed-
ding. http://caffe.berkeleyvision.org/, 2013.
[6] D. B. Kinghorn. Integrals and derivatives for correlated gaussian fuctions using matrix
differential calculus. International Journal of Quantum Chemestry, 57:141–155, 1996.
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convo-
lutional neural networks. In Proc. NIPS, 2012.
[8] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. CoRR, abs/1312.4400,
2013.
[9] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visu-
alising image classification models and saliency maps. In Proc. ICLR, 2014.
[10] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image
recognition. 2015.
[11] A. Vedaldi and B. Fulkerson. VLFeat – An open and portable library of computer vision
algorithms. In Proc. ACM Int. Conf. on Multimedia, 2010.
[12] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In

Proc. ECCV, 2014.
51
1
Image Super-Resolution Using Deep

Convolutional Networks
Chao Dong, Chen Change Loy, Member, IEEE, Kaiming He, Member, IEEE,
and Xiaoou Tang, Fellow, IEEE
Abstract—We propose a deep learning method for single image super-resolution (SR). Our method directly learns an end-to-end
mapping between the low/high-resolution images. The mapping is represented as a deep convolutional neural network (CNN) that takes
the low-resolution image as the input and outputs the high-resolution one. We further show that traditional sparse-coding-based SR
methods can also be viewed as a deep convolutional network. But unlike traditional methods that handle each component separately,
our method jointly optimizes all layers. Our deep CNN has a lightweight structure, yet demonstrates state-of-the-art restoration quality,
arXiv:1501.00092v3 [cs.CV] 31 Jul 2015
and achieves fast speed for practical on-line usage. We explore different network structures and parameter settings to achieve trade-
offs between performance and speed. Moreover, we extend our network to cope with three color channels simultaneously, and show
better overall reconstruction quality.
Index Terms—Super-resolution, deep convolutional neural networks, sparse coding
1 I NTRODUCTION constructed patches are aggregated (e.g., by weighted

Single image super-resolution (SR) [20], which aims at averaging) to produce the final output. This pipeline is
recovering a high-resolution image from a single low- shared by most external example-based methods, which
resolution image, is a classical problem in computer pay particular attention to learning and optimizing the
vision. This problem is inherently ill-posed since a mul- dictionaries [2], [49], [50] or building efficient mapping
tiplicity of solutions exist for any given low-resolution functions [25], [41], [42], [47]. However, the rest of the
pixel. In other words, it is an underdetermined in- steps in the pipeline have been rarely optimized or
verse problem, of which solution is not unique. Such considered in an unified optimization framework.
a problem is typically mitigated by constraining the In this paper, we show that the aforementioned
solution space by strong prior information. To learn pipeline is equivalent to a deep convolutional neural net-
the prior, recent state-of-the-art methods mostly adopt work [27] (more details in Section 3.2). Motivated by this
the example-based [46] strategy. These methods either fact, we consider a convolutional neural network that
exploit internal similarities of the same image [5], [13], directly learns an end-to-end mapping between low- and
[16], [19], [47], or learn mapping functions from external high-resolution images. Our method differs fundamen-
low- and high-resolution exemplar pairs [2], [4], [6], tally from existing external example-based approaches,
[15], [23], [25], [37], [41], [42], [47], [48], [50], [51]. The in that ours does not explicitly learn the dictionaries [41],
external example-based methods can be formulated for [49], [50] or manifolds [2], [4] for modeling the patch
generic image super-resolution, or can be designed to space. These are implicitly achieved via hidden layers.
suit domain specific tasks, i.e., face hallucination [30], Furthermore, the patch extraction and aggregation are
[50], according to the training samples provided. also formulated as convolutional layers, so are involved
The sparse-coding-based method [49], [50] is one of the in the optimization. In our method, the entire SR pipeline
representative external example-based SR methods. This is fully obtained through learning, with little pre/post-
method involves several steps in its solution pipeline. processing.
First, overlapping patches are densely cropped from the We name the proposed model Super-Resolution Con-
input image and pre-processed (e.g.,subtracting mean volutional Neural Network (SRCNN)1 . The proposed
and normalization). These patches are then encoded SRCNN has several appealing properties. First, its struc-
by a low-resolution dictionary. The sparse coefficients ture is intentionally designed with simplicity in mind,
are passed into a high-resolution dictionary for recon- and yet provides superior accuracy2 compared with
structing high-resolution patches. The overlapping re- state-of-the-art example-based methods. Figure 1 shows
a comparison on an example. Second, with moderate
• C. Dong, C. C. Loy and X. Tang are with the Department of Information
Engineering, The Chinese University of Hong Kong, Hong Kong. 1. The implementation is available at http://mmlab.ie.cuhk.edu.hk/
E-mail: {dc012,ccloy,xtang}@ie.cuhk.edu.hk projects/SRCNN.html.
• K. He is with the Visual Computing Group, Microsoft Research Asia,
2. Numerical evaluations by using different metrics such as the Peak
Beijing 100080, China.
Signal-to-Noise Ratio (PSNR), structure similarity index (SSIM) [43],
Email: kahe@microsoft.com
multi-scale SSIM [44], information fidelity criterion [38], when the
ground truth images are available.
2
33 learning-based SR method and the traditional

(dB) (dB) sparse-coding-based SR methods. This relationship
32.5
provides a guidance for the design of the network

PSNR
32
Original / PSNR Bicubic / 24.04 dB
testPSNR
31.5
structure.
Original3) We demonstrate
/ PSNR Bicubic / 24.04 dB that deep learning is useful in
Average test
31
Average
the classical computer vision problem of super-

30.5 SRCNN resolution, and can achieve good quality and
30
SCSRCNN
SC
Bicubic
Bicubic
speed.
29.5
2 4 6 8 10 12 A preliminary version of this work was presented
Number of backprops 8
Number of backprops x 10 SC / 25.58 dB SRCNN / 27.95 dB
earlier [11].
SC / 25.58 dB The present work
SRCNN / 27.95 dB adds to the initial version
in significant ways. Firstly, we improve the SRCNN by
introducing larger filter size in the non-linear mapping
layer, and explore deeper structures by adding non-
linear mapping layers. Secondly, we extend the SRCNN
to process three color channels (either in YCbCr or RGB
color space) simultaneously. Experimentally, we demon-
strate that performance can be improved in comparison
Original
Original
/ PSNR
/ PSNR Bicubic
Bicubic
/ 24.04
/ 24.04
dB dB
to the single-channel network. Thirdly, considerable new
analyses and intuitive explanations are added to the
initial results. We also extend the original experiments
from Set5 [2] and Set14 [51] test images to BSD200 [32]
(200 test images). In addition, we compare with a num-
ber of recently published methods and confirm that
our model still outperforms existing approaches using
SC / SC
25.58
/ 25.58
dB dB SRCNN
SRCNN
/ 27.95
/ 27.95
dB dB different evaluation metrics.
Fig. 1. The proposed Super-Resolution Convolutional
Neural Network (SRCNN) surpasses the bicubic baseline 2 R ELATED W ORK
with just a few training iterations, and outperforms the 2.1 Image Super-Resolution
sparse-coding-based method (SC) [50] with moderate According to the image priors, single-image super res-
training. The performance may be further improved with olution algorithms can be categorized into four types –
more training iterations. More details are provided in prediction models, edge based methods, image statistical
Section 4.4.1 (the Set5 dataset with an upscaling factor methods and patch based (or example-based) methods.
3). The proposed method provides visually appealing These methods have been thoroughly investigated and
reconstructed image. evaluated in Yang et al.’s work [46]. Among them, the
example-based methods [16], [25], [41], [47] achieve the
numbers of filters and layers, our method achieves state-of-the-art performance.
fast speed for practical on-line usage even on a CPU. The internal example-based methods exploit the self-
Our method is faster than a number of example-based similarity property and generate exemplar patches from
methods, because it is fully feed-forward and does the input image. It is first proposed in Glasner’s
not need to solve any optimization problem on usage. work [16], and several improved variants [13], [45] are
Third, experiments show that the restoration quality of proposed to accelerate the implementation. The exter-
the network can be further improved when (i) larger nal example-based methods [2], [4], [6], [15], [37], [41],
and more diverse datasets are available, and/or (ii) [48], [49], [50], [51] learn a mapping between low/high-
a larger and deeper model is used. On the contrary, resolution patches from external datasets. These studies
larger datasets/models can present challenges for exist- vary on how to learn a compact dictionary or manifold
ing example-based methods. Furthermore, the proposed space to relate low/high-resolution patches, and on how
network can cope with three channels of color images representation schemes can be conducted in such spaces.
simultaneously to achieve improved super-resolution In the pioneer work of Freeman et al. [14], the dic-
performance. tionaries are directly presented as low/high-resolution
Overall, the contributions of this study are mainly in patch pairs, and the nearest neighbour (NN) of the input
three aspects: patch is found in the low-resolution space, with its corre-
1) We present a fully convolutional neural net- sponding high-resolution patch used for reconstruction.
work for image super-resolution. The network di- Chang et al. [4] introduce a manifold embedding tech-
rectly learns an end-to-end mapping between low- nique as an alternative to the NN strategy. In Yang et al.’s
and high-resolution images, with little pre/post- work [49], [50], the above NN correspondence advances
processing beyond the optimization. to a more sophisticated sparse coding formulation. Other
2) We establish a relationship between our deep- mapping functions such as kernel regression [25], simple
3
function [47], random forest [37] and anchored neigh- 3 C ONVOLUTIONAL N EURAL N ETWORKS FOR
borhood regression [41], [42] are proposed to further S UPER -R ESOLUTION
improve the mapping accuracy and speed. The sparse-
3.1 Formulation
coding-based method and its several improvements [41],
[42], [48] are among the state-of-the-art SR methods Consider a single low-resolution image, we first upscale
nowadays. In these methods, the patches are the focus it to the desired size using bicubic interpolation, which
of the optimization; the patch extraction and aggregation is the only pre-processing we perform3 . Let us denote
steps are considered as pre/post-processing and handled the interpolated image as Y. Our goal is to recover
separately. from Y an image F (Y) that is as similar as possible
The majority of SR algorithms [2], [4], [15], [41], [48], to the ground truth high-resolution image X. For the
[49], [50], [51] focus on gray-scale or single-channel ease of presentation, we still call Y a “low-resolution”
image super-resolution. For color images, the aforemen- image, although it has the same size as X. We wish to
tioned methods first transform the problem to a dif- learn a mapping F , which conceptually consists of three
ferent color space (YCbCr or YUV), and SR is applied operations:
only on the luminance channel. There are also works 1) Patch extraction and representation: this opera-
attempting to super-resolve all channels simultaneously. tion extracts (overlapping) patches from the low-
For example, Kim and Kwon [25] and Dai et al. [7] apply resolution image Y and represents each patch as a
their model to each RGB channel and combined them to high-dimensional vector. These vectors comprise a
produce the final results. However, none of them has set of feature maps, of which the number equals to
analyzed the SR performance of different channels, and the dimensionality of the vectors.
the necessity of recovering all three channels. 2) Non-linear mapping: this operation nonlinearly
maps each high-dimensional vector onto another
2.2 Convolutional Neural Networks high-dimensional vector. Each mapped vector is
Convolutional neural networks (CNN) date back conceptually the representation of a high-resolution
decades [27] and deep CNNs have recently shown an patch. These vectors comprise another set of feature
explosive popularity partially due to its success in image maps.
classification [18], [26]. They have also been success- 3) Reconstruction: this operation aggregates the
fully applied to other computer vision fields, such as above high-resolution patch-wise representations
object detection [34], [40], [52], face recognition [39], and to generate the final high-resolution image. This
pedestrian detection [35]. Several factors are of central image is expected to be similar to the ground truth
importance in this progress: (i) the efficient training X.
implementation on modern powerful GPUs [26], (ii) the We will show that all these operations form a convolu-
proposal of the Rectified Linear Unit (ReLU) [33] which tional neural network. An overview of the network is
makes convergence much faster while still presents good depicted in Figure 2. Next we detail our definition of
quality [26], and (iii) the easy access to an abundance of each operation.
data (like ImageNet [9]) for training larger models. Our
method also benefits from these progresses. 3.1.1 Patch extraction and representation
A popular strategy in image restoration (e.g., [1]) is to
2.3 Deep Learning for Image Restoration
densely extract patches and then represent them by a set
There have been a few studies of using deep learning of pre-trained bases such as PCA, DCT, Haar, etc. This
techniques for image restoration. The multi-layer per- is equivalent to convolving the image by a set of filters,
ceptron (MLP), whose all layers are fully-connected (in each of which is a basis. In our formulation, we involve
contrast to convolutional), is applied for natural image the optimization of these bases into the optimization of
denoising [3] and post-deblurring denoising [36]. More the network. Formally, our first layer is expressed as an
closely related to our work, the convolutional neural net- operation F1 :
work is applied for natural image denoising [22] and re-
moving noisy patterns (dirt/rain) [12]. These restoration F1 (Y) = max (0, W1 ∗ Y + B1 ) , (1)
problems are more or less denoising-driven. Cui et al. [5] where W1 and B1 represent the filters and biases re-
propose to embed auto-encoder networks in their super- spectively, and ’∗’ denotes the convolution operation.
resolution pipeline under the notion internal example- Here, W1 corresponds to n1 filters of support c × f1 × f1 ,
based approach [16]. The deep model is not specifically where c is the number of channels in the input image,
designed to be an end-to-end solution, since each layer f1 is the spatial size of a filter. Intuitively, W1 applies
of the cascade requires independent optimization of the n1 convolutions on the image, and each convolution has
self-similarity search process and the auto-encoder. On
the contrary, the proposed SRCNN optimizes an end-to- 3. Bicubic interpolation is also a convolutional operation, so it can
end mapping. Further, the SRCNN is faster at speed. It be formulated as a convolutional layer. However, the output size of
this layer is larger than the input size, so there is a fractional stride. To
is not only a quantitatively superior method, but also a take advantage of the popular well-optimized implementations such
practically useful one. as cuda-convnet [26], we exclude this “layer” from learning.
4
feature maps feature maps

of low-resolution image of high-resolution image
Low-resolution High-resolution
image (input) image (output)
Patch extraction Non-linear mapping Reconstruction

and representation
Fig. 2. Given a low-resolution image Y, the first convolutional layer of the SRCNN extracts a set of feature maps. The
second layer maps these feature maps nonlinearly to high-resolution patch representations. The last layer combines
the predictions within a spatial neighbourhood to produce the final high-resolution image F (Y).
a kernel size c × f1 × f1 . The output is composed of Here W3 corresponds to c filters of a size n2 × f3 × f3 ,

n1 feature maps. B1 is an n1 -dimensional vector, whose and B3 is a c-dimensional vector.
each element is associated with a filter. We apply the If the representations of the high-resolution patches
Rectified Linear Unit (ReLU, max(0, x)) [33] on the filter are in the image domain (i.e.,we can simply reshape each
responses4 . representation to form the patch), we expect that the
filters act like an averaging filter; if the representations
3.1.2 Non-linear mapping of the high-resolution patches are in some other domains
The first layer extracts an n1 -dimensional feature for (e.g.,coefficients in terms of some bases), we expect that
each patch. In the second operation, we map each of W3 behaves like first projecting the coefficients onto the
these n1 -dimensional vectors into an n2 -dimensional image domain and then averaging. In either way, W3 is
one. This is equivalent to applying n2 filters which have a set of linear filters.
a trivial spatial support 1 × 1. This interpretation is only
valid for 1 × 1 filters. But it is easy to generalize to larger Interestingly, although the above three operations are
filters like 3 × 3 or 5 × 5. In that case, the non-linear motivated by different intuitions, they all lead to the
mapping is not on a patch of the input image; instead, same form as a convolutional layer. We put all three
it is on a 3 × 3 or 5 × 5 “patch” of the feature map. The operations together and form a convolutional neural
operation of the second layer is: network (Figure 2). In this model, all the filtering weights
F2 (Y) = max (0, W2 ∗ F1 (Y) + B2 ) . (2) and biases are to be optimized. Despite the succinctness
of the overall structure, our SRCNN model is carefully
Here W2 contains n2 filters of size n1 × f2 × f2 , and B2 is developed by drawing extensive experience resulted
n2 -dimensional. Each of the output n2 -dimensional vec- from significant progresses in super-resolution [49], [50].
tors is conceptually a representation of a high-resolution We detail the relationship in the next section.
patch that will be used for reconstruction.
It is possible to add more convolutional layers to
increase the non-linearity. But this can increase the com- 3.2 Relationship to Sparse-Coding-Based Methods
plexity of the model (n2 × f2 × f2 × n2 parameters for We show that the sparse-coding-based SR methods [49],
one layer), and thus demands more training time. We [50] can be viewed as a convolutional neural network.
will explore deeper structures by introducing additional Figure 3 shows an illustration.
non-linear mapping layers in Section 4.3.3. In the sparse-coding-based methods, let us consider
3.1.3 Reconstruction that an f1 × f1 low-resolution patch is extracted from
the input image. Then the sparse coding solver, like
In the traditional methods, the predicted overlapping
Feature-Sign [29], will first project the patch onto a (low-
high-resolution patches are often averaged to produce
resolution) dictionary. If the dictionary size is n1 , this
the final full image. The averaging can be considered
is equivalent to applying n1 linear filters (f1 × f1 ) on
as a pre-defined filter on a set of feature maps (where
the input image (the mean subtraction is also a linear
each position is the “flattened” vector form of a high-
operation so can be absorbed). This is illustrated as the
resolution patch). Motivated by this, we define a convo-
left part of Figure 3.
lutional layer to produce the final high-resolution image:
The sparse coding solver will then iteratively process
F (Y) = W3 ∗ F2 (Y) + B3 . (3) the n1 coefficients. The outputs of this solver are n2
coefficients, and usually n2 = n1 in the case of sparse
4. The ReLU can be equivalently considered as a part of the second
operation (Non-linear mapping), and the first operation (Patch extrac- coding. These n2 coefficients are the representation of
tion and representation) becomes purely linear convolution. the high-resolution patch. In this sense, the sparse coding
5
responses neighbouring
of patch of patches
Patch extraction Non-linear Reconstruction

and representation mapping
Fig. 3. An illustration of sparse-coding-based methods in the view of a convolutional neural network.
solver behaves as a special case of a non-linear mapping (9 + 5 − 1)2 = 169 pixels. Clearly, the information
operator, whose spatial support is 1 × 1. See the middle exploited for reconstruction is comparatively larger than
part of Figure 3. However, the sparse coding solver is that used in existing external example-based approaches,
not feed-forward, i.e.,it is an iterative algorithm. On the e.g., using (5+5−1)2 = 81 pixels5 [15], [50]. This is one of
contrary, our non-linear operator is fully feed-forward the reasons why the SRCNN gives superior performance.
and can be computed efficiently. If we set f2 = 1, then
our non-linear operator can be considered as a pixel-wise
3.3 Training
fully-connected layer. It is worth noting that “the sparse
coding solver” in SRCNN refers to the first two layers, Learning the end-to-end mapping function F re-
but not just the second layer or the activation function quires the estimation of network parameters Θ =
(ReLU). Thus the nonlinear operation in SRCNN is also {W1 , W2 , W3 , B1 , B2 , B3 }. This is achieved through min-
well optimized through the learning process. imizing the loss between the reconstructed images
The above n2 coefficients (after sparse coding) are F (Y; Θ) and the corresponding ground truth high-
then projected onto another (high-resolution) dictionary resolution images X. Given a set of high-resolution
to produce a high-resolution patch. The overlapping images {Xi } and their corresponding low-resolution
high-resolution patches are then averaged. As discussed images {Yi }, we use Mean Squared Error (MSE) as the
above, this is equivalent to linear convolutions on the loss function:
n
n2 feature maps. If the high-resolution patches used for 1X
reconstruction are of size f3 × f3 , then the linear filters L(Θ) = ||F (Yi ; Θ) − Xi ||2 , (4)
n i=1
have an equivalent spatial support of size f3 × f3 . See
the right part of Figure 3. where n is the number of training samples. Using MSE
The above discussion shows that the sparse-coding- as the loss function favors a high PSNR. The PSNR
based SR method can be viewed as a kind of con- is a widely-used metric for quantitatively evaluating
volutional neural network (with a different non-linear image restoration quality, and is at least partially related
mapping). But not all operations have been considered in to the perceptual quality. It is worth noticing that the
the optimization in the sparse-coding-based SR methods. convolutional neural networks do not preclude the usage
On the contrary, in our convolutional neural network, of other kinds of loss functions, if only the loss functions
the low-resolution dictionary, high-resolution dictionary, are derivable. If a better perceptually motivated metric
non-linear mapping, together with mean subtraction and is given during training, it is flexible for the network to
averaging, are all involved in the filters to be optimized. adapt to that metric. On the contrary, such a flexibility
So our method optimizes an end-to-end mapping that is in general difficult to achieve for traditional “hand-
consists of all operations. crafted” methods. Despite that the proposed model is
The above analogy can also help us to design hyper- trained favoring a high PSNR, we still observe satisfac-
parameters. For example, we can set the filter size of tory performance when the model is evaluated using
the last layer to be smaller than that of the first layer, alternative evaluation metrics, e.g., SSIM, MSSIM (see
and thus we rely more on the central part of the high- Section 4.4.1).
resolution patch (to the extreme, if f3 = 1, we are The loss is minimized using stochastic gradient de-
using the center pixel with no averaging). We can also scent with the standard backpropagation [28]. In partic-
set n2 < n1 because it is expected to be sparser. A ular, the weight matrices are updated as
typical and basic setting is f1 = 9, f2 = 1, f3 = 5, ∂L `
n1 = 64, and n2 = 32 (we evaluate more settings in ∆i+1 = 0.9 · ∆i − η · , Wi+1 = Wi` + ∆i+1 , (5)
∂Wi`
the experiment section). On the whole, the estimation
of a high resolution pixel utilizes the information of 5. The patches are overlapped with 4 pixels at each direction.
6
where ` ∈ {1, 2, 3} and i are the indices of layers and it- provides over 5 million sub-images even using a stride
∂L
erations, η is the learning rate, and ∂W ` is the derivative. of 33. We use the basic network settings, i.e., f1 = 9,
i
The filter weights of each layer are initialized by drawing f2 = 1, f3 = 5, n1 = 64, and n2 = 32. We use the Set5 [2]
randomly from a Gaussian distribution with zero mean as the validation set. We observe a similar trend even
and standard deviation 0.001 (and 0 for biases). The if we use the larger Set14 set [51]. The upscaling factor
learning rate is 10−4 for the first two layers, and 10−5 for is 3. We use the sparse-coding-based method [50] as our
the last layer. We empirically find that a smaller learning baseline, which achieves an average PSNR value of 31.42
rate in the last layer is important for the network to dB.
converge (similar to the denoising case [22]). The test convergence curves of using different training
In the training phase, the ground truth images {Xi } sets are shown in Figure 4. The training time on Ima-
are prepared as fsub ×fsub ×c-pixel sub-images randomly geNet is about the same as on the 91-image dataset since
cropped from the training images. By “sub-images” we the number of backpropagations is the same. As can be
mean these samples are treated as small “images” rather observed, with the same number of backpropagations
than “patches”, in the sense that “patches” are overlap- (i.e.,8 × 108 ), the SRCNN+ImageNet achieves 32.52 dB,
ping and require some averaging as post-processing but higher than 32.39 dB yielded by that trained on 91
“sub-images” need not. To synthesize the low-resolution images. The results positively indicate that SRCNN per-
samples {Yi }, we blur a sub-image by a Gaussian kernel, formance may be further boosted using a larger training
sub-sample it by the upscaling factor, and upscale it by set, but the effect of big data is not as impressive as
the same factor via bicubic interpolation. that shown in high-level vision problems [26]. This is
To avoid border effects during training, all the con- mainly because that the 91 images have already cap-
volutional layers have no padding, and the network tured sufficient variability of natural images. On the
produces a smaller output ((fsub − f1 − f2 − f3 + 3)2 × c). other hand, our SRCNN is a relatively small network
The MSE loss function is evaluated only by the difference (8,032 parameters), which could not overfit the 91 images
between the central pixels of Xi and the network output. (24,800 samples). Nevertheless, we adopt the ImageNet,
Although we use a fixed image size in training, the which contains more diverse data, as the default training
convolutional neural network can be applied on images set in the following experiments.
of arbitrary sizes during testing.
We implement our model using the cuda-convnet pack- 4.2 Learned Filters for Super-Resolution
age [26]. We have also tried the Caffe package [24] and Figure 5 shows examples of learned first-layer filters
observed similar performance. trained on the ImageNet by an upscaling factor 3. Please
refer to our published implementation for upscaling
4 E XPERIMENTS factors 2 and 4. Interestingly, each learned filter has
We first investigate the impact of using different datasets its specific functionality. For instance, the filters g and
on the model performance. Next, we examine the filters h are like Laplacian/Gaussian filters, the filters a - e
learned by our approach. We then explore different are like edge detectors at different directions, and the
architecture designs of the network, and study the rela- filter f is like a texture extractor. Example feature maps
tions between super-resolution performance and factors of different layers are shown in figure 6. Obviously,
like depth, number of filters, and filter sizes. Subse- feature maps of the first layer contain different structures
quently, we compare our method with recent state-of- (e.g., edges at different directions), while that of the
the-arts both quantitatively and qualitatively. Following second layer are mainly different on intensities.
[42], super-resolution is only applied on the luminance
channel (Y channel in YCbCr color space) in Sections 4.1- 4.3 Model and Performance Trade-offs
4.4, so c = 1 in the first/last layer, and performance Based on the basic network settings (i.e., f1 = 9, f2 = 1,
(e.g., PSNR and SSIM) is evaluated on the Y channel. At f3 = 5, n1 = 64, and n2 = 32), we will progressively
last, we extend the network to cope with color images modify some of these parameters to investigate the best
and evaluate the performance on different channels. trade-off between performance and speed, and study the
relations between performance and parameters.
4.1 Training Data
As shown in the literature, deep learning generally
32.6
benefits from big data training. For comparison, we use
AverageStestSPSNRSndBI
32.4
a relatively small training set [41], [50] that consists 32.2
of 91 images, and a large training set that consists of 32
395,909 images from the ILSVRC 2013 ImageNet detec- 31.8
SRCNNSntrainedSonSImageNetI
tion training partition. The size of training sub-images is 31.6 SRCNNSntrainedSonS91SimagesI
SCSn31.42SdBI
fsub = 33. Thus the 91-image dataset can be decomposed 31.4
1 2 3 4 5 6 7 8 9 10
into 24,800 sub-images, which are extracted from origi- NumberSofSbackprops xS10
8
nal images with a stride of 14. Whereas the ImageNet Fig. 4. Training with the much larger ImageNet dataset
improves the performance over the use of 91 images.
7
a b c d e f
AverageStestSPSNRS(dB)
g 32.5
h
32
SRCNNS(9−5−5)
SRCNNS(9−3−5)
SRCNNS(9−1−5)
31.5
Fig. 5. The figure shows the first-layer filters trained SCS(31.42SdB)
on ImageNet with an upscaling factor 3. The filters are 1 2 3 4 5 6

NumberSofSbackprops
7 8 9 10
8
xS10
organized based on their respective variances. Fig. 7. A larger filter size leads to better results.
settings remain the same with Section 4.1. The results

with an upscaling factor 3 on Set5 are 32.57 dB, which is
slightly higher than the 32.52 dB reported in Section 4.1.
Input Feature maps of the first layer This indicates that a reasonably larger filter size could
grasp richer structural information, which in turn lead
to better results.
Then we further examine networks with a larger filter
size of the second layer. Specifically, we fix the filter size
Output Feature maps of the second layer f1 = 9, f3 = 5, and enlarge the filter size of the second
layer to be (i) f2 = 3 (9-3-5) and (ii) f2 = 5 (9-5-5).
Fig. 6. Example feature maps of different layers. Convergence curves in Figure 7 show that using a larger
filter size could significantly improve the performance.
4.3.1 Filter number Specifically, the average PSNR values achieved by 9-3-
In general, the performance would improve if we in- 5 and 9-5-5 on Set5 with 8 × 108 backpropagations are
crease the network width6 , i.e., adding more filters, at the 32.66 dB and 32.75 dB, respectively. The results suggest
cost of running time. Specifically, based on our network that utilizing neighborhood information in the mapping
default settings of n1 = 64 and n2 = 32, we conduct stage is beneficial.
two experiments: (i) one is with a larger network with However, the deployment speed will also decrease
n1 = 128 and n2 = 64, and (ii) the other is with a smaller with a larger filter size. For example, the number of
network with n1 = 32 and n2 = 16. Similar to Section 4.1, parameters of 9-1-5, 9-3-5, and 9-5-5 is 8,032, 24,416, and
we also train the two models on ImageNet and test on 57,184 respectively. The complexity of 9-5-5 is almost
Set5 with an upscaling factor 3. The results observed twice of 9-3-5, but the performance improvement is
at 8 × 108 backpropagations are shown in Table 1. It is marginal. Therefore, the choice of the network scale
clear that superior performance could be achieved by should always be a trade-off between performance and
increasing the width. However, if a fast restoration speed speed.
is desired, a small network width is preferred, which
could still achieve better performance than the sparse- 4.3.3 Number of layers
coding-based method (31.42 dB). Recent study by He and Sun [17] suggests that CNN
could benefit from increasing the depth of network
TABLE 1 moderately. Here, we try deeper structures by adding
The results of using different filter numbers in SRCNN. another non-linear mapping layer, which has n22 = 16
Training is performed on ImageNet whilst the evaluation filters with size f22 = 1. We conduct three controlled
is conducted on the Set5 dataset. experiments, i.e., 9-1-1-5, 9-3-1-5, 9-5-1-5, which add an
n1 = 128 n1 = 64 n1 = 32 additional layer on 9-1-5, 9-3-5, and 9-5-5, respectively.
n2 = 64 n2 = 32 n2 = 16 The initialization scheme and learning rate of the ad-
PSNR Time (sec) PSNR Time (sec) PSNR Time (sec)
ditional layer are the same as the second layer. From
32.60 0.60 32.52 0.18 32.26 0.05
Figures 13(a), 13(b) and 8(c), we can observe that the
four-layer networks converge slower than the three-layer
4.3.2 Filter size network. Nevertheless, given enough training time, the
In this section, we examine the network sensitivity to deeper networks will finally catch up and converge to
different filter sizes. In previous experiments, we set the three-layer ones.
filter size f1 = 9, f2 = 1 and f3 = 5, and the network The effectiveness of deeper structures for super reso-
could be denoted as 9-1-5. First, to be consistent with lution is found not as apparent as that shown in image
sparse-coding-based methods, we fix the filter size of the classification [17]. Furthermore, we find that deeper
second layer to be f2 = 1, and enlarge the filter size of networks do not always result in better performance.
other layers to f1 = 11 and f3 = 7 (11-1-7). All the other Specifically, if we add an additional layer with n22 = 32
filters on 9-1-5 network, then the performance degrades
6. We use ‘width’ to term the number of filters in a layer, follow- and fails to surpass the three-layer network (see Fig-
ing [17]. The term ‘width’ may have other meanings in the literature. ure 9(a)). If we go deeper by adding two non-linear
8
32.5
Average(test(PSNR((dB) 32.5
Average(test(PSNR(=dB)
32
32 SRCNN(=9−1−5)
SRCNN(=9−1−1−5,(n22=16)
31.5
31.5 SRCNN((9−1−5) SRCNN(=9−1−1−5,(n =32)
22
SRCNN((9−1−1−5) SRCNN(=9−1−1−1−5,(n22=32,(n23=16)
SC((31.42(dB) 31
SC(=31.42(dB)
31
2 4 6 8 10 12 2 4 6 8 10 12
Number(of(backprops x(10
8 Number(of(backprops 8
x(10
(a) 9-1-5 vs. 9-1-1-5 (a) 9-1-1-5 (n22 = 32) and 9-1-1-1-5 (n22 = 32, n23 = 16)
32.5 32.5
32 SRCNNS(9−3−5)
32 SRCNNS(9−3−1−5)
SRCNNS(9−3−5)
SRCNNS(9−3−3−5)
31.5 SRCNNS(9−3−1−5)
SRCNNS(9−3−3−3)
SCS(31.42SdB)
31.5 SCS(31.42SdB)
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9
NumberSofSbackprops xS10
8
NumberSofSbackprops 8
xS10
(b) 9-3-5 vs. 9-3-1-5 (b) 9-3-3-5 and 9-3-3-3
Fig. 9. Deeper structure does not always lead to better
AverageRtestRPSNRR(dB)
32.5 results.
32 methods. We adopt the model with good performance-
SRCNNR(9−5−5) speed trade-off: a three-layer network with f1 = 9, f2 =
31.5 SRCNNR(9−5−1−5)
SCR(31.42RdB) 5, f3 = 5, n1 = 64, and n2 = 32 trained on the ImageNet.
1 2 3 4 5
NumberRofRbackprops
6 7 8
8
For each upscaling factor ∈ {2, 3, 4}, we train a specific
xR10
(c) 9-5-5 vs. 9-5-1-5
network for that factor7 .
Comparisons. We compare our SRCNN with the state-
Fig. 8. Comparisons between three-layer and four-layer
of-the-art SR methods:
networks. • SC - sparse coding-based method of Yang et al. [50]
• NE+LLE - neighbour embedding + locally linear
mapping layers with n22 = 32 and n23 = 16 filters on
embedding method [4]
9-1-5, then we have to set a smaller learning rate to
• ANR - Anchored Neighbourhood Regression
ensure convergence, but we still do not observe superior
performance after a week of training (see Figure 9(a)). method [41]
• A+ - Adjusted Anchored Neighbourhood Regres-
We also tried to enlarge the filter size of the additional
layer to f22 = 3, and explore two deep structures – 9-3- sion method [42], and
3-5 and 9-3-3-3. However, from the convergence curves • KK - the method described in [25], which achieves
shown in Figure 9(b), these two networks do not show the best performance among external example-
better results than the 9-3-1-5 network. based methods, according to the comprehensive
All these experiments indicate that it is not “the deeper evaluation conducted in Yang et al.’s work [46]
the better” in this deep model for super-resolution. It The implementations are all from the publicly available
may be caused by the difficulty of training. Our CNN codes provided by the authors, and all images are down-
network contains no pooling layer or full-connected sampled using the same bicubic kernel.
layer, thus it is sensitive to the initialization parameters Test set. The Set5 [2] (5 images), Set14 [51] (14 images)
and learning rate. When we go deeper (e.g., 4 or 5 layers), and BSD200 [32] (200 images)8 are used to evaluate the
we find it hard to set appropriate learning rates that performance of upscaling factors 2, 3, and 4.
guarantee convergence. Even it converges, the network Evaluation metrics. Apart from the widely used PSNR
may fall into a bad local minimum, and the learned and SSIM [43] indices, we also adopt another four
filters are of less diversity even given enough training evaluation matrices, namely information fidelity cri-
time. This phenomenon is also observed in [16], where terion (IFC) [38], noise quality measure (NQM) [8],
improper increase of depth leads to accuracy saturation weighted peak signal-to-noise ratio (WPSNR) and multi-
or degradation for image classification. Why “deeper is scale structure similarity index (MSSSIM) [44], which
not better” is still an open question, which requires in- obtain high correlation with the human perceptual scores
vestigations to better understand gradients and training as reported in [46].
dynamics in deep architectures. Therefore, we still adopt
4.4.1 Quantitative and qualitative evaluation
three-layer networks in the following experiments.
As shown in Tables 2, 3 and 4, the proposed SRCNN
yields the highest scores in most evaluation matrices
4.4 Comparisons to State-of-the-Arts
7. In the area of denoising [3], for each noise level a specific network
In this section, we show the quantitative and qualitative is trained.
results of our method in comparison to state-of-the-art 8. We use the same 200 images as in [46].
9
33
SRCNN from the corresponding authors’ MATLAB+MEX imple-
A+ - 32.59 dB mentation, whereas ours are in pure C++. We profile
32.5
Average(test(PSNR((dB)
KK - 32.28 dB
the running time of all the algorithms using the same
32 ANR - 31.92 dB
NE+LLE - 31.84 dB machine (Intel CPU 3.10 GHz and 16 GB memory).
31.5
SC - 31.42 dB Note that the processing time of our approach is highly
31 linear to the test image resolution, since all images go
30.5
through the same number of convolutions. Our method
Bicubic - 30.39 dB
2 4 6 8 10 12
is always a trade-off between performance and speed.
Number(of(backprops 8
x(10 To show this, we train three networks for comparison,
Fig. 10. The test convergence curve of SRCNN and which are 9-1-5, 9-3-5, and 9-5-5. It is clear that the 9-
results of other methods on the Set5 dataset. 1-5 network is the fastest, while it still achieves better
in all experiments9 . Note that our SRCNN results are performance than the next state-of-the-art A+. Other
based on the checkpoint of 8 × 108 backpropagations. methods are several times or even orders of magnitude
Specifically, for the upscaling factor 3, the average gains slower in comparison to 9-1-5 network. Note the speed
on PSNR achieved by SRCNN are 0.15 dB, 0.17 dB, and gap is not mainly caused by the different MATLAB/C++
0.13 dB, higher than the next best approach, A+ [42], implementations; rather, the other methods need to solve
on the three datasets. When we take a look at other complex optimization problems on usage (e.g., sparse
evaluation metrics, we observe that SC, to our surprise, coding or embedding), whereas our method is com-
gets even lower scores than the bicubic interpolation pletely feed-forward. The 9-5-5 network achieves the
on IFC and NQM. It is clear that the results of SC are best performance but at the cost of the running time. The
more visually pleasing than that of bicubic interpolation. test-time speed of our CNN can be further accelerated
This indicates that these two metrics may not truthfully in many ways, e.g., approximating or simplifying the
reveal the image quality. Thus, regardless of these two trained networks [10], [21], [31], with possible slight
metrics, SRCNN achieves the best performance among degradation in performance.
all methods and scaling factors.
It is worth pointing out that SRCNN surpasses the
4.5 Experiments on Color Channels
bicubic baseline at the very beginning of the learning
stage (see Figure 1), and with moderate training, SR- In previous experiments, we follow the conventional
CNN outperforms existing state-of-the-art methods (see approach to super-resolve color images. Specifically, we
Figure 4). Yet, the performance is far from converge. first transform the color images into the YCbCr space.
We conjecture that better results can be obtained given The SR algorithms are only applied on the Y channel,
longer training time (see Figure 10). while the Cb , Cr channels are upscaled by bicubic in-
Figures 14, 15 and 16 show the super-resolution results terpolation. It is interesting to find out if super-resolution
of different approaches by an upscaling factor 3. As can performance can be improved if we jointly consider all
be observed, the SRCNN produces much sharper edges three channels in the process.
than other approaches without any obvious artifacts Our method is flexible to accept more channels with-
across the image. out altering the learning mechanism and network de-
In addition, we report to another recent deep learning sign. In particular, it can readily deal with three chan-
method for image super-resolution (DNC) of Cui et nels simultaneously by setting the input channels to
al. [5]. As they employ a different blur kernel (a Gaussian c = 3. In the following experiments, we explore different
filter with a standard deviation of 0.55), we train a spe- training strategies for color image super-resolution, and
cific network (9-5-5) using the same blur kernel as DNC subsequently evaluate their performance on different
for fair quantitative comparison. The upscaling factor channels.
is 3 and the training set is the 91-image dataset. From Implementation details. Training is performed on the
the convergence curve shown in Figure 11, we observe 91-image dataset, and testing is conducted on the
that our SRCNN surpasses DNC with just 2.7 × 107 Set5 [2]. The network settings are: c = 3, f1 = 9, f2 = 1,
backprops, and a larger margin can be obtained given f3 = 5, n1 = 64, and n2 = 32. As we have proved the
longer training time. This also demonstrates that the
end-to-end learning is superior to DNC, even if that
model is already “deep”.
AverageBtestBPSNRBidBn
32.5
32
4.4.2 Running time
31.5
Figure 12 shows the running time comparisons of several 31 SRCNNBi9-5-5BtrainedBonB91Bimagesn
state-of-the-art methods, along with their restoration DNCBi32.08BdBn
BicubicBi30.29BdBn
30.5
performance on Set14. All baseline methods are obtained
2 4 6 8 10 12
NumberBofBbackprops × 10 7
9. The PSNR value of each image can be found in the supplementary
file. Fig. 11. The test convergence curve of SRCNN and the
result of DNC on the Set5 dataset.
10
TABLE 2
The average results of PSNR (dB), SSIM, IFC, NQM, WPSNR (dB) and MSSIM on the Set5 dataset.
Eval. Mat Scale Bicubic SC [50] NE+LLE [4] KK [25] ANR [41] A+ [41] SRCNN
2 33.66 - 35.77 36.20 35.83 36.54 36.66
PSNR 3 30.39 31.42 31.84 32.28 31.92 32.59 32.75
4 28.42 - 29.61 30.03 29.69 30.28 30.49
2 0.9299 - 0.9490 0.9511 0.9499 0.9544 0.9542
SSIM 3 0.8682 0.8821 0.8956 0.9033 0.8968 0.9088 0.9090
4 0.8104 - 0.8402 0.8541 0.8419 0.8603 0.8628
2 6.10 - 7.84 6.87 8.09 8.48 8.05
IFC 3 3.52 3.16 4.40 4.14 4.52 4.84 4.58
4 2.35 - 2.94 2.81 3.02 3.26 3.01
2 36.73 - 42.90 39.49 43.28 44.58 41.13
NQM 3 27.54 27.29 32.77 32.10 33.10 34.48 33.21
4 21.42 - 25.56 24.99 25.72 26.97 25.96
2 50.06 - 58.45 57.15 58.61 60.06 59.49
WPSNR 3 41.65 43.64 45.81 46.22 46.02 47.17 47.10
4 37.21 - 39.85 40.40 40.01 41.03 41.13
2 0.9915 - 0.9953 0.9953 0.9954 0.9960 0.9959
MSSSIM 3 0.9754 0.9797 0.9841 0.9853 0.9844 0.9867 0.9866
4 0.9516 - 0.9666 0.9695 0.9672 0.9720 0.9725
TABLE 3
The average results of PSNR (dB), SSIM, IFC, NQM, WPSNR (dB) and MSSIM on the Set14 dataset.
2 30.23 - 31.76 32.11 31.80 32.28 32.45
PSNR 3 27.54 28.31 28.60 28.94 28.65 29.13 29.30
4 26.00 - 26.81 27.14 26.85 27.32 27.50
2 0.8687 - 0.8993 0.9026 0.9004 0.9056 0.9067
SSIM 3 0.7736 0.7954 0.8076 0.8132 0.8093 0.8188 0.8215
4 0.7019 - 0.7331 0.7419 0.7352 0.7491 0.7513
2 6.09 - 7.59 6.83 7.81 8.11 7.76
IFC 3 3.41 2.98 4.14 3.83 4.23 4.45 4.26
4 2.23 - 2.71 2.57 2.78 2.94 2.74
2 40.98 - 41.34 38.86 41.79 42.61 38.95
NQM 3 33.15 29.06 37.12 35.23 37.22 38.24 35.25
4 26.15 - 31.17 29.18 31.27 32.31 30.46
2 47.64 - 54.47 53.85 54.57 55.62 55.39
WPSNR 3 39.72 41.66 43.22 43.56 43.36 44.25 44.32
4 35.71 - 37.75 38.26 37.85 38.72 38.87
2 0.9813 - 0.9886 0.9890 0.9888 0.9896 0.9897
MSSSIM 3 0.9512 0.9595 0.9643 0.9653 0.9647 0.9669 0.9675
4 0.9134 - 0.9317 0.9338 0.9326 0.9371 0.9376
TABLE 4
The average results of PSNR (dB), SSIM, IFC, NQM, WPSNR (dB) and MSSIM on the BSD200 dataset.
2 28.38 - 29.67 30.02 29.72 30.14 30.29
PSNR 3 25.94 26.54 26.67 26.89 26.72 27.05 27.18
4 24.65 - 25.21 25.38 25.25 25.51 25.60
2 0.8524 - 0.8886 0.8935 0.8900 0.8966 0.8977
SSIM 3 0.7469 0.7729 0.7823 0.7881 0.7843 0.7945 0.7971
4 0.6727 - 0.7037 0.7093 0.7060 0.7171 0.7184
2 5.30 - 7.10 6.33 7.28 7.51 7.21
IFC 3 3.05 2.77 3.82 3.52 3.91 4.07 3.91
4 1.95 - 2.45 2.24 2.51 2.62 2.45
2 36.84 - 41.52 38.54 41.72 42.37 39.66
NQM 3 28.45 28.22 34.65 33.45 34.81 35.58 34.72
4 21.72 - 25.15 24.87 25.27 26.01 25.65
2 46.15 - 52.56 52.21 52.69 53.56 53.58
WPSNR 3 38.60 40.48 41.39 41.62 41.53 42.19 42.29
4 34.86 - 36.52 36.80 36.64 37.18 37.24
2 0.9780 - 0.9869 0.9876 0.9872 0.9883 0.9883
MSSSIM 3 0.9426 0.9533 0.9575 0.9588 0.9581 0.9609 0.9614
4 0.9005 - 0.9203 0.9215 0.9214 0.9256 0.9261
11
29.4
SRCNN(9-5-5)
SRCNN(9-3-5)
29.2 SRCNN(9-1-5)
A+
29
KK
PSNR (dB)
28.8
(a) First-layer filters – Cb channel
ANR
28.6 NE+LLE
28.4
SC
28.2 2 1 0
10 10 10
Slower <—— Running time (sec) ——> Faster
Fig. 12. The proposed SRCNN achieves the state- (b) First-layer filters – Cr channel
of-the-art super-resolution quality, whilst maintains high Fig. 13. Chrominance channels of the first-layer filters
and competitive speed in comparison to existing external using the “Y pre-train” strategy.
example-based methods. The chart is based on Set14
results summarized in Table 3. The implementation of all in RGB color space). This suggests that the Cb, Cr
three SRCNN networks are available on our project page. channels could decrease the performance of the Y chan-
TABLE 5 nel when training is performed in a unified network.
Average PSNR (dB) of different channels and training (iii) We observe that the Cb, Cr channels have higher
strategies on the Set5 dataset. PSNR values for “Y pre-train” than for “CbCr pre-train”.
The reason lies on the differences between the Cb, Cr
Training PSNR of different channel(s) channels and the Y channel. Visually, the Cb, Cr channels
Strategies Y Cb Cr RGB color image
Bicubic 30.39 45.44 45.42 34.57
are more blurry than the Y channel, thus are less affected
Y only 32.39 45.44 45.42 36.37 by the downsampling process. When we pre-train on
YCbCr 29.25 43.30 43.49 33.47 the Cb, Cr channels, there are only a few filters being
Y pre-train 32.19 46.49 46.45 36.32
CbCr pre-train 32.14 46.38 45.84 36.25
activated. Then the training will soon fall into a bad
RGB 32.33 46.18 46.20 36.44 local minimum during fine-tuning. On the other hand,
KK 32.37 44.35 44.22 36.32 if we pre-train on the Y channel, more filters will be
activated, and the performance on Cb, Cr channels will
be pushed much higher. Figure 13 shows the Cb, Cr
effectiveness of SRCNN on different scales, here we only
channels of the first-layer filters with “Y pre-train”, of
evaluate the performance of upscaling factor 3.
which the patterns largely differ from that shown in
Comparisons. We compare our method with the state- Figure 5. (iv) Training on the RGB channels achieves
of-art color SR method – KK [25]. We also try different the best result on the color image. Different from the
learning strategies for comparison: YCbCr channels, the RGB channels exhibit high cross-
• Y only: this is our baseline method, which is a
correlation among each other. The proposed SRCNN
single-channel (c = 1) network trained only on
is capable of leveraging such natural correspondences
the luminance channel. The Cb, Cr channels are
between the channels for reconstruction. Therefore, the
upscaled using bicubic interpolation.
model achieves comparable result on the Y channel as
• YCbCr: training is performed on the three channels
“Y only”, and better results on Cb, Cr channels than
of the YCbCr space. bicubic interpolation. (v) In KK [25], super-resolution
• Y pre-train: first, to guarantee the performance on
is applied on each RGB channel separately. When we
the Y channel, we only use the MSE of the Y channel transform its results to YCbCr space, the PSNR value
as the loss to pre-train the network. Then we employ of Y channel is similar as “Y only”, but that of Cb, Cr
the MSE of all channels to fine-tune the parameters. channels are poorer than bicubic interpolation. The result
• CbCr pre-train: we use the MSE of the Cb, Cr suggests that the algorithm is biased to the Y channel.
channels as the loss to pre-train the network, then On the whole, our method trained on RGB channels
fine-tune the parameters on all channels. achieves better performance than KK and the single-
• RGB: training is performed on the three channels of channel network (“Y only”). It is also worth noting that
the RGB space. the improvement compared with the single-channel net-
The results are shown in Table 5, where we have the work is not that significant (i.e., 0.07 dB). This indicates
following observations. (i) If we directly train on the that the Cb, Cr channels barely help in improving the
YCbCr channels, the results are even worse than that of performance.
bicubic interpolation. The training falls into a bad local
minimum, due to the inherently different characteristics
of the Y and Cb, Cr channels. (ii) If we pre-train on the
5 C ONCLUSION
Y or Cb, Cr channels, the performance finally improves, We have presented a novel deep learning approach
but is still not better than “Y only” on the color image for single image super-resolution (SR). We show that
(see the last column of Table 5, where PSNR is computed conventional sparse-coding-based SR methods can be
12
reformulated into a deep convolutional neural network. [19] Huang, J.B., Singh, A., Ahuja, N.: Single image super-resolution
The proposed approach, SRCNN, learns an end-to-end from transformed self-exemplars. In: IEEE Conference on Com-
puter Vision and Pattern Recognition. pp. 5197–5206 (2015)
mapping between low- and high-resolution images, with [20] Irani, M., Peleg, S.: Improving resolution by image registration.
little extra pre/post-processing beyond the optimization. Graphical Models and Image Processing 53(3), 231–239 (1991)
With a lightweight structure, the SRCNN has achieved [21] Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convo-
lutional neural networks with low rank expansions. In: British
superior performance than the state-of-the-art methods. Machine Vision Conference (2014)
We conjecture that additional performance can be further [22] Jain, V., Seung, S.: Natural image denoising with convolutional
gained by exploring more filters and different training networks. In: Advances in Neural Information Processing Sys-
tems. pp. 769–776 (2008)
strategies. Besides, the proposed structure, with its ad- [23] Jia, K., Wang, X., Tang, X.: Image transformation based on learning
vantages of simplicity and robustness, could be applied dictionaries across image spaces. IEEE Transactions on Pattern
to other low-level vision problems, such as image de- Analysis and Machine Intelligence 35(2), 367–380 (2013)
[24] Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick,
blurring or simultaneous SR+denoising. One could also R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture
investigate a network to cope with different upscaling for fast feature embedding. In: ACM Multimedia. pp. 675–678
factors. (2014)
[25] Kim, K.I., Kwon, Y.: Single-image super-resolution using sparse
regression and natural image prior. IEEE Transactions on Pattern
R EFERENCES Analysis and Machine Intelligence 32(6), 1127–1133 (2010)
[26] Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification
[1] Aharon, M., Elad, M., Bruckstein, A.: K-SVD: An algorithm for with deep convolutional neural networks. In: Advances in Neural
designing overcomplete dictionaries for sparse representation. Information Processing Systems. pp. 1097–1105 (2012)
IEEE Transactions on Signal Processing 54(11), 4311–4322 (2006) [27] LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E.,
[2] Bevilacqua, M., Roumy, A., Guillemot, C., Morel, M.L.A.: Low- Hubbard, W., Jackel, L.D.: Backpropagation applied to handwrit-
complexity single-image super-resolution based on nonnegative ten zip code recognition. Neural computation pp. 541–551 (1989)
neighbor embedding. In: British Machine Vision Conference [28] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based
(2012) learning applied to document recognition. Proceedings of the
[3] Burger, H.C., Schuler, C.J., Harmeling, S.: Image denoising: Can IEEE 86(11), 2278–2324 (1998)
plain neural networks compete with BM3D? In: IEEE Conference
[29] Lee, H., Battle, A., Raina, R., Ng, A.Y.: Efficient sparse coding algo-
on Computer Vision and Pattern Recognition. pp. 2392–2399
rithms. In: Advances in Neural Information Processing Systems.
(2012)
pp. 801–808 (2006)
[4] Chang, H., Yeung, D.Y., Xiong, Y.: Super-resolution through neigh-
[30] Liu, C., Shum, H.Y., Freeman, W.T.: Face hallucination: Theory
bor embedding. In: IEEE Conference on Computer Vision and
and practice. International Journal of Computer Vision 75(1), 115–
Pattern Recognition (2004)
134 (2007)
[5] Cui, Z., Chang, H., Shan, S., Zhong, B., Chen, X.: Deep network
[31] Mamalet, F., Garcia, C.: Simplifying convnets for fast learning.
cascade for image super-resolution. In: European Conference on
In: International Conference on Artificial Neural Networks, pp.
Computer Vision, pp. 49–64 (2014)
58–65. Springer (2012)
[6] Dai, D., Timofte, R., Van Gool, L.: Jointly optimized regressors for
image super-resolution. In: Eurographics. vol. 7, p. 8 (2015) [32] Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human
[7] Dai, S., Han, M., Xu, W., Wu, Y., Gong, Y., Katsaggelos, A.K.: segmented natural images and its application to evaluating seg-
Softcuts: a soft edge smoothness prior for color image super- mentation algorithms and measuring ecological statistics. In: IEEE
resolution. IEEE Transactions on Image Processing 18(5), 969–981 International Conference on Computer Vision. vol. 2, pp. 416–423
(2009) (2001)
[8] Damera-Venkata, N., Kite, T.D., Geisler, W.S., Evans, B.L., Bovik, [33] Nair, V., Hinton, G.E.: Rectified linear units improve restricted
A.C.: Image quality assessment based on a degradation model. Boltzmann machines. In: International Conference on Machine
IEEE Transactions on Image Processing 9(4), 636–650 (2000) Learning. pp. 807–814 (2010)
[9] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: [34] Ouyang, W., Luo, P., Zeng, X., Qiu, S., Tian, Y., Li, H., Yang,
A large-scale hierarchical image database. In: IEEE Conference on S., Wang, Z., Xiong, Y., Qian, C., et al.: Deepid-net: multi-stage
Computer Vision and Pattern Recognition. pp. 248–255 (2009) and deformable deep convolutional neural networks for object
[10] Denton, E., Zaremba, W., Bruna, J., LeCun, Y., Fergus, R.: Exploit- detection. arXiv preprint arXiv:1409.3505 (2014)
ing linear structure within convolutional networks for efficient [35] Ouyang, W., Wang, X.: Joint deep learning for pedestrian detec-
evaluation. In: Advances in Neural Information Processing Sys- tion. In: IEEE International Conference on Computer Vision. pp.
tems (2014) 2056–2063 (2013)
[11] Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolu- [36] Schuler, C.J., Burger, H.C., Harmeling, S., Scholkopf, B.: A ma-
tional network for image super-resolution. In: European Confer- chine learning approach for non-blind image deconvolution. In:
ence on Computer Vision, pp. 184–199 (2014) IEEE Conference on Computer Vision and Pattern Recognition.
[12] Eigen, D., Krishnan, D., Fergus, R.: Restoring an image taken pp. 1067–1074 (2013)
through a window covered with dirt or rain. In: IEEE Interna- [37] Schulter, S., Leistner, C., Bischof, H.: Fast and accurate image
tional Conference on Computer Vision. pp. 633–640 (2013) upscaling with super-resolution forests. In: IEEE Conference on
[13] Freedman, G., Fattal, R.: Image and video upscaling from local Computer Vision and Pattern Recognition. pp. 3791–3799 (2015)
self-examples. ACM Transactions on Graphics 30(2), 12 (2011) [38] Sheikh, H.R., Bovik, A.C., De Veciana, G.: An information fidelity
[14] Freeman, W.T., Jones, T.R., Pasztor, E.C.: Example-based super- criterion for image quality assessment using natural scene statis-
resolution. Computer Graphics and Applications 22(2), 56–65 tics. IEEE Transactions on Image Processing 14(12), 2117–2128
(2002) (2005)
[15] Freeman, W.T., Pasztor, E.C., Carmichael, O.T.: Learning low- [39] Sun, Y., Chen, Y., Wang, X., Tang, X.: Deep learning face represen-
level vision. International Journal of Computer Vision 40(1), 25–47 tation by joint identification-verification. In: Advances in Neural
(2000) Information Processing Systems. pp. 1988–1996 (2014)
[16] Glasner, D., Bagon, S., Irani, M.: Super-resolution from a single [40] Szegedy, C., Reed, S., Erhan, D., Anguelov, D.: Scalable, high-
image. In: IEEE International Conference on Computer Vision. pp. quality object detection. arXiv preprint arXiv:1412.1441 (2014)
349–356 (2009) [41] Timofte, R., De Smet, V., Van Gool, L.: Anchored neighborhood
[17] He, K., Sun, J.: Convolutional neural networks at constrained time regression for fast example-based super-resolution. In: IEEE In-
cost. arXiv preprint arXiv:1412.1710 (2014) ternational Conference on Computer Vision. pp. 1920–1927 (2013)
[18] He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in [42] Timofte, R., De Smet, V., Van Gool, L.: A+: Adjusted anchored
deep convolutional networks for visual recognition. In: European neighborhood regression for fast super-resolution. In: IEEE Asian
Conference on Computer Vision, pp. 346–361 (2014) Conference on Computer Vision (2014)
13
[43] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality Xiaoou Tang (S93-M96-SM02-F09) received
assessment: from error visibility to structural similarity. IEEE the BS degree from the University of Science
Transactions on Image Processing 13(4), 600–612 (2004) and Technology of China, Hefei, in 1990, the MS
[44] Wang, Z., Simoncelli, E.P., Bovik, A.C.: Multiscale structural sim- degree from the University of Rochester, New
ilarity for image quality assessment. In: IEEE Conference Record York, in 1991, and the PhD degree from the Mas-
of the Thirty-Seventh Asilomar Conference on Signals, Systems sachusetts Institute of Technology, Cambridge,
and Computers. vol. 2, pp. 1398–1402 (2003) in 1996. He is a professor in the Department of
[45] Yang, C.Y., Huang, J.B., Yang, M.H.: Exploiting self-similarities Information Engineering and an associate dean
for single frame super-resolution. In: IEEE Asian Conference on (Research) of the Faculty of Engineering of the
Computer Vision, pp. 497–510 (2010) Chinese University of Hong Kong. He worked
[46] Yang, C.Y., Ma, C., Yang, M.H.: Single-image super-resolution: A as the group manager of the Visual Computing
benchmark. In: European Conference on Computer Vision, pp. Group at the Microsoft Research Asia, from 2005 to 2008. His research
372–386 (2014) interests include computer vision, pattern recognition, and video pro-
[47] Yang, J., Lin, Z., Cohen, S.: Fast image super-resolution based on cessing. He received the Best Paper Award at the IEEE Conference
in-place example regression. In: IEEE Conference on Computer on Computer Vision and Pattern Recognition (CVPR) 2009. He was a
Vision and Pattern Recognition. pp. 1059–1066 (2013) program chair of the IEEE International Conference on Computer Vision
[48] Yang, J., Wang, Z., Lin, Z., Cohen, S., Huang, T.: Coupled dic- (ICCV) 2009 and he is an associate editor of the IEEE Transactions on
tionary training for image super-resolution. IEEE Transactions on Pattern Analysis and Machine Intelligence and the International Journal
Image Processing 21(8), 3467–3478 (2012) of Computer Vision. He is a fellow of the IEEE.
[49] Yang, J., Wright, J., Huang, T., Ma, Y.: Image super-resolution as
sparse representation of raw image patches. In: IEEE Conference
on Computer Vision and Pattern Recognition. pp. 1–8 (2008)
[50] Yang, J., Wright, J., Huang, T.S., Ma, Y.: Image super-resolution
via sparse representation. IEEE Transactions on Image Processing
19(11), 2861–2873 (2010)
[51] Zeyde, R., Elad, M., Protter, M.: On single image scale-up us-
ing sparse-representations. In: Curves and Surfaces, pp. 711–730
(2012)
[52] Zhang, N., Donahue, J., Girshick, R., Darrell, T.: Part-based R-
CNNs for fine-grained category detection. In: European Confer-
ence on Computer Vision. pp. 834–849 (2014)
Chao Dong received the BS degree in Informa-

tion Engineering from Beijing Institute of Tech-
nology, China, in 2011. He is currently working
toward the PhD degree in the Department of
Information Engineering at the Chinese Univer-
sity of Hong Kong. His research interests include
image super-resolution and denoising.
Chen Change Loy received the PhD degree in

Computer Science from the Queen Mary Uni-
versity of London in 2010. He is currently a
Research Assistant Professor in the Department
of Information Engineering, Chinese University
of Hong Kong. Previously he was a postdoc-
toral researcher at Vision Semantics Ltd. His
research interests include computer vision and
pattern recognition, with focus on face analysis,
deep learning, and visual surveillance.
Kaiming He received the BS degree from Ts-

inghua University in 2007, and the PhD degree
from the Chinese University of Hong Kong in
2011. He is a researcher at Microsoft Research
Asia (MSRA). He joined Microsoft Research
Asia in 2011. His research interests include
computer vision and computer graphics. He has
won the Best Paper Award at the IEEE Confer-
ence on Computer Vision and Pattern Recogni-
tion (CVPR) 2009. He is a member of the IEEE.
14
OriginalK/KPSNR BicubicK/K24.04KdB SCK/K25.58KdB NE+LLEK/K25.75KdB
KKK/K27.31KdB ANRK/K25.90KdB A+K/K27.24KdB SRCNNK/K27.95KdB

Fig. 14. The “butterfly” image from Set5 with an upscaling factor 3.
Original+/+PSNR Bicubic+/+23.71+dB SC+/+24.98+dB NE+LLE+/+24.94+dB
KK+/+25.60+dB ANR+/+25.03+dB A++/+26.09+dB SRCNN+/+27.04+dB

Fig. 15. The “ppt3” image from Set14 with an upscaling factor 3.
OriginalL/LPSNR BicubicL/L26.63LdB SCL/L27.95LdB NE+LLEL/L28.31LdB
KKL/L28.85LdB ANRL/L28.43LdB A+L/L28.98LdB SRCNNL/L29.29LdB

Fig. 16. The “zebra” image from Set14 with an upscaling factor 3.
Beyond Short Snippets: Deep Networks for Video Classification
Joe Yue-Hei Ng1 Matthew Hausknecht2 Sudheendra Vijayanarasimhan3

yhng@umiacs.umd.edu mhauskn@cs.utexas.edu svnaras@google.com
Oriol Vinyals3 Rajat Monga3 George Toderici3

vinyals@google.com rajatmonga@google.com gtoderici@google.com
arXiv:1503.08909v2 [cs.CV] 13 Apr 2015
1 2 3
University of Maryland, College Park University of Texas at Austin Google, Inc.
Abstract
Convolutional neural networks (CNNs) have been exten-
sively applied for image recognition problems giving state-
of-the-art results on recognition, detection, segmentation
and retrieval. In this work we propose and evaluate several
deep neural network architectures to combine image infor-
mation across a video over longer time periods than previ-
ously attempted. We propose two methods capable of han-
dling full length videos. The first method explores various
convolutional temporal feature pooling architectures, ex-
amining the various design choices which need to be made
when adapting a CNN for this task. The second proposed
method explicitly models the video as an ordered sequence Figure 1: Overview of our approach.
of frames. For this purpose we employ a recurrent neural
network that uses Long Short-Term Memory (LSTM) cells
which are connected to the output of the underlying CNN.
Our best networks exhibit significant performance improve- motion and other information can be additionally used. At
ments over previously published results on the Sports 1 mil- the same time, the task is much more computationally de-
lion dataset (73.1% vs. 60.9%) and the UCF-101 datasets manding even for processing short video clips since each
with (88.6% vs. 88.0%) and without additional optical flow video might contain hundreds to thousands of frames, not
information (82.6% vs. 73.0%). all of which are useful. A naı̈ve approach would be to treat
video frames as still images and apply CNNs to recognize
each frame and average the predictions at the video level.
1. Introduction However, since each individual video frame forms only a
Convolutional Neural Networks have proven highly suc- small part of the video’s story, such an approach would
cessful at static image recognition problems such as the be using incomplete information and could therefore eas-
MNIST, CIFAR, and ImageNet Large-Scale Visual Recog- ily confuse classes especially if there are fine-grained dis-
nition Challenge [15, 21, 28]. By using a hierarchy of train- tinctions or portions of the video irrelevant to the action of
able filters and feature pooling operations, CNNs are ca- interest.
pable of automatically learning complex features required Therefore, we hypothesize that learning a global de-
for visual object recognition tasks achieving superior per- scription of the video’s temporal evolution is important for
formance to hand-crafted features. Encouraged by these accurate video classification. This is challenging from a
positive results several approaches have been proposed re- modeling perspective as we have to model variable length
cently to apply CNNs to video and action classification videos with a fixed number of parameters. We evaluate two
tasks [2, 13, 14, 19]. approaches capable of meeting this requirement: feature-
Video analysis provides more information to the recog- pooling and recurrent neural networks. The feature pool-
nition task by adding a temporal component through which ing networks independently process each frame using a
1
CNN and then combine frame-level information using var- features such as Histogram of Oriented Gradients (HOG),
ious pooling layers. The recurrent neural network architec- Histogram of Optical Flow (HOF), Motion Boundary His-
ture we employ is derived from Long Short Term Memory togram (MBH) around spatio-temporal interest points [17],
(LSTM) [11] units, and uses memory cells to store, mod- in a dense grid [24] or around dense point trajectories [12,
ify, and access internal state, allowing it to discover long- 16, 22, 23] obtained through optical flow based tracking.
range temporal relationships. Like feature-pooling, LSTM These features are then encoded in order to produce a global
networks operate on frame-level CNN activations, and can video-level descriptor through bag of words (BoW) [17] or
learn how to integrate information over time. By shar- Fisher vector based encodings [23].
ing parameters through time, both architectures are able to However, no previous attempts at CNN-based video
maintain a constant number of parameters while capturing recognition use both motion information and a global de-
a global description of the video’s temporal evolution. scription of the video: Several approaches [2, 13, 14] em-
Since we are addressing the problem of video classifi- ploy 3D-convolution over short video clips - typically just
cation, it is natural to attempt to take advantage of motion a few seconds - to learn motion features from raw frames
information in order to have a better performing network. implicitly and then aggregate predictions at the video level.
Previous work [14] has attempted to address this issue by Karpathy et al. [14] demonstrate that their network is just
using frame stacks as input. However, this type of approach marginally better than single frame baseline, which indi-
is computationally intensive since it involves thousands of cates learning motion features is difficult. In view of this,
3D convolutional filters applied over the input volumes. The Simonyan et al. [19] directly incorporate motion informa-
performance grained by applying such a method is below tion from optical flow, but only sample up to 10 consecutive
2% on the Sports-1M benchmarks [14]. As a result, in this frames at inference time. The disadvantage of such local
work, we avoid implicit motion feature computation. approaches is that each frame/clip may contain only a small
In order to learn a global description of the video while part of the full video’s information, resulting in a network
maintaining a low computational footprint, we propose pro- that performs no better than the naı̈ve approach of classify-
cessing only one frame per second. At this frame rate, im- ing individual frames.
plicit motion information is lost. To compensate, follow- Instead of trying to learn spatio-temporal features over
ing [19] we incorporate explicit motion information in the small time periods, we consider several different ways to
form of optical flow images computed over adjacent frames. aggregate strong CNN image features over long periods of
Thus optical flow allows us to retain the benefits of motion a video (tens of seconds) including feature pooling and re-
information (typically achieved through high-fps sampling) current neural networks. Standard recurrent networks have
while still capturing global video information. Our contri- trouble learning over long sequences due to the problem of
butions can be summarized as follows: vanishing and exploding gradients [3]. In contrast, the Long
1. We propose CNN architectures for obtaining global Short Term Memory (LSTM) [11] uses memory cells to
video-level descriptors and demonstrate that using in- store, modify, and access internal state, allowing it to better
creasing numbers of frames significantly improves discover long-range temporal relationships. For this reason,
classification performance. LSTMs yield state-of-the-art results in handwriting recog-
2. By sharing parameters through time, the number of pa- nition [8, 10], speech recognition [9, 7], phoneme detection
rameters remains constant as a function of video length [5], emotion detection [25], segmentation of meetings and
in both the feature pooling and LSTM architectures. events [18], and evaluating programs [27]. While LSTMs
3. We confirm that optical flow images can greatly bene- have been applied to action classification in [1], the model is
fit video classification and present results showing that learned on top of SIFT features and a BoW representation.
even if the optical flow images themselves are very In addition, our proposed models allow joint fine tuning of
noisy (as is the case with the Sports-1M dataset), they convolutional and recurrent parts of the network, which is
can still provide a benefit when coupled with LSTMs. not possible to do when using hand-crafted features, as pro-
Leveraging these three principles, we achieve state-of- posed in prior work. Baccouche et al. [1] learns globally
the-art performance on two different video classification using Long Short-Term Memory (LSTM) networks on the
tasks: Sports-1M (Section 4.1) and UCF-101 (Section 4.2). ouput of 3D-convolution applied to 9-frame videos clips,
but incorporates no explicit motion information.
2. Related Work
Traditional video recognition research has been ex- 3. Approach
tremely successful at obtaining global video descriptors Two CNN architectures are used to process individual
that encode both appearance and motion information in video frames: AlexNet and GoogLeNet. AlexNet, is a
order to provide state-of-art results on a large number of Krizhevsky-style CNN [15] which takes a 220 × 220 sized
video datasets. These approaches are able to aggregate lo- frame as input. This frame is then processed by square con-
cal appearance and motion information using hand-crafted volutional layers of size 11, 9, and 5 each followed by max-
pooling and local contrast normalization. Finally, outputs image pixels, while allowing the network to choose which
are fed to two fully-connected layers each with 4096 recti- of the input frames are affected by these updates. When
fied linear units (ReLU). Dropout is applied to each fully- used with max-pooling, this is reminiscent of multiple in-
connected layer with a ratio of 0.6 (keeping and scaling 40% stance learning, where the learner knows that at least one of
of the original outputs). the inputs is relevant to the target class.
GoogLeNet [21], uses a network-in-network approach, We experimented with several variations of the basic
stacking Inception modules to form a network 22 layers max-pooling architecture as shown in Figure 2:
deep that is substantially different from previous CNNs
[15, 28]. Like AlexNet, GoogLeNet takes a single image
of size 220 × 220 as input. This image is then passed
through multiple Inception modules, each of which applies,
in parallel, 1×1, 3×3, 5×5 convolution, and max-pooling
operations and concatenates the resulting filters. Finally,
the activations are average-pooled and output as a 1000-
dimensional vector. (a) Conv Pooling (b) Late Pooling
In the following sections, we investigate two classes of
CNN architectures capable of aggregating video-level in-
formation. In the first section, we investigate various fea-
ture pooling architectures that are agnostic to temporal or-
der and in the following section we investigate LSTM net-
works which are capable of learning from temporally or-
dered sequences. In order to make learning computation-
(c) Slow Pooling (d) Local Pooling
ally feasible, in all methods CNN share parameters across
frames.
3.1. Feature Pooling Architectures
Temporal feature pooling has been extensively used for
video classification [17, 24, 12], and has been usually ap-
plied to bag-of-words representations. Typically, image-
based or motion features are computed at every frame,
quantized, then pooled across time. The resulting vector
(e) Time-Domain Convolution
can be used for making video-level predictions. We follow
a similar line of reasoning, except that due to the fact that
we work with neural networks, the pooling operation can Figure 2: Different Feature-Pooling Architectures: The
be incorporated directly as a layer. This allows us to exper- stacked convolutional layers are denoted by “C”. Blue,
iment with the location of the temporal pooling layer with green, yellow and orange rectangles represent max-pooling,
respect to the network architecture. time-domain convolutional, fully-connected and softmax
We analyze several variations depending on the specific layers respectively.
pooling method and the particular layer whose features are
aggregated. The pooling operation need not be limited to Conv Pooling: The Conv Pooling model performs max-
max-pooling. We considered using both average pooling, pooling over the final convolutional layer across the video’s
and max-pooling which have several desirable properties frames. A key advantage of this network is that the spa-
as shown in [4]. In addition, we attempted to employ a tial information in the output of the convolutional layer is
fully connected layer as a “pooling layer”. However, we preserved through a max operation over the time domain.
found that both average pooling and a fully connected layer Late Pooling: The Late Pooling model first passes con-
for pooling failed to learn effectively due to the large num- volutional features through two fully connected layers be-
ber of gradients that they generate. Max-pooling generates fore applying the max-pooling layer. The weights of all
much sparser updates, and as a result tends to yield net- convolutional layers and fully connected layers are shared.
works that learn faster, since the gradient update is gener- Compared to Conv Pooling, Late Pooling directly combines
ated by a sparse set of features from each frame. Therefore, high-level information across frames.
in the rest of the paper we use max-pooling as the main fea- Slow Pooling: Slow Pooling hierarchically combines
ture aggregation technique. frame level information from smaller temporal windows.
Unlike traditional bag of words approaches, gradients Slow Pooling uses a two-stage pooling strategy: max-
coming from the top layers help learn useful features from pooling is first applied over 10-frames of convolutional fea-
tures with stride 5 (e.g. max-pooling may be thought of
as a size-10 filter being convolved over a 1-D input with
stride 5). Each max-pooling layer is then followed by a
fully-connected layer with shared weights. In the second
stage, a single max-pooling layer combines the outputs of
all fully-connected layers. In this manner, the Slow Pooling
network groups temporally local features before combining
high level information from many frames.
Local Pooling: Similar to Slow Pooling, the Local Pool-
ing model combines frame level features locally after the
last convolutional layer. Unlike Slow Pooling, Local Pool-
ing only contains a single stage of max-pooling after the
convolutional layers. This is followed by two fully con-
nected layers, with shared parameters. Finally a larger soft-
max layer is connected to all towers. By eliminating the sec-
ond max-pooling layer, the Local Pooling network avoids a
potential loss of temporal information. Figure 3: Each LSTM cell remembers a single floating point
Time-Domain Convolution: The Time-Domain Convo- value ct (Eq. 5). This value may be diminished or erased
lution model contains an extra time-domain convolutional through a multiplicative interaction with the forget gate ft
layer before feature pooling across frames. Max-pooling is (Eq. 4) or additively modified by the current input xt multi-
performed on the temporal domain after the time-domain plied by the activation of the input gate it (Eq. 3). The out-
convolutional layer. The convolutional layer consist of 256 put gate ot controls the emission of ht , the stored memory
kernels of size 3 × 3 across 10 frames with frame stride 5. ct transformed by the hyperbolic tangent nonlinearity (Eq.
This model aims at capturing local relationships between 6,7). Image duplicated with permission from Alex Graves.
frames within a small temporal window.
GoogLeNet Conv Pooling: We experimented with an where the W terms denote weight matrices (e.g. Wih is the
architecture based on GoogLeNet [21], in which the max- input-hidden weight matrix), the b terms denote bias vectors
pooling operation is performed after the dimensionality re- (e.g. bh is the hidden bias vector) and H is the hidden layer
duction (average pooling) layer in GoogLeNet. This is the activation function, typically the logistic sigmoid function.
layer which in the original architecture was directly con- Unlike standard RNNs, the Long Short Term Memory
nected to the softmax layer. We enhanced this architec- (LSTM) architecture [6] uses memory cells (Figure 3) to
ture by adding two fully connected layers of size 4096 with store and output information, allowing it to better discover
ReLU activations on top of the 1000D output but before long-range temporal relationships. The hidden layer H of
softmax. Similar to AlexNet-based models, the weights the LSTM is computed as follows:
of convolutional layers and inception modules are shared
across time.
3.2. LSTM Architecture it = σ(Wxi xt + Whi ht−1 + Wci ct−1 + bi ) (3)
In contrast to max-pooling, which produces represen- ft = σ(Wxf xt + Whf ht−1 + Wcf ct−1 + bf ) (4)
tations which are order invariant, we propose using a rect = ft ct−1 + it tanh(Wxc xt + Whc ht−1 + bc ) (5)
current neural network to explicitly consider sequences of
ot = σ(Wxo xt + Who ht−1 + Wco ct + bo ) (6)
CNN activations. Since videos contain dynamic content,
the variations between frames may encode additional infor- ht = ot tanh(ct ) (7)
mation which could be useful in making more accurate pre- where σ is the logistic sigmoid function, and i, f , o, and c
dictions. are respectively the input gate, forget gate, output gate, and
Given an input sequence x = (x1 , . . . , xT ) a stan- cell activation vectors. By default, the value stored in the
dard recurrent neural network computes the hidden vector LSTM cell c is maintained unless it is added to by the input
sequence h = (h1 , . . . , hT ) and output vector sequence gate i or diminished by the forget gate f . The output gate o
y = (y1 , . . . , yT ) by iterating the following equations from controls the emission of the memory value from the LSTM
t = 1 to T : cell.
We use a deep LSTM architecture [9] (Figure 4) in which
the output from one LSTM layer is input for the next layer.
ht = H(Wih xt + Whh ht−1 + bh ) (1)
We experimented with various numbers of layers and mem-
yt = Who ht + bo (2) ory cells, and chose to use five stacked LSTM layers, each
with 512 memory cells. Following the LSTM layers, a Soft- speedup compared to training a large network from scratch.
max classifier makes a prediction at every frame. LSTM Training: We followed the same procedure as
training max-pooled network with two modifications: First,
the video’s label was backpropagated at each frame rather
than once per clip. Second, a gain g was applied to the
gradients backpropagated at each frame. g was linearly in-
terpolated from 0...1 over frames t = 0...T . g had the de-
sired effect of emphasizing the importance of correct pre-
diction at later frames in which the LSTM’s internal state
captured more information. Compared empirically against
setting g = 1 over all time steps or setting g = 1 only at the
last time step T (g = 0 elsewhere), linearly interpolating g
resulted in faster learning and higher accuracy. For the fi-
nal results, during training the gradients are backpropagated
through the convolutional layers for fine tuning.
LSTM Inference: In order to combine LSTM frame-
level predictions into a single video-level prediction, we
tried several approaches: 1) returning the prediction at the
Figure 4: Deep Video LSTM takes input the output from last time step T , 2) max-pooling the predictions over time,
the final CNN layer at each consecutive video frame. CNN 3) summing the predictions over time and return the max 4)
outputs are processed forward through time and upwards linearly weighting the predictions over time by g then sum
through five layers of stacked LSTMs. A softmax layer pre- and return the max.
dicts the class at each time step. The parameters of the con- The accuracy for all four approaches was less than 1%
volutional networks (pink) and softmax classifier (orange) different, but weighted predictions usually resulted in the
are shared across time steps. best performance, supporting the idea that the LSTM’s hid-
den state becomes progressively more informed as a func-
tion of the number of frames it has seen.
3.3. Training and Inference
The max-pooling models were optimized on a cluster us- 3.4. Optical Flow
ing Downpour Stochastic Gradient Descent starting with a Optical flow is a crucial component of any video classi-
learning rate of 10−5 in conjunction with a momentum of fication approach because it encodes the pattern of apparent
0.9 and weight decay of 0.0005. For LSTM, we used the motion of objects in a visual scene. Since our networks pro-
same optimization method with a learning rate of N ∗ 10−5 cess video frames at 1f ps, they do not use any apparent mo-
where N is number of frames. The learning rate was ex- tion information. Therefore, we additionally train both our
ponentially decayed over time. Each model had between temporal models on optical flow images and perform late
ten and fifty replicas split across four partitions. To re- fusion akin to the two-stream hypothesis proposed by [19].
duce CNN training time, the parameters of AlexNet and Interestingly, we found that initializing from a model
GoogLeNet were initialized from a pre-trained ImageNet trained on raw image frames can help classify optical flow
model and then fine-tuned on Sports-1M videos. images by allowing faster convergence than when training
Network Expansion for Max-Pooling Networks: from scratch. This is likely due to the fact that features that
Multi-frame models achieve higher accuracy at the cost of can describe for raw frames like edges also help in classify-
longer training times than single-frame models. Since pooling optical flow images. This is related to the effectiveness
ing is performed after CNN towers that share weights, the of Motion Boundary Histogram (MBH), which is analogous
parameters for a single-frame and multi-frame max-pooling to computing Histogram of Oriented Gradients (HOG) on
network are very similar. This makes it possible to expand optical flow images, in action recognition [23].
a single-frame model to a multi-frame model. Max-pooling Optical flow is computed from two adjacent frames sam-
models are first initialized as single-frame networks then pled at 15f ps using the approach of [26]. To utilize exist-
expanded to 30-frames and again to 120-frames. While the ing implementation and networks trained on raw frames, we
feature distribution of the max-pooling layer could change store optical flow as images by thresholding at −40, 40 and
dramatically as a result of expanding to a larger num- rescaling the horizontal and vertical components of the flow
ber of frames (particularly in the single-frame to 30-frame to [0, 255] range. The third dimension is set to zero when
case), experiments show that transfering the parameters is feeding to the network so that it gives no effect on learning
nonetheless beneficial. By expanding small networks into and inference.
larger ones and then fine-tuning, we achieve a significant In our investigation, we treat optical flow in the same
fashion as image frames to learn global description of Method Clip Hit@1 Hit@1 Hit@5
videos using both feature pooling and LSTM networks. Conv Pooling 68.7 71.1 89.3
4. Results Late Pooling 65.1 67.5 87.2
We empirically evaluate the proposed architectures on Slow Pooling 67.1 69.7 88.4
the Sports-1M and UCF-101 datasets with the goals of Local Pooling 68.1 70.4 88.9
investigating the performance of the proposed architec- Time-Domain
64.2 67.2 87.2
tures, quantifying the effect of the number of frames and Convolution
frame rates on classification performance, and understand- Table 1: Conv-Pooling outperforms all other feature-
ing the importance of motion information through optical pooling architectures (Figure 2) on Sports-1M using a 120-
flow models. frame AlexNet model.
4.1. Sports-1M dataset
The Sports-1M dataset [14] consists of roughly 1.2 mil- best to only do this if resource constrained (i.e., when it is
lion YouTube sports videos annotated with 487 classes, and only possible to do a single pass over the video for predic-
it is representative of videos in the wild. There are 1000- tion). Otherwise the data augmentation method proposed
3000 videos per class and approximately 5% of the videos above yields between 3-5% improvements in Hit@1 on the
are annotated with more than one class. Unfortunately, Sports-1M dataset.
since the creation of the dataset, about 7% of the videos Evaluation: Following [14], we use Hit@k values,
have been removed by users. We use the remaining 1.1 mil- which indicate the fraction of test samples that contain at
lion videos for the experiments below. least one of the ground truth labels in the top k predictions.
Although Sports-1M is the largest publicly available We provide both video level and clip level Hit@k values in
video dataset, the annotations that it provides are at video order to compare with previous results where clip hit is the
level. No information is given about the location of the hit on a single video clip (30-120 frames) and video hit is
class of interest. Moreover, the videos in this dataset are obtained by averaging over multiple clips.
unconstrained. This means that the camera movements are Comparison of Feature-Pooling Architectures: Ta-
not guaranteed to be well-behaved, which means that unlike ble 1 shows the results obtained using the different feature
UCF-101, where camera motion is constrained, the optical pooling architectures on the Sports-1M dataset when using
flow quality varies wildly between videos. a 120 frame AlexNet model. We find that max-pooling
Data Extraction: The first 5 minutes of each video are over the outputs of the last convolutional layer provides
sampled at a frame rate of 1f ps to obtain 300 frames per the best clip-level and video-level hit rates. Late Pooling,
video. Frames are repeated from the start for videos that which max-pools after the fully connected layers, performs
are shorter than 5 minutes. We learn feature pooling mod- worse than all other methods, indicating that preserving the
els that process up to 120 frames (2 minutes of video) in a spatial information while performing the pooling operation
single example. across the time domain is important. Time-Domain Convo-
Data Augmentation: Multiple examples per video are lution gives inferior results compared to max-pooling mod-
obtained by randomly selecting the position of the first els. This suggests that a single time-domain convolutional
frame and consistent random crops of each frame during layer is not effective in learning temporal relations on high
both training and testing. It is necessary to ensure that level features, which motivates us to explore more sophis-
the same transforms are applied to all frames for a given ticated network architectures like LSTM which learns from
start/end point. We process all images in the chosen in- temporal sequences.
terval by first resizing them to 256 × 256 pixels, then ran-
domly sampling a 220 × 220 region and randomly flipping Comparison of CNN Architectures: AlexNet and
the image horizontally with 50% probability. To obtain pre- GoogLeNet single-frame CNNs (Section 3) were trained
dictions for a video we randomly sample 240 examples as from scratch on single-frames selected at random from
described above and average all predictions, unless noted Sports-1M videos. Results (Table 2) show that both CNNs
otherwise. Since LSTM models trained on a fixed number outperform Karpathy et al.’s prior single-frame models [14]
of frames can generalize to any number of frames, we also by a margin of 4.3-5.6%. The increased accuracy is likely
report results of using LSTMs without data augmentation. due to advances in CNN architectures and sampling more
Video-Level Prediction: Given the nature of the meth- frames per video when training (300 instead of 50).
ods presented in this paper, it is possible to make predictions Comparing AlexNet to the more recent GoogLeNet
for the entire video without needing to sample, or aggregate yields a 1.9% increase in Hit@5 for the max-pooling ar-
( the networks are designed to work on an unbounded num- chitecture, and an increase of 4.8% for the LSTM. This is
ber of frames for prediction). However, for obtaining the roughly comparable to a 4.5% decrease in top-5 error mov-
highest possible classification rates, we observed that it is ing from the Krizhevsky-style CNNs that won ILSVRC-13
Method Hit@1 Hit@5 Pooling and LSTM models as a function of the number of
AlexNet single frame 63.6 84.7 frames aggregated. In terms of clip hit, the 120 frame model
GoogLeNet single frame 64.9 86.6 performs significantly better than the 30 frame model. Also
LSTM + AlexNet (fc) 62.7 83.6 our best clip hit of 70.8 represents a 70% improvement over
LSTM + GoogLeNet (fc) 67.5 87.1 the Slow Fusion approach of [14] which uses clips of few
Conv pooling + AlexNet 70.4 89.0 seconds length. This confirms our initial hypothesis that we
Conv pooling + GoogLeNet 71.7 90.4 need to consider the entire video in order to benefit more
thoroughly from its content.
Table 2: GoogLeNet outperforms AlexNet alone and when Optical Flow: Table 4 shows the results of fusion with
paired with both Conv-Pooling and LSTM. Experiments the optical flow model. The optical flow model on its own
performed on Sports-1M using 30-frame Conv-Pooling and has a much lower accuracy (59.7%) than the image-based
LSTM models. Note that the (fc) models updated only the model (72.1%) which is to be expected given that the Sports
final layers while training and did not use data augmenta- dataset consists of YouTube videos which are usually of
tion. lower quality and more natural than hand-crafted datasets
such as UCF-101. In the case of Conv Pooling networks the
Method Frames Clip Hit@1 Hit@1 Hit@5 fusion with optical flow has no significant improvement in
LSTM 30 N/A 72.1 90.4 the accuracy. However, for LSTMs the optical flow model
30 66.0 71.7 90.4 is able to improve the overall accuracy to 73.1%.
Conv pooling
120 70.8 72.3 90.8 Overall Performance: Finally, we compare the results
of our best models against the previous state-of-art on the
Table 3: Effect of the number of frames in the model. Both Sports-1M dataset at the time of submission. Table 5 reports
LSTM and Conv-Pooling models use GoogLeNet CNN. the results of the best model from [14] which performs sev-
eral layers of 3D convolutions on short video clips against
Method Hit@1 Hit@5 ours. The max-pool method shows an increase of 18.7%
LSTM on Optical Flow 59.7 81.4 in video Hit@1, whereas the LSTM approach yields a rela-
LSTM on Raw Frames 72.1 90.6 tive increase of 20%. The difference between the max-pool
LSTM on Raw Frames + and LSTM method is explained by the fact that the LSTM
73.1 90.5
LSTM on Optical Flow model can use optical flow in a manner which lends itself to
30 frame Optical Flow 44.5 70.4 late model fusion, which was not possible for the max-pool
Conv Pooling on Raw Frames 71.7 90.4 model.
Conv Pooling on Raw Frames +
71.8 90.4 4.2. UCF-101 Dataset
Conv Pooling on Optical Flow
The UCF-101 [20] contains 13,320 videos with 101 ac-
Table 4: Optical flow is noisy on Sports-1M and if used tion classes covering a broad set of activities such as sports,
alone, results in lower performance than equivalent image- musical instruments, and human-object interaction. We fol-
models. However, if used in conjunction with raw im- low the suggested evaluation protocol and report the aver-
age features, optical flow benefits LSTM. Experiments per- age accuracy over the given three training and testing parti-
formed on 30-frame models using GoogLeNet CNNs. tions. It is difficult to train a deep network with such a small
amount of data. Therefore, we test how well our models that
are trained in Sports-1M dataset perform in UCF-101.
to GoogLeNet in ILSVRC-14. For the max-pool architec- Comparison of Frame Rates: Since UCF-101 contains
ture, this smaller gap between architectures is likely caused short videos, 10-15 seconds on average, it is possible to ex-
by the increased number of noisy images in Sports-1M com- tract frames at higher frame rates such as 6f ps while still
pared to ImageNet. capturing context from the full video. We compare 30-
Fine Tuning: When initializing from a pre-trained net- frame models trained at three different frame-rates: 30f ps
work, it is not always clear whether fine-tuning should be (1 second of video) and 6f ps (5 seconds). Table 6 shows
performed. In our experiments, fine tuning was crucial in that lowering the frame rate from 30f ps to 6f ps yields
achieving high performance. For example, in Table 2 we slightly better performance since the model obtains more
show that a LSTM network paired with GoogLeNet, run- context from longer input clips. We observed no further im-
ning on 30 frames of the video achieves a Hit@1 rate of provements when decreasing the frame rate to 1f ps. Thus,
67.5. However, the same network with fine tuning achieves as long as the network sees enough context from each video,
69.5 Hit@1. Note that these results do not use data aug- the effects of lower frames rate are marginal. The LSTM
mentation and classify the entire 300 seconds of a video. model, on the other hand can take full advantage of the fact
Effect of Number of Frames: Table 3 compares Conv- that the videos can be processed at 30 frames per second.
Category Method Frames Clip Hit@1 Hit@1 Hit@5
Prior Single Frame 1 41.1 59.3 77.7
Results [14] Slow Fusion 15 41.9 60.9 80.2
Conv Pooling Image and Optical Flow 120 70.8 72.4 90.8
LSTM Image and Optical Flow 30 N/A 73.1 90.5
Table 5: Leveraging global video-level descriptors, LSTM and Conv-Pooling achieve a 20% increase in Hit@1 compared to
prior work on the in Sports-1M dataset. Hit@1, and Hit@5 are computed at video level.
Method Frame Rate 3-fold Accuracy (%) Method 3-fold Accu-

Single Frame Model N/A 73.3 racy (%)
30 fps 80.8 Improved Dense Trajectories (IDTF)s [23] 87.9
Conv Pooling (30 frames)
6 fps 82.0 Slow Fusion CNN [14] 65.4
30 fps 82.6 Single Frame CNN Model (Images) [19] 73.0
Conv Pooling (120 frames)
6 fps 82.6 Single Frame CNN Model (Optical Flow) [19] 73.9
Two-Stream CNN (Optical Flow + Image Frames, 86.9
Table 6: Lower frame rates produce higher UCF-101 accu- Averaging) [19]
racy for 30-frame Conv-Pooling models. Two-Stream CNN (Optical Flow + Image Frames, 88.0
SVM Fusion) [19]
Overall Performance: Our models achieve state-of-the-
Our Single Frame Model 73.3
art performance on UCF-101 (Table 7), slightly outperform- Conv Pooling of Image Frames + Optical Flow (30 87.6
ing approaches that use hand-crafted features and CNN- Frames)
based approaches that use optical flow. As before, the per- Conv Pooling of Image Frames + Optical Flow 88.2
formance edge of our method results from using increased (120 Frames)
numbers of frames to capture more of the video. LSTM with 30 Frame Unroll (Optical Flow + Im- 88.6
Our 120 frames model improves upon previous age Frames)
work [19] (82.6% vs 73.0%) when considering models that
learn directly from raw frames without optical flow infor- Table 7: UCF-101 results. The bold-face numbers represent
mation. This is a direct result of considering larger context results that are higher than previously reported results.
within a video, even when the frames within a short clip are of concern, our methods can process an entire video in one
highly similar to each other. shot. Training is possible by expanding smaller networks
Compared to Sports-1M, optical flow in UCF-101 pro- into progressively larger ones and fine-tuning. The result-
vides a much larger improvement in accuracy (82.6% vs. ing networks achieve state-of-the-art performance on both
88.2% for max-pool). This results from UCF-101 videos the Sports-1M and UCF-101 benchmarks, supporting the
being better centered, less shaky, and better trimmed to the idea that learning should take place over the entire video
action in question than the average YouTube video. rather than short clips.
High Quality Data: The UCF-101 dataset contains Additionally, we explore the necessity of motion infor-
short, well-segmented videos of concepts that can typically mation, and confirm that for the UCF-101 benchmark, in
be identified in a single frame. This is evidenced by the order to obtain state-of-the-art results, it is necessary to use
high performance of single-frame networks (See Table 7). optical flow. However, we also show that using optical flow
In contrast, videos in the wild often feature spurious frames is not always helpful, especially if the videos are taken from
containing text or shot transitions, hand-held video shot in the wild as is the case in the Sports-1M dataset. In order to
either first person or third person, and non-topical segments take advantage of optical flow in this case, it is necessary to
such as commentators talking about a game. employ a more sophisticated sequence processing architec-
ture such as LSTM. Moreover, using LSTMs on both image
5. Conclusion frames, and optical flow yields the highest published per-
We presented two video-classification methods capable formance measure for the Sports-1M benchmark.
of aggregating frame-level CNN outputs into video-level In the current models, backpropagation of gradients pro-
predictions: Feature Pooling methods which max-pool lo- ceeds down all layers and backwards through time in the top
cal information through time and LSTM whose hidden state layers, but not backwards through time in the lower (CNN)
evolves with each subsequent frame. Both methods are mo- layers. In the future, it would be interesting to consider
tivated by the idea that incorporating information across a deeper integration of the temporal sequence information
longer video sequences will enable better video classifica- into the CNNs themselves. For instance, a Recurrent Con-
tion. Unlike previous work which trained on seconds of volutional Neural Network may be able to generate better
video, our networks utilize up to two minutes of video (120 features by utilizing its own activations in the last frame in
frames) for optimal classification performance. If speed is conjunction with the image from the current frame.
References [17] I. Laptev, M. Marszaek, C. Schmid, and B. Rozenfeld.
Learning realistic human actions from movies. In Proc.
[1] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and CVPR, pages 1–8, Anchorage, Alaska, USA, 2008. 2, 3
A. Baskurt. Action classification in soccer videos with
[18] S. Reiter, B. Schuller, and G. Rigoll. A combined LSTM-
long short-term memory recurrent neural networks. In Proc.
RNN - HMM - approach for meeting event segmentation
ICANN, pages 154–159, Thessaloniki, Greece, 2010. 2
and recognition. In Proc. ICASSP, pages 393–396, Toulouse,
[2] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and France, 2006. 2
A. Baskurt. Sequential Deep Learning for Human Action [19] K. Simonyan and A. Zisserman. Two-stream convolutional
Recognition. In 2nd International Workshop on Human Be- networks for action recognition in videos. In Proc. NIPS,
havior Understanding (HBU), pages 29–39, Nov. 2011. 1, pages 568–576, Montreal, Canada, 2014. 1, 2, 5, 8
2
[20] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset
[3] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term of 101 human actions classes from videos in the wild. In
dependencies with gradient descent is difficult. IEEE Trans. CRCV-TR-12-01, 2012. 7
on Neural Networks, 5(2):157–166, 1994. 2
[21] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
[4] Y.-L. Boureau, J. Ponce, and Y. Lecun. A theoretical analy- D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
sis of feature pooling in visual recognition. In Proc. ICML, Going deeper with convolutions. CoRR, abs/1409.4842,
pages 111–118, Haifa, Israel, 2010. 3 2014. 1, 3, 4
[5] S. Fernández, A. Graves, and J. Schmidhuber. Phoneme [22] H. Wang, A. Klaser, C. Schmid, and C.-L. Liu. Action recog-
recognition in TIMIT with BLSTM-CTC. CoRR, nition by dense trajectories. In Proc. CVPR, pages 3169–
abs/0804.3269, 2008. 2 3176, Washington, DC, USA, 2011. 2
[6] F. A. Gers, N. N. Schraudolph, and J. Schmidhuber. Learn- [23] H. Wang and C. Schmid. Action Recognition with Improved
ing precise timing with LSTM recurrent networks. JMLR, Trajectories. In Proc. ICCV, pages 3551–3558, Sydney, Aus-
3:115–143, 2002. 4 tralia, 2013. 2, 5, 8
[7] A. Graves and N. Jaitly. Towards end-to-end speech recog- [24] H. Wang, M. M. Ullah, A. Klser, I. Laptev, and C. Schmid.
nition with recurrent neural networks. In Proc. ICML, pages Evaluation of local spatio-temporal features for action recog-
1764–1772, Beijing, China, 2014. 2 nition. In Proc. BMVC, pages 1–11, 2009. 2, 3
[8] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, [25] M. Wllmer, M. Kaiser, F. Eyben, B. Schuller, and G. Rigoll.
H. Bunke, and J. Schmidhuber. A novel connectionist sys- LSTM-modeling of continuous emotions in an audiovisual
tem for unconstrained handwriting recognition. IEEE Trans. affect recognition framework. Image Vision Computing,
PAMI, 31(5):855–868, 2009. 2 31(2):153–163, 2013. 2
[9] A. Graves, A.-R. Mohamed, and G. E. Hinton. Speech [26] C. Zach, T. Pock, and H. Bischof. A duality based approach
recognition with deep recurrent neural networks. CoRR, for realtime tv-l1 optical flow. In Proceedings of the 29th
abs/1303.5778, 2013. 2, 4 DAGM Conference on Pattern Recognition, pages 214–223,
[10] A. Graves and J. Schmidhuber. Offline handwriting recog- Berlin, Heidelberg, 2007. Springer-Verlag. 5
nition with multidimensional recurrent neural networks. In [27] W. Zaremba and I. Sutskever. Learning to execute. CoRR,
Proc. NIPS, pages 545–552, Vancouver, B.C., Canada, 2008. abs/1410.4615, 2014. 2
2 [28] M. D. Zeiler and R. Fergus. Visualizing and understand-
[11] S. Hochreiter and J. Schmidhuber. Long short-term memory. ing convolutional networks. In Proc. ECCV, pages 818–833,
Neural Computing, 9(8):1735–1780, Nov. 1997. 2 Zurich, Switzerland, 2014. 1, 3
[12] M. Jain, H. Jégou, and P. Bouthemy. Better exploiting mo-
tion for better action recognition. In Proc. CVPR, pages
2555–2562, Portland, Oregon, USA, 2013. 2, 3
[13] S. Ji, W. Xu, M. Yang, and K. Yu. 3D convolutional neural
networks for human action recognition. IEEE Trans. PAMI,
35(1):221–231, Jan. 2013. 1, 2
[14] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar,
and L. Fei-Fei. Large-scale video classification with convo-
lutional neural networks. In Proc. CVPR, pages 1725–1732,
Columbus, Ohio, USA, 2014. 1, 2, 6, 7, 8
[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet
classification with deep convolutional neural networks. In
Proc. NIPS, pages 1097–1105, Lake Tahoe, Nevada, USA,
2012. 1, 2, 3
[16] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre.
HMDB: a large video database for human motion recogni-
tion. In Proc. ICCV, pages 2556–2563, Barcelona, Spain,
2011. 2
U-Net: Convolutional Networks for Biomedical
Image Segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox
Computer Science Department and BIOSS Centre for Biological Signalling Studies,
University of Freiburg, Germany
arXiv:1505.04597v1 [cs.CV] 18 May 2015
ronneber@informatik.uni-freiburg.de,
WWW home page: http://lmb.informatik.uni-freiburg.de/
Abstract. There is large consent that successful training of deep net-

works requires many thousand annotated training samples. In this pa-
per, we present a network and training strategy that relies on the strong
use of data augmentation to use the available annotated samples more
efficiently. The architecture consists of a contracting path to capture
context and a symmetric expanding path that enables precise localiza-
tion. We show that such a network can be trained end-to-end from very
few images and outperforms the prior best method (a sliding-window
convolutional network) on the ISBI challenge for segmentation of neu-
ronal structures in electron microscopic stacks. Using the same net-
work trained on transmitted light microscopy images (phase contrast
and DIC) we won the ISBI cell tracking challenge 2015 in these cate-
gories by a large margin. Moreover, the network is fast. Segmentation
of a 512x512 image takes less than a second on a recent GPU. The full
implementation (based on Caffe) and the trained networks are available
at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.
1 Introduction
In the last two years, deep convolutional networks have outperformed the state of
the art in many visual recognition tasks, e.g. [7,3]. While convolutional networks
have already existed for a long time [8], their success was limited due to the
size of the available training sets and the size of the considered networks. The
breakthrough by Krizhevsky et al. [7] was due to supervised training of a large
network with 8 layers and millions of parameters on the ImageNet dataset with
1 million training images. Since then, even larger and deeper networks have been
trained [12].
The typical use of convolutional networks is on classification tasks, where
the output to an image is a single class label. However, in many visual tasks,
especially in biomedical image processing, the desired output should include
localization, i.e., a class label is supposed to be assigned to each pixel. More-
over, thousands of training images are usually beyond reach in biomedical tasks.
Hence, Ciresan et al. [1] trained a network in a sliding-window setup to predict
the class label of each pixel by providing a local region (patch) around that pixel
2
1 64 64
128 64 64 2
input
output
image
segmentation
tile
392 x 392
390 x 390
388 x 388
388 x 388
map
570 x 570
568 x 568
572 x 572
128 128
256 128
200²
198²
196²
280²
284²
282²
256 256 512 256
conv 3x3, ReLU
104²
140²
138²
136²
102²
100²
copy and crop
512 512 1024 512
max pool 2x2
56²
68²
32² 64²
66²
54²
52²
1024 up-conv 2x2
conv 1x1
30²
28²
Fig. 1. U-net architecture (example for 32x32 pixels in the lowest resolution). Each blue
box corresponds to a multi-channel feature map. The number of channels is denoted
on top of the box. The x-y-size is provided at the lower left edge of the box. White
boxes represent copied feature maps. The arrows denote the different operations.
as input. First, this network can localize. Secondly, the training data in terms
of patches is much larger than the number of training images. The resulting
network won the EM segmentation challenge at ISBI 2012 by a large margin.
Obviously, the strategy in Ciresan et al. [1] has two drawbacks. First, it
is quite slow because the network must be run separately for each patch, and
there is a lot of redundancy due to overlapping patches. Secondly, there is a
trade-off between localization accuracy and the use of context. Larger patches
require more max-pooling layers that reduce the localization accuracy, while
small patches allow the network to see only little context. More recent approaches
[11,4] proposed a classifier output that takes into account the features from
multiple layers. Good localization and the use of context are possible at the
same time.
In this paper, we build upon a more elegant architecture, the so-called “fully
convolutional network” [9]. We modify and extend this architecture such that it
works with very few training images and yields more precise segmentations; see
Figure 1. The main idea in [9] is to supplement a usual contracting network by
successive layers, where pooling operators are replaced by upsampling operators.
Hence, these layers increase the resolution of the output. In order to localize, high
resolution features from the contracting path are combined with the upsampled
3
Fig. 2. Overlap-tile strategy for seamless segmentation of arbitrary large images (here
segmentation of neuronal structures in EM stacks). Prediction of the segmentation in
the yellow area, requires image data within the blue area as input. Missing input data
is extrapolated by mirroring
output. A successive convolution layer can then learn to assemble a more precise
output based on this information.
One important modification in our architecture is that in the upsampling
part we have also a large number of feature channels, which allow the network
to propagate context information to higher resolution layers. As a consequence,
the expansive path is more or less symmetric to the contracting path, and yields
a u-shaped architecture. The network does not have any fully connected layers
and only uses the valid part of each convolution, i.e., the segmentation map only
contains the pixels, for which the full context is available in the input image.
This strategy allows the seamless segmentation of arbitrarily large images by an
overlap-tile strategy (see Figure 2). To predict the pixels in the border region
of the image, the missing context is extrapolated by mirroring the input image.
This tiling strategy is important to apply the network to large images, since
otherwise the resolution would be limited by the GPU memory.
As for our tasks there is very little training data available, we use excessive
data augmentation by applying elastic deformations to the available training im-
ages. This allows the network to learn invariance to such deformations, without
the need to see these transformations in the annotated image corpus. This is
particularly important in biomedical segmentation, since deformation used to
be the most common variation in tissue and realistic deformations can be simu-
lated efficiently. The value of data augmentation for learning invariance has been
shown in Dosovitskiy et al. [2] in the scope of unsupervised feature learning.
Another challenge in many cell segmentation tasks is the separation of touch-
ing objects of the same class; see Figure 3. To this end, we propose the use of
a weighted loss, where the separating background labels between touching cells
obtain a large weight in the loss function.
The resulting network is applicable to various biomedical segmentation prob-
lems. In this paper, we show results on the segmentation of neuronal structures
in EM stacks (an ongoing competition started at ISBI 2012), where we out-
4
performed the network of Ciresan et al. [1]. Furthermore, we show results for
cell segmentation in light microscopy images from the ISBI cell tracking chal-
lenge 2015. Here we won with a large margin on the two most challenging 2D
transmitted light datasets.
2 Network Architecture
The network architecture is illustrated in Figure 1. It consists of a contracting
path (left side) and an expansive path (right side). The contracting path follows
the typical architecture of a convolutional network. It consists of the repeated
application of two 3x3 convolutions (unpadded convolutions), each followed by
a rectified linear unit (ReLU) and a 2x2 max pooling operation with stride 2
for downsampling. At each downsampling step we double the number of feature
channels. Every step in the expansive path consists of an upsampling of the
feature map followed by a 2x2 convolution (“up-convolution”) that halves the
number of feature channels, a concatenation with the correspondingly cropped
feature map from the contracting path, and two 3x3 convolutions, each fol-
lowed by a ReLU. The cropping is necessary due to the loss of border pixels in
every convolution. At the final layer a 1x1 convolution is used to map each 64-
component feature vector to the desired number of classes. In total the network
has 23 convolutional layers.
To allow a seamless tiling of the output segmentation map (see Figure 2), it
is important to select the input tile size such that all 2x2 max-pooling operations
are applied to a layer with an even x- and y-size.
3 Training
The input images and their corresponding segmentation maps are used to train
the network with the stochastic gradient descent implementation of Caffe [6].
Due to the unpadded convolutions, the output image is smaller than the input
by a constant border width. To minimize the overhead and make maximum use
of the GPU memory, we favor large input tiles over a large batch size and hence
reduce the batch to a single image. Accordingly we use a high momentum (0.99)
such that a large number of the previously seen training samples determine the
update in the current optimization step.
The energy function is computed by a pixel-wise soft-max over the final
feature map combined with thecross entropy loss function. The soft-max is
PK
defined as pk (x) = exp(ak (x))/ k0 =1 exp(ak (x)) where ak (x) denotes the
0
activation in feature channel k at the pixel position x ∈ Ω with Ω ⊂ Z2 . K

is the number of classes and pk (x) is the approximated maximum-function. I.e.
pk (x) ≈ 1 for the k that has the maximum activation ak (x) and pk (x) ≈ 0 for
all other k. The cross entropy then penalizes at each position the deviation of
p`(x) (x) from 1 using
X
E= w(x) log(p`(x) (x)) (1)
x∈Ω
5
a b c d
Fig. 3. HeLa cells on glass recorded with DIC (differential interference contrast) mi-
croscopy. (a) raw image. (b) overlay with ground truth segmentation. Different colors
indicate different instances of the HeLa cells. (c) generated segmentation mask (white:
foreground, black: background). (d) map with a pixel-wise loss weight to force the
network to learn the border pixels.
where ` : Ω → {1, . . . , K} is the true label of each pixel and w : Ω → R is

a weight map that we introduced to give some pixels more importance in the
training.
We pre-compute the weight map for each ground truth segmentation to com-
pensate the different frequency of pixels from a certain class in the training
data set, and to force the network to learn the small separation borders that we
introduce between touching cells (See Figure 3c and d).
The separation border is computed using morphological operations. The
weight map is then computed as
!
(d1 (x) + d2 (x))2
w(x) = wc (x) + w0 · exp − (2)
2σ 2
where wc : Ω → R is the weight map to balance the class frequencies, d1 : Ω → R

denotes the distance to the border of the nearest cell and d2 : Ω → R the distance
to the border of the second nearest cell. In our experiments we set w0 = 10 and
σ ≈ 5 pixels.
In deep networks with many convolutional layers and different paths through
the network, a good initialization of the weights is extremely important. Oth-
erwise, parts of the network might give excessive activations, while other parts
never contribute. Ideally the initial weights should be adapted such that each
feature map in the network has approximately unit variance. For a network with
our architecture (alternating convolution and ReLU layers) this can be achieved
by drawing p the initial weights from a Gaussian distribution with a standard
deviation of 2/N , where N denotes the number of incoming nodes of one neu-
ron [5]. E.g. for a 3x3 convolution and 64 feature channels in the previous layer
N = 9 · 64 = 576.
3.1 Data Augmentation
Data augmentation is essential to teach the network the desired invariance and
robustness properties, when only few training samples are available. In case of
6
microscopical images we primarily need shift and rotation invariance as well as

robustness to deformations and gray value variations. Especially random elas-
tic deformations of the training samples seem to be the key concept to train
a segmentation network with very few annotated images. We generate smooth
deformations using random displacement vectors on a coarse 3 by 3 grid. The
displacements are sampled from a Gaussian distribution with 10 pixels standard
deviation. Per-pixel displacements are then computed using bicubic interpola-
tion. Drop-out layers at the end of the contracting path perform further implicit
data augmentation.
4 Experiments
We demonstrate the application of the u-net to three different segmentation
tasks. The first task is the segmentation of neuronal structures in electron mi-
croscopic recordings. An example of the data set and our obtained segmentation
is displayed in Figure 2. We provide the full result as Supplementary Material.
The data set is provided by the EM segmentation challenge [14] that was started
at ISBI 2012 and is still open for new contributions. The training data is a set of
30 images (512x512 pixels) from serial section transmission electron microscopy
of the Drosophila first instar larva ventral nerve cord (VNC). Each image comes
with a corresponding fully annotated ground truth segmentation map for cells
(white) and membranes (black). The test set is publicly available, but its seg-
mentation maps are kept secret. An evaluation can be obtained by sending the
predicted membrane probability map to the organizers. The evaluation is done
by thresholding the map at 10 different levels and computation of the “warping
error”, the “Rand error” and the “pixel error” [14].
The u-net (averaged over 7 rotated versions of the input data) achieves with-
out any further pre- or postprocessing a warping error of 0.0003529 (the new
best score, see Table 1) and a rand-error of 0.0382.
This is significantly better than the sliding-window convolutional network
result by Ciresan et al. [1], whose best submission had a warping error of 0.000420
and a rand error of 0.0504. In terms of rand error the only better performing
Table 1. Ranking on the EM segmentation challenge [14] (march 6th, 2015), sorted
by warping error.
Rank Group name Warping Error Rand Error Pixel Error

** human values ** 0.000005 0.0021 0.0010
1. u-net 0.000353 0.0382 0.0611
2. DIVE-SCI 0.000355 0.0305 0.0584
3. IDSIA [1] 0.000420 0.0504 0.0613
4. DIVE 0.000430 0.0545 0.0582
..
.
10. IDSIA-SCI 0.000653 0.0189 0.1027
7
a b c d
Fig. 4. Result on the ISBI cell tracking challenge. (a) part of an input image of the
“PhC-U373” data set. (b) Segmentation result (cyan mask) with manual ground truth
(yellow border) (c) input image of the “DIC-HeLa” data set. (d) Segmentation result
(random colored masks) with manual ground truth (yellow border).
Table 2. Segmentation results (IOU) on the ISBI cell tracking challenge 2015.
Name PhC-U373 DIC-HeLa

IMCB-SG (2014) 0.2669 0.2935
KTH-SE (2014) 0.7953 0.4607
HOUS-US (2014) 0.5323 -
second-best 2015 0.83 0.46
u-net (2015) 0.9203 0.7756
algorithms on this data set use highly data set specific post-processing methods1
applied to the probability map of Ciresan et al. [1].
We also applied the u-net to a cell segmentation task in light microscopic im-
ages. This segmenation task is part of the ISBI cell tracking challenge 2014 and
2015 [10,13]. The first data set “PhC-U373”2 contains Glioblastoma-astrocytoma
U373 cells on a polyacrylimide substrate recorded by phase contrast microscopy
(see Figure 4a,b and Supp. Material). It contains 35 partially annotated train-
ing images. Here we achieve an average IOU (“intersection over union”) of 92%,
which is significantly better than the second best algorithm with 83% (see Ta-
ble 2). The second data set “DIC-HeLa”3 are HeLa cells on a flat glass recorded
by differential interference contrast (DIC) microscopy (see Figure 3, Figure 4c,d
and Supp. Material). It contains 20 partially annotated training images. Here we
achieve an average IOU of 77.5% which is significantly better than the second
best algorithm with 46%.
5 Conclusion
The u-net architecture achieves very good performance on very different biomed-
ical segmentation applications. Thanks to data augmentation with elastic defor-
1
The authors of this algorithm have submitted 78 different solutions to achieve this
result.
2
Data set provided by Dr. Sanjay Kumar. Department of Bioengineering University
of California at Berkeley. Berkeley CA (USA)
3
Data set provided by Dr. Gert van Cappellen Erasmus Medical Center. Rotterdam.
The Netherlands
8
mations, it only needs very few annotated images and has a very reasonable
training time of only 10 hours on a NVidia Titan GPU (6 GB). We provide the
full Caffe[6]-based implementation and the trained networks4 . We are sure that
the u-net architecture can be applied easily to many more tasks.
Acknowlegements
This study was supported by the Excellence Initiative of the German Federal
and State governments (EXC 294) and by the BMBF (Fkz 0316185B).
References
1. Ciresan, D.C., Gambardella, L.M., Giusti, A., Schmidhuber, J.: Deep neural net-
works segment neuronal membranes in electron microscopy images. In: NIPS. pp.
2852–2860 (2012)
2. Dosovitskiy, A., Springenberg, J.T., Riedmiller, M., Brox, T.: Discriminative un-
supervised feature learning with convolutional neural networks. In: NIPS (2014)
3. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for ac-
curate object detection and semantic segmentation. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR) (2014)
4. Hariharan, B., Arbelez, P., Girshick, R., Malik, J.: Hypercolumns for object seg-
mentation and fine-grained localization (2014), arXiv:1411.5752 [cs.CV]
5. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-
level performance on imagenet classification (2015), arXiv:1502.01852 [cs.CV]
6. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadar-
rama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding
(2014), arXiv:1408.5093 [cs.CV]
7. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: NIPS. pp. 1106–1114 (2012)
8. LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.,
Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural
Computation 1(4), 541–551 (1989)
9. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic
segmentation (2014), arXiv:1411.4038 [cs.CV]
10. Maska, M., (...), de Solorzano, C.O.: A benchmark for comparison of cell tracking
algorithms. Bioinformatics 30, 1609–1617 (2014)
11. Seyedhosseini, M., Sajjadi, M., Tasdizen, T.: Image segmentation with cascaded
hierarchical models and logistic disjunctive normal networks. In: Computer Vision
(ICCV), 2013 IEEE International Conference on. pp. 2168–2175 (2013)
12. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition (2014), arXiv:1409.1556 [cs.CV]
13. WWW: Web page of the cell tracking challenge, http://www.codesolorzano.com/
celltrackingchallenge/Cell_Tracking_Challenge/Welcome.html
14. WWW: Web page of the em segmentation challenge, http://brainiac2.mit.edu/
isbi_challenge/
4
U-net implementation, trained networks and supplementary material available at
http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net
1
Faster R-CNN: Towards Real-Time Object

Detection with Region Proposal Networks
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun
Abstract—State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations.
Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region
proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image
convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional
network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to
arXiv:1506.01497v3 [cs.CV] 6 Jan 2016
generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN
into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with
“attention” mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model [3],
our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection
accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO
2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been
made publicly available.
Index Terms—Object Detection, Region Proposal, Convolutional Neural Network.
1 I NTRODUCTION One may note that fast region-based CNNs take

Recent advances in object detection are driven by advantage of GPUs, while the region proposal meth-
the success of region proposal methods (e.g., [4]) ods used in research are implemented on the CPU,
and region-based convolutional neural networks (R- making such runtime comparisons inequitable. An ob-
CNNs) [5]. Although region-based CNNs were com- vious way to accelerate proposal computation is to re-
putationally expensive as originally developed in [5], implement it for the GPU. This may be an effective en-
their cost has been drastically reduced thanks to shar- gineering solution, but re-implementation ignores the
ing convolutions across proposals [1], [2]. The latest down-stream detection network and therefore misses
incarnation, Fast R-CNN [2], achieves near real-time important opportunities for sharing computation.
rates using very deep networks [3], when ignoring the In this paper, we show that an algorithmic change—
time spent on region proposals. Now, proposals are the computing proposals with a deep convolutional neu-
test-time computational bottleneck in state-of-the-art ral network—leads to an elegant and effective solution
detection systems. where proposal computation is nearly cost-free given
the detection network’s computation. To this end, we
Region proposal methods typically rely on inex-
introduce novel Region Proposal Networks (RPNs) that
pensive features and economical inference schemes.
share convolutional layers with state-of-the-art object
Selective Search [4], one of the most popular meth-
detection networks [1], [2]. By sharing convolutions at
ods, greedily merges superpixels based on engineered
test-time, the marginal cost for computing proposals
low-level features. Yet when compared to efficient
is small (e.g., 10ms per image).
detection networks [2], Selective Search is an order of
Our observation is that the convolutional feature
magnitude slower, at 2 seconds per image in a CPU
maps used by region-based detectors, like Fast R-
implementation. EdgeBoxes [6] currently provides the
CNN, can also be used for generating region pro-
best tradeoff between proposal quality and speed,
posals. On top of these convolutional features, we
at 0.2 seconds per image. Nevertheless, the region
construct an RPN by adding a few additional con-
proposal step still consumes as much running time
volutional layers that simultaneously regress region
as the detection network.
bounds and objectness scores at each location on a
• S. Ren is with University of Science and Technology of China, Hefei,
regular grid. The RPN is thus a kind of fully convo-
China. This work was done when S. Ren was an intern at Microsoft lutional network (FCN) [7] and can be trained end-to-
Research. Email: sqren@mail.ustc.edu.cn end specifically for the task for generating detection
• K. He and J. Sun are with Visual Computing Group, Microsoft
Research. E-mail: {kahe,jiansun}@microsoft.com
proposals.
• R. Girshick is with Facebook AI Research. The majority of this work RPNs are designed to efficiently predict region pro-
was done when R. Girshick was with Microsoft Research. E-mail: posals with a wide range of scales and aspect ratios. In
rbg@fb.com
contrast to prevalent methods [8], [9], [1], [2] that use
2
multiple filter sizes

multiple references
feature map feature map feature map
multiple scaled images

image image image
(a) (b) (c)
Figure 1: Different schemes for addressing multiple scales and sizes. (a) Pyramids of images and feature maps
are built, and the classifier is run at all scales. (b) Pyramids of filters with multiple scales/sizes are run on
the feature map. (c) We use pyramids of reference boxes in the regression functions.
pyramids of images (Figure 1, a) or pyramids of filters mercial systems such as at Pinterests [17], with user
(Figure 1, b), we introduce novel “anchor” boxes engagement improvements reported.
that serve as references at multiple scales and aspect In ILSVRC and COCO 2015 competitions, Faster
ratios. Our scheme can be thought of as a pyramid R-CNN and RPN are the basis of several 1st-place
of regression references (Figure 1, c), which avoids entries [18] in the tracks of ImageNet detection, Ima-
enumerating images or filters of multiple scales or geNet localization, COCO detection, and COCO seg-
aspect ratios. This model performs well when trained mentation. RPNs completely learn to propose regions
and tested using single-scale images and thus benefits from data, and thus can easily benefit from deeper
running speed. and more expressive features (such as the 101-layer
To unify RPNs with Fast R-CNN [2] object detec- residual nets adopted in [18]). Faster R-CNN and RPN
tion networks, we propose a training scheme that are also used by several other leading entries in these
alternates between fine-tuning for the region proposal competitions2 . These results suggest that our method
task and then fine-tuning for object detection, while is not only a cost-efficient solution for practical usage,
keeping the proposals fixed. This scheme converges but also an effective way of improving object detec-
quickly and produces a unified network with convo- tion accuracy.
lutional features that are shared between both tasks.1
We comprehensively evaluate our method on the
PASCAL VOC detection benchmarks [11] where RPNs 2 R ELATED W ORK
with Fast R-CNNs produce detection accuracy bet- Object Proposals. There is a large literature on object
ter than the strong baseline of Selective Search with proposal methods. Comprehensive surveys and com-
Fast R-CNNs. Meanwhile, our method waives nearly parisons of object proposal methods can be found in
all computational burdens of Selective Search at [19], [20], [21]. Widely used object proposal methods
test-time—the effective running time for proposals include those based on grouping super-pixels (e.g.,
is just 10 milliseconds. Using the expensive very Selective Search [4], CPMC [22], MCG [23]) and those
deep models of [3], our detection method still has based on sliding windows (e.g., objectness in windows
a frame rate of 5fps (including all steps) on a GPU, [24], EdgeBoxes [6]). Object proposal methods were
and thus is a practical object detection system in adopted as external modules independent of the de-
terms of both speed and accuracy. We also report tectors (e.g., Selective Search [4] object detectors, R-
results on the MS COCO dataset [12] and investi- CNN [5], and Fast R-CNN [2]).
gate the improvements on PASCAL VOC using the
Deep Networks for Object Detection. The R-CNN
COCO data. Code has been made publicly available
method [5] trains CNNs end-to-end to classify the
at https://github.com/shaoqingren/faster_
proposal regions into object categories or background.
rcnn (in MATLAB) and https://github.com/
R-CNN mainly plays as a classifier, and it does not
rbgirshick/py-faster-rcnn (in Python).
predict object bounds (except for refining by bounding
A preliminary version of this manuscript was pub-
box regression). Its accuracy depends on the perfor-
lished previously [10]. Since then, the frameworks of
mance of the region proposal module (see compar-
RPN and Faster R-CNN have been adopted and gen-
isons in [20]). Several papers have proposed ways of
eralized to other methods, such as 3D object detection
using deep networks for predicting object bounding
[13], part-based detection [14], instance segmentation
boxes [25], [9], [26], [27]. In the OverFeat method [9],
[15], and image captioning [16]. Our fast and effective
a fully-connected layer is trained to predict the box
object detection system has also been built in com-
coordinates for the localization task that assumes a
1. Since the publication of the conference version of this paper
single object. The fully-connected layer is then turned
[10], we have also found that RPNs can be trained jointly with Fast
R-CNN networks leading to less training time. 2. http://image-net.org/challenges/LSVRC/2015/results
3
classifier
single, unified network for object detection (Figure 2).
Using the recently popular terminology of neural
RoI pooling networks with ‘attention’ [31] mechanisms, the RPN
module tells the Fast R-CNN module where to look.
In Section 3.1 we introduce the designs and properties
proposals
of the network for region proposal. In Section 3.2 we
develop algorithms for training both modules with
features shared.
Region Proposal Network 3.1 Region Proposal Networks
feature maps
A Region Proposal Network (RPN) takes an image
(of any size) as input and outputs a set of rectangular
object proposals, each with an objectness score.3 We
model this process with a fully convolutional network
[7], which we describe in this section. Because our ulti-
conv layers mate goal is to share computation with a Fast R-CNN
object detection network [2], we assume that both nets
share a common set of convolutional layers. In our ex-
image
periments, we investigate the Zeiler and Fergus model
Figure 2: Faster R-CNN is a single, unified network [32] (ZF), which has 5 shareable convolutional layers
for object detection. The RPN module serves as the and the Simonyan and Zisserman model [3] (VGG-16),
‘attention’ of this unified network. which has 13 shareable convolutional layers.
To generate region proposals, we slide a small
network over the convolutional feature map output
into a convolutional layer for detecting multiple class- by the last shared convolutional layer. This small
specific objects. The MultiBox methods [26], [27] gen- network takes as input an n × n spatial window of
erate region proposals from a network whose last the input convolutional feature map. Each sliding
fully-connected layer simultaneously predicts mul- window is mapped to a lower-dimensional feature
tiple class-agnostic boxes, generalizing the “single- (256-d for ZF and 512-d for VGG, with ReLU [33]
box” fashion of OverFeat. These class-agnostic boxes following). This feature is fed into two sibling fully-
are used as proposals for R-CNN [5]. The MultiBox connected layers—a box-regression layer (reg) and a
proposal network is applied on a single image crop or box-classification layer (cls). We use n = 3 in this
multiple large image crops (e.g., 224×224), in contrast paper, noting that the effective receptive field on the
to our fully convolutional scheme. MultiBox does not input image is large (171 and 228 pixels for ZF and
share features between the proposal and detection VGG, respectively). This mini-network is illustrated
networks. We discuss OverFeat and MultiBox in more at a single position in Figure 3 (left). Note that be-
depth later in context with our method. Concurrent cause the mini-network operates in a sliding-window
with our work, the DeepMask method [28] is devel- fashion, the fully-connected layers are shared across
oped for learning segmentation proposals. all spatial locations. This architecture is naturally im-
Shared computation of convolutions [9], [1], [29], plemented with an n × n convolutional layer followed
[7], [2] has been attracting increasing attention for ef- by two sibling 1 × 1 convolutional layers (for reg and
ficient, yet accurate, visual recognition. The OverFeat cls, respectively).
paper [9] computes convolutional features from an 3.1.1 Anchors
image pyramid for classification, localization, and de-
At each sliding-window location, we simultaneously
tection. Adaptively-sized pooling (SPP) [1] on shared
predict multiple region proposals, where the number
convolutional feature maps is developed for efficient
of maximum possible proposals for each location is
region-based object detection [1], [30] and semantic
denoted as k. So the reg layer has 4k outputs encoding
segmentation [29]. Fast R-CNN [2] enables end-to-end
the coordinates of k boxes, and the cls layer outputs
detector training on shared convolutional features and
2k scores that estimate probability of object or not
shows compelling accuracy and speed.
object for each proposal4 . The k proposals are param-
eterized relative to k reference boxes, which we call
3 FASTER R-CNN
3. “Region” is a generic term and in this paper we only consider
Our object detection system, called Faster R-CNN, is rectangular regions, as is common for many methods (e.g., [27], [4],
composed of two modules. The first module is a deep [6]). “Objectness” measures membership to a set of object classes
fully convolutional network that proposes regions, vs. background.
4. For simplicity we implement the cls layer as a two-class
and the second module is the Fast R-CNN detector [2] softmax layer. Alternatively, one may use logistic regression to
that uses the proposed regions. The entire system is a produce k scores.
4
person : 0.992
2k scores 4k coordinates k anchor boxes dog : 0.994
horse : 0.993
cls layer reg layer car : 1.000 cat : 0.982
dog : 0.997 person : 0.979
256-d
intermediate layer
bus : 0.996
boat : 0.970
person : 0.983
person : 0.736 person : 0.983
person : 0.925
person : 0.989
sliding window
conv feature map
Figure 3: Left: Region Proposal Network (RPN). Right: Example detections using RPN proposals on PASCAL
VOC 2007 test. Our method detects objects in a wide range of scales and aspect ratios.
anchors. An anchor is centered at the sliding window Multi-Scale Anchors as Regression References
in question, and is associated with a scale and aspect Our design of anchors presents a novel scheme
ratio (Figure 3, left). By default we use 3 scales and for addressing multiple scales (and aspect ratios). As
3 aspect ratios, yielding k = 9 anchors at each sliding shown in Figure 1, there have been two popular ways
position. For a convolutional feature map of a size for multi-scale predictions. The first way is based on
W × H (typically ∼2,400), there are W Hk anchors in image/feature pyramids, e.g., in DPM [8] and CNN-
total. based methods [9], [1], [2]. The images are resized at
multiple scales, and feature maps (HOG [8] or deep
Translation-Invariant Anchors
convolutional features [9], [1], [2]) are computed for
An important property of our approach is that it each scale (Figure 1(a)). This way is often useful but
is translation invariant, both in terms of the anchors is time-consuming. The second way is to use sliding
and the functions that compute proposals relative to windows of multiple scales (and/or aspect ratios) on
the anchors. If one translates an object in an image, the feature maps. For example, in DPM [8], models
the proposal should translate and the same function of different aspect ratios are trained separately using
should be able to predict the proposal in either lo- different filter sizes (such as 5×7 and 7×5). If this way
cation. This translation-invariant property is guaran- is used to address multiple scales, it can be thought
teed by our method5 . As a comparison, the MultiBox of as a “pyramid of filters” (Figure 1(b)). The second
method [27] uses k-means to generate 800 anchors, way is usually adopted jointly with the first way [8].
which are not translation invariant. So MultiBox does As a comparison, our anchor-based method is built
not guarantee that the same proposal is generated if on a pyramid of anchors, which is more cost-efficient.
an object is translated. Our method classifies and regresses bounding boxes
The translation-invariant property also reduces the with reference to anchor boxes of multiple scales and
model size. MultiBox has a (4 + 1) × 800-dimensional aspect ratios. It only relies on images and feature
fully-connected output layer, whereas our method has maps of a single scale, and uses filters (sliding win-
a (4 + 2) × 9-dimensional convolutional output layer dows on the feature map) of a single size. We show by
in the case of k = 9 anchors. As a result, our output experiments the effects of this scheme for addressing
layer has 2.8 × 104 parameters (512 × (4 + 2) × 9 multiple scales and sizes (Table 8).
for VGG-16), two orders of magnitude fewer than Because of this multi-scale design based on anchors,
MultiBox’s output layer that has 6.1 × 106 parameters we can simply use the convolutional features com-
(1536 × (4 + 1) × 800 for GoogleNet [34] in MultiBox puted on a single-scale image, as is also done by
[27]). If considering the feature projection layers, our the Fast R-CNN detector [2]. The design of multi-
proposal layers still have an order of magnitude fewer scale anchors is a key component for sharing features
parameters than MultiBox6 . We expect our method without extra cost for addressing scales.
to have less risk of overfitting on small datasets, like
PASCAL VOC. 3.1.2 Loss Function
For training RPNs, we assign a binary class label
5. As is the case of FCNs [7], our network is translation invariant (of being an object or not) to each anchor. We as-
up to the network’s total stride. sign a positive label to two kinds of anchors: (i) the
6. Considering the feature projection layers, our proposal layers’ anchor/anchors with the highest Intersection-over-
parameter count is 3 × 3 × 512 × 512 + 512 × 6 × 9 = 2.4 × 106 ;
MultiBox’s proposal layers’ parameter count is 7 × 7 × (64 + 96 + Union (IoU) overlap with a ground-truth box, or (ii) an
64 + 64) × 1536 + 1536 × 5 × 800 = 27 × 106 . anchor that has an IoU overlap higher than 0.7 with
5
any ground-truth box. Note that a single ground-truth be thought of as bounding-box regression from an
box may assign positive labels to multiple anchors. anchor box to a nearby ground-truth box.
Usually the second condition is sufficient to determine Nevertheless, our method achieves bounding-box
the positive samples; but we still adopt the first regression by a different manner from previous RoI-
condition for the reason that in some rare cases the based (Region of Interest) methods [1], [2]. In [1],
second condition may find no positive sample. We [2], bounding-box regression is performed on features
assign a negative label to a non-positive anchor if its pooled from arbitrarily sized RoIs, and the regression
IoU ratio is lower than 0.3 for all ground-truth boxes. weights are shared by all region sizes. In our formula-
Anchors that are neither positive nor negative do not tion, the features used for regression are of the same
contribute to the training objective. spatial size (3 × 3) on the feature maps. To account
With these definitions, we minimize an objective for varying sizes, a set of k bounding-box regressors
function following the multi-task loss in Fast R-CNN are learned. Each regressor is responsible for one scale
[2]. Our loss function for an image is defined as: and one aspect ratio, and the k regressors do not share
weights. As such, it is still possible to predict boxes of
1 X
L({pi }, {ti }) = Lcls (pi , p∗i ) various sizes even though the features are of a fixed
Ncls i size/scale, thanks to the design of anchors.
(1)
1 X ∗
+λ p Lreg (ti , t∗i ).
Nreg i i 3.1.3 Training RPNs
The RPN can be trained end-to-end by back-
Here, i is the index of an anchor in a mini-batch and propagation and stochastic gradient descent (SGD)
pi is the predicted probability of anchor i being an [35]. We follow the “image-centric” sampling strategy
object. The ground-truth label p∗i is 1 if the anchor from [2] to train this network. Each mini-batch arises
is positive, and is 0 if the anchor is negative. ti is a from a single image that contains many positive and
vector representing the 4 parameterized coordinates negative example anchors. It is possible to optimize
of the predicted bounding box, and t∗i is that of the for the loss functions of all anchors, but this will
ground-truth box associated with a positive anchor. bias towards negative samples as they are dominate.
The classification loss Lcls is log loss over two classes Instead, we randomly sample 256 anchors in an image
(object vs. not object). For the regression loss, we use to compute the loss function of a mini-batch, where
Lreg (ti , t∗i ) = R(ti − t∗i ) where R is the robust loss the sampled positive and negative anchors have a
function (smooth L1 ) defined in [2]. The term p∗i Lreg ratio of up to 1:1. If there are fewer than 128 positive
means the regression loss is activated only for positive samples in an image, we pad the mini-batch with
anchors (p∗i = 1) and is disabled otherwise (p∗i = 0). negative ones.
The outputs of the cls and reg layers consist of {pi } We randomly initialize all new layers by drawing
and {ti } respectively. weights from a zero-mean Gaussian distribution with
The two terms are normalized by Ncls and Nreg standard deviation 0.01. All other layers (i.e., the
and weighted by a balancing parameter λ. In our shared convolutional layers) are initialized by pre-
current implementation (as in the released code), the training a model for ImageNet classification [36], as
cls term in Eqn.(1) is normalized by the mini-batch is standard practice [5]. We tune all layers of the
size (i.e., Ncls = 256) and the reg term is normalized ZF net, and conv3 1 and up for the VGG net to
by the number of anchor locations (i.e., Nreg ∼ 2, 400). conserve memory [2]. We use a learning rate of 0.001
By default we set λ = 10, and thus both cls and for 60k mini-batches, and 0.0001 for the next 20k
reg terms are roughly equally weighted. We show mini-batches on the PASCAL VOC dataset. We use a
by experiments that the results are insensitive to the momentum of 0.9 and a weight decay of 0.0005 [37].
values of λ in a wide range (Table 9). We also note Our implementation uses Caffe [38].
that the normalization as above is not required and
could be simplified.
For bounding box regression, we adopt the param- 3.2 Sharing Features for RPN and Fast R-CNN
eterizations of the 4 coordinates following [5]: Thus far we have described how to train a network
for region proposal generation, without considering
tx = (x − xa )/wa , ty = (y − ya )/ha ,
the region-based object detection CNN that will utilize
tw = log(w/wa ), th = log(h/ha ), these proposals. For the detection network, we adopt
(2)
t∗x = (x∗ − xa )/wa , t∗y = (y ∗ − ya )/ha , Fast R-CNN [2]. Next we describe algorithms that
t∗w = log(w∗ /wa ), t∗h = log(h∗ /ha ), learn a unified network composed of RPN and Fast
R-CNN with shared convolutional layers (Figure 2).
where x, y, w, and h denote the box’s center coordi- Both RPN and Fast R-CNN, trained independently,
nates and its width and height. Variables x, xa , and will modify their convolutional layers in different
x∗ are for the predicted box, anchor box, and ground- ways. We therefore need to develop a technique that
truth box respectively (likewise for y, w, h). This can allows for sharing convolutional layers between the
6
Table 1: the learned average proposal size for each anchor using the ZF net (numbers for s = 600).
anchor 1282 , 2:1 1282 , 1:1 1282 , 1:2 2562 , 2:1 2562 , 1:1 2562 , 1:2 5122 , 2:1 5122 , 1:1 5122 , 1:2
proposal 188×111 113×114 70×92 416×229 261×284 174×332 768×437 499×501 355×715
two networks, rather than learning two separate net- fix the shared convolutional layers and only fine-tune
works. We discuss three ways for training networks the layers unique to RPN. Now the two networks
with features shared: share convolutional layers. Finally, keeping the shared
(i) Alternating training. In this solution, we first train convolutional layers fixed, we fine-tune the unique
RPN, and use the proposals to train Fast R-CNN. layers of Fast R-CNN. As such, both networks share
The network tuned by Fast R-CNN is then used to the same convolutional layers and form a unified
initialize RPN, and this process is iterated. This is the network. A similar alternating training can be run
solution that is used in all experiments in this paper. for more iterations, but we have observed negligible
(ii) Approximate joint training. In this solution, the improvements.
RPN and Fast R-CNN networks are merged into one
3.3 Implementation Details
network during training as in Figure 2. In each SGD
iteration, the forward pass generates region propos- We train and test both region proposal and object
als which are treated just like fixed, pre-computed detection networks on images of a single scale [1], [2].
proposals when training a Fast R-CNN detector. The We re-scale the images such that their shorter side
backward propagation takes place as usual, where for is s = 600 pixels [2]. Multi-scale feature extraction
the shared layers the backward propagated signals (using an image pyramid) may improve accuracy but
from both the RPN loss and the Fast R-CNN loss does not exhibit a good speed-accuracy trade-off [2].
are combined. This solution is easy to implement. But On the re-scaled images, the total stride for both ZF
this solution ignores the derivative w.r.t. the proposal and VGG nets on the last convolutional layer is 16
boxes’ coordinates that are also network responses, pixels, and thus is ∼10 pixels on a typical PASCAL
so is approximate. In our experiments, we have em- image before resizing (∼500×375). Even such a large
pirically found this solver produces close results, yet stride provides good results, though accuracy may be
reduces the training time by about 25-50% comparing further improved with a smaller stride.
with alternating training. This solver is included in For anchors, we use 3 scales with box areas of 1282 ,
our released Python code. 2562 , and 5122 pixels, and 3 aspect ratios of 1:1, 1:2,
and 2:1. These hyper-parameters are not carefully cho-
(iii) Non-approximate joint training. As discussed
sen for a particular dataset, and we provide ablation
above, the bounding boxes predicted by RPN are
experiments on their effects in the next section. As dis-
also functions of the input. The RoI pooling layer
cussed, our solution does not need an image pyramid
[2] in Fast R-CNN accepts the convolutional features
or filter pyramid to predict regions of multiple scales,
and also the predicted bounding boxes as input, so
saving considerable running time. Figure 3 (right)
a theoretically valid backpropagation solver should
shows the capability of our method for a wide range
also involve gradients w.r.t. the box coordinates. These
of scales and aspect ratios. Table 1 shows the learned
gradients are ignored in the above approximate joint
average proposal size for each anchor using the ZF
training. In a non-approximate joint training solution,
net. We note that our algorithm allows predictions
we need an RoI pooling layer that is differentiable
that are larger than the underlying receptive field.
w.r.t. the box coordinates. This is a nontrivial problem
Such predictions are not impossible—one may still
and a solution can be given by an “RoI warping” layer
roughly infer the extent of an object if only the middle
as developed in [15], which is beyond the scope of this
of the object is visible.
paper.
The anchor boxes that cross image boundaries need
4-Step Alternating Training. In this paper, we adopt to be handled with care. During training, we ignore
a pragmatic 4-step training algorithm to learn shared all cross-boundary anchors so they do not contribute
features via alternating optimization. In the first step, to the loss. For a typical 1000 × 600 image, there
we train the RPN as described in Section 3.1.3. This will be roughly 20000 (≈ 60 × 40 × 9) anchors in
network is initialized with an ImageNet-pre-trained total. With the cross-boundary anchors ignored, there
model and fine-tuned end-to-end for the region pro- are about 6000 anchors per image for training. If the
posal task. In the second step, we train a separate boundary-crossing outliers are not ignored in training,
detection network by Fast R-CNN using the proposals they introduce large, difficult to correct error terms in
generated by the step-1 RPN. This detection net- the objective, and training does not converge. During
work is also initialized by the ImageNet-pre-trained testing, however, we still apply the fully convolutional
model. At this point the two networks do not share RPN to the entire image. This may generate cross-
convolutional layers. In the third step, we use the boundary proposal boxes, which we clip to the image
detector network to initialize RPN training, but we boundary.
7
Table 2: Detection results on PASCAL VOC 2007 test set (trained on VOC 2007 trainval). The detectors are
Fast R-CNN with ZF, but using various proposal methods for training and testing.
train-time region proposals test-time region proposals
method # boxes method # proposals mAP (%)
SS 2000 SS 2000 58.7
EB 2000 EB 2000 58.6
RPN+ZF, shared 2000 RPN+ZF, shared 300 59.9
ablation experiments follow below
RPN+ZF, unshared 2000 RPN+ZF, unshared 300 58.7
SS 2000 RPN+ZF 100 55.1
SS 2000 RPN+ZF 300 56.8
SS 2000 RPN+ZF 1000 56.3
SS 2000 RPN+ZF (no NMS) 6000 55.2
SS 2000 RPN+ZF (no cls) 100 44.6
SS 2000 RPN+ZF (no cls) 300 51.4
SS 2000 RPN+ZF (no cls) 1000 55.8
SS 2000 RPN+ZF (no reg) 300 52.1
SS 2000 RPN+ZF (no reg) 1000 51.3
SS 2000 RPN+VGG 300 59.2
Some RPN proposals highly overlap with each IoU. SS has an mAP of 58.7% and EB has an mAP
other. To reduce redundancy, we adopt non-maximum of 58.6% under the Fast R-CNN framework. RPN
suppression (NMS) on the proposal regions based on with Fast R-CNN achieves competitive results, with
their cls scores. We fix the IoU threshold for NMS an mAP of 59.9% while using up to 300 proposals8 .
at 0.7, which leaves us about 2000 proposal regions Using RPN yields a much faster detection system than
per image. As we will show, NMS does not harm the using either SS or EB because of shared convolutional
ultimate detection accuracy, but substantially reduces computations; the fewer proposals also reduce the
the number of proposals. After NMS, we use the region-wise fully-connected layers’ cost (Table 5).
top-N ranked proposal regions for detection. In the Ablation Experiments on RPN. To investigate the be-
following, we train Fast R-CNN using 2000 RPN pro- havior of RPNs as a proposal method, we conducted
posals, but evaluate different numbers of proposals at several ablation studies. First, we show the effect of
test-time. sharing convolutional layers between the RPN and
Fast R-CNN detection network. To do this, we stop
4 E XPERIMENTS after the second step in the 4-step training process.
4.1 Experiments on PASCAL VOC Using separate networks reduces the result slightly to
We comprehensively evaluate our method on the 58.7% (RPN+ZF, unshared, Table 2). We observe that
PASCAL VOC 2007 detection benchmark [11]. This this is because in the third step when the detector-
dataset consists of about 5k trainval images and 5k tuned features are used to fine-tune the RPN, the
test images over 20 object categories. We also provide proposal quality is improved.
results on the PASCAL VOC 2012 benchmark for a Next, we disentangle the RPN’s influence on train-
few models. For the ImageNet pre-trained network, ing the Fast R-CNN detection network. For this pur-
we use the “fast” version of ZF net [32] that has pose, we train a Fast R-CNN model by using the
5 convolutional layers and 3 fully-connected layers, 2000 SS proposals and ZF net. We fix this detector
and the public VGG-16 model7 [3] that has 13 con- and evaluate the detection mAP by changing the
volutional layers and 3 fully-connected layers. We proposal regions used at test-time. In these ablation
primarily evaluate detection mean Average Precision experiments, the RPN does not share features with
(mAP), because this is the actual metric for object the detector.
detection (rather than focusing on object proposal Replacing SS with 300 RPN proposals at test-time
proxy metrics). leads to an mAP of 56.8%. The loss in mAP is because
Table 2 (top) shows Fast R-CNN results when of the inconsistency between the training/testing pro-
trained and tested using various region proposal posals. This result serves as the baseline for the fol-
methods. These results use the ZF net. For Selective lowing comparisons.
Search (SS) [4], we generate about 2000 proposals by Somewhat surprisingly, the RPN still leads to a
the “fast” mode. For EdgeBoxes (EB) [6], we generate competitive result (55.1%) when using the top-ranked
the proposals by the default EB setting tuned for 0.7 8. For RPN, the number of proposals (e.g., 300) is the maximum
number for an image. RPN may produce fewer proposals after
7. www.robots.ox.ac.uk/∼vgg/research/very deep/ NMS, and thus the average number of proposals is smaller.
8
Table 3: Detection results on PASCAL VOC 2007 test set. The detector is Fast R-CNN and VGG-16. Training
data: “07”: VOC 2007 trainval, “07+12”: union set of VOC 2007 trainval and VOC 2012 trainval. For RPN,
the train-time proposals for Fast R-CNN are 2000. † : this number was reported in [2]; using the repository
provided by this paper, this result is higher (68.1).
method # proposals data mAP (%)
SS 2000 07 66.9†
SS 2000 07+12 70.0
RPN+VGG, unshared 300 07 68.5
RPN+VGG, shared 300 07 69.9
RPN+VGG, shared 300 07+12 73.2
RPN+VGG, shared 300 COCO+07+12 78.8
Table 4: Detection results on PASCAL VOC 2012 test set. The detector is Fast R-CNN and VGG-16. Training
data: “07”: VOC 2007 trainval, “07++12”: union set of VOC 2007 trainval+test and VOC 2012 trainval. For
RPN, the train-time proposals for Fast R-CNN are 2000. † : http://host.robots.ox.ac.uk:8080/anonymous/HZJTQA.html. ‡ :
http://host.robots.ox.ac.uk:8080/anonymous/YNPLXB.html. § : http://host.robots.ox.ac.uk:8080/anonymous/XEDH10.html.
method # proposals data mAP (%)
SS 2000 12 65.7
SS 2000 07++12 68.4
RPN+VGG, shared† 300 12 67.0
RPN+VGG, shared‡ 300 07++12 70.4
RPN+VGG, shared§ 300 COCO+07++12 75.9
Table 5: Timing (ms) on a K40 GPU, except SS proposal is evaluated in a CPU. “Region-wise” includes NMS,
pooling, fully-connected, and softmax layers. See our released code for the profiling of running time.
model system conv proposal region-wise total rate
VGG SS + Fast R-CNN 146 1510 174 1830 0.5 fps
VGG RPN + Fast R-CNN 141 10 47 198 5 fps
ZF RPN + Fast R-CNN 31 3 25 59 17 fps
100 proposals at test-time, indicating that the top- (using RPN+ZF) to 59.2% (using RPN+VGG). This is a
ranked RPN proposals are accurate. On the other promising result, because it suggests that the proposal
extreme, using the top-ranked 6000 RPN proposals quality of RPN+VGG is better than that of RPN+ZF.
(without NMS) has a comparable mAP (55.2%), sug- Because proposals of RPN+ZF are competitive with
gesting NMS does not harm the detection mAP and SS (both are 58.7% when consistently used for training
may reduce false alarms. and testing), we may expect RPN+VGG to be better
Next, we separately investigate the roles of RPN’s than SS. The following experiments justify this hy-
cls and reg outputs by turning off either of them pothesis.
at test-time. When the cls layer is removed at test-
Performance of VGG-16. Table 3 shows the results
time (thus no NMS/ranking is used), we randomly
of VGG-16 for both proposal and detection. Using
sample N proposals from the unscored regions. The
RPN+VGG, the result is 68.5% for unshared features,
mAP is nearly unchanged with N = 1000 (55.8%), but
slightly higher than the SS baseline. As shown above,
degrades considerably to 44.6% when N = 100. This
this is because the proposals generated by RPN+VGG
shows that the cls scores account for the accuracy of
are more accurate than SS. Unlike SS that is pre-
the highest ranked proposals.
defined, the RPN is actively trained and benefits from
On the other hand, when the reg layer is removed better networks. For the feature-shared variant, the
at test-time (so the proposals become anchor boxes), result is 69.9%—better than the strong SS baseline, yet
the mAP drops to 52.1%. This suggests that the high- with nearly cost-free proposals. We further train the
quality proposals are mainly due to the regressed box RPN and detection network on the union set of PAS-
bounds. The anchor boxes, though having multiple CAL VOC 2007 trainval and 2012 trainval. The mAP
scales and aspect ratios, are not sufficient for accurate is 73.2%. Figure 5 shows some results on the PASCAL
detection. VOC 2007 test set. On the PASCAL VOC 2012 test set
We also evaluate the effects of more powerful net- (Table 4), our method has an mAP of 70.4% trained
works on the proposal quality of RPN alone. We use on the union set of VOC 2007 trainval+test and VOC
VGG-16 to train the RPN, and still use the above 2012 trainval. Table 6 and Table 7 show the detailed
detector of SS+ZF. The mAP improves from 56.8% numbers.
9
Table 6: Results on PASCAL VOC 2007 test set with Fast R-CNN detectors and VGG-16. For RPN, the train-time
proposals for Fast R-CNN are 2000. RPN∗ denotes the unsharing feature version.
method # box data mAP areo bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
SS 2000 07 66.9 74.5 78.3 69.2 53.2 36.6 77.3 78.2 82.0 40.7 72.7 67.9 79.6 79.2 73.0 69.0 30.1 65.4 70.2 75.8 65.8
SS 2000 07+12 70.0 77.0 78.1 69.3 59.4 38.3 81.6 78.6 86.7 42.8 78.8 68.9 84.7 82.0 76.6 69.9 31.8 70.1 74.8 80.4 70.4
RPN∗ 300 07 68.5 74.1 77.2 67.7 53.9 51.0 75.1 79.2 78.9 50.7 78.0 61.1 79.1 81.9 72.2 75.9 37.2 71.4 62.5 77.4 66.4
RPN 300 07 69.9 70.0 80.6 70.1 57.3 49.9 78.2 80.4 82.0 52.2 75.3 67.2 80.3 79.8 75.0 76.3 39.1 68.3 67.3 81.1 67.6
RPN 300 07+12 73.2 76.5 79.0 70.9 65.5 52.1 83.1 84.7 86.4 52.0 81.9 65.7 84.8 84.6 77.5 76.7 38.8 73.6 73.9 83.0 72.6
RPN 300 COCO+07+12 78.8 84.3 82.0 77.7 68.9 65.7 88.1 88.4 88.9 63.6 86.3 70.8 85.9 87.6 80.1 82.3 53.6 80.4 75.8 86.6 78.9
Table 7: Results on PASCAL VOC 2012 test set with Fast R-CNN detectors and VGG-16. For RPN, the train-time
proposals for Fast R-CNN are 2000.
method # box data mAP areo bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
SS 2000 12 65.7 80.3 74.7 66.9 46.9 37.7 73.9 68.6 87.7 41.7 71.1 51.1 86.0 77.8 79.8 69.8 32.1 65.5 63.8 76.4 61.7
SS 2000 07++12 68.4 82.3 78.4 70.8 52.3 38.7 77.8 71.6 89.3 44.2 73.0 55.0 87.5 80.5 80.8 72.0 35.1 68.3 65.7 80.4 64.2
RPN 300 12 67.0 82.3 76.4 71.0 48.4 45.2 72.1 72.3 87.3 42.2 73.7 50.0 86.8 78.7 78.4 77.4 34.5 70.1 57.1 77.1 58.9
RPN 300 07++12 70.4 84.9 79.8 74.3 53.9 49.8 77.5 75.9 88.5 45.6 77.1 55.3 86.9 81.7 80.9 79.6 40.1 72.6 60.9 81.2 61.5
RPN 300 COCO+07++12 75.9 87.4 83.6 76.8 62.9 59.6 81.9 82.0 91.3 54.9 82.6 59.0 89.0 85.5 84.7 84.1 52.2 78.9 65.5 85.4 70.2
Table 8: Detection results of Faster R-CNN on PAS- 3 scales and 3 aspect ratios (69.9% mAP in Table 8).
CAL VOC 2007 test set using different settings of If using just one anchor at each position, the mAP
anchors. The network is VGG-16. The training data drops by a considerable margin of 3-4%. The mAP
is VOC 2007 trainval. The default setting of using 3 is higher if using 3 scales (with 1 aspect ratio) or 3
scales and 3 aspect ratios (69.9%) is the same as that aspect ratios (with 1 scale), demonstrating that using
in Table 3. anchors of multiple sizes as the regression references
settings anchor scales aspect ratios mAP (%) is an effective solution. Using just 3 scales with 1
1282 1:1 65.8 aspect ratio (69.8%) is as good as using 3 scales with
1 scale, 1 ratio
2562 1:1 66.7 3 aspect ratios on this dataset, suggesting that scales
1282 {2:1, 1:1, 1:2} 68.8
1 scale, 3 ratios and aspect ratios are not disentangled dimensions for
2562 {2:1, 1:1, 1:2} 67.9
the detection accuracy. But we still adopt these two
3 scales, 1 ratio {128 , 2562 , 5122 }
2 1:1 69.8
dimensions in our designs to keep our system flexible.
3 scales, 3 ratios {1282 , 2562 , 5122 } {2:1, 1:1, 1:2} 69.9
In Table 9 we compare different values of λ in Equa-
tion (1). By default we use λ = 10 which makes the
Table 9: Detection results of Faster R-CNN on PAS- two terms in Equation (1) roughly equally weighted
CAL VOC 2007 test set using different values of λ after normalization. Table 9 shows that our result is
in Equation (1). The network is VGG-16. The training impacted just marginally (by ∼ 1%) when λ is within
data is VOC 2007 trainval. The default setting of using a scale of about two orders of magnitude (1 to 100).
λ = 10 (69.9%) is the same as that in Table 3. This demonstrates that the result is insensitive to λ in
λ 0.1 1 10 100 a wide range.
mAP (%) 67.2 68.9 69.9 69.1 Analysis of Recall-to-IoU. Next we compute the
recall of proposals at different IoU ratios with ground-
truth boxes. It is noteworthy that the Recall-to-IoU
metric is just loosely [19], [20], [21] related to the
In Table 5 we summarize the running time of the
ultimate detection accuracy. It is more appropriate to
entire object detection system. SS takes 1-2 seconds
use this metric to diagnose the proposal method than
depending on content (on average about 1.5s), and
to evaluate it.
Fast R-CNN with VGG-16 takes 320ms on 2000 SS
In Figure 4, we show the results of using 300, 1000,
proposals (or 223ms if using SVD on fully-connected
and 2000 proposals. We compare with SS and EB, and
layers [2]). Our system with VGG-16 takes in total
the N proposals are the top-N ranked ones based on
198ms for both proposal and detection. With the con-
the confidence generated by these methods. The plots
volutional features shared, the RPN alone only takes
show that the RPN method behaves gracefully when
10ms computing the additional layers. Our region-
the number of proposals drops from 2000 to 300. This
wise computation is also lower, thanks to fewer pro-
explains why the RPN has a good ultimate detection
posals (300 per image). Our system has a frame-rate
mAP when using as few as 300 proposals. As we
of 17 fps with the ZF net.
analyzed before, this property is mainly attributed to
Sensitivities to Hyper-parameters. In Table 8 we the cls term of the RPN. The recall of SS and EB drops
investigate the settings of anchors. By default we use more quickly than RPN when the proposals are fewer.
10
ϯϬϬƉƌŽƉŽƐĂůƐ ϭϬϬϬƉƌŽƉŽƐĂůƐ ϮϬϬϬƉƌŽƉŽƐĂůƐ

ϭ ϭ ϭ
Ϭ͘ϴ Ϭ͘ϴ Ϭ͘ϴ
Ϭ͘ϲ Ϭ͘ϲ Ϭ͘ϲ
ZĞĐĂůů
^^ ^^ ^^
Ϭ͘ϰ Ϭ͘ϰ Ϭ͘ϰ

Ϭ͘Ϯ ZWE& Ϭ͘Ϯ ZWE& Ϭ͘Ϯ ZWE&
ZWEs'' ZWEs'' ZWEs''
Ϭ Ϭ Ϭ
Ϭ͘ϱ Ϭ͘ϲ Ϭ͘ϳ Ϭ͘ϴ Ϭ͘ϵ ϭ Ϭ͘ϱ Ϭ͘ϲ Ϭ͘ϳ Ϭ͘ϴ Ϭ͘ϵ ϭ Ϭ͘ϱ Ϭ͘ϲ Ϭ͘ϳ Ϭ͘ϴ Ϭ͘ϵ ϭ
/Žh /Žh /Žh
Figure 4: Recall vs. IoU overlap ratio on the PASCAL VOC 2007 test set.
Table 10: One-Stage Detection vs. Two-Stage Proposal + Detection. Detection results are on the PASCAL
VOC 2007 test set using the ZF model and Fast R-CNN. RPN uses unshared features.
proposals detector mAP (%)
Two-Stage RPN + ZF, unshared 300 Fast R-CNN + ZF, 1 scale 58.7
One-Stage dense, 3 scales, 3 aspect ratios 20000 Fast R-CNN + ZF, 1 scale 53.8
One-Stage dense, 3 scales, 3 aspect ratios 20000 Fast R-CNN + ZF, 5 scales 53.9
One-Stage Detection vs. Two-Stage Proposal + De- region proposals with sliding windows leads to ∼6%
tection. The OverFeat paper [9] proposes a detection degradation in both papers. We also note that the one-
method that uses regressors and classifiers on sliding stage system is slower as it has considerably more
windows over convolutional feature maps. OverFeat proposals to process.
is a one-stage, class-specific detection pipeline, and ours
is a two-stage cascade consisting of class-agnostic pro-
posals and class-specific detections. In OverFeat, the 4.2 Experiments on MS COCO
region-wise features come from a sliding window of We present more results on the Microsoft COCO
one aspect ratio over a scale pyramid. These features object detection dataset [12]. This dataset involves 80
are used to simultaneously determine the location and object categories. We experiment with the 80k images
category of objects. In RPN, the features are from on the training set, 40k images on the validation set,
square (3×3) sliding windows and predict proposals and 20k images on the test-dev set. We evaluate the
relative to anchors with different scales and aspect mAP averaged for IoU ∈ [0.5 : 0.05 : 0.95] (COCO’s
ratios. Though both methods use sliding windows, the standard metric, simply denoted as mAP@[.5, .95])
region proposal task is only the first stage of Faster R- and mAP@0.5 (PASCAL VOC’s metric).
CNN—the downstream Fast R-CNN detector attends There are a few minor changes of our system made
to the proposals to refine them. In the second stage of for this dataset. We train our models on an 8-GPU
our cascade, the region-wise features are adaptively implementation, and the effective mini-batch size be-
pooled [1], [2] from proposal boxes that more faith- comes 8 for RPN (1 per GPU) and 16 for Fast R-CNN
fully cover the features of the regions. We believe (2 per GPU). The RPN step and Fast R-CNN step are
these features lead to more accurate detections. both trained for 240k iterations with a learning rate
To compare the one-stage and two-stage systems, of 0.003 and then for 80k iterations with 0.0003. We
we emulate the OverFeat system (and thus also circum- modify the learning rates (starting with 0.003 instead
vent other differences of implementation details) by of 0.001) because the mini-batch size is changed. For
one-stage Fast R-CNN. In this system, the “proposals” the anchors, we use 3 aspect ratios and 4 scales
are dense sliding windows of 3 scales (128, 256, 512) (adding 642 ), mainly motivated by handling small
and 3 aspect ratios (1:1, 1:2, 2:1). Fast R-CNN is objects on this dataset. In addition, in our Fast R-CNN
trained to predict class-specific scores and regress box step, the negative samples are defined as those with
locations from these sliding windows. Because the a maximum IoU with ground truth in the interval of
OverFeat system adopts an image pyramid, we also [0, 0.5), instead of [0.1, 0.5) used in [1], [2]. We note
evaluate using convolutional features extracted from that in the SPPnet system [1], the negative samples
5 scales. We use those 5 scales as in [1], [2]. in [0.1, 0.5) are used for network fine-tuning, but the
Table 10 compares the two-stage system and two negative samples in [0, 0.5) are still visited in the SVM
variants of the one-stage system. Using the ZF model, step with hard-negative mining. But the Fast R-CNN
the one-stage system has an mAP of 53.9%. This is system [2] abandons the SVM step, so the negative
lower than the two-stage system (58.7%) by 4.8%. samples in [0, 0.1) are never visited. Including these
This experiment justifies the effectiveness of cascaded [0, 0.1) samples improves mAP@0.5 on the COCO
region proposals and object detection. Similar obser- dataset for both Fast R-CNN and Faster R-CNN sys-
vations are reported in [2], [39], where replacing SS tems (but the impact is negligible on PASCAL VOC).
11
Table 11: Object detection results (%) on the MS COCO dataset. The model is VGG-16.
COCO val COCO test-dev
method proposals training data mAP@.5 mAP@[.5, .95] mAP@.5 mAP@[.5, .95]
Fast R-CNN [2] SS, 2000 COCO train - - 35.9 19.7
Fast R-CNN [impl. in this paper] SS, 2000 COCO train 38.6 18.9 39.3 19.3
Faster R-CNN RPN, 300 COCO train 41.5 21.2 42.1 21.5
Faster R-CNN RPN, 300 COCO trainval - - 42.7 21.9
The rest of the implementation details are the same Table 12: Detection mAP (%) of Faster R-CNN on
as on PASCAL VOC. In particular, we keep using PASCAL VOC 2007 test set and 2012 test set us-
300 proposals and single-scale (s = 600) testing. The ing different training data. The model is VGG-16.
testing time is still about 200ms per image on the “COCO” denotes that the COCO trainval set is used
COCO dataset. for training. See also Table 6 and Table 7.
training data 2007 test 2012 test
In Table 11 we first report the results of the Fast VOC07 69.9 67.0
R-CNN system [2] using the implementation in this VOC07+12 73.2 -
paper. Our Fast R-CNN baseline has 39.3% mAP@0.5 VOC07++12 - 70.4
on the test-dev set, higher than that reported in [2]. COCO (no VOC) 76.1 73.0
We conjecture that the reason for this gap is mainly COCO+VOC07+12 78.8 -
due to the definition of the negative samples and also COCO+VOC07++12 - 75.9
the changes of the mini-batch sizes. We also note that
the mAP@[.5, .95] is just comparable.
4.3 From MS COCO to PASCAL VOC
Next we evaluate our Faster R-CNN system. Using
the COCO training set to train, Faster R-CNN has Large-scale data is of crucial importance for improv-
42.1% mAP@0.5 and 21.5% mAP@[.5, .95] on the ing deep neural networks. Next, we investigate how
COCO test-dev set. This is 2.8% higher for mAP@0.5 the MS COCO dataset can help with the detection
and 2.2% higher for mAP@[.5, .95] than the Fast R- performance on PASCAL VOC.
CNN counterpart under the same protocol (Table 11). As a simple baseline, we directly evaluate the
This indicates that RPN performs excellent for im- COCO detection model on the PASCAL VOC dataset,
proving the localization accuracy at higher IoU thresh- without fine-tuning on any PASCAL VOC data. This
olds. Using the COCO trainval set to train, Faster R- evaluation is possible because the categories on
CNN has 42.7% mAP@0.5 and 21.9% mAP@[.5, .95] on COCO are a superset of those on PASCAL VOC. The
the COCO test-dev set. Figure 6 shows some results categories that are exclusive on COCO are ignored in
on the MS COCO test-dev set. this experiment, and the softmax layer is performed
only on the 20 categories plus background. The mAP
Faster R-CNN in ILSVRC & COCO 2015 compe- under this setting is 76.1% on the PASCAL VOC 2007
titions We have demonstrated that Faster R-CNN test set (Table 12). This result is better than that trained
benefits more from better features, thanks to the fact on VOC07+12 (73.2%) by a good margin, even though
that the RPN completely learns to propose regions by the PASCAL VOC data are not exploited.
neural networks. This observation is still valid even Then we fine-tune the COCO detection model on
when one increases the depth substantially to over the VOC dataset. In this experiment, the COCO model
100 layers [18]. Only by replacing VGG-16 with a 101- is in place of the ImageNet-pre-trained model (that
layer residual net (ResNet-101) [18], the Faster R-CNN is used to initialize the network weights), and the
system increases the mAP from 41.5%/21.2% (VGG- Faster R-CNN system is fine-tuned as described in
16) to 48.4%/27.2% (ResNet-101) on the COCO val Section 3.2. Doing so leads to 78.8% mAP on the
set. With other improvements orthogonal to Faster R- PASCAL VOC 2007 test set. The extra data from
CNN, He et al. [18] obtained a single-model result of the COCO set increases the mAP by 5.6%. Table 6
55.7%/34.9% and an ensemble result of 59.0%/37.4% shows that the model trained on COCO+VOC has
on the COCO test-dev set, which won the 1st place the best AP for every individual category on PASCAL
in the COCO 2015 object detection competition. The VOC 2007. Similar improvements are observed on the
same system [18] also won the 1st place in the ILSVRC PASCAL VOC 2012 test set (Table 12 and Table 7). We
2015 object detection competition, surpassing the sec- note that the test-time speed of obtaining these strong
ond place by absolute 8.5%. RPN is also a building results is still about 200ms per image.
block of the 1st-place winning entries in ILSVRC 2015
localization and COCO 2015 segmentation competi- 5 C ONCLUSION
tions, for which the details are available in [18] and We have presented RPNs for efficient and accurate
[15] respectively. region proposal generation. By sharing convolutional
12
person : 0.918 cow : 0.995

bird : 0.902
person : 0.988
person : 0.992
car : 0.745
.745 person : 0.797 bird : 0.978
car : 0.955
55 horse : 0.991
bird : 0.972 cow : 0.998
bird : 0.941
bottle : 0.726

p
pers person : 0.986
86
car : 0.999 personn person
: 0.993
0 993 : 0.959
person : 0.976
person : 0.991 car : 0.997 car : 0.980
dog : 0.981
cow : 0.979 person : 0.998

person : 0.961 cow : 0.974
person : 0.958
cow : 0.979
bus : 0.999
person : 0.960 cow : 0.892
cow : 0.985
person : 0.996
per
dog : 0.697
cat : 0.998
person : 0.917
boat : 0.671
car : 1.000 boat : 0.895 boat : 0.749
boat : 0.877
person : 0.988
person : 0.995
person : 0.994 bicycle

b
bicyc
4person e :: 0.981
0.987
0 987
person
940 : 0.893
bicycle : 0.972
bicycle : 0.977
77
boat : 0.992
person : 0.962
dog : 0.987
pottedplant : 0.951
bottle : 0.851
bottle : 0
0.962
962
boat : 0.693 diningtable : 0.791

boat : 0.846
person : 0.948
pottedplant : 0.728
car : 1.000 car : 0.880
car : 0.981
car : 0.982 chair : 0.630
boat : 0.995
boat : 0.948
diningtable : 0.862
bottle : 0.826
boat : 0.692
boat : 0
0.808
808
person : 0.975
aeroplane : 0.992 bird : 0.998

aeroplane : 0.986
sheep : 0.970
bird : 0.980
bird : 0.806 person : 0.670
horse : 0.984
aeroplane : 0.998
pottedplant : 0.820
chair : 0.984
984
diningtable : 0.997
pottedplant : 0.993 chair : 0.978
chair : 0.962
chair : 0.976
pottedplant : 0.715 car : 0.907
907
pottedplant : 0.940
pottedplant : 0.869
tvmonitor : 0.945
person : 0.983
aeroplane : 0.978 bird : 0.997

tvmonitor : 0.993
chair : 0.723
person : 0.968 chair : 0.982 tvmonitor : 0.993 person : 0.959
bottle
e : 0.789
person : 0.988
diningtable : 0.903 bottle : 0
bot 0.858
chair : 0.852 bottle : 0.616 b
bottle :person
0
0.903
903 : 0.897
person : 0.870
bottle : 0.884
bird : 0.727
Figure 5: Selected examples of object detection results on the PASCAL VOC 2007 test set using the Faster
R-CNN system. The model is VGG-16 and the training data is 07+12 trainval (73.2% mAP on the 2007 test
set). Our method detects objects of a wide range of scales and aspect ratios. Each output box is associated
with a category label and a softmax score in [0, 1]. A score threshold of 0.6 is used to display these images.
The running time for obtaining these results is 198ms per image, including all steps.
features with the down-stream detection network, the R EFERENCES

region proposal step is nearly cost-free. Our method
enables a unified, deep-learning-based object detec- [1] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling
tion system to run at near real-time frame rates. The in deep convolutional networks for visual recognition,” in
learned RPN also improves region proposal quality European Conference on Computer Vision (ECCV), 2014.
[2] R. Girshick, “Fast R-CNN,” in IEEE International Conference on
and thus the overall object detection accuracy. Computer Vision (ICCV), 2015.
[3] K. Simonyan and A. Zisserman, “Very deep convolutional
13
person
person
son
on : 0
0.975
975 traffic light : 0.802
person : 0.941
.941
4
person : 0.673 person : 0.928 person nperson
: 0.
0.958
958: 0.823
airplane : 0.997
person : 0.759
p
person : 0.766 backpack : 0.756 person : 0.772
0person
976 : 0.939
person : 0.976 0 939 person : 0.842
0.84 person : 0.841
person : 0.867 umbrella : 0.824 person : 0.897 car : 0.957
person
on : 0
0.950
50
handbag : 0.848 person : 0.805 clock : 0.986
person : 0.950
p person : 0.931
person : 0.970 clock : 0.981
person : 0.916
motorcycle : 0.713
dog : 0.996
bicycle : 0.891
dog : 0.691 bicycle : 0.639
person : 0.996
person : 0.800
motorcycle : 0.827
person : 0.808
pizza : 0.985
person : 0.998
dining table : 0.956
pizza : 0.938
bed : 0.999
pizza : 0.995
pizza : 0.982
clock : 0.982
skis : 0.919 bottle : 0.627
bowl : 0.759
giraffe : 0.989 giraffe : 0.993

giraffe : 0.988 person : 0.999
broccoli : 0.953
boat : 0.992
person : 0.934
surfboard : 0.979 umbrella : 0.885

person : 0.691 p
person : 0.716
person : 0.940
927
person : 0.665
person : 0.825
5person : 0.813 person : 0.864
teddy bear : 0.999
bus : 0.999
teddy bear : 0.738 teddy bear : 0.802

potted plant : 0.769
teddy bear : 0.890
person
person
erson : 0.869
: 0.970
bowl : 0.602 6 sink : 0.938

sink : 0.976
sink : 0.994
toilet : 0.921 sink : 0.992
sink : 0.969
book : 0.611
tv : 0.964
bottle : 0.768 traffic light : 0.713

laptop : 0.986 traffic light : 0.869
couch : 0.627
train : 0.965
couch : 0.991 mouse : 0.871
m boat : 0.613 boat : 0.746
couch : 0.719 tv : 0.959 boat : 0.758
keyboard : 0.956
mouse : 0.677
chair : 0.631 bench : 0.971
chair : 0.644 person : 0.986
cup : 0.720
frisbee : 0.998
person : 0.723
cup : 0.931
dining table : 0.941 cup : 0.986
bird : 0.968
dog : 0.966
bowl : 0.958
zebra : 0.996
zebra : 0.970
970
zebra : 0.848
zebra : 0.993 sandwich : 0.629
bird : 0.987
bird : 0.894
person :tv
0 : 0.711
0.792
792 person : 0.917
refrigerator : 0.699
person : 0.993
bottle : 0.982
laptop : 0.973
tennis :racket
person
perso 0.999 : 0.960 horse : 0.990
bird : 0.746
oven : 0.655 bird : 0.956
keyboard : 0.638
bird : 0.906
keyboard : 0.615
mouse : 0.981
dining table : 0.888 cup : 0.990 car : 0.816 toothbrush : 0.668

person : 0.984
refrigerator : 0.631 pizza : 0.919

kite : 0.934
clock : 0.988
bowl : 0.744
bowl : 0.816
bowl : 0.710 person : 0.998
bowl : 0.847
cup : 0.807
pizza : 0.965
chair : 0.772
oven : 0.969
Figure 6: Selected examples of object detection results on the MS COCO test-dev set using the Faster R-CNN
system. The model is VGG-16 and the training data is COCO trainval (42.7% mAP@0.5 on the test-dev set).
Each output box is associated with a category label and a softmax score in [0, 1]. A score threshold of 0.6 is
used to display these images. For each image, one color represents one object category in that image.
networks for large-scale image recognition,” in International [7] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional
Conference on Learning Representations (ICLR), 2015. networks for semantic segmentation,” in IEEE Conference on
[4] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeul- Computer Vision and Pattern Recognition (CVPR), 2015.
ders, “Selective search for object recognition,” International [8] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra-
Journal of Computer Vision (IJCV), 2013. manan, “Object detection with discriminatively trained part-
[5] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature based models,” IEEE Transactions on Pattern Analysis and Ma-
hierarchies for accurate object detection and semantic seg- chine Intelligence (TPAMI), 2010.
mentation,” in IEEE Conference on Computer Vision and Pattern [9] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,
Recognition (CVPR), 2014. and Y. LeCun, “Overfeat: Integrated recognition, localization
[6] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object and detection using convolutional networks,” in International
proposals from edges,” in European Conference on Computer Conference on Learning Representations (ICLR), 2014.
Vision (ECCV), 2014. [10] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
14
real-time object detection with region proposal networks,” in [36] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Neural Information Processing Systems (NIPS), 2015. Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg,
[11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and and L. Fei-Fei, “ImageNet Large Scale Visual Recognition
A. Zisserman, “The PASCAL Visual Object Classes Challenge Challenge,” in International Journal of Computer Vision (IJCV),
2007 (VOC2007) Results,” 2007. 2015.
[12] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- [37] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classi-
manan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Com- fication with deep convolutional neural networks,” in Neural
mon Objects in Context,” in European Conference on Computer Information Processing Systems (NIPS), 2012.
Vision (ECCV), 2014. [38] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-
[13] S. Song and J. Xiao, “Deep sliding shapes for amodal 3d object shick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional
detection in rgb-d images,” arXiv:1511.02300, 2015. architecture for fast feature embedding,” arXiv:1408.5093, 2014.
[14] J. Zhu, X. Chen, and A. L. Yuille, “DeePM: A deep part-based [39] K. Lenc and A. Vedaldi, “R-CNN minus R,” in British Machine
model for object detection and semantic part localization,” Vision Conference (BMVC), 2015.
arXiv:1511.07131, 2015.
[15] J. Dai, K. He, and J. Sun, “Instance-aware semantic segmenta-
tion via multi-task network cascades,” arXiv:1512.04412, 2015.
[16] J. Johnson, A. Karpathy, and L. Fei-Fei, “Densecap: Fully
convolutional localization networks for dense captioning,”
arXiv:1511.07571, 2015.
[17] D. Kislyuk, Y. Liu, D. Liu, E. Tzeng, and Y. Jing, “Human cu-
ration and convnets: Powering item-to-item recommendations
on pinterest,” arXiv:1511.04003, 2015.
[18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
for image recognition,” arXiv:1512.03385, 2015.
[19] J. Hosang, R. Benenson, and B. Schiele, “How good are de-
tection proposals, really?” in British Machine Vision Conference
(BMVC), 2014.
[20] J. Hosang, R. Benenson, P. Dollár, and B. Schiele, “What makes
for effective detection proposals?” IEEE Transactions on Pattern
Analysis and Machine Intelligence (TPAMI), 2015.
[21] N. Chavali, H. Agrawal, A. Mahendru, and D. Batra,
“Object-Proposal Evaluation Protocol is ’Gameable’,” arXiv:
1505.05836, 2015.
[22] J. Carreira and C. Sminchisescu, “CPMC: Automatic ob-
ject segmentation using constrained parametric min-cuts,”
IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI), 2012.
[23] P. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik,
“Multiscale combinatorial grouping,” in IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2014.
[24] B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the object-
ness of image windows,” IEEE Transactions on Pattern Analysis
and Machine Intelligence (TPAMI), 2012.
[25] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks
for object detection,” in Neural Information Processing Systems
(NIPS), 2013.
[26] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable
object detection using deep neural networks,” in IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR), 2014.
[27] C. Szegedy, S. Reed, D. Erhan, and D. Anguelov, “Scalable,
high-quality object detection,” arXiv:1412.1441 (v1), 2015.
[28] P. O. Pinheiro, R. Collobert, and P. Dollar, “Learning to
segment object candidates,” in Neural Information Processing
Systems (NIPS), 2015.
[29] J. Dai, K. He, and J. Sun, “Convolutional feature masking
for joint object and stuff segmentation,” in IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2015.
[30] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun, “Ob-
ject detection networks on convolutional feature maps,”
arXiv:1504.06066, 2015.
[31] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and
Y. Bengio, “Attention-based models for speech recognition,”
in Neural Information Processing Systems (NIPS), 2015.
[32] M. D. Zeiler and R. Fergus, “Visualizing and understanding
convolutional neural networks,” in European Conference on
Computer Vision (ECCV), 2014.
[33] V. Nair and G. E. Hinton, “Rectified linear units improve
restricted boltzmann machines,” in International Conference on
Machine Learning (ICML), 2010.
[34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
D. Erhan, and A. Rabinovich, “Going deeper with convo-
lutions,” in IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2015.
[35] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
W. Hubbard, and L. D. Jackel, “Backpropagation applied to
handwritten zip code recognition,” Neural computation, 1989.
Under review as a conference paper at ICLR 2016
U NSUPERVISED R EPRESENTATION L EARNING

WITH D EEP C ONVOLUTIONAL
G ENERATIVE A DVERSARIAL N ETWORKS
Alec Radford & Luke Metz
indico Research
Boston, MA
{alec,luke}@indico.io
Soumith Chintala
arXiv:1511.06434v2 [cs.LG] 7 Jan 2016
Facebook AI Research
New York, NY
soumith@fb.com
A BSTRACT
In recent years, supervised learning with convolutional networks (CNNs) has
seen huge adoption in computer vision applications. Comparatively, unsupervised
learning with CNNs has received less attention. In this work we hope to help
bridge the gap between the success of CNNs for supervised learning and unsuper-
vised learning. We introduce a class of CNNs called deep convolutional generative
adversarial networks (DCGANs), that have certain architectural constraints, and
demonstrate that they are a strong candidate for unsupervised learning. Training
on various image datasets, we show convincing evidence that our deep convolu-
tional adversarial pair learns a hierarchy of representations from object parts to
scenes in both the generator and discriminator. Additionally, we use the learned
features for novel tasks - demonstrating their applicability as general image repre-
sentations.
1 I NTRODUCTION
Learning reusable feature representations from large unlabeled datasets has been an area of active
research. In the context of computer vision, one can leverage the practically unlimited amount of
unlabeled images and videos to learn good intermediate representations, which can then be used on
a variety of supervised learning tasks such as image classification. We propose that one way to build
good image representations is by training Generative Adversarial Networks (GANs) (Goodfellow
et al., 2014), and later reusing parts of the generator and discriminator networks as feature extractors
for supervised tasks. GANs provide an attractive alternative to maximum likelihood techniques.
One can additionally argue that their learning process and the lack of a heuristic cost function (such
as pixel-wise independent mean-square error) are attractive to representation learning. GANs have
been known to be unstable to train, often resulting in generators that produce nonsensical outputs.
There has been very limited published research in trying to understand and visualize what GANs
learn, and the intermediate representations of multi-layer GANs.
In this paper, we make the following contributions
• We propose and evaluate a set of constraints on the architectural topology of Convolutional

GANs that make them stable to train in most settings. We name this class of architectures
Deep Convolutional GANs (DCGAN)
• We use the trained discriminators for image classification tasks, showing competitive per-
formance with other unsupervised algorithms.
• We visualize the filters learnt by GANs and empirically show that specific filters have
learned to draw specific objects.
1
• We show that the generators have interesting vector arithmetic properties allowing for easy
manipulation of many semantic qualities of generated samples.
2 R ELATED W ORK
2.1 R EPRESENTATION L EARNING FROM UNLABELED DATA
Unsupervised representation learning is a fairly well studied problem in general computer vision
research, as well as in the context of images. A classic approach to unsupervised representation
learning is to do clustering on the data (for example using K-means), and leverage the clusters for
improved classification scores. In the context of images, one can do hierarchical clustering of image
patches (Coates & Ng, 2012) to learn powerful image representations. Another popular method
is to train auto-encoders (convolutionally, stacked (Vincent et al., 2010), separating the what and
where components of the code (Zhao et al., 2015), ladder structures (Rasmus et al., 2015)) that
encode an image into a compact code, and decode the code to reconstruct the image as accurately
as possible. These methods have also been shown to learn good feature representations from image
pixels. Deep belief networks (Lee et al., 2009) have also been shown to work well in learning
hierarchical representations.
2.2 G ENERATING NATURAL IMAGES
Generative image models are well studied and fall into two categories: parametric and non-
parametric.
The non-parametric models often do matching from a database of existing images, often matching
patches of images, and have been used in texture synthesis (Efros et al., 1999), super-resolution
(Freeman et al., 2002) and in-painting (Hays & Efros, 2007).
Parametric models for generating images has been explored extensively (for example on MNIST
digits or for texture synthesis (Portilla & Simoncelli, 2000)). However, generating natural images
of the real world have had not much success until recently. A variational sampling approach to
generating images (Kingma & Welling, 2013) has had some success, but the samples often suffer
from being blurry. Another approach generates images using an iterative forward diffusion process
(Sohl-Dickstein et al., 2015). Generative Adversarial Networks (Goodfellow et al., 2014) generated
images suffering from being noisy and incomprehensible. A laplacian pyramid extension to this
approach (Denton et al., 2015) showed higher quality images, but they still suffered from the objects
looking wobbly because of noise introduced in chaining multiple models. A recurrent network
approach (Gregor et al., 2015) and a deconvolution network approach (Dosovitskiy et al., 2014) have
also recently had some success with generating natural images. However, they have not leveraged
the generators for supervised tasks.
2.3 V ISUALIZING THE INTERNALS OF CNN S
One constant criticism of using neural networks has been that they are black-box methods, with little
understanding of what the networks do in the form of a simple human-consumable algorithm. In the
context of CNNs, Zeiler et. al. (Zeiler & Fergus, 2014) showed that by using deconvolutions and
filtering the maximal activations, one can find the approximate purpose of each convolution filter in
the network. Similarly, using a gradient descent on the inputs lets us inspect the ideal image that
activates certain subsets of filters (Mordvintsev et al.).
3 A PPROACH AND M ODEL A RCHITECTURE
Historical attempts to scale up GANs using CNNs to model images have been unsuccessful. This
motivated the authors of LAPGAN (Denton et al., 2015) to develop an alternative approach to it-
eratively upscale low resolution generated images which can be modeled more reliably. We also
encountered difficulties attempting to scale GANs using CNN architectures commonly used in the
supervised literature. However, after extensive model exploration we identified a family of archi-
2
tectures that resulted in stable training across a range of datasets and allowed for training higher
resolution and deeper generative models.
Core to our approach is adopting and modifying three recently demonstrated changes to CNN archi-
tectures.
The first is the all convolutional net (Springenberg et al., 2014) which replaces deterministic spatial
pooling functions (such as maxpooling) with strided convolutions, allowing the network to learn
its own spatial downsampling. We use this approach in our generator, allowing it to learn its own
spatial upsampling, and discriminator.
Second is the trend towards eliminating fully connected layers on top of convolutional features.
The strongest example of this is global average pooling which has been utilized in state of the
art image classification models (Mordvintsev et al.). We found global average pooling increased
model stability but hurt convergence speed. A middle ground of directly connecting the highest
convolutional features to the input and output respectively of the generator and discriminator worked
well. The first layer of the GAN, which takes a uniform noise distribution Z as input, could be called
fully connected as it is just a matrix multiplication, but the result is reshaped into a 4-dimensional
tensor and used as the start of the convolution stack. For the discriminator, the last convolution layer
is flattened and then fed into a single sigmoid output. See Fig. 1 for a visualization of an example
model architecture.
Third is Batch Normalization (Ioffe & Szegedy, 2015) which stabilizes learning by normalizing the
input to each unit to have zero mean and unit variance. This helps deal with training problems that
arise due to poor initialization and helps gradient flow in deeper models. This proved critical to get
deep generators to begin learning, preventing the generator from collapsing all samples to a single
point which is a common failure mode observed in GANs. Directly applying batchnorm to all layers
however, resulted in sample oscillation and model instability. This was avoided by not applying
batchnorm to the generator output layer and the discriminator input layer.
The ReLU activation (Nair & Hinton, 2010) is used in the generator with the exception of the output
layer which uses the Tanh function. We observed that using a bounded activation allowed the model
to learn more quickly to saturate and cover the color space of the training distribution. Within the
discriminator we found the leaky rectified activation (Maas et al., 2013) (Xu et al., 2015) to work
well, especially for higher resolution modeling. This is in contrast to the original GAN paper, which
used the maxout activation (Goodfellow et al., 2013).
Architecture guidelines for stable Deep Convolutional GANs
• Replace any pooling layers with strided convolutions (discriminator) and fractional-strided
convolutions (generator).
• Use batchnorm in both the generator and the discriminator.
• Remove fully connected hidden layers for deeper architectures.
• Use ReLU activation in generator for all layers except for the output, which uses Tanh.
• Use LeakyReLU activation in the discriminator for all layers.
4 D ETAILS OF ADVERSARIAL TRAINING
We trained DCGANs on three datasets, Large-scale Scene Understanding (LSUN) (Yu et al., 2015),
Imagenet-1k and a newly assembled Faces dataset. Details on the usage of each of these datasets
are given below.
No pre-processing was applied to training images besides scaling to the range of the tanh activation
function [-1, 1]. All models were trained with mini-batch stochastic gradient descent (SGD) with
a mini-batch size of 128. All weights were initialized from a zero-centered Normal distribution
with standard deviation 0.02. In the LeakyReLU, the slope of the leak was set to 0.2 in all models.
While previous GAN work has used momentum to accelerate training, we used the Adam optimizer
(Kingma & Ba, 2014) with tuned hyperparameters. We found the suggested learning rate of 0.001,
to be too high, using 0.0002 instead. Additionally, we found leaving the momentum term β1 at the
3
Figure 1: DCGAN generator used for LSUN scene modeling. A 100 dimensional uniform distribu-
tion Z is projected to a small spatial extent convolutional representation with many feature maps.
A series of four fractionally-strided convolutions (in some recent papers, these are wrongly called
deconvolutions) then convert this high level representation into a 64 × 64 pixel image. Notably, no
fully connected or pooling layers are used.
suggested value of 0.9 resulted in training oscillation and instability while reducing it to 0.5 helped
stabilize training.
4.1 LSUN
As visual quality of samples from generative image models has improved, concerns of over-fitting
and memorization of training samples have risen. To demonstrate how our model scales with more
data and higher resolution generation, we train a model on the LSUN bedrooms dataset containing
a little over 3 million training examples. Recent analysis has shown that there is a direct link be-
tween how fast models learn and their generalization performance (Hardt et al., 2015). We show
samples from one epoch of training (Fig.2), mimicking online learning, in addition to samples after
convergence (Fig.3), as an opportunity to demonstrate that our model is not producing high quality
samples via simply overfitting/memorizing training examples. No data augmentation was applied to
the images.
4.1.1 D EDUPLICATION
To further decrease the likelihood of the generator memorizing input examples (Fig.2) we perform a
simple image de-duplication process. We fit a 3072-128-3072 de-noising dropout regularized RELU
autoencoder on 32x32 downsampled center-crops of training examples. The resulting code layer
activations are then binarized via thresholding the ReLU activation which has been shown to be an
effective information preserving technique (Srivastava et al., 2014) and provides a convenient form
of semantic-hashing, allowing for linear time de-duplication . Visual inspection of hash collisions
showed high precision with an estimated false positive rate of less than 1 in 100. Additionally, the
technique detected and removed approximately 275,000 near duplicates, suggesting a high recall.
4.2 FACES
We scraped images containing human faces from random web image queries of peoples names. The
people names were acquired from dbpedia, with a criterion that they were born in the modern era.
This dataset has 3M images from 10K people. We run an OpenCV face detector on these images,
keeping the detections that are sufficiently high resolution, which gives us approximately 350,000
face boxes. We use these face boxes for training. No data augmentation was applied to the images.
4
Figure 2: Generated bedrooms after one training pass through the dataset. Theoretically, the model
could learn to memorize training examples, but this is experimentally unlikely as we train with a
small learning rate and minibatch SGD. We are aware of no prior empirical evidence demonstrating
memorization with SGD and a small learning rate.
Figure 3: Generated bedrooms after five epochs of training. There appears to be evidence of visual
under-fitting via repeated noise textures across multiple samples such as the base boards of some of
the beds.
4.3 I MAGENET-1 K
We use Imagenet-1k (Deng et al., 2009) as a source of natural images for unsupervised training. We
train on 32 × 32 min-resized center crops. No data augmentation was applied to the images.
5
5 E MPIRICAL VALIDATION OF DCGAN S CAPABILITIES
5.1 C LASSIFYING CIFAR-10 USING GAN S AS A FEATURE EXTRACTOR
One common technique for evaluating the quality of unsupervised representation learning algo-
rithms is to apply them as a feature extractor on supervised datasets and evaluate the performance
of linear models fitted on top of these features.
On the CIFAR-10 dataset, a very strong baseline performance has been demonstrated from a well
tuned single layer feature extraction pipeline utilizing K-means as a feature learning algorithm.
When using a very large amount of feature maps (4800) this technique achieves 80.6% accuracy.
An unsupervised multi-layered extension of the base algorithm reaches 82.0% accuracy (Coates &
Ng, 2011). To evaluate the quality of the representations learned by DCGANs for supervised tasks,
we train on Imagenet-1k and then use the discriminator’s convolutional features from all layers,
maxpooling each layers representation to produce a 4 × 4 spatial grid. These features are then
flattened and concatenated to form a 28672 dimensional vector and a regularized linear L2-SVM
classifier is trained on top of them. This achieves 82.8% accuracy, out performing all K-means
based approaches. Notably, the discriminator has many less feature maps (512 in the highest layer)
compared to K-means based techniques, but does result in a larger total feature vector size due to
the many layers of 4 × 4 spatial locations. The performance of DCGANs is still less than that of
Exemplar CNNs (Dosovitskiy et al., 2015), a technique which trains normal discriminative CNNs
in an unsupervised fashion to differentiate between specifically chosen, aggressively augmented,
exemplar samples from the source dataset. Further improvements could be made by finetuning the
discriminator’s representations, but we leave this for future work. Additionally, since our DCGAN
was never trained on CIFAR-10 this experiment also demonstrates the domain robustness of the
learned features.
Table 1: CIFAR-10 classification results using our pre-trained model. Our DCGAN is not pre-
trained on CIFAR-10, but on Imagenet-1k, and the features are used to classify CIFAR-10 images.
Model Accuracy Accuracy (400 per class) max # of features units
1 Layer K-means 80.6% 63.7% (±0.7%) 4800
3 Layer K-means Learned RF 82.0% 70.7% (±0.7%) 3200
View Invariant K-means 81.9% 72.6% (±0.7%) 6400
Exemplar CNN 84.3% 77.4% (±0.2%) 1024
DCGAN (ours) + L2-SVM 82.8% 73.8% (±0.4%) 512
5.2 C LASSIFYING SVHN DIGITS USING GAN S AS A FEATURE EXTRACTOR
On the StreetView House Numbers dataset (SVHN)(Netzer et al., 2011), we use the features of
the discriminator of a DCGAN for supervised purposes when labeled data is scarce. Following
similar dataset preparation rules as in the CIFAR-10 experiments, we split off a validation set of
10,000 examples from the non-extra set and use it for all hyperparameter and model selection. 1000
uniformly class distributed training examples are randomly selected and used to train a regularized
linear L2-SVM classifier on top of the same feature extraction pipeline used for CIFAR-10. This
achieves state of the art (for classification using 1000 labels) at 22.48% test error, improving upon
another modifcation of CNNs designed to leverage unlabled data (Zhao et al., 2015). Additionally,
we validate that the CNN architecture used in DCGAN is not the key contributing factor of the
model’s performance by training a purely supervised CNN with the same architecture on the same
data and optimizing this model via random search over 64 hyperparameter trials (Bergstra & Bengio,
2012). It achieves a signficantly higher 28.87% validation error.
6 I NVESTIGATING AND VISUALIZING THE INTERNALS OF THE NETWORKS
We investigate the trained generators and discriminators in a variety of ways. We do not do any
kind of nearest neighbor search on the training set. Nearest neighbors in pixel or feature space are
6
Table 2: SVHN classification with 1000 labels

Model error rate
KNN 77.93%
TSVM 66.55%
M1+KNN 65.63%
M1+TSVM 54.33%
M1+M2 36.02%
SWWAE without dropout 27.83%
SWWAE with dropout 23.56%
DCGAN (ours) + L2-SVM 22.48%
Supervised CNN with the same architecture 28.87% (validation)
trivially fooled (Theis et al., 2015) by small image transforms. We also do not use log-likelihood
metrics to quantitatively assess the model, as it is a poor (Theis et al., 2015) metric.
6.1 WALKING IN THE LATENT SPACE
The first experiment we did was to understand the landscape of the latent space. Walking on the
manifold that is learnt can usually tell us about signs of memorization (if there are sharp transitions)
and about the way in which the space is hierarchically collapsed. If walking in this latent space
results in semantic changes to the image generations (such as objects being added and removed), we
can reason that the model has learned relevant and interesting representations. The results are shown
in Fig.4.
6.2 V ISUALIZING THE D ISCRIMINATOR F EATURES
Previous work has demonstrated that supervised training of CNNs on large image datasets results in
very powerful learned features (Zeiler & Fergus, 2014). Additionally, supervised CNNs trained on
scene classification learn object detectors (Oquab et al., 2014). We demonstrate that an unsupervised
DCGAN trained on a large image dataset can also learn a hierarchy of features that are interesting.
Using guided backpropagation as proposed by (Springenberg et al., 2014), we show in Fig.5 that the
features learnt by the discriminator activate on typical parts of a bedroom, like beds and windows.
For comparison, in the same figure, we give a baseline for randomly initialized features that are not
activated on anything that is semantically relevant or interesting.
6.3 M ANIPULATING THE G ENERATOR R EPRESENTATION
6.3.1 F ORGETTING TO DRAW CERTAIN OBJECTS
In addition to the representations learnt by a discriminator, there is the question of what representa-
tions the generator learns. The quality of samples suggest that the generator learns specific object
representations for major scene components such as beds, windows, lamps, doors, and miscellaneous
furniture. In order to explore the form that these representations take, we conducted an experiment
to attempt to remove windows from the generator completely.
On 150 samples, 52 window bounding boxes were drawn manually. On the second highest con-
volution layer features, logistic regression was fit to predict whether a feature activation was on a
window (or not), by using the criterion that activations inside the drawn bounding boxes are posi-
tives and random samples from the same images are negatives. Using this simple model, all feature
maps with weights greater than zero ( 200 in total) were dropped from all spatial locations. Then,
random new samples were generated with and without the feature map removal.
The generated images with and without the window dropout are shown in Fig.6, and interestingly,
the network mostly forgets to draw windows in the bedrooms, replacing them with other objects.
7
Figure 4: Top rows: Interpolation between a series of 9 random points in Z show that the space
learned has smooth transitions, with every image in the space plausibly looking like a bedroom. In
the 6th row, you see a room without a window slowly transforming into a room with a giant window.
In the 10th row, you see what appears to be a TV slowly being transformed into a window.
6.3.2 V ECTOR ARITHMETIC ON FACE SAMPLES
In the context of evaluating learned representations of words (Mikolov et al., 2013) demonstrated
that simple arithmetic operations revealed rich linear structure in representation space. One canoni-
cal example demonstrated that the vector(”King”) - vector(”Man”) + vector(”Woman”) resulted in a
vector whose nearest neighbor was the vector for Queen. We investigated whether similar structure
emerges in the Z representation of our generators. We performed similar arithmetic on the Z vectors
of sets of exemplar samples for visual concepts. Experiments working on only single samples per
concept were unstable, but averaging the Z vector for three examplars showed consistent and stable
generations that semantically obeyed the arithmetic. In addition to the object manipulation shown
in (Fig. 7), we demonstrate that face pose is also modeled linearly in Z space (Fig. 8).
These demonstrations suggest interesting applications can be developed using Z representations
learned by our models. It has been previously demonstrated that conditional generative models can
learn to convincingly model object attributes like scale, rotation, and position (Dosovitskiy et al.,
2014). This is to our knowledge the first demonstration of this occurring in purely unsupervised
8
Figure 5: On the right, guided backpropagation visualizations of maximal axis-aligned responses

for the first 6 learned convolutional features from the last convolution layer in the discriminator.
Notice a significant minority of features respond to beds - the central object in the LSUN bedrooms
dataset. On the left is a random filter baseline. Comparing to the previous responses there is little to
no discrimination and random structure.
Figure 6: Top row: un-modified samples from model. Bottom row: the same samples generated
with dropping out ”window” filters. Some windows are removed, others are transformed into objects
with similar visual appearance such as doors and mirrors. Although visual quality decreased, overall
scene composition stayed similar, suggesting the generator has done a good job disentangling scene
representation from object representation. Extended experiments could be done to remove other
objects from the image and modify the objects the generator draws.
models. Further exploring and developing the above mentioned vector arithmetic could dramat-
ically reduce the amount of data needed for conditional generative modeling of complex image
distributions.
7 C ONCLUSION AND F UTURE W ORK
We propose a more stable set of architectures for training generative adversarial networks and we
give evidence that adversarial networks learn good representations of images for supervised learning
and generative modeling. There are still some forms of model instability remaining - we noticed as
models are trained longer they sometimes collapse a subset of filters to a single oscillating mode.
9
Figure 7: Vector arithmetic for visual concepts. For each column, the Z vectors of samples are
averaged. Arithmetic was then performed on the mean vectors creating a new vector Y . The center
sample on the right hand side is produce by feeding Y as input to the generator. To demonstrate
the interpolation capabilities of the generator, uniform noise sampled with scale +-0.25 was added
to Y to produce the 8 other samples. Applying arithmetic in the input space (bottom two examples)
results in noisy overlap due to misalignment.
Further work is needed to tackle this from of instability. We think that extending this framework
10
Figure 8: A ”turn” vector was created from four averaged samples of faces looking left vs looking
right. By adding interpolations along this axis to random samples we were able to reliably transform
their pose.
to other domains such as video (for frame prediction) and audio (pre-trained features for speech
synthesis) should be very interesting. Further investigations into the properties of the learnt latent
space would be interesting as well.
ACKNOWLEDGMENTS
We are fortunate and thankful for all the advice and guidance we have received during this work,
especially that of Ian Goodfellow, Tobias Springenberg, Arthur Szlam and Durk Kingma. Addition-
ally we’d like to thank all of the folks at indico for providing support, resources, and conversations,
especially the two other members of the indico research team, Dan Kuster and Nathan Lintz. Finally,
we’d like to thank Nvidia for donating a Titan-X GPU used in this work.
R EFERENCES
Bergstra, James and Bengio, Yoshua. Random search for hyper-parameter optimization. JMLR,
2012.
Coates, Adam and Ng, Andrew. Selecting receptive fields in deep networks. NIPS, 2011.
Coates, Adam and Ng, Andrew Y. Learning feature representations with k-means. In Neural Net-
works: Tricks of the Trade, pp. 561–580. Springer, 2012.
Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, and Fei-Fei, Li. Imagenet: A large-scale
hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.
IEEE Conference on, pp. 248–255. IEEE, 2009.
Denton, Emily, Chintala, Soumith, Szlam, Arthur, and Fergus, Rob. Deep generative image models
using a laplacian pyramid of adversarial networks. arXiv preprint arXiv:1506.05751, 2015.
Dosovitskiy, Alexey, Springenberg, Jost Tobias, and Brox, Thomas. Learning to generate chairs
with convolutional neural networks. arXiv preprint arXiv:1411.5928, 2014.
11
Dosovitskiy, Alexey, Fischer, Philipp, Springenberg, Jost Tobias, Riedmiller, Martin, and Brox,
Thomas. Discriminative unsupervised feature learning with exemplar convolutional neural net-
works. In Pattern Analysis and Machine Intelligence, IEEE Transactions on, volume 99. IEEE,
2015.
Efros, Alexei, Leung, Thomas K, et al. Texture synthesis by non-parametric sampling. In Computer
Vision, 1999. The Proceedings of the Seventh IEEE International Conference on, volume 2, pp.
1033–1038. IEEE, 1999.
Freeman, William T, Jones, Thouis R, and Pasztor, Egon C. Example-based super-resolution. Com-
puter Graphics and Applications, IEEE, 22(2):56–65, 2002.
Goodfellow, Ian J, Warde-Farley, David, Mirza, Mehdi, Courville, Aaron, and Bengio, Yoshua.
Maxout networks. arXiv preprint arXiv:1302.4389, 2013.
Goodfellow, Ian J., Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair,
Sherjil, Courville, Aaron C., and Bengio, Yoshua. Generative adversarial nets. NIPS, 2014.
Gregor, Karol, Danihelka, Ivo, Graves, Alex, and Wierstra, Daan. Draw: A recurrent neural network
for image generation. arXiv preprint arXiv:1502.04623, 2015.
Hardt, Moritz, Recht, Benjamin, and Singer, Yoram. Train faster, generalize better: Stability of
stochastic gradient descent. arXiv preprint arXiv:1509.01240, 2015.
Hauberg, Sren, Freifeld, Oren, Larsen, Anders Boesen Lindbo, Fisher III, John W., and Hansen,
Lars Kair. Dreaming more data: Class-dependent distributions over diffeomorphisms for learned
data augmentation. arXiv preprint arXiv:1510.02795, 2015.
Hays, James and Efros, Alexei A. Scene completion using millions of photographs. ACM Transac-
tions on Graphics (TOG), 26(3):4, 2007.
Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by
reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
Kingma, Diederik P and Ba, Jimmy Lei. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.
Kingma, Diederik P and Welling, Max. Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114, 2013.
Lee, Honglak, Grosse, Roger, Ranganath, Rajesh, and Ng, Andrew Y. Convolutional deep belief
networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the
26th Annual International Conference on Machine Learning, pp. 609–616. ACM, 2009.
Loosli, Gaëlle, Canu, Stéphane, and Bottou, Léon. Training invariant support vector machines using
selective sampling. In Bottou, Léon, Chapelle, Olivier, DeCoste, Dennis, and Weston, Jason
(eds.), Large Scale Kernel Machines, pp. 301–320. MIT Press, Cambridge, MA., 2007. URL
http://leon.bottou.org/papers/loosli-canu-bottou-2006.
Maas, Andrew L, Hannun, Awni Y, and Ng, Andrew Y. Rectifier nonlinearities improve neural
network acoustic models. In Proc. ICML, volume 30, 2013.
Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg S, and Dean, Jeff. Distributed repre-
sentations of words and phrases and their compositionality. In Advances in neural information
processing systems, pp. 3111–3119, 2013.
Mordvintsev, Alexander, Olah, Christopher, and Tyka, Mike. Inceptionism : Going
deeper into neural networks. http://googleresearch.blogspot.com/2015/06/
inceptionism-going-deeper-into-neural.html. Accessed: 2015-06-17.
Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted boltzmann machines.
In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–
814, 2010.
12
Netzer, Yuval, Wang, Tao, Coates, Adam, Bissacco, Alessandro, Wu, Bo, and Ng, Andrew Y. Read-
ing digits in natural images with unsupervised feature learning. In NIPS workshop on deep learn-
ing and unsupervised feature learning, volume 2011, pp. 5. Granada, Spain, 2011.
Oquab, M., Bottou, L., Laptev, I., and Sivic, J. Learning and transferring mid-level image represen-
tations using convolutional neural networks. In CVPR, 2014.
Portilla, Javier and Simoncelli, Eero P. A parametric texture model based on joint statistics of
complex wavelet coefficients. International Journal of Computer Vision, 40(1):49–70, 2000.
Rasmus, Antti, Valpola, Harri, Honkala, Mikko, Berglund, Mathias, and Raiko, Tapani. Semi-
supervised learning with ladder network. arXiv preprint arXiv:1507.02672, 2015.
Sohl-Dickstein, Jascha, Weiss, Eric A, Maheswaranathan, Niru, and Ganguli, Surya. Deep unsuper-
vised learning using nonequilibrium thermodynamics. arXiv preprint arXiv:1503.03585, 2015.
Springenberg, Jost Tobias, Dosovitskiy, Alexey, Brox, Thomas, and Riedmiller, Martin. Striving for
simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
Srivastava, Rupesh Kumar, Masci, Jonathan, Gomez, Faustino, and Schmidhuber, Jürgen. Under-
standing locally competitive networks. arXiv preprint arXiv:1410.1165, 2014.
Theis, L., van den Oord, A., and Bethge, M. A note on the evaluation of generative models.
arXiv:1511.01844, Nov 2015. URL http://arxiv.org/abs/1511.01844.
Vincent, Pascal, Larochelle, Hugo, Lajoie, Isabelle, Bengio, Yoshua, and Manzagol, Pierre-Antoine.
Stacked denoising autoencoders: Learning useful representations in a deep network with a local
denoising criterion. The Journal of Machine Learning Research, 11:3371–3408, 2010.
Xu, Bing, Wang, Naiyan, Chen, Tianqi, and Li, Mu. Empirical evaluation of rectified activations in
convolutional network. arXiv preprint arXiv:1505.00853, 2015.
Yu, Fisher, Zhang, Yinda, Song, Shuran, Seff, Ari, and Xiao, Jianxiong. Construction of a large-scale
image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365,
2015.
Zeiler, Matthew D and Fergus, Rob. Visualizing and understanding convolutional networks. In
Computer Vision–ECCV 2014, pp. 818–833. Springer, 2014.
Zhao, Junbo, Mathieu, Michael, Goroshin, Ross, and Lecun, Yann. Stacked what-where auto-
encoders. arXiv preprint arXiv:1506.02351, 2015.
13
8 S UPPLEMENTARY M ATERIAL
8.1 E VALUATING DCGAN S CAPABILITY TO CAPTURE DATA DISTRIBUTIONS
We propose to apply standard classification metrics to a conditional version of our model, evaluating
the conditional distributions learned. We trained a DCGAN on MNIST (splitting off a 10K validation
set) as well as a permutation invariant GAN baseline and evaluated the models using a nearest
neighbor classifier comparing real data to a set of generated conditional samples. We found that
removing the scale and bias parameters from batchnorm produced better results for both models. We
speculate that the noise introduced by batchnorm helps the generative models to better explore and
generate from the underlying data distribution. The results are shown in Table 3 which compares
our models with other techniques. The DCGAN model achieves the same test error as a nearest
neighbor classifier fitted on the training dataset - suggesting the DCGAN model has done a superb
job at modeling the conditional distributions of this dataset. At one million samples per class, the
DCGAN model outperforms InfiMNIST (Loosli et al., 2007), a hand developed data augmentation
pipeline which uses translations and elastic deformations of training examples. The DCGAN is
competitive with a probabilistic generative data augmentation technique utilizing learned per class
transformations (Hauberg et al., 2015) while being more general as it directly models the data instead
of transformations of the data.
Table 3: Nearest neighbor classification results.

Model Test Error @50K samples Test Error @10M samples
AlignMNIST - 1.4%
InfiMNIST - 2.6%
Real Data 3.1% -
GAN 6.28% 5.65%
DCGAN (ours) 2.98% 1.48%
Figure 9: Side-by-side illustration of (from left-to-right) the MNIST dataset, generations from a
baseline GAN, and generations from our DCGAN .
14
Figure 10: More face generations from our Face DCGAN.
15
Figure 11: Generations of a DCGAN that was trained on the Imagenet-1k dataset.
16
Inception-v4, Inception-ResNet and
the Impact of Residual Connections on Learning
Christian Szegedy Sergey Ioffe Vincent Vanhoucke

Google Inc. sioffe@google.com vanhoucke@google.com
1600 Amphitheatre Pkwy, Mountain View, CA
arXiv:1602.07261v2 [cs.CV] 23 Aug 2016
szegedy@google.com
Alex Alemi
alemi@google.com
Abstract tion [7], object tracking [18], and superresolution [3]. These
examples are but a few of all the applications to which deep
Very deep convolutional networks have been central to convolutional networks have been very successfully applied
the largest advances in image recognition performance in ever since.
recent years. One example is the Inception architecture that In this work we study the combination of the two most
has been shown to achieve very good performance at rel- recent ideas: Residual connections introduced by He et al.
atively low computational cost. Recently, the introduction in [5] and the latest revised version of the Inception archi-
of residual connections in conjunction with a more tradi- tecture [15]. In [5], it is argued that residual connections are
tional architecture has yielded state-of-the-art performance of inherent importance for training very deep architectures.
in the 2015 ILSVRC challenge; its performance was similar Since Inception networks tend to be very deep, it is natu-
to the latest generation Inception-v3 network. This raises ral to replace the filter concatenation stage of the Inception
the question of whether there are any benefit in combining architecture with residual connections. This would allow
the Inception architecture with residual connections. Here Inception to reap all the benefits of the residual approach
we give clear empirical evidence that training with residual while retaining its computational efficiency.
connections accelerates the training of Inception networks Besides a straightforward integration, we have also stud-
significantly. There is also some evidence of residual Incep- ied whether Inception itself can be made more efficient by
tion networks outperforming similarly expensive Inception making it deeper and wider. For that purpose, we designed
networks without residual connections by a thin margin. We a new version named Inception-v4 which has a more uni-
also present several new streamlined architectures for both form simplified architecture and more inception modules
residual and non-residual Inception networks. These varia- than Inception-v3. Historically, Inception-v3 had inherited
tions improve the single-frame recognition performance on a lot of the baggage of the earlier incarnations. The techni-
the ILSVRC 2012 classification task significantly. We fur- cal constraints chiefly came from the need for partitioning
ther demonstrate how proper activation scaling stabilizes the model for distributed training using DistBelief [2]. Now,
the training of very wide residual Inception networks. With after migrating our training setup to TensorFlow [1] these
an ensemble of three residual and one Inception-v4, we constraints have been lifted, which allowed us to simplify
achieve 3.08% top-5 error on the test set of the ImageNet the architecture significantly. The details of that simplified
classification (CLS) challenge. architecture are described in Section 3.
In this report, we will compare the two pure Inception
variants, Inception-v3 and v4, with similarly expensive hy-
1. Introduction brid Inception-ResNet versions. Admittedly, those mod-
els were picked in a somewhat ad hoc manner with the
Since the 2012 ImageNet competition [11] winning en- main constraint being that the parameters and computa-
try by Krizhevsky et al [8], their network “AlexNet” has tional complexity of the models should be somewhat similar
been successfully applied to a larger variety of computer to the cost of the non-residual models. In fact we have tested
vision tasks, for example to object-detection [4], segmen- bigger and wider Inception-ResNet variants and they per-
tation [10], human pose estimation [17], video classifica- formed very similarly on the ImageNet classification chal-
1
lenge [11] dataset. Relu activation
The last experiment reported here is an evaluation of an

ensemble of all the best performing models presented here.
As it was apparent that both Inception-v4 and Inception- +
ResNet-v2 performed similarly well, exceeding state-of-
the art single frame performance on the ImageNet valida-
tion dataset, we wanted to see how a combination of those Conv
pushes the state of the art on this well studied dataset. Sur-
prisingly, we found that gains on the single-frame perfor-
mance do not translate into similarly large gains on ensem- Conv
bled performance. Nonetheless, it still allows us to report
3.1% top-5 error on the validation set with four models en-
sembled setting a new state of the art, to our best knowl- Relu activation
edge.
Figure 1. Residual connections as introduced in He et al. [5].
In the last section, we study some of the classification
failures and conclude that the ensemble still has not reached
Relu activation
the label noise of the annotations on this dataset and there
is still room for improvement for the predictions.
+
2. Related Work
Conv
Convolutional networks have become popular in large
scale image recognition tasks after Krizhevsky et al. [8].
Some of the next important milestones were Network-in- 1x1 Conv
network [9] by Lin et al., VGGNet [12] by Simonyan et al.
and GoogLeNet (Inception-v1) [14] by Szegedy et al.
Relu activation
Residual connection were introduced by He et al. in [5]
in which they give convincing theoretical and practical ev- Figure 2. Optimized version of ResNet connections by [5] to shield
idence for the advantages of utilizing additive merging of computation.
signals both for image recognition, and especially for object
detection. The authors argue that residual connections are
inherently necessary for training very deep convolutional 3. Architectural Choices
models. Our findings do not seem to support this view, at 3.1. Pure Inception blocks
least for image recognition. However it might require more
measurement points with deeper architectures to understand Our older Inception models used to be trained in a par-
the true extent of beneficial aspects offered by residual con- titioned manner, where each replica was partitioned into a
nections. In the experimental section we demonstrate that multiple sub-networks in order to be able to fit the whole
it is not very difficult to train competitive very deep net- model in memory. However, the Inception architecture is
works without utilizing residual connections. However the highly tunable, meaning that there are a lot of possible
use of residual connections seems to improve the training changes to the number of filters in the various layers that
speed greatly, which is alone a great argument for their use. do not affect the quality of the fully trained network. In
The Inception deep convolutional architecture was intro- order to optimize the training speed, we used to tune the
duced in [14] and was called GoogLeNet or Inception-v1 in layer sizes carefully in order to balance the computation be-
our exposition. Later the Inception architecture was refined tween the various model sub-networks. In contrast, with the
in various ways, first by the introduction of batch normaliza- introduction of TensorFlow our most recent models can be
tion [6] (Inception-v2) by Ioffe et al. Later the architecture trained without partitioning the replicas. This is enabled in
was improved by additional factorization ideas in the third part by recent optimizations of memory used by backprop-
iteration [15] which will be referred to as Inception-v3 in agation, achieved by carefully considering what tensors are
this report. needed for gradient computation and structuring the compu-
tation to reduce the number of such tensors. Historically, we
have been relatively conservative about changing the archi-
tectural choices and restricted our experiments to varying
isolated network components while keeping the rest of the
network stable. Not simplifying earlier choices resulted in
networks that looked more complicated that they needed to
be. In our newer experiments, for Inception-v4 we decided
to shed this unnecessary baggage and made uniform choices
for the Inception blocks for each grid size. Plase refer to Filter concat 35x35x384
Figure 9 for the large scale structure of the Inception-v4 net-

work and Figures 3, 4, 5, 6, 7 and 8 for the detailed struc- 3x3 Conv MaxPool
ture of its components. All the convolutions not marked (192 V) (stride=2 V)
with “V” in the figures are same-padded meaning that their
output grid matches the size of their input. Convolutions Filter concat 71x71x192
marked with “V” are valid padded, meaning that input patch
of each unit is fully contained in the previous layer and the
3x3 Conv
grid size of the output activation map is reduced accord-
(96 V)
ingly.
3.2. Residual Inception Blocks 1x7 Conv
3x3 Conv (64)
For the residual versions of the Inception networks, we (96 V)
use cheaper Inception blocks than the original Inception. 7x1 Conv
Each Inception block is followed by filter-expansion layer (64)
(1 × 1 convolution without activation) which is used for 1x1 Conv
(64)
scaling up the dimensionality of the filter bank before the 1x1 Conv
addition to match the depth of the input. This is needed to (64)
compensate for the dimensionality reduction induced by the
Inception block. Filter concat 73x73x160
We tried several versions of the residual version of In-

ception. Only two of them are detailed here. The first 3x3 MaxPool 3x3 Conv
(stride 2 V) (96 stride 2 V)
one “Inception-ResNet-v1” roughly the computational cost
of Inception-v3, while “Inception-ResNet-v2” matches the
raw cost of the newly introduced Inception-v4 network. See 3x3 Conv
147x147x64
Figure 15 for the large scale structure of both varianets. (64)
(However, the step time of Inception-v4 proved to be signif-
icantly slower in practice, probably due to the larger number 3x3 Conv 147x147x32
(32 V)
of layers.)
Another small technical difference between our resid-
ual and non-residual Inception variants is that in the case 3x3 Conv
149x149x32
(32 stride 2 V)
of Inception-ResNet, we used batch-normalization only on
top of the traditional layers, but not on top of the summa-
tions. It is reasonable to expect that a thorough use of batch- Input
299x299x3
(299x299x3)
normalization should be advantageous, but we wanted to
keep each model replica trainable on a single GPU. It turned Figure 3. The schema for stem of the pure Inception-v4 and
out that the memory footprint of layers with large activa- Inception-ResNet-v2 networks. This is the input part of those net-
tion size was consuming disproportionate amount of GPU- works. Cf. Figures 9 and 15
memory. By omitting the batch-normalization on top of
those layers, we were able to increase the overall number
of Inception blocks substantially. We hope that with bet-
ter utilization of computing resources, making this trade-off
will become unecessary.
Filter concat
3x3 Conv Filter concat

(96)
1x1 Conv 3x3 Conv 3x3 Conv 3x3 Conv

(96) (96) (96)
(m stride 2 V)
1x1 Conv 1x1 Conv 1x1 Conv

Avg Pooling
(96) (64) (64)
3x3 MaxPool 3x3 Conv 3x3 Conv
(stride 2 V) (n stride 2 V) (l)
Filter concat
1x1 Conv
(k)
Figure 4. The schema for 35 × 35 grid modules of the pure
Inception-v4 network. This is the Inception-A block of Figure 9.
Filter concat
Filter concat
Figure 7. The schema for 35 × 35 to 17 × 17 reduction module.
Different variants of this blocks (with various number of filters)
7x1 Conv are used in Figure 9, and 15 in each of the new Inception(-v4, -
(256)
ResNet-v1, -ResNet-v2) variants presented in this paper. The k, l,
1x7 Conv m, n numbers represent filter bank sizes which can be looked up
1x7 Conv
1x1 Conv
(256)
(224) in Table 1.
(128)
7x1 Conv
1x1 Conv 1x7 Conv (224)
(384) (224)
1x7 Conv
1x1 Conv (192)
(192)
Avg Pooling
1x1 Conv
(192) Filter concat
Filter concat
3x3 Conv
Figure 5. The schema for 17 × 17 grid modules of the pure (320 stride 2 V)
Inception-v4 network. This is the Inception-B block of Figure 9. 3x3 Conv
(192 stride 2 V)
7x1 Conv
3x3 MaxPool (320)
Filter concat
(stride 2 V)
1x7 Conv
3x1 Conv 1x3 Conv
1x1 Conv (256)
1x1 Conv (256) (256) (192)
(256)
1x3 Conv 3x1 Conv
(256) (256) 3x1 Conv 1x1 Conv
(512)
1x1 Conv
(256)
(256)
1x3 Conv
1x1 Conv (448)
(384)
Avg Pooling
1x1 Conv Filter concat
(384)
Figure 8. The schema for 17 × 17 to 8 × 8 grid-reduction mod-

Filter concat ule. This is the reduction module used by the pure Inception-v4
network in Figure 9.
Figure 6. The schema for 8×8 grid modules of the pure Inception-
v4 network. This is the Inception-C block of Figure 9.
Softmax Output: 1000
Relu activation
Dropout (keep 0.8) Output: 1536
Avarage Pooling Output: 1536 +

3 x Inception-C Output: 8x8x1536
1x1 Conv
(896 Linear)
Reduction-B Output: 8x8x1536
7x1 Conv
7 x Inception-B Output: 17x17x1024
(128)
Reduction-A Output: 17x17x1024

1x1 Conv 1x7 Conv
(128) (128)
4 x Inception-A Output: 35x35x384
1x1 Conv
(128)
Stem
Output: 35x35x384
Input (299x299x3) 299x299x3

Relu activation
Figure 11. The schema for 17 × 17 grid (Inception-ResNet-B)
Figure 9. The overall schema of the Inception-v4 network. For the module of Inception-ResNet-v1 network.
detailed modules, please refer to Figures 3, 4, 5, 6, 7 and 8 for the
detailed structure of the various components.
Relu activation
+
Filter concat
1x1 Conv
3x3 Conv
(256 Linear) (256 stride 2 V)
3x3 Conv 3x3 Conv
3x3 Conv (384 stride 2 V) (256 stride 2 V)
3x3 MaxPool 3x3 Conv
(32) (256)
(stride 2 V)
1x1 Conv
1x1 Conv 1x1 Conv
(32) (256) (256)
3x3 Conv 3x3 Conv 1x1 Conv
(32) (32) (256)
1x1 Conv 1x1 Conv Previous

(32) (32) Layer
Figure 12. “Reduction-B” 17 × 17 to 8 × 8 grid-reduction module.

This module used by the smaller Inception-ResNet-v1 network in
Relu activation Figure 15.
Figure 10. The schema for 35 × 35 grid (Inception-ResNet-A)
module of Inception-ResNet-v1 network.
3x3 Conv 35x35x256
Relu activation (256 stride 2 V)
+ 3x3 Conv
(192 V)
71x71x192
1x1 Conv
73x73x80
1x1 Conv (80)
(1792 Linear)
3x3 MaxPool
3x1 Conv (stride 2 V)
73x73x64
(192)
3x3 Conv
147x147x64
1x1 Conv 1x3 Conv (64)
(192) (192)
3x3 Conv 147x147x32
(32 V)
1x1 Conv
(192)
3x3 Conv
149x149x32
(32 stride 2 V)
Relu activation
Input
299x299x3
Figure 13. The schema for 8×8 grid (Inception-ResNet-C) module (299x299x3)
of Inception-ResNet-v1 network.
Figure 14. The stem of the Inception-ResNet-v1 network.
Output: 1000
Softmax
Output: 1792
Dropout (keep 0.8)
Output: 1792
Average Pooling
Output: 8x8x1792
5 x Inception-resnet-C
Output: 8x8x1792
Reduction-B
10 x Output: 17x17x896
Inception-resnet-B
Output: 17x17x896
Reduction-A
Output: 35x35x256
5 x Inception-resnet-A
Stem Output: 35x35x256
299x299x3
Input (299x299x3)
Figure 15. Schema for Inception-ResNet-v1 and Inception-

ResNet-v2 networks. This schema applies to both networks but
the underlying components differ. Inception-ResNet-v1 uses the
blocks as described in Figures 14, 10, 7, 11, 12 and 13. Inception-
ResNet-v2 uses the blocks as described in Figures 3, 16, 7,17, 18
and 19. The output sizes in the diagram refer to the activation
vector tensor shapes of Inception-ResNet-v1.
Relu activation
Filter concat
+ 3x3 Conv
(320 stride 2 V)
3x3 Conv 3x3 Conv
(384 stride 2 V) (288 stride 2 V)
1x1 Conv 3x3 MaxPool 3x3 Conv
(384 Linear) (stride 2 V) (288)
1x1 Conv 1x1 Conv
3x3 Conv (256) (256)
1x1 Conv
(64) (256)
1x1 Conv
(32)
3x3 Conv 3x3 Conv Previous
(32) (48) Layer
Figure 18. The schema for 17 × 17 to 8 × 8 grid-reduction mod-

1x1 Conv 1x1 Conv ule. Reduction-B module used by the wider Inception-ResNet-v1
(32) (32)
network in Figure 15.
Relu activation
Relu activation
Figure 16. The schema for 35 × 35 grid (Inception-ResNet-A)
module of the Inception-ResNet-v2 network.
+
Relu activation
1x1 Conv
(2048 Linear)
+
3x1 Conv
(256)
1x1 Conv
(1154 Linear) 1x1 Conv 1x3 Conv
(192) (224)
7x1 Conv
(192) 1x1 Conv
(192)
1x1 Conv 1x7 Conv
(192) (160)
Relu activation
1x1 Conv Figure 19. The schema for 8×8 grid (Inception-ResNet-C) module
(128) of the Inception-ResNet-v2 network.
Network k l m n
Inception-v4 192 224 256 384
Relu activation Inception-ResNet-v1 192 192 256 384
Inception-ResNet-v2 256 256 384 384
Figure 17. The schema for 17 × 17 grid (Inception-ResNet-B)
module of the Inception-ResNet-v2 network. Table 1. The number of filters of the Reduction-A module for the
three Inception variants presented in this paper. The four numbers
in the colums of the paper parametrize the four convolutions of
Figure 7
Relu activation
29
28
+ 27
26
25
24
Error (top-1) %
Activation 23
22
Scaling
21
20
19
Inception 18
17 inception-v3
16 inception-resnet-v1
1520 40 60 80 100 120 140 160 180 200
Relu activation Epoch
Figure 20. The general schema for scaling combined Inception- Figure 21. Top-1 error evolution during training of pure Inception-
resnet moduels. We expect that the same idea is useful in the gen- v3 vs a residual network of similar computational cost. The eval-
eral resnet case, where instead of the Inception block an arbitrary uation is measured on a single crop on the non-blacklist images of
subnetwork is used. The scaling block just scales the last linear the ILSVRC-2012 validation set. The residual model was train-
activations by a suitable constant, typically around 0.1. ing much faster, but reached slightly worse final accuracy than the
traditional Inception-v3.
3.3. Scaling of the Residuals

RMSProp [16] with decay of 0.9 and = 1.0. We used a
Also we found that if the number of filters exceeded learning rate of 0.045, decayed every two epochs using an
1000, the residual variants started to exhibit instabilities and exponential rate of 0.94. Model evaluations are performed
the network has just “died” early in the training, meaning using a running average of the parameters computed over
that the last layer before the average pooling started to pro- time.
duce only zeros after a few tens of thousands of iterations.
This could not be prevented, neither by lowering the learn- 5. Experimental Results
ing rate, nor by adding an extra batch-normalization to this
layer. First we observe the top-1 and top-5 validation-error evo-
We found that scaling down the residuals before adding lution of the four variants during training. After the exper-
them to the previous layer activation seemed to stabilize the iment was conducted, we have found that our continuous
training. In general we picked some scaling factors between evaluation was conducted on a subset of the validation set
0.1 and 0.3 to scale the residuals before their being added to which omitted about 1700 blacklisted entities due to poor
the accumulated layer activations (cf. Figure 20). bounding boxes. It turned out that the omission should
A similar instability was observed by He et al. in [5] in have been only performed for the CLSLOC benchmark, but
the case of very deep residual networks and they suggested a yields somewhat incomparable (more optimistic) numbers
two-phase training where the first “warm-up” phase is done when compared to other reports including some earlier re-
with very low learning rate, followed by a second phase ports by our team. The difference is about 0.3% for top-1
with high learning rata. We found that if the number of error and about 0.15% for the top-5 error. However, since
filters is very high, then even a very low (0.00001) learning the differences are consistent, we think the comparison be-
rate is not sufficient to cope with the instabilities and the tween the curves is a fair one.
training with high learning rate had a chance to destroy its On the other hand, we have rerun our multi-crop and en-
effects. We found it much more reliable to just scale the semble results on the complete validation set consisting of
residuals. 50000 images. Also the final ensemble result was also per-
Even where the scaling was not strictly necessary, it formed on the test set and sent to the ILSVRC test server
never seemed to harm the final accuracy, but it helped to for validation to verify that our tuning did not result in an
stabilize the training. over-fitting. We would like to stress that this final validation
was done only once and we have submitted our results only
4. Training Methodology twice in the last year: once for the BN-Inception paper and
later during the ILSVR-2015 CLSLOC competition, so we
We have trained our networks with stochastic gradient believe that the test set numbers constitute a true estimate
utilizing the TensorFlow [1] distributed machine learning of the generalization capabilities of our model.
system using 20 replicas running each on a NVidia Kepler Finally, we present some comparisons, between various
GPU. Our earlier experiments used momentum [13] with a versions of Inception and Inception-ResNet. The models
decay of 0.9, while our best models were achieved using Inception-v3 and Inception-v4 are deep convolutional net-
9.5
9.0 9
8.5
8.0 8
7.5
Error (top-5) %
Error (top-5) %
7.0 7
6.5
6.0 6
5.5
5.0 5
4.5
4.0 inception-v3 4 inception-v4
3.5 inception-resnet-v1 inception-resnet-v2
3.020 40 60 80 100 120 140 160 180 200 320 40 60 80 100 120 140 160
Epoch Epoch
Figure 22. Top-5 error evolution during training of pure Inception- Figure 24. Top-5 error evolution during training of pure Inception-
v3 vs a residual Inception of similar computational cost. The eval- v4 vs a residual Inception of similar computational cost. The eval-
uation is measured on a single crop on the non-blacklist images of uation is measured on a single crop on the non-blacklist images
the ILSVRC-2012 validation set. The residual version has trained of the ILSVRC-2012 validation set. The residual version trained
much faster and reached slightly better final recall on the valida- faster and reached slightly better final recall on the validation set.
tion set.
34 9.5
33 9.0
32 8.5
31
8.0
30
29 7.5
28 7.0
Error (top-5) %
27 6.5
Error (top-1) %
26
6.0
25
24 5.5
23 5.0
22 4.5
21 inception-v4
4.0
20 inception-resnet-v2
19 3.5 inception-v3
18 inception-v4 3.0 inception-resnet-v1
17 inception-resnet-v2 2.520
16 40 60 80 100 120 140 160
1520 Epoch
40 60 80 100 120 140 160
Epoch
Figure 25. Top-5 error evolution of all four models (single model,
Figure 23. Top-1 error evolution during training of pure Inception- single crop). Showing the improvement due to larger model size.
v3 vs a residual Inception of similar computational cost. The eval- Although the residual version converges faster, the final accuracy
uation is measured on a single crop on the non-blacklist images of seems to mainly depend on the model size.
the ILSVRC-2012 validation set. The residual version was train-
ing much faster and reached slightly better final accuracy than the
traditional Inception-v4. 29
28
27
Network Top-1 Error Top-5 Error 26
25
BN-Inception [6] 25.2% 7.8% 24
Inception-v3 [15] 21.2% 5.6%
Error (top-1) %
23
Inception-ResNet-v1 21.3% 5.5% 22

21
Inception-v4 20.0% 5.0% 20
Inception-ResNet-v2 19.9% 4.9% 19
18 inception-v4
inception-resnet-v2
Table 2. Single crop - single model experimental results. Reported inception-v3
inception-resnet-v1
on the non-blacklisted subset of the validation set of ILSVRC 20 40 60 80 100 120 140 160
Epoch
2012.
Figure 26. Top-1 error evolution of all four models (single model,
single crop). This paints a similar picture as the top-5 evaluation.
works not utilizing residual connections while Inception-
ResNet-v1 and Inception-ResNet-v2 are Inception style net-
works that utilize residual connections instead of filter con- Table 3 shows the performance of the various models
catenation. with a small number of crops: 10 crops for ResNet as was
Table 2 shows the single-model, single crop top-1 and reported in [5]), for the Inception variants, we have used the
top-5 error of the various architectures on the validation set. 12 crops evaluation as as described in [14].
Network Crops Top-1 Error Top-5 Error • Inception-v4: a pure Inception variant without residual
ResNet-151 [5] 10 21.4% 5.7% connections with roughly the same recognition perfor-
Inception-v3 [15] 12 19.8% 4.6% mance as Inception-ResNet-v2.
Inception-ResNet-v1 12 19.8% 4.6%
Inception-v4 12 18.7% 4.2% We studied how the introduction of residual connections
Inception-ResNet-v2 12 18.7% 4.1% leads to dramatically improved training speed for the Incep-
tion architecture. Also our latest models (with and without
Table 3. 10/12 crops evaluations - single model experimental re-
sults. Reported on the all 50000 images of the validation set of residual connections) outperform all our previous networks,
ILSVRC 2012. just by virtue of the increased model size.
Network Crops Top-1 Error Top-5 Error References

ResNet-151 [5] dense 19.4% 4.5%
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,
Inception-v3 [15] 144 18.9% 4.3%
C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghe-
mawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia,
Inception-v4 144 17.7% 3.8%
R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané,
R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster,
Table 4. 144 crops evaluations - single model experimental results. J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker,
Reported on the all 50000 images of the validation set of ILSVRC V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. War-
2012. den, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Tensor-
Flow: Large-scale machine learning on heterogeneous sys-
Network Models Top-1 Error Top-5 Error tems, 2015. Software available from tensorflow.org.
ResNet-151 [5] 6 – 3.6% [2] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao,
Inception-v3 [15] 4 17.3% 3.6% A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale dis-
Inception-v4 + tributed deep networks. In Advances in Neural Information
4 16.5% 3.1% Processing Systems, pages 1223–1231, 2012.
3× Inception-ResNet-v2
[3] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep
Table 5. Ensemble results with 144 crops/dense evaluation. Re- convolutional network for image super-resolution. In Com-
ported on the all 50000 images of the validation set of ILSVRC puter Vision–ECCV 2014, pages 184–199. Springer, 2014.
2012. For Inception-v4(+Residual), the ensemble consists of one [4] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-
pure Inception-v4 and three Inception-ResNet-v2 models and were ture hierarchies for accurate object detection and semantic
evaluated both on the validation and on the test-set. The test-set segmentation. In Proceedings of the IEEE Conference on
performance was 3.08% top-5 error verifying that we don’t over- Computer Vision and Pattern Recognition (CVPR), 2014.
fit on the validation set. [5] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-
ing for image recognition. arXiv preprint arXiv:1512.03385,
Table 4 shows the single model performance of the var- 2015.
ious models using. For residual network the dense evalua- [6] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift. In
tion result is reported from [5]. For the inception networks,
Proceedings of The 32nd International Conference on Ma-
the 144 crops strategy was used as described in [14]. chine Learning, pages 448–456, 2015.
Table 5 compares ensemble results. For the pure resid- [7] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar,
ual network the 6 models dense evaluation result is reported and L. Fei-Fei. Large-scale video classification with con-
from [5]. For the inception networks 4 models were ensem- volutional neural networks. In Computer Vision and Pat-
bled using the 144 crops strategy as described in [14]. tern Recognition (CVPR), 2014 IEEE Conference on, pages
1725–1732. IEEE, 2014.
6. Conclusions [8] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
We have presented three new network architectures in Advances in neural information processing systems, pages
detail: 1097–1105, 2012.
[9] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv
• Inception-ResNet-v1: a hybrid Inception version that preprint arXiv:1312.4400, 2013.
has a similar computational cost to Inception-v3 [10] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
from [15]. networks for semantic segmentation. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
• Inception-ResNet-v2: a costlier hybrid Inception ver- tion, pages 3431–3440, 2015.
sion with significantly improved recognition perfor- [11] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
mance. S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
et al. Imagenet large scale visual recognition challenge.
2014.
[12] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014.
[13] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the
importance of initialization and momentum in deep learning.
In Proceedings of the 30th International Conference on Ma-
chine Learning (ICML-13), volume 28, pages 1139–1147.
JMLR Workshop and Conference Proceedings, May 2013.
[14] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
Going deeper with convolutions. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
pages 1–9, 2015.
[15] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.
Rethinking the inception architecture for computer vision.
arXiv preprint arXiv:1512.00567, 2015.
[16] T. Tieleman and G. Hinton. Divide the gradient by a run-
ning average of its recent magnitude. COURSERA: Neural
Networks for Machine Learning, 4, 2012. Accessed: 2015-
11-05.
[17] A. Toshev and C. Szegedy. Deeppose: Human pose estima-
tion via deep neural networks. In Computer Vision and Pat-
tern Recognition (CVPR), 2014 IEEE Conference on, pages
1653–1660. IEEE, 2014.
[18] N. Wang and D.-Y. Yeung. Learning a deep compact image
representation for visual tracking. In Advances in Neural
Information Processing Systems, pages 809–817, 2013.
Generative Adversarial Nets
Ian J. Goodfellow∗, Jean Pouget-Abadie†, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair‡, Aaron Courville, Yoshua Bengio§
Département d’informatique et de recherche opérationnelle
Université de Montréal
Montréal, QC H3C 3J7
Abstract
We propose a new framework for estimating generative models via an adversar-

ial process, in which we simultaneously train two models: a generative model G
that captures the data distribution, and a discriminative model D that estimates
the probability that a sample came from the training data rather than G. The train-
ing procedure for G is to maximize the probability of D making a mistake. This
framework corresponds to a minimax two-player game. In the space of arbitrary
functions G and D, a unique solution exists, with G recovering the training data
distribution and D equal to 21 everywhere. In the case where G and D are defined
by multilayer perceptrons, the entire system can be trained with backpropagation.
There is no need for any Markov chains or unrolled approximate inference net-
works during either training or generation of samples. Experiments demonstrate
the potential of the framework through qualitative and quantitative evaluation of
the generated samples.
1 Introduction
The promise of deep learning is to discover rich, hierarchical models [2] that represent probability
distributions over the kinds of data encountered in artificial intelligence applications, such as natural
images, audio waveforms containing speech, and symbols in natural language corpora. So far, the
most striking successes in deep learning have involved discriminative models, usually those that
map a high-dimensional, rich sensory input to a class label [14, 20]. These striking successes have
primarily been based on the backpropagation and dropout algorithms, using piecewise linear units
[17, 8, 9] which have a particularly well-behaved gradient . Deep generative models have had less
of an impact, due to the difficulty of approximating many intractable probabilistic computations that
arise in maximum likelihood estimation and related strategies, and due to difficulty of leveraging
the benefits of piecewise linear units in the generative context. We propose a new generative model
estimation procedure that sidesteps these difficulties. 1
In the proposed adversarial nets framework, the generative model is pitted against an adversary: a
discriminative model that learns to determine whether a sample is from the model distribution or the
data distribution. The generative model can be thought of as analogous to a team of counterfeiters,
trying to produce fake currency and use it without detection, while the discriminative model is
analogous to the police, trying to detect the counterfeit currency. Competition in this game drives
both teams to improve their methods until the counterfeits are indistiguishable from the genuine
articles.
∗
Ian Goodfellow is now a research scientist at Google, but did this work earlier as a UdeM student
†
Jean Pouget-Abadie did this work while visiting Université de Montréal from Ecole Polytechnique.
‡
Sherjil Ozair is visiting Université de Montréal from Indian Institute of Technology Delhi
§
Yoshua Bengio is a CIFAR Senior Fellow.
1
All code and hyperparameters available at http://www.github.com/goodfeli/adversarial
1
This framework can yield specific training algorithms for many kinds of model and optimization
algorithm. In this article, we explore the special case when the generative model generates samples
by passing random noise through a multilayer perceptron, and the discriminative model is also a
multilayer perceptron. We refer to this special case as adversarial nets. In this case, we can train
both models using only the highly successful backpropagation and dropout algorithms [16] and
sample from the generative model using only forward propagation. No approximate inference or
Markov chains are necessary.
2 Related work
Until recently, most work on deep generative models focused on models that provided a parametric
specification of a probability distribution function. The model can then be trained by maximiz-
ing the log likelihood. In this family of model, perhaps the most succesful is the deep Boltzmann
machine [25]. Such models generally have intractable likelihood functions and therefore require
numerous approximations to the likelihood gradient. These difficulties motivated the development
of “generative machines”–models that do not explicitly represent the likelihood, yet are able to gen-
erate samples from the desired distribution. Generative stochastic networks [4] are an example of
a generative machine that can be trained with exact backpropagation rather than the numerous ap-
proximations required for Boltzmann machines. This work extends the idea of a generative machine
by eliminating the Markov chains used in generative stochastic networks.
Our work backpropagates derivatives through generative processes by using the observation that
lim ∇x E∼N (0,σ2 I) f (x + ) = ∇x f (x).
σ→0
We were unaware at the time we developed this work that Kingma and Welling [18] and Rezende
et al. [23] had developed more general stochastic backpropagation rules, allowing one to backprop-
agate through Gaussian distributions with finite variance, and to backpropagate to the covariance
parameter as well as the mean. These backpropagation rules could allow one to learn the condi-
tional variance of the generator, which we treated as a hyperparameter in this work. Kingma and
Welling [18] and Rezende et al. [23] use stochastic backpropagation to train variational autoen-
coders (VAEs). Like generative adversarial networks, variational autoencoders pair a differentiable
generator network with a second neural network. Unlike generative adversarial networks, the sec-
ond network in a VAE is a recognition model that performs approximate inference. GANs require
differentiation through the visible units, and thus cannot model discrete data, while VAEs require
differentiation through the hidden units, and thus cannot have discrete latent variables. Other VAE-
like approaches exist [12, 22] but are less closely related to our method.
Previous work has also taken the approach of using a discriminative criterion to train a generative
model [29, 13]. These approaches use criteria that are intractable for deep generative models. These
methods are difficult even to approximate for deep models because they involve ratios of probabili-
ties which cannot be approximated using variational approximations that lower bound the probabil-
ity. Noise-contrastive estimation (NCE) [13] involves training a generative model by learning the
weights that make the model useful for discriminating data from a fixed noise distribution. Using a
previously trained model as the noise distribution allows training a sequence of models of increasing
quality. This can be seen as an informal competition mechanism similar in spirit to the formal com-
petition used in the adversarial networks game. The key limitation of NCE is that its “discriminator”
is defined by the ratio of the probability densities of the noise distribution and the model distribution,
and thus requires the ability to evaluate and backpropagate through both densities.
Some previous work has used the general concept of having two neural networks compete. The most
relevant work is predictability minimization [26]. In predictability minimization, each hidden unit
in a neural network is trained to be different from the output of a second network, which predicts
the value of that hidden unit given the value of all of the other hidden units. This work differs from
predictability minimization in three important ways: 1) in this work, the competition between the
networks is the sole training criterion, and is sufficient on its own to train the network. Predictability
minimization is only a regularizer that encourages the hidden units of a neural network to be sta-
tistically independent while they accomplish some other task; it is not a primary training criterion.
2) The nature of the competition is different. In predictability minimization, two networks’ outputs
are compared, with one network trying to make the outputs similar and the other trying to make the
2
outputs different. The output in question is a single scalar. In GANs, one network produces a rich,
high dimensional vector that is used as the input to another network, and attempts to choose an input
that the other network does not know how to process. 3) The specification of the learning process
is different. Predictability minimization is described as an optimization problem with an objective
function to be minimized, and learning approaches the minimum of the objective function. GANs
are based on a minimax game rather than an optimization problem, and have a value function that
one agent seeks to maximize and the other seeks to minimize. The game terminates at a saddle point
that is a minimum with respect to one player’s strategy and a maximum with respect to the other
player’s strategy.
Generative adversarial networks has been sometimes confused with the related concept of “adversar-
ial examples” [28]. Adversarial examples are examples found by using gradient-based optimization
directly on the input to a classification network, in order to find examples that are similar to the
data yet misclassified. This is different from the present work because adversarial examples are
not a mechanism for training a generative model. Instead, adversarial examples are primarily an
analysis tool for showing that neural networks behave in intriguing ways, often confidently clas-
sifying two images differently with high confidence even though the difference between them is
imperceptible to a human observer. The existence of such adversarial examples does suggest that
generative adversarial network training could be inefficient, because they show that it is possible to
make modern discriminative networks confidently recognize a class without emulating any of the
human-perceptible attributes of that class.
3 Adversarial nets
The adversarial modeling framework is most straightforward to apply when the models are both
multilayer perceptrons. To learn the generator’s distribution pg over data x, we define a prior on
input noise variables pz (z), then represent a mapping to data space as G(z; θg ), where G is a
differentiable function represented by a multilayer perceptron with parameters θg . We also define a
second multilayer perceptron D(x; θd ) that outputs a single scalar. D(x) represents the probability
that x came from the data rather than pg . We train D to maximize the probability of assigning the
correct label to both training examples and samples from G. We simultaneously train G to minimize
log(1 − D(G(z))). In other words, D and G play the following two-player minimax game with
value function V (G, D):
min max V (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))]. (1)
G D
In the next section, we present a theoretical analysis of adversarial nets, essentially showing that
the training criterion allows one to recover the data generating distribution as G and D are given
enough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogical
explanation of the approach. In practice, we must implement the game using an iterative, numerical
approach. Optimizing D to completion in the inner loop of training is computationally prohibitive,
and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizing
D and one step of optimizing G. This results in D being maintained near its optimal solution, so
long as G changes slowly enough. The procedure is formally presented in Algorithm 1.
In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning,
when G is poor, D can reject samples with high confidence because they are clearly different from
the training data. In this case, log(1 − D(G(z))) saturates. Rather than training G to minimize
log(1 − D(G(z))) we can train G to maximize log D(G(z)). This objective function results in the
same fixed point of the dynamics of G and D but provides much stronger gradients early in learning.
4 Theoretical Results
The generator G implicitly defines a probability distribution pg as the distribution of the samples
G(z) obtained when z ∼ pz . Therefore, we would like Algorithm 1 to converge to a good estimator
of pdata , if given enough capacity and training time. The results of this section are done in a non-
parametric setting, e.g. we represent a model with infinite capacity by studying convergence in the
space of probability density functions.
We will show in section 4.1 that this minimax game has a global optimum for pg = pdata . We will
then show in section 4.2 that Algorithm 1 optimizes Eq 1, thus obtaining the desired result.
3
...
XXXx
z
Z Z Z
(a) (b) (c) (d)
Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution
(D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black,
dotted line) px from those of the generative distribution pg (G) (green, solid line). The lower horizontal line is
the domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domain
of x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg on
transformed samples. G contracts in regions of high density and expands in regions of low density of pg . (a)
Consider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classifier.
(b) In the inner loop of the algorithm D is trained to discriminate samples from data, converging to D∗ (x) =
pdata (x)
pdata (x)+pg (x)
. (c) After an update to G, gradient of D has guided G(z) to flow to regions that are more likely
to be classified as data. (d) After several steps of training, if G and D have enough capacity, they will reach a
point at which both cannot improve because pg = pdata . The discriminator is unable to differentiate between
the two distributions, i.e. D(x) = 12 .
Algorithm 1 Minibatch stochastic gradient descent training of generative adversarial nets. The number of
steps to apply to the discriminator, k, is a hyperparameter. We used k = 1, the least expensive option, in our
experiments.
for number of training iterations do
for k steps do
• Sample minibatch of m noise samples {z (1) , . . . , z (m) } from noise prior pg (z).
• Sample minibatch of m examples {x(1) , . . . , x(m) } from data generating distribution
pdata (x).
• Update the discriminator by ascending its stochastic gradient:
m
1 Xh i
∇ θd log D x(i) + log 1 − D G z (i) .
m i=1
end for
• Sample minibatch of m noise samples {z (1) , . . . , z (m) } from noise prior pg (z).
• Update the generator by descending its stochastic gradient:
m
1 X
∇θ g log 1 − D G z (i) .
m i=1
end for
The gradient-based updates can use any standard gradient-based learning rule. We used momen-
tum in our experiments.
4.1 Global Optimality of pg = pdata
We first consider the optimal discriminator D for any given generator G.

Proposition 1. For G fixed, the optimal discriminator D is
∗ pdata (x)
DG (x) = (2)
pdata (x) + pg (x)
4
Proof. The training criterion for the discriminator D, given any generator G, is to maximize the
quantity V (G, D)
Z Z
V (G, D) = pdata (x) log(D(x))dx + pz (z) log(1 − D(g(z)))dz
Zx z
= pdata (x) log(D(x)) + pg (x) log(1 − D(x))dx (3)

x
For any (a, b) ∈ R2 \ {0, 0}, the function y → a log(y) + b log(1 − y) achieves its maximum in
a
[0, 1] at a+b . The discriminator does not need to be defined outside of Supp(pdata ) ∪ Supp(pg ),
concluding the proof.
Note that the training objective for D can be interpreted as maximizing the log-likelihood for es-
timating the conditional probability P (Y = y|x), where Y indicates whether x comes from pdata
(with y = 1) or from pg (with y = 0). The minimax game in Eq. 1 can now be reformulated as:
C(G) = max V (G, D)
D
∗ ∗
=Ex∼pdata [log DG (x)] + Ez∼pz [log(1 − DG (G(z)))] (4)
∗ ∗
=Ex∼pdata [log DG (x)] + Ex∼pg [log(1 − DG (x))]

pdata (x) pg (x)
=Ex∼pdata log + Ex∼pg log
Pdata (x) + pg (x) pdata (x) + pg (x)
Theorem 1. The global minimum of the virtual training criterion C(G) is achieved if and only if
pg = pdata . At that point, C(G) achieves the value − log 4.
∗ ∗
Proof. For pg = pdata , DG (x) = 12 , (consider Eq. 2). Hence, by inspecting Eq. 4 at DG (x) = 21 , we
1 1
find C(G) = log 2 + log 2 = − log 4. To see that this is the best possible value of C(G), reached
only for pg = pdata , observe that
Ex∼pdata [− log 2] + Ex∼pg [− log 2] = − log 4
∗
and that by subtracting this expression from C(G) = V (DG , G), we obtain:

pdata + pg pdata + pg
C(G) = − log(4) + KL pdata + KL p g
(5)
2 2
where KL is the Kullback–Leibler divergence. We recognize in the previous expression the Jensen–
Shannon divergence between the model’s distribution and the data generating process:
C(G) = − log(4) + 2 · JSD (pdata kpg ) (6)
Since the Jensen–Shannon divergence between two distributions is always non-negative, and zero
iff they are equal, we have shown that C ∗ = − log(4) is the global minimum of C(G) and that the
only solution is pg = pdata , i.e., the generative model perfectly replicating the data distribution.
4.2 Convergence of Algorithm 1
Proposition 2. If G and D have enough capacity, and at each step of Algorithm 1, the discriminator
is allowed to reach its optimum given G, and pg is updated so as to improve the criterion
∗ ∗
Ex∼pdata [log DG (x)] + Ex∼pg [log(1 − DG (x))]
then pg converges to pdata
Proof. Consider V (G, D) = U (pg , D) as a function of pg as done in the above criterion. Note
that U (pg , D) is convex in pg . The subderivatives of a supremum of convex functions include the
derivative of the function at the point where the maximum is attained. In other words, if f (x) =
supα∈A fα (x) and fα (x) is convex in x for every α, then ∂fβ (x) ∈ ∂f if β = arg supα∈A fα (x).
This is equivalent to computing a gradient descent update for pg at the optimal D given the cor-
responding G. supD U (pg , D) is convex in pg with a unique global optima as proven in Thm 1,
therefore with sufficiently small updates of pg , pg converges to px , concluding the proof.
In practice, adversarial nets represent a limited family of pg distributions via the function G(z; θg ),
and we optimize θg rather than pg itself, so the proofs do not apply. However, the excellent perfor-
mance of multilayer perceptrons in practice suggests that they are a reasonable model to use despite
their lack of theoretical guarantees.
5
Model MNIST TFD
DBN [3] 138 ± 2 1909 ± 66
Stacked CAE [3] 121 ± 1.6 2110 ± 50
Deep GSN [5] 214 ± 1.1 1890 ± 29
Adversarial nets 225 ± 2 2057 ± 26
Table 1: Parzen window-based log-likelihood estimates. The reported numbers on MNIST are the mean log-
likelihood of samples on test set, with the standard error of the mean computed across examples. On TFD, we
computed the standard error across folds of the dataset, with a different σ chosen using the validation set of
each fold. On TFD, σ was cross validated on each fold and mean log-likelihood on each fold were computed.
For MNIST we compare against other models of the real-valued (rather than binary) version of dataset.
5 Experiments
We trained adversarial nets an a range of datasets including MNIST[21], the Toronto Face Database
(TFD) [27], and CIFAR-10 [19]. The generator nets used a mixture of rectifier linear activations [17,
8] and sigmoid activations, while the discriminator net used maxout [9] activations. Dropout [16]
was applied in training the discriminator net. While our theoretical framework permits the use of
dropout and other noise at intermediate layers of the generator, we used noise as the input to only
the bottommost layer of the generator network.
We estimate probability of the test set data under pg by fitting a Gaussian Parzen window to the
samples generated with G and reporting the log-likelihood under this distribution. The σ parameter
of the Gaussians was obtained by cross validation on the validation set. This procedure was intro-
duced in Breuleux et al. [7] and used for various generative models for which the exact likelihood
is not tractable [24, 3, 4]. Results are reported in Table 1. This method of estimating the likelihood
has somewhat high variance and does not perform well in high dimensional spaces but it is the best
method available to our knowledge. Advances in generative models that can sample but not estimate
likelihood directly motivate further research into how to evaluate such models. In Figures 2 and 3
we show samples drawn from the generator net after training. While we make no claim that these
samples are better than samples generated by existing methods, we believe that these samples are at
least competitive with the better generative models in the literature and highlight the potential of the
adversarial framework.
6 Advantages and disadvantages
This new framework comes with advantages and disadvantages relative to previous modeling frame-
works. The disadvantages are primarily that there is no explicit representation of pg (x), and that D
must be synchronized well with G during training (in particular, G must not be trained too much
without updating D, in order to avoid “the Helvetica scenario” in which G collapses too many values
of z to the same value of x to have enough diversity to model pdata ), much as the negative chains of a
Boltzmann machine must be kept up to date between learning steps. The advantages are that Markov
chains are never needed, only backprop is used to obtain gradients, no inference is needed during
learning, and a wide variety of functions can be incorporated into the model. Table 2 summarizes
the comparison of generative adversarial nets with other generative modeling approaches.
The aforementioned advantages are primarily computational. Adversarial models may also gain
some statistical advantage from the generator network not being updated directly with data exam-
ples, but only with gradients flowing through the discriminator. This means that components of the
input are not copied directly into the generator’s parameters. Another advantage of adversarial net-
works is that they can represent very sharp, even degenerate distributions, while methods based on
Markov chains require that the distribution be somewhat blurry in order for the chains to be able to
mix between modes.
7 Conclusions and future work
This framework admits many straightforward extensions:
6
a) b)
c) d)
Figure 2: Visualization of samples from the model. Rightmost column shows the nearest training example of
the neighboring sample, in order to demonstrate that the model has not memorized the training set. Samples
are fair random draws, not cherry-picked. Unlike most other visualizations of deep generative models, these
images show actual samples from the model distributions, not conditional means given samples of hidden units.
Moreover, these samples are uncorrelated because the sampling process does not depend on Markov chain
mixing. a) MNIST b) TFD c) CIFAR-10 (fully connected model) d) CIFAR-10 (convolutional discriminator
and “deconvolutional” generator)
Figure 3: Digits obtained by linearly interpolating between coordinates in z space of the full model.
1. A conditional generative model p(x | c) can be obtained by adding c as input to both G and D.
2. Learned approximate inference can be performed by training an auxiliary network to predict z
given x. This is similar to the inference net trained by the wake-sleep algorithm [15] but with
the advantage that the inference net may be trained for a fixed generator net after the generator
net has finished training.
3. One can approximately model all conditionals p(xS | x6S ) where S is a subset of the indices
of x by training a family of conditional models that share parameters. Essentially, one can use
adversarial nets to implement a stochastic extension of the deterministic MP-DBM [10].
4. Semi-supervised learning: features from the discriminator or inference net could improve perfor-
mance of classifiers when limited labeled data is available.
5. Efficiency improvements: training could be accelerated greatly by devising better methods for
coordinating G and D or determining better distributions to sample z from during training.
This paper has demonstrated the viability of the adversarial modeling framework, suggesting that
these research directions could prove useful.
7
Deep directed Deep undirected Generative
Adversarial models
graphical models graphical models autoencoders
Inference needed
Enforced tradeoff
during training. Synchronizing the
between mixing
Inference needed MCMC needed to discriminator with
Training and power of
during training. approximate the generator.
reconstruction
partition function Helvetica.
generation
gradient.
Learned Learned
Variational MCMC-based
Inference approximate approximate
inference inference
inference inference
Requires Markov Requires Markov
Sampling No difficulties No difficulties
chain chain
Not explicitly Not explicitly
Intractable, may be Intractable, may be represented, may be represented, may be
Evaluating p(x) approximated with approximated with approximated with approximated with
AIS AIS Parzen density Parzen density
estimation estimation
Models need to be
designed to work
with the desired Any differentiable Any differentiable
Careful design
inference scheme function is function is
Model design needed to ensure
— some inference theoretically theoretically
multiple properties
schemes support permitted permitted
similar model
families as GANs
Table 2: Challenges in generative modeling: a summary of the difficulties encountered by different approaches
to deep generative modeling for each of the major operations involving a model.
Acknowledgments
We would like to acknowledge Patrice Marcotte, Olivier Delalleau, Kyunghyun Cho, Guillaume
Alain and Jason Yosinski for helpful discussions. Yann Dauphin shared his Parzen window eval-
uation code with us. We would like to thank the developers of Pylearn2 [11] and Theano [6, 1],
particularly Frédéric Bastien who rushed a Theano feature specifically to benefit this project. Ar-
naud Bergeron provided much-needed support with LATEX typesetting. We would also like to thank
CIFAR, and Canada Research Chairs for funding, and Compute Canada, and Calcul Québec for
providing computational resources. Ian Goodfellow is supported by the 2013 Google Fellowship in
Deep Learning. Finally, we would like to thank Les Trois Brasseurs for stimulating our creativity.
References
[1] Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., and
Bengio, Y. (2012). Theano: new features and speed improvements. Deep Learning and Unsupervised
Feature Learning NIPS 2012 Workshop.
[2] Bengio, Y. (2009). Learning deep architectures for AI. Now Publishers.
[3] Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. (2013). Better mixing via deep representations. In
ICML’13.
[4] Bengio, Y., Thibodeau-Laufer, E., and Yosinski, J. (2014a). Deep generative stochastic networks trainable
by backprop. In ICML’14.
[5] Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014b). Deep generative stochastic net-
works trainable by backprop. In Proceedings of the 30th International Conference on Machine Learning
(ICML’14).
[6] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley,
D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the
Python for Scientific Computing Conference (SciPy). Oral Presentation.
[7] Breuleux, O., Bengio, Y., and Vincent, P. (2011). Quickly generating representative samples from an
RBM-derived process. Neural Computation, 23(8), 2053–2073.
[8] Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier neural networks. In AISTATS’2011.
8
[9] Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013a). Maxout networks.
In ICML’2013.
[10] Goodfellow, I. J., Mirza, M., Courville, A., and Bengio, Y. (2013b). Multi-prediction deep Boltzmann
machines. In NIPS’2013.
[11] Goodfellow, I. J., Warde-Farley, D., Lamblin, P., Dumoulin, V., Mirza, M., Pascanu, R., Bergstra,
J., Bastien, F., and Bengio, Y. (2013c). Pylearn2: a machine learning research library. arXiv preprint
arXiv:1308.4214.
[12] Gregor, K., Danihelka, I., Mnih, A., Blundell, C., and Wierstra, D. (2014). Deep autoregressive networks.
In ICML’2014.
[13] Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estimation principle for
unnormalized statistical models. In Proceedings of The Thirteenth International Conference on Artificial
Intelligence and Statistics (AISTATS’10).
[14] Hinton, G., Deng, L., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P.,
Sainath, T., and Kingsbury, B. (2012a). Deep neural networks for acoustic modeling in speech recognition.
IEEE Signal Processing Magazine, 29(6), 82–97.
[15] Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. (1995). The wake-sleep algorithm for unsupervised
neural networks. Science, 268, 1558–1161.
[16] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012b). Improving
neural networks by preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580.
[17] Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009). What is the best multi-stage architecture
for object recognition? In Proc. International Conference on Computer Vision (ICCV’09), pages 2146–2153.
IEEE.
[18] Kingma, D. P. and Welling, M. (2014). Auto-encoding variational bayes. In Proceedings of the Interna-
tional Conference on Learning Representations (ICLR).
[19] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Technical
report, University of Toronto.
[20] Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). ImageNet classification with deep convolutional
neural networks. In NIPS’2012.
[21] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11), 2278–2324.
[22] Mnih, A. and Gregor, K. (2014). Neural variational inference and learning in belief networks. Technical
report, arXiv preprint arXiv:1402.0030.
[23] Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate
inference in deep generative models. Technical report, arXiv:1401.4082.
[24] Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A generative process for sampling contractive
auto-encoders. In ICML’12.
[25] Salakhutdinov, R. and Hinton, G. E. (2009). Deep Boltzmann machines. In AISTATS’2009, pages 448–
455.
[26] Schmidhuber, J. (1992). Learning factorial codes by predictability minimization. Neural Computation,
4(6), 863–879.
[27] Susskind, J., Anderson, A., and Hinton, G. E. (2010). The Toronto face dataset. Technical Report UTML
TR 2010-001, U. Toronto.
[28] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. (2014).
Intriguing properties of neural networks. ICLR, abs/1312.6199.
[29] Tu, Z. (2007). Learning generative models via discriminative approaches. In Computer Vision and Pattern
Recognition, 2007. CVPR’07. IEEE Conference on, pages 1–8. IEEE.
9
Character-level Convolutional Networks for Text
Classification∗
Xiang Zhang Junbo Zhao Yann LeCun

Courant Institute of Mathematical Sciences, New York University
719 Broadway, 12th Floor, New York, NY 10003
{xiang, junbo.zhao, yann}@cs.nyu.edu
Abstract
This article offers an empirical exploration on the use of character-level convolu-

tional networks (ConvNets) for text classification. We constructed several large-
scale datasets to show that character-level convolutional networks could achieve
state-of-the-art or competitive results. Comparisons are offered against traditional
models such as bag of words, n-grams and their TFIDF variants, and deep learning
models such as word-based ConvNets and recurrent neural networks.
1 Introduction
Text classification is a classic topic for natural language processing, in which one needs to assign
predefined categories to free-text documents. The range of text classification research goes from
designing the best features to choosing the best possible machine learning classifiers. To date,
almost all techniques of text classification are based on words, in which simple statistics of some
ordered word combinations (such as n-grams) usually perform the best [12].
On the other hand, many researchers have found convolutional networks (ConvNets) [17] [18] are
useful in extracting information from raw signals, ranging from computer vision applications to
speech recognition and others. In particular, time-delay networks used in the early days of deep
learning research are essentially convolutional networks that model sequential data [1] [31].
In this article we explore treating text as a kind of raw signal at character level, and applying tem-
poral (one-dimensional) ConvNets to it. For this article we only used a classification task as a way
to exemplify ConvNets’ ability to understand texts. Historically we know that ConvNets usually
require large-scale datasets to work, therefore we also build several of them. An extensive set of
comparisons is offered with traditional models and other deep learning models.
Applying convolutional networks to text classification or natural language processing at large was
explored in literature. It has been shown that ConvNets can be directly applied to distributed [6] [16]
or discrete [13] embedding of words, without any knowledge on the syntactic or semantic structures
of a language. These approaches have been proven to be competitive to traditional models.
There are also related works that use character-level features for language processing. These in-
clude using character-level n-grams with linear classifiers [15], and incorporating character-level
features to ConvNets [28] [29]. In particular, these ConvNet approaches use words as a basis, in
which character-level features extracted at word [28] or word n-gram [29] level form a distributed
representation. Improvements for part-of-speech tagging and information retrieval were observed.
This article is the first to apply ConvNets only on characters. We show that when trained on large-
scale datasets, deep ConvNets do not require the knowledge of words, in addition to the conclusion
∗
An early version of this work entitled “Text Understanding from Scratch” was posted in Feb 2015 as
arXiv:1502.01710. The present paper has considerably more experimental results and a rewritten introduction.
1
from previous research that ConvNets do not require the knowledge about the syntactic or semantic
structure of a language. This simplification of engineering could be crucial for a single system that
can work for different languages, since characters always constitute a necessary construct regardless
of whether segmentation into words is possible. Working on only characters also has the advantage
that abnormal character combinations such as misspellings and emoticons may be naturally learnt.
2 Character-level Convolutional Networks
In this section, we introduce the design of character-level ConvNets for text classification. The de-
sign is modular, where the gradients are obtained by back-propagation [27] to perform optimization.
2.1 Key Modules
The main component is the temporal convolutional module, which simply computes a 1-D convo-
lution. Suppose we have a discrete input function g(x) ∈ [1, l] → R and a discrete kernel function
f (x) ∈ [1, k] → R. The convolution h(y) ∈ [1, b(l − k + 1)/dc] → R between f (x) and g(x) with
stride d is defined as
Xk
h(y) = f (x) · g(y · d − x + c),
x=1
where c = k − d + 1 is an offset constant. Just as in traditional convolutional networks in vision,
the module is parameterized by a set of such kernel functions fij (x) (i = 1, 2, . . . , m and j =
1, 2, . . . , n) which we call weights, on a set of inputs gi (x) and outputs hj (y). We call each gi (or
hj ) input (or output) features, and m (or n) input (or output) feature size. The outputs hj (y) is
obtained by a sum over i of the convolutions between gi (x) and fij (x).
One key module that helped us to train deeper models is temporal max-pooling. It is the 1-D version
of the max-pooling module used in computer vision [2]. Given a discrete input function g(x) ∈
[1, l] → R, the max-pooling function h(y) ∈ [1, b(l − k + 1)/dc] → R of g(x) is defined as
k
h(y) = max g(y · d − x + c),
x=1
where c = k − d + 1 is an offset constant. This very pooling module enabled us to train ConvNets
deeper than 6 layers, where all others fail. The analysis by [3] might shed some light on this.
The non-linearity used in our model is the rectifier or thresholding function h(x) = max{0, x},
which makes our convolutional layers similar to rectified linear units (ReLUs) [24]. The algorithm
used is stochastic gradient descent (SGD) with a minibatch of size 128, using momentum [26] [30]
0.9 and initial step size 0.01 which is halved every 3 epoches for 10 times. Each epoch takes a fixed
number of random training samples uniformly sampled across classes. This number will later be
detailed for each dataset sparately. The implementation is done using Torch 7 [4].
2.2 Character quantization
Our models accept a sequence of encoded characters as input. The encoding is done by prescribing
an alphabet of size m for the input language, and then quantize each character using 1-of-m encoding
(or “one-hot” encoding). Then, the sequence of characters is transformed to a sequence of such m
sized vectors with fixed length l0 . Any character exceeding length l0 is ignored, and any characters
that are not in the alphabet including blank characters are quantized as all-zero vectors. The character
quantization order is backward so that the latest reading on characters is always placed near the begin
of the output, making it easy for fully connected layers to associate weights with the latest reading.
The alphabet used in all of our models consists of 70 characters, including 26 english letters, 10
digits, 33 other characters and the new line character. The non-space characters are:
abcdefghijklmnopqrstuvwxyz0123456789
-,;.!?:’’’/\|_@#$%ˆ&*˜‘+-=<>()[]{}
Later we also compare with models that use a different alphabet in which we distinguish between
upper-case and lower-case letters.
2
2.3 Model Design
We designed 2 ConvNets – one large and one small. They are both 9 layers deep with 6 convolutional
layers and 3 fully-connected layers. Figure 1 gives an illustration.
Length
Some Text
Quantization
Feature
...
Convolutions Max-pooling Conv. and Pool. layers Fully-connected
Figure 1: Illustration of our model
The input have number of features equal to 70 due to our character quantization method, and the
input feature length is 1014. It seems that 1014 characters could already capture most of the texts of
interest. We also insert 2 dropout [10] modules in between the 3 fully-connected layers to regularize.
They have dropout probability of 0.5. Table 1 lists the configurations for convolutional layers, and
table 2 lists the configurations for fully-connected (linear) layers.
Table 1: Convolutional layers used in our experiments. The convolutional layers have stride 1 and
pooling layers are all non-overlapping ones, so we omit the description of their strides.
Layer Large Feature Small Feature Kernel Pool

1 1024 256 7 3
2 1024 256 7 3
3 1024 256 3 N/A
4 1024 256 3 N/A
5 1024 256 3 N/A
6 1024 256 3 3
We initialize the weights using a Gaussian distribution. The mean and standard deviation used for
initializing the large model is (0, 0.02) and small model (0, 0.05).
Table 2: Fully-connected layers used in our experiments. The number of output units for the last
layer is determined by the problem. For example, for a 10-class classification problem it will be 10.
Layer Output Units Large Output Units Small

7 2048 1024
8 2048 1024
9 Depends on the problem
For different problems the input lengths may be different (for example in our case l0 = 1014), and
so are the frame lengths. From our model design, it is easy to know that given input length l0 , the
output frame length after the last convolutional layer (but before any of the fully-connected layers)
is l6 = (l0 − 96)/27. This number multiplied with the frame size at layer 6 will give the input
dimension the first fully-connected layer accepts.
2.4 Data Augmentation using Thesaurus
Many researchers have found that appropriate data augmentation techniques are useful for control-
ling generalization error for deep learning models. These techniques usually work well when we
could find appropriate invariance properties that the model should possess. In terms of texts, it is not
reasonable to augment the data using signal transformations as done in image or speech recognition,
because the exact order of characters may form rigorous syntactic and semantic meaning. Therefore,
3
the best way to do data augmentation would have been using human rephrases of sentences, but this
is unrealistic and expensive due the large volume of samples in our datasets. As a result, the most
natural choice in data augmentation for us is to replace words or phrases with their synonyms.
We experimented data augmentation by using an English thesaurus, which is obtained from the
mytheas component used in LibreOffice1 project. That thesaurus in turn was obtained from Word-
Net [7], where every synonym to a word or phrase is ranked by the semantic closeness to the most
frequently seen meaning. To decide on how many words to replace, we extract all replaceable words
from the given text and randomly choose r of them to be replaced. The probability of number r
is determined by a geometric distribution with parameter p in which P [r] ∼ pr . The index s of
the synonym chosen given a word is also determined by a another geometric distribution in which
P [s] ∼ q s . This way, the probability of a synonym chosen becomes smaller when it moves distant
from the most frequently seen meaning. We will report the results using this new data augmentation
technique with p = 0.5 and q = 0.5.
3 Comparison Models
To offer fair comparisons to competitive models, we conducted a series of experiments with both tra-
ditional and deep learning methods. We tried our best to choose models that can provide comparable
and competitive results, and the results are reported faithfully without any model selection.
3.1 Traditional Methods
We refer to traditional methods as those that using a hand-crafted feature extractor and a linear
classifier. The classifier used is a multinomial logistic regression in all these models.
Bag-of-words and its TFIDF. For each dataset, the bag-of-words model is constructed by selecting
50,000 most frequent words from the training subset. For the normal bag-of-words, we use the
counts of each word as the features. For the TFIDF (term-frequency inverse-document-frequency)
[14] version, we use the counts as the term-frequency. The inverse document frequency is the
logarithm of the division between total number of samples and number of samples with the word in
the training subset. The features are normalized by dividing the largest feature value.
Bag-of-ngrams and its TFIDF. The bag-of-ngrams models are constructed by selecting the 500,000
most frequent n-grams (up to 5-grams) from the training subset for each dataset. The feature values
are computed the same way as in the bag-of-words model.
Bag-of-means on word embedding. We also have an experimental model that uses k-means on
word2vec [23] learnt from the training subset of each dataset, and then use these learnt means as
representatives of the clustered words. We take into consideration all the words that appeared more
than 5 times in the training subset. The dimension of the embedding is 300. The bag-of-means
features are computed the same way as in the bag-of-words model. The number of means is 5000.
3.2 Deep Learning Methods
Recently deep learning methods have started to be applied to text classification. We choose two
simple and representative models for comparison, in which one is word-based ConvNet and the
other a simple long-short term memory (LSTM) [11] recurrent neural network model.
Word-based ConvNets. Among the large number of recent works on word-based ConvNets for
text classification, one of the differences is the choice of using pretrained or end-to-end learned word
representations. We offer comparisons with both using the pretrained word2vec [23] embedding [16]
and using lookup tables [5]. The embedding size is 300 in both cases, in the same way as our bag-
of-means model. To ensure fair comparison, the models for each case are of the same size as
our character-level ConvNets, in terms of both the number of layers and each layer’s output size.
Experiments using a thesaurus for data augmentation are also conducted.
1
http://www.libreoffice.org/
4
Long-short term memory. We also offer a comparison
Mean
with a recurrent neural network model, namely long-short
term memory (LSTM) [11]. The LSTM model used in
our case is word-based, using pretrained word2vec em-
bedding of size 300 as in previous models. The model is LSTM LSTM ... LSTM
formed by taking mean of the outputs of all LSTM cells to
form a feature vector, and then using multinomial logistic Figure 2: long-short term memory
regression on this feature vector. The output dimension
is 512. The variant of LSTM we used is the common
“vanilla” architecture [8] [9]. We also used gradient clipping [25] in which the gradient norm is
limited to 5. Figure 2 gives an illustration.
3.3 Choice of Alphabet
For the alphabet of English, one apparent choice is whether to distinguish between upper-case and
lower-case letters. We report experiments on this choice and observed that it usually (but not always)
gives worse results when such distinction is made. One possible explanation might be that semantics
do not change with different letter cases, therefore there is a benefit of regularization.
4 Large-scale Datasets and Results
Previous research on ConvNets in different areas has shown that they usually work well with large-
scale datasets, especially when the model takes in low-level raw features like characters in our
case. However, most open datasets for text classification are quite small, and large-scale datasets are
splitted with a significantly smaller training set than testing [21]. Therefore, instead of confusing our
community more by using them, we built several large-scale datasets for our experiments, ranging
from hundreds of thousands to several millions of samples. Table 3 is a summary.
Table 3: Statistics of our large-scale datasets. Epoch size is the number of minibatches in one epoch
Dataset Classes Train Samples Test Samples Epoch Size

AG’s News 4 120,000 7,600 5,000
Sogou News 5 450,000 60,000 5,000
DBPedia 14 560,000 70,000 5,000
Yelp Review Polarity 2 560,000 38,000 5,000
Yelp Review Full 5 650,000 50,000 5,000
Yahoo! Answers 10 1,400,000 60,000 10,000
Amazon Review Full 5 3,000,000 650,000 30,000
Amazon Review Polarity 2 3,600,000 400,000 30,000
AG’s news corpus. We obtained the AG’s corpus of news article on the web2 . It contains 496,835
categorized news articles from more than 2000 news sources. We choose the 4 largest classes from
this corpus to construct our dataset, using only the title and description fields. The number of training
samples for each class is 30,000 and testing 1900.
Sogou news corpus. This dataset is a combination of the SogouCA and SogouCS news corpora [32],
containing in total 2,909,551 news articles in various topic channels. We then labeled each piece
of news using its URL, by manually classifying the their domain names. This gives us a large
corpus of news articles labeled with their categories. There are a large number categories but most
of them contain only few articles. We choose 5 categories – “sports”, “finance”, “entertainment”,
“automobile” and “technology”. The number of training samples selected for each class is 90,000
and testing 12,000. Although this is a dataset in Chinese, we used pypinyin package combined
with jieba Chinese segmentation system to produce Pinyin – a phonetic romanization of Chinese.
The models for English can then be applied to this dataset without change. The fields used are title
and content.
2
http://www.di.unipi.it/˜gulli/AG_corpus_of_news_articles.html
5
Table 4: Testing errors of all the models. Numbers are in percentage. “Lg” stands for “large” and
“Sm” stands for “small”. “w2v” is an abbreviation for “word2vec”, and “Lk” for “lookup table”.
“Th” stands for thesaurus. ConvNets labeled “Full” are those that distinguish between lower and
upper letters
Model AG Sogou DBP. Yelp P. Yelp F. Yah. A. Amz. F. Amz. P.

BoW 11.19 7.15 3.39 7.76 42.01 31.11 45.36 9.60
BoW TFIDF 10.36 6.55 2.63 6.34 40.14 28.96 44.74 9.00
ngrams 7.96 2.92 1.37 4.36 43.74 31.53 45.73 7.98
ngrams TFIDF 7.64 2.81 1.31 4.56 45.20 31.49 47.56 8.46
Bag-of-means 16.91 10.79 9.55 12.67 47.46 39.45 55.87 18.39
LSTM 13.94 4.82 1.45 5.26 41.83 29.16 40.57 6.10
Lg. w2v Conv. 9.92 4.39 1.42 4.60 40.16 31.97 44.40 5.88
Sm. w2v Conv. 11.35 4.54 1.71 5.56 42.13 31.50 42.59 6.00
Lg. w2v Conv. Th. 9.91 - 1.37 4.63 39.58 31.23 43.75 5.80
Sm. w2v Conv. Th. 10.88 - 1.53 5.36 41.09 29.86 42.50 5.63
Lg. Lk. Conv. 8.55 4.95 1.72 4.89 40.52 29.06 45.95 5.84
Sm. Lk. Conv. 10.87 4.93 1.85 5.54 41.41 30.02 43.66 5.85
Lg. Lk. Conv. Th. 8.93 - 1.58 5.03 40.52 28.84 42.39 5.52
Sm. Lk. Conv. Th. 9.12 - 1.77 5.37 41.17 28.92 43.19 5.51
Lg. Full Conv. 9.85 8.80 1.66 5.25 38.40 29.90 40.89 5.78
Sm. Full Conv. 11.59 8.95 1.89 5.67 38.82 30.01 40.88 5.78
Lg. Full Conv. Th. 9.51 - 1.55 4.88 38.04 29.58 40.54 5.51
Sm. Full Conv. Th. 10.89 - 1.69 5.42 37.95 29.90 40.53 5.66
Lg. Conv. 12.82 4.88 1.73 5.89 39.62 29.55 41.31 5.51
Sm. Conv. 15.65 8.65 1.98 6.53 40.84 29.84 40.53 5.50
Lg. Conv. Th. 13.39 - 1.60 5.82 39.30 28.80 40.45 4.93
Sm. Conv. Th. 14.80 - 1.85 6.49 40.16 29.84 40.43 5.67
DBPedia ontology dataset. DBpedia is a crowd-sourced community effort to extract structured

information from Wikipedia [19]. The DBpedia ontology dataset is constructed by picking 14 non-
overlapping classes from DBpedia 2014. From each of these 14 ontology classes, we randomly
choose 40,000 training samples and 5,000 testing samples. The fields we used for this dataset
contain title and abstract of each Wikipedia article.
Yelp reviews. The Yelp reviews dataset is obtained from the Yelp Dataset Challenge in 2015. This
dataset contains 1,569,264 samples that have review texts. Two classification tasks are constructed
from this dataset – one predicting full number of stars the user has given, and the other predict-
ing a polarity label by considering stars 1 and 2 negative, and 3 and 4 positive. The full dataset
has 130,000 training samples and 10,000 testing samples in each star, and the polarity dataset has
280,000 training samples and 19,000 test samples in each polarity.
Yahoo! Answers dataset. We obtained Yahoo! Answers Comprehensive Questions and Answers
version 1.0 dataset through the Yahoo! Webscope program. The corpus contains 4,483,032 questions
and their answers. We constructed a topic classification dataset from this corpus using 10 largest
main categories. Each class contains 140,000 training samples and 5,000 testing samples. The fields
we used include question title, question content and best answer.
Amazon reviews. We obtained an Amazon review dataset from the Stanford Network Analysis
Project (SNAP), which spans 18 years with 34,686,770 reviews from 6,643,669 users on 2,441,053
products [22]. Similarly to the Yelp review dataset, we also constructed 2 datasets – one full score
prediction and another polarity prediction. The full dataset contains 600,000 training samples and
130,000 testing samples in each class, whereas the polarity dataset contains 1,800,000 training sam-
ples and 200,000 testing samples in each polarity sentiment. The fields used are review title and
review content.
Table 4 lists all the testing errors we obtained from these datasets for all the applicable models. Note
that since we do not have a Chinese thesaurus, the Sogou News dataset does not have any results
using thesaurus augmentation. We labeled the best result in blue and worse result in red.
6
5 Discussion
90.00% 60.00% 25.00%
80.00% 40.00% 20.00%
70.00%
20.00% 15.00%
60.00%
0.00% 10.00%
50.00%
-20.00% 5.00%
40.00%
-40.00% 0.00%
30.00%
-60.00% -5.00%
20.00%
10.00% -80.00% -10.00%
0.00% -100.00% -15.00%
(a) Bag-of-means (b) n-grams TFIDF (c) LSTM

20.00% 20.00% 20.00%
10.00% 10.00%
10.00%
0.00%
0.00%
0.00%
-10.00%
-10.00%
-10.00% -20.00%
-20.00%
-30.00%
-20.00%
-30.00%
-40.00%
-30.00%
-50.00% -40.00%
-40.00% -60.00% -50.00%
(d) word2vec ConvNet (e) Lookup table ConvNet (f) Full alphabet ConvNet
AG News DBPedia Yelp P. Yelp F. Yahoo A. Amazon F. Amazon P.
Figure 3: Relative errors with comparison models
To understand the results in table 4 further, we offer some empirical analysis in this section. To
facilitate our analysis, we present the relative errors in figure 3 with respect to comparison models.
Each of these plots is computed by taking the difference between errors on comparison model and
our character-level ConvNet model, then divided by the comparison model error. All ConvNets in
the figure are the large models with thesaurus augmentation respectively.
Character-level ConvNet is an effective method. The most important conclusion from our experi-
ments is that character-level ConvNets could work for text classification without the need for words.
This is a strong indication that language could also be thought of as a signal no different from
any other kind. Figure 4 shows 12 random first-layer patches learnt by one of our character-level
ConvNets for DBPedia dataset.
Figure 4: First layer weights. For each patch, height is the kernel size and width the alphabet size
Dataset size forms a dichotomy between traditional and ConvNets models. The most obvious
trend coming from all the plots in figure 3 is that the larger datasets tend to perform better. Tra-
ditional methods like n-grams TFIDF remain strong candidates for dataset of size up to several
hundreds of thousands, and only until the dataset goes to the scale of several millions do we observe
that character-level ConvNets start to do better.
ConvNets may work well for user-generated data. User-generated data vary in the degree of how
well the texts are curated. For example, in our million scale datasets, Amazon reviews tend to be
raw user-inputs, whereas users might be extra careful in their writings on Yahoo! Answers. Plots
comparing word-based deep models (figures 3c, 3d and 3e) show that character-level ConvNets work
better for less curated user-generated texts. This property suggests that ConvNets may have better
applicability to real-world scenarios. However, further analysis is needed to validate the hypothesis
that ConvNets are truly good at identifying exotic character combinations such as misspellings and
emoticons, as our experiments alone do not show any explicit evidence.
Choice of alphabet makes a difference. Figure 3f shows that changing the alphabet by distinguish-
ing between uppercase and lowercase letters could make a difference. For million-scale datasets, it
seems that not making such distinction usually works better. One possible explanation is that there
is a regularization effect, but this is to be validated.
7
Semantics of tasks may not matter. Our datasets consist of two kinds of tasks: sentiment analysis
(Yelp and Amazon reviews) and topic classification (all others). This dichotomy in task semantics
does not seem to play a role in deciding which method is better.
Bag-of-means is a misuse of word2vec [20]. One of the most obvious facts one could observe
from table 4 and figure 3a is that the bag-of-means model performs worse in every case. Comparing
with traditional models, this suggests such a simple use of a distributed word representation may not
give us an advantage to text classification. However, our experiments does not speak for any other
language processing tasks or use of word2vec in any other way.
There is no free lunch. Our experiments once again verifies that there is not a single machine
learning model that can work for all kinds of datasets. The factors discussed in this section could all
play a role in deciding which method is the best for some specific application.
6 Conclusion and Outlook
This article offers an empirical study on character-level convolutional networks for text classifica-
tion. We compared with a large number of traditional and deep learning models using several large-
scale datasets. On one hand, analysis shows that character-level ConvNet is an effective method.
On the other hand, how well our model performs in comparisons depends on many factors, such as
dataset size, whether the texts are curated and choice of alphabet.
In the future, we hope to apply character-level ConvNets for a broader range of language processing
tasks especially when structured outputs are needed.
Acknowledgement
We gratefully acknowledge the support of NVIDIA Corporation with the donation of 2 Tesla K40
GPUs used for this research. We gratefully acknowledge the support of Amazon.com Inc for an
AWS in Education Research grant used for this research.
References
[1] L. Bottou, F. Fogelman Soulié, P. Blanchet, and J. Lienard. Experiments with time delay networks and
dynamic time warping for speaker independent isolated digit recognition. In Proceedings of EuroSpeech
89, volume 2, pages 537–540, Paris, France, 1989.
[2] Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce. Learning mid-level features for recognition. In Computer
Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2559–2566. IEEE, 2010.
[3] Y.-L. Boureau, J. Ponce, and Y. LeCun. A theoretical analysis of feature pooling in visual recognition.
In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 111–118,
2010.
[4] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning.
In BigLearn, NIPS Workshop, number EPFL-CONF-192376, 2011.
[5] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language process-
ing (almost) from scratch. J. Mach. Learn. Res., 12:2493–2537, Nov. 2011.
[6] C. dos Santos and M. Gatti. Deep convolutional neural networks for sentiment analysis of short texts. In
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Tech-
nical Papers, pages 69–78, Dublin, Ireland, August 2014. Dublin City University and Association for
Computational Linguistics.
[7] C. Fellbaum. Wordnet and wordnets. In K. Brown, editor, Encyclopedia of Language and Linguistics,
pages 665–670, Oxford, 2005. Elsevier.
[8] A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural
network architectures. Neural Networks, 18(5):602–610, 2005.
[9] K. Greff, R. K. Srivastava, J. Koutnı́k, B. R. Steunebrink, and J. Schmidhuber. LSTM: A search space
odyssey. CoRR, abs/1503.04069, 2015.
[10] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural
networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
8
[11] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, Nov.
1997.
[12] T. Joachims. Text categorization with suport vector machines: Learning with many relevant features. In
Proceedings of the 10th European Conference on Machine Learning, pages 137–142. Springer-Verlag,
1998.
[13] R. Johnson and T. Zhang. Effective use of word order for text categorization with convolutional neural
networks. CoRR, abs/1412.1058, 2014.
[14] K. S. Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of
Documentation, 28(1):11–21, 1972.
[15] I. Kanaris, K. Kanaris, I. Houvardas, and E. Stamatatos. Words versus character n-grams for anti-spam
filtering. International Journal on Artificial Intelligence Tools, 16(06):1047–1067, 2007.
[16] Y. Kim. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Confer-
ence on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, Doha, Qatar,
October 2014. Association for Computational Linguistics.
[17] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Back-
propagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, Winter
1989.
[18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86(11):2278–2324, November 1998.
[19] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey,
P. van Kleef, S. Auer, and C. Bizer. DBpedia - a large-scale, multilingual knowledge base extracted from
wikipedia. Semantic Web Journal, 2014.
[20] G. Lev, B. Klein, and L. Wolf. In defense of word embedding for generic text representation. In C. Bie-
mann, S. Handschuh, A. Freitas, F. Meziane, and E. Mtais, editors, Natural Language Processing and
Information Systems, volume 9103 of Lecture Notes in Computer Science, pages 35–50. Springer Inter-
national Publishing, 2015.
[21] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization
research. The Journal of Machine Learning Research, 5:361–397, 2004.
[22] J. McAuley and J. Leskovec. Hidden factors and hidden topics: Understanding rating dimensions with
review text. In Proceedings of the 7th ACM Conference on Recommender Systems, RecSys ’13, pages
165–172, New York, NY, USA, 2013. ACM.
[23] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and
phrases and their compositionality. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Wein-
berger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. 2013.
[24] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings
of the 27th International Conference on Machine Learning (ICML-10), pages 807–814, 2010.
[25] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In ICML
2013, volume 28 of JMLR Proceedings, pages 1310–1318. JMLR.org, 2013.
[26] B. Polyak. Some methods of speeding up the convergence of iteration methods. {USSR} Computational
Mathematics and Mathematical Physics, 4(5):1 – 17, 1964.
[27] D. Rumelhart, G. Hintont, and R. Williams. Learning representations by back-propagating errors. Nature,
323(6088):533–536, 1986.
[28] C. D. Santos and B. Zadrozny. Learning character-level representations for part-of-speech tagging. In
Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1818–1826,
2014.
[29] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil. A latent semantic model with convolutional-pooling
structure for information retrieval. In Proceedings of the 23rd ACM International Conference on Confer-
ence on Information and Knowledge Management, pages 101–110. ACM, 2014.
[30] I. Sutskever, J. Martens, G. E. Dahl, and G. E. Hinton. On the importance of initialization and momentum
in deep learning. In S. Dasgupta and D. Mcallester, editors, Proceedings of the 30th International Confer-
ence on Machine Learning (ICML-13), volume 28, pages 1139–1147. JMLR Workshop and Conference
Proceedings, May 2013.
[31] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang. Phoneme recognition using time-delay
neural networks. Acoustics, Speech and Signal Processing, IEEE Transactions on, 37(3):328–339, 1989.
[32] C. Wang, M. Zhang, S. Ma, and L. Ru. Automatic online news issue construction in web environment. In
Proceedings of the 17th International Conference on World Wide Web, WWW ’08, pages 457–466, New
York, NY, USA, 2008. ACM.
9
review articles
doi:10.1145/ 2347736.2347755
Tapping into the “folk knowledge” needed to

advance machine learning applications.
by Pedro Domingos
A Few Useful
Things to
Know About
Machine
Learning is needed to successfully develop
machine learning applications is not
readily available in them. As a result,
many machine learning projects take
much longer than necessary or wind
up producing less-than-ideal results.
Yet much of this folk knowledge is
fairly easy to communicate. This is
Machine learning systems automatically learn
the purpose of this article.
programs from data. This is often a very attractive
alternative to manually constructing them, and in the key insights
last decade the use of machine learning has spread M achine learning algorithms can figure
rapidly throughout computer science and beyond. out how to perform important tasks
by generalizing from examples. This is
Machine learning is used in Web search, spam filters, often feasible and cost-effective where
manual programming is not. As more
recommender systems, ad placement, credit scoring, data becomes available, more ambitious
problems can be tackled.
fraud detection, stock trading, drug design, and many
other applications. A recent report from the McKinsey M achine learning is widely used in
computer science and other fields.
Global Institute asserts that machine learning (a.k.a. However, developing successful
machine learning applications requires a
data mining or predictive analytics) will be the driver substantial amount of “black art” that is
difficult to find in textbooks.
of the next big wave of innovation.15 Several fine
textbooks are available to interested practitioners and T his article summarizes 12 key lessons
that machine learning researchers and
researchers (for example, Mitchell16 and Witten et practitioners have learned. These include
pitfalls to avoid, important issues to focus
al.24). However, much of the “folk knowledge” that on, and answers to common questions.
78 comm unicatio ns o f the ac m | o c to ber 201 2 | vo l . 5 5 | no. 1 0

Many different types of machine Learning = Representation + or scoring function) is needed to dis-
learning exist, but for illustration Evaluation + Optimization tinguish good classifiers from bad
purposes I will focus on the most Suppose you have an application that ones. The evaluation function used
mature and widely used one: clas- you think machine learning might be internally by the algorithm may dif-
sification. Nevertheless, the issues I good for. The first problem facing you fer from the external one that we want
will discuss apply across all of ma- is the bewildering variety of learning al- the classifier to optimize, for ease of
chine learning. A classifier is a sys- gorithms available. Which one to use? optimization and due to the issues I
tem that inputs (typically) a vector There are literally thousands available, will discuss.
of discrete and/or continuous fea- and hundreds more are published each ˲˲ Optimization. Finally, we need
ture values and outputs a single dis- year. The key to not getting lost in this a method to search among the clas-
crete value, the class. For example, huge space is to realize that it consists sifiers in the language for the high-
a spam filter classifies email mes- of combinations of just three compo- est-scoring one. The choice of op-
sages into “spam” or “not spam,” nents. The components are: timization technique is key to the
and its input may be a Boolean vec- ˲˲ Representation. A classifier must efficiency of the learner, and also
tor x = (x 1,…,x j,…,x d), where x j = 1 if be represented in some formal lan- helps determine the classifier pro-
the j th word in the dictionary appears guage that the computer can handle. duced if the evaluation function has
in the email and x j = 0 otherwise. A Conversely, choosing a representa- more than one optimum. It is com-
learner inputs a training set of ex- tion for a learner is tantamount to mon for new learners to start out using
amples (x i, y i), where x i = (x i,1 , . . . , choosing the set of classifiers that it off-the-shelf optimizers, which are lat-
x i, d) is an observed input and y i is the can possibly learn. This set is called er replaced by custom-designed ones.
Image by agsa ndrew/Sh utt erstock.co m
corresponding output, and outputs the hypothesis space of the learner. The accompanying table shows
a classifier. The test of the learner is If a classifier is not in the hypothesis common examples of each of these
whether this classifier produces the space, it cannot be learned. A related three components. For example, k-
correct output yt for future examples question, that I address later, is how nearest neighbor classifies a test ex-
xt (for example, whether the spam to represent the input, in other words, ample by finding the k most similar
filter correctly classifies previously what features to use. training examples and predicting the
unseen email messages as spam or ˲˲ Evaluation. An evaluation func- majority class among them. Hyper-
not spam). tion (also called objective function plane-based methods form a linear
o cto b e r 2 0 1 2 | vo l . 55 | n o. 1 0 | c om m u n icat ion s of t he ac m 79

review articles
Table 1. The three components of learning algorithms. sible different inputs.) Doing well on
the training set is easy (just memorize
the examples). The most common
Representation Evaluation Optimization mistake among machine learning be-
Instances Accuracy/Error rate Combinatorial optimization ginners is to test on the training data
K-nearest neighbor Precision and recall Greedy search and have the illusion of success. If the
Support vector machines Squared error Beam search chosen classifier is then tested on new
Hyperplanes Likelihood Branch-and-bound data, it is often no better than ran-
Naive Bayes Posterior probability Continuous optimization dom guessing. So, if you hire someone
Logistic regression Information gain Unconstrained to build a classifier, be sure to keep
Decision trees K-L divergence Gradient descent some of the data to yourself and test
Sets of rules Cost/Utility Conjugate gradient the classifier they give you on it. Con-
Propositional rules Margin Quasi-Newton methods versely, if you have been hired to build
Logic programs Constrained a classifier, set some of the data aside
Neural networks Linear programming from the beginning, and only use it to
Graphical models Quadratic programming test your chosen classifier at the very
Bayesian networks end, followed by learning your final
Conditional random fields classifier on the whole data.
Contamination of your classifier by
test data can occur in insidious ways,
for example, if you use test data to
Algorithm 1. Decision tree induction. tune parameters and do a lot of tun-
ing. (Machine learning algorithms
LearnDT (TrainSet) have lots of knobs, and success of-
ten comes from twiddling them a lot,
if all examples in TrainSet have the same class y* then so this is a real concern.) Of course,
return MakeLeaf(y*) holding out data reduces the amount
if no feature xj has InfoGain(xj ,y) > 0 then available for training. This can be mit-
y* ← Most frequent class in TrainSet igated by doing cross-validation: ran-
return MakeLeaf(y*) domly dividing your training data into
x* ← argmaxxj InfoGain(xj, y)
(say) 10 subsets, holding out each one
TS0 ← Examples in TrainSet with x* = 0
while training on the rest, testing each
TS1 ← Examples in TrainSet with x* = 1
learned classifier on the examples it
return MakeNode(x*, LearnDT(TS0), LearnDT(TS1))
did not see, and averaging the results
to see how well the particular param-
eter setting does.
combination of the features per class day may not be far when every single In the early days of machine learn-
and predict the class with the high- possible combination has appeared in ing, the need to keep training and test
est-valued combination. Decision some learner! data separate was not widely appreci-
trees test one feature at each internal Most textbooks are organized by ated. This was partly because, if the
node, with one branch for each fea- representation, and it is easy to over- learner has a very limited representa-
ture value, and have class predictions look the fact that the other compo- tion (for example, hyperplanes), the
at the leaves. Algorithm 1 (above) nents are equally important. There is difference between training and test
shows a bare-bones decision tree no simple recipe for choosing each error may not be large. But with very
learner for Boolean domains, using component, but I will touch on some flexible classifiers (for example, deci-
information gain and greedy search.20 of the key issues here. As we will see, sion trees), or even with linear classifi-
InfoGain(xj, y) is the mutual informa- some choices in a machine learning ers with a lot of features, strict separa-
tion between feature xj and the class y. project may be even more important tion is mandatory.
MakeNode(x,c0,c1) returns a node that than the choice of learner. Notice that generalization being
tests feature x and has c0 as the child the goal has an interesting conse-
for x = 0 and c1 as the child for x = 1. It’s Generalization that Counts quence for machine learning. Unlike
Of course, not all combinations of The fundamental goal of machine in most other optimization problems,
one component from each column of learning is to generalize beyond the we do not have access to the function
the table make equal sense. For exam- examples in the training set. This is we want to optimize! We have to use
ple, discrete representations naturally because, no matter how much data training error as a surrogate for test
go with combinatorial optimization, we have, it is very unlikely that we will error, and this is fraught with dan-
and continuous ones with continu- see those exact examples again at test ger. (How to deal with it is addressed
ous optimization. Nevertheless, many time. (Notice that, if there are 100,000 later.) On the positive side, since the
learners have both discrete and con- words in the dictionary, the spam fil- objective function is only a proxy for
tinuous components, and in fact the ter described above has 2100,000 pos- the true goal, we may not need to fully
80 comm unicatio ns o f the acm | o c to ber 201 2 | vo l . 5 5 | no. 1 0

review articles
optimize it; in fact, a local optimum main, instance-based methods may one that is 75% accurate on both, it
returned by simple greedy search may be a good choice. If we have knowl- has overfit.
be better than the global optimum. edge about probabilistic dependen- Everyone in machine learning
cies, graphical models are a good fit. knows about overfitting, but it comes
Data Alone Is Not Enough And if we have knowledge about what in many forms that are not immedi-
Generalization being the goal has an- kinds of preconditions are required by ately obvious. One way to understand
other major consequence: Data alone each class, “IF . . . THEN . . .” rules may overfitting is by decomposing gener-
is not enough, no matter how much be the best option. The most useful alization error into bias and variance.9
of it you have. Consider learning a learners in this regard are those that Bias is a learner’s tendency to con-
Boolean function of (say) 100 vari- do not just have assumptions hard- sistently learn the same wrong thing.
ables from a million examples. There wired into them, but allow us to state Variance is the tendency to learn ran-
are 2100 − 106 examples whose classes them explicitly, vary them widely, and dom things irrespective of the real sig-
you do not know. How do you figure incorporate them automatically into nal. Figure 1 illustrates this by an anal-
out what those classes are? In the ab- the learning (for example, using first- ogy with throwing darts at a board. A
sence of further information, there is order logic21 or grammars6). linear learner has high bias, because
just no way to do this that beats flip- In retrospect, the need for knowl- when the frontier between two classes
ping a coin. This observation was first edge in learning should not be sur- is not a hyperplane the learner is un-
made (in somewhat different form) by prising. Machine learning is not able to induce it. Decision trees do not
the philosopher David Hume over 200 magic; it cannot get something from have this problem because they can
years ago, but even today many mis- nothing. What it does is get more represent any Boolean function, but
takes in machine learning stem from from less. Programming, like all en- on the other hand they can suffer from
failing to appreciate it. Every learner gineering, is a lot of work: we have to high variance: decision trees learned
must embody some knowledge or as- build everything from scratch. Learn- on different training sets generated by
sumptions beyond the data it is given ing is more like farming, which lets the same phenomenon are often very
in order to generalize beyond it. This nature do most of the work. Farmers different, when in fact they should be
notion was formalized by Wolpert in combine seeds with nutrients to grow
his famous “no free lunch” theorems, crops. Learners combine knowledge Figure 1. Bias and variance in
dart-throwing.
according to which no learner can with data to grow programs.
beat random guessing over all pos-
sible functions to be learned.25 Overfitting Has Many Faces
Low High
This seems like rather depressing What if the knowledge and data we Variance Variance
news. How then can we ever hope to have are not sufficient to completely
learn anything? Luckily, the functions determine the correct classifier? Then
we want to learn in the real world are we run the risk of just hallucinating High
Bias
not drawn uniformly from the set of all a classifier (or parts of it) that is not
mathematically possible functions! In grounded in reality, and is simply en-
fact, very general assumptions—like coding random quirks in the data.
smoothness, similar examples hav- This problem is called overfitting, and
Low
ing similar classes, limited depen- is the bugbear of machine learning. Bias
dences, or limited complexity—are When your learner outputs a classi-
often enough to do very well, and this fier that is 100% accurate on the train-
is a large part of why machine learning data but only 50% accurate on test
ing has been so successful. Like de- data, when in fact it could have output
duction, induction (what learners do)
is a knowledge lever: it turns a small Figure 2. Naïve Bayes can outperform a state-of-the-art rule learner (C4.5rules) even
when the true classifier is a set of rules.
amount of input knowledge into a
large amount of output knowledge.
Induction is a vastly more powerful 80
Bayes C4.5
lever than deduction, requiring much

less input knowledge to produce use- 75
Test-Set Accuracy (%)
ful results, but it still needs more than 70

zero input knowledge to work. And, as
with any lever, the more we put in, the 65
more we can get out. 60

A corollary of this is that one of the
key criteria for choosing a representa- 55
tion is which kinds of knowledge are 50

easily expressed in it. For example, if 10 100 1000 10000
we have a lot of knowledge about what Number of Examples
makes examples similar in our do-
o c to b e r 2 0 1 2 | vo l. 55 | n o. 1 0 | c om m u n ic at ion s of t he acm 81
review articles
the same. Similar reasoning applies like training examples labeled with tion of about 10−18 of the input space.
to the choice of optimization meth- the wrong class. This can indeed ag- This is what makes machine learning
od: beam search has lower bias than gravate overfitting, by making the both necessary and hard.
greedy search, but higher variance, be- learner draw a capricious frontier to More seriously, the similarity-
cause it tries more hypotheses. Thus, keep those examples on what it thinks based reasoning that machine learn-
contrary to intuition, a more powerful is the right side. But severe overfitting ing algorithms depend on (explicitly
learner is not necessarily better than a can occur even in the absence of noise. or implicitly) breaks down in high di-
less powerful one. For instance, suppose we learn a Bool- mensions. Consider a nearest neigh-
Figure 2 illustrates this.a Even ean classifier that is just the disjunc- bor classifier with Hamming distance
though the true classifier is a set of tion of the examples labeled “true” as the similarity measure, and sup-
rules, with up to 1,000 examples na- in the training set. (In other words, pose the class is just x1 ∧ x2. If there
ive Bayes is more accurate than a the classifier is a Boolean formula in are no other features, this is an easy
rule learner. This happens despite disjunctive normal form, where each problem. But if there are 98 irrelevant
naive Bayes’s false assumption that term is the conjunction of the feature features x3,..., x100, the noise from
the frontier is linear! Situations like values of one specific training exam- them completely swamps the signal in
this are common in machine learn- ple.) This classifier gets all the training x1 and x2, and nearest neighbor effec-
ing: strong false assumptions can be examples right and every positive test tively makes random predictions.
better than weak true ones, because example wrong, regardless of whether Even more disturbing is that near-
a learner with the latter needs more the training data is noisy or not. est neighbor still has a problem even
data to avoid overfitting. The problem of multiple testing13 is if all 100 features are relevant! This
Cross-validation can help to com- closely related to overfitting. Standard is because in high dimensions all
bat overfitting, for example by using it statistical tests assume that only one examples look alike. Suppose, for
to choose the best size of decision tree hypothesis is being tested, but mod- instance, that examples are laid out
to learn. But it is no panacea, since if ern learners can easily test millions on a regular grid, and consider a test
we use it to make too many parameter before they are done. As a result what example xt. If the grid is d-dimen-
choices it can itself start to overfit.17 looks significant may in fact not be. sional, xt’s 2d nearest examples are
Besides cross-validation, there For example, a mutual fund that beats all at the same distance from it. So as
are many methods to combat overfit- the market 10 years in a row looks very the dimensionality increases, more
ting. The most popular one is adding impressive, until you realize that, if and more examples become nearest
a regularization term to the evaluation there are 1,000 funds and each has a neighbors of xt, until the choice of
function. This can, for example, pe- 50% chance of beating the market on nearest neighbor (and therefore of
nalize classifiers with more structure, any given year, it is quite likely that class) is effectively random.
thereby favoring smaller ones with one will succeed all 10 times just by This is only one instance of a more
less room to overfit. Another option luck. This problem can be combatted general problem with high dimen-
is to perform a statistical significance by correcting the significance tests to sions: our intuitions, which come
test like chi-square before adding new take the number of hypotheses into from a three-dimensional world, of-
structure, to decide whether the dis- account, but this can also lead to un- ten do not apply in high-dimensional
tribution of the class really is differ- derfitting. A better approach is to con- ones. In high dimensions, most of the
ent with and without this structure. trol the fraction of falsely accepted mass of a multivariate Gaussian dis-
These techniques are particularly use- non-null hypotheses, known as the tribution is not near the mean, but in
ful when data is very scarce. Neverthe- false discovery rate.3 an increasingly distant “shell” around
less, you should be skeptical of claims it; and most of the volume of a high-
that a particular technique “solves” Intuition Fails in High Dimensions dimensional orange is in the skin, not
the overfitting problem. It is easy to After overfitting, the biggest problem the pulp. If a constant number of ex-
avoid overfitting (variance) by falling in machine learning is the curse of amples is distributed uniformly in a
into the opposite error of underfitting dimensionality. This expression was high-dimensional hypercube, beyond
(bias). Simultaneously avoiding both coined by Bellman in 1961 to refer some dimensionality most examples
requires learning a perfect classifier, to the fact that many algorithms that are closer to a face of the hypercube
and short of knowing it in advance work fine in low dimensions become than to their nearest neighbor. And if
there is no single technique that will intractable when the input is high- we approximate a hypersphere by in-
always do best (no free lunch). dimensional. But in machine learn- scribing it in a hypercube, in high di-
A common misconception about ing it refers to much more. General- mensions almost all the volume of the
overfitting is that it is caused by noise, izing correctly becomes exponentially hypercube is outside the hypersphere.
harder as the dimensionality (number This is bad news for machine learning,
a Training examples consist of 64 Boolean fea- of features) of the examples grows, be- where shapes of one type are often ap-
tures and a Boolean class computed from cause a fixed-size training set covers a proximated by shapes of another.
them according to a set of “IF . . . THEN . . .” dwindling fraction of the input space. Building a classifier in two or three
rules. The curves are the average of 100 runs
with different randomly generated sets of
Even with a moderate dimension of dimensions is easy; we can find a rea-
rules. Error bars are two standard deviations. 100 and a huge training set of a trillion sonable frontier between examples
See Domingos and Pazzani10 for details. examples, the latter covers only a frac- of different classes just by visual in-
82 communicatio ns o f th e acm | o cto ber 201 2 | vo l . 5 5 | no. 1 0

review articles
spection. (It has even been said that if bad classifiers in the learner’s hypoth-
people could see in high dimensions esis space H. The probability that at
machine learning would not be neces- least one of them is consistent is less
sary.) But in high dimensions it is dif- than b(1 − ε)n, by the union bound. As-
ficult to understand what is happen-
ing. This in turn makes it difficult to One of the major suming the learner always returns a
consistent classifier, the probability
design a good classifier. Naively, one
might think that gathering more fea-
developments of that this classifier is bad is then less
than |H|(1 − ε)n, where we have used
tures never hurts, since at worst they recent decades has the fact that b ≤ |H|. So if we want this
provide no new information about the
class. But in fact their benefits may
been the realization probability to be less than δ, it suffices
to make n > ln(δ/|H|)/ ln(1 − ε) ≥ 1/ε (ln
be outweighed by the curse of dimen- that we can have |H| + ln 1/δ).
sionality.
Fortunately, there is an effect that
guarantees on the Unfortunately, guarantees of this
type have to be taken with a large grain
partly counteracts the curse, which results of induction, of salt. This is because the bounds ob-
might be called the “blessing of non-
uniformity.” In most applications particularly if we tained in this way are usually extreme-
ly loose. The wonderful feature of the
examples are not spread uniformly are willing to settle bound above is that the required num-
throughout the instance space, but
are concentrated on or near a lower- for probabilistic ber of examples only grows logarith-
mically with |H| and 1/δ. Unfortunate-
dimensional manifold. For example,
k-nearest neighbor works quite well
guarantees. ly, most interesting hypothesis spaces
are doubly exponential in the number
for handwritten digit recognition of features d, which still leaves us
even though images of digits have needing a number of examples expo-
one dimension per pixel, because the nential in d. For example, consider
space of digit images is much smaller the space of Boolean functions of d
than the space of all possible images. Boolean variables. If there are e pos-
Learners can implicitly take advan- sible different examples, there are
tage of this lower effective dimension, 2e possible different functions, so
or algorithms for explicitly reducing since there are 2d possible examples,
d
the dimensionality can be used (for the total number of functions is 22 .
example, Tenenbaum22). And even for hypothesis spaces that
are “merely” exponential, the bound
Theoretical Guarantees is still very loose, because the union
Are Not What They Seem bound is very pessimistic. For exam-
Machine learning papers are full of ple, if there are 100 Boolean features
theoretical guarantees. The most com- and the hypothesis space is decision
mon type is a bound on the number of trees with up to 10 levels, to guarantee
examples needed to ensure good gen- δ = ε = 1% in the bound above we need
eralization. What should you make of half a million examples. But in prac-
these guarantees? First of all, it is re- tice a small fraction of this suffices for
markable that they are even possible. accurate learning.
Induction is traditionally contrasted Further, we have to be careful
with deduction: in deduction you can about what a bound like this means.
guarantee that the conclusions are For instance, it does not say that, if
correct; in induction all bets are off. your learner returned a hypothesis
Or such was the conventional wisdom consistent with a particular training
for many centuries. One of the major set, then this hypothesis probably
developments of recent decades has generalizes well. What it says is that,
been the realization that in fact we can given a large enough training set, with
have guarantees on the results of in- high probability your learner will ei-
duction, particularly if we are willing ther return a hypothesis that general-
to settle for probabilistic guarantees. izes well or be unable to find a consis-
The basic argument is remarkably tent hypothesis. The bound also says
simple.5 Let’s say a classifier is bad nothing about how to select a good
if its true error rate is greater than ε. hypothesis space. It only tells us that,
Then the probability that a bad clas- if the hypothesis space contains the
sifier is consistent with n random, in- true classifier, then the probability
dependent training examples is less that the learner outputs a bad classi-
than (1 − ε)n. Let b be the number of fier decreases with training set size.
o c to b e r 2 0 1 2 | vo l. 55 | n o. 1 0 | c om m u n ic at ion s of t he acm 83
review articles
If we shrink the hypothesis space, the chine learning. But it makes sense if
bound improves, but the chances that you consider how time-consuming it
it contains the true classifier shrink is to gather data, integrate it, clean it
also. (There are bounds for the case and preprocess it, and how much trial
where the true classifier is not in the
hypothesis space, but similar consid- A dumb algorithm and error can go into feature design.
Also, machine learning is not a one-
erations apply to them.)
Another common type of theoreti-
with lots and lots shot process of building a dataset and
running a learner, but rather an itera-
cal guarantee is asymptotic: given in- of data beats tive process of running the learner,
finite data, the learner is guaranteed
to output the correct classifier. This
a clever one analyzing the results, modifying the
data and/or the learner, and repeat-
is reassuring, but it would be rash to with modest ing. Learning is often the quickest
choose one learner over another be-
cause of its asymptotic guarantees. In
amounts of it. part of this, but that is because we
have already mastered it pretty well!
practice, we are seldom in the asymp- Feature engineering is more diffi-
totic regime (also known as “asymp- cult because it is domain-specific,
topia”). And, because of the bias-vari- while learners can be largely general
ance trade-off I discussed earlier, if purpose. However, there is no sharp
learner A is better than learner B given frontier between the two, and this is
infinite data, B is often better than A another reason the most useful learn-
given finite data. ers are those that facilitate incorpo-
The main role of theoretical guar- rating knowledge.
antees in machine learning is not as Of course, one of the holy grails
a criterion for practical decisions, of machine learning is to automate
but as a source of understanding and more and more of the feature engi-
driving force for algorithm design. In neering process. One way this is often
this capacity, they are quite useful; in- done today is by automatically gener-
deed, the close interplay of theory and ating large numbers of candidate fea-
practice is one of the main reasons tures and selecting the best by (say)
machine learning has made so much their information gain with respect
progress over the years. But caveat to the class. But bear in mind that
emptor: learning is a complex phe- features that look irrelevant in isola-
nomenon, and just because a learner tion may be relevant in combination.
has a theoretical justification and For example, if the class is an XOR of
works in practice does not mean the k input features, each of them by it-
former is the reason for the latter. self carries no information about the
class. (If you want to annoy machine
Feature Engineering Is The Key learners, bring up XOR.) On the other
At the end of the day, some machine hand, running a learner with a very
learning projects succeed and some large number of features to find out
fail. What makes the difference? Eas- which ones are useful in combination
ily the most important factor is the may be too time-consuming, or cause
features used. Learning is easy if you overfitting. So there is ultimately no
have many independent features that replacement for the smarts you put
each correlate well with the class. On into feature engineering.
the other hand, if the class is a very
complex function of the features, you More Data Beats
may not be able to learn it. Often, the a Cleverer Algorithm
raw data is not in a form that is ame- Suppose you have constructed the
nable to learning, but you can con- best set of features you can, but the
struct features from it that are. This classifiers you receive are still not ac-
is typically where most of the effort in curate enough. What can you do now?
a machine learning project goes. It is There are two main choices: design a
often also one of the most interesting better learning algorithm, or gather
parts, where intuition, creativity and more data (more examples, and pos-
“black art” are as important as the sibly more raw features, subject to
technical stuff. the curse of dimensionality). Machine
First-timers are often surprised by learning researchers are mainly con-
how little time in a machine learning cerned with the former, but pragmati-
project is spent actually doing ma- cally the quickest path to success is
84 communicatio ns o f th e ac m | o c to ber 201 2 | vo l . 5 5 | no. 1 0

review articles
often to just get more data. As a rule ers are seductive, but they are usually cycles. In research papers, learners
of thumb, a dumb algorithm with lots harder to use, because they have more are typically compared on measures
and lots of data beats a clever one with knobs you need to turn to get good re- of accuracy and computational cost.
modest amounts of it. (After all, ma- sults, and because their internals are But human effort saved and insight
chine learning is all about letting data more opaque. gained, although harder to measure,
do the heavy lifting.) Learners can be divided into two are often more important. This favors
This does bring up another prob- major types: those whose representa- learners that produce human-under-
lem, however: scalability. In most of tion has a fixed size, like linear classi- standable output (for example, rule
computer science, the two main lim- fiers, and those whose representation sets). And the organizations that make
ited resources are time and memory. can grow with the data, like decision the most of machine learning are
In machine learning, there is a third trees. (The latter are sometimes called those that have in place an infrastruc-
one: training data. Which one is the nonparametric learners, but this is ture that makes experimenting with
bottleneck has changed from decade somewhat unfortunate, since they many different learners, data sources,
to decade. In the 1980s it tended to usually wind up learning many more and learning problems easy and effi-
be data. Today it is often time. Enor- parameters than parametric ones.) cient, and where there is a close col-
mous mountains of data are avail- Fixed-size learners can only take ad- laboration between machine learning
able, but there is not enough time vantage of so much data. (Notice how experts and application domain ones.
to process it, so it goes unused. This the accuracy of naive Bayes asymptotes
leads to a paradox: even though in at around 70% in Figure 2.) Variable- Learn Many Models, Not Just One
principle more data means that more size learners can in principle learn any In the early days of machine learn-
complex classifiers can be learned, in function given sufficient data, but in ing, everyone had a favorite learner,
practice simpler classifiers wind up practice they may not, because of limi- together with some a priori reasons
being used, because complex ones tations of the algorithm (for example, to believe in its superiority. Most ef-
take too long to learn. Part of the an- greedy search falls into local optima) fort went into trying many variations
swer is to come up with fast ways to or computational cost. Also, because of it and selecting the best one. Then
learn complex classifiers, and indeed of the curse of dimensionality, no ex- systematic empirical comparisons
there has been remarkable progress isting amount of data may be enough. showed that the best learner varies
in this direction (for example, Hulten For these reasons, clever algorithms— from application to application, and
and Domingos11). those that make the most of the data systems containing many different
Part of the reason using cleverer and computing resources available— learners started to appear. Effort now
algorithms has a smaller payoff than often pay off in the end, provided you went into trying many variations of
you might expect is that, to a first ap- are willing to put in the effort. There many learners, and still selecting just
proximation, they all do the same. is no sharp frontier between design- the best one. But then researchers
This is surprising when you consider ing learners and learning classifiers; noticed that, if instead of selecting
representations as different as, say, rather, any given piece of knowledge the best variation found, we combine
sets of rules and neural networks. But could be encoded in the learner or many variations, the results are bet-
in fact propositional rules are readily learned from data. So machine learn- ter—often much better—and at little
encoded as neural networks, and sim- ing projects often wind up having a extra effort for the user.
ilar relationships hold between other significant component of learner de- Creating such model ensembles is
representations. All learners essen- sign, and practitioners need to have now standard.1 In the simplest tech-
tially work by grouping nearby exam- some expertise in it.12 nique, called bagging, we simply gen-
ples into the same class; the key dif- In the end, the biggest bottleneck erate random variations of the train-
ference is in the meaning of “nearby.” is not data or CPU cycles, but human ing set by resampling, learn a classifier
With nonuniformly distributed data, on each, and combine the results by
learners can produce widely different Figure 3. Very different frontiers can yield voting. This works because it greatly
similar predictions. (+ and – are training
frontiers while still making the same examples of two classes.)
reduces variance while only slightly
predictions in the regions that matter increasing bias. In boosting, training
(those with a substantial number of examples have weights, and these are
training examples, and therefore also N. Bayes varied so that each new classifier fo-
where most test examples are likely to cuses on the examples the previous
appear). This also helps explain why kNN ones tended to get wrong. In stacking,
powerful learners can be unstable but SVM the outputs of individual classifiers
still accurate. Figure 3 illustrates this become the inputs of a “higher-level”
in 2D; the effect is much stronger in learner that figures out how best to
high dimensions. combine them.
As a rule, it pays to try the simplest Many other techniques exist, and
learners first (for example, naïve Bayes D. Tree the trend is toward larger and larger
before logistic regression, k-nearest ensembles. In the Netflix prize, teams
neighbor before support vector ma- from all over the world competed to
chines). More sophisticated learn- build the best video recommender

review articles
system (http://netflixprize.com). As continues to improve by adding clas-

the competition progressed, teams sifiers even after the training error has
found they obtained the best results reached zero. Another counterexam-
by combining their learners with oth- ple is support vector machines, which
er teams’, and merged into larger and
larger teams. The winner and runner- Just because can effectively have an infinite num-
ber of parameters without overfitting.
up were both stacked ensembles of
over 100 learners, and combining the
a function can Conversely, the function sign(sin(ax))
can discriminate an arbitrarily large,
two ensembles further improved the be represented arbitrarily labeled set of points on the
results. Doubtless we will see even
larger ones in the future.
does not mean x axis, even though it has only one pa-
rameter.23 Thus, contrary to intuition,
Model ensembles should not be it can be learned. there is no necessary connection be-
confused with Bayesian model av- tween the number of parameters of a
eraging (BMA)—the theoretically model and its tendency to overfit.
optimal approach to learning.4 In A more sophisticated view instead
BMA, predictions on new examples equates complexity with the size of
are made by averaging the individual the hypothesis space, on the basis that
predictions of all classifiers in the smaller spaces allow hypotheses to be
hypothesis space, weighted by how represented by shorter codes. Bounds
well the classifiers explain the train- like the one in the section on theoreti-
ing data and how much we believe cal guarantees might then be viewed
in them a priori. Despite their su- as implying that shorter hypotheses
perficial similarities, ensembles and generalize better. This can be further
BMA are very different. Ensembles refined by assigning shorter codes to
change the hypothesis space (for ex- the hypotheses in the space we have
ample, from single decision trees to some a priori preference for. But
linear combinations of them), and viewing this as “proof” of a trade-off
can take a wide variety of forms. BMA between accuracy and simplicity is
assigns weights to the hypotheses in circular reasoning: we made the hy-
the original space according to a fixed potheses we prefer simpler by design,
formula. BMA weights are extremely and if they are accurate it is because
different from those produced by our preferences are accurate, not be-
(say) bagging or boosting: the latter cause the hypotheses are “simple” in
are fairly even, while the former are the representation we chose.
extremely skewed, to the point where A further complication arises from
the single highest-weight classifier the fact that few learners search their
usually dominates, making BMA ef- hypothesis space exhaustively. A
fectively equivalent to just selecting learner with a larger hypothesis space
it.8 A practical consequence of this is that tries fewer hypotheses from it
that, while model ensembles are a key is less likely to overfit than one that
part of the machine learning toolkit, tries more hypotheses from a smaller
BMA is seldom worth the trouble. space. As Pearl18 points out, the size of
the hypothesis space is only a rough
Simplicity Does Not guide to what really matters for relat-
Imply Accuracy ing training and test error: the proce-
Occam’s razor famously states that dure by which a hypothesis is chosen.
entities should not be multiplied be- Domingos7 surveys the main argu-
yond necessity. In machine learning, ments and evidence on the issue of
this is often taken to mean that, given Occam’s razor in machine learning.
two classifiers with the same training The conclusion is that simpler hy-
error, the simpler of the two will likely potheses should be preferred because
have the lowest test error. Purported simplicity is a virtue in its own right,
proofs of this claim appear regularly not because of a hypothetical connec-
in the literature, but in fact there are tion with accuracy. This is probably
many counterexamples to it, and the what Occam meant in the first place.
“no free lunch” theorems imply it can-
not be true. Representable Does Not
We saw one counterexample previ- Imply Learnable
ously: model ensembles. The gener- Essentially all representations used in
alization error of a boosted ensemble variable-size learners have associated
86 communicatio ns o f th e acm | o cto ber 201 2 | vo l . 5 5 | no. 1 0

review articles
theorems of the form “Every function More often than not, the goal References
can be represented, or approximated of learning predictive models is to 1. Bauer, E. and Kohavi, R. An empirical comparison of
voting classification algorithms: Bagging, boosting
arbitrarily closely, using this repre- use them as guides to action. If we and variants. Machine Learning 36 (1999), 105–142.
sentation.” Reassured by this, fans of find that beer and diapers are often 2. Bengio, Y. Learning deep architectures for AI.
Foundations and Trends in Machine Learning 2, 1
the representation often proceed to bought together at the supermar- (2009), 1–127.
ignore all others. However, just be- ket, then perhaps putting beer next 3. Benjamini, Y. and Hochberg, Y. Controlling the false
discovery rate: A practical and powerful approach
cause a function can be represented to the diaper section will increase to multiple testing. Journal of the Royal Statistical
does not mean it can be learned. For sales. (This is a famous example in Society, Series B, 57 (1995), 289–300.
4. Bernardo, J.M. and Smith, A.F.M. Bayesian Theory.
example, standard decision tree learn- the world of data mining.) But short Wiley, NY, 1994.
ers cannot learn trees with more leaves of actually doing the experiment it is 5. Blumer, A., Ehrenfeucht, A., Haussler, D. and
Warmuth, M.K. Occam’s razor. Information
than there are training examples. In difficult to tell. Machine learning is Processing Letters 24 (1987), 377–380.
6. Cohen, W.W. Grammatically biased learning:
continuous spaces, representing even usually applied to observational data, Learning logic programs using an explicit antecedent
simple functions using a fixed set of where the predictive variables are not description language. Artificial Intelligence 68
(1994), 303–366.
primitives often requires an infinite under the control of the learner, as 7. Domingos, P. The role of Occam’s razor in knowledge
number of components. Further, if opposed to experimental data, where discovery. Data Mining and Knowledge Discovery 3
(1999), 409–425.
the hypothesis space has many local they are. Some learning algorithms 8. Domingos, P. Bayesian averaging of classifiers and
optima of the evaluation function, as can potentially extract causal infor- the overfitting problem. In Proceedings of the 17th
International Conference on Machine Learning
is often the case, the learner may not mation from observational data, but (Stanford, CA, 2000), Morgan Kaufmann, San Mateo,
find the true function even if it is rep- their applicability is rather restrict- CA, 223–230.
9. Domingos, P. A unified bias-variance decomposition
resentable. Given finite data, time and ed.19 On the other hand, correlation and its applications. In Proceedings of the 17th
memory, standard learners can learn is a sign of a potential causal connec- International Conference on Machine Learning
(Stanford, CA, 2000), Morgan Kaufmann, San Mateo,
only a tiny subset of all possible function, and we can use it as a guide to CA, 231–238.
tions, and these subsets are different further investigation (for example, 10. Domingos, P. and Pazzani, M. On the optimality of
the simple Bayesian classifier under zero-one loss.
for learners with different represen- trying to understand what the causal Machine Learning 29 (1997), 103–130.
tations. Therefore the key question is chain might be). 11. Hulten, G. and Domingos, P. Mining complex models
from arbitrarily large databases in constant time. In
not “Can it be represented?” to which Many researchers believe that cau- Proceedings of the 8th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining
the answer is often trivial, but “Can it sality is only a convenient fiction. For (Edmonton, Canada, 2002). ACM Press, NY, 525–531.
be learned?” And it pays to try different example, there is no notion of causal- 12. Kibler, D. and Langley, P. Machine learning as an
experimental science. In Proceedings of the 3rd
learners (and possibly combine them). ity in physical laws. Whether or not European Working Session on Learning (London, UK,
Some representations are exponen- causality really exists is a deep philo- 1988). Pitman.
13. Klockars, A.J. and Sax, G. Multiple Comparisons.
tially more compact than others for sophical question with no definitive Sage, Beverly Hills, CA, 1986.
some functions. As a result, they may answer in sight, but there are two 14. Kohavi, R., Longbotham, R., Sommerfield, D. and
Henne, R. Controlled experiments on the Web:
also require exponentially less data to practical points for machine learn- Survey and practical guide. Data Mining and
learn those functions. Many learners ers. First, whether or not we call them Knowledge Discovery 18 (2009), 140–181.
15. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs,
work by forming linear combinations “causal,” we would like to predict the R., Roxburgh, C. and Byers, A. Big data: The next
of simple basis functions. For exam- effects of our actions, not just corre- frontier for innovation, competition, and productivity.
Technical report, McKinsey Global Institute, 2011.
ple, support vector machines form lations between observable variables. 16. Mitchell, T.M. Machine Learning. McGraw-Hill,
combinations of kernels centered at Second, if you can obtain experimen- NY, 1997.
17. Ng, A.Y. Preventing “overfitting” of cross-validation
some of the training examples (the tal data (for example by randomly as- data. In Proceedings of the 14th International
support vectors). Representing parity signing visitors to different versions of Conference on Machine Learning (Nashville, TN,
1997). Morgan Kaufmann, San Mateo, CA, 245–253.
of n bits in this way requires 2n basis a Web site), then by all means do so.14 18. Pearl, J. On the connection between the complexity
and credibility of inferred models. International
functions. But using a representation Journal of General Systems 4 (1978), 255–264.
with more layers (that is, more steps Conclusion 19. Pearl, J. Causality: Models, Reasoning, and
Inference. Cambridge University Press, Cambridge,
between input and output), parity can Like any discipline, machine learn- UK, 2000.
be encoded in a linear-size classifier. ing has a lot of “folk wisdom” that can 20. Quinlan, J.R. C4.5: Programs for Machine Learning.
Morgan Kaufmann, San Mateo, CA, 1993.
Finding methods to learn these deeper be difficult to come by, but is crucial 21. Richardson, M. and P. Domingos. Markov logic
representations is one of the major re- for success. This article summarized networks. Machine Learning 62 (2006), 107–136.
22. Tenenbaum, J., Silva, V. and Langford, J. A global
search frontiers in machine learning.2 some of the most salient items. Of geometric framework for nonlinear dimensionality
course, it is only a complement to the reduction. Science 290 (2000), 2319–2323.
23. Vapnik, V.N. The Nature of Statistical Learning
Correlation Does Not more conventional study of machine Theory. Springer, NY, 1995.
Imply Causation learning. Check out http://www. 24. Witten, I., Frank, E. and Hall, M. Data Mining:
Practical Machine Learning Tools and Techniques,
The point that correlation does not cs.washington.edu/homes/pedrod/ 3rd Edition. Morgan Kaufmann, San Mateo, CA, 2011.
imply causation is made so often that class for a complete online machine 25. Wolpert, D. The lack of a priori distinctions between
learning algorithms. Neural Computation 8 (1996),
it is perhaps not worth belaboring. learning course that combines formal 1341–1390.
But, even though learners of the kind and informal aspects. There is also a
we have been discussing can only treasure trove of machine learning Pedro Domingos (pedrod@cs.washington.edu) is a
professor in the Department of Computer Science and
learn correlations, their results are lectures at http://www.videolectures. Engineering at the University of Washington, Seattle.
often treated as representing causal net. A good open source machine
relations. Isn’t this wrong? If so, then learning toolkit is Weka.24
why do people do it? Happy learning! © 2012 ACM 0001-0782/12/10 $15.00

Conditional Random Fields as Recurrent Neural Networks
Shuai Zheng∗1 , Sadeep Jayasumana*1 , Bernardino Romera-Paredes1 , Vibhav Vineet†1,2 , Zhizhong Su3 ,
Dalong Du3 , Chang Huang3 , and Philip H. S. Torr1
1 2 3
University of Oxford Stanford University Baidu Institute of Deep Learning
Abstract lenge in pixel-level labelling problems. Work on this topic

includes: TextonBoost [52], TextonForest [51], and Ran-
Pixel-level labelling tasks, such as semantic segmenta- dom Forest-based classifiers [50]. Recently, supervised
tion, play a central role in image understanding. Recent ap- deep learning approaches such as large-scale deep Convolu-
proaches have attempted to harness the capabilities of deep tional Neural Networks (CNNs) have been immensely suc-
learning techniques for image recognition to tackle pixel- cessful in many high-level computer vision tasks such as
level labelling tasks. One central issue in this methodology image recognition [31] and object detection [20]. This mo-
is the limited capacity of deep learning techniques to de- tivates exploring the use of CNNs for pixel-level labelling
lineate visual objects. To solve this problem, we introduce problems. The key insight is to learn a strong feature rep-
a new form of convolutional neural network that combines resentation end-to-end for the pixel-level labelling task in-
the strengths of Convolutional Neural Networks (CNNs) stead of hand-crafting features with heuristic parameter tun-
and Conditional Random Fields (CRFs)-based probabilistic ing. In fact, a number of recent approaches including the
graphical modelling. To this end, we formulate mean-field particularly interesting works FCN [37] and DeepLab [10]
approximate inference for the Conditional Random Fields have shown a significant accuracy boost by adapting state-
with Gaussian pairwise potentials as Recurrent Neural Net- of-the-art CNN based image classifiers to the semantic seg-
works. This network, called CRF-RNN, is then plugged in mentation problem.
as a part of a CNN to obtain a deep network that has de- However, there are significant challenges in adapting
sirable properties of both CNNs and CRFs. Importantly, CNNs designed for high level computer vision tasks such as
our system fully integrates CRF modelling with CNNs, mak- object recognition to pixel-level labelling tasks. Firstly, tra-
ing it possible to train the whole deep network end-to-end ditional CNNs have convolutional filters with large recep-
with the usual back-propagation algorithm, avoiding offline tive fields and hence produce coarse outputs when restruc-
post-processing methods for object delineation. tured to produce pixel-level labels [37]. Presence of max-
We apply the proposed method to the problem of seman- pooling layers in CNNs further reduces the chance of get-
tic image segmentation, obtaining top results on the chal- ting a fine segmentation output [10]. This, for instance, can
lenging Pascal VOC 2012 segmentation benchmark. result in non-sharp boundaries and blob-like shapes in se-
mantic segmentation tasks. Secondly, CNNs lack smooth-
ness constraints that encourage label agreement between
1. Introduction similar pixels, and spatial and appearance consistency of the
labelling output. Lack of such smoothness constraints can
Low-level computer vision problems such as semantic result in poor object delineation and small spurious regions
image segmentation or depth estimation often involve as- in the segmentation output [59, 58, 32, 39].
signing a label to each pixel in an image. While the feature
representation used to classify individual pixels plays an im- On a separate track to the progress of deep learning
portant role in this task, it is similarly important to consider techniques, probabilistic graphical models have been devel-
factors such as image edges, appearance consistency and oped as effective methods to enhance the accuracy of pixel-
spatial consistency while assigning labels in order to obtain level labelling tasks. In particular, Markov Random Fields
accurate and precise results. (MRFs) and its variant Conditional Random Fields (CRFs)
have observed widespread success in this area [32, 29] and
Designing a strong feature representation is a key chal-
have become one of the most successful graphical models
∗ Authors contributed equally. used in computer vision. The key idea of CRF inference
† Work conducted while authors at the University of Oxford. for semantic labelling is to formulate the label assignment
1
problem as a probabilistic inference problem that incor- super-pixels) may lead to poor predictions, no matter how
porates assumptions such as the label agreement between good the feature extraction process is. Pinheiro and Col-
similar pixels. CRF inference is able to refine weak and lobert [46] employed an RNN to model the spatial depen-
coarse pixel-level label predictions to produce sharp bound- dencies during scene parsing. In contrast to their approach,
aries and fine-grained segmentations. Therefore, intuitively, we show that a typical graphical model such as a CRF can
CRFs can be used to overcome the drawbacks in utilizing be formulated as an RNN to form a part of a deep network,
CNNs for pixel-level labelling tasks. to perform end-to-end training combined with a CNN.
One way to utilize CRFs to improve the semantic la- The second strategy is to directly learn a nonlinear model
belling results produced by a CNN is to apply CRF infer- from the images to the label map. This, for example, was
ence as a post-processing step disconnected from the train- shown in [17], where the authors replaced the last fully con-
ing of the CNN [10]. Arguably, this does not fully harness nected layers of a CNN by convolutional layers to keep spa-
the strength of CRFs since it is not integrated with the deep tial information. An important contribution in this direction
network. In this setup, the deep network is unaware of the is [37], where Long et al. used the concept of fully con-
CRF during the training phase. volutional networks, and the notion that top layers obtain
In this paper, we propose an end-to-end deep learn- meaningful features for object recognition whereas low lay-
ing solution for the pixel-level semantic image segmenta- ers keep information about the structure of the image, such
tion problem. Our formulation combines the strengths of as edges. In their work, connections from early layers to
both CNNs and CRF based graphical models in one uni- later layers were used to combine these cues. Bell et al. [5]
fied framework. More specifically, we formulate mean-field and Chen et al. [10, 41] used a CRF to refine segmentation
approximate inference for the dense CRF with Gaussian results obtained from a CNN. Bell et al. focused on material
pairwise potentials as a Recurrent Neural Network (RNN) recognition and segmentation, whereas Chen et al. reported
which can refine coarse outputs from a traditional CNN in very significant improvements on semantic image segmen-
the forward pass, while passing error differentials back to tation. In contrast to these works, which employed CRF
the CNN during training. Importantly, with our formula- inference as a standalone post-processing step disconnected
tion, the whole deep network, which comprises a traditional from the CNN training, our approach is an end-to-end train-
CNN and an RNN for CRF inference, can be trained end- able network that jointly learns the parameters of the CNN
to-end utilizing the usual back-propagation algorithm. and the CRF in one unified deep network.
Arguably, when properly trained, the proposed network Works that use neural networks to predict structured out-
should outperform a system where CRF inference is applied put are found in different domains. For example, Do et
as a post-processing method on independent pixel-level pre- al. [14] proposed an approach to combine deep neural net-
dictions produced by a pre-trained CNN. Our experimental works and Markov networks for sequence labeling tasks.
evaluation confirms that this indeed is the case. We evaluate Jain et al. [26] has shown Convolutional Neural Networks
the performance of our network on the popular Pascal VOC can perform well like MRFs/CRFs approaches in image
2012 benchmark, achieving a new state-of-the-art accuracy restoration application. Another domain which benefits
of 74.7%. from the combination of CNNs and structured loss is hand-
writing recognition. In natural language processing, Yao
2. Related Work et al. [60] shows that the performance of an RNN-based
In this section we review approaches that make use of words tagger can be significantly improved by incorporat-
deep learning and CNNs for low-level computer vision ing elements of the CRF model. In [6], the authors com-
tasks, with a focus on semantic image segmentation. A wide bined a CNN with Hidden Markov Models for that purpose,
variety of approaches have been proposed to tackle the se- whereas more recently, Peng et al. [45] used a modified ver-
mantic image segmentation task using deep learning. These sion of CRFs. Related to this line of works, in [25] a joint
approaches can be categorized into two main strategies. CNN and CRF model was used for text recognition on nat-
The first strategy is based on utilizing separate mecha- ural images. Tompson et al. [57] showed the use of joint
nisms for feature extraction, and image segmentation ex- training of a CNN and an MRF for human pose estimation,
ploiting the edges of the image [2, 38]. One representative while Chen et al. [11] focused on the image classification
instance of this scheme is the application of a CNN for the problem with a similar approach. Another prominent work
extraction of meaningful features, and using superpixels to is [21], in which the authors express deformable part mod-
account for the structural pattern of the image. Two repre- els, a kind of MRF, as a layer in a neural network. In our
sentative examples are [19, 38], where the authors first ob- approach, we cast a different graphical model as a neural
tained superpixels from the image and then used a feature network layer.
extraction process on each of them. The main disadvantage A number of approaches have been proposed for au-
of this strategy is that errors in the initial proposals (e.g: tomatic learning of graphical model parameters and joint
2
training of classifiers and graphical models. Barbu et al. [4]
proposed a joint training of a MRF/CRF model together
with an inference algorithm in their Active Random Field
approach. Domke [15] advocated back-propagation based
parameter optimization in graphical models when approxi-
mate inference methods such as mean-field and belief prop-
agation are used. This idea was utilized in [28], where a bi-
nary dense CRF was used for human pose estimation. Sim- Figure 1. A mean-field iteration as a CNN. A single iteration of
ilarly, Ross et al. [47] and Stoyanov et al. [54] showed how the mean-field algorithm can be modelled as a stack of common
CNN layers.
back-propagation through belief propagation can be used to
optimize model parameters. Ross et al. [21], in particular
proposes an approach based on learning messages. Many energy of a label assignment x is given by:
of these ideas can be traced back to [55], which proposes X X
unrolling message passing algorithms as simpler operations E(x) = ψu (xi ) + ψp (xi , xj ), (1)
i i<j
that could be performed within a CNN. In a different setup,
Krähenbühl and Koltun [30] demonstrated automatic pa- where the unary energy components ψu (xi ) measure the
rameter tuning of dense CRF when a modified mean-field inverse likelihood (and therefore, the cost) of the pixel
algorithm is used for inference. An alternative inference ap- i taking the label xi , and pairwise energy components
proach for dense CRF, not based on mean-field, is proposed ψp (xi , xj ) measure the cost of assigning labels xi , xj to
in [61]. pixels i, j simultaneously. In our model, unary energies are
In contrast to the works described above, our approach obtained from a CNN, which, roughly speaking, predicts la-
shows that it is possible to formulate dense CRF as an RNN bels for pixels without considering the smoothness and the
so that one can form an end-to-end trainable system for se- consistency of the label assignments. The pairwise ener-
mantic image segmentation which combines the strengths gies provide an image data-dependent smoothing term that
of deep learning and graphical modelling. encourages assigning similar labels to pixels with similar
After our initial publication of the technical report of this properties. As was done in [29], we model pairwise poten-
work on arXiv.org, a number of independent works [49, 35] tials as weighted Gaussians:
appeared on arXiv.org presenting similar joint training ap- M
(m)
X
proaches for semantic image segmentation. ψp (xi , xj ) = µ(xi , xj ) w(m) kG (fi , fj ), (2)
m=1
3. Conditional Random Fields (m)
where each kG for m = 1, . . . , M , is a Gaussian kernel
In this section we provide a brief overview of Condi- applied on feature vectors. The feature vector of pixel i,
tional Random Fields (CRF) for pixel-wise labelling and denoted by fi , is derived from image features such as spatial
introduce the notation used in the paper. A CRF, used in location and RGB values [29]. We use the same features as
the context of pixel-wise label prediction, models pixel la- in[29]. The function µ(., .), called the label compatibility
bels as random variables that form a Markov Random Field function, captures the compatibility between different pairs
(MRF) when conditioned upon a global observation. The of labels as the name implies.
global observation is usually taken to be the image. Minimizing the above CRF energy E(x) yields the most
Let Xi be the random variable associated to pixel i, probable label assignment x for the given image. Since this
which represents the label assigned to the pixel i and exact minimization is intractable, a mean-field approxima-
can take any value from a pre-defined set of labels L = tion to the CRF distribution is used for approximate max-
{l1 , l2 , . . . , lL }. Let X be the vector formed by the ran- imum posterior marginal inference. It consists in approxi-
dom variables X1 , X2 , . . . , XN , where N is the number of mating the CRF distribution P (X) by a simpler distribution
pixels in the image. Given a graph G = (V, E), where Q(X), which can be written as the product
Q of independent
V = {X1 , X2 , . . . , XN }, and a global observation (im- marginal distributions, i.e., Q(X) = i Qi (Xi ). The steps
age) I, the pair (I, X) can be modelled as a CRF charac- of the iterative algorithm for approximate mean-field infer-
terized by a Gibbs distribution of the form P (X = x|I) = ence and its reformulation as an RNN are discussed next.
1
Z(I) exp(−E(x|I)). Here E(x) is called the energy of 4. A Mean-field Iteration as a Stack of CNN
the configuration x ∈ LN and Z(I) is the partition func-
tion [33]. From now on, we drop the conditioning on I in
Layers
the notation for convenience. A key contribution of this paper is to show that the mean-
In the fully connected pairwise CRF model of [29], the field CRF inference can be reformulated as a Recurrent
3
Algorithm 1 Mean-field in dense CRFs [29], broken down 4.1. Initialization
to common CNN operations.
In the initialization step of the algorithm,
P the operation
Qi (l) ← Z1i exp (Ui (l)) for all i . Initialization Qi (l) ← Z1i exp (Ui (l)), where Zi = l exp(Ui (l)), is
while not converged do
(m) performed. Note that this is equivalent to applying a soft-
Q̃i (l) ← j6=i k(m) (fi , fj )Qj (l) for all m
P
max function over the unary potentials U across all the la-
. Message Passing
(m) bels at each pixel. The softmax function has been exten-
Q̌i (l) ← m w(m) Q̃i (l)
P
. Weighting Filter Outputs
sively used in CNN architectures before and is therefore
Q̂i (l) ← l0 ∈L µ(l, l0 )Q̌i (l0 )
P well known in the deep learning community. This operation
. Compatibility Transform does not include any parameters and the error differentials
Q̆i (l) ← Ui (l) − Q̂i (l) received at the output of the step during back-propagation
. Adding Unary Potentials could be passed down to the unary potential inputs after per-
Qi ← Z1i exp Q̆i (l) forming usual backward pass calculations of the softmax
. Normalizing transformation.
end while
4.2. Message Passing
In the dense CRF formulation, message passing is imple-
mented by applying M Gaussian filters on Q values. Gaus-
sian filter coefficients are derived based on image features
Neural Network (RNN). To this end, we first consider in-
such as the pixel locations and RGB values, which reflect
dividual steps of the mean-field algorithm summarized in
how strongly a pixel is related to other pixels. Since the
Algorithm 1 [29], and describe them as CNN layers. Our
CRF is potentially fully connected, each filter’s receptive
contribution is based on the observation that filter-based ap-
field spans the whole image, making it infeasible to use a
proximate mean-field inference approach for dense CRFs
brute-force implementation of the filters. Fortunately, sev-
relies on applying Gaussian spatial and bilateral filters on
eral approximation techniques exist to make computation
the mean-field approximates in each iteration. Unlike the
of high dimensional Gaussian filtering significantly faster.
standard convolutional layer in a CNN, in which filters are
Following [29], we use the Permutohedral lattice imple-
fixed after the training stage, we use edge-preserving Gaus-
mentation [1], which can compute the filter response in
sian filters [56, 42], coefficients of which depend on the
O(N ) time, where N is the number of pixels of the im-
original spatial and appearance information of the image.
age [1].
These filters have the additional advantages of requiring a
During back-propagation, error derivatives w.r.t. the fil-
smaller set of parameters, despite the filter size being po-
ter inputs are calculated by sending the error derivatives
tentially as big as the image.
w.r.t. the filter outputs through the same M Gaussian fil-
While reformulating the steps of the inference algorithm ters in reverse direction. In terms of permutohedral lattice
as CNN layers, it is essential to be able to calculate error operations, this can be accomplished by only reversing the
differentials in each layer w.r.t. its inputs in order to be able order of the separable filters in the blur stage, while building
to back-propagate the error differentials to previous layers the permutohedral lattice, splatting, and slicing in the same
during training. We also discuss how to calculate error dif- way as in the forward pass. Therefore, back-propagation
ferentials with respect to the parameters in each layer, en- through this filtering stage can also be performed in O(N )
abling their optimization through the back-propagation al- time. Following [29], we use two Gaussian kernels, a spa-
gorithm. Therefore, in our formulation, CRF parameters tial kernel and a bilateral kernel. In this work, for simplic-
such as the weights of the Gaussian kernels and the label ity, we keep the bandwidth values of the filters fixed. It
compatibility function can also be optimized automatically is also possible to use multiple spatial and bilateral kernels
during the training of the full network. with different bandwidth values and learn their optimal lin-
ear combination.
Once the individual steps of the algorithm are broken
4.3. Weighting Filter Outputs
down as CNN layers, the full algorithm can then be for-
mulated as an RNN. We explain this in Section 5 after dis- The next step of the mean-field iteration is taking a
cussing the steps of Algorithm 1 in detail below. In Algo- weighted sum of the M filter outputs from the previous step,
rithm 1 and the remainder of this paper, we use Ui (l) to for each class label l. When each class label is considered
denote the negative of the unary energy introduced in the individually, this can be viewed as usual convolution with
previous section, i.e., Ui (l) = −ψu (Xi = l). In the con- a 1 × 1 filter with M input channels, and one output chan-
ventional CRF setting, this input Ui (l) to the mean-field al- nel. Since both inputs and the outputs to this step are known
gorithm is obtained from an independent classifier. during back-propagation, the error derivative w.r.t. the filter
4
weights can be computed, making it possible to automat- 4.6. Normalization
ically learn the filter weights (relative contributions from
Finally, the normalization step of the iteration can be
each Gaussian filter output from the previous stage). Er-
considered as another softmax operation with no parame-
ror derivative w.r.t. the inputs can also be computed in the
ters. Differentials at the output of this step can be passed on
usual manner to pass the error derivatives down to the previ-
to the input using the softmax operation’s backward pass.
ous stage. To obtain a higher number of tunable parameters,
in contrast to [29], we use independent kernel weights for 5. The End-to-end Trainable Network
each class label. The intuition is that the relative impor-
tance of the spatial kernel vs the bilateral kernel depends on We now describe our end-to-end deep learning system
the visual class. For example, bilateral kernels may have for semantic image segmentation. To pave the way for this,
on the one hand a high importance in bicycle detection, be- we first explain how repeated mean-field iterations can be
cause similarity of colours is determinant; on the other hand organized as an RNN.
they may have low importance for TV detection, given that
whatever is inside the TV screen may have many different
5.1. CRF as RNN
colours. In the previous section, it was shown that one iteration
of the mean-field algorithm can be formulated as a stack of
4.4. Compatibility Transform common CNN layers (see Fig. 1). We use the function fθ
to denote the transformation done by one mean-field iter-
In the compatibility transform step, outputs from the pre- ation: given an image I, pixel-wise unary potential values
vious step (denoted by Q̌ in Algorithm 1) are shared be- U and an estimation of marginal probabilities Qin from the
tween the labels to a varied extent, depending on the com- previous iteration, the next estimation of marginal distribu-
patibility between these labels. Compatibility between the tions after one mean-field iteration is given by fθ (U, Qin , I).
0 0
two labels l and l0 is parameterized by the label compatibil- The vector θ = {w(m) , µ(l, l )}, m ∈ {1, ..., M }, l, l ∈
ity function µ(l, l0 ). The Potts model, given by µ(l, l0 ) = {l1 , ..., lL } represents the CRF parameters described in Sec-
[l 6= l0 ], where [.] is the Iverson bracket, assigns a fixed tion 4.
penalty if different labels are assigned to pixels with simi- Multiple mean-field iterations can be implemented by re-
lar properties. A limitation of this model is that it assigns peating the above stack of layers in such a way that each
the same penalty for all different pairs of labels. Intuitively, iteration takes Q value estimates from the previous iteration
better results can be obtained by taking the compatibility and the unary values in their original form. This is equiv-
between different label pairs into account and penalizing alent to treating the iterative mean-field inference as a Re-
the assignments accordingly. For example, assigning labels current Neural Network (RNN) as shown in Fig. 2. Using
“person” and “bicycle” to nearby pixels should have a lesser the notation in the figure, the behaviour of the network is
penalty than assigning labels “sky” and “bicycle”. There- given by the following equations where T is the number of
fore, learning the function µ from data is preferred to fixing mean-field iterations:
it in advance with Potts model. We also relax our compat-
ibility transform model by assuming that µ(l, l0 ) 6= µ(l0 , l)
(
softmax(U ), t = 0
in general. H1 (t) = (3)
H2 (t − 1), 0 < t ≤ T,
Compatibility transform step can be viewed as another
convolution layer where the spatial receptive field of the fil- H2 (t) = fθ (U, H1 (t), I), 0 ≤ t ≤ T, (4)
ter is 1 × 1, and the number of input and output channels
(
0, 0≤t<T
are both L. Learning the weights of this filter is equivalent Y (t) = (5)
H2 (t), t = T.
to learning the label compatibility function µ. Transferring
error differentials from the output of this step to the input We name this RNN structure CRF-RNN. Parameters of
can be done since this step is a usual convolution operation. the CRF-RNN are the same as the mean-field parameters
described in Section 4 and denoted by θ here. Since the cal-
4.5. Adding Unary Potentials culation of error differentials w.r.t. these parameters in a sin-
gle iteration was described in Section 4, they can be learnt
In this step, the output from the compatibility transform in the RNN setting using the standard back-propagation
stage is subtracted element-wise from the unary inputs U . through time algorithm [48, 40]. It was shown in [29] that
While no parameters are involved in this step, transferring the mean-field iterative algorithm for dense CRF converges
error differentials can be done trivially by copying the dif- in less than 10 iterations. Furthermore, in practice, after
ferentials at the output of this step to both inputs with the about 5 iterations, increasing the number of iterations usu-
appropriate sign. ally does not significantly improve results [29]. Therefore,
5
segmentation of the image. We used the FCN-8s architec-
ture of [37] as the first part of our network, which provides
I
Meanfield unary potentials to the CRF. This network is based on the
U Iteration H2
G2
Y VGG-16 network [53] but has been restructured to perform
H1 H2 =
fθ (U, H1 , I) pixel-wise prediction instead of image classification.
In the forward pass through the network, once the com-
putation enters the CRF-RNN after passing through the
G1 CNN stage, it takes T iterations for the data to leave the
loop created by the RNN. Neither the CNN that provides
unary values nor the layers after the CRF-RNN (i.e., the
Softmax
loss layers) need to perform any computations during this
Normalization time since the refinement happens only inside the RNN’s
loop. Once the output Y leaves the loop, next stages of the
deep network after the CRF-RNN can continue the forward
pass. In our setup, a softmax loss layer directly follows the
CRF-RNN and terminates the network.
Figure 2. The CRF-RNN Network. We formulate the iterative During the backward pass, once the error differentials
mean-field algorithm as a Recurrent Neural Network (RNN). Gat- reach the CRF-RNN’s output Y , they similarly spend T it-
ing functions G1 and G2 are fixed as described in the text. erations within the loop before reaching the RNN input U
in order to propagate to the CNN which provides the unary
input. In each iteration inside the loop, error differentials
CRF-RNN
are computed inside each component of the mean-field it-
FCN
eration as described in Section 4. We note that unnecessar-
ily increasing the number of mean-field iterations T could
potentially result in the vanishing and exploding gradient
problems in the CRF-RNN. We, however, did not experi-
ence this problem during our experiments.
6. Implementation Details
In the present section we describe the implementation
Figure 3. The End-to-end Trainable Network. Schematic vi- details of the proposed network, as well as its training pro-
sualization of our full network which consists of a CNN and the cess. The high-level architecture of our system, which was
CNN-CRF network. Best viewed in colour. implemented using the popular Caffe [27] deep learning li-
brary, is shown in Fig. 3. The full source code and the
trained models of our approach will be made publicly avail-
it does not suffer from the vanishing and exploding gradient
able 1 .
problem inherent to deep RNNs [7, 43]. This allows us to
We initialized the first part of the network using the pub-
use a plain RNN architecture instead of more sophisticated
licly available weights of the FCN-8s network [37]. The
architectures such as LSTMs in our network.
compatibility transform parameters of the CRF-RNN were
5.2. Completing the Picture initialized using the Potts model, and kernel width and
weight parameters were obtained from a cross-validation
Our approach comprises a fully convolutional network process. We found that such initialization results in faster
stage, which predicts pixel-level labels without consid- convergence of training. During the training phase, param-
ering structure, followed by a CRF-RNN stage, which eters of the whole network were optimized end-to-end using
performs CRF-based probabilistic graphical modelling for the back-propagation algorithm. In particular we used full
1
structured prediction. The complete system, therefore, uni- image training described in [37], with learning rate fixed at
fies strengths of both CNNs and CRFs and is trainable 10−13 and momentum set to 0.99. These extreme values of
end-to-end using the back-propagation algorithm [34] and the parameters were used since we employed only one im-
the Stochastic Gradient Descent (SGD) procedure. During age per batch to avoid reaching memory limits of the GPU.
training, a whole image (or many of them) can be used as
In all our experiments, during training, we set the num-
the mini-batch and the error at each pixel output of the net-
ber of mean-field iterations T in the CRF-RNN to 5 to avoid
work can be computed using an appropriate loss function
such as the softmax loss with respect to the ground truth 1 https://github.com/torrvision/crfasrnn/.
6
vanishing/exploding gradient problems and to reduce the Input Image FCN-8s DeepLab CRF-RNN Ground Truth
training time. During the test time, iteration count was in-
creased to 10. The effect of this parameter value on the
accuracy is discussed in section 7.1.
Loss function During the training of the models that
achieved the best results reported in this paper, we used the
standard softmax loss function, that is, the log-likelihood
error function described in [30]. The standard metric used
in the Pascal VOC challenge is the average intersection over
union (IU), which we also use here to report the results. In
our experiments we found that high values of IU on the val-
idation set were associated to low values of the averaged
softmax loss, to a large extent. We also tried the robust log-
likelihood in [30] as a loss function for CRF-RNN training.
However, this did not result in increased accuracy nor faster
convergence.
Normalization techniques As described in Section 4,
we use the exponential function followed by pixel-wise nor-
malization across channels in several stages of the CRF-
RNN. Since this operation has a tendency to result in small
gradients with respect to the input when the input value is
large, we conducted several experiments where we replaced B-ground Aero plane Bicycle Bird Boat Bottle Bus
this by a rectifier linear unit (ReLU) operation followed by Car Cat Chair Cow Dining-Table Dog Horse
a normalization across the channels. Our hypothesis was Motorbike Person Potted-Plant Sheep Sofa Train TV/Monitor
that this approach may approximate the original operation

adequately while speeding up the training due to improved Figure 4. Qualitative results on the validation set of Pascal
gradients. Furthermore, ReLU would induce sparsity on the VOC 2012. FCN [37] is a CNN-based model that does not em-
probability of labels assigned to pixels, implicitly pruning ploy CRF. Deeplab [10] is a two-stage approach, where the CNN
low likelihood configurations, which could have a positive is trained first, and then CRF is applied on top of the CNN output.
effect. However, this approach did not lead to better re- Our approach is an end-to-end trained system that integrates both
sults, obtaining 1% IU lower than the original setting per- CNN and CRF-RNN in one deep network. Best viewed in colour.
formance.
7. Experiments amounts to a total of 11,685 images. After removing the

We present experimental results with the proposed CRF- overlapping images between VOC 2012 validation data and
RNN framework. We use these datasets: the Pascal VOC this training dataset, we were left with 346 images from the
2012 dataset, and the Pascal Context dataset. We use the original VOC 2012 validation set to validate our models on.
Pascal VOC 2012 dataset as it has become the golden stan- We call this set the reduced validation set in the sequel. An-
dard to comprehensively evaluate any new semantic seg- notations of the VOC 2012 test set, which consists of 1456
mentation approach in comparison to existing methods. We images, are not publicly available and hence the final results
also use the Pascal Context dataset to assess how well our on the test set were obtained by submitting the results to the
approach performs on a dataset with different characteris- Pascal VOC challenge evaluation server [18]. Regardless
tics. of the smaller number of images, we found that the relative
improvements of the accuracy on our validation set were in
good agreement with the test set.
Pascal VOC Datasets
As a first step we directly compared the potential advan-
In order to evaluate our approach with existing methods un- tage of learning the model end-to-end with respect to alter-
der the same circumstances, we conducted two main exper- native learning strategies. These are plain FCN-8s without
iments with the Pascal VOC 2012 dataset, followed by a applying CRF, and with CRF as a postprocessing method
qualitative experiment. disconnected from the training of FCN, which is compara-
In the first experiment, following [37, 38, 41], we used ble to the approach described in [10] and [41]. The results
a training set consisted of VOC 2012 training data (1464 are reported in Table 1 and show a clear advantage of the
images), and training and validation data of [23], which end-to-end strategy over the offline application of CRF as a
7
post-processing method. This can be attributed to the fact Method Without COCO With COCO
that during the SGD training of the CRF-RNN, the CNN Plain FCN-8s 61.3 68.3
component and the CRF component learn how to co-operate FCN-8s and CRF
63.7 69.5
with each other to produce the optimum output of the whole disconnected
network. End-to-end training of
69.6 72.9
We then proceeded to compare our approach with all CRF-RNN
state-of-the-art methods that used training data from the Table 1. Mean IU accuracy of our approach, CRF-RNN, compared
standard VOC 2012 training and validation sets, and from with similar methods, evaluated on the reduced VOC 2012 valida-
the dataset published with [22]. The results are shown in tion set.
Table 2, above the bar, and we can see that our approach
outperforms all competitors. VOC 2010 VOC 2011 VOC 2012
Method
In the second experiment, in addition to the above train- test test test
ing set, we used data from the Microsoft COCO dataset [36] BerkeleyRC [3] n/a 39.1 n/a
as was done in [41] and [12]. We selected images from O2PCPMC [8] 49.6 48.8 47.8
MS COCO 2014 training set where the ground truth seg- Divmbest [44] n/a n/a 48.1
mentation has at least 200 pixels marked with classes la- NUS-UDS [16] n/a n/a 50.0
bels present in the VOC 2012 dataset. With this selection, SDS [23] n/a n/a 51.6
we ended up using 66,099 images from the COCO dataset MSRA-
n/a n/a 61.8
and therefore a total of 66,099 + 11,685 = 77,784 training CFM [13]
images were used in the second experiment. The same re- FCN-8s [37] n/a 62.7 62.2
duced validation set was used in this second experiment as Hypercolumn [24] n/a n/a 62.6
well. In this case, we first fine-tuned the plain FCN-32s Zoomout [38] 64.4 64.1 64.4
network (without the CRF-RNN part) on COCO data, then Context-Deep-
we built an FCN-8s network with the learnt weights and fi- n/a n/a 70.7
CNN-CRF [35]
nally train the CRF-RNN network end-to-end using VOC DeepLab-
2012 training data only. Since the MS COCO ground truth n/a n/a 71.6
MSc [10]
segmentation data contains somewhat coarse segmentation Our method
masks where objects are not delineated properly, we found 73.6 72.4 72.0
w/o COCO
that fine-tuning our model with COCO did not yield signif- BoxSup [12] n/a n/a 71.0
icant improvements. This can be understood because the DeepLab [10,
primary advantage of our model comes from delineating n/a n/a 72.7
41]
the objects and improving fine segmentation boundaries. Our method
The VOC 2012 training dataset therefore helps our model 75.7 75.0 74.7
with COCO
learn this task effectively. The results of this experiment are
Table 2. Mean IU accuracy of our approach, CRF-RNN, com-
shown in Table 2, below the bar, and we see that our ap-
pared to the other approaches on the Pascal VOC 2010-2012 test
proach sets a new state-of-the-art on the VOC 2012 dataset. datasets. Methods from the first group do not use MS COCO data
Note that in both setups, our approach outperforms com- for training. The methods from the second group use both COCO
peting methods due to the end-to-end training of the CNN and VOC datasets for training.
and CRF in the unified CRF-RNN framework. We also
evaluated our models on the VOC 2010, and VOC 2011 test
set (see Table 2). In all cases our method achieves the state- Pascal Context Dataset
of-the-art performance.
We conducted an experiment on the Pascal Context dataset
In order to have a qualitative evidence about how CRF- [39], which differs from the previous one in the larger num-
RNN learns, we visualize the compatibility function learned ber of classes considered, 59. We used the provided parti-
after the training stage of the CRF-RNN as a matrix repre- tions of training and validation sets, and the obtained results
sentation in Fig. 5. Element (i, j) of this matrix corresponds are reported in Table 3.
to µ(i, j) defined earlier: a high value at (i, j) implies high
penalty for assigning label i to a pixel when a similar pixel
(spatially or appearance wise) is assigned label j. For exam- FCN- CRF-
Method O2 P [8] CFM [13]
ple we can appreciate that the learned compatibility matrix 8s [37] RNN
assigns a low penalty to pairs of labels that tend to appear Mean IU 18.1 34.4 37.78 39.28
together, such as [Motorbike, Person], and [Dining table, Table 3. Mean IU accuracy of our approach, CRF-RNN, evaluated
Chair]. on the Pascal Context validation set.
8
B-Ground to-end with 5 such iterations yielded a final mean IU score
0
Aeroplane 0.0
Bicycle
-0.1
of only 70.9, supporting the hypothesis that the recurrent
Bird structure of our approach is important for its success.
Boat -0.2
Bottle
Bus
Car
-0.3 8. Conclusion
Cat -0.4
Chair We presented CRF-RNN, an interpretation of dense
Cow
Dining-Table
-0.5
-0.5
CRFs as Recurrent Neural Networks. Our formulation
Dog -0.6 fully integrates CRF-based probabilistic graphical mod-
Horse
Motorbike -0.7
elling with emerging deep learning techniques. In partic-
Person
PottlePlant
ular, the proposed CRF-RNN can be plugged in as a part
-0.8
Sheep of a traditional deep neural network: It is capable of pass-
Sofa
Train
-0.9
ing on error differentials from its outputs to inputs dur-
-1.0
TV/Monitor ing back-propagation based training of the deep network
B-Ground
Sofa
Aeroplane
TV/Monitor
Bicycle
Train
Dining-Table
PottlePlant
Sheep
Bird
Boat
Bottle
Bus
Car
Cat
Dog
Chair
Cow
Horse
Person
Motorbike
while learning CRF parameters. We demonstrate the use

of this approach by utilizing it for the semantic segmenta-
tion task: we form an end-to-end trainable deep network
Figure 5. Visualization of the learnt label compatibility ma- by combining a fully convolutional neural network with the
trix. In the standard Potts model, diagonal entries are equal to −1, CRF-RNN. Our system achieves a new state-of-the-art on
while off-diagonal entries are zero. These values have changed af- the popular Pascal VOC segmentation benchmark. This im-
ter the end-to-end training of our network. Best viewed in colour. provement can be attributed to the uniting of the strengths
of CNNs and CRFs in a single deep network.
In the future, we plan to investigate the advan-
7.1. Effect of Design Choices
tages/disadvantages of restricting the capabilities of the
We performed a number of additional experiments on the RNN part of our network to mean-field inference of dense
Pascal VOC 2012 validation set described above to study CRF. A sensible baseline to the work presented here would
the effect of some design choices we made. be to use more standard RNNs (e.g. LSTMs) that learn to
We first studied the performance gains attained by our iteratively improve the input unary potentials to make them
modifications to the CRF over the CRF approach proposed closer to the ground-truth.
by [29]. We found that using different filter weights for dif-
ferent classes improved the performance by 1.8 percentage Acknowledgement This work was supported by grants
points, and that introducing the asymmetric compatibility EP/M013774/1 and ERC 321162-HELIOS. We thank the
transform further boosted the performance by 0.9 percent- Caffe team, Baidu IDL, and the Oxford ARC team for their
age points. support. We gratefully acknowledge GPU donations from
Regarding the RNN parameter iteration count T , incre- NVIDIA.
menting it to T = 10 during the test time, from T = 5
during the train time, produced an accuracy improvement References
of 0.2 percentage points. Setting T = 10 also during train-
ing reduced the accuracy by 0.7 percentage points. We be- [1] A. Adams, J. Baek, and M. A. Davis. Fast high-
lieve that this might be due to a vanishing gradient effect dimensional filtering using the permutohedral lattice.
caused by using too many iterations. In practice that leads Computer Graphics Forum, 29(2):753–762, 2010.
to the first part of the network (the one producing unary po- [2] P. Arbeláez, B. Hariharan, C. Gu, S. Gupta, L. Bour-
tentials) receiving a very weak error gradient signal during dev, and J. Malik. Semantic segmentation using re-
training, thus hampering its learning capacity. gions and parts. In IEEE CVPR, 2012.
End-to-end training after the initialization of CRF pa-
[3] P. Arbeláez, M. Maire, C. Fowlkes, and J. Malik. Con-
rameters improved performance by 3.4 percentage points.
tour detection and hierarchical image segmentation.
We also conducted an experiment where we froze the FCN-
IEEE TPAMI, 33(5):898–916, 2011.
8s part and fine-tuned only the RNN part (i.e., CRF param-
eters). It improved the performance over initialization by [4] A. Barbu. Training an active random field for real-
only 1 percentage point. We therefore conclude that end-to- time image denoising. IEEE TIP, 18(11):2451–2462,
end training significantly contributed to boost the accuracy 2009.
of the system. [5] S. Bell, P. Upchurch, N. Snavely, and K. Bala. Mate-
Treating each iteration of mean-field inference as an in- rial recognition in the wild with the materials in con-
dependent step with its own parameters, and training end- text database. In IEEE CVPR, 2015.
9
[6] Y. Bengio, Y. LeCun, and D. Henderson. Globally [22] B. Hariharan, P. Arbelaez, L. D. Bourdev, S. Maji, and
trained handwritten word recognizer using spatial rep- J. Malik. Semantic contours from inverse detectors. In
resentation, convolutional neural networks, and hid- IEEE ICCV, 2011.
den markov models. In NIPS, pages 937–937, 1994. [23] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik.
[7] Y. Bengio, P. Simard, and P. Frasconi. Learning long- Simultaneous detection and segmentation. In ECCV,
term dependencies with gradient descent is difficult. 2014.
IEEE Transactions on Neural Networks, 1994. [24] B. Hariharan, P. Arbelaez, R. Girshick, and J. Ma-
[8] J. Carreira, R. Caseiro, J. Batista, and C. Sminchis- lik. Hypercolumns for object segmentation and fine-
escu. Free-form region description with second-order grained localization. In IEEE CVPR, 2015.
pooling. IEEE TPAMI, 2014. [25] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zis-
[9] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisser- serman. Deep structured output learning for uncon-
man. Return of the devil in the details: Delving deep strained text recognition. In ICLR, 2015.
into convolutional nets. In BMVC, 2014. [26] V. Jain, J. F. Murray, F. Roth, S. C. Turaga, V. P.
[10] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, Zhigulin, K. L. Briggman, M. Helmstaedter, W. Denk,
and A. L. Yuille. Semantic image segmentation with and H. S. Seung. Supervised learning of image
deep convolutional nets and fully connected crfs. In restoration with convolutional networks. In IEEE
ICLR, 2015. ICCV, 2007.
[11] L.-C. Chen, A. G. Schwing, A. L. Yuille, and R. Ur- [27] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,
tasun. Learning deep structured models. In ICLRW, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:
2015. Convolutional architecture for fast feature embedding.
[12] J. Dai, K. He, and J. Sun. Boxsup: Exploiting bound- In ACM Multimedia, pages 675–678, 2014.
ing boxes to supervise convolutional networks for se- [28] M. Kiefel and P. V. Gehler. Human pose estmation
mantic segmentation. In arXiv:1503.01640, 2015. with fields of parts. In ECCV, 2014.
[13] J. Dai, K. He, and J. Sun. Convolutional feature mask- [29] P. Krähenbühl and V. Koltun. Efficient inference in
ing for joint object and stuff segmentation. In IEEE fully connected crfs with gaussian edge potentials. In
CVPR, 2015. NIPS, 2011.
[14] T.-M.-T. Do and T. Artieres. Neural conditional ran- [30] P. Krähenbühl and V. Koltun. Parameter learning
dom fields. In NIPS, 2010. and convergent inference for dense random fields. In
[15] J. Domke. Learning graphical model parameters ICML, 2013.
with approximate marginal inference. IEEE TPAMI, [31] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Im-
35(10):2454–2467, 2013. agenet classification with deep convolutional neural
[16] J. Dong, Q. Chen, S. Yan, and A. Yuille. Towards networks. In NIPS, 2012.
unified object detection and semantic segmentation. In [32] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr. As-
ECCV, 2014. sociative hierarchical crfs for object class image seg-
[17] D. Eigen, C. Puhrsch, and R. Fergus. Depth map pre- mentation. In IEEE ICCV, 2009.
diction from a single image using a multi-scale deep [33] J. D. Lafferty, A. McCallum, and F. C. N. Pereira.
network. In NIPS, 2014. Conditional random fields: Probabilistic models for
[18] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. segmenting and labeling sequence data. In ICML,
Williams, J. Winn, and A. Zisserman. The pascal vi- 2001.
sual object classes challenge: A retrospective. IJCV, [34] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.
111(1):98–136, 2015. Gradient-based learning applied to document recog-
[19] C. Farabet, C. Couprie, L. Najman, and Y. Le- nition. Proceedings of the IEEE, 86(11):2278–2324,
Cun. Learning hierarchical features for scene labeling. 1998.
IEEE TPAMI, 2013. [35] G. Lin, C. Shen, I. Reid, and A. van dan Hengel. Effi-
[20] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich cient piecewise training of deep structured models for
feature hierarchies for accurate object detection and semantic segmentation. In arXiv:1504.01013, 2015.
semantic segmentation. In IEEE CVPR, 2014. [36] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Gir-
[21] R. Girshick, F. Iandola, T. Darrell, and J. Malik. De- shick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick,
formable part models are convolutional neural net- and P. Dollar. Microsoft coco: Common objects in
works. In CVPR, 2015. context. In arXiv:1405.0312, 2014.
10
[37] J. Long, E. Shelhamer, and T. Darrell. Fully convolu- [51] J. Shotton, M. Johnson, and R. Cipolla. Semantic tex-
tional networks for semantic segmentation. In IEEE ton forests for image categorization and segmentation.
CVPR, 2015. In IEEE CVPR, 2008.
[38] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich. [52] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Tex-
Feedforward semantic segmentation with zoom-out tonboost for image understanding: Multi-class object
features. In IEEE CVPR, 2015. recognition and segmentation by jointly modeling tex-
ture, layout, and context. IJCV, 81(1):2–23, 2009.
[39] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee,
S. Fidler, R. Urtasun, and A. Yuille. The role of con- [53] K. Simonyan and A. Zisserman. Very deep convolu-
text for object detection and semantic segmentation in tional networks for large-scale image recognition. In
the wild. In IEEE CVPR, 2014. arXiv:1409.1556, 2014.
[40] M. C. Mozer. Backpropagation. In Y. Chauvin and [54] V. Stoyanov, A. Ropson, and J. Eisner. Empirical risk
D. E. Rumelhart, editors, Backpropagation, chapter minimization of graphical model parameters given ap-
A Focused Backpropagation Algorithm for Temporal proximate inference, decoding, and model structure.
Pattern Recognition, pages 137–169. L. Erlbaum As- In AISTATS, 2011.
sociates Inc., 1995. [55] S. C. Tatikonda and M. I. Jordan. Loopy belief prop-
agation and gibbs measures. In Proceedings of the
[41] G. Papandreou, L.-C. Chen, K. Murphy, and A. L.
Eighteenth Conference on Uncertainty in Artificial In-
Yuille. Weakly- and semi-supervised learning of
telligence, 2002.
a dcnn for semantic image segmentation. In
arXiv:1502.02734, 2015. [56] C. Tomasi and R. Manduchi. Bilateral filtering for
gray and color images. In IEEE CVPR, 1998.
[42] S. Paris and F. Durand. A fast approximation of the bi-
lateral filter using a signal processing approach. IJCV, [57] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint
81(1):24–52, 2013. training of a convolutional network and a graphical
model for human pose estimation. In NIPS, 2014.
[43] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio. On
the difficulty of training recurrent neural networks. In [58] Z. Tu. Auto-context and its application to high-level
ICML, 2013. vision tasks. In IEEE CVPR, 2008.
[59] Z. Tu, X. Chen, A. L. Yuille, and S.-C. Zhu. Image
[44] G. S. Payman Yadollahpour, Dhruv Batra. Discrimi-
parsing: Unifying segmentation, detection, and recog-
native re-ranking of diverse segmentations. In IEEE
nition. IJCV, 63(2):113–140, 2005.
CVPR, 2013.
[60] K. Yao, B. Peng, G. Zweig, D. Yu, X. Li, and F. Gao.
[45] J. Peng, L. Bo, and J. Xu. Conditional neural fields. Recurrent conditional random field for language un-
In NIPS, 2009. derstanding. In ICASSP, 2014.
[46] P. H. O. Pinheiro and R. Collobert. Recurrent convo- [61] Y. Zhang and T. Chen. Efficient inference for fully-
lutional neural networks for scene labeling. In ICML, connected crfs with stationarity. In CVPR, 2012.
2014.
[47] S. Ross, D. Munoz, M. Hebert, and J. A. Bag-
nell. Learning message-passing inference machines
for structured prediction. In IEEE CVPR, 2011.
[48] D. E. Rumelhart, G. E. Hinton, and R. J. Williams.
Parallel distributed processing: explorations in the
microstructure of cognition. In J. A. Anderson and
E. Rosenfeld, editors, Parallel distributed processing:
explorations in the microstructure of cognition, chap-
ter Learning Internal Representations by Error Propa-
gation, pages 318–362. MIT Press, 1986.
[49] A. G. Schwing and R. Urtasun. Fully connected deep
structured networks. In arXiv:1503.02351, 2015.
[50] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp,
M. Finocchio, R. Moore, A. Kipman, and A. Blake.
Real-time human pose recognition in parts from sin-
gle depth images. In IEEE CVPR, 2011.
11
Methods trained with COCO Mean IU
Our method 74.7 90.4 55.3 88.7 68.4 69.8 88.3 82.4 85.1 32.6
DeepLab[10, 41] 72.7 89.1 38.3 88.1 63.3 69.7 87.1 83.1 85.0 29.3
BoxSup[12] 71.0 86.4 35.5 79.7 65.2 65.2 84.3 78.5 83.7 30.5
Methods trained w/o COCO
Our method trained w/o COCO 72.0 87.5 39.0 79.7 64.2 68.3 87.6 80.8 84.4 30.4
DeepLab-MSc-CRF-LargeFOV[10] 71.6 84.4 54.5 81.5 63.6 65.9 85.1 79.1 83.4 30.7
Context Deep CNN CRF[35] 70.7 87.5 37.7 75.8 57.4 72.3 88.4 82.6 80.0 33.4
Zoomout[38] 64.4 81.9 35.1 78.2 57.4 56.5 80.5 74.0 79.8 22.4
Hypercolumn[24] 62.6 68.7 33.5 69.8 51.3 70.2 81.1 71.9 74.9 23.9
FCN-8s[37] 62.2 76.8 34.2 68.9 49.4 60.3 75.3 74.7 77.6 21.4
MSRA CFM[13] 61.8 75.7 26.7 69.5 48.8 65.6 81.0 69.2 73.3 30.0
SDS[23] 51.6 63.3 25.7 63.0 39.8 59.2 70.9 61.4 54.9 16.8
NUS UDS [16] 50.0 67.0 24.5 47.2 45.0 47.9 65.3 60.6 58.5 15.5
TTIC-divmbest-rerank[44] 48.1 62.7 25.6 46.9 43.0 54.8 58.4 58.6 55.6 14.6
BONN O2PCPMC FGT SEGM [8] 47.8 64.0 27.3 54.1 39.2 48.7 56.6 57.7 52.5 14.2
Methods trained with COCO

Our method 78.5 64.4 79.6 81.9 86.4 81.8 58.6 82.4 53.5 77.4 70.1
DeepLab[10, 41] 76.5 56.5 79.8 77.9 85.8 82.4 57.4 84.3 54.9 80.5 64.1
BoxSup[12] 76.2 62.6 79.3 76.1 82.1 81.3 57.0 78.2 55.0 72.5 68.1
Methods trained w/o COCO
Our method trained w/o COCO 78.2 60.4 80.5 77.8 83.1 80.6 59.5 82.8 47.8 78.3 67.1
DeepLab-MSc-CRF-LargeFOV [10] 74.1 59.8 79.0 76.1 83.2 80.8 59.7 82.2 50.4 73.1 63.7
Context Deep CNN CRF[35] 71.5 55.0 79.3 78.4 81.3 82.7 56.1 79.8 48.6 77.1 66.3
TTI zoomout 16[38] 69.6 53.7 74.0 76.0 76.6 68.8 44.3 70.2 40.2 68.9 55.3
Hypercolumn[24] 60.6 46.9 72.1 68.3 74.5 72.9 52.6 64.4 45.4 64.9 57.4
FCN-8s[37] 62.5 46.8 71.8 63.9 76.5 73.9 45.2 72.4 37.4 70.9 55.1
MSRA CFM[13] 68.7 51.5 69.1 68.1 71.7 67.5 50.4 66.5 44.4 58.9 53.5
SDS[23] 45.0 48.2 50.5 51.0 57.7 63.3 31.8 58.7 31.2 55.7 48.5
NUS UDS[16] 50.8 37.4 45.8 59.9 62.0 52.7 40.8 48.2 36.8 53.1 45.6
TTIC-divmbest-rerank[44] 47.5 31.2 44.7 51.0 60.9 53.5 36.6 50.9 30.1 50.2 46.8
BONN O2PCPMC FGT SEGM[8] 54.8 29.6 42.2 58.0 54.8 50.2 36.6 58.6 31.6 48.4 38.6
Table 4. Intersection over Union (IU) accuracy of our approach, CRF-RNN, compared to the other state-of-the-art approaches on the Pascal
VOC 2012 test set. Scores for other methods were taken the results published by the original authors. The symbols are from Chatfield et
al. [9].
12
Input Image CRF-RNN Ground Truth
TV/Monitor
Horse
Bus Bottle
Train
Dog Dining-Table
Boat
Sofa
Sheep
Cow
Bird
Potted-Plant
Bicycle
Chair
B-ground Aero plane
Person
Cat
Motorbike
Car
Figure 6. Typical good quality segmentation results I. Illustration of sample results on the validation set of the Pascal VOC 2012 dataset.
Note that in some cases our method is able to pick correct segmentations that are not marked correctly in the ground truth. Best viewed in
colour.
13
TV/Monitor
Horse
Bus Bottle
Train
Dog Dining-Table
Boat
Sofa
Sheep
Cow
Bird
Potted-Plant
Bicycle
Chair
B-ground Aero plane
Person
Cat
Motorbike
Car
Figure 7. Typical good quality segmentation results II. Illustration of sample results on the validation set of the Pascal VOC 2012 dataset.
Note that in some cases our method is able to pick correct segmentations that are not marked correctly in the ground truth. Best viewed in
colour.
14
TV/Monitor
Horse
Bus Bottle
Train
Dog Dining-Table
Boat
Sofa
Sheep
Cow
Bird
Potted-Plant
Bicycle
Chair
B-ground Aero plane
Person
Cat
Motorbike
Car
Figure 8. Failure cases I. Illustration of sample failure cases on the validation set of the Pascal VOC 2012 dataset. Best viewed in colour.
15
TV/Monitor
Horse
Bus Bottle
Train
Dog Dining-Table
Boat
Sofa
Sheep
Cow
Bird
Potted-Plant
Bicycle
Chair
B-ground Aero plane
Person
Cat
Motorbike
Car
Figure 9. Failure cases II. Illustration of sample failure cases on the validation set of the Pascal VOC 2012 dataset. Best viewed in colour.
16
Input Image FCN-8s DeepLab CRF-RNN Ground Truth
B-ground B-ground Aero plane

Aero plane Bicycle
Bicycle Bird
Bird Boat
Boat Bus
BottleBottle Bus
CarCat Chair Cow Dinging-table Dog Horse
Car Cat Chair Cow Dining-Table Dog Horse
Motorbike Person Potted-Plant Sheep Sofa Train TV/Monitor
Motorbike Person Potted-Plant Sheep Sofa Train TV/Monitor
Figure 10. Qualitative comparison with the other approaches. Sample results with our method on the validation set of the Pascal VOC
2012 dataset, compared with previous state-of-the-art methods. Segmentation results with DeepLap approach were reproduced from the
original publication. Best viewed in colour.
17
Deep Learning Face Attributes in the Wild∗
Ziwei Liu1,3 Ping Luo3,1 Xiaogang Wang2,3 Xiaoou Tang1,3

1
Department of Information Engineering, The Chinese University of Hong Kong
2
Department of Electronic Engineering, The Chinese University of Hong Kong
3
Shenzhen Key Lab of Comp. Vis. & Pat. Rec., Shenzhen Institutes of Advanced Technology, CAS, China
{lz013,pluo,xtang}@ie.cuhk.edu.hk, xgwang@ee.cuhk.edu.hk
Arched Eyebrows Receding Hairline Smiling Mustache Young

Abstract
(a) HOG(landmarks)+SVM
Predicting face attributes in the wild is challenging due
to complex face variations. We propose a novel deep
learning framework for attribute prediction in the wild.
true false false false true
It cascades two CNNs, LNet and ANet, which are fine-
tuned jointly with attribute tags, but pre-trained differently. (b) Our Method
LNet is pre-trained by massive general object categories for

face localization, while ANet is pre-trained by massive face
identities for attribute prediction. This framework not only
true true true true true
outperforms the state-of-the-art with a large margin, but
also reveals valuable facts on learning face representation.
(1) It shows how the performances of face localization
(c)
(LNet) and attribute prediction (ANet) can be improved

by different pre-training strategies. (2) It reveals that 5 attributes 10 attributes 20 attributes 40 attributes
although the filters of LNet are fine-tuned only with image- Figure 1. (a) Inaccurate localization and alignment lead to prediction
level attribute tags, their response maps over entire images errors on attributes by existing methods (b) LNet localizes face regions by
have strong indication of face locations. This fact enables averaging the response maps of attribute filters. ANet predicts attributes
training LNet for face localization with only image-level without alignment (c) Face localization with the averaged response map
when LNet is trained with different numbers of attributes. (Best viewed in
annotations, but without face bounding boxes or landmarks, color)
which are required by all attribute recognition works. (3)
It also demonstrates that the high-level hidden neurons of locations of object parts or landmarks are not required.
ANet automatically discover semantic concepts after pre- They are not robust to deformations of objects [23]. Recent
training with massive face identities, and such concepts are local models [15, 4, 5, 2, 19, 32] first detect object parts
significantly enriched after fine-tuning with attribute tags. and extract features from each part. These local features
Each attribute can be well explained with a sparse linear are concatenated to train classifiers. For example, Kumar et
combination of these concepts. al. [15] predicted face attributes by extracting hand-crafted
1. Introduction features from ten face parts. Zhang et al. [32] recognized
human attributes by employing hundreds of poselets [4]
Face attributes are beneficial for multiple applications to align human body parts. These local methods may fail
such as face verification [15, 2, 24], identification [20], and when unconstrained face images with complex variations
retrieval. Predicting face attributes from images in the wild are present, which makes face localization and alignment
is challenging, because of complex face variations such as difficult. As shown in Fig.1 (a), HOG+SVM fails because
poses, lightings, and occlusions as shown in Fig.1. the faces or landmarks are wrongly localized or misaligned.
Attribute recognition methods are generally categorized Thus the features are extracted at wrong positions [25].
into two groups: global and local methods. Global methods Recent research shows that face localization and alignment
extract features from the entire object, where accurate are still not well solved problems, especially in the wild
∗ Project page: http://personal.ie.cuhk.edu.hk/ condition, although much progress has been achieved in the
˜lz013/projects/FaceAttributes.html past decade. It is also proved by our experimental result.
3730
This work revisits global methods by proposing a novel as race, gender, and age. It indicates that when a deep
deep learning framework, which integrates two CNNs, model is pre-trained for face recognition, it implicitly learns
LNet and ANet, where LNet locates the entire face region attributes. The performance of attribute prediction drops
and ANet extracts high-level face representation from the without this pre-training stage.
located region. The novelties are in three aspects. Firstly, The main contributions are summarized as follows.
LNet is trained in a weakly supervised manner, i.e. only (1) We propose a novel deep learning framework, which
image-level attribute tags of training images are provided, combines massive objects and massive identities to pre-train
making data preparation much easier. This is different from two CNNs for face localization and attribute prediction,
training face and landmark detectors, where face bounding respectively. It achieves state-of-the-art attribute classifi-
boxes and landmark positions are required. LNet is pre- cation results on both the challenging CelebFaces [26] and
trained by classifying massive general object categories, LFW [12] datasets, improving existing methods by 8 and
such that its pre-trained features have good generalization 13 percent, respectively. (2) A novel fast feed-forward
capability on handling large background clutters. LNet is algorithm for CNN with locally shared filters is devised. (3)
then fine-tuned by attributes tags. We demonstrate that fea- Our study reveals multiple valuable facts on leaning face
tures learned in this way are effective for face localization representation by deep models. (4) We also contribute a
and also can distinguish subtle differences between human large facial attribute database with more than eight million
faces and analogous patterns, such as a cat face. attribute labels and it is 20 times larger than the largest
Secondly, ANet extracts discriminative face represen- publicly available dataset.
tation, making attribute recognition from the entire face 1.1. Related Work
region possible. ANet is pre-trained by classifying massive
face identities and is fine-tuned by attributes. We show that Extracting hand-crafted features at pre-defined land-
the pre-training step enables ANet to account for complex marks has become a standard step in attribute recognition
variations in the unconstrained face images. [9, 15, 4, 2]. Kumar et al. [15] extracted HOG-like features
Thirdly, within the rough locations of face regions on various face regions to tackle attribute classification
provided by LNet, averaging the predictions of multiple and face verification. To improve the discriminativeness
patches can improve the performance. A simple way is of hand-crafted features given a specific task, Bourdev et
to evaluate the feed-forward pass for each single patch. al. [4] built a three-level SVM system to extract higher-
However, it is slow and has a lot of redundant computation. level information. Deep learning [18, 34, 23, 7, 19, 32,
A novel fast feed-forward scheme is proposed to replace 31, 13, 33, 22, 3, 28] recently achieved great success in
patch-by-patch evaluation. It evaluates images with arbi- attribute prediction, due to their ability to learn compact and
trary sizes with only one-pass feed-forward operation. It discriminative features. Razavian et al. [23] and Donahue
becomes non-trivial if the filters are locally shared, while et al. [7] demonstrated that off-the-shelf features learned
studies [27, 26] showed that locally shared filters perform by CNN of ImageNet [13] can be effectively adapted to
better in face related tasks. This is solved by proposing an attribute classification. Zhang et al. [32] showed that
interweaved operation. better performance can be achieved by ensembling learned
features of multiple pose-normalized CNNs. The main
Besides proposing new methods, our framework also
drawback of these methods is that they rely on accurate
reveals valuable facts on learning face representation. They
landmark detection and pose estimation in both training and
not only motivate this work but also benefit future research
testing steps. Even though a recent work [31] can perform
on face and deep learning. (1) It shows how pre-training
automatic part localization during test, it still requires
with massive object categories and massive identities can
landmark annotations of the training data.
improve feature learning for face localization and attribute
recognition, respectively. (2) It demonstrates that although 2. Our Approach
filters of LNet are fine-tuned by attribute tags, their response
maps over the entire image have strong indication of face Framework Overview Fig.2 illustrates our pipeline
location. Good features for face localization should be able where LNet locates the entire face region in a coarse-to-
to capture rich face variations, and more supervised infor- fine manner as shown in (a) and (b), while ANet extracts
mation on these variations improves the learning process. features for attribute recognition as shown in (c).
The examples in Fig. 1 (a) show that as the number of Different from existing works that rely on accurate face
attributes decreases, the localization capability of learned and landmark annotations, LNet is trained in a weak-
neurons gets reduced dramatically. (3) ANet is pre-trained ly supervised manner with only image-level annotations.
with massive face identities. It discloses that the pre-trained Specifically, it is pre-trained with one thousand object
high-level hidden neurons of ANet implicitly learn and categories of ImageNet [6] and fine-tuned by image-level
discover sematic concepts that are related to identity, such attribute tags. The former step accounts for background
3731
n (a) LNeto (b) LNets
xo ho xs hs
FC
Linear SVM
L Wavyy Hair
No Beard
…
FC High
Hi
i Cheekbones
Linear SVM
L Smiling
FC
Linear SVM
L
…
hf FC y xf
xf ( ) ANet
(c) AN t (d) Extracting features to predict attributes
Figure 2. The proposed pipeline of attribute prediction (Best viewed in color)

clutters, while the latter step learns features robust to trained with 1, 000 general object categories from the Im-
complex face variations. Learning LNet in this way not only ageNet Large Scale Visual Recognition Challenge (ILSVR-
significantly reduces data labeling, but also improves the C) 2012 [6], containing 1.2 million training images and
accuracy of face localization. Both LNeto and LNets have 50 thousands validation images. All the data is employed
network structures similar to AlexNet [13], whose hyper for pre-training except one third of the validation data
parameters are specified in Fig.2 (a) and (b) respectively. for choosing hyper-parameters [13]. We augment data by
The fifth convolutional layer (C5) of LNeto indicates head- cropping ten patches from each image, including one patch
shoulders while C5 of LNets indicates faces, with their at the center and four at the corners, and their horizontal
highly responsed regions in their averaged response maps. flips. We adopt softmax for object classification, which
Moreover, the input xo of LNeto is a m × n image, while is optimized by stochastic gradient descent (SGD) with
the input xs of LNets is the head-shoulder region, which is back-propagation (BP) [16]. As shown in Fig.3 (a.2), the
localized by LNeto and resized to 227 × 227. averaged response map in C5 of LNeto already indicates lo-
As illustrated in Fig.2 (c), ANet is learned to predict cations of objects including human faces after pre-training.
attributes y by providing the input face region xf , which is Fine-tuning LNet Both LNeto and LNets are fine-tuned
detected by LNets and properly resized. Specifically, multi- with attribute tags. Additional output layers are added to
view versions [13] of xf are utilized to train ANet. Further- the LNets individually for fine-tuning and then removed for
more, ANet contains four convolutional layers, where the evaluation. LNeto adopts the full image xo as input while
filters of C1 and C2 are globally shared and the filters of C3 LNets uses the highly responsed region xs in the averaged
and C4 are locally shared. The effectiveness of local filters response map in C5 of LNeto as input, which roughly re-
have been demonstrated in many face related tasks [25, 27]. spond to head-shoulders. The cross-entropy loss is used for
To handle complex face variations, ANet is pre-trained by attribute classification, i.e. L =

y log p(yi |x) + (1 −
i=1 i
distinguishing massive face identities, which facilitates the
yi ) log 1 − p(yi |x) , where p(yi = 1|x) = 1+exp(−f 1
(x))
learning of discriminative features. is the probability of the i-th attribute given image x. As
Fig.2 (d) outlines the procedure of attribute recognition. shown in Fig.3 (a.3), the response maps after fine-tuning
ANet extracts a set of feature vectors (FCs) by cropping become much more clean and smooth, indicating that the
overlapping patches on xf . An efficient feed-forward filters learned by attribute tags can detect face patterns with
algorithm is developed to reduce redundant computation complex variations. To appreciate the effectiveness of pre-
in the feature extraction stage. SVMs [8] are trained to training, we also include the averaged response map in C5
predict attribute values given each FC. The final prediction of being directly trained from scratch with attribute tags but
is obtained by averaging all these values, to cope with small without pre-training in Fig.3 (a.4). It cannot separate face
misalignment of face localization. regions from background and other body parts well.
2.1. Face Localization Thresholding and Proposing Windows We show that

the responses of C5 in LNet are discriminative enough
The cascade of LNeto and LNets accurately localizes to separate faces and background by simply searching a
face regions by being trained on image-level attribute tags. threshold, such that a window with response larger than
Pre-training LNet Both LNeto and LNets are pre- this threshold corresponding to face and otherwise is back-
3732
(a.1) (a.2) (b) Brown Hair Male
Frontal Left Big Eyes Black Hair
Response on Face Images
Smiling Sunglasses
Response on Bg. Images
Percentage of Images
View 1 View N Attr Config 1 Attr Config N
(a.3) (a.4) threshold ...
... ...
(a) single detector (b) multi-view detector (c) face localization by attributes
Figure 4. Face localization by attributes

Maximum Score
Figure 3. (a.1) Original image. (a.2)-(a.4) are averaged response maps 2.2. Attribute Prediction
in C5 of LNeto after pre-training (a.2), fine-tuning (a.3) and directly
training from scratch with attribute tags but without pre-training (a.4). (b) As shown in Fig.2 (c) and (d), ANet is learned to
Determine threshold. extract features and SVM classifiers are used to predict
attributes. Specifically, in the pre-training stage, ANet is
trained by classifying massive face identities. In the fine-
ground. To determine the threshold, we select 2000 images, tuning stage, we first extend the localized face region, which
each of which contains a single face, and 2000 background is properly resized, with a small factor to incorporate more
images from SUN dataset [29]. For each image, EdgeBox context information. Then, multiple patches are cropped
[35] is adopted to propose 500 candidate windows, each of from the enlarged face region and utilized as inputs of
which is measured by a score that sums over its response ANet. ANet is fine-tuned by attributes to learn the high-
values normalized by its window size. A larger score level feature FC. Furthermore, as shown in Fig.2 (d), each
indicates the localized pattern is more likely to be a face. feature vector is adopted to train SVM classifier for attribute
Each image is then represented by the maximum score over prediction. The above strategy is similar to the multi-
all its windows. In Fig.3 (b), the histogram of the maximum view data augmentation [13], increasing the robustness of
scores shows that these scores clearly separate face images attribute recognition. In the testing stage, attributes are
from background images. The threshold is chosen as the predicted by averaging the SVM scores over all the patches.
decision boundary as shown in Fig.3 (b). More results
Pre-training of ANet We introduce how to learn dis-
are given in Fig.6 (a), showing that the above strategy can
criminative features by pre-training ANet with a large num-
precisely localize face within a single test image. Since each
ber of identities. We select eight thousand face identities
training image only contains one single face, we localize a
from the CelebFaces [26] dataset, where each identity
face region using the window with the largest score during
has around twenty images. There are over 160 thousand
training.
training images in total. A simple way to train ANet
To understand why rich attribute information enables is to classify eight thousand categories with the softmax
accurate face localization, one could consider the examples loss. However, it is challenging because the number of
in Fig.4. If only a single detector [17, 21] is used to samples of each identity is limited to maintain the intra-
classify all the positive and negative samples in Fig.4 (a), class invariance. To improve intra-class invariance, we
it is difficult to handle complex face variations. Therefore, employ the similarity loss similar to [26, 10]. It decreases
multi-view face detectors [30] were developed in Fig.4 (b), the distances between samples of the same identity. We
|D|
i.e. face images in different views are handled by different have L = i=1,yi =yj FCi − FCj 22 , where FCi and FCj
detectors. View labels were used in training detectors and denote the feature vectors of the i-th and j-th face images
the whole training set is divided into subsets according to respectively, and yi = yj indicates the identities of these
views. If views are treated as one type of face attributes, samples are the same. In summary, ANet is pre-trained by
learning face representation by predicting attributes with combining the softmax loss and the similarity loss.
deep models actually extends this idea to extreme. As Efficient Feature Extractions In test, ANet is evaluated
shown in Fig.4 (c), a filter (or a group of filters) functions on multiple patches of the face region as shown in Fig.2 (d),
as a detector of an attribute. When a subset of neurons leading to redundant convolutional computations because
are activated, they indicate the existence of face images of the large overlaps in these patches. When all the filters
with a particular attribute configuration. The neurons at are globally-shared, the computational cost can be reduced
different layers can form many activation patterns, implying by applying [11], which convolves the filters in the input
that the whole set of face images can be divided into image and then obtains a feature vector for each patch by
many subsets based on attribute configurations, and each pooling over the last convolutional layer. Given a simple
activation pattern corresponds to one subset (e.g. ‘pointy example with one convolutional layer as shown in Fig.5 (a),
nose’, ‘rosy cheek’, and ‘smiling’). Therefore, it is not the feature vector FC for each patch (e.g. rectangle in red)
surprising that filters learned by attributes lead to effective can be extracted by pooling in the corresponding region of
representations for face localization. the response map h(1) , without evaluating convolutions in
3733
3
3
2
2 1
6
5
1 6 4 9
8
7
…
5
4 9
8
…
7
(c) feature extraction with
(a) global convolution (b) local convolution interweaved operation (d) interweaved operation
Figure 5. Detailed pipeline of efficient feature extractions in ANet.
the input image patch-by-patch. Therefore, it shares the receptive fields of all the local filters (e.g. h(1) in (b)),
convolutions for every patch. which has to be performed in a patch-by-patch way, the
However, this scheme is not applicable when we have interweaved operation pads the cells with respect to the
more than two convolutional layers whose filters are receptive field of each local filter over the entire image. It
locally-shared. An example is illustrated in Fig.5 (b), where enables extracting multiple feature vectors with only one-
each patch is equally divided into 3 × 3 = 9 cells and pass of feed-forward evaluation. This operation can be
we learn different filters for different cells. To reduce repeated when more locally convolutional layers are added.
computations in the first convolutional layer, each local The proposed feature extraction scheme has achieved 6×
filter can be applied on the entire image, resulting in the speedup empirically when compared with patch-by-patch
(1) scanning. It is applicable to CNNs with local filters and
response map with nine channels, i.e. hi and i = 1...9.
(1)
The final response map h is obtained by cropping and compatible to all existing CNN operations.
padding the regions (i.e. rectangles in black) in these 9
channels. As a result, each feature vector FC can be pooled 3. Experiments
from h(1) , without convolving the input image patch-
by-patch. Nevertheless, since h(1) is corresponded to a Large-scale Data Collection We construct two face
patch of the input image, the succeeding local convolutions attribute datasets, namely CelebA and LFWA, by labeling
have to be handled patch-by-patch, leading to redundant images selected from two challenging face datasets, Celeb-
computations. Faces [26] and LFW [12]. CelebA contains ten thousand
identities, each of which has twenty images. There are
To this end, we propose an interweaved operation, which
two hundred thousand images in total. LFWA has 13, 233
is a fast feed-forward method for CNN with locally-shared
images of 5, 749 identities. Each image in CelebA and
filters. Suppose we have four local filters in the next locally
LFWA is annotated with forty face attributes and five key
convolutional layer and each filter is applied on 2 × 2 cells
points by a professional labeling company. CelebA and
of h(1) as shown in (b). These cells are the receptive fields
LFWA have over eight million and five hundred thousand
of the filters, including {1, 2, 4, 5}, {2, 3, 5, 6}, {4, 5, 7, 8},
attribute labels, respectively.
and {5, 6, 8, 9}. Instead of directly applying the local filters
on h1 , the interweaved operation generates an interweaved CelebA is partitioned into three parts. Images of the
(1) first eight thousand identities (with 160 thousand images)
map Ii for each filter, where i = 1...4. Each local filter
are used to pre-train and fine-tune ANet and LNet, and
is then apply on its corresponding interweaved map. Since
the images of another one thousand identities (with twenty
the interweaved map capturing the entire image, each local
thousand images) are employed to train SVM. The images
filter is turned into a global filter such that its computation
of the remaining one thousand identities (with twenty
can be shared across different patches.
(1) thousand images) are used for testing. LFWA is partitioned
Specifically, each interweaved map, e.g. I1 , is achieved into half for training and half for testing. Specifically, 6, 263
by padding the cells of the corresponding channels in an images are adopted to train SVM and the remaining images
(1)
interweaved manner, e.g. hi={1,2,4,5} , as shown in Fig.5 for test. When being evaluated on LFWA, LNet and ANet
(d). All of the interweaved maps are illustrated in Fig.5 are trained on CelebA.
(c). After that, each of the four local filters is applied on its Methods for Comparisons The proposed method is
corresponding interweaved map, leading to four response compared with three competitive approaches, i.e. FaceTrac-
(2)
maps hi , where i = 1...4. As a result, the feature vector er [14], PANDA-w [32], and PANDA-l [32]. FaceTracer
FC is pooled and concatenated from the receptive fields of extracts HOG and color histograms in several important
the filters, which are the rectangles in black as shown in (c). functional face regions and then trains SVM for attribute
Intuitively, instead of padding cells according to the classification. We extract these functional regions referring
3734
(a) (b) (c)
Figure 6. Averaged response maps of LNet, including (a) CelebA, (b) MobileFaces, (c) some failure cases.
(a) 1 (b)
1
positive rates of Face++ and LNet are 85% and 93%;
0.8 0.8 when F P P I = 0.1, our method outperforms the other
True Positive Rates
True Positive Rates
three methods by 11, 9 and 22 percent respectively. We

0.6 0.6
LNet LNet
also investigate how these methods perform with respect to
DPM [21] DPM [21]
0.4
ACF Multi-view [29]
0.4
ACF Multi-view [29]
overlap ratio (IoU ), following [35, 21]. Fig.7(c) shows that
SURF Cascade [17] SURF Cascade [17]
Face++ [1] Face++ [1]
LNet generally provides more accurate face localization,
0.2 0.2
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
False Positive Per Image False Positive Per Image leading to good performance in the subsequent attribute
(c) 1 (d) 1 prediction.
Recall Rates (FPPI = 0.1)
Recall Rates (FPPI = 0.1)
0.8
0.9 Further Analysis LNet significantly outperforms LNet
0.6 (without pre-training) by 74 percent when the overlap
0.8
0.4
LNet
ratio equals to 0.5, which validates the effectiveness of
0.2
DPM [21]
ACF Multi-view [29]
0.7 pre-training, as shown in Fig.7(c). We then explore
SURF Cascade [17]
LNet (w/o pre-training)
the influence of the number of attributes on localization.
0
0 0.2 0.4 0.6
Overlap Ratio
0.8 1 10 20 30
Number of Attributes
40
Fig.7(d) illustrates rich attribute information facilitates face
Figure 7. ROC curves on (a) CelebA (b) MobileFaces. (c) Recall rates localization.
w.r.t. overlap ratio (F P P I = 0.1). (d) Recall rates w.r.t. number of To examine the generalization ability of LNet, we collect
attributes (F P P I = 0.1)
another 3, 876 face images for testing, namely Mobile-
to the ground truth landmark points. PANDA-w and Faces, which comes from a different source2 and has a
PANDA-l are based on PANDA [32], which was proposed different distribution from CelebA. Several examples of
recently for human attribute recognition by ensembling MobileFaces are shown in Fig.6(b) and the corresponding
multiple CNNs, each of which extracts features from a well- ROC curves are plotted in Fig.7(b). We observe that
aligned human part. These features are concatenated to LNet constantly performs better and still gains 7 percent
train SVM for attribute recognition. It is straightforward improvement (F P P I = 0.1) compared with other face
to adapt this method to face attributes, since face parts can detectors. Despite some failure cases due to extreme poses
be well-aligned by landmark points. Here, we consider two and large occlusions, LNet accurately localize faces in the
settings. PANDA-w obtains the face parts by applying the wild as demonstrated in Fig.6.
state-of-the-art face detection [17] and alignment [25] on • ANet
wild images, while PANDA-l attains the face parts by using Pre-training Discovers Semantic Concepts We show
ground truth landmark points. For fair comparison, all the that pre-training of ANet can implicity discover semantic
above methods are trained with the same data as ours. concepts related to face identity. Given a hidden neuron
at the FC layer of ANet as shown in Fig.2(c), we partition
3.1. Effectiveness of the Framework the face images into three groups, including the face images
This section demonstrates the effectiveness of the frame- with high, medium, and low responses at this neuron. The
work. All experiments in this section are done on CelebA. face images of each group are then averaged to obtain
the mean face. We visualize these mean faces for several
• LNet
neurons in Fig.8(a). Interestingly, these mean face changes
Performance Comparison We compare LNet with four
smoothly from high response to low response, following a
state-of-the-art face detectors, including DPM [21], ACF
high-level concept. Human can easily assign each neuron
Multi-view [30], SURF Cascade [17], and Face++ [1].
with a semantic concept it measures (i.e. the text in yellow).
We evaluate them by using ROC curves when IoU 1 ≥0.5.
As plotted in Fig.7(a), when F P P I = 0.01, the true 2 MobileFaces was collected by users with mobile phones, while Cele-
bA and LFWA collected face images of celebrities taken by professional

1 IoU indicates Intersection over Union. photographers.
3735
High Resp. Low Resp. High Resp. Low Resp. Test Image Activations Neurons
(a.1) Gender (a.2) Hair Color (b.1) Bangs Brown Hair Pale Skin Narrow Eyes High Cheek.
(a.3) Age (a.4) Race (b.2) Eyeglasses Mustache Black Hair Smiling Big Nose
(a.5) Face Shape (a.6) Eye Shape (b.3) Wear. Hat Blond Hair Wear. Lipstick Asian Big Eyes
Figure 8. Visualization of neurons in ANet (a) after pre-training (b) after fine-tuning (Best viewed in color)
(a) ANet (FC) ANet (C4) ANet (C3)
that after fine-tuning, ANet can expand these concepts to
Identity-related Attributes Identity-non-related Attributes
100% 90% more attribute types. Fig.8(b) visualizes the neurons in the
95% 85%
FC layer, which are ranked by their responses in descending
Accuracy
90% 80%
order with respect to several test images. Human can assign
85% 75%
80% 70%
semantic meaning to each of these neurons. We found that
Male White Black Asian Smiling Wearing Rosy 5oClock a large number of new concepts can be observed. Remark-
Hat Cheeks Shadow
ably, these neurons express diverse high-level meanings
(b) ANet (After fine-tuning) HOG (After PCA)
and cooperate to explain the test images. The activations
80% of all the neurons are visualized in Fig.8(b), and they are
Average Accuracy
single best
70% sparse. In some sense, attributes presented in each test
performing
60% neuron image are explained by a sparse linear combination of these
50% concepts. For instance, the first image is described as “a
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
Percentage of Best Performing Neurons Used
lady with bangs, brown hair, pale skin, narrow eyes and high
Figure 9. (a) Layer-wise comparison of ANet after pre-training (b) Best cheekbones”, which well matches human perception.
performing neurons analysis of ANet after fine-tuning. Best performing To validate this, we explore how the number of neurons
neurons are different for different attributes. The proposed accuracies are influences attribute prediction accuracies. Best performing
averaged over attributes which select their own subsets of best performing
neurons.
neurons for each attribute are identified by sorting corre-
sponding SVM weights. Fig.9(b) illusatrates that only 10%
For example, the neurons in (a.1) and (a.4) correspond of ANet best performing neurons are needed to achieve
to ‘gender’ and ‘race’, respectively. It reveals that the 90% of the original performance of a particular attribute3 .
high-level hidden neurons of ANet can implicitly learn In contrast, HOG+PCA does not have the sparse nature
to discover semantic concepts, even though they are only and need more than 95% features Besides, the best single
optimized for face recognition using identity information performing neuron of ANet outperforms that of HOG+PCA
and attribute labels are not used in pre-training. We also by 25 percent in average prediction accuracy.
observe that most of these concepts are intrinsic to face
identity, such as the shape of facial components, gender, 3.2. Attribute Prediction
and race.
Performance Comparison The attribute prediction per-
To better explain this phenomena, we compare the formance is reported in Table.1. On CelebA, the prediction
accuracy of attribute prediction using features at different accuracies of FaceTracer [14], PANDA-w [32], PANDA-l
layers of ANet right after pre-training. They are FC, C4, [32], and our LNets+ANet are 81, 79, 85, and 87 percent
and C3. The forty attributes are roughly separated into respectively, while the corresponding accuracies on LFWA
two groups, which are identity-related attributes, such as are 74, 71, 81, and 84 percent. Our method outperforms
gender and race, and identity-non-related attributes, e.g. PANDA-w by nearly 10 percent. Remarkably, even when
attributes of expressions, wearing hat and sunglasses. We PANDA-l is equipped with groundtruth bounding boxes
select some representative attributes for each group and plot and landmark positions, our method still achieves 3 percent
the results in Fig.9(a), which shows that the performance of gain. The strength of our method is illustrated not only
FC outperforms C4 and C3 in the group of identity-related on global attributes, e.g. “Chubby” and “Young”, but also
attributes, but they are relatively weaker when dealing with on fine-grained facial traits, e.g. “Mastache” and “Pointy
identity-non-related attributes. This is because the top layer Nose”. We also report performance on 19 extended at-
FC learns identity features, which are insensitive to intra- tributes and compare our result with [14] and [2]. The eval-
personal face variations.
Fine-tuning Expands Semantic Concepts Fig.8 shows 3 Best performing neurons are different for different attributes.
3736
Bushy Eyebrows
Arch. Eyebrows
H. Cheekbones
Heavy Makeup
Bags Un. Eyes
Double Chin
Brown Hair
Blond Hair
Eyeglasses
Black Hair
Gray Hair
Attractive
5 Shadow
Big Nose
Big Lips
Chubby
Goatee
Blurry
Bangs
Male
Bald
FaceTracer [14] 85 76 78 76 89 88 64 74 70 80 81 60 80 86 88 98 93 90 85 84 91
PANDA-w [32] 82 73 77 71 92 89 61 70 74 81 77 69 76 82 85 94 86 88 84 80 93
PANDA-l [32] 88 78 81 79 96 92 67 75 85 93 86 77 86 86 88 98 93 94 90 86 97
CelebA
[17]+ANet 86 75 79 77 92 94 63 74 77 86 83 74 80 86 90 96 92 93 87 85 95
LNets+ANet(w/o) 88 74 77 73 95 92 66 75 84 91 80 78 85 86 88 96 92 93 85 84 94
LNets+ANet 91 79 81 79 98 95 68 78 88 95 84 80 90 91 92 99 95 97 90 87 98
FaceTracer [14] 70 67 71 65 77 72 68 73 76 88 73 62 67 67 70 90 69 78 88 77 84
PANDA-w [32] 64 63 70 63 82 79 64 71 78 87 70 65 63 65 64 84 65 77 86 75 86
PANDA-l [32] 84 79 81 80 84 84 73 79 87 94 74 74 79 69 75 89 75 81 93 86 92
LFWA
[17]+ANet 78 66 75 72 86 84 70 73 82 90 75 71 69 68 70 88 68 82 89 79 91
LNets+ANet(w/o) 81 78 80 79 83 84 72 76 86 94 70 73 79 70 74 92 75 81 91 83 91
LNets+ANet 84 82 83 83 88 88 75 81 90 97 74 77 82 73 78 95 78 84 95 88 94
Reced. Hairline
Wear. Necklace
Wear. Earrings
Wear. Lipstick
Wear. Necktie
Rosy Cheeks
Narrow Eyes
Straight Hair
Mouth S. O.
Pointy Nose
Wavy Hair
Oval Face
Wear. Hat
Sideburns
Mustache
No Beard
Pale Skin
Average
Smiling
Young
FaceTracer [14] 87 91 82 90 64 83 68 76 84 94 89 63 73 73 89 89 68 86 80 81
PANDA-w [32] 82 83 79 87 62 84 65 82 81 90 89 67 76 72 91 88 67 88 77 79
PANDA-l [32] 93 93 84 93 65 91 71 85 87 93 92 69 77 78 96 93 67 91 84 85
CelebA
[17]+ANet 85 87 83 91 65 89 67 84 85 94 92 70 79 77 93 91 70 90 81 83
LNets+ANet(w/o) 86 91 77 92 63 87 70 85 87 91 88 69 75 78 96 90 68 86 83 83
LNets+ANet 92 95 81 95 66 91 72 89 90 96 92 73 80 82 99 93 71 93 87 87
FaceTracer [14] 77 83 73 69 66 70 74 63 70 71 78 67 62 88 75 87 81 71 80 74
PANDA-w [32] 74 77 68 63 64 64 68 61 64 68 77 68 63 85 78 83 79 70 76 71
PANDA-l [32] 78 87 73 75 72 84 76 84 73 76 89 73 75 92 82 93 86 79 82 81
LFWA
[17]+ANet 76 79 74 69 66 68 72 70 71 72 82 72 65 87 82 86 81 72 79 76
LNets+ANet(w/o) 78 87 77 75 71 81 76 81 72 72 88 71 73 90 84 92 83 76 82 79
LNets+ANet 82 92 81 79 74 84 80 85 78 77 91 76 76 94 88 95 88 79 86 84
Table 1. Performance comparison of attribute prediction. (Note that FaceTracer and PANDA-l attains the face parts by using ground truth landmark points.)
Mustache
No Beard
Blond H.
M. Aged
Black H.
Average
No Eye.
B. Nose
R. Hair.
A. Eye.
B. Eye.
Gender
R. Jaw
Senior
White
Youth
Asian
Black
Bald
Eye.
FaceTracer [14] 91 87 86 75 66 54 70 66 68 72 84 86 83 76 72 66 65 81 51 73
POOF [2] 92 90 81 90 71 60 80 67 75 67 87 90 86 72 74 71 68 77 55 76
LNets+ANet 94 85 83 87 80 77 81 86 89 84 85 84 86 83 82 75 79 78 81 83
Table 2. Performance comparison on extended attributes. (Performance are measured by the average of true positive rates and true negative rates.)
uation protocol is the same as [2]. In Table 2, LNets+ANet patch-by-patch scanning needs nearly 80 ms to extract
outperforms them by 10 and 7 percent respectively. features. Our framework has large potential in real-world
Further Analysis When compared with [17]+ANet, applications.
LNets accounts for nearly 6 percentage improvement over 4. Conclusion
using an off-the-shelf face detector [17]. We also ex- This paper has proposed a novel deep learning frame-
periment with the case of providing ANet with localized work for face attribute prediction in the wild. With carefully
face region by LNets, but without pre-training, denoted as designed pre-training strategies, our method is robust to
LNets+ANet(w/o). The average accuracies have dropped background clutters and face variations. We devise a new
4 and 5 percent on CelebA and LFWA, which indicate fast feed-forward algorithm for locally shared filters to save
pre-training with massive facial identities helps discover redundant computation, which enables evaluating image
semantic concepts. To further examine whether the pro- with arbitrary size in realtime. It allows taking images of
posed approach can be generalized to unseen attributes, we arbitrary sizes as input without normalization. We have
manually label 30 more attributes for the testing images on also revealed multiple important facts about learning face
LFWA. To test on these 30 attributes, we directly transfer representation, which shed a light on new directions of face
weights learned by deep models to extract features, and only localization and representation learning.
re-train SVMs using one third of the images. LNets+ANet Acknowledgement This work∗ was partially supported by
leads to 8, 10, and 3 percent average gains over the other the National Natural Science Foundation of China (91320101,
three approaches (FaceTracer, PANDA-w, and PANDA-l). 61472410, 61503366) and the Research Grants Council of Hong
Time Complexity For a 300 ∗ 300 image, LNets takes Kong (No. CUHK14207814).
35ms to localize face region while ANet takes 14ms to ∗ For more technical details, please contact the corresponding author
output extracted features on GPU. In contrast, a naı̈ve Ping Luo via pluo.lhi@gmail.com.
3737
References [20] O. K. Manyam, N. Kumar, P. Belhumeur, and D. Kriegman.
Two faces are better than one: Face recognition in group
[1] Face++. http://www.faceplusplus.com/. photographs. In IJCB, pages 1–8, 2011.
[2] T. Berg and P. N. Belhumeur. Poof: Part-based one-vs.-one [21] M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool.
features for fine-grained categorization, face verification, and Face detection without bells and whistles. In ECCV, pages
attribute estimation. In CVPR, pages 955–962, 2013. 720–735. 2014.
[3] A. Bergamo, L. Bazzani, D. Anguelov, and L. Torresani. [22] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object
Self-taught object localization with deep networks. arXiv localization for free?–weakly-supervised learning with con-
preprint arXiv:1409.3964, 2014. volutional neural networks. In CVPR, pages 685–694, 2015.
[4] L. Bourdev, S. Maji, and J. Malik. Describing people: A [23] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson.
poselet-based approach to attribute classification. In ICCV, Cnn features off-the-shelf: an astounding baseline for recog-
pages 1543–1550, 2011. nition. arXiv preprint arXiv:1403.6382, 2014.
[5] J. Chung, D. Lee, Y. Seo, and C. D. Yoo. Deep attribute [24] F. Song, X. Tan, and S. Chen. Exploiting relationship
networks. In NIPS Workshop on Deep Learning and Unsu- between attributes for improved face verification. CVIU,
pervised Feature Learning, volume 3, 2012. 122:143–154, 2014.
[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- [25] Y. Sun, X. Wang, and X. Tang. Deep convolutional network
Fei. Imagenet: A large-scale hierarchical image database. In cascade for facial point detection. In CVPR, pages 3476–
CVPR, pages 248–255, 2009. 3483, 2013.
[7] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, [26] Y. Sun, X. Wang, and X. Tang. Deep learning face
E. Tzeng, and T. Darrell. Decaf: A deep convolutional representation by joint identification-verification. In NIPS,
activation feature for generic visual recognition. arXiv 2014.
preprint arXiv:1310.1531, 2013. [27] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface:
[8] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.- Closing the gap to human-level performance in face verifica-
J. Lin. Liblinear: A library for large linear classification. tion. In CVPR, pages 1701–1708, 2014.
JMLR, 9:1871–1874, 2008. [28] Y. Tian, P. Luo, X. Wang, and X. Tang. Pedestrian detection
[9] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing aided by deep learning semantic tasks. CVPR, 2015.
objects by their attributes. In CVPR, pages 1778–1785, 2009. [29] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba.
[10] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality Sun database: Large-scale scene recognition from abbey to
reduction by learning an invariant mapping. In CVPR, zoo. In CVPR, pages 3485–3492, 2010.
volume 2, pages 1735–1742, 2006. [30] B. Yang, J. Yan, Z. Lei, and S. Z. Li. Aggregate channel
[11] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling features for multi-view face detection. In IJCB, pages 1–8,
in deep convolutional networks for visual recognition. In 2014.
ECCV, pages 346–361. 2014. [31] N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part-
[12] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. based r-cnns for fine-grained category detection. In ECCV,
Labeled faces in the wild: A database for studying face pages 834–849. 2014.
recognition in unconstrained environments. Technical Re- [32] N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev.
port 07-49, University of Massachusetts, Amherst, October Panda: Pose aligned networks for deep attribute modeling. In
2007. CVPR, 2014.
[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet [33] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba.
classification with deep convolutional neural networks. In Object detectors emerge in deep scene cnns. In ICLR, 2015.
NIPS, pages 1097–1105, 2012.
[34] Z. Zhu, P. Luo, X. Wang, and X. Tang. Multi-view
[14] N. Kumar, P. Belhumeur, and S. Nayar. Facetracer: A search perceptron: a deep model for learning face identity and view
engine for large collections of images with faces. In ECCV, representations. NIPS, 2014.
pages 340–353. 2008.
[35] C. L. Zitnick and P. Dollár. Edge boxes: Locating object
[15] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar. proposals from edges. In ECCV, pages 391–405. 2014.
Attribute and simile classifiers for face verification. In ICCV,
pages 365–372, 2009.
[16] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
Howard, W. Hubbard, and L. D. Jackel. Handwritten digit
recognition with a back-propagation network. In NIPS, 1990.
[17] J. Li and Y. Zhang. Learning surf cascade for fast and
accurate object detection. In CVPR, pages 3468–3475, 2013.
[18] P. Luo, X. Wang, and X. Tang. Hierarchical face parsing via
deep learning. CVPR, 2012.
[19] P. Luo, X. Wang, and X. Tang. A deep sum-product
architecture for robust facial attributes analysis. In ICCV,
pages 2864–2871, 2013.
3738
Asynchronous Methods for Deep Reinforcement Learning
Volodymyr Mnih1 VMNIH @ GOOGLE . COM

Adrià Puigdomènech Badia1 ADRIAP @ GOOGLE . COM
Mehdi Mirza1,2 MIRZAMOM @ IRO . UMONTREAL . CA
Alex Graves1 GRAVESA @ GOOGLE . COM
Tim Harley1 THARLEY @ GOOGLE . COM
Timothy P. Lillicrap1 COUNTZERO @ GOOGLE . COM
David Silver1 DAVIDSILVER @ GOOGLE . COM
Koray Kavukcuoglu 1 KORAYK @ GOOGLE . COM
1
Google DeepMind
2
Montreal Institute for Learning Algorithms (MILA), University of Montreal
Abstract line RL updates are strongly correlated. By storing the

We propose a conceptually simple and agent’s data in an experience replay memory, the data can
lightweight framework for deep reinforce- be batched (Riedmiller, 2005; Schulman et al., 2015a) or
ment learning that uses asynchronous gradient randomly sampled (Mnih et al., 2013; 2015; Van Hasselt
descent for optimization of deep neural network et al., 2015) from different time-steps. Aggregating over
controllers. We present asynchronous variants of memory in this way reduces non-stationarity and decorre-
four standard reinforcement learning algorithms lates updates, but at the same time limits the methods to
and show that parallel actor-learners have a off-policy reinforcement learning algorithms.
stabilizing effect on training allowing all four Deep RL algorithms based on experience replay have
methods to successfully train neural network achieved unprecedented success in challenging domains
controllers. The best performing method, an such as Atari 2600. However, experience replay has several
asynchronous variant of actor-critic, surpasses drawbacks: it uses more memory and computation per real
the current state-of-the-art on the Atari domain interaction; and it requires off-policy learning algorithms
while training for half the time on a single that can update from data generated by an older policy.
multi-core CPU instead of a GPU. Furthermore,
we show that asynchronous actor-critic succeeds In this paper we provide a very different paradigm for deep
on a wide variety of continuous motor control reinforcement learning. Instead of experience replay, we
problems as well as on a new task of navigating asynchronously execute multiple agents in parallel, on mul-
random 3D mazes using a visual input. tiple instances of the environment. This parallelism also
decorrelates the agents’ data into a more stationary process,
since at any given time-step the parallel agents will be ex-
1. Introduction periencing a variety of different states. This simple idea
enables a much larger spectrum of fundamental on-policy
Deep neural networks provide rich representations that can RL algorithms, such as Sarsa, n-step methods, and actor-
enable reinforcement learning (RL) algorithms to perform critic methods, as well as off-policy RL algorithms such
effectively. However, it was previously thought that the as Q-learning, to be applied robustly and effectively using
combination of simple online RL algorithms with deep deep neural networks.
neural networks was fundamentally unstable. Instead, a va-
Our parallel reinforcement learning paradigm also offers
riety of solutions have been proposed to stabilize the algo-
practical benefits. Whereas previous approaches to deep re-
rithm (Riedmiller, 2005; Mnih et al., 2013; 2015; Van Has-
inforcement learning rely heavily on specialized hardware
selt et al., 2015; Schulman et al., 2015a). These approaches
such as GPUs (Mnih et al., 2015; Van Hasselt et al., 2015;
share a common idea: the sequence of observed data en-
Schaul et al., 2015) or massively distributed architectures
countered by an online RL agent is non-stationary, and on-
(Nair et al., 2015), our experiments run on a single machine
Proceedings of the 33 rd International Conference on Machine with a standard multi-core CPU. When applied to a vari-
Learning, New York, NY, USA, 2016. JMLR: W&CP volume ety of Atari 2600 domains, on many games asynchronous
48. Copyright 2016 by the author(s). reinforcement learning achieves better results, in far less
time than previous GPU-based algorithms, using far less proaches have recently been applied to some visual rein-
resource than massively distributed approaches. The best forcement learning tasks. In one example, (Koutník et al.,
of the proposed methods, asynchronous advantage actor- 2014) evolved convolutional neural network controllers for
critic (A3C), also mastered a variety of continuous motor the TORCS driving simulator by performing fitness evalu-
control tasks as well as learned general strategies for ex- ations on 8 CPU cores in parallel.
ploring 3D mazes purely from visual inputs. We believe
that the success of A3C on both 2D and 3D games, discrete 3. Reinforcement Learning Background
and continuous action spaces, as well as its ability to train
feedforward and recurrent agents makes it the most general We consider the standard reinforcement learning setting
and successful reinforcement learning agent to date. where an agent interacts with an environment E over a
number of discrete time steps. At each time step t, the
2. Related Work agent receives a state st and selects an action at from some
set of possible actions A according to its policy π, where
The General Reinforcement Learning Architecture (Gorila) π is a mapping from states st to actions at . In return, the
of (Nair et al., 2015) performs asynchronous training of re- agent receives the next state st+1 and receives a scalar re-
inforcement learning agents in a distributed setting. In Go- ward rt . The process continues until the agent reaches a
rila, each process contains an actor that acts in its own copy terminal
Pstate after which the process restarts. The return
∞
of the environment, a separate replay memory, and a learner Rt = k=0 γ k rt+k is the total accumulated return from
that samples data from the replay memory and computes time step t with discount factor γ ∈ (0, 1]. The goal of the
gradients of the DQN loss (Mnih et al., 2015) with respect agent is to maximize the expected return from each state st .
to the policy parameters. The gradients are asynchronously
The action value Qπ (s, a) = E [Rt |st = s, a] is the ex-
sent to a central parameter server which updates a central
pected return for selecting action a in state s and follow-
copy of the model. The updated policy parameters are sent
ing policy π. The optimal value function Q∗ (s, a) =
to the actor-learners at fixed intervals. By using 100 sep-
maxπ Qπ (s, a) gives the maximum action value for state
arate actor-learner processes and 30 parameter server in-
s and action a achievable by any policy. Similarly, the
stances, a total of 130 machines, Gorila was able to signif-
value of state s under policy π is defined as V π (s) =
icantly outperform DQN over 49 Atari games. On many
E [Rt |st = s] and is simply the expected return for follow-
games Gorila reached the score achieved by DQN over 20
ing policy π from state s.
times faster than DQN. We also note that a similar way of
parallelizing DQN was proposed by (Chavez et al., 2015). In value-based model-free reinforcement learning methods,
the action value function is represented using a function ap-
In earlier work, (Li & Schuurmans, 2011) applied the
proximator, such as a neural network. Let Q(s, a; θ) be an
Map Reduce framework to parallelizing batch reinforce-
approximate action-value function with parameters θ. The
ment learning methods with linear function approximation.
updates to θ can be derived from a variety of reinforcement
Parallelism was used to speed up large matrix operations
learning algorithms. One example of such an algorithm is
but not to parallelize the collection of experience or sta-
Q-learning, which aims to directly approximate the optimal
bilize learning. (Grounds & Kudenko, 2008) proposed a
action value function: Q∗ (s, a) ≈ Q(s, a; θ). In one-step
parallel version of the Sarsa algorithm that uses multiple
Q-learning, the parameters θ of the action value function
separate actor-learners to accelerate training. Each actor-
Q(s, a; θ) are learned by iteratively minimizing a sequence
learner learns separately and periodically sends updates to
of loss functions, where the ith loss function defined as
weights that have changed significantly to the other learn-
ers using peer-to-peer communication. 2
0 0
Li (θi ) = E r + γ max0
Q(s , a ; θ i−1 ) − Q(s, a; θ i )
a
(Tsitsiklis, 1994) studied convergence properties of Q-
learning in the asynchronous optimization setting. These where s0 is the state encountered after state s.
results show that Q-learning is still guaranteed to converge
when some of the information is outdated as long as out- We refer to the above method as one-step Q-learning be-
dated information is always eventually discarded and sev- cause it updates the action value Q(s, a) toward the one-
eral other technical assumptions are satisfied. Even earlier, step return r + γ maxa0 Q(s0 , a0 ; θ). One drawback of us-
(Bertsekas, 1982) studied the related problem of distributed ing one-step methods is that obtaining a reward r only di-
dynamic programming. rectly affects the value of the state action pair s, a that led
to the reward. The values of other state action pairs are
Another related area of work is in evolutionary meth- affected only indirectly through the updated value Q(s, a).
ods, which are often straightforward to parallelize by dis- This can make the learning process slow since many up-
tributing fitness evaluations over multiple machines or dates are required the propagate a reward to the relevant
threads (Tomassini, 1999). Such parallel evolutionary ap- preceding states and actions.
One way of propagating rewards faster is by using n- Algorithm 1 Asynchronous one-step Q-learning - pseu-
step returns (Watkins, 1989; Peng & Williams, 1996). docode for each actor-learner thread.
In n-step Q-learning, Q(s, a) is updated toward the n- // Assume global shared θ, θ− , and counter T = 0.
step return defined as rt + γrt+1 + · · · + γ n−1 rt+n−1 + Initialize thread step counter t ← 0
maxa γ n Q(st+n , a). This results in a single reward r di- Initialize target network weights θ− ← θ
Initialize network gradients dθ ← 0
rectly affecting the values of n preceding state action pairs. Get initial state s
This makes the process of propagating rewards to relevant repeat
state-action pairs potentially much more efficient. Take action a with -greedy policy based on Q(s, a; θ)
0
new state s and reward r
Receive
In contrast to value-based methods, policy-based model- r for terminal s0
y= 0 0 −
free methods directly parameterize the policy π(a|s; θ) and r + γ maxa0 Q(s , a ; θ ) for non-terminal s0
2
update the parameters θ by performing, typically approx- Accumulate gradients wrt θ: dθ ← dθ + ∂(y−Q(s,a;θ))
∂θ
imate, gradient ascent on E[Rt ]. One example of such s = s0
a method is the REINFORCE family of algorithms due T ← T + 1 and t ← t + 1
to Williams (1992). Standard REINFORCE updates the if T mod Itarget == 0 then
Update the target network θ− ← θ
policy parameters θ in the direction ∇θ log π(at |st ; θ)Rt , end if
which is an unbiased estimate of ∇θ E[Rt ]. It is possible to if t mod IAsyncU pdate == 0 or s is terminal then
reduce the variance of this estimate while keeping it unbi- Perform asynchronous update of θ using dθ.
ased by subtracting a learned function of the state bt (st ), Clear gradients dθ ← 0.
known as a baseline (Williams, 1992), from the return. The end if
until T > Tmax
resulting gradient is ∇θ log π(at |st ; θ) (Rt − bt (st )).
A learned estimate of the value function is commonly used
as the baseline bt (st ) ≈ V π (st ) leading to a much lower learners running in parallel are likely to be exploring dif-
variance estimate of the policy gradient. When an approx- ferent parts of the environment. Moreover, one can explic-
imate value function is used as the baseline, the quantity itly use different exploration policies in each actor-learner
Rt − bt used to scale the policy gradient can be seen as to maximize this diversity. By running different explo-
an estimate of the advantage of action at in state st , or ration policies in different threads, the overall changes be-
A(at , st ) = Q(at , st )−V (st ), because Rt is an estimate of ing made to the parameters by multiple actor-learners ap-
Qπ (at , st ) and bt is an estimate of V π (st ). This approach plying online updates in parallel are likely to be less corre-
can be viewed as an actor-critic architecture where the pol- lated in time than a single agent applying online updates.
icy π is the actor and the baseline bt is the critic (Sutton & Hence, we do not use a replay memory and rely on parallel
Barto, 1998; Degris et al., 2012). actors employing different exploration policies to perform
the stabilizing role undertaken by experience replay in the
4. Asynchronous RL Framework DQN training algorithm.
We now present multi-threaded asynchronous variants of In addition to stabilizing learning, using multiple parallel
one-step Sarsa, one-step Q-learning, n-step Q-learning, and actor-learners has multiple practical benefits. First, we ob-
advantage actor-critic. The aim in designing these methods tain a reduction in training time that is roughly linear in
was to find RL algorithms that can train deep neural net- the number of parallel actor-learners. Second, since we no
work policies reliably and without large resource require- longer rely on experience replay for stabilizing learning we
ments. While the underlying RL methods are quite dif- are able to use on-policy reinforcement learning methods
ferent, with actor-critic being an on-policy policy search such as Sarsa and actor-critic to train neural networks in a
method and Q-learning being an off-policy value-based stable way. We now describe our variants of one-step Q-
method, we use two main ideas to make all four algorithms learning, one-step Sarsa, n-step Q-learning and advantage
practical given our design goal. actor-critic.
First, we use asynchronous actor-learners, similarly to the Asynchronous one-step Q-learning: Pseudocode for our
Gorila framework (Nair et al., 2015), but instead of using variant of Q-learning, which we call Asynchronous one-
separate machines and a parameter server, we use multi- step Q-learning, is shown in Algorithm 1. Each thread in-
ple CPU threads on a single machine. Keeping the learn- teracts with its own copy of the environment and at each
ers on a single machine removes the communication costs step computes a gradient of the Q-learning loss. We use
of sending gradients and parameters and enables us to use a shared and slowly changing target network in comput-
Hogwild! (Recht et al., 2011) style updates for training. ing the Q-learning loss, as was proposed in the DQN train-
Second, we make the observation that multiple actors- ing method. We also accumulate gradients over multiple
timesteps before they are applied, which is similar to us-
ing minibatches. This reduces the chances of multiple ac- by tmax . The pseudocode for the algorithm is presented in
tor learners overwriting each other’s updates. Accumulat- Supplementary Algorithm S2.
ing updates over several steps also provides some ability to
As with the value-based methods we rely on parallel actor-
trade off computational efficiency for data efficiency.
learners and accumulated updates for improving training
Finally, we found that giving each thread a different explo- stability. Note that while the parameters θ of the policy
ration policy helps improve robustness. Adding diversity and θv of the value function are shown as being separate
to exploration in this manner also generally improves perfor generality, we always share some of the parameters in
formance through better exploration. While there are many practice. We typically use a convolutional neural network
possible ways of making the exploration policies differ we that has one softmax output for the policy π(at |st ; θ) and
experiment with using -greedy exploration with periodi- one linear output for the value function V (st ; θv ), with all
cally sampled from some distribution by each thread. non-output layers shared.
Asynchronous one-step Sarsa: The asynchronous one- We also found that adding the entropy of the policy π to the
step Sarsa algorithm is the same as asynchronous one-step objective function improved exploration by discouraging
Q-learning as given in Algorithm 1 except that it uses a dif- premature convergence to suboptimal deterministic poli-
ferent target value for Q(s, a). The target value used by cies. This technique was originally proposed by (Williams
one-step Sarsa is r + γQ(s0 , a0 ; θ− ) where a0 is the action & Peng, 1991), who found that it was particularly help-
taken in state s0 (Rummery & Niranjan, 1994; Sutton & ful on tasks requiring hierarchical behavior. The gradi-
Barto, 1998). We again use a target network and updates ent of the full objective function including the entropy
accumulated over multiple timesteps to stabilize learning. regularization term with respect to the policy parame-
ters takes the form ∇θ0 log π(at |st ; θ0 )(Rt − V (st ; θv )) +
Asynchronous n-step Q-learning: Pseudocode for our
β∇θ0 H(π(st ; θ0 )), where H is the entropy. The hyperpa-
variant of multi-step Q-learning is shown in Supplementary
rameter β controls the strength of the entropy regulariza-
Algorithm S1. The algorithm is somewhat unusual because
tion term.
it operates in the forward view by explicitly computing n-
step returns, as opposed to the more common backward Optimization: We investigated three different optimiza-
view used by techniques like eligibility traces (Sutton & tion algorithms in our asynchronous framework – SGD
Barto, 1998). We found that using the forward view is eas- with momentum, RMSProp (Tieleman & Hinton, 2012)
ier when training neural networks with momentum-based without shared statistics, and RMSProp with shared statis-
methods and backpropagation through time. In order to tics. We used the standard non-centered RMSProp update
compute a single update, the algorithm first selects actions given by
using its exploration policy for up to tmax steps or until a
∆θ
terminal state is reached. This process results in the agent g = αg + (1 − α)∆θ2 and θ ← θ − η √ , (1)
receiving up to tmax rewards from the environment since g+
its last update. The algorithm then computes gradients for where all operations are performed elementwise. A com-
n-step Q-learning updates for each of the state-action pairs parison on a subset of Atari 2600 games showed that a vari-
encountered since the last update. Each n-step update uses ant of RMSProp where statistics g are shared across threads
the longest possible n-step return resulting in a one-step is considerably more robust than the other two methods.
update for the last state, a two-step update for the second Full details of the methods and comparisons are included
last state, and so on for a total of up to tmax updates. The in Supplementary Section 1.
accumulated updates are applied in a single gradient step.
Asynchronous advantage actor-critic: The algorithm, 5. Experiments
which we call asynchronous advantage actor-critic (A3C), We use four different platforms for assessing the properties
maintains a policy π(at |st ; θ) and an estimate of the value of the proposed framework. We perform most of our exper-
function V (st ; θv ). Like our variant of n-step Q-learning, iments using the Arcade Learning Environment (Bellemare
our variant of actor-critic also operates in the forward view et al., 2012), which provides a simulator for Atari 2600
and uses the same mix of n-step returns to update both the games. This is one of the most commonly used benchmark
policy and the value-function. The policy and the value environments for RL algorithms. We use the Atari domain
function are updated after every tmax actions or when a to compare against state of the art results (Van Hasselt et al.,
terminal state is reached. The update performed by the al- 2015; Wang et al., 2015; Schaul et al., 2015; Nair et al.,
gorithm can be seen as ∇θ0 log π(at |st ; θ0 )A(st , at ; θ, θv ) 2015; Mnih et al., 2015), as well as to carry out a detailed
where A(st , at ; θ, θv ) is an estimate of the advantage func- stability and scalability analysis of the proposed methods.
Pk−1
tion given by i=0 γ i rt+i + γ k V (st+k ; θv ) − V (st ; θv ), We performed further comparisons using the TORCS 3D
where k can vary from state to state and is upper-bounded car racing simulator (Wymann et al., 2013). We also use
Figure 1. Learning speed comparison for DQN and the new asynchronous algorithms on five Atari 2600 games. DQN was trained on
a single Nvidia K40 GPU while the asynchronous methods were trained using 16 CPU cores. The plots are averaged over 5 runs. In
the case of DQN the runs were for different seeds with fixed hyperparameters. For asynchronous methods we average over the best 5
models from 50 experiments with learning rates sampled from LogU nif orm(10−4 , 10−2 ) and all other hyperparameters fixed.
two additional domains to evaluate only the A3C algorithm Method Training Time Mean Median
– Mujoco and Labyrinth. MuJoCo (Todorov, 2015) is a DQN 8 days on GPU 121.9% 47.5%
physics simulator for evaluating agents on continuous mo- Gorila 4 days, 100 machines 215.2% 71.3%
D-DQN 8 days on GPU 332.9% 110.9%
tor control tasks with contact dynamics. Labyrinth is a new Dueling D-DQN 8 days on GPU 343.8% 117.1%
3D environment where the agent must learn to find rewards Prioritized DQN 8 days on GPU 463.6% 127.6%
in randomly generated mazes from a visual input. The pre- A3C, FF 1 day on CPU 344.1% 68.2%
cise details of our experimental setup can be found in Sup- A3C, FF 4 days on CPU 496.8% 116.6%
plementary Section 2. A3C, LSTM 4 days on CPU 623.0% 112.6%
Table 1. Mean and median human-normalized scores on 57 Atari

5.1. Atari 2600 Games games using the human starts evaluation metric. Supplementary
Table SS1 shows the raw scores for all games.
We first present results on a subset of Atari 2600 games to
demonstrate the training speed of the new methods. Fig-
ure 1 compares the learning speed of the DQN algorithm from (Bellemare et al., 2012). We trained our agents for
trained on an Nvidia K40 GPU with the asynchronous four days using 16 CPU cores, while the other agents were
methods trained using 16 CPU cores on five Atari 2600 trained for 8 to 10 days on Nvidia K40 GPUs. Table 1
games. The results show that all four asynchronous meth- shows the average and median human-normalized scores
ods we presented can successfully train neural network obtained by our agents trained by asynchronous advantage
controllers on the Atari domain. The asynchronous meth- actor-critic (A3C) as well as the current state-of-the art.
ods tend to learn faster than DQN, with significantly faster Supplementary Table S1 shows the scores on all games.
learning on some games, while training on only 16 CPU A3C significantly improves on state-of-the-art the average
cores. Additionally, the results suggest that n-step methods score over 57 games in half the training time of the other
learn faster than one-step methods on some games. Over- methods while using only 16 CPU cores and no GPU. Fur-
all, the policy-based advantage actor-critic method signifi- thermore, after just one day of training, A3C matches the
cantly outperforms all three value-based methods. average human normalized score of Dueling Double DQN
We then evaluated asynchronous advantage actor-critic on and almost reaches the median human normalized score of
57 Atari games. In order to compare with the state of the Gorila. We note that many of the improvements that are
art in Atari game playing, we largely followed the train- presented in Double DQN (Van Hasselt et al., 2015) and
ing and evaluation protocol of (Van Hasselt et al., 2015). Dueling Double DQN (Wang et al., 2015) can be incorpo-
Specifically, we tuned hyperparameters (learning rate and rated to 1-step Q and n-step Q methods presented in this
amount of gradient norm clipping) using a search on six work with similar potential improvements.
Atari games (Beamrider, Breakout, Pong, Q*bert, Seaquest
and Space Invaders) and then fixed all hyperparameters for 5.2. TORCS Car Racing Simulator
all 57 games. We trained both a feedforward agent with the We also compared the four asynchronous methods on
same architecture as (Mnih et al., 2015; Nair et al., 2015; the TORCS 3D car racing game (Wymann et al., 2013).
Van Hasselt et al., 2015) as well as a recurrent agent with an TORCS not only has more realistic graphics than Atari
additional 256 LSTM cells after the final hidden layer. We 2600 games, but also requires the agent to learn the dy-
additionally used the final network weights for evaluation namics of the car it is controlling. At each step, an agent
to make the results more comparable to the original results received only a visual input in the form of an RGB image
of the current frame as well as a reward proportional to the Number of threads

agent’s velocity along the center of the track at the agent’s Method 1 2 4 8 16
1-step Q 1.0 3.0 6.3 13.3 24.1
current position. We used the same neural network archi- 1-step SARSA 1.0 2.8 5.9 13.1 22.1
tecture as the one used in the Atari experiments specified in n-step Q 1.0 2.7 5.9 10.7 17.2
Supplementary Section 2. We performed experiments us- A3C 1.0 2.1 3.7 6.9 12.5
ing four different settings – the agent controlling a slow car
with and without opponent bots, and the agent controlling a Table 2. The average training speedup for each method and num-
fast car with and without opponent bots. Full results can be ber of threads averaged over seven Atari games. To compute the
found in Supplementary Figure S2. A3C was the best per- training speed-up on a single game we measured the time to re-
quired reach a fixed reference score using each method and num-
forming agent, reaching between roughly 75% and 90% of
ber of threads. The speedup from using n threads on a game was
the score obtained by a human tester on all four game con- defined as the time required to reach a fixed reference score using
figurations in about 12 hours of training. A video showing one thread divided the time required to reach the reference score
the learned driving behavior of the A3C agent can be found using n threads. The table shows the speedups averaged over
at https://youtu.be/0xo1Ldx3L5Q. seven Atari games (Beamrider, Breakout, Enduro, Pong, Q*bert,
Seaquest, and Space Invaders).
5.3. Continuous Action Control Using the MuJoCo
Physics Simulator We trained an A3C LSTM agent on this task using only
84 × 84 RGB images as input. The final average score
We also examined a set of tasks where the action space
of around 50 indicates that the agent learned a reason-
is continuous. In particular, we looked at a set of rigid
able strategy for exploring random 3D maxes using only
body physics domains with contact dynamics where the
a visual input. A video showing one of the agents ex-
tasks include many examples of manipulation and loco-
ploring previously unseen mazes is included at https:
motion. These tasks were simulated using the Mujoco
//youtu.be/nMR5mjCFZCw.
physics engine. We evaluated only the asynchronous ad-
vantage actor-critic algorithm since, unlike the value-based
methods, it is easily extended to continuous actions. In all 5.5. Scalability and Data Efficiency
problems, using either the physical state or pixels as in- We analyzed the effectiveness of our proposed framework
put, Asynchronous Advantage-Critic found good solutions by looking at how the training time and data efficiency
in less than 24 hours of training and typically in under a few changes with the number of parallel actor-learners. When
hours. Some successful policies learned by our agent can using multiple workers in parallel and updating a shared
be seen in the following video https://youtu.be/ model, one would expect that in an ideal case, for a given
Ajjc08-iPx8. Further details about this experiment can task and algorithm, the number of training steps to achieve
be found in Supplementary Section 3. a certain score would remain the same with varying num-
bers of workers. Therefore, the advantage would be solely
5.4. Labyrinth due to the ability of the system to consume more data in
the same amount of wall clock time and possibly improved
We performed an additional set of experiments with A3C
exploration. Table 2 shows the training speed-up achieved
on a new 3D environment called Labyrinth. The specific
by using increasing numbers of parallel actor-learners av-
task we considered involved the agent learning to find re-
eraged over seven Atari games. These results show that all
wards in randomly generated mazes. At the beginning of
four methods achieve substantial speedups from using mul-
each episode the agent was placed in a new randomly gen-
tiple worker threads, with 16 threads leading to at least an
erated maze consisting of rooms and corridors. Each maze
order of magnitude speedup. This confirms that our pro-
contained two types of objects that the agent was rewarded
posed framework scales well with the number of parallel
for finding – apples and portals. Picking up an apple led to
workers, making efficient use of resources.
a reward of 1. Entering a portal led to a reward of 10 after
which the agent was respawned in a new random location in Somewhat surprisingly, asynchronous one-step Q-learning
the maze and all previously collected apples were regener- and Sarsa algorithms exhibit superlinear speedups that
ated. An episode terminated after 60 seconds after which a cannot be explained by purely computational gains. We
new episode would begin. The aim of the agent is to collect observe that one-step methods (one-step Q and one-step
as many points as possible in the time limit and the optimal Sarsa) often require less data to achieve a particular score
strategy involves first finding the portal and then repeatedly when using more parallel actor-learners. We believe this
going back to it after each respawn. This task is much more is due to positive effect of multiple threads to reduce the
challenging than the TORCS driving domain because the bias in one-step methods. These effects are shown more
agent is faced with a new maze in each episode and must clearly in Figure 3, which shows plots of the average score
learn a general strategy for exploring random mazes. against the total number of training frames for different
Figure 2. Scatter plots of scores obtained by asynchronous advantage actor-critic on five games (Beamrider, Breakout, Pong, Q*bert,
Space Invaders) for 50 different learning rates and random initializations. On each game, there is a wide range of learning rates for
which all random initializations acheive good scores. This shows that A3C is quite robust to learning rates and initial random weights.
numbers of actor-learners and training methods on five substantially improve the data efficiency of these methods
Atari games, and Figure 4, which shows plots of the av- by reusing old data. This could in turn lead to much faster
erage score against wall-clock time. training times in domains like TORCS where interacting
with the environment is more expensive than updating the
5.6. Robustness and Stability model for the architecture we used.
Finally, we analyzed the stability and robustness of the Combining other existing reinforcement learning meth-
four proposed asynchronous algorithms. For each of the ods or recent advances in deep reinforcement learning
four algorithms we trained models on five games (Break- with our asynchronous framework presents many possibil-
out, Beamrider, Pong, Q*bert, Space Invaders) using 50 ities for immediate improvements to the methods we pre-
different learning rates and random initializations. Figure 2 sented. While our n-step methods operate in the forward
shows scatter plots of the resulting scores for A3C, while view (Sutton & Barto, 1998) by using corrected n-step re-
Supplementary Figure S7 shows plots for the other three turns directly as targets, it has been more common to use
methods. There is usually a range of learning rates for each the backward view to implicitly combine different returns
method and game combination that leads to good scores, through eligibility traces (Watkins, 1989; Sutton & Barto,
indicating that all methods are quite robust to the choice of 1998; Peng & Williams, 1996). The asynchronous ad-
learning rate and random initialization. The fact that there vantage actor-critic method could be potentially improved
are virtually no points with scores of 0 in regions with good by using other ways of estimating the advantage function,
learning rates indicates that the methods are stable and do such as generalized advantage estimation of (Schulman
not collapse or diverge once they are learning. et al., 2015b). All of the value-based methods we inves-
tigated could benefit from different ways of reducing over-
6. Conclusions and Discussion estimation bias of Q-values (Van Hasselt et al., 2015; Belle-
mare et al., 2016). Yet another, more speculative, direction
We have presented asynchronous versions of four standard is to try and combine the recent work on true online tempo-
reinforcement learning algorithms and showed that they ral difference methods (van Seijen et al., 2015) with non-
are able to train neural network controllers on a variety linear function approximation.
of domains in a stable manner. Our results show that in
In addition to these algorithmic improvements, a number
our proposed framework stable training of neural networks
of complementary improvements to the neural network ar-
through reinforcement learning is possible with both value-
chitecture are possible. The dueling architecture of (Wang
based and policy-based methods, off-policy as well as on-
et al., 2015) has been shown to produce more accurate es-
policy methods, and in discrete as well as continuous do-
timates of Q-values by including separate streams for the
mains. When trained on the Atari domain using 16 CPU
state value and advantage in the network. The spatial soft-
cores, the proposed asynchronous algorithms train faster
max proposed by (Levine et al., 2015) could improve both
than DQN trained on an Nvidia K40 GPU, with A3C sur-
value-based and policy-based methods by making it easier
passing the current state-of-the-art in half the training time.
for the network to represent feature coordinates.
One of our main findings is that using parallel actor-
learners to update a shared model had a stabilizing effect on ACKNOWLEDGMENTS
the learning process of the three value-based methods we
considered. While this shows that stable online Q-learning We thank Thomas Degris, Remi Munos, Marc Lanctot,
is possible without experience replay, which was used for Sasha Vezhnevets and Joseph Modayil for many helpful
this purpose in DQN, it does not mean that experience re- discussions, suggestions and comments on the paper. We
play is not useful. Incorporating experience replay into also thank the DeepMind evaluation team for setting up the
the asynchronous reinforcement learning framework could environments used to evaluate the agents in the paper.
Figure 3. Data efficiency comparison of different numbers of actor-learners for three asynchronous methods on five Atari games. The
x-axis shows the total number of training epochs where an epoch corresponds to four million frames (across all threads). The y-axis
shows the average score. Each curve shows the average over the three best learning rates. Single step methods show increased data
efficiency from more parallel workers. Results for Sarsa are shown in Supplementary Figure S5.
9000 Beamrider 300 Breakout 20 Pong 4000 Q*bert 800 Space Invaders
1-step Q, 1 threads 1-step Q, 1 threads 1-step Q, 1 threads 1-step Q, 1 threads 1-step Q, 1 threads
8000 1-step Q, 2 threads 1-step Q, 2 threads 15 1-step Q, 2 threads 1-step Q, 2 threads 1-step Q, 2 threads
1-step Q, 4 threads 1-step Q, 4 threads 1-step Q, 4 threads 3500 1-step Q, 4 threads 700 1-step Q, 4 threads
1-step Q, 8 threads 250 1-step Q, 8 threads 1-step Q, 8 threads 1-step Q, 8 threads 1-step Q, 8 threads
7000 1-step Q, 16 threads 1-step Q, 16 threads 10 1-step Q, 16 threads 3000 1-step Q, 16 threads 1-step Q, 16 threads
600
6000 200 5
2500
5000 0 500
Score
Score
Score
Score
Score
150 2000
4000 5 400
1500
3000 100 10
300
2000 15 1000
50 200
1000 20 500
0 0 25 0 100
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
Training time (hours) Training time (hours) Training time (hours) Training time (hours) Training time (hours)
n-step Q, 1 threads n-step Q, 1 threads n-step Q, 1 threads n-step Q, 1 threads
n-step Q, 2 threads n-step Q, 2 threads 15 4000 n-step Q, 2 threads n-step Q, 2 threads
10000 n-step Q, 4 threads 300 n-step Q, 4 threads n-step Q, 4 threads 700 n-step Q, 4 threads
n-step Q, 8 threads n-step Q, 8 threads n-step Q, 8 threads n-step Q, 8 threads
n-step Q, 16 threads n-step Q, 16 threads 10 3500 n-step Q, 16 threads n-step Q, 16 threads
250 600
8000 5 3000
200 0 2500 500
Score
Score
Score
Score
Score
6000
150 5 2000 400
4000 10 1500
100 300
15 n-step Q, 1 threads 1000
2000 n-step Q, 2 threads
50 20
n-step Q, 4 threads
500 200
n-step Q, 8 threads
n-step Q, 16 threads
0 0 25 0 100
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
A3C, 1 threads A3C, 1 threads A3C, 1 threads A3C, 1 threads A3C, 1 threads
A3C, 2 threads A3C, 2 threads A3C, 2 threads A3C, 2 threads A3C, 2 threads
14000 A3C, 4 threads A3C, 4 threads A3C, 4 threads A3C, 4 threads 1400 A3C, 4 threads
A3C, 8 threads 500 A3C, 8 threads 20 A3C, 8 threads 10000 A3C, 8 threads A3C, 8 threads
12000 A3C, 16 threads A3C, 16 threads A3C, 16 threads A3C, 16 threads 1200 A3C, 16 threads
400 10 8000
10000 1000
Score
Score
Score
Score
Score
8000 300 0 6000 800

6000 600
200 10 4000
4000 400
100 20 2000
2000 200
0 0 30 0 0
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
Figure 4. Training speed comparison of different numbers of actor-learners on five Atari games. The x-axis shows training time in
hours while the y-axis shows the average score. Each curve shows the average over the three best learning rates. All asynchronous
methods show significant speedups from using greater numbers of parallel actor-learners. Results for Sarsa are shown in Supplementary
Figure S6.
References Amir, Antonoglou, Ioannis, King, Helen, Kumaran,

Dharshan, Wierstra, Daan, Legg, Shane, and Hassabis,
Bellemare, Marc G, Naddaf, Yavar, Veness, Joel, and
Demis. Human-level control through deep reinforcement
Bowling, Michael. The arcade learning environment:
learning. Nature, 518(7540):529–533, 02 2015. URL
An evaluation platform for general agents. Journal of
http://dx.doi.org/10.1038/nature14236.
Artificial Intelligence Research, 2012.
Nair, Arun, Srinivasan, Praveen, Blackwell, Sam, Alci-
Bellemare, Marc G., Ostrovski, Georg, Guez, Arthur,
cek, Cagdas, Fearon, Rory, Maria, Alessandro De, Pan-
Thomas, Philip S., and Munos, Rémi. Increasing the ac-
neershelvam, Vedavyas, Suleyman, Mustafa, Beattie,
tion gap: New operators for reinforcement learning. In
Charles, Petersen, Stig, Legg, Shane, Mnih, Volodymyr,
Proceedings of the AAAI Conference on Artificial Intel-
Kavukcuoglu, Koray, and Silver, David. Massively par-
ligence, 2016.
allel methods for deep reinforcement learning. In ICML
Bertsekas, Dimitri P. Distributed dynamic programming. Deep Learning Workshop. 2015.
Automatic Control, IEEE Transactions on, 27(3):610–
Peng, Jing and Williams, Ronald J. Incremental multi-step
616, 1982.
q-learning. Machine Learning, 22(1-3):283–290, 1996.
Chavez, Kevin, Ong, Hao Yi, and Hong, Augustus. Dis-
Recht, Benjamin, Re, Christopher, Wright, Stephen, and
tributed deep q-learning. Technical report, Stanford Uni-
Niu, Feng. Hogwild: A lock-free approach to paralleliz-
versity, June 2015.
ing stochastic gradient descent. In Advances in Neural
Degris, Thomas, Pilarski, Patrick M, and Sutton, Richard S. Information Processing Systems, pp. 693–701, 2011.
Model-free reinforcement learning with continuous ac-
Riedmiller, Martin. Neural fitted q iteration–first experi-
tion in practice. In American Control Conference (ACC),
ences with a data efficient neural reinforcement learning
2012, pp. 2177–2182. IEEE, 2012.
method. In Machine Learning: ECML 2005, pp. 317–
Grounds, Matthew and Kudenko, Daniel. Parallel rein- 328. Springer Berlin Heidelberg, 2005.
forcement learning with linear function approximation.
Rummery, Gavin A and Niranjan, Mahesan. On-line q-
In Proceedings of the 5th, 6th and 7th European Confer-
learning using connectionist systems. 1994.
ence on Adaptive and Learning Agents and Multi-agent
Systems: Adaptation and Multi-agent Learning, pp. 60– Schaul, Tom, Quan, John, Antonoglou, Ioannis, and Sil-
74. Springer-Verlag, 2008. ver, David. Prioritized experience replay. arXiv preprint
arXiv:1511.05952, 2015.
Koutník, Jan, Schmidhuber, Jürgen, and Gomez, Faustino.
Evolving deep unsupervised convolutional networks for Schulman, John, Levine, Sergey, Moritz, Philipp, Jordan,
vision-based reinforcement learning. In Proceedings of Michael I, and Abbeel, Pieter. Trust region policy op-
the 2014 conference on Genetic and evolutionary com- timization. In International Conference on Machine
putation, pp. 541–548. ACM, 2014. Learning (ICML), 2015a.
Levine, Sergey, Finn, Chelsea, Darrell, Trevor, and Abbeel, Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan,
Pieter. End-to-end training of deep visuomotor policies. Michael, and Abbeel, Pieter. High-dimensional con-
arXiv preprint arXiv:1504.00702, 2015. tinuous control using generalized advantage estimation.
Li, Yuxi and Schuurmans, Dale. Mapreduce for parallel re- arXiv preprint arXiv:1506.02438, 2015b.
inforcement learning. In Recent Advances in Reinforce- Sutton, R. and Barto, A. Reinforcement Learning: an In-
ment Learning - 9th European Workshop, EWRL 2011, troduction. MIT Press, 1998.
Athens, Greece, September 9-11, 2011, Revised Selected
Papers, pp. 309–320, 2011. Tieleman, Tijmen and Hinton, Geoffrey. Lecture 6.5-
rmsprop: Divide the gradient by a running average of
Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, its recent magnitude. COURSERA: Neural Networks for
Graves, Alex, Antonoglou, Ioannis, Wierstra, Daan, and Machine Learning, 4, 2012.
Riedmiller, Martin. Playing atari with deep reinforce-
ment learning. In NIPS Deep Learning Workshop. 2013. Todorov, E. MuJoCo: Modeling, Simulation and Visual-
ization of Multi-Joint Dynamics with Contact (ed 1.0).
Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Roboti Publishing, 2015.
Rusu, Andrei A., Veness, Joel, Bellemare, Marc G.,
Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K., Tomassini, Marco. Parallel and distributed evolutionary al-
Ostrovski, Georg, Petersen, Stig, Beattie, Charles, Sadik, gorithms: A review. Technical report, 1999.
Tsitsiklis, John N. Asynchronous stochastic approxima-

tion and q-learning. Machine Learning, 16(3):185–202,
1994.
Van Hasselt, Hado, Guez, Arthur, and Silver, David. Deep
reinforcement learning with double q-learning. arXiv
preprint arXiv:1509.06461, 2015.
van Seijen, H., Rupam Mahmood, A., Pilarski, P. M.,

Machado, M. C., and Sutton, R. S. True Online
Temporal-Difference Learning. ArXiv e-prints, Decem-
ber 2015.
Wang, Z., de Freitas, N., and Lanctot, M. Dueling Network
Architectures for Deep Reinforcement Learning. ArXiv
e-prints, November 2015.
Watkins, Christopher John Cornish Hellaby. Learning from
delayed rewards. PhD thesis, University of Cambridge
England, 1989.
Williams, R.J. Simple statistical gradient-following algo-

rithms for connectionist reinforcement learning. Ma-
chine Learning, 8(3):229–256, 1992.
Williams, Ronald J and Peng, Jing. Function optimization
using connectionist reinforcement learning algorithms.
Connection Science, 3(3):241–268, 1991.
Wymann, B., EspiÃl’, E., Guionneau, C., Dimitrakakis, C.,
Coulom, R., and Sumner, A. Torcs: The open racing car
simulator, v1.3.5, 2013.
LETTER doi:10.1038/nature14236
Human-level control through deep reinforcement

learning
Volodymyr Mnih1*, Koray Kavukcuoglu1*, David Silver1*, Andrei A. Rusu1, Joel Veness1, Marc G. Bellemare1, Alex Graves1,
Martin Riedmiller1, Andreas K. Fidjeland1, Georg Ostrovski1, Stig Petersen1, Charles Beattie1, Amir Sadik1, Ioannis Antonoglou1,
Helen King1, Dharshan Kumaran1, Daan Wierstra1, Shane Legg1 & Demis Hassabis1
The theory of reinforcement learning provides a normative account1, agent is to select actions in a fashion that maximizes cumulative future
deeply rooted in psychological2 and neuroscientific3 perspectives on reward. More formally, we use a deep convolutional neural network to
animal behaviour, of how agents may optimize their control of an approximate the optimal action-value function
environment. To use reinforcement learning successfully in situations ! "
Q! ðs,aÞ~ max rt zcrtz1 zc2 rtz2 z . . . jst ~s, at ~a, p ,
approaching real-world complexity, however, agents are confronted p
with a difficult task: they must derive efficient representations of the which is the maximum sum of rewards rt discounted by c at each time-
environment from high-dimensional sensory inputs, and use these step t, achievable by a behaviour policy p 5 P(ajs), after making an
to generalize past experience to new situations. Remarkably, humans observation (s) and taking an action (a) (see Methods)19.
and other animals seem to solve this problem through a harmonious Reinforcement learning is known to be unstable or even to diverge
combination of reinforcement learning and hierarchical sensory pro- when a nonlinear function approximator such as a neural network is
cessing systems4,5, the former evidenced by a wealth of neural data used to represent the action-value (also known as Q) function20. This
revealing notable parallels between the phasic signals emitted by dopa- instability has several causes: the correlations present in the sequence
minergic neurons and temporal difference reinforcement learning of observations, the fact that small updates to Q may significantly change
algorithms3. While reinforcement learning agents have achieved some the policy and therefore change the data distribution, and the correlations
successes in a variety of domains6–8, their applicability has previously between the action-values (Q) and the target values rzc max Qðs0 , a0 Þ.
been limited to domains in which useful features can be handcrafted, a0
We address these instabilities with a novel variant of Q-learning, which
or to domains with fully observed, low-dimensional state spaces.
uses two key ideas. First, we used a biologically inspired mechanism
Here we use recent advances in training deep neural networks9–11 to
termed experience replay21–23 that randomizes over the data, thereby
develop a novel artificial agent, termed a deep Q-network, that can
removing correlations in the observation sequence and smoothing over
learn successful policies directly from high-dimensional sensory inputs
changes in the data distribution (see below for details). Second, we used
using end-to-end reinforcement learning. We tested this agent on
an iterative update that adjusts the action-values (Q) towards target
the challenging domain of classic Atari 2600 games12. We demon-
values that are only periodically updated, thereby reducing correlations
strate that the deep Q-network agent, receiving only the pixels and with the target.
the game score as inputs, was able to surpass the performance of all
While other stable methods exist for training neural networks in the
previous algorithms and achieve a level comparable to that of a pro-
reinforcement learning setting, such as neural fitted Q-iteration24, these
fessional human games tester across a set of 49 games, using the same
methods involve the repeated training of networks de novo on hundreds
algorithm, network architecture and hyperparameters. This work
of iterations. Consequently, these methods, unlike our algorithm, are
bridges the divide between high-dimensional sensory inputs and
too inefficient to be used successfully with large neural networks. We
actions, resulting in the first artificial agent that is capable of learn-
parameterize an approximate value function Q(s,a;hi) using the deep
ing to excel at a diverse array of challenging tasks.
convolutional neural network shown in Fig. 1, in which hi are the param-
We set out to create a single algorithm that would be able to develop eters (that is, weights) of the Q-network at iteration i. To perform
a wide range of competencies on a varied range of challenging tasks—a experience replay we store the agent’s experiences et 5 (st,at,rt,st 1 1)
central goal of general artificial intelligence13 that has eluded previous at each time-step t in a data set Dt 5 {e1,…,et}. During learning, we
efforts8,14,15. To achieve this, we developed a novel agent, a deep Q-network apply Q-learning updates, on samples (or minibatches) of experience
(DQN), which is able to combine reinforcement learning with a class (s,a,r,s9) , U(D), drawn uniformly at random from the pool of stored
of artificial neural network16 known as deep neural networks. Notably, samples. The Q-learning update at iteration i uses the following loss
recent advances in deep neural networks9–11, in which several layers of function:
nodes are used to build up progressively more abstract representations " 2 #
of the data, have made it possible for artificial neural networks to learn 0 0 {
concepts such as object categories directly from raw sensory data. We Li ðhi Þ~ ðs,a,r,s0 Þ*UðDÞ rzc max Q(s ,a ; hi ){Qðs,a; hi Þ
a0
use one particularly successful architecture, the deep convolutional

network17, which uses hierarchical layers of tiled convolutional filters in which c is the discount factor determining the agent’s horizon, hi are
to mimic the effects of receptive fields—inspired by Hubel and Wiesel’s the parameters of the Q-network at iteration i and h{i are the network
seminal work on feedforward processing in early visual cortex18—thereby parameters used to compute the target at iteration i. The target net-
exploiting the local spatial correlations present in images, and building work parameters h{ i are only updated with the Q-network parameters
in robustness to natural transformations such as changes of viewpoint (hi) every C steps and are held fixed between individual updates (see
or scale. Methods).
We consider tasks in which the agent interacts with an environment To evaluate our DQN agent, we took advantage of the Atari 2600
through a sequence of observations, actions and rewards. The goal of the platform, which offers a diverse array of tasks (n 5 49) designed to be
1
Google DeepMind, 5 New Street Square, London EC4A 3TW, UK.
*These authors contributed equally to this work.
2 6 F E B R U A RY 2 0 1 5 | VO L 5 1 8 | N AT U R E | 5 2 9
©2015 Macmillan Publishers Limited. All rights reserved
RESEARCH LETTER
Convolution Convolution Fully connected Fully connected
No input
Figure 1 | Schematic illustration of the convolutional neural network. The symbolizes sliding of each filter across input image) and two fully connected
details of the architecture are explained in the Methods. The input to the neural layers with a single output for each valid action. Each hidden layer is followed
network consists of an 84 3 84 3 4 image produced by the preprocessing by a rectifier nonlinearity (that is, maxð0,xÞ).
map w, followed by three convolutional layers (note: snaking blue line
difficult and engaging for human players. We used the same network We compared DQN with the best performing methods from the
architecture, hyperparameter values (see Extended Data Table 1) and reinforcement learning literature on the 49 games where results were
learning procedure throughout—taking high-dimensional data (210|160 available12,15. In addition to the learned agents, we also report scores for
colour video at 60 Hz) as input—to demonstrate that our approach a professional human games tester playing under controlled conditions
robustly learns successful policies over a variety of games based solely and a policy that selects actions uniformly at random (Extended Data
on sensory inputs with only very minimal prior knowledge (that is, merely Table 2 and Fig. 3, denoted by 100% (human) and 0% (random) on y
the input data were visual images, and the number of actions available axis; see Methods). Our DQN method outperforms the best existing
in each game, but not their correspondences; see Methods). Notably, reinforcement learning methods on 43 of the games without incorpo-
our method was able to train large neural networks using a reinforce- rating any of the additional prior knowledge about Atari 2600 games
ment learning signal and stochastic gradient descent in a stable manner— used by other approaches (for example, refs 12, 15). Furthermore, our
illustrated by the temporal evolution of two indices of learning (the DQN agent performed at a level that was comparable to that of a pro-
agent’s average score-per-episode and average predicted Q-values; see fessional human games tester across the set of 49 games, achieving more
Fig. 2 and Supplementary Discussion for details). than 75% of the human score on more than half of the games (29 games;
a 2,200 b 6,000
2,000
Average score per episode
Average score per episode
1,800 5,000
1,600
4,000
1,400
1,200
3,000
1,000
800 2,000
600
400 1,000
200
0 0
0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
Training epochs Training epochs
c 10 d 11
9 10
Average action value (Q)
Average action value (Q)
8 9
7 8
7
6
6
5
5
4
4
3 3
2 2
1 1
0 0
0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
Training epochs Training epochs
Figure 2 | Training curves tracking the agent’s average score and average on the curve is the average of the action-value Q computed over the held-out
predicted action-value. a, Each point is the average score achieved per episode set of states. Note that Q-values are scaled due to clipping of rewards (see
after the agent is run with e-greedy policy (e 5 0.05) for 520 k frames on Space Methods). d, Average predicted action-value on Seaquest. See Supplementary
Invaders. b, Average score achieved per episode for Seaquest. c, Average Discussion for details.
predicted action-value on a held-out set of states on Space Invaders. Each point
5 3 0 | N AT U R E | VO L 5 1 8 | 2 6 F E B R U A RY 2 0 1 5
LETTER RESEARCH
Video Pinball
Boxing
Breakout
Star Gunner
Robotank
Atlantis
Crazy Climber
Gopher
Demon Attack
Name This Game
Krull
Assault
Road Runner
Kangaroo
James Bond
Tennis
Pong
Space Invaders
Beam Rider
Tutankham
Kung-Fu Master
Freeway
Time Pilot
Enduro
Fishing Derby
Up and Down
Ice Hockey
Q*bert
H.E.R.O. At human-level or above
Asterix Below human-level
Battle Zone
Wizard of Wor
Chopper Command
Centipede
Bank Heist
River Raid
Zaxxon
Amidar
Alien
Venture
Seaquest
Double Dunk
Bowling
Ms. Pac-Man
Asteroids
Frostbite
Gravitar DQN
Private Eye
Montezuma's Revenge Best linear learner
0 100 200 300 400 500 600 1,000 4,500%
Figure 3 | Comparison of the DQN agent with the best reinforcement outperforms competing methods (also see Extended Data Table 2) in almost all
learning methods15 in the literature. The performance of DQN is normalized the games, and performs at a level that is broadly comparable with or superior
with respect to a professional human games tester (that is, 100% level) and to a professional human games tester (that is, operationalized as a level of
random play (that is, 0% level). Note that the normalized performance of DQN, 75% or above) in the majority of games. Audio output was disabled for both
expressed as a percentage, is calculated as: 100 3 (DQN score 2 random play human players and agents. Error bars indicate s.d. across the 30 evaluation
score)/(human score 2 random play score). It can be seen that DQN episodes, starting with different initial conditions.
see Fig. 3, Supplementary Discussion and Extended Data Table 2). In perceptually dissimilar (Fig. 4, bottom right, top left and middle), con-
additional simulations (see Supplementary Discussion and Extended sistent with the notion that the network is able to learn representations
Data Tables 3 and 4), we demonstrate the importance of the individual that support adaptive behaviour from high-dimensional sensory inputs.
core components of the DQN agent—the replay memory, separate target Furthermore, we also show that the representations learned by DQN
Q-network and deep convolutional network architecture—by disabling are able to generalize to data generated from policies other than its
them and demonstrating the detrimental effects on performance. own—in simulations where we presented as input to the network game
We next examined the representations learned by DQN that under- states experienced during human and agent play, recorded the repre-
pinned the successful performance of the agent in the context of the game sentations of the last hidden layer, and visualized the embeddings gen-
Space Invaders (see Supplementary Video 1 for a demonstration of the erated by the t-SNE algorithm (Extended Data Fig. 1 and Supplementary
performance of DQN), by using a technique developed for the visual- Discussion). Extended Data Fig. 2 provides an additional illustration of
ization of high-dimensional data called ‘t-SNE’25 (Fig. 4). As expected, how the representations learned by DQN allow it to accurately predict
the t-SNE algorithm tends to map the DQN representation of percep- state and action values.
tually similar states to nearby points. Interestingly, we also found instances It is worth noting that the games in which DQN excels are extremely
in which the t-SNE algorithm generated similar embeddings for DQN varied in their nature, from side-scrolling shooters (River Raid) to box-
representations of states that are close in terms of expected reward but ing games (Boxing) and three-dimensional car-racing games (Enduro).
RESEARCH LETTER
Figure 4 | Two-dimensional t-SNE embedding of the representations in the predicts high state values for both full (top right screenshots) and nearly
last hidden layer assigned by DQN to game states experienced while playing complete screens (bottom left screenshots) because it has learned that
Space Invaders. The plot was generated by letting the DQN agent play for completing a screen leads to a new screen full of enemy ships. Partially
2 h of real game time and running the t-SNE algorithm25 on the last hidden layer completed screens (bottom screenshots) are assigned lower state values because
representations assigned by DQN to each experienced game state. The less immediate reward is available. The screens shown on the bottom right
points are coloured according to the state values (V, maximum expected reward and top left and middle are less perceptually similar than the other examples but
of a state) predicted by DQN for the corresponding game states (ranging are still mapped to nearby representations and similar values because the
from dark red (highest V) to dark blue (lowest V)). The screenshots orange bunkers do not carry great significance near the end of a level. With
corresponding to a selected number of points are shown. The DQN agent permission from Square Enix Limited.
Indeed, in certain games DQN is able to discover a relatively long-term realization of such a process in the mammalian brain, with the time-
strategy (for example, Breakout: the agent learns the optimal strategy, compressed reactivation of recently experienced trajectories during
which is to first dig a tunnel around the side of the wall allowing the ball offline periods21,22 (for example, waking rest) providing a putative mech-
to be sent around the back to destroy a large number of blocks; see Sup- anism by which value functions may be efficiently updated through
plementary Video 2 for illustration of development of DQN’s perfor- interactions with the basal ganglia22. In the future, it will be important
mance over the course of training). Nevertheless, games demanding more to explore the potential use of biasing the content of experience replay
temporally extended planning strategies still constitute a major chal- towards salient events, a phenomenon that characterizes empirically
lenge for all existing agents including DQN (for example, Montezuma’s observed hippocampal replay29, and relates to the notion of ‘prioritized
Revenge). sweeping’30 in reinforcement learning. Taken together, our work illus-
In this work, we demonstrate that a single architecture can success- trates the power of harnessing state-of-the-art machine learning tech-
fully learn control policies in a range of different environments with only niques with biologically inspired mechanisms to create agents that are
very minimal prior knowledge, receiving only the pixels and the game capable of learning to master a diverse array of challenging tasks.
score as inputs, and using the same algorithm, network architecture and Online Content Methods, along with any additional Extended Data display items
hyperparameters on each game, privy only to the inputs a human player and Source Data, are available in the online version of the paper; references unique
would have. In contrast to previous work24,26, our approach incorpo- to these sections appear only in the online paper.
rates ‘end-to-end’ reinforcement learning that uses reward to continu-
Received 10 July 2014; accepted 16 January 2015.
ously shape representations within the convolutional network towards
salient features of the environment that facilitate value estimation. This 1. Sutton, R. & Barto, A. Reinforcement Learning: An Introduction (MIT Press, 1998).
principle draws on neurobiological evidence that reward signals during 2. Thorndike, E. L. Animal Intelligence: Experimental studies (Macmillan, 1911).
perceptual learning may influence the characteristics of representations 3. Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and
reward. Science 275, 1593–1599 (1997).
within primate visual cortex27,28. Notably, the successful integration of 4. Serre, T., Wolf, L. & Poggio, T. Object recognition with features inspired by visual
reinforcement learning with deep network architectures was critically cortex. Proc. IEEE. Comput. Soc. Conf. Comput. Vis. Pattern. Recognit. 994–1000
dependent on our incorporation of a replay algorithm21–23 involving the (2005).
5. Fukushima, K. Neocognitron: A self-organizing neural network model for a
storage and representation of recently experienced transitions. Conver- mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 36,
gent evidence suggests that the hippocampus may support the physical 193–202 (1980).
5 3 2 | N AT U R E | VO L 5 1 8 | 2 6 F E B R U A RY 2 0 1 5
LETTER RESEARCH
6. Tesauro, G. Temporal difference learning and TD-Gammon. Commun. ACM 38, 23. Lin, L.-J. Reinforcement learning for robots using neural networks. Technical
58–68 (1995). Report, DTIC Document (1993).
7. Riedmiller, M., Gabel, T., Hafner, R. & Lange, S. Reinforcement learning for robot 24. Riedmiller, M. Neural fitted Q iteration - first experiences with a data efficient
soccer. Auton. Robots 27, 55–73 (2009). neural reinforcement learning method. Mach. Learn.: ECML, 3720, 317–328
8. Diuk, C., Cohen, A. & Littman, M. L. An object-oriented representation for efficient (Springer, 2005).
reinforcement learning. Proc. Int. Conf. Mach. Learn. 240–247 (2008). 25. Van der Maaten, L. J. P. & Hinton, G. E. Visualizing high-dimensional data using
9. Bengio, Y. Learning deep architectures for AI. Foundations and Trends in Machine t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Learning 2, 1–127 (2009). 26. Lange, S. & Riedmiller, M. Deep auto-encoder neural networks in reinforcement
10. Krizhevsky, A., Sutskever, I. & Hinton, G. ImageNet classification with deep learning. Proc. Int. Jt. Conf. Neural. Netw. 1–8 (2010).
convolutional neural networks. Adv.Neural Inf.Process.Syst.25, 1106–1114 (2012). 27. Law, C.-T. & Gold, J. I. Reinforcement learning can account for associative
11. Hinton, G. E. & Salakhutdinov, R. R. Reducing the dimensionality of data with and perceptual learning on a visual decision task. Nature Neurosci. 12, 655
neural networks. Science 313, 504–507 (2006). (2009).
12. Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. The arcade learning 28. Sigala, N. & Logothetis, N. K. Visual categorization shapes feature selectivity in the
environment: An evaluation platform for general agents. J. Artif. Intell. Res. 47, primate temporal cortex. Nature 415, 318–320 (2002).
253–279 (2013). 29. Bendor, D. & Wilson, M. A. Biasing the content of hippocampal replay during sleep.
13. Legg, S. & Hutter, M. Universal Intelligence: a definition of machine intelligence. Nature Neurosci. 15, 1439–1444 (2012).
Minds Mach. 17, 391–444 (2007). 30. Moore, A. & Atkeson, C. Prioritized sweeping: reinforcement learning with less data
14. Genesereth, M., Love, N. & Pell, B. General game playing: overview of the AAAI and less real time. Mach. Learn. 13, 103–130 (1993).
competition. AI Mag. 26, 62–72 (2005).
15. Bellemare, M. G., Veness, J. & Bowling, M. Investigating contingency awareness Supplementary Information is available in the online version of the paper.
using Atari 2600 games. Proc. Conf. AAAI. Artif. Intell. 864–871 (2012).
16. McClelland, J. L., Rumelhart, D. E. & Group, T. P. R. Parallel Distributed Processing: Acknowledgements We thank G. Hinton, P. Dayan and M. Bowling for discussions,
Explorations in the Microstructure of Cognition (MIT Press, 1986). A. Cain and J. Keene for work on the visuals, K. Keller and P. Rogers for help with the
17. LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to visuals, G. Wayne for comments on an earlier version of the manuscript, and the rest of
document recognition. Proc. IEEE 86, 2278–2324 (1998). the DeepMind team for their support, ideas and encouragement.
18. Hubel, D. H. & Wiesel, T. N. Shape and arrangement of columns in cat’s striate Author Contributions V.M., K.K., D.S., J.V., M.G.B., M.R., A.G., D.W., S.L. and D.H.
cortex. J. Physiol. 165, 559–568 (1963). conceptualized the problem and the technical framework. V.M., K.K., A.A.R. and D.S.
19. Watkins, C. J. & Dayan, P. Q-learning. Mach. Learn. 8, 279–292 (1992). developed and tested the algorithms. J.V., S.P., C.B., A.A.R., M.G.B., I.A., A.K.F., G.O. and
20. Tsitsiklis, J. & Roy, B. V. An analysis of temporal-difference learning with function A.S. created the testing platform. K.K., H.K., S.L. and D.H. managed the project. K.K., D.K.,
approximation. IEEE Trans. Automat. Contr. 42, 674–690 (1997). D.H., V.M., D.S., A.G., A.A.R., J.V. and M.G.B. wrote the paper.
21. McClelland, J. L., McNaughton, B. L. & O’Reilly, R. C. Why there are complementary
learning systems in the hippocampus and neocortex: insights from the successes Author Information Reprints and permissions information is available at
and failures of connectionist models of learning and memory. Psychol. Rev. 102, www.nature.com/reprints. The authors declare no competing financial interests.
419–457 (1995). Readers are welcome to comment on the online version of the paper. Correspondence
22. O’Neill, J., Pleydell-Bouverie, B., Dupret, D. & Csicsvari, J. Play it again: reactivation and requests for materials should be addressed to K.K. (korayk@google.com) or
of waking experience and memory. Trends Neurosci. 33, 220–229 (2010). D.H. (demishassabis@google.com).
RESEARCH LETTER
METHODS Our experimental setup amounts to using the following minimal prior know-
ledge: that the input data consisted of visual images (motivating our use of a con-
Preprocessing. Working directly with raw Atari 2600 frames, which are 210 3 160
volutional deep network), the game-specific score (with no modification), number
pixel images with a 128-colour palette, can be demanding in terms of computation
of actions, although not their correspondences (for example, specification of the
and memory requirements. We apply a basic preprocessing step aimed at reducing
up ‘button’) and the life count.
the input dimensionality and dealing with some artefacts of the Atari 2600 emu-
Evaluation procedure. The trained agents were evaluated by playing each game
lator. First, to encode a single frame we take the maximum value for each pixel colour
30 times for up to 5 min each time with different initial random conditions (‘no-
value over the frame being encoded and the previous frame. This was necessary to
op’; see Extended Data Table 1) and an e-greedy policy with e 5 0.05. This pro-
remove flickering that is present in games where some objects appear only in even
cedure is adopted to minimize the possibility of overfitting during evaluation. The
frames while other objects appear only in odd frames, an artefact caused by the
random agent served as a baseline comparison and chose a random action at 10 Hz
limited number of sprites Atari 2600 can display at once. Second, we then extract
which is every sixth frame, repeating its last action on intervening frames. 10 Hz is
the Y channel, also known as luminance, from the RGB frame and rescale it to
about the fastest that a human player can select the ‘fire’ button, and setting the
84 3 84. The function w from algorithm 1 described below applies this preprocess-
random agent to this frequency avoids spurious baseline scores in a handful of the
ing to the m most recent frames and stacks them to produce the input to the
games. We did also assess the performance of a random agent that selected an action
Q-function, in which m 5 4, although the algorithm is robust to different values of
at 60 Hz (that is, every frame). This had a minimal effect: changing the normalized
m (for example, 3 or 5).
DQN performance by more than 5% in only six games (Boxing, Breakout, Crazy
Code availability. The source code can be accessed at https://sites.google.com/a/
Climber, Demon Attack, Krull and Robotank), and in all these games DQN out-
deepmind.com/dqn for non-commercial uses only.
performed the expert human by a considerable margin.
Model architecture. There are several possible ways of parameterizing Q using a
The professional human tester used the same emulator engine as the agents, and
neural network. Because Q maps history–action pairs to scalar estimates of their
played under controlled conditions. The human tester was not allowed to pause,
Q-value, the history and the action have been used as inputs to the neural network
save or reload games. As in the original Atari 2600 environment, the emulator was
by some previous approaches24,26. The main drawback of this type of architecture
run at 60 Hz and the audio output was disabled: as such, the sensory input was
is that a separate forward pass is required to compute the Q-value of each action,
equated between human player and agents. The human performance is the average
resulting in a cost that scales linearly with the number of actions. We instead use an
reward achieved from around 20 episodes of each game lasting a maximum of 5 min
architecture in which there is a separate output unit for each possible action, and
each, following around 2 h of practice playing each game.
only the state representation is an input to the neural network. The outputs cor-
Algorithm. We consider tasks in which an agent interacts with an environment,
respond to the predicted Q-values of the individual actions for the input state. The
in this case the Atari emulator, in a sequence of actions, observations and rewards.
main advantage of this type of architecture is the ability to compute Q-values for all
At each time-step the agent selects an action at from the set of legal game actions,
possible actions in a given state with only a single forward pass through the network.
A~f1, . . . ,K g. The action is passed to the emulator and modifies its internal state
The exact architecture, shown schematically in Fig. 1, is as follows. The input to
and the game score. In general the environment may be stochastic. The emulator’s
the neural network consists of an 84 3 84 3 4 image produced by the preprocess-
internal state is not observed by the agent; instead the agent observes an image
ing map w. The first hidden layer convolves 32 filters of 8 3 8 with stride 4 with the
xt [ R d from the emulator, which is a vector of pixel values representing the current
input image and applies a rectifier nonlinearity31,32. The second hidden layer con-
screen. In addition it receives a reward rt representing the change in game score.
volves 64 filters of 4 3 4 with stride 2, again followed by a rectifier nonlinearity.
Note that in general the game score may depend on the whole previous sequence of
This is followed by a third convolutional layer that convolves 64 filters of 3 3 3 with
actions and observations; feedback about an action may only be received after many
stride 1 followed by a rectifier. The final hidden layer is fully-connected and con-
thousands of time-steps have elapsed.
sists of 512 rectifier units. The output layer is a fully-connected linear layer with a
Because the agent only observes the current screen, the task is partially observed33
single output for each valid action. The number of valid actions varied between 4
and many emulator states are perceptually aliased (that is, it is impossible to fully
and 18 on the games we considered.
understand the current situation from only the current screen xt ). Therefore,
Training details. We performed experiments on 49 Atari 2600 games where results
sequences of actions and observations, st ~x1 ,a1 ,x2 ,:::,at{1 ,xt , are input to the
were available for all other comparable methods12,15. A different network was trained
algorithm, which then learns game strategies depending upon these sequences. All
on each game: the same network architecture, learning algorithm and hyperpara-
sequences in the emulator are assumed to terminate in a finite number of time-
meter settings (see Extended Data Table 1) were used across all games, showing that steps. This formalism gives rise to a large but finite Markov decision process (MDP)
our approach is robust enough to work on a variety of games while incorporating in which each sequence is a distinct state. As a result, we can apply standard rein-
only minimal prior knowledge (see below). While we evaluated our agents on unmodi- forcement learning methods for MDPs, simply by using the complete sequence st
fied games, we made one change to the reward structure of the games during training as the state representation at time t.
only. As the scale of scores varies greatly from game to game, we clipped all posi-
The goal of the agent is to interact with the emulator by selecting actions in a way
tive rewards at 1 and all negative rewards at 21, leaving 0 rewards unchanged.
that maximizes future rewards. We make the standard assumption that future rewards
Clipping the rewards in this manner limits the scale of the error derivatives and
are discounted by a factor of c per time-step (c was set to 0.99 throughout), and
makes it easier to use the same learning rate across multiple games. At the same time, X T
0
it could affect the performance of our agent since it cannot differentiate between define the future discounted return at time t as Rt ~ ct {t rt 0 , in which T is the
rewards of different magnitude. For games where there is a life counter, the Atari t 0 ~t
time-step at which the game terminates. We define the optimal action-value
2600 emulator also sends the number of lives left in the game, which is then used to
function Q! ðs,aÞ as the maximum expected return achievable by following any
mark the end of an episode during training.
policy, after seeing some sequence s and then taking some action a, Q! ðs,aÞ~
In these experiments, we used the RMSProp (see http://www.cs.toronto.edu/
maxp ½Rt D st ~s,at ~a,p in which p is a policy mapping sequences to actions (or
,tijmen/csc321/slides/lecture_slides_lec6.pdf ) algorithm with minibatches of size
distributions over actions).
32. The behaviour policy during training was e-greedy with e annealed linearly
The optimal action-value function obeys an important identity known as the
from 1.0 to 0.1 over the first million frames, and fixed at 0.1 thereafter. We trained
Bellman equation. This is based on the following intuition: if the optimal value
for a total of 50 million frames (that is, around 38 days of game experience in total)
Q! ðs0 ,a0 Þ of the sequence s9 at the next time-step was known for all possible actions
and used a replay memory of 1 million most recent frames.
a9, then the optimal strategy is to select the action a9 maximizing the expected value
Following previous approaches to playing Atari 2600 games, we also use a simple of rzcQ! ðs0 ,a0 Þ:
frame-skipping technique15. More precisely, the agent sees and selects actions on  
every kth frame instead of every frame, and its last action is repeated on skipped
Q! ðs,aÞ ~ s0 rzc max Q ! 0 0
ðs ,a s,a
ÞD
frames. Because running the emulator forward for one step requires much less a 0
computation than having the agent select an action, this technique allows the agent
to play roughly k times more games without significantly increasing the runtime. The basic idea behind many reinforcement learning algorithms is to estimate
We use k 5 4 for all games. the action-value function by using the Bellman equation as an iterative update,
The values of all the hyperparameters and optimization parameters were selected Qiz1 ðs,aÞ~ s0 ½rzc maxa0 Qi ðs0 ,a0 ÞD
s,a. Such value iteration algorithms converge
by performing an informal search on the games Pong, Breakout, Seaquest, Space to the optimal action-value function, Qi ? Q! as i? ?. In practice, this basic approach
Invaders and Beam Rider. We did not perform a systematic grid search owing to is impractical, because the action-value function is estimated separately for each
the high computational cost. These parameters were then held fixed across all other sequence, without any generalization. Instead, it is common to use a function approx-
games. The values and descriptions of all hyperparameters are provided in Extended imator to estimate the action-value function, Qðs,a; hÞ<Q! ðs,aÞ. In the reinforce-
Data Table 1. ment learning community this is typically a linear function approximator, but

LETTER RESEARCH
sometimes a nonlinear function approximator is used instead, such as a neural replay the behaviour distribution is averaged over many of its previous states,
network. We refer to a neural network function approximator with weights h as a smoothing out learning and avoiding oscillations or divergence in the parameters.
Q-network. A Q-network can be trained by adjusting the parameters hi at iteration Note that when learning by experience replay, it is necessary to learn off-policy
i to reduce the mean-squared error in the Bellman equation, where the optimal (because our current parameters are different to those used to generate the sam-
target values rzcÿ maxa0 Q! ðs0 ,a0 Þ are substituted with approximate target values ple), which motivates the choice of Q-learning.
y~rzc maxa0 Q s0 ,a0 ; h{ {
i , using parameters hi from some previous iteration. In practice, our algorithm only stores the last N experience tuples in the replay
This leads to a sequence of loss functions Li(hi) that changes at each iteration i, memory, and samples uniformly at random from D when performing updates. This
! " approach is in some respects limited because the memory buffer does not differ-
Li ðhi Þ~ s,a,r ðEs0 ½yDs,a{Qðs,a; hi ÞÞ2
entiate important transitions and always overwrites with recent transitions owing
! "
~ s,a,r,s0 ðy{Qðs,a; hi ÞÞ2 zEs,a,r ½Vs0 ½ y: to the finite memory size N. Similarly, the uniform sampling gives equal impor-
tance to all transitions in the replay memory. A more sophisticated sampling strat-
Note that the targets depend on the network weights; this is in contrast with the egy might emphasize transitions from which we can learn the most, similar to
targets used for supervised learning, which are fixed before learning begins. At prioritized sweeping30.
each stage of optimization, we hold the parameters from the previous iteration hi2 The second modification to online Q-learning aimed at further improving the
fixed when optimizing the ith loss function Li(hi), resulting in a sequence of well- stability of our method with neural networks is to use a separate network for gen-
defined optimization problems. The final term is the variance of the targets, which erating the targets yj in the Q-learning update. More precisely, every C updates we
clone the network Q to obtain a target network Q ^ and use Q^ for generating the
does not depend on the parameters hi that we are currently optimizing, and may
therefore be ignored. Differentiating the loss function with respect to the weights Q-learning targets yj for the following C updates to Q. This modification makes the
we arrive at the following gradient: algorithm more stable compared to standard online Q-learning, where an update
   that increases Q(st,at) often also increases Q(st 1 1,a) for all a and hence also increases
ÿ 0 0 {  the target yj, possibly leading to oscillations or divergence of the policy. Generating
+ hi Lðhi Þ ~ s,a,r,s0 rzc max Q s ,a ; hi {Q ðs,a; hi Þ + hi
Qðs,a; h i Þ :
0 a the targets using an older set of parameters adds a delay between the time an update
to Q is made and the time the update affects the targets yj, making divergence or
Rather than computing the full expectations in the above gradient, it is often oscillations much more unlikely.
computationally expedient to optimize the loss function by stochastic gradient
ÿ 0We also found it helpful to clip the error term from the update rzc maxa0 Q
descent. The familiar Q-learning algorithm19 can be recovered in this framework s ,a0 ; h{
i {Qðs,a; hi Þ to be between 21 and 1. Because the absolute value loss
by updating the weights after every time step, replacing the expectations using function jxj has a derivative of 21 for all negative values of x and a derivative of 1
single samples, and setting h{ i ~hi{1 . for all positive values of x, clipping the squared error to be between 21 and 1 cor-
Note that this algorithm is model-free: it solves the reinforcement learning task responds to using an absolute value loss function for errors outside of the (21,1)
directly using samples from the emulator, without explicitly estimating the reward interval. This form of error clipping further improved the stability of the algorithm.
and transition dynamics Pðr,s0 D s,aÞ. It is also off-policy: it learns about the greedy Algorithm 1: deep Q-learning with experience replay.
policy a~argmaxa0 Qðs,a0 ; hÞ, while following a behaviour distribution that ensures Initialize replay memory D to capacity N
adequate exploration of the state space. In practice, the behaviour distribution is Initialize action-value function Q with random weights h
often selected by an e-greedy policy that follows the greedy policy with probability Initialize target action-value function Q ^ with weights h2 5 h
1 2 e and selects a random action with probability e. For episode 5 1, M do
Training algorithm for deep Q-networks. The full algorithm for training deep
Initialize sequence s1 ~fx1 g and preprocessed sequence w1 ~wðs1 Þ
Q-networks is presented in Algorithm 1. The agent selects and executes actions
For t 5 1,T do
according to an e-greedy policy based on Q. Because using histories of arbitrary
With probability e select a random action at
length as inputs to a neural network can be difficult, our Q-function instead works
on a fixed length representation of histories produced by the function w described otherwise select at ~argmaxa Qðwðst Þ,a; hÞ
above. The algorithm modifies standard online Q-learning in two ways to make it Execute action at in emulator and observe reward rt and image xt 1 1
suitable for training large neural networks without diverging. Set stz1 ~st ,at ,xÿtz1 and preprocess
 wtz1 ~wðstz1 Þ
First, we use a technique known as experience replay23 in which we store the Store transition wt ,at ,rt ,wtz1 in D  
agent’s experiences at each time-step, et 5 (st, at, rt, st 1 1), in a data set Dt 5 {e1,…,et}, Sample random minibatch of transitions wj ,aj ,rj ,wjz1 from D
(
pooled over many episodes (where the end of an episode occurs when a termi- rj  if episode terminates at step jz1
nal state is reached) into a replay memory. During the inner loop of the algorithm, Set yj ~ ^ wjz1 ,a0 ; h{
rj zc maxa0 Q otherwise
we apply Q-learning updates, or minibatch updates, to samples of experience,   2
(s, a, r, s9) , U(D), drawn at random from the pool of stored samples. This approach Perform a gradient descent step on yj {Q wj ,aj ; h with respect to the
has several advantages over standard online Q-learning. First, each step of experience network parameters h
is potentially used in many weight updates, which allows for greater data efficiency. Every C steps reset Q~Q ^
Second, learning directly from consecutive samples is inefficient, owing to the strong End For
correlations between the samples; randomizing the samples breaks these correla- End For
tions and therefore reduces the variance of the updates. Third, when learning on-
policy the current parameters determine the next data sample that the parameters 31. Jarrett, K., Kavukcuoglu, K., Ranzato, M. A. & LeCun, Y. What is the best multi-stage
are trained on. For example, if the maximizing action is to move left then the train- architecture for object recognition? Proc. IEEE. Int. Conf. Comput. Vis. 2146–2153
(2009).
ing samples will be dominated by samples from the left-hand side; if the maximiz-
32. Nair, V. & Hinton, G. E. Rectified linear units improve restricted Boltzmann
ing action then switches to the right then the training distribution will also switch. machines. Proc. Int. Conf. Mach. Learn. 807–814 (2010).
It is easy to see how unwanted feedback loops may arise and the parameters could get 33. Kaelbling, L. P., Littman, M. L. & Cassandra, A. R. Planning and acting in partially
stuck in a poor local minimum, or even diverge catastrophically20. By using experience observable stochastic domains. Artificial Intelligence 101, 99–134 (1994).

RESEARCH LETTER
Extended Data Figure 1 | Two-dimensional t-SNE embedding of the points) and DQN play (blue points) suggests that the representations learned
representations in the last hidden layer assigned by DQN to game states by DQN do indeed generalize to data generated from policies other than its
experienced during a combination of human and agent play in Space own. The presence in the t-SNE embedding of overlapping clusters of points
Invaders. The plot was generated by running the t-SNE algorithm25 on the last corresponding to the network representation of states experienced during
hidden layer representation assigned by DQN to game states experienced human and agent play shows that the DQN agent also follows sequences of
during a combination of human (30 min) and agent (2 h) play. The fact that states similar to those found in human play. Screenshots corresponding to
there is similar structure in the two-dimensional embeddings corresponding to selected states are shown (human: orange border; DQN: blue border).
the DQN representation of states experienced during human play (orange

LETTER RESEARCH
Extended Data Figure 2 | Visualization of learned value functions on two all actions are around 0.7, reflecting the expected value of this state based on
games, Breakout and Pong. a, A visualization of the learned value function on previous experience. At time point 2, the agent starts moving the paddle
the game Breakout. At time points 1 and 2, the state value is predicted to be ,17 towards the ball and the value of the ‘up’ action stays high while the value of the
and the agent is clearing the bricks at the lowest level. Each of the peaks in ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
the value function curve corresponds to a reward obtained by clearing a brick. to the agent losing the ball and incurring a reward of 21. At time point 3,
At time point 3, the agent is about to break through to the top level of bricks and the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
the value increases to ,21 in anticipation of breaking out and clearing a until time point 4, when the ball reaches the left edge of the screen and the value
large set of bricks. At point 4, the value is above 23 and the agent has broken of all actions reflects that the agent is about to receive a reward of 1. Note,
through. After this point, the ball will bounce at the upper part of the bricks the dashed line shows the past trajectory of the ball purely for illustrative
clearing many of them by itself. b, A visualization of the learned action-value purposes (that is, not shown during the game). With permission from Atari
function on the game Pong. At time point 1, the ball is moving towards the Interactive, Inc.
paddle controlled by the agent on the right side of the screen and the values of

RESEARCH LETTER
Extended Data Table 1 | List of hyperparameters and their values
The values of all the hyperparameters were selected by performing an informal search on the games Pong, Breakout, Seaquest, Space Invaders and Beam Rider. We did not perform a systematic grid search owing
to the high computational cost, although it is conceivable that even better results could be obtained by systematically tuning the hyperparameter values.

LETTER RESEARCH
Extended Data Table 2 | Comparison of games scores obtained by DQN agents with methods from the literature12,15 and a professional
human games tester
Best Linear Learner is the best result obtained by a linear function approximator on different types of hand designed features12. Contingency (SARSA) agent figures are the results obtained in ref. 15. Note the
figures in the last column indicate the performance of DQN relative to the human games tester, expressed as a percentage, that is, 100 3 (DQN score 2 random play score)/(human score 2 random play score).

RESEARCH LETTER
Extended Data Table 3 | The effects of replay and separating the target Q-network
DQN agents were trained for 10 million frames using standard hyperparameters for all possible combinations of turning replay on or off, using or not using a separate target Q-network, and three different learning
rates. Each agent was evaluated every 250,000 training frames for 135,000 validation frames and the highest average episode score is reported. Note that these evaluation episodes were not truncated at 5 min
leading to higher scores on Enduro than the ones reported in Extended Data Table 2. Note also that the number of training frames was shorter (10 million frames) as compared to the main results presented in
Extended Data Table 2 (50 million frames).

LETTER RESEARCH
Extended Data Table 4 | Comparison of DQN performance with lin-

ear function approximator
The performance of the DQN agent is compared with the performance of a linear function approximator
on the 5 validation games (that is, where a single linear layer was used instead of the convolutional
network, in combination with replay and separate target network). Agents were trained for 10 million
frames using standard hyperparameters, and three different learning rates. Each agent was evaluated
every 250,000 training frames for 135,000 validation frames and the highest average episode score is
reported. Note that these evaluation episodes were not truncated at 5 min leading to higher scores on
Enduro than the ones reported in Extended Data Table 2. Note also that the number of training frames
was shorter (10 million frames) as compared to the main results presented in Extended Data Table 2
(50 million frames).

REVIEW doi:10.1038/nature14539
Deep learning
Yann LeCun1,2, Yoshua Bengio3 & Geoffrey Hinton4,5
Deep learning allows computational models that are composed of multiple processing layers to learn representations of
data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech rec-
ognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep
learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine
should change its internal parameters that are used to compute the representation in each layer from the representation in
the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and
audio, whereas recurrent nets have shone light on sequential data such as text and speech.
M
achine-learning technology powers many aspects of modern intricate structures in high-dimensional data and is therefore applica-
society: from web searches to content filtering on social net- ble to many domains of science, business and government. In addition
works to recommendations on e-commerce websites, and to beating records in image recognition1–4 and speech recognition5–7, it
it is increasingly present in consumer products such as cameras and has beaten other machine-learning techniques at predicting the activ-
smartphones. Machine-learning systems are used to identify objects ity of potential drug molecules8, analysing particle accelerator data9,10,
in images, transcribe speech into text, match news items, posts or reconstructing brain circuits11, and predicting the effects of mutations
products with users’ interests, and select relevant results of search. in non-coding DNA on gene expression and disease12,13. Perhaps more
Increasingly, these applications make use of a class of techniques called surprisingly, deep learning has produced extremely promising results
deep learning. for various tasks in natural language understanding14, particularly
Conventional machine-learning techniques were limited in their topic classification, sentiment analysis, question answering15 and lan-
ability to process natural data in their raw form. For decades, con- guage translation16,17.
structing a pattern-recognition or machine-learning system required We think that deep learning will have many more successes in the
careful engineering and considerable domain expertise to design a fea- near future because it requires very little engineering by hand, so it
ture extractor that transformed the raw data (such as the pixel values can easily take advantage of increases in the amount of available com-
of an image) into a suitable internal representation or feature vector putation and data. New learning algorithms and architectures that are
from which the learning subsystem, often a classifier, could detect or currently being developed for deep neural networks will only acceler-
classify patterns in the input. ate this progress.
Representation learning is a set of methods that allows a machine to
be fed with raw data and to automatically discover the representations Supervised learning
needed for detection or classification. Deep-learning methods are The most common form of machine learning, deep or not, is super-
representation-learning methods with multiple levels of representa- vised learning. Imagine that we want to build a system that can classify
tion, obtained by composing simple but non-linear modules that each images as containing, say, a house, a car, a person or a pet. We first
transform the representation at one level (starting with the raw input) collect a large data set of images of houses, cars, people and pets, each
into a representation at a higher, slightly more abstract level. With the labelled with its category. During training, the machine is shown an
composition of enough such transformations, very complex functions image and produces an output in the form of a vector of scores, one
can be learned. For classification tasks, higher layers of representation for each category. We want the desired category to have the highest
amplify aspects of the input that are important for discrimination and score of all categories, but this is unlikely to happen before training.
suppress irrelevant variations. An image, for example, comes in the We compute an objective function that measures the error (or dis-
form of an array of pixel values, and the learned features in the first tance) between the output scores and the desired pattern of scores. The
layer of representation typically represent the presence or absence of machine then modifies its internal adjustable parameters to reduce
edges at particular orientations and locations in the image. The second this error. These adjustable parameters, often called weights, are real
layer typically detects motifs by spotting particular arrangements of numbers that can be seen as ‘knobs’ that define the input–output func-
edges, regardless of small variations in the edge positions. The third tion of the machine. In a typical deep-learning system, there may be
layer may assemble motifs into larger combinations that correspond hundreds of millions of these adjustable weights, and hundreds of
to parts of familiar objects, and subsequent layers would detect objects millions of labelled examples with which to train the machine.
as combinations of these parts. The key aspect of deep learning is that To properly adjust the weight vector, the learning algorithm com-
these layers of features are not designed by human engineers: they putes a gradient vector that, for each weight, indicates by what amount
are learned from data using a general-purpose learning procedure. the error would increase or decrease if the weight were increased by a
Deep learning is making major advances in solving problems that tiny amount. The weight vector is then adjusted in the opposite direc-
have resisted the best attempts of the artificial intelligence commu- tion to the gradient vector.
nity for many years. It has turned out to be very good at discovering The objective function, averaged over all the training examples, can
1
Facebook AI Research, 770 Broadway, New York, New York 10003 USA. 2New York University, 715 Broadway, New York, New York 10003, USA. 3Department of Computer Science and Operations
Research Université de Montréal, Pavillon André-Aisenstadt, PO Box 6128 Centre-Ville STN Montréal, Quebec H3C 3J7, Canada. 4Google, 1600 Amphitheatre Parkway, Mountain View, California
94043, USA. 5Department of Computer Science, University of Toronto, 6 King’s College Road, Toronto, Ontario M5S 3G4, Canada.
4 3 6 | NAT U R E | VO L 5 2 1 | 2 8 M AY 2 0 1 5
© 2015 Macmillan Publishers Limited. All rights reserved
REVIEW INSIGHT
be seen as a kind of hilly landscape in the high-dimensional space of Many of the current practical applications of machine learning use
weight values. The negative gradient vector indicates the direction linear classifiers on top of hand-engineered features. A two-class linear
of steepest descent in this landscape, taking it closer to a minimum, classifier computes a weighted sum of the feature vector components.
where the output error is low on average. If the weighted sum is above a threshold, the input is classified as
In practice, most practitioners use a procedure called stochastic belonging to a particular category.
gradient descent (SGD). This consists of showing the input vector Since the 1960s we have known that linear classifiers can only carve
for a few examples, computing the outputs and the errors, computing their input space into very simple regions, namely half-spaces sepa-
the average gradient for those examples, and adjusting the weights rated by a hyperplane19. But problems such as image and speech recog-
accordingly. The process is repeated for many small sets of examples nition require the input–output function to be insensitive to irrelevant
from the training set until the average of the objective function stops variations of the input, such as variations in position, orientation or
decreasing. It is called stochastic because each small set of examples illumination of an object, or variations in the pitch or accent of speech,
gives a noisy estimate of the average gradient over all examples. This while being very sensitive to particular minute variations (for example,
simple procedure usually finds a good set of weights surprisingly the difference between a white wolf and a breed of wolf-like white
quickly when compared with far more elaborate optimization tech- dog called a Samoyed). At the pixel level, images of two Samoyeds in
niques18. After training, the performance of the system is measured different poses and in different environments may be very different
on a different set of examples called a test set. This serves to test the from each other, whereas two images of a Samoyed and a wolf in the
generalization ability of the machine — its ability to produce sensible same position and on similar backgrounds may be very similar to each
answers on new inputs that it has never seen during training. other. A linear classifier, or any other ‘shallow’ classifier operating on
a b
z z
Δz = y Δy
z
y Δy = xy Δx
y
y Δz = yz xy Δ x
x
z z y
x x =  y x
Input Hidden Output
(2) (2 sigmoid) (1 sigmoid)
c d Compare outputs with correct

answer to get error derivatives
yl = f (zl ) E
Output units = yl tl
l yl
zl = wkl yk l
E E yl
wkl k  H2 =
w zl yl zl
kl
E E
= wkl
yk = f (zk ) yk zl
Hidden units H2 k I  out
zk = w jk y j k
wjk E E yk wjk
j  H1 =
zk yk zk E E
= w jk
y j = f (zj ) yj zk
Hidden units H1 j j k  H2
E E yj
zj = wij xi wij =
wij zj y j zj
i  Input
Input units i i
Figure 1 | Multilayer neural networks and backpropagation. a, A multi- which one can backpropagate gradients. At each layer, we first compute
layer neural network (shown by the connected dots) can distort the input the total input z to each unit, which is a weighted sum of the outputs of
space to make the classes of data (examples of which are on the red and the units in the layer below. Then a non-linear function f(.) is applied to
blue lines) linearly separable. Note how a regular grid (shown on the left) z to get the output of the unit. For simplicity, we have omitted bias terms.
in input space is also transformed (shown in the middle panel) by hidden The non-linear functions used in neural networks include the rectified
units. This is an illustrative example with only two input units, two hidden linear unit (ReLU) f(z) = max(0,z), commonly used in recent years, as
units and one output unit, but the networks used for object recognition well as the more conventional sigmoids, such as the hyberbolic tangent,
or natural language processing contain tens or hundreds of thousands of f(z) = (exp(z) − exp(−z))/(exp(z) + exp(−z)) and logistic function logistic,
units. Reproduced with permission from C. Olah (http://colah.github.io/). f(z) = 1/(1 + exp(−z)). d, The equations used for computing the backward
b, The chain rule of derivatives tells us how two small effects (that of a small pass. At each hidden layer we compute the error derivative with respect to
change of x on y, and that of y on z) are composed. A small change Δx in the output of each unit, which is a weighted sum of the error derivatives
x gets transformed first into a small change Δy in y by getting multiplied with respect to the total inputs to the units in the layer above. We then
by ∂y/∂x (that is, the definition of partial derivative). Similarly, the change convert the error derivative with respect to the output into the error
Δy creates a change Δz in z. Substituting one equation into the other derivative with respect to the input by multiplying it by the gradient of f(z).
gives the chain rule of derivatives — how Δx gets turned into Δz through At the output layer, the error derivative with respect to the output of a unit
multiplication by the product of ∂y/∂x and ∂z/∂x. It also works when x, is computed by differentiating the cost function. This gives yl − tl if the cost
y and z are vectors (and the derivatives are Jacobian matrices). c, The function for unit l is 0.5(yl − tl)2, where tl is the target value. Once the ∂E/∂zk
equations used for computing the forward pass in a neural net with two is known, the error-derivative for the weight wjk on the connection from
hidden layers and one output layer, each constituting a module through unit j in the layer below is just yj ∂E/∂zk.
2 8 M AY 2 0 1 5 | VO L 5 2 1 | NAT U R E | 4 3 7
INSIGHT REVIEW
Samoyed (16); Papillon (5.7); Pomeranian (2.7); Arctic fox (1.0); Eskimo dog (0.6); white wolf (0.4); Siberian husky (0.4)
Convolutions and ReLU
Max pooling
Max pooling
Red Green Blue
Figure 2 | Inside a convolutional network. The outputs (not the filters) corresponding to the output for one of the learned features, detected at each
of each layer (horizontally) of a typical convolutional network architecture of the image positions. Information flows bottom up, with lower-level features
applied to the image of a Samoyed dog (bottom left; and RGB (red, green, acting as oriented edge detectors, and a score is computed for each image class
blue) inputs, bottom right). Each rectangular image is a feature map in output. ReLU, rectified linear unit.
raw pixels could not possibly distinguish the latter two, while putting rule for derivatives. The key insight is that the derivative (or gradi-
the former two in the same category. This is why shallow classifiers ent) of the objective with respect to the input of a module can be
require a good feature extractor that solves the selectivity–invariance computed by working backwards from the gradient with respect to
dilemma — one that produces representations that are selective to the output of that module (or the input of the subsequent module)
the aspects of the image that are important for discrimination, but (Fig. 1). The backpropagation equation can be applied repeatedly to
that are invariant to irrelevant aspects such as the pose of the animal. propagate gradients through all modules, starting from the output
To make classifiers more powerful, one can use generic non-linear at the top (where the network produces its prediction) all the way to
features, as with kernel methods20, but generic features such as those the bottom (where the external input is fed). Once these gradients
arising with the Gaussian kernel do not allow the learner to general- have been computed, it is straightforward to compute the gradients
ize well far from the training examples21. The conventional option is with respect to the weights of each module.
to hand design good feature extractors, which requires a consider- Many applications of deep learning use feedforward neural net-
able amount of engineering skill and domain expertise. But this can work architectures (Fig. 1), which learn to map a fixed-size input
all be avoided if good features can be learned automatically using a (for example, an image) to a fixed-size output (for example, a prob-
general-purpose learning procedure. This is the key advantage of ability for each of several categories). To go from one layer to the
deep learning. next, a set of units compute a weighted sum of their inputs from the
A deep-learning architecture is a multilayer stack of simple mod- previous layer and pass the result through a non-linear function. At
ules, all (or most) of which are subject to learning, and many of which present, the most popular non-linear function is the rectified linear
compute non-linear input–output mappings. Each module in the unit (ReLU), which is simply the half-wave rectifier f(z) = max(z, 0).
stack transforms its input to increase both the selectivity and the In past decades, neural nets used smoother non-linearities, such as
invariance of the representation. With multiple non-linear layers, say tanh(z) or 1/(1 + exp(−z)), but the ReLU typically learns much faster
a depth of 5 to 20, a system can implement extremely intricate func- in networks with many layers, allowing training of a deep supervised
tions of its inputs that are simultaneously sensitive to minute details network without unsupervised pre-training28. Units that are not in
— distinguishing Samoyeds from white wolves — and insensitive to the input or output layer are conventionally called hidden units. The
large irrelevant variations such as the background, pose, lighting and hidden layers can be seen as distorting the input in a non-linear way
surrounding objects. so that categories become linearly separable by the last layer (Fig. 1).
In the late 1990s, neural nets and backpropagation were largely
Backpropagation to train multilayer architectures forsaken by the machine-learning community and ignored by the
From the earliest days of pattern recognition22,23, the aim of research- computer-vision and speech-recognition communities. It was widely
ers has been to replace hand-engineered features with trainable thought that learning useful, multistage, feature extractors with lit-
multilayer networks, but despite its simplicity, the solution was not tle prior knowledge was infeasible. In particular, it was commonly
widely understood until the mid 1980s. As it turns out, multilayer thought that simple gradient descent would get trapped in poor local
architectures can be trained by simple stochastic gradient descent. minima — weight configurations for which no small change would
As long as the modules are relatively smooth functions of their inputs reduce the average error.
and of their internal weights, one can compute gradients using the In practice, poor local minima are rarely a problem with large net-
backpropagation procedure. The idea that this could be done, and works. Regardless of the initial conditions, the system nearly always
that it worked, was discovered independently by several different reaches solutions of very similar quality. Recent theoretical and
groups during the 1970s and 1980s24–27. empirical results strongly suggest that local minima are not a serious
The backpropagation procedure to compute the gradient of an issue in general. Instead, the landscape is packed with a combinato-
objective function with respect to the weights of a multilayer stack rially large number of saddle points where the gradient is zero, and
of modules is nothing more than a practical application of the chain the surface curves up in most dimensions and curves down in the
4 3 8 | NAT U R E | VO L 5 2 1 | 2 8 M AY 2 0 1 5
REVIEW INSIGHT
remainder29,30. The analysis seems to show that saddle points with this architecture is twofold. First, in array data such as images, local
only a few downward curving directions are present in very large groups of values are often highly correlated, forming distinctive local
numbers, but almost all of them have very similar values of the objec- motifs that are easily detected. Second, the local statistics of images
tive function. Hence, it does not much matter which of these saddle and other signals are invariant to location. In other words, if a motif
points the algorithm gets stuck at. can appear in one part of the image, it could appear anywhere, hence
Interest in deep feedforward networks was revived around 2006 the idea of units at different locations sharing the same weights and
(refs 31–34) by a group of researchers brought together by the Cana- detecting the same pattern in different parts of the array. Mathemati-
dian Institute for Advanced Research (CIFAR). The researchers intro- cally, the filtering operation performed by a feature map is a discrete
duced unsupervised learning procedures that could create layers of convolution, hence the name.
feature detectors without requiring labelled data. The objective in Although the role of the convolutional layer is to detect local con-
learning each layer of feature detectors was to be able to reconstruct junctions of features from the previous layer, the role of the pooling
or model the activities of feature detectors (or raw inputs) in the layer layer is to merge semantically similar features into one. Because the
below. By ‘pre-training’ several layers of progressively more complex relative positions of the features forming a motif can vary somewhat,
feature detectors using this reconstruction objective, the weights of a reliably detecting the motif can be done by coarse-graining the posi-
deep network could be initialized to sensible values. A final layer of tion of each feature. A typical pooling unit computes the maximum
output units could then be added to the top of the network and the of a local patch of units in one feature map (or in a few feature maps).
whole deep system could be fine-tuned using standard backpropaga- Neighbouring pooling units take input from patches that are shifted
tion33–35. This worked remarkably well for recognizing handwritten by more than one row or column, thereby reducing the dimension of
digits or for detecting pedestrians, especially when the amount of the representation and creating an invariance to small shifts and dis-
labelled data was very limited36. tortions. Two or three stages of convolution, non-linearity and pool-
The first major application of this pre-training approach was in ing are stacked, followed by more convolutional and fully-connected
speech recognition, and it was made possible by the advent of fast layers. Backpropagating gradients through a ConvNet is as simple as
graphics processing units (GPUs) that were convenient to program37 through a regular deep network, allowing all the weights in all the
and allowed researchers to train networks 10 or 20 times faster. In filter banks to be trained.
2009, the approach was used to map short temporal windows of coef- Deep neural networks exploit the property that many natural sig-
ficients extracted from a sound wave to a set of probabilities for the nals are compositional hierarchies, in which higher-level features
various fragments of speech that might be represented by the frame are obtained by composing lower-level ones. In images, local combi-
in the centre of the window. It achieved record-breaking results on a nations of edges form motifs, motifs assemble into parts, and parts
standard speech recognition benchmark that used a small vocabu- form objects. Similar hierarchies exist in speech and text from sounds
lary38 and was quickly developed to give record-breaking results on to phones, phonemes, syllables, words and sentences. The pooling
a large vocabulary task39. By 2012, versions of the deep net from 2009 allows representations to vary very little when elements in the previ-
were being developed by many of the major speech groups6 and were ous layer vary in position and appearance.
already being deployed in Android phones. For smaller data sets, The convolutional and pooling layers in ConvNets are directly
unsupervised pre-training helps to prevent overfitting40, leading to inspired by the classic notions of simple cells and complex cells in
significantly better generalization when the number of labelled exam- visual neuroscience43, and the overall architecture is reminiscent of
ples is small, or in a transfer setting where we have lots of examples the LGN–V1–V2–V4–IT hierarchy in the visual cortex ventral path-
for some ‘source’ tasks but very few for some ‘target’ tasks. Once deep way44. When ConvNet models and monkeys are shown the same pic-
learning had been rehabilitated, it turned out that the pre-training ture, the activations of high-level units in the ConvNet explains half
stage was only needed for small data sets. of the variance of random sets of 160 neurons in the monkey’s infer-
There was, however, one particular type of deep, feedforward net- otemporal cortex45. ConvNets have their roots in the neocognitron46,
work that was much easier to train and generalized much better than the architecture of which was somewhat similar, but did not have an
networks with full connectivity between adjacent layers. This was end-to-end supervised-learning algorithm such as backpropagation.
the convolutional neural network (ConvNet)41,42. It achieved many A primitive 1D ConvNet called a time-delay neural net was used for
practical successes during the period when neural networks were out the recognition of phonemes and simple words47,48.
of favour and it has recently been widely adopted by the computer- There have been numerous applications of convolutional net-
vision community. works going back to the early 1990s, starting with time-delay neu-
ral networks for speech recognition47 and document reading42. The
Convolutional neural networks document reading system used a ConvNet trained jointly with a
ConvNets are designed to process data that come in the form of probabilistic model that implemented language constraints. By the
multiple arrays, for example a colour image composed of three 2D late 1990s this system was reading over 10% of all the cheques in the
arrays containing pixel intensities in the three colour channels. Many United States. A number of ConvNet-based optical character recog-
data modalities are in the form of multiple arrays: 1D for signals and nition and handwriting recognition systems were later deployed by
sequences, including language; 2D for images or audio spectrograms; Microsoft49. ConvNets were also experimented with in the early 1990s
and 3D for video or volumetric images. There are four key ideas for object detection in natural images, including faces and hands50,51,
behind ConvNets that take advantage of the properties of natural and for face recognition52.
signals: local connections, shared weights, pooling and the use of
many layers. Image understanding with deep convolutional networks
The architecture of a typical ConvNet (Fig. 2) is structured as a Since the early 2000s, ConvNets have been applied with great success to
series of stages. The first few stages are composed of two types of the detection, segmentation and recognition of objects and regions in
layers: convolutional layers and pooling layers. Units in a convolu- images. These were all tasks in which labelled data was relatively abun-
tional layer are organized in feature maps, within which each unit dant, such as traffic sign recognition53, the segmentation of biological
is connected to local patches in the feature maps of the previous images54 particularly for connectomics55, and the detection of faces,
layer through a set of weights called a filter bank. The result of this text, pedestrians and human bodies in natural images36,50,51,56–58. A major
local weighted sum is then passed through a non-linearity such as a recent practical success of ConvNets is face recognition59.
ReLU. All units in a feature map share the same filter bank. Differ- Importantly, images can be labelled at the pixel level, which will have
ent feature maps in a layer use different filter banks. The reason for applications in technology, including autonomous mobile robots and
2 8 M AY 2 0 1 5 | VO L 5 2 1 | NAT U R E | 4 3 9
INSIGHT REVIEW
Vision Language
Deep CNN Generating RNN
A group of people
shopping at an outdoor
market.
There are many

vegetables at the
fruit stand.
A woman is throwing a frisbee in a park. A dog is standing on a hardwood floor. A stop sign is on a road with a
mountain in the background
A little girl sitting on a bed with a teddy bear. A group of people sitting on a boat in the water. A giraffe standing in a forest with
trees in the background.
Figure 3 | From image to text. Captions generated by a recurrent neural with permission from ref. 102. When the RNN is given the ability to focus its
network (RNN) taking, as extra input, the representation extracted by a deep attention on a different location in the input image (middle and bottom; the
convolution neural network (CNN) from a test image, with the RNN trained to lighter patches were given more attention) as it generates each word (bold), we
‘translate’ high-level representations of images into captions (top). Reproduced found86 that it exploits this to achieve better ‘translation’ of images into captions.
self-driving cars60,61. Companies such as Mobileye and NVIDIA are Microsoft, IBM, Yahoo!, Twitter and Adobe, as well as a quickly
using such ConvNet-based methods in their upcoming vision sys- growing number of start-ups to initiate research and development
tems for cars. Other applications gaining importance involve natural projects and to deploy ConvNet-based image understanding products
language understanding14 and speech recognition7. and services.
Despite these successes, ConvNets were largely forsaken by the ConvNets are easily amenable to efficient hardware implemen-
mainstream computer-vision and machine-learning communities tations in chips or field-programmable gate arrays66,67. A number
until the ImageNet competition in 2012. When deep convolutional of companies such as NVIDIA, Mobileye, Intel, Qualcomm and
networks were applied to a data set of about a million images from Samsung are developing ConvNet chips to enable real-time vision
the web that contained 1,000 different classes, they achieved spec- applications in smartphones, cameras, robots and self-driving cars.
tacular results, almost halving the error rates of the best compet-
ing approaches1. This success came from the efficient use of GPUs, Distributed representations and language processing
ReLUs, a new regularization technique called dropout62, and tech- Deep-learning theory shows that deep nets have two different expo-
niques to generate more training examples by deforming the existing nential advantages over classic learning algorithms that do not use
ones. This success has brought about a revolution in computer vision; distributed representations21. Both of these advantages arise from the
ConvNets are now the dominant approach for almost all recognition power of composition and depend on the underlying data-generating
and detection tasks4,58,59,63–65 and approach human performance on distribution having an appropriate componential structure40. First,
some tasks. A recent stunning demonstration combines ConvNets learning distributed representations enable generalization to new
and recurrent net modules for the generation of image captions combinations of the values of learned features beyond those seen
(Fig. 3). during training (for example, 2n combinations are possible with n
Recent ConvNet architectures have 10 to 20 layers of ReLUs, hun- binary features)68,69. Second, composing layers of representation in
dreds of millions of weights, and billions of connections between a deep net brings the potential for another exponential advantage70
units. Whereas training such large networks could have taken weeks (exponential in the depth).
only two years ago, progress in hardware, software and algorithm The hidden layers of a multilayer neural network learn to repre-
parallelization have reduced training times to a few hours. sent the network’s inputs in a way that makes it easy to predict the
The performance of ConvNet-based vision systems has caused target outputs. This is nicely demonstrated by training a multilayer
most major technology companies, including Google, Facebook, neural network to predict the next word in a sequence from a local
4 4 0 | NAT U R E | VO L 5 2 1 | 2 8 M AY 2 0 1 5
REVIEW INSIGHT
context of earlier words71. Each word in the context is presented to handful of words would require very large training corpora. N-grams
the network as a one-of-N vector, that is, one component has a value treat each word as an atomic unit, so they cannot generalize across
of 1 and the rest are 0. In the first layer, each word creates a different semantically related sequences of words, whereas neural language
pattern of activations, or word vectors (Fig. 4). In a language model, models can because they associate each word with a vector of real
the other layers of the network learn to convert the input word vec- valued features, and semantically related words end up close to each
tors into an output word vector for the predicted next word, which other in that vector space (Fig. 4).
can be used to predict the probability for any word in the vocabulary
to appear as the next word. The network learns word vectors that Recurrent neural networks
contain many active components each of which can be interpreted When backpropagation was first introduced, its most exciting use was
as a separate feature of the word, as was first demonstrated27 in the for training recurrent neural networks (RNNs). For tasks that involve
context of learning distributed representations for symbols. These sequential inputs, such as speech and language, it is often better to
semantic features were not explicitly present in the input. They were use RNNs (Fig. 5). RNNs process an input sequence one element at a
discovered by the learning procedure as a good way of factorizing time, maintaining in their hidden units a ‘state vector’ that implicitly
the structured relationships between the input and output symbols contains information about the history of all the past elements of
into multiple ‘micro-rules’. Learning word vectors turned out to also the sequence. When we consider the outputs of the hidden units at
work very well when the word sequences come from a large corpus different discrete time steps as if they were the outputs of different
of real text and the individual micro-rules are unreliable71. When neurons in a deep multilayer network (Fig. 5, right), it becomes clear
trained to predict the next word in a news story, for example, the how we can apply backpropagation to train RNNs.
learned word vectors for Tuesday and Wednesday are very similar, as RNNs are very powerful dynamic systems, but training them has
are the word vectors for Sweden and Norway. Such representations proved to be problematic because the backpropagated gradients
are called distributed representations because their elements (the either grow or shrink at each time step, so over many time steps they
features) are not mutually exclusive and their many configurations typically explode or vanish77,78.
correspond to the variations seen in the observed data. These word Thanks to advances in their architecture79,80 and ways of training
vectors are composed of learned features that were not determined them81,82, RNNs have been found to be very good at predicting the
ahead of time by experts, but automatically discovered by the neural next character in the text83 or the next word in a sequence75, but they
network. Vector representations of words learned from text are now can also be used for more complex tasks. For example, after reading
very widely used in natural language applications14,17,72–76. an English sentence one word at a time, an English ‘encoder’ network
The issue of representation lies at the heart of the debate between can be trained so that the final state vector of its hidden units is a good
the logic-inspired and the neural-network-inspired paradigms for representation of the thought expressed by the sentence. This thought
cognition. In the logic-inspired paradigm, an instance of a symbol is vector can then be used as the initial hidden state of (or as extra input
something for which the only property is that it is either identical or to) a jointly trained French ‘decoder’ network, which outputs a prob-
non-identical to other symbol instances. It has no internal structure ability distribution for the first word of the French translation. If a
that is relevant to its use; and to reason with symbols, they must be particular first word is chosen from this distribution and provided
bound to the variables in judiciously chosen rules of inference. By as input to the decoder network it will then output a probability dis-
contrast, neural networks just use big activity vectors, big weight tribution for the second word of the translation and so on until a
matrices and scalar non-linearities to perform the type of fast ‘intui- full stop is chosen17,72,76. Overall, this process generates sequences of
tive’ inference that underpins effortless commonsense reasoning. French words according to a probability distribution that depends on
Before the introduction of neural language models71, the standard the English sentence. This rather naive way of performing machine
approach to statistical modelling of language did not exploit distrib- translation has quickly become competitive with the state-of-the-art,
uted representations: it was based on counting frequencies of occur- and this raises serious doubts about whether understanding a sen-
rences of short symbol sequences of length up to N (called N-grams). tence requires anything like the internal symbolic expressions that are
The number of possible N-grams is on the order of VN, where V is manipulated by using inference rules. It is more compatible with the
the vocabulary size, so taking into account a context of more than a view that everyday reasoning involves many simultaneous analogies
14 body office −2.2

" the two groups
13.5 −2.4
school
schools Agency of the two groups
13 over the last two decades
−2.6 two months before being
agencies the last two decades
12.5 organization for nearly two months
−2.8
institutions
12 organizations
Association dispute
−3 between the two
11.5
−3.2
11 that a few days In a few months
−3.4 a few months ago
10.5
−3.6 within a few months
10 over the last few months
companies −3.8 over the past few months
community society
company the next six months
communities In the last few days
Community industry −4 the past few days
9
−4.2 in the coming months
−37 −36 −35 −34 −33 −32 −31 −30 −29 −5.5 −5 −4.5 −4 −3.5 −3 −2.5 −2
Figure 4 | Visualizing the learned word vectors. On the left is an illustration or sequences of words are mapped to nearby representations. The distributed
of word representations learned for modelling language, non-linearly projected representations of words are obtained by using backpropagation to jointly learn
to 2D for visualization using the t-SNE algorithm103. On the right is a 2D a representation for each word and a function that predicts a target quantity
representation of phrases learned by an English-to-French encoder–decoder such as the next word in a sequence (for language modelling) or a whole
recurrent neural network75. One can observe that semantically similar words sequence of translated words (for machine translation)18,75.
2 8 M AY 2 0 1 5 | VO L 5 2 1 | NAT U R E | 4 4 1
INSIGHT REVIEW
o a sorted list of symbols when their input consists of an unsorted

ot−1 ot ot+1
sequence in which each symbol is accompanied by a real value that
V V V V indicates its priority in the list88. Memory networks can be trained
W
s W st−1 st st+1 to keep track of the state of the world in a setting similar to a text
Unfold
W W W adventure game and after reading a story, they can answer questions
U U U U that require complex inference90. In one test example, the network is
x xt−1 xt xt+1 shown a 15-sentence version of the The Lord of the Rings and correctly
answers questions such as “where is Frodo now?”89.
Figure 5 | A recurrent neural network and the unfolding in time of the
computation involved in its forward computation. The artificial neurons The future of deep learning
(for example, hidden units grouped under node s with values st at time t) get Unsupervised learning91–98 had a catalytic effect in reviving interest in
inputs from other neurons at previous time steps (this is represented with the deep learning, but has since been overshadowed by the successes of
black square, representing a delay of one time step, on the left). In this way, a
purely supervised learning. Although we have not focused on it in this
recurrent neural network can map an input sequence with elements xt into an
output sequence with elements ot, with each ot depending on all the previous
Review, we expect unsupervised learning to become far more important
xtʹ (for tʹ ≤ t). The same parameters (matrices U,V,W ) are used at each time in the longer term. Human and animal learning is largely unsupervised:
step. Many other architectures are possible, including a variant in which the we discover the structure of the world by observing it, not by being told
network can generate a sequence of outputs (for example, words), each of the name of every object.
which is used as inputs for the next time step. The backpropagation algorithm Human vision is an active process that sequentially samples the optic
(Fig. 1) can be directly applied to the computational graph of the unfolded array in an intelligent, task-specific way using a small, high-resolution
network on the right, to compute the derivative of a total error (for example, fovea with a large, low-resolution surround. We expect much of the
the log-probability of generating the right sequence of outputs) with respect to future progress in vision to come from systems that are trained end-to-
all the states st and all the parameters. end and combine ConvNets with RNNs that use reinforcement learning
to decide where to look. Systems combining deep learning and rein-
that each contribute plausibility to a conclusion84,85. forcement learning are in their infancy, but they already outperform
Instead of translating the meaning of a French sentence into an passive vision systems99 at classification tasks and produce impressive
English sentence, one can learn to ‘translate’ the meaning of an image results in learning to play many different video games100.
into an English sentence (Fig. 3). The encoder here is a deep Con- Natural language understanding is another area in which deep learn-
vNet that converts the pixels into an activity vector in its last hidden ing is poised to make a large impact over the next few years. We expect
layer. The decoder is an RNN similar to the ones used for machine systems that use RNNs to understand sentences or whole documents
translation and neural language modelling. There has been a surge of will become much better when they learn strategies for selectively
interest in such systems recently (see examples mentioned in ref. 86). attending to one part at a time76,86.
RNNs, once unfolded in time (Fig. 5), can be seen as very deep Ultimately, major progress in artificial intelligence will come about
feedforward networks in which all the layers share the same weights. through systems that combine representation learning with complex
Although their main purpose is to learn long-term dependencies, reasoning. Although deep learning and simple reasoning have been
theoretical and empirical evidence shows that it is difficult to learn used for speech and handwriting recognition for a long time, new
to store information for very long78. paradigms are needed to replace rule-based manipulation of symbolic
To correct for that, one idea is to augment the network with an expressions by operations on large vectors101. ■
explicit memory. The first proposal of this kind is the long short-term
Received 25 February; accepted 1 May 2015.
memory (LSTM) networks that use special hidden units, the natural
behaviour of which is to remember inputs for a long time79. A special 1. Krizhevsky, A., Sutskever, I. & Hinton, G. ImageNet classification with deep
unit called the memory cell acts like an accumulator or a gated leaky convolutional neural networks. In Proc. Advances in Neural Information
Processing Systems 25 1090–1098 (2012).
neuron: it has a connection to itself at the next time step that has a This report was a breakthrough that used convolutional nets to almost halve
weight of one, so it copies its own real-valued state and accumulates the error rate for object recognition, and precipitated the rapid adoption of
the external signal, but this self-connection is multiplicatively gated deep learning by the computer vision community.
2. Farabet, C., Couprie, C., Najman, L. & LeCun, Y. Learning hierarchical features for
by another unit that learns to decide when to clear the content of the scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1915–1929 (2013).
memory. 3. Tompson, J., Jain, A., LeCun, Y. & Bregler, C. Joint training of a convolutional
LSTM networks have subsequently proved to be more effective network and a graphical model for human pose estimation. In Proc. Advances in
Neural Information Processing Systems 27 1799–1807 (2014).
than conventional RNNs, especially when they have several layers for 4. Szegedy, C. et al. Going deeper with convolutions. Preprint at http://arxiv.org/
each time step87, enabling an entire speech recognition system that abs/1409.4842 (2014).
goes all the way from acoustics to the sequence of characters in the 5. Mikolov, T., Deoras, A., Povey, D., Burget, L. & Cernocky, J. Strategies for training
transcription. LSTM networks or related forms of gated units are also large scale neural network language models. In Proc. Automatic Speech
Recognition and Understanding 196–201 (2011).
currently used for the encoder and decoder networks that perform 6. Hinton, G. et al. Deep neural networks for acoustic modeling in speech
so well at machine translation17,72,76. recognition. IEEE Signal Processing Magazine 29, 82–97 (2012).
Over the past year, several authors have made different proposals to This joint paper from the major speech recognition laboratories, summarizing
the breakthrough achieved with deep learning on the task of phonetic
augment RNNs with a memory module. Proposals include the Neural classification for automatic speech recognition, was the first major industrial
Turing Machine in which the network is augmented by a ‘tape-like’ application of deep learning.
memory that the RNN can choose to read from or write to88, and 7. Sainath, T., Mohamed, A.-R., Kingsbury, B. & Ramabhadran, B. Deep
convolutional neural networks for LVCSR. In Proc. Acoustics, Speech and Signal
memory networks, in which a regular network is augmented by a Processing 8614–8618 (2013).
kind of associative memory89. Memory networks have yielded excel- 8. Ma, J., Sheridan, R. P., Liaw, A., Dahl, G. E. & Svetnik, V. Deep neural nets as a
lent performance on standard question-answering benchmarks. The method for quantitative structure-activity relationships. J. Chem. Inf. Model. 55,
263–274 (2015).
memory is used to remember the story about which the network is 9. Ciodaro, T., Deva, D., de Seixas, J. & Damazio, D. Online particle detection with
later asked to answer questions. neural networks based on topological calorimetry information. J. Phys. Conf.
Beyond simple memorization, neural Turing machines and mem- Series 368, 012030 (2012).
10. Kaggle. Higgs boson machine learning challenge. Kaggle https://www.kaggle.
ory networks are being used for tasks that would normally require com/c/higgs-boson (2014).
reasoning and symbol manipulation. Neural Turing machines can 11. Helmstaedter, M. et al. Connectomic reconstruction of the inner plexiform layer
be taught ‘algorithms’. Among other things, they can learn to output in the mouse retina. Nature 500, 168–174 (2013).
4 4 2 | NAT U R E | VO L 5 2 1 | 2 8 M AY 2 0 1 5
REVIEW INSIGHT
12. Leung, M. K., Xiong, H. Y., Lee, L. J. & Frey, B. J. Deep learning of the tissue- for the task of classifying low-resolution images of handwritten digits.
regulated splicing code. Bioinformatics 30, i121–i129 (2014). 42. LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to
13. Xiong, H. Y. et al. The human splicing code reveals new insights into the genetic document recognition. Proc. IEEE 86, 2278–2324 (1998).
determinants of disease. Science 347, 6218 (2015). This overview paper on the principles of end-to-end training of modular
14. Collobert, R., et al. Natural language processing (almost) from scratch. J. Mach. systems such as deep neural networks using gradient-based optimization
Learn. Res. 12, 2493–2537 (2011). showed how neural networks (and in particular convolutional nets) can be
15. Bordes, A., Chopra, S. & Weston, J. Question answering with subgraph combined with search or inference mechanisms to model complex outputs
embeddings. In Proc. Empirical Methods in Natural Language Processing http:// that are interdependent, such as sequences of characters associated with the
arxiv.org/abs/1406.3676v3 (2014). content of a document.
16. Jean, S., Cho, K., Memisevic, R. & Bengio, Y. On using very large target 43. Hubel, D. H. & Wiesel, T. N. Receptive fields, binocular interaction, and functional
vocabulary for neural machine translation. In Proc. ACL-IJCNLP http://arxiv.org/ architecture in the cat’s visual cortex. J. Physiol. 160, 106–154 (1962).
abs/1412.2007 (2015). 44. Felleman, D. J. & Essen, D. C. V. Distributed hierarchical processing in the
17. Sutskever, I. Vinyals, O. & Le. Q. V. Sequence to sequence learning with neural primate cerebral cortex. Cereb. Cortex 1, 1–47 (1991).
networks. In Proc. Advances in Neural Information Processing Systems 27 45. Cadieu, C. F. et al. Deep neural networks rival the representation of primate
3104–3112 (2014). it cortex for core visual object recognition. PLoS Comp. Biol. 10, e1003963
This paper showed state-of-the-art machine translation results with the (2014).
architecture introduced in ref. 72, with a recurrent network trained to read a 46. Fukushima, K. & Miyake, S. Neocognitron: a new algorithm for pattern
sentence in one language, produce a semantic representation of its meaning, recognition tolerant of deformations and shifts in position. Pattern Recognition
and generate a translation in another language. 15, 455–469 (1982).
18. Bottou, L. & Bousquet, O. The tradeoffs of large scale learning. In Proc. Advances 47. Waibel, A., Hanazawa, T., Hinton, G. E., Shikano, K. & Lang, K. Phoneme
in Neural Information Processing Systems 20 161–168 (2007). recognition using time-delay neural networks. IEEE Trans. Acoustics Speech
19. Duda, R. O. & Hart, P. E. Pattern Classification and Scene Analysis (Wiley, 1973). Signal Process. 37, 328–339 (1989).
20. Schölkopf, B. & Smola, A. Learning with Kernels (MIT Press, 2002). 48. Bottou, L., Fogelman-Soulié, F., Blanchet, P. & Lienard, J. Experiments with time
21. Bengio, Y., Delalleau, O. & Le Roux, N. The curse of highly variable functions delay networks and dynamic time warping for speaker independent isolated
for local kernel machines. In Proc. Advances in Neural Information Processing digit recognition. In Proc. EuroSpeech 89 537–540 (1989).
Systems 18 107–114 (2005). 49. Simard, D., Steinkraus, P. Y. & Platt, J. C. Best practices for convolutional neural
22. Selfridge, O. G. Pandemonium: a paradigm for learning in mechanisation of networks. In Proc. Document Analysis and Recognition 958–963 (2003).
thought processes. In Proc. Symposium on Mechanisation of Thought Processes 50. Vaillant, R., Monrocq, C. & LeCun, Y. Original approach for the localisation of
513–526 (1958). objects in images. In Proc. Vision, Image, and Signal Processing 141, 245–250
23. Rosenblatt, F. The Perceptron — A Perceiving and Recognizing Automaton. Tech. (1994).
Rep. 85-460-1 (Cornell Aeronautical Laboratory, 1957). 51. Nowlan, S. & Platt, J. in Neural Information Processing Systems 901–908 (1995).
24. Werbos, P. Beyond Regression: New Tools for Prediction and Analysis in the 52. Lawrence, S., Giles, C. L., Tsoi, A. C. & Back, A. D. Face recognition: a
Behavioral Sciences. PhD thesis, Harvard Univ. (1974). convolutional neural-network approach. IEEE Trans. Neural Networks 8, 98–113
25. Parker, D. B. Learning Logic Report TR–47 (MIT Press, 1985). (1997).
26. LeCun, Y. Une procédure d’apprentissage pour Réseau à seuil assymétrique 53. Ciresan, D., Meier, U. Masci, J. & Schmidhuber, J. Multi-column deep neural
in Cognitiva 85: a la Frontière de l’Intelligence Artificielle, des Sciences de la network for traffic sign classification. Neural Networks 32, 333–338 (2012).
Connaissance et des Neurosciences [in French] 599–604 (1985). 54. Ning, F. et al. Toward automatic phenotyping of developing embryos from
27. Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by videos. IEEE Trans. Image Process. 14, 1360–1371 (2005).
back-propagating errors. Nature 323, 533–536 (1986). 55. Turaga, S. C. et al. Convolutional networks can learn to generate affinity graphs
28. Glorot, X., Bordes, A. & Bengio. Y. Deep sparse rectifier neural networks. In Proc. for image segmentation. Neural Comput. 22, 511–538 (2010).
14th International Conference on Artificial Intelligence and Statistics 315–323 56. Garcia, C. & Delakis, M. Convolutional face finder: a neural architecture for
(2011). fast and robust face detection. IEEE Trans. Pattern Anal. Machine Intell. 26,
This paper showed that supervised training of very deep neural networks is 1408–1423 (2004).
much faster if the hidden layers are composed of ReLU. 57. Osadchy, M., LeCun, Y. & Miller, M. Synergistic face detection and pose
29. Dauphin, Y. et al. Identifying and attacking the saddle point problem in high- estimation with energy-based models. J. Mach. Learn. Res. 8, 1197–1215
dimensional non-convex optimization. In Proc. Advances in Neural Information (2007).
Processing Systems 27 2933–2941 (2014). 58. Tompson, J., Goroshin, R. R., Jain, A., LeCun, Y. Y. & Bregler, C. C. Efficient object
30. Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B. & LeCun, Y. The loss localization using convolutional networks. In Proc. Conference on Computer
surface of multilayer networks. In Proc. Conference on AI and Statistics http:// Vision and Pattern Recognition http://arxiv.org/abs/1411.4280 (2014).
arxiv.org/abs/1412.0233 (2014). 59. Taigman, Y., Yang, M., Ranzato, M. & Wolf, L. Deepface: closing the gap to
31. Hinton, G. E. What kind of graphical model is the brain? In Proc. 19th human-level performance in face verification. In Proc. Conference on Computer
International Joint Conference on Artificial intelligence 1765–1775 (2005). Vision and Pattern Recognition 1701–1708 (2014).
32. Hinton, G. E., Osindero, S. & Teh, Y.-W. A fast learning algorithm for deep belief 60. Hadsell, R. et al. Learning long-range vision for autonomous off-road driving.
nets. Neural Comp. 18, 1527–1554 (2006). J. Field Robot. 26, 120–144 (2009).
This paper introduced a novel and effective way of training very deep neural 61. Farabet, C., Couprie, C., Najman, L. & LeCun, Y. Scene parsing with multiscale
networks by pre-training one hidden layer at a time using the unsupervised feature learning, purity trees, and optimal covers. In Proc. International
learning procedure for restricted Boltzmann machines. Conference on Machine Learning http://arxiv.org/abs/1202.2160 (2012).
33. Bengio, Y., Lamblin, P., Popovici, D. & Larochelle, H. Greedy layer-wise training 62. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R.
of deep networks. In Proc. Advances in Neural Information Processing Systems 19 Dropout: a simple way to prevent neural networks from overfitting. J. Machine
153–160 (2006). Learning Res. 15, 1929–1958 (2014).
This report demonstrated that the unsupervised pre-training method 63. Sermanet, P. et al. Overfeat: integrated recognition, localization and detection
introduced in ref. 32 significantly improves performance on test data and using convolutional networks. In Proc. International Conference on Learning
generalizes the method to other unsupervised representation-learning Representations http://arxiv.org/abs/1312.6229 (2014).
techniques, such as auto-encoders. 64. Girshick, R., Donahue, J., Darrell, T. & Malik, J. Rich feature hierarchies for
34. Ranzato, M., Poultney, C., Chopra, S. & LeCun, Y. Efficient learning of sparse accurate object detection and semantic segmentation. In Proc. Conference on
representations with an energy-based model. In Proc. Advances in Neural Computer Vision and Pattern Recognition 580–587 (2014).
Information Processing Systems 19 1137–1144 (2006). 65. Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale
image recognition. In Proc. International Conference on Learning Representations
35. Hinton, G. E. & Salakhutdinov, R. Reducing the dimensionality of data with
http://arxiv.org/abs/1409.1556 (2014).
neural networks. Science 313, 504–507 (2006).
66. Boser, B., Sackinger, E., Bromley, J., LeCun, Y. & Jackel, L. An analog neural
36. Sermanet, P., Kavukcuoglu, K., Chintala, S. & LeCun, Y. Pedestrian detection with
network processor with programmable topology. J. Solid State Circuits 26,
unsupervised multi-stage feature learning. In Proc. International Conference
2017–2025 (1991).
on Computer Vision and Pattern Recognition http://arxiv.org/abs/1212.0142
67. Farabet, C. et al. Large-scale FPGA-based convolutional networks. In Scaling
(2013). up Machine Learning: Parallel and Distributed Approaches (eds Bekkerman, R.,
37. Raina, R., Madhavan, A. & Ng, A. Y. Large-scale deep unsupervised learning Bilenko, M. & Langford, J.) 399–419 (Cambridge Univ. Press, 2011).
using graphics processors. In Proc. 26th Annual International Conference on 68. Bengio, Y. Learning Deep Architectures for AI (Now, 2009).
Machine Learning 873–880 (2009). 69. Montufar, G. & Morton, J. When does a mixture of products contain a product of
38. Mohamed, A.-R., Dahl, G. E. & Hinton, G. Acoustic modeling using deep belief mixtures? J. Discrete Math. 29, 321–347 (2014).
networks. IEEE Trans. Audio Speech Lang. Process. 20, 14–22 (2012). 70. Montufar, G. F., Pascanu, R., Cho, K. & Bengio, Y. On the number of linear regions
39. Dahl, G. E., Yu, D., Deng, L. & Acero, A. Context-dependent pre-trained deep of deep neural networks. In Proc. Advances in Neural Information Processing
neural networks for large vocabulary speech recognition. IEEE Trans. Audio Systems 27 2924–2932 (2014).
Speech Lang. Process. 20, 33–42 (2012). 71. Bengio, Y., Ducharme, R. & Vincent, P. A neural probabilistic language model. In
40. Bengio, Y., Courville, A. & Vincent, P. Representation learning: a review and new Proc. Advances in Neural Information Processing Systems 13 932–938 (2001).
perspectives. IEEE Trans. Pattern Anal. Machine Intell. 35, 1798–1828 (2013). This paper introduced neural language models, which learn to convert a word
41. LeCun, Y. et al. Handwritten digit recognition with a back-propagation network. symbol into a word vector or word embedding composed of learned semantic
In Proc. Advances in Neural Information Processing Systems 396–404 (1990). features in order to predict the next word in a sequence.
This is the first paper on convolutional networks trained by backpropagation 72. Cho, K. et al. Learning phrase representations using RNN encoder-decoder
2 8 M AY 2 0 1 5 | VO L 5 2 1 | NAT U R E | 4 4 3
INSIGHT REVIEW
for statistical machine translation. In Proc. Conference on Empirical Methods in 90. Weston, J., Bordes, A., Chopra, S. & Mikolov, T. Towards AI-complete question
Natural Language Processing 1724–1734 (2014). answering: a set of prerequisite toy tasks. http://arxiv.org/abs/1502.05698
73. Schwenk, H. Continuous space language models. Computer Speech Lang. 21, (2015).
492–518 (2007). 91. Hinton, G. E., Dayan, P., Frey, B. J. & Neal, R. M. The wake-sleep algorithm for
74. Socher, R., Lin, C. C-Y., Manning, C. & Ng, A. Y. Parsing natural scenes and unsupervised neural networks. Science 268, 1558–1161 (1995).
natural language with recursive neural networks. In Proc. International 92. Salakhutdinov, R. & Hinton, G. Deep Boltzmann machines. In Proc. International
Conference on Machine Learning 129–136 (2011). Conference on Artificial Intelligence and Statistics 448–455 (2009).
75. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. Distributed 93. Vincent, P., Larochelle, H., Bengio, Y. & Manzagol, P.-A. Extracting and composing
representations of words and phrases and their compositionality. In Proc. robust features with denoising autoencoders. In Proc. 25th International
Advances in Neural Information Processing Systems 26 3111–3119 (2013). Conference on Machine Learning 1096–1103 (2008).
76. Bahdanau, D., Cho, K. & Bengio, Y. Neural machine translation by jointly 94. Kavukcuoglu, K. et al. Learning convolutional feature hierarchies for visual
learning to align and translate. In Proc. International Conference on Learning recognition. In Proc. Advances in Neural Information Processing Systems 23
Representations http://arxiv.org/abs/1409.0473 (2015). 1090–1098 (2010).
77. Hochreiter, S. Untersuchungen zu dynamischen neuronalen Netzen [in 95. Gregor, K. & LeCun, Y. Learning fast approximations of sparse coding. In Proc.
German] Diploma thesis, T.U. Münich (1991). International Conference on Machine Learning 399–406 (2010).
78. Bengio, Y., Simard, P. & Frasconi, P. Learning long-term dependencies with 96. Ranzato, M., Mnih, V., Susskind, J. M. & Hinton, G. E. Modeling natural images
gradient descent is difficult. IEEE Trans. Neural Networks 5, 157–166 (1994). using gated MRFs. IEEE Trans. Pattern Anal. Machine Intell. 35, 2206–2222
79. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, (2013).
1735–1780 (1997). 97. Bengio, Y., Thibodeau-Laufer, E., Alain, G. & Yosinski, J. Deep generative
This paper introduced LSTM recurrent networks, which have become a crucial stochastic networks trainable by backprop. In Proc. 31st International
ingredient in recent advances with recurrent networks because they are good Conference on Machine Learning 226–234 (2014).
at learning long-range dependencies. 98. Kingma, D., Rezende, D., Mohamed, S. & Welling, M. Semi-supervised learning
80. ElHihi, S. & Bengio, Y. Hierarchical recurrent neural networks for long-term with deep generative models. In Proc. Advances in Neural Information Processing
dependencies. In Proc. Advances in Neural Information Processing Systems 8 Systems 27 3581–3589 (2014).
http://papers.nips.cc/paper/1102-hierarchical-recurrent-neural-networks-for- 99. Ba, J., Mnih, V. & Kavukcuoglu, K. Multiple object recognition with visual
long-term-dependencies (1995). attention. In Proc. International Conference on Learning Representations http://
81. Sutskever, I. Training Recurrent Neural Networks. PhD thesis, Univ. Toronto arxiv.org/abs/1412.7755 (2014).
(2012). 100. Mnih, V. et al. Human-level control through deep reinforcement learning. Nature
82. Pascanu, R., Mikolov, T. & Bengio, Y. On the difficulty of training recurrent neural 518, 529–533 (2015).
networks. In Proc. 30th International Conference on Machine Learning 1310– 101. Bottou, L. From machine learning to machine reasoning. Mach. Learn. 94,
1318 (2013). 133–149 (2014).
83. Sutskever, I., Martens, J. & Hinton, G. E. Generating text with recurrent neural 102. Vinyals, O., Toshev, A., Bengio, S. & Erhan, D. Show and tell: a neural image
networks. In Proc. 28th International Conference on Machine Learning 1017– caption generator. In Proc. International Conference on Machine Learning http://
1024 (2011). arxiv.org/abs/1502.03044 (2014).
84. Lakoff, G. & Johnson, M. Metaphors We Live By (Univ. Chicago Press, 2008). 103. van der Maaten, L. & Hinton, G. E. Visualizing data using t-SNE. J. Mach. Learn.
85. Rogers, T. T. & McClelland, J. L. Semantic Cognition: A Parallel Distributed Research 9, 2579–2605 (2008).
Processing Approach (MIT Press, 2004).
86. Xu, K. et al. Show, attend and tell: Neural image caption generation with visual Acknowledgements The authors would like to thank the Natural Sciences and
attention. In Proc. International Conference on Learning Representations http:// Engineering Research Council of Canada, the Canadian Institute For Advanced
arxiv.org/abs/1502.03044 (2015). Research (CIFAR), the National Science Foundation and Office of Naval Research
87. Graves, A., Mohamed, A.-R. & Hinton, G. Speech recognition with deep recurrent for support. Y.L. and Y.B. are CIFAR fellows.
neural networks. In Proc. International Conference on Acoustics, Speech and
Signal Processing 6645–6649 (2013). Author Information Reprints and permissions information is available at
88. Graves, A., Wayne, G. & Danihelka, I. Neural Turing machines. http://arxiv.org/ www.nature.com/reprints. The authors declare no competing financial
abs/1410.5401 (2014). interests. Readers are welcome to comment on the online version of this
89. Weston, J. Chopra, S. & Bordes, A. Memory networks. http://arxiv.org/ paper at go.nature.com/7cjbaa. Correspondence should be addressed to Y.L.
abs/1410.3916 (2014). (yann@cs.nyu.edu).
4 4 4 | NAT U R E | VO L 5 2 1 | 2 8 M AY 2 0 1 5
TensorFlow: A System for Large-Scale
Machine Learning
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,
Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur,
Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker,
Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng, Google Brain
https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi
This paper is included in the Proceedings of the

12th USENIX Symposium on Operating Systems Design
and Implementation (OSDI ’16).
November 2–4, 2016 • Savannah, GA, USA
ISBN 978-1-931971-33-1
Open access to the Proceedings of the

12th USENIX Symposium on Operating Systems
Design and Implementation
is sponsored by USENIX.
TensorFlow: A system for large-scale machine learning
Martı́n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,
Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur,
Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker,
Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng
Google Brain
Abstract datasets, and moving them into production. We have

based TensorFlow on many years of experience with our
TensorFlow is a machine learning system that operates at first-generation system, DistBelief [20], both simplify-
large scale and in heterogeneous environments. Tensor- ing and generalizing it to enable researchers to explore
Flow uses dataflow graphs to represent computation, a wider variety of ideas with relative ease. TensorFlow
shared state, and the operations that mutate that state. It supports both large-scale training and inference: it effi-
maps the nodes of a dataflow graph across many machines ciently uses hundreds of powerful (GPU-enabled) servers
in a cluster, and within a machine across multiple com- for fast training, and it runs trained models for inference in
putational devices, including multicore CPUs, general- production on various platforms, ranging from large dis-
purpose GPUs, and custom-designed ASICs known as tributed clusters in a datacenter, down to running locally
Tensor Processing Units (TPUs). This architecture gives on mobile devices. At the same time, it is flexible enough
flexibility to the application developer: whereas in previ- to support experimentation and research into new machine
ous “parameter server” designs the management of shared learning models and system-level optimizations.
state is built into the system, TensorFlow enables develop- TensorFlow uses a unified dataflow graph to repre-
ers to experiment with novel optimizations and training al- sent both the computation in an algorithm and the state
gorithms. TensorFlow supports a variety of applications, on which the algorithm operates. We draw inspiration
with a focus on training and inference on deep neural net- from the high-level programming models of dataflow sys-
works. Several Google services use TensorFlow in pro- tems [2, 21, 34] and the low-level efficiency of parame-
duction, we have released it as an open-source project, and ter servers [14, 20, 49]. Unlike traditional dataflow sys-
it has become widely used for machine learning research. tems, in which graph vertices represent functional compu-
In this paper, we describe the TensorFlow dataflow model tation on immutable data, TensorFlow allows vertices to
and demonstrate the compelling performance that Tensor- represent computations that own or update mutable state.
Flow achieves for several real-world applications. Edges carry tensors (multi-dimensional arrays) between
nodes, and TensorFlow transparently inserts the appropri-
1 Introduction ate communication between distributed subcomputations.
By unifying the computation and state management in a
In recent years, machine learning has driven advances in single programming model, TensorFlow allows program-
many different fields [3, 5, 24, 25, 29, 31, 42, 47, 50, mers to experiment with different parallelization schemes
52, 57, 67, 68, 72, 76]. We attribute this success to the that, for example, offload computation onto the servers
invention of more sophisticated machine learning mod- that hold the shared state to reduce the amount of network
els [44, 54], the availability of large datasets for tack- traffic. We have also built various coordination protocols,
ling problems in these fields [9, 64], and the develop- and achieved encouraging results with synchronous repli-
ment of software platforms that enable the easy use of cation, echoing recent results [10, 18] that contradict the
large amounts of computational resources for training commonly held belief that asynchronous replication is re-
such models on these large datasets [14, 20]. quired for scalable learning [14, 20, 49].
We have developed the TensorFlow system for ex- Over the past year, more than 150 teams at Google have
perimenting with new models, training them on large used TensorFlow, and we have released the system as an
USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 265
open-source project.1 Thanks to our large community of and write back “delta” updates to each parameter server,
users we have gained experience with many different ma- which combines the updates with its current state.
chine learning applications. In this paper, we focus on Although DistBelief has enabled many Google prod-
neural network training as a challenging systems problem, ucts to use deep neural networks and formed the basis of
and select two representative applications from this space: many machine learning research projects, we soon began
image classification and language modeling. These ap- to feel its limitations. Its Python-based scripting interface
plications stress computational throughput and aggregate for composing pre-defined layers was adequate for users
model size respectively, and we use them both to demon- with simple requirements, but our more advanced users
strate the extensibility of TensorFlow, and to evaluate the sought three further kinds of flexibility:
efficiency and scalability of our present implementation.
Defining new layers For efficiency, we implemented
DistBelief layers as C++ classes. Using a separate, less
familiar programming language for implementing layers
2 Background & motivation is a barrier for machine learning researchers who seek to
experiment with new layer architectures, such as sampled
We begin by describing the limitations of our previous
softmax classifiers [37] and attention modules [53].
system (§2.1) and outlining the design principles that we
used in the development of TensorFlow (§2.2). Refining the training algorithms Many neural net-
works are trained using stochastic gradient descent
(SGD), which iteratively refines the parameters of the net-
2.1 Previous system: DistBelief work by moving them in the direction that maximally de-
TensorFlow is the successor to DistBelief, which is creases the value of the loss function. Several refinements
the distributed system for training neural networks that to SGD accelerate convergence by changing the update
Google has used since 2011 [20]. DistBelief uses the pa- rule [23, 66]. Researchers often want to experiment with
rameter server architecture, and here we criticize its lim- new optimization methods, but doing that in DistBelief
itations, but other systems based on this architecture have involves modifying the parameter server implementation.
addressed these limitations in other ways [11, 14, 49]; we Moreover, the get() and put() interface for the pa-
discuss those systems in Subsection 2.3. rameter server is not ideal for all optimization methods:
In the parameter server architecture, a job comprises sometimes a set of related parameters must be updated
two disjoint sets of processes: stateless worker processes atomically, and in many cases it would be more efficient
that perform the bulk of the computation when training a to offload computation onto the parameter server, and
model, and stateful parameter server processes that main- thereby reduce the amount of network traffic.
tain the current version of the model parameters. Dist- Defining new training algorithms DistBelief workers
Belief’s programming model is similar to Caffe’s [38]: the follow a fixed execution pattern: read a batch of input data
user defines a neural network as a directed acyclic graph and the current parameter values, compute the loss func-
of layers that terminates with a loss function. A layer is tion (a forward pass through the network), compute gra-
a composition of mathematical operators: for example, a dients for each of the parameter (a backward pass), and
fully connected layer multiplies its input by a weight ma- write the gradients back to the parameter server. This pat-
trix, adds a bias vector, and applies a non-linear function tern works for training simple feed-forward neural net-
(such as a sigmoid) to the result. A loss function is a scalar works, but fails for more advanced models, such as recur-
function that quantifies the difference between the pre- rent neural networks, which contain loops [39]; adversar-
dicted value (for a given input data point) and the ground ial networks, in which two related networks are trained al-
truth. In a fully connected layer, the weight matrix and ternately [26]; and reinforcement learning models, where
bias vector are parameters, which a learning algorithm the loss function is computed by some agent in a separate
will update in order to minimize the value of the loss func- system, such as a video game emulator [54]. Moreover,
tion. DistBelief uses the DAG structure and knowledge there are many other machine learning algorithms—such
of the layers’ semantics to compute gradients for each as expectation maximization, decision forest training, and
of the model parameters, via backpropagation [63]. Be- latent Dirichlet allocation—that do not fit the same mold
cause the parameter updates in many algorithms are com- as neural network training, but could also benefit from a
mutative and have weak consistency requirements [61], common, well-optimized distributed runtime.
the worker processes can compute updates independently
In addition, we designed DistBelief with a single plat-
1 Software available from https://tensorflow.org. form in mind: a large distributed cluster of multicore
266 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
# 1. Construct a graph representing the model.
x = tf.placeholder(tf.float32, [BATCH_SIZE, 784]) # Placeholder for input.
y = tf.placeholder(tf.float32, [BATCH_SIZE, 10]) # Placeholder for labels.
W_1 = tf.Variable(tf.random_uniform([784, 100])) # 784x100 weight matrix.

b_1 = tf.Variable(tf.zeros([100])) # 100-element bias vector.
layer_1 = tf.nn.relu(tf.matmul(x, W_1) + b_2) # Output of hidden layer.
W_2 = tf.Variable(tf.random_uniform([100, 10])) # 100x10 weight matrix.

b_2 = tf.Variable(tf.zeros([10])) # 10-element bias vector.
layer_2 = tf.matmul(layer_1, W_2) + b_2 # Output of linear layer.
# 2. Add nodes that represent the optimization algorithm.

loss = tf.nn.softmax_cross_entropy_with_logits(layer_2, y)
train_op = tf.train.AdagradOptimizer(0.01).minimize(loss)
# 3. Execute the graph on batches of input data.

with tf.Session() as sess: # Connect to the TF runtime.
sess.run(tf.initialize_all_variables()) # Randomly initialize weights.
for step in range(NUM_STEPS): # Train iteratively for NUM_STEPS.
x_data, y_data = ... # Load one batch of input data.
sess.run(train_op, {x: x_data, y: y_data}) # Perform one training step.
Figure 1: An image classifier written using TensorFlow’s Python API. This program is a simple solution to the MNIST
digit classification problem [48], with 784-pixel images and 10 output classes.
servers [20]. We were able to add support for GPU ac- cations on distributed clusters, local workstations, mo-
celeration, when it became clear that this acceleration bile devices, and custom-designed accelerators. A high-
would be crucial for executing convolutional kernels effi- level scripting interface (Figure 1) wraps the construction
ciently [44], but DistBelief remains a heavyweight system of dataflow graphs and enables users to experiment with
that is geared for training deep neural networks on huge different model architectures and optimization algorithms
datasets, and is difficult to scale down to other environ- without modifying the core system. In this subsection, we
ments. In particular, many users want to hone their model briefly highlight TensorFlow’s core design principles:
locally on a GPU-powered workstation, before scaling the Dataflow graphs of primitive operators Both Tensor-
same code to train on a much larger dataset. After train- Flow and DistBelief use a dataflow representation for their
ing a model on a cluster, the next step is to push the models, but the most striking difference is that a Dist-
model into production, which might involve integrating Belief model comprises relatively few complex “layers”,
the model into an online service, or deploying it onto a whereas the corresponding TensorFlow model represents
mobile device for offline execution. Each of these tasks individual mathematical operators (such as matrix mul-
has some common computational structure, but our col- tiplication, convolution, etc.) as nodes in the dataflow
leagues found it necessary to use or create separate sys- graph. This approach makes it easier for users to com-
tems that satisfy the different performance and resource pose novel layers using a high-level scripting interface.
requirements of each platform. TensorFlow provides a Many optimization algorithms require each layer to have
single programming model and runtime system for all of defined gradients, and building layers out of simple oper-
these environments. ators makes it easy to differentiate these models automat-
ically (§4.1). In addition to the functional operators, we
represent mutable state, and the operations that update it,
2.2 Design principles as nodes in the dataflow graph, thus enabling experimen-
We designed TensorFlow to be much more flexible than tation with different update rules.
DistBelief, while retaining its ability to satisfy the de- Deferred execution A typical TensorFlow application
mands of Google’s production machine learning work- has two distinct phases: the first phase defines the pro-
loads. TensorFlow provides a simple dataflow-based program (e.g., a neural network to be trained and the update
gramming abstraction that allows users to deploy appli- rules) as a symbolic dataflow graph with placeholders for
the input data and variables that represent the state; and it is more flexible than a conventional parameter server:
the second phase executes an optimized version of the users can program it with the same scripting interface that
program on the set of available devices. By deferring the they use to define models. This flexibility is the key dif-
execution until the entire program is available, Tensor- ference between TensorFlow and contemporary systems,
Flow can optimize the execution phase by using global and in the rest of the paper we will discuss some of the
information about the computation. For example, Tensor- applications that this flexibility enables.
Flow achieves high GPU utilization by using the graph’s
dependency structure to issue a sequence of kernels to the
GPU without waiting for intermediate results. While this 2.3 Related work
design choice makes execution more efficient, we have Single-machine frameworks Many machine learning
had to push more complex features—such as dynamic researchers carry out their work on a single—often GPU-
control flow (§3.4)—into the dataflow graph, so that mod- equipped—computer [43, 44], and several single-machine
els using these features enjoy the same optimizations. frameworks support this scenario. Caffe [38] is a high-
Common abstraction for heterogeneous accelerators performance framework for training declaratively speci-
In addition to general-purpose devices such as multicore fied neural networks on multicore CPUs and GPUs. As
CPUs and GPUs, special-purpose accelerators for deep discussed above, its programming model is similar to
learning can achieve significant performance improve- DistBelief (§2.1), so it is easy to compose models from
ments and power savings. At Google, our colleagues existing layers, but relatively difficult to add new layers
have built the Tensor Processing Unit (TPU) specifically or optimizers. Theano [2] allows programmers to express
for machine learning; TPUs yield an order of magnitude a model as a dataflow graph of primitive operators, and
improvement in performance-per-watt compared to alter- generates efficient compiled code for training that model.
native state-of-the-art technology [40]. To support these Its programming model is closest to TensorFlow, and it
accelerators in TensorFlow, we define a common abstrac- provides much of the same flexibility in a single machine.
tion for devices. At a minimum, a device must implement Unlike Caffe, Theano, and TensorFlow, Torch [17] of-
methods for (i) issuing a kernel for execution, (ii) allocat- fers a powerful imperative programming model for sci-
ing memory for inputs and outputs, and (iii) transferring entific computation and machine learning. It allows fine-
buffers to and from host memory. Each operator (e.g., grained control over the execution order and memory uti-
matrix multiplication) can have multiple specialized im- lization, which enables power users to optimize the per-
plementations for different devices. As a result, the same formance of their programs. While this flexibility is use-
program can easily target GPUs, TPUs, or mobile CPUs ful for research, Torch lacks the advantages of a dataflow
as required for training, serving, and offline inference. graph as a portable representation across small-scale ex-
TensorFlow uses tensors of primitive values as a com- perimentation, production training, and deployment.
mon interchange format that all devices understand. At Batch dataflow systems Starting with MapRe-
the lowest level, all tensors in TensorFlow are dense; duce [21], batch dataflow systems have been applied
sparse tensors can be represented in terms of dense ones to a large number of machine learning algorithms [70],
(§3.1). This decision ensures that the lowest levels of the and more recent systems have focused on increasing
system have simple implementations for memory alloca- expressivity and performance. DryadLINQ [74] adds a
tion and serialization, thus reducing the framework over- high-level query language that supports more sophisti-
head. Tensors also enable other optimizations for memory cated algorithms than MapReduce. Spark [75] extends
management and communication, such as RDMA and di- DryadLINQ with the ability to cache previously com-
rect GPU-to-GPU transfer. puted datasets in memory, and is therefore better suited to
The main consequence of these principles is that in iterative machine learning algorithms (such as k-means
TensorFlow there is no such thing as a parameter server. clustering and logistic regression) when the input data fit
On a cluster, we deploy TensorFlow as a set of tasks in memory. Dandelion extends DryadLINQ with code
(named processes that can communicate over a network) generation for GPUs [62] and FPGAs [16].
that each export the same graph execution API and con- The principal limitation of a batch dataflow system is
tain one or more devices. Typically a subset of those tasks that it requires the input data to be immutable, and all
assumes the role that a parameter server plays in other of the subcomputations to be deterministic, so that the
systems [11, 14, 20, 49], and we therefore call them PS system can re-execute subcomputations when machines
tasks; the others are worker tasks. However, since a PS in the cluster fail. This feature—which is beneficial for
task is capable of running arbitrary TensorFlow graphs, many conventional workloads—makes updating a ma-
False branch
Mod Part Stitch Sum
Parameters Periodic
checkpoint
Read params Apply grads

Shuffle queue Queue
Input Fwd Back
Reader Dist. FS
data
Preprocessing Training
Figure 2: A schematic TensorFlow dataflow graph for a training pipeline, containing subgraphs for reading input data,
preprocessing, training, and checkpointing state.
chine learning model an expensive operation. For ex- Flow we sought a high-level programming model that al-
ample, the SparkNet system for training deep neural net- lows users to customize the code that runs in all parts of
works on Spark takes 20 seconds to broadcast weights and the system, so that the cost of experimentation with new
collect updates from five workers [55]. As a result, in optimization algorithms and model architectures is lower.
these systems, each model update step must process larger In the next section, we describe the building blocks of a
batches, slowing convergence [8]. We show in Subsec- TensorFlow program in more detail.
tion 6.3 that TensorFlow can train larger models on larger
clusters with step times as short as 2 seconds.
3 TensorFlow execution model
Parameter servers As we discuss in Subsection 2.1, a
parameter server architecture uses a set of servers to man- TensorFlow uses a single dataflow graph to represent
age shared state that is updated by a set of parallel work- all computation and state in a machine learning algo-
ers. This architecture emerged in work on scalable topic rithm, including the individual mathematical operations,
Training
modeling [65], andlibraries Inference
DistBelief showed howlibs it can apply
the parameters and their update rules, and the input pre-
to deep neural network training.C++ Project
client Adam
... [14] fur- processing (Figure 2). The dataflow graph expresses the
Python client
ther applied this architecture for the efficient training of communication between subcomputations explicitly, thus
convolutional neural networks; C API and Li et al.’s “Parame-
making it easy to execute independent computations in
ter Server” [49] added innovations in consistency mod- parallel and to partition computations across multiple de-
Distributed
els, fault tolerance, master
and elastic Dataflow
rescaling.executor
Despite earlier vices. TensorFlow differs from batch dataflow systems
skepticism that parameter servers would be compatible (§2.3) in two respects:
Const Var MatMul Conv2D ReLU Queue ...
with GPU acceleration Kernel[14], Cui et al. recently showed
implementations
that a parameter server specialized for use with GPUs can • The model supports multiple concurrent executions
achieve speedups
RPC on RDMAsmall...clusters
CPU[18].GPU ... on overlapping subgraphs of the overall graph.
MXNet [11] is perhaps
Networking layer the closest
Devicesystem
layer in design
to TensorFlow. It uses a dataflow graph to represent the • Individual vertices may have mutable state that can
computation at each worker, and uses a parameter server be shared between different executions of the graph.
to scale training across multiple machines. The MXNet
parameter server exports a key-value store interface that The key observation in the parameter server architec-
supports aggregating updates sent from multiple devices ture [14, 20, 49] is that mutable state is crucial when
in each worker, and using an arbitrary user-provided func- training very large models, because it becomes possible to
tion to combine incoming updates with the current value. make in-place updates to very large parameters, and prop-
The MXNet key-value store interface [22] does not cur- agate those updates to parallel training steps as quickly
rently allow sparse gradient updates within a single value, as possible. Dataflow with mutable state enables Tensor-
which are crucial for the distributed training of large mod- Flow to mimic the functionality of a parameter server,
els (§4.2), and adding this feature would require modifi- but with additional flexibility, because it becomes pos-
cations to the core system. sible to execute arbitrary dataflow subgraphs on the ma-
The parameter server architecture meets many of our chines that host the shared model parameters. As a re-
requirements, and with sufficient engineering effort it sult, our users have been able to experiment with different
would be possible to build most of the features that we optimization algorithms, consistency schemes, and paral-
describe in this paper into a parameter server. For Tensor- lelization strategies.
3.1 Dataflow graph elements example, AssignAdd takes a reference handle r and a
tensor value x, and when executed performs the update
In a TensorFlow graph, each vertex represents a unit of State′ [r] ← State[r] + x. Subsequent Read(r) opera-
local computation, and each edge represents the output tions produce the value State′ [r].
from, or input to, a vertex. We refer to the computation
at vertices as operations, and the values that flow along Stateful operations: queues TensorFlow includes sev-
edges as tensors. In this subsection, we describe the com- eral queue implementations, which support more ad-
mon types of operations and tensors. vanced forms of coordination. The simplest queue is
FIFOQueue, which owns an internal queue of tensors,
Tensors In TensorFlow, we model all data as tensors
and allows concurrent access in first-in-first-out order.
(n-dimensional arrays) with the elements having one
Other types of queues dequeue tensors in random and pri-
of a small number of primitive types, such as int32,
ority orders, which ensure that input data are sampled ap-
float32, or string (where string can represent ar-
propriately. Like a Variable, the FIFOQueue opera-
bitrary binary data). Tensors naturally represent the inputs
tion produces a reference handle that can be consumed by
to and results of the common mathematical operations in
one of the standard queue operations, such as Enqueue
many machine learning algorithms: for example, a matrix
and Dequeue. These operations push their input onto the
multiplication takes two 2-D tensors and produces a 2-D
tail of the queue and, respectively, pop the head element
tensor; and a batch 2-D convolution takes two 4-D tensors
and output it. Enqueue will block if its given queue is
and produces another 4-D tensor.
full, and Dequeue will block if its given queue is empty.
At the lowest level, all TensorFlow tensors are dense,
When queues are used in an input preprocessing pipeline,
for the reasons we discuss in Subsection 2.2. TensorFlow
this blocking provides backpressure; it also supports syn-
offers two alternatives for representing sparse data: either
chronization (§4.4). The combination of queues and dy-
encode the data into variable-length string elements of
namic control flow (§3.4) can also implement a form of
a dense tensor, or use a tuple of dense tensors (e.g., an
streaming computation between subgraphs.
n-D sparse tensor with m non-zero elements can be rep-
resented in coordinate-list format as an m × n matrix of
coordinates and a length-m vector of values). The shape 3.2 Partial and concurrent execution
of a tensor can vary in one or more of its dimensions,
which makes it possible to represent sparse tensors with TensorFlow uses a dataflow graph to represent all possible
differing numbers of elements. computations in a particular application. The API for ex-
ecuting a graph allows the client to specify declaratively
Operations An operation takes m ≥ 0 tensors as input the subgraph that should be executed. The client selects
and produces n ≥ 0 tensors as output. An operation has zero or more edges to feed input tensors into the dataflow,
a named “type” (such as Const, MatMul, or Assign) and one or more edges to fetch output tensors from the
and may have zero or more compile-time attributes that dataflow; the runtime then prunes the graph to contain the
determine its behavior. An operation can be polymorphic necessary set of operations. Each invocation of the API is
and variadic at compile-time: its attributes determine both called a step, and TensorFlow supports multiple concur-
the expected types and arity of its inputs and outputs. rent steps on the same graph. Stateful operations allow
For example, the simplest operation Const has no in- steps to share data and synchronize when necessary.
puts and a single output; its value is a compile-time at- Figure 2 shows a typical training application, with
tribute. For example, AddN sums multiple tensors of the multiple subgraphs that execute concurrently and interact
same element type, and it has a type attribute T and an through shared variables and queues. The core training
integer attribute N that define its type signature. subgraph depends on a set of model parameters and on in-
Stateful operations: variables An operation can con- put batches from a queue. Many concurrent steps of the
tain mutable state that is read and/or written each time training subgraph update the model based on different in-
it executes. A Variable operation owns a mutable put batches, to implement data-parallel training. To fill
buffer that may be used to store the shared parameters the input queue, concurrent preprocessing steps transform
of a model as it is trained. A Variable has no inputs, individual input records (e.g., decoding images and apply-
and produces a reference handle, which acts as a typed ing random distortions), and a separate I/O subgraph reads
capability for reading and writing the buffer. A Read records from a distributed file system. A checkpointing
operation takes a reference handle r as input, and out- subgraph runs periodically for fault tolerance (§4.3).
puts the value of the variable (State[r]) as a dense ten- Partial and concurrent execution is responsible for
sor. Other operations modify the underlying buffer: for much of TensorFlow’s flexibility. Adding mutable state
and coordination via queues makes it possible to spec- input = ... # A sequence of tensors
state = 0 # Initial state
ify a wide variety of model architectures in user-level
w = ... # Trainable weights
code, which enables advanced users to experiment with-
out modifying the internals of the TensorFlow runtime. for i in range(len(input)):
By default, concurrent executions of a TensorFlow sub- state, out[i] = f(state, w, input[i])
graph run asynchronously with respect to one another.
This asynchrony makes it straightforward to implement
machine learning algorithms with weak consistency re- Figure 3: Pseudocode for an abstract RNN (§3.4). The
quirements [61], which include many neural network function f typically comprises differentiable operations
training algorithms [20]. As we discuss later, TensorFlow such as matrix multiplications and convolutions [32].
also provides the primitives needed to synchronize work- TensorFlow implements the loop in its dataflow graph.
ers during training (§4.4), which has led to promising re-
sults on some learning tasks (§6.3).
TensorFlow can automatically determine placements that
achieve close to optimal performance on a given set of de-
3.3 Distributed execution vices, thus freeing users from this concern. Even without
such automation, it may be worthwhile to separate place-
Dataflow simplifies distributed execution, because it ment directives from other aspects of model definitions,
makes communication between subcomputations explicit. so that, for example, it would be trivial to modify place-
It enables the same TensorFlow program to be deployed ments after a model has been trained.
to a cluster of GPUs for training, a cluster of TPUs for Once the operations in a graph have been placed, and
serving, and a cellphone for mobile inference. the partial subgraph has been computed for a step (§3.2),
Each operation resides on a particular device, such as a TensorFlow partitions the operations into per-device sub-
CPU or GPU in a particular task. A device is responsible graphs. A per-device subgraph for device d contains all
for executing a kernel for each operation assigned to it. of the operations that were assigned to d, with additional
TensorFlow allows multiple kernels to be registered for Send and Recv operations that replace edges across de-
a single operation, with specialized implementations for vice boundaries. Send transmits its single input to a spec-
a particular device or data type (see §5 for details). For ified device as soon as the tensor is available, using a ren-
many operations, such as element-wise operators (Add, dezvous key to name the value. Recv has a single output,
Sub, etc.), we can compile a single kernel implementation and blocks until the value for a specified rendezvous key
for CPU and GPU using different compilers. is available locally, before producing that value. Send
The TensorFlow runtime places operations on devices, and Recv have specialized implementations for several
subject to implicit or explicit constraints in the graph. device-type pairs; we describe some of these in Section 5.
The placement algorithm computes a feasible set of de- We optimized TensorFlow for executing large sub-
vices for each operation, calculates the sets of operations graphs repeatedly with low latency. Once the graph for
that must be colocated, and selects a satisfying device for a step has been pruned, placed, and partitioned, its sub-
each colocation group. It respects implicit colocation con- graphs are cached in their respective devices. A client
straints that arise because each stateful operation and its session maintains the mapping from step definitions to
state must be placed on the same device. In addition, cached subgraphs, so that a distributed step on a large
the user may specify partial device preferences such as graph can be initiated with one small message to each par-
“any device in a particular task”, or “a GPU in any task”, ticipating task. This model favors static, reusable graphs,
and the runtime will respect these constraints. A typical but it can support dynamic computations using dynamic
training application will use client-side programming con- control flow, as the next subsection describes.
structs to add constraints such that, for example, parame-
ters are distributed among a set of “PS” tasks (§4.2).
3.4 Dynamic control flow
TensorFlow thus permits great flexibility in how opera-
tions in the dataflow graph are mapped to devices. While TensorFlow supports advanced machine learning algo-
simple heuristics yield adequate performance for novice rithms that contain conditional and iterative control flow.
users, expert users can optimize performance by manually For example, a recurrent neural network (RNN) [39] such
placing operations to balance the computation, memory, as an LSTM [32] can generate predictions from sequential
and network requirements across multiple tasks and mul- data. Google’s Neural Machine Translation system uses
tiple devices within those tasks. An open question is how TensorFlow to train a deep LSTM that achieves state-of-
the-art performance on many translation tasks [73]. The 4.1 Differentiation and optimization
core of an RNN is a recurrence relation, where the output
for sequence element i is a function of some state that ac- Many learning algorithms train a set of parameters using
cumulates across the sequence (Figure 3). In this case, dy- some variant of SGD, which entails computing the gradi-
namic control flow enables iteration over sequences that ents of a loss function with respect to those parameters,
have variable lengths, without unrolling the computation then updating the parameters based on those gradients.
to the length of the longest sequence. TensorFlow includes a user-level library that differentiates
As we discussed in Subsection 2.2, TensorFlow uses a symbolic expression for a loss function and produces a
deferred execution via the dataflow graph to offload larger new symbolic expression representing the gradients. For
chunks of work to accelerators. Therefore, to imple- example, given a neural network as a composition of lay-
ment RNNs and other advanced algorithms, we add con- ers and a loss function, the library will automatically de-
ditional (if statement) and iterative (while loop) program- rive the backpropagation code.
ming constructs in the dataflow graph itself. We use The differentiation algorithm performs breadth-first
these primitives to build higher-order constructs, such as search to identify all of the backwards paths from the tar-
map(), fold(), and scan() [2]. get operation (e.g., a loss function) to a set of parameters,
For this purpose, we borrow the Switch and and sums the partial gradients that each path contributes.
Merge primitives from classic dynamic dataflow archi- Our users frequently specialize the gradients for some op-
tectures [4]. Switch is a demultiplexer: it takes a data erations, and they have implemented optimizations like
input and a control input, and uses the control input to batch normalization [33] and gradient clipping [60] to ac-
select which of its two outputs should produce a value. celerate training and make it more robust. We have ex-
The Switch output not taken receives a special dead tended the algorithm to differentiate conditional and it-
value, which propagates recursively through the rest of erative subcomputations (§3.4) by adding nodes to the
the graph until it reaches a Merge operation. Merge is graph that record the control flow decisions in the for-
a multiplexer: it forwards at most one non-dead input to ward pass, and replaying those decisions in reverse during
its output, or produces a dead output if both of its inputs the backward pass. Differentiating iterative computations
are dead. The conditional operator uses Switch to ex- over long sequences can lead to a large amount of inter-
ecute one of two branches based on the runtime value of mediate state being accumulated in memory, and we have
a boolean tensor, and Merge to combine the outputs of developed techniques for managing limited GPU memory
the branches. The while loop is more complicated, and on these computations.
uses Enter, Exit, and NextIteration operators to TensorFlow users can also experiment with a wide
ensure that the loop is well-formed [56]. range of optimization algorithms, which compute new
The execution of iterations can overlap, and Tensor- values for the parameters in each training step. SGD is
Flow can also partition conditional branches and loop easy to implement in a parameter server: for each param-
bodies across multiple devices and processes. The par- eter W , gradient ∂L/∂W , and learning rate α, the update
titioning step adds logic to coordinate the start and ter- rule is W ′ ← W − α × ∂L/∂W . A parameter server can
mination of each iteration on each device, and to decide implement SGD by using -= as the write operation, and
the termination of the loop. As we will see in Subsec- writing α × ∂L/∂W to each W after a training step.
tion 4.1, TensorFlow also supports automatic differenti- However, there are many more advanced optimization
ation of control flow constructs. Automatic differentia- schemes that are difficult to express as a single write op-
tion adds the subgraphs for computing gradients to the eration. For example, the Momentum algorithm accumu-
dataflow graph, which TensorFlow partitions across po- lates a “velocity” for each parameter based on its gradi-
tentially distributed devices to compute the gradients in ent over multiple iterations, then computes the parameter
parallel. update from that accumulation; and many refinements to
this algorithm have been proposed [66]. Implementing
Momentum in DistBelief [20], required modifications to
4 Extensibility case studies the parameter server implementation to change the rep-
resentation of parameter data, and execute complex logic
By choosing a unified representation for all computation in the write operation; such modifications are challeng-
in TensorFlow, we enable users to experiment with fea- ing for many users. Optimization algorithms are the topic
tures that were hard-coded into the DistBelief runtime. In of active research, and researchers have implemented sev-
this section, we discuss four extensions that we have built eral on top of TensorFlow, including Momentum, Ada-
using dataflow primitives and “user-level” code. Grad, AdaDelta, RMSProp, Adam, and L-BFGS. These
Shard 0 Gather cates this operation with the variable on which it operates.
True True branch
The dynamic partition (Part) operation divides the in-
coming indices into variable-sized tensors that contain the
Shard 1 Gather indices destined for each shard, and the dynamic stitching
Switch Merge
(Stitch) operation reassembles the partial results from
each shard into a single result tensor. Each of these op-
False branch
erations has a corresponding gradient, so it supports au-
Mod Part Stitch Sum
tomatic differentiation (§4.1), and the result is a set of
Figure 4: Schematic dataflow for an embedding layer sparse update operations that act on just the values that
(§4.2) with a two-way sharded embedding matrix. were originally gathered from each of the shards.
Users writing a TensorFlow model typically do not con-
struct graphs like Figure 4 manually.
ParametersInstead TensorFlow
Periodic
can be built in TensorFlow using Variable operations includes libraries that expose the abstraction of a sharded checkpoint
and primitive mathematical operations without modifying parameter, and build appropriate graphs of primitive op-
the underlying system, so it is easy to experiment with erations based on theRead degree of Apply
params
desired grads
distribution.
new algorithms as they emerge. Shuffle queue Queue
Input
While sparse reads and Fwdupdates areBackpossible in a pa-
Reader rameter server [49], TensorFlow adds the flexibility Dist.
to FS
data
Preprocessing Training
4.2 Training very large models offload arbitrary computation onto the devices that host
the shared parameters. For example, classification mod-
To train a model on high-dimensional data, such as words els typically use a softmax classifier that multiplies the
in a corpus of text [7], it is common to use a distributed final output by a weight matrix with c columns, where c
representation, which embeds a training example as a pat- is the number of possible classes; for a language model,
tern of activity across several neurons, and which can be c is the size of the vocabulary, which can be large. Our
learned by backpropagation [30]. For example, in a lan- users have experimented with several schemes to accel-
guage model, a training example might be a sparse vector erate the softmax calculation. The first is similar to an
with non-zero entries corresponding to the IDs of words optimization in Project Adam [14], whereby the weights
in a vocabulary, and the distributed representation for each are sharded across several tasks, and the multiplication
word will be a lower-dimensional vector [6]. “Wide and and gradient calculation are colocated with the shards.
deep learning” creates distributed representations from More efficient training is possible using a sampled soft-
cross-product transformations on categorical features, and max [37], which performs a sparse multiplication based
the implementation on TensorFlow is used to power the on the true class for an example and a set of randomly
Google Play app store recommender system [12]. sampled false classes. We compare the performance of
Inference begins by multiplying a batch of b sparse vec- these two schemes in §6.4.
tors against an n × d embedding matrix, where n is the
number of words in the vocabulary, and d is the desired
dimensionality, to produce a much smaller b × d dense 4.3 Fault tolerance
Training
matrix representation; forlibraries
training, mostInference libs al-
optimization
gorithms modify only the rows of the C++ embedding
client ...
matrix Training a model can take several hours or days, even us-
Python client
that were read by the sparse multiplication. In TensorFlow ing a large number of machines [14, 20]. We often need to
models that process sparse data, nC× d can amount to gi- train a model using non-dedicated resources, for example
API
gabytes of parameters: e.g., a large language model may using the Borg cluster manager [71], which does not guar-
Distributed
use over 109 parameters master
with Dataflowof
a vocabulary executor
800,000 antee availability of the same resources for the duration of
words [41], and we have experience with document mod- the training process. Therefore, a long-running Tensor-
Const Var MatMul Conv2D ReLU Queue ...
els [19] where the parameters occupy several terabytes. Flow job is likely to experience failure or pre-emption,
Kernel implementations
Such models are too large to copy to a worker on every and we require some form of fault tolerance. It is un-
use, or even to storeRPC
in RAMRDMA on a...singleCPU
host. GPU ... likely that tasks will fail so often that individual opera-
We implement sparse embedding
Networking layer
layers in the
Device layer
Tensor- tions need fault tolerance, so a mechanism like Spark’s
Flow graph as a composition of primitive operations. Fig- RDDs [75] would impose significant overhead for little
ure 4 shows a simplified graph for an embedding layer benefit. There is no need to make every write to the pa-
that is split across two parameter server tasks. The core rameter state durable, because we can recompute any up-
operation of this subgraph is Gather, which extracts a date from the input data, and many learning algorithms do
sparse set of rows from a tensor, and TensorFlow colo- not require strong consistency [61].
(a) Asynchronous replication (b) Synchronous replication (c) Synchronous w/ backup worker
PS
Worker 1
Worker 2
Worker 3
Figure 5: Three synchronization schemes for parallel SGD. Each color represents a different starting parameter value;
a white square is a parameter update. In (c), a dashed rectangle represents a backup worker whose result is discarded.
We implement user-level checkpointing for fault tol- 4.4 Synchronous replica coordination
erance, using two operations in the graph (Figure 2):
Save writes one or more tensors to a checkpoint file, and SGD is robust to asynchrony [61], and many systems
Restore reads one or more tensors from a checkpoint train deep neural networks using asynchronous parame-
file. Our typical configuration connects each Variable ter updates [14, 20], which are believed scalable because
in a task to the same Save operation, with one Save per they maintain high throughput in the presence of strag-
task, to maximize the I/O bandwidth to a distributed file glers. The increased throughput comes at the cost of us-
system. The Restore operations read named tensors ing stale parameter values in training steps. Some have
from a file, and a standard Assign stores the restored recently revisited the assumption that synchronous train-
value in its respective variable. During training, a typi- ing does not scale [10, 18]. Since GPUs enable training
cal client runs all of the Save operations periodically to with hundreds—rather than thousands [47]—of machines,
produce a new checkpoint; when the client starts up, it synchronous training may be faster (in terms of time to
attempts to Restore the latest checkpoint. quality) than asynchronous training on the same platform.
TensorFlow includes a client library for constructing Though we originally designed TensorFlow for asyn-
the appropriate graph structure and for invoking Save chronous training, we have begun experimenting with
and Restore as necessary. This behavior is customiz- synchronous methods. The TensorFlow graph enables
able: the user can apply different policies to subsets of the users to change how parameters are read and written when
variables in a model, or customize the checkpoint reten- training a model, and we implement three alternatives. In
tion scheme. For example, many users retain checkpoints the asynchronous case (Figure 5(a)), each worker reads
with the highest score in a custom evaluation metric. The the current values of parameters when each step begins,
implementation is also reusable: it may be used for model and applies its gradient to the (possibly different) current
fine-tuning and unsupervised pre-training [45, 47], which values at the end: this approach ensures high utilization,
are forms of transfer learning, in which the parameters of but the individual steps use stale parameter values, making
a model trained on one task (e.g., recognizing general im- each step less effective. We implement the synchronous
ages) are used as the starting point for another task (e.g., version using queues (§3.1) to coordinate execution: a
recognizing breeds of dog). Having checkpoint and pa- blocking queue acts as a barrier to ensure that all workers
rameter management as programmable operations in the read the same parameter values, and a per-variable queue
graph gives users the flexibility to implement schemes like accumulates gradient updates from all workers in order to
these and others that we have not anticipated. apply them atomically. The simple synchronous version
The checkpointing library does not attempt to produce (Figure 5(b)) accumulates updates from all workers before
consistent checkpoints: if training and checkpointing ex- applying them, but slow workers limit overall throughput.
ecute concurrently, the checkpoint may include none, all, To mitigate stragglers, we implement backup work-
or some of the updates from the training step. This be- ers (Figure 5(c), [10]), which are similar to MapReduce
havior is compatible with the relaxed guarantees of asyn- backup tasks [21]. Whereas MapReduce starts backup
chronous SGD [20]. Consistent checkpoints require ad- tasks reactively—after detecting a straggler—our backup
ditional synchronization to ensure that update operations workers run proactively, and the aggregation takes the
do not interfere with checkpointing; if desired, one can first m of n updates produced. We exploit the fact that
use the scheme in the next subsection to take a checkpoint SGD samples training data randomly at each step, so each
after the synchronous update step. worker processes a different random batch, and it is not a
Training libraries Inference libs
The runtime contains over 200 standard operations, in-
Python client C++ client ...
cluding mathematical, array manipulation, control flow,
C API and state management operations. Many of the operation
kernels are implemented using Eigen::Tensor [36], which
Distributed master Dataflow executor uses C++ templates to generate efficient parallel code for
multicore CPUs and GPUs; however, we liberally use li-
Const Var MatMul Conv2D ReLU Queue ... braries like cuDNN [13] where a more efficient kernel
Kernel implementations
implementation is possible. We have also implemented
quantization, which enables faster inference in environ-
RPC RDMA ... CPU GPU ...
ments such as mobile devices and high-throughput data-
Networking layer Device layer
center applications, and use the gemmlowp low-precision
matrix library [35] to accelerate quantized computation.
Figure 6: The layered TensorFlow architecture. We specialize Send and Recv operations for each
pair of source and destination device types. Trans-
problem if a particular batch is ignored. In §6.3 we show fers between local CPU and GPU devices use the
how backup workers improve throughput by up to 10%. cudaMemcpyAsync() API to overlap computation and
data transfer; transfers between two local GPUs use
DMA to relieve pressure on the host. For transfers be-
5 Implementation tween tasks, TensorFlow uses multiple protocols, includ-
ing gRPC over TCP, and RDMA over Converged Ether-
The TensorFlow runtime is a cross-platform library. Fig- net. We are also investigating optimizations for GPU-to-
ure 6 illustrates its architecture: a C API separates user- GPU communication that use collective operations [59].
level code in different languages from the core runtime. Section 4 describes features that we implement com-
The core TensorFlow library is implemented in C++ for pletely above the C API, in user-level code. Typically,
portability and performance: it runs on several operating users compose standard operations to build higher-level
systems including Linux, Mac OS X, Windows, Android, abstractions, such as neural network layers, optimization
and iOS; the x86 and various ARM-based CPU architec- algorithms (§4.1), and sharded embedding computations
tures; and NVIDIA’s Kepler, Maxwell, and Pascal GPU (§4.2). TensorFlow supports multiple client languages,
microarchitectures. The implementation is open-source, and we have prioritized Python and C++, because our in-
and we have accepted several external contributions that ternal users are most familiar with these languages. As
enable TensorFlow to run on other architectures. features become more established, we typically port them
The distributed master translates user requests into ex- to C++, so that users can access an optimized implemen-
ecution across a set of tasks. Given a graph and a step def- tation from all client languages.
inition, it prunes (§3.2) and partitions (§3.3) the graph to If it is difficult or inefficient to represent a subcom-
obtain subgraphs for each participating device, and caches putation as a composition of operations, users can reg-
these subgraphs so that they may be re-used in subsequent ister additional kernels that provide an efficient imple-
steps. Since the master sees the overall computation for a mentation written in C++. We have found it profitable
step, it applies standard optimizations such as common to hand-implement fused kernels for some performance
subexpression elimination and constant folding; pruning critical operations, such as the ReLU and Sigmoid acti-
is a form of dead code elimination. It then coordinates ex- vation functions and their corresponding gradients. We
ecution of the optimized subgraphs across a set of tasks. are currently investigating automatic kernel fusion using
The dataflow executor in each task handles requests a compilation-based approach.
from the master, and schedules the execution of the ker- In addition to the core runtime, our colleagues have
nels that comprise a local subgraph. We optimize the built several tools that aid users of TensorFlow. These
dataflow executor for running large graphs with low over- include serving infrastructure for inference in produc-
head. Our current implementation can execute 10,000 tion [27], a visualization dashboard that enables users to
subgraphs per second (§6.2), which enables a large num- follow the progress of a training run, a graph visualizer
ber of replicas to make rapid, fine-grained training steps. that helps users to understand the connections in a model,
The dataflow executor dispatches kernels to local devices and a distributed profiler that traces the execution of a
and runs kernels in parallel when possible, for example by computation across multiple devices and tasks. We de-
using multiple CPU cores or GPU streams. scribe these tools in an extended whitepaper [1].
6 Evaluation 10000
In this section, we evaluate the performance of Tensor-

1000
Flow on several synthetic and realistic workloads. Unless
Batches/second
Scalar
otherwise stated, we run all experiments on a shared pro- Sparse 1GB
duction cluster, and all figures plot median values with 100 Sparse 16GB
Dense 100M
error bars showing the 10th and 90th percentiles. Dense 1GB
In this paper we focus on system performance met- 10
rics, rather than learning objectives like time to accu-

racy. TensorFlow is a system that allows machine learn- 1
ing practitioners and researchers to experiment with new 1 2 5 10 25 50 100
techniques, and this evaluation demonstrates that the sys- Number of workers
tem (i) has little overhead, and (ii) can employ large
amounts of computation to accelerate real-world applica-
tions. While techniques like synchronous replication can Figure 7: Baseline throughput for synchronous replication
enable some models to converge in fewer steps overall, we with a null model. Sparse accesses enable TensorFlow to
defer the analysis of such improvements to other papers. handle larger models, such as embedding matrices (§4.2).
6.1 Single-machine benchmarks 6.2 Synchronous replica microbenchmark

Although TensorFlow is a system for “large-scale” ma- The performance of our coordination implementation
chine learning, it is imperative that scalability does not (§4.4) is the main limiting factor for scaling with addi-
mask poor performance at small scales [51]. Table 1 con- tional machines. Figure 7 shows that number of null train-
tains results from Chintala’s benchmark of convolutional ing steps that TensorFlow performs per second for vary-
models on TensorFlow and three single-machine frame- ing model sizes, and increasing numbers of synchronous
works [15]. All frameworks use a six-core Intel Core i7- workers. In a null training step, a worker fetches the
5930K CPU at 3.5 GHz and an NVIDIA Titan X GPU. shared model parameters from 16 PS tasks, performs a
trivial computation, and sends updates to the parameters.
Training step time (ms)
Library AlexNet Overfeat OxfordNet GoogleNet The Scalar curve in Figure 7 shows the best perfor-
Caffe [38] 324 823 1068 1935 mance that we could expect for a synchronous training
Neon [58] 87 211 320 270 step, because only a single 4-byte value is fetched from
Torch [17] 81 268 529 470
each PS task. The median step time is 1.8 ms using a sin-
TensorFlow 81 279 540 445
gle worker, growing to 8.8 ms with 100 workers. These
Table 1: Step times for training four convolutional models times measure the overhead of the synchronization mech-
with different libraries, using one GPU. All results are for anism, and capture some of the noise that we expect when
training with 32-bit floats. The fastest time for each model running on a shared cluster.
is shown in bold. The Dense curves show the performance of a null step
when the worker fetches the entire model. We repeat the
Table 1 shows that TensorFlow achieves shorter step experiment with models of size 100 MB and 1 GB, with
times than Caffe [38], and performance within 6% of the the parameters sharded equally over 16 PS tasks. The me-
latest version of Torch [17]. We attribute the similar per- dian step time for 100 MB increases from 147 ms with one
formance of TensorFlow and Torch to the fact that both worker to 613 ms with 100 workers. For 1 GB, it increases
use the same version of the cuDNN library [13], which from 1.01 s with one worker to 7.16 s with 100 workers.
implements the convolution and pooling operations on For large models, a typical training step accesses only
the critical path for training; Caffe uses open-source im- a subset of the parameters, and the Sparse curves show
plementations for these operations that are simpler but the throughput of the embedding lookup operation from
less efficient than cuDNN. The Neon library [58] outper- Subsection 4.2. Each worker reads 32 randomly selected
forms TensorFlow on three of the models, by using hand- entries from a large embedding matrix containing 1 GB or
optimized convolutional kernels [46] implemented in as- 16 GB of data. As expected, the step times do not vary
sembly language; in principle, we could follow the same with the size of the embedding, and TensorFlow achieves
approach in TensorFlow, but we have not yet done so. step times ranging from 5 to 20 ms.
(a) Baseline performance vs. MXNet (b) Coordination scalability (c) Backup worker effectiveness
30 3000 2.5
Step time Speedup 1.10
Images/second/worker
25 2500 2.4
Normalized speedup
Step time (seconds)
Images/second
1.08
20 2000 2.3
1.06
15 1500 2.2
10 1000 2.1 1.04

TensorFlow Asynchronous 1.02
5 500 2.0
MXNet Synchronous
0 0 1.9 1.00
1 4 8 16 32 50 25 50 100 200 0 1 2 3 4 5
Number of workers Number of workers Number of backup workers
Figure 8: Results of the performance evaluation for Inception-v3 training (§6.3). (a) TensorFlow achieves slightly
better throughput than MXNet for asynchronous training. (b) Asynchronous and synchronous training throughput
increases with up to 200 workers. (c) Adding backup workers to a 50-worker training job can reduce the overall step
time, and improve performance even when normalized for resource consumption.
6.3 Image classification as Inception-v3 will train in fewer steps, and converge to
a higher accuracy than with asynchronous training [10].
Deep neural networks have achieved breakthrough perfor- Training throughput improves to 2,300 images per sec-
mance on computer vision tasks such as recognizing ob- ond as we increase the number of workers to 200, but
jects in photographs [44], and these tasks are a key ap- with diminishing returns (Figure 8(b)). As we add more
plication for TensorFlow at Google. Training a network workers, the step time increases, because there is more
to high accuracy requires a large amount of computa- contention on the PS tasks, both at the network interface
tion, and we use TensorFlow to scale out this computation and in the aggregation of updates. As expected, for all
across a cluster of GPU-enabled servers. In these experi- configurations, synchronous steps are longer than asyn-
ments, we focus on Google’s Inception-v3 model, which chronous steps, because all workers must wait for the
achieves 78.8% accuracy in the ILSVRC 2012 image clas- slowest worker to catch up before starting the next step.
sification challenge [69]; the same techniques apply to While the median synchronous step is approximately 10%
other deep convolutional models—such as ResNet [28]— longer than an asynchronous step with the same workers,
implemented on TensorFlow. We investigate the scalabil- above the 90th percentile the synchronous performance
ity of training Inception-v3 using multiple replicas. We degrades sharply, because stragglers disproportionately
configure TensorFlow with 7 PS tasks, and vary the num- impact tail latency.
ber of worker tasks using two different clusters. To mitigate tail latency, we add backup workers so that
For the first experiment, we compare the performance a step completes when the first m of n tasks produce gra-
training Inception using asynchronous SGD on Tensor- dients. Figure 8(c) shows the effect of adding backup
Flow and MXNet, a contemporary system using a pa- workers to a 50-worker Inception training job. Each addi-
rameter server architecture. For this experiment we use tional backup worker up to and including the fourth re-
Google Compute Engine virtual machines running on In- duces the median step time, because the probability of
tel Xeon E5 servers with NVIDIA K80 GPUs, config- a straggler affecting the step decreases. Adding a fifth
ured with 8 vCPUs, 16Gbps of network bandwidth, and backup worker slightly degrades performance, because
one GPU per VM. Both systems use 7 PS tasks running the 51st worker (i.e., the first whose result is discarded)
on separate VMs with no GPU. Figure 8(a) shows that is more likely to be a non-straggler that generates more
TensorFlow achieves performance that is marginally bet- incoming traffic for the PS tasks. Figure 8(c) also plots
ter than MXNet. As expected, the results are largely de- the normalized speedup for each configuration, defined as
termined by single-GPU performance, and both systems t(b)/t(0) × 50/(50 + b) (where t(b) is the median step
use cuDNN version 5.1, so they have access to the same time with b backup workers), and which discounts the
optimized GPU kernels. speedup by the fraction of additional resources consumed.
Using a larger internal cluster (with NVIDIA K40 Although adding 4 backup workers achieves the shortest
GPUs, and a shared datacenter network), we investigate overall step time (1.93 s), adding 3 achieves the highest
the effect of coordination (§4.4) on training performance. normalized speedup (9.5%), and hence uses less aggre-
Ideally, with efficient synchronous training, a model such gate GPU-time to reach the same quality.
(a) Full softmax (b) Sampled softmax 7 Conclusions
105 105
Words processed/second
Words processed/second
104 104
We have described the TensorFlow system and its pro-
gramming model. TensorFlow’s dataflow representation
103 103 subsumes existing work on parameter server systems, and
256 workers 256 workers offers a set of uniform abstractions that allow users to
102 32 workers 102 32 workers harness large-scale heterogeneous systems, both for pro-
4 workers 4 workers
101 101 duction tasks and for experimenting with new approaches.
1 2 4 8 16 32 1 2 4 8 16 32 We have shown several examples of how the TensorFlow
Number of PS tasks Number of PS tasks programming model facilitates experimentation (§4) and
demonstrated that the resulting implementations are per-
Figure 9: Increasing the number of PS tasks leads to in- formant and scalable (§6).
creased throughput for language model training, by par- Our initial experience with TensorFlow is encourag-
allelizing the softmax computation. Sampled softmax in- ing. A large number of groups at Google have deployed
creases throughput by performing less computation. TensorFlow in production, and TensorFlow is helping our
research colleagues to make new advances in machine
6.4 Language modeling learning. Since we released TensorFlow as open-source
software, more than 14,000 people have forked the source
Given a sequence of words, a language model predicts the code repository, the binary distribution has been down-
most probable next word [6]. Therefore, language mod- loaded over one million times, and dozens of machine
els are integral to predictive text, speech recognition, and learning models that use TensorFlow have been published.
translation applications. In this experiment, we investi- TensorFlow is a work in progress. Its flexible dataflow
gate how TensorFlow can train a recurrent neural network representation enables power users to achieve excellent
(viz. LSTM-512-512 [41]) to model the text in the One performance, but we have not yet determined default
Billion Word Benchmark [9]. The vocabulary size |V | policies that work well for all users. Further research
limits the performance of training, because the final layer on automatic optimization should bridge this gap. On
must decode the output state into probabilities for each of the system level, we are actively developing algorithms
|V | classes [37]. The resulting parameters can be large for automatic placement, kernel fusion, memory manage-
(|V | × d for output state dimension d) so we use the tech- ment, and scheduling. While the current implementations
niques for handling large models from Subsection 4.2. We of mutable state and fault tolerance suffice for applica-
use a restricted vocabulary of the most common 40,000 tions with weak consistency requirements, we expect that
words—instead of the full 800,000 words [9]—in order to some TensorFlow applications will require stronger con-
experiment with smaller configurations. sistency, and we are investigating how to build such poli-
Figure 9 shows the training throughput, measured in cies at user-level. Finally, some users have begun to chafe
words per second, for varying numbers of PS and worker at the limitations of a static dataflow graph, especially for
tasks, and two softmax implementations. The full softmax algorithms like deep reinforcement learning [54]. There-
(Figure 9(a)) multiplies each output by a 512 × 40,000 fore, we face the intriguing problem of providing a sys-
weight matrix sharded across the PS tasks. Adding more tem that transparently and efficiently uses distributed re-
PS tasks increases the throughput, because TensorFlow sources, even when the structure of the computation un-
can exploit distributed model parallelism [20, 43] and per- folds dynamically.
form the multiplication and gradient calculation on the PS
tasks, as in Project Adam [14]. Adding a second PS task
is more effective than increasing from 4 to 32, or 32 to Acknowledgments
256 workers. Eventually the throughput saturates, as the
LSTM calculations dominate the training step. We gratefully acknowledge contributions from our col-
The sampled softmax (Figure 9(b)) reduces the data leagues within Google, and from members of the wider
transferred and the computation performed on the PS machine learning community. In particular, we appreci-
tasks [37]. Instead of a dense weight matrix, it multiplies ate the feedback we have received from the rest of the
the output by a random sparse matrix containing weights Google Brain team and the many users of DistBelief and
for the true class and a random sample of false classes. TensorFlow. We thank the anonymous OSDI reviewers
We sample 512 classes for each batch, thus reducing the and our shepherd KyoungSoo Park for their suggestions,
softmax data transfer and computation by a factor of 78. which greatly improved the presentation of this paper.
References field-of-view deep network. In Proceed-
ings of ICRA, pages 704–711. IEEE, 2015.
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, www.vision.caltech.edu/anelia/publications/
Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, Angelova15LFOV.pdf.
M. Devin, S. Ghemawat, I. J. Goodfellow, A. Harp,
G. Irving, M. Isard, Y. Jia, R. Józefowicz, L. Kaiser, [4] Arvind and D. E. Culler. Dataflow architectures.
M. Kudlur, J. Levenberg, D. Mane, R. Monga, In Annual Review of Computer Science Vol. 1,
S. Moore, D. G. Murray, C. Olah, M. Schus- 1986, pages 225–253. Annual Reviews Inc., 1986.
ter, J. Shlens, B. Steiner, I. Sutskever, K. Tal- www.dtic.mil/cgi-bin/GetTRDoc?Location=U2&
war, P. A. Tucker, V. Vanhoucke, V. Vasudevan, doc=GetTRDoc.pdf&AD=ADA166235.
F. B. Viégas, O. Vinyals, P. Warden, M. Watten-
berg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: [5] J. Ba, V. Mnih, and K. Kavukcuoglu. Multiple ob-
Large-scale machine learning on heterogeneous dis- ject recognition with visual attention. arXiv preprint,
tributed systems. arXiv preprint, 1603.04467, 2016. 1412.7755, 2014. arxiv.org/abs/1412.7755.
arxiv.org/abs/1603.04467. Software available from
tensorflow.org. [6] Y. Bengio, R. Ducharme, P. Vincent, and C. Jau-
vin. A neural probabilistic language model. Journal
[2] R. Al-Rfou, G. Alain, A. Almahairi, C. Anger- of Machine Learning Research, 3:1137–1155, 2003.
mueller, D. Bahdanau, N. Ballas, F. Bastien, jmlr.org/papers/volume3/bengio03a/bengio03a.pdf.
J. Bayer, A. Belikov, A. Belopolsky, Y. Bengio,
[7] T. Brants and A. Franz. Web 1T 5-gram version 1,
A. Bergeron, J. Bergstra, V. Bisson, J. Bleecher
2006. catalog.ldc.upenn.edu/LDC2006T13.
Snyder, N. Bouchard, N. Boulanger-Lewandowski,
X. Bouthillier, A. de Brébisson, O. Breuleux, P.-
[8] R. H. Byrd, G. M. Chin, J. Nocedal, and Y. Wu.
L. Carrier, K. Cho, J. Chorowski, P. Christiano,
Sample size selection in optimization methods for
T. Cooijmans, M.-A. Côté, M. Côté, A. Courville,
machine learning. Mathematical Programming,
Y. N. Dauphin, O. Delalleau, J. Demouth, G. Des-
134(1):127–155, 2012. dx.doi.org/10.1007/s10107-
jardins, S. Dieleman, L. Dinh, M. Ducoffe, V. Du-
012-0572-5.
moulin, S. Ebrahimi Kahou, D. Erhan, Z. Fan,
O. Firat, M. Germain, X. Glorot, I. Goodfellow, [9] C. Chelba, T. Mikolov, M. Schuster, Q. Ge,
M. Graham, C. Gulcehre, P. Hamel, I. Harlouchet, T. Brants, and P. Koehn. One billion word bench-
J.-P. Heng, B. Hidasi, S. Honari, A. Jain, S. Jean, mark for measuring progress in statistical lan-
K. Jia, M. Korobov, V. Kulkarni, A. Lamb, P. Lam- guage modeling. arXiv preprint, 1312.3005, 2013.
blin, E. Larsen, C. Laurent, S. Lee, S. Lefrancois, arxiv.org/abs/1312.3005.
S. Lemieux, N. Léonard, Z. Lin, J. A. Livezey,
C. Lorenz, J. Lowin, Q. Ma, P.-A. Manzagol, [10] J. Chen, R. Monga, S. Bengio, and R. Joze-
O. Mastropietro, R. T. McGibbon, R. Memisevic, fowicz. Revisiting distributed synchronous SGD.
B. van Merriënboer, V. Michalski, M. Mirza, A. Or- In Proceedings of ICLR Workshop Track, 2016.
landi, C. Pal, R. Pascanu, M. Pezeshki, C. Raf- arxiv.org/abs/1604.00981.
fel, D. Renshaw, M. Rocklin, A. Romero, M. Roth,
P. Sadowski, J. Salvatier, F. Savard, J. Schlüter, [11] T. Chen, M. Li, Y. Li, M. Lin, N. Wang,
J. Schulman, G. Schwartz, I. V. Serban, D. Serdyuk, M. Wang, T. Xiao, B. Xu, C. Zhang, and
S. Shabanian, E. Simon, S. Spieckermann, S. R. Z. Zhang. MXNet: A flexible and efficient ma-
Subramanyam, J. Sygnowski, J. Tanguay, G. van chine learning library for heterogeneous distributed
Tulder, J. Turian, S. Urban, P. Vincent, F. Visin, systems. In Proceedings of LearningSys, 2015.
H. de Vries, D. Warde-Farley, D. J. Webb, M. Will- www.cs.cmu.edu/˜muli/file/mxnet-learning-sys.pdf.
son, K. Xu, L. Xue, L. Yao, S. Zhang, and Y. Zhang.
Theano: A Python framework for fast computa- [12] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked,
tion of mathematical expressions. arXiv preprint, T. Chandra, H. Aradhye, G. Anderson, G. Corrado,
1605.02688, 2016. arxiv.org/abs/1605.02688. W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong,
V. Jain, X. Liu, and H. Shah. Wide & deep
[3] A. Angelova, A. Krizhevsky, and V. Van- learning for recommender systems. arXiv preprint,
houcke. Pedestrian detection with a large- 1606.07792, 2016. arxiv.org/abs/1606.07792.
[13] S. Chetlur, C. Woolley, P. Vandermersch, J. Co- [23] J. Duchi, E. Hazan, and Y. Singer. Adap-
hen, J. Tran, B. Catanzaro, and E. Shelhamer. tive subgradient methods for online learning
cuDNN: Efficient primitives for deep learning. arXiv and stochastic optimization. Journal of Ma-
preprint, 1410.0759, 2014. arxiv.org/abs/1410.0759. chine Learning Research, 12:2121–2159, 2011.
jmlr.org/papers/volume12/duchi11a/duchi11a.pdf.
[14] T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalya-
naraman. Project Adam: Building an effi- [24] A. Frome, G. S. Corrado, J. Shlens, S. Ben-
cient and scalable deep learning training sys- gio, J. Dean, T. Mikolov, et al. DeVISE: A
tem. In Proceedings of OSDI, pages 571–582, deep visual-semantic embedding model. In Pro-
2014. www.usenix.org/system/files/conference/ ceedings of NIPS, pages 2121–2129, 2013. re-
osdi14/osdi14-paper-chilimbi.pdf. search.google.com/pubs/archive/41473.pdf.
[15] S. Chintala. convnet-benchmarks, 2016. [25] J. Gonzalez-Dominguez, I. Lopez-Moreno,

github.com/soumith/convnet-benchmarks. P. J. Moreno, and J. Gonzalez-Rodriguez.
Frame-by-frame language identification in
[16] E. S. Chung, J. D. Davis, and J. Lee. LINQits: short utterances using deep neural networks.
Big data on little clients. In Proceedings of ISCA, Neural Networks, 64:49–58, 2015. re-
pages 261–272, 2013. www.microsoft.com/en- search.google.com/pubs/archive/42929.pdf.
us/research/wp-content/uploads/2013/06/ISCA13 -
linqits.pdf. [26] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza,
B. Xu, D. Warde-Farley, S. Ozair, A. C.
[17] R. Collobert, S. Bengio, and J. Mariéthoz. Courville, and Y. Bengio. Generative adversar-
Torch: A modular machine learning soft- ial nets. In Proceedings of NIPS, pages 2672–
ware library. Technical report, IDIAP, 2002. 2680, 2014. papers.nips.cc/paper/5423-generative-
infoscience.epfl.ch/record/82802/files/rr02-46.pdf. adversarial-nets.pdf.
[18] H. Cui, H. Zhang, G. R. Ganger, P. B. Gib- [27] Google Research. Tensorflow serving, 2016. tensor-
bons, and E. P. Xing. GeePS: Scalable deep flow.github.io/serving/.
learning on distributed GPUs with a GPU-
specialized parameter server. In Proceedings [28] K. He, X. Zhang, S. Ren, and J. Sun. Deep
of EuroSys, 2016. www.pdl.cmu.edu/PDL-FTP/ residual learning for image recognition. In
CloudComputing/GeePS-cui-eurosys16.pdf. Proceedings of CVPR, pages 770–778, 2016.
arxiv.org/abs/1512.03385.
[19] A. Dai, C. Olah, and Q. V. Le. Document embedding
with paragraph vectors. arXiv preprint, 1507.07998, [29] G. Heigold, V. Vanhoucke, A. Senior, P. Nguyen,
2015. arxiv.org/abs/1507.07998. M. Ranzato, M. Devin, and J. Dean. Mul-
tilingual acoustic models using distributed
[20] J. Dean, G. S. Corrado, R. Monga, K. Chen, deep neural networks. In Proceedings
M. Devin, Q. V. Le, M. Z. Mao, M. Ran- of ICASSP, pages 8619–8623, 2013. re-
zato, A. Senior, P. Tucker, K. Yang, and search.google.com/pubs/archive/40807.pdf.
A. Y. Ng. Large scale distributed deep net-
works. In Proceedings of NIPS, pages 1232–1240, [30] G. E. Hinton. Learning distributed repre-
2012. research.google.com/archive/large deep net- sentations of concepts. In Proceedings of
works nips2012.pdf. the Eighth Annual Conference of the Cog-
nitive Science Society, pages 1–12, 1986.
[21] J. Dean and S. Ghemawat. MapReduce: www.cogsci.ucsd.edu/ãjyu/Teaching/Cogs202 -
Simplified data processing on large clusters. sp13/Readings/hinton86.pdf.
In Proceedings of OSDI, pages 137–149,
2004. research.google.com/archive/mapreduce- [31] G. E. Hinton, L. Deng, D. Yu, G. E. Dahl,
osdi04.pdf. A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke,
P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep
[22] DMLC. MXNet for deep learning, 2016. neural networks for acoustic modeling in speech
github.com/dmlc/mxnet. recognition: The shared views of four research
groups. IEEE Signal Process. Mag., 29(6):82– modeling. arXiv preprint, 1602.02410, 2016.
97, 2012. www.cs.toronto.edu/˜gdahl/papers/ arxiv.org/abs/1602.02410.
deepSpeechReviewSPM2012.pdf.
[42] A. Karpathy, G. Toderici, S. Shetty, T. Leung,
[32] S. Hochreiter and J. Schmidhuber. Long short-term R. Sukthankar, and L. Fei-Fei. Large-scale video
memory. Neural computation, 9(8):1735–1780, classification with convolutional neural networks. In
1997. deeplearning.cs.cmu.edu/pdfs/Hochreiter97 - Proceedings of CVPR, pages 1725–1732, 2014. re-
lstm.pdf. search.google.com/pubs/archive/42455.pdf.
[33] S. Ioffe and C. Szegedy. Batch normaliza- [43] A. Krizhevsky. One weird trick for paralleliz-
tion: Accelerating deep network training by ing convolutional neural networks. arXiv preprint,
reducing internal covariate shift. In Pro- 1404.5997, 2014. arxiv.org/abs/1404.5997.
ceedings of ICML, pages 448–456, 2015.
jmlr.org/proceedings/papers/v37/ioffe15.pdf. [44] A. Krizhevsky, I. Sutskever, and G. E. Hinton.
ImageNet classification with deep convolutional
[34] M. Isard, M. Budiu, Y. Yu, A. Birrell, and neural networks. In Proceedings of NIPS, pages
D. Fetterly. Dryad: distributed data-parallel 1106–1114, 2012. papers.nips.cc/paper/4824-
programs from sequential building blocks. imagenet-classification-with-deep-convolutional-
In Proceedings of EuroSys, pages 59–72, neural-networks.pdf.
2007. www.microsoft.com/en-us/research/wp-
content/uploads/2007/03/eurosys07.pdf. [45] H. Larochelle, Y. Bengio, J. Louradour,
and P. Lamblin. Exploring strategies for
[35] B. Jacob et al. gemmlowp: a small self- training deep neural networks. Journal
contained low-precision GEMM library, 2015. of Machine Learning Research, 10:1–40,
github.com/google/gemmlowp. 2009. jmlr.org/papers/volume10/larochelle09a/
[36] B. Jacob, G. Guennebaud, et al. Eigen library for larochelle09a.pdf.
linear algebra. eigen.tuxfamily.org. [46] A. Lavin and S. Gray. Fast algorithms for convo-
[37] S. Jean, K. Cho, R. Memisevic, and Y. Ben- lutional neural networks. In Proceedings of CVPR,
gio. On using very large target vocabulary pages 4013–4021, 2016. arxiv.org/abs/1509.09308.
for neural machine translation. In Proceed-
[47] Q. Le, M. Ranzato, R. Monga, M. Devin, G. Cor-
ings of ACL-ICJNLP, pages 1–10, July 2015.
rado, K. Chen, J. Dean, and A. Ng. Building
www.aclweb.org/anthology/P15-1001.
high-level features using large scale unsupervised
[38] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, learning. In Proceedings of ICML, pages 81–88,
J. Long, R. Girshick, S. Guadarrama, and T. Dar- 2012. research.google.com/archive/unsupervised -
rell. Caffe: Convolutional architecture for fast fea- icml2012.pdf.
ture embedding. In Proceedings of ACM Multime-
[48] Y. LeCun, C. Cortes, and C. J. Burges. The
dia, pages 675–678, 2014. arxiv.org/abs/1408.5093.
MNIST database of handwritten digits, 1998.
[39] M. I. Jordan. Serial order: A parallel dis- yann.lecun.com/exdb/mnist/.
tributed processing approach. ICS report
8608, Institute for Cognitive Science, UCSD, [49] M. Li, D. G. Andersen, J. Park, A. J. Smola,
La Jolla, 1986. cseweb.ucsd.edu/˜gary/PAPER- A. Ahmed, V. Josifovski, J. Long, E. J.
SUGGESTIONS/Jordan-TR-8604.pdf. Shekita, and B.-Y. Su. Scaling distributed ma-
chine learning with the Parameter Server. In
[40] N. Jouppi. Google supercharges machine Proceedings of OSDI, pages 583–598, 2014.
learning tasks with TPU custom chip, 2016. www.usenix.org/system/files/conference/osdi14/
cloudplatform.googleblog.com/2016/05/Google- osdi14-paper-li mu.pdf.
supercharges-machine-learning-tasks-with-custom-
chip.html. [50] C. J. Maddison, A. Huang, I. Sutskever, and D. Sil-
ver. Move evaluation in Go using deep convolutional
[41] R. Józefowicz, O. Vinyals, M. Schuster, N. Shazeer, neural networks. arXiv preprint, 1412.6564, 2014.
and Y. Wu. Exploring the limits of language arxiv.org/abs/1412.6564.
[51] F. McSherry, M. Isard, and D. G. Mur- [61] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild:
ray. Scalability! But at what COST? In A lock-free approach to parallelizing stochas-
Proceedings of HotOS, HOTOS’15, 2015. tic gradient descent. In Proceedings of NIPS,
www.usenix.org/system/files/conference/hotos15/ pages 693–701, 2011. papers.nips.cc/paper/4390-
hotos15-paper-mcsherry.pdf. hogwild-a-lock-free-approach-to-parallelizing-
stochastic-gradient-descent.pdf.
[52] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Ef-
ficient estimation of word representations in vector [62] C. J. Rossbach, Y. Yu, J. Currey, J.-P. Mar-
space. In Proceedings of ICLR Workshops Track, tin, and D. Fetterly. Dandelion: a com-
2013. arxiv.org/abs/1301.3781. piler and runtime for heterogeneous systems.
In Proceedings of SOSP, pages 49–68, 2013.
[53] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. sigops.org/sosp/sosp13/papers/p49-rossbach.pdf.
Recurrent models of visual attention. In Pro-
ceedings of NIPS, pages 2204–2212, 2014. [63] D. E. Rumelhart, G. E. Hinton, and R. J. Williams.
papers.nips.cc/paper/5542-recurrent-models-of- Learning representations by back-propagating er-
visual-attention.pdf. rors. In Cognitive modeling, volume 5, pages
213–220. MIT Press, 1988. www.cs.toronto.edu/
[54] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu,
˜hinton/absps/naturebp.pdf.
J. Veness, M. G. Bellemare, A. Graves, M. Ried-
miller, A. K. Fidjeland, G. Ostrovski, S. Petersen, [64] O. Russakovsky, J. Deng, H. Su, J. Krause,
C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Ku- S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
maran, D. Wierstra, S. Legg, and D. Hassabis. A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-
Human-level control through deep reinforcement Fei. ImageNet Large Scale Visual Recognition Chal-
learning. Nature, 518(7540):529–533, 02 2015. lenge. International Journal of Computer Vision,
dx.doi.org/10.1038/nature14236. 115(3):211–252, 2015. arxiv.org/abs/1409.0575.
[55] P. Moritz, R. Nishihara, I. Stoica, and M. I. Jordan. [65] A. Smola and S. Narayanamurthy. An ar-
SparkNet: Training deep networks in Spark. In Pro- chitecture for parallel topic models. Proc.
ceedings of ICLR, 2016. arxiv.org/abs/1511.06051. VLDB Endow., 3(1–2):703–710, Sept. 2010.
[56] D. G. Murray, F. McSherry, M. Isard, R. Isaacs, vldb.org/pvldb/vldb2010/papers/R63.pdf.
P. Barham, and M. Abadi. Incremental, it-
[66] I. Sutskever, J. Martens, G. E. Dahl, and
erative data processing with timely dataflow.
G. E. Hinton. On the importance of initial-
Commun. ACM, 59(10):75–83, Sept. 2016.
ization and momentum in deep learning. In
dl.acm.org/citation.cfm?id=2983551.
Proceedings of ICML, pages 1139–1147, 2013.
[57] A. Nair, P. Srinivasan, S. Blackwell, C. Alci- jmlr.org/proceedings/papers/v28/sutskever13.pdf.
cek, R. Fearon, A. De Maria, V. Panneershel-
[67] I. Sutskever, O. Vinyals, and Q. V. Le. Se-
vam, M. Suleyman, C. Beattie, S. Petersen, et al.
quence to sequence learning with neural net-
Massively parallel methods for deep reinforce-
works. In Proceedings of NIPS, pages 3104–
ment learning. arXiv preprint, 1507.04296, 2015.
3112, 2014. papers.nips.cc/paper/5346-sequence-to-
sequence-learning-with-neural.pdf.
[58] Nervana Systems. Neon deep learning framework,
2016. github.com/NervanaSystems/neon. [68] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and
[59] NVIDIA Corporation. NCCL: Optimized primitives A. Rabinovich. Going deeper with convolu-
for collective multi-GPU communication, 2016. tions. In Proceedings of CVPR, pages 1–9, 2015.
github.com/NVIDIA/nccl. arxiv.org/abs/1409.4842.
[60] R. Pascanu, T. Mikolov, and Y. Bengio. On [69] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and
the difficulty of training recurrent neural networks. Z. Wojna. Rethinking the Inception architecture for
In Proceedings of ICML, pages 1310–1318, 2013. computer vision. arXiv preprint, 1512.00567, 2015.
jmlr.org/proceedings/papers/v28/pascanu13.pdf. arxiv.org/abs/1512.00567.
[70] C. tao Chu, S. K. Kim, Y. an Lin, Y. Yu, chine translation. arXiv preprint, 1609.08144, 2016.
G. Bradski, K. Olukotun, and A. Y. Ng. arxiv.org/abs/1609.08144.
Map-reduce for machine learning on multi-
core. In Proceedings of NIPS, pages 281–288, [74] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlings-
2007. papers.nips.cc/paper/3150-map-reduce-for- son, P. K. Gunda, and J. Currey. DryadLINQ:
machine-learning-on-multicore.pdf. A system for general-purpose distributed data-
parallel computing using a high-level language.
[71] A. Verma, L. Pedrosa, M. Korupolu, D. Op- In Proceedings of OSDI, pages 1–14, 2008.
penheimer, E. Tune, and J. Wilkes. Large- www.usenix.org/legacy/event/osdi08/tech/full pa-
scale cluster management at Google with pers/yu y/yu y.pdf.
Borg. In Proceedings of EuroSys, 2015. re-
search.google.com/pubs/archive/43438.pdf. [75] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma,
M. McCauley, M. J. Franklin, S. Shenker, and I. Sto-
[72] O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, ica. Resilient distributed datasets: A fault-tolerant
and G. Hinton. Grammar as a foreign language. abstraction for in-memory cluster computing.
arXiv preprint, 2014. arxiv.org/abs/1412.7449. In Proceedings of NSDI, pages 15–28, 2012.
https://www.usenix.org/system/files/conference/
[73] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi,
nsdi12/nsdi12-final138.pdf.
W. Macherey, M. Krikun, Y. Cao, Q. Gao,
K. Macherey, J. Klingner, A. Shah, M. Johnson, [76] M. D. Zeiler, M. Ranzato, R. Monga, M. Mao,
X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, K. Yang, Q. Le, P. Nguyen, A. Senior, V. Van-
H. Kazawa, K. Stevens, G. Kurian, N. Patil, houcke, J. Dean, and G. E. Hinton. On recti-
W. Wang, C. Young, J. Smith, J. Riesa, A. Rud- fied linear units for speech processing. In Pro-
nick, O. Vinyals, G. Corrado, M. Hughes, and ceedings of ICASSP, pages 3517–3521, 2013. re-
J. Dean. Google’s Neural Machine Translation sys- search.google.com/pubs/archive/40811.pdf.
tem: Bridging the gap between human and ma-
TensorFlow:
Large-Scale Machine Learning on Heterogeneous Distributed Systems
(Preliminary White Paper, November 9, 2015)
Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,
Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow,
Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser,
Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray,
Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar,
Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals,
Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng
Google Research∗
Abstract sequence prediction [47], move selection for Go [34],
pedestrian detection [2], reinforcement learning [38],
TensorFlow [1] is an interface for expressing machine learn-
and other areas [17, 5]. In addition, often in close collab-
ing algorithms, and an implementation for executing such al-
oration with the Google Brain team, more than 50 teams
gorithms. A computation expressed using TensorFlow can be
at Google and other Alphabet companies have deployed
executed with little or no change on a wide variety of hetero-
deep neural networks using DistBelief in a wide variety
geneous systems, ranging from mobile devices such as phones
of products, including Google Search [11], our advertis-
and tablets up to large-scale distributed systems of hundreds
ing products, our speech recognition systems [50, 6, 46],
of machines and thousands of computational devices such as
Google Photos [43], Google Maps and StreetView [19],
GPU cards. The system is flexible and can be used to express
Google Translate [18], YouTube, and many others.
a wide variety of algorithms, including training and inference
algorithms for deep neural network models, and it has been Based on our experience with DistBelief and a more
used for conducting research and for deploying machine learn- complete understanding of the desirable system proper-
ing systems into production across more than a dozen areas of ties and requirements for training and using neural net-
computer science and other fields, including speech recogni- works, we have built TensorFlow, our second-generation
tion, computer vision, robotics, information retrieval, natural system for the implementation and deployment of large-
language processing, geographic information extraction, and scale machine learning models. TensorFlow takes com-
computational drug discovery. This paper describes the Ten- putations described using a dataflow-like model and
sorFlow interface and an implementation of that interface that maps them onto a wide variety of different hardware
we have built at Google. The TensorFlow API and a reference platforms, ranging from running inference on mobile
implementation were released as an open-source package under device platforms such as Android and iOS to modest-
the Apache 2.0 license in November, 2015 and are available at sized training and inference systems using single ma-
www.tensorflow.org. chines containing one or many GPU cards to large-scale
training systems running on hundreds of specialized ma-
chines with thousands of GPUs. Having a single system
1 Introduction that can span such a broad range of platforms signifi-
cantly simplifies the real-world use of machine learning
The Google Brain project started in 2011 to explore the system, as we have found that having separate systems
use of very-large-scale deep neural networks, both for for large-scale training and small-scale deployment leads
research and for use in Google’s products. As part of to significant maintenance burdens and leaky abstrac-
the early work in this project, we built DistBelief, our tions. TensorFlow computations are expressed as stateful
first-generation scalable distributed training and infer- dataflow graphs (described in more detail in Section 2),
ence system [14], and this system has served us well. We and we have focused on making the system both flexible
and others at Google have performed a wide variety of re- enough for quickly experimenting with new models for
search using DistBelief including work on unsupervised research purposes and sufficiently high performance and
learning [31], language representation [35, 52], models robust for production training and deployment of ma-
for image classification and object detection [16, 48], chine learning models. For scaling neural network train-
video classification [27], speech recognition [56, 21, 20], ing to larger deployments, TensorFlow allows clients to
∗ Corresponding
authors: Jeffrey Dean and Rajat Monga: easily express various kinds of parallelism through repli-
{jeff,rajatmonga}@google.com cation and parallel execution of a core model dataflow
1
graph, with many different computational devices all col- structures within the graph in a manner similar to Naiad
laborating to update a set of shared parameters or other [36]. Clients typically construct a computational graph
state. Modest changes in the description of the com- using one of the supported frontend languages (C++ or
putation allow a wide variety of different approaches Python). An example fragment to construct and then ex-
to parallelism to be achieved and tried with low effort ecute a TensorFlow graph using the Python front end is
[14, 29, 42]. Some TensorFlow uses allow some flexibil- shown in Figure 1, and the resulting computation graph
ity in terms of the consistency of parameter updates, and in Figure 2.
we can easily express and take advantage of these relaxed
In a TensorFlow graph, each node has zero or more in-
synchronization requirements in some of our larger de-
puts and zero or more outputs, and represents the instan-
ployments. Compared to DistBelief, TensorFlow’s pro-
tiation of an operation. Values that flow along normal
gramming model is more flexible, its performance is sig-
edges in the graph (from outputs to inputs) are tensors,
nificantly better, and it supports training and using a
arbitrary dimensionality arrays where the underlying el-
broader range of models on a wider variety of hetero-
ement type is specified or inferred at graph-construction
geneous hardware platforms.
time. Special edges, called control dependencies, can
Dozens of our internal clients of DistBelief have al- also exist in the graph: no data flows along such edges,
ready switched to TensorFlow. These clients rely on but they indicate that the source node for the control de-
TensorFlow for research and production, with tasks as pendence must finish executing before the destination
diverse as running inference for computer vision mod- node for the control dependence starts executing. Since
els on mobile phones to large-scale training of deep our model includes mutable state, control dependencies
neural networks with hundreds of billions of parame- can be used directly by clients to enforce happens before
ters on hundreds of billions of example records using relationships. Our implementation also sometimes in-
many hundreds of machines [11, 47, 48, 18, 53, 41]. serts control dependencies to enforce orderings between
Although these applications have concentrated on ma- otherwise independent operations as a way of, for exam-
chine learning and deep neural networks in particular, ple, controlling the peak memory usage.
we expect that TensorFlow’s abstractions will be useful
in a variety of other domains, including other kinds of
machine learning algorithms, and possibly other kinds Operations and Kernels
of numerical computations. We have open-sourced the
TensorFlow API and a reference implementation under An operation has a name and represents an abstract com-
the Apache 2.0 license in November, 2015, available at putation (e.g., “matrix multiply”, or “add”). An opera-
www.tensorflow.org. tion can have attributes, and all attributes must be pro-
The rest of this paper describes TensorFlow in more vided or inferred at graph-construction time in order to
detail. Section 2 describes the programming model and instantiate a node to perform the operation. One com-
basic concepts of the TensorFlow interface, and Section 3 mon use of attributes is to make operations polymorphic
describes both our single machine and distributed imple- over different tensor element types (e.g., add of two ten-
mentations. Section 4 describes several extensions to sors of type float versus add of two tensors of type int32).
the basic programming model, and Section 5 describes A kernel is a particular implementation of an operation
several optimizations to the basic implementations. Sec- that can be run on a particular type of device (e.g., CPU
tion 6 describes some of our experiences in using Ten- or GPU). A TensorFlow binary defines the sets of opera-
sorFlow, Section 7 describes several programming id- tions and kernels available via a registration mechanism,
ioms we have found helpful when using TensorFlow, and and this set can be extended by linking in additional op-
Section 9 describes several auxiliary tools we have built eration and/or kernel definitions/registrations. Table 1
around the core TensorFlow system. Sections 10 and 11 shows some of the kinds of operations built into the core
discuss future and related work, respectively, and Sec- TensorFlow library.
tion 12 offers concluding thoughts.
Sessions
2 Programming Model and Basic Concepts
Clients programs interact with the TensorFlow system by
A TensorFlow computation is described by a directed creating a Session. To create a computation graph, the
graph, which is composed of a set of nodes. The graph Session interface supports an Extend method to augment
represents a dataflow computation, with extensions for the current graph managed by the session with additional
allowing some kinds of nodes to maintain and update nodes and edges (the initial graph when a session is cre-
persistent state and for branching and looping control ated is empty). The other primary operation supported
2
import tensorflow as tf
b = tf.Variable(tf.zeros([100])) # 100-d vector, init to zeroes

W = tf.Variable(tf.random_uniform([784,100],-1,1)) # 784x100 matrix w/rnd vals
x = tf.placeholder(name="x") # Placeholder for input
relu = tf.nn.relu(tf.matmul(W, x) + b) # Relu(Wx+b)
C = [...] # Cost computed as a function
# of Relu
s = tf.Session()
for step in xrange(0, 10):
input = ...construct 100-D input array ... # Create 100-d vector for input
result = s.run(C, feed_dict={x: input}) # Fetch cost, feeding x=input
print step, result
Figure 1: Example TensorFlow code fragment
...
ReLU
Add
b MatMul
W x
Figure 2: Corresponding computation graph for Figure 1
Category Examples
Element-wise mathematical operations Add, Sub, Mul, Div, Exp, Log, Greater, Less, Equal, ...
Array operations Concat, Slice, Split, Constant, Rank, Shape, Shuffle, ...
Matrix operations MatMul, MatrixInverse, MatrixDeterminant, ...
Stateful operations Variable, Assign, AssignAdd, ...
Neural-net building blocks SoftMax, Sigmoid, ReLU, Convolution2D, MaxPool, ...
Checkpointing operations Save, Restore
Queue and synchronization operations Enqueue, Dequeue, MutexAcquire, MutexRelease, ...
Control flow operations Merge, Switch, Enter, Leave, NextIteration
Table 1: Example TensorFlow operation types
by the session interface is Run, which takes a set of out- arrange to execute the appropriate nodes in an order that
put names that need to be computed, as well as an op- respects their dependencies (as described in more detail
tional set of tensors to be fed into the graph in place of in 3.1). Most of our uses of TensorFlow set up a Session
certain outputs of nodes. Using the arguments to Run, with a graph once, and then execute the full graph or a
the TensorFlow implementation can compute the transi- few distinct subgraphs thousands or millions of times via
tive closure of all nodes that must be executed in order Run calls.
to compute the outputs that were requested, and can then
3
Variables have implementations of our Device interface for CPUs
and GPUs, and new device implementations for other de-
In most computations a graph is executed multiple times. vice types can be provided via a registration mechanism.
Most tensors do not survive past a single execution of the Each device object is responsible for managing alloca-
graph. However, a Variable is a special kind of operation and deallocation of device memory, and for arrang-
tion that returns a handle to a persistent mutable tensor ing for the execution of any kernels that are requested by
that survives across executions of a graph. Handles to higher levels in the TensorFlow implementation.
these persistent mutable tensors can be passed to a hand-
ful of special operations, such as Assign and AssignAdd
(equivalent to +=) that mutate the referenced tensor. For Tensors
machine learning applications of TensorFlow, the param- A tensor in our implementation is a typed, multi-
eters of the model are typically stored in tensors held in dimensional array. We support a variety of tensor ele-
variables, and are updated as part of the Run of the train- ment types, including signed and unsigned integers rang-
ing graph for the model. ing in size from 8 bits to 64 bits, IEEE float and double
types, a complex number type, and a string type (an ar-
3 Implementation bitrary byte array). Backing store of the appropriate size
is managed by an allocator that is specific to the device
The main components in a TensorFlow system are the on which the tensor resides. Tensor backing store buffers
client, which uses the Session interface to communicate are reference counted and are deallocated when no refer-
with the master, and one or more worker processes, with ences remain.
each worker process responsible for arbitrating access to
one or more computational devices (such as CPU cores 3.1 Single-Device Execution
or GPU cards) and for executing graph nodes on those
devices as instructed by the master. We have both lo- Let’s first consider the simplest execution scenario: a sin-
cal and distributed implementations of the TensorFlow gle worker process with a single device. The nodes of the
interface. The local implementation is used when the graph are executed in an order that respects the depen-
client, the master, and the worker all run on a single ma- dencies between nodes. In particular, we keep track of
chine in the context of a single operating system process a count per node of the number of dependencies of that
(possibly with multiple devices, if for example, the ma- node that have not yet been executed. Once this count
chine has many GPU cards installed). The distributed drops to zero, the node is eligible for execution and is
implementation shares most of the code with the local added to a ready queue. The ready queue is processed in
implementation, but extends it with support for an en- some unspecified order, delegating execution of the ker-
vironment where the client, the master, and the workers nel for a node to the device object. When a node has
can all be in different processes on different machines. finished executing, the counts of all nodes that depend
In our distributed environment, these different tasks are on the completed node are decremented.
containers in jobs managed by a cluster scheduling sys-
tem [51]. These two different modes are illustrated in 3.2 Multi-Device Execution
Figure 3. Most of the rest of this section discusses is-
sues that are common to both implementations, while Once a system has multiple devices, there are two main
Section 3.3 discusses some issues that are particular to complications: deciding which device to place the com-
the distributed implementation. putation for each node in the graph, and then managing
the required communication of data across device bound-
Devices aries implied by these placement decisions. This subsec-
tion discusses these two issues.
Devices are the computational heart of TensorFlow. Each
worker is responsible for one or more devices, and 3.2.1 Node Placement
each device has a device type, and a name. Device
names are composed of pieces that identify the de- Given a computation graph, one of the main responsi-
vice’s type, the device’s index within the worker, and, bilities of the TensorFlow implementation is to map the
in our distributed setting, an identification of the job computation onto the set of available devices. A sim-
and task of the worker (or localhost for the case where plified version of this algorithm is presented here. See
the devices are local to the process). Example device Section 4.3 for extensions supported by this algorithm.
names are "/job:localhost/device:cpu:0" or One input to the placement algorithm is a cost model,
"/job:worker/task:17/device:gpu:3". We which contains estimates of the sizes (in bytes) of the
4
single process
client master
process session process
client master run
session
run execute
subgraph
execute
subgraph
worker worker worker
process 1 process 2 process 3
worker
GPU0 ... GPU0 ... GPU0 ...
GPU0 GPU1 ... CPU0
GPU1 CPU0 GPU1 CPU0 GPU1 CPU0
Figure 3: Single machine and distributed system structure
input and output tensors for each graph node, along with 3.2.2 Cross-Device Communication
estimates of the computation time required for each node
when presented with its input tensors. This cost model is Once the node placement has been computed, the graph
either statically estimated based on heuristics associated is partitioned into a set of subgraphs, one per device. Any
with different operation types, or is measured based on cross-device edge from x to y is removed and replaced
an actual set of placement decisions for earlier execu- by an edge from x to a new Send node in x’s subgraph
tions of the graph. and an edge from a corresponding Receive node to y in
y’s subgraph. See Figure 4 for an example of this graph
The placement algorithm first runs a simulated execu- transformation.
tion of the graph. The simulation is described below and
Device B Device B
ends up picking a device for each node in the graph using
b c y b c y
greedy heuristics. The node to device placement gener-
ated by this simulation is also used as the placement for W recv
W recv
the real execution.
The placement algorithm starts with the sources of the send send
computation graph, and simulates the activity on each

a x a x
device in the system as it progresses. For each node that
Device A Device A
is reached in this traversal, the set of feasible devices is
considered (a device may not be feasible if the device
does not provide a kernel that implements the particular Figure 4: Before & after insertion of Send/Receive nodes
operation). For nodes with multiple feasible devices, the
placement algorithm uses a greedy heuristic that exam- At runtime, the implementations of the Send and Re-
ines the effects on the completion time of the node of ceive nodes coordinate to transfer data across devices.
placing the node on each possible device. This heuristic This allows us to isolate all communication inside Send
takes into account the estimated or measured execution and Receive implementations, which simplifies the rest
time of the operation on that kind of device from the cost of the runtime.
model, and also includes the costs of any communica- When we insert Send and Receive nodes, we canoni-
tion that would be introduced in order to transmit inputs calize all users of a particular tensor on a particular de-
to this node from other devices to the considered device. vice to use a single Receive node, rather than one Re-
The device where the node’s operation would finish the ceive node per downstream user on a particular device.
soonest is selected as the device for that operation, and This ensures that the data for the needed tensor is only
the placement process then continues onwards to make transmitted once between a source device → destination
placement decisions for other nodes in the graph, includ- device pair, and that memory for the tensor on the desti-
ing downstream nodes that are now ready for their own nation device is only allocated once, rather than multiple
simulated execution. Section 4.3 describes some exten- times (e.g., see nodes b and c in Figure 4)
sions that allow users to provide hints and partial con- By handling communication in this manner, we also
straints to guide the placement algorithm. The placement allow the scheduling of individual nodes of the graph
algorithm is an area of ongoing development within the on different devices to be decentralized into the work-
system. ers: the Send and Receive nodes impart the necessary
5
synchronization between different workers and devices, C 1
and the master only needs to issue a single Run request
per graph execution to each worker that has any nodes for ... ...
the graph, rather than being involved in the scheduling of
every node or every cross-device communication. This ReLU dReLU
makes the system much more scalable and allows much
finer-granularity node executions than if the scheduling Add dAdd
were forced to be done by the master.
b MatMul dC/db dMatMul
3.3 Distributed Execution x

W dC/dW dC/dx
Distributed execution of a graph is very similar to multi-

device execution. After device placement, a subgraph is Figure 5: Gradients computed for graph in Figure 2
created per device. Send/Receive node pairs that com-
municate across worker processes use remote communi-
cation mechanisms such as TCP or RDMA to move data common need, TensorFlow has built-in support for au-
across machine boundaries. tomatic gradient computation. If a tensor C in a Ten-
sorFlow graph depends, perhaps through a complex sub-
graph of operations, on some set of tensors {Xk }, then
Fault Tolerance there is a built-in function that will return the tensors
{dC/dXk }. Gradient tensors are computed, like other
Failures in a distributed execution can be detected in a
tensors, by extending the TensorFlow graph, using the
variety of places. The main ones we rely on are (a) an
following procedure.
error in a communication between a Send and Receive
node pair, and (b) periodic health-checks from the master When TensorFlow needs to compute the gradient of
process to every worker process. a tensor C with respect to some tensor I on which C
depends, it first finds the path in the computation graph
When a failure is detected, the entire graph execution from I to C. Then it backtracks from C to I, and for
is aborted and restarted from scratch. Recall however each operation on the backward path it adds a node to
that Variable nodes refer to tensors that persist across ex- the TensorFlow graph, composing the partial gradients
ecutions of the graph. We support consistent checkpoint- along the backwards path using the chain rule. The newly
ing and recovery of this state on a restart. In partcular, added node computes the “gradient function” for the cor-
each Variable node is connected to a Save node. These responding operation in the forward path. A gradient
Save nodes are executed periodically, say once every N function may be registered by any operation. This func-
iterations, or once every N seconds. When they execute, tion takes as input not only the partial gradients com-
puted already along the backward path, but also, option-
the contents of the variables are written to persistent stor-
ally, the inputs and outputs of the forward operation. Fig-
age, e.g., a distributed file system. Similarly each Vari- ure 5 shows gradients for a cost computed from the ex-
able is connected to a Restore node that is only enabled ample of Figure 2. Grey arrows show potential inputs
in the first iteration after a restart. See Section 4.2 for to gradient functions that are not used for the particular
details on how some nodes can only be enabled on some operations shown. The addition needed to Figure 1 to
executions of the graph. compute these gradients is:
[db,dW,dx] = tf.gradients(C, [b,W,x])

4 Extensions
In general an operation may have multiple outputs, and
In this section we describe several more advanced fea- C may only depend on some of them. If, for example,
tures of the basic programming model that was intro- operation O has two outputs y1 and y2 , and C only de-
duced in Section 2. pends on y2 , then the first input to O’s gradient function
is set to 0 since dC/dy1 = 0.
4.1 Gradient Computation Automatic gradient computation complicates opti-
mization, particularly of memory usage. When execut-
Many optimization algorithms, including common ma- ing “forward” computation subgraphs, i.e., those that are
chine learning training algorithms like stochastic gradi- explicitly constructed by the user, a sensible heuristic
ent descent [45], compute the gradient of a cost function breaks ties when deciding which node to execute next by
with respect to a set of inputs. Because this is such a observing the order in which the graph was constructed.
6
This generally means that temporary outputs are con- fetch
sumed soon after being constructed, so their memory can
be reused quickly. When the heuristic is ineffective, the
user can change the order of graph construction, or add e f e f
control dependencies as described in Section 5. When
gradient nodes are automatically added to the graph, the
d c d c
user has less control, and the heuristics may break down.
In particular, because gradients reverse the forward com-
putation order, tensors that are used early in a graph’s a b a b
execution are frequently needed again near the end of a
gradient computation. Such tensors can hold on to a lot feed
of scarce GPU memory and unnecessarily limit the size
of computations. We are actively working on improve-
ments to memory management to deal better with such Figure 6: Before and after graph transformation for par-
cases. Options include using more sophisticated heuris- tial execution
tics to determine the order of graph execution, recom-
puting tensors instead of retaining them in memory, and special feed and fetch nodes, the set of nodes to execute
swapping out long-lived tensors from GPU memory to can be determined by starting at each of the nodes named
more plentiful host CPU memory. by any output and working backwards in the graph using
the graph dependencies to determine the full set of nodes
that must be executed in the rewritten graph in order to
4.2 Partial Execution
compute the outputs. Figure 6 shows an original graph
Often a client wants to execute just a subgraph of the on the left, and the transformed graph that results when
entire execution graph. To support this, once the client Run is invoked with inputs=={b} and outputs=={f:0}.
has set up a computation graph in a Session, our Run Since we only need to compute the output of node f, we
method allows them to execute an arbitrary subgraph of will not execute nodes d and e, since they have no con-
the whole graph, and to inject arbitrary data along any tribution to the output of f.
edge in the graph, and to retrieve data flowing along any
edge in the graph.
4.3 Device Constraints
Each node in the graph has a name, and each output of
a node is identified by the source node name and the out- TensorFlow clients can control the placement of nodes
put port from the node, numbered from 0 (e.g., “bar:0” on devices by providing partial constraints for a node
refers to the 1st output of the “bar” node, while “bar:1” about which devices it can execute on. For ex-
refers to the 2nd output). ample, “only place this node on a device of type
Two arguments to the Run call help define the exact GPU”, or “this node can be placed on any device in
subgraph of the computation graph that will be executed. /job:worker/task:17”, or “Colocate this node
First, the Run call accepts inputs, an optional mapping with the node named variable13”. Within the con-
of name:port names to “fed” tensors values. Second, fines of these constraints, the placement algorithm is re-
the Run call accepts output names, a list of output sponsible for choosing an assignment of nodes to de-
name[:port] specifications indicating which nodes vices that provides fast execution of the computation and
should be executed, and, if the port portion is present in a also satisfies various constraints imposed by the devices
name, that that particular output tensor value for the node themselves, such as limiting the total amount of memory
should be returned to the client if the Run call completes needed on a device in order to execute its subset of graph
successfully. nodes.
The graph is transformed based on the values of in- Supporting such constraints requires changes to the
puts and outputs. Each node:port specified in inputs is placement algorithm described in Section 3.2.1. We first
replaced with a feed node, which will pick up the pro- compute the feasible set of devices for each node, and
vided input tensor from specially-initialized entries in a then use union-find on the graph of colocation constraints
Rendezvous object used for the Run call. Similarly, each to compute the graph components that must be placed
output name with a port is connected to a special fetch together. For each such component, we compute the in-
node that arranges to save the output tensor and return it tersection of the feasible device sets. The computed fea-
to the client when the Run call is complete. Finally, once sible device set per node fits easily into the placement
the graph has been rewritten with the insertion of these algorithm’s simulator.
7
4.4 Control Flow 4.5 Input Operations
Although dataflow graphs without any explicit control Although input data can be provided to a computation via
flow are quite expressive, we have observed a number of feed nodes, another common mechanism used for train-
cases where supporting conditionals and loops can lead ing large-scale machine learning models is to have spe-
to more concise and efficient representations of machine cial input operation nodes in the graph, which are typi-
learning algorithms. cally configured with a set of filenames and which yield
a tensor containing one or more examples from the data
Much as in the dataflow-machine approach described
stored in that set of files each time they are executed.
by Arvind [3], we introduce a small set of primitive con-
This allows data to be read directly from the underlying
trol flow operators into TensorFlow and generalize Ten-
storage system into the memory of the machine that will
sorFlow to handle cyclic dataflow graphs. The Switch
perform subsequent processing on the data. In configura-
and Merge operators allow us to skip the execution of
tions where the client process is separate from the worker
an entire subgraph based on the value of a boolean ten-
process, if the data were fed, it typically would require an
sor. The Enter, Leave, and NextIteration operators allow
extra network hop (from the storage system to the client
us to express iteration. High-level programming con-
and then from the client to the worker vs. directly from
structs such as if-conditionals and while-loops can be
the storage system to ther worker when using an input
easily compiled into dataflow graphs with these control
node).
flow operators.
The TensorFlow runtime implements a notion of tags
and frames conceptually similar to the MIT Tagged- 4.6 Queues
Token machine [4]. Each iteration of a loop is uniquely
identified by a tag, and its execution state is represented Queues are a useful feature that we have added to Ten-
by a frame. An input can enter an iteration whenever it sorFlow. They allow different portions of the graph to
becomes available; thus, multiple iterations can be exe- execute asynchronously, possibly at different candences,
cuted concurrently. and to hand off data through Enqueue and Dequeue op-
erations. Enqueue operations can block until space be-
TensorFlow uses a distributed coordination mecha-
comes available in the queue, and Dequeue operations
nism to execute graphs with control flow. In general, a
can block until a desired minimum number of elements
loop can contain nodes that are assigned to many dif-
are available in the queue. One use of queues is to allow
ferent devices. Therefore, managing the state of a loop
input data to be prefetched from disk files while a previ-
becomes a problem of distributed termination detection.
ous batch of data is still being processed by the compu-
TensorFlow’s solution is based on graph rewriting. Dur-
tational portion of a machine learning model. They can
ing the graph partitioning, we automatically add control
also be used for other kinds of grouping, including accu-
nodes to each partition. These nodes implement a small
mulating many gradients in order to compute some more
state machine that orchestrates the start and termination
complex combination of gradients over a larger batch,
of each iteration, and decides the termination of the loop.
or to group different input sentences for recurrent lan-
For each iteration, the device that owns the loop termi-
guage models into bins of sentences that are approxi-
nation predicate sends a tiny control message to every
mately the same length, which can then be processed
participating device.
more efficiently.
As explained above, we often train machine learning In addition to normal FIFO queues, we have also im-
models by gradient descent, and represent gradient com- plemented a shuffling queue, which randomly shuffles its
putations as part of dataflow graphs. When a model elements within a large in-memory buffer. This shuffling
includes control-flow operations, we must account for functionality is useful for machine learning algorithms
them in the corresponding gradient computation. For ex- that want to randomize the order in which they process
ample, the gradient computation for a model with an if- examples, for example.
conditional will need to know which branch of the con-
ditional was taken, then apply the gradient logic to this
branch. Similarly, the gradient computation for a model 4.7 Containers
with a while-loop will need to know how many iterations
were taken, and will also rely on the intermediate values A Container is the mechanism within TensorFlow for
computed during those iterations. The basic technique is managing longer-lived mutable state. The backing store
to rewrite the graph so to memorize the values needed for for a Variable lives in a container. The default con-
the gradient computation. We omit the somewhat intri- tainer is one that persists until the process terminates,
cate details of this encoding. but we also allow other named containers. A container
8
can be reset by clearing it of its contents entirely. Us- 5.3 Asynchronous Kernels
ing containers, it is possible to share state even across
completely disjoint computation graphs associated with In addition to normal synchronous kernels that complete
different Sessions. their execution at the end of the Compute method, our
framework also supports non-blocking kernels. Such
non-blocking kernels use a slightly different interface
whereby the Compute method is passed a continuation
5 Optimizations
that should be invoked when the kernel’s execution is
complete. This is an optimization for environments
In this section, we describe some of the optimizations where having many active threads is relatively expensive
in the TensorFlow implementation that improve perfor- in terms of memory usage or other resources, and allows
mance or resource usage of the system. us to avoid tying up an execution thread for unbounded
periods of time while waiting for I/O or other events to
occur. Examples of asynchronous kernels include the
5.1 Common Subexpression Elimination Receive kernel, and the Enqueue and Dequeue kernels
(which might need to block if queue space is not avail-
Since the construction of computation graphs is often able or if no data is available to be read, respectively).
done by many different layers of abstractions in the client
code, computation graphs can easily end up with redun-
dant copies of the same computation. To handle this, we
5.4 Optimized Libraries for Kernel Imple-
have implemented a common subexpression pass similar mentations
to the algorithm described by Click [12] that runs over We often make use of pre-existing highly-optimized nu-
the computation graph and canonicalizes multiple copies merical libraries to implement kernels for some opera-
of operations with identical inputs and operation types tions. For example, there are a number of optimized li-
to just a single one of these nodes, and redirects graph braries for performing matrix multiplies on different de-
edges appropriately to reflect this canonicalization. vices, including BLAS [15] and cuBLAS [39], or GPU
libraries for convolutional kernels for deep neural nets
such as cuda-convnet [28] and cuDNN [9]. Many of
5.2 Controlling Data Communication and our kernel implementations are relatively thin wrappers
Memory Usage around such optimized libraries.
We make fairly extensive use of the open-source Eigen
Careful scheduling of TensorFlow operations can result linear algebra library [25] for many of the kernel imple-
in better performance of the system, in particular with mentations in the system. As one part of the develop-
respect to data transfers and memory usage. Specifically, ment of TensorFlow, our team (primarily Benoit Steiner)
scheduling can reduce the time window during which has extended the open source Eigen library with support
intermediate results need to be kept in memory in be- for arbitrary dimensionality tensor operations.
tween operations and hence the peak memory consump-
tion. This reduction is particularly important for GPU
devices where memory is scarce. Furthermore, orches- 5.5 Lossy Compression
trating the communication of data across devices can re-
Some machine learning algorithms, including those typ-
duce contention for network resources.
ically used for training neural networks, are tolerant of
While there are many opportunities for scheduling op- noise and reduced precision arithmetic. In a manner sim-
timizations, here we focus on one that we found partic- ilar to the DistBelief system [14], we often use lossy
ularly necessary and effective. It concerns the schedul- compression of higher precision internal representations
ing of Receive nodes for reading remote values. If no when sending data between devices (sometimes within
precautions are taken, these nodes may start much ear- the same machine but especially across machine bound-
lier than necessary, possibly all at once when execution aries). For example, we often insert special conversion
starts. By performing an as-soon-as-possible/as-late-as- nodes that convert 32-bit floating point representations
possible (ASAP/ALAP) calculation, of the kind common into a 16-bit floating point representation (not the pro-
in operations research, we analyze the critical paths of posed IEEE 16-bit floating point standard, but rather just
graphs, in order to estimate when to start the Receive a 32-bit IEEE 794 float format, but with 16 bits less pre-
nodes. We then insert control edges with the aim of decision in the mantissa), and then convert back to a 32-
laying the start of these nodes until just before their re- bit representation on the other side of the communica-
sults are needed. tion channel (by just filling in zeroes for the lost portion
9
of the mantissa, since that’s less computationally expen- strated subtle flaws in a complex network architec-
sive than doing the mathematically correct probabilistic ture specification. In particular we were able to
rounding when doing this 32 → 16 → 32-bit conver- identify operations and variables instantiated incor-
sion). rectly due to automatic broadcasting in a mathemat-
ical operation across a dimension.
6 Status and Experience 2. Start small and scale up. The first convolutional
neural network that we ported from our previ-
The TensorFlow interface and a reference implemen- ous system was a small network employed on the
tation have been open sourced under an Apache 2.0 CIFAR-10 data set [30]. Debugging such a network
license, and the system is available for download at elucidated subtle edge cases in individual opera-
www.tensorflow.org. The system includes detailed docu- tions (e.g., max-pooling) within the machine learn-
mentation, a number of tutorials, and a number of exam- ing system that would have been practically indeci-
ples demonstrating how to use the system for a variety pherable in more complex models.
of different machine learning tasks. The examples in-
clude models for classifying hand-written digits from the
3. Always ensure that the objective (loss function)
MNIST dataset (the “hello world” of machine learning
matches between machine learning systems when
algorithms) [32], classifying images from the CIFAR-
learning is turned off. Setting the learning rate to be
10 dataset [30], doing language modeling using a recur-
zero helped us identify unexpected behavior in how
rent LSTM [22] network, training word embedding vec-
we had randomly initialized variables in a model.
tors [35] and more.
Such an error would have been difficult to identify
The system includes front-ends for specifying Tensor- in a dynamic, training network.
Flow computations in Python and C++, and we expect
other front-ends to be added over time in response to 4. Make a single machine implementation match be-
the desires of both internal Google users and the broader fore debugging a distributed implementation. This
open-source community. strategy helped us delineate and debug discrep-
We have quite a few machine learning models in our ancies in training performance between machine
previous DistBelief system [14] that we have migrated learning system. In particular, we identified bugs
over to TensorFlow. The rest of this section discusses due to race conditions and non-atomic operations
some lessons we have learned that are generalizable for incorrectly assumed to be atomic.
any such migration of machine learning models from one
system to another, and therefore may be valuable to oth- 5. Guard against numerical errors. Numerical li-
ers. braries are inconsistent in how they handle non-
In particular, we focus on our lessons from porting a finite floating point values. Convolutional neu-
state-of-the-art convolutional neural network for image ral networks are particularly susceptible to numer-
recognition termed Inception [23]. This image recogni- ical instability and will tend to diverge quite regu-
tion system classifies 224 × 224 pixel images into one larly during experimentation and debugging phases.
of 1000 labels (e.g., “cheetah”, “garbage truck”, etc.). Guarding against this behavior by checking for non-
Such a model comprises 13.6 million learnable parame- finite floating point values allows one to detect er-
ters and 36,000 operations when expressed as a Tensor- rors in real time as opposed to identifying divergent
Flow graph. Running inference on a single image re- behavior post-hoc.
quires 2 billion multiply-add operations.
After building all necessary mathematical operations 6. Analyze pieces of a network and understand the
in TensorFlow, assembling and debugging all 36,000 op- magnitude of numerical error. Running subsec-
erations into the correct graph structure proved challeng- tions of a neural network in parallel on two machine
ing. Validating correctness is a difficult enterprise be- learning systems provides a precise method to en-
cause the system is inherently stochastic and only in- sure that a numerical algorithm is identical across
tended to behave in a certain way in expectation — po- two systems. Given that such algorithms run with
tentially after hours of computation. Given these cir- floating point precision, it is important to predict
cumstances, we found the following strategies critical for and understand the magnitude of expected numer-
porting the Inception model to TensorFlow: ical error in order to judge whether a given compo-
nent is correctly implemented (e.g., distinguishing
1. Build tools to gain insight into the exact number of between “within 1e-2, great!” and “within 1e-2:
parameters in a given model. Such tools demon- why is it so incorrect?!”).
10
Parameter Device(s)
ΔP
Add
Device A Device B Device C
Client Update model model model
input input input
P
Synchronous Data Parallelism
Parameter Device(s)
ΔP
Client 3 Update
ΔP
Client 2 Update
Client 1 Update ΔP
Device A Device B Device C
model model model
input input input
P
Asynchronous Data Parallelism
Figure 7: Synchronous and asynchronous data parallel training
Validating complex mathematical operations in the Data Parallel Training

presence of an inherently stochastic system is quite chal-
lenging. The strategies outlined above proved invaluable
in gaining confidence in the system and ultimately in in- One simple technique for speeding up SGD is to paral-
stantiating the Inception model in TensorFlow. The end lelize the computation of the gradient for a mini-batch
result of these efforts resulted in a 6-fold speed improve- across mini-batch elements. For example, if we are us-
ment in training time versus our existing DistBelief im- ing a mini-batch size of 1000 elements, we can use 10
plementation of the model and such speed gains proved replicas of the model to each compute the gradient for
indispensable in training a new class of larger-scale im- 100 elements, and then combine the gradients and apply
age recognition models. updates to the parameters synchronously, in order to be-
have exactly as if we were running the sequential SGD
algorithm with a batch size of 1000 elements. In this
7 Common Programming Idioms case, the TensorFlow graph simply has many replicas of
the portion of the graph that does the bulk of the model
computation, and a single client thread drives the entire
TensorFlow’s basic dataflow graph model can be used in training loop for this large graph. This is illustrated in
a variety of ways for machine learning applications. One the top portion of Figure 7.
domain we care about is speeding up training of com-
putationally intensive neural network models on large This approach can also be made asynchronous, where
datasets. This section describes several techniques that the TensorFlow graph has many replicas of the portion of
we and others have developed in order to accomplish the graph that does the bulk of the model computation,
this, and illustrates how to use TensorFlow to realize and each one of these replicas also applies the parame-
these various approaches. ter updates to the model parameters asynchronously. In
The approaches in this subsection assume that the this configuration, there is one client thread for each of
model is being trained using stochastic gradient descent the graph replicas. This is illustrated in the bottom por-
(SGD) with relatively modest-sized mini-batches of 100 tion of Figure 7. This asynchronous approach was also
to 1000 examples. described in [14].
11
8 Performance
Client
A future version of this white paper will have a compre-

Device 3 hensive performance evaluation section of both the sin-
C C C C C
P3 gle machine and distributed implementations.
Device 2
B B B B B
9 Tools
P2
This section describes some tools we have developed that
Device 1
sit alongside the core TensorFlow graph execution en-
A A A A A gine.
P1
9.1 TensorBoard: Visualization of graph

Figure 8: Model parallel training structures and summary statistics
Client
In order to help users understand the structure of their
computation graphs and also to understand the overall
behavior of machine learning models, we have built Ten-
sorBoard, a companion visualization tool for TensorFlow
Update Update Update that is included in the open source release.
P model model model Visualization of Computation Graphs

input input input
Many of the computation graphs for deep neural net-
works can be quite complex. For example, the computa-
tion graph for training a model similar to Google’s Incep-
Figure 9: Concurrent steps tion model [48], a deep convolutional neural net that had
the best classification performance in the ImageNet 2014
contest, has over 36,000 nodes in its TensorFlow compu-
Model Parallel Training
tation graph, and some deep recurrent LSTM models for
language modeling have more than 15,000 nodes.
Model parallel training, where different portions of the
model computation are done on different computational Due to the size and topology of these graphs, naive vi-
devices simultaneously for the same batch of examples, sualization techniques often produce cluttered and over-
is also easy to express in TensorFlow. Figure 8 shows whelming diagrams. To help users see the underlying
an example of a recurrent, deep LSTM model used for organization of the graphs, the algorithms in Tensor-
sequence to sequence learning (see [47]), parallelized Board collapse nodes into high-level blocks, highlighting
across three different devices. groups with identical structures. The system also sep-
arates out high-degree nodes, which often serve book-
keeping functions, into a separate area of the screen. Do-
Concurrent Steps for Model Computation Pipelining ing so reduces visual clutter and focuses attention on the
core sections of the computation graph.
Another common way to get better utilization for train- The entire visualization is interactive: users can pan,
ing deep neural networks is to pipeline the computation zoom, and expand grouped nodes to drill down for de-
of the model within the same devices, by running a small tails. An example of the visualization for the graph of a
number of concurrent steps within the same set of de- deep convolutional image model is shown in Figure 10.
vices. This is shown in Figure 9. It is somewhat similar
to asynchronous data parallelism, except that the paral- Visualization of Summary Data
lelism occurs within the same device(s), rather than repli-
cating the computation graph on different devices. This When training machine learning models, users often
allows “filling in the gaps” where computation of a sin- want to be able to examine the state of various aspects
gle batch of examples might not be able to fully utilize of the model, and how this state changes over time. To
the full parallelism on all devices at all times during a this end, TensorFlow supports a collection of different
single step. Summary operations that can be inserted into the graph,
12
Figure 10: TensorBoard graph visualization of a convolutional neural network model
Figure 11: TensorBoard graphical display of model summary statistics time series data
including scalar summaries (e.g., for examining overall records, and can display this summary information and
properties of the model, such as the value of the loss how it changes over time (with the ability to select the
function averaged across a collection of examples, or the measurement of “time” to be relative wall time since
time taken to execute the computation graph), histogram- the beginning of the execution of the TensorFlow pro-
based summaries (e.g., the distribution of weight values gram, absolute time, or “steps”, a numeric measure of
in a neural network layer), or image-based summaries the number of graph executions that have occurred since
(e.g., a visualization of the filter weights learned in a the beginning of execution of the TensorFlow program).
convolutional neural network). Typically computation A screen shot of the visualization of summary values in
graphs are set up so that Summary nodes are included TensorBoard is shown in Figure 11.
to monitor various interesting values, and every so often
during execution of the training graph, the set of sum-
mary nodes are also executed, in addition to the normal
9.2 Performance Tracing
set of nodes that are executed, and the client driver pro- We also have an internal tool called EEG (not included
gram writes the summary data to a log file associated in the initial open source release in November, 2015) that
with the model training. The TensorBoard program is we use to collect and visualize very fine-grained informa-
then configured to watch this log file for new summary tion about the exact ordering and performance character-
13
istics of the execution of TensorFlow graphs. This tool teresting machine learning models for artificial intelli-
works in both our single machine and distributed imple- gence, and in the course of doing this, we may discover
mentations, and is very useful for understanding the bot- ways in which we will need to extend the basic Ten-
tlenecks in the computation and communication patterns sorFlow system. The open source community may also
of a TensorFlow program. come up with new and interesting directions for the Ten-
Traces are collected simultaneously on each machine sorFlow implementation.
in the system from a variety of sources including Linux One extension to the basic programming model that
kernel ftrace, our own lightweight thread tracing tools we are considering is a function mechanism, whereby
and the CUDA Profiling Tools Interface (CUPTI). With a user can specify an entire subgraph of a TensorFlow
these logs we can reconstruct the execution of a dis- computation to be a reusable component. In the imple-
tributed training step with microsecond-level details of mentation we have designed, these functions can become
every thread-switch, CUDA kernel launch and DMA op- reusable components even across different front-end lan-
eration. guages for TensorFlow, so that a user could define a func-
Traces are combined in a visualization server which tion using the Python front end, but then use that func-
is designed to rapidly extract events in a specified tion as a basic building block from within the C++ front-
timerange and summarize at appropriate detail level for end. We are hopeful that this cross-language reusability
the user-interface resolution. Any significant delays will bootstrap a vibrant community of machine learning
due to communication, synchronization or DMA-related researchers publishing not just whole examples of their
stalls are identified and highlighted using arrows in the research, but also small reusable components from their
visualization. Initially the UI provides an overview of the work that can be reused in other contexts.
entire trace, with only the most significant performance We also have a number of concrete directions to im-
artifacts highlighted. As the user progressively zooms in, prove the performance of TensorFlow. One such direc-
increasingly fine resolution details are rendered. tion is our initial work on a just-in-time compiler that
Figure 12 shows an example EEG visualization of a can take a subgraph of a TensorFlow execution, perhaps
model being trained on a multi-core CPU platform. The with some runtime profiling information about the typi-
top third of the screenshot shows TensorFlow operations cal sizes and shapes of tensors, and can generate an op-
being dispatched in parallel, according to the dataflow timized routine for this subgraph. This compiler will un-
constraints. The bottom section of the trace shows how derstand the semantics of perform a number of optimiza-
most operations are decomposed into multiple work- tions such as loop fusion, blocking and tiling for locality,
items which are executed concurrently in a thread pool. specialization for particular shapes and sizes, etc.
The diagonal arrows on the right hand size show where We also imagine that a significant area for future work
queueing delay is building up in the thread pool. Fig- will be in improving the placement and node scheduling
ure 13 shows another EEG visualization with compu- algorithms used to decide where different nodes will exe-
tation mainly happening on the GPU. Host threads can cute, and when they should start executing. We have cur-
be seen enqueuing TensorFlow GPU operations as they rently implemented a number of heuristics in these sub-
become runnable (the light blue thread pool), and back- systems, and we’d like to have the system instead learn
ground housekeeping threads can be seen in other col- to make good placement decisions (perhaps using a deep
ors being migrated across processor cores. Once again, neural network, combined with a reinforcement learning
arrows show where threads are stalled on GPU to CPU objective function).
transfers, or where ops experience significant queueing
delay.
Finally, Figure 14 shows a more detailed view which 11 Related Work
allows us to examine how Tensorflow GPU operators
are assigned to multiple GPU streams. Whenever the There are many other systems that are comparable in
dataflow graph allows parallel execution or data trans- various ways with TensorFlow. Theano [7], Torch [13],
fer we endeavour to expose the ordering constraints to Caffe [26], Chainer [49] and the Computational Network
the GPU device using streams and stream dependency Toolkit [54] are a few systems designed primarily for the
primitives. training of neural networks. Each of these systems maps
the computation onto a single machine, unlike the dis-
tributed TensorFlow implementation. Like Theano and
10 Future Work Chainer, TensorFlow supports symbolic differentiation,
thus making it easier to define and work with gradient-
We have several different directions for future work. We based optimization algorithms. Like Caffe, TensorFlow
will continue to use TensorFlow to develop new and in- has a core written in C++, simplifying the deployment
14
Figure 12: EEG visualization of multi-threaded CPU operations (x-axis is time in µs).
Figure 13: EEG visualization of Inception training showing CPU and GPU activity.
of trained models in a wide variety of production set- machine learning models using relatively high-level de-
tings, including memory- and computation-constrained scriptions. Unlike DistBelief and Project Adam, though,
environments such as mobile devices. the general-purpose dataflow graph model in TensorFlow
is more flexible and more amenable to expressing a wider
The TensorFlow system shares some design charac- variety of machine learning models and optimization al-
teristics with its predecessor system, DistBelief [14], gorithms. It also permits a significant simplification by
and with later systems with similar designs like Project allowing the expression of stateful parameter nodes as
Adam [10] and the Parameter Server project [33]. Like variables, and variable update operations that are just
DistBelief and Project Adam, TensorFlow allows com- additional nodes in the graph; in contrast, DistBelief,
putations to be spread out across many computational de- Project Adam and the Parameter Server systems all have
vices across many machines, and allows users to specify
15
Figure 14: Timeline of multi-stream GPU execution.
whole separate parameter server subsystems devoted to that the system uses a single, optimized dataflow graph to
communicating and updating parameter values. represent the entire computation, and caches information
The Halide system [40] for expressing image pro- about that graph on each device to minimize coordination
cessing pipelines uses a similar intermediate represen- overhead. Like Spark and Naiad, TensorFlow works best
tation to the TensorFlow dataflow graph. Unlike Ten- when there is sufficient RAM in the cluster to hold the
sorFlow, though, the Halide system actually has higher- working set of the computation. Iteration in TensorFlow
level knowledge of the semantics of its operations and uses a hybrid approach: multiple replicas of the same
uses this knowledge to generate highly optimized pieces dataflow graph may be executing at once, while sharing
of code that combine multiple operations, taking into ac- the same set of variables. Replicas can share data asyn-
count parallelism and locality. Halide runs the resulting chronously through the variables, or use synchronization
computations only on a single machine, and not in a dis- mechanisms in the graph, such as queues, to operate syn-
tributed setting. In future work we are hoping to extend chronously. TensorFlow also supports iteration within a
TensorFlow with a similar cross-operation dynamic com- graph, which is a hybrid of CIEL and Naiad: for simplic-
pilation framework. ity, each node fires only when all of its inputs are ready
Like TensorFlow, several other distributed systems (like CIEL); but for efficiency the graph is represented as
have been developed for executing dataflow graphs a static, cyclic dataflow (like Naiad).
across a cluster. Dryad [24] and Flume [8] demon-
strate how a complex workflow can be represented as
a dataflow graph. CIEL [37] and Naiad [36] introduce
generic support for data-dependent control flow: CIEL 12 Conclusions
represents iteration as a DAG that dynamically unfolds,
whereas Naiad uses a static graph with cycles to support
lower-latency iteration. Spark [55] is optimized for com- We have described TensorFlow, a flexible data flow-
putations that access the same data repeatedly, using “re- based programming model, as well as single machine
silient distributed datasets” (RDDs), which are soft-state and distributed implementations of this programming
cached outputs of earlier computations. Dandelion [44] model. The system is borne from real-world experience
executes dataflow graphs across a cluster of heteroge- in conducting research and deploying more than one hun-
neous devices, including GPUs. TensorFlow uses a hy- dred machine learning projects throughout a wide range
brid dataflow model that borrows elements from each of Google products and services. We have open sourced
of these systems. Its dataflow scheduler, which is the a version of TensorFlow, and hope that a vibrant shared
component that chooses the next node to execute, uses community develops around the use of TensorFlow. We
the same basic algorithm as Dryad, Flume, CIEL, and are excited to see how others outside of Google make use
Spark. Its distributed architecture is closest to Naiad, in of TensorFlow in their own work.
16
Acknowledgements Dataflow Architectures, pages 225–253. 1986.
www.dtic.mil/cgi-bin/GetTRDoc?Location=U2&
The development of TensorFlow has benefitted enor- doc=GetTRDoc.pdf&AD=ADA166235.
mously from the large and broad machine learning com- [4] Arvind and Rishiyur S. Nikhil. Executing a pro-
munity at Google, and in particular from the suggestions gram on the MIT tagged-token dataflow architec-
and contributions from rest of the Google Brain team ture. IEEE Trans. Comput., 39(3):300–318, 1990.
and also from the hundreds of DistBelief and TensorFlow dl.acm.org/citation.cfm?id=78583.
users within Google. Without a doubt, the usability and [5] Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu.
functionality of TensorFlow has been greatly expanded Multiple object recognition with visual atten-
by listening to their feedback. tion. arXiv preprint arXiv:1412.7755, 2014.
Many individuals have contributed to TensorFlow arxiv.org/abs/1412.7755.
and to its open source release, including John Gian- [6] Françoise Beaufays. The neural networks
nandrea (for creating a supportive research environ- behind Google Voice transcription, 2015.
ment), Irina Kofman and Phing Turner (project manage- googleresearch.blogspot.com/2015/08/the-neural-
ment), Bill Gruber and David Westbrook (technical writ- networks-behind-google-voice.html.
ing), Dave Andersen, Anelia Angelova, Yaroslav Bu- [7] James Bergstra, Olivier Breuleux, Frédéric Bastien, Pas-
latov, Jianmin Chen, Jerjou Cheng, George Dahl, An- cal Lamblin, Razvan Pascanu, Guillaume Desjardins,
drew Dai, Lucy Gao, mig Gerard, Stephan Gouws, Joseph Turian, David Warde-Farley, and Yoshua Bengio.
Naveen Kumar, Geoffrey Hinton, Mrinal Kalarishnan, Theano: A CPU and GPU math expression compiler. In
Anjuli Kannan, Yutaka Leon-Suematsu, Frank Li, Pe- Proceedings of the Python for scientific computing con-
ference (SciPy), volume 4, page 3. Austin, TX, 2010.
ter Liu, Xiaobing Liu, Nishant Patil, Pierre Sermanet,
UMontreal PDF.
Noam Shazeer, Jascha Sohl-dickstein, Philip Tucker,
Yonghui Wu, Ke Yang, and Cliff Young (general con- [8] Craig Chambers, Ashish Raniwala, Frances Perry,
tributions), Doug Fritz, Patrick Hurst, Dilip Krish- Stephen Adams, Robert R Henry, Robert Bradshaw,
nan, Daniel Smilkov, James Wexler, Jimbo Wilson, and Nathan Weizenbaum. FlumeJava: easy, effi-
cient data-parallel pipelines. In ACM Sigplan No-
Kanit Ham Wongsuphasawat, Cassandra Xia, and the
tices, volume 45, pages 363–375. ACM, 2010. re-
Big Picture team (graph visualization), Chris Leary, search.google.com/pubs/archive/35650.pdf.
Robert Springer and the Stream Executor team,
Kayur Patel, Michael Piatek, and the coLab team, and [9] Sharan Chetlur, Cliff Woolley, Philippe Vandermer-
sch, Jonathan Cohen, John Tran, Bryan Catanzaro,
the many others who have contributed to the TensorFlow
and Evan Shelhamer. cuDNN: Efficient primitives for
design and code base. deep learning. arXiv preprint arXiv:1410.0759, 2014.
References [10] Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and
Karthik Kalyanaraman. Project Adam: Building an
[1] Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene efficient and scalable deep learning training system. In
Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, 11th USENIX Symposium on Operating Systems Design
Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe- and Implementation (OSDI 14), pages 571–582, 2014.
mawat, Ian Goodfellow, Andrew Harp, Geoffrey Irv- www.usenix.org/system/files/conference/osdi14/osdi14-
ing, Michael Isard, Yangqing Jia, Rafal Jozefowicz, paper-chilimbi.pdf.
Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan
[11] Jack Clark. Google turning its lucrative
Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris
web search over to AI machines, 2015.
Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner,
www.bloomberg.com/news/articles/2015-10-26/google-
Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent
turning-its-lucrative-web-search-over-to-ai-machines.
Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol
Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, [12] Cliff Click. Global code motion/global value number-
Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale ing. In ACM SIGPLAN Notices, volume 30, pages 246–
machine learning on heterogeneous systems, 2015. Soft- 257. ACM, 1995. courses.cs.washington.edu/courses/
ware available from tensorflow.org. cse501/06wi/reading/click-pldi95.pdf.
[2] Anelia Angelova, Alex Krizhevsky, and Vincent Van- [13] Ronan Collobert, Samy Bengio, and Johnny
houcke. Pedestrian detection with a large-field-of-view Mariéthoz. Torch: A modular machine learning
deep network. In Robotics and Automation (ICRA), 2015 software library. Technical report, IDIAP, 2002.
IEEE International Conference on, pages 704–711. IEEE, infoscience.epfl.ch/record/82802/files/rr02-46.pdf.
2015. CalTech PDF. [14] Jeffrey Dean, Gregory S. Corrado, Rajat Monga, Kai
[3] Arvind and David E. Culler. Annual review Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao,
of computer science vol. 1, 1986. chapter Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker,
17
Ke Yang, and Andrew Y. Ng. Large scale distributed deep [25] Benoı̂t Jacob, Gaël Guennebaud, et al. Eigen library for
networks. In NIPS, 2012. Google Research PDF. linear algebra. eigen.tuxfamily.org.
[15] Jack J Dongarra, Jeremy Du Croz, Sven Hammar- [26] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey
ling, and Iain S Duff. A set of level 3 basic lin- Karayev, Jonathan Long, Ross Girshick, Sergio Guadar-
ear algebra subprograms. ACM Transactions on rama, and Trevor Darrell. Caffe: Convolutional archi-
Mathematical Software (TOMS), 16(1):1–17, 1990. tecture for fast feature embedding. In Proceedings of
www.maths.manchester.ac.uk/˜sven/pubs/Level3BLAS- the ACM International Conference on Multimedia, pages
1-TOMS16-90.pdf. 675–678. ACM, 2014. arxiv.org/pdf/1408.5093.
[16] Andrea Frome, Greg S Corrado, Jonathon Shlens, [27] Andrej Karpathy, George Toderici, Sachin Shetty,
Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Tommy Leung, Rahul Sukthankar, and Li Fei-
DeVISE: A deep visual-semantic embedding Fei. Large-scale video classification with con-
model. In Advances in Neural Information Pro- volutional neural networks. In Computer Vision
cessing Systems, pages 2121–2129, 2013. re- and Pattern Recognition (CVPR), 2014 IEEE Con-
search.google.com/pubs/archive/41473.pdf. ference on, pages 1725–1732. IEEE, 2014. re-
[17] Javier Gonzalez-Dominguez, Ignacio Lopez-Moreno, Pe- search.google.com/pubs/archive/42455.pdf.
dro J Moreno, and Joaquin Gonzalez-Rodriguez. Frame-
[28] A Krizhevsky. Cuda-convnet, 2014.
by-frame language identification in short utterances using
code.google.com/p/cuda-convnet/.
deep neural networks. Neural Networks, 64:49–58, 2015.
[18] Otavio Good. How Google Translate [29] Alex Krizhevsky. One weird trick for paralleliz-
squeezes deep learning onto a phone, 2015. ing convolutional neural networks. arXiv preprint
googleresearch.blogspot.com/2015/07/how-google- arXiv:1404.5997, 2014. arxiv.org/abs/1404.5997.
translate-squeezes-deep.html. [30] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The
[19] Ian J. Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha CIFAR-10 dataset. www.cs.toronto.edu/˜kriz/cifar.html.
Arnoud, and Vinay Shet. Multi-digit number recognition
[31] Quoc Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu
from Street View imagery using deep convolutional neu-
Devin, Greg Corrado, Kai Chen, Jeff Dean, and Andrew
ral networks. In International Conference on Learning
Ng. Building high-level features using large scale unsu-
Representations, 2014. arxiv.org/pdf/1312.6082.
pervised learning. In ICML’2012, 2012. Google Research
[20] Georg Heigold, Vincent Vanhoucke, Alan Senior, Patrick PDF.
Nguyen, Marc’Aurelio Ranzato, Matthieu Devin, and
Jeffrey Dean. Multilingual acoustic models using dis- [32] Yann LeCun, Corinna Cortes, and Christopher JC
tributed deep neural networks. In Acoustics, Speech Burges. The MNIST database of handwritten digits,
and Signal Processing (ICASSP), 2013 IEEE Interna- 1998. yann.lecun.com/exdb/mnist/.
tional Conference on, pages 8619–8623. IEEE, 2013. re- [33] Mu Li, Dave Andersen, and Alex Smola. Parameter
search.google.com/pubs/archive/40807.pdf. server. parameterserver.org.
[21] Geoffrey E. Hinton, Li Deng, Dong Yu, George E.
[34] Chris J Maddison, Aja Huang, Ilya Sutskever, and David
Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, An-
Silver. Move evaluation in Go using deep convolutional
drew Senior, Vincent Vanhoucke, Patrick Nguyen,
neural networks. arXiv preprint arXiv:1412.6564, 2014.
Tara N. Sainath, and Brian Kingsbury. Deep
neural networks for acoustic modeling in speech
recognition: The shared views of four research [35] Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-
groups. IEEE Signal Process. Mag., 29(6):82– frey Dean. Efficient estimation of word representa-
97, 2012. www.cs.toronto.edu/˜gdahl/papers/ tions in vector space. In International Conference
deepSpeechReviewSPM2012.pdf. on Learning Representations: Workshops Track, 2013.
[22] Sepp Hochreiter and Jürgen Schmidhuber. Long short- arxiv.org/abs/1301.3781.
term memory. Neural computation, 9(8):1735–1780, [36] Derek G Murray, Frank McSherry, Rebecca Isaacs,
1997. ftp.idsia.ch/pub/juergen/lstm.pdf. Michael Isard, Paul Barham, and Martı́n Abadi. Naiad:
[23] Sergey Ioffe and Christian Szegedy. Batch normaliza- a timely dataflow system. In Proceedings of the Twenty-
tion: Accelerating deep network training by reducing Fourth ACM Symposium on Operating Systems Princi-
internal covariate shift. CoRR, abs/1502.03167, 2015. ples, pages 439–455. ACM, 2013. Microsoft Research
arxiv.org/abs/1502.03167. PDF.
[24] Michael Isard, Mihai Budiu, Yuan Yu, Andrew [37] Derek G. Murray, Malte Schwarzkopf, Christopher
Birrell, and Dennis Fetterly. Dryad: distributed Smowton, Steven Smit, Anil Madhavapeddy, and Steven
data-parallel programs from sequential building Hand. Ciel: a universal execution engine for dis-
blocks. In ACM SIGOPS Operating Systems tributed data-flow computing. In Proceedings of the Ninth
Review, volume 41, pages 59–72. ACM, 2007. USENIX Symposium on Networked Systems Design and
www.michaelisard.com/pubs/eurosys07.pdf. Implementation, 2011. Usenix PDF.
18
[38] Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas [49] Seiya Tokui. Chainer: A powerful, flexible and intuitive
Alcicek, Rory Fearon, Alessandro De Maria, Ve- framework of neural networks. chainer.org.
davyas Panneershelvam, Mustafa Suleyman, Charles [50] Vincent Vanhoucke. Speech recognition and deep learn-
Beattie, Stig Petersen, et al. Massively parallel meth- ing, 2015. googleresearch.blogspot.com/2012/08/speech-
ods for deep reinforcement learning. arXiv preprint recognition-and-deep-learning.html.
arXiv:1507.04296, 2015. arxiv.org/abs/1507.04296.
[51] Abhishek Verma, Luis Pedrosa, Madhukar Korupolu,
[39] CUDA Nvidia. Cublas library. NVIDIA Corpo- David Oppenheimer, Eric Tune, and John Wilkes.
ration, Santa Clara, California, 15, 2008. devel- Large-scale cluster management at Google with Borg.
oper.nvidia.com/cublas. In Proceedings of the Tenth European Conference
[40] Jonathan Ragan-Kelley, Connelly Barnes, Andrew on Computer Systems, page 18. ACM, 2015. re-
Adams, Sylvain Paris, Frédo Durand, and Saman Ama- search.google.com/pubs/archive/43438.pdf.
rasinghe. Halide: A language and compiler for optimiz- [52] O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, and
ing parallelism, locality, and recomputation in image pro- G. Hinton. Grammar as a foreign language. Technical
cessing pipelines. ACM SIGPLAN Notices, 48(6):519– report, arXiv:1412.7449, 2014. arxiv.org/abs/1412.7449.
530, 2013. people.csail.mit.edu/fredo/tmp/Halide-
5min.pdf. [53] Oriol Vinyals, Meire Fortunato, and Navdeep
Jaitly. Pointer networks. In NIPS, 2015.
[41] Bharath Ramsundar, Steven Kearnes, Patrick Riley, Dale arxiv.org/abs/1506.03134.
Webster, David Konerding, and Vijay Pande. Massively
[54] Dong Yu, Adam Eversole, Mike Seltzer, Kaisheng
multitask networks for drug discovery. arXiv preprint
Yao, Zhiheng Huang, Brian Guenter, Oleksii Kuchaiev,
arXiv:1502.02072, 2015. arxiv.org/abs/1502.02072.
Yu Zhang, Frank Seide, Huaming Wang, et al. An
[42] Benjamin Recht, Christopher Re, Stephen Wright, and introduction to computational networks and the com-
Feng Niu. Hogwild: A lock-free approach to paral- putational network toolkit. Technical report, Tech.
lelizing stochastic gradient descent. In Advances in Rep. MSR, Microsoft Research, 2014, 2014. re-
Neural Information Processing Systems, pages 693–701, search.microsoft.com/apps/pubs/?id=226641.
2011. papers.nips.cc/paper/4390-hogwild-a-lock-free-
[55] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das,
approach-to-parallelizing-stochastic-gradient-descent.
Ankur Dave, Justin Ma, Murphy McCauley, Michael J
[43] Chuck Rosenberg. Improving Photo Search: Franklin, Scott Shenker, and Ion Stoica. Resilient
A step across the semantic gap, 2013. distributed datasets: A fault-tolerant abstraction for
googleresearch.blogspot.com/2013/06/improving- in-memory cluster computing. In Proceedings of the
photo-search-step-across.html. 9th USENIX conference on Networked Systems De-
[44] Christopher J Rossbach, Yuan Yu, Jon Currey, Jean- sign and Implementation. USENIX Association, 2012.
Philippe Martin, and Dennis Fetterly. Dandelion: a www.usenix.org/system/files/conference/nsdi12/nsdi12-
compiler and runtime for heterogeneous systems. In final138.pdf.
Proceedings of the Twenty-Fourth ACM Symposium [56] Matthew D. Zeiler, Marc’Aurelio Ranzato, Rajat Monga,
on Operating Systems Principles, pages 49–68. ACM, Mark Mao, Ke Yang, Quoc Le, Patrick Nguyen,
2013. research-srv.microsoft.com/pubs/201110/sosp13- Andrew Senior, Vincent Vanhoucke, Jeff Dean, and
dandelion-final.pdf. Geoffrey E. Hinton. On rectified linear units
for speech processing. In ICASSP, 2013. re-
[45] David E Rumelhart, Geoffrey E Hinton, and Ronald J
search.google.com/pubs/archive/40811.pdf.
Williams. Learning representations by back-
propagating errors. Cognitive modeling, 5:3, 1988.
www.cs.toronto.edu/ hinton/absps/naturebp.pdf.
[46] Haşim Sak, Andrew Senior, Kanishka Rao,
Françoise Beaufays, and Johan Schalkwyk. Google
Voice Search: faster and more accurate, 2015.
googleresearch.blogspot.com/2015/09/google-voice-
search-faster-and-more.html.
[47] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence
to sequence learning with neural networks. In NIPS,
2014. papers.nips.cc/paper/5346-sequence-to-sequence-
learning-with-neural.
[48] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Ser-
manet, Scott Reed, Dragomir Anguelov, Dumitru Er-
han, Vincent Vanhoucke, and Andrew Rabinovich. Go-
ing deeper with convolutions. In CVPR’2015, 2015.
19
Visual Madlibs: Fill in the blank Description Generation and Question
Answering
Licheng Yu, Eunbyung Park, Alexander C. Berg, Tamara L. Berg

Department of Computer Science, University of North Carolina, Chapel Hill
{licheng, eunbyung, aberg, tlberg}@cs.unc.edu
Abstract
In this paper, we introduce a new dataset consisting of

360,001 focused natural language descriptions for 10,738
images. This dataset, the Visual Madlibs dataset, is col-
lected using automatically produced fill-in-the-blank tem-
plates designed to gather targeted descriptions about: peo-
ple and objects, their appearances, activities, and interac-
tions, as well as inferences about the general scene or its
broader context. We provide several analyses of the Vi-
sual Madlibs dataset and demonstrate its applicability to
two new description generation tasks: focused description
generation, and multiple-choice question-answering for im-
ages. Experiments using joint-embedding and deep learn-
ing methods show promising results on these tasks.
1. Introduction
Much of everyday language and discourse concerns the
visual world around us, making understanding the rela-
tionship between the physical world and language describ-
ing that world an important challenge problem for AI. Figure 1. An example from the Visual Madlibs Dataset, including
Understanding this complex and subtle relationship will a variety of targeted descriptions for people and objects.
have broad applicability toward inferring human-like under-
standing for images, producing natural human robot interac- word game where one player prompts another for a list of
tions, and for tasks like natural language grounding in NLP. words to substitute for blanks in a story. In our case, a user
In computer vision, along with improvements in deep learn- might be presented with an image and a fill-in-the-blank
ing based visual recognition, there has been an explosion of template such as “The frisbee is [blank]” and asked to fill
recent interest in methods to automatically generate natural in the [blank] with a description of the appearance of fris-
language descriptions for images [6, 10, 16, 35, 17, 22] or bee. Alternatively, they could be asked to fill in the [blank]
videos [34, 9]. However, most of these methods and exist- with a description of what the person is doing with the fris-
ing datasets have focused on only one type of description, a bee. Fill-in-the-blank questions can be targeted to collect
generic description for the entire image. descriptions about people and objects, their appearances,
In this paper, we collect a new dataset of focused, tar-
activities, and interactions, as well as descriptions of the
geted, descriptions, the Visual Madlibs dataset1 , as illus-
general scene or the broader emotional, spatial, or temporal
trated in Figure 1. To collect this dataset, we introduce
context of an image (examples in Fig 2). Using these tem-
automatically produced fill-in-the-blank templates designed
plates, we collect 360,001 targeted descriptions for 10,738
to collect a range of different descriptions for visual con-
images from the MS COCO collection [23].
tent in an image. This is inspired by Madlibs, a childrens’
With this new dataset, we can develop methods to gen-
1 http://tamaraberg.com/visualmadlibs/ erate more focused descriptions. Instead of asking an algo-
2461
Figure 2. Example Visual Madlibs fill-in-the-blank descriptions.
rithm to “describe the image” we can now ask for more fo- 4) Definition and evaluation of generation and joint-
cused descriptions such as “describe the person”, “describe embedding methods on a new task, multiple-choice fill-in-
what the person is doing,” or “describe the relationship be- the-blank question answering for images.
tween the person and the frisbee.” We can also ask ques- The rest of our paper is organized as follows. First, we
tions about aspects of an image that are somewhat beyond review related work (Sec 2). Then, we describe our strat-
the scope of the directly depicted content. For example, egy for automatically generating fill-in-the-blank templates
“describe what might have happened just before this picture and introduce our Visual Madlibs dataset (Sec 3). Next we
was taken.” or “describe how this image makes you feel.” outline the multiple-choice question answering and targeted
These types of descriptions reach toward high-level goals of generation tasks (Sec 4) and provide several analyses of our
producing human-like visual interpretations for images. dataset (Sec 5). Finally, we provide experiments evaluating
In addition to focused description generation, we also in- description generation and joint-embedding methods on the
troduce a multiple-choice question-answering task for im- proposed tasks (Sec 6) and conclude (Sec 7).
ages. In this task, the computer is provided with an image 2. Related work
and a partial description such as “The person is [blank]”.
Description Generation: Recently, there has been an
A set of possible answers is also provided, one answer that
explosion of interest in methods for producing natural lan-
was written about the image in question, and several ad-
guage descriptions for images or video. Early work in this
ditional answers written about other images. The com-
area focused on detecting content elements and then com-
puter is evaluated on how well it can select the correct
posing captions [20, 36, 28, 11, 18] or made use of exist-
choice. In this way, we can evaluate performance of de-
ing text either directly associated with an image [12, 1] or
scription generation on a concrete task, making evaluation
retrieved from visually similar images [29, 21, 26]. With
more straightforward. Varying the difficulty of the nega-
the advancement of deep learning for content estimation,
tive answers—adjusting how similar they are to the correct
there have been many exciting recent attempts to generate
answer—provides a nuanced measurement of performance.
image descriptions using neural network based approaches.
For both the generation and question-answering tasks, Some methods first detect words or phrases using Convo-
we study and evaluate a recent state of the art approach lutional Neural Network (CNN) features, then generate and
for image description generation [35], as well as a simple re-rank candidate sentences [10, 22]. Other approaches take
joint-embedding method learned on deep representations. a more end-to-end approach to generate output descriptions
The evaluation also includes extensive analysis of the Vi- directly from images [17, 35, 16, 6]. These new methods
sual Madlibs dataset and comparisons to the existing MS have shown great promise for image description generation,
COCO dataset of natural language descriptions for images. under some measures (e.g. BLEU-1) achieving near-human
In summary, our contributions are: performance levels.
1) A new description collection strategy, Visual Madlibs, for Description Datasets: Along with the development of
constructing fill-in-the-blank templates to collect targeted image captioning algorithms there have been a number of
natural language descriptions. datasets collected for this task. One of the first datasets
2) A new Visual Madlibs Dataset consisting of 360,001 tar- collected for this problem was the UIUC Pascal Sentence
geted descriptions, spanning 12 different types of templates, data set [11] which contains 1,000 images with 5 sentences
for 10,738 images, as well as analysis of the dataset and per image written by workers on Amazon Mechanical Turk.
comparisons to existing MS COCO descriptions. Based on this, PASCAL-50s [33] further collected 50 sen-
3) Evaluation of a generation method and a simple joint em- tences per image. As the description problem gained pop-
bedding method for targeted description generation. ularity larger and richer datasets were collected, including
2462
the Flickr8K [30] and Flickr30K [37] datasets. In an al- person along with an image and instructions, e.g., Describe
ternative approach, the SBU Captioned photo dataset [29] the relationship between the indicated person and object.
contains 1 million images with existing captions collected The same image and prompt may be used with different in-
from Flickr, but the text tends to contain more contextual structions to collect a variety of description types.
information since captions were written by the photo own- Instantiating Questions
ers. Most recently, Microsoft released the MS COCO [23] While the general form of the questions for the Visual
dataset, containing 120,000 images depicting 80 common Madlibs were chosen by hand, see Table 1, most of the ques-
object classes, with object segmentations and 5 turker writ- tions are instantiated depending on a subset of the objects
ten descriptions per image. We make use of MS COCO, present in an image. For instance, if an image contained
extending the types of descriptions associated with images. two people and a dog, questions about each person (ques-
Question-answering Natural language question- tion types 9-11 in Table 1), the dog (types 6-8), relationships
answering has been a long standing goal of NLP, with between the two people and the dog (type 12), could be in-
commercial companies like Ask-Jeeves or Google playing a stantiated. For each possible instantiation, the wording of
significant role in developing effective methods. Recently, the questions will be automatically altered slightly to main-
embedding and deep learning methods have shown great tain grammatical consistency. In addition to these types of
promise [32, 3, 4]. Lin et al. [24] take an interesting multi- questions, other questions (types 1-5) can be instantiated for
modal approach to question-answering. A multiple-choice an image regardless of the objects present.
text-based question is first constructed from 3 sentences Notice in particular the questions about the temporal
written about an image; 2 of the sentences are used as the context – what might have happened before or what might
question, and 1 is used as the positive answer, mixed with happen after the image was taken. People can make in-
several negative answers from sentences written about other ferences beyond the specific content depicted in an image.
images. The authors develop ranking methods to answer Sometimes these inferences will be consistent between peo-
these questions and show that generating abstract images ple (e.g., when what will happen next is obvious), and other
for each potential answer can improve results. Note, here times these descriptions may be less consistent. We can
the algorithms are not provided with an image as part of use the variability of returned responses to select images
the question. Some recent work has started to look at the for which these inferences are reliable.
problem of question-answering for images. Malinowski Asking questions about every object and all pairs of ob-
et al. [25] introduced two scene-based QA datasets and jects quickly becomes unwieldy as the number of objects
combine computer vision and NLP in a Bayesian frame- increases. To combat this, we choose a subset of objects
work. DAQUAR is made by collecting human questions present to use in instantiating questions. Such selection
and answers, and SynthQA is automatically generated could be driven by a number of factors. The experiments
based on object segmentation and question templates. in this paper consider comparisons to existing, general, de-
Geman et al. [13] design a visual Turing test to evaluate scriptions of images, so we instantiate questions about the
image understanding using a series of binary questions objects mentioned in those existing natural language de-
about image content. We design question-answering tasks scriptions, an indication of the object’s importance [2].
that are somewhat broader in scope than the previous 3.1. Data Collection
works, allowing us to ask a variety of different types of
natural language questions about images. To collect the Visual Madlibs Dataset we use a subset of
10,738 human-centric images from MS COCO, that make
3. Designing and collecting Visual Madlibs up about a quarter of the validation data [23], and instanti-
ate fill-in-the-blank templates as described above. The MS
The goal of Visual Madlibs is to study targeted natu- COCO images are annotated with a list of objects present in
ral language descriptions of image content that go beyond the images, segmentations for the locations of those objects,
generic descriptions of the whole image. The experiments and 5 general natural language descriptions of the image. To
in this paper begin with a dataset of images where the select the subset of images for collecting Madlibs, we start
presence of some objects have already been labeled. The with the 19,338 images with a person labeled. We then look
prompts for the questions are automatically generated based at the five descriptions for each and perform a dependency
on image content, in a manner designed to elicit more de- parse [8], only keeping those images where a word refer-
tailed descriptions of the objects, their interactions, and the ring to person is the head noun of the parse. This leaves
broader context of the scene shown in each image. 14,150 images. We then filter out the images whose de-
Visual Madlibs: Image+Instruction+Prompts+Blank scriptions do not include a synonym for any of the 79 non-
A single fill-in-the-blank question consists of a prompt and person object categories labeled in the MS COCO dataset.
a blank, e.g., Person A is [blank] the car. The implicit ques- This leaves 10,738 human-centric images with at least one
tion is, “What goes in the blank?” This is presented to a other object from the MS COCO data set mentioned in the
2463
Type Instruction Prompt #words
1. image’s scene Describe the type of scene/place shown in this picture. The place is a(n) . 4+1.45
2. image’s emotion Describe the emotional content of this picture. When I look at this picture, I feel . 8+1.14
3. image’s interesting Describe the most interesting or unusual aspect of this picture. The most interesting aspect of this picture is . 8+3.14
4. image’s past Describe what happened immediately before this picture was taken. One or two seconds before this picture was taken, . 9+5.45
5. image’s future Describe what happened immediately after this picture was taken. One or two seconds after this picture was taken, . 9+5.04
6. object’s attribute Describe the appearance of the indicated object. The object(s) is/are . 3.20+1.62
7. object’s affordance Describe the function of the indicated object. People could the object(s). 4.20+1.74
8. object’s position Describe the position of the indicated object. The object(s) is/are . 3.20+3.35
9. person’s attribute Describe the appearance of the indicated person/people. The person/people is/are . 3+2.52
10. person’s activity Describe the activity of the indicated person/people. The person/people is/are . 3+2.47
11. person’s location Describe the location of the indicated person/people. The person/people is/are . 3.20+3.04
12. pair’s relationship Describe the relationship between the indicated person and object. The person/people is/are the object(s). 5.20+1.65
Table 1. All 12 types of Madlibs instructions and prompts. Right-most column shows the average number of words for each description
(#words for prompt + #words for answer).
general image descriptions. Before final instantiation of the depending on the level of difficulty desired. This ability to
fill-in-the blank templates, we need to resolve a potential choose distractors to adjust the difficulty of the question as
ambiguity regarding which objects are referred to in the de- well as the relative ease of evaluating multiple choice an-
scriptions. We would like to collect Madlibs for objects swers are attractive aspects of this new task.
described in the MS COCO captions, but since correspon- In our experiments we randomly select 20% of the
dences between the segmented objects and description men- 10,738 images to use as our test set for evaluating these
tions are not available, we first try to automatically estimate tasks. For the multiple-choice questions we form two sets of
this assignment by parsing the descriptions. We consider answers for each, with one set designed to be more difficult
two possible cases: 1) there are fewer annotated instances than the other. We first establish the easy task distractor an-
than the sentences describe, 2) there are more annotated in- swers by randomly choosing three descriptions (of the same
stances than the sentences describe. It is easy to address question type) from other images [24]. The hard task is de-
the first case, just construct templates for all of the labeled signed more delicately. Instead of randomly choosing from
instances. For the second case, we sort the area of each seg- the other images, we now only look for those containing
mented instance, and pick the largest ones up to the parsed the same objects as our question image, and then arbitrarily
number for instantiation. Using this procedure, we obtain pick three of their descriptions. Sometimes, the descriptions
26,148 labeled object or person instances in 10,738 images. sampled from “similar” images could also be good answers
Each Visual Madlib is answered by 3 workers on Ama- for our questions (later we experiment with using Turkers to
zon Mechanical Turk. To date, we have collected 360,001 select less ambiguous multiple-choice questions from this
answers to Madlib questions and are continuing collection set). For the targeted generation task, for question types
to include the training portion of the MS COCO dataset. 1-5, algorithms generate descriptions given the image, in-
structions, and prompt. For the other question types whose
4. Tasks: Multiple-choice question answering prompts are related to some specific person or object, we
and targeted generation additionally provide the algorithm with the location of each
We design two tasks to evaluate targeted natural lan- person/object mentioned in the prompt. We also experiment
guage description for images. The first task is to automat- with estimating these locations using object detectors.
ically generate natural language descriptions of images to 5. Analyzing the Visual Madlibs Dataset
fill in the blank for one of the Madlibs questions. The in-
put to this task is an image, instructions, and a Madlibs We begin by conducting quantitative analyses of the re-
prompt. As has been discussed in the community work- sponses collected in the Visual Madlibs Dataset in Sec. 5.1.
ing on description generation for images, it can be dif- A main goal is understanding what additional information is
ficult to evaluate free form generation [33]. Our sec- provided by the targeted descriptions in the Visual Madlibs
ond task tries to address this issue by developing a new Dataset vs general image descriptions. Therefore, we also
targeted multiple-choice question answering task for im- provide analyses comparing Visual Madlibs to MS COCO
ages. Here the input is again an image, instruction, and a descriptions collected for the same images in Sec. 5.2.
prompt, but instead of a free form text answer, there are 5.1. Quantifying Visual Madlibs responses
a fixed set of multiple-choice answers to fill in the blank. We analyze the length, structure, and consistency of the
The possible multiple-choice answers are sampled from the Visual Madlibs responses. First, the average length of each
Madlibs responses, one that was written for the particular type of description is shown in the far right column of Ta-
image/instruction/prompt as the correct answer, and distrac- ble 1. Note that descriptions of people tend to be longer
tors chosen from either similar images or random images than descriptions of other objects in the dataset.
2464
Second, we use phrase chunking [7] to analyze which
phrasal structures are commonly used to fill in the blanks
for different questions. Fig. 3, top row, shows relative fre-
quencies for the top-5 most frequent templates used for sev-
eral question types. Object attributes are usually described
briefly with a simple adjectival phrase. On the other hand,
people use more words and a wider variety of structures to
describe possible future events. Except for future and past
descriptions, the distribution of structures is generally con- Figure 4. Template used for parsing person’s attributes, activity
and interaction with object, and object’s attribute. The percentages
centrated on a few likely choices for each question type.
below compares Madlibs and MS COCO on how frequent these
Third, we analyze how consistent the Mechanical Turk
templates are used for description.
workers’ answers are for each type of question. To com-
pute a measure of similarity between a pair of responses we
use the cosine similarity between representations of each
response. A response is represented by the mean of the
Word2Vec [27] vectors for each word in the response, fol-
lowing [24, 22]. Word2Vec is a 300 dimensional embedding
representation for words that encodes the distributional con-
text of words learned over very large word corpora. This
measure takes into account the actual words used in a re-
sponse, as opposed to the previous analyses of parse struc- Figure 5. Frequency that a word in a position in the people and
object parsing template in one dataset is in the same position for
ture. Each Visual Madlibs question is answered by three
the other dataset.
workers, providing 3 pairs for which similarity is computed.
Fig. 3, bottom row, shows a histogram of all pairwise simi-
larities for several question types. Generally the similarities tions always have one reference to a person in the prompt
have a normal-like distribution with an extra peak around 1 (The person is [blank].). Therefore, for Madlibs, we report
indicating the fraction of responses that agree almost per- the presence of additional references to the person (e.g., the
fectly. Once again, descriptions of the future and past are person is a man). The general attribute directly describes
least likely to be (near) identical, while object attributes and the appearance of the person or object (e.g., old or small);
affordances are often very consistent. the affiliate object indicates whether additional objects are
5.2. Visual Madlibs vs general descriptions used to describe the targeted person (e.g. with a bag, coat,
We compare the targeted descriptions in the Visual or glasses) and the affiliate attribute are appearance char-
Madlibs Dataset to the general image descriptions in MS acteristics of those secondary objects (e.g., red coat). The
COCO. First, we analyze the words used in Visual Madlibs templates for object’s attribute and verbs are more straight-
compared to MS COCO descriptions of the same images. forward as shown in Fig. 4(b)(c). The table in Fig. 4 shows
For each image, we extract the unique set of words from the frequency of each parse component. Overall, more of
all descriptions of that image from both datasets, and com- the potential descriptive elements in these constructions are
pute the coverage of each set with respect to the other. used in response to the Madlibs prompts than in the general
We find that on average (across images) 22.45% of the descriptions found in MS COCO.
Madlibs’s words are also present in MS COCO descriptions, We also break down the overlap between Visual Madlibs
while 52.38% of the MS COCO words are also present in and MS COCO descriptions over different parsing tem-
Madlibs. We also compute the vocabulary size of Madlibs plates for descriptions about people and object (Fig. 5). Yel-
that is 12,329, compared with MS COCO’s 9,683 on the low bars show how often words for each parse type in MS
same image set. COCO descriptions were also found in the same parse type
Second, we compare how Madlibs and MS COCO an- in the Visual Madlibs answers, and green bars measure the
swers describe the people and objects in images. We ob- reverse direction. Observations indicate that Madlibs pro-
serve that the Madlibs questions types, Table 1, cover much vides more coverage in its descriptions than MS COCO for
of the information in MS COCO descriptions [22]. As one all templates except for person’s refer name. One possible
way to see this, we run the StanfordNLP parser on both reason is that the prompts already indicates “the person” or
datasets [5]. For attributes of people, we use the parsing “people” explicitly, so workers need not add an additional
template shown in Fig. 4(a) to analyze the structures be- reference to the person in their descriptions.
ing used. The refer name indicates whether the person was Extrinsic comparison of Visual Madlibs Data and gen-
mentioned in the description. Note that the Madlibs descrip- eral descriptions: We perform an extrinsic analysis by us-
2465
100% Image's future 100% Object's attribute Object's affordance
100% Person's activity
One or two seconds after this 100%
80% picture was taken, ___ . 80% The object(s) is/are ___ .
80%
People could ___ the object(s) .
80% The person/people is/are ___ .
60% Pr:= NP PP NP VP O ___ O 60% Pr:= NP ___ O

60%
Pr:= NP ___ NP O
60% Pr:= NP ___ O
40% 40% 40% 40%

20% 20% 20% 20%
0 0 0 0
P
VP P
P
Pr: P
V P
NP NP
Pr JP
P
Pr r: VP
Pr P NP
Pr : VP
Pr P NP
P
TN
NP DV
DV
DV
PN
PN
PN
: V PP N
D
P
NP
r:
Pr: P PP
PA
Pr
Pr: VP A
PA
PA
V
N
PR
PV
:V
PP
:V
PP
P
r:
VP
:V
Pr
:V
:V
:V
PN
N
V
Pr
P
NP
r:
Pr:
r:
V
P
r:
Pr:
Pr
Image's future
P
25% 25%
Object's attribute
25%
Object's affordance
25%
Person's activity
20% 20% 20% 20%
15% 15% 15% 15%
10% 10% 10% 10%
5% 5% 5% 5%
00 0.2 0.4 0.6 0.8 1.0 0

0 0.2 0.4 0.6 0.8 1.0
0
0 0.2 0.4 0.6 0.8 1.0
0
0 0.2 0.4 0.6 0.8 1.0
Figure 3. First row shows top-5 most frequent phrase templates for image’s future, object’s attribute, object’s affordance and person’s
activity. Second row shows the histograms of similarity between answers. (We put the plots for all 12 types in the supplementary file.)
0 20% 40% 60% 80% tomatically generated image descriptions are to our Madlibs
image's scene
descriptions. Fig. 6 shows the accuracies resulting from us-
image's emotion
ing Madlibs, MS COCO, or CNN+LSTM [35] to select the
image's interesting
correct multiple-choice answer.
image's past
Although this approach is quite simple, it allows us we
image's future
make two interesting observations. First, Madlibs outper-
object's attribute
forms MS COCO on all types of multiple-choice ques-
object's affordance
tions. If Madlibs and MS COCO descriptions provided the
object's position same information, we would expect their performance to
person's attribute be comparable. Second, the automatically generated de-
person's activity scriptions from the pre-trained CNN+LSTM perform much
person's location worse than the actual MS COCO descriptions, despite doing
pair's relationship well on general image description generation.
Madlibs CNN+LSTM(COCO)
MSCOCO
6. Experiments
Figure 6. The accuracy of Madlibs, MS COCO and In this section we evaluate a series of methods on the
CNN+LSTM [35] (trained on MS COCO) used as references to targeted natural language generation and multiple-choice
answer the Madlibs hard multiple-choice questions. question answering tasks. As methods, we evaluate a lan-
guage only baseline, which computes the 4-gram perplex-
ity for each sentence using Google-1T statistics (frequen-
ing either: a) the MS COCO descriptions for an image, cies of all n-grams on the web). We also try simple joint-
or b) Visual Madlibs responses from other Turkers for an embedding methods – canonical correlation analysis (CCA)
image, to select answers for our multiple-choice evalua- and normalized CCA (nCCA) [15] – as well as a recent
tion task. Specifically, we use one of the human provided deep-learning based method for image description gener-
descriptions, and select the multiple-choice answer that is ation CNN+LSTM [35]. We train these models on 80% of
most similar to that description. Similarity is measured as the images in the MadLibs collection and evaluate their per-
cosine similarity between the mean Word2Vec vectors for formance on the remaining 20%.
the words a description compared to the multiple-choice In our experiments we extract image features using
answers. In addition to comparing how well the Madlibs the VGG Convolutional Neural Network (VGGNet) [31],
or MS COCO descriptions can select the correct multiple- trained on the ILSVRC-2012 dataset to recognize 1000 ob-
choice answer, we also use the descriptions automatically ject classes.For comparison, we also extract image features
produced by a recent CNN+LSTM description generation using the Places-CNN, which is trained on 205 scene cate-
system [35]2 . trained on MS COCO dataset. This allows us gories of Places Database [38] using AlexNet [19]. On the
to make one possible measurement of how close current au- sentence side, we average the Word2Vec of all words in a
2 In this paper, we use Karpathy’s implementation: https:// sentence to obtain a representation.
github.com/karpathy/neuraltalk CCA finds a joint embedding between two multi-
2466
Easy Task
nCCA nCCA nCCA CNN+LSTM CNN+LSTM(r)
#Q n-gram CCA nCCA Human
(place) (bbox) (all) (madlibs) (madlibs)
1. scene 6277 24.8% 75.7% 86.8% 85.4% − 87.6% 74.2% 77.6% 93.2%
2. emotion 5138 26.7% 41.3% 49.2% 50.4% − 42.4% 37.0% 44.5% 48.3%
3. past 4903 24.3% 61.8% 77.5% 72.6% − 80.3% 50.1% 47.3% 93.5%
4. future 4658 27.7% 61.2% 78.0% 72.1% − 80.2% 50.6% 50.6% 94.5%
5. interesting 5095 24.2% 66.8% 76.5% 72.0% − 78.9% 55.4% 49.9% 94.7%
6. obj attr 7194 30.6% 44.1% 47.5% 44.7% 54.7% 50.9% 46.9% 59.0% 88.9%
7. obj aff 7326 30.1% 59.8% 73.0% 69.6% 72.2% 76.7% − 88.9% 93.1%
8. obj pos 7290 28.0% 53.0% 65.9% 64.2% 58.9% 69.7% 53.9% 69.6% 91.4%
9. per attr 6651 27.2% 40.4% 48.0% 44.5% 53.1% 44.5% 36.5% 46.0% 83.8%
10. per act 6501 27.3% 70.0% 80.7% 76.9% 75.6% 82.8% 64.7% 68.9% 96.7%
11. per loc 6580 24.4% 69.8% 82.7% 82.6% 73.8% 82.7% 60.8% 71.6% 92.2%
12. pair rel 7595 29.2% 54.3% 63.0% 61.3% 64.2% 67.2% − 72.3% 91.7%
Hard Task
nCCA nCCA nCCA CNN+LSTM CNN+LSTM(r)
#Q n-gram CCA nCCA Human
(place) (bbox) (all) (madlibs) (madlibs)
1. scene 6277 22.8% 63.8% 70.1% 70.7% − 68.2% 63.6% 64.2% 75.6%
2. emotion 5138 25.1% 33.9% 37.2% 38.3% − 33.2% 34.6% 37.6% 38.4%
3. past 4903 22.4% 47.9% 52.8% 49.5% − 54.0% 42.2% 39.5% 73.9%
4. future 4658 24.4% 47.5% 54.3% 50.5% − 53.3% 41.1% 39.5% 75.1%
5. interesting 5095 27.6% 51.4% 53.7% 50.5% − 55.1% 44.0% 37.1% 76.7%
6. obj attr 7194 29.5% 42.2% 43.6% 41.5% 49.8% 39.3% 41.6% 42.3% 70.5%
7. obj aff 7326 32.2% 54.5% 63.5% 60.9% 63.0% 48.5% − 69.4% 52.7%
8. obj pos 7290 29.2% 49.0% 55.7% 53.3% 50.7% 53.4% 46.7% 50.2% 70.8%
9. per attr 6651 23.3% 33.9% 38.6% 35.5% 46.1% 31.6% 35.5% 42.4% 70.5%
10. per act 6501 24.0% 59.7% 65.4% 62.6% 65.1% 66.6% 57.3% 53.7% 85.1%
11. per loc 6580 22.3% 56.8% 63.3% 65.5% 57.8% 62.6% 50.4% 56.8% 72.9%
12. pair rel 7595 30.1% 49.4% 54.3% 52.2% 56.5% 52.0% − 54.6% 74.7%
Table 2. Accuracies computed for different approaches on the easy and hard multiple-choice answering task. CCA, nCCA, and
CNN+LSTM are trained on the whole image representation for each type of question. nCCA(place) uses Places-CNN feature. nCCA(box)
is trained and evaluated on ground-truth bounding-boxes from MS COCO segmentations. nCCA(all) trains a single embedding using all
question types. CNN+LSTM(r) ranks the perplexity of {prompt+choice}.
Figure 7. Some hard multiple-choice question examples. The results are made by nCCA. First row shows correct choices. Second row
shows incorrect choices. Corresponding human accuracies are provided as reference.
dimensional variables, in our case image and text vector lation. We train CCA and nCCA models for each ques-
representations. To increase the flexibility of the feature tion type separately using the training portion of the Visual
selection and for improving computational efficiency, Gong Madlibs Dataset. These models allow us to map from an
et al. [15] proposed nCCA a scalable approximation scheme image representation, to the joint-embedding space, to vec-
of explicit kernel mapping followed by dimension reduction tors in the Word2Vec space, and vice versa. For targeted
and linear CCA. In the projected latent space, the similarity generation, we map an image to the joint-embedding space
is measured by the eigenvalue-weighted normalized corre- and then choose the answer from the training set text that is
2467
Filtered Questions from Hard Task
closest to this embedded point. To answer multiple-choice nCCA nCCA nCCA CNN+LSTM(r)
#Q nCCA
questions, we embed each multiple choice answer and then (place) (bbox) (all) (madlibs)
1. scene 4940 77.6% 77.8% 76.3% 69.7%
select the answer whose embedding is closest. 2. emotion 2052 49.0% 49.5%
−
− 43.8% 43.0%
Following recent description generation techniques [35, 3. past 3976 57.4% 53.8% − 59.4% 41.3%
4. future 3820 59.2% 54.2% − 58.3% 41.7%
16], we train a CNN+LSTM model for each question type. 5. interesting 4159 59.5% 55.1% − 61.3% 40.3%
These models learn a mapping from an image and prompt 6. obj attr 5436 47.2% 44.7% 54.6% 42.8% 46.3%
7. obj aff 4581 71.0% 67.6% 70.5% 57.6% 79.0%
to a sequence of wordse.g., The chair is, and then let 8. obj pos 5721 60.2% 57.7% 54.6% 57.7% 54.3%
the CNN+LSTM system generate the remaining words of 9. per attr 4893 42.4% 38.8% 52.1% 34.4% 46.4%
the description. For the multiple choice task, we evalu- 10. per act 5813 68.3% 65.3% 67.9% 69.6% 55.3%
11. per loc 5096 69.9% 71.7% 62.6% 70.0% 60.6%
ate two ways to select an answer. The first method se- 12. pair rel 5981 57.6% 55.4% 60.0% 56.5% 57.4%
lects the answer with largest cosine Word2Vec similarity Table 3. Accuracies for different approaches on the filtered ques-
to the generated description. The second method ranks the tions from hard task. The filtered questions are those with human
prompt+choices by perplexity and selects the best one. accuracies higher than 0.6. Full tables for filtered easy and hard
task are in the supplementary file.
Easy Task Hard Task
6.1. Discussion of results nCCA nCCA nCCA nCCA
#Q nCCA nCCA
(bbox) (dbox) (bbox) (dbox)
Table 2 shows accuracies of each algorithm on the easy 6. obj attr 2021 47.6% 53.6% 51.4% 43.9% 47.9% 45.2%
and hard versions of the multiple-choice task3 and Fig. 7 9. per attr 4206 50.2% 55.4% 51.2% 40.0% 47.0% 43.3%
shows example correct and incorrect answer choices. There Table 4. Multiple-choice answering using automatic detection for
are several interesting observations we can make. From the 42 object/person categories. “bbox” denotes ground-truth bound-
ing box and “dbox” denotes detected bounding box.
results of the language only n-gram baseline, we conclude
that answering Madlibs questions strongly requires visual BLEU-1 BLEU-2
nCCA CNN+LSTM nCCA CNN+LSTM
information. Second, training nCCA on all types of ques- nCCA
(box) (madlibs)
nCCA
(bbox) (madlibs)
tions together, nCCA(all), is helpful for the easy variant of 1. scene 0.52 − 0.62 0.17 − 0.19
2. emotion 0.17 0.38 0 0
the task, but less useful on the more fine-grained hard ver- 3. future 0.38
−
− 0.39 0.12
−
− 0.13
sion of the task. Third, extracting visual features from the 4. past 0.39 − 0.42 0.12 − 0.12
5. interesting 0.49 0.65 0.14 0.22
bounding box of the relevant person/object yields higher ac- 6. obj attr 0.28
−
0.36 0.48 0.02
−
0.02 0.01
curacy for predicting attributes, but not for other questions. 7. obj aff 0.56 0.60 − 0.10 0.11 −
Based on this finding, we evaluate answering the attribute 8. obj pos 0.53 0.55 0.71 0.24 0.25 0.49
9. per attr 0.26 0.29 0.57 0.06 0.07 0.25
question using automatic detection methods. The detectors 10. per act 0.47 0.41 0.53 0.14 0.11 0.20
are trained on ImageNet using R-CNN [14], covering 42 11. per loc 0.52 0.46 0.63 0.22 019 0.39
12. pair rel 0.46 0.48 0.07 0.08
MS COCO categories. We observe similar performance be- − −
Table 5. BLEU-1 and BLEU-2 computed on Madlibs testing
tween ground-truth and detected bounding boxes in Table 4. dataset for different approaches.
Fourth, we observe that the Places-CNN helps answer ques-
tions related to image’s scene, person’s location, and im-
age’s emotion. 7. Conclusions
We have introduced a new fill-in-the blank strategy for
As an additional experiment we ask 5 people to answer
collecting targeted natural language descriptions. Our anal-
each multiple choice question. The last column of Table 2
yses show that these descriptions are usually more detailed
shows human accuray as a reference. We further use human
than generic whole image descriptions. We also introduce
agreement to select a subset of the multiple-choice ques-
a targeted natural language description generation task, and
tions where at least 3 Turkers choose the correct answer.
a multiple-choice question answering task, then train and
Results of the methods on this question subset are shown
evaluate joint-embedding and generation models. Data pro-
in Table 3, displaying similar patterns as the unfiltered set,
duced by this paper will be publicly released.
with slightly higher accuracy.
Finally, Table 5 shows BLEU-1 and BLEU-2 scores for Acknowledgements: We thank the vision and language
targeted generation. Although the CNN+LSTM models we communities for feedback, especially J. Hockenmaier, K.
trained on Madlibs were not quite as accurate as nCCA for Saenko, and J. Corso. This research is supported by NSF
selecting the correct multiple-choice answer, they did result Awards #1417991, 1405822, 144234, 1452851, and Mi-
in better, sometimes much better, accuracy (as measured by crosoft Research.
BLEU scores) for targeted generation.
References
3 The missing entries for questions 7 and 12 are due to priming not being [1] A. Aker and R. Gaizauskas. Generating image descriptions
valid for questions with blanks in the middle of the sentence. using dependency relational patterns. In ACL, 2010.
2468
[2] A. C. Berg, T. L. Berg, H. D. III, J. Dodge, A. Goyal, X. Han, [21] P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and
A. Mensch, M. Mitchell, A. Sood, K. Stratos, and K. Yam- Y. Choi. Collective generation of natural image descriptions.
aguchi. Understanding and predicting importance in images. In ACL, 2012.
In CVPR, 2012. [22] P. O. Lebret Remi, Pinheiro and R. Collobert. Phrase-based
[3] A. Bordes, J. Weston, and S. Chopra. Question answering captioning. In ICML, 2015.
with subgraph embeddings. In EMNLP, 2014. [23] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
[4] A. Bordes, J. Weston, and N. Usunier. Open question an- manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
swering with weakly supervised embedding models. In mon objects in context. In ECCV, 2014.
ECML PKDD, 2014. [24] X. Lin and D. Parikh. Don’t just listen, use your imagination:
[5] D. Chen and C. D. Manning. A fast and accurate dependency Leveraging visual common sense for non-visual tasks. In
parser using neural networks. In EMNLP, 2014. CVPR, 2015.
[6] X. Chen and C. L. Zitnick. Mind’s eye: A recurrent visual [25] M. Malinowski and M. Fritz. A multi-world approach to
representation for image caption generation. In CVPR, 2015. question answering about real-world scenes based on uncer-
[7] R. Collobert, J. Weston, L. Bottou, M. Karlen, tain input. In NIPS, 2014.
K. Kavukcuoglu, and P. Kuksa. Natural language pro- [26] R. Mason. Domain-independent captioning of domain-
cessing (almost) from scratch. JMLR, 2011. specific images. In NACCL-HLT, 2013.
[8] M.-C. De Marneffe, B. MacCartney, C. D. Manning, et al. [27] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient
Generating typed dependency parses from phrase structure estimation of word representations in vector space. arXiv
parses. In Proceedings of LREC, 2006. preprint arXiv:1301.3781, 2013.
[9] J. Donahue, L. Anne Hendricks, S. Guadarrama, [28] M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal,
M. Rohrbach, S. Venugopalan, K. Saenko, and T. Dar- A. Berg, K. Yamaguchi, T. Berg, K. Stratos, and
rell. Long-term recurrent convolutional networks for visual H. Daumé III. Midge: Generating image descriptions from
recognition and description. In CVPR, 2015. computer vision detections. In EACL, 2012.
[10] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, [29] V. Ordonez, G. Kulkarni, and T. L. Berg. Im2text: Describ-
P. Dollar, J. Gao, X. He, M. Mitchell, J. C. Platt, ing images using 1 million captioned photographs. In NIPS,
C. Lawrence Zitnick, and G. Zweig. From captions to vi- 2011.
sual concepts and back. In CVPR, 2015. [30] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier.
[11] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, Collecting image annotations using amazon’s mechanical
C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every pic- turk. In Proceedings of the NAACL HLT 2010 Workshop
ture tells a story: Generating sentences from images. In on Creating Speech and Language Data with Amazon’s Me-
ECCV, 2010. chanical Turk, 2010.
[12] Y. Feng and M. Lapata. Topic models for image annotation [31] K. Simonyan and A. Zisserman. Very deep convolu-
and text illustration. In ACL, 2010. tional networks for large-scale image recognition. CoRR,
[13] D. Geman, S. Geman, N. Hallonquist, and L. Younes. Visual abs/1409.1556, 2014.
turing test for computer vision systems. Proceedings of the [32] S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus.
National Academy of Sciences, 2015. Weakly supervised memory networks. arXiv preprint
[14] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- arXiv:1503.08895, 2015.
ture hierarchies for accurate object detection and semantic [33] R. Vedantam, C. L. Zitnick, and D. Parikh. Cider:
segmentation. In CVPR, 2014. Consensus-based image description evaluation. In CVPR,
[15] Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. A multi-view em- 2015.
bedding space for modeling internet images, tags, and their [34] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach,
semantics. IJCV, 2014. R. Mooney, and K. Saenko. Translating videos to natural
[16] A. Karpathy and L. Fei-Fei. Deep visual-semantic align- language using deep recurrent neural networks. In NAACL-
ments for generating image descriptions. In CVPR, 2015. HLT, 2015.
[17] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying [35] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and
visual-semantic embeddings with multimodal neural lan- tell: A neural image caption generator. In CVPR, 2015.
guage models. In TACL, 2015. [36] Y. Yang, C. L. Teo, H. Daumé III, and Y. Aloimonos.
[18] N. Krishnamoorthy, G. Malkarnenkar, R. Mooney, Corpus-guided sentence generation of natural images. In
K. Saenko, and S. Guadarrama. Generating natural- EMNLP, 2011.
language video descriptions using text-mined knowledge. [37] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From im-
NAACL HLT 2013, page 10, 2013. age descriptions to visual denotations: New similarity met-
[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet rics for semantic inference over event descriptions. In TACL,
classification with deep convolutional neural networks. In 2014.
NIPS, 2012. [38] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva.
[20] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, Learning deep features for scene recognition using places
and T. L. Berg. Baby talk: Understanding and generating database. In NIPS, 2014.
image descriptions. In CVPR, 2011.
2469
Visualizing and Understanding Convolutional Networks
Matthew D. Zeiler zeiler@cs.nyu.edu

Dept. of Computer Science, Courant Institute, New York University
Rob Fergus fergus@cs.nyu.edu
Dept. of Computer Science, Courant Institute, New York University
arXiv:1311.2901v3 [cs.CV] 28 Nov 2013
Abstract est in convnet models: (i) the availability of much

Large Convolutional Network models have larger training sets, with millions of labeled exam-
recently demonstrated impressive classifica- ples; (ii) powerful GPU implementations, making the
tion performance on the ImageNet bench- training of very large models practical and (iii) bet-
mark (Krizhevsky et al., 2012). However ter model regularization strategies, such as Dropout
there is no clear understanding of why they (Hinton et al., 2012).
perform so well, or how they might be im- Despite this encouraging progress, there is still lit-
proved. In this paper we address both issues. tle insight into the internal operation and behavior
We introduce a novel visualization technique of these complex models, or how they achieve such
that gives insight into the function of inter- good performance. From a scientific standpoint, this
mediate feature layers and the operation of is deeply unsatisfactory. Without clear understanding
the classifier. Used in a diagnostic role, these of how and why they work, the development of better
visualizations allow us to find model architec- models is reduced to trial-and-error. In this paper we
tures that outperform Krizhevsky et al. on introduce a visualization technique that reveals the in-
the ImageNet classification benchmark. We put stimuli that excite individual feature maps at any
also perform an ablation study to discover layer in the model. It also allows us to observe the
the performance contribution from different evolution of features during training and to diagnose
model layers. We show our ImageNet model potential problems with the model. The visualization
generalizes well to other datasets: when the technique we propose uses a multi-layered Deconvo-
softmax classifier is retrained, it convincingly lutional Network (deconvnet), as proposed by (Zeiler
beats the current state-of-the-art results on et al., 2011), to project the feature activations back to
Caltech-101 and Caltech-256 datasets. the input pixel space. We also perform a sensitivity
analysis of the classifier output by occluding portions
of the input image, revealing which parts of the scene
1. Introduction are important for classification.
Since their introduction by (LeCun et al., 1989) in
the early 1990’s, Convolutional Networks (convnets) Using these tools, we start with the architecture of
have demonstrated excellent performance at tasks such (Krizhevsky et al., 2012) and explore different archi-
as hand-written digit classification and face detec- tectures, discovering ones that outperform their results
tion. In the last year, several papers have shown on ImageNet. We then explore the generalization abil-
that they can also deliver outstanding performance on ity of the model to other datasets, just retraining the
more challenging visual classification tasks. (Ciresan softmax classifier on top. As such, this is a form of su-
et al., 2012) demonstrate state-of-the-art performance pervised pre-training, which contrasts with the unsu-
on NORB and CIFAR-10 datasets. Most notably, pervised pre-training methods popularized by (Hinton
(Krizhevsky et al., 2012) show record beating perfor- et al., 2006) and others (Bengio et al., 2007; Vincent
mance on the ImageNet 2012 classification benchmark, et al., 2008). The generalization ability of convnet fea-
with their convnet model achieving an error rate of tures is also explored in concurrent work by (Donahue
16.4%, compared to the 2nd place result of 26.1%. et al., 2013).
Several factors are responsible for this renewed inter-
1.1. Related Work ters in the convolutional layers, weight matrices in the
fully-connected layers and biases) are trained by back-
Visualizing features to gain intuition about the net-
propagating the derivative of the loss with respect to
work is common practice, but mostly limited to the 1st
the parameters throughout the network, and updating
layer where projections to pixel space are possible. In
the parameters via stochastic gradient descent. Full
higher layers this is not the case, and there are limited
details of training are given in Section 3.
methods for interpreting activity. (Erhan et al., 2009)
find the optimal stimulus for each unit by perform-
2.1. Visualization with a Deconvnet
ing gradient descent in image space to maximize the
unit’s activation. This requires a careful initialization Understanding the operation of a convnet requires in-
and does not give any information about the unit’s interpreting the feature activity in intermediate layers.
variances. Motivated by the latter’s short-coming, (Le We present a novel way to map these activities back to
et al., 2010) (extending an idea by (Berkes & Wiskott, the input pixel space, showing what input pattern orig-
2006)) show how the Hessian of a given unit may be inally caused a given activation in the feature maps.
computed numerically around the optimal response, We perform this mapping with a Deconvolutional Net-
giving some insight into invariances. The problem is work (deconvnet) (Zeiler et al., 2011). A deconvnet
that for higher layers, the invariances are extremely can be thought of as a convnet model that uses the
complex so are poorly captured by a simple quadratic same components (filtering, pooling) but in reverse, so
approximation. Our approach, by contrast, provides a instead of mapping pixels to features does the oppo-
non-parametric view of invariance, showing which pat- site. In (Zeiler et al., 2011), deconvnets were proposed
terns from the training set activate the feature map. as a way of performing unsupervised learning. Here,
(Donahue et al., 2013) show visualizations that iden- they are not used in any learning capacity, just as a
tify patches within a dataset that are responsible for probe of an already trained convnet.
strong activations at higher layers in the model. Our
visualizations differ in that they are not just crops of To examine a convnet, a deconvnet is attached to each
input images, but rather top-down projections that of its layers, as illustrated in Fig. 1(top), providing a
reveal structures within each patch that stimulate a continuous path back to image pixels. To start, an
particular feature map. input image is presented to the convnet and features
computed throughout the layers. To examine a given
convnet activation, we set all other activations in the
2. Approach layer to zero and pass the feature maps as input to
We use standard fully supervised convnet models the attached deconvnet layer. Then we successively
throughout the paper, as defined by (LeCun et al., (i) unpool, (ii) rectify and (iii) filter to reconstruct
1989) and (Krizhevsky et al., 2012). These models the activity in the layer beneath that gave rise to the
map a color 2D input image xi , via a series of lay- chosen activation. This is then repeated until input
ers, to a probability vector yî over the C different pixel space is reached.
classes. Each layer consists of (i) convolution of the Unpooling: In the convnet, the max pooling opera-
previous layer output (or, in the case of the 1st layer, tion is non-invertible, however we can obtain an ap-
the input image) with a set of learned filters; (ii) pass- proximate inverse by recording the locations of the
ing the responses through a rectified linear function maxima within each pooling region in a set of switch
(relu(x) = max(x, 0)); (iii) [optionally] max pooling variables. In the deconvnet, the unpooling operation
over local neighborhoods and (iv) [optionally] a lo- uses these switches to place the reconstructions from
cal contrast operation that normalizes the responses the layer above into appropriate locations, preserving
across feature maps. For more details of these opera- the structure of the stimulus. See Fig. 1(bottom) for
tions, see (Krizhevsky et al., 2012) and (Jarrett et al., an illustration of the procedure.
2009). The top few layers of the network are conven-
tional fully-connected networks and the final layer is Rectification: The convnet uses relu non-linearities,
a softmax classifier. Fig. 3 shows the model used in which rectify the feature maps thus ensuring the fea-
many of our experiments. ture maps are always positive. To obtain valid fea-
ture reconstructions at each layer (which also should
We train these models using a large set of N labeled be positive), we pass the reconstructed signal through
images {x, y}, where label yi is a discrete variable a relu non-linearity.
indicating the true class. A cross-entropy loss func-
tion, suitable for image classification, is used to com- Filtering: The convnet uses learned filters to con-
pare yî and yi . The parameters of the network (fil- volve the feature maps from the previous layer. To
invert this, the deconvnet uses transposed versions of Other important differences relating to layers 1 and
the same filters, but applied to the rectified maps, not 2 were made following inspection of the visualizations
the output of the layer beneath. In practice this means in Fig. 6, as described in Section 4.1.
flipping each filter vertically and horizontally.
The model was trained on the ImageNet 2012 train-
Projecting down from higher layers uses the switch ing set (1.3 million images, spread over 1000 different
settings generated by the max pooling in the convnet classes). Each RGB image was preprocessed by resiz-
on the way up. As these switch settings are peculiar ing the smallest dimension to 256, cropping the center
to a given input image, the reconstruction obtained 256x256 region, subtracting the per-pixel mean (across
from a single activation thus resembles a small piece all images) and then using 10 different sub-crops of size
of the original input image, with structures weighted 224x224 (corners + center with(out) horizontal flips).
according to their contribution toward to the feature Stochastic gradient descent with a mini-batch size of
activation. Since the model is trained discriminatively, 128 was used to update the parameters, starting with a
they implicitly show which parts of the input image learning rate of 10−2 , in conjunction with a momentum
are discriminative. Note that these projections are not term of 0.9. We anneal the learning rate throughout
samples from the model, since there is no generative training manually when the validation error plateaus.
process involved. Dropout (Hinton et al., 2012) is used in the fully con-
nected layers (6 and 7) with a rate of 0.5. All weights
are initialized to 10−2 and biases are set to 0.
Layer Above
Pooled Maps
Reconstruction
Switches Visualization of the first layer filters during training
Max Pooling reveals that a few of them dominate, as shown in
Max Unpooling
Fig. 6(a). To combat this, we renormalize each filter
Unpooled Maps Rectified Feature Maps
in the convolutional layers whose RMS value exceeds
Rec'fied Linear Rec'fied Linear a fixed radius of 10−1 to this fixed radius. This is cru-
Func'on Func'on
cial, especially in the first layer of the model, where the
Rectified Unpooled Maps Feature Maps
input images are roughly in the [-128,128] range. As in
Convolu'onal Convolu'onal
Filtering {FT} Filtering {F}
(Krizhevsky et al., 2012), we produce multiple differ-
ent crops and flips of each training example to boost
Reconstruction Layer Below Pooled Maps
training set size. We stopped training after 70 epochs,
which took around 12 days on a single GTX580 GPU,
Layer Above
Pooled Maps
using an implementation based on (Krizhevsky et al.,
Reconstruction
2012).
Unpooling Pooling
Max Locations
“Switches” 4. Convnet Visualization
Unpooled Rectified Using the model described in Section 3, we now use
Maps Feature Maps
Feature Map the deconvnet to visualize the feature activations on
the ImageNet validation set.
Figure 1. Top: A deconvnet layer (left) attached to a con-
vnet layer (right). The deconvnet will reconstruct an ap- Feature Visualization: Fig. 2 shows feature visu-
proximate version of the convnet features from the layer alizations from our model once training is complete.
beneath. Bottom: An illustration of the unpooling oper- However, instead of showing the single strongest ac-
ation in the deconvnet, using switches which record the tivation for a given feature map, we show the top 9
location of the local max in each pooling region (colored activations. Projecting each separately down to pixel
zones) during pooling in the convnet. space reveals the different structures that excite a
given feature map, hence showing its invariance to in-
3. Training Details put deformations. Alongside these visualizations we
show the corresponding image patches. These have
We now describe the large convnet model that will be greater variation than visualizations as the latter solely
visualized in Section 4. The architecture, shown in focus on the discriminant structure within each patch.
Fig. 3, is similar to that used by (Krizhevsky et al., For example, in layer 5, row 1, col 2, the patches ap-
2012) for ImageNet classification. One difference is pear to have little in common, but the visualizations
that the sparse connections used in Krizhevsky’s lay- reveal that this particular feature map focuses on the
ers 3,4,5 (due to the model being split across 2 GPUs) grass in the background, not the foreground objects.
are replaced with dense connections in our model.
Layer 1
Layer 2
Layer 3
Layer 4 Layer 5
Figure 2. Visualization of features in a fully trained model. For layers 2-5 we show the top 9 activations in a random subset
of feature maps across the validation data, projected down to pixel space using our deconvolutional network approach.
Our reconstructions are not samples from the model: they are reconstructed patterns from the validation set that cause
high activations in a given feature map. For each feature map we also show the corresponding image patches. Note:
(i) the the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exaggeration of
discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form.
The projections from each layer show the hierarchi- 4.2. Occlusion Sensitivity
cal nature of the features in the network. Layer 2 re-
With image classification approaches, a natural ques-
sponds to corners and other edge/color conjunctions.
tion is if the model is truly identifying the location of
Layer 3 has more complex invariances, capturing sim-
the object in the image, or just using the surround-
ilar textures (e.g. mesh patterns (Row 1, Col 1); text
ing context. Fig. 7 attempts to answer this question
(R2,C4)). Layer 4 shows significant variation, but
by systematically occluding different portions of the
is more class-specific: dog faces (R1,C1); bird’s legs
input image with a grey square, and monitoring the
(R4,C2). Layer 5 shows entire objects with significant
output of the classifier. The examples clearly show
pose variation, e.g. keyboards (R1,C11) and dogs (R4).
the model is localizing the objects within the scene,
Feature Evolution during Training: Fig. 4 visu- as the probability of the correct class drops signifi-
alizes the progression during training of the strongest cantly when the object is occluded. Fig. 7 also shows
activation (across all training examples) within a given visualizations from the strongest feature map of the
feature map projected back to pixel space. Sudden top convolution layer, in addition to activity in this
jumps in appearance result from a change in the image map (summed over spatial locations) as a function of
from which the strongest activation originates. The occluder position. When the occluder covers the im-
lower layers of the model can be seen to converge age region that appears in the visualization, we see a
within a few epochs. However, the upper layers only strong drop in activity in the feature map. This shows
develop develop after a considerable number of epochs that the visualization genuinely corresponds to the im-
(40-50), demonstrating the need to let the models train age structure that stimulates that feature map, hence
until fully converged. validating the other visualizations shown in Fig. 4 and
Fig. 2.
Feature Invariance: Fig. 5 shows 5 sample images
being translated, rotated and scaled by varying degrees
while looking at the changes in the feature vectors from 4.3. Correspondence Analysis
the top and bottom layers of the model, relative to the Deep models differ from many existing recognition ap-
untransformed feature. Small transformations have a proaches in that there is no explicit mechanism for
dramatic effect in the first layer of the model, but a establishing correspondence between specific object
lesser impact at the top feature layer, being quasi- parts in different images (e.g. faces have a particular
linear for translation & scaling. The network output spatial configuration of the eyes and nose). However,
is stable to translations and scalings. In general, the an intriguing possibility is that deep models might be
output is not invariant to rotation, except for object implicitly computing them. To explore this, we take 5
with rotational symmetry (e.g. entertainment center). randomly drawn dog images with frontal pose and sys-
tematically mask out the same part of the face in each
4.1. Architecture Selection image (e.g. all left eyes, see Fig. 8). For each image i,
we then compute: li = xli − x̃li , where xli and x̃li are the
While visualization of a trained model gives insight feature vectors at layer l for the original and occluded
into its operation, it can also assist with selecting good images respectively. We then measure the consis-
architectures in the first place. By visualizing the first tency of this difference P
vector between all related im-
5
and second layers of Krizhevsky et al. ’s architecture age pairs (i, j): ∆l = i,j=1,i6=j H(sign(li ), sign(lj )),
(Fig. 6(b) & (d)), various problems are apparent. The where H is Hamming distance. A lower value indi-
first layer filters are a mix of extremely high and low cates greater consistency in the change resulting from
frequency information, with little coverage of the mid the masking operation, hence tighter correspondence
frequencies. Additionally, the 2nd layer visualization between the same object parts in different images
shows aliasing artifacts caused by the large stride 4 (i.e. blocking the left eye changes the feature repre-
used in the 1st layer convolutions. To remedy these sentation in a consistent way). In Table 1 we compare
problems, we (i) reduced the 1st layer filter size from the ∆ score for three parts of the face (left eye, right
11x11 to 7x7 and (ii) made the stride of the convolu- eye and nose) to random parts of the object, using fea-
tion 2, rather than 4. This new architecture retains tures from layer l = 5 and l = 7. The lower score for
much more information in the 1st and 2nd layer fea- these parts, relative to random object regions, for the
tures, as shown in Fig. 6(c) & (e). More importantly, it layer 5 features show the model does establish some
also improves the classification performance as shown degree of correspondence.
in Section 5.1.
image size 224 110 26 13 13 13

filter size 7 3 3
1 384 1 384 256
256
stride 2 96 3x3 max 3x3 max C
3x3 max pool contrast pool contrast pool 4096 4096 class
stride 2 norm. stride 2 norm. stride 2 units units softmax
3 55 5
13 3
2 6 256
96 1 256
Input Image
Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Layer 7 Output
Figure 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image (with 3 color planes) is presented as
the input. This is convolved with 96 different 1st layer filters (red), each of size 7 by 7, using a stride of 2 in both x and y.
The resulting feature maps are then: (i) passed through a rectified linear function (not shown), (ii) pooled (max within
3x3 regions, using stride 2) and (iii) contrast normalized across feature maps to give 96 different 55 by 55 element feature
maps. Similar operations are repeated in layers 2,3,4,5. The last two layers are fully connected, taking features from
the top convolutional layer as input in vector form (6 · 6 · 256 = 9216 dimensions). The final layer is a C-way softmax
function, C being the number of classes. All filters and feature maps are square in shape.
Layer 1 Layer 2 Layer 3 Layer 4 Layer 5
Figure 4. Evolution of a randomly chosen subset of model features through training. Each layer’s features are displayed
in a different block. Within each block, we show a randomly chosen subset of features at epochs [1,2,5,10,20,30,40,64].
The visualization shows the strongest activation (across all training examples) for a given feature map, projected down to
pixel space using our deconvnet approach. Color contrast is artificially enhanced and the figure is best viewed in electronic
form.
0.8 1
a1 9 a2 0.7 a3 0.9 a4
8 0.8
0.6
7 Lawn Mower 0.7
Canonical Distance
Canonical Distance
0.5 ShihïTzu
P(true class)
0.6
African Crocodile
5 0.4 African Grey 0.5
Entertrainment Center
0.4 Lawn Mower
0.3
Lawn Mower ShihïTzu
3 0.3
6KLKï7]X African Crocodile
0.2
African Crocodile 0.2 African Grey
African Grey Entertrainment Center
0.1
1 0.1
0 0
ï ï ï ï60 ï40 ï20 0 20 40 60 ï60 ï40 ï20 0 20 40 60
Vertical Translation (Pixels) Vertical Translation (Pixels) Vertical Translation (Pixels)
12 0.7 1
b1 b2 0.6
b3 0.9 b4
10
0.8
0.5 0.7
Canonical Distance
Canonical Distance
8
Lawn Mower
P(true class)
0.6
0.4 ShihïTzu
6 0.5 African Crocodile
African Grey
0.3
0.4 Entertrainment Center
Lawn Mower
4 Lawn Mower
0.2 ShihïTzu 0.3
ShihïTzu African Crocodile
African Crocodile African Grey 0.2
2 0.1
African Grey Entertrainment Center 0.1
0 0 0
1 1.2 1.4 1.6 1.8 1 1.2 1.4 1.6 1.8 1 1.2 1.4 1.6 1.8
Scale (Ratio) Scale (Ratio) Scale (Ratio)
15 1.4 1
c1 c2 1.2
c3 0.9 c4
0.8
1 0.7
Canonical Distance
Canonical Distance
10 Lawn Mower
ShihïTzu
P(true class)
0.6
0.8 African Crocodile
0.5 African Grey
0.6 Entertrainment Center
0.4
5 Lawn Mower Lawn Mower
0.4 0.3
ShihïTzu ShihïTzu
African Crocodile African Crocodile 0.2
African Grey 0.2 African Grey
0.1
Entertrainment Center Entertrainment Center
0 0 0
0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350
Rotation Degrees Rotation Degrees Rotation Degrees
Figure 5. Analysis of vertical translation, scale, and rotation invariance within the model (rows a-c respectively). Col 1: 5
example images undergoing the transformations. Col 2 & 3: Euclidean distance between feature vectors from the original
and transformed images in layers 1 and 7 respectively. Col 4: the probability of the true label for each image, as the
image is transformed.
(a) (b)
(c) (d) (e)
Figure 6. (a): 1st layer features without feature scale clipping. Note that one feature dominates. (b): 1st layer features
from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and filter size (7x7 vs 11x11)
results in more distinctive features and fewer “dead” features. (d): Visualizations of 2nd layer features from (Krizhevsky
et al., 2012). (e): Visualizations of our 2nd layer features. These are cleaner, with no aliasing artifacts that are visible in
(d).
(c) Layer 5, strongest (d) Classifier, probability (e) Classifier, most
(a) Input Image (b) Layer 5, strongest feature map feature map projections of correct class probable class
Pomeranian
0.9 Tennis ball
Keeshond
0.8 Pekinese
0.7
0.6
0.5
0.4
0.3
0.2
True Label: Pomeranian 0.1
Car wheel
Racer
0.25 Cab
Police van
0.2
0.15
0.1
0.05
True Label: Car Wheel

Afghan hound
0.7
Gordon setter
Irish setter
0.6 Mortarboard
Fur coat
0.5
Academic gown
Australian terrier
Ice lolly
0.4
Vizsla
Neck brace
0.3
0.2
0.1
True Label: Afghan Hound
Figure 7. Three test examples where we systematically cover up different portions of the scene with a gray square (1st
column) and see how the top (layer 5) feature maps ((b) & (c)) and classifier output ((d) & (e)) changes. (b): for each
position of the gray scale, we record the total activation in one layer 5 feature map (the one with the strongest response
in the unoccluded image). (c): a visualization of this feature map projected down into the input image (black square),
along with visualizations of this map from other images. The first row example shows the strongest feature to be the
dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). (d): a map of correct
class probability, as a function of the position of the gray square. E.g. when the dog’s face is obscured, the probability
for “pomeranian” drops significantly. (e): the most probable label as a function of occluder position. E.g. in the 1st row,
for most locations it is “pomeranian”, but if the dog’s face is obscured but not the ball, then it predicts “tennis ball”. In
the 2nd example, text on the car is the strongest feature in layer 5, but the classifier is most sensitive to the wheel. The
3rd example contains multiple objects. The strongest feature in layer 5 picks out the faces, but the classifier is sensitive
to the dog (blue region in (d)), since it uses multiple feature maps.
ing set). We note that this error is almost half that of

the top non-convnet entry in the ImageNet 2012 classi-
fication challenge, which obtained 26.2% error (Gunji
et al., 2012).
Val Val Test
Error % Top-1 Top-5 Top-5
(Gunji et al., 2012) - - 26.2
(Krizhevsky et al., 2012), 1 convnet 40.7 18.2 −−
(Krizhevsky et al., 2012), 5 convnets 38.1 16.4 16.4
(Krizhevsky et al., 2012)∗ , 1 convnets 39.0 16.6 −−
(Krizhevsky et al., 2012)∗ , 7 convnets 36.7 15.4 15.3
Our replication of
(Krizhevsky et al., 2012), 1 convnet 40.5 18.1 −−
1 convnet as per Fig. 3 38.4 16.5 −−
Figure 8. Images used for correspondence experiments. 5 convnets as per Fig. 3 – (a) 36.7 15.3 15.3
Col 1: Original image. Col 2,3,4: Occlusion of the right 1 convnet as per Fig. 3 but with
eye, left eye, and nose respectively. Other columns show layers 3,4,5: 512,1024,512 maps – (b) 37.5 16.0 16.1
6 convnets, (a) & (b) combined 36.0 14.7 14.8
examples of random occlusions.
Table 2. ImageNet 2012 classification error rates. The ∗
Mean Feature Mean Feature indicates models that were trained on both ImageNet 2011
Sign Change Sign Change and 2012 training sets.
Occlusion Location Layer 5 Layer 7
Right Eye 0.067 ± 0.007 0.069 ± 0.015 Varying ImageNet Model Sizes: In Table 3, we
Left Eye 0.069 ± 0.007 0.068 ± 0.013 first explore the architecture of (Krizhevsky et al.,
Nose 0.079 ± 0.017 0.069 ± 0.011
2012) by adjusting the size of layers, or removing
Random 0.107 ± 0.017 0.073 ± 0.014
them entirely. In each case, the model is trained from
Table 1. Measure of correspondence for different object scratch with the revised architecture. Removing the
parts in 5 different dog images. The lower scores for the fully connected layers (6,7) only gives a slight increase
eyes and nose (compared to random object parts) show the in error. This is surprising, given that they contain
model implicitly establishing some form of correspondence the majority of model parameters. Removing two of
of parts at layer 5 in the model. At layer 7, the scores the middle convolutional layers also makes a relatively
are more similar, perhaps due to upper layers trying to small different to the error rate. However, removing
discriminate between the different breeds of dog. both the middle convolution layers and the fully con-
nected layers yields a model with only 4 layers whose
5. Experiments performance is dramatically worse. This would sug-
gest that the overall depth of the model is important
5.1. ImageNet 2012 for obtaining good performance. In Table 3, we modify
This dataset consists of 1.3M/50k/100k train- our model, shown in Fig. 3. Changing the size of the
ing/validation/test examples, spread over 1000 cate- fully connected layers makes little difference to perfor-
gories. Table 2 shows our results on this dataset. mance (same for model of (Krizhevsky et al., 2012)).
However, increasing the size of the middle convolution
Using the exact architecture specified in (Krizhevsky layers goes give a useful gain in performance. But in-
et al., 2012), we attempt to replicate their result on the creasing these, while also enlarging the fully connected
validation set. We achieve an error rate within 0.1% of layers results in over-fitting.
their reported value on the ImageNet 2012 validation
set. 5.2. Feature Generalization
Next we analyze the performance of our model with The experiments above show the importance of the
the architectural changes outlined in Section 4.1 (7 × 7 convolutional part of our ImageNet model in obtain-
filters in layer 1 and stride 2 convolutions in layers 1 ing state-of-the-art performance. This is supported by
& 2). This model, shown in Fig. 3, significantly out- the visualizations of Fig. 2 which show the complex in-
performs the architecture of (Krizhevsky et al., 2012), variances learned in the convolutional layers. We now
beating their single model result by 1.7% (test top-5). explore the ability of these feature extraction layers to
When we combine multiple models, we obtain a test generalize to other datasets, namely Caltech-101 (Fei-
error of 14.8%, the best published performance fei et al., 2006), Caltech-256 (Griffin et al., 2006) and
on this dataset1 (despite only using the 2012 train- PASCAL VOC 2012. To do this, we keep layers 1-7
1
of our ImageNet-trained model fixed and train a new
This performance has been surpassed in the recent
Imagenet 2013 competition (http://www.image-net.org/ challenges/LSVRC/2013/results.php).
Train Val Val 75

Error % Top-1 Top-1 Top-5
70
Our replication of
(Krizhevsky et al., 2012), 1 convnet 35.1 40.5 18.1 65
Removed layers 3,4 41.8 45.4 22.1 60
Removed layer 7 27.4 40.0 18.4
Accuracy %
Removed layers 6,7 27.4 44.8 22.4 55
Removed layer 3,4,6,7 71.1 71.3 50.1
50
Adjust layers 6,7: 2048 units 40.3 41.7 18.8
Adjust layers 6,7: 8192 units 26.8 40.0 18.1 45
Our Model (as per Fig. 3) 33.1 38.4 16.5 40

35 Our Model
Bo etal
Adjust layers 3,4,5: 512,1024,512 maps 18.8 37.5 16.0 30
Sohn etal
Adjust layers 6,7: 8192 units and
25
Layers 3,4,5: 512,1024,512 maps 10.0 38.3 16.9 0 10 20 30 40 50 60
Training Images per−class
Table 3. ImageNet 2012 classification error rates with var-

ious architectural changes to the model of (Krizhevsky Figure 9. Caltech-256 classification performance as the
et al., 2012) and our model (see Fig. 3). number of training images per class is varied. Using only
6 training examples per class with our pre-trained feature
extractor, we surpass best reported result by (Bo et al.,
softmax classifier on top (for the appropriate number 2013).
of classes) using the training images of the new dataset.
Since the softmax contains relatively few parameters, ble 4, using 5 train/test folds. Training took 17 min-
it can be trained quickly from a relatively small num- utes for 30 images/class. The pre-trained model beats
ber of examples, as is the case for certain datasets. the best reported result for 30 images/class from (Bo
et al., 2013) by 2.2%. The convnet model trained from
The classifiers used by our model (a softmax) and
scratch however does terribly, only achieving 46.5%.
other approaches (typically a linear SVM) are of simi-
lar complexity, thus the experiments compare our fea- Acc % Acc %
# Train 15/class 30/class
ture representation, learned from ImageNet, with the
(Bo et al., 2013) − 81.4 ± 0.33
hand-crafted features used by other methods. It is im- (Jianchao et al., 2009) 73.2 84.3
portant to note that both our feature representation
Non-pretrained convnet 22.8 ± 1.5 46.5 ± 1.7
and the hand-crafted features are designed using im- ImageNet-pretrained convnet 83.8 ± 0.5 86.5 ± 0.5
ages beyond the Caltech and PASCAL training sets.
For example, the hyper-parameters in HOG descrip- Table 4. Caltech-101 classification accuracy for our con-
tors were determined through systematic experiments vnet models, against two leading alternate approaches.
on a pedestrian dataset (Dalal & Triggs, 2005). We
Caltech-256: We follow the procedure of (Griffin
also try a second strategy of training a model from
et al., 2006), selecting 15, 30, 45, or 60 training im-
scratch, i.e. resetting layers 1-7 to random values and
ages per class, reporting the average of the per-class
train them, as well as the softmax, on the training
accuracies in Table 5. Our ImageNet-pretrained model
images of the dataset.
beats the current state-of-the-art results obtained by
One complication is that some of the Caltech datasets Bo et al. (Bo et al., 2013) by a significant margin:
have some images that are also in the ImageNet train- 74.2% vs 55.2% for 60 training images/class. However,
ing data. Using normalized correlation, we identified as with Caltech-101, the model trained from scratch
these few “overlap” images2 and removed them from does poorly. In Fig. 9, we explore the “one-shot learn-
our Imagenet training set and then retrained our Ima- ing” (Fei-fei et al., 2006) regime. With our pre-trained
genet models, so avoiding the possibility of train/test model, just 6 Caltech-256 training images are needed
contamination. to beat the leading method using 10 times as many im-
ages. This shows the power of the ImageNet feature
Caltech-101: We follow the procedure of (Fei-fei extractor.
et al., 2006) and randomly select 15 or 30 images per
Acc % Acc % Acc % Acc %
class for training and test on up to 50 images per class # Train 15/class 30/class 45/class 60/class
reporting the average of the per-class accuracies in Ta- (Sohn et al., 2011) 35.1 42.1 45.7 47.9
2 (Bo et al., 2013) 40.5 ± 0.4 48.0 ± 0.2 51.9 ± 0.2 55.2 ± 0.3
For Caltech-101, we found 44 images in common (out
of 9,144 total images), with a maximum overlap of 10 for Non-pretr. 9.0 ± 1.4 22.5 ± 0.7 31.2 ± 0.5 38.8 ± 1.4
any given class. For Caltech-256, we found 243 images in ImageNet-pretr. 65.7 ± 0.2 70.6 ± 0.2 72.7 ± 0.4 74.2 ± 0.3
common (out of 30,607 total images), with a maximum Table 5. Caltech 256 classification accuracies.
overlap of 18 for any given class.
PASCAL 2012: We used the standard training and 6. Discussion

validation images to train a 20-way softmax on top of
the ImageNet-pretrained convnet. This is not ideal, as We explored large convolutional neural network mod-
PASCAL images can contain multiple objects and our els, trained for image classification, in a number ways.
model just provides a single exclusive prediction for First, we presented a novel way to visualize the ac-
each image. Table 6 shows the results on the test set. tivity within the model. This reveals the features to
The PASCAL and ImageNet images are quite differ- be far from random, uninterpretable patterns. Rather,
ent in nature, the former being full scenes unlike the they show many intuitively desirable properties such as
latter. This may explain our mean performance being compositionality, increasing invariance and class dis-
3.2% lower than the leading (Yan et al., 2012) result, crimination as we ascend the layers. We also showed
however we do beat them on 5 classes, sometimes by how these visualization can be used to debug prob-
large margins. lems with the model to obtain better results, for ex-
ample improving on Krizhevsky et al. ’s (Krizhevsky
et al., 2012) impressive ImageNet 2012 result. We
Acc % [A] [B] Ours Acc % [A] [B] Ours
Airplane 92.0 97.3 96.0 Dining tab 63.2 77.8 67.7 then demonstrated through a series of occlusion exper-
Bicycle 74.2 84.2 77.1 Dog 68.9 83.0 87.8 iments that the model, while trained for classification,
Bird 73.0 80.8 88.4 Horse 78.2 87.5 86.0
Boat 77.5 85.3 85.5 Motorbike 81.0 90.1 85.1 is highly sensitive to local structure in the image and is
Bottle 54.3 60.8 55.8 Person 91.6 95.0 90.9
Bus 85.2 89.9 85.8 Potted pl 55.9 57.8 52.2
not just using broad scene context. An ablation study
Car 81.9 86.8 78.6 Sheep 69.4 79.2 83.6 on the model revealed that having a minimum depth
Cat 76.4 89.3 91.2 Sofa 65.4 73.4 61.1
Chair 65.2 75.4 65.0 Train 86.7 94.5 91.8 to the network, rather than any individual section, is
Cow 63.2 77.8 74.4 Tv 77.4 80.7 76.1 vital to the model’s performance.
Mean 74.3 82.2 79.0 # won 0 15 5
Table 6. PASCAL 2012 classification results, comparing

Finally, we showed how the ImageNet trained model
our Imagenet-pretrained convnet against the leading two can generalize well to other datasets. For Caltech-101
methods ([A]= (Sande et al., 2012) and [B] = (Yan et al., and Caltech-256, the datasets are similar enough that
2012)). we can beat the best reported results, in the latter case
by a significant margin. This result brings into ques-
tion to utility of benchmarks with small (i.e. < 104 )
5.3. Feature Analysis training sets. Our convnet model generalized less well
We explore how discriminative the features in each to the PASCAL data, perhaps suffering from dataset
layer of our Imagenet-pretrained model are. We do this bias (Torralba & Efros, 2011), although it was still
by varying the number of layers retained from the Ima- within 3.2% of the best reported result, despite no tun-
geNet model and place either a linear SVM or softmax ing for the task. For example, our performance might
classifier on top. Table 7 shows results on Caltech- improve if a different loss function was used that per-
101 and Caltech-256. For both datasets, a steady im- mitted multiple objects per image. This would natu-
provement can be seen as we ascend the model, with rally enable the networks to tackle the object detection
best results being obtained by using all layers. This as well.
supports the premise that as the feature hierarchies
become deeper, they learn increasingly powerful fea- Acknowledgments
tures.
The authors are very grateful for support by NSF grant
Cal-101 Cal-256 IIS-1116923, Microsoft Research and a Sloan Fellow-
(30/class) (60/class)
ship.
SVM (1) 44.8 ± 0.7 24.6 ± 0.4
SVM (2) 66.2 ± 0.5 39.6 ± 0.3
SVM (3) 72.3 ± 0.4 46.0 ± 0.3 References
SVM (4) 76.6 ± 0.4 51.3 ± 0.1
SVM (5) 86.2 ± 0.8 65.6 ± 0.3 Bengio, Y., Lamblin, P., Popovici, D., and Larochelle,
SVM (7) 85.5 ± 0.4 71.7 ± 0.2 H. Greedy layer-wise training of deep networks. In
Softmax (5) 82.9 ± 0.4 65.7 ± 0.5 NIPS, pp. 153–160, 2007.
Softmax (7) 85.4 ± 0.4 72.6 ± 0.1
Berkes, P. and Wiskott, L. On the analysis and in-
Table 7. Analysis of the discriminative information con- terpretation of inhomogeneous quadratic forms as
tained in each layer of feature maps within our ImageNet-
receptive fields. Neural Computation, 2006.
pretrained convnet. We train either a linear SVM or soft-
max on features from different layers (as indicated in brack- Bo, L., Ren, X., and Fox, D. Multipath sparse coding
ets) from the convnet. Higher layers generally produce using hierarchical matching pursuit. In CVPR, 2013.
more discriminative features.
Ciresan, D. C., Meier, J., and Schmidhuber, J. Multi- Sohn, K., Jung, D., Lee, H., and Hero III, A. Effi-
column deep neural networks for image classifica- cient learning of sparse, distributed, convolutional
tion. In CVPR, 2012. feature representations for object recognition. In
ICCV, 2011.
Dalal, N. and Triggs, B. Histograms of oriented gra-
dients for pedestrian detection. In CVPR, 2005. Torralba, A. and Efros, A. A. Unbiased look at dataset
bias. In CVPR, 2011.
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang,
N., Tzeng, E., and Darrell, T. DeCAF: A deep con- Vincent, P., Larochelle, H., Bengio, Y., and Manzagol,
volutional activation feature for generic visual recog- P. A. Extracting and composing robust features
nition. In arXiv:1310.1531, 2013. with denoising autoencoders. In ICML, pp. 1096–
1103, 2008.
Erhan, D., Bengio, Y., Courville, A., and Vincent, P.
Visualizing higher-layer features of a deep network. Yan, S., Dong, J., Chen, Q., Song, Z., Pan, Y., Xia,
In Technical report, University of Montreal, 2009. W., Huang, Z., Hua, Y., and Shen, S. Generalized
hierarchical matching for sub-category aware object
Fei-fei, L., Fergus, R., and Perona, P. One-shot learn- classification. In PASCAL VOC Classification Chal-
ing of object categories. IEEE Trans. PAMI, 2006. lenge 2012, 2012.
Griffin, G., Holub, A., and Perona, P. The caltech 256. Zeiler, M., Taylor, G., and Fergus, R. Adaptive decon-
In Caltech Technical Report, 2006. volutional networks for mid and high level feature
Gunji, N., Higuchi, T., Yasumoto, K., Muraoka, H., learning. In ICCV, 2011.
Ushiku, Y., Harada, T., and Kuniyoshi, Y. Classifi-
cation entry. In Imagenet Competition, 2012.
Hinton, G. E., Osindero, S., and The, Y. A fast learn-

ing algorithm for deep belief nets. Neural Computa-
tion, 18:1527–1554, 2006.
Hinton, G.E., Srivastave, N., Krizhevsky, A.,

Sutskever, I., and Salakhutdinov, R. R. Improv-
ing neural networks by preventing co-adaptation of
feature detectors. arXiv:1207.0580, 2012.
Jarrett, K., Kavukcuoglu, K., Ranzato, M., and Le-

Cun, Y. What is the best multi-stage architecture
for object recognition? In ICCV, 2009.
Jianchao, Y., Kai, Y., Yihong, G., and Thomas, H.

Linear spatial pyramid matching using sparse cod-
ing for image classification. In CVPR, 2009.
Krizhevsky, A., Sutskever, I., and Hinton, G.E. Im-

agenet classification with deep convolutional neural
networks. In NIPS, 2012.
Le, Q. V., Ngiam, J., Chen, Z., Chia, D., Koh, P., and
Ng, A. Y. Tiled convolutional neural networks. In
NIPS, 2010.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D.,

Howard, R. E., Hubbard, W., and Jackel, L. D.
Backpropagation applied to handwritten zip code
recognition. Neural Comput., 1(4):541–551, 1989.
Sande, K., Uijlings, J., Snoek, C., and Smeulders, A.

Hybrid coding for selective search. In PASCAL VOC
Classification Challenge 2012, 2012.

22 Selected Top Papers On Deep Learning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

22 Selected Top Papers On Deep Learning

Uploaded by

Copyright:

Available Formats

22 Selected Top Papers On Deep Learning

interesting from a computer science perspective.

 Deep Learning in Neural Networks: An Overview

Istituto Dalle Molle di Studi sull’Intelligenza Artificiale

LATEX source: http://www.idsia.ch/˜juergen/DeepLearning8Oct2014.tex

2 Event-Oriented Notation for Activation Spreading in NNs 5

3 Depth of Credit Assignment Paths (CAPs) and of Problems 6

4 Recurring Themes of Deep Learning 7

5 Supervised NNs, Some Helped by Unsupervised NNs 8

7 Conclusion and Outlook 34

Abbreviations in Alphabetical Order

2 Event-Oriented Notation for Activation Spreading in NNs

3 Depth of Credit Assignment Paths (CAPs) and of Problems

not make a difference, but in some it would, e.g., Sec. 6.1.

4 Recurring Themes of Deep Learning

4.2 Unsupervised Learning (UL) Facilitating SL and RL

4.4 Occam’s Razor: Compression and Minimum Description Length (MDL)

4.5 Fast Graphics Processing Units (GPUs) for DL in NNs

5 Supervised NNs, Some Helped by Unsupervised NNs

5.1 Early NNs Since the 1940s (and the 1800s)

5.4 1979: Convolution + Weight Replication + Subsampling (Neocognitron)

5.6 Late 1980s-2000 and Beyond: Numerous Improvements of NNs

5.6.2 Better BP Through Advanced Gradient Descent (Compare Sec. 5.24)

5.6.4 Potential Benefits of UL for SL (Compare Sec. 5.7, 5.10, 5.15)

5.7 1987: UL Through Autoencoder (AE) Hierarchies (Compare Sec. 5.15)

5.8 1989: BP for Convolutional NNs (CNNs, Sec. 5.4)

5.9 1991: Fundamental Deep Learning Problem of Gradient Descent

5.10 1991: UL-Based History Compression Through a Deep Stack of RNNs

5.12 1994: Early Contest-Winning NNs

5.15 2006/7: UL For Deep Belief Networks / AE Stacks Fine-Tuned by BP

5.16 2006/7: Improved CNNs / GPU-CNNs / BP for MPCNNs / LSTM Stacks

5.18 2010: Plain Backprop (+ Distortions) on GPU Breaks MNIST Record

5.19 2011: MPCNNs on GPU Achieve Superhuman Vision Performance

5.20 2011: Hessian-Free Optimization for RNNs

5.21 2012: First Contests Won on ImageNet, Object Detection, Segmentation

5.22 2013-: More Contests and Benchmark Records

5.23 Currently Successful Techniques: LSTM RNNs and GPU-MPCNNs

5.25 Consequences for Neuroscience

5.26 DL with Spiking Neurons?

6.1 RL Through NN World Models Yields RNNs With Deep CAPs

6.4 RL Facilitated by Deep UL in FNNs and RNNs

6.6 Deep RL by Direct NN Search / Policy Gradients / Evolution

6.7 Deep RL by Indirect Policy Search / Compressed NN Search

7 Conclusion and Outlook

Allender, A. (1992). Application of time-bounded Kolmogorov complexity in complexity theory. In

Bell, A. J. and Sejnowski, T. J. (1995). An information-maximization approach to blind separation

Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-dynamic Programming. Athena Scientific, Bel-

Blondel, V. D. and Tsitsiklis, J. N. (2000). A survey of computational complexity results in systems

Boltzmann, L. (1909). In Hasenöhrl, F., editor, Wissenschaftliche Abhandlungen (collection of Boltz-

Bourlard, H. and Morgan, N. (1994). Connnectionist Speech Recognition: A Hybrid Approach.

Fukushima, K. (2013b). Training multi-layered neural network Neocognitron. Neural Networks,

Griewank, A. (2012). Documenta Mathematica - Extra Volume ISMP, pages 389–400.

Hebb, D. O. (1949). The Organization of Behavior. Wiley, New York.

Kolmogorov, A. N. (1965b). Three approaches to the quantitative definition of information. Problems

Markram, H. (2012). The human brain project. Scientific American, 306(6):50–55.

Mitchell, T. (1997). Machine Learning. McGraw Hill.

Molgedey, L. and Schuster, H. G. (1994). Separation of independent signals using time-delayed

Moriarty, D. E. and Miikkulainen, R. (1996). Efficient reinforcement learning through symbiotic

Schaback, R. and Werner, H. (1992). Numerische Mathematik, volume 4. Springer.

Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5:197–227.

Schmidhuber, J. (1992c). Learning factorial codes by predictability minimization. Neural Computa-

Schmidhuber, J. (1993b). Netzwerkarchitekturen, Zielfunktionen und Kettenregel. (Network archi-

Schmidhuber, J. (2007). Prototype resilient, self-modeling robots. Science, 316(5825):688.

Schraudolph, N. N. and Sejnowski, T. J. (1996). Tempering backpropagation networks: Not all