Professional Documents
Culture Documents
22 Selected Top Papers On Deep Learning
22 Selected Top Papers On Deep Learning
These papers provide a breadth of information about Deep Learning (a class of machine learning algorithms that
uses multiple layers to progressively extract higher level features from the raw input) that is generally useful and
Jürgen Schmidhuber
The Swiss AI Lab IDSIA
arXiv:1404.7828v4 [cs.NE] 8 Oct 2014
Abstract
In recent years, deep artificial neural networks (including recurrent ones) have won numerous
contests in pattern recognition and machine learning. This historical survey compactly summarises
relevant work, much of it from the previous millennium. Shallow and deep learners are distin-
guished by the depth of their credit assignment paths, which are chains of possibly learnable, causal
links between actions and effects. I review deep supervised learning (also recapitulating the history
of backpropagation), unsupervised learning, reinforcement learning & evolutionary computation,
and indirect search for short programs encoding deep and large networks.
Preface
This is the preprint of an invited Deep Learning (DL) overview. One of its goals is to assign credit
to those who contributed to the present state of the art. I acknowledge the limitations of attempting
to achieve this goal. The DL research community itself may be viewed as a continually evolving,
deep network of scientists who have influenced each other in complex ways. Starting from recent DL
results, I tried to trace back the origins of relevant ideas through the past half century and beyond,
sometimes using “local search” to follow citations of citations backwards in time. Since not all DL
publications properly acknowledge earlier relevant work, additional global search strategies were em-
ployed, aided by consulting numerous neural network experts. As a result, the present preprint mostly
consists of references. Nevertheless, through an expert selection bias I may have missed important
work. A related bias was surely introduced by my special familiarity with the work of my own DL
research group in the past quarter-century. For these reasons, this work should be viewed as merely a
snapshot of an ongoing credit assignment process. To help improve it, please do not hesitate to send
corrections and suggestions to juergen@idsia.ch.
1
Contents
1 Introduction to Deep Learning (DL) in Neural Networks (NNs) 4
2
6 DL in FNNs and RNNs for Reinforcement Learning (RL) 29
6.1 RL Through NN World Models Yields RNNs With Deep CAPs . . . . . . . . . . . . 29
6.2 Deep FNNs for Traditional RL and Markov Decision Processes (MDPs) . . . . . . . 30
6.3 Deep RL RNNs for Partially Observable MDPs (POMDPs) . . . . . . . . . . . . . . 31
6.4 RL Facilitated by Deep UL in FNNs and RNNs . . . . . . . . . . . . . . . . . . . . 31
6.5 Deep Hierarchical RL (HRL) and Subgoal Learning with FNNs and RNNs . . . . . . 31
6.6 Deep RL by Direct NN Search / Policy Gradients / Evolution . . . . . . . . . . . . . 32
6.7 Deep RL by Indirect Policy Search / Compressed NN Search . . . . . . . . . . . . . 33
6.8 Universal RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8 Acknowledgments 35
3
1 Introduction to Deep Learning (DL) in Neural Networks (NNs)
Which modifiable components of a learning system are responsible for its success or failure? What
changes to them improve performance? This has been called the fundamental credit assignment prob-
lem (Minsky, 1963). There are general credit assignment methods for universal problem solvers that
are time-optimal in various theoretical senses (Sec. 6.8). The present survey, however, will focus on
the narrower, but now commercially important, subfield of Deep Learning (DL) in Artificial Neural
Networks (NNs).
A standard neural network (NN) consists of many simple, connected processors called neurons,
each producing a sequence of real-valued activations. Input neurons get activated through sensors per-
ceiving the environment, other neurons get activated through weighted connections from previously
active neurons (details in Sec. 2). Some neurons may influence the environment by triggering actions.
Learning or credit assignment is about finding weights that make the NN exhibit desired behavior,
such as driving a car. Depending on the problem and how the neurons are connected, such behavior
may require long causal chains of computational stages (Sec. 3), where each stage transforms (of-
ten in a non-linear way) the aggregate activation of the network. Deep Learning is about accurately
assigning credit across many such stages.
Shallow NN-like models with few such stages have been around for many decades if not centuries
(Sec. 5.1). Models with several successive nonlinear layers of neurons date back at least to the 1960s
(Sec. 5.3) and 1970s (Sec. 5.5). An efficient gradient descent method for teacher-based Supervised
Learning (SL) in discrete, differentiable networks of arbitrary depth called backpropagation (BP) was
developed in the 1960s and 1970s, and applied to NNs in 1981 (Sec. 5.5). BP-based training of deep
NNs with many layers, however, had been found to be difficult in practice by the late 1980s (Sec. 5.6),
and had become an explicit research subject by the early 1990s (Sec. 5.9). DL became practically fea-
sible to some extent through the help of Unsupervised Learning (UL), e.g., Sec. 5.10 (1991), Sec. 5.15
(2006). The 1990s and 2000s also saw many improvements of purely supervised DL (Sec. 5). In the
new millennium, deep NNs have finally attracted wide-spread attention, mainly by outperforming al-
ternative machine learning methods such as kernel machines (Vapnik, 1995; Schölkopf et al., 1998)
in numerous important applications. In fact, since 2009, supervised deep NNs have won many official
international pattern recognition competitions (e.g., Sec. 5.17, 5.19, 5.21, 5.22), achieving the first
superhuman visual pattern recognition results in limited domains (Sec. 5.19, 2011). Deep NNs also
have become relevant for the more general field of Reinforcement Learning (RL) where there is no
supervising teacher (Sec. 6).
Both feedforward (acyclic) NNs (FNNs) and recurrent (cyclic) NNs (RNNs) have won contests
(Sec. 5.12, 5.14, 5.17, 5.19, 5.21, 5.22). In a sense, RNNs are the deepest of all NNs (Sec. 3)—they
are general computers more powerful than FNNs, and can in principle create and process memories
of arbitrary sequences of input patterns (e.g., Siegelmann and Sontag, 1991; Schmidhuber, 1990a).
Unlike traditional methods for automatic sequential program synthesis (e.g., Waldinger and Lee, 1969;
Balzer, 1985; Soloway, 1986; Deville and Lau, 1994), RNNs can learn programs that mix sequential
and parallel information processing in a natural and efficient way, exploiting the massive parallelism
viewed as crucial for sustaining the rapid decline of computation cost observed over the past 75 years.
The rest of this paper is structured as follows. Sec. 2 introduces a compact, event-oriented notation
that is simple yet general enough to accommodate both FNNs and RNNs. Sec. 3 introduces the
concept of Credit Assignment Paths (CAPs) to measure whether learning in a given NN application is
of the deep or shallow type. Sec. 4 lists recurring themes of DL in SL, UL, and RL. Sec. 5 focuses
on SL and UL, and on how UL can facilitate SL, although pure SL has become dominant in recent
competitions (Sec. 5.17–5.23). Sec. 5 is arranged in a historical timeline format with subsections on
important inspirations and technical contributions. Sec. 6 on deep RL discusses traditional Dynamic
Programming (DP)-based RL combined with gradient-based search techniques for SL or UL in deep
4
NNs, as well as general methods for direct and indirect search in the weight space of deep FNNs and
RNNs, including successful policy gradient and evolutionary methods.
5
goal is to find weights that yield episodes with a high sum of reward signals, through sequences of
appropriate output actions.
Sec. 5.5 will use the notation above to compactly describe a central algorithm of DL, namely,
backpropagation (BP) for supervised weight-sharing FNNs and RNNs. (FNNs may be viewed as
RNNs with certain fixed zero weights.) Sec. 6 will address the more general RL case.
6
to a precise answer, let me just define for the purposes of this overview: problems of depth > 10
require Very Deep Learning.
The difficulty of a problem may have little to do with its depth. Some NNs can quickly learn
to solve certain deep problems, e.g., through random weight guessing (Sec. 5.9) or other types of
direct search (Sec. 6.6) or indirect search (Sec. 6.7) in weight space, or through training an NN first
on shallow problems whose solutions may then generalize to deep problems, or through collapsing
sequences of (non)linear operations into a single (non)linear operation (but see an analysis of non-
trivial aspects of deep linear networks, Baldi and Hornik, 1994, Section B). In general, however,
finding an NN that precisely models a given training set is an NP-complete problem (Judd, 1990;
Blum and Rivest, 1992), also in the case of deep NNs (Sı́ma, 1994; de Souto et al., 1999; Windisch,
2005); compare a survey of negative results (Sı́ma, 2002, Section 1).
Above we have focused on SL. In the more general case of RL in unknown environments, pcc(p, q)
is also true if xp is an output event and xq any later input event—any action may affect the environment
and thus any later perception. (In the real world, the environment may even influence non-input events
computed on a physical hardware entangled with the entire universe, but this is ignored here.) It is
possible to model and replace such unmodifiable environmental PCCs through a part of the NN that
has already learned to predict (through some of its units) input events (including reward signals) from
former input events and actions (Sec. 6.1). Its weights are frozen, but can help to assign credit to
other, still modifiable weights used to compute actions (Sec. 6.1). This approach may lead to very
deep CAPs though.
Some DL research is about automatically rephrasing problems such that their depth is reduced
(Sec. 4). In particular, sometimes UL is used to make SL problems less deep, e.g., Sec. 5.10. Often
Dynamic Programming (Sec. 4.1) is used to facilitate certain traditional RL problems, e.g., Sec. 6.2.
Sec. 5 focuses on CAPs for SL, Sec. 6 on the more complex case of RL.
7
4.3 Learning Hierarchical Representations Through Deep SL, UL, RL
Many methods of Good Old-Fashioned Artificial Intelligence (GOFAI) (Nilsson, 1980) as well as
more recent approaches to AI (Russell et al., 1995) and Machine Learning (Mitchell, 1997) learn
hierarchies of more and more abstract data representations. For example, certain methods of syn-
tactic pattern recognition (Fu, 1977) such as grammar induction discover hierarchies of formal rules
to model observations. The partially (un)supervised Automated Mathematician / EURISKO (Lenat,
1983; Lenat and Brown, 1984) continually learns concepts by combining previously learnt concepts.
Such hierarchical representation learning (Ring, 1994; Bengio et al., 2013; Deng and Yu, 2014) is also
a recurring theme of DL NNs for SL (Sec. 5), UL-aided SL (Sec. 5.7, 5.10, 5.15), and hierarchical RL
(Sec. 6.5). Often, abstract hierarchical representations are natural by-products of data compression
(Sec. 4.4), e.g., Sec. 5.10.
8
UL in the same section: often gradient-based methods, such as BP (Sec. 5.5.1), are used to optimize
objective functions of both UL and SL, and the boundary between SL and UL may blur, for example,
when it comes to time series prediction and sequence classification, e.g., Sec. 5.10, 5.12.
A historical timeline format will help to arrange subsections on important inspirations and techni-
cal contributions (although such a subsection may span a time interval of many years). Sec. 5.1 briefly
mentions early, shallow NN models since the 1940s (and 1800s), Sec. 5.2 additional early neurobio-
logical inspiration relevant for modern Deep Learning (DL). Sec. 5.3 is about GMDH networks (since
1965), to my knowledge the first (feedforward) DL systems. Sec. 5.4 is about the relatively deep
Neocognitron NN (1979) which is very similar to certain modern deep FNN architectures, as it com-
bines convolutional NNs (CNNs), weight pattern replication, and subsampling mechanisms. Sec. 5.5
uses the notation of Sec. 2 to compactly describe a central algorithm of DL, namely, backpropagation
(BP) for supervised weight-sharing FNNs and RNNs. It also summarizes the history of BP 1960-1981
and beyond. Sec. 5.6 describes problems encountered in the late 1980s with BP for deep NNs, and
mentions several ideas from the previous millennium to overcome them. Sec. 5.7 discusses a first hier-
archical stack (1987) of coupled UL-based Autoencoders (AEs)—this concept resurfaced in the new
millennium (Sec. 5.15). Sec. 5.8 is about applying BP to CNNs (1989), which is important for today’s
DL applications. Sec. 5.9 explains BP’s Fundamental DL Problem (of vanishing/exploding gradients)
discovered in 1991. Sec. 5.10 explains how a deep RNN stack of 1991 (the History Compressor) pre-
trained by UL helped to solve previously unlearnable DL benchmarks requiring Credit Assignment
Paths (CAPs, Sec. 3) of depth 1000 and more. Sec. 5.11 discusses a particular winner-take-all (WTA)
method called Max-Pooling (MP, 1992) widely used in today’s deep FNNs. Sec. 5.12 mentions a
first important contest won by SL NNs in 1994. Sec. 5.13 describes a purely supervised DL RNN
(Long Short-Term Memory, LSTM, 1995) for problems of depth 1000 and more. Sec. 5.14 mentions
an early contest of 2003 won by an ensemble of shallow FNNs, as well as good pattern recognition
results with CNNs and deep FNNs and LSTM RNNs (2003). Sec. 5.15 is mostly about Deep Belief
Networks (DBNs, 2006) and related stacks of Autoencoders (AEs, Sec. 5.7), both pre-trained by UL to
facilitate subsequent BP-based SL (compare Sec. 5.6.1, 5.10). Sec. 5.16 mentions the first SL-based
GPU-CNNs (2006), BP-trained MPCNNs (2007), and LSTM stacks (2007). Sec. 5.17–5.22 focus on
official competitions with secret test sets won by (mostly purely supervised) deep NNs since 2009,
in sequence recognition, image classification, image segmentation, and object detection. Many RNN
results depended on LSTM (Sec. 5.13); many FNN results depended on GPU-based FNN code de-
veloped since 2004 (Sec. 5.16, 5.17, 5.18, 5.19), in particular, GPU-MPCNNs (Sec. 5.19). Sec. 5.24
mentions recent tricks for improving DL in NNs, many of them closely related to earlier tricks from
the previous millennium (e.g., Sec. 5.6.2, 5.6.3). Sec. 5.25 discusses how artificial NNs can help to
understand biological NNs; Sec. 5.26 addresses the possibility of DL in NNs with spiking neurons.
9
5.2 Around 1960: Visual Cortex Provides Inspiration for DL (Sec. 5.4, 5.11)
Simple cells and complex cells were found in the cat’s visual cortex (e.g., Hubel and Wiesel, 1962;
Wiesel and Hubel, 1959). These cells fire in response to certain properties of visual sensory inputs,
such as the orientation of edges. Complex cells exhibit more spatial invariance than simple cells. This
inspired later deep NN architectures (Sec. 5.4, 5.11) used in certain modern award-winning Deep
Learners (Sec. 5.19–5.22).
5.3 1965: Deep Networks Based on the Group Method of Data Handling
Networks trained by the Group Method of Data Handling (GMDH) (Ivakhnenko and Lapa, 1965;
Ivakhnenko et al., 1967; Ivakhnenko, 1968, 1971) were perhaps the first DL systems of the Feed-
forward Multilayer Perceptron type, although there was earlier work on NNs with a single hidden
layer (e.g., Joseph, 1961; Viglione, 1970). The units of GMDH nets may have polynomial activation
functions implementing Kolmogorov-Gabor polynomials (more general than other widely used NN
activation functions, Sec. 2). Given a training set, layers are incrementally grown and trained by re-
gression analysis (e.g., Legendre, 1805; Gauss, 1809, 1821) (Sec. 5.1), then pruned with the help of
a separate validation set (using today’s terminology), where Decision Regularisation is used to weed
out superfluous units (compare Sec. 5.6.3). The numbers of layers and units per layer can be learned
in problem-dependent fashion. To my knowledge, this was the first example of open-ended, hierar-
chical representation learning in NNs (Sec. 4.3). A paper of 1971 already described a deep GMDH
network with 8 layers (Ivakhnenko, 1971). There have been numerous applications of GMDH-style
nets, e.g. (Ikeda et al., 1976; Farlow, 1984; Madala and Ivakhnenko, 1994; Ivakhnenko, 1995; Kondo,
1998; Kordı́k et al., 2003; Witczak et al., 2006; Kondo and Ueno, 2008).
10
5.5 1960-1981 and Beyond: Development of Backpropagation (BP) for NNs
The minimisation of errors through gradient descent (Hadamard, 1908) in the parameter space of
complex, nonlinear, differentiable (Leibniz, 1684), multi-stage, NN-related systems has been dis-
cussed at least since the early 1960s (e.g., Kelley, 1960; Bryson, 1961; Bryson and Denham, 1961;
Pontryagin et al., 1961; Dreyfus, 1962; Wilkinson, 1965; Amari, 1967; Bryson and Ho, 1969; Direc-
tor and Rohrer, 1969), initially within the framework of Euler-LaGrange equations in the Calculus of
Variations (e.g., Euler, 1744).
Steepest descent in the weight space of such systems can be performed (Bryson, 1961; Kelley,
1960; Bryson and Ho, 1969) by iterating the chain rule (Leibniz, 1676; L’Hôpital, 1696) à la Dynamic
Programming (DP) (Bellman, 1957). A simplified derivation of this backpropagation method uses the
chain rule only (Dreyfus, 1962).
The systems of the 1960s were already efficient in the DP sense. However, they backpropagated
derivative information through standard Jacobian matrix calculations from one “layer” to the previous
one, without explicitly addressing either direct links across several layers or potential additional effi-
ciency gains due to network sparsity (but perhaps such enhancements seemed obvious to the authors).
Given all the prior work on learning in multilayer NN-like systems (see also Sec. 5.3 on deep non-
linear nets since 1965), it seems surprising in hindsight that a book (Minsky and Papert, 1969) on the
limitations of simple linear perceptrons with a single layer (Sec. 5.1) discouraged some researchers
from further studying NNs.
Explicit, efficient error backpropagation (BP) in arbitrary, discrete, possibly sparsely connected,
NN-like networks apparently was first described in a 1970 master’s thesis (Linnainmaa, 1970, 1976),
albeit without reference to NNs. BP is also known as the reverse mode of automatic differentia-
tion (Griewank, 2012), where the costs of forward activation spreading essentially equal the costs of
backward derivative calculation. See early FORTRAN code (Linnainmaa, 1970) and closely related
work (Ostrovskii et al., 1971).
Efficient BP was soon explicitly used to minimize cost functions by adapting control parameters
(weights) (Dreyfus, 1973). Compare some preliminary, NN-specific discussion (Werbos, 1974, sec-
tion 5.5.1), a method for multilayer threshold NNs (Bobrowski, 1978), and a computer program for
automatically deriving and implementing BP for given differentiable systems (Speelpenning, 1980).
To my knowledge, the first NN-specific application of efficient BP as above was described in
1981 (Werbos, 1981, 2006). Related work was published several years later (Parker, 1985; LeCun,
1985, 1988). A paper of 1986 significantly contributed to the popularisation of BP for NNs (Rumelhart
et al., 1986), experimentally demonstrating the emergence of useful internal representations in hidden
layers. See generalisations for sequence-processing recurrent NNs (e.g., Williams, 1989; Robinson
and Fallside, 1987; Werbos, 1988; Williams and Zipser, 1988, 1989b,a; Rohwer, 1989; Pearlmutter,
1989; Gherrity, 1989; Williams and Peng, 1990; Schmidhuber, 1992a; Pearlmutter, 1995; Baldi, 1995;
Kremer and Kolen, 2001; Atiya and Parlos, 2000), also for equilibrium RNNs (Almeida, 1987; Pineda,
1987) with stationary inputs.
5.5.1 BP for Weight-Sharing Feedforward NNs (FNNs) and Recurrent NNs (RNNs)
Using the notation of Sec. 2 for weight-sharing FNNs or RNNs, after an episode of activation spread-
ing through differentiable ft , P
a single iteration of gradient descent through BP computes changes of
∂E ∂E ∂nett
all wi in proportion to ∂w i
= t ∂nett ∂wi as in Algorithm 5.5.1 (for the additive case), where each
weight wi is associated with a real-valued variable 4i initialized by 0.
The computational costs of the backward (BP) pass are essentially those of the forward pass
(Sec. 2). Forward and backward passes are re-iterated until sufficient performance is reached.
11
Alg. 5.5.1: One iteration of BP for weight-sharing FNNs or RNNs
for t = T, . . . , 1 do
∂E
to compute ∂net t
, inititalize real-valued error signal variable δt by 0;
if xt is an input event then continue with next iteration;
if there is an error et P then δt := xt − dt ;
add to δt the value k∈outt wv(t,k) δk ; (this is the elegant and efficient recursive chain rule
application collecting impacts of nett on future events)
multiply δt by ft0 (nett );
for all k ∈ int add to 4wv(k,t) the value xk δt
end for
change each wi in proportion to 4i and a small real-valued learning rate
As of 2014, this simple BP method is still the central learning algorithm for FNNs and RNNs. No-
tably, most contest-winning NNs up to 2014 (Sec. 5.12, 5.14, 5.17, 5.19, 5.21, 5.22) did not augment
supervised BP by some sort of unsupervised learning as discussed in Sec. 5.7, 5.10, 5.15.
5.6.1 Ideas for Dealing with Long Time Lags and Deep CAPs
To deal with long time lags between relevant events, several sequence processing methods were pro-
posed, including Focused BP based on decay factors for activations of units in RNNs (Mozer, 1989,
1992), Time-Delay Neural Networks (TDNNs) (Lang et al., 1990) and their adaptive extension (Bo-
denhausen and Waibel, 1991), Nonlinear AutoRegressive with eXogenous inputs (NARX) RNNs (Lin
et al., 1996), certain hierarchical RNNs (Hihi and Bengio, 1996) (compare Sec. 5.10, 1991), RL
economies in RNNs with WTA units and local learning rules (Schmidhuber, 1989b), and other meth-
ods (e.g., Ring, 1993, 1994; Plate, 1993; de Vries and Principe, 1991; Sun et al., 1993a; Bengio
et al., 1994). However, these algorithms either worked for shallow CAPs only, could not generalize
to unseen CAP depths, had problems with greatly varying time lags between relevant events, needed
external fine tuning of delay constants, or suffered from other problems. In fact, it turned out that
certain simple but deep benchmark problems used to evaluate such methods are more quickly solved
by randomly guessing RNN weights until a solution is found (Hochreiter and Schmidhuber, 1996).
While the RNN methods above were designed for DL of temporal sequences, the Neural Heat
Exchanger (Schmidhuber, 1990c) consists of two parallel deep FNNs with opposite flow directions.
12
Input patterns enter the first FNN and are propagated “up”. Desired outputs (targets) enter the “oppo-
site” FNN and are propagated “down”. Using a local learning rule, each layer in each net tries to be
similar (in information content) to the preceding layer and to the adjacent layer of the other net. The
input entering the first net slowly “heats up” to become the target. The target entering the opposite net
slowly “cools down” to become the input. The Helmholtz Machine (Dayan et al., 1995; Dayan and
Hinton, 1996) may be viewed as an unsupervised (Sec. 5.6.4) variant thereof (Peter Dayan, personal
communication, 1994).
A hybrid approach (Shavlik and Towell, 1989; Towell and Shavlik, 1994) initializes a poten-
tially deep FNN through a domain theory in propositional logic, which may be acquired through
explanation-based learning (Mitchell et al., 1986; DeJong and Mooney, 1986; Minton et al., 1989).
The NN is then fine-tuned through BP (Sec. 5.5). The NN’s depth reflects the longest chain of
reasoning in the original set of logical rules. An extension of this approach (Maclin and Shavlik,
1993; Shavlik, 1994) initializes an RNN by domain knowledge expressed as a Finite State Automa-
ton (FSA). BP-based fine-tuning has become important for later DL systems pre-trained by UL, e.g.,
Sec. 5.10, 5.15.
13
5.6.3 Searching For Simple, Low-Complexity, Problem-Solving NNs (Sec. 5.24)
Many researchers used BP-like methods to search for “simple,” low-complexity NNs (Sec. 4.4)
with high generalization capability. Most approaches address the bias/variance dilemma (Geman
et al., 1992) through strong prior assumptions. For example, weight decay (Hanson and Pratt, 1989;
Weigend et al., 1991; Krogh and Hertz, 1992) encourages near-zero weights, by penalizing large
weights. In a Bayesian framework (Bayes, 1763), weight decay can be derived (Hinton and van
Camp, 1993) from Gaussian or Laplacian weight priors (Gauss, 1809; Laplace, 1774); see also (Mur-
ray and Edwards, 1993). An extension of this approach postulates that a distribution of networks with
many similar weights generated by Gaussian mixtures is “better” a priori (Nowlan and Hinton, 1992).
Often weight priors are implicit in additional penalty terms (MacKay, 1992) or in methods based
on validation sets (Mosteller and Tukey, 1968; Stone, 1974; Eubank, 1988; Hastie and Tibshirani,
1990; Craven and Wahba, 1979; Golub et al., 1979), Akaike’s information criterion and final pre-
diction error (Akaike, 1970, 1973, 1974), or generalized prediction error (Moody and Utans, 1994;
Moody, 1992). See also (Holden, 1994; Wang et al., 1994; Amari and Murata, 1993; Wang et al.,
1994; Guyon et al., 1992; Vapnik, 1992; Wolpert, 1994). Similar priors (or biases towards simplicity)
are implicit in constructive and pruning algorithms, e.g., layer-by-layer sequential network construc-
tion (e.g., Ivakhnenko, 1968, 1971; Ash, 1989; Moody, 1989; Gallant, 1988; Honavar and Uhr, 1988;
Ring, 1991; Fahlman, 1991; Weng et al., 1992; Honavar and Uhr, 1993; Burgess, 1994; Fritzke, 1994;
Parekh et al., 2000; Utgoff and Stracuzzi, 2002) (see also Sec. 5.3, 5.11), input pruning (Moody, 1992;
Refenes et al., 1994), unit pruning (e.g., Ivakhnenko, 1968, 1971; White, 1989; Mozer and Smolen-
sky, 1989; Levin et al., 1994), weight pruning, e.g., optimal brain damage (LeCun et al., 1990b), and
optimal brain surgeon (Hassibi and Stork, 1993).
A very general but not always practical approach for discovering low-complexity SL NNs or
RL NNs searches among weight matrix-computing programs written in a universal programming
language, with a bias towards fast and short programs (Schmidhuber, 1997) (Sec. 6.7).
Flat Minimum Search (FMS) (Hochreiter and Schmidhuber, 1997a, 1999) searches for a “flat”
minimum of the error function: a large connected region in weight space where error is low and re-
mains approximately constant, that is, few bits of information are required to describe low-precision
weights with high variance. Compare perturbation tolerance conditions (Minai and Williams, 1994;
Murray and Edwards, 1993; Hanson, 1990; Neti et al., 1992; Matsuoka, 1992; Bishop, 1993; Ker-
lirzin and Vallet, 1993; Carter et al., 1990). An MDL-based, Bayesian argument suggests that flat
minima correspond to “simple” NNs and low expected overfitting. Compare Sec. 5.6.4 and more
recent developments mentioned in Sec. 5.24.
14
1993; Redlich, 1993; Zemel, 1993; Zemel and Hinton, 1994; Field, 1994; Hinton et al., 1995; Dayan
and Zemel, 1995; Amari et al., 1996; Deco and Parra, 1997).
Many do this to uncover and disentangle hidden underlying sources of signals (e.g., Jutten and
Herault, 1991; Schuster, 1992; Andrade et al., 1993; Molgedey and Schuster, 1994; Comon, 1994;
Cardoso, 1994; Bell and Sejnowski, 1995; Karhunen and Joutsensalo, 1995; Belouchrani et al., 1997;
Hyvärinen et al., 2001; Szabó et al., 2006; Shan et al., 2007; Shan and Cottrell, 2014).
Many UL methods automatically and robustly generate distributed, sparse representations of in-
put patterns (Földiák, 1990; Hinton and Ghahramani, 1997; Lewicki and Olshausen, 1998; Hyvärinen
et al., 1999; Hochreiter and Schmidhuber, 1999; Falconbridge et al., 2006) through well-known fea-
ture detectors (e.g., Olshausen and Field, 1996; Schmidhuber et al., 1996), such as off-center-on-
surround-like structures, as well as orientation sensitive edge detectors and Gabor filters (Gabor,
1946). They extract simple features related to those observed in early visual pre-processing stages
of biological systems (e.g., De Valois et al., 1982; Jones and Palmer, 1987).
UL can also serve to extract invariant features from different data items (e.g., Becker, 1991)
through coupled NNs observing two different inputs (Schmidhuber and Prelinger, 1992), also called
Siamese NNs (e.g., Bromley et al., 1993; Hadsell et al., 2006; Taylor et al., 2011; Chen and Salman,
2011).
UL can help to encode input data in a form advantageous for further processing. In the context
of DL, one important goal of UL is redundancy reduction. Ideally, given an ensemble of input pat-
terns, redundancy reduction through a deep NN will create a factorial code (a code with statistically
independent components) of the ensemble (Barlow et al., 1989; Barlow, 1989), to disentangle the
unknown factors of variation (compare Bengio et al., 2013). Such codes may be sparse and can be
advantageous for (1) data compression, (2) speeding up subsequent BP (Becker, 1991), (3) trivialising
the task of subsequent naive yet optimal Bayes classifiers (Schmidhuber et al., 1996).
Most early UL FNNs had a single layer. Methods for deeper UL FNNs include hierarchical
(Sec. 4.3) self-organizing Kohonen maps (e.g., Koikkalainen and Oja, 1990; Lampinen and Oja, 1992;
Versino and Gambardella, 1996; Dittenbach et al., 2000; Rauber et al., 2002), hierarchical Gaussian
potential function networks (Lee and Kil, 1991), layer-wise UL of feature hierarchies fed into SL
classifiers (Behnke, 1999, 2003a), the Self-Organising Tree Algorithm (SOTA) (Herrero et al., 2001),
and nonlinear Autoencoders (AEs) with more than 3 (e.g., 5) layers (Kramer, 1991; Oja, 1991; DeMers
and Cottrell, 1993). Such AE NNs (Rumelhart et al., 1986) can be trained to map input patterns
to themselves, for example, by compactly encoding them through activations of units of a narrow
bottleneck hidden layer. Certain nonlinear AEs suffer from certain limitations (Baldi, 2012).
L OCOCODE (Hochreiter and Schmidhuber, 1999) uses FMS (Sec. 5.6.3) to find low-complexity
AEs with low-precision weights describable by few bits of information, often producing sparse or
factorial codes. Predictability Minimization (PM) (Schmidhuber, 1992c) searches for factorial codes
through nonlinear feature detectors that fight nonlinear predictors, trying to become both as infor-
mative and as unpredictable as possible. PM-based UL was applied not only to FNNs but also to
RNNs (e.g., Schmidhuber, 1993b; Lindstädt, 1993). Compare Sec. 5.10 on UL-based RNN stacks
(1991), as well as later UL RNNs (e.g., Klapper-Rybicka et al., 2001; Steil, 2007).
15
facilitate subsequent learning. In one experiment, a particular AE-specific learning algorithm (dif-
ferent from traditional BP of Sec. 5.5.1) was used to learn a mapping in an AE stack pre-trained by
this type of UL (Ballard, 1987). This was faster than learning an equivalent mapping by BP through
a single deeper AE without pre-training. On the other hand, the task did not really require a deep
AE, that is, the benefits of UL were not that obvious from this experiment. Compare an early sur-
vey (Hinton, 1989) and the somewhat related Recursive Auto-Associative Memory (RAAM) (Pollack,
1988, 1990; Melnik et al., 2000), originally used to encode sequential linguistic structures of arbitrary
size through a fixed number of hidden units. More recently, RAAMs were also used as unsupervised
pre-processors to facilitate deep credit assignment for RL (Gisslen et al., 2011) (Sec. 6.4).
In principle, many UL methods (Sec. 5.6.4) could be stacked like the AEs above, the history-
compressing RNNs of Sec. 5.10, the Restricted Boltzmann Machines (RBMs) of Sec. 5.15, or hi-
erarchical Kohonen nets (Sec. 5.6.4), to facilitate subsequent SL. Compare Stacked Generaliza-
tion (Wolpert, 1992; Ting and Witten, 1997), and FNNs that profit from pre-training by competitive
UL (e.g., Rumelhart and Zipser, 1986) prior to BP-based fine-tuning (Maclin and Shavlik, 1995). See
also more recent methods using UL to improve subsequent SL (e.g., Behnke, 1999, 2003a; Escalante-
B. and Wiskott, 2013).
I A Very Deep Learner of 1991 (the History Compressor, Sec. 5.10) alleviates the problem
through unsupervised pre-training for a hierarchy of RNNs. This greatly facilitates subsequent
supervised credit assignment through BP (Sec. 5.5). In the FNN case, similar effects can be
achieved through conceptually related AE stacks (Sec. 5.7, 5.15) and Deep Belief Networks
(DBNs, Sec. 5.15).
16
II LSTM-like networks (Sec. 5.13, 5.16, 5.17, 5.21–5.23) alleviate the problem through a special
architecture unaffected by it.
III Today’s GPU-based computers have a million times the computational power of desktop ma-
chines of the early 1990s. This allows for propagating errors a few layers further down within
reasonable time, even in traditional NNs (Sec. 5.18). That is basically what is winning many of
the image recognition competitions now (Sec. 5.19, 5.21, 5.22). (Although this does not really
overcome the problem in a fundamental way.)
IV Hessian-free optimization (Sec. 5.6.2) can alleviate the problem for FNNs (Møller, 1993;
Pearlmutter, 1994; Schraudolph, 2002; Martens, 2010) (Sec. 5.6.2) and RNNs (Martens and
Sutskever, 2011) (Sec. 5.20).
V The space of NN weight matrices can also be searched without relying on error gradients,
thus avoiding the Fundamental Deep Learning Problem altogether. Random weight guessing
sometimes works better than more sophisticated methods (Hochreiter and Schmidhuber, 1996).
Certain more complex problems are better solved by using Universal Search (Levin, 1973b)
for weight matrix-computing programs written in a universal programming language (Schmid-
huber, 1997). Some are better solved by using linear methods to obtain optimal weights for
connections to output events (Sec. 2), and evolving weights of connections to other events—
this is called Evolino (Schmidhuber et al., 2007). Compare also related RNNs pre-trained by
certain UL rules (Steil, 2007), also in the case of spiking neurons (Yin et al., 2012; Klampfl and
Maass, 2013) (Sec. 5.26). Direct search methods are relevant not only for SL but also for more
general RL, and are discussed in more detail in Sec. 6.6.
17
greatly reduce problem depth. Compare earlier BP-based fine-tuning of NNs initialized by rules of
propositional logic (Shavlik and Towell, 1989) (Sec. 5.6.1).
There is a way of compressing higher levels down into lower levels, thus fully or partially col-
lapsing the RNN stack. The trick is to retrain a lower-level RNN to continually imitate (predict) the
hidden units of an already trained, slower, higher-level RNN (the “conscious” chunker), through ad-
ditional predictive output neurons (Schmidhuber, 1992b). This helps the lower RNN (the automatizer)
to develop appropriate, rarely changing memories that may bridge very long time lags. Again, this
procedure can greatly reduce the required depth of the BP process.
The 1991 system was a working Deep Learner in the modern post-2000 sense, and also a first
Neural Hierarchical Temporal Memory (HTM). It is conceptually similar to earlier AE hierarchies
(1987, Sec. 5.7) and later Deep Belief Networks (2006, Sec. 5.15), but more general in the sense
that it uses sequence-processing RNNs instead of FNNs with unchanging inputs. More recently,
well-known entrepreneurs (Hawkins and George, 2006; Kurzweil, 2012) also got interested in HTMs;
compare also hierarchical HMMs (e.g., Fine et al., 1998), as well as later UL-based recurrent sys-
tems (Klapper-Rybicka et al., 2001; Steil, 2007; Klampfl and Maass, 2013; Young et al., 2014).
Clockwork RNNs (Koutnı́k et al., 2014) also consist of interacting RNN modules with different clock
rates, but do not use UL to set those rates. Stacks of RNNs were used in later work on SL with great
success, e.g., Sec. 5.13, 5.16, 5.17, 5.22.
5.11 1992: Max-Pooling (MP): Towards MPCNNs (Compare Sec. 5.16, 5.19)
The Neocognitron (Sec. 5.4) inspired the Cresceptron (Weng et al., 1992), which adapts its topol-
ogy during training (Sec. 5.6.3); compare the incrementally growing and shrinking GMDH networks
(1965, Sec. 5.3).
Instead of using alternative local subsampling or WTA methods (e.g., Fukushima, 1980; Schmid-
huber, 1989b; Maass, 2000; Fukushima, 2013a), the Cresceptron uses Max-Pooling (MP) layers. Here
a 2-dimensional layer or array of unit activations is partitioned into smaller rectangular arrays. Each
is replaced in a downsampling layer by the activation of its maximally active unit. A later, more com-
plex version of the Cresceptron (Weng et al., 1997) also included “blurring” layers to improve object
location tolerance.
The neurophysiologically plausible topology of the feedforward HMAX model (Riesenhuber and
Poggio, 1999) is very similar to the one of the 1992 Cresceptron (and thus to the 1979 Neocognitron).
HMAX does not learn though. Its units have hand-crafted weights; biologically plausible learning
rules were later proposed for similar models (e.g., Serre et al., 2002; Teichmann et al., 2012).
When CNNs or convnets (Sec. 5.4, 5.8) are combined with MP, they become Cresceptron-like
or HMAX-like MPCNNs with alternating convolutional and max-pooling layers. Unlike Cresceptron
and HMAX, however, MPCNNs are trained by BP (Sec. 5.5, 5.16) (Ranzato et al., 2007). Advantages
of doing this were pointed out subsequently (Scherer et al., 2010). BP-trained MPCNNs have become
central to many modern, competition-winning, feedforward, visual Deep Learners (Sec. 5.17, 5.19–
5.23).
18
5.13 1995: Supervised Recurrent Very Deep Learner (LSTM RNN)
Supervised Long Short-Term Memory (LSTM) RNN (Hochreiter and Schmidhuber, 1997b; Gers et al.,
2000; Pérez-Ortiz et al., 2003) could eventually perform similar feats as the deep RNN hierarchy
of 1991 (Sec. 5.10), overcoming the Fundamental Deep Learning Problem (Sec. 5.9) without any
unsupervised pre-training. LSTM could also learn DL tasks without local sequence predictability
(and thus unlearnable by the partially unsupervised 1991 History Compressor, Sec. 5.10), dealing
with very deep problems (Sec. 3) (e.g., Gers et al., 2002).
The basic LSTM idea is very simple. Some of the units are called Constant Error Carousels
(CECs). Each CEC uses as an activation function f , the identity function, and has a connection to itself
with fixed weight of 1.0. Due to f ’s constant derivative of 1.0, errors backpropagated through a CEC
cannot vanish or explode (Sec. 5.9) but stay as they are (unless they “flow out” of the CEC to other,
typically adaptive parts of the NN). CECs are connected to several nonlinear adaptive units (some
with multiplicative activation functions) needed for learning nonlinear behavior. Weight changes of
these units often profit from error signals propagated far back in time through CECs. CECs are
the main reason why LSTM nets can learn to discover the importance of (and memorize) events that
happened thousands of discrete time steps ago, while previous RNNs already failed in case of minimal
time lags of 10 steps.
Many different LSTM variants and topologies are allowed. It is possible to evolve good problem-
specific topologies (Bayer et al., 2009). Some LSTM variants also use modifiable self-connections of
CECs (Gers and Schmidhuber, 2001).
To a certain extent, LSTM is biologically plausible (O’Reilly, 2003). LSTM learned to solve
many previously unlearnable DL tasks involving: Recognition of the temporal order of widely sep-
arated events in noisy input streams; Robust storage of high-precision real numbers across extended
time intervals; Arithmetic operations on continuous input streams; Extraction of information con-
veyed by the temporal distance between events; Recognition of temporally extended patterns in noisy
input sequences (Hochreiter and Schmidhuber, 1997b; Gers et al., 2000); Stable generation of pre-
cisely timed rhythms, as well as smooth and non-smooth periodic trajectories (Gers and Schmidhuber,
2000). LSTM clearly outperformed previous RNNs on tasks that require learning the rules of regu-
lar languages describable by deterministic Finite State Automata (FSAs) (Watrous and Kuhn, 1992;
Casey, 1996; Siegelmann, 1992; Blair and Pollack, 1997; Kalinke and Lehmann, 1998; Zeng et al.,
1994; Manolios and Fanelli, 1994; Omlin and Giles, 1996; Vahed and Omlin, 2004), both in terms of
reliability and speed.
LSTM also worked on tasks involving context free languages (CFLs) that cannot be represented
by HMMs or similar FSAs discussed in the RNN literature (Sun et al., 1993b; Wiles and Elman, 1995;
Andrews et al., 1995; Steijvers and Grunwald, 1996; Tonkes and Wiles, 1997; Rodriguez et al., 1999;
Rodriguez and Wiles, 1998). CFL recognition (Lee, 1996) requires the functional equivalent of a run-
time stack. Some previous RNNs failed to learn small CFL training sets (Rodriguez and Wiles, 1998).
Those that did not (Rodriguez et al., 1999; Bodén and Wiles, 2000) failed to extract the general rules,
and did not generalize well on substantially larger test sets. Similar for context-sensitive languages
(CSLs) (e.g., Chalup and Blair, 2003). LSTM generalized well though, requiring only the 30 shortest
exemplars (n ≤ 10) of the CSL an bn cn to correctly predict the possible continuations of sequence
prefixes for n up to 1000 and more. A combination of a decoupled extended Kalman filter (Kalman,
1960; Williams, 1992b; Puskorius and Feldkamp, 1994; Feldkamp et al., 1998; Haykin, 2001; Feld-
kamp et al., 2003) and an LSTM RNN (Pérez-Ortiz et al., 2003) learned to deal correctly with values
of n up to 10 million and more. That is, after training the network was able to read sequences of
30,000,000 symbols and more, one symbol at a time, and finally detect the subtle differences be-
tween legal strings such as a10,000,000 b10,000,000 c10,000,000 and very similar but illegal strings such
as a10,000,000 b9,999,999 c10,000,000 . Compare also more recent RNN algorithms able to deal with long
19
time lags (Schäfer et al., 2006; Martens and Sutskever, 2011; Zimmermann et al., 2012; Koutnı́k et al.,
2014).
Bi-directional RNNs (BRNNs) (Schuster and Paliwal, 1997; Schuster, 1999) are designed for in-
put sequences whose starts and ends are known in advance, such as spoken sentences to be labeled by
their phonemes; compare (Fukada et al., 1999). To take both past and future context of each sequence
element into account, one RNN processes the sequence from start to end, the other backwards from
end to start. At each time step their combined outputs predict the corresponding label (if there is
any). BRNNs were successfully applied to secondary protein structure prediction (Baldi et al., 1999).
DAG-RNNs (Baldi and Pollastri, 2003; Wu and Baldi, 2008) generalize BRNNs to multiple dimen-
sions. They learned to predict properties of small organic molecules (Lusci et al., 2013) as well as
protein contact maps (Tegge et al., 2009), also in conjunction with a growing deep FNN (Di Lena
et al., 2012) (Sec. 5.21). BRNNs and DAG-RNNs unfold their full potential when combined with the
LSTM concept (Graves and Schmidhuber, 2005, 2009; Graves et al., 2009).
Particularly successful in recent competitions are stacks (Sec. 5.10) of LSTM RNNs (Fernan-
dez et al., 2007; Graves and Schmidhuber, 2009) trained by Connectionist Temporal Classifica-
tion (CTC) (Graves et al., 2006), a gradient-based method for finding RNN weights that maxi-
mize the probability of teacher-given label sequences, given (typically much longer and more high-
dimensional) streams of real-valued input vectors. CTC-LSTM performs simultaneous segmentation
(alignment) and recognition (Sec. 5.22).
In the early 2000s, speech recognition was dominated by HMMs combined with FNNs (e.g.,
Bourlard and Morgan, 1994). Nevertheless, when trained from scratch on utterances from the TIDIG-
ITS speech database, in 2003 LSTM already obtained results comparable to those of HMM-based
systems (Graves et al., 2003; Beringer et al., 2005; Graves et al., 2006). In 2007, LSTM outperformed
HMMs in keyword spotting tasks (Fernández et al., 2007); compare recent improvements (Indermuhle
et al., 2011; Wöllmer et al., 2013). By 2013, LSTM also achieved best known results on the famous
TIMIT phoneme recognition benchmark (Graves et al., 2013) (Sec. 5.22). Recently, LSTM RNN /
HMM hybrids obtained best known performance on medium-vocabulary (Geiger et al., 2014) and
large-vocabulary speech recognition (Sak et al., 2014a).
LSTM is also applicable to robot localization (Förster et al., 2007), robot control (Mayer et al.,
2008), online driver distraction detection (Wöllmer et al., 2011), and many other tasks. For example,
it helped to improve the state of the art in diverse applications such as protein analysis (Hochreiter
and Obermayer, 2005), handwriting recognition (Graves et al., 2008, 2009; Graves and Schmidhuber,
2009; Bluche et al., 2014), voice activity detection (Eyben et al., 2013), optical character recogni-
tion (Breuel et al., 2013), language identification (Gonzalez-Dominguez et al., 2014), prosody contour
prediction (Fernandez et al., 2014), audio onset detection (Marchi et al., 2014), text-to-speech syn-
thesis (Fan et al., 2014), social signal classification (Brueckner and Schulter, 2014), machine transla-
tion (Sutskever et al., 2014), and others.
RNNs can also be used for metalearning (Schmidhuber, 1987; Schaul and Schmidhuber, 2010;
Prokhorov et al., 2002), because they can in principle learn to run their own weight change algo-
rithm (Schmidhuber, 1993a). A successful metalearner (Hochreiter et al., 2001b) used an LSTM
RNN to quickly learn a learning algorithm for quadratic functions (compare Sec. 6.8).
Recently, LSTM RNNs won several international pattern recognition competitions and set nu-
merous benchmark records on large and complex data sets, e.g., Sec. 5.17, 5.21, 5.22. Gradient-
based LSTM is no panacea though—other methods sometimes outperformed it at least on certain
tasks (Jaeger, 2004; Schmidhuber et al., 2007; Martens and Sutskever, 2011; Pascanu et al., 2013b;
Koutnı́k et al., 2014); compare Sec. 5.20.
20
5.14 2003: More Contest-Winning/Record-Setting NNs; Successful Deep NNs
In the decade around 2000, many practical and commercial pattern recognition applications were
dominated by non-neural machine learning methods such as Support Vector Machines (SVMs) (Vap-
nik, 1995; Schölkopf et al., 1998). Nevertheless, at least in certain domains, NNs outperformed other
techniques.
A Bayes NN (Neal, 2006) based on an ensemble (Breiman, 1996; Schapire, 1990; Wolpert, 1992;
Hashem and Schmeiser, 1992; Ueda, 2000; Dietterich, 2000a) of NNs won the NIPS 2003 Feature
Selection Challenge with secret test set (Neal and Zhang, 2006). The NN was not very deep though—
it had two hidden layers and thus rather shallow CAPs (Sec. 3) of depth 3.
Important for many present competition-winning pattern recognisers (Sec. 5.19, 5.21, 5.22) were
developments in the CNN department. A BP-trained (LeCun et al., 1989) CNN (Sec. 5.4, Sec. 5.8) set
a new MNIST record of 0.4% (Simard et al., 2003), using training pattern deformations (Baird, 1990)
but no unsupervised pre-training (Sec. 5.7, 5.10, 5.15). A standard BP net achieved 0.7% (Simard
et al., 2003). Again, the corresponding CAP depth was low. Compare further improvements in
Sec. 5.16, 5.18, 5.19.
Good image interpretation results (Behnke, 2003b) were achieved with rather deep NNs trained
by the BP variant R-prop (Riedmiller and Braun, 1993) (Sec. 5.6.2); here feedback through recurrent
connections helped to improve image interpretation. FNNs with CAP depth up to 6 were used to
successfully classify high-dimensional data (Vieira and Barradas, 2003).
Deep LSTM RNNs started to obtain certain first speech recognition results comparable to those
of HMM-based systems (Graves et al., 2003); compare Sec. 5.13, 5.16, 5.21, 5.22.
21
Autoencoder (AE) stacks (Ballard, 1987) (Sec. 5.7) became a popular alternative way of pre-
training deep FNNs in unsupervised fashion, before fine-tuning (Sec. 5.6.1) them through BP
(Sec. 5.5) (Bengio et al., 2007; Vincent et al., 2008; Erhan et al., 2010). Sparse coding (Sec. 5.6.4)
was formulated as a combination of convex optimization problems (Lee et al., 2007a). Recent surveys
of stacked RBM and AE methods focus on post-2006 developments (Bengio, 2009; Arel et al., 2010).
Unsupervised DBNs and AE stacks are conceptually similar to, but in a certain sense less general
than, the unsupervised RNN stack-based History Compressor of 1991 (Sec. 5.10), which can process
and re-encode not only stationary input patterns, but entire pattern sequences.
5.17 2009: First Official Competitions Won by RNNs, and with MPCNNs
Stacks of LSTM RNNs trained by CTC (Sec. 5.13, 5.16) became the first RNNs to win official interna-
tional pattern recognition contests (with secret test sets known only to the organisers). More precisely,
three connected handwriting competitions at ICDAR 2009 in three different languages (French, Arab,
Farsi) were won by deep LSTM RNNs without any a priori linguistic knowledge, performing simul-
taneous segmentation and recognition. Compare (Graves and Schmidhuber, 2005; Graves et al., 2009;
Schmidhuber et al., 2011; Graves et al., 2013; Graves and Jaitly, 2014) (Sec. 5.22).
To detect human actions in surveillance videos, a 3-dimensional CNN (e.g., Jain and Seung, 2009;
Prokhorov, 2010), combined with SVMs, was part of a larger system (Yang et al., 2009) using a bag
of features approach (Nowak et al., 2006) to extract regions of interest. The system won three 2009
TRECVID competitions. These were possibly the first official international contests won with the
help of (MP)CNNs (Sec. 5.16). An improved version of the method was published later (Ji et al.,
2013).
2009 also saw a GPU-DBN implementation (Raina et al., 2009) orders of magnitudes faster than
previous CPU-DBNs (see Sec. 5.15); see also (Coates et al., 2013). The Convolutional DBN (Lee
22
et al., 2009a) (with a probabilistic variant of MP, Sec. 5.11) combines ideas from CNNs and DBNs,
and was successfully applied to audio classification (Lee et al., 2009b).
23
This illustrates a general problem with benchmarks whose test sets are public, or at least can be
probed to some extent: competing teams tend to overfit on the test set even when it cannot be directly
used for training, only for evaluation.
In 1997 many thought it a big deal that human chess world champion Kasparov was beaten by
an IBM computer. But back then computers could not at all compete with little kids in visual pat-
tern recognition, which seems much harder than chess from a computational perspective. Of course,
the traffic sign domain is highly restricted, and kids are still much better general pattern recognis-
ers. Nevertheless, by 2011, deep NNs could already learn to rival them in important limited visual
domains.
An ensemble of GPU-MPCNNs was also the first method to achieve human-competitive perfor-
mance (around 0.2%) on MNIST (Ciresan et al., 2012c). This represented a dramatic improvement,
since by then the MNIST record had hovered around 0.4% for almost a decade (Sec. 5.14, 5.16, 5.18).
Given all the prior work on (MP)CNNs (Sec. 5.4, 5.8, 5.11, 5.16) and GPU-CNNs (Sec. 5.16),
GPU-MPCNNs are not a breakthrough in the scientific sense. But they are a commercially relevant
breakthrough in efficient coding that has made a difference in several contests since 2011. Today, most
feedforward competition-winning deep NNs are (ensembles of) GPU-MPCNNs (Sec. 5.21–5.23).
24
spends over 10% of GDP on healthcare (> 6 trillion USD per year), much of it on medical diagnosis
through expensive experts. Partial automation of this could not only save lots of money, but also make
expert diagnostics accessible to many who currently cannot afford it. It is gratifying to observe that
today deep NNs may actually help to improve healthcare and perhaps save human lives.
2012 also saw the first pure image segmentation contest won by DL (Ciresan et al., 2012a), again
through an GPU-MPCNN ensemble (Segmentation of Neuronal Structures in EM Stacks Challenge,
2012).2 EM stacks are relevant for the recently approved huge brain projects in Europe and the
US (e.g., Markram, 2012). Given electron microscopy images of stacks of thin slices of animal
brains, the goal is to build a detailed 3D model of the brain’s neurons and dendrites. But human
experts need many hours and days and weeks to annotate the images: Which parts depict neuronal
membranes? Which parts are irrelevant background? This needs to be automated (e.g., Turaga et al.,
2010). Deep Multi-Column GPU-MPCNNs learned to solve this task through experience with many
training images, and won the contest on all three evaluation metrics by a large margin, with superhu-
man performance in terms of pixel error.
Both object detection (Ciresan et al., 2013) and image segmentation (Ciresan et al., 2012a) profit
from fast MPCNN-based image scans that avoid redundant computations. Recent MPCNN scanners
speed up naive implementations by up to three orders of magnitude (Masci et al., 2013; Giusti et al.,
2013); compare earlier efficient methods for CNNs without MP (Vaillant et al., 1994).
Also in 2012, a system consisting of growing deep FNNs and 2D-BRNNs (Di Lena et al., 2012)
won the CASP 2012 contest on protein contact map prediction. On the IAM-OnDoDB benchmark,
LSTM RNNs (Sec. 5.13) outperformed all other methods (HMMs, SVMs) on online mode detec-
tion (Otte et al., 2012; Indermuhle et al., 2012) and keyword spotting (Indermuhle et al., 2011). On the
long time lag problem of language modelling, LSTM RNNs outperformed all statistical approaches
on the IAM-DB benchmark (Frinken et al., 2012); improved results were later obtained through a
combination of NNs and HMMs (Zamora-Martnez et al., 2014). Compare earlier RNNs for object
recognition through iterative image interpretation (Behnke and Rojas, 1998; Behnke, 2002, 2003b);
see also more recent publications (Wyatte et al., 2012; OReilly et al., 2013) extending work on bio-
logically plausible learning rules for RNNs (O’Reilly, 1996).
they became the first recurrent Deep Learners to win official international pattern recognition contests—see Sec. 5.17.
25
system; this system beat the previous state of the art in English to French translation (Sutskever et al.,
2014).
A new record on the ICDAR Chinese handwriting recognition benchmark (over 3700 classes)
was set on a desktop machine by an ensemble of GPU-MPCNNs (Sec. 5.19) with almost human
performance (Ciresan and Schmidhuber, 2013); compare (Yin et al., 2013).
The MICCAI 2013 Grand Challenge on Mitosis Detection (Veta et al., 2013) also was won by an
object-detecting GPU-MPCNN ensemble (Ciresan et al., 2013). Its data set was even larger and more
challenging than the one of ICPR 2012 (Sec. 5.21): a real-world dataset including many ambiguous
cases and frequently encountered problems such as imperfect slide staining.
Three 2D-CNNs (with mean-pooling instead of MP, Sec. 5.11) observing three orthogonal projec-
tions of 3D images outperformed traditional full 3D methods on the task of segmenting tibial cartilage
in low field knee MRI scans (Prasoon et al., 2013).
Deep GPU-MPCNNs (Sec. 5.19) also helped to achieve new best results on important bench-
marks of the computer vision community: ImageNet classification (Zeiler and Fergus, 2013; Szegedy
et al., 2014) and—in conjunction with traditional approaches—PASCAL object detection (Girshick
et al., 2013). They also learned to predict bounding box coordinates of objects in the Imagenet
2013 database, and obtained state-of-the-art results on tasks of localization and detection (Sermanet
et al., 2013). GPU-MPCNNs also helped to recognise multi-digit numbers in Google Street View
images (Goodfellow et al., 2014b), where part of the NN was trained to count visible digits; compare
earlier work on detecting “numerosity” through DBNs (Stoianov and Zorzi, 2012). This system also
excelled at recognising distorted synthetic text in reCAPTCHA puzzles. Other successful CNN appli-
cations include scene parsing (Farabet et al., 2013), object detection (Szegedy et al., 2013), shadow
detection (Khan et al., 2014), video classification (Karpathy et al., 2014), and Alzheimers disease
neuroimaging (Li et al., 2014).
Additional contests are mentioned in the web pages of the Swiss AI Lab IDSIA, the University of
Toronto, NY University, and the University of Montreal.
26
5.24 Recent Tricks for Improving SL Deep NNs (Compare Sec. 5.6.2, 5.6.3)
DBN training (Sec. 5.15) can be improved through gradient enhancements and automatic learning rate
adjustments during stochastic gradient descent (Cho et al., 2013; Cho, 2014), and through Tikhonov-
type (Tikhonov et al., 1977) regularization of RBMs (Cho et al., 2012). Contractive AEs (Rifai et al.,
2011) discourage hidden unit perturbations in response to input perturbations, similar to how FMS
(Sec. 5.6.3) for L OCOCODE AEs (Sec. 5.6.4) discourages output perturbations in response to weight
perturbations.
Hierarchical CNNs in a Neural Abstraction Pyramid (e.g., Behnke, 2003b, 2005) were trained to
reconstruct images corrupted by structured noise (Behnke, 2001), thus enforcing increasingly abstract
image representations in deeper and deeper layers. Denoising AEs later used a similar procedure (Vin-
cent et al., 2008).
Dropout (Hinton et al., 2012b; Ba and Frey, 2013) removes units from NNs during training to
improve generalisation. Some view it as an ensemble method that trains multiple data models simul-
taneously (Baldi and Sadowski, 2014). Under certain circumstances, it could also be viewed as a form
of training set augmentation: effectively, more and more informative complex features are removed
from the training data. Compare dropout for RNNs (Pham et al., 2013; Pachitariu and Sahani, 2013;
Pascanu et al., 2013a). A deterministic approximation coined fast dropout (Wang and Manning, 2013)
can lead to faster learning and evaluation and was adapted for RNNs (Bayer et al., 2013). Dropout is
closely related to older, biologically plausible techniques for adding noise to neurons or synapses dur-
ing training (e.g., Hanson, 1990; Murray and Edwards, 1993; Schuster, 1992; Nadal and Parga, 1994;
Jim et al., 1995; An, 1996), which in turn are closely related to finding perturbation-resistant low-
complexity NNs, e.g., through FMS (Sec. 5.6.3). MDL-based stochastic variational methods (Graves,
2011) are also related to FMS. They are useful for RNNs, where classic regularizers such as weight
decay (Sec. 5.6.3) represent a bias towards limited memory capacity (e.g., Pascanu et al., 2013b).
Compare recent work on variational recurrent AEs (Bayer and Osendorfer, 2014).
The activation function f of Rectified Linear Units (ReLUs) is f (x) = x for x > 0, f (x) = 0
otherwise—compare the old concept of half-wave rectified units (Malik and Perona, 1990). ReLU
NNs are useful for RBMs (Nair and Hinton, 2010; Maas et al., 2013), outperformed sigmoidal ac-
tivation functions in deep NNs (Glorot et al., 2011), and helped to obtain best results on several
benchmark problems across multiple domains (e.g., Krizhevsky et al., 2012; Dahl et al., 2013).
NNs with competing linear units tend to outperform those with non-competing nonlinear units,
and avoid catastrophic forgetting through BP when training sets change over time (Srivastava et al.,
2013). In this context, choosing a learning algorithm may be more important than choosing activation
functions (Goodfellow et al., 2014a). Maxout NNs (Goodfellow et al., 2013) combine competitive
interactions and dropout (see above) to achieve excellent results on certain benchmarks. Compare
early RNNs with competing units for SL and RL (Schmidhuber, 1989b). To address overfitting,
instead of depending on pre-wired regularizers and hyper-parameters (Hertz et al., 1991; Bishop,
2006), self-delimiting RNNs (SLIM NNs) with competing units (Schmidhuber, 2012) can in principle
learn to select their own runtime and their own numbers of effective free parameters, thus learning
their own computable regularisers (Sec. 4.4, 5.6.3), becoming fast and slim when necessary. One may
penalize the task-specific total length of connections (e.g., Legenstein and Maass, 2002; Schmidhuber,
2012, 2013b; Clune et al., 2013) and communication costs of SLIM NNs implemented on the 3-
dimensional brain-like multi-processor hardware to be expected in the future.
RmsProp (Tieleman and Hinton, 2012; Schaul et al., 2013) can speed up first order gradient de-
scent methods (Sec. 5.5, 5.6.2); compare vario-η (Neuneier and Zimmermann, 1996), Adagrad (Duchi
et al., 2011) and Adadelta (Zeiler, 2012). DL in NNs can also be improved by transforming hidden
unit activations such that they have zero output and slope on average (Raiko et al., 2012). Many ad-
ditional, older tricks (Sec. 5.6.2, 5.6.3) should also be applicable to today’s deep NNs; compare (Orr
27
and Müller, 1998; Montavon et al., 2012).
28
6 DL in FNNs and RNNs for Reinforcement Learning (RL)
So far we have focused on Deep Learning (DL) in supervised or unsupervised NNs. Such NNs learn
to perceive / encode / predict / classify patterns or pattern sequences, but they do not learn to act
in the more general sense of Reinforcement Learning (RL) in unknown environments (see surveys,
e.g., Kaelbling et al., 1996; Sutton and Barto, 1998; Wiering and van Otterlo, 2012). Here we add a
discussion of DL FNNs and RNNs for RL. It will be shorter than the discussion of FNNs and RNNs
for SL and UL (Sec. 5), reflecting the current size of the various fields.
Without a teacher, solely from occasional real-valued pain and pleasure signals, RL agents must
discover how to interact with a dynamic, initially unknown environment to maximize their expected
cumulative reward signals (Sec. 2). There may be arbitrary, a priori unknown delays between actions
and perceivable consequences. The problem is as hard as any problem of computer science, since any
task with a computable description can be formulated in the RL framework (e.g., Hutter, 2005). For
example, an answer to the famous question of whether P = N P (Levin, 1973b; Cook, 1971) would
also set limits for what is achievable by general RL. Compare more specific limitations, e.g., (Blondel
and Tsitsiklis, 2000; Madani et al., 2003; Vlassis et al., 2012). The following subsections mostly focus
on certain obvious intersections between DL and RL—they cannot serve as a general RL survey.
29
Typically M is not given in advance. Then an essential question is: which experiments should C
conduct to quickly improve M ? The Formal Theory of Fun and Creativity (e.g., Schmidhuber, 2006a,
2013b) formalizes driving forces and value functions behind such curious and exploratory behavior:
A measure of the learning progress of M becomes the intrinsic reward of C (Schmidhuber, 1991a);
compare (Singh et al., 2005; Oudeyer et al., 2013). This motivates C to create action sequences
(experiments) such that M makes quick progress.
6.2 Deep FNNs for Traditional RL and Markov Decision Processes (MDPs)
The classical approach to RL (Samuel, 1959; Bertsekas and Tsitsiklis, 1996) makes the simplifying
assumption of Markov Decision Processes (MDPs): the current input of the RL agent conveys all
information necessary to compute an optimal next output event or decision. This allows for greatly
reducing CAP depth in RL NNs (Sec. 3, 6.1) by using the Dynamic Programming (DP) trick (Bellman,
1957). The latter is often explained in a probabilistic framework (e.g., Sutton and Barto, 1998), but
its basic idea can already be conveyed in a deterministic setting. For simplicity, using the notation
of Sec. 2, let input events xt encode the entire current state of the environment, including a real-
valued reward rt (no need to introduce additional vector-valued notation, since real values can encode
arbitrary vectors of real values). The original RL goal (find weights that maximize the sum of all
rewards of an episode) is replaced by an equivalent set of alternative goals set by a real-valued value
function V defined on input events. Consider any two subsequent input events xt , xk . Recursively
define V (xt ) = rt + V (xk ), where V (xk ) = rk if xk is the last input event. Now search for weights
that maximize the V of all input events, by causing appropriate output events or actions.
Due to the Markov assumption, an FNN suffices to implement the policy that maps input to out-
put events. Relevant CAPs are not deeper than this FNN. V itself is often modeled by a separate
FNN (also yielding typically short CAPs) learning to approximate V (xt ) only from local information
rt , V (xk ).
Many variants of traditional RL exist (e.g., Barto et al., 1983; Watkins, 1989; Watkins and Dayan,
1992; Moore and Atkeson, 1993; Schwartz, 1993; Rummery and Niranjan, 1994; Singh, 1994; Baird,
1995; Kaelbling et al., 1995; Peng and Williams, 1996; Mahadevan, 1996; Tsitsiklis and van Roy,
1996; Bradtke et al., 1996; Santamarı́a et al., 1997; Prokhorov and Wunsch, 1997; Sutton and Barto,
1998; Wiering and Schmidhuber, 1998b; Baird and Moore, 1999; Meuleau et al., 1999; Morimoto and
Doya, 2000; Bertsekas, 2001; Brafman and Tennenholtz, 2002; Abounadi et al., 2002; Lagoudakis and
Parr, 2003; Sutton et al., 2008; Maei and Sutton, 2010; van Hasselt, 2012). Most are formulated in
a probabilistic framework, and evaluate pairs of input and output (action) events (instead of input
events only). To facilitate certain mathematical derivations, some discount delayed rewards, but such
distortions of the original RL problem are problematic.
Perhaps the most well-known RL NN is the world-class RL backgammon player (Tesauro, 1994),
which achieved the level of human world champions by playing against itself. Its nonlinear, rather
shallow FNN maps a large but finite number of discrete board states to values. More recently, a
rather deep GPU-CNN was used in a traditional RL framework to play several Atari 2600 computer
games directly from 84x84 pixel 60 Hz video input (Mnih et al., 2013), using experience replay (Lin,
1993), extending previous work on Neural Fitted Q-Learning (NFQ) (Riedmiller, 2005). Even bet-
ter results are achieved by using (slow) Monte Carlo tree planning to train comparatively fast deep
NNs (Guo et al., 2014). Compare RBM-based RL (Sallans and Hinton, 2004) with high-dimensional
inputs (Elfwing et al., 2010), earlier RL Atari players (Grüttner et al., 2010), and an earlier, raw video-
based RL NN for computer games (Koutnı́k et al., 2013) trained by Indirect Policy Search (Sec. 6.7).
30
6.3 Deep RL RNNs for Partially Observable MDPs (POMDPs)
The Markov assumption (Sec. 6.2) is often unrealistic. We cannot directly perceive what is behind our
back, let alone the current state of the entire universe. However, memories of previous events can help
to deal with partially observable Markov decision problems (POMDPs) (e.g., Schmidhuber, 1990d,
1991c; Ring, 1991, 1993, 1994; Williams, 1992a; Lin, 1993; Teller, 1994; Kaelbling et al., 1995;
Littman et al., 1995; Boutilier and Poole, 1996; Jaakkola et al., 1995; McCallum, 1996; Kimura et al.,
1997; Wiering and Schmidhuber, 1996, 1998a; Otsuka et al., 2010). A naive way of implementing
memories without leaving the MDP framework (Sec. 6.2) would be to simply consider a possibly huge
state space, namely, the set of all possible observation histories and their prefixes. A more realistic
way is to use function approximators such as RNNs that produce compact state features as a function
of the entire history seen so far. Generally speaking, POMDP RL often uses DL RNNs to learn which
events to memorize and which to ignore. Three basic alternatives are:
1. Use an RNN as a value function mapping arbitrary event histories to values (e.g., Schmidhuber,
1990b, 1991c; Lin, 1993; Bakker, 2002). For example, deep LSTM RNNs were used in this
way for RL robots (Bakker et al., 2003).
2. Use an RNN controller in conjunction with a second RNN as predictive world model, to obtain
a combined RNN with deep CAPs—see Sec. 6.1.
3. Use an RNN for RL by Direct Search (Sec. 6.6) or Indirect Search (Sec. 6.7) in weight space.
In general, however, POMDPs may imply greatly increased CAP depth.
6.5 Deep Hierarchical RL (HRL) and Subgoal Learning with FNNs and RNNs
Multiple learnable levels of abstraction (Fu, 1977; Lenat and Brown, 1984; Ring, 1994; Bengio
et al., 2013; Deng and Yu, 2014) seem as important for RL as for SL. Work on NN-based Hierar-
chical RL (HRL) has been published since the early 1990s. In particular, gradient-based subgoal
discovery with FNNs or RNNs decomposes RL tasks into subtasks for RL submodules (Schmid-
huber, 1991b; Schmidhuber and Wahnsiedler, 1992). Numerous alternative HRL techniques have
been proposed (e.g., Ring, 1991, 1994; Jameson, 1991; Tenenberg et al., 1993; Weiss, 1994; Moore
and Atkeson, 1995; Precup et al., 1998; Dietterich, 2000b; Menache et al., 2002; Doya et al., 2002;
31
Ghavamzadeh and Mahadevan, 2003; Barto and Mahadevan, 2003; Samejima et al., 2003; Bakker and
Schmidhuber, 2004; Whiteson et al., 2005; Simsek and Barto, 2008). While HRL frameworks such as
Feudal RL (Dayan and Hinton, 1993) and options (Sutton et al., 1999b; Barto et al., 2004; Singh et al.,
2005) do not directly address the problem of automatic subgoal discovery, HQ-Learning (Wiering and
Schmidhuber, 1998a) automatically decomposes POMDPs (Sec. 6.3) into sequences of simpler sub-
tasks that can be solved by memoryless policies learnable by reactive sub-agents. Recent HRL orga-
nizes potentially deep NN-based RL sub-modules into self-organizing, 2-dimensional motor control
maps (Ring et al., 2011) inspired by neurophysiological findings (Graziano, 2009).
32
synapses or weights (Gomez et al., 2008); benefits of this were shown on difficult nonlinear POMDP
benchmarks.
Natural Evolution Strategies (NES) (Wierstra et al., 2008; Glasmachers et al., 2010; Sun et al.,
2009, 2013) link policy gradient methods and evolutionary approaches through the concept of Natural
Gradients (Amari, 1998). RNN evolution may also help to improve SL for deep RNNs through
Evolino (Schmidhuber et al., 2007) (Sec. 5.9).
6.8 Universal RL
General purpose learning algorithms may improve themselves in open-ended fashion and
environment-specific ways in a lifelong learning context (Schmidhuber, 1987; Schmidhuber et al.,
1997b,a; Schaul and Schmidhuber, 2010). The most general type of RL is constrained only by the
fundamental limitations of computability identified by the founders of theoretical computer science
(Gödel, 1931; Church, 1936; Turing, 1936; Post, 1936). Remarkably, there exist blueprints of univer-
sal problem solvers or universal RL machines for unlimited problem depth that are time-optimal in
various theoretical senses (Hutter, 2005, 2002; Schmidhuber, 2002, 2006b). In particular, the Gödel
Machine can be implemented on general computers such as RNNs and may improve any part of its
software (including the learning algorithm itself) in a way that is provably time-optimal in a certain
33
sense (Schmidhuber, 2006b). It can be initialized by an asymptotically optimal meta-method (Hut-
ter, 2002) (also applicable to RNNs) which will solve any well-defined problem as quickly as the
unknown fastest way of solving it, save for an additive constant overhead that becomes negligible as
problem size grows. Note that most problems are large; only few are small. AI and DL researchers are
still in business because many are interested in problems so small that it is worth trying to reduce the
overhead through less general methods, including heuristics. Here I won’t further discuss universal
RL methods, which go beyond what is usually called DL.
34
imization, by finding simple (highly generalizing) problem solutions that require few active neurons
and few, mostly short connections.
The more distant future may belong to general purpose learning algorithms that improve them-
selves in provably optimal ways (Sec. 6.8), but these are not yet practical or commercially relevant.
8 Acknowledgments
Since 16 April 2014, drafts of this paper have undergone massive open online peer review through
public mailing lists including connectionists@cs.cmu.edu, ml-news@googlegroups.com, comp-neuro-
@neuroinf.org, genetic programming@yahoogroups.com, rl-list@googlegroups.com, imageworld-
@diku.dk, Google+ machine learning forum. Thanks to numerous NN / DL experts for valuable
comments. Thanks to SNF, DFG, and the European Commission for partially funding my DL re-
search group in the past quarter-century. The contents of this paper may be used for educational and
non-commercial purposes, including articles for Wikipedia and similar sites.
References
Aberdeen, D. (2003). Policy-Gradient Algorithms for Partially Observable Markov Decision Pro-
cesses. PhD thesis, Australian National University.
Abounadi, J., Bertsekas, D., and Borkar, V. S. (2002). Learning algorithms for Markov decision
processes with average cost. SIAM Journal on Control and Optimization, 40(3):681–698.
Akaike, H. (1970). Statistical predictor identification. Ann. Inst. Statist. Math., 22:203–217.
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In
Second Intl. Symposium on Information Theory, pages 267–281. Akademinai Kiado.
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic
Control, 19(6):716–723.
Amari, S., Cichocki, A., and Yang, H. (1996). A new learning algorithm for blind signal separation.
In Touretzky, D. S., Mozer, M. C., and Hasselmo, M. E., editors, Advances in Neural Information
Processing Systems (NIPS), volume 8. The MIT Press.
Amari, S. and Murata, N. (1993). Statistical theory of learning curves under entropic loss criterion.
Neural Computation, 5(1):140–153.
35
Amari, S.-I. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2):251–
276.
Amit, D. J. and Brunel, N. (1997). Dynamics of a recurrent network of spiking neurons before and
following learning. Network: Computation in Neural Systems, 8(4):373–404.
An, G. (1996). The effects of adding noise during backpropagation training on a generalization
performance. Neural Computation, 8(3):643–674.
Andrade, M. A., Chacon, P., Merelo, J. J., and Moran, F. (1993). Evaluation of secondary structure
of proteins from UV circular dichroism spectra using an unsupervised learning neural network.
Protein Engineering, 6(4):383–390.
Andrews, R., Diederich, J., and Tickle, A. B. (1995). Survey and critique of techniques for extracting
rules from trained artificial neural networks. Knowledge-Based Systems, 8(6):373–389.
Anguita, D. and Gomes, B. A. (1996). Mixing floating- and fixed-point formats for neural network
learning on neuroprocessors. Microprocessing and Microprogramming, 41(10):757 – 769.
Anguita, D., Parodi, G., and Zunino, R. (1994). An efficient implementation of BP on RISC-based
workstations. Neurocomputing, 6(1):57 – 65.
Arel, I., Rose, D. C., and Karnowski, T. P. (2010). Deep machine learning – a new frontier in artificial
intelligence research. Computational Intelligence Magazine, IEEE, 5(4):13–18.
Ash, T. (1989). Dynamic node creation in backpropagation neural networks. Connection Science,
1(4):365–375.
Atick, J. J., Li, Z., and Redlich, A. N. (1992). Understanding retinal color coding from first principles.
Neural Computation, 4:559–572.
Atiya, A. F. and Parlos, A. G. (2000). New results on recurrent network training: unifying the algo-
rithms and accelerating convergence. IEEE Transactions on Neural Networks, 11(3):697–709.
Ba, J. and Frey, B. (2013). Adaptive dropout for training deep neural networks. In Advances in Neural
Information Processing Systems (NIPS), pages 3084–3092.
Baird, H. (1990). Document image defect models. In Proceddings, IAPR Workshop on Syntactic and
Structural Pattern Recognition, Murray Hill, NJ.
Baird, L. and Moore, A. W. (1999). Gradient descent for general reinforcement learning. In Advances
in neural information processing systems 12 (NIPS), pages 968–974. MIT Press.
Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In
International Conference on Machine Learning, pages 30–37.
Bakker, B. (2002). Reinforcement learning with Long Short-Term Memory. In Dietterich, T. G.,
Becker, S., and Ghahramani, Z., editors, Advances in Neural Information Processing Systems 14,
pages 1475–1482. MIT Press, Cambridge, MA.
Bakker, B. and Schmidhuber, J. (2004). Hierarchical reinforcement learning based on subgoal dis-
covery and subpolicy specialization. In et al., F. G., editor, Proc. 8th Conference on Intelligent
Autonomous Systems IAS-8, pages 438–445, Amsterdam, NL. IOS Press.
36
Bakker, B., Zhumatiy, V., Gruener, G., and Schmidhuber, J. (2003). A robot that reinforcement-learns
to identify and memorize important previous observations. In Proceedings of the 2003 IEEE/RSJ
International Conference on Intelligent Robots and Systems, IROS 2003, pages 430–435.
Baldi, P. (1995). Gradient descent learning algorithms overview: A general dynamical systems per-
spective. IEEE Transactions on Neural Networks, 6(1):182–195.
Baldi, P. (2012). Autoencoders, unsupervised learning, and deep architectures. Journal of Machine
Learning Research (Proc. 2011 ICML Workshop on Unsupervised and Transfer Learning), 27:37–
50.
Baldi, P., Brunak, S., Frasconi, P., Pollastri, G., and Soda, G. (1999). Exploiting the past and the
future in protein secondary structure prediction. Bioinformatics, 15:937–946.
Baldi, P. and Chauvin, Y. (1993). Neural networks for fingerprint recognition. Neural Computation,
5(3):402–418.
Baldi, P. and Chauvin, Y. (1996). Hybrid modeling, HMM/NN architectures, and protein applications.
Neural Computation, 8(7):1541–1565.
Baldi, P. and Hornik, K. (1989). Neural networks and principal component analysis: Learning from
examples without local minima. Neural Networks, 2:53–58.
Baldi, P. and Hornik, K. (1994). Learning in linear networks: a survey. IEEE Transactions on Neural
Networks, 6(4):837–858. 1995.
Baldi, P. and Pollastri, G. (2003). The principled design of large-scale recursive neural network
architectures – DAG-RNNs and the protein structure prediction problem. J. Mach. Learn. Res.,
4:575–602.
Baldi, P. and Sadowski, P. (2014). The dropout learning algorithm. Artificial Intelligence, 210C:78–
122.
Ballard, D. H. (1987). Modular learning in neural networks. In Proc. AAAI, pages 279–284.
Baluja, S. (1994). Population-based incremental learning: A method for integrating genetic
search based function optimization and competitive learning. Technical Report CMU-CS-94-163,
Carnegie Mellon University.
Balzer, R. (1985). A 15 year perspective on automatic programming. IEEE Transactions on Software
Engineering, 11(11):1257–1268.
Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1(3):295–311.
Barlow, H. B., Kaushal, T. P., and Mitchison, G. J. (1989). Finding minimum entropy codes. Neural
Computation, 1(3):412–423.
Barrow, H. G. (1987). Learning receptive fields. In Proceedings of the IEEE 1st Annual Conference
on Neural Networks, volume IV, pages 115–121. IEEE.
Barto, A. G. and Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning.
Discrete Event Dynamic Systems, 13(4):341–379.
37
Barto, A. G., Singh, S., and Chentanez, N. (2004). Intrinsically motivated learning of hierarchical col-
lections of skills. In Proceedings of International Conference on Developmental Learning (ICDL),
pages 112–119. MIT Press, Cambridge, MA.
Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983). Neuronlike adaptive elements that can
solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics,
SMC-13:834–846.
Battiti, R. (1989). Accelerated backpropagation learning: two optimization methods. Complex Sys-
tems, 3(4):331–342.
Battiti, T. (1992). First- and second-order methods for learning: Between steepest descent and New-
ton’s method. Neural Computation, 4(2):141–166.
Baum, E. B. and Haussler, D. (1989). What size net gives valid generalization? Neural Computation,
1(1):151–160.
Baum, L. E. and Petrie, T. (1966). Statistical inference for probabilistic functions of finite state
Markov chains. The Annals of Mathematical Statistics, pages 1554–1563.
Baxter, J. and Bartlett, P. L. (2001). Infinite-horizon policy-gradient estimation. J. Artif. Int. Res.,
15(1):319–350.
Bayer, J. and Osendorfer, C. (2014). Variational inference of latent state sequences using recurrent
networks. arXiv preprint arXiv:1406.1655.
Bayer, J., Osendorfer, C., Chen, N., Urban, S., and van der Smagt, P. (2013). On fast dropout and its
applicability to recurrent networks. arXiv preprint arXiv:1311.0701.
Bayer, J., Wierstra, D., Togelius, J., and Schmidhuber, J. (2009). Evolving memory cell structures for
sequence learning. In Proc. ICANN (2), pages 755–764.
Bayes, T. (1763). An essay toward solving a problem in the doctrine of chances. Philosophical
Transactions of the Royal Society of London, 53:370–418. Communicated by R. Price, in a letter
to J. Canton.
Becker, S. (1991). Unsupervised learning procedures for neural networks. International Journal of
Neural Systems, 2(1 & 2):17–33.
Becker, S. and Le Cun, Y. (1989). Improving the convergence of back-propagation learning with
second order methods. In Touretzky, D., Hinton, G., and Sejnowski, T., editors, Proc. 1988 Con-
nectionist Models Summer School, pages 29–37, Pittsburg 1988. Morgan Kaufmann, San Mateo.
Behnke, S. (1999). Hebbian learning and competition in the neural abstraction pyramid. In Pro-
ceedings of the International Joint Conference on Neural Networks (IJCNN), volume 2, pages
1356–1361.
Behnke, S. (2001). Learning iterative image reconstruction in the neural abstraction pyramid. Inter-
national Journal of Computational Intelligence and Applications, 1(4):427–438.
Behnke, S. (2002). Learning face localization using hierarchical recurrent networks. In Proceedings
of the 12th International Conference on Artificial Neural Networks (ICANN), Madrid, Spain, pages
1319–1324.
38
Behnke, S. (2003a). Discovering hierarchical speech features using convolutional non-negative matrix
factorization. In Proceedings of the International Joint Conference on Neural Networks (IJCNN),
volume 4, pages 2758–2763.
Behnke, S. (2003b). Hierarchical Neural Networks for Image Interpretation, volume LNCS 2766 of
Lecture Notes in Computer Science. Springer.
Behnke, S. (2005). Face localization and tracking in the Neural Abstraction Pyramid. Neural Com-
puting and Applications, 14(2):97–103.
Behnke, S. and Rojas, R. (1998). Neural abstraction pyramid: A hierarchical image understand-
ing architecture. In Proceedings of International Joint Conference on Neural Networks (IJCNN),
volume 2, pages 820–825.
Belouchrani, A., Abed-Meraim, K., Cardoso, J.-F., and Moulines, E. (1997). A blind source separation
technique using second-order statistics. IEEE Transactions on Signal Processing, 45(2):434–444.
Bengio, Y. (1991). Artificial Neural Networks and their Application to Sequence Recognition. PhD
thesis, McGill University, (Computer Science), Montreal, Qc., Canada.
Bengio, Y. (2009). Learning Deep Architectures for AI. Foundations and Trends in Machine Learning,
V2(1). Now Publishers.
Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: A review and new per-
spectives. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8):1798–1828.
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training of deep
networks. In Cowan, J. D., Tesauro, G., and Alspector, J., editors, Advances in Neural Information
Processing Systems 19 (NIPS), pages 153–160. MIT Press.
Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient
descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166.
Beringer, N., Graves, A., Schiel, F., and Schmidhuber, J. (2005). Classifying unprompted speech by
retraining LSTM nets. In Duch, W., Kacprzyk, J., Oja, E., and Zadrozny, S., editors, Artificial
Neural Networks: Biological Inspirations - ICANN 2005, LNCS 3696, pages 575–581. Springer-
Verlag Berlin Heidelberg.
Bertsekas, D. P. (2001). Dynamic Programming and Optimal Control. Athena Scientific.
39
Bishop, C. M. (1993). Curvature-driven smoothing: A learning algorithm for feed-forward networks.
IEEE Transactions on Neural Networks, 4(5):882–884.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Blair, A. D. and Pollack, J. B. (1997). Analysis of dynamical recognizers. Neural Computation,
9(5):1127–1142.
Blum, A. L. and Rivest, R. L. (1992). Training a 3-node neural network is np-complete. Neural
Networks, 5(1):117–127.
Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. (1987). Occam’s razor. Information
Processing Letters, 24:377–380.
Bobrowski, L. (1978). Learning processes in multilayer threshold nets. Biological Cybernetics, 31:1–
6.
Bodén, M. and Wiles, J. (2000). Context-free and context-sensitive dynamics in recurrent neural
networks. Connection Science, 12(3-4):197–210.
Bodenhausen, U. and Waibel, A. (1991). The Tempo 2 algorithm: Adjusting time-delays by super-
vised learning. In Lippman, D. S., Moody, J. E., and Touretzky, D. S., editors, Advances in Neural
Information Processing Systems 3, pages 155–161. Morgan Kaufmann.
Bohte, S. M., Kok, J. N., and La Poutre, H. (2002). Error-backpropagation in temporally encoded
networks of spiking neurons. Neurocomputing, 48(1):17–37.
40
Breiman, L. (1996). Bagging predictors. Machine Learning, 24:123–140.
Brette, R., Rudolph, M., Carnevale, T., Hines, M., Beeman, D., Bower, J. M., Diesmann, M., Morri-
son, A., Goodman, P. H., Harris Jr, F. C., et al. (2007). Simulation of networks of spiking neurons:
a review of tools and strategies. Journal of Computational Neuroscience, 23(3):349–398.
Breuel, T. M., Ul-Hasan, A., Al-Azawi, M. A., and Shafait, F. (2013). High-performance OCR for
printed English and Fraktur using LSTM networks. In 12th International Conference on Document
Analysis and Recognition (ICDAR), pages 683–687. IEEE.
Bromley, J., Bentz, J. W., Bottou, L., Guyon, I., LeCun, Y., Moore, C., Sackinger, E., and Shah, R.
(1993). Signature verification using a Siamese time delay neural network. International Journal of
Pattern Recognition and Artificial Intelligence, 7(4):669–688.
Broyden, C. G. et al. (1965). A class of methods for solving nonlinear simultaneous equations. Math.
Comp, 19(92):577–593.
Brueckner, R. and Schulter, B. (2014). Social signal classification using deep BLSTM recurrent neural
networks. In Proceedings 39th IEEE International Conference on Acoustics, Speech, and Signal
Processing, ICASSP 2014, Florence, Italy, pages 4856–4860.
Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking
neurons. Journal of Computational Neuroscience, 8(3):183–208.
Bryson, A. and Ho, Y. (1969). Applied optimal control: optimization, estimation, and control. Blais-
dell Pub. Co.
Bryson, A. E. (1961). A gradient method for optimizing multi-stage allocation processes. In Proc.
Harvard Univ. Symposium on digital computers and their applications.
Bryson, Jr., A. E. and Denham, W. F. (1961). A steepest-ascent method for solving optimum pro-
gramming problems. Technical Report BR-1303, Raytheon Company, Missle and Space Division.
Buhler, J. (2001). Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinfor-
matics, 17(5):419–428.
Buntine, W. L. and Weigend, A. S. (1991). Bayesian back-propagation. Complex Systems, 5:603–643.
Burgess, N. (1994). A constructive algorithm that converges for real-valued input patterns. Interna-
tional Journal of Neural Systems, 5(1):59–66.
Cardoso, J.-F. (1994). On the performance of orthogonal source separation algorithms. In Proc.
EUSIPCO, pages 776–779.
Carreira-Perpinan, M. A. (2001). Continuous latent variable models for dimensionality reduction and
sequential data reconstruction. PhD thesis, University of Sheffield UK.
Carter, M. J., Rudolph, F. J., and Nucci, A. J. (1990). Operational fault tolerance of CMAC networks.
In Touretzky, D. S., editor, Advances in Neural Information Processing Systems (NIPS) 2, pages
340–347. San Mateo, CA: Morgan Kaufmann.
Caruana, R. (1997). Multitask learning. Machine Learning, 28(1):41–75.
Casey, M. P. (1996). The dynamics of discrete-time computation, with application to recurrent neural
networks and finite state machine extraction. Neural Computation, 8(6):1135–1178.
41
Cauwenberghs, G. (1993). A fast stochastic error-descent algorithm for supervised learning and op-
timization. In Lippman, D. S., Moody, J. E., and Touretzky, D. S., editors, Advances in Neural
Information Processing Systems 5, pages 244–244. Morgan Kaufmann.
Chaitin, G. J. (1966). On the length of programs for computing finite binary sequences. Journal of
the ACM, 13:547–569.
Chalup, S. K. and Blair, A. D. (2003). Incremental training of first order recurrent neural networks to
predict a context-sensitive language. Neural Networks, 16(7):955–972.
Chellapilla, K., Puri, S., and Simard, P. (2006). High performance convolutional neural networks for
document processing. In International Workshop on Frontiers in Handwriting Recognition.
Chen, K. and Salman, A. (2011). Learning speaker-specific characteristics with a deep neural archi-
tecture. IEEE Transactions on Neural Networks, 22(11):1744–1756.
Cho, K. (2014). Foundations and Advances in Deep Learning. PhD thesis, Aalto University School
of Science.
Cho, K., Ilin, A., and Raiko, T. (2012). Tikhonov-type regularization for restricted Boltzmann ma-
chines. In Intl. Conf. on Artificial Neural Networks (ICANN) 2012, pages 81–88. Springer.
Cho, K., Raiko, T., and Ilin, A. (2013). Enhanced gradient for training restricted Boltzmann machines.
Neural Computation, 25(3):805–831.
Church, A. (1936). An unsolvable problem of elementary number theory. American Journal of
Mathematics, 58:345–363.
Ciresan, D. C., Giusti, A., Gambardella, L. M., and Schmidhuber, J. (2012a). Deep neural networks
segment neuronal membranes in electron microscopy images. In Advances in Neural Information
Processing Systems (NIPS), pages 2852–2860.
Ciresan, D. C., Giusti, A., Gambardella, L. M., and Schmidhuber, J. (2013). Mitosis detection in
breast cancer histology images with deep neural networks. In Proc. MICCAI, volume 2, pages
411–418.
Ciresan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. (2010). Deep big simple neural
nets for handwritten digit recogntion. Neural Computation, 22(12):3207–3220.
Ciresan, D. C., Meier, U., Masci, J., Gambardella, L. M., and Schmidhuber, J. (2011a). Flexible, high
performance convolutional neural networks for image classification. In Intl. Joint Conference on
Artificial Intelligence IJCAI, pages 1237–1242.
Ciresan, D. C., Meier, U., Masci, J., and Schmidhuber, J. (2011b). A committee of neural networks
for traffic sign classification. In International Joint Conference on Neural Networks (IJCNN), pages
1918–1921.
Ciresan, D. C., Meier, U., Masci, J., and Schmidhuber, J. (2012b). Multi-column deep neural network
for traffic sign classification. Neural Networks, 32:333–338.
Ciresan, D. C., Meier, U., and Schmidhuber, J. (2012c). Multi-column deep neural networks for image
classification. In IEEE Conference on Computer Vision and Pattern Recognition CVPR 2012. Long
preprint arXiv:1202.2745v1 [cs.CV].
42
Ciresan, D. C., Meier, U., and Schmidhuber, J. (2012d). Transfer learning for Latin and Chinese char-
acters with deep neural networks. In International Joint Conference on Neural Networks (IJCNN),
pages 1301–1306.
Ciresan, D. C. and Schmidhuber, J. (2013). Multi-column deep neural networks for offline handwrit-
ten Chinese character classification. Technical report, IDSIA. arXiv:1309.0261.
Cliff, D. T., Husbands, P., and Harvey, I. (1993). Evolving recurrent dynamical networks for robot
control. In Artificial Neural Nets and Genetic Algorithms, pages 428–435. Springer.
Clune, J., Mouret, J.-B., and Lipson, H. (2013). The evolutionary origins of modularity. Proceedings
of the Royal Society B: Biological Sciences, 280(1755):20122863.
Clune, J., Stanley, K. O., Pennock, R. T., and Ofria, C. (2011). On the performance of indirect
encoding across the continuum of regularity. Trans. Evol. Comp, 15(3):346–367.
Coates, A., Huval, B., Wang, T., Wu, D. J., Ng, A. Y., and Catanzaro, B. (2013). Deep learning with
COTS HPC systems. In Proc. International Conference on Machine learning (ICML’13).
Cochocki, A. and Unbehauen, R. (1993). Neural networks for optimization and signal processing.
John Wiley & Sons, Inc.
Collobert, R. and Weston, J. (2008). A unified architecture for natural language processing: Deep
neural networks with multitask learning. In Proceedings of the 25th International Conference on
Machine Learning (ICML), pages 160–167. ACM.
Comon, P. (1994). Independent component analysis – a new concept? Signal Processing, 36(3):287–
314.
Connor, C. E., Brincat, S. L., and Pasupathy, A. (2007). Transformation of shape information in the
ventral pathway. Current Opinion in Neurobiology, 17(2):140–147.
Connor, J., Martin, D. R., and Atlas, L. E. (1994). Recurrent neural networks and robust time series
prediction. IEEE Transactions on Neural Networks, 5(2):240–254.
Cook, S. A. (1971). The complexity of theorem-proving procedures. In Proceedings of the 3rd Annual
ACM Symposium on the Theory of Computing (STOC’71), pages 151–158. ACM, New York.
Cramer, N. L. (1985). A representation for the adaptive generation of simple sequential programs.
In Grefenstette, J., editor, Proceedings of an International Conference on Genetic Algorithms and
Their Applications, Carnegie-Mellon University, July 24-26, 1985, Hillsdale NJ. Lawrence Erl-
baum Associates.
Craven, P. and Wahba, G. (1979). Smoothing noisy data with spline functions: Estimating the correct
degree of smoothing by the method of generalized cross-validation. Numer. Math., 31:377–403.
Cuccu, G., Luciw, M., Schmidhuber, J., and Gomez, F. (2011). Intrinsically motivated evolutionary
search for vision-based reinforcement learning. In Proceedings of the 2011 IEEE Conference on
Development and Learning and Epigenetic Robotics IEEE-ICDL-EPIROB, volume 2, pages 1–7.
IEEE.
Dahl, G., Yu, D., Deng, L., and Acero, A. (2012). Context-dependent pre-trained deep neural net-
works for large-vocabulary speech recognition. Audio, Speech, and Language Processing, IEEE
Transactions on, 20(1):30–42.
43
Dahl, G. E., Sainath, T. N., and Hinton, G. E. (2013). Improving deep neural networks for LVCSR
using rectified linear units and dropout. In IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pages 8609–8613. IEEE.
D’Ambrosio, D. B. and Stanley, K. O. (2007). A novel generative encoding for exploiting neural net-
work sensor and output geometry. In Proceedings of the Conference on Genetic and Evolutionary
Computation (GECCO), pages 974–981.
Datar, M., Immorlica, N., Indyk, P., and Mirrokni, V. S. (2004). Locality-sensitive hashing scheme
based on p-stable distributions. In Proceedings of the 20th Annual Symposium on Computational
Geometry, pages 253–262. ACM.
Dayan, P. and Hinton, G. (1993). Feudal reinforcement learning. In Lippman, D. S., Moody, J. E.,
and Touretzky, D. S., editors, Advances in Neural Information Processing Systems (NIPS) 5, pages
271–278. Morgan Kaufmann.
Dayan, P. and Hinton, G. E. (1996). Varieties of Helmholtz machine. Neural Networks, 9(8):1385–
1403.
Dayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. (1995). The Helmholtz machine. Neural
Computation, 7:889–904.
Dayan, P. and Zemel, R. (1995). Competition and multiple cause models. Neural Computation,
7:565–579.
De Freitas, J. F. G. (2003). Bayesian methods for neural networks. PhD thesis, University of Cam-
bridge.
de Souto, M. C., Souto, M. C. P. D., and Oliveira, W. R. D. (1999). The loading problem for pyramidal
neural networks. In Electronic Journal on Mathematics of Computation.
De Valois, R. L., Albrecht, D. G., and Thorell, L. G. (1982). Spatial frequency selectivity of cells in
macaque visual cortex. Vision Research, 22(5):545–559.
de Vries, B. and Principe, J. C. (1991). A theory for neural networks with time delays. In Lippmann,
R. P., Moody, J. E., and Touretzky, D. S., editors, Advances in Neural Information Processing
Systems (NIPS) 3, pages 162–168. Morgan Kaufmann.
Deco, G. and Parra, L. (1997). Non-linear feature extraction by redundancy reduction in an unsuper-
vised stochastic neural network. Neural Networks, 10(4):683–691.
Deco, G. and Rolls, E. T. (2005). Neurodynamics of biased competition and cooperation for attention:
a model with spiking neurons. Journal of Neurophysiology, 94(1):295–313.
DeJong, G. and Mooney, R. (1986). Explanation-based learning: An alternative view. Machine
Learning, 1(2):145–176.
DeMers, D. and Cottrell, G. (1993). Non-linear dimensionality reduction. In Hanson, S. J., Cowan,
J. D., and Giles, C. L., editors, Advances in Neural Information Processing Systems (NIPS) 5, pages
580–587. Morgan Kaufmann.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data
via the EM algorithm. Journal of the Royal Statistical Society, B, 39.
44
Deng, L. and Yu, D. (2014). Deep Learning: Methods and Applications. NOW Publishers.
Desimone, R., Albright, T. D., Gross, C. G., and Bruce, C. (1984). Stimulus-selective properties of
inferior temporal neurons in the macaque. The Journal of Neuroscience, 4(8):2051–2062.
Deville, Y. and Lau, K. K. (1994). Logic program synthesis. Journal of Logic Programming,
19(20):321–350.
Di Lena, P., Nagata, K., and Baldi, P. (2012). Deep architectures for protein contact map prediction.
Bioinformatics, 28:2449–2457.
DiCarlo, J. J., Zoccolan, D., and Rust, N. C. (2012). How does the brain solve visual object recogni-
tion? Neuron, 73(3):415–434.
Dickmanns, D., Schmidhuber, J., and Winklhofer, A. (1987). Der genetische Algorithmus:
Eine Implementierung in Prolog. Technical Report, Inst. of Informatics, Tech. Univ. Munich.
http://www.idsia.ch/˜juergen/geneticprogramming.html.
Dickmanns, E. D., Behringer, R., Dickmanns, D., Hildebrandt, T., Maurer, M., Thomanek, F., and
Schiehlen, J. (1994). The seeing passenger car ’VaMoRs-P’. In Proc. Int. Symp. on Intelligent
Vehicles ’94, Paris, pages 68–73.
Dietterich, T. G. (2000a). Ensemble methods in machine learning. In Multiple classifier systems,
pages 1–15. Springer.
Dietterich, T. G. (2000b). Hierarchical reinforcement learning with the MAXQ value function de-
composition. J. Artif. Intell. Res. (JAIR), 13:227–303.
Director, S. W. and Rohrer, R. A. (1969). Automated network design - the frequency-domain case.
IEEE Trans. Circuit Theory, CT-16:330–337.
Dittenbach, M., Merkl, D., and Rauber, A. (2000). The growing hierarchical self-organizing map.
In IEEE-INNS-ENNS International Joint Conference on Neural Networks, volume 6, pages 6015–
6015. IEEE Computer Society.
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. (2013). De-
CAF: A deep convolutional activation feature for generic visual recognition. arXiv preprint
arXiv:1310.1531.
Dorffner, G. (1996). Neural networks for time series processing. In Neural Network World.
Doya, K., Samejima, K., ichi Katagiri, K., and Kawato, M. (2002). Multiple model-based reinforce-
ment learning. Neural Computation, 14(6):1347–1369.
Dreyfus, S. E. (1962). The numerical solution of variational problems. Journal of Mathematical
Analysis and Applications, 5(1):30–45.
Dreyfus, S. E. (1973). The computational solution of optimal control problems with time lag. IEEE
Transactions on Automatic Control, 18(4):383–385.
Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and
stochastic optimization. The Journal of Machine Learning, 12:2121–2159.
45
Egorova, A., Gloye, A., Göktekin, C., Liers, A., Luft, M., Rojas, R., Simon, M., Tenchio, O., and
Wiesel, F. (2004). FU-Fighters Small Size 2004, Team Description. RoboCup 2004 Symposium:
Papers and Team Description Papers. CD edition.
Elfwing, S., Otsuka, M., Uchibe, E., and Doya, K. (2010). Free-energy based reinforcement learning
for vision-based navigation with high-dimensional sensory inputs. In Neural Information Process-
ing. Theory and Algorithms (ICONIP), volume 1, pages 215–222. Springer.
Eliasmith, C. (2013). How to build a brain: A neural architecture for biological cognition. Oxford
University Press, New York, NY.
Eliasmith, C., Stewart, T. C., Choo, X., Bekolay, T., DeWolf, T., Tang, Y., and Rasmussen, D. (2012).
A large-scale model of the functioning brain. Science, 338(6111):1202–1205.
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2):179–211.
Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., and Bengio, S. (2010). Why does
unsupervised pre-training help deep learning? J. Mach. Learn. Res., 11:625–660.
Escalante-B., A. N. and Wiskott, L. (2013). How to solve classification and regression problems on
high-dimensional data with a supervised extension of slow feature analysis. Journal of Machine
Learning Research, 14:3683–3719.
Eubank, R. L. (1988). Spline smoothing and nonparametric regression. In Farlow, S., editor, Self-
Organizing Methods in Modeling. Marcel Dekker, New York.
Euler, L. (1744). Methodus inveniendi.
Eyben, F., Weninger, F., Squartini, S., and Schuller, B. (2013). Real-life voice activity detection with
LSTM recurrent neural networks and an application to Hollywood movies. In Proc. 38th IEEE
International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013, Vancouver,
Canada, pages 483–487.
Faggin, F. (1992). Neural network hardware. In International Joint Conference on Neural Networks
(IJCNN), volume 1, page 153.
Fahlman, S. E. (1988). An empirical study of learning speed in back-propagation networks. Technical
Report CMU-CS-88-162, Carnegie-Mellon Univ.
Fahlman, S. E. (1991). The recurrent cascade-correlation learning algorithm. In Lippmann, R. P.,
Moody, J. E., and Touretzky, D. S., editors, Advances in Neural Information Processing Systems
(NIPS) 3, pages 190–196. Morgan Kaufmann.
Falconbridge, M. S., Stamps, R. L., and Badcock, D. R. (2006). A simple Hebbian/anti-Hebbian
network learns the sparse, independent components of natural images. Neural Computation,
18(2):415–429.
Fan, Y., Qian, Y., Xie, F., and Soong, F. K. (2014). TTS synthesis with bidirectional LSTM based
recurrent neural networks. In Proc. Interspeech.
Farabet, C., Couprie, C., Najman, L., and LeCun, Y. (2013). Learning hierarchical features for scene
labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1915–1929.
Farlow, S. J. (1984). Self-organizing methods in modeling: GMDH type algorithms, volume 54. CRC
Press.
46
Feldkamp, L. A., Prokhorov, D. V., Eagen, C. F., and Yuan, F. (1998). Enhanced multi-stream Kalman
filter training for recurrent networks. In Nonlinear Modeling, pages 29–53. Springer.
Feldkamp, L. A., Prokhorov, D. V., and Feldkamp, T. M. (2003). Simple and conditioned adaptive
behavior from Kalman filter trained recurrent networks. Neural Networks, 16(5):683–689.
Feldkamp, L. A. and Puskorius, G. V. (1998). A signal processing framework based on dynamic neural
networks with application to problems in adaptation, filtering, and classification. Proceedings of
the IEEE, 86(11):2259–2277.
Felleman, D. J. and Van Essen, D. C. (1991). Distributed hierarchical processing in the primate
cerebral cortex. Cerebral Cortex, 1(1):1–47.
Fernandez, R., Rendel, A., Ramabhadran, B., and Hoory, R. (2014). Prosody contour prediction with
Long Short-Term Memory, bi-directional, deep recurrent neural networks. In Proc. Interspeech.
Fernández, S., Graves, A., and Schmidhuber, J. (2007). An application of recurrent neural networks
to discriminative keyword spotting. In Proc. ICANN (2), pages 220–229.
Fernandez, S., Graves, A., and Schmidhuber, J. (2007). Sequence labelling in structured domains with
hierarchical recurrent neural networks. In Proceedings of the 20th International Joint Conference
on Artificial Intelligence (IJCAI).
Field, D. J. (1987). Relations between the statistics of natural images and the response properties of
cortical cells. Journal of the Optical Society of America, 4:2379–2394.
Field, D. J. (1994). What is the goal of sensory coding? Neural Computation, 6:559–601.
Fieres, J., Schemmel, J., and Meier, K. (2008). Realizing biological spiking network models in
a configurable wafer-scale hardware system. In IEEE International Joint Conference on Neural
Networks, pages 969–976.
Fine, S., Singer, Y., and Tishby, N. (1998). The hierarchical hidden Markov model: Analysis and
applications. Machine Learning, 32(1):41–62.
Fischer, A. and Igel, C. (2014). Training restricted Boltzmann machines: An introduction. Pattern
Recognition, 47:25–39.
FitzHugh, R. (1961). Impulses and physiological states in theoretical models of nerve membrane.
Biophysical Journal, 1(6):445–466.
Fletcher, R. and Powell, M. J. (1963). A rapidly convergent descent method for minimization. The
Computer Journal, 6(2):163–168.
Floreano, D. and Mattiussi, C. (2001). Evolution of spiking neural controllers for autonomous vision-
based robots. In Evolutionary Robotics. From Intelligent Robotics to Artificial Life, pages 38–61.
Springer.
Fogel, D. B., Fogel, L. J., and Porto, V. (1990). Evolving neural networks. Biological Cybernetics,
63(6):487–493.
Fogel, L., Owens, A., and Walsh, M. (1966). Artificial Intelligence through Simulated Evolution.
Wiley, New York.
47
Földiák, P. (1990). Forming sparse representations by local anti-Hebbian learning. Biological Cyber-
netics, 64:165–170.
Földiák, P. and Young, M. P. (1995). Sparse coding in the primate cortex. In Arbib, M. A., editor, The
Handbook of Brain Theory and Neural Networks, pages 895–898. The MIT Press.
Förster, A., Graves, A., and Schmidhuber, J. (2007). RNN-based Learning of Compact Maps for
Efficient Robot Localization. In 15th European Symposium on Artificial Neural Networks, ESANN,
pages 537–542, Bruges, Belgium.
Franzius, M., Sprekeler, H., and Wiskott, L. (2007). Slowness and sparseness lead to place, head-
direction, and spatial-view cells. PLoS Computational Biology, 3(8):166.
Friedman, J., Hastie, T., and Tibshirani, R. (2001). The elements of statistical learning, volume 1.
Springer Series in Statistics, New York.
Frinken, V., Zamora-Martinez, F., Espana-Boquera, S., Castro-Bleda, M. J., Fischer, A., and Bunke,
H. (2012). Long-short term memory neural networks language modeling for handwriting recog-
nition. In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 701–704.
IEEE.
Fritzke, B. (1994). A growing neural gas network learns topologies. In Tesauro, G., Touretzky, D. S.,
and Leen, T. K., editors, NIPS, pages 625–632. MIT Press.
Fu, K. S. (1977). Syntactic Pattern Recognition and Applications. Berlin, Springer.
Fukada, T., Schuster, M., and Sagisaka, Y. (1999). Phoneme boundary estimation using bidirectional
recurrent neural networks and its applications. Systems and Computers in Japan, 30(4):20–30.
Fukushima, K. (1979). Neural network model for a mechanism of pattern recognition unaffected by
shift in position - Neocognitron. Trans. IECE, J62-A(10):658–665.
Fukushima, K. (1980). Neocognitron: A self-organizing neural network for a mechanism of pattern
recognition unaffected by shift in position. Biological Cybernetics, 36(4):193–202.
Fukushima, K. (2011). Increasing robustness against background noise: visual pattern recognition by
a Neocognitron. Neural Networks, 24(7):767–778.
Fukushima, K. (2013a). Artificial vision by multi-layered neural networks: Neocognitron and its
advances. Neural Networks, 37:103–119.
48
Ge, S., Hang, C. C., Lee, T. H., and Zhang, T. (2010). Stable adaptive neural network control.
Springer.
Geiger, J. T., Zhang, Z., Weninger, F., Schuller, B., and Rigoll, G. (2014). Robust speech recognition
using long short-term memory recurrent neural networks for hybrid acoustic modelling. In Proc.
Interspeech.
Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the bias/variance dilemma.
Neural Computation, 4:1–58.
Gers, F. A. and Schmidhuber, J. (2000). Recurrent nets that time and count. In Neural Networks, 2000.
IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, volume 3,
pages 189–194. IEEE.
Gers, F. A. and Schmidhuber, J. (2001). LSTM recurrent networks learn simple context free and
context sensitive languages. IEEE Transactions on Neural Networks, 12(6):1333–1340.
Gers, F. A., Schmidhuber, J., and Cummins, F. (2000). Learning to forget: Continual prediction with
LSTM. Neural Computation, 12(10):2451–2471.
Gers, F. A., Schraudolph, N., and Schmidhuber, J. (2002). Learning precise timing with LSTM
recurrent networks. Journal of Machine Learning Research, 3:115–143.
Gerstner, W. and Kistler, W. K. (2002). Spiking Neuron Models. Cambridge University Press.
Gerstner, W. and van Hemmen, J. L. (1992). Associative memory in a network of spiking neurons.
Network: Computation in Neural Systems, 3(2):139–164.
Ghavamzadeh, M. and Mahadevan, S. (2003). Hierarchical policy gradient algorithms. In Proceedings
of the Twentieth Conference on Machine Learning (ICML-2003), pages 226–233.
Gherrity, M. (1989). A learning algorithm for analog fully recurrent neural networks. In IEEE/INNS
International Joint Conference on Neural Networks, San Diego, volume 1, pages 643–644.
Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2013). Rich feature hierarchies for accurate object
detection and semantic segmentation. Technical Report arxiv.org/abs/1311.2524, UC Berkeley and
ICSI.
Gisslen, L., Luciw, M., Graziano, V., and Schmidhuber, J. (2011). Sequential constant size compressor
for reinforcement learning. In Proc. Fourth Conference on Artificial General Intelligence (AGI),
Google, Mountain View, CA, pages 31–40. Springer.
Giusti, A., Ciresan, D. C., Masci, J., Gambardella, L. M., and Schmidhuber, J. (2013). Fast image
scanning with deep max-pooling convolutional neural networks. In Proc. ICIP.
Glackin, B., McGinnity, T. M., Maguire, L. P., Wu, Q., and Belatreche, A. (2005). A novel approach
for the implementation of large scale spiking neural networks on FPGA hardware. In Computa-
tional Intelligence and Bioinspired Systems, pages 552–563. Springer.
Glasmachers, T., Schaul, T., Sun, Y., Wierstra, D., and Schmidhuber, J. (2010). Exponential natural
evolution strategies. In Proceedings of the Genetic and Evolutionary Computation Conference
(GECCO), pages 393–400. ACM.
Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier networks. In AISTATS, volume 15,
pages 315–323.
49
Gloye, A., Wiesel, F., Tenchio, O., and Simon, M. (2005). Reinforcing the driving quality of soccer
playing robots by anticipation. IT - Information Technology, 47(5).
Gödel, K. (1931). Über formal unentscheidbare Sätze der Principia Mathematica und verwandter
Systeme I. Monatshefte für Mathematik und Physik, 38:173–198.
Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization and Machine Learning. Addison-
Wesley, Reading, MA.
Goldfarb, D. (1970). A family of variable-metric methods derived by variational means. Mathematics
of computation, 24(109):23–26.
Golub, G., Heath, H., and Wahba, G. (1979). Generalized cross-validation as a method for choosing
a good ridge parameter. Technometrics, 21:215–224.
Gomez, F. J. (2003). Robust Nonlinear Control through Neuroevolution. PhD thesis, Department of
Computer Sciences, University of Texas at Austin.
Gomez, F. J. and Miikkulainen, R. (2003). Active guidance for a finless rocket using neuroevolution.
In Proc. GECCO 2003, Chicago.
Gomez, F. J. and Schmidhuber, J. (2005). Co-evolving recurrent neurons learn deep memory
POMDPs. In Proc. of the 2005 conference on genetic and evolutionary computation (GECCO),
Washington, D. C. ACM Press, New York, NY, USA.
Gomez, F. J., Schmidhuber, J., and Miikkulainen, R. (2008). Accelerated neural evolution through
cooperatively coevolved synapses. Journal of Machine Learning Research, 9(May):937–965.
Gomi, H. and Kawato, M. (1993). Neural network control for a closed-loop system using feedback-
error-learning. Neural Networks, 6(7):933–946.
Gonzalez-Dominguez, J., Lopez-Moreno, I., Sak, H., Gonzalez-Rodriguez, J., and Moreno, P. J.
(2014). Automatic language identification using Long Short-Term Memory recurrent neural net-
works. In Proc. Interspeech.
Goodfellow, I., Mirza, M., Da, X., Courville, A., and Bengio, Y. (2014a). An Empirical Investigation
of Catastrophic Forgetting in Gradient-Based Neural Networks. TR arXiv:1312.6211v2.
Goodfellow, I. J., Bulatov, Y., Ibarz, J., Arnoud, S., and Shet, V. (2014b). Multi-digit number
recognition from street view imagery using deep convolutional neural networks. arXiv preprint
arXiv:1312.6082 v4.
Goodfellow, I. J., Courville, A., and Bengio, Y. (2011). Spike-and-slab sparse coding for unsupervised
feature discovery. In NIPS Workshop on Challenges in Learning Hierarchical Models.
Goodfellow, I. J., Courville, A. C., and Bengio, Y. (2012). Large-scale feature learning with spike-
and-slab sparse coding. In Proceedings of the 29th International Conference on Machine Learning
(ICML).
Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013). Maxout net-
works. In International Conference on Machine Learning (ICML).
Graves, A. (2011). Practical variational inference for neural networks. In Advances in Neural Infor-
mation Processing Systems (NIPS), pages 2348–2356.
50
Graves, A., Eck, D., Beringer, N., and Schmidhuber, J. (2003). Isolated digit recognition with LSTM
recurrent networks. In First International Workshop on Biologically Inspired Approaches to Ad-
vanced Information Technology, Lausanne.
Graves, A., Fernandez, S., Gomez, F. J., and Schmidhuber, J. (2006). Connectionist temporal classifi-
cation: Labelling unsegmented sequence data with recurrent neural nets. In ICML’06: Proceedings
of the 23rd International Conference on Machine Learning, pages 369–376.
Graves, A., Fernandez, S., Liwicki, M., Bunke, H., and Schmidhuber, J. (2008). Unconstrained on-
line handwriting recognition with recurrent neural networks. In Platt, J., Koller, D., Singer, Y.,
and Roweis, S., editors, Advances in Neural Information Processing Systems (NIPS) 20, pages
577–584. MIT Press, Cambridge, MA.
Graves, A. and Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent neural net-
works. In Proc. 31st International Conference on Machine Learning (ICML), pages 1764–1772.
Graves, A., Liwicki, M., Fernandez, S., Bertolami, R., Bunke, H., and Schmidhuber, J. (2009). A
novel connectionist system for improved unconstrained handwriting recognition. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 31(5).
Graves, A., Mohamed, A.-R., and Hinton, G. E. (2013). Speech recognition with deep recurrent
neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 6645–6649. IEEE.
Graves, A. and Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM
and other neural network architectures. Neural Networks, 18(5-6):602–610.
Graves, A. and Schmidhuber, J. (2009). Offline handwriting recognition with multidimensional re-
current neural networks. In Advances in Neural Information Processing Systems (NIPS) 21, pages
545–552. MIT Press, Cambridge, MA.
Graziano, M. (2009). The Intelligent Movement Machine: An Ethological Perspective on the Primate
Motor System. Oxford University Press, USA.
Grossberg, S. (1969). Some networks that can learn, remember, and reproduce any number of com-
plicated space-time patterns, I. Journal of Mathematics and Mechanics, 19:53–91.
Grossberg, S. (1976a). Adaptive pattern classification and universal recoding, 1: Parallel development
and coding of neural feature detectors. Biological Cybernetics, 23:187–202.
Grossberg, S. (1976b). Adaptive pattern classification and universal recoding, 2: Feedback, expecta-
tion, olfaction, and illusions. Biological Cybernetics, 23.
Gruau, F., Whitley, D., and Pyeatt, L. (1996). A comparison between cellular encoding and direct
encoding for genetic neural networks. NeuroCOLT Technical Report NC-TR-96-048, ESPRIT
Working Group in Neural and Computational Learning, NeuroCOLT 8556.
51
Grünwald, P. D., Myung, I. J., and Pitt, M. A. (2005). Advances in minimum description length:
Theory and applications. MIT Press.
Grüttner, M., Sehnke, F., Schaul, T., and Schmidhuber, J. (2010). Multi-Dimensional Deep Memory
Atari-Go Players for Parameter Exploring Policy Gradients. In Proceedings of the International
Conference on Artificial Neural Networks ICANN, pages 114–123. Springer.
Guo, X., Singh, S., Lee, H., Lewis, R., and Wang, X. (2014). Deep learning for real-time Atari game
play using offline Monte-Carlo tree search planning. In Advances in Neural Information Processing
Systems 27 (NIPS).
Guyon, I., Vapnik, V., Boser, B., Bottou, L., and Solla, S. A. (1992). Structural risk minimization for
character recognition. In Lippman, D. S., Moody, J. E., and Touretzky, D. S., editors, Advances in
Neural Information Processing Systems (NIPS) 4, pages 471–479. Morgan Kaufmann.
Hadamard, J. (1908). Mémoire sur le problème d’analyse relatif à l’équilibre des plaques élastiques
encastrées. Mémoires présentés par divers savants à l’Académie des sciences de l’Institut de
France: Éxtrait. Imprimerie nationale.
Hadsell, R., Chopra, S., and LeCun, Y. (2006). Dimensionality reduction by learning an invariant
mapping. In Proc. Computer Vision and Pattern Recognition Conference (CVPR’06). IEEE Press.
Hagras, H., Pounds-Cornish, A., Colley, M., Callaghan, V., and Clarke, G. (2004). Evolving spiking
neural network controllers for autonomous robots. In IEEE International Conference on Robotics
and Automation (ICRA), volume 5, pages 4620–4626.
Hansen, N., Müller, S. D., and Koumoutsakos, P. (2003). Reducing the time complexity of the deran-
domized evolution strategy with covariance matrix adaptation (CMA-ES). Evolutionary Computa-
tion, 11(1):1–18.
Hansen, N. and Ostermeier, A. (2001). Completely derandomized self-adaptation in evolution strate-
gies. Evolutionary Computation, 9(2):159–195.
Hanson, S. J. (1990). A stochastic version of the delta rule. Physica D: Nonlinear Phenomena,
42(1):265–272.
Hanson, S. J. and Pratt, L. Y. (1989). Comparing biases for minimal network construction with
back-propagation. In Touretzky, D. S., editor, Advances in Neural Information Processing Systems
(NIPS) 1, pages 177–185. San Mateo, CA: Morgan Kaufmann.
Happel, B. L. and Murre, J. M. (1994). Design and evolution of modular neural network architectures.
Neural Networks, 7(6):985–1004.
Hashem, S. and Schmeiser, B. (1992). Improving model accuracy using optimal linear combinations
of trained neural networks. IEEE Transactions on Neural Networks, 6:792–794.
Hassibi, B. and Stork, D. G. (1993). Second order derivatives for network pruning: Optimal brain
surgeon. In Lippman, D. S., Moody, J. E., and Touretzky, D. S., editors, Advances in Neural
Information Processing Systems 5, pages 164–171. Morgan Kaufmann.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning. Springer
Series in Statistics.
52
Hastie, T. J. and Tibshirani, R. J. (1990). Generalized additive models. Monographs on Statisics and
Applied Probability, 43.
Hawkins, J. and George, D. (2006). Hierarchical Temporal Memory - Concepts, Theory, and Termi-
nology. Numenta Inc.
Haykin, S. S. (2001). Kalman filtering and neural networks. Wiley Online Library.
Herrero, J., Valencia, A., and Dopazo, J. (2001). A hierarchical unsupervised growing neural network
for clustering gene expression patterns. Bioinformatics, 17(2):126–136.
Hertz, J., Krogh, A., and Palmer, R. (1991). Introduction to the Theory of Neural Computation.
Addison-Wesley, Redwood City.
Hestenes, M. R. and Stiefel, E. (1952). Methods of conjugate gradients for solving linear systems.
Journal of research of the National Bureau of Standards, 49:409–436.
Hihi, S. E. and Bengio, Y. (1996). Hierarchical recurrent neural networks for long-term dependencies.
In Touretzky, D. S., Mozer, M. C., and Hasselmo, M. E., editors, Advances in Neural Information
Processing Systems 8, pages 493–499. MIT Press.
Hinton, G. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks.
Science, 313(5786):504–507.
Hinton, G. E. (1989). Connectionist learning procedures. Artificial intelligence, 40(1):185–234.
Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural
Comp., 14(8):1771–1800.
Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. (1995). The wake-sleep algorithm for unsuper-
vised neural networks. Science, 268:1158–1160.
Hinton, G. E., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V.,
Nguyen, P., Sainath, T. N., and Kingsbury, B. (2012a). Deep neural networks for acoustic modeling
in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag.,
29(6):82–97.
Hinton, G. E. and Ghahramani, Z. (1997). Generative models for discovering sparse distributed
representations. Philosophical Transactions of the Royal Society B, 352:1177–1190.
53
Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets.
Neural Computation, 18(7):1527–1554.
Hinton, G. E. and Sejnowski, T. E. (1986). Learning and relearning in Boltzmann machines. In
Parallel Distributed Processing, volume 1, pages 282–317. MIT Press.
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. (2012b).
Improving neural networks by preventing co-adaptation of feature detectors. Technical Report
arXiv:1207.0580.
Hinton, G. E. and van Camp, D. (1993). Keeping neural networks simple. In Proceedings of the
International Conference on Artificial Neural Networks, Amsterdam, pages 11–18. Springer.
Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut
für Informatik, Lehrstuhl Prof. Brauer, Technische Universität München. Advisor: J. Schmidhuber.
Hochreiter, S., Bengio, Y., Frasconi, P., and Schmidhuber, J. (2001a). Gradient flow in recurrent nets:
the difficulty of learning long-term dependencies. In Kremer, S. C. and Kolen, J. F., editors, A Field
Guide to Dynamical Recurrent Neural Networks. IEEE Press.
Hochreiter, S. and Obermayer, K. (2005). Sequence classification for protein analysis. In Snowbird
Workshop, Snowbird, Utah. Computational and Biological Learning Society.
Hochreiter, S. and Schmidhuber, J. (1996). Bridging long time lags by weight guessing and “Long
Short-Term Memory”. In Silva, F. L., Principe, J. C., and Almeida, L. B., editors, Spatiotemporal
models in biological and artificial systems, pages 65–72. IOS Press, Amsterdam, Netherlands.
Serie: Frontiers in Artificial Intelligence and Applications, Volume 37.
Hochreiter, S. and Schmidhuber, J. (1997a). Flat minima. Neural Computation, 9(1):1–42.
Hochreiter, S. and Schmidhuber, J. (1997b). Long Short-Term Memory. Neural Computation,
9(8):1735–1780. Based on TR FKI-207-95, TUM (1995).
Hochreiter, S. and Schmidhuber, J. (1999). Feature extraction through LOCOCODE. Neural Compu-
tation, 11(3):679–714.
Hochreiter, S., Younger, A. S., and Conwell, P. R. (2001b). Learning to learn using gradient descent.
In Lecture Notes on Comp. Sci. 2130, Proc. Intl. Conf. on Artificial Neural Networks (ICANN-
2001), pages 87–94. Springer: Berlin, Heidelberg.
Hodgkin, A. L. and Huxley, A. F. (1952). A quantitative description of membrane current and its
application to conduction and excitation in nerve. The Journal of Physiology, 117(4):500.
Hoerzer, G. M., Legenstein, R., and Maass, W. (2014). Emergence of complex computational struc-
tures from chaotic neural networks through reward-modulated Hebbian learning. Cerebral Cortex,
24:677–690.
Holden, S. B. (1994). On the Theory of Generalization and Self-Structuring in Linearly Weighted
Connectionist Networks. PhD thesis, Cambridge University, Engineering Department.
Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press,
Ann Arbor.
Honavar, V. and Uhr, L. (1993). Generative learning structures and processes for generalized connec-
tionist networks. Information Sciences, 70(1):75–108.
54
Honavar, V. and Uhr, L. M. (1988). A network of neuron-like units that learns to perceive by gen-
eration as well as reweighting of its links. In Touretzky, D., Hinton, G. E., and Sejnowski, T.,
editors, Proc. of the 1988 Connectionist Models Summer School, pages 472–484, San Mateo. Mor-
gan Kaufman.
Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational
abilities. Proc. of the National Academy of Sciences, 79:2554–2558.
Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal
approximators. Neural Networks, 2(5):359–366.
Hubel, D. H. and Wiesel, T. (1962). Receptive fields, binocular interaction, and functional architecture
in the cat’s visual cortex. Journal of Physiology (London), 160:106–154.
Hubel, D. H. and Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate
cortex. The Journal of Physiology, 195(1):215–243.
Huffman, D. A. (1952). A method for construction of minimum-redundancy codes. Proceedings IRE,
40:1098–1101.
Hung, C. P., Kreiman, G., Poggio, T., and DiCarlo, J. J. (2005). Fast readout of object identity from
macaque inferior temporal cortex. Science, 310(5749):863–866.
Hutter, M. (2002). The fastest and shortest algorithm for all well-defined problems. International
Journal of Foundations of Computer Science, 13(3):431–443. (On J. Schmidhuber’s SNF grant
20-61847).
Hutter, M. (2005). Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Prob-
ability. Springer, Berlin. (On J. Schmidhuber’s SNF grant 20-61847).
Hyvärinen, A., Hoyer, P., and Oja, E. (1999). Sparse code shrinkage: Denoising by maximum likeli-
hood estimation. In Kearns, M., Solla, S. A., and Cohn, D., editors, Advances in Neural Information
Processing Systems (NIPS) 12. MIT Press.
Hyvärinen, A., Karhunen, J., and Oja, E. (2001). Independent component analysis. John Wiley &
Sons.
ICPR 2012 Contest on Mitosis Detection in Breast Cancer Histological Images (2012). IPAL Lab-
oratory and TRIBVN Company and Pitie-Salpetriere Hospital and CIALAB of Ohio State Univ.,
http://ipal.cnrs.fr/ICPR2012/.
Igel, C. (2003). Neuroevolution for reinforcement learning using evolution strategies. In Reynolds, R.,
Abbass, H., Tan, K. C., Mckay, B., Essam, D., and Gedeon, T., editors, Congress on Evolutionary
Computation (CEC 2003), volume 4, pages 2588–2595. IEEE.
Igel, C. and Hüsken, M. (2003). Empirical evaluation of the improved Rprop learning algorithm.
Neurocomputing, 50(C):105–123.
Ikeda, S., Ochiai, M., and Sawaragi, Y. (1976). Sequential GMDH algorithm and its application to
river flow prediction. IEEE Transactions on Systems, Man and Cybernetics, (7):473–479.
Indermuhle, E., Frinken, V., and Bunke, H. (2012). Mode detection in online handwritten documents
using BLSTM neural networks. In Frontiers in Handwriting Recognition (ICFHR), 2012 Interna-
tional Conference on, pages 302–307. IEEE.
55
Indermuhle, E., Frinken, V., Fischer, A., and Bunke, H. (2011). Keyword spotting in online handwrit-
ten documents containing text and non-text using BLSTM neural networks. In Document Analysis
and Recognition (ICDAR), 2011 International Conference on, pages 73–77. IEEE.
Indiveri, G., Linares-Barranco, B., Hamilton, T. J., Van Schaik, A., Etienne-Cummings, R., Delbruck,
T., Liu, S.-C., Dudek, P., Häfliger, P., Renaud, S., et al. (2011). Neuromorphic silicon neuron
circuits. Frontiers in Neuroscience, 5(73).
Ivakhnenko, A. G. (1968). The group method of data handling – a rival of the method of stochastic
approximation. Soviet Automatic Control, 13(3):43–55.
Ivakhnenko, A. G. (1971). Polynomial theory of complex systems. IEEE Transactions on Systems,
Man and Cybernetics, (4):364–378.
Ivakhnenko, A. G. (1995). The review of problems solvable by algorithms of the group method of data
handling (GMDH). Pattern Recognition and Image Analysis / Raspoznavaniye Obrazov I Analiz
Izobrazhenii, 5:527–535.
Ivakhnenko, A. G. and Lapa, V. G. (1965). Cybernetic Predicting Devices. CCM Information Corpo-
ration.
Ivakhnenko, A. G., Lapa, V. G., and McDonough, R. N. (1967). Cybernetics and forecasting tech-
niques. American Elsevier, NY.
Izhikevich, E. M. et al. (2003). Simple model of spiking neurons. IEEE Transactions on Neural
Networks, 14(6):1569–1572.
Jaakkola, T., Singh, S. P., and Jordan, M. I. (1995). Reinforcement learning algorithm for partially
observable Markov decision problems. In Tesauro, G., Touretzky, D. S., and Leen, T. K., editors,
Advances in Neural Information Processing Systems (NIPS) 7, pages 345–352. MIT Press.
Jackel, L., Boser, B., Graf, H.-P., Denker, J., LeCun, Y., Henderson, D., Matan, O., Howard, R., and
Baird, H. (1990). VLSI implementation of electronic neural networks: and example in character
recognition. In IEEE, editor, IEEE International Conference on Systems, Man, and Cybernetics,
pages 320–322, Los Angeles, CA.
Jacob, C., Lindenmayer, A., and Rozenberg, G. (1994). Genetic L-System Programming. In Parallel
Problem Solving from Nature III, Lecture Notes in Computer Science.
Jacobs, R. A. (1988). Increased rates of convergence through learning rate adaptation. Neural Net-
works, 1(4):295–307.
Jaeger, H. (2001). The ”echo state” approach to analysing and training recurrent neural networks.
Technical Report GMD Report 148, German National Research Center for Information Technology.
Jaeger, H. (2004). Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless
communication. Science, 304:78–80.
Jain, V. and Seung, S. (2009). Natural image denoising with convolutional networks. In Koller, D.,
Schuurmans, D., Bengio, Y., and Bottou, L., editors, Advances in Neural Information Processing
Systems (NIPS) 21, pages 769–776. Curran Associates, Inc.
Jameson, J. (1991). Delayed reinforcement learning with multiple time scale hierarchical backpropa-
gated adaptive critics. In Neural Networks for Control.
56
Ji, S., Xu, W., Yang, M., and Yu, K. (2013). 3D convolutional neural networks for human action
recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):221–231.
Jim, K., Giles, C. L., and Horne, B. G. (1995). Effects of noise on convergence and generalization
in recurrent networks. In Tesauro, G., Touretzky, D., and Leen, T., editors, Advances in Neural
Information Processing Systems (NIPS) 7, page 649. San Mateo, CA: Morgan Kaufmann.
Jin, X., Lujan, M., Plana, L. A., Davies, S., Temple, S., and Furber, S. B. (2010). Modeling spiking
neural networks on SpiNNaker. Computing in Science & Engineering, 12(5):91–97.
Jodogne, S. R. and Piater, J. H. (2007). Closed-loop learning of visual control policies. J. Artificial
Intelligence Research, 28:349–391.
Jones, J. P. and Palmer, L. A. (1987). An evaluation of the two-dimensional Gabor filter model of
simple receptive fields in cat striate cortex. Journal of Neurophysiology, 58(6):1233–1258.
Jordan, M. I. (1986). Serial order: A parallel distributed processing approach. Technical Report ICS
Report 8604, Institute for Cognitive Science, University of California, San Diego.
Jordan, M. I. (1988). Supervised learning and systems with excess degrees of freedom. Technical
Report COINS TR 88-27, Massachusetts Institute of Technology.
Jordan, M. I. (1997). Serial order: A parallel distributed processing approach. Advances in Psychol-
ogy, 121:471–495.
Jordan, M. I. and Rumelhart, D. E. (1990). Supervised learning with a distal teacher. Technical Report
Occasional Paper #40, Center for Cog. Sci., Massachusetts Institute of Technology.
Jordan, M. I. and Sejnowski, T. J. (2001). Graphical models: Foundations of neural computation.
MIT Press.
Joseph, R. D. (1961). Contributions to perceptron theory. PhD thesis, Cornell Univ.
Juang, C.-F. (2004). A hybrid of genetic algorithm and particle swarm optimization for recurrent
network design. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on,
34(2):997–1006.
Judd, J. S. (1990). Neural network design and the complexity of learning. Neural network modeling
and connectionism. MIT Press.
Jutten, C. and Herault, J. (1991). Blind separation of sources, part I: An adaptive algorithm based on
neuromimetic architecture. Signal Processing, 24(1):1–10.
Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. (1995). Planning and acting in partially
observable stochastic domains. Technical report, Brown University, Providence RI.
Kaelbling, L. P., Littman, M. L., and Moore, A. W. (1996). Reinforcement learning: a survey. Journal
of AI research, 4:237–285.
Kak, S., Chen, Y., and Wang, L. (2010). Data mining using surface and deep agents based on neural
networks. AMCIS 2010 Proceedings.
Kalinke, Y. and Lehmann, H. (1998). Computation in recurrent neural networks: From counters to
iterated function systems. In Antoniou, G. and Slaney, J., editors, Advanced Topics in Artificial
Intelligence, Proceedings of the 11th Australian Joint Conference on Artificial Intelligence, volume
1502 of LNAI, Berlin, Heidelberg. Springer.
57
Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Journal of Basic
Engineering, 82(1):35–45.
Karhunen, J. and Joutsensalo, J. (1995). Generalizations of principal component analysis, optimiza-
tion problems, and neural networks. Neural Networks, 8(4):549–562.
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014). Large-scale
video classification with convolutional neural networks. In IEEE Conference on Computer Vision
and Pattern Recognition (CVPR).
Kasabov, N. K. (2014). Neucube: A spiking neural network architecture for mapping, learning and
understanding of spatio-temporal brain data. Neural Networks.
Kelley, H. J. (1960). Gradient theory of optimal flight paths. ARS Journal, 30(10):947–954.
Kempter, R., Gerstner, W., and Van Hemmen, J. L. (1999). Hebbian learning and spiking neurons.
Physical Review E, 59(4):4498.
Kerlirzin, P. and Vallet, F. (1993). Robustness in multilayer perceptrons. Neural Computation,
5(1):473–482.
Khan, M. M., Khan, G. M., and Miller, J. F. (2010). Evolution of neural networks using Cartesian
Genetic Programming. In IEEE Congress on Evolutionary Computation (CEC), pages 1–8.
Khan, M. M., Lester, D. R., Plana, L. A., Rast, A., Jin, X., Painkras, E., and Furber, S. B. (2008).
SpiNNaker: mapping neural networks onto a massively-parallel chip multiprocessor. In Interna-
tional Joint Conference on Neural Networks (IJCNN), pages 2849–2856. IEEE.
Khan, S. H., Bennamoun, M., Sohel, F., and Togneri, R. (2014). Automatic feature learning for robust
shadow detection. In IEEE Conference on Computer Vision and Pattern Recognition CVPR.
Kimura, H., Miyazaki, K., and Kobayashi, S. (1997). Reinforcement learning in POMDPs with
function approximation. In ICML, volume 97, pages 152–160.
Kistler, W. M., Gerstner, W., and van Hemmen, J. L. (1997). Reduction of the Hodgkin-Huxley
equations to a single-variable threshold model. Neural Computation, 9(5):1015–1045.
Kitano, H. (1990). Designing neural networks using genetic algorithms with graph generation system.
Complex Systems, 4:461–476.
Klampfl, S. and Maass, W. (2013). Emergence of dynamic memory traces in cortical microcircuit
models through STDP. The Journal of Neuroscience, 33(28):11515–11529.
Klapper-Rybicka, M., Schraudolph, N. N., and Schmidhuber, J. (2001). Unsupervised learning in
LSTM recurrent neural networks. In Lecture Notes on Comp. Sci. 2130, Proc. Intl. Conf. on Artifi-
cial Neural Networks (ICANN-2001), pages 684–691. Springer: Berlin, Heidelberg.
Kobatake, E. and Tanaka, K. (1994). Neuronal selectivities to complex object features in the ventral
visual pathway of the macaque cerebral cortex. J. Neurophysiol., 71:856–867.
Kohl, N. and Stone, P. (2004). Policy gradient reinforcement learning for fast quadrupedal locomotion.
In Robotics and Automation, 2004. Proceedings. ICRA’04. 2004 IEEE International Conference on,
volume 3, pages 2619–2624. IEEE.
58
Kohonen, T. (1972). Correlation matrix memories. Computers, IEEE Transactions on, 100(4):353–
359.
Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological
Cybernetics, 43(1):59–69.
Kohonen, T. (1988). Self-Organization and Associative Memory. Springer, second edition.
Koikkalainen, P. and Oja, E. (1990). Self-organizing hierarchical feature maps. In International Joint
Conference on Neural Networks (IJCNN), pages 279–284. IEEE.
Kolmogorov, A. N. (1965a). On the representation of continuous functions of several variables by su-
perposition of continuous functions of one variable and addition. Doklady Akademii. Nauk USSR,,
114:679–681.
Korkin, M., de Garis, H., Gers, F., and Hemmi, H. (1997). CBM (CAM-Brain Machine) - a hardware
tool which evolves a neural net module in a fraction of a second and runs a million neuron artificial
brain in real time.
Kosko, B. (1990). Unsupervised learning in noise. IEEE Transactions on Neural Networks, 1(1):44–
57.
Koutnı́k, J., Cuccu, G., Schmidhuber, J., and Gomez, F. (July 2013). Evolving large-scale neural
networks for vision-based reinforcement learning. In Proceedings of the Genetic and Evolutionary
Computation Conference (GECCO), pages 1061–1068, Amsterdam. ACM.
Koutnı́k, J., Gomez, F., and Schmidhuber, J. (2010). Evolving neural networks in compressed weight
space. In Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation,
pages 619–626.
Koutnı́k, J., Greff, K., Gomez, F., and Schmidhuber, J. (2014). A Clockwork RNN. In Proceedings
of the 31th International Conference on Machine Learning (ICML), volume 32, pages 1845–1853.
arXiv:1402.3511 [cs.NE].
59
Koza, J. R. (1992). Genetic Programming – On the Programming of Computers by Means of Natural
Selection. MIT Press.
Kramer, M. (1991). Nonlinear principal component analysis using autoassociative neural networks.
AIChE Journal, 37:233–243.
Kremer, S. C. and Kolen, J. F. (2001). Field guide to dynamical recurrent networks. Wiley-IEEE
Press.
Kriegeskorte, N., Mur, M., Ruff, D. A., Kiani, R., Bodurka, J., Esteky, H., Tanaka, K., and Bandettini,
P. A. (2008). Matching categorical object representations in inferior temporal cortex of man and
monkey. Neuron, 60(6):1126–1141.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convo-
lutional neural networks. In Advances in Neural Information Processing Systems (NIPS 2012),
page 4.
Krogh, A. and Hertz, J. A. (1992). A simple weight decay can improve generalization. In Lippman,
D. S., Moody, J. E., and Touretzky, D. S., editors, Advances in Neural Information Processing
Systems 4, pages 950–957. Morgan Kaufmann.
Kruger, N., Janssen, P., Kalkan, S., Lappe, M., Leonardis, A., Piater, J., Rodriguez-Sanchez, A., and
Wiskott, L. (2013). Deep hierarchies in the primate visual cortex: What can we learn for computer
vision? IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1847–1871.
Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical
Statistics, pages 79–86.
Kurzweil, R. (2012). How to Create a Mind: The Secret of Human Thought Revealed.
Lagoudakis, M. G. and Parr, R. (2003). Least-squares policy iteration. JMLR, 4:1107–1149.
Lampinen, J. and Oja, E. (1992). Clustering properties of hierarchical self-organizing maps. Journal
of Mathematical Imaging and Vision, 2(2-3):261–272.
Lang, K., Waibel, A., and Hinton, G. E. (1990). A time-delay neural network architecture for isolated
word recognition. Neural Networks, 3:23–43.
Lange, S. and Riedmiller, M. (2010). Deep auto-encoder neural networks in reinforcement learning.
In Neural Networks (IJCNN), The 2010 International Joint Conference on, pages 1–8.
Lapedes, A. and Farber, R. (1986). A self-optimizing, nonsymmetrical neural net for content address-
able memory and pattern recognition. Physica D, 22:247–259.
Laplace, P. (1774). Mémoire sur la probabilité des causes par les évènements. Mémoires de
l’Academie Royale des Sciences Presentés par Divers Savan, 6:621–656.
Larraanaga, P. and Lozano, J. A. (2001). Estimation of Distribution Algorithms: A New Tool for
Evolutionary Computation. Kluwer Academic Publishers, Norwell, MA, USA.
Le, Q. V., Ranzato, M., Monga, R., Devin, M., Corrado, G., Chen, K., Dean, J., and Ng, A. Y. (2012).
Building high-level features using large scale unsupervised learning. In Proc. ICML’12.
LeCun, Y. (1985). Une procédure d’apprentissage pour réseau à seuil asymétrique. Proceedings of
Cognitiva 85, Paris, pages 599–604.
60
LeCun, Y. (1988). A theoretical framework for back-propagation. In Touretzky, D., Hinton, G.,
and Sejnowski, T., editors, Proceedings of the 1988 Connectionist Models Summer School, pages
21–28, CMU, Pittsburgh, Pa. Morgan Kaufmann.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D.
(1989). Back-propagation applied to handwritten zip code recognition. Neural Computation,
1(4):541–551.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D.
(1990a). Handwritten digit recognition with a back-propagation network. In Touretzky, D. S.,
editor, Advances in Neural Information Processing Systems 2, pages 396–404. Morgan Kaufmann.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to docu-
ment recognition. Proceedings of the IEEE, 86(11):2278–2324.
LeCun, Y., Denker, J. S., and Solla, S. A. (1990b). Optimal brain damage. In Touretzky, D. S., editor,
Advances in Neural Information Processing Systems 2, pages 598–605. Morgan Kaufmann.
LeCun, Y., Muller, U., Cosatto, E., and Flepp, B. (2006). Off-road obstacle avoidance through end-
to-end learning. In Advances in Neural Information Processing Systems (NIPS 2005).
LeCun, Y., Simard, P., and Pearlmutter, B. (1993). Automatic learning rate maximization by on-line
estimation of the Hessian’s eigenvectors. In Hanson, S., Cowan, J., and Giles, L., editors, Advances
in Neural Information Processing Systems (NIPS 1992), volume 5. Morgan Kaufmann Publishers,
San Mateo, CA.
Lee, H., Battle, A., Raina, R., and Ng, A. Y. (2007a). Efficient sparse coding algorithms. In Advances
in Neural Information Processing Systems (NIPS) 19, pages 801–808.
Lee, H., Ekanadham, C., and Ng, A. Y. (2007b). Sparse deep belief net model for visual area V2. In
Advances in Neural Information Processing Systems (NIPS), volume 7, pages 873–880.
Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009a). Convolutional deep belief networks for
scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th Interna-
tional Conference on Machine Learning (ICML), pages 609–616.
Lee, H., Pham, P. T., Largman, Y., and Ng, A. Y. (2009b). Unsupervised feature learning for audio
classification using convolutional deep belief networks. In Proc. NIPS, volume 9, pages 1096–1104.
Lee, L. (1996). Learning of context-free languages: A survey of the literature. Technical Report
TR-12-96, Center for Research in Computing Technology, Harvard University, Cambridge, Mas-
sachusetts.
Lee, S. and Kil, R. M. (1991). A Gaussian potential function network with hierarchically self-
organizing learning. Neural Networks, 4(2):207–224.
Legendre, A. M. (1805). Nouvelles méthodes pour la détermination des orbites des cometes. F. Didot.
Legenstein, R., Wilbert, N., and Wiskott, L. (2010). Reinforcement learning on slow features of
high-dimensional input streams. PLoS Computational Biology, 6(8).
Legenstein, R. A. and Maass, W. (2002). Neural circuits for pattern recognition with small total wire
length. Theor. Comput. Sci., 287(1):239–249.
61
Leibniz, G. W. (1676). Memoir using the chain rule (cited in TMME 7:2&3 p 321-332, 2010).
Leibniz, G. W. (1684). Nova methodus pro maximis et minimis, itemque tangentibus, quae nec
fractas, nec irrationales quantitates moratur, et singulare pro illis calculi genus. Acta Eruditorum,
pages 467–473.
Lenat, D. B. (1983). Theory formation by heuristic search. Machine Learning, 21.
Lenat, D. B. and Brown, J. S. (1984). Why AM an EURISKO appear to work. Artificial Intelligence,
23(3):269–294.
Lennie, P. and Movshon, J. A. (2005). Coding of color and form in the geniculostriate visual pathway.
Journal of the Optical Society of America A, 22(10):2013–2033.
Levenberg, K. (1944). A method for the solution of certain problems in least squares. Quarterly of
applied mathematics, 2:164–168.
Levin, A. U., Leen, T. K., and Moody, J. E. (1994). Fast pruning using principal components. In
Advances in Neural Information Processing Systems 6, page 35. Morgan Kaufmann.
Levin, A. U. and Narendra, K. S. (1995). Control of nonlinear dynamical systems using neural
networks. ii. observability, identification, and control. IEEE Transactions on Neural Networks,
7(1):30–42.
Levin, L. A. (1973a). On the notion of a random sequence. Soviet Math. Dokl., 14(5):1413–1416.
Levin, L. A. (1973b). Universal sequential search problems. Problems of Information Transmission,
9(3):265–266.
Lewicki, M. S. and Olshausen, B. A. (1998). Inferring sparse, overcomplete image codes using an
efficient coding framework. In Jordan, M. I., Kearns, M. J., and Solla, S. A., editors, Advances in
Neural Information Processing Systems (NIPS) 10, pages 815–821.
L’Hôpital, G. F. A. (1696). Analyse des infiniment petits, pour l’intelligence des lignes courbes. Paris:
L’Imprimerie Royale.
Li, M. and Vitányi, P. M. B. (1997). An Introduction to Kolmogorov Complexity and its Applications
(2nd edition). Springer.
Li, R., Zhang, W., Suk, H.-I., Wang, L., Li, J., Shen, D., and Ji, S. (2014). Deep learning based
imaging data completion for improved brain disease diagnosis. In Proc. MICCAI. Springer.
Lin, L. (1993). Reinforcement Learning for Robots Using Neural Networks. PhD thesis, Carnegie
Mellon University, Pittsburgh.
Lin, T., Horne, B., Tino, P., and Giles, C. (1996). Learning long-term dependencies in NARX recur-
rent neural networks. IEEE Transactions on Neural Networks, 7(6):1329–1338.
Lindenmayer, A. (1968). Mathematical models for cellular interaction in development. J. Theoret.
Biology, 18:280–315.
Lindstädt, S. (1993). Comparison of two unsupervised neural network models for redundancy reduc-
tion. In Mozer, M. C., Smolensky, P., Touretzky, D. S., Elman, J. L., and Weigend, A. S., editors,
Proc. of the 1993 Connectionist Models Summer School, pages 308–315. Hillsdale, NJ: Erlbaum
Associates.
62
Linnainmaa, S. (1970). The representation of the cumulative rounding error of an algorithm as a
Taylor expansion of the local rounding errors. Master’s thesis, Univ. Helsinki.
Linnainmaa, S. (1976). Taylor expansion of the accumulated rounding error. BIT Numerical Mathe-
matics, 16(2):146–160.
Linsker, R. (1988). Self-organization in a perceptual network. IEEE Computer, 21:105–117.
Littman, M. L., Cassandra, A. R., and Kaelbling, L. P. (1995). Learning policies for partially ob-
servable environments: Scaling up. In Prieditis, A. and Russell, S., editors, Machine Learning:
Proceedings of the Twelfth International Conference, pages 362–370. Morgan Kaufmann Publish-
ers, San Francisco, CA.
Liu, S.-C., Kramer, J., Indiveri, G., Delbrück, T., Burg, T., Douglas, R., et al. (2001). Orientation-
selective aVLSI spiking neurons. Neural Networks, 14(6-7):629–643.
Ljung, L. (1998). System identification. Springer.
Logothetis, N. K., Pauls, J., and Poggio, T. (1995). Shape representation in the inferior temporal
cortex of monkeys. Current Biology, 5(5):552–563.
Loiacono, D., Cardamone, L., and Lanzi, P. L. (2011). Simulated car racing championship competi-
tion software manual. Technical report, Dipartimento di Elettronica e Informazione, Politecnico di
Milano, Italy.
Loiacono, D., Lanzi, P. L., Togelius, J., Onieva, E., Pelta, D. A., Butz, M. V., Lönneker, T. D.,
Cardamone, L., Perez, D., Sáez, Y., Preuss, M., and Quadflieg, J. (2009). The 2009 simulated car
racing championship.
Lowe, D. (1999). Object recognition from local scale-invariant features. In The Proceedings of the
Seventh IEEE International Conference on Computer Vision (ICCV), volume 2, pages 1150–1157.
Lowe, D. (2004). Distinctive image features from scale-invariant key-points. Intl. Journal of Com-
puter Vision, 60:91–110.
Luciw, M., Kompella, V. R., Kazerounian, S., and Schmidhuber, J. (2013). An intrinsic value system
for developing multiple invariant representations with incremental slowness learning. Frontiers in
Neurorobotics, 7(9).
Lusci, A., Pollastri, G., and Baldi, P. (2013). Deep architectures and deep learning in chemoinformat-
ics: the prediction of aqueous solubility for drug-like molecules. Journal of Chemical Information
and Modeling, 53(7):1563–1575.
Maas, A. L., Hannun, A. Y., and Ng, A. Y. (2013). Rectifier nonlinearities improve neural network
acoustic models. In International Conference on Machine Learning (ICML).
Maass, W. (1996). Lower bounds for the computational power of networks of spiking neurons. Neural
Computation, 8(1):1–40.
Maass, W. (1997). Networks of spiking neurons: the third generation of neural network models.
Neural Networks, 10(9):1659–1671.
Maass, W. (2000). On the computational power of winner-take-all. Neural Computation, 12:2519–
2535.
63
Maass, W., Natschläger, T., and Markram, H. (2002). Real-time computing without stable states: A
new framework for neural computation based on perturbations. Neural Computation, 14(11):2531–
2560.
MacKay, D. J. C. (1992). A practical Bayesian framework for backprop networks. Neural Computa-
tion, 4:448–472.
MacKay, D. J. C. and Miller, K. D. (1990). Analysis of Linsker’s simulation of Hebbian rules. Neural
Computation, 2:173–187.
Maclin, R. and Shavlik, J. W. (1993). Using knowledge-based neural networks to improve algorithms:
Refining the Chou-Fasman algorithm for protein folding. Machine Learning, 11(2-3):195–215.
Maclin, R. and Shavlik, J. W. (1995). Combining the predictions of multiple classifiers: Using com-
petitive learning to initialize neural networks. In Proc. IJCAI, pages 524–531.
Madala, H. R. and Ivakhnenko, A. G. (1994). Inductive learning algorithms for complex systems
modeling. CRC Press, Boca Raton.
Madani, O., Hanks, S., and Condon, A. (2003). On the undecidability of probabilistic planning and
related stochastic optimization problems. Artificial Intelligence, 147(1):5–34.
Maei, H. R. and Sutton, R. S. (2010). GQ(λ): A general gradient algorithm for temporal-difference
prediction learning with eligibility traces. In Proceedings of the Third Conference on Artificial
General Intelligence, volume 1, pages 91–96.
Maex, R. and Orban, G. (1996). Model circuit of spiking neurons generating directional selectivity in
simple cells. Journal of Neurophysiology, 75(4):1515–1545.
Mahadevan, S. (1996). Average reward reinforcement learning: Foundations, algorithms, and empir-
ical results. Machine Learning, 22:159.
Malik, J. and Perona, P. (1990). Preattentive texture discrimination with early vision mechanisms.
Journal of the Optical Society of America A, 7(5):923–932.
Maniezzo, V. (1994). Genetic evolution of the topology and weight distribution of neural networks.
IEEE Transactions on Neural Networks, 5(1):39–53.
Manolios, P. and Fanelli, R. (1994). First-order recurrent neural networks and deterministic finite
state automata. Neural Computation, 6:1155–1173.
Marchi, E., Ferroni, G., Eyben, F., Gabrielli, L., Squartini, S., and Schuller, B. (2014). Multi-
resolution linear prediction based features for audio onset detection with bidirectional LSTM neural
networks. In Proc. 39th IEEE International Conference on Acoustics, Speech, and Signal Process-
ing, ICASSP 2014, Florence, Italy, pages 2183–2187.
64
Martens, J. and Sutskever, I. (2011). Learning recurrent neural networks with Hessian-free optimiza-
tion. In Proceedings of the 28th International Conference on Machine Learning (ICML), pages
1033–1040.
Martinetz, T. M., Ritter, H. J., and Schulten, K. J. (1990). Three-dimensional neural net for learning
visuomotor coordination of a robot arm. IEEE Transactions on Neural Networks, 1(1):131–136.
Masci, J., Giusti, A., Ciresan, D. C., Fricout, G., and Schmidhuber, J. (2013). A fast learning algorithm
for image segmentation with max-pooling convolutional networks. In International Conference on
Image Processing (ICIP13), pages 2713–2717.
Matsuoka, K. (1992). Noise injection into inputs in back-propagation learning. IEEE Transactions
on Systems, Man, and Cybernetics, 22(3):436–440.
Mayer, H., Gomez, F., Wierstra, D., Nagy, I., Knoll, A., and Schmidhuber, J. (2008). A system for
robotic heart surgery that learns to tie knots using recurrent neural networks. Advanced Robotics,
22(13-14):1521–1537.
McCallum, R. A. (1996). Learning to use selective attention and short-term memory in sequential
tasks. In Maes, P., Mataric, M., Meyer, J.-A., Pollack, J., and Wilson, S. W., editors, From Ani-
mals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive
Behavior, Cambridge, MA, pages 315–324. MIT Press, Bradford Books.
McCulloch, W. and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity.
Bulletin of Mathematical Biophysics, 7:115–133.
Melnik, O., Levy, S. D., and Pollack, J. B. (2000). RAAM for infinite context-free languages. In
Proc. IJCNN (5), pages 585–590.
Memisevic, R. and Hinton, G. E. (2010). Learning to represent spatial transformations with factored
higher-order Boltzmann machines. Neural Computation, 22(6):1473–1492.
Menache, I., Mannor, S., and Shimkin, N. (2002). Q-cut – dynamic discovery of sub-goals in rein-
forcement learning. In Proc. ECML’02, pages 295–306.
Merolla, P. A., Arthur, J. V., Alvarez-Icaza, R., Cassidy, A. S., Sawada, J., Akopyan, F., Jackson,
B. L., Imam, N., Guo, C., Nakamura, Y., Brezzo, B., Vo, I., Esser, S. K., Appuswamy, R., Taba,
B., Amir, A., Flickner, M. D., Risk, W. P., Manohar, R., and Modha, D. S. (2014). A million
spiking-neuron integrated circuit with a scalable communication network and interface. Science,
345(6197):668–673.
Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I., Lavoie, E., Muller, X.,
Desjardins, G., Warde-Farley, D., Vincent, P., Courville, A., and Bergstra, J. (2011). Unsupervised
and transfer learning challenge: a deep learning approach. In JMLR W&CP: Proc. Unsupervised
and Transfer Learning, volume 7.
Meuleau, N., Peshkin, L., Kim, K. E., and Kaelbling, L. P. (1999). Learning finite state controllers for
partially observable environments. In 15th International Conference of Uncertainty in AI, pages
427–436.
Miglino, O., Lund, H., and Nolfi, S. (1995). Evolving mobile robots in simulated and real environ-
ments. Artificial Life, 2(4):417–434.
65
Miller, G., Todd, P., and Hedge, S. (1989). Designing neural networks using genetic algorithms. In
Proceedings of the 3rd International Conference on Genetic Algorithms, pages 379–384. Morgan
Kauffman.
Miller, J. F. and Harding, S. L. (2009). Cartesian genetic programming. In Proceedings of the 11th An-
nual Conference Companion on Genetic and Evolutionary Computation Conference: Late Break-
ing Papers, pages 3489–3512. ACM.
Miller, J. F. and Thomson, P. (2000). Cartesian genetic programming. In Genetic Programming, pages
121–132. Springer.
Miller, K. D. (1994). A model for the development of simple cell receptive fields and the ordered
arrangement of orientation columns through activity-dependent competition between on- and off-
center inputs. Journal of Neuroscience, 14(1):409–441.
Miller, W. T., Werbos, P. J., and Sutton, R. S. (1995). Neural networks for control. MIT Press.
Minai, A. A. and Williams, R. D. (1994). Perturbation response in feedforward networks. Neural
Networks, 7(5):783–796.
Minsky, M. (1963). Steps toward artificial intelligence. In Feigenbaum, E. and Feldman, J., editors,
Computers and Thought, pages 406–450. McGraw-Hill, New York.
Minsky, M. and Papert, S. (1969). Perceptrons. Cambridge, MA: MIT Press.
Minton, S., Carbonell, J. G., Knoblock, C. A., Kuokka, D. R., Etzioni, O., and Gil, Y. (1989).
Explanation-based learning: A problem solving perspective. Artificial Intelligence, 40(1):63–118.
Montana, D. J. and Davis, L. (1989). Training feedforward neural networks using genetic algorithms.
In Proceedings of the 11th International Joint Conference on Artificial Intelligence (IJCAI) - Vol-
ume 1, IJCAI’89, pages 762–767, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
Montavon, G., Orr, G., and Müller, K. (2012). Neural Networks: Tricks of the Trade. Number LNCS
7700 in Lecture Notes in Computer Science Series. Springer Verlag.
66
Moody, J. E. (1989). Fast learning in multi-resolution hierarchies. In Touretzky, D. S., editor, Ad-
vances in Neural Information Processing Systems (NIPS) 1, pages 29–39. Morgan Kaufmann.
Moody, J. E. (1992). The effective number of parameters: An analysis of generalization and regu-
larization in nonlinear learning systems. In Lippman, D. S., Moody, J. E., and Touretzky, D. S.,
editors, Advances in Neural Information Processing Systems (NIPS) 4, pages 847–854. Morgan
Kaufmann.
Moody, J. E. and Utans, J. (1994). Architecture selection strategies for neural networks: Application
to corporate bond rating prediction. In Refenes, A. N., editor, Neural Networks in the Capital
Markets. John Wiley & Sons.
Moore, A. and Atkeson, C. (1995). The parti-game algorithm for variable resolution reinforcement
learning in multidimensional state-spaces. Machine Learning, 21(3):199–233.
Moore, A. and Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data
and less time. Machine Learning, 13:103–130.
Moriarty, D. E. (1997). Symbiotic Evolution of Neural Networks in Sequential Decision Tasks. PhD
thesis, Department of Computer Sciences, The University of Texas at Austin.
Mosteller, F. and Tukey, J. W. (1968). Data analysis, including statistics. In Lindzey, G. and Aronson,
E., editors, Handbook of Social Psychology, Vol. 2. Addison-Wesley.
Mozer, M. C. (1989). A focused back-propagation algorithm for temporal sequence recognition.
Complex Systems, 3:349–381.
Mozer, M. C. (1991). Discovering discrete distributed representations with iterative competitive learn-
ing. In Lippmann, R. P., Moody, J. E., and Touretzky, D. S., editors, Advances in Neural Information
Processing Systems 3, pages 627–634. Morgan Kaufmann.
Mozer, M. C. (1992). Induction of multiscale temporal structure. In Lippman, D. S., Moody, J. E.,
and Touretzky, D. S., editors, Advances in Neural Information Processing Systems (NIPS) 4, pages
275–282. Morgan Kaufmann.
Mozer, M. C. and Smolensky, P. (1989). Skeletonization: A technique for trimming the fat from a
network via relevance assessment. In Touretzky, D. S., editor, Advances in Neural Information
Processing Systems (NIPS) 1, pages 107–115. Morgan Kaufmann.
Muller, U. A., Gunzinger, A., and Guggenbühl, W. (1995). Fast neural net simulation with a DSP
processor array. IEEE Transactions on Neural Networks, 6(1):203–213.
Munro, P. W. (1987). A dual back-propagation scheme for scalar reinforcement learning. Proceedings
of the Ninth Annual Conference of the Cognitive Science Society, Seattle, WA, pages 165–176.
67
Murray, A. F. and Edwards, P. J. (1993). Synaptic weight noise during MLP learning enhances fault-
tolerance, generalisation and learning trajectory. In S. J. Hanson, J. D. C. and Giles, C. L., editors,
Advances in Neural Information Processing Systems (NIPS) 5, pages 491–498. San Mateo, CA:
Morgan Kaufmann.
Nadal, J.-P. and Parga, N. (1994). Non-linear neurons in the low noise limit: a factorial code max-
imises information transfer. Network, 5:565–581.
Nagumo, J., Arimoto, S., and Yoshizawa, S. (1962). An active pulse transmission line simulating
nerve axon. Proceedings of the IRE, 50(10):2061–2070.
Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann machines. In
International Conference on Machine Learning (ICML).
Narendra, K. S. and Parthasarathy, K. (1990). Identification and control of dynamical systems using
neural networks. Neural Networks, IEEE Transactions on, 1(1):4–27.
Narendra, K. S. and Thathatchar, M. A. L. (1974). Learning automata – a survey. IEEE Transactions
on Systems, Man, and Cybernetics, 4:323–334.
Neal, R. M. (1995). Bayesian learning for neural networks. PhD thesis, University of Toronto.
Neal, R. M. (2006). Classification with Bayesian neural networks. In Quinonero-Candela, J., Magnini,
B., Dagan, I., and D’Alche-Buc, F., editors, Machine Learning Challenges. Evaluating Predictive
Uncertainty, Visual Object Classification, and Recognising Textual Entailment, volume 3944 of
Lecture Notes in Computer Science, pages 28–32. Springer.
Neal, R. M. and Zhang, J. (2006). High dimensional classification with Bayesian neural networks and
Dirichlet diffusion trees. In Guyon, I., Gunn, S., Nikravesh, M., and Zadeh, L. A., editors, Feature
Extraction: Foundations and Applications, Studies in Fuzziness and Soft Computing, pages 265–
295. Springer.
Neftci, E., Das, S., Pedroni, B., Kreutz-Delgado, K., and Cauwenberghs, G. (2014). Event-driven
contrastive divergence for spiking neuromorphic systems. Frontiers in Neuroscience, 7(272).
Neil, D. and Liu, S.-C. (2014). Minitaur, an event-driven FPGA-based spiking network accelerator.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, PP(99):1–8.
Nessler, B., Pfeiffer, M., Buesing, L., and Maass, W. (2013). Bayesian computation emerges in
generic cortical microcircuits through spike-timing-dependent plasticity. PLoS Computational Bi-
ology, 9(4):e1003037.
Neti, C., Schneider, M. H., and Young, E. D. (1992). Maximally fault tolerant neural networks. In
IEEE Transactions on Neural Networks, volume 3, pages 14–23.
Neuneier, R. and Zimmermann, H.-G. (1996). How to train neural networks. In Orr, G. B. and Müller,
K.-R., editors, Neural Networks: Tricks of the Trade, volume 1524 of Lecture Notes in Computer
Science, pages 373–423. Springer.
Newton, I. (1687). Philosophiae naturalis principia mathematica. William Dawson & Sons Ltd.,
London.
Nguyen, N. and Widrow, B. (1989). The truck backer-upper: An example of self learning in neural
networks. In Proceedings of the International Joint Conference on Neural Networks, pages 357–
363. IEEE Press.
68
Nilsson, N. J. (1980). Principles of artificial intelligence. Morgan Kaufmann, San Francisco, CA,
USA.
Nolfi, S., Floreano, D., Miglino, O., and Mondada, F. (1994a). How to evolve autonomous robots:
Different approaches in evolutionary robotics. In Brooks, R. A. and Maes, P., editors, Fourth
International Workshop on the Synthesis and Simulation of Living Systems (Artificial Life IV), pages
190–197. MIT.
Nolfi, S., Parisi, D., and Elman, J. L. (1994b). Learning and evolution in neural networks. Adaptive
Behavior, 3(1):5–28.
Nowak, E., Jurie, F., and Triggs, B. (2006). Sampling strategies for bag-of-features image classifica-
tion. In Proc. ECCV 2006, pages 490–503. Springer.
Nowlan, S. J. and Hinton, G. E. (1992). Simplifying neural networks by soft weight sharing. Neural
Computation, 4:173–193.
O’Connor, P., Neil, D., Liu, S.-C., Delbruck, T., and Pfeiffer, M. (2013). Real-time classification and
sensor fusion with a spiking deep belief network. Frontiers in Neuroscience, 7(178).
Oh, K.-S. and Jung, K. (2004). GPU implementation of neural networks. Pattern Recognition,
37(6):1311–1314.
Oja, E. (1989). Neural networks, principal components, and subspaces. International Journal of
Neural Systems, 1(1):61–68.
Oja, E. (1991). Data compression, feature extraction, and autoassociation in feedforward neural
networks. In Kohonen, T., Mäkisara, K., Simula, O., and Kangas, J., editors, Artificial Neural
Networks, volume 1, pages 737–745. Elsevier Science Publishers B.V., North-Holland.
Olshausen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive field properties by
learning a sparse code for natural images. Nature, 381(6583):607–609.
Omlin, C. and Giles, C. L. (1996). Extraction of rules from discrete-time recurrent neural networks.
Neural Networks, 9(1):41–52.
Oquab, M., Bottou, L., Laptev, I., and Sivic, J. (2013). Learning and transferring mid-level image
representations using convolutional neural networks. Technical Report hal-00911179.
O’Reilly, R. (2003). Making working memory work: A computational model of learning in the
prefrontal cortex and basal ganglia. Technical Report ICS-03-03, ICS.
O’Reilly, R. C. (1996). Biologically plausible error-driven learning using local activation differences:
The generalized recirculation algorithm. Neural Computation, 8(5):895–938.
Orr, G. and Müller, K. (1998). Neural Networks: Tricks of the Trade. Number LNCS 1524 in Lecture
Notes in Computer Science Series. Springer Verlag.
Ostrovskii, G. M., Volin, Y. M., and Borisov, W. W. (1971). Über die Berechnung von Ableitungen.
Wiss. Z. Tech. Hochschule für Chemie, 13:382–384.
Otsuka, M. (2010). Goal-Oriented Representation of the External World: A Free-Energy-Based Ap-
proach. PhD thesis, Nara Institute of Science and Technology.
69
Otsuka, M., Yoshimoto, J., and Doya, K. (2010). Free-energy-based reinforcement learning in a
partially observable environment. In Proc. ESANN.
Otte, S., Krechel, D., Liwicki, M., and Dengel, A. (2012). Local feature based online mode detection
with recurrent neural networks. In Proceedings of the 2012 International Conference on Frontiers
in Handwriting Recognition, pages 533–537. IEEE Computer Society.
Oudeyer, P.-Y., Baranes, A., and Kaplan, F. (2013). Intrinsically motivated learning of real world
sensorimotor skills with developmental constraints. In Baldassarre, G. and Mirolli, M., editors,
Intrinsically Motivated Learning in Natural and Artificial Systems. Springer.
OReilly, R. C., Wyatte, D., Herd, S., Mingus, B., and Jilk, D. J. (2013). Recurrent processing during
object recognition. Frontiers in Psychology, 4:124.
Pachitariu, M. and Sahani, M. (2013). Regularization and nonlinearities for neural language models:
when are they needed? arXiv preprint arXiv:1301.5650.
Palm, G. (1980). On associative memory. Biological Cybernetics, 36.
Palm, G. (1992). On the information storage capacity of local learning rules. Neural Computation,
4(2):703–711.
Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and
Data Engineering, 22(10):1345–1359.
Parekh, R., Yang, J., and Honavar, V. (2000). Constructive neural network learning algorithms for
multi-category pattern classification. IEEE Transactions on Neural Networks, 11(2):436–451.
Parker, D. B. (1985). Learning-logic. Technical Report TR-47, Center for Comp. Research in Eco-
nomics and Management Sci., MIT.
Pascanu, R., Gulcehre, C., Cho, K., and Bengio, Y. (2013a). How to construct deep recurrent neural
networks. arXiv preprint arXiv:1312.6026.
Pascanu, R., Mikolov, T., and Bengio, Y. (2013b). On the difficulty of training recurrent neural
networks. In ICML’13: JMLR: W&CP volume 28.
Pasemann, F., Steinmetz, U., and Dieckman, U. (1999). Evolving structure and function of neurocon-
trollers. In Angeline, P. J., Michalewicz, Z., Schoenauer, M., Yao, X., and Zalzala, A., editors, Pro-
ceedings of the Congress on Evolutionary Computation, volume 3, pages 1973–1978, Mayflower
Hotel, Washington D.C., USA. IEEE Press.
Pearlmutter, B. A. (1989). Learning state space trajectories in recurrent neural networks. Neural
Computation, 1(2):263–269.
Pearlmutter, B. A. (1994). Fast exact multiplication by the Hessian. Neural Computation, 6(1):147–
160.
Pearlmutter, B. A. (1995). Gradient calculations for dynamic recurrent neural networks: A survey.
IEEE Transactions on Neural Networks, 6(5):1212–1228.
Pearlmutter, B. A. and Hinton, G. E. (1986). G-maximization: An unsupervised learning procedure
for discovering regularities. In Denker, J. S., editor, Neural Networks for Computing: American
Institute of Physics Conference Proceedings 151, volume 2, pages 333–338.
70
Peng, J. and Williams, R. J. (1996). Incremental multi-step Q-learning. Machine Learning, 22:283–
290.
Pérez-Ortiz, J. A., Gers, F. A., Eck, D., and Schmidhuber, J. (2003). Kalman filters improve LSTM
network performance in problems unsolvable by traditional recurrent nets. Neural Networks,
(16):241–250.
Perrett, D., Hietanen, J., Oram, M., Benson, P., and Rolls, E. (1992). Organization and functions of
cells responsive to faces in the temporal cortex [and discussion]. Philosophical Transactions of the
Royal Society of London. Series B: Biological Sciences, 335(1273):23–30.
Perrett, D., Rolls, E., and Caan, W. (1982). Visual neurones responsive to faces in the monkey
temporal cortex. Experimental Brain Research, 47(3):329–342.
Peters, J. (2010). Policy gradient methods. Scholarpedia, 5(11):3698.
Peters, J. and Schaal, S. (2008a). Natural actor-critic. Neurocomputing, 71:1180–1190.
Peters, J. and Schaal, S. (2008b). Reinforcement learning of motor skills with policy gradients. Neural
Network, 21(4):682–697.
Pham, V., Kermorvant, C., and Louradour, J. (2013). Dropout Improves Recurrent Neural Networks
for Handwriting Recognition. arXiv preprint arXiv:1312.4569.
Pineda, F. J. (1987). Generalization of back-propagation to recurrent neural networks. Physical
Review Letters, 19(59):2229–2232.
Plate, T. A. (1993). Holographic recurrent networks. In S. J. Hanson, J. D. C. and Giles, C. L., editors,
Advances in Neural Information Processing Systems (NIPS) 5, pages 34–41. Morgan Kaufmann.
Plumbley, M. D. (1991). On information theory and unsupervised neural networks. Dissertation,
published as technical report CUED/F-INFENG/TR.78, Engineering Department, Cambridge Uni-
versity.
Pollack, J. B. (1988). Implications of recursive distributed representations. In Proc. NIPS, pages
527–536.
Pollack, J. B. (1990). Recursive distributed representation. Artificial Intelligence, 46:77–105.
Pontryagin, L. S., Boltyanskii, V. G., Gamrelidze, R. V., and Mishchenko, E. F. (1961). The Mathe-
matical Theory of Optimal Processes.
Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In IEEE
International Conference on Computer Vision (ICCV) Workshops, pages 689–690. IEEE.
Post, E. L. (1936). Finite combinatory processes-formulation 1. The Journal of Symbolic Logic,
1(3):103–105.
Prasoon, A., Petersen, K., Igel, C., Lauze, F., Dam, E., and Nielsen, M. (2013). Voxel classification
based on triplanar convolutional neural networks applied to cartilage segmentation in knee MRI. In
Medical Image Computing and Computer Assisted Intervention (MICCAI), volume 8150 of LNCS,
pages 246–253. Springer.
Precup, D., Sutton, R. S., and Singh, S. (1998). Multi-time models for temporally abstract planning. In
Advances in Neural Information Processing Systems (NIPS), pages 1050–1056. Morgan Kaufmann.
71
Prokhorov, D. (2010). A convolutional learning system for object classification in 3-D LIDAR data.
IEEE Transactions on Neural Networks, 21(5):858–863.
Prokhorov, D., Puskorius, G., and Feldkamp, L. (2001). Dynamical neural networks for control. In
Kolen, J. and Kremer, S., editors, A field guide to dynamical recurrent networks, pages 23–78.
IEEE Press.
Prokhorov, D. and Wunsch, D. (1997). Adaptive critic design. IEEE Transactions on Neural Net-
works, 8(5):997–1007.
Prokhorov, D. V., Feldkamp, L. A., and Tyukin, I. Y. (2002). Adaptive behavior with fixed weights in
RNN: an overview. In Proceedings of the IEEE International Joint Conference on Neural Networks
(IJCNN), pages 2018–2023.
Puskorius, G. V. and Feldkamp, L. A. (1994). Neurocontrol of nonlinear dynamical systems with
Kalman filter trained recurrent networks. IEEE Transactions on Neural Networks, 5(2):279–297.
Raiko, T., Valpola, H., and LeCun, Y. (2012). Deep learning made easier by linear transformations in
perceptrons. In International Conference on Artificial Intelligence and Statistics, pages 924–932.
Raina, R., Madhavan, A., and Ng, A. (2009). Large-scale deep unsupervised learning using graphics
processors. In Proceedings of the 26th Annual International Conference on Machine Learning
(ICML), pages 873–880. ACM.
Ramacher, U., Raab, W., Anlauf, J., Hachmann, U., Beichter, J., Bruels, N., Wesseling, M., Sich-
eneder, E., Maenner, R., Glaess, J., and Wurz, A. (1993). Multiprocessor and memory architecture
of the neurocomputer SYNAPSE-1. International Journal of Neural Systems, 4(4):333–336.
Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2006). Efficient learning of sparse representa-
tions with an energy-based model. In et al., J. P., editor, Advances in Neural Information Processing
Systems (NIPS 2006). MIT Press.
Ranzato, M. A., Huang, F., Boureau, Y., and LeCun, Y. (2007). Unsupervised learning of invariant
feature hierarchies with applications to object recognition. In Proc. Computer Vision and Pattern
Recognition Conference (CVPR’07), pages 1–8. IEEE Press.
Rauber, A., Merkl, D., and Dittenbach, M. (2002). The growing hierarchical self-organizing map: ex-
ploratory analysis of high-dimensional data. IEEE Transactions on Neural Networks, 13(6):1331–
1341.
Razavian, A. S., Azizpour, H., Sullivan, J., and Carlsson, S. (2014). CNN features off-the-shelf: an
astounding baseline for recognition. arXiv preprint arXiv:1403.6382.
Rechenberg, I. (1971). Evolutionsstrategie - Optimierung technischer Systeme nach Prinzipien der
biologischen Evolution. Dissertation. Published 1973 by Fromman-Holzboog.
Redlich, A. N. (1993). Redundancy reduction as a strategy for unsupervised learning. Neural Com-
putation, 5:289–304.
Refenes, N. A., Zapranis, A., and Francis, G. (1994). Stock performance modeling using neural
networks: a comparative study with regression models. Neural Networks, 7(2):375–388.
Rezende, D. J. and Gerstner, W. (2014). Stochastic variational learning in recurrent spiking networks.
Frontiers in Computational Neuroscience, 8:38.
72
Riedmiller, M. (2005). Neural fitted Q iteration—first experiences with a data efficient neural rein-
forcement learning method. In Proc. ECML-2005, pages 317–328. Springer-Verlag Berlin Heidel-
berg.
Riedmiller, M. and Braun, H. (1993). A direct adaptive method for faster backpropagation learning:
The Rprop algorithm. In Proc. IJCNN, pages 586–591. IEEE Press.
Riedmiller, M., Lange, S., and Voigtlaender, A. (2012). Autonomous reinforcement learning on raw
visual input data in a real world application. In International Joint Conference on Neural Networks
(IJCNN), pages 1–8, Brisbane, Australia.
Riesenhuber, M. and Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nat.
Neurosci., 2(11):1019–1025.
Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. (2011). Contractive auto-encoders:
Explicit invariance during feature extraction. In Proceedings of the 28th International Conference
on Machine Learning (ICML-11), pages 833–840.
Ring, M., Schaul, T., and Schmidhuber, J. (2011). The two-dimensional organization of behavior. In
Proceedings of the First Joint Conference on Development Learning and on Epigenetic Robotics
ICDL-EPIROB, Frankfurt.
Ring, M. B. (1991). Incremental development of complex behaviors through automatic construc-
tion of sensory-motor hierarchies. In Birnbaum, L. and Collins, G., editors, Machine Learning:
Proceedings of the Eighth International Workshop, pages 343–347. Morgan Kaufmann.
Ring, M. B. (1993). Learning sequential tasks by incrementally adding higher orders. In S. J. Hanson,
J. D. C. and Giles, C. L., editors, Advances in Neural Information Processing Systems 5, pages 115–
122. Morgan Kaufmann.
Ring, M. B. (1994). Continual Learning in Reinforcement Environments. PhD thesis, University of
Texas at Austin, Austin, Texas 78712.
Risi, S. and Stanley, K. O. (2012). A unified approach to evolving plasticity and neural geometry. In
International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE.
Rissanen, J. (1986). Stochastic complexity and modeling. The Annals of Statistics, 14(3):1080–1100.
Ritter, H. and Kohonen, T. (1989). Self-organizing semantic maps. Biological Cybernetics,
61(4):241–254.
Robinson, A. J. and Fallside, F. (1987). The utility driven dynamic error propagation network. Tech-
nical Report CUED/F-INFENG/TR.1, Cambridge University Engineering Department.
Robinson, T. and Fallside, F. (1989). Dynamic reinforcement driven error propagation networks
with application to game playing. In Proceedings of the 11th Conference of the Cognitive Science
Society, Ann Arbor, pages 836–843.
Rodriguez, P. and Wiles, J. (1998). Recurrent neural networks can learn to implement symbol-
sensitive counting. In Advances in Neural Information Processing Systems (NIPS), volume 10,
pages 87–93. The MIT Press.
Rodriguez, P., Wiles, J., and Elman, J. (1999). A recurrent neural network that learns to count.
Connection Science, 11(1):5–40.
73
Roggen, D., Hofmann, S., Thoma, Y., and Floreano, D. (2003). Hardware spiking neural network with
run-time reconfigurable connectivity in an autonomous robot. In Proc. NASA/DoD Conference on
Evolvable Hardware, 2003, pages 189–198. IEEE.
Rohwer, R. (1989). The ‘moving targets’ training method. In Kindermann, J. and Linden, A., edi-
tors, Proceedings of ‘Distributed Adaptive Neural Information Processing’, St.Augustin, 24.-25.5,.
Oldenbourg.
Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization
in the brain. Psychological review, 65(6):386.
Rosenblatt, F. (1962). Principles of Neurodynamics. Spartan, New York.
Roux, L., Racoceanu, D., Lomenie, N., Kulikova, M., Irshad, H., Klossa, J., Capron, F., Genestie, C.,
Naour, G. L., and Gurcan, M. N. (2013). Mitosis detection in breast cancer histological images -
an ICPR 2012 contest. J. Pathol. Inform., 4:8.
Rubner, J. and Schulten, K. (1990). Development of feature detectors by self-organization: A network
model. Biological Cybernetics, 62:193–199.
Rückstieß, T., Felder, M., and Schmidhuber, J. (2008). State-Dependent Exploration for policy gra-
dient methods. In et al., W. D., editor, European Conference on Machine Learning (ECML) and
Principles and Practice of Knowledge Discovery in Databases 2008, Part II, LNAI 5212, pages
234–249.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning internal representations by error
propagation. In Rumelhart, D. E. and McClelland, J. L., editors, Parallel Distributed Processing,
volume 1, pages 318–362. MIT Press.
Rumelhart, D. E. and Zipser, D. (1986). Feature discovery by competitive learning. In Parallel
Distributed Processing, pages 151–193. MIT Press.
Rummery, G. and Niranjan, M. (1994). On-line Q-learning using connectionist sytems. Technical
Report CUED/F-INFENG-TR 166, Cambridge University, UK.
Russell, S. J., Norvig, P., Canny, J. F., Malik, J. M., and Edwards, D. D. (1995). Artificial Intelligence:
a Modern Approach, volume 2. Englewood Cliffs: Prentice Hall.
Saito, K. and Nakano, R. (1997). Partial BFGS update and efficient step-length calculation for three-
layer neural networks. Neural Computation, 9(1):123–141.
Sak, H., Senior, A., and Beaufays, F. (2014a). Long Short-Term Memory recurrent neural network
architectures for large scale acoustic modeling. In Proc. Interspeech.
Sak, H., Vinyals, O., Heigold, G., Senior, A., McDermott, E., Monga, R., and Mao, M. (2014b). Se-
quence discriminative distributed training of Long Short-Term Memory recurrent neural networks.
In Proc. Interspeech.
Salakhutdinov, R. and Hinton, G. (2009). Semantic hashing. Int. J. Approx. Reasoning, 50(7):969–
978.
Sallans, B. and Hinton, G. (2004). Reinforcement learning with factored states and actions. Journal
of Machine Learning Research, 5:1063–1088.
74
Sałustowicz, R. P. and Schmidhuber, J. (1997). Probabilistic incremental program evolution. Evolu-
tionary Computation, 5(2):123–141.
Samejima, K., Doya, K., and Kawato, M. (2003). Inter-module credit assignment in modular rein-
forcement learning. Neural Networks, 16(7):985–994.
Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal on
Research and Development, 3:210–229.
Sanger, T. D. (1989). An optimality principle for unsupervised learning. In Touretzky, D. S., editor,
Advances in Neural Information Processing Systems (NIPS) 1, pages 11–19. Morgan Kaufmann.
Santamarı́a, J. C., Sutton, R. S., and Ram, A. (1997). Experiments with reinforcement learning in
problems with continuous state and action spaces. Adaptive Behavior, 6(2):163–217.
Saravanan, N. and Fogel, D. B. (1995). Evolving neural control systems. IEEE Expert, pages 23–27.
Saund, E. (1994). Unsupervised learning of mixtures of multiple causes in binary data. In Cowan,
J. D., Tesauro, G., and Alspector, J., editors, Advances in Neural Information Processing Systems
(NIPS) 6, pages 27–34. Morgan Kaufmann.
Schemmel, J., Grubl, A., Meier, K., and Mueller, E. (2006). Implementing synaptic plasticity in
a VLSI spiking neural network model. In International Joint Conference on Neural Networks
(IJCNN), pages 1–6. IEEE.
Scherer, D., Müller, A., and Behnke, S. (2010). Evaluation of pooling operations in convolutional ar-
chitectures for object recognition. In Proc. International Conference on Artificial Neural Networks
(ICANN), pages 92–101.
Schmidhuber, J. (1987). Evolutionary principles in self-referential learning, or on learning
how to learn: the meta-meta-... hook. Diploma thesis, Inst. f. Inf., Tech. Univ. Munich.
http://www.idsia.ch/˜juergen/diploma.html.
Schmidhuber, J. (1989a). Accelerated learning in back-propagation nets. In Pfeifer, R., Schreter, Z.,
Fogelman, Z., and Steels, L., editors, Connectionism in Perspective, pages 429 – 438. Amsterdam:
Elsevier, North-Holland.
Schmidhuber, J. (1989b). A local learning algorithm for dynamic feedforward and recurrent networks.
Connection Science, 1(4):403–412.
75
Schmidhuber, J. (1990a). Dynamische neuronale Netze und das fundamentale raumzeitliche Lern-
problem. (Dynamic neural nets and the fundamental spatio-temporal credit assignment problem.)
Dissertation, Inst. f. Inf., Tech. Univ. Munich.
Schmidhuber, J. (1990b). Learning algorithms for networks with internal and external feedback. In
Touretzky, D. S., Elman, J. L., Sejnowski, T. J., and Hinton, G. E., editors, Proc. of the 1990
Connectionist Models Summer School, pages 52–61. Morgan Kaufmann.
Schmidhuber, J. (1990c). The Neural Heat Exchanger. Talks at TU Munich (1990), University
of Colorado at Boulder (1992), and Z. Li’s NIPS*94 workshop on unsupervised learning. Also
published at the Intl. Conference on Neural Information Processing (ICONIP’96), vol. 1, pages
194-197, 1996.
Schmidhuber, J. (1990d). An on-line algorithm for dynamic reinforcement learning and planning in
reactive environments. In Proc. IEEE/INNS International Joint Conference on Neural Networks,
San Diego, volume 2, pages 253–258.
Schmidhuber, J. (1991a). Curious model-building control systems. In Proceedings of the International
Joint Conference on Neural Networks, Singapore, volume 2, pages 1458–1463. IEEE press.
Schmidhuber, J. (1991b). Learning to generate sub-goals for action sequences. In Kohonen, T.,
Mäkisara, K., Simula, O., and Kangas, J., editors, Artificial Neural Networks, pages 967–972.
Elsevier Science Publishers B.V., North-Holland.
Schmidhuber, J. (1991c). Reinforcement learning in Markovian and non-Markovian environments.
In Lippman, D. S., Moody, J. E., and Touretzky, D. S., editors, Advances in Neural Information
Processing Systems 3 (NIPS 3), pages 500–506. Morgan Kaufmann.
Schmidhuber, J. (1992a). A fixed size storage O(n3 ) time complexity learning algorithm for fully
recurrent continually running networks. Neural Computation, 4(2):243–248.
Schmidhuber, J. (1992b). Learning complex, extended sequences using the principle of history com-
pression. Neural Computation, 4(2):234–242. (Based on TR FKI-148-91, TUM, 1991).
Schmidhuber, J. (2002). The Speed Prior: a new simplicity measure yielding near-optimal computable
predictions. In Kivinen, J. and Sloan, R. H., editors, Proceedings of the 15th Annual Conference
on Computational Learning Theory (COLT 2002), Lecture Notes in Artificial Intelligence, pages
216–228. Springer, Sydney, Australia.
Schmidhuber, J. (2004). Optimal ordered problem solver. Machine Learning, 54:211–254.
76
Schmidhuber, J. (2006a). Developmental robotics, optimal artificial curiosity, creativity, music, and
the fine arts. Connection Science, 18(2):173–187.
Schmidhuber, J. (2006b). Gödel machines: Fully self-referential optimal universal self-improvers. In
Goertzel, B. and Pennachin, C., editors, Artificial General Intelligence, pages 199–226. Springer
Verlag. Variant available as arXiv:cs.LO/0309048.
Schmidhuber, J. (2013b). P OWER P LAY: Training an Increasingly General Problem Solver by Con-
tinually Searching for the Simplest Still Unsolvable Problem. Frontiers in Psychology.
Schmidhuber, J., Ciresan, D., Meier, U., Masci, J., and Graves, A. (2011). On fast deep nets for AGI
vision. In Proc. Fourth Conference on Artificial General Intelligence (AGI), Google, Mountain
View, CA, pages 243–246.
Schmidhuber, J., Eldracher, M., and Foltin, B. (1996). Semilinear predictability minimization pro-
duces well-known feature detectors. Neural Computation, 8(4):773–786.
Schmidhuber, J. and Huber, R. (1991). Learning to generate artificial fovea trajectories for target
detection. International Journal of Neural Systems, 2(1 & 2):135–141.
Schmidhuber, J., Mozer, M. C., and Prelinger, D. (1993). Continuous history compression. In Hüning,
H., Neuhauser, S., Raus, M., and Ritschel, W., editors, Proc. of Intl. Workshop on Neural Networks,
RWTH Aachen, pages 87–95. Augustinus.
Schmidhuber, J. and Prelinger, D. (1992). Discovering predictable classifications. Technical Report
CU-CS-626-92, Dept. of Comp. Sci., University of Colorado at Boulder. Published in Neural
Computation 5(4):625-635 (1993).
Schmidhuber, J. and Wahnsiedler, R. (1992). Planning simple trajectories using neural subgoal gen-
erators. In Meyer, J. A., Roitblat, H. L., and Wilson, S. W., editors, Proc. of the 2nd International
Conference on Simulation of Adaptive Behavior, pages 196–202. MIT Press.
Schmidhuber, J., Wierstra, D., Gagliolo, M., and Gomez, F. J. (2007). Training recurrent networks by
Evolino. Neural Computation, 19(3):757–779.
Schmidhuber, J., Zhao, J., and Schraudolph, N. (1997a). Reinforcement learning with self-modifying
policies. In Thrun, S. and Pratt, L., editors, Learning to learn, pages 293–309. Kluwer.
Schmidhuber, J., Zhao, J., and Wiering, M. (1997b). Shifting inductive bias with success-story algo-
rithm, adaptive Levin search, and incremental self-improvement. Machine Learning, 28:105–130.
Schölkopf, B., Burges, C. J. C., and Smola, A. J., editors (1998). Advances in Kernel Methods -
Support Vector Learning. MIT Press, Cambridge, MA.
77
Schraudolph, N. and Sejnowski, T. J. (1993). Unsupervised discrimination of clustered data via op-
timization of binary information gain. In Hanson, S. J., Cowan, J. D., and Giles, C. L., editors,
Advances in Neural Information Processing Systems, volume 5, pages 499–506. Morgan Kauf-
mann, San Mateo.
Schraudolph, N. N. (2002). Fast curvature matrix-vector products for second-order gradient descent.
Neural Computation, 14(7):1723–1738.
Schrauwen, B., Verstraeten, D., and Van Campenhout, J. (2007). An overview of reservoir computing:
theory, applications and implementations. In Proceedings of the 15th European Symposium on
Artificial Neural Networks. p. 471-482 2007, pages 471–482.
Schuster, H. G. (1992). Learning by maximization the information transfer through nonlinear noisy
neurons and “noise breakdown”. Phys. Rev. A, 46(4):2131–2138.
Schuster, M. (1999). On supervised learning from sequential data with applications for speech recog-
nition. PhD thesis, Nara Institute of Science and Technolog, Kyoto, Japan.
Schuster, M. and Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions
on Signal Processing, 45:2673–2681.
Serrano-Gotarredona, R., Oster, M., Lichtsteiner, P., Linares-Barranco, A., Paz-Vicente, R., Gómez-
Rodrı́guez, F., Camuñas-Mesa, L., Berner, R., Rivas-Pérez, M., Delbruck, T., et al. (2009). Caviar:
A 45k neuron, 5m synapse, 12g connects/s AER hardware sensory–processing–learning–actuating
system for high-speed visual object recognition and tracking. IEEE Transactions on Neural Net-
works, 20(9):1417–1438.
78
Serre, T., Riesenhuber, M., Louie, J., and Poggio, T. (2002). On the role of object-specific features
for real world object recognition in biological vision. In Biologically Motivated Computer Vision,
pages 387–397.
Seung, H. S. (2003). Learning in spiking neural networks by reinforcement of stochastic synaptic
transmission. Neuron, 40(6):1063–1073.
Shan, H. and Cottrell, G. (2014). Efficient visual coding: From retina to V2. In Proc. International
Conference on Learning Representations (ICLR). arXiv preprint arXiv:1312.6077.
Shan, H., Zhang, L., and Cottrell, G. W. (2007). Recursive ICA. Advances in Neural Information
Processing Systems (NIPS), 19:1273.
Shanno, D. F. (1970). Conditioning of quasi-Newton methods for function minimization. Mathemat-
ics of computation, 24(111):647–656.
Shannon, C. E. (1948). A mathematical theory of communication (parts I and II). Bell System
Technical Journal, XXVII:379–423.
Shao, L., Wu, D., and Li, X. (2014). Learning deep and wide: A spectral method for learning deep
networks. IEEE Transactions on Neural Networks and Learning Systems.
Shavlik, J. W. (1994). Combining symbolic and neural learning. Machine Learning, 14(3):321–331.
Shavlik, J. W. and Towell, G. G. (1989). Combining explanation-based and neural learning: An
algorithm and empirical results. Connection Science, 1(3):233–255.
Siegelmann, H. (1992). Theoretical Foundations of Recurrent Neural Networks. PhD thesis, Rutgers,
New Brunswick Rutgers, The State of New Jersey.
Siegelmann, H. T. and Sontag, E. D. (1991). Turing computability with neural nets. Applied Mathe-
matics Letters, 4(6):77–80.
Silva, F. M. and Almeida, L. B. (1990). Speeding up back-propagation. In Eckmiller, R., editor,
Advanced Neural Computers, pages 151–158, Amsterdam. Elsevier.
Sı́ma, J. (1994). Loading deep networks is hard. Neural Computation, 6(5):842–850.
Sı́ma, J. (2002). Training a single sigmoidal neuron is hard. Neural Computation, 14(11):2709–2728.
Simard, P., Steinkraus, D., and Platt, J. (2003). Best practices for convolutional neural networks
applied to visual document analysis. In Seventh International Conference on Document Analysis
and Recognition, pages 958–963.
Sims, K. (1994). Evolving virtual creatures. In Glassner, A., editor, Proceedings of SIGGRAPH ’94
(Orlando, Florida, July 1994), Computer Graphics Proceedings, Annual Conference, pages 15–22.
ACM SIGGRAPH, ACM Press. ISBN 0-89791-667-0.
Simsek, Ö. and Barto, A. G. (2008). Skill characterization based on betweenness. In NIPS’08, pages
1497–1504.
Singh, S., Barto, A. G., and Chentanez, N. (2005). Intrinsically motivated reinforcement learning. In
Advances in Neural Information Processing Systems 17 (NIPS). MIT Press, Cambridge, MA.
79
Singh, S. P. (1994). Reinforcement learning algorithms for average-payoff Markovian decision pro-
cesses. In National Conference on Artificial Intelligence, pages 700–705.
Smith, S. F. (1980). A Learning System Based on Genetic Adaptive Algorithms,. PhD thesis, Univ.
Pittsburgh.
Smolensky, P. (1986). Parallel distributed processing: Explorations in the microstructure of cognition,
vol. 1. chapter Information Processing in Dynamical Systems: Foundations of Harmony Theory,
pages 194–281. MIT Press, Cambridge, MA, USA.
Solla, S. A. (1988). Accelerated learning in layered neural networks. Complex Systems, 2:625–640.
Solomonoff, R. J. (1964). A formal theory of inductive inference. Part I. Information and Control,
7:1–22.
Song, S., Miller, K. D., and Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-
dependent synaptic plasticity. Nature Neuroscience, 3(9):919–926.
Speelpenning, B. (1980). Compiling Fast Partial Derivatives of Functions Given by Algorithms. PhD
thesis, Department of Computer Science, University of Illinois, Urbana-Champaign.
Srivastava, R. K., Masci, J., Kazerounian, S., Gomez, F., and Schmidhuber, J. (2013). Compete to
compute. In Advances in Neural Information Processing Systems (NIPS), pages 2310–2318.
Stallkamp, J., Schlipsing, M., Salmen, J., and Igel, C. (2011). The German traffic sign recognition
benchmark: A multi-class classification competition. In International Joint Conference on Neural
Networks (IJCNN 2011), pages 1453–1460. IEEE Press.
Stallkamp, J., Schlipsing, M., Salmen, J., and Igel, C. (2012). Man vs. computer: Benchmarking
machine learning algorithms for traffic sign recognition. Neural Networks, 32:323–332.
Stanley, K. O., D’Ambrosio, D. B., and Gauci, J. (2009). A hypercube-based encoding for evolving
large-scale neural networks. Artificial Life, 15(2):185–212.
Stanley, K. O. and Miikkulainen, R. (2002). Evolving neural networks through augmenting topolo-
gies. Evolutionary Computation, 10:99–127.
Steijvers, M. and Grunwald, P. (1996). A recurrent network that performs a contextsensitive prediction
task. In Proceedings of the 18th Annual Conference of the Cognitive Science Society. Erlbaum.
Steil, J. J. (2007). Online reservoir adaptation by intrinsic plasticity for backpropagation–
decorrelation and echo state learning. Neural Networks, 20(3):353–364.
Stemmler, M. (1996). A single spike suffices: the simplest form of stochastic resonance in model
neurons. Network: Computation in Neural Systems, 7(4):687–716.
Stoianov, I. and Zorzi, M. (2012). Emergence of a ’visual number sense’ in hierarchical generative
models. Nature Neuroscience, 15(2):194–6.
80
Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Roy. Stat. Soc.,
36:111–147.
Stoop, R., Schindler, K., and Bunimovich, L. (2000). When pyramidal neurons lock, when they
respond chaotically, and when they like to synchronize. Neuroscience research, 36(1):81–91.
Stratonovich, R. (1960). Conditional Markov processes. Theory of Probability And Its Applications,
5(2):156–178.
Sun, G., Chen, H., and Lee, Y. (1993a). Time warping invariant neural networks. In S. J. Hanson,
J. D. C. and Giles, C. L., editors, Advances in Neural Information Processing Systems (NIPS) 5,
pages 180–187. Morgan Kaufmann.
Sun, G. Z., Giles, C. L., Chen, H. H., and Lee, Y. C. (1993b). The neural network pushdown au-
tomaton: Model, stack and learning simulations. Technical Report CS-TR-3118, University of
Maryland, College Park.
Sun, Y., Gomez, F., Schaul, T., and Schmidhuber, J. (2013). A Linear Time Natural Evolution Strat-
egy for Non-Separable Functions. In Proceedings of the Genetic and Evolutionary Computation
Conference, page 61, Amsterdam, NL. ACM.
Sun, Y., Wierstra, D., Schaul, T., and Schmidhuber, J. (2009). Efficient natural evolution strategies.
In Proc. 11th Genetic and Evolutionary Computation Conference (GECCO), pages 539–546.
Sutskever, I., Hinton, G. E., and Taylor, G. W. (2008). The recurrent temporal restricted Boltzmann
machine. In NIPS, volume 21, page 2008.
Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks.
Technical Report arXiv:1409.3215 [cs.CL], Google. NIPS’2014.
Sutton, R. and Barto, A. (1998). Reinforcement learning: An introduction. Cambridge, MA, MIT
Press.
Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (1999a). Policy gradient methods for
reinforcement learning with function approximation. In Advances in Neural Information Process-
ing Systems (NIPS) 12, pages 1057–1063.
Sutton, R. S., Precup, D., and Singh, S. P. (1999b). Between MDPs and semi-MDPs: A framework
for temporal abstraction in reinforcement learning. Artif. Intell., 112(1-2):181–211.
Sutton, R. S., Szepesvári, C., and Maei, H. R. (2008). A convergent O(n) algorithm for off-policy
temporal-difference learning with linear function approximation. In Advances in Neural Informa-
tion Processing Systems (NIPS’08), volume 21, pages 1609–1616.
Szabó, Z., Póczos, B., and Lőrincz, A. (2006). Cross-entropy optimization for independent pro-
cess analysis. In Independent Component Analysis and Blind Signal Separation, pages 909–916.
Springer.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Ra-
binovich, A. (2014). Going deeper with convolutions. Technical Report arXiv:1409.4842 [cs.CV],
Google.
Szegedy, C., Toshev, A., and Erhan, D. (2013). Deep neural networks for object detection. pages
2553–2561.
81
Taylor, G. W., Spiro, I., Bregler, C., and Fergus, R. (2011). Learning invariance through imitation. In
Conference on Computer Vision and Pattern Recognition (CVPR), pages 2729–2736. IEEE.
Tegge, A. N., Wang, Z., Eickholt, J., and Cheng, J. (2009). NNcon: improved protein contact map
prediction using 2D-recursive neural networks. Nucleic Acids Research, 37(Suppl 2):W515–W518.
Teichmann, M., Wiltschut, J., and Hamker, F. (2012). Learning invariance from natural images in-
spired by observations in the primary visual cortex. Neural Computation, 24(5):1271–1296.
Teller, A. (1994). The evolution of mental models. In Kenneth E. Kinnear, J., editor, Advances in
Genetic Programming, pages 199–219. MIT Press.
Tenenberg, J., Karlsson, J., and Whitehead, S. (1993). Learning via task decomposition. In Meyer,
J. A., Roitblat, H., and Wilson, S., editors, From Animals to Animats 2: Proceedings of the Second
International Conference on Simulation of Adaptive Behavior, pages 337–343. MIT Press.
Tesauro, G. (1994). TD-gammon, a self-teaching backgammon program, achieves master-level play.
Neural Computation, 6(2):215–219.
Tieleman, T. and Hinton, G. (2012). Lecture 6.5—RmsProp: Divide the gradient by a running average
of its recent magnitude. COURSERA: Neural Networks for Machine Learning.
Tikhonov, A. N., Arsenin, V. I., and John, F. (1977). Solutions of ill-posed problems. Winston.
Ting, K. M. and Witten, I. H. (1997). Stacked generalization: when does it work? In in Proc.
International Joint Conference on Artificial Intelligence (IJCAI).
Tiňo, P. and Hammer, B. (2004). Architectural bias in recurrent neural networks: Fractal analysis.
Neural Computation, 15(8):1931–1957.
Tonkes, B. and Wiles, J. (1997). Learning a context-free task with a recurrent neural network: An
analysis of stability. In Proceedings of the Fourth Biennial Conference of the Australasian Cognitive
Science Society.
Tsodyks, M., Pawelzik, K., and Markram, H. (1998). Neural networks with dynamic synapses. Neural
Computation, 10(4):821–835.
Tsodyks, M. V., Skaggs, W. E., Sejnowski, T. J., and McNaughton, B. L. (1996). Population dynam-
ics and theta rhythm phase precession of hippocampal place cell firing: a spiking neuron model.
Hippocampus, 6(3):271–280.
Turaga, S. C., Murray, J. F., Jain, V., Roth, F., Helmstaedter, M., Briggman, K., Denk, W., and Seung,
H. S. (2010). Convolutional networks can learn to generate affinity graphs for image segmentation.
Neural Computation, 22(2):511–538.
Turing, A. M. (1936). On computable numbers, with an application to the Entscheidungsproblem.
Proceedings of the London Mathematical Society, Series 2, 41:230–267.
82
Turner, A. J. and Miller, J. F. (2013). Cartesian Genetic Programming encoded artificial neural net-
works: A comparison using three benchmarks. In Proceedings of the Conference on Genetic and
Evolutionary Computation (GECCO), pages 1005–1012.
Ueda, N. (2000). Optimal linear combination of neural networks for improving classification perfor-
mance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(2):207–215.
Urlbe, A. P. (1999). Structure-adaptable digital neural networks. PhD thesis, Universidad del Valle.
Utgoff, P. E. and Stracuzzi, D. J. (2002). Many-layered learning. Neural Computation, 14(10):2497–
2529.
Vahed, A. and Omlin, C. W. (2004). A machine learning method for extracting symbolic knowledge
from recurrent neural networks. Neural Computation, 16(1):59–71.
Vaillant, R., Monrocq, C., and LeCun, Y. (1994). Original approach for the localisation of objects in
images. IEE Proc on Vision, Image, and Signal Processing, 141(4):245–250.
van den Berg, T. and Whiteson, S. (2013). Critical factors in the performance of HyperNEAT. In
GECCO 2013: Proceedings of the Genetic and Evolutionary Computation Conference, pages 759–
766.
van Hasselt, H. (2012). Reinforcement learning in continuous state and action spaces. In Wiering, M.
and van Otterlo, M., editors, Reinforcement Learning, pages 207–251. Springer.
Vapnik, V. (1992). Principles of risk minimization for learning theory. In Lippman, D. S., Moody,
J. E., and Touretzky, D. S., editors, Advances in Neural Information Processing Systems (NIPS) 4,
pages 831–838. Morgan Kaufmann.
Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer, New York.
Versino, C. and Gambardella, L. M. (1996). Learning fine motion by using the hierarchical ex-
tended Kohonen map. In Proc. Intl. Conf. on Artificial Neural Networks (ICANN), pages 221–226.
Springer.
Veta, M., Viergever, M., Pluim, J., Stathonikos, N., and van Diest, P. J. (2013). MICCAI 2013 Grand
Challenge on Mitosis Detection.
Vieira, A. and Barradas, N. (2003). A training algorithm for classification of high-dimensional data.
Neurocomputing, 50:461–472.
Viglione, S. (1970). Applications of pattern recognition technology. In Mendel, J. M. and Fu, K. S.,
editors, Adaptive, Learning, and Pattern Recognition Systems. Academic Press.
Vincent, P., Hugo, L., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and composing robust fea-
tures with denoising autoencoders. In Proceedings of the 25th international conference on Machine
learning, ICML ’08, pages 1096–1103, New York, NY, USA. ACM.
Vlassis, N., Littman, M. L., and Barber, D. (2012). On the computational complexity of stochastic
controller optimization in POMDPs. ACM Transactions on Computation Theory, 4(4):12.
Vogl, T., Mangis, J., Rigler, A., Zink, W., and Alkon, D. (1988). Accelerating the convergence of the
back-propagation method. Biological Cybernetics, 59:257–263.
83
von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in the striate cortex.
Kybernetik, 14(2):85–100.
Waldinger, R. J. and Lee, R. C. T. (1969). PROW: a step toward automatic program writing. In
Walker, D. E. and Norton, L. M., editors, Proceedings of the 1st International Joint Conference on
Artificial Intelligence (IJCAI), pages 241–252. Morgan Kaufmann.
Wallace, C. S. and Boulton, D. M. (1968). An information theoretic measure for classification. Com-
puter Journal, 11(2):185–194.
Wan, E. A. (1994). Time series prediction by using a connectionist network with internal delay lines.
In Weigend, A. S. and Gershenfeld, N. A., editors, Time series prediction: Forecasting the future
and understanding the past, pages 265–295. Addison-Wesley.
Wang, C., Venkatesh, S. S., and Judd, J. S. (1994). Optimal stopping and effective machine complexity
in learning. In Advances in Neural Information Processing Systems (NIPS’6), pages 303–310.
Morgan Kaufmann.
Wang, S. and Manning, C. (2013). Fast dropout training. In Proceedings of the 30th International
Conference on Machine Learning (ICML-13), pages 118–126.
Watanabe, O. (1992). Kolmogorov complexity and computational complexity. EATCS Monographs
on Theoretical Computer Science, Springer.
Watanabe, S. (1985). Pattern Recognition: Human and Mechanical. Willey, New York.
Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, King’s College, Oxford.
Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine Learning, 8:279–292.
Watrous, R. L. and Kuhn, G. M. (1992). Induction of finite-state automata using second-order recur-
rent networks. In Moody, J. E., Hanson, S. J., and Lippman, R. P., editors, Advances in Neural
Information Processing Systems 4, pages 309–316. Morgan Kaufmann.
Waydo, S. and Koch, C. (2008). Unsupervised learning of individuals and categories from images.
Neural Computation, 20(5):1165–1178.
Weigend, A. S. and Gershenfeld, N. A. (1993). Results of the time series prediction competition
at the Santa Fe Institute. In Neural Networks, 1993., IEEE International Conference on, pages
1786–1793. IEEE.
Weigend, A. S., Rumelhart, D. E., and Huberman, B. A. (1991). Generalization by weight-elimination
with application to forecasting. In Lippmann, R. P., Moody, J. E., and Touretzky, D. S., editors,
Advances in Neural Information Processing Systems (NIPS) 3, pages 875–882. San Mateo, CA:
Morgan Kaufmann.
Weiss, G. (1994). Hierarchical chunking in classifier systems. In Proceedings of the 12th National
Conference on Artificial Intelligence, volume 2, pages 1335–1340. AAAI Press/The MIT Press.
Weng, J., Ahuja, N., and Huang, T. S. (1992). Cresceptron: a self-organizing neural network which
grows adaptively. In International Joint Conference on Neural Networks (IJCNN), volume 1, pages
576–581. IEEE.
Weng, J. J., Ahuja, N., and Huang, T. S. (1997). Learning recognition and segmentation using the
cresceptron. International Journal of Computer Vision, 25(2):109–143.
84
Werbos, P. J. (1974). Beyond Regression: New Tools for Prediction and Analysis in the Behavioral
Sciences. PhD thesis, Harvard University.
Werbos, P. J. (1981). Applications of advances in nonlinear sensitivity analysis. In Proceedings of the
10th IFIP Conference, 31.8 - 4.9, NYC, pages 762–770.
Werbos, P. J. (1987). Building and understanding adaptive systems: A statistical/numerical approach
to factory automation and brain research. IEEE Transactions on Systems, Man, and Cybernetics,
17.
Werbos, P. J. (1988). Generalization of backpropagation with application to a recurrent gas market
model. Neural Networks, 1.
Werbos, P. J. (1989a). Backpropagation and neurocontrol: A review and prospectus. In IEEE/INNS
International Joint Conference on Neural Networks, Washington, D.C., volume 1, pages 209–216.
Werbos, P. J. (1989b). Neural networks for control and system identification. In Proceedings of
IEEE/CDC Tampa, Florida.
Werbos, P. J. (1992). Neural networks, system identification, and control in the chemical industries.
In D. A. White, D. A. S., editor, Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive
Approaches, pages 283–356. Thomson Learning.
Werbos, P. J. (2006). Backwards differentiation in AD and neural nets: Past links and new oppor-
tunities. In Automatic Differentiation: Applications, Theory, and Implementations, pages 15–34.
Springer.
Whitehead, S. (1992). Reinforcement Learning for the adaptive control of perception and action. PhD
thesis, University of Rochester.
Whiteson, S. (2012). Evolutionary computation for reinforcement learning. In Wiering, M. and van
Otterlo, M., editors, Reinforcement Learning, pages 325–355. Springer, Berlin, Germany.
Whiteson, S., Kohl, N., Miikkulainen, R., and Stone, P. (2005). Evolving keepaway soccer players
through task decomposition. Machine Learning, 59(1):5–30.
Whiteson, S. and Stone, P. (2006). Evolutionary function approximation for reinforcement learning.
Journal of Machine Learning Research, 7:877–917.
Widrow, B. and Hoff, M. (1962). Associative storage and retrieval of digital information in networks
of adaptive neurons. Biological Prototypes and Synthetic Systems, 1:160.
Widrow, B., Rumelhart, D. E., and Lehr, M. A. (1994). Neural networks: Applications in industry,
business and science. Commun. ACM, 37(3):93–105.
Wieland, A. P. (1991). Evolving neural network controllers for unstable systems. In International
Joint Conference on Neural Networks (IJCNN), volume 2, pages 667–673. IEEE.
85
Wiering, M. and Schmidhuber, J. (1996). Solving POMDPs with Levin search and EIRA. In Saitta,
L., editor, Machine Learning: Proceedings of the Thirteenth International Conference, pages 534–
542. Morgan Kaufmann Publishers, San Francisco, CA.
Wiering, M. and Schmidhuber, J. (1998a). HQ-learning. Adaptive Behavior, 6(2):219–246.
Wiering, M. and van Otterlo, M. (2012). Reinforcement Learning. Springer.
Wiering, M. A. and Schmidhuber, J. (1998b). Fast online Q(λ). Machine Learning, 33(1):105–116.
Wierstra, D., Foerster, A., Peters, J., and Schmidhuber, J. (2010). Recurrent policy gradients. Logic
Journal of IGPL, 18(2):620–634.
Wierstra, D., Schaul, T., Peters, J., and Schmidhuber, J. (2008). Natural evolution strategies. In
Congress of Evolutionary Computation (CEC 2008).
Wiesel, D. H. and Hubel, T. N. (1959). Receptive fields of single neurones in the cat’s striate cortex.
J. Physiol., 148:574–591.
Wiles, J. and Elman, J. (1995). Learning to count without a counter: A case study of dynamics
and activation landscapes in recurrent networks. In In Proceedings of the Seventeenth Annual
Conference of the Cognitive Science Society, pages pages 482 – 487, Cambridge, MA. MIT Press.
Wilkinson, J. H., editor (1965). The Algebraic Eigenvalue Problem. Oxford University Press, Inc.,
New York, NY, USA.
Williams, R. J. (1986). Reinforcement-learning in connectionist networks: A mathematical analysis.
Technical Report 8605, Institute for Cognitive Science, University of California, San Diego.
Williams, R. J. (1988). Toward a theory of reinforcement-learning connectionist systems. Technical
Report NU-CCS-88-3, College of Comp. Sci., Northeastern University, Boston, MA.
Williams, R. J. (1989). Complexity of exact gradient computation algorithms for recurrent neural
networks. Technical Report Technical Report NU-CCS-89-27, Boston: Northeastern University,
College of Computer Science.
Williams, R. J. (1992a). Simple statistical gradient-following algorithms for connectionist reinforce-
ment learning. Machine Learning, 8:229–256.
Williams, R. J. (1992b). Training recurrent networks using the extended Kalman filter. In Interna-
tional Joint Conference on Neural Networks (IJCNN), volume 4, pages 241–246. IEEE.
Williams, R. J. and Peng, J. (1990). An efficient gradient-based algorithm for on-line training of
recurrent network trajectories. Neural Computation, 4:491–501.
Williams, R. J. and Zipser, D. (1988). A learning algorithm for continually running fully recurrent
networks. Technical Report ICS Report 8805, Univ. of California, San Diego, La Jolla.
Williams, R. J. and Zipser, D. (1989a). Experimental analysis of the real-time recurrent learning
algorithm. Connection Science, 1(1):87–111.
Williams, R. J. and Zipser, D. (1989b). A learning algorithm for continually running fully recurrent
networks. Neural Computation, 1(2):270–280.
86
Willshaw, D. J. and von der Malsburg, C. (1976). How patterned neural connections can be set up by
self-organization. Proc. R. Soc. London B, 194:431–445.
Windisch, D. (2005). Loading deep networks is hard: The pyramidal case. Neural Computation,
17(2):487–502.
Wiskott, L. and Sejnowski, T. (2002). Slow feature analysis: Unsupervised learning of invariances.
Neural Computation, 14(4):715–770.
Witczak, M., Korbicz, J., Mrugalski, M., and Patton, R. J. (2006). A GMDH neural network-based
approach to robust fault diagnosis: Application to the DAMADICS benchmark problem. Control
Engineering Practice, 14(6):671–683.
Wöllmer, M., Blaschke, C., Schindl, T., Schuller, B., Färber, B., Mayer, S., and Trefflich, B. (2011).
On-line driver distraction detection using Long Short-Term Memory. IEEE Transactions on Intel-
ligent Transportation Systems (TITS), 12(2):574–582.
Wöllmer, M., Schuller, B., and Rigoll, G. (2013). Keyword spotting exploiting Long Short-Term
Memory. Speech Communication, 55(2):252–265.
Wyatte, D., Curran, T., and O’Reilly, R. (2012). The limits of feedforward vision: Recurrent process-
ing promotes robust object recognition when objects are degraded. Journal of Cognitive Neuro-
science, 24(11):2248–2261.
Wysoski, S. G., Benuskova, L., and Kasabov, N. (2010). Evolving spiking neural networks for audio-
visual information processing. Neural Networks, 23(7):819–835.
Yamauchi, B. M. and Beer, R. D. (1994). Sequential behavior and learning in evolved dynamical
neural networks. Adaptive Behavior, 2(3):219–246.
Yamins, D., Hong, H., Cadieu, C., and DiCarlo, J. J. (2013). Hierarchical modular optimization of
convolutional networks achieves representations similar to macaque IT and human ventral stream.
Advances in Neural Information Processing Systems (NIPS), pages 1–9.
Yang, M., Ji, S., Xu, W., Wang, J., Lv, F., Yu, K., Gong, Y., Dikmen, M., Lin, D. J., and Huang,
T. S. (2009). Detecting human actions in surveillance videos. In TREC Video Retrieval Evaluation
Workshop.
Yao, X. (1993). A review of evolutionary artificial neural networks. International Journal of Intelli-
gent Systems, 4:203–222.
87
Yin, F., Wang, Q.-F., Zhang, X.-Y., and Liu, C.-L. (2013). ICDAR 2013 Chinese handwriting recog-
nition competition. In 12th International Conference on Document Analysis and Recognition (IC-
DAR), pages 1464–1470.
Yin, J., Meng, Y., and Jin, Y. (2012). A developmental approach to structural self-organization in
reservoir computing. IEEE Transactions on Autonomous Mental Development, 4(4):273–289.
Young, S., Davis, A., Mishtal, A., and Arel, I. (2014). Hierarchical spatiotemporal feature extraction
using recurrent online clustering. Pattern Recognition Letters, 37:115–123.
Yu, X.-H., Chen, G.-A., and Cheng, S.-X. (1995). Dynamic learning rate optimization of the back-
propagation algorithm. IEEE Transactions on Neural Networks, 6(3):669–677.
Zamora-Martnez, F., Frinken, V., Espaa-Boquera, S., Castro-Bleda, M., Fischer, A., and Bunke, H.
(2014). Neural network language models for off-line handwriting recognition. Pattern Recognition,
47(4):1642–1652.
Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method. CoRR, abs/1212.5701.
Zeiler, M. D. and Fergus, R. (2013). Visualizing and understanding convolutional networks. Technical
Report arXiv:1311.2901 [cs.CV], NYU.
Zemel, R. S. (1993). A minimum description length framework for unsupervised learning. PhD thesis,
University of Toronto.
Zemel, R. S. and Hinton, G. E. (1994). Developing population codes by minimizing description
length. In Cowan, J. D., Tesauro, G., and Alspector, J., editors, Advances in Neural Information
Processing Systems 6, pages 11–18. Morgan Kaufmann.
Zeng, Z., Goodman, R., and Smyth, P. (1994). Discrete recurrent neural networks for grammatical
inference. IEEE Transactions on Neural Networks, 5(2).
Zimmermann, H.-G., Tietz, C., and Grothmann, R. (2012). Forecasting with recurrent neural net-
works: 12 tricks. In Montavon, G., Orr, G. B., and Müller, K.-R., editors, Neural Networks:
Tricks of the Trade (2nd ed.), volume 7700 of Lecture Notes in Computer Science, pages 687–707.
Springer.
Zipser, D., Kehoe, B., Littlewort, G., and Fuster, J. (1993). A spiking network model of short-term
active memory. The Journal of Neuroscience, 13(8):3406–3420.
88
1
Abstract—Salient object detection has been attracting a lot of interest, and recently various heuristic computational models have
been designed. In this paper, we formulate saliency map computation as a regression problem. Our method, which is based
on multi-level image segmentation, utilizes the supervised learning approach to map the regional feature vector to a saliency
arXiv:1410.5926v1 [cs.CV] 22 Oct 2014
score. Saliency scores across multiple layers are finally fused to produce the saliency map. The contributions lie in two-fold.
One is that we propose a discriminate regional feature integration approach for salient object detection. Compared with existing
heuristic models, our proposed method is able to automatically integrate high-dimensional regional saliency features and choose
discriminative ones. The other is that by investigating standard generic region properties as well as two widely studied concepts
for salient object detection, i.e., regional contrast and backgroundness, our approach significantly outperforms state-of-the-art
methods on six benchmark datasets. Meanwhile, we demonstrate that our method runs as fast as most existing algorithms.
putation framework is presented in Sec. 3. Sec. 4 Recently, Cheng et al. [36] propose a soft image
describes the regional saliency features adopted in abstraction using a Gaussian Mixture Model (GMM),
this paper. Sec. 5 presents the learning framework where each pixel maintains a probability belonging to
of our approach. Empirical analysis of our proposed all the regions instead of a single hard region label,
method and comparisons with other algorithms are to better compute the saliency. The global unique-
demonstrated in Sec. 6. Finally, Sec. 7 discusses and ness can also be captured with the low-rank matrix
concludes this paper. recovery framework [37]–[39]. The low-rank matrix
corresponds to the background regions while sparse
noises are indications of salient regions. A submod-
2 R ELATED WORK ular salient object detection algorithm is presented
in [40], where superpixels are gradually grouped
Salient object detection, stemming from eye fixation to form potential salient regions by iteratively opti-
prediction, aims to separate the entire salient object mizing a submodular facility location problem. The
from the background. Since the pioneer work of Itti et Bayesian fraemwork is introduced for salient object
al. [2], it attracts more and more research interests detection in [41], [42]. A partial differential equation
in computer vision, driven by applications such as (PDE) is also introduced for salient object detection in
content-aware image resizing [8], picture collage [10], a recent work [43].
etc. In the following, we focus on salient object
detection (segmentation) and briefly review existing In addition to capturing the uniqueness, many other
algorithms. A comprehensive survey can be found priors are also proposed for saliency computation.
from a recent work [23]. A literature review of eye Centeral prior, i.e., the salient object usually lies in
fixation prediction can be seen in [27], which also the center of an image, is investigated in [32], [44].
includes some analysis on salient object detection. Object prior, such as connectivity prior [45], concavity
We simply divide existing algorithms into two cat- context [20], and auto-context cue [46], background-
egories: unsupervised and supervised, according to ness prior [47]–[50], generic objectness prior [51]–[53],
if the groundtruth annotations of salient objects are and background connectivity prior [38], [54], [55]
adopted. are also studied for saliency computation. Example-
based approaches, searching for similar images of the
Unsupervised approaches. Most salient object detec- input, are developed for salient object detection [8],
tion algorithms characterize the uniqueness of a scene [56]. The depth cue is leveraged for saliency analysis
as salient regions following the center-surround con- derived from stereopsic image pairs in [57] and a
trast framework [2], where different kinds of features depth camera (e.g., Kinect) in [58]. Li et al. [59] adopt
are combined according to the feature integration the light field camera for salient object detection.
theory [22]. The multi-scale pixel contrast is studied Besides, spectral analysis in the frequency domain is
in [19], [28]. The discriminant center-surround hy- used to detect salient regions [13].
pothesis is analyzed in [16], [29]. Color histograms,
computed to represent the center and the surround, Supervised approaches. Inspired by the feature in-
are used to evaluate the center-surround dissimilar- tegration theory, some approaches focus on learning
ity [19]. An information theory perspective is intro- the linear fusion weight of saliency features. Liu et
duced to yield a sound mathematical formulation, al. [19] propose to learn the linear fusion weight
computing the center-surround divergence based on of saliency features in a Conditional Random Field
feature statistics [18]. A cost-sensitive SVM is trained (CRF) framework. Recently, the large-margin frame-
to measure the separability of a center region w.r.t. work was adopted to learn the weights in [60]. Due
its surroundings [30]. The uniqueness can also be to the highly non-linear essence of the saliency mech-
captured in a global scope by comparing a patch anism, the linear mapping might not perfectly capture
to its k nearest neighbors [17] or as its distance to the characteristics of saliency. In [24], a mixture of
the average patch over the image along the principal linear Support Vector Machines (SVM) is adopted
component axis coordinates [31]. to partition the feature space into a set of sub-
The center-surround difference framework is also regions that were linearly separable using a divide-
investigated to compute the saliency from region- and-conquer strategy. Alternatively, a Boosted Deci-
based image representation. The multi-level image sion Tree (BDT) is learned to get an initial saliency
segmentation is adopted for salient object detection map, which will be further refined using a high
based on local regional contrast [32]. The global re- dimensional color transform [61]. In [25], generic re-
gional contrast is studied in [15], [21] as well. To gional properties are investigated for salient object
further enhance the performance, saliency maps on detection. Li et al. [62] propose to generate a saliency
hierarchical segmentations are computed and finally map by adaptively averaging the object proposals [63]
combined through a tree model via dynamic pro- with their foreground probabilities that are learned
gramming [33]. Both the color and textural global based on eye fixations features using the Random
uniqueness are investigated in [34], [35]. Forest regressor. Additionally, Wang et al. [64] learn
3
…
methods, our approach extends the contrast value Input Image Multi-Level
Saliency Fusion
used in existing algorithms to the contrast vector
to represent a region. More importantly, instead of Multi-Level
Image Segmentation
Multi-Level
Saliency Computation
To compute the contrast descriptor, we describe regions with similar appearances might belong to the
each region Ri ∈ Sm by a feature vector, including background in one image but belong to the salient
color and texture features, denoted by vRi . The de- object in some other images. It is not enough to merely
tailed description is given in Fig. 2. For color features, use the property features to check if one region is in
we consider RGB, HSV, and L*a*b* color spaces. For the background or the salient object.
texture features, we adopt the LBP feature [71] and Therefore, we extract the pseudo-background re-
the responses of the LM filter bank [72]. gion and compute the backgroundness descriptor for
As suggested in previous works [15], [21], the re- each region with the pseudo-background region as a
gional contrast value xck derived from the k-th feature reference. The pseudo-background region B is defined
channel is computed by checking Ri against all other as the 15-pixel wide narrow border region of the
regions, image. To verify such a definition, we made a simple
Nm survey on the MSRA-B data set with 5000 images and
found that 98% of pixels in the border area belongs to
X
xck (Ri ) = αj wij Dk (vRi , vRj ), (1)
j=1 the background. The backgroundness value xbk of the
region Ri on the k-th feature is then defined as
where Dk (vRi , vRj ) captures the difference of the
k-th channel of the feature vectors vRi and vRj . xbk (Ri ) = Dk (vRi , vB ). (2)
Specifically, the difference of the histogram feature We get a 29-dimensional feature vector. See details in
is computed as the χ2 distance and as their absolute Fig. 2.
||pm −pm ||2
− i 2σ2j
differences for other features. wij = e is as
spatial weighting term, where pi and pj are the mean 4.3 Regional property descriptor
positions of Ri and Rj , respectively. σs controls the Additionally, we consider the generic properties of a
strength of the spatial weighting effect. We empirically region, including appearance and geometric features.
set it as 1.0 in our implementation. αj is introduced These two features are extracted independently from
to account for the irregular shapes of regions, defined each region like the feature extraction algorithm in
as the normalized area of the region Rj . Nm is the image labeling [73]. The appearance features attempt
number of regions in Sm . As a result, we get a 29- to describe the distribution of colors and textures
dimensional feature vector. The details of the regional in a region, which can characterize their common
contrast descriptor are given in Fig. 2. properties for the salient object and the background.
For example, the background usually has homoge-
4.2 Regional backgroundness descriptor neous color distribution or similar texture pattern. The
There exist a few algorithms attempting to make use geometric features include the size and position of
of the characteristics of the background (e.g., homo- a region that may be useful to describe the spatial
geneous color or textures) to heuristically determine distribution of the salient object and the background.
if one region is background, e.g., [47]. In contrast, For example, the salient object tends to be placed near
our algorithm extracts a set of features and adopts the center of the image while the background usually
the supervised learning approach to determine the scatters over the entire image. Finally, we obtain a 35-
background degree (accordingly the saliency degree) dimensional regional property descriptor. The details
of a region. are given in Fig. 3.
It has been observed that the background identi- In summary, we obtain a 93-dimensional (2 × 29 +
fication depends on the whole image context. Image 35) feature vector for each region. Fig. 4 demonstrates
5
5 L EARNING
In this section, we introduce how to learn a Random Fig. 4. Illustration of the most important features.
Forest to map the feature vector of each region to From top to bottom: input images, the most important
a saliency score. Learning the multi-level saliency contrast feature (c12 ), the most important background-
fusion weight is also presented. ness feature (b12 ), the most important property feature
(p5 ), and the saliency map of our approach (DRFIs)
produced on a single-level segmentation. Brighter area
5.1 Generating training samples indicates larger feature value (thus larger saliency
We use supervised multi-level segmentation to gener- value according to c12 , b12 and the saliency map).
ate training samples. We first learn the similarity score
of each adjacent regions, to show the probability that
the adjacent regions both belong to the salient region
{S1t , S2t , . . . , SM
t
} to gather a large mount of train-
or the background. Similar regions will be grouped
ing samples. Specifically, denote S0t as the over-
together in a hierarchical way. Training samples of the
segmentation of the image generated using the graph-
saliency regressor are those confident regions in the
based image segmentation algorithm [70]. The regions
grouping hierarchy.
in S0t are represented by a weighted graph, which con-
By learning the similarity score, we hope that those
nects the spatially neighboring regions. The weight
regions from the object (or background) are more
of each edge is the learned similarity of two adjacent
likely to be grouped together. In specific, given an
superpixels. Similar to the pixel-wise grouping in [70],
over-segmentation of an image, we connect each re-
pairs of regions are sequentially merged in the order
gion and its spatially-neighboring regions forming a
of decreasing the weights of edges. We change the
set of pairs P = {(Ri , Rj )} and learn the probability
tolerance degree of small regions, i.e., the parameter k
p(ai = aj ), where ai is the saliency label of the region
of the approach [70] (see the details in [70]) to generate
Ri . Such a set of pairs into two parts: a positive
the segmentations from S1t to SM t
}. To avoid too fine
part P + = {(Ri , Rj )|ai = aj } and a negative part t |S0t |
P − = {(Ri , Rj )|ai 6= aj }. Following [74], each region groupings, we discard Si if |S t | > 0.6, where | · |
i
pair is described by a set of features including the denotes the number of superpixels.
regional saliency of two regions (2×93 features), the Given a set of training images with ground
feature contrast of two regions (similar to the regional truth annotations and their multi-level segmenta-
contrast descriptor, 29 features), and the geometry tion, we can collect lots of confident regions R =
features of the superpixel boundary of two regions {R(1) , R(2) , · · · , R(Q) } and the corresponding saliency
(similar to p1 ∼ p7 in Fig. 3, 7 features). Given these scores A = {a(1) , a(2) , · · · , a(Q) } to learn a Random
222-dimensional feature, we learn a boosted decision Forest saliency regressor. Only confident regions are
tree classifier to estimate the similarity score of each kept for training since some regions may contain
adjacent region pair. pixels from both the salient object and background.
Based on the learned similarity of two adja- A region is considered to be confident if the number
cent regions, we produce multi-level segmentation of pixels belonging to the salient object or the back-
6
ground exceeds 80% of the total number of pixels these features Fm to maximize the splitting criterion
in the region. Its saliency score is set as 1 or 0 ac-
(ti ) 2 (ti ) 2
P P
cordingly. In experiments we find that few regions of ∗ ∗ ti ∈Dl a ti ∈Dr a
(f , τ ) = max +
all the training examples, around 6%, are unconfident f ∈Fm ,τ |Dl | |Dr |
and we discard them from training. 2 !
a(ti )
P
One benefit to generate multi-level segmentation is − ti ∈D , (3)
|D|
that a large amount of training samples can be gath-
ered. In the Sec. 6.3, we empirically analyze different where Dl = {(x(ti ) , a(ti ) )|x(ti ) (f ) < τ },
settings of Mt and validate our motivation to generate Dr = {(x(ti ) , a(ti ) )|x(ti ) (f ) ≥ τ }, and D = Dl ∪ Dr .
training samples based on multi-level image segmen- Such a spliting procedure is repeated until |D| < 5
tation. Additionally, with the guiding of learned sim- and a leaf node is created. The prediction value of the
ilarity metric, there might be only a few large regions leaf node is the average saliency scores of the training
left in the high-level segmentation, which are help- samples falling in it. We will empirically examine the
ful for the Random Forest regressor to learn object- settings of parameters T and m in Sec. 6.3.
level properties. However, the learned similarity is Learning a saliency regressor can automatically in-
hard to generalize across datasets. This is why our tegrate the features and discover the most discrimi-
approach did not perform the best on SED2 dataset native ones. Additionally, in the training procedure
in our pervious version [1]. To this end, we adopt the of the random forest, the feature importance can be
unsupervised multi-level segmentation in the testing estimated simultaneously. Refer to the supplementary
phrase, which is also more efficient without learning for more details. Fig. 6 presents the most important 60
the similarity score. features.
AUC Scores
AUC Scores
F−measure
0.966 0.962 0.962 0.962
0.14 0.14
0.12 0.12
Feature Importance
Feature Importance
0.1 0.1
0.08 0.08
0.06 0.06
0.04 0.04
0.02 0.02
0 b b b p p p p p p p p p p p p p p b b b b p p b b b p b b p 0 p b b c p p b p b p p p c p p p p b b c c p p b p p b p p c
12 4 8 5 3 12 6 1 26 2 17 28 10 35 4 9 34 11 1 2 29 11 25 21 6 28 29 23 5 8 7 9 10 12 27 13 3 33 7 20 31 22 4 15 21 16 14 20 24 21 8 30 19 19 23 32 26 24 18 23
Fig. 6. The most important 60 regional saliency features given by Random Forest regressor, occupying around
90% of the energy of total features. There are 5 contrast features, 20 backgroundness features, and 35 property
features. From left to right: the first and second 30 important features, respectively. See Fig. 2 and Fig. 3 for the
description of the features.
dimension of the feature vector, i.e., all of the features properties for salient object detection that are widely
can be seen during the splitting, most likely the same studied in other tasks such as image classification [26].
most discriminative feature will be chosen at each For example, high rank of geometric features p5 , p3
split node. Consequently, nearly identical decision and p6 might correspond to the compositional bias of
trees are built that can not complement each other salient objects. The importance of variance features
well and thus may result in inferior performance. p12 and p26 , on the other hand, might be related
According to Fig. 5(c), we empirically set m = 15 since with the background properties. Among the contrast-
the performance is the best. based descriptors, regional contrast descriptor is the
Testing parameters analysis. One can see in Fig. 5(d) least important. Since it might be affected by cluttered
that the AUC scores of the saliency maps increase scenes and less important compared with the regional
when more layers of segmentations are adopted. The backgroundness descriptor which is in some sense
reason is that there may exist some confident regions more robust. Moreover, we also observe that color
that cover the most (even entire) part of an object in features are much more discriminative than texture
more layers of segmentations. However, a larger num- features.
ber of segmentations introduce more computational To further validate the importance of features across
burden. Therefore, to balance the efficiency and the different data sets, we train classifiers by removing
effectiveness, we set M to 15 segmentations in our each kind of feature descriptor on each benchmark
experiments. data set (testing set of MSRA-B). AUC scores of
saliency maps are demonstrated in Fig. 7. As can
6.4 Feature Importance be seen, removing some feature descriptors does not
Our approach uses a wide variety of features. In necessarily lead to performance decrease. Consistent
this section, we empirically analyze the usefulness of with the feature rank given by the Random Fore-
these regional saliency features. Fig. 6 shows the rank stregressor, regional contrast descriptor is the least
of the most important 60 regional features produced important one on most of the benchmark data sets as
during the training of Random Forest regressor, which least performance drop are observed with its removal
occupy around 90% of the energy of total features. on most of the data sets. Regional property descriptor
The feature rank indicates that the property descrip- still plays the most important role on MSRA-B, ECSSD
tor is the most critical one in our feature set (occupies and DUT-OMRON. Since there are multiple salient
35 out of top 60 features). The reason might be that objects in an image in SED2 and iCoSeg, the common
salient objects share some common properties, vali- properties learned from the training data, where only
dating our motivation to exploit the generic regional a single salient object exists in most of the images,
9
1 1 1
Precision
Precision
LRK LRK LRK
0.5 HS 0.5 HS 0.5 HS
GMR GMR GMR
0.4 0.4 0.4
PCA PCA PCA
0.3 MC 0.3 MC 0.3 MC
DSR DSR DSR
0.2 RBD 0.2 RBD 0.2 RBD
DRFIs DRFIs DRFIs
0.1 DRFI MSRA 0.1 DRFI iCoSeg 0.1 DRFI ECSSD
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Recall Recall Recall
1 1
0.9 0.9
SVO SVO 0.5 SVO
0.8 0.8
CA CA CA
0.7 CB 0.7 CB CB
RC RC 0.4 RC
0.6 SF 0.6 SF SF
Precision
Precision
Precision
Fig. 9. Quantitative comparisons of saliency maps produced by different approaches on different data sets in
terms of PR curves. See supplemental materials for more evaluations.
1 1 1
SF SF SF
0.6 0.6 0.6
LRK LRK LRK
0.5 HS 0.5 HS 0.5 HS
GMR GMR GMR
0.4 PCA 0.4 PCA 0.4 PCA
MC MC MC
0.3 DSR 0.3 DSR 0.3 DSR
RBD RBD RBD
0.2 0.2 0.2
DRFIs DRFIs DRFIs
0.1 MSRA DRFI 0.1 iCoSeg DRFI 0.1 ECSSD DRFI
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
False Positive Rate False Positive Rate False Positive Rate
1 1 1
SF SF SF
0.6 0.6 0.6
LRK LRK LRK
0.5 HS 0.5 HS 0.5 HS
GMR GMR GMR
0.4 PCA 0.4 PCA 0.4 PCA
MC MC MC
0.3 DSR 0.3 DSR 0.3 DSR
RBD RBD RBD
0.2 0.2 0.2
DRFIs DRFIs DRFIs
0.1 DUT−OMRON DRFI 0.1 SED2 DRFI 0.1 DUT−OMRON* DRFI
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
False Positive Rate False Positive Rate False Positive Rate
Fig. 10. Quantitative comparison of saliency maps produced by different approaches on different data sets in
terms of ROC curves. See supplemental materials for more evaluations.
well when the object touches the image border, e.g., 6.6 Robustness Analysis
the first and last third rows in Fig. 11, even though
it violates the pseudo-background assumption. With As suggested by Fig. 6 and Fig. 7, the region back-
the multi-level enhancement, more appealing results groundness and property descriptors, especially the
can be achieved. geometric properties, play important roles in our
approach. In natural images, the pseudo-background
assumption may not be held well. Additionally, the
distributions of salient objects may be different from
our training set. It is natural to doubt that whether
11
(a) input (b) SVO (c) CA (d) CB (e) RC (f) SF (g) LRK (h) HS (i) GMR (j) PCA (k) MC (l) DSR (m) RBD (n) DRFIs (o) DRFI
Fig. 11. Visual comparison of the saliency maps. Our method (DRFI) consistently generates better saliency
maps.
our approach can still perform well on these chal- OMRON data set) according to the AUC scores.
lenging cases. To this end, we select 635 images from
DUT-OMRON dataset (we call it DUT-OMRON*), 6.7 Efficiency
where salient objects touch the image border and
are far from the image center. Quantitative compar- Since the computation on each level of the multiple
isons with state-of-the-art approaches are presented segmentations is independent, we utilize the multi-
in Fig. 8, Fig. 9, and Fig. 10. Check our project website thread technique to accelerate our C++ code. Fig. 12
jianghz.com/drfi and the supplementary material for summarizes the running time of different approaches,
more details. tested on the MSRA-B data set with a typical 400×300
image using a PC with an Intel i5 CPU of 2.50GHz and
Not surprisingly, performances of all approaches 8GB memory. 8 threads are utilized for acceleration.
decline. But our approach DRFI still significantly out- As we can see, our approach can run as fast as most
performs other methods in terms of PR curve, ROC existing approaches. If equipped as a pre-processing
curve and AUC scores. Even with a single level, step for an application, e.g., picture collage, our ap-
our approach DRFIs performs slightly better than proach will not harm the user experiences.
others (ranked as the second best in terms of AUC For training, it takes around 24h with around 1.7
scores). In specific, DRFIs and DRFI are better than the million training samples. As training each decision
top-performing method by around 2.22% and 1.36% tree is also independent to each other, parallel com-
(compared with 2.20% and 1.26% on the whole DUT- puting techniques can also be utilized for acceleration.
12
7 D ISCUSSIONS AND F UTURE W ORK Fig. 13. Failure cases of our approach.
results can be expected. For example, background [15] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. S. Torr, and S.-
connectivity prior [55] can be incorporated to M. Hu, “Global contrast based salient region detection,” IEEE
TPAMI, 2014.
relax the pseudo-background assumption. Addi- [16] D. Gao, V. Mahadevan, and N. Vasconcelos, “The discriminant
tionally, spatial distribution prior [19], [21], focus- center-surround hypothesis for bottom-up saliency,” in NIPS,
ness prior [52], diverse density score [53] based 2007.
[17] S. Goferman, L. Zelnik-Manor, and A. Tal, “Context-aware
on generic objectness, and graph-based manifold saliency detection,” in CVPR, 2010, pp. 2376–2383.
ranking score [48] can also be integrated. [18] D. A. Klein and S. Frintrop, “Center-surround divergence of
• Better fusion strategy. We simply investigate the feature statistics for salient object detection,” in ICCV, 2011.
[19] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H.-
linear fusion of saliency maps with any post op- Y. Shum, “Learning to detect a salient object,” IEEE TPAMI,
timization step. As a future work, we can utilize vol. 33, no. 2, pp. 353–367, 2011.
the optimization step of other approaches to en- [20] Y. Lu, W. Zhang, H. Lu, and X. Xue, “Salient object detection
using concavity context,” in ICCV, 2011, pp. 233–240.
hance the performance. For example, we can run [21] F. Perazzi, P. Krähenbühl, Y. Pritch, and A. Hornung, “Saliency
saliency detection on hierarchical detections and filters: Contrast based filtering for salient region detection,” in
fuse them as suggested in [77]. The optimization CVPR, 2012, pp. 733–740.
method proposed in [55] is also applicable. [22] A. Treisman and G. Gelad, “A feature-integration theory of
attention,” Cognitive Psychology, vol. 12, no. 1, pp. 97–136, 1980.
• Integrating more cues. A recent trend on salient [23] A. Borji, D. N. Sihite, and L. Itti, “Salient object detection: A
object detection is to integrate more cues in ad- benchmark,” in ECCV (2), 2012, pp. 414–429.
dition to traditional RGB data. Our approach is [24] P. Khuwuthyakorn, A. Robles-Kelly, and J. Zhou, “Object of
interest detection by saliency learning,” in ECCV, 2010.
natural to be extedned to consider cues such as [25] P. Mehrani and O. Veksler, “Saliency segmentation based on
depth on RGB-D input, temporal consistency on learning and graph cut refinement,” in BMVC, 2010.
video sequences, and saliency co-occurrence for [26] D. Hoiem, A. A. Efros, and M. Hebert, “Recovering surface
layout from an image,” IJCV, vol. 75, no. 1, pp. 151–172, 2007.
co-salient object detection. [27] A. Borji and L. Itti, “State-of-the-art in visual attention mod-
eling,” IEEE Trans. Pattern Anal. Mach. Intell., 2013.
[28] F. Liu and M. Gleicher, “Region enhanced scale-invariant
ACKNOWLEDGEMENTS saliency detection,” in ICME, 2006, pp. 1477–1480.
[29] D. Gao and N. Vasconcelos, “Bottom-up saliency is a discrim-
This work was supported in part by the National inant process,” in ICCV, 2007, pp. 1–6.
[30] X. Li, Y. Li, C. Shen, A. R. Dick, and A. van den Hengel, “Con-
Basic Research Program of China under Grant No. textual hypergraph modeling for salient object detection,” in
2015CB351703 and 2012CB316400, and the National ICCV, 2013, pp. 3328–3335.
Natural Science Foundation of China under Grant No. [31] R. Margolin, A. Tal, and L. Zelnik-Manor, “What makes a
patch distinct?” in CVPR, 2013.
91120006.
[32] H. Jiang, J. Wang, Z. Yuan, T. Liu, N. Zheng, and S. Li,
“Automatic salient object segmentation based on context and
shape prior,” in BMVC, 2011.
R EFERENCES [33] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detec-
tion,” in CVPR, 2013.
[1] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li, “Salient [34] C. Scharfenberger, A. Wong, K. Fergani, J. S. Zelek, and D. A.
object detection: A discriminative regional feature integration Clausi, “Statistical textural distinctiveness for salient region
approach,” in IEEE CVPR, 2013, pp. 2083–2090. detection in natural images,” in CVPR, 2013, pp. 979–986.
[2] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based [35] K. Shi, K. Wang, J. Lu, and L. Lin, “Pisa: Pixelwise image
visual attention for rapid scene analysis,” IEEE TPAMI, 1998. saliency by aggregating complementary appearance contrast
[3] Y.-F. Ma and H.-J. Zhang, “Contrast-based image attention measures with spatial priors,” in CVPR, 2013, pp. 2115–2122.
analysis by using fuzzy growing,” in ACM Multimedia, 2003. [36] M.-M. Cheng, J. Warrell, W.-Y. Lin, S. Zheng, V. Vineet, and
[4] T. Liu, J. Sun, N.-N. Zheng, X. Tang, and H.-Y. Shum, “Learn- N. Crook, “Efficient salient region detection with soft image
ing to detect a salient object,” CVPR, vol. 0, pp. 1–8, 2007. abstraction,” in ICCV, 2013, pp. 1529–1536.
[5] C. Kanan and G. W. Cottrell, “Robust classification of objects, [37] X. Shen and Y. Wu, “A unified approach to salient object
faces, and flowers using natural image statistics,” in CVPR, detection via low rank matrix recovery,” in CVPR, 2012.
2010, pp. 2472–2479. [38] W. Zou, K. Kpalma, Z. Liu, J. Ronsin et al., “Segmentation
[6] D. Walther and C. Koch, “Modeling attention to salient proto- driven low-rank matrix recovery for saliency detection,” in
objects,” Neural Networks, vol. 19, no. 9, pp. 1395–1407, 2006. BMVC, 2013, pp. 1–13.
[7] L. Itti, “Automatic foveation for video compression using a [39] H. Peng, B. Li, R. Ji, W. Hu, W. Xiong, and C. Lang, “Salient
neurobiological model of visual attention,” IEEE TIP, 2004. object detection via low-rank and structured sparse matrix
[8] L. Marchesotti, C. Cifarelli, and G. Csurka, “A framework for decomposition,” in AAAI, 2013.
visual saliency detection with applications to image thumb- [40] Z. Jiang and L. S. Davis, “Submodular salient region detec-
nailing,” in ICCV, 2009, pp. 2232–2239. tion,” in CVPR, 2013, pp. 2043–2050.
[9] S. Goferman, A. Tal, and L. Zelnik-Manor, “Puzzle-like col- [41] E. Rahtu, J. Kannala, M. Salo, and J. Heikkilä, “Segmenting
lage,” Comput. Graph. Forum, vol. 29, no. 2, pp. 459–468, 2010. salient objects from images and videos,” in ECCV (5), 2010,
[10] J. Wang, L. Quan, J. Sun, X. Tang, and H.-Y. Shum, “Picture pp. 366–379.
collage,” in CVPR (1), 2006, pp. 347–354. [42] Y. Xie, H. Lu, and M.-H. Yang, “Bayesian saliency via low and
[11] P. Wang, D. Zhang, J. Wang, Z. Wu, X.-S. Hua, and S. Li, “Color mid level cues,” IEEE TIP, vol. 22, no. 5, pp. 1689–1698, 2013.
filter for image search,” in ACM Multimedia, 2012. [43] R. Liu, J. Cao, G. Zhong, Z. Lin, S. Shan, and Z. Su, “Adap-
[12] P. Wang, D. Zhang, G. Zeng, and J. Wang, “Contextual dom- tive partial differential equation learning for visual saliency
inant color name extraction for web image search,” in ICME detection,” in CVPR, 2014.
Workshops, 2012, pp. 319–324. [44] P. Wang, J. Wang, G. Zeng, J. Feng, H. Zha, and S. Li, “Salient
[13] R. Achanta, S. S. Hemami, F. J. Estrada, and S. Süsstrunk, object detection for searched web images via global saliency,”
“Frequency-tuned salient region detection,” in CVPR, 2009. in CVPR, 2012, pp. 3194–3201.
[14] A. Borji and L. Itti, “Exploiting local and global patch rarities [45] S. Vicente, V. Kolmogorov, and C. Rother, “Graph cut based
for saliency detection,” in CVPR, 2012, pp. 478–485. image segmentation with connectivity priors,” in CVPR, 2008.
14
[46] L. Wang, J. Xue, N. Zheng, and G. Hua, “Automatic salient [77] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detec-
object extraction with contextual cue,” in ICCV, 2011. tion,” in CVPR. CVPR, 2013, pp. 1155–1162.
[47] Y. Wei, F. Wen, W. Zhu, and J. Sun, “Geodesic saliency using [78] X. Shen and Y. Wu, “A unified approach to salient object
background priors,” in ECCV (3), 2012, pp. 29–42. detection via low rank matrix recovery,” in CVPR, 2012.
[48] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency
detection via graph-based manifold ranking,” in CVPR, 2013.
[49] X. Li, H. Lu, L. Zhang, X. Ruan, and M.-H. Yang, “Saliency
detection via dense and sparse reconstruction,” in ICCV, 2013.
[50] B. Jiang, L. Zhang, H. Lu, C. Yang, and M.-H. Yang, “Saliency
detection via absorbing markov chain,” in ICCV, 2013.
[51] K.-Y. Chang, T.-L. Liu, H.-T. Chen, and S.-H. Lai, “Fusing
generic objectness and visual saliency for salient object de-
tection,” in ICCV, 2011, pp. 914–921.
[52] P. Jiang, H. Ling, J. Yu, and J. Peng, “Salient region detection
by ufo: Uniqueness, focusness and objectness,” in ICCV, 2013.
[53] Y. Jia and M. Han, “Category-independent object-level saliency
detection,” in ICCV, 2013.
[54] J. Zhang and S. Sclaroff, “Saliency detection: A boolean map
approach,” in ICCV, 2013, pp. 153–160.
[55] W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency optimization
from robust background detection,” in CVPR, 2014.
[56] M. Wang, J. Konrad, P. Ishwar, K. Jing, and H. A. Rowley,
“Image saliency: From intrinsic to extrinsic context,” in CVPR,
2011, pp. 417–424.
[57] Y. Niu, Y. Geng, X. Li, and F. Liu, “Leveraging stereopsis for
saliency analysis,” in CVPR, 2012, pp. 454–461.
[58] K. Desingh, K. M. Krishna, D. Rajan, and C. Jawahar, “Depth
really matters: Improving visual salient region detection with
depth,” in BMVC, 2013.
[59] N. Li, J. Ye, Y. Ji, H. Ling, and J. Yu, “Saliency detection on
light fields,” in CVPR, 2014.
[60] S. Lu, V. Mahadevan, and N. Vasconcelos, “Learning optimal
seeds for diffusion-based salient object detection,” in CVPR,
2014.
[61] J. Kim, D. Han, Y.-W. Tai, and J. Kim, “Salient region detection
via high-dimensional color transform,” in CVPR, 2014.
[62] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille, “The
secrets of salient object segmentation,” in CVPR, 2014.
[63] J. Carreira and C. Sminchisescu, “Constrained parametric min-
cuts for automatic object segmentation,” in CVPR, 2010, pp.
3241–3248.
[64] P. Wang, J. Wang, G. Zeng, J. Feng, H. Zha, and S. Li, “Salient
object detection for searched web images via global saliency,”
in CVPR, 2012, pp. 3194–3201.
[65] F. Moosmann, D. Larlus, and F. Jurie, “Learning saliency maps
for object categorization,” in EECVW, 2006.
[66] T. Judd, K. A. Ehinger, F. Durand, and A. Torralba, “Learning
to predict where humans look,” in ICCV, 2009, pp. 2106–2113.
[67] Y. Lu, W. Zhang, C. Jin, and X. Xue, “Learning attention map
from images,” in CVPR, 2012, pp. 1067–1074.
[68] E. P. Simoncelli and W. T. Freeman, “The steerable pyramid:
a flexible architecture for multi-scale derivative computation,”
in ICIP (3), 1995, pp. 444–447.
[69] B. Fernando, É. Fromont, D. Muselet, and M. Sebban, “Dis-
criminative feature fusion for image classification,” in CVPR,
2012, pp. 3434–3441.
[70] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient graph-
based image segmentation,” IJCV, vol. 59, no. 2, 2004.
[71] M. Heikkilä, M. Pietikäinen, and C. Schmid, “Description of
interest regions with local binary patterns,” Pattern Recognition,
vol. 42, no. 3, pp. 425–436, 2009.
[72] T. K. Leung and J. Malik, “Representing and recognizing
the visual appearance of materials using three-dimensional
textons,” IJCV, vol. 43, no. 1, pp. 29–44, 2001.
[73] D. Hoiem, A. A. Efros, and M. Hebert, “Geometric context
from a single image,” in ICCV, 2005, pp. 654–661.
[74] H. Jiang, Y. Wu, and Z. Yuan, “Probabilistic salient object
contour detection based on superpixels,” in ICIP, 2013, pp.
3069–3072.
[75] D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen, “iCoseg:
Interactive co-segmentation with intelligent scribble guid-
ance.” in IEEE CVPR, 2010, pp. 3169–3176.
[76] S. Alpert, M. Galun, R. Basri, and A. Brandt, “Image seg-
mentation by probabilistic bottom-up aggregation and cue
integration,” in CVPR, 2007.
1
Abstract—
Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which
are also recurrent are effective for tasks involving sequences, visual and otherwise. We describe a class of recurrent convolutional
architectures which is end-to-end trainable and suitable for large-scale visual understanding tasks, and demonstrate the value of these
models for activity recognition, image captioning, and video description. In contrast to previous models which assume a fixed visual
arXiv:1411.4389v4 [cs.CV] 31 May 2016
representation or perform simple temporal averaging for sequential processing, recurrent convolutional models are “doubly deep” in
that they learn compositional representations in space and time. Learning long-term dependencies is possible when nonlinearities are
incorporated into the network state updates. Differentiable recurrent models are appealing in that they can directly map variable-length
inputs (e.g., videos) to variable-length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be
optimized with backpropagation. Our recurrent sequence models are directly connected to modern visual convolutional network
models and can be jointly trained to learn temporal dynamics and convolutional perceptual representations. Our results show that such
models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined or optimized.
1 I NTRODUCTION
Recognition and description of images and videos is Input Visual Sequence Output
a fundamental challenge of computer vision. Dramatic Features Learning
progress has been achieved by supervised convolutional
neural network (CNN) models on image recognition tasks,
and a number of extensions to process video have been CNN LSTM y1
recently proposed. Ideally, a video model should allow pro-
cessing of variable length input sequences, and also provide
for variable length outputs, including generation of full-
length sentence descriptions that go beyond conventional CNN LSTM y2
one-versus-all prediction tasks. In this paper we propose
Long-term Recurrent Convolutional Networks (LRCNs), a class
of architectures for visual recognition and description which
combines convolutional layers and long-range temporal re-
cursion and is end-to-end trainable (Figure 1). We instanti-
ate our architecture for specific video activity recognition,
image caption generation, and video description tasks as
described below. yT
Research on CNN models for video processing has
CNN LSTM
considered learning 3D spatio-temporal filters over raw
sequence data [1], [2], and learning of frame-to-frame rep-
resentations which incorporate instantaneous optic flow or Fig. 1. We propose Long-term Recurrent Convolutional Networks (LR-
trajectory-based models aggregated over fixed windows CNs), a class of architectures leveraging the strengths of rapid progress
or video shot segments [3], [4]. Such models explore two in CNNs for visual recognition problems, and the growing desire to
apply such models to time-varying inputs and outputs. LRCN processes
extrema of perceptual time-series representation learning: the (possibly) variable-length visual input (left) with a CNN (middle-
either learn a fully general time-varying weighting, or apply left), whose outputs are fed into a stack of recurrent sequence models
(LSTMs, middle-right), which finally produce a variable-length prediction
(right). Both the CNN and LSTM weights are shared across time, result-
• J. Donahue, L. A. Hendricks, M. Rohrbach, S. Guadarrama, and T. Darrell ing in a representation that scales to arbitrarily long sequences.
are with the Department of Electrical Engineering and Computer Science,
UC Berkeley, Berkeley, CA. simple temporal pooling. Following the same inspiration
• M. Rohrbach and T. Darrell are additionally affiliated with the Interna- that motivates current deep convolutional models, we ad-
tional Computer Science Institute, Berkeley, CA. vocate for video recognition and description models which
• S. Venugopalan is with the Department of Computer Science, UT Austin,
Austin, TX. are also deep over temporal dimensions; i.e., have temporal
• K. Saenko is with the Department of Computer Science, UMass Lowell, recurrence of latent variables. Recurrent Neural Network
Lowell, MA. (RNN) models are “deep in time” – explicitly so when
Manuscript received November 30, 2015. unrolled – and form implicit compositional representations
2
in the time domain. Such “deep” models predated deep RNN Unit LSTM Unit
spatial convolution models in the literature [5], [6]. xt
The use of RNNs in perceptual applications has been ex- ht-1
σ ht
Input Output
plored for many decades, with varying results. A significant Gate σ Gate σ
Output σ zt
it ot
limitation of simple RNN models which strictly integrate gt
ϕ + ϕ ht = zt
state information over time is known as the “vanishing Input Modulation Gate
gradient” effect: the ability to backpropagate an error signal
xt ft
through a long-range temporal interval becomes increas- ht-1 σ
ingly difficult in practice. Long Short-Term Memory (LSTM) ct-1
Forget Gate
ct
units, first proposed in [7], are recurrent modules which
enable long-range learning. LSTM units have hidden state
augmented with nonlinear mechanisms to allow state to Fig. 2. A diagram of a basic RNN cell (left) and an LSTM memory
propagate without modification, be updated, or be reset, cell (right) used in this paper (from [13], a slight simplification of the
using simple learned gating functions. LSTMs have recently architecture described in [14], which was derived from the LSTM initially
proposed in [7]).
been demonstrated to be capable of large-scale learning of
speech recognition [8] and language translation models [9],
[10]. 2 BACKGROUND : R ECURRENT N ETWORKS
We show here that convolutional networks with re-
current units are generally applicable to visual time-series Traditional recurrent neural networks (RNNs, Figure 2, left)
modeling, and argue that in visual tasks where static or flat model temporal dynamics by mapping input sequences to
temporal models have previously been employed, LSTM- hidden states, and hidden states to outputs via the following
style RNNs can provide significant improvement when recurrence equations (Figure 2, left):
ample training data are available to learn or refine the rep-
ht = g(Wxh xt + Whh ht−1 + bh )
resentation. Specifically, we show that LSTM type models
provide for improved recognition on conventional video zt = g(Whz ht + bz )
activity challenges and enable a novel end-to-end optimiz-
where g is an element-wise non-linearity, such as a sigmoid
able mapping from image pixels to sentence-level natural
or hyperbolic tangent, xt is the input, ht ∈ RN is the hidden
language descriptions. We also show that these models
state with N hidden units, and zt is the output at time t.
improve generation of descriptions from intermediate visual
For a length T input sequence hx1 , x2 , ..., xT i, the updates
representations derived from conventional visual models.
above are computed sequentially as h1 (letting h0 = 0), z1 ,
We instantiate our proposed architecture in three ex-
h2 , z2 , ..., hT , zT .
perimental settings (Figure 3). First, we show that directly
Though RNNs have proven successful on tasks such
connecting a visual convolutional model to deep LSTM
as speech recognition [15] and text generation [16], it can
networks, we are able to train video recognition models
be difficult to train them to learn long-term dynamics,
that capture temporal state dependencies (Figure 3 left;
likely due in part to the vanishing and exploding gradients
Section 4). While existing labeled video activity datasets
problem [7] that can result from propagating the gradients
may not have actions or activities with particularly com-
down through the many layers of the recurrent network,
plex temporal dynamics, we nonetheless observe significant
each corresponding to a particular time step. LSTMs provide
improvements on conventional benchmarks.
a solution by incorporating memory units that explicitly
Second, we explore end-to-end trainable image to sen-
allow the network to learn when to “forget” previous hid-
tence mappings. Strong results for machine translation
den states and when to update hidden states given new
tasks have recently been reported [9], [10]; such models
information. As research on LSTMs has progressed, hidden
are encoder-decoder pairs based on LSTM networks. We
units with varying connections within the memory unit
propose a multimodal analog of this model, and describe an
have been proposed. We use the LSTM unit as described
architecture which uses a visual convnet to encode a deep
in [13] (Figure 2, right), a slight simplification of the one
state vector, and an LSTM to decode the vector into a natural
described in [8], which was derived from the original LSTM
language string (Figure 3 middle; Section 5). The resulting −1
model can be trained end-to-end on large-scale image and unit proposed in [7]. Letting σ(x) = (1 + e−x ) be the
text datasets, and even with modest training provides com- sigmoid non-linearity which squashes real-valued inputs to
x −x
petitive generation results compared to existing methods. a [0, 1] range, and letting tanh(x) = eex −e
+e−x = 2σ(2x) − 1
Finally, we show that LSTM decoders can be driven be the hyperbolic tangent non-linearity, similarly squashing
directly from conventional computer vision methods which its inputs to a [−1, 1] range, the LSTM updates for time step
predict higher-level discriminative labels, such as the se- t given inputs xt , ht−1 , and ct−1 are:
mantic video role tuple predictors in [11] (Figure 3, right;
it = σ(Wxi xt + Whi ht−1 + bi )
Section 6). While not end-to-end trainable, such models offer
architectural and performance advantages over previous ft = σ(Wxf xt + Whf ht−1 + bf )
statistical machine translation-based approaches. ot = σ(Wxo xt + Who ht−1 + bo )
We have realized a generic framework for recurrent gt = tanh(Wxc xt + Whc ht−1 + bc )
models in the widely adopted deep learning framework
ct = ft ct−1 + it gt
Caffe [12], including ready-to-use implementations of RNN
and LSTM units. (See http://jeffdonahue.com/lrcn/.) ht = ot tanh(ct )
3
x y denotes the element-wise product of vectors x and y . φV (xt ). The outputs of φV are then passed into a recurrent
In addition to a hidden unit ht ∈ RN , the LSTM includes sequence learning module.
an input gate it ∈ RN , forget gate ft ∈ RN , output gate In its most general form, a recurrent model has param-
ot ∈ RN , input modulation gate gt ∈ RN , and memory cell eters W , and maps an input xt and a previous time step
ct ∈ RN . The memory cell unit ct is a sum of two terms: the hidden state ht−1 to an output zt and updated hidden state
previous memory cell unit ct−1 which is modulated by ft , ht . Therefore, inference must be run sequentially (i.e., from
and gt , a function of the current input and previous hidden top to bottom, in the Sequence Learning box of Figure 1), by
state, modulated by the input gate it . Because it and ft are computing in order: h1 = fW (x1 , h0 ) = fW (x1 , 0), then
sigmoidal, their values lie within the range [0, 1], and it h2 = fW (x2 , h1 ), etc., up to hT . Some of our models stack
and ft can be thought of as knobs that the LSTM learns multiple LSTMs atop one another as described in Section 2.
to selectively forget its previous memory or consider its To predict a distribution P (yt ) over outcomes yt ∈ C
current input. Likewise, the output gate ot learns how much (where C is a discrete, finite set of outcomes) at time step
of the memory cell to transfer to the hidden state. These t, the outputs zt ∈ Rdz of the sequential model are passed
additional cells seem to enable the LSTM to learn complex through a linear prediction layer ŷt = Wz zt + bz , where
and long-term temporal dynamics for a wide variety of Wz ∈ R|C|×dz and bz ∈ R|C| are learned parameters. Finally,
sequence learning and prediction tasks. Additional depth the predicted distribution P (yt ) is computed by taking the
exp(ŷt,c )
can be added to LSTMs by stacking them on top of each softmax of ŷt : P (yt = c) = softmax(ŷt ) = P exp(ŷ .
0)t,c
(`−1) c0 ∈C
other, using the hidden state ht of the LSTM in layer
The success of recent deep models for object recogni-
` − 1 as the input to the LSTM in layer `.
tion [17], [18], [19] suggests that strategically composing
Recently, LSTMs have achieved impressive results on
many “layers” of non-linear functions can result in powerful
language tasks such as speech recognition [8] and ma-
models for perceptual problems. For large T , the above
chine translation [9], [10]. Analogous to CNNs, LSTMs are
recurrence indicates that the last few predictions from a
attractive because they allow end-to-end fine-tuning. For
recurrent network with T time steps are computed by a very
example, [8] eliminates the need for complex multi-step
“deep” (T layer) non-linear function, suggesting that the
pipelines in speech recognition by training a deep bidirec-
resulting recurrent model may have similar representational
tional LSTM which maps spectrogram inputs to text. Even
power to a T layer deep network. Critically, however, the
with no language model or pronunciation dictionary, the
sequence model’s weights W are reused at every time step,
model produces convincing text translations. [9] and [10]
forcing the model to learn generic time step-to-time step
translate sentences from English to French with a multi-
dynamics (as opposed to dynamics conditioned on t, the
layer LSTM encoder and decoder. Sentences in the source
sequence index) and preventing the parameter size from
language are mapped to a hidden state using an encoding
growing in proportion to the maximum sequence length.
LSTM, and then a decoding LSTM maps the hidden state to
In most of our experiments, the visual feature transfor-
a sequence in the target language. Such an encoder-decoder
mation φ corresponds to the activations in some layer of
scheme allows an input sequence of arbitrary length to
a deep CNN. Using a visual transformation φV (.) which
be mapped to an output sequence of different length. The
is time-invariant and independent at each time step has the
sequence-to-sequence architecture for machine translation
important advantage of making the expensive convolutional
circumvents the need for language models.
inference and training parallelizable over all time steps of
The advantages of LSTMs for modeling sequential data
the input, facilitating the use of fast contemporary CNN
in vision problems are twofold. First, when integrated with
implementations whose efficiency relies on independent
current vision systems, LSTM models are straightforward
batch processing, and end-to-end optimization of the visual
to fine-tune end-to-end. Second, LSTMs are not confined to
and sequential model parameters V and W .
fixed length inputs or outputs allowing simple modeling
We consider three vision problems (activity recognition,
for sequential data of varying lengths, such as text or video.
image description and video description), each of which
We next describe a unified framework to combine recurrent
instantiates one of the following broad classes of sequential
models such as LSTMs with deep convolutional networks
learning tasks:
to form end-to-end trainable networks capable of complex
visual and sequence prediction tasks. 1) Sequential input, static output (Figure 3, left):
hx1 , x2 , ..., xT i 7→ y . The visual activity recognition
problem can fall under this umbrella, with videos
3 L ONG - TERM R ECURRENT C ONVOLUTIONAL of arbitrary length T as input, but with the goal
N ETWORK (LRCN) MODEL of predicting a single label like running or jumping
This work proposes a Long-term Recurrent Convolutional drawn from a fixed vocabulary.
Network (LRCN) model combining a deep hierarchical vi- 2) Static input, sequential output (Figure 3, middle):
sual feature extractor (such as a CNN) with a model that can x 7→ hy1 , y2 , ..., yT i. The image captioning problem
learn to recognize and synthesize temporal dynamics for fits in this category, with a static (non-time-varying)
tasks involving sequential data (inputs or outputs), visual, image as input, but a much larger and richer label
linguistic, or otherwise. Figure 1 depicts the core of our space consisting of sentences of any length.
approach. LRCN works by passing each visual input xt 3) Sequential input and output (Figure 3, right):
(an image in isolation, or a frame from a video) through hx1 , x2 , ..., xT i 7→ hy1 , y2 , ..., yT 0 i. In tasks such as
a feature transformation φV (.) with parameters V , usually video description, both the visual input and output
a CNN, to produce a fixed-length vector representation are time-varying, and in general the number of
4
CNN
CNN
CNN
CNN
CRF
LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM
Average LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM
HighJump <BOS> A man runs <EOS> <BOS> A man jumps high <EOS>
Fig. 3. Task-specific instantiations of our LRCN model for activity recognition, image description, and video description.
input and output time steps may differ (i.e., we may sequence sampled from the training set L(V, W, D) =
PT
have T 6= T 0 ). In video description, for example, the 1 P
− |D| (xt ,yt )T t=1 log P (yt |x1:t , y1:t−1 , V, W ).
t=1 ∈D
number of frames in the video should not constrain One of the most appealing aspects of the described sys-
the length of (number of words in) the natural tem is the ability to learn the parameters “end-to-end,” such
language description. that the parameters V of the visual feature extractor learn
to pick out the aspects of the visual input that are relevant
In the previously described generic formulation of re- to the sequential classification problem. We train our LRCN
current models, each instance has T inputs hx1 , x2 , ..., xT i models using stochastic gradient descent, with backprop-
and T outputs hy1 , y2 , ..., yT i. Note that this formulation agation used to compute the gradient ∇V,W L(V, W, D̃) of
does not align cleanly with any of the three problem classes the objective L with respect to all parameters (V, W ) over
described above – in the first two classes, either the input
minibatches D̃ ⊂ D sampled from the training dataset D.
or output is static, and in the third class, the input length
We next demonstrate the power of end-to-end trainable
T need not match the output length T 0 . Hence, we describe hybrid convolutional and recurrent networks by exploring
how we adapt this formulation in our hybrid model to each
three applications: activity recognition, image captioning,
of the above three problem settings.
and video description.
With sequential inputs and static outputs (class 1), we
take a late-fusion approach to merging the per-time step
predictions hy1 , y2 , ..., yT i into a single prediction y for the 4 ACTIVITY RECOGNITION
full sequence. With static inputs x and sequential outputs Activity recognition is an instance of the first class of se-
(class 2), we simply duplicate the input x at all T time quential learning tasks described above: each frame in a
steps: ∀t ∈ {1, 2, ..., T } : xt := x. Finally, for a sequence- length T sequence is the input to a single convolutional
to-sequence problem with (in general) different input and network (i.e., the convnet weights are tied across time). We
output lengths (class 3), we take an “encoder-decoder” consider both RGB and flow as inputs to our recognition
approach, as proposed for machine translation by [9], [20]. system. Flow is computed with [21] and transformed into a
In this approach, one sequence model, the encoder, maps “flow image” by scaling and shifting x and y flow values to
the input sequence to a fixed-length vector, and another se- a range of [−128, +128]. A third channel for the flow image
quence model, the decoder, unrolls this vector to a sequential is created by calculating the flow magnitude.
output of arbitrary length. Under this type of model, a run During training, videos are resized to 240 × 320 and we
of the full system on one instance occurs over T +T 0 −1 time augment our data by using 227 × 227 crops and mirroring.
steps. For the first T time steps, the encoder processes the Additionally, we train the LRCN networks with video clips
input x1 , x2 , ..., xT , and the decoder is inactive until time of 16 frames, even though the UCF101 videos are generally
step T , when the encoder’s output is passed to the decoder, much longer (on the order of 100 frames when extracting
which in turn predicts the first output y1 . For the latter T 0 −1 frames at 30 FPS). Training on shorter video clips can be
time steps, the decoder predicts the remainder of the out- seen as analogous to training on image crops and is a useful
put y2 , y3 , ..., yT 0 with the encoder inactive. This encoder- method of data augmentation. LRCN is trained to predict
decoder approach, as applied to the video description task, the video’s activity class at each time step. To produce a
is depicted in Section 6, Figure 5 (left). single label prediction for an entire video clip, we average
Under the proposed system, the parameters (V, W ) the label probabilities – the outputs of the network’s softmax
of the model’s visual and sequential components can layer – across all frames and choose the most probable label.
be jointly optimized by maximizing the likelihood of At test time, we extract 16 frame clips with a stride of 8
the ground truth outputs yt at each time step t, con- frames from each video and average across all clips from a
ditioned on the input data and labels up to that point single video.
(x1:t , y1:t−1 ). In particular, for a training set D of labeled The CNN base of LRCN in our activity recognition
sequences (xt , yt )Tt=1 ∈ D, we optimize parameters (V, W ) experiments is a hybrid of the CaffeNet [12] reference model
to minimize the expected negative log likelihood of a (a minor variant of AlexNet [17]) and the network used
5
by Zeiler & Fergus [22]. The network is pre-trained on Single Input Type Weighted Average
Model RGB Flow 1/2, 1/2 1/3, 2/3
the 1.2M image ILSVRC-2012 [23] classification training
Single frame 67.37 74.37 75.46 78.94
subset of the ImageNet [24] dataset, giving the network a LRCN-fc6 68.20 77.28 80.90 82.34
strong initialization to facilitate faster training and avoid
overfitting to the relatively small video activity recognition TABLE 1
Activity recognition: Comparing single frame models to LRCN networks
datasets. When classifying center crops, the top-1 classifica- for activity recognition on the UCF101 [25] dataset, with RGB and flow
tion accuracy is 60.2% and 57.4% for the hybrid and CaffeNet inputs. Average values across all three splits are shown. LRCN
reference models, respectively. consistently and strongly outperforms a model based on predictions
from the underlying convolutional network architecture alone.
We compare LRCN to a single frame baseline model.
In our baseline model, T video frames are individually Label ∆ Label ∆
classified by a CNN. As in the LSTM model, whole video BoxingPunchingBag 40.82 BoxingSpeedBag -16.22
classification is done by averaging scores across all video HighJump 29.73 Mixing -15.56
frames. JumpRope 28.95 Knitting -14.71
CricketShot 28.57 Typing -13.95
Basketball 28.57 Skiing -12.50
4.1 Evaluation WallPushups 25.71 BaseballPitch -11.63
We evaluate our architecture on the UCF101 dataset [25] Nunchucks 22.86 BrushingTeeth -11.11
ApplyEyeMakeup 22.73 Skijet -10.71
which consists of over 12,000 videos categorized into 101 HeadMassage 21.95 Haircut -9.10
human action classes. The dataset is split into three splits, Drumming 17.78 TennisSwing -8.16
with just under 8,000 videos in the training set for each split.
TABLE 2
We explore various hyperparameters for the LRCN activ- Activity recognition: comparison of improvement ∆ in LRCN’s per-class
ity recognition architecture. To explore different variants, we recognition accuracy versus the single-frame baseline. Here we report
divide the first training split of UCF101 into a smaller train- results on all three splits of UCF101 (only results on the first split were
ing set (≈6,000 videos) and a validation set (≈3,000 videos). presented in the paper). ∆ is the difference between LRCN’s accuracy
and the single-frame model’s accuracy.
We find that the most influential hyperparameters include
the number of hidden units in the LSTM and whether f c6 For the majority of classes, LRCN improves performance
or f c7 features are used as input to the LSTM. We compare over the single frame model. Though LRCN performs worse
networks with 256, 512, and 1024 LSTM hidden units. When on some classes including Knitting and Mixing, in general
using flow as an input, more hidden units leads to better when LRCN performs worse, the loss in accuracy is not
peformance with 1024 hidden units yielding a 1.7% boost in as substantial as the gain in accuracy for classes like Box-
accuracy in comparison to a network with 256 hidden units ingPunchingBag and HighJump. Consequently, accuracy is
on our validation set. In contrast, for networks with RGB higher overall.
input, the number of hidden units has little impact on the Table 3 compares accuracies for the LRCN flow and
performance of the model. We thus use 1024 hidden units LRCN RGB models for individual classes on Split 1 of
for flow inputs, and 256 for RGB inputs. We find that using UCF101. Note that for some classes the LRCN flow model
f c6 as opposed to f c7 features improves accuracy when outperforms the LRCN RGB model and vice versa. One
using flow as input on our validation set by 1%. When using explanation is that activities which are better classified by
RGB images as input, the difference between using f c6 or the LRCN RGB model are best determined by which objects
f c7 features is quite small; using f c6 features only increases are present in the scene, while activities which are better
accuracy by 0.2%. Because both models perform better with classified by the LRCN flow model are best classified by the
f c6 features, we train our final models using f c6 features kind of motion in the scene. For example, activity classes
(denoted by LRCN-f c6 ). We also considered subsampling like Typing are highly correlated with the presence of certain
the frames input to the LSTM, but found that this hurts objects, such as a keyboard, and are thus best learned by the
performance compared with using all frames. Additionally, LRCN RGB model. Other activities such as SoccerJuggling
when training the LRCN network end-to-end, we found that include more generic objects which are frequently seen
aggressive dropout (0.9) was needed to avoid overfitting. in other activities (soccer balls, people) and are thus best
Table 1 reports the average accuracy across the three identified from class-specific motion cues. Because RGB and
standard test splits of UCF101. Columns 2-3, compare video flow signals are complementary, the best models take both
classification of LRCN against the baseline single frame into account.
architecture for both RGB and flow inputs. LRCN yields the LRCN shows clear improvement over the baseline
best results for both RGB and flow and improves upon the single-frame system and is comparable to accuracy achieved
baseline network by 0.83% and 2.91% respectively. RGB and by other deep models. [4] report the results on UCF101
flow networks can be combined by computing a weighted by computing a weighted average between flow and RGB
average of network scores as proposed in [4]. Like [4], networks and achieve 87.6%. [3] reports 65.4% accuracy on
we report two weighted averages of the predictions from UCF101, which is substantially lower than LRCN.
the RGB and flow networks in Table 1 (right). Since the
flow network outperforms the RGB network, weighting the
flow network higher unsurprisingly leads to better accuracy. 5 I MAGE CAPTIONING
In this case, LRCN outperforms the baseline single-frame In contrast to activity recognition, the static image caption-
model by 3.40%. ing task requires only a single invocation of a convolutional
Table 2 compares LRCN’s accuracy with the single frame network since the input consists of a single image. At each
baseline model for individual classes on Split 1 of UCF101. time step, both the image features and the previous word
6
TABLE 4
are provided as inputs to the sequence model, in this case a Image description: retrieval results for the Flickr30k [32] datasets.
stack of LSTMs (each with 1000 hidden units), which is used R@K is the average recall at rank K (high is good). Medr is the
to learn the dynamics of the time-varying output sequence, median rank (low is good).
natural language. modern and computationally expensive VGGNet [18] model
At time step t, the input to the bottom-most LSTM is the pre-trained for ILSVRC-2012 [23] classification.
embedded word from the previous time step yt−1 . Input Without any explicit language modeling or impositions
words are encoded as “one-hot” vectors: vectors y ∈ RK on the structure of the generated captions, the described
with a single non-zero component yi = 1 denoting the ith LRCN system learns mappings from images input as pixel
word in the vocabulary, where K is the number of words intensity values to natural language descriptions that are
in the vocabulary, plus one additional entry for the <BOS> often semantically descriptive and grammatically correct.
(beginning of sequence) token which is always taken as y0 , At training time, the previous word inputs y1:t−1 at time
the “previous word” at the first time step (t = 1). These step t are from the ground truth caption. For inference of
one-hot vectors are then projected into an embedding space captions on a novel image x, the input is a sample ỹt ∼
with dimension de by multiplication We yt with a learned P (yt |ỹ1:t−1 , φV (x)) from the model’s predicted distribution
parameter matrix We ∈ Rde ×K . The result of a matrix- at the previous time step, and generation continues until an
vector multiplication with a one-hot vector is the column <EOS> (end of sequence) token is generated.
of the matrix corresponding to the index of the single non-
zero component of the one-hot vector. We can therefore be
thought of as a “lookup table,” mapping each of the K 5.1 Evaluation
words in the vocabulary to a de -dimensional vector. We evaluate our image description model for retrieval and
The visual feature representation φV (x) of the image x generation tasks. We first demonstrate the effectiveness of
may be input to the sequence model – a stack of L LSTMs our model by quantitatively evaluating it on the image and
– by concatenating it at each time step either with (1) the caption retrieval tasks proposed by [26] and seen in [27],
embedded previous word We yt−1 and fed into the first [28], [29], [30], [31]. We report results on Flickr30k [32], and
(`−1)
LSTM of the stack, or (2) the hidden state ht output COCO 2014 [33] datasets, both with five captions annotated
from LSTM ` − 1 and fed into LSTM `, for some ` ∈ 2, ..., L. per image.
These choices are depicted in Figure 4. We refer to the
latter choice as “factored,” as it forces a sort of separation 5.1.1 Retrieval
of responsibilities by “blinding” the first ` − 1 LSTMs and Retrieval results on the Flickr30k [32] dataset are recorded in
forcing all of the capacity of their hidden states at time step Table 4. We report median rank, Medr, of the first retrieved
t to represent only the partial caption y1:t−1 independent ground truth image or caption and Recall@K , the number
of the visual input, while the LSTMs starting from ` are of images or captions for which a correct caption or image is
responsible for fusing the lower layer’s hidden state given retrieved within the top K results. Our model consistently
by the partial caption with the visual feature representation outperforms the strong baselines from recent work [27],
(`)
φV (x) to produce a joint hidden state representation ht [28], [29], [30], [31] as can be seen in Table 4. Here, we
of the visual and language inputs up to time step t from note that the VGGNet model in [31] (called OxfordNet in
which the next word yt can be predicted. In the factored their work) outperforms our model on the retrieval task.
case, the hidden state ht for the lower layers is conditionally However, VGGNet is a stronger convolutional network [18]
independent of the image x given the partial caption y1:t−1 . than that used for our results on this task. The strength
The outputs of the final LSTM in the stack are the of our sequence model (and integration of the sequence
inputs to a learned linear prediction layer with a softmax and visual models) can be more directly measured against
producing a distribution P (yt |y1:t−1 , φV (x)) over words yt the ConvNet [31] result, which uses a very similar base
in the model’s vocabulary, including the <EOS> token de- CNN architecture (AlexNet [17], where we use CaffeNet)
noting the end of the caption, allowing the model to predict pretrained on the same data.
captions of varying length. The visual model φV used for We also ablate the model’s retrieval performance on a
our image captioning experiments is either the CaffeNet [12] randomly chosen subset of 1000 images (and 5000 cap-
reference model, a variant of AlexNet [17], or the more tions) from the COCO 2014 [33] validation set. Results are
7
CNN
CNN
CNN
LSTM LSTM LSTM LSTM
Single Layer (L = 1) Two Layers (L = 2), Unfactored Two Layers (L = 2), Factored
LRCN1u LRCN2u LRCN2f
Fig. 4. Three variants of the LRCN image captioning architecture that we experimentally evaluate. We explore the effect of depth in the LSTM
stack, and the effect of the “factorization” of the modalities.
recorded in Table 5. The first group of results for each Vision Model Sequence Model Retrieval Performance
task examines the effectiveness of an LSTM compared with CNN FT? Unit L Factor? R@1 R@5 R@10 Medr
a “vanilla” RNN as described in Section 2. These results Caption to Image
demonstrate that the use of the LSTM unit compared to CaffeNet - RNN 2 X 21.3 51.7 67.2 5
the simpler RNN architecture is an important element of CaffeNet - LSTM 2 X 25.0 56.2 70.6 4
our model’s performance on this task, justifying the addi- CaffeNet - LSTM 1 - 25.2 56.2 70.8 4
tional complexity and suggesting that the LSTM’s gating CaffeNet - LSTM 2 - 23.4 54.8 69.3 5
CaffeNet - LSTM 2 X 25.0 56.2 70.6 4
mechanisms allowing for “long-term” memory may be quite
CaffeNet X LSTM 1 - 28.5 60.0 74.5 4
useful, even for relatively simple sequences. CaffeNet X LSTM 2 - 25.6 57.2 72.2 4
Within the second and third result groups, we compare CaffeNet X LSTM 2 X 27.2 59.6 74.7 4
performance among the three sequence model architectural VGGNet - LSTM 2 X 33.5 68.1 80.8 3
variants depicted in Figure 4. For both tasks and under all VGGNet X LSTM 2 X 39.3 74.7 85.9 2
metrics, the two layer, unfactored variant (LRCN2u ) per- Image to Caption
forms worse than the other two. The fact that LRCN1u out- CaffeNet - RNN 2 X 30.2 61.0 72.6 4
performs LRCN2u indicates that stacking additional LSTM CaffeNet - LSTM 2 X 33.8 65.3 75.3 3
layers alone is not beneficial for this task. The other two CaffeNet - LSTM 1 - 32.3 64.5 75.6 3
CaffeNet - LSTM 2 - 29.9 60.8 72.7 3
variants (LRCN2f and LRCN1u ) perform similarly across CaffeNet - LSTM 2 X 33.8 65.3 75.3 3
the board, with LRCN2f appearing to have a slight edge
CaffeNet X LSTM 1 - 36.1 68.4 79.5 3
in the image to caption task under most metrics, but the CaffeNet X LSTM 2 - 33.1 63.7 76.9 3
reverse for caption to image retrieval. CaffeNet X LSTM 2 X 36.3 67.3 80.6 2
Unsurprisingly, finetuning the CNN (indicated by the VGGNet - LSTM 2 X 46.0 77.4 88.3 2
“FT?” column of Table 5) and using a more powerful CNN VGGNet X LSTM 2 X 53.3 84.3 91.9 1
Generation Strategy Vision Model Sequence Model Generation Performance (COCO 2014 [33] Validation Set)
Beam Sample
Width N T CNN FT? Unit L Factor? B1 B2 B3 B4 C M R
1 - - CaffeNet - RNN 2 X 0.638 0.454 0.315 0.220 0.660 0.209 0.473
1 - - CaffeNet - LSTM 2 X 0.646 0.462 0.321 0.224 0.674 0.210 0.477
1 - - CaffeNet - LSTM 1 - 0.654 0.475 0.333 0.231 0.661 0.209 0.480
1 - - CaffeNet - LSTM 2 - 0.653 0.470 0.328 0.230 0.682 0.212 0.480
1 - - CaffeNet - LSTM 2 X 0.646 0.462 0.321 0.224 0.674 0.210 0.477
1 - - CaffeNet X LSTM 1 - 0.661 0.485 0.344 0.241 0.702 0.216 0.489
1 - - CaffeNet X LSTM 2 - 0.659 0.478 0.338 0.238 0.716 0.217 0.486
1 - - CaffeNet X LSTM 2 X 0.659 0.478 0.336 0.237 0.717 0.218 0.486
1 - - VGGNet - LSTM 2 X 0.674 0.494 0.351 0.248 0.773 0.227 0.497
1 - - VGGNet X LSTM 2 X 0.695 0.519 0.374 0.268 0.839 0.237 0.512
- 100 1.5 CaffeNet - RNN 2 X 0.647 0.466 0.334 0.244 0.703 0.212 0.479
- 100 1.5 CaffeNet - LSTM 2 X 0.657 0.478 0.344 0.251 0.720 0.215 0.485
- 100 1.5 CaffeNet - LSTM 1 - 0.664 0.490 0.354 0.254 0.704 0.211 0.488
- 100 1.5 CaffeNet - LSTM 2 - 0.664 0.486 0.352 0.257 0.732 0.216 0.489
- 100 1.5 CaffeNet - LSTM 2 X 0.657 0.478 0.344 0.251 0.720 0.215 0.485
- 100 1.5 CaffeNet X LSTM 1 - 0.679 0.507 0.370 0.268 0.753 0.219 0.499
- 100 1.5 CaffeNet X LSTM 2 - 0.672 0.495 0.361 0.265 0.762 0.222 0.495
- 100 1.5 CaffeNet X LSTM 2 X 0.670 0.493 0.358 0.264 0.764 0.222 0.495
- 100 1.5 VGGNet - LSTM 2 X 0.690 0.514 0.377 0.278 0.828 0.231 0.508
- 100 1.5 VGGNet X LSTM 2 X 0.711 0.541 0.402 0.300 0.896 0.242 0.524
1 - - VGGNet X LSTM 2 X 0.695 0.519 0.374 0.268 0.839 0.237 0.512
2 - - VGGNet X LSTM 2 X 0.707 0.533 0.394 0.291 0.879 0.242 0.520
3 - - VGGNet X LSTM 2 X 0.708 0.536 0.399 0.298 0.888 0.243 0.521
4 - - VGGNet X LSTM 2 X 0.706 0.534 0.398 0.299 0.888 0.243 0.521
5 - - VGGNet X LSTM 2 X 0.704 0.533 0.398 0.300 0.888 0.242 0.520
10 - - VGGNet X LSTM 2 X 0.699 0.528 0.395 0.298 0.886 0.241 0.518
- 1 2.0 VGGNet X LSTM 2 X 0.658 0.472 0.327 0.224 0.733 0.222 0.483
- 10 2.0 VGGNet X LSTM 2 X 0.708 0.534 0.391 0.286 0.868 0.239 0.519
- 25 2.0 VGGNet X LSTM 2 X 0.712 0.540 0.398 0.294 0.885 0.241 0.523
- 100 2.0 VGGNet X LSTM 2 X 0.714 0.543 0.402 0.297 0.889 0.242 0.524
- 100 1.0 VGGNet X LSTM 2 X 0.674 0.494 0.357 0.261 0.805 0.228 0.494
- 100 1.5 VGGNet X LSTM 2 X 0.711 0.541 0.402 0.300 0.896 0.242 0.524
- 100 2.0 VGGNet X LSTM 2 X 0.714 0.543 0.402 0.297 0.889 0.242 0.524
TABLE 6
Image caption generation performance (under the BLEU 1-4 [34] (B1-B4), CIDEr-D [35] (C), METEOR [36] (M), and ROUGE-L [37] (R) metrics)
across various network architectures and generation strategies. In the topmost set of results, we show performance across various CNN and
recurrent architectures for a simple generation strategy – beam search with beam width 1 (i.e., simply choosing the most probable word at each
time step). In the middle set of results, we show performance across the same set of architectures for a more sophisticated and computationally
intensive generation strategy found to be the best performing (in terms of performance under the CIDEr-D metric) among those explored in the
bottom-most set of results, which explores various generation strategies while fixing the choice of network. In the first two sets of results, we vary
the visual input CNN architecture (either CaffeNet [12], an architecture similar to AlexNet [17], or the more modern VGGNet [18]) and whether its
weights are finetuned (FT?). Keeping the visual input CNN fixed with CaffeNet, we also vary the choice of recurrent architecture, comparing a
stack of “vanilla” RNNs with LSTMs [7], as well as the number of layers in the stack L, and (for L = 2) whether the layers are “factored” (i.e.,
whether the visual input is passed into the second layer). In the last set of results, we explore two generation strategies – beam search, and
choosing the best (highest log-likelihood) among N samples from the model’s predicted distribution. For beam search we vary the beam width
from 1-10. For the sampling strategy we explore the effect of sample size N as well as the effect of applying various choices of scalar factor T
(inverse of the “temperature”) to the logits input to the softmax producing the distribution.
this strategy we also examine the effect of applying various of results in the table, which ablates LRCN architectures.
choices of scalar factors (inverse of the “temperature”) T to We also record generation performance for all architectures
the real-valued predictions input to the softmax producing (Table 6, top set of results) with the simpler generation
the distribution. For larger values of T the samples are strategy used in our earlier work [43] for ease of comparison
greedier and less diverse, with T = ∞ being equivalent to with this work and for future researchers. For the remainder
beam search with beam width 1. Larger values of N suggest of this discussion, we will focus on the middle set of
using smaller values of T , and vice versa – for example, results, and particularly on the CIDEr-D [35] (C) metric,
with large N and large T , most of the O(N ) computation is as it was designed specifically for automatic evaluation of
wasted as many of the samples will be redundant. We assess image captioning systems. We see again that the LSTM unit
saturation as the number of samples N grows, and find that outperforms an RNN unit for generation, though not as
N = 100 samples with T = 2 improves little over N = 25. significantly as for retrieval. Between the sequence model
We also varied the temperature T among values 1, 1.5, and architecture choices (depicted in Figure 4) of the number
2 (all with N = 100) and found T = 1.5 to perform the best. of layers L and whether to factor, we see that in this
We adopt the best-performing generation strategy from case the two-layer models (LRCN2f and LRCN2u ) perform
the bottom-most set of results in Table 6 (sampling with similarly, outperforming the single layer model (LRCN1u ).
T = 1.5, N = 100) as the strategy for the middle set Interestingly, of the three variants, LRCN2f is the only one
9
TABLE 7
Image caption generation results from top-performing methods in the 2015 COCO caption challenge competition, sorted by performance under the
CIDEr-D metric. (We omit submissions that did not provide a reference to a report describing their method; see full results at
http://mscoco.org/dataset/#captions-leaderboard.) All results except for our updated result (denoted by LRCN, this work) were competition entries
(submitted by May 2015). Our updated result differs from our original competition entry only by generation strategy (sampling with N = 100,
T = 1.5, rather than beam search with width 1; i.e., greedy search); the visual and recurrent architectures (and trained weights) are the same.
to perform best for both retrieval and generation. Correctness Grammar Relevance
We see again that fine-tuning (FT) the visual represen- TreeTalk [46] 4.08 4.35 3.98
VGGNet [31] 3.71 3.46 3.70
tation and using a stronger vision model (VGGNet [18]) NN [31] 3.44 3.20 3.49
improves results significantly. Fine-tuning improves CIDEr- LRCN fc8 (ours) 3.74 3.19 3.72
LRCN FT (ours) 3.47 3.01 3.50
D by roughly 0.04 points for CaffeNet, and by roughly 0.07
points for VGGNet. Switching from finetuned CaffeNet to Captions 2.55 3.72 2.59
VGGNet improves CIDEr-D by 0.13 points. TABLE 8
In Table 7 we compare generation performance with Image description: Human evaluator rankings from 1-6 (low is good)
contemporaneous and recent work submitted to the 2015 averaged for each method and criterion. We evaluated on 785 Flickr
images selected by the authors of [31] for the purposes of comparison
COCO caption challenge using our best-performing method against this similar contemporary approach.
(under the CIDEr-D metric) from the results on the valida-
tion set described above – generating a caption for a single 6 V IDEO DESCRIPTION
image by taking the best of N = 100 samples with a scalar
factor of T = 1.5 applied to the softmax inputs, using an In video description the LSTM framework allows us to
LRCN model which pairs a fine-tuned VGGNet with our model the video as a variable length input stream. How-
LRCN2f (two layer, factored) sequence model architecture. ever, due to the limitations of available video description
Our results are competitive with the contemporary work, datasets, we rely on more “traditional” activity and video
performing 4th best in CIDEr-D (0.934, compared with the recognition processing for the input and use LSTMs for
best result of 0.946 from [38]), and 3rd best in METEOR generating a sentence. We first distinguish the following
(0.335, compared with 0.346 from [38]). architectures for video description (see Figure 5). For each
In addition to standard quantitative evaluations, we also architecture, we assume we have predictions of activity, tool,
employ Amazon Mechnical Turk workers (“Turkers”) to object, and locations present in the video from a CRF based
evaluate the generated sentences. Given an image and a on the full video input. In this way, we observe the video as
set of descriptions from different models, we ask Turkers whole at each time step, not incrementally frame by frame.
to rank the sentences based on correctness, grammar and (a) LSTM encoder & decoder with CRF max. (Fig-
relevance. We compared sentences from our model to the ure 5(a)) This architecture is motivated by the video de-
ones made publicly available by [31]. As seen in Table 8, scription approach presented in [11]. They first recognize
our fine-tuned (FT) LRCN model performs on par with the a semantic representation of the video using the maximum
Nearest Neighbour (NN) on correctness and relevance, and a posteriori (MAP) estimate of a CRF with video features
better on grammar. as unaries. This representation, e.g., hknife,cut,carrot,cutting
We show sample captions in Figure 6. We additionally boardi, is concatenated into an input sequence (knife cut
note some properties of the captions our model generates. carrot cutting board) which is translated to a natural language
When using the VGG model to generate sentences in the sentence (a person cuts a carrot on the board) using statistical
validation set, we find that 33.7% of our generated setences machine translation (SMT) [47]. We replace SMT with an
exactly match a sentence in the training set. Furthermore, encoder-decoder LSTM, which encodes the input sequence
we find that when using a beam size of one, our model as a fixed length vector before decoding to a sentence.
generates 42% of the vocabulary words used by human (b) LSTM decoder with CRF max. (Figure 5(b)) In this
annotators when describing images in the validation set. variant we provide the full visual input representation at
Some words, such as “lady” and “guy”, are not generated each time step to the LSTM, analogous to how an image is
by our model but are commonly used by human annotators, provided as an input to the LSTM in image captioning.
but synonyms such as “woman” and “man” are two of the (c) LSTM decoder with CRF probabilites. (Figure 5(c))
most common words generated by our model. A benefit of using LSTMs for machine translation compared
10
Visual Input
CRF-max One Hot
Input Sentence
board [0, 1, 0, 0…] LSTM LSTM
Encoder
cutting
CRF board cutting [1, 0, 0, 0…] LSTM LSTM
LSTM LSTM man [0, 1, 0, 0…] [0, 0, 1, 0...] [0, 0, 0, 1…] LSTM LSTM [0, 0.8, 0.2, 0…][0.3, 0, 0.7, 0…][0, 0.1, 0.2, 0.7…] LSTM LSTM man
Decoder
LSTM LSTM cuts [0, 1, 0, 0…] [0, 0, 1, 0...] [0, 0, 0, 1…] LSTM LSTM [0, 0.8, 0.2, 0…][0.3, 0, 0.7, 0…][0, 0.1, 0.2, 0.7…] LSTM LSTM cuts
Architecture Input BLEU image captioning, and video description as well as related
SMT [11] CRF max 24.9 new tasks such as visual question answering.
SMT [48] CRF prob 26.9
(a) LSTM Encoder-Decoder (ours) CRF max 25.3
(b) LSTM Decoder (ours) CRF max 27.4 7.1 Prior Work
(c) LSTM Decoder (ours) CRF prob 28.8
TABLE 9
Activity Recognition. State-of-the-art shallow models com-
Video description: Results on detailed description of TACoS multilevel bine spatio-temporal features along dense trajectories [50]
[48], in %, see Section 6 for details. and encode features as bags of words or Fisher vectors
for classification. Such shallow features track how low
to phrase-based SMT [47] is that it can naturally incorporate
level features change through time but cannot track higher
probability vectors during training and test time which
level features. Furthermore, by encoding features as bags of
allows the LSTM to learn uncertainties in visual generation
words or Fisher vectors, temporal relationships are lost.
rather than relying on MAP estimates. The architecture is
Many deep architectures proposed for activity recogni-
the the same as in (b), but we replace max predictions with
tion stack a fixed number of video frames for input to a
probability distributions.
deep network. [3] propose a fusion convolutional network
which fuses layers which correspond to different input
6.1 Evaluation frames at various levels of a deep network. [4] proposes
We evaluate our approach on the TACoS multilevel [48] a two stream CNN which combines one CNN trained
dataset, which has 44,762 video/sentence pairs (about on RGB frames and one CNN trained on a stack of 10
40,000 for training/validation). We compare to [11] who use flow frames. When combining RGB and flow by averaging
max prediction as well as a variant presented in [48] which softmax scores, results are comparable to state-of-the-art
takes CRF probabilities at test time and uses a word lattice shallow models on UCF101 [25] and HMDB51 [51]. Results
to find an optimal sentence prediction. Since we use the are further improved by using an SVM to fuse RGB and
max prediction as well as the probability scores provided flow as opposed to simply averaging scores. Alternatively,
by [48], we have an identical visual representation. [48] [1] and [2] propose learning deep spatio-temporal features
uses dense trajectories [49] and SIFT features as well as with 3D convolutional neural networks. [2], [52] propose
temporal context reasoning modeled in a CRF. In this set extracting visual and motion features and modeling tempo-
of experiments we use the two-layered, unfactored version ral dependencies with recurrent networks. This architecture
of LRCN, as described for image description. most closely resembles our proposed architecture for activ-
Table 9 shows the BLEU-4 score. The results show that ity classification, though it differs in two key ways. First, we
(1) the LSTM outperforms an SMT-based approach to video integrate 2D CNNs that can be pre-trained on large image
description; (2) the simpler decoder architecture (b) and datasets. Second, we combine the CNN and LSTM into a
(c) achieve better performance than (a), likely because the single model to enable end-to-end fine-tuning.
input does not need to be memorized; and (3) our approach Image Captioning. Several early works [53], [54], [55],
achieves 28.8%, clearly outperforming the best reported [56] on image captioning combine object and scene recog-
number of 26.9% on TACoS multilevel by [48]. nition with template or tree based approaches to generate
More broadly, these results show that our architecture is captions. Such sentences are typically simple and are easily
not restricted only to input from deep networks, but can be distinguished from more fluent human generated descrip-
cleanly integrated with fixed or variable length inputs from tions. [46], [57] address this by composing new sentences
other vision systems. from existing caption fragments which, though more human
like, are not necessarily accurate or correct.
More recently, a variety of deep and multi-modal models
7 R ELATED W ORK [27], [29], [30], [58] have been proposed for image and cap-
We present previous literature pertaining to the three tasks tion retrieval, as well as caption generation. Though some of
discussed in this work. Additionally, we discuss subsequent these models rely on deep convolutional nets for image fea-
extensions which combine convolutional and recurrent net- ture extraction [30], [58], recently researchers have realized
works to achieve improved results on activity recognition, the importance of also including temporally deep networks
11
to model text. [29] propose an RNN to map sentences into focus on the current word generated by the model. Other
a multi-modal embedding space. By mapping images and works aim to address specific limitations of captioning
language into the same embedding space, they are able to models based on combining convolutional and recurrent
compare images and descriptions for image and annotation architectures. For example, methods have been proposed
retrieval tasks. [27] propose a model for caption generation to integrate new vocabulary with limited [40] or no [68]
that is more similar to the model proposed in this work: examples of images and corresponding captions.
predictions for the next word are based on previous words Video Description. In this work, we rely on intermedi-
in a sentence and image features. [58] propose an encoder- ate features for video description, but end-to-end trainable
decoder model for image caption retrieval which relies on models for visual captioning have since been proposed. [69]
both a CNN and LSTM encoder to learn an embedding of propose creating a video feature by pooling high level CNN
image-caption pairs. Their model uses a neural language features across frames. The video feature is then used to
decoder to enable sentence generation. As evidenced by the generate descriptions in the same way an image is used to
rapid growth of image captioning, visual sequence models generate a description in LRCN. Though achieving good
like LRCN are increasingly important for describing the results, by pooling CNN features, temporal information
visual world using natural language. from the video is lost. Consequently, [70] propose an LSTM
Video Description. Recent approaches to describing to encode video frames into a fixed length vector before
video with natural language have made use of templates, sentence generation with an LSTM. Using an end-to-end
retrieval, or language models [11], [59], [60], [60], [61], [62], trainable “sequence-to-sequence” model which can exploit
[63], [64]. To our knowledge, we present the first application temporal structure in video, [70] improve upon results for
of deep models to the video description task. Most similar video description. [71] propose a similar model, adding a
to our work is [11], which use phrase-based SMT [47] to temporal attention mechanism which weights video frames
generate a sentence. In Section 6 we show that phrase-based differently when generating each word in a sentence.
SMT can be replaced with LSTMs for video description as Visual Grounding. [72] combine CNNs with LSTMs for
has been shown previously for language translation [9], [65]. visual grounding. The model first encodes a phrase which
describes part of an image using an LSTM, then learns to
attend to the appropriate location in the image to accurately
7.2 Contemporaneous and Subsequent Work
reconstruct the phrase. In order to reconstruct the phrase,
Similar work in activity recognition and visual description the model must learn to visually ground the input phrase to
was conducted contemporaneously with our work, and a the appropriate location in the image.
variety of subsequent work has combined convolutional and Natural Language Object Retrieval. In this work, we
recurrent networks to both improve upon our results and present methods for image retrieval based on a natural
achieve exciting results on other sequential visual tasks. language description. In contrast, [73] use a model based on
Activity Recognition. Contemporaneous with our work, LRCN for object retrieval, which returns the bounding box
[66] train a network which combines CNNs and LSTMs for around a given object as opposed to an entire image. In or-
activity recognition. Because activity recognition datasets der to adapt LRCN to the task of object retrieval, [73] include
like UCF101 are relatively small in comparison to image local convolutional features which are extracted from object
recognition datasets, [66] pretrain their network using the proposals and the spatial configuration of object proposals
Sports-1M [3] dataset which includes over a million videos in addition to a global image feature. By including local
mined from YouTube. By training a much larger network features, [73] effectively adapt LRCN for object retrieval.
(four stacked LSTMs) and pretraining on a large video
dataset, [66] achieve 88.6% on the UCF101 dataset.
[67] also combines a convolutional network with an
8 C ONCLUSION
LSTM to predict multiple activities per frame. Unlike LRCN, We’ve presented LRCN, a class of models that is both
[67] focuses on frame-level (rather than video-level) predic- spatially and temporally deep, and flexible enough to be
tions, which allows their system to label multiple activities applied to a variety of vision tasks involving sequential
that occur in different temporal locations of a video clip. inputs and outputs. Our results consistently demonstrate
Like we show for activity recognition, [67] demonstrates that by learning sequential dynamics with a deep sequence
that including temporal information improves upon a sin- model, we can improve upon previous methods which learn
gle frame baseline. Additionally, [67] employ an attention a deep hierarchy of parameters only in the visual domain,
mechanism to further improve results. and on methods which take a fixed visual representation
Image Captioning. [45] and [38] also propose models of the input and only learn the dynamics of the output
which combine a CNN with a recurrent network for image sequence.
captioning. Though similar to LRCN, the architectures pro- As the field of computer vision matures beyond tasks
posed in [45] and [38] differ in how image features are input with static input and predictions, deep sequence modeling
into the sequence model. In contrast to our system, in which tools like LRCN are increasingly central to vision systems
image features are input at each time step, [45] and [38] for problems with sequential structure. The ease with which
only input image features at the first time step. Furthermore, these tools can be incorporated into existing visual recog-
they do not explore a “factored” representation (Figure 4). nition pipelines makes them a natural choice for percep-
Subsequent work [44] has proposed attention to focus on tual problems with time-varying visual input or sequential
which portion of the image is observed during sequence outputs, which these methods are able to handle with little
generation. By including attention, [44] aim to visually input preprocessing and no hand-designed features.
12
A female tennis player in action on A group of young men playing a A man riding a wave on top of a
the court. game of soccer surfboard.
A baseball game in progress with the A brown bear standing on top of a A person holding a cell phone in
batter up to plate. lush green field. their hand.
A close up of a person brushing his A woman laying on a bed in a bed- A black and white cat is sitting on a
teeth. room. chair.
A large clock mounted to the side of A bunch of fruit that are sitting on a A toothbrush holder sitting on top of
a building. table. a white sink.
Fig. 6. Image description: images with corresponding captions generated by our finetuned LRCN model. These are images 1-12 of our randomly
chosen validation set from COCO 2014 [33]. We used beam search with a beam size of 5 to generate the sentences, and display the top (highest
likelihood) result above.
13
ACKNOWLEDGMENTS [24] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Ima-
geNet: A large-scale hierarchical image database,” in CVPR, 2009.
The authors thank Oriol Vinyals for valuable advice and [25] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101
helpful discussion throughout this work. This work was human actions classes from videos in the wild,” CRCV-TR-12-01,
supported in part by DARPA’s MSEE and SMISC programs, Tech. Rep., 2012.
[26] P. Y. Micah Hodosh and J. Hockenmaier, “Framing image descrip-
NSF awards IIS-1427425 and IIS-1212798, and the Berkeley tion as a ranking task: Data, models and evaluation metrics,” in
Vision and Learning Center. The GPUs used for this research JAIR, vol. 47, 2013.
were donated by NVIDIA. Marcus Rohrbach was supported [27] J. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille, “Deep captioning
by a fellowship within the FITweltweit-Program of the with multimodal recurrent neural networks (m-RNN),” in ICLR,
2015.
German Academic Exchange Service (DAAD). Lisa Anne [28] A. Karpathy, A. Joulin, and L. Fei-Fei, “Deep fragment embed-
Hendricks was supported by the NDSEG. dings for bidirectional image sentence mapping,” in NIPS, 2014.
[29] R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng,
“Grounded compositional semantics for finding and describing
R EFERENCES images with sentences,” in TACL, vol. 2, 2014.
[30] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov
[1] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural et al., “Devise: A deep visual-semantic embedding model,” in
networks for human action recognition,” in IEEE Trans. Pattern NIPS, 2013.
Anal. Mach. Intell., 2013. [31] R. Kiros, R. Salakhuditnov, and R. S. Zemel, “Unifying visual-
[2] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt, “Se- semantic embeddings with multimodal neural language models,”
quential deep learning for human action recognition,” in Human in TACL, 2015.
Behavior Understanding, 2011. [32] M. H. Peter Young, Alice Lai and J. Hockenmaier, “From image
[3] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and descriptions to visual denotations: New similarity metrics for
L. Fei-Fei, “Large-scale video classification with convolutional semantic inference over event descriptions,” in TACL, vol. 2, 2014.
neural networks,” in CVPR, 2014. [33] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
[4] K. Simonyan and A. Zisserman, “Two-stream convolutional net- P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in
works for action recognition in videos,” in NIPS, 2014. context,” arXiv preprint arXiv:1405.0312, Tech. Rep., 2014.
[5] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning [34] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a method
internal representations by error propagation,” DTIC Document, for automatic evaluation of machine translation,” in ACL, 2002.
Tech. Rep., 1985. [35] R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-
[6] R. J. Williams and D. Zipser, “A learning algorithm for continually based image description evaluation,” in CVPR, 2015.
running fully recurrent neural networks,” in Neural Computation, [36] S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT
1989. evaluation with improved correlation with human judgments,” in
[7] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” in Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation
Neural Computation. MIT Press, 1997. Measures for Machine Translation and/or Summarization, 2005.
[8] A. Graves and N. Jaitly, “Towards end-to-end speech recognition [37] C.-Y. Lin, “Rouge: A package for automatic evaluation of sum-
with recurrent neural networks,” in ICML, 2014. maries,” in Text Summarization Branches Out: Proceedings of the ACL-
[9] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence 04 Workshop, 2004.
learning with neural networks,” in NIPS, 2014. [38] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A
[10] K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio, “On neural image caption generator,” in CVPR, 2015.
the properties of neural machine translation: Encoder-decoder
[39] J. Devlin, H. Cheng, H. Fang, S. Gupta, L. Deng, X. He, G. Zweig,
approaches,” in SSST Workshop, 2014.
and M. Mitchell, “Language models for image captioning: The
[11] M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele,
quirks and what works,” in ACL, 2015.
“Translating video content to natural language descriptions,” in
[40] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, “Learning
ICCV, 2013.
like a child: Fast novel visual concept learning from sentence
[12] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
descriptions of images,” in ICCV, 2015.
S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture
for fast feature embedding,” in ACM MM, 2014. [41] H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng, P. Dollár,
[13] W. Zaremba and I. Sutskever, “Learning to execute,” in arXiv J. Gao, X. He, M. Mitchell, J. Platt et al., “From captions to visual
preprint arXiv:1410.4615, 2014. concepts and back,” in CVPR, 2015.
[14] A. Graves, “Generating sequences with recurrent neural net- [42] J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick,
works,” in arXiv preprint arXiv:1308.0850, 2013. “Exploring nearest neighbor approaches for image captioning,”
[15] O. Vinyals, S. V. Ravuri, and D. Povey, “Revisiting recurrent neural arXiv preprint arXiv:1505.04467, Tech. Rep., 2015.
networks for robust ASR,” in ICASSP, 2012. [43] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach,
[16] I. Sutskever, J. Martens, and G. E. Hinton, “Generating text with S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent
recurrent neural networks,” in ICML, 2011. convolutional networks for visual recognition and description,” in
[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifi- CVPR, 2015.
cation with deep convolutional neural networks,” in NIPS, 2012. [44] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and
[18] K. Simonyan and A. Zisserman, “Very deep convolutional net- Y. Bengio, “Show, attend and tell: Neural image caption generation
works for large-scale image recognition,” in ICLR, 2015. with visual attention,” in ICML, 2015.
[19] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, [45] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with generating image descriptions,” in CVPR, 2015.
convolutions,” in CVPR, 2015. [46] P. Kuznetsova, V. Ordonez, T. L. Berg, U. C. Hill, and Y. Choi,
[20] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, “TreeTalk: Composition and compression of trees for image de-
and Y. Bengio, “Learning phrase representations using rnn scriptions,” in TACL, vol. 2, no. 10, 2014.
encoder-decoder for statistical machine translation,” in EMNLP, [47] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico,
2014. N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer,
[21] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, “High accuracy O. Bojar, A. Constantin, and E. Herbst, “Moses: Open source
optical flow estimation based on a theory for warping,” in ECCV, toolkit for statistical machine translation,” in ACL, 2007.
2004. [48] A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal,
[22] M. D. Zeiler and R. Fergus, “Visualizing and understanding con- and B. Schiele, “Coherent multi-sentence video description with
volutional networks,” in ECCV, 2014. variable level of detail,” in German Conference on Pattern Recognition
[23] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, (GCPR). Springer, 2014.
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and [49] H. Wang, A. Kläser, C. Schmid, and C. Liu, “Dense trajectories
L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” and motion boundary descriptors for action recognition,” in IJCV,
in IJCV, vol. 115, no. 3, 2015. 2013.
14
[50] H. Wang and C. Schmid, “Action recognition with improved Jeff Donahue is a PhD student at the University of California, Berkeley,
trajectories,” in ICCV, 2013. advised by Prof. Trevor Darrell. His research focuses on the use of deep
[51] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB: learning for computer vision applications. He graduated with a BS in
a large video database for human motion recognition,” in ICCV, computer science from the University of Texas at Austin, where he was
2011. advised by Prof. Kristen Grauman.
[52] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt, “Ac-
tion classification in soccer videos with long short-term memory
recurrent neural networks,” in International Conference on Artificial Lisa Anne Hendricks is a PhD student at the University of California,
Neural Networks (ICANN), 2010. Berkeley. Her research focuses on deep learning for sequential models
[53] A. Farhadi, M. Hejrati, M. Sadeghi, P. Young, C. Rashtchian, as well as applications at the intersection of language and vision. She is
J. Hockenmaier, and D. Forsyth, “Every picture tells a story: advised by Prof. Trevor Darrell. Lisa Anne holds a Bachelor’s of Science
Generating sentences from images,” in ECCV, 2010. in Electrical Engineering (B.S.E.E.) from Rice University.
[54] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L.
Berg, “Baby talk: Understanding and generating simple image
descriptions,” in CVPR, 2011.
[55] Y. Yang, C. L. Teo, H. Daumé III, and Y. Aloimonos, “Corpus- Marcus Rohrbach’s research focuses on visual recognition, language
guided sentence generation of natural images,” in EMNLP, 2011. understanding, and machine learning. He received his BSc and MSc
[56] M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg, degree in Computer Science from the University of Technology Darm-
K. Yamaguchi, T. Berg, K. Stratos, and H. Daumé III, “Midge: stadt, Germany, in 2006 and 2009, respectively. From 2006-2007, he
Generating image descriptions from computer vision detections,” spent one year at the University of British Columbia as a graduate
in Proceedings of the 13th Conference of the European Chapter of the visiting student. During his PhD he worked at the Max Planck Institute
Association for Computational Linguistics, 2012. for Informatics, Saarbrücken, Germany with Bernt Schiele and Manfred
[57] P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and Y. Choi, Pinkal. He completed it in 2014 with summa cum laude at Saarland
“Collective generation of natural image descriptions,” in ACL, University and received the DAGM MVTec Dissertation Award 2015 for
2012. it. He currently works as a post-doc with Trevor Darrell at UC Berkeley.
[58] R. Kiros, R. Salakhutdinov, and R. Zemel, “Multimodal neural
language models,” in ICML, 2014.
[59] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venu- Subhashini Venugopalan is a PhD student at the University of Texas at
gopalan, R. Mooney, T. Darrell, and K. Saenko, “YouTube2Text: Austin. Her research focuses on deep learning techniques to generate
Recognizing and describing arbitrary activities using semantic descriptions for events in videos. She is advised by Prof. Raymond
hierarchies and zero-shoot recognition,” in ICCV, 2013. Mooney. Subhashini holds a master’s degree in Computer Science from
[60] M. U. G. Khan, L. Zhang, and Y. Gotoh, “Human focused video IIT Madras and a bachelor’s degree from NIT Karnataka, India.
description,” in ICCV Workshops, 2011.
[61] A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S. Dickinson, S. Fidler,
A. Michaux, S. Mussman, S. Narayanaswamy, D. Salvi, L. Schmidt,
Sergio Guadarrama is a Software Engineer at Google Research, where
J. Shangguan, J. M. Siskind, J. Waggoner, S. Wang, J. Wei, Y. Yin,
he works in Machine Perception as a member of the Vale team. He
and Z. Zhang, “Video in sentences out,” in The Conference on
received his PhD from the Technical University of Madrid, followed by
Uncertainty in Artificial Intelligence (UAI), 2012.
postdoctoral work at the European Center for Soft Computing. After
[62] P. Das, C. Xu, R. Doell, and J. Corso, “Thousand frames in just
that, he was first a Visiting Scholar and then a Research Scientist at
a few words: Lingual description of videos through latent topics
UC Berkeley EECS. His research spans the areas of computer vision,
and sparse object stitching,” in CVPR, 2013.
language and deep learning. Dr. Guadarrama’s current research focus
[63] C. C. Tan, Y.-G. Jiang, and C.-W. Ngo, “Towards textually describ-
is on new network architectures for multi-task dense predictions, such
ing complex video contents with audio-visual concept classifiers,”
as object detection, instance segmentation, depth prediction and visual
in ACM MM, 2011.
question-answering. He has received research grants from the Govern-
[64] J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. J.
ment of Spain, such as the Juan de la Cierva Award (Early Career Award
Mooney, “Integrating language and vision to generate natural
in Computer Science), and the Mobility Grant for Postdoctoral Research.
language descriptions of videos in the wild,” in International
Conference on Computational Linguistics (COLING), 2014.
[65] H. Sak, O. Vinyals, G. Heigold, A. Senior, E. McDermott, R. Monga,
and M. Mao, “Sequence discriminative distributed training of long Kate Saenko is an Assistant Professor of Computer Science at the
short-term memory recurrent neural networks,” in Interspeech, University of Massachusetts Lowell, where she leads the Computer
2014. Vision and Learning Group. She received her PhD from MIT, followed
[66] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, by postdoctoral work at UC Berkeley EECS and Harvard SEAS. Her
R. Monga, and G. Toderici, “Beyond short snippets: Deep net- research spans the areas of computer vision, machine learning, and
works for video classification,” in CVPR, 2015. human-robot interfaces. Dr. Saenko’s current research interests include
[67] S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori, and domain adaptation of machine learning models and joint modeling of
L. Fei-Fei, “Every moment counts: Dense detailed labeling of language and vision. She is the recipient of research grant awards from
actions in complex videos,” arXiv preprint arXiv:1507.05738, Tech. the National Science Foundation, DARPA, and other government and
Rep., 2015. industry agencies.
[68] L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney,
K. Saenko, and T. Darrell, “Deep compositional captioning: De-
scribing novel object categories without paired training data,” in Trevor Darrell is on the faculty of the CS Division of the EECS Depart-
CVPR, 2016. ment at UC Berkeley and is also appointed at the UCB-affiliated Interna-
[69] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and tional Computer Science Institute (ICSI). He is the director of the Berke-
K. Saenko, “Translating videos to natural language using deep ley Vision and Learning Center (BVLC) and is the faculty director of the
recurrent neural networks,” in NAACL, 2015. PATH center in the UCB Institute of Transportation Studies PATH. His
[70] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, interests include computer vision, machine learning, computer graphics,
and K. Saenko, “Sequence to sequence–video to text,” in ICCV, and perception-based human computer interfaces. Prof. Darrell received
2015. the SM and PhD degrees from MIT in 1992 and 1996, respectively. He
[71] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and was previously on the faculty of the MIT EECS department from 1999-
A. Courville, “Describing videos by exploiting temporal struc- 2008, where he directed the Vision Interface Group. He was a member
ture,” in CVPR, vol. 1050, 2015. of the research staff at Interval Research Corporation from 1996-1999.
[72] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele, He obtained the BSE degree from the University of Pennsylvania in
“Grounding of textual phrases in images by reconstruction,” arXiv 1988, having started his career in computer vision as an undergraduate
preprint arXiv:1511.03745, Tech. Rep., 2015. researcher in Ruzena Bajcsy’s GRASP lab.
[73] R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell,
“Natural language object retrieval,” in CVPR, 2016.
MatConvNet
Convolutional Neural Networks for MATLAB
i
ii
Abstract
1 Introduction to MatConvNet 1
1.1 Getting started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 MatConvNet at a glance . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Documentation and examples . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Computational blocks 25
4.1 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Convolution transpose (deconvolution) . . . . . . . . . . . . . . . . . . . . . 27
4.3 Spatial pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5 Spatial bilinear resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.6 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.6.1 Local response normalization (LRN) . . . . . . . . . . . . . . . . . . 30
iii
iv CONTENTS
5 Geometry 37
5.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Simple filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.1 Pooling in Caffe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Convolution transpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.4 Transposing receptive fields . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.5 Composing receptive fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.6 Overlaying receptive fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6 Implementation details 43
6.1 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Convolution transpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.3 Spatial pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.4 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.4.1 ReLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.4.2 Sigmoid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.5 Spatial bilinear resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.6 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.6.1 Local response normalization (LRN) . . . . . . . . . . . . . . . . . . 46
6.6.2 Batch normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.6.3 Spatial normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.6.4 Softmax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.7 Categorical losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.7.1 Classification losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.7.2 Attribute losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.8 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.8.1 p-distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Bibliography 51
Chapter 1
Introduction to MatConvNet
1
2 CHAPTER 1. INTRODUCTION TO MATCONVNET
other areas; for instance, MatConvNet was recently used by the University of Arizona in
planetary science, as summarised in this NVIDIA blogpost.6
MatConvNet can learn large CNN models such AlexNet [7] and the very deep net-
works of [9] from millions of images. Pre-trained versions of several of these powerful models
can be downloaded from the MatConvNet home page7 . While powerful, MatConvNet
remains simple to use and install. The implementation is fully self-contained, requiring only
MATLAB and a compatible C++ compiler (using the GPU code requires the freely-available
CUDA DevKit and a suitable NVIDIA GPU). As demonstrated in fig. 1.1 and section 1.1,
it is possible to download, compile, and install MatConvNet using three MATLAB com-
mands. Several fully-functional examples demonstrating how small and large networks can
be learned are included. Importantly, several standard pre-trained network can be immedi-
ately downloaded and used in applications. A manual with a complete technical description
of the toolbox is maintained along with the toolbox.8 These features make MatConvNet
useful in an educational context too.9
MatConvNet is open-source released under a BSD-like license. It can be downloaded
from http://www.vlfeat.org/matconvnet as well as from GitHub.10 .
% setup MatConvNet
run matlab/vl_setupnn
Figure 1.1: A complete example including download, installing, compiling and running Mat-
ConvNet to classify one of MATLAB stock images using a large CNN pre-trained on
ImageNet.
4 CHAPTER 1. INTRODUCTION TO MATCONVNET
image with the filters by using the command y = vl_nnconv(x,f,[]). This results in an array
y with K channels, one for each of the K filters in the bank.
While users are encouraged to make use of the blocks directly to create new architectures,
MATLAB provides wrappers such as vl_simplenn for standard CNN architectures such as
AlexNet [7] or Network-in-Network [8]. Furthermore, the library provides numerous examples
(in the examples/ subdirectory), including code to learn a variety of models on the MNIST,
CIFAR, and ImageNet datasets. All these examples use the examples/cnn_train training
code, which is an implementation of stochastic gradient descent (section 3.3). While this
training code is perfectly serviceable and quite flexible, it remains in the examples/ subdirec-
tory as it is somewhat problem-specific. Users are welcome to implement their optimisers.
0.9
dropout top-1 val
dropout top-5 val
0.8 bnorm top-1 val
bnorm top-5 val
0.7
0.6
0.5
0.4
0.3
0.2
0 10 20 30 40 50 60
epoch
Figure 1.2: Training AlexNet on ImageNet ILSVRC: dropout vs batch normalisation.
11
See also http://www.robots.ox.ac.uk/~vgg/practicals/cnn/index.html.
6 CHAPTER 1. INTRODUCTION TO MATCONVNET
1.4 Speed
Efficiency is very important for working with CNNs. MatConvNet supports using NVIDIA
GPUs as it includes CUDA implementations of all algorithms (or relies on MATLAB CUDA
support).
To use the GPU (provided that suitable hardware is available and the toolbox has been
compiled with GPU support), one simply converts the arguments to gpuArrays in MATLAB,
as in y = vl_nnconv(gpuArray(x), gpuArray(w), []). In this manner, switching between CPU
and GPU is fully transparent. Note that MatConvNet can also make use of the NVIDIA
CuDNN library with significant speed and space benefits.
Next we evaluate the performance of MatConvNet when training large architectures
on the ImageNet ILSVRC 2012 challenge data [2]. The test machine is a Dell server with
two Intel Xeon CPU E5-2667 v2 clocked at 3.30 GHz (each CPU has eight cores), 256 GB
of RAM, and four NVIDIA Titan Black GPUs (only one of which is used unless otherwise
noted). Experiments use MatConvNet beta12, CuDNN v2, and MATLAB R2015a. The
data is preprocessed to avoid rescaling images on the fly in MATLAB and stored in a RAM
disk for faster access. The code uses the vl_imreadjpeg command to read large batches of
JPEG images from disk in a number of separate threads. The driver examples/cnn_imagenet.m
is used in all experiments.
We train the models discussed in section 1.3 on ImageNet ILSVRC. table 1.1 reports
the training speed as number of images per second processed by stochastic gradient descent.
AlexNet trains at about 264 images/s with CuDNN, which is about 40% faster than the
vanilla GPU implementation (using CuBLAS) and more than 10 times faster than using the
CPUs. Furthermore, we note that, despite MATLAB overhead, the implementation speed is
comparable to Caffe (they report 253 images/s with CuDNN and a Titan – a slightly slower
GPU than the Titan Black used here). Note also that, as the model grows in size, the size of
a SGD batch must be decreased (to fit in the GPU memory), increasing the overhead impact
somewhat.
table 1.2 reports the speed on VGG-VD-16, a very large model, using multiple GPUs. In
this case, the batch size is set to 264 images. These are further divided in sub-batches of 22
images each to fit in the GPU memory; the latter are then distributed among one to four
GPUs on the same machine. While there is a substantial communication overhead, training
speed increases from 20 images/s to 45. Addressing this overhead is one of the medium term
goals of the library.
1.5. ACKNOWLEDGMENTS 7
num GPUs 1 2 3 4
VGG-VD-16 speed 20.0 22.20 38.18 44.8
Table 1.2: Multiple GPU speed (images/s).
1.5 Acknowledgments
MatConvNet is a community project, and as such acknowledgements go to all contributors.
We kindly thank NVIDIA supporting this project by providing us with top-of-the-line GPUs
and MathWorks for ongoing discussion on how to improve the library.
The implementation of several CNN computations in this library are inspired by the Caffe
library [5] (however, Caffe is not a dependency). Several of the example networks have been
trained by Karen Simonyan as part of [1] and [10].
Chapter 2
This chapter provides a brief introduction to the computational aspects of neural networks,
and convolutional neural networks in particular, emphasizing the concepts required to un-
derstand and use MatConvNet.
2.1 Overview
A Neural Network (NN) is a function g mapping data x, for example an image, to an output
vector y, for example an image label. The function g = fL ◦ · · · ◦ f1 is the composition
of a sequence of simpler functions fl , which are called computational blocks or layers. Let
x1 , x2 , . . . , xL be the outputs of each layer in the network, and let x0 = x denote the network
input. Each intermediate output xl = fl (xl−1 ; wl ) is computed from the previous output xl−1
by applying the function fl with parameters wl .
In a Convolutional Neural Network (CNN), the data has a spatial structure: each xl ∈
Hl ×Wl ×Cl
R is a 3D array or tensor where the first two dimensions Hl (height) and Wl (width)
are interpreted as spatial dimensions. The third dimension Cl is instead interpreted as
the number of feature channels. Hence, the tensor xl represents a Hl × Wl field of Cl -
dimensional feature vectors, one for each spatial location. A fourth dimension Nl in the
tensor spans multiple data samples packed in a single batch for efficiency parallel processing.
The number of data samples Nl in a batch is called the batch cardinality. The network is
called convolutional because the functions fl are local and translation invariant operators
(i.e. non-linear filters) like linear convolution.
It is also possible to conceive CNNs with more than two spatial dimensions, where the
additional dimensions may represent volume or time. In fact, there are little a-priori re-
strictions on the format of data in neural networks in general. Many useful NNs contain a
mixture of convolutional layers together with layer that process other data types such as text
strings, or perform other operations that do not strictly conform to the CNN assumptions.
MatConvNet includes a variety of layers, contained in the matlab/ directory, such
as vl_nnconv (convolution), vl_nnconvt (convolution transpose or deconvolution), vl_nnpool
(max and average pooling), vl_nnrelu (ReLU activation), vl_nnsigmoid (sigmoid activation),
vl_nnsoftmax (softmax operator), vl_nnloss (classification log-loss), vl_nnbnorm (batch nor-
malization), vl_nnspnorm (spatial normalization), vl_nnnormalize (locar response normal-
9
10 CHAPTER 2. NEURAL NETWORK COMPUTATIONS
ization – LRN), or vl_nnpdist (p-distance). There are enough layers to implement many
interesting state-of-the-art networks out of the box, or even import them from other tool-
boxes such as Caffe.
NNs are often used as classifiers or regressors. In the example of fig. 1.1, the output
ŷ = f (x) is a vector of probabilities, one for each of a 1,000 possible image labels (dog, cat,
trilobite, ...). If y is the true label of image x, we can measure the CNN performance by a
loss function `y (ŷ) ∈ R which assigns a penalty to classification errors. The CNN parameters
can then be tuned or learned to minimize this loss averaged over a large dataset of labelled
example images.
Learning generally uses a variant of stochastic gradient descent (SGD). While this is an
efficient method (for this type of problems), networks may contain several million parameters
and need to be trained on millions of images; thus, efficiency is a paramount in MATLAB
design, as further discussed in section 1.4. SGD also requires to compute the CNN derivatives,
as explained in the next section.
2.2.1 Sequences
Start by considering a computational block f in the network. This can be represented
schematically as a box receiving data x and parameters w as inputs and producing data y
as output:
x f y
x1 x2 xL−1
x0 f1 f2 ... fL xL
w1 w2 wL
Given an input x0 , evaluating the network is a simple matter of evaluating all the blocks
from left to right, which defines a composite function xL = f (x0 ; w1 , . . . , wL ).
2.2. NETWORK STRUCTURES 11
f1 x1
x0 w1 f3 x3
f2 x2 f5 x7
w2 x5 w5
x4 f4
w4 x6
1. The graph is bipartite, in the sense that arrows always go from boxes to circles and
from circles to boxes.
2. Functions can have any number of inputs or outputs; variables and parameters can
have an arbitrary number of outputs (a parameter with more of one output is shared
between different layers); variables have at most one input and parameters none.
3. Variables with no incoming arrows and parameters are not computed by the network,
but must be set prior to evaluation, i.e. they are inputs. Any variable (or even param-
eter) may be used as output, although these are usually the variables with no outgoing
arrows.
12 CHAPTER 2. NEURAL NETWORK COMPUTATIONS
4. Since the graph is acyclic, the CNN can be evaluated by sorting the functions and
computing them one after another (in the example, evaluating the functions in the
order f1 , f2 , f3 , f4 , f5 would work).
This notation for the derivatives of tensor functions is taken from [6] and is used throughout
this document.
2.3. COMPUTING DERIVATIVES WITH BACKPROPAGATION 13
While it is easy to express the derivatives of tensor functions as matrices, these matrices
are in general extremely large. Even for moderate data sizes (e.g. H = H 0 = W = W 0 =
32 and C = C 0 = 128), there are H 0 W 0 C 0 HW C ≈ 17 × 109 elements in the Jacobian.
Storing that requires 68 GB of space in single precision. The purpose of the backpropagation
algorithm is to compute the derivatives required for learning without incurring this huge
memory cost.
x1 x2 xL−1
x0 f1 f2 ... fL xl ∈ R
w1 w2 wL
The goal is to compute the gradient of the loss value xL (output) with respect to each network
parameter wl :
df d
>
= [fL (·; wL ) ◦ ... ◦ f2 (·; w2 ) ◦ f1 (x0 ; w1 )] .
d(vec wl ) d(vec wl )>
By applying the chain rule and by using the matrix notation introduced above, the derivative
can be written as
df d vec fL (xL−1 ; wL ) d vec fl+1 (xl ; wl+1 ) d vec fl (xl−1 ; wl )
= × ··· × × (2.1)
d(vec wl )> d(vec xL−1 ) > d(vec xl )> d(vec wl> )
where the derivatives are computed at the working point determined by the input x0 and the
current value of the parameters.
Note that, since the network output xl is a scalar quantity, the target derivative
df /d(vec wl )> has the same number of elements of the parameter vector wl , which is moder-
ate. However, the intermediate Jacobian factors have, as seen above, an unmanageable size.
In order to avoid computing these factor explicitly, we can proceed as follows.
Start by multiplying the output of the last layer by a tensor pL = 1 (note that this tensor
is a scalar just like the variable xL ):
df d vec fL (xL−1 ; wL ) d vec fl+1 (xl ; wl+1 ) d vec fl (xl−1 ; wl )
pL × = pL × ×··· × ×
d(vec wl ) > d(vec xL−1 ) > d(vec xl )> d(vec wl> )
| {z }
(vec pL−1 )>
explicitly stored. The construction is then repeated by multiplying pairs of factors from left
to right, obtaining a sequence of tensors pL−2 , . . . , pl until the desired derivative is obtained.
Note that, in doing so, no large tensor is ever stored in memory. This process is known as
backpropagation.
In general, tensor pl is obtained from pl+1 as the product:
dhpl+1 , f (xl ; wl )i
pl = . (2.2)
dxl
Here h·, ·i denotes the inner product between tensors, which results in a scalar quantity.
Hence the derivative (2.2) needs not to use the vec notation, and yields a tensor pl that has
the same size as xl as expected.
In order to implement backpropagation, a CNN toolbox provides implementations of each
layer f that provide:
• A forward mode, computing the output y = f (x; w) of the layer given its input x
and parameters w.
This is best illustrated with an example. Consider a layer f such as the convolution operator
implemented by the MatConvNet vl_nnconv command. In the “forward” mode, one calls
the function as y = vl_nnconv(x,w,[]) to apply the filters w to the input x and obtain the
output y. In the “backward mode”, one calls [dx, dw] = vl_nnconv(x,w,[],p). As explained
above, dx, dw, and p have the same size as x, w, and y, respectively. The computation of large
Jacobian is encapsulated in the function call and never carried out explicitly.
y
x f h·, ·i z∈R
w p
dx df p
dw
x0 f1 x1 f2 x2 ... xL−1 fL xL
w1 w2 wL
DAG structure. Furthermore, in order to simplify the notation, assume that this list contains
both data and parameter variables, as the distinction is moot for the discussion in this section.
We can cut the DAG at any point in the sequence by fixing x0 , . . . , xl−1 to some arbitrary
value and dropping all the DAG layers that feed into them, effectively transforming the first l
variables into inputs. Then, the rest of the DAG defines a function hl that maps these input
variables to the output xL :
xL = hl (x0 , x1 , . . . , xl−1 ).
Next, we show that backpropagation in a DAG iteratively computes the projected derivatives
of all functions h1 , . . . , hL with respect to all their parameters.
Backpropagation starts by initializing variables (dx0 , . . . , dxl−1 ) to null tensors of the
same size as (x0 , . . . , xl−1 ). Next, it computes the projected derivatives of
Here πl denotes the index of the layer fπl that computes the value of the variable xl . There
is at most one such layer, or none if xl is an input or parameter of the original NN. In the
first case, the layer may depend on any of the variables prior to xl in the sequence, so that
general one has:
xl = fπl (x0 , . . . , xl−1 ).
At the beginning of backpropagation, since there are no intermediate variables between xL−1
and xL , the function hL is the same as the last layer fπL . Thus the projected derivatives of
hL are the same as the projected derivatives of fπL , resulting in the equation
Given: a DAG neural network f with a single output xL , the values of all input variables
(including the parameters), and the value of the projection pL (usually xL is a scalar
and pL = pL = 1):
2. Perform a forward pass through the network to compute all the intermediate vari-
able values.
3. Initialize (dx0 , . . . , dxL−1 ) to null tensors with the same size as the corresponding
variables.
4. For l = L, L − 1, . . . , 2, 1:
a) Find the index πl of the layer xl = fπl (x0 , . . . , xl−1 ) that evaluates variable xl .
If there is no such layer (because xl is an input or parameter of the network),
go to the next iteration.
b) Update the variables using the formula:
1. For each layer fl , and variable/parameter xt and wl , create a corresponding layer dfl
and variable/parameter dxt and dwl .
3. If a variable xt (or parameter wl ) is an input of fl , then the variable dxt (or the
parameter dwl ) is an output dfl .
4. In the previous step, if a variable xt (or parameter wl ) is input to two or more layers in
f , then dxt would be the output of two or more layers in the reversed network, which
creates a conflict. Resolve these conflicts by inserting a summation layer that adds
these contributions (this corresponds to the summation in the BP update equation
(2.3)).
The BP network corresponding to the DAG of Fig. 2.1 is given in Fig. 2.2.
2.3. COMPUTING DERIVATIVES WITH BACKPROPAGATION 19
f1 x1
x0 w1 f3 x3
f2 x2 f5 x7
w2 x5 w5
x4 f4
w4 x6
df1 dx1
dx4 df4
Σ p6
dw4
3.1 Wrappers
MatConvNet provides two wrappers: SimpleNN for basic chains of blocks (section 3.1.1)
and DagNN for blocks organized in more complex direct acyclic graphs (section 3.1.2).
3.1.1 SimpleNN
The SimpleNN wrapper is suitable for networks consisting of linear chains of computational
blocks. It is largely implemented by the vl_simplenn function (evaluation of the CNN and of
its derivatives), with a few other support functions such as vl_simplenn_move (moving the
CNN between CPU and GPU) and vl_simplenn_display (obtain and/or print information
about the CNN).
vl_simplenn takes as input a structure net representing the CNN as well as input x and
potentially output derivatives dzdy, depending on the mode of operation. Please refer to the
inline help of the vl_simplenn function for details on the input and output formats. In fact,
the implementation of vl_simplenn is a good example of how the basic neural net building
blocks can be used together and can serve as a basis for more complex implementations.
3.1.2 DagNN
The DagNN wrapper is more complex than SimpleNN as it has to support arbitrary graph
topologies. Its design is object oriented, with one class implementing each layer type. While
this adds complexity, and makes the wrapper slightly slower for tiny CNN architectures (e.g.
MNIST), it is in practice much more flexible and easier to extend.
21
22 CHAPTER 3. WRAPPERS AND PRE-TRAINED MODELS
Note that the image should be preprocessed before running the network. While preprocessing
specifics depend on the model, the pre-trained model contains a net.meta.normalization
field that describes the type of preprocessing that is expected. Note in particular that this
network takes images of a fixed size as input and requires removing the mean; also, image
intensities are normalized in the range [0,255].
The next step is running the CNN. This will return a res structure with the output of
the network layers:
% run the CNN
res = vl_simplenn(net, im_) ;
The output of the last layer can be used to classify the image. The class names are
contained in the net structure for convenience:
% show the classification result
scores = squeeze(gather(res(end).x)) ;
[bestScore, best] = max(scores) ;
figure(1) ; clf ; imagesc(im) ;
title(sprintf('%s (%d), score %.3f',...
net.meta.classes.description{best}, best, bestScore)) ;
Note that several extensions are possible. First, images can be cropped rather than
rescaled. Second, multiple crops can be fed to the network and results averaged, usually for
improved results. Third, the output of the network can be used as generic features for image
encoding.
3.3. LEARNING MODELS 23
• Consider preprocessing the data to convert all images to have a height of 256 pixels.
This can be done with the supplied utils/preprocess-imagenet.sh script. In this
manner, training will not have to resize the images every time. Do not forget to point
the training code to the pre-processed data.
• Consider copying the dataset into a RAM disk (provided that you have enough memory)
for faster access. Do not forget to point the training code to this copy.
• Compile MatConvNet with GPU support. See the homepage for instructions.
Once your setup is ready, you should be able to run examples/cnn_imagenet (edit the
file and change any flag as needed to enable GPU support and image pre-fetching on multiple
threads).
If all goes well, you should expect to be able to train with 200-300 images/sec.
Chapter 4
Computational blocks
4.1 Convolution
The convolutional block is implemented by the function vl_nnconv. y=vl_nnconv(x,f,b) com-
putes the convolution of the input map x with a bank of K multi-dimensional filters f and
biases b. Here
0 0 ×D×D 00 00 ×W 00 ×D 00
x ∈ RH×W ×D , f ∈ RH ×W , y ∈ RH .
1
Other parts of the library will wrap these functions into objects with a perfectly uniform interface;
however, the low-level functions aim at providing a straightforward and obvious interface even if this means
differing slightly from block to block.
25
26 CHAPTER 4. COMPUTATIONAL BLOCKS
P- = 2 x P+ = 3
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
y
1 2
1 2 3 4
1 2 3 4
1 2 3 4
Figure 4.1: Convolution. The figure illustrates the process of filtering a 1D signal x by a
filter f to obtain a signal y. The filter has H 0 = 4 elements and is applied with a stride of
Sh = 2 samples. The purple areas represented padding P− = 2 and P+ = 3 which is zero-
filled. Filters are applied in a sliding-window manner across the input signal. The samples of
x involved in the calculation of a sample of y are shown with arrow. Note that the rightmost
sample of x is never processed by any filter application due to the sampling step. While in
this case the sample is in the padded region, this can happen also without padding.
The process of convolving a signal is illustrated in fig. 4.1 for a 1D slice. Formally, the output
is given by
H0 X
X W0 X
D
yi00 j 00 d00 = bd00 + fi0 j 0 d × xi00 +i0 −1,j 00 +j 0 −1,d0 ,d00 .
i0 =1 j 0 =1 d0 =1
The call vl_nnconv(x,f,[]) does not use the biases. Note that the function works with arbi-
trarily sized inputs and filters (as opposed to, for example, square images). See section 6.1
for technical details.
Output size. vl_nnconv computes only the “valid” part of the convolution; i.e. it requires
each application of a filter to be fully contained in the input support. The size of the output
is computed in section 5.2 and is given by:
H − H 0 + Ph− + Ph+
00
H =1+ .
Sh
4.2. CONVOLUTION TRANSPOSE (DECONVOLUTION) 27
Note that the padded input must be at least as large as the filters: H + Ph− + Ph+ ≥ H 0 ,
otherwise an error is thrown.
Receptive field size and geometric transformations. Very often it is useful to geo-
metrically relate the indexes of the various array to the input data (usually images) in terms
of coordinate transformations and size of the receptive field (i.e. of the image region that
affects an output). This is derived in section 5.2.
Fully connected layers. In other libraries, fully connected blocks or layers are linear
functions where each output dimension depends on all the input dimensions. MatConvNet
does not distinguish between fully connected layers and convolutional blocks. Instead, the
former is a special case of the latter obtained when the output map y has dimensions W 00 =
H 00 = 1. Internally, vl_nnconv handles this case more efficiently when possible.
Filter groups. For additional flexibility, vl_nnconv allows to group channels of the input
array x and apply different subsets of filters to each group. To use this feature, specify
0 0 0 00
as input a bank of D00 filters f ∈ RH ×W ×D ×D such that D0 divides the number of input
dimensions D. These are treated as g = D/D0 filter groups; the first group is applied to
dimensions d = 1, . . . , D0 of the input x; the second group to dimensions d = D0 + 1, . . . , 2D0
00 00 00
and so on. Note that the output is still an array y ∈ RH ×W ×D .
An application of grouping is implementing the Krizhevsky and Hinton network [7] which
uses two such streams. Another application is sum pooling; in the latter case, one can specify
D groups of D0 = 1 dimensional filters identical filters of value 1 (however, this is considerably
slower than calling the dedicated pooling function as given in section 4.3).
be the input tensor, filters, and output tensors. Imagine operating in the reverse direction
by using the filter bank f to convolve the output y to obtain the input x, using the defini-
tions given in section 4.1 for the convolution operator; since convolution is linear, it can be
expressed as a matrix M such that vec x = M vec y; convolution transpose computes instead
vec y = M > vec x. This process is illustrated for a 1D slice in fig. 4.2.
There are two important applications of convolution transpose. The first one are the
so called deconvolutional networks [12] and other networks such as convolutional decoders
that use the transpose of a convolution. The second one is implementing data interpolation.
In fact, as the convolution block supports input padding and output downsampling, the
convolution transpose block supports input upsampling and output cropping.
28 CHAPTER 4. COMPUTATIONAL BLOCKS
C- = 2 y C+ = 3
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
x
1 2 4
2 3 4
1 2 3 4
1 2 3 4
Figure 4.2: Convolution transpose. The figure illustrates the process of filtering a 1D
signal x by a filter f to obtain a signal y. The filter is applied in a sliding-window, in a
pattern that is the transpose of fig. 4.1. The filter has H 0 = 4 samples in total, although each
filter application uses two of them (blue squares) in a circulant manner. The purple areas
represent crops with C− = 2 and C+ = 3 which are discarded. The samples of x involved in
the calculation of a sample of y are shown with arrow. Note that, differently from fig. 4.1,
there are no samples to the right of y which are involved in a convolution operation. This is
because the width H 00 of the output y, which given H 0 can be determined up to Uh samples,
is selected to be the smallest possible.
Convolution transpose can be expressed in closed form in the following rather unwieldy
expression (derived in section 6.2):
0 0
D q(H
X X ,Sh ) q(W ,Sw )
X
yi00 j 00 d00 = f1+Sh i0 +m(i00 +P − ,Sh ),
h
−
1+Sw j 0 +m(j 00 +Pw ,Sw ), d00 ,d0 ×
d0 =1 i0 =0 j 0 =0
where
k−1
m(k, S) = (k − 1) mod S, q(k, n) = ,
S
(Sh , Sw ) are the vertical and horizontal input upsampling factors, (Ph− , Ph+ , Ph− , Ph+ ) the output
crops, and x and f are zero-padded as needed in the calculation. Note also that filter k is
stored as a slice f:,:,k,: of the 4D tensor f .
The height of the output array y is given by
H 00 = Sh (H − 1) + H 0 − Ph− − Ph+ .
A similar formula holds true for the width. These formulas are derived in section 5.3 along
with an expression for the receptive field of the operator.
We now illustrate the action of convolution transpose in an example (see also fig. 4.2).
Consider a 1D slice in the vertical direction, assume that the crop parameters are zero,
4.3. SPATIAL POOLING 29
and that Sh > 1. Consider the output sample yi00 where the index i00 is chosen such that
Sh divides i00 − 1; according to (4.1), this sample is obtained as a weighted summation of
xi00 /Sh , xi00 /Sh −1 , ... (note that the order is reversed). The weights are the filter elements f1 ,
fSh ,f2Sh , . . . subsampled with a step of Sh . Now consider computing the element yi00 +1 ; due to
the rounding in the quotient operation q(i00 , Sh ), this output sample is obtained as a weighted
combination of the same elements of the input x that were used to compute yi00 ; however,
the filter weights are now shifted by one place to the right: f2 , fSh +1 ,f2Sh +1 , . . . . The same
is true for i00 + 2, i00 + 3, . . . until we hit i00 + Sh . Here the cycle restarts after shifting x to
the right by one place. Effectively, convolution transpose works as an interpolating filter.
00 00
resulting in an output of size y ∈ RH ×W ×D , similar to the convolution operator of sec-
tion 4.1. Sum-pooling computes the average of the values instead:
1 X
yi00 j 00 d = 0 0 xi00 +i0 −1,j 00 +j 0 −1,d .
W H 1≤i0 ≤H 0 ,1≤j 0 ≤W 0
Padding and stride. Similar to the convolution operator of section 4.1, vl_nnpool sup-
ports padding the input; however, the effect is different from padding in the convolutional
block as pooling regions straddling the image boundaries are cropped. For max pooling,
this is equivalent to extending the input data with −∞; for sum pooling, this is similar to
padding with zeros, but the normalization factor at the boundaries is smaller to account for
the smaller integration area.
The same transformation is applied to all the features channels in the input, as follows:
H X
X W
yi00 j 00 c = xijc max{0, 1 − |αv g1i00 j 00 + βv − i|} max{0, 1 − |αu g2i00 j 00 + βu − j|}, (4.2)
i=1 j=1
where, for each feature channel c, the output yi00 j 00 c at the location (i00 , j 00 ), is a weighted sum
of the input values xijc in the neighborhood of location (g1i00 j 00 , g2i00 j 00 ). The weights, as given
in (4.2), correspond to performing bilinear interpolation. Furthermore, the grid coordinates
are expressed not in pixels, but relative to a reference frame that extends from −1 to 1 for
all spatial dimensions of the input image; this is given by choosing the coefficients as:
H −1 H +1 W −1 W +1
αv = , βv = − , αu = , βu = − .
2 2 2 2
See section 6.5 for implementation details.
4.6 Normalization
4.6.1 Local response normalization (LRN)
vl_nnnormalize implements the Local Response Normalization (LRN) operator. This oper-
ator is applied independently at each spatial location and to groups of feature channels as
follows: −β
X
yijk = xijk κ + α x2ijt ,
t∈G(k)
where, for each output channel k, G(k) ⊂ {1, 2, . . . , D} is a corresponding subset of input
channels. Note that input x and output y have the same dimensions. Note also that the
operator is applied uniformly at all spatial locations.
See section 6.6.1 for implementation details.
y = vl_nnbnorm(x, w, b) normalizes each channel of the feature map x averaging over spatial
locations and batch instances. Let T be the batch size; then
x, y ∈ RH×W ×K×T , w ∈ RK , b ∈ RK .
Note that in this case the input and output arrays are explicitly treated as 4D tensors in
order to work with a batch of feature maps. The tensors w and b define component-wise
multiplicative and additive constants. The output feature map is given by
H W T H W T
xijkt − µk 1 XXX 1 XXX
yijkt = wk p 2 +bk , µk = xijkt , σk2 = (xijkt −µk )2 .
σk + HW T i=1 j=1 t=1 HW T i=1 j=1 t=1
In practice, the factor 1/W 0 H 0 is adjusted at the boundaries to account for the fact that
neighbors must be cropped. Then this is used to normalize the input:
1
yi00 j 00 d = xi00 j 00 d .
(1 + αn2i00 j 00 d )β
4.6.4 Softmax
vl_nnsoftmax computes the softmax operator:
exijk
yijk = PD .
xijt
t=1 e
Note that the operator is applied across feature channels and in a convolutional manner
at all spatial locations. Softmax can be seen as the combination of an activation function
(exponential) and a normalization operator. See section 6.6.4 for implementation details.
operator, in the sense that the loss is evaluated independently at each spatial location. How-
ever, the contribution of different samples are summed together (possibly after weighting)
and the output of the loss is a scalar. Section 4.7.1 losses useful for multi-class classification
and the section 4.7.2 losses useful for binary attribute prediction. Further technical details
are in section 6.7. vl_nnloss implements the following all of these.
Here x ∈ RH×W ×C×N and c ∈ {1, . . . , C}H×W ×1×N , such that the slice xij:n represent a vector
of C class scores and and cij1n is the ground truth class label. The `instanceWeights` option
can be used to specify the tensor w of weights, which are otherwise set to all ones; w has
the same dimension as c.
Unless otherwise noted, we drop the other indices and denote by x and c the slice xij:n
and the scalar cij1n . vl_nnloss automatically skips all samples such that c = 0, which can
be used as an “ignore” label.
Classification error. The classification error is zero if class c is assigned the largest score
and zero otherwise:
`(x, c) = 1 c 6= argmax xc . (4.4)
k
Ties are broken randomly.
Top-K classification error. The top-K classification error is zero if class c is within the
top K ranked scores:
`(x, c) = 1 [|{k : xk ≥ xc }| ≤ K] . (4.5)
The classification error is the same as the top-1 classification error.
Softmax log-loss or multinomial logistic loss. This loss combines the softmax block
and the log-loss block into a single block:
C
e xc X
`(x, c) = − log PC = −xc + log exk . (4.7)
k=1 exk k=1
Combining the two blocks explicitly is required for numerical stability. Note that, by combin-
P with softmax, this loss automatically makes the score compete: `(bx, c) ≈ 0
ing the log-loss
when xc k6=c xk .
This loss is implemented also in the deprecated function vl_softmaxloss.
Note that `(x, c) = 0 ⇔ xc ≥ 1. This, just as for the log-loss above, this loss does not
automatically make the score competes. In order to do that, the loss is usually preceded by
the block:
yc = xc − max xk .
k6=c
Hence yc represent the confidence margin between class c and the other classes k 6= c. Just
like softmax log-loss combines softmax and loss, the next loss combines margin computation
and hinge loss.
Structured multi-class hinge loss. The structured multi-class logistic loss, also know as
Crammer-Singer loss, combines the multi-class hinge loss with a block computing the score
margin:
`(x, c) = max 0, 1 − xc + max xk . (4.9)
k6=c
Here x ∈ RH×W ×C×N and c ∈ {−1, +1}H×W ×C×N , such that the scalar xijkn represent
a confidence that attribute k is on and cij1n is the ground truth attribute label. The
`instanceWeights` option can be used to specify the tensor w of weights, which are oth-
erwise set to all ones; w has the same dimension as c.
Unless otherwise noted, we drop the other indices and denote by x and c the scalars xijkn
and cijkn . As before, samples with c = 0 are skipped.
34 CHAPTER 4. COMPUTATIONAL BLOCKS
Binary error. This loss is zero only if the sign of x − τ agrees with the ground truth label
c:
`(x, c|τ ) = 1[sign(x − τ ) 6= c]. (4.11)
Here τ is a configurable threshold, often set to zero.
Binary log-loss. This is the same as the multi-class log-loss but for binary attributes.
Namely, this time xk ∈ [0, 1] is interpreted as the probability that attribute k is on:
(
− log x, c = +1,
`(x, c) = (4.12)
− log(1 − x), c = −1,
1 1
= − log c x − + . (4.13)
2 2
Similarly to the multi-class log loss, the assumption x ∈ [0, 1] must be enforced by the block
computing x.
Binary logistic loss. This is the same as the multi-class logistic loss, but this time x/2
represents the confidence that the attribute is on and −x/2 that it is off. This is obtained
by using the logistic function σ(x)
cx
1 e2
`(x, c) = − log σ(cx) = − log = − log cx cx . (4.14)
1 + e−cx e 2 + e− 2
Binary hinge loss. This is the same as the structured multi-class hinge loss but for binary
attributes:
`(x, c) = max{0, 1 − cx}. (4.15)
There is a relationship between the hinge loss and the structured multi-class hinge loss which
is analogous to the relationship between binary logistic loss and multi-class logistic loss.
Namely, the hinge loss can be rewritten as:
cx kx
`(x, c) = max 0, 1 − + max
2 k6=c 2
Hence the hinge loss is the same as the structure multi-class hinge loss for C = 2 classes,
where x/2 is the score associated to class c = 1 and −x/2 the score associated to class c = −1.
4.8 Comparisons
4.8.1 p-distance
The vl_nnpdist function computes the p-distance between the vectors in the input data x
and a target x̄:
! p1
X
yij = |xijd − x̄ijd |p
d
4.8. COMPARISONS 35
Note that this operator is applied convolutionally, i.e. at each spatial location ij one extracts
and compares vectors xij: . By specifying the option 'noRoot', true it is possible to compute
a variant omitting the root:
X
yij = |xijd − x̄ijd |p , p > 0.
d
Geometry
5.1 Preliminaries
In this section we are interested in understanding how components in a CNN depend on
components in the layers before it, and in particular on components of the input. Since
CNNs can incorporate blocks that perform complex operations, such as for example cropping
their inputs based on data-dependent terms (e.g. Fast R-CNN), this information is generally
available only at “run time” and cannot be uniquely determined given only the structure
of the network. Furthermore, blocks can implement complex operations that are difficult to
characterise in simple terms. Therefore, the analysis will be necessarily limited in scope.
We consider blocks such as convolutions for which one can deterministically establish
dependency chains between network components. We also assume that all the inputs x and
outputs y are in the usual form of spatial maps, and therefore indexed as xi,j,d,k where i, j
are spatial coordinates.
Consider a layer y = f (x). We are interested in establishing which components of x
influence which components of y. We also assume that this relation can be expressed in
terms of a sliding rectangular window field, called receptive field. This means that the output
component yi00 ,j 00 depends only on the input components xi,j where (i, j) ∈ Ω(i00 , j 00 ) (note that
feature channels are implicitly coalesced in this discussion). The set Ω(i00 , j 00 ) is a rectangle
defined as follows:
00 ∆h − 1 ∆h − 1
i ∈ αh (i − 1) + βh + − , (5.1)
2 2
00 ∆v − 1 ∆v − 1
j ∈ αv (j − 1) + βv + − , (5.2)
2 2
where (αh , αv ) is the stride, (βh , βv ) the offset, and (∆h , ∆v ) the receptive field size.
37
38 CHAPTER 5. GEOMETRY
This is the same formula as for above filters, but with the ceil instead of floor operator. Note
that in practice Ph− = Ph+ = Ph since Caffe does not support asymmetric padding.
Unfortunately, it gets more complicated. Using the formula above, it can happen that
the last padding application is completely outside the input image and Caffe tries to avoid
it. This requires
H − 1 + Ph−
S(i00 − 1) − Ph− + 1i00 =H 00 ≤ H H 00 ≤
⇔ + 1. (5.4)
Sh
Using the fact that for integers a, b, one has da/be = b(a + b − 1)/bc, we can rewrite the
expression for H 00 as follows
Hence if Ph+ + Sh ≤ H 0 then the second term is less than zero and (5.4) is satisfied. In
practice, Caffe assumes that Ph+ , Ph− ≤ H 0 − 1, as otherwise the first filter application falls
entirely in the padded region. Hence, we can upper bound the second term:
Ph+ + Sh − H 0 Sh − 1
≤ ≤ 1.
Sh Sh
We conclude that, for any choices of Ph+ and Sh allowed by Caffe, the formula above may
violate constraint (5.4) by at most one unit. Caffe has a special provision for that and lowers
H 00 by one when needed. Furthermore, we see that if Ph+ = 0 and Sh ≤ H 0 (which is often
the case and may be assumed by Caffe), then the equation is also satisfied and Caffe skips
the check.
Next, we find MatConvNet equivalents for these parameters. Assume that Caffe applies
a symmetric padding Ph . Then in MatConvNet Ph− = Ph to align the top part of the output
signal. To match Caffe, the last sample of the last filter application has to be on or to the
right of the last Caffe-padded pixel:
H − H0 + P − + P +
Sh h h
+ 1 −1 + H 0 ≥ H + 2P − .
Sh | {z h}
| {z } Caffe rightmost input sample with padding
MatConvNet rightmost pooling index
| {z }
MatConvNet rightmost pooled input sample
Rearranging
H − H 0 + Ph− + Ph+ H − H 0 + 2Ph−
≥
Sh Sh
Using ba/bc = d(a − b + 1)/be we get the equivalent condition:
Removing the ceil operator lower bounds the left-hand side of the equation and produces the
sufficient condition
Ph+ ≥ Ph− + Sh − 1.
As before, this may still be too much padding, causing the last pool window application to
be entirely in the rightmost padded area. MatConvNet places the restriction Ph+ ≤ H 0 − 1,
so that
Ph+ = min{Ph− + Sh − 1, H 0 − 1}.
For example, a pooling region of width H 0 = 3 samples with a stride of Sh = 1 samples and
null Caffe padding Ph− = 0, would result in a right MatConvNet padding of Ph+ = 1.
where cropping becomes padding and upsampling becomes downsampling. Turning this
relation around, we find that
i + Ch− − H 0 i + Ch− − 1
00 00
+1≤i≤ + 1.
Sh Sh
Note that, due to rounding, it is not possible to express this set tightly in the form outlined
above. We can however relax these two relations (hence obtaining a slightly larger receptive
field) and conclude that
1 2Ch− − H 0 + 1 H0 − 1
αh = , βh = + 1, ∆h = + 1.
Uh 2Uh Uh
Next, we want to determine the height H 00 of the output y of convolution transpose as
a function of the heigh H of the input x and the other parameters. Swapping input and
output in (5.3) results in the constraint:
H − H 0 + Ch− + Ch+
00
H =1+ .
Uh
If H is now given as input, it is not possible to recover H 00 uniquely from this expression;
instead, all the following values are possible
This is due to the fact that Uh acts as a downsampling factor in the standard convolution
direction and some of the samples to the right of the convolution input y may be ignored by
the filter (see also fig. 4.1 and fig. 4.2).
Since the height of y is then determined up to Sh samples, and since the extra samples
would be ignored by the computation and stay zero, we choose the tighter definition and set
H 00 = Uh (H − 1) + H 0 − Ch− − Ch+ .
∆h − 1 ∆h − 1
− ≤ i − αh (i00 − 1) − βh ≤ .
2 2
A simple manipulation of this expression results in the equivalent expression:
1 1 + αh − βh ˆ h = ∆h + αh − 1 .
α̂h = , β̂h = , ∆
αh αh αh
H0 + 1
αh = S h , βh = − Ph− , ∆h = H 0 .
2
Using the formulas just found, we can obtain the RF transformation for convolution transpose:
1 1
α̂h = = ,
αh Sh
1 + Sh − (H 0 + 1)/2 + Ph− P − − H 0 /2 + 1/2 2P − − H 0 + 1
β̂h = = h +1= h + 1,
Sh Sh Sh
0 0
ˆ h = H + Sh − 1 = H − 1 + 1.
∆
Sh Sh
Implementation details
6.1 Convolution
It is often convenient to express the convolution operation in matrix form. To this end, let
φ(x) be the im2row operator, extracting all W 0 × H 0 patches from the map x and storing
them as rows of a (H 00 W 00 ) × (H 0 W 0 D) matrix. Formally, this operator is given by:
[φ(x)]pq = xijd
(i,j,d)=t(p,q)
Note that φ and φ∗ are linear operators. Both can be expressed by a matrix H ∈
00 00 0 0
R(H W H W D)×(HW D) such that
Hence we obtain the following expression for the vectorized output (see [6]):
(
(I ⊗ φ(x)) vec F, or, equivalently,
vec y = vec (φ(x)F ) = >
(F ⊗ I) vec φ(x),
0 0
where F ∈ R(H W D)×K is the matrix obtained by reshaping the array f and I is an identity
matrix of suitable dimensions. This allows obtaining the following formulas for the deriva-
tives: >
dz dz > dz
= (I ⊗ φ(x)) = vec φ(x)
d(vec F )> d(vec y)> dY
43
44 CHAPTER 6. IMPLEMENTATION DETAILS
00 W 00 )×K
where Y ∈ R(H is the matrix obtained by reshaping the array y. Likewise:
>
dz dz > d vec φ(x) dz >
= (F ⊗ I) = vec F H
d(vec x)> d(vec y)> d(vec x)> dY
In summary, after reshaping these terms we obtain the formulas:
dz dz dz dz >
vec y = vec (φ(x)F ) , = φ(x)> , = φ∗ F
dF dY dX dY
0 0
where X ∈ R(H W )×D is the matrix obtained by reshaping x. Notably, these expressions are
used to implement the convolutional operator; while this may seem inefficient, it is instead
a fast approach when the number of filters is large and it allows leveraging fast BLAS and
GPU BLAS implementations.
where S is the downsampling factor, Ph− and Ph+ the padding, H the length of the input
signal, x and H 0 the length of the filter f . Due to padding, the index of the input data x
may exceed the range [1, H]; we implicitly assume that the signal is zero padded outside this
range.
In order to derive an expression of the convolution transpose, we make use of the identity
vec y> (M vec x) = (vec y> M ) vec x = vec x> (M > vec y). Expanding this in formulas:
b W 0 +∞ +∞
X X X X
yi00 fi0 xS(i00 −1)+i0 −P − = yi00 fi0 xS(i00 −1)+i0 −P −
h h
i00 =1 i0 =1 i00 =−∞ i0 =−∞
+∞
X +∞
X
= yi00 fk−S(i00 −1)+P − xk
h
i00 =−∞ k=−∞
+∞
X +∞
X
= yi00 f
k−1+P −
xk
(k−1+Ph− ) mod S+S 1−i00 + S
h +1
i00 =−∞ k=−∞
+∞
X +∞
X
= xk y k−1+Ph− f(k−1+P − ) mod S+S(q−1)+1 .
+2−q h
k=−∞ q=−∞ S
6.3. SPATIAL POOLING 45
Summation ranges have been extended to infinity by assuming that all signals are zero padded
as needed. In order to recover such ranges, note that k ∈ [1, H] (since this is the range of
elements of x involved in the original convolution). Furthermore, q ≥ 1 is the minimum value
of q for which the filter f is non zero; likewise, q ≤ b(H 0 − 1)/2c + 1 is a fairly tight upper
bound on the maximum value (although, depending on k, there could be an element less).
Hence
0
1+b H S−1 c
X
xk = y k−1+Ph− f(k−1+P − ) mod S+S(q−1)+1 , k = 1, . . . , H. (6.1)
+2−q h
q=1 S
Note that the summation extrema in (6.1) can be refined slightly to account for the finite
size of y and w:
k − 1 + Ph−
00
max 1, +2−H ≤q
S
H − 1 − (k − 1 + Ph− ) mod S k − 1 + Ph−
0
≤ 1 + min , .
S S
dz dz
vec y = S(x) vec x, = S(x)> . (6.2)
d vec x d vec y
dz dz
vec y = diag s vec x, = diag s
d vec x d vec y
6.4.2 Sigmoid
The derivative of the sigmoid function is given by
dz dz dyijd dz −1
= = (−e−xijd )
dxijk dyijd dxijd dyijd (1 + e−xijd )2
dz
= yijd (1 − yijd ).
dyijd
In matrix notation:
dz dz
= y (11> − y).
dx dy
Note that the formula is similar to Eq. 4.2, with the difference that summation is on i00 rather
than i.
The projected derivative dhp, φ(x, g)i/dg with respect to the grid is similar:
" H X W
#
∂ X X
pi00 k00 c xijc max{0, 1 − |αv g1i00 j 00 + βv − i|} max{0, 1 − |αu g2i00 j 00 + βu − j|}
∂g1i0 j 0 i00 j 00 c i=1 j=1
X H X
X W
=− pi0 j 0 c αv xijc max{0, 1−|αv g2i0 j 0 +βv −j|} sign(αv g1i0 j 0 +βv −j)1{−1<αu g2i0 j0 +βu <1} .
c i=1 j=1
(6.4)
A similar expression holds for ∂g2i0 j 0
6.6 Normalization
6.6.1 Local response normalization (LRN)
The derivative is easily computed as:
dz dz X dz
= L(i, j, d|x)−β − 2αβxijd L(i, j, k|x)−β−1 xijk
dxijd dyijd dyijk
k:d∈G(k)
where X
L(i, j, k|x) = κ + α x2ijt .
t∈G(k)
6.6. NORMALIZATION 47
The derivative of the network output z with respect to the block input x is computed as
follows:
dz X dz dyi00 j 00 k00 t00
= .
dxijkt i00 j 00 k00 t00 dyi00 j 00 k00 t00 dxijkt
Since feature channels are processed independently, all terms with k 00 6= k are zero. Hence
dz X dz dyi00 j 00 kt00
= ,
dxijkt i00 j 00 t00 dyi00 j 00 kt00 dxijkt
where
− 32 dσk2
dyi00 j 00 kt00 dµk 1 wk 2
= wk δi=i00 ,j=j 00 ,t=t00 − p − (x i 00 j 00 kt00 − µk ) σ +
k ,
dxijkt dxijkt σk2 + 2 dxijkt
the derivatives with respect to the mean and variance are computed as follows:
dµk 1
= ,
dxijkt HW T
dσk2
2 X 1 2
= (xijkt − µk ) δi=i0 ,j=j 0 ,t=t0 − = (xi0 j 0 kt0 − µk ) ,
dxi0 j 0 kt0 HW T ijt HW T HW T
i.e.
!
dz wk dz 1 X dz
=p 2 −
dxijkt σk + dyijkt HW T i00 j 00 kt00 dyi00 j 00 kt00
wk xijkt − µk 1 X dz xi00 j 00 kt00 − µk
−p 2 p p .
σk + σk2 + HW T i00 j 00 kt00 dyi00 j 00 kt00 σk2 +
48 CHAPTER 6. IMPLEMENTATION DETAILS
We can identify some of these terms with the ones computed as derivatives of bnorm with
respect to wk and µk :
!
dz wk dz 1 dz xijkt − µk 1 dz
=p 2 − − p 2 .
dxijkt σk + dyijkt HW T dbk σk + HW T dwk
Note that the summation can be computed as the derivative of the vl_nnpool block.
6.6.4 Softmax
Care must be taken in evaluating the exponential in order to avoid underflow or overflow.
The simplest way to do so is to divide the numerator and denominator by the exponential of
the maximum value:
exijk −maxd xijd
yijk = PD .
e xijt −maxd xijd
t=1
The derivative is given by:
D
dz X dz X
exijd L(x)−1 δ{k=d} − exijd exijk L(x)−2 , exijt .
= L(x) =
dxijd k
dyijk t=1
Simplifying: !
K
dz dz X dz
= yijd − yijk . .
dxijd dyijd k=1 dyijk
In matrix form:
dz dz dz >
=Y − Y 11
dX dY dY
6.7. CATEGORICAL LOSSES 49
where X, Y ∈ RHW ×D are the matrices obtained by reshaping the arrays x and y. Note that
the numerical implementation of this expression is straightforward once the output Y has
been computed with the caveats above.
∂p`(x, c) ∂ log(xc )
= −p = −pxc δk=c .
∂xk ∂xk
∂p`(x, c)
= −p 1[xc < 1] δk=c .
∂xk
∂p`(x, c)
= −p 1[xc < 1 + max xt ] (δk=c − δk=t∗ ), t∗ = argmax xt .
∂xk t6=c t=1,2,...,C
∂p`(x, c) c
= −p .
c x − 12 + 12
∂x
∂p`(x, c) ∂ 1 ce−cx c
= −p log −cx
= −p −cx
= −p cx = −pc σ(−cx).
∂x ∂x 1+e 1+e e +1
∂p`(x, c)
= −pc 1[cx < 1].
∂x
6.8 Comparisons
6.8.1 p-distance
The derivative of the operator without root is given by:
dz dz
= p|xijd − x̄ijd |p−1 sign(xijd − x̄ijd ).
dxijd dyij
The formulas simplify a little for p = 1, 2 which are therefore implemented as special cases.
Bibliography
[1] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the
details: Delving deep into convolutional nets. In Proc. BMVC, 2014.
[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale
Hierarchical Image Database. In Proc. CVPR, 2009.
[3] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by
reducing internal covariate shift. CoRR, 2015.
[4] S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift. ArXiv e-prints, 2015.
[5] Yangqing Jia. Caffe: An open source convolutional architecture for fast feature embed-
ding. http://caffe.berkeleyvision.org/, 2013.
[6] D. B. Kinghorn. Integrals and derivatives for correlated gaussian fuctions using matrix
differential calculus. International Journal of Quantum Chemestry, 57:141–155, 1996.
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convo-
lutional neural networks. In Proc. NIPS, 2012.
[8] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. CoRR, abs/1312.4400,
2013.
[9] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visu-
alising image classification models and saliency maps. In Proc. ICLR, 2014.
[10] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image
recognition. 2015.
[11] A. Vedaldi and B. Fulkerson. VLFeat – An open and portable library of computer vision
algorithms. In Proc. ACM Int. Conf. on Multimedia, 2010.
51
1
Abstract—We propose a deep learning method for single image super-resolution (SR). Our method directly learns an end-to-end
mapping between the low/high-resolution images. The mapping is represented as a deep convolutional neural network (CNN) that takes
the low-resolution image as the input and outputs the high-resolution one. We further show that traditional sparse-coding-based SR
methods can also be viewed as a deep convolutional network. But unlike traditional methods that handle each component separately,
our method jointly optimizes all layers. Our deep CNN has a lightweight structure, yet demonstrates state-of-the-art restoration quality,
arXiv:1501.00092v3 [cs.CV] 31 Jul 2015
and achieves fast speed for practical on-line usage. We explore different network structures and parameter settings to achieve trade-
offs between performance and speed. Moreover, we extend our network to cope with three color channels simultaneously, and show
better overall reconstruction quality.
32
Original / PSNR Bicubic / 24.04 dB
testPSNR
31.5
structure.
Original3) We demonstrate
/ PSNR Bicubic / 24.04 dB that deep learning is useful in
Average test
31
Average
function [47], random forest [37] and anchored neigh- 3 C ONVOLUTIONAL N EURAL N ETWORKS FOR
borhood regression [41], [42] are proposed to further S UPER -R ESOLUTION
improve the mapping accuracy and speed. The sparse-
3.1 Formulation
coding-based method and its several improvements [41],
[42], [48] are among the state-of-the-art SR methods Consider a single low-resolution image, we first upscale
nowadays. In these methods, the patches are the focus it to the desired size using bicubic interpolation, which
of the optimization; the patch extraction and aggregation is the only pre-processing we perform3 . Let us denote
steps are considered as pre/post-processing and handled the interpolated image as Y. Our goal is to recover
separately. from Y an image F (Y) that is as similar as possible
The majority of SR algorithms [2], [4], [15], [41], [48], to the ground truth high-resolution image X. For the
[49], [50], [51] focus on gray-scale or single-channel ease of presentation, we still call Y a “low-resolution”
image super-resolution. For color images, the aforemen- image, although it has the same size as X. We wish to
tioned methods first transform the problem to a dif- learn a mapping F , which conceptually consists of three
ferent color space (YCbCr or YUV), and SR is applied operations:
only on the luminance channel. There are also works 1) Patch extraction and representation: this opera-
attempting to super-resolve all channels simultaneously. tion extracts (overlapping) patches from the low-
For example, Kim and Kwon [25] and Dai et al. [7] apply resolution image Y and represents each patch as a
their model to each RGB channel and combined them to high-dimensional vector. These vectors comprise a
produce the final results. However, none of them has set of feature maps, of which the number equals to
analyzed the SR performance of different channels, and the dimensionality of the vectors.
the necessity of recovering all three channels. 2) Non-linear mapping: this operation nonlinearly
maps each high-dimensional vector onto another
2.2 Convolutional Neural Networks high-dimensional vector. Each mapped vector is
Convolutional neural networks (CNN) date back conceptually the representation of a high-resolution
decades [27] and deep CNNs have recently shown an patch. These vectors comprise another set of feature
explosive popularity partially due to its success in image maps.
classification [18], [26]. They have also been success- 3) Reconstruction: this operation aggregates the
fully applied to other computer vision fields, such as above high-resolution patch-wise representations
object detection [34], [40], [52], face recognition [39], and to generate the final high-resolution image. This
pedestrian detection [35]. Several factors are of central image is expected to be similar to the ground truth
importance in this progress: (i) the efficient training X.
implementation on modern powerful GPUs [26], (ii) the We will show that all these operations form a convolu-
proposal of the Rectified Linear Unit (ReLU) [33] which tional neural network. An overview of the network is
makes convergence much faster while still presents good depicted in Figure 2. Next we detail our definition of
quality [26], and (iii) the easy access to an abundance of each operation.
data (like ImageNet [9]) for training larger models. Our
method also benefits from these progresses. 3.1.1 Patch extraction and representation
A popular strategy in image restoration (e.g., [1]) is to
2.3 Deep Learning for Image Restoration
densely extract patches and then represent them by a set
There have been a few studies of using deep learning of pre-trained bases such as PCA, DCT, Haar, etc. This
techniques for image restoration. The multi-layer per- is equivalent to convolving the image by a set of filters,
ceptron (MLP), whose all layers are fully-connected (in each of which is a basis. In our formulation, we involve
contrast to convolutional), is applied for natural image the optimization of these bases into the optimization of
denoising [3] and post-deblurring denoising [36]. More the network. Formally, our first layer is expressed as an
closely related to our work, the convolutional neural net- operation F1 :
work is applied for natural image denoising [22] and re-
moving noisy patterns (dirt/rain) [12]. These restoration F1 (Y) = max (0, W1 ∗ Y + B1 ) , (1)
problems are more or less denoising-driven. Cui et al. [5] where W1 and B1 represent the filters and biases re-
propose to embed auto-encoder networks in their super- spectively, and ’∗’ denotes the convolution operation.
resolution pipeline under the notion internal example- Here, W1 corresponds to n1 filters of support c × f1 × f1 ,
based approach [16]. The deep model is not specifically where c is the number of channels in the input image,
designed to be an end-to-end solution, since each layer f1 is the spatial size of a filter. Intuitively, W1 applies
of the cascade requires independent optimization of the n1 convolutions on the image, and each convolution has
self-similarity search process and the auto-encoder. On
the contrary, the proposed SRCNN optimizes an end-to- 3. Bicubic interpolation is also a convolutional operation, so it can
end mapping. Further, the SRCNN is faster at speed. It be formulated as a convolutional layer. However, the output size of
this layer is larger than the input size, so there is a fractional stride. To
is not only a quantitatively superior method, but also a take advantage of the popular well-optimized implementations such
practically useful one. as cuda-convnet [26], we exclude this “layer” from learning.
4
Low-resolution High-resolution
image (input) image (output)
responses neighbouring
of patch of patches
solver behaves as a special case of a non-linear mapping (9 + 5 − 1)2 = 169 pixels. Clearly, the information
operator, whose spatial support is 1 × 1. See the middle exploited for reconstruction is comparatively larger than
part of Figure 3. However, the sparse coding solver is that used in existing external example-based approaches,
not feed-forward, i.e.,it is an iterative algorithm. On the e.g., using (5+5−1)2 = 81 pixels5 [15], [50]. This is one of
contrary, our non-linear operator is fully feed-forward the reasons why the SRCNN gives superior performance.
and can be computed efficiently. If we set f2 = 1, then
our non-linear operator can be considered as a pixel-wise
3.3 Training
fully-connected layer. It is worth noting that “the sparse
coding solver” in SRCNN refers to the first two layers, Learning the end-to-end mapping function F re-
but not just the second layer or the activation function quires the estimation of network parameters Θ =
(ReLU). Thus the nonlinear operation in SRCNN is also {W1 , W2 , W3 , B1 , B2 , B3 }. This is achieved through min-
well optimized through the learning process. imizing the loss between the reconstructed images
The above n2 coefficients (after sparse coding) are F (Y; Θ) and the corresponding ground truth high-
then projected onto another (high-resolution) dictionary resolution images X. Given a set of high-resolution
to produce a high-resolution patch. The overlapping images {Xi } and their corresponding low-resolution
high-resolution patches are then averaged. As discussed images {Yi }, we use Mean Squared Error (MSE) as the
above, this is equivalent to linear convolutions on the loss function:
n
n2 feature maps. If the high-resolution patches used for 1X
reconstruction are of size f3 × f3 , then the linear filters L(Θ) = ||F (Yi ; Θ) − Xi ||2 , (4)
n i=1
have an equivalent spatial support of size f3 × f3 . See
the right part of Figure 3. where n is the number of training samples. Using MSE
The above discussion shows that the sparse-coding- as the loss function favors a high PSNR. The PSNR
based SR method can be viewed as a kind of con- is a widely-used metric for quantitatively evaluating
volutional neural network (with a different non-linear image restoration quality, and is at least partially related
mapping). But not all operations have been considered in to the perceptual quality. It is worth noticing that the
the optimization in the sparse-coding-based SR methods. convolutional neural networks do not preclude the usage
On the contrary, in our convolutional neural network, of other kinds of loss functions, if only the loss functions
the low-resolution dictionary, high-resolution dictionary, are derivable. If a better perceptually motivated metric
non-linear mapping, together with mean subtraction and is given during training, it is flexible for the network to
averaging, are all involved in the filters to be optimized. adapt to that metric. On the contrary, such a flexibility
So our method optimizes an end-to-end mapping that is in general difficult to achieve for traditional “hand-
consists of all operations. crafted” methods. Despite that the proposed model is
The above analogy can also help us to design hyper- trained favoring a high PSNR, we still observe satisfac-
parameters. For example, we can set the filter size of tory performance when the model is evaluated using
the last layer to be smaller than that of the first layer, alternative evaluation metrics, e.g., SSIM, MSSIM (see
and thus we rely more on the central part of the high- Section 4.4.1).
resolution patch (to the extreme, if f3 = 1, we are The loss is minimized using stochastic gradient de-
using the center pixel with no averaging). We can also scent with the standard backpropagation [28]. In partic-
set n2 < n1 because it is expected to be sparser. A ular, the weight matrices are updated as
typical and basic setting is f1 = 9, f2 = 1, f3 = 5, ∂L `
n1 = 64, and n2 = 32 (we evaluate more settings in ∆i+1 = 0.9 · ∆i − η · , Wi+1 = Wi` + ∆i+1 , (5)
∂Wi`
the experiment section). On the whole, the estimation
of a high resolution pixel utilizes the information of 5. The patches are overlapped with 4 pixels at each direction.
6
where ` ∈ {1, 2, 3} and i are the indices of layers and it- provides over 5 million sub-images even using a stride
∂L
erations, η is the learning rate, and ∂W ` is the derivative. of 33. We use the basic network settings, i.e., f1 = 9,
i
The filter weights of each layer are initialized by drawing f2 = 1, f3 = 5, n1 = 64, and n2 = 32. We use the Set5 [2]
randomly from a Gaussian distribution with zero mean as the validation set. We observe a similar trend even
and standard deviation 0.001 (and 0 for biases). The if we use the larger Set14 set [51]. The upscaling factor
learning rate is 10−4 for the first two layers, and 10−5 for is 3. We use the sparse-coding-based method [50] as our
the last layer. We empirically find that a smaller learning baseline, which achieves an average PSNR value of 31.42
rate in the last layer is important for the network to dB.
converge (similar to the denoising case [22]). The test convergence curves of using different training
In the training phase, the ground truth images {Xi } sets are shown in Figure 4. The training time on Ima-
are prepared as fsub ×fsub ×c-pixel sub-images randomly geNet is about the same as on the 91-image dataset since
cropped from the training images. By “sub-images” we the number of backpropagations is the same. As can be
mean these samples are treated as small “images” rather observed, with the same number of backpropagations
than “patches”, in the sense that “patches” are overlap- (i.e.,8 × 108 ), the SRCNN+ImageNet achieves 32.52 dB,
ping and require some averaging as post-processing but higher than 32.39 dB yielded by that trained on 91
“sub-images” need not. To synthesize the low-resolution images. The results positively indicate that SRCNN per-
samples {Yi }, we blur a sub-image by a Gaussian kernel, formance may be further boosted using a larger training
sub-sample it by the upscaling factor, and upscale it by set, but the effect of big data is not as impressive as
the same factor via bicubic interpolation. that shown in high-level vision problems [26]. This is
To avoid border effects during training, all the con- mainly because that the 91 images have already cap-
volutional layers have no padding, and the network tured sufficient variability of natural images. On the
produces a smaller output ((fsub − f1 − f2 − f3 + 3)2 × c). other hand, our SRCNN is a relatively small network
The MSE loss function is evaluated only by the difference (8,032 parameters), which could not overfit the 91 images
between the central pixels of Xi and the network output. (24,800 samples). Nevertheless, we adopt the ImageNet,
Although we use a fixed image size in training, the which contains more diverse data, as the default training
convolutional neural network can be applied on images set in the following experiments.
of arbitrary sizes during testing.
We implement our model using the cuda-convnet pack- 4.2 Learned Filters for Super-Resolution
age [26]. We have also tried the Caffe package [24] and Figure 5 shows examples of learned first-layer filters
observed similar performance. trained on the ImageNet by an upscaling factor 3. Please
refer to our published implementation for upscaling
4 E XPERIMENTS factors 2 and 4. Interestingly, each learned filter has
We first investigate the impact of using different datasets its specific functionality. For instance, the filters g and
on the model performance. Next, we examine the filters h are like Laplacian/Gaussian filters, the filters a - e
learned by our approach. We then explore different are like edge detectors at different directions, and the
architecture designs of the network, and study the rela- filter f is like a texture extractor. Example feature maps
tions between super-resolution performance and factors of different layers are shown in figure 6. Obviously,
like depth, number of filters, and filter sizes. Subse- feature maps of the first layer contain different structures
quently, we compare our method with recent state-of- (e.g., edges at different directions), while that of the
the-arts both quantitatively and qualitatively. Following second layer are mainly different on intensities.
[42], super-resolution is only applied on the luminance
channel (Y channel in YCbCr color space) in Sections 4.1- 4.3 Model and Performance Trade-offs
4.4, so c = 1 in the first/last layer, and performance Based on the basic network settings (i.e., f1 = 9, f2 = 1,
(e.g., PSNR and SSIM) is evaluated on the Y channel. At f3 = 5, n1 = 64, and n2 = 32), we will progressively
last, we extend the network to cope with color images modify some of these parameters to investigate the best
and evaluate the performance on different channels. trade-off between performance and speed, and study the
relations between performance and parameters.
4.1 Training Data
As shown in the literature, deep learning generally
32.6
benefits from big data training. For comparison, we use
AverageStestSPSNRSndBI
32.4
a relatively small training set [41], [50] that consists 32.2
of 91 images, and a large training set that consists of 32
395,909 images from the ILSVRC 2013 ImageNet detec- 31.8
SRCNNSntrainedSonSImageNetI
tion training partition. The size of training sub-images is 31.6 SRCNNSntrainedSonS91SimagesI
SCSn31.42SdBI
fsub = 33. Thus the 91-image dataset can be decomposed 31.4
1 2 3 4 5 6 7 8 9 10
into 24,800 sub-images, which are extracted from origi- NumberSofSbackprops xS10
8
nal images with a stride of 14. Whereas the ImageNet Fig. 4. Training with the much larger ImageNet dataset
improves the performance over the use of 91 images.
7
a b c d e f
AverageStestSPSNRS(dB)
g 32.5
h
32
SRCNNS(9−5−5)
SRCNNS(9−3−5)
SRCNNS(9−1−5)
31.5
Fig. 5. The figure shows the first-layer filters trained SCS(31.42SdB)
32.5
Average(test(PSNR((dB) 32.5
Average(test(PSNR(=dB)
32
32 SRCNN(=9−1−5)
SRCNN(=9−1−1−5,(n22=16)
31.5
31.5 SRCNN((9−1−5) SRCNN(=9−1−1−5,(n =32)
22
SRCNN((9−1−1−5) SRCNN(=9−1−1−1−5,(n22=32,(n23=16)
SC((31.42(dB) 31
SC(=31.42(dB)
31
2 4 6 8 10 12 2 4 6 8 10 12
Number(of(backprops x(10
8 Number(of(backprops 8
x(10
(a) 9-1-5 vs. 9-1-1-5 (a) 9-1-1-5 (n22 = 32) and 9-1-1-1-5 (n22 = 32, n23 = 16)
AverageStestSPSNRS(dB)
AverageStestSPSNRS(dB)
32.5 32.5
32 SRCNNS(9−3−5)
32 SRCNNS(9−3−1−5)
SRCNNS(9−3−5)
SRCNNS(9−3−3−5)
31.5 SRCNNS(9−3−1−5)
SRCNNS(9−3−3−3)
SCS(31.42SdB)
31.5 SCS(31.42SdB)
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9
NumberSofSbackprops xS10
8
NumberSofSbackprops 8
xS10
(b) 9-3-5 vs. 9-3-1-5 (b) 9-3-3-5 and 9-3-3-3
Fig. 9. Deeper structure does not always lead to better
AverageRtestRPSNRR(dB)
32.5 results.
32 methods. We adopt the model with good performance-
SRCNNR(9−5−5) speed trade-off: a three-layer network with f1 = 9, f2 =
31.5 SRCNNR(9−5−1−5)
SCR(31.42RdB) 5, f3 = 5, n1 = 64, and n2 = 32 trained on the ImageNet.
1 2 3 4 5
NumberRofRbackprops
6 7 8
8
For each upscaling factor ∈ {2, 3, 4}, we train a specific
xR10
(c) 9-5-5 vs. 9-5-1-5
network for that factor7 .
Comparisons. We compare our SRCNN with the state-
Fig. 8. Comparisons between three-layer and four-layer
of-the-art SR methods:
networks. • SC - sparse coding-based method of Yang et al. [50]
• NE+LLE - neighbour embedding + locally linear
mapping layers with n22 = 32 and n23 = 16 filters on
embedding method [4]
9-1-5, then we have to set a smaller learning rate to
• ANR - Anchored Neighbourhood Regression
ensure convergence, but we still do not observe superior
performance after a week of training (see Figure 9(a)). method [41]
• A+ - Adjusted Anchored Neighbourhood Regres-
We also tried to enlarge the filter size of the additional
layer to f22 = 3, and explore two deep structures – 9-3- sion method [42], and
3-5 and 9-3-3-3. However, from the convergence curves • KK - the method described in [25], which achieves
shown in Figure 9(b), these two networks do not show the best performance among external example-
better results than the 9-3-1-5 network. based methods, according to the comprehensive
All these experiments indicate that it is not “the deeper evaluation conducted in Yang et al.’s work [46]
the better” in this deep model for super-resolution. It The implementations are all from the publicly available
may be caused by the difficulty of training. Our CNN codes provided by the authors, and all images are down-
network contains no pooling layer or full-connected sampled using the same bicubic kernel.
layer, thus it is sensitive to the initialization parameters Test set. The Set5 [2] (5 images), Set14 [51] (14 images)
and learning rate. When we go deeper (e.g., 4 or 5 layers), and BSD200 [32] (200 images)8 are used to evaluate the
we find it hard to set appropriate learning rates that performance of upscaling factors 2, 3, and 4.
guarantee convergence. Even it converges, the network Evaluation metrics. Apart from the widely used PSNR
may fall into a bad local minimum, and the learned and SSIM [43] indices, we also adopt another four
filters are of less diversity even given enough training evaluation matrices, namely information fidelity cri-
time. This phenomenon is also observed in [16], where terion (IFC) [38], noise quality measure (NQM) [8],
improper increase of depth leads to accuracy saturation weighted peak signal-to-noise ratio (WPSNR) and multi-
or degradation for image classification. Why “deeper is scale structure similarity index (MSSSIM) [44], which
not better” is still an open question, which requires in- obtain high correlation with the human perceptual scores
vestigations to better understand gradients and training as reported in [46].
dynamics in deep architectures. Therefore, we still adopt
4.4.1 Quantitative and qualitative evaluation
three-layer networks in the following experiments.
As shown in Tables 2, 3 and 4, the proposed SRCNN
yields the highest scores in most evaluation matrices
4.4 Comparisons to State-of-the-Arts
7. In the area of denoising [3], for each noise level a specific network
In this section, we show the quantitative and qualitative is trained.
results of our method in comparison to state-of-the-art 8. We use the same 200 images as in [46].
9
33
SRCNN from the corresponding authors’ MATLAB+MEX imple-
A+ - 32.59 dB mentation, whereas ours are in pure C++. We profile
32.5
Average(test(PSNR((dB)
KK - 32.28 dB
the running time of all the algorithms using the same
32 ANR - 31.92 dB
NE+LLE - 31.84 dB machine (Intel CPU 3.10 GHz and 16 GB memory).
31.5
SC - 31.42 dB Note that the processing time of our approach is highly
31 linear to the test image resolution, since all images go
30.5
through the same number of convolutions. Our method
Bicubic - 30.39 dB
2 4 6 8 10 12
is always a trade-off between performance and speed.
Number(of(backprops 8
x(10 To show this, we train three networks for comparison,
Fig. 10. The test convergence curve of SRCNN and which are 9-1-5, 9-3-5, and 9-5-5. It is clear that the 9-
results of other methods on the Set5 dataset. 1-5 network is the fastest, while it still achieves better
in all experiments9 . Note that our SRCNN results are performance than the next state-of-the-art A+. Other
based on the checkpoint of 8 × 108 backpropagations. methods are several times or even orders of magnitude
Specifically, for the upscaling factor 3, the average gains slower in comparison to 9-1-5 network. Note the speed
on PSNR achieved by SRCNN are 0.15 dB, 0.17 dB, and gap is not mainly caused by the different MATLAB/C++
0.13 dB, higher than the next best approach, A+ [42], implementations; rather, the other methods need to solve
on the three datasets. When we take a look at other complex optimization problems on usage (e.g., sparse
evaluation metrics, we observe that SC, to our surprise, coding or embedding), whereas our method is com-
gets even lower scores than the bicubic interpolation pletely feed-forward. The 9-5-5 network achieves the
on IFC and NQM. It is clear that the results of SC are best performance but at the cost of the running time. The
more visually pleasing than that of bicubic interpolation. test-time speed of our CNN can be further accelerated
This indicates that these two metrics may not truthfully in many ways, e.g., approximating or simplifying the
reveal the image quality. Thus, regardless of these two trained networks [10], [21], [31], with possible slight
metrics, SRCNN achieves the best performance among degradation in performance.
all methods and scaling factors.
It is worth pointing out that SRCNN surpasses the
4.5 Experiments on Color Channels
bicubic baseline at the very beginning of the learning
stage (see Figure 1), and with moderate training, SR- In previous experiments, we follow the conventional
CNN outperforms existing state-of-the-art methods (see approach to super-resolve color images. Specifically, we
Figure 4). Yet, the performance is far from converge. first transform the color images into the YCbCr space.
We conjecture that better results can be obtained given The SR algorithms are only applied on the Y channel,
longer training time (see Figure 10). while the Cb , Cr channels are upscaled by bicubic in-
Figures 14, 15 and 16 show the super-resolution results terpolation. It is interesting to find out if super-resolution
of different approaches by an upscaling factor 3. As can performance can be improved if we jointly consider all
be observed, the SRCNN produces much sharper edges three channels in the process.
than other approaches without any obvious artifacts Our method is flexible to accept more channels with-
across the image. out altering the learning mechanism and network de-
In addition, we report to another recent deep learning sign. In particular, it can readily deal with three chan-
method for image super-resolution (DNC) of Cui et nels simultaneously by setting the input channels to
al. [5]. As they employ a different blur kernel (a Gaussian c = 3. In the following experiments, we explore different
filter with a standard deviation of 0.55), we train a spe- training strategies for color image super-resolution, and
cific network (9-5-5) using the same blur kernel as DNC subsequently evaluate their performance on different
for fair quantitative comparison. The upscaling factor channels.
is 3 and the training set is the 91-image dataset. From Implementation details. Training is performed on the
the convergence curve shown in Figure 11, we observe 91-image dataset, and testing is conducted on the
that our SRCNN surpasses DNC with just 2.7 × 107 Set5 [2]. The network settings are: c = 3, f1 = 9, f2 = 1,
backprops, and a larger margin can be obtained given f3 = 5, n1 = 64, and n2 = 32. As we have proved the
longer training time. This also demonstrates that the
end-to-end learning is superior to DNC, even if that
model is already “deep”.
AverageBtestBPSNRBidBn
32.5
32
4.4.2 Running time
31.5
Figure 12 shows the running time comparisons of several 31 SRCNNBi9-5-5BtrainedBonB91Bimagesn
state-of-the-art methods, along with their restoration DNCBi32.08BdBn
BicubicBi30.29BdBn
30.5
performance on Set14. All baseline methods are obtained
2 4 6 8 10 12
NumberBofBbackprops × 10 7
9. The PSNR value of each image can be found in the supplementary
file. Fig. 11. The test convergence curve of SRCNN and the
result of DNC on the Set5 dataset.
10
TABLE 2
The average results of PSNR (dB), SSIM, IFC, NQM, WPSNR (dB) and MSSIM on the Set5 dataset.
Eval. Mat Scale Bicubic SC [50] NE+LLE [4] KK [25] ANR [41] A+ [41] SRCNN
2 33.66 - 35.77 36.20 35.83 36.54 36.66
PSNR 3 30.39 31.42 31.84 32.28 31.92 32.59 32.75
4 28.42 - 29.61 30.03 29.69 30.28 30.49
2 0.9299 - 0.9490 0.9511 0.9499 0.9544 0.9542
SSIM 3 0.8682 0.8821 0.8956 0.9033 0.8968 0.9088 0.9090
4 0.8104 - 0.8402 0.8541 0.8419 0.8603 0.8628
2 6.10 - 7.84 6.87 8.09 8.48 8.05
IFC 3 3.52 3.16 4.40 4.14 4.52 4.84 4.58
4 2.35 - 2.94 2.81 3.02 3.26 3.01
2 36.73 - 42.90 39.49 43.28 44.58 41.13
NQM 3 27.54 27.29 32.77 32.10 33.10 34.48 33.21
4 21.42 - 25.56 24.99 25.72 26.97 25.96
2 50.06 - 58.45 57.15 58.61 60.06 59.49
WPSNR 3 41.65 43.64 45.81 46.22 46.02 47.17 47.10
4 37.21 - 39.85 40.40 40.01 41.03 41.13
2 0.9915 - 0.9953 0.9953 0.9954 0.9960 0.9959
MSSSIM 3 0.9754 0.9797 0.9841 0.9853 0.9844 0.9867 0.9866
4 0.9516 - 0.9666 0.9695 0.9672 0.9720 0.9725
TABLE 3
The average results of PSNR (dB), SSIM, IFC, NQM, WPSNR (dB) and MSSIM on the Set14 dataset.
Eval. Mat Scale Bicubic SC [50] NE+LLE [4] KK [25] ANR [41] A+ [41] SRCNN
2 30.23 - 31.76 32.11 31.80 32.28 32.45
PSNR 3 27.54 28.31 28.60 28.94 28.65 29.13 29.30
4 26.00 - 26.81 27.14 26.85 27.32 27.50
2 0.8687 - 0.8993 0.9026 0.9004 0.9056 0.9067
SSIM 3 0.7736 0.7954 0.8076 0.8132 0.8093 0.8188 0.8215
4 0.7019 - 0.7331 0.7419 0.7352 0.7491 0.7513
2 6.09 - 7.59 6.83 7.81 8.11 7.76
IFC 3 3.41 2.98 4.14 3.83 4.23 4.45 4.26
4 2.23 - 2.71 2.57 2.78 2.94 2.74
2 40.98 - 41.34 38.86 41.79 42.61 38.95
NQM 3 33.15 29.06 37.12 35.23 37.22 38.24 35.25
4 26.15 - 31.17 29.18 31.27 32.31 30.46
2 47.64 - 54.47 53.85 54.57 55.62 55.39
WPSNR 3 39.72 41.66 43.22 43.56 43.36 44.25 44.32
4 35.71 - 37.75 38.26 37.85 38.72 38.87
2 0.9813 - 0.9886 0.9890 0.9888 0.9896 0.9897
MSSSIM 3 0.9512 0.9595 0.9643 0.9653 0.9647 0.9669 0.9675
4 0.9134 - 0.9317 0.9338 0.9326 0.9371 0.9376
TABLE 4
The average results of PSNR (dB), SSIM, IFC, NQM, WPSNR (dB) and MSSIM on the BSD200 dataset.
Eval. Mat Scale Bicubic SC [50] NE+LLE [4] KK [25] ANR [41] A+ [41] SRCNN
2 28.38 - 29.67 30.02 29.72 30.14 30.29
PSNR 3 25.94 26.54 26.67 26.89 26.72 27.05 27.18
4 24.65 - 25.21 25.38 25.25 25.51 25.60
2 0.8524 - 0.8886 0.8935 0.8900 0.8966 0.8977
SSIM 3 0.7469 0.7729 0.7823 0.7881 0.7843 0.7945 0.7971
4 0.6727 - 0.7037 0.7093 0.7060 0.7171 0.7184
2 5.30 - 7.10 6.33 7.28 7.51 7.21
IFC 3 3.05 2.77 3.82 3.52 3.91 4.07 3.91
4 1.95 - 2.45 2.24 2.51 2.62 2.45
2 36.84 - 41.52 38.54 41.72 42.37 39.66
NQM 3 28.45 28.22 34.65 33.45 34.81 35.58 34.72
4 21.72 - 25.15 24.87 25.27 26.01 25.65
2 46.15 - 52.56 52.21 52.69 53.56 53.58
WPSNR 3 38.60 40.48 41.39 41.62 41.53 42.19 42.29
4 34.86 - 36.52 36.80 36.64 37.18 37.24
2 0.9780 - 0.9869 0.9876 0.9872 0.9883 0.9883
MSSSIM 3 0.9426 0.9533 0.9575 0.9588 0.9581 0.9609 0.9614
4 0.9005 - 0.9203 0.9215 0.9214 0.9256 0.9261
11
29.4
SRCNN(9-5-5)
SRCNN(9-3-5)
29.2 SRCNN(9-1-5)
A+
29
KK
PSNR (dB)
28.8
(a) First-layer filters – Cb channel
ANR
28.6 NE+LLE
28.4
SC
28.2 2 1 0
10 10 10
Slower <—— Running time (sec) ——> Faster
Fig. 12. The proposed SRCNN achieves the state- (b) First-layer filters – Cr channel
of-the-art super-resolution quality, whilst maintains high Fig. 13. Chrominance channels of the first-layer filters
and competitive speed in comparison to existing external using the “Y pre-train” strategy.
example-based methods. The chart is based on Set14
results summarized in Table 3. The implementation of all in RGB color space). This suggests that the Cb, Cr
three SRCNN networks are available on our project page. channels could decrease the performance of the Y chan-
TABLE 5 nel when training is performed in a unified network.
Average PSNR (dB) of different channels and training (iii) We observe that the Cb, Cr channels have higher
strategies on the Set5 dataset. PSNR values for “Y pre-train” than for “CbCr pre-train”.
The reason lies on the differences between the Cb, Cr
Training PSNR of different channel(s) channels and the Y channel. Visually, the Cb, Cr channels
Strategies Y Cb Cr RGB color image
Bicubic 30.39 45.44 45.42 34.57
are more blurry than the Y channel, thus are less affected
Y only 32.39 45.44 45.42 36.37 by the downsampling process. When we pre-train on
YCbCr 29.25 43.30 43.49 33.47 the Cb, Cr channels, there are only a few filters being
Y pre-train 32.19 46.49 46.45 36.32
CbCr pre-train 32.14 46.38 45.84 36.25
activated. Then the training will soon fall into a bad
RGB 32.33 46.18 46.20 36.44 local minimum during fine-tuning. On the other hand,
KK 32.37 44.35 44.22 36.32 if we pre-train on the Y channel, more filters will be
activated, and the performance on Cb, Cr channels will
be pushed much higher. Figure 13 shows the Cb, Cr
effectiveness of SRCNN on different scales, here we only
channels of the first-layer filters with “Y pre-train”, of
evaluate the performance of upscaling factor 3.
which the patterns largely differ from that shown in
Comparisons. We compare our method with the state- Figure 5. (iv) Training on the RGB channels achieves
of-art color SR method – KK [25]. We also try different the best result on the color image. Different from the
learning strategies for comparison: YCbCr channels, the RGB channels exhibit high cross-
• Y only: this is our baseline method, which is a
correlation among each other. The proposed SRCNN
single-channel (c = 1) network trained only on
is capable of leveraging such natural correspondences
the luminance channel. The Cb, Cr channels are
between the channels for reconstruction. Therefore, the
upscaled using bicubic interpolation.
model achieves comparable result on the Y channel as
• YCbCr: training is performed on the three channels
“Y only”, and better results on Cb, Cr channels than
of the YCbCr space. bicubic interpolation. (v) In KK [25], super-resolution
• Y pre-train: first, to guarantee the performance on
is applied on each RGB channel separately. When we
the Y channel, we only use the MSE of the Y channel transform its results to YCbCr space, the PSNR value
as the loss to pre-train the network. Then we employ of Y channel is similar as “Y only”, but that of Cb, Cr
the MSE of all channels to fine-tune the parameters. channels are poorer than bicubic interpolation. The result
• CbCr pre-train: we use the MSE of the Cb, Cr suggests that the algorithm is biased to the Y channel.
channels as the loss to pre-train the network, then On the whole, our method trained on RGB channels
fine-tune the parameters on all channels. achieves better performance than KK and the single-
• RGB: training is performed on the three channels of channel network (“Y only”). It is also worth noting that
the RGB space. the improvement compared with the single-channel net-
The results are shown in Table 5, where we have the work is not that significant (i.e., 0.07 dB). This indicates
following observations. (i) If we directly train on the that the Cb, Cr channels barely help in improving the
YCbCr channels, the results are even worse than that of performance.
bicubic interpolation. The training falls into a bad local
minimum, due to the inherently different characteristics
of the Y and Cb, Cr channels. (ii) If we pre-train on the
5 C ONCLUSION
Y or Cb, Cr channels, the performance finally improves, We have presented a novel deep learning approach
but is still not better than “Y only” on the color image for single image super-resolution (SR). We show that
(see the last column of Table 5, where PSNR is computed conventional sparse-coding-based SR methods can be
12
reformulated into a deep convolutional neural network. [19] Huang, J.B., Singh, A., Ahuja, N.: Single image super-resolution
The proposed approach, SRCNN, learns an end-to-end from transformed self-exemplars. In: IEEE Conference on Com-
puter Vision and Pattern Recognition. pp. 5197–5206 (2015)
mapping between low- and high-resolution images, with [20] Irani, M., Peleg, S.: Improving resolution by image registration.
little extra pre/post-processing beyond the optimization. Graphical Models and Image Processing 53(3), 231–239 (1991)
With a lightweight structure, the SRCNN has achieved [21] Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convo-
lutional neural networks with low rank expansions. In: British
superior performance than the state-of-the-art methods. Machine Vision Conference (2014)
We conjecture that additional performance can be further [22] Jain, V., Seung, S.: Natural image denoising with convolutional
gained by exploring more filters and different training networks. In: Advances in Neural Information Processing Sys-
tems. pp. 769–776 (2008)
strategies. Besides, the proposed structure, with its ad- [23] Jia, K., Wang, X., Tang, X.: Image transformation based on learning
vantages of simplicity and robustness, could be applied dictionaries across image spaces. IEEE Transactions on Pattern
to other low-level vision problems, such as image de- Analysis and Machine Intelligence 35(2), 367–380 (2013)
[24] Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick,
blurring or simultaneous SR+denoising. One could also R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture
investigate a network to cope with different upscaling for fast feature embedding. In: ACM Multimedia. pp. 675–678
factors. (2014)
[25] Kim, K.I., Kwon, Y.: Single-image super-resolution using sparse
regression and natural image prior. IEEE Transactions on Pattern
R EFERENCES Analysis and Machine Intelligence 32(6), 1127–1133 (2010)
[26] Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification
[1] Aharon, M., Elad, M., Bruckstein, A.: K-SVD: An algorithm for with deep convolutional neural networks. In: Advances in Neural
designing overcomplete dictionaries for sparse representation. Information Processing Systems. pp. 1097–1105 (2012)
IEEE Transactions on Signal Processing 54(11), 4311–4322 (2006) [27] LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E.,
[2] Bevilacqua, M., Roumy, A., Guillemot, C., Morel, M.L.A.: Low- Hubbard, W., Jackel, L.D.: Backpropagation applied to handwrit-
complexity single-image super-resolution based on nonnegative ten zip code recognition. Neural computation pp. 541–551 (1989)
neighbor embedding. In: British Machine Vision Conference [28] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based
(2012) learning applied to document recognition. Proceedings of the
[3] Burger, H.C., Schuler, C.J., Harmeling, S.: Image denoising: Can IEEE 86(11), 2278–2324 (1998)
plain neural networks compete with BM3D? In: IEEE Conference
[29] Lee, H., Battle, A., Raina, R., Ng, A.Y.: Efficient sparse coding algo-
on Computer Vision and Pattern Recognition. pp. 2392–2399
rithms. In: Advances in Neural Information Processing Systems.
(2012)
pp. 801–808 (2006)
[4] Chang, H., Yeung, D.Y., Xiong, Y.: Super-resolution through neigh-
[30] Liu, C., Shum, H.Y., Freeman, W.T.: Face hallucination: Theory
bor embedding. In: IEEE Conference on Computer Vision and
and practice. International Journal of Computer Vision 75(1), 115–
Pattern Recognition (2004)
134 (2007)
[5] Cui, Z., Chang, H., Shan, S., Zhong, B., Chen, X.: Deep network
[31] Mamalet, F., Garcia, C.: Simplifying convnets for fast learning.
cascade for image super-resolution. In: European Conference on
In: International Conference on Artificial Neural Networks, pp.
Computer Vision, pp. 49–64 (2014)
58–65. Springer (2012)
[6] Dai, D., Timofte, R., Van Gool, L.: Jointly optimized regressors for
image super-resolution. In: Eurographics. vol. 7, p. 8 (2015) [32] Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human
[7] Dai, S., Han, M., Xu, W., Wu, Y., Gong, Y., Katsaggelos, A.K.: segmented natural images and its application to evaluating seg-
Softcuts: a soft edge smoothness prior for color image super- mentation algorithms and measuring ecological statistics. In: IEEE
resolution. IEEE Transactions on Image Processing 18(5), 969–981 International Conference on Computer Vision. vol. 2, pp. 416–423
(2009) (2001)
[8] Damera-Venkata, N., Kite, T.D., Geisler, W.S., Evans, B.L., Bovik, [33] Nair, V., Hinton, G.E.: Rectified linear units improve restricted
A.C.: Image quality assessment based on a degradation model. Boltzmann machines. In: International Conference on Machine
IEEE Transactions on Image Processing 9(4), 636–650 (2000) Learning. pp. 807–814 (2010)
[9] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: [34] Ouyang, W., Luo, P., Zeng, X., Qiu, S., Tian, Y., Li, H., Yang,
A large-scale hierarchical image database. In: IEEE Conference on S., Wang, Z., Xiong, Y., Qian, C., et al.: Deepid-net: multi-stage
Computer Vision and Pattern Recognition. pp. 248–255 (2009) and deformable deep convolutional neural networks for object
[10] Denton, E., Zaremba, W., Bruna, J., LeCun, Y., Fergus, R.: Exploit- detection. arXiv preprint arXiv:1409.3505 (2014)
ing linear structure within convolutional networks for efficient [35] Ouyang, W., Wang, X.: Joint deep learning for pedestrian detec-
evaluation. In: Advances in Neural Information Processing Sys- tion. In: IEEE International Conference on Computer Vision. pp.
tems (2014) 2056–2063 (2013)
[11] Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolu- [36] Schuler, C.J., Burger, H.C., Harmeling, S., Scholkopf, B.: A ma-
tional network for image super-resolution. In: European Confer- chine learning approach for non-blind image deconvolution. In:
ence on Computer Vision, pp. 184–199 (2014) IEEE Conference on Computer Vision and Pattern Recognition.
[12] Eigen, D., Krishnan, D., Fergus, R.: Restoring an image taken pp. 1067–1074 (2013)
through a window covered with dirt or rain. In: IEEE Interna- [37] Schulter, S., Leistner, C., Bischof, H.: Fast and accurate image
tional Conference on Computer Vision. pp. 633–640 (2013) upscaling with super-resolution forests. In: IEEE Conference on
[13] Freedman, G., Fattal, R.: Image and video upscaling from local Computer Vision and Pattern Recognition. pp. 3791–3799 (2015)
self-examples. ACM Transactions on Graphics 30(2), 12 (2011) [38] Sheikh, H.R., Bovik, A.C., De Veciana, G.: An information fidelity
[14] Freeman, W.T., Jones, T.R., Pasztor, E.C.: Example-based super- criterion for image quality assessment using natural scene statis-
resolution. Computer Graphics and Applications 22(2), 56–65 tics. IEEE Transactions on Image Processing 14(12), 2117–2128
(2002) (2005)
[15] Freeman, W.T., Pasztor, E.C., Carmichael, O.T.: Learning low- [39] Sun, Y., Chen, Y., Wang, X., Tang, X.: Deep learning face represen-
level vision. International Journal of Computer Vision 40(1), 25–47 tation by joint identification-verification. In: Advances in Neural
(2000) Information Processing Systems. pp. 1988–1996 (2014)
[16] Glasner, D., Bagon, S., Irani, M.: Super-resolution from a single [40] Szegedy, C., Reed, S., Erhan, D., Anguelov, D.: Scalable, high-
image. In: IEEE International Conference on Computer Vision. pp. quality object detection. arXiv preprint arXiv:1412.1441 (2014)
349–356 (2009) [41] Timofte, R., De Smet, V., Van Gool, L.: Anchored neighborhood
[17] He, K., Sun, J.: Convolutional neural networks at constrained time regression for fast example-based super-resolution. In: IEEE In-
cost. arXiv preprint arXiv:1412.1710 (2014) ternational Conference on Computer Vision. pp. 1920–1927 (2013)
[18] He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in [42] Timofte, R., De Smet, V., Van Gool, L.: A+: Adjusted anchored
deep convolutional networks for visual recognition. In: European neighborhood regression for fast super-resolution. In: IEEE Asian
Conference on Computer Vision, pp. 346–361 (2014) Conference on Computer Vision (2014)
13
[43] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality Xiaoou Tang (S93-M96-SM02-F09) received
assessment: from error visibility to structural similarity. IEEE the BS degree from the University of Science
Transactions on Image Processing 13(4), 600–612 (2004) and Technology of China, Hefei, in 1990, the MS
[44] Wang, Z., Simoncelli, E.P., Bovik, A.C.: Multiscale structural sim- degree from the University of Rochester, New
ilarity for image quality assessment. In: IEEE Conference Record York, in 1991, and the PhD degree from the Mas-
of the Thirty-Seventh Asilomar Conference on Signals, Systems sachusetts Institute of Technology, Cambridge,
and Computers. vol. 2, pp. 1398–1402 (2003) in 1996. He is a professor in the Department of
[45] Yang, C.Y., Huang, J.B., Yang, M.H.: Exploiting self-similarities Information Engineering and an associate dean
for single frame super-resolution. In: IEEE Asian Conference on (Research) of the Faculty of Engineering of the
Computer Vision, pp. 497–510 (2010) Chinese University of Hong Kong. He worked
[46] Yang, C.Y., Ma, C., Yang, M.H.: Single-image super-resolution: A as the group manager of the Visual Computing
benchmark. In: European Conference on Computer Vision, pp. Group at the Microsoft Research Asia, from 2005 to 2008. His research
372–386 (2014) interests include computer vision, pattern recognition, and video pro-
[47] Yang, J., Lin, Z., Cohen, S.: Fast image super-resolution based on cessing. He received the Best Paper Award at the IEEE Conference
in-place example regression. In: IEEE Conference on Computer on Computer Vision and Pattern Recognition (CVPR) 2009. He was a
Vision and Pattern Recognition. pp. 1059–1066 (2013) program chair of the IEEE International Conference on Computer Vision
[48] Yang, J., Wang, Z., Lin, Z., Cohen, S., Huang, T.: Coupled dic- (ICCV) 2009 and he is an associate editor of the IEEE Transactions on
tionary training for image super-resolution. IEEE Transactions on Pattern Analysis and Machine Intelligence and the International Journal
Image Processing 21(8), 3467–3478 (2012) of Computer Vision. He is a fellow of the IEEE.
[49] Yang, J., Wright, J., Huang, T., Ma, Y.: Image super-resolution as
sparse representation of raw image patches. In: IEEE Conference
on Computer Vision and Pattern Recognition. pp. 1–8 (2008)
[50] Yang, J., Wright, J., Huang, T.S., Ma, Y.: Image super-resolution
via sparse representation. IEEE Transactions on Image Processing
19(11), 2861–2873 (2010)
[51] Zeyde, R., Elad, M., Protter, M.: On single image scale-up us-
ing sparse-representations. In: Curves and Surfaces, pp. 711–730
(2012)
[52] Zhang, N., Donahue, J., Girshick, R., Darrell, T.: Part-based R-
CNNs for fine-grained category detection. In: European Confer-
ence on Computer Vision. pp. 834–849 (2014)
1 2 3
University of Maryland, College Park University of Texas at Austin Google, Inc.
Abstract
Convolutional neural networks (CNNs) have been exten-
sively applied for image recognition problems giving state-
of-the-art results on recognition, detection, segmentation
and retrieval. In this work we propose and evaluate several
deep neural network architectures to combine image infor-
mation across a video over longer time periods than previ-
ously attempted. We propose two methods capable of han-
dling full length videos. The first method explores various
convolutional temporal feature pooling architectures, ex-
amining the various design choices which need to be made
when adapting a CNN for this task. The second proposed
method explicitly models the video as an ordered sequence Figure 1: Overview of our approach.
of frames. For this purpose we employ a recurrent neural
network that uses Long Short-Term Memory (LSTM) cells
which are connected to the output of the underlying CNN.
Our best networks exhibit significant performance improve- motion and other information can be additionally used. At
ments over previously published results on the Sports 1 mil- the same time, the task is much more computationally de-
lion dataset (73.1% vs. 60.9%) and the UCF-101 datasets manding even for processing short video clips since each
with (88.6% vs. 88.0%) and without additional optical flow video might contain hundreds to thousands of frames, not
information (82.6% vs. 73.0%). all of which are useful. A naı̈ve approach would be to treat
video frames as still images and apply CNNs to recognize
each frame and average the predictions at the video level.
1. Introduction However, since each individual video frame forms only a
Convolutional Neural Networks have proven highly suc- small part of the video’s story, such an approach would
cessful at static image recognition problems such as the be using incomplete information and could therefore eas-
MNIST, CIFAR, and ImageNet Large-Scale Visual Recog- ily confuse classes especially if there are fine-grained dis-
nition Challenge [15, 21, 28]. By using a hierarchy of train- tinctions or portions of the video irrelevant to the action of
able filters and feature pooling operations, CNNs are ca- interest.
pable of automatically learning complex features required Therefore, we hypothesize that learning a global de-
for visual object recognition tasks achieving superior per- scription of the video’s temporal evolution is important for
formance to hand-crafted features. Encouraged by these accurate video classification. This is challenging from a
positive results several approaches have been proposed re- modeling perspective as we have to model variable length
cently to apply CNNs to video and action classification videos with a fixed number of parameters. We evaluate two
tasks [2, 13, 14, 19]. approaches capable of meeting this requirement: feature-
Video analysis provides more information to the recog- pooling and recurrent neural networks. The feature pool-
nition task by adding a temporal component through which ing networks independently process each frame using a
1
CNN and then combine frame-level information using var- features such as Histogram of Oriented Gradients (HOG),
ious pooling layers. The recurrent neural network architec- Histogram of Optical Flow (HOF), Motion Boundary His-
ture we employ is derived from Long Short Term Memory togram (MBH) around spatio-temporal interest points [17],
(LSTM) [11] units, and uses memory cells to store, mod- in a dense grid [24] or around dense point trajectories [12,
ify, and access internal state, allowing it to discover long- 16, 22, 23] obtained through optical flow based tracking.
range temporal relationships. Like feature-pooling, LSTM These features are then encoded in order to produce a global
networks operate on frame-level CNN activations, and can video-level descriptor through bag of words (BoW) [17] or
learn how to integrate information over time. By shar- Fisher vector based encodings [23].
ing parameters through time, both architectures are able to However, no previous attempts at CNN-based video
maintain a constant number of parameters while capturing recognition use both motion information and a global de-
a global description of the video’s temporal evolution. scription of the video: Several approaches [2, 13, 14] em-
Since we are addressing the problem of video classifi- ploy 3D-convolution over short video clips - typically just
cation, it is natural to attempt to take advantage of motion a few seconds - to learn motion features from raw frames
information in order to have a better performing network. implicitly and then aggregate predictions at the video level.
Previous work [14] has attempted to address this issue by Karpathy et al. [14] demonstrate that their network is just
using frame stacks as input. However, this type of approach marginally better than single frame baseline, which indi-
is computationally intensive since it involves thousands of cates learning motion features is difficult. In view of this,
3D convolutional filters applied over the input volumes. The Simonyan et al. [19] directly incorporate motion informa-
performance grained by applying such a method is below tion from optical flow, but only sample up to 10 consecutive
2% on the Sports-1M benchmarks [14]. As a result, in this frames at inference time. The disadvantage of such local
work, we avoid implicit motion feature computation. approaches is that each frame/clip may contain only a small
In order to learn a global description of the video while part of the full video’s information, resulting in a network
maintaining a low computational footprint, we propose pro- that performs no better than the naı̈ve approach of classify-
cessing only one frame per second. At this frame rate, im- ing individual frames.
plicit motion information is lost. To compensate, follow- Instead of trying to learn spatio-temporal features over
ing [19] we incorporate explicit motion information in the small time periods, we consider several different ways to
form of optical flow images computed over adjacent frames. aggregate strong CNN image features over long periods of
Thus optical flow allows us to retain the benefits of motion a video (tens of seconds) including feature pooling and re-
information (typically achieved through high-fps sampling) current neural networks. Standard recurrent networks have
while still capturing global video information. Our contri- trouble learning over long sequences due to the problem of
butions can be summarized as follows: vanishing and exploding gradients [3]. In contrast, the Long
1. We propose CNN architectures for obtaining global Short Term Memory (LSTM) [11] uses memory cells to
video-level descriptors and demonstrate that using in- store, modify, and access internal state, allowing it to better
creasing numbers of frames significantly improves discover long-range temporal relationships. For this reason,
classification performance. LSTMs yield state-of-the-art results in handwriting recog-
2. By sharing parameters through time, the number of pa- nition [8, 10], speech recognition [9, 7], phoneme detection
rameters remains constant as a function of video length [5], emotion detection [25], segmentation of meetings and
in both the feature pooling and LSTM architectures. events [18], and evaluating programs [27]. While LSTMs
3. We confirm that optical flow images can greatly bene- have been applied to action classification in [1], the model is
fit video classification and present results showing that learned on top of SIFT features and a BoW representation.
even if the optical flow images themselves are very In addition, our proposed models allow joint fine tuning of
noisy (as is the case with the Sports-1M dataset), they convolutional and recurrent parts of the network, which is
can still provide a benefit when coupled with LSTMs. not possible to do when using hand-crafted features, as pro-
Leveraging these three principles, we achieve state-of- posed in prior work. Baccouche et al. [1] learns globally
the-art performance on two different video classification using Long Short-Term Memory (LSTM) networks on the
tasks: Sports-1M (Section 4.1) and UCF-101 (Section 4.2). ouput of 3D-convolution applied to 9-frame videos clips,
but incorporates no explicit motion information.
2. Related Work
Traditional video recognition research has been ex- 3. Approach
tremely successful at obtaining global video descriptors Two CNN architectures are used to process individual
that encode both appearance and motion information in video frames: AlexNet and GoogLeNet. AlexNet, is a
order to provide state-of-art results on a large number of Krizhevsky-style CNN [15] which takes a 220 × 220 sized
video datasets. These approaches are able to aggregate lo- frame as input. This frame is then processed by square con-
cal appearance and motion information using hand-crafted volutional layers of size 11, 9, and 5 each followed by max-
pooling and local contrast normalization. Finally, outputs image pixels, while allowing the network to choose which
are fed to two fully-connected layers each with 4096 recti- of the input frames are affected by these updates. When
fied linear units (ReLU). Dropout is applied to each fully- used with max-pooling, this is reminiscent of multiple in-
connected layer with a ratio of 0.6 (keeping and scaling 40% stance learning, where the learner knows that at least one of
of the original outputs). the inputs is relevant to the target class.
GoogLeNet [21], uses a network-in-network approach, We experimented with several variations of the basic
stacking Inception modules to form a network 22 layers max-pooling architecture as shown in Figure 2:
deep that is substantially different from previous CNNs
[15, 28]. Like AlexNet, GoogLeNet takes a single image
of size 220 × 220 as input. This image is then passed
through multiple Inception modules, each of which applies,
in parallel, 1×1, 3×3, 5×5 convolution, and max-pooling
operations and concatenates the resulting filters. Finally,
the activations are average-pooled and output as a 1000-
dimensional vector. (a) Conv Pooling (b) Late Pooling
In the following sections, we investigate two classes of
CNN architectures capable of aggregating video-level in-
formation. In the first section, we investigate various fea-
ture pooling architectures that are agnostic to temporal or-
der and in the following section we investigate LSTM net-
works which are capable of learning from temporally or-
dered sequences. In order to make learning computation-
(c) Slow Pooling (d) Local Pooling
ally feasible, in all methods CNN share parameters across
frames.
3.1. Feature Pooling Architectures
Temporal feature pooling has been extensively used for
video classification [17, 24, 12], and has been usually ap-
plied to bag-of-words representations. Typically, image-
based or motion features are computed at every frame,
quantized, then pooled across time. The resulting vector
(e) Time-Domain Convolution
can be used for making video-level predictions. We follow
a similar line of reasoning, except that due to the fact that
we work with neural networks, the pooling operation can Figure 2: Different Feature-Pooling Architectures: The
be incorporated directly as a layer. This allows us to exper- stacked convolutional layers are denoted by “C”. Blue,
iment with the location of the temporal pooling layer with green, yellow and orange rectangles represent max-pooling,
respect to the network architecture. time-domain convolutional, fully-connected and softmax
We analyze several variations depending on the specific layers respectively.
pooling method and the particular layer whose features are
aggregated. The pooling operation need not be limited to Conv Pooling: The Conv Pooling model performs max-
max-pooling. We considered using both average pooling, pooling over the final convolutional layer across the video’s
and max-pooling which have several desirable properties frames. A key advantage of this network is that the spa-
as shown in [4]. In addition, we attempted to employ a tial information in the output of the convolutional layer is
fully connected layer as a “pooling layer”. However, we preserved through a max operation over the time domain.
found that both average pooling and a fully connected layer Late Pooling: The Late Pooling model first passes con-
for pooling failed to learn effectively due to the large num- volutional features through two fully connected layers be-
ber of gradients that they generate. Max-pooling generates fore applying the max-pooling layer. The weights of all
much sparser updates, and as a result tends to yield net- convolutional layers and fully connected layers are shared.
works that learn faster, since the gradient update is gener- Compared to Conv Pooling, Late Pooling directly combines
ated by a sparse set of features from each frame. Therefore, high-level information across frames.
in the rest of the paper we use max-pooling as the main fea- Slow Pooling: Slow Pooling hierarchically combines
ture aggregation technique. frame level information from smaller temporal windows.
Unlike traditional bag of words approaches, gradients Slow Pooling uses a two-stage pooling strategy: max-
coming from the top layers help learn useful features from pooling is first applied over 10-frames of convolutional fea-
tures with stride 5 (e.g. max-pooling may be thought of
as a size-10 filter being convolved over a 1-D input with
stride 5). Each max-pooling layer is then followed by a
fully-connected layer with shared weights. In the second
stage, a single max-pooling layer combines the outputs of
all fully-connected layers. In this manner, the Slow Pooling
network groups temporally local features before combining
high level information from many frames.
Local Pooling: Similar to Slow Pooling, the Local Pool-
ing model combines frame level features locally after the
last convolutional layer. Unlike Slow Pooling, Local Pool-
ing only contains a single stage of max-pooling after the
convolutional layers. This is followed by two fully con-
nected layers, with shared parameters. Finally a larger soft-
max layer is connected to all towers. By eliminating the sec-
ond max-pooling layer, the Local Pooling network avoids a
potential loss of temporal information. Figure 3: Each LSTM cell remembers a single floating point
Time-Domain Convolution: The Time-Domain Convo- value ct (Eq. 5). This value may be diminished or erased
lution model contains an extra time-domain convolutional through a multiplicative interaction with the forget gate ft
layer before feature pooling across frames. Max-pooling is (Eq. 4) or additively modified by the current input xt multi-
performed on the temporal domain after the time-domain plied by the activation of the input gate it (Eq. 3). The out-
convolutional layer. The convolutional layer consist of 256 put gate ot controls the emission of ht , the stored memory
kernels of size 3 × 3 across 10 frames with frame stride 5. ct transformed by the hyperbolic tangent nonlinearity (Eq.
This model aims at capturing local relationships between 6,7). Image duplicated with permission from Alex Graves.
frames within a small temporal window.
GoogLeNet Conv Pooling: We experimented with an where the W terms denote weight matrices (e.g. Wih is the
architecture based on GoogLeNet [21], in which the max- input-hidden weight matrix), the b terms denote bias vectors
pooling operation is performed after the dimensionality re- (e.g. bh is the hidden bias vector) and H is the hidden layer
duction (average pooling) layer in GoogLeNet. This is the activation function, typically the logistic sigmoid function.
layer which in the original architecture was directly con- Unlike standard RNNs, the Long Short Term Memory
nected to the softmax layer. We enhanced this architec- (LSTM) architecture [6] uses memory cells (Figure 3) to
ture by adding two fully connected layers of size 4096 with store and output information, allowing it to better discover
ReLU activations on top of the 1000D output but before long-range temporal relationships. The hidden layer H of
softmax. Similar to AlexNet-based models, the weights the LSTM is computed as follows:
of convolutional layers and inception modules are shared
across time.
3.2. LSTM Architecture it = σ(Wxi xt + Whi ht−1 + Wci ct−1 + bi ) (3)
In contrast to max-pooling, which produces represen- ft = σ(Wxf xt + Whf ht−1 + Wcf ct−1 + bf ) (4)
tations which are order invariant, we propose using a re- ct = ft ct−1 + it tanh(Wxc xt + Whc ht−1 + bc ) (5)
current neural network to explicitly consider sequences of
ot = σ(Wxo xt + Who ht−1 + Wco ct + bo ) (6)
CNN activations. Since videos contain dynamic content,
the variations between frames may encode additional infor- ht = ot tanh(ct ) (7)
mation which could be useful in making more accurate pre- where σ is the logistic sigmoid function, and i, f , o, and c
dictions. are respectively the input gate, forget gate, output gate, and
Given an input sequence x = (x1 , . . . , xT ) a stan- cell activation vectors. By default, the value stored in the
dard recurrent neural network computes the hidden vector LSTM cell c is maintained unless it is added to by the input
sequence h = (h1 , . . . , hT ) and output vector sequence gate i or diminished by the forget gate f . The output gate o
y = (y1 , . . . , yT ) by iterating the following equations from controls the emission of the memory value from the LSTM
t = 1 to T : cell.
We use a deep LSTM architecture [9] (Figure 4) in which
the output from one LSTM layer is input for the next layer.
ht = H(Wih xt + Whh ht−1 + bh ) (1)
We experimented with various numbers of layers and mem-
yt = Who ht + bo (2) ory cells, and chose to use five stacked LSTM layers, each
with 512 memory cells. Following the LSTM layers, a Soft- speedup compared to training a large network from scratch.
max classifier makes a prediction at every frame. LSTM Training: We followed the same procedure as
training max-pooled network with two modifications: First,
the video’s label was backpropagated at each frame rather
than once per clip. Second, a gain g was applied to the
gradients backpropagated at each frame. g was linearly in-
terpolated from 0...1 over frames t = 0...T . g had the de-
sired effect of emphasizing the importance of correct pre-
diction at later frames in which the LSTM’s internal state
captured more information. Compared empirically against
setting g = 1 over all time steps or setting g = 1 only at the
last time step T (g = 0 elsewhere), linearly interpolating g
resulted in faster learning and higher accuracy. For the fi-
nal results, during training the gradients are backpropagated
through the convolutional layers for fine tuning.
LSTM Inference: In order to combine LSTM frame-
level predictions into a single video-level prediction, we
tried several approaches: 1) returning the prediction at the
Figure 4: Deep Video LSTM takes input the output from last time step T , 2) max-pooling the predictions over time,
the final CNN layer at each consecutive video frame. CNN 3) summing the predictions over time and return the max 4)
outputs are processed forward through time and upwards linearly weighting the predictions over time by g then sum
through five layers of stacked LSTMs. A softmax layer pre- and return the max.
dicts the class at each time step. The parameters of the con- The accuracy for all four approaches was less than 1%
volutional networks (pink) and softmax classifier (orange) different, but weighted predictions usually resulted in the
are shared across time steps. best performance, supporting the idea that the LSTM’s hid-
den state becomes progressively more informed as a func-
tion of the number of frames it has seen.
3.3. Training and Inference
The max-pooling models were optimized on a cluster us- 3.4. Optical Flow
ing Downpour Stochastic Gradient Descent starting with a Optical flow is a crucial component of any video classi-
learning rate of 10−5 in conjunction with a momentum of fication approach because it encodes the pattern of apparent
0.9 and weight decay of 0.0005. For LSTM, we used the motion of objects in a visual scene. Since our networks pro-
same optimization method with a learning rate of N ∗ 10−5 cess video frames at 1f ps, they do not use any apparent mo-
where N is number of frames. The learning rate was ex- tion information. Therefore, we additionally train both our
ponentially decayed over time. Each model had between temporal models on optical flow images and perform late
ten and fifty replicas split across four partitions. To re- fusion akin to the two-stream hypothesis proposed by [19].
duce CNN training time, the parameters of AlexNet and Interestingly, we found that initializing from a model
GoogLeNet were initialized from a pre-trained ImageNet trained on raw image frames can help classify optical flow
model and then fine-tuned on Sports-1M videos. images by allowing faster convergence than when training
Network Expansion for Max-Pooling Networks: from scratch. This is likely due to the fact that features that
Multi-frame models achieve higher accuracy at the cost of can describe for raw frames like edges also help in classify-
longer training times than single-frame models. Since pool- ing optical flow images. This is related to the effectiveness
ing is performed after CNN towers that share weights, the of Motion Boundary Histogram (MBH), which is analogous
parameters for a single-frame and multi-frame max-pooling to computing Histogram of Oriented Gradients (HOG) on
network are very similar. This makes it possible to expand optical flow images, in action recognition [23].
a single-frame model to a multi-frame model. Max-pooling Optical flow is computed from two adjacent frames sam-
models are first initialized as single-frame networks then pled at 15f ps using the approach of [26]. To utilize exist-
expanded to 30-frames and again to 120-frames. While the ing implementation and networks trained on raw frames, we
feature distribution of the max-pooling layer could change store optical flow as images by thresholding at −40, 40 and
dramatically as a result of expanding to a larger num- rescaling the horizontal and vertical components of the flow
ber of frames (particularly in the single-frame to 30-frame to [0, 255] range. The third dimension is set to zero when
case), experiments show that transfering the parameters is feeding to the network so that it gives no effect on learning
nonetheless beneficial. By expanding small networks into and inference.
larger ones and then fine-tuning, we achieve a significant In our investigation, we treat optical flow in the same
fashion as image frames to learn global description of Method Clip Hit@1 Hit@1 Hit@5
videos using both feature pooling and LSTM networks. Conv Pooling 68.7 71.1 89.3
4. Results Late Pooling 65.1 67.5 87.2
We empirically evaluate the proposed architectures on Slow Pooling 67.1 69.7 88.4
the Sports-1M and UCF-101 datasets with the goals of Local Pooling 68.1 70.4 88.9
investigating the performance of the proposed architec- Time-Domain
64.2 67.2 87.2
tures, quantifying the effect of the number of frames and Convolution
frame rates on classification performance, and understand- Table 1: Conv-Pooling outperforms all other feature-
ing the importance of motion information through optical pooling architectures (Figure 2) on Sports-1M using a 120-
flow models. frame AlexNet model.
4.1. Sports-1M dataset
The Sports-1M dataset [14] consists of roughly 1.2 mil- best to only do this if resource constrained (i.e., when it is
lion YouTube sports videos annotated with 487 classes, and only possible to do a single pass over the video for predic-
it is representative of videos in the wild. There are 1000- tion). Otherwise the data augmentation method proposed
3000 videos per class and approximately 5% of the videos above yields between 3-5% improvements in Hit@1 on the
are annotated with more than one class. Unfortunately, Sports-1M dataset.
since the creation of the dataset, about 7% of the videos Evaluation: Following [14], we use Hit@k values,
have been removed by users. We use the remaining 1.1 mil- which indicate the fraction of test samples that contain at
lion videos for the experiments below. least one of the ground truth labels in the top k predictions.
Although Sports-1M is the largest publicly available We provide both video level and clip level Hit@k values in
video dataset, the annotations that it provides are at video order to compare with previous results where clip hit is the
level. No information is given about the location of the hit on a single video clip (30-120 frames) and video hit is
class of interest. Moreover, the videos in this dataset are obtained by averaging over multiple clips.
unconstrained. This means that the camera movements are Comparison of Feature-Pooling Architectures: Ta-
not guaranteed to be well-behaved, which means that unlike ble 1 shows the results obtained using the different feature
UCF-101, where camera motion is constrained, the optical pooling architectures on the Sports-1M dataset when using
flow quality varies wildly between videos. a 120 frame AlexNet model. We find that max-pooling
Data Extraction: The first 5 minutes of each video are over the outputs of the last convolutional layer provides
sampled at a frame rate of 1f ps to obtain 300 frames per the best clip-level and video-level hit rates. Late Pooling,
video. Frames are repeated from the start for videos that which max-pools after the fully connected layers, performs
are shorter than 5 minutes. We learn feature pooling mod- worse than all other methods, indicating that preserving the
els that process up to 120 frames (2 minutes of video) in a spatial information while performing the pooling operation
single example. across the time domain is important. Time-Domain Convo-
Data Augmentation: Multiple examples per video are lution gives inferior results compared to max-pooling mod-
obtained by randomly selecting the position of the first els. This suggests that a single time-domain convolutional
frame and consistent random crops of each frame during layer is not effective in learning temporal relations on high
both training and testing. It is necessary to ensure that level features, which motivates us to explore more sophis-
the same transforms are applied to all frames for a given ticated network architectures like LSTM which learns from
start/end point. We process all images in the chosen in- temporal sequences.
terval by first resizing them to 256 × 256 pixels, then ran-
domly sampling a 220 × 220 region and randomly flipping Comparison of CNN Architectures: AlexNet and
the image horizontally with 50% probability. To obtain pre- GoogLeNet single-frame CNNs (Section 3) were trained
dictions for a video we randomly sample 240 examples as from scratch on single-frames selected at random from
described above and average all predictions, unless noted Sports-1M videos. Results (Table 2) show that both CNNs
otherwise. Since LSTM models trained on a fixed number outperform Karpathy et al.’s prior single-frame models [14]
of frames can generalize to any number of frames, we also by a margin of 4.3-5.6%. The increased accuracy is likely
report results of using LSTMs without data augmentation. due to advances in CNN architectures and sampling more
Video-Level Prediction: Given the nature of the meth- frames per video when training (300 instead of 50).
ods presented in this paper, it is possible to make predictions Comparing AlexNet to the more recent GoogLeNet
for the entire video without needing to sample, or aggregate yields a 1.9% increase in Hit@5 for the max-pooling ar-
( the networks are designed to work on an unbounded num- chitecture, and an increase of 4.8% for the LSTM. This is
ber of frames for prediction). However, for obtaining the roughly comparable to a 4.5% decrease in top-5 error mov-
highest possible classification rates, we observed that it is ing from the Krizhevsky-style CNNs that won ILSVRC-13
Method Hit@1 Hit@5 Pooling and LSTM models as a function of the number of
AlexNet single frame 63.6 84.7 frames aggregated. In terms of clip hit, the 120 frame model
GoogLeNet single frame 64.9 86.6 performs significantly better than the 30 frame model. Also
LSTM + AlexNet (fc) 62.7 83.6 our best clip hit of 70.8 represents a 70% improvement over
LSTM + GoogLeNet (fc) 67.5 87.1 the Slow Fusion approach of [14] which uses clips of few
Conv pooling + AlexNet 70.4 89.0 seconds length. This confirms our initial hypothesis that we
Conv pooling + GoogLeNet 71.7 90.4 need to consider the entire video in order to benefit more
thoroughly from its content.
Table 2: GoogLeNet outperforms AlexNet alone and when Optical Flow: Table 4 shows the results of fusion with
paired with both Conv-Pooling and LSTM. Experiments the optical flow model. The optical flow model on its own
performed on Sports-1M using 30-frame Conv-Pooling and has a much lower accuracy (59.7%) than the image-based
LSTM models. Note that the (fc) models updated only the model (72.1%) which is to be expected given that the Sports
final layers while training and did not use data augmenta- dataset consists of YouTube videos which are usually of
tion. lower quality and more natural than hand-crafted datasets
such as UCF-101. In the case of Conv Pooling networks the
Method Frames Clip Hit@1 Hit@1 Hit@5 fusion with optical flow has no significant improvement in
LSTM 30 N/A 72.1 90.4 the accuracy. However, for LSTMs the optical flow model
30 66.0 71.7 90.4 is able to improve the overall accuracy to 73.1%.
Conv pooling
120 70.8 72.3 90.8 Overall Performance: Finally, we compare the results
of our best models against the previous state-of-art on the
Table 3: Effect of the number of frames in the model. Both Sports-1M dataset at the time of submission. Table 5 reports
LSTM and Conv-Pooling models use GoogLeNet CNN. the results of the best model from [14] which performs sev-
eral layers of 3D convolutions on short video clips against
Method Hit@1 Hit@5 ours. The max-pool method shows an increase of 18.7%
LSTM on Optical Flow 59.7 81.4 in video Hit@1, whereas the LSTM approach yields a rela-
LSTM on Raw Frames 72.1 90.6 tive increase of 20%. The difference between the max-pool
LSTM on Raw Frames + and LSTM method is explained by the fact that the LSTM
73.1 90.5
LSTM on Optical Flow model can use optical flow in a manner which lends itself to
30 frame Optical Flow 44.5 70.4 late model fusion, which was not possible for the max-pool
Conv Pooling on Raw Frames 71.7 90.4 model.
Conv Pooling on Raw Frames +
71.8 90.4 4.2. UCF-101 Dataset
Conv Pooling on Optical Flow
The UCF-101 [20] contains 13,320 videos with 101 ac-
Table 4: Optical flow is noisy on Sports-1M and if used tion classes covering a broad set of activities such as sports,
alone, results in lower performance than equivalent image- musical instruments, and human-object interaction. We fol-
models. However, if used in conjunction with raw im- low the suggested evaluation protocol and report the aver-
age features, optical flow benefits LSTM. Experiments per- age accuracy over the given three training and testing parti-
formed on 30-frame models using GoogLeNet CNNs. tions. It is difficult to train a deep network with such a small
amount of data. Therefore, we test how well our models that
are trained in Sports-1M dataset perform in UCF-101.
to GoogLeNet in ILSVRC-14. For the max-pool architec- Comparison of Frame Rates: Since UCF-101 contains
ture, this smaller gap between architectures is likely caused short videos, 10-15 seconds on average, it is possible to ex-
by the increased number of noisy images in Sports-1M com- tract frames at higher frame rates such as 6f ps while still
pared to ImageNet. capturing context from the full video. We compare 30-
Fine Tuning: When initializing from a pre-trained net- frame models trained at three different frame-rates: 30f ps
work, it is not always clear whether fine-tuning should be (1 second of video) and 6f ps (5 seconds). Table 6 shows
performed. In our experiments, fine tuning was crucial in that lowering the frame rate from 30f ps to 6f ps yields
achieving high performance. For example, in Table 2 we slightly better performance since the model obtains more
show that a LSTM network paired with GoogLeNet, run- context from longer input clips. We observed no further im-
ning on 30 frames of the video achieves a Hit@1 rate of provements when decreasing the frame rate to 1f ps. Thus,
67.5. However, the same network with fine tuning achieves as long as the network sees enough context from each video,
69.5 Hit@1. Note that these results do not use data aug- the effects of lower frames rate are marginal. The LSTM
mentation and classify the entire 300 seconds of a video. model, on the other hand can take full advantage of the fact
Effect of Number of Frames: Table 3 compares Conv- that the videos can be processed at 30 frames per second.
Category Method Frames Clip Hit@1 Hit@1 Hit@5
Prior Single Frame 1 41.1 59.3 77.7
Results [14] Slow Fusion 15 41.9 60.9 80.2
Conv Pooling Image and Optical Flow 120 70.8 72.4 90.8
LSTM Image and Optical Flow 30 N/A 73.1 90.5
Table 5: Leveraging global video-level descriptors, LSTM and Conv-Pooling achieve a 20% increase in Hit@1 compared to
prior work on the in Sports-1M dataset. Hit@1, and Hit@5 are computed at video level.
Computer Science Department and BIOSS Centre for Biological Signalling Studies,
University of Freiburg, Germany
arXiv:1505.04597v1 [cs.CV] 18 May 2015
ronneber@informatik.uni-freiburg.de,
WWW home page: http://lmb.informatik.uni-freiburg.de/
1 Introduction
In the last two years, deep convolutional networks have outperformed the state of
the art in many visual recognition tasks, e.g. [7,3]. While convolutional networks
have already existed for a long time [8], their success was limited due to the
size of the available training sets and the size of the considered networks. The
breakthrough by Krizhevsky et al. [7] was due to supervised training of a large
network with 8 layers and millions of parameters on the ImageNet dataset with
1 million training images. Since then, even larger and deeper networks have been
trained [12].
The typical use of convolutional networks is on classification tasks, where
the output to an image is a single class label. However, in many visual tasks,
especially in biomedical image processing, the desired output should include
localization, i.e., a class label is supposed to be assigned to each pixel. More-
over, thousands of training images are usually beyond reach in biomedical tasks.
Hence, Ciresan et al. [1] trained a network in a sliding-window setup to predict
the class label of each pixel by providing a local region (patch) around that pixel
2
1 64 64
128 64 64 2
input
output
image
segmentation
tile
392 x 392
390 x 390
388 x 388
388 x 388
map
570 x 570
568 x 568
572 x 572
128 128
256 128
200²
198²
196²
280²
284²
282²
104²
140²
138²
136²
102²
100²
copy and crop
512 512 1024 512
max pool 2x2
56²
68²
32² 64²
66²
54²
52²
1024 up-conv 2x2
conv 1x1
30²
28²
Fig. 1. U-net architecture (example for 32x32 pixels in the lowest resolution). Each blue
box corresponds to a multi-channel feature map. The number of channels is denoted
on top of the box. The x-y-size is provided at the lower left edge of the box. White
boxes represent copied feature maps. The arrows denote the different operations.
as input. First, this network can localize. Secondly, the training data in terms
of patches is much larger than the number of training images. The resulting
network won the EM segmentation challenge at ISBI 2012 by a large margin.
Obviously, the strategy in Ciresan et al. [1] has two drawbacks. First, it
is quite slow because the network must be run separately for each patch, and
there is a lot of redundancy due to overlapping patches. Secondly, there is a
trade-off between localization accuracy and the use of context. Larger patches
require more max-pooling layers that reduce the localization accuracy, while
small patches allow the network to see only little context. More recent approaches
[11,4] proposed a classifier output that takes into account the features from
multiple layers. Good localization and the use of context are possible at the
same time.
In this paper, we build upon a more elegant architecture, the so-called “fully
convolutional network” [9]. We modify and extend this architecture such that it
works with very few training images and yields more precise segmentations; see
Figure 1. The main idea in [9] is to supplement a usual contracting network by
successive layers, where pooling operators are replaced by upsampling operators.
Hence, these layers increase the resolution of the output. In order to localize, high
resolution features from the contracting path are combined with the upsampled
3
Fig. 2. Overlap-tile strategy for seamless segmentation of arbitrary large images (here
segmentation of neuronal structures in EM stacks). Prediction of the segmentation in
the yellow area, requires image data within the blue area as input. Missing input data
is extrapolated by mirroring
output. A successive convolution layer can then learn to assemble a more precise
output based on this information.
One important modification in our architecture is that in the upsampling
part we have also a large number of feature channels, which allow the network
to propagate context information to higher resolution layers. As a consequence,
the expansive path is more or less symmetric to the contracting path, and yields
a u-shaped architecture. The network does not have any fully connected layers
and only uses the valid part of each convolution, i.e., the segmentation map only
contains the pixels, for which the full context is available in the input image.
This strategy allows the seamless segmentation of arbitrarily large images by an
overlap-tile strategy (see Figure 2). To predict the pixels in the border region
of the image, the missing context is extrapolated by mirroring the input image.
This tiling strategy is important to apply the network to large images, since
otherwise the resolution would be limited by the GPU memory.
As for our tasks there is very little training data available, we use excessive
data augmentation by applying elastic deformations to the available training im-
ages. This allows the network to learn invariance to such deformations, without
the need to see these transformations in the annotated image corpus. This is
particularly important in biomedical segmentation, since deformation used to
be the most common variation in tissue and realistic deformations can be simu-
lated efficiently. The value of data augmentation for learning invariance has been
shown in Dosovitskiy et al. [2] in the scope of unsupervised feature learning.
Another challenge in many cell segmentation tasks is the separation of touch-
ing objects of the same class; see Figure 3. To this end, we propose the use of
a weighted loss, where the separating background labels between touching cells
obtain a large weight in the loss function.
The resulting network is applicable to various biomedical segmentation prob-
lems. In this paper, we show results on the segmentation of neuronal structures
in EM stacks (an ongoing competition started at ISBI 2012), where we out-
4
performed the network of Ciresan et al. [1]. Furthermore, we show results for
cell segmentation in light microscopy images from the ISBI cell tracking chal-
lenge 2015. Here we won with a large margin on the two most challenging 2D
transmitted light datasets.
2 Network Architecture
The network architecture is illustrated in Figure 1. It consists of a contracting
path (left side) and an expansive path (right side). The contracting path follows
the typical architecture of a convolutional network. It consists of the repeated
application of two 3x3 convolutions (unpadded convolutions), each followed by
a rectified linear unit (ReLU) and a 2x2 max pooling operation with stride 2
for downsampling. At each downsampling step we double the number of feature
channels. Every step in the expansive path consists of an upsampling of the
feature map followed by a 2x2 convolution (“up-convolution”) that halves the
number of feature channels, a concatenation with the correspondingly cropped
feature map from the contracting path, and two 3x3 convolutions, each fol-
lowed by a ReLU. The cropping is necessary due to the loss of border pixels in
every convolution. At the final layer a 1x1 convolution is used to map each 64-
component feature vector to the desired number of classes. In total the network
has 23 convolutional layers.
To allow a seamless tiling of the output segmentation map (see Figure 2), it
is important to select the input tile size such that all 2x2 max-pooling operations
are applied to a layer with an even x- and y-size.
3 Training
The input images and their corresponding segmentation maps are used to train
the network with the stochastic gradient descent implementation of Caffe [6].
Due to the unpadded convolutions, the output image is smaller than the input
by a constant border width. To minimize the overhead and make maximum use
of the GPU memory, we favor large input tiles over a large batch size and hence
reduce the batch to a single image. Accordingly we use a high momentum (0.99)
such that a large number of the previously seen training samples determine the
update in the current optimization step.
The energy function is computed by a pixel-wise soft-max over the final
feature map combined with thecross entropy loss function. The soft-max is
PK
defined as pk (x) = exp(ak (x))/ k0 =1 exp(ak (x)) where ak (x) denotes the
0
a b c d
Fig. 3. HeLa cells on glass recorded with DIC (differential interference contrast) mi-
croscopy. (a) raw image. (b) overlay with ground truth segmentation. Different colors
indicate different instances of the HeLa cells. (c) generated segmentation mask (white:
foreground, black: background). (d) map with a pixel-wise loss weight to force the
network to learn the border pixels.
Data augmentation is essential to teach the network the desired invariance and
robustness properties, when only few training samples are available. In case of
6
4 Experiments
We demonstrate the application of the u-net to three different segmentation
tasks. The first task is the segmentation of neuronal structures in electron mi-
croscopic recordings. An example of the data set and our obtained segmentation
is displayed in Figure 2. We provide the full result as Supplementary Material.
The data set is provided by the EM segmentation challenge [14] that was started
at ISBI 2012 and is still open for new contributions. The training data is a set of
30 images (512x512 pixels) from serial section transmission electron microscopy
of the Drosophila first instar larva ventral nerve cord (VNC). Each image comes
with a corresponding fully annotated ground truth segmentation map for cells
(white) and membranes (black). The test set is publicly available, but its seg-
mentation maps are kept secret. An evaluation can be obtained by sending the
predicted membrane probability map to the organizers. The evaluation is done
by thresholding the map at 10 different levels and computation of the “warping
error”, the “Rand error” and the “pixel error” [14].
The u-net (averaged over 7 rotated versions of the input data) achieves with-
out any further pre- or postprocessing a warping error of 0.0003529 (the new
best score, see Table 1) and a rand-error of 0.0382.
This is significantly better than the sliding-window convolutional network
result by Ciresan et al. [1], whose best submission had a warping error of 0.000420
and a rand error of 0.0504. In terms of rand error the only better performing
Table 1. Ranking on the EM segmentation challenge [14] (march 6th, 2015), sorted
by warping error.
a b c d
Fig. 4. Result on the ISBI cell tracking challenge. (a) part of an input image of the
“PhC-U373” data set. (b) Segmentation result (cyan mask) with manual ground truth
(yellow border) (c) input image of the “DIC-HeLa” data set. (d) Segmentation result
(random colored masks) with manual ground truth (yellow border).
Table 2. Segmentation results (IOU) on the ISBI cell tracking challenge 2015.
algorithms on this data set use highly data set specific post-processing methods1
applied to the probability map of Ciresan et al. [1].
We also applied the u-net to a cell segmentation task in light microscopic im-
ages. This segmenation task is part of the ISBI cell tracking challenge 2014 and
2015 [10,13]. The first data set “PhC-U373”2 contains Glioblastoma-astrocytoma
U373 cells on a polyacrylimide substrate recorded by phase contrast microscopy
(see Figure 4a,b and Supp. Material). It contains 35 partially annotated train-
ing images. Here we achieve an average IOU (“intersection over union”) of 92%,
which is significantly better than the second best algorithm with 83% (see Ta-
ble 2). The second data set “DIC-HeLa”3 are HeLa cells on a flat glass recorded
by differential interference contrast (DIC) microscopy (see Figure 3, Figure 4c,d
and Supp. Material). It contains 20 partially annotated training images. Here we
achieve an average IOU of 77.5% which is significantly better than the second
best algorithm with 46%.
5 Conclusion
The u-net architecture achieves very good performance on very different biomed-
ical segmentation applications. Thanks to data augmentation with elastic defor-
1
The authors of this algorithm have submitted 78 different solutions to achieve this
result.
2
Data set provided by Dr. Sanjay Kumar. Department of Bioengineering University
of California at Berkeley. Berkeley CA (USA)
3
Data set provided by Dr. Gert van Cappellen Erasmus Medical Center. Rotterdam.
The Netherlands
8
mations, it only needs very few annotated images and has a very reasonable
training time of only 10 hours on a NVidia Titan GPU (6 GB). We provide the
full Caffe[6]-based implementation and the trained networks4 . We are sure that
the u-net architecture can be applied easily to many more tasks.
Acknowlegements
This study was supported by the Excellence Initiative of the German Federal
and State governments (EXC 294) and by the BMBF (Fkz 0316185B).
References
1. Ciresan, D.C., Gambardella, L.M., Giusti, A., Schmidhuber, J.: Deep neural net-
works segment neuronal membranes in electron microscopy images. In: NIPS. pp.
2852–2860 (2012)
2. Dosovitskiy, A., Springenberg, J.T., Riedmiller, M., Brox, T.: Discriminative un-
supervised feature learning with convolutional neural networks. In: NIPS (2014)
3. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for ac-
curate object detection and semantic segmentation. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR) (2014)
4. Hariharan, B., Arbelez, P., Girshick, R., Malik, J.: Hypercolumns for object seg-
mentation and fine-grained localization (2014), arXiv:1411.5752 [cs.CV]
5. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-
level performance on imagenet classification (2015), arXiv:1502.01852 [cs.CV]
6. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadar-
rama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding
(2014), arXiv:1408.5093 [cs.CV]
7. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: NIPS. pp. 1106–1114 (2012)
8. LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.,
Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural
Computation 1(4), 541–551 (1989)
9. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic
segmentation (2014), arXiv:1411.4038 [cs.CV]
10. Maska, M., (...), de Solorzano, C.O.: A benchmark for comparison of cell tracking
algorithms. Bioinformatics 30, 1609–1617 (2014)
11. Seyedhosseini, M., Sajjadi, M., Tasdizen, T.: Image segmentation with cascaded
hierarchical models and logistic disjunctive normal networks. In: Computer Vision
(ICCV), 2013 IEEE International Conference on. pp. 2168–2175 (2013)
12. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition (2014), arXiv:1409.1556 [cs.CV]
13. WWW: Web page of the cell tracking challenge, http://www.codesolorzano.com/
celltrackingchallenge/Cell_Tracking_Challenge/Welcome.html
14. WWW: Web page of the em segmentation challenge, http://brainiac2.mit.edu/
isbi_challenge/
4
U-net implementation, trained networks and supplementary material available at
http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net
1
Abstract—State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations.
Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region
proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image
convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional
network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to
arXiv:1506.01497v3 [cs.CV] 6 Jan 2016
generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN
into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with
“attention” mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model [3],
our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection
accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO
2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been
made publicly available.
Figure 1: Different schemes for addressing multiple scales and sizes. (a) Pyramids of images and feature maps
are built, and the classifier is run at all scales. (b) Pyramids of filters with multiple scales/sizes are run on
the feature map. (c) We use pyramids of reference boxes in the regression functions.
pyramids of images (Figure 1, a) or pyramids of filters mercial systems such as at Pinterests [17], with user
(Figure 1, b), we introduce novel “anchor” boxes engagement improvements reported.
that serve as references at multiple scales and aspect In ILSVRC and COCO 2015 competitions, Faster
ratios. Our scheme can be thought of as a pyramid R-CNN and RPN are the basis of several 1st-place
of regression references (Figure 1, c), which avoids entries [18] in the tracks of ImageNet detection, Ima-
enumerating images or filters of multiple scales or geNet localization, COCO detection, and COCO seg-
aspect ratios. This model performs well when trained mentation. RPNs completely learn to propose regions
and tested using single-scale images and thus benefits from data, and thus can easily benefit from deeper
running speed. and more expressive features (such as the 101-layer
To unify RPNs with Fast R-CNN [2] object detec- residual nets adopted in [18]). Faster R-CNN and RPN
tion networks, we propose a training scheme that are also used by several other leading entries in these
alternates between fine-tuning for the region proposal competitions2 . These results suggest that our method
task and then fine-tuning for object detection, while is not only a cost-efficient solution for practical usage,
keeping the proposals fixed. This scheme converges but also an effective way of improving object detec-
quickly and produces a unified network with convo- tion accuracy.
lutional features that are shared between both tasks.1
We comprehensively evaluate our method on the
PASCAL VOC detection benchmarks [11] where RPNs 2 R ELATED W ORK
with Fast R-CNNs produce detection accuracy bet- Object Proposals. There is a large literature on object
ter than the strong baseline of Selective Search with proposal methods. Comprehensive surveys and com-
Fast R-CNNs. Meanwhile, our method waives nearly parisons of object proposal methods can be found in
all computational burdens of Selective Search at [19], [20], [21]. Widely used object proposal methods
test-time—the effective running time for proposals include those based on grouping super-pixels (e.g.,
is just 10 milliseconds. Using the expensive very Selective Search [4], CPMC [22], MCG [23]) and those
deep models of [3], our detection method still has based on sliding windows (e.g., objectness in windows
a frame rate of 5fps (including all steps) on a GPU, [24], EdgeBoxes [6]). Object proposal methods were
and thus is a practical object detection system in adopted as external modules independent of the de-
terms of both speed and accuracy. We also report tectors (e.g., Selective Search [4] object detectors, R-
results on the MS COCO dataset [12] and investi- CNN [5], and Fast R-CNN [2]).
gate the improvements on PASCAL VOC using the
Deep Networks for Object Detection. The R-CNN
COCO data. Code has been made publicly available
method [5] trains CNNs end-to-end to classify the
at https://github.com/shaoqingren/faster_
proposal regions into object categories or background.
rcnn (in MATLAB) and https://github.com/
R-CNN mainly plays as a classifier, and it does not
rbgirshick/py-faster-rcnn (in Python).
predict object bounds (except for refining by bounding
A preliminary version of this manuscript was pub-
box regression). Its accuracy depends on the perfor-
lished previously [10]. Since then, the frameworks of
mance of the region proposal module (see compar-
RPN and Faster R-CNN have been adopted and gen-
isons in [20]). Several papers have proposed ways of
eralized to other methods, such as 3D object detection
using deep networks for predicting object bounding
[13], part-based detection [14], instance segmentation
boxes [25], [9], [26], [27]. In the OverFeat method [9],
[15], and image captioning [16]. Our fast and effective
a fully-connected layer is trained to predict the box
object detection system has also been built in com-
coordinates for the localization task that assumes a
1. Since the publication of the conference version of this paper
single object. The fully-connected layer is then turned
[10], we have also found that RPNs can be trained jointly with Fast
R-CNN networks leading to less training time. 2. http://image-net.org/challenges/LSVRC/2015/results
3
classifier
single, unified network for object detection (Figure 2).
Using the recently popular terminology of neural
RoI pooling networks with ‘attention’ [31] mechanisms, the RPN
module tells the Fast R-CNN module where to look.
In Section 3.1 we introduce the designs and properties
proposals
of the network for region proposal. In Section 3.2 we
develop algorithms for training both modules with
features shared.
Region Proposal Network 3.1 Region Proposal Networks
feature maps
A Region Proposal Network (RPN) takes an image
(of any size) as input and outputs a set of rectangular
object proposals, each with an objectness score.3 We
model this process with a fully convolutional network
[7], which we describe in this section. Because our ulti-
conv layers mate goal is to share computation with a Fast R-CNN
object detection network [2], we assume that both nets
share a common set of convolutional layers. In our ex-
image
periments, we investigate the Zeiler and Fergus model
Figure 2: Faster R-CNN is a single, unified network [32] (ZF), which has 5 shareable convolutional layers
for object detection. The RPN module serves as the and the Simonyan and Zisserman model [3] (VGG-16),
‘attention’ of this unified network. which has 13 shareable convolutional layers.
To generate region proposals, we slide a small
network over the convolutional feature map output
into a convolutional layer for detecting multiple class- by the last shared convolutional layer. This small
specific objects. The MultiBox methods [26], [27] gen- network takes as input an n × n spatial window of
erate region proposals from a network whose last the input convolutional feature map. Each sliding
fully-connected layer simultaneously predicts mul- window is mapped to a lower-dimensional feature
tiple class-agnostic boxes, generalizing the “single- (256-d for ZF and 512-d for VGG, with ReLU [33]
box” fashion of OverFeat. These class-agnostic boxes following). This feature is fed into two sibling fully-
are used as proposals for R-CNN [5]. The MultiBox connected layers—a box-regression layer (reg) and a
proposal network is applied on a single image crop or box-classification layer (cls). We use n = 3 in this
multiple large image crops (e.g., 224×224), in contrast paper, noting that the effective receptive field on the
to our fully convolutional scheme. MultiBox does not input image is large (171 and 228 pixels for ZF and
share features between the proposal and detection VGG, respectively). This mini-network is illustrated
networks. We discuss OverFeat and MultiBox in more at a single position in Figure 3 (left). Note that be-
depth later in context with our method. Concurrent cause the mini-network operates in a sliding-window
with our work, the DeepMask method [28] is devel- fashion, the fully-connected layers are shared across
oped for learning segmentation proposals. all spatial locations. This architecture is naturally im-
Shared computation of convolutions [9], [1], [29], plemented with an n × n convolutional layer followed
[7], [2] has been attracting increasing attention for ef- by two sibling 1 × 1 convolutional layers (for reg and
ficient, yet accurate, visual recognition. The OverFeat cls, respectively).
paper [9] computes convolutional features from an 3.1.1 Anchors
image pyramid for classification, localization, and de-
At each sliding-window location, we simultaneously
tection. Adaptively-sized pooling (SPP) [1] on shared
predict multiple region proposals, where the number
convolutional feature maps is developed for efficient
of maximum possible proposals for each location is
region-based object detection [1], [30] and semantic
denoted as k. So the reg layer has 4k outputs encoding
segmentation [29]. Fast R-CNN [2] enables end-to-end
the coordinates of k boxes, and the cls layer outputs
detector training on shared convolutional features and
2k scores that estimate probability of object or not
shows compelling accuracy and speed.
object for each proposal4 . The k proposals are param-
eterized relative to k reference boxes, which we call
3 FASTER R-CNN
3. “Region” is a generic term and in this paper we only consider
Our object detection system, called Faster R-CNN, is rectangular regions, as is common for many methods (e.g., [27], [4],
composed of two modules. The first module is a deep [6]). “Objectness” measures membership to a set of object classes
fully convolutional network that proposes regions, vs. background.
4. For simplicity we implement the cls layer as a two-class
and the second module is the Fast R-CNN detector [2] softmax layer. Alternatively, one may use logistic regression to
that uses the proposed regions. The entire system is a produce k scores.
4
person : 0.992
2k scores 4k coordinates k anchor boxes dog : 0.994
horse : 0.993
256-d
intermediate layer
bus : 0.996
boat : 0.970
person : 0.983
person : 0.736 person : 0.983
person : 0.925
person : 0.989
sliding window
conv feature map
Figure 3: Left: Region Proposal Network (RPN). Right: Example detections using RPN proposals on PASCAL
VOC 2007 test. Our method detects objects in a wide range of scales and aspect ratios.
anchors. An anchor is centered at the sliding window Multi-Scale Anchors as Regression References
in question, and is associated with a scale and aspect Our design of anchors presents a novel scheme
ratio (Figure 3, left). By default we use 3 scales and for addressing multiple scales (and aspect ratios). As
3 aspect ratios, yielding k = 9 anchors at each sliding shown in Figure 1, there have been two popular ways
position. For a convolutional feature map of a size for multi-scale predictions. The first way is based on
W × H (typically ∼2,400), there are W Hk anchors in image/feature pyramids, e.g., in DPM [8] and CNN-
total. based methods [9], [1], [2]. The images are resized at
multiple scales, and feature maps (HOG [8] or deep
Translation-Invariant Anchors
convolutional features [9], [1], [2]) are computed for
An important property of our approach is that it each scale (Figure 1(a)). This way is often useful but
is translation invariant, both in terms of the anchors is time-consuming. The second way is to use sliding
and the functions that compute proposals relative to windows of multiple scales (and/or aspect ratios) on
the anchors. If one translates an object in an image, the feature maps. For example, in DPM [8], models
the proposal should translate and the same function of different aspect ratios are trained separately using
should be able to predict the proposal in either lo- different filter sizes (such as 5×7 and 7×5). If this way
cation. This translation-invariant property is guaran- is used to address multiple scales, it can be thought
teed by our method5 . As a comparison, the MultiBox of as a “pyramid of filters” (Figure 1(b)). The second
method [27] uses k-means to generate 800 anchors, way is usually adopted jointly with the first way [8].
which are not translation invariant. So MultiBox does As a comparison, our anchor-based method is built
not guarantee that the same proposal is generated if on a pyramid of anchors, which is more cost-efficient.
an object is translated. Our method classifies and regresses bounding boxes
The translation-invariant property also reduces the with reference to anchor boxes of multiple scales and
model size. MultiBox has a (4 + 1) × 800-dimensional aspect ratios. It only relies on images and feature
fully-connected output layer, whereas our method has maps of a single scale, and uses filters (sliding win-
a (4 + 2) × 9-dimensional convolutional output layer dows on the feature map) of a single size. We show by
in the case of k = 9 anchors. As a result, our output experiments the effects of this scheme for addressing
layer has 2.8 × 104 parameters (512 × (4 + 2) × 9 multiple scales and sizes (Table 8).
for VGG-16), two orders of magnitude fewer than Because of this multi-scale design based on anchors,
MultiBox’s output layer that has 6.1 × 106 parameters we can simply use the convolutional features com-
(1536 × (4 + 1) × 800 for GoogleNet [34] in MultiBox puted on a single-scale image, as is also done by
[27]). If considering the feature projection layers, our the Fast R-CNN detector [2]. The design of multi-
proposal layers still have an order of magnitude fewer scale anchors is a key component for sharing features
parameters than MultiBox6 . We expect our method without extra cost for addressing scales.
to have less risk of overfitting on small datasets, like
PASCAL VOC. 3.1.2 Loss Function
For training RPNs, we assign a binary class label
5. As is the case of FCNs [7], our network is translation invariant (of being an object or not) to each anchor. We as-
up to the network’s total stride. sign a positive label to two kinds of anchors: (i) the
6. Considering the feature projection layers, our proposal layers’ anchor/anchors with the highest Intersection-over-
parameter count is 3 × 3 × 512 × 512 + 512 × 6 × 9 = 2.4 × 106 ;
MultiBox’s proposal layers’ parameter count is 7 × 7 × (64 + 96 + Union (IoU) overlap with a ground-truth box, or (ii) an
64 + 64) × 1536 + 1536 × 5 × 800 = 27 × 106 . anchor that has an IoU overlap higher than 0.7 with
5
any ground-truth box. Note that a single ground-truth be thought of as bounding-box regression from an
box may assign positive labels to multiple anchors. anchor box to a nearby ground-truth box.
Usually the second condition is sufficient to determine Nevertheless, our method achieves bounding-box
the positive samples; but we still adopt the first regression by a different manner from previous RoI-
condition for the reason that in some rare cases the based (Region of Interest) methods [1], [2]. In [1],
second condition may find no positive sample. We [2], bounding-box regression is performed on features
assign a negative label to a non-positive anchor if its pooled from arbitrarily sized RoIs, and the regression
IoU ratio is lower than 0.3 for all ground-truth boxes. weights are shared by all region sizes. In our formula-
Anchors that are neither positive nor negative do not tion, the features used for regression are of the same
contribute to the training objective. spatial size (3 × 3) on the feature maps. To account
With these definitions, we minimize an objective for varying sizes, a set of k bounding-box regressors
function following the multi-task loss in Fast R-CNN are learned. Each regressor is responsible for one scale
[2]. Our loss function for an image is defined as: and one aspect ratio, and the k regressors do not share
weights. As such, it is still possible to predict boxes of
1 X
L({pi }, {ti }) = Lcls (pi , p∗i ) various sizes even though the features are of a fixed
Ncls i size/scale, thanks to the design of anchors.
(1)
1 X ∗
+λ p Lreg (ti , t∗i ).
Nreg i i 3.1.3 Training RPNs
The RPN can be trained end-to-end by back-
Here, i is the index of an anchor in a mini-batch and propagation and stochastic gradient descent (SGD)
pi is the predicted probability of anchor i being an [35]. We follow the “image-centric” sampling strategy
object. The ground-truth label p∗i is 1 if the anchor from [2] to train this network. Each mini-batch arises
is positive, and is 0 if the anchor is negative. ti is a from a single image that contains many positive and
vector representing the 4 parameterized coordinates negative example anchors. It is possible to optimize
of the predicted bounding box, and t∗i is that of the for the loss functions of all anchors, but this will
ground-truth box associated with a positive anchor. bias towards negative samples as they are dominate.
The classification loss Lcls is log loss over two classes Instead, we randomly sample 256 anchors in an image
(object vs. not object). For the regression loss, we use to compute the loss function of a mini-batch, where
Lreg (ti , t∗i ) = R(ti − t∗i ) where R is the robust loss the sampled positive and negative anchors have a
function (smooth L1 ) defined in [2]. The term p∗i Lreg ratio of up to 1:1. If there are fewer than 128 positive
means the regression loss is activated only for positive samples in an image, we pad the mini-batch with
anchors (p∗i = 1) and is disabled otherwise (p∗i = 0). negative ones.
The outputs of the cls and reg layers consist of {pi } We randomly initialize all new layers by drawing
and {ti } respectively. weights from a zero-mean Gaussian distribution with
The two terms are normalized by Ncls and Nreg standard deviation 0.01. All other layers (i.e., the
and weighted by a balancing parameter λ. In our shared convolutional layers) are initialized by pre-
current implementation (as in the released code), the training a model for ImageNet classification [36], as
cls term in Eqn.(1) is normalized by the mini-batch is standard practice [5]. We tune all layers of the
size (i.e., Ncls = 256) and the reg term is normalized ZF net, and conv3 1 and up for the VGG net to
by the number of anchor locations (i.e., Nreg ∼ 2, 400). conserve memory [2]. We use a learning rate of 0.001
By default we set λ = 10, and thus both cls and for 60k mini-batches, and 0.0001 for the next 20k
reg terms are roughly equally weighted. We show mini-batches on the PASCAL VOC dataset. We use a
by experiments that the results are insensitive to the momentum of 0.9 and a weight decay of 0.0005 [37].
values of λ in a wide range (Table 9). We also note Our implementation uses Caffe [38].
that the normalization as above is not required and
could be simplified.
For bounding box regression, we adopt the param- 3.2 Sharing Features for RPN and Fast R-CNN
eterizations of the 4 coordinates following [5]: Thus far we have described how to train a network
for region proposal generation, without considering
tx = (x − xa )/wa , ty = (y − ya )/ha ,
the region-based object detection CNN that will utilize
tw = log(w/wa ), th = log(h/ha ), these proposals. For the detection network, we adopt
(2)
t∗x = (x∗ − xa )/wa , t∗y = (y ∗ − ya )/ha , Fast R-CNN [2]. Next we describe algorithms that
t∗w = log(w∗ /wa ), t∗h = log(h∗ /ha ), learn a unified network composed of RPN and Fast
R-CNN with shared convolutional layers (Figure 2).
where x, y, w, and h denote the box’s center coordi- Both RPN and Fast R-CNN, trained independently,
nates and its width and height. Variables x, xa , and will modify their convolutional layers in different
x∗ are for the predicted box, anchor box, and ground- ways. We therefore need to develop a technique that
truth box respectively (likewise for y, w, h). This can allows for sharing convolutional layers between the
6
Table 1: the learned average proposal size for each anchor using the ZF net (numbers for s = 600).
anchor 1282 , 2:1 1282 , 1:1 1282 , 1:2 2562 , 2:1 2562 , 1:1 2562 , 1:2 5122 , 2:1 5122 , 1:1 5122 , 1:2
proposal 188×111 113×114 70×92 416×229 261×284 174×332 768×437 499×501 355×715
two networks, rather than learning two separate net- fix the shared convolutional layers and only fine-tune
works. We discuss three ways for training networks the layers unique to RPN. Now the two networks
with features shared: share convolutional layers. Finally, keeping the shared
(i) Alternating training. In this solution, we first train convolutional layers fixed, we fine-tune the unique
RPN, and use the proposals to train Fast R-CNN. layers of Fast R-CNN. As such, both networks share
The network tuned by Fast R-CNN is then used to the same convolutional layers and form a unified
initialize RPN, and this process is iterated. This is the network. A similar alternating training can be run
solution that is used in all experiments in this paper. for more iterations, but we have observed negligible
(ii) Approximate joint training. In this solution, the improvements.
RPN and Fast R-CNN networks are merged into one
3.3 Implementation Details
network during training as in Figure 2. In each SGD
iteration, the forward pass generates region propos- We train and test both region proposal and object
als which are treated just like fixed, pre-computed detection networks on images of a single scale [1], [2].
proposals when training a Fast R-CNN detector. The We re-scale the images such that their shorter side
backward propagation takes place as usual, where for is s = 600 pixels [2]. Multi-scale feature extraction
the shared layers the backward propagated signals (using an image pyramid) may improve accuracy but
from both the RPN loss and the Fast R-CNN loss does not exhibit a good speed-accuracy trade-off [2].
are combined. This solution is easy to implement. But On the re-scaled images, the total stride for both ZF
this solution ignores the derivative w.r.t. the proposal and VGG nets on the last convolutional layer is 16
boxes’ coordinates that are also network responses, pixels, and thus is ∼10 pixels on a typical PASCAL
so is approximate. In our experiments, we have em- image before resizing (∼500×375). Even such a large
pirically found this solver produces close results, yet stride provides good results, though accuracy may be
reduces the training time by about 25-50% comparing further improved with a smaller stride.
with alternating training. This solver is included in For anchors, we use 3 scales with box areas of 1282 ,
our released Python code. 2562 , and 5122 pixels, and 3 aspect ratios of 1:1, 1:2,
and 2:1. These hyper-parameters are not carefully cho-
(iii) Non-approximate joint training. As discussed
sen for a particular dataset, and we provide ablation
above, the bounding boxes predicted by RPN are
experiments on their effects in the next section. As dis-
also functions of the input. The RoI pooling layer
cussed, our solution does not need an image pyramid
[2] in Fast R-CNN accepts the convolutional features
or filter pyramid to predict regions of multiple scales,
and also the predicted bounding boxes as input, so
saving considerable running time. Figure 3 (right)
a theoretically valid backpropagation solver should
shows the capability of our method for a wide range
also involve gradients w.r.t. the box coordinates. These
of scales and aspect ratios. Table 1 shows the learned
gradients are ignored in the above approximate joint
average proposal size for each anchor using the ZF
training. In a non-approximate joint training solution,
net. We note that our algorithm allows predictions
we need an RoI pooling layer that is differentiable
that are larger than the underlying receptive field.
w.r.t. the box coordinates. This is a nontrivial problem
Such predictions are not impossible—one may still
and a solution can be given by an “RoI warping” layer
roughly infer the extent of an object if only the middle
as developed in [15], which is beyond the scope of this
of the object is visible.
paper.
The anchor boxes that cross image boundaries need
4-Step Alternating Training. In this paper, we adopt to be handled with care. During training, we ignore
a pragmatic 4-step training algorithm to learn shared all cross-boundary anchors so they do not contribute
features via alternating optimization. In the first step, to the loss. For a typical 1000 × 600 image, there
we train the RPN as described in Section 3.1.3. This will be roughly 20000 (≈ 60 × 40 × 9) anchors in
network is initialized with an ImageNet-pre-trained total. With the cross-boundary anchors ignored, there
model and fine-tuned end-to-end for the region pro- are about 6000 anchors per image for training. If the
posal task. In the second step, we train a separate boundary-crossing outliers are not ignored in training,
detection network by Fast R-CNN using the proposals they introduce large, difficult to correct error terms in
generated by the step-1 RPN. This detection net- the objective, and training does not converge. During
work is also initialized by the ImageNet-pre-trained testing, however, we still apply the fully convolutional
model. At this point the two networks do not share RPN to the entire image. This may generate cross-
convolutional layers. In the third step, we use the boundary proposal boxes, which we clip to the image
detector network to initialize RPN training, but we boundary.
7
Table 2: Detection results on PASCAL VOC 2007 test set (trained on VOC 2007 trainval). The detectors are
Fast R-CNN with ZF, but using various proposal methods for training and testing.
train-time region proposals test-time region proposals
method # boxes method # proposals mAP (%)
SS 2000 SS 2000 58.7
EB 2000 EB 2000 58.6
RPN+ZF, shared 2000 RPN+ZF, shared 300 59.9
ablation experiments follow below
RPN+ZF, unshared 2000 RPN+ZF, unshared 300 58.7
SS 2000 RPN+ZF 100 55.1
SS 2000 RPN+ZF 300 56.8
SS 2000 RPN+ZF 1000 56.3
SS 2000 RPN+ZF (no NMS) 6000 55.2
SS 2000 RPN+ZF (no cls) 100 44.6
SS 2000 RPN+ZF (no cls) 300 51.4
SS 2000 RPN+ZF (no cls) 1000 55.8
SS 2000 RPN+ZF (no reg) 300 52.1
SS 2000 RPN+ZF (no reg) 1000 51.3
SS 2000 RPN+VGG 300 59.2
Some RPN proposals highly overlap with each IoU. SS has an mAP of 58.7% and EB has an mAP
other. To reduce redundancy, we adopt non-maximum of 58.6% under the Fast R-CNN framework. RPN
suppression (NMS) on the proposal regions based on with Fast R-CNN achieves competitive results, with
their cls scores. We fix the IoU threshold for NMS an mAP of 59.9% while using up to 300 proposals8 .
at 0.7, which leaves us about 2000 proposal regions Using RPN yields a much faster detection system than
per image. As we will show, NMS does not harm the using either SS or EB because of shared convolutional
ultimate detection accuracy, but substantially reduces computations; the fewer proposals also reduce the
the number of proposals. After NMS, we use the region-wise fully-connected layers’ cost (Table 5).
top-N ranked proposal regions for detection. In the Ablation Experiments on RPN. To investigate the be-
following, we train Fast R-CNN using 2000 RPN pro- havior of RPNs as a proposal method, we conducted
posals, but evaluate different numbers of proposals at several ablation studies. First, we show the effect of
test-time. sharing convolutional layers between the RPN and
Fast R-CNN detection network. To do this, we stop
4 E XPERIMENTS after the second step in the 4-step training process.
4.1 Experiments on PASCAL VOC Using separate networks reduces the result slightly to
We comprehensively evaluate our method on the 58.7% (RPN+ZF, unshared, Table 2). We observe that
PASCAL VOC 2007 detection benchmark [11]. This this is because in the third step when the detector-
dataset consists of about 5k trainval images and 5k tuned features are used to fine-tune the RPN, the
test images over 20 object categories. We also provide proposal quality is improved.
results on the PASCAL VOC 2012 benchmark for a Next, we disentangle the RPN’s influence on train-
few models. For the ImageNet pre-trained network, ing the Fast R-CNN detection network. For this pur-
we use the “fast” version of ZF net [32] that has pose, we train a Fast R-CNN model by using the
5 convolutional layers and 3 fully-connected layers, 2000 SS proposals and ZF net. We fix this detector
and the public VGG-16 model7 [3] that has 13 con- and evaluate the detection mAP by changing the
volutional layers and 3 fully-connected layers. We proposal regions used at test-time. In these ablation
primarily evaluate detection mean Average Precision experiments, the RPN does not share features with
(mAP), because this is the actual metric for object the detector.
detection (rather than focusing on object proposal Replacing SS with 300 RPN proposals at test-time
proxy metrics). leads to an mAP of 56.8%. The loss in mAP is because
Table 2 (top) shows Fast R-CNN results when of the inconsistency between the training/testing pro-
trained and tested using various region proposal posals. This result serves as the baseline for the fol-
methods. These results use the ZF net. For Selective lowing comparisons.
Search (SS) [4], we generate about 2000 proposals by Somewhat surprisingly, the RPN still leads to a
the “fast” mode. For EdgeBoxes (EB) [6], we generate competitive result (55.1%) when using the top-ranked
the proposals by the default EB setting tuned for 0.7 8. For RPN, the number of proposals (e.g., 300) is the maximum
number for an image. RPN may produce fewer proposals after
7. www.robots.ox.ac.uk/∼vgg/research/very deep/ NMS, and thus the average number of proposals is smaller.
8
Table 3: Detection results on PASCAL VOC 2007 test set. The detector is Fast R-CNN and VGG-16. Training
data: “07”: VOC 2007 trainval, “07+12”: union set of VOC 2007 trainval and VOC 2012 trainval. For RPN,
the train-time proposals for Fast R-CNN are 2000. † : this number was reported in [2]; using the repository
provided by this paper, this result is higher (68.1).
method # proposals data mAP (%)
SS 2000 07 66.9†
SS 2000 07+12 70.0
RPN+VGG, unshared 300 07 68.5
RPN+VGG, shared 300 07 69.9
RPN+VGG, shared 300 07+12 73.2
RPN+VGG, shared 300 COCO+07+12 78.8
Table 4: Detection results on PASCAL VOC 2012 test set. The detector is Fast R-CNN and VGG-16. Training
data: “07”: VOC 2007 trainval, “07++12”: union set of VOC 2007 trainval+test and VOC 2012 trainval. For
RPN, the train-time proposals for Fast R-CNN are 2000. † : http://host.robots.ox.ac.uk:8080/anonymous/HZJTQA.html. ‡ :
http://host.robots.ox.ac.uk:8080/anonymous/YNPLXB.html. § : http://host.robots.ox.ac.uk:8080/anonymous/XEDH10.html.
method # proposals data mAP (%)
SS 2000 12 65.7
SS 2000 07++12 68.4
RPN+VGG, shared† 300 12 67.0
RPN+VGG, shared‡ 300 07++12 70.4
RPN+VGG, shared§ 300 COCO+07++12 75.9
Table 5: Timing (ms) on a K40 GPU, except SS proposal is evaluated in a CPU. “Region-wise” includes NMS,
pooling, fully-connected, and softmax layers. See our released code for the profiling of running time.
model system conv proposal region-wise total rate
VGG SS + Fast R-CNN 146 1510 174 1830 0.5 fps
VGG RPN + Fast R-CNN 141 10 47 198 5 fps
ZF RPN + Fast R-CNN 31 3 25 59 17 fps
100 proposals at test-time, indicating that the top- (using RPN+ZF) to 59.2% (using RPN+VGG). This is a
ranked RPN proposals are accurate. On the other promising result, because it suggests that the proposal
extreme, using the top-ranked 6000 RPN proposals quality of RPN+VGG is better than that of RPN+ZF.
(without NMS) has a comparable mAP (55.2%), sug- Because proposals of RPN+ZF are competitive with
gesting NMS does not harm the detection mAP and SS (both are 58.7% when consistently used for training
may reduce false alarms. and testing), we may expect RPN+VGG to be better
Next, we separately investigate the roles of RPN’s than SS. The following experiments justify this hy-
cls and reg outputs by turning off either of them pothesis.
at test-time. When the cls layer is removed at test-
Performance of VGG-16. Table 3 shows the results
time (thus no NMS/ranking is used), we randomly
of VGG-16 for both proposal and detection. Using
sample N proposals from the unscored regions. The
RPN+VGG, the result is 68.5% for unshared features,
mAP is nearly unchanged with N = 1000 (55.8%), but
slightly higher than the SS baseline. As shown above,
degrades considerably to 44.6% when N = 100. This
this is because the proposals generated by RPN+VGG
shows that the cls scores account for the accuracy of
are more accurate than SS. Unlike SS that is pre-
the highest ranked proposals.
defined, the RPN is actively trained and benefits from
On the other hand, when the reg layer is removed better networks. For the feature-shared variant, the
at test-time (so the proposals become anchor boxes), result is 69.9%—better than the strong SS baseline, yet
the mAP drops to 52.1%. This suggests that the high- with nearly cost-free proposals. We further train the
quality proposals are mainly due to the regressed box RPN and detection network on the union set of PAS-
bounds. The anchor boxes, though having multiple CAL VOC 2007 trainval and 2012 trainval. The mAP
scales and aspect ratios, are not sufficient for accurate is 73.2%. Figure 5 shows some results on the PASCAL
detection. VOC 2007 test set. On the PASCAL VOC 2012 test set
We also evaluate the effects of more powerful net- (Table 4), our method has an mAP of 70.4% trained
works on the proposal quality of RPN alone. We use on the union set of VOC 2007 trainval+test and VOC
VGG-16 to train the RPN, and still use the above 2012 trainval. Table 6 and Table 7 show the detailed
detector of SS+ZF. The mAP improves from 56.8% numbers.
9
Table 6: Results on PASCAL VOC 2007 test set with Fast R-CNN detectors and VGG-16. For RPN, the train-time
proposals for Fast R-CNN are 2000. RPN∗ denotes the unsharing feature version.
method # box data mAP areo bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
SS 2000 07 66.9 74.5 78.3 69.2 53.2 36.6 77.3 78.2 82.0 40.7 72.7 67.9 79.6 79.2 73.0 69.0 30.1 65.4 70.2 75.8 65.8
SS 2000 07+12 70.0 77.0 78.1 69.3 59.4 38.3 81.6 78.6 86.7 42.8 78.8 68.9 84.7 82.0 76.6 69.9 31.8 70.1 74.8 80.4 70.4
RPN∗ 300 07 68.5 74.1 77.2 67.7 53.9 51.0 75.1 79.2 78.9 50.7 78.0 61.1 79.1 81.9 72.2 75.9 37.2 71.4 62.5 77.4 66.4
RPN 300 07 69.9 70.0 80.6 70.1 57.3 49.9 78.2 80.4 82.0 52.2 75.3 67.2 80.3 79.8 75.0 76.3 39.1 68.3 67.3 81.1 67.6
RPN 300 07+12 73.2 76.5 79.0 70.9 65.5 52.1 83.1 84.7 86.4 52.0 81.9 65.7 84.8 84.6 77.5 76.7 38.8 73.6 73.9 83.0 72.6
RPN 300 COCO+07+12 78.8 84.3 82.0 77.7 68.9 65.7 88.1 88.4 88.9 63.6 86.3 70.8 85.9 87.6 80.1 82.3 53.6 80.4 75.8 86.6 78.9
Table 7: Results on PASCAL VOC 2012 test set with Fast R-CNN detectors and VGG-16. For RPN, the train-time
proposals for Fast R-CNN are 2000.
method # box data mAP areo bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
SS 2000 12 65.7 80.3 74.7 66.9 46.9 37.7 73.9 68.6 87.7 41.7 71.1 51.1 86.0 77.8 79.8 69.8 32.1 65.5 63.8 76.4 61.7
SS 2000 07++12 68.4 82.3 78.4 70.8 52.3 38.7 77.8 71.6 89.3 44.2 73.0 55.0 87.5 80.5 80.8 72.0 35.1 68.3 65.7 80.4 64.2
RPN 300 12 67.0 82.3 76.4 71.0 48.4 45.2 72.1 72.3 87.3 42.2 73.7 50.0 86.8 78.7 78.4 77.4 34.5 70.1 57.1 77.1 58.9
RPN 300 07++12 70.4 84.9 79.8 74.3 53.9 49.8 77.5 75.9 88.5 45.6 77.1 55.3 86.9 81.7 80.9 79.6 40.1 72.6 60.9 81.2 61.5
RPN 300 COCO+07++12 75.9 87.4 83.6 76.8 62.9 59.6 81.9 82.0 91.3 54.9 82.6 59.0 89.0 85.5 84.7 84.1 52.2 78.9 65.5 85.4 70.2
Table 8: Detection results of Faster R-CNN on PAS- 3 scales and 3 aspect ratios (69.9% mAP in Table 8).
CAL VOC 2007 test set using different settings of If using just one anchor at each position, the mAP
anchors. The network is VGG-16. The training data drops by a considerable margin of 3-4%. The mAP
is VOC 2007 trainval. The default setting of using 3 is higher if using 3 scales (with 1 aspect ratio) or 3
scales and 3 aspect ratios (69.9%) is the same as that aspect ratios (with 1 scale), demonstrating that using
in Table 3. anchors of multiple sizes as the regression references
settings anchor scales aspect ratios mAP (%) is an effective solution. Using just 3 scales with 1
1282 1:1 65.8 aspect ratio (69.8%) is as good as using 3 scales with
1 scale, 1 ratio
2562 1:1 66.7 3 aspect ratios on this dataset, suggesting that scales
1282 {2:1, 1:1, 1:2} 68.8
1 scale, 3 ratios and aspect ratios are not disentangled dimensions for
2562 {2:1, 1:1, 1:2} 67.9
the detection accuracy. But we still adopt these two
3 scales, 1 ratio {128 , 2562 , 5122 }
2 1:1 69.8
dimensions in our designs to keep our system flexible.
3 scales, 3 ratios {1282 , 2562 , 5122 } {2:1, 1:1, 1:2} 69.9
In Table 9 we compare different values of λ in Equa-
tion (1). By default we use λ = 10 which makes the
Table 9: Detection results of Faster R-CNN on PAS- two terms in Equation (1) roughly equally weighted
CAL VOC 2007 test set using different values of λ after normalization. Table 9 shows that our result is
in Equation (1). The network is VGG-16. The training impacted just marginally (by ∼ 1%) when λ is within
data is VOC 2007 trainval. The default setting of using a scale of about two orders of magnitude (1 to 100).
λ = 10 (69.9%) is the same as that in Table 3. This demonstrates that the result is insensitive to λ in
λ 0.1 1 10 100 a wide range.
mAP (%) 67.2 68.9 69.9 69.1 Analysis of Recall-to-IoU. Next we compute the
recall of proposals at different IoU ratios with ground-
truth boxes. It is noteworthy that the Recall-to-IoU
metric is just loosely [19], [20], [21] related to the
In Table 5 we summarize the running time of the
ultimate detection accuracy. It is more appropriate to
entire object detection system. SS takes 1-2 seconds
use this metric to diagnose the proposal method than
depending on content (on average about 1.5s), and
to evaluate it.
Fast R-CNN with VGG-16 takes 320ms on 2000 SS
In Figure 4, we show the results of using 300, 1000,
proposals (or 223ms if using SVD on fully-connected
and 2000 proposals. We compare with SS and EB, and
layers [2]). Our system with VGG-16 takes in total
the N proposals are the top-N ranked ones based on
198ms for both proposal and detection. With the con-
the confidence generated by these methods. The plots
volutional features shared, the RPN alone only takes
show that the RPN method behaves gracefully when
10ms computing the additional layers. Our region-
the number of proposals drops from 2000 to 300. This
wise computation is also lower, thanks to fewer pro-
explains why the RPN has a good ultimate detection
posals (300 per image). Our system has a frame-rate
mAP when using as few as 300 proposals. As we
of 17 fps with the ZF net.
analyzed before, this property is mainly attributed to
Sensitivities to Hyper-parameters. In Table 8 we the cls term of the RPN. The recall of SS and EB drops
investigate the settings of anchors. By default we use more quickly than RPN when the proposals are fewer.
10
ZĞĐĂůů
^^ ^^ ^^
Ϭ͘ϰ Ϭ͘ϰ Ϭ͘ϰ
Ϭ͘Ϯ ZWE& Ϭ͘Ϯ ZWE& Ϭ͘Ϯ ZWE&
ZWEs'' ZWEs'' ZWEs''
Ϭ Ϭ Ϭ
Ϭ͘ϱ Ϭ͘ϲ Ϭ͘ϳ Ϭ͘ϴ Ϭ͘ϵ ϭ Ϭ͘ϱ Ϭ͘ϲ Ϭ͘ϳ Ϭ͘ϴ Ϭ͘ϵ ϭ Ϭ͘ϱ Ϭ͘ϲ Ϭ͘ϳ Ϭ͘ϴ Ϭ͘ϵ ϭ
/Žh /Žh /Žh
Figure 4: Recall vs. IoU overlap ratio on the PASCAL VOC 2007 test set.
Table 10: One-Stage Detection vs. Two-Stage Proposal + Detection. Detection results are on the PASCAL
VOC 2007 test set using the ZF model and Fast R-CNN. RPN uses unshared features.
proposals detector mAP (%)
Two-Stage RPN + ZF, unshared 300 Fast R-CNN + ZF, 1 scale 58.7
One-Stage dense, 3 scales, 3 aspect ratios 20000 Fast R-CNN + ZF, 1 scale 53.8
One-Stage dense, 3 scales, 3 aspect ratios 20000 Fast R-CNN + ZF, 5 scales 53.9
One-Stage Detection vs. Two-Stage Proposal + De- region proposals with sliding windows leads to ∼6%
tection. The OverFeat paper [9] proposes a detection degradation in both papers. We also note that the one-
method that uses regressors and classifiers on sliding stage system is slower as it has considerably more
windows over convolutional feature maps. OverFeat proposals to process.
is a one-stage, class-specific detection pipeline, and ours
is a two-stage cascade consisting of class-agnostic pro-
posals and class-specific detections. In OverFeat, the 4.2 Experiments on MS COCO
region-wise features come from a sliding window of We present more results on the Microsoft COCO
one aspect ratio over a scale pyramid. These features object detection dataset [12]. This dataset involves 80
are used to simultaneously determine the location and object categories. We experiment with the 80k images
category of objects. In RPN, the features are from on the training set, 40k images on the validation set,
square (3×3) sliding windows and predict proposals and 20k images on the test-dev set. We evaluate the
relative to anchors with different scales and aspect mAP averaged for IoU ∈ [0.5 : 0.05 : 0.95] (COCO’s
ratios. Though both methods use sliding windows, the standard metric, simply denoted as mAP@[.5, .95])
region proposal task is only the first stage of Faster R- and mAP@0.5 (PASCAL VOC’s metric).
CNN—the downstream Fast R-CNN detector attends There are a few minor changes of our system made
to the proposals to refine them. In the second stage of for this dataset. We train our models on an 8-GPU
our cascade, the region-wise features are adaptively implementation, and the effective mini-batch size be-
pooled [1], [2] from proposal boxes that more faith- comes 8 for RPN (1 per GPU) and 16 for Fast R-CNN
fully cover the features of the regions. We believe (2 per GPU). The RPN step and Fast R-CNN step are
these features lead to more accurate detections. both trained for 240k iterations with a learning rate
To compare the one-stage and two-stage systems, of 0.003 and then for 80k iterations with 0.0003. We
we emulate the OverFeat system (and thus also circum- modify the learning rates (starting with 0.003 instead
vent other differences of implementation details) by of 0.001) because the mini-batch size is changed. For
one-stage Fast R-CNN. In this system, the “proposals” the anchors, we use 3 aspect ratios and 4 scales
are dense sliding windows of 3 scales (128, 256, 512) (adding 642 ), mainly motivated by handling small
and 3 aspect ratios (1:1, 1:2, 2:1). Fast R-CNN is objects on this dataset. In addition, in our Fast R-CNN
trained to predict class-specific scores and regress box step, the negative samples are defined as those with
locations from these sliding windows. Because the a maximum IoU with ground truth in the interval of
OverFeat system adopts an image pyramid, we also [0, 0.5), instead of [0.1, 0.5) used in [1], [2]. We note
evaluate using convolutional features extracted from that in the SPPnet system [1], the negative samples
5 scales. We use those 5 scales as in [1], [2]. in [0.1, 0.5) are used for network fine-tuning, but the
Table 10 compares the two-stage system and two negative samples in [0, 0.5) are still visited in the SVM
variants of the one-stage system. Using the ZF model, step with hard-negative mining. But the Fast R-CNN
the one-stage system has an mAP of 53.9%. This is system [2] abandons the SVM step, so the negative
lower than the two-stage system (58.7%) by 4.8%. samples in [0, 0.1) are never visited. Including these
This experiment justifies the effectiveness of cascaded [0, 0.1) samples improves mAP@0.5 on the COCO
region proposals and object detection. Similar obser- dataset for both Fast R-CNN and Faster R-CNN sys-
vations are reported in [2], [39], where replacing SS tems (but the impact is negligible on PASCAL VOC).
11
Table 11: Object detection results (%) on the MS COCO dataset. The model is VGG-16.
COCO val COCO test-dev
method proposals training data mAP@.5 mAP@[.5, .95] mAP@.5 mAP@[.5, .95]
Fast R-CNN [2] SS, 2000 COCO train - - 35.9 19.7
Fast R-CNN [impl. in this paper] SS, 2000 COCO train 38.6 18.9 39.3 19.3
Faster R-CNN RPN, 300 COCO train 41.5 21.2 42.1 21.5
Faster R-CNN RPN, 300 COCO trainval - - 42.7 21.9
The rest of the implementation details are the same Table 12: Detection mAP (%) of Faster R-CNN on
as on PASCAL VOC. In particular, we keep using PASCAL VOC 2007 test set and 2012 test set us-
300 proposals and single-scale (s = 600) testing. The ing different training data. The model is VGG-16.
testing time is still about 200ms per image on the “COCO” denotes that the COCO trainval set is used
COCO dataset. for training. See also Table 6 and Table 7.
training data 2007 test 2012 test
In Table 11 we first report the results of the Fast VOC07 69.9 67.0
R-CNN system [2] using the implementation in this VOC07+12 73.2 -
paper. Our Fast R-CNN baseline has 39.3% mAP@0.5 VOC07++12 - 70.4
on the test-dev set, higher than that reported in [2]. COCO (no VOC) 76.1 73.0
We conjecture that the reason for this gap is mainly COCO+VOC07+12 78.8 -
due to the definition of the negative samples and also COCO+VOC07++12 - 75.9
the changes of the mini-batch sizes. We also note that
the mAP@[.5, .95] is just comparable.
4.3 From MS COCO to PASCAL VOC
Next we evaluate our Faster R-CNN system. Using
the COCO training set to train, Faster R-CNN has Large-scale data is of crucial importance for improv-
42.1% mAP@0.5 and 21.5% mAP@[.5, .95] on the ing deep neural networks. Next, we investigate how
COCO test-dev set. This is 2.8% higher for mAP@0.5 the MS COCO dataset can help with the detection
and 2.2% higher for mAP@[.5, .95] than the Fast R- performance on PASCAL VOC.
CNN counterpart under the same protocol (Table 11). As a simple baseline, we directly evaluate the
This indicates that RPN performs excellent for im- COCO detection model on the PASCAL VOC dataset,
proving the localization accuracy at higher IoU thresh- without fine-tuning on any PASCAL VOC data. This
olds. Using the COCO trainval set to train, Faster R- evaluation is possible because the categories on
CNN has 42.7% mAP@0.5 and 21.9% mAP@[.5, .95] on COCO are a superset of those on PASCAL VOC. The
the COCO test-dev set. Figure 6 shows some results categories that are exclusive on COCO are ignored in
on the MS COCO test-dev set. this experiment, and the softmax layer is performed
only on the 20 categories plus background. The mAP
Faster R-CNN in ILSVRC & COCO 2015 compe- under this setting is 76.1% on the PASCAL VOC 2007
titions We have demonstrated that Faster R-CNN test set (Table 12). This result is better than that trained
benefits more from better features, thanks to the fact on VOC07+12 (73.2%) by a good margin, even though
that the RPN completely learns to propose regions by the PASCAL VOC data are not exploited.
neural networks. This observation is still valid even Then we fine-tune the COCO detection model on
when one increases the depth substantially to over the VOC dataset. In this experiment, the COCO model
100 layers [18]. Only by replacing VGG-16 with a 101- is in place of the ImageNet-pre-trained model (that
layer residual net (ResNet-101) [18], the Faster R-CNN is used to initialize the network weights), and the
system increases the mAP from 41.5%/21.2% (VGG- Faster R-CNN system is fine-tuned as described in
16) to 48.4%/27.2% (ResNet-101) on the COCO val Section 3.2. Doing so leads to 78.8% mAP on the
set. With other improvements orthogonal to Faster R- PASCAL VOC 2007 test set. The extra data from
CNN, He et al. [18] obtained a single-model result of the COCO set increases the mAP by 5.6%. Table 6
55.7%/34.9% and an ensemble result of 59.0%/37.4% shows that the model trained on COCO+VOC has
on the COCO test-dev set, which won the 1st place the best AP for every individual category on PASCAL
in the COCO 2015 object detection competition. The VOC 2007. Similar improvements are observed on the
same system [18] also won the 1st place in the ILSVRC PASCAL VOC 2012 test set (Table 12 and Table 7). We
2015 object detection competition, surpassing the sec- note that the test-time speed of obtaining these strong
ond place by absolute 8.5%. RPN is also a building results is still about 200ms per image.
block of the 1st-place winning entries in ILSVRC 2015
localization and COCO 2015 segmentation competi- 5 C ONCLUSION
tions, for which the details are available in [18] and We have presented RPNs for efficient and accurate
[15] respectively. region proposal generation. By sharing convolutional
12
person : 0.988
person : 0.992
car : 0.745
.745 person : 0.797 bird : 0.978
car : 0.955
55 horse : 0.991
bird : 0.941
bottle : 0.726
dog : 0.981
dog : 0.697
cat : 0.998
person : 0.917
boat : 0.671
car : 1.000 boat : 0.895 boat : 0.749
boat : 0.877
person : 0.988
person : 0.995
bottle : 0.851
bottle : 0
0.962
962
pottedplant : 0.728
car : 1.000 car : 0.880
car : 0.981
car : 0.982 chair : 0.630
boat : 0.995
boat : 0.948
diningtable : 0.862
bottle : 0.826
boat : 0.692
boat : 0
0.808
808
person : 0.975
bird : 0.980
horse : 0.984
aeroplane : 0.998
pottedplant : 0.820
chair : 0.984
984
diningtable : 0.997
pottedplant : 0.993 chair : 0.978
chair : 0.962
chair : 0.976
pottedplant : 0.715 car : 0.907
907
person : 0.993 person : 0.987
pottedplant : 0.940
pottedplant : 0.869
tvmonitor : 0.945
person : 0.983
chair : 0.723
person : 0.968 chair : 0.982 tvmonitor : 0.993 person : 0.959
bottle
e : 0.789
person : 0.988
diningtable : 0.903 bottle : 0
bot 0.858
chair : 0.852 bottle : 0.616 b
bottle :person
0
0.903
903 : 0.897
person : 0.870
bottle : 0.884
bird : 0.727
Figure 5: Selected examples of object detection results on the PASCAL VOC 2007 test set using the Faster
R-CNN system. The model is VGG-16 and the training data is 07+12 trainval (73.2% mAP on the 2007 test
set). Our method detects objects of a wide range of scales and aspect ratios. Each output box is associated
with a category label and a softmax score in [0, 1]. A score threshold of 0.6 is used to display these images.
The running time for obtaining these results is 198ms per image, including all steps.
person
person
son
on : 0
0.975
975 traffic light : 0.802
person : 0.941
.941
4
person : 0.673 person : 0.928 person nperson
: 0.
0.958
958: 0.823
airplane : 0.997
person : 0.759
p
person : 0.766 backpack : 0.756 person : 0.772
0person
976 : 0.939
person : 0.976 0 939 person : 0.842
0.84 person : 0.841
person : 0.867 umbrella : 0.824 person : 0.897 car : 0.957
person
on : 0
0.950
50
handbag : 0.848 person : 0.805 clock : 0.986
person : 0.950
p person : 0.931
person : 0.970 clock : 0.981
person : 0.916
motorcycle : 0.713
dog : 0.996
bicycle : 0.891
dog : 0.691 bicycle : 0.639
person : 0.996
person : 0.800
motorcycle : 0.827
person : 0.808
pizza : 0.985
person : 0.998
dining table : 0.956
pizza : 0.938
bed : 0.999
pizza : 0.995
pizza : 0.982
clock : 0.982
bowl : 0.759
broccoli : 0.953
boat : 0.992
person : 0.934
bus : 0.999
person
person
erson : 0.869
: 0.970
book : 0.611
tv : 0.964
cup : 0.720
frisbee : 0.998
person : 0.723
cup : 0.931
dining table : 0.941 cup : 0.986
bird : 0.968
dog : 0.966
bowl : 0.958
zebra : 0.996
zebra : 0.970
970
zebra : 0.848
zebra : 0.993 sandwich : 0.629
bird : 0.987
bird : 0.894
person :tv
0 : 0.711
0.792
792 person : 0.917
refrigerator : 0.699
person : 0.993
bottle : 0.982
laptop : 0.973
tennis :racket
person
perso 0.999 : 0.960 horse : 0.990
bird : 0.746
oven : 0.655 bird : 0.956
keyboard : 0.638
bird : 0.906
keyboard : 0.615
mouse : 0.981
clock : 0.988
bowl : 0.744
bowl : 0.816
bowl : 0.710 person : 0.998
bowl : 0.847
cup : 0.807
pizza : 0.965
chair : 0.772
oven : 0.969
dining table : 0.618
Figure 6: Selected examples of object detection results on the MS COCO test-dev set using the Faster R-CNN
system. The model is VGG-16 and the training data is COCO trainval (42.7% mAP@0.5 on the test-dev set).
Each output box is associated with a category label and a softmax score in [0, 1]. A score threshold of 0.6 is
used to display these images. For each image, one color represents one object category in that image.
networks for large-scale image recognition,” in International [7] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional
Conference on Learning Representations (ICLR), 2015. networks for semantic segmentation,” in IEEE Conference on
[4] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeul- Computer Vision and Pattern Recognition (CVPR), 2015.
ders, “Selective search for object recognition,” International [8] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra-
Journal of Computer Vision (IJCV), 2013. manan, “Object detection with discriminatively trained part-
[5] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature based models,” IEEE Transactions on Pattern Analysis and Ma-
hierarchies for accurate object detection and semantic seg- chine Intelligence (TPAMI), 2010.
mentation,” in IEEE Conference on Computer Vision and Pattern [9] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,
Recognition (CVPR), 2014. and Y. LeCun, “Overfeat: Integrated recognition, localization
[6] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object and detection using convolutional networks,” in International
proposals from edges,” in European Conference on Computer Conference on Learning Representations (ICLR), 2014.
Vision (ECCV), 2014. [10] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
14
real-time object detection with region proposal networks,” in [36] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Neural Information Processing Systems (NIPS), 2015. Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg,
[11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and and L. Fei-Fei, “ImageNet Large Scale Visual Recognition
A. Zisserman, “The PASCAL Visual Object Classes Challenge Challenge,” in International Journal of Computer Vision (IJCV),
2007 (VOC2007) Results,” 2007. 2015.
[12] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- [37] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classi-
manan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Com- fication with deep convolutional neural networks,” in Neural
mon Objects in Context,” in European Conference on Computer Information Processing Systems (NIPS), 2012.
Vision (ECCV), 2014. [38] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-
[13] S. Song and J. Xiao, “Deep sliding shapes for amodal 3d object shick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional
detection in rgb-d images,” arXiv:1511.02300, 2015. architecture for fast feature embedding,” arXiv:1408.5093, 2014.
[14] J. Zhu, X. Chen, and A. L. Yuille, “DeePM: A deep part-based [39] K. Lenc and A. Vedaldi, “R-CNN minus R,” in British Machine
model for object detection and semantic part localization,” Vision Conference (BMVC), 2015.
arXiv:1511.07131, 2015.
[15] J. Dai, K. He, and J. Sun, “Instance-aware semantic segmenta-
tion via multi-task network cascades,” arXiv:1512.04412, 2015.
[16] J. Johnson, A. Karpathy, and L. Fei-Fei, “Densecap: Fully
convolutional localization networks for dense captioning,”
arXiv:1511.07571, 2015.
[17] D. Kislyuk, Y. Liu, D. Liu, E. Tzeng, and Y. Jing, “Human cu-
ration and convnets: Powering item-to-item recommendations
on pinterest,” arXiv:1511.04003, 2015.
[18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
for image recognition,” arXiv:1512.03385, 2015.
[19] J. Hosang, R. Benenson, and B. Schiele, “How good are de-
tection proposals, really?” in British Machine Vision Conference
(BMVC), 2014.
[20] J. Hosang, R. Benenson, P. Dollár, and B. Schiele, “What makes
for effective detection proposals?” IEEE Transactions on Pattern
Analysis and Machine Intelligence (TPAMI), 2015.
[21] N. Chavali, H. Agrawal, A. Mahendru, and D. Batra,
“Object-Proposal Evaluation Protocol is ’Gameable’,” arXiv:
1505.05836, 2015.
[22] J. Carreira and C. Sminchisescu, “CPMC: Automatic ob-
ject segmentation using constrained parametric min-cuts,”
IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI), 2012.
[23] P. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik,
“Multiscale combinatorial grouping,” in IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2014.
[24] B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the object-
ness of image windows,” IEEE Transactions on Pattern Analysis
and Machine Intelligence (TPAMI), 2012.
[25] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks
for object detection,” in Neural Information Processing Systems
(NIPS), 2013.
[26] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable
object detection using deep neural networks,” in IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR), 2014.
[27] C. Szegedy, S. Reed, D. Erhan, and D. Anguelov, “Scalable,
high-quality object detection,” arXiv:1412.1441 (v1), 2015.
[28] P. O. Pinheiro, R. Collobert, and P. Dollar, “Learning to
segment object candidates,” in Neural Information Processing
Systems (NIPS), 2015.
[29] J. Dai, K. He, and J. Sun, “Convolutional feature masking
for joint object and stuff segmentation,” in IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2015.
[30] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun, “Ob-
ject detection networks on convolutional feature maps,”
arXiv:1504.06066, 2015.
[31] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and
Y. Bengio, “Attention-based models for speech recognition,”
in Neural Information Processing Systems (NIPS), 2015.
[32] M. D. Zeiler and R. Fergus, “Visualizing and understanding
convolutional neural networks,” in European Conference on
Computer Vision (ECCV), 2014.
[33] V. Nair and G. E. Hinton, “Rectified linear units improve
restricted boltzmann machines,” in International Conference on
Machine Learning (ICML), 2010.
[34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
D. Erhan, and A. Rabinovich, “Going deeper with convo-
lutions,” in IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2015.
[35] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
W. Hubbard, and L. D. Jackel, “Backpropagation applied to
handwritten zip code recognition,” Neural computation, 1989.
Under review as a conference paper at ICLR 2016
Soumith Chintala
arXiv:1511.06434v2 [cs.LG] 7 Jan 2016
Facebook AI Research
New York, NY
soumith@fb.com
A BSTRACT
In recent years, supervised learning with convolutional networks (CNNs) has
seen huge adoption in computer vision applications. Comparatively, unsupervised
learning with CNNs has received less attention. In this work we hope to help
bridge the gap between the success of CNNs for supervised learning and unsuper-
vised learning. We introduce a class of CNNs called deep convolutional generative
adversarial networks (DCGANs), that have certain architectural constraints, and
demonstrate that they are a strong candidate for unsupervised learning. Training
on various image datasets, we show convincing evidence that our deep convolu-
tional adversarial pair learns a hierarchy of representations from object parts to
scenes in both the generator and discriminator. Additionally, we use the learned
features for novel tasks - demonstrating their applicability as general image repre-
sentations.
1 I NTRODUCTION
Learning reusable feature representations from large unlabeled datasets has been an area of active
research. In the context of computer vision, one can leverage the practically unlimited amount of
unlabeled images and videos to learn good intermediate representations, which can then be used on
a variety of supervised learning tasks such as image classification. We propose that one way to build
good image representations is by training Generative Adversarial Networks (GANs) (Goodfellow
et al., 2014), and later reusing parts of the generator and discriminator networks as feature extractors
for supervised tasks. GANs provide an attractive alternative to maximum likelihood techniques.
One can additionally argue that their learning process and the lack of a heuristic cost function (such
as pixel-wise independent mean-square error) are attractive to representation learning. GANs have
been known to be unstable to train, often resulting in generators that produce nonsensical outputs.
There has been very limited published research in trying to understand and visualize what GANs
learn, and the intermediate representations of multi-layer GANs.
In this paper, we make the following contributions
1
Under review as a conference paper at ICLR 2016
• We show that the generators have interesting vector arithmetic properties allowing for easy
manipulation of many semantic qualities of generated samples.
2 R ELATED W ORK
Unsupervised representation learning is a fairly well studied problem in general computer vision
research, as well as in the context of images. A classic approach to unsupervised representation
learning is to do clustering on the data (for example using K-means), and leverage the clusters for
improved classification scores. In the context of images, one can do hierarchical clustering of image
patches (Coates & Ng, 2012) to learn powerful image representations. Another popular method
is to train auto-encoders (convolutionally, stacked (Vincent et al., 2010), separating the what and
where components of the code (Zhao et al., 2015), ladder structures (Rasmus et al., 2015)) that
encode an image into a compact code, and decode the code to reconstruct the image as accurately
as possible. These methods have also been shown to learn good feature representations from image
pixels. Deep belief networks (Lee et al., 2009) have also been shown to work well in learning
hierarchical representations.
Generative image models are well studied and fall into two categories: parametric and non-
parametric.
The non-parametric models often do matching from a database of existing images, often matching
patches of images, and have been used in texture synthesis (Efros et al., 1999), super-resolution
(Freeman et al., 2002) and in-painting (Hays & Efros, 2007).
Parametric models for generating images has been explored extensively (for example on MNIST
digits or for texture synthesis (Portilla & Simoncelli, 2000)). However, generating natural images
of the real world have had not much success until recently. A variational sampling approach to
generating images (Kingma & Welling, 2013) has had some success, but the samples often suffer
from being blurry. Another approach generates images using an iterative forward diffusion process
(Sohl-Dickstein et al., 2015). Generative Adversarial Networks (Goodfellow et al., 2014) generated
images suffering from being noisy and incomprehensible. A laplacian pyramid extension to this
approach (Denton et al., 2015) showed higher quality images, but they still suffered from the objects
looking wobbly because of noise introduced in chaining multiple models. A recurrent network
approach (Gregor et al., 2015) and a deconvolution network approach (Dosovitskiy et al., 2014) have
also recently had some success with generating natural images. However, they have not leveraged
the generators for supervised tasks.
One constant criticism of using neural networks has been that they are black-box methods, with little
understanding of what the networks do in the form of a simple human-consumable algorithm. In the
context of CNNs, Zeiler et. al. (Zeiler & Fergus, 2014) showed that by using deconvolutions and
filtering the maximal activations, one can find the approximate purpose of each convolution filter in
the network. Similarly, using a gradient descent on the inputs lets us inspect the ideal image that
activates certain subsets of filters (Mordvintsev et al.).
Historical attempts to scale up GANs using CNNs to model images have been unsuccessful. This
motivated the authors of LAPGAN (Denton et al., 2015) to develop an alternative approach to it-
eratively upscale low resolution generated images which can be modeled more reliably. We also
encountered difficulties attempting to scale GANs using CNN architectures commonly used in the
supervised literature. However, after extensive model exploration we identified a family of archi-
2
Under review as a conference paper at ICLR 2016
tectures that resulted in stable training across a range of datasets and allowed for training higher
resolution and deeper generative models.
Core to our approach is adopting and modifying three recently demonstrated changes to CNN archi-
tectures.
The first is the all convolutional net (Springenberg et al., 2014) which replaces deterministic spatial
pooling functions (such as maxpooling) with strided convolutions, allowing the network to learn
its own spatial downsampling. We use this approach in our generator, allowing it to learn its own
spatial upsampling, and discriminator.
Second is the trend towards eliminating fully connected layers on top of convolutional features.
The strongest example of this is global average pooling which has been utilized in state of the
art image classification models (Mordvintsev et al.). We found global average pooling increased
model stability but hurt convergence speed. A middle ground of directly connecting the highest
convolutional features to the input and output respectively of the generator and discriminator worked
well. The first layer of the GAN, which takes a uniform noise distribution Z as input, could be called
fully connected as it is just a matrix multiplication, but the result is reshaped into a 4-dimensional
tensor and used as the start of the convolution stack. For the discriminator, the last convolution layer
is flattened and then fed into a single sigmoid output. See Fig. 1 for a visualization of an example
model architecture.
Third is Batch Normalization (Ioffe & Szegedy, 2015) which stabilizes learning by normalizing the
input to each unit to have zero mean and unit variance. This helps deal with training problems that
arise due to poor initialization and helps gradient flow in deeper models. This proved critical to get
deep generators to begin learning, preventing the generator from collapsing all samples to a single
point which is a common failure mode observed in GANs. Directly applying batchnorm to all layers
however, resulted in sample oscillation and model instability. This was avoided by not applying
batchnorm to the generator output layer and the discriminator input layer.
The ReLU activation (Nair & Hinton, 2010) is used in the generator with the exception of the output
layer which uses the Tanh function. We observed that using a bounded activation allowed the model
to learn more quickly to saturate and cover the color space of the training distribution. Within the
discriminator we found the leaky rectified activation (Maas et al., 2013) (Xu et al., 2015) to work
well, especially for higher resolution modeling. This is in contrast to the original GAN paper, which
used the maxout activation (Goodfellow et al., 2013).
Architecture guidelines for stable Deep Convolutional GANs
• Replace any pooling layers with strided convolutions (discriminator) and fractional-strided
convolutions (generator).
• Use batchnorm in both the generator and the discriminator.
• Remove fully connected hidden layers for deeper architectures.
• Use ReLU activation in generator for all layers except for the output, which uses Tanh.
• Use LeakyReLU activation in the discriminator for all layers.
We trained DCGANs on three datasets, Large-scale Scene Understanding (LSUN) (Yu et al., 2015),
Imagenet-1k and a newly assembled Faces dataset. Details on the usage of each of these datasets
are given below.
No pre-processing was applied to training images besides scaling to the range of the tanh activation
function [-1, 1]. All models were trained with mini-batch stochastic gradient descent (SGD) with
a mini-batch size of 128. All weights were initialized from a zero-centered Normal distribution
with standard deviation 0.02. In the LeakyReLU, the slope of the leak was set to 0.2 in all models.
While previous GAN work has used momentum to accelerate training, we used the Adam optimizer
(Kingma & Ba, 2014) with tuned hyperparameters. We found the suggested learning rate of 0.001,
to be too high, using 0.0002 instead. Additionally, we found leaving the momentum term β1 at the
3
Under review as a conference paper at ICLR 2016
Figure 1: DCGAN generator used for LSUN scene modeling. A 100 dimensional uniform distribu-
tion Z is projected to a small spatial extent convolutional representation with many feature maps.
A series of four fractionally-strided convolutions (in some recent papers, these are wrongly called
deconvolutions) then convert this high level representation into a 64 × 64 pixel image. Notably, no
fully connected or pooling layers are used.
suggested value of 0.9 resulted in training oscillation and instability while reducing it to 0.5 helped
stabilize training.
4.1 LSUN
As visual quality of samples from generative image models has improved, concerns of over-fitting
and memorization of training samples have risen. To demonstrate how our model scales with more
data and higher resolution generation, we train a model on the LSUN bedrooms dataset containing
a little over 3 million training examples. Recent analysis has shown that there is a direct link be-
tween how fast models learn and their generalization performance (Hardt et al., 2015). We show
samples from one epoch of training (Fig.2), mimicking online learning, in addition to samples after
convergence (Fig.3), as an opportunity to demonstrate that our model is not producing high quality
samples via simply overfitting/memorizing training examples. No data augmentation was applied to
the images.
4.1.1 D EDUPLICATION
To further decrease the likelihood of the generator memorizing input examples (Fig.2) we perform a
simple image de-duplication process. We fit a 3072-128-3072 de-noising dropout regularized RELU
autoencoder on 32x32 downsampled center-crops of training examples. The resulting code layer
activations are then binarized via thresholding the ReLU activation which has been shown to be an
effective information preserving technique (Srivastava et al., 2014) and provides a convenient form
of semantic-hashing, allowing for linear time de-duplication . Visual inspection of hash collisions
showed high precision with an estimated false positive rate of less than 1 in 100. Additionally, the
technique detected and removed approximately 275,000 near duplicates, suggesting a high recall.
4.2 FACES
We scraped images containing human faces from random web image queries of peoples names. The
people names were acquired from dbpedia, with a criterion that they were born in the modern era.
This dataset has 3M images from 10K people. We run an OpenCV face detector on these images,
keeping the detections that are sufficiently high resolution, which gives us approximately 350,000
face boxes. We use these face boxes for training. No data augmentation was applied to the images.
4
Under review as a conference paper at ICLR 2016
Figure 2: Generated bedrooms after one training pass through the dataset. Theoretically, the model
could learn to memorize training examples, but this is experimentally unlikely as we train with a
small learning rate and minibatch SGD. We are aware of no prior empirical evidence demonstrating
memorization with SGD and a small learning rate.
Figure 3: Generated bedrooms after five epochs of training. There appears to be evidence of visual
under-fitting via repeated noise textures across multiple samples such as the base boards of some of
the beds.
4.3 I MAGENET-1 K
We use Imagenet-1k (Deng et al., 2009) as a source of natural images for unsupervised training. We
train on 32 × 32 min-resized center crops. No data augmentation was applied to the images.
5
Under review as a conference paper at ICLR 2016
One common technique for evaluating the quality of unsupervised representation learning algo-
rithms is to apply them as a feature extractor on supervised datasets and evaluate the performance
of linear models fitted on top of these features.
On the CIFAR-10 dataset, a very strong baseline performance has been demonstrated from a well
tuned single layer feature extraction pipeline utilizing K-means as a feature learning algorithm.
When using a very large amount of feature maps (4800) this technique achieves 80.6% accuracy.
An unsupervised multi-layered extension of the base algorithm reaches 82.0% accuracy (Coates &
Ng, 2011). To evaluate the quality of the representations learned by DCGANs for supervised tasks,
we train on Imagenet-1k and then use the discriminator’s convolutional features from all layers,
maxpooling each layers representation to produce a 4 × 4 spatial grid. These features are then
flattened and concatenated to form a 28672 dimensional vector and a regularized linear L2-SVM
classifier is trained on top of them. This achieves 82.8% accuracy, out performing all K-means
based approaches. Notably, the discriminator has many less feature maps (512 in the highest layer)
compared to K-means based techniques, but does result in a larger total feature vector size due to
the many layers of 4 × 4 spatial locations. The performance of DCGANs is still less than that of
Exemplar CNNs (Dosovitskiy et al., 2015), a technique which trains normal discriminative CNNs
in an unsupervised fashion to differentiate between specifically chosen, aggressively augmented,
exemplar samples from the source dataset. Further improvements could be made by finetuning the
discriminator’s representations, but we leave this for future work. Additionally, since our DCGAN
was never trained on CIFAR-10 this experiment also demonstrates the domain robustness of the
learned features.
Table 1: CIFAR-10 classification results using our pre-trained model. Our DCGAN is not pre-
trained on CIFAR-10, but on Imagenet-1k, and the features are used to classify CIFAR-10 images.
Model Accuracy Accuracy (400 per class) max # of features units
1 Layer K-means 80.6% 63.7% (±0.7%) 4800
3 Layer K-means Learned RF 82.0% 70.7% (±0.7%) 3200
View Invariant K-means 81.9% 72.6% (±0.7%) 6400
Exemplar CNN 84.3% 77.4% (±0.2%) 1024
DCGAN (ours) + L2-SVM 82.8% 73.8% (±0.4%) 512
On the StreetView House Numbers dataset (SVHN)(Netzer et al., 2011), we use the features of
the discriminator of a DCGAN for supervised purposes when labeled data is scarce. Following
similar dataset preparation rules as in the CIFAR-10 experiments, we split off a validation set of
10,000 examples from the non-extra set and use it for all hyperparameter and model selection. 1000
uniformly class distributed training examples are randomly selected and used to train a regularized
linear L2-SVM classifier on top of the same feature extraction pipeline used for CIFAR-10. This
achieves state of the art (for classification using 1000 labels) at 22.48% test error, improving upon
another modifcation of CNNs designed to leverage unlabled data (Zhao et al., 2015). Additionally,
we validate that the CNN architecture used in DCGAN is not the key contributing factor of the
model’s performance by training a purely supervised CNN with the same architecture on the same
data and optimizing this model via random search over 64 hyperparameter trials (Bergstra & Bengio,
2012). It achieves a signficantly higher 28.87% validation error.
We investigate the trained generators and discriminators in a variety of ways. We do not do any
kind of nearest neighbor search on the training set. Nearest neighbors in pixel or feature space are
6
Under review as a conference paper at ICLR 2016
trivially fooled (Theis et al., 2015) by small image transforms. We also do not use log-likelihood
metrics to quantitatively assess the model, as it is a poor (Theis et al., 2015) metric.
The first experiment we did was to understand the landscape of the latent space. Walking on the
manifold that is learnt can usually tell us about signs of memorization (if there are sharp transitions)
and about the way in which the space is hierarchically collapsed. If walking in this latent space
results in semantic changes to the image generations (such as objects being added and removed), we
can reason that the model has learned relevant and interesting representations. The results are shown
in Fig.4.
Previous work has demonstrated that supervised training of CNNs on large image datasets results in
very powerful learned features (Zeiler & Fergus, 2014). Additionally, supervised CNNs trained on
scene classification learn object detectors (Oquab et al., 2014). We demonstrate that an unsupervised
DCGAN trained on a large image dataset can also learn a hierarchy of features that are interesting.
Using guided backpropagation as proposed by (Springenberg et al., 2014), we show in Fig.5 that the
features learnt by the discriminator activate on typical parts of a bedroom, like beds and windows.
For comparison, in the same figure, we give a baseline for randomly initialized features that are not
activated on anything that is semantically relevant or interesting.
In addition to the representations learnt by a discriminator, there is the question of what representa-
tions the generator learns. The quality of samples suggest that the generator learns specific object
representations for major scene components such as beds, windows, lamps, doors, and miscellaneous
furniture. In order to explore the form that these representations take, we conducted an experiment
to attempt to remove windows from the generator completely.
On 150 samples, 52 window bounding boxes were drawn manually. On the second highest con-
volution layer features, logistic regression was fit to predict whether a feature activation was on a
window (or not), by using the criterion that activations inside the drawn bounding boxes are posi-
tives and random samples from the same images are negatives. Using this simple model, all feature
maps with weights greater than zero ( 200 in total) were dropped from all spatial locations. Then,
random new samples were generated with and without the feature map removal.
The generated images with and without the window dropout are shown in Fig.6, and interestingly,
the network mostly forgets to draw windows in the bedrooms, replacing them with other objects.
7
Under review as a conference paper at ICLR 2016
Figure 4: Top rows: Interpolation between a series of 9 random points in Z show that the space
learned has smooth transitions, with every image in the space plausibly looking like a bedroom. In
the 6th row, you see a room without a window slowly transforming into a room with a giant window.
In the 10th row, you see what appears to be a TV slowly being transformed into a window.
In the context of evaluating learned representations of words (Mikolov et al., 2013) demonstrated
that simple arithmetic operations revealed rich linear structure in representation space. One canoni-
cal example demonstrated that the vector(”King”) - vector(”Man”) + vector(”Woman”) resulted in a
vector whose nearest neighbor was the vector for Queen. We investigated whether similar structure
emerges in the Z representation of our generators. We performed similar arithmetic on the Z vectors
of sets of exemplar samples for visual concepts. Experiments working on only single samples per
concept were unstable, but averaging the Z vector for three examplars showed consistent and stable
generations that semantically obeyed the arithmetic. In addition to the object manipulation shown
in (Fig. 7), we demonstrate that face pose is also modeled linearly in Z space (Fig. 8).
These demonstrations suggest interesting applications can be developed using Z representations
learned by our models. It has been previously demonstrated that conditional generative models can
learn to convincingly model object attributes like scale, rotation, and position (Dosovitskiy et al.,
2014). This is to our knowledge the first demonstration of this occurring in purely unsupervised
8
Under review as a conference paper at ICLR 2016
Figure 6: Top row: un-modified samples from model. Bottom row: the same samples generated
with dropping out ”window” filters. Some windows are removed, others are transformed into objects
with similar visual appearance such as doors and mirrors. Although visual quality decreased, overall
scene composition stayed similar, suggesting the generator has done a good job disentangling scene
representation from object representation. Extended experiments could be done to remove other
objects from the image and modify the objects the generator draws.
models. Further exploring and developing the above mentioned vector arithmetic could dramat-
ically reduce the amount of data needed for conditional generative modeling of complex image
distributions.
We propose a more stable set of architectures for training generative adversarial networks and we
give evidence that adversarial networks learn good representations of images for supervised learning
and generative modeling. There are still some forms of model instability remaining - we noticed as
models are trained longer they sometimes collapse a subset of filters to a single oscillating mode.
9
Under review as a conference paper at ICLR 2016
Figure 7: Vector arithmetic for visual concepts. For each column, the Z vectors of samples are
averaged. Arithmetic was then performed on the mean vectors creating a new vector Y . The center
sample on the right hand side is produce by feeding Y as input to the generator. To demonstrate
the interpolation capabilities of the generator, uniform noise sampled with scale +-0.25 was added
to Y to produce the 8 other samples. Applying arithmetic in the input space (bottom two examples)
results in noisy overlap due to misalignment.
Further work is needed to tackle this from of instability. We think that extending this framework
10
Under review as a conference paper at ICLR 2016
Figure 8: A ”turn” vector was created from four averaged samples of faces looking left vs looking
right. By adding interpolations along this axis to random samples we were able to reliably transform
their pose.
to other domains such as video (for frame prediction) and audio (pre-trained features for speech
synthesis) should be very interesting. Further investigations into the properties of the learnt latent
space would be interesting as well.
ACKNOWLEDGMENTS
We are fortunate and thankful for all the advice and guidance we have received during this work,
especially that of Ian Goodfellow, Tobias Springenberg, Arthur Szlam and Durk Kingma. Addition-
ally we’d like to thank all of the folks at indico for providing support, resources, and conversations,
especially the two other members of the indico research team, Dan Kuster and Nathan Lintz. Finally,
we’d like to thank Nvidia for donating a Titan-X GPU used in this work.
R EFERENCES
Bergstra, James and Bengio, Yoshua. Random search for hyper-parameter optimization. JMLR,
2012.
Coates, Adam and Ng, Andrew. Selecting receptive fields in deep networks. NIPS, 2011.
Coates, Adam and Ng, Andrew Y. Learning feature representations with k-means. In Neural Net-
works: Tricks of the Trade, pp. 561–580. Springer, 2012.
Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, and Fei-Fei, Li. Imagenet: A large-scale
hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.
IEEE Conference on, pp. 248–255. IEEE, 2009.
Denton, Emily, Chintala, Soumith, Szlam, Arthur, and Fergus, Rob. Deep generative image models
using a laplacian pyramid of adversarial networks. arXiv preprint arXiv:1506.05751, 2015.
Dosovitskiy, Alexey, Springenberg, Jost Tobias, and Brox, Thomas. Learning to generate chairs
with convolutional neural networks. arXiv preprint arXiv:1411.5928, 2014.
11
Under review as a conference paper at ICLR 2016
Dosovitskiy, Alexey, Fischer, Philipp, Springenberg, Jost Tobias, Riedmiller, Martin, and Brox,
Thomas. Discriminative unsupervised feature learning with exemplar convolutional neural net-
works. In Pattern Analysis and Machine Intelligence, IEEE Transactions on, volume 99. IEEE,
2015.
Efros, Alexei, Leung, Thomas K, et al. Texture synthesis by non-parametric sampling. In Computer
Vision, 1999. The Proceedings of the Seventh IEEE International Conference on, volume 2, pp.
1033–1038. IEEE, 1999.
Freeman, William T, Jones, Thouis R, and Pasztor, Egon C. Example-based super-resolution. Com-
puter Graphics and Applications, IEEE, 22(2):56–65, 2002.
Goodfellow, Ian J, Warde-Farley, David, Mirza, Mehdi, Courville, Aaron, and Bengio, Yoshua.
Maxout networks. arXiv preprint arXiv:1302.4389, 2013.
Goodfellow, Ian J., Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair,
Sherjil, Courville, Aaron C., and Bengio, Yoshua. Generative adversarial nets. NIPS, 2014.
Gregor, Karol, Danihelka, Ivo, Graves, Alex, and Wierstra, Daan. Draw: A recurrent neural network
for image generation. arXiv preprint arXiv:1502.04623, 2015.
Hardt, Moritz, Recht, Benjamin, and Singer, Yoram. Train faster, generalize better: Stability of
stochastic gradient descent. arXiv preprint arXiv:1509.01240, 2015.
Hauberg, Sren, Freifeld, Oren, Larsen, Anders Boesen Lindbo, Fisher III, John W., and Hansen,
Lars Kair. Dreaming more data: Class-dependent distributions over diffeomorphisms for learned
data augmentation. arXiv preprint arXiv:1510.02795, 2015.
Hays, James and Efros, Alexei A. Scene completion using millions of photographs. ACM Transac-
tions on Graphics (TOG), 26(3):4, 2007.
Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by
reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
Kingma, Diederik P and Ba, Jimmy Lei. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.
Kingma, Diederik P and Welling, Max. Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114, 2013.
Lee, Honglak, Grosse, Roger, Ranganath, Rajesh, and Ng, Andrew Y. Convolutional deep belief
networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the
26th Annual International Conference on Machine Learning, pp. 609–616. ACM, 2009.
Loosli, Gaëlle, Canu, Stéphane, and Bottou, Léon. Training invariant support vector machines using
selective sampling. In Bottou, Léon, Chapelle, Olivier, DeCoste, Dennis, and Weston, Jason
(eds.), Large Scale Kernel Machines, pp. 301–320. MIT Press, Cambridge, MA., 2007. URL
http://leon.bottou.org/papers/loosli-canu-bottou-2006.
Maas, Andrew L, Hannun, Awni Y, and Ng, Andrew Y. Rectifier nonlinearities improve neural
network acoustic models. In Proc. ICML, volume 30, 2013.
Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg S, and Dean, Jeff. Distributed repre-
sentations of words and phrases and their compositionality. In Advances in neural information
processing systems, pp. 3111–3119, 2013.
Mordvintsev, Alexander, Olah, Christopher, and Tyka, Mike. Inceptionism : Going
deeper into neural networks. http://googleresearch.blogspot.com/2015/06/
inceptionism-going-deeper-into-neural.html. Accessed: 2015-06-17.
Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted boltzmann machines.
In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–
814, 2010.
12
Under review as a conference paper at ICLR 2016
Netzer, Yuval, Wang, Tao, Coates, Adam, Bissacco, Alessandro, Wu, Bo, and Ng, Andrew Y. Read-
ing digits in natural images with unsupervised feature learning. In NIPS workshop on deep learn-
ing and unsupervised feature learning, volume 2011, pp. 5. Granada, Spain, 2011.
Oquab, M., Bottou, L., Laptev, I., and Sivic, J. Learning and transferring mid-level image represen-
tations using convolutional neural networks. In CVPR, 2014.
Portilla, Javier and Simoncelli, Eero P. A parametric texture model based on joint statistics of
complex wavelet coefficients. International Journal of Computer Vision, 40(1):49–70, 2000.
Rasmus, Antti, Valpola, Harri, Honkala, Mikko, Berglund, Mathias, and Raiko, Tapani. Semi-
supervised learning with ladder network. arXiv preprint arXiv:1507.02672, 2015.
Sohl-Dickstein, Jascha, Weiss, Eric A, Maheswaranathan, Niru, and Ganguli, Surya. Deep unsuper-
vised learning using nonequilibrium thermodynamics. arXiv preprint arXiv:1503.03585, 2015.
Springenberg, Jost Tobias, Dosovitskiy, Alexey, Brox, Thomas, and Riedmiller, Martin. Striving for
simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
Srivastava, Rupesh Kumar, Masci, Jonathan, Gomez, Faustino, and Schmidhuber, Jürgen. Under-
standing locally competitive networks. arXiv preprint arXiv:1410.1165, 2014.
Theis, L., van den Oord, A., and Bethge, M. A note on the evaluation of generative models.
arXiv:1511.01844, Nov 2015. URL http://arxiv.org/abs/1511.01844.
Vincent, Pascal, Larochelle, Hugo, Lajoie, Isabelle, Bengio, Yoshua, and Manzagol, Pierre-Antoine.
Stacked denoising autoencoders: Learning useful representations in a deep network with a local
denoising criterion. The Journal of Machine Learning Research, 11:3371–3408, 2010.
Xu, Bing, Wang, Naiyan, Chen, Tianqi, and Li, Mu. Empirical evaluation of rectified activations in
convolutional network. arXiv preprint arXiv:1505.00853, 2015.
Yu, Fisher, Zhang, Yinda, Song, Shuran, Seff, Ari, and Xiao, Jianxiong. Construction of a large-scale
image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365,
2015.
Zeiler, Matthew D and Fergus, Rob. Visualizing and understanding convolutional networks. In
Computer Vision–ECCV 2014, pp. 818–833. Springer, 2014.
Zhao, Junbo, Mathieu, Michael, Goroshin, Ross, and Lecun, Yann. Stacked what-where auto-
encoders. arXiv preprint arXiv:1506.02351, 2015.
13
Under review as a conference paper at ICLR 2016
8 S UPPLEMENTARY M ATERIAL
8.1 E VALUATING DCGAN S CAPABILITY TO CAPTURE DATA DISTRIBUTIONS
We propose to apply standard classification metrics to a conditional version of our model, evaluating
the conditional distributions learned. We trained a DCGAN on MNIST (splitting off a 10K validation
set) as well as a permutation invariant GAN baseline and evaluated the models using a nearest
neighbor classifier comparing real data to a set of generated conditional samples. We found that
removing the scale and bias parameters from batchnorm produced better results for both models. We
speculate that the noise introduced by batchnorm helps the generative models to better explore and
generate from the underlying data distribution. The results are shown in Table 3 which compares
our models with other techniques. The DCGAN model achieves the same test error as a nearest
neighbor classifier fitted on the training dataset - suggesting the DCGAN model has done a superb
job at modeling the conditional distributions of this dataset. At one million samples per class, the
DCGAN model outperforms InfiMNIST (Loosli et al., 2007), a hand developed data augmentation
pipeline which uses translations and elastic deformations of training examples. The DCGAN is
competitive with a probabilistic generative data augmentation technique utilizing learned per class
transformations (Hauberg et al., 2015) while being more general as it directly models the data instead
of transformations of the data.
Figure 9: Side-by-side illustration of (from left-to-right) the MNIST dataset, generations from a
baseline GAN, and generations from our DCGAN .
14
Under review as a conference paper at ICLR 2016
15
Under review as a conference paper at ICLR 2016
Figure 11: Generations of a DCGAN that was trained on the Imagenet-1k dataset.
16
Inception-v4, Inception-ResNet and
the Impact of Residual Connections on Learning
szegedy@google.com
Alex Alemi
alemi@google.com
Abstract tion [7], object tracking [18], and superresolution [3]. These
examples are but a few of all the applications to which deep
Very deep convolutional networks have been central to convolutional networks have been very successfully applied
the largest advances in image recognition performance in ever since.
recent years. One example is the Inception architecture that In this work we study the combination of the two most
has been shown to achieve very good performance at rel- recent ideas: Residual connections introduced by He et al.
atively low computational cost. Recently, the introduction in [5] and the latest revised version of the Inception archi-
of residual connections in conjunction with a more tradi- tecture [15]. In [5], it is argued that residual connections are
tional architecture has yielded state-of-the-art performance of inherent importance for training very deep architectures.
in the 2015 ILSVRC challenge; its performance was similar Since Inception networks tend to be very deep, it is natu-
to the latest generation Inception-v3 network. This raises ral to replace the filter concatenation stage of the Inception
the question of whether there are any benefit in combining architecture with residual connections. This would allow
the Inception architecture with residual connections. Here Inception to reap all the benefits of the residual approach
we give clear empirical evidence that training with residual while retaining its computational efficiency.
connections accelerates the training of Inception networks Besides a straightforward integration, we have also stud-
significantly. There is also some evidence of residual Incep- ied whether Inception itself can be made more efficient by
tion networks outperforming similarly expensive Inception making it deeper and wider. For that purpose, we designed
networks without residual connections by a thin margin. We a new version named Inception-v4 which has a more uni-
also present several new streamlined architectures for both form simplified architecture and more inception modules
residual and non-residual Inception networks. These varia- than Inception-v3. Historically, Inception-v3 had inherited
tions improve the single-frame recognition performance on a lot of the baggage of the earlier incarnations. The techni-
the ILSVRC 2012 classification task significantly. We fur- cal constraints chiefly came from the need for partitioning
ther demonstrate how proper activation scaling stabilizes the model for distributed training using DistBelief [2]. Now,
the training of very wide residual Inception networks. With after migrating our training setup to TensorFlow [1] these
an ensemble of three residual and one Inception-v4, we constraints have been lifted, which allowed us to simplify
achieve 3.08% top-5 error on the test set of the ImageNet the architecture significantly. The details of that simplified
classification (CLS) challenge. architecture are described in Section 3.
In this report, we will compare the two pure Inception
variants, Inception-v3 and v4, with similarly expensive hy-
1. Introduction brid Inception-ResNet versions. Admittedly, those mod-
els were picked in a somewhat ad hoc manner with the
Since the 2012 ImageNet competition [11] winning en- main constraint being that the parameters and computa-
try by Krizhevsky et al [8], their network “AlexNet” has tional complexity of the models should be somewhat similar
been successfully applied to a larger variety of computer to the cost of the non-residual models. In fact we have tested
vision tasks, for example to object-detection [4], segmen- bigger and wider Inception-ResNet variants and they per-
tation [10], human pose estimation [17], video classifica- formed very similarly on the ImageNet classification chal-
1
lenge [11] dataset. Relu activation
+
2. Related Work
Conv
Convolutional networks have become popular in large
scale image recognition tasks after Krizhevsky et al. [8].
Some of the next important milestones were Network-in- 1x1 Conv
network [9] by Lin et al., VGGNet [12] by Simonyan et al.
and GoogLeNet (Inception-v1) [14] by Szegedy et al.
Relu activation
Residual connection were introduced by He et al. in [5]
in which they give convincing theoretical and practical ev- Figure 2. Optimized version of ResNet connections by [5] to shield
idence for the advantages of utilizing additive merging of computation.
signals both for image recognition, and especially for object
detection. The authors argue that residual connections are
inherently necessary for training very deep convolutional 3. Architectural Choices
models. Our findings do not seem to support this view, at 3.1. Pure Inception blocks
least for image recognition. However it might require more
measurement points with deeper architectures to understand Our older Inception models used to be trained in a par-
the true extent of beneficial aspects offered by residual con- titioned manner, where each replica was partitioned into a
nections. In the experimental section we demonstrate that multiple sub-networks in order to be able to fit the whole
it is not very difficult to train competitive very deep net- model in memory. However, the Inception architecture is
works without utilizing residual connections. However the highly tunable, meaning that there are a lot of possible
use of residual connections seems to improve the training changes to the number of filters in the various layers that
speed greatly, which is alone a great argument for their use. do not affect the quality of the fully trained network. In
The Inception deep convolutional architecture was intro- order to optimize the training speed, we used to tune the
duced in [14] and was called GoogLeNet or Inception-v1 in layer sizes carefully in order to balance the computation be-
our exposition. Later the Inception architecture was refined tween the various model sub-networks. In contrast, with the
in various ways, first by the introduction of batch normaliza- introduction of TensorFlow our most recent models can be
tion [6] (Inception-v2) by Ioffe et al. Later the architecture trained without partitioning the replicas. This is enabled in
was improved by additional factorization ideas in the third part by recent optimizations of memory used by backprop-
iteration [15] which will be referred to as Inception-v3 in agation, achieved by carefully considering what tensors are
this report. needed for gradient computation and structuring the compu-
tation to reduce the number of such tensors. Historically, we
have been relatively conservative about changing the archi-
tectural choices and restricted our experiments to varying
isolated network components while keeping the rest of the
network stable. Not simplifying earlier choices resulted in
networks that looked more complicated that they needed to
be. In our newer experiments, for Inception-v4 we decided
to shed this unnecessary baggage and made uniform choices
for the Inception blocks for each grid size. Plase refer to Filter concat 35x35x384
Filter concat
1x1 Conv
(k)
Figure 4. The schema for 35 × 35 grid modules of the pure
Inception-v4 network. This is the Inception-A block of Figure 9.
Filter concat
Filter concat
Figure 7. The schema for 35 × 35 to 17 × 17 reduction module.
Different variants of this blocks (with various number of filters)
7x1 Conv are used in Figure 9, and 15 in each of the new Inception(-v4, -
(256)
ResNet-v1, -ResNet-v2) variants presented in this paper. The k, l,
1x7 Conv m, n numbers represent filter bank sizes which can be looked up
1x7 Conv
1x1 Conv
(256)
(224) in Table 1.
(128)
7x1 Conv
1x1 Conv 1x7 Conv (224)
(384) (224)
1x7 Conv
1x1 Conv (192)
(192)
Avg Pooling
1x1 Conv
(192) Filter concat
Filter concat
3x3 Conv
Figure 5. The schema for 17 × 17 grid modules of the pure (320 stride 2 V)
Inception-v4 network. This is the Inception-B block of Figure 9. 3x3 Conv
(192 stride 2 V)
7x1 Conv
3x3 MaxPool (320)
Filter concat
(stride 2 V)
1x7 Conv
3x1 Conv 1x3 Conv
1x1 Conv (256)
1x1 Conv (256) (256) (192)
(256)
1x3 Conv 3x1 Conv
(256) (256) 3x1 Conv 1x1 Conv
(512)
1x1 Conv
(256)
(256)
1x3 Conv
1x1 Conv (448)
(384)
Avg Pooling
1x1 Conv Filter concat
(384)
Relu activation
Dropout (keep 0.8) Output: 1536
1x1 Conv
(896 Linear)
Reduction-B Output: 8x8x1536
7x1 Conv
7 x Inception-B Output: 17x17x1024
(128)
1x1 Conv
(128)
Stem
Output: 35x35x384
Relu activation
+
Filter concat
1x1 Conv
3x3 Conv
(256 Linear) (256 stride 2 V)
3x3 Conv 3x3 Conv
3x3 Conv (384 stride 2 V) (256 stride 2 V)
3x3 MaxPool 3x3 Conv
(32) (256)
(stride 2 V)
1x1 Conv
1x1 Conv 1x1 Conv
(32) (256) (256)
3x3 Conv 3x3 Conv 1x1 Conv
(32) (32) (256)
+ 3x3 Conv
(192 V)
71x71x192
1x1 Conv
73x73x80
1x1 Conv (80)
(1792 Linear)
3x3 MaxPool
3x1 Conv (stride 2 V)
73x73x64
(192)
3x3 Conv
147x147x64
1x1 Conv 1x3 Conv (64)
(192) (192)
3x3 Conv 147x147x32
(32 V)
1x1 Conv
(192)
3x3 Conv
149x149x32
(32 stride 2 V)
Relu activation
Input
299x299x3
Figure 13. The schema for 8×8 grid (Inception-ResNet-C) module (299x299x3)
of Inception-ResNet-v1 network.
Figure 14. The stem of the Inception-ResNet-v1 network.
Output: 1000
Softmax
Output: 1792
Dropout (keep 0.8)
Output: 1792
Average Pooling
Output: 8x8x1792
5 x Inception-resnet-C
Output: 8x8x1792
Reduction-B
10 x Output: 17x17x896
Inception-resnet-B
Output: 17x17x896
Reduction-A
Output: 35x35x256
5 x Inception-resnet-A
299x299x3
Input (299x299x3)
+ 3x3 Conv
(320 stride 2 V)
3x3 Conv 3x3 Conv
(384 stride 2 V) (288 stride 2 V)
1x1 Conv 3x3 MaxPool 3x3 Conv
(384 Linear) (stride 2 V) (288)
1x1 Conv 1x1 Conv
3x3 Conv (256) (256)
1x1 Conv
(64) (256)
1x1 Conv
(32)
3x3 Conv 3x3 Conv Previous
(32) (48) Layer
Relu activation
Relu activation
Figure 16. The schema for 35 × 35 grid (Inception-ResNet-A)
module of the Inception-ResNet-v2 network.
+
Relu activation
1x1 Conv
(2048 Linear)
+
3x1 Conv
(256)
1x1 Conv
(1154 Linear) 1x1 Conv 1x3 Conv
(192) (224)
7x1 Conv
(192) 1x1 Conv
(192)
1x1 Conv 1x7 Conv
(192) (160)
Relu activation
1x1 Conv Figure 19. The schema for 8×8 grid (Inception-ResNet-C) module
(128) of the Inception-ResNet-v2 network.
Network k l m n
Inception-v4 192 224 256 384
Relu activation Inception-ResNet-v1 192 192 256 384
Inception-ResNet-v2 256 256 384 384
Figure 17. The schema for 17 × 17 grid (Inception-ResNet-B)
module of the Inception-ResNet-v2 network. Table 1. The number of filters of the Reduction-A module for the
three Inception variants presented in this paper. The four numbers
in the colums of the paper parametrize the four convolutions of
Figure 7
Relu activation
29
28
+ 27
26
25
24
Error (top-1) %
Activation 23
22
Scaling
21
20
19
Inception 18
17 inception-v3
16 inception-resnet-v1
1520 40 60 80 100 120 140 160 180 200
Relu activation Epoch
Figure 20. The general schema for scaling combined Inception- Figure 21. Top-1 error evolution during training of pure Inception-
resnet moduels. We expect that the same idea is useful in the gen- v3 vs a residual network of similar computational cost. The eval-
eral resnet case, where instead of the Inception block an arbitrary uation is measured on a single crop on the non-blacklist images of
subnetwork is used. The scaling block just scales the last linear the ILSVRC-2012 validation set. The residual model was train-
activations by a suitable constant, typically around 0.1. ing much faster, but reached slightly worse final accuracy than the
traditional Inception-v3.
Error (top-5) %
7.0 7
6.5
6.0 6
5.5
5.0 5
4.5
4.0 inception-v3 4 inception-v4
3.5 inception-resnet-v1 inception-resnet-v2
3.020 40 60 80 100 120 140 160 180 200 320 40 60 80 100 120 140 160
Epoch Epoch
Figure 22. Top-5 error evolution during training of pure Inception- Figure 24. Top-5 error evolution during training of pure Inception-
v3 vs a residual Inception of similar computational cost. The eval- v4 vs a residual Inception of similar computational cost. The eval-
uation is measured on a single crop on the non-blacklist images of uation is measured on a single crop on the non-blacklist images
the ILSVRC-2012 validation set. The residual version has trained of the ILSVRC-2012 validation set. The residual version trained
much faster and reached slightly better final recall on the valida- faster and reached slightly better final recall on the validation set.
tion set.
34 9.5
33 9.0
32 8.5
31
8.0
30
29 7.5
28 7.0
Error (top-5) %
27 6.5
Error (top-1) %
26
6.0
25
24 5.5
23 5.0
22 4.5
21 inception-v4
4.0
20 inception-resnet-v2
19 3.5 inception-v3
18 inception-v4 3.0 inception-resnet-v1
17 inception-resnet-v2 2.520
16 40 60 80 100 120 140 160
1520 Epoch
40 60 80 100 120 140 160
Epoch
Figure 25. Top-5 error evolution of all four models (single model,
Figure 23. Top-1 error evolution during training of pure Inception- single crop). Showing the improvement due to larger model size.
v3 vs a residual Inception of similar computational cost. The eval- Although the residual version converges faster, the final accuracy
uation is measured on a single crop on the non-blacklist images of seems to mainly depend on the model size.
the ILSVRC-2012 validation set. The residual version was train-
ing much faster and reached slightly better final accuracy than the
traditional Inception-v4. 29
28
27
Network Top-1 Error Top-5 Error 26
25
BN-Inception [6] 25.2% 7.8% 24
Inception-v3 [15] 21.2% 5.6%
Error (top-1) %
23
Ian J. Goodfellow∗, Jean Pouget-Abadie†, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair‡, Aaron Courville, Yoshua Bengio§
Département d’informatique et de recherche opérationnelle
Université de Montréal
Montréal, QC H3C 3J7
Abstract
1 Introduction
The promise of deep learning is to discover rich, hierarchical models [2] that represent probability
distributions over the kinds of data encountered in artificial intelligence applications, such as natural
images, audio waveforms containing speech, and symbols in natural language corpora. So far, the
most striking successes in deep learning have involved discriminative models, usually those that
map a high-dimensional, rich sensory input to a class label [14, 20]. These striking successes have
primarily been based on the backpropagation and dropout algorithms, using piecewise linear units
[17, 8, 9] which have a particularly well-behaved gradient . Deep generative models have had less
of an impact, due to the difficulty of approximating many intractable probabilistic computations that
arise in maximum likelihood estimation and related strategies, and due to difficulty of leveraging
the benefits of piecewise linear units in the generative context. We propose a new generative model
estimation procedure that sidesteps these difficulties. 1
In the proposed adversarial nets framework, the generative model is pitted against an adversary: a
discriminative model that learns to determine whether a sample is from the model distribution or the
data distribution. The generative model can be thought of as analogous to a team of counterfeiters,
trying to produce fake currency and use it without detection, while the discriminative model is
analogous to the police, trying to detect the counterfeit currency. Competition in this game drives
both teams to improve their methods until the counterfeits are indistiguishable from the genuine
articles.
∗
Ian Goodfellow is now a research scientist at Google, but did this work earlier as a UdeM student
†
Jean Pouget-Abadie did this work while visiting Université de Montréal from Ecole Polytechnique.
‡
Sherjil Ozair is visiting Université de Montréal from Indian Institute of Technology Delhi
§
Yoshua Bengio is a CIFAR Senior Fellow.
1
All code and hyperparameters available at http://www.github.com/goodfeli/adversarial
1
This framework can yield specific training algorithms for many kinds of model and optimization
algorithm. In this article, we explore the special case when the generative model generates samples
by passing random noise through a multilayer perceptron, and the discriminative model is also a
multilayer perceptron. We refer to this special case as adversarial nets. In this case, we can train
both models using only the highly successful backpropagation and dropout algorithms [16] and
sample from the generative model using only forward propagation. No approximate inference or
Markov chains are necessary.
2 Related work
Until recently, most work on deep generative models focused on models that provided a parametric
specification of a probability distribution function. The model can then be trained by maximiz-
ing the log likelihood. In this family of model, perhaps the most succesful is the deep Boltzmann
machine [25]. Such models generally have intractable likelihood functions and therefore require
numerous approximations to the likelihood gradient. These difficulties motivated the development
of “generative machines”–models that do not explicitly represent the likelihood, yet are able to gen-
erate samples from the desired distribution. Generative stochastic networks [4] are an example of
a generative machine that can be trained with exact backpropagation rather than the numerous ap-
proximations required for Boltzmann machines. This work extends the idea of a generative machine
by eliminating the Markov chains used in generative stochastic networks.
Our work backpropagates derivatives through generative processes by using the observation that
lim ∇x E∼N (0,σ2 I) f (x + ) = ∇x f (x).
σ→0
We were unaware at the time we developed this work that Kingma and Welling [18] and Rezende
et al. [23] had developed more general stochastic backpropagation rules, allowing one to backprop-
agate through Gaussian distributions with finite variance, and to backpropagate to the covariance
parameter as well as the mean. These backpropagation rules could allow one to learn the condi-
tional variance of the generator, which we treated as a hyperparameter in this work. Kingma and
Welling [18] and Rezende et al. [23] use stochastic backpropagation to train variational autoen-
coders (VAEs). Like generative adversarial networks, variational autoencoders pair a differentiable
generator network with a second neural network. Unlike generative adversarial networks, the sec-
ond network in a VAE is a recognition model that performs approximate inference. GANs require
differentiation through the visible units, and thus cannot model discrete data, while VAEs require
differentiation through the hidden units, and thus cannot have discrete latent variables. Other VAE-
like approaches exist [12, 22] but are less closely related to our method.
Previous work has also taken the approach of using a discriminative criterion to train a generative
model [29, 13]. These approaches use criteria that are intractable for deep generative models. These
methods are difficult even to approximate for deep models because they involve ratios of probabili-
ties which cannot be approximated using variational approximations that lower bound the probabil-
ity. Noise-contrastive estimation (NCE) [13] involves training a generative model by learning the
weights that make the model useful for discriminating data from a fixed noise distribution. Using a
previously trained model as the noise distribution allows training a sequence of models of increasing
quality. This can be seen as an informal competition mechanism similar in spirit to the formal com-
petition used in the adversarial networks game. The key limitation of NCE is that its “discriminator”
is defined by the ratio of the probability densities of the noise distribution and the model distribution,
and thus requires the ability to evaluate and backpropagate through both densities.
Some previous work has used the general concept of having two neural networks compete. The most
relevant work is predictability minimization [26]. In predictability minimization, each hidden unit
in a neural network is trained to be different from the output of a second network, which predicts
the value of that hidden unit given the value of all of the other hidden units. This work differs from
predictability minimization in three important ways: 1) in this work, the competition between the
networks is the sole training criterion, and is sufficient on its own to train the network. Predictability
minimization is only a regularizer that encourages the hidden units of a neural network to be sta-
tistically independent while they accomplish some other task; it is not a primary training criterion.
2) The nature of the competition is different. In predictability minimization, two networks’ outputs
are compared, with one network trying to make the outputs similar and the other trying to make the
2
outputs different. The output in question is a single scalar. In GANs, one network produces a rich,
high dimensional vector that is used as the input to another network, and attempts to choose an input
that the other network does not know how to process. 3) The specification of the learning process
is different. Predictability minimization is described as an optimization problem with an objective
function to be minimized, and learning approaches the minimum of the objective function. GANs
are based on a minimax game rather than an optimization problem, and have a value function that
one agent seeks to maximize and the other seeks to minimize. The game terminates at a saddle point
that is a minimum with respect to one player’s strategy and a maximum with respect to the other
player’s strategy.
Generative adversarial networks has been sometimes confused with the related concept of “adversar-
ial examples” [28]. Adversarial examples are examples found by using gradient-based optimization
directly on the input to a classification network, in order to find examples that are similar to the
data yet misclassified. This is different from the present work because adversarial examples are
not a mechanism for training a generative model. Instead, adversarial examples are primarily an
analysis tool for showing that neural networks behave in intriguing ways, often confidently clas-
sifying two images differently with high confidence even though the difference between them is
imperceptible to a human observer. The existence of such adversarial examples does suggest that
generative adversarial network training could be inefficient, because they show that it is possible to
make modern discriminative networks confidently recognize a class without emulating any of the
human-perceptible attributes of that class.
3 Adversarial nets
The adversarial modeling framework is most straightforward to apply when the models are both
multilayer perceptrons. To learn the generator’s distribution pg over data x, we define a prior on
input noise variables pz (z), then represent a mapping to data space as G(z; θg ), where G is a
differentiable function represented by a multilayer perceptron with parameters θg . We also define a
second multilayer perceptron D(x; θd ) that outputs a single scalar. D(x) represents the probability
that x came from the data rather than pg . We train D to maximize the probability of assigning the
correct label to both training examples and samples from G. We simultaneously train G to minimize
log(1 − D(G(z))). In other words, D and G play the following two-player minimax game with
value function V (G, D):
min max V (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))]. (1)
G D
In the next section, we present a theoretical analysis of adversarial nets, essentially showing that
the training criterion allows one to recover the data generating distribution as G and D are given
enough capacity, i.e., in the non-parametric limit. See Figure 1 for a less formal, more pedagogical
explanation of the approach. In practice, we must implement the game using an iterative, numerical
approach. Optimizing D to completion in the inner loop of training is computationally prohibitive,
and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizing
D and one step of optimizing G. This results in D being maintained near its optimal solution, so
long as G changes slowly enough. The procedure is formally presented in Algorithm 1.
In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning,
when G is poor, D can reject samples with high confidence because they are clearly different from
the training data. In this case, log(1 − D(G(z))) saturates. Rather than training G to minimize
log(1 − D(G(z))) we can train G to maximize log D(G(z)). This objective function results in the
same fixed point of the dynamics of G and D but provides much stronger gradients early in learning.
4 Theoretical Results
The generator G implicitly defines a probability distribution pg as the distribution of the samples
G(z) obtained when z ∼ pz . Therefore, we would like Algorithm 1 to converge to a good estimator
of pdata , if given enough capacity and training time. The results of this section are done in a non-
parametric setting, e.g. we represent a model with infinite capacity by studying convergence in the
space of probability density functions.
We will show in section 4.1 that this minimax game has a global optimum for pg = pdata . We will
then show in section 4.2 that Algorithm 1 optimizes Eq 1, thus obtaining the desired result.
3
...
XXXx
z
Z Z Z
Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution
(D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black,
dotted line) px from those of the generative distribution pg (G) (green, solid line). The lower horizontal line is
the domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domain
of x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg on
transformed samples. G contracts in regions of high density and expands in regions of low density of pg . (a)
Consider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classifier.
(b) In the inner loop of the algorithm D is trained to discriminate samples from data, converging to D∗ (x) =
pdata (x)
pdata (x)+pg (x)
. (c) After an update to G, gradient of D has guided G(z) to flow to regions that are more likely
to be classified as data. (d) After several steps of training, if G and D have enough capacity, they will reach a
point at which both cannot improve because pg = pdata . The discriminator is unable to differentiate between
the two distributions, i.e. D(x) = 12 .
Algorithm 1 Minibatch stochastic gradient descent training of generative adversarial nets. The number of
steps to apply to the discriminator, k, is a hyperparameter. We used k = 1, the least expensive option, in our
experiments.
for number of training iterations do
for k steps do
• Sample minibatch of m noise samples {z (1) , . . . , z (m) } from noise prior pg (z).
• Sample minibatch of m examples {x(1) , . . . , x(m) } from data generating distribution
pdata (x).
• Update the discriminator by ascending its stochastic gradient:
m
1 Xh i
∇ θd log D x(i) + log 1 − D G z (i) .
m i=1
end for
• Sample minibatch of m noise samples {z (1) , . . . , z (m) } from noise prior pg (z).
• Update the generator by descending its stochastic gradient:
m
1 X
∇θ g log 1 − D G z (i) .
m i=1
end for
The gradient-based updates can use any standard gradient-based learning rule. We used momen-
tum in our experiments.
∗ pdata (x)
DG (x) = (2)
pdata (x) + pg (x)
4
Proof. The training criterion for the discriminator D, given any generator G, is to maximize the
quantity V (G, D)
Z Z
V (G, D) = pdata (x) log(D(x))dx + pz (z) log(1 − D(g(z)))dz
Zx z
Note that the training objective for D can be interpreted as maximizing the log-likelihood for es-
timating the conditional probability P (Y = y|x), where Y indicates whether x comes from pdata
(with y = 1) or from pg (with y = 0). The minimax game in Eq. 1 can now be reformulated as:
C(G) = max V (G, D)
D
∗ ∗
=Ex∼pdata [log DG (x)] + Ez∼pz [log(1 − DG (G(z)))] (4)
∗ ∗
=Ex∼pdata [log DG (x)] + Ex∼pg [log(1 − DG (x))]
pdata (x) pg (x)
=Ex∼pdata log + Ex∼pg log
Pdata (x) + pg (x) pdata (x) + pg (x)
Theorem 1. The global minimum of the virtual training criterion C(G) is achieved if and only if
pg = pdata . At that point, C(G) achieves the value − log 4.
∗ ∗
Proof. For pg = pdata , DG (x) = 12 , (consider Eq. 2). Hence, by inspecting Eq. 4 at DG (x) = 21 , we
1 1
find C(G) = log 2 + log 2 = − log 4. To see that this is the best possible value of C(G), reached
only for pg = pdata , observe that
Ex∼pdata [− log 2] + Ex∼pg [− log 2] = − log 4
∗
and that by subtracting this expression from C(G) = V (DG , G), we obtain:
pdata + pg
pdata + pg
C(G) = − log(4) + KL pdata
+ KL p g
(5)
2 2
where KL is the Kullback–Leibler divergence. We recognize in the previous expression the Jensen–
Shannon divergence between the model’s distribution and the data generating process:
C(G) = − log(4) + 2 · JSD (pdata kpg ) (6)
Since the Jensen–Shannon divergence between two distributions is always non-negative, and zero
iff they are equal, we have shown that C ∗ = − log(4) is the global minimum of C(G) and that the
only solution is pg = pdata , i.e., the generative model perfectly replicating the data distribution.
Proposition 2. If G and D have enough capacity, and at each step of Algorithm 1, the discriminator
is allowed to reach its optimum given G, and pg is updated so as to improve the criterion
∗ ∗
Ex∼pdata [log DG (x)] + Ex∼pg [log(1 − DG (x))]
then pg converges to pdata
Proof. Consider V (G, D) = U (pg , D) as a function of pg as done in the above criterion. Note
that U (pg , D) is convex in pg . The subderivatives of a supremum of convex functions include the
derivative of the function at the point where the maximum is attained. In other words, if f (x) =
supα∈A fα (x) and fα (x) is convex in x for every α, then ∂fβ (x) ∈ ∂f if β = arg supα∈A fα (x).
This is equivalent to computing a gradient descent update for pg at the optimal D given the cor-
responding G. supD U (pg , D) is convex in pg with a unique global optima as proven in Thm 1,
therefore with sufficiently small updates of pg , pg converges to px , concluding the proof.
In practice, adversarial nets represent a limited family of pg distributions via the function G(z; θg ),
and we optimize θg rather than pg itself, so the proofs do not apply. However, the excellent perfor-
mance of multilayer perceptrons in practice suggests that they are a reasonable model to use despite
their lack of theoretical guarantees.
5
Model MNIST TFD
DBN [3] 138 ± 2 1909 ± 66
Stacked CAE [3] 121 ± 1.6 2110 ± 50
Deep GSN [5] 214 ± 1.1 1890 ± 29
Adversarial nets 225 ± 2 2057 ± 26
Table 1: Parzen window-based log-likelihood estimates. The reported numbers on MNIST are the mean log-
likelihood of samples on test set, with the standard error of the mean computed across examples. On TFD, we
computed the standard error across folds of the dataset, with a different σ chosen using the validation set of
each fold. On TFD, σ was cross validated on each fold and mean log-likelihood on each fold were computed.
For MNIST we compare against other models of the real-valued (rather than binary) version of dataset.
5 Experiments
We trained adversarial nets an a range of datasets including MNIST[21], the Toronto Face Database
(TFD) [27], and CIFAR-10 [19]. The generator nets used a mixture of rectifier linear activations [17,
8] and sigmoid activations, while the discriminator net used maxout [9] activations. Dropout [16]
was applied in training the discriminator net. While our theoretical framework permits the use of
dropout and other noise at intermediate layers of the generator, we used noise as the input to only
the bottommost layer of the generator network.
We estimate probability of the test set data under pg by fitting a Gaussian Parzen window to the
samples generated with G and reporting the log-likelihood under this distribution. The σ parameter
of the Gaussians was obtained by cross validation on the validation set. This procedure was intro-
duced in Breuleux et al. [7] and used for various generative models for which the exact likelihood
is not tractable [24, 3, 4]. Results are reported in Table 1. This method of estimating the likelihood
has somewhat high variance and does not perform well in high dimensional spaces but it is the best
method available to our knowledge. Advances in generative models that can sample but not estimate
likelihood directly motivate further research into how to evaluate such models. In Figures 2 and 3
we show samples drawn from the generator net after training. While we make no claim that these
samples are better than samples generated by existing methods, we believe that these samples are at
least competitive with the better generative models in the literature and highlight the potential of the
adversarial framework.
This new framework comes with advantages and disadvantages relative to previous modeling frame-
works. The disadvantages are primarily that there is no explicit representation of pg (x), and that D
must be synchronized well with G during training (in particular, G must not be trained too much
without updating D, in order to avoid “the Helvetica scenario” in which G collapses too many values
of z to the same value of x to have enough diversity to model pdata ), much as the negative chains of a
Boltzmann machine must be kept up to date between learning steps. The advantages are that Markov
chains are never needed, only backprop is used to obtain gradients, no inference is needed during
learning, and a wide variety of functions can be incorporated into the model. Table 2 summarizes
the comparison of generative adversarial nets with other generative modeling approaches.
The aforementioned advantages are primarily computational. Adversarial models may also gain
some statistical advantage from the generator network not being updated directly with data exam-
ples, but only with gradients flowing through the discriminator. This means that components of the
input are not copied directly into the generator’s parameters. Another advantage of adversarial net-
works is that they can represent very sharp, even degenerate distributions, while methods based on
Markov chains require that the distribution be somewhat blurry in order for the chains to be able to
mix between modes.
6
a) b)
c) d)
Figure 2: Visualization of samples from the model. Rightmost column shows the nearest training example of
the neighboring sample, in order to demonstrate that the model has not memorized the training set. Samples
are fair random draws, not cherry-picked. Unlike most other visualizations of deep generative models, these
images show actual samples from the model distributions, not conditional means given samples of hidden units.
Moreover, these samples are uncorrelated because the sampling process does not depend on Markov chain
mixing. a) MNIST b) TFD c) CIFAR-10 (fully connected model) d) CIFAR-10 (convolutional discriminator
and “deconvolutional” generator)
Figure 3: Digits obtained by linearly interpolating between coordinates in z space of the full model.
1. A conditional generative model p(x | c) can be obtained by adding c as input to both G and D.
2. Learned approximate inference can be performed by training an auxiliary network to predict z
given x. This is similar to the inference net trained by the wake-sleep algorithm [15] but with
the advantage that the inference net may be trained for a fixed generator net after the generator
net has finished training.
3. One can approximately model all conditionals p(xS | x6S ) where S is a subset of the indices
of x by training a family of conditional models that share parameters. Essentially, one can use
adversarial nets to implement a stochastic extension of the deterministic MP-DBM [10].
4. Semi-supervised learning: features from the discriminator or inference net could improve perfor-
mance of classifiers when limited labeled data is available.
5. Efficiency improvements: training could be accelerated greatly by devising better methods for
coordinating G and D or determining better distributions to sample z from during training.
This paper has demonstrated the viability of the adversarial modeling framework, suggesting that
these research directions could prove useful.
7
Deep directed Deep undirected Generative
Adversarial models
graphical models graphical models autoencoders
Inference needed
Enforced tradeoff
during training. Synchronizing the
between mixing
Inference needed MCMC needed to discriminator with
Training and power of
during training. approximate the generator.
reconstruction
partition function Helvetica.
generation
gradient.
Learned Learned
Variational MCMC-based
Inference approximate approximate
inference inference
inference inference
Requires Markov Requires Markov
Sampling No difficulties No difficulties
chain chain
Not explicitly Not explicitly
Intractable, may be Intractable, may be represented, may be represented, may be
Evaluating p(x) approximated with approximated with approximated with approximated with
AIS AIS Parzen density Parzen density
estimation estimation
Models need to be
designed to work
with the desired Any differentiable Any differentiable
Careful design
inference scheme function is function is
Model design needed to ensure
— some inference theoretically theoretically
multiple properties
schemes support permitted permitted
similar model
families as GANs
Table 2: Challenges in generative modeling: a summary of the difficulties encountered by different approaches
to deep generative modeling for each of the major operations involving a model.
Acknowledgments
We would like to acknowledge Patrice Marcotte, Olivier Delalleau, Kyunghyun Cho, Guillaume
Alain and Jason Yosinski for helpful discussions. Yann Dauphin shared his Parzen window eval-
uation code with us. We would like to thank the developers of Pylearn2 [11] and Theano [6, 1],
particularly Frédéric Bastien who rushed a Theano feature specifically to benefit this project. Ar-
naud Bergeron provided much-needed support with LATEX typesetting. We would also like to thank
CIFAR, and Canada Research Chairs for funding, and Compute Canada, and Calcul Québec for
providing computational resources. Ian Goodfellow is supported by the 2013 Google Fellowship in
Deep Learning. Finally, we would like to thank Les Trois Brasseurs for stimulating our creativity.
References
[1] Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., and
Bengio, Y. (2012). Theano: new features and speed improvements. Deep Learning and Unsupervised
Feature Learning NIPS 2012 Workshop.
[2] Bengio, Y. (2009). Learning deep architectures for AI. Now Publishers.
[3] Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. (2013). Better mixing via deep representations. In
ICML’13.
[4] Bengio, Y., Thibodeau-Laufer, E., and Yosinski, J. (2014a). Deep generative stochastic networks trainable
by backprop. In ICML’14.
[5] Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014b). Deep generative stochastic net-
works trainable by backprop. In Proceedings of the 30th International Conference on Machine Learning
(ICML’14).
[6] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley,
D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the
Python for Scientific Computing Conference (SciPy). Oral Presentation.
[7] Breuleux, O., Bengio, Y., and Vincent, P. (2011). Quickly generating representative samples from an
RBM-derived process. Neural Computation, 23(8), 2053–2073.
[8] Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier neural networks. In AISTATS’2011.
8
[9] Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013a). Maxout networks.
In ICML’2013.
[10] Goodfellow, I. J., Mirza, M., Courville, A., and Bengio, Y. (2013b). Multi-prediction deep Boltzmann
machines. In NIPS’2013.
[11] Goodfellow, I. J., Warde-Farley, D., Lamblin, P., Dumoulin, V., Mirza, M., Pascanu, R., Bergstra,
J., Bastien, F., and Bengio, Y. (2013c). Pylearn2: a machine learning research library. arXiv preprint
arXiv:1308.4214.
[12] Gregor, K., Danihelka, I., Mnih, A., Blundell, C., and Wierstra, D. (2014). Deep autoregressive networks.
In ICML’2014.
[13] Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estimation principle for
unnormalized statistical models. In Proceedings of The Thirteenth International Conference on Artificial
Intelligence and Statistics (AISTATS’10).
[14] Hinton, G., Deng, L., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P.,
Sainath, T., and Kingsbury, B. (2012a). Deep neural networks for acoustic modeling in speech recognition.
IEEE Signal Processing Magazine, 29(6), 82–97.
[15] Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. (1995). The wake-sleep algorithm for unsupervised
neural networks. Science, 268, 1558–1161.
[16] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012b). Improving
neural networks by preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580.
[17] Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009). What is the best multi-stage architecture
for object recognition? In Proc. International Conference on Computer Vision (ICCV’09), pages 2146–2153.
IEEE.
[18] Kingma, D. P. and Welling, M. (2014). Auto-encoding variational bayes. In Proceedings of the Interna-
tional Conference on Learning Representations (ICLR).
[19] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Technical
report, University of Toronto.
[20] Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). ImageNet classification with deep convolutional
neural networks. In NIPS’2012.
[21] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11), 2278–2324.
[22] Mnih, A. and Gregor, K. (2014). Neural variational inference and learning in belief networks. Technical
report, arXiv preprint arXiv:1402.0030.
[23] Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate
inference in deep generative models. Technical report, arXiv:1401.4082.
[24] Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A generative process for sampling contractive
auto-encoders. In ICML’12.
[25] Salakhutdinov, R. and Hinton, G. E. (2009). Deep Boltzmann machines. In AISTATS’2009, pages 448–
455.
[26] Schmidhuber, J. (1992). Learning factorial codes by predictability minimization. Neural Computation,
4(6), 863–879.
[27] Susskind, J., Anderson, A., and Hinton, G. E. (2010). The Toronto face dataset. Technical Report UTML
TR 2010-001, U. Toronto.
[28] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and Fergus, R. (2014).
Intriguing properties of neural networks. ICLR, abs/1312.6199.
[29] Tu, Z. (2007). Learning generative models via discriminative approaches. In Computer Vision and Pattern
Recognition, 2007. CVPR’07. IEEE Conference on, pages 1–8. IEEE.
9
Character-level Convolutional Networks for Text
Classification∗
Abstract
1 Introduction
Text classification is a classic topic for natural language processing, in which one needs to assign
predefined categories to free-text documents. The range of text classification research goes from
designing the best features to choosing the best possible machine learning classifiers. To date,
almost all techniques of text classification are based on words, in which simple statistics of some
ordered word combinations (such as n-grams) usually perform the best [12].
On the other hand, many researchers have found convolutional networks (ConvNets) [17] [18] are
useful in extracting information from raw signals, ranging from computer vision applications to
speech recognition and others. In particular, time-delay networks used in the early days of deep
learning research are essentially convolutional networks that model sequential data [1] [31].
In this article we explore treating text as a kind of raw signal at character level, and applying tem-
poral (one-dimensional) ConvNets to it. For this article we only used a classification task as a way
to exemplify ConvNets’ ability to understand texts. Historically we know that ConvNets usually
require large-scale datasets to work, therefore we also build several of them. An extensive set of
comparisons is offered with traditional models and other deep learning models.
Applying convolutional networks to text classification or natural language processing at large was
explored in literature. It has been shown that ConvNets can be directly applied to distributed [6] [16]
or discrete [13] embedding of words, without any knowledge on the syntactic or semantic structures
of a language. These approaches have been proven to be competitive to traditional models.
There are also related works that use character-level features for language processing. These in-
clude using character-level n-grams with linear classifiers [15], and incorporating character-level
features to ConvNets [28] [29]. In particular, these ConvNet approaches use words as a basis, in
which character-level features extracted at word [28] or word n-gram [29] level form a distributed
representation. Improvements for part-of-speech tagging and information retrieval were observed.
This article is the first to apply ConvNets only on characters. We show that when trained on large-
scale datasets, deep ConvNets do not require the knowledge of words, in addition to the conclusion
∗
An early version of this work entitled “Text Understanding from Scratch” was posted in Feb 2015 as
arXiv:1502.01710. The present paper has considerably more experimental results and a rewritten introduction.
1
from previous research that ConvNets do not require the knowledge about the syntactic or semantic
structure of a language. This simplification of engineering could be crucial for a single system that
can work for different languages, since characters always constitute a necessary construct regardless
of whether segmentation into words is possible. Working on only characters also has the advantage
that abnormal character combinations such as misspellings and emoticons may be naturally learnt.
In this section, we introduce the design of character-level ConvNets for text classification. The de-
sign is modular, where the gradients are obtained by back-propagation [27] to perform optimization.
The main component is the temporal convolutional module, which simply computes a 1-D convo-
lution. Suppose we have a discrete input function g(x) ∈ [1, l] → R and a discrete kernel function
f (x) ∈ [1, k] → R. The convolution h(y) ∈ [1, b(l − k + 1)/dc] → R between f (x) and g(x) with
stride d is defined as
Xk
h(y) = f (x) · g(y · d − x + c),
x=1
where c = k − d + 1 is an offset constant. Just as in traditional convolutional networks in vision,
the module is parameterized by a set of such kernel functions fij (x) (i = 1, 2, . . . , m and j =
1, 2, . . . , n) which we call weights, on a set of inputs gi (x) and outputs hj (y). We call each gi (or
hj ) input (or output) features, and m (or n) input (or output) feature size. The outputs hj (y) is
obtained by a sum over i of the convolutions between gi (x) and fij (x).
One key module that helped us to train deeper models is temporal max-pooling. It is the 1-D version
of the max-pooling module used in computer vision [2]. Given a discrete input function g(x) ∈
[1, l] → R, the max-pooling function h(y) ∈ [1, b(l − k + 1)/dc] → R of g(x) is defined as
k
h(y) = max g(y · d − x + c),
x=1
where c = k − d + 1 is an offset constant. This very pooling module enabled us to train ConvNets
deeper than 6 layers, where all others fail. The analysis by [3] might shed some light on this.
The non-linearity used in our model is the rectifier or thresholding function h(x) = max{0, x},
which makes our convolutional layers similar to rectified linear units (ReLUs) [24]. The algorithm
used is stochastic gradient descent (SGD) with a minibatch of size 128, using momentum [26] [30]
0.9 and initial step size 0.01 which is halved every 3 epoches for 10 times. Each epoch takes a fixed
number of random training samples uniformly sampled across classes. This number will later be
detailed for each dataset sparately. The implementation is done using Torch 7 [4].
Our models accept a sequence of encoded characters as input. The encoding is done by prescribing
an alphabet of size m for the input language, and then quantize each character using 1-of-m encoding
(or “one-hot” encoding). Then, the sequence of characters is transformed to a sequence of such m
sized vectors with fixed length l0 . Any character exceeding length l0 is ignored, and any characters
that are not in the alphabet including blank characters are quantized as all-zero vectors. The character
quantization order is backward so that the latest reading on characters is always placed near the begin
of the output, making it easy for fully connected layers to associate weights with the latest reading.
The alphabet used in all of our models consists of 70 characters, including 26 english letters, 10
digits, 33 other characters and the new line character. The non-space characters are:
abcdefghijklmnopqrstuvwxyz0123456789
-,;.!?:’’’/\|_@#$%ˆ&*˜‘+-=<>()[]{}
Later we also compare with models that use a different alphabet in which we distinguish between
upper-case and lower-case letters.
2
2.3 Model Design
We designed 2 ConvNets – one large and one small. They are both 9 layers deep with 6 convolutional
layers and 3 fully-connected layers. Figure 1 gives an illustration.
Length
Some Text
Quantization
Feature
...
The input have number of features equal to 70 due to our character quantization method, and the
input feature length is 1014. It seems that 1014 characters could already capture most of the texts of
interest. We also insert 2 dropout [10] modules in between the 3 fully-connected layers to regularize.
They have dropout probability of 0.5. Table 1 lists the configurations for convolutional layers, and
table 2 lists the configurations for fully-connected (linear) layers.
Table 1: Convolutional layers used in our experiments. The convolutional layers have stride 1 and
pooling layers are all non-overlapping ones, so we omit the description of their strides.
We initialize the weights using a Gaussian distribution. The mean and standard deviation used for
initializing the large model is (0, 0.02) and small model (0, 0.05).
Table 2: Fully-connected layers used in our experiments. The number of output units for the last
layer is determined by the problem. For example, for a 10-class classification problem it will be 10.
For different problems the input lengths may be different (for example in our case l0 = 1014), and
so are the frame lengths. From our model design, it is easy to know that given input length l0 , the
output frame length after the last convolutional layer (but before any of the fully-connected layers)
is l6 = (l0 − 96)/27. This number multiplied with the frame size at layer 6 will give the input
dimension the first fully-connected layer accepts.
Many researchers have found that appropriate data augmentation techniques are useful for control-
ling generalization error for deep learning models. These techniques usually work well when we
could find appropriate invariance properties that the model should possess. In terms of texts, it is not
reasonable to augment the data using signal transformations as done in image or speech recognition,
because the exact order of characters may form rigorous syntactic and semantic meaning. Therefore,
3
the best way to do data augmentation would have been using human rephrases of sentences, but this
is unrealistic and expensive due the large volume of samples in our datasets. As a result, the most
natural choice in data augmentation for us is to replace words or phrases with their synonyms.
We experimented data augmentation by using an English thesaurus, which is obtained from the
mytheas component used in LibreOffice1 project. That thesaurus in turn was obtained from Word-
Net [7], where every synonym to a word or phrase is ranked by the semantic closeness to the most
frequently seen meaning. To decide on how many words to replace, we extract all replaceable words
from the given text and randomly choose r of them to be replaced. The probability of number r
is determined by a geometric distribution with parameter p in which P [r] ∼ pr . The index s of
the synonym chosen given a word is also determined by a another geometric distribution in which
P [s] ∼ q s . This way, the probability of a synonym chosen becomes smaller when it moves distant
from the most frequently seen meaning. We will report the results using this new data augmentation
technique with p = 0.5 and q = 0.5.
3 Comparison Models
To offer fair comparisons to competitive models, we conducted a series of experiments with both tra-
ditional and deep learning methods. We tried our best to choose models that can provide comparable
and competitive results, and the results are reported faithfully without any model selection.
We refer to traditional methods as those that using a hand-crafted feature extractor and a linear
classifier. The classifier used is a multinomial logistic regression in all these models.
Bag-of-words and its TFIDF. For each dataset, the bag-of-words model is constructed by selecting
50,000 most frequent words from the training subset. For the normal bag-of-words, we use the
counts of each word as the features. For the TFIDF (term-frequency inverse-document-frequency)
[14] version, we use the counts as the term-frequency. The inverse document frequency is the
logarithm of the division between total number of samples and number of samples with the word in
the training subset. The features are normalized by dividing the largest feature value.
Bag-of-ngrams and its TFIDF. The bag-of-ngrams models are constructed by selecting the 500,000
most frequent n-grams (up to 5-grams) from the training subset for each dataset. The feature values
are computed the same way as in the bag-of-words model.
Bag-of-means on word embedding. We also have an experimental model that uses k-means on
word2vec [23] learnt from the training subset of each dataset, and then use these learnt means as
representatives of the clustered words. We take into consideration all the words that appeared more
than 5 times in the training subset. The dimension of the embedding is 300. The bag-of-means
features are computed the same way as in the bag-of-words model. The number of means is 5000.
Recently deep learning methods have started to be applied to text classification. We choose two
simple and representative models for comparison, in which one is word-based ConvNet and the
other a simple long-short term memory (LSTM) [11] recurrent neural network model.
Word-based ConvNets. Among the large number of recent works on word-based ConvNets for
text classification, one of the differences is the choice of using pretrained or end-to-end learned word
representations. We offer comparisons with both using the pretrained word2vec [23] embedding [16]
and using lookup tables [5]. The embedding size is 300 in both cases, in the same way as our bag-
of-means model. To ensure fair comparison, the models for each case are of the same size as
our character-level ConvNets, in terms of both the number of layers and each layer’s output size.
Experiments using a thesaurus for data augmentation are also conducted.
1
http://www.libreoffice.org/
4
Long-short term memory. We also offer a comparison
Mean
with a recurrent neural network model, namely long-short
term memory (LSTM) [11]. The LSTM model used in
our case is word-based, using pretrained word2vec em-
bedding of size 300 as in previous models. The model is LSTM LSTM ... LSTM
formed by taking mean of the outputs of all LSTM cells to
form a feature vector, and then using multinomial logistic Figure 2: long-short term memory
regression on this feature vector. The output dimension
is 512. The variant of LSTM we used is the common
“vanilla” architecture [8] [9]. We also used gradient clipping [25] in which the gradient norm is
limited to 5. Figure 2 gives an illustration.
For the alphabet of English, one apparent choice is whether to distinguish between upper-case and
lower-case letters. We report experiments on this choice and observed that it usually (but not always)
gives worse results when such distinction is made. One possible explanation might be that semantics
do not change with different letter cases, therefore there is a benefit of regularization.
Previous research on ConvNets in different areas has shown that they usually work well with large-
scale datasets, especially when the model takes in low-level raw features like characters in our
case. However, most open datasets for text classification are quite small, and large-scale datasets are
splitted with a significantly smaller training set than testing [21]. Therefore, instead of confusing our
community more by using them, we built several large-scale datasets for our experiments, ranging
from hundreds of thousands to several millions of samples. Table 3 is a summary.
Table 3: Statistics of our large-scale datasets. Epoch size is the number of minibatches in one epoch
AG’s news corpus. We obtained the AG’s corpus of news article on the web2 . It contains 496,835
categorized news articles from more than 2000 news sources. We choose the 4 largest classes from
this corpus to construct our dataset, using only the title and description fields. The number of training
samples for each class is 30,000 and testing 1900.
Sogou news corpus. This dataset is a combination of the SogouCA and SogouCS news corpora [32],
containing in total 2,909,551 news articles in various topic channels. We then labeled each piece
of news using its URL, by manually classifying the their domain names. This gives us a large
corpus of news articles labeled with their categories. There are a large number categories but most
of them contain only few articles. We choose 5 categories – “sports”, “finance”, “entertainment”,
“automobile” and “technology”. The number of training samples selected for each class is 90,000
and testing 12,000. Although this is a dataset in Chinese, we used pypinyin package combined
with jieba Chinese segmentation system to produce Pinyin – a phonetic romanization of Chinese.
The models for English can then be applied to this dataset without change. The fields used are title
and content.
2
http://www.di.unipi.it/˜gulli/AG_corpus_of_news_articles.html
5
Table 4: Testing errors of all the models. Numbers are in percentage. “Lg” stands for “large” and
“Sm” stands for “small”. “w2v” is an abbreviation for “word2vec”, and “Lk” for “lookup table”.
“Th” stands for thesaurus. ConvNets labeled “Full” are those that distinguish between lower and
upper letters
6
5 Discussion
70.00%
20.00% 15.00%
60.00%
0.00% 10.00%
50.00%
-20.00% 5.00%
40.00%
-40.00% 0.00%
30.00%
-60.00% -5.00%
20.00%
10.00% 10.00%
10.00%
0.00%
0.00%
0.00%
-10.00%
-10.00%
-10.00% -20.00%
-20.00%
-30.00%
-20.00%
-30.00%
-40.00%
-30.00%
-50.00% -40.00%
(d) word2vec ConvNet (e) Lookup table ConvNet (f) Full alphabet ConvNet
AG News DBPedia Yelp P. Yelp F. Yahoo A. Amazon F. Amazon P.
To understand the results in table 4 further, we offer some empirical analysis in this section. To
facilitate our analysis, we present the relative errors in figure 3 with respect to comparison models.
Each of these plots is computed by taking the difference between errors on comparison model and
our character-level ConvNet model, then divided by the comparison model error. All ConvNets in
the figure are the large models with thesaurus augmentation respectively.
Character-level ConvNet is an effective method. The most important conclusion from our experi-
ments is that character-level ConvNets could work for text classification without the need for words.
This is a strong indication that language could also be thought of as a signal no different from
any other kind. Figure 4 shows 12 random first-layer patches learnt by one of our character-level
ConvNets for DBPedia dataset.
Figure 4: First layer weights. For each patch, height is the kernel size and width the alphabet size
Dataset size forms a dichotomy between traditional and ConvNets models. The most obvious
trend coming from all the plots in figure 3 is that the larger datasets tend to perform better. Tra-
ditional methods like n-grams TFIDF remain strong candidates for dataset of size up to several
hundreds of thousands, and only until the dataset goes to the scale of several millions do we observe
that character-level ConvNets start to do better.
ConvNets may work well for user-generated data. User-generated data vary in the degree of how
well the texts are curated. For example, in our million scale datasets, Amazon reviews tend to be
raw user-inputs, whereas users might be extra careful in their writings on Yahoo! Answers. Plots
comparing word-based deep models (figures 3c, 3d and 3e) show that character-level ConvNets work
better for less curated user-generated texts. This property suggests that ConvNets may have better
applicability to real-world scenarios. However, further analysis is needed to validate the hypothesis
that ConvNets are truly good at identifying exotic character combinations such as misspellings and
emoticons, as our experiments alone do not show any explicit evidence.
Choice of alphabet makes a difference. Figure 3f shows that changing the alphabet by distinguish-
ing between uppercase and lowercase letters could make a difference. For million-scale datasets, it
seems that not making such distinction usually works better. One possible explanation is that there
is a regularization effect, but this is to be validated.
7
Semantics of tasks may not matter. Our datasets consist of two kinds of tasks: sentiment analysis
(Yelp and Amazon reviews) and topic classification (all others). This dichotomy in task semantics
does not seem to play a role in deciding which method is better.
Bag-of-means is a misuse of word2vec [20]. One of the most obvious facts one could observe
from table 4 and figure 3a is that the bag-of-means model performs worse in every case. Comparing
with traditional models, this suggests such a simple use of a distributed word representation may not
give us an advantage to text classification. However, our experiments does not speak for any other
language processing tasks or use of word2vec in any other way.
There is no free lunch. Our experiments once again verifies that there is not a single machine
learning model that can work for all kinds of datasets. The factors discussed in this section could all
play a role in deciding which method is the best for some specific application.
This article offers an empirical study on character-level convolutional networks for text classifica-
tion. We compared with a large number of traditional and deep learning models using several large-
scale datasets. On one hand, analysis shows that character-level ConvNet is an effective method.
On the other hand, how well our model performs in comparisons depends on many factors, such as
dataset size, whether the texts are curated and choice of alphabet.
In the future, we hope to apply character-level ConvNets for a broader range of language processing
tasks especially when structured outputs are needed.
Acknowledgement
We gratefully acknowledge the support of NVIDIA Corporation with the donation of 2 Tesla K40
GPUs used for this research. We gratefully acknowledge the support of Amazon.com Inc for an
AWS in Education Research grant used for this research.
References
[1] L. Bottou, F. Fogelman Soulié, P. Blanchet, and J. Lienard. Experiments with time delay networks and
dynamic time warping for speaker independent isolated digit recognition. In Proceedings of EuroSpeech
89, volume 2, pages 537–540, Paris, France, 1989.
[2] Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce. Learning mid-level features for recognition. In Computer
Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2559–2566. IEEE, 2010.
[3] Y.-L. Boureau, J. Ponce, and Y. LeCun. A theoretical analysis of feature pooling in visual recognition.
In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 111–118,
2010.
[4] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning.
In BigLearn, NIPS Workshop, number EPFL-CONF-192376, 2011.
[5] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language process-
ing (almost) from scratch. J. Mach. Learn. Res., 12:2493–2537, Nov. 2011.
[6] C. dos Santos and M. Gatti. Deep convolutional neural networks for sentiment analysis of short texts. In
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Tech-
nical Papers, pages 69–78, Dublin, Ireland, August 2014. Dublin City University and Association for
Computational Linguistics.
[7] C. Fellbaum. Wordnet and wordnets. In K. Brown, editor, Encyclopedia of Language and Linguistics,
pages 665–670, Oxford, 2005. Elsevier.
[8] A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural
network architectures. Neural Networks, 18(5):602–610, 2005.
[9] K. Greff, R. K. Srivastava, J. Koutnı́k, B. R. Steunebrink, and J. Schmidhuber. LSTM: A search space
odyssey. CoRR, abs/1503.04069, 2015.
[10] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural
networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
8
[11] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, Nov.
1997.
[12] T. Joachims. Text categorization with suport vector machines: Learning with many relevant features. In
Proceedings of the 10th European Conference on Machine Learning, pages 137–142. Springer-Verlag,
1998.
[13] R. Johnson and T. Zhang. Effective use of word order for text categorization with convolutional neural
networks. CoRR, abs/1412.1058, 2014.
[14] K. S. Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of
Documentation, 28(1):11–21, 1972.
[15] I. Kanaris, K. Kanaris, I. Houvardas, and E. Stamatatos. Words versus character n-grams for anti-spam
filtering. International Journal on Artificial Intelligence Tools, 16(06):1047–1067, 2007.
[16] Y. Kim. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Confer-
ence on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, Doha, Qatar,
October 2014. Association for Computational Linguistics.
[17] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Back-
propagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, Winter
1989.
[18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86(11):2278–2324, November 1998.
[19] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey,
P. van Kleef, S. Auer, and C. Bizer. DBpedia - a large-scale, multilingual knowledge base extracted from
wikipedia. Semantic Web Journal, 2014.
[20] G. Lev, B. Klein, and L. Wolf. In defense of word embedding for generic text representation. In C. Bie-
mann, S. Handschuh, A. Freitas, F. Meziane, and E. Mtais, editors, Natural Language Processing and
Information Systems, volume 9103 of Lecture Notes in Computer Science, pages 35–50. Springer Inter-
national Publishing, 2015.
[21] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization
research. The Journal of Machine Learning Research, 5:361–397, 2004.
[22] J. McAuley and J. Leskovec. Hidden factors and hidden topics: Understanding rating dimensions with
review text. In Proceedings of the 7th ACM Conference on Recommender Systems, RecSys ’13, pages
165–172, New York, NY, USA, 2013. ACM.
[23] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and
phrases and their compositionality. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Wein-
berger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. 2013.
[24] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings
of the 27th International Conference on Machine Learning (ICML-10), pages 807–814, 2010.
[25] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In ICML
2013, volume 28 of JMLR Proceedings, pages 1310–1318. JMLR.org, 2013.
[26] B. Polyak. Some methods of speeding up the convergence of iteration methods. {USSR} Computational
Mathematics and Mathematical Physics, 4(5):1 – 17, 1964.
[27] D. Rumelhart, G. Hintont, and R. Williams. Learning representations by back-propagating errors. Nature,
323(6088):533–536, 1986.
[28] C. D. Santos and B. Zadrozny. Learning character-level representations for part-of-speech tagging. In
Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1818–1826,
2014.
[29] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil. A latent semantic model with convolutional-pooling
structure for information retrieval. In Proceedings of the 23rd ACM International Conference on Confer-
ence on Information and Knowledge Management, pages 101–110. ACM, 2014.
[30] I. Sutskever, J. Martens, G. E. Dahl, and G. E. Hinton. On the importance of initialization and momentum
in deep learning. In S. Dasgupta and D. Mcallester, editors, Proceedings of the 30th International Confer-
ence on Machine Learning (ICML-13), volume 28, pages 1139–1147. JMLR Workshop and Conference
Proceedings, May 2013.
[31] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang. Phoneme recognition using time-delay
neural networks. Acoustics, Speech and Signal Processing, IEEE Transactions on, 37(3):328–339, 1989.
[32] C. Wang, M. Zhang, S. Ma, and L. Ru. Automatic online news issue construction in web environment. In
Proceedings of the 17th International Conference on World Wide Web, WWW ’08, pages 457–466, New
York, NY, USA, 2008. ACM.
9
review articles
doi:10.1145/ 2347736.2347755
A Few Useful
Things to
Know About
Machine
Learning is needed to successfully develop
machine learning applications is not
readily available in them. As a result,
many machine learning projects take
much longer than necessary or wind
up producing less-than-ideal results.
Yet much of this folk knowledge is
fairly easy to communicate. This is
Machine learning systems automatically learn
the purpose of this article.
programs from data. This is often a very attractive
alternative to manually constructing them, and in the key insights
last decade the use of machine learning has spread M achine learning algorithms can figure
rapidly throughout computer science and beyond. out how to perform important tasks
by generalizing from examples. This is
Machine learning is used in Web search, spam filters, often feasible and cost-effective where
manual programming is not. As more
recommender systems, ad placement, credit scoring, data becomes available, more ambitious
problems can be tackled.
fraud detection, stock trading, drug design, and many
other applications. A recent report from the McKinsey M achine learning is widely used in
computer science and other fields.
Global Institute asserts that machine learning (a.k.a. However, developing successful
machine learning applications requires a
data mining or predictive analytics) will be the driver substantial amount of “black art” that is
difficult to find in textbooks.
of the next big wave of innovation.15 Several fine
textbooks are available to interested practitioners and T his article summarizes 12 key lessons
that machine learning researchers and
researchers (for example, Mitchell16 and Witten et practitioners have learned. These include
pitfalls to avoid, important issues to focus
al.24). However, much of the “folk knowledge” that on, and answers to common questions.
corresponding output, and outputs the hypothesis space of the learner. The accompanying table shows
a classifier. The test of the learner is If a classifier is not in the hypothesis common examples of each of these
whether this classifier produces the space, it cannot be learned. A related three components. For example, k-
correct output yt for future examples question, that I address later, is how nearest neighbor classifies a test ex-
xt (for example, whether the spam to represent the input, in other words, ample by finding the k most similar
filter correctly classifies previously what features to use. training examples and predicting the
unseen email messages as spam or ˲˲ Evaluation. An evaluation func- majority class among them. Hyper-
not spam). tion (also called objective function plane-based methods form a linear
Table 1. The three components of learning algorithms. sible different inputs.) Doing well on
the training set is easy (just memorize
the examples). The most common
Representation Evaluation Optimization mistake among machine learning be-
Instances Accuracy/Error rate Combinatorial optimization ginners is to test on the training data
K-nearest neighbor Precision and recall Greedy search and have the illusion of success. If the
Support vector machines Squared error Beam search chosen classifier is then tested on new
Hyperplanes Likelihood Branch-and-bound data, it is often no better than ran-
Naive Bayes Posterior probability Continuous optimization dom guessing. So, if you hire someone
Logistic regression Information gain Unconstrained to build a classifier, be sure to keep
Decision trees K-L divergence Gradient descent some of the data to yourself and test
Sets of rules Cost/Utility Conjugate gradient the classifier they give you on it. Con-
Propositional rules Margin Quasi-Newton methods versely, if you have been hired to build
Logic programs Constrained a classifier, set some of the data aside
Neural networks Linear programming from the beginning, and only use it to
Graphical models Quadratic programming test your chosen classifier at the very
Bayesian networks end, followed by learning your final
Conditional random fields classifier on the whole data.
Contamination of your classifier by
test data can occur in insidious ways,
for example, if you use test data to
Algorithm 1. Decision tree induction. tune parameters and do a lot of tun-
ing. (Machine learning algorithms
LearnDT (TrainSet) have lots of knobs, and success of-
ten comes from twiddling them a lot,
if all examples in TrainSet have the same class y* then so this is a real concern.) Of course,
return MakeLeaf(y*) holding out data reduces the amount
if no feature xj has InfoGain(xj ,y) > 0 then available for training. This can be mit-
y* ← Most frequent class in TrainSet igated by doing cross-validation: ran-
return MakeLeaf(y*) domly dividing your training data into
x* ← argmaxxj InfoGain(xj, y)
(say) 10 subsets, holding out each one
TS0 ← Examples in TrainSet with x* = 0
while training on the rest, testing each
TS1 ← Examples in TrainSet with x* = 1
learned classifier on the examples it
return MakeNode(x*, LearnDT(TS0), LearnDT(TS1))
did not see, and averaging the results
to see how well the particular param-
eter setting does.
combination of the features per class day may not be far when every single In the early days of machine learn-
and predict the class with the high- possible combination has appeared in ing, the need to keep training and test
est-valued combination. Decision some learner! data separate was not widely appreci-
trees test one feature at each internal Most textbooks are organized by ated. This was partly because, if the
node, with one branch for each fea- representation, and it is easy to over- learner has a very limited representa-
ture value, and have class predictions look the fact that the other compo- tion (for example, hyperplanes), the
at the leaves. Algorithm 1 (above) nents are equally important. There is difference between training and test
shows a bare-bones decision tree no simple recipe for choosing each error may not be large. But with very
learner for Boolean domains, using component, but I will touch on some flexible classifiers (for example, deci-
information gain and greedy search.20 of the key issues here. As we will see, sion trees), or even with linear classifi-
InfoGain(xj, y) is the mutual informa- some choices in a machine learning ers with a lot of features, strict separa-
tion between feature xj and the class y. project may be even more important tion is mandatory.
MakeNode(x,c0,c1) returns a node that than the choice of learner. Notice that generalization being
tests feature x and has c0 as the child the goal has an interesting conse-
for x = 0 and c1 as the child for x = 1. It’s Generalization that Counts quence for machine learning. Unlike
Of course, not all combinations of The fundamental goal of machine in most other optimization problems,
one component from each column of learning is to generalize beyond the we do not have access to the function
the table make equal sense. For exam- examples in the training set. This is we want to optimize! We have to use
ple, discrete representations naturally because, no matter how much data training error as a surrogate for test
go with combinatorial optimization, we have, it is very unlikely that we will error, and this is fraught with dan-
and continuous ones with continu- see those exact examples again at test ger. (How to deal with it is addressed
ous optimization. Nevertheless, many time. (Notice that, if there are 100,000 later.) On the positive side, since the
learners have both discrete and con- words in the dictionary, the spam fil- objective function is only a proxy for
tinuous components, and in fact the ter described above has 2100,000 pos- the true goal, we may not need to fully
optimize it; in fact, a local optimum main, instance-based methods may one that is 75% accurate on both, it
returned by simple greedy search may be a good choice. If we have knowl- has overfit.
be better than the global optimum. edge about probabilistic dependen- Everyone in machine learning
cies, graphical models are a good fit. knows about overfitting, but it comes
Data Alone Is Not Enough And if we have knowledge about what in many forms that are not immedi-
Generalization being the goal has an- kinds of preconditions are required by ately obvious. One way to understand
other major consequence: Data alone each class, “IF . . . THEN . . .” rules may overfitting is by decomposing gener-
is not enough, no matter how much be the best option. The most useful alization error into bias and variance.9
of it you have. Consider learning a learners in this regard are those that Bias is a learner’s tendency to con-
Boolean function of (say) 100 vari- do not just have assumptions hard- sistently learn the same wrong thing.
ables from a million examples. There wired into them, but allow us to state Variance is the tendency to learn ran-
are 2100 − 106 examples whose classes them explicitly, vary them widely, and dom things irrespective of the real sig-
you do not know. How do you figure incorporate them automatically into nal. Figure 1 illustrates this by an anal-
out what those classes are? In the ab- the learning (for example, using first- ogy with throwing darts at a board. A
sence of further information, there is order logic21 or grammars6). linear learner has high bias, because
just no way to do this that beats flip- In retrospect, the need for knowl- when the frontier between two classes
ping a coin. This observation was first edge in learning should not be sur- is not a hyperplane the learner is un-
made (in somewhat different form) by prising. Machine learning is not able to induce it. Decision trees do not
the philosopher David Hume over 200 magic; it cannot get something from have this problem because they can
years ago, but even today many mis- nothing. What it does is get more represent any Boolean function, but
takes in machine learning stem from from less. Programming, like all en- on the other hand they can suffer from
failing to appreciate it. Every learner gineering, is a lot of work: we have to high variance: decision trees learned
must embody some knowledge or as- build everything from scratch. Learn- on different training sets generated by
sumptions beyond the data it is given ing is more like farming, which lets the same phenomenon are often very
in order to generalize beyond it. This nature do most of the work. Farmers different, when in fact they should be
notion was formalized by Wolpert in combine seeds with nutrients to grow
his famous “no free lunch” theorems, crops. Learners combine knowledge Figure 1. Bias and variance in
dart-throwing.
according to which no learner can with data to grow programs.
beat random guessing over all pos-
sible functions to be learned.25 Overfitting Has Many Faces
Low High
This seems like rather depressing What if the knowledge and data we Variance Variance
news. How then can we ever hope to have are not sufficient to completely
learn anything? Luckily, the functions determine the correct classifier? Then
we want to learn in the real world are we run the risk of just hallucinating High
Bias
not drawn uniformly from the set of all a classifier (or parts of it) that is not
mathematically possible functions! In grounded in reality, and is simply en-
fact, very general assumptions—like coding random quirks in the data.
smoothness, similar examples hav- This problem is called overfitting, and
Low
ing similar classes, limited depen- is the bugbear of machine learning. Bias
dences, or limited complexity—are When your learner outputs a classi-
often enough to do very well, and this fier that is 100% accurate on the train-
is a large part of why machine learn- ing data but only 50% accurate on test
ing has been so successful. Like de- data, when in fact it could have output
duction, induction (what learners do)
is a knowledge lever: it turns a small Figure 2. Naïve Bayes can outperform a state-of-the-art rule learner (C4.5rules) even
when the true classifier is a set of rules.
amount of input knowledge into a
large amount of output knowledge.
Induction is a vastly more powerful 80
Bayes C4.5
o c to b e r 2 0 1 2 | vo l. 55 | n o. 1 0 | c om m u n ic at ion s of t he acm 81
review articles
the same. Similar reasoning applies like training examples labeled with tion of about 10−18 of the input space.
to the choice of optimization meth- the wrong class. This can indeed ag- This is what makes machine learning
od: beam search has lower bias than gravate overfitting, by making the both necessary and hard.
greedy search, but higher variance, be- learner draw a capricious frontier to More seriously, the similarity-
cause it tries more hypotheses. Thus, keep those examples on what it thinks based reasoning that machine learn-
contrary to intuition, a more powerful is the right side. But severe overfitting ing algorithms depend on (explicitly
learner is not necessarily better than a can occur even in the absence of noise. or implicitly) breaks down in high di-
less powerful one. For instance, suppose we learn a Bool- mensions. Consider a nearest neigh-
Figure 2 illustrates this.a Even ean classifier that is just the disjunc- bor classifier with Hamming distance
though the true classifier is a set of tion of the examples labeled “true” as the similarity measure, and sup-
rules, with up to 1,000 examples na- in the training set. (In other words, pose the class is just x1 ∧ x2. If there
ive Bayes is more accurate than a the classifier is a Boolean formula in are no other features, this is an easy
rule learner. This happens despite disjunctive normal form, where each problem. But if there are 98 irrelevant
naive Bayes’s false assumption that term is the conjunction of the feature features x3,..., x100, the noise from
the frontier is linear! Situations like values of one specific training exam- them completely swamps the signal in
this are common in machine learn- ple.) This classifier gets all the training x1 and x2, and nearest neighbor effec-
ing: strong false assumptions can be examples right and every positive test tively makes random predictions.
better than weak true ones, because example wrong, regardless of whether Even more disturbing is that near-
a learner with the latter needs more the training data is noisy or not. est neighbor still has a problem even
data to avoid overfitting. The problem of multiple testing13 is if all 100 features are relevant! This
Cross-validation can help to com- closely related to overfitting. Standard is because in high dimensions all
bat overfitting, for example by using it statistical tests assume that only one examples look alike. Suppose, for
to choose the best size of decision tree hypothesis is being tested, but mod- instance, that examples are laid out
to learn. But it is no panacea, since if ern learners can easily test millions on a regular grid, and consider a test
we use it to make too many parameter before they are done. As a result what example xt. If the grid is d-dimen-
choices it can itself start to overfit.17 looks significant may in fact not be. sional, xt’s 2d nearest examples are
Besides cross-validation, there For example, a mutual fund that beats all at the same distance from it. So as
are many methods to combat overfit- the market 10 years in a row looks very the dimensionality increases, more
ting. The most popular one is adding impressive, until you realize that, if and more examples become nearest
a regularization term to the evaluation there are 1,000 funds and each has a neighbors of xt, until the choice of
function. This can, for example, pe- 50% chance of beating the market on nearest neighbor (and therefore of
nalize classifiers with more structure, any given year, it is quite likely that class) is effectively random.
thereby favoring smaller ones with one will succeed all 10 times just by This is only one instance of a more
less room to overfit. Another option luck. This problem can be combatted general problem with high dimen-
is to perform a statistical significance by correcting the significance tests to sions: our intuitions, which come
test like chi-square before adding new take the number of hypotheses into from a three-dimensional world, of-
structure, to decide whether the dis- account, but this can also lead to un- ten do not apply in high-dimensional
tribution of the class really is differ- derfitting. A better approach is to con- ones. In high dimensions, most of the
ent with and without this structure. trol the fraction of falsely accepted mass of a multivariate Gaussian dis-
These techniques are particularly use- non-null hypotheses, known as the tribution is not near the mean, but in
ful when data is very scarce. Neverthe- false discovery rate.3 an increasingly distant “shell” around
less, you should be skeptical of claims it; and most of the volume of a high-
that a particular technique “solves” Intuition Fails in High Dimensions dimensional orange is in the skin, not
the overfitting problem. It is easy to After overfitting, the biggest problem the pulp. If a constant number of ex-
avoid overfitting (variance) by falling in machine learning is the curse of amples is distributed uniformly in a
into the opposite error of underfitting dimensionality. This expression was high-dimensional hypercube, beyond
(bias). Simultaneously avoiding both coined by Bellman in 1961 to refer some dimensionality most examples
requires learning a perfect classifier, to the fact that many algorithms that are closer to a face of the hypercube
and short of knowing it in advance work fine in low dimensions become than to their nearest neighbor. And if
there is no single technique that will intractable when the input is high- we approximate a hypersphere by in-
always do best (no free lunch). dimensional. But in machine learn- scribing it in a hypercube, in high di-
A common misconception about ing it refers to much more. General- mensions almost all the volume of the
overfitting is that it is caused by noise, izing correctly becomes exponentially hypercube is outside the hypersphere.
harder as the dimensionality (number This is bad news for machine learning,
a Training examples consist of 64 Boolean fea- of features) of the examples grows, be- where shapes of one type are often ap-
tures and a Boolean class computed from cause a fixed-size training set covers a proximated by shapes of another.
them according to a set of “IF . . . THEN . . .” dwindling fraction of the input space. Building a classifier in two or three
rules. The curves are the average of 100 runs
with different randomly generated sets of
Even with a moderate dimension of dimensions is easy; we can find a rea-
rules. Error bars are two standard deviations. 100 and a huge training set of a trillion sonable frontier between examples
See Domingos and Pazzani10 for details. examples, the latter covers only a frac- of different classes just by visual in-
spection. (It has even been said that if bad classifiers in the learner’s hypoth-
people could see in high dimensions esis space H. The probability that at
machine learning would not be neces- least one of them is consistent is less
sary.) But in high dimensions it is dif- than b(1 − ε)n, by the union bound. As-
ficult to understand what is happen-
ing. This in turn makes it difficult to One of the major suming the learner always returns a
consistent classifier, the probability
design a good classifier. Naively, one
might think that gathering more fea-
developments of that this classifier is bad is then less
than |H|(1 − ε)n, where we have used
tures never hurts, since at worst they recent decades has the fact that b ≤ |H|. So if we want this
provide no new information about the
class. But in fact their benefits may
been the realization probability to be less than δ, it suffices
to make n > ln(δ/|H|)/ ln(1 − ε) ≥ 1/ε (ln
be outweighed by the curse of dimen- that we can have |H| + ln 1/δ).
sionality.
Fortunately, there is an effect that
guarantees on the Unfortunately, guarantees of this
type have to be taken with a large grain
partly counteracts the curse, which results of induction, of salt. This is because the bounds ob-
might be called the “blessing of non-
uniformity.” In most applications particularly if we tained in this way are usually extreme-
ly loose. The wonderful feature of the
examples are not spread uniformly are willing to settle bound above is that the required num-
throughout the instance space, but
are concentrated on or near a lower- for probabilistic ber of examples only grows logarith-
mically with |H| and 1/δ. Unfortunate-
dimensional manifold. For example,
k-nearest neighbor works quite well
guarantees. ly, most interesting hypothesis spaces
are doubly exponential in the number
for handwritten digit recognition of features d, which still leaves us
even though images of digits have needing a number of examples expo-
one dimension per pixel, because the nential in d. For example, consider
space of digit images is much smaller the space of Boolean functions of d
than the space of all possible images. Boolean variables. If there are e pos-
Learners can implicitly take advan- sible different examples, there are
tage of this lower effective dimension, 2e possible different functions, so
or algorithms for explicitly reducing since there are 2d possible examples,
d
the dimensionality can be used (for the total number of functions is 22 .
example, Tenenbaum22). And even for hypothesis spaces that
are “merely” exponential, the bound
Theoretical Guarantees is still very loose, because the union
Are Not What They Seem bound is very pessimistic. For exam-
Machine learning papers are full of ple, if there are 100 Boolean features
theoretical guarantees. The most com- and the hypothesis space is decision
mon type is a bound on the number of trees with up to 10 levels, to guarantee
examples needed to ensure good gen- δ = ε = 1% in the bound above we need
eralization. What should you make of half a million examples. But in prac-
these guarantees? First of all, it is re- tice a small fraction of this suffices for
markable that they are even possible. accurate learning.
Induction is traditionally contrasted Further, we have to be careful
with deduction: in deduction you can about what a bound like this means.
guarantee that the conclusions are For instance, it does not say that, if
correct; in induction all bets are off. your learner returned a hypothesis
Or such was the conventional wisdom consistent with a particular training
for many centuries. One of the major set, then this hypothesis probably
developments of recent decades has generalizes well. What it says is that,
been the realization that in fact we can given a large enough training set, with
have guarantees on the results of in- high probability your learner will ei-
duction, particularly if we are willing ther return a hypothesis that general-
to settle for probabilistic guarantees. izes well or be unable to find a consis-
The basic argument is remarkably tent hypothesis. The bound also says
simple.5 Let’s say a classifier is bad nothing about how to select a good
if its true error rate is greater than ε. hypothesis space. It only tells us that,
Then the probability that a bad clas- if the hypothesis space contains the
sifier is consistent with n random, in- true classifier, then the probability
dependent training examples is less that the learner outputs a bad classi-
than (1 − ε)n. Let b be the number of fier decreases with training set size.
o c to b e r 2 0 1 2 | vo l. 55 | n o. 1 0 | c om m u n ic at ion s of t he acm 83
review articles
If we shrink the hypothesis space, the chine learning. But it makes sense if
bound improves, but the chances that you consider how time-consuming it
it contains the true classifier shrink is to gather data, integrate it, clean it
also. (There are bounds for the case and preprocess it, and how much trial
where the true classifier is not in the
hypothesis space, but similar consid- A dumb algorithm and error can go into feature design.
Also, machine learning is not a one-
erations apply to them.)
Another common type of theoreti-
with lots and lots shot process of building a dataset and
running a learner, but rather an itera-
cal guarantee is asymptotic: given in- of data beats tive process of running the learner,
finite data, the learner is guaranteed
to output the correct classifier. This
a clever one analyzing the results, modifying the
data and/or the learner, and repeat-
is reassuring, but it would be rash to with modest ing. Learning is often the quickest
choose one learner over another be-
cause of its asymptotic guarantees. In
amounts of it. part of this, but that is because we
have already mastered it pretty well!
practice, we are seldom in the asymp- Feature engineering is more diffi-
totic regime (also known as “asymp- cult because it is domain-specific,
topia”). And, because of the bias-vari- while learners can be largely general
ance trade-off I discussed earlier, if purpose. However, there is no sharp
learner A is better than learner B given frontier between the two, and this is
infinite data, B is often better than A another reason the most useful learn-
given finite data. ers are those that facilitate incorpo-
The main role of theoretical guar- rating knowledge.
antees in machine learning is not as Of course, one of the holy grails
a criterion for practical decisions, of machine learning is to automate
but as a source of understanding and more and more of the feature engi-
driving force for algorithm design. In neering process. One way this is often
this capacity, they are quite useful; in- done today is by automatically gener-
deed, the close interplay of theory and ating large numbers of candidate fea-
practice is one of the main reasons tures and selecting the best by (say)
machine learning has made so much their information gain with respect
progress over the years. But caveat to the class. But bear in mind that
emptor: learning is a complex phe- features that look irrelevant in isola-
nomenon, and just because a learner tion may be relevant in combination.
has a theoretical justification and For example, if the class is an XOR of
works in practice does not mean the k input features, each of them by it-
former is the reason for the latter. self carries no information about the
class. (If you want to annoy machine
Feature Engineering Is The Key learners, bring up XOR.) On the other
At the end of the day, some machine hand, running a learner with a very
learning projects succeed and some large number of features to find out
fail. What makes the difference? Eas- which ones are useful in combination
ily the most important factor is the may be too time-consuming, or cause
features used. Learning is easy if you overfitting. So there is ultimately no
have many independent features that replacement for the smarts you put
each correlate well with the class. On into feature engineering.
the other hand, if the class is a very
complex function of the features, you More Data Beats
may not be able to learn it. Often, the a Cleverer Algorithm
raw data is not in a form that is ame- Suppose you have constructed the
nable to learning, but you can con- best set of features you can, but the
struct features from it that are. This classifiers you receive are still not ac-
is typically where most of the effort in curate enough. What can you do now?
a machine learning project goes. It is There are two main choices: design a
often also one of the most interesting better learning algorithm, or gather
parts, where intuition, creativity and more data (more examples, and pos-
“black art” are as important as the sibly more raw features, subject to
technical stuff. the curse of dimensionality). Machine
First-timers are often surprised by learning researchers are mainly con-
how little time in a machine learning cerned with the former, but pragmati-
project is spent actually doing ma- cally the quickest path to success is
often to just get more data. As a rule ers are seductive, but they are usually cycles. In research papers, learners
of thumb, a dumb algorithm with lots harder to use, because they have more are typically compared on measures
and lots of data beats a clever one with knobs you need to turn to get good re- of accuracy and computational cost.
modest amounts of it. (After all, ma- sults, and because their internals are But human effort saved and insight
chine learning is all about letting data more opaque. gained, although harder to measure,
do the heavy lifting.) Learners can be divided into two are often more important. This favors
This does bring up another prob- major types: those whose representa- learners that produce human-under-
lem, however: scalability. In most of tion has a fixed size, like linear classi- standable output (for example, rule
computer science, the two main lim- fiers, and those whose representation sets). And the organizations that make
ited resources are time and memory. can grow with the data, like decision the most of machine learning are
In machine learning, there is a third trees. (The latter are sometimes called those that have in place an infrastruc-
one: training data. Which one is the nonparametric learners, but this is ture that makes experimenting with
bottleneck has changed from decade somewhat unfortunate, since they many different learners, data sources,
to decade. In the 1980s it tended to usually wind up learning many more and learning problems easy and effi-
be data. Today it is often time. Enor- parameters than parametric ones.) cient, and where there is a close col-
mous mountains of data are avail- Fixed-size learners can only take ad- laboration between machine learning
able, but there is not enough time vantage of so much data. (Notice how experts and application domain ones.
to process it, so it goes unused. This the accuracy of naive Bayes asymptotes
leads to a paradox: even though in at around 70% in Figure 2.) Variable- Learn Many Models, Not Just One
principle more data means that more size learners can in principle learn any In the early days of machine learn-
complex classifiers can be learned, in function given sufficient data, but in ing, everyone had a favorite learner,
practice simpler classifiers wind up practice they may not, because of limi- together with some a priori reasons
being used, because complex ones tations of the algorithm (for example, to believe in its superiority. Most ef-
take too long to learn. Part of the an- greedy search falls into local optima) fort went into trying many variations
swer is to come up with fast ways to or computational cost. Also, because of it and selecting the best one. Then
learn complex classifiers, and indeed of the curse of dimensionality, no ex- systematic empirical comparisons
there has been remarkable progress isting amount of data may be enough. showed that the best learner varies
in this direction (for example, Hulten For these reasons, clever algorithms— from application to application, and
and Domingos11). those that make the most of the data systems containing many different
Part of the reason using cleverer and computing resources available— learners started to appear. Effort now
algorithms has a smaller payoff than often pay off in the end, provided you went into trying many variations of
you might expect is that, to a first ap- are willing to put in the effort. There many learners, and still selecting just
proximation, they all do the same. is no sharp frontier between design- the best one. But then researchers
This is surprising when you consider ing learners and learning classifiers; noticed that, if instead of selecting
representations as different as, say, rather, any given piece of knowledge the best variation found, we combine
sets of rules and neural networks. But could be encoded in the learner or many variations, the results are bet-
in fact propositional rules are readily learned from data. So machine learn- ter—often much better—and at little
encoded as neural networks, and sim- ing projects often wind up having a extra effort for the user.
ilar relationships hold between other significant component of learner de- Creating such model ensembles is
representations. All learners essen- sign, and practitioners need to have now standard.1 In the simplest tech-
tially work by grouping nearby exam- some expertise in it.12 nique, called bagging, we simply gen-
ples into the same class; the key dif- In the end, the biggest bottleneck erate random variations of the train-
ference is in the meaning of “nearby.” is not data or CPU cycles, but human ing set by resampling, learn a classifier
With nonuniformly distributed data, on each, and combine the results by
learners can produce widely different Figure 3. Very different frontiers can yield voting. This works because it greatly
similar predictions. (+ and – are training
frontiers while still making the same examples of two classes.)
reduces variance while only slightly
predictions in the regions that matter increasing bias. In boosting, training
(those with a substantial number of examples have weights, and these are
training examples, and therefore also N. Bayes varied so that each new classifier fo-
where most test examples are likely to cuses on the examples the previous
appear). This also helps explain why kNN ones tended to get wrong. In stacking,
powerful learners can be unstable but SVM the outputs of individual classifiers
still accurate. Figure 3 illustrates this become the inputs of a “higher-level”
in 2D; the effect is much stronger in learner that figures out how best to
high dimensions. combine them.
As a rule, it pays to try the simplest Many other techniques exist, and
learners first (for example, naïve Bayes D. Tree the trend is toward larger and larger
before logistic regression, k-nearest ensembles. In the Netflix prize, teams
neighbor before support vector ma- from all over the world competed to
chines). More sophisticated learn- build the best video recommender
theorems of the form “Every function More often than not, the goal References
can be represented, or approximated of learning predictive models is to 1. Bauer, E. and Kohavi, R. An empirical comparison of
voting classification algorithms: Bagging, boosting
arbitrarily closely, using this repre- use them as guides to action. If we and variants. Machine Learning 36 (1999), 105–142.
sentation.” Reassured by this, fans of find that beer and diapers are often 2. Bengio, Y. Learning deep architectures for AI.
Foundations and Trends in Machine Learning 2, 1
the representation often proceed to bought together at the supermar- (2009), 1–127.
ignore all others. However, just be- ket, then perhaps putting beer next 3. Benjamini, Y. and Hochberg, Y. Controlling the false
discovery rate: A practical and powerful approach
cause a function can be represented to the diaper section will increase to multiple testing. Journal of the Royal Statistical
does not mean it can be learned. For sales. (This is a famous example in Society, Series B, 57 (1995), 289–300.
4. Bernardo, J.M. and Smith, A.F.M. Bayesian Theory.
example, standard decision tree learn- the world of data mining.) But short Wiley, NY, 1994.
ers cannot learn trees with more leaves of actually doing the experiment it is 5. Blumer, A., Ehrenfeucht, A., Haussler, D. and
Warmuth, M.K. Occam’s razor. Information
than there are training examples. In difficult to tell. Machine learning is Processing Letters 24 (1987), 377–380.
6. Cohen, W.W. Grammatically biased learning:
continuous spaces, representing even usually applied to observational data, Learning logic programs using an explicit antecedent
simple functions using a fixed set of where the predictive variables are not description language. Artificial Intelligence 68
(1994), 303–366.
primitives often requires an infinite under the control of the learner, as 7. Domingos, P. The role of Occam’s razor in knowledge
number of components. Further, if opposed to experimental data, where discovery. Data Mining and Knowledge Discovery 3
(1999), 409–425.
the hypothesis space has many local they are. Some learning algorithms 8. Domingos, P. Bayesian averaging of classifiers and
optima of the evaluation function, as can potentially extract causal infor- the overfitting problem. In Proceedings of the 17th
International Conference on Machine Learning
is often the case, the learner may not mation from observational data, but (Stanford, CA, 2000), Morgan Kaufmann, San Mateo,
find the true function even if it is rep- their applicability is rather restrict- CA, 223–230.
9. Domingos, P. A unified bias-variance decomposition
resentable. Given finite data, time and ed.19 On the other hand, correlation and its applications. In Proceedings of the 17th
memory, standard learners can learn is a sign of a potential causal connec- International Conference on Machine Learning
(Stanford, CA, 2000), Morgan Kaufmann, San Mateo,
only a tiny subset of all possible func- tion, and we can use it as a guide to CA, 231–238.
tions, and these subsets are different further investigation (for example, 10. Domingos, P. and Pazzani, M. On the optimality of
the simple Bayesian classifier under zero-one loss.
for learners with different represen- trying to understand what the causal Machine Learning 29 (1997), 103–130.
tations. Therefore the key question is chain might be). 11. Hulten, G. and Domingos, P. Mining complex models
from arbitrarily large databases in constant time. In
not “Can it be represented?” to which Many researchers believe that cau- Proceedings of the 8th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining
the answer is often trivial, but “Can it sality is only a convenient fiction. For (Edmonton, Canada, 2002). ACM Press, NY, 525–531.
be learned?” And it pays to try different example, there is no notion of causal- 12. Kibler, D. and Langley, P. Machine learning as an
experimental science. In Proceedings of the 3rd
learners (and possibly combine them). ity in physical laws. Whether or not European Working Session on Learning (London, UK,
Some representations are exponen- causality really exists is a deep philo- 1988). Pitman.
13. Klockars, A.J. and Sax, G. Multiple Comparisons.
tially more compact than others for sophical question with no definitive Sage, Beverly Hills, CA, 1986.
some functions. As a result, they may answer in sight, but there are two 14. Kohavi, R., Longbotham, R., Sommerfield, D. and
Henne, R. Controlled experiments on the Web:
also require exponentially less data to practical points for machine learn- Survey and practical guide. Data Mining and
learn those functions. Many learners ers. First, whether or not we call them Knowledge Discovery 18 (2009), 140–181.
15. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs,
work by forming linear combinations “causal,” we would like to predict the R., Roxburgh, C. and Byers, A. Big data: The next
of simple basis functions. For exam- effects of our actions, not just corre- frontier for innovation, competition, and productivity.
Technical report, McKinsey Global Institute, 2011.
ple, support vector machines form lations between observable variables. 16. Mitchell, T.M. Machine Learning. McGraw-Hill,
combinations of kernels centered at Second, if you can obtain experimen- NY, 1997.
17. Ng, A.Y. Preventing “overfitting” of cross-validation
some of the training examples (the tal data (for example by randomly as- data. In Proceedings of the 14th International
support vectors). Representing parity signing visitors to different versions of Conference on Machine Learning (Nashville, TN,
1997). Morgan Kaufmann, San Mateo, CA, 245–253.
of n bits in this way requires 2n basis a Web site), then by all means do so.14 18. Pearl, J. On the connection between the complexity
and credibility of inferred models. International
functions. But using a representation Journal of General Systems 4 (1978), 255–264.
with more layers (that is, more steps Conclusion 19. Pearl, J. Causality: Models, Reasoning, and
Inference. Cambridge University Press, Cambridge,
between input and output), parity can Like any discipline, machine learn- UK, 2000.
be encoded in a linear-size classifier. ing has a lot of “folk wisdom” that can 20. Quinlan, J.R. C4.5: Programs for Machine Learning.
Morgan Kaufmann, San Mateo, CA, 1993.
Finding methods to learn these deeper be difficult to come by, but is crucial 21. Richardson, M. and P. Domingos. Markov logic
representations is one of the major re- for success. This article summarized networks. Machine Learning 62 (2006), 107–136.
22. Tenenbaum, J., Silva, V. and Langford, J. A global
search frontiers in machine learning.2 some of the most salient items. Of geometric framework for nonlinear dimensionality
course, it is only a complement to the reduction. Science 290 (2000), 2319–2323.
23. Vapnik, V.N. The Nature of Statistical Learning
Correlation Does Not more conventional study of machine Theory. Springer, NY, 1995.
Imply Causation learning. Check out http://www. 24. Witten, I., Frank, E. and Hall, M. Data Mining:
Practical Machine Learning Tools and Techniques,
The point that correlation does not cs.washington.edu/homes/pedrod/ 3rd Edition. Morgan Kaufmann, San Mateo, CA, 2011.
imply causation is made so often that class for a complete online machine 25. Wolpert, D. The lack of a priori distinctions between
learning algorithms. Neural Computation 8 (1996),
it is perhaps not worth belaboring. learning course that combines formal 1341–1390.
But, even though learners of the kind and informal aspects. There is also a
we have been discussing can only treasure trove of machine learning Pedro Domingos (pedrod@cs.washington.edu) is a
professor in the Department of Computer Science and
learn correlations, their results are lectures at http://www.videolectures. Engineering at the University of Washington, Seattle.
often treated as representing causal net. A good open source machine
relations. Isn’t this wrong? If so, then learning toolkit is Weka.24
why do people do it? Happy learning! © 2012 ACM 0001-0782/12/10 $15.00
Shuai Zheng∗1 , Sadeep Jayasumana*1 , Bernardino Romera-Paredes1 , Vibhav Vineet†1,2 , Zhizhong Su3 ,
Dalong Du3 , Chang Huang3 , and Philip H. S. Torr1
1 2 3
University of Oxford Stanford University Baidu Institute of Deep Learning
1
problem as a probabilistic inference problem that incor- super-pixels) may lead to poor predictions, no matter how
porates assumptions such as the label agreement between good the feature extraction process is. Pinheiro and Col-
similar pixels. CRF inference is able to refine weak and lobert [46] employed an RNN to model the spatial depen-
coarse pixel-level label predictions to produce sharp bound- dencies during scene parsing. In contrast to their approach,
aries and fine-grained segmentations. Therefore, intuitively, we show that a typical graphical model such as a CRF can
CRFs can be used to overcome the drawbacks in utilizing be formulated as an RNN to form a part of a deep network,
CNNs for pixel-level labelling tasks. to perform end-to-end training combined with a CNN.
One way to utilize CRFs to improve the semantic la- The second strategy is to directly learn a nonlinear model
belling results produced by a CNN is to apply CRF infer- from the images to the label map. This, for example, was
ence as a post-processing step disconnected from the train- shown in [17], where the authors replaced the last fully con-
ing of the CNN [10]. Arguably, this does not fully harness nected layers of a CNN by convolutional layers to keep spa-
the strength of CRFs since it is not integrated with the deep tial information. An important contribution in this direction
network. In this setup, the deep network is unaware of the is [37], where Long et al. used the concept of fully con-
CRF during the training phase. volutional networks, and the notion that top layers obtain
In this paper, we propose an end-to-end deep learn- meaningful features for object recognition whereas low lay-
ing solution for the pixel-level semantic image segmenta- ers keep information about the structure of the image, such
tion problem. Our formulation combines the strengths of as edges. In their work, connections from early layers to
both CNNs and CRF based graphical models in one uni- later layers were used to combine these cues. Bell et al. [5]
fied framework. More specifically, we formulate mean-field and Chen et al. [10, 41] used a CRF to refine segmentation
approximate inference for the dense CRF with Gaussian results obtained from a CNN. Bell et al. focused on material
pairwise potentials as a Recurrent Neural Network (RNN) recognition and segmentation, whereas Chen et al. reported
which can refine coarse outputs from a traditional CNN in very significant improvements on semantic image segmen-
the forward pass, while passing error differentials back to tation. In contrast to these works, which employed CRF
the CNN during training. Importantly, with our formula- inference as a standalone post-processing step disconnected
tion, the whole deep network, which comprises a traditional from the CNN training, our approach is an end-to-end train-
CNN and an RNN for CRF inference, can be trained end- able network that jointly learns the parameters of the CNN
to-end utilizing the usual back-propagation algorithm. and the CRF in one unified deep network.
Arguably, when properly trained, the proposed network Works that use neural networks to predict structured out-
should outperform a system where CRF inference is applied put are found in different domains. For example, Do et
as a post-processing method on independent pixel-level pre- al. [14] proposed an approach to combine deep neural net-
dictions produced by a pre-trained CNN. Our experimental works and Markov networks for sequence labeling tasks.
evaluation confirms that this indeed is the case. We evaluate Jain et al. [26] has shown Convolutional Neural Networks
the performance of our network on the popular Pascal VOC can perform well like MRFs/CRFs approaches in image
2012 benchmark, achieving a new state-of-the-art accuracy restoration application. Another domain which benefits
of 74.7%. from the combination of CNNs and structured loss is hand-
writing recognition. In natural language processing, Yao
2. Related Work et al. [60] shows that the performance of an RNN-based
In this section we review approaches that make use of words tagger can be significantly improved by incorporat-
deep learning and CNNs for low-level computer vision ing elements of the CRF model. In [6], the authors com-
tasks, with a focus on semantic image segmentation. A wide bined a CNN with Hidden Markov Models for that purpose,
variety of approaches have been proposed to tackle the se- whereas more recently, Peng et al. [45] used a modified ver-
mantic image segmentation task using deep learning. These sion of CRFs. Related to this line of works, in [25] a joint
approaches can be categorized into two main strategies. CNN and CRF model was used for text recognition on nat-
The first strategy is based on utilizing separate mecha- ural images. Tompson et al. [57] showed the use of joint
nisms for feature extraction, and image segmentation ex- training of a CNN and an MRF for human pose estimation,
ploiting the edges of the image [2, 38]. One representative while Chen et al. [11] focused on the image classification
instance of this scheme is the application of a CNN for the problem with a similar approach. Another prominent work
extraction of meaningful features, and using superpixels to is [21], in which the authors express deformable part mod-
account for the structural pattern of the image. Two repre- els, a kind of MRF, as a layer in a neural network. In our
sentative examples are [19, 38], where the authors first ob- approach, we cast a different graphical model as a neural
tained superpixels from the image and then used a feature network layer.
extraction process on each of them. The main disadvantage A number of approaches have been proposed for au-
of this strategy is that errors in the initial proposals (e.g: tomatic learning of graphical model parameters and joint
2
training of classifiers and graphical models. Barbu et al. [4]
proposed a joint training of a MRF/CRF model together
with an inference algorithm in their Active Random Field
approach. Domke [15] advocated back-propagation based
parameter optimization in graphical models when approxi-
mate inference methods such as mean-field and belief prop-
agation are used. This idea was utilized in [28], where a bi-
nary dense CRF was used for human pose estimation. Sim- Figure 1. A mean-field iteration as a CNN. A single iteration of
ilarly, Ross et al. [47] and Stoyanov et al. [54] showed how the mean-field algorithm can be modelled as a stack of common
CNN layers.
back-propagation through belief propagation can be used to
optimize model parameters. Ross et al. [21], in particular
proposes an approach based on learning messages. Many energy of a label assignment x is given by:
of these ideas can be traced back to [55], which proposes X X
unrolling message passing algorithms as simpler operations E(x) = ψu (xi ) + ψp (xi , xj ), (1)
i i<j
that could be performed within a CNN. In a different setup,
Krähenbühl and Koltun [30] demonstrated automatic pa- where the unary energy components ψu (xi ) measure the
rameter tuning of dense CRF when a modified mean-field inverse likelihood (and therefore, the cost) of the pixel
algorithm is used for inference. An alternative inference ap- i taking the label xi , and pairwise energy components
proach for dense CRF, not based on mean-field, is proposed ψp (xi , xj ) measure the cost of assigning labels xi , xj to
in [61]. pixels i, j simultaneously. In our model, unary energies are
In contrast to the works described above, our approach obtained from a CNN, which, roughly speaking, predicts la-
shows that it is possible to formulate dense CRF as an RNN bels for pixels without considering the smoothness and the
so that one can form an end-to-end trainable system for se- consistency of the label assignments. The pairwise ener-
mantic image segmentation which combines the strengths gies provide an image data-dependent smoothing term that
of deep learning and graphical modelling. encourages assigning similar labels to pixels with similar
After our initial publication of the technical report of this properties. As was done in [29], we model pairwise poten-
work on arXiv.org, a number of independent works [49, 35] tials as weighted Gaussians:
appeared on arXiv.org presenting similar joint training ap- M
(m)
X
proaches for semantic image segmentation. ψp (xi , xj ) = µ(xi , xj ) w(m) kG (fi , fj ), (2)
m=1
3. Conditional Random Fields (m)
where each kG for m = 1, . . . , M , is a Gaussian kernel
In this section we provide a brief overview of Condi- applied on feature vectors. The feature vector of pixel i,
tional Random Fields (CRF) for pixel-wise labelling and denoted by fi , is derived from image features such as spatial
introduce the notation used in the paper. A CRF, used in location and RGB values [29]. We use the same features as
the context of pixel-wise label prediction, models pixel la- in[29]. The function µ(., .), called the label compatibility
bels as random variables that form a Markov Random Field function, captures the compatibility between different pairs
(MRF) when conditioned upon a global observation. The of labels as the name implies.
global observation is usually taken to be the image. Minimizing the above CRF energy E(x) yields the most
Let Xi be the random variable associated to pixel i, probable label assignment x for the given image. Since this
which represents the label assigned to the pixel i and exact minimization is intractable, a mean-field approxima-
can take any value from a pre-defined set of labels L = tion to the CRF distribution is used for approximate max-
{l1 , l2 , . . . , lL }. Let X be the vector formed by the ran- imum posterior marginal inference. It consists in approxi-
dom variables X1 , X2 , . . . , XN , where N is the number of mating the CRF distribution P (X) by a simpler distribution
pixels in the image. Given a graph G = (V, E), where Q(X), which can be written as the product
Q of independent
V = {X1 , X2 , . . . , XN }, and a global observation (im- marginal distributions, i.e., Q(X) = i Qi (Xi ). The steps
age) I, the pair (I, X) can be modelled as a CRF charac- of the iterative algorithm for approximate mean-field infer-
terized by a Gibbs distribution of the form P (X = x|I) = ence and its reformulation as an RNN are discussed next.
1
Z(I) exp(−E(x|I)). Here E(x) is called the energy of 4. A Mean-field Iteration as a Stack of CNN
the configuration x ∈ LN and Z(I) is the partition func-
tion [33]. From now on, we drop the conditioning on I in
Layers
the notation for convenience. A key contribution of this paper is to show that the mean-
In the fully connected pairwise CRF model of [29], the field CRF inference can be reformulated as a Recurrent
3
Algorithm 1 Mean-field in dense CRFs [29], broken down 4.1. Initialization
to common CNN operations.
In the initialization step of the algorithm,
P the operation
Qi (l) ← Z1i exp (Ui (l)) for all i . Initialization Qi (l) ← Z1i exp (Ui (l)), where Zi = l exp(Ui (l)), is
while not converged do
(m) performed. Note that this is equivalent to applying a soft-
Q̃i (l) ← j6=i k(m) (fi , fj )Qj (l) for all m
P
max function over the unary potentials U across all the la-
. Message Passing
(m) bels at each pixel. The softmax function has been exten-
Q̌i (l) ← m w(m) Q̃i (l)
P
. Weighting Filter Outputs
sively used in CNN architectures before and is therefore
Q̂i (l) ← l0 ∈L µ(l, l0 )Q̌i (l0 )
P well known in the deep learning community. This operation
. Compatibility Transform does not include any parameters and the error differentials
Q̆i (l) ← Ui (l) − Q̂i (l) received at the output of the step during back-propagation
. Adding Unary Potentials could be passed down to the unary potential inputs after per-
Qi ← Z1i exp Q̆i (l) forming usual backward pass calculations of the softmax
. Normalizing transformation.
end while
4.2. Message Passing
In the dense CRF formulation, message passing is imple-
mented by applying M Gaussian filters on Q values. Gaus-
sian filter coefficients are derived based on image features
Neural Network (RNN). To this end, we first consider in-
such as the pixel locations and RGB values, which reflect
dividual steps of the mean-field algorithm summarized in
how strongly a pixel is related to other pixels. Since the
Algorithm 1 [29], and describe them as CNN layers. Our
CRF is potentially fully connected, each filter’s receptive
contribution is based on the observation that filter-based ap-
field spans the whole image, making it infeasible to use a
proximate mean-field inference approach for dense CRFs
brute-force implementation of the filters. Fortunately, sev-
relies on applying Gaussian spatial and bilateral filters on
eral approximation techniques exist to make computation
the mean-field approximates in each iteration. Unlike the
of high dimensional Gaussian filtering significantly faster.
standard convolutional layer in a CNN, in which filters are
Following [29], we use the Permutohedral lattice imple-
fixed after the training stage, we use edge-preserving Gaus-
mentation [1], which can compute the filter response in
sian filters [56, 42], coefficients of which depend on the
O(N ) time, where N is the number of pixels of the im-
original spatial and appearance information of the image.
age [1].
These filters have the additional advantages of requiring a
During back-propagation, error derivatives w.r.t. the fil-
smaller set of parameters, despite the filter size being po-
ter inputs are calculated by sending the error derivatives
tentially as big as the image.
w.r.t. the filter outputs through the same M Gaussian fil-
While reformulating the steps of the inference algorithm ters in reverse direction. In terms of permutohedral lattice
as CNN layers, it is essential to be able to calculate error operations, this can be accomplished by only reversing the
differentials in each layer w.r.t. its inputs in order to be able order of the separable filters in the blur stage, while building
to back-propagate the error differentials to previous layers the permutohedral lattice, splatting, and slicing in the same
during training. We also discuss how to calculate error dif- way as in the forward pass. Therefore, back-propagation
ferentials with respect to the parameters in each layer, en- through this filtering stage can also be performed in O(N )
abling their optimization through the back-propagation al- time. Following [29], we use two Gaussian kernels, a spa-
gorithm. Therefore, in our formulation, CRF parameters tial kernel and a bilateral kernel. In this work, for simplic-
such as the weights of the Gaussian kernels and the label ity, we keep the bandwidth values of the filters fixed. It
compatibility function can also be optimized automatically is also possible to use multiple spatial and bilateral kernels
during the training of the full network. with different bandwidth values and learn their optimal lin-
ear combination.
Once the individual steps of the algorithm are broken
4.3. Weighting Filter Outputs
down as CNN layers, the full algorithm can then be for-
mulated as an RNN. We explain this in Section 5 after dis- The next step of the mean-field iteration is taking a
cussing the steps of Algorithm 1 in detail below. In Algo- weighted sum of the M filter outputs from the previous step,
rithm 1 and the remainder of this paper, we use Ui (l) to for each class label l. When each class label is considered
denote the negative of the unary energy introduced in the individually, this can be viewed as usual convolution with
previous section, i.e., Ui (l) = −ψu (Xi = l). In the con- a 1 × 1 filter with M input channels, and one output chan-
ventional CRF setting, this input Ui (l) to the mean-field al- nel. Since both inputs and the outputs to this step are known
gorithm is obtained from an independent classifier. during back-propagation, the error derivative w.r.t. the filter
4
weights can be computed, making it possible to automat- 4.6. Normalization
ically learn the filter weights (relative contributions from
Finally, the normalization step of the iteration can be
each Gaussian filter output from the previous stage). Er-
considered as another softmax operation with no parame-
ror derivative w.r.t. the inputs can also be computed in the
ters. Differentials at the output of this step can be passed on
usual manner to pass the error derivatives down to the previ-
to the input using the softmax operation’s backward pass.
ous stage. To obtain a higher number of tunable parameters,
in contrast to [29], we use independent kernel weights for 5. The End-to-end Trainable Network
each class label. The intuition is that the relative impor-
tance of the spatial kernel vs the bilateral kernel depends on We now describe our end-to-end deep learning system
the visual class. For example, bilateral kernels may have for semantic image segmentation. To pave the way for this,
on the one hand a high importance in bicycle detection, be- we first explain how repeated mean-field iterations can be
cause similarity of colours is determinant; on the other hand organized as an RNN.
they may have low importance for TV detection, given that
whatever is inside the TV screen may have many different
5.1. CRF as RNN
colours. In the previous section, it was shown that one iteration
of the mean-field algorithm can be formulated as a stack of
4.4. Compatibility Transform common CNN layers (see Fig. 1). We use the function fθ
to denote the transformation done by one mean-field iter-
In the compatibility transform step, outputs from the pre- ation: given an image I, pixel-wise unary potential values
vious step (denoted by Q̌ in Algorithm 1) are shared be- U and an estimation of marginal probabilities Qin from the
tween the labels to a varied extent, depending on the com- previous iteration, the next estimation of marginal distribu-
patibility between these labels. Compatibility between the tions after one mean-field iteration is given by fθ (U, Qin , I).
0 0
two labels l and l0 is parameterized by the label compatibil- The vector θ = {w(m) , µ(l, l )}, m ∈ {1, ..., M }, l, l ∈
ity function µ(l, l0 ). The Potts model, given by µ(l, l0 ) = {l1 , ..., lL } represents the CRF parameters described in Sec-
[l 6= l0 ], where [.] is the Iverson bracket, assigns a fixed tion 4.
penalty if different labels are assigned to pixels with simi- Multiple mean-field iterations can be implemented by re-
lar properties. A limitation of this model is that it assigns peating the above stack of layers in such a way that each
the same penalty for all different pairs of labels. Intuitively, iteration takes Q value estimates from the previous iteration
better results can be obtained by taking the compatibility and the unary values in their original form. This is equiv-
between different label pairs into account and penalizing alent to treating the iterative mean-field inference as a Re-
the assignments accordingly. For example, assigning labels current Neural Network (RNN) as shown in Fig. 2. Using
“person” and “bicycle” to nearby pixels should have a lesser the notation in the figure, the behaviour of the network is
penalty than assigning labels “sky” and “bicycle”. There- given by the following equations where T is the number of
fore, learning the function µ from data is preferred to fixing mean-field iterations:
it in advance with Potts model. We also relax our compat-
ibility transform model by assuming that µ(l, l0 ) 6= µ(l0 , l)
(
softmax(U ), t = 0
in general. H1 (t) = (3)
H2 (t − 1), 0 < t ≤ T,
Compatibility transform step can be viewed as another
convolution layer where the spatial receptive field of the fil- H2 (t) = fθ (U, H1 (t), I), 0 ≤ t ≤ T, (4)
ter is 1 × 1, and the number of input and output channels
(
0, 0≤t<T
are both L. Learning the weights of this filter is equivalent Y (t) = (5)
H2 (t), t = T.
to learning the label compatibility function µ. Transferring
error differentials from the output of this step to the input We name this RNN structure CRF-RNN. Parameters of
can be done since this step is a usual convolution operation. the CRF-RNN are the same as the mean-field parameters
described in Section 4 and denoted by θ here. Since the cal-
4.5. Adding Unary Potentials culation of error differentials w.r.t. these parameters in a sin-
gle iteration was described in Section 4, they can be learnt
In this step, the output from the compatibility transform in the RNN setting using the standard back-propagation
stage is subtracted element-wise from the unary inputs U . through time algorithm [48, 40]. It was shown in [29] that
While no parameters are involved in this step, transferring the mean-field iterative algorithm for dense CRF converges
error differentials can be done trivially by copying the dif- in less than 10 iterations. Furthermore, in practice, after
ferentials at the output of this step to both inputs with the about 5 iterations, increasing the number of iterations usu-
appropriate sign. ally does not significantly improve results [29]. Therefore,
5
segmentation of the image. We used the FCN-8s architec-
ture of [37] as the first part of our network, which provides
I
Meanfield unary potentials to the CRF. This network is based on the
U Iteration H2
G2
Y VGG-16 network [53] but has been restructured to perform
H1 H2 =
fθ (U, H1 , I) pixel-wise prediction instead of image classification.
In the forward pass through the network, once the com-
putation enters the CRF-RNN after passing through the
G1 CNN stage, it takes T iterations for the data to leave the
loop created by the RNN. Neither the CNN that provides
unary values nor the layers after the CRF-RNN (i.e., the
Softmax
loss layers) need to perform any computations during this
Normalization time since the refinement happens only inside the RNN’s
loop. Once the output Y leaves the loop, next stages of the
deep network after the CRF-RNN can continue the forward
pass. In our setup, a softmax loss layer directly follows the
CRF-RNN and terminates the network.
Figure 2. The CRF-RNN Network. We formulate the iterative During the backward pass, once the error differentials
mean-field algorithm as a Recurrent Neural Network (RNN). Gat- reach the CRF-RNN’s output Y , they similarly spend T it-
ing functions G1 and G2 are fixed as described in the text. erations within the loop before reaching the RNN input U
in order to propagate to the CNN which provides the unary
input. In each iteration inside the loop, error differentials
CRF-RNN
are computed inside each component of the mean-field it-
FCN
eration as described in Section 4. We note that unnecessar-
ily increasing the number of mean-field iterations T could
potentially result in the vanishing and exploding gradient
problems in the CRF-RNN. We, however, did not experi-
ence this problem during our experiments.
6. Implementation Details
In the present section we describe the implementation
Figure 3. The End-to-end Trainable Network. Schematic vi- details of the proposed network, as well as its training pro-
sualization of our full network which consists of a CNN and the cess. The high-level architecture of our system, which was
CNN-CRF network. Best viewed in colour. implemented using the popular Caffe [27] deep learning li-
brary, is shown in Fig. 3. The full source code and the
trained models of our approach will be made publicly avail-
it does not suffer from the vanishing and exploding gradient
able 1 .
problem inherent to deep RNNs [7, 43]. This allows us to
We initialized the first part of the network using the pub-
use a plain RNN architecture instead of more sophisticated
licly available weights of the FCN-8s network [37]. The
architectures such as LSTMs in our network.
compatibility transform parameters of the CRF-RNN were
5.2. Completing the Picture initialized using the Potts model, and kernel width and
weight parameters were obtained from a cross-validation
Our approach comprises a fully convolutional network process. We found that such initialization results in faster
stage, which predicts pixel-level labels without consid- convergence of training. During the training phase, param-
ering structure, followed by a CRF-RNN stage, which eters of the whole network were optimized end-to-end using
performs CRF-based probabilistic graphical modelling for the back-propagation algorithm. In particular we used full
1
structured prediction. The complete system, therefore, uni- image training described in [37], with learning rate fixed at
fies strengths of both CNNs and CRFs and is trainable 10−13 and momentum set to 0.99. These extreme values of
end-to-end using the back-propagation algorithm [34] and the parameters were used since we employed only one im-
the Stochastic Gradient Descent (SGD) procedure. During age per batch to avoid reaching memory limits of the GPU.
training, a whole image (or many of them) can be used as
In all our experiments, during training, we set the num-
the mini-batch and the error at each pixel output of the net-
ber of mean-field iterations T in the CRF-RNN to 5 to avoid
work can be computed using an appropriate loss function
such as the softmax loss with respect to the ground truth 1 https://github.com/torrvision/crfasrnn/.
6
vanishing/exploding gradient problems and to reduce the Input Image FCN-8s DeepLab CRF-RNN Ground Truth
training time. During the test time, iteration count was in-
creased to 10. The effect of this parameter value on the
accuracy is discussed in section 7.1.
Loss function During the training of the models that
achieved the best results reported in this paper, we used the
standard softmax loss function, that is, the log-likelihood
error function described in [30]. The standard metric used
in the Pascal VOC challenge is the average intersection over
union (IU), which we also use here to report the results. In
our experiments we found that high values of IU on the val-
idation set were associated to low values of the averaged
softmax loss, to a large extent. We also tried the robust log-
likelihood in [30] as a loss function for CRF-RNN training.
However, this did not result in increased accuracy nor faster
convergence.
Normalization techniques As described in Section 4,
we use the exponential function followed by pixel-wise nor-
malization across channels in several stages of the CRF-
RNN. Since this operation has a tendency to result in small
gradients with respect to the input when the input value is
large, we conducted several experiments where we replaced B-ground Aero plane Bicycle Bird Boat Bottle Bus
this by a rectifier linear unit (ReLU) operation followed by Car Cat Chair Cow Dining-Table Dog Horse
a normalization across the channels. Our hypothesis was Motorbike Person Potted-Plant Sheep Sofa Train TV/Monitor
7
post-processing method. This can be attributed to the fact Method Without COCO With COCO
that during the SGD training of the CRF-RNN, the CNN Plain FCN-8s 61.3 68.3
component and the CRF component learn how to co-operate FCN-8s and CRF
63.7 69.5
with each other to produce the optimum output of the whole disconnected
network. End-to-end training of
69.6 72.9
We then proceeded to compare our approach with all CRF-RNN
state-of-the-art methods that used training data from the Table 1. Mean IU accuracy of our approach, CRF-RNN, compared
standard VOC 2012 training and validation sets, and from with similar methods, evaluated on the reduced VOC 2012 valida-
the dataset published with [22]. The results are shown in tion set.
Table 2, above the bar, and we can see that our approach
outperforms all competitors. VOC 2010 VOC 2011 VOC 2012
Method
In the second experiment, in addition to the above train- test test test
ing set, we used data from the Microsoft COCO dataset [36] BerkeleyRC [3] n/a 39.1 n/a
as was done in [41] and [12]. We selected images from O2PCPMC [8] 49.6 48.8 47.8
MS COCO 2014 training set where the ground truth seg- Divmbest [44] n/a n/a 48.1
mentation has at least 200 pixels marked with classes la- NUS-UDS [16] n/a n/a 50.0
bels present in the VOC 2012 dataset. With this selection, SDS [23] n/a n/a 51.6
we ended up using 66,099 images from the COCO dataset MSRA-
n/a n/a 61.8
and therefore a total of 66,099 + 11,685 = 77,784 training CFM [13]
images were used in the second experiment. The same re- FCN-8s [37] n/a 62.7 62.2
duced validation set was used in this second experiment as Hypercolumn [24] n/a n/a 62.6
well. In this case, we first fine-tuned the plain FCN-32s Zoomout [38] 64.4 64.1 64.4
network (without the CRF-RNN part) on COCO data, then Context-Deep-
we built an FCN-8s network with the learnt weights and fi- n/a n/a 70.7
CNN-CRF [35]
nally train the CRF-RNN network end-to-end using VOC DeepLab-
2012 training data only. Since the MS COCO ground truth n/a n/a 71.6
MSc [10]
segmentation data contains somewhat coarse segmentation Our method
masks where objects are not delineated properly, we found 73.6 72.4 72.0
w/o COCO
that fine-tuning our model with COCO did not yield signif- BoxSup [12] n/a n/a 71.0
icant improvements. This can be understood because the DeepLab [10,
primary advantage of our model comes from delineating n/a n/a 72.7
41]
the objects and improving fine segmentation boundaries. Our method
The VOC 2012 training dataset therefore helps our model 75.7 75.0 74.7
with COCO
learn this task effectively. The results of this experiment are
Table 2. Mean IU accuracy of our approach, CRF-RNN, com-
shown in Table 2, below the bar, and we see that our ap-
pared to the other approaches on the Pascal VOC 2010-2012 test
proach sets a new state-of-the-art on the VOC 2012 dataset. datasets. Methods from the first group do not use MS COCO data
Note that in both setups, our approach outperforms com- for training. The methods from the second group use both COCO
peting methods due to the end-to-end training of the CNN and VOC datasets for training.
and CRF in the unified CRF-RNN framework. We also
evaluated our models on the VOC 2010, and VOC 2011 test
set (see Table 2). In all cases our method achieves the state- Pascal Context Dataset
of-the-art performance.
We conducted an experiment on the Pascal Context dataset
In order to have a qualitative evidence about how CRF- [39], which differs from the previous one in the larger num-
RNN learns, we visualize the compatibility function learned ber of classes considered, 59. We used the provided parti-
after the training stage of the CRF-RNN as a matrix repre- tions of training and validation sets, and the obtained results
sentation in Fig. 5. Element (i, j) of this matrix corresponds are reported in Table 3.
to µ(i, j) defined earlier: a high value at (i, j) implies high
penalty for assigning label i to a pixel when a similar pixel
(spatially or appearance wise) is assigned label j. For exam- FCN- CRF-
Method O2 P [8] CFM [13]
ple we can appreciate that the learned compatibility matrix 8s [37] RNN
assigns a low penalty to pairs of labels that tend to appear Mean IU 18.1 34.4 37.78 39.28
together, such as [Motorbike, Person], and [Dining table, Table 3. Mean IU accuracy of our approach, CRF-RNN, evaluated
Chair]. on the Pascal Context validation set.
8
B-Ground to-end with 5 such iterations yielded a final mean IU score
0
Aeroplane 0.0
Bicycle
-0.1
of only 70.9, supporting the hypothesis that the recurrent
Bird structure of our approach is important for its success.
Boat -0.2
Bottle
Bus
Car
-0.3 8. Conclusion
Cat -0.4
Chair We presented CRF-RNN, an interpretation of dense
Cow
Dining-Table
-0.5
-0.5
CRFs as Recurrent Neural Networks. Our formulation
Dog -0.6 fully integrates CRF-based probabilistic graphical mod-
Horse
Motorbike -0.7
elling with emerging deep learning techniques. In partic-
Person
PottlePlant
ular, the proposed CRF-RNN can be plugged in as a part
-0.8
Sheep of a traditional deep neural network: It is capable of pass-
Sofa
Train
-0.9
ing on error differentials from its outputs to inputs dur-
-1.0
TV/Monitor ing back-propagation based training of the deep network
B-Ground
Sofa
Aeroplane
TV/Monitor
Bicycle
Train
Dining-Table
PottlePlant
Sheep
Bird
Boat
Bottle
Bus
Car
Cat
Dog
Chair
Cow
Horse
Person
Motorbike
9
[6] Y. Bengio, Y. LeCun, and D. Henderson. Globally [22] B. Hariharan, P. Arbelaez, L. D. Bourdev, S. Maji, and
trained handwritten word recognizer using spatial rep- J. Malik. Semantic contours from inverse detectors. In
resentation, convolutional neural networks, and hid- IEEE ICCV, 2011.
den markov models. In NIPS, pages 937–937, 1994. [23] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik.
[7] Y. Bengio, P. Simard, and P. Frasconi. Learning long- Simultaneous detection and segmentation. In ECCV,
term dependencies with gradient descent is difficult. 2014.
IEEE Transactions on Neural Networks, 1994. [24] B. Hariharan, P. Arbelaez, R. Girshick, and J. Ma-
[8] J. Carreira, R. Caseiro, J. Batista, and C. Sminchis- lik. Hypercolumns for object segmentation and fine-
escu. Free-form region description with second-order grained localization. In IEEE CVPR, 2015.
pooling. IEEE TPAMI, 2014. [25] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zis-
[9] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisser- serman. Deep structured output learning for uncon-
man. Return of the devil in the details: Delving deep strained text recognition. In ICLR, 2015.
into convolutional nets. In BMVC, 2014. [26] V. Jain, J. F. Murray, F. Roth, S. C. Turaga, V. P.
[10] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, Zhigulin, K. L. Briggman, M. Helmstaedter, W. Denk,
and A. L. Yuille. Semantic image segmentation with and H. S. Seung. Supervised learning of image
deep convolutional nets and fully connected crfs. In restoration with convolutional networks. In IEEE
ICLR, 2015. ICCV, 2007.
[11] L.-C. Chen, A. G. Schwing, A. L. Yuille, and R. Ur- [27] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,
tasun. Learning deep structured models. In ICLRW, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:
2015. Convolutional architecture for fast feature embedding.
[12] J. Dai, K. He, and J. Sun. Boxsup: Exploiting bound- In ACM Multimedia, pages 675–678, 2014.
ing boxes to supervise convolutional networks for se- [28] M. Kiefel and P. V. Gehler. Human pose estmation
mantic segmentation. In arXiv:1503.01640, 2015. with fields of parts. In ECCV, 2014.
[13] J. Dai, K. He, and J. Sun. Convolutional feature mask- [29] P. Krähenbühl and V. Koltun. Efficient inference in
ing for joint object and stuff segmentation. In IEEE fully connected crfs with gaussian edge potentials. In
CVPR, 2015. NIPS, 2011.
[14] T.-M.-T. Do and T. Artieres. Neural conditional ran- [30] P. Krähenbühl and V. Koltun. Parameter learning
dom fields. In NIPS, 2010. and convergent inference for dense random fields. In
[15] J. Domke. Learning graphical model parameters ICML, 2013.
with approximate marginal inference. IEEE TPAMI, [31] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Im-
35(10):2454–2467, 2013. agenet classification with deep convolutional neural
[16] J. Dong, Q. Chen, S. Yan, and A. Yuille. Towards networks. In NIPS, 2012.
unified object detection and semantic segmentation. In [32] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr. As-
ECCV, 2014. sociative hierarchical crfs for object class image seg-
[17] D. Eigen, C. Puhrsch, and R. Fergus. Depth map pre- mentation. In IEEE ICCV, 2009.
diction from a single image using a multi-scale deep [33] J. D. Lafferty, A. McCallum, and F. C. N. Pereira.
network. In NIPS, 2014. Conditional random fields: Probabilistic models for
[18] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. segmenting and labeling sequence data. In ICML,
Williams, J. Winn, and A. Zisserman. The pascal vi- 2001.
sual object classes challenge: A retrospective. IJCV, [34] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.
111(1):98–136, 2015. Gradient-based learning applied to document recog-
[19] C. Farabet, C. Couprie, L. Najman, and Y. Le- nition. Proceedings of the IEEE, 86(11):2278–2324,
Cun. Learning hierarchical features for scene labeling. 1998.
IEEE TPAMI, 2013. [35] G. Lin, C. Shen, I. Reid, and A. van dan Hengel. Effi-
[20] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich cient piecewise training of deep structured models for
feature hierarchies for accurate object detection and semantic segmentation. In arXiv:1504.01013, 2015.
semantic segmentation. In IEEE CVPR, 2014. [36] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Gir-
[21] R. Girshick, F. Iandola, T. Darrell, and J. Malik. De- shick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick,
formable part models are convolutional neural net- and P. Dollar. Microsoft coco: Common objects in
works. In CVPR, 2015. context. In arXiv:1405.0312, 2014.
10
[37] J. Long, E. Shelhamer, and T. Darrell. Fully convolu- [51] J. Shotton, M. Johnson, and R. Cipolla. Semantic tex-
tional networks for semantic segmentation. In IEEE ton forests for image categorization and segmentation.
CVPR, 2015. In IEEE CVPR, 2008.
[38] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich. [52] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Tex-
Feedforward semantic segmentation with zoom-out tonboost for image understanding: Multi-class object
features. In IEEE CVPR, 2015. recognition and segmentation by jointly modeling tex-
ture, layout, and context. IJCV, 81(1):2–23, 2009.
[39] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee,
S. Fidler, R. Urtasun, and A. Yuille. The role of con- [53] K. Simonyan and A. Zisserman. Very deep convolu-
text for object detection and semantic segmentation in tional networks for large-scale image recognition. In
the wild. In IEEE CVPR, 2014. arXiv:1409.1556, 2014.
[40] M. C. Mozer. Backpropagation. In Y. Chauvin and [54] V. Stoyanov, A. Ropson, and J. Eisner. Empirical risk
D. E. Rumelhart, editors, Backpropagation, chapter minimization of graphical model parameters given ap-
A Focused Backpropagation Algorithm for Temporal proximate inference, decoding, and model structure.
Pattern Recognition, pages 137–169. L. Erlbaum As- In AISTATS, 2011.
sociates Inc., 1995. [55] S. C. Tatikonda and M. I. Jordan. Loopy belief prop-
agation and gibbs measures. In Proceedings of the
[41] G. Papandreou, L.-C. Chen, K. Murphy, and A. L.
Eighteenth Conference on Uncertainty in Artificial In-
Yuille. Weakly- and semi-supervised learning of
telligence, 2002.
a dcnn for semantic image segmentation. In
arXiv:1502.02734, 2015. [56] C. Tomasi and R. Manduchi. Bilateral filtering for
gray and color images. In IEEE CVPR, 1998.
[42] S. Paris and F. Durand. A fast approximation of the bi-
lateral filter using a signal processing approach. IJCV, [57] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint
81(1):24–52, 2013. training of a convolutional network and a graphical
model for human pose estimation. In NIPS, 2014.
[43] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio. On
the difficulty of training recurrent neural networks. In [58] Z. Tu. Auto-context and its application to high-level
ICML, 2013. vision tasks. In IEEE CVPR, 2008.
[59] Z. Tu, X. Chen, A. L. Yuille, and S.-C. Zhu. Image
[44] G. S. Payman Yadollahpour, Dhruv Batra. Discrimi-
parsing: Unifying segmentation, detection, and recog-
native re-ranking of diverse segmentations. In IEEE
nition. IJCV, 63(2):113–140, 2005.
CVPR, 2013.
[60] K. Yao, B. Peng, G. Zweig, D. Yu, X. Li, and F. Gao.
[45] J. Peng, L. Bo, and J. Xu. Conditional neural fields. Recurrent conditional random field for language un-
In NIPS, 2009. derstanding. In ICASSP, 2014.
[46] P. H. O. Pinheiro and R. Collobert. Recurrent convo- [61] Y. Zhang and T. Chen. Efficient inference for fully-
lutional neural networks for scene labeling. In ICML, connected crfs with stationarity. In CVPR, 2012.
2014.
[47] S. Ross, D. Munoz, M. Hebert, and J. A. Bag-
nell. Learning message-passing inference machines
for structured prediction. In IEEE CVPR, 2011.
[48] D. E. Rumelhart, G. E. Hinton, and R. J. Williams.
Parallel distributed processing: explorations in the
microstructure of cognition. In J. A. Anderson and
E. Rosenfeld, editors, Parallel distributed processing:
explorations in the microstructure of cognition, chap-
ter Learning Internal Representations by Error Propa-
gation, pages 318–362. MIT Press, 1986.
[49] A. G. Schwing and R. Urtasun. Fully connected deep
structured networks. In arXiv:1503.02351, 2015.
[50] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp,
M. Finocchio, R. Moore, A. Kipman, and A. Blake.
Real-time human pose recognition in parts from sin-
gle depth images. In IEEE CVPR, 2011.
11
Methods trained with COCO Mean IU
Our method 74.7 90.4 55.3 88.7 68.4 69.8 88.3 82.4 85.1 32.6
DeepLab[10, 41] 72.7 89.1 38.3 88.1 63.3 69.7 87.1 83.1 85.0 29.3
BoxSup[12] 71.0 86.4 35.5 79.7 65.2 65.2 84.3 78.5 83.7 30.5
Methods trained w/o COCO
Our method trained w/o COCO 72.0 87.5 39.0 79.7 64.2 68.3 87.6 80.8 84.4 30.4
DeepLab-MSc-CRF-LargeFOV[10] 71.6 84.4 54.5 81.5 63.6 65.9 85.1 79.1 83.4 30.7
Context Deep CNN CRF[35] 70.7 87.5 37.7 75.8 57.4 72.3 88.4 82.6 80.0 33.4
Zoomout[38] 64.4 81.9 35.1 78.2 57.4 56.5 80.5 74.0 79.8 22.4
Hypercolumn[24] 62.6 68.7 33.5 69.8 51.3 70.2 81.1 71.9 74.9 23.9
FCN-8s[37] 62.2 76.8 34.2 68.9 49.4 60.3 75.3 74.7 77.6 21.4
MSRA CFM[13] 61.8 75.7 26.7 69.5 48.8 65.6 81.0 69.2 73.3 30.0
SDS[23] 51.6 63.3 25.7 63.0 39.8 59.2 70.9 61.4 54.9 16.8
NUS UDS [16] 50.0 67.0 24.5 47.2 45.0 47.9 65.3 60.6 58.5 15.5
TTIC-divmbest-rerank[44] 48.1 62.7 25.6 46.9 43.0 54.8 58.4 58.6 55.6 14.6
BONN O2PCPMC FGT SEGM [8] 47.8 64.0 27.3 54.1 39.2 48.7 56.6 57.7 52.5 14.2
Table 4. Intersection over Union (IU) accuracy of our approach, CRF-RNN, compared to the other state-of-the-art approaches on the Pascal
VOC 2012 test set. Scores for other methods were taken the results published by the original authors. The symbols are from Chatfield et
al. [9].
12
Input Image CRF-RNN Ground Truth
TV/Monitor
Horse
Bus Bottle
Train
Dog Dining-Table
Boat
Sofa
Sheep
Cow
Bird
Potted-Plant
Bicycle
Chair
B-ground Aero plane
Person
Cat
Motorbike
Car
Figure 6. Typical good quality segmentation results I. Illustration of sample results on the validation set of the Pascal VOC 2012 dataset.
Note that in some cases our method is able to pick correct segmentations that are not marked correctly in the ground truth. Best viewed in
colour.
13
Input Image CRF-RNN Ground Truth
TV/Monitor
Horse
Bus Bottle
Train
Dog Dining-Table
Boat
Sofa
Sheep
Cow
Bird
Potted-Plant
Bicycle
Chair
B-ground Aero plane
Person
Cat
Motorbike
Car
Figure 7. Typical good quality segmentation results II. Illustration of sample results on the validation set of the Pascal VOC 2012 dataset.
Note that in some cases our method is able to pick correct segmentations that are not marked correctly in the ground truth. Best viewed in
colour.
14
Input Image CRF-RNN Ground Truth
TV/Monitor
Horse
Bus Bottle
Train
Dog Dining-Table
Boat
Sofa
Sheep
Cow
Bird
Potted-Plant
Bicycle
Chair
B-ground Aero plane
Person
Cat
Motorbike
Car
Figure 8. Failure cases I. Illustration of sample failure cases on the validation set of the Pascal VOC 2012 dataset. Best viewed in colour.
15
Input Image CRF-RNN Ground Truth
TV/Monitor
Horse
Bus Bottle
Train
Dog Dining-Table
Boat
Sofa
Sheep
Cow
Bird
Potted-Plant
Bicycle
Chair
B-ground Aero plane
Person
Cat
Motorbike
Car
Figure 9. Failure cases II. Illustration of sample failure cases on the validation set of the Pascal VOC 2012 dataset. Best viewed in colour.
16
Input Image FCN-8s DeepLab CRF-RNN Ground Truth
Figure 10. Qualitative comparison with the other approaches. Sample results with our method on the validation set of the Pascal VOC
2012 dataset, compared with previous state-of-the-art methods. Segmentation results with DeepLap approach were reproduced from the
original publication. Best viewed in colour.
17
Deep Learning Face Attributes in the Wild∗
(a) HOG(landmarks)+SVM
Predicting face attributes in the wild is challenging due
to complex face variations. We propose a novel deep
learning framework for attribute prediction in the wild.
true false false false true
It cascades two CNNs, LNet and ANet, which are fine-
tuned jointly with attribute tags, but pre-trained differently. (b) Our Method
3730
This work revisits global methods by proposing a novel as race, gender, and age. It indicates that when a deep
deep learning framework, which integrates two CNNs, model is pre-trained for face recognition, it implicitly learns
LNet and ANet, where LNet locates the entire face region attributes. The performance of attribute prediction drops
and ANet extracts high-level face representation from the without this pre-training stage.
located region. The novelties are in three aspects. Firstly, The main contributions are summarized as follows.
LNet is trained in a weakly supervised manner, i.e. only (1) We propose a novel deep learning framework, which
image-level attribute tags of training images are provided, combines massive objects and massive identities to pre-train
making data preparation much easier. This is different from two CNNs for face localization and attribute prediction,
training face and landmark detectors, where face bounding respectively. It achieves state-of-the-art attribute classifi-
boxes and landmark positions are required. LNet is pre- cation results on both the challenging CelebFaces [26] and
trained by classifying massive general object categories, LFW [12] datasets, improving existing methods by 8 and
such that its pre-trained features have good generalization 13 percent, respectively. (2) A novel fast feed-forward
capability on handling large background clutters. LNet is algorithm for CNN with locally shared filters is devised. (3)
then fine-tuned by attributes tags. We demonstrate that fea- Our study reveals multiple valuable facts on leaning face
tures learned in this way are effective for face localization representation by deep models. (4) We also contribute a
and also can distinguish subtle differences between human large facial attribute database with more than eight million
faces and analogous patterns, such as a cat face. attribute labels and it is 20 times larger than the largest
Secondly, ANet extracts discriminative face represen- publicly available dataset.
tation, making attribute recognition from the entire face 1.1. Related Work
region possible. ANet is pre-trained by classifying massive
face identities and is fine-tuned by attributes. We show that Extracting hand-crafted features at pre-defined land-
the pre-training step enables ANet to account for complex marks has become a standard step in attribute recognition
variations in the unconstrained face images. [9, 15, 4, 2]. Kumar et al. [15] extracted HOG-like features
Thirdly, within the rough locations of face regions on various face regions to tackle attribute classification
provided by LNet, averaging the predictions of multiple and face verification. To improve the discriminativeness
patches can improve the performance. A simple way is of hand-crafted features given a specific task, Bourdev et
to evaluate the feed-forward pass for each single patch. al. [4] built a three-level SVM system to extract higher-
However, it is slow and has a lot of redundant computation. level information. Deep learning [18, 34, 23, 7, 19, 32,
A novel fast feed-forward scheme is proposed to replace 31, 13, 33, 22, 3, 28] recently achieved great success in
patch-by-patch evaluation. It evaluates images with arbi- attribute prediction, due to their ability to learn compact and
trary sizes with only one-pass feed-forward operation. It discriminative features. Razavian et al. [23] and Donahue
becomes non-trivial if the filters are locally shared, while et al. [7] demonstrated that off-the-shelf features learned
studies [27, 26] showed that locally shared filters perform by CNN of ImageNet [13] can be effectively adapted to
better in face related tasks. This is solved by proposing an attribute classification. Zhang et al. [32] showed that
interweaved operation. better performance can be achieved by ensembling learned
features of multiple pose-normalized CNNs. The main
Besides proposing new methods, our framework also
drawback of these methods is that they rely on accurate
reveals valuable facts on learning face representation. They
landmark detection and pose estimation in both training and
not only motivate this work but also benefit future research
testing steps. Even though a recent work [31] can perform
on face and deep learning. (1) It shows how pre-training
automatic part localization during test, it still requires
with massive object categories and massive identities can
landmark annotations of the training data.
improve feature learning for face localization and attribute
recognition, respectively. (2) It demonstrates that although 2. Our Approach
filters of LNet are fine-tuned by attribute tags, their response
maps over the entire image have strong indication of face Framework Overview Fig.2 illustrates our pipeline
location. Good features for face localization should be able where LNet locates the entire face region in a coarse-to-
to capture rich face variations, and more supervised infor- fine manner as shown in (a) and (b), while ANet extracts
mation on these variations improves the learning process. features for attribute recognition as shown in (c).
The examples in Fig. 1 (a) show that as the number of Different from existing works that rely on accurate face
attributes decreases, the localization capability of learned and landmark annotations, LNet is trained in a weak-
neurons gets reduced dramatically. (3) ANet is pre-trained ly supervised manner with only image-level annotations.
with massive face identities. It discloses that the pre-trained Specifically, it is pre-trained with one thousand object
high-level hidden neurons of ANet implicitly learn and categories of ImageNet [6] and fine-tuned by image-level
discover sematic concepts that are related to identity, such attribute tags. The former step accounts for background
3731
n (a) LNeto (b) LNets
xo ho xs hs
FC
Linear SVM
L Wavyy Hair
No Beard
…
FC High
Hi
i Cheekbones
Linear SVM
L Smiling
FC
Linear SVM
L
…
hf FC y xf
xf ( ) ANet
(c) AN t (d) Extracting features to predict attributes
3732
(a.1) (a.2) (b) Brown Hair Male
Frontal Left Big Eyes Black Hair
Response on Face Images
Smiling Sunglasses
Response on Bg. Images
Percentage of Images
View 1 View N Attr Config 1 Attr Config N
(a.3) (a.4) threshold ...
... ...
(a) single detector (b) multi-view detector (c) face localization by attributes
3733
3
3
2
2 1
6
5
1 6 4 9
8
7
…
5
4 9
8
…
7
(c) feature extraction with
(a) global convolution (b) local convolution interweaved operation (d) interweaved operation
Figure 5. Detailed pipeline of efficient feature extractions in ANet.
the input image patch-by-patch. Therefore, it shares the receptive fields of all the local filters (e.g. h(1) in (b)),
convolutions for every patch. which has to be performed in a patch-by-patch way, the
However, this scheme is not applicable when we have interweaved operation pads the cells with respect to the
more than two convolutional layers whose filters are receptive field of each local filter over the entire image. It
locally-shared. An example is illustrated in Fig.5 (b), where enables extracting multiple feature vectors with only one-
each patch is equally divided into 3 × 3 = 9 cells and pass of feed-forward evaluation. This operation can be
we learn different filters for different cells. To reduce repeated when more locally convolutional layers are added.
computations in the first convolutional layer, each local The proposed feature extraction scheme has achieved 6×
filter can be applied on the entire image, resulting in the speedup empirically when compared with patch-by-patch
(1) scanning. It is applicable to CNNs with local filters and
response map with nine channels, i.e. hi and i = 1...9.
(1)
The final response map h is obtained by cropping and compatible to all existing CNN operations.
padding the regions (i.e. rectangles in black) in these 9
channels. As a result, each feature vector FC can be pooled 3. Experiments
from h(1) , without convolving the input image patch-
by-patch. Nevertheless, since h(1) is corresponded to a Large-scale Data Collection We construct two face
patch of the input image, the succeeding local convolutions attribute datasets, namely CelebA and LFWA, by labeling
have to be handled patch-by-patch, leading to redundant images selected from two challenging face datasets, Celeb-
computations. Faces [26] and LFW [12]. CelebA contains ten thousand
identities, each of which has twenty images. There are
To this end, we propose an interweaved operation, which
two hundred thousand images in total. LFWA has 13, 233
is a fast feed-forward method for CNN with locally-shared
images of 5, 749 identities. Each image in CelebA and
filters. Suppose we have four local filters in the next locally
LFWA is annotated with forty face attributes and five key
convolutional layer and each filter is applied on 2 × 2 cells
points by a professional labeling company. CelebA and
of h(1) as shown in (b). These cells are the receptive fields
LFWA have over eight million and five hundred thousand
of the filters, including {1, 2, 4, 5}, {2, 3, 5, 6}, {4, 5, 7, 8},
attribute labels, respectively.
and {5, 6, 8, 9}. Instead of directly applying the local filters
on h1 , the interweaved operation generates an interweaved CelebA is partitioned into three parts. Images of the
(1) first eight thousand identities (with 160 thousand images)
map Ii for each filter, where i = 1...4. Each local filter
are used to pre-train and fine-tune ANet and LNet, and
is then apply on its corresponding interweaved map. Since
the images of another one thousand identities (with twenty
the interweaved map capturing the entire image, each local
thousand images) are employed to train SVM. The images
filter is turned into a global filter such that its computation
of the remaining one thousand identities (with twenty
can be shared across different patches.
(1) thousand images) are used for testing. LFWA is partitioned
Specifically, each interweaved map, e.g. I1 , is achieved into half for training and half for testing. Specifically, 6, 263
by padding the cells of the corresponding channels in an images are adopted to train SVM and the remaining images
(1)
interweaved manner, e.g. hi={1,2,4,5} , as shown in Fig.5 for test. When being evaluated on LFWA, LNet and ANet
(d). All of the interweaved maps are illustrated in Fig.5 are trained on CelebA.
(c). After that, each of the four local filters is applied on its Methods for Comparisons The proposed method is
corresponding interweaved map, leading to four response compared with three competitive approaches, i.e. FaceTrac-
(2)
maps hi , where i = 1...4. As a result, the feature vector er [14], PANDA-w [32], and PANDA-l [32]. FaceTracer
FC is pooled and concatenated from the receptive fields of extracts HOG and color histograms in several important
the filters, which are the rectangles in black as shown in (c). functional face regions and then trains SVM for attribute
Intuitively, instead of padding cells according to the classification. We extract these functional regions referring
3734
(a) (b) (c)
Figure 6. Averaged response maps of LNet, including (a) CelebA, (b) MobileFaces, (c) some failure cases.
(a) 1 (b)
1
positive rates of Face++ and LNet are 85% and 93%;
0.8 0.8 when F P P I = 0.1, our method outperforms the other
True Positive Rates
LNet LNet
also investigate how these methods perform with respect to
DPM [21] DPM [21]
0.4
ACF Multi-view [29]
0.4
ACF Multi-view [29]
overlap ratio (IoU ), following [35, 21]. Fig.7(c) shows that
SURF Cascade [17] SURF Cascade [17]
Face++ [1] Face++ [1]
LNet generally provides more accurate face localization,
0.2 0.2
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
False Positive Per Image False Positive Per Image leading to good performance in the subsequent attribute
(c) 1 (d) 1 prediction.
Recall Rates (FPPI = 0.1)
0.8
0.9 Further Analysis LNet significantly outperforms LNet
0.6 (without pre-training) by 74 percent when the overlap
0.8
0.4
LNet
ratio equals to 0.5, which validates the effectiveness of
0.2
DPM [21]
ACF Multi-view [29]
0.7 pre-training, as shown in Fig.7(c). We then explore
SURF Cascade [17]
LNet (w/o pre-training)
the influence of the number of attributes on localization.
0
0 0.2 0.4 0.6
Overlap Ratio
0.8 1 10 20 30
Number of Attributes
40
Fig.7(d) illustrates rich attribute information facilitates face
Figure 7. ROC curves on (a) CelebA (b) MobileFaces. (c) Recall rates localization.
w.r.t. overlap ratio (F P P I = 0.1). (d) Recall rates w.r.t. number of To examine the generalization ability of LNet, we collect
attributes (F P P I = 0.1)
another 3, 876 face images for testing, namely Mobile-
to the ground truth landmark points. PANDA-w and Faces, which comes from a different source2 and has a
PANDA-l are based on PANDA [32], which was proposed different distribution from CelebA. Several examples of
recently for human attribute recognition by ensembling MobileFaces are shown in Fig.6(b) and the corresponding
multiple CNNs, each of which extracts features from a well- ROC curves are plotted in Fig.7(b). We observe that
aligned human part. These features are concatenated to LNet constantly performs better and still gains 7 percent
train SVM for attribute recognition. It is straightforward improvement (F P P I = 0.1) compared with other face
to adapt this method to face attributes, since face parts can detectors. Despite some failure cases due to extreme poses
be well-aligned by landmark points. Here, we consider two and large occlusions, LNet accurately localize faces in the
settings. PANDA-w obtains the face parts by applying the wild as demonstrated in Fig.6.
state-of-the-art face detection [17] and alignment [25] on • ANet
wild images, while PANDA-l attains the face parts by using Pre-training Discovers Semantic Concepts We show
ground truth landmark points. For fair comparison, all the that pre-training of ANet can implicity discover semantic
above methods are trained with the same data as ours. concepts related to face identity. Given a hidden neuron
at the FC layer of ANet as shown in Fig.2(c), we partition
3.1. Effectiveness of the Framework the face images into three groups, including the face images
This section demonstrates the effectiveness of the frame- with high, medium, and low responses at this neuron. The
work. All experiments in this section are done on CelebA. face images of each group are then averaged to obtain
the mean face. We visualize these mean faces for several
• LNet
neurons in Fig.8(a). Interestingly, these mean face changes
Performance Comparison We compare LNet with four
smoothly from high response to low response, following a
state-of-the-art face detectors, including DPM [21], ACF
high-level concept. Human can easily assign each neuron
Multi-view [30], SURF Cascade [17], and Face++ [1].
with a semantic concept it measures (i.e. the text in yellow).
We evaluate them by using ROC curves when IoU 1 ≥0.5.
As plotted in Fig.7(a), when F P P I = 0.01, the true 2 MobileFaces was collected by users with mobile phones, while Cele-
3735
High Resp. Low Resp. High Resp. Low Resp. Test Image Activations Neurons
(a.1) Gender (a.2) Hair Color (b.1) Bangs Brown Hair Pale Skin Narrow Eyes High Cheek.
(a.3) Age (a.4) Race (b.2) Eyeglasses Mustache Black Hair Smiling Big Nose
(a.5) Face Shape (a.6) Eye Shape (b.3) Wear. Hat Blond Hair Wear. Lipstick Asian Big Eyes
Figure 8. Visualization of neurons in ANet (a) after pre-training (b) after fine-tuning (Best viewed in color)
(a) ANet (FC) ANet (C4) ANet (C3)
that after fine-tuning, ANet can expand these concepts to
Identity-related Attributes Identity-non-related Attributes
100% 90% more attribute types. Fig.8(b) visualizes the neurons in the
95% 85%
FC layer, which are ranked by their responses in descending
Accuracy
90% 80%
order with respect to several test images. Human can assign
85% 75%
80% 70%
semantic meaning to each of these neurons. We found that
Male White Black Asian Smiling Wearing Rosy 5oClock a large number of new concepts can be observed. Remark-
Hat Cheeks Shadow
ably, these neurons express diverse high-level meanings
(b) ANet (After fine-tuning) HOG (After PCA)
and cooperate to explain the test images. The activations
80% of all the neurons are visualized in Fig.8(b), and they are
Average Accuracy
single best
70% sparse. In some sense, attributes presented in each test
performing
60% neuron image are explained by a sparse linear combination of these
50% concepts. For instance, the first image is described as “a
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
Percentage of Best Performing Neurons Used
lady with bangs, brown hair, pale skin, narrow eyes and high
Figure 9. (a) Layer-wise comparison of ANet after pre-training (b) Best cheekbones”, which well matches human perception.
performing neurons analysis of ANet after fine-tuning. Best performing To validate this, we explore how the number of neurons
neurons are different for different attributes. The proposed accuracies are influences attribute prediction accuracies. Best performing
averaged over attributes which select their own subsets of best performing
neurons.
neurons for each attribute are identified by sorting corre-
sponding SVM weights. Fig.9(b) illusatrates that only 10%
For example, the neurons in (a.1) and (a.4) correspond of ANet best performing neurons are needed to achieve
to ‘gender’ and ‘race’, respectively. It reveals that the 90% of the original performance of a particular attribute3 .
high-level hidden neurons of ANet can implicitly learn In contrast, HOG+PCA does not have the sparse nature
to discover semantic concepts, even though they are only and need more than 95% features Besides, the best single
optimized for face recognition using identity information performing neuron of ANet outperforms that of HOG+PCA
and attribute labels are not used in pre-training. We also by 25 percent in average prediction accuracy.
observe that most of these concepts are intrinsic to face
identity, such as the shape of facial components, gender, 3.2. Attribute Prediction
and race.
Performance Comparison The attribute prediction per-
To better explain this phenomena, we compare the formance is reported in Table.1. On CelebA, the prediction
accuracy of attribute prediction using features at different accuracies of FaceTracer [14], PANDA-w [32], PANDA-l
layers of ANet right after pre-training. They are FC, C4, [32], and our LNets+ANet are 81, 79, 85, and 87 percent
and C3. The forty attributes are roughly separated into respectively, while the corresponding accuracies on LFWA
two groups, which are identity-related attributes, such as are 74, 71, 81, and 84 percent. Our method outperforms
gender and race, and identity-non-related attributes, e.g. PANDA-w by nearly 10 percent. Remarkably, even when
attributes of expressions, wearing hat and sunglasses. We PANDA-l is equipped with groundtruth bounding boxes
select some representative attributes for each group and plot and landmark positions, our method still achieves 3 percent
the results in Fig.9(a), which shows that the performance of gain. The strength of our method is illustrated not only
FC outperforms C4 and C3 in the group of identity-related on global attributes, e.g. “Chubby” and “Young”, but also
attributes, but they are relatively weaker when dealing with on fine-grained facial traits, e.g. “Mastache” and “Pointy
identity-non-related attributes. This is because the top layer Nose”. We also report performance on 19 extended at-
FC learns identity features, which are insensitive to intra- tributes and compare our result with [14] and [2]. The eval-
personal face variations.
Fine-tuning Expands Semantic Concepts Fig.8 shows 3 Best performing neurons are different for different attributes.
3736
Bushy Eyebrows
Arch. Eyebrows
H. Cheekbones
Heavy Makeup
Bags Un. Eyes
Double Chin
Brown Hair
Blond Hair
Eyeglasses
Black Hair
Gray Hair
Attractive
5 Shadow
Big Nose
Big Lips
Chubby
Goatee
Blurry
Bangs
Male
Bald
FaceTracer [14] 85 76 78 76 89 88 64 74 70 80 81 60 80 86 88 98 93 90 85 84 91
PANDA-w [32] 82 73 77 71 92 89 61 70 74 81 77 69 76 82 85 94 86 88 84 80 93
PANDA-l [32] 88 78 81 79 96 92 67 75 85 93 86 77 86 86 88 98 93 94 90 86 97
CelebA
[17]+ANet 86 75 79 77 92 94 63 74 77 86 83 74 80 86 90 96 92 93 87 85 95
LNets+ANet(w/o) 88 74 77 73 95 92 66 75 84 91 80 78 85 86 88 96 92 93 85 84 94
LNets+ANet 91 79 81 79 98 95 68 78 88 95 84 80 90 91 92 99 95 97 90 87 98
FaceTracer [14] 70 67 71 65 77 72 68 73 76 88 73 62 67 67 70 90 69 78 88 77 84
PANDA-w [32] 64 63 70 63 82 79 64 71 78 87 70 65 63 65 64 84 65 77 86 75 86
PANDA-l [32] 84 79 81 80 84 84 73 79 87 94 74 74 79 69 75 89 75 81 93 86 92
LFWA
[17]+ANet 78 66 75 72 86 84 70 73 82 90 75 71 69 68 70 88 68 82 89 79 91
LNets+ANet(w/o) 81 78 80 79 83 84 72 76 86 94 70 73 79 70 74 92 75 81 91 83 91
LNets+ANet 84 82 83 83 88 88 75 81 90 97 74 77 82 73 78 95 78 84 95 88 94
Reced. Hairline
Wear. Necklace
Wear. Earrings
Wear. Lipstick
Wear. Necktie
Rosy Cheeks
Narrow Eyes
Straight Hair
Mouth S. O.
Pointy Nose
Wavy Hair
Oval Face
Wear. Hat
Sideburns
Mustache
No Beard
Pale Skin
Average
Smiling
Young
FaceTracer [14] 87 91 82 90 64 83 68 76 84 94 89 63 73 73 89 89 68 86 80 81
PANDA-w [32] 82 83 79 87 62 84 65 82 81 90 89 67 76 72 91 88 67 88 77 79
PANDA-l [32] 93 93 84 93 65 91 71 85 87 93 92 69 77 78 96 93 67 91 84 85
CelebA
[17]+ANet 85 87 83 91 65 89 67 84 85 94 92 70 79 77 93 91 70 90 81 83
LNets+ANet(w/o) 86 91 77 92 63 87 70 85 87 91 88 69 75 78 96 90 68 86 83 83
LNets+ANet 92 95 81 95 66 91 72 89 90 96 92 73 80 82 99 93 71 93 87 87
FaceTracer [14] 77 83 73 69 66 70 74 63 70 71 78 67 62 88 75 87 81 71 80 74
PANDA-w [32] 74 77 68 63 64 64 68 61 64 68 77 68 63 85 78 83 79 70 76 71
PANDA-l [32] 78 87 73 75 72 84 76 84 73 76 89 73 75 92 82 93 86 79 82 81
LFWA
[17]+ANet 76 79 74 69 66 68 72 70 71 72 82 72 65 87 82 86 81 72 79 76
LNets+ANet(w/o) 78 87 77 75 71 81 76 81 72 72 88 71 73 90 84 92 83 76 82 79
LNets+ANet 82 92 81 79 74 84 80 85 78 77 91 76 76 94 88 95 88 79 86 84
Table 1. Performance comparison of attribute prediction. (Note that FaceTracer and PANDA-l attains the face parts by using ground truth landmark points.)
Mustache
No Beard
Blond H.
M. Aged
Black H.
Average
No Eye.
B. Nose
R. Hair.
A. Eye.
B. Eye.
Gender
R. Jaw
Senior
White
Youth
Asian
Black
Bald
Eye.
FaceTracer [14] 91 87 86 75 66 54 70 66 68 72 84 86 83 76 72 66 65 81 51 73
POOF [2] 92 90 81 90 71 60 80 67 75 67 87 90 86 72 74 71 68 77 55 76
LNets+ANet 94 85 83 87 80 77 81 86 89 84 85 84 86 83 82 75 79 78 81 83
Table 2. Performance comparison on extended attributes. (Performance are measured by the average of true positive rates and true negative rates.)
uation protocol is the same as [2]. In Table 2, LNets+ANet patch-by-patch scanning needs nearly 80 ms to extract
outperforms them by 10 and 7 percent respectively. features. Our framework has large potential in real-world
Further Analysis When compared with [17]+ANet, applications.
LNets accounts for nearly 6 percentage improvement over 4. Conclusion
using an off-the-shelf face detector [17]. We also ex- This paper has proposed a novel deep learning frame-
periment with the case of providing ANet with localized work for face attribute prediction in the wild. With carefully
face region by LNets, but without pre-training, denoted as designed pre-training strategies, our method is robust to
LNets+ANet(w/o). The average accuracies have dropped background clutters and face variations. We devise a new
4 and 5 percent on CelebA and LFWA, which indicate fast feed-forward algorithm for locally shared filters to save
pre-training with massive facial identities helps discover redundant computation, which enables evaluating image
semantic concepts. To further examine whether the pro- with arbitrary size in realtime. It allows taking images of
posed approach can be generalized to unseen attributes, we arbitrary sizes as input without normalization. We have
manually label 30 more attributes for the testing images on also revealed multiple important facts about learning face
LFWA. To test on these 30 attributes, we directly transfer representation, which shed a light on new directions of face
weights learned by deep models to extract features, and only localization and representation learning.
re-train SVMs using one third of the images. LNets+ANet Acknowledgement This work∗ was partially supported by
leads to 8, 10, and 3 percent average gains over the other the National Natural Science Foundation of China (91320101,
three approaches (FaceTracer, PANDA-w, and PANDA-l). 61472410, 61503366) and the Research Grants Council of Hong
Time Complexity For a 300 ∗ 300 image, LNets takes Kong (No. CUHK14207814).
35ms to localize face region while ANet takes 14ms to ∗ For more technical details, please contact the corresponding author
output extracted features on GPU. In contrast, a naı̈ve Ping Luo via pluo.lhi@gmail.com.
3737
References [20] O. K. Manyam, N. Kumar, P. Belhumeur, and D. Kriegman.
Two faces are better than one: Face recognition in group
[1] Face++. http://www.faceplusplus.com/. photographs. In IJCB, pages 1–8, 2011.
[2] T. Berg and P. N. Belhumeur. Poof: Part-based one-vs.-one [21] M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool.
features for fine-grained categorization, face verification, and Face detection without bells and whistles. In ECCV, pages
attribute estimation. In CVPR, pages 955–962, 2013. 720–735. 2014.
[3] A. Bergamo, L. Bazzani, D. Anguelov, and L. Torresani. [22] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object
Self-taught object localization with deep networks. arXiv localization for free?–weakly-supervised learning with con-
preprint arXiv:1409.3964, 2014. volutional neural networks. In CVPR, pages 685–694, 2015.
[4] L. Bourdev, S. Maji, and J. Malik. Describing people: A [23] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson.
poselet-based approach to attribute classification. In ICCV, Cnn features off-the-shelf: an astounding baseline for recog-
pages 1543–1550, 2011. nition. arXiv preprint arXiv:1403.6382, 2014.
[5] J. Chung, D. Lee, Y. Seo, and C. D. Yoo. Deep attribute [24] F. Song, X. Tan, and S. Chen. Exploiting relationship
networks. In NIPS Workshop on Deep Learning and Unsu- between attributes for improved face verification. CVIU,
pervised Feature Learning, volume 3, 2012. 122:143–154, 2014.
[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- [25] Y. Sun, X. Wang, and X. Tang. Deep convolutional network
Fei. Imagenet: A large-scale hierarchical image database. In cascade for facial point detection. In CVPR, pages 3476–
CVPR, pages 248–255, 2009. 3483, 2013.
[7] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, [26] Y. Sun, X. Wang, and X. Tang. Deep learning face
E. Tzeng, and T. Darrell. Decaf: A deep convolutional representation by joint identification-verification. In NIPS,
activation feature for generic visual recognition. arXiv 2014.
preprint arXiv:1310.1531, 2013. [27] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface:
[8] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.- Closing the gap to human-level performance in face verifica-
J. Lin. Liblinear: A library for large linear classification. tion. In CVPR, pages 1701–1708, 2014.
JMLR, 9:1871–1874, 2008. [28] Y. Tian, P. Luo, X. Wang, and X. Tang. Pedestrian detection
[9] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing aided by deep learning semantic tasks. CVPR, 2015.
objects by their attributes. In CVPR, pages 1778–1785, 2009. [29] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba.
[10] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality Sun database: Large-scale scene recognition from abbey to
reduction by learning an invariant mapping. In CVPR, zoo. In CVPR, pages 3485–3492, 2010.
volume 2, pages 1735–1742, 2006. [30] B. Yang, J. Yan, Z. Lei, and S. Z. Li. Aggregate channel
[11] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling features for multi-view face detection. In IJCB, pages 1–8,
in deep convolutional networks for visual recognition. In 2014.
ECCV, pages 346–361. 2014. [31] N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part-
[12] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. based r-cnns for fine-grained category detection. In ECCV,
Labeled faces in the wild: A database for studying face pages 834–849. 2014.
recognition in unconstrained environments. Technical Re- [32] N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev.
port 07-49, University of Massachusetts, Amherst, October Panda: Pose aligned networks for deep attribute modeling. In
2007. CVPR, 2014.
[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet [33] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba.
classification with deep convolutional neural networks. In Object detectors emerge in deep scene cnns. In ICLR, 2015.
NIPS, pages 1097–1105, 2012.
[34] Z. Zhu, P. Luo, X. Wang, and X. Tang. Multi-view
[14] N. Kumar, P. Belhumeur, and S. Nayar. Facetracer: A search perceptron: a deep model for learning face identity and view
engine for large collections of images with faces. In ECCV, representations. NIPS, 2014.
pages 340–353. 2008.
[35] C. L. Zitnick and P. Dollár. Edge boxes: Locating object
[15] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar. proposals from edges. In ECCV, pages 391–405. 2014.
Attribute and simile classifiers for face verification. In ICCV,
pages 365–372, 2009.
[16] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
Howard, W. Hubbard, and L. D. Jackel. Handwritten digit
recognition with a back-propagation network. In NIPS, 1990.
[17] J. Li and Y. Zhang. Learning surf cascade for fast and
accurate object detection. In CVPR, pages 3468–3475, 2013.
[18] P. Luo, X. Wang, and X. Tang. Hierarchical face parsing via
deep learning. CVPR, 2012.
[19] P. Luo, X. Wang, and X. Tang. A deep sum-product
architecture for robust facial attributes analysis. In ICCV,
pages 2864–2871, 2013.
3738
Asynchronous Methods for Deep Reinforcement Learning
time than previous GPU-based algorithms, using far less proaches have recently been applied to some visual rein-
resource than massively distributed approaches. The best forcement learning tasks. In one example, (Koutník et al.,
of the proposed methods, asynchronous advantage actor- 2014) evolved convolutional neural network controllers for
critic (A3C), also mastered a variety of continuous motor the TORCS driving simulator by performing fitness evalu-
control tasks as well as learned general strategies for ex- ations on 8 CPU cores in parallel.
ploring 3D mazes purely from visual inputs. We believe
that the success of A3C on both 2D and 3D games, discrete 3. Reinforcement Learning Background
and continuous action spaces, as well as its ability to train
feedforward and recurrent agents makes it the most general We consider the standard reinforcement learning setting
and successful reinforcement learning agent to date. where an agent interacts with an environment E over a
number of discrete time steps. At each time step t, the
2. Related Work agent receives a state st and selects an action at from some
set of possible actions A according to its policy π, where
The General Reinforcement Learning Architecture (Gorila) π is a mapping from states st to actions at . In return, the
of (Nair et al., 2015) performs asynchronous training of re- agent receives the next state st+1 and receives a scalar re-
inforcement learning agents in a distributed setting. In Go- ward rt . The process continues until the agent reaches a
rila, each process contains an actor that acts in its own copy terminal
Pstate after which the process restarts. The return
∞
of the environment, a separate replay memory, and a learner Rt = k=0 γ k rt+k is the total accumulated return from
that samples data from the replay memory and computes time step t with discount factor γ ∈ (0, 1]. The goal of the
gradients of the DQN loss (Mnih et al., 2015) with respect agent is to maximize the expected return from each state st .
to the policy parameters. The gradients are asynchronously
The action value Qπ (s, a) = E [Rt |st = s, a] is the ex-
sent to a central parameter server which updates a central
pected return for selecting action a in state s and follow-
copy of the model. The updated policy parameters are sent
ing policy π. The optimal value function Q∗ (s, a) =
to the actor-learners at fixed intervals. By using 100 sep-
maxπ Qπ (s, a) gives the maximum action value for state
arate actor-learner processes and 30 parameter server in-
s and action a achievable by any policy. Similarly, the
stances, a total of 130 machines, Gorila was able to signif-
value of state s under policy π is defined as V π (s) =
icantly outperform DQN over 49 Atari games. On many
E [Rt |st = s] and is simply the expected return for follow-
games Gorila reached the score achieved by DQN over 20
ing policy π from state s.
times faster than DQN. We also note that a similar way of
parallelizing DQN was proposed by (Chavez et al., 2015). In value-based model-free reinforcement learning methods,
the action value function is represented using a function ap-
In earlier work, (Li & Schuurmans, 2011) applied the
proximator, such as a neural network. Let Q(s, a; θ) be an
Map Reduce framework to parallelizing batch reinforce-
approximate action-value function with parameters θ. The
ment learning methods with linear function approximation.
updates to θ can be derived from a variety of reinforcement
Parallelism was used to speed up large matrix operations
learning algorithms. One example of such an algorithm is
but not to parallelize the collection of experience or sta-
Q-learning, which aims to directly approximate the optimal
bilize learning. (Grounds & Kudenko, 2008) proposed a
action value function: Q∗ (s, a) ≈ Q(s, a; θ). In one-step
parallel version of the Sarsa algorithm that uses multiple
Q-learning, the parameters θ of the action value function
separate actor-learners to accelerate training. Each actor-
Q(s, a; θ) are learned by iteratively minimizing a sequence
learner learns separately and periodically sends updates to
of loss functions, where the ith loss function defined as
weights that have changed significantly to the other learn-
ers using peer-to-peer communication. 2
0 0
Li (θi ) = E r + γ max0
Q(s , a ; θ i−1 ) − Q(s, a; θ i )
a
(Tsitsiklis, 1994) studied convergence properties of Q-
learning in the asynchronous optimization setting. These where s0 is the state encountered after state s.
results show that Q-learning is still guaranteed to converge
when some of the information is outdated as long as out- We refer to the above method as one-step Q-learning be-
dated information is always eventually discarded and sev- cause it updates the action value Q(s, a) toward the one-
eral other technical assumptions are satisfied. Even earlier, step return r + γ maxa0 Q(s0 , a0 ; θ). One drawback of us-
(Bertsekas, 1982) studied the related problem of distributed ing one-step methods is that obtaining a reward r only di-
dynamic programming. rectly affects the value of the state action pair s, a that led
to the reward. The values of other state action pairs are
Another related area of work is in evolutionary meth- affected only indirectly through the updated value Q(s, a).
ods, which are often straightforward to parallelize by dis- This can make the learning process slow since many up-
tributing fitness evaluations over multiple machines or dates are required the propagate a reward to the relevant
threads (Tomassini, 1999). Such parallel evolutionary ap- preceding states and actions.
Asynchronous Methods for Deep Reinforcement Learning
One way of propagating rewards faster is by using n- Algorithm 1 Asynchronous one-step Q-learning - pseu-
step returns (Watkins, 1989; Peng & Williams, 1996). docode for each actor-learner thread.
In n-step Q-learning, Q(s, a) is updated toward the n- // Assume global shared θ, θ− , and counter T = 0.
step return defined as rt + γrt+1 + · · · + γ n−1 rt+n−1 + Initialize thread step counter t ← 0
maxa γ n Q(st+n , a). This results in a single reward r di- Initialize target network weights θ− ← θ
Initialize network gradients dθ ← 0
rectly affecting the values of n preceding state action pairs. Get initial state s
This makes the process of propagating rewards to relevant repeat
state-action pairs potentially much more efficient. Take action a with -greedy policy based on Q(s, a; θ)
0
new state s and reward r
Receive
In contrast to value-based methods, policy-based model- r for terminal s0
y= 0 0 −
free methods directly parameterize the policy π(a|s; θ) and r + γ maxa0 Q(s , a ; θ ) for non-terminal s0
2
update the parameters θ by performing, typically approx- Accumulate gradients wrt θ: dθ ← dθ + ∂(y−Q(s,a;θ))
∂θ
imate, gradient ascent on E[Rt ]. One example of such s = s0
a method is the REINFORCE family of algorithms due T ← T + 1 and t ← t + 1
to Williams (1992). Standard REINFORCE updates the if T mod Itarget == 0 then
Update the target network θ− ← θ
policy parameters θ in the direction ∇θ log π(at |st ; θ)Rt , end if
which is an unbiased estimate of ∇θ E[Rt ]. It is possible to if t mod IAsyncU pdate == 0 or s is terminal then
reduce the variance of this estimate while keeping it unbi- Perform asynchronous update of θ using dθ.
ased by subtracting a learned function of the state bt (st ), Clear gradients dθ ← 0.
known as a baseline (Williams, 1992), from the return. The end if
until T > Tmax
resulting gradient is ∇θ log π(at |st ; θ) (Rt − bt (st )).
A learned estimate of the value function is commonly used
as the baseline bt (st ) ≈ V π (st ) leading to a much lower learners running in parallel are likely to be exploring dif-
variance estimate of the policy gradient. When an approx- ferent parts of the environment. Moreover, one can explic-
imate value function is used as the baseline, the quantity itly use different exploration policies in each actor-learner
Rt − bt used to scale the policy gradient can be seen as to maximize this diversity. By running different explo-
an estimate of the advantage of action at in state st , or ration policies in different threads, the overall changes be-
A(at , st ) = Q(at , st )−V (st ), because Rt is an estimate of ing made to the parameters by multiple actor-learners ap-
Qπ (at , st ) and bt is an estimate of V π (st ). This approach plying online updates in parallel are likely to be less corre-
can be viewed as an actor-critic architecture where the pol- lated in time than a single agent applying online updates.
icy π is the actor and the baseline bt is the critic (Sutton & Hence, we do not use a replay memory and rely on parallel
Barto, 1998; Degris et al., 2012). actors employing different exploration policies to perform
the stabilizing role undertaken by experience replay in the
4. Asynchronous RL Framework DQN training algorithm.
We now present multi-threaded asynchronous variants of In addition to stabilizing learning, using multiple parallel
one-step Sarsa, one-step Q-learning, n-step Q-learning, and actor-learners has multiple practical benefits. First, we ob-
advantage actor-critic. The aim in designing these methods tain a reduction in training time that is roughly linear in
was to find RL algorithms that can train deep neural net- the number of parallel actor-learners. Second, since we no
work policies reliably and without large resource require- longer rely on experience replay for stabilizing learning we
ments. While the underlying RL methods are quite dif- are able to use on-policy reinforcement learning methods
ferent, with actor-critic being an on-policy policy search such as Sarsa and actor-critic to train neural networks in a
method and Q-learning being an off-policy value-based stable way. We now describe our variants of one-step Q-
method, we use two main ideas to make all four algorithms learning, one-step Sarsa, n-step Q-learning and advantage
practical given our design goal. actor-critic.
First, we use asynchronous actor-learners, similarly to the Asynchronous one-step Q-learning: Pseudocode for our
Gorila framework (Nair et al., 2015), but instead of using variant of Q-learning, which we call Asynchronous one-
separate machines and a parameter server, we use multi- step Q-learning, is shown in Algorithm 1. Each thread in-
ple CPU threads on a single machine. Keeping the learn- teracts with its own copy of the environment and at each
ers on a single machine removes the communication costs step computes a gradient of the Q-learning loss. We use
of sending gradients and parameters and enables us to use a shared and slowly changing target network in comput-
Hogwild! (Recht et al., 2011) style updates for training. ing the Q-learning loss, as was proposed in the DQN train-
Second, we make the observation that multiple actors- ing method. We also accumulate gradients over multiple
timesteps before they are applied, which is similar to us-
Asynchronous Methods for Deep Reinforcement Learning
ing minibatches. This reduces the chances of multiple ac- by tmax . The pseudocode for the algorithm is presented in
tor learners overwriting each other’s updates. Accumulat- Supplementary Algorithm S2.
ing updates over several steps also provides some ability to
As with the value-based methods we rely on parallel actor-
trade off computational efficiency for data efficiency.
learners and accumulated updates for improving training
Finally, we found that giving each thread a different explo- stability. Note that while the parameters θ of the policy
ration policy helps improve robustness. Adding diversity and θv of the value function are shown as being separate
to exploration in this manner also generally improves per- for generality, we always share some of the parameters in
formance through better exploration. While there are many practice. We typically use a convolutional neural network
possible ways of making the exploration policies differ we that has one softmax output for the policy π(at |st ; θ) and
experiment with using -greedy exploration with periodi- one linear output for the value function V (st ; θv ), with all
cally sampled from some distribution by each thread. non-output layers shared.
Asynchronous one-step Sarsa: The asynchronous one- We also found that adding the entropy of the policy π to the
step Sarsa algorithm is the same as asynchronous one-step objective function improved exploration by discouraging
Q-learning as given in Algorithm 1 except that it uses a dif- premature convergence to suboptimal deterministic poli-
ferent target value for Q(s, a). The target value used by cies. This technique was originally proposed by (Williams
one-step Sarsa is r + γQ(s0 , a0 ; θ− ) where a0 is the action & Peng, 1991), who found that it was particularly help-
taken in state s0 (Rummery & Niranjan, 1994; Sutton & ful on tasks requiring hierarchical behavior. The gradi-
Barto, 1998). We again use a target network and updates ent of the full objective function including the entropy
accumulated over multiple timesteps to stabilize learning. regularization term with respect to the policy parame-
ters takes the form ∇θ0 log π(at |st ; θ0 )(Rt − V (st ; θv )) +
Asynchronous n-step Q-learning: Pseudocode for our
β∇θ0 H(π(st ; θ0 )), where H is the entropy. The hyperpa-
variant of multi-step Q-learning is shown in Supplementary
rameter β controls the strength of the entropy regulariza-
Algorithm S1. The algorithm is somewhat unusual because
tion term.
it operates in the forward view by explicitly computing n-
step returns, as opposed to the more common backward Optimization: We investigated three different optimiza-
view used by techniques like eligibility traces (Sutton & tion algorithms in our asynchronous framework – SGD
Barto, 1998). We found that using the forward view is eas- with momentum, RMSProp (Tieleman & Hinton, 2012)
ier when training neural networks with momentum-based without shared statistics, and RMSProp with shared statis-
methods and backpropagation through time. In order to tics. We used the standard non-centered RMSProp update
compute a single update, the algorithm first selects actions given by
using its exploration policy for up to tmax steps or until a
∆θ
terminal state is reached. This process results in the agent g = αg + (1 − α)∆θ2 and θ ← θ − η √ , (1)
receiving up to tmax rewards from the environment since g+
its last update. The algorithm then computes gradients for where all operations are performed elementwise. A com-
n-step Q-learning updates for each of the state-action pairs parison on a subset of Atari 2600 games showed that a vari-
encountered since the last update. Each n-step update uses ant of RMSProp where statistics g are shared across threads
the longest possible n-step return resulting in a one-step is considerably more robust than the other two methods.
update for the last state, a two-step update for the second Full details of the methods and comparisons are included
last state, and so on for a total of up to tmax updates. The in Supplementary Section 1.
accumulated updates are applied in a single gradient step.
Asynchronous advantage actor-critic: The algorithm, 5. Experiments
which we call asynchronous advantage actor-critic (A3C), We use four different platforms for assessing the properties
maintains a policy π(at |st ; θ) and an estimate of the value of the proposed framework. We perform most of our exper-
function V (st ; θv ). Like our variant of n-step Q-learning, iments using the Arcade Learning Environment (Bellemare
our variant of actor-critic also operates in the forward view et al., 2012), which provides a simulator for Atari 2600
and uses the same mix of n-step returns to update both the games. This is one of the most commonly used benchmark
policy and the value-function. The policy and the value environments for RL algorithms. We use the Atari domain
function are updated after every tmax actions or when a to compare against state of the art results (Van Hasselt et al.,
terminal state is reached. The update performed by the al- 2015; Wang et al., 2015; Schaul et al., 2015; Nair et al.,
gorithm can be seen as ∇θ0 log π(at |st ; θ0 )A(st , at ; θ, θv ) 2015; Mnih et al., 2015), as well as to carry out a detailed
where A(st , at ; θ, θv ) is an estimate of the advantage func- stability and scalability analysis of the proposed methods.
Pk−1
tion given by i=0 γ i rt+i + γ k V (st+k ; θv ) − V (st ; θv ), We performed further comparisons using the TORCS 3D
where k can vary from state to state and is upper-bounded car racing simulator (Wymann et al., 2013). We also use
Asynchronous Methods for Deep Reinforcement Learning
Figure 1. Learning speed comparison for DQN and the new asynchronous algorithms on five Atari 2600 games. DQN was trained on
a single Nvidia K40 GPU while the asynchronous methods were trained using 16 CPU cores. The plots are averaged over 5 runs. In
the case of DQN the runs were for different seeds with fixed hyperparameters. For asynchronous methods we average over the best 5
models from 50 experiments with learning rates sampled from LogU nif orm(10−4 , 10−2 ) and all other hyperparameters fixed.
two additional domains to evaluate only the A3C algorithm Method Training Time Mean Median
– Mujoco and Labyrinth. MuJoCo (Todorov, 2015) is a DQN 8 days on GPU 121.9% 47.5%
physics simulator for evaluating agents on continuous mo- Gorila 4 days, 100 machines 215.2% 71.3%
D-DQN 8 days on GPU 332.9% 110.9%
tor control tasks with contact dynamics. Labyrinth is a new Dueling D-DQN 8 days on GPU 343.8% 117.1%
3D environment where the agent must learn to find rewards Prioritized DQN 8 days on GPU 463.6% 127.6%
in randomly generated mazes from a visual input. The pre- A3C, FF 1 day on CPU 344.1% 68.2%
cise details of our experimental setup can be found in Sup- A3C, FF 4 days on CPU 496.8% 116.6%
plementary Section 2. A3C, LSTM 4 days on CPU 623.0% 112.6%
Figure 2. Scatter plots of scores obtained by asynchronous advantage actor-critic on five games (Beamrider, Breakout, Pong, Q*bert,
Space Invaders) for 50 different learning rates and random initializations. On each game, there is a wide range of learning rates for
which all random initializations acheive good scores. This shows that A3C is quite robust to learning rates and initial random weights.
numbers of actor-learners and training methods on five substantially improve the data efficiency of these methods
Atari games, and Figure 4, which shows plots of the av- by reusing old data. This could in turn lead to much faster
erage score against wall-clock time. training times in domains like TORCS where interacting
with the environment is more expensive than updating the
5.6. Robustness and Stability model for the architecture we used.
Finally, we analyzed the stability and robustness of the Combining other existing reinforcement learning meth-
four proposed asynchronous algorithms. For each of the ods or recent advances in deep reinforcement learning
four algorithms we trained models on five games (Break- with our asynchronous framework presents many possibil-
out, Beamrider, Pong, Q*bert, Space Invaders) using 50 ities for immediate improvements to the methods we pre-
different learning rates and random initializations. Figure 2 sented. While our n-step methods operate in the forward
shows scatter plots of the resulting scores for A3C, while view (Sutton & Barto, 1998) by using corrected n-step re-
Supplementary Figure S7 shows plots for the other three turns directly as targets, it has been more common to use
methods. There is usually a range of learning rates for each the backward view to implicitly combine different returns
method and game combination that leads to good scores, through eligibility traces (Watkins, 1989; Sutton & Barto,
indicating that all methods are quite robust to the choice of 1998; Peng & Williams, 1996). The asynchronous ad-
learning rate and random initialization. The fact that there vantage actor-critic method could be potentially improved
are virtually no points with scores of 0 in regions with good by using other ways of estimating the advantage function,
learning rates indicates that the methods are stable and do such as generalized advantage estimation of (Schulman
not collapse or diverge once they are learning. et al., 2015b). All of the value-based methods we inves-
tigated could benefit from different ways of reducing over-
6. Conclusions and Discussion estimation bias of Q-values (Van Hasselt et al., 2015; Belle-
mare et al., 2016). Yet another, more speculative, direction
We have presented asynchronous versions of four standard is to try and combine the recent work on true online tempo-
reinforcement learning algorithms and showed that they ral difference methods (van Seijen et al., 2015) with non-
are able to train neural network controllers on a variety linear function approximation.
of domains in a stable manner. Our results show that in
In addition to these algorithmic improvements, a number
our proposed framework stable training of neural networks
of complementary improvements to the neural network ar-
through reinforcement learning is possible with both value-
chitecture are possible. The dueling architecture of (Wang
based and policy-based methods, off-policy as well as on-
et al., 2015) has been shown to produce more accurate es-
policy methods, and in discrete as well as continuous do-
timates of Q-values by including separate streams for the
mains. When trained on the Atari domain using 16 CPU
state value and advantage in the network. The spatial soft-
cores, the proposed asynchronous algorithms train faster
max proposed by (Levine et al., 2015) could improve both
than DQN trained on an Nvidia K40 GPU, with A3C sur-
value-based and policy-based methods by making it easier
passing the current state-of-the-art in half the training time.
for the network to represent feature coordinates.
One of our main findings is that using parallel actor-
learners to update a shared model had a stabilizing effect on ACKNOWLEDGMENTS
the learning process of the three value-based methods we
considered. While this shows that stable online Q-learning We thank Thomas Degris, Remi Munos, Marc Lanctot,
is possible without experience replay, which was used for Sasha Vezhnevets and Joseph Modayil for many helpful
this purpose in DQN, it does not mean that experience re- discussions, suggestions and comments on the paper. We
play is not useful. Incorporating experience replay into also thank the DeepMind evaluation team for setting up the
the asynchronous reinforcement learning framework could environments used to evaluate the agents in the paper.
Asynchronous Methods for Deep Reinforcement Learning
Figure 3. Data efficiency comparison of different numbers of actor-learners for three asynchronous methods on five Atari games. The
x-axis shows the total number of training epochs where an epoch corresponds to four million frames (across all threads). The y-axis
shows the average score. Each curve shows the average over the three best learning rates. Single step methods show increased data
efficiency from more parallel workers. Results for Sarsa are shown in Supplementary Figure S5.
9000 Beamrider 300 Breakout 20 Pong 4000 Q*bert 800 Space Invaders
1-step Q, 1 threads 1-step Q, 1 threads 1-step Q, 1 threads 1-step Q, 1 threads 1-step Q, 1 threads
8000 1-step Q, 2 threads 1-step Q, 2 threads 15 1-step Q, 2 threads 1-step Q, 2 threads 1-step Q, 2 threads
1-step Q, 4 threads 1-step Q, 4 threads 1-step Q, 4 threads 3500 1-step Q, 4 threads 700 1-step Q, 4 threads
1-step Q, 8 threads 250 1-step Q, 8 threads 1-step Q, 8 threads 1-step Q, 8 threads 1-step Q, 8 threads
7000 1-step Q, 16 threads 1-step Q, 16 threads 10 1-step Q, 16 threads 3000 1-step Q, 16 threads 1-step Q, 16 threads
600
6000 200 5
2500
5000 0 500
Score
Score
Score
Score
Score
150 2000
4000 5 400
1500
3000 100 10
300
2000 15 1000
50 200
1000 20 500
0 0 25 0 100
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
Training time (hours) Training time (hours) Training time (hours) Training time (hours) Training time (hours)
12000 Beamrider 350 Breakout 20 Pong 4500 Q*bert 800 Space Invaders
n-step Q, 1 threads n-step Q, 1 threads n-step Q, 1 threads n-step Q, 1 threads
n-step Q, 2 threads n-step Q, 2 threads 15 4000 n-step Q, 2 threads n-step Q, 2 threads
10000 n-step Q, 4 threads 300 n-step Q, 4 threads n-step Q, 4 threads 700 n-step Q, 4 threads
n-step Q, 8 threads n-step Q, 8 threads n-step Q, 8 threads n-step Q, 8 threads
n-step Q, 16 threads n-step Q, 16 threads 10 3500 n-step Q, 16 threads n-step Q, 16 threads
250 600
8000 5 3000
200 0 2500 500
Score
Score
Score
Score
Score
6000
150 5 2000 400
4000 10 1500
100 300
15 n-step Q, 1 threads 1000
2000 n-step Q, 2 threads
50 20
n-step Q, 4 threads
500 200
n-step Q, 8 threads
n-step Q, 16 threads
0 0 25 0 100
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
Training time (hours) Training time (hours) Training time (hours) Training time (hours) Training time (hours)
16000 Beamrider 600 Breakout 30 Pong 12000 Q*bert 1600 Space Invaders
A3C, 1 threads A3C, 1 threads A3C, 1 threads A3C, 1 threads A3C, 1 threads
A3C, 2 threads A3C, 2 threads A3C, 2 threads A3C, 2 threads A3C, 2 threads
14000 A3C, 4 threads A3C, 4 threads A3C, 4 threads A3C, 4 threads 1400 A3C, 4 threads
A3C, 8 threads 500 A3C, 8 threads 20 A3C, 8 threads 10000 A3C, 8 threads A3C, 8 threads
12000 A3C, 16 threads A3C, 16 threads A3C, 16 threads A3C, 16 threads 1200 A3C, 16 threads
400 10 8000
10000 1000
Score
Score
Score
Score
Score
Figure 4. Training speed comparison of different numbers of actor-learners on five Atari games. The x-axis shows training time in
hours while the y-axis shows the average score. Each curve shows the average over the three best learning rates. All asynchronous
methods show significant speedups from using greater numbers of parallel actor-learners. Results for Sarsa are shown in Supplementary
Figure S6.
Asynchronous Methods for Deep Reinforcement Learning
The theory of reinforcement learning provides a normative account1, agent is to select actions in a fashion that maximizes cumulative future
deeply rooted in psychological2 and neuroscientific3 perspectives on reward. More formally, we use a deep convolutional neural network to
animal behaviour, of how agents may optimize their control of an approximate the optimal action-value function
environment. To use reinforcement learning successfully in situations ! "
Q! ðs,aÞ~ max rt zcrtz1 zc2 rtz2 z . . . jst ~s, at ~a, p ,
approaching real-world complexity, however, agents are confronted p
with a difficult task: they must derive efficient representations of the which is the maximum sum of rewards rt discounted by c at each time-
environment from high-dimensional sensory inputs, and use these step t, achievable by a behaviour policy p 5 P(ajs), after making an
to generalize past experience to new situations. Remarkably, humans observation (s) and taking an action (a) (see Methods)19.
and other animals seem to solve this problem through a harmonious Reinforcement learning is known to be unstable or even to diverge
combination of reinforcement learning and hierarchical sensory pro- when a nonlinear function approximator such as a neural network is
cessing systems4,5, the former evidenced by a wealth of neural data used to represent the action-value (also known as Q) function20. This
revealing notable parallels between the phasic signals emitted by dopa- instability has several causes: the correlations present in the sequence
minergic neurons and temporal difference reinforcement learning of observations, the fact that small updates to Q may significantly change
algorithms3. While reinforcement learning agents have achieved some the policy and therefore change the data distribution, and the correlations
successes in a variety of domains6–8, their applicability has previously between the action-values (Q) and the target values rzc max Qðs0 , a0 Þ.
been limited to domains in which useful features can be handcrafted, a0
We address these instabilities with a novel variant of Q-learning, which
or to domains with fully observed, low-dimensional state spaces.
uses two key ideas. First, we used a biologically inspired mechanism
Here we use recent advances in training deep neural networks9–11 to
termed experience replay21–23 that randomizes over the data, thereby
develop a novel artificial agent, termed a deep Q-network, that can
removing correlations in the observation sequence and smoothing over
learn successful policies directly from high-dimensional sensory inputs
changes in the data distribution (see below for details). Second, we used
using end-to-end reinforcement learning. We tested this agent on
an iterative update that adjusts the action-values (Q) towards target
the challenging domain of classic Atari 2600 games12. We demon-
values that are only periodically updated, thereby reducing correlations
strate that the deep Q-network agent, receiving only the pixels and with the target.
the game score as inputs, was able to surpass the performance of all
While other stable methods exist for training neural networks in the
previous algorithms and achieve a level comparable to that of a pro-
reinforcement learning setting, such as neural fitted Q-iteration24, these
fessional human games tester across a set of 49 games, using the same
methods involve the repeated training of networks de novo on hundreds
algorithm, network architecture and hyperparameters. This work
of iterations. Consequently, these methods, unlike our algorithm, are
bridges the divide between high-dimensional sensory inputs and
too inefficient to be used successfully with large neural networks. We
actions, resulting in the first artificial agent that is capable of learn-
parameterize an approximate value function Q(s,a;hi) using the deep
ing to excel at a diverse array of challenging tasks.
convolutional neural network shown in Fig. 1, in which hi are the param-
We set out to create a single algorithm that would be able to develop eters (that is, weights) of the Q-network at iteration i. To perform
a wide range of competencies on a varied range of challenging tasks—a experience replay we store the agent’s experiences et 5 (st,at,rt,st 1 1)
central goal of general artificial intelligence13 that has eluded previous at each time-step t in a data set Dt 5 {e1,…,et}. During learning, we
efforts8,14,15. To achieve this, we developed a novel agent, a deep Q-network apply Q-learning updates, on samples (or minibatches) of experience
(DQN), which is able to combine reinforcement learning with a class (s,a,r,s9) , U(D), drawn uniformly at random from the pool of stored
of artificial neural network16 known as deep neural networks. Notably, samples. The Q-learning update at iteration i uses the following loss
recent advances in deep neural networks9–11, in which several layers of function:
nodes are used to build up progressively more abstract representations " 2 #
of the data, have made it possible for artificial neural networks to learn 0 0 {
concepts such as object categories directly from raw sensory data. We Li ðhi Þ~ ðs,a,r,s0 Þ*UðDÞ rzc max Q(s ,a ; hi ){Qðs,a; hi Þ
a0
2 6 F E B R U A RY 2 0 1 5 | VO L 5 1 8 | N AT U R E | 5 2 9
©2015 Macmillan Publishers Limited. All rights reserved
RESEARCH LETTER
No input
Figure 1 | Schematic illustration of the convolutional neural network. The symbolizes sliding of each filter across input image) and two fully connected
details of the architecture are explained in the Methods. The input to the neural layers with a single output for each valid action. Each hidden layer is followed
network consists of an 84 3 84 3 4 image produced by the preprocessing by a rectifier nonlinearity (that is, maxð0,xÞ).
map w, followed by three convolutional layers (note: snaking blue line
difficult and engaging for human players. We used the same network We compared DQN with the best performing methods from the
architecture, hyperparameter values (see Extended Data Table 1) and reinforcement learning literature on the 49 games where results were
learning procedure throughout—taking high-dimensional data (210|160 available12,15. In addition to the learned agents, we also report scores for
colour video at 60 Hz) as input—to demonstrate that our approach a professional human games tester playing under controlled conditions
robustly learns successful policies over a variety of games based solely and a policy that selects actions uniformly at random (Extended Data
on sensory inputs with only very minimal prior knowledge (that is, merely Table 2 and Fig. 3, denoted by 100% (human) and 0% (random) on y
the input data were visual images, and the number of actions available axis; see Methods). Our DQN method outperforms the best existing
in each game, but not their correspondences; see Methods). Notably, reinforcement learning methods on 43 of the games without incorpo-
our method was able to train large neural networks using a reinforce- rating any of the additional prior knowledge about Atari 2600 games
ment learning signal and stochastic gradient descent in a stable manner— used by other approaches (for example, refs 12, 15). Furthermore, our
illustrated by the temporal evolution of two indices of learning (the DQN agent performed at a level that was comparable to that of a pro-
agent’s average score-per-episode and average predicted Q-values; see fessional human games tester across the set of 49 games, achieving more
Fig. 2 and Supplementary Discussion for details). than 75% of the human score on more than half of the games (29 games;
a 2,200 b 6,000
2,000
Average score per episode
1,800 5,000
1,600
4,000
1,400
1,200
3,000
1,000
800 2,000
600
400 1,000
200
0 0
0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
Training epochs Training epochs
c 10 d 11
9 10
Average action value (Q)
8 9
7 8
7
6
6
5
5
4
4
3 3
2 2
1 1
0 0
0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
Training epochs Training epochs
Figure 2 | Training curves tracking the agent’s average score and average on the curve is the average of the action-value Q computed over the held-out
predicted action-value. a, Each point is the average score achieved per episode set of states. Note that Q-values are scaled due to clipping of rewards (see
after the agent is run with e-greedy policy (e 5 0.05) for 520 k frames on Space Methods). d, Average predicted action-value on Seaquest. See Supplementary
Invaders. b, Average score achieved per episode for Seaquest. c, Average Discussion for details.
predicted action-value on a held-out set of states on Space Invaders. Each point
5 3 0 | N AT U R E | VO L 5 1 8 | 2 6 F E B R U A RY 2 0 1 5
©2015 Macmillan Publishers Limited. All rights reserved
LETTER RESEARCH
Video Pinball
Boxing
Breakout
Star Gunner
Robotank
Atlantis
Crazy Climber
Gopher
Demon Attack
Name This Game
Krull
Assault
Road Runner
Kangaroo
James Bond
Tennis
Pong
Space Invaders
Beam Rider
Tutankham
Kung-Fu Master
Freeway
Time Pilot
Enduro
Fishing Derby
Up and Down
Ice Hockey
Q*bert
H.E.R.O. At human-level or above
Asterix Below human-level
Battle Zone
Wizard of Wor
Chopper Command
Centipede
Bank Heist
River Raid
Zaxxon
Amidar
Alien
Venture
Seaquest
Double Dunk
Bowling
Ms. Pac-Man
Asteroids
Frostbite
Gravitar DQN
Private Eye
Montezuma's Revenge Best linear learner
Figure 3 | Comparison of the DQN agent with the best reinforcement outperforms competing methods (also see Extended Data Table 2) in almost all
learning methods15 in the literature. The performance of DQN is normalized the games, and performs at a level that is broadly comparable with or superior
with respect to a professional human games tester (that is, 100% level) and to a professional human games tester (that is, operationalized as a level of
random play (that is, 0% level). Note that the normalized performance of DQN, 75% or above) in the majority of games. Audio output was disabled for both
expressed as a percentage, is calculated as: 100 3 (DQN score 2 random play human players and agents. Error bars indicate s.d. across the 30 evaluation
score)/(human score 2 random play score). It can be seen that DQN episodes, starting with different initial conditions.
see Fig. 3, Supplementary Discussion and Extended Data Table 2). In perceptually dissimilar (Fig. 4, bottom right, top left and middle), con-
additional simulations (see Supplementary Discussion and Extended sistent with the notion that the network is able to learn representations
Data Tables 3 and 4), we demonstrate the importance of the individual that support adaptive behaviour from high-dimensional sensory inputs.
core components of the DQN agent—the replay memory, separate target Furthermore, we also show that the representations learned by DQN
Q-network and deep convolutional network architecture—by disabling are able to generalize to data generated from policies other than its
them and demonstrating the detrimental effects on performance. own—in simulations where we presented as input to the network game
We next examined the representations learned by DQN that under- states experienced during human and agent play, recorded the repre-
pinned the successful performance of the agent in the context of the game sentations of the last hidden layer, and visualized the embeddings gen-
Space Invaders (see Supplementary Video 1 for a demonstration of the erated by the t-SNE algorithm (Extended Data Fig. 1 and Supplementary
performance of DQN), by using a technique developed for the visual- Discussion). Extended Data Fig. 2 provides an additional illustration of
ization of high-dimensional data called ‘t-SNE’25 (Fig. 4). As expected, how the representations learned by DQN allow it to accurately predict
the t-SNE algorithm tends to map the DQN representation of percep- state and action values.
tually similar states to nearby points. Interestingly, we also found instances It is worth noting that the games in which DQN excels are extremely
in which the t-SNE algorithm generated similar embeddings for DQN varied in their nature, from side-scrolling shooters (River Raid) to box-
representations of states that are close in terms of expected reward but ing games (Boxing) and three-dimensional car-racing games (Enduro).
2 6 F E B R U A RY 2 0 1 5 | VO L 5 1 8 | N AT U R E | 5 3 1
©2015 Macmillan Publishers Limited. All rights reserved
RESEARCH LETTER
Figure 4 | Two-dimensional t-SNE embedding of the representations in the predicts high state values for both full (top right screenshots) and nearly
last hidden layer assigned by DQN to game states experienced while playing complete screens (bottom left screenshots) because it has learned that
Space Invaders. The plot was generated by letting the DQN agent play for completing a screen leads to a new screen full of enemy ships. Partially
2 h of real game time and running the t-SNE algorithm25 on the last hidden layer completed screens (bottom screenshots) are assigned lower state values because
representations assigned by DQN to each experienced game state. The less immediate reward is available. The screens shown on the bottom right
points are coloured according to the state values (V, maximum expected reward and top left and middle are less perceptually similar than the other examples but
of a state) predicted by DQN for the corresponding game states (ranging are still mapped to nearby representations and similar values because the
from dark red (highest V) to dark blue (lowest V)). The screenshots orange bunkers do not carry great significance near the end of a level. With
corresponding to a selected number of points are shown. The DQN agent permission from Square Enix Limited.
Indeed, in certain games DQN is able to discover a relatively long-term realization of such a process in the mammalian brain, with the time-
strategy (for example, Breakout: the agent learns the optimal strategy, compressed reactivation of recently experienced trajectories during
which is to first dig a tunnel around the side of the wall allowing the ball offline periods21,22 (for example, waking rest) providing a putative mech-
to be sent around the back to destroy a large number of blocks; see Sup- anism by which value functions may be efficiently updated through
plementary Video 2 for illustration of development of DQN’s perfor- interactions with the basal ganglia22. In the future, it will be important
mance over the course of training). Nevertheless, games demanding more to explore the potential use of biasing the content of experience replay
temporally extended planning strategies still constitute a major chal- towards salient events, a phenomenon that characterizes empirically
lenge for all existing agents including DQN (for example, Montezuma’s observed hippocampal replay29, and relates to the notion of ‘prioritized
Revenge). sweeping’30 in reinforcement learning. Taken together, our work illus-
In this work, we demonstrate that a single architecture can success- trates the power of harnessing state-of-the-art machine learning tech-
fully learn control policies in a range of different environments with only niques with biologically inspired mechanisms to create agents that are
very minimal prior knowledge, receiving only the pixels and the game capable of learning to master a diverse array of challenging tasks.
score as inputs, and using the same algorithm, network architecture and Online Content Methods, along with any additional Extended Data display items
hyperparameters on each game, privy only to the inputs a human player and Source Data, are available in the online version of the paper; references unique
would have. In contrast to previous work24,26, our approach incorpo- to these sections appear only in the online paper.
rates ‘end-to-end’ reinforcement learning that uses reward to continu-
Received 10 July 2014; accepted 16 January 2015.
ously shape representations within the convolutional network towards
salient features of the environment that facilitate value estimation. This 1. Sutton, R. & Barto, A. Reinforcement Learning: An Introduction (MIT Press, 1998).
principle draws on neurobiological evidence that reward signals during 2. Thorndike, E. L. Animal Intelligence: Experimental studies (Macmillan, 1911).
perceptual learning may influence the characteristics of representations 3. Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and
reward. Science 275, 1593–1599 (1997).
within primate visual cortex27,28. Notably, the successful integration of 4. Serre, T., Wolf, L. & Poggio, T. Object recognition with features inspired by visual
reinforcement learning with deep network architectures was critically cortex. Proc. IEEE. Comput. Soc. Conf. Comput. Vis. Pattern. Recognit. 994–1000
dependent on our incorporation of a replay algorithm21–23 involving the (2005).
5. Fukushima, K. Neocognitron: A self-organizing neural network model for a
storage and representation of recently experienced transitions. Conver- mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 36,
gent evidence suggests that the hippocampus may support the physical 193–202 (1980).
5 3 2 | N AT U R E | VO L 5 1 8 | 2 6 F E B R U A RY 2 0 1 5
©2015 Macmillan Publishers Limited. All rights reserved
LETTER RESEARCH
6. Tesauro, G. Temporal difference learning and TD-Gammon. Commun. ACM 38, 23. Lin, L.-J. Reinforcement learning for robots using neural networks. Technical
58–68 (1995). Report, DTIC Document (1993).
7. Riedmiller, M., Gabel, T., Hafner, R. & Lange, S. Reinforcement learning for robot 24. Riedmiller, M. Neural fitted Q iteration - first experiences with a data efficient
soccer. Auton. Robots 27, 55–73 (2009). neural reinforcement learning method. Mach. Learn.: ECML, 3720, 317–328
8. Diuk, C., Cohen, A. & Littman, M. L. An object-oriented representation for efficient (Springer, 2005).
reinforcement learning. Proc. Int. Conf. Mach. Learn. 240–247 (2008). 25. Van der Maaten, L. J. P. & Hinton, G. E. Visualizing high-dimensional data using
9. Bengio, Y. Learning deep architectures for AI. Foundations and Trends in Machine t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Learning 2, 1–127 (2009). 26. Lange, S. & Riedmiller, M. Deep auto-encoder neural networks in reinforcement
10. Krizhevsky, A., Sutskever, I. & Hinton, G. ImageNet classification with deep learning. Proc. Int. Jt. Conf. Neural. Netw. 1–8 (2010).
convolutional neural networks. Adv.Neural Inf.Process.Syst.25, 1106–1114 (2012). 27. Law, C.-T. & Gold, J. I. Reinforcement learning can account for associative
11. Hinton, G. E. & Salakhutdinov, R. R. Reducing the dimensionality of data with and perceptual learning on a visual decision task. Nature Neurosci. 12, 655
neural networks. Science 313, 504–507 (2006). (2009).
12. Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. The arcade learning 28. Sigala, N. & Logothetis, N. K. Visual categorization shapes feature selectivity in the
environment: An evaluation platform for general agents. J. Artif. Intell. Res. 47, primate temporal cortex. Nature 415, 318–320 (2002).
253–279 (2013). 29. Bendor, D. & Wilson, M. A. Biasing the content of hippocampal replay during sleep.
13. Legg, S. & Hutter, M. Universal Intelligence: a definition of machine intelligence. Nature Neurosci. 15, 1439–1444 (2012).
Minds Mach. 17, 391–444 (2007). 30. Moore, A. & Atkeson, C. Prioritized sweeping: reinforcement learning with less data
14. Genesereth, M., Love, N. & Pell, B. General game playing: overview of the AAAI and less real time. Mach. Learn. 13, 103–130 (1993).
competition. AI Mag. 26, 62–72 (2005).
15. Bellemare, M. G., Veness, J. & Bowling, M. Investigating contingency awareness Supplementary Information is available in the online version of the paper.
using Atari 2600 games. Proc. Conf. AAAI. Artif. Intell. 864–871 (2012).
16. McClelland, J. L., Rumelhart, D. E. & Group, T. P. R. Parallel Distributed Processing: Acknowledgements We thank G. Hinton, P. Dayan and M. Bowling for discussions,
Explorations in the Microstructure of Cognition (MIT Press, 1986). A. Cain and J. Keene for work on the visuals, K. Keller and P. Rogers for help with the
17. LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to visuals, G. Wayne for comments on an earlier version of the manuscript, and the rest of
document recognition. Proc. IEEE 86, 2278–2324 (1998). the DeepMind team for their support, ideas and encouragement.
18. Hubel, D. H. & Wiesel, T. N. Shape and arrangement of columns in cat’s striate Author Contributions V.M., K.K., D.S., J.V., M.G.B., M.R., A.G., D.W., S.L. and D.H.
cortex. J. Physiol. 165, 559–568 (1963). conceptualized the problem and the technical framework. V.M., K.K., A.A.R. and D.S.
19. Watkins, C. J. & Dayan, P. Q-learning. Mach. Learn. 8, 279–292 (1992). developed and tested the algorithms. J.V., S.P., C.B., A.A.R., M.G.B., I.A., A.K.F., G.O. and
20. Tsitsiklis, J. & Roy, B. V. An analysis of temporal-difference learning with function A.S. created the testing platform. K.K., H.K., S.L. and D.H. managed the project. K.K., D.K.,
approximation. IEEE Trans. Automat. Contr. 42, 674–690 (1997). D.H., V.M., D.S., A.G., A.A.R., J.V. and M.G.B. wrote the paper.
21. McClelland, J. L., McNaughton, B. L. & O’Reilly, R. C. Why there are complementary
learning systems in the hippocampus and neocortex: insights from the successes Author Information Reprints and permissions information is available at
and failures of connectionist models of learning and memory. Psychol. Rev. 102, www.nature.com/reprints. The authors declare no competing financial interests.
419–457 (1995). Readers are welcome to comment on the online version of the paper. Correspondence
22. O’Neill, J., Pleydell-Bouverie, B., Dupret, D. & Csicsvari, J. Play it again: reactivation and requests for materials should be addressed to K.K. (korayk@google.com) or
of waking experience and memory. Trends Neurosci. 33, 220–229 (2010). D.H. (demishassabis@google.com).
2 6 F E B R U A RY 2 0 1 5 | VO L 5 1 8 | N AT U R E | 5 3 3
©2015 Macmillan Publishers Limited. All rights reserved
RESEARCH LETTER
METHODS Our experimental setup amounts to using the following minimal prior know-
ledge: that the input data consisted of visual images (motivating our use of a con-
Preprocessing. Working directly with raw Atari 2600 frames, which are 210 3 160
volutional deep network), the game-specific score (with no modification), number
pixel images with a 128-colour palette, can be demanding in terms of computation
of actions, although not their correspondences (for example, specification of the
and memory requirements. We apply a basic preprocessing step aimed at reducing
up ‘button’) and the life count.
the input dimensionality and dealing with some artefacts of the Atari 2600 emu-
Evaluation procedure. The trained agents were evaluated by playing each game
lator. First, to encode a single frame we take the maximum value for each pixel colour
30 times for up to 5 min each time with different initial random conditions (‘no-
value over the frame being encoded and the previous frame. This was necessary to
op’; see Extended Data Table 1) and an e-greedy policy with e 5 0.05. This pro-
remove flickering that is present in games where some objects appear only in even
cedure is adopted to minimize the possibility of overfitting during evaluation. The
frames while other objects appear only in odd frames, an artefact caused by the
random agent served as a baseline comparison and chose a random action at 10 Hz
limited number of sprites Atari 2600 can display at once. Second, we then extract
which is every sixth frame, repeating its last action on intervening frames. 10 Hz is
the Y channel, also known as luminance, from the RGB frame and rescale it to
about the fastest that a human player can select the ‘fire’ button, and setting the
84 3 84. The function w from algorithm 1 described below applies this preprocess-
random agent to this frequency avoids spurious baseline scores in a handful of the
ing to the m most recent frames and stacks them to produce the input to the
games. We did also assess the performance of a random agent that selected an action
Q-function, in which m 5 4, although the algorithm is robust to different values of
at 60 Hz (that is, every frame). This had a minimal effect: changing the normalized
m (for example, 3 or 5).
DQN performance by more than 5% in only six games (Boxing, Breakout, Crazy
Code availability. The source code can be accessed at https://sites.google.com/a/
Climber, Demon Attack, Krull and Robotank), and in all these games DQN out-
deepmind.com/dqn for non-commercial uses only.
performed the expert human by a considerable margin.
Model architecture. There are several possible ways of parameterizing Q using a
The professional human tester used the same emulator engine as the agents, and
neural network. Because Q maps history–action pairs to scalar estimates of their
played under controlled conditions. The human tester was not allowed to pause,
Q-value, the history and the action have been used as inputs to the neural network
save or reload games. As in the original Atari 2600 environment, the emulator was
by some previous approaches24,26. The main drawback of this type of architecture
run at 60 Hz and the audio output was disabled: as such, the sensory input was
is that a separate forward pass is required to compute the Q-value of each action,
equated between human player and agents. The human performance is the average
resulting in a cost that scales linearly with the number of actions. We instead use an
reward achieved from around 20 episodes of each game lasting a maximum of 5 min
architecture in which there is a separate output unit for each possible action, and
each, following around 2 h of practice playing each game.
only the state representation is an input to the neural network. The outputs cor-
Algorithm. We consider tasks in which an agent interacts with an environment,
respond to the predicted Q-values of the individual actions for the input state. The
in this case the Atari emulator, in a sequence of actions, observations and rewards.
main advantage of this type of architecture is the ability to compute Q-values for all
At each time-step the agent selects an action at from the set of legal game actions,
possible actions in a given state with only a single forward pass through the network.
A~f1, . . . ,K g. The action is passed to the emulator and modifies its internal state
The exact architecture, shown schematically in Fig. 1, is as follows. The input to
and the game score. In general the environment may be stochastic. The emulator’s
the neural network consists of an 84 3 84 3 4 image produced by the preprocess-
internal state is not observed by the agent; instead the agent observes an image
ing map w. The first hidden layer convolves 32 filters of 8 3 8 with stride 4 with the
xt [ R d from the emulator, which is a vector of pixel values representing the current
input image and applies a rectifier nonlinearity31,32. The second hidden layer con-
screen. In addition it receives a reward rt representing the change in game score.
volves 64 filters of 4 3 4 with stride 2, again followed by a rectifier nonlinearity.
Note that in general the game score may depend on the whole previous sequence of
This is followed by a third convolutional layer that convolves 64 filters of 3 3 3 with
actions and observations; feedback about an action may only be received after many
stride 1 followed by a rectifier. The final hidden layer is fully-connected and con-
thousands of time-steps have elapsed.
sists of 512 rectifier units. The output layer is a fully-connected linear layer with a
Because the agent only observes the current screen, the task is partially observed33
single output for each valid action. The number of valid actions varied between 4
and many emulator states are perceptually aliased (that is, it is impossible to fully
and 18 on the games we considered.
understand the current situation from only the current screen xt ). Therefore,
Training details. We performed experiments on 49 Atari 2600 games where results
sequences of actions and observations, st ~x1 ,a1 ,x2 ,:::,at{1 ,xt , are input to the
were available for all other comparable methods12,15. A different network was trained
algorithm, which then learns game strategies depending upon these sequences. All
on each game: the same network architecture, learning algorithm and hyperpara-
sequences in the emulator are assumed to terminate in a finite number of time-
meter settings (see Extended Data Table 1) were used across all games, showing that steps. This formalism gives rise to a large but finite Markov decision process (MDP)
our approach is robust enough to work on a variety of games while incorporating in which each sequence is a distinct state. As a result, we can apply standard rein-
only minimal prior knowledge (see below). While we evaluated our agents on unmodi- forcement learning methods for MDPs, simply by using the complete sequence st
fied games, we made one change to the reward structure of the games during training as the state representation at time t.
only. As the scale of scores varies greatly from game to game, we clipped all posi-
The goal of the agent is to interact with the emulator by selecting actions in a way
tive rewards at 1 and all negative rewards at 21, leaving 0 rewards unchanged.
that maximizes future rewards. We make the standard assumption that future rewards
Clipping the rewards in this manner limits the scale of the error derivatives and
are discounted by a factor of c per time-step (c was set to 0.99 throughout), and
makes it easier to use the same learning rate across multiple games. At the same time, X T
0
it could affect the performance of our agent since it cannot differentiate between define the future discounted return at time t as Rt ~ ct {t rt 0 , in which T is the
rewards of different magnitude. For games where there is a life counter, the Atari t 0 ~t
time-step at which the game terminates. We define the optimal action-value
2600 emulator also sends the number of lives left in the game, which is then used to
function Q! ðs,aÞ as the maximum expected return achievable by following any
mark the end of an episode during training.
policy, after seeing some sequence s and then taking some action a, Q! ðs,aÞ~
In these experiments, we used the RMSProp (see http://www.cs.toronto.edu/
maxp ½Rt D st ~s,at ~a,p in which p is a policy mapping sequences to actions (or
,tijmen/csc321/slides/lecture_slides_lec6.pdf ) algorithm with minibatches of size
distributions over actions).
32. The behaviour policy during training was e-greedy with e annealed linearly
The optimal action-value function obeys an important identity known as the
from 1.0 to 0.1 over the first million frames, and fixed at 0.1 thereafter. We trained
Bellman equation. This is based on the following intuition: if the optimal value
for a total of 50 million frames (that is, around 38 days of game experience in total)
Q! ðs0 ,a0 Þ of the sequence s9 at the next time-step was known for all possible actions
and used a replay memory of 1 million most recent frames.
a9, then the optimal strategy is to select the action a9 maximizing the expected value
Following previous approaches to playing Atari 2600 games, we also use a simple of rzcQ! ðs0 ,a0 Þ:
frame-skipping technique15. More precisely, the agent sees and selects actions on
every kth frame instead of every frame, and its last action is repeated on skipped
Q! ðs,aÞ ~ s0 rzc max Q ! 0 0
ðs ,a s,a
ÞD
frames. Because running the emulator forward for one step requires much less a 0
computation than having the agent select an action, this technique allows the agent
to play roughly k times more games without significantly increasing the runtime. The basic idea behind many reinforcement learning algorithms is to estimate
We use k 5 4 for all games. the action-value function by using the Bellman equation as an iterative update,
The values of all the hyperparameters and optimization parameters were selected Qiz1 ðs,aÞ~ s0 ½rzc maxa0 Qi ðs0 ,a0 ÞD
s,a. Such value iteration algorithms converge
by performing an informal search on the games Pong, Breakout, Seaquest, Space to the optimal action-value function, Qi ? Q! as i? ?. In practice, this basic approach
Invaders and Beam Rider. We did not perform a systematic grid search owing to is impractical, because the action-value function is estimated separately for each
the high computational cost. These parameters were then held fixed across all other sequence, without any generalization. Instead, it is common to use a function approx-
games. The values and descriptions of all hyperparameters are provided in Extended imator to estimate the action-value function, Qðs,a; hÞ<Q! ðs,aÞ. In the reinforce-
Data Table 1. ment learning community this is typically a linear function approximator, but
sometimes a nonlinear function approximator is used instead, such as a neural replay the behaviour distribution is averaged over many of its previous states,
network. We refer to a neural network function approximator with weights h as a smoothing out learning and avoiding oscillations or divergence in the parameters.
Q-network. A Q-network can be trained by adjusting the parameters hi at iteration Note that when learning by experience replay, it is necessary to learn off-policy
i to reduce the mean-squared error in the Bellman equation, where the optimal (because our current parameters are different to those used to generate the sam-
target values rzcÿ maxa0 Q! ðs0 ,a0 Þ are substituted with approximate target values ple), which motivates the choice of Q-learning.
y~rzc maxa0 Q s0 ,a0 ; h{ {
i , using parameters hi from some previous iteration. In practice, our algorithm only stores the last N experience tuples in the replay
This leads to a sequence of loss functions Li(hi) that changes at each iteration i, memory, and samples uniformly at random from D when performing updates. This
! " approach is in some respects limited because the memory buffer does not differ-
Li ðhi Þ~ s,a,r ðEs0 ½yDs,a{Qðs,a; hi ÞÞ2
entiate important transitions and always overwrites with recent transitions owing
! "
~ s,a,r,s0 ðy{Qðs,a; hi ÞÞ2 zEs,a,r ½Vs0 ½ y: to the finite memory size N. Similarly, the uniform sampling gives equal impor-
tance to all transitions in the replay memory. A more sophisticated sampling strat-
Note that the targets depend on the network weights; this is in contrast with the egy might emphasize transitions from which we can learn the most, similar to
targets used for supervised learning, which are fixed before learning begins. At prioritized sweeping30.
each stage of optimization, we hold the parameters from the previous iteration hi2 The second modification to online Q-learning aimed at further improving the
fixed when optimizing the ith loss function Li(hi), resulting in a sequence of well- stability of our method with neural networks is to use a separate network for gen-
defined optimization problems. The final term is the variance of the targets, which erating the targets yj in the Q-learning update. More precisely, every C updates we
clone the network Q to obtain a target network Q ^ and use Q^ for generating the
does not depend on the parameters hi that we are currently optimizing, and may
therefore be ignored. Differentiating the loss function with respect to the weights Q-learning targets yj for the following C updates to Q. This modification makes the
we arrive at the following gradient: algorithm more stable compared to standard online Q-learning, where an update
that increases Q(st,at) often also increases Q(st 1 1,a) for all a and hence also increases
ÿ 0 0 { the target yj, possibly leading to oscillations or divergence of the policy. Generating
+ hi Lðhi Þ ~ s,a,r,s0 rzc max Q s ,a ; hi {Q ðs,a; hi Þ + hi
Qðs,a; h i Þ :
0 a the targets using an older set of parameters adds a delay between the time an update
to Q is made and the time the update affects the targets yj, making divergence or
Rather than computing the full expectations in the above gradient, it is often oscillations much more unlikely.
computationally expedient to optimize the loss function by stochastic gradient
ÿ 0We also found it helpful to clip the error term from the update rzc maxa0 Q
descent. The familiar Q-learning algorithm19 can be recovered in this framework s ,a0 ; h{
i {Qðs,a; hi Þ to be between 21 and 1. Because the absolute value loss
by updating the weights after every time step, replacing the expectations using function jxj has a derivative of 21 for all negative values of x and a derivative of 1
single samples, and setting h{ i ~hi{1 . for all positive values of x, clipping the squared error to be between 21 and 1 cor-
Note that this algorithm is model-free: it solves the reinforcement learning task responds to using an absolute value loss function for errors outside of the (21,1)
directly using samples from the emulator, without explicitly estimating the reward interval. This form of error clipping further improved the stability of the algorithm.
and transition dynamics Pðr,s0 D s,aÞ. It is also off-policy: it learns about the greedy Algorithm 1: deep Q-learning with experience replay.
policy a~argmaxa0 Qðs,a0 ; hÞ, while following a behaviour distribution that ensures Initialize replay memory D to capacity N
adequate exploration of the state space. In practice, the behaviour distribution is Initialize action-value function Q with random weights h
often selected by an e-greedy policy that follows the greedy policy with probability Initialize target action-value function Q ^ with weights h2 5 h
1 2 e and selects a random action with probability e. For episode 5 1, M do
Training algorithm for deep Q-networks. The full algorithm for training deep
Initialize sequence s1 ~fx1 g and preprocessed sequence w1 ~wðs1 Þ
Q-networks is presented in Algorithm 1. The agent selects and executes actions
For t 5 1,T do
according to an e-greedy policy based on Q. Because using histories of arbitrary
With probability e select a random action at
length as inputs to a neural network can be difficult, our Q-function instead works
on a fixed length representation of histories produced by the function w described otherwise select at ~argmaxa Qðwðst Þ,a; hÞ
above. The algorithm modifies standard online Q-learning in two ways to make it Execute action at in emulator and observe reward rt and image xt 1 1
suitable for training large neural networks without diverging. Set stz1 ~st ,at ,xÿtz1 and preprocess
wtz1 ~wðstz1 Þ
First, we use a technique known as experience replay23 in which we store the Store transition wt ,at ,rt ,wtz1 in D
agent’s experiences at each time-step, et 5 (st, at, rt, st 1 1), in a data set Dt 5 {e1,…,et}, Sample random minibatch of transitions wj ,aj ,rj ,wjz1 from D
(
pooled over many episodes (where the end of an episode occurs when a termi- rj if episode terminates at step jz1
nal state is reached) into a replay memory. During the inner loop of the algorithm, Set yj ~ ^ wjz1 ,a0 ; h{
rj zc maxa0 Q otherwise
we apply Q-learning updates, or minibatch updates, to samples of experience, 2
(s, a, r, s9) , U(D), drawn at random from the pool of stored samples. This approach Perform a gradient descent step on yj {Q wj ,aj ; h with respect to the
has several advantages over standard online Q-learning. First, each step of experience network parameters h
is potentially used in many weight updates, which allows for greater data efficiency. Every C steps reset Q~Q ^
Second, learning directly from consecutive samples is inefficient, owing to the strong End For
correlations between the samples; randomizing the samples breaks these correla- End For
tions and therefore reduces the variance of the updates. Third, when learning on-
policy the current parameters determine the next data sample that the parameters 31. Jarrett, K., Kavukcuoglu, K., Ranzato, M. A. & LeCun, Y. What is the best multi-stage
are trained on. For example, if the maximizing action is to move left then the train- architecture for object recognition? Proc. IEEE. Int. Conf. Comput. Vis. 2146–2153
(2009).
ing samples will be dominated by samples from the left-hand side; if the maximiz-
32. Nair, V. & Hinton, G. E. Rectified linear units improve restricted Boltzmann
ing action then switches to the right then the training distribution will also switch. machines. Proc. Int. Conf. Mach. Learn. 807–814 (2010).
It is easy to see how unwanted feedback loops may arise and the parameters could get 33. Kaelbling, L. P., Littman, M. L. & Cassandra, A. R. Planning and acting in partially
stuck in a poor local minimum, or even diverge catastrophically20. By using experience observable stochastic domains. Artificial Intelligence 101, 99–134 (1994).
Extended Data Figure 1 | Two-dimensional t-SNE embedding of the points) and DQN play (blue points) suggests that the representations learned
representations in the last hidden layer assigned by DQN to game states by DQN do indeed generalize to data generated from policies other than its
experienced during a combination of human and agent play in Space own. The presence in the t-SNE embedding of overlapping clusters of points
Invaders. The plot was generated by running the t-SNE algorithm25 on the last corresponding to the network representation of states experienced during
hidden layer representation assigned by DQN to game states experienced human and agent play shows that the DQN agent also follows sequences of
during a combination of human (30 min) and agent (2 h) play. The fact that states similar to those found in human play. Screenshots corresponding to
there is similar structure in the two-dimensional embeddings corresponding to selected states are shown (human: orange border; DQN: blue border).
the DQN representation of states experienced during human play (orange
Extended Data Figure 2 | Visualization of learned value functions on two all actions are around 0.7, reflecting the expected value of this state based on
games, Breakout and Pong. a, A visualization of the learned value function on previous experience. At time point 2, the agent starts moving the paddle
the game Breakout. At time points 1 and 2, the state value is predicted to be ,17 towards the ball and the value of the ‘up’ action stays high while the value of the
and the agent is clearing the bricks at the lowest level. Each of the peaks in ‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would lead
the value function curve corresponds to a reward obtained by clearing a brick. to the agent losing the ball and incurring a reward of 21. At time point 3,
At time point 3, the agent is about to break through to the top level of bricks and the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing
the value increases to ,21 in anticipation of breaking out and clearing a until time point 4, when the ball reaches the left edge of the screen and the value
large set of bricks. At point 4, the value is above 23 and the agent has broken of all actions reflects that the agent is about to receive a reward of 1. Note,
through. After this point, the ball will bounce at the upper part of the bricks the dashed line shows the past trajectory of the ball purely for illustrative
clearing many of them by itself. b, A visualization of the learned action-value purposes (that is, not shown during the game). With permission from Atari
function on the game Pong. At time point 1, the ball is moving towards the Interactive, Inc.
paddle controlled by the agent on the right side of the screen and the values of
The values of all the hyperparameters were selected by performing an informal search on the games Pong, Breakout, Seaquest, Space Invaders and Beam Rider. We did not perform a systematic grid search owing
to the high computational cost, although it is conceivable that even better results could be obtained by systematically tuning the hyperparameter values.
Extended Data Table 2 | Comparison of games scores obtained by DQN agents with methods from the literature12,15 and a professional
human games tester
Best Linear Learner is the best result obtained by a linear function approximator on different types of hand designed features12. Contingency (SARSA) agent figures are the results obtained in ref. 15. Note the
figures in the last column indicate the performance of DQN relative to the human games tester, expressed as a percentage, that is, 100 3 (DQN score 2 random play score)/(human score 2 random play score).
Extended Data Table 3 | The effects of replay and separating the target Q-network
DQN agents were trained for 10 million frames using standard hyperparameters for all possible combinations of turning replay on or off, using or not using a separate target Q-network, and three different learning
rates. Each agent was evaluated every 250,000 training frames for 135,000 validation frames and the highest average episode score is reported. Note that these evaluation episodes were not truncated at 5 min
leading to higher scores on Enduro than the ones reported in Extended Data Table 2. Note also that the number of training frames was shorter (10 million frames) as compared to the main results presented in
Extended Data Table 2 (50 million frames).
The performance of the DQN agent is compared with the performance of a linear function approximator
on the 5 validation games (that is, where a single linear layer was used instead of the convolutional
network, in combination with replay and separate target network). Agents were trained for 10 million
frames using standard hyperparameters, and three different learning rates. Each agent was evaluated
every 250,000 training frames for 135,000 validation frames and the highest average episode score is
reported. Note that these evaluation episodes were not truncated at 5 min leading to higher scores on
Enduro than the ones reported in Extended Data Table 2. Note also that the number of training frames
was shorter (10 million frames) as compared to the main results presented in Extended Data Table 2
(50 million frames).
Deep learning
Yann LeCun1,2, Yoshua Bengio3 & Geoffrey Hinton4,5
Deep learning allows computational models that are composed of multiple processing layers to learn representations of
data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech rec-
ognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep
learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine
should change its internal parameters that are used to compute the representation in each layer from the representation in
the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and
audio, whereas recurrent nets have shone light on sequential data such as text and speech.
M
achine-learning technology powers many aspects of modern intricate structures in high-dimensional data and is therefore applica-
society: from web searches to content filtering on social net- ble to many domains of science, business and government. In addition
works to recommendations on e-commerce websites, and to beating records in image recognition1–4 and speech recognition5–7, it
it is increasingly present in consumer products such as cameras and has beaten other machine-learning techniques at predicting the activ-
smartphones. Machine-learning systems are used to identify objects ity of potential drug molecules8, analysing particle accelerator data9,10,
in images, transcribe speech into text, match news items, posts or reconstructing brain circuits11, and predicting the effects of mutations
products with users’ interests, and select relevant results of search. in non-coding DNA on gene expression and disease12,13. Perhaps more
Increasingly, these applications make use of a class of techniques called surprisingly, deep learning has produced extremely promising results
deep learning. for various tasks in natural language understanding14, particularly
Conventional machine-learning techniques were limited in their topic classification, sentiment analysis, question answering15 and lan-
ability to process natural data in their raw form. For decades, con- guage translation16,17.
structing a pattern-recognition or machine-learning system required We think that deep learning will have many more successes in the
careful engineering and considerable domain expertise to design a fea- near future because it requires very little engineering by hand, so it
ture extractor that transformed the raw data (such as the pixel values can easily take advantage of increases in the amount of available com-
of an image) into a suitable internal representation or feature vector putation and data. New learning algorithms and architectures that are
from which the learning subsystem, often a classifier, could detect or currently being developed for deep neural networks will only acceler-
classify patterns in the input. ate this progress.
Representation learning is a set of methods that allows a machine to
be fed with raw data and to automatically discover the representations Supervised learning
needed for detection or classification. Deep-learning methods are The most common form of machine learning, deep or not, is super-
representation-learning methods with multiple levels of representa- vised learning. Imagine that we want to build a system that can classify
tion, obtained by composing simple but non-linear modules that each images as containing, say, a house, a car, a person or a pet. We first
transform the representation at one level (starting with the raw input) collect a large data set of images of houses, cars, people and pets, each
into a representation at a higher, slightly more abstract level. With the labelled with its category. During training, the machine is shown an
composition of enough such transformations, very complex functions image and produces an output in the form of a vector of scores, one
can be learned. For classification tasks, higher layers of representation for each category. We want the desired category to have the highest
amplify aspects of the input that are important for discrimination and score of all categories, but this is unlikely to happen before training.
suppress irrelevant variations. An image, for example, comes in the We compute an objective function that measures the error (or dis-
form of an array of pixel values, and the learned features in the first tance) between the output scores and the desired pattern of scores. The
layer of representation typically represent the presence or absence of machine then modifies its internal adjustable parameters to reduce
edges at particular orientations and locations in the image. The second this error. These adjustable parameters, often called weights, are real
layer typically detects motifs by spotting particular arrangements of numbers that can be seen as ‘knobs’ that define the input–output func-
edges, regardless of small variations in the edge positions. The third tion of the machine. In a typical deep-learning system, there may be
layer may assemble motifs into larger combinations that correspond hundreds of millions of these adjustable weights, and hundreds of
to parts of familiar objects, and subsequent layers would detect objects millions of labelled examples with which to train the machine.
as combinations of these parts. The key aspect of deep learning is that To properly adjust the weight vector, the learning algorithm com-
these layers of features are not designed by human engineers: they putes a gradient vector that, for each weight, indicates by what amount
are learned from data using a general-purpose learning procedure. the error would increase or decrease if the weight were increased by a
Deep learning is making major advances in solving problems that tiny amount. The weight vector is then adjusted in the opposite direc-
have resisted the best attempts of the artificial intelligence commu- tion to the gradient vector.
nity for many years. It has turned out to be very good at discovering The objective function, averaged over all the training examples, can
1
Facebook AI Research, 770 Broadway, New York, New York 10003 USA. 2New York University, 715 Broadway, New York, New York 10003, USA. 3Department of Computer Science and Operations
Research Université de Montréal, Pavillon André-Aisenstadt, PO Box 6128 Centre-Ville STN Montréal, Quebec H3C 3J7, Canada. 4Google, 1600 Amphitheatre Parkway, Mountain View, California
94043, USA. 5Department of Computer Science, University of Toronto, 6 King’s College Road, Toronto, Ontario M5S 3G4, Canada.
4 3 6 | NAT U R E | VO L 5 2 1 | 2 8 M AY 2 0 1 5
© 2015 Macmillan Publishers Limited. All rights reserved
REVIEW INSIGHT
be seen as a kind of hilly landscape in the high-dimensional space of Many of the current practical applications of machine learning use
weight values. The negative gradient vector indicates the direction linear classifiers on top of hand-engineered features. A two-class linear
of steepest descent in this landscape, taking it closer to a minimum, classifier computes a weighted sum of the feature vector components.
where the output error is low on average. If the weighted sum is above a threshold, the input is classified as
In practice, most practitioners use a procedure called stochastic belonging to a particular category.
gradient descent (SGD). This consists of showing the input vector Since the 1960s we have known that linear classifiers can only carve
for a few examples, computing the outputs and the errors, computing their input space into very simple regions, namely half-spaces sepa-
the average gradient for those examples, and adjusting the weights rated by a hyperplane19. But problems such as image and speech recog-
accordingly. The process is repeated for many small sets of examples nition require the input–output function to be insensitive to irrelevant
from the training set until the average of the objective function stops variations of the input, such as variations in position, orientation or
decreasing. It is called stochastic because each small set of examples illumination of an object, or variations in the pitch or accent of speech,
gives a noisy estimate of the average gradient over all examples. This while being very sensitive to particular minute variations (for example,
simple procedure usually finds a good set of weights surprisingly the difference between a white wolf and a breed of wolf-like white
quickly when compared with far more elaborate optimization tech- dog called a Samoyed). At the pixel level, images of two Samoyeds in
niques18. After training, the performance of the system is measured different poses and in different environments may be very different
on a different set of examples called a test set. This serves to test the from each other, whereas two images of a Samoyed and a wolf in the
generalization ability of the machine — its ability to produce sensible same position and on similar backgrounds may be very similar to each
answers on new inputs that it has never seen during training. other. A linear classifier, or any other ‘shallow’ classifier operating on
a b
z z
Δz = y Δy
z
y Δy = xy Δx
y
y Δz = yz xy Δ x
x
z z y
x x = y x
Input Hidden Output
(2) (2 sigmoid) (1 sigmoid)
Input units i i
Figure 1 | Multilayer neural networks and backpropagation. a, A multi- which one can backpropagate gradients. At each layer, we first compute
layer neural network (shown by the connected dots) can distort the input the total input z to each unit, which is a weighted sum of the outputs of
space to make the classes of data (examples of which are on the red and the units in the layer below. Then a non-linear function f(.) is applied to
blue lines) linearly separable. Note how a regular grid (shown on the left) z to get the output of the unit. For simplicity, we have omitted bias terms.
in input space is also transformed (shown in the middle panel) by hidden The non-linear functions used in neural networks include the rectified
units. This is an illustrative example with only two input units, two hidden linear unit (ReLU) f(z) = max(0,z), commonly used in recent years, as
units and one output unit, but the networks used for object recognition well as the more conventional sigmoids, such as the hyberbolic tangent,
or natural language processing contain tens or hundreds of thousands of f(z) = (exp(z) − exp(−z))/(exp(z) + exp(−z)) and logistic function logistic,
units. Reproduced with permission from C. Olah (http://colah.github.io/). f(z) = 1/(1 + exp(−z)). d, The equations used for computing the backward
b, The chain rule of derivatives tells us how two small effects (that of a small pass. At each hidden layer we compute the error derivative with respect to
change of x on y, and that of y on z) are composed. A small change Δx in the output of each unit, which is a weighted sum of the error derivatives
x gets transformed first into a small change Δy in y by getting multiplied with respect to the total inputs to the units in the layer above. We then
by ∂y/∂x (that is, the definition of partial derivative). Similarly, the change convert the error derivative with respect to the output into the error
Δy creates a change Δz in z. Substituting one equation into the other derivative with respect to the input by multiplying it by the gradient of f(z).
gives the chain rule of derivatives — how Δx gets turned into Δz through At the output layer, the error derivative with respect to the output of a unit
multiplication by the product of ∂y/∂x and ∂z/∂x. It also works when x, is computed by differentiating the cost function. This gives yl − tl if the cost
y and z are vectors (and the derivatives are Jacobian matrices). c, The function for unit l is 0.5(yl − tl)2, where tl is the target value. Once the ∂E/∂zk
equations used for computing the forward pass in a neural net with two is known, the error-derivative for the weight wjk on the connection from
hidden layers and one output layer, each constituting a module through unit j in the layer below is just yj ∂E/∂zk.
2 8 M AY 2 0 1 5 | VO L 5 2 1 | NAT U R E | 4 3 7
© 2015 Macmillan Publishers Limited. All rights reserved
INSIGHT REVIEW
Samoyed (16); Papillon (5.7); Pomeranian (2.7); Arctic fox (1.0); Eskimo dog (0.6); white wolf (0.4); Siberian husky (0.4)
Max pooling
Max pooling
Figure 2 | Inside a convolutional network. The outputs (not the filters) corresponding to the output for one of the learned features, detected at each
of each layer (horizontally) of a typical convolutional network architecture of the image positions. Information flows bottom up, with lower-level features
applied to the image of a Samoyed dog (bottom left; and RGB (red, green, acting as oriented edge detectors, and a score is computed for each image class
blue) inputs, bottom right). Each rectangular image is a feature map in output. ReLU, rectified linear unit.
raw pixels could not possibly distinguish the latter two, while putting rule for derivatives. The key insight is that the derivative (or gradi-
the former two in the same category. This is why shallow classifiers ent) of the objective with respect to the input of a module can be
require a good feature extractor that solves the selectivity–invariance computed by working backwards from the gradient with respect to
dilemma — one that produces representations that are selective to the output of that module (or the input of the subsequent module)
the aspects of the image that are important for discrimination, but (Fig. 1). The backpropagation equation can be applied repeatedly to
that are invariant to irrelevant aspects such as the pose of the animal. propagate gradients through all modules, starting from the output
To make classifiers more powerful, one can use generic non-linear at the top (where the network produces its prediction) all the way to
features, as with kernel methods20, but generic features such as those the bottom (where the external input is fed). Once these gradients
arising with the Gaussian kernel do not allow the learner to general- have been computed, it is straightforward to compute the gradients
ize well far from the training examples21. The conventional option is with respect to the weights of each module.
to hand design good feature extractors, which requires a consider- Many applications of deep learning use feedforward neural net-
able amount of engineering skill and domain expertise. But this can work architectures (Fig. 1), which learn to map a fixed-size input
all be avoided if good features can be learned automatically using a (for example, an image) to a fixed-size output (for example, a prob-
general-purpose learning procedure. This is the key advantage of ability for each of several categories). To go from one layer to the
deep learning. next, a set of units compute a weighted sum of their inputs from the
A deep-learning architecture is a multilayer stack of simple mod- previous layer and pass the result through a non-linear function. At
ules, all (or most) of which are subject to learning, and many of which present, the most popular non-linear function is the rectified linear
compute non-linear input–output mappings. Each module in the unit (ReLU), which is simply the half-wave rectifier f(z) = max(z, 0).
stack transforms its input to increase both the selectivity and the In past decades, neural nets used smoother non-linearities, such as
invariance of the representation. With multiple non-linear layers, say tanh(z) or 1/(1 + exp(−z)), but the ReLU typically learns much faster
a depth of 5 to 20, a system can implement extremely intricate func- in networks with many layers, allowing training of a deep supervised
tions of its inputs that are simultaneously sensitive to minute details network without unsupervised pre-training28. Units that are not in
— distinguishing Samoyeds from white wolves — and insensitive to the input or output layer are conventionally called hidden units. The
large irrelevant variations such as the background, pose, lighting and hidden layers can be seen as distorting the input in a non-linear way
surrounding objects. so that categories become linearly separable by the last layer (Fig. 1).
In the late 1990s, neural nets and backpropagation were largely
Backpropagation to train multilayer architectures forsaken by the machine-learning community and ignored by the
From the earliest days of pattern recognition22,23, the aim of research- computer-vision and speech-recognition communities. It was widely
ers has been to replace hand-engineered features with trainable thought that learning useful, multistage, feature extractors with lit-
multilayer networks, but despite its simplicity, the solution was not tle prior knowledge was infeasible. In particular, it was commonly
widely understood until the mid 1980s. As it turns out, multilayer thought that simple gradient descent would get trapped in poor local
architectures can be trained by simple stochastic gradient descent. minima — weight configurations for which no small change would
As long as the modules are relatively smooth functions of their inputs reduce the average error.
and of their internal weights, one can compute gradients using the In practice, poor local minima are rarely a problem with large net-
backpropagation procedure. The idea that this could be done, and works. Regardless of the initial conditions, the system nearly always
that it worked, was discovered independently by several different reaches solutions of very similar quality. Recent theoretical and
groups during the 1970s and 1980s24–27. empirical results strongly suggest that local minima are not a serious
The backpropagation procedure to compute the gradient of an issue in general. Instead, the landscape is packed with a combinato-
objective function with respect to the weights of a multilayer stack rially large number of saddle points where the gradient is zero, and
of modules is nothing more than a practical application of the chain the surface curves up in most dimensions and curves down in the
4 3 8 | NAT U R E | VO L 5 2 1 | 2 8 M AY 2 0 1 5
© 2015 Macmillan Publishers Limited. All rights reserved
REVIEW INSIGHT
remainder29,30. The analysis seems to show that saddle points with this architecture is twofold. First, in array data such as images, local
only a few downward curving directions are present in very large groups of values are often highly correlated, forming distinctive local
numbers, but almost all of them have very similar values of the objec- motifs that are easily detected. Second, the local statistics of images
tive function. Hence, it does not much matter which of these saddle and other signals are invariant to location. In other words, if a motif
points the algorithm gets stuck at. can appear in one part of the image, it could appear anywhere, hence
Interest in deep feedforward networks was revived around 2006 the idea of units at different locations sharing the same weights and
(refs 31–34) by a group of researchers brought together by the Cana- detecting the same pattern in different parts of the array. Mathemati-
dian Institute for Advanced Research (CIFAR). The researchers intro- cally, the filtering operation performed by a feature map is a discrete
duced unsupervised learning procedures that could create layers of convolution, hence the name.
feature detectors without requiring labelled data. The objective in Although the role of the convolutional layer is to detect local con-
learning each layer of feature detectors was to be able to reconstruct junctions of features from the previous layer, the role of the pooling
or model the activities of feature detectors (or raw inputs) in the layer layer is to merge semantically similar features into one. Because the
below. By ‘pre-training’ several layers of progressively more complex relative positions of the features forming a motif can vary somewhat,
feature detectors using this reconstruction objective, the weights of a reliably detecting the motif can be done by coarse-graining the posi-
deep network could be initialized to sensible values. A final layer of tion of each feature. A typical pooling unit computes the maximum
output units could then be added to the top of the network and the of a local patch of units in one feature map (or in a few feature maps).
whole deep system could be fine-tuned using standard backpropaga- Neighbouring pooling units take input from patches that are shifted
tion33–35. This worked remarkably well for recognizing handwritten by more than one row or column, thereby reducing the dimension of
digits or for detecting pedestrians, especially when the amount of the representation and creating an invariance to small shifts and dis-
labelled data was very limited36. tortions. Two or three stages of convolution, non-linearity and pool-
The first major application of this pre-training approach was in ing are stacked, followed by more convolutional and fully-connected
speech recognition, and it was made possible by the advent of fast layers. Backpropagating gradients through a ConvNet is as simple as
graphics processing units (GPUs) that were convenient to program37 through a regular deep network, allowing all the weights in all the
and allowed researchers to train networks 10 or 20 times faster. In filter banks to be trained.
2009, the approach was used to map short temporal windows of coef- Deep neural networks exploit the property that many natural sig-
ficients extracted from a sound wave to a set of probabilities for the nals are compositional hierarchies, in which higher-level features
various fragments of speech that might be represented by the frame are obtained by composing lower-level ones. In images, local combi-
in the centre of the window. It achieved record-breaking results on a nations of edges form motifs, motifs assemble into parts, and parts
standard speech recognition benchmark that used a small vocabu- form objects. Similar hierarchies exist in speech and text from sounds
lary38 and was quickly developed to give record-breaking results on to phones, phonemes, syllables, words and sentences. The pooling
a large vocabulary task39. By 2012, versions of the deep net from 2009 allows representations to vary very little when elements in the previ-
were being developed by many of the major speech groups6 and were ous layer vary in position and appearance.
already being deployed in Android phones. For smaller data sets, The convolutional and pooling layers in ConvNets are directly
unsupervised pre-training helps to prevent overfitting40, leading to inspired by the classic notions of simple cells and complex cells in
significantly better generalization when the number of labelled exam- visual neuroscience43, and the overall architecture is reminiscent of
ples is small, or in a transfer setting where we have lots of examples the LGN–V1–V2–V4–IT hierarchy in the visual cortex ventral path-
for some ‘source’ tasks but very few for some ‘target’ tasks. Once deep way44. When ConvNet models and monkeys are shown the same pic-
learning had been rehabilitated, it turned out that the pre-training ture, the activations of high-level units in the ConvNet explains half
stage was only needed for small data sets. of the variance of random sets of 160 neurons in the monkey’s infer-
There was, however, one particular type of deep, feedforward net- otemporal cortex45. ConvNets have their roots in the neocognitron46,
work that was much easier to train and generalized much better than the architecture of which was somewhat similar, but did not have an
networks with full connectivity between adjacent layers. This was end-to-end supervised-learning algorithm such as backpropagation.
the convolutional neural network (ConvNet)41,42. It achieved many A primitive 1D ConvNet called a time-delay neural net was used for
practical successes during the period when neural networks were out the recognition of phonemes and simple words47,48.
of favour and it has recently been widely adopted by the computer- There have been numerous applications of convolutional net-
vision community. works going back to the early 1990s, starting with time-delay neu-
ral networks for speech recognition47 and document reading42. The
Convolutional neural networks document reading system used a ConvNet trained jointly with a
ConvNets are designed to process data that come in the form of probabilistic model that implemented language constraints. By the
multiple arrays, for example a colour image composed of three 2D late 1990s this system was reading over 10% of all the cheques in the
arrays containing pixel intensities in the three colour channels. Many United States. A number of ConvNet-based optical character recog-
data modalities are in the form of multiple arrays: 1D for signals and nition and handwriting recognition systems were later deployed by
sequences, including language; 2D for images or audio spectrograms; Microsoft49. ConvNets were also experimented with in the early 1990s
and 3D for video or volumetric images. There are four key ideas for object detection in natural images, including faces and hands50,51,
behind ConvNets that take advantage of the properties of natural and for face recognition52.
signals: local connections, shared weights, pooling and the use of
many layers. Image understanding with deep convolutional networks
The architecture of a typical ConvNet (Fig. 2) is structured as a Since the early 2000s, ConvNets have been applied with great success to
series of stages. The first few stages are composed of two types of the detection, segmentation and recognition of objects and regions in
layers: convolutional layers and pooling layers. Units in a convolu- images. These were all tasks in which labelled data was relatively abun-
tional layer are organized in feature maps, within which each unit dant, such as traffic sign recognition53, the segmentation of biological
is connected to local patches in the feature maps of the previous images54 particularly for connectomics55, and the detection of faces,
layer through a set of weights called a filter bank. The result of this text, pedestrians and human bodies in natural images36,50,51,56–58. A major
local weighted sum is then passed through a non-linearity such as a recent practical success of ConvNets is face recognition59.
ReLU. All units in a feature map share the same filter bank. Differ- Importantly, images can be labelled at the pixel level, which will have
ent feature maps in a layer use different filter banks. The reason for applications in technology, including autonomous mobile robots and
2 8 M AY 2 0 1 5 | VO L 5 2 1 | NAT U R E | 4 3 9
© 2015 Macmillan Publishers Limited. All rights reserved
INSIGHT REVIEW
Vision Language
Deep CNN Generating RNN
A group of people
shopping at an outdoor
market.
A woman is throwing a frisbee in a park. A dog is standing on a hardwood floor. A stop sign is on a road with a
mountain in the background
A little girl sitting on a bed with a teddy bear. A group of people sitting on a boat in the water. A giraffe standing in a forest with
trees in the background.
Figure 3 | From image to text. Captions generated by a recurrent neural with permission from ref. 102. When the RNN is given the ability to focus its
network (RNN) taking, as extra input, the representation extracted by a deep attention on a different location in the input image (middle and bottom; the
convolution neural network (CNN) from a test image, with the RNN trained to lighter patches were given more attention) as it generates each word (bold), we
‘translate’ high-level representations of images into captions (top). Reproduced found86 that it exploits this to achieve better ‘translation’ of images into captions.
self-driving cars60,61. Companies such as Mobileye and NVIDIA are Microsoft, IBM, Yahoo!, Twitter and Adobe, as well as a quickly
using such ConvNet-based methods in their upcoming vision sys- growing number of start-ups to initiate research and development
tems for cars. Other applications gaining importance involve natural projects and to deploy ConvNet-based image understanding products
language understanding14 and speech recognition7. and services.
Despite these successes, ConvNets were largely forsaken by the ConvNets are easily amenable to efficient hardware implemen-
mainstream computer-vision and machine-learning communities tations in chips or field-programmable gate arrays66,67. A number
until the ImageNet competition in 2012. When deep convolutional of companies such as NVIDIA, Mobileye, Intel, Qualcomm and
networks were applied to a data set of about a million images from Samsung are developing ConvNet chips to enable real-time vision
the web that contained 1,000 different classes, they achieved spec- applications in smartphones, cameras, robots and self-driving cars.
tacular results, almost halving the error rates of the best compet-
ing approaches1. This success came from the efficient use of GPUs, Distributed representations and language processing
ReLUs, a new regularization technique called dropout62, and tech- Deep-learning theory shows that deep nets have two different expo-
niques to generate more training examples by deforming the existing nential advantages over classic learning algorithms that do not use
ones. This success has brought about a revolution in computer vision; distributed representations21. Both of these advantages arise from the
ConvNets are now the dominant approach for almost all recognition power of composition and depend on the underlying data-generating
and detection tasks4,58,59,63–65 and approach human performance on distribution having an appropriate componential structure40. First,
some tasks. A recent stunning demonstration combines ConvNets learning distributed representations enable generalization to new
and recurrent net modules for the generation of image captions combinations of the values of learned features beyond those seen
(Fig. 3). during training (for example, 2n combinations are possible with n
Recent ConvNet architectures have 10 to 20 layers of ReLUs, hun- binary features)68,69. Second, composing layers of representation in
dreds of millions of weights, and billions of connections between a deep net brings the potential for another exponential advantage70
units. Whereas training such large networks could have taken weeks (exponential in the depth).
only two years ago, progress in hardware, software and algorithm The hidden layers of a multilayer neural network learn to repre-
parallelization have reduced training times to a few hours. sent the network’s inputs in a way that makes it easy to predict the
The performance of ConvNet-based vision systems has caused target outputs. This is nicely demonstrated by training a multilayer
most major technology companies, including Google, Facebook, neural network to predict the next word in a sequence from a local
4 4 0 | NAT U R E | VO L 5 2 1 | 2 8 M AY 2 0 1 5
© 2015 Macmillan Publishers Limited. All rights reserved
REVIEW INSIGHT
context of earlier words71. Each word in the context is presented to handful of words would require very large training corpora. N-grams
the network as a one-of-N vector, that is, one component has a value treat each word as an atomic unit, so they cannot generalize across
of 1 and the rest are 0. In the first layer, each word creates a different semantically related sequences of words, whereas neural language
pattern of activations, or word vectors (Fig. 4). In a language model, models can because they associate each word with a vector of real
the other layers of the network learn to convert the input word vec- valued features, and semantically related words end up close to each
tors into an output word vector for the predicted next word, which other in that vector space (Fig. 4).
can be used to predict the probability for any word in the vocabulary
to appear as the next word. The network learns word vectors that Recurrent neural networks
contain many active components each of which can be interpreted When backpropagation was first introduced, its most exciting use was
as a separate feature of the word, as was first demonstrated27 in the for training recurrent neural networks (RNNs). For tasks that involve
context of learning distributed representations for symbols. These sequential inputs, such as speech and language, it is often better to
semantic features were not explicitly present in the input. They were use RNNs (Fig. 5). RNNs process an input sequence one element at a
discovered by the learning procedure as a good way of factorizing time, maintaining in their hidden units a ‘state vector’ that implicitly
the structured relationships between the input and output symbols contains information about the history of all the past elements of
into multiple ‘micro-rules’. Learning word vectors turned out to also the sequence. When we consider the outputs of the hidden units at
work very well when the word sequences come from a large corpus different discrete time steps as if they were the outputs of different
of real text and the individual micro-rules are unreliable71. When neurons in a deep multilayer network (Fig. 5, right), it becomes clear
trained to predict the next word in a news story, for example, the how we can apply backpropagation to train RNNs.
learned word vectors for Tuesday and Wednesday are very similar, as RNNs are very powerful dynamic systems, but training them has
are the word vectors for Sweden and Norway. Such representations proved to be problematic because the backpropagated gradients
are called distributed representations because their elements (the either grow or shrink at each time step, so over many time steps they
features) are not mutually exclusive and their many configurations typically explode or vanish77,78.
correspond to the variations seen in the observed data. These word Thanks to advances in their architecture79,80 and ways of training
vectors are composed of learned features that were not determined them81,82, RNNs have been found to be very good at predicting the
ahead of time by experts, but automatically discovered by the neural next character in the text83 or the next word in a sequence75, but they
network. Vector representations of words learned from text are now can also be used for more complex tasks. For example, after reading
very widely used in natural language applications14,17,72–76. an English sentence one word at a time, an English ‘encoder’ network
The issue of representation lies at the heart of the debate between can be trained so that the final state vector of its hidden units is a good
the logic-inspired and the neural-network-inspired paradigms for representation of the thought expressed by the sentence. This thought
cognition. In the logic-inspired paradigm, an instance of a symbol is vector can then be used as the initial hidden state of (or as extra input
something for which the only property is that it is either identical or to) a jointly trained French ‘decoder’ network, which outputs a prob-
non-identical to other symbol instances. It has no internal structure ability distribution for the first word of the French translation. If a
that is relevant to its use; and to reason with symbols, they must be particular first word is chosen from this distribution and provided
bound to the variables in judiciously chosen rules of inference. By as input to the decoder network it will then output a probability dis-
contrast, neural networks just use big activity vectors, big weight tribution for the second word of the translation and so on until a
matrices and scalar non-linearities to perform the type of fast ‘intui- full stop is chosen17,72,76. Overall, this process generates sequences of
tive’ inference that underpins effortless commonsense reasoning. French words according to a probability distribution that depends on
Before the introduction of neural language models71, the standard the English sentence. This rather naive way of performing machine
approach to statistical modelling of language did not exploit distrib- translation has quickly become competitive with the state-of-the-art,
uted representations: it was based on counting frequencies of occur- and this raises serious doubts about whether understanding a sen-
rences of short symbol sequences of length up to N (called N-grams). tence requires anything like the internal symbolic expressions that are
The number of possible N-grams is on the order of VN, where V is manipulated by using inference rules. It is more compatible with the
the vocabulary size, so taking into account a context of more than a view that everyday reasoning involves many simultaneous analogies
Figure 4 | Visualizing the learned word vectors. On the left is an illustration or sequences of words are mapped to nearby representations. The distributed
of word representations learned for modelling language, non-linearly projected representations of words are obtained by using backpropagation to jointly learn
to 2D for visualization using the t-SNE algorithm103. On the right is a 2D a representation for each word and a function that predicts a target quantity
representation of phrases learned by an English-to-French encoder–decoder such as the next word in a sequence (for language modelling) or a whole
recurrent neural network75. One can observe that semantically similar words sequence of translated words (for machine translation)18,75.
2 8 M AY 2 0 1 5 | VO L 5 2 1 | NAT U R E | 4 4 1
© 2015 Macmillan Publishers Limited. All rights reserved
INSIGHT REVIEW
4 4 2 | NAT U R E | VO L 5 2 1 | 2 8 M AY 2 0 1 5
© 2015 Macmillan Publishers Limited. All rights reserved
REVIEW INSIGHT
12. Leung, M. K., Xiong, H. Y., Lee, L. J. & Frey, B. J. Deep learning of the tissue- for the task of classifying low-resolution images of handwritten digits.
regulated splicing code. Bioinformatics 30, i121–i129 (2014). 42. LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to
13. Xiong, H. Y. et al. The human splicing code reveals new insights into the genetic document recognition. Proc. IEEE 86, 2278–2324 (1998).
determinants of disease. Science 347, 6218 (2015). This overview paper on the principles of end-to-end training of modular
14. Collobert, R., et al. Natural language processing (almost) from scratch. J. Mach. systems such as deep neural networks using gradient-based optimization
Learn. Res. 12, 2493–2537 (2011). showed how neural networks (and in particular convolutional nets) can be
15. Bordes, A., Chopra, S. & Weston, J. Question answering with subgraph combined with search or inference mechanisms to model complex outputs
embeddings. In Proc. Empirical Methods in Natural Language Processing http:// that are interdependent, such as sequences of characters associated with the
arxiv.org/abs/1406.3676v3 (2014). content of a document.
16. Jean, S., Cho, K., Memisevic, R. & Bengio, Y. On using very large target 43. Hubel, D. H. & Wiesel, T. N. Receptive fields, binocular interaction, and functional
vocabulary for neural machine translation. In Proc. ACL-IJCNLP http://arxiv.org/ architecture in the cat’s visual cortex. J. Physiol. 160, 106–154 (1962).
abs/1412.2007 (2015). 44. Felleman, D. J. & Essen, D. C. V. Distributed hierarchical processing in the
17. Sutskever, I. Vinyals, O. & Le. Q. V. Sequence to sequence learning with neural primate cerebral cortex. Cereb. Cortex 1, 1–47 (1991).
networks. In Proc. Advances in Neural Information Processing Systems 27 45. Cadieu, C. F. et al. Deep neural networks rival the representation of primate
3104–3112 (2014). it cortex for core visual object recognition. PLoS Comp. Biol. 10, e1003963
This paper showed state-of-the-art machine translation results with the (2014).
architecture introduced in ref. 72, with a recurrent network trained to read a 46. Fukushima, K. & Miyake, S. Neocognitron: a new algorithm for pattern
sentence in one language, produce a semantic representation of its meaning, recognition tolerant of deformations and shifts in position. Pattern Recognition
and generate a translation in another language. 15, 455–469 (1982).
18. Bottou, L. & Bousquet, O. The tradeoffs of large scale learning. In Proc. Advances 47. Waibel, A., Hanazawa, T., Hinton, G. E., Shikano, K. & Lang, K. Phoneme
in Neural Information Processing Systems 20 161–168 (2007). recognition using time-delay neural networks. IEEE Trans. Acoustics Speech
19. Duda, R. O. & Hart, P. E. Pattern Classification and Scene Analysis (Wiley, 1973). Signal Process. 37, 328–339 (1989).
20. Schölkopf, B. & Smola, A. Learning with Kernels (MIT Press, 2002). 48. Bottou, L., Fogelman-Soulié, F., Blanchet, P. & Lienard, J. Experiments with time
21. Bengio, Y., Delalleau, O. & Le Roux, N. The curse of highly variable functions delay networks and dynamic time warping for speaker independent isolated
for local kernel machines. In Proc. Advances in Neural Information Processing digit recognition. In Proc. EuroSpeech 89 537–540 (1989).
Systems 18 107–114 (2005). 49. Simard, D., Steinkraus, P. Y. & Platt, J. C. Best practices for convolutional neural
22. Selfridge, O. G. Pandemonium: a paradigm for learning in mechanisation of networks. In Proc. Document Analysis and Recognition 958–963 (2003).
thought processes. In Proc. Symposium on Mechanisation of Thought Processes 50. Vaillant, R., Monrocq, C. & LeCun, Y. Original approach for the localisation of
513–526 (1958). objects in images. In Proc. Vision, Image, and Signal Processing 141, 245–250
23. Rosenblatt, F. The Perceptron — A Perceiving and Recognizing Automaton. Tech. (1994).
Rep. 85-460-1 (Cornell Aeronautical Laboratory, 1957). 51. Nowlan, S. & Platt, J. in Neural Information Processing Systems 901–908 (1995).
24. Werbos, P. Beyond Regression: New Tools for Prediction and Analysis in the 52. Lawrence, S., Giles, C. L., Tsoi, A. C. & Back, A. D. Face recognition: a
Behavioral Sciences. PhD thesis, Harvard Univ. (1974). convolutional neural-network approach. IEEE Trans. Neural Networks 8, 98–113
25. Parker, D. B. Learning Logic Report TR–47 (MIT Press, 1985). (1997).
26. LeCun, Y. Une procédure d’apprentissage pour Réseau à seuil assymétrique 53. Ciresan, D., Meier, U. Masci, J. & Schmidhuber, J. Multi-column deep neural
in Cognitiva 85: a la Frontière de l’Intelligence Artificielle, des Sciences de la network for traffic sign classification. Neural Networks 32, 333–338 (2012).
Connaissance et des Neurosciences [in French] 599–604 (1985). 54. Ning, F. et al. Toward automatic phenotyping of developing embryos from
27. Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by videos. IEEE Trans. Image Process. 14, 1360–1371 (2005).
back-propagating errors. Nature 323, 533–536 (1986). 55. Turaga, S. C. et al. Convolutional networks can learn to generate affinity graphs
28. Glorot, X., Bordes, A. & Bengio. Y. Deep sparse rectifier neural networks. In Proc. for image segmentation. Neural Comput. 22, 511–538 (2010).
14th International Conference on Artificial Intelligence and Statistics 315–323 56. Garcia, C. & Delakis, M. Convolutional face finder: a neural architecture for
(2011). fast and robust face detection. IEEE Trans. Pattern Anal. Machine Intell. 26,
This paper showed that supervised training of very deep neural networks is 1408–1423 (2004).
much faster if the hidden layers are composed of ReLU. 57. Osadchy, M., LeCun, Y. & Miller, M. Synergistic face detection and pose
29. Dauphin, Y. et al. Identifying and attacking the saddle point problem in high- estimation with energy-based models. J. Mach. Learn. Res. 8, 1197–1215
dimensional non-convex optimization. In Proc. Advances in Neural Information (2007).
Processing Systems 27 2933–2941 (2014). 58. Tompson, J., Goroshin, R. R., Jain, A., LeCun, Y. Y. & Bregler, C. C. Efficient object
30. Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B. & LeCun, Y. The loss localization using convolutional networks. In Proc. Conference on Computer
surface of multilayer networks. In Proc. Conference on AI and Statistics http:// Vision and Pattern Recognition http://arxiv.org/abs/1411.4280 (2014).
arxiv.org/abs/1412.0233 (2014). 59. Taigman, Y., Yang, M., Ranzato, M. & Wolf, L. Deepface: closing the gap to
31. Hinton, G. E. What kind of graphical model is the brain? In Proc. 19th human-level performance in face verification. In Proc. Conference on Computer
International Joint Conference on Artificial intelligence 1765–1775 (2005). Vision and Pattern Recognition 1701–1708 (2014).
32. Hinton, G. E., Osindero, S. & Teh, Y.-W. A fast learning algorithm for deep belief 60. Hadsell, R. et al. Learning long-range vision for autonomous off-road driving.
nets. Neural Comp. 18, 1527–1554 (2006). J. Field Robot. 26, 120–144 (2009).
This paper introduced a novel and effective way of training very deep neural 61. Farabet, C., Couprie, C., Najman, L. & LeCun, Y. Scene parsing with multiscale
networks by pre-training one hidden layer at a time using the unsupervised feature learning, purity trees, and optimal covers. In Proc. International
learning procedure for restricted Boltzmann machines. Conference on Machine Learning http://arxiv.org/abs/1202.2160 (2012).
33. Bengio, Y., Lamblin, P., Popovici, D. & Larochelle, H. Greedy layer-wise training 62. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R.
of deep networks. In Proc. Advances in Neural Information Processing Systems 19 Dropout: a simple way to prevent neural networks from overfitting. J. Machine
153–160 (2006). Learning Res. 15, 1929–1958 (2014).
This report demonstrated that the unsupervised pre-training method 63. Sermanet, P. et al. Overfeat: integrated recognition, localization and detection
introduced in ref. 32 significantly improves performance on test data and using convolutional networks. In Proc. International Conference on Learning
generalizes the method to other unsupervised representation-learning Representations http://arxiv.org/abs/1312.6229 (2014).
techniques, such as auto-encoders. 64. Girshick, R., Donahue, J., Darrell, T. & Malik, J. Rich feature hierarchies for
34. Ranzato, M., Poultney, C., Chopra, S. & LeCun, Y. Efficient learning of sparse accurate object detection and semantic segmentation. In Proc. Conference on
representations with an energy-based model. In Proc. Advances in Neural Computer Vision and Pattern Recognition 580–587 (2014).
Information Processing Systems 19 1137–1144 (2006). 65. Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale
image recognition. In Proc. International Conference on Learning Representations
35. Hinton, G. E. & Salakhutdinov, R. Reducing the dimensionality of data with
http://arxiv.org/abs/1409.1556 (2014).
neural networks. Science 313, 504–507 (2006).
66. Boser, B., Sackinger, E., Bromley, J., LeCun, Y. & Jackel, L. An analog neural
36. Sermanet, P., Kavukcuoglu, K., Chintala, S. & LeCun, Y. Pedestrian detection with
network processor with programmable topology. J. Solid State Circuits 26,
unsupervised multi-stage feature learning. In Proc. International Conference
2017–2025 (1991).
on Computer Vision and Pattern Recognition http://arxiv.org/abs/1212.0142
67. Farabet, C. et al. Large-scale FPGA-based convolutional networks. In Scaling
(2013). up Machine Learning: Parallel and Distributed Approaches (eds Bekkerman, R.,
37. Raina, R., Madhavan, A. & Ng, A. Y. Large-scale deep unsupervised learning Bilenko, M. & Langford, J.) 399–419 (Cambridge Univ. Press, 2011).
using graphics processors. In Proc. 26th Annual International Conference on 68. Bengio, Y. Learning Deep Architectures for AI (Now, 2009).
Machine Learning 873–880 (2009). 69. Montufar, G. & Morton, J. When does a mixture of products contain a product of
38. Mohamed, A.-R., Dahl, G. E. & Hinton, G. Acoustic modeling using deep belief mixtures? J. Discrete Math. 29, 321–347 (2014).
networks. IEEE Trans. Audio Speech Lang. Process. 20, 14–22 (2012). 70. Montufar, G. F., Pascanu, R., Cho, K. & Bengio, Y. On the number of linear regions
39. Dahl, G. E., Yu, D., Deng, L. & Acero, A. Context-dependent pre-trained deep of deep neural networks. In Proc. Advances in Neural Information Processing
neural networks for large vocabulary speech recognition. IEEE Trans. Audio Systems 27 2924–2932 (2014).
Speech Lang. Process. 20, 33–42 (2012). 71. Bengio, Y., Ducharme, R. & Vincent, P. A neural probabilistic language model. In
40. Bengio, Y., Courville, A. & Vincent, P. Representation learning: a review and new Proc. Advances in Neural Information Processing Systems 13 932–938 (2001).
perspectives. IEEE Trans. Pattern Anal. Machine Intell. 35, 1798–1828 (2013). This paper introduced neural language models, which learn to convert a word
41. LeCun, Y. et al. Handwritten digit recognition with a back-propagation network. symbol into a word vector or word embedding composed of learned semantic
In Proc. Advances in Neural Information Processing Systems 396–404 (1990). features in order to predict the next word in a sequence.
This is the first paper on convolutional networks trained by backpropagation 72. Cho, K. et al. Learning phrase representations using RNN encoder-decoder
2 8 M AY 2 0 1 5 | VO L 5 2 1 | NAT U R E | 4 4 3
© 2015 Macmillan Publishers Limited. All rights reserved
INSIGHT REVIEW
for statistical machine translation. In Proc. Conference on Empirical Methods in 90. Weston, J., Bordes, A., Chopra, S. & Mikolov, T. Towards AI-complete question
Natural Language Processing 1724–1734 (2014). answering: a set of prerequisite toy tasks. http://arxiv.org/abs/1502.05698
73. Schwenk, H. Continuous space language models. Computer Speech Lang. 21, (2015).
492–518 (2007). 91. Hinton, G. E., Dayan, P., Frey, B. J. & Neal, R. M. The wake-sleep algorithm for
74. Socher, R., Lin, C. C-Y., Manning, C. & Ng, A. Y. Parsing natural scenes and unsupervised neural networks. Science 268, 1558–1161 (1995).
natural language with recursive neural networks. In Proc. International 92. Salakhutdinov, R. & Hinton, G. Deep Boltzmann machines. In Proc. International
Conference on Machine Learning 129–136 (2011). Conference on Artificial Intelligence and Statistics 448–455 (2009).
75. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. Distributed 93. Vincent, P., Larochelle, H., Bengio, Y. & Manzagol, P.-A. Extracting and composing
representations of words and phrases and their compositionality. In Proc. robust features with denoising autoencoders. In Proc. 25th International
Advances in Neural Information Processing Systems 26 3111–3119 (2013). Conference on Machine Learning 1096–1103 (2008).
76. Bahdanau, D., Cho, K. & Bengio, Y. Neural machine translation by jointly 94. Kavukcuoglu, K. et al. Learning convolutional feature hierarchies for visual
learning to align and translate. In Proc. International Conference on Learning recognition. In Proc. Advances in Neural Information Processing Systems 23
Representations http://arxiv.org/abs/1409.0473 (2015). 1090–1098 (2010).
77. Hochreiter, S. Untersuchungen zu dynamischen neuronalen Netzen [in 95. Gregor, K. & LeCun, Y. Learning fast approximations of sparse coding. In Proc.
German] Diploma thesis, T.U. Münich (1991). International Conference on Machine Learning 399–406 (2010).
78. Bengio, Y., Simard, P. & Frasconi, P. Learning long-term dependencies with 96. Ranzato, M., Mnih, V., Susskind, J. M. & Hinton, G. E. Modeling natural images
gradient descent is difficult. IEEE Trans. Neural Networks 5, 157–166 (1994). using gated MRFs. IEEE Trans. Pattern Anal. Machine Intell. 35, 2206–2222
79. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, (2013).
1735–1780 (1997). 97. Bengio, Y., Thibodeau-Laufer, E., Alain, G. & Yosinski, J. Deep generative
This paper introduced LSTM recurrent networks, which have become a crucial stochastic networks trainable by backprop. In Proc. 31st International
ingredient in recent advances with recurrent networks because they are good Conference on Machine Learning 226–234 (2014).
at learning long-range dependencies. 98. Kingma, D., Rezende, D., Mohamed, S. & Welling, M. Semi-supervised learning
80. ElHihi, S. & Bengio, Y. Hierarchical recurrent neural networks for long-term with deep generative models. In Proc. Advances in Neural Information Processing
dependencies. In Proc. Advances in Neural Information Processing Systems 8 Systems 27 3581–3589 (2014).
http://papers.nips.cc/paper/1102-hierarchical-recurrent-neural-networks-for- 99. Ba, J., Mnih, V. & Kavukcuoglu, K. Multiple object recognition with visual
long-term-dependencies (1995). attention. In Proc. International Conference on Learning Representations http://
81. Sutskever, I. Training Recurrent Neural Networks. PhD thesis, Univ. Toronto arxiv.org/abs/1412.7755 (2014).
(2012). 100. Mnih, V. et al. Human-level control through deep reinforcement learning. Nature
82. Pascanu, R., Mikolov, T. & Bengio, Y. On the difficulty of training recurrent neural 518, 529–533 (2015).
networks. In Proc. 30th International Conference on Machine Learning 1310– 101. Bottou, L. From machine learning to machine reasoning. Mach. Learn. 94,
1318 (2013). 133–149 (2014).
83. Sutskever, I., Martens, J. & Hinton, G. E. Generating text with recurrent neural 102. Vinyals, O., Toshev, A., Bengio, S. & Erhan, D. Show and tell: a neural image
networks. In Proc. 28th International Conference on Machine Learning 1017– caption generator. In Proc. International Conference on Machine Learning http://
1024 (2011). arxiv.org/abs/1502.03044 (2014).
84. Lakoff, G. & Johnson, M. Metaphors We Live By (Univ. Chicago Press, 2008). 103. van der Maaten, L. & Hinton, G. E. Visualizing data using t-SNE. J. Mach. Learn.
85. Rogers, T. T. & McClelland, J. L. Semantic Cognition: A Parallel Distributed Research 9, 2579–2605 (2008).
Processing Approach (MIT Press, 2004).
86. Xu, K. et al. Show, attend and tell: Neural image caption generation with visual Acknowledgements The authors would like to thank the Natural Sciences and
attention. In Proc. International Conference on Learning Representations http:// Engineering Research Council of Canada, the Canadian Institute For Advanced
arxiv.org/abs/1502.03044 (2015). Research (CIFAR), the National Science Foundation and Office of Naval Research
87. Graves, A., Mohamed, A.-R. & Hinton, G. Speech recognition with deep recurrent for support. Y.L. and Y.B. are CIFAR fellows.
neural networks. In Proc. International Conference on Acoustics, Speech and
Signal Processing 6645–6649 (2013). Author Information Reprints and permissions information is available at
88. Graves, A., Wayne, G. & Danihelka, I. Neural Turing machines. http://arxiv.org/ www.nature.com/reprints. The authors declare no competing financial
abs/1410.5401 (2014). interests. Readers are welcome to comment on the online version of this
89. Weston, J. Chopra, S. & Bordes, A. Memory networks. http://arxiv.org/ paper at go.nature.com/7cjbaa. Correspondence should be addressed to Y.L.
abs/1410.3916 (2014). (yann@cs.nyu.edu).
4 4 4 | NAT U R E | VO L 5 2 1 | 2 8 M AY 2 0 1 5
© 2015 Macmillan Publishers Limited. All rights reserved
TensorFlow: A System for Large-Scale
Machine Learning
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,
Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur,
Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker,
Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng, Google Brain
https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi
Google Brain
USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 265
open-source project.1 Thanks to our large community of and write back “delta” updates to each parameter server,
users we have gained experience with many different ma- which combines the updates with its current state.
chine learning applications. In this paper, we focus on Although DistBelief has enabled many Google prod-
neural network training as a challenging systems problem, ucts to use deep neural networks and formed the basis of
and select two representative applications from this space: many machine learning research projects, we soon began
image classification and language modeling. These ap- to feel its limitations. Its Python-based scripting interface
plications stress computational throughput and aggregate for composing pre-defined layers was adequate for users
model size respectively, and we use them both to demon- with simple requirements, but our more advanced users
strate the extensibility of TensorFlow, and to evaluate the sought three further kinds of flexibility:
efficiency and scalability of our present implementation.
Defining new layers For efficiency, we implemented
DistBelief layers as C++ classes. Using a separate, less
familiar programming language for implementing layers
2 Background & motivation is a barrier for machine learning researchers who seek to
experiment with new layer architectures, such as sampled
We begin by describing the limitations of our previous
softmax classifiers [37] and attention modules [53].
system (§2.1) and outlining the design principles that we
used in the development of TensorFlow (§2.2). Refining the training algorithms Many neural net-
works are trained using stochastic gradient descent
(SGD), which iteratively refines the parameters of the net-
2.1 Previous system: DistBelief work by moving them in the direction that maximally de-
TensorFlow is the successor to DistBelief, which is creases the value of the loss function. Several refinements
the distributed system for training neural networks that to SGD accelerate convergence by changing the update
Google has used since 2011 [20]. DistBelief uses the pa- rule [23, 66]. Researchers often want to experiment with
rameter server architecture, and here we criticize its lim- new optimization methods, but doing that in DistBelief
itations, but other systems based on this architecture have involves modifying the parameter server implementation.
addressed these limitations in other ways [11, 14, 49]; we Moreover, the get() and put() interface for the pa-
discuss those systems in Subsection 2.3. rameter server is not ideal for all optimization methods:
In the parameter server architecture, a job comprises sometimes a set of related parameters must be updated
two disjoint sets of processes: stateless worker processes atomically, and in many cases it would be more efficient
that perform the bulk of the computation when training a to offload computation onto the parameter server, and
model, and stateful parameter server processes that main- thereby reduce the amount of network traffic.
tain the current version of the model parameters. Dist- Defining new training algorithms DistBelief workers
Belief’s programming model is similar to Caffe’s [38]: the follow a fixed execution pattern: read a batch of input data
user defines a neural network as a directed acyclic graph and the current parameter values, compute the loss func-
of layers that terminates with a loss function. A layer is tion (a forward pass through the network), compute gra-
a composition of mathematical operators: for example, a dients for each of the parameter (a backward pass), and
fully connected layer multiplies its input by a weight ma- write the gradients back to the parameter server. This pat-
trix, adds a bias vector, and applies a non-linear function tern works for training simple feed-forward neural net-
(such as a sigmoid) to the result. A loss function is a scalar works, but fails for more advanced models, such as recur-
function that quantifies the difference between the pre- rent neural networks, which contain loops [39]; adversar-
dicted value (for a given input data point) and the ground ial networks, in which two related networks are trained al-
truth. In a fully connected layer, the weight matrix and ternately [26]; and reinforcement learning models, where
bias vector are parameters, which a learning algorithm the loss function is computed by some agent in a separate
will update in order to minimize the value of the loss func- system, such as a video game emulator [54]. Moreover,
tion. DistBelief uses the DAG structure and knowledge there are many other machine learning algorithms—such
of the layers’ semantics to compute gradients for each as expectation maximization, decision forest training, and
of the model parameters, via backpropagation [63]. Be- latent Dirichlet allocation—that do not fit the same mold
cause the parameter updates in many algorithms are com- as neural network training, but could also benefit from a
mutative and have weak consistency requirements [61], common, well-optimized distributed runtime.
the worker processes can compute updates independently
In addition, we designed DistBelief with a single plat-
1 Software available from https://tensorflow.org. form in mind: a large distributed cluster of multicore
266 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
# 1. Construct a graph representing the model.
x = tf.placeholder(tf.float32, [BATCH_SIZE, 784]) # Placeholder for input.
y = tf.placeholder(tf.float32, [BATCH_SIZE, 10]) # Placeholder for labels.
servers [20]. We were able to add support for GPU ac- cations on distributed clusters, local workstations, mo-
celeration, when it became clear that this acceleration bile devices, and custom-designed accelerators. A high-
would be crucial for executing convolutional kernels effi- level scripting interface (Figure 1) wraps the construction
ciently [44], but DistBelief remains a heavyweight system of dataflow graphs and enables users to experiment with
that is geared for training deep neural networks on huge different model architectures and optimization algorithms
datasets, and is difficult to scale down to other environ- without modifying the core system. In this subsection, we
ments. In particular, many users want to hone their model briefly highlight TensorFlow’s core design principles:
locally on a GPU-powered workstation, before scaling the Dataflow graphs of primitive operators Both Tensor-
same code to train on a much larger dataset. After train- Flow and DistBelief use a dataflow representation for their
ing a model on a cluster, the next step is to push the models, but the most striking difference is that a Dist-
model into production, which might involve integrating Belief model comprises relatively few complex “layers”,
the model into an online service, or deploying it onto a whereas the corresponding TensorFlow model represents
mobile device for offline execution. Each of these tasks individual mathematical operators (such as matrix mul-
has some common computational structure, but our col- tiplication, convolution, etc.) as nodes in the dataflow
leagues found it necessary to use or create separate sys- graph. This approach makes it easier for users to com-
tems that satisfy the different performance and resource pose novel layers using a high-level scripting interface.
requirements of each platform. TensorFlow provides a Many optimization algorithms require each layer to have
single programming model and runtime system for all of defined gradients, and building layers out of simple oper-
these environments. ators makes it easy to differentiate these models automat-
ically (§4.1). In addition to the functional operators, we
represent mutable state, and the operations that update it,
2.2 Design principles as nodes in the dataflow graph, thus enabling experimen-
We designed TensorFlow to be much more flexible than tation with different update rules.
DistBelief, while retaining its ability to satisfy the de- Deferred execution A typical TensorFlow application
mands of Google’s production machine learning work- has two distinct phases: the first phase defines the pro-
loads. TensorFlow provides a simple dataflow-based pro- gram (e.g., a neural network to be trained and the update
gramming abstraction that allows users to deploy appli- rules) as a symbolic dataflow graph with placeholders for
USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 267
the input data and variables that represent the state; and it is more flexible than a conventional parameter server:
the second phase executes an optimized version of the users can program it with the same scripting interface that
program on the set of available devices. By deferring the they use to define models. This flexibility is the key dif-
execution until the entire program is available, Tensor- ference between TensorFlow and contemporary systems,
Flow can optimize the execution phase by using global and in the rest of the paper we will discuss some of the
information about the computation. For example, Tensor- applications that this flexibility enables.
Flow achieves high GPU utilization by using the graph’s
dependency structure to issue a sequence of kernels to the
GPU without waiting for intermediate results. While this 2.3 Related work
design choice makes execution more efficient, we have Single-machine frameworks Many machine learning
had to push more complex features—such as dynamic researchers carry out their work on a single—often GPU-
control flow (§3.4)—into the dataflow graph, so that mod- equipped—computer [43, 44], and several single-machine
els using these features enjoy the same optimizations. frameworks support this scenario. Caffe [38] is a high-
Common abstraction for heterogeneous accelerators performance framework for training declaratively speci-
In addition to general-purpose devices such as multicore fied neural networks on multicore CPUs and GPUs. As
CPUs and GPUs, special-purpose accelerators for deep discussed above, its programming model is similar to
learning can achieve significant performance improve- DistBelief (§2.1), so it is easy to compose models from
ments and power savings. At Google, our colleagues existing layers, but relatively difficult to add new layers
have built the Tensor Processing Unit (TPU) specifically or optimizers. Theano [2] allows programmers to express
for machine learning; TPUs yield an order of magnitude a model as a dataflow graph of primitive operators, and
improvement in performance-per-watt compared to alter- generates efficient compiled code for training that model.
native state-of-the-art technology [40]. To support these Its programming model is closest to TensorFlow, and it
accelerators in TensorFlow, we define a common abstrac- provides much of the same flexibility in a single machine.
tion for devices. At a minimum, a device must implement Unlike Caffe, Theano, and TensorFlow, Torch [17] of-
methods for (i) issuing a kernel for execution, (ii) allocat- fers a powerful imperative programming model for sci-
ing memory for inputs and outputs, and (iii) transferring entific computation and machine learning. It allows fine-
buffers to and from host memory. Each operator (e.g., grained control over the execution order and memory uti-
matrix multiplication) can have multiple specialized im- lization, which enables power users to optimize the per-
plementations for different devices. As a result, the same formance of their programs. While this flexibility is use-
program can easily target GPUs, TPUs, or mobile CPUs ful for research, Torch lacks the advantages of a dataflow
as required for training, serving, and offline inference. graph as a portable representation across small-scale ex-
TensorFlow uses tensors of primitive values as a com- perimentation, production training, and deployment.
mon interchange format that all devices understand. At Batch dataflow systems Starting with MapRe-
the lowest level, all tensors in TensorFlow are dense; duce [21], batch dataflow systems have been applied
sparse tensors can be represented in terms of dense ones to a large number of machine learning algorithms [70],
(§3.1). This decision ensures that the lowest levels of the and more recent systems have focused on increasing
system have simple implementations for memory alloca- expressivity and performance. DryadLINQ [74] adds a
tion and serialization, thus reducing the framework over- high-level query language that supports more sophisti-
head. Tensors also enable other optimizations for memory cated algorithms than MapReduce. Spark [75] extends
management and communication, such as RDMA and di- DryadLINQ with the ability to cache previously com-
rect GPU-to-GPU transfer. puted datasets in memory, and is therefore better suited to
The main consequence of these principles is that in iterative machine learning algorithms (such as k-means
TensorFlow there is no such thing as a parameter server. clustering and logistic regression) when the input data fit
On a cluster, we deploy TensorFlow as a set of tasks in memory. Dandelion extends DryadLINQ with code
(named processes that can communicate over a network) generation for GPUs [62] and FPGAs [16].
that each export the same graph execution API and con- The principal limitation of a batch dataflow system is
tain one or more devices. Typically a subset of those tasks that it requires the input data to be immutable, and all
assumes the role that a parameter server plays in other of the subcomputations to be deterministic, so that the
systems [11, 14, 20, 49], and we therefore call them PS system can re-execute subcomputations when machines
tasks; the others are worker tasks. However, since a PS in the cluster fail. This feature—which is beneficial for
task is capable of running arbitrary TensorFlow graphs, many conventional workloads—makes updating a ma-
268 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
False branch
Mod Part Stitch Sum
Parameters Periodic
checkpoint
Figure 2: A schematic TensorFlow dataflow graph for a training pipeline, containing subgraphs for reading input data,
preprocessing, training, and checkpointing state.
chine learning model an expensive operation. For ex- Flow we sought a high-level programming model that al-
ample, the SparkNet system for training deep neural net- lows users to customize the code that runs in all parts of
works on Spark takes 20 seconds to broadcast weights and the system, so that the cost of experimentation with new
collect updates from five workers [55]. As a result, in optimization algorithms and model architectures is lower.
these systems, each model update step must process larger In the next section, we describe the building blocks of a
batches, slowing convergence [8]. We show in Subsec- TensorFlow program in more detail.
tion 6.3 that TensorFlow can train larger models on larger
clusters with step times as short as 2 seconds.
3 TensorFlow execution model
Parameter servers As we discuss in Subsection 2.1, a
parameter server architecture uses a set of servers to man- TensorFlow uses a single dataflow graph to represent
age shared state that is updated by a set of parallel work- all computation and state in a machine learning algo-
ers. This architecture emerged in work on scalable topic rithm, including the individual mathematical operations,
Training
modeling [65], andlibraries Inference
DistBelief showed howlibs it can apply
the parameters and their update rules, and the input pre-
to deep neural network training.C++ Project
client Adam
... [14] fur- processing (Figure 2). The dataflow graph expresses the
Python client
ther applied this architecture for the efficient training of communication between subcomputations explicitly, thus
convolutional neural networks; C API and Li et al.’s “Parame-
making it easy to execute independent computations in
ter Server” [49] added innovations in consistency mod- parallel and to partition computations across multiple de-
Distributed
els, fault tolerance, master
and elastic Dataflow
rescaling.executor
Despite earlier vices. TensorFlow differs from batch dataflow systems
skepticism that parameter servers would be compatible (§2.3) in two respects:
Const Var MatMul Conv2D ReLU Queue ...
with GPU acceleration Kernel[14], Cui et al. recently showed
implementations
that a parameter server specialized for use with GPUs can • The model supports multiple concurrent executions
achieve speedups
RPC on RDMAsmall...clusters
CPU[18].GPU ... on overlapping subgraphs of the overall graph.
MXNet [11] is perhaps
Networking layer the closest
Devicesystem
layer in design
to TensorFlow. It uses a dataflow graph to represent the • Individual vertices may have mutable state that can
computation at each worker, and uses a parameter server be shared between different executions of the graph.
to scale training across multiple machines. The MXNet
parameter server exports a key-value store interface that The key observation in the parameter server architec-
supports aggregating updates sent from multiple devices ture [14, 20, 49] is that mutable state is crucial when
in each worker, and using an arbitrary user-provided func- training very large models, because it becomes possible to
tion to combine incoming updates with the current value. make in-place updates to very large parameters, and prop-
The MXNet key-value store interface [22] does not cur- agate those updates to parallel training steps as quickly
rently allow sparse gradient updates within a single value, as possible. Dataflow with mutable state enables Tensor-
which are crucial for the distributed training of large mod- Flow to mimic the functionality of a parameter server,
els (§4.2), and adding this feature would require modifi- but with additional flexibility, because it becomes pos-
cations to the core system. sible to execute arbitrary dataflow subgraphs on the ma-
The parameter server architecture meets many of our chines that host the shared model parameters. As a re-
requirements, and with sufficient engineering effort it sult, our users have been able to experiment with different
would be possible to build most of the features that we optimization algorithms, consistency schemes, and paral-
describe in this paper into a parameter server. For Tensor- lelization strategies.
USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 269
3.1 Dataflow graph elements example, AssignAdd takes a reference handle r and a
tensor value x, and when executed performs the update
In a TensorFlow graph, each vertex represents a unit of State′ [r] ← State[r] + x. Subsequent Read(r) opera-
local computation, and each edge represents the output tions produce the value State′ [r].
from, or input to, a vertex. We refer to the computation
at vertices as operations, and the values that flow along Stateful operations: queues TensorFlow includes sev-
edges as tensors. In this subsection, we describe the com- eral queue implementations, which support more ad-
mon types of operations and tensors. vanced forms of coordination. The simplest queue is
FIFOQueue, which owns an internal queue of tensors,
Tensors In TensorFlow, we model all data as tensors
and allows concurrent access in first-in-first-out order.
(n-dimensional arrays) with the elements having one
Other types of queues dequeue tensors in random and pri-
of a small number of primitive types, such as int32,
ority orders, which ensure that input data are sampled ap-
float32, or string (where string can represent ar-
propriately. Like a Variable, the FIFOQueue opera-
bitrary binary data). Tensors naturally represent the inputs
tion produces a reference handle that can be consumed by
to and results of the common mathematical operations in
one of the standard queue operations, such as Enqueue
many machine learning algorithms: for example, a matrix
and Dequeue. These operations push their input onto the
multiplication takes two 2-D tensors and produces a 2-D
tail of the queue and, respectively, pop the head element
tensor; and a batch 2-D convolution takes two 4-D tensors
and output it. Enqueue will block if its given queue is
and produces another 4-D tensor.
full, and Dequeue will block if its given queue is empty.
At the lowest level, all TensorFlow tensors are dense,
When queues are used in an input preprocessing pipeline,
for the reasons we discuss in Subsection 2.2. TensorFlow
this blocking provides backpressure; it also supports syn-
offers two alternatives for representing sparse data: either
chronization (§4.4). The combination of queues and dy-
encode the data into variable-length string elements of
namic control flow (§3.4) can also implement a form of
a dense tensor, or use a tuple of dense tensors (e.g., an
streaming computation between subgraphs.
n-D sparse tensor with m non-zero elements can be rep-
resented in coordinate-list format as an m × n matrix of
coordinates and a length-m vector of values). The shape 3.2 Partial and concurrent execution
of a tensor can vary in one or more of its dimensions,
which makes it possible to represent sparse tensors with TensorFlow uses a dataflow graph to represent all possible
differing numbers of elements. computations in a particular application. The API for ex-
ecuting a graph allows the client to specify declaratively
Operations An operation takes m ≥ 0 tensors as input the subgraph that should be executed. The client selects
and produces n ≥ 0 tensors as output. An operation has zero or more edges to feed input tensors into the dataflow,
a named “type” (such as Const, MatMul, or Assign) and one or more edges to fetch output tensors from the
and may have zero or more compile-time attributes that dataflow; the runtime then prunes the graph to contain the
determine its behavior. An operation can be polymorphic necessary set of operations. Each invocation of the API is
and variadic at compile-time: its attributes determine both called a step, and TensorFlow supports multiple concur-
the expected types and arity of its inputs and outputs. rent steps on the same graph. Stateful operations allow
For example, the simplest operation Const has no in- steps to share data and synchronize when necessary.
puts and a single output; its value is a compile-time at- Figure 2 shows a typical training application, with
tribute. For example, AddN sums multiple tensors of the multiple subgraphs that execute concurrently and interact
same element type, and it has a type attribute T and an through shared variables and queues. The core training
integer attribute N that define its type signature. subgraph depends on a set of model parameters and on in-
Stateful operations: variables An operation can con- put batches from a queue. Many concurrent steps of the
tain mutable state that is read and/or written each time training subgraph update the model based on different in-
it executes. A Variable operation owns a mutable put batches, to implement data-parallel training. To fill
buffer that may be used to store the shared parameters the input queue, concurrent preprocessing steps transform
of a model as it is trained. A Variable has no inputs, individual input records (e.g., decoding images and apply-
and produces a reference handle, which acts as a typed ing random distortions), and a separate I/O subgraph reads
capability for reading and writing the buffer. A Read records from a distributed file system. A checkpointing
operation takes a reference handle r as input, and out- subgraph runs periodically for fault tolerance (§4.3).
puts the value of the variable (State[r]) as a dense ten- Partial and concurrent execution is responsible for
sor. Other operations modify the underlying buffer: for much of TensorFlow’s flexibility. Adding mutable state
270 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
and coordination via queues makes it possible to spec- input = ... # A sequence of tensors
state = 0 # Initial state
ify a wide variety of model architectures in user-level
w = ... # Trainable weights
code, which enables advanced users to experiment with-
out modifying the internals of the TensorFlow runtime. for i in range(len(input)):
By default, concurrent executions of a TensorFlow sub- state, out[i] = f(state, w, input[i])
graph run asynchronously with respect to one another.
This asynchrony makes it straightforward to implement
machine learning algorithms with weak consistency re- Figure 3: Pseudocode for an abstract RNN (§3.4). The
quirements [61], which include many neural network function f typically comprises differentiable operations
training algorithms [20]. As we discuss later, TensorFlow such as matrix multiplications and convolutions [32].
also provides the primitives needed to synchronize work- TensorFlow implements the loop in its dataflow graph.
ers during training (§4.4), which has led to promising re-
sults on some learning tasks (§6.3).
TensorFlow can automatically determine placements that
achieve close to optimal performance on a given set of de-
3.3 Distributed execution vices, thus freeing users from this concern. Even without
such automation, it may be worthwhile to separate place-
Dataflow simplifies distributed execution, because it ment directives from other aspects of model definitions,
makes communication between subcomputations explicit. so that, for example, it would be trivial to modify place-
It enables the same TensorFlow program to be deployed ments after a model has been trained.
to a cluster of GPUs for training, a cluster of TPUs for Once the operations in a graph have been placed, and
serving, and a cellphone for mobile inference. the partial subgraph has been computed for a step (§3.2),
Each operation resides on a particular device, such as a TensorFlow partitions the operations into per-device sub-
CPU or GPU in a particular task. A device is responsible graphs. A per-device subgraph for device d contains all
for executing a kernel for each operation assigned to it. of the operations that were assigned to d, with additional
TensorFlow allows multiple kernels to be registered for Send and Recv operations that replace edges across de-
a single operation, with specialized implementations for vice boundaries. Send transmits its single input to a spec-
a particular device or data type (see §5 for details). For ified device as soon as the tensor is available, using a ren-
many operations, such as element-wise operators (Add, dezvous key to name the value. Recv has a single output,
Sub, etc.), we can compile a single kernel implementation and blocks until the value for a specified rendezvous key
for CPU and GPU using different compilers. is available locally, before producing that value. Send
The TensorFlow runtime places operations on devices, and Recv have specialized implementations for several
subject to implicit or explicit constraints in the graph. device-type pairs; we describe some of these in Section 5.
The placement algorithm computes a feasible set of de- We optimized TensorFlow for executing large sub-
vices for each operation, calculates the sets of operations graphs repeatedly with low latency. Once the graph for
that must be colocated, and selects a satisfying device for a step has been pruned, placed, and partitioned, its sub-
each colocation group. It respects implicit colocation con- graphs are cached in their respective devices. A client
straints that arise because each stateful operation and its session maintains the mapping from step definitions to
state must be placed on the same device. In addition, cached subgraphs, so that a distributed step on a large
the user may specify partial device preferences such as graph can be initiated with one small message to each par-
“any device in a particular task”, or “a GPU in any task”, ticipating task. This model favors static, reusable graphs,
and the runtime will respect these constraints. A typical but it can support dynamic computations using dynamic
training application will use client-side programming con- control flow, as the next subsection describes.
structs to add constraints such that, for example, parame-
ters are distributed among a set of “PS” tasks (§4.2).
3.4 Dynamic control flow
TensorFlow thus permits great flexibility in how opera-
tions in the dataflow graph are mapped to devices. While TensorFlow supports advanced machine learning algo-
simple heuristics yield adequate performance for novice rithms that contain conditional and iterative control flow.
users, expert users can optimize performance by manually For example, a recurrent neural network (RNN) [39] such
placing operations to balance the computation, memory, as an LSTM [32] can generate predictions from sequential
and network requirements across multiple tasks and mul- data. Google’s Neural Machine Translation system uses
tiple devices within those tasks. An open question is how TensorFlow to train a deep LSTM that achieves state-of-
USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 271
the-art performance on many translation tasks [73]. The 4.1 Differentiation and optimization
core of an RNN is a recurrence relation, where the output
for sequence element i is a function of some state that ac- Many learning algorithms train a set of parameters using
cumulates across the sequence (Figure 3). In this case, dy- some variant of SGD, which entails computing the gradi-
namic control flow enables iteration over sequences that ents of a loss function with respect to those parameters,
have variable lengths, without unrolling the computation then updating the parameters based on those gradients.
to the length of the longest sequence. TensorFlow includes a user-level library that differentiates
As we discussed in Subsection 2.2, TensorFlow uses a symbolic expression for a loss function and produces a
deferred execution via the dataflow graph to offload larger new symbolic expression representing the gradients. For
chunks of work to accelerators. Therefore, to imple- example, given a neural network as a composition of lay-
ment RNNs and other advanced algorithms, we add con- ers and a loss function, the library will automatically de-
ditional (if statement) and iterative (while loop) program- rive the backpropagation code.
ming constructs in the dataflow graph itself. We use The differentiation algorithm performs breadth-first
these primitives to build higher-order constructs, such as search to identify all of the backwards paths from the tar-
map(), fold(), and scan() [2]. get operation (e.g., a loss function) to a set of parameters,
For this purpose, we borrow the Switch and and sums the partial gradients that each path contributes.
Merge primitives from classic dynamic dataflow archi- Our users frequently specialize the gradients for some op-
tectures [4]. Switch is a demultiplexer: it takes a data erations, and they have implemented optimizations like
input and a control input, and uses the control input to batch normalization [33] and gradient clipping [60] to ac-
select which of its two outputs should produce a value. celerate training and make it more robust. We have ex-
The Switch output not taken receives a special dead tended the algorithm to differentiate conditional and it-
value, which propagates recursively through the rest of erative subcomputations (§3.4) by adding nodes to the
the graph until it reaches a Merge operation. Merge is graph that record the control flow decisions in the for-
a multiplexer: it forwards at most one non-dead input to ward pass, and replaying those decisions in reverse during
its output, or produces a dead output if both of its inputs the backward pass. Differentiating iterative computations
are dead. The conditional operator uses Switch to ex- over long sequences can lead to a large amount of inter-
ecute one of two branches based on the runtime value of mediate state being accumulated in memory, and we have
a boolean tensor, and Merge to combine the outputs of developed techniques for managing limited GPU memory
the branches. The while loop is more complicated, and on these computations.
uses Enter, Exit, and NextIteration operators to TensorFlow users can also experiment with a wide
ensure that the loop is well-formed [56]. range of optimization algorithms, which compute new
The execution of iterations can overlap, and Tensor- values for the parameters in each training step. SGD is
Flow can also partition conditional branches and loop easy to implement in a parameter server: for each param-
bodies across multiple devices and processes. The par- eter W , gradient ∂L/∂W , and learning rate α, the update
titioning step adds logic to coordinate the start and ter- rule is W ′ ← W − α × ∂L/∂W . A parameter server can
mination of each iteration on each device, and to decide implement SGD by using -= as the write operation, and
the termination of the loop. As we will see in Subsec- writing α × ∂L/∂W to each W after a training step.
tion 4.1, TensorFlow also supports automatic differenti- However, there are many more advanced optimization
ation of control flow constructs. Automatic differentia- schemes that are difficult to express as a single write op-
tion adds the subgraphs for computing gradients to the eration. For example, the Momentum algorithm accumu-
dataflow graph, which TensorFlow partitions across po- lates a “velocity” for each parameter based on its gradi-
tentially distributed devices to compute the gradients in ent over multiple iterations, then computes the parameter
parallel. update from that accumulation; and many refinements to
this algorithm have been proposed [66]. Implementing
Momentum in DistBelief [20], required modifications to
4 Extensibility case studies the parameter server implementation to change the rep-
resentation of parameter data, and execute complex logic
By choosing a unified representation for all computation in the write operation; such modifications are challeng-
in TensorFlow, we enable users to experiment with fea- ing for many users. Optimization algorithms are the topic
tures that were hard-coded into the DistBelief runtime. In of active research, and researchers have implemented sev-
this section, we discuss four extensions that we have built eral on top of TensorFlow, including Momentum, Ada-
using dataflow primitives and “user-level” code. Grad, AdaDelta, RMSProp, Adam, and L-BFGS. These
272 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
Shard 0 Gather cates this operation with the variable on which it operates.
True True branch
The dynamic partition (Part) operation divides the in-
coming indices into variable-sized tensors that contain the
Shard 1 Gather indices destined for each shard, and the dynamic stitching
Switch Merge
(Stitch) operation reassembles the partial results from
each shard into a single result tensor. Each of these op-
False branch
erations has a corresponding gradient, so it supports au-
Mod Part Stitch Sum
tomatic differentiation (§4.1), and the result is a set of
Figure 4: Schematic dataflow for an embedding layer sparse update operations that act on just the values that
(§4.2) with a two-way sharded embedding matrix. were originally gathered from each of the shards.
Users writing a TensorFlow model typically do not con-
struct graphs like Figure 4 manually.
ParametersInstead TensorFlow
Periodic
can be built in TensorFlow using Variable operations includes libraries that expose the abstraction of a sharded checkpoint
and primitive mathematical operations without modifying parameter, and build appropriate graphs of primitive op-
the underlying system, so it is easy to experiment with erations based on theRead degree of Apply
params
desired grads
distribution.
new algorithms as they emerge. Shuffle queue Queue
Input
While sparse reads and Fwdupdates areBackpossible in a pa-
Reader rameter server [49], TensorFlow adds the flexibility Dist.
to FS
data
Preprocessing Training
4.2 Training very large models offload arbitrary computation onto the devices that host
the shared parameters. For example, classification mod-
To train a model on high-dimensional data, such as words els typically use a softmax classifier that multiplies the
in a corpus of text [7], it is common to use a distributed final output by a weight matrix with c columns, where c
representation, which embeds a training example as a pat- is the number of possible classes; for a language model,
tern of activity across several neurons, and which can be c is the size of the vocabulary, which can be large. Our
learned by backpropagation [30]. For example, in a lan- users have experimented with several schemes to accel-
guage model, a training example might be a sparse vector erate the softmax calculation. The first is similar to an
with non-zero entries corresponding to the IDs of words optimization in Project Adam [14], whereby the weights
in a vocabulary, and the distributed representation for each are sharded across several tasks, and the multiplication
word will be a lower-dimensional vector [6]. “Wide and and gradient calculation are colocated with the shards.
deep learning” creates distributed representations from More efficient training is possible using a sampled soft-
cross-product transformations on categorical features, and max [37], which performs a sparse multiplication based
the implementation on TensorFlow is used to power the on the true class for an example and a set of randomly
Google Play app store recommender system [12]. sampled false classes. We compare the performance of
Inference begins by multiplying a batch of b sparse vec- these two schemes in §6.4.
tors against an n × d embedding matrix, where n is the
number of words in the vocabulary, and d is the desired
dimensionality, to produce a much smaller b × d dense 4.3 Fault tolerance
Training
matrix representation; forlibraries
training, mostInference libs al-
optimization
gorithms modify only the rows of the C++ embedding
client ...
matrix Training a model can take several hours or days, even us-
Python client
that were read by the sparse multiplication. In TensorFlow ing a large number of machines [14, 20]. We often need to
models that process sparse data, nC× d can amount to gi- train a model using non-dedicated resources, for example
API
gabytes of parameters: e.g., a large language model may using the Borg cluster manager [71], which does not guar-
Distributed
use over 109 parameters master
with Dataflowof
a vocabulary executor
800,000 antee availability of the same resources for the duration of
words [41], and we have experience with document mod- the training process. Therefore, a long-running Tensor-
Const Var MatMul Conv2D ReLU Queue ...
els [19] where the parameters occupy several terabytes. Flow job is likely to experience failure or pre-emption,
Kernel implementations
Such models are too large to copy to a worker on every and we require some form of fault tolerance. It is un-
use, or even to storeRPC
in RAMRDMA on a...singleCPU
host. GPU ... likely that tasks will fail so often that individual opera-
We implement sparse embedding
Networking layer
layers in the
Device layer
Tensor- tions need fault tolerance, so a mechanism like Spark’s
Flow graph as a composition of primitive operations. Fig- RDDs [75] would impose significant overhead for little
ure 4 shows a simplified graph for an embedding layer benefit. There is no need to make every write to the pa-
that is split across two parameter server tasks. The core rameter state durable, because we can recompute any up-
operation of this subgraph is Gather, which extracts a date from the input data, and many learning algorithms do
sparse set of rows from a tensor, and TensorFlow colo- not require strong consistency [61].
USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 273
(a) Asynchronous replication (b) Synchronous replication (c) Synchronous w/ backup worker
PS
Worker 1
Worker 2
Worker 3
Figure 5: Three synchronization schemes for parallel SGD. Each color represents a different starting parameter value;
a white square is a parameter update. In (c), a dashed rectangle represents a backup worker whose result is discarded.
We implement user-level checkpointing for fault tol- 4.4 Synchronous replica coordination
erance, using two operations in the graph (Figure 2):
Save writes one or more tensors to a checkpoint file, and SGD is robust to asynchrony [61], and many systems
Restore reads one or more tensors from a checkpoint train deep neural networks using asynchronous parame-
file. Our typical configuration connects each Variable ter updates [14, 20], which are believed scalable because
in a task to the same Save operation, with one Save per they maintain high throughput in the presence of strag-
task, to maximize the I/O bandwidth to a distributed file glers. The increased throughput comes at the cost of us-
system. The Restore operations read named tensors ing stale parameter values in training steps. Some have
from a file, and a standard Assign stores the restored recently revisited the assumption that synchronous train-
value in its respective variable. During training, a typi- ing does not scale [10, 18]. Since GPUs enable training
cal client runs all of the Save operations periodically to with hundreds—rather than thousands [47]—of machines,
produce a new checkpoint; when the client starts up, it synchronous training may be faster (in terms of time to
attempts to Restore the latest checkpoint. quality) than asynchronous training on the same platform.
TensorFlow includes a client library for constructing Though we originally designed TensorFlow for asyn-
the appropriate graph structure and for invoking Save chronous training, we have begun experimenting with
and Restore as necessary. This behavior is customiz- synchronous methods. The TensorFlow graph enables
able: the user can apply different policies to subsets of the users to change how parameters are read and written when
variables in a model, or customize the checkpoint reten- training a model, and we implement three alternatives. In
tion scheme. For example, many users retain checkpoints the asynchronous case (Figure 5(a)), each worker reads
with the highest score in a custom evaluation metric. The the current values of parameters when each step begins,
implementation is also reusable: it may be used for model and applies its gradient to the (possibly different) current
fine-tuning and unsupervised pre-training [45, 47], which values at the end: this approach ensures high utilization,
are forms of transfer learning, in which the parameters of but the individual steps use stale parameter values, making
a model trained on one task (e.g., recognizing general im- each step less effective. We implement the synchronous
ages) are used as the starting point for another task (e.g., version using queues (§3.1) to coordinate execution: a
recognizing breeds of dog). Having checkpoint and pa- blocking queue acts as a barrier to ensure that all workers
rameter management as programmable operations in the read the same parameter values, and a per-variable queue
graph gives users the flexibility to implement schemes like accumulates gradient updates from all workers in order to
these and others that we have not anticipated. apply them atomically. The simple synchronous version
The checkpointing library does not attempt to produce (Figure 5(b)) accumulates updates from all workers before
consistent checkpoints: if training and checkpointing ex- applying them, but slow workers limit overall throughput.
ecute concurrently, the checkpoint may include none, all, To mitigate stragglers, we implement backup work-
or some of the updates from the training step. This be- ers (Figure 5(c), [10]), which are similar to MapReduce
havior is compatible with the relaxed guarantees of asyn- backup tasks [21]. Whereas MapReduce starts backup
chronous SGD [20]. Consistent checkpoints require ad- tasks reactively—after detecting a straggler—our backup
ditional synchronization to ensure that update operations workers run proactively, and the aggregation takes the
do not interfere with checkpointing; if desired, one can first m of n updates produced. We exploit the fact that
use the scheme in the next subsection to take a checkpoint SGD samples training data randomly at each step, so each
after the synchronous update step. worker processes a different random batch, and it is not a
274 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
Training libraries Inference libs
The runtime contains over 200 standard operations, in-
Python client C++ client ...
cluding mathematical, array manipulation, control flow,
C API and state management operations. Many of the operation
kernels are implemented using Eigen::Tensor [36], which
Distributed master Dataflow executor uses C++ templates to generate efficient parallel code for
multicore CPUs and GPUs; however, we liberally use li-
Const Var MatMul Conv2D ReLU Queue ... braries like cuDNN [13] where a more efficient kernel
Kernel implementations
implementation is possible. We have also implemented
quantization, which enables faster inference in environ-
RPC RDMA ... CPU GPU ...
ments such as mobile devices and high-throughput data-
Networking layer Device layer
center applications, and use the gemmlowp low-precision
matrix library [35] to accelerate quantized computation.
Figure 6: The layered TensorFlow architecture. We specialize Send and Recv operations for each
pair of source and destination device types. Trans-
problem if a particular batch is ignored. In §6.3 we show fers between local CPU and GPU devices use the
how backup workers improve throughput by up to 10%. cudaMemcpyAsync() API to overlap computation and
data transfer; transfers between two local GPUs use
DMA to relieve pressure on the host. For transfers be-
5 Implementation tween tasks, TensorFlow uses multiple protocols, includ-
ing gRPC over TCP, and RDMA over Converged Ether-
The TensorFlow runtime is a cross-platform library. Fig- net. We are also investigating optimizations for GPU-to-
ure 6 illustrates its architecture: a C API separates user- GPU communication that use collective operations [59].
level code in different languages from the core runtime. Section 4 describes features that we implement com-
The core TensorFlow library is implemented in C++ for pletely above the C API, in user-level code. Typically,
portability and performance: it runs on several operating users compose standard operations to build higher-level
systems including Linux, Mac OS X, Windows, Android, abstractions, such as neural network layers, optimization
and iOS; the x86 and various ARM-based CPU architec- algorithms (§4.1), and sharded embedding computations
tures; and NVIDIA’s Kepler, Maxwell, and Pascal GPU (§4.2). TensorFlow supports multiple client languages,
microarchitectures. The implementation is open-source, and we have prioritized Python and C++, because our in-
and we have accepted several external contributions that ternal users are most familiar with these languages. As
enable TensorFlow to run on other architectures. features become more established, we typically port them
The distributed master translates user requests into ex- to C++, so that users can access an optimized implemen-
ecution across a set of tasks. Given a graph and a step def- tation from all client languages.
inition, it prunes (§3.2) and partitions (§3.3) the graph to If it is difficult or inefficient to represent a subcom-
obtain subgraphs for each participating device, and caches putation as a composition of operations, users can reg-
these subgraphs so that they may be re-used in subsequent ister additional kernels that provide an efficient imple-
steps. Since the master sees the overall computation for a mentation written in C++. We have found it profitable
step, it applies standard optimizations such as common to hand-implement fused kernels for some performance
subexpression elimination and constant folding; pruning critical operations, such as the ReLU and Sigmoid acti-
is a form of dead code elimination. It then coordinates ex- vation functions and their corresponding gradients. We
ecution of the optimized subgraphs across a set of tasks. are currently investigating automatic kernel fusion using
The dataflow executor in each task handles requests a compilation-based approach.
from the master, and schedules the execution of the ker- In addition to the core runtime, our colleagues have
nels that comprise a local subgraph. We optimize the built several tools that aid users of TensorFlow. These
dataflow executor for running large graphs with low over- include serving infrastructure for inference in produc-
head. Our current implementation can execute 10,000 tion [27], a visualization dashboard that enables users to
subgraphs per second (§6.2), which enables a large num- follow the progress of a training run, a graph visualizer
ber of replicas to make rapid, fine-grained training steps. that helps users to understand the connections in a model,
The dataflow executor dispatches kernels to local devices and a distributed profiler that traces the execution of a
and runs kernels in parallel when possible, for example by computation across multiple devices and tasks. We de-
using multiple CPU cores or GPU streams. scribe these tools in an extended whitepaper [1].
USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 275
6 Evaluation 10000
Batches/second
Scalar
otherwise stated, we run all experiments on a shared pro- Sparse 1GB
duction cluster, and all figures plot median values with 100 Sparse 16GB
Dense 100M
error bars showing the 10th and 90th percentiles. Dense 1GB
In this paper we focus on system performance met- 10
276 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
(a) Baseline performance vs. MXNet (b) Coordination scalability (c) Backup worker effectiveness
30 3000 2.5
Step time Speedup 1.10
Images/second/worker
25 2500 2.4
Normalized speedup
Step time (seconds)
Images/second
1.08
20 2000 2.3
1.06
15 1500 2.2
Figure 8: Results of the performance evaluation for Inception-v3 training (§6.3). (a) TensorFlow achieves slightly
better throughput than MXNet for asynchronous training. (b) Asynchronous and synchronous training throughput
increases with up to 200 workers. (c) Adding backup workers to a 50-worker training job can reduce the overall step
time, and improve performance even when normalized for resource consumption.
6.3 Image classification as Inception-v3 will train in fewer steps, and converge to
a higher accuracy than with asynchronous training [10].
Deep neural networks have achieved breakthrough perfor- Training throughput improves to 2,300 images per sec-
mance on computer vision tasks such as recognizing ob- ond as we increase the number of workers to 200, but
jects in photographs [44], and these tasks are a key ap- with diminishing returns (Figure 8(b)). As we add more
plication for TensorFlow at Google. Training a network workers, the step time increases, because there is more
to high accuracy requires a large amount of computa- contention on the PS tasks, both at the network interface
tion, and we use TensorFlow to scale out this computation and in the aggregation of updates. As expected, for all
across a cluster of GPU-enabled servers. In these experi- configurations, synchronous steps are longer than asyn-
ments, we focus on Google’s Inception-v3 model, which chronous steps, because all workers must wait for the
achieves 78.8% accuracy in the ILSVRC 2012 image clas- slowest worker to catch up before starting the next step.
sification challenge [69]; the same techniques apply to While the median synchronous step is approximately 10%
other deep convolutional models—such as ResNet [28]— longer than an asynchronous step with the same workers,
implemented on TensorFlow. We investigate the scalabil- above the 90th percentile the synchronous performance
ity of training Inception-v3 using multiple replicas. We degrades sharply, because stragglers disproportionately
configure TensorFlow with 7 PS tasks, and vary the num- impact tail latency.
ber of worker tasks using two different clusters. To mitigate tail latency, we add backup workers so that
For the first experiment, we compare the performance a step completes when the first m of n tasks produce gra-
training Inception using asynchronous SGD on Tensor- dients. Figure 8(c) shows the effect of adding backup
Flow and MXNet, a contemporary system using a pa- workers to a 50-worker Inception training job. Each addi-
rameter server architecture. For this experiment we use tional backup worker up to and including the fourth re-
Google Compute Engine virtual machines running on In- duces the median step time, because the probability of
tel Xeon E5 servers with NVIDIA K80 GPUs, config- a straggler affecting the step decreases. Adding a fifth
ured with 8 vCPUs, 16Gbps of network bandwidth, and backup worker slightly degrades performance, because
one GPU per VM. Both systems use 7 PS tasks running the 51st worker (i.e., the first whose result is discarded)
on separate VMs with no GPU. Figure 8(a) shows that is more likely to be a non-straggler that generates more
TensorFlow achieves performance that is marginally bet- incoming traffic for the PS tasks. Figure 8(c) also plots
ter than MXNet. As expected, the results are largely de- the normalized speedup for each configuration, defined as
termined by single-GPU performance, and both systems t(b)/t(0) × 50/(50 + b) (where t(b) is the median step
use cuDNN version 5.1, so they have access to the same time with b backup workers), and which discounts the
optimized GPU kernels. speedup by the fraction of additional resources consumed.
Using a larger internal cluster (with NVIDIA K40 Although adding 4 backup workers achieves the shortest
GPUs, and a shared datacenter network), we investigate overall step time (1.93 s), adding 3 achieves the highest
the effect of coordination (§4.4) on training performance. normalized speedup (9.5%), and hence uses less aggre-
Ideally, with efficient synchronous training, a model such gate GPU-time to reach the same quality.
USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 277
(a) Full softmax (b) Sampled softmax 7 Conclusions
105 105
Words processed/second
Words processed/second
104 104
We have described the TensorFlow system and its pro-
gramming model. TensorFlow’s dataflow representation
103 103 subsumes existing work on parameter server systems, and
256 workers 256 workers offers a set of uniform abstractions that allow users to
102 32 workers 102 32 workers harness large-scale heterogeneous systems, both for pro-
4 workers 4 workers
101 101 duction tasks and for experimenting with new approaches.
1 2 4 8 16 32 1 2 4 8 16 32 We have shown several examples of how the TensorFlow
Number of PS tasks Number of PS tasks programming model facilitates experimentation (§4) and
demonstrated that the resulting implementations are per-
Figure 9: Increasing the number of PS tasks leads to in- formant and scalable (§6).
creased throughput for language model training, by par- Our initial experience with TensorFlow is encourag-
allelizing the softmax computation. Sampled softmax in- ing. A large number of groups at Google have deployed
creases throughput by performing less computation. TensorFlow in production, and TensorFlow is helping our
research colleagues to make new advances in machine
6.4 Language modeling learning. Since we released TensorFlow as open-source
software, more than 14,000 people have forked the source
Given a sequence of words, a language model predicts the code repository, the binary distribution has been down-
most probable next word [6]. Therefore, language mod- loaded over one million times, and dozens of machine
els are integral to predictive text, speech recognition, and learning models that use TensorFlow have been published.
translation applications. In this experiment, we investi- TensorFlow is a work in progress. Its flexible dataflow
gate how TensorFlow can train a recurrent neural network representation enables power users to achieve excellent
(viz. LSTM-512-512 [41]) to model the text in the One performance, but we have not yet determined default
Billion Word Benchmark [9]. The vocabulary size |V | policies that work well for all users. Further research
limits the performance of training, because the final layer on automatic optimization should bridge this gap. On
must decode the output state into probabilities for each of the system level, we are actively developing algorithms
|V | classes [37]. The resulting parameters can be large for automatic placement, kernel fusion, memory manage-
(|V | × d for output state dimension d) so we use the tech- ment, and scheduling. While the current implementations
niques for handling large models from Subsection 4.2. We of mutable state and fault tolerance suffice for applica-
use a restricted vocabulary of the most common 40,000 tions with weak consistency requirements, we expect that
words—instead of the full 800,000 words [9]—in order to some TensorFlow applications will require stronger con-
experiment with smaller configurations. sistency, and we are investigating how to build such poli-
Figure 9 shows the training throughput, measured in cies at user-level. Finally, some users have begun to chafe
words per second, for varying numbers of PS and worker at the limitations of a static dataflow graph, especially for
tasks, and two softmax implementations. The full softmax algorithms like deep reinforcement learning [54]. There-
(Figure 9(a)) multiplies each output by a 512 × 40,000 fore, we face the intriguing problem of providing a sys-
weight matrix sharded across the PS tasks. Adding more tem that transparently and efficiently uses distributed re-
PS tasks increases the throughput, because TensorFlow sources, even when the structure of the computation un-
can exploit distributed model parallelism [20, 43] and per- folds dynamically.
form the multiplication and gradient calculation on the PS
tasks, as in Project Adam [14]. Adding a second PS task
is more effective than increasing from 4 to 32, or 32 to Acknowledgments
256 workers. Eventually the throughput saturates, as the
LSTM calculations dominate the training step. We gratefully acknowledge contributions from our col-
The sampled softmax (Figure 9(b)) reduces the data leagues within Google, and from members of the wider
transferred and the computation performed on the PS machine learning community. In particular, we appreci-
tasks [37]. Instead of a dense weight matrix, it multiplies ate the feedback we have received from the rest of the
the output by a random sparse matrix containing weights Google Brain team and the many users of DistBelief and
for the true class and a random sample of false classes. TensorFlow. We thank the anonymous OSDI reviewers
We sample 512 classes for each batch, thus reducing the and our shepherd KyoungSoo Park for their suggestions,
softmax data transfer and computation by a factor of 78. which greatly improved the presentation of this paper.
278 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
References field-of-view deep network. In Proceed-
ings of ICRA, pages 704–711. IEEE, 2015.
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, www.vision.caltech.edu/anelia/publications/
Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, Angelova15LFOV.pdf.
M. Devin, S. Ghemawat, I. J. Goodfellow, A. Harp,
G. Irving, M. Isard, Y. Jia, R. Józefowicz, L. Kaiser, [4] Arvind and D. E. Culler. Dataflow architectures.
M. Kudlur, J. Levenberg, D. Mane, R. Monga, In Annual Review of Computer Science Vol. 1,
S. Moore, D. G. Murray, C. Olah, M. Schus- 1986, pages 225–253. Annual Reviews Inc., 1986.
ter, J. Shlens, B. Steiner, I. Sutskever, K. Tal- www.dtic.mil/cgi-bin/GetTRDoc?Location=U2&
war, P. A. Tucker, V. Vanhoucke, V. Vasudevan, doc=GetTRDoc.pdf&AD=ADA166235.
F. B. Viégas, O. Vinyals, P. Warden, M. Watten-
berg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: [5] J. Ba, V. Mnih, and K. Kavukcuoglu. Multiple ob-
Large-scale machine learning on heterogeneous dis- ject recognition with visual attention. arXiv preprint,
tributed systems. arXiv preprint, 1603.04467, 2016. 1412.7755, 2014. arxiv.org/abs/1412.7755.
arxiv.org/abs/1603.04467. Software available from
tensorflow.org. [6] Y. Bengio, R. Ducharme, P. Vincent, and C. Jau-
vin. A neural probabilistic language model. Journal
[2] R. Al-Rfou, G. Alain, A. Almahairi, C. Anger- of Machine Learning Research, 3:1137–1155, 2003.
mueller, D. Bahdanau, N. Ballas, F. Bastien, jmlr.org/papers/volume3/bengio03a/bengio03a.pdf.
J. Bayer, A. Belikov, A. Belopolsky, Y. Bengio,
[7] T. Brants and A. Franz. Web 1T 5-gram version 1,
A. Bergeron, J. Bergstra, V. Bisson, J. Bleecher
2006. catalog.ldc.upenn.edu/LDC2006T13.
Snyder, N. Bouchard, N. Boulanger-Lewandowski,
X. Bouthillier, A. de Brébisson, O. Breuleux, P.-
[8] R. H. Byrd, G. M. Chin, J. Nocedal, and Y. Wu.
L. Carrier, K. Cho, J. Chorowski, P. Christiano,
Sample size selection in optimization methods for
T. Cooijmans, M.-A. Côté, M. Côté, A. Courville,
machine learning. Mathematical Programming,
Y. N. Dauphin, O. Delalleau, J. Demouth, G. Des-
134(1):127–155, 2012. dx.doi.org/10.1007/s10107-
jardins, S. Dieleman, L. Dinh, M. Ducoffe, V. Du-
012-0572-5.
moulin, S. Ebrahimi Kahou, D. Erhan, Z. Fan,
O. Firat, M. Germain, X. Glorot, I. Goodfellow, [9] C. Chelba, T. Mikolov, M. Schuster, Q. Ge,
M. Graham, C. Gulcehre, P. Hamel, I. Harlouchet, T. Brants, and P. Koehn. One billion word bench-
J.-P. Heng, B. Hidasi, S. Honari, A. Jain, S. Jean, mark for measuring progress in statistical lan-
K. Jia, M. Korobov, V. Kulkarni, A. Lamb, P. Lam- guage modeling. arXiv preprint, 1312.3005, 2013.
blin, E. Larsen, C. Laurent, S. Lee, S. Lefrancois, arxiv.org/abs/1312.3005.
S. Lemieux, N. Léonard, Z. Lin, J. A. Livezey,
C. Lorenz, J. Lowin, Q. Ma, P.-A. Manzagol, [10] J. Chen, R. Monga, S. Bengio, and R. Joze-
O. Mastropietro, R. T. McGibbon, R. Memisevic, fowicz. Revisiting distributed synchronous SGD.
B. van Merriënboer, V. Michalski, M. Mirza, A. Or- In Proceedings of ICLR Workshop Track, 2016.
landi, C. Pal, R. Pascanu, M. Pezeshki, C. Raf- arxiv.org/abs/1604.00981.
fel, D. Renshaw, M. Rocklin, A. Romero, M. Roth,
P. Sadowski, J. Salvatier, F. Savard, J. Schlüter, [11] T. Chen, M. Li, Y. Li, M. Lin, N. Wang,
J. Schulman, G. Schwartz, I. V. Serban, D. Serdyuk, M. Wang, T. Xiao, B. Xu, C. Zhang, and
S. Shabanian, E. Simon, S. Spieckermann, S. R. Z. Zhang. MXNet: A flexible and efficient ma-
Subramanyam, J. Sygnowski, J. Tanguay, G. van chine learning library for heterogeneous distributed
Tulder, J. Turian, S. Urban, P. Vincent, F. Visin, systems. In Proceedings of LearningSys, 2015.
H. de Vries, D. Warde-Farley, D. J. Webb, M. Will- www.cs.cmu.edu/˜muli/file/mxnet-learning-sys.pdf.
son, K. Xu, L. Xue, L. Yao, S. Zhang, and Y. Zhang.
Theano: A Python framework for fast computa- [12] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked,
tion of mathematical expressions. arXiv preprint, T. Chandra, H. Aradhye, G. Anderson, G. Corrado,
1605.02688, 2016. arxiv.org/abs/1605.02688. W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong,
V. Jain, X. Liu, and H. Shah. Wide & deep
[3] A. Angelova, A. Krizhevsky, and V. Van- learning for recommender systems. arXiv preprint,
houcke. Pedestrian detection with a large- 1606.07792, 2016. arxiv.org/abs/1606.07792.
USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 279
[13] S. Chetlur, C. Woolley, P. Vandermersch, J. Co- [23] J. Duchi, E. Hazan, and Y. Singer. Adap-
hen, J. Tran, B. Catanzaro, and E. Shelhamer. tive subgradient methods for online learning
cuDNN: Efficient primitives for deep learning. arXiv and stochastic optimization. Journal of Ma-
preprint, 1410.0759, 2014. arxiv.org/abs/1410.0759. chine Learning Research, 12:2121–2159, 2011.
jmlr.org/papers/volume12/duchi11a/duchi11a.pdf.
[14] T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalya-
naraman. Project Adam: Building an effi- [24] A. Frome, G. S. Corrado, J. Shlens, S. Ben-
cient and scalable deep learning training sys- gio, J. Dean, T. Mikolov, et al. DeVISE: A
tem. In Proceedings of OSDI, pages 571–582, deep visual-semantic embedding model. In Pro-
2014. www.usenix.org/system/files/conference/ ceedings of NIPS, pages 2121–2129, 2013. re-
osdi14/osdi14-paper-chilimbi.pdf. search.google.com/pubs/archive/41473.pdf.
[18] H. Cui, H. Zhang, G. R. Ganger, P. B. Gib- [27] Google Research. Tensorflow serving, 2016. tensor-
bons, and E. P. Xing. GeePS: Scalable deep flow.github.io/serving/.
learning on distributed GPUs with a GPU-
specialized parameter server. In Proceedings [28] K. He, X. Zhang, S. Ren, and J. Sun. Deep
of EuroSys, 2016. www.pdl.cmu.edu/PDL-FTP/ residual learning for image recognition. In
CloudComputing/GeePS-cui-eurosys16.pdf. Proceedings of CVPR, pages 770–778, 2016.
arxiv.org/abs/1512.03385.
[19] A. Dai, C. Olah, and Q. V. Le. Document embedding
with paragraph vectors. arXiv preprint, 1507.07998, [29] G. Heigold, V. Vanhoucke, A. Senior, P. Nguyen,
2015. arxiv.org/abs/1507.07998. M. Ranzato, M. Devin, and J. Dean. Mul-
tilingual acoustic models using distributed
[20] J. Dean, G. S. Corrado, R. Monga, K. Chen, deep neural networks. In Proceedings
M. Devin, Q. V. Le, M. Z. Mao, M. Ran- of ICASSP, pages 8619–8623, 2013. re-
zato, A. Senior, P. Tucker, K. Yang, and search.google.com/pubs/archive/40807.pdf.
A. Y. Ng. Large scale distributed deep net-
works. In Proceedings of NIPS, pages 1232–1240, [30] G. E. Hinton. Learning distributed repre-
2012. research.google.com/archive/large deep net- sentations of concepts. In Proceedings of
works nips2012.pdf. the Eighth Annual Conference of the Cog-
nitive Science Society, pages 1–12, 1986.
[21] J. Dean and S. Ghemawat. MapReduce: www.cogsci.ucsd.edu/˜ajyu/Teaching/Cogs202 -
Simplified data processing on large clusters. sp13/Readings/hinton86.pdf.
In Proceedings of OSDI, pages 137–149,
2004. research.google.com/archive/mapreduce- [31] G. E. Hinton, L. Deng, D. Yu, G. E. Dahl,
osdi04.pdf. A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke,
P. Nguyen, T. N. Sainath, and B. Kingsbury. Deep
[22] DMLC. MXNet for deep learning, 2016. neural networks for acoustic modeling in speech
github.com/dmlc/mxnet. recognition: The shared views of four research
280 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
groups. IEEE Signal Process. Mag., 29(6):82– modeling. arXiv preprint, 1602.02410, 2016.
97, 2012. www.cs.toronto.edu/˜gdahl/papers/ arxiv.org/abs/1602.02410.
deepSpeechReviewSPM2012.pdf.
[42] A. Karpathy, G. Toderici, S. Shetty, T. Leung,
[32] S. Hochreiter and J. Schmidhuber. Long short-term R. Sukthankar, and L. Fei-Fei. Large-scale video
memory. Neural computation, 9(8):1735–1780, classification with convolutional neural networks. In
1997. deeplearning.cs.cmu.edu/pdfs/Hochreiter97 - Proceedings of CVPR, pages 1725–1732, 2014. re-
lstm.pdf. search.google.com/pubs/archive/42455.pdf.
[33] S. Ioffe and C. Szegedy. Batch normaliza- [43] A. Krizhevsky. One weird trick for paralleliz-
tion: Accelerating deep network training by ing convolutional neural networks. arXiv preprint,
reducing internal covariate shift. In Pro- 1404.5997, 2014. arxiv.org/abs/1404.5997.
ceedings of ICML, pages 448–456, 2015.
jmlr.org/proceedings/papers/v37/ioffe15.pdf. [44] A. Krizhevsky, I. Sutskever, and G. E. Hinton.
ImageNet classification with deep convolutional
[34] M. Isard, M. Budiu, Y. Yu, A. Birrell, and neural networks. In Proceedings of NIPS, pages
D. Fetterly. Dryad: distributed data-parallel 1106–1114, 2012. papers.nips.cc/paper/4824-
programs from sequential building blocks. imagenet-classification-with-deep-convolutional-
In Proceedings of EuroSys, pages 59–72, neural-networks.pdf.
2007. www.microsoft.com/en-us/research/wp-
content/uploads/2007/03/eurosys07.pdf. [45] H. Larochelle, Y. Bengio, J. Louradour,
and P. Lamblin. Exploring strategies for
[35] B. Jacob et al. gemmlowp: a small self- training deep neural networks. Journal
contained low-precision GEMM library, 2015. of Machine Learning Research, 10:1–40,
github.com/google/gemmlowp. 2009. jmlr.org/papers/volume10/larochelle09a/
[36] B. Jacob, G. Guennebaud, et al. Eigen library for larochelle09a.pdf.
linear algebra. eigen.tuxfamily.org. [46] A. Lavin and S. Gray. Fast algorithms for convo-
[37] S. Jean, K. Cho, R. Memisevic, and Y. Ben- lutional neural networks. In Proceedings of CVPR,
gio. On using very large target vocabulary pages 4013–4021, 2016. arxiv.org/abs/1509.09308.
for neural machine translation. In Proceed-
[47] Q. Le, M. Ranzato, R. Monga, M. Devin, G. Cor-
ings of ACL-ICJNLP, pages 1–10, July 2015.
rado, K. Chen, J. Dean, and A. Ng. Building
www.aclweb.org/anthology/P15-1001.
high-level features using large scale unsupervised
[38] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, learning. In Proceedings of ICML, pages 81–88,
J. Long, R. Girshick, S. Guadarrama, and T. Dar- 2012. research.google.com/archive/unsupervised -
rell. Caffe: Convolutional architecture for fast fea- icml2012.pdf.
ture embedding. In Proceedings of ACM Multime-
[48] Y. LeCun, C. Cortes, and C. J. Burges. The
dia, pages 675–678, 2014. arxiv.org/abs/1408.5093.
MNIST database of handwritten digits, 1998.
[39] M. I. Jordan. Serial order: A parallel dis- yann.lecun.com/exdb/mnist/.
tributed processing approach. ICS report
8608, Institute for Cognitive Science, UCSD, [49] M. Li, D. G. Andersen, J. Park, A. J. Smola,
La Jolla, 1986. cseweb.ucsd.edu/˜gary/PAPER- A. Ahmed, V. Josifovski, J. Long, E. J.
SUGGESTIONS/Jordan-TR-8604.pdf. Shekita, and B.-Y. Su. Scaling distributed ma-
chine learning with the Parameter Server. In
[40] N. Jouppi. Google supercharges machine Proceedings of OSDI, pages 583–598, 2014.
learning tasks with TPU custom chip, 2016. www.usenix.org/system/files/conference/osdi14/
cloudplatform.googleblog.com/2016/05/Google- osdi14-paper-li mu.pdf.
supercharges-machine-learning-tasks-with-custom-
chip.html. [50] C. J. Maddison, A. Huang, I. Sutskever, and D. Sil-
ver. Move evaluation in Go using deep convolutional
[41] R. Józefowicz, O. Vinyals, M. Schuster, N. Shazeer, neural networks. arXiv preprint, 1412.6564, 2014.
and Y. Wu. Exploring the limits of language arxiv.org/abs/1412.6564.
USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 281
[51] F. McSherry, M. Isard, and D. G. Mur- [61] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild:
ray. Scalability! But at what COST? In A lock-free approach to parallelizing stochas-
Proceedings of HotOS, HOTOS’15, 2015. tic gradient descent. In Proceedings of NIPS,
www.usenix.org/system/files/conference/hotos15/ pages 693–701, 2011. papers.nips.cc/paper/4390-
hotos15-paper-mcsherry.pdf. hogwild-a-lock-free-approach-to-parallelizing-
stochastic-gradient-descent.pdf.
[52] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Ef-
ficient estimation of word representations in vector [62] C. J. Rossbach, Y. Yu, J. Currey, J.-P. Mar-
space. In Proceedings of ICLR Workshops Track, tin, and D. Fetterly. Dandelion: a com-
2013. arxiv.org/abs/1301.3781. piler and runtime for heterogeneous systems.
In Proceedings of SOSP, pages 49–68, 2013.
[53] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. sigops.org/sosp/sosp13/papers/p49-rossbach.pdf.
Recurrent models of visual attention. In Pro-
ceedings of NIPS, pages 2204–2212, 2014. [63] D. E. Rumelhart, G. E. Hinton, and R. J. Williams.
papers.nips.cc/paper/5542-recurrent-models-of- Learning representations by back-propagating er-
visual-attention.pdf. rors. In Cognitive modeling, volume 5, pages
213–220. MIT Press, 1988. www.cs.toronto.edu/
[54] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu,
˜hinton/absps/naturebp.pdf.
J. Veness, M. G. Bellemare, A. Graves, M. Ried-
miller, A. K. Fidjeland, G. Ostrovski, S. Petersen, [64] O. Russakovsky, J. Deng, H. Su, J. Krause,
C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Ku- S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
maran, D. Wierstra, S. Legg, and D. Hassabis. A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-
Human-level control through deep reinforcement Fei. ImageNet Large Scale Visual Recognition Chal-
learning. Nature, 518(7540):529–533, 02 2015. lenge. International Journal of Computer Vision,
dx.doi.org/10.1038/nature14236. 115(3):211–252, 2015. arxiv.org/abs/1409.0575.
[55] P. Moritz, R. Nishihara, I. Stoica, and M. I. Jordan. [65] A. Smola and S. Narayanamurthy. An ar-
SparkNet: Training deep networks in Spark. In Pro- chitecture for parallel topic models. Proc.
ceedings of ICLR, 2016. arxiv.org/abs/1511.06051. VLDB Endow., 3(1–2):703–710, Sept. 2010.
[56] D. G. Murray, F. McSherry, M. Isard, R. Isaacs, vldb.org/pvldb/vldb2010/papers/R63.pdf.
P. Barham, and M. Abadi. Incremental, it-
[66] I. Sutskever, J. Martens, G. E. Dahl, and
erative data processing with timely dataflow.
G. E. Hinton. On the importance of initial-
Commun. ACM, 59(10):75–83, Sept. 2016.
ization and momentum in deep learning. In
dl.acm.org/citation.cfm?id=2983551.
Proceedings of ICML, pages 1139–1147, 2013.
[57] A. Nair, P. Srinivasan, S. Blackwell, C. Alci- jmlr.org/proceedings/papers/v28/sutskever13.pdf.
cek, R. Fearon, A. De Maria, V. Panneershel-
[67] I. Sutskever, O. Vinyals, and Q. V. Le. Se-
vam, M. Suleyman, C. Beattie, S. Petersen, et al.
quence to sequence learning with neural net-
Massively parallel methods for deep reinforce-
works. In Proceedings of NIPS, pages 3104–
ment learning. arXiv preprint, 1507.04296, 2015.
3112, 2014. papers.nips.cc/paper/5346-sequence-to-
arxiv.org/abs/1507.04296.
sequence-learning-with-neural.pdf.
[58] Nervana Systems. Neon deep learning framework,
2016. github.com/NervanaSystems/neon. [68] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and
[59] NVIDIA Corporation. NCCL: Optimized primitives A. Rabinovich. Going deeper with convolu-
for collective multi-GPU communication, 2016. tions. In Proceedings of CVPR, pages 1–9, 2015.
github.com/NVIDIA/nccl. arxiv.org/abs/1409.4842.
[60] R. Pascanu, T. Mikolov, and Y. Bengio. On [69] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and
the difficulty of training recurrent neural networks. Z. Wojna. Rethinking the Inception architecture for
In Proceedings of ICML, pages 1310–1318, 2013. computer vision. arXiv preprint, 1512.00567, 2015.
jmlr.org/proceedings/papers/v28/pascanu13.pdf. arxiv.org/abs/1512.00567.
282 12th USENIX Symposium on Operating Systems Design and Implementation USENIX Association
[70] C. tao Chu, S. K. Kim, Y. an Lin, Y. Yu, chine translation. arXiv preprint, 1609.08144, 2016.
G. Bradski, K. Olukotun, and A. Y. Ng. arxiv.org/abs/1609.08144.
Map-reduce for machine learning on multi-
core. In Proceedings of NIPS, pages 281–288, [74] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlings-
2007. papers.nips.cc/paper/3150-map-reduce-for- son, P. K. Gunda, and J. Currey. DryadLINQ:
machine-learning-on-multicore.pdf. A system for general-purpose distributed data-
parallel computing using a high-level language.
[71] A. Verma, L. Pedrosa, M. Korupolu, D. Op- In Proceedings of OSDI, pages 1–14, 2008.
penheimer, E. Tune, and J. Wilkes. Large- www.usenix.org/legacy/event/osdi08/tech/full pa-
scale cluster management at Google with pers/yu y/yu y.pdf.
Borg. In Proceedings of EuroSys, 2015. re-
search.google.com/pubs/archive/43438.pdf. [75] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma,
M. McCauley, M. J. Franklin, S. Shenker, and I. Sto-
[72] O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, ica. Resilient distributed datasets: A fault-tolerant
and G. Hinton. Grammar as a foreign language. abstraction for in-memory cluster computing.
arXiv preprint, 2014. arxiv.org/abs/1412.7449. In Proceedings of NSDI, pages 15–28, 2012.
https://www.usenix.org/system/files/conference/
[73] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi,
nsdi12/nsdi12-final138.pdf.
W. Macherey, M. Krikun, Y. Cao, Q. Gao,
K. Macherey, J. Klingner, A. Shah, M. Johnson, [76] M. D. Zeiler, M. Ranzato, R. Monga, M. Mao,
X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, K. Yang, Q. Le, P. Nguyen, A. Senior, V. Van-
H. Kazawa, K. Stevens, G. Kurian, N. Patil, houcke, J. Dean, and G. E. Hinton. On recti-
W. Wang, C. Young, J. Smith, J. Riesa, A. Rud- fied linear units for speech processing. In Pro-
nick, O. Vinyals, G. Corrado, M. Hughes, and ceedings of ICASSP, pages 3517–3521, 2013. re-
J. Dean. Google’s Neural Machine Translation sys- search.google.com/pubs/archive/40811.pdf.
tem: Bridging the gap between human and ma-
USENIX Association 12th USENIX Symposium on Operating Systems Design and Implementation 283
TensorFlow:
Large-Scale Machine Learning on Heterogeneous Distributed Systems
(Preliminary White Paper, November 9, 2015)
Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,
Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow,
Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser,
Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray,
Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar,
Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals,
Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng
Google Research∗
Abstract sequence prediction [47], move selection for Go [34],
pedestrian detection [2], reinforcement learning [38],
TensorFlow [1] is an interface for expressing machine learn-
and other areas [17, 5]. In addition, often in close collab-
ing algorithms, and an implementation for executing such al-
oration with the Google Brain team, more than 50 teams
gorithms. A computation expressed using TensorFlow can be
at Google and other Alphabet companies have deployed
executed with little or no change on a wide variety of hetero-
deep neural networks using DistBelief in a wide variety
geneous systems, ranging from mobile devices such as phones
of products, including Google Search [11], our advertis-
and tablets up to large-scale distributed systems of hundreds
ing products, our speech recognition systems [50, 6, 46],
of machines and thousands of computational devices such as
Google Photos [43], Google Maps and StreetView [19],
GPU cards. The system is flexible and can be used to express
Google Translate [18], YouTube, and many others.
a wide variety of algorithms, including training and inference
algorithms for deep neural network models, and it has been Based on our experience with DistBelief and a more
used for conducting research and for deploying machine learn- complete understanding of the desirable system proper-
ing systems into production across more than a dozen areas of ties and requirements for training and using neural net-
computer science and other fields, including speech recogni- works, we have built TensorFlow, our second-generation
tion, computer vision, robotics, information retrieval, natural system for the implementation and deployment of large-
language processing, geographic information extraction, and scale machine learning models. TensorFlow takes com-
computational drug discovery. This paper describes the Ten- putations described using a dataflow-like model and
sorFlow interface and an implementation of that interface that maps them onto a wide variety of different hardware
we have built at Google. The TensorFlow API and a reference platforms, ranging from running inference on mobile
implementation were released as an open-source package under device platforms such as Android and iOS to modest-
the Apache 2.0 license in November, 2015 and are available at sized training and inference systems using single ma-
www.tensorflow.org. chines containing one or many GPU cards to large-scale
training systems running on hundreds of specialized ma-
chines with thousands of GPUs. Having a single system
1 Introduction that can span such a broad range of platforms signifi-
cantly simplifies the real-world use of machine learning
The Google Brain project started in 2011 to explore the system, as we have found that having separate systems
use of very-large-scale deep neural networks, both for for large-scale training and small-scale deployment leads
research and for use in Google’s products. As part of to significant maintenance burdens and leaky abstrac-
the early work in this project, we built DistBelief, our tions. TensorFlow computations are expressed as stateful
first-generation scalable distributed training and infer- dataflow graphs (described in more detail in Section 2),
ence system [14], and this system has served us well. We and we have focused on making the system both flexible
and others at Google have performed a wide variety of re- enough for quickly experimenting with new models for
search using DistBelief including work on unsupervised research purposes and sufficiently high performance and
learning [31], language representation [35, 52], models robust for production training and deployment of ma-
for image classification and object detection [16, 48], chine learning models. For scaling neural network train-
video classification [27], speech recognition [56, 21, 20], ing to larger deployments, TensorFlow allows clients to
∗ Corresponding
authors: Jeffrey Dean and Rajat Monga: easily express various kinds of parallelism through repli-
{jeff,rajatmonga}@google.com cation and parallel execution of a core model dataflow
1
graph, with many different computational devices all col- structures within the graph in a manner similar to Naiad
laborating to update a set of shared parameters or other [36]. Clients typically construct a computational graph
state. Modest changes in the description of the com- using one of the supported frontend languages (C++ or
putation allow a wide variety of different approaches Python). An example fragment to construct and then ex-
to parallelism to be achieved and tried with low effort ecute a TensorFlow graph using the Python front end is
[14, 29, 42]. Some TensorFlow uses allow some flexibil- shown in Figure 1, and the resulting computation graph
ity in terms of the consistency of parameter updates, and in Figure 2.
we can easily express and take advantage of these relaxed
In a TensorFlow graph, each node has zero or more in-
synchronization requirements in some of our larger de-
puts and zero or more outputs, and represents the instan-
ployments. Compared to DistBelief, TensorFlow’s pro-
tiation of an operation. Values that flow along normal
gramming model is more flexible, its performance is sig-
edges in the graph (from outputs to inputs) are tensors,
nificantly better, and it supports training and using a
arbitrary dimensionality arrays where the underlying el-
broader range of models on a wider variety of hetero-
ement type is specified or inferred at graph-construction
geneous hardware platforms.
time. Special edges, called control dependencies, can
Dozens of our internal clients of DistBelief have al- also exist in the graph: no data flows along such edges,
ready switched to TensorFlow. These clients rely on but they indicate that the source node for the control de-
TensorFlow for research and production, with tasks as pendence must finish executing before the destination
diverse as running inference for computer vision mod- node for the control dependence starts executing. Since
els on mobile phones to large-scale training of deep our model includes mutable state, control dependencies
neural networks with hundreds of billions of parame- can be used directly by clients to enforce happens before
ters on hundreds of billions of example records using relationships. Our implementation also sometimes in-
many hundreds of machines [11, 47, 48, 18, 53, 41]. serts control dependencies to enforce orderings between
Although these applications have concentrated on ma- otherwise independent operations as a way of, for exam-
chine learning and deep neural networks in particular, ple, controlling the peak memory usage.
we expect that TensorFlow’s abstractions will be useful
in a variety of other domains, including other kinds of
machine learning algorithms, and possibly other kinds Operations and Kernels
of numerical computations. We have open-sourced the
TensorFlow API and a reference implementation under An operation has a name and represents an abstract com-
the Apache 2.0 license in November, 2015, available at putation (e.g., “matrix multiply”, or “add”). An opera-
www.tensorflow.org. tion can have attributes, and all attributes must be pro-
The rest of this paper describes TensorFlow in more vided or inferred at graph-construction time in order to
detail. Section 2 describes the programming model and instantiate a node to perform the operation. One com-
basic concepts of the TensorFlow interface, and Section 3 mon use of attributes is to make operations polymorphic
describes both our single machine and distributed imple- over different tensor element types (e.g., add of two ten-
mentations. Section 4 describes several extensions to sors of type float versus add of two tensors of type int32).
the basic programming model, and Section 5 describes A kernel is a particular implementation of an operation
several optimizations to the basic implementations. Sec- that can be run on a particular type of device (e.g., CPU
tion 6 describes some of our experiences in using Ten- or GPU). A TensorFlow binary defines the sets of opera-
sorFlow, Section 7 describes several programming id- tions and kernels available via a registration mechanism,
ioms we have found helpful when using TensorFlow, and and this set can be extended by linking in additional op-
Section 9 describes several auxiliary tools we have built eration and/or kernel definitions/registrations. Table 1
around the core TensorFlow system. Sections 10 and 11 shows some of the kinds of operations built into the core
discuss future and related work, respectively, and Sec- TensorFlow library.
tion 12 offers concluding thoughts.
Sessions
2 Programming Model and Basic Concepts
Clients programs interact with the TensorFlow system by
A TensorFlow computation is described by a directed creating a Session. To create a computation graph, the
graph, which is composed of a set of nodes. The graph Session interface supports an Extend method to augment
represents a dataflow computation, with extensions for the current graph managed by the session with additional
allowing some kinds of nodes to maintain and update nodes and edges (the initial graph when a session is cre-
persistent state and for branching and looping control ated is empty). The other primary operation supported
2
import tensorflow as tf
s = tf.Session()
for step in xrange(0, 10):
input = ...construct 100-D input array ... # Create 100-d vector for input
result = s.run(C, feed_dict={x: input}) # Fetch cost, feeding x=input
print step, result
...
ReLU
Add
b MatMul
W x
Category Examples
Element-wise mathematical operations Add, Sub, Mul, Div, Exp, Log, Greater, Less, Equal, ...
Array operations Concat, Slice, Split, Constant, Rank, Shape, Shuffle, ...
Matrix operations MatMul, MatrixInverse, MatrixDeterminant, ...
Stateful operations Variable, Assign, AssignAdd, ...
Neural-net building blocks SoftMax, Sigmoid, ReLU, Convolution2D, MaxPool, ...
Checkpointing operations Save, Restore
Queue and synchronization operations Enqueue, Dequeue, MutexAcquire, MutexRelease, ...
Control flow operations Merge, Switch, Enter, Leave, NextIteration
by the session interface is Run, which takes a set of out- arrange to execute the appropriate nodes in an order that
put names that need to be computed, as well as an op- respects their dependencies (as described in more detail
tional set of tensors to be fed into the graph in place of in 3.1). Most of our uses of TensorFlow set up a Session
certain outputs of nodes. Using the arguments to Run, with a graph once, and then execute the full graph or a
the TensorFlow implementation can compute the transi- few distinct subgraphs thousands or millions of times via
tive closure of all nodes that must be executed in order Run calls.
to compute the outputs that were requested, and can then
3
Variables have implementations of our Device interface for CPUs
and GPUs, and new device implementations for other de-
In most computations a graph is executed multiple times. vice types can be provided via a registration mechanism.
Most tensors do not survive past a single execution of the Each device object is responsible for managing alloca-
graph. However, a Variable is a special kind of opera- tion and deallocation of device memory, and for arrang-
tion that returns a handle to a persistent mutable tensor ing for the execution of any kernels that are requested by
that survives across executions of a graph. Handles to higher levels in the TensorFlow implementation.
these persistent mutable tensors can be passed to a hand-
ful of special operations, such as Assign and AssignAdd
(equivalent to +=) that mutate the referenced tensor. For Tensors
machine learning applications of TensorFlow, the param- A tensor in our implementation is a typed, multi-
eters of the model are typically stored in tensors held in dimensional array. We support a variety of tensor ele-
variables, and are updated as part of the Run of the train- ment types, including signed and unsigned integers rang-
ing graph for the model. ing in size from 8 bits to 64 bits, IEEE float and double
types, a complex number type, and a string type (an ar-
3 Implementation bitrary byte array). Backing store of the appropriate size
is managed by an allocator that is specific to the device
The main components in a TensorFlow system are the on which the tensor resides. Tensor backing store buffers
client, which uses the Session interface to communicate are reference counted and are deallocated when no refer-
with the master, and one or more worker processes, with ences remain.
each worker process responsible for arbitrating access to
one or more computational devices (such as CPU cores 3.1 Single-Device Execution
or GPU cards) and for executing graph nodes on those
devices as instructed by the master. We have both lo- Let’s first consider the simplest execution scenario: a sin-
cal and distributed implementations of the TensorFlow gle worker process with a single device. The nodes of the
interface. The local implementation is used when the graph are executed in an order that respects the depen-
client, the master, and the worker all run on a single ma- dencies between nodes. In particular, we keep track of
chine in the context of a single operating system process a count per node of the number of dependencies of that
(possibly with multiple devices, if for example, the ma- node that have not yet been executed. Once this count
chine has many GPU cards installed). The distributed drops to zero, the node is eligible for execution and is
implementation shares most of the code with the local added to a ready queue. The ready queue is processed in
implementation, but extends it with support for an en- some unspecified order, delegating execution of the ker-
vironment where the client, the master, and the workers nel for a node to the device object. When a node has
can all be in different processes on different machines. finished executing, the counts of all nodes that depend
In our distributed environment, these different tasks are on the completed node are decremented.
containers in jobs managed by a cluster scheduling sys-
tem [51]. These two different modes are illustrated in 3.2 Multi-Device Execution
Figure 3. Most of the rest of this section discusses is-
sues that are common to both implementations, while Once a system has multiple devices, there are two main
Section 3.3 discusses some issues that are particular to complications: deciding which device to place the com-
the distributed implementation. putation for each node in the graph, and then managing
the required communication of data across device bound-
Devices aries implied by these placement decisions. This subsec-
tion discusses these two issues.
Devices are the computational heart of TensorFlow. Each
worker is responsible for one or more devices, and 3.2.1 Node Placement
each device has a device type, and a name. Device
names are composed of pieces that identify the de- Given a computation graph, one of the main responsi-
vice’s type, the device’s index within the worker, and, bilities of the TensorFlow implementation is to map the
in our distributed setting, an identification of the job computation onto the set of available devices. A sim-
and task of the worker (or localhost for the case where plified version of this algorithm is presented here. See
the devices are local to the process). Example device Section 4.3 for extensions supported by this algorithm.
names are "/job:localhost/device:cpu:0" or One input to the placement algorithm is a cost model,
"/job:worker/task:17/device:gpu:3". We which contains estimates of the sizes (in bytes) of the
4
single process
client master
process session process
client master run
session
run execute
subgraph
execute
subgraph
worker worker worker
process 1 process 2 process 3
worker
GPU0 ... GPU0 ... GPU0 ...
GPU0 GPU1 ... CPU0
GPU1 CPU0 GPU1 CPU0 GPU1 CPU0
input and output tensors for each graph node, along with 3.2.2 Cross-Device Communication
estimates of the computation time required for each node
when presented with its input tensors. This cost model is Once the node placement has been computed, the graph
either statically estimated based on heuristics associated is partitioned into a set of subgraphs, one per device. Any
with different operation types, or is measured based on cross-device edge from x to y is removed and replaced
an actual set of placement decisions for earlier execu- by an edge from x to a new Send node in x’s subgraph
tions of the graph. and an edge from a corresponding Receive node to y in
y’s subgraph. See Figure 4 for an example of this graph
The placement algorithm first runs a simulated execu- transformation.
tion of the graph. The simulation is described below and
Device B Device B
ends up picking a device for each node in the graph using
b c y b c y
greedy heuristics. The node to device placement gener-
ated by this simulation is also used as the placement for W recv
W recv
the real execution.
The placement algorithm starts with the sources of the send send
5
synchronization between different workers and devices, C 1
and the master only needs to issue a single Run request
per graph execution to each worker that has any nodes for ... ...
the graph, rather than being involved in the scheduling of
every node or every cross-device communication. This ReLU dReLU
makes the system much more scalable and allows much
finer-granularity node executions than if the scheduling Add dAdd
were forced to be done by the master.
b MatMul dC/db dMatMul
6
This generally means that temporary outputs are con- fetch
sumed soon after being constructed, so their memory can
be reused quickly. When the heuristic is ineffective, the
user can change the order of graph construction, or add e f e f
control dependencies as described in Section 5. When
gradient nodes are automatically added to the graph, the
d c d c
user has less control, and the heuristics may break down.
In particular, because gradients reverse the forward com-
putation order, tensors that are used early in a graph’s a b a b
execution are frequently needed again near the end of a
gradient computation. Such tensors can hold on to a lot feed
of scarce GPU memory and unnecessarily limit the size
of computations. We are actively working on improve-
ments to memory management to deal better with such Figure 6: Before and after graph transformation for par-
cases. Options include using more sophisticated heuris- tial execution
tics to determine the order of graph execution, recom-
puting tensors instead of retaining them in memory, and special feed and fetch nodes, the set of nodes to execute
swapping out long-lived tensors from GPU memory to can be determined by starting at each of the nodes named
more plentiful host CPU memory. by any output and working backwards in the graph using
the graph dependencies to determine the full set of nodes
that must be executed in the rewritten graph in order to
4.2 Partial Execution
compute the outputs. Figure 6 shows an original graph
Often a client wants to execute just a subgraph of the on the left, and the transformed graph that results when
entire execution graph. To support this, once the client Run is invoked with inputs=={b} and outputs=={f:0}.
has set up a computation graph in a Session, our Run Since we only need to compute the output of node f, we
method allows them to execute an arbitrary subgraph of will not execute nodes d and e, since they have no con-
the whole graph, and to inject arbitrary data along any tribution to the output of f.
edge in the graph, and to retrieve data flowing along any
edge in the graph.
4.3 Device Constraints
Each node in the graph has a name, and each output of
a node is identified by the source node name and the out- TensorFlow clients can control the placement of nodes
put port from the node, numbered from 0 (e.g., “bar:0” on devices by providing partial constraints for a node
refers to the 1st output of the “bar” node, while “bar:1” about which devices it can execute on. For ex-
refers to the 2nd output). ample, “only place this node on a device of type
Two arguments to the Run call help define the exact GPU”, or “this node can be placed on any device in
subgraph of the computation graph that will be executed. /job:worker/task:17”, or “Colocate this node
First, the Run call accepts inputs, an optional mapping with the node named variable13”. Within the con-
of name:port names to “fed” tensors values. Second, fines of these constraints, the placement algorithm is re-
the Run call accepts output names, a list of output sponsible for choosing an assignment of nodes to de-
name[:port] specifications indicating which nodes vices that provides fast execution of the computation and
should be executed, and, if the port portion is present in a also satisfies various constraints imposed by the devices
name, that that particular output tensor value for the node themselves, such as limiting the total amount of memory
should be returned to the client if the Run call completes needed on a device in order to execute its subset of graph
successfully. nodes.
The graph is transformed based on the values of in- Supporting such constraints requires changes to the
puts and outputs. Each node:port specified in inputs is placement algorithm described in Section 3.2.1. We first
replaced with a feed node, which will pick up the pro- compute the feasible set of devices for each node, and
vided input tensor from specially-initialized entries in a then use union-find on the graph of colocation constraints
Rendezvous object used for the Run call. Similarly, each to compute the graph components that must be placed
output name with a port is connected to a special fetch together. For each such component, we compute the in-
node that arranges to save the output tensor and return it tersection of the feasible device sets. The computed fea-
to the client when the Run call is complete. Finally, once sible device set per node fits easily into the placement
the graph has been rewritten with the insertion of these algorithm’s simulator.
7
4.4 Control Flow 4.5 Input Operations
Although dataflow graphs without any explicit control Although input data can be provided to a computation via
flow are quite expressive, we have observed a number of feed nodes, another common mechanism used for train-
cases where supporting conditionals and loops can lead ing large-scale machine learning models is to have spe-
to more concise and efficient representations of machine cial input operation nodes in the graph, which are typi-
learning algorithms. cally configured with a set of filenames and which yield
a tensor containing one or more examples from the data
Much as in the dataflow-machine approach described
stored in that set of files each time they are executed.
by Arvind [3], we introduce a small set of primitive con-
This allows data to be read directly from the underlying
trol flow operators into TensorFlow and generalize Ten-
storage system into the memory of the machine that will
sorFlow to handle cyclic dataflow graphs. The Switch
perform subsequent processing on the data. In configura-
and Merge operators allow us to skip the execution of
tions where the client process is separate from the worker
an entire subgraph based on the value of a boolean ten-
process, if the data were fed, it typically would require an
sor. The Enter, Leave, and NextIteration operators allow
extra network hop (from the storage system to the client
us to express iteration. High-level programming con-
and then from the client to the worker vs. directly from
structs such as if-conditionals and while-loops can be
the storage system to ther worker when using an input
easily compiled into dataflow graphs with these control
node).
flow operators.
The TensorFlow runtime implements a notion of tags
and frames conceptually similar to the MIT Tagged- 4.6 Queues
Token machine [4]. Each iteration of a loop is uniquely
identified by a tag, and its execution state is represented Queues are a useful feature that we have added to Ten-
by a frame. An input can enter an iteration whenever it sorFlow. They allow different portions of the graph to
becomes available; thus, multiple iterations can be exe- execute asynchronously, possibly at different candences,
cuted concurrently. and to hand off data through Enqueue and Dequeue op-
erations. Enqueue operations can block until space be-
TensorFlow uses a distributed coordination mecha-
comes available in the queue, and Dequeue operations
nism to execute graphs with control flow. In general, a
can block until a desired minimum number of elements
loop can contain nodes that are assigned to many dif-
are available in the queue. One use of queues is to allow
ferent devices. Therefore, managing the state of a loop
input data to be prefetched from disk files while a previ-
becomes a problem of distributed termination detection.
ous batch of data is still being processed by the compu-
TensorFlow’s solution is based on graph rewriting. Dur-
tational portion of a machine learning model. They can
ing the graph partitioning, we automatically add control
also be used for other kinds of grouping, including accu-
nodes to each partition. These nodes implement a small
mulating many gradients in order to compute some more
state machine that orchestrates the start and termination
complex combination of gradients over a larger batch,
of each iteration, and decides the termination of the loop.
or to group different input sentences for recurrent lan-
For each iteration, the device that owns the loop termi-
guage models into bins of sentences that are approxi-
nation predicate sends a tiny control message to every
mately the same length, which can then be processed
participating device.
more efficiently.
As explained above, we often train machine learning In addition to normal FIFO queues, we have also im-
models by gradient descent, and represent gradient com- plemented a shuffling queue, which randomly shuffles its
putations as part of dataflow graphs. When a model elements within a large in-memory buffer. This shuffling
includes control-flow operations, we must account for functionality is useful for machine learning algorithms
them in the corresponding gradient computation. For ex- that want to randomize the order in which they process
ample, the gradient computation for a model with an if- examples, for example.
conditional will need to know which branch of the con-
ditional was taken, then apply the gradient logic to this
branch. Similarly, the gradient computation for a model 4.7 Containers
with a while-loop will need to know how many iterations
were taken, and will also rely on the intermediate values A Container is the mechanism within TensorFlow for
computed during those iterations. The basic technique is managing longer-lived mutable state. The backing store
to rewrite the graph so to memorize the values needed for for a Variable lives in a container. The default con-
the gradient computation. We omit the somewhat intri- tainer is one that persists until the process terminates,
cate details of this encoding. but we also allow other named containers. A container
8
can be reset by clearing it of its contents entirely. Us- 5.3 Asynchronous Kernels
ing containers, it is possible to share state even across
completely disjoint computation graphs associated with In addition to normal synchronous kernels that complete
different Sessions. their execution at the end of the Compute method, our
framework also supports non-blocking kernels. Such
non-blocking kernels use a slightly different interface
whereby the Compute method is passed a continuation
5 Optimizations
that should be invoked when the kernel’s execution is
complete. This is an optimization for environments
In this section, we describe some of the optimizations where having many active threads is relatively expensive
in the TensorFlow implementation that improve perfor- in terms of memory usage or other resources, and allows
mance or resource usage of the system. us to avoid tying up an execution thread for unbounded
periods of time while waiting for I/O or other events to
occur. Examples of asynchronous kernels include the
5.1 Common Subexpression Elimination Receive kernel, and the Enqueue and Dequeue kernels
(which might need to block if queue space is not avail-
Since the construction of computation graphs is often able or if no data is available to be read, respectively).
done by many different layers of abstractions in the client
code, computation graphs can easily end up with redun-
dant copies of the same computation. To handle this, we
5.4 Optimized Libraries for Kernel Imple-
have implemented a common subexpression pass similar mentations
to the algorithm described by Click [12] that runs over We often make use of pre-existing highly-optimized nu-
the computation graph and canonicalizes multiple copies merical libraries to implement kernels for some opera-
of operations with identical inputs and operation types tions. For example, there are a number of optimized li-
to just a single one of these nodes, and redirects graph braries for performing matrix multiplies on different de-
edges appropriately to reflect this canonicalization. vices, including BLAS [15] and cuBLAS [39], or GPU
libraries for convolutional kernels for deep neural nets
such as cuda-convnet [28] and cuDNN [9]. Many of
5.2 Controlling Data Communication and our kernel implementations are relatively thin wrappers
Memory Usage around such optimized libraries.
We make fairly extensive use of the open-source Eigen
Careful scheduling of TensorFlow operations can result linear algebra library [25] for many of the kernel imple-
in better performance of the system, in particular with mentations in the system. As one part of the develop-
respect to data transfers and memory usage. Specifically, ment of TensorFlow, our team (primarily Benoit Steiner)
scheduling can reduce the time window during which has extended the open source Eigen library with support
intermediate results need to be kept in memory in be- for arbitrary dimensionality tensor operations.
tween operations and hence the peak memory consump-
tion. This reduction is particularly important for GPU
devices where memory is scarce. Furthermore, orches- 5.5 Lossy Compression
trating the communication of data across devices can re-
Some machine learning algorithms, including those typ-
duce contention for network resources.
ically used for training neural networks, are tolerant of
While there are many opportunities for scheduling op- noise and reduced precision arithmetic. In a manner sim-
timizations, here we focus on one that we found partic- ilar to the DistBelief system [14], we often use lossy
ularly necessary and effective. It concerns the schedul- compression of higher precision internal representations
ing of Receive nodes for reading remote values. If no when sending data between devices (sometimes within
precautions are taken, these nodes may start much ear- the same machine but especially across machine bound-
lier than necessary, possibly all at once when execution aries). For example, we often insert special conversion
starts. By performing an as-soon-as-possible/as-late-as- nodes that convert 32-bit floating point representations
possible (ASAP/ALAP) calculation, of the kind common into a 16-bit floating point representation (not the pro-
in operations research, we analyze the critical paths of posed IEEE 16-bit floating point standard, but rather just
graphs, in order to estimate when to start the Receive a 32-bit IEEE 794 float format, but with 16 bits less pre-
nodes. We then insert control edges with the aim of de- cision in the mantissa), and then convert back to a 32-
laying the start of these nodes until just before their re- bit representation on the other side of the communica-
sults are needed. tion channel (by just filling in zeroes for the lost portion
9
of the mantissa, since that’s less computationally expen- strated subtle flaws in a complex network architec-
sive than doing the mathematically correct probabilistic ture specification. In particular we were able to
rounding when doing this 32 → 16 → 32-bit conver- identify operations and variables instantiated incor-
sion). rectly due to automatic broadcasting in a mathemat-
ical operation across a dimension.
6 Status and Experience 2. Start small and scale up. The first convolutional
neural network that we ported from our previ-
The TensorFlow interface and a reference implemen- ous system was a small network employed on the
tation have been open sourced under an Apache 2.0 CIFAR-10 data set [30]. Debugging such a network
license, and the system is available for download at elucidated subtle edge cases in individual opera-
www.tensorflow.org. The system includes detailed docu- tions (e.g., max-pooling) within the machine learn-
mentation, a number of tutorials, and a number of exam- ing system that would have been practically indeci-
ples demonstrating how to use the system for a variety pherable in more complex models.
of different machine learning tasks. The examples in-
clude models for classifying hand-written digits from the
3. Always ensure that the objective (loss function)
MNIST dataset (the “hello world” of machine learning
matches between machine learning systems when
algorithms) [32], classifying images from the CIFAR-
learning is turned off. Setting the learning rate to be
10 dataset [30], doing language modeling using a recur-
zero helped us identify unexpected behavior in how
rent LSTM [22] network, training word embedding vec-
we had randomly initialized variables in a model.
tors [35] and more.
Such an error would have been difficult to identify
The system includes front-ends for specifying Tensor- in a dynamic, training network.
Flow computations in Python and C++, and we expect
other front-ends to be added over time in response to 4. Make a single machine implementation match be-
the desires of both internal Google users and the broader fore debugging a distributed implementation. This
open-source community. strategy helped us delineate and debug discrep-
We have quite a few machine learning models in our ancies in training performance between machine
previous DistBelief system [14] that we have migrated learning system. In particular, we identified bugs
over to TensorFlow. The rest of this section discusses due to race conditions and non-atomic operations
some lessons we have learned that are generalizable for incorrectly assumed to be atomic.
any such migration of machine learning models from one
system to another, and therefore may be valuable to oth- 5. Guard against numerical errors. Numerical li-
ers. braries are inconsistent in how they handle non-
In particular, we focus on our lessons from porting a finite floating point values. Convolutional neu-
state-of-the-art convolutional neural network for image ral networks are particularly susceptible to numer-
recognition termed Inception [23]. This image recogni- ical instability and will tend to diverge quite regu-
tion system classifies 224 × 224 pixel images into one larly during experimentation and debugging phases.
of 1000 labels (e.g., “cheetah”, “garbage truck”, etc.). Guarding against this behavior by checking for non-
Such a model comprises 13.6 million learnable parame- finite floating point values allows one to detect er-
ters and 36,000 operations when expressed as a Tensor- rors in real time as opposed to identifying divergent
Flow graph. Running inference on a single image re- behavior post-hoc.
quires 2 billion multiply-add operations.
After building all necessary mathematical operations 6. Analyze pieces of a network and understand the
in TensorFlow, assembling and debugging all 36,000 op- magnitude of numerical error. Running subsec-
erations into the correct graph structure proved challeng- tions of a neural network in parallel on two machine
ing. Validating correctness is a difficult enterprise be- learning systems provides a precise method to en-
cause the system is inherently stochastic and only in- sure that a numerical algorithm is identical across
tended to behave in a certain way in expectation — po- two systems. Given that such algorithms run with
tentially after hours of computation. Given these cir- floating point precision, it is important to predict
cumstances, we found the following strategies critical for and understand the magnitude of expected numer-
porting the Inception model to TensorFlow: ical error in order to judge whether a given compo-
nent is correctly implemented (e.g., distinguishing
1. Build tools to gain insight into the exact number of between “within 1e-2, great!” and “within 1e-2:
parameters in a given model. Such tools demon- why is it so incorrect?!”).
10
Parameter Device(s)
ΔP
Add
P
Synchronous Data Parallelism
Parameter Device(s)
ΔP
Client 3 Update
ΔP
Client 2 Update
Client 1 Update ΔP
Device A Device B Device C
P
Asynchronous Data Parallelism
11
8 Performance
Client
Device 2
B B B B B
9 Tools
P2
This section describes some tools we have developed that
Device 1
sit alongside the core TensorFlow graph execution en-
A A A A A gine.
P1
Client
In order to help users understand the structure of their
computation graphs and also to understand the overall
behavior of machine learning models, we have built Ten-
sorBoard, a companion visualization tool for TensorFlow
Update Update Update that is included in the open source release.
12
Figure 10: TensorBoard graph visualization of a convolutional neural network model
Figure 11: TensorBoard graphical display of model summary statistics time series data
including scalar summaries (e.g., for examining overall records, and can display this summary information and
properties of the model, such as the value of the loss how it changes over time (with the ability to select the
function averaged across a collection of examples, or the measurement of “time” to be relative wall time since
time taken to execute the computation graph), histogram- the beginning of the execution of the TensorFlow pro-
based summaries (e.g., the distribution of weight values gram, absolute time, or “steps”, a numeric measure of
in a neural network layer), or image-based summaries the number of graph executions that have occurred since
(e.g., a visualization of the filter weights learned in a the beginning of execution of the TensorFlow program).
convolutional neural network). Typically computation A screen shot of the visualization of summary values in
graphs are set up so that Summary nodes are included TensorBoard is shown in Figure 11.
to monitor various interesting values, and every so often
during execution of the training graph, the set of sum-
mary nodes are also executed, in addition to the normal
9.2 Performance Tracing
set of nodes that are executed, and the client driver pro- We also have an internal tool called EEG (not included
gram writes the summary data to a log file associated in the initial open source release in November, 2015) that
with the model training. The TensorBoard program is we use to collect and visualize very fine-grained informa-
then configured to watch this log file for new summary tion about the exact ordering and performance character-
13
istics of the execution of TensorFlow graphs. This tool teresting machine learning models for artificial intelli-
works in both our single machine and distributed imple- gence, and in the course of doing this, we may discover
mentations, and is very useful for understanding the bot- ways in which we will need to extend the basic Ten-
tlenecks in the computation and communication patterns sorFlow system. The open source community may also
of a TensorFlow program. come up with new and interesting directions for the Ten-
Traces are collected simultaneously on each machine sorFlow implementation.
in the system from a variety of sources including Linux One extension to the basic programming model that
kernel ftrace, our own lightweight thread tracing tools we are considering is a function mechanism, whereby
and the CUDA Profiling Tools Interface (CUPTI). With a user can specify an entire subgraph of a TensorFlow
these logs we can reconstruct the execution of a dis- computation to be a reusable component. In the imple-
tributed training step with microsecond-level details of mentation we have designed, these functions can become
every thread-switch, CUDA kernel launch and DMA op- reusable components even across different front-end lan-
eration. guages for TensorFlow, so that a user could define a func-
Traces are combined in a visualization server which tion using the Python front end, but then use that func-
is designed to rapidly extract events in a specified tion as a basic building block from within the C++ front-
timerange and summarize at appropriate detail level for end. We are hopeful that this cross-language reusability
the user-interface resolution. Any significant delays will bootstrap a vibrant community of machine learning
due to communication, synchronization or DMA-related researchers publishing not just whole examples of their
stalls are identified and highlighted using arrows in the research, but also small reusable components from their
visualization. Initially the UI provides an overview of the work that can be reused in other contexts.
entire trace, with only the most significant performance We also have a number of concrete directions to im-
artifacts highlighted. As the user progressively zooms in, prove the performance of TensorFlow. One such direc-
increasingly fine resolution details are rendered. tion is our initial work on a just-in-time compiler that
Figure 12 shows an example EEG visualization of a can take a subgraph of a TensorFlow execution, perhaps
model being trained on a multi-core CPU platform. The with some runtime profiling information about the typi-
top third of the screenshot shows TensorFlow operations cal sizes and shapes of tensors, and can generate an op-
being dispatched in parallel, according to the dataflow timized routine for this subgraph. This compiler will un-
constraints. The bottom section of the trace shows how derstand the semantics of perform a number of optimiza-
most operations are decomposed into multiple work- tions such as loop fusion, blocking and tiling for locality,
items which are executed concurrently in a thread pool. specialization for particular shapes and sizes, etc.
The diagonal arrows on the right hand size show where We also imagine that a significant area for future work
queueing delay is building up in the thread pool. Fig- will be in improving the placement and node scheduling
ure 13 shows another EEG visualization with compu- algorithms used to decide where different nodes will exe-
tation mainly happening on the GPU. Host threads can cute, and when they should start executing. We have cur-
be seen enqueuing TensorFlow GPU operations as they rently implemented a number of heuristics in these sub-
become runnable (the light blue thread pool), and back- systems, and we’d like to have the system instead learn
ground housekeeping threads can be seen in other col- to make good placement decisions (perhaps using a deep
ors being migrated across processor cores. Once again, neural network, combined with a reinforcement learning
arrows show where threads are stalled on GPU to CPU objective function).
transfers, or where ops experience significant queueing
delay.
Finally, Figure 14 shows a more detailed view which 11 Related Work
allows us to examine how Tensorflow GPU operators
are assigned to multiple GPU streams. Whenever the There are many other systems that are comparable in
dataflow graph allows parallel execution or data trans- various ways with TensorFlow. Theano [7], Torch [13],
fer we endeavour to expose the ordering constraints to Caffe [26], Chainer [49] and the Computational Network
the GPU device using streams and stream dependency Toolkit [54] are a few systems designed primarily for the
primitives. training of neural networks. Each of these systems maps
the computation onto a single machine, unlike the dis-
tributed TensorFlow implementation. Like Theano and
10 Future Work Chainer, TensorFlow supports symbolic differentiation,
thus making it easier to define and work with gradient-
We have several different directions for future work. We based optimization algorithms. Like Caffe, TensorFlow
will continue to use TensorFlow to develop new and in- has a core written in C++, simplifying the deployment
14
Figure 12: EEG visualization of multi-threaded CPU operations (x-axis is time in µs).
Figure 13: EEG visualization of Inception training showing CPU and GPU activity.
of trained models in a wide variety of production set- machine learning models using relatively high-level de-
tings, including memory- and computation-constrained scriptions. Unlike DistBelief and Project Adam, though,
environments such as mobile devices. the general-purpose dataflow graph model in TensorFlow
is more flexible and more amenable to expressing a wider
The TensorFlow system shares some design charac- variety of machine learning models and optimization al-
teristics with its predecessor system, DistBelief [14], gorithms. It also permits a significant simplification by
and with later systems with similar designs like Project allowing the expression of stateful parameter nodes as
Adam [10] and the Parameter Server project [33]. Like variables, and variable update operations that are just
DistBelief and Project Adam, TensorFlow allows com- additional nodes in the graph; in contrast, DistBelief,
putations to be spread out across many computational de- Project Adam and the Parameter Server systems all have
vices across many machines, and allows users to specify
15
Figure 14: Timeline of multi-stream GPU execution.
whole separate parameter server subsystems devoted to that the system uses a single, optimized dataflow graph to
communicating and updating parameter values. represent the entire computation, and caches information
The Halide system [40] for expressing image pro- about that graph on each device to minimize coordination
cessing pipelines uses a similar intermediate represen- overhead. Like Spark and Naiad, TensorFlow works best
tation to the TensorFlow dataflow graph. Unlike Ten- when there is sufficient RAM in the cluster to hold the
sorFlow, though, the Halide system actually has higher- working set of the computation. Iteration in TensorFlow
level knowledge of the semantics of its operations and uses a hybrid approach: multiple replicas of the same
uses this knowledge to generate highly optimized pieces dataflow graph may be executing at once, while sharing
of code that combine multiple operations, taking into ac- the same set of variables. Replicas can share data asyn-
count parallelism and locality. Halide runs the resulting chronously through the variables, or use synchronization
computations only on a single machine, and not in a dis- mechanisms in the graph, such as queues, to operate syn-
tributed setting. In future work we are hoping to extend chronously. TensorFlow also supports iteration within a
TensorFlow with a similar cross-operation dynamic com- graph, which is a hybrid of CIEL and Naiad: for simplic-
pilation framework. ity, each node fires only when all of its inputs are ready
Like TensorFlow, several other distributed systems (like CIEL); but for efficiency the graph is represented as
have been developed for executing dataflow graphs a static, cyclic dataflow (like Naiad).
across a cluster. Dryad [24] and Flume [8] demon-
strate how a complex workflow can be represented as
a dataflow graph. CIEL [37] and Naiad [36] introduce
generic support for data-dependent control flow: CIEL 12 Conclusions
represents iteration as a DAG that dynamically unfolds,
whereas Naiad uses a static graph with cycles to support
lower-latency iteration. Spark [55] is optimized for com- We have described TensorFlow, a flexible data flow-
putations that access the same data repeatedly, using “re- based programming model, as well as single machine
silient distributed datasets” (RDDs), which are soft-state and distributed implementations of this programming
cached outputs of earlier computations. Dandelion [44] model. The system is borne from real-world experience
executes dataflow graphs across a cluster of heteroge- in conducting research and deploying more than one hun-
neous devices, including GPUs. TensorFlow uses a hy- dred machine learning projects throughout a wide range
brid dataflow model that borrows elements from each of Google products and services. We have open sourced
of these systems. Its dataflow scheduler, which is the a version of TensorFlow, and hope that a vibrant shared
component that chooses the next node to execute, uses community develops around the use of TensorFlow. We
the same basic algorithm as Dryad, Flume, CIEL, and are excited to see how others outside of Google make use
Spark. Its distributed architecture is closest to Naiad, in of TensorFlow in their own work.
16
Acknowledgements Dataflow Architectures, pages 225–253. 1986.
www.dtic.mil/cgi-bin/GetTRDoc?Location=U2&
The development of TensorFlow has benefitted enor- doc=GetTRDoc.pdf&AD=ADA166235.
mously from the large and broad machine learning com- [4] Arvind and Rishiyur S. Nikhil. Executing a pro-
munity at Google, and in particular from the suggestions gram on the MIT tagged-token dataflow architec-
and contributions from rest of the Google Brain team ture. IEEE Trans. Comput., 39(3):300–318, 1990.
and also from the hundreds of DistBelief and TensorFlow dl.acm.org/citation.cfm?id=78583.
users within Google. Without a doubt, the usability and [5] Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu.
functionality of TensorFlow has been greatly expanded Multiple object recognition with visual atten-
by listening to their feedback. tion. arXiv preprint arXiv:1412.7755, 2014.
Many individuals have contributed to TensorFlow arxiv.org/abs/1412.7755.
and to its open source release, including John Gian- [6] Françoise Beaufays. The neural networks
nandrea (for creating a supportive research environ- behind Google Voice transcription, 2015.
ment), Irina Kofman and Phing Turner (project manage- googleresearch.blogspot.com/2015/08/the-neural-
ment), Bill Gruber and David Westbrook (technical writ- networks-behind-google-voice.html.
ing), Dave Andersen, Anelia Angelova, Yaroslav Bu- [7] James Bergstra, Olivier Breuleux, Frédéric Bastien, Pas-
latov, Jianmin Chen, Jerjou Cheng, George Dahl, An- cal Lamblin, Razvan Pascanu, Guillaume Desjardins,
drew Dai, Lucy Gao, mig Gerard, Stephan Gouws, Joseph Turian, David Warde-Farley, and Yoshua Bengio.
Naveen Kumar, Geoffrey Hinton, Mrinal Kalarishnan, Theano: A CPU and GPU math expression compiler. In
Anjuli Kannan, Yutaka Leon-Suematsu, Frank Li, Pe- Proceedings of the Python for scientific computing con-
ference (SciPy), volume 4, page 3. Austin, TX, 2010.
ter Liu, Xiaobing Liu, Nishant Patil, Pierre Sermanet,
UMontreal PDF.
Noam Shazeer, Jascha Sohl-dickstein, Philip Tucker,
Yonghui Wu, Ke Yang, and Cliff Young (general con- [8] Craig Chambers, Ashish Raniwala, Frances Perry,
tributions), Doug Fritz, Patrick Hurst, Dilip Krish- Stephen Adams, Robert R Henry, Robert Bradshaw,
nan, Daniel Smilkov, James Wexler, Jimbo Wilson, and Nathan Weizenbaum. FlumeJava: easy, effi-
cient data-parallel pipelines. In ACM Sigplan No-
Kanit Ham Wongsuphasawat, Cassandra Xia, and the
tices, volume 45, pages 363–375. ACM, 2010. re-
Big Picture team (graph visualization), Chris Leary, search.google.com/pubs/archive/35650.pdf.
Robert Springer and the Stream Executor team,
Kayur Patel, Michael Piatek, and the coLab team, and [9] Sharan Chetlur, Cliff Woolley, Philippe Vandermer-
sch, Jonathan Cohen, John Tran, Bryan Catanzaro,
the many others who have contributed to the TensorFlow
and Evan Shelhamer. cuDNN: Efficient primitives for
design and code base. deep learning. arXiv preprint arXiv:1410.0759, 2014.
arxiv.org/abs/1410.0759.
References [10] Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and
Karthik Kalyanaraman. Project Adam: Building an
[1] Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene efficient and scalable deep learning training system. In
Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, 11th USENIX Symposium on Operating Systems Design
Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe- and Implementation (OSDI 14), pages 571–582, 2014.
mawat, Ian Goodfellow, Andrew Harp, Geoffrey Irv- www.usenix.org/system/files/conference/osdi14/osdi14-
ing, Michael Isard, Yangqing Jia, Rafal Jozefowicz, paper-chilimbi.pdf.
Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan
[11] Jack Clark. Google turning its lucrative
Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris
web search over to AI machines, 2015.
Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner,
www.bloomberg.com/news/articles/2015-10-26/google-
Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent
turning-its-lucrative-web-search-over-to-ai-machines.
Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol
Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, [12] Cliff Click. Global code motion/global value number-
Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale ing. In ACM SIGPLAN Notices, volume 30, pages 246–
machine learning on heterogeneous systems, 2015. Soft- 257. ACM, 1995. courses.cs.washington.edu/courses/
ware available from tensorflow.org. cse501/06wi/reading/click-pldi95.pdf.
[2] Anelia Angelova, Alex Krizhevsky, and Vincent Van- [13] Ronan Collobert, Samy Bengio, and Johnny
houcke. Pedestrian detection with a large-field-of-view Mariéthoz. Torch: A modular machine learning
deep network. In Robotics and Automation (ICRA), 2015 software library. Technical report, IDIAP, 2002.
IEEE International Conference on, pages 704–711. IEEE, infoscience.epfl.ch/record/82802/files/rr02-46.pdf.
2015. CalTech PDF. [14] Jeffrey Dean, Gregory S. Corrado, Rajat Monga, Kai
[3] Arvind and David E. Culler. Annual review Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao,
of computer science vol. 1, 1986. chapter Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker,
17
Ke Yang, and Andrew Y. Ng. Large scale distributed deep [25] Benoı̂t Jacob, Gaël Guennebaud, et al. Eigen library for
networks. In NIPS, 2012. Google Research PDF. linear algebra. eigen.tuxfamily.org.
[15] Jack J Dongarra, Jeremy Du Croz, Sven Hammar- [26] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey
ling, and Iain S Duff. A set of level 3 basic lin- Karayev, Jonathan Long, Ross Girshick, Sergio Guadar-
ear algebra subprograms. ACM Transactions on rama, and Trevor Darrell. Caffe: Convolutional archi-
Mathematical Software (TOMS), 16(1):1–17, 1990. tecture for fast feature embedding. In Proceedings of
www.maths.manchester.ac.uk/˜sven/pubs/Level3BLAS- the ACM International Conference on Multimedia, pages
1-TOMS16-90.pdf. 675–678. ACM, 2014. arxiv.org/pdf/1408.5093.
[16] Andrea Frome, Greg S Corrado, Jonathon Shlens, [27] Andrej Karpathy, George Toderici, Sachin Shetty,
Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Tommy Leung, Rahul Sukthankar, and Li Fei-
DeVISE: A deep visual-semantic embedding Fei. Large-scale video classification with con-
model. In Advances in Neural Information Pro- volutional neural networks. In Computer Vision
cessing Systems, pages 2121–2129, 2013. re- and Pattern Recognition (CVPR), 2014 IEEE Con-
search.google.com/pubs/archive/41473.pdf. ference on, pages 1725–1732. IEEE, 2014. re-
[17] Javier Gonzalez-Dominguez, Ignacio Lopez-Moreno, Pe- search.google.com/pubs/archive/42455.pdf.
dro J Moreno, and Joaquin Gonzalez-Rodriguez. Frame-
[28] A Krizhevsky. Cuda-convnet, 2014.
by-frame language identification in short utterances using
code.google.com/p/cuda-convnet/.
deep neural networks. Neural Networks, 64:49–58, 2015.
[18] Otavio Good. How Google Translate [29] Alex Krizhevsky. One weird trick for paralleliz-
squeezes deep learning onto a phone, 2015. ing convolutional neural networks. arXiv preprint
googleresearch.blogspot.com/2015/07/how-google- arXiv:1404.5997, 2014. arxiv.org/abs/1404.5997.
translate-squeezes-deep.html. [30] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The
[19] Ian J. Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha CIFAR-10 dataset. www.cs.toronto.edu/˜kriz/cifar.html.
Arnoud, and Vinay Shet. Multi-digit number recognition
[31] Quoc Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu
from Street View imagery using deep convolutional neu-
Devin, Greg Corrado, Kai Chen, Jeff Dean, and Andrew
ral networks. In International Conference on Learning
Ng. Building high-level features using large scale unsu-
Representations, 2014. arxiv.org/pdf/1312.6082.
pervised learning. In ICML’2012, 2012. Google Research
[20] Georg Heigold, Vincent Vanhoucke, Alan Senior, Patrick PDF.
Nguyen, Marc’Aurelio Ranzato, Matthieu Devin, and
Jeffrey Dean. Multilingual acoustic models using dis- [32] Yann LeCun, Corinna Cortes, and Christopher JC
tributed deep neural networks. In Acoustics, Speech Burges. The MNIST database of handwritten digits,
and Signal Processing (ICASSP), 2013 IEEE Interna- 1998. yann.lecun.com/exdb/mnist/.
tional Conference on, pages 8619–8623. IEEE, 2013. re- [33] Mu Li, Dave Andersen, and Alex Smola. Parameter
search.google.com/pubs/archive/40807.pdf. server. parameterserver.org.
[21] Geoffrey E. Hinton, Li Deng, Dong Yu, George E.
[34] Chris J Maddison, Aja Huang, Ilya Sutskever, and David
Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, An-
Silver. Move evaluation in Go using deep convolutional
drew Senior, Vincent Vanhoucke, Patrick Nguyen,
neural networks. arXiv preprint arXiv:1412.6564, 2014.
Tara N. Sainath, and Brian Kingsbury. Deep
arxiv.org/abs/1412.6564.
neural networks for acoustic modeling in speech
recognition: The shared views of four research [35] Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-
groups. IEEE Signal Process. Mag., 29(6):82– frey Dean. Efficient estimation of word representa-
97, 2012. www.cs.toronto.edu/˜gdahl/papers/ tions in vector space. In International Conference
deepSpeechReviewSPM2012.pdf. on Learning Representations: Workshops Track, 2013.
[22] Sepp Hochreiter and Jürgen Schmidhuber. Long short- arxiv.org/abs/1301.3781.
term memory. Neural computation, 9(8):1735–1780, [36] Derek G Murray, Frank McSherry, Rebecca Isaacs,
1997. ftp.idsia.ch/pub/juergen/lstm.pdf. Michael Isard, Paul Barham, and Martı́n Abadi. Naiad:
[23] Sergey Ioffe and Christian Szegedy. Batch normaliza- a timely dataflow system. In Proceedings of the Twenty-
tion: Accelerating deep network training by reducing Fourth ACM Symposium on Operating Systems Princi-
internal covariate shift. CoRR, abs/1502.03167, 2015. ples, pages 439–455. ACM, 2013. Microsoft Research
arxiv.org/abs/1502.03167. PDF.
[24] Michael Isard, Mihai Budiu, Yuan Yu, Andrew [37] Derek G. Murray, Malte Schwarzkopf, Christopher
Birrell, and Dennis Fetterly. Dryad: distributed Smowton, Steven Smit, Anil Madhavapeddy, and Steven
data-parallel programs from sequential building Hand. Ciel: a universal execution engine for dis-
blocks. In ACM SIGOPS Operating Systems tributed data-flow computing. In Proceedings of the Ninth
Review, volume 41, pages 59–72. ACM, 2007. USENIX Symposium on Networked Systems Design and
www.michaelisard.com/pubs/eurosys07.pdf. Implementation, 2011. Usenix PDF.
18
[38] Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas [49] Seiya Tokui. Chainer: A powerful, flexible and intuitive
Alcicek, Rory Fearon, Alessandro De Maria, Ve- framework of neural networks. chainer.org.
davyas Panneershelvam, Mustafa Suleyman, Charles [50] Vincent Vanhoucke. Speech recognition and deep learn-
Beattie, Stig Petersen, et al. Massively parallel meth- ing, 2015. googleresearch.blogspot.com/2012/08/speech-
ods for deep reinforcement learning. arXiv preprint recognition-and-deep-learning.html.
arXiv:1507.04296, 2015. arxiv.org/abs/1507.04296.
[51] Abhishek Verma, Luis Pedrosa, Madhukar Korupolu,
[39] CUDA Nvidia. Cublas library. NVIDIA Corpo- David Oppenheimer, Eric Tune, and John Wilkes.
ration, Santa Clara, California, 15, 2008. devel- Large-scale cluster management at Google with Borg.
oper.nvidia.com/cublas. In Proceedings of the Tenth European Conference
[40] Jonathan Ragan-Kelley, Connelly Barnes, Andrew on Computer Systems, page 18. ACM, 2015. re-
Adams, Sylvain Paris, Frédo Durand, and Saman Ama- search.google.com/pubs/archive/43438.pdf.
rasinghe. Halide: A language and compiler for optimiz- [52] O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, and
ing parallelism, locality, and recomputation in image pro- G. Hinton. Grammar as a foreign language. Technical
cessing pipelines. ACM SIGPLAN Notices, 48(6):519– report, arXiv:1412.7449, 2014. arxiv.org/abs/1412.7449.
530, 2013. people.csail.mit.edu/fredo/tmp/Halide-
5min.pdf. [53] Oriol Vinyals, Meire Fortunato, and Navdeep
Jaitly. Pointer networks. In NIPS, 2015.
[41] Bharath Ramsundar, Steven Kearnes, Patrick Riley, Dale arxiv.org/abs/1506.03134.
Webster, David Konerding, and Vijay Pande. Massively
[54] Dong Yu, Adam Eversole, Mike Seltzer, Kaisheng
multitask networks for drug discovery. arXiv preprint
Yao, Zhiheng Huang, Brian Guenter, Oleksii Kuchaiev,
arXiv:1502.02072, 2015. arxiv.org/abs/1502.02072.
Yu Zhang, Frank Seide, Huaming Wang, et al. An
[42] Benjamin Recht, Christopher Re, Stephen Wright, and introduction to computational networks and the com-
Feng Niu. Hogwild: A lock-free approach to paral- putational network toolkit. Technical report, Tech.
lelizing stochastic gradient descent. In Advances in Rep. MSR, Microsoft Research, 2014, 2014. re-
Neural Information Processing Systems, pages 693–701, search.microsoft.com/apps/pubs/?id=226641.
2011. papers.nips.cc/paper/4390-hogwild-a-lock-free-
[55] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das,
approach-to-parallelizing-stochastic-gradient-descent.
Ankur Dave, Justin Ma, Murphy McCauley, Michael J
[43] Chuck Rosenberg. Improving Photo Search: Franklin, Scott Shenker, and Ion Stoica. Resilient
A step across the semantic gap, 2013. distributed datasets: A fault-tolerant abstraction for
googleresearch.blogspot.com/2013/06/improving- in-memory cluster computing. In Proceedings of the
photo-search-step-across.html. 9th USENIX conference on Networked Systems De-
[44] Christopher J Rossbach, Yuan Yu, Jon Currey, Jean- sign and Implementation. USENIX Association, 2012.
Philippe Martin, and Dennis Fetterly. Dandelion: a www.usenix.org/system/files/conference/nsdi12/nsdi12-
compiler and runtime for heterogeneous systems. In final138.pdf.
Proceedings of the Twenty-Fourth ACM Symposium [56] Matthew D. Zeiler, Marc’Aurelio Ranzato, Rajat Monga,
on Operating Systems Principles, pages 49–68. ACM, Mark Mao, Ke Yang, Quoc Le, Patrick Nguyen,
2013. research-srv.microsoft.com/pubs/201110/sosp13- Andrew Senior, Vincent Vanhoucke, Jeff Dean, and
dandelion-final.pdf. Geoffrey E. Hinton. On rectified linear units
for speech processing. In ICASSP, 2013. re-
[45] David E Rumelhart, Geoffrey E Hinton, and Ronald J
search.google.com/pubs/archive/40811.pdf.
Williams. Learning representations by back-
propagating errors. Cognitive modeling, 5:3, 1988.
www.cs.toronto.edu/ hinton/absps/naturebp.pdf.
[46] Haşim Sak, Andrew Senior, Kanishka Rao,
Françoise Beaufays, and Johan Schalkwyk. Google
Voice Search: faster and more accurate, 2015.
googleresearch.blogspot.com/2015/09/google-voice-
search-faster-and-more.html.
[47] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence
to sequence learning with neural networks. In NIPS,
2014. papers.nips.cc/paper/5346-sequence-to-sequence-
learning-with-neural.
[48] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Ser-
manet, Scott Reed, Dragomir Anguelov, Dumitru Er-
han, Vincent Vanhoucke, and Andrew Rabinovich. Go-
ing deeper with convolutions. In CVPR’2015, 2015.
arxiv.org/abs/1409.4842.
19
Visual Madlibs: Fill in the blank Description Generation and Question
Answering
Abstract
1. Introduction
Much of everyday language and discourse concerns the
visual world around us, making understanding the rela-
tionship between the physical world and language describ-
ing that world an important challenge problem for AI. Figure 1. An example from the Visual Madlibs Dataset, including
Understanding this complex and subtle relationship will a variety of targeted descriptions for people and objects.
have broad applicability toward inferring human-like under-
standing for images, producing natural human robot interac- word game where one player prompts another for a list of
tions, and for tasks like natural language grounding in NLP. words to substitute for blanks in a story. In our case, a user
In computer vision, along with improvements in deep learn- might be presented with an image and a fill-in-the-blank
ing based visual recognition, there has been an explosion of template such as “The frisbee is [blank]” and asked to fill
recent interest in methods to automatically generate natural in the [blank] with a description of the appearance of fris-
language descriptions for images [6, 10, 16, 35, 17, 22] or bee. Alternatively, they could be asked to fill in the [blank]
videos [34, 9]. However, most of these methods and exist- with a description of what the person is doing with the fris-
ing datasets have focused on only one type of description, a bee. Fill-in-the-blank questions can be targeted to collect
generic description for the entire image. descriptions about people and objects, their appearances,
In this paper, we collect a new dataset of focused, tar-
activities, and interactions, as well as descriptions of the
geted, descriptions, the Visual Madlibs dataset1 , as illus-
general scene or the broader emotional, spatial, or temporal
trated in Figure 1. To collect this dataset, we introduce
context of an image (examples in Fig 2). Using these tem-
automatically produced fill-in-the-blank templates designed
plates, we collect 360,001 targeted descriptions for 10,738
to collect a range of different descriptions for visual con-
images from the MS COCO collection [23].
tent in an image. This is inspired by Madlibs, a childrens’
With this new dataset, we can develop methods to gen-
1 http://tamaraberg.com/visualmadlibs/ erate more focused descriptions. Instead of asking an algo-
2461
Figure 2. Example Visual Madlibs fill-in-the-blank descriptions.
rithm to “describe the image” we can now ask for more fo- 4) Definition and evaluation of generation and joint-
cused descriptions such as “describe the person”, “describe embedding methods on a new task, multiple-choice fill-in-
what the person is doing,” or “describe the relationship be- the-blank question answering for images.
tween the person and the frisbee.” We can also ask ques- The rest of our paper is organized as follows. First, we
tions about aspects of an image that are somewhat beyond review related work (Sec 2). Then, we describe our strat-
the scope of the directly depicted content. For example, egy for automatically generating fill-in-the-blank templates
“describe what might have happened just before this picture and introduce our Visual Madlibs dataset (Sec 3). Next we
was taken.” or “describe how this image makes you feel.” outline the multiple-choice question answering and targeted
These types of descriptions reach toward high-level goals of generation tasks (Sec 4) and provide several analyses of our
producing human-like visual interpretations for images. dataset (Sec 5). Finally, we provide experiments evaluating
In addition to focused description generation, we also in- description generation and joint-embedding methods on the
troduce a multiple-choice question-answering task for im- proposed tasks (Sec 6) and conclude (Sec 7).
ages. In this task, the computer is provided with an image 2. Related work
and a partial description such as “The person is [blank]”.
Description Generation: Recently, there has been an
A set of possible answers is also provided, one answer that
explosion of interest in methods for producing natural lan-
was written about the image in question, and several ad-
guage descriptions for images or video. Early work in this
ditional answers written about other images. The com-
area focused on detecting content elements and then com-
puter is evaluated on how well it can select the correct
posing captions [20, 36, 28, 11, 18] or made use of exist-
choice. In this way, we can evaluate performance of de-
ing text either directly associated with an image [12, 1] or
scription generation on a concrete task, making evaluation
retrieved from visually similar images [29, 21, 26]. With
more straightforward. Varying the difficulty of the nega-
the advancement of deep learning for content estimation,
tive answers—adjusting how similar they are to the correct
there have been many exciting recent attempts to generate
answer—provides a nuanced measurement of performance.
image descriptions using neural network based approaches.
For both the generation and question-answering tasks, Some methods first detect words or phrases using Convo-
we study and evaluate a recent state of the art approach lutional Neural Network (CNN) features, then generate and
for image description generation [35], as well as a simple re-rank candidate sentences [10, 22]. Other approaches take
joint-embedding method learned on deep representations. a more end-to-end approach to generate output descriptions
The evaluation also includes extensive analysis of the Vi- directly from images [17, 35, 16, 6]. These new methods
sual Madlibs dataset and comparisons to the existing MS have shown great promise for image description generation,
COCO dataset of natural language descriptions for images. under some measures (e.g. BLEU-1) achieving near-human
In summary, our contributions are: performance levels.
1) A new description collection strategy, Visual Madlibs, for Description Datasets: Along with the development of
constructing fill-in-the-blank templates to collect targeted image captioning algorithms there have been a number of
natural language descriptions. datasets collected for this task. One of the first datasets
2) A new Visual Madlibs Dataset consisting of 360,001 tar- collected for this problem was the UIUC Pascal Sentence
geted descriptions, spanning 12 different types of templates, data set [11] which contains 1,000 images with 5 sentences
for 10,738 images, as well as analysis of the dataset and per image written by workers on Amazon Mechanical Turk.
comparisons to existing MS COCO descriptions. Based on this, PASCAL-50s [33] further collected 50 sen-
3) Evaluation of a generation method and a simple joint em- tences per image. As the description problem gained pop-
bedding method for targeted description generation. ularity larger and richer datasets were collected, including
2462
the Flickr8K [30] and Flickr30K [37] datasets. In an al- person along with an image and instructions, e.g., Describe
ternative approach, the SBU Captioned photo dataset [29] the relationship between the indicated person and object.
contains 1 million images with existing captions collected The same image and prompt may be used with different in-
from Flickr, but the text tends to contain more contextual structions to collect a variety of description types.
information since captions were written by the photo own- Instantiating Questions
ers. Most recently, Microsoft released the MS COCO [23] While the general form of the questions for the Visual
dataset, containing 120,000 images depicting 80 common Madlibs were chosen by hand, see Table 1, most of the ques-
object classes, with object segmentations and 5 turker writ- tions are instantiated depending on a subset of the objects
ten descriptions per image. We make use of MS COCO, present in an image. For instance, if an image contained
extending the types of descriptions associated with images. two people and a dog, questions about each person (ques-
Question-answering Natural language question- tion types 9-11 in Table 1), the dog (types 6-8), relationships
answering has been a long standing goal of NLP, with between the two people and the dog (type 12), could be in-
commercial companies like Ask-Jeeves or Google playing a stantiated. For each possible instantiation, the wording of
significant role in developing effective methods. Recently, the questions will be automatically altered slightly to main-
embedding and deep learning methods have shown great tain grammatical consistency. In addition to these types of
promise [32, 3, 4]. Lin et al. [24] take an interesting multi- questions, other questions (types 1-5) can be instantiated for
modal approach to question-answering. A multiple-choice an image regardless of the objects present.
text-based question is first constructed from 3 sentences Notice in particular the questions about the temporal
written about an image; 2 of the sentences are used as the context – what might have happened before or what might
question, and 1 is used as the positive answer, mixed with happen after the image was taken. People can make in-
several negative answers from sentences written about other ferences beyond the specific content depicted in an image.
images. The authors develop ranking methods to answer Sometimes these inferences will be consistent between peo-
these questions and show that generating abstract images ple (e.g., when what will happen next is obvious), and other
for each potential answer can improve results. Note, here times these descriptions may be less consistent. We can
the algorithms are not provided with an image as part of use the variability of returned responses to select images
the question. Some recent work has started to look at the for which these inferences are reliable.
problem of question-answering for images. Malinowski Asking questions about every object and all pairs of ob-
et al. [25] introduced two scene-based QA datasets and jects quickly becomes unwieldy as the number of objects
combine computer vision and NLP in a Bayesian frame- increases. To combat this, we choose a subset of objects
work. DAQUAR is made by collecting human questions present to use in instantiating questions. Such selection
and answers, and SynthQA is automatically generated could be driven by a number of factors. The experiments
based on object segmentation and question templates. in this paper consider comparisons to existing, general, de-
Geman et al. [13] design a visual Turing test to evaluate scriptions of images, so we instantiate questions about the
image understanding using a series of binary questions objects mentioned in those existing natural language de-
about image content. We design question-answering tasks scriptions, an indication of the object’s importance [2].
that are somewhat broader in scope than the previous 3.1. Data Collection
works, allowing us to ask a variety of different types of
natural language questions about images. To collect the Visual Madlibs Dataset we use a subset of
10,738 human-centric images from MS COCO, that make
3. Designing and collecting Visual Madlibs up about a quarter of the validation data [23], and instanti-
ate fill-in-the-blank templates as described above. The MS
The goal of Visual Madlibs is to study targeted natu- COCO images are annotated with a list of objects present in
ral language descriptions of image content that go beyond the images, segmentations for the locations of those objects,
generic descriptions of the whole image. The experiments and 5 general natural language descriptions of the image. To
in this paper begin with a dataset of images where the select the subset of images for collecting Madlibs, we start
presence of some objects have already been labeled. The with the 19,338 images with a person labeled. We then look
prompts for the questions are automatically generated based at the five descriptions for each and perform a dependency
on image content, in a manner designed to elicit more de- parse [8], only keeping those images where a word refer-
tailed descriptions of the objects, their interactions, and the ring to person is the head noun of the parse. This leaves
broader context of the scene shown in each image. 14,150 images. We then filter out the images whose de-
Visual Madlibs: Image+Instruction+Prompts+Blank scriptions do not include a synonym for any of the 79 non-
A single fill-in-the-blank question consists of a prompt and person object categories labeled in the MS COCO dataset.
a blank, e.g., Person A is [blank] the car. The implicit ques- This leaves 10,738 human-centric images with at least one
tion is, “What goes in the blank?” This is presented to a other object from the MS COCO data set mentioned in the
2463
Type Instruction Prompt #words
1. image’s scene Describe the type of scene/place shown in this picture. The place is a(n) . 4+1.45
2. image’s emotion Describe the emotional content of this picture. When I look at this picture, I feel . 8+1.14
3. image’s interesting Describe the most interesting or unusual aspect of this picture. The most interesting aspect of this picture is . 8+3.14
4. image’s past Describe what happened immediately before this picture was taken. One or two seconds before this picture was taken, . 9+5.45
5. image’s future Describe what happened immediately after this picture was taken. One or two seconds after this picture was taken, . 9+5.04
6. object’s attribute Describe the appearance of the indicated object. The object(s) is/are . 3.20+1.62
7. object’s affordance Describe the function of the indicated object. People could the object(s). 4.20+1.74
8. object’s position Describe the position of the indicated object. The object(s) is/are . 3.20+3.35
9. person’s attribute Describe the appearance of the indicated person/people. The person/people is/are . 3+2.52
10. person’s activity Describe the activity of the indicated person/people. The person/people is/are . 3+2.47
11. person’s location Describe the location of the indicated person/people. The person/people is/are . 3.20+3.04
12. pair’s relationship Describe the relationship between the indicated person and object. The person/people is/are the object(s). 5.20+1.65
Table 1. All 12 types of Madlibs instructions and prompts. Right-most column shows the average number of words for each description
(#words for prompt + #words for answer).
general image descriptions. Before final instantiation of the depending on the level of difficulty desired. This ability to
fill-in-the blank templates, we need to resolve a potential choose distractors to adjust the difficulty of the question as
ambiguity regarding which objects are referred to in the de- well as the relative ease of evaluating multiple choice an-
scriptions. We would like to collect Madlibs for objects swers are attractive aspects of this new task.
described in the MS COCO captions, but since correspon- In our experiments we randomly select 20% of the
dences between the segmented objects and description men- 10,738 images to use as our test set for evaluating these
tions are not available, we first try to automatically estimate tasks. For the multiple-choice questions we form two sets of
this assignment by parsing the descriptions. We consider answers for each, with one set designed to be more difficult
two possible cases: 1) there are fewer annotated instances than the other. We first establish the easy task distractor an-
than the sentences describe, 2) there are more annotated in- swers by randomly choosing three descriptions (of the same
stances than the sentences describe. It is easy to address question type) from other images [24]. The hard task is de-
the first case, just construct templates for all of the labeled signed more delicately. Instead of randomly choosing from
instances. For the second case, we sort the area of each seg- the other images, we now only look for those containing
mented instance, and pick the largest ones up to the parsed the same objects as our question image, and then arbitrarily
number for instantiation. Using this procedure, we obtain pick three of their descriptions. Sometimes, the descriptions
26,148 labeled object or person instances in 10,738 images. sampled from “similar” images could also be good answers
Each Visual Madlib is answered by 3 workers on Ama- for our questions (later we experiment with using Turkers to
zon Mechanical Turk. To date, we have collected 360,001 select less ambiguous multiple-choice questions from this
answers to Madlib questions and are continuing collection set). For the targeted generation task, for question types
to include the training portion of the MS COCO dataset. 1-5, algorithms generate descriptions given the image, in-
structions, and prompt. For the other question types whose
4. Tasks: Multiple-choice question answering prompts are related to some specific person or object, we
and targeted generation additionally provide the algorithm with the location of each
We design two tasks to evaluate targeted natural lan- person/object mentioned in the prompt. We also experiment
guage description for images. The first task is to automat- with estimating these locations using object detectors.
ically generate natural language descriptions of images to 5. Analyzing the Visual Madlibs Dataset
fill in the blank for one of the Madlibs questions. The in-
put to this task is an image, instructions, and a Madlibs We begin by conducting quantitative analyses of the re-
prompt. As has been discussed in the community work- sponses collected in the Visual Madlibs Dataset in Sec. 5.1.
ing on description generation for images, it can be dif- A main goal is understanding what additional information is
ficult to evaluate free form generation [33]. Our sec- provided by the targeted descriptions in the Visual Madlibs
ond task tries to address this issue by developing a new Dataset vs general image descriptions. Therefore, we also
targeted multiple-choice question answering task for im- provide analyses comparing Visual Madlibs to MS COCO
ages. Here the input is again an image, instruction, and a descriptions collected for the same images in Sec. 5.2.
prompt, but instead of a free form text answer, there are 5.1. Quantifying Visual Madlibs responses
a fixed set of multiple-choice answers to fill in the blank. We analyze the length, structure, and consistency of the
The possible multiple-choice answers are sampled from the Visual Madlibs responses. First, the average length of each
Madlibs responses, one that was written for the particular type of description is shown in the far right column of Ta-
image/instruction/prompt as the correct answer, and distrac- ble 1. Note that descriptions of people tend to be longer
tors chosen from either similar images or random images than descriptions of other objects in the dataset.
2464
Second, we use phrase chunking [7] to analyze which
phrasal structures are commonly used to fill in the blanks
for different questions. Fig. 3, top row, shows relative fre-
quencies for the top-5 most frequent templates used for sev-
eral question types. Object attributes are usually described
briefly with a simple adjectival phrase. On the other hand,
people use more words and a wider variety of structures to
describe possible future events. Except for future and past
descriptions, the distribution of structures is generally con- Figure 4. Template used for parsing person’s attributes, activity
and interaction with object, and object’s attribute. The percentages
centrated on a few likely choices for each question type.
below compares Madlibs and MS COCO on how frequent these
Third, we analyze how consistent the Mechanical Turk
templates are used for description.
workers’ answers are for each type of question. To com-
pute a measure of similarity between a pair of responses we
use the cosine similarity between representations of each
response. A response is represented by the mean of the
Word2Vec [27] vectors for each word in the response, fol-
lowing [24, 22]. Word2Vec is a 300 dimensional embedding
representation for words that encodes the distributional con-
text of words learned over very large word corpora. This
measure takes into account the actual words used in a re-
sponse, as opposed to the previous analyses of parse struc- Figure 5. Frequency that a word in a position in the people and
object parsing template in one dataset is in the same position for
ture. Each Visual Madlibs question is answered by three
the other dataset.
workers, providing 3 pairs for which similarity is computed.
Fig. 3, bottom row, shows a histogram of all pairwise simi-
larities for several question types. Generally the similarities tions always have one reference to a person in the prompt
have a normal-like distribution with an extra peak around 1 (The person is [blank].). Therefore, for Madlibs, we report
indicating the fraction of responses that agree almost per- the presence of additional references to the person (e.g., the
fectly. Once again, descriptions of the future and past are person is a man). The general attribute directly describes
least likely to be (near) identical, while object attributes and the appearance of the person or object (e.g., old or small);
affordances are often very consistent. the affiliate object indicates whether additional objects are
5.2. Visual Madlibs vs general descriptions used to describe the targeted person (e.g. with a bag, coat,
We compare the targeted descriptions in the Visual or glasses) and the affiliate attribute are appearance char-
Madlibs Dataset to the general image descriptions in MS acteristics of those secondary objects (e.g., red coat). The
COCO. First, we analyze the words used in Visual Madlibs templates for object’s attribute and verbs are more straight-
compared to MS COCO descriptions of the same images. forward as shown in Fig. 4(b)(c). The table in Fig. 4 shows
For each image, we extract the unique set of words from the frequency of each parse component. Overall, more of
all descriptions of that image from both datasets, and com- the potential descriptive elements in these constructions are
pute the coverage of each set with respect to the other. used in response to the Madlibs prompts than in the general
We find that on average (across images) 22.45% of the descriptions found in MS COCO.
Madlibs’s words are also present in MS COCO descriptions, We also break down the overlap between Visual Madlibs
while 52.38% of the MS COCO words are also present in and MS COCO descriptions over different parsing tem-
Madlibs. We also compute the vocabulary size of Madlibs plates for descriptions about people and object (Fig. 5). Yel-
that is 12,329, compared with MS COCO’s 9,683 on the low bars show how often words for each parse type in MS
same image set. COCO descriptions were also found in the same parse type
Second, we compare how Madlibs and MS COCO an- in the Visual Madlibs answers, and green bars measure the
swers describe the people and objects in images. We ob- reverse direction. Observations indicate that Madlibs pro-
serve that the Madlibs questions types, Table 1, cover much vides more coverage in its descriptions than MS COCO for
of the information in MS COCO descriptions [22]. As one all templates except for person’s refer name. One possible
way to see this, we run the StanfordNLP parser on both reason is that the prompts already indicates “the person” or
datasets [5]. For attributes of people, we use the parsing “people” explicitly, so workers need not add an additional
template shown in Fig. 4(a) to analyze the structures be- reference to the person in their descriptions.
ing used. The refer name indicates whether the person was Extrinsic comparison of Visual Madlibs Data and gen-
mentioned in the description. Note that the Madlibs descrip- eral descriptions: We perform an extrinsic analysis by us-
2465
100% Image's future 100% Object's attribute Object's affordance
100% Person's activity
One or two seconds after this 100%
80% picture was taken, ___ . 80% The object(s) is/are ___ .
80%
People could ___ the object(s) .
80% The person/people is/are ___ .
P
VP P
P
Pr: P
V P
NP NP
Pr JP
P
Pr r: VP
Pr P NP
Pr : VP
Pr P NP
P
TN
NP DV
DV
DV
PN
PN
PN
: V PP N
D
P
NP
r:
Pr: P PP
PA
Pr
Pr: VP A
PA
PA
V
N
PR
PV
:V
PP
:V
PP
P
r:
VP
:V
Pr
:V
:V
:V
PN
N
V
Pr
P
NP
r:
Pr:
r:
V
P
r:
Pr:
Pr
Image's future
P
25% 25%
Object's attribute
25%
Object's affordance
25%
Person's activity
5% 5% 5% 5%
Figure 3. First row shows top-5 most frequent phrase templates for image’s future, object’s attribute, object’s affordance and person’s
activity. Second row shows the histograms of similarity between answers. (We put the plots for all 12 types in the supplementary file.)
0 20% 40% 60% 80% tomatically generated image descriptions are to our Madlibs
image's scene
descriptions. Fig. 6 shows the accuracies resulting from us-
image's emotion
ing Madlibs, MS COCO, or CNN+LSTM [35] to select the
image's interesting
correct multiple-choice answer.
image's past
Although this approach is quite simple, it allows us we
image's future
make two interesting observations. First, Madlibs outper-
object's attribute
forms MS COCO on all types of multiple-choice ques-
object's affordance
tions. If Madlibs and MS COCO descriptions provided the
object's position same information, we would expect their performance to
person's attribute be comparable. Second, the automatically generated de-
person's activity scriptions from the pre-trained CNN+LSTM perform much
person's location worse than the actual MS COCO descriptions, despite doing
pair's relationship well on general image description generation.
Madlibs CNN+LSTM(COCO)
MSCOCO
6. Experiments
Figure 6. The accuracy of Madlibs, MS COCO and In this section we evaluate a series of methods on the
CNN+LSTM [35] (trained on MS COCO) used as references to targeted natural language generation and multiple-choice
answer the Madlibs hard multiple-choice questions. question answering tasks. As methods, we evaluate a lan-
guage only baseline, which computes the 4-gram perplex-
ity for each sentence using Google-1T statistics (frequen-
ing either: a) the MS COCO descriptions for an image, cies of all n-grams on the web). We also try simple joint-
or b) Visual Madlibs responses from other Turkers for an embedding methods – canonical correlation analysis (CCA)
image, to select answers for our multiple-choice evalua- and normalized CCA (nCCA) [15] – as well as a recent
tion task. Specifically, we use one of the human provided deep-learning based method for image description gener-
descriptions, and select the multiple-choice answer that is ation CNN+LSTM [35]. We train these models on 80% of
most similar to that description. Similarity is measured as the images in the MadLibs collection and evaluate their per-
cosine similarity between the mean Word2Vec vectors for formance on the remaining 20%.
the words a description compared to the multiple-choice In our experiments we extract image features using
answers. In addition to comparing how well the Madlibs the VGG Convolutional Neural Network (VGGNet) [31],
or MS COCO descriptions can select the correct multiple- trained on the ILSVRC-2012 dataset to recognize 1000 ob-
choice answer, we also use the descriptions automatically ject classes.For comparison, we also extract image features
produced by a recent CNN+LSTM description generation using the Places-CNN, which is trained on 205 scene cate-
system [35]2 . trained on MS COCO dataset. This allows us gories of Places Database [38] using AlexNet [19]. On the
to make one possible measurement of how close current au- sentence side, we average the Word2Vec of all words in a
2 In this paper, we use Karpathy’s implementation: https:// sentence to obtain a representation.
github.com/karpathy/neuraltalk CCA finds a joint embedding between two multi-
2466
Easy Task
nCCA nCCA nCCA CNN+LSTM CNN+LSTM(r)
#Q n-gram CCA nCCA Human
(place) (bbox) (all) (madlibs) (madlibs)
1. scene 6277 24.8% 75.7% 86.8% 85.4% − 87.6% 74.2% 77.6% 93.2%
2. emotion 5138 26.7% 41.3% 49.2% 50.4% − 42.4% 37.0% 44.5% 48.3%
3. past 4903 24.3% 61.8% 77.5% 72.6% − 80.3% 50.1% 47.3% 93.5%
4. future 4658 27.7% 61.2% 78.0% 72.1% − 80.2% 50.6% 50.6% 94.5%
5. interesting 5095 24.2% 66.8% 76.5% 72.0% − 78.9% 55.4% 49.9% 94.7%
6. obj attr 7194 30.6% 44.1% 47.5% 44.7% 54.7% 50.9% 46.9% 59.0% 88.9%
7. obj aff 7326 30.1% 59.8% 73.0% 69.6% 72.2% 76.7% − 88.9% 93.1%
8. obj pos 7290 28.0% 53.0% 65.9% 64.2% 58.9% 69.7% 53.9% 69.6% 91.4%
9. per attr 6651 27.2% 40.4% 48.0% 44.5% 53.1% 44.5% 36.5% 46.0% 83.8%
10. per act 6501 27.3% 70.0% 80.7% 76.9% 75.6% 82.8% 64.7% 68.9% 96.7%
11. per loc 6580 24.4% 69.8% 82.7% 82.6% 73.8% 82.7% 60.8% 71.6% 92.2%
12. pair rel 7595 29.2% 54.3% 63.0% 61.3% 64.2% 67.2% − 72.3% 91.7%
Hard Task
nCCA nCCA nCCA CNN+LSTM CNN+LSTM(r)
#Q n-gram CCA nCCA Human
(place) (bbox) (all) (madlibs) (madlibs)
1. scene 6277 22.8% 63.8% 70.1% 70.7% − 68.2% 63.6% 64.2% 75.6%
2. emotion 5138 25.1% 33.9% 37.2% 38.3% − 33.2% 34.6% 37.6% 38.4%
3. past 4903 22.4% 47.9% 52.8% 49.5% − 54.0% 42.2% 39.5% 73.9%
4. future 4658 24.4% 47.5% 54.3% 50.5% − 53.3% 41.1% 39.5% 75.1%
5. interesting 5095 27.6% 51.4% 53.7% 50.5% − 55.1% 44.0% 37.1% 76.7%
6. obj attr 7194 29.5% 42.2% 43.6% 41.5% 49.8% 39.3% 41.6% 42.3% 70.5%
7. obj aff 7326 32.2% 54.5% 63.5% 60.9% 63.0% 48.5% − 69.4% 52.7%
8. obj pos 7290 29.2% 49.0% 55.7% 53.3% 50.7% 53.4% 46.7% 50.2% 70.8%
9. per attr 6651 23.3% 33.9% 38.6% 35.5% 46.1% 31.6% 35.5% 42.4% 70.5%
10. per act 6501 24.0% 59.7% 65.4% 62.6% 65.1% 66.6% 57.3% 53.7% 85.1%
11. per loc 6580 22.3% 56.8% 63.3% 65.5% 57.8% 62.6% 50.4% 56.8% 72.9%
12. pair rel 7595 30.1% 49.4% 54.3% 52.2% 56.5% 52.0% − 54.6% 74.7%
Table 2. Accuracies computed for different approaches on the easy and hard multiple-choice answering task. CCA, nCCA, and
CNN+LSTM are trained on the whole image representation for each type of question. nCCA(place) uses Places-CNN feature. nCCA(box)
is trained and evaluated on ground-truth bounding-boxes from MS COCO segmentations. nCCA(all) trains a single embedding using all
question types. CNN+LSTM(r) ranks the perplexity of {prompt+choice}.
Figure 7. Some hard multiple-choice question examples. The results are made by nCCA. First row shows correct choices. Second row
shows incorrect choices. Corresponding human accuracies are provided as reference.
dimensional variables, in our case image and text vector lation. We train CCA and nCCA models for each ques-
representations. To increase the flexibility of the feature tion type separately using the training portion of the Visual
selection and for improving computational efficiency, Gong Madlibs Dataset. These models allow us to map from an
et al. [15] proposed nCCA a scalable approximation scheme image representation, to the joint-embedding space, to vec-
of explicit kernel mapping followed by dimension reduction tors in the Word2Vec space, and vice versa. For targeted
and linear CCA. In the projected latent space, the similarity generation, we map an image to the joint-embedding space
is measured by the eigenvalue-weighted normalized corre- and then choose the answer from the training set text that is
2467
Filtered Questions from Hard Task
closest to this embedded point. To answer multiple-choice nCCA nCCA nCCA CNN+LSTM(r)
#Q nCCA
questions, we embed each multiple choice answer and then (place) (bbox) (all) (madlibs)
1. scene 4940 77.6% 77.8% 76.3% 69.7%
select the answer whose embedding is closest. 2. emotion 2052 49.0% 49.5%
−
− 43.8% 43.0%
Following recent description generation techniques [35, 3. past 3976 57.4% 53.8% − 59.4% 41.3%
4. future 3820 59.2% 54.2% − 58.3% 41.7%
16], we train a CNN+LSTM model for each question type. 5. interesting 4159 59.5% 55.1% − 61.3% 40.3%
These models learn a mapping from an image and prompt 6. obj attr 5436 47.2% 44.7% 54.6% 42.8% 46.3%
7. obj aff 4581 71.0% 67.6% 70.5% 57.6% 79.0%
to a sequence of wordse.g., The chair is, and then let 8. obj pos 5721 60.2% 57.7% 54.6% 57.7% 54.3%
the CNN+LSTM system generate the remaining words of 9. per attr 4893 42.4% 38.8% 52.1% 34.4% 46.4%
the description. For the multiple choice task, we evalu- 10. per act 5813 68.3% 65.3% 67.9% 69.6% 55.3%
11. per loc 5096 69.9% 71.7% 62.6% 70.0% 60.6%
ate two ways to select an answer. The first method se- 12. pair rel 5981 57.6% 55.4% 60.0% 56.5% 57.4%
lects the answer with largest cosine Word2Vec similarity Table 3. Accuracies for different approaches on the filtered ques-
to the generated description. The second method ranks the tions from hard task. The filtered questions are those with human
prompt+choices by perplexity and selects the best one. accuracies higher than 0.6. Full tables for filtered easy and hard
task are in the supplementary file.
Easy Task Hard Task
6.1. Discussion of results nCCA nCCA nCCA nCCA
#Q nCCA nCCA
(bbox) (dbox) (bbox) (dbox)
Table 2 shows accuracies of each algorithm on the easy 6. obj attr 2021 47.6% 53.6% 51.4% 43.9% 47.9% 45.2%
and hard versions of the multiple-choice task3 and Fig. 7 9. per attr 4206 50.2% 55.4% 51.2% 40.0% 47.0% 43.3%
shows example correct and incorrect answer choices. There Table 4. Multiple-choice answering using automatic detection for
are several interesting observations we can make. From the 42 object/person categories. “bbox” denotes ground-truth bound-
ing box and “dbox” denotes detected bounding box.
results of the language only n-gram baseline, we conclude
that answering Madlibs questions strongly requires visual BLEU-1 BLEU-2
nCCA CNN+LSTM nCCA CNN+LSTM
information. Second, training nCCA on all types of ques- nCCA
(box) (madlibs)
nCCA
(bbox) (madlibs)
tions together, nCCA(all), is helpful for the easy variant of 1. scene 0.52 − 0.62 0.17 − 0.19
2. emotion 0.17 0.38 0 0
the task, but less useful on the more fine-grained hard ver- 3. future 0.38
−
− 0.39 0.12
−
− 0.13
sion of the task. Third, extracting visual features from the 4. past 0.39 − 0.42 0.12 − 0.12
5. interesting 0.49 0.65 0.14 0.22
bounding box of the relevant person/object yields higher ac- 6. obj attr 0.28
−
0.36 0.48 0.02
−
0.02 0.01
curacy for predicting attributes, but not for other questions. 7. obj aff 0.56 0.60 − 0.10 0.11 −
Based on this finding, we evaluate answering the attribute 8. obj pos 0.53 0.55 0.71 0.24 0.25 0.49
9. per attr 0.26 0.29 0.57 0.06 0.07 0.25
question using automatic detection methods. The detectors 10. per act 0.47 0.41 0.53 0.14 0.11 0.20
are trained on ImageNet using R-CNN [14], covering 42 11. per loc 0.52 0.46 0.63 0.22 019 0.39
12. pair rel 0.46 0.48 0.07 0.08
MS COCO categories. We observe similar performance be- − −
Table 5. BLEU-1 and BLEU-2 computed on Madlibs testing
tween ground-truth and detected bounding boxes in Table 4. dataset for different approaches.
Fourth, we observe that the Places-CNN helps answer ques-
tions related to image’s scene, person’s location, and im-
age’s emotion. 7. Conclusions
We have introduced a new fill-in-the blank strategy for
As an additional experiment we ask 5 people to answer
collecting targeted natural language descriptions. Our anal-
each multiple choice question. The last column of Table 2
yses show that these descriptions are usually more detailed
shows human accuray as a reference. We further use human
than generic whole image descriptions. We also introduce
agreement to select a subset of the multiple-choice ques-
a targeted natural language description generation task, and
tions where at least 3 Turkers choose the correct answer.
a multiple-choice question answering task, then train and
Results of the methods on this question subset are shown
evaluate joint-embedding and generation models. Data pro-
in Table 3, displaying similar patterns as the unfiltered set,
duced by this paper will be publicly released.
with slightly higher accuracy.
Finally, Table 5 shows BLEU-1 and BLEU-2 scores for Acknowledgements: We thank the vision and language
targeted generation. Although the CNN+LSTM models we communities for feedback, especially J. Hockenmaier, K.
trained on Madlibs were not quite as accurate as nCCA for Saenko, and J. Corso. This research is supported by NSF
selecting the correct multiple-choice answer, they did result Awards #1417991, 1405822, 144234, 1452851, and Mi-
in better, sometimes much better, accuracy (as measured by crosoft Research.
BLEU scores) for targeted generation.
References
3 The missing entries for questions 7 and 12 are due to priming not being [1] A. Aker and R. Gaizauskas. Generating image descriptions
valid for questions with blanks in the middle of the sentence. using dependency relational patterns. In ACL, 2010.
2468
[2] A. C. Berg, T. L. Berg, H. D. III, J. Dodge, A. Goyal, X. Han, [21] P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and
A. Mensch, M. Mitchell, A. Sood, K. Stratos, and K. Yam- Y. Choi. Collective generation of natural image descriptions.
aguchi. Understanding and predicting importance in images. In ACL, 2012.
In CVPR, 2012. [22] P. O. Lebret Remi, Pinheiro and R. Collobert. Phrase-based
[3] A. Bordes, J. Weston, and S. Chopra. Question answering captioning. In ICML, 2015.
with subgraph embeddings. In EMNLP, 2014. [23] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
[4] A. Bordes, J. Weston, and N. Usunier. Open question an- manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
swering with weakly supervised embedding models. In mon objects in context. In ECCV, 2014.
ECML PKDD, 2014. [24] X. Lin and D. Parikh. Don’t just listen, use your imagination:
[5] D. Chen and C. D. Manning. A fast and accurate dependency Leveraging visual common sense for non-visual tasks. In
parser using neural networks. In EMNLP, 2014. CVPR, 2015.
[6] X. Chen and C. L. Zitnick. Mind’s eye: A recurrent visual [25] M. Malinowski and M. Fritz. A multi-world approach to
representation for image caption generation. In CVPR, 2015. question answering about real-world scenes based on uncer-
[7] R. Collobert, J. Weston, L. Bottou, M. Karlen, tain input. In NIPS, 2014.
K. Kavukcuoglu, and P. Kuksa. Natural language pro- [26] R. Mason. Domain-independent captioning of domain-
cessing (almost) from scratch. JMLR, 2011. specific images. In NACCL-HLT, 2013.
[8] M.-C. De Marneffe, B. MacCartney, C. D. Manning, et al. [27] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient
Generating typed dependency parses from phrase structure estimation of word representations in vector space. arXiv
parses. In Proceedings of LREC, 2006. preprint arXiv:1301.3781, 2013.
[9] J. Donahue, L. Anne Hendricks, S. Guadarrama, [28] M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal,
M. Rohrbach, S. Venugopalan, K. Saenko, and T. Dar- A. Berg, K. Yamaguchi, T. Berg, K. Stratos, and
rell. Long-term recurrent convolutional networks for visual H. Daumé III. Midge: Generating image descriptions from
recognition and description. In CVPR, 2015. computer vision detections. In EACL, 2012.
[10] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, [29] V. Ordonez, G. Kulkarni, and T. L. Berg. Im2text: Describ-
P. Dollar, J. Gao, X. He, M. Mitchell, J. C. Platt, ing images using 1 million captioned photographs. In NIPS,
C. Lawrence Zitnick, and G. Zweig. From captions to vi- 2011.
sual concepts and back. In CVPR, 2015. [30] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier.
[11] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, Collecting image annotations using amazon’s mechanical
C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every pic- turk. In Proceedings of the NAACL HLT 2010 Workshop
ture tells a story: Generating sentences from images. In on Creating Speech and Language Data with Amazon’s Me-
ECCV, 2010. chanical Turk, 2010.
[12] Y. Feng and M. Lapata. Topic models for image annotation [31] K. Simonyan and A. Zisserman. Very deep convolu-
and text illustration. In ACL, 2010. tional networks for large-scale image recognition. CoRR,
[13] D. Geman, S. Geman, N. Hallonquist, and L. Younes. Visual abs/1409.1556, 2014.
turing test for computer vision systems. Proceedings of the [32] S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus.
National Academy of Sciences, 2015. Weakly supervised memory networks. arXiv preprint
[14] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- arXiv:1503.08895, 2015.
ture hierarchies for accurate object detection and semantic [33] R. Vedantam, C. L. Zitnick, and D. Parikh. Cider:
segmentation. In CVPR, 2014. Consensus-based image description evaluation. In CVPR,
[15] Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. A multi-view em- 2015.
bedding space for modeling internet images, tags, and their [34] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach,
semantics. IJCV, 2014. R. Mooney, and K. Saenko. Translating videos to natural
[16] A. Karpathy and L. Fei-Fei. Deep visual-semantic align- language using deep recurrent neural networks. In NAACL-
ments for generating image descriptions. In CVPR, 2015. HLT, 2015.
[17] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying [35] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and
visual-semantic embeddings with multimodal neural lan- tell: A neural image caption generator. In CVPR, 2015.
guage models. In TACL, 2015. [36] Y. Yang, C. L. Teo, H. Daumé III, and Y. Aloimonos.
[18] N. Krishnamoorthy, G. Malkarnenkar, R. Mooney, Corpus-guided sentence generation of natural images. In
K. Saenko, and S. Guadarrama. Generating natural- EMNLP, 2011.
language video descriptions using text-mined knowledge. [37] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From im-
NAACL HLT 2013, page 10, 2013. age descriptions to visual denotations: New similarity met-
[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet rics for semantic inference over event descriptions. In TACL,
classification with deep convolutional neural networks. In 2014.
NIPS, 2012. [38] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva.
[20] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg, Learning deep features for scene recognition using places
and T. L. Berg. Baby talk: Understanding and generating database. In NIPS, 2014.
image descriptions. In CVPR, 2011.
2469
Visualizing and Understanding Convolutional Networks
1.1. Related Work ters in the convolutional layers, weight matrices in the
fully-connected layers and biases) are trained by back-
Visualizing features to gain intuition about the net-
propagating the derivative of the loss with respect to
work is common practice, but mostly limited to the 1st
the parameters throughout the network, and updating
layer where projections to pixel space are possible. In
the parameters via stochastic gradient descent. Full
higher layers this is not the case, and there are limited
details of training are given in Section 3.
methods for interpreting activity. (Erhan et al., 2009)
find the optimal stimulus for each unit by perform-
2.1. Visualization with a Deconvnet
ing gradient descent in image space to maximize the
unit’s activation. This requires a careful initialization Understanding the operation of a convnet requires in-
and does not give any information about the unit’s in- terpreting the feature activity in intermediate layers.
variances. Motivated by the latter’s short-coming, (Le We present a novel way to map these activities back to
et al., 2010) (extending an idea by (Berkes & Wiskott, the input pixel space, showing what input pattern orig-
2006)) show how the Hessian of a given unit may be inally caused a given activation in the feature maps.
computed numerically around the optimal response, We perform this mapping with a Deconvolutional Net-
giving some insight into invariances. The problem is work (deconvnet) (Zeiler et al., 2011). A deconvnet
that for higher layers, the invariances are extremely can be thought of as a convnet model that uses the
complex so are poorly captured by a simple quadratic same components (filtering, pooling) but in reverse, so
approximation. Our approach, by contrast, provides a instead of mapping pixels to features does the oppo-
non-parametric view of invariance, showing which pat- site. In (Zeiler et al., 2011), deconvnets were proposed
terns from the training set activate the feature map. as a way of performing unsupervised learning. Here,
(Donahue et al., 2013) show visualizations that iden- they are not used in any learning capacity, just as a
tify patches within a dataset that are responsible for probe of an already trained convnet.
strong activations at higher layers in the model. Our
visualizations differ in that they are not just crops of To examine a convnet, a deconvnet is attached to each
input images, but rather top-down projections that of its layers, as illustrated in Fig. 1(top), providing a
reveal structures within each patch that stimulate a continuous path back to image pixels. To start, an
particular feature map. input image is presented to the convnet and features
computed throughout the layers. To examine a given
convnet activation, we set all other activations in the
2. Approach layer to zero and pass the feature maps as input to
We use standard fully supervised convnet models the attached deconvnet layer. Then we successively
throughout the paper, as defined by (LeCun et al., (i) unpool, (ii) rectify and (iii) filter to reconstruct
1989) and (Krizhevsky et al., 2012). These models the activity in the layer beneath that gave rise to the
map a color 2D input image xi , via a series of lay- chosen activation. This is then repeated until input
ers, to a probability vector yˆi over the C different pixel space is reached.
classes. Each layer consists of (i) convolution of the Unpooling: In the convnet, the max pooling opera-
previous layer output (or, in the case of the 1st layer, tion is non-invertible, however we can obtain an ap-
the input image) with a set of learned filters; (ii) pass- proximate inverse by recording the locations of the
ing the responses through a rectified linear function maxima within each pooling region in a set of switch
(relu(x) = max(x, 0)); (iii) [optionally] max pooling variables. In the deconvnet, the unpooling operation
over local neighborhoods and (iv) [optionally] a lo- uses these switches to place the reconstructions from
cal contrast operation that normalizes the responses the layer above into appropriate locations, preserving
across feature maps. For more details of these opera- the structure of the stimulus. See Fig. 1(bottom) for
tions, see (Krizhevsky et al., 2012) and (Jarrett et al., an illustration of the procedure.
2009). The top few layers of the network are conven-
tional fully-connected networks and the final layer is Rectification: The convnet uses relu non-linearities,
a softmax classifier. Fig. 3 shows the model used in which rectify the feature maps thus ensuring the fea-
many of our experiments. ture maps are always positive. To obtain valid fea-
ture reconstructions at each layer (which also should
We train these models using a large set of N labeled be positive), we pass the reconstructed signal through
images {x, y}, where label yi is a discrete variable a relu non-linearity.
indicating the true class. A cross-entropy loss func-
tion, suitable for image classification, is used to com- Filtering: The convnet uses learned filters to con-
pare yˆi and yi . The parameters of the network (fil- volve the feature maps from the previous layer. To
Visualizing and Understanding Convolutional Networks
invert this, the deconvnet uses transposed versions of Other important differences relating to layers 1 and
the same filters, but applied to the rectified maps, not 2 were made following inspection of the visualizations
the output of the layer beneath. In practice this means in Fig. 6, as described in Section 4.1.
flipping each filter vertically and horizontally.
The model was trained on the ImageNet 2012 train-
Projecting down from higher layers uses the switch ing set (1.3 million images, spread over 1000 different
settings generated by the max pooling in the convnet classes). Each RGB image was preprocessed by resiz-
on the way up. As these switch settings are peculiar ing the smallest dimension to 256, cropping the center
to a given input image, the reconstruction obtained 256x256 region, subtracting the per-pixel mean (across
from a single activation thus resembles a small piece all images) and then using 10 different sub-crops of size
of the original input image, with structures weighted 224x224 (corners + center with(out) horizontal flips).
according to their contribution toward to the feature Stochastic gradient descent with a mini-batch size of
activation. Since the model is trained discriminatively, 128 was used to update the parameters, starting with a
they implicitly show which parts of the input image learning rate of 10−2 , in conjunction with a momentum
are discriminative. Note that these projections are not term of 0.9. We anneal the learning rate throughout
samples from the model, since there is no generative training manually when the validation error plateaus.
process involved. Dropout (Hinton et al., 2012) is used in the fully con-
nected layers (6 and 7) with a rate of 0.5. All weights
are initialized to 10−2 and biases are set to 0.
Layer Above
Pooled Maps
Reconstruction
Switches
Visualization of the first layer filters during training
Max
Pooling
reveals that a few of them dominate, as shown in
Max
Unpooling
Fig. 6(a). To combat this, we renormalize each filter
Unpooled Maps Rectified Feature Maps
in the convolutional layers whose RMS value exceeds
Rec'fied
Linear
Rec'fied
Linear
a fixed radius of 10−1 to this fixed radius. This is cru-
Func'on
Func'on
cial, especially in the first layer of the model, where the
Rectified Unpooled Maps Feature Maps
input images are roughly in the [-128,128] range. As in
Convolu'onal
Convolu'onal
Filtering
{FT}
Filtering
{F}
(Krizhevsky et al., 2012), we produce multiple differ-
ent crops and flips of each training example to boost
Reconstruction Layer Below Pooled Maps
training set size. We stopped training after 70 epochs,
which took around 12 days on a single GTX580 GPU,
Layer Above
Pooled Maps
using an implementation based on (Krizhevsky et al.,
Reconstruction
2012).
Unpooling Pooling
Max Locations
“Switches” 4. Convnet Visualization
Unpooled Rectified Using the model described in Section 3, we now use
Maps Feature Maps
Feature Map the deconvnet to visualize the feature activations on
the ImageNet validation set.
Figure 1. Top: A deconvnet layer (left) attached to a con-
vnet layer (right). The deconvnet will reconstruct an ap- Feature Visualization: Fig. 2 shows feature visu-
proximate version of the convnet features from the layer alizations from our model once training is complete.
beneath. Bottom: An illustration of the unpooling oper- However, instead of showing the single strongest ac-
ation in the deconvnet, using switches which record the tivation for a given feature map, we show the top 9
location of the local max in each pooling region (colored activations. Projecting each separately down to pixel
zones) during pooling in the convnet. space reveals the different structures that excite a
given feature map, hence showing its invariance to in-
3. Training Details put deformations. Alongside these visualizations we
show the corresponding image patches. These have
We now describe the large convnet model that will be greater variation than visualizations as the latter solely
visualized in Section 4. The architecture, shown in focus on the discriminant structure within each patch.
Fig. 3, is similar to that used by (Krizhevsky et al., For example, in layer 5, row 1, col 2, the patches ap-
2012) for ImageNet classification. One difference is pear to have little in common, but the visualizations
that the sparse connections used in Krizhevsky’s lay- reveal that this particular feature map focuses on the
ers 3,4,5 (due to the model being split across 2 GPUs) grass in the background, not the foreground objects.
are replaced with dense connections in our model.
Visualizing and Understanding Convolutional Networks
Layer 1
Layer 2
Layer 3
Layer 4 Layer 5
Figure 2. Visualization of features in a fully trained model. For layers 2-5 we show the top 9 activations in a random subset
of feature maps across the validation data, projected down to pixel space using our deconvolutional network approach.
Our reconstructions are not samples from the model: they are reconstructed patterns from the validation set that cause
high activations in a given feature map. For each feature map we also show the corresponding image patches. Note:
(i) the the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exaggeration of
discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form.
Visualizing and Understanding Convolutional Networks
The projections from each layer show the hierarchi- 4.2. Occlusion Sensitivity
cal nature of the features in the network. Layer 2 re-
With image classification approaches, a natural ques-
sponds to corners and other edge/color conjunctions.
tion is if the model is truly identifying the location of
Layer 3 has more complex invariances, capturing sim-
the object in the image, or just using the surround-
ilar textures (e.g. mesh patterns (Row 1, Col 1); text
ing context. Fig. 7 attempts to answer this question
(R2,C4)). Layer 4 shows significant variation, but
by systematically occluding different portions of the
is more class-specific: dog faces (R1,C1); bird’s legs
input image with a grey square, and monitoring the
(R4,C2). Layer 5 shows entire objects with significant
output of the classifier. The examples clearly show
pose variation, e.g. keyboards (R1,C11) and dogs (R4).
the model is localizing the objects within the scene,
Feature Evolution during Training: Fig. 4 visu- as the probability of the correct class drops signifi-
alizes the progression during training of the strongest cantly when the object is occluded. Fig. 7 also shows
activation (across all training examples) within a given visualizations from the strongest feature map of the
feature map projected back to pixel space. Sudden top convolution layer, in addition to activity in this
jumps in appearance result from a change in the image map (summed over spatial locations) as a function of
from which the strongest activation originates. The occluder position. When the occluder covers the im-
lower layers of the model can be seen to converge age region that appears in the visualization, we see a
within a few epochs. However, the upper layers only strong drop in activity in the feature map. This shows
develop develop after a considerable number of epochs that the visualization genuinely corresponds to the im-
(40-50), demonstrating the need to let the models train age structure that stimulates that feature map, hence
until fully converged. validating the other visualizations shown in Fig. 4 and
Fig. 2.
Feature Invariance: Fig. 5 shows 5 sample images
being translated, rotated and scaled by varying degrees
while looking at the changes in the feature vectors from 4.3. Correspondence Analysis
the top and bottom layers of the model, relative to the Deep models differ from many existing recognition ap-
untransformed feature. Small transformations have a proaches in that there is no explicit mechanism for
dramatic effect in the first layer of the model, but a establishing correspondence between specific object
lesser impact at the top feature layer, being quasi- parts in different images (e.g. faces have a particular
linear for translation & scaling. The network output spatial configuration of the eyes and nose). However,
is stable to translations and scalings. In general, the an intriguing possibility is that deep models might be
output is not invariant to rotation, except for object implicitly computing them. To explore this, we take 5
with rotational symmetry (e.g. entertainment center). randomly drawn dog images with frontal pose and sys-
tematically mask out the same part of the face in each
4.1. Architecture Selection image (e.g. all left eyes, see Fig. 8). For each image i,
we then compute: li = xli − x̃li , where xli and x̃li are the
While visualization of a trained model gives insight feature vectors at layer l for the original and occluded
into its operation, it can also assist with selecting good images respectively. We then measure the consis-
architectures in the first place. By visualizing the first tency of this difference P
vector between all related im-
5
and second layers of Krizhevsky et al. ’s architecture age pairs (i, j): ∆l = i,j=1,i6=j H(sign(li ), sign(lj )),
(Fig. 6(b) & (d)), various problems are apparent. The where H is Hamming distance. A lower value indi-
first layer filters are a mix of extremely high and low cates greater consistency in the change resulting from
frequency information, with little coverage of the mid the masking operation, hence tighter correspondence
frequencies. Additionally, the 2nd layer visualization between the same object parts in different images
shows aliasing artifacts caused by the large stride 4 (i.e. blocking the left eye changes the feature repre-
used in the 1st layer convolutions. To remedy these sentation in a consistent way). In Table 1 we compare
problems, we (i) reduced the 1st layer filter size from the ∆ score for three parts of the face (left eye, right
11x11 to 7x7 and (ii) made the stride of the convolu- eye and nose) to random parts of the object, using fea-
tion 2, rather than 4. This new architecture retains tures from layer l = 5 and l = 7. The lower score for
much more information in the 1st and 2nd layer fea- these parts, relative to random object regions, for the
tures, as shown in Fig. 6(c) & (e). More importantly, it layer 5 features show the model does establish some
also improves the classification performance as shown degree of correspondence.
in Section 5.1.
Visualizing and Understanding Convolutional Networks
Figure 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image (with 3 color planes) is presented as
the input. This is convolved with 96 different 1st layer filters (red), each of size 7 by 7, using a stride of 2 in both x and y.
The resulting feature maps are then: (i) passed through a rectified linear function (not shown), (ii) pooled (max within
3x3 regions, using stride 2) and (iii) contrast normalized across feature maps to give 96 different 55 by 55 element feature
maps. Similar operations are repeated in layers 2,3,4,5. The last two layers are fully connected, taking features from
the top convolutional layer as input in vector form (6 · 6 · 256 = 9216 dimensions). The final layer is a C-way softmax
function, C being the number of classes. All filters and feature maps are square in shape.
Figure 4. Evolution of a randomly chosen subset of model features through training. Each layer’s features are displayed
in a different block. Within each block, we show a randomly chosen subset of features at epochs [1,2,5,10,20,30,40,64].
The visualization shows the strongest activation (across all training examples) for a given feature map, projected down to
pixel space using our deconvnet approach. Color contrast is artificially enhanced and the figure is best viewed in electronic
form.
0.8 1
a1
9 a2
0.7 a3
0.9 a4
8 0.8
0.6
7 Lawn Mower 0.7
Canonical Distance
Canonical Distance
0.5 ShihïTzu
P(true class)
0.6
African Crocodile
5 0.4 African Grey 0.5
Entertrainment Center
0.4 Lawn Mower
0.3
Lawn Mower ShihïTzu
3 0.3
6KLKï7]X African Crocodile
0.2
African Crocodile 0.2 African Grey
African Grey Entertrainment Center
0.1
1 0.1
Entertrainment Center
0 0
ï ï ï ï60 ï40 ï20 0 20 40 60 ï60 ï40 ï20 0 20 40 60
Vertical Translation (Pixels) Vertical Translation (Pixels) Vertical Translation (Pixels)
12 0.7 1
b1
b2
0.6
b3
0.9 b4
10
0.8
0.5 0.7
Canonical Distance
Canonical Distance
8
Lawn Mower
P(true class)
0.6
0.4 ShihïTzu
6 0.5 African Crocodile
African Grey
0.3
0.4 Entertrainment Center
Lawn Mower
4 Lawn Mower
0.2 ShihïTzu 0.3
ShihïTzu African Crocodile
African Crocodile African Grey 0.2
2 0.1
African Grey Entertrainment Center 0.1
Entertrainment Center
0 0 0
1 1.2 1.4 1.6 1.8 1 1.2 1.4 1.6 1.8 1 1.2 1.4 1.6 1.8
Scale (Ratio) Scale (Ratio) Scale (Ratio)
15 1.4 1
c1
c2
1.2
c3
0.9 c4
0.8
1 0.7
Canonical Distance
Canonical Distance
10 Lawn Mower
ShihïTzu
P(true class)
0.6
0.8 African Crocodile
0.5 African Grey
0.6 Entertrainment Center
0.4
5 Lawn Mower Lawn Mower
0.4 0.3
ShihïTzu ShihïTzu
African Crocodile African Crocodile 0.2
African Grey 0.2 African Grey
0.1
Entertrainment Center Entertrainment Center
0 0 0
0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350
Rotation Degrees Rotation Degrees Rotation Degrees
Figure 5. Analysis of vertical translation, scale, and rotation invariance within the model (rows a-c respectively). Col 1: 5
example images undergoing the transformations. Col 2 & 3: Euclidean distance between feature vectors from the original
and transformed images in layers 1 and 7 respectively. Col 4: the probability of the true label for each image, as the
image is transformed.
Visualizing and Understanding Convolutional Networks
(a) (b)
Figure 6. (a): 1st layer features without feature scale clipping. Note that one feature dominates. (b): 1st layer features
from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and filter size (7x7 vs 11x11)
results in more distinctive features and fewer “dead” features. (d): Visualizations of 2nd layer features from (Krizhevsky
et al., 2012). (e): Visualizations of our 2nd layer features. These are cleaner, with no aliasing artifacts that are visible in
(d).
(c) Layer 5, strongest (d) Classifier, probability (e) Classifier, most
(a) Input Image (b) Layer 5, strongest feature map feature map projections of correct class probable class
Pomeranian
0.9 Tennis ball
Keeshond
0.8 Pekinese
0.7
0.6
0.5
0.4
0.3
0.2
Car wheel
Racer
0.25 Cab
Police van
0.2
0.15
0.1
0.05
0.2
0.1
True Label: Afghan Hound
Figure 7. Three test examples where we systematically cover up different portions of the scene with a gray square (1st
column) and see how the top (layer 5) feature maps ((b) & (c)) and classifier output ((d) & (e)) changes. (b): for each
position of the gray scale, we record the total activation in one layer 5 feature map (the one with the strongest response
in the unoccluded image). (c): a visualization of this feature map projected down into the input image (black square),
along with visualizations of this map from other images. The first row example shows the strongest feature to be the
dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). (d): a map of correct
class probability, as a function of the position of the gray square. E.g. when the dog’s face is obscured, the probability
for “pomeranian” drops significantly. (e): the most probable label as a function of occluder position. E.g. in the 1st row,
for most locations it is “pomeranian”, but if the dog’s face is obscured but not the ball, then it predicts “tennis ball”. In
the 2nd example, text on the car is the strongest feature in layer 5, but the classifier is most sensitive to the wheel. The
3rd example contains multiple objects. The strongest feature in layer 5 picks out the faces, but the classifier is sensitive
to the dog (blue region in (d)), since it uses multiple feature maps.
Visualizing and Understanding Convolutional Networks
Accuracy %
Removed layers 6,7 27.4 44.8 22.4 55
Removed layer 3,4,6,7 71.1 71.3 50.1
50
Adjust layers 6,7: 2048 units 40.3 41.7 18.8
Adjust layers 6,7: 8192 units 26.8 40.0 18.1 45
Ciresan, D. C., Meier, J., and Schmidhuber, J. Multi- Sohn, K., Jung, D., Lee, H., and Hero III, A. Effi-
column deep neural networks for image classifica- cient learning of sparse, distributed, convolutional
tion. In CVPR, 2012. feature representations for object recognition. In
ICCV, 2011.
Dalal, N. and Triggs, B. Histograms of oriented gra-
dients for pedestrian detection. In CVPR, 2005. Torralba, A. and Efros, A. A. Unbiased look at dataset
bias. In CVPR, 2011.
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang,
N., Tzeng, E., and Darrell, T. DeCAF: A deep con- Vincent, P., Larochelle, H., Bengio, Y., and Manzagol,
volutional activation feature for generic visual recog- P. A. Extracting and composing robust features
nition. In arXiv:1310.1531, 2013. with denoising autoencoders. In ICML, pp. 1096–
1103, 2008.
Erhan, D., Bengio, Y., Courville, A., and Vincent, P.
Visualizing higher-layer features of a deep network. Yan, S., Dong, J., Chen, Q., Song, Z., Pan, Y., Xia,
In Technical report, University of Montreal, 2009. W., Huang, Z., Hua, Y., and Shen, S. Generalized
hierarchical matching for sub-category aware object
Fei-fei, L., Fergus, R., and Perona, P. One-shot learn- classification. In PASCAL VOC Classification Chal-
ing of object categories. IEEE Trans. PAMI, 2006. lenge 2012, 2012.
Griffin, G., Holub, A., and Perona, P. The caltech 256. Zeiler, M., Taylor, G., and Fergus, R. Adaptive decon-
In Caltech Technical Report, 2006. volutional networks for mid and high level feature
Gunji, N., Higuchi, T., Yasumoto, K., Muraoka, H., learning. In ICCV, 2011.
Ushiku, Y., Harada, T., and Kuniyoshi, Y. Classifi-
cation entry. In Imagenet Competition, 2012.
Le, Q. V., Ngiam, J., Chen, Z., Chia, D., Koh, P., and
Ng, A. Y. Tiled convolutional neural networks. In
NIPS, 2010.