Contents MLP PDF

Chapter 6
Deep Feedforward Netwo
Deep feedforward networks, also often called feedforwa

or multilayer perceptrons (MLPs), are the quintessentia
The goal of a feedforward network is to approximate some fun
for a classier, y = f (x) maps an input x to a category y.
denes a mapping y = f (x; ) and learns the value of the pa
in the best function approximation.
These models are called feedforward because informa
function being evaluated from x, through the intermediate
dene f , and nally to the output y. There are no feedbac
outputs of the model are fed back into itself. When feedfor
are extended to include feedback connections, they are call
networks, presented in chapter 10.
Feedforward networks are of extreme importance to m
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
length of the chain gives the depth of the model. It is from

the name deep learning arises. The nal layer of a feedfor
the output layer. During neural network training, we driv
The training data provides us with noisy, approximate examp
at dierent training points. Each example x is accompanied
The training examples specify directly what the output layer
x; it must produce a value that is close to y. The behavio
not directly specied by the training data.The learning a
how to use those layers to produce the desired output, but
not say what each individual layer should do. Instead, the le
decide how to use these layers to best implement an approxi
the training data does not show the desired output for each
layers are called hidden layers.
Finally, these networks are called neural because they a
neuroscience. Each hidden layer of the network is typical
dimensionality of these hidden layers determines the widt
element of the vector may be interpreted as playing a role
Rather than thinking of the layer as representing a single ve
we can also think of the layer as consisting of many unit
each representing a vector-to-scalar function. Each unit
the sense that it receives input from many other units a
activation value.The idea of using many layers of vector-
is drawn from neuroscience. The choice of the functions f (
these representations is also loosely guided by neuroscienti
the functions that biological neurons compute. However, m
research is guided by many mathematical and engineerin
goal of neural networks is not to perfectly model the brain
nonlinear transformation. Equivalently, we can apply the ke

section 5.7.2, to obtain a nonlinear learning algorithm based
the mapping. We can think of as providing a set of fea
as providing a new representation for x.
The question is then how to choose the mapping .
1. One option is to use a very generic , such as the inn

is implicitly used by kernel machines based on the R
of high enough dimension, we can always have enou
training set, but generalization to the test set often
generic feature mappings are usually based only on
smoothness and do not encode enough prior informa
problems.
2. Another option is to manually engineer . Until the a

this was the dominant approach. This approach requ
eort foreach separatetask,with practitionerssp
domains such as speechrecognition orcomputer v
transfer between domains.
3. The strategy of deep learning is to learn . In this app

y = f(x; , w) = (x; )w. We now have parameters
from a broad class of functions, and parameters w t
the desired output. This is an example of a deep feed
dening a hidden layer.This approach is the only
gives up on the convexity of the training problem, but
the harms. In this approach, we parametrize the rep
and use the optimization algorithm to nd the that
mappings from x to y that lack feedback connections.Ot

later will apply these principles to learning stochastic mappi
with feedback, and learning probability distributions over a
We begin this chapter with a simple example of a feedf
we address each of the design decisions needed to deploy a
First, training a feedforward network requires making ma
decisions as are necessary for a linear model: choosing th
function, and the form of the output units. We review these b
learning, then proceed to confront some of the design dec
to feedforward networks. Feedforward networks have introd
hidden layer, and this requires us to choose the activatio
be used to compute the hidden layer values. We must also d
of the network, including how many layers the network sho
layers shouldbe connected toeachother,and how man
each layer. Learning in deep neural networks requires com
of complicated functions. We present the back-propagat
modern generalizations, which can be used to eciently co
Finally, we close with some historical perspective.
6.1 Example: Learning XOR

To make the idea of a feedforward network more concre
example of a fully functioning feedforward network on a ver
the XOR function.
The XOR function (exclusive or) is an operation on
and x2 . When exactly one of these binary values is equal t
appropriate cost function for modeling binary data. More a

are described in section 6.2.2.2.
Evaluated on our whole training set, the MSE loss func
1
J( ) = (f (x) f (x; ))2 .
4
xX
Now we must choose the form of our model, f (x; ). Su

a linear model, with consisting of w and b. Our model is
f (x; w , b) = xw + b.
We can minimize J( ) in closed form with respect to w a

equations.
After solving the normal equations, we obtain w = 0
model simply outputs 0.5 everywhere. Why does this happ
how a linear model is not able to represent the XOR funct
this problem is to use a model that learns a dierent fea
linear model is able to represent the solution.
Specically, we will introduce a very simple feedforw
hidden layer containing two hidden units. See gure 6.2
this model. This feedforward network has a vector of hid
computed by a function f (1)(x; W , c). The values of these
used as the input for a second layer. The second layer is t
network. The output layer is still just a linear regression
applied to h rather than to x . The network now contains
together: h = f (1) (x; W , c) and y = f (2)(h; w, b), with the
(2) (1)
Original x space Learned
1 1
h2
x2
0 0
0 1 0 1
x1 h
Figure 6.1: Solving the XOR problem by learning a representa

printed on the plot indicate the value that the learned function m
(Left)A linear model applied directly to the original input cann
function. When x1 = 0, the models output must increase as x2
the models output must decrease as x2 increases. A linear m
coecient w2 to x2. The linear model therefore cannot use th
the coecient on x2 and cannot solve this problem. (Right)In
represented by the features extracted by a neural network, a lin
the problem. In our example solution, the two points that must
collapsed into a single point in feature space. In other words, the
y y
h1 h2 h
x1 x2 x
Figure 6.2: An example of a feedforward network, drawn in two di

this is the feedforward network we use to solve the XOR exampl
layer containing two units. (Left)In this style, we draw every uni
This style is very explicit and unambiguous but for networks la
it can consume too much space. (Right)In this style, we draw
each entire vector representing a layers activations.This style
Sometimes we annotate the edges in this graph with the name
describe the relationship between two layers. Here, we indicate th
the mapping from x to h, and a vector w describes the map
typically omit the intercept parameters associated with each laye
of drawing.
model, we used a vector of weights and a scalar bias par

ane transformation from an input vector to an output sca
an ane transformation from a vector x to a vector h, so a
parameters is needed. The activation function g is typically c
that is applied element-wise, with hi = g(xW:,i + c i ). In m
the default recommendation is to use the rectied linear
et al., 2009; Nair and Hinton, 2010; Glorot et al., 2011a) de
g (z ) = max{0, z}
0
z
Figure 6.3: The rectied linear activation function. This activatio

activation function recommended for use with most feedforward ne
this function to the output of a linear transformation yields a n
However, the function remains very close to linear, in the sense t
function with two linear pieces. Because rectied linear units
preserve many of the properties that make linear models easy to
based methods. They also preserve many of the properties t
generalize well. A common principle throughout computer scie

1
w= ,
2
and b = 0.
We can now walk through the way that the model proce
Let X be the design matrix containing all four points in t
with one example per row:
0 0
0 1
X= 1 0 .

1 1
The rst step in the neural network is to multiply the inp
layers weight matrix:
0 0
1 1
XW =
1 1 .
2 2
Next, we add the bias vector c, to obtain

0 1
1 0

1 0 .
2 1
In this space, all of the examples lie along a line with slope
this line, the output needs to begin at 0, then rise to 1, then
A linear model cannot implement such a function. To nish
of h for each example, we apply the rectied linear transfo

0 0
The neural network has obtained the correct answer for every
In this example, we simply specied the solution, then s
zero error.In a real situation, there might be billions of m
billions of training examples, so one cannot simply guess
here. Instead, a gradient-based optimization algorithm can
produce very little error. The solution we described to the
global minimum of the loss function, so gradient descent
point. There are other equivalent solutions to the XOR p
descent could also nd. The convergence point of gradient d
initial values of the parameters. In practice, gradient desc
nd clean, easily understood, integer-valued solutions like
here.
6.2 Gradient-Based Learning

Designing and training a neural network is not much dier
other machine learning model with gradient descent. In sect
how to build a machine learning algorithm by specifying an o
a cost function, and a model family.
The largest dierence between the linear models we have
networks is that the nonlinearity of a neural network cause
functions to become non-convex. This means that neural
trained by using iterative, gradient-based optimizers that
function to a very low value, rather than the linear equatio
linear regression models or the convex optimization algorith
gence guarantees used to train logistic regression or SVMs.
more specically, are most often improvements of the stoch

algorithm, introduced in section 5.9.
We can of course, train models such as linear regressio
machines with gradient descent too, and in fact this is comm
set is extremely large. From this point of view, training a
much dierent from training any other model. Computing
more complicated for a neural network, but can still be done
Section 6.5 will describe how to obtain the gradient using
algorithm and modern generalizations of the back-propaga
As with other machine learning models, to apply gradi
must choose a cost function, and we must choose how to r
the model. We now revisit these design considerations wit
the neural networks scenario.
6.2.1 Cost Functions
An important aspect of the design of a deep neural netwo

cost function. Fortunately, the cost functions for neural net
the same as those for other parametric models, such as line
In most cases, our parametric model denes a distrib
we simply usethe principleof maximum likelihood. Th
cross-entropy between the training data and the models p
function.
Sometimes, we take a simpler approach, where rather than
probability distribution over y, we merely predict some sta
on x. Specialized loss functions allow us to train a predicto
as the cross-entropy between the training data and the mo

cost function is given by
J( ) = Ex,ypdata log p model (y | x).
The specic form of the cost function changes from mod

on the specic form of log pmodel. The expansion of the ab
yields some terms that do not depend on the model param
carded. For example, as we saw in section 5.5.1, if pmodel (y |
then we recover the mean squared error cost,
1
J ( ) = E x,ypdata ||y f (x; )||2 + con
2
up to a scaling factor of 12 and a term that does not depen
constant is based on the variance of the Gaussian distribut
we chose not to parametrize. Previously, we saw that the
maximum likelihood estimation with an output distributio
mean squared error holds for a linear model, but in fact,
regardless of the f (x; ) used to predict the mean of the G
An advantage of this approach of deriving the cost fun
likelihood is that it removes the burden of designing cost fu
Specifying a model p(y | x) automatically determines a cos
One recurring theme throughout neural network design
the cost function must be large and predictable enough to
for the learning algorithm. Functions that saturate (becom
this objective because they make the gradient become very
this happens because the activation functions used to prod
can control the density of the output distribution (for exa

variance parameter of a Gaussian output distribution) the
to assign extremely high density to the correct training se
cross-entropy approaching negative innity. Regularization
in chapter 7 provide several dierent ways of modifying th
that the model cannot reap unlimited reward in this way.
6.2.1.2 Learning Conditional Statistics
Instead of learning a full probability distribution p(y | x ; )

just one conditional statistic of y given x.
For example, we may have a predictor f(x; ) that we wi
of y .
If we use a suciently powerful neural network, we ca
network as being able to represent any function f from a w
with this class being limited only by features such as contin
rather than by having a specic parametric form. From
can view the cost function as being a functional rather th
functional is a mapping from functions to real numbers.
learning as choosing a function rather than merely choosin
We can design our cost functional to have its minimum o
function we desire. For example, we can design the cost
minimum lie on the function that maps x to the expecte
Solving an optimization problem with respect to a function r
tool called calculus of variations, described in section 19.
to understand calculus of variations to understand the cont
the moment, it is only necessary to understand that calculu
Dierent cost functions give dierent statistics. A secon

calculus of variations is that
f = arg min Ex,ypdata ||y f (x)||1

f
yields a function that predicts the median value of y for eac

function may be described by the family of functions we op
function is commonly called mean absolute error.
Unfortunately, mean squared error and mean absolute e
results when used with gradient-based optimization. So
saturate produce very small gradients when combined wit
This is one reason that the cross-entropy cost function is mo
squared error or mean absolute error, even when it is not ne
entire distribution p(y | x).
6.2.2 Output Units
The choice of cost function is tightly coupled with the choic

of the time, we simply use the cross-entropy between the da
model distribution.The choice of how to represent the ou
the form of the cross-entropy function.
Any kind of neural network unit that may be used as
used as a hidden unit. Here, we focus on the use of these u
model, but in principle they can be used internally as well.
with additional detail about their use as hidden units in se
Throughout this section, we suppose that the feedforwa
set of hidden features dened by h = f (x ; ). The role of t
Maximizing the log-likelihood is then equivalent to minimiz

error.
The maximum likelihood framework makes it straigh
covariance of the Gaussian too, or to make the covariance
function of the input. However, the covariance must be const
denite matrix for all inputs. It is dicult to satisfy such co
output layer, so typically other output units are used to para
Approaches to modeling the covariance are described short
Because linear units do not saturate, they pose little
based optimization algorithms and may be used with a wide
algorithms.
6.2.2.2 Sigmoid Units for Bernoulli Output Distri
Many tasks require predicting the value of a binary vari

problems with two classes can be cast in this form.
The maximum-likelihood approach is to dene a Berno
conditioned on x.
A Bernoulli distribution is dened by just a single nu
needs to predict only P(y = 1 | x). For this number to be
must lie in the interval [0, 1].
Satisfying this constraint requires some careful design e
to use a linear unit, and threshold its value to obtain a val

P (y = 1 | x) = max 0, min 1, wh + b
where is the logistic sigmoid function described in section

We can think of the sigmoid output unit as having two
uses a linear layer to compute z = w h + b. Next, it uses
function to convert z into a probability.
We omit the dependence on x for the moment to di
probability distribution over y using the value z . The sigm
by constructing an unnormalized probability distribution
sum to 1. We can then divide by an appropriate const
probability distribution. If we begin with the assumption tha
probabilities are linear in y and z, we can exponentiate to ob
probabilities. We then normalize to see that this yields a
controlled by a sigmoidal transformation of z :
log P(y ) = yz
P(y ) = exp(yz )
exp(yz )
P (y ) = 1

y =0 exp(y z)
P (y ) = ((2y 1)z ) .
Probability distributions based on exponentiation and norm

throughout the statistical modeling literature. The z va
distribution over binary variables is called a logit.
This approach to predicting the probabilities in log-sp
with maximum likelihood learning. Because the cost functio
likelihood is log P(y | x), the log in the cost function u
sigmoid. Without this eect, the saturation of the sigmoid c
(1 2y)z, may be simplied to |z |. As |z | becomes large whil

the softplus function asymptotes toward simply returning
derivative with respect to z asymptotes to sign(z), so, in
incorrect z, the softplus function does not shrink the gradien
is very useful because it means that gradient-based learn
correct a mistaken z .
When we use other loss functions, such as mean squar
saturate anytime (z ) saturates. The sigmoid activation f
when z becomes very negative and saturates to 1 when z
The gradient can shrink too small to be useful for learning w
whether the model has the correct answer or the incorrect a
maximum likelihood is almost always the preferred approa
output units.
Analytically, the logarithm of the sigmoid is always den
the sigmoid returns values restricted to the open interval (0
the entire closed interval of valid probabilities [0, 1]. In soft
to avoid numerical problems, it is best to write the negat
function of z, rather than as a function of y = (z ). If
underows to zero, then taking the logarithm of y yields ne
6.2.2.3 Softmax Units for Multinoulli Output Dist
Any time we wish to represent a probability distribution o

with n possible values, we may use the softmax function.
generalization of the sigmoid function which was used to r
distribution over a binary variable.
Softmax functions are most often used as the output of a
To generalize to the case of a discrete variable with n

to produce a vector y, with yi = P (y = i | x ). We requi
element of yi be between 0 and 1, but also that the entire ve
it represents a valid probability distribution. The same app
the Bernoulli distribution generalizes to the multinoulli distr
layer predicts unnormalized log probabilities:
z = W h + b,
where zi = log P(y = i | x). The softmax function can t

Formally, the softmax
normalize z to obtain the desired y.
exp(zi )
softmax(z )i = .
j exp(zj )
As with the logistic sigmoid, the use of the exp function

training the softmax to output a target value y using maxim
this case, we wish to maximize log P (y = i; z) = log softm
softmax in terms of exp is natural because the log in the lo
the exp of the softmax:

log softmax(z ) i = zi log exp(z j )
j
The rst term of equation 6.30 shows that the input

contribution to the cost function. Because this term cann
that learning can proceed, even if the contribution of z i
equation 6.30 becomes very small. When maximizing the l
the fraction of counts of each outcome observed in the trai

m
j=1 1y (j ) =i,x (j ) =
softmax(z (x; )) i m
j=1 1x (j ) =x
Because maximum likelihood is a consistent estimator, this is

so long as the model family is capable of representing the tr
practice, limited model capacity and imperfect optimizati
model is only able to approximate these fractions.
Many objective functions other than the log-likelihood
with the softmax function. Specically, objective functions t
undo the exp of the softmax fail to learn when the argume
very negative, causing the gradient to vanish. In particul
poor loss function for softmax units, and can fail to train th
output, even when the model makes highly condent incorr
1990). To understand why these other loss functions can fa
the softmax function itself.
Like the sigmoid, the softmax activation can saturate. Th
a single output that saturates when its input is extremely
positive. In the case of the softmax, there are multiple
output values can saturate when the dierences between
extreme. When the softmax saturates, many cost functions
also saturate, unless they are able to invert the saturating
To see that the softmax function responds to the diere
observe that the softmax output is invariant to adding the s
inputs:
softmax(z ) = softmax(z + c).
can cause similar diculties for learning if the loss funct

compensate for it.
The argument z to the softmax function can be produce
The most common is simply to have an earlier layer of the
every element of z, as described above using the linear layer
straightforward, this approach actually overparametrizes
constraint that the n outputs must sum to 1 means that onl
necessary; the probability of the n -th value may be obtain
rst n 1 probabilities from 1. We can thus impose a require
of z be xed. For example, we can require that zn = 0. I
what the sigmoid unit does. Dening P (y = 1 | x) = (z) is
P (y = 1 | x) = softmax(z)1 with a two-dimensional z and
argument and the n argument approaches to the softmax
set of probability distributions, but have dierent learning
there is rarely much dierence between using the overparam
restricted version, and it is simpler to implement the overp
From a neuroscientic point of view, it is interesting to t
a way to create a form of competition between the units tha
softmax outputs always sum to 1 so an increase in the value
corresponds to a decrease in the value of others. This is an
inhibition that is believed to exist between nearby neurons
extreme (when the dierence between the maximal ai and
magnitude) it becomes a form of winner-take-all (one of
and the others are nearly 0).
The name softmax can be somewhat confusing. The fu
related to the arg max function than the max function.T
from the fact that the softmax function is continuous an
a good cost function for nearly any kind of output layer.

In general, if we dene a conditional distribution p(y |
maximum likelihood suggests we use log p(y | x; ) as ou
In general, we can think of the neural network as represen
The outputs of this function are not direct predictions of
f (x;) = provides the parameters for a distribution ove
can then be interpreted as log p(y ; (x)).
For example, we may wish to learn the variance of a cond
given x. In the simple case, where the variance 2 is a con
form expression because the maximum likelihood estimator o
empirical mean of the squared dierence between observation
value. A computationally more expensive approach that do
special-case code is to simply include the variance as one o
distribution p(y | x ) that is controlled by = f(x; ). The
log p(y; (x)) will then provide a cost function with t
necessary to make our optimization procedure incrementally
the simple case where the standard deviation does not dep
can make a new parameter in the network that is copied dir
parameter might be itself or could be a parameter v repr
be a parameter representing 12 , depending on how we
the distribution. We may wish our model to predict a diere
in y for dierent values of x. This is called a heterosce
heteroscedastic case, we simply make the specication of
the values output by f(x; ). A typical way to do this is to f
distribution using precision, rather than variance, as descr
In the multivariate case it is most common to use a diagon
Regardless of whether we use standard deviation, variance,

ensure that the covariance matrix of the Gaussian is posi
the eigenvalues of the precision matrix are the reciprocals
the covariance matrix, this is equivalent to ensuring that t
positive denite. If we use a diagonal matrix, or a scalar tim
then the only condition we need to enforce on the output of
If we suppose that a is the raw activation of the model u
diagonal precision, we can use the softplus function to obta
vector: = (a). This same strategy applies equally if usin
deviation rather than precision or if using a scalar times
diagonal matrix.
It is rare to learn a covariance or precision matrix with
diagonal.If the covariance is full and conditional, then a
be chosen that guarantees positive-deniteness of the predic
This can be achieved by writing (x) = B (x)B (x) , where
square matrix. One practical issue if the matrix is full rank
likelihood is expensive, with a d d matrix requiring O(d3
determinant and inverse of (x) (or equivalently, and mor
eigendecomposition or that of B (x)).
We often want to perform multimodal regression, that is
that come from a conditional distribution p(y | x) that can
peaks in y space for the same value of x. In this case, a
a natural representation for the output (Jacobs et al., 1
Neural networks with Gaussian mixtures as their output are
density networks. A Gaussian mixture output with n com
the conditional probability distribution
n

typically be obtained by a softmax over an n-dimension

that these outputs are positive and sum to 1.
2. Means (i) (x): these indicate the center or mean as

Gaussian component, and are unconstrained (typicall
at all for these output units). If y is a d-vector, then th
an n d matrix containing all n of these d-dimensio
these means with maximum likelihood is slightly m
learning the means of a distribution with only one o
want to update the mean for the component that a
observation. In practice, we do not know which com
observation. The expression for the negative log-likelih
each examples contribution to the loss for each compo
that the component produced the example.
3. Covariances (i)(x ): these specify the covariance mat

i. As when learning a single Gaussian component, we t
matrix to avoid needing to compute determinants. As w
of the mixture, maximum likelihood is complicated
partial responsibility for each point to each mixture
descent will automatically follow the correct proces
specication of the negative log-likelihood under the
It has been reported that gradient-based optimization of

mixtures (on the output of neural networks) can be unreliab
gets divisions (by the variance) which can be numerically
variance gets to be small for a particular example, yielding
One solution is to clip gradients (see section 10.11.1) wh
Figure 6.4: Samples drawn from a neural network with a mixtu

The input x is sampled from a uniform distribution and the ou
p model(y | x ). The neural network is able to learn nonlinear map
the parameters of the output distribution. These parameters i
governing which of three mixture components will generate th
parameters for each mixture component. Each mixture comp
predicted mean and variance. All of these aspects of the output
vary with respect to the input x, and to do so in nonlinear ways
to describe y becomes complex enough to be beyond the

Chapter 10 describes how to use recurrent neural networks
over sequences, and part III describes advanced techniques
probability distributions.
6.3 Hidden Units

describe here some of the basic intuitions motivating each

These intuitions can help decide when to try out each of th
impossible to predict in advance which will work best. The
of trial and error, intuiting that a kind of hidden unit ma
training a network with that kind of hidden unit and evalu
on a validation set.
Some of the hidden units included in this list are not ac
all input points. For example, the rectied linear function g
dierentiable at z = 0. This may seem like it invalidates g f
based learning algorithm. In practice, gradient descent still
for these models to be used for machine learning tasks.T
neural network training algorithms do not usually arrive a
the cost function, but instead merely reduce its value sign
gure 4.3. These ideas will be described further in chapter
expect training to actually reach a point where the gradien
for the minima of the cost function to correspond to points w
Hidden units that are not dierentiable are usually non-d
small number of points. In general, a function g(z) has a
by the slope of the function immediately to the left of z
dened by the slope of the function immediately to the r
is dierentiable at z only if both the left derivative and th
dened and equal to each other. The functions used in
networks usually have dened left derivatives and dened ri
case of g(z) = max{0, z} , the left derivative at z = 0 is 0 a
is 1. Software implementations of neural network training
the one-sided derivatives rather than reporting that the der
raising an error.This may be heuristically justied by ob
6.3.1 Rectied Linear Units and Their Genera
Rectied linear units use the activation function g (z ) = ma

Rectied linear units are easy to optimize because they
units. The only dierence between a linear unit and a r
that a rectied linear unit outputs zero across half its dom
derivatives through a rectied linear unit remain large when
The gradients are not only large but also consistent. The se
rectifying operation is 0 almost everywhere, and the deriv
operation is 1 everywhere that the unit is active. This me
direction is far more useful for learning than it would be wi
that introduce second-order eects.
Rectied linear units are typically used on top of an a
h = g (W x + b).
When initializing the parameters of the ane transformat

practice to set all elements of b to a small, positive value, su
it very likely that the rectied linear units will be initially
in the training set and allow the derivatives to pass throug
Several generalizations of rectied linear units exist. M
izations perform comparably to rectied linear units and
better.
One drawback to rectied linear units is that they can
basedmethodson examplesforwhichtheiractivation
generalizations of rectied linear units guarantee that they
where.
one of these groups:

g(z )i = max zj
jG(i)
where G(i) is the set of indices into the inputs for group i,
This provides a way of learning a piecewise linear function tha
directions in the input x space.
A maxout unit can learn a piecewise linear, convex funct
Maxout units can thus be seen as learning the activation
than just the relationship between units. With large enough
learn to approximate any convex function with arbitrary
a maxout layer with two pieces can learn to implement the
input x as a traditional layer using the rectied linear activa
value rectication function, or the leaky or parametric R
implement a totally dierent function altogether. The maxo
be parametrized dierently from any of these other layer
dynamics will be dierent even in the cases where maxout le
same function of x as one of the other layer types.
Each maxout unit is now parametrized by k weight vect
so maxout units typically need more regularization than rect
can work well without regularization if the training set is la
pieces per unit is kept low (Cai et al., 2013).
Maxout units have a few other benets. In some cases,
tistical and computational advantages by requiring fewer pa
if the features captured by n dierent linear lters can be
losing information by taking the max over each group of k
layer can get by with k times fewer weights.
architectures, the LSTM, propagates information through t

particular straightforward kind of such linear activation. Th
in section 10.10.
6.3.2 Logistic Sigmoid and Hyperbolic Tangen
Prior to the introduction of rectied linear units, most neu

logistic sigmoid activation function
g (z ) = (z )
or the hyperbolic tangent activation function
g (z ) = tanh(z ).
These activation functions are closely related because tanh

Wehave alreadyseen sigmoidunits as outputunits
probability that a binary variable is 1. Unlike piecewise l
units saturate across most of their domainthey saturate
z is very positive, saturate to a low value when z is very
strongly sensitive to their input when z is near 0. The wid
sigmoidal units can make gradient-based learning very di
their use as hidden units in feedforward networks is now d
as output units is compatible with the use of gradient-ba
appropriate cost function can undo the saturation of the
layer.
When a sigmoidal activation function must be used, t
activation function typically performs better than the logisti
6.3.3 Other Hidden Units
Many other types of hidden units are possible, but are used
In general, a wide variety of dierentiable functions p
Many unpublished activation functions perform just as we
To provide a concrete example, the authors tested a feedf
h = cos(W x + b) on the MNIST dataset and obtained an
1%, which is competitive with results obtained using more c
functions. During research and development of new tech
to test many dierent activation functions and nd that
standard practice perform comparably. This means that us
types are published only if they are clearly demonstrated t
improvement. New hidden unit types that perform roughly
types are so common as to be uninteresting.
It would be impractical to list all of the hidden unit typ
in the literature. We highlight a few especially useful and d
One possibility is to not have an activation g (z) at all.
this as using the identity function as the activation funct
seen that a linear unit can be useful as the output of a ne
also be used as a hidden unit. If every layer of the neural n
linear transformations, then the network as a whole will b
is acceptable for some layers of the neural network to be p
a neural network layer with n inputs and p outputs, h = g
replace this with two layers, with one layer using weight m
using weight matrix V . If the rst layer has no activation f
essentially factored the weight matrix of the original lay
factored approach is to compute h = g( V U x + b). If U
A few other reasonably common hidden unit types inclu

Radial basis function or RBF unit: hi = exp
function becomes more active as x approaches a tem
saturates to 0 for most x, it can be dicult to optim
Softplus: g(a) = (a) = log(1 + ea ). This is a smooth

introduced by Dugas et al. (2001) for function appro
and Hinton (2010) for the conditional distributions of u
models. Glorot et al. (2011a) compared the softplus a
better results with the latter. The use of the softplus is
The softplus demonstrates that the performance of h
be very counterintuitiveone might expect it to ha
the rectier due to being dierentiable everywhere or
completely, but empirically it does not.
Hard tanh: this is shaped similarly to the tanh and

the latter, it is bounded, g(a) = max( 1, min(1 , a)
by Collobert (2004).
Hidden unit design remains an active area of research an

unit types remain to be discovered.
6.4 Architecture Design

Another key design consideration for neural networks is determ
The word architecture refers to the overall structure of th
In these chain-based architectures, the main architectu

to choose the depth of the network and the width of each
a network with even one hidden layer is sucient to t th
networks often are able to use far fewer units per layer and
and often generalize to the test set, but are also often ha
ideal network architecture for a task must be found via expe
monitoring the validation set error.
6.4.1 Universal Approximation Properties and
A linear model, mapping from features to outputs via matr

by denition represent only linear functions. It has the adva
train because many loss functions result in convex optimi
applied to linear models. Unfortunately, we often want to lea
At rst glance, we might presume that learning a nonli
designing a specialized model family for the kind of nonline
Fortunately, feedforward networks with hidden layers provi
mation framework. Specically, the universal approximat
et al., 1989; Cybenko, 1989) states that a feedforward netwo
layer and at least one hidden layer with any squashing ac
as the logistic sigmoid activation function) can approximate
function from one nite-dimensional space to another with
amount of error, provided that the network is given enou
derivatives of the feedforward network can also approximate
function arbitrarily well (Hornik et al., 1990). The concept
is beyond the scope of this book;for our purposes it su
continuous function on a closed and bounded subset of R
may not be able to nd the value of the parameters that corr

function. Second, the training algorithm might choose the
overtting. Recall from section 5.2.1 that the no free lunch
there is no universally superior machine learning algorithm.
provide a universal system for representing functions, in t
function, there exists a feedforward network that approximat
is no universal procedure for examining a training set of
choosing a function that will generalize to points not in the
The universal approximation theorem says that there
enough to achieve any degree of accuracy we desire, but
say how large this network will be. Barron (1993) provide
size of a single-layer network needed to approximate a br
Unfortunately, in the worse case, an exponential number of
with one hidden unit corresponding to each input congur
distinguished) may be required. This is easiest to see in
number of possible binary functions on vectors v {0, 1}
one such function requires 2n bits, which will in general re
freedom.
In summary, a feedforward network with a single layer is
any function, but the layer may be infeasibly large and
generalize correctly. In many circumstances, using deeper
number of units required to represent the desired functio
amount of generalization error.
There exist families of functions which can be approxim
architecture with depth greater than some value d, but which
model if depth is restricted to be less than or equal to d. In m
of hidden units required by the shallow model is exponen
(2014) showed that functions representable with a deep re

an exponential number of hidden units with a shallow (one
More precisely, they showed that piecewise linear networks (
from rectier nonlinearities or maxout units) can represent fu
of regions that is exponential in the depth of the network. Fi
a network with absolute value rectication creates mirror
computed on top of some hidden unit, with respect to the
unit. Each hidden unit species where to fold the input sp
mirror responses (on both sides of the absolute value nonlin
these folding operations, we obtain an exponentially large
linear regions which can capture all kinds of regular (e.g., r
Figure 6.5: An intuitive, geometric explanation of the exponen

rectier networks formally by Montufar et al. (2014). (Left)An ab
unit has the same output for every pair of mirror points in its
of symmetry is given by the hyperplane dened by the weights
function computed on top of that unit (the green decision surface
of a simpler pattern across that axis of symmetry. (Center)The f
by folding the space around the axis of symmetry. (Right)Anoth
be folded on top of the rst (by another downstream unit) to ob
(which is now repeated four times, with two hidden layers).
permission from Montufar et al. (2014).
Of course, there is no guarantee that the kinds of functio

applications of machine learning (and in particular for AI)
We may also want to choose a deep model for statistic
we choose a specic machine learning algorithm, we are im
set of prior beliefs we have about what kind of function
learn. Choosing a deep model encodes a very general belie
want to learn should involve composition of several simpler
interpreted from a representation learning point of view as
the learning problem consists of discovering a set of underly
that can in turn be described in terms of other, simpler
variation. Alternately, we can interpret the use of a deep arc
a belief that the function we want to learn is a computer
multiple steps, where each step makes use of the previous
intermediate outputs are not necessarily factors of variatio
analogous to counters or pointers that the network uses t
processing. Empirically, greater depth does seem to result i
for a wide variety of tasks (Bengio et al., 2007; Erhan et a
Mesnil et al., 2011; Ciresan et al., 2012; Krizhevsky et al.,
2013; Farabet et al., 2013; Couprie et al., 2013; Kahou et
et al., 2014d; Szegedy et al., 2014a). See gure 6.6 and gu
some of these empirical results. This suggests that using d
indeed express a useful prior over the space of functions th
6.4.2 Other Architectural Considerations
So far we have described neural networks as being simple ch

main considerations being the depth of the network and th
96.5
96.0
Test accuracy (percent)
95.5
95.0
94.5
94.0
93.5
93.0
92.5
92.0
36 4 5 7 8 9
Number of hidden layers
Figure 6.6: Empirical results showing that deeper networks gen
to transcribe multi-digit numbers from photographs of addresses
et al. (2014d).The test set accuracy consistently increases wit
gure 6.7 for a control experiment demonstrating that other inc
do not yield the same eect.
Another key consideration of architecture design is exa

pair of layers to each other. In the default neural network lay
transformation via a matrix W , every input unit is conn
unit. Many specialized networks in the chapters ahead have
that each unit in the input layer is connected to only a sm
the output layer. These strategies for reducing the number
the number of parameters and the amount of computation
97
3,
Test accuracy (percent)
96
3,
95 11
94
93
92
91
0.0 0.2 0.4 0.6 0
Number of parameters
Figure 6.7: Deeper models tend to perform better. This is not me

larger. This experiment from Goodfellow et al. (2014d) shows tha
of parameters in layers of convolutional networks without incre
nearly as eective at increasing test set performance. The legen
network used to make each curve and whether the curve represen
the convolutional or the fully connected layers. We observe tha
context overt at around 20 million parameters while deep ones
over 60 million. This suggests that using a deep model expresses
6.5 Back-Propagation and Other Dier

rithms
When we use a feedforward neural network to accept an in
output y, information ows forward through the network.
the initial information that then propagates up to the hidd
and nally produces y. This is called forward propagat
forward propagation can continue onward until it produc
The back-propagation algorithm (Rumelhart et al., 1986
backprop, allows the information from the cost to then o
the network, in order to compute the gradient.
Computing an analytical expression for the gradient i
numerically evaluating such an expression can be computat
back-propagation algorithm does so using a simple and ine
The term back-propagation is oftenmisunderstood a
learning algorithm for multi-layer neural networks. Actua
refers only to the method for computing the gradient, wh
such as stochastic gradient descent, is used to perform learn
Furthermore, back-propagation is often misunderstood as b
layer neural networks, but in principle it can compute deriv
(for some functions, the correct response is to report that
function is undened). Specically, we will describe how to
x f(x, y) for an arbitrary function f , where x is a set of vari
are desired, and y is an additional set of variables that are
but whose derivatives are not required. In learning algorithm
often require is the gradient of the cost function with resp
J(). Many machine learning tasks involve computing o
6.5.1 Computational Graphs
So far we have discussed neural networks with a relatively in

To describe the back-propagation algorithm more precisely,
more precise computational graph language.
Many ways of formalizing computation as graphs are p
Here, we use each node in the graph to indicate a varia
be a scalar, vector, matrix, tensor, or even a variable of an
To formalize our graphs, we also need to introduce the
An operation is a simple function of one or more variables
is accompanied by a set of allowable operations. Functio
than the operations in this set may be described by compo
together.
Without loss of generality,we dene an operation to
output variable. This does not lose generality because the ou
multiple entries, such as a vector. Software implementation
usually support operations with multiple outputs, but we
description because it introduces many extra details that
conceptual understanding.
If a variable y is computed by applying an operation
we draw a directed edge from x to y .We sometimes anno
with the name of the operation applied, and other times om
operation is clear from context.
Examples of computational graphs are shown in gure
6.5.2 Chain Rule of Calculus

z u(1)
+
dot

x y x w
(a) (b)
H u(2)
relu
sum
U (1) U (2) y u(1

+
sqr
dot
matmul
X W b x w
(c) (d)
g maps from Rm to Rn , and f maps from R n to R . If y = g(

z z yj
= .
xi j
y j xi
In vector notation, this may be equivalently written as

y
xz = y z,
x
y
where x is the n m Jacobian matrix of g .
From this we see that the gradient of a variable x can be o
a Jacobian matrix y
x by a gradient
y z. The back-propaga
of performing such a Jacobian-gradient product for each op
Usually we do not apply the back-propagation algorit
but rather to tensors of arbitrary dimensionality. Conceptu
same as back-propagation with vectors. The only dierenc
are arranged in a grid to form a tensor. We could imagine
into a vector before we run back-propagation, computing a
and then reshaping the gradient back into a tensor. In
back-propagation is still just multiplying Jacobians by grad
To denote the gradient of a value z with respect to a te
just as if X were a vector. The indices into X now have mu
example, a 3-D tensor is indexed by three coordinates. We
by using a single variable i to represent the complete tup
z
possible index tuples i, ( X z)i gives X i
. This is exactly t
z
possible integer indices i into a vector, (x z)i gives x i
. U
can write the chain rule as it applies to tensors. If Y = g (X
will need to choose whether to store these subexpressions

several times. An example of how these repeated subexpre
gure 6.9. In some cases, computing the same subexpressi
be wasteful.For complicated graphs, there can be expone
wasted computations, making a naive implementation of th
In other cases, computing the same subexpression twice co
reduce memory consumption at the cost of higher runtime.
We rst begin by a version of the back-propagation algor
actual gradient computation directly (algorithm 6.2 along wit
associated forward computation), in the order it will actually
to the recursive application of chain rule. One could either
computations or view the description of the algorithm as a
of the computational graph for computing the back-propa
formulation does not make explicit the manipulation and t
symbolic graph that performs the gradient computation.
presented below in section 6.5.6, with algorithm 6.5, where
nodes that contain arbitrary tensors.
First consider a computational graph describing how to c
u(n) (say the loss on a training example). This scalar i
gradient we want to obtain, with respect to the n i input
(n)
other words we wish to compute u u (i)
for all i {1, 2, . . . , n
of back-propagation to computing gradients for gradient de
u(n) will be the cost associated with an example or a miniba
correspond to the parameters of the model.
We will assume that the nodes of the graph have been
that we can compute their output one after the other, st
going up to u(n). As dened in algorithm 6.1, each node u(i
Algorithm 6.1 A procedure that performs the computatio

u(1) to u (ni) to an output u(n) . This denes a computational
computes numerical value u(i) by applying a function f (i) t
A(i) that comprises the values of previous nodes u (j ), j < i,
input to the computational graph is the vector x, and is set
u(1) to u (ni ) . The output of the computational graph is rea
node u(n).
for i = 1, . . . , ni do
u(i) xi
end for
for i = ni + 1, . . . , n do
A(i) {u (j ) | j P a(u(i) )}
u(i) f (i) (A(i))
end for
return u(n)
using the chain rule with respect to scalar output u (n):
u(n) u(n) u(i)

=
u(j ) u (i) u(j )
i:j P a(u(i) )
as specied by algorithm 6.2. The subgraph B contains exa

edge from node u (j ) to node u (i) of G . The edge from u(j ) to
u (i)
the computation of u(j ) . In addition, a dot product is per
between the gradient already computed with respect to node
(i
of u(j ) and the vector containing the partial derivatives u
u(j
(i)
Algorithm 6.2 Simplied version of the back-propagation a

the derivatives of u(n) with respect to the variables in the g
intended to further understanding by showing a simplied c
are scalars, and we wish to compute the derivatives with re
This simplied version computes the derivatives of all nod
computational cost of this algorithm is proportional to th
the graph, assuming that the partial derivative associated w
a constant time. This is of the same order as the numbe
(i)
the forward propagation. Each u u(j )
is a function of the pa
linking the nodes of the forward graph to those added for
graph.
Run forward propagation (algorithm 6.1 for this example
tions of the network
Initialize grad_table, a data structure that will store th
been computed. The entry grad_table[u(i) ] will store t
u (n)
u(i)
.
grad_table[u(n)] 1
for j = n 1 down to 1 do
u (n)
(n) u (i)
The next line computes u(j ) = i:j P a(u(i) ) u
u (i) u(j )
(
grad_table[u(j ) ] i:j P a(u(i) ) grad_table[u(i) ] u
u(j
end for
return {grad_table[u(i)] | i = 1, . . . , ni}
Back-propagation thus avoids the exponential explosion in re

However, other algorithms may be able to avoid more subexp
simplications on the computational graph, or may be able t
Figure 6.9: A computational graph that results in repeated subexp

the gradient. Let w R be the input to the graph. We use the s
as the operation that we apply at every step of a chain: x = f
To compute zw
, we apply equation 6.44 and obtain:
z
w
z y x
=
y x w
=f (y )f (x)f (w )
=f (f (f (w )))f (f (w ))f (w )
applying the back-propagation algorithm to this graph.

Algorithms 6.3 and 6.4 are demonstrations that are ch
straightforward to understand. However,they are speci
problem.
Modern software implementations are based on the gen
propagation described in section 6.5.6 below, which can acc
tational graph by explicitly manipulating a data structure for
computation.
Algorithm 6.3 Forward propagation through a typical dee

the computation of the cost function. The loss L(y, y) d
y and on the target y (see section 6.2.1.1 for examples o
obtain the total cost J , the loss may be added to a regu
contains all the parameters (weights and biases). Algorit
compute gradients of J with respect to parameters W and
demonstration uses only a single input example x. Practic
use a minibatch. See section 6.5.7 for a more realistic demo
Require: Network depth, l
Require: W (i), i {1, . . . , l}, the weight matrices of the m
Require: b(i) , i {1, . . . , l}, the bias parameters of the mo
Require: x, the input to process
Require: y, the target output
h(0) = x
for k = 1, . . . , l do
a (k) = b(k) + W (k) h (k1)
h (k) = f (a (k) )
end for
Algorithm 6.4 Backward computation for the deep ne

rithm 6.3, which uses in addition to the input x a target
yields the gradients on the activations a(k) for each layer
output layer and going backwards to the rst hidden layer.
which can be interpreted as an indication of how each layers
to reduce error, one can obtain the gradient on the parame
gradients on weights and biases can be immediately used
tic gradient update (performing the update right after th
computed) or used with other gradient-based optimization
After the forward computation, compute the gradient on
g yJ = y L(y, y )
for k = l, l 1, . . . , 1 do
Convertthe gradient onthe layersoutputintoag
nonlinearity activation (element-wise multiplication if
g a(k) J = g f (a(k))
Compute gradients on weights and biases (including th
where needed):
b(k) J = g + b(k) ( )
W (k)J = g h (k1) + W (k) ( )
z z
f f
f dz
y y
dy
f f
f
dy
x x
dx
f f
f
dx
w w
dw
Figure 6.10: An example of the symbol-to-symbol approach to c

this approach, the back-propagation algorithm does not need t
specic numeric values. Instead, it adds nodes to a computation
to compute these derivatives. A generic graph evaluation engin
derivatives for any specic numeric values. (Left)In this example
representing z = f (f(f (w))). (Right)We run the back-propagatio
dz
it to construct the graph for the expression corresponding to dw
not explain how the back-propagation algorithm works. The pur
what the desired result is: a computational graph with a sym
derivative.
Some approaches to back-propagation take a computa

of numerical values for the inputs to the graph, then retu
values describing the gradient at those input values. We call
terms of constructing a computational graph for the derivati

graph may then be evaluated using specic numerical value
allows us to avoid specifying exactly when each operation
Instead, a generic graph evaluation engine can evaluate eve
parents values are available.
The description of the symbol-to-symbol based approach
to-number approach. The symbol-to-number approach c
performing exactly the same computations as are done in
symbol-to-symbol approach. The key dierence is that t
approach does not expose the graph.
6.5.6 General Back-Propagation
The back-propagation algorithm is very simple. To comput

scalar z with respect to one of its ancestors x in the graph,
that the gradient with respect to z is given by dz dz = 1. W
the gradient with respect to each parent of z in the grap
current gradient by the Jacobian of the operation that pro
multiplying by Jacobians traveling backwards through the g
we reach x. For any node that may be reached by going bac
two or more paths, we simply sum the gradients arriving f
that node.
More formally, each node in the graph G corresponds to
maximum generality, we describe this variable as being a
in general have any number of dimensions.They subsum
matrices.
We assume that each variable V is associated with the
Each operation op is also associated with a bprop op

operation can compute a Jacobian-vector product as descr
This is how the back-propagation algorithm is able to ac
Each operation is responsible for knowing how to back-p
edges in the graph that it participates in. For example, w
multiplication operation to create a variable C = AB. Sup
of a scalar z with respect to C is given by G. The matrix m
is responsible for dening two back-propagation rules, on
arguments. If we call the bprop method to request the gra
A given that the gradient on the output is G, then the
matrix multiplication operation must state that the gradi
is given by GB . Likewise, if we call the bprop method t
with respect to B , then the matrix operation is responsible
bprop method and specifying that the desired gradient is
back-propagation algorithm itself does not need to know any
only needs to call each operations bprop rules with the righ
op.bprop(inputs, X, G) must return

( Xop.f(inputs) i) Gi ,
i
which is just an implementation of the chain rule as expre

Here, inputs is a list of inputs that are supplied to the o
mathematical function that the operation implements, X is th
we wish to compute, and G is the gradient on the output o
The op.bprop method should always pretend that all of
from each other, even if they are not. For example, if the
two copies of x to compute x2, the op.bprop method shou
Algorithm 6.5 The outermost skeleton of the back-propag

portion does simple setup and cleanup work. Most of the im
in the build_grad subroutine of algorithm 6.6
.
Require: T, the target set of variables whose gradients m
Require: G, the computational graph
Require: z, the variable to be dierentiated
Let G be G pruned to contain only nodes that are ancesto
of nodes in T.
Initialize grad_table, a data structure associating tenso
grad_table[z ] 1
for V in T do
build_grad(V, G , G , grad_table)
end for
Return grad_table restricted to T
In section 6.5.2, we explained that back-propagation was

avoid computing the same subexpression in the chain rule mu
algorithm could have exponential runtime due to these rep
Now that we have specied the back-propagation algorithm,
computational cost. If we assume that each operation evalu
same cost, then we may analyze the computational cost in
of operations executed. Keep in mind here that we refer t
fundamental unit of our computational graph, which might a
many arithmetic operations (for example, we might have a gr
multiplication as a single operation). Computing a gradient i
will never execute more than O(n 2) operations or store the
Algorithm 6.6 The inner loop subroutine build_grad(V,

the back-propagation algorithm, called by the back-propaga
in algorithm 6.5.
Require: V, the variable whose gradient should be added
Require: G, the graph to modify.
Require: G , the restriction of G to nodes that participate
Require: grad_table, a data structure mapping nodes to
if V is in grad_table then
Return grad_table[V]
end if
i1
for C in get_consumers(V , G ) do
op get_operation(C)
D build_grad(C, G , G , grad_table)
G(i) op.bprop(get_inputs(C, G ), V, D)
ii+1
end for

G i G (i)
grad_table[V] = G
Insert G and the operations creating it into G
Return G
roughly chain-structured, causing back-propagation to have

better than the naive approach, which might need to execu
nodes. This potentially exponential cost can be seen by ex
the recursive chain rule (equation 6.49) non-recursively:
back-propagation avoids repeating many common subexpress

strategy is sometimes called dynamic programming.
6.5.7 Example: Back-Propagation for MLP Tr
As an example, we walk through the back-propagation algo

train a multilayer perceptron.
Here we develop a very simple multilayer perception
layer. To train this model, we will use minibatch stocha
The back-propagation algorithm is used to compute the gra
single minibatch. Specically, we use a minibatch of exam
set formatted as a design matrix X and a vector of asso
The network computes a layer of hidden features H =
simplify the presentation we do not use biases in this mode
graph language includes a relu operation that can compu
wise. The predictions of the unnormalized log probabilities
given by HW (2). We assume that our graph language inclu
operation that computes the cross-entropy between the target
distribution dened by these unnormalized log probabilitie
entropy denes the cost JMLE . Minimizing this cross-entro
likelihood estimation of the classier. However, to make this
we also include a regularization term. The total cost

(1) 2 (2)
J = JMLE + W i,j + Wi,j
i,j i,j
consists of the cross-entropy and a weight decay term w

J MLE J
cross_entropy +
U (2) y u
matmul
H W (2) U (5) u(6) u

sqr sum +
relu
U (1)
matmul
X W (1) U (3) u(4)

sqr sum
Figure 6.11: The computational graph used to compute the cost u

of a single-layer MLP using the cross-entropy loss and weight de
The other path through the cross-entropy cost is sligh

Let G be the gradient on the unnormalized log probabili
the cross_entropy operation. The back-propagation alg
explore two dierent branches. On the shorter branch,
gradient on W (2) , using the back-propagation rule for th
the matrix multiplication operation. The other branch cor
chain descending further along the network. First, the back-
matrix, resulting in O(w) multiply-adds, where w is the num

the backward propagation stage, we multiply by the tran
matrix, which has the same computational cost. The mai
algorithm is that we need to store the input to the nonlinear
This value is stored from the time it is computed until th
returned to the same point. The memory cost is thus O(
number of examples in the minibatch and nh is the numbe
6.5.8 Complications
Our description of the back-propagation algorithm here is s

mentations actually used in practice.
As noted above, we have restricted the denition of
function that returns a single tensor. Most software imp
support operations that can return more than one tensor. F
to compute both the maximum value in a tensor and the in
best to compute both in a single pass through memory, so
implement this procedure as a single operation with two ou
Wehave notdescribedhowto control the memory
propagation. Back-propagation often involves summation of
In the naive approach, each of these tensors would be com
all of them would be added in a second step. The naive a
high memory bottleneck that can be avoided by maintaini
adding each value to that buer as it is computed.
Real-world implementations of back-propagation also n
data types, such as 32-bit oating point, 64-bit oating po
The policy for handling each of these types takes special ca
concerning how to perform dierentiation. More generally, t

dierentiation is concerned with how to compute deriv
The back-propagation algorithm described here is only one
dierentiation. It is a special case of a broader class of tech
mode accumulation. Other approaches evaluate the sube
rule in dierent orders. In general,determining the ord
results in the lowest computational cost is a dicult problem
sequence of operations to compute the gradient is NP-comp
in the sense that it may require simplifying algebraic expr
expensive form.
For example, suppose we have variables p1 , p2 , . . . , pn rep
and variables z1 , z2 , . . . , zn representing unnormalized log
we dene
exp(z i)
qi = ,
i exp(zi )
where we build the softmax function out of exponentiation, su
operations,andconstructa cross-entropy loss J =
mathematician can observe that the derivative of J with res
simple form: qi pi . The back-propagation algorithm is not
the gradient this way, and will instead explicitly propagate g
the logarithm and exponentiation operations in the original
libraries such as Theano (Bergstra et al., 2010; Bastien et
perform some kinds of algebraic substitution to improve ov
by the pure back-propagation algorithm.
When the forward graph G has a single output node and
u(i)
u(j )
can be computed with a constant amount of computa
guarantees that the number of computations for the gradi
each scalar internal node in the original forward graph, the

computes k gradients instead of a single gradient. When the
the graph is larger than the number of inputs, it is somet
another form of automatic dierentiation called forward
Forward mode computation has been proposed for obtaining
of gradients in recurrent networks, for example (Williams a
also avoids the need to store the values and gradients for th
o computational eciency for memory. The relationship b
and backward mode is analogous to the relationship between
right-multiplying a sequence of matrices, such as
ABCD,
where the matrices can be thought of as Jacobian matrice

is a column vector while A has many rows, this correspo
single output and many inputs, and starting the multipl
and going backwards only requires matrix-vector products
the backward mode. Instead, starting to multiply from th
series of matrix-matrix products, which makes the whole co
expensive. However, if A has fewer rows than D has column
the multiplications left-to-right, corresponding to the forwa
In many communities outside of machine learning, it is
plement dierentiation software that acts directly on tra
language code, such as Python or C code, and automatical
that dierentiate functions written in these languages. In t
munity, computational graphs are usually represented by e
created by specialized libraries. The specialized approach
requiring the library developer to dene the bprop metho
6.5.10 Higher-Order Derivatives
Some software frameworks support the use of higher-order d

deep learning software frameworks, this includes at least Th
These libraries use the same kind of data structure to descr
derivatives as they use to describe the original function bei
means that the symbolic dierentiation machinery can be a
In the context of deep learning, it is rare to compute a s
of a scalar function. Instead, we are usually interested in pr
matrix. If we have a function f : Rn R, then the Hessian
In typical deep learning applications, n will be the numbe
model, which could easily number in the billions. The en
thus infeasible to even represent.
Instead of explicitly computing the Hessian, the typical d
is to use Krylov methods. Krylov methods are a set of i
performing various operations like approximately invertin
approximations to its eigenvectors or eigenvalues, withou
other than matrix-vector products.
In order to use Krylov methods on the Hessian, we on
compute the product between the Hessian matrix H and an
straightforward technique (Christianson, 1992) for doing so

Hv = x (xf (x)) v .
Both of the gradient computations in this expression may b

cally by the appropriate software library. Note that the out
takes the gradient of a function of the inner gradient expre
From this point of view, the modern feedforward network

centuries of progress on the general function approximation
The chain rule that underlies the back-propagation al
in the 17th century (Leibniz, 1676; LHpital, 1696). Calc
long been used to solve optimization problems in closed form
was not introduced as a technique for iteratively approxim
optimization problems until the 19th century (Cauchy, 184
Beginning in the 1940s, these function approximation te
motivate machine learning models such as the perceptron.
models were based on linear models. Critics including Marv
several of the aws of the linear model family, such as its
XOR function, which led to a backlash against the entire ne
Learning nonlinear functions required the developmen
ceptron and a means of computing the gradient through s
applications of the chain rule based on dynamic programm
in the 1960s and 1970s, mostly for control applications (Ke
Denham, 1961; Dreyfus, 1962; Bryson and Ho, 1969; Drey
sensitivity analysis (Linnainmaa, 1976). Werbos (1981) pr
techniques to training articial neural networks. The idea
in practice after being independently rediscovered in diere
Parker, 1985; Rumelhart et al., 1986a). The book Parall
cessing presented the results of some of the rst succes
back-propagation in a chapter (Rumelhart et al., 1986b) th
to the popularization of back-propagation and initiated a
research in multi-layer neural networks.However, the ide
authors of that book and in particular by Rumelhart and H
back-propagation.They include crucial ideas about the p
approaches to gradient descent are still in use. Most of the

network performance from 1986 to 2015 can be attributed
larger datasets have reduced the degree to which statisti
challenge for neural networks. Second, neural networks hav
due to more powerful computers, and better software infra
small number of algorithmic changes have improved the p
networks noticeably.
One of these algorithmic changes was the replacement
with the cross-entropy family of loss functions. Mean square
the 1980s and 1990s, but was gradually replaced by cross-e
principle of maximum likelihood as ideas spread between th
and the machine learning community. The use of cross-
improved the performance of models with sigmoid and so
had previously suered from saturation and slow learning
squared error loss.
The other major algorithmic change that has greatly imp
of feedforward networks was the replacement of sigmoid hidd
linear hidden units, such as rectied linear units. Recticatio
function was introduced in early neural network models a
as far as the Cognitron and Neocognitron (Fukushima, 197
models didnot use rectiedlinear units,butinstead ap
nonlinear functions. Despite the early popularity of rectica
largely replaced by sigmoids in the 1980s, perhaps because si
when neural networks are very small. As of the early 2000s
were avoided due to a somewhat superstitious belief that act
non-dierentiable points must be avoided. This began to
Jarrett et al. (2009) observed that using a rectifying nonline
Rectied linear units are also of historical interest be

neuroscience has continued to havean inuence on the
learning algorithms. Glorot et al. (2011a) motivate recti
biological considerations. The half-rectifying nonlinearity w
these properties of biological neurons: 1) For some inputs,
completely inactive. 2) For some inputs, a biological neurons
to its input. 3) Most of the time, biological neurons operat
they are inactive (i.e., they should have sparse activation
When the modern resurgence of deep learning began
networks continued to have a bad reputation. From about 20
believed that feedforward networks would not perform well un
by other models, such as probabilistic models. Today, it is no
right resources and engineering practices, feedforward netw
Today, gradient-based learning in feedforward networks is us
probabilistic models, such as the variational autoencoder and
networks, described in chapter 20. Rather than being vi
technology that must be supported by other techniques, gra
feedforward networks has been viewed since 2012 as a pow
may be applied to many other machine learning tasks. In
used unsupervised learning to support supervised learning,
is more common to use supervised learning to support unsu
Feedforward networks continue to have unfullled poten
expect they will be applied to many more tasks, and that ad
algorithms and model design will improve their performan
chapter has primarily described the neural network fam
subsequent chapters, we turn to how to use these models
train them.

Contents MLP PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Contents MLP PDF

Uploaded by

Copyright:

Available Formats

Chapter 6

Deep Feedforward Netwo

Deep feedforward networks, also often called feedforwa

length of the chain gives the depth of the model. It is from

nonlinear transformation. Equivalently, we can apply the ke

1. One option is to use a very generic , such as the inn

2. Another option is to manually engineer . Until the a

3. The strategy of deep learning is to learn . In this app

mappings from x to y that lack feedback connections.Ot

6.1 Example: Learning XOR

appropriate cost function for modeling binary data. More a

Now we must choose the form of our model, f (x; ). Su

We can minimize J( ) in closed form with respect to w a

Original x space Learned

Figure 6.1: Solving the XOR problem by learning a representa

Figure 6.2: An example of a feedforward network, drawn in two di

model, we used a vector of weights and a scalar bias par

Figure 6.3: The rectied linear activation function. This activatio

6.2 Gradient-Based Learning

more specically, are most often improvements of the stoch

6.2.1 Cost Functions

An important aspect of the design of a deep neural netwo

as the cross-entropy between the training data and the mo

J( ) = Ex,ypdata log p model (y | x).

The specic form of the cost function changes from mod

can control the density of the output distribution (for exa

6.2.1.2 Learning Conditional Statistics

Instead of learning a full probability distribution p(y | x ; )

Dierent cost functions give dierent statistics. A secon

f = arg min Ex,ypdata ||y f (x)||1

yields a function that predicts the median value of y for eac

6.2.2 Output Units

The choice of cost function is tightly coupled with the choic

Maximizing the log-likelihood is then equivalent to minimiz

6.2.2.2 Sigmoid Units for Bernoulli Output Distri

Many tasks require predicting the value of a binary vari

where is the logistic sigmoid function described in section

Probability distributions based on exponentiation and norm

(1 2y)z, may be simplied to |z |. As |z | becomes large whil

6.2.2.3 Softmax Units for Multinoulli Output Dist

Any time we wish to represent a probability distribution o

To generalize to the case of a discrete variable with n

where zi = log P(y = i | x). The softmax function can t

As with the logistic sigmoid, the use of the exp function

The rst term of equation 6.30 shows that the input

the fraction of counts of each outcome observed in the trai

Because maximum likelihood is a consistent estimator, this is

can cause similar diculties for learning if the loss funct

a good cost function for nearly any kind of output layer.

Regardless of whether we use standard deviation, variance,

typically be obtained by a softmax over an n-dimension

2. Means (i) (x): these indicate the center or mean as

3. Covariances (i)(x ): these specify the covariance mat

It has been reported that gradient-based optimization of

Figure 6.4: Samples drawn from a neural network with a mixtu

to describe y becomes complex enough to be beyond the

6.3 Hidden Units

describe here some of the basic intuitions motivating each

6.3.1 Rectied Linear Units and Their Genera

Rectied linear units use the activation function g (z ) = ma

When initializing the parameters of the ane transformat

one of these groups: