Glorot10a PDF

Understanding the difficulty of training deep feedforward neural networks
Xavier Glorot Yoshua Bengio

DIRO, Université de Montréal, Montréal, Québec, Canada
Abstract learning methods for a wide array of deep architectures,

including neural networks with many hidden layers (Vin-
Whereas before 2006 it appears that deep multi- cent et al., 2008) and graphical models with many levels of
layer neural networks were not successfully hidden variables (Hinton et al., 2006), among others (Zhu
trained, since then several algorithms have been et al., 2009; Weston et al., 2008). Much attention has re-
shown to successfully train them, with experi- cently been devoted to them (see (Bengio, 2009) for a re-
mental results showing the superiority of deeper view), because of their theoretical appeal, inspiration from
vs less deep architectures. All these experimen- biology and human cognition, and because of empirical
tal results were obtained with new initialization success in vision (Ranzato et al., 2007; Larochelle et al.,
or training mechanisms. Our objective here is to 2007; Vincent et al., 2008) and natural language process-
understand better why standard gradient descent ing (NLP) (Collobert & Weston, 2008; Mnih & Hinton,
from random initialization is doing so poorly 2009). Theoretical results reviewed and discussed by Ben-
with deep neural networks, to better understand gio (2009), suggest that in order to learn the kind of com-
these recent relative successes and help design plicated functions that can represent high-level abstractions
better algorithms in the future. We first observe (e.g. in vision, language, and other AI-level tasks), one
the influence of the non-linear activations func- may need deep architectures.
tions. We find that the logistic sigmoid activation Most of the recent experimental results with deep archi-
is unsuited for deep networks with random ini- tecture are obtained with models that can be turned into
tialization because of its mean value, which can deep supervised neural networks, but with initialization or
drive especially the top hidden layer into satu- training schemes different from the classical feedforward
ration. Surprisingly, we find that saturated units neural networks (Rumelhart et al., 1986). Why are these
can move out of saturation by themselves, albeit new algorithms working so much better than the standard
slowly, and explaining the plateaus sometimes random initialization and gradient-based optimization of a
seen when training neural networks. We find that supervised training criterion? Part of the answer may be
a new non-linearity that saturates less can often found in recent analyses of the effect of unsupervised pre-
be beneficial. Finally, we study how activations training (Erhan et al., 2009), showing that it acts as a regu-
and gradients vary across layers and during train- larizer that initializes the parameters in a “better” basin of
ing, with the idea that training may be more dif- attraction of the optimization procedure, corresponding to
ficult when the singular values of the Jacobian an apparent local minimum associated with better general-
associated with each layer are far from 1. Based ization. But earlier work (Bengio et al., 2007) had shown
on these considerations, we propose a new ini- that even a purely supervised but greedy layer-wise proce-
tialization scheme that brings substantially faster dure would give better results. So here instead of focus-
convergence. ing on what unsupervised pre-training or semi-supervised
criteria bring to deep architectures, we focus on analyzing
what may be going wrong with good old (but deep) multi-
1 Deep Neural Networks layer neural networks.
Deep learning methods aim at learning feature hierarchies Our analysis is driven by investigative experiments to mon-
with features from higher levels of the hierarchy formed itor activations (watching for saturation of hidden units)
by the composition of lower level features. They include and gradients, across layers and across training iterations.
We also evaluate the effects on these of choices of acti-
Appearing in Proceedings of the 13th International Conference vation function (with the idea that it might affect satura-
on Artificial Intelligence and Statistics (AISTATS) 2010, Chia La- tion) and initialization procedure (since unsupervised pre-
guna Resort, Sardinia, Italy. Volume 9 of JMLR: W&CP 9. Copy-
training is a particular form of initialization and it has a
right 2010 by the authors.
drastic impact).
249
2 Experimental Setting and Datasets

Code to produce the new datasets introduced in this section
is available from: http://www.iro.umontreal.
ca/˜lisa/twiki/bin/view.cgi/Public/
DeepGradientsAISTATS2010.
2.1 Online Learning on an Infinite Dataset:

Shapeset-3 × 2
Recent work with deep architectures (see Figure 7 in Ben-
gio (2009)) shows that even with very large training sets
or online learning, initialization from unsupervised pre-
training yields substantial improvement, which does not
vanish as the number of training examples increases. The
online setting is also interesting because it focuses on the
optimization issues rather than on the small-sample regu-
larization effects, so we decided to include in our experi-
ments a synthetic images dataset inspired from Larochelle
et al. (2007) and Larochelle et al. (2009), from which as
many examples as needed could be sampled, for testing the
online learning scenario. Figure 1: Top: Shapeset-3×2 images at 64×64 resolution.
The examples we used are at 32×32 resolution. The learner
We call this dataset the Shapeset-3 × 2 dataset, with ex-
tries to predict which objects (parallelogram, triangle, or el-
ample images in Figure 1 (top). Shapeset-3 × 2 con-
lipse) are present, and 1 or 2 objects can be present, yield-
tains images of 1 or 2 two-dimensional objects, each taken
ing 9 possible classifications. Bottom: Small-ImageNet
from 3 shape categories (triangle, parallelogram, ellipse),
images at full resolution.
and placed with random shape parameters (relative lengths
and/or angles), scaling, rotation, translation and grey-scale. set of the tiny-images dataset that contains 50,000 training
We noticed that for only one shape present in the image the examples (from which we extracted 10,000 as validation
task of recognizing it was too easy. We therefore decided data) and 10,000 test examples. There are 10 classes cor-
to sample also images with two objects, with the constraint responding to the main object in each image: airplane, au-
that the second object does not overlap with the first by tomobile, bird, cat, deer, dog, frog, horse, ship, or truck.
more than fifty percent of its area, to avoid hiding it en- The classes are balanced. Each image is in color, but is
tirely. The task is to predict the objects present (e.g. trian- just 32 × 32 pixels in size, so the input is a vector of
gle + ellipse, parallelogram + parallelogram, triangle alone, 32 × 32 × 3 = 3072 real values.
etc.) without having to distinguish between the foreground Small-ImageNet which is a set of tiny 37×37 gray level
shape and the background shape when they overlap. This images dataset computed from the higher-resolution and
therefore defines nine configuration classes. larger set at http://www.image-net.org, with la-
The task is fairly difficult because we need to discover in- bels from the WordNet noun hierarchy. We have used
variances over rotation, translation, scaling, object color, 90,000 examples for training, 10,000 for the validation set,
occlusion and relative position of the shapes. In parallel we and 10,000 for testing. There are 10 balanced classes: rep-
need to extract the factors of variability that predict which tiles, vehicles, birds, mammals, fish, furniture, instruments,
object shapes are present. tools, flowers and fruits Figure 1 (bottom) shows randomly
chosen examples.
The size of the images are arbitrary but we fixed it to 32×32
in order to work with deep dense networks efficiently.
2.3 Experimental Setting
2.2 Finite Datasets We optimized feedforward neural networks with one to
five hidden layers, with one thousand hidden units per
The MNIST digits (LeCun et al., 1998a), dataset has layer, and with a softmax logistic regression for the out-
50,000 training images, 10,000 validation images (for put layer. The cost function is the negative log-likelihood
hyper-parameter selection), and 10,000 test images, each − log P (y|x), where (x, y) is the (input image, target class)
showing a 28×28 grey-scale pixel image of one of the 10 pair. The neural networks were optimized with stochastic
digits. back-propagation on mini-batches of size ten, i.e., the av-
CIFAR-10 (Krizhevsky & Hinton, 2009) is a labelled sub- erage g of ∂−log∂θP (y|x) was computed over 10 consecutive
250
Xavier Glorot, Yoshua Bengio
training pairs (x, y) and used to update parameters θ in that

direction, with θ ← θ − g. The learning rate is a hyper-
parameter that is optimized based on validation set error
after a large number of updates (5 million).
We varied the type of non-linear activation function in the
hidden layers: the sigmoid 1/(1 + e−x ), the hyperbolic
tangent tanh(x), and a newly proposed activation func-
Figure 2: Mean and standard deviation (vertical bars) of the
tion (Bergstra et al., 2009) called the softsign, x/(1 + |x|).
activation values (output of the sigmoid) during supervised
The softsign is similar to the hyperbolic tangent (its range
learning, for the different hidden layers of a deep archi-
is -1 to 1) but its tails are quadratic polynomials rather
tecture. The top hidden layer quickly saturates at 0 (slow-
than exponentials, i.e., it approaches its asymptotes much
ing down all learning), but then slowly desaturates around
slower.
epoch 100.
In the comparisons, we search for the best hyper-
parameters (learning rate and depth) separately for each We see that very quickly at the beginning, all the sigmoid
model. Note that the best depth was always five for activation values of the last hidden layer are pushed to their
Shapeset-3 × 2, except for the sigmoid, for which it was lower saturation value of 0. Inversely, the others layers
four. have a mean activation value that is above 0.5, and decreas-
We initialized the biases to be 0 and the weights Wij at ing as we go from the output layer to the input layer. We
each layer with the following commonly used heuristic: have found that this kind of saturation can last very long in
h 1 1 i deeper networks with sigmoid activations, e.g., the depth-
Wij ∼ U − √ , √ , (1) five model never escaped this regime during training. The
n n
big surprise is that for intermediate number of hidden lay-
where U [−a, a] is the uniform distribution in the interval ers (here four), the saturation regime may be escaped. At
(−a, a) and n is the size of the previous layer (the number the same time that the top hidden layer moves out of satura-
of columns of W ). tion, the first hidden layer begins to saturate and therefore
to stabilize.
3 Effect of Activation Functions and
Saturation During Training We hypothesize that this behavior is due to the combina-
tion of random initialization and the fact that an hidden unit
Two things we want to avoid and that can be revealed from output of 0 corresponds to a saturated sigmoid. Note that
the evolution of activations is excessive saturation of acti- deep networks with sigmoids but initialized from unsuper-
vation functions on one hand (then gradients will not prop- vised pre-training (e.g. from RBMs) do not suffer from
agate well), and overly linear units (they will not compute this saturation behavior. Our proposed explanation rests on
something interesting). the hypothesis that the transformation that the lower layers
of the randomly initialized network computes initially is
3.1 Experiments with the Sigmoid not useful to the classification task, unlike the transforma-
tion obtained from unsupervised pre-training. The logistic
The sigmoid non-linearity has been already shown to slow
layer output softmax(b + W h) might initially rely more on
down learning because of its none-zero mean that induces
its biases b (which are learned very quickly) than on the top
important singular values in the Hessian (LeCun et al.,
hidden activations h derived from the input image (because
1998b). In this section we will see another symptomatic
h would vary in ways that are not predictive of y, maybe
behavior due to this activation function in deep feedforward
correlated mostly with other and possibly more dominant
networks.
variations of x). Thus the error gradient would tend to
We want to study possible saturation, by looking at the evo- push W h towards 0, which can be achieved by pushing
lution of activations during training, and the figures in this h towards 0. In the case of symmetric activation functions
section show results on the Shapeset-3 × 2 data, but sim- like the hyperbolic tangent and the softsign, sitting around
ilar behavior is observed with the other datasets. Figure 2 0 is good because it allows gradients to flow backwards.
shows the evolution of the activation values (after the non- However, pushing the sigmoid outputs to 0 would bring
linearity) at each hidden layer during training of a deep ar- them into a saturation regime which would prevent gradi-
chitecture with sigmoid activation functions. Layer 1 refers ents to flow backward and prevent the lower layers from
to the output of first hidden layer, and there are four hidden learning useful features. Eventually but slowly, the lower
layers. The graph shows the means and standard deviations layers move toward more useful features and the top hidden
of these activations. These statistics along with histograms layer then moves out of the saturation regime. Note how-
are computed at different times during learning, by looking ever that, even after this, the network moves into a solution
at activation values for a fixed set of 300 test examples. that is of poorer quality (also in terms of generalization)
251
then those found with symmetric activation functions, as where the gradients would flow well.
can be seen in figure 11.
3.2 Experiments with the Hyperbolic tangent

As discussed above, the hyperbolic tangent networks do not
suffer from the kind of saturation behavior of the top hid-
den layer observed with sigmoid networks, because of its
symmetry around h 0. However,
i with our standard weight
initialization U − √1n , √1n , we observe a sequentially oc-
curring saturation phenomenon starting with layer 1 and
propagating up in the network, as illustrated in Figure 3.
Why this is happening remains to be understood.
Figure 4: Activation values normalized histogram at the

end of learning, averaged across units of the same layer and
across 300 test examples. Top: activation function is hyper-
bolic tangent, we see important saturation of the lower lay-
ers. Bottom: activation function is softsign, we see many
activation values around (-0.6,-0.8) and (0.6,0.8) where the
units do not saturate but are non-linear.
4 Studying Gradients and their Propagation

4.1 Effect of the Cost Function
We have found that the logistic regression or conditional
Figure 3: Top:98 percentiles (markers alone) and standard log-likelihood cost function (− log P (y|x) coupled with
deviation (solid lines with markers) of the distribution of softmax outputs) worked much better (for classification
the activation values for the hyperbolic tangent networks in problems) than the quadratic cost which was tradition-
the course of learning. We see the first hidden layer satu- ally used to train feedforward neural networks (Rumelhart
rating first, then the second, etc. Bottom: 98 percentiles et al., 1986). This is not a new observation (Solla et al.,
(markers alone) and standard deviation (solid lines with 1988) but we find it important to stress here. We found that
markers) of the distribution of activation values for the soft- the plateaus in the training criterion (as a function of the pa-
sign during learning. Here the different layers saturate less rameters) are less present with the log-likelihood cost func-
and do so together. tion. We can see this on Figure 5, which plots the training
criterion as a function of two weights for a two-layer net-
3.3 Experiments with the Softsign work (one hidden layer) with hyperbolic tangent units, and
a random input and target signal. There are clearly more
The softsign x/(1+|x|) is similar to the hyperbolic tangent
severe plateaus with the quadratic cost.
but might behave differently in terms of saturation because
of its smoother asymptotes (polynomial instead of expo-
4.2 Gradients at initialization
nential). We see on Figure 3 that the saturation does not
occur one layer after the other like for the hyperbolic tan- 4.2.1 Theoretical Considerations and a New
gent. It is faster at the beginning and then slow, and all Normalized Initialization
layers move together towards larger weights.
We study the back-propagated gradients, or equivalently
We can also see at the end of training that the histogram the gradient of the cost function on the inputs biases at each
of activation values is very different from that seen with layer. Bradley (2009) found that back-propagated gradients
the hyperbolic tangent (Figure 4). Whereas the latter yields were smaller as one moves from the output layer towards
modes of the activations distribution mostly at the extremes the input layer, just after initialization. He studied networks
(asymptotes -1 and 1) or around 0, the softsign network has with linear activation at each layer, finding that the variance
modes of activations around its knees (between the linear of the back-propagated gradients decreases as we go back-
regime around 0 and the flat regime around -1 and 1). These wards in the network. We will also start by studying the
are the areas where there is substantial non-linearity but linear regime.
252
From a forward-propagation point of view, to keep infor-

mation flowing we would like that
0
∀(i, i0 ), V ar[z i ] = V ar[z i ]. (8)
From a back-propagation point of view we would similarly

like to have
h ∂Cost i h ∂Cost i
∀(i, i0 ), V ar = V ar . (9)
∂si ∂si0
These two conditions transform to:
∀i, ni V ar[W i ] = 1 (10)
∀i, ni+1 V ar[W i ] = 1 (11)
Figure 5: Cross entropy (black, surface on top) and As a compromise between these two constraints, we might
quadratic (red, bottom surface) cost as a function of two want to have
weights (one at each layer) of a network with two layers, 2
W1 respectively on the first layer and W2 on the second, ∀i, V ar[W i ] = (12)
ni + ni+1
output layer.
Note how both constraints are satisfied when all layers have
For a dense artificial neural network using symmetric acti- the same width. If we also have the same initialization for
vation function f with unit derivative at 0 (i.e. f 0 (0) = 1), the weights we could get the following interesting proper-
if we write zi for the activation vector of layer i, and si ties:
the argument vector of the activation function at layer i, h ∂Cost i h id−i
we have si = zi W i + bi and zi+1 = f (si ). From these ∀i, V ar = nV ar[W ] V ar[x] (13)
definitions we obtain the following: ∂si
h ∂Cost i h id h ∂Cost i
∂Cost i+1 ∂Cost ∀i, V ar = nV ar[W ] V ar[x]V ar
= f 0 (sik )Wk,• (2) ∂wi ∂sd
∂sik ∂si+1 (14)
∂Cost ∂Cost
i
= zli (3) We can see that the variance of the gradient on the weights
∂wl,k ∂sik is the same for all layers, but the variance of the back-
propagated gradient might still vanish or explode as we
The variances will be expressed with respect to the input,
consider deeper networks. Note how this is reminis-
outpout and weight initialization randomness. Consider
cent of issues raised when studying recurrent neural net-
the hypothesis that we are in a linear regime at the initial-
works (Bengio et al., 1994), which can be seen as very deep
ization, that the weights are initialized independently and
networks when unfolded through time.
that the inputs features variances are the same (= V ar[x]).
Then we can say that, with ni the size of layer i and x the The standard initialization that we have used (eq.1) gives
network input, rise to variance with the following property:
f 0 (sik ) ≈ 1, (4) nV ar[W ] =
1
(15)
3
i−1
Y 0
V ar[z i ] = V ar[x] ni0 V ar[W i ], (5) where n is the layer size (assuming all layers of the same
i0 =0 size). This will cause the variance of the back-propagated
0 gradient to be dependent on the layer (and decreasing).
We write V ar[W i ] for the shared scalar variance of all
weights at layer i0 . Then for a network with d layers, The normalization factor may therefore be important when
h ∂Cost i h ∂Cost i Y d
initializing deep networks because of the multiplicative ef-
0
V ar = V ar ni0 +1 V ar[W i ], (6) fect through layers, and we suggest the following initializa-
∂si ∂sd 0 tion procedure to approximately satisfy our objectives of
i =i
i−1 d−1
maintaining activation variances and back-propagated gra-
h ∂Cost i Y 0 Y 0 dients variance as one moves up or down the network. We
V ar = ni0 V ar[W i ] ni0 +1 V ar[W i ]
∂wi call it the normalized initialization:
i0 =0 i0 =i
h ∂Cost i √ √
h 6 6 i
× V ar[x]V ar . W ∼U −√ ,√ (16)
∂sd nj + nj+1 nj + nj+1
(7)
253
4.2.2 Gradient Propagation Study

To empirically validate the above theoretical ideas, we have
plotted some normalized histograms of activation values,
weight gradients and of the back-propagated gradients at
initialization with the two different initialization methods.
The results displayed (Figures 6, 7 and 8) are from exper-
iments on Shapeset-3 × 2, but qualitatively similar results
were obtained with the other datasets.
We monitor the singular values of the Jacobian matrix as-
sociated with layer i:
∂zi+1
Ji = (17)
∂zi
When consecutive layers have the same dimension, the av-
erage singular value corresponds to the average ratio of in- Figure 7: Back-propagated gradients normalized his-
finitesimal volumes mapped from zi to zi+1 , as well as tograms with hyperbolic tangent activation, with standard
to the ratio of average activation variance going from zi (top) vs normalized (bottom) initialization. Top: 0-peak
to zi+1 . With our normalized initialization, this ratio is decreases for higher layers.
around 0.8 whereas with the standard initialization, it drops
down to 0.5. What was initially really surprising is that even when the
back-propagated gradients become smaller (standard ini-
tialization), the variance of the weights gradients is roughly
constant across layers, as shown on Figure 8. However, this
is explained by our theoretical analysis above (eq. 14). In-
terestingly, as shown in Figure 9, these observations on the
weight gradient of standard and normalized initialization
change during training (here for a tanh network). Indeed,
whereas the gradients have initially roughly the same mag-
nitude, they diverge from each other (with larger gradients
in the lower layers) as training progresses, especially with
the standard initialization. Note that this might be one of
the advantages of the normalized initialization, since hav-
ing gradients of very different magnitudes at different lay-
ers may yield to ill-conditioning and slower training.
Finally, we observe that the softsign networks share simi-
Figure 6: Activation values normalized histograms with larities with the tanh networks with normalized initializa-
hyperbolic tangent activation, with standard (top) vs nor- tion, as can be seen by comparing the evolution of activa-
malized initialization (bottom). Top: 0-peak increases for tions in both cases (resp. Figure 3-bottom and Figure 10).
higher layers.
5 Error Curves and Conclusions
4.3 Back-propagated Gradients During Learning
The final consideration that we care for is the success
The dynamic of learning in such networks is complex and of training with different strategies, and this is best il-
we would like to develop better tools to analyze and track lustrated with error curves showing the evolution of test
it. In particular, we cannot use simple variance calculations error as training progresses and asymptotes. Figure 11
in our theoretical analysis because the weights values are shows such curves with online training on Shapeset-3 × 2,
not anymore independent of the activation values and the while Table 1 gives final test error for all the datasets
linearity hypothesis is also violated. studied (Shapeset-3 × 2, MNIST, CIFAR-10, and Small-
ImageNet). As a baseline, we optimized RBF SVM mod-
As first noted by Bradley (2009), we observe (Figure 7) that
els on one hundred thousand Shapeset examples and ob-
at the beginning of training, after the standard initializa-
tained 59.47% test error, while on the same set we obtained
tion (eq. 1), the variance of the back-propagated gradients
50.47% with a depth five hyperbolic tangent network with
gets smaller as it is propagated downwards. However we
normalized initialization.
find that this trend is reversed very quickly during learning.
Using our normalized initialization we do not see such de- These results illustrate the effect of the choice of activa-
creasing back-propagated gradients (bottom of Figure 7). tion and initialization. As a reference we include in Fig-
254
Figure 8: Weight gradient normalized histograms with hy- Figure 9: Standard deviation intervals of the weights gradi-
perbolic tangent activation just after initialization, with ents with hyperbolic tangents with standard initialization
standard initialization (top) and normalized initialization (top) and normalized (bottom) during training. We see
(bottom), for different layers. Even though with standard that the normalization allows to keep the same variance
initialization the back-propagated gradients get smaller, the of the weights gradient across layers, during training (top:
weight gradients do not! smaller variance for higher layers).
Table 1: Test error with different activation functions and
initialization schemes for deep networks with 5 hidden lay-
ers. N after the activation function name indicates the use
of normalized initialization. Results in bold are statistically
different from non-bold ones under the null hypothesis test
with p = 0.005.
TYPE Shapeset MNIST CIFAR-10 ImageNet
Figure 10: 98 percentile (markers alone) and standard de-
Softsign 16.27 1.64 55.78 69.14 viation (solid lines with markers) of the distribution of ac-
Softsign N 16.06 1.72 53.8 68.13
Tanh 27.15 1.76 55.9 70.58 tivation values for hyperbolic tangent with normalized ini-
Tanh N 15.60 1.64 52.92 68.57 tialization during learning.
Sigmoid 82.61 2.21 57.28 70.66
activations (flowing upward) and gradients (flowing
ure 11 the error curve for the supervised fine-tuning from backward).
the initialization obtained after unsupervised pre-training
with denoising auto-encoders (Vincent et al., 2008). For Others methods can alleviate discrepancies between lay-
each network the learning rate is separately chosen to min- ers during learning, e.g., exploiting second order informa-
imize error on the validation set. We can remark that on tion to set the learning rate separately for each parame-
Shapeset-3 × 2, because of the task difficulty, we observe ter. For example, we can exploit the diagonal of the Hes-
important saturations during learning, this might explain sian (LeCun et al., 1998b) or a gradient variance estimate.
that the normalized initialization or the softsign effects are Both those methods have been applied for Shapeset-3 × 2
more visible. with hyperbolic tangent and standard initialization. We ob-
served a gain in performance but not reaching the result ob-
Several conclusions can be drawn from these error curves: tained from normalized initialization. In addition, we ob-
• The more classical neural networks with sigmoid or served further gains by combining normalized initialization
hyperbolic tangent units and standard initialization with second order methods: the estimated Hessian might
fare rather poorly, converging more slowly and appar- then focus on discrepancies between units, not having to
ently towards ultimately poorer local minima. correct important initial discrepancies between layers.
• The softsign networks seem to be more robust to the In all reported experiments we have used the same num-
initialization procedure than the tanh networks, pre- ber of units per layer. However, we verified that we obtain
sumably because of their gentler non-linearity. the same gains when the layer size increases (or decreases)
with layer number.
• For tanh networks, the proposed normalized initial-
ization can be quite helpful, presumably because the The other conclusions from this study are the following:
layer-to-layer transformations maintain magnitudes of • Monitoring activations and gradients across layers and
255
Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007).

Greedy layer-wise training of deep networks. NIPS 19 (pp.
153–160). MIT Press.
Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term
dependencies with gradient descent is difficult. IEEE Transac-
tions on Neural Networks, 5, 157–166.
Bergstra, J., Desjardins, G., Lamblin, P., & Bengio, Y. (2009).
Quadratic polynomials learn better image features (Technical
Report 1337). Département d’Informatique et de Recherche
Opérationnelle, Université de Montréal.
Bradley, D. (2009). Learning in modular systems. Doctoral dis-
sertation, The Robotics Institute, Carnegie Mellon University.
Collobert, R., & Weston, J. (2008). A unified architecture for nat-
ural language processing: Deep neural networks with multitask
learning. ICML 2008.
Figure 11: Test error during online training on the Erhan, D., Manzagol, P.-A., Bengio, Y., Bengio, S., & Vincent,
Shapeset-3×2 dataset, for various activation functions and P. (2009). The difficulty of training deep architectures and the
effect of unsupervised pre-training. AISTATS’2009 (pp. 153–
initialization schemes (ordered from top to bottom in de- 160).
creasing final error). N after the activation function name
indicates the use of normalized initialization. Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning
algorithm for deep belief nets. Neural Computation, 18, 1527–
1554.
Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers
of features from tiny images (Technical Report). University of
Toronto.
Larochelle, H., Bengio, Y., Louradour, J., & Lamblin, P. (2009).
Exploring strategies for training deep neural networks. The
Journal of Machine Learning Research, 10, 1–40.
Larochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio,
Y. (2007). An empirical evaluation of deep architectures on
problems with many factors of variation. ICML 2007.
Figure 12: Test error curves during training on MNIST and
CIFAR10, for various activation functions and initialization LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998a).
Gradient-based learning applied to document recognition. Pro-
schemes (ordered from top to bottom in decreasing final ceedings of the IEEE, 86, 2278–2324.
error). N after the activation function name indicates the
use of normalized initialization. LeCun, Y., Bottou, L., Orr, G. B., & Müller, K.-R. (1998b). Effi-
cient backprop. In Neural networks, tricks of the trade, Lecture
Notes in Computer Science LNCS 1524. Springer Verlag.
training iterations is a powerful investigative tool for
understanding training difficulties in deep nets. Mnih, A., & Hinton, G. E. (2009). A scalable hierarchical dis-
tributed language model. NIPS 21 (pp. 1081–1088).
• Sigmoid activations (not symmetric around 0) should
Ranzato, M., Poultney, C., Chopra, S., & LeCun, Y. (2007). Ef-
be avoided when initializing from small random ficient learning of sparse representations with an energy-based
weights, because they yield poor learning dynamics, model. NIPS 19.
with initial saturation of the top hidden layer.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learn-
• Keeping the layer-to-layer transformations such that ing representations by back-propagating errors. Nature, 323,
533–536.
both activations and gradients flow well (i.e. with a Ja-
cobian around 1) appears helpful, and allows to elim- Solla, S. A., Levin, E., & Fleisher, M. (1988). Accelerated learn-
inate a good part of the discrepancy between purely ing in layered neural networks. Complex Systems, 2, 625–639.
supervised deep networks and ones pre-trained with Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008).
unsupervised learning. Extracting and composing robust features with denoising au-
toencoders. ICML 2008.
• Many of our observations remain unexplained, sug-
gesting further investigations to better understand gra- Weston, J., Ratle, F., & Collobert, R. (2008). Deep learning
via semi-supervised embedding. ICML 2008 (pp. 1168–1175).
dients and training dynamics in deep architectures. New York, NY, USA: ACM.
References Zhu, L., Chen, Y., & Yuille, A. (2009). Unsupervised learning
of probabilistic grammar-markov models for object categories.
Bengio, Y. (2009). Learning deep architectures for AI. Founda- IEEE Transactions on Pattern Analysis and Machine Intelli-
tions and Trends in Machine Learning, 2, 1–127. Also pub- gence, 31, 114–128.
lished as a book. Now Publishers, 2009.
256

Glorot10a PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Glorot10a PDF

Uploaded by

Copyright:

Available Formats

Understanding the difficulty of training deep feedforward neural networks

Xavier Glorot Yoshua Bengio

Abstract learning methods for a wide array of deep architectures,

2 Experimental Setting and Datasets

2.1 Online Learning on an Infinite Dataset:

training pairs (x, y) and used to update parameters θ in that

3.2 Experiments with the Hyperbolic tangent

Figure 4: Activation values normalized histogram at the

4 Studying Gradients and their Propagation

From a forward-propagation point of view, to keep infor-

From a back-propagation point of view we would similarly

∀i, ni+1 V ar[W i ] = 1 (11)

4.2.2 Gradient Propagation Study

Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007).

You might also like