Professional Documents
Culture Documents
Glorot10a PDF
Glorot10a PDF
249
Understanding the difficulty of training deep feedforward neural networks
250
Xavier Glorot, Yoshua Bengio
251
Understanding the difficulty of training deep feedforward neural networks
then those found with symmetric activation functions, as where the gradients would flow well.
can be seen in figure 11.
252
Xavier Glorot, Yoshua Bengio
Figure 5: Cross entropy (black, surface on top) and As a compromise between these two constraints, we might
quadratic (red, bottom surface) cost as a function of two want to have
weights (one at each layer) of a network with two layers, 2
W1 respectively on the first layer and W2 on the second, ∀i, V ar[W i ] = (12)
ni + ni+1
output layer.
Note how both constraints are satisfied when all layers have
For a dense artificial neural network using symmetric acti- the same width. If we also have the same initialization for
vation function f with unit derivative at 0 (i.e. f 0 (0) = 1), the weights we could get the following interesting proper-
if we write zi for the activation vector of layer i, and si ties:
the argument vector of the activation function at layer i, h ∂Cost i h id−i
we have si = zi W i + bi and zi+1 = f (si ). From these ∀i, V ar = nV ar[W ] V ar[x] (13)
definitions we obtain the following: ∂si
h ∂Cost i h id h ∂Cost i
∂Cost i+1 ∂Cost ∀i, V ar = nV ar[W ] V ar[x]V ar
= f 0 (sik )Wk,• (2) ∂wi ∂sd
∂sik ∂si+1 (14)
∂Cost ∂Cost
i
= zli (3) We can see that the variance of the gradient on the weights
∂wl,k ∂sik is the same for all layers, but the variance of the back-
propagated gradient might still vanish or explode as we
The variances will be expressed with respect to the input,
consider deeper networks. Note how this is reminis-
outpout and weight initialization randomness. Consider
cent of issues raised when studying recurrent neural net-
the hypothesis that we are in a linear regime at the initial-
works (Bengio et al., 1994), which can be seen as very deep
ization, that the weights are initialized independently and
networks when unfolded through time.
that the inputs features variances are the same (= V ar[x]).
Then we can say that, with ni the size of layer i and x the The standard initialization that we have used (eq.1) gives
network input, rise to variance with the following property:
f 0 (sik ) ≈ 1, (4) nV ar[W ] =
1
(15)
3
i−1
Y 0
V ar[z i ] = V ar[x] ni0 V ar[W i ], (5) where n is the layer size (assuming all layers of the same
i0 =0 size). This will cause the variance of the back-propagated
0 gradient to be dependent on the layer (and decreasing).
We write V ar[W i ] for the shared scalar variance of all
weights at layer i0 . Then for a network with d layers, The normalization factor may therefore be important when
h ∂Cost i h ∂Cost i Y d
initializing deep networks because of the multiplicative ef-
0
V ar = V ar ni0 +1 V ar[W i ], (6) fect through layers, and we suggest the following initializa-
∂si ∂sd 0 tion procedure to approximately satisfy our objectives of
i =i
i−1 d−1
maintaining activation variances and back-propagated gra-
h ∂Cost i Y 0 Y 0 dients variance as one moves up or down the network. We
V ar = ni0 V ar[W i ] ni0 +1 V ar[W i ]
∂wi call it the normalized initialization:
i0 =0 i0 =i
h ∂Cost i √ √
h 6 6 i
× V ar[x]V ar . W ∼U −√ ,√ (16)
∂sd nj + nj+1 nj + nj+1
(7)
253
Understanding the difficulty of training deep feedforward neural networks
254
Xavier Glorot, Yoshua Bengio
Figure 8: Weight gradient normalized histograms with hy- Figure 9: Standard deviation intervals of the weights gradi-
perbolic tangent activation just after initialization, with ents with hyperbolic tangents with standard initialization
standard initialization (top) and normalized initialization (top) and normalized (bottom) during training. We see
(bottom), for different layers. Even though with standard that the normalization allows to keep the same variance
initialization the back-propagated gradients get smaller, the of the weights gradient across layers, during training (top:
weight gradients do not! smaller variance for higher layers).
Table 1: Test error with different activation functions and
initialization schemes for deep networks with 5 hidden lay-
ers. N after the activation function name indicates the use
of normalized initialization. Results in bold are statistically
different from non-bold ones under the null hypothesis test
with p = 0.005.
TYPE Shapeset MNIST CIFAR-10 ImageNet
Figure 10: 98 percentile (markers alone) and standard de-
Softsign 16.27 1.64 55.78 69.14 viation (solid lines with markers) of the distribution of ac-
Softsign N 16.06 1.72 53.8 68.13
Tanh 27.15 1.76 55.9 70.58 tivation values for hyperbolic tangent with normalized ini-
Tanh N 15.60 1.64 52.92 68.57 tialization during learning.
Sigmoid 82.61 2.21 57.28 70.66
activations (flowing upward) and gradients (flowing
ure 11 the error curve for the supervised fine-tuning from backward).
the initialization obtained after unsupervised pre-training
with denoising auto-encoders (Vincent et al., 2008). For Others methods can alleviate discrepancies between lay-
each network the learning rate is separately chosen to min- ers during learning, e.g., exploiting second order informa-
imize error on the validation set. We can remark that on tion to set the learning rate separately for each parame-
Shapeset-3 × 2, because of the task difficulty, we observe ter. For example, we can exploit the diagonal of the Hes-
important saturations during learning, this might explain sian (LeCun et al., 1998b) or a gradient variance estimate.
that the normalized initialization or the softsign effects are Both those methods have been applied for Shapeset-3 × 2
more visible. with hyperbolic tangent and standard initialization. We ob-
served a gain in performance but not reaching the result ob-
Several conclusions can be drawn from these error curves: tained from normalized initialization. In addition, we ob-
• The more classical neural networks with sigmoid or served further gains by combining normalized initialization
hyperbolic tangent units and standard initialization with second order methods: the estimated Hessian might
fare rather poorly, converging more slowly and appar- then focus on discrepancies between units, not having to
ently towards ultimately poorer local minima. correct important initial discrepancies between layers.
• The softsign networks seem to be more robust to the In all reported experiments we have used the same num-
initialization procedure than the tanh networks, pre- ber of units per layer. However, we verified that we obtain
sumably because of their gentler non-linearity. the same gains when the layer size increases (or decreases)
with layer number.
• For tanh networks, the proposed normalized initial-
ization can be quite helpful, presumably because the The other conclusions from this study are the following:
layer-to-layer transformations maintain magnitudes of • Monitoring activations and gradients across layers and
255
Understanding the difficulty of training deep feedforward neural networks
256