Professional Documents
Culture Documents
Training Neural Networks - Practical Issues
Training Neural Networks - Practical Issues
Practical Issues
M. Soleymani
Sharif University of Technology
Fall 2017
Most slides have been adapted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017,
and some from Andrew Ng lectures, “Deep Learning Specialization”, coursera, 2017.
Outline
• Activation Functions
• Data Preprocessing
• Weight Initialization
• Checking gradients
Neuron
Activation functions
Activation functions: sigmoid
• Squashes numbers to range [0,1]
• Historically popular since they have nice interpretation as a
saturating “firing rate” of a neuron
1
𝜎 𝑥 =
1 + 𝑒 −𝑥
Problems:
1. Saturated neurons “kill” the gradients
Activation functions: sigmoid
𝑓 𝑤𝑖 𝑥𝑖 + 𝑏
𝑖
• What can we say about the gradients on w?
Activation functions: sigmoid
• Consider what happens when the input to
a neuron is always positive...
𝑓 𝑤𝑖 𝑥𝑖 + 𝑏
𝑖
Activation functions: sigmoid
• Squashes numbers to range [0,1]
• Historically popular since they have nice interpretation as a saturating
“firing rate” of a neuron
1
𝜎 𝑥 =
1 + 𝑒 −𝑥
3 problems:
1. Saturated neurons “kill” the gradients
2. Sigmoid outputs are not zero-centered
3. exp() is a bit compute expensive
Activation functions: tanh
[LeCun et al., 1991]
• Squashes numbers to range [-1,1] 1 − 𝑒 −2𝑥
tanh 𝑥 =
1 + 𝑒 −2𝑥
• zero centered (nice)
• still kills gradients when saturated :( = 1 − 2𝜎 2𝑥
Activation functions: ReLU (Rectified Linear Unit)
[Krizhevsky et al., 2012]
• Computationally efficient
• Converges much faster than
sigmoid/tanh in practice! (e.g. 6x)
• will not “die”.
• Remove mean: from each of the input mean of the corresponding input
– 𝑥𝑖 ← 𝑥𝑖 − 𝜇𝑖
• Normalize variance:
𝑥𝑖
– 𝑥𝑖 ←
𝜎𝑖
• If we normalize the training set we use the same 𝜇 and 𝜎 2 to normalize test
data too
Why zero-mean input?
• Reminder: sigmoid
Speed up training
• Works ~okay for small networks, but problems with deeper networks.
Multi-layer network
[𝐿]
𝑎
𝑎[𝐿] = 𝑜𝑢𝑡𝑝𝑢𝑡
𝑂𝑢𝑡𝑝𝑢𝑡 = 𝑎[𝐿] 𝑓 [𝐿] 𝑧 [𝐿]
= 𝑓 𝑧 [𝐿] × 𝑎
[𝐿−1]
𝑊 [2] 𝑓 [1] 𝑧
[1]
×
𝑊 [1] 𝑥
Vanishing/exploding gradinets
[𝑙]
𝜕𝐸𝑛 𝜕𝐸𝑛 𝜕𝑎 [𝑙] [𝑙−1] [𝑙]
[𝑙]
= [𝑙]
× [𝑙]
=𝛿 ×𝑎 × 𝑓′ 𝑧
𝜕𝑊 𝜕𝑎 𝜕𝑊
Example:
𝑊 [1] = ⋯ = 𝑊 [𝐿] = 𝜔𝐼, 𝑓 𝑧 = 𝑧 ⇒
[1] [𝐿] [𝐿]
𝛿 = 𝜔𝐼 𝐿−1 𝛿 = 𝜔 𝐿−1 𝛿
Lets look at some activation statistics
• E.g. 10-layer net with 500 neurons on each layer
– using tanh non-linearities
– Initialization: gaussian with zero mean and 0.01 standard deviation
Lets look at some activation statistics
Initialization: gaussian with zero mean and 0.01 standard deviation
Reasonable initialization.
(Mathematical derivation
assumes linear activations)
Xavier initialization
Initialization: gaussian with zero mean and 1/ fan_in standard deviation
𝑥
Error: 𝑂(ℎ2 ) when ℎ → 0
𝑥−ℎ 𝑥+ℎ
Check implementation of Backpropagation
𝑓 𝑥 = 𝑥4
𝑥
Error: 𝑂(ℎ2 ) when ℎ → 0
𝑥−ℎ 𝑥+ℎ
𝜕𝐽
• 𝐽′ 𝑖 = (𝑖 = 1, … , 𝑝) is found via backprop
𝜕𝜃𝑖
Check gradients and debug backpropagation:
1 𝐿
• 𝜃 : containing all parameters of the network ( 𝑊 ,…𝑊 are
vectorized and concatenated)
𝜕𝐽
• 𝐽′ 𝑖 = (𝑖 = 1, … , 𝑝) is found via backprop
𝜕𝜃𝑖
• For each i:
′ 𝐽 𝜃1 ,…,𝜃𝑖−1 ,𝜃𝑖 +ℎ,𝜃𝑖 ,…,𝜃𝑝 −𝐽(𝜃1 ,…,𝜃𝑖−1 ,𝜃𝑖 −ℎ,𝜃𝑖 ,…,𝜃𝑝 )
– 𝐽 [𝑖] =
2ℎ
Check gradients and debug backpropagation:
• 𝜃: containing all parameters of the network (𝑊 1 ,…𝑊 𝐿 are vectorized
and concatenated)
𝜕𝐽
• 𝐽′ 𝑖 = (𝑖 = 1, … , 𝑝) is found via backprop
𝜕𝜃𝑖
• For each i:
𝐽 𝜃1 ,…,𝜃𝑖−1 ,𝜃𝑖 +ℎ,𝜃𝑖 ,…,𝜃𝑝 −𝐽(𝜃1 ,…,𝜃𝑖−1 ,𝜃𝑖 −ℎ,𝜃𝑖 ,…,𝜃𝑝 )
– 𝐽′ [𝑖] =
2ℎ
𝐽′ −𝐽′ 2
• Compute 𝜖 =
𝐽′ 2 + 𝐽′ 2
CheckGrad
• with e.g. ℎ = 10−7 :
– 𝜖 < 10−7 (great)
– 𝜖 ≈ 10−5 (may need check)
• okay for objectives with kinks. But if there are no kinks then 1e-4 is too high.
– 𝜖 ≈ 10−3 (concern)
– 𝜖 > 10−2 (probably wrong)
• Use double precision and stick around active range of floating point
• Be careful with the step size h (not too small that cause underflow)
• Gradient check after burn-in of training
• After running gradcheck at random initialization, run grad check again when
you've trained for some number of iterations
Before learning: sanity checks Tips/Tricks
• Look for correct loss at chance performance.