Professional Documents
Culture Documents
Op Tim Ization
Op Tim Ization
Some slides have been adapted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017,
some from Andrew Ng lectures, “Deep Learning Specialization”, coursera, 2017,
and few from Hinton lectures, “Neural Networks for Machine Learning”, coursera, 2015.
Outline
• Mini-batch gradient descent
• Gradient descent with momentum
• RMSprop
• Adam
• Learning rate and learning rate decay
• Batch normalization
Reminder: The error surface
• For a linear neuron with a squared error,
it is a quadratic bowl.
– Vertical cross-sections are parabolas.
– Horizontal cross-sections are ellipses.
Batch size=1 e.g., Batch size= 32, 64, 128, 256 Batch size=n
(the size of training set)
Batch size=1 e.g., Batch size= 32, 64, 128, 256 Batch size=n
(the size of training set)
• Make sure one batch of training data and the corresponding forward,
backward required to be cached can fit in CPU memory
Mini-batch gradient descent: loss-#epoch curve
Mini-batch gradient descent vs. full-batch
Poor conditioning
• Plateau is a region where the derivative is close to zero for a long time
– Very very slow progress
– Finally, algorithm can then find its way off the plateau.
𝑣0 = 0
𝑣1 = 0.9𝑣0 + 0.1𝜃1
…
𝑣𝑡 = 0.9𝑣𝑡−1 + 0.1𝜃𝑡
𝛽 = 𝟎. 𝟗𝟖
1/𝜖
1
1−𝜖 =
𝑒
1 1 1
0.910 =
𝑒 𝑒
50
1
0.98 = -1
𝑒
• 𝑣𝑑𝑊 = 0
• On each iteration:
– Compute dW on the current mini-batch
– 𝑣𝑑𝑊 = 𝛽𝑣𝑑𝑊 + 1 − 𝛽 𝑑𝑊
– 𝑊 = 𝑊 − 𝛼𝑣𝑑𝑊
• 𝑣𝑑𝑊 = 0
• On each iteration:
– Compute dW on the current mini-batch
– 𝑣𝑑𝑊 = 𝛽𝑣𝑑𝑊 + 1 − 𝛽 𝑑𝑊 𝑣𝑑𝑊 = 𝛽𝑣𝑑𝑊 + 𝑑𝑊
– 𝑊 = 𝑊 − 𝛼𝑣𝑑𝑊 learning rate would need to be tuned
differently for these two different versions
Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983
Nesterov, “Introductory lectures on convex optimization: a basic course”, 2004
Sutskever et al, “On the importance of initialization and momentum in deel learning”, ICML 2013
Nesterov Momentum
Simple update rule
Momentum
Nesterov
Nesterov Accelerate Gradient (NAG)
• Nestrov accelerated gradient
– a little is annoying (requires gradient at 𝑥𝑡 + 𝜌𝑣𝑡 instead of at 𝑥𝑡 )
• You like to evaluate gradient and loss in the same point
Adagrad
𝑆𝑑𝑊 = 𝑆𝑑𝑊 + 𝑑𝑊 2
(𝑆𝑑𝑊 : Sum of the squares of derivatives)
Step size will be very small during this algorithm
RMSprop
• Updates in the vertical direction is divided by a much larger number
while in the horizontal direction by a smaller number.
• 𝑣𝑑𝑊 = 0, 𝑆𝑑𝑊 = 0
• On each iteration:
– Compute dW on the current mini-batch
– 𝑉𝑑𝑊 = 𝛽1 𝑉𝑑𝑊 + 1 − 𝛽1 𝑑𝑊
– 𝑆𝑑𝑊 = 𝛽2 𝑆𝑑𝑊 + 1 − 𝛽2 𝑑𝑊 2
𝑉𝑑𝑊
–𝑊 =𝑊−𝛼 Problem: may cause very large steps at the beginning
𝑆𝑑𝑊 +𝜖 Solution: Bias correction to prevent first and
second moment estimates start at zero
Adam (full form)
• Takes RMSprop and momentum idea together
• 𝑣𝑑𝑊 = 0, 𝑆𝑑𝑊 = 0
• On each iteration:
– Compute dW on the current mini-batch
– 𝑉𝑑𝑊 = 𝛽1 𝑉𝑑𝑊 + 1 − 𝛽1 𝑑𝑊
– 𝑆𝑑𝑊 = 𝛽2 𝑆𝑑𝑊 + 1 − 𝛽2 𝑑𝑊 2
𝑐 𝑉𝑑𝑊
– 𝑉𝑑𝑊 =
1−𝛽1𝑡
𝑐 𝑆𝑑𝑊
– 𝑆𝑑𝑊 =
1−𝛽2𝑡
𝑐
𝑉𝑑𝑊
– 𝑊 =𝑊−𝛼
𝑐 +𝜖
𝑆𝑑𝑊
Adam
• Takes RMSprop and momentum idea together
• 𝑣𝑑𝑊 = 0, 𝑆𝑑𝑊 = 0
• On each iteration:
– Compute dW on the current mini-batch
– 𝑉𝑑𝑊 = 𝛽1 𝑉𝑑𝑊 + 1 − 𝛽1 𝑑𝑊 Momentum Hyperparameter Choice:
– 𝑆𝑑𝑊 = 𝛽2 𝑆𝑑𝑊 + 1 − 𝛽2 𝑑𝑊 2 RMSprop 𝛼: need to bee tuned
𝑐 𝑉𝑑𝑊 𝛽1 = 0.9
– 𝑉𝑑𝑊 =
1−𝛽1𝑡 𝛽2 = 0.999
𝑐 𝑆𝑑𝑊 Bias correction 𝜖 = 10−8
– 𝑆𝑑𝑊 =
1−𝛽2𝑡
𝑐
𝑉𝑑𝑊
– 𝑊 =𝑊−𝛼
𝑐 +𝜖
𝑆𝑑𝑊
Adam
Example
RMSprop, AdaGrad, and Adam can adaptively tune the update vector size (per parameter)
Known also as methods with adaptive learning rates
Source: http://cs231n.github.io/neural-networks-3/
Example
Source: http://cs231n.github.io/neural-networks-3/
Learning rate
• SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning
rate as a hyperparameter.
• Learning rate is an important hyperparameter that usually adjust first
Which one of these learning rates is best to use?
Choosing learning rate parameter
• loss not going down: learning rate too low
ratio between the updates and values: e.g., 0.0002 / 0.02 = 0.01 (about okay)
want this to be somewhere around 0.001 or so
First-order optimization
First-order optimization
Second-order optimization
Second-order optimization
No hyperparameters!
What is nice about this update?
No learning rate!
Second-order optimization
• If you can afford to do full batch updates then try out L-BFGS (and
don’t forget to disable all sources of noise)
Before going to the generalization
in the next lecture
We now see an important technique that helps to speed up training
while also acts as a form of regularization
Recall: Normalizing inputs
• On the training set compute mean of each input (feature)
𝑁 𝑥 𝑛
𝑛=1 𝑖
– 𝜇𝑖 =
𝑁
𝑛 2
𝑁 𝑥𝑖 −𝜇𝑖
𝑛=1
– 𝜎𝑖2 =
𝑁
• Remove mean: from each of the input mean of the corresponding input
– 𝑥𝑖 ← 𝑥𝑖 − 𝜇𝑖
• Normalize variance:
𝑥𝑖
– 𝑥𝑖 ←
𝜎𝑖
Batch normalization
• “you want unit gaussian activations? just
make them so.” [𝐿]
𝑎
𝑎[𝐿] = 𝑜𝑢𝑡𝑝𝑢𝑡
𝑓 [𝐿] 𝑧 [𝐿]
• Can we normalize 𝑎[𝑙] too? × 𝑎
[𝐿−1]
𝑊 [𝐿]
𝑓 [1] 𝑧
[1]
𝑙
𝛾 and 𝛽 𝑙 are learnable parameters of the model
Batch normalization
[Ioffe and Szegedy, 2015]
𝑖 𝑙 𝑧 (𝑖)[𝑙] − 𝜇 [𝑙]
𝑧𝑛𝑜𝑟𝑚 = In the linear area of tanh
[𝑙]
𝜎2 +𝜖
• Allow the network to squash the range if it wants to:
𝑖 𝑙
𝑧 (𝑖)[𝑙] = 𝛾 𝑙 𝑧𝑛𝑜𝑟𝑚 + 𝛽 𝑙
• 𝜇 [𝑙] = 0
• for i=1…m
𝑏 𝑖
𝑋
𝑗=1 𝑗
– 𝜇 [𝑙] = 𝛽𝜇 [𝑙] + 1 − 𝛽
𝑏
How does Batch Norm work?
• Learning on shifting input distributions
𝑊 [5]