Professional Documents
Culture Documents
Optimization Algorithms - Part 2 - Parveen Khurana - Medium
Optimization Algorithms - Part 2 - Parveen Khurana - Medium
In this article, we would try to answer the below question and we would
also look at some other algorithms apart from the Gradient Descent for the
update rule.
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 1/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 2/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
As is clear from the red-underlined part in the above image, we are making
multiple passes over the data and for each iteration/pass, we are
looking at all the data points(blue underlined part) and accumulating
the gradients for all the points and once we have done one pass over
the data(which is also termed as one epoch), we are updating the
weights.
Our overall loss value is going to be the sum of the loss value over all the
data points and hence the derivative of the overall loss can also be written
as the sum of the derivative of the loss over all the data points:
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 3/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
We are updating the parameters once in one pass over the entire data, now
say if we have one million data points, then first we are going to compute
the partial derivative of the parameters for each of the one million points,
accumulate all those in a gradient and then we are going to make one
update of the parameters. This looks like computationally expensive.
So, whatever was the total derivative, we have replaced it with fewer points
i.e with the sum of derivatives of the ‘B’ points that we are considering at a
time:
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 4/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
So, in one pass over the data, we would see many such batches(each of
size ‘B’, the last one could be of less than ‘B’) for example for a total of
‘M’ data points and a batch size of ‘B’, we would have nearly M/B
batches and we update the weights this many times in just one
epoch(one pass over the entire dataset). So, this is termed as mini-
batch Gradient Descent as we are looking at mini-batches of data.
We could take this idea to the extreme and compute the derivative for
each of the data points and update the parameters for each of the data
points. In other words, our batch size is 1. This is termed as Stochastic
Gradient Descent.
The advantage of this approach is that we are able to make many updates in
one epoch, we would be able to move/converge faster than the vanilla
Gradient Descent. The downside is that whether we’re using mini-batch or
stochastic, we are doing an approximation, we are not actually computing
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 5/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
the true derivative because the true loss is the total loss over all the data
points and the true gradient would be the sum of the partial derivatives
with respect to all the data points.
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 6/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 7/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
As is clear from the above image, the error curve is a smooth curve with not
any oscillations.
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 9/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
We have set the mini-batch size as 10. We are going over the data,
computing the partial derivatives and accumulating them and also keeping
a count of the number of data points that we have seen and as soon as we
have seen the number of data points equivalent to the mini-batch size, we
make an update.
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 10/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
The red curve is for mini-batch and the blue curve is for stochastic GD.
Once again we can see that there are these oscillations that never occurred
in Vanilla GD or in full Batch GD but are occurring both in stochastic and
mini-batch GD.
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 11/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
In case of full batch GD, we look at all the data, we compute the partial
derivative of the loss function for every data point, then we take the sum
and make the updates.
In the case of Stochastic GD, we look at one data point, computes the
derivative and updates the parameters and in case of mini-batch, we look at
the data in batches of size ‘B’.
Terminology:
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 12/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
We could also have the stochastic and mini-batch for Momentum based and
NAG GD.
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 13/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
Everything in the code remains the same except we are making the updates
for every data point that we see.
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 14/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
Error surface for the stochastic version of momentum and NAG looks like:
The red curve is for NAG and the black one represents momentum-based GD.
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 15/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
So, we have the inputs and their corresponding weights and the sigmoid
equation looks like:
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 16/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
We have discussed this in an earlier article that the derivative of the loss
function with respect to weight is proportional to the input:
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 17/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
In real-world data, many features are sparse i.e they take on the value 0 for
most of the training inputs that we have for example if we have one feature
say ‘isActorMattDamon’ for a movie classification task(whether we like it
or not), say we have data for 1000 movies or so and out of those 1000, Matt
Damon would have done 30–40 movies at max. So, that means out of the
1000 entries, the value of the feature ‘isActorMattDamon’ is going to be 0
for about 960 data points. The implication of this is brought out in the
below image:
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 18/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
The derivative is dependent on the input ‘x’ and the input would be 0 for
most of the data points in the scenario discussed above. That means the
corresponding derivative is also going to be 0 and if the derivative is going
to be 0 then the weight update is going to be 0 and if the weights are not
changing then we have a problem as the parameter and apparently the loss
value would remain the same.
So, we want to ensure that in the case where we have sparse features
just like in the above scenario where the input is ‘on’(has a value 1)
only 30–40 times out of 1000 times, then we need to make sure that for
those 30–40 data points the derivative must get boosted with a larger
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 19/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
In other words, we want the learning rate to be adaptive to the features that
we are dealing with, when we are dealing with sparse features the learning
rate should be high and on the other hand, if we are dealing with dense
features(features is on most of the times) then we want the learning rate to
be small.
Adagrad:
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 20/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
Adagrad helps to have this adaptive learning rate by dividing the learning
rate by the history of the updates of the feature.
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 21/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
If we focus on the yellow highlighted terms in the below image, it is just the
update as for the vanilla GD, we are dividing the learning rate eta with
some quantity(in red in the below image)
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 22/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
And we are hoping that by doing this we need to ensure two things: one is
that for the features which have received a lot of updates, the
denominator term(in red) in the above image should be high so that
the effective learning rate becomes small and we want this
denominator to be small if we have received very few updates in the
past so that the overall learning rate is boosted.
The denominator term ‘vt’ is the running sum of the squares of the
derivatives:
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 23/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
Say we are going over the data points one at a time and for the first 10 data
points, the feature value was 1, in that case, the derivative would not be
zero for any of the data points and consequently, we have some high value
for ‘vt’ as we are taking the square of the derivative in the equation for vt.
So, for the 11th data point, as the denominator is large now, the effective
learning rate(in green in the below image) would become small now which
is exactly what we want for a dense feature.
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 24/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
In the case of a sparse feature, most of the derivatives would be 0 and say
for k’th data point that derivative is non-zero(and the derivative is 0 for all
(k-1) data points), so here accumulated ‘vt’ would be 0 as all the
derivatives were 0 till this k’th data point and the learning rate, when
divided by this small history, would become very large as compared to what
it was for the dense features.
Visualizing Adagrad:
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 25/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
The black curve in the above image corresponds to GD, the red one
corresponds to Momentum based GD and the blue one corresponds to NAG
GD.
We started with the point marked in the below image and we should have
reached the minima which is at the center of the valley
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 26/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
Our b value must increase up to the black circled point in the below image
and w must increase up to the blue circled point(overall they both are
represented by the red circled point in the below image)
Ideally, both the parameters should have changed parallelly and followed
the path in yellow in the below image to reach the minima:
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 27/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
Instead what we see is that initially the ‘w’ is not changing at all and only ‘b’
is changing and has changed from -2 to 0 whereas ‘w’ is still almost the
same.
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 28/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
This is for the case where ‘w’ is associated with a sparse feature and ‘b’ is
always going to have some value so we can consider it as a dense feature
and that’s why it was able to move where ever it had to move. Now at this
point, we have reached a good value for ‘b’, there is nothing we can do by
changing ‘b’ anymore. And the only way to reach the minima is to change
the value of ‘w’.
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 29/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
So, for the initial 30–40 epochs, the only ‘b’ is changing here with almost no
change in the value of ‘w’ and thereafter a large no. of epochs are still
required to achieve the optimum value of ‘w’. This is the problem with the
other algorithms that we discussed in previous articles.
Now let’s see what the moment looks like with Adagrad:
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 30/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
The green curve represents the moment for Adagrad. Now we can see that
both the parameters are changing simultaneously which is the result of an
Adaptive Learning Rate.
Limitation of Adagrad:
In the case of Momentum Based GD and NAG, after some 100 iterations we
were able to reach the minima even though for the first 30–40 iterations,
they were just moving in the direction of ‘b’, they were not making any
changes to ‘w’ but what happens in the case of Adagrad is that since the
learning rate is decaying aggressively for the frequent parameter, as ‘b’ gets
updated frequently, its history component increases by a lot because of
which the effective learning rate could go close to 0.
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 31/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
So, now what happens is, after a while when we have made a lot of updates
to ‘b’, we are not able to make any more updates to ‘b’, so even if we run
this algorithm for 100- 200 iterations, it just reaches near the minima but it
is not able to reach the exact minima and to reach the minima, it has to
move in the direction of ‘b’, it has to move up and it's not able to move in the
direction of ‘b’ as the effective learning rate for ‘b’ has become very close to
0 and hence no updates are happening to ‘b’.
We wanted to increase the learning rate for ‘w’, blow up its learning rate
which has happened but in the process, we have also happened to decrease
the learning rate for ‘b’ and after a while, it does not converge.
RMSProp:
RMSProp is another algorithm which operates on the below intuition:
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 32/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 33/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
We can see that the ‘v4’ would be much smaller in the case of RMSProp,
even though we are accumulating the history of gradients, we are not
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 34/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
There would still be a relative difference between the value of ‘vt’ for dense
features and for sparse features.
The brown curve represents the moment for RMSProp and we can see that
it eventually reaches the minima as opposed to the Adagrad(green curve).
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 35/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
Adam Algorithm:
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 36/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 37/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 38/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
Adam combines the ideas of the two algorithms and ensures that the
learning rate is adaptive and also ensures that the second term in the
update is not the current derivative but a history of derivatives so that
we can move fast out of flat regions.
In practice, Adam also does bias correction(to ensure that the gradients are
more stable at the beginning) where instead of using ‘mt’ and ‘vt’, we have:
So, first, it computes ‘mt’ and ‘vt’ using the below equation and then divide
by the quantity(does the bias correction) as in the above image:
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 39/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
Summary:
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 40/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
with the last three having an adaptive learning rate. And we have seen three
different strategies which can be used in combination with any of the
Algorithms
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 41/42
5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
Machine Learning
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 42/42