Optimization Algorithms - Part 2 - Parveen Khurana - Medium

5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium
Optimization Algorithms — Part 2

Parveen Khurana Following
Feb 4 · 15 min read
This article covers the content discussed in the Optimization

Algorithms(Part 2) module of the Deep Learning course and all the images
are taken from the same module.
In this article, we would try to answer the below question and we would
also look at some other algorithms apart from the Gradient Descent for the
update rule.
https://medium.com/@prvnk10/optimization-algorithms-part-2-7b21da62f5c7 1/42
The idea of a stochastic and mini-batch gradient descent:

Python code for the Gradient Descent algorithm that we have been using so
far in the previous articles is as follows:
As is clear from the red-underlined part in the above image, we are making
multiple passes over the data and for each iteration/pass, we are
looking at all the data points(blue underlined part) and accumulating
the gradients for all the points and once we have done one pass over
the data(which is also termed as one epoch), we are updating the
weights.
Our overall loss value is going to be the sum of the loss value over all the
data points and hence the derivative of the overall loss can also be written
as the sum of the derivative of the loss over all the data points:
We are updating the parameters once in one pass over the entire data, now
say if we have one million data points, then first we are going to compute
the partial derivative of the parameters for each of the one million points,
accumulate all those in a gradient and then we are going to make one
update of the parameters. This looks like computationally expensive.
We could choose to do an approximation over here. Instead of looking at all

the data points, we look at a few data points(we will shuffle the data
randomly) and we could look say first ‘B’ data points and using that, we
compute the partial derivative for each of those ‘B’ data points and
accumulate all of those in a gradient and make an update to the
parameters.
So, whatever was the total derivative, we have replaced it with fewer points
i.e with the sum of derivatives of the ‘B’ points that we are considering at a
time:
So, in one pass over the data, we would see many such batches(each of
size ‘B’, the last one could be of less than ‘B’) for example for a total of
‘M’ data points and a batch size of ‘B’, we would have nearly M/B
batches and we update the weights this many times in just one
epoch(one pass over the entire dataset). So, this is termed as mini-
batch Gradient Descent as we are looking at mini-batches of data.
We could take this idea to the extreme and compute the derivative for
each of the data points and update the parameters for each of the data
points. In other words, our batch size is 1. This is termed as Stochastic
Gradient Descent.
The advantage of this approach is that we are able to make many updates in
one epoch, we would be able to move/converge faster than the vanilla
Gradient Descent. The downside is that whether we’re using mini-batch or
stochastic, we are doing an approximation, we are not actually computing
the true derivative because the true loss is the total loss over all the data
points and the true gradient would be the sum of the partial derivatives
with respect to all the data points.
Error Surface for Stochastic GD:
Code for GD:
Code for Stochastic GD:
Red ink shows the equals sign
For Vanilla GD, we have the error surface as:
As is clear from the above image, the error curve is a smooth curve with not
any oscillations.
For stochastic GD, we have the error surface as:

Now we see a lot of oscillations here(and these oscillations are different

from the oscillations in case of Momentum, NAG GD).
Error Surface for Mini-batch GD:
We have set the mini-batch size as 10. We are going over the data,
computing the partial derivatives and accumulating them and also keeping
a count of the number of data points that we have seen and as soon as we
have seen the number of data points equivalent to the mini-batch size, we
make an update.
Here, the approximation would be better than stochastic gradient descent

as we are looking at a batch of data points.
The red curve is for mini-batch and the blue curve is for stochastic GD.
Once again we can see that there are these oscillations that never occurred
in Vanilla GD or in full Batch GD but are occurring both in stochastic and
mini-batch GD.
As we increase batch size, we can see some stability in the oscillations.
In practice, we choose batch size like 32, 64 or 128 depending on the

amount of data that we have. And that also makes sure that approximations
are better than just using stochastic GD.
Epochs and Steps:

The difference between Batch, Mini-batch, stochastic GD is the amount of
data that is taken into account when computing the derivatives.
In case of full batch GD, we look at all the data, we compute the partial
derivative of the loss function for every data point, then we take the sum
and make the updates.
In the case of Stochastic GD, we look at one data point, computes the
derivative and updates the parameters and in case of mini-batch, we look at
the data in batches of size ‘B’.
Terminology:
We could also have the stochastic and mini-batch for Momentum based and
NAG GD.
Everything in the code remains the same except we are making the updates
for every data point that we see.
Error surface for the stochastic version of momentum and NAG looks like:
The red curve is for NAG and the black one represents momentum-based GD.
Oscillations are smaller in the stochastic version of Momentum and NAG

GD compared to Vanilla GD as there is this history component in this case
which helps to be on the track to an extent.
Batch size is a hyper-parameter and in practice, we use mini-batch GD with

a batch size of 32, 64 or 128.
Why do we need an adaptive learning rate?

Let’s say we have a sigmoid neuron that takes in many inputs.
So, we have the inputs and their corresponding weights and the sigmoid
equation looks like:
The update rule for each of the weights is going to be:
the update is proportional to the derivative and we have already discussed

that if the derivatives are small then the weights are not going to change
that much.
We have discussed this in an earlier article that the derivative of the loss
function with respect to weight is proportional to the input:
The updates are proportional to derivatives and the derivative

depends on the input.
In real-world data, many features are sparse i.e they take on the value 0 for
most of the training inputs that we have for example if we have one feature
say ‘isActorMattDamon’ for a movie classification task(whether we like it
or not), say we have data for 1000 movies or so and out of those 1000, Matt
Damon would have done 30–40 movies at max. So, that means out of the
1000 entries, the value of the feature ‘isActorMattDamon’ is going to be 0
for about 960 data points. The implication of this is brought out in the
below image:
The derivative is dependent on the input ‘x’ and the input would be 0 for
most of the data points in the scenario discussed above. That means the
corresponding derivative is also going to be 0 and if the derivative is going
to be 0 then the weight update is going to be 0 and if the weights are not
changing then we have a problem as the parameter and apparently the loss
value would remain the same.
So, we want to ensure that in the case where we have sparse features
just like in the above scenario where the input is ‘on’(has a value 1)
only 30–40 times out of 1000 times, then we need to make sure that for
those 30–40 data points the derivative must get boosted with a larger
learning rate so that the weights are changed accordingly. And we

want a larger learning rate as the weights are not getting updated for
all of the data points and when they are getting updated, we are
ensuring they get changed by a large amount.
In other words, we want the learning rate to be adaptive to the features that
we are dealing with, when we are dealing with sparse features the learning
rate should be high and on the other hand, if we are dealing with dense
features(features is on most of the times) then we want the learning rate to
be small.
Adagrad:
The intuition behind Adagrad is that we should try to decay the

learning rate in proportion to the update history of the parameter. So,
if we have a parameter which is getting updated very frequently(because
that input is ‘on’ most of the times, hence the derivative is non-zero most of
the times hence the weight is getting updated most of the times) then for
such parameters we want to have a smaller learning rate and in contrast, for
the parameters which is ‘off’ most of the times, then the derivative is going
to be 0 most of the times and for such features, whenever it is ‘on’, we want
the update to be faster, so we want to have a higher learning rate for such
features.
Adagrad helps to have this adaptive learning rate by dividing the learning
rate by the history of the updates of the feature.
If we focus on the yellow highlighted terms in the below image, it is just the
update as for the vanilla GD, we are dividing the learning rate eta with
some quantity(in red in the below image)
And we are hoping that by doing this we need to ensure two things: one is
that for the features which have received a lot of updates, the
denominator term(in red) in the above image should be high so that
the effective learning rate becomes small and we want this
denominator to be small if we have received very few updates in the
past so that the overall learning rate is boosted.
The denominator term ‘vt’ is the running sum of the squares of the
derivatives:
Say we are going over the data points one at a time and for the first 10 data
points, the feature value was 1, in that case, the derivative would not be
zero for any of the data points and consequently, we have some high value
for ‘vt’ as we are taking the square of the derivative in the equation for vt.
So, for the 11th data point, as the denominator is large now, the effective
learning rate(in green in the below image) would become small now which
is exactly what we want for a dense feature.
In the case of a sparse feature, most of the derivatives would be 0 and say
for k’th data point that derivative is non-zero(and the derivative is 0 for all
(k-1) data points), so here accumulated ‘vt’ would be 0 as all the
derivatives were 0 till this k’th data point and the learning rate, when
divided by this small history, would become very large as compared to what
it was for the dense features.
Visualizing Adagrad:
The black curve in the above image corresponds to GD, the red one
corresponds to Momentum based GD and the blue one corresponds to NAG
GD.
We started with the point marked in the below image and we should have
reached the minima which is at the center of the valley
Our b value must increase up to the black circled point in the below image
and w must increase up to the blue circled point(overall they both are
represented by the red circled point in the below image)
Ideally, both the parameters should have changed parallelly and followed
the path in yellow in the below image to reach the minima:
Instead what we see is that initially the ‘w’ is not changing at all and only ‘b’
is changing and has changed from -2 to 0 whereas ‘w’ is still almost the
same.
This is for the case where ‘w’ is associated with a sparse feature and ‘b’ is
always going to have some value so we can consider it as a dense feature
and that’s why it was able to move where ever it had to move. Now at this
point, we have reached a good value for ‘b’, there is nothing we can do by
changing ‘b’ anymore. And the only way to reach the minima is to change
the value of ‘w’.
And whenever we see the value of data point corresponding to feature

associated with ‘w’, there would be a change and if we train for a large
number of epochs, eventually ‘w’ would change from a value of -2 to value
0 where the overall loss would be minimal.
So, for the initial 30–40 epochs, the only ‘b’ is changing here with almost no
change in the value of ‘w’ and thereafter a large no. of epochs are still
required to achieve the optimum value of ‘w’. This is the problem with the
other algorithms that we discussed in previous articles.
Now let’s see what the moment looks like with Adagrad:
The green curve represents the moment for Adagrad. Now we can see that
both the parameters are changing simultaneously which is the result of an
Adaptive Learning Rate.
Limitation of Adagrad:
In the case of Momentum Based GD and NAG, after some 100 iterations we
were able to reach the minima even though for the first 30–40 iterations,
they were just moving in the direction of ‘b’, they were not making any
changes to ‘w’ but what happens in the case of Adagrad is that since the
learning rate is decaying aggressively for the frequent parameter, as ‘b’ gets
updated frequently, its history component increases by a lot because of
which the effective learning rate could go close to 0.
So, now what happens is, after a while when we have made a lot of updates
to ‘b’, we are not able to make any more updates to ‘b’, so even if we run
this algorithm for 100- 200 iterations, it just reaches near the minima but it
is not able to reach the exact minima and to reach the minima, it has to
move in the direction of ‘b’, it has to move up and it's not able to move in the
direction of ‘b’ as the effective learning rate for ‘b’ has become very close to
0 and hence no updates are happening to ‘b’.
We wanted to increase the learning rate for ‘w’, blow up its learning rate
which has happened but in the process, we have also happened to decrease
the learning rate for ‘b’ and after a while, it does not converge.
RMSProp:
RMSProp is another algorithm which operates on the below intuition:
So, the history component is different in RMSProp compared to Adagrad.
In Adagrad, we are adding the square of every derivative whereas, in

RMSProp, we are again taking the exponentially decaying sum of the
square of the derivatives.
‘v4’ in case of Adagrad would look like:
whereas in case of RMSProp, it would look like:
We can see that the ‘v4’ would be much smaller in the case of RMSProp,
even though we are accumulating the history of gradients, we are not
allowing it to blow up as we are multiplying it by decay ratios.
There would still be a relative difference between the value of ‘vt’ for dense
features and for sparse features.
The brown curve represents the moment for RMSProp and we can see that
it eventually reaches the minima as opposed to the Adagrad(green curve).
Adam Algorithm:
We have seen Momentum based GD and RMSProp GD and both of these

were using a history component, while Momentum based GD uses history
of the gradient whereas RMSProp uses a history of the square of the
gradients and we have seen in momentum-based GD that having this
cumulative history actually helps in moving fast out of the flat surfaces.
In momentum-based GD, the history component is used to compute the

current update whereas, in RMSProp, history is used to adjust the learning
rate. So, Adam algorithm combines these two ideas:
Adam combines the ideas of the two algorithms and ensures that the
learning rate is adaptive and also ensures that the second term in the
update is not the current derivative but a history of derivatives so that
we can move fast out of flat regions.
In practice, Adam also does bias correction(to ensure that the gradients are
more stable at the beginning) where instead of using ‘mt’ and ‘vt’, we have:
So, first, it computes ‘mt’ and ‘vt’ using the below equation and then divide
by the quantity(does the bias correction) as in the above image:
The Cyan colored curve corresponds to Adam.
As this algorithm has some part/features of momentum, so it behaves just

like that in the sense that it overshoots the minima, makes some oscillations
and reaches the valley.
Summary:
We have seen six different algorithms:
with the last three having an adaptive learning rate. And we have seen three
different strategies which can be used in combination with any of the
Algorithms
In practice, Adam with Mini-Batch is the most popular algorithm. The

advantage of this combination is that it does not depend too much on the
initial learning rate because anyways learning rate is going to adapt and
because of this other property of having the momentum, it kinds of
combines the advantage of Momentum based GD and RMSProp.
Deep Learning Arti cial Intelligence Optimization Algorithms Neural Networks
Machine Learning
Discover Medium Make Medium yours Explore your membership

Welcome to a place where words matter. Follow all the topics you care about, and Thank you for being a member of
On Medium, smart voices and original we’ll deliver the best stories for you to Medium. You get unlimited access to
ideas take center stage - with no ads in your homepage and inbox. Explore insightful stories from amazing thinkers
sight. Watch and storytellers. Browse
About Help Legal

Optimization Algorithms - Part 2 - Parveen Khurana - Medium

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Optimization Algorithms - Part 2 - Parveen Khurana - Medium

Uploaded by

Copyright:

Available Formats

5/2/2020 Optimization Algorithms — Part 2 - Parveen Khurana - Medium

Optimization Algorithms — Part 2

This article covers the content discussed in the Optimization

The idea of a stochastic and mini-batch gradient descent:

We could choose to do an approximation over here. Instead of looking at all

Error Surface for Stochastic GD:

Code for GD:

Code for Stochastic GD:

Red ink shows the equals sign

For Vanilla GD, we have the error surface as:

For stochastic GD, we have the error surface as:

Now we see a lot of oscillations here(and these oscillations are different

Error Surface for Mini-batch GD:

Here, the approximation would be better than stochastic gradient descent

As we increase batch size, we can see some stability in the oscillations.

In practice, we choose batch size like 32, 64 or 128 depending on the

Epochs and Steps:

Oscillations are smaller in the stochastic version of Momentum and NAG

Batch size is a hyper-parameter and in practice, we use mini-batch GD with

Why do we need an adaptive learning rate?

The update rule for each of the weights is going to be:

the update is proportional to the derivative and we have already discussed

The updates are proportional to derivatives and the derivative

learning rate so that the weights are changed accordingly. And we

The intuition behind Adagrad is that we should try to decay the

And whenever we see the value of data point corresponding to feature

So, the history component is different in RMSProp compared to Adagrad.

In Adagrad, we are adding the square of every derivative whereas, in

‘v4’ in case of Adagrad would look like:

whereas in case of RMSProp, it would look like:

allowing it to blow up as we are multiplying it by decay ratios.

We have seen Momentum based GD and RMSProp GD and both of these

In momentum-based GD, the history component is used to compute the

The Cyan colored curve corresponds to Adam.

As this algorithm has some part/features of momentum, so it behaves just

We have seen six different algorithms:

In practice, Adam with Mini-Batch is the most popular algorithm. The

Deep Learning Arti cial Intelligence Optimization Algorithms Neural Networks

Discover Medium Make Medium yours Explore your membership

About Help Legal

You might also like