DL DM22204 Abhishek Singh

Deep Learning Assignment
Abhishek Singh
DM22204
Activation Functions:
An activation function of a neuron is crucial because it indicates whether a characteristic is
important for output prediction. There are two types of activation functions: linear and
nonlinear.
Non Linear Activation functions:
Relu – Pros – a) stronger preference for multilayer networks due to lower computational
demands than sigmoid and tanh runs faster.
Cons – a) There is no learning in the negative axis because it has a zero value.
b) Neuron Problem.
Leaky ReLu – pros – a) small update of relu: instead of zero in the negative axis, it is now 0.01.
enable back propagation
Cons – a) Inefficient in terms of learning
ReLu-6 – pros – a) Rectified Linear Unit (ReLU) is an abbreviation for Rectified Linear Unit. The
key advantage of employing the ReLU function over other activation functions is that it does
not stimulate all neurons simultaneously. As a result, the weights and biases for some neurons
are not updated during the backpropagation process.
Cons- a) The circumstance where massive weight updates can mean that the summed input to
the activation function is always negative, regardless of the input to the network, is one of
ReLU's key drawbacks. This means that a node with this issue will always produce an activation
value of 0.0. This is known as a "dying ReLU."
Sigmoid – pros – A good classifier is one in which a minor change in x results in a substantial
change in y. b) Activation values do not disappear because the answer is always in the middle
(0,1)
Cons – a) In the end, a substantial change in X results in a small change in Y, resulting in a low
learning rate.
b) Vanishing Gradient Problem.
Softmax – pros A) good classifier is one in which a minor change in x results in a substantial
change in y. b) Activation values do not disappear because the answer is always in the middle
(0,1)
Cons – a) In the end, a substantial change in X results in a small change in Y, resulting in a low
learning rate.
a) Capable of handling multiple classes (other activation functions only handle one class)—
normalizes the outputs for each class between 0 and 1, and divides by their sum, indicating the
chance of the input value being in a certain class.
b) Effective for output neurons—Softmax is often employed only for the output layer of neural
networks that must classify inputs into several categories.
How optimizer/momentum function works?

Optimizers are techniques or approaches that adjust the characteristics of your neural
networks, such as weights and learning rate, to reduce losses.
Stochastic Gradient Descent

Gradient Descent is a version of it. It makes an effort to update the model's parameters more
regularly. The model parameters are changed after each training example's loss computation. As
a result, if the dataset contains 1000 rows, SGD will update the model parameters 1000 times in
one dataset cycle, rather than once as in Gradient Descent.
Adam
Adam (Adaptive Moment Estimation) works with momentums of the first and second order.
The idea behind the Adam is that we don't want to roll too fast only to be able to jump over the
minimum; instead, we want to slow down a little bit for a more attentive search. Adam, like
AdaDelta, preserves an exponentially decaying average of previous squared gradients as well as
an exponentially decaying average of past gradients. M(t).
M(t) and V(t) are values of the first moment which is the Mean and the second moment which is
the uncentered variance of the gradients respectively.
AdaDelta
It is an AdaGrad plugin that attempts to solve the fading learning Rate problem. Adadelta limits
the window of accumulated prior gradients to some specified size w rather than accumulating all
previously squared gradients. Rather than the sum of all gradients, an exponentially moving
average is employed in this case.
E[g²](t)=γ.E[g²](t−1)+(1−γ).g²(t)
We set γ to a similar value as the momentum term, around 0.9.
Adagrad
One of the drawbacks of all the optimizers discussed is that the learning rate is constant for all
parameters and each cycle. The learning rate is altered by this optimizer. It modifies the learning
rate ‘η’ for each parameter and at each time step ‘t’. It is a second order optimization algorithm
of the type. It is based on an error function's derivative.
‘η’is a learning rate that is altered for a particular parameter θ(i) at a given time depending on
past gradients calculated for that parameter θ(i) .
We save the total of the squares of the gradients with respect to I up to time step t, where is a
smoothing term that avoids division by zero (typically on the order of 1e-8). Interestingly,
without the square root operation, the algorithm performs much worse.
It makes big updates for less frequent parameters and a small step for frequent parameters.
RMS Prop:
RMSprop's main goal is to:
Maintain a moving (discounted) average of gradient squares.
Divide the gradient by the average's root.
This implementation of RMSprop employs simple momentum rather than Nesterov

momentum. In addition, the centred version keeps a moving average of the gradients and
utilises that average to estimate the variance.
Nesterov Momentum
Nesterov Momentum is an extension to the gradient descent optimization algorithm.
This last update or last change to the variable is then added to the variable scaled by a
“momentum” hyperparameter that controls how much of the last change to add, e.g. 0.9 for
90%.
It is easier to think about this update in terms of two steps, e.g calculate the change in the
variable using the partial derivative then calculate the new value for the variable.
change(t+1) = (momentum * change(t)) – (step_size * f'(x(t)))
x(t+1) = x(t) + change(t+1)
We can think of momentum in terms of a ball rolling downhill that will accelerate and continue
to go in the same direction even in the presence of small hills.
A problem with momentum is that acceleration can sometimes cause the search to overshoot
the minima at the bottom of a basin or valley floor.
Nesterov Momentum can be thought of as a modification to momentum to overcome this

problem of overshooting the minima.
It involves first calculating the projected position of the variable using the change from the last
iteration and using the derivative of the projected position in the calculation of the new
position for the variable.
Calculating the gradient of the projected position acts like a correction factor for the
acceleration that has been accumulated.
Nesterov Momentum is easy to think about this in terms of the four steps:
1. Project the position of the solution.
2. Calculate the gradient of the projection.
3. Calculate the change in the variable using the partial derivative.
4. Update the variable.
Let’s go through these steps in more detail.
First, the projected position of the entire solution is calculated using the change calculated in
the last iteration of the algorithm.
projection(t+1) = x(t) + (momentum * change(t))
We can then calculate the gradient for this new position.

gradient(t+1) = f'(projection(t+1))
Now we can calculate the new position of each variable using the gradient of the projection,
first by calculating the change in each variable.
change(t+1) = (momentum * change(t)) – (step_size * gradient(t+1))
And finally, calculating the new value for each variable using the calculated change.
x(t+1) = x(t) + change(t+1)
In the field of convex optimization more generally, Nesterov Momentum is known to improve
the rate of convergence of the optimization algorithm (e.g. reduce the number of iterations
required to find the solution).
References:
 https://towardsdatascience.com/optimization-algorithms-in-deep-learning-
191bfc2737a4
 https://algorithmia.com/blog/introduction-to-optimizers
 https://medium.com/mlearning-ai/optimizers-in-deep-learning-7bf81fed78a0
 https://towardsdatascience.com/optimizers-for-training-neural-network-59450d71caf6
 https://medium.com/analytics-vidhya/comprehensive-synthesis-of-the-main-activation-
functions-pros-and-cons-dab105fe4b3b

DL DM22204 Abhishek Singh

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DL DM22204 Abhishek Singh

Uploaded by

Copyright:

Available Formats

Deep Learning Assignment

Non Linear Activation functions:

Cons – a) Inefficient in terms of learning

b) Vanishing Gradient Problem.

How optimizer/momentum function works?

Stochastic Gradient Descent

We set γ to a similar value as the momentum term, around 0.9.

Maintain a moving (discounted) average of gradient squares.

Divide the gradient by the average's root.

This implementation of RMSprop employs simple momentum rather than Nesterov

change(t+1) = (momentum * change(t)) – (step_size * f'(x(t)))

x(t+1) = x(t) + change(t+1)

Nesterov Momentum can be thought of as a modification to momentum to overcome this

1. Project the position of the solution.

2. Calculate the gradient of the projection.

3. Calculate the change in the variable using the partial derivative.

4. Update the variable.

Let’s go through these steps in more detail.

projection(t+1) = x(t) + (momentum * change(t))

We can then calculate the gradient for this new position.

change(t+1) = (momentum * change(t)) – (step_size * gradient(t+1))

x(t+1) = x(t) + change(t+1)

You might also like