Professional Documents
Culture Documents
Backpropag-Relu-Grad Descent
Backpropag-Relu-Grad Descent
Gradient Descent is an optimization algorithm that's used when training a machine learning
model. It's based on a convex function and tweaks its parameters iteratively to minimize a given
function to its local minimum.
For gradient descent to reach the local minimum we must set the learning rate to an appropriate
value, which is neither too low nor too high. This is important because if the steps it takes are
too big, it may not reach the local minimum because it bounces back and forth between the
convex function of gradient descent. If we set the learning rate to a very small value, gradient
descent will eventually reach the local minimum but that may take a while.
ReLu is a non-linear activation function that is used in multi-layer neural networks or deep
neural networks. Traditionally, some prevalent non-linear activation functions, like sigmoid
functions (or logistic) and hyperbolic tangent, are used in neural networks to get activation
values corresponding to each neuron. Recently, the ReLu function has been used instead to
calculate the activation values in traditional neural network or deep neural network paradigms.
The reasons of replacing sigmoid and hyperbolic tangent with ReLu consist of:
1. Computation saving - the ReLu function is able to accelerate the training speed of deep
neural networks compared to traditional activation functions since the derivative of ReLu
is 1 for a positive input. Due to a constant, deep neural networks do not need to take
additional time for computing error terms during the training phase.
2. Solving the vanishing gradient problem - the ReLu function does not trigger the
vanishing gradient problem when the number of layers grows. This is because this
function does not have an asymptotic upper and lower bound. Thus, the earliest layer
(the first hidden layer) is able to receive the errors coming from the last layers to adjust
all weights between layers. By contrast, a traditional activation function like sigmoid is
restricted between 0 and 1, so the errors become small for the first hidden layer. This
scenario will lead to a poorly trained neural network.
-Nishita Verma
BTBM/18/120
Section D
Sem 5