3 TrainingNetwork

Neural Network and Deep
Learning
Samatrix Consulting Pvt Ltd
Training Neural Network
How to train Neural Network
• So far, we have studied, how we can solve real-life using deep
learning.
• But, one of the most challenging questions remains, how to set the
parameter vectors (the weights of the neurons).
• The training process helps us set the weights.
• During the training of the neural networks, we input a large number
of training examples.
• We modify the weights iteratively to minimize the errors.
• After seeing enough training examples and training cycles, our neural
network will be ready to solve our problems.
• The first question that arises is, how do we train the neurons.
• Suppose we have a large training data set.
• We can calculate the output of the 𝑖 𝑡ℎ training example using a simple
formula.
• We can train the neurons by picking the optimal weights so that we
can minimize the errors for the training examples.
• For example, we want to minimize the square error over all of the
training examples.
• Suppose, for the 𝑖 𝑡ℎ training example, the true answer is 𝑦𝑖 and the
value computed by the neural network is 𝑦ො𝑖 .
• We can minimize the value of the error function 𝐸 as follows:
1 2
𝐸 = ෍ 𝑦𝑖 − 𝑦ො𝑖
2
𝑖
• The error function, 𝐸, is zero if the model makes perfect predictions
for every training example.
• Hence, we should select our parameter value 𝜃, the value of all the
weights, such that the error function is as close to 0 as possible.
• The challenge remains to identify the value of all the weights and
biases especially when we are using nonlinear neurons such as
sigmoid, tanh, or ReLU neurons.
Derivatives
• We will start this topic by understanding what a
derivative is. Given a function
𝑦 = 𝑓(𝑥)
• The derivative of 𝑦 with respect to 𝑥 represents the change in the value of
𝑦 for a small change in 𝑥. Some of the notations are
′ ′
𝑑𝑦
𝑦 ,𝑓 𝑥 ,
𝑑𝑥
• The Figure below plots the value of an arbitrary function 𝑦 = 𝑓(𝑥).
• The plot demonstrates the derivative 𝑓 ′ 𝑥 by plotting the tangent line at
three different points.
• The tangent to a curve is a straight line with the same slope (derivative) as
the curve at the location where the line touches the curve.
Derivatives
• There are some observations. The derivative at the minimum point of the
curve is 0 (the tangent is horizontal).
• If we move away from the minimum, the derivate increases (or decreases if
we move in other direction).
• We can use these observations to solve an optimization problem in which
the objective is to find the value of variable 𝑥 that minimizes the value of
function 𝑦.
• For an initial value of 𝑥 and its corresponding value of 𝑦, we can find the
direction (by using the sign of the derivative) in which we can adjust 𝑥 to
reduce the value of 𝑦.
• If we can solve 𝑥 for 0, we can find an extreme point (minimum, maximum,
or saddle point) of 𝑦.
Optimization Problem
• We assume that the optimization problem is a minimization problem.
• On some occasions, we need to solve the maximization problem also.
• In that case, we convert a maximization problem into a minimization
problem by negating the function that we want to maximize.
• In the course, we would focus on minimization problems.
• However, these extreme points could be local extremes also.
• There is no guarantee that we would find the global minimum.
• In the example given above, we have worked with
one variable only.
• However, in practice, we work with many variables.
• Suppose, we are working with a function with two variables, that is, 𝑦 =
𝑓(𝑥0 , 𝑥1 ) or we can say 𝑦 = 𝑓(x) where x is a 2D vector.
• This function can be thought of as a landscape that has hills and valleys.
• Now we can compute two derivatives
𝜕𝑦 𝜕𝑦
𝑎𝑛𝑑
𝜕𝑥0 𝜕𝑥1
• The partial derivate is similar to a normal derivative but we assume that all
the variables except one variable are constant.
• We can arrange the partial derivatives in a vector, we get
𝜕𝑦
𝜕𝑥0
∇𝑦 =
𝜕𝑦
𝜕𝑥1
• This is known as the gradient of the function.
• The gradient is a derivative but generalized to a function with multiple
variables.
• The symbol ∇ is known as “nabla”
• The gradient is a vector.
• It consists of a direction and a magnitude.
• The direction indicates where to move from a given point (𝑥0 , 𝑥1 ) so
that the resulting function value (𝑦) increases the most.
• This is the direction of the steepest ascent.
• The magnitude of the gradient indicates the slope of the hill in that
direction.
• The three arrows in the Figure above illustrate both the direction and
the slope of the steepest ascent in the three points.
• Each arrow is defined by the gradient in its point.
Gradient Descent
Gradient Descent
• We can simplify the problem to minimize the squared
error of over all of the training examples.
• Suppose our linear neuron has two inputs or weights,
𝑤1 and 𝑤2 .
• We can plot the weights and error function, E, on a
three-dimensional space where the horizontal
dimensions represent weights 𝑤1 and 𝑤2 and the
vertical dimension represents the error function 𝐸.
• Using this plot, we can represent the error function
corresponding to different settings of the weights.
• If we plot this graph for all the possible weights in a
three-dimensional space, we get a quadratic bowl as
shown in Figure 3.3.
Gradient Descent
• This surface can also be visualized as a set of
elliptical contours in a two-dimensional plane.
• The center of the ellipses represents the
minimum error.
• This two-dimensional space represents the
contours corresponding to the settings of 𝑤1
and 𝑤2 for the same value of 𝐸.
• The closer the contours to each other, the
steeper the slope.
• The direction of the steepest descent is
always perpendicular to the contours.
• This direction is represented as a vector and is
known as the gradient.
Gradient Descent
• We can now develop a strategy to find the values
of the weights so that the error function can be
minimized.
• If we initialize the weights randomly, we would
find ourselves on the horizontal plane.
• We can evaluate the gradient at our current
position to find the direction of the steepest
descent and take a step in that direction.
• We will find ourselves in a new position that is
closer to the minimum than the earlier position.
• We will again reevaluate the direction of the
steepest descent by taking a step in the direction
of the gradient to take a new position.
Gradient Descent
• By following this strategy recursively, we
will reach the point where the error is
minimum as shown in Figure 3-4.
• We call this algorithm gradient descent.
• We use this algorithm to train individual
neurons as well as neural networks.
Solve Learning Problem with Gradient
Descent
• We can solve a learning problem by identifying the weights for the input values for a
training example so that the output from the network matches the desired output for
the training example.
• Mathematically, we can state the statement as follows:
𝑦 − 𝑦ො = 0
• In this case, 𝑦 is the desired output value and 𝑦ො (pronounced “𝑦 hat”) is the predicted
value.
• However, we do not have only one single training example, rather we have a set of
training examples that our function should satisfy.
• Hence, we can combine all the multiple training examples into a single error metric using
mean squared error (MSE). 𝑚
1
𝑀𝑆𝐸 = ෍ 𝑦𝑖 − 𝑦ො𝑖 2
𝑚
𝑖=1
Descent
• The subscript inside the parentheses differentiates among different training
examples.
• For most of the problems, the MSE is strictly greater than 0. Hence solving
it for 0 is impossible.
• We can try to solve the problem by identifying the weights that minimize
the value of the error function.
• For most of the deep learning problems, we use the numerical method,
gradient descent, to find an approximate solution to the minimization
problem.
• In an iterative method, we can start with an initial guess of the solution
and then refine it gradually.
Descent
• In Figure 3.5 we have demonstrated the gradient descent.
We start the process by an initial guess 𝑥0 .
• We insert this value in the function 𝑓(𝑥) to compute the
corresponding 𝑦 and its derivative.
• We can improve the guess 𝑥1 by either increasing or
decreasing the 𝑥0 slightly.
• Using the sign of the derivative we can decide whether we
should increase or decrease 𝑥0 .
• A positive slope indicates that if we decrease 𝑥, 𝑦 will
increase.
• Therefore, we can refine the solution by making small
adjustments to 𝑥 iteratively.
• Gradient descent is one of the most commonly used learning
algorithms in deep learning.
Learning Rate
• The derivative provides not only the direction to adjust 𝑥 but also an
indication of whether the current value of 𝑥 is close to or far away
from the values that will minimize 𝑦.
• The gradient descent uses the value of the derivative to decide how
much to adjust 𝑥.
• We can use the updated formula for gradient descent
𝑥𝑛+1 = 𝑥𝑛 − 𝜂𝑓 ′ (𝑥𝑛 )
Learning Rate
• The Greek letter eta, 𝜂, is known as the learning rate.
• From the relationship above, we can infer that the step size depends on
both the learning rate and the derivative.
• The step size decreases with the derivative.
• The Figure 3-5 illustrates the behavior of gradient descent with a learning
rate 𝜂 of 0.3.
• We can notice that the step size decreases as the derivative gets closer to
0.
• As the algorithm converges at the minimum point, the derivative
approaches 0.
• This means that the step size also approaches 0.
Learning Rate
• We have to select the learning rate judiciously.
• If the learning rate is too high, the gradient descent can overshoot the
solution and will fail to converge.
• On the other hand, if the learning rate, hence the step size is too
small, the algorithm may stick in the local minima and would not be
able to find the global minima.
• Newton-Raphson or Newton’s method is also the most commonly
used iterative algorithm for numerical optimization problems.
Constants and Variables
• While applying the gradient descent to the neural network, we consider the input
values, x, to be constant and we adjust the weights, w, that includes the bias
input weight (𝑏).
• However, in the previous example, we tried to find the input value to minimize a
function.
• For the two-input perceptron, it looked that 𝑥1 and 𝑥2 would be considered input
values.
• It would have been true if our weights were fixed.
• However, the purpose of our learning algorithms is to adjust weights (𝑏, 𝑤1 , 𝑤2 )
for fixed inputs (𝑥1 , 𝑥2 ).
• So, we treat 𝑥1 and 𝑥2 as constant (𝑥0 as well but that is always the constant 1),
while we treat 𝑏, 𝑤1 𝑎𝑛𝑑 𝑤2 as variable that can be adjusted.
Constants and Variables
• Therefore, during learning, we consider the weights (𝑤) variables in our
functions not the inputs (𝑥)
• For example, if we are training a network to classify between a dog and a cat.
• We would use the pixel values as the input (𝑥) to the network.
• If the network incorrectly classifies the picture of a dog as a cat, we would not
adjust the picture to look more like a cat but would adjust the weights of the
network so that network can correctly classify the dog as a dog.
Delta Rule and Learning Rate
• In addition to optimizing the weight parameters, we often need to optimize
additional parameters for the training process. One such hyperparameter is the
learning rate.
• As discussed in the previous section, for gradient descent, we move
perpendicular to the contour.
• The next question is how big the step should be.
• We can decide on the size of the step or distance based on the steepness of the
surface.
• If we are far from the minimum, the steepness is high whereas if we are close to
the minimum, the surface is quite flat.
• So, if we are close to the minimum, we should take a smaller step.
• We can use the steepness of the surface to measure the distance from the
minimum.
• If the error surface is mellow, the training can take a large amount of
time.
• So, we often multiply the gradient by a factor 𝜂, the learning rate.
• If the learning rate is too small, the training process will be very long.
• On the other hand, if the training rate is too big, we may diverge away
from the minimum.
• Now we can derive the delta rule for training our linear neuron.
• We can calculate how to change each weight by evaluating the gradient which is the partial derivative of the
error function with respect to each of the weights.
𝜕𝐸
∇𝑤𝑘 = −𝜂
𝜕𝑤𝑘
𝜕 1
=𝜂 ( ෍ 𝑦𝑖 − 𝑦ො𝑖 2 )
𝜕𝑤𝑘 2
𝑖
𝜕𝑦𝑖
= ෍ 𝜂 𝑦𝑖 − 𝑦ො𝑖
𝜕𝑤𝑘
𝑖
𝜕𝑦
• Since 𝑦𝑖 = 𝑤𝑘 𝑥𝑘 ; 𝜕𝑤𝑖 = 𝑥𝑘
𝑘
= ෍ 𝜂𝑥𝑘 𝑦𝑖 − 𝑦ො𝑖
𝑖
• We can use this method to change the weights at every iteration and we would be able to utilize the
gradient descent.
Backpropagation
Backpropagation
• Backpropagation is a basic learning algorithm.
• All the neural-network learning algorithms are variations of the
backpropagation.
• The backpropagation algorithm consists of three simple steps.
1. Present one or more training examples to the neural network
2. Compare the output of the neural network to the desired value
3. Adjust the weights to make the output get closer to the desired value
Backpropagation
The backpropagation algorithm consists of the following two passes
1. The forward pass: We present a learning example to the network and
compare the network output to the desired value
2. The backward pass: We compute the partial derivatives with respect to the
weights. We use these derivatives to adjust the weights so that the network
output is closer to the desired value
In summary, the backpropagation algorithm consists of a forward pass
in which the training examples are presented to the network.
It is followed by a backward pass in which the weights are adjusted
using gradient descent.
The gradient is computed using the backpropagation algorithm.
Backpropagation
• In the previous section, while applying the gradient descent to the
perceptron, we ignored the activation function, which is the sign
function that is applied to the z-value to arrive at the y-value.
• We used the gradient descent to drive z in the desired direction that
would implicitly affect y.
• While working with a multilayer network, the output from the
activation function in one layer is used as input to the next layer.
• Hence, while computing the multiple layer networks, we need to
consider the activation function which should be differentiable
because we need to compute the gradient.
Backpropagation Example
• For this tutorial, we would use
a neural network with two
inputs, two hidden neurons,
and two output neurons.
• The hidden and output
neurons will include a bias.
Backpropagation Example
• We have initial weights and
biases with training inputs and
outputs
• The objective of the
backpropagation is to optimize
the weights so that neural
networks can learn to correctly
map arbitrary inputs to outputs.
• For the single training set: Input
0.05 and 0.10. Output 0.01 and
0.99
Forward Pass
• As mentioned above the input values are 0.05 and 0.1. We will feed
the input values through the network.
• We will calculate the total net input for each hidden layer neuron,
squash the total net input using an activation function (we will use
sigmoid function), then we will repeat the process using the output
layer neurons.
• We can calculate the total net input for ℎ1 :
𝑛𝑒𝑡ℎ1 = 𝑤1 × 𝑖1 + 𝑤2 × 𝑖2 + 𝑏1 × 1
𝑛𝑒𝑡ℎ1 = 0.15 × 0.05 + 0.2 × 0.1 + 0.35 × 1 = 0.3775

Forward Pass
• Now we will use the sigmoid function to get the output of ℎ1 :
1 1
𝑜𝑢𝑡ℎ1 = −𝑛𝑒𝑡
= −0.3775
= 0.593269992
1+𝑒 ℎ1 1+𝑒
• Similarly, we can calculate ℎ2 we get:
𝑜𝑢𝑡ℎ2 = 0.596884378
• Now we repeat the process for the output layer neurons by using the
output from the hidden layer neurons as inputs.
Forward Pass
• The output for 𝑜1 :
𝑛𝑒𝑡𝑜1 = 𝑤5 × 𝑜𝑢𝑡ℎ1 + 𝑤6 × 𝑜𝑢𝑡ℎ2 + 𝑏2 × 1
𝑛𝑒𝑡𝑜1 = 0.4 × 0.593269992 + 0.45 × 0.596884378 + 0.6 × 1 = 1.105905967
1 1
𝑜𝑢𝑡𝑜1 = −𝑛𝑒𝑡
= −1.105905967
= 0.75136507
1+𝑒 𝑜1 1+𝑒
• We can carry out 𝑜2 following the same steps:
𝑜𝑢𝑡𝑜2 = 0.772928465
Forward Pass
• Now we can calculate the Total Error.
1 2
𝐸𝑡𝑜𝑡𝑎𝑙 = ෍ 𝑡𝑎𝑟𝑔𝑒𝑡 − 𝑜𝑢𝑡𝑝𝑢𝑡
2
• The target output is 𝑜1 is 0.01 but the neural network output is 0.75136507, hence the
error is:
1 1
𝐸𝑜1 = 𝑡𝑎𝑟𝑔𝑒𝑡𝑜1 − 𝑜𝑢𝑡𝑜1 = 0.01 − 0.75136507 2 = 0.274811083
2
2 2
• We can repeat the process for 𝑜2 for which the target is 0.99, we get:
𝐸𝑜2 = 0.023560026
Forward Pass
• The total error from the network is the sum of these errors:
𝐸𝑡𝑜𝑡𝑎𝑙 = 𝐸𝑜1 + 𝐸𝑜2 = 0.274811083 + 0.023560026 = 0.298371109

Backward Pass
• Using the backpropagation, we can update the weights in the
network so that the actual output is closer to the target output.
Output Layer
• Let’s consider 𝑤5 . We can calculate
required change in 𝑤5 to affect the
𝜕𝐸𝑡𝑜𝑡𝑎𝑙
total error:
𝜕𝑤5
• We can use the chain rule
𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝑜𝑢𝑡𝑜1 𝜕𝑛𝑒𝑡𝑜1

= × ×
𝜕𝑤5 𝜕𝑜𝑢𝑡𝑜1 𝜕𝑛𝑒𝑡𝑜1 𝜕𝑤5
• Visually we can mark as follows:

Output Layer
• Let’s consider each piece of the equation
1 2
1 2
𝐸𝑡𝑜𝑡𝑎𝑙 = 𝑡𝑎𝑟𝑔𝑒𝑡𝑜1 − 𝑜𝑢𝑡𝑜1 + 𝑡𝑎𝑟𝑔𝑒𝑡𝑜2 + 𝑜𝑢𝑡𝑜2
2 2
𝜕𝐸𝑡𝑜𝑡𝑎𝑙 1 2−1
= 2 × 𝑡𝑎𝑟𝑔𝑒𝑡𝑜1 − 𝑜𝑢𝑡𝑜1 × −1 + 0
𝜕𝑜𝑢𝑡𝑜1 2
= − 𝑡𝑎𝑟𝑔𝑒𝑡𝑜1 − 𝑜𝑢𝑡𝑜1 = − 0.01 − 0.75136507
𝜕𝑜𝑢𝑡𝑜1
= 0.74136507
Output Layer
• Now, we can calculate the change in output of 𝑜1 with respect to its
total net input.
• Please note that the partial derivative of the sigmoid function (or
logistic function) is the output multiplied by 1 minus the output
1
𝑜𝑢𝑡𝑜1 =
1 + 𝑒 −𝑛𝑒𝑡𝑜1
𝜕𝑜𝑢𝑡𝑜1
= 𝑜𝑢𝑡𝑜1 1 − 𝑜𝑢𝑡𝑜1 = 0.75136507 1 − 0.75136507 = 0.186815602
𝜕𝑛𝑒𝑡𝑜1
Output Layer
• Finally, we can calculate the change in total net input of 𝑜1 with respect to 𝑤5
= 1 × 𝑜𝑢𝑡ℎ1 × 𝑤51−1 + 0 + 0 = 𝑜𝑢𝑡ℎ1 = 0.593269992
𝜕𝑤5
• We can put all the three equations together

𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝑜𝑢𝑡𝑜1 𝜕𝑛𝑒𝑡𝑜1
= × ×
𝜕𝑤5 𝜕𝑜𝑢𝑡𝑜1 𝜕𝑛𝑒𝑡𝑜1 𝜕𝑤5
= 0.74136507 × 0.186815602 × 0.593269992 = 0.082167041
𝜕𝑤5
Output Layer
In order to decrease the error, we need to subtract this value from the
current weight by considering the learning rate, 𝜂, that we have set 0.5
for this example
+ 𝜕𝐸𝑡𝑜𝑡𝑎𝑙
𝑤5 = 𝑤5 − 𝜂 × = 0.4 − 0.5 × 0.082167041 = 0.35891648
𝜕𝑤5
Similarly, we can calculate the following value

𝑤6+ = 0.408666186
𝑤7+ = 0.51301270
𝑤8+ = 0.561370121
Output Layer
• We will update the weights in the neural network after calculating the
weights for the hidden layer.
• Hence, we will use the original weights, not the updated weights,
while working on the backpropagation algorithm.
Hidden Layer
• Following the backward pass process, we will calculate the new values for
𝑤1 , 𝑤2 , 𝑤3 , 𝑎𝑛𝑑 𝑤4
• We need to find out
𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡ℎ1
= × ×
𝜕𝑤1 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡ℎ1 𝜕𝑤1
• We will use the similar process as we used for output layer but in this case,
the output of each hidden neuron contributes to the output of multiple
output neurons.
• Hence, we can say that 𝑜𝑢𝑡ℎ1 affects both 𝑜𝑢𝑡𝑜1 and 𝑜𝑢𝑡𝑜2 .
𝜕𝐸
• Hence the 𝑡𝑜𝑡𝑎𝑙 should consider its effect on both the output neurons.
𝜕𝑜𝑢𝑡ℎ1
Hidden Layer
𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝐸𝑜1 𝜕𝐸𝑜2
• = +
𝜕𝑜𝑢𝑡ℎ1 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑜𝑢𝑡ℎ1
• Whereas
𝜕𝐸𝑜1 𝜕𝐸𝑜1 𝜕𝑛𝑒𝑡𝑜1
• = ×
𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡𝑜1 𝜕𝑜𝑢𝑡ℎ1
𝜕𝐸
• The value of 𝑜1
has already
been calculated
𝜕𝐸𝑜1 𝜕𝐸𝑜1 𝜕𝑜𝑢𝑡𝑜1

= ×
𝜕𝑛𝑒𝑡𝑜1 𝜕𝑜𝑢𝑡𝑜1 𝜕𝑛𝑒𝑡𝑜1
= 0.74136507 × 0.186815602 = 0.1384986562
Hidden Layer
Now we calculate
= 𝑤5 = 0.4
So,
𝜕𝐸𝑜1 𝜕𝐸𝑜1 𝜕𝑛𝑒𝑡𝑜1
= × = 0.138498562 × 0.4 = 0.055399425
𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡𝑜1 𝜕𝑜𝑢𝑡ℎ1
Hidden Layer
𝜕𝐸𝑜2
• Similarly, we can find
𝜕𝐸𝑜2
= −0.019049119
• Hence
𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝐸𝑜1 𝜕𝐸𝑜2

= + = 0.055399425 + −0.019049119 = 0.036350306
𝜕𝑜𝑢𝑡ℎ1 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑜𝑢𝑡ℎ1
Hidden Layer
Now we can calculate
𝜕𝑛𝑒𝑡ℎ1
1
𝑜𝑢𝑡ℎ1 =
1 + 𝑒 −𝑛𝑒𝑡ℎ1
= 𝑜𝑢𝑡ℎ1 1 − 𝑜𝑢𝑡ℎ1 = 0.59326999 1 − 0.59326999 = 0.241300709
Hidden Layer
• Similarly, we can find
𝜕𝑤1
𝑛𝑒𝑡ℎ1 = 𝑤1 × 𝑖1 + 𝑤3 × 𝑖2 + 𝑏1 × 1
𝜕𝑛𝑒𝑡ℎ1 𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡ℎ1

= × ×
𝜕𝑤1 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡ℎ1 𝜕𝑤1
= 0.036350306 × 0.241300709 × 0.05 = 0.000438568
𝜕𝑤1
Hidden Layer
Now we can update 𝑤1
𝑤1+ = 𝑤1 − 𝜂 × = 0.15 − 0.5 × 0.000438568 = 0.149780716
𝜕𝑤1
We can repeat the calculation for 𝑤2 , 𝑤3 , 𝑎𝑛𝑑 𝑤4
𝑤2+ = 0.19956143
𝑤3+ = 0.24975114
𝑤4+ = 0.29950229
Hidden Layer
• At this step, we update all of our weights.
• Initially when we started the feed forward process with 0.05 and 0.1
input values, the error for the network was 0.298371109.
• After the first round of backpropagation the error was reduced to
0.291027924.
• If we repeat the whole process 10000 times, the error reduced to
0.0000351085 and the two neurons generated 0.015912196 vs
0.01 and 0.984065734 vs 0.99
Approaches to Gradient Descent
Approaches
• Different approaches to gradient descent are
• Batch Gradient Descent
• Stochastic Gradient Descent
• Minibatch Gradient Descent
• Each approach handles the optimization problem differently.
• You have to decide, which approach works best for your problem
Batch Gradient Descent
• Suppose your dataset has four features and six records.
• During the training process, the neural network will predict the response variable
and compare it with the actual value of the response variable for each record.
• Then it will calculate the gradient of the loss function for that record.
• At this stage, we can update the weights by using the calculated gradient to
minimize the loss function.
• But in the case of the batch gradient descent, we do not update the weights right
away.
• We update the weights after all records of the dataset have been processed.
• We calculate the gradients for each record, add all the gradients, and find the
average (by dividing by six in this case).
• The weights are updated using the averaged gradient.
Advantages and Disadvantages
Advantages of Batch Gradient Descent
Computationally Efficient: Since no update is required after each sample, this
technique is computationally efficient
Stable Convergence: Since we calculate the average of all the gradients, we get a good
estimate of the true gradient. Hence the convergence of the weights is stable
Disadvantages of Batch Gradient Descent
Slower Learning: In the case of the batch gradient descent, the learning process is very
slow because we perform only one update after processing 𝑁 samples.
Local Minima and Saddle Points: Chances of getting stuck in a local minimum or
saddle points are high in this approach because the gradients are more or less the
same.
We need some noisy gradient so that the gradient can jump out of a local minimum.
Stochastic Gradient Descent (SGD)
• SGD tries to solve the problem with Batch Gradient Descent where we
calculate the gradients of whole training data before updating the
weights.
• On the other hand, SGD is stochastic in nature.
• It picks a random instance of the training data at each step, computes
the gradient, and updates the weights after each record has been
processed.
• In other words, we can say that it updates the weights after
processing one observation per iteration.
• In the following image 2D projection, we have compared full batch
gradient descent and SGD with batch 1.
• We can see that full batch updates are smoother whereas SGD has
wiggy convergence characteristics.
Advantages and Disadvantages - SGD
Advantages of SGD
Faster Learning: The learning is faster because we update the weights after each
record is processed
Disadvantages of SGD
Noisy Gradient: Since we are using every single gradient to update the weights, the
gradients are very unstable, noisy, and have high variance with respect to their
direction and values. They are only a rough estimate of the gradient. On the other
hand, the noisy gradient helps in jumping local minima during the training.
Computationally Intensive: The SGD is more computationally intensive than batch
gradient descent.
Inability to settle on a global minimum: Due to noise, the SGD finds it more difficult to
find and stay at the global minimum.
Mini-Batch Gradient Descent
• The third weight update technique is called mini-batch gradient descent.
• This method combines the best of the above-mentioned two methods.
• For the mini-batch gradient descent, we divide the training set into a batch
size of 𝑛.
• If the data set contains 10000 records, we can select the batch size of
8, 16, 32, 64, 128
• Similar to the batch gradient descent, we compute and average the
gradients across the records in a mini-batch.
• We perform the gradient descent step after we processing each mini-batch
of the records.
Advantages and Disadvantages
Advantages of Mini-Batch Gradient Descent
Computational Efficiency: This technique lies between full batch gradient
descent and SGD
Stable Convergence: Since we calculate the mean gradient over 𝑛 samples, the
gradient is less noisy and converges more stably towards the global minimum
Faster Learning: This process is faster learning than full batch gradient descent.
The disadvantages of Mini-Batch Gradient Descent
New Hyperparameter: For this technique, we need a new hyperparameter,
mini-batch size, which is the second most important hyperparameter after
learning rate. So, we need to try different batch sizes in combination with other
parameters such as learning rate.
Thanks
Samatrix Consulting Pvt Ltd

3 TrainingNetwork

Uploaded by

Copyright:

Available Formats

You might also like

3 TrainingNetwork

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3 TrainingNetwork

Uploaded by

Copyright:

Available Formats

Neural Network and Deep

𝑛𝑒𝑡ℎ1 = 0.15 × 0.05 + 0.2 × 0.1 + 0.35 × 1 = 0.3775

• Similarly, we can calculate ℎ2 we get:

𝑛𝑒𝑡𝑜1 = 𝑤5 × 𝑜𝑢𝑡ℎ1 + 𝑤6 × 𝑜𝑢𝑡ℎ2 + 𝑏2 × 1

𝑛𝑒𝑡𝑜1 = 0.4 × 0.593269992 + 0.45 × 0.596884378 + 0.6 × 1 = 1.105905967

• We can carry out 𝑜2 following the same steps:

𝐸𝑡𝑜𝑡𝑎𝑙 = 𝐸𝑜1 + 𝐸𝑜2 = 0.274811083 + 0.023560026 = 0.298371109

𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝑜𝑢𝑡𝑜1 𝜕𝑛𝑒𝑡𝑜1

• Visually we can mark as follows:

• We can put all the three equations together

Similarly, we can calculate the following value

𝜕𝐸𝑜1 𝜕𝐸𝑜1 𝜕𝑜𝑢𝑡𝑜1

𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝐸𝑜1 𝜕𝐸𝑜2

𝜕𝑛𝑒𝑡ℎ1 𝜕𝐸𝑡𝑜𝑡𝑎𝑙 𝜕𝑜𝑢𝑡ℎ1 𝜕𝑛𝑒𝑡ℎ1

We can repeat the calculation for 𝑤2 , 𝑤3 , 𝑎𝑛𝑑 𝑤4

You might also like