DL Unit 1

Introduction to Deep Learning
By
Dr Nisarg Gandhewar
Overview of Syllabus
Unit1: Introduction to Neural Networks
Feed Forward Neural Networks, Backpropagation,

Gradient Descent (GD),
Momentum Based GD,
Nesterov Accelerated GD,
https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Stochastic GD,
AdaGrad,
RMSProp,
Adam
Neural Networks
Artificial Neural Network (ANN)
 Its an information processing system inspired by the biological neural Network.
An artificial neuron network (neural network) is a computational model that mimics the way nerve
cells work in the human brain.
Biological Neuron vs Artificial Neuron
Mathematical Intuition
Shallow neural networks consist of only 1 or 2 hidden layers.

Applications of Artificial Neural Network (ANN)
Perceptron
•It is the simplest type of feedforward neural network, consisting of a single layer of input
nodes that are fully connected to a layer of output nodes.
• Perceptron: Building Block of Artificial Neural Network

Types of Perceptron:
•Single layer: SLP is the simplest type of artificial neural networks and can only classify linearly
separable cases with a binary target.
•Multilayer: Fully connected feed-forward ANN with at least three layers (input, output, and at least
one hidden layer).
Types of ANN
•Feedforward Network
•Feedback network (Recurrent Neural network)
•Feedforward Network
•The information only moves in one direction,

from the input layer, through the hidden layers, to
the output layer.
•It is represented by acyclic directed graph which
contains no cycle. https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
•No feedback loop.

•It has no memory of the input.
•It only considers the current input.
•It can’t remember anything about what happened
in the past, except their training.
•Feedback Network
•Here Information flow can be bidirectional.

•The information cycles through a loop.
•When it makes a decision, it takes into
consideration the current input and also what it
has learned from the inputs it received https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
previously.
•It is represented by cyclic directed graph which
contains cycle.
•It produces output, copies that output and loops
it back into the network.
Forward propagation and Backward propagation
Forward Propagation Example
Back Propagation Example
Compute the new value of weight w5 using back propagation after one iteration in below
ANN. Use sigmoid activation function through out the network and learning rate 0.5.
Activation Function
•It’s a function which helps to bring non linearity in the neural network.
•It converts the linear input signals and models into non-linear output signals.
•It also helps to transform the output to specific range depending on type of function
used.
•It is also known as Transfer Function.

Activation Function
•Sigmoid
•Relu
•Leaky ReLU
•Tanh
•Softmax https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
Sigmoid
Sigmoid or Logistic Activation Function (0,1)
It is used in binary classification problem.

Relu (Rectified Linear Unit)
•The ReLU is the most used activation function.

•it is used in almost all the neural networks or deep learning.
•It is generally used in hidden layers and in regression problem it is also used in output layer.
Leaky ReLU (Rectified Linear Unit)
•Leaky ReLU function is an improved version of the ReLU activation function.
Tanh
• Tanh function became preferred over the sigmoid function as it gave better performance
for multi-layer neural networks.
Softmax
• The softmax activation function transforms the raw outputs of the neural network into a vector of
probabilities, essentially a probability distribution over the input classes.
•Consider a multiclass classification problem with N classes.
Loss function
• The Loss function is a method of evaluating how well your algorithm is modeling your dataset.
• loss = Actual value – Predicted Value
Regression Loss function
MSE = (Actual value – Predicted Value) ^ 2
MAE = |Actual value – Predicted Value|

Classification Loss function
Optimization
Optimizers
•Optimizers are algorithms or methods used to minimize an error function (loss

function) or to maximize the efficiency of production.
•Optimizers are mathematical functions which are dependent on model’s learnable

parameters i.e Weights & Biases.
•Optimizers help to know how to change weights and learning rate of neural network
to reduce the losses.

Optimizers
2D View of optimizers
3D View of optimizers
Types of Optimizers
•Gradient Descent (GD),

•Stochastic GD
• Mini Batch Gradient Descent
•Momentum Based GD,
•Nesterov Accelerated GD,
•AdaGrad,
•RMSProp,
•Adam
Gradient Descent (GD)
•Gradient descent is an optimization algorithm which is

commonly-used to train machine learning models and neural
networks.
•It is dependent on the derivatives of the loss function for

finding minima.
•It uses the data of the entire training set to calculate the
gradient of the cost function to the parameters which requires

large amount of memory and slows down the process.
Gradient Descent (GD)
Stochastic Gradient Descent (GD)
•To overcome some of the disadvantages of the GD algorithm, the SGD
algorithm comes into the picture as an extension of the Gradient Descent.
•One of the disadvantages of the Gradient Descent algorithm is that it

requires a lot of memory to load the entire dataset at a time to compute the
derivative of the loss function.
•So, In the SGD algorithm, we compute the derivative by taking one data
point at a time i.e, tries to update the model’s parameters more frequently.
• It updates the model parameters onehttps://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/

by one.
• If the model has 10K records SGD will update the model parameters 10k
times.
Stochastic Gradient Descent (GD)
Mini-Batch Gradient Descent (GD)
•It simply splits the training dataset into small batches and performs an
update for each of those batches.
•This creates a balance between the robustness of stochastic gradient descent

and the efficiency of batch gradient descent.
•It can reduce the variance when the parameters are updated, and the
convergence is more stable. https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
SGD with Momentum/ Momentum Based GD
• Few problems that can occur when using gradient descent:

•Local Minima, plateau, Oscillations, Slow convergence
•SGD with Momentum is a stochastic optimization method that adds a

momentum term to regular stochastic gradient descent.
•Momentum simulates the inertia of an object when it is moving.
• The momentum term helps to accelerate the optimization process by allowing

the updates to build up in the direction ofhttps://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
the steepest descent.
• Eg Grace in marks
SGD with Momentum/ Momentum Based GD
Momentum helps to,
•Escape local minima

•Aids in faster convergence by reducing oscillations
•Smooths out weight updates for stability
•Reduces model complexity and prevents overfitting
Nesterov Accelerated Gradient (NAG)
•Nesterov accelerated gradient optimizer is an optimizer that is an upgraded version of

momentum optimizers and mostly it performs well than momentum optimizers.
• Nesterov Accelerated Gradient (NAG) is a powerful optimization technique that enhances

the performance of gradient descent-based algorithms in training deep neural networks
•Look before jump
•Provide momentum on momentum

•It has faster convergence compare to previous one

AdaGrad (Adaptive Gradient Descent)
•Gradient Descent and other conventional optimization techniques use a fixed learning rate
throughout the duration of training. However, this uniform learning rate might not be the best
option for all parameters.
•This approach enable the algorithm to efficiently traverse the optimization landscape by
adjusting the learning rate for each parameter based on their previous gradients.
•The intuition behind AdaGrad is can we use different Learning Rates for each and every
neuron for each and every hidden layer based on different iterations.
• Algorithm that adjusts the learning rate for each parameter based on its prior gradients.
AdaGrad (Adaptive Gradient Descent)
Advantages of Using AdaGrad
•It eliminates the need to manually tune the learning rate.
•It is able to train sparse data as well.
•Convergence is faster and more reliable than simple SGD when the scaling of the weights is
unequal.
RMS-Prop (Root Mean Square Propagation)
•Root Mean Squared Propagation, or RMSProp, is an extension of gradient descent and the
AdaGrad version of gradient descent that uses a decaying average of partial gradients in the
adaptation of the step size for each parameter.
•The use of a decaying moving average allows the algorithm to forget early gradients and focus on the
most recently observed partial gradients seen during the progress of the search, overcoming the
limitation of AdaGrad.
Adam (Adaptive Moment Estimation)
• Adam optimizer is one of the most popular and famous gradient descent optimization
algorithms.
•It is a method that computes adaptive learning rates for each parameter.
•It’s a combination of momentum & RMSProp.
•Runs faster
• Adam’s primary goal is to stabilize the training process and help neural networks converge to
optimal solutions.
Adam: Mathematical Formulation
Adam (Adaptive Moment Estimation)
• Adaptive Learning Rates: Adam adjusts the learning rates for each parameter individually
based on historical gradients.
• Efficient Memory Usage: Adam maintains a moving average of past gradients and squared
gradients for each parameter. This allows the algorithm to effectively use memory resources and
results in efficient parameter updates.
• Suitability for Large Datasets and High-Dimensional Spaces: Adam performs well on large
datasets and high-dimensional parameter spaces.
Selection of optimizer
•Selecting an appropriate optimizer in deep learning involves considering various factors, like
characteristics of your data, the architecture of your neural network, and the specific problem you are
trying to solve.
Understand the Problem and Data:

•If your dataset is large and high-dimensional, optimizers like Adam or RMSProp may be effective due to their
adaptability.
•For smaller datasets or more straightforward problems, traditional stochastic gradient descent (SGD) might be
sufficient. https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
Experiment with Different Optimizers:

It's often a good practice to experiment with multiple optimizers to see how they perform on your specific task.
Some popular optimizers include SGD, Adam, RMSProp, Adagrad, and others.
Consider the Learning Rate Schedule:

The learning rate is a crucial hyperparameter for optimizers. Some optimizers, like Adam, have adaptive learning
rates, while others, like SGD, require manual tuning.
Selection of optimizer
Memory Constraints:
Some optimizers may have more efficient memory usage than others. If you have memory constraints, consider
optimizers like Adam, which maintain moving averages but can be more memory-efficient than methods like
Adagrad.
Consult Literature and Community Practices:

Review recent literature and community practices for the specific task or type of model you are working on.
Text Books
•Neural Networks and Deep Learning A Textbook, Charu C. Aggarwal, Springer
•Deep Learning from Scratch ,Building with Python from First Principles, Seth
Weidman, O’Reilly
Reference Books
•Deep Learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville MIT press

DL Unit 1

Uploaded by

Copyright:

Available Formats

You might also like

DL Unit 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DL Unit 1

Uploaded by

Copyright:

Available Formats

Introduction to Deep Learning

Unit1: Introduction to Neural Networks

Feed Forward Neural Networks, Backpropagation,

Shallow neural networks consist of only 1 or 2 hidden layers.

• Perceptron: Building Block of Artificial Neural Network

•Feedback network (Recurrent Neural network)

•The information only moves in one direction,

•No feedback loop.

•Here Information flow can be bidirectional.

•It is also known as Transfer Function.

Sigmoid or Logistic Activation Function (0,1)

It is used in binary classification problem.

•The ReLU is the most used activation function.

•Leaky ReLU function is an improved version of the ReLU activation function.

•Consider a multiclass classification problem with N classes.

• loss = Actual value – Predicted Value

• loss = Actual value – Predicted Value

MSE = (Actual value – Predicted Value) ^ 2

MAE = |Actual value – Predicted Value|

• loss = Actual value – Predicted Value

•Optimizers are algorithms or methods used to minimize an error function (loss

•Optimizers are mathematical functions which are dependent on model’s learnable

to reduce the losses.

•Gradient Descent (GD),

•Gradient descent is an optimization algorithm which is

•It is dependent on the derivatives of the loss function for

gradient of the cost function to the parameters which requires

•One of the disadvantages of the Gradient Descent algorithm is that it

• It updates the model parameters onehttps://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/

•This creates a balance between the robustness of stochastic gradient descent

• Few problems that can occur when using gradient descent:

•SGD with Momentum is a stochastic optimization method that adds a

•Momentum simulates the inertia of an object when it is moving.

• The momentum term helps to accelerate the optimization process by allowing

Momentum helps to,

•Escape local minima

•Nesterov accelerated gradient optimizer is an optimizer that is an upgraded version of

• Nesterov Accelerated Gradient (NAG) is a powerful optimization technique that enhances

•Look before jump

•Provide momentum on momentum

•It has faster convergence compare to previous one

Advantages of Using AdaGrad

•It eliminates the need to manually tune the learning rate.

•It is able to train sparse data as well.

•It’s a combination of momentum & RMSProp.

Understand the Problem and Data:

Experiment with Different Optimizers:

Consider the Learning Rate Schedule:

Consult Literature and Community Practices:

•Neural Networks and Deep Learning A Textbook, Charu C. Aggarwal, Springer

You might also like