DL Unit 1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 58

Introduction to Deep Learning

By
Dr Nisarg Gandhewar
Overview of Syllabus

Unit1: Introduction to Neural Networks

Feed Forward Neural Networks, Backpropagation,


Gradient Descent (GD),
Momentum Based GD,
Nesterov Accelerated GD,
https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph

Stochastic GD,
AdaGrad,
RMSProp,
Adam
Neural Networks

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Artificial Neural Network (ANN)
 Its an information processing system inspired by the biological neural Network.

An artificial neuron network (neural network) is a computational model that mimics the way nerve
cells work in the human brain.

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Artificial Neural Network (ANN)

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Biological Neuron vs Artificial Neuron
Mathematical Intuition
Artificial Neural Network (ANN)

Shallow neural networks consist of only 1 or 2 hidden layers.


Applications of Artificial Neural Network (ANN)
Perceptron

•It is the simplest type of feedforward neural network, consisting of a single layer of input
nodes that are fully connected to a layer of output nodes.

• Perceptron: Building Block of Artificial Neural Network


Types of Perceptron:

•Single layer: SLP is the simplest type of artificial neural networks and can only classify linearly
separable cases with a binary target.

•Multilayer: Fully connected feed-forward ANN with at least three layers (input, output, and at least
one hidden layer).
Types of ANN

•Feedforward Network

•Feedback network (Recurrent Neural network)

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
•Feedforward Network

•The information only moves in one direction,


from the input layer, through the hidden layers, to
the output layer.
•It is represented by acyclic directed graph which
contains no cycle. https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph

•No feedback loop.


•It has no memory of the input.
•It only considers the current input.
•It can’t remember anything about what happened
in the past, except their training.
•Feedback Network

•Here Information flow can be bidirectional.


•The information cycles through a loop.
•When it makes a decision, it takes into
consideration the current input and also what it
has learned from the inputs it received https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph

previously.
•It is represented by cyclic directed graph which
contains cycle.
•It produces output, copies that output and loops
it back into the network.
Forward propagation and Backward propagation

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Forward Propagation Example

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Forward Propagation Example

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Forward Propagation Example

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Back Propagation Example

Compute the new value of weight w5 using back propagation after one iteration in below
ANN. Use sigmoid activation function through out the network and learning rate 0.5.

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Back Propagation Example

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Back Propagation Example

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Back Propagation Example

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Back Propagation Example

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Back Propagation Example

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Back Propagation Example

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Back Propagation Example

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Activation Function

•It’s a function which helps to bring non linearity in the neural network.

•It converts the linear input signals and models into non-linear output signals.

•It also helps to transform the output to specific range depending on type of function
used.
https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/

•It is also known as Transfer Function.


https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Activation Function

•Sigmoid
•Relu
•Leaky ReLU
•Tanh
•Softmax https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Sigmoid

Sigmoid or Logistic Activation Function (0,1)

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph

It is used in binary classification problem.


Relu (Rectified Linear Unit)

•The ReLU is the most used activation function.


•it is used in almost all the neural networks or deep learning.
•It is generally used in hidden layers and in regression problem it is also used in output layer.

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Leaky ReLU (Rectified Linear Unit)

•Leaky ReLU function is an improved version of the ReLU activation function.

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Tanh

• Tanh function became preferred over the sigmoid function as it gave better performance
for multi-layer neural networks.

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Softmax

• The softmax activation function transforms the raw outputs of the neural network into a vector of
probabilities, essentially a probability distribution over the input classes.

•Consider a multiclass classification problem with N classes.

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Loss function

• The Loss function is a method of evaluating how well your algorithm is modeling your dataset.

• loss = Actual value – Predicted Value

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Regression Loss function

• The Loss function is a method of evaluating how well your algorithm is modeling your dataset.

• loss = Actual value – Predicted Value

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph

MSE = (Actual value – Predicted Value) ^ 2

MAE = |Actual value – Predicted Value|


Classification Loss function

• The Loss function is a method of evaluating how well your algorithm is modeling your dataset.

• loss = Actual value – Predicted Value

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Optimization

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Optimizers

•Optimizers are algorithms or methods used to minimize an error function (loss


function) or to maximize the efficiency of production.

•Optimizers are mathematical functions which are dependent on model’s learnable


parameters i.e Weights & Biases.

•Optimizers help to know how to change weights and learning rate of neural network
https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph

to reduce the losses.


Optimizers

2D View of optimizers

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph

3D View of optimizers
Types of Optimizers

•Gradient Descent (GD),


•Stochastic GD
• Mini Batch Gradient Descent
•Momentum Based GD,
•Nesterov Accelerated GD,
https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph

•AdaGrad,
•RMSProp,
•Adam
Gradient Descent (GD)

•Gradient descent is an optimization algorithm which is


commonly-used to train machine learning models and neural
networks.

•It is dependent on the derivatives of the loss function for


finding minima.
https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
•It uses the data of the entire training set to calculate the
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph

gradient of the cost function to the parameters which requires


large amount of memory and slows down the process.
Gradient Descent (GD)

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Stochastic Gradient Descent (GD)
•To overcome some of the disadvantages of the GD algorithm, the SGD
algorithm comes into the picture as an extension of the Gradient Descent.

•One of the disadvantages of the Gradient Descent algorithm is that it


requires a lot of memory to load the entire dataset at a time to compute the
derivative of the loss function.

•So, In the SGD algorithm, we compute the derivative by taking one data
point at a time i.e, tries to update the model’s parameters more frequently.

• It updates the model parameters onehttps://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/


by one.
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph

• If the model has 10K records SGD will update the model parameters 10k
times.
Stochastic Gradient Descent (GD)

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Mini-Batch Gradient Descent (GD)

•It simply splits the training dataset into small batches and performs an
update for each of those batches.

•This creates a balance between the robustness of stochastic gradient descent


and the efficiency of batch gradient descent.

•It can reduce the variance when the parameters are updated, and the
convergence is more stable. https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
SGD with Momentum/ Momentum Based GD

• Few problems that can occur when using gradient descent:


•Local Minima, plateau, Oscillations, Slow convergence

•SGD with Momentum is a stochastic optimization method that adds a


momentum term to regular stochastic gradient descent.

•Momentum simulates the inertia of an object when it is moving.

• The momentum term helps to accelerate the optimization process by allowing


the updates to build up in the direction ofhttps://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
the steepest descent.
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph

• Eg Grace in marks
SGD with Momentum/ Momentum Based GD

Momentum helps to,

•Escape local minima


•Aids in faster convergence by reducing oscillations
•Smooths out weight updates for stability
•Reduces model complexity and prevents overfitting

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Nesterov Accelerated Gradient (NAG)

•Nesterov accelerated gradient optimizer is an optimizer that is an upgraded version of


momentum optimizers and mostly it performs well than momentum optimizers.

• Nesterov Accelerated Gradient (NAG) is a powerful optimization technique that enhances


the performance of gradient descent-based algorithms in training deep neural networks

•Look before jump

•Provide momentum on momentum


https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph

•It has faster convergence compare to previous one


AdaGrad (Adaptive Gradient Descent)

•Gradient Descent and other conventional optimization techniques use a fixed learning rate
throughout the duration of training. However, this uniform learning rate might not be the best
option for all parameters.

•This approach enable the algorithm to efficiently traverse the optimization landscape by
adjusting the learning rate for each parameter based on their previous gradients.

•The intuition behind AdaGrad is can we use different Learning Rates for each and every
neuron for each and every hidden layer based on different iterations.
https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph

• Algorithm that adjusts the learning rate for each parameter based on its prior gradients.
AdaGrad (Adaptive Gradient Descent)

Advantages of Using AdaGrad

•It eliminates the need to manually tune the learning rate.

•It is able to train sparse data as well.

•Convergence is faster and more reliable than simple SGD when the scaling of the weights is
unequal.
https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
RMS-Prop (Root Mean Square Propagation)
•Root Mean Squared Propagation, or RMSProp, is an extension of gradient descent and the
AdaGrad version of gradient descent that uses a decaying average of partial gradients in the
adaptation of the step size for each parameter.

•The use of a decaying moving average allows the algorithm to forget early gradients and focus on the
most recently observed partial gradients seen during the progress of the search, overcoming the
limitation of AdaGrad.

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Adam (Adaptive Moment Estimation)

• Adam optimizer is one of the most popular and famous gradient descent optimization
algorithms.

•It is a method that computes adaptive learning rates for each parameter.

•It’s a combination of momentum & RMSProp.

•Runs faster
https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph

• Adam’s primary goal is to stabilize the training process and help neural networks converge to
optimal solutions.
Adam: Mathematical Formulation

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Adam (Adaptive Moment Estimation)

• Adaptive Learning Rates: Adam adjusts the learning rates for each parameter individually
based on historical gradients.

• Efficient Memory Usage: Adam maintains a moving average of past gradients and squared
gradients for each parameter. This allows the algorithm to effectively use memory resources and
results in efficient parameter updates.

• Suitability for Large Datasets and High-Dimensional Spaces: Adam performs well on large
datasets and high-dimensional parameter spaces.
https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Selection of optimizer
•Selecting an appropriate optimizer in deep learning involves considering various factors, like
characteristics of your data, the architecture of your neural network, and the specific problem you are
trying to solve.

Understand the Problem and Data:


•If your dataset is large and high-dimensional, optimizers like Adam or RMSProp may be effective due to their
adaptability.

•For smaller datasets or more straightforward problems, traditional stochastic gradient descent (SGD) might be
sufficient. https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph

Experiment with Different Optimizers:


It's often a good practice to experiment with multiple optimizers to see how they perform on your specific task.
Some popular optimizers include SGD, Adam, RMSProp, Adagrad, and others.

Consider the Learning Rate Schedule:


The learning rate is a crucial hyperparameter for optimizers. Some optimizers, like Adam, have adaptive learning
rates, while others, like SGD, require manual tuning.
Selection of optimizer

Memory Constraints:
Some optimizers may have more efficient memory usage than others. If you have memory constraints, consider
optimizers like Adam, which maintain moving averages but can be more memory-efficient than methods like
Adagrad.

Consult Literature and Community Practices:


Review recent literature and community practices for the specific task or type of model you are working on.

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Text Books

•Neural Networks and Deep Learning A Textbook, Charu C. Aggarwal, Springer

•Deep Learning from Scratch ,Building with Python from First Principles, Seth
Weidman, O’Reilly
https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph
Reference Books

•Deep Learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville MIT press

https://www.linkedin.com/pulse/how-does-machine-learning-work-rohit-jayale/
https://www.freepik.com/free-vector/isometric-people-working-with-technology_5083803.htm#query=DeFi&position=6&from_view=search&track=sph

You might also like