DL Unit2

Department of
Computer Science and

Engineering
School of Computing
Department of Computer Science & Engineering
Course Name : Deep Learning

Course Code : 1152CS172
Category : Program Elective
Faculty Name : Dr. S. Priya
Designation : Assistant Professor Department:CSE
School of Computing
Vel Tech Rangarajan Dr. Sagunthala R&D Institute of
Science and Technology
UNIT II - FUNDAMENTALS OF NEURAL
NETWORKS
Basics of Neural Networks- Neural network
representation-History and cognitive
basis of neural computation- Perceptrons-
Perceptron Learning Algorithm- Multilayer
Perceptrons (MLPs)- Representation Power of
MLPs- Back Propagation.
History of Artificial Neural Network
 history of the ANNs stems from the 1940s, the decade of the first electronic
computer.
 However, the first important step took place in 1957 when Rosenblatt introduced
the first concrete neural model, the perceptron. Rosenblatt also took part in
constructing the first successful neurocomputer, the Mark I Perceptron. After this,
the development of ANNs has proceeded as described in Figure.
Department of Computer Science and Engineering

Contd..
 Rosenblatt's original perceptron model contained only one layer. From this, a
multi-layered model was derived in 1960. At first, the use of the multi-layer
perceptron (MLP) was complicated by the lack of a appropriate learning
algorithm.
 In 1974, Werbos came to introduce a so-called backpropagation algorithm for the
three-layered perceptron network.
 In 1982, Hopfield brought out his idea of a neural network. Unlike the neurons in
MLP, the Hopfield network consists of only one layer whose neurons are fully
connected with each other.
 in 1982, A totally unique kind of network model is the Self-Organizing Map
(SOM) introduced by Kohonen. SOM is a certain kind of topological map which
organizes itself based on the input patterns that it is trained with. The SOM
originated from the LVQ (Learning Vector Quantization) network the underlying
idea of which was also Kohonen's in 1972.

Contd..
 In 1986, The application area of the MLP networks remained rather
limited until the breakthrough when a general back propagation
algorithm for a multi-layered perceptron was introduced by Rummelhart
and Mclelland.
 In 1988, Radial Basis Function (RBF) networks were first introduced
by Broomhead & Lowe. they intriduced a new frontier in the neural
network community.
 Since then, new versions of the Hopfield network have been developed.
The Boltzmann machine has been influenced by both the Hopfield
network and the MLP.
 Then, research on artificial neural networks has remained active, leading
to many new network types, as well as hybrid algorithms and
hardware for neural information processing.

Why Artificial Neural Networks?
There are two basic reasons why we are interested in
building artificial neural networks (ANNs):
• Technical viewpoint: Some problems such as

character recognition or the prediction of future
states of a system require massively parallel and
adaptive processing.
• Biological viewpoint: ANNs can be used to

replicate and simulate components of the human
(or animal) brain, thereby giving us insight into
natural information processing.
Basics of Neural Networks

Contd..
• Neurons (also called nerve cells) are the fundamental units of the brain
and nervous system.
• The cells are responsible for receiving sensory input from the external
world, for sending motor commands to our muscles, and for
transforming and relaying the electrical signals at every step in
between.
• A neuron has three main parts: dendrites, an axon, and a cell body or
soma,
• A dendrite is where a neuron receives input from other cells. Dendrites
branch as they move towards their tips, just like tree branches do, and
they even have leaf-like structures on them called spines.
• The axon is the output structure of the neuron; when a neuron wants to
talk to another neuron, it sends an electrical message called an action
potential throughout the entire axon.
• The soma (tree trunk) is where the nucleus lies, where the neuron’s
DNA is housed, and where proteins are made to be transported
throughout the axon and dendrites.
Basics of Neural Networks
Biologically motivated approach to machine learning
Similarity with biological network
1. Fundamental processing elements of a neural network is a neuron the
processing element receives many signals.
2. Signals may be modified by a weight at the receiving synapse.
3. The processing element sums the weighted inputs.
4. Under appropriate circumstances (sufficient input), the neuron transmits a
single output.
5. The output from a particular neuron may go to many other neurons.

Structure of Neural Network

Advantages of Neural
Networks
• Fault tolerance
In a neural network, even if a few neurons are not working properly,
that would not prevent the neural networks from generating
outputs.
• Real-time Operations
Neural networks can learn synchronously and easily adapt to their
changing environments.
• Adaptive Learning
Neural networks can learn how to work on different tasks. Based on
the data given to produce the right output.
• Parallel processing capacity

Neural networks have the strength and ability to perform multiple
jobs simultaneously.
Disadvantages of Neural Networks
Unexplained behavior of the network

Neural networks provide a solution to a problem. Due to the
complexity of the networks, it doesn’t provide the reasoning behind
“why and how” it made the decisions it made. Therefore, trust in the
network may be reduced.
Determination of appropriate network structure

There is no specified rule (or rule of thumb) for a neural network
procedure. A proper network structure is achieved by trying the best
network, in a trial and error approach. It is a process that involves
refinement.
Hardware dependence
The pieces of equipment of a neural network are dependent on one
another. By which we mean, that neural networks require (or are highly
dependent on) processors with adequate processing capacity.

Perceptron Model - Neural Networks
• Minsky and Papert proposed the Perceptron model (Single-layer

neural network). They said it was modeled after how the human brain
functions.
• It is one of the simplest models that can learn and solve complex data
problems using neural networks. Perceptron is also called an artificial
neuron.
• A perceptron network is comprised of two layers:
• Input Layer
• Output Layer
• The input layer computes the weighted input for every node. The
activation function is pertained to get the result as output.

Perceptron Example
The Perceptron Algorithm: Frank Rosenblatt suggested this algorithm:
Steps:
1. Set a threshold value The perceptron tries to decide if you
2. Multiply all inputs with its weights should go to a concert.
3. Sum all the results
Is the artist good?
4. Activate the output Is the weather good?
1. Set a threshold value: Threshold = 1.5 What weights should these facts have?
2. Multiply all inputs with its weights:
x1 * w1 = 1 * 0.7 = 0.7
Criteria Input Weight
x2 * w2 = 0 * 0.6 = 0 Artists is Good x1 = 0 or 1 w1 = 0.7
x3 * w3 = 1 * 0.5 = 0.5 Weather is Good x2 = 0 or 1 w2 = 0.6
x4 * w4 = 0 * 0.3 = 0 Friend will Come x3 = 0 or 1 w3 = 0.5
x5 * w5 = 1 * 0.4 = 0.4 Food is Served x4 = 0 or 1 w4 = 0.3
3. Sum all the results: Alcohol is Served x5 = 0 or 1 w5 = 0.4
0.7 + 0 + 0.5 + 0 + 0.4 = 1.6 (The Weighted Sum)

4. Activate the Output:
Return true if the sum > 1.5 ("Yes I will go to the Concert")

Activation Functions
• Binary Step Functon

• Linear or Identity Activation Function.
• Non-linear Activation Function.
• Sigmoid or Logistic Activation Function.
• Tanh or hyperbolic tangent Activation Function.
• ReLU (Rectified Linear Unit) Activation Function.
• Leaky ReLU.
• Softmax

Binary Step Function
• Binary step function depends on a threshold value that decides

whether a neuron should be activated or not.
• The input fed to the activation function is compared to a certain

threshold; if the input is greater than it, then the neuron is activated,
else it is deactivated, meaning that its output is not passed on to the
next hidden layer.
• Mathematically it can be represented as:

Limitations of Binary Step Function
• It will not be useful when there are multiple classes in the target
variable.
• the gradient of the step function is zero which causes a hindrance in

the back propagation process.
• Gradients are calculated to update the weights and biases during the
backprop process. Since the gradient of the function is zero, the
weights and biases don’t update.

Linear or Indentity activation Function
• The problem with the step function is the gradient of the

function became zero.
• This is because there is no component of x in the binary step
function.
• Instead of a binary function, use a linear function. It can
define the function as-
• f(x)= a * x
• Here the activation is proportional to the input.The variable

‘a’ in this case can be any constant value

Limitations
• The gradient here does not become zero, but it is a constant which does
not depend upon the input value x at all.
• This implies that the weights and biases will be updated during the
backpropagation process but the updating factor would be the same.
• In this scenario, the neural network will not really improve the error
since the gradient is the same for every iteration.
• The network will not be able to train well and capture the complex
patterns from the data.
• Hence, linear function might be ideal for simple tasks where
interpretability is highly desired.It’s not possible to use
backpropagation as the derivative of the function is a constant and has
no relation to the input x.
• All layers of the neural network will collapse into one if a linear
activation function is used.
• No matter the number of layers in the neural network, the last layer will
still be a linear function of the first layer.
• A linear activation function turns the neural network into just one layer.
• The Nonlinear Activation Functions are the most used activation
functions.
• It makes it easy for the model to generalize or adapt with variety of data
and to differentiate between the output.
• The Nonlinear Activation Functions are mainly divided on the basis of
their range or curves-
1. Sigmoid or Logistic Activation Function
Mathmatical Expression for sigmoid function f(z) = 1/(1+e^-z)
where, z =∑ (xi * wi ) +b

Sigmoid Function
• use of sigmoid function:It is exists between (0 to 1). Therefore, it is

especially used for models where we have to predict the probability as an
output.Since probability of anything exists only between the range of 0 and
1, sigmoid is the right choice.
• The function is differentiable.That means, we can find the slope of the

sigmoid curve at any two points.
• The logistic sigmoid function can cause a neural network to get stuck at
the training time.
• Sigmoid function used for binary classification.
• The softmax function is a more generalized logistic activation function

which is used for multiclass classification.

Pros and Cons
• It is non-linear in nature. Combinations of this function are also non-

linear, and it will give an analogue activation, unlike binary step activation
function. It has a smooth gradient too, and It’s good for a classifier type
problem.
• The output of the activation function is always going to be in the range
(0,1) compared to (-∞, ∞) of linear activation function.
• Sigmoid function gives rise to a problem of “Vanishing gradients” and
Sigmoids saturate and kill gradients.
• Its output isn’t zero centred, and it makes the gradient updates go too far
in different directions. The output value is between zero and one, so it
makes optimization harder.
• The network either refuses to learn more or is extremely slow.

Tanh or hyperbolic tangent Activation Function
• tanh is also like logistic sigmoid but better.

• The range of the tanh function is from (-1 to 1).
• The advantage is that the negative inputs will be mapped strongly
negative and the zero inputs will be mapped near zero in the tanh
graph.
• The function is differentiable.
• The function is monotonic while its derivative is not monotonic.
• The tanh function is mainly used classification between two
classes.
• Both tanh and logistic sigmoid activation functions are used in
feed-forward nets

Contd..
Mathematical function for

Tanh function
Pros and Cons
• TanH also has the vanishing gradient problem, but the gradient is
stronger for TanH than sigmoid (derivatives are steeper).
• TanH is zero-centered, and gradients do not have to move in a
specific direction.

ReLU (Rectified Linear Unit)
• ReLU stands for Rectified Linear Unit and is one of the most commonly used
activation function in the applications.
• It’s solved the problem of vanishing gradient because the maximum value of
the gradient of ReLU function is one.
• It also solved the problem of saturating neuron, since the slope is never zero
for ReLU function. The range of ReLU is between 0 and infinity.
Mathematical Equation for ReLU activation Function f(x)= max(0, x)

Pros and Cons
• Since only a certain number of neurons are activated, the ReLU function is
far more computationally efficient when compared to the sigmoid and TanH
functions.
• ReLU accelerates the convergence of gradient descent towards the global
minimum of the loss function due to its linear, non-saturating property.
• One of its limitations is that it should only be used within hidden layers of
an artificial neural network model.
• Some gradients can be fragile during training.
• For activations in the region (x<0) of ReLu, the gradient will be 0 because
of which the weights will not get adjusted during descent.
• Dying ReLu problem can occur (That means, those neurons, which go into
that state will stop responding to variations in input (simply because the
gradient is 0, nothing changes).

Leaky ReLU
• Leaky ReLU is an upgraded version of the ReLU activation function
to solve the dying ReLU problem, as it has a small positive slope in
the negative area.
• But, the consistency of the benefit across tasks is presently
ambiguous.
mathematical equation for activation function of Leaky ReLU f(x)= max(0.1+x, x)

Pros and Cons
• The advantages of Leaky ReLU are the same as that of ReLU, in

addition to the fact that it does enable back propagation, even for
negative input values.
• Making minor modification of negative input values, the gradient

of the left side of the graph comes out to be a real (non-zero)
value. As a result, there would be no more dead neurons in that
area.
• The predictions may not be steady for negative input values.

Exponential ReLU
• ELU is also one of the variations of ReLU which also solves the dead
ReLU problem.
• ELU, just like leaky ReLU also considers negative values by

introducing a new alpha parameter and multiplying it will another
equation.
• ELU is slightly more computationally expensive than leaky ReLU,

and it’s very similar to ReLU except negative inputs. They are both in
identity function shape for positive inputs.
• Mathematical equation for this activation function as follows:

Pros and Cons
• ELU is a strong alternative to ReLU. Different from the ReLU, ELU

can produce negative outputs.
• Exponential operations are there in ELU, So it increases the

computational time.
• No learning about the ‘a’ value takes place, and exploding gradient
problem.

SoftMax
• A combination of many sigmoids is referred to

as the Softmax function.
• It determines relative probability. Similar to the
sigmoid activation function, the Softmax
function returns the probability of each
class/labels.
• In multi-class classification, softmax activation
function is most commonly used for the last
layer of the neural network.
• The softmax function gives the probability of
the current class with respect to others. It also
considers the possibility of other classes too.
Mathematical Expression for

Pros and Cons
• It mimics the one encoded label better than the absolute values.
• Lose information can happen when absolute (modulus) values

are used, but the exponential takes care of this on its own.
• The softmax function should be used for multi-label

classification and regression task as well.

Important Considerations
While choosing the proper activation function, the following problems and issues
must be considered:
Vanishing gradient is a common problem encountered during neural network training.

Like a sigmoid activation function, some activation functions have a small output
range (0 to 1). So a huge change in the input of the sigmoid activation function will
create a small modification in the output. Therefore, the derivative also becomes
small. These activation functions are only used for shallow networks with only a
few layers. When these activation functions are applied to a multi-layer network, the
gradient may become too small for expected training.
Exploding gradients are situations in which massive incorrect gradients build

during training, resulting in huge updates to neural network model weights. When
there are exploding gradients, an unstable network might form, and training cannot
be completed. Due to exploding gradients, the weights’ values can potentially grow to
the point where they overflow, resulting in loss in NaN values.

Important points for Activation Functions
• All hidden layers generally use the same activation functions.

• ReLU activation function should only be used in the hidden layer for
better results.
• Sigmoid and TanH activation functions should not be utilized in
hidden layers due to the vanishing gradient, since they make the
model more susceptible to problems during training.
• Swish function is used in artificial neural networks having a depth more
than 40 layers.
• Regression problems should use linear activation functions
• Binary classification problems should use the sigmoid activation
function
• Multiclass classification problems shold use the softmax activation
function
Neural network architecture and their usable activation functions,

• Convolutional Neural Network (CNN): ReLU activation function
• Recurrent Neural Network (RNN): TanH or sigmoid activation functions
Design of perceptron

• Initialize weight values and bias
• Forward Propagate
• Check the error
• Back propagate and Adjust weights and bias
• Repeat for all training examples
• Perceptron for AND, OR, NOT gate

Feedforward Network

• A feedforward neural network is used for pattern recognition and
classification, non-linear regression, and function approximation.
• A feedforward neural network is a type of artificial neural network

in which nodes’ connections do not form a loop.
• Often referred to as a multi-layered network of neurons,

feedforward neural networks are so named because all
information flows in a forward manner only.
• The data enters the input nodes, travels through the hidden
layers, and eventually exits the output nodes.

A Feedforward Neural Network’s Layers- Components
Layer of input :
It contains the neurons that receive input. The data is subsequently passed on to the next tier. The
input layer’s total number of neurons is equal to the number of variables in the dataset.
Hidden layer
This is the intermediate layer, which is concealed between the input and output layers. This layer
has a large number of neurons that perform alterations on the inputs. They then communicate
with the output layer.
Output layer
It is the last layer and is depending on the model’s construction. Additionally, the output layer is
the expected feature, as you are aware of the desired outcome.
Neurons weights
Weights are used to describe the strength of a connection between neurons. The range of a
weight’s value is from 0 to 1.

Cost Function in Feedforward Neural Network
• The cost function is an important factor of a feedforward neural network. Generally,
minor adjustments to weights and biases have little effect on the categorized data points.
• It determines a method for improving performance by making minor adjustments to
weights and biases using a smooth cost function.
The mean square error cost function is defined as follows:

For single training example:
where n is the number of output nodes
the error of the network across all n training examples

Types of the cost function
1. Regression cost Function

2. Binary Classification cost Functions
3. Multi-class Classification cost Functions
1. Regression cost Function:
Regression models deal with predicting a continuous value for example salary of
an employee, price of a car, loan prediction, etc. A cost function used in the
regression problem is called “Regression Cost Function”. They are calculated
on the distance-based error as follows:
Error = y-y’
Where,
Y – Actual Input , Y’ – Predicted output
The most used Regression cost functions are :

• Mean Error (ME)
• Mean Squared Error (MSE)
• Mean Absolute Error (MAE)

Cost function for Classification
Binary Classification
Multi class Classification
Loss function: Used to refer the error for a single training example.
Cost function: Used to refer to an average of the loss functions over an entire training dataset.

Cost function for Classification- example
A commonly used loss function for classification is the cross-entropy loss.
Consider the classification problem of 3 classes as follows.
Class(Orange,Apple,Tomato)
The machine learning model will give a probability distribution of these 3
classes as output for a given input data. The class with the highest probability
is considered as a winner class for prediction.
Output = [P(Orange),P(Apple),P(Tomato)]
The actual probability distribution for each class is shown below.
Orange = [1,0,0]
Apple = [0,1,0]
Tomato = [0,0,1]

Let us now define the cost function using the above example
p(Tomato) = [0.1, 0.3, 0.6]
y(Tomato) = [0, 0, 1]
Cross-Entropy(y,P) = – (0*Log(0.1) + 0*Log(0.3)+1*Log(0.6)) = 0.51
The above formula just measures the cross-entropy for a single observation or input data. The
error in classification for the complete model is given by categorical cross-entropy which is nothing
but the mean of cross-entropy for all N training data.
Categorical Cross-Entropy = (Sum of Cross-Entropy for N data)/N

Multilayer
M Perceptron
• Deep Learning deals with training multi-layer

artificial neural networks, also called Deep Neural
Networks.
• After Rosenblatt perceptron was developed in the
1950s, there was a lack of interest in neural networks
until 1986, when Dr.Hinton and his colleagues
developed the backpropagation algorithm to train a
multilayer neural network.

MLP Learning Procedure
The MLP learning procedure is as follows:
• Starting with the input layer, propagate data forward to the
output layer. This step is the forward propagation.
• Based on the output, calculate the error (the difference
between the predicted and known outcome). The error needs
to be minimized.
• Backpropagate the error. Find its derivative with respect to
each weight in the network, and update the model.
• Repeat the three steps given above over multiple epochs to
learn ideal weights.
• Finally, the output is taken via a threshold function to obtain
the predicted class labels.
• Back propagation example
Multilayer Perceptron
https://www.simplilearn.com/tutorials/deep-learning-tutorial/
perceptron#:~:text=Perceptron%20is%20an%20algorithm%20for%20Supervised
%20Learning%20of%20single%20layer,neuron%20is%20fired%20or%20not.
https://machinelearningmastery.com/perceptron-algorithm-for-classification-in-
python/
Simulation Link for Multi Layer Perceptron: https://bit.ly/3zd2t1P
https://learnopencv.com/understanding-feedforward-neural-networks/

DL Unit2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DL Unit2

Uploaded by

Copyright:

Available Formats

Department of

Computer Science and

Course Name : Deep Learning

Department of Computer Science and Engineering

Department of Computer Science and Engineering

Department of Computer Science and Engineering

• Technical viewpoint: Some problems such as

• Biological viewpoint: ANNs can be used to

Department of Computer Science and Engineering

Department of Computer Science and Engineering

Department of Computer Science and Engineering

• Parallel processing capacity

Unexplained behavior of the network

Determination of appropriate network structure

Department of Computer Science and Engineering

• Minsky and Papert proposed the Perceptron model (Single-layer

Department of Computer Science and Engineering

0.7 + 0 + 0.5 + 0 + 0.4 = 1.6 (The Weighted Sum)

Department of Computer Science and Engineering

• Binary Step Functon

Department of Computer Science and Engineering

• Binary step function depends on a threshold value that decides

• The input fed to the activation function is compared to a certain

Department of Computer Science and Engineering

• the gradient of the step function is zero which causes a hindrance in

Department of Computer Science and Engineering

• The problem with the step function is the gradient of the

• Here the activation is proportional to the input.The variable

Department of Computer Science and Engineering

Mathmatical Expression for sigmoid function f(z) = 1/(1+e^-z)

Department of Computer Science and Engineering

• use of sigmoid function:It is exists between (0 to 1). Therefore, it is

• The function is differentiable.That means, we can find the slope of the

• Sigmoid function used for binary classification.

• The softmax function is a more generalized logistic activation function

Department of Computer Science and Engineering

• It is non-linear in nature. Combinations of this function are also non-

Department of Computer Science and Engineering

• tanh is also like logistic sigmoid but better.

Department of Computer Science and Engineering

Mathematical function for

Pros and Cons

Department of Computer Science and Engineering

Mathematical Equation for ReLU activation Function f(x)= max(0, x)

Department of Computer Science and Engineering

Department of Computer Science and Engineering

mathematical equation for activation function of Leaky ReLU f(x)= max(0.1+x, x)

Department of Computer Science and Engineering

• The advantages of Leaky ReLU are the same as that of ReLU, in

• Making minor modification of negative input values, the gradient

• The predictions may not be steady for negative input values.

Department of Computer Science and Engineering

• ELU, just like leaky ReLU also considers negative values by

• ELU is slightly more computationally expensive than leaky ReLU,

• Mathematical equation for this activation function as follows:

Department of Computer Science and Engineering

• ELU is a strong alternative to ReLU. Different from the ReLU, ELU

• Exponential operations are there in ELU, So it increases the

Department of Computer Science and Engineering

• A combination of many sigmoids is referred to

Mathematical Expression for

Department of Computer Science and Engineering

Cross-Entropy(y,P) = – (0Log(0.1) + 0Log(0.3)+1*Log(0.6)) = 0.51