Professional Documents
Culture Documents
Unit IV Artificial Neural Networks
Unit IV Artificial Neural Networks
Unit IV Artificial Neural Networks
Perceptron model is also treated as one of the best and simplest types of Artificial
Neural networks. However, it is a supervised learning algorithm of binary classifiers.
Hence, we can consider it as a single-layer neural network with four main parameters,
i.e., input values, weights and Bias, net sum, and an activation function.
ADVERTISEMENT
This is the primary component of Perceptron which accepts the initial data into the
system for further processing. Each input node contains a real numerical value.
Weight parameter represents the strength of the connection between units. This is
another most important parameter of Perceptron components. Weight is directly
proportional to the strength of the associated input neuron in deciding the output.
Further, Bias can be considered as the line of intercept in a linear equation.
o Activation Function:
These are the final and important components that help to determine whether the
neuron will fire or not. Activation Function can be considered primarily as a step
function.
Types of Activation functions:
o Sign function
o Step function, and
o Sigmoid function
The data scientist uses the activation function to take a subjective decision based on
various problem statements and forms the desired outputs. Activation function may
differ (e.g., Sign, Step, and Sigmoid) in perceptron models by checking whether the
learning process is slow or has vanishing or exploding gradients.
Step-1
In the first step first, multiply all input values with corresponding weight values and
then add them to determine the weighted sum. Mathematically, we can calculate the
weighted sum as follows:
Add a special term called bias 'b' to this weighted sum to improve the model's
performance.
∑wi*xi + b
Step-2
Y = f(∑wi*xi + b)
In a single layer perceptron model, its algorithms do not contain recorded data, so it
begins with inconstantly allocated input for weight parameters. Further, it sums up all
inputs (weight). After adding all inputs, if the total sum of all inputs is more than a
pre-determined value, the model gets activated and shows the output value as +1.
ADVERTISEMENT
o Forward Stage: Activation functions start from the input layer in the forward
stage and terminate on the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified
as per the model's requirement. In this stage, the error between actual output
and demanded originated backward on the output layer and ended on the input
layer.
Perceptron Function
Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with
the learned weight coefficient 'w'.
f(x)=1; if w.x+b>0
otherwise, f(x)=0
Characteristics of Perceptron
The perceptron model has the following characteristics.
Future of Perceptron
The future of the Perceptron model is much bright and significant as it helps to
interpret data by building intuitive patterns and applying them in the future. Machine
learning is a rapidly growing technology of Artificial Intelligence that is continuously
evolving and in the developing phase; hence the future of perceptron technology will
continue to support and facilitate analytical behavior in machines that will, in turn,
add to the efficiency of computers.
The best way to define the local minimum or local maximum of a function using
gradient descent is as follows:
Suppose we have xn inputs(x1, x2….xn) and a bias unit. Let the weight
applied to be w1, w2…..wn. Then find the summation and bias unit on
performing dot product among inputs and weights as:
r = Σmi=1 wixi + bias
On feeding the r into activation function F(r) we find the output for the
hidden layers. For the first hidden layer h 1, the neuron can be calculated
as:
h11 = F(r)
For all the other hidden layers repeat the same procedure. Keep repeating
the process until reach the last weight set.
Generalization
Generalization is a term usually refers to a Machine Learning models
ability to perform well on the new unseen data. After being trained
on a training set, a model can digest new data and can able to make
accurate predictions. The main success of the model is the ability of
the model to generalize well. If the model has been trained too well
on the training data, it will be difficult for the model to generalize.
When the model has not been trained enough on the data leads to
underfitting problem. In the case of underfitting, it makes the model
useless and incapable of making accurate predictions even with
training data.
If the model is over trained on the data, then it will be able to
discover all the relevant information in the training data, but will fail
miserably when the new data is introduced. By this we can say that
the model is not capable of generalizing which also means that the
training data is over trained.
Algorithm
Training:
Step 1: Initialize the weights wij random value may be assumed. Initialize
the learning rate α.
Step 2: Calculate squared Euclidean distance.
D(j) = Σ (wij – xi)^2 where i=1 to n and j=1 to m
Step 3: Find index J, when D(j) is minimum that will be considered as
winning index.
Step 4: For each j within a specific neighborhood of j and for all i, calculate
the new weight.
wij(new)=wij(old) + α[xi – wij(old)]
Step 5: Update the learning rule by using :
α(t+1) = 0.5 * t
Step 6: Test the Stopping Condition.
A set of data points are said to be linearly separable if the data can be divided into two
classes using a straight line. If the data is not divided into two classes using a straight
line, such data points are said to be called non-linearly separable data.
Although the perceptron rule finds a successful weight vector when the training
examples are linearly separable, it can fail to converge if the examples are not linearly
separable.
A second training rule, called the delta rule, is designed to overcome this difficulty.
If the training examples are not linearly separable, the delta rule converges toward a
best-fit approximation to the target concept.
The key idea behind the delta rule is to use gradient descent to search the hypothesis
space of possible weight vectors to find the weights that best fit the training examples.
This rule is important because gradient descent provides the basis for the
BACKPROPAGATON algorithm, which can learn networks with many
interconnected units.
The delta training rule is best understood by considering the task of training an
unthresholded perceptron; that is, a linear unit for which the output o is given by
Thus, a linear unit corresponds to the first stage of a perceptron, without the threshold.
See also Implementation of Naive Bayes in Python
In order to derive a weight learning rule for linear units, let us begin by specifying a
measure for the training error of a hypothesis (weight vector), relative to the training
examples.
Although there are many ways to define this error, one common measure is
where D is the set of training examples, ‘td’ is the target output for training example
‘d’, and od is the output of the linear unit for training example ‘d’.
How to calculate the direction of steepest descent along the error surface?
The direction of steepest can be found by computing the derivative of E with respect
to each component of the vector w. This vector derivative is called the gradient of E
with respect to w, written as,
The gradient specifies the direction of steepest increase of E, the training rule for
gradient descent is
Here η is a positive constant called the learning rate, which determines the step size in
the gradient descent search.
The negative sign is present because we want to move the weight vector in the
direction that decreases E.
Finally,
CNN architecture
Convolutional Neural Network consists of multiple layers like the input
layer, Convolutional layer, Pooling layer, and fully connected layers.
Now imagine taking a small patch of this image and running a small neural
network, called a filter or kernel on it, with say, K outputs and representing
them vertically. Now slide that neural network across the whole image, as
a result, we will get another image with different widths, heights, and
depths. Instead of just R, G, and B channels now we have more channels
but lesser width and height. This operation is called Convolution. If the
patch size is the same as that of the image it will be a regular neural
network. Because of this small patch, we have fewer weights.
Now let’s talk about a bit of mathematics that is involved in the whole
convolution process.
Convolution layers consist of a set of learnable filters (or kernels)
having small widths and heights and the same depth as that of input
volume (3 if the input layer is image input).
For example, if we have to run convolution on an image with
dimensions 34x34x3. The possible size of filters can be axax3, where
‘a’ can be anything like 3, 5, or 7 but smaller as compared to the image
dimension.
During the forward pass, we slide each filter across the whole input
volume step by step where each step is called stride (which can have a
value of 2, 3, or even 4 for high-dimensional images) and compute the
dot product between the kernel weights and patch from input volume.
As we slide our filters we’ll get a 2-D output for each filter and we’ll
stack them together as a result, we’ll get output volume having a depth
equal to the number of filters. The network will learn all the filters.
Output Layer: The output from the fully connected layers is then fed
into a logistic function for classification tasks like sigmoid or softmax
which converts the output of each class into the probability score of
each class.