Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 34

Supervised Learning Networks

Prof. J. Ujwala Rekha


What is Perceptron?
• Perceptron is one of the simplest ANN
architecture.
• It was introduced by Frank Rosenblatt in 1957
• It is a feed forward neural network consisting of
a single layer of input nodes that are fully
connected to a layer of output nodes.
• It can learn the linearly separable patterns.
• It uses slightly different types of artificial
neurons known as threshold logic units (TLU).
Types of Perceptrons
• Single-Layer Perceptron: This type of
perceptron is limited to learning linearly
separable patterns.
– Effective for tasks where the data can be divided
into distinct categories through a straight line.
• Multilayer Perceptron: Multilayer perceptron
possesses enhanced processing capabilities
and adept at handling more complex patterns
and relationships within the data.
Basic Components of Perceptron
Basic Components of a Perceptron
• Input Nodes or Input Layer: This is the primary
component of Perceptron which accepts the initial data
into the system for further processing.
• Weight: Weight parameter represents the strength of
connection between units. Weight is directly proportional
to the strength of the associated input neuron in deciding
the output.
• Bias: It is the constant which is added to the product of
features and weights. It is used to offset the result. It
helps the models to shift the activation function towards
the positive or negative side
Basic Components of a Perceptron
• Activation Function: It determines whether
the neuron will fire or not.
Types of Activation Functions:
How Does a Perceptron Work?
• The perceptron model begins with the multiplication of all
input values and their weights, then adds these values
together to create the weighted sum.
෍ 𝑤𝑖 ∗ 𝑥𝑖 = 𝑥1 ∗ 𝑤1 + 𝑥2 ∗ 𝑤2 + ⋯ + 𝑥𝑛 𝑤𝑛
• Add bias ‘b’ to this weighted sum to improve the model
performance
𝑧 = ෍ 𝑤𝑖 ∗ 𝑥𝑖 + 𝑏 = 𝑥1 ∗ 𝑤1 + 𝑥2 ∗ 𝑤2 + ⋯ + 𝑥𝑛 𝑤𝑛 + 𝑏

• The z output is used as input for the threshold function


𝑓(𝑧)
• The b constant added at the beginning, the bias term, is a
way to simplify learning a good threshold value for the
network.
How Does a Perceptron Work?
 Consider the original threshold function, which
compares the weighted sum of inputs to a
threshold θ:
𝑧 = ෍ 𝑤𝑖 ∗ 𝑥𝑖 ≥ 𝜃
 Now, if we subtract θ from both sides we obtain:
𝑧 = ෍ 𝑤𝑖 ∗ 𝑥𝑖 − 𝜃 ≥ 𝜃 − 𝜃

= ෍ 𝑤𝑖 ∗ 𝑥𝑖 − 𝜃 ≥ 0
 Finally, we can replace –θ with b to indicate
“bias”, move the b to the front, and we obtain:
𝑧 = 𝑏 + ෍ 𝑤𝑖 ∗ 𝑥𝑖 ≥ 0
How Does a Perceptron Work?
 Now, the weight for b can be learned along with
the weights for the input values.
 If you omit the “bias term”, the perceptron won’t
be able to learn solutions that do not pass by the
origin in R dimensional space.
 The threshold function for the perceptron is
defined as:
+1, 𝑖𝑓 𝑧 > 0
𝑦ො=𝑓 𝑧 =൜
ሺ ሻ
−1, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Perceptron Learning Procedure
• Here we describe what Rosenblatt defined as an error-
corrective reinforcement learning procedure:
– Compute the mismatch between the obtained value and the
expected value for the training example
– If the obtained and expected values match, do nothing
– If the obtained and expected values do not match, compute
the difference or delta between those values
– Use the delta value to update the weights of the network
Perceptron Learning Procedure
 Formally,
𝑤𝑘 +1 = 𝑤𝑘 + 𝛥𝑤𝑘
 The 𝛥𝑤𝑘 is computed as:
𝛥𝑤𝑘 = 𝜂(𝑦 − 𝑦ො)𝑥𝑘
where:
𝑤𝑘 is the weight for case k
𝜂 is the learning rate
𝑦 is the actual value (true class label)
𝑦
ො is the predicted value (predicted class label)
𝑥𝑘 is the vector of inputs for case k
 The learning rate has the role of facilitating the
training process by weighting the delta used to update
the weights.
 This basically means that instead of completely
replacing the previous weight with the sum of weights
+Δ, we incorporate a proportion of the error into the
updating process to make the learning process more
stable.
Schematic Representation of the Perceptron
with the Learning Procedure
Multilayer Perceptron
Limitations of Perceptron Model
• The output of a perceptron can only be a
binary number due to the hard limit transfer
function.
• Perceptron can only be used to classify the
linearly separable sets of input vectors.
ADALINE (Adaptive Linear Neuron)
• The ADALINE was introduced shortly after Rosenblatt’s
perceptron by Bernard Widrow and Ted Hoff.
• The main difference between the perceptron and the
ADALINE is that the later works by minimizing the
mean-squared error of the predictions of a linear
function.
• This means that the learning procedure is based on
the outcome of a linear function rather than on the
outcome of a threshold function as in the perceptron.
ADALINE (Adaptive Linear Neuron)
• Mathematically, learning from the output of a
linear function enables the minimization of a
continuous cost or loss function.
• The cost function is a measure of the overall
badness (or goodness) of the network
prediction.
ADALINE-Mathematical Formalization
• Mathematically, the ADALINE is described
by:
– A linear function that aggregates the input signal
– A learning procedure to adjust connection weights
• The linear aggregation function is the same as
in the perceptron:
ADALINE-Threshold Decision Function
When dealing with a binary classification
problem, we still use a threshold function, as in the
perceptron, by taking the sign of the linear
function as:
+1, 𝑖𝑓 𝑦ො>0
𝑦ො
′ = 𝑓ሺ𝑦ොሻ= ൜
−1, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
where 𝑦ො is the output of the linear function.
Perceptron Vs. ADALINE
• The perceptron updates the weights by computing
the difference between the expected and predicted
class labels.
• In other words, the perceptron always compares
+1 or -1 (predicted values) to +1 or -1 (expected
values).
• An important consequence of this is that
perceptron only learns when errors are made.
• In contrast, the ADALINE computes the
difference between the expected class value y (+1
or -1) and the continuous output value 𝑦ොfrom the
linear function, which can be any real number.
Perceptron Vs. ADALINE
• It means that the ADALINE can learn even
when no classification mistake has been made.
• Since the ADALINE learns all the time and the
perceptron only after errors, the ADALINE
will find a solution faster than the perceptron
for the same problem.
The ADALINE Error Computation
• In a single iteration, the error in the ADALINE
is calculated as (𝑦 − 𝑦ො )2 , in words, by squaring
the difference between the expected value and
the predicted value.
• This process of comparing the expected and
predicted values is repeated for all cases, j=1 to
j=n, in a given dataset.
• Once we add the squared difference for the
entire dataset and divide by the total, we obtain
the mean of squared errors (MSE)
The ADALINE Error Computation
The ADALINE Error Surface

Fig. Visual Example of Least Squares Method with One Predictor


The ADALINE Error Surface
• The horizontal axis represents the predictor 𝑥1 ,
the vertical axis represents the predicted value 𝑦ො
.
• The pinkish dots represent the expected values
(real data points).
• 𝑦ො= 𝑤1 𝑏 + 𝑤2 𝑥1 defines a line in a Cartesian
plane. The intercept and the slope of the line is
determined by 𝑤1 and 𝑤2 weights.
• The 𝑏 and 𝑥1 values are given and do not
change, therefore, they can’t influence the shape
of the line.
• The goal of the least-squares algorithm is to
generate as little cumulative error as possible.
The ADALINE Error Surface
• Since the weights are the only values we can
adjust to change the shape of the line, different
pairs of weights will generate different means of
squared errors.
• This is our gateway to the idea of finding a
minima in an error surface.
• Imagine the following: you are trying to find the
set of weights 𝑤1 and 𝑤2 that would generate the
smallest mean of squared error.
• Your weights can take values ranging from 0 to
1, and your error can go from 0 to 1.
• Now you decide to plot the mean of squared
errors against all possible combinations of
𝑤1 and 𝑤2 resulting in error surface.
The ADALINE Error Surface
The ADALINE Error Surface

• Instead, of having a unique point where the error is at


its minima, we have multiple low points or valleys at
different sections in the surface.
• Those valleys are called local minima.
• Ideally, we want to find the global minima.
ADALINE Learning Procedure
• We want to find a set of parameters that minimize the mean of squared
errors.
• The ADALINE approaches this by utilizing the gradient descent
algorithm.
• Imagine that you are hiker at the top of a mountain in the side of a
valley similar the figure in the previous slide.
• Your goal is to reach the base of the valley. Logically, you would want
to walk downhill.
• In the context of training neural networks, this is what we call
“descending a gradient”.
• We want to follow the pat that will get you faster to the base of the
valley.
• In gradient descent terms, this equals to move along the error surface in
the direction where the gradient (degree of inclination) is steepest.
• We can use the chain-rule of calculus to estimate the gradient and
adjust the weights.
ADALINE Learning Procedure

• For conciseness, let’s define the error of the


network as function E.
𝑛
1 2
𝐸 ሺ𝑦
ො ሻ = ෍ (𝑦𝑗 − 𝑦
ෝ𝑗)
𝑛
𝑗 =1
• The only values we can adjust to change 𝑦 ො
are the weights 𝑤𝑖 .
• In differential calculus taking derivatives
means calculating the rate of change of a
function with respective to an infinitely
small change in an input argument.
• In our case, it means to compute the rate of
change of the E function in response to a
very small change in w.
• That is what we call to compute a gradient,
that we will call Δ, at a point in the error
surface.
ADALINE Learning Procedure
• Widrow and Hoff have the idea that instead
of computing the gradient for the total mean
squared error E, they could approximate the
gradient’s value by computing the partial
derivative of the error with respect to the
weights on each iteration.
• Since we are dealing with a single case, let’s
drop the summation symbols and indices for
clarity. The function to derivate becomes:
2
𝑒ሺ𝑤, 𝑥 ሻ = ൫𝑦 − ሺ𝑏 + 𝑤𝑥 ሻ൯
 We can calculate the gradient of e by
applying the chain-rule of calculus
ADALINE Learning Procedure
ADALINE Learning Procedure
• Finally, the rule to update the weights says the
following: “change the weight 𝑤𝑗 by a portion, η,
of the calculated negative gradient, 𝛥𝑗 .
• We use the negative of the gradient because we
want to go “downhill”, otherwise, you will be
climbing the surface in the wrong direction.
Schematic Representation of the ADALINE with
the Learning Procedure
Thank You

You might also like