Machine Learning Unit 4

Machine Learning
Unit IV
Support Vector Machines (SVM)
Introduction:
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector
Machine. Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it can
learn about different features of cats and dogs, and then we test it with this strange creature.
So as support vector creates a decision boundary between these two data (cat and dog) and
choose extreme cases (support vectors), it will see the extreme case of cat and dog. On the
basis of the support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is
termed as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.
Linear Discriminant Functions for Binary Classification:

Linear Discriminant Analysis (LDA), also known as Normal Discriminant Analysis or
Discriminant Function Analysis, is a dimensionality reduction technique primarily utilized
in supervised classification problems. It facilitates the modeling of distinctions between
groups, effectively separating two or more classes. LDA operates by projecting features
from a higher-dimensional space into a lower-dimensional one. In machine learning, LDA
serves as a supervised learning algorithm specifically designed for classification tasks,
aiming to identify a linear combination of features that optimally segregates classes within
a dataset.
For example, we have two classes and we need to separate them efficiently. Classes can
have multiple features. Using only a single feature to classify them may result in some
overlapping as shown in the below figure. So, we will keep on increasing the number of
features for proper classification.
Assumptions of LDA
LDA assumes that the data has a Gaussian distribution and that the covariance matrices of
the different classes are equal. It also assumes that the data is linearly separable, meaning
that a linear decision boundary can accurately classify the different classes.
Suppose we have two sets of data points belonging to two different classes that we want
to classify. As shown in the given 2D graph, when the data points are plotted on the 2D
plane, there’s no straight line that can separate the two classes of data points completely.
Hence, in this case, LDA (Linear Discriminant Analysis) is used which reduces the 2D graph
into a 1D graph in order to maximize the separability between the two classes.
Linearly Separable Dataset
Here, Linear Discriminant Analysis uses both axes (X and Y) to create a new axis and
projects data onto a new axis in a way to maximize the separation of the two categories
and hence, reduces the 2D graph into a 1D graph.
Two criteria are used by LDA to create a new axis:
1. Maximize the distance between the means of the two classes.
2. Minimize the variation within each class.
The perpendicular distance between the line and points
In the above graph, it can be seen that a new axis (in red) is generated and plotted in the
2D graph such that it maximizes the distance between the means of the two classes and
minimizes the variation within each class. In simple terms, this newly generated axis
increases the separation between the data points of the two classes. After generating this
new axis using the above-mentioned criteria, all the data points of the classes are plotted
on this new axis and are shown in the figure given below.
But Linear Discriminant Analysis fails when the mean of the distributions are shared, as it
becomes impossible for LDA to find a new axis that makes both classes linearly separable.
In such cases, we use non-linear discriminant analysis.
How does LDA work?

LDA works by projecting the data onto a lower-dimensional space that maximizes the
separation between the classes. It does this by finding a set of linear discriminants that
maximize the ratio of between-class variance to within-class variance. In other words, it
finds the directions in the feature space that best separates the different classes of data.
Mathematical Intuition Behind LDA
Let’s suppose we have two classes and a d- dimensional samples such as x1, x2 … xn,
where:
 n1 samples coming from the class (c1) and n2 coming from the class (c2).
If xi is the data point, then its projection on the line represented by unit vector v can be
written as vTxi
Let’s consider u1 and u2 to be the means of samples class c1 and c2 respectively before
projection and u1hat denote the mean of the samples of class after projection and it can
be calculated by:
Similarly,
Similarly:
Now, we need to protect our data on the line having direction v which maximizes,
For maximizing the above equation we need to find a projection vector that maximizes the
difference of means of reducing the scatters of both classes. Now, scatter matrix of s1 and
s2 of classes c1 and c2 are:
and s2
After simplifying the above equation, we get scatter within the classes(sw) and scatter b/w
the classes(sb):
Now, we try to simplify the numerator part of J(v),
Now, To maximize the above equation we need to calculate differentiation with respect to
v,
Here, for the maximum value of J(v), we will use the value corresponding to the
highest eigenvalue. This will provide us with the best solution for LDA.
Extensions to LDA
1. Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of
variance (or covariance when there are multiple input variables).
2. Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs
are used such as splines.
3. Regularized Discriminant Analysis (RDA): Introduces regularization into the
estimate of the variance (actually covariance), moderating the influence of
different variables on LDA.
Perceptron Algorithm:
Perceptron is Machine Learning algorithm for supervised learning of various binary

classification tasks. Further, Perceptron is also understood as an Artificial Neuron or neural
network unit that helps to detect certain input data computations in business intelligence.
Perceptron model is also treated as one of the best and simplest types of Artificial Neural
networks. However, it is a supervised learning algorithm of binary classifiers. Hence, we can
consider it as a single-layer neural network with four main parameters, i.e., input values,
weights and Bias, net sum, and an activation function.
Basic Components of Perceptron
Mr. Frank Rosenblatt invented the perceptron model as a binary classifier which contains
three main components. These are as follows:
ADVERTISEMENT
o Input Nodes or Input Layer:
This is the primary component of Perceptron which accepts the initial data into the system
for further processing. Each input node contains a real numerical value.
o Wight and Bias:
Weight parameter represents the strength of the connection between units. This is another
most important parameter of Perceptron components. Weight is directly proportional to the
strength of the associated input neuron in deciding the output. Further, Bias can be
considered as the line of intercept in a linear equation.
o Activation Function:
These are the final and important components that help to determine whether the neuron
will fire or not. Activation Function can be considered primarily as a step function.
Types of Activation functions:
o Sign function
o Step function, and
o Sigmoid function
The data scientist uses the activation function to take a subjective decision based on various
problem statements and forms the desired outputs. Activation function may differ (e.g., Sign,
Step, and Sigmoid) in perceptron models by checking whether the learning process is slow or
has vanishing or exploding gradients.
How does Perceptron work?
In Machine Learning, Perceptron is considered as a single-layer neural network that consists

of four main parameters named input values (Input nodes), weights and Bias, net sum, and
an activation function. The perceptron model begins with the multiplication of all input values
and their weights, then adds these values together to create the weighted sum. Then this
weighted sum is applied to the activation function 'f' to obtain the desired output. This
activation function is also known as the step function and is represented by 'f'.
This step function or Activation function plays a vital role in ensuring that output is mapped
between required values (0,1) or (-1,1). It is important to note that the weight of input is
indicative of the strength of a node. Similarly, an input's bias value gives the ability to shift
the activation function curve up or down.
Perceptron model works in two important steps as follows:

Step-1
In the first step first, multiply all input values with corresponding weight values and then add
them to determine the weighted sum. Mathematically, we can calculate the weighted sum as
follows:
∑wi*xi = x1*w1 + x2*w2 +…wn*xn
Add a special term called bias 'b' to this weighted sum to improve the model's performance.
∑wi*xi + b
Step-2
In the second step, an activation function is applied with the above-mentioned weighted sum,
which gives us output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)
Types of Perceptron Models
Based on the layers, Perceptron models are divided into two types. These are as follows:
1. Single-layer Perceptron Model

2. Multi-layer Perceptron model
Single Layer Perceptron Model:
This is one of the easiest Artificial neural networks (ANN) types. A single-layered perceptron
model consists feed-forward network and also includes a threshold transfer function inside
the model. The main objective of the single-layer perceptron model is to analyze the linearly
separable objects with binary outcomes.
In a single layer perceptron model, its algorithms do not contain recorded data, so it begins
with inconstantly allocated input for weight parameters. Further, it sums up all inputs
(weight). After adding all inputs, if the total sum of all inputs is more than a pre-determined
value, the model gets activated and shows the output value as +1.
If the outcome is same as pre-determined or threshold value, then the performance of this
model is stated as satisfied, and weight demand does not change. However, this model
consists of a few discrepancies triggered when multiple weight inputs values are fed into the
model. Hence, to find desired output and minimize errors, some changes should be necessary
for the weights input.
"Single-layer perceptron can learn only linearly separable patterns."

Multi-Layered Perceptron Model:
Like a single-layer perceptron model, a multi-layer perceptron model also has the same model
structure but has a greater number of hidden layers.
The multi-layer perceptron model is also known as the Backpropagation algorithm, which
executes in two stages as follows:
o Forward Stage: Activation functions start from the input layer in the forward stage
and terminate on the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as per
the model's requirement. In this stage, the error between actual output and
demanded originated backward on the output layer and ended on the input layer.
Hence, a multi-layered perceptron model has considered as multiple artificial neural networks
having various layers in which activation function does not remain linear, similar to a single
layer perceptron model. Instead of linear, activation function can be executed as sigmoid,
TanH, ReLU, etc., for deployment.
A multi-layer perceptron model has greater processing power and can process linear and non-
linear patterns. Further, it can also implement logic gates such as AND, OR, XOR, NAND, NOT,
XNOR, NOR.
Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with the
learned weight coefficient 'w'.
Mathematically, we can express it as follows:
ADVERTISEMENT
f(x)=1; if w.x+b>0
otherwise, f(x)=0
o 'w' represents real-valued weights vector

o 'b' represents the bias
o 'x' represents a vector of input x values.
Characteristics of Perceptron
The perceptron model has the following characteristics.
1. Perceptron is a machine learning algorithm for supervised learning of binary

classifiers.
2. In Perceptron, the weight coefficient is automatically learned.
3. Initially, weights are multiplied with input features, and the decision is made whether
the neuron is fired or not.
4. The activation function applies a step rule to check whether the weight function is
greater than zero.
5. The linear decision boundary is drawn, enabling the distinction between the two
linearly separable classes +1 and -1.
6. If the added sum of all input values is more than the threshold value, it must have an
output signal; otherwise, no output will be shown.
Limitations of Perceptron Model
A perceptron model has limitations as follows:
o The output of a perceptron can only be a binary number (0 or 1) due to the hard limit
transfer function.
o Perceptron can only be used to classify the linearly separable sets of input vectors. If
input vectors are non-linear, it is not easy to classify them properly.
Future of Perceptron
The future of the Perceptron model is much bright and significant as it helps to interpret data
by building intuitive patterns and applying them in the future. Machine learning is a rapidly
growing technology of Artificial Intelligence that is continuously evolving and in the
developing phase; hence the future of perceptron technology will continue to support and
facilitate analytical behavior in machines that will, in turn, add to the efficiency of computers.
The perceptron model is continuously becoming more advanced and working efficiently on
complex problems with the help of artificial neurons.
Large margin classifier for linearly separable data using SVM:

A key concept in Support Vector Machines (SVM) is the idea of a large margin classifier for
linearly separable data. The goal is to find a hyperplane that maximizes the margin between
two classes of data points. Here's an explanation with simple equations and diagrams:
 Consider a binary classification problem where you have two classes: Class A and Class
B. The objective of SVM is to find a hyperplane that best separates these two classes
while maximizing the margin. The equation for this hyperplane is
w x+b = 0
Where
W: The weight vector that defines the orientation of the hyperplane.
x: The feature vector representing an input data
point. b: The bias or threshold value that shifts the
hyperplane.
Maximizing the Margin:
 The margin is the distance between the hyperplane and the nearest data points from
each class. To maximize this margin, we want to find w and b such that the distance
from the hyperplane to the closest point in Class A and the closest point in Class B is
maximized.
Mathematically, this can be represented as:
Where ||w|| is the Euclidean norm (magnitude) of the weight vector w. The objective of
SVM is to maximize this margin.
Here’s diagrams illustrating this concept
In the diagram, the decision hyperplane (the straight line) separates Class A from Class B.
The margin is the distance from the hyperplane to the closest data points from each class.
• SVM's objective is to find the optimal w and b that maximize this margin while
ensuring that data points are correctly classified.
• In this ideal scenario of linearly separable data, the support vectors are the data points
closest to the hyperplane, and they are used to define the margin.
• SVM finds these support vectors and optimizes the margin by solving a constrained
optimization problem.
The large margin classifier provides a robust solution for linearly separable data, ensuring a
wider separation between classes and making it less sensitive to noise in the data.
Fig: Representing small and large margin at SVM classifier model
Fig: Representing good and bad SVM classifier models in small and large margin cases
A large margin classifier in SVM for linearly separable data aims to find an optimal
hyperplane that maximizes the margin between two classes, ensuring a robust separation.
Support vectors define this margin, and SVM finds the best hyperplane by minimizing
classification errors while maximizing the margin, enhancing classification accuracy and
robustness.
Linear soft margin classifier for over lapping classes

A linear soft margin classifier in SVM addresses overlapping classes by allowing for some
classification errors. It introduces a penalty for misclassifications, striking a balance between
maximizing the margin and tolerating some errors.
Linear Soft Margin Classifier:
 The linear soft margin classifier in SVM aims to find a hyperplane that best separates
overlapping classes, even when perfect separation isn't possible. It introduces a
"slack variable" (ξ) to account for classification errors. The objective function is
modified as follows
Where
w: The weight vector.
b: The bias or threshold.
C: A hyper parameter that controls the trade-off between maximizing the margin and
minimizing the misclassification error.
ξi: Slack variable for the ith data
point. n: The number of data
points. yi: The class label of the ith
data point.
A simple diagram illustrating the concept of a linear soft margin classifier:

Class 1 is represented by points labeled as -1.
Class 2 is represented by points labeled as +1.
• The decision hyperplane (a straight line) attempts to separate the classes, but due to
overlapping, some data points may lie on the wrong side.
• The slack variables (ξ) allow for some misclassifications while trying maximizing the
margin. The parameter C controls the balance between minimizing errors (small C)
and maximizing the margin (large C).
• It helps SVM adapt to overlapping classes and create a margin that balances the trade-
off between classification accuracy and margin size.
• between classification accuracy and margin size.

Kernel-induced feature spaces (Nonlinear classifier)
Kernel-induced feature spaces are a critical concept in Support Vector Machines (SVM) that
allows SVM to handle non-linearly separable data by implicitly transforming it into a
higherdimensional space where it might become linearly separable.
• In SVM, the kernel function plays a central role. The kernel function, denoted as K(x,
y), takes two input data points x and y and returns a measure of similarity between
them.
• It implicitly maps the data into a higher-dimensional feature space where linear
separation might be possible.
The equation for SVM's decision boundary in the feature space is:
Where
f(x): The decision function. αi: Lagrange multipliers
determined during the SVM optimization. yi: Class labels of
the data points.
K (xi, x): The kernel function that maps xi and x into the feature
space b: The bias or threshold.
• Consider a simple 2D dataset where Class A (Green points) and Class B (blue points)
are not linearly separable in the original feature space:
Fig: Non linear separable data using 2D Space
• In this diagram, it's evident that a straight-line decision boundary cannot separate the
classes effectively in the original 2D space.
• Now, by using a kernel function, we implicitly map this data to a higher-dimensional
feature space, often referred to as a "kernel-induced feature space." Let's say we use
a radial basis function (RBF) kernel
• This RBF kernel implicitly maps the data to a higher-dimensional space where the
classes might become linearly separable
Fig: Non linear separable data using 3D Kernel Space and 2D Space
• In this new feature space, the data points might be linearly separable with the right
choice of kernel and kernel parameters, enabling SVM to find an optimal decision
boundary that maximizes the margin between classes.
The transformation into the kernel-induced feature space is implicit and doesn't require
explicit calculation of the transformed feature vectors. It allows SVM to handle non-linearly
separable data effectively.
Regression by SVM (Super Vector Machine):
Support Vector Machines (SVM) are not just limited to classification tasks; they can also be
used for regression. In regression tasks, the goal is to predict a continuous target variable
rather than class labels. SVM for regression is known as Support Vector Regression (SVR).it is
classified into two models.
1. Linear regression by SVM

2. Non-Linear Regression by SVM
Linear regression by SVM:
Linear regression using Support Vector Machines (SVM) is a variation of SVM designed for
regression tasks. It aims to find a linear relationship between input features and a
continuous target variable.
• In linear regression using SVM, the goal is to find a linear function that best
approximates the relationship between input features and the target variable. This
linear function is represented as:
f(x) = w⋅x+b
f(x): The predicted target variable. w: The weight
vector. x: The feature vector representing the
input data point.
b : The bias or intercept term.
• The linear regression objective is to minimize the mean squared error (MSE) between
the predictions and the true target values
yi: The true target variable of the ith data point.

C: A regularization parameter controlling the trade-off between fitting the data and
keeping the model simple.
• The target variable (y) is represented on the vertical axis, and the input features (x)
are on the horizontal axis.
• The linear function f(x) = w.x + b is the best-fitting line that minimizes the mean
squared error by adjusting the weight vector (w) and the bias term (b).
• This linear model can be used for regression tasks to predict continuous target
variables based on input features.
Non-Linear Regression by SVM:

Non-linear regression by Support Vector Machines (SVM) uses the principles of SVM to
model non-linear relationships between input features and a continuous target variable. The
key idea is to use kernel functions to implicitly map the data into a higher-dimensional
space, where a linear regression model can be applied effectively
• In non-linear regression using SVM, the goal is to find a non-linear function that best
fits the relationship between input features and the target variable.
• Unlike linear regression, which assumes a linear relationship, non-linear regression
allows for more complex, non-linear patterns.
• The non-linear regression objective is to minimize the mean squared error (MSE)
between the predictions and the true target values:
yi: The true target variable of the ith data point.

C: A regularization parameter controlling the trade-off between fitting the
data and keeping the model simple.
To account for non-linearity, the non-linear function is represented as
f(x)= ∑𝑛𝑖=1 αi K(xi, x)+b
f(x): The predicted target variable. αi: Lagrange multipliers
determined during the SVM optimization.
K(xi, x): The kernel function that implicitly maps xi and x into a higher-dimensional
feature space.
b : The bias or intercept term.
• The target variable (y) is represented on the vertical axis, and the input features (x)
are on the horizontal axis.
• The non-linear function f(x)= ∑𝑛𝑖=1 αi K(xi, x)+b captures non-linear relationships
between input features and the target variable by implicitly mapping the data into a
higher-dimensional feature space using the kernel function.
• The model can then make non-linear predictions based on the input features.
Learning with neural networks toward cognitive machines:
Learning with neural networks toward cognitive machines represents an approach to

developing intelligent systems that can reason, learn, and adapt like human cognition.
• Cognitive machines are designed to emulate human-like thinking and problem-solving

processes. Here's an overview of how learning with neural networks contributes to
the development of cognitive machines
1. Neural Networks as a Foundation
Neural networks, particularly deep learning models like convolutional neural networks
(CNNs) and recurrent neural networks (RNNs), serve as the foundation for cognitive
machine learning. These models are capable of handling complex data, learning patterns,
and making predictions.
2. Supervised and Unsupervised Learning:
Cognitive machines employ both supervised and unsupervised learning techniques. In

supervised learning, neural networks are trained on labeled data, while unsupervised
learning allows them to discover hidden patterns and structures in data.
3. Reinforcement Learning:
Cognitive machines also incorporate reinforcement learning, enabling them to learn
through interactions with their environment. Agents learn by receiving rewards and
penalties based on their actions, enabling them to make decisions and adapt over time.
4. Transfer Learning:
To mimic cognitive abilities, neural networks use transfer learning. Pre-trained models are
fine-tuned for specific tasks, which is akin to humans applying knowledge learned in one
context to solve related problems.
5. Multimodal Data Processing:
Cognitive machines process data from various sources (text, images, audio)
simultaneously, fostering a more comprehensive understanding of the environment. They
can analyze multiple data modalities to make informed decisions.
6. Memory and Reasoning:
Cognitive machines integrate memory networks and reasoning modules, enabling them to
store and retrieve information and perform logical reasoning. This allows them to solve
problems by considering context and past experiences.
7. Natural Language Understanding and Generation:
Cognitive machines excel in natural language processing tasks. They can understand and
generate human-like text and engage in meaningful conversations, making them highly
interactive and adaptive.
8. Contextual Awareness:
These machines have contextual awareness, recognizing the importance of the context in
which they operate. They can adapt their behavior, decisions, and responses based on the
current situation.
9. Continuous Learning:
Cognitive machines don't stop learning after initial training. They engage in continuous
learning and self-improvement, allowing them to adapt to changing conditions and acquire
new knowledge over time.
10. Emulating Human Cognition:
The ultimate goal of learning with neural networks toward cognitive machines is to create
systems that replicate and augment human-like cognition. They mimic human problem-
solving, decision-making, creativity, and adaptability.
In summary, learning with neural networks toward cognitive machines involves a holistic
approach to developing intelligent systems. By combining various learning techniques, these
machines can process complex data, reason, understand language, adapt to changing
situations, and replicate cognitive functions, bringing us closer to creating intelligent
systems that emulate human cognition and understanding.
Neuron Models:
Let us discuss two neuron models
1.Biological neuron
2.Artificial neuron 1. Biological neuron:
Fig: Biological Neuron Structure
• Neuron Structure:
A typical human neuron consists of three main parts:

Cell Body (Soma): The cell body contains the nucleus and other organelles.
Dendrites: These are the branched extensions that receive signals from other neurons.
Axon: The axon is a long, slender extension that transmits signals to other neurons
or cells.
• Synapses:
Neurons communicate with each other through synapses, which are small gaps
between the axon of one neuron and the dendrites of another. Neurotransmitters
are released at the synapse to transmit signals.
• Action Potential:
Neurons transmit electrical signals in the form of action potentials. An action
potential is a brief change in the neuron's electrical charge, leading to the
propagation of a signal along the axon.
• Resting Potential:
Neurons maintain a resting potential, which is a difference in electrical charge

across the cell membrane. It is around -70 mill volts and is essential for neural
signaling.
• Threshold and Firing:
When the electrical charge inside the neuron reaches a certain threshold, an action
potential is initiated. This action potential travels down the axon and signals the
release of neurotransmitters at the synapse.
• Excitatory and Inhibitory Neurons:
Neurons can be classified as either excitatory or inhibitory. Excitatory neurons

promote action potentials, while inhibitory neurons reduce the likelihood of an
action potential.
• Neural Networks:
Neurons are interconnected in complex networks. These networks allow for

information processing, learning, and memory formation.
2. Artificial Neuron:
Fig: Artificial Neuron
• Inputs (x1, x2... xn):

Artificial neurons receive multiple input signals, each associated with a weight (w1,
w2...
wn). These inputs represent the features of the data being processed.
• Weights (w1, w2... wn):

Each input signal is multiplied by a weight. The weights determine the importance of
each input in the neuron's computation.
• Summation (Σ):
The weighted inputs are summed together, typically with a bias term (b), to compute
the net input:
Net Input= (w1∗x1) + (w2∗x2) +...+ (wn∗xn) +b
 Activation Function (f):

The net input is passed through an activation function, such as the sigmoid, ReLU, or
tanh function. The activation function introduces non-linearity to the model.
Common choices include the sigmoid function
1
Output = f (Net Input) = 1+𝑒−𝑁𝑒𝑡 𝐼𝑛𝑝𝑢𝑡
 Output (y):
The result of the activation function is the output of the artificial neuron. It
represents the neuron's response to the input signals.
Neural Network Architectures:

In the biological brain, a huge number of neurons are interconnected to form the network
and perform advanced intelligent activities. The artificial neural network is built by neuron
models. Many different types of artificial neural networks have been proposed, just as there
are many theories on how biological neural processing works. We may classify organization
of the neural networks into two types. They are
1. Single layer neural networks

2. Multilayer neural networks
Single layer neural networks:
A single-layer neural network, also known as a single-layer Perceptron, is the simplest neural
network architecture. It consists of an input layer, which directly connects to an output
layer, without any hidden layers. Single-layer networks are mainly used for binary
classification problems or linearly separable tasks.
Fig: Single layer Artificial Neural Network
• Weighted Sum (z):
The weighted sum of input features is computed as follows
z = w1. x1 + w2 . x2 + ……+ wn. xn + b
Where:
z is the weighted sum.
x1, x2, …., xn are the input features.
w1, w2… wn are the corresponding weights.
b is the bias.
• Activation Function (f(z)):
A step function, also known as the Heaviside step function, is often used as the
activation function. It outputs 1 if the weighted sum z is greater than or equal to 0,
and 0 otherwise
• In the diagram, input features (x1, x2... xn) are connected to the weighted sum
calculation, followed by the activation function (step function), which produces a
binary output (0 or
1).
• This single-layer neural network can make binary decisions based on the weighted
sum of its input features, which is often used for linearly separable classification
problems.
• Single-layer networks are limited in their capability compared to more complex neural
architectures like multi-layer perceptrons (MLPs) or deep neural networks.
• They can only solve problems that are linearly separable and cannot capture complex
non-linear relationships in data. While simple, they are foundational in understanding
neural networks and are a starting point for more sophisticated architectures. To
handle more complex tasks, deeper neural networks with hidden layers are employed.
Multilayer neural networks:

Multi-layer neural networks, often referred to as multi-layer perceptrons (MLPs), are a type
of artificial neural network with multiple layers of interconnected neurons. These networks
are designed to handle more complex tasks by introducing hidden layers between the input
and output layers.
Fig: Multilayer artificial neural network
• Weighted Sum (z) in a Hidden Layer:
The weighted sum for each neuron in a hidden layer is calculated as follows:
Where
zj is the weighted sum for neuron j in the hidden
layer.
wij is the weight connecting input I to neuron j.

xi is the input from the previous layer.
bj is the bias for neuron j.
• Activation Function (f(z)) for Hidden Layers:

Common activation functions for hidden layers include the sigmoid, ReLU, or tanh
functions
f(z) = Activation function (z)
• Weighted Sum (z) in the Output Layer:
The weighted sum for each neuron in the output layer is calculated similarly to the
hidden layer zk= ∑𝑚𝑖=1 wkj′⋅f(zj)+bk′
Where
zk is the weighted sum for neuron k in the output layer.
w'kj is the weight connecting neuron j in the hidden layer to neuron k in the output
layer. f(zj) is the output of neuron j in the hidden layer.
b'k is the bias for neuron k in the output layer.
• Activation Function (f(z)) for Output Layer:
The activation function in the output layer depends on the type of problem. For
binary classification, you might use a sigmoid function. For multiclass classification, a
softmax function is common.
In this diagram, input features (x1, x2... xn) are connected to the weighted sum
calculations in the hidden layer, followed by the activation function for the hidden layer. The
output of the hidden layer is then connected to the weighted sum calculations in the output
layer, followed by the activation function for the output layer. This network structure allows
multilayer neural networks to capture complex relationships and solve a wide range of tasks,
including classification, regression, and more.
Linear neuron and the widrow-Hoff Learning Rule:

A linear neuron, also known as a McCulloch-Pitts neuron or a threshold neuron, is a
simplified model of a biological neuron that computes a weighted sum of its inputs and
compares it to a threshold to produce an output.
Linear Neuron:
• Inputs (x1, x2... xn): A linear neuron takes multiple input values (x1, x2... xn). Each input
is associated with a weight (w1, w2... wn), which represents the importance of that
input.
• Weighted Sum (z): The weighted sum of inputs is computed as
Z = w1 x1+w2 x2+...+wn xn
• Threshold (θ): The weighted sum is compared to a threshold (θ) to produce the
output.
• Output (y): If the weighted sum z is greater than or equal to the threshold θ, the
neuron's output is 1. Otherwise, the output is 0.
A linear neuron can be used for binary classification, where it acts as a simple decision-
maker, and the weights and threshold are adjusted to make correct classifications.
Widrow-Hoff Learning Rule:

The Widrow-Hoff learning rule, also known as the delta rule or the LMS (Least Mean
Squares) algorithm, is a supervised learning algorithm used to adjust the weights of a linear
neuron to minimize the error in classification or regression tasks. It updates the weights
based on the prediction error and the input values. The update rule for the ith weight is as
follows
winew = wiold + α (target−output) xi
Where
Winew is the new weight.
wiold is the old weight.
α :is the learning rate, controlling the step size for weight updates.
target : is the desired output or target value.
Output :is the actual output of the
neuron. xi is the input associated with
weight wi.
• The learning rule adjusts the weights in the direction that reduces the error. It
continues to update the weights in an iterative process until the error is minimized or
converges to a satisfactory level.
The Widrow-Hoff learning rule is a foundational concept in machine learning and neural
networks, providing a mechanism for training linear neurons to make accurate binary
classifications or predictions in a supervised learning context.
Error Correction Delta Rule:
The Error Correction Delta Rule, often referred to simply as the Delta Rule or the Delta
Learning Rule, is a supervised learning algorithm used to adjust the weights of artificial
neurons in a neural network, specifically in the context of supervised learning tasks. The
primary goal of this rule is to minimize the error between the actual output of the neuron
and the desired target output.
Components of the Error Correction Delta Rule:
• Actual Output (Y): This is the output produced by the artificial neuron or network
based on the current set of weights and inputs.
• Desired Target Output (D): This is the expected or correct output for the given input.
It's provided during the training phase.
• Error (E):The error is the difference between the actual output and the desired target
output:
E=D-Y
The Weight Update Rule:
The goal of the Error Correction Delta Rule is to adjust the weights to minimize the error (E).
The update for the ith weight wi is given by
Where
Winew is the new weight.
wiold is the old weight.
α :is the learning rate, controlling the step size for weight
updates. E is the error as calculated above xi is the input
associated with weight wi.
Weight Adjustment Process:
• Calculate the error (E) by taking the difference between the desired target output (D)
and the actual output (Y).
• Adjust each weight (wi) based on the weight update rule, considering the learning rate
(α).
This weight adjustment process is repeated iteratively for multiple data points during the
training process until the error converges to a satisfactory level, meaning that the difference
between the desired and actual outputs is minimized.
The Error Correction Delta Rule is a foundational concept in supervised learning for
neural networks. It's used to train the network by iteratively adjusting the weights to make
the network's predictions more accurate and aligned with the desired target outputs. The
choice of the learning rate is crucial, as it affects the speed and stability of the learning
process.

Machine Learning Unit 4

Uploaded by

Copyright:

Available Formats

You might also like

Machine Learning Unit 4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Unit 4

Uploaded by

Copyright:

Available Formats

Machine Learning

SVM can be of two types:

Linear Discriminant Functions for Binary Classification:

Linearly Separable Dataset

How does LDA work?

Now, we try to simplify the numerator part of J(v),

Perceptron is Machine Learning algorithm for supervised learning of various binary

Basic Components of Perceptron

o Input Nodes or Input Layer:

o Wight and Bias:

Types of Activation functions:

How does Perceptron work?

In Machine Learning, Perceptron is considered as a single-layer neural network that consists

Perceptron model works in two important steps as follows:

∑wi*xi = x1*w1 + x2*w2 +…wn*xn

Types of Perceptron Models

1. Single-layer Perceptron Model

Single Layer Perceptron Model:

"Single-layer perceptron can learn only linearly separable patterns."

Mathematically, we can express it as follows:

o 'w' represents real-valued weights vector

The perceptron model has the following characteristics.

1. Perceptron is a machine learning algorithm for supervised learning of binary

Limitations of Perceptron Model

A perceptron model has limitations as follows:

Large margin classifier for linearly separable data using SVM:

x: The feature vector representing an input data

point. b: The bias or threshold value that shifts the

Maximizing the Margin:

Fig: Representing small and large margin at SVM classifier model

Linear soft margin classifier for over lapping classes

point. n: The number of data

points. yi: The class label of the ith

A simple diagram illustrating the concept of a linear soft margin classifier:

• between classification accuracy and margin size.

f(x): The decision function. αi: Lagrange multipliers

determined during the SVM optimization. yi: Class labels of

the data points.

space b: The bias or threshold.

Fig: Non linear separable data using 2D Space

Regression by SVM (Super Vector Machine):

1. Linear regression by SVM

Linear regression by SVM:

vector. x: The feature vector representing the

input data point.

b : The bias or intercept term.

yi: The true target variable of the ith data point.

Non-Linear Regression by SVM:

yi: The true target variable of the ith data point.

f(x)= ∑𝑛𝑖=1 αi K(xi, x)+b

f(x): The predicted target variable. αi: Lagrange multipliers

determined during the SVM optimization.

Learning with neural networks toward cognitive machines:

Learning with neural networks toward cognitive machines represents an approach to

• Cognitive machines are designed to emulate human-like thinking and problem-solving

1. Neural Networks as a Foundation

Cognitive machines employ both supervised and unsupervised learning techniques. In

7. Natural Language Understanding and Generation:

Fig: Biological Neuron Structure

∑wixi = x1w1 + x2w2 +…wnxn