Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 76

UNIT-II

THE PERCEPTRON

Bio-inspired Learning

The term "Artificial Neural Network" is derived from Biological neural


networks that develop the structure of a human brain, Similar to the human brain
that has neurons interconnected to one another, artificial neural networks also have
neurons that are interconnected to one another in various layers of the networks.
These neurons are known as nodes

The given figure illustrates the typical diagram of Biological Neural Network. The
typical Artificial Neural Network looks something like the given figure

Through the dendrites the i/p is accepted , through axon the o/p is transmitted
through electric pulses

Through synapse it is transmitted to other neuron .

Synapse is connected with the dendrites of other neuron

Dendrites from Biological Neural Network represent inputs in Artificial Neural


Networks, cell nucleus represents Nodes, synapse represents Weights, and Axon
represents Output
biology tells us that our brains are made up of a bunch of little units, called
neurons, that send electrical signals to one another.

The rate of firing tells us how “activated” a neuron is. A single neuron, like that
might have three incoming neurons.

These incoming neurons are firing at different rates (i.e., have different
activations). Based on how much these incoming neurons are firing, and how
“strong” the neural connections are, our main neuron will “decide” how strongly it
wants to fire. And so on through the whole brain.

Learning in the brain happens by neurons becoming connected to other neurons,


and the strengths of connections adapting over time

There are around 1000 billion neurons in the human brain. Each neuron has an
association point somewhere in the range of 1,000 and 100,000. In the human
brain, data is stored in such a manner as to be distributed, and we can extract more
than one piece of this data when necessary from our memory parallel. We can say
that the human brain is made up of incredibly amazing parallel processors

Perceptron:-

The single Artificial Neuron is like tree structure(Single Layer Perceptron)

It contains i/p nodes ,single o/p node ,single hidden layer(single node)

Each neuron connected to “n” other neurons is called perceptron consisting


weights, summation processor and activation function

Perceptron is a building block of an Artificial Neural Network. Initially, in the


mid of 19th century, Mr. Frank Rosenblatt invented the Perceptron for performing
certain calculations to detect input data capabilities or business intelligence.
Perceptron is a linear Machine Learning algorithm used for supervised learning for
various binary classifiers.

Basic Components of Perceptron:-

=>Input Nodes or Input Layer:

This is the primary component of Perceptron which accepts the initial data into the
system for further processing.

Each input node contains a real numerical value.

=>Weight and Bias:Weight parameter represents the strength of the connection


between units.

This is another most important parameter of Perceptron components.

Weight is directly proportional to the strength of the associated input neuron in


deciding the output.

Further, Bias can be considered as the line of intercept in a linear equation.

=>Activation Function:

These are the final and important components that help to determine whether the
neuron will fire or not.

Activation Function can be considered primarily as a step function.

Types of Activation functions:

o Sign function
o Step function, and
o Sigmoid function

Step function:- (-)ve i/p values are converged to zero=><0=>0


(+) ve i/p values are converged to 1=>>=0=>0

Summation=30=>step(30)=>1

Summation=-10=>step(-10)=>0

Sigmoid function:- The curve of the Sigmoid function called “S Curve” is shown
here.

This is called a logistic sigmoid and leads to a probability of the value between 0
and 1.

1
θ( Z)=
1+ e−z

This is useful as an activation function when one is interested in probability


mapping rather than precise values of input parameter t.

The sigmoid output is close to zero for highly negative input. This can be a
problem in neural network training and can lead to slow learning and the model
getting trapped in local minima during training. Hence, hyperbolic tangent is more
preferable as an activation function in hidden layers of a neural network.

Hyperbolic Activation Functions

The graph below shows the curve of these activation functions:


Apart from these, tanh, sinh, and cosh can also be used for activation function.

Based on the desired output, a data scientist can decide which of these activation
functions need to be used in the Perceptron logic.

sign function:- generate o/p values for given i/p values to between -1 to 1
Characteristics of Perceptron

The perceptron model has the following characteristics.

1. Perceptron is a machine learning algorithm for supervised learning of binary


classifiers.
2. In Perceptron, the weight coefficient is automatically learned.
3. Initially, weights are multiplied with input features, and the decision is made
whether the neuron is fired or not.
4. The activation function applies a step rule to check whether the weight
function is greater than zero.
5. The linear decision boundary is drawn, enabling the distinction between the
two linearly separable classes +1 and -1.
6. If the added sum of all input values is more than the threshold value, it must
have an output signal; otherwise, no output will be shown.

Perceptron model works in two important steps as follows:

Step1:- computing summation

Mathematically, an input vector X = x1, x2, . . . , xD arrives The neuron stores D-


many weights, w1, w2, . . . , wD. The neuron computes the sum:
D

a¿ ∑ wi∗xi =w1*x1+w2*x2+w3*x3+…..+wd*xd
i=1

If a is greater than zero value we consider it as positive example otherwise we


consider it as negative example,

a>0 =>+ve=>=+1

a<0=>-ve=>-1

Eg:-
Let us suppose the input are x1=1, x2=1, x3=0 and weights w1=3, w2=3, w3=1
now the summation of the product inputs and weights

a=3*1+3*1+1*0=6>0=> it is positive

It is often convenient to have a non-zero threshold. In other words, we might want


to predict positive if a >=θ for some value θ. The way that is most convenient to
achieve this is to introduce a bias term into the neuron, so that the summation is
always increased by some fixed value b.

Step2:-

In the second step, an activation function is applied with the weighted


sum(a), which gives us output either in binary form or a continuous value as
follows:

Y = f(∑wd*xd + b)

Error-Driven Updating: The Perceptron Algorithm:-

The perceptron is a classic learning algorithm for the neural model of learning. It
processes one example at first(i/p) and then goes on to the next one. Second, it is
error driven this means that, so long as it is doing well, it doesn’t bother updating
its parameters.

The algorithm maintains a “guess” at good parameters (weights and bias) as it


runs. It processes one example at a time. For a given example, it makes a
prediction. It checks to see if this prediction is correct (recall that this is training
data, so we have access to true labels). If the prediction is correct, it does nothing.
Only when the prediction is incorrect does it change its parameters, and it changes
them in such a way that it would do better on this example next time around. It
then goes on to the next example. Once it hits the last example in the training set, it
loops back around for a specified number of iterations.

Let us suppose my truth value is y it is +1 or -1 and ya>0 means our prediction is


correct otherwise our prediction is wrong ya<=0 in such case we update current
weight by incrementing with yxd and update bias by incrementing with y value

We now update our weights and bias. Let’s call the new weights w1 1 , . . . , w1 D ,
b1 . Suppose we observe the same example again and need to compute a new
activation a1.
so the difference between the old activation a and the new activation a1 is

∑ xd 2 + 1.

But x d 2 ≥ 0, since it’s squared. So this value is always at least one. Thus, the new
activation is always at least the old activation plus one. Since this was a positive
example, we have successfully moved the activation in the proper direction

The only hyperparameter of the perceptron algorithm is MaxIter, the number of


passes to make over the training data. If we make many many passes over the
training data, then the algorithm is likely to overfit. On the other hand, going over
the data only one time might lead to underfitting.

Consider the following graph .The x-axis shows the number of passes over the data
and the y-axis shows the training error and the test error. Due to overfitting more
error rate is observed in testing data
Geometric Interpretation:-

As single layer perceptron is a binary classifier the decision boundaries of


perceptron are as follows as the activation ,a changes from -1 to 1

The decision boundary is set of points that neither positive nor negative for i.e
bias is zero. Formally the decision boundary is computed as follows

is just the dot product between the vector w = <w1, w2, . . . , wD> and the
vector x=<x1,,x2,…xD>. We will write this as w · x. Two vectors have a zero dot
product if and only if they are perpendicular. Then the decision boundary is simply
the plane perpendicular to w

The weight vector is shown, together with it’s perpendicular plane. This plane
forms the decision boundary between positive points and negative points.

One thing to notice is that the scale of the weight vector is irrelevant from the
perspective of classification. Suppose you take a weight vector w and replace it
with 2w.

a=10=>20=>y(a)=>+1,+1

a=-10=>-20=>y(a)=>-1,-1
All activations are now doubled. But their sign does not change. This makes
complete sense geometrically, since all that matters is which side of the plane a test
point falls on, now how far it is from that plane. For this reason, it is common to
work with normalized weight vectors, w, that have length one; i.e., ||w|| = 1.

The value w · x is just the distance of x from the origin when projected onto the
vector w.
We can think of this as a one-dimensional version of the data, where each data
point is placed according to its projection along w. This distance along w is exactly
the activation of that example, with no bias if no threshold
Any example with a negative projection onto w would be classified negative; any
example with a positive projection, positive.
The bias simply moves this threshold. Now, after the projection is computed, b is
added to get the overall activation. The projection plus b is then compared against
zero.
from a geometric perspective, the role of the bias is to shift the decision boundary
away from the origin, in the direction of w.
So if b is positive, the boundary is shifted away from w . if b is negative, the
boundary is shifted toward w.
Hyperplane:-
The decision boundary which separates the positive and negative examples is
called hyper plane.
Two dimensions, 1-d hyperplane is simply a line, three dimensions, 2-d
hyperplane is like a sheet of paper
Consider the situation in Here, we have a current guess as to the hyperplane,
and positive training example comes in that is currently mis-classified. The
weights are updated:
w ← w + yx.
This yields the new weight vector, also shown in the Figure. In this case, the
weight vector changed enough that this training example is now correctly
classified.
Interpreting Perceptron Weights:-

The perceptron, learned a really awesome classifier, and then wondering

The perceptron is learning “the right thing.” But we have to want to remove a
bunch of features that aren’t very useful because they’re expensive to compute or
take a lot of storage.

The perceptron learns a classifier of the form sign(∑wdxd + b).

The perceptron sensitivity is how sensitive is the final classification to small


changes in some particular feature?

Then consider the 7th feature we can compute

This rate at which the activation(a) changes as a function of the 7th feature is
exactly w7.

This gives rise to a useful heuristic for interpreting perceptron weights:

o sort all the weights from largest (positive) to largest (negative),


o Take the top ten and bottom ten. The top ten are the features that the
perceptron is most sensitive to for making positive predictions. The bottom
ten are the features that the perceptron is most sensitive to for making
negative predictions
o This heuristic is useful, especially when the inputs x consist entirely of
binary values.
o The heuristic is less useful when the range of the individual features varies
significantly. The issue is that if you have one feature x5 that’s either 0 or 1,
and another feature x7 that’s either 0 or 100, but w5 = w7, it’s reasonable to
say that w7 is more important because it is likely to have a much larger
influence on the final prediction. The easiest way to compensate for this is
simply to scale your features ahead of time: this is another reason why
feature scaling is a useful preprocessing step.

Perceptron Convergence and Linear Separability:-

The perceptron algorithm is that if the data is linearly separable, then it will
converge to a weight vector that separates the data

w0 1
w1 x1
Let w= .. and x= ..
wd xd

wT=w 0 w1 .. wd

Linearly Separability:- A Data can be said to Linearly separable it can be


broken into two partitions i.e positive(P) and negative(N) such that one
satisfies

w*T xj>0 where xj∈P

w*Txi<0 where xi∈N

If the training is linearly separable,i.e data is linearly separable. This means that
there exists some hyperplane that puts all the positive examples on one side and
all the negative examples on the other side.

If P and N are finite sets that are linearly separable, then the perceptron
learning algorithm updates weights with in a finite no of times, in precisely
there will a point where w will correctly classify both the classes. The
perceptron will converge more quickly for easy learning problems than for hard
learning problems. To define “easy” and “hard” in a meaningful way through
the notion of margin.

Margin:-

If a hyperplane separates dataset and then the margin is the distance between
the hyperplane and the nearest point.

problems with large margins should be easy; and problems with small margins
should be hard problem

Formally, given a data set D, a weight vector w and bias b, the margin of w, b
on D is defined as:

In words, the margin is only defined if w, b actually separate the data


(otherwise it is just −∞). In the case that it separates the data, we find the point
with the minimum (a)activation, after the activation is multiplied by the label. It
is denoted by the Greek letter γ (gamma).

The margin of a data set. The margin of a data set is the largest attainable
margin on this data. Formally

We “try” every possible w, b pair. For each pair, we compute its margin. We
then take the largest of these as the overall margin of the data
If the data is not linearly separable, then the value of the sup, and therefore the
value of the margin, is −∞.

The number of errors that the perceptron algorithm makes is bounded by γ −2 .

(Perceptron Convergence Theorem). Suppose the perceptron algorithm is run


on a linearly separable data set D with margin γ > 0. Assume that ||x|| ≤ 1 for all
x ∈ D. Then the algorithm will converge after at most 1/ γ2 updates.

If the data is linearly separable with margin γ, then there exists some weight
vector w∗ that achieves this margin. Obviously we don’t know what w∗ is, but
we know it exists.

Every time the perceptron makes an update, the angle between w and w∗
changes. What we prove is that the angle actually decreases.

Proof: The margin γ > 0 must be realized by some set of parameters, say w ∗ .
Suppose we train a perceptron on this data. Denote by w (0) the initial weight
vector, w(1) the weight vector after the first update, and so on w(k) the weight
vector after the kth update.

First, we will show that w∗ · w(k) grows quicky as a function of k. Second, we


will show that does not grow quickly.

First, suppose that the kth update happens on example (x, y).

We are trying to show that w(k) is becoming aligned with w∗ .

Because we updated, know that this example was misclassified: yw(k-1) · x < 0.
After the update, we get w(k) = w(k-1) + yx.=>wd=wd+yx

We do a little computation:
Y[w*0+w*1x1+w*2x2+w*3x3+…+w*dxd]=>w0=b

>W* .(Wk-2+yx)+ γ(wk-1=wk-2+xy)

>W*.Wk-2 +W*yx+ γ

>W*.Wk-2+ γ+ γ

> W*.Wk-2+2 γ

Like that we have to go back k steps then it would > W*.W0+k γ

Let β be the angle between W* and WK


w*.Wk=w* X wk cos
β

then cosβ= w∗. wK


¿|wk|∨¿ ¿

Every time we make correction then γ is added to the since we have k such a
updated k γ is added to the numerator

So numerator > W*.W0+k γ

Denominator 2=

Then if we go one step back


We have to go k such steps then it would be

Cosβ grows with a proportion √ k

As k increases the Cosβ arbitrarily large

But cosβ <=1, i.e k must be bounded by some maximum number

So that means that there are finite no of corrections to w and the algorithm will
converge

u.v is projection of u on v

|u · v| ≤ |u||v|.

0 ≤ (tu + v) · (tu + v) = t 2 (u · u) + 2t(u · v) + v · v = t 2 |u| 2 + 2t(u · v) + |v| 2 ,

b 2 − 4ac ≤ 0, which implies 4(u · v) 2 ≤ 4|u| 2 |v| 2 , or (u · v) 2 ≤ |u| 2 |v| 2


->u=W*

->v=wk=>(w*.Wk)2 <=|w*|2|wk||2

=>||wk||>=w*. wk
Improved Generalization: Voting and Averaging

In order to make it more competitive with other learning algorithms, you need to
modify it a bit to get better generalization. consider a data set with 10,000
examples. Suppose that after the first 100 examples, the perceptron has learned a
really good classifier. It’s so good that it goes over the next 9899 examples without
making any updates. It reaches the 10,000th example and makes an error. It
updates. For all we know, the update on this 10, 000th example completely ruins
the weight vector that has done so well on 99.99% of the data!

for weight vectors that “survive” a long time to get more say than weight vectors
that are overthrown quickly. One way to achieve this is by voting. As the
perceptron learns, it remembers how long each hyperplane survives. At test time,
each hyperplane encountered during training “votes” on the class of a test example.
If a particular hyperplane survived for 20 examples, then it gets a vote of 20. If it
only survived for one example, it only gets a vote of 1. In particular, let (w, b)
(1) , . . . ,(w, b) (K) be the K + 1 weight vectors encountered during training, and c
(1) , . . . , c (K) be the survival times for each of these weight vectors. (A weight
vector that gets immediately updated gets c = 1; one that survives another round
gets c = 2 and so on.) The prediction is based on

A much more practical alternative is the averaged perceptron. The idea is similar:
you maintain a collection of weight vectors and survival times. However, at test
time, you predict according to the average weight vector, rather than the voting

Initially: m = 1, w1 = y1x1.c1=1
1. For t=2, 3,....
2. If then:
a. wm+1 = wm + yt xt
m=m+1
cm=1
else:
Cm=cm+1

3. Output (w1, c1), (w2, c2), ..., (wm, cm

Problem

X Y
4 +1
1 -1
-3 -1
-2 +1

m=1,c1=1 w1=y1x1=4*1=4,

for t=2 to 4

check y2(w1*x2)<=0

-1(4*1)<=0 yes

Then w2=w1+y2x2=4+(-1*1)=3

m=m+1=1+1=2

c2=1

next t=3

now check y3(w2*x3)<=0

-1(3*-3)=+9<=0(NO)

C2=c2+1=1+1=2

Next t=4

Now check y4(w2*x4)<=0


+1(3*-2)=-6<=0Yes

Then w3=w2+y4*x4=3+(-2*+1)=3-2=1

m=m+1=2+1=3

c3=1

c1=1,c2=2,c3=1 w1=4,w2=3,w3=1

voting output=

average output=

X=-3

Sign(-12)=-1

Sign(sign(4*-3)+2sign(3*-3)+sign(1*-3))=sign(-1+-2-1)=>-1

Sign(sign(4*-3)+sign(2*3*-3)+sign(1*-3))=sign(-1+-1+-1)=>-1

Limitations of Perceptron Model

A perceptron model has limitations as follows:

o The output of a perceptron can only be a binary number (0 or 1) due to the


hard limit transfer function.
o

o .Perceptron can only be used to classify the linearly separable sets of input
vectors.
o If input vectors are non-linear, it is not easy to classify them properly
o Perceptron XOR problem

We can’t draw a decision boundaries for XOR problem using single layer
perceptron

Practical Issues

Good Features:-

Features are individual measurable properties and base to a model

The Working of machine learning algorithm depends on how the features are?

If the features are good then it works properly otherwise it can’t


So coming up good features is most important job in machine learning

Suppose the features are binary then it is easy to decide between two different
things

Object recongnition

An alternative representation of images is the patch representation, where the unit


of interest is a small rectangular block of an image, rather than a single pixel.

Again, permuting the patches has no effect on the classifier.

Figure shows the same images in patch representation. Can you identify them?

A final representation is a shape representation. Here, we throw out all color and
pixel information and simply provide a bounding polygon. Figure shows the same
images in this representation.

Image classification:-

Consider the following classification


We have to class of dogs greyhounds and Labradors two kinds of breeds

Greyhounds are usually taller than Labradors

Here two classify the dogs we had two features height and eyes

Generally Greyhounds are more height than Labradors

If the height is less (15,20)then it is categorized into Labradors,

If the height is more (30,35,40)then it is categorized into Greyhounds

If the height is 25 then the properties nearer Greyhounds, Labradors so height not
valuable features

In the same way eyes features


Eyes are the features which are not useful for prediction

Avoid useless features

Some feature are redundant height in inches, height in centimeter both are
indicating same features such a features are called redundant features we have to
avoid the redundant features

Text Categorization:-

In the context of text categorization , one standard representation is the bag of


words representation. Here, we have one feature for each unique word that appears
in a document. For the feature happy, the feature value is the number of times that
the word “happy” appears in the document. The bag of words (BOW)
representation throws away all position information. shows a BOW representation
for two chapters of this book. Can you tell which is which?

Irrelevant and Redundant Features:

An irrelevant feature is one completely uncorrelated with the prediction task


Eg:The presence of ‘the’ might be completely irrelevant for the prediction of
“course review is positive or negative”

The features are Redundant if they are highly correlated i.e Two features are
redundant if they are highly correlated, regardless of whether they are correlated
with the task or not. For example, having a bright red pixel in an image at position
(20, 93) is probably highly redundant with having a bright red pixel at position (21,
93). Both might be useful (e.g., for identifying fire hydrants),

but because of how images are structured, these two features are likely to co-occur
frequently

Feature Pruning and Normalization:-

Removing the unnecessary features is called feature pruning. For example In


training documents the positive vocabulary word appears exactly on training
document, which is positive. It’s had to tell with just one training example if it is
correlated with the positive class or is it just noise.

Reducing the size of decision tree, Reducing the complexity of final classifier.

In the beginning, pruning does not hurt (and sometimes helps!) but eventually we
prune away all the interesting words and performance suffers

Normalization:-

It is often useful to normalize the data so that it is consistent in some way.


There are two basic types of normalization: feature normalization and example
normalization.

In feature normalization, you go through each feature and adjust it the same way
across all examples.

In example normalization, each example is adjusted individually

The height of the ‘A’ is


reduced from 8 to 6 pixels

The width has been reduced


from 7 to 5 pixels

The goal of both types of normalization is to make it easier for your learning
algorithm to learn. In feature normalization, there are two standard things to do:

1. Centering: moving the entire data set so that it is centered around the origin.
2. Scaling: rescaling each feature so that one of the following holds:
(a) Each feature has variance 1 across the training data.
(b) Each feature has maximum absolute value 1 across the training data

The goal of centering is to make sure that no features are arbitrarily large. The goal
of scaling is to make sure that all features have roughly the same scale (to avoid
the issue of centimeters versus millimeters).

Here, xn,d refers to the dth feature of example n.


Eg we have features 1,2,3 =>

x C VS AS
1 -1 1 0.33
2 0 2 0.66
3 1 3 1

ud=1+2+3=6/3=2

σd = sqrt(1/2*(1+0+1))=sqrt(2/2)=sqrt(1)==1

rd=max(1,2,3)=3

Combinatorial Feature Explosion

The feature network represents an organization of the enumerated combinations

Consider a sentiment classification problem that has three features that simply say
whether a given word is contained in a review of a course.

These features are: excellent, terrible and not.

The excellent feature is indicative of positive reviews and

the terrible feature is indicative of negative reviews.

But in the presence of the not feature, this categorization flips. One way to
address this problem is by adding feature combinations.
We could add two additional features: excellent-and-not and terrible-and-not
that indicate a conjunction of these base features. By assigning weights as follows,
you can achieve the desired effect:

Wexecellent = +1

Wterrible = −1

Wnot = 0

Wexecllent-and-not = −2

Wterrible-and-not = +2

In this particular case, we have addressed the problem.

One thing here we observed is feature combination Combinatorial feature


explosion is the exponential growth rate at which most feature grow to gigantic
feature vectors because of this computational expense could be grown

Eg
If I am goes on increasing the features then we are going to get more and more
features with exponential growth

Combinatorial Transformation :-

We have a perceptron with fixed nodes in hidden layer, using that we want
predict the feature where they are combinations of features

As no of nodes is hyperparameter we cant change now the solution for this

We can perform the combinatorial Transformation of data by training the


data with decision tree algorithm then it constructs the tree once the tree is
constructed then we can explore the tree for different combination of features

Consider the following decision tree


The nodes transformed into meta features and the branches transformed into
combination of features this kind of Transformation is called Combinatorial
Transformation, For bigger trees, or if you have more data, you might benefit from
longer paths.

If we go in a single path that combination gives the i/p to the perceptron

But the fact is depth of the tree is hyperparameter as no of nodes

Logarithmic transformation:-

The logarithm, x to log base 10 of x, or x to log base e of x (ln x), or x to log base 2
of x, is a strong transformation with a major effect on distribution shape. It is
commonly used for reducing right skewness and is often appropriate for measured
variables

Exponential growth or decline

y = a e (bx)

is made linear by

ln y = ln a + bx
log-transform is an important transformation in text data, where the presence of
the word “excellent” once is a good indicator of a positive review; seeing
“excellent” twice is a better indicator;but the difference between seeing “excellent”
10 times and seeing it 11 times really isn’t a big deal any more. A log-transform
achieves this

the transformation is actually xd → log2 (xd + 1) to ensure that zeros remain zero
and sparsity is retained.

In the case that feature values can also be negative, the slightly more complex
mapping

xd → log2 (|xd | + 1)sign(xd ), where sign(xd ) denotes the sign of xd

Evaluating Model Performance

Evaluation Model performance is one of the most important steps in developing a


machine learning pipeline. Just imagine, designing a model and then straight away
deploying it on production. And Suppose, your model is being used in the medical
domain, then it may lead to the death of multiple people

Classification Problems

As we know, classification problems are those problems in which the output is a


discrete value.For e.g. spam detection,cancer detection etc.

Evaluation Metrics

We’ll divide the evaluation metrics into two categories based on when to use them.

 Evaluation metrics when the dataset is balanced.


 Evaluation metrics when the dataset is imbalanced

Evaluation metrics when the dataset is imbalanced

Any dataset with an unequal class distribution is technically imbalanced. However,


a dataset is said to be imbalanced when there is a significant, or in some cases
extreme, disproportion among the number of examples of each class of the problem.

Eg:- Suppose, we have a dataset in which we have 900 examples of benign tumour
and 100 examples of malignant tumour.

Evaluation metrics for balanced Dataset:-

Accuracy: The accuracy of a model is calculated by the following formula:

Now suppose our model predicts(imbalanced dataset) all the outputs as a benign
tumour, now by the above formula, our accuracy is 90%(Think about it!). We can
clearly see that our model failed to predict malignant tumour. Now imagine, how
big of a disaster it would have been if you would have used the model in a real-
world scenario.

Evaluation metrics for imbalanced Dataset

 Confusion matrix: As the name suggests, a confusion matrix is a matrix that


tells us for what values our models is getting confused between different
classes.
Definitions related to confusion matrix

 TP: True Positive are those examples that were actually positives and were
predicted as positives.

 Eg. The actual output was a benign tumour(positive tumour) and the model
also predicted benign tumour.

 FP: False Positives also known as Type 1 error are those examples that were
actually negatives but our model predicted them as positive.

 Eg. The actual output was a malignant tumour(negative class) but our model
predicted benign tumour.
 FN: False Negatives also known as Type 2 error are those examples that were
actually positives but our model predicted them as negative.

 Eg. The actual output was a benign tumour(positive class) but our model
predicted it as a malignant tumour.

 TN: True Negative are those examples that were actually negative and our
model predicted them as negative.

 Eg. The actual output was a malignant tumour(negative class) and our model
also predicted it as a malignant tumour.

Whenever we are training our model we should try and reduce False positives and
False negatives, such that our model makes as many correct predictions as possible.

By looking at the confusion matrix, we can get an idea of how our model is
performing on particular classes. Unlike accuracy, which gave an estimate of the
overall model.

Precision and Recall:-

These are two very important metrics to evaluate the performance of the model. No
one is better than the other, it just depends upon the use case and business
requirement. Let’s first have a look at their definition and then we’ll develop an
intuition of which to use when

Precision: The precision of a model is given using the following formula:


Precision can be defined as “out of the total positive predicted values, how many of
them are actually positive.”

Recall: The recall of a model is given by using the following formula

Recall can be defined as “out of the total actual positive values, how many of them
were actually predicted positive.

Eg: TP=100 FP=10 FN=5 TN=50

Confusion matrix

100(TP) 10(FP) 110

5(FN) 50(TF) 55

105 60 165

Precision=100/(100+10)=0.9090
Recall=100/(100+5)=0.95
Precision=100+50/165=0.9090
F1- Score

Although precision and recall are good they don’t give us the power to compare two
models. If one model has a good recall and the other one has good precision, it
becomes really confusing which one to use for our task(until we are completely
sure that we need to focus on only one metric).

F1-Score to rescue

F1-score takes a harmonic mean of both precision and recall and gives a single
value to evaluate our model. it is given by the following model.

In some cases, you might believe that precision is more important than recall. This
idea leads to the weighted f-measure, which is parameterized by a weight β ∈ [0,
∞) (beta):
Sensitivity:-

Sensitivity is a measure of how well a machine learning model can detect positive
instances.

It is also known as the true positive rate (TPR) or recall.

Sensitivity is used to evaluate model performance because it allows us to see how


many positive instances the model was able to correctly identify.

A model with high sensitivity will have few false negatives, which means that it is
missing a few of the positive instances.

Sensitivity = (True Positive)/(True Positive + False Negative)

A high sensitivity means that the model is correctly identifying most of the positive
results, while a low sensitivity means that the model is missing a lot of positive
results.

sensitivity is exactly the same as recall

Specificity:-

Specificity measures the proportion of true negatives that are correctly identified
by the model. This implies that there will be another proportion of actual negative
which got predicted as positive and could be termed as false positives. This
proportion could also be called a True Negative Rate (TNR). 

Specificity = (True Negative)/(True Negative + False Positive)

A specific classifier is one which does a good job not finding the things that it
doesn’t want to find

Sensitivity and Specificity measures are used to plot the ROC curve. And, Area


under the ROC curve (AUC) is used to determine the model performance. The
following represents different ROC curves and related AOC values.
The diagram below represents a scenario of high sensitivity (low false negatives)
and low specificity (high false positives).

The typical plot, referred to as the receiver operating characteristic (or ROC curve)
plots the sensitivity against 1 − specificity. Given an ROC curve,

We can compute the area under the curve (or AUC) metric, which also provides a
meaningful single number for a system’s performance.

AUC scores tend to be very high, even for not great systems. This is because
random chance will give you an AUC of 0.5 and the best possible AUC is 1.0.
in general, a model with high sensitivity will have a high false-positive rate, while
a model with high specificity will have a high false-negative rate. The trade-off
between sensitivity and specificity can be tuned by changing the threshold for
classification. A higher threshold will result in a model with high sensitivity and
low specificity, while a lower threshold will result in a model with low sensitivity
and high specificity.

Cross-Validation in Machine Learning

Cross-validation is a technique for validating the model efficiency by training it on


the subset of input data and testing on previously unseen subset of the input data.

 We can also say that it is a technique to check how a statistical model
generalizes to an independent dataset.

In machine learning

, there is always the need to test the stability of the model. It means based only on
the training dataset; we can't fit our model on the training dataset. For this purpose,
we reserve a particular sample of the dataset, which was not part of the training
dataset.
After that, we test our model on that sample before deployment, and this complete
process comes under cross-validation.
This is something different from the general train-test split.

Hence the basic steps of cross-validations are:

o Reserve a subset of the dataset as a validation set.


o Provide the training to the model using the training dataset.
o Now, evaluate model performance using the validation set. If the model
performs well with the validation set, perform the further step, else check for
the issues.

K-Fold Cross-Validation

K-fold cross-validation approach divides the input dataset into K groups of


samples of equal sizes. These samples are called folds. For each learning set, the
prediction function uses k-1 folds, and the rest of the folds are used for the test set.
This approach is a very popular CV approach because it is easy to understand, and
the output is less biased than other methods.

The steps for k-fold cross-validation are:

o Split the input dataset into K groups


o For each group:
o Take one group as the reserve or test data set.
o Use remaining groups as the training dataset
o Fit the model on the training set and evaluate the performance of the
model using the test set.

Let's take an example of 5-folds cross-validation. So, the dataset is grouped into 5
folds. On 1st iteration, the first fold is reserved for test the model, and rest are used
to train the model. On 2nd iteration, the second fold is used to test the model, and
rest are used to train the model. This process will continue until each fold is not
used for the test fold.
LOO cross validation:-

leave-one-out cross validation. , we need to take 1 dataset out of training. It means,


in this approach, for each learning set, only one datapoint is reserved, and the
remaining dataset is used to train the model. This process repeats for each
datapoint. Hence for n samples, we get n different training set and n test set. It has
the following features:

o In this approach, the bias is minimum as all the data points are used.
o The process is executed for n times; hence execution time is high.
o This approach leads to high variation in testing the effectiveness of the
model as we iteratively check against one data point.
Applications of Cross-Validation

o This technique can be used to compare the performance of different


predictive modeling methods.
o It has great scope in the medical research field.
o It can also be used for the meta-analysis, as it is already being used by the
data scientists in the field of medical statistics.

Limitations of Cross-Validation

There are some limitations of the cross-validation technique, which are given
below:

o For the ideal conditions, it provides the optimum output. But for the
inconsistent data, it may produce a drastic result. So, it is one of the big
disadvantages of cross-validation, as there is no certainty of the type of data
in machine learning.
o In predictive modeling, the data evolves over a period, due to which, it may
face the differences between the training set and validation sets. Such as if
we create a model for the prediction of stock market values, and the data is
trained on the previous 5 years stock values, but the realistic future values
for the next 5 years may drastically different, so it is difficult to expect the
correct output for such situations.

Hypothesis testing:-
hypothesis is an assumption about a population which may or may not be true.
Hypothesis testing is a set of formal procedures used by us to either accept or reject
hypotheses. hypotheses are of two types:
Null hypothesis, H0 - represents a hypothesis of chance basis.
null hypothesis states there is no statistical relationship between the two
variables.
 Alternative hypothesis, Ha - represents a hypothesis of observations which
are influenced by some non-random cause.
 Alternative hypothesis defines there is a statistically important relationship
between two variables.
Example
suppose we wanted to check whether a coin was fair and balanced. A null
hypothesis might say, that half flips will be of head and half will of tails whereas
alternative hypothesis might say that flips of head and tail may be very different.
H0:P=0.5
Ha:P≠0.5
For example if we flipped the coin 50 times, in which 40 Heads and 10 Tails
results. Using result, we need to reject the null hypothesis and would conclude,
based on the evidence, that the coin was probably not fair and balanced.
Hypothesis Tests
Following formal process is used by us to determine whether to reject a null
hypothesis, based on sample data. This process is called hypothesis testing and is
consists of following four steps:
1. State the hypotheses - This step involves stating both null and alternative
hypotheses. The hypotheses should be stated in such a way that they are
mutually exclusive. If one is true then other must be false.
2. Formulate an analysis plan - The analysis plan is to describe how to use
the sample data to evaluate the null hypothesis. The evaluation process
focuses around a single test statistic.
3. Analyze sample data - Find the value of the test statistic (using properties
like mean score, proportion, t statistic, z-score, etc.) stated in the analysis
plan.
4. Interpret results - Apply the decisions stated in the analysis plan. If the
value of the test statistic is very unlikely based on the null hypothesis, then
reject the null hypothesis.

T-test follows t-distribution, which is appropriate when the sample size is small,
and the population standard deviation is not known. The shape of a t-distribution is
highly affected by the degree of freedom. The degree of freedom implies the
number of independent observations in a given set of observations.

Assumptions of T-test:

 All data points are independent.


 The sample size is small. Generally, a sample size exceeding 30 sample
units is regarded as large, otherwise small but that should not be less than 5,
to apply t-test.
 Sample values are to be taken and recorded accurately.

The test statistic is:

x ̅is the sample mean


s is sample standard deviation
n is sample size
μ is the population mean

Paired t-test: A statistical test applied when the two samples are dependent and
paired observations are taken.
Z-test

Z-test refers to a univariate statistical analysis used to test the hypothesis that
proportions from two independent samples differ greatly.

It determines to what extent a data point is away from its mean of the data set, in
standard deviation.

The researcher adopts z-test, when the population variance is known, in essence,
when there is a large sample size, sample variance is deemed to be approximately
equal to the population variance. In this way, it is assumed to be known, despite
the fact that only sample data is available and so normal test can be applied.

Assumptions of Z-test:

 All sample observations are independent


 Sample size should be more than 30.
 Distribution of Z is normal, with a mean zero and variance 1.

The test statistic is:

x ̅is the sample mean


σ is population standard deviation
n is sample size
μ is the population mean

P-value Definition
The P-value is known as the probability value.

It is defined as the probability of getting a result that is either the same or more
extreme than the actual observations.
The P-value is known as the level of marginal significance within the hypothesis
testing that represents the probability of occurrence of the given event.

The P-value is used as an alternative to the rejection point to provide the least
significance at which the null hypothesis would be rejected.
If the P-value is small, then there is stronger evidence in favour of the alternative
hypothesis.

P-value Table
The P-value table shows the hypothesis interpretations:

P-value Decision

P-value The result is not statistically significant and


> 0.05 hence don’t reject the null hypothesis.

The result is statistically significant.


P-value Generally, reject the null hypothesis in
< 0.05 favour of the alternative hypothesis.

The result is highly statistically significant,


P-value and thus rejects the null hypothesis in favour
< 0.01 of the alternative hypothesis.

Boot Strap Evaluate:- The bootstrap resampling procedure is sketched in


Algorithm

This takes three arguments: the true labels y, the predicted labels y^ and the
number of folds to run. It returns the mean and standard deviation from which you
can compute a confidence interval.
Bias-Variance Trade off – Machine Learning
It is important to understand prediction errors (bias and variance) when it
comes to accuracy in any machine learning algorithm. There is a tradeoff
between a model’s ability to minimize bias and variance which is referred to
as the best solution for selecting a value of Regularization constant. Proper
understanding of these errors would help to avoid the overfitting and
underfitting of a data set while training the algorithm.
Bias
The bias is known as the difference between the prediction of the values by
the ML model and the correct value.
Being high in biasing gives a large error in training as well as testing data. Its
recommended that an algorithm should always be low biased to avoid the
problem of underfitting.

By high bias, the data predicted is in a straight line format, thus not fitting
accurately in the data in the data set. Such fitting is known as Underfitting
of Data.
This happens when the hypothesis is too simple or linear in nature. Refer to
the graph given below for an example of such a situation.

Variance
The variability of model prediction for a given data point which tells us
spread of our data is called the variance of the model.
The model with high variance has a very complex fit to the training data and
thus is not able to fit accurately on the data which it hasn’t seen before.
As a result, such models perform very well on training data but has high
error rates on test data.
When a model is high on variance, it is then said to as Overfitting of Data.

Overfitting is fitting the training set accurately via complex curve and high
order hypothesis but is not the solution as the error with unseen data is high.

While training a data model variance should be kept low


The high variance data looks like follows.

Bias Variance Tradeoff


If the algorithm is too simple (hypothesis with linear eq.) then it may be on high
bias and low variance condition and thus is error-prone.
If algorithms fit too complex ( hypothesis with high degree eq.) then it may be on
high variance and low bias.
In the latter condition, the new entries will not perform well. Well, there is
something between both of these conditions, known as Trade-off or Bias
Variance Trade-off.
This tradeoff in complexity is why there is a tradeoff between bias and variance.
An algorithm can’t be more complex and less complex at the same time. For the
graph, the perfect tradeoff will be like.

The best fit will be given by hypothesis on the tradeoff point.

LINEAR MODELS

The term linear model implies that the model is specified as a linear combination
of features. Based on training data, the learning process computes one weight for
each feature to form a model that can predict or estimate the target value

Perceptron is Linear Classifier that it runs a particular algorithm until a linear


separator is found.

The perceptron algorithm will find an optimal w if the data is separable ‣


efficiency depends on the margin and norm of the data However, if the data is not
separable, optimizing this is -hard ‣ i.e., there is no efficient way to minimize this .

Optimization provides a way to minimize the loss function. In addition to


minimizing training error, we want a simpler model ‣
Remember our goal is to minimize generalization error ‣

Recall the bias and variance tradeoff for learners

We can add a regularization term R(w,b) that prefers simpler models ‣

For example we may prefer decision trees of shallow depth

Here λ is a hyperparameter of optimization problem

Convex Surrogate Loss Functions:-

Zero/one loss is hard to optimize

Small changes in w can cause large changes in the loss

Surrogate loss: replace Zero/one loss by a smooth function ‣


Smooth version of the threshold function Known as a sigmoid/logistic function –
Smooth transition between 0-1
hw(x) = g(wT x)
z = wT x
1
g(z) =
1+ e−z
g is sigmoid function

Decision Boundary –
y=1: h(x) >0.5 wtx>=0
Y=0: h(x)< 0.5 wtx<0

The benefit of using such an S-function is that it is smooth, and potentially easier
to optimize. The difficulty is that it is not convex

Error surface:- Linear classifier Hypothesis space is parameterized by W

Easier to optimize if the surrogate loss is convex


a convex function is one that looks like a happy face ,On the other hand, a
concave function is one that looks like a sad face an easy mnemonic There are
two equivalent definitions of a convex function.

1. Second derivative is always non-negative.

2. Any chord of the function lies above it.

The convex function minimizes the loss This leads to the idea of convex surrogate
loss functions. Since zero/one loss is hard to optimize, you want to optimize
something else, instead. Since convex functions are easy to optimize, we want to
approximate zero/one loss with a convex function. This approximating function
will be called a surrogate loss. The surrogate losses we construct will always be
upper bounds on the true loss function: this guarantees that if you minimize the
surrogate loss, you are also pushing down the real loss.
There are four common surrogate loss functions, each with their own properties:
hinge loss, logistic loss, exponential loss and squared loss.

Zero/one: ` (0/1)(y, yˆ) = 1[yyˆ ≤ 0]

Hinge: ` (hin)(y, yˆ) = max{0, 1 − yyˆ}

1
Logistic: ` (log)(y, yˆ) = log (1 + exp[−yyˆ])
log 2

Exponential: ` (exp)(y, yˆ) = exp[−yyˆ]

Squared: ` (sqr)(y, yˆ) = (y − yˆ) 2

Weight regularization:-

Regularization is the technique in which slight modifications are made to learning


algorithm such that the model generalizes better. This in turn results in the
improvement of the model’s performance on the test data or unseen data. In weight
regularization, It penalizes the weight matrices of nodes. Weight regularization
results in simpler linear network and slight underfitting of training data

we have R(w,b) regularization added to the margin

We can say R(w,b) as good regularization function when it

 It is having weights t o be small —

➡ Change in the features cause small change to the score

➡ Robustness to noise

‣ To be sparse —

➡ Use as few features as possible

➡ Similar to controlling the depth of a decision tree

In L1 weight regularization, the sum of squared values of the weights are used to
calculate size of the weights

In L2 weight regularization, the sum of absolute values of the weights is used to


calculate size of the weights. L2 regularization use in linear regression and logistic
regression is often referred as Ridge Regression or Tikhonov regularization
=w1,w2 p=2=>(|w1|2+|w2|2)1/2

General optimization framework:-

Select a suitable:

‣ convex surrogate loss

‣ convex regularization

Select the hyperparameter λ Minimize the regularized objective with respect to w

This framework for optimization is called Tikhonov regularization or generally


Structural Risk Minimization (SRM)
Optimization with Gradient Descent:-

Gradient Descent is known as one of the most commonly used optimization


algorithms to train machine learning models by means of minimizing errors
between actual and expected results. Further, gradient descent is also used to train
Neural Networks. The main objective of gradient descent is to minimize the
convex function using iteration of parameter updates

Gradient descent was initially discovered by "Augustin-Louis Cauchy" in mid of


18th century. Gradient Descent is defined as one of the most commonly used
iterative optimization algorithms of machine learning to train the machine learning
and deep learning models. It helps in finding the local minimum of a function.

Gradient is slope, Gradient descent is moving down i.e opposite to the


direction of maximum slope

Gradient is how much the output of a function(L) changes with respect to


change in input(W).

Mathematically Gradient is partial derivative of the function with respect to


weight,

Change in weights with regard to the change in error

In machine learning gradient is a partial derivative function that has more


than one input variable known as slope of the function

Gradient descent is the most commonly used algorithm in machine learning


and deep learning algorithms
It is used to train a machine and model and it is based on convex function, it is
iterative optimization algorithm used in minimize the value of the function

Learning rates:-Ƞ is called learning rate

Learning rate is hyper parameter in the optimization algorithm that


determines step size for iterative solution to find minimum of the loss

Learning rate is nothing but step size

Higher learning rates allows the algorithm to learn faster, i.e. update weights
and biases faster at the rate the cost of arriving at sub-optimal solution

A smaller learning rate means results more optimal solution but it may take
significantly longer to reach that optimal solution

We can use adaptive learning rate also. In adaptive learning rate algorithm
starts with longer It reduces the training time as compared to larger learning
rate
choose exponential loss as a loss function and the 2-norm as a regularizer

exp[−yyˆ]

The only “strange” thing in this objective is that we have replaced λ with λ. The
reason for this change is just to make the gradients cleaner. We can first compute
derivatives with respect to b:
The update is of the form w ← w − η∇wL.

For poorly classified points, the gradient points in the direction −ynxn, so the
update is of the form

w ← w + cynxn, where c is some constant

Note that c is large for very poorly classified points and small for relatively well
classified points. By looking at the part of the gradient related to the regularization,

the update says: w ← w − λw = (1 − λ)w. This has the effect of shrinking the
weights toward zero

Types of Gradient Descent:-

Based on the error in various training models, the Gradient Descent learning
algorithm can be divided into

Batch gradient descent

Stochastic gradient descent

Mini-batch gradient descent

1. Batch Gradient Descent: Batch gradient descent (BGD) is used to find the
error for each point in the training set and update the model after evaluating
all training examples.
Let’s say there are a total of ‘m’ observations in a data set and we use all these
observations to calculate the loss function, then this is known as Batch Gradient
Descent.

 Forward propagation and backward propagation are performed and the parameters
are updated. In batch Gradient Descent since we are using the entire training set,
the parameters will be updated only once per epoch.

This procedure is known as the training epoch. In simple words, it is a greedy


approach where we have to sum over all examples for each update.

Advantages of Batch gradient descent:

It produces less noise in comparison to other gradient descent.


It produces stable gradient descent convergence.
It is Computationally efficient as all resources are used for all training
samples.

2. Stochastic gradient descent Stochastic gradient descent (SGD) is a type of


gradient descent that runs one training example per iteration. Or in other words,
it processes a training epoch for each example within a dataset and updates
each training example's parameters one at a time.
As it requires only one training example at a time, hence it is easier to store in
allocated memory. However, it shows some computational efficiency losses in
comparison to batch gradient systems as it shows frequent updates that require
more detail and speed.
Further, due to frequent updates, it is also treated as a noisy gradient. However,
sometimes it can be helpful in finding the global minimum and also escaping
the local minimum.

Let’s say we have 5 observations and each observation has three features and the
values that I’ve taken are completely random.

Now if we use the SGD, will take the first observation, then pass it through the
neural network, calculate the error and then update the parameters.
Advantages of Stochastic gradient descent:
In Stochastic gradient descent (SGD), learning happens on every example,
and it consists of a few advantages over other gradient descent.
It is easier to allocate in desired memory.
It is relatively fast to compute than batch gradient descent.
It is more efficient for large datasets.

3. Mini-Batch Gradient Descent: Mini Batch gradient descent is the


combination of both batch gradient descent and stochastic gradient descent. It
divides the training datasets into small batch sizes then performs the updates on
those batches separately. Splitting training datasets into smaller batches make a
balance to maintain the computational efficiency of batch gradient descent and
speed of stochastic gradient descent. Hence, we can achieve a special type of
gradient descent with higher computational efficiency and less noisy gradient
descent.

Again let’s take the same example. Assume that the batch size is 2. So we’ll
take the first two observations, pass them through the neural network, calculate
the error and then update the parameters.
Then we will take the next two observations and perform similar steps i.e will
pass through the network, calculate the error and update the parameters.

Now since we’re left with the single observation in the final iteration, there will
be only a single observation and will update the parameters using this
observation.

Advantages of Mini Batch gradient descent:


It is easier to fit in allocated memory.
It is computationally efficient.
It produces stable gradient descent convergence.

Challenges with the Gradient Descent

Although we know Gradient Descent is one of the most popular methods for
optimization problems, it still also has some challenges. There are a few challenges
as follows:

1. Local Minima and Saddle Point: For convex problems, gradient


descent can find the global minimum easily, while for non-convex
problems, it is sometimes difficult to find the global minimum,
where the machine learning models achieve the best results.
Whenever the slope of the cost function is at zero or just close to
zero, this model stops learning further. Apart from the global
minimum, there occur some scenarios that can show this slop,
which is saddle point and local minimum.
Local minima generate the shape similar to the global minimum,
where the slope of the cost function increases on both sides of the
current point

Whenever the slope of the cost function is at zero or just

close to zero, this model stops learning further.


Apart from the global minimum, there occur some scenarios that
can show this slop, which is saddle point and local minimum.
Local minima generate the shape similar to the global minimum,
where the slope of the cost function increases on both sides of the
current points.
2. Vanishing and Exploding Gradient
In a deep neural network, if the model is trained with gradient descent
and backpropagation, there can occur two more issues other than local
minima and saddle point.
Vanishing Gradients:
Vanishing Gradient occurs when the gradient is smaller than expected.
During backpropagation, this gradient becomes smaller that causing the
decrease in the learning rate of earlier layers than the later layer of the
network. Once this happens, the weight parameters update until they
become insignificant.
Exploding Gradient:
Exploding gradient is just opposite to the vanishing gradient as it occurs when
the Gradient is too large and creates a stable model. Further, in this scenario,
model weight increases, and they will be represented as NaN.

This problem can be solved using the dimensionality reduction technique,


which helps to minimize complexity within the model.

Support Vector Machine Algorithm


Support Vector Machine or SVM is one of the most popular Supervised
Learning algorithms, which is used for Classification as well as Regression problems.
However, primarily, it is used for Classification problems in Machine Learning.

The objective of the support vector machine algorithm is to find a


hyperplane in an N-dimensional space(N — the number of features) that
distinctly classifies the data points.
To separate the two classes of data points, there are many possible hyperplanes that
could be chosen.
Our objective is to find a plane that has the maximum margin, i.e the maximum
distance between data points of both classes.
Maximizing the margin distance provides some reinforcement so that future data points
can be classified with more confidence.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data
point in the correct category in the future. This best decision boundary is called a
hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different
categories that are classified using a decision boundary or hyperplane:

Example: SVM can be understood with the example that we have used in the
KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a
model that can accurately identify whether it is a cat or dog, so such a model can
be created by using the SVM algorithm.
We will first train our model with lots of images of cats and dogs so that it can
learn about different features of cats and dogs, and then we test it with this
strange creature.
So as support vector creates a decision boundary between these two data (cat and
dog) and choose extreme cases (support vectors), it will see the extreme case of
cat and dog. On the basis of the support vectors, it will classify it as a cat.
Consider the below diagram:

SVM algorithm can be used for Face detection, image


classification, text categorization, etc.

Types of SVM

SVM can be of two types:


o Linear SVM: Linear SVM is used for linearly separable data, which
means if a dataset can be classified into two classes by using a single
straight line, then such data is termed as linearly separable data, and
classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated
data, which means if a dataset cannot be classified by using a straight line,
then such data is termed as non-linear data and classifier used is called as
Non- linear SVM classifier.
Hyperplane and Support Vectors in the SVM algorithm:
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes
in n-dimensional space, but we need to find out the best decision boundary that
helps to classify the data points. This best boundary is known as the hyperplane
of SVM.
The dimensions of the hyperplane depend on the features present in the dataset,
which means if there are 2 features (as shown in image), then hyperplane will be
a straight line. And if there are 3 features, then hyperplane will be a 2-dimension
plane.
We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.

Linear SVM:
o The working of the SVM algorithm can be understood by using an
example. Suppose we have a dataset that has two tags (green and blue),
and the dataset has two features x1 and x2. We want a classifier that can
classify the pair(x1, x2) of coordinates in either green or blue. Consider
the below image:

So as it is 2-d space so by just using a straight line, we can easily separate


these two classes. But there can be multiple lines that can separate these
classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary;
this best boundary or region is called as a hyperplane.
SVM algorithm finds the closest point of the lines from both the classes.
These points are called support vectors.
The distance between the vectors and the hyperplane is called as margin. And
the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.

Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line,
but for non-linear data, we cannot draw a single straight line. Consider the
below image:
So to separate these data points, we need to add one more dimension.

Consider the following sample of 2D


We can not separate using linear line

For linear `data, we have used two dimensions x and y, so for non-linear data,
we will add a third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below
image:
So now, SVM will divide the datasets into classes in the way such that all
data points are classified properly

since we are in 3-d Space, hence it is looking like a plane parallel to the x-

axis.

Hence we get a circumference of radius 1 in case of non-linear data.

You might also like