Professional Documents
Culture Documents
PythonUNIT II
PythonUNIT II
THE PERCEPTRON
Bio-inspired Learning
The given figure illustrates the typical diagram of Biological Neural Network. The
typical Artificial Neural Network looks something like the given figure
Through the dendrites the i/p is accepted , through axon the o/p is transmitted
through electric pulses
The rate of firing tells us how “activated” a neuron is. A single neuron, like that
might have three incoming neurons.
These incoming neurons are firing at different rates (i.e., have different
activations). Based on how much these incoming neurons are firing, and how
“strong” the neural connections are, our main neuron will “decide” how strongly it
wants to fire. And so on through the whole brain.
There are around 1000 billion neurons in the human brain. Each neuron has an
association point somewhere in the range of 1,000 and 100,000. In the human
brain, data is stored in such a manner as to be distributed, and we can extract more
than one piece of this data when necessary from our memory parallel. We can say
that the human brain is made up of incredibly amazing parallel processors
Perceptron:-
It contains i/p nodes ,single o/p node ,single hidden layer(single node)
This is the primary component of Perceptron which accepts the initial data into the
system for further processing.
=>Activation Function:
These are the final and important components that help to determine whether the
neuron will fire or not.
o Sign function
o Step function, and
o Sigmoid function
Summation=30=>step(30)=>1
Summation=-10=>step(-10)=>0
Sigmoid function:- The curve of the Sigmoid function called “S Curve” is shown
here.
This is called a logistic sigmoid and leads to a probability of the value between 0
and 1.
1
θ( Z)=
1+ e−z
The sigmoid output is close to zero for highly negative input. This can be a
problem in neural network training and can lead to slow learning and the model
getting trapped in local minima during training. Hence, hyperbolic tangent is more
preferable as an activation function in hidden layers of a neural network.
Based on the desired output, a data scientist can decide which of these activation
functions need to be used in the Perceptron logic.
sign function:- generate o/p values for given i/p values to between -1 to 1
Characteristics of Perceptron
a¿ ∑ wi∗xi =w1*x1+w2*x2+w3*x3+…..+wd*xd
i=1
a>0 =>+ve=>=+1
a<0=>-ve=>-1
Eg:-
Let us suppose the input are x1=1, x2=1, x3=0 and weights w1=3, w2=3, w3=1
now the summation of the product inputs and weights
a=3*1+3*1+1*0=6>0=> it is positive
Step2:-
Y = f(∑wd*xd + b)
The perceptron is a classic learning algorithm for the neural model of learning. It
processes one example at first(i/p) and then goes on to the next one. Second, it is
error driven this means that, so long as it is doing well, it doesn’t bother updating
its parameters.
We now update our weights and bias. Let’s call the new weights w1 1 , . . . , w1 D ,
b1 . Suppose we observe the same example again and need to compute a new
activation a1.
so the difference between the old activation a and the new activation a1 is
∑ xd 2 + 1.
But x d 2 ≥ 0, since it’s squared. So this value is always at least one. Thus, the new
activation is always at least the old activation plus one. Since this was a positive
example, we have successfully moved the activation in the proper direction
Consider the following graph .The x-axis shows the number of passes over the data
and the y-axis shows the training error and the test error. Due to overfitting more
error rate is observed in testing data
Geometric Interpretation:-
The decision boundary is set of points that neither positive nor negative for i.e
bias is zero. Formally the decision boundary is computed as follows
is just the dot product between the vector w = <w1, w2, . . . , wD> and the
vector x=<x1,,x2,…xD>. We will write this as w · x. Two vectors have a zero dot
product if and only if they are perpendicular. Then the decision boundary is simply
the plane perpendicular to w
The weight vector is shown, together with it’s perpendicular plane. This plane
forms the decision boundary between positive points and negative points.
One thing to notice is that the scale of the weight vector is irrelevant from the
perspective of classification. Suppose you take a weight vector w and replace it
with 2w.
a=10=>20=>y(a)=>+1,+1
a=-10=>-20=>y(a)=>-1,-1
All activations are now doubled. But their sign does not change. This makes
complete sense geometrically, since all that matters is which side of the plane a test
point falls on, now how far it is from that plane. For this reason, it is common to
work with normalized weight vectors, w, that have length one; i.e., ||w|| = 1.
The value w · x is just the distance of x from the origin when projected onto the
vector w.
We can think of this as a one-dimensional version of the data, where each data
point is placed according to its projection along w. This distance along w is exactly
the activation of that example, with no bias if no threshold
Any example with a negative projection onto w would be classified negative; any
example with a positive projection, positive.
The bias simply moves this threshold. Now, after the projection is computed, b is
added to get the overall activation. The projection plus b is then compared against
zero.
from a geometric perspective, the role of the bias is to shift the decision boundary
away from the origin, in the direction of w.
So if b is positive, the boundary is shifted away from w . if b is negative, the
boundary is shifted toward w.
Hyperplane:-
The decision boundary which separates the positive and negative examples is
called hyper plane.
Two dimensions, 1-d hyperplane is simply a line, three dimensions, 2-d
hyperplane is like a sheet of paper
Consider the situation in Here, we have a current guess as to the hyperplane,
and positive training example comes in that is currently mis-classified. The
weights are updated:
w ← w + yx.
This yields the new weight vector, also shown in the Figure. In this case, the
weight vector changed enough that this training example is now correctly
classified.
Interpreting Perceptron Weights:-
The perceptron is learning “the right thing.” But we have to want to remove a
bunch of features that aren’t very useful because they’re expensive to compute or
take a lot of storage.
This rate at which the activation(a) changes as a function of the 7th feature is
exactly w7.
The perceptron algorithm is that if the data is linearly separable, then it will
converge to a weight vector that separates the data
w0 1
w1 x1
Let w= .. and x= ..
wd xd
wT=w 0 w1 .. wd
If the training is linearly separable,i.e data is linearly separable. This means that
there exists some hyperplane that puts all the positive examples on one side and
all the negative examples on the other side.
If P and N are finite sets that are linearly separable, then the perceptron
learning algorithm updates weights with in a finite no of times, in precisely
there will a point where w will correctly classify both the classes. The
perceptron will converge more quickly for easy learning problems than for hard
learning problems. To define “easy” and “hard” in a meaningful way through
the notion of margin.
Margin:-
If a hyperplane separates dataset and then the margin is the distance between
the hyperplane and the nearest point.
problems with large margins should be easy; and problems with small margins
should be hard problem
Formally, given a data set D, a weight vector w and bias b, the margin of w, b
on D is defined as:
The margin of a data set. The margin of a data set is the largest attainable
margin on this data. Formally
We “try” every possible w, b pair. For each pair, we compute its margin. We
then take the largest of these as the overall margin of the data
If the data is not linearly separable, then the value of the sup, and therefore the
value of the margin, is −∞.
If the data is linearly separable with margin γ, then there exists some weight
vector w∗ that achieves this margin. Obviously we don’t know what w∗ is, but
we know it exists.
Every time the perceptron makes an update, the angle between w and w∗
changes. What we prove is that the angle actually decreases.
Proof: The margin γ > 0 must be realized by some set of parameters, say w ∗ .
Suppose we train a perceptron on this data. Denote by w (0) the initial weight
vector, w(1) the weight vector after the first update, and so on w(k) the weight
vector after the kth update.
First, suppose that the kth update happens on example (x, y).
Because we updated, know that this example was misclassified: yw(k-1) · x < 0.
After the update, we get w(k) = w(k-1) + yx.=>wd=wd+yx
We do a little computation:
Y[w*0+w*1x1+w*2x2+w*3x3+…+w*dxd]=>w0=b
>W*.Wk-2 +W*yx+ γ
>W*.Wk-2+ γ+ γ
> W*.Wk-2+2 γ
Every time we make correction then γ is added to the since we have k such a
updated k γ is added to the numerator
Denominator 2=
So that means that there are finite no of corrections to w and the algorithm will
converge
u.v is projection of u on v
|u · v| ≤ |u||v|.
->v=wk=>(w*.Wk)2 <=|w*|2|wk||2
=>||wk||>=w*. wk
Improved Generalization: Voting and Averaging
In order to make it more competitive with other learning algorithms, you need to
modify it a bit to get better generalization. consider a data set with 10,000
examples. Suppose that after the first 100 examples, the perceptron has learned a
really good classifier. It’s so good that it goes over the next 9899 examples without
making any updates. It reaches the 10,000th example and makes an error. It
updates. For all we know, the update on this 10, 000th example completely ruins
the weight vector that has done so well on 99.99% of the data!
for weight vectors that “survive” a long time to get more say than weight vectors
that are overthrown quickly. One way to achieve this is by voting. As the
perceptron learns, it remembers how long each hyperplane survives. At test time,
each hyperplane encountered during training “votes” on the class of a test example.
If a particular hyperplane survived for 20 examples, then it gets a vote of 20. If it
only survived for one example, it only gets a vote of 1. In particular, let (w, b)
(1) , . . . ,(w, b) (K) be the K + 1 weight vectors encountered during training, and c
(1) , . . . , c (K) be the survival times for each of these weight vectors. (A weight
vector that gets immediately updated gets c = 1; one that survives another round
gets c = 2 and so on.) The prediction is based on
A much more practical alternative is the averaged perceptron. The idea is similar:
you maintain a collection of weight vectors and survival times. However, at test
time, you predict according to the average weight vector, rather than the voting
Initially: m = 1, w1 = y1x1.c1=1
1. For t=2, 3,....
2. If then:
a. wm+1 = wm + yt xt
m=m+1
cm=1
else:
Cm=cm+1
Problem
X Y
4 +1
1 -1
-3 -1
-2 +1
m=1,c1=1 w1=y1x1=4*1=4,
for t=2 to 4
check y2(w1*x2)<=0
-1(4*1)<=0 yes
Then w2=w1+y2x2=4+(-1*1)=3
m=m+1=1+1=2
c2=1
next t=3
-1(3*-3)=+9<=0(NO)
C2=c2+1=1+1=2
Next t=4
Then w3=w2+y4*x4=3+(-2*+1)=3-2=1
m=m+1=2+1=3
c3=1
c1=1,c2=2,c3=1 w1=4,w2=3,w3=1
voting output=
average output=
X=-3
Sign(-12)=-1
Sign(sign(4*-3)+2sign(3*-3)+sign(1*-3))=sign(-1+-2-1)=>-1
Sign(sign(4*-3)+sign(2*3*-3)+sign(1*-3))=sign(-1+-1+-1)=>-1
o .Perceptron can only be used to classify the linearly separable sets of input
vectors.
o If input vectors are non-linear, it is not easy to classify them properly
o Perceptron XOR problem
We can’t draw a decision boundaries for XOR problem using single layer
perceptron
Practical Issues
Good Features:-
The Working of machine learning algorithm depends on how the features are?
Suppose the features are binary then it is easy to decide between two different
things
Object recongnition
Figure shows the same images in patch representation. Can you identify them?
A final representation is a shape representation. Here, we throw out all color and
pixel information and simply provide a bounding polygon. Figure shows the same
images in this representation.
Image classification:-
Here two classify the dogs we had two features height and eyes
If the height is 25 then the properties nearer Greyhounds, Labradors so height not
valuable features
Some feature are redundant height in inches, height in centimeter both are
indicating same features such a features are called redundant features we have to
avoid the redundant features
Text Categorization:-
The features are Redundant if they are highly correlated i.e Two features are
redundant if they are highly correlated, regardless of whether they are correlated
with the task or not. For example, having a bright red pixel in an image at position
(20, 93) is probably highly redundant with having a bright red pixel at position (21,
93). Both might be useful (e.g., for identifying fire hydrants),
but because of how images are structured, these two features are likely to co-occur
frequently
Reducing the size of decision tree, Reducing the complexity of final classifier.
In the beginning, pruning does not hurt (and sometimes helps!) but eventually we
prune away all the interesting words and performance suffers
Normalization:-
In feature normalization, you go through each feature and adjust it the same way
across all examples.
The goal of both types of normalization is to make it easier for your learning
algorithm to learn. In feature normalization, there are two standard things to do:
1. Centering: moving the entire data set so that it is centered around the origin.
2. Scaling: rescaling each feature so that one of the following holds:
(a) Each feature has variance 1 across the training data.
(b) Each feature has maximum absolute value 1 across the training data
The goal of centering is to make sure that no features are arbitrarily large. The goal
of scaling is to make sure that all features have roughly the same scale (to avoid
the issue of centimeters versus millimeters).
x C VS AS
1 -1 1 0.33
2 0 2 0.66
3 1 3 1
ud=1+2+3=6/3=2
σd = sqrt(1/2*(1+0+1))=sqrt(2/2)=sqrt(1)==1
rd=max(1,2,3)=3
Consider a sentiment classification problem that has three features that simply say
whether a given word is contained in a review of a course.
But in the presence of the not feature, this categorization flips. One way to
address this problem is by adding feature combinations.
We could add two additional features: excellent-and-not and terrible-and-not
that indicate a conjunction of these base features. By assigning weights as follows,
you can achieve the desired effect:
Wexecellent = +1
Wterrible = −1
Wnot = 0
Wexecllent-and-not = −2
Wterrible-and-not = +2
Eg
If I am goes on increasing the features then we are going to get more and more
features with exponential growth
Combinatorial Transformation :-
We have a perceptron with fixed nodes in hidden layer, using that we want
predict the feature where they are combinations of features
Logarithmic transformation:-
The logarithm, x to log base 10 of x, or x to log base e of x (ln x), or x to log base 2
of x, is a strong transformation with a major effect on distribution shape. It is
commonly used for reducing right skewness and is often appropriate for measured
variables
y = a e (bx)
is made linear by
ln y = ln a + bx
log-transform is an important transformation in text data, where the presence of
the word “excellent” once is a good indicator of a positive review; seeing
“excellent” twice is a better indicator;but the difference between seeing “excellent”
10 times and seeing it 11 times really isn’t a big deal any more. A log-transform
achieves this
the transformation is actually xd → log2 (xd + 1) to ensure that zeros remain zero
and sparsity is retained.
In the case that feature values can also be negative, the slightly more complex
mapping
Classification Problems
Evaluation Metrics
We’ll divide the evaluation metrics into two categories based on when to use them.
Eg:- Suppose, we have a dataset in which we have 900 examples of benign tumour
and 100 examples of malignant tumour.
Now suppose our model predicts(imbalanced dataset) all the outputs as a benign
tumour, now by the above formula, our accuracy is 90%(Think about it!). We can
clearly see that our model failed to predict malignant tumour. Now imagine, how
big of a disaster it would have been if you would have used the model in a real-
world scenario.
TP: True Positive are those examples that were actually positives and were
predicted as positives.
Eg. The actual output was a benign tumour(positive tumour) and the model
also predicted benign tumour.
FP: False Positives also known as Type 1 error are those examples that were
actually negatives but our model predicted them as positive.
Eg. The actual output was a malignant tumour(negative class) but our model
predicted benign tumour.
FN: False Negatives also known as Type 2 error are those examples that were
actually positives but our model predicted them as negative.
Eg. The actual output was a benign tumour(positive class) but our model
predicted it as a malignant tumour.
TN: True Negative are those examples that were actually negative and our
model predicted them as negative.
Eg. The actual output was a malignant tumour(negative class) and our model
also predicted it as a malignant tumour.
Whenever we are training our model we should try and reduce False positives and
False negatives, such that our model makes as many correct predictions as possible.
By looking at the confusion matrix, we can get an idea of how our model is
performing on particular classes. Unlike accuracy, which gave an estimate of the
overall model.
These are two very important metrics to evaluate the performance of the model. No
one is better than the other, it just depends upon the use case and business
requirement. Let’s first have a look at their definition and then we’ll develop an
intuition of which to use when
Recall can be defined as “out of the total actual positive values, how many of them
were actually predicted positive.
Confusion matrix
5(FN) 50(TF) 55
105 60 165
Precision=100/(100+10)=0.9090
Recall=100/(100+5)=0.95
Precision=100+50/165=0.9090
F1- Score
Although precision and recall are good they don’t give us the power to compare two
models. If one model has a good recall and the other one has good precision, it
becomes really confusing which one to use for our task(until we are completely
sure that we need to focus on only one metric).
F1-Score to rescue
F1-score takes a harmonic mean of both precision and recall and gives a single
value to evaluate our model. it is given by the following model.
In some cases, you might believe that precision is more important than recall. This
idea leads to the weighted f-measure, which is parameterized by a weight β ∈ [0,
∞) (beta):
Sensitivity:-
Sensitivity is a measure of how well a machine learning model can detect positive
instances.
A model with high sensitivity will have few false negatives, which means that it is
missing a few of the positive instances.
A high sensitivity means that the model is correctly identifying most of the positive
results, while a low sensitivity means that the model is missing a lot of positive
results.
Specificity:-
Specificity measures the proportion of true negatives that are correctly identified
by the model. This implies that there will be another proportion of actual negative
which got predicted as positive and could be termed as false positives. This
proportion could also be called a True Negative Rate (TNR).
A specific classifier is one which does a good job not finding the things that it
doesn’t want to find
The typical plot, referred to as the receiver operating characteristic (or ROC curve)
plots the sensitivity against 1 − specificity. Given an ROC curve,
We can compute the area under the curve (or AUC) metric, which also provides a
meaningful single number for a system’s performance.
AUC scores tend to be very high, even for not great systems. This is because
random chance will give you an AUC of 0.5 and the best possible AUC is 1.0.
in general, a model with high sensitivity will have a high false-positive rate, while
a model with high specificity will have a high false-negative rate. The trade-off
between sensitivity and specificity can be tuned by changing the threshold for
classification. A higher threshold will result in a model with high sensitivity and
low specificity, while a lower threshold will result in a model with low sensitivity
and high specificity.
We can also say that it is a technique to check how a statistical model
generalizes to an independent dataset.
In machine learning
, there is always the need to test the stability of the model. It means based only on
the training dataset; we can't fit our model on the training dataset. For this purpose,
we reserve a particular sample of the dataset, which was not part of the training
dataset.
After that, we test our model on that sample before deployment, and this complete
process comes under cross-validation.
This is something different from the general train-test split.
K-Fold Cross-Validation
Let's take an example of 5-folds cross-validation. So, the dataset is grouped into 5
folds. On 1st iteration, the first fold is reserved for test the model, and rest are used
to train the model. On 2nd iteration, the second fold is used to test the model, and
rest are used to train the model. This process will continue until each fold is not
used for the test fold.
LOO cross validation:-
o In this approach, the bias is minimum as all the data points are used.
o The process is executed for n times; hence execution time is high.
o This approach leads to high variation in testing the effectiveness of the
model as we iteratively check against one data point.
Applications of Cross-Validation
Limitations of Cross-Validation
There are some limitations of the cross-validation technique, which are given
below:
o For the ideal conditions, it provides the optimum output. But for the
inconsistent data, it may produce a drastic result. So, it is one of the big
disadvantages of cross-validation, as there is no certainty of the type of data
in machine learning.
o In predictive modeling, the data evolves over a period, due to which, it may
face the differences between the training set and validation sets. Such as if
we create a model for the prediction of stock market values, and the data is
trained on the previous 5 years stock values, but the realistic future values
for the next 5 years may drastically different, so it is difficult to expect the
correct output for such situations.
Hypothesis testing:-
hypothesis is an assumption about a population which may or may not be true.
Hypothesis testing is a set of formal procedures used by us to either accept or reject
hypotheses. hypotheses are of two types:
Null hypothesis, H0 - represents a hypothesis of chance basis.
null hypothesis states there is no statistical relationship between the two
variables.
Alternative hypothesis, Ha - represents a hypothesis of observations which
are influenced by some non-random cause.
Alternative hypothesis defines there is a statistically important relationship
between two variables.
Example
suppose we wanted to check whether a coin was fair and balanced. A null
hypothesis might say, that half flips will be of head and half will of tails whereas
alternative hypothesis might say that flips of head and tail may be very different.
H0:P=0.5
Ha:P≠0.5
For example if we flipped the coin 50 times, in which 40 Heads and 10 Tails
results. Using result, we need to reject the null hypothesis and would conclude,
based on the evidence, that the coin was probably not fair and balanced.
Hypothesis Tests
Following formal process is used by us to determine whether to reject a null
hypothesis, based on sample data. This process is called hypothesis testing and is
consists of following four steps:
1. State the hypotheses - This step involves stating both null and alternative
hypotheses. The hypotheses should be stated in such a way that they are
mutually exclusive. If one is true then other must be false.
2. Formulate an analysis plan - The analysis plan is to describe how to use
the sample data to evaluate the null hypothesis. The evaluation process
focuses around a single test statistic.
3. Analyze sample data - Find the value of the test statistic (using properties
like mean score, proportion, t statistic, z-score, etc.) stated in the analysis
plan.
4. Interpret results - Apply the decisions stated in the analysis plan. If the
value of the test statistic is very unlikely based on the null hypothesis, then
reject the null hypothesis.
T-test follows t-distribution, which is appropriate when the sample size is small,
and the population standard deviation is not known. The shape of a t-distribution is
highly affected by the degree of freedom. The degree of freedom implies the
number of independent observations in a given set of observations.
Assumptions of T-test:
Paired t-test: A statistical test applied when the two samples are dependent and
paired observations are taken.
Z-test
Z-test refers to a univariate statistical analysis used to test the hypothesis that
proportions from two independent samples differ greatly.
It determines to what extent a data point is away from its mean of the data set, in
standard deviation.
The researcher adopts z-test, when the population variance is known, in essence,
when there is a large sample size, sample variance is deemed to be approximately
equal to the population variance. In this way, it is assumed to be known, despite
the fact that only sample data is available and so normal test can be applied.
Assumptions of Z-test:
P-value Definition
The P-value is known as the probability value.
It is defined as the probability of getting a result that is either the same or more
extreme than the actual observations.
The P-value is known as the level of marginal significance within the hypothesis
testing that represents the probability of occurrence of the given event.
The P-value is used as an alternative to the rejection point to provide the least
significance at which the null hypothesis would be rejected.
If the P-value is small, then there is stronger evidence in favour of the alternative
hypothesis.
P-value Table
The P-value table shows the hypothesis interpretations:
P-value Decision
This takes three arguments: the true labels y, the predicted labels y^ and the
number of folds to run. It returns the mean and standard deviation from which you
can compute a confidence interval.
Bias-Variance Trade off – Machine Learning
It is important to understand prediction errors (bias and variance) when it
comes to accuracy in any machine learning algorithm. There is a tradeoff
between a model’s ability to minimize bias and variance which is referred to
as the best solution for selecting a value of Regularization constant. Proper
understanding of these errors would help to avoid the overfitting and
underfitting of a data set while training the algorithm.
Bias
The bias is known as the difference between the prediction of the values by
the ML model and the correct value.
Being high in biasing gives a large error in training as well as testing data. Its
recommended that an algorithm should always be low biased to avoid the
problem of underfitting.
By high bias, the data predicted is in a straight line format, thus not fitting
accurately in the data in the data set. Such fitting is known as Underfitting
of Data.
This happens when the hypothesis is too simple or linear in nature. Refer to
the graph given below for an example of such a situation.
Variance
The variability of model prediction for a given data point which tells us
spread of our data is called the variance of the model.
The model with high variance has a very complex fit to the training data and
thus is not able to fit accurately on the data which it hasn’t seen before.
As a result, such models perform very well on training data but has high
error rates on test data.
When a model is high on variance, it is then said to as Overfitting of Data.
Overfitting is fitting the training set accurately via complex curve and high
order hypothesis but is not the solution as the error with unseen data is high.
LINEAR MODELS
The term linear model implies that the model is specified as a linear combination
of features. Based on training data, the learning process computes one weight for
each feature to form a model that can predict or estimate the target value
Decision Boundary –
y=1: h(x) >0.5 wtx>=0
Y=0: h(x)< 0.5 wtx<0
The benefit of using such an S-function is that it is smooth, and potentially easier
to optimize. The difficulty is that it is not convex
The convex function minimizes the loss This leads to the idea of convex surrogate
loss functions. Since zero/one loss is hard to optimize, you want to optimize
something else, instead. Since convex functions are easy to optimize, we want to
approximate zero/one loss with a convex function. This approximating function
will be called a surrogate loss. The surrogate losses we construct will always be
upper bounds on the true loss function: this guarantees that if you minimize the
surrogate loss, you are also pushing down the real loss.
There are four common surrogate loss functions, each with their own properties:
hinge loss, logistic loss, exponential loss and squared loss.
1
Logistic: ` (log)(y, yˆ) = log (1 + exp[−yyˆ])
log 2
Weight regularization:-
➡ Robustness to noise
‣ To be sparse —
In L1 weight regularization, the sum of squared values of the weights are used to
calculate size of the weights
Select a suitable:
‣ convex regularization
Higher learning rates allows the algorithm to learn faster, i.e. update weights
and biases faster at the rate the cost of arriving at sub-optimal solution
A smaller learning rate means results more optimal solution but it may take
significantly longer to reach that optimal solution
We can use adaptive learning rate also. In adaptive learning rate algorithm
starts with longer It reduces the training time as compared to larger learning
rate
choose exponential loss as a loss function and the 2-norm as a regularizer
exp[−yyˆ]
The only “strange” thing in this objective is that we have replaced λ with λ. The
reason for this change is just to make the gradients cleaner. We can first compute
derivatives with respect to b:
The update is of the form w ← w − η∇wL.
For poorly classified points, the gradient points in the direction −ynxn, so the
update is of the form
Note that c is large for very poorly classified points and small for relatively well
classified points. By looking at the part of the gradient related to the regularization,
the update says: w ← w − λw = (1 − λ)w. This has the effect of shrinking the
weights toward zero
Based on the error in various training models, the Gradient Descent learning
algorithm can be divided into
1. Batch Gradient Descent: Batch gradient descent (BGD) is used to find the
error for each point in the training set and update the model after evaluating
all training examples.
Let’s say there are a total of ‘m’ observations in a data set and we use all these
observations to calculate the loss function, then this is known as Batch Gradient
Descent.
Forward propagation and backward propagation are performed and the parameters
are updated. In batch Gradient Descent since we are using the entire training set,
the parameters will be updated only once per epoch.
Let’s say we have 5 observations and each observation has three features and the
values that I’ve taken are completely random.
Now if we use the SGD, will take the first observation, then pass it through the
neural network, calculate the error and then update the parameters.
Advantages of Stochastic gradient descent:
In Stochastic gradient descent (SGD), learning happens on every example,
and it consists of a few advantages over other gradient descent.
It is easier to allocate in desired memory.
It is relatively fast to compute than batch gradient descent.
It is more efficient for large datasets.
Again let’s take the same example. Assume that the batch size is 2. So we’ll
take the first two observations, pass them through the neural network, calculate
the error and then update the parameters.
Then we will take the next two observations and perform similar steps i.e will
pass through the network, calculate the error and update the parameters.
Now since we’re left with the single observation in the final iteration, there will
be only a single observation and will update the parameters using this
observation.
Although we know Gradient Descent is one of the most popular methods for
optimization problems, it still also has some challenges. There are a few challenges
as follows:
Example: SVM can be understood with the example that we have used in the
KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a
model that can accurately identify whether it is a cat or dog, so such a model can
be created by using the SVM algorithm.
We will first train our model with lots of images of cats and dogs so that it can
learn about different features of cats and dogs, and then we test it with this
strange creature.
So as support vector creates a decision boundary between these two data (cat and
dog) and choose extreme cases (support vectors), it will see the extreme case of
cat and dog. On the basis of the support vectors, it will classify it as a cat.
Consider the below diagram:
Types of SVM
Linear SVM:
o The working of the SVM algorithm can be understood by using an
example. Suppose we have a dataset that has two tags (green and blue),
and the dataset has two features x1 and x2. We want a classifier that can
classify the pair(x1, x2) of coordinates in either green or blue. Consider
the below image:
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line,
but for non-linear data, we cannot draw a single straight line. Consider the
below image:
So to separate these data points, we need to add one more dimension.
For linear `data, we have used two dimensions x and y, so for non-linear data,
we will add a third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below
image:
So now, SVM will divide the datasets into classes in the way such that all
data points are classified properly
since we are in 3-d Space, hence it is looking like a plane parallel to the x-
axis.