Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

LEAST SQUARES REGRESSION METHOD

Linear Regression
Linear regression is one of the basic statistical techniques in regression
analysis. People use it for investigating and modelling the relationship between
variables (i.e dependent variable and one or more independent variables).
Before being promptly adopted into machine learning and data science, linear
models were used as basic tools in statistics to assist prediction analysis and data
mining. If the model involves only one regressor variable (independent variable), it
is called simple linear regression and if the model has more than one regressor
variable, the process is called multiple linear regression.
Equation of Straight Line
Let’s consider a simple example of an engineer wanting to analyze the product
delivery and service operations for vending machines. He/she wants to determine
the relationship between the time required by a deliveryman to load a machine
and the volume of the products delivered. The engineer collected the delivery time
(in minutes) and the volume of the products (in a number of cases) of 25 randomly
selected retail outlets with vending machines. Scatter diagram is the observations
plotted on a graph.

Now, if I consider Y as delivery time (dependent variable), and X as product volume


delivered (independent variable). Then we can represent the linear relationship
between these two variables as

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
Okay! Now that looks familiar. Its equation is for a straight line, where m is the
slope and c is the y-intercept. Our objective is to estimate these unknown
parameters in the regression model, such that they give the minimal error for the
given dataset. Commonly referred to as parameter estimation or model fitting. In
machine learning, the most common method of estimation is the Least
Squares method.
What is the Least Square Regression Method?
Least squares is a commonly used method in regression analysis for estimating the
unknown parameters by creating a model which will minimize the sum of squared
errors between the observed data and the predicted data.
Basically, it is one of the widely used methods of fitting curves that works by
minimizing the sum of squared errors as small as possible. It helps you draw a line
of best fit depending on your data points.
Finding the Line of Best Fit Using Least Square Regression
Given any collection of a pair of numbers and the corresponding scatter graph, the
line of best fit is the straight line that you can draw through the scatter points to
best represent the relationship between them. So, back to our equation of the
straight line, we have:

Where,
Y: Dependent Variable
m: Slope
X: Independent Variable
c: y-intercept
Our aim here is to calculate the values of slope, y-intercept, and substitute them in
the equation along with the values of independent variable X, to determine the
values of dependent variable Y. Let’s assume that we have ‘n’ data points, then we
can calculate slope using the scary looking formula below:
Dept of CSE,ASGI B. Uma Maheswari
Assistant Professor
Then, y-intercept is calculated using the formula:

Lastly, we substitute these values in the final equation Y = mX + c. Simple enough,


right? Now let’s take a real life example and implement these formulas to find the
line of best fit.
Least Squares Regression Example
Let us take a simple dataset to demonstrate least squares regression method.

Step 1: First step is to calculate the slope ‘m’ using the formula

After substituting the respective values in the formula, m = 4.70 approximately.


Step 2: Next step is to calculate the y-intercept ‘c’ using the formula (ymean — m
* xmean). By doing that, the value of c approximately is c = 6.67.

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
Step 3: Now we have all the information needed for the equation and by
substituting the respective values in Y = mX + c, we get the following table. Using
this info you can now plot the graph.

This way by the least squares regression method provides the closest relationship
between the dependent and independent variables by minimizing the distance
between the residuals (or error) and the trend line (or line of best fit). Therefore,
the sum of squares of residuals (or error) is minimal under this approach.

Mul ple Linear Regression

In the previous topic, we have learned about Simple Linear Regression, where a
single Independent/Predictor(X) variable is used to model the response variable (Y).
But there may be various cases in which the response variable is affected by more
than one predictor variable; for such cases, the Multiple Linear Regression
algorithm is used.

Moreover, Multiple Linear Regression is an extension of Simple Linear regression


as it takes more than one predictor variable to predict the response variable. We
can define it as:

Mul ple Linear Regression is one of the important regression algorithms which
models the linear rela onship between a single dependent con nuous variable and
more than one independent variable.

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
o For MLR, the dependent or target variable(Y) must be the continuous/real,
but the predictor or independent variable may be of continuous or
categorical form.
o Each feature variable must model the linear relationship with the dependent
variable.
o MLR tries to fit a regression line through a multidimensional space of data-
points.

MLR equation:

In Multiple Linear Regression, the target variable(Y) is a linear combination of


multiple predictor variables x1, x2, x3, ...,xn. Since it is an enhancement of Simple
Linear Regression, so the same is applied for the multiple linear regression
equation, the equation becomes:

1. Y= b<sub>0</sub>+b<sub>1</sub>x<sub>1</sub>+ b<sub>2</sub>x<sub>
2</sub>+ b<sub>3</sub>x<sub>3</sub>+...... bnxn ............... (a)

Where,

Y= Output/Response variable

b0, b1, b2, b3 , bn....= Coefficients of the model.

x1, x2, x3, x4,...= Various Independent/feature variable

o A linear relationship should exist between the Target and predictor


variables.
o The regression residuals must be normally distributed.
o MLR assumes little or no multicollinearity (correlation between the
independent variable) in data.

Applications of Multiple Linear Regression:

There are mainly two applications of Multiple Linear Regression:

o Effectiveness of Independent variable on prediction:

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
o Predicting the impact of changes:

Perceptron in Machine Learning

In Machine Learning and Artificial Intelligence, Perceptron is the most commonly


used term for all folks. It is the primary step to learn Machine Learning and Deep
Learning technologies, which consists of a set of weights, input values or scores,
and a threshold. Perceptron is a building block of an Artificial Neural Network.
Initially, in the mid of 19th century, Mr. Frank Rosenblatt invented the Perceptron
for performing certain calculations to detect input data capabilities or business
intelligence. Perceptron is a linear Machine Learning algorithm used for supervised
learning for various binary classifiers. This algorithm enables neurons to learn
elements and processes them one by one during preparation. In this tutorial,
"Perceptron in Machine Learning," we will discuss in-depth knowledge of
Perceptron and its basic functions in brief. Let's start with the basic introduction of
Perceptron.

What is the Perceptron model in Machine Learning?

Perceptron is Machine Learning algorithm for supervised learning of various binary


classification tasks. Further, Perceptron is also understood as an Artificial Neuron
or neural network unit that helps to detect certain input data computations in
business intelligence.

Perceptron model is also treated as one of the best and simplest types of Artificial
Neural networks. However, it is a supervised learning algorithm of binary classifiers.
Hence, we can consider it as a single-layer neural network with four main
parameters, i.e., input values, weights and Bias, net sum, and an activation
function.

Basic Components of Perceptron

Mr. Frank Rosenblatt invented the perceptron model as a binary classifier which
contains three main components. These are as follows:

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
o Input Nodes or Input Layer:

This is the primary component of Perceptron which accepts the initial data into the
system for further processing. Each input node contains a real numerical value.

o Wight and Bias:

Weight parameter represents the strength of the connection between units. This is
another most important parameter of Perceptron components. Weight is directly
proportional to the strength of the associated input neuron in deciding the output.
Further, Bias can be considered as the line of intercept in a linear equation.

o Activation Function:

These are the final and important components that help to determine whether the
neuron will fire or not. Activation Function can be considered primarily as a step
function.

Types of Activation functions:

o Sign function
o Step function, and
o Sigmoid function

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
The data scientist uses the activation function to take a subjective decision based
on various problem statements and forms the desired outputs. Activation function
may differ (e.g., Sign, Step, and Sigmoid) in perceptron models by checking whether
the learning process is slow or has vanishing or exploding gradients.

How does Perceptron work?

In Machine Learning, Perceptron is considered as a single-layer neural network that


consists of four main parameters named input values (Input nodes), weights and
Bias, net sum, and an activation function. The perceptron model begins with the
multiplication of all input values and their weights, then adds these values together
to create the weighted sum. Then this weighted sum is applied to the activation
function 'f' to obtain the desired output. This activation function is also known as
the step function and is represented by 'f'.

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
This step function or Activation function plays a vital role in ensuring that output is
mapped between required values (0,1) or (-1,1). It is important to note that the
weight of input is indicative of the strength of a node. Similarly, an input's bias value
gives the ability to shift the activation function curve up or down.

Perceptron model works in two important steps as follows:

Step-1

In the first step first, multiply all input values with corresponding weight values and
then add them to determine the weighted sum. Mathematically, we can calculate
the weighted sum as follows:

∑wi*xi = x1*w1 + x2*w2 +…wn*xn

Add a special term called bias 'b' to this weighted sum to improve the model's
performance.

∑wi*xi + b

Step-2

In the second step, an activation function is applied with the above-mentioned


weighted sum, which gives us output either in binary form or a continuous value as
follows:

Y = f(∑wi*xi + b)

Types of Perceptron Models

Based on the layers, Perceptron models are divided into two types. These are as
follows:

1. Single-layer Perceptron Model


2. Multi-layer Perceptron model

Single Layer Perceptron Model:

This is one of the easiest Artificial neural networks (ANN) types. A single-layered
perceptron model consists feed-forward network and also includes a threshold
Dept of CSE,ASGI B. Uma Maheswari
Assistant Professor
transfer function inside the model. The main objective of the single-layer
perceptron model is to analyze the linearly separable objects with binary
outcomes.

In a single layer perceptron model, its algorithms do not contain recorded data, so
it begins with inconstantly allocated input for weight parameters. Further, it sums
up all inputs (weight). After adding all inputs, if the total sum of all inputs is more
than a pre-determined value, the model gets activated and shows the output value
as +1.

If the outcome is same as pre-determined or threshold value, then the


performance of this model is stated as satisfied, and weight demand does not
change. However, this model consists of a few discrepancies triggered when
multiple weight inputs values are fed into the model. Hence, to find desired output
and minimize errors, some changes should be necessary for the weights input.

"Single-layer perceptron can learn only linearly separable patterns."

Multi-Layered Perceptron Model:

Like a single-layer perceptron model, a multi-layer perceptron model also has the
same model structure but has a greater number of hidden layers.

The multi-layer perceptron model is also known as the Backpropagation algorithm,


which executes in two stages as follows:

o Forward Stage: Activation functions start from the input layer in the forward
stage and terminate on the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified
as per the model's requirement. In this stage, the error between actual
output and demanded originated backward on the output layer and ended
on the input layer.

Hence, a multi-layered perceptron model has considered as multiple artificial


neural networks having various layers in which activation function does not remain
linear, similar to a single layer perceptron model. Instead of linear, activation
function can be executed as sigmoid, TanH, ReLU, etc., for deployment.

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
A multi-layer perceptron model has greater processing power and can process
linear and non-linear patterns. Further, it can also implement logic gates such as
AND, OR, XOR, NAND, NOT, XNOR, NOR.

Advantages of Multi-Layer Perceptron:

o A multi-layered perceptron model can be used to solve complex non-linear


problems.
o It works well with both small and large input data.
o It helps us to obtain quick predictions after the training.
o It helps to obtain the same accuracy ratio with large as well as small data.

Disadvantages of Multi-Layer Perceptron:

o In Multi-layer perceptron, computations are difficult and time-consuming.


o In multi-layer Perceptron, it is difficult to predict how much the dependent
variable affects each independent variable.
o The model functioning depends on the quality of the training.

Perceptron Function

Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x'
with the learned weight coefficient 'w'.

Mathematically, we can express it as follows:

f(x)=1; if w.x+b>0

otherwise, f(x)=0

o 'w' represents real-valued weights vector


o 'b' represents the bias
o 'x' represents a vector of input x values.

Characteristics of Perceptron

The perceptron model has the following characteristics.

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
1. Perceptron is a machine learning algorithm for supervised learning of binary
classifiers.
2. In Perceptron, the weight coefficient is automatically learned.
3. Initially, weights are multiplied with input features, and the decision is made
whether the neuron is fired or not.
4. The activation function applies a step rule to check whether the weight
function is greater than zero.
5. The linear decision boundary is drawn, enabling the distinction between the
two linearly separable classes +1 and -1.
6. If the added sum of all input values is more than the threshold value, it must
have an output signal; otherwise, no output will be shown.

Limitations of Perceptron Model

A perceptron model has limitations as follows:

o The output of a perceptron can only be a binary number (0 or 1) due to the


hard limit transfer function.
o Perceptron can only be used to classify the linearly separable sets of input
vectors. If input vectors are non-linear, it is not easy to classify them
properly.

Future of Perceptron

The future of the Perceptron model is much bright and significant as it helps to
interpret data by building intuitive patterns and applying them in the future.
Machine learning is a rapidly growing technology of Artificial Intelligence that is
continuously evolving and in the developing phase; hence the future of perceptron
technology will continue to support and facilitate analytical behavior in machines
that will, in turn, add to the efficiency of computers.

The perceptron model is continuously becoming more advanced and working


efficiently on complex problems with the help of artificial neurons.

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
Support Vector Machine Algorithm

Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.
However, primarily, it is used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is
called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is termed
as Support Vector Machine. Consider the below diagram in which there are two
different categories that are classified using a decision boundary or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if
we want a model that can accurately identify whether it is a cat or dog, so such a
model can be created by using the SVM algorithm. We will first train our model
Dept of CSE,ASGI B. Uma Maheswari
Assistant Professor
with lots of images of cats and dogs so that it can learn about different features of
cats and dogs, and then we test it with this strange creature. So as support vector
creates a decision boundary between these two data (cat and dog) and choose
extreme cases (support vectors), it will see the extreme case of cat and dog. On the
basis of the support vectors, it will classify it as a cat. Consider the below diagram:

SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

Types of SVM

SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then
such data is termed as linearly separable data, and classifier is used called as
Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data,
which means if a dataset cannot be classified by using a straight line, then
such data is termed as non-linear data and classifier used is called as Non-
linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
Hyperplane: There can be multiple lines/decision boundaries to segregate the
classes in n-dimensional space, but we need to find out the best decision boundary
that helps to classify the data points. This best boundary is known as the
hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset,
which means if there are 2 features (as shown in image), then hyperplane will be a
straight line. And if there are 3 features, then hyperplane will be a 2-dimension
plane.

We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect
the position of the hyperplane are termed as Support Vector. Since these vectors
support the hyperplane, hence called a Support vector.

How does SVM works?

Linear SVM:

The working of the SVM algorithm can be understood by using an example.


Suppose we have a dataset that has two tags (green and blue), and the dataset has
two features x1 and x2. We want a classifier that can classify the pair(x1, x2) of
coordinates in either green or blue. Consider the below image:

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider
the below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point
of the lines from both the classes. These points are called support vectors. The
distance between the vectors and the hyperplane is called as margin. And the goal
of SVM is to maximize this margin. The hyperplane with maximum margin is called
the optimal hyperplane.

Non-Linear SVM:

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
If data is linearly arranged, then we can separate it by using a straight line, but for
non-linear data, we cannot draw a single straight line. Consider the below image:

So to separate these data points, we need to add one more dimension. For linear
data, we have used two dimensions x and y, so for non-linear data, we will add a
third dimension z. It can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
So now, SVM will divide the datasets into classes in the following way. Consider the
below image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.

Naïve Bayes Classifier Algorithm


o Naïve Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional
training dataset.

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
o Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can
make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.

Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can
be described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a certain


feature is independent of the occurrence of other features. Such as if the
fruit is identified on the bases of color, shape, and taste, then red, spherical,
and sweet fruit is recognized as an apple. Hence each feature individually
contributes to identify that it is an apple without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem.

Bayes' Theorem:

o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends
on the conditional probability.
o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event


B.

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
P(B|A) is Likelihood probability: Probability of the evidence given that the
probability of a hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the below
example:

Suppose we have a dataset of weather conditions and corresponding target


variable "Play". So using this dataset we need to decide that whether we should
play or not on a particular day according to the weather conditions. So to solve this
problem, we need to follow the below steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

13 Overcast Yes

Frequency table for the Weather Conditions:

Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5

Likelihood table weather condition:

Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:

o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of
datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other
Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:

o Naive Bayes assumes that all features are independent or unrelated, so it


cannot learn the relationship between features.

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
Applications of Naïve Bayes Classifier:

o It is used for Credit Scoring.


o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an
eager learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.

Types of Naïve Bayes Model:

There are three types of Naive Bayes Model, which are given below:

o Gaussian: The Gaussian model assumes that features follow a normal


distribution. This means if predictors take continuous values instead of
discrete, then the model assumes that these values are sampled from the
Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data
is multinomial distributed. It is primarily used for document classification
problems, it means a particular document belongs to which category such as
Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier,
but the predictor variables are the independent Booleans variables. Such as
if a particular word is present or not in a document. This model is also famous
for document classification tasks.

What is Kernel Method?

A set of techniques known as kernel methods are used in machine learning to


address classification, regression, and other prediction issues. They are built
around the idea of kernels, which are functions that gauge how similar two data
points are to one another in a high-dimensional feature space.

Kernel methods' fundamental premise is used to convert the input data into a high-
dimensional feature space, which makes it simpler to distinguish between classes
or generate predictions. Kernel methods employ a kernel function to implicitly map

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
the data into the feature space, as opposed to manually computing the feature
space.

The most popular kind of kernel approach is the Support Vector Machine (SVM), a
binary classifier that determines the best hyperplane that most effectively divides
the two groups. In order to efficiently locate the ideal hyperplane, SVMs map the
input into a higher-dimensional space using a kernel function.

Other examples of kernel methods include kernel ridge regression, kernel PCA, and
Gaussian processes. Since they are strong, adaptable, and computationally
efficient, kernel approaches are frequently employed in machine learning. They are
resilient to noise and outliers and can handle sophisticated data structures like
strings and graphs.

Because the linear classifier can solve a very limited class of problems, the kernel
trick is employed to empower the linear classifier, enabling the SVM to solve a
larger class of problems.

Kernel Method in SVMs

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
Support Vector Machines (SVMs) use kernel methods to transform the input data
into a higher-dimensional feature space, which makes it simpler to distinguish
between classes or generate predictions. Kernel approaches in SVMs work on the
fundamental principle of implicitly mapping input data into a higher-dimensional
feature space without directly computing the coordinates of the data points in that
space.

The kernel function in SVMs is essential in determining the decision boundary that
divides the various classes. In order to calculate the degree of similarity between
any two points in the feature space, the kernel function computes their dot
product.

The most commonly used kernel function in SVMs is the Gaussian or radial basis
function (RBF) kernel. The RBF kernel maps the input data into an infinite-
dimensional feature space using a Gaussian function. This kernel function is popular
because it can capture complex nonlinear relationships in the data.

Other types of kernel functions that can be used in SVMs include the polynomial
kernel, the sigmoid kernel, and the Laplacian kernel. The choice of kernel function
depends on the specific problem and the characteristics of the data.

Basically, kernel methods in SVMs are a powerful technique for solving


classification and regression problems, and they are widely used in machine
learning because they can handle complex data structures and are robust to noise
and outliers.

Characteristics of Kernel Function

Kernel functions used in machine learning, including in SVMs (Support Vector


Machines), have several important characteristics, including:

o Mercer's condition: A kernel function must satisfy Mercer's condition to be


valid. This condition ensures that the kernel function is positive semi definite,
which means that it is always greater than or equal to zero.
o Positive definiteness: A kernel function is positive definite if it is always
greater than zero except for when the inputs are equal to each other.
o Non-negativity: A kernel function is non-negative, meaning that it produces
non-negative values for all inputs.
Dept of CSE,ASGI B. Uma Maheswari
Assistant Professor
o Symmetry: A kernel function is symmetric, meaning that it produces the
same value regardless of the order in which the inputs are given.
o Reproducing property: A kernel function satisfies the reproducing property
if it can be used to reconstruct the input data in the feature space.
o Smoothness: A kernel function is said to be smooth if it produces a smooth
transformation of the input data into the feature space.
o Complexity: The complexity of a kernel function is an important
consideration, as more complex kernel functions may lead to over fitting and
reduced generalization performance.

Basically, the choice of kernel function depends on the specific problem and the
characteristics of the data, and selecting an appropriate kernel function can
significantly impact the performance of machine learning algorithms.

Major Kernel Function in Support Vector Machine

In Support Vector Machines (SVMs), there are several types of kernel functions that
can be used to map the input data into a higher-dimensional feature space. The
choice of kernel function depends on the specific problem and the characteristics
of the data.

Here are some most commonly used kernel functions in SVMs:

Linear Kernel

A linear kernel is a type of kernel function used in machine learning, including in


SVMs (Support Vector Machines). It is the simplest and most commonly used kernel
function, and it defines the dot product between the input vectors in the original
feature space.

The linear kernel can be defined as:

1. K(x, y) = x .y

Where x and y are the input feature vectors. The dot product of the input vectors
is a measure of their similarity or distance in the original feature space.

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
When using a linear kernel in an SVM, the decision boundary is a linear hyperplane
that separates the different classes in the feature space. This linear boundary can
be useful when the data is already separable by a linear decision boundary or when
dealing with high-dimensional data, where the use of more complex kernel
functions may lead to overfitting.

Polynomial Kernel

A particular kind of kernel function utilised in machine learning, such as in SVMs, is


a polynomial kernel (Support Vector Machines). It is a nonlinear kernel function
that employs polynomial functions to transfer the input data into a higher-
dimensional feature space.

One definition of the polynomial kernel is:

Where x and y are the input feature vectors, c is a constant term, and d is the degree
of the polynomial, K(x, y) = (x. y + c)d. The constant term is added to, and the dot
product of the input vectors elevated to the degree of the polynomial.

The decision boundary of an SVM with a polynomial kernel might capture more
intricate correlations between the input characteristics because it is a nonlinear
hyperplane.

The degree of nonlinearity in the decision boundary is determined by the degree


of the polynomial.

The polynomial kernel has the benefit of being able to detect both linear and
nonlinear correlations in the data. It can be difficult to select the proper degree of
the polynomial, though, as a larger degree can result in overfitting while a lower
degree cannot adequately represent the underlying relationships in the data.

In general, the polynomial kernel is an effective tool for converting the input data
into a higher-dimensional feature space in order to capture nonlinear correlations
between the input characteristics.

Gaussian (RBF) Kernel

The Gaussian kernel, also known as the radial basis function (RBF) kernel, is a
popular kernel function used in machine learning, particularly in SVMs (Support

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
Vector Machines). It is a nonlinear kernel function that maps the input data into a
higher-dimensional feature space using a Gaussian function.

The Gaussian kernel can be defined as:

1. K(x, y) = exp(-gamma * ||x - y||^2)

Where x and y are the input feature vectors, gamma is a parameter that controls
the width of the Gaussian function, and ||x - y||^2 is the squared Euclidean
distance between the input vectors.

When using a Gaussian kernel in an SVM, the decision boundary is a nonlinear


hyper plane that can capture complex nonlinear relationships between the input
features. The width of the Gaussian function, controlled by the gamma parameter,
determines the degree of nonlinearity in the decision boundary.

One advantage of the Gaussian kernel is its ability to capture complex relationships
in the data without the need for explicit feature engineering. However, the choice
of the gamma parameter can be challenging, as a smaller value may result in under
fitting, while a larger value may result in over fitting.

Laplace Kernel

The Laplacian kernel, also known as the Laplace kernel or the exponential kernel, is
a type of kernel function used in machine learning, including in SVMs (Support
Vector Machines). It is a non-parametric kernel that can be used to measure the
similarity or distance between two input feature vectors.

The Laplacian kernel can be defined as:

1. K(x, y) = exp(-gamma * ||x - y||)

Where x and y are the input feature vectors, gamma is a parameter that controls
the width of the Laplacian function, and ||x - y|| is the L1 norm or Manhattan
distance between the input vectors.

When using a Laplacian kernel in an SVM, the decision boundary is a nonlinear


hyperplane that can capture complex relationships between the input features.

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
The width of the Laplacian function, controlled by the gamma parameter,
determines the degree of nonlinearity in the decision boundary.

One advantage of the Laplacian kernel is its robustness to outliers, as it places less
weight on large distances between the input vectors than the Gaussian kernel.
However, like the Gaussian kernel, choosing the correct value of the gamma
parameter can be challenging.

Hyperbolic or the Sigmoid Kernel


It is used for non-linear classification problems. It transforms the input data into a
higher-dimensional space using the Sigmoid kernel.
k(x, y) = tanh(xTy + c)
Anova radial basis kernel
It is a multiple-input kernel function that can be used for feature selection.
k(x, y) = k=1nexp(-(xk -yk)2)d
Radial-basis function kernel
It maps the input data to an to an infinite-dimensional space.
K(x, y) = exp(-γ ||x - y||^2)
Wavelet kernel
It is a non-stationary kernel function that can be used for time-series analysis.
K(x, y) = ∑φ(i,j) Ψ(x(i),y(j))
Spectral kernel
This function is based on eigenvalues & eisenvectors of a similarity matrix.
K(x, y) = ∑λi φi(x) φi(y)
Mahalonibus kernel
This function takes into account the covariance structure of the data.
K(x, y) = exp(-1/2 (x - y)T S^-1 (x - y))

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
How to Choose the Right Kernel Function in SVM?
The choice of kernel in SVM depends on the type of data being analyzed & the
problem being solved. Here's how you can go about choosing the right kernel
function:
1. Understand the problem - Understand the type of data, features, and the
complexity of the relationship between the features
2. Choose a simple kernel function - Start with the linear kernel function as it serves
as the baseline for comparison with the complex kernel functions.
3. Test different kernel functions - Test polynomial, RBF kernel, etc and compare
their performance.
4. Tune the parameters - Experiment with different parameter values & choose the
values that deliver the best performance.
5. Use domain knowledge - Based on the type of data, use the domain knowledge
& choose the right type of kernel for your data set.
6. Consider computational complexity - Calculate the computation type & the
resources that would be required for larger data sets.

What are the applications of kernel method in SVM?

1. Classification
2. Regression
3. Clustering
4. Anomaly detection
5. Feature extraction
6. Natural language processing
7. Computer vision
8. Time series analysis
9. Recommender systems
10.Bio-informatics
11.Signal processing
12.Robotics

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
Probabilis c Model in Machine Learning

A Probabilistic model in machine learning is a mathematical representation of a


real-world process that incorporates uncertain or random variables. The goal of
probabilistic modeling is to estimate the probabilities of the possible outcomes of
a system based on data or prior knowledge.

Machine learning algorithms today rely heavily on probabilistic models, which


take into consideration the uncertainty inherent in real-world data. These
models make predictions based on probability distributions, rather than
absolute values, allowing for a more nuanced and accurate understanding of
complex systems. One common approach is Bayesian inference, where prior
knowledge is combined with observed data to make predictions. Another
approach is maximum likelihood estimation, which seeks to find the model that
best fits observational data.
What are Probabilistic Models?
Probabilistic models are an essential component of machine learning, which
aims to learn patterns from data and make predictions on new, unseen data.
They are statistical models that capture the inherent uncertainty in data and
incorporate it into their predictions. Probabilistic models are used in various
applications such as image and speech recognition, natural language processing,
and recommendation systems. In recent years, significant progress has been
made in developing probabilistic models that can handle large datasets
efficiently.
The probabilistic Model in Machine Learning is a popular algorithm used for
machine learning. It is a combination of Discriminant Analysis and a Multinomial
Bayes classifier. The probabilistic Model in Machine Learning learns from data more
efficiently than traditional statistical techniques.
What Is Probabilis c Modeling?
The probabilistic model is a machine learning method in which the decision-making
is done by using the probability of the possible outcome of the independent
variable and an assumption that the likelihood of certain events is constant. It may
be used, for example, to make the best choice among several alternatives. The main
advantage of this model lies in its reliance on an underlying learning algorithm,
which uses simple rules like taking action if its expected value is positive or taking
action if its expected value exceeds some threshold.

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
In Machine Learning, a probability model is used when we want to predict a new
variable value based on previous variables or events. For example, in machine
learning, we can use a Bayesian inference algorithm to find the best possible value
for our prediction based on past data.

Categories Of Probabilistic Models


These models can be classified into the following categories:
 Generative models
 Discriminative models.
 Graphical models
Generative models:
Generative models aim to model the joint distribution of the input and output
variables. These models generate new data based on the probability distribution
of the original dataset. Generative models are powerful because they can
generate new data that resembles the training data. They can be used for tasks
such as image and speech synthesis, language translation, and text generation.
Discriminative models
The discriminative model aims to model the conditional distribution of the
output variable given the input variable. They learn a decision boundary that
separates the different classes of the output variable. Discriminative models are
useful when the focus is on making accurate predictions rather than generating
new data. They can be used for tasks such as image recognition, speech
recognition, and sentiment analysis.
Graphical models
These models use graphical representations to show the conditional
dependence between variables. They are commonly used for tasks such as
image recognition, natural language processing, and causal inference.
Importance of Probabilistic Models
Probabilistic models are fundamental in machine learning. They are used to
represent the relationship between variables, and they help us make predictions
about future data. Probabilistic models also help us understand the uncertainty in
our data, which is essential because it helps us make more informed decisions. For
example, suppose we know there is some probability of a problem occurring in our
system. In that case, we can take steps to reduce the risk of failure.

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
The importance of probabilistic models has led to the creation of new fields, such
as Bayesian Statistics, which helps us make better predictions about how certain
events will affect our systems.

 Probabilistic models play a crucial role in the field of machine learning,


providing a framework for understanding the underlying patterns and
complexities in massive datasets.
 Probabilistic models provide a natural way to reason about the
likelihood of different outcomes and can help us understand the
underlying structure of the data.
 Probabilistic models help enable researchers and practitioners to make
informed decisions when faced with uncertainty.
 Probabilistic models allow us to perform Bayesian inference, which is a
powerful method for updating our beliefs about a hypothesis based on
new data. This can be particularly useful in situations where we need to
make decisions under uncertainty.
Advantages Of Probabilistic Models
 Probabilistic models are an increasingly popular method in many fields,
including artificial intelligence, finance, and healthcare.
 The main advantage of these models is their ability to take into account
uncertainty and variability in data. This allows for more accurate
predictions and decision-making, particularly in complex and
unpredictable situations.
 Probabilistic models can also provide insights into how different factors
influence outcomes and can help identify patterns and relationships
within data.
Disadvantages Of Probabilistic Models
There are also some disadvantages to using probabilistic models.
 One of the disadvantages is the potential for overfitting, where the
model is too specific to the training data and doesn’t perform well on
new data.
 Not all data fits well into a probabilistic framework, which can limit the
usefulness of these models in certain applications.
 Another challenge is that probabilistic models can be computationally
intensive and require significant resources to develop and implement.

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor
Naive Bayes Algorithm
Naive Bayes is a popular probabilistic classification algorithm in machine learning.
It is used to make predictions about new observations, given past observations and
a set of parameters that describe the data distribution. The Naive Bayes algorithm
works well for non-binary classifiers and can be easily understood with an example.
The Naive Bayes algorithm makes predictions about new observations, given past
observations and a set of parameters that describe the data distribution. The Naive
Bayes algorithm works well for non-binary classifiers.

Dept of CSE,ASGI B. Uma Maheswari


Assistant Professor

You might also like