Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 51

INTRODUCTION TO

MACHINE LEARNING
Module 5.2
INTRODUCTION
• Regression analysis is a statistical method used to model the relationship between a dependent
variable and one or more independent variables.
• More specifically, Regression analysis helps us to understand how the value of the
dependent variable is changing corresponding to an independent variable when other
independent variables are held fixed.
• It predicts continuous/real values such as temperature, age, salary, price, etc.

• In simple linear regression, there is only one independent variable, and the goal
is to model the relationship between that variable and the dependent variable
using a straight line.
• The equation of the line is determined using the method of least squares, which
finds the line that minimizes the sum of the squared differences between the
observed values of the dependent variable and the predicted values based on the
line.
2
• Regression is a supervised learning technique which helps in finding the correlation between variables and
enables us to predict the continuous output variable based on the one or more predictor variables.
• It is mainly used for prediction, forecasting, time series modeling, and determining the causal-
effect relationship between variables.

• "Regression shows a line or curve that passes through all the datapoints on
target-predictor graph in such a way that the vertical distance between the
datapoints and the regression line is minimum."
• There are various scenarios in the real world where we need some future
predictions such as weather condition, sales prediction, marketing trends, etc., for
such case we need some technology which can make predictions more accurately.
• So, for such case we need Regression analysis which is a statistical method and
used in machine learning and data science.

3
TERMINOLOGIES RELATED TO REGRESSION
ANALYSIS
• Dependent Variable: The main factor in Regression analysis which we want to predict or understand is
called the dependent variable. It is also called target variable.
• Independent Variable: The factors which affect the dependent variables, or which are used to predict
the values of the dependent variables are called independent variable, also called as a predictor.
• Outliers: Outlier is an observation which contains either very low value or very high value in
comparison to other observed values. An outlier may hamper the result, so it should be avoided.
• Multicollinearity: If the independent variables are highly correlated with each other than other variables,
then such condition is called Multicollinearity. It should not be present in the dataset, because it creates
problem while ranking the most affecting variable.
• Underfitting and Overfitting: If our algorithm works well with the training dataset but not well with
test dataset, then such problem is called Overfitting. And if our algorithm does not perform well even
with training dataset, then such problem is called underfitting.
4
REASONS FOR REGRESSION ANALYSIS
• Regression estimates the relationship between the target and the independent variable.
• It is used to find the trends in data.
• It helps to predict real/continuous values.
• By performing the regression, we can confidently determine the most important factor,
the least important factor, and how each factor is affecting the other factors.

5
LINEAR
REGRESSION
MODELS

6
LINEAR REGRESSION MODELS
• Linear regression models are a type of regression analysis that models the relationship between a
dependent variable and one or more independent variables using a linear equation.
• Simple linear regression models have only one independent variable and are often represented by the
equation:

• where y is the dependent variable, x is the independent variable, β ₀ and β ₁ are coefficients that
represent the intercept and slope of the line, respectively, and ɛ is the error term.

7
• Multiple linear regression models extend the approach to cases where there are multiple
independent variables, and the equation becomes:

• where x₁, x₂, ..., xₖ are the independent variables, and β ₁, β ₂, ..., β ₖ are the corresponding
coefficients that represent the impact of each independent variable on the dependent
variable.

8
LEAST SQUARES
• Linear regression models are typically estimated using the method of least squares.
• The goal of the least squares method is to find the line or plane that minimizes the sum of the
squared differences between the predicted values of the dependent variable and the observed
values in the data set.
• In the case of simple linear regression, the line that minimizes the sum of
squared differences between the predicted and observed values is given by:

• where β₀ and β₁ are coefficients that minimize the sum of squared


differences. This line is commonly referred to as the "line of best fit" and is
used to model the relationship between the independent and dependent
variables.

9
• The Equation for the Line of
best fit is :

• Here x̅ is the mean of all the values in the input X and ȳ is the
mean of all the values in the desired output Y. This is the Least
Squares method.

6
SUBSET
SELECTION

11
SUBSET SELECTION
• Subset selection is a technique used in machine learning to select a subset of relevant features
or variables from a larger set of available features or variables.
• The goal of subset selection is to improve the accuracy and interpretability of a machine
learning model by reducing the dimensionality of the input data.
• There are different approaches to subset selection, including forward
selection, backward elimination.

• The choice of the subset selection method depends on the nature of the
problem and the available data. Subset selection can be used to improve the
accuracy of a model by reducing the variance caused by irrelevant features
or variables. However, it can also lead to a loss of information and
decreased model interpretability if relevant features or variables are
removed. 12
FORWARD SELECTION
• Forward selection starts with an empty set of features and adds one feature at a time based on
the feature that provides the best improvement in the model performance. This process
continues until a stopping criterion is met.

13
BACKWARD ELIMINATION
• Backward elimination, on the other hand, starts with all the features and eliminates one
feature at a time based on the feature that provides the least contribution to the model
performance. This process also continues until a stopping criterion is met.

14
SHRINKAGE
METHODS

15
SHRINKAGE METHODS
• Shrinkage methods are a class of techniques used in machine learning to reduce the impact of
irrelevant or noisy features in a model.
• The goal of shrinkage methods is to improve the accuracy and interpretability of the
model by reducing the variance and avoiding overfitting.
• Two common types of shrinkage methods are ridge regression and Lasso
regression.

16
RIDGE REGRESSION
• Ridge regression adds a penalty term to the least-squares objective function, which constrains
the magnitude of the coefficients. The penalty term is proportional to the square of the
coefficients, and a regularization parameter is used to control the strength of the penalty.

• As a result, ridge regression shrinks the coefficients towards zero, but does
not necessarily set them to zero. This method is particularly useful when
dealing with collinear features, as it can prevent overfitting and improve the
stability of the model.

17
RIDGE REGRESSION

18
LASSO REGRESSION
• Lasso regression, on the other hand, adds a penalty term that is proportional to the
absolute value of the coefficients. This method can set the coefficients to zero and
thus perform feature selection, meaning it can identify and exclude irrelevant
features from the model.

• Lasso regression is particularly


useful when dealing with high-
dimensional data, as it can
reduce the number of features
and improve the interpretability
of the model.

19
LASSO REGRESSION

20
LOGISTIC
REGRESSION

21
LOGISTIC REGRESSION IN ML
• Logistic regression is one of the most popular Machine Learning algorithms, which comes under the Supervised
Learning technique.
• It is used for predicting the categorical dependent variable using a given set of independent variables.
• Logistic regression predicts the output of a categorical dependent variable.
• Therefore, the outcome must be a categorical or discrete value.
• It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
• Logistic Regression is much similar to the Linear Regression except that how they are used.
• Linear Regression is used for solving Regression problems, whereas Logistic regression is used for solving the
classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which predicts
two maximum values (0 or 1).
• The curve from the logistic function indicates the likelihood of something such as whether the cells are cancerous
or not, a mouse is obese or not based on its weight, etc. 22
• Logistic Regression is a significant machine learning algorithm because it has the ability to provide
probabilities and classify new data using continuous and discrete datasets.
• Logistic Regression can be used to classify the observations using different types of data and can easily
determine the most effective variables used for the classification.
• Logistic regression uses the concept of predictive modeling as regression; therefore, it is called logistic
regression, but is used to classify samples; Therefore, it falls under the classification algorithm.
• The sigmoid function is a mathematical function
used to map the predicted values to probabilities.
• It maps any real value into another value within a
range of 0 and 1.
• The value of the logistic regression must be
between 0 and 1, which cannot go beyond this limit,
so it forms a curve like the "S" form. The S-form
curve is called the Sigmoid function or the logistic
function. 23
• In logistic regression, we use the concept of the threshold value, which defines the probability of either 0 or 1. Such as
values above the threshold value tends to 1, and a value below the threshold values tends to 0.

ASSUMPTIONS FOR LOGISTIC REGRESSION


• The dependent variable must be categorical in nature.
• The independent variable should not have multi-collinearity.

EQUATION FOR LOGISTIC REGRESSION

TYPES OF LOGISTIC REGRESSION


• Binomial: In binomial Logistic regression, there can be only two possible types of the dependent
variables, such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types of the
dependent variable, such as "cat", "dogs", or "sheep"
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent
variables, such as "low", "Medium", or "High". 24
FITTING LOGISTIC REGRESSION
• Logistic regression is a statistical MODELS
method used to model the relationship
between a binary dependent variable
and one or more independent variables.
• The output of a logistic regression
model is a probability estimate that a
given observation belongs to the
positive class (e.g., success, yes) or
negative class (e.g., failure, no).
• In this way, logistic regression can be
used to classify observations into one
of two classes.
25
• To fit a logistic regression model, the following steps are typically taken:
1. Collect data: Collect data on the dependent variable and independent variables. The dependent variable should
be binary (i.e., only take two values).
2. Pre-process the data: Pre-process the data by checking for missing values, outliers, and other issues that may
impact the quality of the analysis.
3. Build the model: Specify the logistic regression model by selecting the appropriate independent variables and
fitting the model to the data. There are various ways to do this, but one common method is to use maximum
likelihood estimation.
4. Evaluate the model: Evaluate the model by examining its performance on a held-out test set or by using other
metrics such as accuracy, precision, recall, or the area under the receiver operating characteristic curve (AUC-
ROC).
5. Interpret the results: Interpret the results of the model by examining the coefficients of the independent
variables and their associated p-values or confidence intervals. These coefficients represent the change in the log
odds of the positive outcome for a one-unit increase in the corresponding independent variable.
6. Use the model: Use the model to make predictions on new data by inputting the values of the independent
variables and calculating the corresponding probability estimate using the logistic regression equation. The decision
threshold for classifying observations into one of the two classes can be set by selecting an appropriate cutoff value
26
for the probability estimate.
QUADRATIC
APPROXIMATION
AND INFERENCE

27
QUADRATIC APPROXIMATION
• Quadratic approximation is a mathematical technique used to approximate complex functions by a
quadratic function, which is much simpler to work with.
• In statistics, this technique is often used to approximate the likelihood function of a statistical model,
which is a complex function that describes the probability of the observed data given the model
parameters.
• Quadratic approximation is based on the Taylor series expansion of a function around a specific
point. The Taylor series expansion of a function f(x) around a point a is given by:

• where f'(a) and f''(a) are the first and second derivatives of f(x) evaluated at x=a, respectively. The
quadratic approximation of f(x) around a is obtained by keeping only the first two terms of the Taylor
series expansion, which gives:
28
• The quadratic approximation of a function can be used to approximate the likelihood function of a
statistical model by approximating the log-likelihood function, which is often a complex function.
• The log-likelihood function is approximated by a quadratic function, which is much simpler to work
with, and its maximum can be found using standard optimization techniques.
• Quadratic approximation is a powerful technique for approximating complex functions and for making
statistical inference about the parameters of a statistical model.
• It is widely used in statistics and related fields to estimate parameters and to make predictions about
future observations.

29
INFERENCE IN QUADRATIC APPROXIMATION
• Inference in quadratic approximation involves calculating the standard errors and confidence intervals of the
estimated parameters using the Hessian matrix, which is the matrix of second partial derivatives of the log-
likelihood function evaluated at the estimated parameters.
• The Hessian matrix is a square matrix that measures the curvature of the log-likelihood function around the
estimated parameters.
• It is defined as follows:
H = - ∂² log L(θ) / ∂θ²
where L(θ) is the log-likelihood function and θ is the vector of estimated parameters.
• The negative of the Hessian matrix is used to ensure that the maximum of the log-likelihood function is a
maximum and not a minimum or a saddle point.
• The standard errors of the estimated parameters are calculated as the square roots of the diagonal elements of
the inverse of the Hessian matrix.
• The inverse of the Hessian matrix is known as the covariance matrix, and it measures the uncertainty in the
estimated parameters.
• The standard errors can be used to test hypotheses about the parameters and to assess the precision of the
estimates.
30
• The confidence intervals of the estimated parameters are calculated using the standard errors and the t-
distribution.
• The t-distribution is used because the distribution of the estimated parameters is often not normal.
• The confidence interval for a parameter is given by:
θ ± tα/2 * SE(θ)
where θ is the estimated parameter, tα/2 is the critical value of the t-distribution with α/2 probability and
n-p-1 degrees of freedom, where n is the sample size and p is the number of parameters, and SE(θ) is the
standard error of the estimated parameter.

• Overall, inference in quadratic approximation involves calculating the standard errors and confidence
intervals of the estimated parameters using the Hessian matrix.
• This technique is widely used in statistics and related fields to test hypotheses about the parameters and to
assess the precision of the estimates.

31
L1
REGULARIZED
LOGISTIC
REGRESSION

32
L1 REGULARISED LOGISTIC
• REGRESSION
L1 regularized logistic regression, also known as Lasso logistic regression, is a statistical method that
combines logistic regression with a penalty on the absolute values of the regression coefficients. The
penalty encourages the coefficients to be sparse and helps to identify the most important predictors.
• The objective function of L1 regularized logistic regression is given by:
minimize -log likelihood + λ ||β||1
where β is the vector of regression coefficients, λ is a tuning parameter that controls the strength of the
penalty, and ||β||1 is the L1 norm of the coefficients, which is the sum of the absolute values of the
coefficients.

• The L1 norm penalty encourages the coefficients to be exactly zero, leading to sparse models that only
include the most important predictors.
• The optimization problem can be solved using various algorithms, such as coordinate descent, proximal
gradient descent, and stochastic gradient descent. These algorithms iteratively update the coefficients and
the tuning parameter until convergence is achieved.

33
• L1 regularized logistic regression has several advantages over traditional logistic regression.
• It can handle high-dimensional data, where the number of predictors is much larger than the sample size, and
it can identify the most important predictors, which can lead to more interpretable models.
• Additionally, L1 regularization can improve the prediction accuracy of logistic regression models by reducing
overfitting.
• However, L1 regularized logistic regression has some limitations as well. It can be computationally expensive
to optimize the objective function, especially for large datasets, and the choice of the tuning parameter λ can
be challenging.
• Additionally, the L1 penalty may not be appropriate for all situations, and other penalties, such as the L2
penalty, may be more suitable in some cases.

34
SUPPORT
VECTOR
MACHINES

35
INTRODUCTION
• Support Vector Machine (SVM) is a supervised machine learning algorithm used for both classification
as well as regression.
• The objective of SVM is to find a hyperplane in an N-dimensional space that distinctively classifies the
data points.
• The dimension of the hyperplane depends upon the number of features.
• If the number of input features is two, then the hyperplane is just a line.
• If the number of input features is three, then the hyperplane becomes a 2-D plane.
• It becomes difficult to imagine when the number of features exceeds three.
• Basically, the idea behind SVM is to find a plane or a boundary that separates the data between two
classes.
• Support vectors are the data points that are close to the decision boundary, they are the data points most
difficult to classify, they hold the key for SVM to be optimal decision surface.
• The optimal hyperplane comes from the function class with the lowest capacity i.e minimum number of
independent features/parameters.

36
37
38
WHAT IS THE USE OF A KERNEL IN THE SVM
ALGORITHM?
• The SVM kernel is a function that takes low-dimensional input space and transforms it into higher-
dimensional space, ie it converts nonseparable problems to separable problems.
• It is mostly useful in non-linear separation problems.
• Simply put the kernel, does some extremely complex data transformations and then finds out the
process to separate the data based on the labels or outputs defined.

TYPES OF SVM
• Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be
classified into two classes by using a single straight line, then such data is termed as linearly
separable data, and classifier is used called as Linear SVM classifier.
• Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset
cannot be classified by using a straight line, then such data is termed as non-linear data and classifier
used is called as Non-linear SVM classifier.

39
ADVANTAGES OF SVM
• The means strength of SVM is that they work well hey even when the number of SVM features is
much larger than the number of instances.
• It can work on data sets with huge features space, such is the case in spam filtering, where a large
number of words are the potential signifiers of a message being spam.
• Even when the optimal decision boundary is a nonlinear curve, the SVM transforms the variables to
create new dimensions such that the representation of the classifier is a linear function of those
transformed dimensions of the data.
• SVMs are conceptually easy to understand.
• Do create an easy-to-understand linear classifier.
• SVMs are now available with almost all data analytics tool sets.
• SVMS are effective in high-dimensional cases.
• Its memory is efficient as it uses a subset of training points in the decision function called support
vectors.
• Different kernel functions can be specified for the decision functions and its possible to specify
custom kernels.

40
DISADVANTAGES OF SVM
• The SVM technique has two major constraints:
⮚ It works well only with real numbers, i.e all the data points in all the dimensions must be
defined by numeric values only.
⮚ It works only with binary classification problems. The solution to get around this constraint is to
make a series of cascaded SVMs.
• Training the SVMs is an inefficient and time-consuming process, when the data is large.
• It does not work well when there is much noise in the data, and thus has to compute soft margins.
• The SVMs will also not provide a probability estimate of classification, i.e. the confidence level for
classifying an instance.

41
APPLICATIONS OF SVM
1. Classification
2. Regression analysis
3. Pattern recognition
4. Outliers detection
5. Relevance based applications

42
43
SUPPORT VECTOR CLASSIFIER
• Support Vector classifiers map data to a high-dimensional feature space so that data points can be
categorized, even when the data are not otherwise linearly separable.
• They are of two types:
1. Simple or linear SVM
2. Kernel or non-linear SVM

44
LINEAR SVM

45
NON-LINEAR SVM

46
KERNELS IN SVM
• The function of kernel is to take data as input and transform it into the required form.
• Different SVM algorithms use different types of kernel functions.
• “Kernel” is used due to a set of mathematical functions used in Support Vector Machine providing
the window to manipulate the data.
• So, Kernel Function generally transforms the training set of data so that a non-linear decision surface
is able to transform to a linear equation in a higher number of dimension spaces.
• Basically, It returns the inner product between two points in a standard feature dimension.
• The standard kernel function equation is:

47
Gaussian Kernel Graph Sigmoid Kernel Graph

Polynomial Kernel Graph

48
49
50
THANK YOU

You might also like