Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

Regression:

What is Regression

Regression is a form of predictive modelling technique which investigates the relationship


between a dependent (target) and independent variable (s) (predictor). This technique is
used for forecasting, time series modelling and finding the causal effect  between the
variables
What are the types of Regression Techniques (Question looks weird --- especially this comes from
people who are new to ML)

These techniques are mostly driven by three metrics (number of independent variables,
type of dependent variables and shape of regression line).

Add techniques

1. Linear Regression
2. Logistic Regression
3. Polynomial Regression
4. Ridge Regression
5. Lasso Regression
6. ElasticNet Regression
7. SVM (Support Vector Regression)
8. Decision Tree Regression
9. Random Forrest Regression
10. Naïve Bayes Regression

Linear Regression : The dependent variable is continuous, independent variable(s) can


be continuous or discrete, and nature of regression line is linear. ( compare with diagram)
It is represented by an equation Y=m*X +b+ e, where b is intercept, m is slope of the
line and e is error term. This equation can be used to predict the value of target variable
based on given predictor variable(s).

Follow up question multiple linear regression

Multiple linear regression : two or more independent variables are used to predict
the value of a dependent variable. The difference between the two is the number of
independent variables.

y= b+m1X1+m2X2+m3X3

multiple linear regression has (>1) independent variables, whereas simple linear


regression has only 1 independent variable

Note: (Follow up question) Some interviewers expect this when we talk about Lin Reg

How to obtain best fit line (Value of m and b)?

his task can be easily accomplished by Least Square Method. It is the most common
method used for fitting a regression line. It calculates the best-fit line for the observed data
by minimizing the sum of the squares of the vertical deviations from each data point to the
line. Because the deviations are first squared, when added, there is no cancelling out
between positive and negative values.
I think this is enough – No need for us to talk about other metrics RMSE , R2
Adjusted R2 when we talk about Linear Reg – (Those will be talked when we talk
about Model Performance)

Few important Things to know

Assumptions of Linear Regression

 The regression has five key assumptions:


 Linear relationship
 Multivariate normality
 No or little multicollinearity
 No auto-correlation
 Homoscedasticity

Look at below points and answer them if they ask something about assumptions (Not required but if
still needed wait for Part 2 😊)

Important Points:

 There must be linear relationship between independent and dependent variables


 Multiple regression suffers from multicollinearity, autocorrelation,
heteroskedasticity.
 Linear Regression is very sensitive to Outliers. It can terribly affect the regression
line and eventually the forecasted values.
 Multicollinearity can increase the variance of the coefficient estimates and make the
estimates very sensitive to minor changes in the model. The result is that the
coefficient estimates are unstable
 In case of multiple independent variables, we can go with forward
selection, backward elimination and step wise approach for selection of most
significant independent variables.

Polynomial Regression :

A regression equation is a polynomial regression equation if the power of independent


variable is more than 1. The equation below represents a polynomial equation:

y=b+m*x^2

(Some idiot 😊 interviewer asked squaring coefficient (higher degrees) or squaring


independent variables make it linear – confused soul)

Basically Question is why polynomial regression is considered special case of multi


linear regression

https://stats.stackexchange.com/questions/92065/why-is-polynomial-regression-
considered-a-special-case-of-multiple-linear-regres

In this regression technique, the best fit line is not a straight line. It is rather a curve that fits
into the data points.

Note : Comparison with SVM regressor for polynomial regression is very important
concept to understand – I will try to put some notes in notebook which I create )
Important Points:

 While there might be a temptation to fit a higher degree polynomial to get lower error,
this can result in over-fitting. Always plot the relationships to see the fit and focus on
making sure that the curve fits the nature of the problem.

Especially look out for curve towards the ends and see whether those shapes and trends
make sense. Higher polynomials can end up producing wierd results on extrapolation. ( I
will add more points on this while comparing with SVM regressor in notebook)

Ridge Regression : Ridge Regression is a technique used when the data suffers from
multicollinearity ( independent variables are highly correlated). In multicollinearity, even
though the least squares estimates (OLS) are unbiased, their variances are large which
deviates the observed value far from the true value. By adding a degree of bias to the
regression estimates, ridge regression reduces the standard errors

y=a+ b*x

This equation also has an error term. The complete equation becomes:

y=a+b*x+e (error term),  [error term is the value needed to correct for a prediction

error between the observed and predicted value]

=> y=a+y= a+ b1x1+ b2x2+....+e, for multiple independent variables.

In a linear equation, prediction errors can be decomposed into two sub components. First
is due to the biasedand second is due to the variance. Prediction error can occur due to
any one of these two or both components. Here, we’ll discuss about the error caused due to
variance.
Ridge regression solves the multicollinearity problem through shrinkage parameter λ
(lambda). Look at the equation below.

In this equation, we have two components. First one is least square term and other one is
lambda of the summation of β2 (beta- square) where β is the coefficient. This is added to
least square term in order to shrink the parameter to have a very low variance.

Intutively explaining Ridge regression in interview :

y=w0X0+w1X1+w2X2+b

Ridge regression is also a linear model for regression , So the formula it uses to make
predictions is same used for OLS , In ridge regression co-efficeints (w) are chosen not only they
predict well on training set but also on additional constraint . We want magnitude of co efficient
as small as possible in other words all entries of w should be close to Zero , Intuitively this
means each feature should have little effect on outcome as possible( This means having small
slope) .This constraint what we put is what we call it as regularization

Regularization means explicitly restricting model to avoid overfitting this kind used by Ridge
regression is called L2 regularization. Below is the equation sklearn uses

Here,   is a complexity parameter that controls the amount of shrinkage: the larger the
value of  , the greater the amount of shrinkage and thus the coefficients become more robust
to collinearity.
Lasso Regression
Similarto Ridge Regression, Lasso (Least Absolute Shrinkage and Selection Operator) also
penalizes the absolute size of the regression coefficients. In addition, it is capable of
reducing the variability and improving the accuracy of linear regression models.Look at the

equation : 

Lasso regression differs from ridge regression in a way that it uses absolute values in the
penalty function, instead of squares. This leads to penalizing (or equivalently constraining
the sum of the absolute values of the estimates) values which causes some of the
parameter estimates to turn out exactly zero. Larger the penalty applied, further the
estimates get shrunk towards absolute zero. This results to variable selection out of given n
variables.

Intuitively

An alternative to Ridge for regularizing linear regression is Lasso. As with ridge regression,
using the lasso also restricts coefficients to be close to zero, but in a slightly different way,
called L1 regularization.8 The consequence of L1 regularization is that when using the lasso,
some coefficients are exactly zero. This means some fea ‐ tures are entirely ignored by the
model. This can be seen as a form of automatic fea‐ ture selection. Having some coefficients be
exactly zero often makes a model easier to interpret, and can reveal the most important
features of your model.

This is how sklearn deduces above equation when we are building our model we need to tune
these alpha( this will be discussed in notebook)

(1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1

Note: Notebook will talk about tuning parameters alpha and I will try to plot on how it works
Elastic Net :

Regularization is used to prevent overfitting the model to training data. This is


achieved by slightly perturbing ( adding noise ) the objective function of the model
before optimizing it ( optimising a model means to find the model parameters w* such
that the argmin /argmax of the objective function is found- in other words, it is to find
the global optima of the objective function) . In L1 Regularisation, a noise of
magnitude lambda .|w*| is added while in L2 Regularisation, noise of
magnitude lambda.|w*|. |w*| is added. where |w*| is the magnitude of the optimal
parameter vector.

In Elastic Net Regularization, a linear sum of both noises are added. Hence, the
objective function would then be

Note that L1 and L2 regularizations are special cases of Elastic Net regularization

Decision Tree Regression:


CART: Classification and Regression Trees, Here focus is on Regression trees

DT for Regression: It breaks down a dataset into smaller and smaller subsets while at the
same time an associated decision tree is incrementally developed. The final result is a tree
with decision nodes and leaf nodes. A decision node has two or more branches each
representing values for the attribute tested. Leaf node (e.g., Hours Played) represents a
decision on the numerical target. The topmost decision node in a tree which corresponds to
the best predictor called root node.
Step 1: Standard Deviation
A decision tree is built top-down from a root node and involves partitioning the data into
subsets that contain instances with similar values (homogenous). We use standard
deviation to calculate the homogeneity of a numerical sample. If the numerical sample is
completely homogeneous its standard deviation is zero.
a) This is like for the Target variable we take standard deviation without considering
features SD Target variable
b) Then we take standard deviation for (Target and Predictor(dependent variable—you
can think like you do some sort of group by and then calculate std deviation )
STEP 2: Standard Deviation Reduction
The standard deviation reduction is based on the decrease in standard deviation after a
dataset is split on an attribute. Constructing a decision tree is all about finding attribute that
returns the highest standard deviation reduction (i.e., the most homogeneous branches).
Step 3: The attribute with the largest standard deviation reduction is chosen for the decision
node. 
Step 4: The dataset is divided based on the values of the selected attribute. This process is
run recursively on the non-leaf branches, until all data is processed.
In practice, we need some termination criteria. For example, when coefficient of deviation
(CV) for a branch becomes smaller than a certain threshold (e.g., 10%) and/or when too few
instances (n) remain in the branch --- This you can compare it with DT classification
problem like when to stop – we normally stop when there is no further gain
Repeat step 4 until you cannot further split or it is not good to split further
When we reach leaf node we calculate the average as final value for the leaf node
Remember this is a Regression we need to calculate Error like MSE, MAE etc to figure out
how good model is performing
(http://www.saedsayad.com/decision_tree_reg.htm) refer these link to get better understanding
https://www.youtube.com/watch?v=nWuUahhK3Oc
https://www.youtube.com/watch?v=IQe2Icb1WKE

Import Note : Standard deviation is square root of Variance so if some one talks about
Variance it is same they are talking
SKLEARN uses variance to implement
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx

cdef class
RegressionCriterion(Criterion)
:
r"""Abstract regression criterion.
This handles cases where the target is a continuous
value, and is
evaluated by computing the variance of the target
values left and right
of the split point. The computation takes linear time
with `n_samples`
by using ::
var = \sum_i^n (y_i - y_bar) ** 2
= (\sum_i^n y_i ** 2) - n_samples * y_bar **
2
"""

Random Forrest Regression : Bagging or bootstrap aggregation is a technique for reducing


the variance of an estimated prediction function. Bagging seems to work especially well for
high-variance, low-bias procedures, such as trees. For regression, we simply fit the same
regression tree many times to bootstrap sampled versions of the training data and average
the result.
For b = 1 to B:
(a) Draw a bootstrap sample Z ∗ of size N from the training data.
(b) Grow a random-forest tree Tb to the bootstrapped data, by recursively repeating the
following steps for each terminal node of the tree, until the minimum node size nmin is
reached.
i. Select m variables at random from the p variables.
ii. Pick the best variable/split-point among the m.
iii. Split the node into two daughter nodes. 2. Output the ensemble of trees {Tb} B
To make a prediction at a new point x: Regression: ˆf B rf (x) = 1 B PB b=1 Tb(x)
Since trees are notoriously noisy, they benefit greatly from the averaging. Moreover, since
each tree generated in bagging is identically distributed (i.d.), the expectation of an average
of B such trees is the same as the expectation of any one of them. This means the bias of
bagged trees is the same as that of the individual (bootstrap) trees, and the only hope of
improvement is through variance reduction. This contrasts with boosting, where the trees
are grown in an adaptive way to remove bias

Note: Explained Variance and all other model performance will be talked in notebook and
separate prep guide
Note : Here we are not using majority Vote instead we are taking Average
Note : Random Forrest is a Bagging technique It only reduces Variance

SVM Regression :
Popular question if you are attending companies like Amazon , MSFT ,GOOG etc 😊
Intuitively, as all regressors it tries to fit a line to data by minimizing a cost function. However,
the interesting part about SVR is that you can deploy a non-linear kernel. In this case you end
making non-linear regression, i.e. fitting a curve rather than a line.

This process is based on the kernel trick and the representation of the solution/model in the
dual rather than in the primal. That is, the model is represented as combinations of the training
points rather than a function of the features and some weights. At the same time the basic
algorithm remains the same: the only real change in the process of going non-linear is the
kernel function, which changes from a simple inner product to some non linear function.

The Support Vector Regression (SVR) uses the same principles as the SVM for
classification, with only a few minor differences. First of all, because output is a real
number it becomes very difficult to predict the information at hand, which has
infinite possibilities. In the case of regression, a margin of tolerance (epsilon) is set in
approximation to the SVM which would have already requested from the problem.
But besides this fact, there is also a more complicated reason, the algorithm is more
complicated therefore to be taken in consideration. However, the main idea is
always the same: to minimize error, individualizing the hyperplane which maximizes
the margin, keeping in mind that part of the error is tolerated. 
Linear SVR

Non-linear SVR

The kernel functions transform the data into a higher dimensional feature space to make it possible to perform
linear separation.
Kernel functions

Note : SVM regressor is both Linear and Non Linear – When interviewer asks Linear Regression
be careful

SVM is a type of Linear classifier when you use Linear as long as you are not playing with
Kernel that is as long as you are using liblinear

Naïve Bayes Regression :

First laugh on interviewers face if asks this question 😊

https://link.springer.com/content/pdf/10.1023%2FA%3A1007670802811.pdf
Even if we force naive bayes and tweak it a little bit for regression the result is
disappointing; A team experimented with this and achieve not so good results.

Also in wikipedia naivebayes has closeness to logistic regression.

Relation to logistic regression: naive Bayes classifier can be considered a way of fitting a
probability model that optimizes the joint likelihood p(C , x), while logistic regression fits the
same probability model to optimize the conditional p(C | x).

So now you have to choices, tweak naive bayes formula or use logistic regression.

lets use logistic regression instead of reinventing the wheel.

 https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Relation_to_logistic_regression

So you will end up classifying not predicting with Naïve Bayes

Don’t get confused with Bayseian Ridge Regression : Please update sheet if someone asks
it in interview

Classification Techniques - I will put logistic regression but interviewers tend to ask Logistic
Regression as Regression Technique to confuse and talk about Linear models caution
while answering

Model evaluation Prep guide will cover all errors and Notebook will compare them

Happy Regression

You might also like