‘i222, 905 PM
In [ ]:
In [
'87_Suryaprakash Dubey ass-13 -Jupyter Notebook
1, What is Linear regression?
1. Linear Regression is Supervised machine Learning Algorithm
2. It will help us to resolve the regression problems.
3. It is a predictive model used for finding linear relationship between
dependent and one more
independent variables
2, How do you represent a simple linear regression?
y=m+e
>> Slope
>> Intercept
c
Dependent Variable >> Continuous Data
Independent Variable >> Continuous / Discrete
3, What is multiple linear regression?
In multiple regression Dependent/Target variable is 1 but Independendent
variables are more then one equation of multiple regression is-
mxn + ¢
y = mixl + m2x2 + m3x3.....
Dependent Variable >> Continuous Data
Independent Variable >> Continuous / Discrete
4. What are the assumpt
ns made in the Linear regression model?
Linearity
Independence
No Multicolinearity
Normality
HomoScedasticity
5. What if these assumptions get violated?
1. if linearity is voilated- Tansformation like Log, Square Root, Cube Root is
done.
2. if independence or no multicolinearity is voileted that attributes are
added/clubed together ot only 1 attribute is kept and other similar
attributes are dropped.
localhost 8888inotebooks/DesktopNelocty/02_ML/AssignmentAssignment 13/87_Suryaprakash Dubey ass-13.ipynb ano‘i222, 905 PM
'87_Suryaprakash Dubey ass-13 -Jupyter Notebook
6. What is the assumption of homoscedasticity
Homoscedasticity in which the residual or the errors will be constant rate
from the linear line.
7. What is the assumption of normality?
Normality >> Normality is the Normal Distribuation of the errors or the data
points
1. We always try to keep our data points near to the Mean
2. Or else, We always try to gather the data points in first,second and third
Standard Deviation
8. How to prevent heteroscedasticity?
- Possible reasons of arising Heteroscedasticity:
- Often occurs in those data sets which have a large range between the largest
and the smallest observed values i.e. there are outliers.
- When model is not correctly specified.
- If observations are mixed with different measures of scale.
- When incorrect transformation of data is used to perform the regression.
Skewness in the distribution of a regressor, and may be some other sources.
1.We can use different specification for the model.
2.Weighted Least Squares method is one of the common statistical method. This
is the generalization of ordinary least square and linear regression in which
the errors co-variance matrix is allowed to be different from an identity
matrix.
3,Use MINQUE: The theory of Minimum Norm Quadratic Unbiased Estimation
(MINQUE) involves three stages. First, defining a general class of potential
estimators as quadratic functions of the observed data, where the estimators
relate to a vector of model parameters. Secondly, specifying certain
constraints on the desired properties of the estimators, such as unbiasedness
and third, choosing the optimal estimator by minimizing a “norm” which
measures the size of the covariance matrix of the estimators.
9. What does multicollinearity mean?
Input variables are highly correlated to each other, And if the are kept as it
is the increase the dimentionality of the model and make it complex
10. What are feature selection and feature scaling?
localhost 8888inotebooks/DesktopNelocty/02_ML/AssignmentAssignment 13/87_Suryaprakash Dubey ass-13.ipynb 210‘i222, 905 PM
'87_Suryaprakash Dubey ass-13 -Jupyter Notebook
In machine learning and statistics, feature selection, also known as variable
selection, attribute selection or variable subset selection, is the process of
selecting a subset of relevant features (variables, predictors) for use in
model construction
- simplification of models to make them easier to interpret by
researchers/users.
- shorter training times.
= to avoid the curse of dimensionality.
- improve data's compatibility with a learning model class.
= encode inherent symmetries present in the input space.
Feature Scaling Algorithms will attributes in a fixed range say [-1, 1] or [0,
1]. And then no feature can dominate others.
11. How to find the best fit line in a linear regression model?
1. The line which passes through the maximum number of data points. It is
called as Best Fit Line
2. Line on which we are having lowest mean squared error (MSE) or Sum of
Squared error (SSE)
3. Gradient Descent algorithm will help to find the best fit line
4. G.D algorithm finds single line (Best Fit Line) from infinite number of
possibilities
or regression lines
12. Why do we square the error instead of u:
g modulus?
1, The absolute error is often closer to what we want when making predictions
from our model. But, if we want to penalize those predictions that are
contributing to the maximum value of error.
2. The squared function is differentiable everywhere, while the absolute error
is not differentiable at all the points in its domain(its derivative is
undefined at @). This makes the squared error more preferable to the
techniques of mathematical optimization. To optimize the squared error, we can
compute the derivative and set its expression equal to ®, and solve. But to
optimize the absolute error, we require more complex techniques having more
computations.
ues adopted to find the slope and the intercept of
the linear regression line which best fits the model?
1. Gradient Descent algorithm will work on the PD.
2. Tt will help to reduce the Cost Function or Loss Function
3. It will help to get the best M and C Value
4. We do follow the baby steps while working on the GD algorithm
5. Baby steps are totally depends on the learning rates that we do keep in the
model.
6. Default or Best learning rate value will be 0.001 (L = 0.001)
localhost 8888inotebooks/Desktopielocty/02_ML/AssignmentAssignment 13/87_Suryaprakash Dubey ass-13.ipynb ano‘i222, 905 PM
87_Suryaprakash Dubey ass-13 -Jupyter Notebook
7. If we will change the learning rate as L = 1, Then we might overshoot the
Global Minima.
14. What is cost Function in Linear Regression?
It is the metrics to measure accurecy of the best fit Line.
MSE = Mean Squared Error
MSE = sum(Ya - Yp)*2/N
N = Nunber of Samples
MSE = Cost Function = Loss Function
15, briefly explain gradient descent algorithm
1. Gradient Descent algorithm will work on the PD.
2. It will help to reduce the Cost Function or Loss Function
3. It will help to get the best M and C Value
4. We do follow the baby steps while working on the GD algorithn
5. Baby steps are totally depends on the learning rates that we do keep in the
model.
6. Default or Best learning rate value will be 8.001 (L = 0.001)
7. If we will change the learning rate as L = 1, Then we might overshoot the
Global Minina.
16, How to evaluate regression models?
1.Mean Absolute Error (Scale variant):
MAE = Sum(|Ya - Yp|)/N
2.Means Squared Error (Scale variant):
MSE = sum(Ya - Yp)*2/N
3.R2Score = Coefficient of Determination(Scale invariant)
R2Score = 1 - SSE/SST
= (SST - SSE)/SST
@ is the worst Score in terms of the Coefficient of Determination
1 is the best Score in terms of the Coefficient of Determination
17. Which evaluation technique should you prefer to use for data
having a lot of outliers in it?
localhost 8888inotebooks/DesktopNelocty/02_ML/AssignmentAssignment 13/87_Suryaprakash Dubey ass-13.ipynb ano‘i222, 905 PM
In]:
'87_Suryaprakash Dubey ass-13 -Jupyter Notebook
Mean Absolute Error(MAE) is preferable to use for data having too many
outliers in it because MAE is robust to outliers whereas MSE and RMSE are very
susceptible to outliers and starts penalizing the outliers by squaring the
residuals
18. What is re:
lual? How is it computed?
Residual is equal to the difference between the observed value and the
predicted value. For data points above the line, the residual is positive, and
for data points below the line, the residual is negative.
Error or residual=(Ya-Yp)
19. What are SSE, SSR, and SST? and What is the relationship between
them?
SSE = sum(Ya - Yp)*2
SSR = sum(Yp - Ym)*2
SST = sum(Va - Ym)*2
SST = SSE + SSR
20, What's the intuition behind R-Squared?
We use linear regression to predict y given some value of x. But suppose that
we had to predict a y value without a corresponding x value.
Without using regression on the x variable, our most reasonable estimate would
be to simply predict the average of the y values.
However, this line will not fit the data very well. One way to measure the fit
of the line is to calculate the sum of the squared residuals - this gives us
an overall sense of how much prediction error a given model has.
Now, if we predict the same data with regression we will see that the least-
squares regression line will seem to fit the data pretty well (as shown in the
figure below).
We will find that using least-squares regression, the sum of the squared
residuals has been considerably reduced.
So using least-squares regression eliminated a considerable amount of
prediction error. R-squared tells us what percent of the prediction error in
the y variable is eliminated when we use least-squares regression on the x
variable.
localhost 8888inotebooks/DesktopNelocty/02_ML/AssignmentAssignment 13/87_Suryaprakash Dubey ass-13.ipynb 510‘i222, 905 PM '87_Suryaprakash Dubey ass-13 -Jupyter Notebook
As a result, R? is also called the coefficient of determination. Many formal
definitions say that R? tells us what percent of the variability in the y
variable is accounted for by the regression on the x variable. The value of R?
varies from @ to 1.
21, What does the coefficient of determination explain?
coefficient of determination tells us what percent of the variability in the y
variable is accounted for by the regression on the x variable
22. Can R? be negative?
Yes, R? can be negative. The formula of R* is given by:
R2=1- SSE/SST
If the sum of squared error of the mean line(SSE) is greater than sum of
squared OF total error(SST), R squared will be negative.
23. What are the flaws in R-squared?
There are two major flaws:
Problem 1: R? increases with every predictor added to a model. As R? always
increases and never decreases, it can appear to be a better fit with the more
terms we add to the model. This can be completely misleading.
Problem 2: Similarly, if our model has too many terms and too many high-order
polynomials we can run into the problem of over-fitting the data. When we
over-fit data, a misleadingly high R? value can lead to misleading
predictions.
24, What is adjusted R*?
Adjusted R-squared is used to determine how reliable the correlation is
between the independent variables and the dependent variable.
On addition of highly correlated variables the adjusted R-squared will
increase and
or variables with no correlation with dependent variable the adjusted R-
squared will decrease.
Adjusted R? will always be less than or equal to R?
25. What is the Coefficient of Correlation: Definition, Formula
localhost 8888inotebooks/DesktopNelocty/02_ML/AssignmentAssignment 13/87_Suryaprakash Dubey ass-13.ipynb 60‘i222, 905 PM
'87_Suryaprakash Dubey ass-13 -Jupyter Notebook
Correlation Coefficient is a statistical concept, which helps in establishing
a relation between
predicted and actual values obtained in a statistical experiment. The
calculated value of the
correlation coefficient explains the exactness between the predicted and
actual values.
R (x,y) = cov (x,y) / STDx.STDy
R (x,y) = E{(Xi-Xm) (Yi-Ym) }/{(Z(XA-Xm) **2) (VEn(Vi-Ym)**2) }**0.5
26, What is difference between Correlation and covariance?
Covariance is a statistical term that refers to a systematic relationship
between two random variables in which a change in the other reflects a change
in one variable.
The covariance value can range from -« to +#, with a negative value indicating
a negative relationship and a positive value indicating a positive
relationship.
Correlation is limited to values between the range -1 and +1
Change in scale Affects covariance but Does not affect the correlation
27. What is the relationship between R-Squared and Adjusted R-
Squared?
R2Score = 1 - SSE/SST
Adjusted R-squared is used to determine how reliable the correlation is
between the independent
variables and the dependent variable.
On addition of highly correlated variables the adjusted R-squared will
increase and
or variables with no correlation with dependent variable the adjusted R-
squared will decrease.
Adjusted R? will always be less than or equal to R?
28. What is the difference between overfitting and underfitting?
In overfitting, a statistical model describes random error or noise instead of
the underlying relationship. Overfitting occurs when a model is excessively
complex, such as having too many parameters relative to the number of
observations. A model that has been overfit has poor predictive performance,
as it overreacts to minor fluctuations in the training data
localhost 8888inotebooks/DesktopNelocty/02_ML/AssignmentAssignment 13/87_Suryaprakash Dubey ass-13.ipynb m0‘i222, 905 PM
'87_Suryaprakash Dubey ass-13 -Jupyter Notebook
Underfitting occurs when a statistical model or machine learning algorithm
cannot capture the underlying trend of the data. Underfitting would occur, for
example, when fitting a linear model to non-linear data. Such a model too
would have poor predictive performance.
29. How to identify if the model is overfitting or underfitting?
overfitting is a concept when the model fits against the training dataset
perfectly. While this may sound like a good fit, it is the opposite. In
overfitting, the model performs far worse with unseen data.
‘A model can be considered an ‘overfit? when it fits the training dataset
perfectly but does poorly with new test datasets.
On the other hand, underfitting takes place when a model has been trained for
an insufficient period of time to determine meaningful patterns in the
training data.
30. How to interpret a Q-Q plot in a Linear regression model?
Points on the Normal QQ plot provide an indication of normality of the
dataset. If the data is normally distributed, the points will fall on the 45-
degree reference line. If the data is not normally distributed, the points
will deviate from the reference line.
31, What are the advantages and disadvantages of Linear
Regression?
Advantages
1. Linear Regression model perform well on linearly seperable data.
2. Linear Regression model is Easy to implement and easy to interpret
3. When LR model will get overfitted then we can reduce the overfitting by
using L1 and L2 Regularization
Disadvantages
We are having assumptions on the Data in Linear Regression.
Linearity
Independence
Linear Regression is sensitive to the Outliers
32. What is the use of regularisation? Explain L1 and L2
regularisations.
L1 and L2 regularization are the best ways to manage overfitting and perform
feature selection when there is a large set of features.
localhost 8888inotebooks/DesktopNelocty/02_ML/AssignmentAssignment 13/87_Suryaprakash Dubey ass-13.ipynb ano‘i222, 905 PM
'87_Suryaprakash Dubey ass-13 -Jupyter Notebook
La Regularization, also called a lasso regression, adds the “absolute value of
magnitude” of the coefficient as a penalty term to the loss function.
L2 Regularization, also called a ridge regression, adds the “squared
magnitude” of the coefficient as the penalty term to the loss function.
when the predictor variables are highly correlated then multicollinearity can
become a problem. This can cause the coefficient estimates of the model to be
unreliable and have high variance. That is, when the model is applied to a new
set of data it hasn’t seen before, it’s likely to perform poorly.
One way to get around this issue is to use a method known as lasso regression,
which instead seeks to minimize the
RSS + AZ|BI]
where j ranges from 1 to p and A2@.
This second term in the equation is known as a shrinkage penalty.
When A= @, this penalty term has no effect and lasso regression produces the
same coefficient estimates as least squares.
However, as A approaches infinity the shrinkage penalty becomes more
influential and the predictor variables that aren’t importable in the model
get shrunk towards zero and some even get dropped from the model.
The advantage of lasso regression compared to least squares regression lies in
the bias-variance tradeoff
MSE = Variance + Bias**2 + Irreducible error
The basic idea of lasso regression is to introduce a little bias so that the
variance can be substantially reduced, which leads to a lower overall MSE.
localhost 8888inotebooks/DesktopNelocty/02_ML/AssignmentAssignment 13/87_Suryaprakash Dubey ass-13.ipynb
eno‘i222, 905 PM
In[ ]:
'87_Suryaprakash Dubey ass-13 -Jupyter Notebook
Least squares regression
coefficient estimates Test MSE
Lasso regression
coefficient estimates
(A=some value > 0)
Mean Squared Error
Variance
A
When we use ridge regression, the coefficients of each predictor are shrunken
towards zero but none of them can go completely to zero.In cases where only a
small number of predictor variables are significant, lasso regression tends to
perform better because it’s able to shrink insignificant variables completely
to zero and remove them from the model.
However, when many predictor variables are significant in the model and their
coefficients are roughly equal then ridge regression tends to perform better
because it keeps all of the predictors in the model.
Whichever model produces the lowest test mean squared error (MSE) is the
preferred model to use
localhost 8888inotebooks/DesktopNelocty/02_ML/AssignmentAssignment 13/87_Suryaprakash Dubey ass-13.ipynb s010