Unit 3 Notes

 REGRESSION : INTRODUCTION
UQ.When is it suitable to use linear regression over classification?

(SPPU - Q. 3(a), Nov./Dec. 16, 5 Marks)
Linear Regression is general form of predictive analysis. It is broadly used for all
statistical techniques. It measures the relationship between one or more predictor variables
and one outcome variable. Regression analysis examines the relationship between a
dependent and independent variable.
 3.1.1 Linear Models
UQ.What do you mean by a linear regression? Which applications are best modeled by
linear regression?
(SPPU - Q. 5(a), March 19, Q. 5(b), March 19,
Q. 1(b), Nov./Dec. 17, 4 Marks))
 Linear regression model illustrates the relationship between two variables or factors.
Regression analysis is generally used to show the correlation between two variables.
 The prediction variable present in the equation model of linear regression is known as
dependent variable. Let us consider it as Y. The variables that predict the dependent
variable are known as independent variables. Let say X.
 Y becomes a dependent variable, as the prediction (Y) dependent on the other variables
(X).
 In simple linear regression analysis, each observation has two variables. That is the
independent variable and the dependent variable. Multiple regression analysis consists of
two or more independent variables and in what way they correlate to the independent
variable. The equation that defines how y is related to X is called the regression model.
 Here’s a simple linear regression formula:
Fig. 3.1.1 : Linear Decision Boundary

 In the equation shown above, y is the dependent variable, which is to be described and x1
is independent variable. That is the variable which is related with the change in predicted
values.
 The coefficient describes that a change in independent variable is maybe not totally
equivalent to a change
in y.
 Now let us look at the proof. To put a line through our data that best fits the data. A
regression line shows a positive linear relationship (the line that is sloping up), a negative
linear relationship (the line that is sloping down), or totally absence of relationship
(shown by a flat line)
Fig. 3.1.2 : Positive Relationship
Fig. 3.1.3 : Negative Relationship

Fig. 3.1.4 : Absent Relationship
 Where the line crosses the vertical axis, that point is said to be as constant.
 For instance, if mapping of 0 years of experience

(X axis) is considered with Salary on (Y axis) then it would be $30000.
 Hence, the constant in given graph below will be about $30000.
 The steeper the slope, the more the salary for years of experience.
 For example, if 1 more year of experience is considered, then salary (y) should be
incremented by $10,000, but due to steeper slope, it may increase like $15,000
 When we look at a graph, vertical lines can be drawn from the line to our actual
observations. The actual observations can be clearly seen as the dots, while the line
displays the model observations (the predictions).
Fig. 3.1.5
Fig. 3.1.6
 The line shows the difference between employees’ actually earning and he’s prediction to
be earned. To find the best line, the minimum sum of squares is looked, the sum of all the
squared differences is done and the minimum is found out. This is known as Ordinary
least squares method.
 Regression is one of the parametric techniques that make assumptions. Let's have a
glance at the assumptions it makes :
1. A linear and additive relationship is present between dependent variable (DV) and
independent variable (IV). Linear relationship means, that the change in DV by 1 unit
change in IV is constant. By additive it means, the effect of X on Y is independent of
other variables.
2. No correlation between independent variable must be present. As presence of correlation
in ndependent variables may cause Multicollinearity. That is, it becomes difficult for the
model to define the actual effect of IV on DV.
3. The error terms should consist of constant variance. Due to its absence, it causes
heteroskedest a city.
4. Error at ∈t should not decide the error at ∈t+1
i.e. the error terms must not be uncorrelated.
5. Correlation in error terms is called as Autocorrelation. Its presence extremely affects the
regression coefficients and standard error values as they are based on the assumption of
uncorrelated error terms.
6. There should be normal distribution between dependent variable and the error terms.
 Presence of these assumptions makes regression relatively obstructive. By restrictive it
means, the performance of a regression model is dependent on completion of these
assumptions.
 Following are the types of models in Linear Regression
1. Univariate Linear Regression

2. Multiple Linear Regression
 3.1.2 Univariate Linear Regression : Model Representation
 Simple linear regression consists of a single input, that can be used in statistics to
estimate the coefficients.
 Statistical properties from data are required to get calculated like means, standard
deviations, correlations and covariance. All the data should be present to traverse and
calculate statistics. Hypothesis function for it is given by
y = 1 + 2  x
x : input training data (univariate – one input variable(parameter))

y : labels to data (supervised learning)
While training the model – to predict the value of y, it is required to fit the best line to for
a given value of x. The best regression fit line is given by model by finding the best θ1
and θ2 values.
θ1 : intercept ; θ2 : coefficient of x
 In simple linear regression, the topic of this section, the predictions of Y when plotted as
a function of X form a straight line.
 In Table 3.1.1. Example data is plotted in Fig. 3.1.7. There exists a positive relationship
between X and Y. If Y is predicted from X, higher the value of X, higher will be
prediction of Y.
Table 3.1.1 : X with Y sample data
X Y
1.00 1.00
2.00 2.00
3.00 1.30
X Y
4.00 3.75
5.00 2.25
Fig. 3.1.7 : A scatter plot of the sample data
 The best fit straight line through the points is found in linear regression. That best-fitting
line is known as regression line. In Fig. 3.1.8, the back diagonal line is regression line
which consists of the predicted score on Y for each possible value of X. Errors of
prediction are represented by the vertical lines from the point to the regression line. As
shown in Fig. 3.1.8, the red point is near the regression line; it has less error prediction
level. Whereas, the yellow point is much higher, so has more error prediction.
 The black line comprises of the predictions, the points depict the actual data, and the
vertical lines between the points and the black line denote prediction errors.
Fig. 3.1.8 : Diagonal Regression line

 The error of prediction for a point is the calculation of value of the point minus the
predicted value (the value on the line).
 Table 3.1.2 shows the predicted values (Y') and the errors of prediction (Y-Y'). An
example illustrates it, the first point has a Y of 1.00 and a predicted Y
(called Y') of 1.21. So, its error of prediction is – 0.21.
Table 3.1.2 : Example data
X Y Y' Y – Y (Y – Y)2
1.00 1.00 1.210 – 0.044

0.210
2.00 2.00 1.635 0.365 0.133
3.00 1.30 2.060 – 0.578

0.760
4.00 3.75 2.485 1.265 1.600
5.00 2.25 2.910 – 0.436

0.660
 We have not yet defined the term "best-fitting line." The line that has minimum sum of
the squared errors of prediction is so far widely used criterion. Same criterion is used for
finding the line in Fig. 3.1.8.
 The squared errors of prediction are given in last column of Table 3.1.2. Compared to
any other regression line, the sum of the squared errors of prediction shown in Table
3.1.2 is lowest.
The formula for a regression line is
Y = bX + A
where Y' is the predicted value, b is the slope of the line, and A is the Y intercept.
The equation for the line in Fig. 3.1.8 is
Y = 0.425X + 0.785
For X = 1,
Y = (0.425)(1) + 0.785 = 1.21.
For X = 2,
Y = (0.425)(2) + 0.785 = 1.64.

 3.1.3 Multiple Linear Regression
 The relationship between more than one explanatory variables and response variable is
modeled by multiple linear regression through fitting a linear equation to observed data.
 Each value of the independent variable x is associated with a value of the dependent
variable y. The population regression line for p explanatory variables x1, x2, ...xp is
defined to be
y = 0 + 1 x1 + 2x2 + ... + px
 p in this line it is shown that the way the mean response y changes with the explanatory
variables. The observed values for y differ about their means y and are assumed to have
the same standard deviation. The parameters 0, 1,...p are estimated by the fitted
values b0, b1,…... bp of the population regression line.
 The multiple regression models include a term for variation, as the observed values for y
vary about their means y.
 That means, the model is expressed as
DATA = FIT + RESIDUAL, where the “FIT” term represents the expression 0 + 1 x1
+ 2x2 + ... + p xp.
 The “RESIDUAL” term signifies the deviations of the observed values y from their
means y, which are normally distributed with mean 0 and variance. The notation for the
model deviations is  .
 Lawfully, the model for multiple linear regressions, for n observations, is
y1 = 0 + 1 xi1 + 2 xi2 +…+ p xip + i for i = 1,2,…n
 Here we have learnt the concept of simple linear regression where to model the response
variable Y, a single predictor variable X was used. In so many applications, more than
one factor are responsible to influence the response.
 Multiple regression models define how a single response variable Y is dependent linearly
on a number of predictor variables.
 Examples
 The selling price of a house can depend on various factors like the popularity of the
location, the number of bedrooms, the number of bathrooms, the year the house was
built, the square footage of the plot etc.
 The child’s height can depend on the height of the parents, nutrition he gets, and other
environmental factors.
 3.2 LEAST-SQUARE METHOD MODEL REPRESENTATION
 The “least squares” method is a type of mathematical regression analysis that determines
the best fit line for a collection of data, displaying the relationship between the points
visually.
 A relationship between a known independent variable and an unknown dependent

variable is represented by each point of data.
 3.3 UNIVARIATE REGRESSION : LEAST SQUARE METHOD
UQ.What do you mean by least square method? Explain least square method in the context
of linear regression. (SPPU - Q. 2(b), Dec. 19, 5 Marks,
Q. 1(a), May/June 2016, 5 Marks)
UQ.What do you mean by coefficient of regression? SSR, MSE in the context of
regression. Explain SST, SSE,
(SPPU - Q. 4(b), Dec. 19, 5 Marks)
 Univariate regression is also called as simple linear regression in which a single

independent variable ‘X’ has linear relationship with a single dependent variable.
 Regression analysis is used to identify linear relationship between single dependent and
an independent variable.
 Equation of linear regression is given by
Y = 0 + 1 X + 
where,
Y = Value of dependent variable
X = independent variable ;
0 and 1 are constant
 = random error
This equation of univariate regression is similar with

Y = b + mx
Where,
b = Y axis intercept = 0
m = slope of line = 1
Fig. 3.3.1
Univariate linear regression model we get a line on

X-Y plane as
Fig. 3.3.2 : Simple linear regression
Actual value of Y is, y = 0 + 1 X + 
and predicted value of is,  = + X
 For increase in value of X by 1 unit then value of Y is expected to increase by 1 units
 Even if X = 0 i.e. value of independent variable is zero then also it is expected that value
of Y is 0.
 Features of Best fit regression line should satisfy
1. Regression line results in minimum sum of errors.

2. It must pass through centroid of sample data where centroid is and
–– = , –– =
 It does not need to go through all or maximum points of sample data.
 It does not need to have same number of sample points above and below to it.
 Value of 0 and 1 is given by
0 = ––– 1 ––
1 =
 Least square method
 It is used to find model parameters in linear regression
 Consider input features vectors

(x1, y1), (x2, y2) ,,,, (xn, yn)
X = x1, x2,...xn = independent variable
Y = y1, y2,...yn = independent variable
 Least square method is discussed with respect to shaft univariate linear regression
Y = 0 + 1 X + 
 Here, target is to find values of 0 and 1 which should best fit to the given sample data.
 0 and 1 values can be found by using least square method.
 Linear regression predicts value of yi for given input feature xi as 
 y1 is predicted as 
 y2 is predicted as 
 For each point on regression, we can calculate the difference between actual value y and
predicted value (predicted by regression line)
 for point x1, e1 = y1 – 
for point x2, e2 = y2 – 
the difference between actual value and predicted value is called as Residual or Errors
 Least square method is used to find values of 0 and 1 to construct in regression line for
which sum of squares of all error (SSE) is minimum.
 SSE = n
 Objective of least square method is to find values of 0 and 1 for which nis minimum
i.e. sum of square of errors is minimum,
SSE = n = n
= n
SSE = n ...(1)
 To get minimum value of SSE for 0 and 1 partial derivative of SSE w.r.to of 0 and
1 must be equal to 0
=0 ;  =0
 = n=0
 Partial derivative and summation are interchargeable
n =0
can be written as,
n =0
 n– 2 = 0
–2n =0
n =0 ...(2)
Similarly, for 1 find partial derivative to get value of 1
 = 0
 n = 0
n = 0
 n – 2xi = 0
– 2nxi = 0
n xi = 0 ...(3)
Consider Equation no. (2)

n = 0
n yi – nn xi = 0
n yi – n1 – n xi = 0
n yi – n – n xi = 0
 n  = n– yi – n xi
 =–
But, X = ; Y=
  = Y– X ...(4)
Now consider equation no. (3)

nxi = 0
n xi = 0
n xi = 0
n xi = 0
n xi = 0
n xi = 0
n xi = n xi
= =
 = ...(5)
But, cov (x, y) = = xy
But, cov (x, y) = = xx
 = =
 =
Y = + X
 For increase in value of X by 1 unit there is increase in value of Y is 1 units.
 Even if x = 0 i.e. value of independent variable is zero then also it is expected that value
of Y is 0.
–– = and ––=
 = ––– –– ; =
 3.4 COST FUNCTIONS : MSE, MAE,

R – SQUARED
UQ.What do you mean by coefficient of regression? SSR, MSE in the context of

regression. Explain SST, SSE.
UQ.What are the ingredients of machine learning?
(SPPU - Q. 4(b), May/June 2016, 5 Marks)
UQ.Enlist ingredients of ML. Explain each ingredient in two or three sentences.
(SPPU - Q. 2(b), Nov./Dec. 16, 5 Marks)
UQ.How the performance of a regression function is measured ? (SPPU - Q. 2(b),
Nov./Dec. 17, 4 Marks)
UQ.Define and explain Squared Error (SE) and Mean Squared Error (MSE) w.r.t.
Regression.
UQ.How the performance of regression is assessed? Write various performance metrics
used for it.
UQ.Suppose you have been given a set of training examples {[x1, y1), (x2, y2)........(xn,
yn)}. Find the equation of the line that best fits the data in that minimizes the squared
error.
(SPPU - Q. 6(b), Oct.19, 5 Marks)
 A cost function is a mechanism utilized in supervised machine learning, the cost

function returns the error between predicted outcomes compared with the actual
outcomes. In other words, it estimates the total cost of production given a specific
quantity produced.
 A cost function is a measure of how wrong the model is in terms of its ability to estimate
the relationship between X and y. This is typically expressed as a difference or distance
between the predicted value and the actual value.
 Regression cost Function
 Regression models deal with predicting a continuous value for example salary of an
employee, price of a car, loan prediction, etc.
 A cost function used in the regression problem is called “Regression Cost Function”.
 Mean Error (ME)
 In this cost function, the error for each training data is calculated and then the mean value
of all these errors is derived.
 Calculating the mean of the errors is the simplest and most intuitive way possible.
 The errors can be both negative and positive. So they can cancel each other out during
summation giving zero mean error for the model.
 Thus this is not a recommended cost function but it does lay the foundation for other cost
functions of regression models.
 Mean Squared Error (MSE)
 This improves the drawback we encountered in Mean Error above. Here a square of the
difference between the actual and predicted value is calculated to avoid any possibility of
negative error.
 It is measured as the average of the sum of squared differences between predictions and
actual observations. It is also known as L2 loss.
 In MSE, since each error is squared, it helps to penalize even small deviations in
prediction when compared to MAE. But if our dataset has outliers that contribute to
larger prediction errors, then squaring this error further will magnify the error many times
more and also lead to higher MSE error.
 Hence we can say that it is less robust to outliers
 Mean Absolute Error (MAE)
 This cost function also addresses the shortcoming of mean error differently. Here an
absolute difference between the actual and predicted value is calculated to avoid any
possibility of negative error.
 So in this cost function, MAE is measured as the average of the sum of absolute
differences between predictions and actual observations.
 It is also known as L1 Loss.
 It is robust to outliers thus it will give better results even when our dataset has noise or
outliers.
 R – Squared
 R-squared (R2) is a statistical measure that represents the proportion of the variance for a
dependent variable that's explained by an independent variable or variables in a
regression model. ... It may also be known as the coefficient of determination.
 Linear regression equation is given by,
yi = 0 + 2 Xi + i
 Values of 0 and 1 are estimated by various methods for e.g for least square method,
maximum likelihood method
 Values of 0 and 1 are used to predict values of Yi as Ywhere
y = + x
 Analysis of performance of linear regression can be done by using various measures as
(a) SSE – Sum of Squared Error

(b) MSE – Mean of Squared Error
(c) RMSE – Root Mean Squared Error
(d) NMSE – Normalised Mean Squared Error
(e) R-Squared
 (a) Sum of Squared Error (SSE)
 SSE = n
 (b) Mean of Squared Error (MSE)
MSE = (SSE) = n
 (c) Root Mean Squared Error (RMSE)
RMSE = =
 (d) Normalised Mean Squared Error (NMSE)
NMSE = = n
 (e) R-Squared
Rr-squared = 1 –
 (f) Mean Absolute Error (MAE)
MAE = n
 Solved Examples
Ex. 3.4.1 :
Consider following data, for 5 students. Each
Xi (i = 1 to 5) represents score of i th student in standard
X and corresponding Yi (j = 1 to 5) respects score of ith student in standard XII.
(a)What linear regression equation best predicts

standard XIIth score ?
OR
Find Regression line that fits best for given sample data
(b)How to interpret regression equation ?
(c)If a student’s score is 90 in Xth Standard then what
is his/her expected score in XIIth standard ?
Sample data
Student Score in Xth std Score in XIIth std

(Xi) (Yi)
1 95 85
2 85 95
3 80 70
4 70 65
5 60 70
 Soln. :
Given Xi = Score of ith student in Xth std.
Yi = Score of ith student in XIIth std.
Let us assume that X is independent variable and Y is dependent variable.

Let equation of regression line is
Y = + X
Where, values of and are given by least square method are as,
Fig. 3.4.1 : Linear Regression for Students example
= ––– ––
 =
In this example, n = 5
xi yi 
95 8 17 289 +8 136
5
85 9 7 49 + 18 126
5
80 7 2 4 –7 – 14
0
70 6 –8 64 – 12 96
5
60 7 – 18 324 –7 126
0
Su 730 470
m
nxi = 390 ; –– = = = 78
n xi = 385 ; –– = nyi = = 77
 n= 289 + 49 + 4 + 64 + 324 = 730

and
n = 470
 1 = = = 0.644
and 0 = ––– 1  ––= 77 – (0.644) (78) = 26.768
 Equation of Regression that Best Fit in sample data is,
 = 26.768 + 0.644 x
(b) Interpretation of linear regression equation
 = 26.768 + 0.644 x
 Interpretation 1
For increase in value of X increase score of students in X increase in score of student in

Xth Std same student in XIIth std is 0.644.
 Interpretation 2
Even if X = 0 (practically it is not possible to apply this interpretation in real word). i.e. a
students score in Xth is 0 then also expected score of student in XIIth standard is 26.76
(c) If student score in Xth Std is 90 then students score in XIIth std is calculated as :
 = 26.768 + 0.644 (90) ;  = 84.72
 Problem for Practice
UQ.For a given data having 100 examples, if squared errors SE 1, SE2, and SE3 are 13.33,
3.33 and 4.00 respectively, calculate Mean Squared Error (MSE). State the formula for
MSE.
UQ.Consider the following data points :
Calculate the Cost Function for 00 = 0.5 and
01 = l using linear regression.
(SPPU - Q. 4(b), Nov./Dec. 17, 6 Marks)
X Y
1 1.5
2 2.75
3 4
4 4.5
5 5.5
 .8 MULTIVARIATE REGRESSION : MODEL REPRESENTATION
GQ.Explain higher dimensional linear regression with suitable example.

UQ. Define and explain : Multivariate normal distribution.
UQ.What is multivariate regression? How will it be different from univariate regression?
(SPPU - Q. 2(a), May/June 2016,
Q. 5(a), Nov./Dec. 16, 5 Marks)
UQ.What do you mean by zero centered and
un-correlated features? What is the use of it in the solution of multivariate linear
regression?
 Linear regression is a statistical model that observes the linear relationship between two
(Simple Linear Regression) or more (Multiple Linear Regression) variables a dependent
variable and independent variable(s). Linear relationship mainly means that dependent
variable too increases (or decreases), when one (or more) independent variables increases
(or decreases).
 As it can be seen, that a linear relationship may be positive (independent variable goes
up, dependent variable goes up) or negative (independent variable goes up, dependent
variable goes down.
 Multiple Linear Regression attempts to model the Relationship between two or more
features and a response by fitting a linear equation to observed data. The steps that are
needed to perform multiple linear Regression are similar to that of simple linear
Regression.
 The difference is in the Evalution. it can be used to find out which factor has the highest
impact on the predicted output and different variable relate to each other.
Here : Y = b0 + b1*x1 + b2*x2 + b3*x3 +…… bn*xn
Y = Dependent variable and x1, x2, x3, …… xn

= multiple independent variables
Fig. 3.8.1 : Positive and Negative Regression
 There are two important disadvantages of Linear Regression. Let’s consider that the
shown model is actually close to (or exactly) linear, i.e.,
yi = r(xi) + I i = 1,….n,
for some underlying regression function r(X0) that is approximately (or exactly) linear in
X0.
The two short comings :

1. Predictive Ability : Though the Linear regression fit has low bias but has high variance.
The expected test error is a combination of these two quantities. Prediction accuracy can
be enhanced by losing some small amount of bias in order to decrease the variance.
2. Interpretative ability : A coefficient is assigned to every predictor variable by Linear
regression. A smaller set of important variables is searched, when the number of
variables p is large, for the sake of interpretation. So there is a need to encourage our
fitting to make a subset of the coefficients large, and others small or even zero.
 In a high dimensional regression setting, these limitations become major problems,

where the number of predictors p rivals or even exceeds the number of observations n. In
fact, where p > n, the linear regression approximation is not well defined.
 How can we do better?
 For a linear model, the linear regression has predictable test error σ 2 + p  σ2/n. The first
term is the irreducible error; the second term is entirely from the variance of the linear
regression estimate (averaged over the input points). Its bias is exactly zero
 What can be understood from this? If another predictor variable is being added into the
mix, then same amount of variance will get added, σ 2/n, irrespective of whether its true
coefficient is large or small (or zero)
 Hence in the last example, efforts for “spending” variance for trying to fit truly small
coefficients was done there were 20 out of 30 them.
 One may find that it can be done better by shrinking small coefficients towards zero,
which possibly introduces some bias, but also reduces the variance. In other words,
“small details” were ignored in order to get a more stable “big picture”. If it is properly
done, this way can actually work.
 3.9 INTRODUCTION TO POLYNOMIAL REGRESSION
UQ. Write short notes on : Linearly and non- linearly separable data (SPPU - Q.
6(a), March 19, 5 Marks)
UQ.What is a polynomial regression? How it can be represented in a form of a matrix?
UQ.What do you mean by zero centered and
un-correlated features? What is the use of it in the solution of multivariate linear
regression?
 In statistics, polynomial regression is a kind of regression analysis in which the

relationship between the independent variable x and the dependent variable y is modeled
as an nth degree polynomial in x.
 The non- linear relationship between value of x and equivalent conditional mean y, is
fitted by polynomial regression, denoted as E(y | x).
 Though polynomial regression fits a nonlinear model to the data, it is linear as a

statistical estimation, in the logic that the regression function E(y | x) is linear in the
unknown parameters estimated by the data.
 Because of this, polynomial regression is considered as a special case of multiple linear
regression.
Fig. 3.9.1 : Non linear Regression Model
 In the case above, the model remains linear externally, but it can hold internal non-
linearity. Let’s consider the above Fig. 3.9.1, which shows how scikit-learn implements
this technique. This is obviously a non-linear dataset, and any linear regression based
only on the original two-dimensional points cannot capture the dynamics.
 The need of Polynomial Regression in ML can be understood in the below points:
 If a linear model is applied on a linear dataset, then it offers good result as seen in Simple
Linear Regression, but if the same model is applied without any alteration on a non-linear
dataset, then the result that is being produced may be drastic.
 Because of which loss function may increase, the error rate will become high, and
accuracy will ultimately get decreased.
 Thus in such cases, a Polynomial Regression model is needed, where data points are
arranged in a non-linear fashion. This can be understood in a better way using the
comparison shown in below Fig. 3.9.2 and
Fig. 3.9.3 of the linear dataset and non-linear dataset.
Fig. 3.9.2 : Simple Linear Model Fig. 3.9.3 : Polynomial Model

 In the image above, a non-linear dataset is arranged. So if it is looked from the view of
linear model, then clearly it is seen that it hardly covers any data point. In another way, a
curve is appropriate which covers most of the data points, which is of the Polynomial
model.
 Henceforth, if the datasets are organized in a non-linear way, then the Polynomial
Regression model is used instead of Simple Linear Regression.
 Equation of the Polynomial Regression Model
 Polynomial Regression equation :
y = b0 + b1x + b2 x2 + b3 x3 + …... + bn xn
 Polynomial regression is a crucial part of linear regression. It’s main idea is how to
select the features. Observing at the multivariate regression with 2 variables: x1 and
x2. Linear regression will look like this:
y = a1 * x1 + a2 * x2.
 To have a polynomial regression (let’s make 2-degree polynomial). Few additional

features are created: x1*x2, and . So we will get your ‘linear regression’:
y= a1 * x1 + a2 * x2 + a3 * x1*x2 + a4 * + a5 *
 A polynomial term: linear model can be turned into curve by a quadratic (squared) or
cubic (cubed) terms. But here data X is squared or cubed, and not Beta coefficients, so it
is still a linear model. This helps us to model curves easily without explicitly modelling a
complicated nonlinear model.
 One frequent pattern in machine learning is to use trained linear models on nonlinear
functions of the data. This method gives fast performance of linear methods while letting
them to fit a much wider range of data.
 Advantages of using Polynomial Regression
1. Polynomial delivers the best approximation of the relationship between the dependent
and independent variable.
2. A Wide range of function can be fit onto it.
3. Polynomial mostly fits a wide range of curvature.
 Disadvantages of using Polynomial Regression

1. Even one or two outlier can affect the result to great extent.
2. Polynomial regression is very sensitive to outliers.
3. Furthermore, nonlinear regression, have fewer model validation tools than linear
regression.
 .12 UNDERFITTING AND OVERFITTING
UQ. Explain the Fig. 3.12.1 (a), (b) and (c).

(SPPU - Q. 5(b), Oct.19, 5 Marks)
Fig. 3.12.1
GQ. What is mean by overfitting ? Discuss different

methodology for avoidance of overtaking.
UQ.Which one of these is underfit and overfit?
Why? Comment with respect to bias and
variance.
(SPPU - Q. 4(a), Nov./Dec. 16, 5 Marks)
(a) Degree 1 (b) Degree 4

(c) Degree 15
Fig. Q. 4(a)
UQ. What is overfitting and underfitting? What are the
catalysts of overfitting?
 A model can be called as a good machine learning model, if it can generalize new input
data from the problem domain appropriately. This may help to make future data
prediction, that data model will be unknown about.
 Assume that we need to check how good machine learning model learns and generalizes
new data, for this is the concept of over fitting and Underfitting. They are responsible for
poor performance of machine learning algorithms.
 3.12.1 Underfitting
 Underfitting situation is said to be arrived when a statistical model or machine learning

algorithm cannot capture the trend of data. It destroys the accuracy of model.
 Underfitting means that data is not able to fit well. this happens usually when limited data
is present to build an accurate model and also possibly when a linear model is tried to
build using non-linear model.
 In these situations machine learning model is much easier and flexible to apply rules on
minimal data when results that model makes in wrong predictions. For this, more data
and reduction of feature selection is required to avoid underfitting.
 3.12.2 Overfitting
 It is said to be overfitted, when a model is trained with a lot of data. And when such
situation occurs, it starts learning from the noise and inaccurate data entries are done in
data set.
 Thus model cannot categorize the data properly, due to too much of details and noise.
The non-parametric and non-linear methods are the causes of overfitting as these types of
machine learning algorithms have more freedom in building the model based on the
dataset and hence they can really build unrealistic models.
 To overcome problem of overfitting use of a linear algorithm is required, if we have

linear data or the parameters like the maximal depth while using decision trees.
Fig. 3.12.1 : Overfitting

 Overfitting is more probable with nonparametric and nonlinear models that have more
flexibility while learning a target function.
 Also, many nonparametric machine learning algorithms also contain parameters or

techniques that can limit and restrain how much detail the model learns.
 For instance, decision trees are a nonparametric machine learning algorithm as they are
flexible, the problem of overfitting arrives.
 This problem can be overcome by reducing a tree after learning so that some details can
be removed.
 The generally used methodologies for avoidance of overfitting are :
1. Cross- Validation : It is a standard way that finds out-of-sample prediction error which
can be used for 5-fold cross validation.
2. Early Stopping : This rule guides us to know how many iterations can be run before
learner begins to over-fit.
3. Pruning : Pruning is widely used while building related models. In this nodes are
removed that have predictive power for the problem in hand.
4. Regularization : Regularization brings new features by introducing a cost term with the
objective function. Therefore it pushes the coefficients of many variables to zero that will
reduce cost term.
 3.13 BIAS VS. VARIANCE
UQ.Explain the term bias-Variance dilemma.

(SPPU - Q. 3(b), Nov./Dec. 18, Q. 4(a),
May/June 19 Q. 5(a), Oct.19)
 3.13.1 Bias
 Let’s consider we have two values, one is predicted by our model and other is actual
value of data (target value).
 Bias refers to the gap between these two values (predicted value by our model and actual
value of data).
 Bias helps us to generalize better and make our model less sensitive to some single data
point.
 Model with high bias pays very little attention to the training data and oversimplifies the
model.
 It always leads to high error on training and test data.
 High Bias
Our estimated data value is a long way from the actual data value, resulting in a large gap
between the two.
 Low Bias
Our estimated data value is close to the actual data value, i.e. there is a smaller gap
between expected and actual data value.
 3.13.2 Variance
 Variance refers to the spread of expected values in relation to one another.
 A high variance model pays close attention to training data and does not generalise to
data it hasn't seen before.
 On training data, such models work well, but on test data, they have a high error rate.
 Variance comes from highly complex models with a large number of features.
 Low Variance
All predicted values we will see in a group (closely together).
 High Variance
Predicted values will be scattered in relation with each other.
Case 1 : Low Bias and Low Variance
The difference between actual and predicted values is small, and it belongs to the same
group as the low biassed and low variance rule (refer Fig. 3.13.1)
Case 2 : Low Bias and High Variance
Data is scattered due to high variance, but due to the rule of low bias, it is not far from
the actual data (target value) as seen Fig. 3.13.1.
Case 3 : High Bias and Low Variance

By the rule of high bias it’s a huge gap and by the rule of low variance it’s in group refer
Fig. 3.13.1.
Fig. 3.13.1 : Bias and Variance graphical visualization
Case 4 : High Bias and High Variance
 By the rule of high bias it’s a huge gap and by the rule of high variance data is scattered
refer Fig. 3.13.1.
 The predicted values are almost identical to the data's actual value. So the ideal option is
Low Bias and Low Variance.
 Underfitting
The data predicted with a high bias is in a straight line
format, which does not fit the data in the data set adequately.
h(x) = g ( 0 + 1x1 + 2x2 )

 Overfitting
When a model's variance is excessive, it's referred to as Overfitting of Data. It involves

precisely fitting the training set using a complicated curve and a high order.
h(x) = 0 + 1x + 2x2 + 3x3 + 4x4
Fig. 3.13.2 : Model representation for bias and variance
 Difference between Bias and Variance
Sr. Bias Variance

No
.
1. The bias is known as The variability of
the difference model prediction for
between the a given data point
prediction of the which tells us
values by the ML spread of our data is
model and the called the variance
correct value. of the model.
2. Model with high High variance
bias pays very little models pay close
attention to the attention to training
training data and data and do not
oversimplifies the generalize to new
model. input.
3. It always leads to Such models
high error on perform very well
training and test on training data but
data. has high error rates
on test data.

Unit 3 Notes

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 3 Notes

Uploaded by

Copyright:

Available Formats

 REGRESSION : INTRODUCTION

UQ.When is it suitable to use linear regression over classification?

 3.1.1 Linear Models

 Here’s a simple linear regression formula:

Fig. 3.1.1 : Linear Decision Boundary

Fig. 3.1.2 : Positive Relationship

Fig. 3.1.3 : Negative Relationship

 For instance, if mapping of 0 years of experience

 Hence, the constant in given graph below will be about $30000.

 Following are the types of models in Linear Regression

1. Univariate Linear Regression

 3.1.2 Univariate Linear Regression : Model Representation

x : input training data (univariate – one input variable(parameter))

Table 3.1.1 : X with Y sample data

Fig. 3.1.7 : A scatter plot of the sample data

Fig. 3.1.8 : Diagonal Regression line

Table 3.1.2 : Example data

1.00 1.00 1.210 – 0.044

2.00 2.00 1.635 0.365 0.133

3.00 1.30 2.060 – 0.578

4.00 3.75 2.485 1.265 1.600

5.00 2.25 2.910 – 0.436

The formula for a regression line is

Y = (0.425)(1) + 0.785 = 1.21.

Y = (0.425)(2) + 0.785 = 1.64.

y = 0 + 1 x1 + 2x2 + ... + px

 That means, the model is expressed as

 Lawfully, the model for multiple linear regressions, for n observations, is

y1 = 0 + 1 xi1 + 2 xi2 +…+ p xip + i for i = 1,2,…n

 3.2 LEAST-SQUARE METHOD MODEL REPRESENTATION

 A relationship between a known independent variable and an unknown dependent

 3.3 UNIVARIATE REGRESSION : LEAST SQUARE METHOD

 Univariate regression is also called as simple linear regression in which a single

 Equation of linear regression is given by

0 and 1 are constant

This equation of univariate regression is similar with

Univariate linear regression model we get a line on

Fig. 3.3.2 : Simple linear regression

Actual value of Y is, y = 0 + 1 X + 

and predicted value of is,  = + X

 For increase in value of X by 1 unit then value of Y is expected to increase by 1 units

 Features of Best fit regression line should satisfy

1. Regression line results in minimum sum of errors.

 Value of 0 and 1 is given by

 Least square method

 It is used to find model parameters in linear regression

 Consider input features vectors

X = x1, x2,...xn = independent variable

Y = y1, y2,...yn = independent variable

 0 and 1 values can be found by using least square method.

 Linear regression predicts value of yi for given input feature xi as 

for point x2, e2 = y2 – 

 Partial derivative and summation are interchargeable

Consider Equation no. (2)

Now consider equation no. (3)

But, cov (x, y) = = xy

But, cov (x, y) = = xx

 For increase in value of X by 1 unit there is increase in value of Y is 1 units.

 3.4 COST FUNCTIONS : MSE, MAE,

UQ.What do you mean by coefficient of regression? SSR, MSE in the context of

 A cost function is a mechanism utilized in supervised machine learning, the cost

 Regression cost Function

Here : Y = b0 + b1x1 + b2x2 + b3x3 +…… bnxn