Session 3 - Linear Regression

REGRESSION ANALYSIS
REGRESSION
 Regression is a su p er vised lea r n in g a lgorith m under Machine

Learning
 An important tool in Pred ictive An a lytics
2
Regression m od el esta blish es existen ce of
a ssocia tion betw een two va ria bles, bu t n ot
ca u sa tion .
3
REGRESSION VS CORRELATION
 Regression is the study of, “existen ce of a rela tion sh ip ”, between two

variable.
 Regression describes h ow to rep resen t a rela tion sh ip betw een two
va ria bles and numerically relate an independent variable to the dependent
variable.
 Correlation is the study of, “stren gth of rela tion sh ip ”, between two variables.
9
WHERE IS IT USED?
 Finance: CAPM, Non-performing assets, probability of default, Chance of
bankruptcy, credit risk.
 Marketing: Sales, market share, customer satisfaction, customer churn,

customer retention, customer life-time value.
 Operations: Inventory, productivity, efficiency, price.
 HR – Job satisfaction, attrition.
10
INTRODUCTION TO RESIDUALS
 Tr yin g to fit a lin e to d a ta p oin ts.

 It's h a rd to sa y for su re w h ich lin e
fits th e d a ta best.
 h ow d o th ey d ecid e w h ich lin e is
best?
11
 A resid u a l is a m ea su re of h ow w ell a lin e fits a n in d ivid u a l d a ta p oin t.
Consider this simple data set with a

line of fit drawn through it
12
 A resid u a l is a m ea su re of h ow w ell a lin e fits a n in d ivid u a l d a ta p oin t.
Point (2,8) is 4 units above the line
This vertical distance is known as a

residual.
For data points above the line, the
residual is positive, and for data points
below the line, the residual is negative.
Wh ich is th e better fit?
13
SUM OF SQUARES ERROR (SSE)
14
WHAT IS REGRESSION?
 Regression is a tool for finding existence of an association relationship between

a dependent variable (Y) and one or more independent variables (X1, X2, …, Xn)
in a study.
 The relationship can be linear or non-linear.
 A dependent variable (resp on se va ria ble) “measures an outcome of a study
(also called ou tcom e va ria ble)”.
 An independent variable (exp la n a tor y va ria ble) “explains changes in a
response variable”.
15
REGRESSION NOMENCLATURE
Dependent Variable Independent Variable

Explained Variable Explanatory variable
Regressand Regressor
Predictand Predictor
Endogenous Variable Exogenous Variable
Controlled Variable Control Variable
Target Variable Stimulus Variable
Response Variable
Feature Outcome Variable
16
TYPES OF REGRESSION
Regression
Models
One More than One
independent independent
variable variable
Simple Multiple
Regression Regression
Linear Non-linear Linear Non-linear
17
EXAMPLE
Hou se Price
Hou se n o.
(10000 $)
 Boston Housing dataset 1 32
2 28
 Goal is to predict the price value of other 3 30
houses 4 40
5 25
6 35
7 27
8 39
9 31
10 29
18
EXAMPLE
Hou se Hou se Age Hou se Price
n o. years (10000 $)
1 5 32
2 10 28
3 8 30
4 2 40
5 15 25
6 7 35
7 12 27
8 3 39
9 6 31
10 9 29
Dependent or Independent variable?

19
SIMPLE LINEAR REGRESSION: EXAMPLE

n o. years (10000 $)
1 5 32
2 10 28
3 8 30
4 2 40
5 15 25
6 7 35
7 12 27
8 3 39
9 6 31
10 9 29
20
 Which is the best fit?
 Regression line represents the

"best-fit" line that minimizes the
overall distance between the line
and the actual data points.
 But how close is close enough?
21
SIMPLE LINEAR REGRESSION
 Managerial decisions often are based on the relationship betw een two
va ria bles.
 Regression analysis can be used to develop an equation showing h ow th e
va ria bles a re rela ted .
 The variable being predicted is called the d ep en d en t va ria ble and is denoted by
y.
 The variables being used to predict the value of the dependent variable are called
the in d ep en d en t va ria bles and are denoted by x.
22
SIMPLE LINEAR REGRESSION
 Simple linear regression involves one in d ep en d en t va ria ble a n d on e

d ep en d en t va ria ble.
 The relationship between the two variables is approximated by a stra igh t lin e.
 Regression analysis involving two or m ore in d ep en d en t va ria bles is called
m u ltip le regression .
23
SIMPLE LINEAR REGRESSION MODEL
 The equation that describes how y is related to x and an error term is called the
regression model.
 The simple linear regression model is
y =β 0 + β1 x + ε
y = Dependent variable
x = Independent variable
𝛽𝛽0 : y-intercept
𝛽𝛽1 : slope
𝜖𝜖 : error term, unexplained variation in y
24
SIMPLE LINEAR REGRESSION EQUATION
 The simple linear regression equation is
25
Positive Lin ea r Rela tion sh ip

26
Negative Linear Relationship
27
No Relationship
28

n o. years (10000 $)
1 5 32
2 10 28
3 8 30
4 2 40
5 15 25
6 7 35
7 12 27
8 3 39
9 6 31
10 9 29
29
 Which is the best fit?
 Regression line represents the

"best-fit" line that minimizes the
overall distance between the line
and the actual data points.
 But how close is close enough?
30
ESTIMATION PROCESS
Regression Model Sample Data:

y = 𝛽𝛽0 +𝛽𝛽1 x + 𝜖𝜖 x y
Regression Equation x1 y1
E(y) = 𝛽𝛽0 +𝛽𝛽1 x . .
Unknown Parameters . .
𝛽𝛽0 , 𝛽𝛽1 xn yn
Estimated
b0 and b1 Regression Equation
provide estimates of
𝛽𝛽0 and 𝛽𝛽1 ŷ = b0 + b1x
Sample Statistics
b0, b1
31
LEAST SQUARES METHOD
 Least Squares Criterion
Minimize the sum of the squares of the deviations between the obser ved
va lu es of the dependent variable yi and the p red icted va lu es of the dependent
variable y�i .
32
 Slope and 𝑦𝑦-intercept for the Estimated Regression Equation ŷ = b0 + b1x
33
34
35
Slop e = b 1 = -1.146
In tercep t = b 0 = 40.427
36
Actu a l Price Pred icted
Hou se n o. Hou se Age Er ror Squ a red Er ror
(10000 $) Price
1 5 32 34.695350 -2.695350 7.264914
2 10 28 28.963220 -0.963220 0.927793
3 8 30 31.256072 -1.256072 1.577717
4 2 40 38.134629 1.865371 3.479610
5 15 25 23.231090 1.768910 3.129044
6 7 35 32.402498 2.597502 6.747015
7 12 27 26.670368 0.329632 0.108657
8 3 39 36.988203 2.011797 4.047329
9 6 31 33.548924 -2.548924 6.497015
10 9 29 30.109646 -1.109646 1.231314
SSE = 35.01
37
EXAMPLE
 Reed Auto periodically has a special week-long sale. As part of the advertising
campaign Reed runs one or more television commercials during the weekend
preceding the sale. Here are the data from a sample of 5 previous sales:
38
ESTIMATED REGRESSION EQUATION
 Slope for the estimated regression equation
 𝑦𝑦-intercept for the estimated regression equation
 Estimated Regression Equation:
39
SCATTER DIAGRAM & ESTIMATED REGRESSION EQUATION
40
41
 Tota l Su m of Squ a res (SST) =

224.4
 Su m of Squ a res d u e to
Regression (SSR) = 189.38
 Su m of Squ a red Er rors (SSE)

= 35.01
42
COEFFICIENT OF DETERMINATION
− ∧ − ∧
Yi − Y= Yi − Y + Yi − Yi
  
Total Variation in Y Variation in Y explained by the model Variation in Y not explained by the model
 Relationship Among SST, SSR, SSE
where:
SST = Tota l Su m of Squ a res
SSR = Su m of Squ a res d u e to Regression
SSE = Su m of Squ a res d u e to Er ror
43
 The coefficient of determination is:
2
 ∧ −
 Yi − Y 
Explained variation SSR  
Coefficient of determination = R 2 = = = 
Total variation SST  2
−
 Yi − Y 
 
 
where:
SSR = sum of squares due to regression
SST = total sum of squares
44
 The coefficient of determination is:
R2 = = 189/ 224 = 0.84
 For a perfect fit, SSE = ?
 For a perfect fit, SSR/ SST = ?
The regression relationship is very strong; 84.4% of the variability in the price of houses can
be explained by the linear relationship between the house age and price.
45
 R-squared is a statistical measure that indicates h ow m u ch of th e va ria tion of a
d ep en d en t variable is explained by an in d ep en d en t variable in a regression
model.
 R-squared values range from 0 to 1 and are commonly stated as percentages from
0% to 100%.
 R-squared value of 0.9 would indicate that 90% of th e va ria n ce of the dependent
variable being studied is explained by the variance of the independent variable.
 What qualifies as a “good” R-squared value will d ep en d on the con text. In some
fields, such as the socia l scien ces, even a relatively low R-squared value, such as
0.5, could be considered rela tively stron g. In other fields, the standards for a
good R-squ a red rea d in g can be much higher, such as 0.9 or a bove.
46
SPURIOUS REGRESSION
 Higher value of R2 implies better fit, but one should be aware of spurious regression.
Number of Facebook users and the number of people who died of helium poisoning in UK
Yea r Nu m ber of Fa cebook u sers Nu m ber of p eop le w h o d ied of h eliu m

in m illion s (X) p oison in g in UK (Y)
2004 1 2
2005 6 2
2006 12 2
2007 58 2
2008 145 11
2009 360 21
2010 608 31
2011 845 40
2012 1056 51 47
SPURIOUS REGRESSION
SUMMARY OUTPUT
Regression Sta tistics
Mu ltip le R 0.996442
R Squ a re 0.992896
Sta n d a rd Er ror 1.69286
Obser va tion s 9
ANOVA
SS MS F Significance F
Regression 1 2803.94 2803.94 978.4229 8.82E-09
Resid u a l 7 20.06042 2.865775
Tota l 8 2824
Coefficients Standard Error t-stat P-value Lower 95% Upper 95%
In tercep t 1.9967 0.76169 2.62143 0.034338 0.195607 3.79783
FB 0.0465 0.00149 31.27975 8.82E-09 0.043074 0.050119
The R-square value for regression model between the number of deaths due to helium poisoning in UK and the number of Facebook users is
0.9928. That is, 99.28% va ria tion in th e n u m ber of d ea th s d u e to h eliu m p oison in g in UK is exp la in ed by th e n u m ber of
Fa cebook u sers. The regression model is given a s Y = 1.9967 + 0.0465 X.
48
EXAMPLE
 Tota l Su m of Squ a res (SST) = ?
 Su m of Squ a res d u e to Regression

(SSR) = ?
 Su m of Squ a red Er rors (SSE) = ?
49
ASSUMPTIONS
 The conditional expected value of the residuals, E(εi), is zero.
 The variance of the residuals, σ2 is con sta n t for all values of Xi. (income vs savings)
 When the variance of the residuals is con sta n t for different values of Xi, it is called
h om osced a sticity. A non-constant variance of residuals is called
h eterosced a sticity.
 Residuals are u n cor rela ted , that is, Cov(εi, εj) = 0 for all i ≠ j.
 The residuals, εi, , follow a n or m a l distribution.
 The regression model is lin ea r in regression parameters.
 The explanatory variable, X, is assumed to be non-stochastic (i.e., X is
d eter m in istic).
50
TESTING FOR SIGNIFICANCE
Estim a te of σ2
 From the regression model and its assumptions, we can conclude that σ2, the variance of εi, also
represents the variance of the y values about the regression line.
 SSE, the sum of squared residuals, is a measure of the variability of the actual observations
 The mean square error (MSE) provides the estimate of σ2
 Every sum of squares has associated with it a number called its degrees of freedom. Statisticians have
shown that SSE has n-2 degrees of freedom because two parameters (𝛽𝛽0 and 𝛽𝛽1) must be estimated
to compute SSE. Thus, the mean square error (MSE) is computed by dividing SSE by n-2.
51
 In a simple linear regression equation, the mean or expected value of y is a

lin ea r fu n ction of x:
𝐸𝐸(𝑦𝑦) = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥
 The regression co-efficient (β1) captures the existence of a linear relationship

between the response variable and the explanatory variable.
 If β1 = 0 , we can conclude that there is no statistically significant linear
relationship between the two variables.
 To test for a significant regression relationship, we must conduct a hypothesis test to
determine whether the value of 𝛽𝛽1 is zero.
𝐻𝐻0: 𝛽𝛽1 = 0
𝐻𝐻𝑎𝑎: 𝛽𝛽1 ≠ 0
52
 The properties of the sampling distribution of 𝑏𝑏1, the least squares estimator of 𝛽𝛽1 ,
provide the basis for the hypothesis test.
 We do not know the value of

population standard deviation
 We develop an estimate of
sample standard deviation
53
 The standard deviation of b 1 is also referred to as the standard error of b 1 . Thus, provides an
estimate of the standard error of b1.
The t test for a significant relationship is based on the fact that the test sta tistic
𝑏𝑏1 − 𝐸𝐸(𝑏𝑏1 ) =
𝑏𝑏1 − 𝛽𝛽1
𝑠𝑠𝑏𝑏1 𝑠𝑠𝑏𝑏1
follows a t distribution with n - 2 degrees of freedom. If the null hypothesis is true, then 𝛽𝛽1 = 0 and
t = 𝑏𝑏1 / 𝑠𝑠𝑏𝑏1
54
t Test
Hypotheses:
Test Statistic:
55
t Test
Rejection Rule:
where:
tα/ 2 is based on a t distribution with n – 2 degrees of freedom
56
EXAMPLE
57
58
SCATTER DIAGRAM & ESTIMATED REGRESSION EQUATION
59
TESTING FOR SIGNIFICANCE: EXAMPLE
60
Two Ta il Hyp oth esis Testin g
1 - α of all
t values
𝑡𝑡
−𝑡𝑡𝛼𝛼/2 + 𝑡𝑡𝛼𝛼/2
61
P-va lu e
𝑝𝑝 = 2*𝑃𝑃𝑟𝑟𝑜𝑜𝑏𝑏 ( 𝑡𝑡 ≥ 4.63) = 2 x 0.01 = 0.02
Since p is less than 0.05, we reject the null hypothesis.
Critica l Va lu e
𝑡𝑡 = 4.63 > 3.18, we reject the null hypothesis.
62
CONFIDENCE INTERVAL FOR 𝛽𝛽1
 We can use a 95% confidence interval for 𝛽𝛽1 to test the hypotheses just used
in the t test.
 H 0 is rejected if the hypothesized value of 𝛽𝛽1 is not included in the
confidence interval for 𝛽𝛽1 .
63
64
 Rejection Ru le:
Reject 𝐻𝐻0 if 0 is not included in the confidence interval for 𝛽𝛽1 .
 95% Confidence Interval for 𝛽𝛽1
 Con clu sion :

0 is not included in the confidence interval. Reject 𝐻𝐻0 .
65
TESTING FOR SIGNIFICANCE: F TEST
The null and alternative hypothesis for F-test is given by
H 0: There is no statistically significant relationship between Y and any of the
explanatory variables (i.e., all regression coefficients are zero).
H A: Not all regression coefficients are zero
 Alternatively:
H 0: All regression coefficients are equal to zero
H A: Not all regression coefficients are equal to zero
 The F-statistic is given by

MSR SSR /1
=F =
MSE SSE / n − 2
66
 Hyp oth eses:
 Test Sta tistic:

MSR SSR /1
=F =
 Rejection Ru le: MSE SSE / n − 2
67
68
 Com p u te th e va lu e of th e test sta tistic:
Deter m in e w h eth er to reject 𝐻𝐻0.

𝐹𝐹 = 17.44 provides an area of 0.025 in the upper tail. Thus, the p-value
corresponding to 𝐹𝐹 = 21.43 is less than 0.025. Hence, we reject H0.
The statistical evidence is sufficient to conclude that we have a significant
relationship between the number of TV ads aired and the number of cars sold.
69
RESIDUAL ANALYSIS
Residual (error) analysis is important to check whether the assumptions of

regression models have been satisfied. It is performed to check the following:
∧
 The residuals (Yi − Yi ) are normally distributed.
 The variance of residual is constant (homoscedasticity).
 The functional form of regression is correctly specified.
 If there are any outliers
70
CHECKING FOR NORMAL DISTRIBUTION OF RESIDUALS
 The easiest technique to check whether the residuals follow normal distribution is to use the P-P plot
(Probability-Probability plot).
 The P-P plot compares the cumulative distribution function of two probability distributions against each
other.
 Diagonal line is the cumulative distribution of a normal distribution, whereas the dots represent the
cumulative distribution of the residuals
71
TEST OF HOMOSCEDASTICITY
 An important assumption of regression model is that the residuals have constant variance
(homoscedasticity) across different values of the explanatory variable (X).
 That is, the variance of residuals is assumed to be independent of variable X. Failure to meet
this assumption will result in unreliability of the hypothesis tests.
72
TESTING THE FUNCTIONAL FORM OF REGRESSION MODEL
 Any pattern in the residual plot would indicate incorrect specification
(misspecification) of the model.
73
OUTLIER ANALYSIS
 Outliers are observations whose values show a large deviation from mean value,
−
that is ( Y − Y ) large
i
 Presence of an outlier can have significant influence on values of regression
coefficients. Thus, it is important to identify the existence of outliers in the data
74
VALIDATION OF THE SIMPLE LINEAR REGRESSION MODEL
 The following measures are used to validate the simple linear regression models:
1. Co-efficient of determination (R-square).
2. Hypothesis test for the regression coefficient.
3. Analysis of Variance for overall model validity (relevant more for multiple linear
regression).
4. Residual analysis to validate the regression model assumptions.
5. Outlier analysis.
75
USING THE ESTIMATED REGRESSION EQUATION FOR ESTIMATION
AND PREDICTION
 A con fid en ce interval is an interval estimate of the mean value of 𝑦𝑦 for a given
value of 𝑥𝑥.
 A p red iction interval is used whenever we want to predict an individual value of
𝑦𝑦 for a new observation corresponding to a given value of 𝑥𝑥.
 The margin of error is larger for a prediction interval.
76
USING THE ESTIMATED REGRESSION EQUATION FOR ESTIMATION
AND PREDICTION
 Confidence Interval Estimate of 𝐸𝐸 𝑦𝑦 ∗
 Prediction Interval Estimate of 𝐸𝐸(𝑦𝑦 ∗ )
77
EXAMPLE
78
79
POINT ESTIMATION
If 3 TV ads are run prior to a sale, we expect the mean number of

cars sold to be:
80
CONFIDENCE INTERVAL
Estimate of the Standard Deviation of 𝑦𝑦� ∗ :
81
CONFIDENCE INTERVAL
The 95% confidence interval estimate of the mean number of cars sold when 3 TV
ads are run is:
82
PREDICTION INTERVAL
Estimate of the Standard Deviation of an Individual Value of 𝑦𝑦∗
83
PREDICTION INTERVAL
The 95% prediction interval estimate of the number of cars sold in one particular
week when 3 TV ads are run is:
84
85
FRAMEWORK FOR SLR MODEL DEVELOPMENT
86
PYTHON CODE
# Im p or t n ecessa r y libra ries Working with arrays

import numpy as np
import pandas as pd Data Analysis Library
Pandas provides two primary data
import matplotlib.pyplot as plt structures for storing and
manipulating data: Series and
DataFrame
87
PYTHON CODE
ad_data = pd.read_csv('Advertising.csv’, index_col= 'Unnamed: 0’)
ad_data.info()
ad_data.head()
Radio Newspaper Sales

TV
1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9
The sales are in thousands of units and the budget is in thousands of dollars.
88
LINEARITY CHECK Statistical Data Visualization
# visu a lize th e rela tion sh ip betw een th e fea tu res a n d th e resp on se u sin g sca tter p lots
import seaborn as sns
p = sns.pairplot(ad_data, x_vars= ['TV','Radio','Newspaper'], y_vars= 'Sales', size= 7, aspect= 0.7)
89
SCATTER PLOT
X = ad_data.drop(["Sales","Radio","Newspaper"],axis= 1)
Y = ad_data.Sales
# sca tter p lot

plt.scatter(X,Y)
plt.xlabel('x')
plt.ylabel('y')
plt.show()
90
PREDICTION
# Sp lit th e d a ta in to tra in in g a n d testin g sets
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)
from sklearn.linear_model import LinearRegression

# In itia lize th e lin ea r regression m od el
model = LinearRegression()
# Tra in th e m od el u sin g th e tra in in g sets

Sklearn: python library to
model.fit(X_train, Y_train) implement machine learning
models and statistical
modelling
# Ma ke p red iction s u sin g th e testin g set
y_pred = model.predict(X_test)
91
MODEL PERFORMANCE
OUTPUT
# Ca lcu la te a n d p rin t th e m od el p erfor m a n ce m etrics
Mean Squared Error :
from sklearn.metrics import r2_score, mean_squared_error
10.18618193453022
mse = mean_squared_error(Y_test, y_pred)
R-Squared : 0.6763151577939721
r2 = r2_score(Y_test, y_pred)# Best fit lineplt.scatter(x, y)
plt.plot(X_test, y_pred, color = 'Blue', marker = 'o') Y-intercept : 7.292493773559364
Slope : [0.04600779]
plt.scatter(X_train, Y_train)
# Resu lts
print("Mean Squared Error : ", mse)
print("R-Squared :" , r2)
print("Y-intercept :" , model.intercept_)
print("Slope :" , model.coef_)
92
RESIDUAL
residuals = Y_test - y_pred

mean_residuals = np.mean(residuals)
print("Mean of Residuals {}".format(mean_residuals))
OUTPUT
Mean of Residuals -0.1754708813050691
93
RESIDUAL PLOT
CHECK FOR HOMOSCEDASTICITY OUTPUT
p = plt.scatter(y_pred, residuals)
plt.xlabel('y_pred/ predicted values')
plt.ylabel('Residuals')
plt.ylim(-10,10)
plt.xlim(0,30)
p = plt.plot([0,30],[0,0],color= 'blue')
p = plt.title('Residuals vs fitted values plot for homoscedasticity check')
94
HOMOSCEDASTICITY CHECK - GOLDFELD QUANDT TEST
OUTPUT
import statsmodels.stats.api as sms
[('F sta tistic', 1.4054961974172953), ('p -va lu e',
from statsmodels.compat import lzip 0.23256111674013064)]
name = ['F statistic', 'p-value']
Sin ce p va lu e is m ore th a n 0.05 in Gold feld
test = sms.het_goldfeldquandt(residuals, Qu a n d t Test, w e ca n 't reject it's n u ll
X_test) h yp oth esis th a t er ror ter m s a re
lzip(name, test) h om osced a stic.
95
WHAT TO DO IF THERE IS HETEROSCEDASTICITY?
 Outlier removal
 Change the functional form of the model
 Use log-transformation or polynomials
96
CHECK FOR NORMALITY OF ERROR TERMS/ RESIDUALS
OUTPUT
p = sns.displot(data= None, x=
residuals, kde= True)
p = plt.title('Normality of error
terms/ residuals')
97
CHECK FOR NORMALITY OF ERROR TERMS/ RESIDUALS
OUTPUT
# Import library
from scipy import stats
stats.probplot(residuals, dist= "norm", plot=
plt)
plt.title("MODEL Residuals P-P Plot")
plt.legend(['Actual','Theoretical'])
98
OLS import pandas as pd
import statsmodels.api as sm
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
ad_data = pd.read_csv('Advertising.csv',index_col= 'Unnamed: 0')

X = ad_data.drop(["Sales","Radio","Newspaper"],axis= 1)
Y = ad_data.Sales
# Ad d a con sta n t ter m to th e fea tu res (requ ired for OLS regression )
X = sm.add_constant(X)
# Fit OLS regression m od el
model = sm.OLS(Y,X).fit()
# Prin t m od el su m m a r y
print(model.summary())
99
RESULTS
100
Assign m en t
Pa cka ge Pricin g a t th e Die An oth er Da y (DAD) Hosp ita l
101

Session 3 - Linear Regression

Uploaded by

Copyright:

Available Formats

You might also like

Session 3 - Linear Regression

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Session 3 - Linear Regression

Uploaded by

Copyright:

Available Formats

REGRESSION ANALYSIS

 Regression is a su p er vised lea r n in g a lgorith m under Machine

 An important tool in Pred ictive An a lytics

 Regression is the study of, “existen ce of a rela tion sh ip ”, between two

 Marketing: Sales, market share, customer satisfaction, customer churn,

 Operations: Inventory, productivity, efficiency, price.

 HR – Job satisfaction, attrition.

 Tr yin g to fit a lin e to d a ta p oin ts.

 A resid u a l is a m ea su re of h ow w ell a lin e fits a n in d ivid u a l d a ta p oin t.

Consider this simple data set with a

 A resid u a l is a m ea su re of h ow w ell a lin e fits a n in d ivid u a l d a ta p oin t.

Point (2,8) is 4 units above the line

This vertical distance is known as a

Wh ich is th e better fit?

 Regression is a tool for finding existence of an association relationship between

Dependent Variable Independent Variable

Endogenous Variable Exogenous Variable

Controlled Variable Control Variable

Target Variable Stimulus Variable

Feature Outcome Variable

Linear Non-linear Linear Non-linear

Dependent or Independent variable?

Hou se Hou se Age Hou se Price

 Which is the best fit?

 Regression line represents the

 But how close is close enough?

 Simple linear regression involves one in d ep en d en t va ria ble a n d on e

 The simple linear regression equation is

Positive Lin ea r Rela tion sh ip

Negative Linear Relationship

Hou se Hou se Age Hou se Price

 Which is the best fit?

 Regression line represents the

 But how close is close enough?

Regression Model Sample Data:

 Least Squares Criterion

 Slope and 𝑦𝑦-intercept for the Estimated Regression Equation ŷ = b0 + b1x

 𝑦𝑦-intercept for the estimated regression equation

 Estimated Regression Equation:

 Tota l Su m of Squ a res (SST) =

 Su m of Squ a red Er rors (SSE)

 Relationship Among SST, SSR, SSE

 The coefficient of determination is:

 The coefficient of determination is:

R2 = = 189/ 224 = 0.84

 For a perfect fit, SSE = ?

 For a perfect fit, SSR/ SST = ?

Yea r Nu m ber of Fa cebook u sers Nu m ber of p eop le w h o d ied of h eliu m

 Tota l Su m of Squ a res (SST) = ?

 Su m of Squ a res d u e to Regression

 Su m of Squ a red Er rors (SSE) = ?

 The mean square error (MSE) provides the estimate of σ2

 In a simple linear regression equation, the mean or expected value of y is a

 The regression co-efficient (β1) captures the existence of a linear relationship

 We do not know the value of

 𝑦𝑦-intercept for the estimated regression equation

 Estimated Regression Equation:

Two Ta il Hyp oth esis Testin g

 95% Confidence Interval for 𝛽𝛽1