Professional Documents
Culture Documents
Session 3 - Linear Regression
Session 3 - Linear Regression
Session 3 - Linear Regression
REGRESSION
2
Regression m od el esta blish es existen ce of
a ssocia tion betw een two va ria bles, bu t n ot
ca u sa tion .
3
REGRESSION VS CORRELATION
9
WHERE IS IT USED?
Finance: CAPM, Non-performing assets, probability of default, Chance of
bankruptcy, credit risk.
10
INTRODUCTION TO RESIDUALS
11
INTRODUCTION TO RESIDUALS
12
INTRODUCTION TO RESIDUALS
13
SUM OF SQUARES ERROR (SSE)
14
WHAT IS REGRESSION?
15
REGRESSION NOMENCLATURE
Regressand Regressor
Predictand Predictor
Response Variable
16
TYPES OF REGRESSION
Regression
Models
One More than One
independent independent
variable variable
Simple Multiple
Regression Regression
17
EXAMPLE
Hou se Price
Hou se n o.
(10000 $)
Boston Housing dataset 1 32
2 28
Goal is to predict the price value of other 3 30
houses 4 40
5 25
6 35
7 27
8 39
9 31
10 29
18
EXAMPLE
Hou se Hou se Age Hou se Price
n o. years (10000 $)
1 5 32
2 10 28
3 8 30
4 2 40
5 15 25
6 7 35
7 12 27
8 3 39
9 6 31
10 9 29
20
SIMPLE LINEAR REGRESSION: EXAMPLE
21
SIMPLE LINEAR REGRESSION
Managerial decisions often are based on the relationship betw een two
va ria bles.
Regression analysis can be used to develop an equation showing h ow th e
va ria bles a re rela ted .
The variable being predicted is called the d ep en d en t va ria ble and is denoted by
y.
The variables being used to predict the value of the dependent variable are called
the in d ep en d en t va ria bles and are denoted by x.
22
SIMPLE LINEAR REGRESSION
23
SIMPLE LINEAR REGRESSION MODEL
The equation that describes how y is related to x and an error term is called the
regression model.
The simple linear regression model is
y =β 0 + β1 x + ε
y = Dependent variable
x = Independent variable
𝛽𝛽0 : y-intercept
𝛽𝛽1 : slope
𝜖𝜖 : error term, unexplained variation in y
24
SIMPLE LINEAR REGRESSION EQUATION
25
SIMPLE LINEAR REGRESSION EQUATION
27
SIMPLE LINEAR REGRESSION EQUATION
No Relationship
28
SIMPLE LINEAR REGRESSION: EXAMPLE
29
SIMPLE LINEAR REGRESSION: EXAMPLE
30
ESTIMATION PROCESS
Estimated
b0 and b1 Regression Equation
provide estimates of
𝛽𝛽0 and 𝛽𝛽1 ŷ = b0 + b1x
Sample Statistics
b0, b1
31
LEAST SQUARES METHOD
Minimize the sum of the squares of the deviations between the obser ved
va lu es of the dependent variable yi and the p red icted va lu es of the dependent
variable y�i .
32
LEAST SQUARES METHOD
33
34
35
LEAST SQUARES METHOD
Slop e = b 1 = -1.146
In tercep t = b 0 = 40.427
36
LEAST SQUARES METHOD
Actu a l Price Pred icted
Hou se n o. Hou se Age Er ror Squ a red Er ror
(10000 $) Price
1 5 32 34.695350 -2.695350 7.264914
2 10 28 28.963220 -0.963220 0.927793
3 8 30 31.256072 -1.256072 1.577717
4 2 40 38.134629 1.865371 3.479610
5 15 25 23.231090 1.768910 3.129044
6 7 35 32.402498 2.597502 6.747015
7 12 27 26.670368 0.329632 0.108657
8 3 39 36.988203 2.011797 4.047329
9 6 31 33.548924 -2.548924 6.497015
10 9 29 30.109646 -1.109646 1.231314
SSE = 35.01
37
EXAMPLE
Reed Auto periodically has a special week-long sale. As part of the advertising
campaign Reed runs one or more television commercials during the weekend
preceding the sale. Here are the data from a sample of 5 previous sales:
38
ESTIMATED REGRESSION EQUATION
Slope for the estimated regression equation
39
SCATTER DIAGRAM & ESTIMATED REGRESSION EQUATION
40
LEAST SQUARES METHOD
41
LEAST SQUARES METHOD
Su m of Squ a res d u e to
Regression (SSR) = 189.38
42
COEFFICIENT OF DETERMINATION
− ∧ − ∧
Yi − Y= Yi − Y + Yi − Yi
Total Variation in Y Variation in Y explained by the model Variation in Y not explained by the model
where:
SST = Tota l Su m of Squ a res
SSR = Su m of Squ a res d u e to Regression
SSE = Su m of Squ a res d u e to Er ror
43
COEFFICIENT OF DETERMINATION
2
∧ −
Yi − Y
Explained variation SSR
Coefficient of determination = R 2 = = =
Total variation SST 2
−
Yi − Y
where:
SSR = sum of squares due to regression
SST = total sum of squares
44
COEFFICIENT OF DETERMINATION
The regression relationship is very strong; 84.4% of the variability in the price of houses can
be explained by the linear relationship between the house age and price.
45
COEFFICIENT OF DETERMINATION
R-squared is a statistical measure that indicates h ow m u ch of th e va ria tion of a
d ep en d en t variable is explained by an in d ep en d en t variable in a regression
model.
R-squared values range from 0 to 1 and are commonly stated as percentages from
0% to 100%.
R-squared value of 0.9 would indicate that 90% of th e va ria n ce of the dependent
variable being studied is explained by the variance of the independent variable.
What qualifies as a “good” R-squared value will d ep en d on the con text. In some
fields, such as the socia l scien ces, even a relatively low R-squared value, such as
0.5, could be considered rela tively stron g. In other fields, the standards for a
good R-squ a red rea d in g can be much higher, such as 0.9 or a bove.
46
SPURIOUS REGRESSION
Higher value of R2 implies better fit, but one should be aware of spurious regression.
Number of Facebook users and the number of people who died of helium poisoning in UK
SS MS F Significance F
Regression 1 2803.94 2803.94 978.4229 8.82E-09
Resid u a l 7 20.06042 2.865775
Tota l 8 2824
Coefficients Standard Error t-stat P-value Lower 95% Upper 95%
In tercep t 1.9967 0.76169 2.62143 0.034338 0.195607 3.79783
FB 0.0465 0.00149 31.27975 8.82E-09 0.043074 0.050119
The R-square value for regression model between the number of deaths due to helium poisoning in UK and the number of Facebook users is
0.9928. That is, 99.28% va ria tion in th e n u m ber of d ea th s d u e to h eliu m p oison in g in UK is exp la in ed by th e n u m ber of
Fa cebook u sers. The regression model is given a s Y = 1.9967 + 0.0465 X.
48
EXAMPLE
Reed Auto periodically has a special week-long sale. As part of the advertising
campaign Reed runs one or more television commercials during the weekend
preceding the sale. Here are the data from a sample of 5 previous sales:
49
ASSUMPTIONS
The conditional expected value of the residuals, E(εi), is zero.
The variance of the residuals, σ2 is con sta n t for all values of Xi. (income vs savings)
When the variance of the residuals is con sta n t for different values of Xi, it is called
h om osced a sticity. A non-constant variance of residuals is called
h eterosced a sticity.
Residuals are u n cor rela ted , that is, Cov(εi, εj) = 0 for all i ≠ j.
The residuals, εi, , follow a n or m a l distribution.
The regression model is lin ea r in regression parameters.
The explanatory variable, X, is assumed to be non-stochastic (i.e., X is
d eter m in istic).
50
TESTING FOR SIGNIFICANCE
Estim a te of σ2
From the regression model and its assumptions, we can conclude that σ2, the variance of εi, also
represents the variance of the y values about the regression line.
SSE, the sum of squared residuals, is a measure of the variability of the actual observations
Every sum of squares has associated with it a number called its degrees of freedom. Statisticians have
shown that SSE has n-2 degrees of freedom because two parameters (𝛽𝛽0 and 𝛽𝛽1) must be estimated
to compute SSE. Thus, the mean square error (MSE) is computed by dividing SSE by n-2.
51
TESTING FOR SIGNIFICANCE
The properties of the sampling distribution of 𝑏𝑏1, the least squares estimator of 𝛽𝛽1 ,
provide the basis for the hypothesis test.
We develop an estimate of
sample standard deviation
53
TESTING FOR SIGNIFICANCE
The standard deviation of b 1 is also referred to as the standard error of b 1 . Thus, provides an
estimate of the standard error of b1.
The t test for a significant relationship is based on the fact that the test sta tistic
𝑏𝑏1 − 𝐸𝐸(𝑏𝑏1 ) =
𝑏𝑏1 − 𝛽𝛽1
𝑠𝑠𝑏𝑏1 𝑠𝑠𝑏𝑏1
follows a t distribution with n - 2 degrees of freedom. If the null hypothesis is true, then 𝛽𝛽1 = 0 and
t = 𝑏𝑏1 / 𝑠𝑠𝑏𝑏1
54
TESTING FOR SIGNIFICANCE
t Test
Hypotheses:
Test Statistic:
55
TESTING FOR SIGNIFICANCE
t Test
Rejection Rule:
where:
tα/ 2 is based on a t distribution with n – 2 degrees of freedom
56
EXAMPLE
Reed Auto periodically has a special week-long sale. As part of the advertising
campaign Reed runs one or more television commercials during the weekend
preceding the sale. Here are the data from a sample of 5 previous sales:
57
ESTIMATED REGRESSION EQUATION
Slope for the estimated regression equation
58
SCATTER DIAGRAM & ESTIMATED REGRESSION EQUATION
59
TESTING FOR SIGNIFICANCE: EXAMPLE
60
TESTING FOR SIGNIFICANCE: EXAMPLE
1 - α of all
t values
𝑡𝑡
−𝑡𝑡𝛼𝛼/2 + 𝑡𝑡𝛼𝛼/2
61
TESTING FOR SIGNIFICANCE: EXAMPLE
P-va lu e
𝑝𝑝 = 2*𝑃𝑃𝑟𝑟𝑜𝑜𝑏𝑏 ( 𝑡𝑡 ≥ 4.63) = 2 x 0.01 = 0.02
Since p is less than 0.05, we reject the null hypothesis.
Critica l Va lu e
𝑡𝑡 = 4.63 > 3.18, we reject the null hypothesis.
62
CONFIDENCE INTERVAL FOR 𝛽𝛽1
We can use a 95% confidence interval for 𝛽𝛽1 to test the hypotheses just used
in the t test.
H 0 is rejected if the hypothesized value of 𝛽𝛽1 is not included in the
confidence interval for 𝛽𝛽1 .
63
CONFIDENCE INTERVAL FOR 𝛽𝛽1
64
CONFIDENCE INTERVAL FOR 𝛽𝛽1
Rejection Ru le:
Reject 𝐻𝐻0 if 0 is not included in the confidence interval for 𝛽𝛽1 .
65
TESTING FOR SIGNIFICANCE: F TEST
The null and alternative hypothesis for F-test is given by
H 0: There is no statistically significant relationship between Y and any of the
explanatory variables (i.e., all regression coefficients are zero).
H A: Not all regression coefficients are zero
Alternatively:
H 0: All regression coefficients are equal to zero
H A: Not all regression coefficients are equal to zero
67
TESTING FOR SIGNIFICANCE: F TEST
68
TESTING FOR SIGNIFICANCE: F TEST
69
RESIDUAL ANALYSIS
70
CHECKING FOR NORMAL DISTRIBUTION OF RESIDUALS
The easiest technique to check whether the residuals follow normal distribution is to use the P-P plot
(Probability-Probability plot).
The P-P plot compares the cumulative distribution function of two probability distributions against each
other.
Diagonal line is the cumulative distribution of a normal distribution, whereas the dots represent the
cumulative distribution of the residuals
71
TEST OF HOMOSCEDASTICITY
An important assumption of regression model is that the residuals have constant variance
(homoscedasticity) across different values of the explanatory variable (X).
That is, the variance of residuals is assumed to be independent of variable X. Failure to meet
this assumption will result in unreliability of the hypothesis tests.
72
TESTING THE FUNCTIONAL FORM OF REGRESSION MODEL
Any pattern in the residual plot would indicate incorrect specification
(misspecification) of the model.
73
OUTLIER ANALYSIS
Outliers are observations whose values show a large deviation from mean value,
−
that is ( Y − Y ) large
i
Presence of an outlier can have significant influence on values of regression
coefficients. Thus, it is important to identify the existence of outliers in the data
74
VALIDATION OF THE SIMPLE LINEAR REGRESSION MODEL
The following measures are used to validate the simple linear regression models:
1. Co-efficient of determination (R-square).
2. Hypothesis test for the regression coefficient.
3. Analysis of Variance for overall model validity (relevant more for multiple linear
regression).
4. Residual analysis to validate the regression model assumptions.
5. Outlier analysis.
75
USING THE ESTIMATED REGRESSION EQUATION FOR ESTIMATION
AND PREDICTION
A con fid en ce interval is an interval estimate of the mean value of 𝑦𝑦 for a given
value of 𝑥𝑥.
A p red iction interval is used whenever we want to predict an individual value of
𝑦𝑦 for a new observation corresponding to a given value of 𝑥𝑥.
The margin of error is larger for a prediction interval.
76
USING THE ESTIMATED REGRESSION EQUATION FOR ESTIMATION
AND PREDICTION
77
EXAMPLE
Reed Auto periodically has a special week-long sale. As part of the advertising
campaign Reed runs one or more television commercials during the weekend
preceding the sale. Here are the data from a sample of 5 previous sales:
78
ESTIMATED REGRESSION EQUATION
Slope for the estimated regression equation
79
POINT ESTIMATION
80
CONFIDENCE INTERVAL
Estimate of the Standard Deviation of 𝑦𝑦� ∗ :
81
CONFIDENCE INTERVAL
The 95% confidence interval estimate of the mean number of cars sold when 3 TV
ads are run is:
82
PREDICTION INTERVAL
83
PREDICTION INTERVAL
The 95% prediction interval estimate of the number of cars sold in one particular
week when 3 TV ads are run is:
84
85
FRAMEWORK FOR SLR MODEL DEVELOPMENT
86
PYTHON CODE
87
PYTHON CODE
ad_data = pd.read_csv('Advertising.csv’, index_col= 'Unnamed: 0’)
ad_data.info()
ad_data.head()
The sales are in thousands of units and the budget is in thousands of dollars.
88
LINEARITY CHECK Statistical Data Visualization
# visu a lize th e rela tion sh ip betw een th e fea tu res a n d th e resp on se u sin g sca tter p lots
import seaborn as sns
p = sns.pairplot(ad_data, x_vars= ['TV','Radio','Newspaper'], y_vars= 'Sales', size= 7, aspect= 0.7)
89
SCATTER PLOT
X = ad_data.drop(["Sales","Radio","Newspaper"],axis= 1)
Y = ad_data.Sales
90
PREDICTION
# Sp lit th e d a ta in to tra in in g a n d testin g sets
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)
# Resu lts
print("Mean Squared Error : ", mse)
print("R-Squared :" , r2)
print("Y-intercept :" , model.intercept_)
print("Slope :" , model.coef_)
92
RESIDUAL
OUTPUT
Mean of Residuals -0.1754708813050691
93
RESIDUAL PLOT
CHECK FOR HOMOSCEDASTICITY OUTPUT
p = plt.scatter(y_pred, residuals)
plt.xlabel('y_pred/ predicted values')
plt.ylabel('Residuals')
plt.ylim(-10,10)
plt.xlim(0,30)
p = plt.plot([0,30],[0,0],color= 'blue')
p = plt.title('Residuals vs fitted values plot for homoscedasticity check')
94
HOMOSCEDASTICITY CHECK - GOLDFELD QUANDT TEST
OUTPUT
import statsmodels.stats.api as sms
[('F sta tistic', 1.4054961974172953), ('p -va lu e',
from statsmodels.compat import lzip 0.23256111674013064)]
name = ['F statistic', 'p-value']
Sin ce p va lu e is m ore th a n 0.05 in Gold feld
test = sms.het_goldfeldquandt(residuals, Qu a n d t Test, w e ca n 't reject it's n u ll
X_test) h yp oth esis th a t er ror ter m s a re
lzip(name, test) h om osced a stic.
95
WHAT TO DO IF THERE IS HETEROSCEDASTICITY?
Outlier removal
Change the functional form of the model
Use log-transformation or polynomials
96
CHECK FOR NORMALITY OF ERROR TERMS/ RESIDUALS
OUTPUT
p = sns.displot(data= None, x=
residuals, kde= True)
p = plt.title('Normality of error
terms/ residuals')
97
CHECK FOR NORMALITY OF ERROR TERMS/ RESIDUALS
OUTPUT
# Import library
from scipy import stats
stats.probplot(residuals, dist= "norm", plot=
plt)
plt.title("MODEL Residuals P-P Plot")
plt.legend(['Actual','Theoretical'])
98
OLS import pandas as pd
import statsmodels.api as sm
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
100
Assign m en t
Pa cka ge Pricin g a t th e Die An oth er Da y (DAD) Hosp ita l
101