Session 3 - Linear Regression

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 96

REGRESSION ANALYSIS

REGRESSION

 Regression is a su p er vised lea r n in g a lgorith m under Machine


Learning

 An important tool in Pred ictive An a lytics

2
Regression m od el esta blish es existen ce of
a ssocia tion betw een two va ria bles, bu t n ot
ca u sa tion .

3
REGRESSION VS CORRELATION

 Regression is the study of, “existen ce of a rela tion sh ip ”, between two


variable.
 Regression describes h ow to rep resen t a rela tion sh ip betw een two
va ria bles and numerically relate an independent variable to the dependent
variable.
 Correlation is the study of, “stren gth of rela tion sh ip ”, between two variables.

9
WHERE IS IT USED?
 Finance: CAPM, Non-performing assets, probability of default, Chance of
bankruptcy, credit risk.

 Marketing: Sales, market share, customer satisfaction, customer churn,


customer retention, customer life-time value.

 Operations: Inventory, productivity, efficiency, price.

 HR – Job satisfaction, attrition.

10
INTRODUCTION TO RESIDUALS

 Tr yin g to fit a lin e to d a ta p oin ts.


 It's h a rd to sa y for su re w h ich lin e
fits th e d a ta best.
 h ow d o th ey d ecid e w h ich lin e is
best?

11
INTRODUCTION TO RESIDUALS

 A resid u a l is a m ea su re of h ow w ell a lin e fits a n in d ivid u a l d a ta p oin t.

Consider this simple data set with a


line of fit drawn through it

12
INTRODUCTION TO RESIDUALS

 A resid u a l is a m ea su re of h ow w ell a lin e fits a n in d ivid u a l d a ta p oin t.

Point (2,8) is 4 units above the line

This vertical distance is known as a


residual.
For data points above the line, the
residual is positive, and for data points
below the line, the residual is negative.

Wh ich is th e better fit?

13
SUM OF SQUARES ERROR (SSE)

14
WHAT IS REGRESSION?

 Regression is a tool for finding existence of an association relationship between


a dependent variable (Y) and one or more independent variables (X1, X2, …, Xn)
in a study.
 The relationship can be linear or non-linear.
 A dependent variable (resp on se va ria ble) “measures an outcome of a study
(also called ou tcom e va ria ble)”.
 An independent variable (exp la n a tor y va ria ble) “explains changes in a
response variable”.

15
REGRESSION NOMENCLATURE

Dependent Variable Independent Variable


Explained Variable Explanatory variable

Regressand Regressor

Predictand Predictor

Endogenous Variable Exogenous Variable

Controlled Variable Control Variable

Target Variable Stimulus Variable

Response Variable

Feature Outcome Variable

16
TYPES OF REGRESSION

Regression
Models
One More than One
independent independent
variable variable

Simple Multiple
Regression Regression

Linear Non-linear Linear Non-linear

17
EXAMPLE
Hou se Price
Hou se n o.
(10000 $)
 Boston Housing dataset 1 32
2 28
 Goal is to predict the price value of other 3 30
houses 4 40
5 25
6 35
7 27
8 39
9 31
10 29

18
EXAMPLE
Hou se Hou se Age Hou se Price
n o. years (10000 $)
1 5 32
2 10 28
3 8 30
4 2 40
5 15 25
6 7 35
7 12 27
8 3 39
9 6 31
10 9 29

Dependent or Independent variable?


19
SIMPLE LINEAR REGRESSION: EXAMPLE

Hou se Hou se Age Hou se Price


n o. years (10000 $)
1 5 32
2 10 28
3 8 30
4 2 40
5 15 25
6 7 35
7 12 27
8 3 39
9 6 31
10 9 29

20
SIMPLE LINEAR REGRESSION: EXAMPLE

 Which is the best fit?

 Regression line represents the


"best-fit" line that minimizes the
overall distance between the line
and the actual data points.

 But how close is close enough?

21
SIMPLE LINEAR REGRESSION

 Managerial decisions often are based on the relationship betw een two
va ria bles.
 Regression analysis can be used to develop an equation showing h ow th e
va ria bles a re rela ted .
 The variable being predicted is called the d ep en d en t va ria ble and is denoted by
y.
 The variables being used to predict the value of the dependent variable are called
the in d ep en d en t va ria bles and are denoted by x.

22
SIMPLE LINEAR REGRESSION

 Simple linear regression involves one in d ep en d en t va ria ble a n d on e


d ep en d en t va ria ble.
 The relationship between the two variables is approximated by a stra igh t lin e.
 Regression analysis involving two or m ore in d ep en d en t va ria bles is called
m u ltip le regression .

23
SIMPLE LINEAR REGRESSION MODEL

 The equation that describes how y is related to x and an error term is called the
regression model.
 The simple linear regression model is

y =β 0 + β1 x + ε

y = Dependent variable
x = Independent variable
𝛽𝛽0 : y-intercept
𝛽𝛽1 : slope
𝜖𝜖 : error term, unexplained variation in y
24
SIMPLE LINEAR REGRESSION EQUATION

 The simple linear regression equation is

25
SIMPLE LINEAR REGRESSION EQUATION

Positive Lin ea r Rela tion sh ip


26
SIMPLE LINEAR REGRESSION EQUATION

Negative Linear Relationship

27
SIMPLE LINEAR REGRESSION EQUATION

No Relationship

28
SIMPLE LINEAR REGRESSION: EXAMPLE

Hou se Hou se Age Hou se Price


n o. years (10000 $)
1 5 32
2 10 28
3 8 30
4 2 40
5 15 25
6 7 35
7 12 27
8 3 39
9 6 31
10 9 29

29
SIMPLE LINEAR REGRESSION: EXAMPLE

 Which is the best fit?

 Regression line represents the


"best-fit" line that minimizes the
overall distance between the line
and the actual data points.

 But how close is close enough?

30
ESTIMATION PROCESS

Regression Model Sample Data:


y = 𝛽𝛽0 +𝛽𝛽1 x + 𝜖𝜖 x y
Regression Equation x1 y1
E(y) = 𝛽𝛽0 +𝛽𝛽1 x . .
Unknown Parameters . .
𝛽𝛽0 , 𝛽𝛽1 xn yn

Estimated
b0 and b1 Regression Equation
provide estimates of
𝛽𝛽0 and 𝛽𝛽1 ŷ = b0 + b1x
Sample Statistics
b0, b1
31
LEAST SQUARES METHOD

 Least Squares Criterion

Minimize the sum of the squares of the deviations between the obser ved
va lu es of the dependent variable yi and the p red icted va lu es of the dependent
variable y�i .

32
LEAST SQUARES METHOD

 Slope and 𝑦𝑦-intercept for the Estimated Regression Equation ŷ = b0 + b1x

33
34
35
LEAST SQUARES METHOD

Slop e = b 1 = -1.146
In tercep t = b 0 = 40.427

36
LEAST SQUARES METHOD
Actu a l Price Pred icted
Hou se n o. Hou se Age Er ror Squ a red Er ror
(10000 $) Price
1 5 32 34.695350 -2.695350 7.264914
2 10 28 28.963220 -0.963220 0.927793
3 8 30 31.256072 -1.256072 1.577717
4 2 40 38.134629 1.865371 3.479610
5 15 25 23.231090 1.768910 3.129044
6 7 35 32.402498 2.597502 6.747015
7 12 27 26.670368 0.329632 0.108657
8 3 39 36.988203 2.011797 4.047329
9 6 31 33.548924 -2.548924 6.497015
10 9 29 30.109646 -1.109646 1.231314

SSE = 35.01
37
EXAMPLE
 Reed Auto periodically has a special week-long sale. As part of the advertising
campaign Reed runs one or more television commercials during the weekend
preceding the sale. Here are the data from a sample of 5 previous sales:

38
ESTIMATED REGRESSION EQUATION
 Slope for the estimated regression equation

 𝑦𝑦-intercept for the estimated regression equation

 Estimated Regression Equation:

39
SCATTER DIAGRAM & ESTIMATED REGRESSION EQUATION

40
LEAST SQUARES METHOD

41
LEAST SQUARES METHOD

 Tota l Su m of Squ a res (SST) =


224.4

 Su m of Squ a res d u e to
Regression (SSR) = 189.38

 Su m of Squ a red Er rors (SSE)


= 35.01

42
COEFFICIENT OF DETERMINATION

− ∧ − ∧
Yi − Y= Yi − Y + Yi − Yi
  
Total Variation in Y Variation in Y explained by the model Variation in Y not explained by the model

 Relationship Among SST, SSR, SSE

where:
SST = Tota l Su m of Squ a res
SSR = Su m of Squ a res d u e to Regression
SSE = Su m of Squ a res d u e to Er ror
43
COEFFICIENT OF DETERMINATION

 The coefficient of determination is:

2
 ∧ −
 Yi − Y 
Explained variation SSR  
Coefficient of determination = R 2 = = = 
Total variation SST  2
−
 Yi − Y 
 
 

where:
SSR = sum of squares due to regression
SST = total sum of squares
44
COEFFICIENT OF DETERMINATION

 The coefficient of determination is:

R2 = = 189/ 224 = 0.84

 For a perfect fit, SSE = ?

 For a perfect fit, SSR/ SST = ?

The regression relationship is very strong; 84.4% of the variability in the price of houses can
be explained by the linear relationship between the house age and price.

45
COEFFICIENT OF DETERMINATION
 R-squared is a statistical measure that indicates h ow m u ch of th e va ria tion of a
d ep en d en t variable is explained by an in d ep en d en t variable in a regression
model.
 R-squared values range from 0 to 1 and are commonly stated as percentages from
0% to 100%.
 R-squared value of 0.9 would indicate that 90% of th e va ria n ce of the dependent
variable being studied is explained by the variance of the independent variable.
 What qualifies as a “good” R-squared value will d ep en d on the con text. In some
fields, such as the socia l scien ces, even a relatively low R-squared value, such as
0.5, could be considered rela tively stron g. In other fields, the standards for a
good R-squ a red rea d in g can be much higher, such as 0.9 or a bove.

46
SPURIOUS REGRESSION
 Higher value of R2 implies better fit, but one should be aware of spurious regression.
Number of Facebook users and the number of people who died of helium poisoning in UK

Yea r Nu m ber of Fa cebook u sers Nu m ber of p eop le w h o d ied of h eliu m


in m illion s (X) p oison in g in UK (Y)
2004 1 2
2005 6 2
2006 12 2
2007 58 2
2008 145 11
2009 360 21
2010 608 31
2011 845 40
2012 1056 51 47
SPURIOUS REGRESSION
SUMMARY OUTPUT
Regression Sta tistics
Mu ltip le R 0.996442
R Squ a re 0.992896
Sta n d a rd Er ror 1.69286
Obser va tion s 9
ANOVA

SS MS F Significance F
Regression 1 2803.94 2803.94 978.4229 8.82E-09
Resid u a l 7 20.06042 2.865775
Tota l 8 2824
Coefficients Standard Error t-stat P-value Lower 95% Upper 95%
In tercep t 1.9967 0.76169 2.62143 0.034338 0.195607 3.79783
FB 0.0465 0.00149 31.27975 8.82E-09 0.043074 0.050119

The R-square value for regression model between the number of deaths due to helium poisoning in UK and the number of Facebook users is
0.9928. That is, 99.28% va ria tion in th e n u m ber of d ea th s d u e to h eliu m p oison in g in UK is exp la in ed by th e n u m ber of
Fa cebook u sers. The regression model is given a s Y = 1.9967 + 0.0465 X.
48
EXAMPLE
 Reed Auto periodically has a special week-long sale. As part of the advertising
campaign Reed runs one or more television commercials during the weekend
preceding the sale. Here are the data from a sample of 5 previous sales:

 Tota l Su m of Squ a res (SST) = ?

 Su m of Squ a res d u e to Regression


(SSR) = ?

 Su m of Squ a red Er rors (SSE) = ?

49
ASSUMPTIONS
 The conditional expected value of the residuals, E(εi), is zero.
 The variance of the residuals, σ2 is con sta n t for all values of Xi. (income vs savings)
 When the variance of the residuals is con sta n t for different values of Xi, it is called
h om osced a sticity. A non-constant variance of residuals is called
h eterosced a sticity.
 Residuals are u n cor rela ted , that is, Cov(εi, εj) = 0 for all i ≠ j.
 The residuals, εi, , follow a n or m a l distribution.
 The regression model is lin ea r in regression parameters.
 The explanatory variable, X, is assumed to be non-stochastic (i.e., X is
d eter m in istic).
50
TESTING FOR SIGNIFICANCE
Estim a te of σ2
 From the regression model and its assumptions, we can conclude that σ2, the variance of εi, also
represents the variance of the y values about the regression line.
 SSE, the sum of squared residuals, is a measure of the variability of the actual observations

 The mean square error (MSE) provides the estimate of σ2

 Every sum of squares has associated with it a number called its degrees of freedom. Statisticians have
shown that SSE has n-2 degrees of freedom because two parameters (𝛽𝛽0 and 𝛽𝛽1) must be estimated
to compute SSE. Thus, the mean square error (MSE) is computed by dividing SSE by n-2.

51
TESTING FOR SIGNIFICANCE

 In a simple linear regression equation, the mean or expected value of y is a


lin ea r fu n ction of x:
𝐸𝐸(𝑦𝑦) = 𝛽𝛽0 + 𝛽𝛽1 𝑥𝑥

 The regression co-efficient (β1) captures the existence of a linear relationship


between the response variable and the explanatory variable.
 If β1 = 0 , we can conclude that there is no statistically significant linear
relationship between the two variables.
 To test for a significant regression relationship, we must conduct a hypothesis test to
determine whether the value of 𝛽𝛽1 is zero.
𝐻𝐻0: 𝛽𝛽1 = 0
𝐻𝐻𝑎𝑎: 𝛽𝛽1 ≠ 0
52
TESTING FOR SIGNIFICANCE

 The properties of the sampling distribution of 𝑏𝑏1, the least squares estimator of 𝛽𝛽1 ,
provide the basis for the hypothesis test.

 We do not know the value of


population standard deviation

 We develop an estimate of
sample standard deviation

53
TESTING FOR SIGNIFICANCE

 The standard deviation of b 1 is also referred to as the standard error of b 1 . Thus, provides an
estimate of the standard error of b1.

The t test for a significant relationship is based on the fact that the test sta tistic

𝑏𝑏1 − 𝐸𝐸(𝑏𝑏1 ) =
𝑏𝑏1 − 𝛽𝛽1
𝑠𝑠𝑏𝑏1 𝑠𝑠𝑏𝑏1

follows a t distribution with n - 2 degrees of freedom. If the null hypothesis is true, then 𝛽𝛽1 = 0 and
t = 𝑏𝑏1 / 𝑠𝑠𝑏𝑏1

54
TESTING FOR SIGNIFICANCE

t Test
Hypotheses:

Test Statistic:

55
TESTING FOR SIGNIFICANCE

t Test
Rejection Rule:

where:
tα/ 2 is based on a t distribution with n – 2 degrees of freedom

56
EXAMPLE
 Reed Auto periodically has a special week-long sale. As part of the advertising
campaign Reed runs one or more television commercials during the weekend
preceding the sale. Here are the data from a sample of 5 previous sales:

57
ESTIMATED REGRESSION EQUATION
 Slope for the estimated regression equation

 𝑦𝑦-intercept for the estimated regression equation

 Estimated Regression Equation:

58
SCATTER DIAGRAM & ESTIMATED REGRESSION EQUATION

59
TESTING FOR SIGNIFICANCE: EXAMPLE

60
TESTING FOR SIGNIFICANCE: EXAMPLE

Two Ta il Hyp oth esis Testin g

1 - α of all
t values

𝑡𝑡
−𝑡𝑡𝛼𝛼/2 + 𝑡𝑡𝛼𝛼/2
61
TESTING FOR SIGNIFICANCE: EXAMPLE

P-va lu e
𝑝𝑝 = 2*𝑃𝑃𝑟𝑟𝑜𝑜𝑏𝑏 ( 𝑡𝑡 ≥ 4.63) = 2 x 0.01 = 0.02
Since p is less than 0.05, we reject the null hypothesis.

Critica l Va lu e
𝑡𝑡 = 4.63 > 3.18, we reject the null hypothesis.

62
CONFIDENCE INTERVAL FOR 𝛽𝛽1

 We can use a 95% confidence interval for 𝛽𝛽1 to test the hypotheses just used
in the t test.
 H 0 is rejected if the hypothesized value of 𝛽𝛽1 is not included in the
confidence interval for 𝛽𝛽1 .

63
CONFIDENCE INTERVAL FOR 𝛽𝛽1

64
CONFIDENCE INTERVAL FOR 𝛽𝛽1

 Rejection Ru le:
Reject 𝐻𝐻0 if 0 is not included in the confidence interval for 𝛽𝛽1 .

 95% Confidence Interval for 𝛽𝛽1

 Con clu sion :


0 is not included in the confidence interval. Reject 𝐻𝐻0 .

65
TESTING FOR SIGNIFICANCE: F TEST
The null and alternative hypothesis for F-test is given by
H 0: There is no statistically significant relationship between Y and any of the
explanatory variables (i.e., all regression coefficients are zero).
H A: Not all regression coefficients are zero
 Alternatively:
H 0: All regression coefficients are equal to zero
H A: Not all regression coefficients are equal to zero

 The F-statistic is given by


MSR SSR /1
=F =
MSE SSE / n − 2
66
TESTING FOR SIGNIFICANCE: F TEST

 Hyp oth eses:

 Test Sta tistic:


MSR SSR /1
=F =
 Rejection Ru le: MSE SSE / n − 2

67
TESTING FOR SIGNIFICANCE: F TEST

68
TESTING FOR SIGNIFICANCE: F TEST

 Com p u te th e va lu e of th e test sta tistic:

Deter m in e w h eth er to reject 𝐻𝐻0.


𝐹𝐹 = 17.44 provides an area of 0.025 in the upper tail. Thus, the p-value
corresponding to 𝐹𝐹 = 21.43 is less than 0.025. Hence, we reject H0.
The statistical evidence is sufficient to conclude that we have a significant
relationship between the number of TV ads aired and the number of cars sold.

69
RESIDUAL ANALYSIS

Residual (error) analysis is important to check whether the assumptions of


regression models have been satisfied. It is performed to check the following:

 The residuals (Yi − Yi ) are normally distributed.
 The variance of residual is constant (homoscedasticity).
 The functional form of regression is correctly specified.
 If there are any outliers

70
CHECKING FOR NORMAL DISTRIBUTION OF RESIDUALS
 The easiest technique to check whether the residuals follow normal distribution is to use the P-P plot
(Probability-Probability plot).
 The P-P plot compares the cumulative distribution function of two probability distributions against each
other.
 Diagonal line is the cumulative distribution of a normal distribution, whereas the dots represent the
cumulative distribution of the residuals

71
TEST OF HOMOSCEDASTICITY
 An important assumption of regression model is that the residuals have constant variance
(homoscedasticity) across different values of the explanatory variable (X).
 That is, the variance of residuals is assumed to be independent of variable X. Failure to meet
this assumption will result in unreliability of the hypothesis tests.

72
TESTING THE FUNCTIONAL FORM OF REGRESSION MODEL
 Any pattern in the residual plot would indicate incorrect specification
(misspecification) of the model.

73
OUTLIER ANALYSIS
 Outliers are observations whose values show a large deviation from mean value,

that is ( Y − Y ) large
i
 Presence of an outlier can have significant influence on values of regression
coefficients. Thus, it is important to identify the existence of outliers in the data

74
VALIDATION OF THE SIMPLE LINEAR REGRESSION MODEL

 The following measures are used to validate the simple linear regression models:
1. Co-efficient of determination (R-square).
2. Hypothesis test for the regression coefficient.
3. Analysis of Variance for overall model validity (relevant more for multiple linear
regression).
4. Residual analysis to validate the regression model assumptions.
5. Outlier analysis.

75
USING THE ESTIMATED REGRESSION EQUATION FOR ESTIMATION
AND PREDICTION

 A con fid en ce interval is an interval estimate of the mean value of 𝑦𝑦 for a given
value of 𝑥𝑥.
 A p red iction interval is used whenever we want to predict an individual value of
𝑦𝑦 for a new observation corresponding to a given value of 𝑥𝑥.
 The margin of error is larger for a prediction interval.

76
USING THE ESTIMATED REGRESSION EQUATION FOR ESTIMATION
AND PREDICTION

 Confidence Interval Estimate of 𝐸𝐸 𝑦𝑦 ∗

 Prediction Interval Estimate of 𝐸𝐸(𝑦𝑦 ∗ )

77
EXAMPLE
 Reed Auto periodically has a special week-long sale. As part of the advertising
campaign Reed runs one or more television commercials during the weekend
preceding the sale. Here are the data from a sample of 5 previous sales:

78
ESTIMATED REGRESSION EQUATION
 Slope for the estimated regression equation

 𝑦𝑦-intercept for the estimated regression equation

 Estimated Regression Equation:

79
POINT ESTIMATION

If 3 TV ads are run prior to a sale, we expect the mean number of


cars sold to be:

80
CONFIDENCE INTERVAL
Estimate of the Standard Deviation of 𝑦𝑦� ∗ :

81
CONFIDENCE INTERVAL

The 95% confidence interval estimate of the mean number of cars sold when 3 TV
ads are run is:

82
PREDICTION INTERVAL

Estimate of the Standard Deviation of an Individual Value of 𝑦𝑦∗

83
PREDICTION INTERVAL
The 95% prediction interval estimate of the number of cars sold in one particular
week when 3 TV ads are run is:

84
85
FRAMEWORK FOR SLR MODEL DEVELOPMENT

86
PYTHON CODE

# Im p or t n ecessa r y libra ries Working with arrays


import numpy as np
import pandas as pd Data Analysis Library
Pandas provides two primary data
import matplotlib.pyplot as plt structures for storing and
manipulating data: Series and
DataFrame

87
PYTHON CODE
ad_data = pd.read_csv('Advertising.csv’, index_col= 'Unnamed: 0’)
ad_data.info()
ad_data.head()

Radio Newspaper Sales


TV
1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9

The sales are in thousands of units and the budget is in thousands of dollars.

88
LINEARITY CHECK Statistical Data Visualization

# visu a lize th e rela tion sh ip betw een th e fea tu res a n d th e resp on se u sin g sca tter p lots
import seaborn as sns
p = sns.pairplot(ad_data, x_vars= ['TV','Radio','Newspaper'], y_vars= 'Sales', size= 7, aspect= 0.7)

89
SCATTER PLOT

X = ad_data.drop(["Sales","Radio","Newspaper"],axis= 1)
Y = ad_data.Sales

# sca tter p lot


plt.scatter(X,Y)
plt.xlabel('x')
plt.ylabel('y')
plt.show()

90
PREDICTION
# Sp lit th e d a ta in to tra in in g a n d testin g sets
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

from sklearn.linear_model import LinearRegression


# In itia lize th e lin ea r regression m od el
model = LinearRegression()

# Tra in th e m od el u sin g th e tra in in g sets


Sklearn: python library to
model.fit(X_train, Y_train) implement machine learning
models and statistical
modelling
# Ma ke p red iction s u sin g th e testin g set
y_pred = model.predict(X_test)
91
MODEL PERFORMANCE
OUTPUT
# Ca lcu la te a n d p rin t th e m od el p erfor m a n ce m etrics
Mean Squared Error :
from sklearn.metrics import r2_score, mean_squared_error
10.18618193453022
mse = mean_squared_error(Y_test, y_pred)
R-Squared : 0.6763151577939721
r2 = r2_score(Y_test, y_pred)# Best fit lineplt.scatter(x, y)
plt.plot(X_test, y_pred, color = 'Blue', marker = 'o') Y-intercept : 7.292493773559364
Slope : [0.04600779]
plt.scatter(X_train, Y_train)

# Resu lts
print("Mean Squared Error : ", mse)
print("R-Squared :" , r2)
print("Y-intercept :" , model.intercept_)
print("Slope :" , model.coef_)

92
RESIDUAL

residuals = Y_test - y_pred


mean_residuals = np.mean(residuals)
print("Mean of Residuals {}".format(mean_residuals))

OUTPUT
Mean of Residuals -0.1754708813050691

93
RESIDUAL PLOT
CHECK FOR HOMOSCEDASTICITY OUTPUT

p = plt.scatter(y_pred, residuals)
plt.xlabel('y_pred/ predicted values')
plt.ylabel('Residuals')
plt.ylim(-10,10)
plt.xlim(0,30)
p = plt.plot([0,30],[0,0],color= 'blue')
p = plt.title('Residuals vs fitted values plot for homoscedasticity check')

94
HOMOSCEDASTICITY CHECK - GOLDFELD QUANDT TEST

OUTPUT
import statsmodels.stats.api as sms
[('F sta tistic', 1.4054961974172953), ('p -va lu e',
from statsmodels.compat import lzip 0.23256111674013064)]
name = ['F statistic', 'p-value']
Sin ce p va lu e is m ore th a n 0.05 in Gold feld
test = sms.het_goldfeldquandt(residuals, Qu a n d t Test, w e ca n 't reject it's n u ll
X_test) h yp oth esis th a t er ror ter m s a re
lzip(name, test) h om osced a stic.

95
WHAT TO DO IF THERE IS HETEROSCEDASTICITY?

 Outlier removal
 Change the functional form of the model
 Use log-transformation or polynomials

96
CHECK FOR NORMALITY OF ERROR TERMS/ RESIDUALS
OUTPUT

p = sns.displot(data= None, x=
residuals, kde= True)
p = plt.title('Normality of error
terms/ residuals')

97
CHECK FOR NORMALITY OF ERROR TERMS/ RESIDUALS

OUTPUT
# Import library
from scipy import stats
stats.probplot(residuals, dist= "norm", plot=
plt)
plt.title("MODEL Residuals P-P Plot")
plt.legend(['Actual','Theoretical'])

98
OLS import pandas as pd
import statsmodels.api as sm
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

ad_data = pd.read_csv('Advertising.csv',index_col= 'Unnamed: 0')


X = ad_data.drop(["Sales","Radio","Newspaper"],axis= 1)
Y = ad_data.Sales
# Ad d a con sta n t ter m to th e fea tu res (requ ired for OLS regression )
X = sm.add_constant(X)
# Fit OLS regression m od el
model = sm.OLS(Y,X).fit()
# Prin t m od el su m m a r y
print(model.summary())
99
RESULTS

100
Assign m en t
Pa cka ge Pricin g a t th e Die An oth er Da y (DAD) Hosp ita l

101

You might also like