Simple Regression

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 46

QM II

LotArea

Look at the Data 250000

200000

150000

100000

50000

0
Age 0 200000 400000 600000 800000

160
140
120
100
80
60
LivingArea 40
4500 20

4000 0
0 100000 200000 300000 400000 500000 600000 700000 800000
3500

3000

2500

2000

1500

1000

500
0 50000 100000 150000 200000 250000 300000 350000 400000 450000
Covariance
å
n
(X i - X )(Yi -Y )
Cov(X,Y ) = SXY = i=1
n -1

SalePrice LivingArea LotArea Age

SalePrice 6306788585

LivingArea 29561605.19 275940.5035

LotArea 209067774.7 1379088.261 99557412.9

Age -1256826.986 -3181.799972 -4475.096209 914.4449615


Covariance
• The sign of covariance can be used to understand the direction of relationship
between two variables
• Random variables with zero covariance are uncorrelated
• All independent variables are uncorrelated
• All uncorrelated variables are not independent
• It is difficult to establish the strength of the relationship using covariance
because it depends on the unit of measurement
Covariance application
• Construct W = aX + bY ; a, b are any constants; X, Y are two random variables
• Mean: E[W] = aE[X] + bE[Y]
• Variance: Var[W] = a2Var[X] + b2Var[Y] + 2abCov[X,Y]
• The variance of the combination increases or decreases depending on the sign of
the covariance term (+ or -)
• Can be used to optimize portfolio
Correlation
Cov( X ,Y ) S XY
Correl( X ,Y )  rXY  
SD( X ).SD(Y ) S X SY

SalePrice LivingArea LotArea Age


SalePrice 1

LivingArea 0.708624478 1
LotArea 0.263843354 0.263116167 1
Age -0.523350418 -0.200302496 -0.014831559 1
Probable Error

1−𝑟 2
• P.E = 0.6745
𝑁
• If r < P.E, then correlation is not significant
• If r > 6 P.E, then correlation is certain
Correlation
• Correlation is dimensionless - it is standardized using standard deviations
• It always takes a value between -1 and 1
• Close to +1 implies a linear relationship with a positive slope
• Close to -1 implies a linear relationship with a negative slope
• Close to 0 implies that there is no linear relationship
• Correlation describes the direction and strength of the relationship but cannot
readily used for “predictive” purposes
• e.g. If the annual salary goes up by a $1000, how much do we expect the
entertainment spending to change?
Correlation
• Correlation is not the same as causation; can result from
“lurking” variables
• e.g. Fire damage and number of fire engines
• Correlation captures the association between two
variables at a time
• Cause-effect relationship is captured by Regression
Regression Analysis
• How to estimate a linear fit?
• How to interpret the slope and the intercept of the fitted
line?
• How to quantify the goodness of fit?
Learning Objectives 11

• What is a simple regression model (SRM)?


• How to draw statistical inference about the model parameters?
• How to construct prediction intervals for the response variable?
• What are the key assumptions required on the population for inference
and prediction?
• What important diagnostic checks should be run before interpreting
regression output?
Linear Fit: Beyond Correlation
230
Profit
• The fitted line is denoted by
210
• 𝑌෠ = 𝑏0 + 𝑏1 𝑋
190
– b0 is the intercept
170 – b1 is the slope
150
– 𝑌෠ is an (point) estimate or fitted
value of y for a given x value
130

110

90
55 60 65 70 75 80

Ad Exp
Interpretation of b0 and b1
240
PBT
220

200

180

160

140

120

100

80
55 57 59 61 63 65 67 69 71 73 75
AdEx
𝑌𝑖 = 𝑏0 + 𝑏1 𝑋𝑖 + 𝑒𝑖
240
PBT

220
Observed
200
Deviation
180
Estimated
160

140
Estimated 𝑌෠𝑖 = 𝑏0 + 𝑏1 𝑋𝑖
120

Deviation
Observed
100

80
55 57 59 61 63 65 67 69 71 73 75

AdEx
Choosing the Right Line 15

• The error in estimation is given by:


• 𝑒𝑖 = 𝑌𝑖 − 𝑌෠𝑖 = 𝑌𝑖 − 𝑏0 + 𝑏1 𝑋𝑖
• The ei are called the residuals

• Choose b0 and b1 such that they minimize the sum of squared residuals
𝑛

min ෍ 𝑒𝑖2
𝑏0 ,𝑏1
𝑖=1

• Why square the residuals?


Normal Equations
• Equation 1: 𝑛𝑏0 + 𝑏1 σ𝑛𝑖=1 𝑋𝑖 = σ𝑛𝑖=1 𝑌𝑖
• Equation 2: 𝑏0 σ𝑛𝑖=1 𝑋𝑖 + 𝑏1 σ𝑛𝑖=1 𝑋𝑖2 = σ𝑛𝑖=1 𝑋𝑖 𝑌𝑖
For the case data
• Equation 1: 10 b0 + 680 b1 =1700
• Equation 2: 680 𝑏0 + 46402 𝑏1 =116745

• b0 = - 310.62; b1 = 7.0679
• PBT = -310.62 + 7.0679 AdEx
17

Ordinary Least Squares


• The resulting estimators b0 and b1 are called the
Ordinary Least Squares (OLS) estimates
σ𝒏 ഥ
𝒊=𝟏 𝒀𝒊 −𝒀 𝑿𝒊 −𝑿
ഥ 𝑪𝑶𝑽(𝑿,𝒀) σ𝒀 𝑺𝑺𝑿𝒀
• 𝒃𝟏 = σ𝒏 ഥ 𝟐 = = 𝒓𝒙𝒚 =
𝒊=𝟏 𝑿 𝒊 − 𝑿 𝑽𝑨𝑹(𝑿) σ𝑿 𝑺𝑺𝑿
ഥ − 𝒃𝟏 𝑿
• 𝒃𝟎 = 𝒀 ഥ
Properties of OLS line
• Useful properties of the OLS linear fit
ത 𝑋ത
• Intercept: Regression line passes through 𝑌,
• Slope: Cov(X,Y) determines the direction of the line (+,-)
• Sum of the residuals around the best fitted line is zero:
σ 𝑒𝑖 = 0
How Good is the Model Fit? 19
𝑌෠𝑖 = −310.62 + 7.0679𝑋
20 𝑖
How Good is the Model Fit?
240
PBT
220
(𝑋𝑖 , 𝑌𝑖 ) 𝑒𝑖 = (𝑌𝑖 −𝑌෠𝑖 )
200

180

𝑌ത (𝑌෠𝑖 − 𝑌)

160

140 𝑌෠𝑖 = 𝑏0 + 𝑏1 𝑋𝑖
120

100

80 AdEx
55 57 59 61 63 65 67 69 71 73 75

Sum of Squares Sum of Squares Sum of Squares Error


Total (SST) = Regression (SSR) + (SSE)
2
෍ 𝑌𝑖 − 𝑌ത 2
= ෍ 𝑌෠𝑖 − 𝑌ത
2
+ ෍ 𝑌𝑖 − 𝑌෠𝑖
df
R2 
SSR
SST
 
 Correl (Y ,Yˆ)  Correl (Y , X )2
2

RMSE 
SSE

(e12  e22  ...  en2 )
(n  2) (n  2)
Standard Lower Upper
Coefficients Error t Stat P-value 95% 95%
Intercept -310.6173 62.9636 -4.9333 0.0011 -455.8115 -165.4230
Advertising
7.0679 0.9243 7.6466 0.0001 4.9364 9.1994
Expenditure

Regression Statistics
Multiple R 0.9379
R Square 0.8796
Adjusted R Square 0.8646
ANOVA
Standard Error 11.7646
df SS MS F Significance F
Observations 10 Regression 1 8092.75 8092.75 58.47 6.04E-05
Residual 8 1107.25 138.41
Total 9 9200.00
22

Obtaining Model Fit from Excel Output


𝑆𝑆𝐸 7515863.32
𝑅𝑀𝑆𝐸 = = = 646.1795
Regression Statistics (𝑛−2) 18
Multiple R 0.9307

R Square 0.8662

Adjusted R Square 0.8588

Standard Error 646.1795

Observations 20
ANOVA

df SS MS F Significance F

Regression
1 48674530.5 48674530 116.6 2.7102E-09

Residual
18 7515863.32 417548
2
𝑆𝑆𝑅 48674530.5
𝑅 = = = 0.8662
𝑆𝑆𝑇 56190393.8 Total 19 56190393.8

Approximately 86% of the variation in PBT is explained by variation in Sales


23

Simple Linear Regression Model

• Use a linear equation to model the population relationship between the variables
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜀

• Y  Response variable - the variable we are interested in explaining


• Also referred to as target, dependent or outcome variable
• X  Predictor variable - the variable that is useful in explaining
• Also referred to as explanatory or independent variable

• b0 and b1  parameters of the model


• e  error term (disturbance or noise)
24

Sampling Distributions for Intercept and Slope


• If certain assumptions hold, we can find the distributions for B0 and B1

𝑏0 − 𝛽0 𝑏1 − 𝛽1
~𝑇𝑛−2 𝑎𝑛𝑑 ~𝑇𝑛−2
𝑆𝐸(𝑏0 ) 𝑆𝐸(𝑏1 )
• We can use these distributions for making inferences about the relationship between
X and Y in the population
• Confidence intervals
• Hypothesis tests
• We can also use these distributions to construct prediction intervals for values of the
response variable (Y) for a given value of the predictor variable (X)
Variances and Standard Errors
ANOVA
• Var(ei) = MSE = 138.41
df SS MS
• SE(ei) = Se = RMSE = √MSE = √138.41 = 11.7646 Regression 1 8092.75 8092.75
Residual 8 1107.25 138.41
2 σ 𝑋2 46402 Total 9 9200.00
• 𝑉𝑎𝑟 𝑏0 = 𝑆𝑒 = 138.4066 = 3964.41
𝑛 𝑋𝑖 −𝑋ത 2 10 162
Regression Statistics
• 𝑆𝑒 𝑏0 = 3964.41 = 62.9636
Multiple R 0.9379
1 1 R Square 0.8796
• 𝑉𝑎𝑟 𝑏1 = 𝑆𝑒2 = 138.4066 = 0.8544
𝑋𝑖 −𝑋ത 2 162
Adjusted R Square 0.8646
• 𝑆𝑒 𝑏1 = 0.8544 = 0.9243 Standard Error 11.7646
Observations 10
Coefficients Standard Error
Intercept -310.6173 62.9636
Advertising
Expenditure 7.0679 0.9243
Inference (II): Confidence Intervals
• We can use the point estimate b1
and se(b1) to construct confidence
interval for the slope parameter • What is the 95% confidence interval for
the slope parameter?
𝛽1

• Given a confidence level (1 − 𝛼), t0.05, 8 =2.306


the corresponding interval is
7.0679 ± 2.306 (0.9243) = {4.9364, 9.1993}
given by
𝑏1 ± 𝑡1−𝛼,𝑛−2 𝑠𝑒 𝑏1

Standard Lower Upper


Coefficients Error t Stat P-value 95% 95%
Available from Excel Intercept -310.6173 62.9636 -4.9333 0.0011 -455.8115 -165.4230
output Advertising
7.0679 0.9243 7.6466 0.0001 4.9364 9.1994
Expenditure
Inference (I): Hypothesis Tests 27

• We can use the point estimate b1 and Standard


se(B1) to test hypothesis about Coefficients Error t Stat P-value
specific value of slope parameter
Intercept -310.6173 62.9636 -4.9333 0.0011
Advertising
Expenditure 7.0679 0.9243 7.6466 0.0001
𝐻0 : 𝛽1 = 𝜃 𝐻𝑎 : 𝛽1 ≠ 𝜃
• For the slope parameter, we can test • Is there strong enough evidence to conclude
the hypothesis that AdEx has an impact on PBT?

𝑏1 − 𝜃 • Is there strong enough evidence to conclude


𝑡𝑛−2 =
𝑆𝐸(𝑏1 ) that a unit increase in AdEx is associated with
• If ε is normally distributed, this less than 5 unit increase in PBT?
hypothesis can be tested using a t-
statistic with df = n-2: (7.0679 − 5)
𝑡= = 2.2373 P-value =0.0278
• By default, most tools test for 𝜃 = 0 0.9243
Prediction Using the OLS Equation 28

Interval Estimate: Point Estimate ± Margin of Error

• Confidence interval: What • 𝑌෠ from the • Uncertainty in


is our estimate for the regression line the estimate of
mean value of y, given x? the regression
line
e.g. Average price of 1800 sqft
house
• Prediction interval: What
• 𝑌෠ from the
is our estimate for an
regression line • Additional
individual value of y, uncertainty from
given x? the idiosyncratic
e.g. Price of a specific 1800 errors
sqft house
Confidence Intervals
• CI for 𝑌෠ as an individual prediction (Prediction Interval)
1 ത 2
(𝑋0 −𝑋)
• 𝑌෠ ± 𝑡𝛼Τ2,𝑑𝑓 . 𝑅𝑀𝑆𝐸 (1 + + σ𝑛 ത 2)
𝑛 𝑖=1 𝑋 𝑖 −𝑋

• CI for 𝑌෠ average i.e., (𝜇𝑌|𝑋


෠ 0)

1 (𝑋 −𝑋) ത 2
• 𝑌෠ ± 𝑡𝛼Τ2,𝑑𝑓 . 𝑅𝑀𝑆𝐸 ( + σ𝑛 0 ത 2 )
𝑛 𝑖=1 𝑋𝑖 −𝑋

• Approximation (only as an exception)


• 𝑌෠ ± 2(𝑅𝑀𝑆𝐸)
Confidence Intervals

• Confidence intervals (mean values) are narrower near the


mean of the predictor and when the sample size is very large
because the sampling variation in b0 and b1 are expected to be
low

• However, no matter how large the sample size is, there is


always an error in prediction intervals (individual values)
Confidence (mean) and Prediction (individual)
Intervals

Regression Line

Prediction
Interval

Confidence
Interval

95% Confidence Intervals (Mean Values) and Prediction Intervals (Individual Values)
32

“Approximate” Prediction Interval (Individual Values)


• What is the 95% prediction interval for PBT for AdEX = 75 lakhs?

• 𝑌෠ = −310.6173 + 7.0679 75 = 219.48


• Then, we can approximate the prediction interval by
1 75−68 2
𝑌෠ ± (2.306)11.7646 1 + 10 + 162
= 219.48 ± 2.306(11.7646)1.1843
= [187.35, 251.60]
“Certain” Assumptions for the Simple Regression 33
Model
• Assuming that the true relationship between Y and X is indeed given by
Y = b0 + b1X +e
Assumption Implication

1. Error term ε is a random variable with an Since β0 and β1 are constants, for a given value of X, the expected
expected value of zero for a given value of X value of Y is E(Y|X) = b0 + b1X
This also implies that the errors are not correlated (systematically
E[ε|X] = 0 related) to the value of X, i.e. Corr(X, ε) = 0
2. Variance of ε is a constant for all values of X The variance of Y about the regression line is the same for all values
of X and equals σε2 (Homoskedasticity)
Var[ε|X]= σε2
3. Values of εi are independent The value of Y for a particular value of X is not related to the value of
Y for another value of X
Corr[εi, εj] = 0
This condition will generally be satisfied for a SRS
4. Error term is normally distributed The dependent variable Y is normally distributed for a given value of
(ε|X) ~ N(0, σε2) X, i.e., (Y|X) ~ N(E(b0 + b1X) , σε2)
34

Visual Interpretation of Assumptions


35

Diagnostic checks: Using OLS residuals


• We need to check the appropriateness of the following main assumptions
1. E[ε|X] = 0
2. Homoskedasticity: Var[ε|X]= σε2
3. Correlation[εi, εj] = 0 for all i≠j
4. Normality of errors: ε|X ~ N(0, σε2)

• Other key diagnostic checks include


– Impact of outliers
– Linear relationship between Y and X

• Violations of these assumptions cause problems e.g. bias, inefficiency,


incorrect inference
• We can use plots of residuals to get an idea if the assumptions are satisfied
36

Residuals vs. Predictor Values

Residual
0

A good pattern of residuals is “no pattern”


Problem 1: Heteroskedasticity
38

Problem 1: Heteroskedasticity
• Variance of the residuals increases/decreases with the value of the predictor variable
900 300

800
200
700
Price ($000)

600 100

Residual
500
0
400

300 -100

200
-200
100
1000 1500 2000 2500 3000 3500 1000 1500 2000 2500 3000 3500
Square Feet Square Feet

• Standard errors reported by statistical packages are lower than the actual ones
39

Problem 2: Dependence and Autocorrelation


• The errors may be correlated to each other if data were collected over time
– e.g. return of stock over time
• Often shows up as a pattern in the residuals, if plotted in chronological order

• Errors are can also correlated when the data structure is hierarchical or
nested
– e.g. Salary of MBA students across different b-schools and GMAT scores

• Standard errors reported by statistical packages are lower than the actual ones
Problem 3: Departures from Normality 40

6000

• Construct a quantile plot of the


.01 .05 .10 .25 .50 .75 .90 .95 .99

4000
residuals instead of original
2000
variables
0

• Inferences (hypothesis tests and


-2000
confidence intervals) work
-4000
pretty well even when residuals
-6000
are not strictly normal
-8000
-3 -2 -1 0 1 2 3

Normal Quantile Plot


Problem 4: Outliers, leverage points, influential
41
observations
leverage point
• In the case of regression, outliers (unusual
observations) can occur in the y or x
variables
• Unusual observations in the x variable are
called leverage points
• Influential observations are those that
substantially change the OLS fit depending
on whether they are included or not
• Typically leverage points are suspect for
being influential observations as OLS
penalizes large errors more (due to
Green line - Best fit without the leverage point
squaring) Red line – Best fit with the leverage point
• Not all leverage points are necessarily
influential observations
42

Problem 5: Linearity Assumption

• In many applications of interest the relationship between the


dependent and the independent variable might be nonlinear
• Look for transformation of the data (either X or Y or both
variables) such that the relationship between transformed
variables is approximately linear
• The transformed data should satisfy the OLS assumptions
Example: Pet Food Demand Curve Estimation 43

• The manager would like to estimate


a demand curve, i.e., relationship
between demand and price
• Data on the prices and weekly sales
for a pet food brand over a period of
2 years
• Note: Sales is a censored signal of
demand but the best piece of information
that we have now
44

OLS estimation (assuming linear relationship)


160000 50000

140000 40000
Estimated Sales = 190480 – 125190
120000
Price 30000

Sales Volume
Sales Volume

Residual
100000
20000
80000

10000
60000

0
40000

20000 -10000
0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4
Avg Price ($) Avg Price ($)

Regression Statistics
Multiple R 0.9102 Coefficient Standard
R Square 0.8285 s Error t Stat P-value
Intercept 190483.4 6226.106 30.5943 3.4E-53
Adjusted R Square 0.8268
Standard Error 6991.41 Avg Price ($) -125188.7 5640.396 -22.195 7.74E-41
Observations 104
Log-Log Transformation: Is there a pattern now? 45

• Obtain the logarithm transform for both sales and average price

• Calculate OLS estimates for the transformed data

• Examine the residual plot for any nonlinear patterns

Estimated (Ln_SalesVolume) = 11.05 – 2.442 Ln_Price


46

Interpreting the Estimates in Log Models


Model Specification Interpretation of β1

Log-Log ln(Y) = β0 + β1 ln(X) + ε Elasticity: A 1% change in X is associated with a β1% change in Y

Log-Linear ln(Y) = β0 + β1 X + ε A one unit change in X is associated with a 100 β1% increase in Y

Linear-Log Y = β0 + β1 ln(X) + ε A 1% change in X is associated with a 0.01β1 change in Y

Regression Statistics Standard


Coefficients Error t Stat P-value
Multiple R 0.9770
R Square 0.9546 Intercept 11.0506 0.0075 1477.35 1.1E-222
Adjusted R Square 0.9541
Ln(Price) -2.4420 0.0528 -46.2865 2.75E-70
Standard Error 0.0605
Observations 104

You might also like