Professional Documents
Culture Documents
Simple Regression
Simple Regression
Simple Regression
LotArea
200000
150000
100000
50000
0
Age 0 200000 400000 600000 800000
160
140
120
100
80
60
LivingArea 40
4500 20
4000 0
0 100000 200000 300000 400000 500000 600000 700000 800000
3500
3000
2500
2000
1500
1000
500
0 50000 100000 150000 200000 250000 300000 350000 400000 450000
Covariance
å
n
(X i - X )(Yi -Y )
Cov(X,Y ) = SXY = i=1
n -1
SalePrice 6306788585
LivingArea 0.708624478 1
LotArea 0.263843354 0.263116167 1
Age -0.523350418 -0.200302496 -0.014831559 1
Probable Error
1−𝑟 2
• P.E = 0.6745
𝑁
• If r < P.E, then correlation is not significant
• If r > 6 P.E, then correlation is certain
Correlation
• Correlation is dimensionless - it is standardized using standard deviations
• It always takes a value between -1 and 1
• Close to +1 implies a linear relationship with a positive slope
• Close to -1 implies a linear relationship with a negative slope
• Close to 0 implies that there is no linear relationship
• Correlation describes the direction and strength of the relationship but cannot
readily used for “predictive” purposes
• e.g. If the annual salary goes up by a $1000, how much do we expect the
entertainment spending to change?
Correlation
• Correlation is not the same as causation; can result from
“lurking” variables
• e.g. Fire damage and number of fire engines
• Correlation captures the association between two
variables at a time
• Cause-effect relationship is captured by Regression
Regression Analysis
• How to estimate a linear fit?
• How to interpret the slope and the intercept of the fitted
line?
• How to quantify the goodness of fit?
Learning Objectives 11
110
90
55 60 65 70 75 80
Ad Exp
Interpretation of b0 and b1
240
PBT
220
200
180
160
140
120
100
80
55 57 59 61 63 65 67 69 71 73 75
AdEx
𝑌𝑖 = 𝑏0 + 𝑏1 𝑋𝑖 + 𝑒𝑖
240
PBT
220
Observed
200
Deviation
180
Estimated
160
140
Estimated 𝑌𝑖 = 𝑏0 + 𝑏1 𝑋𝑖
120
Deviation
Observed
100
80
55 57 59 61 63 65 67 69 71 73 75
AdEx
Choosing the Right Line 15
• Choose b0 and b1 such that they minimize the sum of squared residuals
𝑛
min 𝑒𝑖2
𝑏0 ,𝑏1
𝑖=1
• b0 = - 310.62; b1 = 7.0679
• PBT = -310.62 + 7.0679 AdEx
17
180
𝑌ത (𝑌𝑖 − 𝑌)
ത
160
140 𝑌𝑖 = 𝑏0 + 𝑏1 𝑋𝑖
120
100
80 AdEx
55 57 59 61 63 65 67 69 71 73 75
RMSE
SSE
(e12 e22 ... en2 )
(n 2) (n 2)
Standard Lower Upper
Coefficients Error t Stat P-value 95% 95%
Intercept -310.6173 62.9636 -4.9333 0.0011 -455.8115 -165.4230
Advertising
7.0679 0.9243 7.6466 0.0001 4.9364 9.1994
Expenditure
Regression Statistics
Multiple R 0.9379
R Square 0.8796
Adjusted R Square 0.8646
ANOVA
Standard Error 11.7646
df SS MS F Significance F
Observations 10 Regression 1 8092.75 8092.75 58.47 6.04E-05
Residual 8 1107.25 138.41
Total 9 9200.00
22
R Square 0.8662
Observations 20
ANOVA
df SS MS F Significance F
Regression
1 48674530.5 48674530 116.6 2.7102E-09
Residual
18 7515863.32 417548
2
𝑆𝑆𝑅 48674530.5
𝑅 = = = 0.8662
𝑆𝑆𝑇 56190393.8 Total 19 56190393.8
• Use a linear equation to model the population relationship between the variables
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜀
𝑏0 − 𝛽0 𝑏1 − 𝛽1
~𝑇𝑛−2 𝑎𝑛𝑑 ~𝑇𝑛−2
𝑆𝐸(𝑏0 ) 𝑆𝐸(𝑏1 )
• We can use these distributions for making inferences about the relationship between
X and Y in the population
• Confidence intervals
• Hypothesis tests
• We can also use these distributions to construct prediction intervals for values of the
response variable (Y) for a given value of the predictor variable (X)
Variances and Standard Errors
ANOVA
• Var(ei) = MSE = 138.41
df SS MS
• SE(ei) = Se = RMSE = √MSE = √138.41 = 11.7646 Regression 1 8092.75 8092.75
Residual 8 1107.25 138.41
2 σ 𝑋2 46402 Total 9 9200.00
• 𝑉𝑎𝑟 𝑏0 = 𝑆𝑒 = 138.4066 = 3964.41
𝑛 𝑋𝑖 −𝑋ത 2 10 162
Regression Statistics
• 𝑆𝑒 𝑏0 = 3964.41 = 62.9636
Multiple R 0.9379
1 1 R Square 0.8796
• 𝑉𝑎𝑟 𝑏1 = 𝑆𝑒2 = 138.4066 = 0.8544
𝑋𝑖 −𝑋ത 2 162
Adjusted R Square 0.8646
• 𝑆𝑒 𝑏1 = 0.8544 = 0.9243 Standard Error 11.7646
Observations 10
Coefficients Standard Error
Intercept -310.6173 62.9636
Advertising
Expenditure 7.0679 0.9243
Inference (II): Confidence Intervals
• We can use the point estimate b1
and se(b1) to construct confidence
interval for the slope parameter • What is the 95% confidence interval for
the slope parameter?
𝛽1
1 (𝑋 −𝑋) ത 2
• 𝑌 ± 𝑡𝛼Τ2,𝑑𝑓 . 𝑅𝑀𝑆𝐸 ( + σ𝑛 0 ത 2 )
𝑛 𝑖=1 𝑋𝑖 −𝑋
Regression Line
Prediction
Interval
Confidence
Interval
95% Confidence Intervals (Mean Values) and Prediction Intervals (Individual Values)
32
1. Error term ε is a random variable with an Since β0 and β1 are constants, for a given value of X, the expected
expected value of zero for a given value of X value of Y is E(Y|X) = b0 + b1X
This also implies that the errors are not correlated (systematically
E[ε|X] = 0 related) to the value of X, i.e. Corr(X, ε) = 0
2. Variance of ε is a constant for all values of X The variance of Y about the regression line is the same for all values
of X and equals σε2 (Homoskedasticity)
Var[ε|X]= σε2
3. Values of εi are independent The value of Y for a particular value of X is not related to the value of
Y for another value of X
Corr[εi, εj] = 0
This condition will generally be satisfied for a SRS
4. Error term is normally distributed The dependent variable Y is normally distributed for a given value of
(ε|X) ~ N(0, σε2) X, i.e., (Y|X) ~ N(E(b0 + b1X) , σε2)
34
Residual
0
Problem 1: Heteroskedasticity
• Variance of the residuals increases/decreases with the value of the predictor variable
900 300
800
200
700
Price ($000)
600 100
Residual
500
0
400
300 -100
200
-200
100
1000 1500 2000 2500 3000 3500 1000 1500 2000 2500 3000 3500
Square Feet Square Feet
• Standard errors reported by statistical packages are lower than the actual ones
39
• Errors are can also correlated when the data structure is hierarchical or
nested
– e.g. Salary of MBA students across different b-schools and GMAT scores
• Standard errors reported by statistical packages are lower than the actual ones
Problem 3: Departures from Normality 40
6000
4000
residuals instead of original
2000
variables
0
140000 40000
Estimated Sales = 190480 – 125190
120000
Price 30000
Sales Volume
Sales Volume
Residual
100000
20000
80000
10000
60000
0
40000
20000 -10000
0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4
Avg Price ($) Avg Price ($)
Regression Statistics
Multiple R 0.9102 Coefficient Standard
R Square 0.8285 s Error t Stat P-value
Intercept 190483.4 6226.106 30.5943 3.4E-53
Adjusted R Square 0.8268
Standard Error 6991.41 Avg Price ($) -125188.7 5640.396 -22.195 7.74E-41
Observations 104
Log-Log Transformation: Is there a pattern now? 45
• Obtain the logarithm transform for both sales and average price
Log-Linear ln(Y) = β0 + β1 X + ε A one unit change in X is associated with a 100 β1% increase in Y