Simple Regression

QM II
LotArea
Look at the Data 250000
200000
150000
100000
50000
0
Age 0 200000 400000 600000 800000
160
140
120
100
80
60
LivingArea 40
4500 20
4000 0
0 100000 200000 300000 400000 500000 600000 700000 800000
3500
3000
2500
2000
1500
1000
500
0 50000 100000 150000 200000 250000 300000 350000 400000 450000
Covariance
å
n
(X i - X )(Yi -Y )
Cov(X,Y ) = SXY = i=1
n -1
SalePrice LivingArea LotArea Age
SalePrice 6306788585
LivingArea 29561605.19 275940.5035
LotArea 209067774.7 1379088.261 99557412.9
Age -1256826.986 -3181.799972 -4475.096209 914.4449615

Covariance
• The sign of covariance can be used to understand the direction of relationship
between two variables
• Random variables with zero covariance are uncorrelated
• All independent variables are uncorrelated
• All uncorrelated variables are not independent
• It is difficult to establish the strength of the relationship using covariance
because it depends on the unit of measurement
Covariance application
• Construct W = aX + bY ; a, b are any constants; X, Y are two random variables
• Mean: E[W] = aE[X] + bE[Y]
• Variance: Var[W] = a2Var[X] + b2Var[Y] + 2abCov[X,Y]
• The variance of the combination increases or decreases depending on the sign of
the covariance term (+ or -)
• Can be used to optimize portfolio
Correlation
Cov( X ,Y ) S XY
Correl( X ,Y )  rXY  
SD( X ).SD(Y ) S X SY
SalePrice LivingArea LotArea Age

SalePrice 1
LivingArea 0.708624478 1
LotArea 0.263843354 0.263116167 1
Age -0.523350418 -0.200302496 -0.014831559 1
Probable Error
1−𝑟 2
• P.E = 0.6745
𝑁
• If r < P.E, then correlation is not significant
• If r > 6 P.E, then correlation is certain
Correlation
• Correlation is dimensionless - it is standardized using standard deviations
• It always takes a value between -1 and 1
• Close to +1 implies a linear relationship with a positive slope
• Close to -1 implies a linear relationship with a negative slope
• Close to 0 implies that there is no linear relationship
• Correlation describes the direction and strength of the relationship but cannot
readily used for “predictive” purposes
• e.g. If the annual salary goes up by a $1000, how much do we expect the
entertainment spending to change?
Correlation
• Correlation is not the same as causation; can result from
“lurking” variables
• e.g. Fire damage and number of fire engines
• Correlation captures the association between two
variables at a time
• Cause-effect relationship is captured by Regression
Regression Analysis
• How to estimate a linear fit?
• How to interpret the slope and the intercept of the fitted
line?
• How to quantify the goodness of fit?
Learning Objectives 11
• What is a simple regression model (SRM)?

• How to draw statistical inference about the model parameters?
• How to construct prediction intervals for the response variable?
• What are the key assumptions required on the population for inference
and prediction?
• What important diagnostic checks should be run before interpreting
regression output?
Linear Fit: Beyond Correlation
230
Profit
• The fitted line is denoted by
210
• 𝑌෠ = 𝑏0 + 𝑏1 𝑋
190
– b0 is the intercept
170 – b1 is the slope
150
– 𝑌෠ is an (point) estimate or fitted
value of y for a given x value
130
110
90
55 60 65 70 75 80
Ad Exp
Interpretation of b0 and b1
240
PBT
220
200
180
160
140
120
100
80
55 57 59 61 63 65 67 69 71 73 75
AdEx
𝑌𝑖 = 𝑏0 + 𝑏1 𝑋𝑖 + 𝑒𝑖
240
PBT
220
Observed
200
Deviation
180
Estimated
160
140
Estimated 𝑌෠𝑖 = 𝑏0 + 𝑏1 𝑋𝑖
120
Deviation
Observed
100
80
55 57 59 61 63 65 67 69 71 73 75
AdEx
Choosing the Right Line 15
• The error in estimation is given by:

• 𝑒𝑖 = 𝑌𝑖 − 𝑌෠𝑖 = 𝑌𝑖 − 𝑏0 + 𝑏1 𝑋𝑖
• The ei are called the residuals
• Choose b0 and b1 such that they minimize the sum of squared residuals
𝑛
min ෍ 𝑒𝑖2
𝑏0 ,𝑏1
𝑖=1
• Why square the residuals?

Normal Equations
• Equation 1: 𝑛𝑏0 + 𝑏1 σ𝑛𝑖=1 𝑋𝑖 = σ𝑛𝑖=1 𝑌𝑖
• Equation 2: 𝑏0 σ𝑛𝑖=1 𝑋𝑖 + 𝑏1 σ𝑛𝑖=1 𝑋𝑖2 = σ𝑛𝑖=1 𝑋𝑖 𝑌𝑖
For the case data
• Equation 1: 10 b0 + 680 b1 =1700
• Equation 2: 680 𝑏0 + 46402 𝑏1 =116745
• b0 = - 310.62; b1 = 7.0679
• PBT = -310.62 + 7.0679 AdEx
17
Ordinary Least Squares

• The resulting estimators b0 and b1 are called the
Ordinary Least Squares (OLS) estimates
σ𝒏 ഥ
𝒊=𝟏 𝒀𝒊 −𝒀 𝑿𝒊 −𝑿
ഥ 𝑪𝑶𝑽(𝑿,𝒀) σ𝒀 𝑺𝑺𝑿𝒀
• 𝒃𝟏 = σ𝒏 ഥ 𝟐 = = 𝒓𝒙𝒚 =
𝒊=𝟏 𝑿 𝒊 − 𝑿 𝑽𝑨𝑹(𝑿) σ𝑿 𝑺𝑺𝑿
ഥ − 𝒃𝟏 𝑿
• 𝒃𝟎 = 𝒀 ഥ
Properties of OLS line
• Useful properties of the OLS linear fit
ത 𝑋ത
• Intercept: Regression line passes through 𝑌,
• Slope: Cov(X,Y) determines the direction of the line (+,-)
• Sum of the residuals around the best fitted line is zero:
σ 𝑒𝑖 = 0
How Good is the Model Fit? 19
𝑌෠𝑖 = −310.62 + 7.0679𝑋
20 𝑖
How Good is the Model Fit?
240
PBT
220
(𝑋𝑖 , 𝑌𝑖 ) 𝑒𝑖 = (𝑌𝑖 −𝑌෠𝑖 )
200
180
𝑌ത (𝑌෠𝑖 − 𝑌)
ത
160
140 𝑌෠𝑖 = 𝑏0 + 𝑏1 𝑋𝑖
120
100
80 AdEx
55 57 59 61 63 65 67 69 71 73 75
Sum of Squares Sum of Squares Sum of Squares Error

Total (SST) = Regression (SSR) + (SSE)
2
෍ 𝑌𝑖 − 𝑌ത 2
= ෍ 𝑌෠𝑖 − 𝑌ത
2
+ ෍ 𝑌𝑖 − 𝑌෠𝑖
df
R2 
SSR
SST
 
 Correl (Y ,Yˆ)  Correl (Y , X )2
2
RMSE 
SSE

(e12  e22  ...  en2 )
(n  2) (n  2)
Standard Lower Upper
Coefficients Error t Stat P-value 95% 95%
Intercept -310.6173 62.9636 -4.9333 0.0011 -455.8115 -165.4230
Advertising
7.0679 0.9243 7.6466 0.0001 4.9364 9.1994
Expenditure
Regression Statistics
Multiple R 0.9379
R Square 0.8796
Adjusted R Square 0.8646
ANOVA
Standard Error 11.7646
df SS MS F Significance F
Observations 10 Regression 1 8092.75 8092.75 58.47 6.04E-05
Residual 8 1107.25 138.41
Total 9 9200.00
22
Obtaining Model Fit from Excel Output

𝑆𝑆𝐸 7515863.32
𝑅𝑀𝑆𝐸 = = = 646.1795
Regression Statistics (𝑛−2) 18
Multiple R 0.9307
R Square 0.8662
Observations 20
ANOVA
df SS MS F Significance F
Regression
1 48674530.5 48674530 116.6 2.7102E-09
Residual
18 7515863.32 417548
2
𝑆𝑆𝑅 48674530.5
𝑅 = = = 0.8662
𝑆𝑆𝑇 56190393.8 Total 19 56190393.8
Approximately 86% of the variation in PBT is explained by variation in Sales

23
Simple Linear Regression Model
• Use a linear equation to model the population relationship between the variables
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜀
• Y  Response variable - the variable we are interested in explaining

• Also referred to as target, dependent or outcome variable
• X  Predictor variable - the variable that is useful in explaining
• Also referred to as explanatory or independent variable
• b0 and b1  parameters of the model

• e  error term (disturbance or noise)
24
Sampling Distributions for Intercept and Slope

• If certain assumptions hold, we can find the distributions for B0 and B1
𝑏0 − 𝛽0 𝑏1 − 𝛽1
~𝑇𝑛−2 𝑎𝑛𝑑 ~𝑇𝑛−2
𝑆𝐸(𝑏0 ) 𝑆𝐸(𝑏1 )
• We can use these distributions for making inferences about the relationship between
X and Y in the population
• Confidence intervals
• Hypothesis tests
• We can also use these distributions to construct prediction intervals for values of the
response variable (Y) for a given value of the predictor variable (X)
Variances and Standard Errors
ANOVA
• Var(ei) = MSE = 138.41
df SS MS
• SE(ei) = Se = RMSE = √MSE = √138.41 = 11.7646 Regression 1 8092.75 8092.75
Residual 8 1107.25 138.41
2 σ 𝑋2 46402 Total 9 9200.00
• 𝑉𝑎𝑟 𝑏0 = 𝑆𝑒 = 138.4066 = 3964.41
𝑛 𝑋𝑖 −𝑋ത 2 10 162
• 𝑆𝑒 𝑏0 = 3964.41 = 62.9636
Multiple R 0.9379
1 1 R Square 0.8796
• 𝑉𝑎𝑟 𝑏1 = 𝑆𝑒2 = 138.4066 = 0.8544
𝑋𝑖 −𝑋ത 2 162
• 𝑆𝑒 𝑏1 = 0.8544 = 0.9243 Standard Error 11.7646
Observations 10
Coefficients Standard Error
Intercept -310.6173 62.9636
Advertising
Expenditure 7.0679 0.9243
Inference (II): Confidence Intervals
• We can use the point estimate b1
and se(b1) to construct confidence
interval for the slope parameter • What is the 95% confidence interval for
the slope parameter?
𝛽1
• Given a confidence level (1 − 𝛼), t0.05, 8 =2.306

the corresponding interval is
7.0679 ± 2.306 (0.9243) = {4.9364, 9.1993}
given by
𝑏1 ± 𝑡1−𝛼,𝑛−2 𝑠𝑒 𝑏1
Standard Lower Upper

Coefficients Error t Stat P-value 95% 95%
Available from Excel Intercept -310.6173 62.9636 -4.9333 0.0011 -455.8115 -165.4230
output Advertising
7.0679 0.9243 7.6466 0.0001 4.9364 9.1994
Expenditure
Inference (I): Hypothesis Tests 27
• We can use the point estimate b1 and Standard

se(B1) to test hypothesis about Coefficients Error t Stat P-value
specific value of slope parameter
Intercept -310.6173 62.9636 -4.9333 0.0011
Advertising
Expenditure 7.0679 0.9243 7.6466 0.0001
𝐻0 : 𝛽1 = 𝜃 𝐻𝑎 : 𝛽1 ≠ 𝜃
• For the slope parameter, we can test • Is there strong enough evidence to conclude
the hypothesis that AdEx has an impact on PBT?
𝑏1 − 𝜃 • Is there strong enough evidence to conclude

𝑡𝑛−2 =
𝑆𝐸(𝑏1 ) that a unit increase in AdEx is associated with
• If ε is normally distributed, this less than 5 unit increase in PBT?
hypothesis can be tested using a t-
statistic with df = n-2: (7.0679 − 5)
𝑡= = 2.2373 P-value =0.0278
• By default, most tools test for 𝜃 = 0 0.9243
Prediction Using the OLS Equation 28
Interval Estimate: Point Estimate ± Margin of Error
• Confidence interval: What • 𝑌෠ from the • Uncertainty in

is our estimate for the regression line the estimate of
mean value of y, given x? the regression
line
e.g. Average price of 1800 sqft
house
• Prediction interval: What
• 𝑌෠ from the
is our estimate for an
regression line • Additional
individual value of y, uncertainty from
given x? the idiosyncratic
e.g. Price of a specific 1800 errors
sqft house
Confidence Intervals
• CI for 𝑌෠ as an individual prediction (Prediction Interval)
1 ത 2
(𝑋0 −𝑋)
• 𝑌෠ ± 𝑡𝛼Τ2,𝑑𝑓 . 𝑅𝑀𝑆𝐸 (1 + + σ𝑛 ത 2)
𝑛 𝑖=1 𝑋 𝑖 −𝑋
• CI for 𝑌෠ average i.e., (𝜇𝑌|𝑋

෠ 0)
1 (𝑋 −𝑋) ത 2
• 𝑌෠ ± 𝑡𝛼Τ2,𝑑𝑓 . 𝑅𝑀𝑆𝐸 ( + σ𝑛 0 ത 2 )
𝑛 𝑖=1 𝑋𝑖 −𝑋
• Approximation (only as an exception)

• 𝑌෠ ± 2(𝑅𝑀𝑆𝐸)
Confidence Intervals
• Confidence intervals (mean values) are narrower near the

mean of the predictor and when the sample size is very large
because the sampling variation in b0 and b1 are expected to be
low
• However, no matter how large the sample size is, there is

always an error in prediction intervals (individual values)
Confidence (mean) and Prediction (individual)
Intervals
Regression Line
Prediction
Interval
Confidence
Interval
95% Confidence Intervals (Mean Values) and Prediction Intervals (Individual Values)
32
“Approximate” Prediction Interval (Individual Values)

• What is the 95% prediction interval for PBT for AdEX = 75 lakhs?
• 𝑌෠ = −310.6173 + 7.0679 75 = 219.48

• Then, we can approximate the prediction interval by
1 75−68 2
𝑌෠ ± (2.306)11.7646 1 + 10 + 162
= 219.48 ± 2.306(11.7646)1.1843
= [187.35, 251.60]
“Certain” Assumptions for the Simple Regression 33
Model
• Assuming that the true relationship between Y and X is indeed given by
Y = b0 + b1X +e
Assumption Implication
1. Error term ε is a random variable with an Since β0 and β1 are constants, for a given value of X, the expected
expected value of zero for a given value of X value of Y is E(Y|X) = b0 + b1X
This also implies that the errors are not correlated (systematically
E[ε|X] = 0 related) to the value of X, i.e. Corr(X, ε) = 0
2. Variance of ε is a constant for all values of X The variance of Y about the regression line is the same for all values
of X and equals σε2 (Homoskedasticity)
Var[ε|X]= σε2
3. Values of εi are independent The value of Y for a particular value of X is not related to the value of
Y for another value of X
Corr[εi, εj] = 0
This condition will generally be satisfied for a SRS
4. Error term is normally distributed The dependent variable Y is normally distributed for a given value of
(ε|X) ~ N(0, σε2) X, i.e., (Y|X) ~ N(E(b0 + b1X) , σε2)
34
Visual Interpretation of Assumptions

35
Diagnostic checks: Using OLS residuals

• We need to check the appropriateness of the following main assumptions
1. E[ε|X] = 0
2. Homoskedasticity: Var[ε|X]= σε2
3. Correlation[εi, εj] = 0 for all i≠j
4. Normality of errors: ε|X ~ N(0, σε2)
• Other key diagnostic checks include

– Impact of outliers
– Linear relationship between Y and X
• Violations of these assumptions cause problems e.g. bias, inefficiency,

incorrect inference
• We can use plots of residuals to get an idea if the assumptions are satisfied
36
Residuals vs. Predictor Values
Residual
0
A good pattern of residuals is “no pattern”

Problem 1: Heteroskedasticity
38
Problem 1: Heteroskedasticity
• Variance of the residuals increases/decreases with the value of the predictor variable
900 300
800
200
700
Price ($000)
600 100
Residual
500
0
400
300 -100
200
-200
100
1000 1500 2000 2500 3000 3500 1000 1500 2000 2500 3000 3500
Square Feet Square Feet
• Standard errors reported by statistical packages are lower than the actual ones
39
Problem 2: Dependence and Autocorrelation

• The errors may be correlated to each other if data were collected over time
– e.g. return of stock over time
• Often shows up as a pattern in the residuals, if plotted in chronological order
• Errors are can also correlated when the data structure is hierarchical or
nested
– e.g. Salary of MBA students across different b-schools and GMAT scores
• Standard errors reported by statistical packages are lower than the actual ones
Problem 3: Departures from Normality 40
6000
• Construct a quantile plot of the

.01 .05 .10 .25 .50 .75 .90 .95 .99
4000
residuals instead of original
2000
variables
0
• Inferences (hypothesis tests and

-2000
confidence intervals) work
-4000
pretty well even when residuals
-6000
are not strictly normal
-8000
-3 -2 -1 0 1 2 3
Normal Quantile Plot

Problem 4: Outliers, leverage points, influential
41
observations
leverage point
• In the case of regression, outliers (unusual
observations) can occur in the y or x
variables
• Unusual observations in the x variable are
called leverage points
• Influential observations are those that
substantially change the OLS fit depending
on whether they are included or not
• Typically leverage points are suspect for
being influential observations as OLS
penalizes large errors more (due to
Green line - Best fit without the leverage point
squaring) Red line – Best fit with the leverage point
• Not all leverage points are necessarily
influential observations
42
Problem 5: Linearity Assumption
• In many applications of interest the relationship between the

dependent and the independent variable might be nonlinear
• Look for transformation of the data (either X or Y or both
variables) such that the relationship between transformed
variables is approximately linear
• The transformed data should satisfy the OLS assumptions
Example: Pet Food Demand Curve Estimation 43
• The manager would like to estimate

a demand curve, i.e., relationship
between demand and price
• Data on the prices and weekly sales
for a pet food brand over a period of
2 years
• Note: Sales is a censored signal of
demand but the best piece of information
that we have now
44
OLS estimation (assuming linear relationship)

160000 50000
140000 40000
Estimated Sales = 190480 – 125190
120000
Price 30000
Sales Volume
Sales Volume
Residual
100000
20000
80000
10000
60000
0
40000
20000 -10000
0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4
Avg Price ($) Avg Price ($)
Multiple R 0.9102 Coefficient Standard
R Square 0.8285 s Error t Stat P-value
Intercept 190483.4 6226.106 30.5943 3.4E-53
Standard Error 6991.41 Avg Price ($) -125188.7 5640.396 -22.195 7.74E-41
Observations 104
Log-Log Transformation: Is there a pattern now? 45
• Obtain the logarithm transform for both sales and average price
• Calculate OLS estimates for the transformed data
• Examine the residual plot for any nonlinear patterns
Estimated (Ln_SalesVolume) = 11.05 – 2.442 Ln_Price

46
Interpreting the Estimates in Log Models

Model Specification Interpretation of β1
Log-Log ln(Y) = β0 + β1 ln(X) + ε Elasticity: A 1% change in X is associated with a β1% change in Y
Log-Linear ln(Y) = β0 + β1 X + ε A one unit change in X is associated with a 100 β1% increase in Y
Linear-Log Y = β0 + β1 ln(X) + ε A 1% change in X is associated with a 0.01β1 change in Y
Regression Statistics Standard

Coefficients Error t Stat P-value
Multiple R 0.9770
R Square 0.9546 Intercept 11.0506 0.0075 1477.35 1.1E-222
Ln(Price) -2.4420 0.0528 -46.2865 2.75E-70
Observations 104

Simple Regression

Uploaded by

Copyright:

Available Formats

You might also like

Simple Regression

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Simple Regression

Uploaded by

Copyright:

Available Formats

QM II

Look at the Data 250000

SalePrice LivingArea LotArea Age

LivingArea 29561605.19 275940.5035

LotArea 209067774.7 1379088.261 99557412.9

Age -1256826.986 -3181.799972 -4475.096209 914.4449615

SalePrice LivingArea LotArea Age

• What is a simple regression model (SRM)?

• The error in estimation is given by:

• Why square the residuals?

Ordinary Least Squares

Sum of Squares Sum of Squares Sum of Squares Error

Obtaining Model Fit from Excel Output

Adjusted R Square 0.8588

Standard Error 646.1795

Approximately 86% of the variation in PBT is explained by variation in Sales

Simple Linear Regression Model

• Y  Response variable - the variable we are interested in explaining

• b0 and b1  parameters of the model

Sampling Distributions for Intercept and Slope

• Given a confidence level (1 − 𝛼), t0.05, 8 =2.306

Standard Lower Upper

• We can use the point estimate b1 and Standard

𝑏1 − 𝜃 • Is there strong enough evidence to conclude

Interval Estimate: Point Estimate ± Margin of Error

• Confidence interval: What • 𝑌෠ from the • Uncertainty in

• CI for 𝑌෠ average i.e., (𝜇𝑌|𝑋

• Approximation (only as an exception)

• Confidence intervals (mean values) are narrower near the

• However, no matter how large the sample size is, there is

“Approximate” Prediction Interval (Individual Values)

• 𝑌෠ = −310.6173 + 7.0679 75 = 219.48

Visual Interpretation of Assumptions

Diagnostic checks: Using OLS residuals

• Other key diagnostic checks include

• Violations of these assumptions cause problems e.g. bias, inefficiency,

Residuals vs. Predictor Values

A good pattern of residuals is “no pattern”

Problem 2: Dependence and Autocorrelation

• Construct a quantile plot of the

• Inferences (hypothesis tests and

Normal Quantile Plot

Problem 5: Linearity Assumption

• In many applications of interest the relationship between the

• The manager would like to estimate

OLS estimation (assuming linear relationship)

• Calculate OLS estimates for the transformed data

• Examine the residual plot for any nonlinear patterns

Estimated (Ln_SalesVolume) = 11.05 – 2.442 Ln_Price

Interpreting the Estimates in Log Models

Log-Log ln(Y) = β0 + β1 ln(X) + ε Elasticity: A 1% change in X is associated with a β1% change in Y

Linear-Log Y = β0 + β1 ln(X) + ε A 1% change in X is associated with a 0.01β1 change in Y

Regression Statistics Standard

You might also like