Professional Documents
Culture Documents
Sessions 18 19 - Regression - SLR MLR
Sessions 18 19 - Regression - SLR MLR
Y Y
X X
Y Y
X X
Types of Relationships-linear relationships
(continued)
Strong relationships Weak relationships
Y Y
X X
Y Y
X X
Types of Relationships
(continued)
No relationship
X
Simple Linear Regression Model
Population Random
Population Independent Error
Slope
Y intercept Variable term
Coefficient
Dependent
Variable
Yi β0 β1Xi ε i
Linear component Random Error
component
Simple Linear Regression Model
(continued)
Y Yi β0 β1Xi ε i
Observed Value
of Y for Xi
εi Slope = β1
Predicted Value Random Error
of Y for Xi
for this Xi value
Intercept = β0
Xi X
Simple Linear Regression Equation
(Prediction Line)
The simple linear regression equation provides an
estimate of the population regression line
Estimated
(or predicted) Estimate of Estimate of the
Y value for the regression regression slope
observation i
intercept
Value of X for
Ŷi b0 b1Xi
observation i
The Least Squares Method
Ŷ
b0 and b1 are obtained by finding the values of that
minimize the sum of the squared differences between
Ŷ
Y and :
n ces
re
d iffe
e d
these differences are q u ar
s
called residuals f the
m o …
e su line
e s th
d t he
im iz a n
t s
e min poin
n
h is li n the
T wee
be t
16.13
The Mathematics Behind…
Least Squares Method
• Slope for the Estimated Regression Equation
b1 ( x x )( y y )
i i
(x x )
i
2
where:
xi = value of independent variable for ith
observation
yi = value of dependent variable for ith
_ observation
x = mean value for independent variable
_
y = mean value for dependent variable
Least Squares Method
b0 y b1 x
Interpretation of the
Slope and the Intercept
450
400
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
SSR 18934.9348
Regression Statistics
r2 0.58082
Multiple R 0.76211 SST 32600.5000
R Square 0.58082
Adjusted R Square 0.52842 58.08% of the variation in
Standard Error 41.33032
house prices is explained by
Observations 10
variation in square feet
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
450
400
98.25 0.1098(200 0)
317.85
The predicted price for a house with 2000
square feet is 317.85($1,000s) = $317,850
Simple Linear Regression
Example: Making Predictions
• When using a regression model for prediction, only
predict within the relevant range of data
Relevant range for
interpolation
450
400
House Price ($1000s)
350
300
250
200
150
100
50 Do not try to
0
extrapolate
0 500 1000 1500 2000 2500 3000
Square Feet
beyond the range
of observed X’s
Comparing Standard Errors
SYX is a measure of the variation of observed
Y values from the regression line
Y Y
• Test statistic
n2 r r 2 if b1 0
r r 2 if b1 0
Hypothesis Test for Correlation Coefficient
(continued)
H 0: ρ = 0 (No correlation)
H 1: ρ ≠ 0 (correlation exists)
=.05 , df = 10 - 2 = 8
rρ .762 0
t STAT 3.329
1 r2 1 .7622
n2 10 2
Hypothesis Test for Correlation Coefficient
(continued)
SSR
MSR
k
SSE
MSE
n k 1
where FSTAT follows an F distribution with k numerator and (n – k - 1)
denominator degrees of freedom
Y
Yi
SSE = (Yi - Yi )2 Y
_
SST = (Yi - Y)2
Y _
_ SSR = (Yi - Y)2 _
Y Y
Xi X
Coefficient of Determination, r2
• The coefficient of determination is the portion of
the total variation in the dependent variable that
is explained by variation in the independent
variable
• The coefficient of determination is also called r-
squared and is denoted as r2
note: 0 r 1
2
F Test for Significance
(continued)
Sb Sb1 = standard
1 error of the slope
Df = n-k-1,
k = # indepenedent variables
d.f. n 2
Inferences About the Slope
S YX S YX
Sb1
SSX (X i X) 2
where:
Sb1 = Estimate of the standard error of the slope
SSE
S YX = Standard error of the estimate
n2
Standard Error of Estimate
• The standard deviation of the variation of
observations around the regression line is estimated
by
n
SSE
(Yi Yˆi ) 2
i 1
S YX
n2 n2
Where
SSE = error sum of squares
n = sample size
Inferences About the Slope: t Test Example
H0: β1 = 0
From Excel output: H1: β1 ≠ 0
Coefficients Standard Error t Stat P-value
Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039
b1 Sb1
b1 β 1 0.10977 0
t STAT 3.32938
Sb 0.03297
1
Inferences About the Slope: t Test Example
H0: β1 = 0
Test Statistic: tSTAT = 3.329
H1: β1 ≠ 0
d.f. = 10- 2 = 8
a/2=.025 a/2=.025
Decision: Reject H0
p-value
Decision: Reject H0, since p-value < α
There is sufficient evidence that
square footage affects house price.
Confidence Interval Estimate for the
Slope
Confidence Interval Estimate of the Slope:
b1 t α / 2 S b d.f. = n - 2
1
• Linearity
• The relationship between X and Y is linear
• Independence of Errors
• Error values are statistically independent
• Normality of Error
• Error values are normally distributed for any given value of
X
• Equal Variance (also called homoscedasticity)
• The probability distribution of the errors has constant
variance
Residual Analysis
ei Yi Ŷi
• The residual for observation i, ei, is the difference between
its observed and predicted value
• Check the assumptions of regression by examining the
residuals
• Examine for linearity assumption
• Evaluate independence assumption
• Evaluate normal distribution assumption
• Examine for constant variance for all levels of X (homoscedasticity)
x x
residuals
residuals
x x
Not Linear
Linear
Residual Analysis for
Independence
Not Independent
Independent
residuals
residuals
X
residuals
X
Checking for Normality
• Examine the Stem-and-Leaf Display of the Residuals
• Examine the Boxplot of the Residuals
• Examine the Histogram of the Residuals
• Construct a Normal Probability Plot of the Residuals
Residual Analysis for
Normality
When using a normal probability plot, normal
errors will approximately display in a straight line
Percent
100
0
-3 -2 -1 0 1 2 3
Residual
Residual Analysis for
Equal Variance
Y Y
x x
residuals
residuals
x x
Non-constant variance
Constant variance
Simple Linear Regression Example: Excel Residual Output
RESIDUAL OUTPUT
House Price Model Residual Plot
Predicted
House Price Residuals
80
1 251.92316 -6.923162
2 273.87671 38.12329 60
3 284.85348 -5.853484 40
Residuals
4 304.06284 3.937162 20
5 218.99284 -19.99284 0
6 268.38832 -49.38832 0 1000 2000 3000
-20
7 356.20251 48.79749 -40
8 367.17929 -43.17929
-60
9 254.6674 64.33264 Square Feet
10 284.85348 -29.85348
Does not appear to violate
any regression assumptions
Estimating Mean Values and
Predicting Individual Values
Goal: Form intervals around Y to express
uncertainty about the value of Y for a given Xi
Confidence
Interval for Y
the mean of Y
Y, given Xi
Y = b0+b1Xi
Prediction Interval
for an individual Y,
given Xi Xi X
Confidence Interval for the Average Y, Given X
Confidenceintervalfor μ Y|X X :
i
Yˆ t α / 2 S YX hi
1 (X i X)2 1 (X i X)2
hi
n SSX n (X i X)2
Prediction Interval for an Individual Y, Given X
Confidenceintervalfor YX X :
i
Yˆ t α / 2 S YX 1 hi
1 (X i X) 2
Ŷ t 0.025S YX 317.85 37.12
n
(X i X) 2
1 (X i X) 2
Ŷ t 0.025S YX 1 317.85 102.28
n
(X i X) 2
1 no 1
The confidence interval estimate of the expected value of y will be narrower than
the prediction interval for the same given value of x and confidence level. This is
because there is less error in estimating a mean value as opposed to predicting an
individual value.
Multiple Linear Regression
• Concept and analogy to Simple Linear Regression
• Equation type and interpretation
• Model fit [DoF=n-k-1] and explanatory power of each independent
variable
• Comparing different models- Adjusted R squared
Adjusted r2
• r2 never decreases when a new X variable is added
to the model
• This can be a disadvantage when comparing models
• What is the net effect of adding a new variable?
• We lose a degree of freedom when a new X variable is
added
• Did the new X variable add enough explanatory power to
offset the loss of one degree of freedom?
Adjusted r2
(continued)
• Shows the proportion of variation in Y explained by all
X variables adjusted for the number of X variables used
2 n 1
r 2
adj 1 (1 r )
n k 1
(where n = sample size, k = number of independent variables)