Professional Documents
Culture Documents
Simple Linear Regression
Simple Linear Regression
Y Y
X X
Y Y
X X
Copyright ©2011 Pearson Education 13-5
Types of Relationships
(continued)
Strong relationships Weak relationships
Y Y
X X
Y Y
X X
Copyright ©2011 Pearson Education 13-6
Types of Relationships
(continued)
No relationship
X
Copyright ©2011 Pearson Education 13-7
Simple Linear Regression
Model
Population Random
Population Independent Error
Slope
Y intercept Variable term
Coefficient
Dependent
Variable
Yi β0 β1Xi ε i
Linear component Random Error
component
Y Yi β0 β1Xi ε i
Observed Value
of Y for Xi
εi Slope = β1
Predicted Value Random Error
of Y for Xi
for this Xi value
Intercept = β0
Xi X
Copyright ©2011 Pearson Education 13-9
Simple Linear Regression
Equation (Prediction Line)
The simple linear regression equation provides an
estimate of the population regression line
Estimated
(or predicted) Estimate of Estimate of the
Y value for the regression regression slope
observation i
intercept
Value of X for
Ŷi b0 b1Xi
observation i
350
300
250
200
150
100
50
0
0 500 1000 1500 2000 2500 3000
Square Feet
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
350
Slope
300
250
= 0.10977
200
150
100
50
Intercept 0
= 98.248 0 500 1000 1500 2000 2500 3000
Square Feet
98.25 0.1098(200 0)
317.85
The predicted price for a house with 2000
square feet is 317.85($1,000s) = $317,850
Copyright ©2011 Pearson Education 13-23
Simple Linear Regression
Example: Making Predictions
When using a regression model for prediction,
only predict within the relevant range of data
Relevant range for
interpolation
450
400
House Price ($1000s)
350
300
250
200
150
100
50 Do not try to
0
extrapolate
0 500 1000 1500 2000 2500 3000
Square Feet
beyond the range
of observed X’s
Copyright ©2011 Pearson Education 13-24
Measures of Variation
Y
Yi
SSE = (Yi - Yi )2 Y
_
SST = (Yi - Y)2
Y _
_ SSR = (Yi - Y)2 _
Y Y
Xi X
Copyright ©2011 Pearson Education 13-27
Coefficient of Determination, r2
The coefficient of determination is the portion
of the total variation in the dependent variable
that is explained by variation in the
independent variable
The coefficient of determination is also called
r-squared and is denoted as r2
SSR regression sum of squares
r 2
SST total sum of squares
note: 0 r 1
2
X
r =1
2
X
Copyright ©2011 Pearson Education 13-30
Examples of Approximate
r2 Values
r2 = 0
Y
No linear relationship
between X and Y:
SSR 18934.9348
2
Regression Statistics
r 0.58082
Multiple R 0.76211 SST 32600.5000
R Square 0.58082
Adjusted R Square 0.52842
58.08% of the variation in
Standard Error 41.33032
Observations 10
house prices is explained by
variation in square feet
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
SSE
(Yi Yˆi ) 2
i 1
S YX
n2 n2
Where
SSE = error sum of squares
n = sample size
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Linearity
The relationship between X and Y is linear
Independence of Errors
Error values are statistically independent
Normality of Error
Error values are normally distributed for any given
value of X
Equal Variance (also called homoscedasticity)
The probability distribution of the errors has constant
variance
Y Y
x x
residuals
x residuals x
Not Linear
Copyright ©2011 Pearson Education
Linear
13-38
Residual Analysis for
Independence
Not Independent
Independent
residuals
residuals
X
residuals
Percent
100
0
-3 -2 -1 0 1 2 3
Residual
Copyright ©2011 Pearson Education 13-42
Residual Analysis for
Equal Variance
Y Y
x x
residuals
x residuals x
Non-constant variance
Constant variance
S YX S YX
Sb1
SSX (X i X) 2
where:
Sb1 = Estimate of the standard error of the slope
SSE
S YX = Standard error of the estimate
n2
Copyright ©2011 Pearson Education 13-45
Inferences About the Slope:
t Test
House Price
Square Feet
Estimated Regression Equation:
in $1000s
(x)
(y)
house price 98.25 0.1098 (sq.ft.)
245 1400
312 1600
279 1700
308 1875
The slope of this model is 0.1098
199 1100
219 1550 Is there a relationship between the
405 2350 square footage of the house and its
324 2450
sales price?
319 1425
255 1700
b1 Sb1
b1 β 1 0.10977 0
t STAT 3.32938
Sb 0.03297
1
H0: β1 = 0
Test Statistic: tSTAT = 3.329
H1: β1 ≠ 0
d.f. = 10- 2 = 8
a/2=.025 a/2=.025
Decision: Reject H0
p-value
Decision: Reject H0, since p-value < α
There is sufficient evidence that
square footage affects house price.
Copyright ©2011 Pearson Education 13-50
F Test for Significance
where SSR
MSR
k
SSE
MSE
n k 1
where FSTAT follows an F distribution with k numerator and (n – k - 1)
denominator degrees of freedom
Regression Statistics
Multiple R 0.76211
MSR 18934.9348
R Square 0.58082 FSTAT 11.0848
Adjusted R MSE 1708.1957
Square 0.52842
Standard Error 41.33032
With 1 and 8 degrees p-value for
Observations 10
of freedom the F-Test
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
regression model
Yi β 0 β1 X1i β 2 X 2i β k X ki ε i
ˆ b b X b X b X
Yi 0 1 1i 2 2i k ki
In this chapter we will use Excel or Minitab to obtain the
regression slope coefficients and other regression
summary measures.
Copyright ©2011 Pearson Education 14-64
Multiple Regression Equation
(continued)
Two variable model
Y
Ŷ b0 b1X1 b 2 X 2
X1
e
abl
ri
r va
fo
l ope X2
S
f or v ariable X 2
Slope
X1
Copyright ©2011 Pearson Education 14-65
Example:
2 Independent Variables
A distributor of frozen dessert pies wants to
evaluate factors thought to influence demand
Dependent variable: Pie sales (units per week)
Independent variables: Price (in $)
Advertising ($100’s)
R Square 0.52148
Adjusted R Square 0.44172
Standard Error 47.46341 Sales 306.526- 24.975(Price) 74.131(Advertising)
Observations 15
ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333
models
What is the net effect of adding a new variable?
We lose a degree of freedom when a new X
variable is added
Did the new X variable add enough
ei = (Yi – Yi)
<
Yi
x2i
X2
x1i
The best fit equation is found
X1 by minimizing the sum of
squared errors, e2
Copyright ©2011 Pearson Education 14-80
Multiple Regression Assumptions
<
ei = (Yi – Yi)
Assumptions:
The errors are normally distributed
<
Residuals vs. Yi
Residuals vs. X1i
Residuals vs. X2i
Residuals vs. time (if time series data)
Use the residual plots to check for
violations of regression assumptions
Copyright ©2011 Pearson Education 14-82
Are Individual Variables
Significant?
Use t tests of individual variable slopes
Shows if there is a linear relationship between
the variable Xj and Y holding constant the effects
of other X variables
Hypotheses:
H0: βj = 0 (no linear relationship)
H1: βj ≠ 0 (linear relationship does exist
between Xj and Y)
Test Statistic:
bj 0
1) t STAT (df = n – k –
Sb
j
b j tα / 2 Sb where t has
(n – k – 1) d.f.
j
Example: Form a 95% confidence interval for the effect of changes in price (X 1) on
pie sales:
-24.975 ± (2.1788)(10.832)
So the interval is (-48.576 , -1.374)
(This interval does not contain zero, so price has a significant effect on sales holding constant
the effect of advertising)
SSR(X1 | X2)
= SSR (all variables) – SSR(X2)
= .05, df = 1 and 12
F0.05 = 4.75
(For X1 and X2) (For X2 only)
ANOVA ANOVA
df SS MS df SS
Regression 2 29460.02687 14730.01343 Regression 1 17484.22249
Residual 12 27033.30647 2252.775539 Residual 13 39009.11085
Total 14 56493.33333 Total 14 56493.33333
2
ta F1,a
Where a = degrees of freedom
Intermediate Calculations
SSR(X1,X2) 29460.02687
SST 56493.33333
SSR(X2) 17484.22249 SSR(X1 | X2) 11975.80438
SSR(X1) 11100.43803 SSR(X2 | X1) 18359.58884
Coefficients
r2 Y1.2 0.307000188
r2 Y2.1 0.404459524
Ŷ b0 b1 X1 b 2 X 2
Let:
Y = pie sales
X1 = price
X2 = holiday (X2 = 1 if a holiday occurred during the week)
(X2 = 0 if there was no holiday that week)
Ŷ b0 b1 X1 b 2 (1) (b 0 b 2 ) b1 X1 Holiday
Ŷ b0 b1 X1 b 2 (0) b0 b1 X1 No Holiday
Different Same
intercept slope
Y (sales)
If H0: β2 = 0 is
b0 + b2
Holi rejected, then
day
b0 (X = “Holiday” has a
No H 2 1)
olida significant effect
y (X on pie sales
2 = 0
)
Y = house price
X1 = square feet
X2 = 1 if ranch, 0 otherwise
X3 = 1 if split level, 0 otherwise
Ŷ b0 b1X1 b 2 X 2 b3 X3
b0 b1X1 b 2 X 2 b3 (X1X 2 )
Copyright ©2011 Pearson Education 14-105
Effect of Interaction
4 X2 = 0:
Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1
0
X1
0 0.5 1 1.5
Slopes are different if the effect of X1 on Y depends on X2 value
Copyright ©2011 Pearson Education 14-107
Significance of Interaction Term
Yi β0 β1X1i β 2 X1i2 ε i
The second independent variable is the square
of the first variable
Yi β0 β1X1i β 2 X ε i
2
1i
where:
β0 = Y intercept
β1 = regression coefficient for linear effect of X on Y
β2 = regression coefficient for quadratic effect on Y
εi = random error in Y for observation i
Y Y
X X
residuals
X residuals X
X1 X1 X1 X1
β1 < 0 β1 > 0 β1 < 0 β1 > 0
β2 > 0 β2 > 0 β2 < 0 β2 < 0
β1 = the coefficient of the linear term
β2 = the coefficient of the squared term
Copyright ©2011 Pearson Education 15-116
Testing the Overall
Quadratic Model
Estimate the quadratic model to obtain the
regression equation:
Ŷi b0 b1X1i b 2 X2
1i
Purity
Filter
Time Purity increases as filter time increases:
3 1
7 2 Purity vs. Time
8 3
15 5 100
22 7
80
33 8
40 10
60
Purity
54 12
67 13 40
70 14
78 15 20
85 15
87 16 0
0 5 10 15 20
99 17
Time
Residuals
Time 1.56496 0.60179 2.60052 0.02467
Time-squared 0.24516 0.03258 7.52406 1.165E-05 0
0 5 10 15 20
-5
Regression Statistics Significance Time
R Square 0.99494 F F
Adjusted R Square 0.99402 1080.7330 2.368E-13 Time -square d Re sidual Plot
10
Standard Error 2.59513
5
Residuals
The quadratic term is significant and 0
0 100 200 300 400
improves the model: adj. r2 is higher and -5
Time-squared
SYX is lower, residuals are now random
Copyright ©2011 Pearson Education 15-124
Using Transformations in
Regression Analysis
Idea:
non-linear models can often be transformed
to a linear form
Can be estimated by least squares if transformed
transform X or Y or both to get a better fit or
to deal with violations of regression
assumptions
Can be based on theory, logic or scatter
plots
Copyright ©2011 Pearson Education 15-125
The Square Root Transformation
Yi β0 β1 X1i ε i
Used to
overcome violations of the constant variance
assumption
fit a non-linear relationship
Yi β0 β1X1i ε i Yi β0 β1 X1i ε i
Shape of original relationship Relationship when transformed
Y Y b1 > 0
X X
Y Y
b1 < 0
Copyright ©2011 Pearson Education X X 15-127
The Log Transformation
Yi e
β 0 β1X1i β 2 X 2i
εi ln Yi β0 β1X1i β 2 X 2i ln ε i
1
VIFj
1 R j
2
The Cp Statistic
(1 Rk2 )(n T )
Cp (n 2(k 1))
1 RT2
Remove Yes
variable with More
Add quadratic and/or interaction
highest than one?
terms or transform variables
VIF
No
Remove Perform
this X predictions
Copyright ©2011 Pearson Education 15-142
Pitfalls and Ethical
Considerations
To avoid pitfalls and address ethical considerations:
Understand that interpretation of the
estimated regression coefficients are
performed holding all other independent
variables constant
Evaluate residual plots for each independent
variable
Evaluate interaction terms
y = 3x + 12x + 2
y
x
Copyright ©2011 Pearson Education
Function:
y = 3x + 12x + 2
y
x
Copyright ©2011 Pearson Education
Function:
y = 3x + 12x + 2
y
x
Copyright ©2011 Pearson Education
Statistical Learnings:
Y = f(x) + є
Copyright ©2011 Pearson Education
Statistical Learnings:
Y1 = f1 (x) + є
Examples ? ? ?
Copyright ©2011 Pearson Education
Methods to Estimate Function “ f ”
Non-Parametric
Copyright ©2011 Pearson Education
Methods to Estimate Function “ f ”
Likelihood Function
Likelihood Function
Bayes Theorem
K=1 K=9
Unsupervised
Learning
• K Means
• Hierarchical