Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 33

Business Statistics

Fourth Canadian Edition

Chapter 20
Multiple Regression

Copyright © 2021 Pearson Canada Inc.


Ch. 20: Multiple Regression
Learning Objectives
1) Model one variable in terms of multiple other variables
2) Test the significance level of the model

Copyright © 2021 Pearson Canada Inc.


20.1 The Linear Multiple Regression Model
(1 of 5)

For simple regression, the predicted value depends on only one


predictor variable:

ŷ  b0  b1x
For multiple regression, we write the regression model with more
predictor variables:

yˆ  b0  b1x1  b2 x2    bk xk

Copyright © 2021 Pearson Canada Inc.


20.1 The Linear Multiple Regression Model
(2 of 5)

Simple Regression Example: Home Price vs. Bedrooms (1 of 2)


Random sample of 1057 homes. Can Bedrooms be used to predict
Price?

• Approximately linear relationship


• Equal Spread Condition is
violated
• Be cautious about using
inferential methods on these data
Figure 20.1 Side-by-side boxplots of Price
against Bedrooms show that price increases, on
average, with more bedrooms.

Copyright © 2021 Pearson Canada Inc.


20.1 The Linear Multiple Regression Model
(3 of 5)

Simple Regression Example: Home Price vs. Bedrooms (2 of 2)


Computer regression output: Price = 14349.48 + 48218.91× Bedrooms
Response variable: Price R2 = 21.4%
s = 68432.21 with 1057 − 2 = 1055 degrees of freedom

Table 20.1 Linear regression of Price on Bedrooms.


Variable Coeff SE(Coeff) t-ratio P-value
Intercept 14349.48 9297.69 1.5 0.1230
Bedrooms 48218.91 2843.88 16.96 ≤0.0001

• The variation in Bedrooms accounts for only 21% of the variation in


Price.
• Perhaps the inclusion of another factor can account for a portion of the
remaining variation.
Copyright © 2021 Pearson Canada Inc.
20.1 The Linear Multiple Regression Model
(4 of 5)

Multiple Regression: Include Living Area as a predictor in the regression


model, computer regression output
Response variable: Price R2 = 57.8%
s = 50142.4 with 1057 − 3 = 1054 degrees of freedom

Table 20.2 Multiple regression output for the linear model predicting Price
from Bedrooms and Living Area.
Variable Coeff SE(Coeff) t-ratio P-value
Intercept 20986.09 6816.3 3.08 0.0021
Bedrooms −7483.10 2783.5 −2.69 0.0073
Living area 93.84 3.11 30.18 ≤0.0001

Price  20,986.09  7483.10 Bedrooms  93.84 Living Area.


• Now the model accounts for 57.8% of the variation in Price.
Copyright © 2021 Pearson Canada Inc.
20.1 The Linear Multiple Regression Model
(5 of 5)

Multiple Regression:
• Residuals: e  y  yˆ (as with simple regression)

• Degrees of freedom: df  n  k1

n = number of observations
k = number of predictor variables
• Standard deviation of residuals:

 y  yˆ 
2

se =
n  k 1

Copyright © 2021 Pearson Canada Inc.


20.2 Interpreting Multiple Regression
Coefficients (1 of 4)
NOTE: The meaning of the coefficients in multiple regression
can be subtly different than in simple regression.

Price  20,986.09  7483.10 Bedrooms  93.84 Living Area.

Price drops with increasing bedrooms? Counterintuitive?

Copyright © 2021 Pearson Canada Inc.


20.2 Interpreting Multiple Regression
Coefficients (2 of 4)
For houses with similar sized
Living Areas, more bedrooms
means smaller bedrooms
and/or smaller common living
space. Cramped rooms may
de-value the home.

Figure 20.3 For the 96 houses with Living


Area between 2500 and 3000 square feet,
the slope of Price on Bedrooms is negative.
For each additional bedroom, restricting data
to homes of this size, we would predict that
the house’s Price was about $17,800 lower.

Copyright © 2021 Pearson Canada Inc.


20.2 Interpreting Multiple Regression
Coefficients (3 of 4)
So, what’s the correct answer to the question:
“Do more bedrooms tend to increase or decrease the
price of a home?”

Correct answer:
• “increase” if “Bedrooms” is the only predictor (“more
bedrooms” may mean “bigger house”, after all!)
• “decrease” if “Bedrooms” increases for fixed Living Area
(“more bedrooms” may mean “smaller, more-cramped
rooms”)

Copyright © 2021 Pearson Canada Inc.


20.2 Interpreting Multiple Regression
Coefficients (4 of 4)
Summarizing:
Multiple regression coefficients must be interpreted in
terms of the other predictors in the model.

Copyright © 2021 Pearson Canada Inc.


20.3 Assumptions and Conditions for the
Multiple Regression Model (1 of 8)
Linearity Assumption: Check each of the predictors.
Home Prices Example:

Linearity Condition is well-satisfied for both Bedrooms and Living Area.

Copyright © 2021 Pearson Canada Inc.


20.3 Assumptions and Conditions for the
Multiple Regression Model (2 of 8)
Linearity Assumption: Also check the residuals plot.
Home Prices Example:

Figure 20.4 A scatterplot of Residuals against the Predicted Values


shows no obvious pattern.

Linearity Condition is well-satisfied.


Copyright © 2021 Pearson Canada Inc.
20.3 Assumptions and Conditions for the
Multiple Regression Model (3 of 8)
Independence Assumption:
• As usual, there is no way to be sure the assumption is
satisfied
• But, think about how the data were collected to decide if
the assumption is reasonable
• Check the Randomization Condition as well. Does the data
collection method introduce any bias?

Copyright © 2021 Pearson Canada Inc.


20.3 Assumptions and Conditions for the
Multiple Regression Model (4 of 8)
Equal Variance Assumption:
• The variability of the errors should be about the same for each
predictor.
• Use scatterplots to assess the Equal Spread Condition.

Residuals vs. Predicted Values:


Home Price Example

Copyright © 2021 Pearson Canada Inc.


20.3 Assumptions and Conditions for the
Multiple Regression Model (5 of 8)
Normality Assumption:
• Check to see if the distribution of residuals is unimodal and
symmetric.

Copyright © 2021 Pearson Canada Inc.


20.3 Assumptions and Conditions for the
Multiple Regression Model (6 of 8)
Home Price Example:
The ‘tails” of the distribution appear to be non-normal.

Figure 20.5 A histogram of the residuals shows a unimodal, symmetric distribution, but the tails seem a
bit longer than one would expect from a Normal model. The Normal probability plot confirms that.

Copyright © 2021 Pearson Canada Inc.


20.3 Assumptions and Conditions for the
Multiple Regression Model (7 of 8)
Summary of Multiple Regression Model and Condition Checks:
1) Check Linearity Condition with a scatterplot for each predictor. If
necessary, consider data re-expression
2) If the Linearity Condition is satisfied, fit a multiple regression
model to the data
3) Find the residuals and predicted values
4) Inspect a scatterplot of the residuals against the predicted values.
Check for nonlinearity and non-uniform variation.
5) Think about how the data were collected.
– Do you expect the data to be independent?
– Was suitable randomization utilized?
– Are the data representative of a clearly identifiable population?
– Is autocorrelation an issue?
Copyright © 2021 Pearson Canada Inc.
20.3 Assumptions and Conditions for the
Multiple Regression Model (8 of 8)
Summary of Multiple Regression Model and Condition Checks:
6) If the conditions check out this far, feel free to interpret the
regression model and use it for prediction.
7) Check the Nearly Normal Condition by inspecting a residual
distribution histogram and a Normal plot. If the sample size is
large, the Normality is less important for inference. Watch for
skewness and outliers.

Copyright © 2021 Pearson Canada Inc.


20.4 Testing the Multiple Regression Model
(1 of 5)

• There are several hypothesis tests in multiple regression


• Each is concerned with whether the underlying parameters
(slopes and intercept) are actually zero

The Null Hypothesis for slope coefficients:


H0 : 1     k  0

Test the hypothesis with an F-test (a generalization of the


t-test to more than one predictor).

Copyright © 2021 Pearson Canada Inc.


20.4 Testing the Multiple Regression Model
(2 of 5)

• The F-distribution has two degrees of freedom:


– k, where k is the number of predictors
– n – k – 1 , where n is the number of observations
• The F-test is one-sided – bigger F-values mean smaller P-
values.
• If the null hypothesis is true, then F will be near 1.

Copyright © 2021 Pearson Canada Inc.


20.4 Testing the Multiple Regression Model
(3 of 5)

If a multiple regression F-test leads to a rejection of the null


hypothesis, then check the t-test statistic for each
coefficient:
bj  0
t n  k 1 
SE  b j 

Note that the degrees of freedom for the t-test is n − k − 1.


b j  t n* k 1  SE (b j ).

Confidence interval:

Copyright © 2021 Pearson Canada Inc.


20.4 Testing the Multiple Regression Model
(4 of 5)

“Tricky” Parts of the t-tests:


• SE’s are harder to compute (let technology do it!)
• The meaning of a coefficient depends on the other predictors in
the model (as we saw in the Home Price example)
– If we fail to reject H0 : j = 0 based on it’s t-test, it does not
mean that xj has no linear relationship to y
– Rather, it means that xj contributes nothing to modeling y
after allowing for the other predictors

Copyright © 2021 Pearson Canada Inc.


20.4 Testing the Multiple Regression Model
(5 of 5)

In Multiple Regression, it looks like each tells us the effect of


its associated predictor, xj.
BUT
• The coefficient j can be different from zero even when
there is no correlation between y and xj.
• It is even possible that the multiple regression slope
changes sign when a new variable enters the regression.

Copyright © 2021 Pearson Canada Inc.


20.5 The F-Statistic and ANOVA (1 of 3)
Analysis of Variance (ANOVA) table is used to present various
measures of variability in a regression analysis.
Summary of Multiple Regression Variation Measures:
Parameter Significance
Sum of Squared Residuals
Larger SSE = “noisier” data and less
precise prediction
Regression
Regression Sum
Sum of of Squares
Squares
Larger
Larger SSR
SSR = = stronger
stronger model
model
correlation
correlation
Total
Total Sum
Sum ofof Squares
Squares
Larger
Larger SST
SST = = larger
larger variability
variability in
in y,
y,
due to “noisier” data (SSE) and/or
due to “noisier” data (SSE) and/or
stronger
stronger model
model correlation
correlation (SSR)
(SSR)

Copyright © 2021 Pearson Canada Inc.


20.5 The F-Statistic and ANOVA (2 of 3)
R2 in Multiple Regression:
SSR SSE
R 
2
 1
SST SST
R2 = fraction of the total variation in y accounted for by the model
(all the predictor variables included)
F and R2:
By using the expressions for SSE, SSR, SST, and R2, it can be
shown that:
R2 n  k  1
F .
(1  R )
2
k

So, testing whether F = 0 is equivalent to testing whether R2 = 0.


Copyright © 2021 Pearson Canada Inc.
20.5 The F-Statistic and ANOVA (3 of 3)

Table 20.3 Typical ANOVA table in multivariate regression analysis. The table
indicates the formulas that are used in the software to produce numerical results.

Degrees of Sum of
Blank Freedom (df) squares Mean Square F-Ratio P-Value
M S R = start fraction S S R over F = M S R over
Regression K SSR k end fraction MSE P
(explained
variability)
N minus k minus 1 M S E = start fraction S S E over
Errors SSE n minus k minus end fraction Blank Blank
(unexplained n−k−1
variability)
Total (Sum of n−1 SSTotal Blank Blank Blank
Squares, Total)

Copyright © 2021 Pearson Canada Inc.


20.6 R2 and Adjusted R2
• Adding new predictor variables to a model never decreases R2 and
may increase it.
• But each added variable increases the model complexity, which may
not be desirable.
• Adjusted R2 imposes a “penalty” on the correlation strength of larger
models, depreciating their R2 values to account for an undesired
increase in complexity:

 2 k  n 1  SSE / (n  k  1)
R 2
adj  R      1 .
 n  1  n  k  1 SST / ( n  1)

Adjusted R2 permits a more equitable comparison between models of


different sizes.

Copyright © 2021 Pearson Canada Inc.


What Can Go Wrong? (1 of 2)
• It is sometimes a mistake to claim to “hold everything else
constant” for a single individual. (For the predictors Age
and Years of Education, it is impossible for an individual to
get a year of education at constant age.)
• Don’t interpret regression causally. Statistics assesses
correlation, not causality.
• Be cautious about interpreting a regression as predictive.
That is, be alert for combinations of predictor values that
take you outside the ranges of these predictors.

Copyright © 2021 Pearson Canada Inc.


What Can Go Wrong? (2 of 2)
• Don’t think that the sign of a coefficient is special. The sign
of a predictor coefficient may depend on which predictors
are included in the model.
• If a coefficient’s t-statistic is not significant, don’t interpret it
at all.

Copyright © 2021 Pearson Canada Inc.


What Else Can Go Wrong?
• Don’t fit a linear regression to data that aren’t straight.
Usually, we are satisfied when plots of y against the x’s are
straight enough.
• Watch out for plot thickening. If plots of residuals vs.
predictors all show thickening, then consider re-expressing
y. If the thickening is observed for just one predictor,
consider re-expressing that predictor.
• Make sure the errors are nearly normal.
• Watch out for high-influence points and outliers.

Copyright © 2021 Pearson Canada Inc.


What Have We Learned? (1 of 2)
• The assumptions and conditions for multiple regression
are the same as those for simple regression.
• R2 is still the fraction of the variation accounted for by the
regression model.
• se is still the standard deviation of the residuals.
• The degrees of freedom (in the denominator of se and for
each t-test) is n minus the number of parameters
estimated.

Copyright © 2021 Pearson Canada Inc.


What Have We Learned? (2 of 2)
• The regression table produced by any statistics package shows
a row for each coefficient, giving its estimate, a standard error, a
t-statistic, and a P-value.
• If all the conditions are met, we can test each coefficient against
the null hypothesis that its parameter value is zero with a
Student’s t-test.
• We can perform an overall test of whether the multiple
regression model provides a better summary for y than its mean
by using the F-distribution.
• We learned that R2 may not be appropriate for comparing
multiple regression models with different numbers of predictors.

Copyright © 2021 Pearson Canada Inc.

You might also like