Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 195

Simple Linear Regression

Copyright ©2011 Pearson Education 13-1


Correlation vs. Regression
 A scatter plot can be used to show the
relationship between two variables
 Correlation analysis is used to measure the
strength of the association (linear relationship)
between two variables
 Correlation is only concerned with strength of the
relationship
 No causal effect is implied with correlation
 Scatter plots
 Correlation

Copyright ©2011 Pearson Education 13-2


Introduction to
Regression Analysis
 Regression analysis is used to:
 Predict the value of a dependent variable based on
the value of at least one independent variable
 Explain the impact of changes in an independent
variable on the dependent variable
Dependent variable: the variable we wish to
predict or explain
Independent variable: the variable used to predict
or explain the
dependent variable

Copyright ©2011 Pearson Education 13-3


Simple Linear Regression
Model
 Only one independent variable, X
 Relationship between X and Y is
described by a linear function
 Changes in Y are assumed to be related
to changes in X

Copyright ©2011 Pearson Education 13-4


Types of Relationships

Linear relationships Curvilinear relationships

Y Y

X X

Y Y

X X
Copyright ©2011 Pearson Education 13-5
Types of Relationships
(continued)
Strong relationships Weak relationships

Y Y

X X

Y Y

X X
Copyright ©2011 Pearson Education 13-6
Types of Relationships
(continued)
No relationship

X
Copyright ©2011 Pearson Education 13-7
Simple Linear Regression
Model

Population Random
Population Independent Error
Slope
Y intercept Variable term
Coefficient
Dependent
Variable

Yi  β0  β1Xi  ε i
Linear component Random Error
component

Copyright ©2011 Pearson Education 13-8


Simple Linear Regression
Model
(continued)

Y Yi  β0  β1Xi  ε i
Observed Value
of Y for Xi

εi Slope = β1
Predicted Value Random Error
of Y for Xi
for this Xi value

Intercept = β0

Xi X
Copyright ©2011 Pearson Education 13-9
Simple Linear Regression
Equation (Prediction Line)
The simple linear regression equation provides an
estimate of the population regression line

Estimated
(or predicted) Estimate of Estimate of the
Y value for the regression regression slope
observation i
intercept
Value of X for

Ŷi  b0  b1Xi
observation i

Copyright ©2011 Pearson Education 13-10


The Least Squares Method

b0 and b1 are obtained by finding the values of


that minimize the sum of the squared
differences between Y and Ŷ :

min  (Yi Ŷi )  min  (Yi  (b 0  b1Xi ))


2 2

Copyright ©2011 Pearson Education 13-11


Finding the Least Squares
Equation

 The coefficients b0 and b1 , and other


regression results in this chapter, will be
found using Excel

Formulas are shown in the text for those


who are interested

Copyright ©2011 Pearson Education 13-12


Interpretation of the
Slope and the Intercept

 b0 is the estimated average value of Y


when the value of X is zero

 b1 is the estimated change in the


average value of Y as a result of a
one-unit increase in X

Copyright ©2011 Pearson Education 13-13


Simple Linear Regression
Example
 A real estate agent wishes to examine the
relationship between the selling price of a home
and its size (measured in square feet)
 A random sample of 10 houses is selected
 Dependent variable (Y) = house price in $1000s

 Independent variable (X) = square feet

Copyright ©2011 Pearson Education 13-14


Simple Linear Regression
Example: Data
House Price in $1000s Square Feet
(Y) (X)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700

Copyright ©2011 Pearson Education 13-15


Simple Linear Regression
Example: Scatter Plot
House price model: Scatter Plot
450
400
House Price ($1000s)

350
300
250
200
150
100
50
0
0 500 1000 1500 2000 2500 3000
Square Feet

Copyright ©2011 Pearson Education 13-16


Simple Linear Regression Example:
Using Excel Data Analysis Function
DCOVA
1. Choose Data 2. Choose Data Analysis
3. Choose Regression

Copyright ©2011 Pearson Education 13-17


Simple Linear Regression Example:
Using Excel Data Analysis Function
(continued)
Enter Y’s and X’s and desired options

Copyright ©2011 Pearson Education 13-18


Simple Linear Regression Example:
Excel Output
Regression Statistics
Multiple R 0.76211 The regression equation is:
R Square 0.58082
Adjusted R Square 0.52842 house price  98.24833  0.10977 (square feet)
Standard Error 41.33032
Observations 10

ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

Copyright ©2011 Pearson Education 13-19


Simple Linear Regression Example:
Graphical Representation

House price model: Scatter Plot and Prediction Line


450
400
House Price ($1000s)

350
Slope
300
250
= 0.10977
200
150
100
50
Intercept 0
= 98.248 0 500 1000 1500 2000 2500 3000
Square Feet

house price  98.24833  0.10977 (square feet)


Copyright ©2011 Pearson Education 13-20
Simple Linear Regression
Example: Interpretation of bo

house price  98.24833  0.10977 (square feet)

 b0 is the estimated average value of Y when the


value of X is zero (if X = 0 is in the range of
observed X values)
 Because a house cannot have a square footage
of 0, b0 has no practical application

Copyright ©2011 Pearson Education 13-21


Simple Linear Regression
Example: Interpreting b1

house price  98.24833  0.10977 (square feet)

 b1 estimates the change in the average


value of Y as a result of a one-unit
increase in X
 Here, b1 = 0.10977 tells us that the mean value of a
house increases by .10977($1000) = $109.77, on
average, for each additional one square foot of size

Copyright ©2011 Pearson Education 13-22


Simple Linear Regression
Example: Making Predictions
Predict the price for a house
with 2000 square feet:

house price  98.25  0.1098 (sq.ft.)

 98.25  0.1098(200 0)

 317.85
The predicted price for a house with 2000
square feet is 317.85($1,000s) = $317,850
Copyright ©2011 Pearson Education 13-23
Simple Linear Regression
Example: Making Predictions
 When using a regression model for prediction,
only predict within the relevant range of data
Relevant range for
interpolation
450
400
House Price ($1000s)

350
300
250
200
150
100
50 Do not try to
0
extrapolate
0 500 1000 1500 2000 2500 3000
Square Feet
beyond the range
of observed X’s
Copyright ©2011 Pearson Education 13-24
Measures of Variation

 Total variation is made up of two parts:

SST  SSR  SSE


Total Sum of Regression Sum Error Sum of
Squares of Squares Squares

SST   ( Yi  Y )2 SSR   ( Ŷi  Y )2 SSE   ( Yi  Ŷi )2


where:
Y = Mean value of the dependent variable
Yi = Observed value of the dependent variable
Yˆi = Predicted value of Y for the given X value
i
Copyright ©2011 Pearson Education 13-25
Measures of Variation
(continued)

 SST = total sum of squares (Total Variation)


 Measures the variation of the Yi values around their
mean Y
 SSR = regression sum of squares (Explained Variation)
 Variation attributable to the relationship between X
and Y
 SSE = error sum of squares (Unexplained Variation)
 Variation in Y attributable to factors other than X

Copyright ©2011 Pearson Education 13-26


Measures of Variation
(continued)

Y
Yi  
SSE = (Yi - Yi )2 Y
_
SST = (Yi - Y)2

Y  _
_ SSR = (Yi - Y)2 _
Y Y

Xi X
Copyright ©2011 Pearson Education 13-27
Coefficient of Determination, r2
 The coefficient of determination is the portion
of the total variation in the dependent variable
that is explained by variation in the
independent variable
 The coefficient of determination is also called
r-squared and is denoted as r2
SSR regression sum of squares
r 2

SST total sum of squares

note: 0 r 1
2

Copyright ©2011 Pearson Education 13-28


Examples of Approximate
r2 Values
Y
r2 = 1

Perfect linear relationship


between X and Y:
X
r2 = 1
Y 100% of the variation in Y is
explained by variation in X

X
r =1
2

Copyright ©2011 Pearson Education 13-29


Examples of Approximate
r2 Values
Y
0 < r2 < 1

Weaker linear relationships


between X and Y:
X
Some but not all of the
Y
variation in Y is explained
by variation in X

X
Copyright ©2011 Pearson Education 13-30
Examples of Approximate
r2 Values

r2 = 0
Y
No linear relationship
between X and Y:

The value of Y does not


X depend on X. (None of the
r2 = 0
variation in Y is explained
by variation in X)

Copyright ©2011 Pearson Education 13-31


Simple Linear Regression Example:
Coefficient of Determination, r2 in Excel

SSR 18934.9348
2
Regression Statistics
r    0.58082
Multiple R 0.76211 SST 32600.5000
R Square 0.58082
Adjusted R Square 0.52842
58.08% of the variation in
Standard Error 41.33032
Observations 10
house prices is explained by
variation in square feet
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

Copyright ©2011 Pearson Education 13-32


Standard Error of Estimate
 The standard deviation of the variation of
observations around the regression line is
estimated by
n

SSE
 (Yi  Yˆi ) 2
i 1
S YX  
n2 n2
Where
SSE = error sum of squares
n = sample size

Copyright ©2011 Pearson Education 13-33


Simple Linear Regression Example:
Standard Error of Estimate in Excel
Regression Statistics
Multiple R
R Square
0.76211
0.58082
S YX  41.33032
Adjusted R Square 0.52842
Standard Error 41.33032
Observations 10

ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039

Residual 8 13665.5652 1708.1957


Total 9 32600.5000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

Copyright ©2011 Pearson Education 13-34


Comparing Standard Errors
SYX is a measure of the variation of observed
Y values from the regression line
Y Y

small SYX X large SYX X

The magnitude of SYX should always be judged relative to the


size of the Y values in the sample data
i.e., SYX = $41.33K is moderately small relative to house prices in
the $200K - $400K range
Copyright ©2011 Pearson Education 13-35
Assumptions of Regression
L.I.N.E

 Linearity
 The relationship between X and Y is linear

 Independence of Errors
 Error values are statistically independent

 Normality of Error
 Error values are normally distributed for any given

value of X
 Equal Variance (also called homoscedasticity)
 The probability distribution of the errors has constant

variance

Copyright ©2011 Pearson Education 13-36


Residual Analysis
ei  Yi  Ŷi
 The residual for observation i, ei, is the difference
between its observed and predicted value
 Check the assumptions of regression by examining the
residuals
 Examine for linearity assumption
 Evaluate independence assumption
 Evaluate normal distribution assumption
 Examine for constant variance for all levels of X
(homoscedasticity)
 Graphical Analysis of Residuals
 Can plot residuals vs. X
Copyright ©2011 Pearson Education 13-37
Residual Analysis for Linearity

Y Y

x x
residuals

x residuals x

Not Linear
Copyright ©2011 Pearson Education
 Linear
13-38
Residual Analysis for
Independence

Not Independent
 Independent
residuals

residuals
X
residuals

Copyright ©2011 Pearson Education 13-39


Checking for Normality

 Examine the Stem-and-Leaf Display of the


Residuals
 Examine the Boxplot of the Residuals
 Examine the Histogram of the Residuals
 Construct a Normal Probability Plot of the
Residuals

Copyright ©2011 Pearson Education 13-40


Normal Probability Plot

 The normal probability plot is a graphical


technique for assessing whether or not a data
set is approximately normally distributed.

 The data are plotted against a


theoretical normal distribution in such a way
that the points should form an approximate
straight line.

Copyright ©2011 Pearson Education 13-41


Residual Analysis for Normality

When using a normal probability plot, normal


errors will approximately display in a straight line

Percent
100

0
-3 -2 -1 0 1 2 3
Residual
Copyright ©2011 Pearson Education 13-42
Residual Analysis for
Equal Variance

Y Y

x x
residuals

x residuals x

Non-constant variance
 Constant variance

Copyright ©2011 Pearson Education 13-43


Simple Linear Regression
Example: Excel Residual Output

RESIDUAL OUTPUT House Price Model Residual Plot


Predicted
House Price Residuals 80
1 251.92316 -6.923162
60
2 273.87671 38.12329
40
3 284.85348 -5.853484
Residuals
4 304.06284 3.937162 20
5 218.99284 -19.99284 0
6 268.38832 -49.38832 0 1000 2000 3000
-20
7 356.20251 48.79749
-40
8 367.17929 -43.17929
-60
9 254.6674 64.33264
Square Feet
10 284.85348 -29.85348

Does not appear to violate


any regression assumptions
Copyright ©2011 Pearson Education 13-44
Inferences About the Slope

 The standard error of the regression slope


coefficient (b1) is estimated by

S YX S YX
Sb1  
SSX  (X i  X) 2

where:
Sb1 = Estimate of the standard error of the slope

SSE
S YX  = Standard error of the estimate
n2
Copyright ©2011 Pearson Education 13-45
Inferences About the Slope:
t Test

 t test for a population slope


 Is there a linear relationship between X and Y?
 Null and alternative hypotheses
 H0: β1 = 0 (no linear relationship)
 H1: β1 ≠ 0 (linear relationship does exist)
 Test statistic where:
b1  β 1
t STAT  b1 = regression slope
coefficient
Sb β1 = hypothesized slope
1
Sb1 = standard
d.f.  n  2 error of the slope

Copyright ©2011 Pearson Education 13-46


Inferences About the Slope:
t Test Example

House Price
Square Feet
Estimated Regression Equation:
in $1000s
(x)
(y)
house price  98.25  0.1098 (sq.ft.)
245 1400
312 1600
279 1700
308 1875
The slope of this model is 0.1098
199 1100
219 1550 Is there a relationship between the
405 2350 square footage of the house and its
324 2450
sales price?
319 1425
255 1700

Copyright ©2011 Pearson Education 13-47


Inferences About the Slope:
t Test Example
H0: β1 = 0
From Excel output: H1: β1 ≠ 0
Coefficients Standard Error t Stat P-value
Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039

b1 Sb1

b1  β 1 0.10977  0
t STAT    3.32938
Sb 0.03297
1

Copyright ©2011 Pearson Education 13-48


Inferences About the Slope:
t Test Example

H0: β1 = 0
Test Statistic: tSTAT = 3.329
H1: β1 ≠ 0

d.f. = 10- 2 = 8

a/2=.025 a/2=.025
Decision: Reject H0

There is sufficient evidence


Reject H0
-tα/2
Do not reject H0
tα/2
Reject H0 that square footage affects
0
-2.3060 2.3060 3.329 house price

Copyright ©2011 Pearson Education 13-49


Inferences About the Slope:
t Test Example
H0: β1 = 0
H1: β1 ≠ 0
From Excel output:
Coefficients Standard Error t Stat P-value
Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039

p-value
Decision: Reject H0, since p-value < α
There is sufficient evidence that
square footage affects house price.
Copyright ©2011 Pearson Education 13-50
F Test for Significance

 F Test statistic: F MSR


STAT 
MSE

where SSR
MSR 
k
SSE
MSE 
n  k 1
where FSTAT follows an F distribution with k numerator and (n – k - 1)
denominator degrees of freedom

(k = the number of independent variables in the regression model)

Copyright ©2011 Pearson Education 13-51


F-Test for Significance
Excel Output

Regression Statistics
Multiple R 0.76211
MSR 18934.9348
R Square 0.58082 FSTAT    11.0848
Adjusted R MSE 1708.1957
Square 0.52842
Standard Error 41.33032
With 1 and 8 degrees p-value for
Observations 10
of freedom the F-Test
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000

Copyright ©2011 Pearson Education 13-52


F Test for Significance
(continued)

H0: β1 = 0 Test Statistic:


H1: β1 ≠ 0 MSR
FSTAT   11.08
 = .05 MSE
df1= 1 df2 = 8 Decision:
Critical Reject H0 at  = 0.05
Value:
F = 5.32
Conclusion:
 = .05
There is sufficient evidence that
0 F house size affects selling price
Do not Reject H0
reject H0
F.05 = 5.32
Copyright ©2011 Pearson Education 13-53
Confidence Interval Estimate
for the Slope
Confidence Interval Estimate of the Slope:
b1  t α / 2 S b d.f. = n - 2
1

Excel Printout for House Prices:


Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

At 95% level of confidence, the confidence interval for


the slope is (0.0337, 0.1858)

Copyright ©2011 Pearson Education 13-54


Confidence Interval Estimate
for the Slope (continued)

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580

Since the units of the house price variable is


$1000s, we are 95% confident that the average
impact on sales price is between $33.74 and
$185.80 per square foot of house size

This 95% confidence interval does not include 0.


Conclusion: There is a significant relationship between
house price and square feet at the .05 level of significance

Copyright ©2011 Pearson Education 13-55


Pitfalls of Regression Analysis
 Lacking an awareness of the assumptions
underlying least-squares regression
 Not knowing how to evaluate the assumptions
 Not knowing the alternatives to least-squares
regression if a particular assumption is violated
 Using a regression model without knowledge of
the subject matter
 Extrapolating outside the relevant range

Copyright ©2011 Pearson Education 13-56


Strategies for Avoiding
the Pitfalls of Regression
 Start with a scatter plot of X vs. Y to observe
possible relationship
 Perform residual analysis to check the
assumptions
 Plot the residuals vs. X to check for violations of
assumptions such as homoscedasticity
 Use a histogram, stem-and-leaf display, boxplot,
or normal probability plot of the residuals to
uncover possible non-normality

Copyright ©2011 Pearson Education 13-57


Strategies for Avoiding
the Pitfalls of Regression
(continued)

 If there is violation of any assumption, use


alternative methods or models
 If there is no evidence of assumption violation,
then test for the significance of the regression
coefficients and construct confidence intervals
and prediction intervals
 Avoid making predictions or forecasts outside
the relevant range

Copyright ©2011 Pearson Education 13-58


Chapter Summary

 Introduced types of regression models


 Reviewed assumptions of regression and
correlation
 Discussed determining the simple linear
regression equation
 Described measures of variation
 Discussed residual analysis
 Addressed measuring autocorrelation

Copyright ©2011 Pearson Education 13-59


Chapter Summary
(continued)

 Described inference about the slope


 Discussed correlation -- measuring the strength
of the association
 Addressed estimation of mean values and
prediction of individual values
 Discussed possible pitfalls in regression and
recommended strategies to avoid them

Copyright ©2011 Pearson Education 13-60


Statistics for Managers using
Microsoft Excel

Introduction to Multiple Regression

Copyright ©2011 Pearson Education 14-61


Learning Objectives

In this chapter, you learn:


 How to develop a multiple regression model

 How to interpret the regression coefficients

 How to determine which independent variables to

include in the regression model


 How to determine which independent variables are more

important in predicting a dependent variable


 How to use categorical independent variables in a

regression model

Copyright ©2011 Pearson Education 14-62


The Multiple Regression
Model

Idea: Examine the linear relationship between


1 dependent (Y) & 2 or more independent variables (Xi)

Multiple Regression Model with k Independent Variables:

Y-intercept Population slopes Random Error

Yi  β 0  β1 X1i  β 2 X 2i      β k X ki  ε i

Copyright ©2011 Pearson Education 14-63


Multiple Regression Equation

The coefficients of the multiple regression model are


estimated using sample data

Multiple regression equation with k independent variables:


Estimated Estimated
(or predicted) Estimated slope coefficients
value of Y intercept

ˆ  b  b X  b X    b X
Yi 0 1 1i 2 2i k ki
In this chapter we will use Excel or Minitab to obtain the
regression slope coefficients and other regression
summary measures.
Copyright ©2011 Pearson Education 14-64
Multiple Regression Equation
(continued)
Two variable model
Y
Ŷ  b0  b1X1  b 2 X 2

X1
e
abl
ri
r va
fo
l ope X2
S
f or v ariable X 2
Slope

X1
Copyright ©2011 Pearson Education 14-65
Example:
2 Independent Variables
 A distributor of frozen dessert pies wants to
evaluate factors thought to influence demand
 Dependent variable: Pie sales (units per week)
 Independent variables: Price (in $)
Advertising ($100’s)

 Data are collected for 15 weeks

Copyright ©2011 Pearson Education 14-66


Pie Sales Example
Pie Price Advertising
Week Sales ($) ($100s)
Multiple regression equation:
1 350 5.50 3.3
2 460 7.50 3.3
3 350 8.00 3.0 Sales = b0 + b1 (Price)
4 430 8.00 4.5
5 350 6.80 3.0 + b2
6
7
380
430
7.50
4.50
4.0
3.0
(Advertising)
8 470 6.40 3.7
9 450 7.00 3.5
10 490 5.00 4.0
11 340 7.20 3.5
12 300 7.90 3.2
13 440 5.90 4.0
14 450 5.00 3.5
15 300 7.00 2.7

Copyright ©2011 Pearson Education 14-67


Excel Multiple Regression Output
Regression Statistics
Multiple R 0.72213

R Square 0.52148
Adjusted R Square 0.44172
Standard Error 47.46341 Sales  306.526- 24.975(Price)  74.131(Advertising)
Observations 15

ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888

Copyright ©2011 Pearson Education 14-68


The Multiple Regression Equation

Sales  306.526 - 24.975(Pri ce)  74.131(Adv ertising)


where
Sales is in number of pies per week
Price is in $
Advertising is in $100’s.
b1 = -24.975: sales b2 = 74.131: sales will
will decrease, on increase, on average,
average, by 24.975 by 74.131 pies per
pies per week for week for each $100
each $1 increase in increase in
selling price, net of advertising, net of the
the effects of changes effects of changes
due to advertising due to price

Copyright ©2011 Pearson Education 14-69


Using The Equation to Make
Predictions
Predict sales for a week in which the selling
price is $5.50 and advertising is $350:

Sales  306.526 - 24.975(Pri ce)  74.131(Adv ertising)


 306.526 - 24.975 (5.50)  74.131 (3.5)
 428.62

Note that Advertising is


Predicted sales in $100’s, so $350
means that X2 = 3.5
is 428.62 pies
Copyright ©2011 Pearson Education 14-70
Coefficient of
Multiple Determination
 Reports the proportion of total variation in Y
explained by all X variables taken together

SSR regression sum of squares


r 2

SST total sum of squares

Copyright ©2011 Pearson Education 14-71


Multiple Coefficient of
Determination In Excel
Regression Statistics
Multiple R 0.72213 SSR 29460.0
r 
2
  .52148
R Square 0.52148
SST 56493.3
Adjusted R Square 0.44172
Standard Error 47.46341
52.1% of the variation in pie sales
Observations 15 is explained by the variation in
price and advertising
ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888

Copyright ©2011 Pearson Education 14-72


Adjusted r2
 r2 never decreases when a new X variable is
added to the model
 This can be a disadvantage when comparing

models
 What is the net effect of adding a new variable?
 We lose a degree of freedom when a new X

variable is added
 Did the new X variable add enough

explanatory power to offset the loss of one


degree of freedom?
Copyright ©2011 Pearson Education 14-73
Adjusted r2
(continued)
 Shows the proportion of variation in Y explained
by all X variables adjusted for the number of X
variables used
 2  n  1 
r 2
adj  1  (1  r ) 
  n  k  1 
(where n = sample size, k = number of independent variables)

 Penalize excessive use of unimportant independent


variables
 Smaller than r2
 Useful in comparing among models
Copyright ©2011 Pearson Education 14-74
Adjusted r2 in Excel
Regression Statistics
Multiple R 0.72213
2
radj  .44172
R Square 0.52148
Adjusted R Square 0.44172 44.2% of the variation in pie sales is
Standard Error 47.46341 explained by the variation in price and
Observations 15 advertising, taking into account the sample
size and number of independent variables
ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888

Copyright ©2011 Pearson Education 14-75


Is the Model Significant?
 F Test for Overall Significance of the Model
 Shows if there is a linear relationship between all
of the X variables considered together and Y
 Use F-test statistic
 Hypotheses:
H0: β1 = β2 = … = βk = 0 (no linear relationship)
H1: at least one βi ≠ 0 (at least one independent
variable affects Y)

Copyright ©2011 Pearson Education 14-76


F Test for Overall Significance
 Test statistic:
SSR
MSR k
FSTAT  
MSE SSE
n  k 1

where FSTAT has numerator d.f. = k and


denominator d.f. = (n – k -
1)

Copyright ©2011 Pearson Education 14-77


F Test for Overall Significance In
Excel
(continued)
Regression Statistics
Multiple R 0.72213

R Square 0.52148 MSR 14730.0


FSTAT    6.5386
Adjusted R Square 0.44172 MSE 2252.8
Standard Error 47.46341
Observations 15
With 2 and 12 degrees P-value for
of freedom the F Test
ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888

Copyright ©2011 Pearson Education 14-78


F Test for Overall Significance
(continued)

H0: β1 = β2 = 0 Test Statistic:


H1: β1 and β2 not both zero MSR
FSTAT   6.5386
 = .05 MSE
df1= 2 df2 = 12
Decision:
Critical Since FSTAT test statistic is
Value:
in the rejection region (p-
F0.05 = 3.885
value < .05), reject H0
 = .05
Conclusion:
0 F There is evidence that at least one
Do not Reject H0
reject H0 independent variable affects Y
F0.05 = 3.885
Copyright ©2011 Pearson Education 14-79
Residuals in Multiple Regression
Two variable model
Y Sample
Yi observation Ŷ  b0  b1X1  b 2 X 2
Residual =
<

ei = (Yi – Yi)
<

Yi

x2i
X2
x1i
The best fit equation is found
X1 by minimizing the sum of
squared errors, e2
Copyright ©2011 Pearson Education 14-80
Multiple Regression Assumptions

Errors (residuals) from the regression model:

<
ei = (Yi – Yi)

Assumptions:
 The errors are normally distributed

 Errors have a constant variance

 The model errors are independent

Copyright ©2011 Pearson Education 14-81


Residual Plots Used
in Multiple Regression
 These residual plots are used in multiple
regression:

<
 Residuals vs. Yi
 Residuals vs. X1i
 Residuals vs. X2i
 Residuals vs. time (if time series data)
Use the residual plots to check for
violations of regression assumptions
Copyright ©2011 Pearson Education 14-82
Are Individual Variables
Significant?
 Use t tests of individual variable slopes
 Shows if there is a linear relationship between
the variable Xj and Y holding constant the effects
of other X variables
 Hypotheses:
 H0: βj = 0 (no linear relationship)
 H1: βj ≠ 0 (linear relationship does exist
between Xj and Y)

Copyright ©2011 Pearson Education 14-83


Are Individual Variables
Significant?
(continued)

H0: βj = 0 (no linear relationship)


H1: βj ≠ 0 (linear relationship does exist
between Xj and Y)

Test Statistic:

bj  0
1) t STAT  (df = n – k –
Sb
j

Copyright ©2011 Pearson Education 14-84


Are Individual Variables
Significant? Excel Output (continued)
Regression Statistics
Multiple R 0.72213
t Stat for Price is tSTAT = -2.306, with
R Square 0.52148 p-value .0398
Adjusted R Square 0.44172
Standard Error 47.46341 t Stat for Advertising is tSTAT = 2.855,
Observations 15
with p-value .0145
ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888

Copyright ©2011 Pearson Education 14-85


Inferences about the Slope:
t Test Example
From the Excel output:
H0: βj = 0
For Price tSTAT = -2.306, with p-value .0398
H1: βj  0
For Advertising tSTAT = 2.855, with p-value .0145
d.f. = 15-2-1 = 12
a = .05 The test statistic for each variable falls
t/2 = 2.1788 in the rejection region (p-values < .05)
Decision:
a/2=.025 a/2=.025 Reject H0 for each variable
Conclusion:
There is evidence that both
Reject H0
-tα/2
Do not reject H0 Reject H0
tα/2 Price and Advertising affect
0
-2.1788 2.1788 pie sales at  = .05
Copyright ©2011 Pearson Education 14-86
Confidence Interval Estimate
for the Slope
Confidence interval for the population slope βj

b j  tα / 2 Sb where t has
(n – k – 1) d.f.
j

Coefficients Standard Error


Intercept 306.52619 114.25389 Here, t has
Price -24.97509 10.83213
(15 – 2 – 1) = 12 d.f.
Advertising 74.13096 25.96732

Example: Form a 95% confidence interval for the effect of changes in price (X 1) on
pie sales:
-24.975 ± (2.1788)(10.832)
So the interval is (-48.576 , -1.374)
(This interval does not contain zero, so price has a significant effect on sales holding constant
the effect of advertising)

Copyright ©2011 Pearson Education 14-87


Confidence Interval Estimate
for the Slope
(continued)
Confidence interval for the population slope βj

Coefficients Standard Error … Lower 95% Upper 95%


Intercept 306.52619 114.25389 … 57.58835 555.46404
Price -24.97509 10.83213 … -48.57626 -1.37392
Advertising 74.13096 25.96732 … 17.55303 130.70888

Example: Excel output also reports these interval endpoints:


Weekly sales are estimated to be reduced by between 1.37 to
48.58 pies for each increase of $1 in the selling price, holding the
effect of advertising constant

Copyright ©2011 Pearson Education 14-88


Testing Portions of the
Multiple Regression Model
 Contribution of a Single Independent Variable X j

SSR(Xj | all variables except Xj)


= SSR (all variables) – SSR(all variables except Xj)

 Measures the contribution of Xj in explaining the total


variation in Y (SST)

Copyright ©2011 Pearson Education 14-89


Testing Portions of the
Multiple Regression Model
(continued)

Contribution of a Single Independent Variable Xj,


assuming all other variables are already included
(consider here a 2-variable model):

SSR(X1 | X2)
= SSR (all variables) – SSR(X2)

From ANOVA section of From ANOVA section of


regression for regression for
ˆ b b X b X
Y ˆ b b X
Y
0 1 1 2 2 0 2 2

Measures the contribution of X1 in explaining SST


Copyright ©2011 Pearson Education 14-90
The Partial F-Test Statistic

 Consider the hypothesis test:


H0: variable Xj does not significantly improve the model after all
other variables are included
H1: variable Xj significantly improves the model after all other
variables are included

 Test using the F-test statistic:


(with 1 and n-k-1 d.f.)

SSR (X j | all variablesexcept j)


FSTAT 
MSE
Copyright ©2011 Pearson Education 14-91
Testing Portions of Model:
Example

Example: Frozen dessert pies

Test at the  = .05 level


to determine whether
the price variable
significantly improves
the model given that
advertising is included

Copyright ©2011 Pearson Education 14-92


Testing Portions of Model:
Example
(continued)

H0: X1 (price) does not improve the model


with X2 (advertising) included
H1: X1 does improve model

 = .05, df = 1 and 12
F0.05 = 4.75
(For X1 and X2) (For X2 only)
ANOVA ANOVA
df SS MS df SS
Regression 2 29460.02687 14730.01343 Regression 1 17484.22249
Residual 12 27033.30647 2252.775539 Residual 13 39009.11085
Total 14 56493.33333 Total 14 56493.33333

Copyright ©2011 Pearson Education 14-93


Testing Portions of Model:
Example (continued)

(For X1 and X2) (For X2 only)


ANOVA ANOVA
df SS MS df SS
Regression 2 29460.02687 14730.01343 Regression 1 17484.22249
Residual 12 27033.30647 2252.775539 Residual 13 39009.11085
Total 14 56493.33333 Total 14 56493.33333

SSR (X1 | X 2 ) 29,460.03  17,484.22


FSTAT    5.316
MSE(all) 2252.78

Conclusion: Since FSTAT = 5.316 > F0.05 = 4.75 Reject H0;


Adding X1 does improve model

Copyright ©2011 Pearson Education 14-94


Relationship Between Test
Statistics
 The partial F test statistic developed in this section and
the t test statistic are both used to determine the
contribution of an independent variable to a multiple
regression model.
 The hypothesis tests associated with these two
statistics always result in the same decision (that is, the
p-values are identical).

2
ta  F1,a
Where a = degrees of freedom

Copyright ©2011 Pearson Education 14-95


Coefficient of Partial Determination
for k variable model
2
rYj.(all variables except j)

SSR (X j | all variables except j)



SST SSR(all variables)  SSR(X j | all variables except j)

 Measures the proportion of variation in the dependent


variable that is explained by Xj while controlling for
(holding constant) the other independent variables

Copyright ©2011 Pearson Education 14-96


Coefficient of Partial
Determination in Excel
 Coefficients of Partial Determination can be
found using Excel:
 PHStat | regression | multiple regression …
 Check the “coefficient of partial determination” box
Regression Analysis
Coefficients of Partial Determination

Intermediate Calculations
SSR(X1,X2) 29460.02687
SST 56493.33333
SSR(X2) 17484.22249 SSR(X1 | X2) 11975.80438
SSR(X1) 11100.43803 SSR(X2 | X1) 18359.58884

Coefficients
r2 Y1.2 0.307000188
r2 Y2.1 0.404459524

Copyright ©2011 Pearson Education 14-97


Using Dummy Variables
 A dummy variable is a categorical independent
variable with two levels:
 yes or no, on or off, male or female
 coded as 0 or 1
 Assumes the slopes associated with numerical
independent variables do not change with the
value for the categorical variable
 If more than two levels, the number of dummy
variables needed is (number of levels - 1)

Copyright ©2011 Pearson Education 14-98


Dummy-Variable Example
(with 2 Levels)

Ŷ  b0  b1 X1  b 2 X 2

Let:
Y = pie sales
X1 = price
X2 = holiday (X2 = 1 if a holiday occurred during the week)
(X2 = 0 if there was no holiday that week)

Copyright ©2011 Pearson Education 14-99


Dummy-Variable Example
(with 2 Levels) (continued)

Ŷ  b0  b1 X1  b 2 (1)  (b 0  b 2 )  b1 X1 Holiday

Ŷ  b0  b1 X1  b 2 (0)  b0  b1 X1 No Holiday

Different Same
intercept slope
Y (sales)
If H0: β2 = 0 is
b0 + b2
Holi rejected, then
day
b0 (X = “Holiday” has a
No H 2 1)
olida significant effect
y (X on pie sales
2 = 0
)

Copyright ©2011 Pearson Education X1 (Price) 14-100


Interpreting the Dummy Variable
Coefficient (with 2 Levels)
Example: Sales  300 - 30(Price)  15(Holiday)
Sales: number of pies sold per week
Price: pie price in $
1 If a holiday occurred during the week
Holiday:
0 If no holiday occurred

b2 = 15: on average, sales were 15 pies greater in


weeks with a holiday than in weeks without a
holiday, given the same price

Copyright ©2011 Pearson Education 14-101


Dummy-Variable Models
(more than 2 Levels)
 The number of dummy variables is one less
than the number of levels
 Example:
Y = house price ; X1 = square feet

 If style of the house is also thought to matter:


Style = ranch, split level, colonial

Three levels, so two dummy


variables are needed
Copyright ©2011 Pearson Education 14-102
Dummy-Variable Models
(more than 2 Levels) (continued)

 Example: Let “colonial” be the default category, and let


X2 and X3 be used for the other two categories:

Y = house price
X1 = square feet
X2 = 1 if ranch, 0 otherwise
X3 = 1 if split level, 0 otherwise

The multiple regression equation is:


Ŷ  b0  b1X1  b 2 X 2  b3 X3
Copyright ©2011 Pearson Education 14-103
Interpreting the Dummy Variable
Coefficients (with 3 Levels)
Consider the regression equation:
Ŷ  20.43  0.045X1  23.53X 2  18.84X 3
For a colonial: X2 = X3 = 0
With the same square feet, a
Ŷ  20.43  0.045X 1 ranch will have an estimated
average price of 23.53
For a ranch: X2 = 1; X3 = 0 thousand dollars more than a
colonial.
Ŷ  20.43  0.045X 1  23.53
With the same square feet, a
For a split level: X2 = 0; X3 = 1 split-level will have an
estimated average price of
Ŷ  20.43  0.045X 1  18.84 18.84 thousand dollars more
than a colonial.
Copyright ©2011 Pearson Education 14-104
Interaction Between
Independent Variables
 Hypothesizes interaction between pairs of X
variables
 Response to one X variable may vary at different
levels of another X variable

 Contains two-way cross product terms


Ŷ  b0  b1X1  b 2 X 2  b3 X3

 b0  b1X1  b 2 X 2  b3 (X1X 2 )
Copyright ©2011 Pearson Education 14-105
Effect of Interaction

 Given: Y  β0  β1X1  β 2 X 2  β3 X1X 2  ε

 Without interaction term, effect of X 1 on Y is


measured by β1
 With interaction term, effect of X1 on Y is
measured by β1 + β3 X2
 Effect changes as X2 changes

Copyright ©2011 Pearson Education 14-106


Interaction Example
Suppose X2 is a dummy variable and the estimated
regression equation is Ŷ = 1 + 2X1 + 3X2 + 4X1X2
Y
12
X2 = 1:
8 Y = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1

4 X2 = 0:
Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1
0
X1
0 0.5 1 1.5
Slopes are different if the effect of X1 on Y depends on X2 value
Copyright ©2011 Pearson Education 14-107
Significance of Interaction Term

 Can perform a partial F test for the contribution


of a variable to see if the addition of an
interaction term improves the model

 Multiple interaction terms can be included


 Use a partial F test for the simultaneous contribution
of multiple variables to the model

Copyright ©2011 Pearson Education 14-108


Simultaneous Contribution of
Independent Variables
 Use partial F test for the simultaneous
contribution of multiple variables to the model
 Let m variables be an additional set of variables
added simultaneously
 To test the hypothesis that the set of m variables
improves the model:

[SSR(all)  SSR (all except new set of m variables)] / m


FSTAT 
MSE(all)

(where FSTAT has m and n-k-1 d.f.)


Copyright ©2011 Pearson Education 14-109
Chapter Summary
 Developed the multiple regression model
 Tested the significance of the multiple regression model
 Discussed adjusted r2
 Discussed using residual plots to check model
assumptions
 Tested individual regression coefficients
 Tested portions of the regression model
 Used dummy variables
 Evaluated interaction effects

Copyright ©2011 Pearson Education 14-110


Statistics for Managers using
Microsoft Excel

Multiple Regression Model Building

Copyright ©2011 Pearson Education 15-111


Learning Objectives

In this chapter, you learn:


 To use quadratic terms in a regression model
 To use transformed variables in a regression
model
 To measure the correlation among the
independent variables
 To build a regression model using either the
stepwise or best-subsets approach
 To avoid the pitfalls involved in developing a
multiple regression model
Copyright ©2011 Pearson Education 15-112
Nonlinear Relationships
 The relationship between the dependent
variable and an independent variable may
not be linear
 Can review the scatter plot to check for non-
linear relationships
 Example: Quadratic model

Yi  β0  β1X1i  β 2 X1i2  ε i
 The second independent variable is the square
of the first variable

Copyright ©2011 Pearson Education 15-113


Quadratic Regression Model
Model form:

Yi  β0  β1X1i  β 2 X  ε i
2
1i
 where:
β0 = Y intercept
β1 = regression coefficient for linear effect of X on Y
β2 = regression coefficient for quadratic effect on Y
εi = random error in Y for observation i

Copyright ©2011 Pearson Education 15-114


Linear vs. Nonlinear Fit

Y Y

X X
residuals

X residuals X

Linear fit does not give Nonlinear fit gives


random residuals
Copyright ©2011 Pearson Education
 random residuals
15-115
Quadratic Regression Model
Yi  β0  β1X1i  β 2 X1i2  ε i
Quadratic models may be considered when the scatter
plot takes on one of the following shapes:
Y Y Y Y

X1 X1 X1 X1
β1 < 0 β1 > 0 β1 < 0 β1 > 0
β2 > 0 β2 > 0 β2 < 0 β2 < 0
β1 = the coefficient of the linear term
β2 = the coefficient of the squared term
Copyright ©2011 Pearson Education 15-116
Testing the Overall
Quadratic Model
 Estimate the quadratic model to obtain the
regression equation:

Ŷi  b0  b1X1i  b 2 X2
1i

 Test for Overall Relationship


H0: β1 = β2 = 0 (no overall relationship between X and Y)
H1: β1 and/or β2 ≠ 0 (there is a relationship between X and Y)
MSR
 FSTAT = MSE

Copyright ©2011 Pearson Education 15-117


Testing for Significance:
Quadratic Effect
 Testing the Quadratic Effect

 Compare quadratic regression equation


Yi  b0  b1X1i  b 2 X1i2
with the linear regression equation
Yi  b0  b1X1i

Copyright ©2011 Pearson Education 15-118


Testing for Significance:
Quadratic Effect
(continued)

 Testing the Quadratic Effect


 Consider the quadratic regression equation
Yi  b0  b1X1i  b 2 X1i2
Hypotheses
H0: β2 = 0 (The quadratic term does not improve the model)

H1: β2  0 (The quadratic term improves the model)

Copyright ©2011 Pearson Education 15-119


Testing for Significance:
Quadratic Effect
(continued)

 Testing the Quadratic Effect


Hypotheses
H0: β2 = 0 (The quadratic term does not improve the model)

H1: β2  0 (The quadratic term improves the model)


 The test statistic is
where:
b2  β2
t STAT  b2 = squared term slope
coefficient
Sb
2 β2 = hypothesized slope (zero)
Sb = standard error of the slope
d.f.  n  3 2

Copyright ©2011 Pearson Education 15-120


Testing for Significance:
Quadratic Effect
(continued)

 Testing the Quadratic Effect

Compare r2 from simple regression to


adjusted r2 from the quadratic model

 If adj. r2 from the quadratic model is larger


than the r2 from the simple model, then the
quadratic model is likely a better model

Copyright ©2011 Pearson Education 15-121


Example: Quadratic Model

Purity
Filter
Time Purity increases as filter time increases:
3 1
7 2 Purity vs. Time
8 3
15 5 100

22 7
80
33 8
40 10
60
Purity

54 12
67 13 40
70 14
78 15 20
85 15
87 16 0
0 5 10 15 20
99 17
Time

Copyright ©2011 Pearson Education 15-122


Example: Quadratic Model
(continued)
 Simple regression results:
^
Y = -11.283 + 5.985 Time
Standard
Coefficients Error t Stat P-value t statistic, F statistic, and r2
Intercept -11.28267 3.46805 -3.25332 0.00691 are all high, but the
Time 5.98520 0.30966 19.32819 2.078E-10 residuals are not random:
Regression Statistics Time Residual Plot
Significance
R Square 0.96888 F F 10
Adjusted R Square 0.96628 373.57904 2.0778E-10
5
Standard Error 6.15997
Residuals 0
-5 0 5 10 15 20
-10
Time
Copyright ©2011 Pearson Education 15-123
Example: Quadratic Model in Excel
(continued)
 Quadratic regression results:
^
Y = 1.539 + 1.565 Time + 0.245 (Time)2
Standard Time Residual Plot
Coefficients Error t Stat P-value
10
Intercept 1.53870 2.24465 0.68550 0.50722
5

Residuals
Time 1.56496 0.60179 2.60052 0.02467
Time-squared 0.24516 0.03258 7.52406 1.165E-05 0
0 5 10 15 20
-5
Regression Statistics Significance Time
R Square 0.99494 F F
Adjusted R Square 0.99402 1080.7330 2.368E-13 Time -square d Re sidual Plot
10
Standard Error 2.59513
5
Residuals
The quadratic term is significant and 0
0 100 200 300 400
improves the model: adj. r2 is higher and -5
Time-squared
SYX is lower, residuals are now random
Copyright ©2011 Pearson Education 15-124
Using Transformations in
Regression Analysis
Idea:
 non-linear models can often be transformed

to a linear form
 Can be estimated by least squares if transformed
 transform X or Y or both to get a better fit or
to deal with violations of regression
assumptions
 Can be based on theory, logic or scatter
plots
Copyright ©2011 Pearson Education 15-125
The Square Root Transformation

 The square-root transformation

Yi  β0  β1 X1i  ε i

 Used to
 overcome violations of the constant variance
assumption
 fit a non-linear relationship

Copyright ©2011 Pearson Education 15-126


The Square Root Transformation
(continued)

Yi  β0  β1X1i  ε i Yi  β0  β1 X1i  ε i
 Shape of original relationship  Relationship when transformed

Y Y b1 > 0

X X
Y Y

b1 < 0
Copyright ©2011 Pearson Education X X 15-127
The Log Transformation

The Multiplicative Model:


 Original multiplicative model  Transformed multiplicative model

Yi  β0 X1iβ1 ε i log Yi  log β0  β1 log X1i  log ε i

The Exponential Model:


 Original multiplicative model  Transformed exponential model

Yi  e
β 0 β1X1i β 2 X 2i
εi ln Yi  β0  β1X1i  β 2 X 2i  ln ε i

Copyright ©2011 Pearson Education 15-128


Interpretation of coefficients

For the multiplicative model:

log Yi  log β0  β1 log X1i  log ε i

 When both dependent and independent


variables are logged:
 The coefficient of the independent variable Xk can
be interpreted as : a 1 percent change in Xk leads to
an estimated bk percentage change in the average
value of Y. Therefore bk is the elasticity of Y with
respect to a change in Xk .
Copyright ©2011 Pearson Education 15-129
Collinearity

 Collinearity: High correlation exists among two


or more independent variables
 This means the correlated variables contribute
redundant information to the multiple regression
model

Copyright ©2011 Pearson Education 15-130


Collinearity (continued)

 Including two highly correlated independent


variables can adversely affect the regression
results
 No new information provided
 Can lead to unstable coefficients (large
standard error and low t-values)
 Coefficient signs may not match prior
expectations

Copyright ©2011 Pearson Education 15-131


Some Indications of Strong
Collinearity
 Incorrect signs on the coefficients
 Large change in the value of a previous
coefficient when a new variable is added to the
model
 A previously significant variable becomes non-
significant when a new independent variable is
added
 The estimate of the standard deviation of the
model increases when a variable is added to
the model
Copyright ©2011 Pearson Education 15-132
Detecting Collinearity
(Variance Inflationary Factor)
VIFj is used to measure collinearity:

1
VIFj 
1 R j
2

where R2j is the coefficient of determination of


variable Xj with all other X variables

If VIFj > 5, Xj is highly correlated with


the other independent variables

Copyright ©2011 Pearson Education 15-133


Example: Pie Sales
Pie Price Advertising
Week Sales ($) ($100s)
1 350 5.50 3.3
2 460 7.50 3.3
Recall the multiple regression
3 350 8.00 3.0
4 430 8.00 4.5
equation of chapter 14:
5 350 6.80 3.0
6 380 7.50 4.0
7 430 4.50 3.0
8 470 6.40 3.7
Sales = b0 + b1 (Price)
9 450 7.00 3.5
10 490 5.00 4.0 + b2
11 340 7.20 3.5
12 300 7.90 3.2 (Advertising)
13 440 5.90 4.0
14 450 5.00 3.5
15 300 7.00 2.7

Copyright ©2011 Pearson Education 15-134


Model Building
 Goal is to develop a model with the best set of
independent variables
 Easier to interpret if unimportant variables are
removed
 Lower probability of collinearity
 Stepwise regression procedure
 Provide evaluation of alternative models as variables
are added and deleted
 Best-subset approach
 Try all combinations and select the best using the
highest adjusted r2 and lowest standard error

Copyright ©2011 Pearson Education 15-135


Stepwise Regression

 Idea: develop the least squares regression


equation in steps, adding one independent
variable at a time and evaluating whether
existing variables should remain or be removed

 The coefficient of partial determination is the


measure of the marginal contribution of each
independent variable, given that other
independent variables are in the model

Copyright ©2011 Pearson Education 15-136


Best Subsets Regression

 Idea: estimate all possible regression equations


using all possible combinations of independent
variables

 Choose the best fit by looking for the highest


adjusted r2 and lowest standard error

Stepwise regression and best subsets


regression can be performed using PHStat

Copyright ©2011 Pearson Education 15-137


Alternative Best Subsets
Criterion

 Calculate the value Cp for each potential


regression model

 Consider models with Cp values close to or


below k + 1

 k is the number of independent variables in the


model under consideration

Copyright ©2011 Pearson Education 15-138


Alternative Best Subsets
Criterion (continued)

 The Cp Statistic
(1  Rk2 )(n  T )
Cp   (n  2(k  1))
1  RT2

Where k = number of independent variables included in a


particular regression model
T = total number of parameters to be estimated in the
full regression model
Rk2 = coefficient of multiple determination for model with k
independent variables
R 2T = coefficient of multiple determination for full model with
all T estimated parameters
Copyright ©2011 Pearson Education 15-139
Steps in Model Building
1. Compile a listing of all independent variables
under consideration
2. Estimate full model and check VIFs
3. Check if any VIFs > 5
 If no VIF > 5, go to step 4
 If one VIF > 5, remove this variable
 If more than one, eliminate the variable with the
highest VIF and go back to step 2
4.Perform best subsets regression with remaining
variables …
Copyright ©2011 Pearson Education 15-140
Steps in Model Building
(continued)

5. List all models with Cp close to or less than (k


+ 1)
6. Choose the best model
 Consider parsimony
 Do extra variables make a significant contribution?
7.Perform complete analysis with chosen model,
including residual analysis
8.Transform the model if necessary to deal with
violations of linearity or other model
assumptions
9.Use the model for prediction and inference
Copyright ©2011 Pearson Education 15-141
Model Building Flowchart

Choose X1,X2,…Xk Run subsets


regression to obtain
No “best” models in
Run regression Any terms of Cp
to find VIFs VIF>5?

Yes Do complete analysis

Remove Yes
variable with More
Add quadratic and/or interaction
highest than one?
terms or transform variables
VIF
No
Remove Perform
this X predictions
Copyright ©2011 Pearson Education 15-142
Pitfalls and Ethical
Considerations
To avoid pitfalls and address ethical considerations:
 Understand that interpretation of the
estimated regression coefficients are
performed holding all other independent
variables constant
 Evaluate residual plots for each independent
variable
 Evaluate interaction terms

Copyright ©2011 Pearson Education 15-143


Additional Pitfalls
and Ethical Considerations
(continued)

To avoid pitfalls and address ethical considerations:


 Obtain VIFs for each independent variable
before determining which variables should be
included in the model
 Examine several alternative models using best-
subsets regression
 Use other methods when the assumptions
necessary for least-squares regression have
been seriously violated

Copyright ©2011 Pearson Education 15-144


Chapter Summary
 Developed the quadratic regression model
 Discussed using transformations in
regression models
 The multiplicative model
 The exponential model
 Described collinearity
 Discussed model building
 Stepwise regression
 Best subsets
 Addressed pitfalls in multiple regression and
ethical considerations
Copyright ©2011 Pearson Education 15-145
Objectives:
 Statistical learning including quantitative,
qualitative analysis techniques
 Predictive Analytics using linear, polynomial
and logistic regression techniques and
model comparison
 The use of the above analysis and
visualization to aid decision making

Copyright ©2011 Pearson Education


Content:
 Business Analytics - Introduction
 Statistical Methods for Business Analytics
 Basics of Hypothesis Testing
 Correlation and Regression
 Multiple Linear Regression
 Model Comparison and Performance
 Classification
 Time Series Analysis

Copyright ©2011 Pearson Education


Statistical Learning

Copyright ©2011 Pearson Education


Function:

y = 3x + 12x + 2
y

x
Copyright ©2011 Pearson Education
Function:

y = 3x + 12x + 2
y

x
Copyright ©2011 Pearson Education
Function:

y = 3x + 12x + 2
y

x
Copyright ©2011 Pearson Education
Statistical Learnings:

Y = f(x) + є
Copyright ©2011 Pearson Education
Statistical Learnings:

Income = f(Yrs of Edu and Seniority) + є


Copyright ©2011 Pearson Education
Statistical Learnings:

Statistical Learning: Refers to a set of


approaches for estimating f

Income = f(Yrs of Edu and Seniority) + є


Copyright ©2011 Pearson Education
Why Estimate Function “ f ”
1) What is Inference
Happening?

2) What is Going to Prediction


Happen?

Copyright ©2011 Pearson Education


Inference
Y = 14 + 223x1 + 34x2 - 120x3 + 0.0002x4 - 12x5 + 0.006x6 + є

All x values (x1 to x6 are in the range of 0 to 10)

Which predictors are associated with the response?

What is the relationship between the response and each


predictor?

Can the relationship between Y and each predictor be


adequately summarized using a linear equation, or is the
relationship more complicated?

Copyright ©2011 Pearson Education


Prediction

Income = f (Education and Seniority) + є


Copyright ©2011 Pearson Education
Prediction
Y = f (x) + є

Y1 = f1 (x) + є

E(Y - Y1)2 = [f (x) - f1 (x)]2 + Var(є )


Reducible Irreducible

A set of approaches for estimating f


Copyright ©2011 Pearson Education
Why Estimate Function “ f ”
1) What is Inference
Happening?

2) What is Going to Prediction


Happen?

Examples ? ? ?
Copyright ©2011 Pearson Education
Methods to Estimate Function “ f ”

Copyright ©2011 Pearson Education


Methods to Estimate Function “ f ”
 Parametric

Non-Parametric 
Copyright ©2011 Pearson Education
Methods to Estimate Function “ f ”

Copyright ©2011 Pearson Education


Supervised Learning

Copyright ©2011 Pearson Education


Unsupervised Learning

Copyright ©2011 Pearson Education


Unsupervised Learning

Copyright ©2011 Pearson Education


Summary
1) Functions and its Variables
2) Statistical Learning
3) Estimating Function
4) Purpose of Estimating Function:
• Inferences
• Predictions
5) Methods to Estimate f
• Parametric
• Non-parametric
6) Prediction Accuracy vs Interpretability
7) Supervised vs Unsupervised Learning
Copyright ©2011 Pearson Education
Content:
 Business Analytics - Introduction
 Statistical Methods for Business Analytics
 Basics of Hypothesis Testing
 Correlation and Regression
 Multiple Linear Regression
 Model Comparison and Performance
 Classification
 Time Series Analysis

Copyright ©2011 Pearson Education


Classification:
 A person arrives at the emergency room with a set of
symptoms that could possibly be attributed to one of
three medical conditions. Which of the three
conditions does the individual have?
 An online banking service must be able to determine
whether or not a transaction being performed on the
site is fraudulent, on the basis of the user’s IP
address, past transaction history, and so forth.
 On the basis of DNA sequence data for a number of
patients with and without a given disease, a biologist
would like to figure out which DNA mutations are
deleterious (disease-causing) and which are not.

Copyright ©2011 Pearson Education


Classification:

Copyright ©2011 Pearson Education


Why not MRA?
Classification:

Conditional Probability: Prob (Y=1| X=x)


Copyright ©2011 Pearson Education
Classification: Prob (Y=1| X=x)

Copyright ©2011 Pearson Education


Logistic Regression: Prob (X)

Likelihood Function

Copyright ©2011 Pearson Education


Logistic Regression: Prob (X)

Likelihood Function

Copyright ©2011 Pearson Education


Logistic Regression: Prob (X)

Copyright ©2011 Pearson Education


Logistic Regression: Prob (X)

Copyright ©2011 Pearson Education


Multiple Logistic Regression: Prob (X)

Copyright ©2011 Pearson Education


Classification – Multi Level Categorical DV

Bayes Theorem

Copyright ©2011 Pearson Education


Classification - KNN

Copyright ©2011 Pearson Education


Classification - KNN

Copyright ©2011 Pearson Education


Classification – KNN (Value of K)

Copyright ©2011 Pearson Education


Classification – KNN – For Regression

Copyright ©2011 Pearson Education


Classification – KNN – For Regression

K=1 K=9

Copyright ©2011 Pearson Education


Classification – Clustering

Unsupervised
Learning

Copyright ©2011 Pearson Education


Classification – Clustering

• K Means

• Hierarchical

Copyright ©2011 Pearson Education


Classification – Clustering – K Means

Copyright ©2011 Pearson Education


Classification – Clustering – K Means

Copyright ©2011 Pearson Education


Classification – Clustering – K Means

Copyright ©2011 Pearson Education


Classification – Clustering – K Means

Copyright ©2011 Pearson Education


Classification – Clustering – K Means
1. Randomly assign a number, from 1 to K, to each of the
observations. These serve as initial cluster assignments
for the observations.
2. Iterate until the cluster assignments stop changing:
(a) For each of the K clusters, compute the cluster
centroid. The kth cluster centroid is the vector of the p
feature means for the observations in the kth cluster.
(b) Assign each observation to the cluster whose centroid
is closest (where closest is defined using Euclidean
distance).

Copyright ©2011 Pearson Education


Classification – Clustering – K Means

Copyright ©2011 Pearson Education


Classification – Clustering-Hierarchical

Copyright ©2011 Pearson Education


Classification – Clustering-Hierarchical

Copyright ©2011 Pearson Education


Classification – Clustering-Hierarchical

Copyright ©2011 Pearson Education


Classification – Clustering-Hierarchical

Copyright ©2011 Pearson Education


Classification – Clustering-Hierarchical

Copyright ©2011 Pearson Education

You might also like