Unit 4-B: Multiple Regression

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 75

UNIT 4-B

MULTIPLE REGRESSION
MULTIPLE REGRESSION

 The simple linear regression model was used to analyze how one interval variable (the
dependent variable y) is related to one other interval variable (the independent variable
x).
 Multiple regression allows for any number of independent variables.
 We expect to develop models that fit the data better than would a simple linear
regression model.
THE MODEL

 We now assume we have k independent variables potentially related to the one dependent
variable. This relationship is represented in this first order linear equation:
dependent independent variables
variable

error variable

coefficients
 In the one variable, two dimensional case we drew a regression line; here we imagine a
response surface.
ESTIMATING THE COEFFICIENTS

 The sample regression equation is expressed as:

 We will use computer output to:


 Assess the model…
How well it fits the data
Is it useful
Are any required conditions violated?
 Employ the model…
Interpreting the coefficients
Predictions using the regression model.
REGRESSION ANALYSIS STEPS

1. Use a computer and software to generate the coefficients and the statistics
used to assess the model.
2. Diagnose violations of required conditions. If there are problems, attempt
to remedy them.
3. Assess the model’s fit.
1. coefficient of determination,
2. F-test of the analysis of variance.

4. If u, v, and w are OK, use the model for prediction.


EXAMPLE: LA QUINTA

 La Quinta Motor Inns is a moderately priced chain of motor inns located


across the United States. Its market is the frequent business traveler.
 The chain recently launched a campaign to increase market share by
building new inns. The management of the chain is aware of the difficulty
in choosing locations for new motels. Moreover, making decisions without
adequate information often results in poor decisions.
 Consequently the chain management acquired data on 100 randomly
selected inns belonging to La Quinta. The objective was to predict which
sites are likely to be profitable.
EXAMPLE: LA QUINTA

 To measure profitability La Quinta used operating margin, which is


the ratio of the sum of profit, depreciation, and interest expenses
divided by total revenue.
 The higher the operating margin, the greater the success of the inn.
 La Quinta defines profitable inns as those with an operating margin
in excess of 50% and unprofitable inns with margins of less than
30%.
EXAMPLE: LA QUINTA

 After a discussion with a number of experienced managers La Quinta


decided to select one or two independent variables from each of the
categories:
 Competition
 Market awareness
 Demand generators
 Demographics
 Physical
EXAMPLE: LA QUINTA

 To measure the degree of competition they determined the total


number of motel and hotel rooms within 3 miles of each La Quinta
inn.
 Market awareness was measured by the number of miles to the
closest competing motel.
 Two variables that represent sources of customers were chosen.
 The amount of office space and college and university enrollment in
the surrounding community are demand generators. Both of these
are measures of economic activity.
EXAMPLE: LA QUINTA

 A demographic variable that describes the community is the median


household income.

 Finally, as a measure of the physical qualities of the location La


Quinta chose the distance to the downtown core.
EXAMPLE: LA QUINTA

Where should La Qunita locate a new motel? Factors influencing profitability…


Profitability
factor

Market Demand
Competition Community Physical
Awareness Generators

# of rooms Distance to Median Distance to


measure

Offices,
within the nearest Higher Ed. household downtown.
3 mile La Quinta inn. income.
radius
*these need to be interval data !
EXAMPLE: LA QUINTA

Where should La Qunita locate a new motel?

Several possible predictors of profitability were identified, and data (Quinta.sav) were collected. Its believed that
operating margin (y) is dependent upon these factors:

x1 = Total motel and hotel rooms within 3 mile radius

x2 = Number of miles to closest competition

x3 = Volume of office space in surrounding community

x4 = College and university student numbers in community

x5 = Median household income in community

x6 = Distance (in miles) to the downtown core.


TRANSFORMATION

 Can we transform this data into a mathematical model that looks like this:

margin
awareness physical
competition
(i.e. # of rooms)
(distance to
nearest alt.)
… (distance to
downtown)
USING SPSS
INTERPRET

THE MODEL

 Although we haven’t done any assessment of the model yet, at first pass:

it suggests that increases in


 The number of miles to closest competition, office space, student enrollment and
household income will positively impact the operating margin.
 Likewise, increases in the total number of lodging rooms within a short distance and the
distance from downtown will negatively impact the operating margin…
MODEL ASSESSMENT

We will assess the model in two ways:

Coefficient of determination, and


F-test of the analysis of variance.
COEFFICIENT OF DETERMINATION…

 Again, the coefficient of determination is defined as:

 This means that 52.5% of the variation in operating margin is explained by the six
independent variables, but 47.5% remains unexplained.
ADJUSTED R2 VALUE

 The adjusted” R2 is: the coefficient of determination adjusted for the number of
explanatory variables.

 It takes into account the sample size n, and k, the number of independent variables, and is
given by:
TESTING THE VALIDITY OF THE MODEL

In a multiple regression model (i.e. more than one independent variable), we utilize an
analysis of variance technique to test the overall validity of the model. Here’s the idea:
H0: 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑘 = 0

H1: At least one 𝛽𝑖 is not equal to zero.


If the null hypothesis is true, none of the independent variables is linearly related to y, and
so the model is invalid.
If at least one 𝛽𝑖 is not equal to 0, the model does have some validity.
TESTING THE VALIDITY OF THE MODEL

 In a multiple regression model (i.e. more than one independent variable), we utilize an
analysis of variance technique to test the overall validity of the model. Here’s the idea:
H0: 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑘 = 0
H1: At least one 𝛽𝑖 is not equal to zero.
 If the null hypothesis is true, none of the independent variables is linearly related to y, and
so the model is invalid.
 If at least one 𝛽𝑖 is not equal to 0, the model does have some validity.
TESTING THE VALIDITY OF THE MODEL

ANOVA table for regression analysis…


Source of degrees of Sums of
Mean Squares F-Statistic
Variation freedom Squares
Regression k SSR MSR = SSR/k F=MSR/MSE

Error n–k–1 SSE MSE = SSE/(n–k-1)

Total n–1

A large value of F indicates that most of the variation in y is explained by


the regression equation and that the model is valid. A small value of F
indicates that most of the variation in y is unexplained.
TESTING THE VALIDITY OF THE MODEL

 Our rejection region is:

 Since SPSS calculated the F statistic as F = 17.14 and our FCritical = 2.17, (and the p-value is zero) we
reject H0 in favor of H1, that is:

“there is a great deal of evidence to infer


that the model is valid”
INTERPRETING THE COEFFICIENTS*

 Intercept (b0) 38.14 • This is the average operating margin when all of the
independent variables are zero. It’s meaningless to try and interpret this value,
particularly if 0 is outside the range of the values of the independent variables (as
is the case here).
 # of motel and hotel rooms (b1) –.0076 • Each additional room within
three miles of the La Quinta inn, will decrease the operating margin. (I.e. for
each additional 1000 rooms the margin decreases by 7.6%)
 Distance to nearest competitor (b2) 1.65 • For each additional mile that
the nearest competitor is to a La Quinta inn, the average operating margin
increases by 1.65%. *in each case we assume all other variables are held constant…
INTERPRETING THE COEFFICIENTS*

 Office space (b3) .020 • For each additional thousand square feet of office space, the
margin will increase by .020. E.g. an extra 100,000 square feet of office space will
increase margin (on average) by 2.0%.
 Student enrollment (b4) .21 • For each additional thousand students, the average
operating margin increases by .21%
 Median household income (b5) .41 • For each additional thousand dollar increase in
median household income, the average operating margin increases by .41%
 Distance to downtown core (b6) –.23 • For each additional mile to the downtown
center, the operating margin decreases on average by .23%
*in each case we assume all other variables are held constant…
TESTING THE COEFFICIENTS

For each independent variable, we can test to determine whether there is enough evidence of a
linear relationship between it and the dependent variable for the entire population…

(for i = 1, 2, …, k) and using:

as our test statistic (with n–k–1 degrees of freedom).


TESTING THE COEFFICIENTS

 For each independent variable, we can test to determine whether there is enough evidence
of a linear relationship between it and the dependent variable for the entire population
 The hypotheses:
𝐻0 : 𝛽𝑖 = 0 𝑣𝑠 𝐻𝑎 : 𝛽𝑖 ≠ 0, 𝑖 = 1, … , 𝑘.

 The test statistic is given by:


𝑏𝑖 − 𝛽𝑖
𝑡=
𝑠𝑏𝑖

with n-k-1 degree of freedom.


TESTING THE COEFFICIENTS

We can use our SPSS output to quickly test each of the six coefficients in our model…

Thus, the number of hotel and motel rooms, distance to the nearest motel, amount
of office space, and median household income are linearly related to the operating
margin. There is no evidence to infer that college enrollment and distance to
downtown center are linearly related to operating margin.
USING THE REGRESSION EQUATION

 Much like we did with simple linear regression, we can produce a prediction interval for
a particular value of y.

 As well, we can produce the confidence interval estimate of the expected value of y.
USING THE REGRESSION EQUATION

Predict the operating margin if a La Quinta Inn is built at a location where:


1. There are 3815 rooms within 3 miles of the site.
2. The closest other hotel or motel is .9 miles away.
3. The amount of office space is 476,000 square feet.
4. There is one college and one university nearby with a total enrollment of 24,500 students.
5. Census data indicates the median household income in the area (rounded to the nearest
thousand) is $35,000, and,
6. The distance to the downtown center is11.2 miles.

our xi’s…
USING THE REGRESSION EQUATION

We add one row (our given values for the independent variables) to the
bottom of our data set:
PREDICTION INTERVAL

 We predict that the operating margin will fall between 25.31 and 48.73.
 If management defines a profitable inn as one with an operating margin greater than 50%
and an unprofitable inn as one with an operating margin below 30%, they will pass on this
site, since the entire prediction interval is below 50%.

INTERPRET
CONFIDENCE INTERVAL

 The expected operating margin of all sites that fit this category is estimated to be
between 32.87 and 41.18.
 We interpret this to mean that if we built inns on an infinite number of sites that fit the
category described, the mean operating margin would fall between 33.0 and 41.2. In
other words, the average inn would not be profitable either…

INTERPRET
REGRESSION DIAGNOSTICS

 Calculate the residuals and check the following:


 Is the error variable nonnormal?
 Perform a normality test

 Is the error variance constant?


 Plot the residuals versus the predicted values of y.

 Are the errors independent (time-series data)?


 Plot the residuals versus the time periods.

 Are there observations that are inaccurate or do not belong to the target population?
 Double-check the accuracy of outliers and influential observations.
REGRESSION DIAGNOSTICS

 Multiple regression models have a problem that simple regressions do not, namely
multicollinearity.
 It happens when the independent variables are highly correlated.
 We’ll explore this concept through the following example…
EXAMPLE: HOUSING PRICES

 A real estate agent wanted to develop a model to predict the selling price of a home. The
agent believed that the most important variables in determining the price of a house are
its:
 size,
 number of bedrooms,
 and lot size.
 The proposed model is:

 Housing market data has been gathered and SPSS is the analysis tool of choice
(housprices.sav)
USING SPSS

The F-test indicates the mode is


valid

…but these t-stats


suggest none of the
variables are related to
the selling price.
EXAMPLE: HOUSING PRICES

Unlike the t-tests in the multiple


regression model, these three t-tests tell
us that the number of bedrooms, the
house size, and the lot size are all
linearly related to the price…
EXAMPLE: HOUSING PRICES

This is reasonable: larger houses


have more bedrooms and are
The answer is that the three
How to account for this situated on larger lots, and
independent variables are
apparent contradiction? smaller houses have fewer
correlated with each other !
bedrooms and are located on
smaller lots.

multicollinearity affected the t-


tests so that they implied that One Solution: Use Stepwise or
none of the independent remove the most collinear
variables is linearly related to variables.
price when, in fact, all are
REGRESSION  The regression assumptions related to errors are
DIAGNOSTICS  Normality: Use either pp-plot or test for
normality.
 Equal variances (constant variance): Use the
plot of the standardized residuals versus the
fitted values
 Independence of the errors: Either by using
the plot of residuals versus time and counting
the number of runs or using Durbin-Watson
test.
REGRESSION DIAGNOSTICS – TIME SERIES

 The Durbin-Watson test allows us to determine whether there is evidence of first-order


autocorrelation — a condition in which a relationship exists between consecutive
residuals, i.e. ei-1 and ei (i is the time period). The statistic for this test is defined as:

 d has a range of values: 0 ≤ d ≤ 4.


DURBIN–WATSON (TWO-TAIL TEST)

 To test for first-order autocorrelation:


 If d < dL or d > 4 – dL , first-order autocorrelation exists.
 If d falls between dL and dU or between 4 – dU and 4 – dU , the test is inconclusive.
 If d falls between dU and 4 – dU there is no evidence of first order autocorrelation.

exists inconclusive doesn’t exist inconclusive exists

0 dL dU 2 4-dU 4-dL 4
EXAMPLE: LIFT TICKETS

 Can we create a model that will predict lift ticket


sales at a ski hill based on two weather
parameters?
 Variables:
𝑦: 𝐿𝑖𝑓𝑡 𝑡𝑖𝑐𝑘𝑒𝑡 𝑠𝑎𝑙𝑒𝑠 𝑑𝑢𝑟𝑖𝑛𝑔 𝐶ℎ𝑟𝑖𝑠𝑡𝑚𝑎𝑠
𝑥1 : 𝑇𝑜𝑡𝑎𝑙 𝑠𝑛𝑛𝑜𝑤𝑓𝑎𝑙𝑙 (𝑖𝑛𝑐ℎ𝑒𝑠)
𝑥2 : 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 (𝑑𝑒𝑔𝑟𝑒𝑒 𝐹𝑎ℎ𝑟𝑒𝑛ℎ𝑒𝑖𝑡)
 Our ski hill manager collected 20 years of data
(LiftTickets.sav)
EXAMPLE: LIFT TICKETS

Both the coefficient of determination


and the p-value of the F-test indicate
the model is poor…

Neither variable is linearly related


to ticket sale…
EXAMPLE: LIFT TICKETS

 The histogram of residuals reveals


the errors may be normally
distributed
EXAMPLE: LIFT TICKETS

• In the plot of residuals versus predicted values (testing for heteroscedasticity) — the
error variance appears to be constant
EXAMPLE: LIFT
TICKETS (DW
TEST)
APPLY THE DURBIN-WATSON STATISTIC FROM TO
THE ENTIRE LIST OF RESIDUALS.
EXAMPLE: LIFT TICKETS

 To test for first-order autocorrelation with α = .05, we find in Table 8(a)


in Appendix B: dL = 1.10 and dU = 1.54The null and alternative
 Hypotheses are
 H0 : There is no first-order autocorrelation.
 H1 : There is first-order autocorrelation.
 The rejection region includes d < dL = 1.10. Since d = .593, we reject the
null hypothesis and conclude that there is enough evidence to infer that
first-order autocorrelation exists.
EXAMPLE: LIFT TICKETS

 Autocorrelation usually indicates that the model needs to include an independent


variable that has a time-ordered effect on the dependent variable.
 The simplest such independent variable represents the time periods.
 We included a third independent variable that records the number of years since the
year the data were gathered. Thus, x3 = 1, 2,..., 20. The new model is
y = β0 + β1 x 1 + β2 x 2 + β 3 x 3 + ε
EXAMPLE: LIFT TICKETS

The fit of the model is high,


The model is valid…

Snowfall and time are linearly related to our new


ticket sales; temperature is not… variable
EXAMPLE: LIFT TICKETS

 The Durbin-Watson statistic against the residuals from our Regression analysis is equal
to 1.885.
 We can conclude that there is not enough evidence to infer the presence of first-order
autocorrelation. (Determining dL is left as an exercise for the reader…)
 Hence, we have improved out model dramatically!
EXAMPLE: LIFT TICKETS

 Notice that the model is improved dramatically.


 The F-test tells us that the model is valid. The t-tests tell us that both the amount of
snowfall and time are significantly linearly related to the number of lift tickets.
 This information could prove useful in advertising for the resort. For example, if there has
been a recent snowfall, the resort could emphasize that in its advertising.
 If no new snow has fallen, it may emphasize their snow-making facilities.
MODEL  Regression analysis can also be used for:
SELECTION  non-linear (polynomial) models, and
 models that include nominal
independent variables.
POLYNOMIAL MODELS

 Previously we looked a this multiple regression model:

(it’s considered linear or first-order since the exponent on each of the xi’s is 1)
 The independent variables may be functions of a smaller number of predictor variables;
polynomial models fall into this category. If there is one predictor value (x) we have:
POLYNOMIAL MODELS

Technically, equation u is a multiple regression model with p independent variables (x1, x2, …,
xp). Since x1 = x, x2 = x2, x3 = x3, …, xp = xp, its based on one predictor value (x).

p is the order of the equation; we’ll focus equations of order p = 1, 2, and 3.


FIRST ORDER MODEL

 When p = 1, we have our simple linear regression model:

 That is, we believe there is a straight-line relationship between the dependent and
independent variables over the range of the values of x:
SECOND ORDER MODEL

When p = 2, the polynomial model is a parabola:


THIRD ORDER MODEL

When p = 3, our third order model looks like:


POLYNOMIAL MODELS: 2 PREDICTOR VARIABLES

 Perhaps we suspect that there are two predictor variables (x1 & x2) which influence the
dependent variable:
 First order model (no interaction):

 First order model (with interaction):


POLYNOMIAL MODELS: 2 PREDICTOR VARIABLES

First order models, 2 predictors, without & with interaction:


POLYNOMIAL MODELS: 2 PREDICTOR VARIABLES

If we believe that a quadratic relationship exists between y and each of x1 and x2, and that
the predictor variables interact in their effect on y, we can use this model:

Second order model (in two variables) WITH interaction:


POLYNOMIAL MODELS: 2 PREDICTOR VARIABLES

2nd order models, 2 predictors, without & with interaction:


SELECTING A MODEL

 One predictor variable, or two (or more)?


 First order?
 Second order?
 Higher order?
 With interaction? Without?

 How do we choose the right model??


 Use our knowledge of the variables involved to build an initial model.
 Test that model using statistical techniques.
 If required, modify our model and re-test…
EXAMPLE: RESTAURANT

 We’ve been asked to come up with a regression model for a fast food restaurant.
 We know our primary market is middle-income adults and their children, particularly
those between the ages of 5 and 12.
 Dependent variable —restaurant revenue (gross or net)
 Predictor variables — family income, age of children
Is the relationship first order? quadratic?…
EXAMPLE: RESTAURANT

 The relationship between the dependent


variable (revenue) and each predictor
variable is probably quadratic.
 Members of low- or high-income households
are less likely to eat at this chain’s restaurants,
since the restaurants attract mostly middle-
income customers.
 Neighborhoods where the mean age of
children is either quite low or quite high are
also less likely to eat there vs. the families
with children in the 5-to-12-year range.

Seems reasonable?
EXAMPLE: RESTAURANT

 Should we include the interaction term in our model?


 When in doubt, it is probably best to include it.
 Our model then, is:

where y = annual gross sales


x1 = median annual household income*
x2 = mean age of children*

*in the neighborhood


EXAMPLE: RESTAURANT

Our fast food restaurant research department selected 25 locations at random and
gathered data on revenues, household income, and ages of neighborhood children.

Collected Data Calculated Data


EXAMPLE: RESTAURANT

You can take the original data collected (revenues, household income, and age) and plot y
vs. x1 and y vs. x2 to get a feel for the data; trend lines were added for clarity…
EXAMPLE: RESTAURANT

The model fits the data well


and its valid…

INTERPRET
NOMINAL INDEPENDENT VARIABLES

 Thus far in our regression analysis, we’ve only considered variables that are interval.
Often however, we need to consider nominal data in our analysis.
 For example, our earlier example regarding the market for used cars focused only on
mileage. Perhaps color is an important factor. How can we model this new variable?
INDICATOR VARIABLES

 An indicator variable (also called a dummy variable) is a variable that can assume either
one of only two values (usually 0 and 1).

 A value of one usually indicates the existence of a certain condition, while a value of zero
usually indicates that the condition does not hold.
we need m–1 indicator variables
0 if color not white
I1 = Car Color I1 I2
1 if color is white
white 1 0
0 if color not silver
I2 = 1 if color is silver silver 0 1
other 0 0
two tone! 1 1
to represent m categories…
INTERPRETING INDICATOR VARIABLE
COEFFICIENTS
 After performing our regression analysis,
we have this regression equation:

 Thus, the price diminishes with additional


mileage (x):
 a white car sells for $91.10 more than
other colors (I1)
 a silver car fetches $330.40 more than
other colors (I2)
GRAPHICALLY
TESTING THE COEFFICIENTS

To test the coefficient of I1, we use these hypotheses:

𝐻 : 𝛽 =0
ቊ 0 𝑖
𝐻𝑎 : 𝛽𝑖 ≠ 0

We can conclude that there There is insufficient evidence to


are differences in auction infer that in the population of 3-year-
selling prices between all 3- old white Tauruses with the same
year-old silver-colored odometer reading have a different
Tauruses and the “other” color selling price than do Tauruses in the
category with the same “other” color category…
odometer readings
MODEL BUILDING

Here is a procedure for building a regression model:


1. Identify the dependent variable; what is it we wish to predict? Don’t forget the
variable’s unit of measure.
2. List potential predictors; how would changes in predictors change the dependent
variable? Be selective; go with the fewest independent variables required. Be aware of
the effects of multicollinearity.
3. Gather the data; at least six observations for each independent variable used in the
equation.
MODEL BUILDING

4. Identify several possible models; formulate first- and second- order models with
and without interaction. Draw scatter diagrams.
5. Use statistical software to estimate the models.
6. Determine whether the required conditions are satisfied; if not, attempt to
correct the problem.
7. Use your judgment and the statistical output to select the best model!

You might also like