Professional Documents
Culture Documents
Unit 4-B: Multiple Regression
Unit 4-B: Multiple Regression
Unit 4-B: Multiple Regression
MULTIPLE REGRESSION
MULTIPLE REGRESSION
The simple linear regression model was used to analyze how one interval variable (the
dependent variable y) is related to one other interval variable (the independent variable
x).
Multiple regression allows for any number of independent variables.
We expect to develop models that fit the data better than would a simple linear
regression model.
THE MODEL
We now assume we have k independent variables potentially related to the one dependent
variable. This relationship is represented in this first order linear equation:
dependent independent variables
variable
error variable
coefficients
In the one variable, two dimensional case we drew a regression line; here we imagine a
response surface.
ESTIMATING THE COEFFICIENTS
1. Use a computer and software to generate the coefficients and the statistics
used to assess the model.
2. Diagnose violations of required conditions. If there are problems, attempt
to remedy them.
3. Assess the model’s fit.
1. coefficient of determination,
2. F-test of the analysis of variance.
Market Demand
Competition Community Physical
Awareness Generators
Offices,
within the nearest Higher Ed. household downtown.
3 mile La Quinta inn. income.
radius
*these need to be interval data !
EXAMPLE: LA QUINTA
Several possible predictors of profitability were identified, and data (Quinta.sav) were collected. Its believed that
operating margin (y) is dependent upon these factors:
Can we transform this data into a mathematical model that looks like this:
margin
awareness physical
competition
(i.e. # of rooms)
(distance to
nearest alt.)
… (distance to
downtown)
USING SPSS
INTERPRET
THE MODEL
Although we haven’t done any assessment of the model yet, at first pass:
This means that 52.5% of the variation in operating margin is explained by the six
independent variables, but 47.5% remains unexplained.
ADJUSTED R2 VALUE
The adjusted” R2 is: the coefficient of determination adjusted for the number of
explanatory variables.
It takes into account the sample size n, and k, the number of independent variables, and is
given by:
TESTING THE VALIDITY OF THE MODEL
In a multiple regression model (i.e. more than one independent variable), we utilize an
analysis of variance technique to test the overall validity of the model. Here’s the idea:
H0: 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑘 = 0
In a multiple regression model (i.e. more than one independent variable), we utilize an
analysis of variance technique to test the overall validity of the model. Here’s the idea:
H0: 𝛽1 = 𝛽2 = ⋯ = 𝛽𝑘 = 0
H1: At least one 𝛽𝑖 is not equal to zero.
If the null hypothesis is true, none of the independent variables is linearly related to y, and
so the model is invalid.
If at least one 𝛽𝑖 is not equal to 0, the model does have some validity.
TESTING THE VALIDITY OF THE MODEL
Total n–1
Since SPSS calculated the F statistic as F = 17.14 and our FCritical = 2.17, (and the p-value is zero) we
reject H0 in favor of H1, that is:
Intercept (b0) 38.14 • This is the average operating margin when all of the
independent variables are zero. It’s meaningless to try and interpret this value,
particularly if 0 is outside the range of the values of the independent variables (as
is the case here).
# of motel and hotel rooms (b1) –.0076 • Each additional room within
three miles of the La Quinta inn, will decrease the operating margin. (I.e. for
each additional 1000 rooms the margin decreases by 7.6%)
Distance to nearest competitor (b2) 1.65 • For each additional mile that
the nearest competitor is to a La Quinta inn, the average operating margin
increases by 1.65%. *in each case we assume all other variables are held constant…
INTERPRETING THE COEFFICIENTS*
Office space (b3) .020 • For each additional thousand square feet of office space, the
margin will increase by .020. E.g. an extra 100,000 square feet of office space will
increase margin (on average) by 2.0%.
Student enrollment (b4) .21 • For each additional thousand students, the average
operating margin increases by .21%
Median household income (b5) .41 • For each additional thousand dollar increase in
median household income, the average operating margin increases by .41%
Distance to downtown core (b6) –.23 • For each additional mile to the downtown
center, the operating margin decreases on average by .23%
*in each case we assume all other variables are held constant…
TESTING THE COEFFICIENTS
For each independent variable, we can test to determine whether there is enough evidence of a
linear relationship between it and the dependent variable for the entire population…
For each independent variable, we can test to determine whether there is enough evidence
of a linear relationship between it and the dependent variable for the entire population
The hypotheses:
𝐻0 : 𝛽𝑖 = 0 𝑣𝑠 𝐻𝑎 : 𝛽𝑖 ≠ 0, 𝑖 = 1, … , 𝑘.
We can use our SPSS output to quickly test each of the six coefficients in our model…
Thus, the number of hotel and motel rooms, distance to the nearest motel, amount
of office space, and median household income are linearly related to the operating
margin. There is no evidence to infer that college enrollment and distance to
downtown center are linearly related to operating margin.
USING THE REGRESSION EQUATION
Much like we did with simple linear regression, we can produce a prediction interval for
a particular value of y.
As well, we can produce the confidence interval estimate of the expected value of y.
USING THE REGRESSION EQUATION
our xi’s…
USING THE REGRESSION EQUATION
We add one row (our given values for the independent variables) to the
bottom of our data set:
PREDICTION INTERVAL
We predict that the operating margin will fall between 25.31 and 48.73.
If management defines a profitable inn as one with an operating margin greater than 50%
and an unprofitable inn as one with an operating margin below 30%, they will pass on this
site, since the entire prediction interval is below 50%.
INTERPRET
CONFIDENCE INTERVAL
The expected operating margin of all sites that fit this category is estimated to be
between 32.87 and 41.18.
We interpret this to mean that if we built inns on an infinite number of sites that fit the
category described, the mean operating margin would fall between 33.0 and 41.2. In
other words, the average inn would not be profitable either…
INTERPRET
REGRESSION DIAGNOSTICS
Are there observations that are inaccurate or do not belong to the target population?
Double-check the accuracy of outliers and influential observations.
REGRESSION DIAGNOSTICS
Multiple regression models have a problem that simple regressions do not, namely
multicollinearity.
It happens when the independent variables are highly correlated.
We’ll explore this concept through the following example…
EXAMPLE: HOUSING PRICES
A real estate agent wanted to develop a model to predict the selling price of a home. The
agent believed that the most important variables in determining the price of a house are
its:
size,
number of bedrooms,
and lot size.
The proposed model is:
Housing market data has been gathered and SPSS is the analysis tool of choice
(housprices.sav)
USING SPSS
0 dL dU 2 4-dU 4-dL 4
EXAMPLE: LIFT TICKETS
• In the plot of residuals versus predicted values (testing for heteroscedasticity) — the
error variance appears to be constant
EXAMPLE: LIFT
TICKETS (DW
TEST)
APPLY THE DURBIN-WATSON STATISTIC FROM TO
THE ENTIRE LIST OF RESIDUALS.
EXAMPLE: LIFT TICKETS
The Durbin-Watson statistic against the residuals from our Regression analysis is equal
to 1.885.
We can conclude that there is not enough evidence to infer the presence of first-order
autocorrelation. (Determining dL is left as an exercise for the reader…)
Hence, we have improved out model dramatically!
EXAMPLE: LIFT TICKETS
(it’s considered linear or first-order since the exponent on each of the xi’s is 1)
The independent variables may be functions of a smaller number of predictor variables;
polynomial models fall into this category. If there is one predictor value (x) we have:
POLYNOMIAL MODELS
Technically, equation u is a multiple regression model with p independent variables (x1, x2, …,
xp). Since x1 = x, x2 = x2, x3 = x3, …, xp = xp, its based on one predictor value (x).
That is, we believe there is a straight-line relationship between the dependent and
independent variables over the range of the values of x:
SECOND ORDER MODEL
Perhaps we suspect that there are two predictor variables (x1 & x2) which influence the
dependent variable:
First order model (no interaction):
If we believe that a quadratic relationship exists between y and each of x1 and x2, and that
the predictor variables interact in their effect on y, we can use this model:
We’ve been asked to come up with a regression model for a fast food restaurant.
We know our primary market is middle-income adults and their children, particularly
those between the ages of 5 and 12.
Dependent variable —restaurant revenue (gross or net)
Predictor variables — family income, age of children
Is the relationship first order? quadratic?…
EXAMPLE: RESTAURANT
Seems reasonable?
EXAMPLE: RESTAURANT
Our fast food restaurant research department selected 25 locations at random and
gathered data on revenues, household income, and ages of neighborhood children.
You can take the original data collected (revenues, household income, and age) and plot y
vs. x1 and y vs. x2 to get a feel for the data; trend lines were added for clarity…
EXAMPLE: RESTAURANT
INTERPRET
NOMINAL INDEPENDENT VARIABLES
Thus far in our regression analysis, we’ve only considered variables that are interval.
Often however, we need to consider nominal data in our analysis.
For example, our earlier example regarding the market for used cars focused only on
mileage. Perhaps color is an important factor. How can we model this new variable?
INDICATOR VARIABLES
An indicator variable (also called a dummy variable) is a variable that can assume either
one of only two values (usually 0 and 1).
A value of one usually indicates the existence of a certain condition, while a value of zero
usually indicates that the condition does not hold.
we need m–1 indicator variables
0 if color not white
I1 = Car Color I1 I2
1 if color is white
white 1 0
0 if color not silver
I2 = 1 if color is silver silver 0 1
other 0 0
two tone! 1 1
to represent m categories…
INTERPRETING INDICATOR VARIABLE
COEFFICIENTS
After performing our regression analysis,
we have this regression equation:
𝐻 : 𝛽 =0
ቊ 0 𝑖
𝐻𝑎 : 𝛽𝑖 ≠ 0
4. Identify several possible models; formulate first- and second- order models with
and without interaction. Draw scatter diagrams.
5. Use statistical software to estimate the models.
6. Determine whether the required conditions are satisfied; if not, attempt to
correct the problem.
7. Use your judgment and the statistical output to select the best model!