BA Module 5 Summary

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Business Analytics

Module 5 Multiple Regression


• We use single variable linear regression to investigate the relationship
between a dependent variable and one independent variable.
o A coefficient in a single variable linear regression characterizes the gross
relationship between the independent variable and the dependent
variable.
• We use multiple regression to investigate the relationship between a
dependent variable and multiple independent variables.
• The structure of the multiple regression equation is ŷ = a + b1x1 + b2x2 +…+ bkxk.
o The true relationship between multiple variables is described by ŷ = a +
b1x1 + b2x2 +…+ bkxk + e, where 𝜀 is the error term. The idealized equation
that describes the true regression model is ŷ = a + b1x1 + b2x2 + … + bkxk.
o Coefficients in multiple regression characterize relationships that are net
with respect to the independent variables included in the model but gross
with respect to all omitted independent variables.
• Forecasting with a multiple regression equation is similar to forecasting with a
single variable linear model. However, instead of entering only one value for a
single independent variable, we input a value for each of the independent
variables.
• As with single variable linear regression, it is important to evaluate several
metrics to determine whether a multiple variable linear regression model is a
good fit for our data.
o For multiple regression we rely less on scatter plots and more on
numerical values and residual plots because visualizing three or more
variables can be difficult.
• Because R2 never decreases when independent variables are added to a
regression, it is important to multiply it by an adjustment factor when assessing
and comparing the fit of a multiple regression model. This adjustment factor
compensates for the increase in R2 that results solely from increasing the number
of independent variables.
o Adjusted R2 is provided in the regression output.
o It is particularly important to look at Adjusted R2, rather than R2, when
comparing regression models with different numbers of independent
variables.
• In addition to analyzing Adjusted R2, we must test whether the relationship
between the independent and dependent variables is linear and significant. We

© Copyright 2020 President and Fellows of Harvard College. All Rights Reserved.
do this by analyzing the regression’s residual plots and the p-values associated
with each independent variable’s coefficient.
• For multiple regression models, because it is difficult to view the data in a simple
scatter plot, residual plots are an indispensable tool for detecting whether the
linear model is a good fit.
o There is a residual plot for each independent variable included in the
regression model.
o We can graph a residual plot for each independent variable to help detect
patterns such as heteroskedasticity and nonlinearity.
o As with single variable regression models, if the underlying multiple
relationship is linear, each of the residuals follows a normal distribution
with a mean of zero and fixed variance.
• We should also analyze the p-values of the independent variables to determine
whether there is a significant relationship between the variables in the model. If
the p-value of each of the independent variables is less than 0.05, we conclude
that there is sufficient evidence to say that we are 95% confident that there is a
significant linear relationship between the independent and dependent variables.
• Multiple regression requires us to be aware of the possibility of multicollinearity
among the independent variables.
o Multicollinearity occurs when there is a strong linear relationship among
two or more of the independent variables.
o Indications of multicollinearity include seeing an independent variable’s
p-value increase when one or more other independent variables are
added to a regression model.
o We may be able to reduce multicollinearity by either increasing the sample
size or removing one (or more) of the collinear variables.
• Dummy variables and lagged variables can be useful in regression models.
o Multiple regression models allow us to include multiple dummy variables
for categorical data—day of week, for example.
§ A dummy variable is equal to 1 when the variable of interest fits a
certain criterion. For example, a dummy variable for “Saturday”
would equal 1 for observations relating to Saturdays and 0 for
observations related to all other days.
§ The number of dummy variables we include must always be one
fewer than the number of options in a category.
• We can also include lagged variables in multiple regression models. Lagged
values are used to capture the ongoing effects of a given variable.
o The lag period is based on managerial insight and data availability.
o Including lagged variables has some drawbacks:
§ Each lagged variable decreases our sample size by one
observation.

© Copyright 2020 President and Fellows of Harvard College. All Rights Reserved. 2
§ If the lagged variable does not increase the model’s explanatory
power, the addition of the variable decreases Adjusted R2.

EXCEL SUMMARY

Recall the Excel functions and analyses covered in this course and make sure to
familiarize yourself with all of the necessary steps, syntax, and arguments. We have
provided some additional information for the more complex functions listed below. As
usual, the arguments shown in square brackets are optional.

• Forecasting with regression models in Excel


• Creating a regression output table using the Data Analysis tool
• Creating regression models using dummy variables
o =IF(logical_test,[value_if_true],[value_if_false])
§ Returns value_if_true if the specified condition is met, and returns
value_if_false if the condition is not met.
• Creating regression models using lagged variables

© Copyright 2020 President and Fellows of Harvard College. All Rights Reserved. 3

You might also like