Business Analytics Module 4 Summary

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Business Analytics

Module 4 Single Variable Linear Regression


• We use regression analysis for two primary purposes:
o Studying the magnitude and structure of a relationship between two
variables.
o Forecasting a variable based on its relationship with another variable.
• The structure of the single variable linear regression line is ŷ = a + bx.
o ŷ is the expected value of y, the dependent variable, for a given value of x.
o x is the independent variable, the variable we are using to help us predict
or better understand the dependent variable.
o a is the y-intercept, the point at which the regression line intersects the
vertical axis. This is the value of ŷ when the independent variable, x, is set
equal to 0.
o b is the slope, the average change in the dependent variable y as the
independent variable x increases by one.
o The true relationship between two variables is described by the equation
y = a + bx + e, where e is the error term (e = y – ŷ ). The idealized equation
that describes the true regression line is ŷ = a + bx.
• We determine a point forecast by entering the desired value of x into the
regression equation.
o We must be extremely cautious about using regression to forecast for
values outside of the historically observed range of the independent
variable (x-values).
o Instead of predicting a single point, we can construct a prediction
interval, an interval around the point forecast that is likely to contain, for
example, the actual selling price of a house of a given size.
§ The width of a prediction interval varies based on the standard
deviation of the regression (the standard error of the regression),
the desired level of confidence, and the location of the x-value of
interest in relation to the historical values of the independent
variable.
• It is important to evaluate several metrics in order to determine whether a single
variable linear regression model is a good fit for a data set, rather than looking at
single metrics in isolation.
• R2 measures the percent of total variation in the dependent variable, y, that is
explained by the regression line.
o R2 = =
o 0 ≤ R2 ≤ 1

© Copyright 2020 President and Fellows of Harvard College. All Rights Reserved.
o For a single variable linear regression, R2 is equal to the square of the
correlation coefficient.
• In addition to analyzing R2, we must test whether the relationship between the
dependent and independent variable is significant and whether the linear model
is a good fit for the data. We do this by analyzing the p-value (or confidence
interval) associated with the independent variable and the regression’s
residual plot.
o The p-value of the independent variable is the result of the hypothesis test
that tests whether there is a significant linear relationship; that is, it tests
whether the slope of the regression line is zero, H0: b = 0 and Ha: b ≠ 0.
§ If the coefficient’s p-value is less than 0.05, we reject the null
hypothesis and conclude that we have sufficient evidence to be
95% confident that there is a significant linear relationship between
the dependent and independent variables.
§ Note that the p-value and R2 provide different information. A linear
relationship can be significant (have a low p-value) but not explain
a large percentage of the variation (not have a high R2.)
o A confidence interval associated with an independent variable’s
coefficient indicates the likely range for that coefficient.
§ If the 95% confidence interval does not contain zero, we can be
95% confident that there is a significant linear relationship between
the variables.
• Residual plots can provide important insights into whether a linear model is a
good fit.
o Each observation in a data set has a residual equal to the historically
observed value minus the regression’s predicted value, that is, 𝜀 = y - ŷ.
o Linear regression models assume that the regression’s residuals follow a
normal distribution with a mean of zero and fixed variance.
• We can also perform regression analyses using qualitative, or categorical,
variables. To do so, we must convert data to dummy (0, 1) variables. After that,
we can proceed as we would with any other regression analysis.
o A dummy variable is equal to 1 when the variable of interest fits a certain
criterion. For example, a dummy variable for “Female” would equal 1 for
all female observations and 0 for male observations.

© Copyright 2020 President and Fellows of Harvard College. All Rights Reserved. 2
EXCEL SUMMARY

Recall the Excel functions and analyses covered in this course and make sure to
familiarize yourself with all of the necessary steps, syntax, and arguments. We have
provided some additional information for the more complex functions listed below. As
usual, the arguments shown in square brackets are optional.

• Adding the best fit line to a scatter plot using the Insert menu
• Forecasting with regression models in Excel
o =SUMPRODUCT(array1, [array2], [array3],…) is a convenient function
for calculating point forecasts.
• Creating a regression output table using the Data Analysis tool
• Creating regression models with dummy variables
o =IF(logical_test,[value_if_true],[value_if_false])
§ Returns value_if_true if the specified condition is met, and returns
value_if_false if the condition is not met.
o To perform a regression analysis with an independent dummy variable,
follow the same steps as when using quantitative variables.

© Copyright 2020 President and Fellows of Harvard College. All Rights Reserved. 3

You might also like