Professional Documents
Culture Documents
Advanced Regression in Excel S
Advanced Regression in Excel S
Advanced Regression
in Excel
By Mark Harmon
mark@ExcelMasterSeries.com
www.ExcelMasterSeries.com
ISBN: 978-0-9833070-6-8
Table of Contents
Click on Entries to Go To Each
R Square ............................................................................................................. 18
Adjusted R Square........................................................................................... 18
Significance of F .............................................................................................. 19
The Logit................................................................................................................. 26
The Solver dialogue box has the following 4 parameters that need to be
set: ............................................................................................................................... 45
Objective: ............................................................................................................... 46
Decision Variables:.............................................................................................. 46
Constraints: ........................................................................................................... 46
Assume Non-Negative:................................................................................... 50
The video on the next page will make the entire procedure of Dummy Variable
Regression in Excel to perform Conjoint Analysis much easier to understand:
Instructional Video
Go to
http://www.youtube.com/watch?v=EMbiGPGlBEM
to View a
Video From Excel Master Series
About How To Use
Dummy Variable Regression
in Excel To Perform
Conjoint Analysis
Dummy Variables in a regression are variables that can only assume two values.
One Dummy Variable must be created for each product choice.
When the survey is returned, the survey data is converted into the proper layout
for the Regression function in Excel. Each Dummy Variable assigned to a
specific attribute will be assigned the value of 0 or 1, depending on whether that
attribute was an element of the combination that is currently being rated.
Watching this done in the linked video is probably the easiest way to understand
how to do it.
The data can now be run as a regular regression using Excel’s regression tool.
The linked video shows how to do this in detail.
For example, the marketer will find out how important the color red was
compared to each of the other product choices during the purchase decision.
Utilities of product choices that were associated with the Dummy Variables that
were removed to prevent collinearity will be assigned the value of 0.
We now have Utilities for each attribute. Now, the overall attractiveness of a
particular combination of choices can be calculated by adding up the individual
Utilities associated with the each of the choices. The sum of the Utilities for each
combination is the regression’s prediction of consumer’s degree of liking for that
combination of product choices.
The removal of the individual Dummy Variables does not affect the accuracy or
completeness of the answer. Adding up the Utilities for each combination will
produce a figure that will be very close to the consumer’s actual rating for that
combination. An example of this is shown in the video.
Showing the Regression Equation Predicts Nearly the Same Score as the
Customer's Ranking of Card 13, Even Though Dummy Variables Were
Removed
There is a lot more to the Excel Regression output than just the regression
equation. If you know how to quickly read the output of a Regression done in,
you’ll know right away the most important points of a regression: if the overall
regression was a good, whether this output could have occurred by chance,
whether or not all of the independent input variables were good predictors, and
whether residuals show a pattern (which means there’s a problem).
This video will illustrate exactly how to quickly and easily understand the output
of Regression performed in Excel:
Some parts of the Excel Regression output are much more important than
others. The goal here is for you to be able to glance at the Excel Regression
output and immediately understand it, so we will focus our attention only on the
four most important parts of the Excel regression output.
R Square
This is the most important number of the output. R Square tells how well the
regression line approximates the real data. This number tells you how much of
the output variable’s variance is explained by the input variables’ variance.
Ideally we would like to see this at least 0.6 (60%) or 0.7 (70%).
Adjusted R Square
This is quoted most often when explaining the accuracy of the regression
equation. Adjusted R Square is more conservative the R Square because it is
Copyright ©2011 http://ExcelMasterSeries.com/New_Manuals.php Page 18
Advanced Regression in Excel The Excel Statistical Master
always less than R Square. Another reason that Adjusted R Square is quoted
more often is that when new input variables are added to the Regression
analysis, Adjusted R Square increases only when the new input variable makes
the Regression equation more accurate (improves the Regression equations’s
ability to predict the output). R Square always goes up when a new variable is
added, whether or not the new input variable improves the Regression equation’s
accuracy.
Significance of F
This indicates the probability that the Regression output could have been
obtained by chance. A small Significance of F confirms the validity of the
Regression output. For example, if Significance of F = 0.030, there is only a 3%
chance that the Regression output was merely a chance occurrence.
The P-Values of each of these provide the likelihood that they are real results
and did not occur by chance. The lower the P-Value, the higher the likelihood
that that coefficient or Y-Intercept is valid. For example, a P-Value of 0.016 for a
regression coefficient indicates that there is only a 1.6% chance that the result
occurred only as a result of chance.
The residuals are the difference between the Regression’s predicted value and
the actual value of the output variable. You can quickly plot the Residuals on a
scatterplot chart. Look for patterns in the scatterplot. The more random (without
patterns) and centered around zero the residuals appear to be, the more likely it
is that the Regression equation is valid.
There are many other pieces of information in the Excel regression output but the
above four items will give a quick read on the validity of your Regression.
Go To
http://excelmasterseries.com/Excel_Statistical_Master/Regression.php
You'll Quickly See Why You Always Want To Use Excel To Solve Statistical
Problems !
Wouldn’t it be great if there was a more accurate way to predict whether your
prospect will buy rather than just taking an educated guess? Well, there is…if
you have enough data on your previous prospects. The tool that makes this
possible is called Logistic Regression and can be easily implemented in Excel.
Instructional Video
Go to
http://www.youtube.com/watch?v=NHOO7iceJrw
to View a
Video From Excel Master Series
About How To Use
Logistic Regression
in Excel To Predict of Your
Next Prospect
WILL BUY! (or not !#!$%!)
Suppose that you have collected three pieces of data on each of your previous
prospects. The data you have collected on each prospect was:
With the above data, you could create a predictive equation that would calculate
a new prospect’s probability of purchasing by inputting this new prospect’s age
and gender. This predictive equation will be in the form of:
The Logit
P(X) has only one variable. That is L, which is called the Logit.
L, the Logit, has 3 variables: Constant, A, and B. They must be known before
P(X) can be calculated. Those 3 variables can be found in Excel by using the
Excel Solver. The Excel Solver will find the optimal combination of those 3
variables that causes the resulting P(X) to most accurately predict whether Y = 1
or 0 for all previous prospects.
Here’s how the most optimal set of Logit variables (Constant, A, and B) are found
in Excel:
Using Excel, each recorded prospect has the following calculation performed:
The Y refers to Y = 1 if the prospect bought and Y = 0 if the prospect didn’t buy.
The P(X) is the probability of purchase that will be calculated using the equation
listed above. In Excel, the P(X) calculation is initially performed by the Excel
Solver using Logit variables (Constant, A, and B) which are not optimal. The
Excel Solver will then continuously try new combinations of these variables until
the optimal P(X) is found.
Here’s how the Excel Solver knows when it has found the correct combinations
of these 3 variables so that the resulting P(X) equation most accurately predicts
whether Y = 1 or 0:
The equation P(X)Y * [ 1 - P(X) ] (1-Y) is maximized when P(X) is most accurate. It
approaches it highest value (1) when Y = 1 and P(X) approaches 1. It also
approaches its highest value (1) when Y = 0 and P(X) approaches 0. When Y = 1
and P(X) = 1, that is a 100% correct prediction by P(X) that Y = 1. When Y = 0
and P(X) = 0, that is a 100% correct prediction by P(X) that Y = 0.
Each prospect has a separate P(X)Y * [ 1 - P(X) ] (1-Y) value calculated for him or
her.
The sum of each P(X)Y * [ 1 - P(X) ] (1-Y)calculation for all prospects is taken.
The only variables that exist when calculating P(X)Y * [ 1 - P(X) ] (1-Y)are Y and
the variables of P(X), which are Constant, A, and B. Use the Excel Solver, these
variable are adjusted until their values maximize the sum of all
P(X)Y * [ 1 - P(X) ] (1-Y)
When the sum of P(X)Y * [ 1 - P(X) ] (1-Y) is maximized, then the final resulting
P(X) equation is as accurate as possible at predicting whether Y will be 1 or 0.
Stated another way, we now have a predictive equation P(X ) which uses the
optimal combination of Constant, A, and B which most accurately calculates the
probability that Y = 1 given a prospect’s age and gender.
The embedded video provides a clear picture of all of this in action in Excel.
The use of the Excel Solver does require some hand-tweeking to ensure that the
most accurate answer is obtained. The video shows an example of this.
Ultimately what the Solver is doing is adjusting variables Constant, A, and B to
maximize the sum of the column of
P(X)Y * [ 1 - P(X) ] (1-Y) equations. The answer obtained by the Solver should
maximize that sum and provide realistic answers for the probabilities of each
prospect, including the new one.
You’ll probably find that you have to experiment by applying constraints to the
variables that Solver is adjusting in order to maximize the target sum. The
variables that Solver adjusts are called Decision Variables. Solver allows you to
create constraints on the value of any Decision Variable.
In the video, you will be able to watch how a Decision Variable is constrained to
make the final answer more accurate. The Decision Variable called Constant was
constrained to always remain above -25 during the Solver analysis. This resulted
in the most accurate and realistic maximization of the sum of the
P(X)Y * [ 1 - P(X) ] (1-Y) equations.
Following is a video of this article showing how to perform all four steps to
Regression in Excel, including the above two crucial steps at the beginning:
The input and output variables will be graphed together. The y-axis of the chart
will provide the scale for plotting of those values. The x-axis will provide a
measure of whatever continuum was used, e.g. time, to collect the values of all of
the variables. Excel’s charting function is the way to go here. The above linked
video shows exactly how to chart all the data in Excel.
between the output variable and an input variable indicates that the input variable
is not a good predictor of the output. That input variable should be removed from
the Regression Analysis. The attached video provides an example of this.
After looking at the Correlation Coefficients between the input and output
variables, look at the Correlation Coefficients between the input variables
themselves. You do not want to use pairs of input variables that are good
predictors of each other in a Regression. This will cause a Regression error
known as Collinearity or Multicollinearity. One variable from any pair of highly-
correlated input variables should be removed prior to running the Regression
Analysis. Variables can be considered highly-Correlated if the absolute value of
their Correlation Coefficient is greater the 0.7 (greater than +0.7 or less than -
0.7).
Excel Solver is one of the best and easiest curve-fitting devices in the world, if
you know how to use it. Its curve-fitting capabilities make it an excellent tool to
perform nonlinear regression. The Excel Solver will find the equation of the linear
or nonlinear curve which most closely fits a set of data points.
One very important caveat must be added: the user must first determine the
general type of the curve and input that information into Solver at the start. This
information is in the form of the general equation that defines the curve, such as
a0 + a1*x + a2*x2 = c or a*ln(xb) = c. Solver then calculates all needed variables
which produce the equation which most closely fits the data points. We will run
through an example here.
In this problem we are going to show how to use the Excel Solver to calculate an
equation which most closely describes the relationship between sales and
number of ads being run. The purpose of this equation is to be able to predict the
number of sales based upon the number of ads that will be run.
A marketing manager has collected this following data on the company’s sales
vs. the number of ads that were running at different times.
We would like to create an equation from this data that allows us to predict the
sales based upon the number of ads currently running.
The first step is to eyeball the data and estimate what general type of curve this
graph probably is. In this case it appears to a graph the has a diminishing y value
for an increasing x value. A formula for such a curve would have the general
form:
Y = A1 + A2 * XB1
We can use the Excel Solver to solve for A1, A2, and B1. We need to arrange
the data in a form that can be input into the Excel Solver as follows:
This table shows the arrangement of data and the calculations. Here we have
created an Excel model based upon our model of:
One example of this formula in action is explained for Cell E16. We are listing the
variable that we are solving for (A1, A2, and B1) in cells B3 to B5. In Solver
language, these solves that we are changing are called Decision Variables.
A1 = 100
A2 = 100
B1 = 0.05
We now take the difference between the actual number of sales and the number
of sales predicted by our model with our arbitrary settings for the Decision
Copyright ©2011 http://ExcelMasterSeries.com/New_Manuals.php Page 43
Advanced Regression in Excel The Excel Statistical Master
Variables. The square of each difference is taken and then all squares are
summed up.
We are trying to find the settings for the Decision Variables that will minimize the
sum of the squares of the differences. In other words, we are trying to find A1,
A2, and B1 that will minimize the number in cell G13.
Once the Solver has been installed as an add-in (To add-in Solver: File /
Options / Add-Ins / Manage / Excel Add-Ins / Go / Solver Add-In), you can
access the Solver in Excel 2010 by: Data / Solver.
The Solver dialogue box has the following 4 parameters that need to be set:
1) The Objective Cell – This is the target cell that we are either trying to
maximize, minimize, or achieve a certain value.
4) Constraints – These are the limitations that the problem subjects the
Solver to during its calculations
Objective:
We are trying to minimize Cell G13, the sum of the square of differences
between the actual and predicted sales.
Decision Variables:
We are changing A1, A2, and B1 (cells B3 to B5) to minimize our Objective, Cell
G13. The Decision Variables are therefore Cells B3 to B5.
Constraints:
The GRG Nonlinear method is used when the equation producing the objective is
not linear but is smooth (continuous). Examples of smooth nonlinear functions in
Excel are:
These functions have graphs that are curved (nonlinear), but have no breaks
(smooth)
Solver has optimized the Decision Variables to minimize the objective function as
follows:
A1 = -445,616
A2 = 437,247
B1 = 0.00911
We can now create an Excel graph of the Actual Sales vs. the Predicted Sales as
follows:
Solver calculates that Sales can be predicted from Number of Ads Running by
the following equation:
The trickiest part of this problem is the first step; eyeballing the data to
determine what kind of graph the data is arranged in. You should take time to
evaluate whether you are pursuing calculation of the correct curve type.
Solver Tips
You may notice that if you run this problem through the Solver multiple time, you
will get slightly different answers. Each time that you run Solver’s GRG algorithm,
it will calculate different values for the Decision Variables. You are trying to find
the values for the Decision Variables that minimize the objective function (cell
G13) the most.
When the Solver runs the GRG algorithm, it picks a starting point for its
calculations. Each time you run the Solver GRG method a slightly different
starting point will be picked. That is why different answers will appear during
each run. Choose the Decision Variable value that occur during the run which
produces the lowest value of the Objective. Keep running the Solver until the
objective is not minimized anymore. That should give you the optimal values of
the Decision Variables. That was done in the example above.
Here are some Solver settings that you want to configure prior to running the
Solver for most problems. These settings are found when you click the Options
button:
Show Iteration Results: Leave this unchecked. This stops the GRG Solver after
each iteration, displaying the result for that iteration. Very rarely is there a reason
for doing that.
Use Automatic Scaling: Leave this box unchecked. You would only use this
option if you had reason to believe that inputs of the Solver were measured using
different scales.
Assume Non-Negative: Only check this if you are sure that none of the
variables can ever be negative. In this case, that is clearly not the case.
Summary
Mark Harmon is also a natural teacher. As an adjunct professor, he spent five years
teaching more than thirty semester-long courses in marketing and finance at the Anglo-
American College in Prague and the International University in Vienna, Austria. During
that five-year time period, he also worked as an independent marketing consultant in the
Czech Republic and performed long-term assignments for more than one hundred clients.
His years of teaching and consulting have honed his ability to present difficult subject
matter in an easy-to-understand way.
Harmon received a degree in electrical engineering from Villanova University and MBA
in marketing from the Wharton School.