Professional Documents
Culture Documents
15 Building Regression Models Part2
15 Building Regression Models Part2
15 Building Regression Models Part2
Topics Outline
Include/Exclude Decisions
Variable Selection Procedures
Example 1
Explaining spending amounts at HyTex
HyTex is a direct marketer of stereo equipment, personal computers, and other electronic
products. HyTex advertises entirely by mailing catalogs to its customers, and all of its orders are
taken over the telephone. The company spends a great deal of money on its catalog mailings, and
it wants to be sure that this is paying off in sales.
The file Catalog_Marketing.xlsx contains data on 1000 customers who purchased mail-order
products from the HyTex Company in the current year. For each customer there are data on the
following variables:
Age
Gender
OwnHome
Married
Close
Salary
Children
PrevCust
PrevSpent
Catalogs
AmountSpent
Develop a multiple regression model that is useful for explaining current year spending amounts at HyTex.
Solution:
With this much data, 1000 observations, it is possible to set aside part of the data set for validation.
Although any split can be used, lets base the regression on the first 750 observations and use the
other 250 for validation. Therefore, you should select only the range through row 751 when
defining the StatTools data set.
(a) Regression 1
Run first a multiple regression with all explanatory variables.
The goal is then to exclude variables that aren't necessary, based on their t-values and P-values.
Here is the multiple regression output.
-1-
It indicates a fairly good fit. The r2 value is 74.7% and se is about $491. Given that the actual
amounts spent in the current year vary from a low of under $50 to a high of over $5500, with
a median of about $950, a typical prediction error of around $491 is decent but not great.
(b) Which variable(s) would you exclude from the regression equation?
From the P-value column, you can see that there are four variables, Age, Gender, OwnHome,
and Married, that have P-values well above 0.05. These are the obvious candidates for
exclusion from the equation. You could rerun the equation with all four of these variables
excluded, but it is a better practice to exclude one variable at a time. It is possible that when
one of these variables is excluded, another one of them will become significant.
(c) Rerun the regression after excluding the variables with the largest P-values one at a time.
Regression 2
The variable Married has the largest P-value. The result from rerunning the regression
without this variable shows that Age, Gender, and OwnHome still have large p-values.
Regression 3
The variable with the largest remaining P-value Age is excluded.
Regression 4
The variable with the largest remaining P-value OwnHome is excluded.
Regression 5
The variable with the largest remaining P-value Gender is excluded.
Here is the resulting output.
-2-
The r2 and se values of 74.6% and $491 are almost the same as they were with all variables
included, and all of the P-values are very small.
(d) Interpret the coefficients of the final regression equation.
The coefficient of Close implies that an average customer living close to stores with this type
of merchandise spent about $416 less than an average customer living far from such stores.
The coefficient of Salary implies that, on average, about 1.8 cents of every extra salary dollar
was spent on HyTex merchandise.
The coefficient of Children implies that about $161 less was spent for every extra child living at home.
The PrevCust and PrevSpent terms are somewhat more difficult to interpret.
First, both of these terms are zero for customers who didn't purchase from HyTex in the
previous year. For those who did, the terms become
544 + 0.27PrevSpent
The coefficient 0.27 implies that each extra dollar spent the previous year can be expected to
contribute an extra 27 cents in the current year. The 544 literally means that if you compare a
customer who didn't purchase from HyTex last year to another customer who purchased only a
tiny amount, the latter is expected to spend about $544 less than the former this year. However,
none of the latter customers were in the data set. A look at the data shows that of all customers
who purchased from HyTex last year, almost all spent at least $100 and most spent considerably
more. In fact, the median amount spent by these customers last year was about $900 (the median
of all positive values for the PrevSpent variable). If you substitute this median value into the
expression 544 + 0.27PrevSpent, you obtain 298. Therefore, this median spender from last
year can be expected to spend about $298 less this year than the previous year nonspender.
The coefficient of Catalogs implies that each extra catalog can be expected to generate about
$44 in extra spending.
-3-
-4-
The variables that enter or exit the equation are listed at the bottom of the output. The usual
regression output for the final equation also appears. Again, however, this final equation's
output is exactly the same as when multiple regression is used with these particular
variables.
Notes:
1. If you validate this final regression equation on the other 250 customers, you will find r 2 and
se values of 73.2% and $486. These are very promising. They are very close to the values
based on the original 750 customers.
2. We haven't tried all possibilities. We haven't tried nonlinear or interaction variables,
nor have we looked at different coding schemes (such as treating Catalogs as a categorical
variable and using dummy variables to represent it).
3. We haven't checked the regression assumptions. In particular, it turns out that the condition
for constant error variance is violated as can be seen from the fan shape of the scatterplot of
AmountSpent versus Salary:
-5-
As usual, when you see a fan shape, where the variability increases from left to right in a
scatterplot, you can try a logarithmic transformation. The reason this often works is that the
logarithmic transformation squeezes the large values closer together and pulls the small
values farther apart. The scatterplot of the log of AmountSpent versus Salary is shown below.
Clearly, the fan shape is gone. However, the logarithmic transformation appears to have
introduced some curvature into the plot. So, perhaps some other nonlinear transformations are
worth exploring in this example.
-6-
Example 2
Possible gender discrimination in salary at Fifth National Bank of Springfield
The Fifth National Bank of Springfield is facing a gender discrimination suit.
The charge is that its female employees receive substantially smaller salaries than its male
employees. The bank's employee data are listed in the file Bank_Salaries.xlsx.
Employee EducLev JobGrade
1
3
1
2
1
1
M
M
M
207
5
6
208
5
6
YrsExper
3
14
M
35
33
Age
26
38
M
59
62
Gender YrsPrior
Male
1
Female
1
M
M
Male
0
Female
0
PCJob
No
No
M
No
No
Salary
$32,000
$39,100
M
$94,000
$30,000
For each of the 208 employees, the data set includes the following variables:
EducLev education level, a categorical variable with categories
1 (finished high school), 2 (finished some college courses), 3 (obtained a bachelor's degree),
4 (took some graduate courses), 5 (obtained a graduate degree)
JobGrade a categorical variable indicating the current job level, the possible levels being 1 through 6
YrsExper years of experience with this bank
Age
Salary
Do these data provide evidence that the bank discriminates against females in terms of salary?
A formal hypothesis test to compare the average female salary to the average male salary could
be run. Using this method, you can check that the average of all salaries is $39,922, the female
average is $37,210, the male average is $45,505, and the difference between the male and female
averages is statistically significant at any reasonable level of significance.
In short, the females definitely earn less. But perhaps there is a reason for this.
They might have lower education levels, they might have been hired more recently, and so on.
The question is whether the difference between female and male salaries is still evident after
taking these other attributes into account.
Solution:
-7-
If you substitute Female = 1 into the estimated regression equation, you obtain
Predicted Salary = 45505 8296(1) = 37209
Because Female = 1 corresponds to females, this equation simply indicates the average female salary.
Similarly, if you substitute Female = 0 into the estimated equation, you obtain
Predicted Salary = 45505 8296(0) = 45505
Because Female = 0 corresponds to males, this equation indicates the average male salary.
Therefore, the interpretation of the 8296 coefficient of the Female dummy variable is straightforward.
It is the average female salary relative to the reference (male) category.
In short, females get paid $8296 less on average than males.
(c) Regression 2
Expand the regression equation by adding the experience variables YrsExper and YrsPrior.
Here is the output with the Female dummy variable and these two experience variables.
(d) Regression 3
Add education level to the equation by including any four of the five education level dummies,
for example by including EducLev = 2 through EducLev = 5. (Reminder: You should always
use one fewer dummy than the number of categories for any categorical variable.)
Here is the resulting output.
(e) Regression 4
Add the remaining explanatory variables to the model: JobGrade=2 through JobGrade=6
(the lowest job grade is used as the reference category), Age and HasPCJob.
The regression output for this equation with all variables appears below.
The effect of age appears to be minimal, and there appears to be a bonus of close to $5000
for having a PC-related job.
The r2 value has now increased to 76.5%, and the penalty for being a female has decreased to
$2555 still large but not as large as before.
As expected, the coefficients of the job grade dummies are all positive, and they increase as
the job grade increases it pays to be in the higher job grades. Thus, the regression indicates
that being in lower job grades implies lower salaries, but it doesn't explain why females are
in the lower job grades in the first place.
(f) Regression 5
If you rerun the regression using the numerical explanatory variable YrsExper and the
dummy variable Female, you obtain the equation
Predicted Salary = 35824 + 981 YrsExper 8012 Female
The r2 value for this equation is 49.1%.
It is certainly plausible that the effect of YrsExper on Salary is different for males than for females.
So, it makes good sense to test for an interaction between YrsExper and Female variables.
- 11 -
(g) Regression 6
If an interaction variable between YrsExper and Female is added to this equation, what is its effect?
You first need to form an interaction variable that is the product of YrsExper and Female.
Using Excel
Use an Excel formula that multiplies the two variables involved.
Using StatTools
Data Utilities
Interaction
Interaction Between: Two Numeric Variables
Select YrsExper and Female
OK
Now you can run the regression. The multiple regression output appears below.
Notice that the r2 value with the interaction variable has increased from 49.1% to 63.9%.
The interaction variable has definitely added to the explanatory power of the equation.
The estimated regression equation is
Predicted Salary = 30430 + 1528 YrsExper + 4098 Female 1248 Interaction(YrsExper,Female)
The negative interaction here means that females tend to get lower raises for each extra year
of experience than the males get. To unravel the meaning of this negative interaction, it is useful
to write the above equation as two separate equations, one for females and one for males.
The female equation (Female = 1, so that Interaction(YrsExper,Female) = YrsExper ) is
Predicted Salary = (30430 + 4098) = (1528 1248) YrsExper = 34528 + 280 YrsExper
and the male equation (Female = 0, so that Interaction(YrsExper,Female) = 0 ) is
Predicted Salary = 30430 + 1528 YrsExper
Graphically, these equations appear in the following figure.
- 12 -
The y-intercept for the female line is slightly higher females with no experience with Fifth National
tend to start out slightly higher than males but the slope of the female line is much smaller.
That is, males tend to move up the salary ladder much more quickly than females. This provides
another argument, although a somewhat different one, for gender discrimination against females.
Notes:
1. Interaction variables can make a regression quite difficult to interpret, and they are certainly
not always necessary. However, without them, the effect of each x on y is independent of the
values of the other xs. If you believe, as in this example, that the effect of years of experience
on salary is different for males than it is for females, the only way to capture this behavior is
to include an interaction variable between years of experience and gender.
2. The product of any two variables, a numerical and a dummy variable, two dummy variables,
or even two numerical variables, can be used to create an interaction term. The easiest way to
interpret the results correctly is the way we have been doing it by writing several separate
equations and seeing how they differ.
(h) Suppose you include the variables YrsExper, Female, and HighJob in the equation for Salary,
along with interactions between Female and YrsExper and between Female and HighJob.
Here, HighJob is a new dummy variable that is 1 for job grades 4 to 6 and is 0 for job grades 1 to 3.
(It can be calculated as the sum of the dummies JobGrade = 4 through JobGrade = 6.)
The resulting equation is
Predicted Salary = 28168 + 1261 YrsExper + 9242 HighJob + 6601 Female
1224 Interaction(YrsExper,Female) + 1564 Interaction(Female,HighJob)
and the r2 value is now 76.6%.
Interpret the regression coefficients.
- 13 -
The interpretation of this equation is quite a challenge because it is really composed of four
separate equations, one for each combination of Female and HighJob.
For females in the high job category, the equation becomes
Predicted Salary = (28168 + 9242 + 6601 + 1564) + (1261 - 1224) YrsExper
= 45575 + 37 YrsExper
and for females in the low job category it is
Predicted Salary = (28168 + 6601) + (1261 - 1224) YrsExper
= 34769 + 37 YrsExper
Similarly, for males in the high job category, the equation becomes
Predicted Salary = (28168 + 9242) + 1261 YrsExper
= 37410 + 1261 YrsExper
and for males in the low job category it is
Predicted Salary = 28168 + 1261 YrsExper
Putting this into words, the various coefficients can be interpreted as follows.
The intercept 28168 is the average starting salary (that is, with no experience at Fifth National)
for males in the low job category.
The coefficient 1261 of YrsExper is the expected increase in salary per extra year of
experience for males (in either job category).
The coefficient 9242 of HighJob is the expected salary premium for males starting in the
high job category instead of the low job category.
The coefficient 6601 of Female is the expected starting salary premium for females relative
to males, given that they start in the low job category.
The coefficient 1224 of Interaction(YrsExper,Female) is the penalty per extra year of
experience for females relative to males that is, male salaries increase this much more than
female salaries each year.
The coefficient 1564 of Interaction(Female,HighJob) is the extra premium (in addition to the
male premium) for females starting in the high job category instead of the low job category.
(i) Regression 7
A glance at the distribution of salaries of the 208 employees shows some skewness to the right
a few employees make substantially more than the majority of employees. Therefore, it might
make more sense to use the natural logarithm of Salary as the dependent variable, not Salary.
Run a regression with Log(Salary) as the dependent variable and YrsExper and Female as
explanatory variables. How can you interpret the results?
Here are the results obtained after creating the Log(Salary) variable and running the regression.
- 14 -
(j) In Regression 6 we regressed Salary versus the Female dummy, YrsExper, and the interaction
between Female and YrsExper, Interaction(YrsExper,Female). The output appears below.
(k) Run the block procedure a second time, changing the order of the blocks:
Block2 = JobGrade dummies, JobGrade=2 to JobGrade=6
Block3 = EducLev dummies, EducLev=2 to EducLev=5
Block4 = interactions between the Female dummy and the education dummies,
Interaction(Female,EducLev=2) to Interaction(Female,EducLev=5)
The regression output appears to
the right.
Note that neither of the last two
blocks enters the equation this
time. Once the job grade
dummies are in the equation, the
terms including education are no
longer needed.
The implication is that the order
of the blocks can make a
difference.
- 17 -