Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

REVISED

M04_REND6289_10_IM_C04.QXD 5/7/08 2:49 PM Page 1

4
C H A P T E R

Regression Models

TEACHING SUGGESTIONS Ads purchased, Apartments leased,


(X) (Y)
Teaching Suggestion 4.1: Which Is the Independent Variable? 15 6
We find that students are often confused about which variable is 9 4
independent and which is dependent in a regression model. For 40 16
example, in Triple A’s problem, clarify which variable is X and 20 6
which is Y. Emphasize that the dependent variable (Y ) is what we 25 13
25
are trying to predict based on the value of the independent (X ) 15 9
variable. Use examples such as the time required to drive to a store 10
35
and the distance traveled, the totals number of units sold and the 16

selling price of a product, and the cost of a computer and the


processor speed. We can find a mathematical equation by using the least
squares regression approach.
Teaching Suggestion 4.2: Statistical Correlation Does Not
Always Mean Causality. Leases, Y Ads, X (X — X¯¯ )2 (X — X¯¯ )(Y — Y¯¯ )
Students should understand that a high R2 doesn’t always mean 6 1 64 3
one variable will be a good predictor of the other. Explain that 4 5 19 2
9 6 8
4
skirt lengths and stock market prices may be correlated, but rais- 16 40 289 102
ing one doesn’t necessarily mean the other will go up or down. An 6 20 9 12
interesting study indicated that, over a 10-year period, the salaries 13 25 4 6
ume of alcoholic beverages (both were actually correlated with 10 15 64 0
inflation). 16 35 144 72

Teaching Suggestion 4.3: Give students a set of data and have \Y = 80 \X = 184 \(X — X¯¯ )2 = 774 \(X — X¯¯ )(Y — Y¯¯ ) = 306
them plot the data and manually draw a line through the data. A 80 184
Y  10; X   23
discussion of which line is “best” can help them appreciate the 8 8
least squares criterion.
Teaching Suggestion 4.4: Select some randomly generated values b1 = 306/774 = 0.395
for X and Y (you can use random numbers from the random b0 = 10 — 0.395(23) = 0.915
number table in Chapter 15 or use the RAND function in Excel). The estimated regression equation is
Develop a regression line using Excel and discuss the coefficient
of determination and the F-test. Students will see that a regression Yˆ = 0.915 + 0.395X
line can always be developed, but it may not necessarily be useful. or
Teaching Suggestion 4.5: A discussion of the long formulas and Apartments leased = 0.915 + 0.395 ads placed
short-cut formulas that are provided in the appendix is helpful. If the number of ads is 30, we can estimate the number of apart-
The long formulas provide students with a better understanding ments leased with the regression equation
of the meaning of the SSE and SST. Since many people use
0.915 + 0.395(30) = 12.76 or 13 apartments
computers for regression problems, it helps to see the original
formulas. The short-cut formulas are helpful if students are Alternative Example 4.2: Given the data on ads and apartment
performing the computations on a calculator. rentals in Alternative Example 4.1, find the coefficient of determi-
nation. The following have been computed in the table that
follows:
ALTERNATIVE EXAMPLES
SST = 150; SSE = 29.02; SSR = 120.76
Alternative Example 4.1: The sales manager of a large apart-
ment rental complex feels the demand for apartments may be (Note: Round-off error may cause this to be slightly different than
related to the number of newspaper ads placed during the previous a computer solution.)
month. She has collected the data shown in the accompanying
table.

46


CHAPTER 4 R E G R E S S I O N M O D E LS 47

Y X (Y — Y¯¯ )2 Yˆ = (Y —Yˆ )2 (Yˆ — Y¯¯ )2


0.915+0.395X
6.00 15.00 16 6.84 0.706 9.986
4.00 9.00 36 4.47 0.221 30.581
16.0 40.00 36 16.71 0.511 45.091
0 5
6.00 20.00 16 8.815 7.924 1.404
13.0 25.00 9 10.79 4.884 0.624
0
9.00 25.00 1 10.79 3.204 0.624
10.0 15.00 0 6.84 9.986 9.986
0
16.0 35.00 36 14.74 1.588 22.468
0
80.0 184.0 SST=150. 80.00 SSE=29. SSR=120.76
0 0 00 02

From this the coefficient of determination is


adjusted r2 value declines or does not increase when a new vari-
r2 = SSR/SST = 120.76/150 = 0.81 able is added, then the variable should not be added to the model.
Alternative Example 4.3: For Alternative Examples 4.1 and 4.2, 4-6. The F-test is used to determine if the overall regression
dealing with ads, X, and apartments leased, Y, compute the model is helpful in predicting the value of the independent
correla- tion coefficient. variable (Y ). If the F-value is large and the p-value or significance
Since r2 = 0.81 and the slope is positive (+0.395), the posi- level is low, then we can conclude that there is a linear
tive square root of 0.81 is the correlation coefficient. r = 0.90.
relationship and the model is useful, as these results would
probably not occur by chance. If the significance level is high,
SOLUTIONS TO DISCUSSION then the model is not useful and the results in the sample could be
QUESTIONS AND PROBLEMS due to random variations.
4-1. The term least-squares means that the regression line will 4-7. The SSE is the sum of the squared errors in a regression
minimize the sum of the squared errors (SSE). No other line will model. SST = SSE + SSR.
give a lower SSE. 4-8. When the residuals (errors) are plotted after a regression
4-2. Dummy variables are used when a qualitative factor such line is found, the errors should be random and should not show
as the gender of an individual (male or female) is to be included in any significant pattern. If a pattern does exist, then the assump-
the model. Usually this is given a value of 1 when the condition is tions may not be met or another model (perhaps nonlinear) would
be more appropriate.
met (e.g. person is male) and 0 otherwise. When there are more 4-9. a. Yˆ = 36 + 4.3(70) = 337
than two levels or values for the qualitative factor, more than one b. Yˆ = 36 + 4.3(80) = 380
dummy variable must be used. The number of dummy variables is c. Yˆ = 36 + 4.3(90) = 423
one less than the number of possible values or categories. For ex-
ample, if students are classified as freshmen, sophomores, juniors 4-10. a.
and seniors, three dummy variables would be necessary. 12
4-3. The coefficient of determination (r2) is the square of the
coefficient of correlation (r). Both of these give an indication of 10
how well a regression model fits a particular set of data. An r2
value of 1 would indicate a perfect fit of the regression model to 8
the points. This would also mean that r would equal —1 or +1.
Demand

4-4. A scatter diagram is a plot of the data. This graphical 6


image helps to determine if a linear relationship is present, or if
another type of relationship would be more appropriate. 4
4-5. The adjusted r2 value is used to help determine if a new
2
variable should be added to a regression model. Generally, if the
adjusted r2 value increases when a new variable is added to a
0
model, this new variable should be included in the model. If the 0 2 4 6 8 10
TV Appearances
48 CHAPTER 4 R E G R E S S I O N M O D E LS

4-10. b.
Demand TV
Appearances (X — X¯¯ (Y — Y¯¯ (X — X¯¯ )(Y — Yˆ (Y — (Yˆ —
Y X )2 )2 Y¯¯ ) Yˆ ) 2 Y¯¯ )2
3 3 6.25 12.25 8.75 4 1 6.25
6 4 2.25 0.25 0.75 5 1 2.25
7 7 2.25 0.25 0.75 8 1 2.25
5 6 0.25 2.25 —0.75 7 4 0.25
10 8 6.25 12.25 8.75 9 1 6.25
8 5 0.25 2.25 —0.75 6 4 0.25
\Y = 39.0 \XX¯¯
== 33
5.5 17.5 29.5 17.5 12 17.5
Y¯¯ = 6.5
SS SSE SSR
SST = 29.5; SSE = 12; SSR = 17.5
b1 = 17.5/17.5 = 1
b0 = 6.5 — 1(5.5) = 1
The regression equation is Yˆ = 1 + 1X.
c. Yˆ = 1 + 1X = 1 + 1(6) = 7.
4-11. See the table for the solution to problem 4-10 to obtain
some of these numbers.
MSE = SSE/(n — k — 1) = 12/(6 — 1 — 1) = 3
MSR = SSR/k = 17.7/1 = 17.5
F = MSR/MSE = 17.5/3 = 5.83
df1 = k = 1
df2 = n — k — 1 = 6 — 1 — 1 = 4
F0.05, 1, 4 = 7.71
Do not reject H0 since 5.83 < 7.71. Therefore, we cannot conclude
there is a statistically significant relationship at the 0.05 level.

4-12. Using Excel, the regression equation is Yˆ = 1 + 1X.


F = 5.83, the significance level is 0.073. This is significant at the
0.10 level (0.073 < 0.10), but it is not significant at the 0.05 level.
There is marginal evidence that there is a relationship between
demand for drums and TV appearances.

4-13.

Fin. Test
1
Ave,(Y) (X) (X — (Y — Y¯¯ (X — X¯¯ )(Y — Y¯¯ Y (Y — (Yˆ — Y¯¯
X¯¯ )2 )2 ) Yˆ ) 2 )2
93 98 285.2 196 236.444 91.5 2.264 156.135
35
78 77 16.90 1 4.11 76 4.168 9.252
1 1
84 88 47.45 25 34.444 84.1 0.009 25.977
7
73 80 1.235 36 6.66 78.2 26.811 0.676
7
84 96 221.6 25 74.444 90 36.188 121.345
79
64 61 404.4 225 301.667 64.1 0.015 221.396
57
64 66 228.3 225 226.667 67.8 14.592 124.994
46
95 95 192.9 256 222.222 89.3 32.766 105.592
01
76 69 146.6 9 36.333 70 35.528 80.291
79
711 730 1544. 998 1143 152.34 845.659
9 1

b1 = 1143/1544.9 = 0.740
b0 = (711/9) — 0.740 (730/9) = 18.99
CHAPTER 4 R E G R E S S I O N M O D E LS 49

a. Yˆ = 18.99 + 0.74X Yˆ = 1.03 + 0.0034(450) = 2.56.


b. Yˆ = 18.99 + 0.74(83) = 80.41 If a student scores 800 on the SAT, we get
c. r2 = SSR/SST = 845.629/998 = 0.85; r = 0.92; this Yˆ = 1.03 + 0.0034(800) = 3.75.
means that 85% of the variability in the final average can 4-19. a. A linear model is reasonable from the graph below.
be explained by the variability in the first test score.
50
4-14. See the table for the solution to problem 4-13 to obtain 45
some of these numbers.
40
MSE = SSE/(n — k — 1) = 152.341/(9 — 1 — 1) = 21.76

Ridership (100,000s)
35
MSR = SSR/k = 845.659/1 = 845.659
F = MSR/MSE = 845.659/21.76 = 38.9 30

df1 = k = 1 25
df2 = n — k — 1 = 9 — 1 — 1 = 7 20
F0.05, 1, 7 = 5.59 15
Because 38.9 > 5.59, we can conclude (at the 0.05 level) that 10
there is a statistically significant relationship between the first test
grade and the final average. 5

4-15. F = 38.86; the significance level = 0.0004 (which is ex- 0


tremely small) so there is definitely a statistically significant 0 5 10 15 20 25
relationship Tourists (Millions)
.
Yˆ = 13,473 + 37.65(1,860) = b. Yˆ = 5.060 + 1.593X
4-16. a. $83,502.
b. The predicted average selling price for a house this c. Yˆ = 5.060 + 1.593(10) = 20.99, or 2,099,000 people.
size would be $83,502. Some will sell for more and some d. If there are no tourists, the predicted ridership would be
will sell for less. There are other factors besides size that 5.06 (100,000s) or 506,000. Because X = 0 is outside the
influence the price of the house. range of values that were used to construct the regression
c. Some other variables that might be included are age model, this number may be questionable.
of the house, number of bedrooms, and size of the lot. 4-20. The F-value for the F-test is 52.6 and the significance
There are other factors in addition to these that one can level is extremely small (0.00002) which indicates that there is a
identify. statistically significant relationship between number of tourists
d. The coefficient of determination (r2) = (0.63)2 = and ridership. The coefficient of determination is 0.84 indicating
0.3969. that 84% of the variability in ridership from one year to the next
4-17. The multiple regression equation is Yˆ = $90.00 + could be explained by the variations in the number of tourists.
$48.50X1 + $0.40X2 4-21. a. Yˆ = 24,328 + 3026.67X1 + 6684X2
a. Number of days on the road: X1 = 5; Distance traveled:
X2 = 300 miles where Yˆ predicted starting salary; X1 = GPA; X2 = 1 if business
The amount he may be expected to claim is major, 0 otherwise.
Yˆ = 90.00 + 48.50(5) + $0.40(300) = $452.50 b. Yˆ = 24,328 + 3026.67(3.0) + 6684(1) = $40,092.01.
b. The reimbursement request, according to the model, c. The starting salary for business majors tends to be
appears to be too high. However, this does not mean that it is about $6,684 higher than non-business majors in this
not justified. The accountants should question Thomas sample, even after adjusting for variations in GPA.
Williams about his expenses to see if there are other explana- d. The overall significance level is 0.099 and r2 = 0.69.
tions for the high cost. Thus, the model is significant at the 0.10 level and 69%
c. A number of other variables should be included, such as of the variability in starting salary is explained by GPA
the type of travel (air or car), conference fees if any, and ex- and major. The model is useful in predicting starting
penses for entertainment of customers, and other salary.
transportation (cab and limousine) expenses. In addition, the 4-22. a. Let
coefficient of correlation is only 0.68 and r2 = (0.68)2 =
Yˆ = predicted selling price
0.46. Thus, about
46% of the variability in the cost of the trip is explained by X1 = square footage
this X2 = number of bedrooms
model; the other 54% is due to other factors. X3 = age
4-18. Using computer software to get the regression equation, The model with square footage:Yˆ = 2367.26 + 46.60X1 ; r2 =
we get 0.65 The model with number of bedrooms: Yˆ = 1923.5 +
Yˆ = 1.03 + 0.0034X 36137.76X2 ; r2 = 0.36
where Yˆ = predicted GPA and X = SAT The model with age: Yˆ = 147670.9 — 2424.16X3 ; r2 = 0.78
score. If a student scores 450 on the SAT, we
get
50 CHAPTER 4 R E G R E S S I O N M O D E LS

All of these models are significant at the 0.01 level or less. The If both SAT and a dummy variable (X2 = 1 for private, 0
best model uses age as the independent variable. The coefficient otherwise) are used to predict the cost, we get r2 = 0.79. The
of determination is highest for this, and it is significant. model is
4-23. Yˆ = 5701.45 + 48.51X1 — 2540.39X2 and r2 = 0.65. Yˆ = 7121.8 + 5.16X1 + 9354.99X2.
This says that a private school tends to be about $9,355 more ex-
Yˆ = 5701.45 + 48.51(2000) — 2540.39(3) =
pensive than a public school when the median SAT score is used
95,100.28. to adjust for the quality of the school. The coefficient of
Notice the r2 value is the same as it was in the previous problem determination indicates that about 79% of the variability in cost
with just square footage as the independent variable. Adding the can be explained by these factors. The model is significant at the
number of bedrooms did not add any significant information that 0.001 level.
was not already captured by the square footage. It should not be
included in the model. The r 2 for this is lower than for age alone in 4-31. Yˆ  67.8  0.0145X
the previous problem. There is a significant relationship between the number of victories
4-24. Yˆ = 82185.5 + 25.94X1 — 2151.7X2 — 1711.5X3 and (Y) and the payroll (X) at the 0.054 level, which is marginally sig-
r = 0.89.
2 nificant. However, r2 = 0.24, so the relationship is not very strong.
Only about 24% of the variability in victories is explained by this
Yˆ = 82185.5 + 25.94(2000) — 2151.7(3) — 1711.5(10)
model.
= $110,495.4.
4-32. a. Yˆ  42.43  0.0004X
4-25. Yˆ = 3071.885 + 6.5326X
where
b. Yˆ  31.54  0.0058X
Y = DJIA and X = S&P.
r = 0.84 and r2 = 0.70.
c. The correlation coefficient for the first stock is only
Yˆ = 3071.885 + 6.5326(1100) = 10257.8 (rounded) 0.19 while the correlation coefficient for the second is
4-26. With one independent variable, beds, in the model, r2 = 0.96. Thus, there is a much stronger correlation between
0.88. With just admissions in the model, r2 = 0.974. When both stock 2 and the DJI than there is for stock 1 and the DJI.
variables are in the model, r2 = 0.975. Thus, the model with only
admissions as the independent variable is the best. Adding the CASE STUDIES
number of beds had virtually no impact on r2, and the adjusted r2
decreased slightly. Thus, the best model is Yˆ = 1.518 + SOLUTION TO NORTH–SOUTH AIRLINE CASE
0.6686X Northern Airline Data
where Y = expense and X = admissions.
Airframe Engine Cost Average Age
4-27. Using Excel with Y = MPG; X1 = horsepower; X2 = Year Cost per per Aircraft (Hours)
weight the models are: Aircraft
Yˆ = 53.87 — 0.269X1; r2 = 0.77 2001 51.8 43.49 6,512
0
Yˆ = 57.53 — 0.01X2; r2 = 0.73.
2002 54.9 38.58 8,404
Thus, the model with horsepower as the independent variable is 2
better since r2 is higher. 2003 69.7 51.48 11,077
0
4-28.
Yˆ = 57,69 — 0.17X1 — 0.005X2 2004 68.9 58.72 11,717
0
where
Y = MPG
X1 = horsepower Southeast Airline Data
X2 = weight
r2 = 0.82. Airframe Engine Cost Average Age
Year Cost per per Aircraft (Hours)
This model is better because the coefficient of determination is much higher
Aircraft
with both variables than it is with either one individually.
2001 13.29 18.86 5,107
4-29. Let Y = MPG; X1 = horsepower; X2 = weight 2002 25.15 31.55 8,145
The model Yˆ = b + b X + b X 2 is Yˆ = 69.93 —0.620X + 2003 32.18 40.43 7,360
0.001747X 2 2004 31.78 22.10 5,773
0 1 1 2 1 1
r2 2005 25.34 19.69 7,150
1 and has = 0.798. 2006 32.78 32.58 9,364
The model Yˆ = b0 + b3X2 + b X 2 is Yˆ = 89.09 — 0.0337X2 + 2007 35.56 38.07 8,259
4 2
0.0000039X22 and has r2 = 0.800. Utilizing QM for Windows, we can develop the following regres-
The model Yˆ = b0 + b1X1 + b2 X1 2 + b3X2 + b 4X 22 is Yˆ = 89.2 sion equations for the variables of interest.

0.51X1 + 0.001889X 2 — 0.01615X + 0.00000162X 2 and has r2 =
1 2 2
0.883. This model has a higher r2 value than the model in 4-28. A Yˆ = —7793.1 + 21.8X1 with r2 = 0.22.
graph of the data would show a nonlinear relationship.
4-30. If SAT median score alone is used to predict the cost,
we get
Northern Airline—airframe maintenance cost:
Cost = 36.10 +
0.0025 (airframe age)
Coefficient of
determination =
0.7694 Coefficient of
correlation = 0.8771
CHAPTER 4 R E G R E S S I O N M O D E LS 51

Northern Airline—engine maintenance cost: Overall, it would seem that:


Cost = 20.57 + 0.0026 (airframe 1. Northern Airline has the smallest variance in mainte-
age) Coefficient of determination = nance costs, indicating that the day-to-day management of
0.6124 Coefficient of correlation = maintenance is working pretty well.
0.7825 2. Maintenance costs seem to be more a function of airline
Southeast Airline—airframe maintenance cost: than of airframe age.
3. The airframe and engine maintenance costs for Southeast
Cost = 4.60 + 0.0032 (airframe age)
Airline are not only lower but more nearly similar than those
Coefficient of determination =
for Northern Airline, but, from the graphs at least, appear to
0.3904 Coefficient of correlation =
be rising more sharply with age.
0.6248
4. From an overall perspective, it appears that Southeast
Southeast Airline—engine maintenance cost: Air- line may perform more efficiently on sporadic or
Cost = —0.671 + 0.0041 (airframe age) emergency repairs, and Northern Airline may place more
Coefficient of determination = 0.4599 emphasis on preventive maintenance.
Coefficient of correlation = 0.6782 Ms. Young’s report should conclude that:
The graphs below portray both the actual data and the regres- 1. There is evidence to suggest that maintenance costs
sion lines for airframe and engine maintenance costs for both air- could be made to be a function of airframe age by
lines. Note that the two graphs have been drawn to the same scale implement- ing more effective management practices.
to facilitate comparisons between the two airlines. 2. The difference between maintenance procedures of the
Northern Airline: There seem to be modest correlations be- two airlines should be investigated.
tween maintenance costs and airframe age for Northern Airline. 3. The data with which she is presently working do not pro-
There is certainly reason to conclude, however, that airframe age vide conclusive results.
is not the only important factor.
Southeast Airline: The relationships between maintenance
costs and airframe age for Southeast Airline are much less well
defined. It is even more obvious that airframe age is not the only
important factor—perhaps not even the most important factor.

Northern Airline
Southeast Airline
90
90
80
80
70
70
60
60
Cost ($)

Cost ($)

50
50
40
40
30 Airframe Engine
30 Airframe Engine
20
20
10
5 7 9 11 13 15 17 19 10
5 7 9 11 13 15 17 19
Average Airframe Age (Thousands)
Average Airframe Age (Thousands)

You might also like