Professional Documents
Culture Documents
PROBLEMS ch05
PROBLEMS ch05
PROBLEMS ch05
The first model represents the population regression model and the second model is computed using
sample data and represents the estimate of the population mean for any value of X within the sample
range.
2. What is least squares estimation? Illustrate how least squares regression defines the best fitting
regression line.
Least squares estimation is the most common approach used to estimate the population regression
model using sample data. The example below shows the least squares fit for a set of 5 data points.
The least squares fit is 1.92 + .19 X, and the error (difference between the sample Y value and the least
squares fit value) is computed at each point. The sum of those squared errors is 4.7027, no other fit
would result in a lower total of squared errors.
Intercept = 0.19 16
Slope = 1.92 14
12 f(x) = 1.91891891891892 x + 0.189189189189191
10
8
6
4
2
0
1 2 3 4 5 6 7 8
Regression
Line Error
X Y 1.92+.19 X Y-[1.92+.19 X] Error Squared
2 5 4.0270 0.9730 0.9467
5 9 9.7838 -0.7838 0.6143
4 7 7.8649 -0.8649 0.7480
7 15 13.6216 1.3784 1.8999
6 11 11.7027 -0.7027 0.4938
4.7027
3. Explain the coefficient of determination, R2. How does it differ from the sample correlation coefficient?
The coefficient of determination, R2, is the ratio SSR/SST and gives the proportion of variation that is
explained by the independent variable of the regression model. The value of R 2 will be between 0 and 1.
A value of 1.0 indicates a perfect fit, while a value of 0 indicates that no relationship exists.
The sample correlation coefficient, R, is the square root of the coefficient of determination. Values of R
range from -1 to 1, where the sign is determined by the sign of the slope of the regression line. A
correlation coefficient of R = 1 indicates perfect positive correlation; that is, as the independent variable
increases, the dependent variable does also; R = -1 indicates perfect negative correlation - as X
increases, Y decreases. As with R2, a value of R = 0 indicates no correlation. Because R2 measures
the actual proportion of the variation explained by regression, it is generally easier to interpret than R.
4. Explain the assumptions of linear regression. How can you determine if each of these assumptions
hold?
1. Linearity, the relationship between the dependent and independent variables is linear.
This can be checked by examining a plot of residuals, if a pattern exists, perhaps some
other, nonlinear model should be used.
2. The errors for each individual value of X are normally distributed with a mean of zero and a
constant variance.
This can be verified by examining a histogram of the residuals associated with each value
of the independent variable and inspecting for a bell-shaped distribution, or using more
formal goodness-of-fit tests as described in Chapter 4.
3. Homoscedasticity, which means that the variation about the regression line is constant for
all values of the independent variable.
This can also be evaluated examining the residual plot and looking for large differences in
the variances at different values of the independent variable.
4. The residuals should be independent for each value of the independent variable.
Correlation among successive observations over time is called autocorrelation, and can be
identified by residual plots having clusters of residuals with the same sign.
Autocorrelation can be evaluated more formally using a statistical test based on the
Durbin-Watson statistic.
5. Explain how Adjusted R2 is used in evaluating the fit of multiple regression models.
Adjusted R2, like R2, considers the value of adding an independent variable to the multiple regression
model, but unlike R2, it is adjusted for both the number of independent variables and the sample size.
Also unlike R2,adjusted R2 may actually go down if the added variable adds little in explanation power to
the regression model. Therefore adjusted R2 may be more useful than R2 in identifying the most
beneficial variables to include in the multiple regression model.
6. Describe the differences, and advantages/disadvantages, of using best subsets regression, stepwise
regression, and examination of correlations in developing multiple regression models.
Best subsets and stepwise regression are both (automated) search methods for developing multiple
regression models. Best subsets regression will consider every possible model but takes awhile
to complete and leave you with a lot of candidate models to consider (and the PHStat version leaves a
whole bunch of new worksheets). Stepwise regression builds just one model by adding or removing one
independent variable at each step, thus taking less time but perhaps missing the one model you really
would want. Note that both methods are dependent on the criterion used for model selection.
Examination of correlations refers to manually examining the relationships between the independent
variables and the dependent variable, and also between independent variables, either to construct a
model manually or to better understand a regression model.
7. Find real data on daily changes in the S&P 500 (or the Dow Jones Industrial Average) and a stock of your
interest. Use regression analysis to estimate the beta risk of the stock.
e best fitting
tion regression
of 5 data points.
value and the least
027, no other fit
9189189189191
7 8
lation coefficient?
f variation that is
be between 0 and 1.
ation. Values of R
ssion line. A
dependent variable
se R2 measures
interpret than R.
se assumptions
nt variables is linear.
endent variable.
ultiple regression
the sample size.
xplanation power to
ession, stepwise
eloping multiple
at version leaves a
ing or removing one
e model you really
he independent
er to construct a
MS F Significance F
0.3111678611 8.057984 0.005757
0.0386160924
8. Construct a scatter diagram for Takeaways and Yards Allowed in the 2000 NFL Data.xls worksheet.
Does there appear to be a linear relationship? Develop a regression model for predicting Yards Allowed
as a function of Takeaways. Explain the statistical significance of the model.
7,000
6,000
5,000
Yards Allowed
There is a negative relationship, i.e. as takeaways increase, yards allowed tend to decrease. However the
strength of that relationship is relatively weak (R2 = .22).
-0.46593399
X TakeawaYards Allowed Y
30 3,813
Data.xls worksheet. 49 3,967
edicting Yards Allowed 31 4,546
37 5,249
18 5,701
31 4,820
44 5,544
41 4,636
22 5,357
41 4,800
25 5,494
35 4,743
35 4,820
35 4,713
28 5,069
42 5,033
33 4,474
29 4,426
38 5,656
30 4,845
29 5,293
29 6,391
o decrease. However the 21 5,709
25 5,329
20 5,234
23 5,353
25 5,607
21 5,487
25 5,643
20 5,737
22 4,959
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.46593399
R Square 0.21709448
Adjusted R Squ 0.19009774
Standard Error 502.319706
Observations 31
ANOVA
df SS MS F Significance F
Regression 1 2029074 2029074 8.041507 0.008248
Residual 29 7317428 252325.1
Total 30 9346501
9. Construct a 95% confidence band chart for the model developed in Problem 8.
7000.0000
6500.0000 b0 = intercept b
6000.0000
5500.0000
5000.0000
4500.0000
4000.0000
3500.0000
3000.0000
15 20 25 30 35 40 45 50 55
Calculations: n= 31 =COUNT(x's)
`X = 30.13 =AVERAGE(x's)
S (X - `X)2 = 1,931.48 =DEVSQ(x's)
Slope = -32.41 =SLOPE(y's,x's)
Intercept = 6,087.76 =INTERCEPT(y's,x's)
SYX = 502.32 =STEYX(y's,x's)
a= 0.05
t= 2.05 =TINV(alpha,n-1)
Yards Lower
Takeaways Allowed Squared Regression Confidence
X Y Deviations h Line Band
18 5,701 147.1134 0.1084 5504.3498 5166.0629
20 5,234 102.5973 0.0854 5439.5262 5139.3395
20 5,737 102.5973 0.0854 5439.5262 5139.3395
21 5,709 83.3392 0.0754 5407.1143 5125.0002
21 5,487 83.3392 0.0754 5407.1143 5125.0002
22 5,357 66.0812 0.0665 5374.7025 5109.8297
22 4,959 66.0812 0.0665 5374.7025 5109.8297
23 5,353 50.8231 0.0586 5342.2907 5093.6548
25 5,494 26.3070 0.0459 5277.4671 5057.4151
25 5,329 26.3070 0.0459 5277.4671 5057.4151
25 5,607 26.3070 0.0459 5277.4671 5057.4151
25 5,643 26.3070 0.0459 5277.4671 5057.4151
28 5,069 4.5328 0.0346 5180.2316 4989.1184
29 4,426 1.2747 0.0329 5147.8198 4961.4227
29 5,293 1.2747 0.0329 5147.8198 4961.4227
29 6,391 1.2747 0.0329 5147.8198 4961.4227
30 3,813 0.0166 0.0323 5115.4080 4930.8642
30 4,845 0.0166 0.0323 5115.4080 4930.8642
31 4,546 0.7586 0.0327 5082.9962 4897.3571
31 4,820 0.7586 0.0327 5082.9962 4897.3571
33 4,474 8.2425 0.0365 5018.1725 4821.8273
35 4,743 23.7263 0.0445 4953.3489 4736.5249
35 4,820 23.7263 0.0445 4953.3489 4736.5249
35 4,713 23.7263 0.0445 4953.3489 4736.5249
37 5,249 47.2102 0.0567 4888.5253 4643.8918
38 5,656 61.9521 0.0643 4856.1134 4595.5347
41 4,636 118.1779 0.0934 4758.8780 4444.8300
41 4,800 118.1779 0.0934 4758.8780 4444.8300
42 5,033 140.9199 0.1052 4726.4662 4393.2192
44 5,544 192.4037 0.1319 4661.6425 4288.5647
49 3,967 356.1134 0.2166 4499.5834 4021.4131
45 50 55
VERAGE(x's)
LOPE(y's,x's)
NTERCEPT(y's,x's)
TEYX(y's,x's)
INV(alpha,n-1)
Upper
Confidence
Band
5842.6367
5739.7128
5739.7128
5689.2284
5689.2284
5639.5754
5639.5754
5590.9266
5497.5190
5497.5190
5497.5190
5497.5190
5371.3449
5334.2169
5334.2169
5334.2169
5299.9518
5299.9518
5268.6352
5268.6352
5214.5177
5170.1729
5170.1729
5170.1729
5133.1587
5116.6922
5072.9259
5072.9259
5059.7131
5034.7204
4977.7538
Chapter 5: Regression Analysis
10. Develop simple linear regression models for predicting Games Won as a function of each of the
independent variables in the 2000 NFL Data.xls worksheet individually. Do the assumptions of linear
regression hold for your models? How do these models compare to the multiple regression model
developed in the chapter?
x = Yards Gained
14
12
f(x) = 0.00233044788412 x − 3.98662110410048
10 R² = 0.361782483272298
8
6
4
2
0
3,000 3,500 4,000 4,500 5,000 5,500 6,000 6,500 7,000 7,500
x = Takeaways
14
12
f(x) = 0.228839601843811 x + 1.10528425412519
10
R² = 0.339419812130753
8
6
4
2
0
15 20 25 30 35 40 45 50 55
x = Giveaways
14
12
10
f(x) = − 0.241416401342397 x + 15.2736425436709
8 R² = 0.293264219080362
6
4
2
0
15 20 25 30 35 40 45 50 55
x = Yards Allowed
14
12
10 f(x) = − 0.002898624713617 x + 22.8155254394586
8 R² = 0.263521948796365
6
4
2
x = Yards Allowed
14
12
10 f(x) = − 0.002898624713617 x + 22.8155254394586
8 R² = 0.263521948796365
6
4
2
0
3,500 4,000 4,500 5,000 5,500 6,000 6,500 7,000
x = Points Scored
14
12 f(x) = 0.025666737634605 x − 0.48989444210445
10 R² = 0.46424065721651
8
6
4
2
0
100 150 200 250 300 350 400 450 500 550 600
As pointed out in the text, these variables satisfy the assumptions of linear regression.
These models display the same direction of the x coefficients (i.e. negative for giveaways and yards
allowed, positive for the others) as the multiple regression model developed in the text. For the
multiple regression model R2 = .77, the greatest R2 for a simple regression model is R2 = .46 for points
scored as the independent variable.
Team Yards GaTakeawaGiveawa Yards Al Points S Games Won
Tennessee 5,350 30 30 3,813 346 13
of each of the Baltimore 5,014 49 26 3,967 333 12
ssumptions of linear New York 6,376 31 24 4,546 328 12
regression model Oakland 5,776 37 20 5,249 479 12
Minnesota 5,961 18 28 5,701 397 11
Philadelph 5,006 31 29 4,820 351 11
Denver 6,567 44 25 5,544 485 11
Miami 4,461 41 26 4,636 323 11
Indianapoli 6,141 22 29 5,357 429 10
Tampa Ba 4,649 41 24 4,800 388 10
St. Louis 7,075 25 35 5,494 540 10
New Orlea 5,397 35 26 4,743 354 10
New York J 5,395 35 40 4,820 321 9
Pittsburgh 4,766 35 21 4,713 321 9
Green Bay 5,321 28 33 5,069 353 9
Detroit 4,422 42 31 5,033 307 9
Washingto 5,396 33 33 4,474 281 8
Buffalo 5,498 29 23 4,426 315 8
Carolina 4,654 38 35 5,656 310 7
Jacksonvil 5,690 30 29 4,845 367 7
Kansas Cit 5,614 29 26 5,293 355 7
Seattle 4,680 29 38 6,391 320 6
San Franci 6,040 21 19 5,709 388 6
Dallas 4,475 25 39 5,329 294 5
Chicago 4,541 20 29 5,234 216 5
New Engla 4,571 23 25 5,353 276 5
Atlanta 3,994 25 34 5,607 252 4
Cincinnati 4,260 21 35 5,487 185 4
Cleveland 3,530 25 28 5,643 161 3
Arizona 4,528 20 44 5,737 210 3
San Diego 4,300 22 50 4,959 269 1
veaways and yards
e text. For the
is R2 = .46 for points
Games Won
Chapter 5: Regression Analysis
11. Data obtained from a County Auditor (see the file Market Value.xls) provides information about the age,
square footage, and current market value of houses along one street in a particular subdivision.
a. Construct a scatter diagrams showing the relationship between market value as a function
of the age and size of the house, and add trendlines using the Add Trendline option in
Excel.
b. Develop simple linear regression models for estimating the market value as a function of
the age of the house and size of the house separately.
c. Develop a multiple linear regression model for estimating the market value as a function of
both the age and size of the house.
d. How do the models developed in parts b and c compare?
a. b.
x = House Age
$140,000.00
$120,000.00
$100,000.00
$80,000.00 f(x) = 1570.43418332185 x + 45217.7611499458
$60,000.00 R² = 0.130621028591945
$40,000.00
$20,000.00
$0.00
20 22 24 26 28 30 32 34
x = Square Feet
$150,000.00
$0.00
1,000 1,200 1,400 1,600 1,800 2,000 2,200 2,400 2,600
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.7454947764
R Square 0.5557624616
Adjusted R Squ0.5329810494
Standard Error 7211.8484974
Observations 42
ANOVA
df SS MS F
Regression 2 2537650171 1268825085 24.395435019
Residual 39 2028419591 52010758.749
Total 41 4566069762
d. R2 = .55 for the multiple regression model, only slightly better than R2 = for the simple
regression model using Square Feet as the independent variable.
Count Street Address House Age Square Feet Market Value
1 1357 33 1,812 $90,000.00
rmation about the age, 2 1358 32 1,914 $104,400.00
ar subdivision. 3 1361 32 1,842 $93,300.00
arket value as a function 4 1362 33 1,812 $91,000.00
d Trendline option in 5 1365 32 1,836 $101,900.00
6 1366 33 2,028 $108,500.00
t value as a function of 7 1369 32 1,732 $87,600.00
8 1370 33 1,850 $96,000.00
ket value as a function of 9 1373 32 1,791 $89,200.00
10 1374 33 1,666 $88,400.00
11 1377 32 1,852 $100,800.00
12 1378 32 1,620 $96,700.00
13 1381 32 1,692 $87,500.00
14 1382 32 2,372 $114,000.00
15 1385 32 2,372 $113,200.00
16 1386 33 1,666 $87,500.00
17 1389 32 2,123 $116,100.00
18 1390 32 1,620 $94,700.00
19 1393 32 1,731 $86,400.00
20 1394 32 1,666 $87,100.00
21 1405 28 1,520 $83,400.00
22 1406 27 1,484 $79,800.00
23 1409 28 1,588 $81,500.00
24 1410 28 1,598 $87,100.00
25 1413 28 1,484 $82,600.00
26 1414 28 1,484 $78,800.00
27 1417 28 1,520 $87,600.00
28 1418 27 1,701 $94,200.00
29 1421 28 1,484 $82,000.00
30 1425 28 1,468 $88,100.00
31 1426 28 1,520 $88,100.00
32 1429 27 1,520 $88,600.00
33 1430 27 1,484 $76,600.00
34 1434 28 1,520 $84,400.00
35 1438 27 1,668 $90,900.00
36 1442 28 1,588 $81,000.00
37 1446 28 1,784 $91,300.00
38 1450 27 1,484 $81,300.00
39 1453 27 1,520 $100,700.00
40 1454 28 1,520 $87,200.00
41 1457 27 1,684 $96,700.00
42 1458 27 1,581 $120,700.00
Significance F
1.344304E-07
12. Excel file TV Viewing.xls provides sample data on the number of hours of TV viewing per week for six age
groups.
a. Using all the data, develop a simple linear regression model for estimating TV viewing time
as a function of age.
b. Is a linear model appropriate? If not, propose an alternative model.
a.
150
TV Viewing Time (y)
0
10 20 30 40 50 60 70 80 90
Age (x)
b. Although you can achieve a slightly higher R2 = .71 with a polynomial model (see below),
the linear model is appropriate for this data and R2 = .69 is relatively high.
150
TV Viewing Time (y)
f(x)
100= − 0.000373274 x³ + 0.061980486 x² − 2.100023441 x + 82.22709498
R² = 0.711759988339481
50
0
10 20 30 40 50 60 70 80 90
Age (x)
Age TV hours/week Age X Age^2 x^2 Age^3 X^3
21 48 21 441 9261 SUMMARY OUTPUT
wing per week for six age 21 47 21 441 9261
18 73 18 324 5832 Regression Statistics
stimating TV viewing time 23 65 23 529 12167 Multiple R
19 74 19 361 6859 R Square
19 50 19 361 6859 Adjusted R Square
20 57 20 400 8000 Standard Error
24 64 24 576 13824 Observations
21 70 21 441 9261
23 51 23 529 12167 ANOVA
20 54 20 400 8000
21 63 21 441 9261 Regression
23 67 23 529 12167 Residual
24 75 24 576 13824 Total
95789024244 19 61 19 361 6859
24 51 24 576 13824 Coefficients
23 47 23 529 12167 Intercept
18 76 18 324 5832 X Variable 1
20 63 20 400 8000 X Variable 2
19 72 19 361 6859 X Variable 3
70 80 90 22 59 22 484 10648
18 57 18 324 5832
20 51 20 400 8000
24 62 24 576 13824
mial model (see below), 22 68 22 484 10648
20 46 20 400 8000
21 64 21 441 9261
20 69 20 400 8000
19 57 19 361 6859
24 57 24 576 13824
23 56 23 529 12167
22 62 22 484 10648
22 37 22 484 10648
20 69 20 400 8000
41 x + 82.22709498 23 75 23 529 12167
23 52 23 529 12167
22 78 22 484 10648
23 63 23 529 12167
21 41 21 441 9261
19 50 19 361 6859
70 80 90 24 65 24 576 13824
18 62 18 324 5832
22 73 22 484 10648
21 50 21 441 9261
21 56 21 441 9261
30 61 30 900 27000
28 78 28 784 21952
26 72 26 676 17576
30 65 30 900 27000
34 73 34 1156 39304
34 69 34 1156 39304
29 54 29 841 24389
34 74 34 1156 39304
29 70 29 841 24389
30 57 30 900 27000
33 86 33 1089 35937
30 55 30 900 27000
30 64 30 900 27000
26 67 26 676 17576
32 71 32 1024 32768
33 57 33 1089 35937
27 71 27 729 19683
27 70 27 729 19683
29 87 29 841 24389
31 58 31 961 29791
34 62 34 1156 39304
33 91 33 1089 35937
29 63 29 841 24389
25 69 25 625 15625
28 79 28 784 21952
32 75 32 1024 32768
32 56 32 1024 32768
26 77 26 676 17576
30 86 30 900 27000
33 80 33 1089 35937
27 87 27 729 19683
25 67 25 625 15625
25 63 25 625 15625
26 70 26 676 17576
25 76 25 625 15625
26 70 26 676 17576
36 76 36 1296 46656
35 75 35 1225 42875
42 69 42 1764 74088
36 70 36 1296 46656
35 70 35 1225 42875
43 64 43 1849 79507
39 53 39 1521 59319
37 78 37 1369 50653
37 71 37 1369 50653
36 70 36 1296 46656
40 76 40 1600 64000
43 75 43 1849 79507
40 66 40 1600 64000
42 61 42 1764 74088
44 70 44 1936 85184
37 77 37 1369 50653
44 72 44 1936 85184
40 63 40 1600 64000
38 61 38 1444 54872
38 74 38 1444 54872
40 64 40 1600 64000
40 63 40 1600 64000
44 71 44 1936 85184
36 64 36 1296 46656
43 62 43 1849 79507
37 62 37 1369 50653
44 76 44 1936 85184
40 55 40 1600 64000
44 73 44 1936 85184
37 63 37 1369 50653
40 70 40 1600 64000
41 71 41 1681 68921
38 70 38 1444 54872
35 70 35 1225 42875
41 69 41 1681 68921
39 71 39 1521 59319
40 54 40 1600 64000
44 65 44 1936 85184
38 62 38 1444 54872
40 61 40 1600 64000
38 62 38 1444 54872
36 62 36 1296 46656
49 95 49 2401 117649
54 80 54 2916 157464
52 105 52 2704 140608
48 83 48 2304 110592
49 89 49 2401 117649
51 90 51 2601 132651
49 72 49 2401 117649
51 94 51 2601 132651
48 72 48 2304 110592
45 103 45 2025 91125
50 99 50 2500 125000
50 84 50 2500 125000
50 93 50 2500 125000
47 108 47 2209 103823
47 82 47 2209 103823
54 88 54 2916 157464
54 93 54 2916 157464
53 82 53 2809 148877
52 106 52 2704 140608
46 89 46 2116 97336
51 110 51 2601 132651
49 87 49 2401 117649
51 94 51 2601 132651
46 76 46 2116 97336
49 74 49 2401 117649
46 90 46 2116 97336
52 83 52 2704 140608
45 91 45 2025 91125
46 98 46 2116 97336
47 71 47 2209 103823
53 81 53 2809 148877
54 73 54 2916 157464
48 87 48 2304 110592
52 106 52 2704 140608
63 103 63 3969 250047
61 110 61 3721 226981
55 99 55 3025 166375
56 109 56 3136 175616
57 93 57 3249 185193
62 95 62 3844 238328
61 116 61 3721 226981
56 103 56 3136 175616
56 93 56 3136 175616
59 91 59 3481 205379
59 74 59 3481 205379
61 81 61 3721 226981
60 74 60 3600 216000
60 95 60 3600 216000
56 111 56 3136 175616
57 108 57 3249 185193
59 88 59 3481 205379
58 92 58 3364 195112
59 101 59 3481 205379
64 90 64 4096 262144
64 109 64 4096 262144
62 87 62 3844 238328
59 92 59 3481 205379
56 86 56 3136 175616
60 135 60 3600 216000
61 87 61 3721 226981
55 90 55 3025 166375
63 90 63 3969 250047
59 80 59 3481 205379
79 117 79 6241 493039
80 113 80 6400 512000
78 117 78 6084 474552
76 124 76 5776 438976
70 126 70 4900 343000
67 120 67 4489 300763
76 119 76 5776 438976
69 116 69 4761 328509
74 113 74 5476 405224
71 116 71 5041 357911
68 113 68 4624 314432
77 110 77 5929 456533
67 128 67 4489 300763
73 107 73 5329 389017
79 110 79 6241 493039
71 116 71 5041 357911
74 123 74 5476 405224
80 124 80 6400 512000
77 113 77 5929 456533
75 116 75 5625 421875
ression Statistics
0.843659
0.71176
0.707479
10.95873
206
df SS MS F Significance F
3 59903.37 19967.79 166.2683 2.62E-54
202 24258.95 120.0938
205 84162.32
Coefficients
Standard Error t Stat P-value Lower 95%Upper 95%Lower 95.0%
Upper 95.0%
82.22709 14.24763 5.771284 2.93E-08 54.13398 110.3202 54.13398 110.3202
-2.10002 1.067794 -1.96669 0.050588 -4.20547 0.005427 -4.20547 0.005427
0.06198 0.024179 2.563436 0.011091 0.014306 0.109655 0.014306 0.109655
-0.00037 0.000169 -2.21406 0.027943 -0.00071 -4.08E-05 -0.00071 -4.08E-05
Chapter 5: Regression Analysis
13. A deep foundation engineering contractor has bid on a foundation system for a new world headquarters
building for a Fortune 500 company. A part of the project consists of installing 311 augercast piles. The
contractor was given bid information for cost estimating purposes, which consisted of the estimated
depth of each pile; however, actual drill footage of each pile could not be determined exactly until
construction was performed. The Excel file Pile Foundation.xls contains the estimates and actual pile
lengths after the project was completed.
a. Develop a linear regression model to estimate the actual pile length as a function of the
estimated pile lengths. What do you conclude?
Note: The decimal point was missing for pile number 225.
100.00
80.00
60.00
f(x) = 0.814189160265947 x + 11.6143770642259
Actual
40.00 R² = 0.635094810583462
20.00
0.00
0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00
Estimate
The estimate explains 63% of the variation in the actual drill footage (63% accurate?).
Pile X Y
Number Estimated Actual
ew world headquarters 1 10.58 18.58 SUMMARY OUTPUT
1 augercast piles. The 2 10.58 18.58
d of the estimated 3 10.58 18.58 Regression Statistics
ned exactly until 4 10.58 18.58 Multiple R 0.796928
mates and actual pile 5 10.58 28.58 R Square 0.635095
6 10.58 26.58 Adjusted R 0.633914
th as a function of the 7 10.58 17.58 Standard E 9.886816
8 10.58 27.58 Observatio 311
9 10.58 27.58
10 10.58 37.58 ANOVA
11 10.58 28.58 df SS MS
12 5.83 1.83 Regressio 1 52569.02 52569.02
13 5.83 8.83 Residual 309 30204.48 97.74913
14 5.83 8.83 Total 310 82773.5
15 5.83 8.83
16 10.83 16.83 Coefficients
Standard Error t Stat
17 10.83 19.83 Intercept 11.61438 1.137095 10.21408
18 10.83 18.33 X Variable 0.814189 0.035109 23.19041
19 10.83 14.83
20 21.33 12
21 16.33 27
22 21.33 42
23 21.33 46
24 21.33 46
25 31.33 57
26 41.08 55.75
27 41.08 55.75
28 41.08 55.75
29 46.33 57
30 61.33 77
31 61.33 76
32 61.33 75
33 61.33 75
34 61.33 36
35 61.33 56.42
36 61.33 45
37 61.33 45.42
38 61.33 37
39 11.33 8
40 5.83 2.83
41 5.83 2.83
42 5.83 2.83
43 5.83 2.83
44 15.83 14.83
45 15.83 16.83
46 15.83 16.83
47 15.83 16.83
48 11.33 11
49 21.08 42.75
50 21.08 43.17
51 21.08 43.17
52 21.08 42.75
53 21.08 42.75
54 41.08 31.75
55 41.08 31.75
56 41.08 31.75
57 41.08 36.75
58 41.08 44.75
59 61.08 36.75
60 61.08 55.75
61 61.08 40.17
62 61.08 56.75
63 61.33 26.42
64 61.33 66.00
65 11.33 13.33
66 20.08 38.75
67 20.08 38.75
68 20.08 38.75
69 20.08 38.75
70 20.08 38.75
71 20.05 38.75
72 40.08 54.75
73 40.08 25.75
74 40.08 25.75
75 40.08 54.75
76 40.08 23.75
77 40.08 73.08
78 60.08 57.75
79 60.08 56.75
80 60.08 57.75
81 60.08 58.17
82 61.33 62.00
83 8.42 9.09
84 20.58 16.25
85 20.58 16.25
86 20.55 16.25
87 20.58 16.25
88 20.58 16.25
89 20.55 16.25
90 23.50 29.17
91 20.52 21.19
92 20.52 20.19
93 20.52 20.19
94 20.52 36.19
95 20.52 26.19
96 20.52 22.19
97 38.42 29.51
98 35.83 26.51
99 35.83 27.09
100 35.83 34.51
101 35.83 37.09
102 35.83 31.51
103 34.75 55.42
104 34.75 28.84
105 4.75 14.42
106 19.75 22.42
107 34.75 42.42
108 34.75 55.42
109 4.75 14.42
110 17.25 21.92
111 17.25 21.92
112 17.25 21.92
113 17.25 21.92
114 17.25 21.92
115 17.25 21.92
116 19.75 22.42
117 28.42 32.09
118 28.42 32.09
119 28.42 32.09
120 28.42 32.09
121 28.42 32.09
122 28.42 32.09
123 35.92 46.59
124 40.53 55.51
125 40.83 56.51
126 40.83 67.09
127 40.83 57.09
128 40.83 67.09
129 53.42 55.09
130 50.83 50.34
131 50.83 50.34
132 50.83 45.34
133 50.83 47.76
134 53.42 41.51
135 53.42 41.09
136 5.58 19.25
137 5.58 19.25
138 5.58 19.25
139 6.58 19.25
140 5.58 19.25
141 5.58 19.25
142 20.58 28.25
143 20.58 28.25
144 20.55 28.25
145 20.58 28.25
146 20.58 28.25
147 20.58 28.25
148 28.75 32.42
149 28.75 32.42
150 28.75 32.42
151 28.75 32.42
152 28.75 32.42
153 28.75 32.42
154 39.25 56.92
155 39.25 60.92
156 39.25 59.92
157 39.25 59.34
158 39.25 57.92
159 49.75 47.42
160 49.75 48.42
161 49.75 48.42
162 49.75 48.42
163 53.42 45.09
164 13.42 22.09
165 6.97 26.64
166 6.97 26.64
167 6.97 26.64
168 8.42 22.09
169 21.67 29.34
170 21.67 29.34
171 21.67 29.34
172 28.42 37.09
173 31.67 34.34
174 31.67 35.34
175 31.67 29.34
176 38.42 59.09
177 41.67 61.34
178 41.67 60.34
179 41.67 61.34
180 53.42 49.09
181 13.42 11.59
182 13.42 14.09
183 10.83 26.50
184 10.83 24.50
185 10.83 33.92
186 10.83 32.92
187 10.83 16.09
188 15.42 18.51
189 15.42 28.09
190 15.42 29.09
191 15.42 29.09
192 9.42 14.09
193 9.42 14.09
194 9.42 14.09
195 9.42 14.09
196 9.42 14.09
197 10.58 23.25
198 10.58 23.25
199 10.55 23.25
200 10.58 23.25
201 10.58 23.25
202 10.55 26.25
203 20.55 29.25
204 20.55 31.25
205 20.58 30.25
206 20.55 31.25
207 20.58 31.25
208 20.55 32.25
209 30.58 31.25
210 30.55 32.25
211 30.58 31.25
212 30.58 32.25
213 30.58 31.25
214 30.55 29.25
215 40.83 61.50
216 40.83 61.50
217 40.83 58.50
218 40.83 57.92
219 40.83 57.50
220 50.83 51.50
221 50.83 51.50
222 50.83 49.50
223 50.83 48.50
224 63.42 49.09
225 53.42 52.09
226 22.58 23.25
227 22.58 21.25
228 20.83 22.50
229 20.83 22.50
230 20.83 27.50
231 20.83 25.92
232 22.68 37.25
233 20.83 22.92
234 20.83 26.92
235 20.83 21.50
236 20.83 22.50
237 24.83 33.50
238 24.83 33.50
239 24.83 31.50
240 24.83 32.50
241 10.83 14.50
242 10.83 17.50
243 10.83 14.50
244 10.83 14.50
245 11.67 17.50
246 11.67 17.50
247 11.67 17.50
248 11.67 17.50
249 20.83 26.50
250 20.83 31.50
251 20.83 28.50
252 20.83 26.50
253 25.83 34.50
254 25.83 34.50
255 25.83 34.50
256 25.83 34.50
257 30.83 29.92
258 30.83 30.92
259 30.83 31.50
260 30.83 34.50
261 40.83 66.50
262 40.83 69.50
263 40.83 68.50
264 40.53 66.50
265 50.83 53.34
266 50.83 54.34
267 50.83 61.76
268 50.83 51.34
269 53.42 52.09
270 53.42 52.09
271 53.42 52.09
272 22.58 38.25
273 22.58 43.25
274 22.58 50.25
275 20.83 29.50
276 20.83 29.50
277 20.83 38
278 25.42 31.69
279 26.67 27.34
280 26.67 32.34
281 26.67 26.76
282 23.42 10.51
283 11.67 18.34
284 11.67 18.34
285 11.67 18.34
286 13.42 8.09
287 11.67 22.34
288 11.67 21.34
289 11.67 21.34
290 13.42 16.09
291 21.67 29.34
292 21.67 30.34
293 21.67 30.34
294 28.42 31.09
295 26.67 24.34
296 26.67 35.34
297 26.67 35.34
298 28.42 37.09
299 31.67 37.34
300 31.67 37.34
301 31.67 39.76
302 38.42 63.09
303 41.67 67.34
304 41.67 67.34
305 41.67 67.34
306 53.42 64.09
307 51.67 51.34
308 51.67 50.76
309 51.67 46.34
310 53.42 50.09
311 53.42 49.09
63.42 77.00
F Significance F
537.7953 1.29E-69
14. The file 1999 Baseball Data.xls contains data for professional baseball teams for 1999, including their
total payroll, winning percentage, batting average, home runs, runs, runs batted in, earned run average,
and pitching saves.
a. Develop a multiple regression equation for predicting the winning percentage as a function
of all the other variables. How good is your model? Is multicollinearity a problem?
b. Find the best set of independent variables that predict the winning percentage by
examining the correlation matrix.
c. Find the best set of independent variables that predict the winning percentage using best
subsets regression.
d. Find the best set of independent variables that predict the winning percentage using
stepwise regression.
a. SUMMARY OUTPUT
Regression Statistics
Multiple R 0.9758640096
R Square 0.9523105651
Adjusted R Squ0.9371366541
Standard Error 0.0191261557
Observations 30
ANOVA
df SS MS F
Regression 7 0.1607068837 0.0229581262 62.759730132
Residual 22 0.0080478163 0.0003658098
Total 29 0.1687547
b. Correlation Matrix
Select the four independent variables that have the highest correlation with the dependent
variable, leaving out runs because of multicollinearity.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.9749021651
R Square 0.9504342315
Adjusted R Squ0.9425037086
Standard Error 0.0182914804
Observations 30
ANOVA
df SS MS F
Regression 4 0.1603902436 0.0400975609 119.84508927
Residual 25 0.0083644564 0.0003345783
Total 29 0.1687547
Using RBI's, ERA, Saves, and Payroll yields an R2 = .95 and Adjusted R2 = .94
Note that the last model, with the highest adjusted R2, is the model picked in b.
Saves entered.
RBI entered.
ERA entered.
Payroll entered.
CORREL
1
0.9966714544
-0.061914571
0.3879442345
Significance F
6.305265E-16
sted R2 = .94
Consider
This Model?
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
el picked in b.
Pitching Y X
RBI ERA Saves Winning % Payroll SUMMARY OUTPUT
673 4.79 37 0.432 $51,340,297
865 3.77 42 0.617 $70,046,818 Regression Statistics
791 3.65 45 0.636 $79,256,599 Multiple R 0.682823
804 4.77 33 0.481 $75,443,363 R Square 0.466247
808 4.00 50 0.580 $72,330,656 Adjusted R 0.447185
742 4.92 39 0.466 $24,535,000 Standard E 0.056718
717 5.27 32 0.414 $55,419,648 Observatio 30
820 3.99 55 0.589 $38,031,285
960 4.91 46 0.599 $73,531,692 ANOVA
863 6.03 33 0.444 $54,367,504 df
704 5.22 33 0.429 $36,954,666 Regressio 1
655 4.90 33 0.395 $14,650,000 Residual 28
784 3.84 48 0.599 $56,389,000 Total 29
800 5.35 29 0.398 $16,557,000
761 4.45 37 0.475 $76,607,247 Coefficients
777 5.08 40 0.460 $42,976,575 Intercept 0.386468
643 5.03 34 0.394 $15,845,000 X Variable 2.32E-09
680 4.69 44 0.420 $15,015,250
855 4.16 50 0.605 $91,990,955
814 4.27 49 0.595 $71,510,523
845 4.76 48 0.537 $25,208,858
797 4.93 32 0.475 $30,441,500
735 4.35 34 0.484 $23,682,420
671 4.47 43 0.457 $46,507,179
825 5.25 40 0.488 $45,351,254
828 4.71 42 0.531 $45,991,934
763 4.76 38 0.466 $46,337,129
728 5.06 45 0.426 $37,860,451
897 5.07 47 0.586 $80,801,598
856 4.93 39 0.519 $48,847,300
RY OUTPUT
SS MS F Significance F
0.078681 0.078681 24.45875 3.22E-05
0.090073 0.003217
0.168755
15. The State of Ohio Department of Education has a mandated 9th grade proficiency test that covers
writing, reading, mathematics, citizenship (social studies), and science. The Excel file Ohio Education
Performance.xls provides data on success rates (defined as the percentage of students passing) in
school districts in the Greater Cincinnati metropolitan area along with state averages.
a. Develop a multiple regression model to predict math success as a function of success in
all other subjects. Is multicollinearity a problem.
b. Develop the best regression model to predict math success as a function of success in
the other subjects by examining the correlation matrix.
c. Develop the best regression model to predict math success as a function of success in
the other subjects using best subsets regression.
d. Develop the best regression model to predict math success as a function of success in
the other subjects using stepwise regression.
a. SUMMARY OUTPUT
Regression Statistics
Multiple R 0.9393726812
R Square 0.8824210342
Adjusted R Squ0.8643319625
Standard Error 5.5555151986
Observations 31
ANOVA
df SS MS F
Regression 4 6022.3812325 1505.5953081 48.78199671
Residual 26 802.45747717 30.863749122
Total 30 6824.8387097
It appears that there will be a multicollinearity problem (see correlation matrix below).
Due to multicollinearity, select only the highest correlation (Science has extremely high
correlation with math) and use simple regression.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.9353020802
R Square 0.8747899812
Adjusted R Squ0.8704723943
Standard Error 5.4283362007
Observations 31
ANOVA
df SS MS F
Regression 1 5970.3005263 5970.3005263 202.61085887
Residual 29 854.53818334 29.466833908
Total 30 6824.8387097
The model with science and reading as independent variables (X2X4) has a slightly higher
adjusted R2 value than the simple regression model with just science.
Science entered.
Stepwise regression agrees with the simple regression model using science as the
independent variable.
School DiMath Writing Reading Citizens Science All
Indian Hill 89 95 98 95 91 83
y test that covers Wyoming 86 98 96 93 87 81
el file Ohio Education Mason Cit 85 96 92 94 86 72
udents passing) in Madiera 88 94 95 82 88 69
Mariemont 74 99 92 89 88 68
a function of success in Sycamore 80 85 88 87 84 68
Forest Hill 73 93 91 85 88 67
unction of success in Kings Loca 78 92 86 82 78 64
Lakota 73 90 88 85 81 64
unction of success in Loveland 72 85 93 86 86 61
Southwest 73 92 82 78 73 58
unction of success in Fairf ield 71 90 86 83 77 57
Oak Hills 75 88 86 77 79 57
Three Rive 66 87 85 84 77 56
Milford 72 82 86 82 76 53
Ross 66 84 85 75 78 52
West Cler 63 88 83 73 70 48
Reading 58 88 80 76 75 46
Princeton 59 83 75 76 63 46
Finneytow 61 79 71 67 62 45
Norwood 64 86 77 75 67 44
Lockland 52 88 79 82 64 41
Franklin Ci 49 85 79 70 67 40
Winton Wo 55 82 77 65 59 40
Northwest 51 75 74 62 61 38
Significance F North Colle 50 77 76 66 57 35
1.023771E-11 Mount Heal 40 87 72 62 53 32
Felicity Fr 52 52 64 81 64 28
St. Bernar 40 81 59 41 48 26
Deer Park 40 69 66 43 52 25
Cincinnati 35 63 59 50 44 23
Consider
This Model?
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
ng science as the
Chapter 5: Regression Analysis
16. A national homebuilder builds single-family homes and condominium style townhouses. The Excel file
House Sales Data.xls provides information on the selling price, lot cost, type of home, and region of the
country (M = Midwest, S = South) for closings during one month.
a. Develop a multiple regression model for sales price as a function of lot cost, region of
country, and type of home.
b. Determine if any interactions exist between sales price, region, and type of home.
Note: set up zero-one variables for Region (M=1,S=0) and Type (SF=1, T=0).
a. SUMMARY OUTPUT
Regression Statistics
Multiple R 0.8081105397
R Square 0.6530426444
Adjusted R Squ0.6466958635
Standard Error 48853.475443
Observations 168
ANOVA
df SS MS F
Regression 3 7.36716E+11 2.45572E+11 102.89352274
Residual 164 3.91413E+11 2386662063
Total 167 1.12813E+12
b. Correlation Matrix.
17. The Excel file Salary Data.xls provides information on current salary, beginning salary, previous
experience when hired in months, and total years of education for a sample of 100 employees in a firm.
a. Develop a multiple regression model for predicting current salary as a function of the other
variables.
b. Find the best model for predicting current salary.
a. SUMMARY OUTPUT
Regression Statistics
Multiple R 0.896159669
R Square 0.8031021524
Adjusted R Square 0.7969490947
Standard Error 7790.8746383
Observations 100
ANOVA
df SS MS
Regression 3 23766951872 7922317291
Residual 96 5826981852 60697727.63
Total 99 29593933724
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.8939024564
R Square 0.7990616015
Adjusted R Square 0.7949185417
Standard Error 7829.7329472
Observations 100
ANOVA
df SS MS
Regression 2 23647376076 11823688038
Residual 97 5946557648 61304718.024
Total 99 29593933724
Regression Statistics
0.893902
0.799062
0.794919
7829.733
100
df SS MS F Significance F
2 2.36E+10 1.18E+10 192.8675 1.58E-34
97 5.95E+09 61304718
99 2.96E+10
Coefficients
Standard Error t Stat P-value Lower 95%Upper 95%
Lower 95.0%
Upper 95.0%
-6276.57 3937.368 -1.5941 0.114166 -14091.2 1538.011 -14091.2 1538.011
1.69044 0.110771 15.26066 1.58E-27 1.47059 1.91029 1.47059 1.91029
852.9109 340.2605 2.506641 0.013851 177.5884 1528.233 177.5884 1528.233
Chapter 5: Regression Analysis
18. The Excel file Cereal Data.xls provides a variety of nutritional information about 67 cereals and their shelf
location in a supermarket. Use regression analysis to determine if a relationship exists between calories
and the other variables. Investigate the model assumptions and clearly explain your conclusions.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.8674241518
R Square 0.7524246591
Adjusted R Square 0.7321315983
Standard Error 9.7154511857
Observations 67
ANOVA
df SS MS F
Regression 5 17498.926922 3499.7853843 37.077928706
Residual 61 5757.7894962 94.389991742
Total 66 23256.716418
Only Carbs and Sugars are significant in the above model, create a model with only Carbs
and Sugars as independent variables.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.8519372382
R Square 0.7257970578
Adjusted R Square 0.7172282158
Standard Error 9.9820620902
Observations 67
ANOVA
df SS MS F
Regression 2 16879.656349 8439.8281746 84.701884153
Residual 64 6377.0600687 99.641563573
Total 66 23256.716418
50
Residuals
Sugars Residual Plot
40 0
4 6
Residuals
20
-50
0
-20 0 2 4 6 8 10 12 14 16
-40
Sugars
The residual plots do not exhibit any pattern. The correlation between Carbs and Sugars
is -.47.
Calories Sodium Fiber Carbs Sugars Shelf
70 130 10 5 6 3
als and their shelf 70 260 9 7 5 3
between calories 50 140 14 8 0 3
110 200 1 14 8 3
110 180 1.50 10.50 10 1
110 125 1 11 14 2
130 210 2 18 8 3
90 200 4 15 6 1
90 210 5 13 5 3
120 220 0 12 12 2
110 290 2 17 1 1
120 210 0 13 9 2
110 140 2 13 7 3
110 180 0 12 13 2
110 280 0 22 3 1
100 290 1 21 2 1
Significance F 110 90 1 13 12 2
2.874096E-17 110 180 0 12 13 2
110 140 4 10 7 3
100 80 1 21 0 2
110 220 1 21 3 3
100 140 2 11 10 3
100 190 1 18 5 3
110 125 1 11 13 2
110 200 1 14 11 1
100 0 3 14 7 2
120 160 5 12 10 3
120 240 5 14 12 3
110 135 0 13 12 2
110 280 0 15 9 2
del with only Carbs 100 140 3 15 5 3
110 170 3 17 3 3
120 75 3 13 4 3
110 180 0 14 11 1
120 220 1 12 11 2
110 250 1.50 11.50 10 1
110 170 1 17 6 3
140 170 2 20 9 3
110 260 0 21 3 2
100 150 2 12 6 2
110 180 0 12 12 2
160 150 3 17 13 3
100 220 2 15 6 1
120 190 0 15 9 2
Significance F 130 170 1.50 13.50 10 3
1.043056E-18 120 200 6 11 14 3
100 320 1 20 3 3
50 0 0 13 0 3 Observation
50 0 1 10 0 3
100 135 2 14 6 3
120 210 5 14 12 2
90 0 2 15 6 3
110 240 0 23 2 1
110 290 0 22 3 1
Shelf 90 0 3 20 0 1
80 0 3 16 0 1
90 0 4 19 0 1
110 70 1 9 15 2
110 230 1 16 3 1
90 15 3 15 5 2
110 200 0 21 3 3
140 190 4 15 14 3
100 200 3 16 3 3
110 140 0 13 12 2
100 230 3 17 3 1
100 200 3 17 3 1
110 200 1 18 8 1
50
Carbs Residual Plot
Residuals
0
4 6 8 10 12 14 16 18 20 22 24
-50
Carbs
Sugars
1
0.1100252664
Regression Statistics
Multiple R 0.867424
R Square 0.752425
Adjusted R 0.732132
Standard E 9.715451
Observatio 67
ANOVA
df SS MS F Significance F
Regressio 5 17498.93 3499.785 37.07793 2.87E-17
Residual 61 5757.789 94.38999
Total 66 23256.72
Coefficients
Standard Error t Stat P-value Lower 95%Upper 95%Lower 95.0%
Upper 95.0%
Intercept 16.50551 9.495272 1.738287 0.087206 -2.48146 35.49249 -2.48146 35.49249
Sodium 0.027295 0.01619 1.68592 0.09692 -0.00508 0.05967 -0.00508 0.05967
Fiber 0.429737 0.610575 0.703823 0.484221 -0.79118 1.650656 -0.79118 1.650656
Carbs 3.468359 0.457335 7.583856 2.29E-10 2.553862 4.382856 2.553862 4.382856
Sugars 3.878465 0.361011 10.74334 1.07E-15 3.156578 4.600351 3.156578 4.600351
Shelf 2.454392 1.555381 1.578001 0.119738 -0.65579 5.564569 -0.65579 5.564569
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.851937
R Square 0.725797
Adjusted R 0.717228
Standard E 9.982062
Observatio 67
ANOVA
df SS MS F Significance F
Regressio 2 16879.66 8439.828 84.70188 1.04E-18
Residual 64 6377.06 99.64156
Total 66 23256.72
Coefficients
Standard Error t Stat P-value Lower 95%Upper 95%Lower 95.0%
Upper 95.0%
Intercept 28.7602 6.745266 4.26376 6.75E-05 15.28499 42.2354 15.28499 42.2354
Carbs 3.358327 0.36016 9.32453 1.54E-13 2.638824 4.077829 2.638824 4.077829
Sugars 3.905585 0.315296 12.38704 1.18E-18 3.275709 4.53546 3.275709 4.53546
Observation
Predicted Calories
Residuals Percentile Calories
1 68.98534 1.014662 0.746269 50
2 71.79641 -1.79641 2.238806 50
3 55.62681 -5.62681 3.731343 50
4 107.0214 2.978551 5.223881 70
5 103.0785 6.921526 6.716418 70
6 120.38 -10.38 8.208955 80
7 120.4548 9.545243 9.701493 90
8 102.5686 -12.5686 11.19403 90
9 91.94637 -1.94637 12.68657 90
10 115.9271 4.072866 14.1791 90
11 89.75734 20.24266 15.67164 90
12 107.5687 12.43129 17.16418 90
13 99.75754 10.24246 18.65672 100
14 119.8327 -9.83272 20.14925 100
15 114.3601 -4.36014 21.64179 100
16 107.0962 -7.09623 23.13433 100
17 119.2855 -9.28546 24.62687 100
18 119.8327 -9.83272 26.1194 100
19 89.68256 20.31744 27.61194 100
20 99.28506 0.714939 29.10448 100
21 111.0018 -1.00181 30.59701 100
22 104.7576 -4.75764 32.08955 100
23 108.738 -8.738 33.58209 100
24 116.4744 -6.47439 35.07463 100
25 118.7382 -8.7382 36.56716 100
26 103.1159 -3.11586 38.0597 110
27 108.116 11.88404 39.55224 110
28 122.6438 -2.64379 41.04478 110
29 119.2855 -9.28546 42.53731 110
30 114.2854 -4.28536 44.02985 110
31 98.66302 1.336978 45.52239 110
32 97.56851 12.43149 47.01493 110
33 88.04078 31.95922 48.50746 110
34 118.7382 -8.7382 50 110
35 112.0215 7.978451 51.49254 110
36 106.4368 3.563199 52.98507 110
37 109.2853 0.714739 54.47761 110
38 131.077 8.923005 55.97015 110
39 111.0018 -1.00181 57.46269 110
40 92.49363 7.506374 58.95522 110
41 115.9271 -5.92713 60.44776 110
42 136.6244 23.37565 61.9403 110
43 102.5686 -2.56861 63.43284 110
44 114.2854 5.714639 64.92537 110
45 113.1535 16.84655 66.41791 110
46 120.38 -0.37998 67.91045 110
47 107.6435 -7.64349 69.40299 110
48 72.41845 -22.4184 70.89552 110
49 62.34346 -12.3435 72.38806 110
50 99.21028 0.78972 73.8806 110
51 122.6438 -2.64379 75.37313 110
52 102.5686 -12.5686 76.86567 110
53 113.8129 -3.81288 78.35821 110
54 114.3601 -4.36014 79.85075 120
55 95.92673 -5.92673 81.34328 120
56 82.49343 -2.49343 82.83582 120
57 92.56841 -2.56841 84.32836 120
58 117.5689 -7.56891 85.8209 120
59 94.21018 15.78982 87.31343 120
60 98.66302 -8.66302 88.80597 120
61 111.0018 -1.00181 90.29851 120
62 133.8133 6.186716 91.79104 120
63 94.21018 5.78982 93.28358 130
64 119.2855 -9.28546 94.77612 130
65 97.56851 2.431493 96.26866 140
66 97.56851 2.431493 97.76119 140
67 120.4548 -10.4548 99.25373 160
Normal Probability Plot
200
Calories
100
0
0 20 40 60 80 100 120
Sample Percentile
Chapter 5: Regression Analysis
19. The Excel file Infant Mortality.xls provides data on infant mortality rate (deaths per 1000 births), female
literacy (percentage who read), and population density (people per square kilometer) for 85 countries.
Develop simple and multiple regression models for the relationship between mortality, population density,
and literacy. Explain all statistical output.
200.00
Infant Mortality
150.00
100.00
50.00
f(x) = − 0.009321363395322 x + 53.3489342494644
0.00 R² = 0.034339689822422
0.00 1
200.00
Infant Mortality
150.00
100.00
50.00
f(x) = − 0.020108043003514 x + 54.1414169236885
0.00 R² = 0.002129153486033
0.00 100.00 200.00 300.00 400.00 500.00
Density
Any way you look at it, there is no statistical relationship between Infant
Mortality and Density (very low R2).
200.00
Infant Mortality
150.00
100.00 f(x) = − 1.12896263836012 x + 127.203287100057
50.00 R² = 0.711342937520834
0.00
0 20 40 60 80 100 120
Literacy
200.00
Infant Mortality
150.00
100.00 f(x) = − 1.12896263836012 x + 127.203287100057
50.00 R² = 0.711342937520834
0.00
0 20 40 60 80 100 120
Literacy
Multiple Regression: dependent variable: Infant Mortality, independent variables: Density and Literacy
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.8586805697
R Square 0.7373323208
Adjusted R Squ 0.730925792
Standard Error 19.863456131
Observations 85
ANOVA
df SS MS F
Regression 2 90819.711535 45409.855767 115.09076886
Residual 82 32353.664936 394.55688946
Total 84 123173.37647
Both Density and Literacy are significant in this model (P-values of .006 and 6.74E-25),
however R2 = .74 for this model is only a small increase over R2 = .71 for simple
regression with Literacy.
Country Mortalit Density Literacy
Australia 7.30 2.30 100
r 1000 births), female Botswana 39.30 2.40 16
ter) for 85 countries. Libya 63.00 2.80 50
ality, population density, Gabon 94.00 4.20 48
Cent. Afri. 137.00 5.00 15
Bolivia 75 6.90 71
Saudi Arab 52.00 7.70 48
Russia 27.00 8.80 100
Somalia 126.00 10.00 14
Paraguay 25.20 11.00 88
Zambia 85.00 11.00 65
Argentina 25.60 12.00 95
Brazil 66.00 18.00 80
Chile 14.60 18.00 93
Peru 54.00 18.00 79
Uruguay 17.00 18.00 96
Venezuela 28.00 22.00 87
Afghanista 168.00 25.00 14
USA 8.10 26.00 97
1000.00
Cameroon 77.00 27.00 45 2000
Liberia 113.00 29.00 29
Tanzania 110.00 29.00 31
Colombia 28.00 31.00 86
U.Arab Em 22.00 32.00 63
Nicaragua 52.50 33.00 57
Panama 16.50 34.00 88
Burkina Fa 118.00 36.00 9
Estonia 19.00 36.00 100
Ecuador 39.00 39.00 86
Iran 60.00 39.00 43
Latvia 21.50 40.00 100
Jordan 34.00 42.00 70
Senegal 76.00 43.00 25
Iraq 67.00 44.00 49
Honduras 45.00 46.00 71
Mexico 35.00 46.00 85
Ethiopia 110.00 47.00 16
Kenya 74.00 49.00 58
Belarus 19.00 50.00 100
Uzbekistan 53.00 50.00 100
Cambodia 112.00 55.00 22
Egypt 76.40 57.00 34
Lithuania 17.00 58.00 98
Malaysia 25.60 58.00 70
Morocco 50.00 63.00 38
Costa Rica 11.00 64.00 93
Syria 43.00 74.00 51
Uganda 112.00 76.00 35
Spain 6.90 77.00 93
Turkey 49.00 79.00 71
Greece 8.20 80.00 89
Georgia 23.00 81.00 100
Azerbaijan 35.00 86.00 100
Gambia 124.00 86.00 16
Ukraine 20.70 87.00 100
Guatemala 57.00 97.00 47
Kuwait 12.50 97.00 67
Cuba 10.20 99.00 93
Indonesia 68.00 102.00 68
Nigeria 75.00 102.00 40
Portugal 9.20 108.00 82
Hungary 12.50 111.00 98
Thailand 37.00 115.00 90
Density and Literacy Poland 13.80 123.00 98
China 52.00 124.00 68
Armenia 27.00 126.00 100
Pakistan 101.00 143.00 21
Domincan 51.50 159.00 82
Italy 7.60 188.00 96
N. Korea 27.70 189.00 99
Burundi 105.00 216.00 40
Vietnam 46.00 218.00 83
Philippines 51.00 221.00 90
Haiti 109.00 231.00 47
Israel 8.60 238.00 89
Significance F El Salvado 41.00 246.00 70
1.569178E-24 India 79.00 283.00 39
Rwanda 117.00 311.00 37
Lebanon 39.50 343.00 73
S. Korea 21.70 447.00 99
Barbados 20.30 605.00 99
Banglades 106.00 800.00 22
Bahrain 25.00 828.00 55
Singapore 5.70 4456.00 84
Hong Kong 5.80 5494.00 64
of .006 and 6.74E-25),
.71 for simple
2000.00 3000.00
Density
3000.00 4000.00
Density
000.00 5000.00
6000.00
Chapter 5: Regression Analysis
20. A mental health agency measured the self-esteem score for randomly selected individuals with
disabilities who were involved in some work activity within the past year. The Excel file Self Esteem.xls
provides the data, including the individuals' marital status, length of work, type of support received (direct
support includes job-related services such as job coaching and counseling), education, and age.
a. Use simple linear regression to determine if there is a relationship between self-esteem
and length of work
b. Use multiple linear regression for predicting self esteem as a function of the other
variables. Investigate possible interaction effects and determine the best model.
Note: Self Esteem omitted for line 50, removed that line from data.
7
6 f(x) = 0.053887746731874 x + 2.89776786177823
5 R² = 0.646263823385515
Self Esteem
4
3
2
1
0
0 relationship
The statistical 10 20 30Length
between
Length 40of Work
of Work 50and Self
60 Esteem
70 is
positive with R = .65
2
b. First create categorical variables for Support (none=0, direct=1) and Single, Married,
Separated, and Divorced (no=0, yes=1), then take a look at the correlation matrix:
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.8424977386
R Square 0.7098024395
Adjusted R Squ0.6824253112
Standard Error 0.566821702
Observations 59
ANOVA
df SS MS F
Regression 5 41.649763485 8.3299526969 25.926840482
Residual 53 17.028202617 0.3212868418
Total 58 58.677966102
Divorced, Support Level, and Education are not significant (high P-vales).
Education is also highly correlated with both Age and Length of Work.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.8300496487
R Square 0.6889824193
Adjusted R Squ0.6778746485
Standard Error 0.5708683714
Observations 59
ANOVA
df SS MS F
Regression 2 40.428087043 20.214043522 62.027065144
Residual 56 18.249879058 0.3258906975
Total 58 58.677966102
Significance F
3.881066E-13
Significance F
6.28027E-15
Length oAge Separated Education (years)
4 52 0 9
4 52 0 9
14 40 0 11
10 46 0 9
12 40 0 12
6 47 0 10
4 50 0 10
5 44 0 10
9 46 0 9
4 47 0 9
4 51 0 10
11 47 1 9
10 51 1 9
12 42 1 10
9 48 1 9
3 46 1 9
14 37 1 12
13 35 1 12
3 43 1 9
4 45 1 10
10 50 0 10
8 46 0 9
2 49 0 11
7 48 0 12
8 45 0 10
9 47 0 9
6 46 0 9
4 47 0 9
3 45 0 10
2 47 0 10
12 28 0 13
37 42 0 12
23 40 0 10
23 38 0 12
10 45 0 9
10 47 0 10
9 47 0 9
23 39 0 12
12 43 1 10
21 39 1 11
10 33 1 12
11 35 1 12
14 45 0 12
12 41 0 13
5 43 0 9
51 28 0 12
40 32 0 14
29 38 0 14
29 39 0 14
21 39 0 13
7 48 0 9
24 29 0 12
32 26 0 13
37 30 1 14
31 26 0 13
58 27 0 12
17 38 0 12
61 28 0 16
64 38 1 14
Chapter 5: Regression Analysis
21. Data collected in 1960 from the National Cancer Institute provides the per capita numbers of cigarettes
sold along with death rates for various forms of cancer (see the Excel file Smoking and Cancer.xls).
Use simple linear regression to determine if a significant relationship exists between the number of
cigarettes sold and each form of cancer.
2.00
0.00
10.00 15.00 20.00 25.00 30.00 35.00 40.00 45.00
4.00
1.00
0.00
10.00 15.00 20.00 25.00 30.00 35.00 40.00 45.00
8.00
2.00
x=cigarettes sold, y=leukemia
10.00
8.00
2.00
0.00
10.00 15.00 20.00 25.00 30.00 35.00 40.00 45.00
The Excel file HATCO.xls (adopted from Hair, Anderson, Tatham, and Black in Multivariate Analysis, 5th
Edition, Prentice-Hall 1998) consists of data related to predicting the level of business (Usage Level)
obtained from a survey of purchasing managers of customers of an industrial supplier, HATCO. The
independent variables are
1. Delivery speed - amount of time it takes to deliver the product once an order is confirmed
2. Price level - perceived level of price charged by product suppliers
3. Price flexibility - perceived willingness of HATCO representatives to negotiate price on all
types of purchases
4. Manufacturing image - overall image of the manufacturer or supplier
5. Overall service - overall level of service necessary for maintaining a satisfactory
relationship between supplier and purchaser
6. Salesforce image - overall image of the manufacturer's salesforce
7. Product quality - perceived level of quality of a particular product
8. Size of firm relative to others in this market (0 = small; 1 = large)
Responses to the first seven variables were obtained using a graphic rating scale, where a 10-centimeter
line was drawn between endpoints labeled "poor" and "excellent." Respondents indicated their
perceptions using a mark on the line, which was measured from the left endpoint. The result was a scale
from 0 to 10 rounded to one decimal place.
22. Develop a correlation matrix, and interpret the ability of each independent variable explain Usage Level.
23. Construct a simple linear regression model of Usage level as a function of Overall service and interpret
the results.
Usage level is positively correlated with Overall service, slope = 8.39, intercept = 21.62, and R
24. Construct a simple linear regression model of Usage level as a function of Delivery Speed and interpret
the results.
25. Construct a multiple regression model with Usage Level as the dependent variable, and Delivery speed
and Overall Service as independent variables. Interpret the results.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.7678598462
R Square 0.5896087434
Adjusted R Squ 0.581147068
Standard Error 5.8178850418
Observations 100
ANOVA
df SS MS F
Regression 2 4717.0211231 2358.5105616 69.679905696
Residual 97 3283.2352769 33.84778636
Total 99 8000.2564
Both Delivery speed and Overall service are significant (P-values 5.65E-06 and
2.10E-07), R2 = .5896 and Adjusted R2 = .5811.
The multiple regression model with Delivery speed and Overall service explains the most variation in
Usage level (R2 = .59, followed by the simple regression equation with Overall service (R2 = .49).
27. Develop a multiple regression model of Usage Level as a function of the first seven independent variables
and interpret the results
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.8803126254
R Square 0.7749503185
Adjusted R Squ0.7578269731
Standard Error 4.4238178904
Observations 100
ANOVA
df SS MS F
Regression 7 6199.801245 885.68589215 45.256946197
Residual 92 1800.455155 19.570164728
Total 99 8000.2564
Only Price flexibility and Overall service are significant (P-values 1.5557E-12 and
.0359), R2 = .7750 and Adjusted R2 = .7578.
Although R2 has increased over the two variable model, there are too many
non-significant variables included in this model.
28. Use best subsets and stepwise regression to find good models for Usage Level using the first seven
independent variables. What is your recommendation?
Stepwise Analysis
Table of Results for General Stepwise
df SS MS F
Regression 1 3935.7039852 3935.7039852 94.893348933
Residual 98 4064.5524148 41.475024641
Total 99 8000.2564
df SS MS F
Regression 2 6036.7218583 3018.3609292 149.10917222
Residual 97 1963.5345417 20.242624141
Total 99 8000.2564
df SS MS F
Regression 3 6147.5536958 2049.1845653 106.18094194
Residual 96 1852.7027042 19.298986502
Total 99 8000.2564
Independent Variable(s) R2
Price flexibility (X3), Overall Service (X5), Salesforce image(X6) 0.7684
29. Include the categorical variable Size of Firm (coded as 0 for small firms, and 1 for large firms) in
identifying the best model for predicting Usage Level. Be sure to investigate possible interactions.
Stepwise Analysis
Table of Results for General Stepwise
df SS MS F
Regression 1 3935.7039852 3935.7039852 94.893348933
Residual 98 4064.5524148 41.475024641
Total 99 8000.2564
df SS MS F
Regression 2 6036.7218583 3018.3609292 149.10917222
Residual 97 1963.5345417 20.242624141
Total 99 8000.2564
df SS MS F
Regression 3 6222.8841647 2074.2947216 112.03747269
Residual 96 1777.3722353 18.514294118
Total 99 8000.2564
df SS MS F
Regression 4 6342.6338284 1585.6584571 90.875664943
Residual 95 1657.6225716 17.448658649
Total 99 8000.2564
df SS MS F
Regression 5 6414.1504943 1282.8300989 76.026467625
Residual 94 1586.1059057 16.873467082
Total 99 8000.2564
Independent Variable(s) R2
Delivery speed (X1), Price flexibility (X3), Overall Service (X5), 0.8017
Salesforce image(X6), Size of firm (X8)
It is interesting to note that Delivery speed re-enters as a significant variable in this model, there is
something about firm size that enhances Delivery speed in explaining variance in Usage level.
30. Segregate the HATCO data by firm size. Run separate regressions on the data for small firms and the
data for large firms. Compare your results with Problem 29.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.8792684242
R Square 0.7731129619
Adjusted R Squ0.7566120864
Standard Error 4.339904019
Observations 60
ANOVA
df SS MS F
Regression 4 3529.8496541 882.46241354 46.85284498
Residual 55 1035.9121792 18.834766894
Total 59 4565.7618333
Salesforce image is no longer a significant variable for small firms - useful information.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.9310092849
R Square 0.8667782886
Adjusted R Squ0.8515529502
Standard Error 3.0026304776
Observations 40
ANOVA
df SS MS F
Regression 4 2053.0751075 513.26877688 56.929984963
Residual 35 315.55264247 9.0157897848
Total 39 2368.62775
Delivery speed is not significant variable for large firms - also useful information.
R2 = .86, Adjusted R2 = .85 - this model explains more variation in Usage level than the
model for small firms.
Purchase Outcome Respondent Perceptions
X9 X1 X2 X3 X4 X5 X6
Usage level
Delivery speedPrice level
Price flexibility
Manuf image
Overall service
Sales image
32 4.1 0.6 6.9 4.7 2.4 2.3
ultivariate Analysis, 5th 43 1.8 3 6.3 6.6 2.5 4
ness (Usage Level) 48 3.4 5.2 5.7 6 4.3 2.7
plier, HATCO. The 32 2.7 1 7.1 5.9 1.8 2.3
58 6 0.9 9.6 7.8 3.4 4.6
45 1.9 3.3 7.9 4.8 2.6 1.9
ce an order is confirmed 46 4.6 2.4 9.5 6.6 3.5 4.5
44 1.3 4.2 6.2 5.1 2.8 2.2
to negotiate price on all 63 5.5 1.6 9.4 4.7 3.5 3
54 4 3.5 6.5 6 3.7 3.2
32 2.4 1.6 8.8 4.8 2 2.8
a satisfactory 47 3.9 2.2 9.1 4.6 3 2.5
39 2.8 1.4 8.1 3.8 2.1 1.4
38 3.7 1.5 8.6 5.7 2.7 3.7
54 4.7 1.3 9.9 6.7 3 2.6
49 3.4 2 9.7 4.7 2.7 1.7
38 3.2 4.1 5.7 5.1 3.6 2.9
, where a 10-centimeter 40 4.9 1.8 7.7 4.3 3.4 1.5
ndicated their 54 5.3 1.4 9.7 6.1 3.3 3.9
The result was a scale 55 4.7 1.3 9.9 6.7 3 2.6
41 3.3 0.9 8.6 4 2.1 1.8
35 3.4 0.4 8.3 2.5 1.2 1.7
55 3 4 9.1 7.1 3.5 3.4
e explain Usage Level. 36 2.4 1.5 6.7 4.8 1.9 2.5
49 5.1 1.4 8.7 4.8 3.3 2.6
49 4.6 2.1 7.9 5.8 3.4 2.8
36 2.4 1.5 6.6 4.8 1.9 2.5
54 5.2 1.3 9.7 6.1 3.2 3.9
49 3.5 2.8 9.9 3.5 3.1 1.7
46 4.1 3.7 5.9 5.5 3.9 3
43 3 3.2 6 5.3 3.1 3
53 2.8 3.8 8.9 6.9 3.3 3.2
60 5.2 2 9.3 5.9 3.7 2.4
47.3 3.4 3.7 6.4 5.7 3.5 3.4
35 2.4 1 7.7 3.4 1.7 1.1
39 1.8 3.3 7.5 4.5 2.5 2.4
44 3.6 4 5.8 5.8 3.7 2.5
46 4 0.9 9.1 5.4 2.4 2.6
29 0 2.1 6.9 5.4 1.1 2.6
28 2.4 2 6.4 4.5 2.1 2.2
40 1.9 3.4 7.6 4.6 2.6 2.5
58 5.9 0.9 9.6 7.8 3.4 4.6
53 4.9 2.3 9.3 4.5 3.6 1.3
48 5 1.3 8.6 4.7 3.1 2.5
38 2 2.6 6.5 3.7 2.4 1.7
54 5 2.5 9.4 4.6 3.7 1.4
55 3.1 1.9 10 4.5 2.6 3.2
43 3.4 3.9 5.6 5.6 3.6 2.3
57 5.8 0.2 8.8 4.5 3 2.4
53 5.4 2.1 8 3 3.8 1.4
41 3.7 0.7 8.2 6 2.1 2.5
53 2.6 4.8 8.2 5 3.6 2.5
50 4.5 4.1 6.3 5.9 4.3 3.4
32 2.8 2.4 6.7 4.9 2.5 2.6
39 3.8 0.8 8.7 2.9 1.6 2.1
47 2.9 2.6 7.7 7 2.8 3.6
62 4.9 4.4 7.4 6.9 4.6 4
65 5.4 2.5 9.6 5.5 4 3
46 4.3 1.8 7.6 5.4 3.1 2.5
50 2.3 4.5 8 4.7 3.3 2.2
54 3.1 1.9 9.9 4.5 2.6 3.1
l service and interpret 60 5.1 1.9 9.2 5.8 3.6 2.3
47 4.1 1.1 9.3 5.5 2.5 2.7
36 3 3.8 5.5 4.9 3.4 2.6
40 1.1 2 7.2 4.7 1.6 3.2
45 3.7 1.4 9 4.5 2.6 2.3
59 4.2 2.5 9.2 6.2 3.3 3.9
46 1.6 4.5 6.4 5.3 3 2.5
58 5.3 1.7 8.5 3.7 3.5 1.9
49 2.3 3.7 8.3 5.2 3 2.3
50 3.6 5.4 5.9 6.2 4.5 2.9
55 5.6 2.2 8.2 3.1 4 1.6
51 3.6 2.2 9.9 4.8 2.9 1.9
60 5.2 1.3 9.1 4.5 3.3 2.7
41 3 2 6.6 6.6 2.4 2.7
49 4.2 2.4 9.4 4.9 3.2 2.7
42 3.8 0.8 8.3 6.1 2.2 2.6
47 3.3 2.6 9.7 3.3 2.9 1.5
39 1 1.9 7.1 4.5 1.5 3.1
56 4.5 1.6 8.7 4.6 3.1 2.1
21.62, and R 2 = .49. 59 5.5 1.8 8.7 3.8 3.6 2.1
47.3 3.4 4.6 5.5 8.2 4 4.4
41 1.6 2.8 6.1 6.4 2.3 3.8
37 2.3 3.7 7.6 5 3 2.5
53 2.6 3 8.5 6 2.8 2.8
ry Speed and interpret 43 2.5 3.1 7 4.2 2.8 2.2
51 2.4 2.9 8.4 5.9 2.7 2.7
36 2.1 3.5 7.4 4.8 2.8 2.3
34 2.9 1.2 7.3 6.1 2 2.5
60 4.3 2.5 9.3 6.3 3.4 4
49 3 2.8 7.8 7.1 3 3.8
39 4.8 1.7 7.6 4.2 3.3 1.4
43 3.1 4.2 5.1 7.8 3.6 4
36 1.9 2.7 5 4.9 2.2 2.5
31 4 0.5 6.7 4.5 2.2 2.1
25 0.6 1.6 6.4 5 0.7 2.1
60 6.1 0.5 9.2 4.8 3.3 2.8
38 2 2.8 5.2 5 2.4 2.7
42 3.1 2.2 6.7 6.8 2.6 2.9
33 2.5 1.8 9 5 2.2 3
29.92, and R 2 = .46.
Significance F
1.738232E-19
5.65E-06 and
he most variation in
rvice (R2 = .49).
en independent variables
Significance F
4.078136E-27
1.5557E-12 and
(P-value = .9828),
exibility and Overall
Std. Error
4.4238178904
4.4000600657
4.402491569
4.422140211
4.4126060166
4.3790193115
4.3999808212
4.4205850106
4.4123576099
4.3765967026
4.4161123561
4.4024963266
4.3930611767
4.3794518128
Significance F
4.40976E-16
Lower 95%
16.48590381
6.6830163631
Significance F
2.58125E-30
Lower 95%
-9.529058433
6.7860758617
2.6812620599
Significance F
2.219534E-30
Lower 95%
-12.95767197
6.4229821968
2.7354456003
0.2431360829
, Price flexibility (X3),
e smallest value for Cp
Adjusted R2
0.7612
large firms) in
sible interactions.
Significance F
4.40976E-16
Lower 95%
16.48590381
6.6830163631
Significance F
2.58125E-30
Lower 95%
-9.529058433
6.7860758617
2.6812620599
Significance F
3.044562E-31
Lower 95%
-21.18290022
7.2419774889
3.3594593896
1.4028303145
Significance F
1.304291E-31
Lower 95%
-24.74899027
6.885919358
3.4435813939
1.5605911829
0.3566677237
Significance F
1.701095E-31
Lower 95%
-24.24538644
5.4836693247
3.1662593222
2.4554269057
0.4809957153
0.0398415984
Adjusted R2
0.7912
is model, there is
Usage level.
- useful information.
Significance F
7.739904E-15
Lower 95%
-19.00944028
-3.07195738
2.4843232373
6.3275280811
1.2845131983
ful information.
31. (From Horngren, Foster, and Datar, Cost Accounting: A Managerial Emphasis, 9th ed., 1997, Prentice
Hall, 371.) The managing director of a consulting group has the following monthly data on total overhead
costs and professional labor-hours to bill to clients.
Generate a regression model to identify the fixed overhead costs to the consulting group.
a. What is the constant component of the consultant group's overhead?
b. If a special job requiring 1,000 billable hours that would contribute a margin of $38,000
before overhead was available, would the job be attractive?
b. The average margin contribution (the slope of the regression equation) is 47.543 per hour,
so 1,000 hours should generate 47.543 x 1,000 = $47,543. In comparison, a margin
contribution of $38,000 is probably not attractive.
h ed., 1997, Prentice
data on total overhead
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.997009
R Square 0.994027
Adjusted R Squ 0.992534
Standard Error 7708.375
Observations 6
ANOVA
e a margin of $38,000 df SS MS F Significance F
Regression 1 3.96E+10 3.96E+10 665.7067 1.34E-05
Residual 4 2.38E+08 59419048
Total 5 3.98E+10
Coefficients
Standard Error t Stat P-value Lower 95%Upper 95%
Lower 95.0%
Upper 95.0%
Intercept 199847.6 10611.94 18.83234 4.68E-05 170384.1 229311.1 170384.1 229311.1
X Variable 1 47.54286 1.842654 25.80129 1.34E-05 42.42682 52.6589 42.42682 52.6589
9047619
000 9,000
32. (From Horngren, Foster, and Datar, Cost Accounting: A Managerial Emphasis, 9th ed., 1997, Prentice
Hall, 349.) Cost functions are often nonlinear with volume, as production facilities are often able to
produce larger quantities at lower rates than smaller quantities. Using the following data, plot the data
and use the Chart, Add Trendline feature. Compare a linear trendline with a logarithmic trendline.
X Y
Units Produced Costs
500 $12,500
1,000 $25,000
1,500 $32,500
2,000 $40,000
2,500 $45,000
3,000 $50,000
Linear Trendline:
$30,000
$20,000
$10,000
$0
0 500 1000 1500 2000 2500 3000 3500
Logarithmic Trendline:
Although the linear trendline is a pretty good fit (R2 = .96), the logarithmic trend line is a
better fit (R2 = .99).
h ed., 1997, Prentice
s are often able to
ng data, plot the data
ithmic trendline.
Chapter 5: Regression Analysis
33. (From Horngren, Foster, and Datar, Cost Accounting: A Managerial Emphasis, 9th ed., 1997, Prentice
Hall, 349.) The Helicopter Division of Aerospatiale is studying assembly costs at its Marseilles plant.
Past data indicates the following costs per helicopter: Use linear regression, and compare the results
with a second-order polynomial regression model. Using the model with the best fit, predict the hours
required for a ninth helicopter. (Hint: a model often used in such situation is Y = aXb).
2,000
500
0
0 1 2 3 4 5 6 7 8 9
2,000
f(x) = 30.9404761904762 x² − 398.345238095238 x + 2231.82142857143
1,500 R² = 0.922595782310216
1,000
500
0
0 1 2 3 4 5 6 7 8 9
Using the second-order polynomial regression model (higher R2),
This is perhaps not a good estimate, illustrating the problem with extrapolating beyond the
range of the data.
2,000
f(x) = 1875.93620431106 x^-0.340901256518966
1,500 R² = 0.9725757984856
1,000
500
0
0 1 2 3 4 5 6 7 8 9
This is probably a better estimate for the time of the ninth helicopter.
h ed., 1997, Prentice
ts Marseilles plant.
compare the results
fit, predict the hours
571429
7 8 9
2231.82142857143
7 8 9
extrapolating beyond the
886.9689
18966
7 8 9
Chapter 5: Regression Analysis
34. For the data in problem 33, use an exponential regression model and estimate the hours required for the
ninth helicopter.
2,000
500
0
0 1 2 3 4 5 6 7 8 9
38698403 x )
7 8 9
Chapter 5: Regression Analysis
35. (From Crask, Fox, and Stout, Marketing Research: Principles & Applications, Prentice Hall, 1995, 252.)
A real estate company hired a small market research firm to develop a model to calculate a ballpark
price for a home based only on square footage. The real estate company felt that this model would be
useful in helping customers set the list prices of their homes. The market research firm wants a linear
regression relating price as a function of square footage based on the following sample data.
Note that the model says the price per square foot is $79.402, i.e., about $80 per sq. ft.
ntice Hall, 1995, 252.)
calculate a ballpark
this model would be
h firm wants a linear
313004449
00 2,400