Professional Documents
Culture Documents
Business Statistics Canadian 3rd Edition Sharpe Solutions Manual Download
Business Statistics Canadian 3rd Edition Sharpe Solutions Manual Download
SECTION EXERCISES
SECTION 7.1
1.
a) False. We choose the best fitting line through the data points on a scatterplot. This is done by
minimizing the sum of the squared errors. The predicted values (not the data values) fall on
the line.
b) True
c) False. Least squares means that the sum of all squared residuals (difference between the observed
y-values and the y-values predicted by the line) is minimized.
2.
a) True
b) False. Least squares means that the sum of all squared residuals (difference between the observed
y-values and the y-values predicted by the line) is minimized.
c) True
SECTION 7.2
3.
4.
a) b1 = 0.914
sy
To find the slope use b1 = r
sx
5.34
b1 = 0.965 = 0.914
5.64
b) The slope of 0.914 indicates that an additional 0.914 ($1000), or $914 in sales, is associated with
each additional salesperson working.
c) b0 = 8.09
To find the intercept use b0 = y − b1 x
b0 = 17.6 − (0.914)(10.4) = 8.09
d) The intercept of 8.09 implies that, on average, we expect sales of 8.09 ($1000), or $8090, with
zero salespeople working. This value is not meaningful in this context.
h) Because the actual sales value is more than the predicted value, the estimated regression equation
has underestimated sales when x = 18.
SECTION 7.3
5. The winners may be suffering from regression to the mean. Regression toward the mean implies that a
variable that yields an extreme value on the first measurement will tend to be closer to the mean on a later
measurement. So the winners of the “rookie junior executive of the year award” had an extreme
measurement their first year. This extremely good performance may have been due, in part, to just an
unusually lucky year. Their next year’s performance moved closer to the mean.
6. If the poor performance of mutual funds in the previous year is not the result of extreme values, then
regression to the mean won’t ensure better performance the next year. Although, on average, the
performance of funds will cluster around the mean, we can’t predict how any particular fund will do.
b) The residual with the largest absolute value contributes the most to the sum being minimized by
the least squares criterion. In this case it is 2.77.
c) The residual with the smallest absolute value contributes the least to the sum being minimized by
the least squares criterion. In this case it is 0.07.
8.
a) The curved pattern in the residuals violates the Linearity Assumption.
b) The unusual residual value that is extremely different from the others violates the Outlier
Condition.
c) The increasing spread of the residual values resulting in a cone-shaped pattern violates the Equal
Spread Condition.
10. R2 = 98.8%. R2 is the squared correlation, or .9942. It means that about 98.8% of the variance in the price of
disk drives can be accounted for by the regression model of Price on Capacity.
SECTION 7.8
11.
a) The original values are the squared resulting values: 16, 16, 36, 49, 49, 64, and 100.
b) The values are not symmetric but instead are skewed to the right (high end).
12. The average is $370,000, a value not representative of any of the three individual salaries since they are
very unequally spaced. The logarithm to the base 10 of the three individual salaries. The % would be 4, 5,
and 6, resulting in values that are equally spaced.
CHAPTER EXERCISES
b) The problem states that the weekly Sales (in pounds) of frozen pizza are being predicted from the
average Price/unit ($). Whatever is being predicted is the response variable. Sales of frozen pizza
are being predicted, and therefore Sales is y , or the response variable. In addition, the prediction
·
equation is stated as Sales = 141,865.53 − 24,369.49* Price , which is of the form ŷ = b + b x .
0 1
Sales is y, or the response variable.
c) ·
The prediction equation is stated as: Sales = 141,865.53 − 24,369.49* Price , which is of the form:
ŷ = b0 + b1 x , where b1 is the slope. The slope for this equation is –$24,369.49, which means that
for every extra dollar increase in the price of pizza, weekly sales of frozen pizzas are predicted to
decrease by 24,369.49 kilos.
d) The y -intercept is the value of the line when the x -variable is zero. The intercept can be used as
a starting point for predictions, but it is not meaningful in all circumstances. In this equation,
·
Sales = 141,865.53 − 24,369.49* Price is of the form ŷ = b + b x , where b is the y -intercept.
0 1 0
The y -intercept for this equation is $141,865.53. This number is not meaningful except as a base
or starting value for the line because it is obviously not realistic to set the Price at zero dollars.
e) A prediction can be made by substituting the given value for the Price of pizza into the equation
and solving for weekly Sales. If the Price/unit of the pizza is $3.50, the equation is:
·
Sales = 141,865.53 − 24,369.49*(3.50) = 141,865.53 − 85,293.22 = 56,572.32 kg.
f) If a Price of $3.50 yields 60,000 pounds, the residual is calculated by subtracting the predicted
value (calculated from the linear model equation) from the observed or measured value
(Residual = Data – Predicted or e = y − yˆ ). The predicted value of y at an x -value of $3.50 was
calculated in part e) as 56,572.32 pounds. The Residual = 60,000 (Data) –56,572.32 (Predicted)
= 3,427.68 kg.
c) The prediction equation is stated as: Pr · ice = 21, 253.58 − 0.11097 Mileage which is of the form:
ŷ = b0 + b1 x . where b1 is the slope. The slope for this equation is –0.11097, which means that for every
extra mile driven, the price of the 2012 Honda is predicted to decrease by $0.11097 (about $111 per extra
1000 miles driven).
d) The y intercept is the value of the line when the x variable is zero. The intercept can be used as a starting
point for predictions but it is not meaningful in all circumstances. In this equation,
· ice = 21, 253.58 − 0.11097 Mileage is of the form: ŷ = b + b x where b0 is the y intercept. The y
Pr 0 1
intercept for this equation is $21,253.58. This number represents the base or starting value for the line
which means the average Price for a used car with no miles on the odometer. This is not very realistic.
e) A prediction can be made by substituting the given value for Mileage into the equation and solving for
Price. If the mileage is 100,000 miles, the equation is:
· ice = 21, 253.58 − 0.11097 Mileage = 21, 253.58 − 0.11097 50, 000
Pr
= 21, 253.58 − 5548.5 = $15,705.08 .
f) If a car with 50,000 miles on it cost $14,000, the residual is calculated by subtracting the predicted value
(calculated from the linear model equation) from the observed or measured value (Residual = Data –
Predicted or e = y - y ). The predicted value of y at an x value of 50,000 was calculated in part e. to be
$15,705.08. The Residual = $14,000 (Data) –$15,705.08 (Predicted) = -$1705.08.
g) Yes, this would be a good deal since the predicted price was calculated to be $15,705.08.
17. Sales by region. The model is meaningless because the variable Region is categorical, not quantitative.
Although each region is denoted by a number, the variable is still categorical. The slope makes no sense
because Region has no units. The boxplot comparisons are informative, but the regression is meaningless.
18. Salary by job type. The model is meaningless because the variable Job Type is categorical, not
quantitative. It has no units, so the linear model and slope make no sense. A bar chart of average salary for
the different job types would be a good display of the data.
c) The slope represents the change in y or the response variable for every x unit or one unit step in the predictor
variable. The slope for this equation is 0.468, which means that for every 1% increase in Annual GDP Growth
of Developed Countries, the Annual GDP Growth of Developing Countries increases by 0.468%.
d) A prediction can be made by substituting the given value of 4% for Annual GDP Growth of Developing
Countries into the equation and solving for Annual GDP Growth of Developing Countries. The equation is:
·
Growth ( DevelopingCountries) = 3.38 + 0.468Growth( Developed Countries) = 3.38 + 0.468 4
= 5.25%.
e) If developed countries experience a 2.65% growth while developing countries grew at a rate of 6.09%, the
predicted value of y at an x value of 2.65% is:
·
Growth ( DevelopingCountries) = 3.38 + 0.468 2.65 = 4.62%.
The predicted value using the linear model is less than the actual percentage. The actual value performed better
than expected.
f) The residual is calculated by subtracting the predicted value (calculated from the linear modelequation) from
the observed or measured value (Residual = Data – Predicted or e = e = y − yˆ ). The predicted value of 4.62%
is compared to the actual value of 6.09%. The Residual = 6.09% (Data) – 4.62%. (Predicted) = 1.47%.
b. The slope represents the change in y , or the response variable, for every x -unit or one-unit step in the
predictor variable. The slope for this equation is 771, which means that for every 1% increase in
mutual fund Return, the Flow into mutual funds increases by 771 ($M).
c. The predicted fund Flow for a month that had a market Return of 0% is the y -intercept, which has a
value of 9747 ($M).
d. If the recorded fund Flow was $5 billion during a month when the Return was 0%, the residual is
calculated by subtracting the predicted value (calculated from the linear model equation) from the
observed or measured value (Residual = Data – Predicted or e = y − yˆ ). The predicted value of y at
an x -value of 0% was calculated in c) as 9747 ($M). $5 billion = 5000 $Million. The Residual = 5000
($M) (Data) – 9747 ($M) (Predicted) = –4747 ($M). This model overestimated the Flow value.
b. The slope represents the change in y , or the response variable, for every x -unit or one-unit step in the
predictor variable. The slope for this equation is 0.012, which means that for every 1($) increase in
Income, the Purchases increase by 0.012 ($), or just over one cent. If you multiply both by $1000, for
every $1000 increase in Income, Purchases increase by $12.
c. The predicted Purchases for an Income of $20,000 is calculated by substituting $20,000 for income in
·
the regression equation: Purchases = −31.6 + 0.012* 20, 000 = $208.40 .
d. The actual Purchases were $100, so the model overestimated the prediction. The residual is calculated
by subtracting the predicted value (calculated from the linear model equation) from the observed or
measured value (Residual = Data – Predicted or e = y − yˆ ). The predicted value of y at an x -value of
$20,000 was calculated in c) as $208.40. The Residual = $100 (Data) – $208.40 (Predicted) = –
$108.40.
25. The Home Depot, part 1.
a. The slope has units corresponding to a change in y / x , which in this case is quarterly Sales ($B)/U.S.
Housing Starts (thousands), translating into billions of dollars per thousand Housing Starts.
b. The R 2 value is r*r. Correlation r is given as 0.70, therefore r*r = 0.70*0.70=0.49 or 49%.
c. For one standard deviation below average in Housing Starts, we can use the standardized equation to
find a solution: zˆ y = rz x . One standard deviation below the mean in x represents a z-score
of –1, which results in a z -score for y of –r = –0.70, representing 0.70 standard deviations
below the mean in Sales.
b. The R 2 value is r*r. Correlation r is given as 0.79, therefore r*r = 0.79*0.79=0.6241 or 62.41%.
c. For two standard deviation above average in Price, we can use the standardized equation to find a
solution: zˆ y = rz x . Two standard deviation above the mean in x represents a z-score of +2, which
results in a z -score for y of 2r = 2*0.79 = 1.58, representing 1.58 standard deviations above the
mean in Price.
b. The R 2 value is r*r. Correlation r can be calculated by finding the square root of R 2 . In this case,
.883 = 0.94 . We expect a negative correlation between unemployment Rate and Sale, so r = 0.94.
c. The slope is –2.99, which means that for every 1% increase in unemployment Rate, Sales decrease by
2.99 ($B).
b. The R 2 value is r*r. Correlation r can be calculated by finding the square root of R 2 . In this case,
.329 = 0.573 , so r = –0.573. The slope is negative; therefore the correlation is also negative.
c. The slope is –24,369.49, which means that for every $1 increase in Price, Sales decrease by 24,369.49
kg. For a $0.50 increase, the predicted decrease is 12,184.75 kg.
b. The linear model is not appropriate for this data set. The data points are curved, indicating a nonlinear
relationship.
c. The linear model is not appropriate for this data set. The data points start out close together and then
the spread increases as x increases.
30. Residual plots, part 2.
a. The linear model is not appropriate for this data set. The data points are curved, indicating a nonlinear
relationship.
b. The linear model is not appropriate for this data set. The data points are spread out at the start and then
the spread decreases as x increases.
c. The linear model seems appropriate. The residual plot has appropriate scatter of points and nothing
remarkable.
b. Calculate the prediction by substituting 500,000 units for Housing Starts and solve for Sales.
·
Sales = −11.5 + 0.0535*500 = 15.25($ B )
c. If Sales are $3 billion higher than predicted, the difference is the residual (the difference in the
y-values at a particular x-value).
a. The slope is –2.994, which means that for every 1% increase in unemployment Rate, Sales decrease by
2.994 ($B).
b. Calculate the prediction by substituting 6% for unemployment Rate and solve for Sales.
·
Sales = 20.91 − 2.994*6 = 2.946($ B)
c. Calculate the prediction by substituting 4% for unemployment Rate and solve for Sales.
·
Sales = 20.91 − 2.994* 4 = 8.934($ B) . The actual Sales value is $8.5 billion. The residual is the
difference between the actual data and the predicted value: 8.5–8.934 = –$0.434 billion.
33. Consumer spending. There are two influential outliers that give more weight to the linear regression (slope and
intercept) and R 2 at 79%. The predictions will not be accurate for this regression. Looking at the scatterplot
illustrates why. Without these two data points, the R 2 drops to about 31%. The analyst should identify these
two customers and examine why they are outliers. For analysis of the rest of the data points, these two
customers should be set aside and the model refit to the rest of the data.
34. Insurance policies. There is one very influential outlier that gives more weight to the linear regression (slope
and intercept) and R 2 at 99.9%. The predictions will not be accurate for this regression. Looking at the
scatterplot illustrates why. Most of the data are clustered at the low end, with the exception of the high outlier.
Without this data point, the R 2 drops to 23.5%. The analyst should identify this salesperson and examine why
he/she is an outlier. For analysis of the rest of the data points, the salesperson should be set aside and the model
refit to the rest of the data.
c. The meaning of R 2 in this problem is that 56.9% of the variability in annual Sales in 2000 can be
accounted for by the variability in the Population of the town where the store is located.
b. Assuming population is in thousands and Sales are in millions of dollars, the slope of 0.0703 means
that for every increase in population of 1000 residents, sales increase by $0.0703M or $70,300.
c. The y -intercept is $2,924,068, which means that for a town with no Population, Sales is nearly $3
million. This does not make sense.
b. The statement should not be declared an absolute fact. The annual sales figure is a prediction.
The statement should be rephrased as “The model predicts the quarterly sales will be $10M when $1.5
million is spent on advertising.”
a. R 2 measures the amount of variation explained by the linear model. Literacy Rate accounts for 64% of
the variation in GDP.
b. The slope of the line is a trend. In this case, it reflects a prediction of how GDP changes with a change
in Literacy Rate. Absolute statements should not be made with regard to specific values in the
interpretation of a regression.
b) There is a fairly strong positive association between used BMW 850 CSI prices and their model year.
There seems to be a decreasing spread of data points as years increase (1994-1996 seem about the same).
c) A linear model is appropriate for this relationship. It satisfies all requirements for a linear model but use
with caution due to the curvature. There are no outliers or other unusual patterns although there is an
upward curvature.
d) Correlation r which can be calculated by finding the square root of R2 . In the case,
0.573 = 0.757 so r = 0.757.
e) 57.3% of the variation in Price of a used BMW 850 CSI can be accounted for by the Year the car was
made.
f) This relationship is not perfect. Other factors contribute to the variability of the price, such as options,
condition of car, and mileage.
b. A linear model is possible for this data set but there are only 10 data points with a lot of scatter
and three possible outliers—one low value and two high values. Both variables are quantitative
and there are no obvious differences in the spread. However, you would be right to be cautious in
using any predictive equation.
c. ·
The regression equation is determined from technology as Sales = 24.116 − 0.4829 Median Age
d. The
predicted Sales
for a similar
supermarket in a
town with a
·
Median Age of 32 are Sales = 24.1155 − 0.48285*32 = 8.664 ($M).
e. The same reservations apply as defined in part b); the linear model is probably not very accurate.
Specifically, in looking at the scatterplot for a Median Age of 32, there are no other data than the
two high outliers.
b. A linear
model is
appropriate
because the
conditions are
satisfied for
quantitative variables with an even spread. However, there is possibly one high outlier and there
are only 10 data points. This is reason to be cautious about using a linear prediction.
c. ·
The regression equation from technology is Sales = 2.956$M + 0.168 Total Housing Units
d. ·
Sales = 2.952$M + 0.168*100 = 19.76$M
e. This is an extrapolation—100,000 units is more than twice the number of units in the largest town
in this data set. We should not be confident of this prediction.
44. El Niño.
a. The correlation r is the square root of R 2 (33.4%), which equals 0.578.
b. The meaning of R 2 in this context is that CO2 levels account for 33.4% of the variation in Mean
Temperature.
d. The slope of 0.004 means that the predicted Mean Temperature (in degrees Celsius) has been
increasing at an average rate of 0.004 degrees ( 0C)/ parts per million of CO2.
e. The intercept would be the value for the Mean Temperature at 15.3066, if the parts per million of
CO2 were zero, but this doesn’t make any sense.
f. The residual scatterplot does not show anything remarkable, just a scattering about the zero point.
There is no evidence of violation of the assumptions of regression.
g. The Mean Temperature (0C) prediction for 364 parts per million CO2 levels is:
·
MeanTemperature = 15.3066 + 0.004*364 = 16.7626 0C.
a)
b) Linear regression should not be used directly as the data are not linear.
The data would not become linear using logarithms, squares, or square roots, since the shape of the graph in a)
is different from the shapes of graphs of logarithms, squares, or square roots. Therefore linear regression cannot
be used on transformed data, either.
c)
-0.2
-0.3
-0.4
-0.5
The variables (year and fertility rate) are quantitative
The scatterplot is linear from 1970 onwards after doing the transformation
There are no outliers
The spread is uniform
The residuals show no pattern
0.04
0.03
0.02
0.01
Residual
0
-0.6 -0.4 -0.2 0 0.2 0.4 0.6
-0.01
-0.02
-0.03
-0.04
-0.05
Predicted
10
8
Cost ($/W)
0
0 50 100 150 200 250 300
Cumulative Volume (MW)
1.2
0.8
Log(Cost)
0.6
0.4
0.2
0
0 0.5 1 1.5 2 2.5 3
Log(Cumulative Volume)
0.3
0.25
0.2
0.15
0.1
Residual
0.05
Predicted
0
0 0.2 0.4 0.6 0.8 1 1.2
-0.05
-0.1
-0.15
-0.2
-0.25
c) Log(cost) = 0.373039
Cost = 2.360692 $/W
d) The data are quantitative
The trend is linear
There are some outliers
The spread is uniform
0.8
0.6
0.4
0.2
Residuals
Predicted
0
0 0.5 1 1.5 2 2.5
-0.2
-0.4
-0.6
-0.8
f) R squared = 0.72
This is the same for each model because it is the square of the correlation coefficient
There is only one correlation coefficient
10.00%
9.00%
Spoilage (%)
8.00%
7.00%
6.00%
5.00%
4.00%
1.5 2 2.5 3 3.5 4
Volume shipped (tns)
0.008
0.006
0.004
0.002
Residuals
Spoilage
0
0.06 0.065 0.07 0.075 0.08 0.085 0.09 0.095 0.1 0.105
-0.002
-0.004
-0.006
-0.008
b)
Spoilage = 0.144 - 0.0215 * 4 = 5.84%
c)
The variables are quantitative
The scatterplot is linear
There are no outliers
The spread is fairly uniform
0.4
0.3
0.2
0.1
Residual
0
1.5 2 2.5 3 3.5 4
-0.1
-0.2
-0.3
-0.4
Volume shipped
c)
The variables are quantitative
The scatterplot is linear
There are no outliers
The spread is uniform
British Columbia
5 ($m)
Ontario ($m)
Sales ($m)
4
Quebec ($m)
0
1 2 3 4 5 6 7 8 9 10 11 12
Month
a)
Quantitative Linear
variables trend Outliers Spread
BC Yes Yes No Uniform
Ontario Yes No No Uniform
Quebec Yes No Yes Uniform
b) A linear model can be used for British Columbia The residuals show no pattern
d) British Columbia
Amount = 0.0919 * 0.5 = 0.0459 $m
Agreed. There is no evidence of any linear trend and it is not possible to fit a linear model,
b)
Agreed. There is no evidence of any linear trend and it is not possible to fit a linear model,
c) The drivers cost is shifted back a season, since it is based on winnings the previous season.
The linear model explains 80% of the variability in the data as given by R Squared
d)
Agreed The trend is linear and the spread is uniform so a linear model can be fitted
When winnings = $0.5m, mechanics should get 0.331 +0.33 * 0.5 = $0.496m 0.496 $m
51. Bricks.
20
18
16
Sales revenue ($)
14
12
10
8
6
4
2
0
0 0.5 1 1.5 2 2.5
Price/brick ($)
The data provided shows a non-linear trend, with sales rising to a peak and declining at the higher prices. We do not
have a simple way to transform the variables to deal with this shape.
Although some students may have additional knowledge beyond this introductory chapter on linear regression and
may be able to fit a quadratic model to this data, the key to the exercise is a careful reading of the question.
The question specifically asks for a linear model, and it also ask us to estimate the number of bricks not the sales
revenue. We should therefore use data on the number of bricks which can easily be calculated from the sales
revenue and the price of the bricks. We obtain:
25
20
# bricks sold (m)
15
10
0
0 0.5 1 1.5 2 2.5
Price/brick ($)
1.5
1
Residuals
0.5
0
0 5 10 15 20 25
-0.5
-1
-1.5
Predicted
b)
c) The regression against the square of diameter is preferable to the regression against diameter for two reasons. (i)
The R-squared is higher indicating that the regression explains more of the variability in the data. (ii) The residuals
are positive at the start and at the end and are negative in the middle. In the regression against diameter, this effect is
more pronounced than in the regression against diameter squared indicating that the trend line is curved.
Sales
3
2.5
1.5
0.5
0
0 2 4 6 8 10 12
Quarter
a)
Demand
3.5
3 y = 0.2426x + 0.2424
R² = 0.9384
2.5
1.5
0.5
0
0 2 4 6 8 10 12
Quarter
b)
Demand
3
y = 2.4294x + 0.0486
2.5 R² = 0.9916
1.5
0.5
0
0 0.2 0.4 0.6 0.8 1 1.2
Log(quarter)
c) The regression in b) is preferable for two reasons. (i) The R-squared is higher indicating that the regression
explains more of the variability in the data. (ii) The residuals are negative at the start and at the end and are positive
in the middle in a). This indicates that the trend line is curved and is better represented by the log transformation in
b).
PLAN Setup: State the objective Examine the relationship between the overall cost of living and the cost of
a luxury apartment (per month),the price of a bus or subway ride, the price
of a CD, the price of an international newspaper, the price of a cup of
coffee (including service), and the price of a fast-food hamburger meal.
DO Mechanics: Large format tables and graphs (if any) are placed below this
PLAN/DO/REPORT table
REPORT Conclusion: State the Among the variables considered to be related to cost of living, rent has the
conclusion in the context strongest positive relationship and would be the best predictor of overall
of the original objective cost. The price of a cup of coffee has the weakest relationship (positive)
and would therefore be the worst predictor. A surprising relationship is the
strong negative association between the cost of living index and the price of
an international newspaper.
110
100
Cost of Living
90
80
70
60
50
500 1000 1500 2000 2500
Rent
Versus Fits
(response is Cost of Living)
10
0
Residual
-5
-10
-15
-20
110
100
Cost of Living
90
80
70
60
50
0.0 0.5 1.0 1.5 2.0
Pubic Trans
Versus Fits
(response is Cost of Living)
20
10
Residual
-10
-20
70 80 90 100 110
Fitted Value
110
100
Cost of Living
90
80
70
60
50
6 7 8 9 10 11 12 13 14 15
CD
Versus Fits
(response is Cost of Living)
30
20
10
Residual
-10
-20
-30
76 78 80 82 84 86 88 90 92
Fitted Value
110
100
Cost of Living
90
80
70
60
50
0.5 1.0 1.5 2.0 2.5
News
Versus Fits
(response is Cost of Living)
10
5
Residual
-5
-10
-15
50 60 70 80 90 100 110
Fitted Value
110
100
Cost of Living
90
80
70
60
50
1.0 1.5 2.0 2.5 3.0
Coffee
Median
Median
Median
PLAN Setup. State the We want to find whether there is a relationship between buying power indices and retail sales in
objectives of the study.
Model. Check the The scatterplots given below confirm the conditions.
conditions. Quantitative Variables Condition—yes.
Linearity Condition—There are not many data points, but there is no indication that the relation
Outlier Condition—There are two cities that could be regarded as outliers. We will do the analy
Equal Spread Condition—We do not have enough data to check this condition, but there is no e
(b)
REPORT Conclusion. Interpret The value of R2 of around 0.5 in the regressions before removing the two cities at the top right
the results
When we remove those two cities, the value of R2 becomes too small to indicate any relationshi
The data provided are inconclusive as to whether the buying power indices are related to retail s