Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

STA467/567

Fall 2022

Tyler Drellishak

1. In order to investigate the feasibility of starting a Sunday edition for a large


metropolitan newspaper, information was obtained from a sample of 34
newspapers concerning their daily and Sunday circulations (in thousands)
(Source: Gale Directory of Publications, 1994). The data are given in the file
(P054.txt).

a. Construct a scatter plot of Sunday circulation versus daily circulation. Does the
plot suggest a linear relationship between daily and Sunday circulation? Do
you think this is a plausible relationship?

The plot suggests a linear relationship between daily and Sunday circulation, I
believe this is a plausible relationship as papers which are more popular on
weekdays are also likely to be more popular on Sundays.

b. Fit a regression line predicting Sunday circulation from daily circulation.


c. Interpret the estimated value of the slope β , and then obtain the 95%
confidence intervals for 𝛽0 and 𝛽1.
Assuming our model assumptions hold.
For every one-thousand additional papers in circulation on a weekday, on
average we predict there will be 1,340 additional papers circulated on a
Sunday.

95% CI for β : β ± T , / ∗ 𝑆𝐸 β = 1.34 ± 2.037 ∗ 0.0708 =


(1.196, 1.484)
95% CI for β : β ± 𝑇 , / ∗ 𝑆𝐸 β = 13.84 ± 2.037 ∗ 35.804 =
(-59.093, 86.773)

d. What is your estimate for the error variance 𝜎#?

σ = 109.4 = 11968.36

e. Is there a significant relationship between Sunday circulation and daily


circulation? Justify your answer by a statistical test. Indicate what hypothesis
you are testing and your conclusion.

𝐻 :𝛽 = 0
𝐻 :𝛽 ≠ 0
𝑇 = 18.93
𝑝<2

Assuming our model assumptions hold, since 𝑝 < 2 we have significant


evidence to reject the null hypothesis at the confidence level α = 0.05.
Therefore, we conclude that the slope of the regression line is not equal to zero
and that there is a significant relationship between Sunday Circulation and
Daily Circulation.

f. What proportion of the variability in Sunday circulation is accounted for by


daily circulation?
Assuming our model assumptions hold, 𝑅 = 91.81 so 91.81% of the
variability in Sunday circulation can be accounted for by daily circulation.

g. Provide an interval estimate (based on 95% level) for the average Sunday
circulation of newspapers with daily circulation of 500,000.

Assuming the model assumptions hold using R we find an interval of


(644.195, 723.191)

h. The particular newspaper that is considering a Sunday edition has a daily


circulation of 500,000. Provide an interval estimate (based on 95% level) for
the predicted Sunday circulation of this paper. How does this interval differ
from that given in (f)?

Assuming the model assumptions hold using R we find an interval of


(683.693, 910.0493). This interval is larger than that in (f) because the standard
error for the mean response is smaller than that for a given individual
response.

i. Another newspaper being considered as a candidate for a Sunday edition has a


daily circulation of 2,000,000. Provide an interval estimate for the predicted
Sunday circulation for this paper? Do you think it is likely to be accurate?

We should not predict beyond the range of our predictors; we cannot extrapolate
our regression models. It could be accurate however we do not know about the
relationship between the variables beyond our model.

2. One may wonder if people of similar heights tend to marry each other. For this
purpose, a sample of newly married couples was selected. Let X be the height of
the husband and Y be the height of the wife. The heights (in centimeters) of
husbands and wives are found in the file named (P052.txt).

a. Compute the covariance between the heights of the husbands and wives.

The Covariance between Husbands height and Wifes height is 69.413 𝑐𝑚

b. What would the covariance be if heights were measured in inches rather than
in centimeters?

69.413𝑐𝑚 1𝑖𝑛
∗ = 10.759𝑖𝑛
1 2.54 𝑐𝑚
c. Compute the correlation coefficient between the heights of the husband and
wife.

r = 0.763

d. What would the correlation be if heights were measured in inches rather than
in centimeters?

The correlation would remain 0.763, correlation is a unitless measure

e. What would the correlation be if every man married a woman exactly 5


centimeters shorter than him?

The correlation should be 1, we would have the model formula WifeHeight = -5 +


HusbandHeight. Every value would fall exactly on this line as they are perfectly
correlated and exhibit a perfect linear pattern.

f. We wish to fit a regression model relating the heights of husbands and wives.
Which one of the two variables would you choose as the response variable?
Justify your answer.

I’d choose wife height to be the response variable, for the prior question a man is
choosing a wife exactly 5 cm shorter than himself. This makes the Male height
the predictor variable and wife height the response.

g. Using your choice of the response variable in part (f), test the null hypothesis
that the slope is zero.
𝐻 :β = 0
𝐻 :β ≠ 0
𝑇 = 11.458
𝑝 < 2-

Assuming our model assumptions hold, since 𝑝 < 2 we have significant


evidence to reject the null hypothesis at the confidence level 𝛼 = 0.05. Therefore,
we conclude that the slope is not equal to zero and that there is a significant
relationship between Husbands height and Wifes height.

h. Using your choice of the response variable in part(f), test the null hypothesis
that the intercept is zero.

𝐻 :β = 0
𝐻 :β ≠ 0
𝑇 = 3.933
𝑝 = 0.000161

Assuming the model assumptions hold, since p = 0.000161 we have


significant evidence to reject the null hypothesis at the confidence level
α = 0.05. Therefore we conclude that the intercept is not equal to zero.

i. What is the coefficient of determination for the model you fitted in (f) and
interpret its value? How this coefficient of determination is related to the
correlation coefficient in part (c)?

𝑅 = 0.5828, assuming the model assumptions hold, 58.28% of the variability in


the height of a Wife can be explained using Husbands height.

j. If Y and X were reversed in the above regression, what would you expect 𝑅# to
be? Why?

( , )
I would expect 𝑅 = 0.5828 , 𝑅 = 𝑟 and 𝑟 = since 𝑐𝑜𝑣(𝑥, 𝑦) =
𝑐𝑜𝑣(𝑦, 𝑥)
( , ) ( , )
So we know 𝑅 = 𝑟 = = = 𝑟 (𝑟𝑒𝑣𝑒𝑟𝑠𝑒𝑑) = 𝑅 (𝑟𝑒𝑣𝑒𝑟𝑠𝑒𝑑)

3. Name one or more graphs that can be used to validate each of the following
assumptions. For each graph, sketch an example where the corresponding
assumption is valid and an example where the assumption is clearly invalid.

a. There is a linear relationship between the response and predictor variables.


Scatter plot
Showing Linear Relationship.
Showing Non-linear Relationship

b. The error terms have constant variance.


Residual plot
Constant Variance
Non-Constant Variance
c. The error terms are normally distributed.
Q-Q plot
Shows error terms are normally distributed.

Shows error terms are not normally distributed.

d. The observations are equally influential on least squares results.

Index Plot of cooks distance.


Showing Observations are equally influential
Showing observations are not equally influential

4. The Expanded Computer Repair Times Data: Length of Service Calls (Minutes)
and Number of Units Repaired (Units). You can find the data in the file named
“P124.txt’’.

a. Fit a linear regression model relating Minutes to Units.


A linear regression model was fit using the lm command in R.
Our fitted model is of the form:
𝑦 =𝛽 +𝛽 𝑥

The coefficients calculated by R give us a model of :

𝑀𝑖𝑛𝑢𝑡𝑒𝑠 = 37.21 + 9.97 ∗ 𝑈𝑛𝑖𝑡𝑠

b. Check each of the standard regression assumptions and indicate which


assumption(s) seems to be violated.

1) It is a linear model.
To check this assumption we will look at the scatterplot of length of call vs
units repaired to see if a linear model appears appropriate

The scatterplot appears to show a nonlinear, specifically exponential


relationship between the data. This is confirmed by our model residuals vs
fits showing an non-random pattern.
2) The model is additive in xs and error terms.
Our linear regression model of best fit generated by R is by default additive
in xs and error terms to minimize the sum of squared error .

3) The unknown parameters 𝛽 , … , 𝛽 are constant


Since our data exhibits an apparent exponential relationship we would
expect the slope of our model 𝛽 to be increasing, i.e. nonconstant

4) The predictors are fixed, non-random


The predictors come from the number of units repaired which is a fixed
value from which we can make a prediction.

5) The error terms 𝜖 are normally distributed,


Consider the QQ plot of our regression model.
The normal Q-Q plot of the residuals imply that the errors do not follow a
normal distribution.

6) Looking at the plot of the residuals vs fits

It appears that there is unequal variance, we see a clear pattern in the


residuals and with variance changing as the fitted value increases.

7) Observations are equally reliable and influential without outliers.


The plot of cooks distance shows the observations are not equally
influential. It appears like the last observation is much more influential
than the others.

You might also like