Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Problem Set 4 Solutions

Statistics 104
Due March 26, 2020 at 11:59 pm

Problem set policies. Please provide concise, clear answers for each question. Note that only writing the result of
a calculation (e.g., "SD = 3.3") without explanation is not sufficient. For problems involving R, be sure to include
the code in your solution.
Please submit your problem set via Canvas as a PDF, along with the R Markdown source file.
We encourage you to discuss problems with other students (and, of course, with the course head and the TFs), but
you must write your final answer in your own words. Solutions prepared "in committee" are not acceptable. If
you do collaborate with classmates on a problem, please list your collaborators on your solution.

Problem 1.
A company that produces coffee for use in commercial machines monitors the caffeine content in
its coffee. The company selects 35 8-oz samples each hour from its production line to analyze. The
samples collected one morning between 8:00 - 9:00 am contained on average 96.1 mg of caffeine,
with standard deviation 1.2 mg.
a) Compute and interpret a 95% confidence interval for mean caffeine content based on the
collected data.
With 95% confidence, the interval (95.69, 96.51) mg contains the mean caffeine content of
coffee produced by the company.
#use r as a calculator
x.bar = 96.1
s = 1.2
n = 35
alpha = 0.05

t.star = qt(1-alpha/2, df = n - 1)
m = t.star* (s/sqrt(n))

x.bar - m; x.bar + m

## [1] 95.68779
## [1] 96.51221
b) According to production standards, the mean amount of caffeine content per 8 ounces
should be no more than 95 mg. An overly high caffeine content indicates that the coffee
beans have not been roasted long enough.
Conduct a formal hypothesis test to investigate whether production standards are being met,
based on the observed data. Summarize your findings to the CEO using language accessible
to someone who has not taken a statistics course and make a recommendation as to whether
an adjustment needs to be made to the bean roasting time.

1
Test H0 : µ = 95 mg against HA : µ > 95 mg. Let α = 0.05. The one-sided p-value is 2.42×10−6 .
Based on these data, an adjustment needs to be made to the bean roasting time. The prob-
ability of observing a set of samples with mean caffeine content of 96.1 mg if the mean
caffeine content of all coffee being produced were actually 95 mg is effectively zero. These
data strongly suggest that the mean caffeine content of all coffee is higher than 95 mg; the
coffee bean roasting time should be increased.
#set parameters
mu.0 = 95

#calculate p-value
t.stat = (x.bar - mu.0)/(s/sqrt(n))
pt(t.stat, df = n - 1, lower.tail = F)

## [1] 2.422591e-06
c) A set of samples collected between 10:00 - 11:00 am on the same day has average caffeine
content of 95.3 mg, with standard deviation 1.1 mg. Based on observing this data, would
you change your recommendation in part c)? Explain your answer.
The samples collected between 10:00 - 11:00 am result in a confidence interval of (94.92,
95.68) mg for the population mean caffeine content, and a p-value of 0.058.
While this set of samples does not provide as strong evidence against the null hypothesis
as the earlier set, they still suggest that the mean caffeine content is higher than 95 mg;
the sample mean is higher than 95 mg and a p-value of 0.058 is smaller than α = 0.05, but
only slightly so. There is only a 5.8% chance of observing a sample mean of 95.3 mg if the
population mean is actually 95 mg. Thus, based on observing this data, I would not change
the earlier recommendation.
#use r as a calculator
x.bar = 95.3
s = 1.1
n = 35
alpha = 0.05

t.star = qt(1-alpha/2, df = n - 1)
m = t.star* s/sqrt(n)

x.bar - m; x.bar + m

## [1] 94.92214
## [1] 95.67786
#calculate p-value
t.stat = (x.bar - mu.0)/(s/sqrt(n))
pt(t.stat, df = n - 1, lower.tail = F)

## [1] 0.05794276

2
Problem 2.
The General Social Survey (GSS) is a sociological survey used to collect data on demographic char-
acteristics and attitudes of residents of the United States. In 2010, the survey collected responses
from 1,154 US residents. The survey is conducted face-to-face with an in-person interview of
a randomly selected sample of adults. One of the questions on the survey is “After an average
workday, about how many hours do you have to relax or pursue activities that you enjoy?” A 95%
confidence interval from the 2010 GSS survey for the collected answers is 3.53 to 3.83 hours.
Identify each of the following statements as true or false. Explain your answers.
a) If the researchers wanted to report a confidence interval with a smaller margin of error based
on the same sample of 1,154 Americans, the confidence interval would be larger.
False. The confidence interval would be smaller, since the margin of error is half the width
of a confidence interval (the quantity t ? × √sn ).

b) We can be 95% confident that the interval (3.53, 3.83) hours contains the mean hours that
the sampled adults have for leisure time after an average workday.
False. This statement would be true if the phrase "the sampled adults" were modified to
"U.S. adults". Confidence intervals are statements about the population parameter µ, not x.
c) The confidence interval of (3.53, 3.83) hours contains the mean hours that U.S. adults have
for leisure time after an average workday.
False. There is a 5% chance that any 95% confidence interval does not contain the true
population mean hours that U.S. adults have for leisure time after an average workday; this
is expressed by the error probability α.
d) The survey provides statistically significant evidence at the α = 0.05 significance level that
the mean hours U.S. adults have for leisure time after the average workday is 3.6 hours.
False. Since the value 3.6 hours is contained in the interval, a corresponding hypothesis test
of H0 : µ = 3.6 hours would not result in rejecting the null hypothesis. However, failure to
reject the null hypothesis is not equivalent to accepting the null hypothesis.
e) There is a 5% chance that the interval (3.53, 3.83) hours does not contain the mean hours
that U.S. adults have for leisure time after an average workday.
False. A fixed interval either contains or does not contain the population parameter.
f) The interval (3.53, 3.83) hours provides evidence at the α = 0.05 significance level that U.S.
adults, on average, have fewer than 3.9 hours of leisure time after a typical workday.
True. The interval corresponds to a two-sided test, with H0 : µ = 3.9 hours, HA : µ , 3.9
hours, and α = 0.05. Since the µ0 of 3.9 days is outside the interval, the sample provides
sufficient evidence to reject the null hypothesis. The interval consists of values lower than
3.9 hours, which suggests that µ is less than 3.9 hours.

3
Problem 3.
After graduating, you decide to apply for positions as a statistical consultant. A friend who grad-
uated a few years prior and already works in the field has sent you the following set of scenarios.
If she is impressed by your responses, she will recommend your name to the hiring team.
For each part, provide a clear explanation and limit your answer to no more than five sentences.
a) A pharmaceutical company is planning to run an initial trial on a potential new cholesterol
lowering drug compound to assess whether they should continue to invest money on its
development. They would like your recommendation as to whether the initial trial should
be run at the α = 0.05 or α = 0.10 significance level.
The initial trial should be run at the higher significance level, α = 0.10. In this context, a
Type I error represents investing in a compound that turns out to be ineffective, while a
Type II error represents failing to invest in a compound that turns out to be a promising
drug. At the early stages, it is better to avoid screening out too many candidates; in other
words, the consequences of a Type II error are relatively more serious than those of a Type I
error. Later trials in the approval process can be more stringent as to testing whether a drug
is safe and/or effective.
b) Melatonin pills are a popular remedy for providing insomnia relief. A company developed
a new formulation of melatonin pill and conducted a trial on a random sample of 2,000
American adults who were already experiencing insomnia and using melatonin pills. A
95% confidence interval for mean increase in sleep hours per typical night was calculated
based on data from the trial: (0.10, 0.15) hours. Assess whether the evidence suggests the
new formulation is a substantial improvement over melatonin pills currently available on
the market.
The evidence does not suggest that the new formulation is a substantial improvement over
currently available melatonin pills. From the confidence interval, the range of plausible
values for the mean increase in sleep duration per typical night is (6, 9) minutes. Even if
the mean increase in sleep duration were on the upper end of the interval at, say, 9 minutes,
this does not represent a "substantial" improvement on the scale of typical nightly sleep
duration (7 - 9 hours, or 420 - 540 minutes). This is an example of a context where a small
effect size can be statistically significant due to large sample size. The results, however, are
not practically significant.

4
Problem 4.
A hospital administrator has been asked by her supervisor to assess whether waiting time in the
emergency room (ER) has changed from last year, when average wait time was 127 minutes. The
administrator collects a simple random sample of 64 patients and records the time between when
they checked in at the ER until they were first seen by a doctor; the average wait time is 137
minutes, with standard deviation 39 minutes.
a) Compute and interpret a 95% confidence interval for mean ER wait time at the hospital.
Based on the interval, is mean ER wait time statistically significantly different from 127
minutes at the α = 0.05 level?
With 95% confidence, the interval (127.26, 146.74) minutes captures the mean ER wait time
at the hospital. This represents plausible values for mean ER wait time at the hospital based
on observing a sample mean of 127 minutes from a sample of size 64.
Yes, the difference in observed mean ER wait time from 127 minutes is statistically signifi-
cant at the α = 0.05 level; the 95% confidence interval does not contain 127 minutes.

!
s
x ± tdf , 1−(α/2) × √
n
!
39
137 ± 1.998 × √
64

(127.26, 146.74) minutes

#use r as a calculator
x.bar = 137
s = 39
n = 64

t.star = qt(0.975, df = n - 1)

m = t.star * s/sqrt(n)

x.bar - m; x.bar + m

## [1] 127.2581
## [1] 146.7419
b) Would the conclusion in part a) change if the significance level were changed to α = 0.01?
Yes, the conclusion in part a) would change if the significance level were changed to α = 0.01.
Since the 99% confidence interval contains 127 minutes, there is not sufficient evidence to
reject H0 : µ = 127 minutes at α = 0.01 and we would conclude that there is not significant
evidence to suggest that mean wait time has changed from 127 minutes.

!
s
x ± tdf , 1−(α/2) × √
n

5
!
39
137 ± 2.66 × √
64

(124.05, 149.95) minutes

#use r as a calculator
x.bar = 137
s = 39
n = 64

t.star = qt(0.995, df = n - 1)

m = t.star * s/sqrt(n)

t.star

## [1] 2.656145
x.bar - m; x.bar + m

## [1] 124.0513
## [1] 149.9487
c) Suppose that upon seeing the results from part a), the supervisor criticizes the hospital
administrator on the basis that ER wait times have increased greatly from last year and the
administrator must be at fault. Present a brief argument in favor of the administrator; be
sure to fully explain your reasoning.
From a statistical perspective, it is valid to criticize the precision of the interval estimate.
The 95% confidence interval spans 20 minutes, which seems too imprecise to be informative
given that the supervisor seems concerned that mean wait time increased by 10 minutes
from last year. A reasonable approach for obtaining a more informative confidence interval
is to take a larger sample size.
From a contextual perspective, a hypothesis test does not provide any evidence of the admin-
istrator being at fault. For example, perhaps increased wait times are due to more patients
checking into the ER this year than last year, or fewer doctors available to staff the ER; it
may not be a fair comparison to expect that the mean wait time should still be 127 minutes
if some conditions have changed.

6
Problem 5.
The file nc.csv contains data on 1,000 randomly sampled births from birth records released by
the state of North Carolina in 2004. The following variables are contained in the dataset:
Variable Description
fage Father’s age, in years
mage Mother’s age, in years
mature Whether the mother was considered a younger mom or mature mom
weeks Length of pregnancy, in weeks
premie Whether the birth was full-term (full term) or premature (premie)
visits Number of hospital visits during pregnancy
marital Whether the mother was married at the time of the birth
gained Weight gained by the mother during pregnancy, in pounds
weight Weight of the baby at birth, in pounds
lowbirthweight Whether the baby was classified as low birthweight or not
gender Sex of the baby, female or male
habit Whether the mother was a nonsmoker or smoker
whitemom Whether the mother was white or nonwhite
a) From the data, compute and interpret a 95% confidence interval for the mean birthweight
of babies born in North Carolina in 2004.
With 95% confidence, the interval (7.01, 7.19) pounds captures the mean birthweight of
babies born in North Carolina in 2004.
#load the data
nc = read.csv("datasets/nc.csv")

#compute confidence interval


t.test(nc$weight, conf.level = 0.95)$conf.int

## [1] 7.007368 7.194632


## attr(,"conf.level")
## [1] 0.95
b) From national census figures, the mean birthweight for full-term births is 7.5 pounds. Con-
duct a formal hypothesis test to assess whether mean birthweight for premature births in
North Carolina is different from 7.5 pounds. Summarize your findings; do they cohere with
your expectations?
Test H0 : µ = 7.5 pounds against HA : µ , 7.5 pounds. Let α = 0.05. The t-statistic is -
14.85, with p-value less than 0.001. Since p < α, there is sufficient evidence to reject the null
and accept the alternative that mean birthweight for premature births in North Carolina
is different from 7.5 pounds. Since the sample mean is less than 7.5 pounds, the evidence
suggests that mean birthweight for pre-term births in North Carolina is less than 7.5 pounds.
This coheres with expectations because pre-term birth is a common cause of low birthweight.
#conduct test
t.test(nc$weight[nc$premie == "premie"], mu = 7.5)

##

7
## One Sample t-test
##
## data: nc$weight[nc$premie == "premie"]
## t = -14.845, df = 151, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 7.5
## 95 percent confidence interval:
## 4.812778 5.444064
## sample estimates:
## mean of x
## 5.128421
c) Preterm birth occurs disproportionately among communities with lower socieconomic sta-
tus. Suppose an investigator takes a random sample of 40 birth weights from several teach-
ing hospitals located in an inner-city neighborhood. The results of the analysis will inform
whether an income support and maternal health program (providing some medical services
free of cost) will be started at the neighborhood community center.
i. State the hypotheses for comparing mean birth weight for babies born in these teaching
hospitals to 7.5 pounds. Justify your choice of alternative hypothesis.
H0 : µ = 7.5 pounds. HA : µ < 7.5 pounds. Given the problem context, is reasonable to
expect that if the average weight of babies born in these hospitals is different from 7.5
pounds, it will be lower. A one-sided alternative is also appropriate since concluding
"not different than µ" and "greater than µ" result in the same decision—in both cases,
the analysis would conclude that the educational program should not be started.
ii. Discuss whether an α of 0.10 or 0.01 might be more appropriate than the standard
value of 0.05.
Let α = 0.10. Choosing a higher α reduces the chances of failing to reject the null
hypothesis when the null is false (Type II error); starting the support program when
it is not needed is less harmful than not starting the program when it is needed. An
argument for choosing a small significance level like α = 0.01 could also be reasonable;
resources are limited, and perhaps a city can only afford to start such programs in
neighborhoods where the observed difference in birthweights from µ is very extreme.

8
Problem 6.
In each of the following scenarios,
i. Identify whether the data are paired or independent.
ii. Consider whether a design that produces the “other type” of two-sample data might be
preferable (for example, if the data are paired, consider whether a design that produces
independent data might be preferable). If so, briefly describe the alternative study design.
If not, briefly explain why the design presented in the question statement is preferable.
For each scenario, limit your description/explanation to at most five sentences.
a) Does salary differ by gender for full professors at American universities? Draw a random
sample of 50 female full professors and 50 male full professors, and compare their salaries.
i. Independent.
ii. Since a major factor influencing salary is department/field, pairing by depart-
ment/field could reduce variability in salary attributable to working in relatively
high-paying fields (e.g., law) versus low-paying fields (e.g., history). This design poses
some challenges since it requires two individuals (one male, one female) to be sampled
for 50 fields. Not all fields will be represented across all American universities, so
it would be advisable to limit the population of interest to, for example, Ivy League
universities, which offer a similar range of departments/fields. Another way to deal
with variability in salary attributable to department/field is to conduct the original
design separately within particular fields; e.g., draw a random sample of female full
professors and male full professors working in computer science.
b) Does Vitamin E increase artery thickness? Measure artery thickness for a group of patients
before they start taking Vitamin E regularly for two years. Compare initial artery thickness
to artery thickness at the end of the study.
i. Paired.
ii. An alternative design is to randomize individuals into treatment groups, where one
group takes Vitamin E while the other group takes a placebo, then compare mean artery
thickness between groups at the end of the study. The paired design is preferable as
it allows each individual to act as their own control. There will be less noise present
when analyzing paired differences than when comparing two independent groups; in
the independent group design, there will be variability attributable to natural variation
in artery thickness between individuals and not just variability attributable to taking
Vitamin E.
Note: another possible study design that produces independent group data is to ran-
domize individuals into treatment groups (Vitamin E versus placebo), then measure
change in artery thickness as the outcome. This design addresses a potential weakness
of the paired design with regard to time; for example, if artery thickness naturally de-
creases over time, observed change after two years may be primarily due to time rather
than starting Vitamin E.

9
c) Is a Mediterranean diet effective for weight loss? Compare the weights of individuals before
and after going on the diet regimen.
i. Paired.
ii. An alternative design is to randomize individuals into treatment groups, where one
group is instructed to adhere to the Mediterranean diet while the other group receives
standard advice about weight loss (or some other conventional diet). This may be
preferable to the paired design, because it has more potential to isolate the effect of
the Mediterranean diet relative to a standard diet. Since individuals who are adhering
to a Mediterranean diet (or any diet) are likely to exhibit other health behaviors (e.g.,
exercising regularly) that promote weight loss, the paired design may pick up change
in weight attributable to factors other than adherence to the Mediterranean diet. Note:
This answer differs from part b) because while it is typical for people to purposefully
alter behavior so as to lose weight, it is not the case that people can monitor artery
thickness on their own and/or have a general idea of how to improve artery thickness.
d) Do Intel’s stock and Southwest Airlines’ stock have similar rates of return? Take a random
sample of 60 days, and compare Intel’s and Southwest’s stock on those days.
i. Paired.
ii. A paired design is preferable to an independent design because a paired design can
account for the dependency of returns on time/day. Comparing how well two stocks
perform by analyzing the difference in performance for each day is a more reasonable
approach than comparing average performance over a set of (different) days.

Problem 7.
A possible important environmental determinant of lung function in children is amount of
cigarette smoking in the home. Suppose this question is studied by selecting two groups: Group
1 consists of 23 nonsmoking children 5-9 years of age with two parents who smoke, and have
mean forced expiratory volume (FEV) of 2.1 L and a standard deviation of 0.7 L; group 2 consists
of 20 nonsmoking children of comparable age, with two parents who do not smoke, and have
mean FEV of 2.3 L and a standard deviation of 0.4 L.
a) Do the children of two parents who smoke have mean FEV different from the children of
two parents who do not smoke? Conduct a formal hypothesis test. Use the min(n1 −1, n2 −1)
approximation for the degrees of freedom.
Conduct a two-sample t-test for independent data.
Let µ1 equal the population mean FEV for children whose parents do not smoke, and µ2
equal the population mean FEV for children whose parents do smoke. Test H0 : µ1 = µ2
against HA : µ1 , µ2 . Let α = 0.05.
Calculate the t-statistic:

(x1 − x2 ) − (µ1 − µ2 ) 2.3 − 2.1


t= s =r = 1.17
s12 s22 0.42 0.72
+ +
n1 n2 20 23

10
#use r as a calculator
x.bar.1 = 2.3
s.1 = 0.4
n.1 = 20
x.bar.2 = 2.1
s.2 = 0.7
n.2 = 23

#hypothesis test
t.num = x.bar.1 - x.bar.2
t.den = sqrt((s.1^2 / n.1) + (s.2^2 / n.2))
t.stat = t.num/t.den
t.stat

## [1] 1.168326
Calculate the p-value. Since df can be approximated as n − 1 for the smaller of the two
groups, df = 19.
2*pt(1.17, 19, lower.tail=FALSE)

## [1] 0.2564685
The p-value is 0.26. Since p > α, there is insufficient evidence to reject H0 at the 0.05 signif-
icance level. We fail to reject the null hypothesis that the population mean FEV for children
whose parents do smoke is the same as the population mean FEV for children whose par-
ents do smoke. The observed difference between mean FEV in the two groups may be due
to random chance.
b) Summarize your findings in language that someone who has not taken a statistics course
would understand.
An analysis was done to investigate whether the mean forced expiratory volume (FEV) for
children of two parents who smoke is different from mean FEV for children of two parents
who do not smoke. When a random sample of 23 children whose parents do smoke were
measured, their average FEV was 2.1 L; the mean FEV for a random sample of 20 children
with nonsmoking parents was 2.3 L. There is about a 26% chance of observing a difference
of 0.2 L between the two groups if there is actually no difference in FEV between children
of smoking versus nonsmoking parents. Thus, from this data, it is not possible to conclude
that exposure to cigarette smoke in the home is associated with decreased lung function in
children.

11
c) Suppose you have been asked to assess the design of a new study addressing the same ques-
tion. In the new study, researchers plan to recruit 20 nonsmoking children 5-9 years of age
with two parents who smoke and 20 nonsmoking children 5-9 years of age with two parents
who do not smoke; children will be matched based on similar household income. Briefly ex-
plain whether you believe the new study has more potential to show an association between
smoke exposure and lung function than the original study.
Yes, the new study has more potential to show an association between smoke exposure and
lung function than the original study. A paired study design is more efficient than an in-
dependent two-group design when the matching variable is a potential confounder. Dif-
ferences in household income might be responsible for systematic differences in FEV that
are separate from possible effects of cigarette smoke; for example, perhaps children in more
affluent households have better access to healthcare and generally higher FEV. Differences
in household income may also be associated with the likelihood of having two parents who
smoke. Matching on a potential confounder like household income improves the power of a
test by reducing variability (i.e., noise) in the data.

12

You might also like