Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Problem Set 8 Solutions

Statistics 104
Due December 03, 2019 at 10:30 am

Problem set policies. Please provide concise, clear answers for each question. Note that only writing the result of
a calculation (e.g., "SD = 3.3") without explanation is not sufficient. For problems involving R, be sure to include
the code in your solution.
Please submit your problem set via Canvas as a PDF, along with the R Markdown source file.
We encourage you to discuss problems with other students (and, of course, with the course head and the TFs), but
you must write your final answer in your own words. Solutions prepared "in committee" are not acceptable. If
you do collaborate with classmates on a problem, please list your collaborators on your solution.

Problem 1.
The Unit 9 lecture introduced a case study about the Challenger shuttle accident. A report pub-
lished by the commission responsible for investigating the accident determined that the cause of
the accident was due to the failure of an O-ring seal in the solid rocket motor due to the unusual
cold temperatures on the day of the launch (about 30.9 degrees Fahrenheit).
A risk analysis conducted prior to the launch had incorrectly concluded that “temperature data
[are] not conclusive on predicting primary O-ring blowby.” According to the commission, “A care-
ful analysis of the flight history of the O-ring performance would have revealed the correlation of
O-ring damage in low temperature.”
In 1989, a more rigorous analysis of the data used in the pre-Challenger risk assessment was pub-
lished in the Journal of the American Statistical Association.1 These data are available in the mcsm
package as the challenger dataset.
– oring: binary indicator for whether at least one O-ring failed, coded 1 for at least one failure,
and 0 for no failures
– temp: temperature at the time of launch, measured in degrees Fahrenheit.
Fit a logistic regression model to predict O-ring failure from temperature at time of launch.
a) Write the model equation estimated from the data.

" #
p̂(failure = 1|temp)
log = 15.043 − 0.232(temp)
1 − p̂(failure = 1|temp)

log(odds
[ of failure | temp) = 15.043 − 0.232(temp)

#load the data


library(mcsm)
data("challenger")
1 SR Dalal, Fowlkes EB, and Hoadley B. Risk analysis of the Space Shuttle: Pre-Challenger Prediction of Failure.
Journal of the American Statistical Association. 84: 408 (1989).

1
#fit the model
oring.fit = glm(oring ~ temp, data = challenger, family = "binomial")

b) Does the intercept have a meaningful interpretation in the context of the data?
Yes, the intercept has a meaningful interpretation in context of the data. It is possible that
temperatures can reach 0 degrees Fahrenheit; the intercept is the predicted log odds of O-
ring failure for temperature 0 degrees Fahrenheit.
c) Interpret the slope coefficient. Assess whether temperature is significantly associated with
O-ring failure.
An increase in 1 degree Fahrenheit in temperature is associated with a decrease of 0.232 in
the log odds of O-ring failure. The p-value of the slope coefficient is 0.03, which is less than
α = 0.05; there is evidence of a significant association between temperature at launch and
O-ring failure.
summary(oring.fit)

##
## Call:
## glm(formula = oring ~ temp, family = "binomial", data = challenger)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0611 -0.7613 -0.3783 0.4524 2.2175
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 15.0429 7.3786 2.039 0.0415 *
## temp -0.2322 0.1082 -2.145 0.0320 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28.267 on 22 degrees of freedom
## Residual deviance: 20.315 on 21 degrees of freedom
## AIC: 24.315
##
## Number of Fisher Scoring iterations: 5
d) Calculate the odds of O-ring failure for a launch on a day with temperature 70 degrees
Fahrenheit. Is it more likely that O-ring failure will occur than not? Explain your answer.
The odds of O-ring failure for a launch on a day with temperature 70 degrees Fahrenheit is
0.299. Since the odds are less than 1, the more likely outcome is that there will not be O-ring
failure.
log.odds = predict(oring.fit, newdata = data.frame("temp" = 70))
exp(log.odds)

2
## 1
## 0.2986478
e) Calculate the odds ratio of O-ring failure comparing a launch on a day with temperature 60
degrees Fahrenheit to a day with temperature 75 degrees Fahrenheit.
The odds of O-ring failure for a launch that takes place on a day with temperature 75 degrees
Fahrenheit are over 32 times as large as the odds of O-ring failure for a launch that takes
place at 60 degrees Fahrenheit.
#OR approach
exp(oring.fit$coef[2]*(60 - 75))

## temp
## 32.53906
#calculate odds then OR
log.odds.75 = predict(oring.fit, newdata = data.frame("temp" = 75))
log.odds.60 = predict(oring.fit, newdata = data.frame("temp" = 60))
exp(log.odds.60)/exp(log.odds.75)

## 1
## 32.53906
f) According to the model, what is the predicted probability of O-ring failure for a launch
when the temperature is 30.9 degrees Fahrenheit?
According to the model, the predicted probability of O-ring failure for a launch when the
temperature is 30.9 degrees Fahrenheit is 0.9996; i.e., effectively 1.
log.odds = predict(oring.fit, newdata = data.frame("temp" = 30.9))
odds = exp(log.odds)
odds/(1 + odds)

## 1
## 0.9996178
g) Based on the work done in the previous parts of this question, comment on whether there
was reasonable evidence to postpone the Challenger launch. Be sure to reference relevant
numerical details and to address one important caveat for the result in part f).
There was reasonable evidence to postpone the *Challenger* launch. The available data
demonstrated statistically significant evidence of an association between lower temperature
at launch and the occurrence of O-ring failure (p = 0.03). Based on the model, the predicted
probability of O-ring failure for a launch taking place on such a cold day was effectively 1.
Even though this prediction represents extrapolation from the available data, it still repre-
sents a plausible reason for postponing; it was a risky decision to even consider launching
at temperatures for which data had not been observed.

3
Problem 2.
An individual is said to be anemic (or have anemia) if they do not have enough healthy red blood
cells to transport oxygen to tissues in the body. According to the US National Heart, Lung and
Blood Institute, anemia affects approximately 3 million Americans and is the most common blood
disorder in the United States. Anemia can have many causes, including internal bleeding or nutri-
tional deficiencies. Anemia is diagnosed by measuring the level of hemoglobin present in blood;
hemoglobin is the protein in red blood cells responsible for oxygen transport.
Loss of energy and easy fatigue are common symptoms of anemia. Individuals with mild ane-
mia may not recognize their symptoms. Detection of anemia typically occurs if a physician no-
tices a change in energy level and requests a blood test, or if a routine blood test indicates low
hemoglobin level. Anemia can be particularly serious in infants and young children, causing
impairments in mental, physical, and social development.
Anemia is considered a severe public health concern, especially in under-resourced parts of the
world where individuals have limited access to healthcare. In this problem, you will examine data
from a study examining the health status of 120 children living in a slum area of a large city in
Southeast Asia. The aim of the study was to estimate the prevalence of anemia and investigate
possible predictors of anemia. For children in the age range of this study, hemoglobin level less
than 10.5 g/dL constitutes a diagnosis of anemia. For low-income children in this age range in the
United States, the prevalence of anemia is approximately 15%.
The study collected demographic information in addition to measuring health variables. Family
income level was recorded as wealth quintile within the sample; for example, a value of 1 for the
wealth variable indicates that the child’s family income level was in the lowest 20% for families
with children participating in the study. Iron level was recorded via an assay in which negative
values indicate iron deficiency and positive values indicate adequate iron level.
Data from the study are in the file anemia.Rdata. The following table provides a list of the vari-
ables in the dataset and their descriptions.

Variable Description
sex sex, coded female for female and male for male
wealth relative measure of family income level
diarrhea coded Yes for at least one episode of diarrhea in the past two weeks, and No otherwise
whz standardized weight-for-height z-score relative to the national population
age age in months
iron blood iron level, in milligrams per kilogram (mg/kg)
hb blood hemoglobin level, in grams per deciliter (g/dL)
Use the data to answer the following questions.
a) Explore the data.
i. Describe the age and sex distribution of these children, with reference to appropriate
numerical and graphical summaries.
The ages of the observed children range between about 12 months and 24 months, with
a median age of 17 months. As shown by the boxplot and histogram, the distribution
of ages is relatively symmetric and there are no children who are very young or much
older relative to the other children in the data.

4
There are 54 females and 66 males in the data; 55% of the children are male.
#load the data
load("datasets/anemia.Rdata")

#age
par(mfrow = c(1, 2))
hist(anemia$age, main = "Age", xlab = "Age (months)", col = COL[1, 3])
boxplot(anemia$age, main = "Age", ylab = "Age (months)", col = COL[1, 3])

Age Age

12 14 16 18 20 22 24
20

Age (months)
Frequency

15
10
5
0

12 14 16 18 20 22 24

Age (months)

summary(anemia$age)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 12.09 14.62 17.10 17.71 21.00 23.85
#sex
table(anemia$sex)

##
## female male
## 54 66
prop.table(table(anemia$sex))

##
## female male
## 0.45 0.55
ii. Children are considered substantially underweight if, for their height, they are in the
lower 5% of the distribution of weight. What percentage of children in this study sam-
ple are substantially underweight?
To be in the lower 5% of the weight distribution (relative to height), a child must have
weight-for-height z-score lower than -1.64, the value that marks off the lower 5% of
a standard normal distribution. 21 children in the dataset have weight-for-height z-

5
score lower than -1.64. Thus, 17.5% of children in the study sample are substantially
underweight.
under.weight = (anemia$whz < qnorm(0.05))
table(under.weight)

## under.weight
## FALSE TRUE
## 99 21
(sum(under.weight == TRUE))/length(under.weight)

## [1] 0.175
iii. What proportion of these children are iron-deficient?
A value of blood iron level that is negative is indicative of iron deficiency. 76 children
in the dataset have a negative value for blood iron level. Thus, 0.63 of the children in
the study sample are iron-deficient.
iron.deficient = (anemia$iron < 0)
table(iron.deficient)

## iron.deficient
## FALSE TRUE
## 44 76
(sum(iron.deficient == TRUE))/length(iron.deficient)

## [1] 0.6333333
iv. How does the prevalence of anemia in the sampled children compare with anemia
prevalence in low-income children in the United States?
Children in the age range of this study qualify as anemic with hemoglobin level less
than 10.5 g/dL. 81 children in this study sample have hemoglobin level less than 10.5
g/dL; thus, anemia prevalence in the sampled children is 67.5%. This is substantially
higher than anemia prevalence in low-income children in the United States (approxi-
mately 15%).
anemic = (anemia$hb < 10.5)
table(anemic)

## anemic
## FALSE TRUE
## 39 81
(sum(anemic == TRUE))/length(anemic)

## [1] 0.675
b) Calculate and interpret an appropriate measure of association between sex and presence of
anemia.
Either the relative risk or the odds ratio can be used as a measure of association between sex

6
and presence of anemia.
The risk ratio of anemia comparing males to females is 1.13. Males have a 13% higher risk
of being anemic than females.
The odds ratio of anemia comparing males to females is 1.46. The odds of a male having
anemia are 46% larger than the odds of a female having anemia.
library(epitools)

## Warning: package 'epitools' was built under R version 3.5.2


#create a table
anemia.sex.table = table(anemia$sex, anemic)
addmargins(anemia.sex.table)

## anemic
## FALSE TRUE Sum
## female 20 34 54
## male 19 47 66
## Sum 39 81 120
#calculate risk ratio
riskratio(anemia.sex.table)$measure

## risk ratio with 95% C.I.


## estimate lower upper
## female 1.000000 NA NA
## male 1.131016 0.8758422 1.460534
oddsratio(anemia.sex.table, method = "wald")$measure

## odds ratio with 95% C.I.


## estimate lower upper
## female 1.000000 NA NA
## male 1.455108 0.6754576 3.134675
c) Do the data support the claim that more than half of the female children in this slum area
are anemic? Justify your answer.
Conduct a one-sample test of proportions for whether the proportion of anemic individuals
among females is different from 0.50. The p-value is 0.08. There is insufficient evidence
at α = 0.05 to reject the null hypothesis that the half the female children in this area are
anemic. Thus, the data do not support the claim that more than half of the female children
in this region are anemic.
It is also defensible to conduct a one-sided test of the alternative HA : p > 0.50. The p-value
is 0.04 for the one-sided test, which leads to rejecting H0 at α = 0.05 (and failing to reject H0
at α = 0.025.)
#two-sided test
binom.test(x = sum(anemia$sex == "female" & anemic), n = sum(anemia$sex == "female"),
p = 0.50, alternative = "two.sided")$p.val

7
## [1] 0.07590473
#one-sided test
binom.test(x = sum(anemia$sex == "female" & anemic), n = sum(anemia$sex == "female"),
p = 0.50, alternative = "greater")$p.val

## [1] 0.03795236
d) Estimate the association between age and presence of anemia. Is an older child more or less
likely to be anemic than a younger child? Justify your answer.
An older child is less likely to be anemic than a younger child. As estimated from the data,
a one month increase in age is associated with a 0.084 decrease in the log odds of being
anemic, which is equal to a 9% decrease in the odds of being anemic. A decrease in the odds
of being anemic corresponds to a decrease in the probability of being anemic. The observed
association between age and presence of anemia may not be present at the population level,
since p = 0.12.
#add anemic logical to the dataframe
anemia$anemic = anemic

#fit a model
summary(glm(anemic ~ age, data = anemia, family = binomial(link = "logit")))

##
## Call:
## glm(formula = anemic ~ age, family = binomial(link = "logit"),
## data = anemia)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7172 -1.3492 0.7888 0.8798 1.0807
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.22983 1.00314 2.223 0.0262 *
## age -0.08377 0.05442 -1.539 0.1237
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 151.34 on 119 degrees of freedom
## Residual deviance: 148.94 on 118 degrees of freedom
## AIC: 152.94
##
## Number of Fisher Scoring iterations: 4
e) Fit a model to investigate whether the association between age and presence of anemia dif-
fers by sex. Interpret the model coefficients and summarize your findings.

8
The intercept represents the estimated log odds of being anemic for a female of age 0
months; the intercept does not have a meaningful interpretation since the lowest observed
age in the data is 12 months.
The coefficient for sex represents the difference in intercepts between the model equations
specific to males and females. The age coefficient indicates that in females, an increase in
one month of age is associated with a decrease in the estimated log odds of being anemic
by 0.0053 (and a 0.53% decrease in the estimated odds of being anemic). The coefficient of
the interaction term indicates that in males, an increase in age is associated with a greater
decrease in the estimated log odds of being anemic than in females. In males, an increase in
age of one month is associated with a decrease in the estimated log odds of being anemic by
0.0053 + 0.160 = 0.1653 (and an 18% decrease in the estimated odds of being anemic).
The slope of the interaction term has p-value of 0.15. Since p < 0.05, there is not statistically
significant evidence of an interaction; i.e., there is not evidence that the association between
age and presence of anemia differs by sex at the population level.
#fit a model
summary(glm(anemic ~ age*sex, data = anemia, family = binomial(link = "logit")))

##
## Call:
## glm(formula = anemic ~ age * sex, family = binomial(link = "logit"),
## data = anemia)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0078 -1.3959 0.7309 0.9577 1.2073
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.625542 1.381323 0.453 0.651
## age -0.005299 0.075459 -0.070 0.944
## sexmale 3.254487 2.054421 1.584 0.113
## age:sexmale -0.160325 0.111053 -1.444 0.149
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 151.34 on 119 degrees of freedom
## Residual deviance: 146.02 on 116 degrees of freedom
## AIC: 154.02
##
## Number of Fisher Scoring iterations: 4
f) Previous research has shown that iron deficiency is associated with anemia.
i. Compare the odds of being anemic for a child whose iron level is -1.48 mg/kg to one
whose iron level is 0.18 mg/kg.
The estimated odds of being anemic for a child with iron level -1.48 mg/kg are 26%
larger than the odds of being anemic for a child with iron level 0.18 mg/kg.

9
#fit a model
anemic.iron.model = glm(anemic ~ iron, data = anemia,
family = binomial(link = "logit"))

#calculate odds for child not deficient in iron


log.odds.notdeficient = predict(anemic.iron.model,
newdata = data.frame(iron = 0.18))
odds.notdeficient = exp(log.odds.notdeficient)

#calculate odds for child deficient in iron


log.odds.deficient = predict(anemic.iron.model,
newdata = data.frame(iron = -1.48))
odds.deficient = exp(log.odds.deficient)

#calculate odds ratio


odds.deficient/odds.notdeficient

## 1
## 1.257106
ii. Do these data provide evidence that iron level is significantly associated with the pres-
ence of anemia? Explain your answer.
Yes, these data provide evidence that iron level is significantly associated with presence
of anemia; specifically, an increase in iron level is associated with absence of anemia.
From a simple logistic regression predicting odds of anemia from iron level, an increase
in 1 mg/kg of body iron level is associated with a decrease of -.138 in the log odds of
being anemic (a 15% decrease in the odds of being anemic). The slope coefficient has
p-value 0.002 and is statistically significantly different from 0 at α = 0.05.
summary(anemic.iron.model)$coef

## Estimate Std. Error z value Pr(>|z|)


## (Intercept) 0.4486511 0.21545702 2.082323 0.037312977
## iron -0.1378388 0.04450386 -3.097231 0.001953373
iii. Does the association between iron level and anemia persist after adjusting for the pos-
sible confounding variables sex, age, weight to height z-score, family wealth quintile,
and recent episode of diarrhea? Explain your answer.
Yes, after adjusting for the stated potential confounders, the association between iron
level and anemia persists. When holding the values of the other variables in the model
constant, an increase in 1 mg/kg of body iron level is associated with a decrease in the
estimated log odds of anemia by 0.186 (a 20% decrease in the odds of being anemic).
Higher body iron level remains associated with absence of anemia. The slope coefficient
for anemia has p-value of 0.00056 and is statistically significant at α = 0.05.
#create wealth.quintile factor
anemia$wealth = factor(anemia$wealth, levels = 1:5,
labels = c("1st", "2nd", "3rd", "4th", "5th"))

10
summary(glm(anemic ~ iron + sex + age + whz + wealth + diarrhea,
data = anemia, family = binomial(link = "logit")))$coef

## Estimate Std. Error z value Pr(>|z|)


## (Intercept) 2.5701275 1.37772787 1.8654827 0.062113773
## iron -0.1863578 0.05403721 -3.4486934 0.000563306
## sexmale 0.4061745 0.47180294 0.8608986 0.389293890
## age -0.1827410 0.07099802 -2.5738892 0.010056248
## whz -0.3728506 0.23573150 -1.5816751 0.113723765
## wealth2nd -0.4269645 0.68691545 -0.6215677 0.534226167
## wealth3rd 1.0803015 0.77603200 1.3920837 0.163897044
## wealth4th 1.1594687 0.72787033 1.5929606 0.111169037
## wealth5th 0.7277882 0.69139093 1.0526436 0.292504364
## diarrheaYes 0.5311090 0.58493563 0.9079786 0.363889550
g) A public health official is about to hold a press conference announcing a decision to admin-
ister iron supplementation to all children living in the slum district, based on the results
of the previous analysis. You have been asked to prepare the official for questions from the
press. In no more than five sentences, provide answers to the following questions. Be sure
to use language accessible to a general audience and fully explain your answers.
i. What is the rationale behind the decision to provide iron supplementation to all chil-
dren living in this district?
The majority of children living in this district are iron deficient. It is reasonable to con-
jecture that beyond boosting iron level, providing iron supplementation may reduce
the prevalence of anemia, since the study showed evidence of an association between
lower iron level and presence of anemia. It would not be feasible to target iron sup-
plementation to only children with an iron deficiency or to children with anemia, since
that would require conducting a blood test on each child.
ii. Does the study provide evidence that iron supplementation will reduce the prevalence
of anemia in this population of children?
The study does not provide evidence that iron supplementation will reduce the preva-
lence of anemia, since we only observed children with and without anemia and mea-
sured their iron levels. To demonstrate causality, we would need to conduct a study in
which we administered iron supplementation to anemic children and noted whether
the supplementation resolved their anemia, for example. Our study also did not allow
us to conclude that iron supplementation will necessarily increase a child’s hemoglobin
levels.

Problem 3.
Biological ornamentation refers to features that are primarily decorative, such as the elaborate tail
feathers of a peacock. The evolution of ornamentation in males has been extensively researched;
there are many studies exploring how male ornamentation functions as a signal of phenotypic
and/or genetic quality to potential mates. In contrast, there are few studies investigating female
ornamentation.

11
Some biologists have hypothesized that there is strong natural selection against overly conspicu-
ous female ornaments. Bright or colorful plumage in females might be expected to increase the
incidence of predation on nests for species in which females incubate eggs. Female ornamentation
might also undergo positive selection, functioning in sexual signaling like male ornamentation,
and indicating desirable qualities such as high immune function.
The data in the file rubythroats.Rdata are from a study of 83 female rubythroats, a bird species
in which both males and females exhibit a brightly colored red patch on the throat and breast
(referred to as a “bib”). In rubythroats, females incubate the eggs, while males provide food to
females to facilitate uninterrupted incubation.
– survival: records whether the bird survived to return to the nesting site the subsequent
year, yes if the female was observed and no if the female was not observed
– weight: weight of the bird, measured in grams
– wing.length: wing length of the bird, measured in millimeters
– tarsus.length: tarsus (i.e., leg) length of the bird, measured in millimeters
– first.clutch.size: number of eggs in the first clutch laid during the first year that the bird
was observed
– nestling.fate: whether the nestlings from the first clutch survived to fledging (Fledged) or
were lost to predation (Predated)
– second.clutch: whether the bird laid a second clutch during the first year that the bird was
observed, recorded as Yes for laying a second clutch and No for otherwise
– carotenoid.chroma: a measure of the abundance of red carotenoid pigment in feathers, as
measured from a sample of four feathers taken from the center of the bird’s bib. Larger
numbers indicate higher levels of pigment in the feathers and a more saturated red color.
– bib.area: the total area of the bird’s bib, measured in millimeters squared
– total.brightness: a measure of bib brightness, calculated from spectrometer analyses.
Larger numbers indicate a brighter red color.
You will be conducting an analysis of the results in order to investigate how bib attributes and
other phenotypic characteristics of female birds are associated with measures of fitness.
a) Fit a model to predict nestling fate from female bib characteristics (carotenoid chroma, bib
area, total brightness) and female body characteristics (weight, wing length, tarsus length).
Identify the slope coefficients significant at α = 0.10, and provide an interpretation of these
coefficients in the context of the data.
The coefficients for total brightness and wing length are statistically significant at α = 0.10.
An increase in total brightness of one unit is associated with a decrease of 0.13 in the esti-
mated log odds of nestling survival (when the values of the other variables do not change).
An increase in wing length of one millimeter is associated with an increase of 0.52 in the es-
timated log odds of nestling survival (when the values of the other variables do not change).

#load the data


load("datasets/rubythroats.Rdata")

12
#fit the model
summary(glm(nestling.fate ~ carotenoid.chroma + bib.area + total.brightness +
weight + wing.length + tarsus.length, data = rubythroats,
family = binomial(link = "logit")))

##
## Call:
## glm(formula = nestling.fate ~ carotenoid.chroma + bib.area +
## total.brightness + weight + wing.length + tarsus.length,
## family = binomial(link = "logit"), data = rubythroats)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8550 -0.8988 -0.4402 1.0220 1.7890
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -24.057818 13.046850 -1.844 0.06519 .
## carotenoid.chroma -4.774799 3.302978 -1.446 0.14829
## bib.area -0.001272 0.002833 -0.449 0.65341
## total.brightness -0.130880 0.043669 -2.997 0.00273 **
## weight -0.358164 0.419582 -0.854 0.39332
## wing.length 0.521821 0.234826 2.222 0.02627 *
## tarsus.length 0.476708 0.484340 0.984 0.32500
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 97.736 on 70 degrees of freedom
## Residual deviance: 79.655 on 64 degrees of freedom
## (14 observations deleted due to missingness)
## AIC: 93.655
##
## Number of Fisher Scoring iterations: 4
b) Investigate the factors associated with whether a female lays a second clutch during the first
year that she was observed.
i. Is there evidence of a significant association between nestling fate and whether a female
lays a second clutch? If so, report the direction of association.
Since p = 0.001, there is evidence of a significant association between nestling fate and
whether a female lays a second clutch. There is a positive association between nestling
survival and laying a second clutch; specifically, nestling survival is associated with a
3.4 estimated increase in the log odds of laying a second clutch.
#logistic regression
summary(glm(second.clutch ~ nestling.fate, data = rubythroats,

13
family = binomial(link = "logit")))

##
## Call:
## glm(formula = second.clutch ~ nestling.fate, family = binomial(link = "logit"),
## data = rubythroats)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0302 -0.8262 -0.2144 -0.2144 2.7511
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.761 1.012 -3.718 0.000201 ***
## nestling.fateFledged 3.405 1.070 3.182 0.001462 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 76.370 on 77 degrees of freedom
## Residual deviance: 55.615 on 76 degrees of freedom
## (7 observations deleted due to missingness)
## AIC: 59.615
##
## Number of Fisher Scoring iterations: 6
ii. Fit a model to predict whether a female lays a second clutch from nestling fate and
bib characteristics. Identify the two predictors that are most statistically significantly
associated with the response variable.
The two predictors most statistically significantly associated with laying a second
clutch are total brightness (p = 0.030) and nestling fate (p = 0.0015).
#fit a model
summary(glm(second.clutch ~ bib.area + total.brightness + carotenoid.chroma
+ nestling.fate, data = rubythroats,
family = binomial(link = "logit")))

##
## Call:
## glm(formula = second.clutch ~ bib.area + total.brightness + carotenoid.chroma +
## nestling.fate, family = binomial(link = "logit"), data = rubythroats)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.87779 -0.25235 -0.09075 -0.02136 2.10799
##
## Coefficients:

14
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -21.397205 7.593148 -2.818 0.00483 **
## bib.area 0.007019 0.003801 1.847 0.06480 .
## total.brightness 0.149085 0.068533 2.175 0.02960 *
## carotenoid.chroma 11.585799 6.085778 1.904 0.05694 .
## nestling.fateFledged 5.527158 1.740419 3.176 0.00149 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 75.503 on 75 degrees of freedom
## Residual deviance: 39.076 on 71 degrees of freedom
## (9 observations deleted due to missingness)
## AIC: 49.076
##
## Number of Fisher Scoring iterations: 7
iii. Fit a new model to predict whether a female lays a second clutch using the two predic-
tors identified in part ii. and their interaction. Interpret the model coefficients in the
context of the data.
The coefficient of total.brightness represents the association between the estimated
log odds of laying a second clutch and total brightness for females whose nestlings were
lost to predation; an increase in brightness of one unit is associated with a decrease of
0.20 in the estimated log odds of laying a second clutch.
The coefficient of nestling.fateFledged represents the change in intercept value be-
tween the model for females who lost nestlings to predation versus females whose
nestlings fledged.
The interaction term indicates the difference in the slope coefficient of total.brightness
between the Predated and Fledged groups. For females whose nestlings fledged,
an increase in total brightness of one unit is associated with an increase of
0.2033 − 0.0362 = 0.0362 in the estimated log odds of laying a second clutch.
The positive association between bib brightness and laying a second clutch is weaker
for females whose first clutch fledged (but not statistically significantly weaker,
p = 0.314).
#fit a model
summary(glm(second.clutch ~ total.brightness*nestling.fate, data = rubythroats,
family = binomial(link = "logit")))

##
## Call:
## glm(formula = second.clutch ~ total.brightness * nestling.fate,
## family = binomial(link = "logit"), data = rubythroats)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max

15
## -1.17919 -0.84586 -0.07953 -0.02768 2.14776
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.8693 6.7704 -1.605 0.108
## total.brightness 0.2033 0.1594 1.276 0.202
## nestling.fateFledged 9.7318 6.8536 1.420 0.156
## total.brightness:nestling.fateFledged -0.1671 0.1659 -1.007 0.314
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 75.940 on 76 degrees of freedom
## Residual deviance: 52.024 on 73 degrees of freedom
## (8 observations deleted due to missingness)
## AIC: 60.024
##
## Number of Fisher Scoring iterations: 8
c) Investigate the factors associated with whether a female survives to return to the nesting site
the subsequent year.
i. Fit a model to predict survival from bib characteristics, female body characteristics,
first clutch size, and whether a second clutch was laid. Identify factors that are posi-
tively associated with survival for the observed birds.
The factors positively associated with survival for the observed birds are wing length,
weight, first clutch size, and having a second clutch.
#fit a model
summary(glm(survival ~ bib.area + total.brightness + carotenoid.chroma
+ tarsus.length + wing.length + weight + first.clutch.size +
second.clutch, data = rubythroats,
family = binomial(link = "logit")))

##
## Call:
## glm(formula = survival ~ bib.area + total.brightness + carotenoid.chroma +
## tarsus.length + wing.length + weight + first.clutch.size +
## second.clutch, family = binomial(link = "logit"), data = rubythroats)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9501 -0.5706 -0.2036 0.4403 2.5038
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -24.322666 22.759433 -1.069 0.2852
## bib.area -0.002710 0.004754 -0.570 0.5686
## total.brightness -0.108595 0.063858 -1.701 0.0890 .
## carotenoid.chroma -18.517610 6.804872 -2.721 0.0065 **

16
## tarsus.length -0.886169 0.837672 -1.058 0.2901
## wing.length 0.842892 0.430007 1.960 0.0500 *
## weight 0.605506 0.648570 0.934 0.3505
## first.clutch.size 2.620596 1.132561 2.314 0.0207 *
## second.clutchYes 1.664275 1.075925 1.547 0.1219
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 64.438 on 48 degrees of freedom
## Residual deviance: 34.764 on 40 degrees of freedom
## (36 observations deleted due to missingness)
## AIC: 52.764
##
## Number of Fisher Scoring iterations: 6
ii. Fit a new model with only the significant predictors from the previous model; let α =
0.10. Comment on whether this model is preferable to the one fit in part i.
The model is not preferable to the one fit in part i., since the AIC of this model is 64.4
and the AIC of the previous model is 52.8. The lower AIC of the model from part i.
suggests that the non-significant variables do contribute enough to the model to offset
the penalty on additional predictors.
#fit a model
summary(glm(survival ~ total.brightness + carotenoid.chroma +
wing.length + first.clutch.size, data = rubythroats,
family = binomial(link = "logit")))

##
## Call:
## glm(formula = survival ~ total.brightness + carotenoid.chroma +
## wing.length + first.clutch.size, family = binomial(link = "logit"),
## data = rubythroats)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0203 -0.7386 -0.4338 0.8047 2.4505
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -20.15330 10.45426 -1.928 0.0539 .
## total.brightness -0.07515 0.04582 -1.640 0.1010
## carotenoid.chroma -9.49612 3.89478 -2.438 0.0148 *
## wing.length 0.46054 0.22795 2.020 0.0433 *
## first.clutch.size 1.56415 0.70098 2.231 0.0257 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

17
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 72.997 on 55 degrees of freedom
## Residual deviance: 54.389 on 51 degrees of freedom
## (29 observations deleted due to missingness)
## AIC: 64.389
##
## Number of Fisher Scoring iterations: 5
For parts iii. and iv., use the better parsimonious model of the ones fit in parts i. and ii.
iii. Compare the odds of survival for a female who laid 5 eggs in her first clutch to the odds
of survival for a female who laid 3 eggs in her first clutch, if the females are physically
identical and both laid a second clutch.
Since the values of the other variables are not changing, the odds of survival can be
compared by using the slope coefficient of first.clutch.size, which represents the
log of the odds ratio for a change of one egg. The odds of survival for a female that lays
5 eggs are over 188 times as large as the odds of survival for a female that lays 3 eggs,
if the the two are physically identical and both lay a second clutch: e(2)(2.62) = 188.67.
#calculate odds ratio
exp(2*2.62)

## [1] 188.6701
iv. Suppose female A has bib area 350 mm2 , total brightness of 35, carotenoid chroma
0.90, tarsus length of 19.5 mm, wing length 51 mm, weighs 10.8 g, lays 4 eggs in her
first clutch, and lays a second clutch. Female B has bib area 300 mm2 , total brightness
of 20, carotenoid chroma 0.85, tarsus length of 19.0 mm, wing length 50 mm, weighs
10.9 g, lays 3 eggs in her first clutch, and lays a second clutch. Compare the odds of
survival for females A and B.
The estimated odds of survival for female A are about 31% larger than the odds of
survival for female B; the odds ratio, comparing female A to female B, is 1.31.
#name model
survival.model = glm(survival ~ bib.area + total.brightness + carotenoid.chroma
+ tarsus.length + wing.length + weight + first.clutch.size +
second.clutch, data = rubythroats,
family = binomial(link = "logit"))

#make predictions
log.odds.female.A = predict(survival.model,
newdata = data.frame(bib.area = 350,
total.brightness = 35,
carotenoid.chroma = 0.90,
tarsus.length = 19.5,
wing.length = 51,
weight = 10.8,

18
first.clutch.size = 4,
second.clutch = "Yes"))
log.odds.female.B = predict(survival.model,
newdata = data.frame(bib.area = 300,
total.brightness = 20,
carotenoid.chroma = 0.85,
tarsus.length = 19,
wing.length = 50,
weight = 10.9,
first.clutch.size = 3,
second.clutch = "Yes"))

#calculate odds ratio


exp(log.odds.female.A)/exp(log.odds.female.B)

## 1
## 1.309356
d) Biological fitness refers to how successful an organism is at surviving and reproducing.
Based on the results of your analysis, briefly discuss whether female ornamentation seems
beneficial for fitness in this bird species. Limit your response to at most ten sentences. You
do not need to reference specific numerical results/models from the analysis.
Overall, female ornamentation seems highly related to fitness; certain features of ornamen-
tation are detrimental to survival, while others are predictive of reproductive success. Bib
brightness lowers the odds of nestling survival and female survival; this is likely because
brightly colored bibs would easily attract the attention of predators. Similarly, more sat-
urated red color on a bib would make a bird more visible and susceptible to predation;
carotenoid chroma is significantly negatively associated with female survival, and was also
observed to be negatively associated with nestling survival. However, all three measures
of ornamentation are significantly positively associated with laying a second clutch; this
suggests that females with conspicuous ornaments have high fertility and are capable of
successfully reproducing. The data support the idea that there are certain fitness costs to
conspicuous ornamentation, but that ornamentation can also be indicative of desirable traits
such as high fertility.

19
NOTE: For full credit on the problem set, complete either Problem 4 or Problem 5. Clearly
indicate which problem you have chosen in your solutions.

Problem 4.
Polychlorinated biphenyls (PCBs) are a collection of synthetic compounds, called congeners, that
are particularly toxic to fetuses and young children. Although PCBs are no longer produced in
the United States, they are still found in the environment. Since human exposure to these PCBs is
primarily through the consumption of fish, the Environmental Protection Agency (EPA) monitors
the PCB levels in fish. Unfortunately, there are 209 different congeners and measuring all of them
in a fish specimen is an expensive and time-consuming process.2
The file pcb.Rdata contains data collected from 69 fish specimens; each row represents data from
a single specimen. The first 7 columns contain the amounts of specific PCBs measured in units of
nanograms per kilogram (ng/kg): PCB138, PCB153, PCB180, PCB28, PCB52, PCB126, PCB118.
The variable pcb.total indicates the total amount of PCBs detected.
If the total amount of PCBs in a specimen can be estimated well using only measurements of a
few PCBs, the cost of PCB assays can be greatly reduced.
In your written report for the EPA, present a model for predicting the total amount of PCBs in a
specimen from the levels of individual PCBs. Be sure to briefly explain the work done to develop
the model, evaluate the strengths and weaknesses of the model, and discuss whether the model
can be effectively used to reduce the costs of monitoring PCB levels in fish.
Limit your written report to at most 1 page and include all R code and output in the appendix.
Written Report
The following written report describes a model obtained without using log transformations; justifications
for doing so are that model fitting works reasonably well when all variables are similarly skewed and
model interpretation is more direct. For your reference, the second part appendix show plots for going
through model selection after log-transforming all variables. Note how the residuals are better behaved
after all variables are log-transformed.
The distribution of the total amount of PCBs in these fish specimens is right-skewed; most fish
have a relatively low amount of PCBs. The amount of PCBs ranges from 0 ng/kg to 319 ng/kg,
with a median amount of 45 ng/kg. PCBs 28, 52, and 126 are the least common PCBs, present
in the smallest amounts relative to the other congeners assayed. PCBs 138 and 153 are the most
common PCBs, with median amount around 5 ng/kg. There are a couple specimens with very
high amounts of PCBs 138, 153, and 180.
PCBs 138, 153, 180, and 118 have the strongest linear relationships with the total amount of PCBs;
correlation coefficients range from 0.80 to 0.93. These are chosen for the initial model.
The initial model with four congeners explains about 95% of the observed variation in total PCB
level. In this model, PCB180 and PCB118 are statistically significantly associated with total PCB
level at the α = 0.05 significance level. A model with only these two PCBs has a slightly lower
adjusted R2 value (0.9401 versus 0.9434); however, it still explains 94% of the observed variation
in total PCB level. In comparison, a model with the congener most correlated with total PCB
level (PCB118) explains about 71% of the observed variation in total PCB level. The F-statistic
2 Data from Introduction to the Practice of Statistics, 7th ed.

20
of the two-predictor model is highly significant, indicating that the variables as a group are good
predictors of total PCB level.
Since the goal is to reduce the cost of PCB assays, the model with PCB180 and PCB118 seems to
be the best choice; it is reasonable to choose a model that requires information about two PCBs
as opposed to four PCBs, when the adjusted R2 values are so close. If it were of interest to use a
single congener to estimate total PCB, the natural choice would be PCB118.
The final model seems somewhat reasonable. There does not seem to be non-linearity of residuals
against either PCB118 or PCB180. Variability in the residuals is roughly constant in the 1-100
ng/kg predicted total PCB range; on the high end, there are very few observations and it is not
possible to assess variability. Predictions made using this model for high levels of total PCB should
be treated with skepticism. Overall, there seem to be more points under the line than above;
this suggests the model may have a tendency to overpredict total PCB level, which is a desirable
feature. It is reasonable to assume that the observations are independent, and the residuals are
normally distributed in the center but deviate from normality in both tails.
In summary, it seems feasible that the total amount of PCBs in a specimen can be estimated
with only a few PCBs: PCB118 and PCB180. A model with just these two PCBs as predictors
explains 94% of the observed variation in total PCB levels. Some caution should be used if the
predicted amount of total PCBs is higher than 100 ng/kg; the predictions from the model may not
be accurate. Within the 0 to 100 ng/kg range, observations tend to be between 20 ng/kg higher
or lower from the predicted value. Judgments about whether the model is precise enough to be
useful would require information about the cutoff level for when fish are deemed unsafe to eat.

21
Appendix
#load the data
load("datasets/pcb.Rdata")

#load colors
library(openintro)
data("COL")

#distribution of total PCBs


par(mfrow = c(1, 2))
hist(pcb$pcb.total, main = "Total PCB",
xlab = "PCB level (ng/kg)", col = COL[1, 3], border = COL[1])
hist(log(pcb$pcb.total), main = "Log-Transformed Total PCB",
xlab = "Log PCB level (log ng/kg)", col = COL[1, 3], border = COL[1])

Total PCB Log−Transformed Total PCB


35

15
25
Frequency

Frequency

10
15

5
5
0

0 50 150 250 350 2 3 4 5 6

PCB level (ng/kg) Log PCB level (log ng/kg)

summary(pcb$pcb.total)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 6.10 30.18 47.96 68.47 91.63 318.75
#compare distributions of individual PCBs
boxplot(pcb$pcb138, pcb$pcb153, pcb$pcb180, pcb$pcb28,
pcb$pcb52, pcb$pcb126, pcb$pcb118,
names = c("PCB138", "PCB153", "PCB180", "PCB28",
"PCB52", "PCB126", "PCB118"),
col = COL[1, 3], border = COL[1])

22
40
30
20
10
0

PCB138 PCB153 PCB180 PCB28 PCB52 PCB126 PCB118

#compare distributions of log-transformed versions


boxplot(log(pcb$pcb138), log(pcb$pcb153), log(pcb$pcb180),
log(pcb$pcb28), log(pcb$pcb52), log(pcb$pcb126 + 0.0000001),
log(pcb$pcb118),
names = c("PCB138", "PCB153", "PCB180", "PCB28",
"PCB52", "PCB126", "PCB118"),
col = COL[1, 3], border = COL[1])
0
−5
−10
−15

PCB138 PCB153 PCB180 PCB28 PCB52 PCB126 PCB118

23
0 10 25 0 10 20 30 0 4 8 0 5 15

250
pcb.total

0 100
25

pcb138
10
0

40
pcb153

20
0
10 20 30

pcb180
0

6
pcb28

4
2
0
8

pcb52
4
0

0.020
pcb126

0.000
15

pcb118
5
0

0 100 250 0 20 40 0 2 4 6 0.000 0.020

24
0 1 2 3 −1 1 2 3 −4 −2 0 2 −1 1 2 3

2 3 4 5
log(pcb.total)
3
2

log(pcb138)
1
0

3
2
log(pcb153)

1
0
1 2 3

log(pcb180)
−1

2
−2 0
log(pcb28)

−5
2
0

log(pcb52)
−4 −2

−4.0
log(pcb126)

−5.0
1 2 3

log(pcb118)
−1

2 3 4 5 0 1 2 3 −5 −2 0 2 −5.0 −4.0

#create correlation matrix


pcb.subset = subset(pcb,
select = c("pcb.total", "pcb138", "pcb153",
"pcb180", "pcb28", "pcb52",
"pcb126", "pcb118"))
cor(pcb.subset)

## pcb.total pcb138 pcb153 pcb180 pcb28 pcb52


## pcb.total 1.0000000 0.9288353 0.9209322 0.80085493 0.37937443 0.59635717
## pcb138 0.9288353 1.0000000 0.9301981 0.88230223 0.16864612 0.30089828
## pcb153 0.9209322 0.9301981 1.0000000 0.86484159 0.14971905 0.34485793
## pcb180 0.8008549 0.8823022 0.8648416 1.00000000 0.04561106 0.08692971
## pcb28 0.3793744 0.1686461 0.1497191 0.04561106 1.00000000 0.68340637

25
## pcb52 0.5963572 0.3008983 0.3448579 0.08692971 0.68340637 1.00000000
## pcb126 0.6997818 0.7773519 0.6608169 0.61565849 0.09956607 0.21157595
## pcb118 0.8432980 0.7293792 0.7197511 0.43744431 0.34984796 0.68490727
## pcb126 pcb118
## pcb.total 0.69978176 0.8432980
## pcb138 0.77735193 0.7293792
## pcb153 0.66081693 0.7197511
## pcb180 0.61565849 0.4374443
## pcb28 0.09956607 0.3498480
## pcb52 0.21157595 0.6849073
## pcb126 1.00000000 0.6547819
## pcb118 0.65478189 1.0000000
#initial prediction model
initial.model = lm(pcb.total ~ pcb138 + pcb153 + pcb180 + pcb118,
data = pcb)
summary(initial.model)

##
## Call:
## lm(formula = pcb.total ~ pcb138 + pcb153 + pcb180 + pcb118, data = pcb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.220 -5.611 -1.140 3.363 62.734
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.7094 2.7784 0.975 0.333163
## pcb138 1.2444 1.0956 1.136 0.260251
## pcb153 1.1070 0.6198 1.786 0.078847 .
## pcb180 4.0326 1.0802 3.733 0.000404 ***
## pcb118 9.6222 1.2031 7.998 3.25e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.13 on 64 degrees of freedom
## Multiple R-squared: 0.9467, Adjusted R-squared: 0.9434
## F-statistic: 284.4 on 4 and 64 DF, p-value: < 2.2e-16
#model with 180 and 118 (significant at 0.05)
summary(lm(pcb.total ~ pcb180 + pcb118, data = pcb))

##
## Call:
## lm(formula = pcb.total ~ pcb180 + pcb118, data = pcb)
##
## Residuals:
## Min 1Q Median 3Q Max

26
## -31.240 -5.763 -1.373 3.705 59.637
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.9597 2.6789 1.105 0.273
## pcb180 6.3623 0.3930 16.189 <2e-16 ***
## pcb118 11.9922 0.6491 18.476 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.53 on 66 degrees of freedom
## Multiple R-squared: 0.9419, Adjusted R-squared: 0.9401
## F-statistic: 534.9 on 2 and 66 DF, p-value: < 2.2e-16
#model with 180, 118, and 153
summary(lm(pcb.total ~ pcb153 + pcb180 + pcb118, data = pcb))

##
## Call:
## lm(formula = pcb.total ~ pcb153 + pcb180 + pcb118, data = pcb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.164 -5.866 -1.926 3.139 61.708
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7290 2.6353 1.415 0.1618
## pcb153 1.2795 0.6023 2.124 0.0374 *
## pcb180 4.8475 0.8094 5.989 1.01e-07 ***
## pcb118 10.4391 0.9667 10.799 3.84e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.16 on 65 degrees of freedom
## Multiple R-squared: 0.9457, Adjusted R-squared: 0.9432
## F-statistic: 377.1 on 3 and 65 DF, p-value: < 2.2e-16
#model with only 118
summary(lm(pcb.total ~ pcb118, data = pcb))

##
## Call:
## lm(formula = pcb.total ~ pcb118, data = pcb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -81.318 -14.187 -6.928 9.338 163.457
##

27
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.448 5.716 2.528 0.0138 *
## pcb118 16.589 1.292 12.844 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.16 on 67 degrees of freedom
## Multiple R-squared: 0.7112, Adjusted R-squared: 0.7068
## F-statistic: 165 on 1 and 67 DF, p-value: < 2.2e-16
final.model = lm(pcb.total ~ pcb180 + pcb118, data = pcb)

Residuals vs PCB118 Residuals vs PCB180)


60

60
40

40
Residual

Residual
20

20
0

0
−20

−20

0 5 10 15 −1 0 1 2 3

PCB118 (ng/kg) PCB180 (ng/kg)

Residuals vs Predicted Total PCB Q−Q Plot of Model Residuals


60

60
40

40
Sample Quantiles
Residual

20

20
0

0
−20

−20

0 50 100 150 200 250 300 −2 −1 0 1 2

Predicted Total PCB (ng/kg) Theoretical Quantiles

28
Approach with Log-Transformed Variables
## log.pcb.total log.pcb138 log.pcb153 log.pcb180 log.pcb28
## log.pcb.total 1.0000000 0.9560549 0.9049176 0.8288974 0.5699256
## log.pcb138 0.9560549 1.0000000 0.9219441 0.8963662 0.3876895
## log.pcb153 0.9049176 0.9219441 1.0000000 0.8668080 0.3260234
## log.pcb180 0.8288974 0.8963662 0.8668080 1.0000000 0.2272701
## log.pcb28 0.5699256 0.3876895 0.3260234 0.2272701 1.0000000
## log.pcb52 0.7005905 0.5404601 0.5192283 0.3015365 0.7950316
## log.pcb126 0.5751236 0.6305038 0.5269149 0.5752551 0.1195300
## log.pcb118 0.9064775 0.8897442 0.7798756 0.6538711 0.5336685
## log.pcb52 log.pcb126 log.pcb118
## log.pcb.total 0.7005905 0.5751236 0.9064775
## log.pcb138 0.5404601 0.6305038 0.8897442
## log.pcb153 0.5192283 0.5269149 0.7798756
## log.pcb180 0.3015365 0.5752551 0.6538711
## log.pcb28 0.7950316 0.1195300 0.5336685
## log.pcb52 1.0000000 0.2337883 0.6709082
## log.pcb126 0.2337883 1.0000000 0.5693747
## log.pcb118 0.6709082 0.5693747 1.0000000
#initial prediction model
initial.model = lm(log(pcb.total) ~ log(pcb138) + log(pcb153) +
log(pcb180) + log(pcb118), data = pcb)
summary(initial.model)

##
## Call:
## lm(formula = log(pcb.total) ~ log(pcb138) + log(pcb153) + log(pcb180) +
## log(pcb118), data = pcb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39794 -0.11865 -0.02434 0.08503 0.90335
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.63780 0.12271 21.496 < 2e-16 ***
## log(pcb138) 0.34384 0.18098 1.900 0.061958 .
## log(pcb153) 0.20876 0.07395 2.823 0.006336 **
## log(pcb180) 0.06634 0.08679 0.764 0.447465
## log(pcb118) 0.35522 0.09436 3.764 0.000365 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2061 on 64 degrees of freedom
## Multiple R-squared: 0.9379, Adjusted R-squared: 0.934
## F-statistic: 241.5 on 4 and 64 DF, p-value: < 2.2e-16

29
#model with 118 and 153 (significant at 0.05)
summary(lm(log(pcb.total) ~ log(pcb153) + log(pcb118), data = pcb))

##
## Call:
## lm(formula = log(pcb.total) ~ log(pcb153) + log(pcb118), data = pcb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4879 -0.1436 -0.0154 0.1121 0.7610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.72601 0.06205 43.932 < 2e-16 ***
## log(pcb153) 0.44961 0.04895 9.185 2.04e-13 ***
## log(pcb118) 0.49677 0.05334 9.314 1.21e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2277 on 66 degrees of freedom
## Multiple R-squared: 0.9217, Adjusted R-squared: 0.9194
## F-statistic: 388.7 on 2 and 66 DF, p-value: < 2.2e-16
#model with 118, 153, and 138
summary(lm(log(pcb.total) ~ log(pcb153) + log(pcb118) + log(pcb138),
data = pcb))

##
## Call:
## lm(formula = log(pcb.total) ~ log(pcb153) + log(pcb118) + log(pcb138),
## data = pcb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42191 -0.10864 -0.01387 0.08670 0.89961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.56062 0.06950 36.842 < 2e-16 ***
## log(pcb153) 0.21478 0.07330 2.930 0.004669 **
## log(pcb118) 0.30520 0.06777 4.504 2.84e-05 ***
## log(pcb138) 0.45193 0.11259 4.014 0.000157 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2054 on 65 degrees of freedom
## Multiple R-squared: 0.9373, Adjusted R-squared: 0.9344
## F-statistic: 323.8 on 3 and 65 DF, p-value: < 2.2e-16

30
#model with only 118
summary(lm(log(pcb.total) ~ log(pcb118), data = pcb))

##
## Call:
## lm(formula = log(pcb.total) ~ log(pcb118), data = pcb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.74232 -0.23297 -0.03774 0.16760 0.86563
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.16481 0.05932 53.35 <2e-16 ***
## log(pcb118) 0.87883 0.05001 17.57 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3412 on 67 degrees of freedom
## Multiple R-squared: 0.8217, Adjusted R-squared: 0.819
## F-statistic: 308.8 on 1 and 67 DF, p-value: < 2.2e-16
#final prediction model
final.model = lm(log(pcb.total) ~ log(pcb153) + log(pcb118) + log(pcb138),
data = pcb)

31
0.8
Residuals vs log(PCB118) Residuals vs log(PCB138) Residuals vs log(PCB153)

0.8

0.8
0.6

0.6

0.6
0.4

0.4

0.4
Residual

Residual

Residual
0.2

0.2

0.2
0.0

0.0

0.0
−0.2

−0.2

−0.2
−0.4

−0.4

−0.4
−1 0 1 2 3 0 1 2 3 0 1 2 3

log(PCB118) log(PCB138) log(PCB153)

Residuals vs Pred. log(total PCB) Q−Q Plot of Model Residuals


0.8

0.8
0.6

0.6
Sample Quantiles
0.4

0.4
Residual

0.2

0.2
0.0

0.0
−0.2

−0.2
−0.4

−0.4

2 3 4 5 −2 −1 0 1 2

Predicted log(total PCB) Theoretical Quantiles

32
Problem 5.
Forced expiratory volume (FEV) is an index of pulmonary function that measures the volume (L)
of air expelled after one second of constant effort; higher values indicate better lung function. In
1980, FEV was measured on 654 children ages 3 through 19 in East Boston, Massachusetts, as a
part of a larger study assessing change in pulmonary function over time in children.
The FEV measurements (fev) are in lung_function.Rdata, along with a record of smoking status
(smoke), age measured in years (age), sex (sex), and height in inches (height).
Is smoking associated with decreased lung function in children?
Conduct an analysis of the East Boston data to examine whether there is evidence of an association
between smoking and decreased lung function in children.
In your written report, be sure to describe the study participants, explain the analysis approach,
summarize the results, and discuss your findings in the context of the research question.
Limit your written report to at most 1 page and include all R code and output in the appendix.
Written Report
50% of the children in the study are between ages 8 and 12, although there are children as young as
3 and as old as 19. About 10% of the children smoke. The distribution of FEV is right-skewed, with
several high outliers of FEV representing children with unusually high levels of lung function
relative to ther children in the study. Median FEV is 2.5 L, and the middle 50% of FEV values is
from about 2 L to 3 L.
In an initial model, there is evidence that smoking status is significantly associated with lung
function (p = 3.07 × 10−7 ). The mean FEV for non-smokers is lower than that of smokers; thus, it
seems that smoking is associated with increased lung function.
However, the other variables in the dataset seem to be potential confounders: age, height, and
sex. Older individuals tend to have higher FEV; older individuals tend to be smokers. Taller
individuals tend to have higher FEV, taller individuals tend to be smokers. Females are more
likely to be smokers than males; females have lower median FEV than males.
In the final model that adjusts for age, height, and sex, smoking status is no longer significantly
associated with lung function (p = 0.141). The model F-statistic is significant (p < 0.001), indi-
cating that the variables as a group are good predictors of FEV; the model explains 78% of the
observed variation in FEV.
The model seems somewhat reasonable for the data. There does not appear to be remaining
non-linearity with respect to age after the model is fit, but there is some curvature remaining
with respect ot height. This curvature is also reflected in the plot of residuals versus predicted
FEV values.3 It is reasonable to assume that the observations are independent; since the study
population consists of a large number of children selected, it is unlikely that a large number are
related, etc. The residuals are normally distributed.
In summary, there is not sufficient evidence to suggest that smoking is associated with decreased
lung function in children. Data were collected on children from a wide age range that spans not
3 The curvilinear relationship between FEV and height is visible in the scatterplot matrix. This can be fixed with a
quadratic term. For your interest, the last page of the solutions shows the residual plot from a model with a quadratic
term for height. This strategy is beyond the scope of Stat 102.

33
only most of childhood but includes the teenage years. An initial analysis suggested that smoking
was actually associated with higher levels of FEV; however, this was explained by the confounding
effects of age (and height). Although smoking would likely decrease lung function to a certain
extent, older (and taller) children will also necessarily have higher FEV than younger children.
The relationship between age (and height) and FEV seems to be masking a potential association
between smoking and FEV. In the final model, smoking status is not significantly associated with
decreased lung function at the α = 0.05 level; however, it does have a negative coefficient which
points towards smoking being associated with lower FEV.
A future study should collect data from a more specific age range of children, in order to control
for the effect of age on lung function. Additionally, it seems like smoking status alone might not
be enough information to address the main question; it would be useful to have information about
how long a child has been smoking, as well as how mmuch they smoke. These factors might be
more strongly linked to decreased lung function than smoking status.
Appendix
#load the data
load("datasets/lung_function.Rdata")

#numerical summaries
summary(lung.function$fev)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 0.791 1.981 2.547 2.637 3.119 5.793
summary(lung.function$age)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 3.000 8.000 10.000 9.931 12.000 19.000
summary(lung.function$height)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 46.00 57.00 61.50 61.14 65.50 74.00

34
Lung Function Lung Function
150

5
100
Frequency

4
3
50

2
1
0

1 2 3 4 5 6

FEV (L) FEV (L)

Age Age
150

15
Frequency

100

10
50

5
0

5 10 15 20

Age (yrs) Age (yrs)

35
Height Height

75
70
80

65
Frequency

60

60
40

55
20

50
0

45
45 50 55 60 65 70 75

Height (in) Height (in)


500

250
300

150
100

50
0

nonsmoker smoker female male

36
5 10 15

5
4
fev

3
2
1
15

age
10
5

75
70
65
height

1 2 3 4 5 45 50 55 60 65 70 75 60
55
50
45

37
5
FEV by Sex FEV by Smoking Status

5
4

4
3

3
2

2
1

1
female male nonsmoker smoker

Sex Smoking Status

Age by Smoking Status 75


70
Height by Smoking Status
15

65
60
10

55
50
5

45

nonsmoker smoker nonsmoker smoker

Smoking Status Smoking Status

##
## female male
## nonsmoker 279 310
## smoker 39 26
#fit initial model, unadjusted
summary(lm(fev ~ smoke, data = lung.function))

##
## Call:
## lm(formula = fev ~ smoke, data = lung.function)
##
## Residuals:

38
## Min 1Q Median 3Q Max
## -1.7751 -0.6339 -0.1021 0.4804 3.2269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.56614 0.03466 74.037 < 2e-16 ***
## smokesmoker 0.71072 0.10994 6.464 1.99e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8412 on 652 degrees of freedom
## Multiple R-squared: 0.06023, Adjusted R-squared: 0.05879
## F-statistic: 41.79 on 1 and 652 DF, p-value: 1.993e-10
#fit final model, adjusting for potential confounders
summary(lm(fev ~ smoke + age + height + sex,
data = lung.function))

##
## Call:
## lm(formula = fev ~ smoke + age + height + sex, data = lung.function)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.37656 -0.25033 0.00894 0.25588 1.92047
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.456974 0.222839 -20.001 < 2e-16 ***
## smokesmoker -0.087246 0.059254 -1.472 0.141
## age 0.065509 0.009489 6.904 1.21e-11 ***
## height 0.104199 0.004758 21.901 < 2e-16 ***
## sexmale 0.157103 0.033207 4.731 2.74e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4122 on 649 degrees of freedom
## Multiple R-squared: 0.7754, Adjusted R-squared: 0.774
## F-statistic: 560 on 4 and 649 DF, p-value: < 2.2e-16
final.model = lm(fev ~ smoke + age + height + sex,
data = lung.function)

39
2.0
Residuals vs Age Residuals vs Height

2.0
1.5

1.5
1.0

1.0
Residual

Residual
0.5

0.5
−0.5

−0.5
−1.5

−1.5
5 10 15 45 50 55 60 65 70 75

Age (yrs) Height (in)

Residuals vs Predicted FEV Q−Q Plot of Model Residuals


2.0

2.0
1.5

1.5
1.0

1.0
Sample Quantiles
Residual

0.5

0.5
−0.5

−0.5
−1.5

−1.5

1 2 3 4 −3 −2 −1 0 1 2 3

Predicted FEV (L) Theoretical Quantiles

40
#fit final model, with quadratic terms
final.model = lm(fev ~ smoke + age + height + I(height^2) + sex,
data = lung.function)

residuals = resid(final.model)
predicted = predict(final.model)

plot(residuals ~ predicted,
pch = 21, col = COL[1], bg = COL[1, 4], cex = 0.7,
main = "Quadratic Model: Residuals vs Predicted FEV", ylab = "Residual",
xlab = "Predicted FEV (L)")
abline(h = 0, col = "red", lty = 2)

Quadratic Model: Residuals vs Predicted FEV


1.5
0.5
Residual

−0.5
−1.5

1 2 3 4

Predicted FEV (L)

41

You might also like