Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

Problem Set 7 Solutions

Statistics 104
Due November 26, 2019 at 11:59 pm

Problem set policies. Please provide concise, clear answers for each question. Note that only writing the result of
a calculation (e.g., "SD = 3.3") without explanation is not sufficient. For problems involving R, be sure to include
the code in your solution.
Please submit your problem set via Canvas as a PDF, along with the R Markdown source file.
We encourage you to discuss problems with other students (and, of course, with the course head and the TFs), but
you must write your final answer in your own words. Solutions prepared "in committee" are not acceptable. If you
do collaborate with classmates on a problem, please list your collaborators on your solution.

Problem 1.

The National Health and Nutrition Examination Survey (NHANES) is a yearly survey conducted by
the US Centers for Disease Control. This question uses the nhanes.samp.adult.500 dataset in the
oibiostat package, which consists of information on a subset of 500 individuals ages 21 years and
older from the larger NHANES dataset.
Poverty (Poverty) is measured as a ratio of family income to poverty guidelines. Smaller numbers
indicate more poverty, and ratios of 5 or larger were recorded as 5. Education (Education) is
reported for individuals ages 20 years or older and indicates the highest level of education achieved:
either 8th Grade, 9 - 11th Grade, High School, Some College, or College Grad. The variable HomeOwn
records whether a participant rents or owns their home; the levels of the variable are Own, Rent,
and Other.
a) Create a plot showing the association between poverty and educational level. Describe what
you see.
A higher level of educational attainment is associated with a higher poverty ratio (i.e., higher
income); the median poverty ratio increases with higher education group. For example,
median poverty ratio among college graduates is about 5, while the median poverty ratio
among individuals with at most an 8th grade education is about 1.4.

1
Poverty by Education in NHANES (n = 500)

5
4
3
Poverty

2
1
0

8th Grade 9 − 11th Grade High School Some College College Grad

Education

b) Fit a linear model to predict poverty from educational level.


i. Interpret the model coefficients and associated p-values.
Each slope coefficient is highly significant, indicating that the mean poverty ratio for
each group is significantly different from mean poverty ratio in the baseline group (8th
grade). Mean poverty ratio is 0.993 higher than baseline in the 9 - 11th grade group, 1.09
higher than baseline in the high school group, 1.49 higher than baseline in the some
college group, and 2.50 higher than baseline in the college graduate group.
ii. Assess whether educational level, overall, is associated with poverty. Be sure to include
any relevant numerical evidence as part of your answer.
The p-value associated with the F-statistic is less than 0.05, which supports the alterna-
tive hypothesis that poverty is associated with educational level.
#fit linear model
summary(lm(Poverty ~ Education, data = nhanes.samp.adult.500))

##
## Call:
## lm(formula = Poverty ~ Education, data = nhanes.samp.adult.500)
##
## Residuals:
## Min 1Q Median 3Q Max

2
## -3.4903 -1.2003 0.0901 1.0497 2.7545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.4555 0.2703 5.384 1.17e-07 ***
## Education9 - 11th Grade 0.9931 0.3302 3.008 0.002776 **
## EducationHigh School 1.0900 0.3113 3.501 0.000508 ***
## EducationSome College 1.4943 0.2976 5.021 7.37e-07 ***
## EducationCollege Grad 2.4948 0.2958 8.434 4.45e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.456 on 456 degrees of freedom
## (39 observations deleted due to missingness)
## Multiple R-squared: 0.1977, Adjusted R-squared: 0.1906
## F-statistic: 28.09 on 4 and 456 DF, p-value: < 2.2e-16
c) Create a plot showing the association between poverty and home ownership. Based on what
you see, speculate briefly about the home ownership status of individuals who responded
with Other.
Since median poverty ratio in the ’Other’ group is higher than in the ’Rent’ group, it is
unlikely these individuals are homeless. Perhaps they are living in a home owned by family
members; it seems these individuals could afford to rent a home, (although perhaps not own
one) and simply choose to not do so.

Poverty by Home Ownership in NHANES (n = 500)


5
4
Poverty

3
2
1
0

Own Rent Other

HomeOwn

3
d) Fit a linear model to predict poverty from educational level and home ownership. Comment
on whether this model is an improvement from the model in part b).
This model explains more of the observed variability in poverty ratio; the R2 has improved
from about 20% to 30%. The model adjusted R2 has also improved from 0.19 to 0.29,
indicating that the contribution of home ownership was ’worth’ the added complexity of two
additional predictor terms.
#fit a model
summary(lm(Poverty ~ Education + HomeOwn, data = nhanes.samp.adult.500))

##
## Call:
## lm(formula = Poverty ~ Education + HomeOwn, data = nhanes.samp.adult.500)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3974 -1.0462 0.0826 0.7826 3.2707
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.8125 0.2565 7.067 5.99e-12 ***
## Education9 - 11th Grade 0.9506 0.3087 3.080 0.002199 **
## EducationHigh School 1.0885 0.2914 3.736 0.000211 ***
## EducationSome College 1.5337 0.2782 5.513 5.91e-08 ***
## EducationCollege Grad 2.4049 0.2767 8.691 < 2e-16 ***
## HomeOwnRent -1.1717 0.1441 -8.128 4.18e-15 ***
## HomeOwnOther -0.9792 0.4611 -2.123 0.034259 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.361 on 454 degrees of freedom
## (39 observations deleted due to missingness)
## Multiple R-squared: 0.3023, Adjusted R-squared: 0.2931
## F-statistic: 32.79 on 6 and 454 DF, p-value: < 2.2e-16

4
Problem 2.

Do men and women think differently about their body weight? To address this question, you will
be using data from the Behavioral Risk Factor Surveillance System (BRFSS).
The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000
people in the United States collected by the Centers for Disease Control and Prevention (CDC).
As its name implies, the BRFSS is designed to identify risk factors in the adult population and
report emerging health trends. For example, respondents are asked about diet and weekly physical
activity, HIV/AIDS status, possible tobacco use, and level of healthcare coverage.
The cdc.sample dataset contains data on 500 individuals from a random sample of 20,000 respon-
dents to the BRFSS survey conducted in 2000, on the following nine variables:
– genhlth: general health status, with categories excellent, very good, good, fair, and poor
– exerany: recorded as 1 if the respondent exercised in the past month and 0 otherwise
– hlthplan: recorded as 1 if the respondent has some form of health coverage and 0 otherwise
– smoke100: recorded as 1 if the respondent has smoked at least 100 cigarettes in their entire
life and 0 otherwise
– height: height in inches
– weight: weight in pounds
– wt.desire: desired weight in pounds
– age: age in years
– gender: gender, recorded as m for male and f for female
a) Create a variable called wt.discr that is a measure of the discrepancy between an individual’s
desired weight and their actual weight, expressed as a proportion of their actual weight:

actual weight − desired weight


weight discrepancy =
actual weight
#load the data
load("datasets/cdc_sample.Rdata")

#create wt.discr
cdc.sample$wt.discr = (cdc.sample$weight - cdc.sample$wtdesire)/cdc.sample$weight

b) Fit a linear model to predict weight discrepancy from age and gender. Interpret the slope
coefficients in the model.
The coefficient for age indicates that a one year increase in age is associated with an increase
in mean weight discrepancy of 0.01%, when gender is held constant. The coefficient for
gender indicates that a female has a mean weight discrepancy of 4.7% higher than a male of
the same age.
summary(lm(wt.discr ~ age + gender, data = cdc.sample))

5
##
## Call:
## lm(formula = wt.discr ~ age + gender, data = cdc.sample)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58132 -0.05839 -0.02131 0.06051 0.41627
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0453405 0.0140198 3.234 0.0013 **
## age 0.0001445 0.0002787 0.518 0.6045
## genderf 0.0468575 0.0098007 4.781 2.3e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1083 on 497 degrees of freedom
## Multiple R-squared: 0.04545, Adjusted R-squared: 0.04161
## F-statistic: 11.83 on 2 and 497 DF, p-value: 9.549e-06
c) Investigate whether the association between weight discrepancy and age is different for males
versus females.
i. Fit a linear model to predict weight discrepancy from age, gender, and the interaction
between age and gender. Write the model equation.

[ = 0.0114 + 0.000936(age) + 0.105(genderF) − 0.00133(age × genderF)


wtdiscr

ii. Write the prediction equation for males and the prediction equation for females.
The prediction equation for males is

[ =0.0114 + 0.000936(age) + 0.105(0) − 0.00133(age × 0)


wtdiscr
=0.0114 + 0.000936(age)

The prediction equation for females is

[ =0.0114 + 0.000936(age) + 0.105(1) − 0.00133(age × 1)


wtdiscr
=(0.0114 + 0.105) + (0.000936 − 0.00133)
=0.116 − 0.000394(age)

Note: A visualization of these equations is provided below.


iii. Is there statistically significant evidence of an interaction between age and gender?
Explain your answer.

6
Yes, there is statistically significant evidence of an interaction between age and gender.
The interaction term has p = 0.019, which is less than α = 0.05.
model.interact = lm(wt.discr ~ age*gender, data = cdc.sample)
summary(model.interact)$coef

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) 0.0114156564 0.0200833355 0.5684144 5.700109e-01
## age 0.0009360246 0.0004365124 2.1443252 3.249215e-02
## genderf 0.1052749594 0.0267131898 3.9409355 9.282006e-05
## age:genderf -0.0013283617 0.0005654713 -2.3491231 1.920906e-02
d) Comment on whether the results from part c) suggest that men and women think differently
about their body weight. Do you find the results surprising; why or why not? Limit your
response to at most five sentences.
A surprising trend visible in the data is the slight positive association between weight
discrepancy and age for men; an older man on average has higher weight discrepancy than a
younger man. Perhaps this is reflective of society being generally more accepting of aging in
men, causing men to be less concerned about body image until they are past middle age. In
contrast, women consistently exhibit concern about their weight throughout their lifetime.

Weight Discrepancy by Age


0.1 0.2 0.3 0.4
Weight Discrepancy (%)

−0.1
−0.3

20 40 60 80

Age (yrs)

7
Problem 3.

Studies have indicated that several factors contribute to clinicians perceiving encounters with
patients as difficult; such factors may relate to physicians or patients. For example, physicians may
have negative bias toward specific health conditions; additionally, physicians involved in difficult
encounters tend to be less experienced. Patients who exhibit personality disorders, non-adherence
to medical advice, and self-destructive behaviors can also contribute to encounter difficulty.
A study was conducted at a university outpatient primary care clinic in Switzerland to identify fac-
tors associated with difficult doctor-patient encounters. The data consist of 527 patient encounters
total, conducted by the 27 medical residents employed at the clinic during the time of the study.
After each encounter, the attending medical resident completed two questionnaires: the Difficult
Doctor Patient Relationship Questionnaire (DDPRQ-10) and the patient’s vulnerability grid (PVG).
The data are in difficult_encounters.Rdata, stored as the diff.enc dataframe.
The DDPRQ-10 is a survey that measures the difficulty of a patient encounter, with a higher score
indicating a more difficult encounter; the maximum possible score is 60 and encounters with scores
30 and higher are considered difficult. The DDPRQ-10 score for each counter is stored as ddprq.
The PVG measures five dimensions of patient vulnerability: somatic determinants, mental health
state, behavioral determinants, social determinants, and healthcare use. Each dimension has a
certain number of associated characteristics; a patient receives 1 point for each characteristic. The
total score within a dimension is stored as the variable ending with total, while the variable ending
with bin is a binary variable where 1 corresponds to a score of 1 or greater for that dimension and
0 indicates the patient does not have any of the characteristics for that dimension.
– Somatic determinants (soma.bin, soma.total): factors related to physical impairment, such
as severe chronic disease, physical disability, or pregnancy [max score 6]
– Mental health state (mental.bin, mental.total): factors related to mental health, such as
mood disorder, post-traumatic stress disorder, or dementia [max score 9]
– Behavioural determinants (risk.bin, risk.total): factors related to risky behavior, such as
substance abuse or physical violence [max score 5]
– Social determinants (social.bin, social.total): factors related to social difficulty, such as
complex family situation, inadequate housing, or language barrier [max score 8]
– Healthcare use (health.bin, health.total): factors related to healthcare use, such as being a
frequent user or lacking a primary care physician [max score 3]
Features of the attending medical resident were also recorded: age in years (age), sex (sex, recorded
as F for female and M for male), and years of training completed (yrs.train).
a) Use graphical and numerical summaries to explore the distribution of DDPRQ-10 score.
Briefly summarize your findings. How many of the encounters are classified as difficult based
on DDPRQ-10 score?
DDPRQ-10 score is roughly symmetric around a mean of 30.62 points; as shown by the two
large outliers on the boxplot, there are two unusually high scores relative to the distribution.
The middle 50% of scores are between 26 and 35. 285 of the encounters are classified as
difficult, with scores of 30 or greater.

8
#load the data
load("datasets/difficult_encounters.Rdata")

#graphical summaries
par(mfrow = c(1, 2))
hist(diff.enc$ddprq, main = "Distribution of DDPRQ-10 Score",
xlab = "DDPRQ-10 Score", border = COL[1], col = COL[1, 4])
boxplot(diff.enc$ddprq, main = "Distribution of DDPRQ-10 Score",
ylab = "DDPRQ-10 Score", border = COL[1], col = COL[1, 4],
pch = 21, outcol = COL[1], outbg = COL[1, 4])

Distribution of DDPRQ−10 Score Distribution of DDPRQ−10 Score

15 20 25 30 35 40 45 50
150

DDPRQ−10 Score
100
Frequency

50
0

20 30 40 50

DDPRQ−10 Score

#numerical summaries
summary(diff.enc$ddprq)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 15.00 26.00 30.00 30.62 35.00 51.00
sd(diff.enc$ddprq)

## [1] 6.52483
#identify difficult encounters
difficult = (diff.enc$ddprq >= 30)
table(difficult)

## difficult
## FALSE TRUE
## 242 285

9
b) Fit a model for the association of DDPRQ-10 score with features of the attending medical
resident. Is there evidence of a significant association between DDPRQ-10 score and any of
the physician features?
The three variables related to features of the attending medical resident are age, sex, and
years of training. The p-values associated with each of these slope coefficients is greater than
α = 0.05; there is not evidence of a significant association between DDPRQ-10 score and any
of the physician features.
Note: Not surprisingly, the F-statistic of the model also has a high p-value. As a group, these
physician features are not useful for predicting DDPRQ-10 score.
#fit a model
summary(lm(ddprq ~ age + sex + yrs.train, data = diff.enc))

##
## Call:
## lm(formula = ddprq ~ age + sex + yrs.train, data = diff.enc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.953 -4.409 -0.432 3.963 15.871
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.59442 2.88602 10.601 <2e-16 ***
## age -0.01633 0.10435 -0.157 0.876
## sexM -0.53512 0.78057 -0.686 0.494
## yrs.train 0.09591 0.21543 0.445 0.656
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.062 on 286 degrees of freedom
## (237 observations deleted due to missingness)
## Multiple R-squared: 0.002407, Adjusted R-squared: -0.008058
## F-statistic: 0.23 on 3 and 286 DF, p-value: 0.8755
c) Create a plot that shows the association between DDPRQ-10 score and patient mental health
vulnerability score (mental.total). Describe what you see.
Increased patient mental health vulnerability is associated with higher DDPRQ-10 score. A
patient who scores higher on mental health vulnerability tends to have an encounter that is
rated as being more difficult; for example, median DDPRQ-10 score is about 40 in patients
with a PVG mental health score of 4, and about 30 for those with a mental health score of 0.
#create a plot
boxplot(ddprq ~ mental.total, data = diff.enc,
main = "DDPRQ-10 Score vs Patient Mental Health Vulnerability",
ylab = "DDPRQ-10", xlab = "PVG Mental Health Score",
col = brewer.pal(5, "BuPu"),
pch = 21, outbg = brewer.pal(5, "BuPu"))

10
DDPRQ−10 Score vs Patient Mental Health Vulnerability

50
45
40
DDPRQ−10

35
30
25
20
15

0 1 2 3 4

PVG Mental Health Score

d) Fit a model for the association of DDPRQ-10 score with patient mental health vulnerability
score while adjusting for physician features. Interpret the slope for mental.total.
According to the model, a 1 point increase in patient mental health vulnerability score is
associated with a mean increase of 2.7 points in average DDPRQ-10 score, assuming that
physician features (age, sex, years of training) are held constant.
#fit a model with mental.total
lm(ddprq ~ age + sex + yrs.train + mental.total, data = diff.enc)

##
## Call:
## lm(formula = ddprq ~ age + sex + yrs.train + mental.total, data = diff.enc)
##
## Coefficients:
## (Intercept) age sexM yrs.train mental.total
## 30.00911 -0.09745 0.41943 0.32727 2.70456
e) Repeat part d), using the binary version of patient mental health vulnerability score
(mental.bin). Based on your observations in part c), do you prefer this model to the one in
part d), or would you prefer a model that treats patient mental health vulnerability score as a
categorical variable with several levels? Explain your answer.
According to the model, a patient with a mental health vulnerability score of 1 or greater
has average DDPRQ-10 score 4.74 points higher than one with a mental health vulnerability
score of 0, assuming that physician features are held constant.
There is not a clear difference in mean DDPRQ-10 between individuals without mental
health vulnerability and those with one or more mental health vulnerabilities, so the binary

11
version of the variable does not seem like the best choice. The increase in mean DDPRQ-10
score as PVG mental health score increases is roughly linear in the sample, so it would not
be unreasonable to use the model in part d) that treats mental health score as a numerical
variable. A model that treats patient mental health vulnerability score as categorical with
several levels may be preferable since it allows for the change in predicted mean DDPRQ-10
score to be different between any two ‘adjacent’ groups; e.g., the predicted mean change
between the vulnerability score 0 and vulnerability 1 groups is not necessarily equal to the
predicted mean change between the vulnerability score 3 and score 4 groups.
From a modeling perspective, treating mental health vulnerability score as a numerical
predictor allows for a simpler model (i.e., one with fewer predictors) than as a binary
predictor or predictor with several levels. A comparison of adjusted R2 suggests that the
model using the fully categorical version of mental health score actually provides enough
information to offset the penalty of additional predictors; R2adj is 0.146, which is very close to
that of the model using mental health vulnerability as a numerical predictor (R2adj = 0.145).
#fit a model with mental.binary
lm(ddprq ~ age + sex + yrs.train + mental.bin, data = diff.enc)

##
## Call:
## lm(formula = ddprq ~ age + sex + yrs.train + mental.bin, data = diff.enc)
##
## Coefficients:
## (Intercept) age sexM yrs.train mental.bin
## 29.32577 -0.08299 0.49084 0.30580 4.74444
#calculate group means
tapply(diff.enc$ddprq, diff.enc$mental.total, mean)

## 0 1 2 3 4
## 29.10774 31.43312 33.61364 36.21053 39.20000
#compare adjusted R^2
summary(lm(ddprq ~ age + sex + yrs.train + mental.total,
data = diff.enc))$adj.r.squared

## [1] 0.1450492
summary(lm(ddprq ~ age + sex + yrs.train + mental.bin,
data = diff.enc))$adj.r.squared

## [1] 0.1274226
summary(lm(ddprq ~ age + sex + yrs.train + as.factor(mental.total),
data = diff.enc))$adj.r.squared

## [1] 0.1458574

12
f) Fit a model for the association of DDPRQ-10 score with all five dimensions of patient
vulnerability (using the binary versions of the variables), while adjusting for physician
features.
i. Which patient features are significantly associated with DDPRQ-10 score?
The following patient vulnerabilities are significantly associated with DDPRQ-10 score:
mental health state, behavioural determinants, social determinants, and healthcare use.
ii. Interpret the coefficients for the features in part i.
Having at least one mental health vulnerability is associated with a higher predicted
mean DDPRQ-10 score of 3.15 points, relative to having none, assuming that physician
features and presence of other vulnerabilities are held constant.
Having at least one behavioural vulnerability is associated with a higher predicted
mean DDPRQ-10 score of 2.95 points, relative to having none, assuming that physician
features and presence of other vulnerabilities are held constant.
Having at least one social vulnerability is associated with a higher predicted mean
DDPRQ-10 score of 2.64 points, relative to having none, assuming that physician
features and presence of other vulnerabilities are held constant.
Having at least one healthcare-related vulnerability is associated with a higher predicted
mean DDPRQ-10 score of 3.16 points, relative to having none, assuming that physician
features and presence of other vulnerabilities are held constant.
#fit a model
summary(lm(ddprq ~ age + sex + yrs.train +
mental.bin + soma.bin + risk.bin + social.bin + health.bin,
data = diff.enc))

##
## Call:
## lm(formula = ddprq ~ age + sex + yrs.train + mental.bin + soma.bin +
## risk.bin + social.bin + health.bin, data = diff.enc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5845 -2.9475 -0.0842 2.7674 17.8825
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.92229 2.63485 8.320 3.85e-15 ***
## age 0.05663 0.09023 0.628 0.530790
## sexM -0.56924 0.70145 -0.812 0.417754
## yrs.train 0.31951 0.18332 1.743 0.082449 .
## mental.bin 3.15214 0.66735 4.723 3.67e-06 ***
## soma.bin 0.45686 0.61996 0.737 0.461790
## risk.bin 2.94654 0.69402 4.246 2.97e-05 ***
## social.bin 2.63515 0.67061 3.929 0.000107 ***
## health.bin 3.16265 0.65590 4.822 2.33e-06 ***

13
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.075 on 281 degrees of freedom
## (237 observations deleted due to missingness)
## Multiple R-squared: 0.3131, Adjusted R-squared: 0.2935
## F-statistic: 16.01 on 8 and 281 DF, p-value: < 2.2e-16
g) Comment briefly on two limitations of this study, with respect to understanding factors
associated with difficult doctor-patient encounters.
One limitation of the study is generalizability; this study was conducted among medical
residents in a single outpatient clinic in Switzerland. These results may not generalize to
doctors who have finished their medical training or who work in a different type of clinical
setting, such as a large hospital. It also seems these residents have an unusually high number
of difficult encounters (almost 50%), which suggests this clinic sees more vulnerable patients
than typical. Thus, these results may only generalize to medical residents in countries with
similar standard of care working in a similar setting (outpatient clinic, perhaps targeted to
caring for vulnerable patients).
Physician features (gender, age, years of training) were not identified as associated with
DDPRQ-10 score. This result may simply be due to homogeneity in the physicians who
participated in the study—as medical residents, these 27 individuals will be quite similar in
age and years of training. It would be informative to collect data from a more diverse group
of physicians.
Another limitation is the lack of information about patient demographics. Including infor-
mation about patient age, gender, etc. in a model would be ideal, since these features may
also predict whether a doctor-patient encounter is difficult.

14
Problem 4.

In Units 6 and 7, you have become familiar with the Prevention of REnal and Vascular END-stage
Disease (PREVEND) study, which took place between 2003 and 2006 in the Netherlands. Clinical
and demographic information for 500 individuals are stored as prevend.samp in the oibiostat
package.
The PREVEND data were mainly used throughout the Unit 7 lectures to demonstrate one applica-
tion of multiple regression: estimating the association between a response variable and primary
predictor of interest while adjusting for confounders. This question uses the PREVEND data in the
context of explanatory model building.
Suppose that you have accepted a request to do some consulting work for a friend. Your task is to
develop a prediction model for RFFT score based on the following possible predictor variables and
the data in prevend.samp.
Variable Description
Age age in years
Gender gender, coded 0 for males and 1 for females
Education highest level of education
DM diabetes status, coded 0 for absent and 1 for present
Statin statin use, coded 0 for non-users and 1 for users
Smoking smoking, coded 0 for non-smokers and 1 for smokers
BMI body mass index, in kg/m2
FRS Framingham risk score, measure of risk for cardiovascular event with 10 years
The variable Education is coded 0 for primary school, 1 for lower secondary school, 2 for higher
secondary school, and 3 for university. A higher FRS indicates higher risk of a cardiovascular event.
Your friend has requested that your final model have no more than two predictor variables.
Additionally, your friend would like you to predict the mean RFFT score for a female individual of
age 55 with a university education, no diabetes, no statin use, who is not a smoker, has BMI of 24,
and FRS of 5. Use only the information necessary to make a prediction from your model. Be sure
to report and interpret a relevant (approximate) 95% prediction interval.
In your solution, briefly explain the work done at each step of developing the final model and
evaluate the final model’s strengths and weaknesses.
Data Exploration
RFFT score is symmetric, centered around a mean of 68 points with no outliers; the middle 50% of
scores are between 46 points to 88 points. Age is symmetric, ranging from 36 years to 81 years,
with a mean at about 55 years. FRS is symmetric around a mean of 9.9; the middle 50% of scores
are between 5 and 15. BMI is heavily right skewed and would benefit from a log transformation;
median BMI is 26.1. Gender is roughly equally distributed. Educational level is roughly evenly
distributed in the upper three levels at around 150 people in each group, but there are about 50
participants who have only completed primary school. Relatively few participants are diabetic.
About 100 individuals use statins and about 100 individuals smoke.
The numerical variables (age, log BMI, FRS) seem linearly associated with RFFT score (correlations
of -0.53, -0.21, and -0.43, respectively), while log BMI shows a weak association with RFFT score.
Of the categorical variables, all except gender seem associated with RFFT score.

15
Initial Model Fitting
Based on the data exploration, the initial model should include age, log BMI, FRS, educational
level, diabetes status, statin use, and smoking status. The R2 of the model is 0.44, and the adjusted
R2 is 0.43. The variables age, log BMI, education, and smoking status are statistically significant at
the α = 0.05 level.
Model Comparison
After removing FRS, diabetes status, and statin use, log BMI becomes insignificant. The model with
log BMI has adjusted R2 of 0.432, while the one with only age, educational status, and smoking
status has adjusted R2 of 0.430; the smaller model has a marginally lower adjusted R2 . Of the
possible two-variable models from age, educational status, and smoking status, the model with age
and educational status has the highest adjusted R2 of 0.424. Adding an interaction term between
age and educational status raises adjusted R2 to 0.435.
Thus, the final model has age, educational status, and the interaction term between age and
educational status. This model R2 is 0.443.
Model Assessment
The model residuals follow a normal distribution very well, with only a few observations in the
tails that deviate from normality. The variance of the residuals is roughly constant across the range
of predicted RFFT score, although there is somewhat higher variance at higher predicted values.
There is not evidence of remaining non-linearity in the model with respect to age; the dots scatter
randomly around the horizontal line.1
Conclusions
The model R2 is 0.443, which indicates that a model with age, educational status, and their
interaction explains 44.3% of the observed variation in RFFT score. The model F-statistic is highly
significant (p < 0.001), suggesting that these variables as a group are useful for predicting RFFT
score.
In all educational groups, age is negatively associated with RFFT score; an increase in 1 year of age
is associated with predicted mean decrease of .42 points in the primary school group, 0.58 points
in the lower secondary group, 1.21 points in the upper secondary group, and 1.09 points in the
university education group. The model predicts that mean RFFT score for an individual of age 55
and university education is 81.14 points. With 95% confidence, the predicted RFFT score for such
an individual is captured by the interval (40.67, 121.60).
The final model seems less reliable for predictions about individuals with university education.
For individuals in the university education group, the model predictions tend to be accurate within
±15 points; in other groups, the model tends to be accurate within ±10 points. Additionally, the
model residuals are somewhat more variable at younger ages (and higher RFFT score), particularly
between 40-50 years of age (and 80-100 points).

1 The lighter dots correspond to lower values of educational level, while the darker dots correspond to higher values.

16
RFFT Score RFFT Score

140
60

120
50

100
RFFT Score
40
Frequency

80
30

60
20

40
10

20
0

20 40 60 80 100 140

RFFT Score

Age Gender Education Diabetes Status


150
250
80
60

300
100
Frequency

150
40

50
20

100
50
0

40 50 60 70 80 Male Female Primary HigherSec No Yes

Age (yrs) Gender Educational Level Diabetes Status

Statin Use Smoking Status BMI FRS


100 150 200

120
300
300

Frequency

Frequency
200

80
200

100

40
100

50
0

0
0

NonUser User No Yes 20 30 40 50 60 −5 5 15 25

Statin Use Smoking Status BMI (kg/m^2) FRS

17
summary(prevend.samp$RFFT)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 11.0 46.0 67.0 68.4 88.0 136.0
summary(prevend.samp$Age)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 36.00 46.00 54.00 54.82 64.00 81.00
summary(prevend.samp$BMI)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 18.11 23.87 26.11 26.90 29.00 60.95
summary(prevend.samp$FRS)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## -2.000 5.000 10.000 9.946 15.000 29.000

log BMI
150
100
Frequency

50
0

2.8 3.2 3.6 4.0

log BMI (log kg/m^2)

18
40 50 60 70 80 0 5 15 25

140
100
RFFT

60
20
40 50 60 70 80

Age

3.8
log(BMI)

3.4
3.0
25
15

FRS
0 5

20 60 100 140 3.0 3.4 3.8

19
RFFT by Gender RFFT by Education RFFT by Diabetes

140

140

140
80 100

80 100

80 100
RFFT Score

RFFT Score

RFFT Score
60

60

60
40

40

40
20

20

20
Male Female Primary HigherSec No Yes

Gender Education Diabetes

RFFT by Statin Use RFFT by Smoking Status


140

140
80 100

80 100
RFFT Score

RFFT Score
60

60
40

40
20

20

NonUser User No Yes

Statin Use Smoking Status

#add log BMI to dataframe


prevend.samp$log.BMI = log(prevend.samp$BMI)

#create a correlation matrix


prevend.subset = subset(prevend.samp,
select = c("RFFT", "Age", "log.BMI", "FRS"))

cor(prevend.subset)

## RFFT Age log.BMI FRS


## RFFT 1.0000000 -0.5338617 -0.2072366 -0.4370950
## Age -0.5338617 1.0000000 0.1722337 0.7467989
## log.BMI -0.2072366 0.1722337 1.0000000 0.2894448
## FRS -0.4370950 0.7467989 0.2894448 1.0000000

20
#initial model fitting
initial.model = lm(RFFT ~ Age + log(BMI) + FRS + Education + DM +
Statin + Smoking, data = prevend.samp)
summary(initial.model)

##
## Call:
## lm(formula = RFFT ~ Age + log(BMI) + FRS + Education + DM + Statin +
## Smoking, data = prevend.samp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.440 -14.439 -0.397 13.736 60.012
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 150.5740 23.0170 6.542 1.55e-10 ***
## Age -1.0676 0.1398 -7.639 1.18e-13 ***
## log(BMI) -13.4837 6.4222 -2.100 0.03629 *
## FRS 0.3229 0.2601 1.242 0.21496
## EducationLowerSec 9.7603 3.3952 2.875 0.00422 **
## EducationHigherSec 20.4510 3.6281 5.637 2.95e-08 ***
## EducationUniv 31.4126 3.6337 8.645 < 2e-16 ***
## DMYes -3.5651 4.2723 -0.834 0.40443
## StatinUser 4.1729 2.4689 1.690 0.09164 .
## SmokingYes -7.8050 2.4401 -3.199 0.00147 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.64 on 484 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.4443, Adjusted R-squared: 0.434
## F-statistic: 43 on 9 and 484 DF, p-value: < 2.2e-16
#model comparison
model1 = lm(RFFT ~ Age + log(BMI) + Education +
Smoking, data = prevend.samp)
summary(model1)

##
## Call:
## lm(formula = RFFT ~ Age + log(BMI) + Education + Smoking, data = prevend.samp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.621 -14.221 -0.587 14.622 58.418
##
## Coefficients:

21
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 135.41171 21.13745 6.406 3.52e-10 ***
## Age -0.90143 0.08872 -10.160 < 2e-16 ***
## log(BMI) -10.48972 6.17515 -1.699 0.09001 .
## EducationLowerSec 9.75229 3.39310 2.874 0.00423 **
## EducationHigherSec 20.45097 3.62704 5.638 2.91e-08 ***
## EducationUniv 31.09886 3.63049 8.566 < 2e-16 ***
## SmokingYes -6.36708 2.24741 -2.833 0.00480 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.68 on 487 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.439, Adjusted R-squared: 0.4321
## F-statistic: 63.52 on 6 and 487 DF, p-value: < 2.2e-16
#remove log(BMI)
model2 = lm(RFFT ~ Age + Education + Smoking, data = prevend.samp)
summary(model2)

##
## Call:
## lm(formula = RFFT ~ Age + Education + Smoking, data = prevend.samp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.653 -14.352 -0.891 14.745 61.708
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101.23335 6.49052 15.597 < 2e-16 ***
## Age -0.91480 0.08854 -10.332 < 2e-16 ***
## EducationLowerSec 9.77626 3.39962 2.876 0.00421 **
## EducationHigherSec 20.82643 3.62729 5.742 1.65e-08 ***
## EducationUniv 32.01108 3.59749 8.898 < 2e-16 ***
## SmokingYes -5.93609 2.23735 -2.653 0.00823 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.72 on 488 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.4357, Adjusted R-squared: 0.4299
## F-statistic: 75.35 on 5 and 488 DF, p-value: < 2.2e-16
#fit two-variable models
model3 = lm(RFFT ~ Age + Education, data = prevend.samp)
summary(model3)$adj.r.squared

## [1] 0.4238827

22
model4 = lm(RFFT ~ Age + Smoking, data = prevend.samp)
summary(model4)$adj.r.squared

## [1] 0.3020135
model5 = lm(RFFT ~ Smoking + Education, data = prevend.samp)
summary(model5)$adj.r.squared

## [1] 0.306606
#add interaction term
model6 = lm(RFFT ~ Age*Education, data = prevend.samp)
summary(model6)$adj.r.squared

## [1] 0.4348507
#model assessment

#define final model


final.model <- model6

#model summary
summary(final.model)

##
## Call:
## lm(formula = RFFT ~ Age * Education, data = prevend.samp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.613 -14.564 -1.067 13.774 62.266
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.1321 18.6247 3.658 0.000281 ***
## Age -0.4238 0.2868 -1.478 0.140110
## EducationLowerSec 21.5941 20.6570 1.045 0.296369
## EducationHigherSec 67.8847 20.5033 3.311 0.000998 ***
## EducationUniv 73.4204 20.4093 3.597 0.000354 ***
## Age:EducationLowerSec -0.1567 0.3236 -0.484 0.628459
## Age:EducationHigherSec -0.7920 0.3294 -2.404 0.016565 *
## Age:EducationUniv -0.6747 0.3292 -2.050 0.040932 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.6 on 492 degrees of freedom
## Multiple R-squared: 0.4428, Adjusted R-squared: 0.4349
## F-statistic: 55.85 on 7 and 492 DF, p-value: < 2.2e-16

23
#prediction interval
y.hat = predict(final.model, newdata = data.frame(Age = 55, Education = "Univ"))
y.hat

## 1
## 81.13568
m = qt(0.975, summary(final.model)$df[2]) * summary(final.model)$sigma
y.hat - m; y.hat + m

## 1
## 40.67055
## 1
## 121.6008
#create q-q plot
qqnorm(resid(final.model),
pch = 21, col = COL[1], bg = COL[1, 4],
main = "Q-Q Plot of Model Residuals")
qqline(resid(final.model))

Q−Q Plot of Model Residuals


20 40 60
Sample Quantiles

−20 0
−60

−3 −2 −1 0 1 2 3

Theoretical Quantiles

24
Residual Residual

−60 −40 −20 0 20 40 60 −60 −40 −20 0 20 40 60

40

Primary
60

Education
80

UpperSec
Predicted RFFT Score
100
Residual

25
−60 −40 −20 0 20 40 60

40
50
60

Age
70
80
Problem 5.

The Learning Early about Peanut Allergy (LEAP) study was conducted to investigated whether early
exposure to peanut products reduces the probability that a child will develop peanut allergies.2
The study enrolled infants with eczema, egg allergy, or both. Each child was randomly assigned
to either the peanut consumption (treatment) group or the peanut avoidance (control) group.
Children in the treatment group were fed at least 6 grams of peanut protein daily, while children
in the control group avoided consuming peanut protein.
At 5 years of age, each child was tested for peanut allergy using an oral food challenge (OFC):
consumption of 5 grams of peanut protein in a single dose. A child was recorded as passing the
OFC if no allergic reaction was detected, and failing the OFC if an allergic reaction occurred. The
following table summarizes the results from 530 participants in the study, organized by treatment
group and OFC outcome.

Fail OFC Pass OFC Total


Peanut Avoidance 36 227 263
Peanut Consumption 5 262 267
Total 41 489 530

a) Analyze the data to assess whether early exposure to peanut products seems to be an effective
strategy for reducing the chances of developing peanut allergies and summarize the results.
Be sure to check any assumptions.
Conduct a χ2 test of association between treatment group and outcome. It is reasonable
to assume the infants are independent of each other since they were part of a large study.
None of the expected values are below 10. The p-value is 8.30 × 10−7 ; the results are highly
significant and there is sufficient evidence to reject the null hypothesis that treatment and
outcome are independent. Of the infants in the peanut consumption group, more pass the
OFC than expected, and fewer fail than expected. Thus, the data suggest that early exposure
to peanut products seems effective for reducing the chances of developing peanut allergies.
#enter the data
leap.table = matrix(c(36, 227, 5, 262), nrow = 2, ncol = 2, byrow = T)
dimnames(leap.table) = list("Group" = c("Avoidance", "Consumption"),
"Outcome" = c("Fail OFC", "Pass OFC"))
addmargins(leap.table)

## Outcome
## Group Fail OFC Pass OFC Sum
## Avoidance 36 227 263
## Consumption 5 262 267
## Sum 41 489 530
#conduct an analysis
chisq.test(leap.table)

##
2 Du Toit, George, et al. Randomized trial of peanut consumption in infants at risk for peanut allergy. NEJM 372.9
(2015): 803-813.

26
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: leap.table
## X-squared = 24.286, df = 1, p-value = 8.302e-07
chisq.test(leap.table)$expected

## Outcome
## Group Fail OFC Pass OFC
## Avoidance 20.34528 242.6547
## Consumption 20.65472 246.3453
chisq.test(leap.table)$resid

## Outcome
## Group Fail OFC Pass OFC
## Avoidance 3.470670 -1.0049648
## Consumption -3.444575 0.9974086
b) Calculate the relative risk of OFC failure in the peanut avoidance group relative to the
peanut consumption group. In other words, how much more likely is an infant in the peanut
avoidance group to fail the OFC than one in the peanut consumption group?
The relative risk of OFC failure in the peanut avoidance group compared to the peanut
consumption group is (36/263)/(5/267) = 7.31; failing the OFC is over 7 times as likely for an
infant in the peanut avoidance group as for an infant in the consumption group.
library(epitools)
riskratio(leap.table, rev = "both", method = "wald")$measure

## risk ratio with 95% C.I.


## Group estimate lower upper
## Consumption 1.000000 NA NA
## Avoidance 7.309506 2.913602 18.33774
c) Speculate as to why the study team chose to specifically enroll infants with eczema, egg
allergy, or both.
Although peanut allergy is becoming more common, it is still relatively rare. Infants who
have eczema or egg allergy are more likely to have other allergies, including peanut allergies.
Restricting the sample to infants who have eczema or egg allergy increases the chance of
observing some who will present with a peanut allergy at 5 years of age. Had the study simply
recruited infants from the general population, it is possible that no OFC failures would be
observed in either group, and the study would have been inconclusive.
Another plausible reason for the restriction is that the study results are meant to be general-
ized specifically to infants susceptible to developing peanut allergies; i.e., the goal is to assess
whether peanut consumption should be recommended specifically for infants at greater risk
of developing peanut allergy, rather than all infants.

27
Problem 6.

In the PREVEND study introduced in Unit 6, researchers measured various features of study
participants, including data on BMI and diabetes status. Obesity is a known risk factor for Type 2
diabetes.
Organizations such as the Centers for Disease Control and the World Health Organization have
defined weight status categories for particular BMI rangs. A BMI below 18.5 is considered under-
weight, while a BMI above 18.5 and lower than 25.0 is considered healthy. A BMI above 25.0 and
lower than 30 is considered overweight, while a BMI higher than 30 is considered obese.
For this problem, use prevend.samp to investigate the association between BMI weight status and
diabetes status.
a) Diabetes prevalence in the United States is approximately 9.4%. Suppose that individuals
in prevend.samp represent a random sample of individuals from the Netherlands. Assess
whether there is evidence that diabetes prevalence in the Netherlands is different from that
in the United States. Summarize your findings, including reporting and interpreting an
appropriate confidence interval.
Test the null hypothesis that the prevalence of diabetes is equal in the United States and the
Netherlands, H0 : p = 0.094, against the alternative hypothesis p , 0.094. The p-value is 0.039;
there is sufficient evidence at the α = 0.05 significance level to reject the null hypothesis in
favor of the alternative hypothesis. Since the observed sample proportion is less than 0.094,
the data suggest that the prevalence of diabetes in the Netherlands is lower than in the United
States. With 95% confidence, the interval (0.047, 0.092) captures the prevalence of diabetes
in the Netherlands.
prop.test(sum(prevend.samp$DM == "Yes"), length(prevend.samp$DM), p = 0.094)

##
## 1-sample proportions test with continuity correction
##
## data: sum(prevend.samp$DM == "Yes") out of length(prevend.samp$DM), null probability 0.094
## X-squared = 4.28, df = 1, p-value = 0.03856
## alternative hypothesis: true p is not equal to 0.094
## 95 percent confidence interval:
## 0.04653657 0.09238083
## sample estimates:
## p
## 0.066
b) Run the code shown in the template to create a categorical version of the BMI variable named
BMI_Cat. Since there are very few individuals with BMI below 18.5, individuals with BMI
lower than 25 are grouped together.
library(oibiostat)
data("prevend.samp")

prevend.samp$BMI_Cat = cut(prevend.samp$BMI, breaks = c(0, 25, 30, 80),


labels = c("Underweight/Normal", "Overweight", "Obese"))

28
prevend.samp$DM = factor(prevend.samp$DM, levels = 0:1, labels = c("No", "Yes"))

c) Analyze the data to assess whether there is evidence of an association between BMI weight
status and diabetes status. Summarize your findings.
Conduct a χ2 test of association between BMI weight status group and diabetes status. The
p-value is 0.003; the results are significant at α = 0.05 and there is sufficient evidence to
reject the null hypothesis that BMI weight status is independent of diabetes status. Of the
individuals in the underweight/normal category, fewer than expected have diabetes and
more than expected do not have diabetes. Of the individuals in the obese category, more than
expected have diabetes and fewer than expected have diabetes. Thus, the data suggest that
obesity is associated with higher incidence of diabetes.
bmi.dm.table = table(prevend.samp$DM, prevend.samp$BMI_Cat)

chisq.test(bmi.dm.table)

##
## Pearson's Chi-squared test
##
## data: bmi.dm.table
## X-squared = 11.855, df = 2, p-value = 0.002665
bmi.dm.table

##
## Underweight/Normal Overweight Obese
## No 184 194 89
## Yes 6 13 14
chisq.test(bmi.dm.table)$expected

##
## Underweight/Normal Overweight Obese
## No 177.46 193.338 96.202
## Yes 12.54 13.662 6.798
chisq.test(bmi.dm.table)$resid

##
## Underweight/Normal Overweight Obese
## No 0.49093897 0.04761013 -0.73427893
## Yes -1.84683876 -0.17910217 2.76224717

29
Problem 7.

Suppose we are interested in investigating the relationship between high salt intake and death
from cardiovascular disease (CVD). One possible study design is to identify a group of high- and
low-salt users then follow them over time to compare the relative frequency of CVD death in the
two groups. In contrast, a less expensive study design is to look at death records, identify CVD
deaths from non-CVD deaths, collect information about the dietary habits of the deceased, then
compare salt intake between individuals who died of CVD versus those who died of other causes.
This design is called a retrospective design.
Suppose a retrospective study is done in a specific county of Massachusetts; data are collected on
men ages 50-54 who died over a 1-month period. Of 35 men who died from CVD, 5 had a diet
with high salt intake before they died, while of the 25 men who died from other causes, 2 had a
diet with high salt intake. These data are summarized in the following table.

CVD Death Non-CVD Death Total


High Salt Diet 5 2 7
Low Salt Diet 30 23 53
Total 35 25 60

a) Under the null hypothesis of no association, what are the expected cell counts?
Under the null hypothesis of no association, the expected number of CVD deaths is 4.08 in
the high-salt group and 30.92 in the low-salt group and the expected number of non-CVD
deaths is 2.92 in the high-salt group and 22.08 in the low-salt group.
#enter data
salt.table = matrix(c(5, 2, 30, 23), nrow = 2, ncol = 2, byrow = T)
dimnames(salt.table) = list("Diet" = c("High Salt", "Low Salt"),
"Death" = c("CVD", "Non-CVD"))
addmargins(salt.table)

## Death
## Diet CVD Non-CVD Sum
## High Salt 5 2 7
## Low Salt 30 23 53
## Sum 35 25 60
#calculate expected cell counts
chisq.test(salt.table)$expected

## Death
## Diet CVD Non-CVD
## High Salt 4.083333 2.916667
## Low Salt 30.916667 22.083333

30
b) Of the 35 CVD deaths, 5 were in the high salt diet group and 30 were in the low salt diet
group. Under the assumption that the marginal totals are fixed, enumerate all possible sets
of results (i.e., the table counts) that are more extreme than what was observed, in the same
direction.
Let p̂1 represent the proportion of CVD deaths in the high-salt group and p̂2 represent the
proportion of CVD deaths in the low-salt group. The observed results show a case of p̂1 > p̂2 ;
results that are more extreme consist of cases where more than 5 individuals in the high-salt
group had a death related to CVD. More extreme results are represented by cases where 6 or
7 high-salt individuals had a death related to CVD.

CVD Death Non-CVD Death Sum


High Salt Diet 6 1 7
Low Salt Diet 29 24 53
Sum 35 25 60

CVD Death Non-CVD Death Sum


High Salt Diet 7 0 7
Low Salt Diet 28 25 53
Sum 35 25 60

c) Calculate the probability of observing each set of results from part b).
The value 0.106 represents the probability of observing 6 CVD deaths out of 7 individuals in
the high-salt group and 1 CVD death out of 53 individuals in the low-salt group, given that
there are a total of 60 deaths and 35 were CVD-related. In the language of the hypergeometric
distribution, the parameters of the distribution represented by these fixed marginal totals is
as follows: N = 60, there are m = 7 total successes (i.e., CVD deaths in the high-salt group),
and the sample obtained is n = 35 (i.e., number of individuals in the high-salt group). For the
table in which 6 CVD deaths are observed in the high-salt group, k = 6.
The probability of observing 7 CVD deaths given these fixed marginal totals is 0.0174.
#probability of table with 6 high-salt CVD deaths
dhyper(6, 7, 60 - 7, 35)

## [1] 0.1050706
#probability of table with 7 high-salt CVD deaths
dhyper(7, 7, 60 - 7, 35)

## [1] 0.0174117
d) Evaluate the statistical significance of the observed data with a two-sided alternative. Let
α = 0.05. Summarize your results.
Test the null hypothesis that the proportion of CVD deaths among individuals on a high-salt
diet in the population is not different from the proportion of CVD deaths among those on a
low-salt diet in the population, H0 : p1 = p2 against the alternative HA : p1 , p2 . Let α = 0.05.
Since p = 0.69, p > α and there is insufficient evidence to suggest there is an association
between salt consumption and CVD death.

31
The same conclusion is reached by computing the two-sided p-value via the method of
doubling the one-sided p-value; the two-sided p-value is 0.749 with this method.
#use fisher.test
fisher.test(salt.table)

##
## Fisher's Exact Test for Count Data
##
## data: salt.table
## p-value = 0.6882
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.278957 21.620483
## sample estimates:
## odds ratio
## 1.897126
#double the one-sided p-value
one.sided = dhyper(5, 7, 60 - 7, 35) + dhyper(6, 7, 60 - 7, 35) +
dhyper(7, 7, 60 - 7, 35)
2*one.sided

## [1] 0.7493036

Problem 8.

The simulation code shown in the template computes confidence interval coverage for defined
values of p, n, and confidence level. By definition, a 95% confidence interval has a 95% coverage
probability for the population parameter; if 1,000 intervals are calculated, approximately 95% of
the intervals should contain p.
The simulation uses the rbinom( ) function to simulate the number of successes observed in n
trials with probability of success p. The confidence interval is calculated using the formula for an
approximate two-sided confidence interval discussed in lecture, which is commonly referred to as
the Wald interval: r
p̂(1 − p̂)
p̂ ± 1.96 .
n
a) Run the simulation with n = 30 and p = 0.7. What is the observed coverage probability for a
95% confidence interval?
The observed coverage probability is 95%.
#set parameters
p = 0.7
n = 30
conf.level = 0.95
alpha = 1 - conf.level

num.iterations = 1000

32
#create empty vectors
contains.p = vector("numeric", num.iterations)

#set seed
set.seed(2019)

#simulate intervals
for(k in 1:num.iterations){

#simulate number of successes


x = rbinom(1, n, p)

#calculate confidence interval


p.hat = x/n
z.star = qnorm(1 - alpha/2)
m = z.star * sqrt((p.hat*(1 - p.hat))/n)
ci.ub = p.hat + m
ci.lb = p.hat - m

#record whether interval contains p


contains.p[k] = (p < ci.ub) & (p > ci.lb)
}

#summarize results
prop.table(table(contains.p))

## contains.p
## 0 1
## 0.05 0.95
b) Run the simulation with n = 30 and p = 0.1. What is the observed coverage probability for a
95% confidence interval?
The observed coverage probabilty is 80.4%.
#set parameters
p = 0.1
n = 30
conf.level = 0.95
alpha = 1 - conf.level

num.iterations = 1000

#create empty vectors


contains.p = vector("numeric", num.iterations)

#set seed
set.seed(2019)

33
#simulate intervals
for(k in 1:num.iterations){

#simulate number of successes


x = rbinom(1, n, p)

#calculate confidence interval


p.hat = x/n
z.star = qnorm(1 - alpha/2)
m = z.star * sqrt((p.hat*(1 - p.hat))/n)
ci.ub = p.hat + m
ci.lb = p.hat - m

#record whether interval contains p


contains.p[k] = (p < ci.ub) & (p > ci.lb)
}

#summarize results
prop.table(table(contains.p))

## contains.p
## 0 1
## 0.196 0.804
c) An alternative method for calculating the confidence interval for a single proportion is
referred to as the Agresti interval. To calculate the Agresti interval, define the sample
x+2
proportion as p̃ = n+4 , then use the formula
r
p̃(1 − p̃)
p̃ ± 1.96 .
n+4

Modify the simulation code to simulate the Agresti interval. Run the simulation with n = 30
and p = 0.1 and compare the observed coverage probability for the 95% Agresti interval to
the answer from part b). Which interval do you consider more reliable?
The observed coverage probability for the Agresti interval is 97.8%. This interval performs
much better than the Wald interval, which had a coverage probability much lower than the
expected 95%. The Agresti interval seems more reliable based on these simulations.
Note: In general, the Wald interval has been shown to have low coverage probabilities for
values of p that are close to 0 or 1.
#set parameters
p = 0.1
n = 30
conf.level = 0.95
alpha = 1 - conf.level

num.iterations = 1000

34
#create empty vectors
contains.p = vector("numeric", num.iterations)
contains.p.agresti = vector("numeric", num.iterations)

#set seed
set.seed(2019)

#simulate intervals
for(k in 1:num.iterations){

#simulate number of successes


x = rbinom(1, n, p)

#calculate agresti interval


p.tilde = (x + 2)/(n + 4)
z.star = qnorm(1 - alpha/2)
m = z.star * sqrt((p.tilde*(1 - p.tilde))/(n + 4))
ci.ub.agresti = p.tilde + m
ci.lb.agresti = p.tilde - m

#record whether interval contains p


contains.p.agresti[k] = (p < ci.ub.agresti) & (p > ci.lb.agresti)

#summarize results
prop.table(table(contains.p.agresti))

## contains.p.agresti
## 0 1
## 0.022 0.978

35

You might also like