S24 hw6

Homework 6
82pt
Data background
The primary objective of the study on the Efficacy of Nosocomial Infection Control (SENIC) was to
determine whether infection surveillance and control programs have reduced the rates of nosocomial
(hospital-acquired) infection in United States hospitals. The data set consists of a random sample of 113
hospitals selected from the original 338 hospitals surveyed.
Each line of the data set has an identification number and provides information on 11 other variables for
a single hospital. The 12 variables and descriptions are the following. The data, senic_homework6.csv,
is available on the course website. Note that in this data, the region has been recoded such that 1=NC,
2=NE, 3=S, 4=W, and the default baseline level is NC or 1.
Variable number Variable name Description

1 Id 1-113
(Y) 2 Length of stay Average length of stay of all patients in hospital (in days)
3 Age Average age of patients (in years)
Average estimated probability of acquiring infection in hospital

(infection)4 Infection risk (in percent)
Routine Ratio of number of cultures performed to number of patients

5 culturing ratio without signs or symptoms of pneumonia, times 100
Routine chest Ratio of X-rays performed to number of patient without signs or

6 X-ray ratio symptoms of pneumonia, times 100
7 Number of beds Average number of beds in hospital during study period
Medical school
8 affiliation 1=yes, 2=no
(region)9 Region Geographic region, where 1=NE, 2=NC, 3=S, 4=W
Average daily Average number of patients in hospital per day during study
10 census period
Number of Average number of full-time equivalent registered and licensed

(nurse)11 nurses practical nurses during study period
Available
facilities and Percent of 35 potential facilities and services that are provided
12 services by the hospital
1
Use R for the homework. For each question, attach code, R output and interpret the relevant
output clearly.
1. (4) the length of stay (Y) is to be regressed on infection (infection) and the availability of nurse. Note
that the categorical variable, nurse2, is created from the original continuous variable, nurse in the
data. Specifically, nurse2 =0 if the number of nurses is no less than the average value (173), or =1
otherwise.
nurse2<-ifelse(nurse>=173, 0, 1)
Consider the MLR model: 𝑌 = 𝛽0 + 𝛽1 𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛 + 𝛽2 𝑛𝑢𝑟𝑠𝑒2 + 𝛽12 𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛 ∗ 𝑛𝑢𝑟𝑠𝑒2. Use R to
complete the model summary. (2)
Observation: infection seems to be a good predictor for the response variable however nurse2 doesn’t.
the interaction coefficient is negative , which can mean the relationship of the continuous variable
(infection) is weaker within each category of nurse2.
The general form of the mean response functions for Y at the baseline level when 𝑛𝑢𝑟𝑠𝑒2 = 0.
is :𝑌 = 𝛽0 + 𝛽1 𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛 + 𝛽2 (0) + 𝛽12 (𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛 ∗ 0) = 𝛽0 + 𝛽1𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛 , find the estimated
mean response function by finding the value of the parameter estimate, i.e., 𝑏0 and 𝑏1 from the model
summary__Y = 5.2257 + 1.0359(infection)_________ (1)
2
The general form of the mean response functions at the level where 𝑛𝑢𝑟𝑠𝑒𝐹 = 1 is = 𝛽0 +
𝛽1 𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛 + 𝛽2 (1) + 𝛽12 (𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛 ∗ 1) = (𝛽0 + 𝛽2 ) + (𝛽1 + 𝛽12 )𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛.
Find the estimated form __Y=_(5.2257+1.5468) + (1.0359 – 0.4174)infection__= 6.7725 +
0.6185*infection______________(1)
2.(10) Refer to question 1, complete the following hypothesis with a GLT F-test. Specifically, define the
full model, reduced model, compute the test statistic, critical value, p-value, and state the conclusion.
a). (5) Infection has no impact on Y regardless the number of nurses is no less than the average or not
(i.e., nurse2=0 or 1)
H0: 𝛽1 = 𝛽12= 0 , reduced model: Y i=β0 + β2 X i ,2+ ε
Ha: Not H0 , Full model : Y i=β0 + β1 Xi ,1 + β2 X i ,2+ β12 Xi , 1 Xi , 2+ ε
Critical value:
Assuming significance level α =0.05
Critical value = 3.079596
SSE(R) – SSE(F) 383.79 – 283.064

dfE(R) – dfE(F) 2
Test Statistic = Fs SSE(F)/df(𝐹)
= 283.064/109
= 19.39338
3
p-value:
p-value = 6.2317*e-8 , less than α

Conclusion:
We can see that F-statistic is greater than critical value and, p value is less than
significance level (0.05).
Therefore, we can reject the null hypothesis. This implies that infection dos have an
impact on Y regardless of whether nurse2 is 0 or 1.
b). (5) Infection has the same impact Y when the number of nurses is no less than the average or greater.
In other words, there is no interaction impact between infection and nurse2 on Y.
H0: 𝛽12= 0 , reduced model: Y i=β0 + β1 Xi ,1 + β2 X i ,2+ ε

Ha: Not H0 , Full model : Y i=β0 + β1 Xi ,1 + β2 X i ,2+ β12 Xi , 1 Xi , 2+ ε
Critical value:
Assuming significance level α =0.05
Critical value = 3.928195
SSE(R) – SSE(F) 288.887 – 283.064

dfE(R) – dfE(F) 1
Test Statistic = Fs = SSE(F)/df(𝐹) = = 2.242274
283.064/109
p-value:
4
p-value = 0.1371736
Conclusion:
We can see that F-statistic is less than critical value and, p value (0.1371736) is
greater than significance level (0.05).
Therefore, we do not reject the null hypothesis. This implies that there is no interaction
impact between infection and nurse2 on Y.
3. (8) Consider the multiple linear regression (MLR) model Y~ infection + Region+ infection*Region,
where Region has four levels and can be modeled with 3 dummy variables: X2, X3, and X4. The default
baseline chosen in R is NC based on alphabetical order, with X2=0, X3=0, and X4=0 if the region is
NC, X2=1, X3=0, and X4=0 if the region is NE, X2=0, X3=1, and X4=0 if the region is S, and X2=0,
X3=0, and X4=1 if the region is W.
The MLR full model can be written as following:

𝑌 = 𝛽0 + 𝛽1 𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝛽4 𝑋4 + 𝛽12 𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛 ∗ 𝑋2 + 𝛽13 𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛𝑋3
+ 𝛽14 𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛 ∗ 𝑋4
The response function can be specifically defined based on the 4 levels of the categorical variable,
region. For the baseline (NC): Y=𝛽0 + 𝛽1 𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛 which is estimated as b0+b1infection
For Level 2(NE): 𝑌 = 𝛽0 + 𝛽2 + (𝛽1 + 𝛽12 )𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛
For level 3 (S)=𝑌 = 𝛽0 + 𝛽3 + (𝛽1 + 𝛽13 )𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛
For Level 4(W): 𝑌 = 𝛽0 + 𝛽4 + (𝛽1 + 𝛽14 )𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛
a).(4) Obtain the model summary and ANOVA table using R. Then, using the model summary, write the
estimated response function for each level of the four regions (1 points each)
• Hint1: There are two ways to do this in R. You can manually create dummy variable columns; or use the
factor(region) in the lm function: lm(Y~infection+factor(Region)+infection*factor(Region)). Although the
factor() method can generate the summary for the full model quicky, it is not as flexible as the dummy
variable method to evaluate the parameters in the model.
• Hint2: You may use the relevel() function in R to change the baseline category. For example, if you want
to use the second, or the NE category as the baseline, do data$region<-relevel(data$region,2). This is
useful when you are not comparing the mean response value of Y in a level to that in the baseline level.
5
For the baseline (NC): YNC=7.56051+0.48317×infection
For Level 2 (NE): YNE=(7.56051−3.02259)+(0.48317+0.86458)×infection
For Level 3 (S): Ys=(7.56051−0.43117)+(0.48317+0.04191)×infection
For Level 4 YW=(7.56051+0.47754)+(0.48317−0.46589)×infection
b). (4) Consider the MLR model Y~infection*region instead of Y~infection*factor(region) in R, which
causes R to treat “region” as a continuous variable. Compare the model summary and ANOVA output
for this incorrect model to those in the previous questions. Be sure to specify in detail the differences
between the two models, particularly in terms of the beta coefficients, MSE and dfE.
6
Observation:
The number of coefficients in the first model is greater than the inaccurate model because it implements
a case wise region whereas the inaccurate model doesn’t distinguish between regions and just uses a
continuous variable. The summary for the inaccurate model says region is bad predictor for the model
but it is not the case for the first model it implies that not all the regions have insignificant linear impact.
The dfe for model1 is less because it is using dummy variables in its statistical calculation whereas for
the inaccurate model it takes region as a single continuous variable so its dfe is greater.
The MSE of the interaction terms is quite different , in the first model, the interaction between
senic$infection and each level of senic$region is explicitly modeled, resulting in a more accurate
assessment of the variability of the residuals and the overall model fit , this is the main reason
its MSE is less than the inaccurate model’s. Also the df for each model’s MSR is different ; 1 for
inaccurate and 3 for the first model.
4. (24) What is the implication of centering X variable to simply the mean response prediction? If we
center the X variable,
𝑋′ = (𝑋 − 𝑋̅), then 𝑋′ = 0 𝑤ℎ𝑒𝑛 𝑋 = 𝑋̅
7
We can then fit the model 𝑌~𝛽0 + 𝛽1 𝑋′. The mean response prediction at the mean level (𝑋 = 𝑋̅)
simplifies to
𝑌̂ℎ = 𝛽0
because there is only the intercept value in the equation.
Now center the infection variable and denote the centered variable as 𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛𝐶, where
𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛𝐶 = 𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛 − 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛.
Then 𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛𝐶 = 0 when 𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛 takes the average value.
Refit the MLR model with 𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛𝐶, region, and the interaction between them.
a). (6) predict the mean response value, 𝑌ℎ when infection takes the average value and fill in the blanks.
The purpose of this question is to interpret the meaning of 𝛽0 , 𝛽1 , 𝛽2 , 𝛽3 , 𝛽4 , 𝛽12 , 𝛽13 , 𝛽14 when the
predictor takes the average value.
The average Y for NC is represented by 𝛽0 and has a predicted value of _9.66465_____.

The average Y for NE is different from NC by 𝛽2 and has a predicted value of __9.66465 + 0.74252 =
10.40717____.
The average Y for S is different from NC by __𝛽3 ____(which beta?) and has a predicted value of
__9.416____.
The average Y for W is different from NC by __𝛽4 ____(which beta?) and has a predicted value of
__8.11329____.
8
b). (13) Now consider the impact of infection on Y,
The impact of infection on Y is 𝛽1 for NC. This impact is _________(significant/insignificant),
because____p-value for infection (0.008985) is__less than
0.05_______________________________________________________________.
The impact of infection on Y for NE is different from NC by 𝛽12 and has a predicted value of _𝛽0 +
𝛽1 + _𝛽12_= 11.0124
___, the difference is ______(significance/insignificance),
because__p-value of
𝛽12 (0.002) 𝑖𝑚𝑝𝑎𝑐𝑡 𝑖𝑠 𝑙𝑒𝑠𝑠 𝑡ℎ𝑎𝑛 0.05_____________________________________________________
___________________.
The impact of infection on Y for S is different from NC by ___𝛽13 ___(which beta?) and has a predicted
value of _𝛽0 + 𝛽1 + _𝛽13 = 9.89917___. the difference is ______(significance/insignificance),
because______ p-value of
𝛽13 (0.860791)𝑖𝑚𝑝𝑎𝑐𝑡 𝑖𝑠 𝑔𝑟𝑒𝑎𝑡𝑒𝑟 𝑡ℎ𝑎𝑛 0.05______________________________________________
______________________.
The impact of infection on Y for W is different from NC by _𝛽14 _____(which beta?) and has a
predicted value of __𝛽0 + 𝛽1 + _𝛽14 = 9.68193
____. the difference is ______(significance/insignificance),
because______________________ p-value of 𝛽14
(0.289931)𝑖𝑚𝑝𝑎𝑐𝑡 𝑖𝑠 𝑔𝑟𝑒𝑎𝑡𝑒𝑟 𝑡ℎ𝑎𝑛 0.05_________________________________________________
_________.
c) (5) Based on the result in b), can you judge whether the impact of infection on Y for NE is
significantly different from for S? If not, find a way to compare and then conclude.
Conducting t-test:
H0: 𝛽12 = 𝛽13
Ha: 𝛽12 ≠ 𝛽13
Ts = 𝛽12 - 𝛽13 / √(𝑠 2 [𝛽12 ] + 𝑠 2 [𝛽13 ])/𝑛
Ts = 0.82267/0.03414655 = 24.09233
p-value
p-value = 4.061 * e^-46
Conclusion since p-value is less than significance value , we reject the null hypothesis therefore there is
significant difference in the linear impact of infection for NE compared to S.
5. (6) Perform a hypothesis test to see whether the interaction effect between infection and region on Y
is significant. Define the hypothesis with appropriate notation. Then complete the hypothesis with a
GLT F-test. Specifically, define the full model, reduced model, compute the test statistic, critical value,
p-value, and state the conclusion.
9
H0: 𝛽12=𝛽13= 𝛽14= 0 , reduced model: Y i=β0 + β1*infection + β2 X i ,2+ β3 X i ,3+ β4 X i ,4 + ε
Ha: Not H0 , Full model : Y i=𝛽0 + 𝛽1 𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛 + 𝛽2 𝑋2 + 𝛽3 𝑋3 + 𝛽4 𝑋4 + 𝛽12 𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛 ∗ 𝑋2 +
𝛽13 𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛𝑋3 + 𝛽14 𝑖𝑛𝑓𝑒𝑐𝑡𝑖𝑜𝑛 ∗ 𝑋4 + ε
Critical value:
SSE(R) – SSE(F) 220.798 – 192.300

dfE(R) – dfE(F) 3
Test Statistic = Fs SSE(F)/df(𝐹)
= 192.300/105
= 5.188
p-value:
p-value = 0.0022 (<0.05)

Conclusion:
10
Since test statistic is greater than critical value and p-value(0.0022 ) is less than significance value we
reject the null hypothesis therefore the interaction effect between infection and region on Y is
significant
11
6. (30) Consider the Life Expectancy data, and the MLR model given by 𝒀~𝑿𝟏 + 𝑿𝟐 + 𝑿𝟑
a). (6) Obtain a partial regression plot for X1, X2, and X3 and discuss whether the regression
relationships in the fitted regression function are inappropriate for any of the predictor variables.
12
The plot of X1 seems more logarithmic than linear , there seems to be non-constant variance at some
parts of the plot.
For the plot of X2 , the slope is less steep indicating a weaker relationship with response variable. The
data points don’t seem equally distributed the larger X2 values have less data points.
The plot of X3, also takes logarithmic form especially for the initial values , the variance like the others
seems non-constant.
All the plots seem to have outliers as well and have indications of non-linear relationship and non-
constant variance which are inappropriate for a fitted regression function.
b). (6) (Model selection) Discuss the best subset regression models according to the
2
𝑅𝑎𝑑𝑗 , 𝐶𝑝, 𝐴𝐼𝐶𝑝 , 𝐵𝐼𝐶𝑝 , 𝑎𝑛𝑑 𝑃𝑅𝐸𝑆𝑆.
Pressp: Y~X1 + X2 + X3
R2.adj: Y~X1 + X2 + X3
Cp: Y~X1 + X2 + X3
AICP: Y~X1 + X2 + X3
Bicp: Y~X1 + X3
c). (10) Check for influential points by calculating DFFITS, DFBETAS and Cook’s distance values.
DFFTS:
DFBETAS
13
Cook’s distance values
No points with major or minor influence.
d). (8) Determine whether there is multicollinearity present based on VIF.
Since each of the values are less than 10 we can say that there doesn’t seem to be any multicollinearity
issue.
14

S24 hw6

Uploaded by

Copyright:

Available Formats

You might also like

S24 hw6

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

S24 hw6

Uploaded by

Copyright:

Available Formats

Homework 6

Variable number Variable name Description

3 Age Average age of patients (in years)

Average estimated probability of acquiring infection in hospital

Routine Ratio of number of cultures performed to number of patients

Routine chest Ratio of X-rays performed to number of patient without signs or

7 Number of beds Average number of beds in hospital during study period

Number of Average number of full-time equivalent registered and licensed

Critical value = 3.079596

SSE(R) – SSE(F) 383.79 – 283.064

p-value = 6.2317*e-8 , less than α

H0: 𝛽12= 0 , reduced model: Y i=β0 + β1 Xi ,1 + β2 X i ,2+ ε

Critical value = 3.928195

SSE(R) – SSE(F) 288.887 – 283.064

The MLR full model can be written as following:

For Level 2 (NE): YNE=(7.56051−3.02259)+(0.48317+0.86458)×infection

For Level 3 (S): Ys=(7.56051−0.43117)+(0.48317+0.04191)×infection

For Level 4 YW=(7.56051+0.47754)+(0.48317−0.46589)×infection

The average Y for NC is represented by 𝛽0 and has a predicted value of _9.66465_____.

p-value = 4.061 * e^-46

SSE(R) – SSE(F) 220.798 – 192.300

p-value = 0.0022 (<0.05)

No points with major or minor influence.

d). (8) Determine whether there is multicollinearity present based on VIF.

You might also like