Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Biostatistics Session 2022–2023 (Semester 2)

Tutorial 1
1. Assume Y and X are random variables.
(a) Consider the regression model Y = β0 + β1 X + ϵ. Express the OLS estimator for β1
in terms of cor(Y,X), s.d. of Y and s.d. of X.
(b) Consider the regression model X = γ0 + γ1 Y + ϵ. Express the OLS estimator for γ1
in terms of cor(Y,X), s.d. of Y and s.d. of X.
(c) Explain also graphically why the estimator of γ1 is not the same as the estimator of
β1 .

2. Assume Y and X are random variables. Suppose the following model holds Y = β0 +
β1 X + β2 X 2 + ϵ.
(a) Suppose we are interested in the expected difference in a variable Y when a second
variable X differs one unit. Give an expression for this expected difference.
(b) Assume that X follows a normal distribution with zero mean. Give an expression of
the explained variance of Y by X.

3. Check figure on slide 15 Results Galton dataset of the lecture again and note that the
predicted height for tall fathers is smaller than the height of the fathers and the other way
around for small fathers. This is called regression to the mean. Explain intuitively why
this makes sense. Do you have another example of this phenomenon?

4. In the lecture we analysed the height of sons in relationship with the height of the father.
We considered five questions. Answer these questions for daughters and mothers. The data
is on Virtuale: Galton.txt.
(1) What is the correlation ρ between Y and X? Explain when you would be interested
in the correlation in general.
(2) Given a value for the height of the mother (X) what would be a prediction for the
height of their daughter (Y )?
(3) Is the relationship between Y and X of the previous question statistically significant?
Use a 5% significance level.
(4) How much of the variation of Y is explained by the random variable X?
(5) Is the height of the father a confounder? Motivate your answer.

5. In this exercise our aim is to get insight in the relationship between house price and some
house characteristics in King’s county. We use the King’s county data set which is on
Virtuale. In this dataset you can find the following variables
• id - Unique ID for each home sold
• date - Date of the home sale
• price - Price of each home sold
• bedrooms - Number of bedrooms

1
• bathrooms - Number of bathrooms, where .5 accounts for a room with a toilet but no
shower
• sqft living - Square footage of the apartments interior living space
• sqft lot - Square footage of the land space
• floors - Number of floors
• waterfront - A dummy variable for whether the apartment was overlooking the wa-
terfront or not
• grade - An index from 1 to 13, where 1-3 falls short of building construction and
design, 7 has an average level of construction and design, and 11-13 have a high
quality level of construction and design.
and some more variables.
(a) Use
prices.dat<-read.csv(’kc_house_data.csv’,header=T)
to load the data in R. Check the number of records. Also check the first rows of the
dataset by using head().
(b) We start with a multiple linear regression model with price as response and bedrooms,
sqrt living, sqrt lot, waterfront and grade as exploratory variables. Fit this model
what is your conclusion?
(c) Now check the model fit by using the plot function. The conclusion is that there is
no good fit. Do you agree?
(d) Now check the distribution of the response variable. It is far from normal and hence
it can be expected that the residuals of the multiple linear regression are also not
normal. Try a log transformation and check again the distribution.
(d) Repeat the regression of question b but now use log of price as outcome variable.
Check the model fit again.
(e) Which variables are statistically significant at the 5% level? Note the negative sign of
the parameter estimates representing the effect for number of bedrooms and sqrt lot.
Any explanation?
(f) Explain why the economic status of the neighbourhood might be a confounder.
(f) We do not have economic status of the neighbourhood but we have zipcode. Include
the variable as.factor(zipcode) in the model. Why do we use as.factor? What
is your conclusion?

6. Simulation study. Perform the simulation described in the lecture to check whether the
theoretical standard error of the OLS estimator agrees with its empirical estimate. Do this
for a dataset of size 10, of size 50 and of size 100.

7. (a) Generate an exposure variable X1 a covariate X2 and a response as follows


set.seed(1)
x1<-rnorm(50,0,2)
x2<-rnorm(50,0,3)
y<-1+0.1*x1+2*x2+rnorm(50,0,3)

2
Check whether the standard error for the esimator of the parameter representing the
effect of X1 on Y is smaller in a multiple linear regression than in the univariate
regression? Also calculate cor(x1,x2).
(b) Generate 100 datasets to check how often the p-value for the wald test for H0 : β1 = 0
with β1 the parameter representing the effect of X1 on Y is smaller in the multiple
linear regression compared to the single variable regression. To do this.
– Do you generate 100 times (y,x1,x2) or just one time (x1,x2) and 100 times
y?
– To extract the p-value you can use the coef() function:
dif<-rbind(dif,(coef(summary(lm(y~x1+x2)))[2,4]
<coef(summary(lm(y~x1)))[2,4]))

8. Design a simulation study with an exposure X1 and a confounder X2 and verify that linear
regression model with only X1 in the model provides a biased estimator.

You might also like