Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

School of Mathematics

MATH3714/5714: Linear Regression and Robustness


Exercise 1

Q1 For the simple linear regression model

yi = β0 + β1 xi + i , i = 1, . . . , n,

derive the least squares estimators β̂0 and β̂1 given by

(xi − x̄)(yi − ȳ)


P
β̂0 = ȳ − β̂1 x̄, β̂1 = .
Σ(xi − x̄)2

Q2 Express the simple linear regression model

yi = β0 + β1 xi + i , i = 1, . . . , n,

in matrix notation, stating clearly the contents of your design matrix, X, and vectors y, β, ε.
Form the information matrix XT X, its inverse (XT X)−1 , and X T y. Hence show that
P 2P
1 xi yi − xi xi yi
P P 
β̂ = (X T X)−1 X T y =

P 2 P P P
n xi − ( xi )2 xi yi + n x i yi
P

Express β̂1 and β̂0 in the form given below, noting that to produce the expression for β̂0 in the
form given, it helps to add and subtract n2 x̄2 ȳ.

(xi − x̄)(yi − ȳ)


P
β̂0 = ȳ − β̂1 x̄, β̂1 = .
Σ(xi − x̄)2

β̂0
 
The variance-covariance matrix of β̂ = is Var(β̂) = (XT X)−1 . What does this tell you
β̂1
about β̂0 and β̂1 here?

Q3 In the usual multiple linear regression model y = Xβ + , the estimated or fitted y values
are given by ŷ = Xβ̂ and the residuals are e = y − ŷ = (I − H)y. Show that H = H T ,
H 2 = H, and (I − H)2 = I − H.
Q4 The following data relate biomass production of soyabeans to cumulative intercepted solar radi-
ation over an 8-week period following emergence. Biomass production is the mean dry weight
in grams of independent samples of four plants.

Solar radiation, x Plant biomass, y


29.7 16.6
68.4 49.1
120.7 121.7
217.2 219.6
313.5 375.5
419.1 570.8
535.9 648.2
641.5 755.6

Use the following R commands to carry out a linear regression analysis of plant biomass on
solar radiation. Evaluate 95% confidence intervals for the regression coefficients β0 and β1 .
Comment on your results.

x = c(29.7, 68.4, 120.7, 217.2, 313.5, 419.1, 535.9, 641.5)


y = c(16.6, 49.1, 121.7, 219.6, 375.5, 570.8, 648.2, 755.6)
plot(x,y,xlab="solar radiation",ylab="biomass",main="soy prod")
lm1=lm(y˜x)
abline(lm1)
summary(lm1)
qt(.975,df=20) # tabulated value corresponding to 20 df

Q5 A hotel experienced an outbreak of pseudomona dermatis among its guests. Physicians sus-
pected the source of infection to be the hotel whirlpool-spa. The data in the table give the
number of female guests and the number infected by categories of time (minutes) spent in the
whirlpool.

Time (minutes) Number of guests Number infected


0–10 8 1
11–20 12 3
21–30 9 3
31–40 14 7
41–50 7 4
51–60 4 3
61–70 2 2

(a) State the basic assumptions of least squares regression and comment, if you can, on
whether each is satisfied by these data.
(b) Perform a linear regression analysis using R of the incidence of infection (number in-
fected/number exposed) on time spent in the whirlpool? Use the midpoint of the time
interval as the independent variable. Estimate the intercept and the slope, and plot the
regression line and the data. Comment on your results.

Q6 Show that for any multiple regression model with an intercept (a β0 constant term in the model),
the sum of the residuals equals zero.

You might also like