Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

STA 2205 Dr.

Wamwea 2021

Multiple Linear Regression Revision in R


Example
Let
Y = β0 + β1 X1 + β2 X2 + · · · + β6 X6 + e

where
X1 ∼ N (2, 0.7)
X2 ∼ Beta(1, 0.6)
X3 ∼ P oisson(2)
X4 ∼ χ2 (4)
X5 ∼ t(2)
X6 ∼ unif orm(1, 2)
e ∼ N (0, 0.7)
βj = 1.2(j − 1) + 0.7, ∀j

1. Write an R program that:

(a) Generates 1000 variates of Y .


(b) Fits a multiple linear regression model and estimtes the βj ’s.
(c) Repeats part (2) 500 times and determines the overall mean of the SSE given by:

500 1000
1 XX
M SSE = (Yji − Ŷji )2
500 ∗ 1000 j=1 i=1

2. State the fitted model

3. Identify the significant variables at 5%.

4. Comment on the overall fit of the model at 1%.

5. Write an R program that can be used to calculate the 95% confidence intervals of βj ’s and use them
to confirm the results in (3) above.
STA 2205 Dr. Wamwea 2021

Solution

1. The R program used to solve this question is:

#Question 1
SSE=0

B.hat=matrix(0,nrow=500,ncol=7)
for(j in 1:500)
{
X1=rnorm(1000,2,0.8)
X2=rbeta(1000,1,0.6)
X3=rpois(1000,2)
X4=rchisq(1000,4)
X5=rt(1000,2)
X6=runif(1000,1,2)
e=rnorm(1000,0,0.7)

b=0
for(i in 1:7)
{
b[i]=1.2*(i-2)+0.7
}

#1(a)

#Actual Y
Y=b[1]+b[2]*X1+b[3]*X2+b[4]*X3+b[5]*X4+b[6]*X5+b[7]*X6+e

#1(b)
Model=lm(Y~X1+X2+X3+X4+X5+X6)

#1 (c)
#Estimating Y.hat

#Method 1

b0.hat=coef(Model)[1]
b1.hat=coef(Model)[2]
b2.hat=coef(Model)[3]
b3.hat=coef(Model)[4]
b4.hat=coef(Model)[5]
b5.hat=coef(Model)[6]
b6.hat=coef(Model)[7]
STA 2205 Dr. Wamwea 2021

Y.hat=b0.hat+b1.hat*X1+b2.hat*X2+b3.hat*X3+b4.hat*X4+b5.hat*X5+b6.hat*X6

#Alternatively
#Method 2

y.hat=predict(Model)

SSE[j]=sum((Y-Y.hat)^2)
}
MSSE=mean(SSE)
MSSE

NB: We shall use the model summary (stated below) obtained from part 1(b) to answer
questions 2-4.

Call:
lm(formula = Y ~ X1 + X2 + X3 + X4 + X5 + X6)

Residuals:
Min 1Q Median 3Q Max
-2.1412 -0.4588 0.0009 0.4899 2.2853

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.731202 0.142707 -5.124 3.6e-07 ***
X1 0.743358 0.027222 27.307 < 2e-16 ***
X2 1.983937 0.071227 27.854 < 2e-16 ***
X3 3.101981 0.015508 200.022 < 2e-16 ***
X4 4.311278 0.008225 524.138 < 2e-16 ***
X5 5.516239 0.008821 625.362 < 2e-16 ***
X6 6.705355 0.076332 87.844 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6864 on 993 degrees of freedom


Multiple R-squared: 0.9987, Adjusted R-squared: 0.9987
F-statistic: 1.238e+05 on 6 and 993 DF, p-value: < 2.2e-16

2. The fitted model is given by:

Ŷ = −0.731202 + 0.743358X1 + 1.983937X2 + 3.101981X3 + 4.311278X4 + 5.516239X5 + 6.705355X6


STA 2205 Dr. Wamwea 2021

3. To test the significance of the variables in a multiple linear regression model, we use the t-test. In
this example, α = 5% = 0.05

Under the t-test, the hypotheses to be tested are:

H0 : βj = 0 (Variable is not signicant) ; j = {0, 1, 2, 3, 4, 5, 6}


H0 : βj 6= 0 (Variable is signicant)

The t-test statistic is given by:


Estimatej
t − valuej =
Std.Errorj

There are two methods of solving this question:

(a) The null hypothesis is rejected at α level of significance if the

p − value < α

The p-values of the different variables are as summarized in the table below:

Variable P-value Alpha Decision


X1 < 2e-16=0.00 0.05 Significant
X2 < 2e-16=0.00 0.05 Significant
X3 < 2e-16=0.00 0.05 Significant
X4 < 2e-16=0.00 0.05 Significant
X5 < 2e-16=0.00 0.05 Significant
X6 < 2e-16=0.00 0.05 Significant

Since for all the variables, the p-value=0.00<0.05, we reject the null hypothesis H0 and conclude
that all the variables X1 to X6 are significant.
(b) The null hypothesis is rejected at α level of significance if the

α
|t − value| > tn−p−1 (1 − )
2
0.05
t1000−6−1 (1 − 2 )
= t993 (0.975) = 1.96
The t-values and the tabulated values of the different variables are as summarized in the table
below:

Variable |t-value| Tabulated value Decision


X1 27.307 1.96 Significant
X2 27.854 1.96 Significant
X3 200.022 1.96 Significant
X4 524.138 1.96 Significant
X5 625.362 1.96 Significant
X6 87.844 1.96 Significant

Since for all the variables, the |t-value|>tabulated value=1.96, we reject the null hypothesis
H0 and conclude that all the variables X1 to X6 are significant. Where |-a|=|a|=a which is
the absolute value of a.
STA 2205 Dr. Wamwea 2021

4. To test the overall fitness of the multiple linear regression model, we use the F-test. In this example,
α = 1% = 0.01

Under the F-test, the hypotheses to be tested are:

H0 : β1 = β2 = β3 = β4 = β5 = β6 = 0 (Model is not adequate)


H1 : βj 6= 0 for at least one βj (Model is adequate)

There are two methods of solving this question:

(a) The null hypothesis is rejected at α level of significance if the

p − value < α

Since the p-value of the fitted model is <2.2e-16=0.00 is less than α=0.01, hence we reject the
null hypothesis H0 and conclude that the model is adequate.
(b) The null hypothesis is rejected at α level of significance if the

F − value > Fp,n−p−1 (1 − α)

F6,993 (0.99) = 2.82

Since the F-value=1.238e+05=123,800>tabulated F-value=2.82, hence we reject the null hy-


pothesis H0 and conclude that the model is adequate.

5. The R program used to calculate the 95% C.I of the βj ’s is as given below.

B.hat=matrix(0,nrow=500,ncol=7)
for(j in 1:500)
{
X1=rnorm(1000,2,0.8)
X2=rbeta(1000,1,0.6)
X3=rpois(1000,2)
X4=rchisq(1000,4)
X5=rt(1000,2)
X6=runif(1000,1,2)
e=rnorm(1000,0,0.7)

b=0
for(i in 1:7)
{
b[i]=1.2*(i-2)+0.7
}

Y=b[1]+b[2]*X1+b[3]*X2+b[4]*X3+b[5]*X4+b[6]*X5+b[7]*X6+e
Model=lm(Y~X1+X2+X3+X4+X5+X6)
STA 2205 Dr. Wamwea 2021

B.hat[j,]=coef(Model)
}

#Calculate the 95% confidence intervals for Question 5


B.hat
#Finding the positions of the lower (L) and upper (U) 95% confidence intervals
L=round(500*2.5/100,0) #use the round() function to round off to the nearest intege
L
U=round(500*97.5/100,0)
U

B0.hat=B.hat[,1]
L.C.I.b0=sort(B0.hat)[L]
U.C.I.b0=sort(B0.hat)[U]
b0.C.I = c(L.C.I.b0,U.C.I.b0)
b0.C.I
#Since 0 is not part of the confidence interval of b0,
#The intercept is significant in predicting the values of Y.

B1.hat=B.hat[,2]
L.C.I.b1=sort(B1.hat)[L]
U.C.I.b1=sort(B1.hat)[U]
b1.C.I = c(L.C.I.b1,U.C.I.b1)
b1.C.I
#Since 0 is not part of the confidence interval of b1,
#The variable X1 is significant in predicting the values of Y.

B2.hat=B.hat[,3]
L.C.I.b2=sort(B2.hat)[L]
U.C.I.b2=sort(B2.hat)[U]
b2.C.I = c(L.C.I.b1,U.C.I.b1)
b2.C.I
#Since 0 is not part of the confidence interval of b2,
#The variable X2 is significant in predicting the values of Y.

B3.hat=B.hat[,4]
L.C.I.b3=sort(B3.hat)[L]
U.C.I.b3=sort(B3.hat)[U]
b3.C.I = c(L.C.I.b3,U.C.I.b3)
b3.C.I
#Since 0 is not part of the confidence interval of b3,
#The variable X3 is significant in predicting the values of Y.

B4.hat=B.hat[,5]
STA 2205 Dr. Wamwea 2021

L.C.I.b4=sort(B4.hat)[L]
U.C.I.b4=sort(B4.hat)[U]
b4.C.I = c(L.C.I.b4,U.C.I.b4)
b4.C.I
#Since 0 is not part of the confidence interval of b4,
#The variable X4 is significant in predicting the values of Y.

B5.hat=B.hat[,6]
L.C.I.b5=sort(B5.hat)[L]
U.C.I.b5=sort(B5.hat)[U]
b5.C.I = c(L.C.I.b5,U.C.I.b5)
b5.C.I
#Since 0 is not part of the confidence interval of b5,
#The variable X5 is significant in predicting the values of Y.

B6.hat=B.hat[,7]
L.C.I.b6=sort(B6.hat)[L]
U.C.I.b6=sort(B6.hat)[U]
b6.C.I = c(L.C.I.b6,U.C.I.b6)
b6.C.I
#Since 0 is not part of the confidence interval of b6,
#The variable X6 is significant in predicting the values of Y.

The 95% C.I. for the respective βj ’s were as summarized in the table below:

Variables Actual LCI UCI


beta0 -0.5 -0.7783439 -0.2251310
beta1 0.7 0.6428931 0.7536061
beta2 1.9 1.750723 2.048918
beta3 3.1 3.066062 3.129891
beta4 4.3 4.283521 4.316174
beta5 5.5 5.483391 5.516849
beta6 6.7 6.549297 6.846228

From the above table, it is clear that the range of all the 95% confidence interval does not touch 0, hence
all the variables are significant. This can be seen from the fact that the actual βj values lie within the
95% C.I.

You might also like