Professional Documents
Culture Documents
Linear Regression in R
Linear Regression in R
Wamwea 2021
where
X1 ∼ N (2, 0.7)
X2 ∼ Beta(1, 0.6)
X3 ∼ P oisson(2)
X4 ∼ χ2 (4)
X5 ∼ t(2)
X6 ∼ unif orm(1, 2)
e ∼ N (0, 0.7)
βj = 1.2(j − 1) + 0.7, ∀j
500 1000
1 XX
M SSE = (Yji − Ŷji )2
500 ∗ 1000 j=1 i=1
5. Write an R program that can be used to calculate the 95% confidence intervals of βj ’s and use them
to confirm the results in (3) above.
STA 2205 Dr. Wamwea 2021
Solution
#Question 1
SSE=0
B.hat=matrix(0,nrow=500,ncol=7)
for(j in 1:500)
{
X1=rnorm(1000,2,0.8)
X2=rbeta(1000,1,0.6)
X3=rpois(1000,2)
X4=rchisq(1000,4)
X5=rt(1000,2)
X6=runif(1000,1,2)
e=rnorm(1000,0,0.7)
b=0
for(i in 1:7)
{
b[i]=1.2*(i-2)+0.7
}
#1(a)
#Actual Y
Y=b[1]+b[2]*X1+b[3]*X2+b[4]*X3+b[5]*X4+b[6]*X5+b[7]*X6+e
#1(b)
Model=lm(Y~X1+X2+X3+X4+X5+X6)
#1 (c)
#Estimating Y.hat
#Method 1
b0.hat=coef(Model)[1]
b1.hat=coef(Model)[2]
b2.hat=coef(Model)[3]
b3.hat=coef(Model)[4]
b4.hat=coef(Model)[5]
b5.hat=coef(Model)[6]
b6.hat=coef(Model)[7]
STA 2205 Dr. Wamwea 2021
Y.hat=b0.hat+b1.hat*X1+b2.hat*X2+b3.hat*X3+b4.hat*X4+b5.hat*X5+b6.hat*X6
#Alternatively
#Method 2
y.hat=predict(Model)
SSE[j]=sum((Y-Y.hat)^2)
}
MSSE=mean(SSE)
MSSE
NB: We shall use the model summary (stated below) obtained from part 1(b) to answer
questions 2-4.
Call:
lm(formula = Y ~ X1 + X2 + X3 + X4 + X5 + X6)
Residuals:
Min 1Q Median 3Q Max
-2.1412 -0.4588 0.0009 0.4899 2.2853
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.731202 0.142707 -5.124 3.6e-07 ***
X1 0.743358 0.027222 27.307 < 2e-16 ***
X2 1.983937 0.071227 27.854 < 2e-16 ***
X3 3.101981 0.015508 200.022 < 2e-16 ***
X4 4.311278 0.008225 524.138 < 2e-16 ***
X5 5.516239 0.008821 625.362 < 2e-16 ***
X6 6.705355 0.076332 87.844 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
3. To test the significance of the variables in a multiple linear regression model, we use the t-test. In
this example, α = 5% = 0.05
p − value < α
The p-values of the different variables are as summarized in the table below:
Since for all the variables, the p-value=0.00<0.05, we reject the null hypothesis H0 and conclude
that all the variables X1 to X6 are significant.
(b) The null hypothesis is rejected at α level of significance if the
α
|t − value| > tn−p−1 (1 − )
2
0.05
t1000−6−1 (1 − 2 )
= t993 (0.975) = 1.96
The t-values and the tabulated values of the different variables are as summarized in the table
below:
Since for all the variables, the |t-value|>tabulated value=1.96, we reject the null hypothesis
H0 and conclude that all the variables X1 to X6 are significant. Where |-a|=|a|=a which is
the absolute value of a.
STA 2205 Dr. Wamwea 2021
4. To test the overall fitness of the multiple linear regression model, we use the F-test. In this example,
α = 1% = 0.01
p − value < α
Since the p-value of the fitted model is <2.2e-16=0.00 is less than α=0.01, hence we reject the
null hypothesis H0 and conclude that the model is adequate.
(b) The null hypothesis is rejected at α level of significance if the
5. The R program used to calculate the 95% C.I of the βj ’s is as given below.
B.hat=matrix(0,nrow=500,ncol=7)
for(j in 1:500)
{
X1=rnorm(1000,2,0.8)
X2=rbeta(1000,1,0.6)
X3=rpois(1000,2)
X4=rchisq(1000,4)
X5=rt(1000,2)
X6=runif(1000,1,2)
e=rnorm(1000,0,0.7)
b=0
for(i in 1:7)
{
b[i]=1.2*(i-2)+0.7
}
Y=b[1]+b[2]*X1+b[3]*X2+b[4]*X3+b[5]*X4+b[6]*X5+b[7]*X6+e
Model=lm(Y~X1+X2+X3+X4+X5+X6)
STA 2205 Dr. Wamwea 2021
B.hat[j,]=coef(Model)
}
B0.hat=B.hat[,1]
L.C.I.b0=sort(B0.hat)[L]
U.C.I.b0=sort(B0.hat)[U]
b0.C.I = c(L.C.I.b0,U.C.I.b0)
b0.C.I
#Since 0 is not part of the confidence interval of b0,
#The intercept is significant in predicting the values of Y.
B1.hat=B.hat[,2]
L.C.I.b1=sort(B1.hat)[L]
U.C.I.b1=sort(B1.hat)[U]
b1.C.I = c(L.C.I.b1,U.C.I.b1)
b1.C.I
#Since 0 is not part of the confidence interval of b1,
#The variable X1 is significant in predicting the values of Y.
B2.hat=B.hat[,3]
L.C.I.b2=sort(B2.hat)[L]
U.C.I.b2=sort(B2.hat)[U]
b2.C.I = c(L.C.I.b1,U.C.I.b1)
b2.C.I
#Since 0 is not part of the confidence interval of b2,
#The variable X2 is significant in predicting the values of Y.
B3.hat=B.hat[,4]
L.C.I.b3=sort(B3.hat)[L]
U.C.I.b3=sort(B3.hat)[U]
b3.C.I = c(L.C.I.b3,U.C.I.b3)
b3.C.I
#Since 0 is not part of the confidence interval of b3,
#The variable X3 is significant in predicting the values of Y.
B4.hat=B.hat[,5]
STA 2205 Dr. Wamwea 2021
L.C.I.b4=sort(B4.hat)[L]
U.C.I.b4=sort(B4.hat)[U]
b4.C.I = c(L.C.I.b4,U.C.I.b4)
b4.C.I
#Since 0 is not part of the confidence interval of b4,
#The variable X4 is significant in predicting the values of Y.
B5.hat=B.hat[,6]
L.C.I.b5=sort(B5.hat)[L]
U.C.I.b5=sort(B5.hat)[U]
b5.C.I = c(L.C.I.b5,U.C.I.b5)
b5.C.I
#Since 0 is not part of the confidence interval of b5,
#The variable X5 is significant in predicting the values of Y.
B6.hat=B.hat[,7]
L.C.I.b6=sort(B6.hat)[L]
U.C.I.b6=sort(B6.hat)[U]
b6.C.I = c(L.C.I.b6,U.C.I.b6)
b6.C.I
#Since 0 is not part of the confidence interval of b6,
#The variable X6 is significant in predicting the values of Y.
The 95% C.I. for the respective βj ’s were as summarized in the table below:
From the above table, it is clear that the range of all the 95% confidence interval does not touch 0, hence
all the variables are significant. This can be seen from the fact that the actual βj values lie within the
95% C.I.