Linear Regression in R

STA 2205 Dr.
Wamwea 2021
Multiple Linear Regression Revision in R

Example
Let
Y = β0 + β1 X1 + β2 X2 + · · · + β6 X6 + e
where
X1 ∼ N (2, 0.7)
X2 ∼ Beta(1, 0.6)
X3 ∼ P oisson(2)
X4 ∼ χ2 (4)
X5 ∼ t(2)
X6 ∼ unif orm(1, 2)
e ∼ N (0, 0.7)
βj = 1.2(j − 1) + 0.7, ∀j
1. Write an R program that:
(a) Generates 1000 variates of Y .

(b) Fits a multiple linear regression model and estimtes the βj ’s.
(c) Repeats part (2) 500 times and determines the overall mean of the SSE given by:
500 1000
1 XX
M SSE = (Yji − Ŷji )2
500 ∗ 1000 j=1 i=1
2. State the fitted model
3. Identify the significant variables at 5%.
4. Comment on the overall fit of the model at 1%.
5. Write an R program that can be used to calculate the 95% confidence intervals of βj ’s and use them
to confirm the results in (3) above.
STA 2205 Dr. Wamwea 2021
Solution
1. The R program used to solve this question is:
#Question 1
SSE=0
B.hat=matrix(0,nrow=500,ncol=7)
for(j in 1:500)
{
X1=rnorm(1000,2,0.8)
X2=rbeta(1000,1,0.6)
X3=rpois(1000,2)
X4=rchisq(1000,4)
X5=rt(1000,2)
X6=runif(1000,1,2)
e=rnorm(1000,0,0.7)
b=0
for(i in 1:7)
{
b[i]=1.2*(i-2)+0.7
}
#1(a)
#Actual Y
Y=b[1]+b[2]*X1+b[3]*X2+b[4]*X3+b[5]*X4+b[6]*X5+b[7]*X6+e
#1(b)
Model=lm(Y~X1+X2+X3+X4+X5+X6)
#1 (c)
#Estimating Y.hat
#Method 1
b0.hat=coef(Model)[1]
Y.hat=b0.hat+b1.hat*X1+b2.hat*X2+b3.hat*X3+b4.hat*X4+b5.hat*X5+b6.hat*X6
#Alternatively
#Method 2
y.hat=predict(Model)
SSE[j]=sum((Y-Y.hat)^2)
}
MSSE=mean(SSE)
MSSE
NB: We shall use the model summary (stated below) obtained from part 1(b) to answer
questions 2-4.
Call:
lm(formula = Y ~ X1 + X2 + X3 + X4 + X5 + X6)
Residuals:
Min 1Q Median 3Q Max
-2.1412 -0.4588 0.0009 0.4899 2.2853
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.731202 0.142707 -5.124 3.6e-07 ***
X1 0.743358 0.027222 27.307 < 2e-16 ***
X2 1.983937 0.071227 27.854 < 2e-16 ***
X3 3.101981 0.015508 200.022 < 2e-16 ***
X4 4.311278 0.008225 524.138 < 2e-16 ***
X5 5.516239 0.008821 625.362 < 2e-16 ***
X6 6.705355 0.076332 87.844 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6864 on 993 degrees of freedom

Multiple R-squared: 0.9987, Adjusted R-squared: 0.9987
F-statistic: 1.238e+05 on 6 and 993 DF, p-value: < 2.2e-16
2. The fitted model is given by:
Ŷ = −0.731202 + 0.743358X1 + 1.983937X2 + 3.101981X3 + 4.311278X4 + 5.516239X5 + 6.705355X6

3. To test the significance of the variables in a multiple linear regression model, we use the t-test. In
this example, α = 5% = 0.05
Under the t-test, the hypotheses to be tested are:
H0 : βj = 0 (Variable is not signicant) ; j = {0, 1, 2, 3, 4, 5, 6}

H0 : βj 6= 0 (Variable is signicant)
The t-test statistic is given by:

Estimatej
t − valuej =
Std.Errorj
There are two methods of solving this question:
(a) The null hypothesis is rejected at α level of significance if the
p − value < α
The p-values of the different variables are as summarized in the table below:
Variable P-value Alpha Decision

X1 < 2e-16=0.00 0.05 Significant
Since for all the variables, the p-value=0.00<0.05, we reject the null hypothesis H0 and conclude
that all the variables X1 to X6 are significant.
(b) The null hypothesis is rejected at α level of significance if the
α
|t − value| > tn−p−1 (1 − )
2
0.05
t1000−6−1 (1 − 2 )
= t993 (0.975) = 1.96
The t-values and the tabulated values of the different variables are as summarized in the table
below:
Variable |t-value| Tabulated value Decision

X1 27.307 1.96 Significant
Since for all the variables, the |t-value|>tabulated value=1.96, we reject the null hypothesis
H0 and conclude that all the variables X1 to X6 are significant. Where |-a|=|a|=a which is
the absolute value of a.
4. To test the overall fitness of the multiple linear regression model, we use the F-test. In this example,
α = 1% = 0.01
Under the F-test, the hypotheses to be tested are:
H0 : β1 = β2 = β3 = β4 = β5 = β6 = 0 (Model is not adequate)

H1 : βj 6= 0 for at least one βj (Model is adequate)
There are two methods of solving this question:
(a) The null hypothesis is rejected at α level of significance if the
p − value < α
Since the p-value of the fitted model is <2.2e-16=0.00 is less than α=0.01, hence we reject the
null hypothesis H0 and conclude that the model is adequate.
(b) The null hypothesis is rejected at α level of significance if the
F − value > Fp,n−p−1 (1 − α)
F6,993 (0.99) = 2.82
Since the F-value=1.238e+05=123,800>tabulated F-value=2.82, hence we reject the null hy-

pothesis H0 and conclude that the model is adequate.
5. The R program used to calculate the 95% C.I of the βj ’s is as given below.
B.hat=matrix(0,nrow=500,ncol=7)
for(j in 1:500)
{
X1=rnorm(1000,2,0.8)
X2=rbeta(1000,1,0.6)
X3=rpois(1000,2)
X4=rchisq(1000,4)
X5=rt(1000,2)
X6=runif(1000,1,2)
e=rnorm(1000,0,0.7)
b=0
for(i in 1:7)
{
b[i]=1.2*(i-2)+0.7
}
Y=b[1]+b[2]*X1+b[3]*X2+b[4]*X3+b[5]*X4+b[6]*X5+b[7]*X6+e
Model=lm(Y~X1+X2+X3+X4+X5+X6)
B.hat[j,]=coef(Model)
}
#Calculate the 95% confidence intervals for Question 5

B.hat
#Finding the positions of the lower (L) and upper (U) 95% confidence intervals
L=round(500*2.5/100,0) #use the round() function to round off to the nearest intege
L
U=round(500*97.5/100,0)
U
B0.hat=B.hat[,1]
L.C.I.b0=sort(B0.hat)[L]
U.C.I.b0=sort(B0.hat)[U]
b0.C.I = c(L.C.I.b0,U.C.I.b0)
b0.C.I
#Since 0 is not part of the confidence interval of b0,
#The intercept is significant in predicting the values of Y.
B1.hat=B.hat[,2]
b1.C.I
#The variable X1 is significant in predicting the values of Y.
B2.hat=B.hat[,3]
b2.C.I
B3.hat=B.hat[,4]
b3.C.I
B4.hat=B.hat[,5]
b4.C.I
B5.hat=B.hat[,6]
b5.C.I
B6.hat=B.hat[,7]
b6.C.I
The 95% C.I. for the respective βj ’s were as summarized in the table below:
Variables Actual LCI UCI

beta0 -0.5 -0.7783439 -0.2251310
beta1 0.7 0.6428931 0.7536061
beta2 1.9 1.750723 2.048918
beta3 3.1 3.066062 3.129891
beta4 4.3 4.283521 4.316174
beta5 5.5 5.483391 5.516849
beta6 6.7 6.549297 6.846228
From the above table, it is clear that the range of all the 95% confidence interval does not touch 0, hence
all the variables are significant. This can be seen from the fact that the actual βj values lie within the
95% C.I.

Linear Regression in R

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Regression in R

Uploaded by

Copyright:

Available Formats

STA 2205 Dr.

Multiple Linear Regression Revision in R

1. Write an R program that:

(a) Generates 1000 variates of Y .

2. State the fitted model

3. Identify the significant variables at 5%.

4. Comment on the overall fit of the model at 1%.

1. The R program used to solve this question is:

Residual standard error: 0.6864 on 993 degrees of freedom

2. The fitted model is given by:

Ŷ = −0.731202 + 0.743358X1 + 1.983937X2 + 3.101981X3 + 4.311278X4 + 5.516239X5 + 6.705355X6

Under the t-test, the hypotheses to be tested are:

H0 : βj = 0 (Variable is not signicant) ; j = {0, 1, 2, 3, 4, 5, 6}

The t-test statistic is given by:

There are two methods of solving this question:

(a) The null hypothesis is rejected at α level of significance if the

Variable P-value Alpha Decision

Variable |t-value| Tabulated value Decision

Under the F-test, the hypotheses to be tested are:

H0 : β1 = β2 = β3 = β4 = β5 = β6 = 0 (Model is not adequate)

There are two methods of solving this question:

(a) The null hypothesis is rejected at α level of significance if the

F − value > Fp,n−p−1 (1 − α)

F6,993 (0.99) = 2.82

Since the F-value=1.238e+05=123,800>tabulated F-value=2.82, hence we reject the null hy-

#Calculate the 95% confidence intervals for Question 5

Variables Actual LCI UCI

You might also like