Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Assignment 3: Bootstrap tests and IV methods

Zahra Amin

2022-11-17

# libraries
library(wooldridge)
library(margins)
library(boot)
library(car)
library(sandwich)

Part 1: MLE
1. Estimate the model using glm and interpret the coefficient.

data(fertil2)

# estimating gml model


fit <- glm(children ~ catholic + educ + age + I(ageˆ2), fertil2, family=poisson(link=sqrt))
summary(fit)

##
## Call:
## glm(formula = children ~ catholic + educ + age + I(age^2), family = poisson(link = sqrt),
## data = fertil2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3190 -0.7494 -0.1770 0.5637 2.8693
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.099e+00 8.241e-02 -25.464 <2e-16 ***
## catholic -1.610e-02 2.525e-02 -0.638 0.524
## educ -2.857e-02 2.051e-03 -13.929 <2e-16 ***
## age 2.014e-01 5.669e-03 35.521 <2e-16 ***
## I(age^2) -2.280e-03 9.337e-05 -24.419 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 10082.1 on 4360 degrees of freedom

1
## Residual deviance: 3949.6 on 4356 degrees of freedom
## AIC: 13053
##
## Number of Fisher Scoring iterations: 5

Since this is a Poisson model, the expected value of the target variable children is related to the independent
variables by inverse of the square root link. In other words:


children = −2.099 + −1.610e − 02 × catholic − 2.86e − 2 × education + 0.201 × age − 2.28e − 3 × age2

The intercept value can be interpreted as the square root of the expected number of children for a non
catholic with 0 years of education and 0 age.
The coefficient of catholic is not significantly different from zero since the p-value of the t test is greater than
5%.
The coefficient of education can be interpreted as the unit effect of education on the square root of the
expected number of children. The coefficients β3 and β4 represent the simple and quadratic effect of a unit
change in age on the square root of the expected number of children (the effects of ageˆ1 and the effect of
ageˆ2, respectively.)
Note ageˆ1 and ageˆ2 cannot be interpreted separately because I believe one is part of the other, and their
effect is combined, reflecting the relationship between the dependent variable and age.
2. Estimate the marginal effect of each regressor, using the margins package. Interpret the effects and their
significance.

summary(margins(fit))

## factor AME SE z p lower upper


## age 0.1615 0.0024 68.6364 0.0000 0.1569 0.1661
## catholic -0.0440 0.0690 -0.6377 0.5237 -0.1792 0.0912
## educ -0.0781 0.0056 -13.8874 0.0000 -0.0891 -0.0671

b. Interpret the effects and their significance.


The p-value of the marginal effects of age and education is 0, which means their marginal effects are sig-
nificant. On the other hand, the p-value of catholic is greater than 5%, meaning that its marginal effect is
insignificant.
The marginal effect of age is 0.1615, meaning that one unit increase in age, with all else held constant, is
expected to be associated to 0.1615 increase in the number of children. The marginal effect of education is
-0.078, meaning that one unit increase in age, with everything else held constant, is expected to be associated
to 0.078 decrease in the number of children.
c. Compute the marginal effects manually (just the effect, not the s.e.). d. Explain what you are doing and
compare with the results from the margins package.
The model link function is the square root, therefore, the marginal effect are:

∂E(Y |X) ∂X ′ β
= 2X ′ β ×
∂Xi ∂Xi

2
# calculating the marginal effects
b <- coef(fit) # get model coefficients
X <- cbind(1, fertil2$catholic, fertil2$educ,
fertil2$age, fertil2$ageˆ2) # get X values from the data
mX <- 2*sum(colMeans(X)*b) # calculate 2X'B
mX*c(b[2], b[3], b[4]+2*b[5]*mean(fertil2$age)) # marginal effects

## catholic educ age


## -0.04400190 -0.07807473 0.20872980

The marginal effects I found are very close to the ones estimated by the margins function, except for age
where the margins function estimated AME=0.1615 and I found with the calculations that AME=0.2087.
The small difference noticed here could be the result of using the mean to estimate the average age marginal
effect.
3. Compute manually the standard errors of the coefficients β̂ using the sandwich and non-sandwich esti-
mators. You will therefore need to compute the hessian, and information matrices. Start by deriving them
mathematically and then translate your results in R. Compare your standard errors with the ones from the
glm function.
The standard errors with the non-sandwich estimators are: Ĥ −1

• The standard errors with sandwich estimators are: Ĥ −1 IˆĤ −1

Assuming the distribuiton of Y is a Poisson distribution, the expected Hessian is:


2 2
Ĥ(λ(X)) = − ∂λ(X)

2 E(li (β)) = − ∂λ(X)2 E(Y log[λ(X)] − λ(X) − log(Y !)) = E( λ(X)2 ) =
∂ Y 1
λ(X)

With λ(X) = (X ′ β)2 .


And the expected information matrix is:
ˆ = E( λ(X) Y2
− 1)2 = E( λ(X) 2 − 2 λ(X) + 1) =
λ(X)(1+λ(X))
− 2 λ(X)
λ(X) + 1 =
Y Y 1
I(λ(X)) λ(X)2 λ(X)

# standard error with the non sandwich estimator


b <- coef(fit) # get model coefficients
X <- cbind(1, fertil2$catholic, fertil2$educ,
fertil2$age, fertil2$ageˆ2) # get X values from the data
lambda_X <- (colMeans(X)*b)ˆ2 # calculate the vector (X'B)ˆ2
H <- 1/lambda_X # hessian
# non sandwich se:
(se1 <- Hˆ(-1))

## (Intercept) catholic educ age I(age^2)


## 4.403804e+00 2.724063e-06 2.799329e-02 3.045089e+01 3.551038e+00

# standard error with the sandwich estimator


b <- coef(fit) # get model coefficients
X <- cbind(1, fertil2$catholic, fertil2$educ, fertil2$age, fertil2$ageˆ2) # get X values from the data
lambda_X <- (colMeans(X)*b)ˆ2 # calculate the vector (X'B)ˆ2
H <- 1/lambda_X # hessian
I <- 1/lambda_X # Information matrix
# sandwich se:
(se2 <- Hˆ(-1)%*%I%*%Hˆ(-1))

3
## [,1] [,2] [,3] [,4] [,5]
## [1,] 22.01902 1.362031e-05 0.1399664 152.2545 17.75519

The MLE standard errors calculated using the sandwich and non-sandwich estimators are different from
the ones obtained using the glm function. They are larger for all beta coefficients except β0 , which is not
significant to the model.
4. Using 600 bootstrap samples (pairs bootstrap), estimate the standard errors of the marginal effects. Com-
pare them with the ones from the margins package.

marg_i <- function(data, ind){


data <- data[ind,]
fit2 <- glm(children ~ catholic + educ + age + I(ageˆ2), data,
family=poisson(link=sqrt), start=coef(fit))
mr <- summary(margins(fit2))
#c(mr$AME, mr$SE)
mr$SE
}

fitB.se <- boot(fertil2, marg_i, R=600)


mean(fitB.se$t[1:600])

## [1] 0.002353731

mean(fitB.se$t[601:1200])

## [1] 0.06896553

mean(fitB.se$t[1201:1800])

## [1] 0.005622364

The standard errors of the marginal effects that were found using the pairs bootstrap are identical to the
ones obtained using the margins package.
5. Test the joint hypothesis H0 : B1 = B4 = 0 using the Wald, LM and LR statistics. Interpret the results.

# Wald test
R <- rbind(c(0,1,0,0,0), c(0,0,0,0,1))
R

## [,1] [,2] [,3] [,4] [,5]


## [1,] 0 1 0 0 0
## [2,] 0 0 0 0 1

#Wald test
V <- vcov(fit)
b <- coef(fit)
q <- R%*%b - c(0,0)
(W <- t(q)%*%solve(R%*%V%*%t(R), q)) # Wald stat

## [,1]
## [1,] 597.0746

4
## the p-values
1-pchisq(c(W), 3)

## [1] 0

The p-value of the of the Wald test is equal to zero, which is less than the 5% significance level. Therefore,
we reject the null hypothesis that β1 = β4 = 0.

# LM statistic
# restricted model
res <- glm(children ~ educ + age, fertil2, family=poisson(link=sqrt))
# restricted coefficients
coef.res <- c(coef(res)[1], 0, coef(res)[c(2,3)], 0)
names(coef.res) <- names(coef(fit))
coef.res

## (Intercept) catholic educ age I(age^2)


## -0.19953201 0.00000000 -0.02848553 0.06379706 0.00000000

X <- model.matrix(fit)
e <- residuals(fit)
e.r1 <- residuals(res)
(S <- c(-2*t(X)%*%e))

## [1] 1233.6659 113.3758 7511.6137 30987.2503 874438.2775

(S.r1 <- c(-2*t(X)%*%e.r1))

## [1] 1320.6846 152.1758 7962.8422 33634.6058 1201495.7501

meat1 <- t(X)%*%diag(e.r1ˆ2)%*%X


Xe1 <- t(X)%*%e.r1
(LM1 <- c(t(Xe1)%*%solve(meat1, Xe1))) # LM statistic

## [1] 597.0892

# p value
1-pchisq(LM1,3)

## [1] 0

The p-value of the of the LM test is equal to zero, which is less than the 5% significance level. Therefore,
we reject the null hypothesis that β1 = β4 = 0.

# LR statistic
sigma <- sqrt(mean(residuals(fit)ˆ2))
lnU <- sum(dnorm(residuals(fit), 0, sigma, log=TRUE))
#data$wage <- data$wage - c(model.matrix(fit)[,c(2,5,6)]%*%cVec)
sigmar <- sqrt(mean(residuals(res)ˆ2))
lnR <- sum(dnorm(residuals(res), 0, sigmar, log=TRUE))

(LR <- 2*(lnU-lnR))# LR statistic

5
## [1] 571.7075

# p value
1-pchisq(LR,3)

## [1] 0

The p-value of the of the test is equal to zero, which is less than the 5% significance level. Therefore, we
reject the null hypothesis β1 = β4 = 0.
We can notice that the three tests gave the same results.

Part 2: Bootstrap inference

data(CPS1985, package="AER")
set.seed(21042668)
n <- sample(40:150, 1)
ind <- sample(534, size=n, replace=FALSE)
dat <- CPS1985[ind,]

1. Compare the normal and percentile bootstrap confidence intervals using pairs and wild bootstrap. Interpret
your result.

# Fitting the model


dat$male <- as.numeric(dat$gender=="male") # create dummy variable for male
dat$married2 <- as.numeric(dat$married=="yes") # create dummy variable for married
f <- log(wage) ~ male + married2 + age + education + male:married2
fit_2 <- lm(f, dat) # fit the model

# Normal and percentile CI with pairs bootstrap


boost.ci <- function(dat, ind){
dat <- dat[ind,]
fit.i <- lm(f, dat)
fit.i$coefficients
}
res <- boot(dat, boost.ci, R=600)
# normal CI
Confint(res, type = c("norm"))

## Bootstrap normal confidence intervals


##
## 2.5 % 97.5 %
## 1 -0.281925430 0.90117002
## 2 -0.333733616 0.33923991
## 3 -0.452597525 0.13878227
## 4 0.005391692 0.02001102
## 5 0.065816636 0.13087690
## 6 -0.078255493 0.72867462

6
# percentile cI
Confint(res, type = c("perc"))

## Bootstrap percent confidence intervals


##
## 2.5 % 97.5 %
## 1 -0.292333558 0.88707960
## 2 -0.349801227 0.31571363
## 3 -0.469524029 0.13233507
## 4 0.004831198 0.01958692
## 5 0.067864498 0.13142978
## 6 -0.026762832 0.75825346

The 95% percentile and normal CI estimated using the the pairs bootstrap method show that 95% of the
time, these CIs contain the true coefficient. Both the normal and percentile CIs lead to the same conclusions:
CIs of β0 , β1 , β2 , β5 include zero, which means these coefficient are not significant, and the CIs of β2 andβ3
don’t include zero, and therefore, these coefficient are significant (when a CI contains zero, it means the
coefficient is not significant because there is a probability that it could be equal to zero. On the other hand,
when the interval does not contain zero, it means we are 95% confident that the coefficient is different from
zero).

# Normal and percentile CI with Wild (residuals) Bootstrap


resB <- Boot(fit_2, method="residual")
# normal CI
Confint(resB, type = c("norm"))

## Bootstrap normal confidence intervals


##
## Estimate 2.5 % 97.5 %
## (Intercept) 0.309764001 -0.353850644 0.94021168
## male -0.006274246 -0.331962527 0.31105115
## married2 -0.160012010 -0.445359182 0.11925685
## age 0.012641585 0.003595828 0.02174254
## education 0.098639307 0.064223544 0.13548072
## male:married2 0.334115727 -0.054501978 0.73271808

# Percentil CI
Confint(resB, type = c("perc"))

## Bootstrap percent confidence intervals


##
## Estimate 2.5 % 97.5 %
## (Intercept) 0.309764001 -0.331106507 0.98436575
## male -0.006274246 -0.343172531 0.30166515
## married2 -0.160012010 -0.433873809 0.13073725
## age 0.012641585 0.003478384 0.02181643
## education 0.098639307 0.061042433 0.13225169
## male:married2 0.334115727 -0.073896538 0.73529199

The 95% percentile and normal CI estimated using the Wild bootstrap method show that 95% of the time
these CIs contain the true coefficient. Both the normal and percentile CIs lead to the same conclusion as

7
we found using the pairs bootstrap: CIs of β0 , β1 , β2 , β5 include zero, which means these coefficient are not
significant, and the CIs of β2 andβ3 don’t include zero, therefore these coefficient are significant (again, when
a CI contains zero, it means the coefficient is not significant because there is a probability that it could be
equal to zero. On the other hand, when the interval does not contain zero, it means we are 95% confident
that the coefficient is different from zero).
2. Compare the intervals from the previous question with the one obtained using the Delta method. Interpret
the difference.
Since this question allows use of any package, I use the delta package here.

# CI with the Delta method using the RcmdrMisc package


RcmdrMisc::DeltaMethod(fit_2, "b0")[1]

## $test
## Estimate SE 2.5 % 97.5 %
## b0 0.30976 0.33005 -0.33712 0.9566

RcmdrMisc::DeltaMethod(fit_2, "b1")[1]

## $test
## Estimate SE 2.5 % 97.5 %
## b1 -0.0062742 0.1603462 -0.3205471 0.308

RcmdrMisc::DeltaMethod(fit_2, "b2")[1]

## $test
## Estimate SE 2.5 % 97.5 %
## b2 -0.16001 0.14763 -0.44936 0.1293

RcmdrMisc::DeltaMethod(fit_2, "b3")[1]

## $test
## Estimate SE 2.5 % 97.5 %
## b3 0.0126416 0.0046591 0.0035099 0.0218

RcmdrMisc::DeltaMethod(fit_2, "b4")[1]

## $test
## Estimate SE 2.5 % 97.5 %
## b4 0.098639 0.018218 0.062933 0.1343

RcmdrMisc::DeltaMethod(fit_2, "b5")[1]

## $test
## Estimate SE 2.5 % 97.5 %
## b5 0.334116 0.198626 -0.055184 0.7234

8
The results of both the delta method and the bootstrap CI indicate similar conclusions in terms of the
significance of coefficients. The accuracy of the CIs for the estimates differs between the methods since the
bootstrap test is more robust to the model’s misspecifications and to the noise in the data.
3.Test the null hypothesis: If we hold age and education constant, the gender gap for married and non-married
workers is equal, against the alternative that it is no equal at 5%.
For each of the following test, compute the p-value and interpret your result:
The null hypothesis is: if we hold age and education constant, the gender gap for married and non-married
workers is equal.
The gender gap for married workers is (expβ1 +β5 − 1) × 100.
The gender gap for non married workers is (expβ1 − 1) × 100. Therefore, the null hypothesis is equivalent to
β5 = 0. In other words:
H0 : β5 = 0
and
HA : β5 ̸= 0
Computing the t-values for the following tests and their p-values:
T-test with asymptotic distribution

# fitting the model


f2 <- wage ~ male + married2 + age + education + male:married2
fit_3 <- lm(f2, dat) # fit the model

# T-test with asymptotic distribution


linearHypothesis(fit_3, names(coef(fit_2))[c(6)], vcov. = vcovHC, test="F") # pval=0.08292 > 5%

## Linear hypothesis test


##
## Hypothesis:
## male:married2 = 0
##
## Model 1: restricted model
## Model 2: wage ~ male + married2 + age + education + male:married2
##
## Note: Coefficient covariance matrix supplied.
##
## Res.Df Df F Pr(>F)
## 1 115
## 2 114 1 3.0603 0.08292 .
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

The p-value of the t-test with asymptotic distribution is 0.08292 which is greater than the 5% significance
level. Therefore, we cannot reject the null hypothesis H0 : β5 = 0.
Bootstrap T-test using pairs bootstrap

# Bootstrap T-test using pairs bootstrap


marg2 <- function(dat, ind)
{
dat <- dat[ind,]

9
fit.i <- lm(f2, dat)
s <- summary(fit.i)
co <- s$coefficients[,1]
se <- s$coefficients[,2]
c(co, se)
}
fitB.t <- boot(dat, marg2, R=1000)

t.married <- (fitB.t$t[,6]-fitB.t$t0[6])/fitB.t$t[,12]


(test <- (fitB.t$t0[6]-0)/fitB.t$t0[12]) # t test

## male:married2
## 2.231796

(q <- quantile(abs(t.married), .95)) # critical value

## 95%
## 2.563802

mean(abs(t.married)>abs(test)) # p value > 5%

## [1] 0.081

The p-value of the Bootstrap t-test using pairs bootstrap is 0.081, which is greater than the 5% significance
level (α = 5%). Therefore, we cannot reject the null hypothesis H0 : β5 = 0.
Bootstrap T-test using the restricted wild bootstrap with the Rademacher distribution.

# Bootstrap T-test using restricted wild bootstrap with the Rademacher distribution
wBoot <- function(u, type=c("Rademacher","Mammen"))
{
type <- match.arg(type)
s <- sqrt(5)
xi <- switch(type,
Rademacher=c(1,-1,.5,.5),
Mammen=c((1+s)/2, (1-s)/2, (5-s)/10, (5+s)/10))
xi <- sample(xi[1:2], size=length(u), replace=TRUE, prob=xi[3:4])
xi*u
}

# restricted model
fr <- wage ~ male + married2 + age + education
fitr <- lm(fr, dat)
utilde <- residuals(fitr) # restricted residuals
btilde <- coef(fitr) # restricted beta

# t test function for boot


ttest <- function(dat, ind)
{
dat <- dat[ind,]
ustar <- wBoot(utilde, type="Rademacher")
wageStar <- btilde[1]+btilde[2]*dat$male+btilde[3]*dat$married2+btilde[4]*dat$age+btilde[5]*dat$educat

10
fit2 <- lm(wageStar ~ male + married2 + age + education + male*married2, dat)
s <- summary(fit2)
co <- s$coefficients[,1]
se <- s$coefficients[,2]
c(co, se)
}

# bootstrap
fitWB <- boot(dat, ttest, R=1000)

t.married <- (fitWB$t[,6]-fitWB$t0[6])/fitWB$t[,12]


(test <- (fitWB$t0[6]-0)/fitWB$t0[12]) # t test

## male:married2
## -1.221308

mean(abs(t.married)>abs(test)) # p value > 5%

## [1] 0.507

The p-value of the Bootstrap T-test using the restricted wild bootstrap with the Rademacher distribution
is 0.507 which is greater than the 5% significance level (α = 5%). Therefore, we cannot reject the null
hypothesis H0 : β5 = 0.
Conclusion
The p-values of the three tests are greater than the 5% significance level. Therefore, we cannot reject the
null hypothesis that if we hold age and education constant, the gender gap for married and non-married
workers is equal.
4. Consider the same model. We want to test the joint hypothesis H0 : β1 = 1 β3 = 5β4 β5 = 0 5%.
For each of the following tests, compute the p-value and interpret your result:
Wald test using the asymptotic distribution

# Wald test with asymptotic distribution


linearHypothesis(fit_3, c("male = 1", "age - 5*education = 0", "male:married2 = 0"), vcov. = vcovHC, tes

## Linear hypothesis test


##
## Hypothesis:
## male = 1
## age - 5 education = 0
## male:married2 = 0
##
## Model 1: restricted model
## Model 2: wage ~ male + married2 + age + education + male:married2
##
## Note: Coefficient covariance matrix supplied.
##
## Res.Df Df Chisq Pr(>Chisq)
## 1 117
## 2 114 3 27.426 4.792e-06 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

11
The p-value of the Wald-test with asymptotic distribution is 4.792e-06 which is lower than the 5% significance
level (α = 5%). We, therefore, reject the null hypothesis H0 : β1 = 1, β3 = 5β4 , β5 = 0.
Bootstrap Wald test using pairs bootstrap

# Bootstrap Wald test using pairs bootstrap


# Function to compute Wald statistic
Wald <- function(fit, R, c){
b <- coef(fit)
V <- vcovHC(fit)
q <- R%*%b-c
V <- R%*%V%*%t(R)
c(t(q)%*%solve(V, q))
}

# R matrix
R <- rbind(c(0,1,0,0,0,0), c(0,0,0,1,-5,0), c(0,0,0,0,0,1))
# c vector
c <- c(1,0,0)
# Wald stat for the model
(W <- Wald(fit_3, R, c))

## [1] 27.42641

# c hat
(c.hat <- c(R%*%coef(fit_3)))

## [1] -1.205543 -4.722550 4.867719

# Bootstrap loop
Ws <- numeric()
for (i in 1:1000){
ind <- sample(nrow(dat), replace=TRUE)
fit2 <- lm(f2, dat[ind,])
Ws[i] <- Wald(fit2, R, c.hat)
}

# p-value
mean(Ws>W)

## [1] 0.003

The p-value of Bootstrap Wald test using pairs bootstrap at 0.003 is lower than the 5% significance level
(α = 5%). We,therefore, reject the null hypothesis H0 : β1 = 1, β3 = 5β4 , β5 = 0.
Bootstrap Wald test using the restricted wild bootstrap with the Rademacher distribution.
The restricted model is:

[wage − (β̂1 − 1)male − (β̂3 − 5β̂4 )age − β̂5 male : married] = β2 married + β3 (age − 5educ)

12
# Bootstrap Wald test using the restricted wild bootstrap with the Rademacher distribution.

# Restricted model
dat2 <- dat
X <- model.matrix(f2, dat2)
cVec <- c(coef(fit_3)[2]-1, coef(fit_3)[4]-5*coef(fit_3)[5], coef(fit_3)[6]) # (beta1-1 , beta3 - 5beta4
dat2$wage <- dat2$wage-c(X[,c(2,4,6)]%*%cVec)
dat2$x6 <- dat2$age - (5*dat2$education)
fr <- wage ~ married2 + x6
utilde <- residuals(fitr <- lm(fr, dat2))
btilde <- coef(fitr) # restricted beta

# Wald stat for the model


ustar <- wBoot(utilde, type="Rademacher")
wageStar <- btilde[1]+btilde[2]*dat$married2+btilde[3]*dat$age-5*btilde[3]*dat$education+ustar
fr2 <- wageStar ~ male + married2 + age + education + male*married2
fit_2 <- lm(fr2, dat)
# c vector
c <- c(1,0,0)
(W2 <- Wald(fit_2, R, c))

## [1] 56.30884

# c hat
(c.hat <- c(R%*%coef(fit_2)))

## [1] -12.305185 56.234245 8.394266

# Bootstrap loop
Ws2 <- numeric()
for(i in 1:1000){
ind <- sample(nrow(dat), replace=TRUE)
ustar <- wBoot(utilde, type="Rademacher")
wageStar <- btilde[1]+btilde[2]*dat$married2+btilde[3]*dat$age-5*btilde[3]*dat$education+ustar
fit2 <- lm(wageStar ~ male + married2 + age + education + male*married2, dat[ind,])
Ws2[i] <- Wald(fit2, R, c.hat)
}

# p-value
mean(Ws2>W2)

## [1] 0.14

The p-value of Bootstrap Wald test using the restricted wild bootstrap with the Rademacher distribution
is 0.14, which is greater than the 5% significance level (α = 5%). We,therefore, cannot reject the null
hypothesis H0 : β1 = 1, β3 = 5β4 , β5 = 0.
Bootstrap LM test using pairs bootstrap

# • Bootstrap LM test using pairs bootstrap

# function to calculate LM stat

13
LMBoot <- function(data, ind, cVec){
data <- data[ind,]
X <- model.matrix(f2, data)

data$wage <- data$wage-c(X[,c(2,4,6)]%*%cVec)


data$x6 <- data$age - (5*data$education)

fr <- wage ~ married2 + x6

utilde <- residuals(fitr <- lm(fr, data))


uX <- t(X)%*%utilde
meat <- crossprod(X*utildeˆ2, X)
c(t(uX)%*%solve(meat, uX))
}

# LM statistic
c <- c(coef(fit_3)[2]-1, coef(fit_3)[4]-5*coef(fit_3)[5], coef(fit_3)[6])
(LM <- LMBoot(dat, 1:nrow(dat), c))

## [1] 53.07328

# boot
resB <- boot(dat, LMBoot, R=1000, cVec=c)

# p-values
mean(resB$t>LM)

## [1] 0.582

The p-value of Bootstrap LM test using the pairs bootstrap is 0.582, which is greater than the 5% significance
level (α = 5%). We, therefore,cannot reject the null hypothesis H0 : β1 = 1, β3 = 5β4 , β5 = 0.
Bootstrap LM test using the restricted wild bootstrap with the Rademacher distribution.

# • Bootstrap LM test using the restricted wild bootstrap with the Rademacher distribution.

# function to calculate LM stat with Rademacher distribution


LMBoot2 <- function(data, ind)
{
data <- data[ind,]

ustar <- wBoot(utilde, type="Rademacher")


wageStar <- btilde[1]+btilde[2]*dat$married2+btilde[3]*dat$age-5*btilde[3]*dat$education+ustar
fr2 <- wageStar ~ male + married2 + age + education + male*married2
fitr2 <- lm(fr2, data)
cVec <- c(coef(fitr2)[2]-1, coef(fitr2)[4]-5*coef(fitr2)[5], coef(fitr2)[6])

X <- model.matrix(fr2, data)


data$wage <- data$wage-c(X[,c(2,4,6)]%*%cVec)
data$x6 <- data$age - (5*data$education)
fr <- wage ~ married2 + x6
utilde <- residuals(fitr <- lm(fr, data))
uX <- t(X)%*%utilde

14
meat <- crossprod(X*utildeˆ2, X)
c(t(uX)%*%solve(meat, uX))
}

# restricted model
dat2 <- dat
X <- model.matrix(f2, dat2)
cVec <- c(coef(fit_3)[2]-1, coef(fit_3)[4]-5*coef(fit_3)[5], coef(fit_3)[6])
dat2$wage <- dat2$wage-c(X[,c(2,4,6)]%*%cVec)
dat2$x6 <- dat2$age - (5*dat2$education)
fr <- wage ~ married2 + x6
utilde <- residuals(fitr <- lm(fr, dat2))
btilde <- coef(fitr) # restricted beta

# LM stat for the model


(LM2 <- LMBoot2(dat, 1:nrow(dat)))

## [1] 55.53485

# boot
resB2 <- boot(dat, LMBoot2, R=1000)

# p-value
mean(resB2$t>LM2)

## [1] 0.514

The p-value of Bootstrap LM test using pairs bootstrap is 0.514 which is greater than the 5% significance
level (α = 5%). We,therefore, cannot reject the null hypothesis H0 : β1 = 1, β3 = 5β4 , β5 = 0.

Part 3: One of Many Return to Education Study (from a Nobel


Memorial Price winner): To continue in the next assignment
I want everyone of you to have a different dataset. Once you have loaded the data, run the following code,
with your student ID in the set.seed function:

load("Card.rda")
set.seed(21042668)
n <- sample(600:1000, 1)
ind <- sample(nrow(dat), n, replace=FALSE)
dat <- dat[ind,]

You will see that exp76 (years of experience) is not in the dataset. You have to compute it using the formula
exp76 = age76 - ed76 - 6. It is not the actual experience but the potential experience. This is a commonly
used measure of experience in the labour economics literature.

# create the experience variable


dat$exp76 <- dat$age76 - dat$ed76 - 6

15
1. Explain what is the issue with OLS when the objective is to estimate the return to education. Then,
explain why the solution proposed by David Card can potentially solve the problem. You can use the article
to answer the question, but you have to explain in your own words. Can you think of a possible reason for
rejecting the validity of the instrument?
The issue with OLS is that it could potentially give biased estimates of the β coefficient in the presence
of correlation between one of the independent variables and the unobserved feature (omitted variables) of
wages, or in other words, if an endogenous regressor is there in the model. David Card suggests using IV
estimates instead of OLS to solve this problem, since IV estimates are obtained from the variation coming
from the instrument only.
For the instrument to be valid, and to overcome the issue faced using OLS, it needs to be uncorrelated with
the error term u, and correlated with the independent variable we are interested in, which is ed76.
2. Estimate the model by OLS and interpret the result. Is there a different return to education for black and
non-black workers?

# Estimate OLS model


mod <- lm(log(wage76) ~ ed76 + black + ed76 * black + exp76 + I(exp76ˆ2) + reg76r + smsa76r, data=dat)
summary(mod)

##
## Call:
## lm(formula = log(wage76) ~ ed76 + black + ed76 * black + exp76 +
## I(exp76^2) + reg76r + smsa76r, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.98726 -0.20587 0.00924 0.24768 1.30935
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7047109 0.1488870 31.599 < 2e-16 ***
## ed76 0.0707223 0.0075027 9.426 < 2e-16 ***
## black -0.5335607 0.2328487 -2.291 0.0222 *
## exp76 0.1017337 0.0153676 6.620 6.07e-11 ***
## I(exp76^2) -0.0032597 0.0007883 -4.135 3.87e-05 ***
## reg76r -0.0860909 0.0290279 -2.966 0.0031 **
## smsa76r 0.1548633 0.0302059 5.127 3.58e-07 ***
## ed76:black 0.0254052 0.0177099 1.435 0.1518
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.3954 on 928 degrees of freedom
## Multiple R-squared: 0.2014, Adjusted R-squared: 0.1953
## F-statistic: 33.43 on 7 and 928 DF, p-value: < 2.2e-16

The p-value of the F statistic is less than the 5% significance level. Therefore, the estimated coefficients are
not all equal to zero.
The adjusted R squared of the model is 0.1953, indicating that this model explains 19.53% of variance in
the dependent variable.
The p-values of the t-statistics of the coefficients of the variables ed76, black, exp76 (polynomial), reg76r
and smsa76r are all significant (less than the 5% significance level), which means that these coefficients are
significantly different from zero and that these variables are significant to the model. The interaction effect

16
between ed76 and black is not significant to the model, as the p-value of 0.1518 is greater than the 5%
significane level.
The variable black is significant to the model (p-value < 5%), indicating that there is a different return to
education for black and non-black workers, such that for black workers log(wage76) is likely to be 0.53 lower
compared to non-black workers.
Important For the following questions, I want you to estimate the model and perform the tests manually (no
package). Just use the solution from your notes.
3. For this question only, consider the model
log(wage76) = β0 + β1 ed76 + u.
Using the nearc4 as instrument, show that the Wald estimator is the same as the IV estimator.
We are interested in estimating β1 in:

log(wage76) = β0 + β1 × ed76 + u

Using nearc4 as instrument:

# First stage regression model


s1_mod <- lm(ed76 ~ nearc4, data=dat)
summary(s1_mod)

##
## Call:
## lm(formula = ed76 ~ nearc4, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0289 -2.0289 -0.6475 1.9711 4.3525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.6475 0.1359 100.418 <2e-16 ***
## nearc4 0.3814 0.1621 2.353 0.0188 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2.266 on 934 degrees of freedom
## Multiple R-squared: 0.005893, Adjusted R-squared: 0.004828
## F-statistic: 5.536 on 1 and 934 DF, p-value: 0.01883

Under the assumption that nearc4 is a valid instrument,the first stage regression is estimated using the
equation (using pi instead of beta to avoid confusion in interpretations):
ed76 = π0 + π1 × nearc4 + ν
The estimated first stage regression shows that the instrument nearc4 is significant (p-value of t-statistic is
less than the 5% significance level). Therefore, the estimated first stage equation is:

ˆ = 13.6475 + 0.3814 × nearc4


ed76

The regression’s R squared states that about 0.58% of the variation in ed76 is explained by the variation of
nearc4.

17
# store the predicted values
ed_pred <- s1_mod$fitted.values

# run the 2 stage least square regression (TSLS regression)


s2_mod <- lm(log(dat$wage76) ~ ed_pred)
summary(s2_mod)

##
## Call:
## lm(formula = log(dat$wage76) ~ ed_pred)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.02421 -0.25610 0.03158 0.27975 1.29147
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.86838 1.14195 1.636 0.1021
## ed_pred 0.32055 0.08206 3.906 0.0001 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.4375 on 934 degrees of freedom
## Multiple R-squared: 0.01608, Adjusted R-squared: 0.01502
## F-statistic: 15.26 on 1 and 934 DF, p-value: 0.0001004

The estimated model using the TSLS gives:

ˆ
log(wage76) = 1.8683 + 0.32055 × ed76

The IV estimator is statistically significant with a p-value of < 5%. The estimate for β1 = 0.32055 suggests
that an increase in ed76 by one unit increases log(wage76) by 0.32055.

# Wald estimator
num <- round(lm(log(dat$wage76)~dat$nearc4)$coefficient[2],3) # regression of Y on Z
den <- round(lm(dat$ed76~dat$nearc4)$coefficient[2],3) # regression of X on Z
num/den # Wald estimator

## dat$nearc4
## 0.32021

The Wald estimator is equal to 0.32021, which is very close to the IV estimator.
Explain the intuition behind the Wald estimator in this particular example.
The intuition behind Wald estimator in this example is that the instrument nearc4 is a binary estimate. If
an instrument is a binary variable, then the Wald estimator is better than IV.
Can you explain why controlling for nearc4 is not the same as using it as instrument? What is the different?
Controlling for nearc4 in this model assumes that it has a direct effect on log(wage76) along with the
independent variable, while using it as an instrument assumes that it has no direct effect on the outcome,
but has an effect on the independent variable ed76.

18
4. Going back to the previous model, estimate it by IV using nearc4 as instrument. Note that ed76 appears
twice in the regression: if ed76 is endogenous, ed76 × black is also endogenous.
We want to estimate the previous model using nearc4 as an instrument. The endogenous variables are ed76
and ed76 * black, and the exogenous variables are black, exp76, exp762, reg76r, and smsa76r.

# IV estimations

# First stage regression


s1_end1 <- lm(ed76 ~ nearc4 + black + exp76 + I(exp76ˆ2) + reg76r + smsa76r, data=dat)
ed_pred_1 <- s1_end1$fitted.values

s1_end2 <- lm((ed76 * black) ~ nearc4 + black + exp76 + I(exp76ˆ2) + reg76r + smsa76r, data=dat)
ed_pred_2 <- s1_end2$fitted.values

# second stage regression


s2_mod2 <- lm(log(dat$wage76) ~ ed_pred_1 + dat$black + ed_pred_2 + dat$exp76 + I(dat$exp76ˆ2) + dat$reg
# summary of the estimated model
summary(s2_mod2)

##
## Call:
## lm(formula = log(dat$wage76) ~ ed_pred_1 + dat$black + ed_pred_2 +
## dat$exp76 + I(dat$exp76^2) + dat$reg76r + dat$smsa76r)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.10121 -0.23423 0.02068 0.25903 1.32463
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.164184 1.442844 2.193 0.028553 *
## ed_pred_1 0.203406 0.074502 2.730 0.006449 **
## dat$black 35.014679 10.044793 3.486 0.000514 ***
## ed_pred_2 -2.732734 0.784615 -3.483 0.000519 ***
## dat$exp76 0.150037 0.053884 2.784 0.005471 **
## I(dat$exp76^2) -0.010716 0.001676 -6.393 2.57e-10 ***
## dat$reg76r -0.117865 0.030429 -3.873 0.000115 ***
## dat$smsa76r NA NA NA NA
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.4166 on 929 degrees of freedom
## Multiple R-squared: 0.1125, Adjusted R-squared: 0.1068
## F-statistic: 19.63 on 6 and 929 DF, p-value: < 2.2e-16

The estimated model is

log(wage76) = 3.164184+0.203406×ed_pred_1+35.014679×black−2.732734×ed_pred_2+0.150037×exp76−0.010716×exp7

Interpret the coefficients. Do you see a difference between OLS and IV?

19
The coefficients of the endogenous variables and the exogenous variables black, exp76, exp762, and reg76r
are significantly different from zero.
The model estimates that an increase in ed76 by one unit increases log(wage76) by 0.20341 for non-black
workers, and decreases log(wage76) by 2.73 for black workers.
Furthermore, the model suggests that if the worker is black, log(wage76) is likely to increase by 35.015
compared to a non-black worker.
Since the coefficient of the quadratic term exp76ˆ2 is significant (p-value < 0.05), the relationship is not
linear. We can conclude that the relationship between exp76 and log(wage76) is quadratic, where one unit
increase in exp76 means 0.15 increase in log(wage27).
We notice that there is a difference in the significance and values of the model’s coefficients between OLS
and IV, as well as a difference in the goodness of it, as the OLS model has a better goodness of fit (greater
R-squared value).
Test the hypothesis that the return to education is the same for black and non-black workers. Use a robust
test. Do you reach the same conclusion as in question 1?
The null hypothesis to be tested is H0 : β2 = β3 = 0.

# R matrix
R <- rbind(c(0,0,1,0,0,0,0),
c(0,0,0,1,0,0,0))
b <- coef(s2_mod2)[1:7]
Vr <- vcovHC(s2_mod2)
q <- R%*%b
(Wr <- t(q)%*%solve(R%*%Vr%*%t(R), q)) # test statistic

## [,1]
## [1,] 11.73401

## the p-values
1-pchisq(c(Wr), 2)

## [1] 0.002831342

The p-value of the robust Wald test is 0.0028 which is less than the 5% significance level. Therefore, we can
reject the null hypothesis that the return to education is the same for black and non-black workers.
We reached the same conclusion as in the first question: there is a different return to education for black
and non-black workers.

20

You might also like