Professional Documents
Culture Documents
A3 Answers
A3 Answers
Zahra Amin
2022-11-17
# libraries
library(wooldridge)
library(margins)
library(boot)
library(car)
library(sandwich)
Part 1: MLE
1. Estimate the model using glm and interpret the coefficient.
data(fertil2)
##
## Call:
## glm(formula = children ~ catholic + educ + age + I(age^2), family = poisson(link = sqrt),
## data = fertil2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3190 -0.7494 -0.1770 0.5637 2.8693
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.099e+00 8.241e-02 -25.464 <2e-16 ***
## catholic -1.610e-02 2.525e-02 -0.638 0.524
## educ -2.857e-02 2.051e-03 -13.929 <2e-16 ***
## age 2.014e-01 5.669e-03 35.521 <2e-16 ***
## I(age^2) -2.280e-03 9.337e-05 -24.419 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 10082.1 on 4360 degrees of freedom
1
## Residual deviance: 3949.6 on 4356 degrees of freedom
## AIC: 13053
##
## Number of Fisher Scoring iterations: 5
Since this is a Poisson model, the expected value of the target variable children is related to the independent
variables by inverse of the square root link. In other words:
√
children = −2.099 + −1.610e − 02 × catholic − 2.86e − 2 × education + 0.201 × age − 2.28e − 3 × age2
The intercept value can be interpreted as the square root of the expected number of children for a non
catholic with 0 years of education and 0 age.
The coefficient of catholic is not significantly different from zero since the p-value of the t test is greater than
5%.
The coefficient of education can be interpreted as the unit effect of education on the square root of the
expected number of children. The coefficients β3 and β4 represent the simple and quadratic effect of a unit
change in age on the square root of the expected number of children (the effects of ageˆ1 and the effect of
ageˆ2, respectively.)
Note ageˆ1 and ageˆ2 cannot be interpreted separately because I believe one is part of the other, and their
effect is combined, reflecting the relationship between the dependent variable and age.
2. Estimate the marginal effect of each regressor, using the margins package. Interpret the effects and their
significance.
summary(margins(fit))
∂E(Y |X) ∂X ′ β
= 2X ′ β ×
∂Xi ∂Xi
2
# calculating the marginal effects
b <- coef(fit) # get model coefficients
X <- cbind(1, fertil2$catholic, fertil2$educ,
fertil2$age, fertil2$ageˆ2) # get X values from the data
mX <- 2*sum(colMeans(X)*b) # calculate 2X'B
mX*c(b[2], b[3], b[4]+2*b[5]*mean(fertil2$age)) # marginal effects
The marginal effects I found are very close to the ones estimated by the margins function, except for age
where the margins function estimated AME=0.1615 and I found with the calculations that AME=0.2087.
The small difference noticed here could be the result of using the mean to estimate the average age marginal
effect.
3. Compute manually the standard errors of the coefficients β̂ using the sandwich and non-sandwich esti-
mators. You will therefore need to compute the hessian, and information matrices. Start by deriving them
mathematically and then translate your results in R. Compare your standard errors with the ones from the
glm function.
The standard errors with the non-sandwich estimators are: Ĥ −1
3
## [,1] [,2] [,3] [,4] [,5]
## [1,] 22.01902 1.362031e-05 0.1399664 152.2545 17.75519
The MLE standard errors calculated using the sandwich and non-sandwich estimators are different from
the ones obtained using the glm function. They are larger for all beta coefficients except β0 , which is not
significant to the model.
4. Using 600 bootstrap samples (pairs bootstrap), estimate the standard errors of the marginal effects. Com-
pare them with the ones from the margins package.
## [1] 0.002353731
mean(fitB.se$t[601:1200])
## [1] 0.06896553
mean(fitB.se$t[1201:1800])
## [1] 0.005622364
The standard errors of the marginal effects that were found using the pairs bootstrap are identical to the
ones obtained using the margins package.
5. Test the joint hypothesis H0 : B1 = B4 = 0 using the Wald, LM and LR statistics. Interpret the results.
# Wald test
R <- rbind(c(0,1,0,0,0), c(0,0,0,0,1))
R
#Wald test
V <- vcov(fit)
b <- coef(fit)
q <- R%*%b - c(0,0)
(W <- t(q)%*%solve(R%*%V%*%t(R), q)) # Wald stat
## [,1]
## [1,] 597.0746
4
## the p-values
1-pchisq(c(W), 3)
## [1] 0
The p-value of the of the Wald test is equal to zero, which is less than the 5% significance level. Therefore,
we reject the null hypothesis that β1 = β4 = 0.
# LM statistic
# restricted model
res <- glm(children ~ educ + age, fertil2, family=poisson(link=sqrt))
# restricted coefficients
coef.res <- c(coef(res)[1], 0, coef(res)[c(2,3)], 0)
names(coef.res) <- names(coef(fit))
coef.res
X <- model.matrix(fit)
e <- residuals(fit)
e.r1 <- residuals(res)
(S <- c(-2*t(X)%*%e))
## [1] 597.0892
# p value
1-pchisq(LM1,3)
## [1] 0
The p-value of the of the LM test is equal to zero, which is less than the 5% significance level. Therefore,
we reject the null hypothesis that β1 = β4 = 0.
# LR statistic
sigma <- sqrt(mean(residuals(fit)ˆ2))
lnU <- sum(dnorm(residuals(fit), 0, sigma, log=TRUE))
#data$wage <- data$wage - c(model.matrix(fit)[,c(2,5,6)]%*%cVec)
sigmar <- sqrt(mean(residuals(res)ˆ2))
lnR <- sum(dnorm(residuals(res), 0, sigmar, log=TRUE))
5
## [1] 571.7075
# p value
1-pchisq(LR,3)
## [1] 0
The p-value of the of the test is equal to zero, which is less than the 5% significance level. Therefore, we
reject the null hypothesis β1 = β4 = 0.
We can notice that the three tests gave the same results.
data(CPS1985, package="AER")
set.seed(21042668)
n <- sample(40:150, 1)
ind <- sample(534, size=n, replace=FALSE)
dat <- CPS1985[ind,]
1. Compare the normal and percentile bootstrap confidence intervals using pairs and wild bootstrap. Interpret
your result.
6
# percentile cI
Confint(res, type = c("perc"))
The 95% percentile and normal CI estimated using the the pairs bootstrap method show that 95% of the
time, these CIs contain the true coefficient. Both the normal and percentile CIs lead to the same conclusions:
CIs of β0 , β1 , β2 , β5 include zero, which means these coefficient are not significant, and the CIs of β2 andβ3
don’t include zero, and therefore, these coefficient are significant (when a CI contains zero, it means the
coefficient is not significant because there is a probability that it could be equal to zero. On the other hand,
when the interval does not contain zero, it means we are 95% confident that the coefficient is different from
zero).
# Percentil CI
Confint(resB, type = c("perc"))
The 95% percentile and normal CI estimated using the Wild bootstrap method show that 95% of the time
these CIs contain the true coefficient. Both the normal and percentile CIs lead to the same conclusion as
7
we found using the pairs bootstrap: CIs of β0 , β1 , β2 , β5 include zero, which means these coefficient are not
significant, and the CIs of β2 andβ3 don’t include zero, therefore these coefficient are significant (again, when
a CI contains zero, it means the coefficient is not significant because there is a probability that it could be
equal to zero. On the other hand, when the interval does not contain zero, it means we are 95% confident
that the coefficient is different from zero).
2. Compare the intervals from the previous question with the one obtained using the Delta method. Interpret
the difference.
Since this question allows use of any package, I use the delta package here.
## $test
## Estimate SE 2.5 % 97.5 %
## b0 0.30976 0.33005 -0.33712 0.9566
RcmdrMisc::DeltaMethod(fit_2, "b1")[1]
## $test
## Estimate SE 2.5 % 97.5 %
## b1 -0.0062742 0.1603462 -0.3205471 0.308
RcmdrMisc::DeltaMethod(fit_2, "b2")[1]
## $test
## Estimate SE 2.5 % 97.5 %
## b2 -0.16001 0.14763 -0.44936 0.1293
RcmdrMisc::DeltaMethod(fit_2, "b3")[1]
## $test
## Estimate SE 2.5 % 97.5 %
## b3 0.0126416 0.0046591 0.0035099 0.0218
RcmdrMisc::DeltaMethod(fit_2, "b4")[1]
## $test
## Estimate SE 2.5 % 97.5 %
## b4 0.098639 0.018218 0.062933 0.1343
RcmdrMisc::DeltaMethod(fit_2, "b5")[1]
## $test
## Estimate SE 2.5 % 97.5 %
## b5 0.334116 0.198626 -0.055184 0.7234
8
The results of both the delta method and the bootstrap CI indicate similar conclusions in terms of the
significance of coefficients. The accuracy of the CIs for the estimates differs between the methods since the
bootstrap test is more robust to the model’s misspecifications and to the noise in the data.
3.Test the null hypothesis: If we hold age and education constant, the gender gap for married and non-married
workers is equal, against the alternative that it is no equal at 5%.
For each of the following test, compute the p-value and interpret your result:
The null hypothesis is: if we hold age and education constant, the gender gap for married and non-married
workers is equal.
The gender gap for married workers is (expβ1 +β5 − 1) × 100.
The gender gap for non married workers is (expβ1 − 1) × 100. Therefore, the null hypothesis is equivalent to
β5 = 0. In other words:
H0 : β5 = 0
and
HA : β5 ̸= 0
Computing the t-values for the following tests and their p-values:
T-test with asymptotic distribution
The p-value of the t-test with asymptotic distribution is 0.08292 which is greater than the 5% significance
level. Therefore, we cannot reject the null hypothesis H0 : β5 = 0.
Bootstrap T-test using pairs bootstrap
9
fit.i <- lm(f2, dat)
s <- summary(fit.i)
co <- s$coefficients[,1]
se <- s$coefficients[,2]
c(co, se)
}
fitB.t <- boot(dat, marg2, R=1000)
## male:married2
## 2.231796
## 95%
## 2.563802
## [1] 0.081
The p-value of the Bootstrap t-test using pairs bootstrap is 0.081, which is greater than the 5% significance
level (α = 5%). Therefore, we cannot reject the null hypothesis H0 : β5 = 0.
Bootstrap T-test using the restricted wild bootstrap with the Rademacher distribution.
# Bootstrap T-test using restricted wild bootstrap with the Rademacher distribution
wBoot <- function(u, type=c("Rademacher","Mammen"))
{
type <- match.arg(type)
s <- sqrt(5)
xi <- switch(type,
Rademacher=c(1,-1,.5,.5),
Mammen=c((1+s)/2, (1-s)/2, (5-s)/10, (5+s)/10))
xi <- sample(xi[1:2], size=length(u), replace=TRUE, prob=xi[3:4])
xi*u
}
# restricted model
fr <- wage ~ male + married2 + age + education
fitr <- lm(fr, dat)
utilde <- residuals(fitr) # restricted residuals
btilde <- coef(fitr) # restricted beta
10
fit2 <- lm(wageStar ~ male + married2 + age + education + male*married2, dat)
s <- summary(fit2)
co <- s$coefficients[,1]
se <- s$coefficients[,2]
c(co, se)
}
# bootstrap
fitWB <- boot(dat, ttest, R=1000)
## male:married2
## -1.221308
## [1] 0.507
The p-value of the Bootstrap T-test using the restricted wild bootstrap with the Rademacher distribution
is 0.507 which is greater than the 5% significance level (α = 5%). Therefore, we cannot reject the null
hypothesis H0 : β5 = 0.
Conclusion
The p-values of the three tests are greater than the 5% significance level. Therefore, we cannot reject the
null hypothesis that if we hold age and education constant, the gender gap for married and non-married
workers is equal.
4. Consider the same model. We want to test the joint hypothesis H0 : β1 = 1 β3 = 5β4 β5 = 0 5%.
For each of the following tests, compute the p-value and interpret your result:
Wald test using the asymptotic distribution
11
The p-value of the Wald-test with asymptotic distribution is 4.792e-06 which is lower than the 5% significance
level (α = 5%). We, therefore, reject the null hypothesis H0 : β1 = 1, β3 = 5β4 , β5 = 0.
Bootstrap Wald test using pairs bootstrap
# R matrix
R <- rbind(c(0,1,0,0,0,0), c(0,0,0,1,-5,0), c(0,0,0,0,0,1))
# c vector
c <- c(1,0,0)
# Wald stat for the model
(W <- Wald(fit_3, R, c))
## [1] 27.42641
# c hat
(c.hat <- c(R%*%coef(fit_3)))
# Bootstrap loop
Ws <- numeric()
for (i in 1:1000){
ind <- sample(nrow(dat), replace=TRUE)
fit2 <- lm(f2, dat[ind,])
Ws[i] <- Wald(fit2, R, c.hat)
}
# p-value
mean(Ws>W)
## [1] 0.003
The p-value of Bootstrap Wald test using pairs bootstrap at 0.003 is lower than the 5% significance level
(α = 5%). We,therefore, reject the null hypothesis H0 : β1 = 1, β3 = 5β4 , β5 = 0.
Bootstrap Wald test using the restricted wild bootstrap with the Rademacher distribution.
The restricted model is:
[wage − (β̂1 − 1)male − (β̂3 − 5β̂4 )age − β̂5 male : married] = β2 married + β3 (age − 5educ)
12
# Bootstrap Wald test using the restricted wild bootstrap with the Rademacher distribution.
# Restricted model
dat2 <- dat
X <- model.matrix(f2, dat2)
cVec <- c(coef(fit_3)[2]-1, coef(fit_3)[4]-5*coef(fit_3)[5], coef(fit_3)[6]) # (beta1-1 , beta3 - 5beta4
dat2$wage <- dat2$wage-c(X[,c(2,4,6)]%*%cVec)
dat2$x6 <- dat2$age - (5*dat2$education)
fr <- wage ~ married2 + x6
utilde <- residuals(fitr <- lm(fr, dat2))
btilde <- coef(fitr) # restricted beta
## [1] 56.30884
# c hat
(c.hat <- c(R%*%coef(fit_2)))
# Bootstrap loop
Ws2 <- numeric()
for(i in 1:1000){
ind <- sample(nrow(dat), replace=TRUE)
ustar <- wBoot(utilde, type="Rademacher")
wageStar <- btilde[1]+btilde[2]*dat$married2+btilde[3]*dat$age-5*btilde[3]*dat$education+ustar
fit2 <- lm(wageStar ~ male + married2 + age + education + male*married2, dat[ind,])
Ws2[i] <- Wald(fit2, R, c.hat)
}
# p-value
mean(Ws2>W2)
## [1] 0.14
The p-value of Bootstrap Wald test using the restricted wild bootstrap with the Rademacher distribution
is 0.14, which is greater than the 5% significance level (α = 5%). We,therefore, cannot reject the null
hypothesis H0 : β1 = 1, β3 = 5β4 , β5 = 0.
Bootstrap LM test using pairs bootstrap
13
LMBoot <- function(data, ind, cVec){
data <- data[ind,]
X <- model.matrix(f2, data)
# LM statistic
c <- c(coef(fit_3)[2]-1, coef(fit_3)[4]-5*coef(fit_3)[5], coef(fit_3)[6])
(LM <- LMBoot(dat, 1:nrow(dat), c))
## [1] 53.07328
# boot
resB <- boot(dat, LMBoot, R=1000, cVec=c)
# p-values
mean(resB$t>LM)
## [1] 0.582
The p-value of Bootstrap LM test using the pairs bootstrap is 0.582, which is greater than the 5% significance
level (α = 5%). We, therefore,cannot reject the null hypothesis H0 : β1 = 1, β3 = 5β4 , β5 = 0.
Bootstrap LM test using the restricted wild bootstrap with the Rademacher distribution.
# • Bootstrap LM test using the restricted wild bootstrap with the Rademacher distribution.
14
meat <- crossprod(X*utildeˆ2, X)
c(t(uX)%*%solve(meat, uX))
}
# restricted model
dat2 <- dat
X <- model.matrix(f2, dat2)
cVec <- c(coef(fit_3)[2]-1, coef(fit_3)[4]-5*coef(fit_3)[5], coef(fit_3)[6])
dat2$wage <- dat2$wage-c(X[,c(2,4,6)]%*%cVec)
dat2$x6 <- dat2$age - (5*dat2$education)
fr <- wage ~ married2 + x6
utilde <- residuals(fitr <- lm(fr, dat2))
btilde <- coef(fitr) # restricted beta
## [1] 55.53485
# boot
resB2 <- boot(dat, LMBoot2, R=1000)
# p-value
mean(resB2$t>LM2)
## [1] 0.514
The p-value of Bootstrap LM test using pairs bootstrap is 0.514 which is greater than the 5% significance
level (α = 5%). We,therefore, cannot reject the null hypothesis H0 : β1 = 1, β3 = 5β4 , β5 = 0.
load("Card.rda")
set.seed(21042668)
n <- sample(600:1000, 1)
ind <- sample(nrow(dat), n, replace=FALSE)
dat <- dat[ind,]
You will see that exp76 (years of experience) is not in the dataset. You have to compute it using the formula
exp76 = age76 - ed76 - 6. It is not the actual experience but the potential experience. This is a commonly
used measure of experience in the labour economics literature.
15
1. Explain what is the issue with OLS when the objective is to estimate the return to education. Then,
explain why the solution proposed by David Card can potentially solve the problem. You can use the article
to answer the question, but you have to explain in your own words. Can you think of a possible reason for
rejecting the validity of the instrument?
The issue with OLS is that it could potentially give biased estimates of the β coefficient in the presence
of correlation between one of the independent variables and the unobserved feature (omitted variables) of
wages, or in other words, if an endogenous regressor is there in the model. David Card suggests using IV
estimates instead of OLS to solve this problem, since IV estimates are obtained from the variation coming
from the instrument only.
For the instrument to be valid, and to overcome the issue faced using OLS, it needs to be uncorrelated with
the error term u, and correlated with the independent variable we are interested in, which is ed76.
2. Estimate the model by OLS and interpret the result. Is there a different return to education for black and
non-black workers?
##
## Call:
## lm(formula = log(wage76) ~ ed76 + black + ed76 * black + exp76 +
## I(exp76^2) + reg76r + smsa76r, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.98726 -0.20587 0.00924 0.24768 1.30935
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7047109 0.1488870 31.599 < 2e-16 ***
## ed76 0.0707223 0.0075027 9.426 < 2e-16 ***
## black -0.5335607 0.2328487 -2.291 0.0222 *
## exp76 0.1017337 0.0153676 6.620 6.07e-11 ***
## I(exp76^2) -0.0032597 0.0007883 -4.135 3.87e-05 ***
## reg76r -0.0860909 0.0290279 -2.966 0.0031 **
## smsa76r 0.1548633 0.0302059 5.127 3.58e-07 ***
## ed76:black 0.0254052 0.0177099 1.435 0.1518
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.3954 on 928 degrees of freedom
## Multiple R-squared: 0.2014, Adjusted R-squared: 0.1953
## F-statistic: 33.43 on 7 and 928 DF, p-value: < 2.2e-16
The p-value of the F statistic is less than the 5% significance level. Therefore, the estimated coefficients are
not all equal to zero.
The adjusted R squared of the model is 0.1953, indicating that this model explains 19.53% of variance in
the dependent variable.
The p-values of the t-statistics of the coefficients of the variables ed76, black, exp76 (polynomial), reg76r
and smsa76r are all significant (less than the 5% significance level), which means that these coefficients are
significantly different from zero and that these variables are significant to the model. The interaction effect
16
between ed76 and black is not significant to the model, as the p-value of 0.1518 is greater than the 5%
significane level.
The variable black is significant to the model (p-value < 5%), indicating that there is a different return to
education for black and non-black workers, such that for black workers log(wage76) is likely to be 0.53 lower
compared to non-black workers.
Important For the following questions, I want you to estimate the model and perform the tests manually (no
package). Just use the solution from your notes.
3. For this question only, consider the model
log(wage76) = β0 + β1 ed76 + u.
Using the nearc4 as instrument, show that the Wald estimator is the same as the IV estimator.
We are interested in estimating β1 in:
log(wage76) = β0 + β1 × ed76 + u
##
## Call:
## lm(formula = ed76 ~ nearc4, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0289 -2.0289 -0.6475 1.9711 4.3525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.6475 0.1359 100.418 <2e-16 ***
## nearc4 0.3814 0.1621 2.353 0.0188 *
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2.266 on 934 degrees of freedom
## Multiple R-squared: 0.005893, Adjusted R-squared: 0.004828
## F-statistic: 5.536 on 1 and 934 DF, p-value: 0.01883
Under the assumption that nearc4 is a valid instrument,the first stage regression is estimated using the
equation (using pi instead of beta to avoid confusion in interpretations):
ed76 = π0 + π1 × nearc4 + ν
The estimated first stage regression shows that the instrument nearc4 is significant (p-value of t-statistic is
less than the 5% significance level). Therefore, the estimated first stage equation is:
The regression’s R squared states that about 0.58% of the variation in ed76 is explained by the variation of
nearc4.
17
# store the predicted values
ed_pred <- s1_mod$fitted.values
##
## Call:
## lm(formula = log(dat$wage76) ~ ed_pred)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.02421 -0.25610 0.03158 0.27975 1.29147
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.86838 1.14195 1.636 0.1021
## ed_pred 0.32055 0.08206 3.906 0.0001 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.4375 on 934 degrees of freedom
## Multiple R-squared: 0.01608, Adjusted R-squared: 0.01502
## F-statistic: 15.26 on 1 and 934 DF, p-value: 0.0001004
ˆ
log(wage76) = 1.8683 + 0.32055 × ed76
The IV estimator is statistically significant with a p-value of < 5%. The estimate for β1 = 0.32055 suggests
that an increase in ed76 by one unit increases log(wage76) by 0.32055.
# Wald estimator
num <- round(lm(log(dat$wage76)~dat$nearc4)$coefficient[2],3) # regression of Y on Z
den <- round(lm(dat$ed76~dat$nearc4)$coefficient[2],3) # regression of X on Z
num/den # Wald estimator
## dat$nearc4
## 0.32021
The Wald estimator is equal to 0.32021, which is very close to the IV estimator.
Explain the intuition behind the Wald estimator in this particular example.
The intuition behind Wald estimator in this example is that the instrument nearc4 is a binary estimate. If
an instrument is a binary variable, then the Wald estimator is better than IV.
Can you explain why controlling for nearc4 is not the same as using it as instrument? What is the different?
Controlling for nearc4 in this model assumes that it has a direct effect on log(wage76) along with the
independent variable, while using it as an instrument assumes that it has no direct effect on the outcome,
but has an effect on the independent variable ed76.
18
4. Going back to the previous model, estimate it by IV using nearc4 as instrument. Note that ed76 appears
twice in the regression: if ed76 is endogenous, ed76 × black is also endogenous.
We want to estimate the previous model using nearc4 as an instrument. The endogenous variables are ed76
and ed76 * black, and the exogenous variables are black, exp76, exp762, reg76r, and smsa76r.
# IV estimations
s1_end2 <- lm((ed76 * black) ~ nearc4 + black + exp76 + I(exp76ˆ2) + reg76r + smsa76r, data=dat)
ed_pred_2 <- s1_end2$fitted.values
##
## Call:
## lm(formula = log(dat$wage76) ~ ed_pred_1 + dat$black + ed_pred_2 +
## dat$exp76 + I(dat$exp76^2) + dat$reg76r + dat$smsa76r)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.10121 -0.23423 0.02068 0.25903 1.32463
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.164184 1.442844 2.193 0.028553 *
## ed_pred_1 0.203406 0.074502 2.730 0.006449 **
## dat$black 35.014679 10.044793 3.486 0.000514 ***
## ed_pred_2 -2.732734 0.784615 -3.483 0.000519 ***
## dat$exp76 0.150037 0.053884 2.784 0.005471 **
## I(dat$exp76^2) -0.010716 0.001676 -6.393 2.57e-10 ***
## dat$reg76r -0.117865 0.030429 -3.873 0.000115 ***
## dat$smsa76r NA NA NA NA
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 0.4166 on 929 degrees of freedom
## Multiple R-squared: 0.1125, Adjusted R-squared: 0.1068
## F-statistic: 19.63 on 6 and 929 DF, p-value: < 2.2e-16
log(wage76) = 3.164184+0.203406×ed_pred_1+35.014679×black−2.732734×ed_pred_2+0.150037×exp76−0.010716×exp7
Interpret the coefficients. Do you see a difference between OLS and IV?
19
The coefficients of the endogenous variables and the exogenous variables black, exp76, exp762, and reg76r
are significantly different from zero.
The model estimates that an increase in ed76 by one unit increases log(wage76) by 0.20341 for non-black
workers, and decreases log(wage76) by 2.73 for black workers.
Furthermore, the model suggests that if the worker is black, log(wage76) is likely to increase by 35.015
compared to a non-black worker.
Since the coefficient of the quadratic term exp76ˆ2 is significant (p-value < 0.05), the relationship is not
linear. We can conclude that the relationship between exp76 and log(wage76) is quadratic, where one unit
increase in exp76 means 0.15 increase in log(wage27).
We notice that there is a difference in the significance and values of the model’s coefficients between OLS
and IV, as well as a difference in the goodness of it, as the OLS model has a better goodness of fit (greater
R-squared value).
Test the hypothesis that the return to education is the same for black and non-black workers. Use a robust
test. Do you reach the same conclusion as in question 1?
The null hypothesis to be tested is H0 : β2 = β3 = 0.
# R matrix
R <- rbind(c(0,0,1,0,0,0,0),
c(0,0,0,1,0,0,0))
b <- coef(s2_mod2)[1:7]
Vr <- vcovHC(s2_mod2)
q <- R%*%b
(Wr <- t(q)%*%solve(R%*%Vr%*%t(R), q)) # test statistic
## [,1]
## [1,] 11.73401
## the p-values
1-pchisq(c(Wr), 2)
## [1] 0.002831342
The p-value of the robust Wald test is 0.0028 which is less than the 5% significance level. Therefore, we can
reject the null hypothesis that the return to education is the same for black and non-black workers.
We reached the same conclusion as in the first question: there is a different return to education for black
and non-black workers.
20