Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Columbia University W3412

Department of Economics Fall 2014



Problem Set 5
Introduction to Econometrics
Profs. Seyhan Erden Arkonac and Miikka Rokkanen
for all sections

Part I
Consider the following results for a wage regression where lwage is the natural log of average
hourly earnings in US dollars, age is in years, female is a binary variable for gender, bachelor is
one for someone with bachelor degree and zero otherwise, femxbac and femxage are self
explanotary interaction variables.

Regression 1:

. reg lwage age female bachelor femxbac, r

Linear regression Number of obs = 15316
F( 4, 15311) = 861.64
Prob > F = 0.0000
R-squared = 0.1852
Root MSE = .50507

------------------------------------------------------------------------------
| Robust
lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0259277 .0014453 17.94 0.000 .0230947 .0287606
female | -.2257049 .010984 -20.55 0.000 -.2472348 -.2041749
bachelor | .3980052 .0114889 34.64 0.000 .3754857 .4205248
femxbac | .1082764 .0165844 6.53 0.000 .0757691 .1407838
_cons | 1.708492 .0434411 39.33 0.000 1.623342 1.793642
------------------------------------------------------------------------------


Regression 2:

. reg lwage female age bachelor femxage, r

Linear regression Number of obs = 15316
F( 4, 15311) = 840.78
Prob > F = 0.0000
R-squared = 0.1848
Root MSE = .5052

------------------------------------------------------------------------------
| Robust
lwage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | .331927 .0859364 3.86 0.000 .1634814 .5003726
age | .032957 .0019547 16.86 0.000 .0291256 .0367884
bachelor | .4437149 .0083212 53.32 0.000 .4274044 .4600254
femxage | -.0171776 .002895 -5.93 0.000 -.0228521 -.0115032
_cons | 1.481794 .0583113 25.41 0.000 1.367497 1.596092
------------------------------------------------------------------------------


Using the appropriate regression please answer the following questions


(a) What is the estimated average wage difference between females with bachelor degrees
and males with bachelor degree? Explain your answer.
(b) What is the estimated average wage difference between females with bachelor degrees
and females without bachelor degree? Explain your answer.
(c) How would you test if there is a significance difference in wages of females with
bachelor degrees and males with bachelor degrees? Please write STATA commands
necessary, you can use any method you want, but you must write the null hypothesis first.
(d) How would you test if there is an intercept and if there is a slope difference in the two
estimated regression lines for males and females in regression 2? Write the null
hypothesis for each test.
(e) As people get older, does the wage gender gap widen? Draw a sketch graph (with wage
on vertical axis and age on horizontal axis) of what this result tells you.
(f) Is the coefficient of female in Regression 2, statistically significant at 1% significance
level? Please write the null hypothesis and calculate the test statistic before you answer
this question. Also make sure to give your reason for your answer.

Part II
1. What are the root causes of terrorism? Poverty? Repressive political regimes? Religious or
ethnic conflicts arising from heterogeneous populations?
In this problem set you will take a look at some empirical evidence on cross-country sources of
terrorism. Variables in the data set, terrorism.dta, are defined in Table 1. Note that, to do this
problem set, you will need to create (generate) some new variables, which are functions of the
variables in terrorism.dta.
Preliminary data analysis:
(a) Produce the scatterplot of ftmpop vs. gdppc.
(b) Generate the variables lnftmpop = log(ftmpop) vs. lngdppc = log(gdppc). Produce the
scatterplot of lnftmpop vs. lngdppc.
(c) Produce the scatterplot of lnftmpop v. lackpf.
(d) Using the scatterplots from (a) and (b), would you suggest using the variables (i) ftmpop
and gdppc or (ii) lnftmpop and lngdppc for modeling using linear regression?
(e) Using the scatterplot from (c), does the relation between lnftmpop and lackpf appear to be
linear or nonlinear? If nonlinear, what sort of nonlinear curve might you want to explore
(briefly explain)?

2. Estimate the regressions in Table 2 and fill in the empty entries. You may write in the entries
by hand or type them using the .doc electronic version of the table on the course Web site.
Note: for these regressions, only use the countries that have nonzero values of ftmpop.

3. Use the results in Table 2 to answer the following questions.
(a) Using regression (1), test the hypothesis that the coefficient on lngdppc is zero, against
the alternative that it is nonzero, at the 5% significance level. Explain in words what the
coefficient means.
(b) Using regression (3), test the hypothesis that the coefficients on lngdppc and lngdppc
2
are
both zero, against the alternative that one or the other coefficient is nonzero, at the 5%
significance level.
(c) Explain why the conclusions in (a) and (b) differ.
(d) Using regression (3), is there evidence that the relationship between lnftmpop and
lngdppc is nonlinear?
(e) Using regression (3), is there evidence that the relationship between lnftmpop and lackpf
is nonlinear?
(f) Using regression (5), test the null hypothesis (at the 5% significance level) that the
coefficients on the other regional dummies all are zero, against the alternative
hypothesis that at least one is nonzero. What is number of restrictions q in your test?
What is the critical value of your test?
(g) Using regression (4), discuss the evidence that ethnic diversity is associated with
increases in terrorism, holding constant GDP per capita, religious diversity, and a
measure of political freedoms? Interpret the sign of the coefficient on ethnic.
(h) Using regression (4), discuss the evidence that religious diversity is associated with
increases in terrorism, holding constant GDP per capita, ethnic diversity, and a measure
of political freedoms? Interpret the sign of the coefficient on religious.
(i) Using regression (4), test the hypothesis that the population coefficients on ethnic and
religion are both zero, against the alternative that one or the other coefficient is nonzero.
Explain in words what hypothesis you have tested, and what your conclusion is.
(j) Can you use regressions (3) and (4) to test the same hypothesis as in (i), this time using
the R
2
formula for the homoskedasticity-only F statistic? Explain.
(k) Using regression (4), estimate the effect on lnftmpop of changing from lackpf = 7
(extremely limited political freedoms) to lackpf = 5 (some political freedoms), holding
constant the values of the other regressors in regression (4).
(l) Using regression (4), compute a 95% confidence interval for the effect you estimated in
part (k).
(m)Using regression (4), plot the estimated relationship between lnftmpop and lackpf (you
can do this in STATA or in Excel or by hand). At approximately what value of lackpf is
this relationship maximized?
(n) In words, briefly describe the relationship you found in part (m).




Table 1
DATA DESCRIPTION, FILE: terrorism.dta
Variable Definition
Ftmpop Number of fatalities from terrorist incidents in the country, 1998-2004, per
million population (U.S. State Department)
Evmpop Number of terrorist events in the country, 1998-2004, per million population
(U.S. State Department)
Gdppc GDP per capita in the country (World Bank)
Lackpf Index of the lack of political freedoms (Freedom House), 1-7 scale, 7 =
extremely limited political freedoms
language Index of linguistic fractionalization (0 to 1 scale, 0 = no fractionalization)
Ethnic Index of ethnic fractionalization (0 to 1 scale, 0 = no fractionalization)
Religion Index of religious fractionalization (0 to 1 scale, 0 = no fractionalization)
mideast, latinam,
easteurope, africa, eastasia
= 1 if the country is in the indicated region, = 0 otherwise



STATA HINTS
STATA command Description
generate lngdppc = ln(gdppc) creates a new variable lngdppc that is the natural
logarithm of gdppc.
gen lngdppc2 = lngdppc* lngdppc creates a new variable that is the square of lgdppc
reg yvar xvar1 xvar2 xvar3, robust
test xvar2 xvar3
This pair of commands first estimates a regression of
yvar against xvar1, xvar2, and xvar3, then tests the joint
hypothesis that the coefficients on xvar2 and xvar3 both
equal zero, against the hypothesis that at least one does
not, using heteroskedasticity-robust standard errors.
Note: the test command always refers to the most
recently run regression (or estimation) command.














Table 2
Determinants of Terrorism
(1) (2) (3) (4) (5)
Dependent variable: lnftmpop lnftmpop lnftmpop lnftmpop lnftmpop
Regressor:
lngdppc


( )

( )

( )

( )

( )
(lngdppc)
2

__ __
( )
__ __
lackpf

__
( )

( )

( )

( )
lackpf
2

__ __
( )

( )

( )
ethnic __ __ __
( )

( )
religion __ __ __
( )

( )
Mideast

__ __ __ __
( )
Other regional dummies (latinam,
easteurope, africa, eastasia)?
No No No No Yes
Intercept


( )

( )

( )

( )

( )
F-statistics testing the hypothesis that the population coefficients on the indicated
regressors are all zero:
lngdppc, lngdppc
2
__ __
( )
__ __
lackpf, lackpf
2
__ __
( )

( )

( )
ethnic, religion __ __ __
( )

( )
Other regional dummies __ __ __ __
( )
Regression summary statistics
2
R

R
2

SER
n

Notes: Heteroskedasticity-robust standard errors are given in parentheses under estimated
coefficients, and p-values are given in parentheses under F- statistics. The F-statistics are
heteroskedasticity-robust. Coefficients are significant at the
+
10%, *5%, **1% significance
level. The other regional dummies included in regression (5) are latinam, easteurope, africa,
and eastasia (the omitted case is Western Europe combined with North America).







4. Create a new binary variable higdppc, which equals one if gdppc is greater than or equal to
the median in the data set, and which equals zero otherwise; also create the interaction
variables hi_lackf = higdppclackpf and hi_lackf2 = higdppclackpf
2
. Estimate the
regressions in Table 3 and fill in the empty entries. You may write in the entries by hand or
type them using the .doc electronic version of the table on the course Web site. Note: for all
calculations, only use the countries that have nonzero values of ftmpop and non-missing
values of gdppc.

5. Regression (6) produces two regression lines, one for higdppc = 0 and one for higdppc = 1.
Produce a scatterplot of lnftmpop vs. lackpf, showing the two regression lines. This can be
done either by producing two scatterplots, one for higdppc = 0 and one for higdppc = 1, or by
combining the two scatterplots into a single graph. Use regression (6) to write out the
estimated regression lines for the two groups (in slope-intercept form).

6. Use the results in Table 3 to answer the following questions.
(a) Is the difference between the two slopes plotted in the scatterplot of Question 5
statistically significantly different from zero at the 5% significance level? Explain.
(b) In a sentence or two, interpret the sign of the coefficients in regression (6) and the
scatterplot(s) in Question 5; that is, explain in everyday terms the findings shown in that
scatterplot.
(c) Using regression (7), test the hypothesis that the coefficients on higdppc lackpf and
higdppc lackpf
2
are zero, against the hypothesis that one or the other (or both) is
nonzero. State in words what the hypothesis is that you are testing.
(d) Using regression (7), test the hypothesis that the coefficients on lackpf
2
and higdppc
lackpf
2
are zero, against the hypothesis that one or the other (or both) is nonzero. State in
words what the hypothesis is that you are testing.
(e) Are the coefficients on the regional binary variables in regression (9) jointly statistically
significant at the 5% significance level? Explain.

7. One theory is that ethnic and religious diversity leads to strife and terrorism when economic
resources are poor, but if overall economic conditions are strong then ethnic and religious
diversity are more readily tolerated. Do regressions (8) and (9) support this theory? Explain.

8. Write a paragraph summarizing your findings about the relation between terrorist fatalities
and economic conditions, political freedoms, and ethnic and religious diversity. Your
discussion should be based on your results in Table 2 and Table 3 (that is, on the empirical
evidence, not your opinions), however you do not need to refer to specific regressions.

9. Assess the following threats to the internal validity of your analysis in Question 8.
Specifically, in your judgment are the following threats plausibly important in this
application? Why or why not? Be brief but specific.
(a) Omitted variable bias:
(b) Functional form misspecification:
(c) Errors-in-variables bias:
(d) Sample selection bias:
(e) Simultaneous causality bias:

Table 3 Problem Set 5 Determinants of Terrorism, ctd.
(6) (7) (8) (9)
Dependent variable: lnftmpop lnftmpop lnftmpop lnftmpop
Regressor:
higdppc


( )

( )

( )

( )
lackpf


( )

( )

( )

( )
lackpf
2

__
( )

( )

( )
higdppc lackpf


( )

( )
__ __
higdppc lackpf
2

__
( )
__ __
ethnic __ __
( )

( )
religion __ __
( )

( )
higdppc ethnic __ __
( )

( )
higdppc religion __ __
( )

( )
Mideast

__ __ __
( )
Other regional dummies (latinam,
easteurope, africa, eastasia)?
No No No Yes
Intercept


( )

( )

( )

( )
F-statistics testing the hypothesis that the population coefficients on the indicated
regressors are all zero:
higdppc lackpf,
higdppc lackpf
2

__
( )
__ __
lackpf
2
, higdppc lackpf
2


__
( )
__ __
lackpf, lackpf
2
__
( )

( )

( )
higdppc ethnic, higdppc religion __ __
( )

( )
ethnic, religion,
higdppc ethnic, higdppc religion

( )

( )
Other regional dummies __ __ __
( )
Regression summary statistics:
2
R

R
2

SER
n
Notes: Heteroskedasticity-robust standard errors are given in parentheses under estimated coefficients, and p-values are given in
parentheses under F- statistics. The F-statistics are heteroskedasticity-robust. Coefficients are individually statistically
significant at the
+
10%, *5%, **1% significance level. The other regional dummies included in regression (5) are latinam,
easteurope, africa, and eastasia (the omitted case is Western Europe combined with North America).



Following questions will not be graded, they are for you to practice and will be discussed at
the recitation:

10. US states differ in the generosity of their welfare programs. We here wish to analyze which factors
play a role in the level of benefits across different states. The data set TANF2.dta contains data from
each of 49 states. The variables in the data set are given in the following table:

Table 3
DATA DESCRIPTION, FILE: TANF2.dta
Variable Definition
tanfreal States real maximum benefit for single parent with three kids.
black Percentage of states population who are African Americans.
blue

Dummy variable, equals 1 if state voted Democratic in 2004
presidential election.
mdinc States median income.
west

= 1 if state is in West
= 0 otherwise
south = 1 if state is in South.
= 0 otherwise.
midwest = 1 if state is in Midwest
= 0 otherwise
northeast = 1 if state is in Northeast
=0 otherwise

Use data set TANF2.dta to examine whether Midwest states differ in their welfare programs from other
states. To do this, we will use the following regression model:

tanfreal =
0
+
1
black +
2
blue +
3
midwest +
4
(black*midwest) +
5
(blue*midwest) + u

Here, black*midwest is the product of the regressors black and midwest and so forth.

(a) Write the null hypothesis to test whether there is a difference between the welfare programs of
Midwest states and all other states, explain.
(b) Construct new set of interaction regressors in STATA. Estimate the model above. Write your
answer as a regression equation with standard errors in parenthesis underneath each coefficient.
Perform the test for the null hypothesis in part (a) with a robust F-test. What is your conclusion?
(c) Introduce a new variable nonmidwest = 1 midwest. That is nonmidwest = 1 if a state is not in the
Midwest and zero otherwise. Consider the following alternative regression model:

tanfreal =
1
nonmidwest +
2
(black*nonmidwest) +
3
(blue*nonmidwest) +
4
midwest
+
5
(black*midwest) +
6
(blue*midwest ) + u

Write up the hypothesis of no differences in welfare programs in terms of
1
....
6
What is the relationship between the parameters
1
....
6
in this new model and
1
....
6
in the previous model? Estimate the model in STATA and write the result in usual regression
equation form with standard errors in parentheses underneath coefficients.
(d) What happens if you include an intercept
0
in the model in part (c)? Explain.

11. SW exercise 8.2
12. SW Exercise 8.10
13. SW Exercise 9.6
!"# SW Empirical Exercise 8.2
15. SW Empirical Exercise 9.2

You might also like