This document contains information about a problem set on econometrics including two wage regressions and questions about the regressions. It also includes a section on modeling sources of terrorism using cross-country data where students are asked to generate new variables, estimate regressions, and answer questions about the results. Tables with variable definitions and an outline of regressions to estimate are provided.
This document contains information about a problem set on econometrics including two wage regressions and questions about the regressions. It also includes a section on modeling sources of terrorism using cross-country data where students are asked to generate new variables, estimate regressions, and answer questions about the results. Tables with variable definitions and an outline of regressions to estimate are provided.
This document contains information about a problem set on econometrics including two wage regressions and questions about the regressions. It also includes a section on modeling sources of terrorism using cross-country data where students are asked to generate new variables, estimate regressions, and answer questions about the results. Tables with variable definitions and an outline of regressions to estimate are provided.
Problem Set 5 Introduction to Econometrics Profs. Seyhan Erden Arkonac and Miikka Rokkanen for all sections
Part I Consider the following results for a wage regression where lwage is the natural log of average hourly earnings in US dollars, age is in years, female is a binary variable for gender, bachelor is one for someone with bachelor degree and zero otherwise, femxbac and femxage are self explanotary interaction variables.
Regression 1:
. reg lwage age female bachelor femxbac, r
Linear regression Number of obs = 15316 F( 4, 15311) = 861.64 Prob > F = 0.0000 R-squared = 0.1852 Root MSE = .50507
Using the appropriate regression please answer the following questions
(a) What is the estimated average wage difference between females with bachelor degrees and males with bachelor degree? Explain your answer. (b) What is the estimated average wage difference between females with bachelor degrees and females without bachelor degree? Explain your answer. (c) How would you test if there is a significance difference in wages of females with bachelor degrees and males with bachelor degrees? Please write STATA commands necessary, you can use any method you want, but you must write the null hypothesis first. (d) How would you test if there is an intercept and if there is a slope difference in the two estimated regression lines for males and females in regression 2? Write the null hypothesis for each test. (e) As people get older, does the wage gender gap widen? Draw a sketch graph (with wage on vertical axis and age on horizontal axis) of what this result tells you. (f) Is the coefficient of female in Regression 2, statistically significant at 1% significance level? Please write the null hypothesis and calculate the test statistic before you answer this question. Also make sure to give your reason for your answer.
Part II 1. What are the root causes of terrorism? Poverty? Repressive political regimes? Religious or ethnic conflicts arising from heterogeneous populations? In this problem set you will take a look at some empirical evidence on cross-country sources of terrorism. Variables in the data set, terrorism.dta, are defined in Table 1. Note that, to do this problem set, you will need to create (generate) some new variables, which are functions of the variables in terrorism.dta. Preliminary data analysis: (a) Produce the scatterplot of ftmpop vs. gdppc. (b) Generate the variables lnftmpop = log(ftmpop) vs. lngdppc = log(gdppc). Produce the scatterplot of lnftmpop vs. lngdppc. (c) Produce the scatterplot of lnftmpop v. lackpf. (d) Using the scatterplots from (a) and (b), would you suggest using the variables (i) ftmpop and gdppc or (ii) lnftmpop and lngdppc for modeling using linear regression? (e) Using the scatterplot from (c), does the relation between lnftmpop and lackpf appear to be linear or nonlinear? If nonlinear, what sort of nonlinear curve might you want to explore (briefly explain)?
2. Estimate the regressions in Table 2 and fill in the empty entries. You may write in the entries by hand or type them using the .doc electronic version of the table on the course Web site. Note: for these regressions, only use the countries that have nonzero values of ftmpop.
3. Use the results in Table 2 to answer the following questions. (a) Using regression (1), test the hypothesis that the coefficient on lngdppc is zero, against the alternative that it is nonzero, at the 5% significance level. Explain in words what the coefficient means. (b) Using regression (3), test the hypothesis that the coefficients on lngdppc and lngdppc 2 are both zero, against the alternative that one or the other coefficient is nonzero, at the 5% significance level. (c) Explain why the conclusions in (a) and (b) differ. (d) Using regression (3), is there evidence that the relationship between lnftmpop and lngdppc is nonlinear? (e) Using regression (3), is there evidence that the relationship between lnftmpop and lackpf is nonlinear? (f) Using regression (5), test the null hypothesis (at the 5% significance level) that the coefficients on the other regional dummies all are zero, against the alternative hypothesis that at least one is nonzero. What is number of restrictions q in your test? What is the critical value of your test? (g) Using regression (4), discuss the evidence that ethnic diversity is associated with increases in terrorism, holding constant GDP per capita, religious diversity, and a measure of political freedoms? Interpret the sign of the coefficient on ethnic. (h) Using regression (4), discuss the evidence that religious diversity is associated with increases in terrorism, holding constant GDP per capita, ethnic diversity, and a measure of political freedoms? Interpret the sign of the coefficient on religious. (i) Using regression (4), test the hypothesis that the population coefficients on ethnic and religion are both zero, against the alternative that one or the other coefficient is nonzero. Explain in words what hypothesis you have tested, and what your conclusion is. (j) Can you use regressions (3) and (4) to test the same hypothesis as in (i), this time using the R 2 formula for the homoskedasticity-only F statistic? Explain. (k) Using regression (4), estimate the effect on lnftmpop of changing from lackpf = 7 (extremely limited political freedoms) to lackpf = 5 (some political freedoms), holding constant the values of the other regressors in regression (4). (l) Using regression (4), compute a 95% confidence interval for the effect you estimated in part (k). (m)Using regression (4), plot the estimated relationship between lnftmpop and lackpf (you can do this in STATA or in Excel or by hand). At approximately what value of lackpf is this relationship maximized? (n) In words, briefly describe the relationship you found in part (m).
Table 1 DATA DESCRIPTION, FILE: terrorism.dta Variable Definition Ftmpop Number of fatalities from terrorist incidents in the country, 1998-2004, per million population (U.S. State Department) Evmpop Number of terrorist events in the country, 1998-2004, per million population (U.S. State Department) Gdppc GDP per capita in the country (World Bank) Lackpf Index of the lack of political freedoms (Freedom House), 1-7 scale, 7 = extremely limited political freedoms language Index of linguistic fractionalization (0 to 1 scale, 0 = no fractionalization) Ethnic Index of ethnic fractionalization (0 to 1 scale, 0 = no fractionalization) Religion Index of religious fractionalization (0 to 1 scale, 0 = no fractionalization) mideast, latinam, easteurope, africa, eastasia = 1 if the country is in the indicated region, = 0 otherwise
STATA HINTS STATA command Description generate lngdppc = ln(gdppc) creates a new variable lngdppc that is the natural logarithm of gdppc. gen lngdppc2 = lngdppc* lngdppc creates a new variable that is the square of lgdppc reg yvar xvar1 xvar2 xvar3, robust test xvar2 xvar3 This pair of commands first estimates a regression of yvar against xvar1, xvar2, and xvar3, then tests the joint hypothesis that the coefficients on xvar2 and xvar3 both equal zero, against the hypothesis that at least one does not, using heteroskedasticity-robust standard errors. Note: the test command always refers to the most recently run regression (or estimation) command.
__ __ __ __ ( ) Other regional dummies (latinam, easteurope, africa, eastasia)? No No No No Yes Intercept
( )
( )
( )
( )
( ) F-statistics testing the hypothesis that the population coefficients on the indicated regressors are all zero: lngdppc, lngdppc 2 __ __ ( ) __ __ lackpf, lackpf 2 __ __ ( )
( )
( ) ethnic, religion __ __ __ ( )
( ) Other regional dummies __ __ __ __ ( ) Regression summary statistics 2 R
R 2
SER n
Notes: Heteroskedasticity-robust standard errors are given in parentheses under estimated coefficients, and p-values are given in parentheses under F- statistics. The F-statistics are heteroskedasticity-robust. Coefficients are significant at the + 10%, *5%, **1% significance level. The other regional dummies included in regression (5) are latinam, easteurope, africa, and eastasia (the omitted case is Western Europe combined with North America).
4. Create a new binary variable higdppc, which equals one if gdppc is greater than or equal to the median in the data set, and which equals zero otherwise; also create the interaction variables hi_lackf = higdppclackpf and hi_lackf2 = higdppclackpf 2 . Estimate the regressions in Table 3 and fill in the empty entries. You may write in the entries by hand or type them using the .doc electronic version of the table on the course Web site. Note: for all calculations, only use the countries that have nonzero values of ftmpop and non-missing values of gdppc.
5. Regression (6) produces two regression lines, one for higdppc = 0 and one for higdppc = 1. Produce a scatterplot of lnftmpop vs. lackpf, showing the two regression lines. This can be done either by producing two scatterplots, one for higdppc = 0 and one for higdppc = 1, or by combining the two scatterplots into a single graph. Use regression (6) to write out the estimated regression lines for the two groups (in slope-intercept form).
6. Use the results in Table 3 to answer the following questions. (a) Is the difference between the two slopes plotted in the scatterplot of Question 5 statistically significantly different from zero at the 5% significance level? Explain. (b) In a sentence or two, interpret the sign of the coefficients in regression (6) and the scatterplot(s) in Question 5; that is, explain in everyday terms the findings shown in that scatterplot. (c) Using regression (7), test the hypothesis that the coefficients on higdppc lackpf and higdppc lackpf 2 are zero, against the hypothesis that one or the other (or both) is nonzero. State in words what the hypothesis is that you are testing. (d) Using regression (7), test the hypothesis that the coefficients on lackpf 2 and higdppc lackpf 2 are zero, against the hypothesis that one or the other (or both) is nonzero. State in words what the hypothesis is that you are testing. (e) Are the coefficients on the regional binary variables in regression (9) jointly statistically significant at the 5% significance level? Explain.
7. One theory is that ethnic and religious diversity leads to strife and terrorism when economic resources are poor, but if overall economic conditions are strong then ethnic and religious diversity are more readily tolerated. Do regressions (8) and (9) support this theory? Explain.
8. Write a paragraph summarizing your findings about the relation between terrorist fatalities and economic conditions, political freedoms, and ethnic and religious diversity. Your discussion should be based on your results in Table 2 and Table 3 (that is, on the empirical evidence, not your opinions), however you do not need to refer to specific regressions.
9. Assess the following threats to the internal validity of your analysis in Question 8. Specifically, in your judgment are the following threats plausibly important in this application? Why or why not? Be brief but specific. (a) Omitted variable bias: (b) Functional form misspecification: (c) Errors-in-variables bias: (d) Sample selection bias: (e) Simultaneous causality bias:
Table 3 Problem Set 5 Determinants of Terrorism, ctd. (6) (7) (8) (9) Dependent variable: lnftmpop lnftmpop lnftmpop lnftmpop Regressor: higdppc
( )
( )
( )
( ) lackpf
( )
( )
( )
( ) lackpf 2
__ ( )
( )
( ) higdppc lackpf
( )
( ) __ __ higdppc lackpf 2
__ ( ) __ __ ethnic __ __ ( )
( ) religion __ __ ( )
( ) higdppc ethnic __ __ ( )
( ) higdppc religion __ __ ( )
( ) Mideast
__ __ __ ( ) Other regional dummies (latinam, easteurope, africa, eastasia)? No No No Yes Intercept
( )
( )
( )
( ) F-statistics testing the hypothesis that the population coefficients on the indicated regressors are all zero: higdppc lackpf, higdppc lackpf 2
( ) Other regional dummies __ __ __ ( ) Regression summary statistics: 2 R
R 2
SER n Notes: Heteroskedasticity-robust standard errors are given in parentheses under estimated coefficients, and p-values are given in parentheses under F- statistics. The F-statistics are heteroskedasticity-robust. Coefficients are individually statistically significant at the + 10%, *5%, **1% significance level. The other regional dummies included in regression (5) are latinam, easteurope, africa, and eastasia (the omitted case is Western Europe combined with North America).
Following questions will not be graded, they are for you to practice and will be discussed at the recitation:
10. US states differ in the generosity of their welfare programs. We here wish to analyze which factors play a role in the level of benefits across different states. The data set TANF2.dta contains data from each of 49 states. The variables in the data set are given in the following table:
Table 3 DATA DESCRIPTION, FILE: TANF2.dta Variable Definition tanfreal States real maximum benefit for single parent with three kids. black Percentage of states population who are African Americans. blue
Dummy variable, equals 1 if state voted Democratic in 2004 presidential election. mdinc States median income. west
= 1 if state is in West = 0 otherwise south = 1 if state is in South. = 0 otherwise. midwest = 1 if state is in Midwest = 0 otherwise northeast = 1 if state is in Northeast =0 otherwise
Use data set TANF2.dta to examine whether Midwest states differ in their welfare programs from other states. To do this, we will use the following regression model:
tanfreal = 0 + 1 black + 2 blue + 3 midwest + 4 (black*midwest) + 5 (blue*midwest) + u
Here, black*midwest is the product of the regressors black and midwest and so forth.
(a) Write the null hypothesis to test whether there is a difference between the welfare programs of Midwest states and all other states, explain. (b) Construct new set of interaction regressors in STATA. Estimate the model above. Write your answer as a regression equation with standard errors in parenthesis underneath each coefficient. Perform the test for the null hypothesis in part (a) with a robust F-test. What is your conclusion? (c) Introduce a new variable nonmidwest = 1 midwest. That is nonmidwest = 1 if a state is not in the Midwest and zero otherwise. Consider the following alternative regression model:
Write up the hypothesis of no differences in welfare programs in terms of 1 .... 6 What is the relationship between the parameters 1 .... 6 in this new model and 1 .... 6 in the previous model? Estimate the model in STATA and write the result in usual regression equation form with standard errors in parentheses underneath coefficients. (d) What happens if you include an intercept 0 in the model in part (c)? Explain.