Homework 1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Homework 1

Instructor: Marco Martinez

29-03-2023

Instructions
The homework includes five questions about the linear regression model. Most of the answers needs statistical
analyses conducted using R, although some answers can be complemented by theory. The data for each
question is contained in the library “AER” and can be loaded in R using the commands:
library(AER)
data("CollegeDistance")

Please submit your work not later than the 15th of April 2023 to ensure consideration. To do so, load the R
script and short document with the answers and the regression outputs on Moodle or, if this does not work
for you, send them to me at marco.martinez@santannapisa.it. Results will be uploaded not later than
the 22th of April 2023 on Moodle. The grade ranges from 0 to 17 with 15 being the maximum grade and 2
additional points. This grade will be summed to your grade of the second homework once you complete it,
for a total of 30 points + additional points. To pass this homework, you need to make not less than 9 points.

A campus you can call home?


Suppose that you work in a city council and that you who want to raise the final high-school test scores
for students of disadvantaged socio-economic backgrounds. You are conducting an exploratory analysis to
decide whether opening local branches of Universities in peripheral areas of your town would help. As a
result, you want to study the determinants of a closer geographical exposure to a University on higher final
year high-school grade. The underlying idea is that the access to educational facilities of Universities to
high-school students (such as museums, libraries) may also impact the preparation of students who lived
nearby to such places of learning. The dataset comes from the High School and Beyond survey conducted by
the U.S. Department of Education in 1980, with a follow-up in 1986. The survey included students from
approximately 1,100 high schools across the country. As background information, note that the U.S. has a
competitive “college” (approximately, University) admission so that a higher final school grade is positively
correlated with the chances to get to a better University later on. As a result, family resources are likely to
be devoted to ensure high final year exam scores. Also different groups students might have different levels of
motivation in making the extra effort required to get higher final grades to get into University than others.
The dataset contains information about 4,739 individuals. The dataset has 14 variables:
• gender: the binary gender of surveyed individuals (dummy)
• ethnicity: ethnicity (categorical: afro american, hispanic, or other)
• score: test score given to high school seniors in the sample.
• f college: is the father a college graduate? (dummy)
• mcollege: is the mother a college graduate? (dummy)
• home: does the family own their home? (dummy)
• urban: is the school located in an urban area? (dummy)
• unemp: county employment rate in 1980.
• wage: hourly wage in manufacturing in 1980 (state by state)
• distance: distance from closest community college (in 10 miles). In this homework we can think of the
distance variable as the distance from the closest University

1
• tuition: average University tuition (in 1,000 USD)
• education: number of years of education.
• income: is the family income above USD 25,000 per year? (dummy)
• region: factor indicating U.S. region (dummy: west or other)?
Take your time to explore the dataset.
Rouse (1995) computed years of education by assigning 12 years to all members of the senior class. Each
additional year of secondary education counted as a one year. Students with vocational degrees were assigned
13 years, AA degrees were assigned 14 years, BA degrees were assigned 16 years, those with some graduate
education were assigned 17 years, and those with a graduate degree were assigned 18 years. (Rouse, C.E.
(1995). Democratization or Diversion? The Effect of Community Colleges on Educational Attainment.
Journal of Business & Economic Statistics, 12, 217–224).

Question one [4 points]


Consider the regression

ScoreHSi = δ0 + δ1 Distancei + ui
Where ScoreHSi is the final year test score of high-school seniors (see it as equivalent of the Italian Maturità,
German Abitur, and English A-levels). Distancei is the distance from the closest University (in 10 miles).
Run a regression to estimate the coefficients δ0 and δ1 and report the results.
• Is the coefficient of δ1 statistically different from zero?
• How can you interpret it?
• Can delta1 be affected by omitted variable biases? Which ones can you think of and why (please provide
at least two examples)? In which direction do you expect the direction of the bias to go?

Question two [5 points]


Consider the regression:

ScoreHSi = γ0 + γ1 Incomei + ui

Incomei is a dummy variable taking value 1 if the family income is above 25,000 USD and 0 otherwise.
Run a regression to estimate the coefficients γ0 and γ1 and report the results.
• Would you expect the family Income to directly affect high school final test scores? [No need of R]
• Is the coefficient of γ1 statistically different from zero?
• How can you interpret γ1 ?
• Given this regression, can you conclude that the previously estimated δ1 suffers from omitted variable
biases? Why?
• Can you model the relationship between ScoreHSi and Incomei differently than in the previous
regression, considering that Incomei is a dummy variable? How? Would you prefer to include the
intercept or to include the two categories of Income and not the intercept? [No need of R]

Question three [2 points]


Now consider the multivariate regression

ScoreHSi = β0 + β1 Distancei + β2 Incomei + ui


Run the regression and report the results.

2
• How can you justify a variation of Distancei in this model compared to the variation of Distancei in
the preceding univariate regression?
• How do you interpret the β1 coefficient?

Question four [4 points]


Distance proxies for an otherwise unobserved exposure to a high-quality educational environment. Perhaps
we can do better and consider the education of arguably the most influential individuals in affecting students’
outcomes: parents.
Using the variables F college and M college, the two dummy variables measuring the higher (1) or lower (0)
education of parents, estimate the following model:

ScoreHSi = β0 + β1 Distancei + β2 F collegei + β3 M collegei + ui


Run the regression and report the results.
• What is the role of Distancei in explaining high school grades in this case?
• Are F collegei and M collegei individually significant? are they jointly significant [Hint: test the
restrictions in linearHypothesis using names fcollegeyes and mcollegeyes]?
• Now include also Incomei in the model. What happens to β1 ?
• Are F collegei and Distancei individually significant?

Question five [2 points]


Estimate the model and report the results:

ScoreHSi = β0 + β1 Incomei + β2 distance + β3 Ethnicityi + β4 F collegei + β5 M collegei +


+β6 Homei + β7 U rbani + β8 U nemployedi + β9 T uitioni + β10 Regioni + ui
Run the regression.
• Keeping all other factors fixed, what is the difference in test scores between individuals of hispanic
ethnicity and of “other” ethnicity? Is this difference statistically significant?
• Extend the model of question four to allow the high school test score to also depend from ethnicity and
formulate the the null hypothesis that test scores do not depend on ethnicity (in words).
• Test the null hypothesis formulated above against a two-sided alternative and construct a 95% confidence
interval. [Note that the two levels of ethnicity included in the regression output are ethnicityafam
and ethnicityhispanic]. What do you conclude?

You might also like