Professional Documents
Culture Documents
Project R
Project R
Project R
ScoreHSi = δ0 + δ1Distancei + ui
Where δ0 is the intercept of the line, δ1 is the slope of it and ui is the error term.
In practice, δ0 and δ1 are unknown. Therefore we must employ data to estimate
both parameters. The results of this estimation will be δ1̂ and δ0̂ .
The estimates are δ0̂ = 51.35 and δ1̂ = − 0.26 (rounded). Hence, the model
becomes
This suggests that there is a negative correlation between score and distance,
or, in other words, that as the distance from colleges increases, the high
schools scores decrease. To assess whether δ1̂ is statistically significant, and,
hence, that the relationship between the two variables is relevant, we have to
test the hypothesis that H0 : δ1̂ = 0 against the alternative that H1 : δ1̂ ≠ 0 .
There are two ways to test the null hypothesis: either by using the t-statistic
approach or the p-value approach.
The first one consists of comparing the resulting t-statistic to a critical value
established by our confidence interval. The latter approach instead refers to
the comparison of observed results with the critical area established by the
confidence interval. If the observed result is smaller than the area represented
by the p-value, we can reject the null hypothesis.
In this case, the absolute value of the t-statistic we obtained is greater than
the critical level defined by a 95% confidence interval (1.96). We therefore do
not accept the null hypothesis, concluding that the difference is statistically
significant. Moreover, we can also reject the hypothesis at 1% significance
level, since the absolute value is greater than the critical value constructed
with a 99% confidence interval (2.58). By following the p-value method we get
the same results as the p-value is smaller than the critical value.
The estimated value of δ1̂ is the variation in the score due to a unitary increase
in distance. Put differently, the score decreases by 0.25752 on average when
the distance increases by one unit. This implies a negative correlation between
the two variables, strengthening the theory that opening local branches of
Universities in peripheral areas will improve test scores.
However, distance might not be the only factor affecting scores, there could be
other variables able to explain the regression which we are not considering in
our model. These are called omitted variables and, when they are present, our
model will be flawed by the omitted variable bias. Omitted variables possess
two main characteristics: they use be a determinant for the dependent variable
and they must be correlated with the independent variables of our model. The
ones we analyzed were education, high school quality and wage.
2
Elisa Centamore — Homework 1
Our estimate can also be affected by high school quality. High schools located
closer to universities tend to have better teachers, resources, and curriculum,
therefore omitting high school quality could lead to underestimate the effect of
distance on test scores. In both cases, the direction of the bias would depend
on the sign and magnitude of the correlation between the omitted variable and
both distance and test scores.
Finally, average wage level exerts a weakly negative but negligible effect on
our model. Even though one could argue that in someway average hourly wage
in manufacturing state by state is connected to the distance and high school
scores, the correlation is not significant. Thus, we will not consider this as an
omitted variable for our model.
3
Elisa Centamore — Homework 1
QUESTION II
ScoreHSi = γ0 + γ1Incomei + ui
Running the linear regression model, we can see that the estimates are
γ0̂ = 49.90 and γ1̂ = 3.43 (rounded).
The comparison between the value of the t-statistic computed with standard
error and the critical value with a 95% confidence interval leads us to not
accept the null hypothesis. When we calculate the p-value we achieve the
same result, hence the difference is significant.
Given that our regressor is binary, it is useless to think of γ1̂ as a slope; there is
no line. We must interpret it as the coefficient on Incomei . Therefore γ1̂ is the
difference between the conditional expectations of the variable Incomei when
it is equal to 1 and when it is equal 0, or:
γ1 = E(Yi | Incomei = 1) − E(Yi | Incomei = 0). Due to this, we interpret γ1̂ as
the difference between the sample averages of Incomei in the two groups.
The estimated value of γ1̂ is 3.42716, which means that students with an
income greater than 25.000 USD are expected to score roughly 3.43 points
higher on the average high school finals than students with an income equal to
or less than 25.000 USD, holding distance constant. This shows a positive
4
Elisa Centamore — Homework 1
correlation between the variables, which is captured by the box plot below,
which shows higher sample averages for the class Incomehigh.
5
Elisa Centamore — Homework 1
6
Elisa Centamore — Homework 1
QUESTION III
The variation of Distancei in this model is lower with respect to the one in the
preceding univariate regression. This is due to the fact that now we have
introduced a variable which is partly responsible for the variation of scores (a
partial regressor) and for determining the distance from colleges, which was
omitted in the previous model. In fact, our univariate model was biased,
causing the effect of a variation in Distancei to be overestimated.
7
Elisa Centamore — Homework 1
QUESTION IV
8
Elisa Centamore — Homework 1
Testing the null hypothesis that both coefficients are equal to 0 against the
alternative one (in which at least one is different from 0), we can conclude that
they are jointly significant because the F-statistic (181.4) is larger than the
critical value of the F-statistic at 5% level of significance with two constraints
(F=3.00).
As the model expands to include more variables which explain the variation in
the score, the variation explained by Distancei decreases. In this case, when
we also include income, we include an omitted variable which is not only
related to score and distance, but also to the parents’ education, since a
higher education leads to higher wages which, in turn, lead to higher income.
9
Elisa Centamore — Homework 1
QUESTION V
To exclude a relationship between scores and ethnicity we will have to test the
joint null hypothesis, which in this case is H0 : β4 = β5 = 0 against the
alternative one that H1 : either β4 ≠ 0 or β5 ≠ 0 or both.
10
Elisa Centamore — Homework 1
As we can observe, the point (0,0) is not present in the confidence set, hence
we can reject the null hypothesis that H0 : β4 = β5 = 0.
11