Project R

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Elisa Centamore — Homework 1

Elisa Centamore — 618089

Quantitative Economics for Business


Homework I
QUESTION I

In order to solve this question, we must build a univariate linear regression


model in which the dependent variable (y) is represented by the high school
score and the independent one (x) is represented by the distance from the
closest university (in 10 miles). Our model can be synthesized in the following
population regression function:

ScoreHSi = δ0 + δ1Distancei + ui

Where δ0 is the intercept of the line, δ1 is the slope of it and ui is the error term.
In practice, δ0 and δ1 are unknown. Therefore we must employ data to estimate
both parameters. The results of this estimation will be δ1̂ and δ0̂ .

To gather a better understanding of the characteristics of the data set, we can


plot the data on a scatterplot, also to analyze whether there is a visible
relationship between the two variables. As the fan shape of the scatterplot
s u g g e s t s , o u r d a t a i s h e t e ro s k e d a s t i c . S i n c e R a l w a y s a s s u m e s
homoskedasticity, we must adjust our results.

The estimates are δ0̂ = 51.35 and δ1̂ = − 0.26 (rounded). Hence, the model
becomes

Elisa Centamore — Homework 1

ScoreHSi = 51.35 − 0.26Distancei + ui

This suggests that there is a negative correlation between score and distance,
or, in other words, that as the distance from colleges increases, the high
schools scores decrease. To assess whether δ1̂ is statistically significant, and,
hence, that the relationship between the two variables is relevant, we have to
test the hypothesis that H0 : δ1̂ = 0 against the alternative that H1 : δ1̂ ≠ 0 .
There are two ways to test the null hypothesis: either by using the t-statistic
approach or the p-value approach.

The first one consists of comparing the resulting t-statistic to a critical value
established by our confidence interval. The latter approach instead refers to
the comparison of observed results with the critical area established by the
confidence interval. If the observed result is smaller than the area represented
by the p-value, we can reject the null hypothesis.

In this case, the absolute value of the t-statistic we obtained is greater than
the critical level defined by a 95% confidence interval (1.96). We therefore do
not accept the null hypothesis, concluding that the difference is statistically
significant. Moreover, we can also reject the hypothesis at 1% significance
level, since the absolute value is greater than the critical value constructed
with a 99% confidence interval (2.58). By following the p-value method we get
the same results as the p-value is smaller than the critical value.

The estimated value of δ1̂ is the variation in the score due to a unitary increase
in distance. Put differently, the score decreases by 0.25752 on average when
the distance increases by one unit. This implies a negative correlation between
the two variables, strengthening the theory that opening local branches of
Universities in peripheral areas will improve test scores.

However, distance might not be the only factor affecting scores, there could be
other variables able to explain the regression which we are not considering in
our model. These are called omitted variables and, when they are present, our
model will be flawed by the omitted variable bias. Omitted variables possess
two main characteristics: they use be a determinant for the dependent variable
and they must be correlated with the independent variables of our model. The
ones we analyzed were education, high school quality and wage.

2
Elisa Centamore — Homework 1

“Education” is a variable defined as the number of years of education. Just by


looking at the definition, we can assert that there will be a correlation with high
school scores, because schools which last longer give students more time to
learn and practice and grant a greater topic coverage, factors which lead to
higher grades. Usually schools which last for a shorter number of years are
located in places in which there is an actual demand for these types of
services, for example by students who prefer working rather than continuing
their studies. This demand is usually higher in rural places rather than in urban
ones. Considering the majority of the college campuses are located in big city
centers, we can also infer that there is a correlation between education and
distance. This is also shown by our analysis, which shows clearly that our δ1̂ is
significantly overestimated when we do not take into account education. Given
this information, we can state that education is an omitted variable.

Our estimate can also be affected by high school quality. High schools located
closer to universities tend to have better teachers, resources, and curriculum,
therefore omitting high school quality could lead to underestimate the effect of
distance on test scores. In both cases, the direction of the bias would depend
on the sign and magnitude of the correlation between the omitted variable and
both distance and test scores.

Finally, average wage level exerts a weakly negative but negligible effect on
our model. Even though one could argue that in someway average hourly wage
in manufacturing state by state is connected to the distance and high school
scores, the correlation is not significant. Thus, we will not consider this as an
omitted variable for our model.

3
Elisa Centamore — Homework 1

QUESTION II

Now we are considering the following linear regression model:

ScoreHSi = γ0 + γ1Incomei + ui

One can expect income to directly affect scores as greater economic


advantages lead to greater access to tutoring after school, better didactic
tools (i.e. computers and wi-fi), higher quality of education and private spaces
to study without any distraction.

Running the linear regression model, we can see that the estimates are
γ0̂ = 49.90 and γ1̂ = 3.43 (rounded).

ScoreHSi = 49.9 + 3.43Incomei + ui

However, to assess whether γ1̂ is statistically significant, and, hence, if the


relationship between the two variables is relevant, we have to test the
hypothesis that γ1̂ = 0 against the alternative that γ1̂ ≠ 0.

The comparison between the value of the t-statistic computed with standard
error and the critical value with a 95% confidence interval leads us to not
accept the null hypothesis. When we calculate the p-value we achieve the
same result, hence the difference is significant.

Given that our regressor is binary, it is useless to think of γ1̂ as a slope; there is
no line. We must interpret it as the coefficient on Incomei . Therefore γ1̂ is the
difference between the conditional expectations of the variable Incomei when
it is equal to 1 and when it is equal 0, or: 

γ1 = E(Yi | Incomei = 1) − E(Yi | Incomei = 0). Due to this, we interpret γ1̂ as
the difference between the sample averages of Incomei in the two groups.
The estimated value of γ1̂ is 3.42716, which means that students with an
income greater than 25.000 USD are expected to score roughly 3.43 points
higher on the average high school finals than students with an income equal to
or less than 25.000 USD, holding distance constant. This shows a positive

4
Elisa Centamore — Homework 1

correlation between the variables, which is captured by the box plot below,
which shows higher sample averages for the class Incomehigh.

Given this regression we can conclude that the previously estimated


coefficient is biased by omitted variables, because income affects the score
and determines also the distance from colleges, since usually colleges are
located in more expensive areas. This difference was proven to be statistically
significant. Put differently, the OLS estimate suggests that distance decreases
high school scores, but that the effect of distance is overestimated as it
captures the effect of having students with different economic backgrounds,
too.

Another way of structuring our model would be by creating two dummy


variables by splitting the income variable. For instance, one could represent
low income families through a dummy variable which would be equal to 1
when income is low and 0 otherwise, and the other dummy variable could
represent high income families in the same but opposite way. Nevertheless, in
this case we would fall into a dummy variable trap, or, in other words, we
would have a linear relationship between the intercept and the variables,
therefore we would risk to receive the same information from the intercept and
from one of the variables. There are only two ways in which we can solve this
problem: either we remove the intercept and leave the two created variables or
we remove one of the two dummy variables and leave the intercept. In this
case, the best way to proceed is by removing one of the variables, which will

5
Elisa Centamore — Homework 1

still be represented by the intercept, because if we were to remove the


intercept, which is the mean value of the response variable when all of the
predictor variables in the model are equal to zero, we would affect the
measures of fit of the model such as R 2.

6
Elisa Centamore — Homework 1

QUESTION III

Now we are considering a multivariate regression model with two independent


variables: distance and income.

ScoreHSi = β0 + β1Distancei + β2 Incomei + +ui

The estimates of the coefficients β0 , β1 and β2 are, respectively, β0̂ = 50.29,


β1̂ = − 0.20, β2̂ = 3.34.

The variation of Distancei in this model is lower with respect to the one in the
preceding univariate regression. This is due to the fact that now we have
introduced a variable which is partly responsible for the variation of scores (a
partial regressor) and for determining the distance from colleges, which was
omitted in the previous model. In fact, our univariate model was biased,
causing the effect of a variation in Distancei to be overestimated.

In this multivariate regression, the interpretation of β1 is the same as in the


univariate regression: it captures the effect of a unitary increase in distance,
which leads to a decrease in the average of scores of 0.204, ceteris paribus.

7
Elisa Centamore — Homework 1

QUESTION IV

By introducing two other omitted variables to the model our estimated


multivariate regression model becomes:

ScoreHSi = 40.89 − 0.15Distancei + 4.39Fcollegei + 2.49Mcollegei + ui

As expected the role of Distancei in explaining high school grades is marginal


compared to the one in the univariate model, this is always due to the fact that
we have introduced variables which are responsible for the determination of
the scores and the distance from universities, which caused us to overestimate
the effect of distance on scores. In fact, we can assert that more educated
parents will place more importance on the education of their children and will
be more likely to be able to afford to live in more expensive areas, such as
those in which colleges are located. Also, more educated parents will likely
invest in those areas to grant a better learning experience for their children.

To check for individual significance of the independent variables Fcollegei and


Mcollegei we have to construct two individual null hypotheses that in this
case are H0 : β2 = 0 and H0 : β3 = 0, and two individual alternative hypothesis
H1 : β2 ≠ 0 and H1 : β3 ≠ 0. Secondly we have to compare each of the
individual t-statistics with the critical value defined by the constructed 95%
confidence interval. By looking at the table below, we do not accept the null
hypotheses, meaning that these variables have statistical relevance.

We also conducted a jointly hypothesis testing due to multicollinearity. If two


predictors are correlated (as it might be in this case), it may happen that they
are insignificant by themselves but that they are jointly significant. The reason
is that multicollinearity will result in the variables mutually increasing each
other’s standard error, thus giving rise to the insignificance with t-test. The null
H0 : β2 = β3 = 0 , the alternative one is:
hypothesis constructed in this test is:
H1 : either β2 ≠ 0 or β3 ≠ 0 or both.

8
Elisa Centamore — Homework 1

Testing the null hypothesis that both coefficients are equal to 0 against the
alternative one (in which at least one is different from 0), we can conclude that
they are jointly significant because the F-statistic (181.4) is larger than the
critical value of the F-statistic at 5% level of significance with two constraints
(F=3.00).

ScoreHSi = 49.53 − 0.13Distancei + 3.80Fcollegei + 2.23Mcollegei + 1.75Incomei + ui

As the model expands to include more variables which explain the variation in
the score, the variation explained by Distancei decreases. In this case, when
we also include income, we include an omitted variable which is not only
related to score and distance, but also to the parents’ education, since a
higher education leads to higher wages which, in turn, lead to higher income.

To check for the individual significancy of Fcollegei and Distancei we have to


compare the t-static or p-value with the critical value or area. According to our
hypothesis test, we can reject the null hypothesis for both cases, implying that
both are individually significant.


9
Elisa Centamore — Homework 1

QUESTION V

Given that ethnicity is a categorical variable (with three categories, namely:


hispanic, afro-american and other), we must interpret Ethnicit yhispanic and
Ethnicit yafam as differentials. They must be interpreted as the incremental
effect of being hispanic or afro-american relative to the base case, which in
this situation is other. Ceteris paribus, the difference in average test scores
between individuals of hispanic ethnicity and of “other” ethnicity is of -3.62,
and since the t-statistic is greater than the critical value we can also assert
that this difference is statistically significant. This tells us that average scores
and Hispanic ethnicity are negatively correlated, leading hispanic students to
have lower scores on average than others.

As we extend the model to include ethnicity in the regression we get the


following regression line:

ScoreHSi = β0 − β1Distancei + β2 Fcollegei + β3 Mcollegei + β4 Ethnicit yafam + β5Ethnicit yhispanic

To exclude a relationship between scores and ethnicity we will have to test the
joint null hypothesis, which in this case is H0 : β4 = β5 = 0 against the
alternative one that H1 : either β4 ≠ 0 or β5 ≠ 0 or both.

Comparing the F-statistic that we obtained by running the hypothesis test to


analyze the joint significance of the two variables (F=314.21), to the critical
value of the F-statistic at 5% level of significance with two constraints
(F=3.00), we conclude that we do not accept the null hypothesis, hence that
ethnicity is significant.

10
Elisa Centamore — Homework 1

Based on the F-statistic that we have computed, we can specify a confidence


set, which is analogous to confidence intervals for individual coefficients. In
fact, confidence sets consist of the set of all coefficient combinations for
which we cannot reject the joint null hypothesis tested using a F-test. The
ellipse created is centered around the point defined by both coefficient
estimates, which in this case is (-7.42, -4.42).

As we can observe, the point (0,0) is not present in the confidence set, hence
we can reject the null hypothesis that H0 : β4 = β5 = 0.

11

You might also like