Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Problem set 2 - Answers

1. The answer is (e) because each statement presented is correct.

2. They all describe the R2 except for (b). The R2 statistic is not a good statistic to use when
comparing models with different dependent variables. This is because the total variation would
be different if the dependent variables are different so we cannot make comparisons.

Wˆi = 7.50 + 1.25Si


3.
(0.147) (0.077)

The parameter on years of schooling shows that for this sample, hourly pay will rise by £1.25
for each additional year one has in school.

To determine the significance of 𝑆, we are testing the hypothesis

𝐻0 : 𝛽1 = 0
𝐻1 : 𝛽1 ≠ 0

̂𝛽1
To do this we can use a t statistic, i.e. 𝑡 = 𝑠𝑒(𝛽
̂ )~𝑡𝑛−2 . So
1
1.25
t= = 16.23
0.077

0.025
t98 = 1.985

And hence we reject the null hypothesis because the statistic is in the tail of the distribution and
this tells us that the years of schooling significantly affects wages.

However, this model is probably mis-specified for several reasons. Firstly, the way that wages
change with schooling is unlikely to be linear because you would expect, after a certain point,
each additional year would have a reduced effect and the benefits tail off. Secondly, there are
lots of important variables missing from the equation. Years of schooling may not be an
accurate way to measure an individual’s level of education and therefore likely ability. A better
measure is to model the qualifications that they have. Also, wages are influenced by many
different factors, including gender, age, experience, ethnicity, family background, area of
living/working, marital status etc. The simple bivariate regression is forcing all of the
explanatory power of these variables onto the years of schooling variable. Once these other
variables have been controlled for (via multiple regression), you might even find that years of
schooling isn’t actually significant after all.
4. For each model we need to consider the problem of minimising the sum of squared residuals.

i) 𝑌𝑖 = 𝛽0 + 𝜀𝑖 for i = 1, ,n
The problem is therefore to

min 𝑅𝑆𝑆 ≡ min ∑ 𝜀̂𝑖2 = ∑( 𝑌𝑖 − 𝛽̂0 )2

To find the value of 𝛽̂0 that minimises this function requires differentiation. It’s the value at
which the function has a turning point, i.e. where the derivative is equal to 0.
𝜕𝑅𝑆𝑆
̂0 = −2 ∑(𝑌𝑖 − 𝛽̂0 ) = 0
𝜕𝛽

⇒ ∑ 𝑌𝑖 − ∑ 𝛽̂0 = 0

⇒ ∑ 𝑌𝑖 − 𝑛𝛽̂0 = 0
∑𝑌
Therefore 𝛽̂0 = 𝑛 𝑖 = 𝑌̅.

The estimate of the constant term in this simple regression is just the sample mean of the
dependent variable. To show that this value does indeed minimise RSS rather than maximise it,
we need to look at the SOC.
𝜕2 𝑅𝑆𝑆
̂ 02
= 2𝑛 > 0
𝜕𝛽

and shows that 𝛽̂0 = 𝑌̅ is the value that minimises the sum of squared residuals.

ii) 𝑌𝑖 = 𝛽1 𝑋𝑖 + 𝜀𝑖
Using the same process

min 𝑅𝑆𝑆 ≡ min ∑ 𝜀̂𝑖2 = ∑( 𝑌𝑖 − 𝛽̂1 𝑋𝑖 )2


𝜕𝑅𝑆𝑆
̂1 = −2 ∑ 𝑋𝑖 (𝑌𝑖 − 𝛽̂1 𝑋𝑖 ) = 0
𝜕𝛽

⇒ ∑ 𝑋𝑖 𝑌𝑖 − 𝛽̂1 ∑ 𝑋𝑖2 = 0
∑𝑋 𝑌
⟹ 𝛽̂1 = ∑ 𝑋𝑖 2 𝑖
𝑖

Now we check that this is the minimising value of RSS by looking at the second derivative
𝜕2 𝑅𝑆𝑆
̂ 12
= 2 ∑ 𝑋𝑖2 > 0
𝜕𝛽

iii) 𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝜀𝑖
This is the kind of simple bivariate regression that we have analysed in the lectures. We have to
minimise the RSS with respect to two parameters.
min 𝑅𝑆𝑆 ≡ min ∑ 𝜀̂𝑖2 = ∑( 𝑌𝑖 − 𝛽̂0 − 𝛽̂1 𝑋𝑖 )2
𝜕𝑅𝑆𝑆
̂0 = −2 ∑(𝑌𝑖 − 𝛽̂0 − 𝛽̂1 𝑋𝑖 ) = 0 (1)
𝜕𝛽

𝜕𝑅𝑆𝑆
̂1 = −2 ∑ 𝑋𝑖 (𝑌𝑖 − 𝛽̂0 − 𝛽̂1 𝑋𝑖 ) = 0 (2)
𝜕𝛽

Solving these simultaneously

(1) ⇒ ∑ 𝑌𝑖 − 𝑛 𝛽̂0 − 𝛽̂1 ∑ 𝑋𝑖 = 0

(2) ⇒ ∑ 𝑋𝑖 𝑌𝑖 − 𝛽̂0 ∑ 𝑋𝑖 − 𝛽̂1 ∑ 𝑋𝑖 2 = 0

̂ 1 ∑ 𝑋𝑖
∑ 𝑌𝑖 −𝛽
(1) ⇒ 𝛽̂0 = = 𝑌̅ − 𝛽̂1 𝑋̅
𝑛
̂ 0 ∑ 𝑋𝑖
∑ 𝑋𝑖 𝑌𝑖 −𝛽
(2) ⇒ 𝛽̂1 = ∑ 𝑋𝑖 2

If we substitute 𝛽̂0 into 𝛽̂1


∑ 𝑋 𝑌 −(𝑌̅−𝛽 𝑋̅) ∑ 𝑋𝑖 ̂
𝛽̂1 = 𝑖 𝑖 ∑ 𝑋 21
𝑖

Re-arranging gives
̅ ∑ 𝑋 𝑌 −𝑌̅ ∑ 𝑋
𝛽̂1 (1 − 𝑋∑ ∑𝑋𝑋2𝑖 ) = 𝑖 ∑𝑖𝑋 2 𝑖
𝑖 𝑖

2̅ ∑𝑋 ∑ 𝑋𝑖 𝑌𝑖 −𝑌̅ ∑ 𝑋𝑖
𝛽̂1 (∑ 𝑋𝑖∑ −𝑋
𝑋 2
𝑖) =
∑𝑋 2
𝑖 𝑖

Therefore
∑ 𝑋 𝑌 −𝑌̅ ∑ 𝑋
𝛽̂1 = ∑ 𝑋𝑖 2𝑖−𝑋̅ ∑ 𝑋 𝑖
𝑖 𝑖

We can go a bit further by replacing ∑ 𝑋𝑖 = 𝑛𝑋̅


∑ 𝑋 𝑌 −𝑛𝑌̅𝑋̅ ∑(𝑋𝑖 −𝑋̅)(𝑌𝑖 −𝑌̅)
𝛽̂1 = ∑ 𝑋𝑖 2𝑖−𝑛𝑋̅ 2 = ∑(𝑋
𝑖 −𝑋̅)2 𝑖

To show these values will minimise the RSS we need to show that
𝜕2 𝑅𝑆𝑆 𝜕2 𝑅𝑆𝑆 𝜕2 𝑅𝑆𝑆 𝜕2 𝑅𝑆𝑆 𝜕2 𝑅𝑆𝑆 2
̂0 2 > 0, ̂1 2 > 0 and that ̂0 2
̂1 2 > (𝜕𝛽̂ ̂ )
𝜕𝛽 𝜕𝛽 𝜕𝛽 𝜕𝛽 0 𝜕𝛽1

𝜕2 𝑅𝑆𝑆 2 2 𝜕2 𝑅𝑆𝑆
̂ 02
= 2𝑛 > 0, 𝜕 ̂𝑅𝑆𝑆
2 = 2 ∑ 𝑋𝑖 > 0 and 𝜕𝛽
̂ ̂ = 2 ∑ 𝑋𝑖
𝜕𝛽 𝜕𝛽1 0 𝜕𝛽1

To show that the 3rd inequality holds we need

4𝑛 ∑ 𝑋𝑖2 > 4(∑ 𝑋𝑖 )2


4𝑛 ∑ 𝑋𝑖2 − 4𝑛𝑋̅ 2 = 4𝑛 ∑(𝑋𝑖 − 𝑋̅)2 > 0.
Therefore, these values minimise the 𝑅𝑆𝑆.

5. In (iii) above we determined the OLS estimators for this regression. Using these equations we
need to show that they are unbiased and derive their variances.
a) We’ll start with 𝛽̂1 . Following the steps that we covered in the proof of TR1 in Section 2:
∑(𝑋 −𝑋̅ )𝑌
𝛽̂1 = ∑(𝑋𝑖 −𝑋̅)2𝑖 = ∑ 𝑤𝑖 𝑌𝑖
𝑖

Substituting the regression model for 𝑌 we get

𝛽̂1 = ∑ 𝑤𝑖 (𝛽0 + 𝛽1 𝑋𝑖 + 𝜀𝑖 ) = 𝛽0 ∑ 𝑤𝑖 + 𝛽1 ∑ 𝑤𝑖 𝑋𝑖 + ∑ 𝑤𝑖 𝜀𝑖 = 𝛽1 + ∑ 𝑤𝑖 𝜀𝑖

𝐸(𝛽̂1|𝑋) = 𝛽1 + 𝐸(∑ 𝑤𝑖 𝜀𝑖 |𝑋)

The 𝑤𝑖 , which is a function of 𝑋, is therefore constant in the conditional expectation and


therefore

𝐸(𝛽̂1|𝑋) = 𝛽1 + ∑ 𝑤𝑖 𝐸(𝜀𝑖 |𝑋) = 𝛽1 + ∑ 𝑤𝑖 0 = 𝛽1

given the 0 conditional mean assumption. This shows unbiasedness of 𝛽̂1.

Now for 𝛽̂0:

𝐸(𝛽̂0 ) = 𝐸(𝑌̅ − 𝛽̂1 𝑋̅) = 𝐸(𝑌̅) − 𝐸(𝛽̂1 )𝑋̅ = 𝐸(𝑌̅) − 𝛽̂1 𝑋̅

given the unbiasedness result for 𝛽̂1.


𝐸(𝑌̅) = 𝐸(𝛽0 + 𝛽1 𝑋̅ + 𝜀̅) = 𝛽0 + 𝛽1 𝑋̅

Therefore 𝐸(𝛽̂0 ) = 𝛽0 + 𝛽1 𝑋̅ − 𝛽1 𝑋̅ = 𝛽0 and is hence unbiased.


b) We will use the variance operator (related to the expectations operator) to answer this part.

And again start with 𝛽̂1.

𝑣𝑎𝑟(𝛽̂1 |𝑋) = 𝑣𝑎𝑟(𝛽1 + ∑ 𝑤𝑖 𝜀𝑖 |𝑋)

𝛽1 is a non-varying parameter and hence has 0 variance.


𝑣𝑎𝑟(∑ 𝑤𝑖 𝜀𝑖 |𝑋) = 𝑣𝑎𝑟(𝑤1 𝜀1 + 𝑤2 𝜀2 … 𝑤𝑛 𝜀𝑛 |𝑋)
= 𝑤12 𝑣𝑎𝑟(𝜀1 |𝑋) + 𝑤22 𝑣𝑎𝑟(𝜀2 |𝑋) + ⋯ + 𝑤𝑛2 𝑣𝑎𝑟(𝜀𝑛 |𝑋) + 2𝑤1 𝑤2 𝑐𝑜𝑣(𝜀1 , 𝜀2 ) +
𝑎𝑙𝑙 𝑜𝑡ℎ𝑒𝑟 𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒𝑠
Given the classical assumption of no autocorrelation, every one of these covariances is 0. Hence

𝑣𝑎𝑟(𝛽̂1 |𝑋) = ∑ 𝑤𝑖2 𝑣𝑎𝑟(𝜀𝑖 |𝑋).

Further, given the homoscedasticity assumption, 𝑣𝑎𝑟(𝜀𝑖 |𝑋) = 𝜎 2 then


𝜎 2
𝑣𝑎𝑟(𝛽̂1 |𝑋) = 𝜎 2 ∑ 𝑤𝑖2 = ∑(𝑋 −𝑋̅)2
𝑖
∑(𝑋𝑖 −𝑋̅ )2 1
since ∑ 𝑤𝑖2 = = ∑(𝑋 .
(∑(𝑋𝑖 −𝑋̅)2 )2 ̅ )2
𝑖 −𝑋

For 𝛽̂0, the variance follows along the same lines as from the lecture notes:

𝑣𝑎𝑟(𝛽̂0 |𝑋) = 𝑣𝑎𝑟(𝑌̅ − 𝛽̂1 𝑋̅|𝑋) = 𝑣𝑎𝑟(𝑌|𝑋


̅̅̅̅̅) + 𝑋̅ 2 𝑣𝑎𝑟(𝛽̂1 |𝑋) − 2𝑋̅𝑐𝑜𝑣(𝑌̅, 𝛽̂1 |𝑋).

Let’s take each term in turn:

• 𝑣𝑎𝑟(𝑌̅|𝑋) = 𝑣𝑎𝑟(𝛽0 + 𝛽1 𝑋̅ + 𝜀̅|𝑋) = 𝑣𝑎𝑟(𝜀̅|𝑋), given the parameters are constant and the
1 1 𝜎2
conditioning on 𝑋. Given CLRA4 and CLRA5, 𝑣𝑎𝑟(𝜀̅|𝑋) = 𝑛2 𝑣𝑎𝑟(∑ 𝜀𝑖 |𝑋) = 𝑛2 ∑ 𝑣𝑎𝑟(𝜀𝑖 |𝑋) = 𝑛
.

𝜎 2
• 𝑣𝑎𝑟(𝛽̂1 |𝑋) = ∑(𝑋 −𝑋̅)2 , as proved above.
𝑖

∑ 𝑌 ∑(𝑋 −𝑋̅)𝑌 1 1
• 𝑐𝑜𝑣(𝑌̅, 𝛽̂1 |𝑋) = 𝑐𝑜𝑣 ( 𝑛 𝑖 , ∑(𝑋𝑖 −𝑋̅)2𝑖) = 𝑛 ∑(𝑋 −𝑋̅)2 𝑐𝑜𝑣(∑ 𝑌𝑖 , ∑(𝑋𝑖 − 𝑋̅)𝑌𝑖 ) =
𝑖 𝑖

1 1
∑(𝑋𝑖 − 𝑋̅) 𝑐𝑜𝑣(𝑌𝑖 , 𝑌𝑗 ) = 0 given the summation is 0.
𝑛 ∑(𝑋𝑖 −𝑋̅)2

Therefore
𝜎2 𝑋̅ 2 𝜎 2 𝜎 2 (∑(𝑋𝑖 − 𝑋̅)2 + 𝑛𝑋̅ 2 ) 𝜎 2 ∑ 𝑋𝑖2
𝑣𝑎𝑟(𝛽̂0 |𝑋) = + = =
𝑛 ∑(𝑋𝑖 − 𝑋̅)2 𝑛 ∑(𝑋𝑖 − 𝑋̅)2 𝑛 ∑(𝑋𝑖 − 𝑋̅)2

These variances should tally with those provided in the lecture notes.

You might also like