Professional Documents
Culture Documents
Week 1
Week 1
Week 1
Week 1 Review
Thomas T. Yang
Research School of Economics
Australian National University
S2 2023
Review
1 OLS estimator with a single X
2 OLS estimator with multiple X
3 Maximum Likelihood Estimation
Definition
The Ordinary Least Squares (OLS) estimators are defined by
n
β̂ 0 , β̂ 1 := argmin ∑ (Yi − b0 − b1 Xi )2
b0 ,b1 i =1
In words
we look at the rhs as a function in b0 and b1
that function happens to be quadratic
we find the values of b0 and b1 that minimize that function
the values that minimize that function are called solution
we give the solution a specific name: β̂ 0 and β̂ 1
n
∂L
= − ∑ 2 (Yi − b0 − b1 Xi ) ,
∂b0 i =1
n
∂L
= − ∑ 2Xi (Yi − b0 − b1 Xi ) .
∂b1 i =1
Fourth step:
β̂ 0 = Ȳ − β̂ 1 X̄ ,
n n n
∑ Xi + β̂1 ∑ Xi2 ∑ Xi Yi .
Ȳ − β̂ 1 X̄ =
i =1 i =1 i =1
β̂ 0 = Ȳ − β̂ 1 X̄
∑ni=1 (Yi − Ȳ )(Xi − X̄ )
β̂ 1 =
∑ni=1 (Xi − X̄ )2
The OLS estimator of the slope is equal to the ratio of sample covariance
and sample variance!
The OLS estimators are functions of the sample data only
Given the sample data (Xi , Yi ) we can first compute the rhs for β̂ 1 and
then we can compute the rhs for β̂ 0
Computer programs such as Stata easily calculate the rhs for you
E [θ̂ ] = θ
Theorem
Under OLS Assumptions 1 through 4a, the OLS estimator
n
β̂ 0 , β̂ 1 := argmin ∑ (Yi − b0 − b1 Xi )2
b0 ,b1 i =1
is BLUE.
The Gauss-Markov theorem provides a theoretical justification for using
OLS
Or simply
Y = X β + ϵ.
or
Yi = β 1 + X2i β 2 + X3i β 3 + · · · + Xki β k + ϵi ,
for i = 1, 2, ..., n.
Since
Yi = Xi′ β + ϵi .
for i = 1, 2, ..., n.
This is just another way to write
for i = 1, 2, ..., n.
EMET3008/4310/8010 (ANU) Review S2 2023 21 / 42
Estimates
The OLS tries to minimize the mean square errors, that is, finding β̂
to
n
min ∑ (Yi − b1 − Xi2 b2 − · · · − Xik bk )2 .
i =1
Let
e = Y − X b ,
n ×1 n ×1 n ×k k ×1
that is
ei = Yi − b1 − Xi2 b2 − · · · − Xik bk .
The OLS is equivalent to minimizing
e1
e2
min e ′ e = min
e1 e2 · · · en .
..
.
en
Note
min e ′ e = min (Y − Xb )′ (Y − Xb ) .
A note on matrix differentiation. For
f (b ) = a′ b = a1 b1 + a2 b2 + ... + ak bk ,
1×k k ×1
we have ∂f (b )
∂b1 a1
∂f (b )
a2
∂f (b ) ∂b2
= = .
.. ..
∂b . .
k ×1
∂f (b ) ak
∂bk
Note
L = (Y − Xb )′ (Y − Xb ) = Y ′ Y − 2Y ′ Xb + b ′ X ′ Xb.
By matrix differentiation,
∂L
= 2X ′ Xb − 2X ′ Y .
∂b
First order condition
∂L
= 2X ′ X β̂ − 2X ′ Y = 0,
∂b b = β̂
−1
⇒ β̂ = X ′ X X ′Y .
So
−1 ′ −1 ′
X ′ ê = X ′ I − X X ′ X X Y = X ′Y − X ′X X ′X X Y
= X ′ Y − X ′ Y = 0.
1
∑ni=1 êi is zero because the first column of
The sample mean of ê n
X is a vector of ones.
We start by flipping the coin multiple times and recording the outcome of
each flip.
For simplicity, let’s say we flip the coin 10 times and get 7 heads and 3
tails. Now, we need to figure out what value of p makes the data we
observed (7 heads and 3 tails) the most likely.
n i.i.d. y1 , y2 , ..., yn .
We do not know p, and we want to estimate p using y1 , y2 , ..., yn .
The probability of it happens is:
P (Y1 = y1 , Y2 = y2 , ..., Yn = yn )
= P (Y1 = y1 ) × P (Y2 = y2 ) × ... × P (Yn = yn )
= p y1 (1 − p )1−y1 × p y2 (1 − p )1−y2 × ... × p yn (1 − p )1−yn
n n
= log p ∑ yi + log (1 − p ) ∑ (1 − yi ) .
i =1 i =1
First derivative:
n n
dL 1 1
dp
=
p ∑ yi − 1 − p ∑ (1 − yi )
i =1 i =1
(1 − p ) ∑ni=1 yi − p ∑ni=1 (1 − yi )
=
p (1 − p )
n
∑ i =1 i
y − np
=
p (1 − p )
EMET3008/4310/8010 (ANU) Review S2 2023 31 / 42
Maximum Likelihood Estimation (Appendix 11.2)
continued
Poisson distribution
λx e − λ
P (X = x ) = .
x!
n i.i.d. x1 , ..., xn
The log likelihood is
Thus
1 n
n i∑
λ̂ = xi
=1
n i.i.d. x1 , ..., xn .
The log likelihood is
L = log f (X1 = x1 , X2 = x2 , ..., Xn = xn )
" #
n
1 n
(xi − µ)2
= ∑ log f (Xi = xi ) = ∑ log √ exp −
i =1 i =1 2πσ 2σ2
" #
n
1 (xi − µ)2
= ∑ − log σ − log (2π ) −
i =1 2 2σ2
n n
( xi − µ ) 2
= −n log σ − log (2π ) − ∑
2 i =1 2σ2
n i.i.d. x1 , ..., xn .
The log likelihood is
L = log f (X1 = x1 , X2 = x2 , ..., Xn = xn )
" #
n
1 n
(xi − µ)2
= ∑ log f (Xi = xi ) = ∑ log √ exp −
i =1 i =1 2πσ 2σ2
" #
n
1 (xi − µ)2
= ∑ − log σ − log (2π ) −
i =1 2 2σ2
n n
( xi − µ ) 2
= −n log σ − log (2π ) − ∑
2 i =1 2σ2
which yields
1 n
n i∑
µ̂ = xi ,
=1
1 n
n i∑
σ̂2 = (xi − µ̂)2 .
=1
We often assume that the errors in a linear regression model are normally
distributed:
y = Xβ+ϵ
where X is the matrix of predictors, β is the vector of coefficients, and ϵ is
the vector of errors such that ϵ ∼ N (0, σ2 I ). Under this assumption, we
can apply MLE to estimate the parameters β and σ2 .
∂l ( β, σ2 |y )
= X T (y − X β ) = 0
∂β
⇒ β̂ = (X T X )−1 X T y
This shows that the MLE of β is the same as the least squares estimator
of β.
∂l ( β, σ2 |y ) n 1
=− 2+ (y − X β )T (y − X β ) = 0
∂σ2 2σ 2( σ 2 )2
1
⇒ σˆ2 = (y − X β̂)T (y − X β̂)
n
This shows that the MLE of σ2 is the residual sum of squares divided by
the number of observations.