Week 1

Applied Macro and Financial Econometrics
Week 1 Review
Thomas T. Yang
Research School of Economics
Australian National University
S2 2023
EMET3008/4310/8010 (ANU) Review S2 2023 1 / 42

Outline
Review
1 OLS estimator with a single X
2 OLS estimator with multiple X
3 Maximum Likelihood Estimation
EMET3008/4310/8010 (ANU) Review S2 2023 2 / 42

Review: OLS Estimator with a Single X
EMET3008/4310/8010 (ANU) Review S2 2023 3 / 42

The linear model:
Y = β 0 + X β 1 + U.
The parameter β 1 is the population regression coefficient
Because we do not observe the population, we do not know β 1
How can we learn about it?
EMET3008/4310/8010 (ANU) Review S2 2023 4 / 42

Learning about an unknown population coefficient is the goal of statistical
inference
When the unknown population coefficient is part of a linear model, then
there is one dominant method to learn about the unknown population
coefficient: estimation via OLS
The whole point of EMET2007 or EMET8005 was to expose you to OLS
estimation
OLS is one of many ways to estimate β 1
Let’s do a brief review of OLS and its properties . . .
EMET3008/4310/8010 (ANU) Review S2 2023 5 / 42

(EMET2007 lecture 4 or somewhere in EMET8005)
Definition
The Ordinary Least Squares (OLS) estimators are defined by
n
β̂ 0 , β̂ 1 := argmin ∑ (Yi − b0 − b1 Xi )2
b0 ,b1 i =1
In words
we look at the rhs as a function in b0 and b1
that function happens to be quadratic
we find the values of b0 and b1 that minimize that function
the values that minimize that function are called solution
we give the solution a specific name: β̂ 0 and β̂ 1
EMET3008/4310/8010 (ANU) Review S2 2023 6 / 42

The mathematics of finding the solution
The basic approach is multivariate calculus which you know from high
school or EMET1001 or both
First step: differentiate L = ∑ni=1 (Yi − b0 − b1 Xi )2 wrt b0 , b1
n
∂L
= − ∑ 2 (Yi − b0 − b1 Xi ) ,
∂b0 i =1
n
∂L
= − ∑ 2Xi (Yi − b0 − b1 Xi ) .
∂b1 i =1
Second step: set derivatives equal zero (obtain the foc)

n
− ∑ 2 Yi − β̂ 0 − β̂ 1 Xi

= 0,
i =1
n
− ∑ 2Xi Yi − β̂ 0 − β̂ 1 Xi

= 0.
i =1
EMET3008/4310/8010 (ANU) Review S2 2023 7 / 42

Third step: solve
n n
n β̂ 0 + β̂ 1 ∑ Xi = ∑ Yi ,
i =1 i =1
n n n
β̂ 0 ∑ Xi + β̂1 ∑ Xi2 = ∑ Xi Yi .
i =1 i =1 i =1
Fourth step:
β̂ 0 = Ȳ − β̂ 1 X̄ ,
n n n
∑ Xi + β̂1 ∑ Xi2 ∑ Xi Yi .

Ȳ − β̂ 1 X̄ =
i =1 i =1 i =1
EMET3008/4310/8010 (ANU) Review S2 2023 8 / 42

End result:
β̂ 0 = Ȳ − β̂ 1 X̄
∑ni=1 (Yi − Ȳ )(Xi − X̄ )
β̂ 1 =
∑ni=1 (Xi − X̄ )2
The OLS estimator of the slope is equal to the ratio of sample covariance
and sample variance!
The OLS estimators are functions of the sample data only
Given the sample data (Xi , Yi ) we can first compute the rhs for β̂ 1 and
then we can compute the rhs for β̂ 0
Computer programs such as Stata easily calculate the rhs for you
EMET3008/4310/8010 (ANU) Review S2 2023 9 / 42

Given an infinitely large set of possible estimators for β 1 , why would we
use this complicated looking OLS procedure?
As you know already, it turns out that the OLS estimator has some
desirable properties
We assess ‘goodness’ of an estimator by three properties:
1 bias
2 variance
3 consistency
Let’s look at these in turn
EMET3008/4310/8010 (ANU) Review S2 2023 10 / 42

Definition
An estimator θ̂ for an unobserved population parameter θ is unbiased if
its expected value is equal to θ, that is
E [θ̂ ] = θ
If we draw lots of random samples of size n we obtain lots of estimates

θ̂1 , θ̂2 , θ̂3 , . . .
If the estimator θ̂ is unbiased, then the mean of these estimates will be
equal to θ
Note that this is only a thought exercise, in reality we will not draw lots of
random samples (we only have one available)
EMET3008/4310/8010 (ANU) Review S2 2023 11 / 42

Definition
An estimator θ̂ for an unobserved population parameter θ has minimum
variance if its variance is (weakly) smaller than the variance of any other
estimator of θ. Sometimes we will also say that the estimator is efficient.
EMET3008/4310/8010 (ANU) Review S2 2023 12 / 42

In EMET2007 Juergen or EMET8005 Tue gave this definition of
consistency:
Definition
An estimator θ̂ for an unobserved population parameter θ is consistent if
it converges in probability to θ.
Consistency is difficult to understand

Here is a useful way to look at it:
If, in a thought experiment, you observe the entire population and apply
your estimator to it, you want the resulting estimate to be equal to θ
EMET3008/4310/8010 (ANU) Review S2 2023 13 / 42

Let’s be slightly more technical (and therefore more precise)
Definition (Convergence in Probability)

Let θ̂ be an estimator of θ. We say that θ̂ converges in probability to θ if
Pr(|θ̂ − θ | > ε) → 0 for all ε > 0.

p
We write θ̂ → θ and say that θ̂ is a consistent estimator of the
population parameter θ.
EMET3008/4310/8010 (ANU) Review S2 2023 14 / 42

Important result about the ’goodness’ of the OLS estimator:
(EMET2007 lecture 6)
Theorem
Under OLS Assumptions 1 through 4a, the OLS estimator
n
β̂ 0 , β̂ 1 := argmin ∑ (Yi − b0 − b1 Xi )2
b0 ,b1 i =1
is BLUE.
The Gauss-Markov theorem provides a theoretical justification for using
OLS
EMET3008/4310/8010 (ANU) Review S2 2023 15 / 42

Digression: do you remember the 4 Assumptions?
1 conditional mean independence (CMI):
E [ui |Xi ] = E [ui ]
2 sample data are i.i.d. draw from population distribution

3 finite fourth moments (large outliers are unlikely)
4 homoskedasticity
EMET3008/4310/8010 (ANU) Review S2 2023 16 / 42

Putting together the pieces of the puzzle:
our research objective is the causal effect of Xi on Yi
generically there exists a functional relationship between the two:
Yi = f (Xi , ui )
to make our lives easier, we assume that f (·) is linear
then the causal effect boils down to the parameter β 1 and can be
interpreted as the average causal effect
to estimate that parameter we use OLS
we obtain the estimate β̂ 1 which is our estimate of the causal effect
β1
by the Gauss-Markov theorem, the estimate β̂ 1 is ’good’ as long as
the four OLS Assumptions are satisfied
EMET3008/4310/8010 (ANU) Review S2 2023 17 / 42

Review: OLS Estimator with Multiple X
EMET3008/4310/8010 (ANU) Review S2 2023 18 / 42

The True Model
Let X be an n × k matrix where we have observations on k

independent variables for n observations. Since our model will usually
contain a constant term, one of the columns in the X matrix will
contain only ones. This column should be treated exactly the same as
any other column in the X matrix
Let Y be an n × 1 vector of observations on the dependent variable
Let ϵ be an n × 1 vector of disturbances or errors.
Let β be an k × 1 vector of unknown population parameters that we
want to estimate.
EMET3008/4310/8010 (ANU) Review S2 2023 19 / 42

The True Model continued
The model will look something like the following:

       
Y1 1 X21 X31 · · · Xk1 β1 ϵ1
 Y2 
 
 1
 X22 X32 · · · Xk2  

 β2 


 ϵ2 

 ..   .. .. .. ..   ..   .. 
 .  =  . . . ··· .   .  + . 
 ..   .. .. .. ..  .. ..
     
   
 .   . . . ··· .   .   . 
Yn n×1 1 X2n X3n · · · Xkn n×k βk k ×1
ϵn n×
Or simply
Y = X β + ϵ.
or
Yi = β 1 + X2i β 2 + X3i β 3 + · · · + Xki β k + ϵi ,
for i = 1, 2, ..., n.
EMET3008/4310/8010 (ANU) Review S2 2023 20 / 42

The True Model continued
The model will look something like the following:
   ′     
Y1 X1 β ϵ1 1
 Y2   X′β   ϵ2   Xi2 
   2     
 ..   ..   ..   .. 
 .  = . 
  + . 
  , with Xi = 
 .
.
 ..   ..   ..   ..
  

 .   .   .   . 
Yn n×1 ′
Xn β n×1 ϵn n × 1 Xik
Since
Yi = Xi′ β + ϵi .
for i = 1, 2, ..., n.
This is just another way to write
Yi = β 1 + X2i β 2 + X3i β 3 + · · · + Xki β k + ϵi ,
for i = 1, 2, ..., n.
EMET3008/4310/8010 (ANU) Review S2 2023 21 / 42
Estimates
The OLS tries to minimize the mean square errors, that is, finding β̂
to
n
min ∑ (Yi − b1 − Xi2 b2 − · · · − Xik bk )2 .
i =1
Let
e = Y − X b ,
n ×1 n ×1 n ×k k ×1
that is
ei = Yi − b1 − Xi2 b2 − · · · − Xik bk .
The OLS is equivalent to minimizing
 
e1
 e2 
min e ′ e = min

e1 e2 · · · en .
 
 ..
 . 
en
EMET3008/4310/8010 (ANU) Review S2 2023 22 / 42

Estimates continued
Note
min e ′ e = min (Y − Xb )′ (Y − Xb ) .
A note on matrix differentiation. For
f (b ) = a′ b = a1 b1 + a2 b2 + ... + ak bk ,
1×k k ×1
we have  ∂f (b )   
∂b1 a1
∂f (b )
a2
  
∂f (b )  ∂b2
  
= = .

.. ..
∂b . .
   
k ×1
 
∂f (b ) ak
∂bk
EMET3008/4310/8010 (ANU) Review S2 2023 23 / 42

Estimates continued
Note
L = (Y − Xb )′ (Y − Xb ) = Y ′ Y − 2Y ′ Xb + b ′ X ′ Xb.
By matrix differentiation,
∂L
= 2X ′ Xb − 2X ′ Y .
∂b
First order condition
∂L
= 2X ′ X β̂ − 2X ′ Y = 0,
∂b b = β̂
−1
⇒ β̂ = X ′ X X ′Y .
EMET3008/4310/8010 (ANU) Review S2 2023 24 / 42

Properties of OLS estimator
X are uncorrelated with the residuals (The intution is “projection”).

−1 ′ −1 ′
ê = Y − X β̂ = Y − X X ′ X X Y = In − X X ′ X X Y.
So
−1 ′ −1 ′
X ′ ê = X ′ I − X X ′ X X Y = X ′Y − X ′X X ′X X Y
= X ′ Y − X ′ Y = 0.
1
∑ni=1 êi is zero because the first column of

The sample mean of ê n
X is a vector of ones.
EMET3008/4310/8010 (ANU) Review S2 2023 25 / 42

Review: Maximum Likelihood Estimation
EMET3008/4310/8010 (ANU) Review S2 2023 26 / 42

Intuition for Maximum Likelihood Estimation (MLE)
Scenario: We have a machine that produces coins. These could be fair

(50% heads, 50% tails) or unfair (60% heads, 40% tails, etc.).
We are given a coin from this machine, but we don’t know if it’s fair or
not. Our task is to estimate the ”fairness” of the coin, i.e., the probability
of getting a head when the coin is flipped (let’s call it p).
EMET3008/4310/8010 (ANU) Review S2 2023 27 / 42

Intuition for MLE: Flipping the Coin
We start by flipping the coin multiple times and recording the outcome of
each flip.
For simplicity, let’s say we flip the coin 10 times and get 7 heads and 3
tails. Now, we need to figure out what value of p makes the data we
observed (7 heads and 3 tails) the most likely.
EMET3008/4310/8010 (ANU) Review S2 2023 28 / 42

Intuition for MLE: Maximum Likelihood
We could try different values of p and calculate the probability of getting 7

heads and 3 tails for each.
If the coin is fair (p = 0.5), the probability of getting 7 heads and 3 tails
would be relatively small. But if p = 0.7, the probability of getting 7
heads and 3 tails would be higher.
So, we might guess that p = 0.7 is a better estimate of the fairness of the
coin. This is what MLE does. It finds the parameter values that make the
observed data the most likely.
EMET3008/4310/8010 (ANU) Review S2 2023 29 / 42

Maximum Likelihood Estimation (Appendix 11.2)
Bernoulli random variable

1 with probability p
Y = .
0 with probability 1 − p
n i.i.d. y1 , y2 , ..., yn .
We do not know p, and we want to estimate p using y1 , y2 , ..., yn .
The probability of it happens is:
P (Y1 = y1 , Y2 = y2 , ..., Yn = yn )
= P (Y1 = y1 ) × P (Y2 = y2 ) × ... × P (Yn = yn )
= p y1 (1 − p )1−y1 × p y2 (1 − p )1−y2 × ... × p yn (1 − p )1−yn
MLE of p is the value of p that maximizes the likelihood equation.
EMET3008/4310/8010 (ANU) Review S2 2023 30 / 42

continued
log is a strictly increasing function.
It is equivalent to maximize its log likelihood
L = log P (Y1 = y1 , Y2 = y2 , ..., Yn = yn )
n
= log p ∑i =1 yi (1 − p )∑i =1 (1−yi )
n
n n
= log p ∑ yi + log (1 − p ) ∑ (1 − yi ) .
i =1 i =1
First derivative:
n n
dL 1 1
dp
=
p ∑ yi − 1 − p ∑ (1 − yi )
i =1 i =1
(1 − p ) ∑ni=1 yi − p ∑ni=1 (1 − yi )
=
p (1 − p )
n
∑ i =1 i
y − np
=
p (1 − p )
EMET3008/4310/8010 (ANU) Review S2 2023 31 / 42
continued
First order condition:

n
∑ yi − np̂ = 0
i =1
=⇒
n
p̂ = n −1 ∑ yi = ȳ .
i =1
Example, coin tossing, let

1 if heads, with probability p
Yi = .
0 if tails, with probability 1 − p
Independent toss a coin n times, we estimate p as the sample

average. Make sense?
EMET3008/4310/8010 (ANU) Review S2 2023 32 / 42

continued
The theory of maximum likelihood estimation says that is the most

efficient estimator of p – of all possible estimators! – at least for large n.
(Much stronger than the Gauss-Markov theorem). For this reason the
MLE is primary estimator used for models that in which the parameters
(coefficients) enter nonlinearly.
EMET3008/4310/8010 (ANU) Review S2 2023 33 / 42

Maximum Likelihood Estimation continued
Poisson distribution
λx e − λ
P (X = x ) = .
x!
n i.i.d. x1 , ..., xn
The log likelihood is
L = log P (X1 = x1 , X2 = x2 , ..., Xn = xn )

n
= ∑ log P (Xi = xi )
i =1
n
λxi e −λ

= ∑ log xi !
i =1
n
= ∑ [xi log (λ) − λ − log (xi !)]
i =1
EMET3008/4310/8010 (ANU) Review S2 2023 34 / 42

First order derivative

n
dL x
= ∑ i −1
dλ i =1 λ
Thus
1 n
n i∑
λ̂ = xi
=1
EMET3008/4310/8010 (ANU) Review S2 2023 35 / 42

Normal distribution
( )
2
1 (x − µ )2
f x; µ, σ = √ exp −
σ 2π 2σ2
n i.i.d. x1 , ..., xn .
L = log f (X1 = x1 , X2 = x2 , ..., Xn = xn )
" #
n
1 n
(xi − µ)2
= ∑ log f (Xi = xi ) = ∑ log √ exp −
i =1 i =1 2πσ 2σ2
" #
n
1 (xi − µ)2
= ∑ − log σ − log (2π ) −
i =1 2 2σ2
n n
( xi − µ ) 2
= −n log σ − log (2π ) − ∑
2 i =1 2σ2
EMET3008/4310/8010 (ANU) Review S2 2023 36 / 42

Normal distribution
( )
2
1 (x − µ )2
f x; µ, σ = √ exp −
σ 2π 2σ2
n i.i.d. x1 , ..., xn .
L = log f (X1 = x1 , X2 = x2 , ..., Xn = xn )
" #
n
1 n
(xi − µ)2
= ∑ log f (Xi = xi ) = ∑ log √ exp −
i =1 i =1 2πσ 2σ2
" #
n
1 (xi − µ)2
= ∑ − log σ − log (2π ) −
i =1 2 2σ2
n n
( xi − µ ) 2
= −n log σ − log (2π ) − ∑
2 i =1 2σ2
EMET3008/4310/8010 (ANU) Review S2 2023 37 / 42

First order conditions:

n
∂L n 1
∂σ
= − + 3
σ̂ σ̂ ∑ (xi − µ̂)2 = 0
(µ̂,σ̂) i =1
n
∂L 1
∂µ
=
σ̂2 ∑ (xi − µ̂) = 0,
(µ̂,σ̂) i =1
which yields
1 n
n i∑
µ̂ = xi ,
=1
1 n
n i∑
σ̂2 = (xi − µ̂)2 .
=1
EMET3008/4310/8010 (ANU) Review S2 2023 38 / 42

MLE for Linear Models: Normal Distribution Assumption
We often assume that the errors in a linear regression model are normally
distributed:
y = Xβ+ϵ
where X is the matrix of predictors, β is the vector of coefficients, and ϵ is
the vector of errors such that ϵ ∼ N (0, σ2 I ). Under this assumption, we
can apply MLE to estimate the parameters β and σ2 .
EMET3008/4310/8010 (ANU) Review S2 2023 39 / 42

The Likelihood Function for Linear Regression
The likelihood function for this model is given by:

1
L( β, σ2 |y ) = (2πσ2 )−n/2 exp − 2 (y − X β)T (y − X β)
2σ
We take the natural log to simplify the expression, obtaining the

log-likelihood function:
n n 1
l ( β, σ2 |y ) = − log(2π ) − log(σ2 ) − 2 (y − X β)T (y − X β)
2 2 2σ
EMET3008/4310/8010 (ANU) Review S2 2023 40 / 42

MLE for Linear Regression: Derivation of β
To find the MLE of β, we take the derivative of the log-likelihood with

respect to β and set it to zero:
∂l ( β, σ2 |y )
= X T (y − X β ) = 0
∂β
⇒ β̂ = (X T X )−1 X T y
This shows that the MLE of β is the same as the least squares estimator
of β.
EMET3008/4310/8010 (ANU) Review S2 2023 41 / 42

MLE for Linear Regression: Derivation of σ2
To find the MLE of σ2 , we take the derivative of the log-likelihood with

respect to σ2 and set it to zero:
∂l ( β, σ2 |y ) n 1
=− 2+ (y − X β )T (y − X β ) = 0
∂σ2 2σ 2( σ 2 )2
1
⇒ σˆ2 = (y − X β̂)T (y − X β̂)
n
This shows that the MLE of σ2 is the residual sum of squares divided by
the number of observations.
EMET3008/4310/8010 (ANU) Review S2 2023 42 / 42

Week 1

Uploaded by

Copyright:

Available Formats

You might also like

Week 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 1

Uploaded by

Copyright:

Available Formats

Applied Macro and Financial Econometrics

EMET3008/4310/8010 (ANU) Review S2 2023 1 / 42

EMET3008/4310/8010 (ANU) Review S2 2023 2 / 42

EMET3008/4310/8010 (ANU) Review S2 2023 3 / 42

EMET3008/4310/8010 (ANU) Review S2 2023 4 / 42

EMET3008/4310/8010 (ANU) Review S2 2023 5 / 42

EMET3008/4310/8010 (ANU) Review S2 2023 6 / 42

Second step: set derivatives equal zero (obtain the foc)

EMET3008/4310/8010 (ANU) Review S2 2023 7 / 42

EMET3008/4310/8010 (ANU) Review S2 2023 8 / 42

EMET3008/4310/8010 (ANU) Review S2 2023 9 / 42

EMET3008/4310/8010 (ANU) Review S2 2023 10 / 42

If we draw lots of random samples of size n we obtain lots of estimates

EMET3008/4310/8010 (ANU) Review S2 2023 11 / 42

EMET3008/4310/8010 (ANU) Review S2 2023 12 / 42

Consistency is difficult to understand

EMET3008/4310/8010 (ANU) Review S2 2023 13 / 42

Definition (Convergence in Probability)

Pr(|θ̂ − θ | > ε) → 0 for all ε > 0.

EMET3008/4310/8010 (ANU) Review S2 2023 14 / 42

EMET3008/4310/8010 (ANU) Review S2 2023 15 / 42

E [ui |Xi ] = E [ui ]

2 sample data are i.i.d. draw from population distribution

EMET3008/4310/8010 (ANU) Review S2 2023 16 / 42

EMET3008/4310/8010 (ANU) Review S2 2023 17 / 42

EMET3008/4310/8010 (ANU) Review S2 2023 18 / 42

Let X be an n × k matrix where we have observations on k

EMET3008/4310/8010 (ANU) Review S2 2023 19 / 42

The model will look something like the following:

EMET3008/4310/8010 (ANU) Review S2 2023 20 / 42

Yi = β 1 + X2i β 2 + X3i β 3 + · · · + Xki β k + ϵi ,

EMET3008/4310/8010 (ANU) Review S2 2023 22 / 42

EMET3008/4310/8010 (ANU) Review S2 2023 23 / 42

EMET3008/4310/8010 (ANU) Review S2 2023 24 / 42

X are uncorrelated with the residuals (The intution is “projection”).

EMET3008/4310/8010 (ANU) Review S2 2023 25 / 42

EMET3008/4310/8010 (ANU) Review S2 2023 26 / 42

Scenario: We have a machine that produces coins. These could be fair

EMET3008/4310/8010 (ANU) Review S2 2023 27 / 42

EMET3008/4310/8010 (ANU) Review S2 2023 28 / 42

We could try different values of p and calculate the probability of getting 7

EMET3008/4310/8010 (ANU) Review S2 2023 29 / 42

Bernoulli random variable

MLE of p is the value of p that maximizes the likelihood equation.

EMET3008/4310/8010 (ANU) Review S2 2023 30 / 42

First order condition:

Example, coin tossing, let

Independent toss a coin n times, we estimate p as the sample

EMET3008/4310/8010 (ANU) Review S2 2023 32 / 42

The theory of maximum likelihood estimation says that is the most

EMET3008/4310/8010 (ANU) Review S2 2023 33 / 42

L = log P (X1 = x1 , X2 = x2 , ..., Xn = xn )

EMET3008/4310/8010 (ANU) Review S2 2023 34 / 42

First order derivative

EMET3008/4310/8010 (ANU) Review S2 2023 35 / 42

EMET3008/4310/8010 (ANU) Review S2 2023 36 / 42

EMET3008/4310/8010 (ANU) Review S2 2023 37 / 42

First order conditions: