Week 2, OLS

MICRO-ECONOMETRICS
ECO 6175
ABEL BRODEUR
Weeks 2 & 3
1/ 83
Statistical Tools
Outline:
I OLS
I Limitations of OLS
I Omitted variable bias
I Measurement error
I Reverse causality
I Heteroscedasticity
I Panel Data
I Discrete Choice Models
2/ 83
Ordinary Least Square
Yi = β0 + β1 xi + εi
Yi = β̂0 + β̂1 xi + ε̂i

The value
Ŷi = β̂0 + β̂1 xi
is the predicted value for xi
3/ 83
Regression
β0 is the intercept: refers to the value of Y when X is zero
β1 is the slope: the rate of change in Y for a one-unit

change in X (regression coefficient)
This implies a straight line. But since we never have a

perfect line, we have an error term (epsilon)
Trying to minimize the Residual Sum of Squares (RSS)
4/ 83
Let’s Solve!
X X
minimize ε̂2i = (Yi − Ŷi )2
β̂0 ,β̂1 i i
Graph!
5/ 83
First-Order Conditions
∂ X
= −2 (Yi − Ŷi ) = 0 (1)
∂ β̂0 i
∂ X
= −2 xi (Yi − Ŷi ) = 0 (2)
∂ β̂1 i
6/ 83
From eq. (1)

X
(Yi − β̂0 − β̂1 xi ) = 0
i
Ȳi − β̂0 − β̂1 x̄ = 0
7/ 83
From eq. (2) X
xi (Yi − Ŷi ) = 0
i
X
xi (Yi − β̂0 − β̂1 xi ) = 0
i
(1) -> (2) where

−β̂0 = β̂1 x̄ − Ȳ
X
xi (Yi − Ȳ + β̂1 x̄ − β̂1 xi ) = 0
i
8/ 83
Least Squares Formulas
P
xi (Yi − Ȳ)
β̂1 = Pi
i xi (xi − x̄)
Be careful:
(xi − x̄) 6= 0
P
i (Y − Ȳ)(xi − x̄)
β̂1 = Pi 2
i (xi − x̄)
9/ 83
Condition
Under the condition that

X
(xi − x̄)2 6= 0
i
If xi is a constant, each term is equal to its mean. We then

get 0.
We need to check that the Hessian Matrix is positive and

semidefinite
10/ 83
Hessian Matrix
 
∂ 2 RSS ∂ 2 RSS
 ∂ β̂02 ∂ β̂0 ∂ β̂1 
 
H=
 

 2
 ∂ RSS d2 RSS 

∂ β̂1 ∂ β̂0 ∂ β̂12
where Ŷi = β̂0 + β̂1 xi

 P 
2N 2 i xi
H= P P 2

2 i xi 2 i xi
11/ 83
Hessian Matrix
If var(X
ˆ > 0), then this is a positive definite matrix since all
the principal submatrices have positive determinants:
 
1P
 1 N i 
xi
H = 2N 
 

1 P 1 P 2
xi x
N i N i i
12/ 83
Hessian Matrix
If var(X
ˆ > 0), then this is a positive definite matrix since all
the principal submatrices have positive determinants:
 
1P
 1 xi
N i  1X 2 1X
= xi − x̄2 = (xi − x̄)2
 

1 P 1 P 2 N i N i
xi xi
N i N i
13/ 83
Multiple Independent Variables
Example: 2 independent variables
yi = β0 + β1 xi1 + β2 xi2 + εi
Obtaining OLS estimates:
− x¯2 )2 i (xi1 − x¯1 )(yi − ȳ) − i (xi1 − x¯1 )(xi2 − x¯2 ) i (xi2 − x¯2 )(yi − ȳ)
P P P P
i (xi2
βˆ1 = P P P 2
2 2
i (xi1 − x¯1 ) i (xi2 − x¯2 ) − i (xi1 − x¯1 )(xi2 − x¯2 )
Estimated betas have partial effect interpretations here.

When x2 is held fixed, then β1 gives the change in y if x1
changes by one unit.
14/ 83
   
ŷ1 x11 β̂1 + · · · + x1j β̂j + · · · + x1k β̂k
 ŷ2  x21 β̂1 +
.  .
 · · · + x2j β̂j + · · · + x2k β̂k 

.  . 
.  . 
 = 
 ŷi   xi1 β̂1 + · · · + xij β̂j + · · · + xik β̂k 
.  . 
 ..   .. 

ŷN xn1 β̂1 + · · · + xnj β̂j + · · · + xnk β̂k
15/ 83
The linear model coefficients are written as a vector
β = (β0 , β1 , · · · , βp )0
where β0 is the intercept and βk is the slope

corresponding to the kth covariate. For a given working
covariate vector β, the vector of fitted values is given by
the matrix-vector product
Ŷ = Xβ,
which is an n-dimensional vector.
16/ 83
X X
2
min ( (yi − ŷi ) = (yi − xi1 β̂1 − ... − xik β̂k )2 )
β̂1 ,··· ,β̂k i i
17/ 83
∂ X
= −2 xi1 ε̂i = 0
∂ β̂1 i
∂ X
= −2 xi2 ε̂i = 0
∂ β̂2 i
..
.
∂ X
= −2 xik ε̂i = 0
∂ β̂k i
18/ 83
X 0 : k x N; ε̂ : N x 1; 0: k x 1
X 0 ε̂ = 0
X 0 (Y − X β̂) = 0
β̂ = (X 0 X)−1 X 0 Y
If the rank of X 0 X is k, then (X 0 X)−1 exists
If the rank of X 0 X = rank(X) (hard to prove!)
19/ 83
Assumptions
Assumptions about the true regression model and data
generating process
I Ideal conditions have to be met in order for OLS to be
a good estimate
I BLUE: unbiased and efficient
I Best (variance of OLS estimator is minimal) linear

unbiased (expected values equal to the true value)
estimator
Need to be aware of ideal conditions and their violation to

be able to control for deviations from these conditions and
provide results that are unbiased or at least consistent
20/ 83
Assumption 1
To consistently estimate β, we need to make few

assumptions
E(X 0 µ) = 0
because X contains a constant, this assumption is
equivalent to saying that µ has mean zero and is
uncorrelated with each regressor
21/ 83
Assumption 2
Assumption that X matrix is full ranked
rank(X) = k
No multicollinearity! Fails if and only if at least one of the

regressors can be written as a linear function of the other
regressors
In case of perfect multicollinearity, the predictor matrix is

singular and therefore cannot be inverted. Under these
circumstances, for a general linear model the ordinary
least-squares estimator does not exist
22/ 83
Assumption 3
E(µ2 X 0 X) = σ 2 E(X 0 X)
where σ 2 ≡ E(µ2 )
Because E(µ) = 0, σ 2 is also equal to var(µ)

I This assumption is equivalent to assuming that the
squared error, µ2 , is uncorrelated with each xj , xj2 , and
all cross products of the form xj xk
23/ 83
Assumption 3
Use the law of iterated expectations
E(µ2 |X) = σ 2
which is the same as var(µ|X) = σ 2 when E(µ|X) = 0
The constant conditional variance assumption for µ given

X is stronger than needed
24/ 83
Significance and F -Test
Testing multiple linear restrictions

I t-test is associated with any OLS coefficient
I But it’s possible to test multiple restrictions (e.g. all

coefficients jointly equal to zero)
I H0 : β0 = β1 = ... = βk = 0
The F-statistic is defined as:

(SSRr −SSRur )/q
F= SSRur /(n−k−1)
and is distributed as F ∼ Fq,n−k−1
25/ 83
Significance and F -Test
F -statistics
I SSRr : Sum of Squared Residuals of the restricted
model
I SSRur : Sum of Squared Residuals of the unrestricted
model (all regressors)
I F -statistics is always non-negative (SSRr is greater
than SSRur )
I The F -test is thus a one sided test
Notation
I k: number of regressors; n: observations; q: number
of exclusion restrictions
I q is equal to the difference in degrees of freedom
between the restricted and unrestricted models
26/ 83
Interpreting and Comparing Coefficients
The size of the slope parameters depends on the scaling
of the variables!
I Not easy to compare the size effects even when
similar scales
I It is sometimes also hard to interpret coefficients
without logs (elasticities)
Standardized coefficients:
I Take the standard deviation of the dependent and
independent variables into account
I How much y changes if x changes by one standard
deviation instead of one unit
I Makes it easier also to compare coefficients of the
many independent variables
27/ 83
Limitations of OLS
(I) Omitted variables bias
(II) Measurement error
(III) Simultaneous equations
28/ 83
Ability Bias and the Returns to Schooling
We would like to run the long regression
yi = α + ρSi + γAi + i
where yi is log earnings, Si is schooling and Ai is ability. If

we do not have a measure of ability we can only run the
short regression
yi = αs + ρs Si + si .
What do we get?
29/ 83
(I) Omitted Variables
The relationship between the long and short regression
coefficients is given by the omitted variables bias (OVB)
formula
Cov(yi , Si )
ρs = = ρ + γδAS
Var(Si )
Cov(Ai ,Si )
where δAS = Var(Si )
is the regression coefficient from a regression of Ai (the

omitted variable) on Si (the included variable). The OVB
formula is a mechanical relationship between the two
regressions: it holds regardless of the causal
interpretation of any of the coefficients
30/ 83
Griliches (1977)
The conventional wisdom is Cov(Ai , Si ) > 0, so returns to
schooling estimates will be biased up
Short regression estimates using the National

Longitudinal Survey
yi = const + 0.068(0.003)Si + experience
Long regression estimates
yi = const + 0.059(0.003)Si
+ 0.0028(0.0005)IQi + experience
31/ 83
Bad Controls
In the quest for identifying causal effects, which variables

belong on the right hand side of a regression equation?
I Good: variables determining the treatment and
correlated with the outcome (e.g. ability)
I Good: variables uncorrelated with the treatment but
correlated with the outcome (may reduce standard
error)
I Bad: variables which are outcomes of the treatment
itself
32/ 83
Bad Controls
Some researchers regressing earnings on schooling (and

experience) include controls for occupation. Does it make
sense?
33/ 83
Bad Controls
Clearly we can think of schooling affecting the access to

higher level occupations
I This gives rise to a two equation system
yi = α + ρSi + γOi + i
Oi = λ0 + λ1 Si + ui
You could think about these as a simultaneous equations
system. Occupation is an endogenous variable. As a
result, you could not necessarily estimate the first
equation by OLS. Bad control!
34/ 83
Classical Measurement Error
(II) Measurement error leads to bias
Example
yi = α + βxi + i
We do not observe xi , instead we have x̃i
x̃i = xi + wi
where cov(xi , wi ) = 0 and cov(ei , wi ) = 0. This is called the
classical measurement error model
35/ 83
Classical Measurement Error
The bivariate regression coefficient we estimate is
cov(yi , x̃i )
β̂ =
var(x̃i )
cov(α + βXi + i , xi + wi )
=
var(xi + wi )
var(xi )
=β = βλ
var(xi + wi )
We see that β is biased towards zero by an attenuation
factor
var(xi )
λ=
var(xi + wi )
which is the variance in the “signal” divided by the
variance in the “signal plus noise”
36/ 83
Measurement Error in the Returns to
Schooling
Think of yi as log earnings, and xi as schooling. Ignore

age or experience for the moment
Ashenfelter and Krueger (1994) find λ = 0.9 for schooling
This means if the true returns to schooling is 0.1, we

would expect an estimate of 0.09
37/ 83
Measurement Error With Two Regressors
Consider again the generic case first
yi = α + β1 x1i + β2 x2i + i
and only x1i is subject to classical measurement error i.e.
cov(x1i , wi ) = cov(x2i , wi ) = 0
Then it can be shown that
β̂1 = β1 λ0
0 λ − R212
λ =
1 − R212
where λ is the bivariate attenuation factor, and R212 is the
R2 from the population regression of x̃1i on x2i
38/ 83
Short regression (on just x̃1i ) coefficient is
β̂1,short = λβ1 + β2 δx2 x̃1 = λ(β1 + β2 δx2 x1 )
where the estimate of β1 is biased both because of
attenuation due to measurement error, and because of
omitted variables bias
The coefficient from the long regression is

β̂1,long = λ0 β1
Note that
λ0 < λ
but
β̂1,short < or > β̂1,long
39/ 83
Notice that it is more difficult to compare the bias from the

short and long regressions now
λ0 < λ implies that the attenuation bias goes up when

another regressor is entered which is correlated with x̃1i
(because we took some “good” variation)
There is less attenuation in the short regression but there

is also OVB now. Not clear what the net effect is
40/ 83
Measurement Error in the Control
What about the coefficient β2 ? Even when there is no
measurement error in x̃2i , the estimate of β2 will be biased:
1−λ
β̂2 = β1 δx1 x2 + β2
1 − R212
Note that the bias will be larger the larger
I The measurement error
I The correlation between x1i and x2i
Intuition:
I β1 is attenuated, and hence does not reflect the full
effect of x1i
I β2 will capture part of the effect of x1i , through the
correlation with x2i
41/ 83
Measurement Error in the Returns to
Schooling
yi = α + ρSi + γAi +
where Si is schooling and Ai is ability. Suppose we only
have a mismeasured version of schooling, S̃i . Then the
short regression will give
ρ̂short = λρ + γδAS̃
and the long regression
ρ̂long = λ0 ρ
If ability bias is upwards (δAS̃ > 0) it is not possible to say
a priori which estimate will be closer to ρ
42/ 83
Measurement Error in Ability
Now suppose years of schooling is measured perfectly

but we only have mismeasured ability Ãi . Then
1−λ
ρ̂ = γδAS̃ + ρ.
1 − R2AS̃
If ability bias is upwards (δAS̃ > 0) then the returns to
schooling will be biased up but by less than in the short
regression. Controlling for a mismeasured ability is better
than controlling for nothing!
43/ 83
Numbers on the Griliches Example
Pick some numbers for the regression
yi = 0.15Si + 0.01Ai + i
and set
λ = 0.9
σS̃ = 3, σa = 15, σAS = 22.5.

Then
σAS 22.5
δAS̃ = = = 2.5
σS̃2 9
and
ρ̂short = λρ + γδAS̃ = 0.9 × 0.1 + 0.01 × 2.5 = 0.115
Overestimating!
44/ 83
What about Long Regression
We first need
0
λ − R2S̃A
λ =
1 − R2S̃A
which is
2 2
σAS 22.5
R2S̃A = = = 0.25
σS̃ σA 45
0.9 − 0.25
λ0 = = 0.867.
1 − 0.25
Then the long regression coefficient is
ρ̂long = λ0 ρ = 0.867 × 0.1 = 0.087
Underestimating: so the short regression coefficient is
too large and the long regression coefficient is too small.
45/ 83
Limitations of OLS
(III) Simultaneous equations
y1 = β0 + β1 y2 + µ (3)
y2 = α0 + α1 y1 + α2 x + ν (4)
Everything depends on everything: y1 and y2 are

endogenous!
46/ 83
Limitations of OLS
OLS for eq. (3) and we forget to include eq. (4)
cov(y2 , β0 + β1 y2 + µ)
β̂1 =
var(y2 )
cov(y2 , µ)
β̂1 = β1 +
var(y2 )
...
α1 σµ2
1−α1 β1
E(β̂1 ) = β1 +
var(y2 )
47/ 83
Other Issues
I Outliers? Weight?
I Nonlinear models
I We will cover many other problems!
48/ 83
Statistical Tools
I OLS
I Panel Data
49/ 83
Heteroscedasticity
Heteroscedasticity (Ancient Greek: “different dispersion”)
I Variance of the error terms differ across observations
I Can take many different forms!
I For instance, if the spread of the errors is not

constant across the X values, heteroscedasticity is
present
Example I: income and variability of food consumption. If

income increases -> variability of food consumption will
decrease. A wealthy person always eat three meals
whereas poor does not eat the same quantity of food
every day
50/ 83
Example II
51/ 83
Example III
52/ 83
Check for Heteroscedasticity
Look at a plot of the residuals against the independent

variables or the expected values
I Calculate the residuals of the regression and plot
them against the predicted values (ŷi )
I Easy to correct the standard errors in most statistical
software (Stata: robust)
I Goldfeld-Quandt test
I There may also be spatial autocorrelation! (Stata:

cluster)
53/ 83
Heteroscedasticity: Serial Correlation
Simple model: AR(1)
εt = ρεt−1 + µt
I |ρ| < 1, µt ∼ iid(0, σ 2 )

I We need to write the matrix var-cov of ε
54/ 83
εt = ρεt−1 + µt
εt = ρ(ρεt−2 + µt−1 ) + µt
εt = ρ2 εt−2 + ρµt−1 + µt
εt = ρ2 (ρεt−3 + µt−2 ) + ρµt−1 + µt
...
r−1
X
r
εt = ρ εt−r + ρi µt−i
i=0
if ρ < 1, if r → ∞, ρ → ∞
∞
X
εt = ρi µt−i
0
if t → ∞ we can calculate the moments εt

55/ 83
var(εt ) = E(ε2t )
X∞
var(εt ) = var( ρi µt−i )
0
Recall that µt ∼ iid

∞
X
var(εt ) = ρ2i var(µt−i )
0
∞
X
2
var(εt ) = σ ρ2i
0
1
var(εt ) = σ 2
1 − ρ2
56/ 83
r−1
X
cov(εt , εt−r ) = cov(ρr εt−r + ρi µt−i , εt−r )
i=0
cov(εt , εt−r ) = ρr varεt−r
σ2
cov(εt , εt−r ) = ρr
1 − ρ2
σ2
Ω= ×
1 − ρ2
 
1 ρ ρ2 · · · ρt−1

 ··· ··· ··· ···  

 · · · · · · ρ2 

 ··· ρ 
1
57/ 83
AR(1) with |ρ| < 1 violates the assumption that there is no

autocorrelation in the errors
A possible solution is to pre-multiply by Ω−1/2 :
Ω−1/2 Y = Ω−1/2 XB + Ω−1/2 ε
Y ∗ = X ∗ B + ε∗ → β̃
58/ 83
E(ε∗) = E(Ω−1/2 ε) = E(ε) = 0
0
E(ε∗ ε∗ ) = E(Ω−1/2 εε0 Ω−1/2 )
0
E(ε∗ ε∗ ) = Ω−1/2 E((εε0 )Ω−1/2 )
0
E(ε∗ ε∗ ) = Ω−1/2 ΩΩ−1/2 )
0
E(ε∗ ε∗ ) = Id
59/ 83
In practice:
0 0
β̃ = (X ∗ X ∗ )−1 X ∗ Y ∗
β̃ = (X 0 Ω−1/2 Ω−1/2 X)−1 (X 0 Ω−1/2 Ω−1/2 Y)
β̃ = (X 0 Ω−1 X)−1 (X 0 Ω−1 Y)
and:
0
var(β̃) = σ 2 (X ∗ X ∗ )−1
var(β̃) = σ 2 (X 0 Ω−1 X)−1
Generalized least squares (GLS): take into account

heteroscedasticity
60/ 83
It is unfortunately impossible to do.
We need an estimator Ωn×n . We need to estimate

N(N+1)/2 terms!
I Something less ambitious is FGSL (feasible GLS)
I (1) We can do this with White standard errors
I (2) Or we can impose a parametric structure (e.g.

Ω(θ))
61/ 83
White (1980):
 
σ12 0 0 0
 σ22 0 0
Ω= 
 ··· 0
σT2
where T is unknown. Recall that
var(β̂) = (X 0 X)−1 X 0 ΩX(X 0 X)−1
and
X 0 ΩX 1X 2 0
= εt xt xt
T T
62/ 83
White (1980):
(X 0 X)−1 X 0 ΩX(X 0 X)−1 =

X
(X 0 X)−1 ε2t xt xt0 (X 0 X)−1
Pre-multiply by xt xt0 with ε2t . See if there is a relationship
between the two
63/ 83
If ε2 is large coupled with large X, then the robust standard

errors are bigger than non-robust standard errors
I β̂ols has a small variance, maybe significant but
should not be!
Another problem remains: spatial autocorrelation
64/ 83
Spatial Autocorrelation
Observations look alike within a group

I For instance, unemployment in year t in Ontario may
be very similar to unemployment in year t-1 in Ontario
Very important if you analyze a program at the

province-level, state-level, etc.
65/ 83
Z: Matrix n × p , indicators (0 or 1) depending on whether
the individual is within group p
 
1 0 0 ··· 0
1 0 0 0 0
 
1 0 0 0 0
 .. .. .. .. .. 
 
. . . . .
1 0 0 0 0
 
0 1 0 · · · 0
 
Z= 0 1 0 0 0

 
0 1 0 0 0
. . . . .
 .. .. .. .. .. 
 
0 1 0 0 0
 
0 0 1 · · · 0
 
.. .. .. .. ..
. . . . .
66/ 83
 
1 ··· 1 0 ··· 0 0 ··· 0
1 ··· 1 0 ··· 0 0 · · · 0
 .. . . ..
 
. · · · .. .. · · ·
0 . · · · 0

1 1 1 0 0 0 0 0 0
 
0 ··· 0 1 ··· 1 0 · · · 0
 
0 ··· 0 1 ··· 1 0 · · · 0
 
0
ZZ = 
 ... . . .. .. 
 · · · .. .. · · ·
. . · · · 0 
0 0 0 1 1 1 0 0 0
 
0
 · · · 0 0 · · · 0 1 · · · 1 
0
 · · · 0 0 · · · 0 1 · · · 1 
. .. .. .. .. .. 
 .. · · · . . · · · . . · · · .
0 0 0 0 0 0 1 1 1
67/ 83
Homoscedasticity: Summary
Homoscedasticity means that a person (i.e. observation)

is a cluster. Homoscedasticity: make the assumption that
there is no heteroscedasticity nor clusters
I In most cases, you should consider
heteroscedasticity
I How to cluster your standard errors? What is the
spatial unit?
I Cluster at the province level if the policy is at the
province-level
I You need at least 20 clusters. Problematic for
Canadian studies!
I Stata: reg Y X, r cluster (province)
68/ 83
Statistical Tools
I OLS
I Panel Data
69/ 83
Panel Data
Panel data imply having at least two observations for the

same individual/province/country over time
I Allow to control for time-varying unobserved
heterogeneity
Yit = Xit β + εit
where εit = αi + νit
May control for αi (the individual fixed effect) if more than

one data point for each individual
70/ 83
Panel Data
Note that panel data do not provide causal effects!!!

I Moreover, if you include individual fixed effects in
your model, you cannot estimate time-invariant
variables (e.g. gender)
(1) Balance
I A balanced data set is a set that contains all
elements observed in all time frame
I Is it unbalanced for exogenous reasons? Selection
(e.g. attrition/migration)?
71/ 83
Panel Data
(2) Type of variables:
Yit = Xit0 βx + Zi0 βz + Wt0 βw + εit
(3) Estimation
I Fixed effects: αi parameters to evaluate
I Random effects: αi is a stochastic variable (remain in

error term)
72/ 83
Panel Data
Fixed effects:
Yit = Xit β + αi + νit
N
X
Yit = Xit β + αki dki + νit
k=1
dki is equal to 1 if k = i and 0 if k 6= i. We treat dki as a

parameter to evaluate
73/ 83
Panel Data
Fixed effects: average for i over time t
Y¯i. = X¯i. β + αi + ν¯i.
Yit − Y¯i. = (Xit − X¯i. )β + νit − ν¯i.

α disappear!
74/ 83
Panel Data
Random effects: useful if αi independent of X
νit ∼iid (0, σν2 )
αi ∼iid (0, σα2 )

α and ν are not correlated
75/ 83
Panel Data
Random effects:
I In practice, need FGLS!
I Hard to estimate, but consistent and efficient if

E(α|X) = 0
Hausman test: test equality of the coefficients (βˆFE and

βˆRE ). If the coefficients are different, take fixed effects!
76/ 83
Statistical Tools
I OLS
I Panel Data
77/ 83
Discrete Choice Models
Regressions with a binary dependent variable

I So far we considered only cases in which the
dependent variable is continuous
I Interpret the regression as modeling the probability
that the dependent variable equals one
I For a binary variable: E(Y) = Pr(Y = 1)
78/ 83
Linear Probability Model
OLS regression with a binary dependent variable
Yi = β0 + β1 X1i + ... + βk Xki + µi
I β1 expresses the change in probability that Y = 1

associated with a unit change in X1
Pr(Y = 1|X1 , ..., Xk ) = β0 + β1 X1 + ... + βk Xk = Ŷ
79/ 83
Linear Probability Model
Issues:
I Noncomforing predicted probabilities: LPM can
predict probabilities outside the range 0-1
I Heteroscedastic by construction
Need to use probit or logit models to solve the first issue
80/ 83
Probit and Logit
Bound predicted values between 0 and 1:

I Transform a linear index into something that ranges
from 0 to 1
I Use a cumulative distribution function (CDF) for this
transformation
CDF is the cumulative standard normal distribution φ

(logit: logistic function)
Pr(Y = 1|X1 , ..., Xk ) = φ(β0 + β1 X1 + ... + βk Xk )
81/ 83
Multinomial Logit
Regressions with categorical, unordered dependent

variable
I For instance, commuting by car, walking or bus
I Pick a category as a baseline and calculate the odds

that a member of group i falls in category j as
opposed to the baseline
I See stata do-file on Brightspace
82/ 83
Ordered Response Model
Regressions with categorical, ordered dependent variable
I For instance, happiness question: (1) very happy, (2)

happy, (3) unhappy, or (4) very unhappy
I Ordered probit
I You may obtain the marginal effects for any of the
values of your outcome
83/ 83

Week 2, OLS

Uploaded by

Copyright:

You might also like

Week 2, OLS

Uploaded by

Document Information

Original Description:

Copyright

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Week 2, OLS

Uploaded by

Copyright:

MICRO-ECONOMETRICS

Yi = β̂0 + β̂1 xi + ε̂i

is the predicted value for xi

β0 is the intercept: refers to the value of Y when X is zero

β1 is the slope: the rate of change in Y for a one-unit

This implies a straight line. But since we never have a

Trying to minimize the Residual Sum of Squares (RSS)

From eq. (1)

Ȳi − β̂0 − β̂1 x̄ = 0

(1) -> (2) where

Under the condition that

If xi is a constant, each term is equal to its mean. We then

We need to check that the Hessian Matrix is positive and

∂ β̂1 ∂ β̂0 ∂ β̂12

where Ŷi = β̂0 + β̂1 xi

Example: 2 independent variables

Estimated betas have partial effect interpretations here.

The linear model coefficients are written as a vector

where β0 is the intercept and βk is the slope

If the rank of X 0 X is k, then (X 0 X)−1 exists

If the rank of X 0 X = rank(X) (hard to prove!)

I Best (variance of OLS estimator is minimal) linear

Need to be aware of ideal conditions and their violation to

To consistently estimate β, we need to make few

Assumption that X matrix is full ranked

No multicollinearity! Fails if and only if at least one of the

In case of perfect multicollinearity, the predictor matrix is

Because E(µ) = 0, σ 2 is also equal to var(µ)

Use the law of iterated expectations

which is the same as var(µ|X) = σ 2 when E(µ|X) = 0

The constant conditional variance assumption for µ given

Testing multiple linear restrictions

I But it’s possible to test multiple restrictions (e.g. all

The F-statistic is defined as:

and is distributed as F ∼ Fq,n−k−1

(I) Omitted variables bias

(II) Measurement error

(III) Simultaneous equations

We would like to run the long regression

where yi is log earnings, Si is schooling and Ai is ability. If

is the regression coefficient from a regression of Ai (the

Short regression estimates using the National

yi = const + 0.068(0.003)Si + experience

Long regression estimates

In the quest for identifying causal effects, which variables

Some researchers regressing earnings on schooling (and

Clearly we can think of schooling affecting the access to

(II) Measurement error leads to bias

Think of yi as log earnings, and xi as schooling. Ignore

Ashenfelter and Krueger (1994) find λ = 0.9 for schooling

This means if the true returns to schooling is 0.1, we

Then it can be shown that

The coefficient from the long regression is

Notice that it is more difficult to compare the bias from the

λ0 < λ implies that the attenuation bias goes up when

There is less attenuation in the short regression but there

I The correlation between x1i and x2i

Now suppose years of schooling is measured perfectly

σS̃ = 3, σa = 15, σAS = 22.5.

(III) Simultaneous equations

Everything depends on everything: y1 and y2 are

I Can take many different forms!