Week 2, OLS

You might also like

You are on page 1of 83

MICRO-ECONOMETRICS

ECO 6175

ABEL BRODEUR

Weeks 2 & 3

1/ 83
Statistical Tools

Outline:
I OLS

I Limitations of OLS
I Omitted variable bias
I Measurement error
I Reverse causality
I Heteroscedasticity
I Panel Data
I Discrete Choice Models

2/ 83
Ordinary Least Square

Yi = β0 + β1 xi + εi

Yi = β̂0 + β̂1 xi + ε̂i


The value
Ŷi = β̂0 + β̂1 xi

is the predicted value for xi

3/ 83
Regression

β0 is the intercept: refers to the value of Y when X is zero

β1 is the slope: the rate of change in Y for a one-unit


change in X (regression coefficient)

This implies a straight line. But since we never have a


perfect line, we have an error term (epsilon)

Trying to minimize the Residual Sum of Squares (RSS)

4/ 83
Let’s Solve!

X X
minimize ε̂2i = (Yi − Ŷi )2
β̂0 ,β̂1 i i

Graph!

5/ 83
First-Order Conditions

∂ X
= −2 (Yi − Ŷi ) = 0 (1)
∂ β̂0 i

∂ X
= −2 xi (Yi − Ŷi ) = 0 (2)
∂ β̂1 i

6/ 83
First-Order Conditions

From eq. (1)


X
(Yi − β̂0 − β̂1 xi ) = 0
i

Ȳi − β̂0 − β̂1 x̄ = 0

7/ 83
First-Order Conditions
From eq. (2) X
xi (Yi − Ŷi ) = 0
i
X
xi (Yi − β̂0 − β̂1 xi ) = 0
i

(1) -> (2) where


−β̂0 = β̂1 x̄ − Ȳ

X
xi (Yi − Ȳ + β̂1 x̄ − β̂1 xi ) = 0
i
8/ 83
Least Squares Formulas

P
xi (Yi − Ȳ)
β̂1 = Pi
i xi (xi − x̄)
Be careful:
(xi − x̄) 6= 0

P
i (Y − Ȳ)(xi − x̄)
β̂1 = Pi 2
i (xi − x̄)

9/ 83
Condition

Under the condition that


X
(xi − x̄)2 6= 0
i

If xi is a constant, each term is equal to its mean. We then


get 0.

We need to check that the Hessian Matrix is positive and


semidefinite

10/ 83
Hessian Matrix
 
∂ 2 RSS ∂ 2 RSS
 ∂ β̂02 ∂ β̂0 ∂ β̂1 
 
H=
 

 2
 ∂ RSS d2 RSS 

∂ β̂1 ∂ β̂0 ∂ β̂12

where Ŷi = β̂0 + β̂1 xi


 P 
2N 2 i xi
H= P P 2

2 i xi 2 i xi

11/ 83
Hessian Matrix

If var(X
ˆ > 0), then this is a positive definite matrix since all
the principal submatrices have positive determinants:
 
1P
 1 N i 
xi
H = 2N 
 

1 P 1 P 2
xi x
N i N i i

12/ 83
Hessian Matrix

If var(X
ˆ > 0), then this is a positive definite matrix since all
the principal submatrices have positive determinants:

 
1P
 1 xi
N i  1X 2 1X
= xi − x̄2 = (xi − x̄)2
 

1 P 1 P 2 N i N i
xi xi
N i N i

13/ 83
Multiple Independent Variables

Example: 2 independent variables

yi = β0 + β1 xi1 + β2 xi2 + εi
Obtaining OLS estimates:
− x¯2 )2 i (xi1 − x¯1 )(yi − ȳ) − i (xi1 − x¯1 )(xi2 − x¯2 ) i (xi2 − x¯2 )(yi − ȳ)
P P P P
i (xi2
βˆ1 = P P P 2
2 2
i (xi1 − x¯1 ) i (xi2 − x¯2 ) − i (xi1 − x¯1 )(xi2 − x¯2 )

Estimated betas have partial effect interpretations here.


When x2 is held fixed, then β1 gives the change in y if x1
changes by one unit.

14/ 83
Multiple Independent Variables

   
ŷ1 x11 β̂1 + · · · + x1j β̂j + · · · + x1k β̂k
 ŷ2  x21 β̂1 +
.  .
 · · · + x2j β̂j + · · · + x2k β̂k 

.  . 
.  . 
 = 
 ŷi   xi1 β̂1 + · · · + xij β̂j + · · · + xik β̂k 
.  . 
 ..   .. 

ŷN xn1 β̂1 + · · · + xnj β̂j + · · · + xnk β̂k

15/ 83
Multiple Independent Variables

The linear model coefficients are written as a vector

β = (β0 , β1 , · · · , βp )0

where β0 is the intercept and βk is the slope


corresponding to the kth covariate. For a given working
covariate vector β, the vector of fitted values is given by
the matrix-vector product

Ŷ = Xβ,
which is an n-dimensional vector.

16/ 83
Multiple Independent Variables

X X
2
min ( (yi − ŷi ) = (yi − xi1 β̂1 − ... − xik β̂k )2 )
β̂1 ,··· ,β̂k i i

17/ 83
First-Order Conditions

∂ X
= −2 xi1 ε̂i = 0
∂ β̂1 i

∂ X
= −2 xi2 ε̂i = 0
∂ β̂2 i

..
.
∂ X
= −2 xik ε̂i = 0
∂ β̂k i

18/ 83
Multiple Independent Variables

X 0 : k x N; ε̂ : N x 1; 0: k x 1

X 0 ε̂ = 0

X 0 (Y − X β̂) = 0

β̂ = (X 0 X)−1 X 0 Y

If the rank of X 0 X is k, then (X 0 X)−1 exists

If the rank of X 0 X = rank(X) (hard to prove!)

19/ 83
Assumptions
Assumptions about the true regression model and data
generating process
I Ideal conditions have to be met in order for OLS to be
a good estimate
I BLUE: unbiased and efficient

I Best (variance of OLS estimator is minimal) linear


unbiased (expected values equal to the true value)
estimator

Need to be aware of ideal conditions and their violation to


be able to control for deviations from these conditions and
provide results that are unbiased or at least consistent

20/ 83
Assumption 1

To consistently estimate β, we need to make few


assumptions
E(X 0 µ) = 0
because X contains a constant, this assumption is
equivalent to saying that µ has mean zero and is
uncorrelated with each regressor

21/ 83
Assumption 2

Assumption that X matrix is full ranked

rank(X) = k

No multicollinearity! Fails if and only if at least one of the


regressors can be written as a linear function of the other
regressors

In case of perfect multicollinearity, the predictor matrix is


singular and therefore cannot be inverted. Under these
circumstances, for a general linear model the ordinary
least-squares estimator does not exist

22/ 83
Assumption 3

E(µ2 X 0 X) = σ 2 E(X 0 X)
where σ 2 ≡ E(µ2 )

Because E(µ) = 0, σ 2 is also equal to var(µ)


I This assumption is equivalent to assuming that the
squared error, µ2 , is uncorrelated with each xj , xj2 , and
all cross products of the form xj xk

23/ 83
Assumption 3

Use the law of iterated expectations

E(µ2 |X) = σ 2

which is the same as var(µ|X) = σ 2 when E(µ|X) = 0

The constant conditional variance assumption for µ given


X is stronger than needed

24/ 83
Significance and F -Test

Testing multiple linear restrictions


I t-test is associated with any OLS coefficient

I But it’s possible to test multiple restrictions (e.g. all


coefficients jointly equal to zero)
I H0 : β0 = β1 = ... = βk = 0

The F-statistic is defined as:


(SSRr −SSRur )/q
F= SSRur /(n−k−1)

and is distributed as F ∼ Fq,n−k−1

25/ 83
Significance and F -Test
F -statistics
I SSRr : Sum of Squared Residuals of the restricted
model
I SSRur : Sum of Squared Residuals of the unrestricted
model (all regressors)
I F -statistics is always non-negative (SSRr is greater
than SSRur )
I The F -test is thus a one sided test

Notation
I k: number of regressors; n: observations; q: number
of exclusion restrictions
I q is equal to the difference in degrees of freedom
between the restricted and unrestricted models
26/ 83
Interpreting and Comparing Coefficients
The size of the slope parameters depends on the scaling
of the variables!
I Not easy to compare the size effects even when
similar scales
I It is sometimes also hard to interpret coefficients
without logs (elasticities)

Standardized coefficients:
I Take the standard deviation of the dependent and
independent variables into account
I How much y changes if x changes by one standard
deviation instead of one unit
I Makes it easier also to compare coefficients of the
many independent variables
27/ 83
Limitations of OLS

(I) Omitted variables bias

(II) Measurement error

(III) Simultaneous equations

28/ 83
Ability Bias and the Returns to Schooling

We would like to run the long regression

yi = α + ρSi + γAi + i

where yi is log earnings, Si is schooling and Ai is ability. If


we do not have a measure of ability we can only run the
short regression

yi = αs + ρs Si + si .
What do we get?

29/ 83
(I) Omitted Variables
The relationship between the long and short regression
coefficients is given by the omitted variables bias (OVB)
formula
Cov(yi , Si )
ρs = = ρ + γδAS
Var(Si )
Cov(Ai ,Si )
where δAS = Var(Si )

is the regression coefficient from a regression of Ai (the


omitted variable) on Si (the included variable). The OVB
formula is a mechanical relationship between the two
regressions: it holds regardless of the causal
interpretation of any of the coefficients

30/ 83
Griliches (1977)
The conventional wisdom is Cov(Ai , Si ) > 0, so returns to
schooling estimates will be biased up

Short regression estimates using the National


Longitudinal Survey

yi = const + 0.068(0.003)Si + experience

Long regression estimates

yi = const + 0.059(0.003)Si
+ 0.0028(0.0005)IQi + experience

31/ 83
Bad Controls

In the quest for identifying causal effects, which variables


belong on the right hand side of a regression equation?
I Good: variables determining the treatment and
correlated with the outcome (e.g. ability)
I Good: variables uncorrelated with the treatment but
correlated with the outcome (may reduce standard
error)
I Bad: variables which are outcomes of the treatment
itself

32/ 83
Bad Controls

Some researchers regressing earnings on schooling (and


experience) include controls for occupation. Does it make
sense?

33/ 83
Bad Controls

Clearly we can think of schooling affecting the access to


higher level occupations
I This gives rise to a two equation system

yi = α + ρSi + γOi + i
Oi = λ0 + λ1 Si + ui
You could think about these as a simultaneous equations
system. Occupation is an endogenous variable. As a
result, you could not necessarily estimate the first
equation by OLS. Bad control!

34/ 83
Classical Measurement Error

(II) Measurement error leads to bias

Example
yi = α + βxi + i
We do not observe xi , instead we have x̃i

x̃i = xi + wi
where cov(xi , wi ) = 0 and cov(ei , wi ) = 0. This is called the
classical measurement error model

35/ 83
Classical Measurement Error
The bivariate regression coefficient we estimate is
cov(yi , x̃i )
β̂ =
var(x̃i )
cov(α + βXi + i , xi + wi )
=
var(xi + wi )
var(xi )
=β = βλ
var(xi + wi )
We see that β is biased towards zero by an attenuation
factor
var(xi )
λ=
var(xi + wi )
which is the variance in the “signal” divided by the
variance in the “signal plus noise”
36/ 83
Measurement Error in the Returns to
Schooling

Think of yi as log earnings, and xi as schooling. Ignore


age or experience for the moment

Ashenfelter and Krueger (1994) find λ = 0.9 for schooling

This means if the true returns to schooling is 0.1, we


would expect an estimate of 0.09

37/ 83
Measurement Error With Two Regressors
Consider again the generic case first

yi = α + β1 x1i + β2 x2i + i
and only x1i is subject to classical measurement error i.e.
cov(x1i , wi ) = cov(x2i , wi ) = 0

Then it can be shown that

β̂1 = β1 λ0

0 λ − R212
λ =
1 − R212
where λ is the bivariate attenuation factor, and R212 is the
R2 from the population regression of x̃1i on x2i
38/ 83
Measurement Error With Two Regressors
Short regression (on just x̃1i ) coefficient is
β̂1,short = λβ1 + β2 δx2 x̃1 = λ(β1 + β2 δx2 x1 )
where the estimate of β1 is biased both because of
attenuation due to measurement error, and because of
omitted variables bias

The coefficient from the long regression is


β̂1,long = λ0 β1
Note that
λ0 < λ
but
β̂1,short < or > β̂1,long
39/ 83
Measurement Error With Two Regressors

Notice that it is more difficult to compare the bias from the


short and long regressions now

λ0 < λ implies that the attenuation bias goes up when


another regressor is entered which is correlated with x̃1i
(because we took some “good” variation)

There is less attenuation in the short regression but there


is also OVB now. Not clear what the net effect is

40/ 83
Measurement Error in the Control
What about the coefficient β2 ? Even when there is no
measurement error in x̃2i , the estimate of β2 will be biased:

1−λ
β̂2 = β1 δx1 x2 + β2
1 − R212
Note that the bias will be larger the larger
I The measurement error

I The correlation between x1i and x2i

Intuition:
I β1 is attenuated, and hence does not reflect the full
effect of x1i
I β2 will capture part of the effect of x1i , through the
correlation with x2i
41/ 83
Measurement Error in the Returns to
Schooling

yi = α + ρSi + γAi + 
where Si is schooling and Ai is ability. Suppose we only
have a mismeasured version of schooling, S̃i . Then the
short regression will give

ρ̂short = λρ + γδAS̃
and the long regression

ρ̂long = λ0 ρ
If ability bias is upwards (δAS̃ > 0) it is not possible to say
a priori which estimate will be closer to ρ
42/ 83
Measurement Error in Ability

Now suppose years of schooling is measured perfectly


but we only have mismeasured ability Ãi . Then

1−λ
ρ̂ = γδAS̃ + ρ.
1 − R2AS̃
If ability bias is upwards (δAS̃ > 0) then the returns to
schooling will be biased up but by less than in the short
regression. Controlling for a mismeasured ability is better
than controlling for nothing!

43/ 83
Numbers on the Griliches Example
Pick some numbers for the regression
yi = 0.15Si + 0.01Ai + i
and set
λ = 0.9

σS̃ = 3, σa = 15, σAS = 22.5.


Then
σAS 22.5
δAS̃ = = = 2.5
σS̃2 9
and
ρ̂short = λρ + γδAS̃ = 0.9 × 0.1 + 0.01 × 2.5 = 0.115
Overestimating!
44/ 83
What about Long Regression
We first need
0
λ − R2S̃A
λ =
1 − R2S̃A
which is
 2  2
σAS 22.5
R2S̃A = = = 0.25
σS̃ σA 45
0.9 − 0.25
λ0 = = 0.867.
1 − 0.25
Then the long regression coefficient is
ρ̂long = λ0 ρ = 0.867 × 0.1 = 0.087
Underestimating: so the short regression coefficient is
too large and the long regression coefficient is too small.
45/ 83
Limitations of OLS

(III) Simultaneous equations

y1 = β0 + β1 y2 + µ (3)

y2 = α0 + α1 y1 + α2 x + ν (4)

Everything depends on everything: y1 and y2 are


endogenous!

46/ 83
Limitations of OLS
OLS for eq. (3) and we forget to include eq. (4)

cov(y2 , β0 + β1 y2 + µ)
β̂1 =
var(y2 )

cov(y2 , µ)
β̂1 = β1 +
var(y2 )
...

α1 σµ2
1−α1 β1
E(β̂1 ) = β1 +
var(y2 )

47/ 83
Other Issues

I Outliers? Weight?
I Nonlinear models
I We will cover many other problems!

48/ 83
Statistical Tools

I OLS
I Limitations of OLS
I Heteroscedasticity
I Panel Data
I Discrete Choice Models

49/ 83
Heteroscedasticity
Heteroscedasticity (Ancient Greek: “different dispersion”)
I Variance of the error terms differ across observations

I Can take many different forms!

I For instance, if the spread of the errors is not


constant across the X values, heteroscedasticity is
present

Example I: income and variability of food consumption. If


income increases -> variability of food consumption will
decrease. A wealthy person always eat three meals
whereas poor does not eat the same quantity of food
every day

50/ 83
Example II

51/ 83
Example III

52/ 83
Check for Heteroscedasticity

Look at a plot of the residuals against the independent


variables or the expected values
I Calculate the residuals of the regression and plot
them against the predicted values (ŷi )
I Easy to correct the standard errors in most statistical
software (Stata: robust)
I Goldfeld-Quandt test

I There may also be spatial autocorrelation! (Stata:


cluster)

53/ 83
Heteroscedasticity: Serial Correlation

Simple model: AR(1)

εt = ρεt−1 + µt

I |ρ| < 1, µt ∼ iid(0, σ 2 )


I We need to write the matrix var-cov of ε

54/ 83
Heteroscedasticity: Serial Correlation

εt = ρεt−1 + µt
εt = ρ(ρεt−2 + µt−1 ) + µt
εt = ρ2 εt−2 + ρµt−1 + µt
εt = ρ2 (ρεt−3 + µt−2 ) + ρµt−1 + µt
...
r−1
X
r
εt = ρ εt−r + ρi µt−i
i=0

if ρ < 1, if r → ∞, ρ → ∞

X
εt = ρi µt−i
0

if t → ∞ we can calculate the moments εt


55/ 83
Heteroscedasticity: Serial Correlation

var(εt ) = E(ε2t )
X∞
var(εt ) = var( ρi µt−i )
0

Recall that µt ∼ iid



X
var(εt ) = ρ2i var(µt−i )
0

X
2
var(εt ) = σ ρ2i
0
1
var(εt ) = σ 2
1 − ρ2
56/ 83
Heteroscedasticity: Serial Correlation
r−1
X
cov(εt , εt−r ) = cov(ρr εt−r + ρi µt−i , εt−r )
i=0
cov(εt , εt−r ) = ρr varεt−r
σ2
cov(εt , εt−r ) = ρr
1 − ρ2
σ2
Ω= ×
1 − ρ2
 
1 ρ ρ2 · · · ρt−1

 ··· ··· ··· ···  

 · · · · · · ρ2 

 ··· ρ 
1
57/ 83
Heteroscedasticity: Serial Correlation

AR(1) with |ρ| < 1 violates the assumption that there is no


autocorrelation in the errors

A possible solution is to pre-multiply by Ω−1/2 :

Ω−1/2 Y = Ω−1/2 XB + Ω−1/2 ε

Y ∗ = X ∗ B + ε∗ → β̃

58/ 83
Heteroscedasticity: Serial Correlation

E(ε∗) = E(Ω−1/2 ε) = E(ε) = 0

0
E(ε∗ ε∗ ) = E(Ω−1/2 εε0 Ω−1/2 )
0
E(ε∗ ε∗ ) = Ω−1/2 E((εε0 )Ω−1/2 )
0
E(ε∗ ε∗ ) = Ω−1/2 ΩΩ−1/2 )
0
E(ε∗ ε∗ ) = Id

59/ 83
Heteroscedasticity: Serial Correlation
In practice:
0 0
β̃ = (X ∗ X ∗ )−1 X ∗ Y ∗
β̃ = (X 0 Ω−1/2 Ω−1/2 X)−1 (X 0 Ω−1/2 Ω−1/2 Y)
β̃ = (X 0 Ω−1 X)−1 (X 0 Ω−1 Y)

and:
0
var(β̃) = σ 2 (X ∗ X ∗ )−1
var(β̃) = σ 2 (X 0 Ω−1 X)−1

Generalized least squares (GLS): take into account


heteroscedasticity

60/ 83
Heteroscedasticity: Serial Correlation

It is unfortunately impossible to do.

We need an estimator Ωn×n . We need to estimate


N(N+1)/2 terms!
I Something less ambitious is FGSL (feasible GLS)

I (1) We can do this with White standard errors

I (2) Or we can impose a parametric structure (e.g.


Ω(θ))

61/ 83
Heteroscedasticity: Serial Correlation
White (1980):
 
σ12 0 0 0
 σ22 0 0
Ω= 
 ··· 0
σT2

where T is unknown. Recall that

var(β̂) = (X 0 X)−1 X 0 ΩX(X 0 X)−1

and
X 0 ΩX 1X 2 0
= εt xt xt
T T

62/ 83
Heteroscedasticity: Serial Correlation

White (1980):

(X 0 X)−1 X 0 ΩX(X 0 X)−1 =


X
(X 0 X)−1 ε2t xt xt0 (X 0 X)−1
Pre-multiply by xt xt0 with ε2t . See if there is a relationship
between the two

63/ 83
Heteroscedasticity: Serial Correlation

If ε2 is large coupled with large X, then the robust standard


errors are bigger than non-robust standard errors
I β̂ols has a small variance, maybe significant but
should not be!

Another problem remains: spatial autocorrelation

64/ 83
Spatial Autocorrelation

Observations look alike within a group


I For instance, unemployment in year t in Ontario may
be very similar to unemployment in year t-1 in Ontario

Very important if you analyze a program at the


province-level, state-level, etc.

65/ 83
Spatial Autocorrelation
Z: Matrix n × p , indicators (0 or 1) depending on whether
the individual is within group p
 
1 0 0 ··· 0
1 0 0 0 0
 
1 0 0 0 0
 .. .. .. .. .. 
 
. . . . .
1 0 0 0 0
 
0 1 0 · · · 0
 
Z= 0 1 0 0 0

 
0 1 0 0 0
. . . . .
 .. .. .. .. .. 
 
0 1 0 0 0
 
0 0 1 · · · 0
 
.. .. .. .. ..
. . . . .
66/ 83
Spatial Autocorrelation
 
1 ··· 1 0 ··· 0 0 ··· 0
1 ··· 1 0 ··· 0 0 · · · 0
 .. . . ..
 
. · · · .. .. · · ·
0 . · · · 0

1 1 1 0 0 0 0 0 0
 
0 ··· 0 1 ··· 1 0 · · · 0
 
0 ··· 0 1 ··· 1 0 · · · 0
 
0
ZZ = 
 ... . . .. .. 
 · · · .. .. · · ·
. . · · · 0 
0 0 0 1 1 1 0 0 0
 
0
 · · · 0 0 · · · 0 1 · · · 1 
0
 · · · 0 0 · · · 0 1 · · · 1 
. .. .. .. .. .. 
 .. · · · . . · · · . . · · · .
0 0 0 0 0 0 1 1 1

67/ 83
Homoscedasticity: Summary

Homoscedasticity means that a person (i.e. observation)


is a cluster. Homoscedasticity: make the assumption that
there is no heteroscedasticity nor clusters
I In most cases, you should consider
heteroscedasticity
I How to cluster your standard errors? What is the
spatial unit?
I Cluster at the province level if the policy is at the
province-level
I You need at least 20 clusters. Problematic for
Canadian studies!
I Stata: reg Y X, r cluster (province)

68/ 83
Statistical Tools

I OLS
I Limitations of OLS
I Heteroscedasticity
I Panel Data
I Discrete Choice Models

69/ 83
Panel Data

Panel data imply having at least two observations for the


same individual/province/country over time
I Allow to control for time-varying unobserved
heterogeneity
Yit = Xit β + εit
where εit = αi + νit

May control for αi (the individual fixed effect) if more than


one data point for each individual

70/ 83
Panel Data

Note that panel data do not provide causal effects!!!


I Moreover, if you include individual fixed effects in
your model, you cannot estimate time-invariant
variables (e.g. gender)

(1) Balance
I A balanced data set is a set that contains all
elements observed in all time frame
I Is it unbalanced for exogenous reasons? Selection
(e.g. attrition/migration)?

71/ 83
Panel Data

(2) Type of variables:

Yit = Xit0 βx + Zi0 βz + Wt0 βw + εit

(3) Estimation
I Fixed effects: αi parameters to evaluate

I Random effects: αi is a stochastic variable (remain in


error term)

72/ 83
Panel Data

Fixed effects:
Yit = Xit β + αi + νit
N
X
Yit = Xit β + αki dki + νit
k=1

dki is equal to 1 if k = i and 0 if k 6= i. We treat dki as a


parameter to evaluate

73/ 83
Panel Data

Fixed effects: average for i over time t

Y¯i. = X¯i. β + αi + ν¯i.

Yit − Y¯i. = (Xit − X¯i. )β + νit − ν¯i.


α disappear!

74/ 83
Panel Data

Random effects: useful if αi independent of X

νit ∼iid (0, σν2 )

αi ∼iid (0, σα2 )


α and ν are not correlated

75/ 83
Panel Data

Random effects:
I In practice, need FGLS!

I Hard to estimate, but consistent and efficient if


E(α|X) = 0

Hausman test: test equality of the coefficients (βˆFE and


βˆRE ). If the coefficients are different, take fixed effects!

76/ 83
Statistical Tools

I OLS
I Limitations of OLS
I Heteroscedasticity
I Panel Data
I Discrete Choice Models

77/ 83
Discrete Choice Models

Regressions with a binary dependent variable


I So far we considered only cases in which the
dependent variable is continuous
I Interpret the regression as modeling the probability
that the dependent variable equals one
I For a binary variable: E(Y) = Pr(Y = 1)

78/ 83
Linear Probability Model

OLS regression with a binary dependent variable

Yi = β0 + β1 X1i + ... + βk Xki + µi

I β1 expresses the change in probability that Y = 1


associated with a unit change in X1
Pr(Y = 1|X1 , ..., Xk ) = β0 + β1 X1 + ... + βk Xk = Ŷ

79/ 83
Linear Probability Model

Issues:
I Noncomforing predicted probabilities: LPM can
predict probabilities outside the range 0-1
I Heteroscedastic by construction

Need to use probit or logit models to solve the first issue

80/ 83
Probit and Logit

Bound predicted values between 0 and 1:


I Transform a linear index into something that ranges
from 0 to 1
I Use a cumulative distribution function (CDF) for this
transformation

CDF is the cumulative standard normal distribution φ


(logit: logistic function)

Pr(Y = 1|X1 , ..., Xk ) = φ(β0 + β1 X1 + ... + βk Xk )

81/ 83
Multinomial Logit

Regressions with categorical, unordered dependent


variable
I For instance, commuting by car, walking or bus

I Pick a category as a baseline and calculate the odds


that a member of group i falls in category j as
opposed to the baseline
I See stata do-file on Brightspace

82/ 83
Ordered Response Model

Regressions with categorical, ordered dependent variable

I For instance, happiness question: (1) very happy, (2)


happy, (3) unhappy, or (4) very unhappy
I Ordered probit
I You may obtain the marginal effects for any of the
values of your outcome

83/ 83

You might also like