Microeconometrie Chapitre2 MultinomialModels

Chapter 2
Multinomial Models
Théophile T. Azomahou
University Clermont Auvergne, CNRS, CERDI
Maastricht University, School of Business and Economics
Email: theophile.azomahou@uca.fr
Multinomial logit vs. probit

Ordered outcomes
Multivariate discrete outcomes
Théophile T. Azomahou (CERDI) Février 20-28, 2020 1 / 24
Introduction
1. Introduction
Chapter 1 considered models for discrete outcome variables that can take
two possible values (0/1). Here we consider several possible outcomes,
usually mutually exclusive.
Examples: i) different ways to commute to work (bus, car, cycle, walking);

ii) various types of health insurance (fee-for-service, managed care, or none);
iii) different employment status (full-time, part-time, or none); iv) choice of
recreational site, occupational choice, product choice; etc.
Estimation is most often by maximum likelihood because the data are clearly
multinomial distributed. For some complications, however, moment-based
estimation is used instead.
Different multinomial models arise owing to different functional forms for

the probabilities of the multinomial distribution, similar to the differences
between probit and logit in the binary case.

Introduction
A distinction is also made between models where regressors vary across

alternatives for a given individual and models where regressors are constant
across alternatives.
Example: in transportation mode choice some regressors, such as travel

time or cost, will vary with choices whereas others, such as age, are choice
invariant.
Some structure can be placed on the decision-making process, such as a

natural ordering of alternatives or sequencing of decisions. In practice many
different multinomial models are used.
Family of models to be considered here: unordered, ordered, multivariate.

Multinomial models
2. Multinomial models
Assume there are m alternatives and the dependent variable y is defined to take
value j if the jth alternative is taken, j = 1, · · · , m (Some authors instead
consider m + 1 alternatives with j = 0, 1, · · · , m. Define the probability that
alternative j is chosen as:
pj = P(y = j); j = 1, · · · , m (1)
We introduce m binary variables for each observation y :

(
1 if y = j
yj = (2)
0 if y 6= j
Thus yj equals one if alternative j is the observed outcome and the remaining yk
equal zero, so for each observation on y exactly one of y1 , y2 , · · · , ym will be
nonzero.

Multinomial models
The multinomial density for one observation can then be conveniently written
as:
m
y
Y
f (y ) = p1y1 × p2y2 × · · · × pm
ym
= pj j (3)
j=1
For regression models, introduce a subscript i for the ith individual and regressors
x i . A model for the probability that individual i chooses the jth alternative is:
pij = P(yi = j) = Fj (x i , β), j = 1, · · · , m; i = 1, · · · , N (4)
The functional form for Fj should be such that probabilities lie between 0 and 1
and sum over j to one. Different functional specifications for Fj correspond to
specific models, notably multinomial logit, nested logit, multinomial probit,
ordered, sequential, and multivariate models.

Multinomial models
3. Estimation: Maximum Likelihood

The multinomial density for one observation is given in (3). The likelihood
function for a sample of N independent observations is then The multinomial
density for one observation can then be conveniently written as:
N Y
m
y
Y
L= pijij , j = 1, · · · , m; i = 1, · · · , N (5)
i=1 j=1
The log-likelihood function is:

N X
X m
ln L = yij ln pij (6)
i=1 j=1
where pij = Fj (x i , β) is a function of parameters β and regressors, defined in (4).

The first order conditions for MLE β̂ is
N m
∂ ln L X X yij ∂pij
= =0 (7)
∂β pij ∂β
i=1 j=1

Multinomial models
Consistency of the ML estimator

The distribution of yi is multinomial, so correct specification of the Data
Generating Process (GDP) means correct specification of the functional forms
Fj (x i , β) for the probabilities pij . This ensures consistency as then E (yij |x) = pij .
So taking the expectation of (7) gives:
 
N X m
∂ ln L X E (y ij |x) ∂pij 
E = E
∂β pij ∂β
i=1 j=1
 
N X m
X ∂p ij
= E 
∂β
i=1 j=1
Pm
which is zero because j=0 pij = 1
Asymptotic distribution the ML estimator
The usual asymptotic from ML theory applies (by differentiating (7) w.r.t β 0 ):
  −1 
N Xm 2
b ∼N
X 1 ∂pij ∂pij ∂ pij  
β ML β,  −
pij ∂β ∂β 0 ∂β∂β 0

i=1 j=1

Multinomial models
4. Multinomial Logit
The simplest specification is the multinomial logit model, proposed by Luce
(1959). The commonly used variants of this model differ according to whether or
not regressors vary across alternatives.
4.1 Conditional Logit Model (CL)

For alternative-varying regressors the conditional logit model is used. The CL
model specifies:
0
e x ij β
pij = Pm x 0 β , j = 1, · · · , m (8)
l=1 e
il
Pm
Because j=1 pij = 1 an equivalent model is obtained by defining x ij to be
deviations of regressors from values of alternative 1, say, and setting x i1 = 0.
The marginal effects are given by (– Exercise –):
(
∂pij pij (1 − pik )β if j = k
= pij (δijk − pik ) β =
∂x ik −pij pik β if j 6= k
where δijk = 1[j=k] .
Multinomial models
In this case, the sign of the own marginal effects is the same as the sign of β k ,
while the sign of the cross effects is opposite to the sign of the coefficient.
4.2 Multinomial Logit Models (MNL)

When instead the regressors do not vary over alternatives, the multinomial logit
model is used. The MNL model specifies
0
e x i βj
pij = Pm x 0 β , j = 1, · · · , m (9)
l=1 e
i l
The marginal effects in this model are the effect of changing a regressor by one
unit on the probabilities of choosing each alternative (– Exercise –):
m
!
∂pij X
= pij β j − pih β h ≡ pij β j − β̄ p (10)
∂x i
h=1
P
where β̄ p = h pih β h is a probability weighted average of alternative-specific,
using the choice probabilities p as weights.

Multinomial models
Remarks:
From this expression we can see that the sign of a parameter estimate does
not necessarily correspond to the sign of the effect of an increase in the
regressor on the probability of choosing this alternative.
In that sense, it does not make much sense to test whether a coefficient is
different from zero or not. More subtly, the sign of the individual marginal
effects can differ across individuals, as the weighted average β̄ p uses
individual-specific choice probabilities as weights.
The two models can be combined into what some authors call a mixed logit
model.

Multinomial models
4.3 Regression Parameter Interpretation
Remarks: For multinomial models, there is not necessarily a one-to-one

correspondence between coefficient sign and coefficient probability.
We focus on marginal effects on the choice probabilities of a change in the
regressor for a given individual.
Elasticities can then be computed by multiplying the marginal effect by the

current regressor value and dividing by the probability. Typically these are
then averaged over individuals to give an average marginal effect or average
elasticity.
If the regression coefficient is positive then an increase in the corresponding

component of the regressor value for the kth alternative increases the
probability of the kth alternative and decreases the probability of the other
alternatives.

Multinomial models
4.4 Comparison to Base Category
The coefficients in the CL and MNL models can also be given a more direct
logit-like interpretation in terms of relative risk. This is because the models can
be reexpressed as binary logit models.
For the MNL model, comparison is to a base category, which is the
alternative normalized to have coefficients equal to zero. To see this note
that the multinomial logit probabilities (9) imply that the conditional
probability of observing alternative j given that either alternative j or
alternative k is observed is:
pj
P(y = j|j = j or k) =
pj + pk
0
e x βj
= 0β (11)
e j + e x 0 βk
x
0
e x (βj −βk )
= 0
1 + e x (βj −βk )
which is a logit model with coefficient (β j − β k ).

Multinomial models
Suppose normalization is on alternative 1, so that β 1 = 0. Then

0
e x βj
P(y = j|j = j or 1) = 0 (12)
1 + e x βj
and β j can be interpreted in the same way as the logit model coefficient for
binary choice between alternatives j and 1.
For interpretation to be really useful one needs to have a natural base
category. For example, if interest lies in various alternative commute modes
to traveling by car then normalize the coefficients for the car alternative to
equal zero.
CL: A similar approach can also be applied to the CL model, with
0
e (x ij −x ik ) β
P(y = j|j = j or k) = 0 (13)
1 + e (x ij −x ik ) β
and normalization now is with respect to regressor values for a base category.

Multinomial models
5. Independence of Irrelevant Alternatives (IIA)

A limitation of the CL and MNL models is that discrimination among the m
alternatives reduces to a series of pairwise comparisons that are unaffected
by the characteristics of alternatives other than the pair under consideration.
This is clear from (11) and (13), which show that the MNL model reduces
to a binary choice logit model between any pair of choices. The conditional
probability does not depend on other alternatives.
This weakness of CL and MNL is known in the literature formally as the

assumption of independence of irrelevant alternatives: it assumes that
the error terms between alternatives are uncorrelated. It can be tested by a
Hausman test (see Hausman and McFadden, 1984).
Much of the econometrics literature has focused on alternative unordered

models that do not have this weakness such as: Multinomial Probit,
Nested Logit Model, Markov Chain (but panel data is needed).

Multinomial models
6. Multinomial Probit (MNP)

An alternative and obvious way to introduce correlation across choices in the
unobserved component is to work with normally distributed errors. The MNP
model is an m-choice multinomial model, with utility of the jth choice given by:
Uj = Vj + εj , j = 1, · · · , m (14)
where Vj denotes the deterministic component of utility and εj the random

component.
Random Utility Model: For the ith individual, usually, Vij = x 0ij β or Vij = x 0i β j .
One can suppress subscript i for notational convenience. The chosen alternative is
that with the highest utility, so that:
P(yj = j) = P(Uj ≥ Uk ), all k 6= j (15)

= P(Uk − Uj ≤ 0), all k 6= j
= P(εk − εj ≤ Vj − Vk ), all k 6= j
= P(ε̃kj ≤ −Ṽkj )

Multinomial models
where the tilda and second subscript j denotes differencing with respect to
reference alternative j.
Example: consider the expression for P(y = 1) in a three-choice model (our

example or social protection). Using the last equality in (15) and defining
ε̃31 = ε3 − ε1 and ε̃21 = ε2 − ε1 , we have:
P(y = 1) = P(ε̃21 ≤ −Ṽ21 , ε̃31 ≤ −Ṽ31 ) (16)

Z −Ṽ31 Z −Ṽ21
= f (ε̃21 , ε̃31 )d ε̃21 d ε̃31
−∞ −∞
which is a bivariate integral that generally does not have an analytical solution.

Multinomial models
Estimation of MNP
We assume that the errors are joint normally distributed, with
ε ∼ N(0, Σ) (17)
where the m × 1 vectors ε = [ε1 , · · · , εm ]0 .

Remark: Different MNP models arise from different specifications of the
covariance matrix Σ. Note that if the errors are uncorrelated the MNP still yields
no closed-form solution for the probabilities and it is easier to assume instead that
the errors are extreme value and use the CL or MNL models.
Evaluation of MNP probabilities assumes knowledge of β and Σ. In fact we need
to estimate β and Σ. The absence of closed form for this integral requires the use
of simulated likelihood estimator methods such as Monte Carlo integration
or Geweke, Hajivassiliou and Keane (GHK) to evaluate the log-likelihood:
N X
X m
ln L̂(β, Σ) = yij ln p̂ij (18)
i=1 j=1
where p̂ij are simulated.

Multinomial models
7. Ordered Multinomial Models

Previously: unordered models.
Now: the outcome has a natural ordering.
Analysis is straightforward as appropriate models are well established and
estimation is again by MLE, with different models leading to different
specifications of the probabilities pij .
Suppose that there is a natural ordering of alternatives. For example, self-rated

health status may be one of excellent, good, fair, or poor. Such data can be
estimated by an unordered multinomial model, but a much more parsimonious
model and sensible model is one that takes account of this ordering.
The starting point is an index model, with single latent variable
yi∗ = x 0i β + ui (19)
where x here does not include an intercept. As y ∗ crosses a series of increasing

unknown thresholds we move up the ordering of alternatives.

Multinomial models
For example, for very low y ∗ health status is poor, for y ∗ > α1 health status
improves to fair, for y ∗ > α2 it improves further to good, and so on. In general
for an m-alternative ordered model we define:
yi = j if αj−1 < yi∗ ≤ αj (20)
where α0 = −∞ and αm = ∞. Then
P(yi = j) = P(αj−1 < yi∗ ≤ αj )

= P(αj−1 < x 0i β + ui ≤ αj )
= P(αj−1 − x 0i β < ui ≤ αj − x 0i β) (21)
= F (αj − x 0i β) − F (αj−1 − x 0i β)
where F is the CDF of u. The regression parameters β and the (m − 1) threshold

parameters α1 , · · · , αm−1 are obtained by maximizing the log-likelihood with pij
defined in (21). For the ordered logit model u is logistic distributed with and for
the ordered probit model u is standard normal distributed and F (·) is the
standard normal CDF.

Multinomial models
The sign of the regression parameters β can be interpreted as determining

whether or not the latent variable y ∗ increases with the regressor. For marginal
effects in the probabilities:
∂P(yi = j)
= [F 0 (αj − x 0i β) − F 0 (αj−1 − x 0i β)] β (22)
∂∂x i
where F 0 denotes the derivative of F . The term in braces can be positive or

negative.

Multinomial models
8. Multivariate Discrete Outcomes

For simplicity consider bivariate discrete data (y1i , y2i ). For example, in a joint
model of labor supply and fertility the dependent variables (y1i , y2i ) for individual
i may be y1i = 2 if work and y1i = 1 do not work, and y2i = 2 if have children
and y2i = 1 have no children.
More generally, y1 may take values 1, · · · , m1 and y2 may take values 1, · · · , m2 .
For individual i define
pijk = P(y1i = j, y2i = k), j = 1, · · · , m1 k = 1, · · · , m2 (23)
Note that
P P pijk define probabilities of mutually exclusive events and
pijk = j k pijk = 1. Define m1 × m2 corresponding binary indicator variables
yjk = 1 if (y1 = j, y2 = k) and yjk = 0 otherwise. Then the joint density for the
ith observation is
m1 Y
m2
y
Y
f (y1i , y2i ) = pijkijk (24)
k=1 j=1
PN Pm1 Pm2
The log-likelihood is then: ln L = i=1 k=1 j=1 yijk ln pijk and estimation is by
ML.
Multinomial models
The Bivariate Probit Model
Define the unobserved latent variables
y1∗ = x 01 β 1 + ε1 (25)
y2∗ = x 02 β 2 + ε2
where the ε1 and ε2 are joint normal with means zero, variances one, and
correlation ρ. Then the bivariate probit model specifies the observed outcomes
to be: (
2 if y1∗ > 0
y1 = (26)
1 if y1∗ ≤ 0
(
2 if y2∗ > 0
y2 = (27)
1 if y2∗ ≤ 0
where we use values (2, 1) rather than (1, 0) to be consistent with the notation in
this lecture. Observe that if ρ = 0 this specification collapses to two separate
probit models for y1 and y2 . When ρ 6= 0, there is no closed-form solution for the
probabilities.
Multinomial models
For example,
p22 = P(y1 = 2, y2 = 2)
= P(y1∗ > 0, y2∗ > 0)
= P(−ε1 < x 01 β 1 , −ε2 < x 02 β 2 )
= P(ε1 < x 01 β 1 , ε2 < x 02 β 2 )
Z x 01 β1 Z x 02 β2
= φ2 (z1 , z2 , ρ)dz1 dz2
−∞ −∞
= Φ2 (x 01 β 1 , x 02 β 2 , ρ)
where φ2 (z1 , z2 , ρ) and Φ2 (x 01 β 1 , x 02 β 2 , ρ) are, respectively, the standardized
bivariate normal density and CDF for z1 , z2 ) with zero means, unit variances, and
correlation ρ, and the fourth equality holds for the bivariate normal with mean
zero.
Performing similar algebra for the other possible outcomes yields:
pjk = P(y1 = j, y2 = k)
= Φ2 (q1 x 01 β 1 , q2 x 02 β 2 , ρ)
where q` = 1 if y` = 2 and q` = −1 if y` = 1 for ` = 1, 2.

Multinomial models
9. Application using STATA

The effect of informality on social protection
Read the data and organize the response variables according to the following
specifications
Estimation of the multinomial logit model
Estimation of the multinomial probit model
Estimation of the ordered multinomial model
Estimation of the bivariate probit model
Comment on results

Microeconometrie Chapitre2 MultinomialModels

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Microeconometrie Chapitre2 MultinomialModels

Uploaded by

Copyright:

Available Formats

Chapter 2

Multinomial logit vs. probit

Examples: i) different ways to commute to work (bus, car, cycle, walking);

Different multinomial models arise owing to different functional forms for

Théophile T. Azomahou (CERDI) Février 20-28, 2020 2 / 24

A distinction is also made between models where regressors vary across

Example: in transportation mode choice some regressors, such as travel

Some structure can be placed on the decision-making process, such as a

Family of models to be considered here: unordered, ordered, multivariate.

Théophile T. Azomahou (CERDI) Février 20-28, 2020 3 / 24

pj = P(y = j); j = 1, · · · , m (1)

We introduce m binary variables for each observation y :

Théophile T. Azomahou (CERDI) Février 20-28, 2020 4 / 24

pij = P(yi = j) = Fj (x i , β), j = 1, · · · , m; i = 1, · · · , N (4)

Théophile T. Azomahou (CERDI) Février 20-28, 2020 5 / 24

3. Estimation: Maximum Likelihood

The log-likelihood function is:

where pij = Fj (x i , β) is a function of parameters β and regressors, defined in (4).

Théophile T. Azomahou (CERDI) Février 20-28, 2020 6 / 24

Consistency of the ML estimator

Théophile T. Azomahou (CERDI) Février 20-28, 2020 7 / 24

4.1 Conditional Logit Model (CL)

4.2 Multinomial Logit Models (MNL)

Théophile T. Azomahou (CERDI) Février 20-28, 2020 9 / 24

Théophile T. Azomahou (CERDI) Février 20-28, 2020 10 / 24

4.3 Regression Parameter Interpretation

Remarks: For multinomial models, there is not necessarily a one-to-one

Elasticities can then be computed by multiplying the marginal effect by the

If the regression coefficient is positive then an increase in the corresponding

Théophile T. Azomahou (CERDI) Février 20-28, 2020 11 / 24

4.4 Comparison to Base Category

Théophile T. Azomahou (CERDI) Février 20-28, 2020 12 / 24

Suppose normalization is on alternative 1, so that β 1 = 0. Then

Théophile T. Azomahou (CERDI) Février 20-28, 2020 13 / 24

5. Independence of Irrelevant Alternatives (IIA)

This weakness of CL and MNL is known in the literature formally as the

Much of the econometrics literature has focused on alternative unordered

Théophile T. Azomahou (CERDI) Février 20-28, 2020 14 / 24

6. Multinomial Probit (MNP)

where Vj denotes the deterministic component of utility and εj the random

P(yj = j) = P(Uj ≥ Uk ), all k 6= j (15)

Théophile T. Azomahou (CERDI) Février 20-28, 2020 15 / 24

Example: consider the expression for P(y = 1) in a three-choice model (our

P(y = 1) = P(ε̃21 ≤ −Ṽ21 , ε̃31 ≤ −Ṽ31 ) (16)

Théophile T. Azomahou (CERDI) Février 20-28, 2020 16 / 24

We assume that the errors are joint normally distributed, with

where the m × 1 vectors ε = [ε1 , · · · , εm ]0 .

where p̂ij are simulated.

7. Ordered Multinomial Models

Suppose that there is a natural ordering of alternatives. For example, self-rated

The starting point is an index model, with single latent variable

where x here does not include an intercept. As y ∗ crosses a series of increasing

Théophile T. Azomahou (CERDI) Février 20-28, 2020 18 / 24

yi = j if αj−1 < yi∗ ≤ αj (20)

where α0 = −∞ and αm = ∞. Then

P(yi = j) = P(αj−1 < yi∗ ≤ αj )

where F is the CDF of u. The regression parameters β and the (m − 1) threshold

Théophile T. Azomahou (CERDI) Février 20-28, 2020 19 / 24

The sign of the regression parameters β can be interpreted as determining

where F 0 denotes the derivative of F . The term in braces can be positive or

Théophile T. Azomahou (CERDI) Février 20-28, 2020 20 / 24

8. Multivariate Discrete Outcomes

pijk = P(y1i = j, y2i = k), j = 1, · · · , m1 k = 1, · · · , m2 (23)