Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Chapter 2

Multinomial Models
Théophile T. Azomahou
University Clermont Auvergne, CNRS, CERDI
Maastricht University, School of Business and Economics
Email: theophile.azomahou@uca.fr

Multinomial logit vs. probit


Ordered outcomes
Multivariate discrete outcomes
Théophile T. Azomahou (CERDI) Février 20-28, 2020 1 / 24
Introduction

1. Introduction
Chapter 1 considered models for discrete outcome variables that can take
two possible values (0/1). Here we consider several possible outcomes,
usually mutually exclusive.

Examples: i) different ways to commute to work (bus, car, cycle, walking);


ii) various types of health insurance (fee-for-service, managed care, or none);
iii) different employment status (full-time, part-time, or none); iv) choice of
recreational site, occupational choice, product choice; etc.

Estimation is most often by maximum likelihood because the data are clearly
multinomial distributed. For some complications, however, moment-based
estimation is used instead.

Different multinomial models arise owing to different functional forms for


the probabilities of the multinomial distribution, similar to the differences
between probit and logit in the binary case.

Théophile T. Azomahou (CERDI) Février 20-28, 2020 2 / 24


Introduction

A distinction is also made between models where regressors vary across


alternatives for a given individual and models where regressors are constant
across alternatives.

Example: in transportation mode choice some regressors, such as travel


time or cost, will vary with choices whereas others, such as age, are choice
invariant.

Some structure can be placed on the decision-making process, such as a


natural ordering of alternatives or sequencing of decisions. In practice many
different multinomial models are used.

Family of models to be considered here: unordered, ordered, multivariate.

Théophile T. Azomahou (CERDI) Février 20-28, 2020 3 / 24


Multinomial models

2. Multinomial models
Assume there are m alternatives and the dependent variable y is defined to take
value j if the jth alternative is taken, j = 1, · · · , m (Some authors instead
consider m + 1 alternatives with j = 0, 1, · · · , m. Define the probability that
alternative j is chosen as:

pj = P(y = j); j = 1, · · · , m (1)

We introduce m binary variables for each observation y :


(
1 if y = j
yj = (2)
0 if y 6= j

Thus yj equals one if alternative j is the observed outcome and the remaining yk
equal zero, so for each observation on y exactly one of y1 , y2 , · · · , ym will be
nonzero.

Théophile T. Azomahou (CERDI) Février 20-28, 2020 4 / 24


Multinomial models

The multinomial density for one observation can then be conveniently written
as:
m
y
Y
f (y ) = p1y1 × p2y2 × · · · × pm
ym
= pj j (3)
j=1

For regression models, introduce a subscript i for the ith individual and regressors
x i . A model for the probability that individual i chooses the jth alternative is:

pij = P(yi = j) = Fj (x i , β), j = 1, · · · , m; i = 1, · · · , N (4)

The functional form for Fj should be such that probabilities lie between 0 and 1
and sum over j to one. Different functional specifications for Fj correspond to
specific models, notably multinomial logit, nested logit, multinomial probit,
ordered, sequential, and multivariate models.

Théophile T. Azomahou (CERDI) Février 20-28, 2020 5 / 24


Multinomial models

3. Estimation: Maximum Likelihood


The multinomial density for one observation is given in (3). The likelihood
function for a sample of N independent observations is then The multinomial
density for one observation can then be conveniently written as:
N Y
m
y
Y
L= pijij , j = 1, · · · , m; i = 1, · · · , N (5)
i=1 j=1

The log-likelihood function is:


N X
X m
ln L = yij ln pij (6)
i=1 j=1

where pij = Fj (x i , β) is a function of parameters β and regressors, defined in (4).


The first order conditions for MLE β̂ is
N m
∂ ln L X X yij ∂pij
= =0 (7)
∂β pij ∂β
i=1 j=1

Théophile T. Azomahou (CERDI) Février 20-28, 2020 6 / 24


Multinomial models

Consistency of the ML estimator


The distribution of yi is multinomial, so correct specification of the Data
Generating Process (GDP) means correct specification of the functional forms
Fj (x i , β) for the probabilities pij . This ensures consistency as then E (yij |x) = pij .
So taking the expectation of (7) gives:
 
  N X m
∂ ln L X E (y ij |x) ∂pij 
E = E
∂β pij ∂β
i=1 j=1
 
N X m
X ∂p ij
= E 
∂β
i=1 j=1
Pm
which is zero because j=0 pij = 1
Asymptotic distribution the ML estimator
The usual asymptotic from ML theory applies (by differentiating (7) w.r.t β 0 ):
  −1 
N Xm 2
b ∼N
X 1 ∂pij ∂pij ∂ pij  
β ML β,  −
pij ∂β ∂β 0 ∂β∂β 0

i=1 j=1

Théophile T. Azomahou (CERDI) Février 20-28, 2020 7 / 24


Multinomial models

4. Multinomial Logit
The simplest specification is the multinomial logit model, proposed by Luce
(1959). The commonly used variants of this model differ according to whether or
not regressors vary across alternatives.

4.1 Conditional Logit Model (CL)


For alternative-varying regressors the conditional logit model is used. The CL
model specifies:
0
e x ij β
pij = Pm x 0 β , j = 1, · · · , m (8)
l=1 e
il

Pm
Because j=1 pij = 1 an equivalent model is obtained by defining x ij to be
deviations of regressors from values of alternative 1, say, and setting x i1 = 0.
The marginal effects are given by (– Exercise –):
(
∂pij pij (1 − pik )β if j = k
= pij (δijk − pik ) β =
∂x ik −pij pik β if j 6= k
where δijk = 1[j=k] .
Théophile T. Azomahou (CERDI) Février 20-28, 2020 8 / 24
Multinomial models

In this case, the sign of the own marginal effects is the same as the sign of β k ,
while the sign of the cross effects is opposite to the sign of the coefficient.

4.2 Multinomial Logit Models (MNL)


When instead the regressors do not vary over alternatives, the multinomial logit
model is used. The MNL model specifies
0
e x i βj
pij = Pm x 0 β , j = 1, · · · , m (9)
l=1 e
i l

The marginal effects in this model are the effect of changing a regressor by one
unit on the probabilities of choosing each alternative (– Exercise –):
m
!
∂pij X 
= pij β j − pih β h ≡ pij β j − β̄ p (10)
∂x i
h=1
P
where β̄ p = h pih β h is a probability weighted average of alternative-specific,
using the choice probabilities p as weights.

Théophile T. Azomahou (CERDI) Février 20-28, 2020 9 / 24


Multinomial models

Remarks:

From this expression we can see that the sign of a parameter estimate does
not necessarily correspond to the sign of the effect of an increase in the
regressor on the probability of choosing this alternative.

In that sense, it does not make much sense to test whether a coefficient is
different from zero or not. More subtly, the sign of the individual marginal
effects can differ across individuals, as the weighted average β̄ p uses
individual-specific choice probabilities as weights.

The two models can be combined into what some authors call a mixed logit
model.

Théophile T. Azomahou (CERDI) Février 20-28, 2020 10 / 24


Multinomial models

4.3 Regression Parameter Interpretation

Remarks: For multinomial models, there is not necessarily a one-to-one


correspondence between coefficient sign and coefficient probability.
We focus on marginal effects on the choice probabilities of a change in the
regressor for a given individual.

Elasticities can then be computed by multiplying the marginal effect by the


current regressor value and dividing by the probability. Typically these are
then averaged over individuals to give an average marginal effect or average
elasticity.

If the regression coefficient is positive then an increase in the corresponding


component of the regressor value for the kth alternative increases the
probability of the kth alternative and decreases the probability of the other
alternatives.

Théophile T. Azomahou (CERDI) Février 20-28, 2020 11 / 24


Multinomial models

4.4 Comparison to Base Category

The coefficients in the CL and MNL models can also be given a more direct
logit-like interpretation in terms of relative risk. This is because the models can
be reexpressed as binary logit models.
For the MNL model, comparison is to a base category, which is the
alternative normalized to have coefficients equal to zero. To see this note
that the multinomial logit probabilities (9) imply that the conditional
probability of observing alternative j given that either alternative j or
alternative k is observed is:
pj
P(y = j|j = j or k) =
pj + pk
0
e x βj
= 0β (11)
e j + e x 0 βk
x
0
e x (βj −βk )
= 0
1 + e x (βj −βk )
which is a logit model with coefficient (β j − β k ).

Théophile T. Azomahou (CERDI) Février 20-28, 2020 12 / 24


Multinomial models

Suppose normalization is on alternative 1, so that β 1 = 0. Then


0
e x βj
P(y = j|j = j or 1) = 0 (12)
1 + e x βj
and β j can be interpreted in the same way as the logit model coefficient for
binary choice between alternatives j and 1.
For interpretation to be really useful one needs to have a natural base
category. For example, if interest lies in various alternative commute modes
to traveling by car then normalize the coefficients for the car alternative to
equal zero.
CL: A similar approach can also be applied to the CL model, with
0
e (x ij −x ik ) β
P(y = j|j = j or k) = 0 (13)
1 + e (x ij −x ik ) β
and normalization now is with respect to regressor values for a base category.

Théophile T. Azomahou (CERDI) Février 20-28, 2020 13 / 24


Multinomial models

5. Independence of Irrelevant Alternatives (IIA)


A limitation of the CL and MNL models is that discrimination among the m
alternatives reduces to a series of pairwise comparisons that are unaffected
by the characteristics of alternatives other than the pair under consideration.
This is clear from (11) and (13), which show that the MNL model reduces
to a binary choice logit model between any pair of choices. The conditional
probability does not depend on other alternatives.

This weakness of CL and MNL is known in the literature formally as the


assumption of independence of irrelevant alternatives: it assumes that
the error terms between alternatives are uncorrelated. It can be tested by a
Hausman test (see Hausman and McFadden, 1984).

Much of the econometrics literature has focused on alternative unordered


models that do not have this weakness such as: Multinomial Probit,
Nested Logit Model, Markov Chain (but panel data is needed).

Théophile T. Azomahou (CERDI) Février 20-28, 2020 14 / 24


Multinomial models

6. Multinomial Probit (MNP)


An alternative and obvious way to introduce correlation across choices in the
unobserved component is to work with normally distributed errors. The MNP
model is an m-choice multinomial model, with utility of the jth choice given by:

Uj = Vj + εj , j = 1, · · · , m (14)

where Vj denotes the deterministic component of utility and εj the random


component.
Random Utility Model: For the ith individual, usually, Vij = x 0ij β or Vij = x 0i β j .
One can suppress subscript i for notational convenience. The chosen alternative is
that with the highest utility, so that:

P(yj = j) = P(Uj ≥ Uk ), all k 6= j (15)


= P(Uk − Uj ≤ 0), all k 6= j
= P(εk − εj ≤ Vj − Vk ), all k 6= j
= P(ε̃kj ≤ −Ṽkj )

Théophile T. Azomahou (CERDI) Février 20-28, 2020 15 / 24


Multinomial models

where the tilda and second subscript j denotes differencing with respect to
reference alternative j.

Example: consider the expression for P(y = 1) in a three-choice model (our


example or social protection). Using the last equality in (15) and defining
ε̃31 = ε3 − ε1 and ε̃21 = ε2 − ε1 , we have:

P(y = 1) = P(ε̃21 ≤ −Ṽ21 , ε̃31 ≤ −Ṽ31 ) (16)


Z −Ṽ31 Z −Ṽ21
= f (ε̃21 , ε̃31 )d ε̃21 d ε̃31
−∞ −∞

which is a bivariate integral that generally does not have an analytical solution.

Théophile T. Azomahou (CERDI) Février 20-28, 2020 16 / 24


Multinomial models

Estimation of MNP

We assume that the errors are joint normally distributed, with

ε ∼ N(0, Σ) (17)

where the m × 1 vectors ε = [ε1 , · · · , εm ]0 .


Remark: Different MNP models arise from different specifications of the
covariance matrix Σ. Note that if the errors are uncorrelated the MNP still yields
no closed-form solution for the probabilities and it is easier to assume instead that
the errors are extreme value and use the CL or MNL models.
Evaluation of MNP probabilities assumes knowledge of β and Σ. In fact we need
to estimate β and Σ. The absence of closed form for this integral requires the use
of simulated likelihood estimator methods such as Monte Carlo integration
or Geweke, Hajivassiliou and Keane (GHK) to evaluate the log-likelihood:
N X
X m
ln L̂(β, Σ) = yij ln p̂ij (18)
i=1 j=1

where p̂ij are simulated.


Théophile T. Azomahou (CERDI) Février 20-28, 2020 17 / 24
Multinomial models

7. Ordered Multinomial Models


Previously: unordered models.
Now: the outcome has a natural ordering.
Analysis is straightforward as appropriate models are well established and
estimation is again by MLE, with different models leading to different
specifications of the probabilities pij .

Suppose that there is a natural ordering of alternatives. For example, self-rated


health status may be one of excellent, good, fair, or poor. Such data can be
estimated by an unordered multinomial model, but a much more parsimonious
model and sensible model is one that takes account of this ordering.

The starting point is an index model, with single latent variable

yi∗ = x 0i β + ui (19)

where x here does not include an intercept. As y ∗ crosses a series of increasing


unknown thresholds we move up the ordering of alternatives.

Théophile T. Azomahou (CERDI) Février 20-28, 2020 18 / 24


Multinomial models

For example, for very low y ∗ health status is poor, for y ∗ > α1 health status
improves to fair, for y ∗ > α2 it improves further to good, and so on. In general
for an m-alternative ordered model we define:

yi = j if αj−1 < yi∗ ≤ αj (20)

where α0 = −∞ and αm = ∞. Then

P(yi = j) = P(αj−1 < yi∗ ≤ αj )


= P(αj−1 < x 0i β + ui ≤ αj )
= P(αj−1 − x 0i β < ui ≤ αj − x 0i β) (21)
= F (αj − x 0i β) − F (αj−1 − x 0i β)

where F is the CDF of u. The regression parameters β and the (m − 1) threshold


parameters α1 , · · · , αm−1 are obtained by maximizing the log-likelihood with pij
defined in (21). For the ordered logit model u is logistic distributed with and for
the ordered probit model u is standard normal distributed and F (·) is the
standard normal CDF.

Théophile T. Azomahou (CERDI) Février 20-28, 2020 19 / 24


Multinomial models

The sign of the regression parameters β can be interpreted as determining


whether or not the latent variable y ∗ increases with the regressor. For marginal
effects in the probabilities:

∂P(yi = j)
= [F 0 (αj − x 0i β) − F 0 (αj−1 − x 0i β)] β (22)
∂∂x i

where F 0 denotes the derivative of F . The term in braces can be positive or


negative.

Théophile T. Azomahou (CERDI) Février 20-28, 2020 20 / 24


Multinomial models

8. Multivariate Discrete Outcomes


For simplicity consider bivariate discrete data (y1i , y2i ). For example, in a joint
model of labor supply and fertility the dependent variables (y1i , y2i ) for individual
i may be y1i = 2 if work and y1i = 1 do not work, and y2i = 2 if have children
and y2i = 1 have no children.
More generally, y1 may take values 1, · · · , m1 and y2 may take values 1, · · · , m2 .
For individual i define

pijk = P(y1i = j, y2i = k), j = 1, · · · , m1 k = 1, · · · , m2 (23)

Note that
P P pijk define probabilities of mutually exclusive events and
pijk = j k pijk = 1. Define m1 × m2 corresponding binary indicator variables
yjk = 1 if (y1 = j, y2 = k) and yjk = 0 otherwise. Then the joint density for the
ith observation is
m1 Y
m2
y
Y
f (y1i , y2i ) = pijkijk (24)
k=1 j=1
PN Pm1 Pm2
The log-likelihood is then: ln L = i=1 k=1 j=1 yijk ln pijk and estimation is by
ML.
Théophile T. Azomahou (CERDI) Février 20-28, 2020 21 / 24
Multinomial models

The Bivariate Probit Model

Define the unobserved latent variables

y1∗ = x 01 β 1 + ε1 (25)
y2∗ = x 02 β 2 + ε2

where the ε1 and ε2 are joint normal with means zero, variances one, and
correlation ρ. Then the bivariate probit model specifies the observed outcomes
to be: (
2 if y1∗ > 0
y1 = (26)
1 if y1∗ ≤ 0
(
2 if y2∗ > 0
y2 = (27)
1 if y2∗ ≤ 0
where we use values (2, 1) rather than (1, 0) to be consistent with the notation in
this lecture. Observe that if ρ = 0 this specification collapses to two separate
probit models for y1 and y2 . When ρ 6= 0, there is no closed-form solution for the
probabilities.
Théophile T. Azomahou (CERDI) Février 20-28, 2020 22 / 24
Multinomial models

For example,
p22 = P(y1 = 2, y2 = 2)
= P(y1∗ > 0, y2∗ > 0)
= P(−ε1 < x 01 β 1 , −ε2 < x 02 β 2 )
= P(ε1 < x 01 β 1 , ε2 < x 02 β 2 )
Z x 01 β1 Z x 02 β2
= φ2 (z1 , z2 , ρ)dz1 dz2
−∞ −∞
= Φ2 (x 01 β 1 , x 02 β 2 , ρ)
where φ2 (z1 , z2 , ρ) and Φ2 (x 01 β 1 , x 02 β 2 , ρ) are, respectively, the standardized
bivariate normal density and CDF for z1 , z2 ) with zero means, unit variances, and
correlation ρ, and the fourth equality holds for the bivariate normal with mean
zero.
Performing similar algebra for the other possible outcomes yields:
pjk = P(y1 = j, y2 = k)
= Φ2 (q1 x 01 β 1 , q2 x 02 β 2 , ρ)

where q` = 1 if y` = 2 and q` = −1 if y` = 1 for ` = 1, 2.


Théophile T. Azomahou (CERDI) Février 20-28, 2020 23 / 24
Multinomial models

9. Application using STATA


The effect of informality on social protection

Read the data and organize the response variables according to the following
specifications
Estimation of the multinomial logit model
Estimation of the multinomial probit model
Estimation of the ordered multinomial model
Estimation of the bivariate probit model
Comment on results

Théophile T. Azomahou (CERDI) Février 20-28, 2020 24 / 24

You might also like