Lattin James M-Analyzing Multivariate Data-Pp477-490

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

This material is protected pursuant to the Norwegian Copyright Act and

has been fully or partly produced pursuant to agreement with Kopinor.


The material may be used by students studying the subject in question,
for their own studies, in any format and on any platform.
Any other reproduction or publication not explicitly authorised by the
rightsholder is only permitted if allowed by law (reproduction for private
use, quotation, etc.) or agreement with Kopinor (www.kopinor.no).

Course of Study:
(GRA4139) Marketing Analytics

Title of work:
Analyzing multivariate data (2003)

Section:
pp. 477--490

Author/editor of work:
Lattin, James M.; Carroll, J. Douglas; Green, Paul E.

Author of section:
James M. Lattin

Name of Publisher:
Thomson Brooks/Cole

13 .2 Binary Logit Model : How It Works 477

FIGURE 13 .1 Share of purchases


Tracking share of 60 (GW"o )
purchases for a
brand of ground 50 ~> + C>Qp (% )
coffee in a holdout
sample of 100 cus- 40
tomers (Source :
Guadagni and
Little, 1983) 30

20

10

0
MAR08 .79 MAY3 .79 JUN28 .79 AUG23 .79 OCT18 .79 DEC13 .79 FEB07 .80
4-Week periods

The data used by Guadagni and Little to calibrate their model consisted of the
choice histories from a group of 100 panelists (randomly chosen from a group that
was screened to eliminate the lightest and heaviest users) over a 78-week period .
They also held out another 100 panelists for validation . Using measures of brand and
size loyalty to account for the differences in preference across households (see sec-
tion 13 .6 .1 below), Guadagni and Little estimated the effects of price and promotion
on brand choice and found them to be highly significant . They were able to use the
and model to obtain estimates of price elasticity and cross-price elasticity for both pro-
ittie
motional and nonpromotional price cuts . Furthermore, aggregating predicted choice
~nial probabilities across households, they found that the model did an excellent job of
use-
tracking the week-to-week fluctuations in market share that were attributable to
iold
changes in marketing activity (see Figure 13 .1) .
rsed
:r is
ong
.tilt .
_luc 13.2 BINARY LOGIT MODEL :
HOW IT WORKS

We begin with a simple, stylized dichotomous choice problem . Consider a single in-
dividual who shops in a supermarket that offers two different brands in a particular
product category : Brands A and B . The regular shelf price of the two brands is the
same, but Brand A offers frequent discounts and is often available at prices 10, 15,
20, and even 30 cents below the price of Brand B (which remains fixed) .
Over time, we observe 30 different purchase occasions by the individual from
this product category. The number of choices of Brands A and B at each level of price
offered by Brand A are shown in Table 13 .2 .
478 Chapter 13 Logit Choice Models

TABLE 13 .2 Choices of Brand A and


Brand B over 30 different occasions at
different levels of discount for Brand A

Discount Choices of Choices of


for A Brand A Brand B
0 .00 1 7
0 .10 1 6
0 .15 3 3
0 .20 4 1
0 .30 4 0

The pattern of choices in Table 13 .2 makes clear that the individual is strongly
influenced by the relative price of Brand A when making a choice from the product
category . Out of all the choices made by the individual when A and B are available at
the same (regular) shelf price, only one out of eight went to Brand A . But of all tbc
choices made by the individual when Brand A offers at least a 20 cent discount, eight
out of nine choices went to Brand A .
Our goal is to build an individual-level model of the probability of purchasing
Brand A as a function of the price discount offered by Brand A . Our model of choice
behavior is based on the following two assumptions .

ASSUMPTIONS OF CHOICE MODEL BASED ON UTILITY MAXIMIZATION

1. Each choice alternative offers the individual some amount of utility at the
time of the choice . The utility offered by alternative i at time t is denoted
u ;, . (We will further assume that at time t there exists a strict ordering
among alternatives .)
2. In choosing among alternatives, the individual selects the one with highest
utility. When there are just two alternatives, the individual chooses alterna-
tive 1 when u lo > U21 ; otherwise, the individual chooses alternative 2 .

Note that the choice process described above is a completely deterministic one .
From the perspectine of the individual, there is nothing inherently random about the
choice process . We have assumed that the utility of each alternative is known to the
individual with certainty and that the outcome of the choice process is completely de-
termined by the relative value of the two utilities u l , and u 2t. So where does the choice
probability come in?
The probability of choice in our model does not have to do with any randomness
inherent in the individual choice process ; it has to do with the incomplete informa-
tion we have as modelers and our inability to specify completely the utility function
of the individual . In fact, we have only one piece of information from which to con-
struct the utility function : the price of Brand A relative to Brand B . In most choice

13 .2 Binary Logit Model : How It Works 479

situations (as in this one), there will be many, many other factors - most of them un-
knowable or unmeasurable by the modeler-that influence choice behavior. In a
supermarket shopping environment, these may be other store cues, such as special
display, as well as other factors that increase the salience of one brand relative to an-
other. Although these factors may not be as important as price, together they have
some cumulative impact on choice .
Because it is unreasonable to assume that we as modelers will ever have access
to all of the relevant information needed to completely specify the utility function of
the individual, we decompose the utility function u i , into two components :

Uit -v,t +c,t (13 .1)


where v„ denotes the deterministic component of utility (i .e ., that part of the utility
trongly function that we as modelers can capture using the information available to us) and
product c;t denotes the stochastic component of utility (i .e ., the cumulative effect on utility of
Llable at a great many unobservable factors) . Typically, we will model the deterministic com-
f all the ponent vi, as a linear combination of independent variables ; that is,
rit, eight v i, = x,t 0 (13 .2)

rchasing where x„ is a vector of the measured characteristics directly influencing the choice of
~f choice item i at time t . The stochastic component of utility we model as a random variable
c,, . The value c ;t is the realization of the random variable on choice occasion t for
choice alternative i .
Because we do not observe c,,, we do not know u it and cannot say for certain
4TION which alternative the individual will choose . By assumption, we know that the indi-
vidual will choose alternative 1 over 2 so long as
the (13 .3)
uit > U2t
ed
Substituting equations (13 .1) and (13 .2) into (13 .3) yields

x,, f3 + c,, > xzt 3 + c2, (13.4)


hest
Rearranging terms, we sec that the individual will choose alternative 1 so long as the
rrna-
difference between the stochastic components of utility is not greater than the differ-
ence in the deterministic components - that is, so long as

(e2, - c1) < ( xi, - x21) ' 13 (13 .5)


iistic one .
We do not know the value of (c21 - c l ,) . However, if we know the distribution
about the
function that describes the random variable (c2, - e it), then we can at least tell how
iwn to the
likely it is that the inequality in equation (13 .5) would be met .
)letely de-
Let f(c2, - c l ,) denote the probability density function that describes the differ-
:he choice
ence of the two random variables 82, and E I , . The probability that we observe a real-
ization (c 2r - Ei,) of this random process that satisfies the inequality in equation
(13 .5) is depicted graphically in Figure 13 .2 and can be written as follows :

Pit =
J -w
f(c21 - c1)a(c21 - c1) = F((xi1 - x21) ' 0) (13.6)

4 80 Chapter 13 Logit Choice Models

FIGURE 13 .2
Probability that
u, > u 2

where pit is the probability that the individual chooses alternative 1 on occasion t and
F( .) is the cumulative distribution function of (821 - E, t ) . Equation (13 .6) makes it
clear that the functional form for the choice probability depends entirely on the dis-
tribution function for the random variable (c2 r - c, t) •

Probit Model
If we assume that (c2, -E„) has a normally distributed density function, the result-
ing probability model is known as a probit model. The normai distribution is a theo-
retically well-grounded choice with strong precedent . Having motivated the stochas-
tic component s;, as the cumulative impact of a large number of factors on utility, we
would expect from the Central Limit Theorem to have the distribution of ei, described
by a bell-shaped curve . Unfortunately, although the choice of the normai distribution
is well grounded, it is not especially convenient front a modeling perspective . Be-
cause the cumulative normal distribution has no closed form, there is no closed func-
tional form for probability of choice . Although this is not an insurmountable prob-
lem (for example, we can still estimate the parameters of the utility function using
numerical approximation), there are times when it is useful to be able to express the
probability of choice as a function of the independent variables .

Logit Model
Alternatively, we may want to assume that (s et - c, t ) has the following distribution
function :

IJ = exp[-c21 - e )1
f( szt -
c (13 .7)
(1 + exp [ - (£ 2t - sit)]) 2

The distribution in equation (13 .7) is known as a logistic distribution, and the re-
sulting probability model is known as a logistic or logit model . Although the logistic
distribution has a shape that is similar to the normal distribution, it is not exactly bell-
shaped and has tails that are somewhat fatter than the normai distribution .

13 .2 Binary Logit Model : How It Works 481

The real benefit to using the logistic distribution is that it yields a closed form
for the probability expression in equation (13 .6). Substituting equation (13 .7) into the
integral in (13 .6) and solving yields
(x„-x,_,)'0
exp( - (E2e-c11))
Pit - Eu)
- J 00
1 + exp ( - (E2, - Eir))2 a(E2~
1
(13.8)
1 + exp (-(x 1r - x2r )'f3)

Multiplying the numerator and the denominator by exp (x„ P) yields a somewhat
more intuitive expression for the probability :

exp (x;,f3)
Per = (13 .9)
exp (xírP) + exp (xír3)

Note that the expression in equation (13 .9) qualifies as a probability measure : It
is in the interval [0, 1] for all possible values of x' and ¢ . If we think of the expres-
sion exp (xi,f3) as the "attractiveness" of alternative i, then the probability of choice
of item 1 is given by item 1's share of the total attractiveness of alternatives 1 and 2
together.
result-
a theo- Maximum Likelihood Estimation
tochas-
The values of the utility function parameters for the logit model are easily estimated
lity, we
using maximum likelihood . Recall that we observe only the choice outcomes from a
scribed
process with underlying choice probability p ;, (which is unobserved) . For the binary
iibution
choice problem, we can use a single 011 variable Y, to dente the observed choice out-
ve . Be-
come . If alternative 1 is chosen on choice occasion t, then Y, = l ; otherwise, Y, = 0 .
d func-
The probability of observing these choice outcomes is given by the logit model :
e prob-
n using exp (xí, f3)
ess the Pr(Y` -= 1) = Pit =
exp (xi,f3) + exp (x2,0)
exp(x'to) _
Pr(Y, = 0) = p2 pi,
= exp (xirf3) + exp (xzt 0) = 1 -

In maximum likelihood, our objective is to choose the model parameters (i .e ., the


coefficient vector 0) to maximize the joint probability or likelihood of observing the
choice outcomes Yt . If we assume that the observations are independent of one an-
other, this likelihood is given by the product of the probabilities across all choice oc-
casions . An easy way of writing this expression is given in equation (13 .10) :

L = II p „(1 - p,,) ( ' -Y' ) ( 13 .10)

Thus, whenever alternative 1 is chosen (when Y, = 1), p„ enters into the product in
(13 .10) ; when alternative 2 is chosen, (1 - p l ,) enters into the expression .
It is often more convenient to maximize the log transformation of the likeli-
hood expression in equation (13 .10) . Because the log is a monotonic transformation,

482 Chapter 13 Logit Choice Models

maximizing ln(L) is equivalent to maximizing L . The log transformation is written


as follows :

In (L) = 1 Y, In (p,,) + (1 - Yj 1n (1 - p, t ) (13 .11)


t

where p, t is given by the expression in equation (13 .9) . We then use numerical opti-
mization methods (described in Chapter 6) to choose the values of 0 to maximize the
expression in (13 .11) . The maximum likelihood estimates are consistent estimators
of the parameters, and from the optimization procedure we are able to obtain asymp-
totic standard errors with which to assess the variability of tbc estimates . (Recall
from Chapter 6 that the asymptotic standard errors are taken from the inverse of the
matrix of second derivatives of the objective function, which essentially measures the
curvature of the function ; the sharper tbc curvature, the smaller the standard error as-
sociated with the parameter estimate .)
We now illustrate the maximum likelihood estimation procedure for the choice
problem described above . The deterministic component of the utility function can be
described by a linear function with two coefficients : an intercept term and a slope
coefficient that captures tbc effect of Brand A's discount at time t (denoted DA,) . One
way we might parameterize the model is shown below :

VAt - aA + (3DAt (13 .12)


VBt - ()CB (13 .13)

Note that because the price of Brand B remains fixed, there is no price term in vB,; the
intercept term a B captures the fixed deterministic component of utility offered by
Brand B at its prevailing price .
The first thing to notice about the model specification in equations (13 .12) and
(13 .13) is that the two intercept terms UA and a B are not separately identified . This is
due to the nature of the nonlinear expression for probability . Substituting the expres-
sions in (13 .12) and (13 .13) into equation (13 .9) yields the probability of choosing
Brand A at time t :

exp (a A + 3 DAt)
(13.14)
PAt [exp (UA + PDAJ + exp (al)]

If we were to add the same constant K to both intercept terms a A and aB , then we
would find equation (13 .14) unchanged as shown below :

exp + K
(UIA PDAJ +
PAt
[exp (aA + K + RDA ,) + exp (aB + K)]
exp (K) exp (aA + I3DAt)
(13.15)
[exp (K)][exp (a A + PDAJ + exp (a B )]

exp (a A + PDAJ

1 'exp (cc, + (3DA,) + exp (a B )]


13 .2 Binary Logit Model : How It Works 483

tten This tells us that the deterministic component of the utility function is undeter-
mined by an additive constant. In other words, the utility function has no meaningful
absolute zero point; it is only the relative nalues of the utility function that matter.
To solve this problem, we typically identify the model by setting one intercept
term arbitrarily to zero . All other intercept terms are now interpreted relative to this
base case value . In this illustration, we set a B = 0, yielding
opti-
e the VAt = a A + I3DAt (13 .16)
ators V Bt = 0 (13.17)
ymp-
:eca11 The expressions for probability of choice now become
)f the exp (a A + (3DA ,)
es the (13.18)
pA` - [ 1 + exp (UA + (3D A,) ]
,or as-
PBt (13.19)
choice [ 1 + exp (aA + (3DAt ) ]
:an be Because of the limited dimensionality of the parameter space, we can perform a
. slope simple grid search to illustrate the "shape" of the log likelihood function and the ap-
) . One proximate optimal values of a A and (3 . These are shown in Table 13 .3 . Note that the
objective function is approximately maximized for a A = - 3 .0 and (3 = 0 .20 . Using
maximum likelihood, we could also obtain the standard errors for the two parameters
(1 .14 and 0.08, respectively), showing that both are significantly different from zero
at the 0 .01 level .
v Bt; the The probability of choosing Brand A as a function of the level of discount of-
red by fered by Brand A is shown in Figure 13 .3 . The graph shows that when Brand A is
available at regular price comparable to Brand B (i .e., when DA, = 0), then the prob-
12) and ability of choice is less than 0 .1 . At zero discount, the deterministic component of
This is utility for Brand A (UA = - 3 .0) is much less than that for Brand B (where a B is set
expres- equal to zero) . When Brand A is offered on discount, however, its probability of
hoosing choice rises relative to Brand B . For example, when Brand A offers a 15 cent dis-
count, the individual is indifferent between the two brands (PA, = PBt = 0 .50) . At a
discount of 30 cents, Brand A is more than 10 times as likely to be chosen as Brand B .
(13 .14)
TABLE 13 .3 Objective function values for different combinations of the
then we parameters a and (3
R
0 .16 0 .18 0 .20 0 .22 0 .24
a = -3 .4 -16 .0854 -14 .7675 -13 .9577 -13 .5981 -13 .6349
-3 .2 -15 .2349 -14 .1924 -13 .6341 -13 .5266 -13 .7875
(13 .15) -3 .0 -14 .5592 -13 .7926 -13 .5011 -13 .6625 -14 .1005
-2.8 -14 .0661 -13 .5736 -13 .5350 -13 .8871 -14 .5734
-2 .6 -13 .7626 -13 .5397 -13 .7471 -14 .3209 -15 .2051

484 Chapter 13 Logit Choice Models

FIGURE 13 .3 1 .0-
Plot of probability
0.9-
as a function of
price discount 0.8-
for brand choice 0 .7 -
example
0 .6-
Q
0 0 .5-
á
0 .4-
0 .3-
0 .2 -

0 .1-
0 .0- 1
0 10 20 30
Discount

The evidence suggests that this is a discount-sensitive individual, who is prepared to


switch his or her brand allegiance if the price is right .

13.2 .2 Properties of the Logit Model


Model Significance

We can use the log likelihood function to assess the significance of the overall model
and test its significance relative to nested alternatives (i .e ., model specifications
where some subset of the parameters are set equal to zero) . Let's say we have two
models : a full model and a restricted model (in which some of the parameters have
been constrained to zero) . We denote the log likelihood values for these two models
as LL F and LLR , respectively. Under the null hypothesis of no difference in fit between
the two models, the difference -2(LLR - LLF) is distributed chi-square, with the
number of degrees of freedom equal to the number of constrained parameters .
One natural test of model significance is to compare a model with intercept only
to a model with intercept and independent variables . In our simpte example, there is
only one variable, so the test of model significance amounts to a chi-square test on
a single degree of freedom . The log likelihood for the model with intercept-only is
LLR = - 20 .5 ; after adding the discount variable, die log likelihood improves to
LLF = - 13 .5 . Thus the test of model significance (relative to an intercept-only
model) is

x 2 (1)=-2(-20.5-(-13 .5))=14 .0
which is significant at the 0 .01 level . This leads us to the same conclusion that we
reach looking at the asymptotic t-test for the model parameter (t = 2 .67, also
significant at the 0.01 level) .

13 .2 Binary Logit Model : How It Works 485

Goodness of Fit
Although the log likelihood can be used as a goodness-of-fit measure, it suffers from
two limitations . First, it does not take into account the number of degrees of freedom
used by the model . When dealing with multiple regression models, we learned that
adding variables to the model almost always improves the model fit (as measured by
R2 , the proportion of variance explained in the dependent variable), even though the
improvement in fit may not justify the loss in degrees of freedom . To account for the
loss of degrees of freedom, we use R 2 (see Chapter 3), which essentially penalizes
the goodness-of-fit statietic for every additional degree of freedom used in fitting the
model .
We can also adjust the log likelihood function in an analogous way to form what
is called an information criterion . We present two different information criteria : one
due to Akaike (denoted AIC) and one due to Schwartz (denoted SC) . Both penalize
the log likelihood function by adjusting it according to some function of the number
of degrees of freedom used by the model as shown below :
AIC = -2LL + 2k (13 .20)
SC = -2LL + 2 In (n)k (13 .21)
where n is the sample size and k is the number of parameters estimated by the model .
Note that the sign is changed, which means that smaller values of AIC and SC indi-
cate better model fit .
Of the two measures, the Schwartz criterion is more conservative because it pe-
nalizes a less parsimonious model more heavily . Despite this, even the Schwartz
1 model criterion suggests a better fit for the full model (SC = 33 .8 versus SC = 44 .5 for
the intercept-only model) . Both measures are sometimes used to compare non-
cations
nested models (although cross-validation is probably a better way to compare the
ave two
predictive ability of non-nested models that are based on different numbers of
rs have
parameters) .
models
A second limitation of the log likelihood function (and the information criteria
etween
shown above) is that they are scaled in such a way that only relative comparisons are
ith the
meaningful . Sometimes it is valuable to get some sense of how much uncertainty is
s.
ept only explained by the model (and how much remains to be explained) . One such measure
is p 2 (McFadden, 1974), given by
there is
test on p2 = 1 - (13 .22)
only is LL
oves to where LLo denotes the log likelihood function of the intercept-only model . Hauser
ept-only (1978) showed that for this specific choice of reference model (where the probabili-
ties are the same across all choice observations), p 2 is interpretable as the fraction of
uncertainty explained by the calibrated model relative to the prior model (i .e ., one
with maximum uncertainty) . A model that adds nothing to the intercept-only model
has p 2 = 0 . A model with complete explanatory power has p 2 = 1 .
For our illustration, the value of p 2 = 1 - (-13 .5/-20 .5) = 0 .34, which sug-
gests that approximately one-third of the uncertainty in choice in the intercept-only

486 Chapter 13 Logit Choice Models

TABLE 13 .4 Estimated TABLE 13 .5 Fitted choice


values of a and p for the probabilities for the logit and
logit and probit models probit models at different
Coeff Logit Probit levels of discount

a -2 .94 -1 .60 Discount Logit Probit


(3 -0.20 -0.11 0 .00 0.0502 0 .0542
0 .05 0 .1256 0 .1469
0 .10 0 .2807 0 .3104
0.15 0 .5146 0 .5242
0 .20 0.7423 0 .7310 EXA
0 .25 0.8867 0 .8792 Mod
ity oi
0 .30 0 .9551 0 .9578 a din
conti
600
model is explained by the full model . Unlike R2 in regression, it is unusual to see val- 800
ues of p 2 near 1 .0 ; values of p 2 in the range from 0 .3 to 0 .5 are often described as
excellent fits .

Differente between Logit and Probit


It is also instructive to ask, how big a differente is there between the logit and probit
formulations? We can use maximum likelihood to estimate the parameter values
under the logit and probit model assumptions . The estimated values are shown in
Table 13 .4 .
At first glance, the differences between the two approaches appear to be sub-
stantial . In particular, the coefficient describing the impact of a discount by Brand A
is almost twice as large in the logit model formulation . These differences, however,
are related to the scaling differences between models . If we look at the shape of the
probability curve implied by the two calibrated models, we see that the differences
are quite smalt as shown in Table 13 .5 .
A better means of comparing the estimated impact of discounting from the two
models is to look at the price elasticities implied by the models . Looking at the en-
tries of Table 13 .5 for discount values of 10 cents and 20 cents, we can see that the
response curve from the logit model is slightly steeper (it increases from 0 .28 to 0 .74,
compared to the probit model increase from 0 .31 to 0.73) . Expressed as an elasticity
(i .e ., as the percentage change in quantity divided by the percentage change in price),
the elasticity to price discount from the logit model estimate is 1 .64 versus 1 .35 from
the probit model . In this case, the results from the two models are similar but not iden-
tical ; in practice, the differences between binary logit and binary probit are often not
substantial .
In general, it is a good idea to avoid comparing estimated utility function pa-
rameters across models . This is because the magnitude of the parameter is associated
with the degree of fit of the model : As the model fit improves, the parameters become
larger and the fitted probabilities are driven toward the extremes of zero or 1 . In most
13 .3 Sample Problem : Books by Mail 487

cases, we are not interested so much in the parameter value itself but in its impact on
the probability of choice . Because of the nonlinearity inherent in the model formula-
tion, it is better to assess this through an elasticity calculation .

13.3 SAMPLE PROBLEM : BOOKS BY MAIL

EXAMPLE To further illustrate the binary logit choice model, we revisit the Books by Mail ex-
Modeling probabil- ample presented in Chapter 12 . Recall that Books by Mail sent a test mailing to 1,000
ity of purchase in
randomly chosen customers offering a new book titled The Art History of Florence .
a direct marketing
context The company also sent out an identical mailing to another 1,000 customers to serve
BOOKS_1 and as a holdout sample . In Chapter 12, we used information about the past purchasing
BOOKS 2 behavior of customers (in particular, months since last purchase and number of art
books purchased) to discriminate between buyers and nonbuyers of the new book . We
now use the same data to estimate the parameters of a binary choice model (where
the two choice alternatives are "buy" and "not buy") that will provide us with an in-
dividual-level estimate of the probability of purchase for The Art History of Florence.
In contrast with the previous example, in which we had 30 choice observations from
the same individual across different occasions denoted by the subscript t, we now
have one choice observation from each of 1,000 different customers denoted by the
subscript c .
To identify the model, we set the deterministic component of the utility function
for nonpurchase to zero (i .e ., v 2c = 0) . The deterministic component of the utility
provided by a purchase of The Art History of Florence is given by

vl, = a + (31x1, + (3 2x2, (13 .23)


where x1, = the number of months since customer c's last purchase from Books by
Mail, and x2, = the number of art books purchased by customer c from the company .
he two Because we observe a purchase much less frequently than a nonpurchase (recall, 83
he en- out of 1,000 customers responded to the offer), we expect a < 0 . We also expect that
hat tbc someone who has not purchased in a long time is less likely to respond to the offer
o0 .74, (i .e ., P, < 0) and that a customer who has purchased similar books in the past is more
sticity likely to respond (i .e ., R2 > 0) .
price), The maximum likelihood estimates for a, Í31, and R2, along with model goodness-
5 from of-fit statistics, are shown in Table 13 .6 . Judging from the t-tests based on the as-
tiden- ymptotic standard errors, all model coefficients are significant (with signs in the ex-
ten not pected direction) . Furthermore, the overall test of model significance (comparing the
proposed model with an intercept-only model) is also significant : The chi-square test
based on twice the difference in the log likelihoods of the two models is x2 (2) = 69 .1,
which is highly significant . Thus, the information about the customer appears to make
an important difference in determining the likelihood of response . The fitted proba-
bility associated with a customer who has purchased recently and has demonstrated

488 Chapter 13 Logit Choice Models

TABLE 13 .6 Logit model results for TABLE 13 .7 Hit rate of the


Books by Mail data mailing policy identified by
the logit model
Maximum Likelihood
Estimates Do Not
Mail Mail
Parameter Standard
Estimate Error No purchase 829 90

a -2 .2256 0.2389 Purchase 43 38

R, -0.0707 0.0192

02 0.9890 0.1347
Log likelihood -254.9

an interest in the art book category (e .g ., x, . = 6, and x2 , = 1) is more than


five times as high as a nonrecent purchaser with no prior category purchases (e .g .,
xi „ = 18, and x 2c = 0) : p„= 0 .16 versus p„ = 0 .03 .
The fact that the model parameters are significant does not necessarily mean that
we have explained most of the uncertainty in the model . In fact, the goodness of fit
is rather modest. Relative to a model with intercept term only (i .e ., a model where
the individual choice probability is the same for all customers and is equal to the re-
sponse rate for the mailing), the improvement in log likelihood from adding the two
independent variables is only 34 .5 (from -286 .0 to -251 .5) . This translates to a
goodness-of-fit measure of p 2 = 0.12 .
Does a model with such a modest fit serve any practical purpose for Books by
Mail? We can assess the operational significance of the model by using it to deter-
mine a mailing strategy for the 1,000 customers in the holdout sample (much as we
did with discriminant analysis in Chapter 12) and then assessing the profitability of
the strategy . Recall that the cost of mailing an offer to purchase The Art History of
Florence to each customer is $1 ; if the customer responds and purchases the bonk,
then the net profit (after the cost of the mailing) is $6 . The profit-maximizing strat-
egy is to mail to all customers with expected profitability greater than zero . Thus, we
mail to all households for whom the following inequality holds :

($6) X p,, + (-$1) X (1 - pc ) > 0 or p, > 1

In the holdout sample of 1,000 customers, there are 128 with a fitted probability
of purchase greater than 7. We can now compare the profitability of this mailing strat-
egy with the profitability of mailing to all 1,000 customers . Table 13 .7 shows the re-
sults . Overall, the response rate in the holdout sample is 8 .1 percent (quite close to
the 8 .3 percent of the calibration sample) . Had we mailed this offer to all 1,000 cus-
tomers, then the net profit would have been $6 X 81 + (-$1) X 919 = -$433, a
distinctly nonprofitable alternative . A better alternative (although not one to build
much of a business on) is to mail to no one, yielding a profit of exactly $0. Mailing
to the 128 customers with fitted response probability of greater than 1 yields a profit

13 .3 Sample Problem : Books by Mail 489

of $6 x 38 + (-$1) X 90 = $138 .' Thus, despite the modest goodness of fit, the
model has substantial practical value to Books by Mail .

ai] Comparison with Discriminant Analysis


~0 Coincidentally, the profitability of the mailing strategy described above is exactly the
48 same as the profitability of the mailing strategy determined by discriminant analysis .
In fact, this is purely coincidence . (Actually, the mailing strategies themselves are
somewhat different : In the discriminant analysis, we mailed to 142 people and got
40 responses .) But this raises an interesting question : In what ways are these two ap-
proaches to the problem different?
The similarity between them is readily apparent : Both use the same kind of data
(i .e ., a single 0/1 dependent variable) . The differences can be more readily under-
stood by appealing to the Mahalanobis approach to discriminant analysis . Recall that
than
the objective of Mahalanobis was to construct a locus of points that are "equidistant"
e .g .,
from the two group centroids . The distance (adjusting for covariance among the in-
dependent variables) is used to determine a posterior probability that can be used as
i that
the basis for assigning the observation to one of the two groups . Thus, although the
of fit
discriminant function itself is linear in nature, the procedure also provides a proba-
ihere
bility of group membership that is a nonlinear function of the independent variables
ie re-
in the model. When this probability of group membership corresponds to the proba-
two
bility of choice (as it does in the Books by Mail example), we effectively have a
;toa
choice model with a different functional form .
In the discriminant analysis approach, we used Bayes's Theorem to calculate the
ks by
posterior probability that an observation belongs in a particular group . In the context
eter-
of the Books by Mail example, this is the conditional probability, given recency of
as we
purchase and number of art books purchased, that a particular customer will either
ity of
buy or not buy The Art History of Florence . This probability can be written
ry of
book, x
P(buylxJ = qP
' ` Ibuy) (13 .24)
strat- g 1 P(xjbuy) + g 2P(x,Inot buy)
s, we
where q, is our prior (absent any customer-specific information) that customer c will
buy and q2 = 1 - q, . If we assume that the independent variables X are well de-
scribed by a normal distribution, then we can write the probabilities P(x,lbuy) and
P(x,jnot buy) as
bility
strat- P(xjbuy) ICWI12n exp (13.25)
he re- 2 )
se to and
cus- 2
-D2
33, a P(x,inot buy) = ICwI exp (13 .26)
-v/21t
build
ailing where
profit D& _ (xC - xs)'Cw,(xc - xg ) (13 .27)

490 Chapter 13 Logit Choice Models

is the Mahalanobis distance from x, to the centroid of group g . Substituting equations


(13 .25) and (13 .26) into equation (13 .24) and simplifying yields the following prob-
ability expression :

exp (-D ;/2 + In q,)


P(buylx,) = exp (13 .28)
(-D2/2 + In q,) + exp (-D2/2 + In q2)
where Di and D2 are quadratic expressions in x, given in (13 .27) .
Note the similarity to the expression for the binary logit model . Both expressions
have the same structure : an exponentiated term for one alternative in the numerator
and the sum of the exponentiated terms for both choice alternatives in the denomina-
tor . The differente is the nature of the exponentiated term . In the logit model, the ex-
ponentiated term is a simple linear function of the independent variables, reflecting
the deterministic utility (i .e ., the observable value to the customer) of the choice al-
ternative . In the discriminant approach, the exponentiated term is a distance (ex-
pressed as a quadratic function of the independent variables) reflecting the proxim-
ity of the observation to the center of each group .

13.4 MULTINOMIAL LOGIT MODEL :


HOW IT WORKS

13 .4.1 Intuition
Having laid out the assumptions of the discrete choice probability model for the case
of two choice alternatives, the extension to more than two alternatives is conceptually
straightforward . However, to keep the functional form of the probability model
tractable, we have to give up considerably more in terms of the generality of the final
model . The rest of this chapter is devoted to examining the assumptions of the multi-
nomial logit model and to understanding the consequences of these assumptions on
the propertjes of the model .
Our assumptions regarding the choice behavior of the individual remain the same
as before . We assume that the individual chooses from all available items (i .e ., the ele-
ments of the choice set C) the alternative providing the maximum utility ; that is,

Choose i e C such that u i = maxj Ec {uj } (13 .29)

We also assume that the utility of each alternative, although known to the indi-
vidual, is not known and cannot be captured exactly by the modeler . We therefore
write utility as the sum of two components : a deterministic component v i (which in
general we will choose to model as a linear function of known and measurable vari-
ables X) and a stochastic component Ei:

ui = v i + Ei ( 13 .30)

You might also like