Professional Documents
Culture Documents
Lattin James M-Analyzing Multivariate Data-Pp477-490
Lattin James M-Analyzing Multivariate Data-Pp477-490
Lattin James M-Analyzing Multivariate Data-Pp477-490
Course of Study:
(GRA4139) Marketing Analytics
Title of work:
Analyzing multivariate data (2003)
Section:
pp. 477--490
Author/editor of work:
Lattin, James M.; Carroll, J. Douglas; Green, Paul E.
Author of section:
James M. Lattin
Name of Publisher:
Thomson Brooks/Cole
20
10
0
MAR08 .79 MAY3 .79 JUN28 .79 AUG23 .79 OCT18 .79 DEC13 .79 FEB07 .80
4-Week periods
The data used by Guadagni and Little to calibrate their model consisted of the
choice histories from a group of 100 panelists (randomly chosen from a group that
was screened to eliminate the lightest and heaviest users) over a 78-week period .
They also held out another 100 panelists for validation . Using measures of brand and
size loyalty to account for the differences in preference across households (see sec-
tion 13 .6 .1 below), Guadagni and Little estimated the effects of price and promotion
on brand choice and found them to be highly significant . They were able to use the
and model to obtain estimates of price elasticity and cross-price elasticity for both pro-
ittie
motional and nonpromotional price cuts . Furthermore, aggregating predicted choice
~nial probabilities across households, they found that the model did an excellent job of
use-
tracking the week-to-week fluctuations in market share that were attributable to
iold
changes in marketing activity (see Figure 13 .1) .
rsed
:r is
ong
.tilt .
_luc 13.2 BINARY LOGIT MODEL :
HOW IT WORKS
We begin with a simple, stylized dichotomous choice problem . Consider a single in-
dividual who shops in a supermarket that offers two different brands in a particular
product category : Brands A and B . The regular shelf price of the two brands is the
same, but Brand A offers frequent discounts and is often available at prices 10, 15,
20, and even 30 cents below the price of Brand B (which remains fixed) .
Over time, we observe 30 different purchase occasions by the individual from
this product category. The number of choices of Brands A and B at each level of price
offered by Brand A are shown in Table 13 .2 .
478 Chapter 13 Logit Choice Models
The pattern of choices in Table 13 .2 makes clear that the individual is strongly
influenced by the relative price of Brand A when making a choice from the product
category . Out of all the choices made by the individual when A and B are available at
the same (regular) shelf price, only one out of eight went to Brand A . But of all tbc
choices made by the individual when Brand A offers at least a 20 cent discount, eight
out of nine choices went to Brand A .
Our goal is to build an individual-level model of the probability of purchasing
Brand A as a function of the price discount offered by Brand A . Our model of choice
behavior is based on the following two assumptions .
1. Each choice alternative offers the individual some amount of utility at the
time of the choice . The utility offered by alternative i at time t is denoted
u ;, . (We will further assume that at time t there exists a strict ordering
among alternatives .)
2. In choosing among alternatives, the individual selects the one with highest
utility. When there are just two alternatives, the individual chooses alterna-
tive 1 when u lo > U21 ; otherwise, the individual chooses alternative 2 .
Note that the choice process described above is a completely deterministic one .
From the perspectine of the individual, there is nothing inherently random about the
choice process . We have assumed that the utility of each alternative is known to the
individual with certainty and that the outcome of the choice process is completely de-
termined by the relative value of the two utilities u l , and u 2t. So where does the choice
probability come in?
The probability of choice in our model does not have to do with any randomness
inherent in the individual choice process ; it has to do with the incomplete informa-
tion we have as modelers and our inability to specify completely the utility function
of the individual . In fact, we have only one piece of information from which to con-
struct the utility function : the price of Brand A relative to Brand B . In most choice
situations (as in this one), there will be many, many other factors - most of them un-
knowable or unmeasurable by the modeler-that influence choice behavior. In a
supermarket shopping environment, these may be other store cues, such as special
display, as well as other factors that increase the salience of one brand relative to an-
other. Although these factors may not be as important as price, together they have
some cumulative impact on choice .
Because it is unreasonable to assume that we as modelers will ever have access
to all of the relevant information needed to completely specify the utility function of
the individual, we decompose the utility function u i , into two components :
rchasing where x„ is a vector of the measured characteristics directly influencing the choice of
~f choice item i at time t . The stochastic component of utility we model as a random variable
c,, . The value c ;t is the realization of the random variable on choice occasion t for
choice alternative i .
Because we do not observe c,,, we do not know u it and cannot say for certain
4TION which alternative the individual will choose . By assumption, we know that the indi-
vidual will choose alternative 1 over 2 so long as
the (13 .3)
uit > U2t
ed
Substituting equations (13 .1) and (13 .2) into (13 .3) yields
Pit =
J -w
f(c21 - c1)a(c21 - c1) = F((xi1 - x21) ' 0) (13.6)
FIGURE 13 .2
Probability that
u, > u 2
where pit is the probability that the individual chooses alternative 1 on occasion t and
F( .) is the cumulative distribution function of (821 - E, t ) . Equation (13 .6) makes it
clear that the functional form for the choice probability depends entirely on the dis-
tribution function for the random variable (c2 r - c, t) •
Probit Model
If we assume that (c2, -E„) has a normally distributed density function, the result-
ing probability model is known as a probit model. The normai distribution is a theo-
retically well-grounded choice with strong precedent . Having motivated the stochas-
tic component s;, as the cumulative impact of a large number of factors on utility, we
would expect from the Central Limit Theorem to have the distribution of ei, described
by a bell-shaped curve . Unfortunately, although the choice of the normai distribution
is well grounded, it is not especially convenient front a modeling perspective . Be-
cause the cumulative normal distribution has no closed form, there is no closed func-
tional form for probability of choice . Although this is not an insurmountable prob-
lem (for example, we can still estimate the parameters of the utility function using
numerical approximation), there are times when it is useful to be able to express the
probability of choice as a function of the independent variables .
Logit Model
Alternatively, we may want to assume that (s et - c, t ) has the following distribution
function :
IJ = exp[-c21 - e )1
f( szt -
c (13 .7)
(1 + exp [ - (£ 2t - sit)]) 2
The distribution in equation (13 .7) is known as a logistic distribution, and the re-
sulting probability model is known as a logistic or logit model . Although the logistic
distribution has a shape that is similar to the normal distribution, it is not exactly bell-
shaped and has tails that are somewhat fatter than the normai distribution .
The real benefit to using the logistic distribution is that it yields a closed form
for the probability expression in equation (13 .6). Substituting equation (13 .7) into the
integral in (13 .6) and solving yields
(x„-x,_,)'0
exp( - (E2e-c11))
Pit - Eu)
- J 00
1 + exp ( - (E2, - Eir))2 a(E2~
1
(13.8)
1 + exp (-(x 1r - x2r )'f3)
Multiplying the numerator and the denominator by exp (x„ P) yields a somewhat
more intuitive expression for the probability :
exp (x;,f3)
Per = (13 .9)
exp (xírP) + exp (xír3)
Note that the expression in equation (13 .9) qualifies as a probability measure : It
is in the interval [0, 1] for all possible values of x' and ¢ . If we think of the expres-
sion exp (xi,f3) as the "attractiveness" of alternative i, then the probability of choice
of item 1 is given by item 1's share of the total attractiveness of alternatives 1 and 2
together.
result-
a theo- Maximum Likelihood Estimation
tochas-
The values of the utility function parameters for the logit model are easily estimated
lity, we
using maximum likelihood . Recall that we observe only the choice outcomes from a
scribed
process with underlying choice probability p ;, (which is unobserved) . For the binary
iibution
choice problem, we can use a single 011 variable Y, to dente the observed choice out-
ve . Be-
come . If alternative 1 is chosen on choice occasion t, then Y, = l ; otherwise, Y, = 0 .
d func-
The probability of observing these choice outcomes is given by the logit model :
e prob-
n using exp (xí, f3)
ess the Pr(Y` -= 1) = Pit =
exp (xi,f3) + exp (x2,0)
exp(x'to) _
Pr(Y, = 0) = p2 pi,
= exp (xirf3) + exp (xzt 0) = 1 -
Thus, whenever alternative 1 is chosen (when Y, = 1), p„ enters into the product in
(13 .10) ; when alternative 2 is chosen, (1 - p l ,) enters into the expression .
It is often more convenient to maximize the log transformation of the likeli-
hood expression in equation (13 .10) . Because the log is a monotonic transformation,
where p, t is given by the expression in equation (13 .9) . We then use numerical opti-
mization methods (described in Chapter 6) to choose the values of 0 to maximize the
expression in (13 .11) . The maximum likelihood estimates are consistent estimators
of the parameters, and from the optimization procedure we are able to obtain asymp-
totic standard errors with which to assess the variability of tbc estimates . (Recall
from Chapter 6 that the asymptotic standard errors are taken from the inverse of the
matrix of second derivatives of the objective function, which essentially measures the
curvature of the function ; the sharper tbc curvature, the smaller the standard error as-
sociated with the parameter estimate .)
We now illustrate the maximum likelihood estimation procedure for the choice
problem described above . The deterministic component of the utility function can be
described by a linear function with two coefficients : an intercept term and a slope
coefficient that captures tbc effect of Brand A's discount at time t (denoted DA,) . One
way we might parameterize the model is shown below :
Note that because the price of Brand B remains fixed, there is no price term in vB,; the
intercept term a B captures the fixed deterministic component of utility offered by
Brand B at its prevailing price .
The first thing to notice about the model specification in equations (13 .12) and
(13 .13) is that the two intercept terms UA and a B are not separately identified . This is
due to the nature of the nonlinear expression for probability . Substituting the expres-
sions in (13 .12) and (13 .13) into equation (13 .9) yields the probability of choosing
Brand A at time t :
exp (a A + 3 DAt)
(13.14)
PAt [exp (UA + PDAJ + exp (al)]
If we were to add the same constant K to both intercept terms a A and aB , then we
would find equation (13 .14) unchanged as shown below :
exp + K
(UIA PDAJ +
PAt
[exp (aA + K + RDA ,) + exp (aB + K)]
exp (K) exp (aA + I3DAt)
(13.15)
[exp (K)][exp (a A + PDAJ + exp (a B )]
exp (a A + PDAJ
tten This tells us that the deterministic component of the utility function is undeter-
mined by an additive constant. In other words, the utility function has no meaningful
absolute zero point; it is only the relative nalues of the utility function that matter.
To solve this problem, we typically identify the model by setting one intercept
term arbitrarily to zero . All other intercept terms are now interpreted relative to this
base case value . In this illustration, we set a B = 0, yielding
opti-
e the VAt = a A + I3DAt (13 .16)
ators V Bt = 0 (13.17)
ymp-
:eca11 The expressions for probability of choice now become
)f the exp (a A + (3DA ,)
es the (13.18)
pA` - [ 1 + exp (UA + (3D A,) ]
,or as-
PBt (13.19)
choice [ 1 + exp (aA + (3DAt ) ]
:an be Because of the limited dimensionality of the parameter space, we can perform a
. slope simple grid search to illustrate the "shape" of the log likelihood function and the ap-
) . One proximate optimal values of a A and (3 . These are shown in Table 13 .3 . Note that the
objective function is approximately maximized for a A = - 3 .0 and (3 = 0 .20 . Using
maximum likelihood, we could also obtain the standard errors for the two parameters
(1 .14 and 0.08, respectively), showing that both are significantly different from zero
at the 0 .01 level .
v Bt; the The probability of choosing Brand A as a function of the level of discount of-
red by fered by Brand A is shown in Figure 13 .3 . The graph shows that when Brand A is
available at regular price comparable to Brand B (i .e., when DA, = 0), then the prob-
12) and ability of choice is less than 0 .1 . At zero discount, the deterministic component of
This is utility for Brand A (UA = - 3 .0) is much less than that for Brand B (where a B is set
expres- equal to zero) . When Brand A is offered on discount, however, its probability of
hoosing choice rises relative to Brand B . For example, when Brand A offers a 15 cent dis-
count, the individual is indifferent between the two brands (PA, = PBt = 0 .50) . At a
discount of 30 cents, Brand A is more than 10 times as likely to be chosen as Brand B .
(13 .14)
TABLE 13 .3 Objective function values for different combinations of the
then we parameters a and (3
R
0 .16 0 .18 0 .20 0 .22 0 .24
a = -3 .4 -16 .0854 -14 .7675 -13 .9577 -13 .5981 -13 .6349
-3 .2 -15 .2349 -14 .1924 -13 .6341 -13 .5266 -13 .7875
(13 .15) -3 .0 -14 .5592 -13 .7926 -13 .5011 -13 .6625 -14 .1005
-2.8 -14 .0661 -13 .5736 -13 .5350 -13 .8871 -14 .5734
-2 .6 -13 .7626 -13 .5397 -13 .7471 -14 .3209 -15 .2051
FIGURE 13 .3 1 .0-
Plot of probability
0.9-
as a function of
price discount 0.8-
for brand choice 0 .7 -
example
0 .6-
Q
0 0 .5-
á
0 .4-
0 .3-
0 .2 -
0 .1-
0 .0- 1
0 10 20 30
Discount
We can use the log likelihood function to assess the significance of the overall model
and test its significance relative to nested alternatives (i .e ., model specifications
where some subset of the parameters are set equal to zero) . Let's say we have two
models : a full model and a restricted model (in which some of the parameters have
been constrained to zero) . We denote the log likelihood values for these two models
as LL F and LLR , respectively. Under the null hypothesis of no difference in fit between
the two models, the difference -2(LLR - LLF) is distributed chi-square, with the
number of degrees of freedom equal to the number of constrained parameters .
One natural test of model significance is to compare a model with intercept only
to a model with intercept and independent variables . In our simpte example, there is
only one variable, so the test of model significance amounts to a chi-square test on
a single degree of freedom . The log likelihood for the model with intercept-only is
LLR = - 20 .5 ; after adding the discount variable, die log likelihood improves to
LLF = - 13 .5 . Thus the test of model significance (relative to an intercept-only
model) is
x 2 (1)=-2(-20.5-(-13 .5))=14 .0
which is significant at the 0 .01 level . This leads us to the same conclusion that we
reach looking at the asymptotic t-test for the model parameter (t = 2 .67, also
significant at the 0.01 level) .
Goodness of Fit
Although the log likelihood can be used as a goodness-of-fit measure, it suffers from
two limitations . First, it does not take into account the number of degrees of freedom
used by the model . When dealing with multiple regression models, we learned that
adding variables to the model almost always improves the model fit (as measured by
R2 , the proportion of variance explained in the dependent variable), even though the
improvement in fit may not justify the loss in degrees of freedom . To account for the
loss of degrees of freedom, we use R 2 (see Chapter 3), which essentially penalizes
the goodness-of-fit statietic for every additional degree of freedom used in fitting the
model .
We can also adjust the log likelihood function in an analogous way to form what
is called an information criterion . We present two different information criteria : one
due to Akaike (denoted AIC) and one due to Schwartz (denoted SC) . Both penalize
the log likelihood function by adjusting it according to some function of the number
of degrees of freedom used by the model as shown below :
AIC = -2LL + 2k (13 .20)
SC = -2LL + 2 In (n)k (13 .21)
where n is the sample size and k is the number of parameters estimated by the model .
Note that the sign is changed, which means that smaller values of AIC and SC indi-
cate better model fit .
Of the two measures, the Schwartz criterion is more conservative because it pe-
nalizes a less parsimonious model more heavily . Despite this, even the Schwartz
1 model criterion suggests a better fit for the full model (SC = 33 .8 versus SC = 44 .5 for
the intercept-only model) . Both measures are sometimes used to compare non-
cations
nested models (although cross-validation is probably a better way to compare the
ave two
predictive ability of non-nested models that are based on different numbers of
rs have
parameters) .
models
A second limitation of the log likelihood function (and the information criteria
etween
shown above) is that they are scaled in such a way that only relative comparisons are
ith the
meaningful . Sometimes it is valuable to get some sense of how much uncertainty is
s.
ept only explained by the model (and how much remains to be explained) . One such measure
is p 2 (McFadden, 1974), given by
there is
test on p2 = 1 - (13 .22)
only is LL
oves to where LLo denotes the log likelihood function of the intercept-only model . Hauser
ept-only (1978) showed that for this specific choice of reference model (where the probabili-
ties are the same across all choice observations), p 2 is interpretable as the fraction of
uncertainty explained by the calibrated model relative to the prior model (i .e ., one
with maximum uncertainty) . A model that adds nothing to the intercept-only model
has p 2 = 0 . A model with complete explanatory power has p 2 = 1 .
For our illustration, the value of p 2 = 1 - (-13 .5/-20 .5) = 0 .34, which sug-
gests that approximately one-third of the uncertainty in choice in the intercept-only
cases, we are not interested so much in the parameter value itself but in its impact on
the probability of choice . Because of the nonlinearity inherent in the model formula-
tion, it is better to assess this through an elasticity calculation .
EXAMPLE To further illustrate the binary logit choice model, we revisit the Books by Mail ex-
Modeling probabil- ample presented in Chapter 12 . Recall that Books by Mail sent a test mailing to 1,000
ity of purchase in
randomly chosen customers offering a new book titled The Art History of Florence .
a direct marketing
context The company also sent out an identical mailing to another 1,000 customers to serve
BOOKS_1 and as a holdout sample . In Chapter 12, we used information about the past purchasing
BOOKS 2 behavior of customers (in particular, months since last purchase and number of art
books purchased) to discriminate between buyers and nonbuyers of the new book . We
now use the same data to estimate the parameters of a binary choice model (where
the two choice alternatives are "buy" and "not buy") that will provide us with an in-
dividual-level estimate of the probability of purchase for The Art History of Florence.
In contrast with the previous example, in which we had 30 choice observations from
the same individual across different occasions denoted by the subscript t, we now
have one choice observation from each of 1,000 different customers denoted by the
subscript c .
To identify the model, we set the deterministic component of the utility function
for nonpurchase to zero (i .e ., v 2c = 0) . The deterministic component of the utility
provided by a purchase of The Art History of Florence is given by
R, -0.0707 0.0192
02 0.9890 0.1347
Log likelihood -254.9
In the holdout sample of 1,000 customers, there are 128 with a fitted probability
of purchase greater than 7. We can now compare the profitability of this mailing strat-
egy with the profitability of mailing to all 1,000 customers . Table 13 .7 shows the re-
sults . Overall, the response rate in the holdout sample is 8 .1 percent (quite close to
the 8 .3 percent of the calibration sample) . Had we mailed this offer to all 1,000 cus-
tomers, then the net profit would have been $6 X 81 + (-$1) X 919 = -$433, a
distinctly nonprofitable alternative . A better alternative (although not one to build
much of a business on) is to mail to no one, yielding a profit of exactly $0. Mailing
to the 128 customers with fitted response probability of greater than 1 yields a profit
of $6 x 38 + (-$1) X 90 = $138 .' Thus, despite the modest goodness of fit, the
model has substantial practical value to Books by Mail .
13 .4.1 Intuition
Having laid out the assumptions of the discrete choice probability model for the case
of two choice alternatives, the extension to more than two alternatives is conceptually
straightforward . However, to keep the functional form of the probability model
tractable, we have to give up considerably more in terms of the generality of the final
model . The rest of this chapter is devoted to examining the assumptions of the multi-
nomial logit model and to understanding the consequences of these assumptions on
the propertjes of the model .
Our assumptions regarding the choice behavior of the individual remain the same
as before . We assume that the individual chooses from all available items (i .e ., the ele-
ments of the choice set C) the alternative providing the maximum utility ; that is,
We also assume that the utility of each alternative, although known to the indi-
vidual, is not known and cannot be captured exactly by the modeler . We therefore
write utility as the sum of two components : a deterministic component v i (which in
general we will choose to model as a linear function of known and measurable vari-
ables X) and a stochastic component Ei:
ui = v i + Ei ( 13 .30)