Professional Documents
Culture Documents
Modeling Vehicle Choice and Simulating Market Share With Bayesian Networks
Modeling Vehicle Choice and Simulating Market Share With Bayesian Networks
Modeling Vehicle Choice and Simulating Market Share With Bayesian Networks
A case study about predicting the U.S. market share of the Porsche Panamera
using the Bayesia Market Simulator
Conrady Applied Science, LLC - Bayesia’s North American Partner for Sales and Consulting
Simulating Market Share with the Bayesia Market Simulator
Table of Contents
Modeling Vehicle Choice and Simulating Market Share with Bayesian Net-
works
Abstract/Executive Summary 1
Objective 1
About the Authors 2
Stefan Conrady 2
Lionel Jouffe 2
Acknowledgements 2
Introduction 2
Bayesian Networks for Choice Modeling 3
Case Study 4
Porsche Panamera 4
Common Forecasting Practices 6
Tutorial 6
Data Preparation 6
Consumer Research 6
Variable Selection 7
Set of Choice Alternatives 7
Filtered Values (Censored States) 7
Data Modeling 8
Data Import 8
Missing Values 9
Discretization 10
Variable Classes and Forbidden Arcs 12
Unsupervised Learning 13
Simulation 14
Product Scenario Baseline 14
Product Scenario Simulation 16
Substitution and Cannibalization 19
Market Scenario Simulation 20
Limitations 20
Outlook 20
Summary 21
Appendix 22
Utility-Based Choice Theory 22
Multinomial Logit Models 22
Stated Preference Data 23
Revealed Preference Data 23
NVES Variables 23
References 25
Contact Information 26
Conrady Applied Science, LLC 26
Bayesia SAS 26
Copyright 26
Market Share Simulation Workflow with BayesiaLab and Bayesia Market Simulator
Scenario
Market Data
from Survey
Definition
from Analyst
Projection
Market Model Simulation
Modeling
Bayesian Network Bayesia Market
BayesiaLab
Simulator
Market Shares
1 BayesiaLab and Bayesia Market Simulator can run on a wide range of operating systems, including Windows, OS X,
Linux/Unix, etc.
ing, as they are applicable to research with BayesiaLab in Bayesian networks. BayesiaLab enjoys broad acceptance
general, regardless of the domain. in academic communities as well as in business and in-
dustry. The relevance of Bayesian networks, especially in
This paper is part of a series of tutorials, which are ex-
the context of market research, is highlighted by
ploring a broad range of real-world applications of
Bayesia’s strategic partnership with Procter & Gamble,
Bayesian networks. who has deployed BayesiaLab globally since 2007.
of Conrady Applied Science, LLC, a privately held con- source for this case study. In this context, special thanks
go to Alexander Edwards, President, Automotive Divi-
sulting firm specializing in knowledge discovery and
probabilistic reasoning with Bayesian networks. In 2010, sion of Strategic Vision.
2 www.strategicvision.com
3 Assistant Professor of Marketing, Vanderbilt University, Owen Graduate School of Management.
4 President, Fitzgerald Brunetti Productions, Inc., New York.
5 Professor Emeritus, Professor Emeritus of Civil and Environmental Engineering, Robert R. McCormick School of En-
gineering and Applied Science, Northwestern University.
6 Adjunct Professor of Economics and Public Policy, University of California, Berkeley.
as together they define the sales volume expectation, “oracles” that allow us to “deliberately reason about the
which, for obvious reasons, is a key element in most consequences of actions we have not yet taken.” 8
business cases.
Bayesian Networks for Choice Modeling
As a result, it is critical for decision makers to correctly Using Bayesian networks9 as the general framework for
predict the future market shares of products not yet de- modeling a domain or system has many advantages,
veloped. The task of such market share forecasts typi- which Darwiche (2010) summarizes as follows:
cally falls into marketing and market research depart-
ments, who are mostly closely involved with understand- • “Bayesian networks provide a systematic and localized
ing consumer behavior and, more specifically, the method for structuring probabilistic information
product choices they make. about a situation into a coherent whole […]”
If we fully understood the consumer’s decision making • “Many applications can be reduced to Bayesian net-
process and observed all components of it, we could work inference, allowing one to to capitalize on Bayes-
simply generate a deterministic model for predicting ian network algorithms instead of having to invent
future consumer choices. However, we do not and it is specialized algorithms for each new application.”
obvious that many elements contributing to a consumer’s
Given the very attractive properties of Bayesian net-
purchase decision are inherently unobservable. Despite
works for representing a wide range of problem do-
our limited comprehension of the true human choice
mains, it seems appropriate applying them for choice
process, there are a number of tools that still allow mod-
modeling as well. In particular, the BayesiaLab software
eling consumer choice with what is observable, and ac-
package has made it very convenient to automatically
counting for what will remain unknowable. In this con-
machine-learn fairly large and complex Bayesian net-
text, and based on the seminal works of Nobel-laureate
works from observational data.
Daniel McFadden7, choice modeling has emerged as an
important tool in understanding and simulating con- Beyond the convenience and speed of estimating Bayes-
sumer choice. ian networks with BayesiaLab, there are three fundamen-
tal differences in modeling consumer choice with Bayes-
Such choice models serve a representation of the “real
ian networks compared to traditional discrete choice
world” and thus become, what Judea Pearl likes to call
models.10
7 Daniel McFadden received, jointly with James Heckman, the 2000 Nobel Memorial Prize in Economic Sciences;
McFadden’s share of the prize was “for his development of theory and methods for analyzing discrete choice”.
8 A recurring quote from Judea Pearl’s many lectures on causality.
9 A Bayesian network is a graphical model that represents the joint probability distribution over a set of random vari-
ables and their conditional dependencies via a directed acyclic graph (DAG). For example, a Bayesian network could
represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to
compute the probabilities of the presence of various diseases. A very concise introduction to Bayesian networks can be
found in Darwiche (2010).
10 A very brief overview about utility-based choice models is provided in the appendix.
1. Whereas utility-based choice models, such as multi- As a result we obtain a choice probability as a function
nomial logit models (MNL), will “flatten” the vector of product and consumer attributes.
of attribute utilities into a single scalar value, Bayes-
In order to obtain a product’s projected market share, we
ian networks do not inherently restrict all the di-
then need to simulate choice probabilities across all
mensions relating to choice. For example, learning a
Bayesian network on observed vehicle choices might product scenarios and across all individuals in the popu-
lation under study. For this specific purpose Bayesia SAS
reveal that fuel economy and vehicle price are sub-
has developed the Bayesia Market Simulator, which uses
ject to tradeoff, while safety is a nonnegotiable basic
requirement for the consumer. Correctly recognizing the Bayesian networks generated by BayesiaLab. Both
tools will play a central role in this case study.
such dynamics are obviously critical for making
predictions about future consumer choices.
Case Study
2. Bayesian networks are nonparametric and therefore To illustrate the entire market share estimation process
do not require the specification of a functional form. with Bayesian networks, we have derived a case study
No assumptions need to made regarding the form of from the U.S. auto industry. More specifically, we will
links between variables. Potentially nonlinear pat- model consumer choice behavior in the high-end vehicle
terns are therefore not an issue for model estimation market based on 2009 survey data. This is an interesting
or simulation. point in time, as it precedes the launch of the new Por-
sche Panamera in model year 2010 (MY 2010), which
3. Bayesian networks are inherently probabilistic and
will be the focus of our study.
as such there is no need to specify an error term. An
error would be needed in a traditional choice model Porsche Panamera
to make it non-deterministic.
11 The properties of Stated Preference (SP) and Revealed Preference (RP) data are explained in the appendix.
12 Although we focus here exclusively on machine-learning consumer behavior, within the BayesiaLab framework we
can also utilize expert knowledge about consumer behavior. For instance, vehicle dealers and their salespeople will have
extensive knowledge about how consumer behave in the showroom. A special Knowledge Elicitation module in
BayesiaLab can formally capture such expertise and build a new Bayesian network from it or augment an existing one.
Knowledge Elicitation with BayesiaLab will be the subject of a separate tutorial to be published in the near future.
enters a segment with well-established contenders, such Beyond these traditional premium sedans, there are a
the Mercedes-Benz S-Class13 ,
the BMW 7-series14 and number of less conventional products that one can as-
the Audi A815, shown below in that order. sume to be in the Panamera’s competitive field as well.
The coupe-like Mercedes-Benz CLS16 would probably
fall into this category.
13 MY 2010 shown
14 MY 2009 shown
15 MY 2009 shown
16 MY 2010 shown
17 MY 2009 shown
The authors believe strongly that there is great risk in • BayesiaLab and Bayesia Market Simulator functions,
relying too heavily on “art”, which is inherently non- keywords, commands, etc., are shown in bold type.
auditable, and have therefore been pursuing easily trac-
• Variable/node names are capitalized and italicized.
table, but scientifically sound methods to support mana-
gerial decision making, especially in the context of fore- Data Preparation
casting. With this in mind, this very formal and struc-
tured forecasting exercise was consciously chosen as the Consumer Research
topic of the tutorial. This tutorial utilizes the 2009 New Vehicle Experience
Survey, a syndicated study conducted annually by Strate-
gic Vision, Inc., which surveys new vehicle buyers in the
18 We followed the SVI segmentation and included “Luxury Car”, “Premium Coupe”, “Premium Convertible/Roadster”
and “Luxury Utility” in our selection.
19 The $75,000 threshold was chosen as it marks the lower end of the Panamera price range.
20 As an interesting aside, these negotiations are usually Markovian in nature, i.e. the starting point of today’s negotia-
tion only depends on the outcome of the previous negotiation.
U.S. This study is widely used in the auto industry and it cles actual buyers did consider and which vehicles they
serves one of the primary market research tools. NVES disposed in the context of their most recent purchase.23
contains over 1,000 variables and close to 200,000 re-
As mentioned in the case study introduction, we included
spondent records. In large auto companies, hundreds of
“Luxury Car”, “Premium Coupe”, “Premium
analysts typically have access to NVES, most often
through the mTAB interface provided by Productive Ac- Convertible/Roadster” and “Luxury Utility” 24 in the
choice set and we further restricted it by excluding all
cess, Inc. (PAI).21
domestic vehicles and vehicles priced below $75,000. For
Variable Selection this segment of assumed Panamera competitors we have
Compared to traditional statistical models, Bayesian approximately 1,200 unweighted observations in the
networks require much less “care” in terms of variable 2009 NVES, which, on a weighted basis, reflect ap-
selection, as overparameterization is generally not an proximately 25,000 vehicles purchased in 2009.
issue. So, although we could easily start with all 1,000+
Filtered Values (Censored States)
variables, for expositional clarity we will initially select
Although in BayesiaLab we can be less rigorous regard-
only about 50 variables22 from the following categories,
ing the maximum number of variables, we still need to
which we assume to capture relevant characteristics of
both the consumer and the product: be conscious of the information contained in them.
21 www.paiwhq.com
22 A list of all variables used is given in the appendix. It should be noted that even 50 variables would create a major
computational challenge with MNL models.
23 Martin Krzywinski’s visualization tool, Circos, is highly recommended for the interpretation of cross-shopping behav-
ior: www.mkweb.bcgsc.ca/circos/
24 According to SVI’s segment definition.
Alternatively, this recoding logic can also be expressed Furthermore, the Information box provides a brief sum-
with the following pseudo code: mary regarding the number of records, the number of
missing values, filtered states, etc.
IF towing=yes THEN towing weight=unchanged
Data Modeling
Data Import
To start the analysis with BayesiaLab, we first import the
For this example, we will need to override the default
database, which needs to be formatted as a CSV file.25
data type for the Unique Identifier variable, as each
With Data>Open Data Source>Text File, we start the
value is a nominal record identifier rather than a numeri-
Data Import wizard, which immediately provides a
cal scale value. We can change the data type by highlight-
preview of the data file.
ing the Unique Identifier column and clicking the Row
25 CSV stands for “comma-separated values”, a common format for text-based data files. As an alternative to this im-
port format, BayesiaLab offers a JDBC connection, which is practical when accessing large databases on servers.
Identifier check box, which changes the color of the of discrete distributions, means-imputation typically also
Unique Identifier column to beige. introduces a bias. There are other, better techniques,
which typically demand significant computational effort
Although it is not imperative to maintain a Row Identi-
and thus often turn out like a labor-intensive standalone
fier, and we could instead assign the Not Distributed
project rather than being just a preparatory step.
status to the Unique Identifier variable, it can be quite
helpful for finding individual respondent records at a Without going into too much detail at this point,
later point in the analysis. BayesiaLab can estimate all missing values given the
learned network structure using the Expectation Maxi-
As the respondent records in the NVES survey are mization (EM) algorithm. As a result, we obtain a com-
weighted, we need to select the Weight by clicking on the
plete database without “making things up.” In tradi-
Combined Base Weight variable, which will turn the
tional statistics, the equivalent would be to say that nei-
column green.
ther the mean nor the variance of the variables is af-
fected by the imputation process.
Missing Values
In the context of data import, it is important to point out
how missing values are treated in BayesiaLab. The na-
tive, automatic processing of missing values reveals a
particular strength of BayesiaLab.
Discretization
The next step is the Discretization and Aggregation dia-
logue, which allows the analyst to determine the type of
discretization, which must be performed on all continu-
ous variables.27 We will use the Purchase Price variable
to explain the process. Highlighting a variable will show
the default discretization algorithm while the graph
panel is initially blank. We could now manually select binning thresholds by
way of point-and-click directly on the graph panel. This
26 The normal curve in the histogram is just for illustration purposes. BayesiaLab always uses the actual discrete distri-
bution, not a parametric approximation.
27 BayesiaLab requires discrete distributions for all variables.
28 $75,000 was previously selected as the lower boundary for this particular vehicle segment. $180,000 was the highest
reported price in NVES.
Note
29 The now-expired luxury tax for passenger cars in the U.S. would be an example for such a policy.
Unsupervised Learning
Now that the learning constraints are in place, we con-
tinue to learn the network by selecting Learning>Asso-
ciation Discovering>EQ.30
30 EQ is one of the unsupervised learning algorithms implemented in BayesiaLab. Koller and Friedman (2009) provide a
comprehensive introduction to learning algorithms.
However, we will not analyze this structure any further, coordinate system, that allows us to identify products
but rather use it solely as a statistical device to be used in through their principal characteristics. For instance, the
the Bayesia Market Simulator. We simply need to save following attributes would uniquely define a “Mercedes-
Benz S550 4Matic”:
the network in its native xbl file format, so the Bayesia
Market Simulator can subsequently import it.
• Brand=“Mercedes-Benz”
31 The year-to-year invariance assumption of the market has been challenged by many marketing executives during the
most recent recession. In this context, many media headlines also proclaimed a paradigm shift in consumer behavior.
The authors have believed - then as well as now - that more has remained the same than has changed in terms of con-
sumer attitudes.
32 For expositional simplicity, we make no distinction between model year and calendar year.
33 In our example, we judge this to be a reasonable simplification, even though a small number of automobiles at the
very top end of the market, e.g. the Rolls-Royce Phantom, may not be captured in the survey.
34 Using the Strategic Vision segmentation nomenclature, “High Premium” defines a large four-door luxury sedan.
To make these unique product scenarios available for Upon loading we will see the principal interface of the
subsequent use in the Bayesia Market Simulator, we need Bayesia Market Simulator. On the left panel, all nodes of
to save the table as a semicolon-delimited CSV file. This the network appear as variables. We will now need to
is important to point out, as most programs will save separate all variables into Market Variables and Scenario
CSV files by default as comma-delimited files. Variables by clicking the respective arrow buttons. In our
case, the aptly named Market variables are the Market
Product Scenario Simulation Variables in BMS nomenclature and Product variables
Now that we have the Bayesian network describing the are the Scenario Variables.
overall market (as an xbl file) as well as the baseline
product scenarios (as a csv file), we can proceed to open
the Bayesia Market Simulator.
35 To maintain expositional simplicity, we have added all Panamera versions for the entire year 2010 and not changed
any other product scenarios. It should be pointed out that the V6 version of the Porsche Panamera was introduced only
in mid-2010. BMW has also launched an additional six-cylinder version of the 7-series as well as AWD variants, which
are not reflected in the simulation. Finally, Jaguar has released a new XJ in 2010, while that year marked the runout of
the old-generation Audi A8.
attribute states, e.g. RWD or AWD.36 This also allows to done by associating the original database, from which
change attributes of existing products, according to the the network was learned, or by creating a new, artificial
analysts requirements. one that reflects the joint probability distribution of the
learned Bayesian network.
36 RWD and AWD stands for rear-wheel drive and all-wheel drive respectively
12%
21%
Audi
BMW
3% Jaguar
Lexus
10%
Mercedes
Porsche
53%
Upon completion, the simulation results will appear in Substitution and Cannibalization
the form of a pie chart and a table. One can go back and The fully simulated database can also be saved as a
review the scenarios by clicking the Scenario Editing semicolon-delimited CSV file, which will allow reviewing
button. the choice probability for each product scenario by indi-
vidual consumer in a spreadsheet.
It is equally interesting to examine which Porsche buyers Upon editing the market segments, the simulation can be
would pick the Panamera over their current vehicle rerun to obtain the new market share results.
choice.
Limitations
This approach can simulate product and market scenar-
ios consisting of variations of configurations, which can
be observed with sufficient sample today. However, the
impact of entirely new technologies cannot be simulated
on this basis. As a result, projecting the market share of
the all-electric Nissan Leaf38 would not possible, whereas
estimating the share of a hypothetical three-row BMW
crossover vehicle would be feasible. In all cases, it re-
quires the analyst’s expert knowledge and judgment to
determine the adequacy and equivalency of product at-
tributes observable today.
Not surprisingly, our simulation suggests high probabili-
ties of Panamera choice for several current Cayenne Outlook
owners. One is tempted to take this a step further and There exist several natural extensions to the presented
calculate a rate of cannibalization. In this particular sur- methodology, however it would go beyond the scope of
vey, however, the sample size is too small to attempt do- this paper to present them. A brief summary shall suffice
ing so. Otherwise, such a computation would be simple for now and we will go into greater detail in forthcom-
arithmetic. ing case studies in this series:
Market Scenario Simulation 1. Beyond learning from data, we can use expert
Although experimenting with product scenarios is ex- knowledge to create or augment Bayesian networks.
pected to be the primary use of the Bayesia Market BayesiaLab offers a Knowledge Elicitation module,
Simulator, it is also possible to change the market scenar- which formally captures expert knowledge and en-
ios. codes it in a Bayesian network. In absence of market
data, this is an excellent approach to have decision
For example, this can be used to simulate the impact of makers collectively (and formally correct) reason
policy changes. One could hypothesize that legislation about future states of the world.
would prohibit or severely penalize ownership of vehi-
cles of a certain size or of a specific engine type in urban 2. We can extend the concept of product attributes to
areas.37 consumers’ product satisfaction ratings. This will
allow estimating the market share impact as a func-
tion of changes in consumer ratings. For instance,
an automaker could reason about the volume im-
pact from a vehicle facelift, which is expected to
raise the consumer rating of “styling”.
37 Given the draconian restrictions on motorists in Central London, this example is presumably not very far-fetched.
38 The all-electric Leaf was launched by Nissan in the U.S. in December of 2010.
Summary
BayesiaLab and Bayesia Market Simulator are unique in
their ability to use Bayesian networks for choice model-
ing and market share simulation. The presented work-
flow provides a comprehensive method for simulating
market shares of future products based on their key
characteristics, without requiring new and costly ex-
periments.
In today’s choice modeling practice, utility-based choice ables and, based on this knowledge, they allow us to
theory plays a dominant role. predict choice in the future. One such method is briefly
highlighted in the following.
1. The first concept of utility-based choice theory is
that each individual chooses the alternative that Multinomial Logit Models
yields him or her the highest utility. In the domain of choice modeling, MultiNomial Logit
models (MNL) have become the workhorse of the indus-
2. The second idea refers to being able to collapse a try, but here we only want to provide a cursory overview,
vector describing attributes of choice alternatives so the reader can compare the approach presented in the
into a single scalar utility value for the chooser. For case study with current practice.
instance, a vector of attributes for one choice alter-
native, e.g. [Price, Fuel Economy, Safety Rating], MNL models provide a functional form for describing
would translate into one scalar value, e.g. [5], spe- the relationship between the utilities of alternatives and
The following example is meant to illustrate both: For instance, using an MNL model for a choice situation
with three vehicle alternatives, Altima, Accord and
For Consumer A: Camry, the probability of choosing the Altima can be
expressed as:
• Utility of Product 1:
[Price=$25,000, Fuel Economy=25MPG, Safety Rat- exp(VAltima )
ing=4 stars] = 7 ✓
Pr(Altima) =
exp(VAltima ) + exp(VAccord ) + exp(VCamry )
• Utility of Product 2:
[Price=$29,000, Fuel Economy=23MPG, Safety Rat- VAltima in this case stands for the utility of the Altima
ing=5 stars] = 5.5 alternative. The utilities VAltima, VAccord, and VCamry are a
function of the product attributes, e.g.
For Consumer B:
VAltima = β1 × Cost Altima + β 2 × FuelEconomyAltima + β 3 × SafetyRatingAltima
• Utility of Product 1:
[Price=$25,000, Fuel Economy=25MPG, Safety Rat- As we can observe tangible attributes like vehicle cost,
ing=4 stars] = 4 fuel economy and safety rating, and we can also observe
who bought which vehicle, we can estimate the unknown
• Utility of Product 2: parameters. Once we have the parameters, we can simu-
[Price=$29,000, Fuel Economy=23MPG, Safety Rat- late choices based on new, hypothetical product attrib-
ing=5 stars] = 7.5 ✓ utes, such as a better fuel economy for the Altima or a
lower price for the Camry.
This concept implies that consumers make tradeoffs,
either explicitly or implicitly, and that there exists an The parameters of MNL models can be estimated both
amount x of “Fuel Economy” that is equivalent in utility from “stated preference” (SP) data, i.e. asking consumers
to an amount y of “Safety”. The reader may reasonably about what they would choose, and “revealed prefer-
object that not even a fuel economy of 100MPG would ence” (RP) data, i.e. observing what they have actually
make it acceptable to drive a vehicle that is rated very chosen. There are numerous variations and extensions
poorly on safety. to the class of MNL models and the reader is referred to
Train (2003) and Koppelman (2006) for a comprehen-
Also, we do not know a priori what the utility values are sive introduction.
nor can we measure them. Neither do we know in ad-
Stated Preference Data cal for a much broader audience. Although ELM has
Stated preference data typically comes from experiments, successfully removed the burden of manual coding,
i.e. consumer surveys or product clinics. In this context, countless iterations of specification and estimation re-
conjoint experiments have become a very popular choice main a very time-consuming task of the analyst.
elicitation method and a wide range of tools have been
NVES Variables
developed for this particular approach. In conjoint stud-
The following variables from the 2009 Strategic Vision
ies, consumers would typically be given a set of artifi-
cially generated product choices along with their attrib- NVES were included this case study:
We speculate that one of the reasons for the lack of • Total Family Pre-Tax Income
popularity outside the world of academia is the absence
• Ethnic Group
of easy-to-use software packages. Only recently, with the
release of Easy Logit Modeling (ELM)40 , specifying and • Location Of Residence
estimating multinomial logit models has become practi-
• I Seek Variety in My Life • I want a vehicle that says a lot about my success in life
/ career
• I'm Curious and Open to Experiences
• I will switch brand for features or price
• Luxury is Not Important Unless it Has Purpose
• There are lots of different brands of vehicles that I
• I Enjoy Expressing Myself Creatively would consider buying
• I really don't enjoy driving • I want to look good when driving my vehicle
• I want vehicles that provide that open-air driving ex- • Price is most important to me when buying a new
perience
vehicle
References
+33(0)2 43 49 75 69
• You may not, except with our express written permis-
info@bayesia.com sion, distribute or commercially exploit the content.
www.bayesia.com Nor may you transmit it or store it in any other web-
site or other form of electronic retrieval system.